|
1) Sentence Parsing
The first step in any MT application should be the breaking down of the text into sentences and then into unit lexemes, that is, individual words. Not much is special about ESI's sentence parsing, except its ability to recognize true carriage returns and linefeeds, even when they are confused among artificially inserted ones that typically come in emails and HTML text formats. Moreover, it will not interpret as 'end of sentence' the periods that come in number figures, brands nor Internet addresses.
Another unique feature is ESI's ability to parse secondary sentences within the main sentence, when they are enclosed in parentheses or within hyphens. A sentence in parentheses is considered to be a comment and translated separately, without affecting the main structure in any way.
2) Style and Format Parsing
Although this problem has not been tackled by any other MT system that we know of, it is extremely important in obtaining a correct translation.
The grammatical rules that apply to a normal predicative sentence are not the same ones that apply, for instance, to a title, a heading, a list of terms, a glossary, a table, a display, an ad, an order or a message. So, it is important to tell the grammatical parser (stage 6) which kind of text is being fed in, before the parser starts to apply incorrect rules. Otherwise, results are guaranteed to be disastrous.
This is also the case with clauses (a, b, c, d, etc.) typical in legal documents. Each clause in itself may or may not be a complete sentence. It could also be part of the same predicate of the first clause. It is a difficult task to determine which rules to apply in each clause.
ESI is just starting to implement rules to deal with these complications. and we expect to have them in operation in version 5.0, by Fall 2003. In the meantime, the User can work around the problem by using the Interactive Mode under Options/Style.
Also, as a first approach to this powerful mechanism, ESI is implementing a full-blown word processor compatible with MS Word XP, which will automatically determine and reproduce format and style. This new word processor will come with version 4.20, slated to be formally released in March or early April 2003.
Besides sentence format, ESI also takes into account the format of each word. Capitalized letters are a constant source of problems, sometimes coming with first-letter uppercase and sometimes all uppercase.
In the first case, the word can be confused with a proper name. Take Ford, Smith, Carpenter or Book, for instance. There are thousands of those possibilities. Or take IBM, model AK-908, 128 M RAM, Dell, Logos, Super Translator, IN, Ca, CA, FLA, Fla. These are not included in dictionaries. Are they to be taken as first names, last names, product names, brands, company names, country names, zip codes, email addresses, abbreviations, or simply as words which the writer chose to capitalize in order to highlight what he wanted to say?
IN, for instance, could be a capitalized preposition or the abbreviation of Indiana (included in our dictionaries). AND could be the abbreviation of a substance, a Boolean (a noun) or a capitalized conjunction.
ESI has intelligent mechanisms for deciphering such ambiguities, even though much is still pending in this area.
Another type of formatting found in most texts is the quotation mark. It is almost impossible for MT to determine what exactly goes inside two sets of quotation marks. It could be an actual direct quote, a noun, a whole phrase or simply a device used to highlight a particular word or expression, or part of the main sentence.
ESI has the capability of making a distinction and correctly translating
a. The word case,
b. The word "case"
c.. The "word case"
3) Spell-checker and ambiguity-checker revision module
Spell-checking is necessarily a vital step in any translation process. ESI has two very powerful dictionaries for this purpose containing several million words and conjugated words in each language.
The Spanish spell checker includes all the possible combinations of enclitic verb forms. Version 5.0 will also include all diminutives, augmentatives and superlatives in plural, singular, and their feminine and masculine forms, whenever applicable.
The ambiguity checker is unique to ESI. It highlights all grammatically ambiguous words in the text for the User; that is, those which could be written either with or without an accent, meaning totally different things in each case.
In Version 5.0 this unique feature will be developed to a much more intelligent and user-friendly level, highlighting those words only where incorrect spelling is suspected.
Moreover, it will point out in English many common errors, such as using it's when its should have been used, or the incorrect placing of a hyphen where it could be misinterpreted as a phrase within hyphens instead of a hyphenated cluster of two or more words. The translation of those two cases would produce radically different results. ESI is capable of treating both cases correctly, provided the hyphen is properly placed.
No other MT system that we know of is currently capable of correctly translating hyphenated clusters.
4) Idiomatic-Construction Parsing
We make a clear distinction between idiomatic constructions and idiomatic expressions.
An idiomatic expression is a word cluster that can be entered as an entry in a dictionary. The number of idiomatic expressions is considered limited, albeit very high and close to half a million in English alone.
Idiomatic constructions, however, are in theory unlimited, since they are not set phrases but non-orthodox syntactical patterns used to construct phrases, admitting permutations of words. An idiomatic expression ceases to be one if we change one of its constituent words. For instance "Let's call it a day" ceases to be an idiomatic expression if we say instead "Let's call it a microsecond".
In other words, in the idiomatic expression, it is the specific words that form the set phrase. In idiomatic constructions, it is the particular pattern of clustering words that becomes idiomatic. Since the words themselves can change, the permutations can explode in number, and the only way to handle such exceptions is through an 'idiomatic parser'.
One way of creating idiomatic constructions is through the use of the hyphen. In the hyphenated cluster, one word becomes the operator of the other word, and the whole cluster takes on a part of speech and a meaning different from the orthodox free association.
Take for instance: weak-sighted. We can change this cluster to strong-sighted, or far-sighted, near-sighted, feeble-minded, strong-chested, weak-chested, broad-shouldered, etc. We see that there is only one pattern from which many combinations can be formed.
Since no dictionary could possibly include all possible combinations, a system has to be devised to interpret these kinds of patterns.
We do not know yet how many patterns there are, even in the case of hyphens. And hyphenation is not the only instance in idiomatic construction. There are hundreds of additional patterns.
Here are a few examples of hyphenation:
strong-chested : adj + noun + ed => that has a strong chest, adj
feeble-minded : adj + noun + ed => that has a feeble mind, adj
feeble-mindedness: adj + noun+ ed+ ness => weakness of mind, noun
US-born: noun + participle => born in the US, adj
easy-to-use: adj + infinitive => that is easy to use, adj
chocolate-eating: noun + present participle => that eats chocolate, adj
science-fiction: noun + noun => noun
ESI has implemented a few cases of hyphenation, but not all, of course. This is another pending task of investigation. All other MT systems that we have tested failed in this task.
Another way of using hyphens is in the association of adjectives. Consider, for instance: "A red-plastic toy", and "A red plastic toy." ESI will correctly interpret and translate the first phrase as "A toy made of red plastic", whereas the second one will be "A plastic toy having red color."
Yet another use of hyphens is in verbs. Machine-dry, steam-clean, spray-paint, jump-start, etc. are all examples of hyphenated verbs. ESI deals with these cases through inclusion in the dictionary rather than treating them as special patterns.
Here is sample sentence translated by ESI:
He dog-eared the pages using a cutting-edge technology.
ESI: Él dobló la esquina de las páginas usando una tecnología de punta.
Others: Él perro-espigado las paginaciones usando una tecnología del corte-borde.
Idiomatic constructions are also formed in English without the use of hyphens. Take, for instance, the normal, 'orthodox' sequence of adjectives in a noun phrase: aAA,A,…,AS, or aAAS, AAAS (Where, a = Article A=adjective S=noun D= Adverb V= Verb).
In other words, in English there is normally one or two series of the A before any S, separated or not by commas.
However, a particular idiomatic construction allows: aAASAS, or aASAS, or aAASS. Example:
A five year old boy. A twenty five foot long boat.
Translation by Word Magic: Un niño de cinco años de edad. Un bote de veinticinco pies de largo.
Translation by a typical MT system: Un viejo muchacho de cinco años. Los veinte cinco pies desean barco.
Notice that the above examples require the use of the hyphen. However, people leave hyphens and commas out in most cases nowadays, even though it is grammatically incorrect. A practical MT system has to be able to recognize such deviations from orthodox grammar, interpret them, and possibly point them out to the writer for correction.
Now let's look at another example, this time in an exclamational phrase:
What a nice car! (DaAS),
which seems "normal", but all existing programs on the market translate this as:
Qué un carro bonito! ( DaSA)
whereas it should be translated as: Qué carro tan bonito!: (DSDA)
There are hundreds of idiomatic constructions like these in Spanish, as well. ESI does not pretend to have identified them all, but it has actually identified a great number of them, and part of our goal is to discover, identify and solve more and more cases with every new ESI version.
Yet another type of Idiomatic Construction is one that comes from leaving out essential parts of the sentence. The example we presented before is a typical one: Leaving out the pronoun I from the sentence: Hope she is well. Hope you come. Wish to see you soon.
Here is another case where not one but several words are left out:
Mary will be five tomorrow.
Which should be interpreted as: Mary will be five years old tomorrow.
As expected this, as well as hundreds of other idiomatic constructions, is not handled by any MT system. ESI, in most cases (but not all), interprets them correctly.
5) Idiomatic Expressions
There are thousands of idiomatic phrases in English, and the only way to account for them is by their inclusion into the database, since they do not follow any generalized pattern or rule. The same is true of Idiomatic Constructions.
An Idiomatic Expression is like a chemical compound: The final aggregate does not share the properties of any of its constituents but rather has its own unique properties.
After checking for Idiomatic Constructions and hyphenated clusters, ESI goes on to the next stage and identifies all possible Idiomatic Expressions contained in the sentence. This is done in close consultation with the Dictionary, which contains approximately 125,000 expressions as entries, with their corresponding equivalencies in the other language, for a total of 250,000 entries. This is by far the largest collection of idiomatic expressions and clusters in computer-dictionary form existing to date.
However, we must note that idiomatic expressions or set-phrases cannot and should not be taken at face value whenever they appear in the text and the dictionary. A Parsing System is necessary. Word clusters have a grammar of their own, and special cluster-grammar rules have to be implemented in order to make a sensible interpretation of either English or Spanish texts.
Consider the set-phrase:
Enough of it! = ¡Basta ya! (Enough is enough!)
If we say: "We had enough of it", the idiomatic sense outlined above is totally out of context and the dictionary entry should be rejected, making instead a literal translation. Otherwise it would be translated into Spanish as: We had enough is enough! ESI is capable of making such distinctions and the translation "basta ya" is not picked up by the Idiomatic Parser.
Or, consider these two other examples involving nouns:
a) The system controls = Los mandos del sistema (the control panel)
b) The system controls our economy= El sistema controla nuestra economía
In this case, "system controls" exists as an entry in ESI's dictionary, but it had to be rejected from the selection because it is out of context.
Perhaps these are rare instances of word clusters being rejected. However, with adverbs, verbs and adjectives the probability of an out-of-context idiomatic expression is fairly common. These exceptions will multiply as ESI's database adds more and more idiomatic entries.
With verbs, the selection of idiomatic expressions taken from existing entries in the dictionary is a very large and complicated process. In fact, it is the single largest section in the whole program, and it includes hundreds of different rules which take into account transitivity, reflexiveness, semantic attributes and the verb's virtual environment.
For instance, if you enter: "I will ask Jane out", ESI will translate it as "Invitaré a salir a Jane. (I will invite Jane to go out) "
However, if you enter "I will ask the car out", ESI will not accept the dictionary entry "ask out". It will render a literal translation which is meaningless both in Spanish and in English.
We have tested these sentences with other MT systems, and they were not able to recognize the idiomatic expression in either case, rendering: "Pediré Jane hacia fuera. (I will ask Jane towards the outside)."
Here is a trickier verb, "turn on":
She turned the switch on = Ella encendió el interruptor.
She turned him on = Ella le excitó.
She turned on him = Ella se volvió contra él
She turned on the corner = Ella cambió de dirección {dobló} en la esquina.
See below a comparative translation made by the popular Systran program:
Ella giró el interruptor
Ella lo giró
Ella lo giró
Ella giró la esquina.
Or, consider the following idioms:
I will take my son to his new room = Llevaré a mi hijo a su cuarto nuevo.
I will get my son to do his homework= Persaudiré a mi hijo a hacer su tarea.
She came to my house = Ella vino a mi casa.
She came to last night= Ella recobró el conocimiento anoche.
She came to work=Ella llegó a trabajar.
The smith worked the metal piece in very skillfully= El herrero insertó el pedazo de metal muy diestramente.
The smith worked in our car= El herrero trabajó en nuestro coche.
As we see through these examples, the selection of phrasal verbs is not as simple as reading them in the text and comparing them to their match in the database. There are numerous rules involved, all of which had to be developed by Word Magic. We cannot discuss all the possibilities that arise when trying to fit a phrasal verb into a given text, but this last example should suffice to point up the complexity involved.
Let's consider the verb have to, which is a paraphrastic case that gives the phrasal compound the connotation of an obligation to do something, as in "I have to go to school."
In the sentence: "I have to go to the party", the cluster is first recognized as existing as an entry in ESI's dictionary and then, after going through the Idiomatic Parser, it is finally accepted with its corresponding translation "tener que', and the final translation would be.
Tengo que ir a la fiesta.
However, if ESI encounters exactly the same cluster have to in this other sentence
This is the only dress I have to go to the party.
It will not interpret 'have to" as a word cluster, even though it exists as an entry in the dictionary, but instead it will split the sentence right in the middle of have to:
[This is the only dress that I have] + [to go to the party]
And its translation
Este es el único vestido que yo tengo | para ir a la fiesta
Notice that a conjunction "que" has been added to the Spanish rendition, which was omitted in English.
6) Grammatical Parsing:
In this area, ESI is similar to other MT systems in that it must have a way of parsing the syntax of the input text based on fixed, orthodox grammatical rules.
Therefore, we will not delve into the typical grammatical constructions, but rather highlight those which we know are not handled at all by any other translator available on the market.
In particular, ESI is the only system capable of detecting the elision of the relative pronoun that, which usually precedes a subordinate or relative phrase, and it accomplishes this on the basis of grammatical criteria alone.
The following example illustrates what we mean:
The Internet is a technology people use around the world.
There is silent relative pronoun somewhere, hidden in the syntactic construction of that sentence.
If we try Systran or any other program available on the Internet, the parsing will be: SVaSSSraS, that is:
Internet is a usage of technology-people around the world,
Translated into Spanish as
El Internet es un uso de la gente de la tecnología alrededor del mundo.
However, ESI correctly interprets this construction as: SvaS + that + SvraS, or
The Internet is a technology that people are using around the world,
translated as :
Internet es una tecnología que las personas usan alrededor del mundo.
Other constructions which are generally incorrectly handled by most other applications are long strings of nouns, proper names, numbers, and certain special constructions. We invite the reader to test and compare these cases through our online applications at our web site http://www.wordmagicsoft.com and compare.
There is one final point that we must stress with respect to ESI's interpretive power. The above example should serve to illustrate that adding just one more degree of freedom to a syntactical interpretation, in order to be able to handle elided relative pronouns linking relative clauses, could double, or perhaps more than double, the possibilities of a faulty interpretation. Permutations increase exponentially with the number of degrees of freedom in any closed system. And ESI, as we have seen, has not one more but many more degrees of freedom in its code, to be able to analyze, interpret and accept not only diverse grammatical structures, but also idiomatic expressions, hyphenation, parentheses and idiomatic constructions.
Additionally, consider one other factor: ESI's dictionaries are much larger than any other digitalized translation dictionary available. Therefore, the possible number of permutations is tremendously larger too, as we saw at the beginning of this article. Dealing with 70,000 words is not nearly as complex as dealing with 250,000 and 850,000 translations references in all, as ESI does. The possibilities are endless --but then, the possibilities of problems and errors are endless too.
Perhaps a comparison with real-life systems will shed light on this situation. ESI is like a jetliner flying at high speed with hundreds of people on board, whereas other primitive MT systems are more like hand-driven, four-wheeled wagons riding on rails.
The wagon has practically no degrees of freedom: It can only move straight ahead. It is restricted vertically by the ground, and horizontally by fixed rails.
The airplane can move up and down, laterally, nose-up, nose-down, sway, tilt over to right or the left, glide, or even plummet down out of control. It can encounter strong winds in its flight, snow, sleet, rain and storms. It also has to be able to glissade through gentle breezes and then land softly on the ground.
No doubt, to construct such a machine and to succeed in its stabilization and maneuverability is much tougher than doing the same with the railroad wagon. Also, it would take much more time to achieve the desired stability, after traversing a longer road of experimentation through trial and error.
7) Semantical Disambiguation:
The theory and practice we have implemented in ESI -- particularly the practice-- deviates from what Noam Chomsky constantly teaches: that the syntax of a sentence should be determined independently from its semantics. "Syntax was regarded as the heart of linguistics and (Chomsky's) project was supposed to transform linguistics into a rigorous science." John R. Searle End of the Revolution
We have found through thousands of instances taken from real-life texts that syntax is totally dependent on semantics, and ESI's grammatical parser is equipped with a corresponding number of rules and exceptions to deal with this reality. This dependency is even more dramatic in Spanish. Unfortunately, space does not permit in this article to explore in depth this fascinating new field, a field which we believe is being researched and developed for the first time ever by Word Magic Software with plenty of success.
We will ponder just a few examples. Look at the following two sentences
a. My daughter is to marry tomorrow = Mi hija ha de casarse mañana.
b. My idea is to marry tomorrow = Mi idea es casarme mañana
ESI's rendition in each case is radically different from the other, as you can see at the right. The typical translations found in other MT applications, on the other hand, make little or no sense, because none of them, with the sole exception of ESI, as far as we know, take semantics into consideration.
The trick here is to recognize that an idea cannot get married (case b), and therefore the action is passed on to whoever is writing the sentence, whereas in case a. the daughter is the one that explicitly is to get married.
Here is another example where semantics plays a key role. Consider the next two sentences:
a. Wine from Europe and cheese in general will be affected by the tax.
b. Wine from Europe and Asia in general will be affected by the tax.
Here are ESI's translations:
a. El vino de Europa y el queso en general serán afectados por el impuesto.
b. El vino de Europa y Asia en general será afectado por el impuesto.
Notice that ESI interprets the subject of sentence a. as a noun phrase composed of two elements (wine and cheese), and thus uses the plural form of the verb and its predicate, whereas in sentence b. ESI correctly recognizes one element only and thus uses the singular verb.
Disambiguation properly occurs when ESI chooses among several connotations of a noun, an adjective or a verb based entirely on their mutual semantic correspondence. In Chapter I we presented an example of this process:
I am teaching this bat to fly,
where ESI correctly selected the meaning 'murciélago' (animal) for the word bat.
The process can operate the other way around too: Consider the following sentence:
This bat is used to play baseball = Este bate se usa para jugar béisbol.
Top |