11/28/16 Machine Translation • Machine Translation (MT) systems automatically translate one natural language (the source language) to another (the target language). • High-quality MT could greatly enhance global communication. Early Difficulties for MT Ich muss nach Hause gehen I must go home hydraulic ram water sheep • But … MT is challenging because it requires full understanding as well as generation. out of sight, out of mind blind, crazy • Early MT systems ran into such difficulties that MT research stopped for many years! But MT has seen a resurgence. The spirit is willing but the flesh is weak The vodka is good but the meat is rotten Famous Human Translation Fails Fun with NLP web page Pepsi slogan “Come alive with the Pepsi Generation” à “Pepsi will bring your ancestors back from the dead” TA à your Kentucky Fried Chicken slogan “finger-lickin’ good” à “eat your fingers off” Lecture Slides à Slides of Conference Chevy Nova à Nova means “won’t go” in Spanish! Computation and Language E-Print Archive à Files of calculation and E-Print language Scandinavian vacuum manufacturer Electrolux used this in an American ad: “Nothing sucks like an Electrolux” Siemens washing machine: “made in Turkey” à “made by a turkey” Syllabus à Program Microsoft Research Natural Language Processing Group à Group Processing of Natural Language of Search of Microsoft If you have suggestions, let us know à If you have suggestions, leave knows to us 1 11/28/16 Fun with Babelfish Epic Machine Translation Fail There’s a serious point here … it’s dangerous to put total trust in an MT system when you have no idea what its output means! English to English Example The grieving people of Littleton were burying furnace more victims of the Columbine High School massacres one Monday have investigators said that the two teen gunmen had hoped to kill hundreds of classmates and teachers. Authorities also said year unidentified 18-year-old female being questioned butt whether she purchased any firearms used in the rampage is not considered has suspect “at this time”. English to English Example The grieving people of Littleton were burying furnace more victims of the Columbine High School massacres one Monday have investigators said that the two teen gunmen had hoped to kill hundreds of classmates and teachers. Authorities also said year unidentified 18-year-old female being questioned butt whether she purchased any firearms used in the rampage is not considered has suspect “at this time”. The grieving people of Littleton were burying four more victims of the Columbine High School massacre on Monday as investigators said that the two teen gunmen had hoped to kill hundreds of classmates and teachers. Authorities also said an unidentified 18-year-old female being questioned about whether she purchased any firearms used in the rampage is not considered a suspect “at this time”. 2 11/28/16 Another Example Philadelphia guard Allen Iverson got the 76ers off to a quick start this season. Sour Now he wants to make they finish have strong. Iverson scored 14 fourth-quarter points one his way to 38 to give Philadelphia has 103-86 victory over the Orlando Magic on Sunday, securing the team’s fourth straight win. Another Example Philadelphia guard Allen Iverson got the 76ers off to a quick start this season. Sour Now he wants to make they finish have strong. Iverson scored 14 fourth-quarter points one his way to 38 to give Philadelphia has 103-86 victory over the Orlando Magic on Sunday, securing the team’s fourth straight win. Philadelphia guard Allen Iverson got the 76ers off to a quick start this season. Now he wants to make sure they finish as strong. Iverson scored 14 fourth-quarter points on his way to 38 to give Philadelphia a 103-86 victory over the Orlando Magic on Sunday, securing the team’s fourth straight win. Different Types of MT Systems Direct Machine Translation [Knight, AI Magazine 1997] • Direct machine translation translates one language directly into another without an intermediate representation. • Direct MT systems typically use word for word translation based on a bilingual dictionary. • Local word reordering can be used to smooth the target language output. • Direct MT methods are inherently limited and usually don’t perform very well. 3 11/28/16 Word for Word Translation Problems Many commercial systems do word for word translation, despite its inherent problems, which include: Examples (English French) He walked across the road. Il traversa la rue a pied. • Lexical Ambiguity: shot, bat • Synonyms & Subtleties: run, jog, sprint, trot, flee • Phrases: hot dog, real estate, dry run • Transposition: Word order varies a lot, especially between different types of languages. She drove into town. Elle entra dans la ville en voiture. They flew from Gatwick. Ils partirent par avion de Gatwick. • Idioms: raining cats and dogs, buried the hatchet, kicked the bucket Indirect Machine Translation • Indirect machine translation uses an intermediate representation to capture meaning. • An interlingua is a language-independent meaning representation. The source language is analyzed and its meaning is represented by the interlingua, which is then used to generate output in the target language. • The interlingua must be rich enough to capture full understanding! • Interlingua-based MT allows for multilingual systems that can translate to and from any pair of languages. Knowledge-based MT (KBMT) KBMT is knowledge intensive and only feasible for limited domains. But it can perform very well! KBMT requires: • Bilingual dictionaries with syntactic and semantic information. • Syntactic grammars. • A semantic ontology. • Semantic analyzers. • Usually an interlingua representation. • Language generation models. • Another approach is transfer systems that use language-dependent intermediate representations. 4 11/28/16 Statistical MT • The best MT systems today use statistical methods. • Statistical MT systems are typically trained on parallel text corpora, which consist of source language texts coupled with their manual translations into a target language. Alignment • In parallel corpora, sentences and paragraphs do not always correspond. For example, one French sentence may be translated as two English sentences. • Some information (even entire sentences or paragraphs!) may be present in one but missing in the other. • The Hansard parallel corpus is widely used and contains the proceedings of the Canadian Parliament in both French and English. The European Union also has parallel corpora for many European languages. • Sentence alignment algorithms attempt to align the corresponding sentences across parallel texts. • The Web contains many sites that have multiple translations. However, the translations often are not parallel. • Word alignment algorithms attempt to align the corresponding words across parallel sentences. Parallel Text Example The higher turnover was largely due to an increase in then sales volume. Employment and investment levels also climbed. Word Alignment Examples The letter was sent Tuesday. Following a two-year transitional period, the new Foodstuffs Ordinance for Mineral Water came into effect on April 1, 1988. Specifically, it contains more stringent requirements regarding quality consistency and purity guarantees. La progression des chiffres d'affaires resulte en grande partie de l'accroisssement du volume des ventes. Le lettre a ete envoyee le mardi. Mary did not slap the green witch. L'emploi et le investissements ont egalement augmente. La nouvelle ordinnance federale sur les denrees alimentaires concernant entre autres les eaux minerales, entree en vigueur le ler avril 1988 apres une periode transitoire de deux ans, exige surtout une plus grande constance dans la qualite et une garantie de la purete. Maria no dio una bofetada a la bruja verde. 5 11/28/16 1a. 1b. ok-voon ororok sprok . at-voon bichat dat . 2a. 2b. [Knight, AI Magazine 1997] [Knight, AI Magazine 1997] 1a. 1b. Garcia and associates. Garcia y asociados. ok-drubel ok-voon anok plok sprok . at-drubel at-voon pippat rrat dat . 2a. 2b. Carlos Garcia has three associates. Carlos Garcia tiene tres asociados. 3a. 3b. erok sprok izok hihok ghirok . totat dat arrat vat hilat . 3a. 3b. his associates are not strong. sus asociados no son fuertes. 4a. 4b. ok-voon anok drok brok jok . at-voon krat pippat sat lat . 4a. 4b. Garcia has a company also. Garcia tambien tiene una empresa. 5a. 5b. wiwok farok izok stok . totat jjat quat cat . 5a. 5b. its clients are angry. sus clientes están enfadados. 6a. 6b. lalok sprok izok jok stok . wat dat krat quat cat . 6a. 6b. the associates are also angry. los asociados tambien están enfadados. 7a. 7b. lalok farok ororok lalok sprok izok enemok . wat jjat bichat wat dat vat eneat . 7a. 7b. the clients and the associates are enemies. los clientes y los asociados son enemigos 8a. 8b. lalok brok anok plok nok . iat lat pippat rrat nnat . 8a. 8b. the company has three groups. la empresa tiene tres grupos. 9a. 9b. wiwok nok izok kantok ok-yurp . totat nnat quat oloat at-yurp . 9a. 9b. its groups are in Europe. sus grupos están en Europa. 10a. 10b. the modern groups sell strong pharmaceuticals. los grupos modernos venden medicinas fuertes. 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . Statistical Models for MT The basic statistical model for MT uses a translation model P(S | T) and a language model P(T). TargetOutput = argmax P(S | T) * P(T) Language Generation Challenges for language generation include: • lexical choice • syntactic form (voice, tense, etc.) • word orderings • information content The translation model computes the probability that the source text would be generated by the target text. Article Generation: I go to school. I go to university. (*) I saw her on TV. I heard him on radio. (*) The language model computes the probability that the target text would occur in the target language. Idioms: John kicked the bucket. The bucket was kicked by John. (*) Semantic Subtleties: President Bush goes fishing and boating in Maine. President Bush goes fishing and shipping in Maine. (*) 6 11/28/16 Using Language Models Parsers can judge the grammaticality of a sentence, but this is not enough. For example: John saw Mary on TV. John viewed Mary in the television. (*) Statistical language models can be useful for judging both word order and word selection (lexical choice). For example: “en” in Spanish can be translated as both “in” and “on”. I swam in the pool. I swam on the pool. (*) Or to judge grammaticality: Language Generation The number of possible output sentences can be enormous! Consider this meaning representation for a Japanese sentence. (Japanese does not mark singular/plural, definiteness, or time.) ACCUSATION Agent = she Theme =THEFT (agent = he theme = car) I ate bread. I ate a bread. (*) NITROGEN Example NITROGEN is a language generator that used a dictionary of nouns, verbs, adjectives, and adverbs, plus a hand-built grammar. NITROGEN found 381,440 English renderings! Examples: Using Language Models to Rank Statistical methods can be used to rank the choices. A statistical language model ranked these sentences as the top 5 choices: She charged that he stole the car. She charged that he stole the cars. There is the accusation of theft of the car by him by her. She charged that he stole cars. She charged that he stole car. She impeaches that he thieve that there was the auto. She charges that he stole the car. Her incriminates for him to thieve an automobiles. 7 11/28/16 The State of the Art in MT • Current MT systems are far from perfect, but they are improving and are faster and cheaper than manual translation. • MT generally gives a good idea of what a document is about. But the language generation can be awkward. • Gisting can be valuable: generate a rough translation to decide whether a manual translation is worthwhile. • Automatic MT systems can also be used as a starting point for manual translation. 8
© Copyright 2026 Paperzz