MWE Alignment in Phrase Based Sta.s.cal Machine Transla.on

MWE Alignment in Phrase Based Sta5s5cal Machine Transla5on Ar5cle by : Santanu Pal , Sudip Kumar Naskar and Sivaji Bandyopadhyay Presenta5on : Alon Har-­‐Carmel Introduc5on •  The main problem in Machine Transla5on is ambiguity. •  Ambiguity happens when words have more than one meaning. •  When a word is used in conjunc5on with other word they become ambiguous. Main idea •  Present the role of MWEs in improving the performance of phrase based Sta5s5cal machine Transla5on (PB-­‐SMT) system. •  Preprocess the parallel corpus on both sides we can achieve significant improvement over baseline PBSMT system. Mul5word Expression •  is a lexeme made up of a sequence of two or more lexemes that has proper5es that are not predictable from the proper5es of the individual lexemes or their normal mode of combina5on.
(From Wikipedia) •  Few types: –  conjunc5ons: as well as –  idioms : kick the bucket‖ –  phrasal verbs : find out‖ –  compound noun : bus stop‖‖ System Descrip5on • 
• 
• 
• 
MWE Iden5fica5on Preprocessing of the parallel corpus MWE Extrac5on from Comparable Corpora Automa5c Alignment of MWEs –  MWE Alignment Valida5on •  Incorpora5ng Alignment directly into the word alignment Model MWE Iden5fica5on Source : English •  considered : phrasal preposi5ons ,Verb-­‐object combina5ons and noun-­‐noun. •  Noun-­‐Noun MWE Iden5fica5on Using combina5on of number of techniques : –  Point-­‐wise Mutual Informa5on (PMI) –  Log-­‐likelihood Ra5o (LLR) –  Phi-­‐coefficient •  Similar method has been followed to iden5fy the other MWE Target: Bengali •  Considered: noun-­‐noun, reduplicated phrases and complex predicates. •  Iden5fica5on of Reduplica5on using a simple rule-­‐based approach Preprocessing of the parallel corpus •  Iden5fied MWEs •  Converted them into single-­‐tokens •  1-­‐to-­‐1 alignments between the source and target can establish. MWE Extrac5on from Comparable Corpora •  parallel training set was rela5vely small and very few types of MWEs have been iden5fied. •  collected comparable corpora from Wikipedia. •  Listed the named en55es (NEs) from the training data and aligned them. •  From this parallel NE list searched informa5on about individual NEs in Wikipedia for both languages. •  Iden5fy MWEs following the method described. •  As the MWE iden5fica5on method follows a sta5s5cal method, by using comparable corpora we have extracted much more MWEs from the training data. Automa5c Alignment of MWE •  parallel corpus is cleaned and filtered •  An English−Bengali and Bengali−English PB-­‐SMT system have been developed to translate MWEs. •  MWE Iden5fica5on •  The English MWE are translated and validated against the target Bengali that the MWEs extracted from. •  Same for Bengali MWE. MWE Alignment Valida5on •  valida5on process uses a fuzzy matching technique. •  score set by: •  A closely matching string is iden5fied from the corresponding parallel text of the extracted MWEs. •  A[er retrieving the closest strings for all MWEs, we prepare a MWE-­‐level parallel corpus. •  These parallel MWEs are added with the parallel training corpus as addi5onal training data. Incorpora5ng Alignment directly into the word alignment Model •  Aligned bilingual MWEs have been incorporated directly into the word alignment model by upda5ng the word alignment table. •  The word alignment table is updated by looking up this bilingual MWE dic5onary which was extracted from the training corpus. •  The probability is normalized in both the source
−target and target−source lexical file accordingly. •  The lexical file is generated during the training phrase. Experiments Se\ngs •  Randomly iden5fied 500 sentences for the test set from the ini5al parallel corpus. •  The rest are considered as the training corpus. •  The training corpus was filtered –  maximum sentence length of 100 words –  sentence length ra5o of 1:2 (either way). •  In the training corpus contained 22,492 sentences •  Choosing 4-­‐gram language model and a maximum phrase length of 7 for PB-­‐SMT. MWE sta5s5cs of the parallel training corpus Results Conclusion •  pre-­‐processing of MWEs in the parallel corpus can improve the system performance. •  For language pair with small resource , this approach can help to improve the machine transla5on quality. •  knowledge can be acquired from external resources like comparable corpora. Reduplica5on types •  Onomatopoeic expressions: The sound sequence of the word denotes the par5cular meaning of the form -­‐ khat khat, knock knock •  Complete Reduplica5on: The individual words carry certain meaning, and they are repeated -­‐ bara-­‐bara, big big •  Par5al Reduplica5on: Only one of the words is meaningful, while other is constructed by par5ally reduplica5ng the first word -­‐ thakur-­‐thukur, God •  Seman5c Reduplica5on: The paired members are seman5cally related such as synonymy (matha-­‐mundu, head) , antonymous ( dinrat, day and night). •  Correla5ve Reduplica5on: The corresponding correla5ve words is used just preceding the main verb -­‐ maramari, figh5ng • 
• 
• 
• 
• 
• 
Tools and Resources A sentence-­‐aligned English-­‐Bengali parallel corpus containing 23,492 parallel sentences from t he travel and tourism domain. The corpus has been collected from the consor5um-­‐mode project of English to Indian Languages Machine Transla5on (EILMT) System. The Stanford Parser, Stanford NER, CRF chunker and the Wordnet 3.05 have been used for iden5fying complex predicates in the English side of the parallel corpus. The sentences on the target side (Bengali) are tagged by using the tools obtained from the consor5um mode project of Indian Language to Indian Language Machine Transla5on (IL-­‐ILMT) System NEs in Bengali are iden5fied using the NER system of Ekbal and Bandyopadhyay standard log linear PB-­‐SMT model as our baseline system: GIZA++ implementa5on of IBM word alignment model 4