Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Exploiting Parallel Treebanks to Improve Phrase-Based Statistical Machine Translation John Tinsley, Mary Hearne and Andy Way National Centre for Language Technology Dublin City University Ireland J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Motivation and Background Phrase-Based Statistical MT Parallel Treebanks Construction Alignment Experimental Setup Data MT System Phrase Extraction Translation Models Evalutation Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Phrase-Based Statistical MT Motivation and Background Phrase-Based Statistical MT Parallel Treebanks Construction Alignment Experimental Setup Data MT System Phrase Extraction Translation Models Evalutation Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Phrase-Based Statistical MT Most state-of-the-art Machine Translation (MT) systems ◮ are not syntax-aware use models which are based on n-grams ◮ have no linguistic components ◮ J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Phrase-Based Statistical MT Most state-of-the-art Machine Translation (MT) systems ◮ are not syntax-aware use models which are based on n-grams ◮ have no linguistic components ◮ Parallel treebanks are not widely used in MT, if at all. However, we believe that the data encoded within parallel treebanks could be useful in MT. J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Phrase-Based Statistical MT Most state-of-the-art Machine Translation (MT) systems ◮ are not syntax-aware use models which are based on n-grams ◮ have no linguistic components ◮ Parallel treebanks are not widely used in MT, if at all. However, we believe that the data encoded within parallel treebanks could be useful in MT. In order to find out ◮ we automatically build large parallel treebanks ◮ using off-the-shelf parsers and our sub-tree aligner use these parallel treebanks to train MT systems ◮ J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Phrase-Based Statistical MT Most state-of-the-art Machine Translation (MT) systems ◮ are not syntax-aware use models which are based on n-grams ◮ have no linguistic components ◮ Parallel treebanks are not widely used in MT, if at all. However, we believe that the data encoded within parallel treebanks could be useful in MT. In order to find out ◮ we automatically build large parallel treebanks ◮ using off-the-shelf parsers and our sub-tree aligner use these parallel treebanks to train MT systems ◮ Akin to the work of [Groves, 2007] J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Phrase-Based Statistical MT The MT system “flavour” we are using here is a phrase-based statistical MT system (PBSMT). This is a data-driven MT system which uses well-defined statistical models to estimate parameters from parallel corpora J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Phrase-Based Statistical MT The MT system “flavour” we are using here is a phrase-based statistical MT system (PBSMT). This is a data-driven MT system which uses well-defined statistical models to estimate parameters from parallel corpora They can be roughly broken up into a number of components ◮ ◮ statistical word alignment phrase alignment heuristics ◮ translation model combining word and phrase pairs and their probabilities ◮ target language model (tri-gram model) decoder (beam search for optimal translation) ◮ J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT PBSMT system Parallel Corpus PBSMT system Language model Parallel Corpus PBSMT system Parallel Corpus Word Aligner Language model PBSMT system Parallel Corpus Word Aligner 'Word-alignment based' word + phrase extraction Language model PBSMT system Parallel Corpus Word Aligner 'Word-alignment based' word + phrase extraction Language model Translation Model PBSMT system Parallel Corpus Word Aligner 'Word-alignment based' word + phrase extraction Language model Translation Model decoder PBSMT system Parallel Corpus Word Aligner 'Word-alignment based' word + phrase extraction Language model input sentences Translation Model decoder PBSMT system Parallel Corpus Word Aligner 'Word-alignment based' word + phrase extraction Language model input sentences Translation Model decoder output translations Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment Motivation and Background Phrase-Based Statistical MT Parallel Treebanks Construction Alignment Experimental Setup Data MT System Phrase Extraction Translation Models Evalutation Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment The treebank creation process was fully automatic Advantages ◮ fast ◮ large-scale capabilities highly accurate tools ◮ J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment The treebank creation process was fully automatic Advantages ◮ fast ◮ large-scale capabilities highly accurate tools ◮ The parallel corpora from which we built the parallel treebanks were samples from the EuroParl release [Koehn, 2005] Two main aspects ◮ monolingual parsing ◮ sub-sentential alignment J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment Sub-sentential alignment of the parallel treebank was carried out using the tool described fully in [Tinsley et al., 2007] Node are aligned between tree pairs using a simple algorithm J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment Sub-sentential alignment of the parallel treebank was carried out using the tool described fully in [Tinsley et al., 2007] Node are aligned between tree pairs using a simple algorithm ◮ assume hypothetical links between all nodes in a given tree pair ◮ estimate scores for these hypothetical links (done using word alignment probabilities) ◮ use a greedy search to select the optimal set of alignments J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment Sub-sentential alignment of the parallel treebank was carried out using the tool described fully in [Tinsley et al., 2007] Node are aligned between tree pairs using a simple algorithm ◮ assume hypothetical links between all nodes in a given tree pair ◮ estimate scores for these hypothetical links (done using word alignment probabilities) ◮ use a greedy search to select the optimal set of alignments [Hearne et al., 2007] describes the aligner’s linguistic capabilities in specific detail J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment HEADER-1 PP-2 P-3 from PP-1 COLON-9 NP-4 : D-5 a P-2 D-5 P-6 D-8 à partir de une NP-6 N-7 N-8 Windows Application 1 NP-7 P-3 NP-9 N-10 N-11 application Windows 2 3 5 6 7 8 9 10 11 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0 J. Tinsley, M. Hearne and A. Way – DCU @ TLT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment HEADER-1 PP-2 P-3 from PP-1 COLON-9 NP-4 : D-5 a P-2 NP-7 P-3 D-5 P-6 D-8 à partir de une NP-6 N-7 N-8 Windows Application NP-9 N-10 N-11 application Windows 1 2 3 5 6 7 8 9 10 11 1 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 5 0 0 0 0 6 0 0 0 0 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0 J. Tinsley, M. Hearne and A. Way – DCU @ TLT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment HEADER-1 PP-2 P-3 from PP-1 COLON-9 NP-4 : D-5 a P-2 NP-7 P-3 D-5 P-6 D-8 à partir de une NP-6 N-7 N-8 Windows Application NP-9 N-10 N-11 application Windows 1 2 3 5 6 7 8 9 10 11 1 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 5 0 0 0 0 2 0 2 0 0 0 6 0 0 0 0 0 2 0 7 0 0 0 0 0 0 0 0 8 0 0 0 0 9 0 0 0 0 J. Tinsley, M. Hearne and A. Way – DCU @ TLT 0 0 0 0 0 0 2 0 0 0 Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment HEADER-1 PP-2 P-3 from PP-1 COLON-9 NP-4 : D-5 a P-2 NP-7 P-3 D-5 P-6 D-8 à partir de une NP-6 N-7 N-8 Windows Application NP-9 N-10 N-11 application Windows 1 2 3 5 6 7 8 9 10 11 1 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 3 0 3 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 5 0 0 0 0 2 0 2 0 0 0 6 0 0 0 0 0 2 0 7 0 0 0 0 3 0 0 0 0 8 0 0 0 0 0 0 0 0 9 0 0 0 0 3 0 2 0 J. Tinsley, M. Hearne and A. Way – DCU @ TLT 0 0 0 Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment HEADER-1 PP-2 P-3 from PP-1 COLON-9 NP-4 : D-5 a P-2 NP-7 P-3 D-5 P-6 D-8 à partir de une NP-6 N-7 N-8 Windows Application NP-9 N-10 N-11 application Windows 1 2 3 5 6 7 8 9 10 11 1 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 3 0 3 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 5 0 0 0 0 2 0 2 0 0 0 6 0 0 0 0 0 2 0 4 0 7 0 0 0 0 3 0 0 8 0 0 0 0 0 0 9 0 0 0 0 3 0 J. Tinsley, M. Hearne and A. Way – DCU @ TLT 0 0 0 0 4 2 0 0 0 Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment HEADER-1 PP-2 P-3 from PP-1 COLON-9 NP-4 : D-5 a P-2 NP-7 P-3 D-5 P-6 D-8 à partir de une NP-6 N-7 N-8 Windows Application NP-9 N-10 N-11 application Windows 1 2 3 5 6 7 8 9 10 11 1 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 3 0 3 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 5 0 0 0 0 2 0 2 0 0 0 6 0 0 0 0 0 2 0 5 4 0 7 0 0 0 0 3 0 0 0 0 8 0 0 0 0 0 0 0 0 4 0 9 0 0 0 0 3 0 2 0 0 5 J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment HEADER-1 PP-2 P-3 from PP-1 COLON-9 NP-4 : D-5 a P-2 NP-7 P-3 D-5 P-6 D-8 à partir de une NP-6 N-7 N-8 Windows Application NP-9 N-10 N-11 application Windows 1 2 3 5 6 7 8 9 10 11 1 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 3 0 3 0 0 0 0 0 0 0 0 4 0 0 0 0 0 6 0 0 0 0 5 0 0 0 0 2 0 2 0 0 0 6 0 0 0 0 0 2 0 5 4 0 7 0 0 0 0 3 0 0 0 0 8 0 0 0 0 0 0 0 0 4 0 9 0 0 0 0 3 0 2 0 0 5 J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment HEADER-1 PP-2 P-3 from PP-1 COLON-9 NP-4 : D-5 a P-2 NP-7 P-3 D-5 P-6 D-8 à partir de une NP-6 N-7 N-8 Windows Application NP-9 N-10 N-11 application Windows 1 2 3 5 6 7 8 9 10 11 1 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 3 0 3 0 0 0 0 0 0 0 0 4 0 0 0 0 0 6 0 0 0 0 5 0 0 0 0 2 0 2 0 0 0 6 0 0 0 0 0 2 0 5 4 0 7 0 0 0 0 3 0 0 0 0 7 8 0 0 0 0 0 0 0 0 4 0 9 0 0 0 0 3 0 2 0 0 5 J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Construction Alignment HEADER-1 PP-2 P-3 from PP-1 COLON-9 NP-4 : D-5 a P-2 NP-7 P-3 D-5 P-6 D-8 à partir de une NP-6 N-7 N-8 Windows Application NP-9 N-10 N-11 application Windows 1 2 3 5 6 7 8 9 10 11 1 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 3 0 3 0 0 0 0 0 0 0 0 4 0 0 0 0 0 6 0 0 0 0 5 0 0 0 0 2 0 2 0 0 0 6 0 0 0 0 0 2 0 5 4 0 7 0 0 0 0 3 0 0 0 0 7 8 0 0 0 0 0 0 0 0 4 0 9 0 0 0 0 3 0 2 0 0 5 J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Parallel Treebank Construction Parallel Corpus Parallel Treebank Construction Parallel Corpus Source parser Target parser Parallel Treebank Construction Parallel Corpus Source parser Target parser Tree Aligner Parallel Treebank Construction Parallel Corpus Word Aligner Source parser Target parser Tree Aligner Parallel Treebank Construction Parallel Corpus Word Aligner Source parser WA Probabilities Target parser Tree Aligner Parallel Treebank Construction Parallel Corpus Word Aligner Source parser WA Probabilities Target parser Tree Aligner Parallel Treebank Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation Motivation and Background Phrase-Based Statistical MT Parallel Treebanks Construction Alignment Experimental Setup Data MT System Phrase Extraction Translation Models Evalutation Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation We used two data sets for two different language pairs English – German EuroParl ◮ ◮ 9000:1000 sentence split for training:testing Monolingual parsers ◮ ◮ English - Bikel [Bikel, 2002] German - BitPar [Schmid, 2004] (TIGER) J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation We used two data sets for two different language pairs English – German EuroParl ◮ ◮ 9000:1000 sentence split for training:testing Monolingual parsers ◮ ◮ English - Bikel [Bikel, 2002] German - BitPar [Schmid, 2004] (TIGER) English – Spanish EuroParl ◮ ◮ 4411:500 sentence split for training:testing Monolingual parsers ◮ ◮ English - Bikel Spanish - Bikel [Chrupala and van Genabith, 2006] (Cast3LB) J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation Tools used to build the MT system ◮ language model - SRILM toolkit [Stolcke, 2002] word alignment - GIZA++ [Och and Ney, 2003] ◮ decoder - Moses [Koehn et al., 2007] ◮ J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT PBSMT System Parallel Corpus Word Aligner 'Word-alignment based' word + phrase extraction Language model input sentences Translation Model decoder output translations PBSMT System exploiting TB Parallel Corpus Word Aligner 'Word-alignment based' word + phrase extraction Language model input sentences Translation Model decoder output translations PBSMT System exploiting TB Source parser Target parser Parallel Corpus Word Aligner 'Word-alignment based' word + phrase extraction Language model input sentences Translation Model decoder output translations PBSMT System exploiting TB Source parser Target parser Tree Aligner Parallel Corpus Word Aligner 'Word-alignment based' word + phrase extraction Language model input sentences Translation Model decoder output translations PBSMT System exploiting TB Source parser Target parser Tree Aligner Parallel Corpus Parallel Treebank Word Aligner 'Word-alignment based' word + phrase extraction Language model input sentences Translation Model decoder output translations PBSMT System exploiting TB Source parser Target parser Tree Aligner Parallel Corpus Parallel Treebank Word Aligner 'Word-alignment based' word + phrase extraction Language model input sentences 'Tree-based' word + phrase extraction Translation Model decoder output translations PBSMT System exploiting TB Source parser Target parser Tree Aligner Parallel Corpus Parallel Treebank Word Aligner 'Word-alignment based' word + phrase extraction Language model input sentences 'Tree-based' word + phrase extraction Translation Model decoder output translations Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation Tools used to build the MT system ◮ ◮ ◮ ◮ language model - SRILM toolkit [Stolcke, 2002] word alignment - GIZA++ [Och and Ney, 2003] decoder - Moses [Koehn et al., 2007] What about the phrase extraction? J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation Parallel Corpora ◮ ◮ GIZA++ gives S → T and T → S word alignment probabilities phrases extracted according to heuristics based on the word alignments J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion J. Tinsley, M. Hearne and A. Way – DCU @ TLT Data MT System Phrase Extraction Translation Models Evalutation Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion J. Tinsley, M. Hearne and A. Way – DCU @ TLT Data MT System Phrase Extraction Translation Models Evalutation Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion J. Tinsley, M. Hearne and A. Way – DCU @ TLT Data MT System Phrase Extraction Translation Models Evalutation Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion J. Tinsley, M. Hearne and A. Way – DCU @ TLT Data MT System Phrase Extraction Translation Models Evalutation Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion J. Tinsley, M. Hearne and A. Way – DCU @ TLT Data MT System Phrase Extraction Translation Models Evalutation Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion J. Tinsley, M. Hearne and A. Way – DCU @ TLT Data MT System Phrase Extraction Translation Models Evalutation Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation Parallel Corpora ◮ ◮ GIZA++ gives S → T and T → S word alignment probabilities phrases extracted according to heuristics based on the word alignments J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation Parallel Corpora ◮ ◮ GIZA++ gives S → T and T → S word alignment probabilities phrases extracted according to heuristics based on the word alignments Parallel Treebanks ◮ word and phrase pairs extracted based on surface strings of linked node pairs ◮ for example. . . J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation S S NP1 John NP1 VP V NP2 likes Mary J. Tinsley, M. Hearne and A. Way – DCU @ TLT Marie VP V plaı̂t PP P NP2 à Jean Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation S S NP1 John NP1 VP V NP2 likes Mary John J. Tinsley, M. Hearne and A. Way – DCU @ TLT Marie VP V plaı̂t ⇔ PP P NP2 à Jean Jean Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation S S NP1 John NP1 VP V NP2 likes Mary John Mary J. Tinsley, M. Hearne and A. Way – DCU @ TLT Marie VP V plaı̂t ⇔ ⇔ PP P NP2 à Jean Jean Marie Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation S S NP1 John NP1 VP V NP2 likes Mary John Mary John likes Mary J. Tinsley, M. Hearne and A. Way – DCU @ TLT Marie VP V plaı̂t ⇔ ⇔ ⇔ PP P NP2 à Jean Jean Marie Marie plaı̂t à Jean Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation We combine the word and phrase pairs extracted from the parallel corpora and parallel treebanks into five separate translation models (one of the following for each language pair) J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Data MT System Phrase Extraction Translation Models Evalutation We combine the word and phrase pairs extracted from the parallel corpora and parallel treebanks into five separate translation models (one of the following for each language pair) hWC , PC i (baseline) hWT , PT i hWC , PT i hWT , PC i hWC + WT , PC + PT i Translation model consisting only of word and phrase pairs extracted from the parallel corpus using PBSMT techniques. Translation model consisting only of word and phrase pairs extracted from the parallel treebank based on the induced links. Translation model consisting of word pairs extracted from the parallel corpus and phrase pairs extracted from the parallel treebank. Translation model consisting of word pairs extracted from the parallel treebank and phrase pairs extracted from the parallel corpus. Translation model consisting of word and phrase pairs extracted from both the parallel corpus and the parallel treebank. J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT PBSMT System exploiting TB Source parser Target parser Tree Aligner Parallel Corpus Parallel Treebank Word Aligner 'Word-alignment based' word + phrase extraction Translation Model (1|2|3|4|5) Language model input sentences 'Tree-based' word + phrase extraction decoder output translations Motivation and Background Parallel Treebanks Experimental Setup Results Discussion ◮ Data MT System Phrase Extraction Translation Models Evalutation Given 2 language pairs, 2 translation directions and 5 translation models, we have 20 translation runs in total ◮ e.g. DE → EN using translation model hWC , PT i J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion ◮ Given 2 language pairs, 2 translation directions and 5 translation models, we have 20 translation runs in total ◮ ◮ e.g. DE → EN using translation model hWC , PT i Translation evalutated using three common automatic measures for MT ◮ ◮ Data MT System Phrase Extraction Translation Models Evalutation BLEU [Papineni et al., 2002], NIST [Doddington, 2002] and METEOR [Banerjee and Lavie, 2005] Statistical significance calculated using bootstrap resampling [Zhang et al., 2004] J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Motivation and Background Phrase-Based Statistical MT Parallel Treebanks Construction Alignment Experimental Setup Data MT System Phrase Extraction Translation Models Evalutation Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion EuroParl: English −→ German Configuration BLEU NIST METEOR hWC , PC i 0.1186 4.1168 0.3840 hWC + WT , PC + PT i 0.1259 4.3044 0.3938 J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion EuroParl: English −→ German Configuration BLEU NIST METEOR hWC , PC i 0.1186 4.1168 0.3840 hWC + WT , PC + PT i 0.1259 4.3044 0.3938 Main points ◮ combining parallel treebank data with parallel corpus data improves significantly over the parallel-corpus-only baseline J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion EuroParl: English −→ German Configuration BLEU NIST METEOR hWC , PC i 0.1186 4.1168 0.3840 hWC + WT , PC + PT i 0.1259 4.3044 0.3938 Main points ◮ ◮ combining parallel treebank data with parallel corpus data improves significantly over the parallel-corpus-only baseline improvements are evident in both data sets and in both translation directions over all metrics J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion EuroParl: English −→ German Configuration BLEU NIST METEOR hWC , PC i 0.1186 4.1168 0.3840 hWC + WT , PC + PT i 0.1259 4.3044 0.3938 Main points ◮ ◮ ◮ combining parallel treebank data with parallel corpus data improves significantly over the parallel-corpus-only baseline improvements are evident in both data sets and in both translation directions over all metrics Contradiction: “syntax-based mappings do not lead to better translations” [Koehn et al., 2003] J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion EuroParl: English −→ German Configuration BLEU NIST METEOR hWC , PC i 0.1186 4.1168 0.3840 hWC + WT , PC + PT i 0.1259 4.3044 0.3938 Main points ◮ ◮ ◮ ◮ combining parallel treebank data with parallel corpus data improves significantly over the parallel-corpus-only baseline improvements are evident in both data sets and in both translation directions over all metrics Contradiction: “syntax-based mappings do not lead to better translations” [Koehn et al., 2003] comparing different translation models tells us more. . . J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions Motivation and Background Phrase-Based Statistical MT Parallel Treebanks Construction Alignment Experimental Setup Data MT System Phrase Extraction Translation Models Evalutation Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions EuroParl: German −→ English Configuration BLEU NIST METEOR hWC , PC i 0.1622 4.9949 0.4344 hWT , PC i 0.1676 5.2324 0.4473 J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions EuroParl: German −→ English Configuration BLEU NIST METEOR hWC , PC i 0.1622 4.9949 0.4344 hWT , PC i 0.1676 5.2324 0.4473 Using parallel treebank word pairs instead of parallel corpus word pairs leads to significantly higher scores, why? J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions EuroParl: German −→ English Configuration BLEU NIST METEOR hWC , PC i 0.1622 4.9949 0.4344 hWT , PC i 0.1676 5.2324 0.4473 Using parallel treebank word pairs instead of parallel corpus word pairs leads to significantly higher scores, why? ◮ more word pair tokens extracted from the parallel treebank parallel treebank yields twice as many unique word pair types ◮ parallel corpus word pairs not as useful for translation ◮ J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions EuroParl: Spanish −→ English Configuration BLEU NIST METEOR hWC , PC i 0.1754 4.7582 0.4802 hWC , PT i 0.1626 4.6606 0.4498 J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions EuroParl: Spanish −→ English Configuration BLEU NIST METEOR hWC , PC i 0.1754 4.7582 0.4802 hWC , PT i 0.1626 4.6606 0.4498 Using phrase pairs from the parallel treebank in place of those from the parallel corpus does not improve translation, why? J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions EuroParl: Spanish −→ English Configuration BLEU NIST METEOR hWC , PC i 0.1754 4.7582 0.4802 hWC , PT i 0.1626 4.6606 0.4498 Using phrase pairs from the parallel treebank in place of those from the parallel corpus does not improve translation, why? ◮ ◮ ◮ ◮ There are nearly four times as many phrase pairs extracted from the parallel corpora, on average This increases the system’s coverage and gives rise to the higher scores However, the higher scores achieved by the corpus phrase pairs are not proportionate to the larger number of phrase pairs evident (e.g. compared to word pairs) Very similar findings to those of [Groves, 2007] J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions We obtained some insight into these results by manually inspecting the extracted data ◮ ◮ Phrase pairs extracted from the parallel treebanks were of higher quaility, Phrase pairs extracted from the parallel corpora consisted mainly of function words, and had a very short average length (1.85 words per phrase) J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions We obtained some insight into these results by manually inspecting the extracted data ◮ ◮ Phrase pairs extracted from the parallel treebanks were of higher quaility, Phrase pairs extracted from the parallel corpora consisted mainly of function words, and had a very short average length (1.85 words per phrase) Example of most frequent phrase pair unique to each resource ◮ Parallel Corpus ◮ of the ⇔ de la J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions We obtained some insight into these results by manually inspecting the extracted data ◮ ◮ Phrase pairs extracted from the parallel treebanks were of higher quaility, Phrase pairs extracted from the parallel corpora consisted mainly of function words, and had a very short average length (1.85 words per phrase) Example of most frequent phrase pair unique to each resource ◮ Parallel Corpus ◮ ◮ of the ⇔ de la Parallel Treebank ◮ the european union ⇔ la unión europea J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions Herein lies a big stumbling block ◮ We cannot extract non-constituent phrase pairs, such as of the ⇔ de la, from the parallel treebank (as of yet) ◮ constituent phrase pairs alone are not enough for translation [Koehn et al., 2003] (which was backed up by our results) ◮ but all is not lost. . . J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions Ongoing work: ◮ ◮ ◮ Scaling up - 730,000 sentence pairs Weighting the parallel treebank word and phrase pairs in the translation model(s) Extracting non-constituent phrase pairs from the parallel treebank based on the existing alignments J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions Conclusions ◮ ◮ Parallel treebank word and phrase pairs improve translation quality when combined with traditional corpus-based extraction Parallel treebank word pairs are better for translation than those given by traditional word alignment ◮ Parallel treebank phrase pairs are too few to be used alone for translation ◮ Further improvement will come when we can get more from the treebank and use it better J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions Banerjee, S. and Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of the Association of Computational Linguistics (ACL-05), Ann Arbor, MI. Bikel, D. (2002). Design of a Multi-lingual, parallel-processing statistical parsing engine. In Human Language Technology Conference (HLT), San Diego, CA, USA. Chrupala, G. and van Genabith, J. (2006). Using Machine-Learning to Assign Function Labels to Parser Output for Spanish. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 136–143, Sydney, Australia. Association for Computational Linguistics. Doddington, G. (2002). Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics. In Human Language Technology: Notebook Proceedings, pages 128–132, San Diego, CA. Groves, D. (2007). Hybrid Data-Driven Models of Machine Translation. PhD thesis, Dublin City University, Dublin, Ireland. Hearne, M., Tinsley, J., Zhechev, V., and Way, A. (2007). J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions Capturing Translational Divergences with a Statistical Tree-to-Tree Aligner. In Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-07), pages 83–94, Skövde, Sweden. Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. In Machine Translation Summit X, pages 79–86, Phuket, Thailand. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. (2007). Moses: Open Source Toolkit for Statistical Machine Translation. In Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, pages 177–180, Prague, Czech Republic. Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical Phrase-Based Translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL ’03), pages 48–54, Edmonton, Canada. Och, F. J. and Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19–51. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02), pages 311–318, Philadelphia, PA. Schmid, H. (2004). Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 04), Geneva, Switzerland. Stolcke, A. (2002). SRILM - An Extensible Language Modeling Toolkit. In Proceedings of the International Conference Spoken Language Processing, Denver, CO. Tinsley, J., Zhechev, V., Hearne, M., and Way, A. (2007). Robust Language-Pair Independent Sub-Tree Alignment. In Machine Translation Summit XI, pages 467–474, Copenhagen, Denmark. Zhang, Y., Vogel, S., and Waibel, A. (2004). Interpreting Bleu–NIST scores: How much improvement do we need to have a better system? In Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal. J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions EuroParl: English −→ German Configuration BLEU NIST hWC , PC i 0.1186 4.1168 hWT , PT i 0.1055 4.1153 hWC , PT i 0.1019 3.9778 hWT , PC i 0.1242 4.2605 hWC + WT , PC + PT i 0.1259 4.3044 J. Tinsley, M. Hearne and A. Way – DCU @ TLT METEOR 0.3840 0.3796 0.3691 0.3931 0.3938 Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions EuroParl: German −→ English Configuration BLEU NIST hWC , PC i 0.1622 4.9949 hWT , PT i 0.1498 5.1720 hWC , PT i 0.1443 4.9342 hWT , PC i 0.1676 5.2324 hWC + WT , PC + PT i 0.1687 5.2474 J. Tinsley, M. Hearne and A. Way – DCU @ TLT METEOR 0.4344 0.4327 0.4176 0.4473 0.4492 Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions EuroParl: English −→ Spanish Configuration BLEU NIST hWC , PC i 0.1765 4.8857 hWT , PT i 0.1689 4.8662 hWC , PT i 0.1634 4.6964 hWT , PC i 0.1807 5.0389 hWC + WT , PC + PT i 0.1867 5.0898 J. Tinsley, M. Hearne and A. Way – DCU @ TLT METEOR 0.4515 0.4560 0.4440 0.4619 0.4701 Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions EuroParl: Spanish −→ English Configuration BLEU NIST hWC , PC i 0.1754 4.7582 hWT , PT i 0.1708 4.8664 hWC , PT i 0.1626 4.6606 hWT , PC i 0.1840 4.9557 hWC + WT , PC + PT i 0.1880 4.9923 J. Tinsley, M. Hearne and A. Way – DCU @ TLT METEOR 0.4802 0.4659 0.4498 0.4910 0.4935 Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions Word pair statistics EN-DE EN-ES Corpus Treebank Corpus Treebank #Tokens 69,200 79,675 37,339 43,312 #Types 7,672 18,286 5,056 11,274 COMP 1,929 12,545 904 7,131 Corpus Treebank Corpus Treebank #Tokens 120,410 33,789 86,640 18,301 #Types 97,167 30,251 72,583 15,199 COMP 92,084 25,041 67,378 9,949 Phrase pair statistics EN-DE EN-ES J. Tinsley, M. Hearne and A. Way – DCU @ TLT Exploiting Parallel Treebanks to Improve Phrase-Based SMT Motivation and Background Parallel Treebanks Experimental Setup Results Discussion Word Alignments Phrase Alignments Ongoing and Future Work Conclusions Unique corpus phrase pairs of the that president , mr president , european union ⇔ ⇔ ⇔ ⇔ ⇔ der , daß präsident , herr präsident , europäischen union 398 383 163 163 135 Unique treebank phrase pairs the next item very much the union ’s the european union ’s the sitting ⇔ ⇔ ⇔ ⇔ ⇔ J. Tinsley, M. Hearne and A. Way – DCU @ TLT nach der tagesordnung dank der union der europischen union die sitzung 22 17 17 15 13 Exploiting Parallel Treebanks to Improve Phrase-Based SMT
© Copyright 2026 Paperzz