Exploiting Parallel Treebanks to Improve Phrase

Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Exploiting Parallel Treebanks to Improve
Phrase-Based Statistical Machine Translation
John Tinsley, Mary Hearne and Andy Way
National Centre for Language Technology
Dublin City University
Ireland
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Motivation and Background
Phrase-Based Statistical MT
Parallel Treebanks
Construction
Alignment
Experimental Setup
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Phrase-Based Statistical MT
Motivation and Background
Phrase-Based Statistical MT
Parallel Treebanks
Construction
Alignment
Experimental Setup
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Phrase-Based Statistical MT
Most state-of-the-art Machine Translation (MT) systems
◮
are not syntax-aware
use models which are based on n-grams
◮
have no linguistic components
◮
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Phrase-Based Statistical MT
Most state-of-the-art Machine Translation (MT) systems
◮
are not syntax-aware
use models which are based on n-grams
◮
have no linguistic components
◮
Parallel treebanks are not widely used in MT, if at all. However, we
believe that the data encoded within parallel treebanks could be
useful in MT.
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Phrase-Based Statistical MT
Most state-of-the-art Machine Translation (MT) systems
◮
are not syntax-aware
use models which are based on n-grams
◮
have no linguistic components
◮
Parallel treebanks are not widely used in MT, if at all. However, we
believe that the data encoded within parallel treebanks could be
useful in MT.
In order to find out
◮
we automatically build large parallel treebanks
◮
using off-the-shelf parsers and our sub-tree aligner
use these parallel treebanks to train MT systems
◮
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Phrase-Based Statistical MT
Most state-of-the-art Machine Translation (MT) systems
◮
are not syntax-aware
use models which are based on n-grams
◮
have no linguistic components
◮
Parallel treebanks are not widely used in MT, if at all. However, we
believe that the data encoded within parallel treebanks could be
useful in MT.
In order to find out
◮
we automatically build large parallel treebanks
◮
using off-the-shelf parsers and our sub-tree aligner
use these parallel treebanks to train MT systems
◮
Akin to the work of [Groves, 2007]
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Phrase-Based Statistical MT
The MT system “flavour” we are using here is a phrase-based
statistical MT system (PBSMT). This is a data-driven MT system
which uses well-defined statistical models to estimate parameters from
parallel corpora
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Phrase-Based Statistical MT
The MT system “flavour” we are using here is a phrase-based
statistical MT system (PBSMT). This is a data-driven MT system
which uses well-defined statistical models to estimate parameters from
parallel corpora
They can be roughly broken up into a number of
components
◮
◮
statistical word alignment
phrase alignment heuristics
◮
translation model combining word and phrase pairs and their
probabilities
◮
target language model (tri-gram model)
decoder (beam search for optimal translation)
◮
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
PBSMT
system
Parallel Corpus
PBSMT
system
Language model
Parallel Corpus
PBSMT
system
Parallel Corpus
Word Aligner
Language model
PBSMT
system
Parallel Corpus
Word Aligner
'Word-alignment based'
word + phrase extraction
Language model
PBSMT
system
Parallel Corpus
Word Aligner
'Word-alignment based'
word + phrase extraction
Language model
Translation Model
PBSMT
system
Parallel Corpus
Word Aligner
'Word-alignment based'
word + phrase extraction
Language model
Translation Model
decoder
PBSMT
system
Parallel Corpus
Word Aligner
'Word-alignment based'
word + phrase extraction
Language model
input sentences
Translation Model
decoder
PBSMT
system
Parallel Corpus
Word Aligner
'Word-alignment based'
word + phrase extraction
Language model
input sentences
Translation Model
decoder
output translations
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
Motivation and Background
Phrase-Based Statistical MT
Parallel Treebanks
Construction
Alignment
Experimental Setup
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
The treebank creation process was fully automatic
Advantages
◮
fast
◮
large-scale capabilities
highly accurate tools
◮
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
The treebank creation process was fully automatic
Advantages
◮
fast
◮
large-scale capabilities
highly accurate tools
◮
The parallel corpora from which we built the parallel treebanks were
samples from the EuroParl release [Koehn, 2005]
Two main aspects
◮
monolingual parsing
◮
sub-sentential alignment
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
Sub-sentential alignment of the parallel treebank was carried out
using the tool described fully in [Tinsley et al., 2007]
Node are aligned between tree pairs using a simple
algorithm
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
Sub-sentential alignment of the parallel treebank was carried out
using the tool described fully in [Tinsley et al., 2007]
Node are aligned between tree pairs using a simple
algorithm
◮
assume hypothetical links between all nodes in a given tree pair
◮
estimate scores for these hypothetical links (done using word
alignment probabilities)
◮
use a greedy search to select the optimal set of alignments
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
Sub-sentential alignment of the parallel treebank was carried out
using the tool described fully in [Tinsley et al., 2007]
Node are aligned between tree pairs using a simple
algorithm
◮
assume hypothetical links between all nodes in a given tree pair
◮
estimate scores for these hypothetical links (done using word
alignment probabilities)
◮
use a greedy search to select the optimal set of alignments
[Hearne et al., 2007] describes the aligner’s linguistic capabilities in
specific detail
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
HEADER-1
PP-2
P-3
from
PP-1
COLON-9
NP-4
:
D-5
a
P-2
D-5
P-6
D-8
à
partir
de
une
NP-6
N-7
N-8
Windows
Application
1
NP-7
P-3
NP-9
N-10
N-11
application Windows
2
3
5
6
7
8
9
10
11
1
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
0
4
0
0
0
0
5
0
0
0
0
6
0
0
0
0
7
0
0
0
0
8
0
0
0
0
9
0
0
0
0
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
HEADER-1
PP-2
P-3
from
PP-1
COLON-9
NP-4
:
D-5
a
P-2
NP-7
P-3
D-5
P-6
D-8
à
partir
de
une
NP-6
N-7
N-8
Windows
Application
NP-9
N-10
N-11
application Windows
1
2
3
5
6
7
8
9
10
11
1
1
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
4
0
0
0
0
0
5
0
0
0
0
6
0
0
0
0
7
0
0
0
0
8
0
0
0
0
9
0
0
0
0
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
HEADER-1
PP-2
P-3
from
PP-1
COLON-9
NP-4
:
D-5
a
P-2
NP-7
P-3
D-5
P-6
D-8
à
partir
de
une
NP-6
N-7
N-8
Windows
Application
NP-9
N-10
N-11
application Windows
1
2
3
5
6
7
8
9
10
11
1
1
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
0
5
0
0
0
0
2
0
2
0
0
0
6
0
0
0
0
0
2
0
7
0
0
0
0
0
0
0
0
8
0
0
0
0
9
0
0
0
0
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
0
0
0
0
0
0
2
0
0
0
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
HEADER-1
PP-2
P-3
from
PP-1
COLON-9
NP-4
:
D-5
a
P-2
NP-7
P-3
D-5
P-6
D-8
à
partir
de
une
NP-6
N-7
N-8
Windows
Application
NP-9
N-10
N-11
application Windows
1
2
3
5
6
7
8
9
10
11
1
1
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
0
3
0
3
0
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
0
5
0
0
0
0
2
0
2
0
0
0
6
0
0
0
0
0
2
0
7
0
0
0
0
3
0
0
0
0
8
0
0
0
0
0
0
0
0
9
0
0
0
0
3
0
2
0
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
0
0
0
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
HEADER-1
PP-2
P-3
from
PP-1
COLON-9
NP-4
:
D-5
a
P-2
NP-7
P-3
D-5
P-6
D-8
à
partir
de
une
NP-6
N-7
N-8
Windows
Application
NP-9
N-10
N-11
application Windows
1
2
3
5
6
7
8
9
10
11
1
1
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
0
3
0
3
0
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
0
5
0
0
0
0
2
0
2
0
0
0
6
0
0
0
0
0
2
0
4
0
7
0
0
0
0
3
0
0
8
0
0
0
0
0
0
9
0
0
0
0
3
0
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
0
0
0
0
4
2
0
0
0
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
HEADER-1
PP-2
P-3
from
PP-1
COLON-9
NP-4
:
D-5
a
P-2
NP-7
P-3
D-5
P-6
D-8
à
partir
de
une
NP-6
N-7
N-8
Windows
Application
NP-9
N-10
N-11
application Windows
1
2
3
5
6
7
8
9
10
11
1
1
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
0
3
0
3
0
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
0
5
0
0
0
0
2
0
2
0
0
0
6
0
0
0
0
0
2
0
5
4
0
7
0
0
0
0
3
0
0
0
0
8
0
0
0
0
0
0
0
0
4
0
9
0
0
0
0
3
0
2
0
0
5
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
HEADER-1
PP-2
P-3
from
PP-1
COLON-9
NP-4
:
D-5
a
P-2
NP-7
P-3
D-5
P-6
D-8
à
partir
de
une
NP-6
N-7
N-8
Windows
Application
NP-9
N-10
N-11
application Windows
1
2
3
5
6
7
8
9
10
11
1
1
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
0
3
0
3
0
0
0
0
0
0
0
0
4
0
0
0
0
0
6
0
0
0
0
5
0
0
0
0
2
0
2
0
0
0
6
0
0
0
0
0
2
0
5
4
0
7
0
0
0
0
3
0
0
0
0
8
0
0
0
0
0
0
0
0
4
0
9
0
0
0
0
3
0
2
0
0
5
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
HEADER-1
PP-2
P-3
from
PP-1
COLON-9
NP-4
:
D-5
a
P-2
NP-7
P-3
D-5
P-6
D-8
à
partir
de
une
NP-6
N-7
N-8
Windows
Application
NP-9
N-10
N-11
application Windows
1
2
3
5
6
7
8
9
10
11
1
1
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
0
3
0
3
0
0
0
0
0
0
0
0
4
0
0
0
0
0
6
0
0
0
0
5
0
0
0
0
2
0
2
0
0
0
6
0
0
0
0
0
2
0
5
4
0
7
0
0
0
0
3
0
0
0
0
7
8
0
0
0
0
0
0
0
0
4
0
9
0
0
0
0
3
0
2
0
0
5
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Construction
Alignment
HEADER-1
PP-2
P-3
from
PP-1
COLON-9
NP-4
:
D-5
a
P-2
NP-7
P-3
D-5
P-6
D-8
à
partir
de
une
NP-6
N-7
N-8
Windows
Application
NP-9
N-10
N-11
application Windows
1
2
3
5
6
7
8
9
10
11
1
1
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
0
3
0
3
0
0
0
0
0
0
0
0
4
0
0
0
0
0
6
0
0
0
0
5
0
0
0
0
2
0
2
0
0
0
6
0
0
0
0
0
2
0
5
4
0
7
0
0
0
0
3
0
0
0
0
7
8
0
0
0
0
0
0
0
0
4
0
9
0
0
0
0
3
0
2
0
0
5
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Parallel Treebank Construction
Parallel Corpus
Parallel Treebank Construction
Parallel Corpus
Source parser
Target parser
Parallel Treebank Construction
Parallel Corpus
Source parser
Target parser
Tree Aligner
Parallel Treebank Construction
Parallel Corpus
Word Aligner
Source parser
Target parser
Tree Aligner
Parallel Treebank Construction
Parallel Corpus
Word Aligner
Source parser
WA Probabilities
Target parser
Tree Aligner
Parallel Treebank Construction
Parallel Corpus
Word Aligner
Source parser
WA Probabilities
Target parser
Tree Aligner
Parallel Treebank
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Motivation and Background
Phrase-Based Statistical MT
Parallel Treebanks
Construction
Alignment
Experimental Setup
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
We used two data sets for two different language pairs
English – German EuroParl
◮
◮
9000:1000 sentence split for training:testing
Monolingual parsers
◮
◮
English - Bikel [Bikel, 2002]
German - BitPar [Schmid, 2004] (TIGER)
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
We used two data sets for two different language pairs
English – German EuroParl
◮
◮
9000:1000 sentence split for training:testing
Monolingual parsers
◮
◮
English - Bikel [Bikel, 2002]
German - BitPar [Schmid, 2004] (TIGER)
English – Spanish EuroParl
◮
◮
4411:500 sentence split for training:testing
Monolingual parsers
◮
◮
English - Bikel
Spanish - Bikel [Chrupala and van Genabith, 2006] (Cast3LB)
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Tools used to build the MT system
◮
language model - SRILM toolkit [Stolcke, 2002]
word alignment - GIZA++ [Och and Ney, 2003]
◮
decoder - Moses [Koehn et al., 2007]
◮
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
PBSMT System
Parallel Corpus
Word Aligner
'Word-alignment based'
word + phrase extraction
Language model
input sentences
Translation Model
decoder
output translations
PBSMT System
exploiting TB
Parallel Corpus
Word Aligner
'Word-alignment based'
word + phrase extraction
Language model
input sentences
Translation Model
decoder
output translations
PBSMT System
exploiting TB
Source parser
Target parser
Parallel Corpus
Word Aligner
'Word-alignment based'
word + phrase extraction
Language model
input sentences
Translation Model
decoder
output translations
PBSMT System
exploiting TB
Source parser
Target parser
Tree Aligner
Parallel Corpus
Word Aligner
'Word-alignment based'
word + phrase extraction
Language model
input sentences
Translation Model
decoder
output translations
PBSMT System
exploiting TB
Source parser
Target parser
Tree Aligner
Parallel Corpus
Parallel Treebank
Word Aligner
'Word-alignment based'
word + phrase extraction
Language model
input sentences
Translation Model
decoder
output translations
PBSMT System
exploiting TB
Source parser
Target parser
Tree Aligner
Parallel Corpus
Parallel Treebank
Word Aligner
'Word-alignment based'
word + phrase extraction
Language model
input sentences
'Tree-based'
word + phrase extraction
Translation Model
decoder
output translations
PBSMT System
exploiting TB
Source parser
Target parser
Tree Aligner
Parallel Corpus
Parallel Treebank
Word Aligner
'Word-alignment based'
word + phrase extraction
Language model
input sentences
'Tree-based'
word + phrase extraction
Translation Model
decoder
output translations
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Tools used to build the MT system
◮
◮
◮
◮
language model - SRILM toolkit [Stolcke, 2002]
word alignment - GIZA++ [Och and Ney, 2003]
decoder - Moses [Koehn et al., 2007]
What about the phrase extraction?
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Parallel Corpora
◮
◮
GIZA++ gives S → T and T → S word alignment probabilities
phrases extracted according to heuristics based on the word
alignments
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Parallel Corpora
◮
◮
GIZA++ gives S → T and T → S word alignment probabilities
phrases extracted according to heuristics based on the word
alignments
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Parallel Corpora
◮
◮
GIZA++ gives S → T and T → S word alignment probabilities
phrases extracted according to heuristics based on the word
alignments
Parallel Treebanks
◮
word and phrase pairs extracted based on surface strings of
linked node pairs
◮
for example. . .
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
S
S
NP1
John
NP1
VP
V
NP2
likes
Mary
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Marie
VP
V
plaı̂t
PP
P
NP2
à
Jean
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
S
S
NP1
John
NP1
VP
V
NP2
likes
Mary
John
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Marie
VP
V
plaı̂t
⇔
PP
P
NP2
à
Jean
Jean
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
S
S
NP1
John
NP1
VP
V
NP2
likes
Mary
John
Mary
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Marie
VP
V
plaı̂t
⇔
⇔
PP
P
NP2
à
Jean
Jean
Marie
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
S
S
NP1
John
NP1
VP
V
NP2
likes
Mary
John
Mary
John likes Mary
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Marie
VP
V
plaı̂t
⇔
⇔
⇔
PP
P
NP2
à
Jean
Jean
Marie
Marie plaı̂t à Jean
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
We combine the word and phrase pairs extracted from the parallel
corpora and parallel treebanks into five separate translation models
(one of the following for each language pair)
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Data
MT System
Phrase Extraction
Translation Models
Evalutation
We combine the word and phrase pairs extracted from the parallel
corpora and parallel treebanks into five separate translation models
(one of the following for each language pair)
hWC , PC i (baseline)
hWT , PT i
hWC , PT i
hWT , PC i
hWC + WT , PC + PT i
Translation model consisting only of word and
phrase pairs extracted from the parallel corpus using
PBSMT techniques.
Translation model consisting only of word and
phrase pairs extracted from the parallel treebank
based on the induced links.
Translation model consisting of word pairs extracted
from the parallel corpus and phrase pairs extracted
from the parallel treebank.
Translation model consisting of word pairs extracted
from the parallel treebank and phrase pairs extracted from the parallel corpus.
Translation model consisting of word and phrase
pairs extracted from both the parallel corpus and
the parallel treebank.
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
PBSMT System
exploiting TB
Source parser
Target parser
Tree Aligner
Parallel Corpus
Parallel Treebank
Word Aligner
'Word-alignment based'
word + phrase extraction
Translation Model
(1|2|3|4|5)
Language model
input sentences
'Tree-based'
word + phrase extraction
decoder
output translations
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
◮
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Given 2 language pairs, 2 translation directions and 5 translation
models, we have 20 translation runs in total
◮
e.g. DE → EN using translation model hWC , PT i
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
◮
Given 2 language pairs, 2 translation directions and 5 translation
models, we have 20 translation runs in total
◮
◮
e.g. DE → EN using translation model hWC , PT i
Translation evalutated using three common automatic measures
for MT
◮
◮
Data
MT System
Phrase Extraction
Translation Models
Evalutation
BLEU [Papineni et al., 2002], NIST [Doddington, 2002] and
METEOR [Banerjee and Lavie, 2005]
Statistical significance calculated using bootstrap resampling
[Zhang et al., 2004]
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Motivation and Background
Phrase-Based Statistical MT
Parallel Treebanks
Construction
Alignment
Experimental Setup
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
EuroParl: English −→ German
Configuration
BLEU
NIST
METEOR
hWC , PC i
0.1186
4.1168
0.3840
hWC + WT , PC + PT i 0.1259 4.3044
0.3938
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
EuroParl: English −→ German
Configuration
BLEU
NIST
METEOR
hWC , PC i
0.1186
4.1168
0.3840
hWC + WT , PC + PT i 0.1259 4.3044
0.3938
Main points
◮
combining parallel treebank data with parallel corpus data
improves significantly over the parallel-corpus-only baseline
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
EuroParl: English −→ German
Configuration
BLEU
NIST
METEOR
hWC , PC i
0.1186
4.1168
0.3840
hWC + WT , PC + PT i 0.1259 4.3044
0.3938
Main points
◮
◮
combining parallel treebank data with parallel corpus data
improves significantly over the parallel-corpus-only baseline
improvements are evident in both data sets and in both
translation directions over all metrics
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
EuroParl: English −→ German
Configuration
BLEU
NIST
METEOR
hWC , PC i
0.1186
4.1168
0.3840
hWC + WT , PC + PT i 0.1259 4.3044
0.3938
Main points
◮
◮
◮
combining parallel treebank data with parallel corpus data
improves significantly over the parallel-corpus-only baseline
improvements are evident in both data sets and in both
translation directions over all metrics
Contradiction: “syntax-based mappings do not lead to better
translations” [Koehn et al., 2003]
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
EuroParl: English −→ German
Configuration
BLEU
NIST
METEOR
hWC , PC i
0.1186
4.1168
0.3840
hWC + WT , PC + PT i 0.1259 4.3044
0.3938
Main points
◮
◮
◮
◮
combining parallel treebank data with parallel corpus data
improves significantly over the parallel-corpus-only baseline
improvements are evident in both data sets and in both
translation directions over all metrics
Contradiction: “syntax-based mappings do not lead to better
translations” [Koehn et al., 2003]
comparing different translation models tells us more. . .
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
Motivation and Background
Phrase-Based Statistical MT
Parallel Treebanks
Construction
Alignment
Experimental Setup
Data
MT System
Phrase Extraction
Translation Models
Evalutation
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
EuroParl: German −→ English
Configuration BLEU
NIST
METEOR
hWC , PC i
0.1622
4.9949
0.4344
hWT , PC i
0.1676 5.2324
0.4473
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
EuroParl: German −→ English
Configuration BLEU
NIST
METEOR
hWC , PC i
0.1622
4.9949
0.4344
hWT , PC i
0.1676 5.2324
0.4473
Using parallel treebank word pairs instead of parallel
corpus word pairs leads to significantly higher scores, why?
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
EuroParl: German −→ English
Configuration BLEU
NIST
METEOR
hWC , PC i
0.1622
4.9949
0.4344
hWT , PC i
0.1676 5.2324
0.4473
Using parallel treebank word pairs instead of parallel
corpus word pairs leads to significantly higher scores, why?
◮
more word pair tokens extracted from the parallel treebank
parallel treebank yields twice as many unique word pair types
◮
parallel corpus word pairs not as useful for translation
◮
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
EuroParl: Spanish −→ English
Configuration BLEU
NIST
METEOR
hWC , PC i
0.1754 4.7582
0.4802
hWC , PT i
0.1626
4.6606
0.4498
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
EuroParl: Spanish −→ English
Configuration BLEU
NIST
METEOR
hWC , PC i
0.1754 4.7582
0.4802
hWC , PT i
0.1626
4.6606
0.4498
Using phrase pairs from the parallel treebank in place of
those from the parallel corpus does not improve
translation, why?
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
EuroParl: Spanish −→ English
Configuration BLEU
NIST
METEOR
hWC , PC i
0.1754 4.7582
0.4802
hWC , PT i
0.1626
4.6606
0.4498
Using phrase pairs from the parallel treebank in place of
those from the parallel corpus does not improve
translation, why?
◮
◮
◮
◮
There are nearly four times as many phrase pairs extracted from
the parallel corpora, on average
This increases the system’s coverage and gives rise to the higher
scores
However, the higher scores achieved by the corpus phrase pairs
are not proportionate to the larger number of phrase pairs
evident (e.g. compared to word pairs)
Very similar findings to those of [Groves, 2007]
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
We obtained some insight into these results by manually
inspecting the extracted data
◮
◮
Phrase pairs extracted from the parallel treebanks were of higher
quaility,
Phrase pairs extracted from the parallel corpora consisted mainly
of function words, and had a very short average length (1.85
words per phrase)
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
We obtained some insight into these results by manually
inspecting the extracted data
◮
◮
Phrase pairs extracted from the parallel treebanks were of higher
quaility,
Phrase pairs extracted from the parallel corpora consisted mainly
of function words, and had a very short average length (1.85
words per phrase)
Example of most frequent phrase pair unique to each
resource
◮
Parallel Corpus
◮
of the ⇔ de la
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
We obtained some insight into these results by manually
inspecting the extracted data
◮
◮
Phrase pairs extracted from the parallel treebanks were of higher
quaility,
Phrase pairs extracted from the parallel corpora consisted mainly
of function words, and had a very short average length (1.85
words per phrase)
Example of most frequent phrase pair unique to each
resource
◮
Parallel Corpus
◮
◮
of the ⇔ de la
Parallel Treebank
◮
the european union ⇔ la unión europea
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
Herein lies a big stumbling block
◮
We cannot extract non-constituent phrase pairs, such as of the ⇔
de la, from the parallel treebank (as of yet)
◮
constituent phrase pairs alone are not enough for translation
[Koehn et al., 2003] (which was backed up by our results)
◮
but all is not lost. . .
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
Ongoing work:
◮
◮
◮
Scaling up - 730,000 sentence pairs
Weighting the parallel treebank word and phrase pairs in the
translation model(s)
Extracting non-constituent phrase pairs from the parallel
treebank based on the existing alignments
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
Conclusions
◮
◮
Parallel treebank word and phrase pairs improve translation
quality when combined with traditional corpus-based extraction
Parallel treebank word pairs are better for translation than those
given by traditional word alignment
◮
Parallel treebank phrase pairs are too few to be used alone for
translation
◮
Further improvement will come when we can get more from the
treebank and use it better
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
Banerjee, S. and Lavie, A. (2005).
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation
with Human Judgments.
In Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for
MT and/or Summarization at the 43th Annual Meeting of the Association of
Computational Linguistics (ACL-05), Ann Arbor, MI.
Bikel, D. (2002).
Design of a Multi-lingual, parallel-processing statistical parsing engine.
In Human Language Technology Conference (HLT), San Diego, CA, USA.
Chrupala, G. and van Genabith, J. (2006).
Using Machine-Learning to Assign Function Labels to Parser Output for Spanish.
In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions,
pages 136–143, Sydney, Australia. Association for Computational Linguistics.
Doddington, G. (2002).
Automatic Evaluation of Machine Translation Quality Using N-gram
Co-Occurrence Statistics.
In Human Language Technology: Notebook Proceedings, pages 128–132, San
Diego, CA.
Groves, D. (2007).
Hybrid Data-Driven Models of Machine Translation.
PhD thesis, Dublin City University, Dublin, Ireland.
Hearne, M., Tinsley, J., Zhechev, V., and Way, A. (2007).
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
Capturing Translational Divergences with a Statistical Tree-to-Tree Aligner.
In Proceedings of the 11th International Conference on Theoretical and
Methodological Issues in Machine Translation (TMI-07), pages 83–94, Skövde,
Sweden.
Koehn, P. (2005).
Europarl: A Parallel Corpus for Statistical Machine Translation.
In Machine Translation Summit X, pages 79–86, Phuket, Thailand.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,
Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A.,
and Herbst, E. (2007).
Moses: Open Source Toolkit for Statistical Machine Translation.
In Annual Meeting of the Association for Computational Linguistics (ACL),
demonstration session, pages 177–180, Prague, Czech Republic.
Koehn, P., Och, F. J., and Marcu, D. (2003).
Statistical Phrase-Based Translation.
In Proceedings of the 2003 Conference of the North American Chapter of the
Association for Computational Linguistics on Human Language Technology
(NAACL ’03), pages 48–54, Edmonton, Canada.
Och, F. J. and Ney, H. (2003).
A Systematic Comparison of Various Statistical Alignment Models.
Computational Linguistics, 29(1):19–51.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
BLEU: a Method for Automatic Evaluation of Machine Translation.
In Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL-02), pages 311–318, Philadelphia, PA.
Schmid, H. (2004).
Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors.
In Proceedings of the 20th International Conference on Computational
Linguistics (COLING 04), Geneva, Switzerland.
Stolcke, A. (2002).
SRILM - An Extensible Language Modeling Toolkit.
In Proceedings of the International Conference Spoken Language Processing,
Denver, CO.
Tinsley, J., Zhechev, V., Hearne, M., and Way, A. (2007).
Robust Language-Pair Independent Sub-Tree Alignment.
In Machine Translation Summit XI, pages 467–474, Copenhagen, Denmark.
Zhang, Y., Vogel, S., and Waibel, A. (2004).
Interpreting Bleu–NIST scores: How much improvement do we need to have a
better system?
In Proceedings of the 4th International Conference on Language Resources and
Evaluation, Lisbon, Portugal.
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
EuroParl: English −→ German
Configuration
BLEU
NIST
hWC , PC i
0.1186
4.1168
hWT , PT i
0.1055
4.1153
hWC , PT i
0.1019
3.9778
hWT , PC i
0.1242
4.2605
hWC + WT , PC + PT i
0.1259
4.3044
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
METEOR
0.3840
0.3796
0.3691
0.3931
0.3938
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
EuroParl: German −→ English
Configuration
BLEU
NIST
hWC , PC i
0.1622
4.9949
hWT , PT i
0.1498
5.1720
hWC , PT i
0.1443
4.9342
hWT , PC i
0.1676
5.2324
hWC + WT , PC + PT i
0.1687
5.2474
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
METEOR
0.4344
0.4327
0.4176
0.4473
0.4492
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
EuroParl: English −→ Spanish
Configuration
BLEU
NIST
hWC , PC i
0.1765
4.8857
hWT , PT i
0.1689
4.8662
hWC , PT i
0.1634
4.6964
hWT , PC i
0.1807
5.0389
hWC + WT , PC + PT i
0.1867
5.0898
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
METEOR
0.4515
0.4560
0.4440
0.4619
0.4701
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
EuroParl: Spanish −→ English
Configuration
BLEU
NIST
hWC , PC i
0.1754
4.7582
hWT , PT i
0.1708
4.8664
hWC , PT i
0.1626
4.6606
hWT , PC i
0.1840
4.9557
hWC + WT , PC + PT i
0.1880
4.9923
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
METEOR
0.4802
0.4659
0.4498
0.4910
0.4935
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
Word pair statistics
EN-DE
EN-ES
Corpus
Treebank
Corpus
Treebank
#Tokens
69,200
79,675
37,339
43,312
#Types
7,672
18,286
5,056
11,274
COMP
1,929
12,545
904
7,131
Corpus
Treebank
Corpus
Treebank
#Tokens
120,410
33,789
86,640
18,301
#Types
97,167
30,251
72,583
15,199
COMP
92,084
25,041
67,378
9,949
Phrase pair statistics
EN-DE
EN-ES
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
Exploiting Parallel Treebanks to Improve Phrase-Based SMT
Motivation and Background
Parallel Treebanks
Experimental Setup
Results
Discussion
Word Alignments
Phrase Alignments
Ongoing and Future Work
Conclusions
Unique corpus phrase pairs
of the
that
president ,
mr president ,
european union
⇔
⇔
⇔
⇔
⇔
der
, daß
präsident ,
herr präsident ,
europäischen union
398
383
163
163
135
Unique treebank phrase pairs
the next item
very much
the union ’s
the european union ’s
the sitting
⇔
⇔
⇔
⇔
⇔
J. Tinsley, M. Hearne and A. Way – DCU @ TLT
nach der tagesordnung
dank
der union
der europischen union
die sitzung
22
17
17
15
13
Exploiting Parallel Treebanks to Improve Phrase-Based SMT