Derivative Sentence Breaking for Moore Alignment - Thai

Derivative Sentence Breaking for Moore Alignment: Introduction
Thai has no sentence‐breaking (SB) punctuation. But we need SB to use MT systems.
TH chunk:
EN chunk:
To begin with let me admit that I did it on purpose. Perhaps it was partly from jealousy. Splitting up the above chunks, we get:
EN sentences:
ก่อนอืน
่ ผมต ้องยอมรับว่าทําไปโดยเจตนา
อาจแฝงความอิจฉาริษยาอยูบ
่ ้างก็ได ้
TH fragments:
To begin with let me admit that I did it on purpose.
ก่อนอืน
่
Perhaps it was partly from jealousy.
ผมต ้องยอมรับว่าทําไปโดยเจตนา
อาจแฝงความอิจฉาริษยาอยูบ
่ ้างก็ได ้
We want to align the EN sentences with the TH fragments so that we can train an MT system.
EN sentences:
TH fragments:
To begin with let me admit that I did it on purpose.
ก่อนอืน
่
Perhaps it was partly from jealousy.
ผมต ้องยอมรับว่าทําไปโดยเจตนา
อาจแฝงความอิจฉาริษยาอยูบ
่ ้างก็ได ้
Sentence pairs to train an MT system:
EN sentences:
TH “sentences”:
To begin with let me admit that I did it on purpose.
ก่อนอืน
่ ผมต ้องยอมรับว่าทําไปโดยเจตนา
Perhaps it was partly from jealousy.
อาจแฝงความอิจฉาริษยาอยูบ
่ ้างก็ได ้
Derivative Sentence Breaking for Moore Alignment: Hypothesis
But building an accurate sentence aligner is hard. So why not make use of a proven aligner e.g. Moore’s sentence aligner.
Moore’s sentence aligner:
•
•
•
•
•
Length + lexical aligner
Requires no external knowledge of languages in corpus
Domain‐independent
Highly accurate, very fast
Requires sentence‐broken input
A strict requirement, as we will see
We decided to build a simple length‐based aligner, LA‐1, to provide Moore’s aligner with good enough sentence‐broken input. Our thinking was that Moore’s aligner would do the rest by filtering and jostling LA‐
1’s alignments.
LA‐1 sentence aligner:
• Calculates scaling factor between length (in chars) of EN and TH text
• Assigns TH fragments to EN sentences according to scaling factor,
monotonically
• If assigning next TH fragment to current EN sentence exceeds scaling
factor, assigns to next EN sentence
• Unused portion of scaling factor from previous alignment brought to
next alignment to keep source and target texts synchronized
Our hypothesis:
“Using LA‐1 to provide good enough sentence‐broken input for Moore’s aligner will result in an MT training set which is accurate and encompasses a large fraction of the initial corpus as Moore’s aligner will adjust the SBs generated by LA‐1 to generate accurate alignments”
Derivative Sentence Breaking for Moore Alignment: Experiments
We ran experiments on 4 different pipelines:
Pipeline 1
EN sentences
Pipeline 2
EN sentences
Pipeline 3
(Control Experiment 1)
EN sentences
TH fragments
TH fragments
TH fragments
MaxEnt SB
(monolingual)
LA‐1 aligner
Moore’s Aligner
Moore’s Aligner
Moore’s Aligner
Giza++/
Moses
Giza++/
Moses
Giza++/
Moses
Pipeline 4
(Control Experiment 2)
EN sentences
TH fragments
LA‐1 aligner
Giza++/
Moses
Derivative Sentence Breaking for Moore Alignment: Results
Pipeline 1
Pipeline 2
Pipeline 3
Pipeline 4
(control experiment 1) (control experiment 2)
English sentences
48,736
48,736
48,736
48,736
Thai “sentences”
80,250
48,736
110,707
48,736
Moore running time
1:26:15
0:27:18
2:02:45
‐
Moore threshold
0.85
0.90
0.85
0.90
0.85
0.90
‐
Moore sentence pairs
9,960
8,714
21,991
21,309
10,240
7,900
‐
Training sentences
9,489
8,314
17,031
16,414
9,592
7,426
37,442
BLEU score
4.13
4.63
4.77
5.01
1.65
2.19
1.29
OOV%
14.15
15.43
5.12
6.31
14.08
17.05
4.71
Derivative Sentence Breaking for Moore Alignment: Conclusions
Conclusions:
•
Hypothesis adopted ‐ Pipeline 2 outperforms Pipeline 1 in BLEU score, OOV error rate and alignment running time.
•
Control Experiment 1 shows that some form of SB is required for Thai.
•
Control experiment 2 shows that LA‐1 benefits from Moore’s subsequent alignments.
•
LA‐1 extends Moore’s alignment technique by not requiring both languages to have
deterministic SB and also preserves its language and domain independency.
Caveat:
•
Pipeline 2’s BLEU score and OOV error rate are generated on a larger training set than Pipeline 1 was able to generate.
•
Moore’s aligner strongly prefers LA‐1 input over MaxEnt input. We speculate that LA‐1’s pre‐
matched sentence‐lengths are alluring to Moore’s length‐based factor.
•
Because Pipeline 1 achieves impressive performance despite this handicap, it is likely that it would achieve significantly better results on a larger corpus, and we leave this point for later examination.