Derivative Sentence Breaking for Moore Alignment: Introduction Thai has no sentence‐breaking (SB) punctuation. But we need SB to use MT systems. TH chunk: EN chunk: To begin with let me admit that I did it on purpose. Perhaps it was partly from jealousy. Splitting up the above chunks, we get: EN sentences: ก่อนอืน ่ ผมต ้องยอมรับว่าทําไปโดยเจตนา อาจแฝงความอิจฉาริษยาอยูบ ่ ้างก็ได ้ TH fragments: To begin with let me admit that I did it on purpose. ก่อนอืน ่ Perhaps it was partly from jealousy. ผมต ้องยอมรับว่าทําไปโดยเจตนา อาจแฝงความอิจฉาริษยาอยูบ ่ ้างก็ได ้ We want to align the EN sentences with the TH fragments so that we can train an MT system. EN sentences: TH fragments: To begin with let me admit that I did it on purpose. ก่อนอืน ่ Perhaps it was partly from jealousy. ผมต ้องยอมรับว่าทําไปโดยเจตนา อาจแฝงความอิจฉาริษยาอยูบ ่ ้างก็ได ้ Sentence pairs to train an MT system: EN sentences: TH “sentences”: To begin with let me admit that I did it on purpose. ก่อนอืน ่ ผมต ้องยอมรับว่าทําไปโดยเจตนา Perhaps it was partly from jealousy. อาจแฝงความอิจฉาริษยาอยูบ ่ ้างก็ได ้ Derivative Sentence Breaking for Moore Alignment: Hypothesis But building an accurate sentence aligner is hard. So why not make use of a proven aligner e.g. Moore’s sentence aligner. Moore’s sentence aligner: • • • • • Length + lexical aligner Requires no external knowledge of languages in corpus Domain‐independent Highly accurate, very fast Requires sentence‐broken input A strict requirement, as we will see We decided to build a simple length‐based aligner, LA‐1, to provide Moore’s aligner with good enough sentence‐broken input. Our thinking was that Moore’s aligner would do the rest by filtering and jostling LA‐ 1’s alignments. LA‐1 sentence aligner: • Calculates scaling factor between length (in chars) of EN and TH text • Assigns TH fragments to EN sentences according to scaling factor, monotonically • If assigning next TH fragment to current EN sentence exceeds scaling factor, assigns to next EN sentence • Unused portion of scaling factor from previous alignment brought to next alignment to keep source and target texts synchronized Our hypothesis: “Using LA‐1 to provide good enough sentence‐broken input for Moore’s aligner will result in an MT training set which is accurate and encompasses a large fraction of the initial corpus as Moore’s aligner will adjust the SBs generated by LA‐1 to generate accurate alignments” Derivative Sentence Breaking for Moore Alignment: Experiments We ran experiments on 4 different pipelines: Pipeline 1 EN sentences Pipeline 2 EN sentences Pipeline 3 (Control Experiment 1) EN sentences TH fragments TH fragments TH fragments MaxEnt SB (monolingual) LA‐1 aligner Moore’s Aligner Moore’s Aligner Moore’s Aligner Giza++/ Moses Giza++/ Moses Giza++/ Moses Pipeline 4 (Control Experiment 2) EN sentences TH fragments LA‐1 aligner Giza++/ Moses Derivative Sentence Breaking for Moore Alignment: Results Pipeline 1 Pipeline 2 Pipeline 3 Pipeline 4 (control experiment 1) (control experiment 2) English sentences 48,736 48,736 48,736 48,736 Thai “sentences” 80,250 48,736 110,707 48,736 Moore running time 1:26:15 0:27:18 2:02:45 ‐ Moore threshold 0.85 0.90 0.85 0.90 0.85 0.90 ‐ Moore sentence pairs 9,960 8,714 21,991 21,309 10,240 7,900 ‐ Training sentences 9,489 8,314 17,031 16,414 9,592 7,426 37,442 BLEU score 4.13 4.63 4.77 5.01 1.65 2.19 1.29 OOV% 14.15 15.43 5.12 6.31 14.08 17.05 4.71 Derivative Sentence Breaking for Moore Alignment: Conclusions Conclusions: • Hypothesis adopted ‐ Pipeline 2 outperforms Pipeline 1 in BLEU score, OOV error rate and alignment running time. • Control Experiment 1 shows that some form of SB is required for Thai. • Control experiment 2 shows that LA‐1 benefits from Moore’s subsequent alignments. • LA‐1 extends Moore’s alignment technique by not requiring both languages to have deterministic SB and also preserves its language and domain independency. Caveat: • Pipeline 2’s BLEU score and OOV error rate are generated on a larger training set than Pipeline 1 was able to generate. • Moore’s aligner strongly prefers LA‐1 input over MaxEnt input. We speculate that LA‐1’s pre‐ matched sentence‐lengths are alluring to Moore’s length‐based factor. • Because Pipeline 1 achieves impressive performance despite this handicap, it is likely that it would achieve significantly better results on a larger corpus, and we leave this point for later examination.
© Copyright 2026 Paperzz