Projet ANR 201 2 CORD 01 5
TRANSREAD
Lecture et interaction bilingues
enrichies par les données d'alignement
LIVRABLE 1 .3
TOWARDS IMPROVING SENTENCE
ALIGNMENT FOR LITERARY WORKS
Novembre 201 5
Yong Xu - François Yvon
2
Abstract
In the TransRead (Lecture et interaction bilingues enrichies par les données d’alignement, ANR12-CORD-0015) project, automatic sentence alignment is an important preprocessing step. However, it
remains difficult to align certain types of parallel corpus, in particular, literary bi-texts, which constitute
the application context of TransRead. State-of-the-art tools deliver unsatisfactory results for this kind
of bi-texts, since most of them were designed to align large amount of institutional parallel corpus. By
assuming a strong translational regularity between the two sides, they can rely on only limited resources
and scarce features to achieve acceptable alignment quality, while maintaining high processing speed.
For literary bi-texts, however, the assumption of strong translational regularity often does not hold. It is
hence necessary to design more sophisticated parallelism indicators, potentially incorporating knowledge
extracted from external resources that are nowadays widely available.
In this report, we investigate novel sentence alignment methods for bilingual literary works. We
propose to take advantage of publicly available parallel corpus to train better alignment link classifiers,
where complicated features can be employed. We further explore the impact of dependencies between
consecutive alignment links, a question that has been considered only recently, using a 2D Conditional
Random Field (CRF) model. Such techniques have led to statistically significant improvements over
state-of-the-art methods on evaluation corpus.
Résumé
Dans le cadre du projet TransRead (Lecture et interaction bilingues enrichies par les données d’alignement, ANR-12-CORD-0015), l’alignement automatique de phrases est une étape importante de la
chaine de pré-traitements. Pourtant, il reste difficile d’aligner certains types de corpus parallèles, en
particulier, les bi-textes littéraires, qui constituent le contexte d’application du TransRead. Pour ces
bi-textes, les outils existants d’état-de-l’art ne donnent que des résultats peu satisfaisants, la plupart étant
initialement conçus à aligner des grands volumes de corpus parallèles institutionnels. S’appuyant sur la
forte régularité traductionnelle de ces corpus, ils arrivent souvent à obtenir efficacement des alignements
de qualité, en utilisant seulement des simples caractéristiques et des ressources limitées. Cependant, souvent la régularité traductionnelle n’est pas suffisamment forte dans un bi-texte littéraire. Il est donc
nécessaire de concevoir des indicateurs plus puissants de parallélisme, en y intégrant potentiellement des
connaissances extraites des ressources externes largement disponibles aujourd’hui.
Dans ce rapport technique, nous décrivons des nouvelles méthodes d’alignement de phrases pour les
bi-textes littéraires. Nous proposons d’entrainer des meilleurs classifieurs de liens d’alignement, en utilisant des corpus parallèles disponibles au public, ce qui nous permet de produire des caractéristiques
relativement complexes. Nous explorons ensuite l’impact des dépendances entre des liens consécutives,
un sujet qui n’est étudié que récemment, en s’appuyant sur un modèle des Champs Aléatoires Conditionnels (Conditional Random Fields, CRF) bi-dimensionnels. Ces techniques ont conduit aux améliorations
statistiquement significatives par rapport aux méthodes d’état-de-l’art sur des corpus de référence d’évaluation.
Towards Improving Sentence Alignment for Literary Works
Yong Xu
François Yvon
LIMSI, CNRS, Université Paris-Saclay, F-91405 Orsay
{yong, yvon}@limsi.fr
1
Introduction
In the digital era, more and more books are becoming available in electronic form. Widely used devices,
R
R
such as the Amazon Kindle
and the Kobo Touch
, have made e-books an accepted reading option for the
general public. Works of fiction account for a major part of the e-book market. 1 Global economical and
cultural exchanges also facilitate the dissemination of literary works, and many works of fiction nowadays
target an international audience and successful books are pre-sold and translated very rapidly to reach the
largest possible readership. 2 Multiple versions of e-books constitute a highly valuable resource for a number
of uses, such as language learning [Kraif and Tutin, 2011] or translation studies.
While reading a novel in the original language often helps to better appreciate its content and spirit,
a non-native reader may come across many obstacles. Some expressions can be difficult to render or even
translate in a foreign language (consider idioms, or jargons); some sentences use very complex structures;
some paragraphs convey highly subjective and delicate arguments; even worse, some authors tend to use a
very rich vocabulary, which is difficult to match for foreign readers. Under those circumstances, an alignment
between two versions of the same book could prove very helpful [Pillias and Cubaud, to appear, 2015]: A
reader, upon encountering a difficult fragment in the original language, would then be able to refer to its
translation in a more familiar language. 3 For the TransRead project, whose goal is to develop approaches
to facilitate multilingual reading, better alignment technologies are thus of great importance.
In this study, we focus on the task of sentence alignment for literary works. Sentence alignment aims at
identifying correspondences between small groups of sentences in bi-texts made of a source text and their
corresponding translation: such links often match one sentence with one sentence (henceforth 1:1 links),
but more complex alignments are also relatively common. This task, generally considered as relatively easy,
has received much attention in the early days of word-based Statistical Machine Translation (SMT), driven
by the need to obtain large amounts of parallel sentences to train translation models. Most approaches to
sentence alignment are unsupervised and share two important assumptions which help make the problem
computationally tractable: (a) a restricted number of link types suffice to capture most alignment patterns,
the most common types being 1 : 1 links, then 1 : 2 or 2 : 1, etc.; (b) the relative order of sentences is the
same on the two sides of the bi-text. From a bird’s eye view, alignment techniques can be grouped into
two main families: on the one hand, length-based approaches [Gale and Church, 1991, Brown et al., 1991]
exploit the fact that the translation of a short (resp. long) sentence is short (resp. long). On the other hand,
lexical matching approaches [Kay and Röscheisen, 1993, Chen, 1993, Simard et al., 1993, Melamed, 1999, Ma,
2006] identify sure anchor points for the alignment using bilingual dictionaries or crude surface similarities
between word forms. Length-based approaches are fast but error-prone, while lexical matching approaches
seem to deliver more reliable results but at higher computational costs. The majority of the state-of-the-art
approaches to the problem [Langlais, 1998, Simard and Plamondon, 1998, Moore, 2002, Varga et al., 2005,
Braune and Fraser, 2010, Lamraoui and Langlais, 2013] combine both types of information.
The goal of these methods, however, is primarily to deliver high-precision parallel sentence pairs to fuel
SMT systems or feed translation memories in specialized domains. They can then safely prune blocks of
1. About 70 % of the top 50,000 bestselling e-books on Amazon are in the ’fiction’ category (source: http://authorearnings.
com/report/the-50k-report/).
2. For instance, J. K. Rowling’s Harry Potter has already been translated into over 70 languages.
3. An example implementation is at http://www.doppeltext.com/.
3
2
EVALUATING STATE-OF-THE-ART SENTENCE ALIGNMENT TOOLS
4
sentence pairs whenever their alignment is uncertain; some methods even only target high-confidence, 1:1,
alignment links. A significant part of recent developments in sentence alignment have tried to make the
large-scale harvesting of parallel sentence pairs work also for noisy parallel data [Mujdricza-Maydt et al.,
2013], as well as for comparable bilingual corpora [Munteanu and Marcu, 2005, Smith et al., 2010], using
supervised learning techniques.
While these restrictions are reasonable for the purpose of training SMT systems, 4 for other applications,
such as bi-text visualization, translator training, automatic translation checking, the alignment for the entire
bi-text should be computed. This is especially the case for multilingual works of fiction: the parts that are
more difficult for automatic alignment algorithms (usually involving highly non-literal translations or large
blocks of insertions/deletions) often correspond to parts where a reader might also look for help. For such
tasks, it seems that both precision and recall are to be maximized.
In this report, we reconsider the task of full-text sentence alignment with the goals to: (a) re-evaluate the
actual performance of state-of-the-art methods for literary texts, both in terms of their precision and recall,
using large collections of publicly available novels; (b) develop, analyse and improve an algorithm initially
introduced in [Yu et al., 2012a,b], which was shown to outperform a significant sample of sentence alignment
tools on a set of manually aligned corpora; (c) develop and analyse a 2D CRF model for sentence alignment,
which takes the dependencies between consecutive alignment links into account.
The rest of this report is organized as follows: we briefly review, in Section 2, several state-of-the-art
sentence alignment tools and evaluate their performance. While part of this work has been reported in the
technical report 1.2 [Xu and Yvon, 2014], in this report we better describe the tools and evaluate them on
larger corpus. Section 3 presents our two-pass alignment algorithm; one of its distinguishing feature is its use
of external resources to improve its decisions. Section 4 concerns our proposal on using a 2D CRF model for
sentence alignment, which is able to learn patterns between consecutive links. Experiments on two language
pairs (English-French and English-Spanish) are presented and discussed in Section 5, before we recap our
main findings and discuss further prospects in Section 6.
2
Evaluating state-of-the-art sentence alignment tools
In this section, we describe several state-of-the-art tools, and evaluate their performance on literary bitexts. We discuss also the evaluation metrics and test sets. This section is an extension of [Xu and Yvon,
2014].
2.1
Baseline tools
Baseline alignments are computed using several open-source sentence alignment packages, described
respectively in [Melamed, 1999] 5 (GMA), in [Moore, 2002] 6 (BMA), in [Varga et al., 2005] 7 (Hunalign),
in [Braune and Fraser, 2010] 8 (Gargantua, or Garg for short), and in [Lamraoui and Langlais, 2013] 9
(Yasa). These tools constitute, we believe, a representative panel of the current state-of-the-art in sentence
alignment. Note that we leave aside here approaches inspired by Information Retrieval techniques such as
[Bisson and Fluhr, 2000], which models sentence alignment as a cross-language information retrieval task,
as well as the MT-based approach of Sennrich and Volk [2010], also based on some automatic translation of
the “source” text, followed by a monolingual matching step.
GMA, introduced in [Melamed, 1999], is the oldest approach included in this panel, and yet one of
the most effective: assuming "sure" lexical anchor points in the bi-text map, obtained e.g. using bilingual
dictionaries or cognate-based heuristics, GMA greedily builds a so-called "sentence map" of the bi-text, trying
to include as many anchors as possible, while also remaining close to the bi-text "diagonal". A post-processing
step will take sentence boundaries into account to deliver the final sentence alignment. Note that GMA uses
4.
5.
6.
7.
8.
9.
The work by Uszkoreit et al. [2010], however, shows that this procedure actually discards a lot of useful data.
http://nlp.cs.nyu.edu/GMA/
http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/
http://mokk.bme.hu/en/resources/hunalign/
http://sourceforge.net/projects/gargantua/
http://rali.iro.umontreal.ca/rali/?q=en/yasa
2
EVALUATING STATE-OF-THE-ART SENTENCE ALIGNMENT TOOLS
5
no length cues, and also that it has been shown to perform particularly well at spotting large omissions in
a bi-text [Melamed, 1996].
The approach of Moore [2002] implements a two-pass, coarse-to-fine, strategy: a first pass, based on
sentence length cues, computes a first alignment according to the principles of length-based approaches
[Brown et al., 1991, Gale and Church, 1991]. This initial alignment is used to train a simplified version of
IBM model 1 [Brown et al., 1993], which provides the alignment system with lexical association scores. These
scores are then used to refine the measure of association between sentences. This approach is primarily aimed
at delivering high-confidence, 1:1 sentence alignments to be used as training material for data-intensive MT.
Sentences that cannot be reliably aligned are discarded from the resulting alignment.
Hunalign is described in [Varga et al., 2005]. It also implements a two-pass strategy which resembles
Moore’s approach. Their main difference is that Hunalign also produces many-to-one and one-to-many
alignment links, which are needed to ensure that all the input sentences are actually aligned.
Gargantua [Braune and Fraser, 2010] is, similarly to the approach of Deng et al. [2007] and our own
approach, an attempt to improve the final steps of Moore’s algorithm. The authors propose a two-pass
unsupervised approach that works along the following lines: (a) search for an optimal alignment considering
only links made of at most one sentence in each language (including null links); (b) heuristically improve
this initial solution by merging adjacent links. A key observation in this work is that step (a) can be very
fast. Like most works in this vein, this approach requires to explicitly model null links, and is prone to miss
large untranslated portions on one side of the bi-text.
The work of Lamraoui and Langlais [2013] also performs multiple passes over the data, but proceeds
in the reverse order: predefined lexical associations (from a bilingual dictionary or from so-called cognates)
are first used to prune the alignment search space to a restricted number of near-diagonal alignments (the
diagonal is defined with respect to these sure anchor alignment points.). The second pass will then perform
dynamic programming search with the additional help of a length-based model. In spite of its simplicity,
this computationally lightweight approach is reported to perform remarkably well by Lamraoui and Langlais
[2013].
2.2
Evaluation metrics
Sentence alignment tools are usually evaluated using standard recall (R) and precision (P) measures,
combined in the F-measure (F), with respect to some manually defined gold alignment [Véronis and Langlais,
2000]. These measures can be computed at various levels of granularity: at the level of alignment links, of
sentences, of words, and of characters. As gold references only specify alignment links, the other references
are automatically derived in the most inclusive way. As a side effect, all metrics but the link-level ones ignore
null alignments. 10 The results in this report are therefore based solely on the link-level F-measure, so as to
reflect the importance of correctly predicting unaligned sentences in our targeted applicative scenario. 11
2.3
Evaluation corpora
The performance of sentence alignment algorithms is typically evaluated on reference corpora for which a
gold alignment is provided. Manual alignments constitute the most reliable references, but are quite rare. In
this work, we have used two sets of manual alignments of literary works: one is an extract of the BAF corpus
[Simard, 1998], consisting of one book by Jules Verne (De la terre à la lune); the other has been developed
for a preliminary study described in [Yu et al., 2012a], and is made up of 4 novels translated from French
into English and 3 from English into French. 12 These two sets are both relatively small, and only contain
bi-texts in English and French. As these corpora were manually aligned, they may contain links of arbitrary
types. These gold references constitute our main source of evidence for comparing the various algorithms
used in this study.
10. Assume, for instance, that the reference alignment links the pair of Foreign sentences (f1 , f2 ) to the single English sentence
e: reference sentence-level alignments will contain both (f1 , e) and (f2 , e); likewise, reference word-level alignments will contain
all the possible word alignments between tokens in the source and the target side, etc. For such metrics, missing the alignment
of a large "block" of sentences is more harmful than missing a small one; likewise, misaligning short sentences is less penalized
than misaligning longer ones.
11. This is different with [Xu and Yvon, 2014], where we have also reported the performances using sentence-level F-measure.
12. This dataset is now released on the Web site of TransRead: https://transread.limsi.fr/resources.html.
2
EVALUATING STATE-OF-THE-ART SENTENCE ALIGNMENT TOOLS
6
For the purpose of a larger scale evaluation, we have also made use of two much larger, multi-parallel,
corpora of publicly available books that are available on the Internet. 13 The corpus auto en-fr contains
novels in English and French, and the corpus auto en-es contains novels in English and Spanish. These
corpora are only imperfectly aligned, and are used to provide approximations of the actual alignment quality.
Table 1 provides basic statistics for all these evaluation sets.
BAF
manual en-fr
auto en-fr
auto en-es
# books
1
7
24
17
lang.
en-fr
en-fr
en-fr
en-es
# links
2,520
1,790
75,731
61,181
# sent. en
2,554
1,970
129,022
102,545
# sent. fr or es
3,319
2,100
126,561
104,216
Table 1 – Corpus statistics
Note that there is an obvious discrepancy in the BAF corpus between the number of sentences on the
two sides of the bi-text, which makes the automatic alignment of this book especially challenging. Also note
that in the larger corpora, auto en-fr and auto en-es, manual alignment links are defined at the level of
paragraphs, rather than at the level of sentences.
2.4
Baseline evaluation
We evaluate the performance of the baseline sentence alignment tools on these four corpora, using the
standard link-level F-measure. As explained above, the two larger corpora are aligned at the paragraph level,
meaning that such resources cannot be readily used to compute alignment quality scores. Our solution has
been to refine this coarse alignment by running the Gale and Church [1991] alignment program to compute
within-paragraph sentence alignments, keeping the paragraph alignments unchanged from the reference.
This approach is similar to the procedure used to align the Europarl corpus at the sentence level [Koehn,
2005], where reliable paragraph boundaries are readily derived from speaker turns or session changes. As a
result, these semi-automatic references only contain a restricted number of link types as computed by Gale
and Church [1991] program: 1:0, 0:1, 1:1, 1:2, and 2:1. We then take these partially correct alignments
as pseudo-references for the purpose of evaluating alignment tools - keeping in mind that the corresponding
results will only be approximate. Our main evaluation results are in Table 2. For more details, see Table 11,
and (in the appendix) Tables 16 and 17.
BAF
manual en-fr
auto en-fr
auto en-es
min
max
mean
min
max
mean
min
max
mean
GMA
61.4
53.5
92.8
79.6
62.1
99.5
88.7
60.3
96.5
82.8
BMA
73.6
57.4
91.5
74.9
47.1
98.4
84.0
48.8
98
78.4
Hun
71.2
54.3
92.6
74.5
56.6
99.5
87.9
43.7
96.4
81.0
Garg
65.6
51.7
97.1
80.2
56.4
98.1
88.7
60.9
98.8
80.5
Yasa
75.7
59.9
95.6
79.1
62.3
98.8
89.6
58.3
98.4
82.7
Table 2 – Baseline evaluation results
Regarding the gold corpus (manual en-fr), the numbers in the top part of Table 2 show that the alignment
problem is far from solved, with an average F-score around 80 for the three best systems (Garg, GMA and
Yasa).
On the two larger corpora, a legitimate question concerns the reliability of numbers computed using
semi-automatic references. To this end, we manually aligned excerpts of three more books: Lewis Carroll’s
13. See http://www.farkastranslations.com/bilingual_books.php
3
A MAXENT-BASED ALGORITHM
7
Alice’s Adventures in Wonderland, Arthur Conan Doyle’s The Hound of the Baskervilles, and Edgar Allan
Poe’s The Fall of the House of Usher, for a total of 1,965 sentences on the English side. We have then
computed the difference in performance observed when replacing the gold alignments with semi-automatic
ones. For these three books, the average difference between the evaluation on pseudo-references and on actual
references is less than 2 points in F-measure; furthermore, these differences are consistent across algorithms.
A second comforting observation is that the same ranking of baseline tools is observed across the board:
Gargantua, GMA and Yasa tend to produce comparable alignments, outperforming Hunalign and BMA by
approximately 2 to 3 F-measure points. It is also worth pointing out that on the two large datasets, less
than 3% of the sentence links computed by BMA actually cross the reference paragraph boundaries, which
warrants our assumption that BMA actually computes sure 1 : 1 links. Note that even for "easy" books,
the performance falls short of what is typically observed for technical documents, with a F-measure hardly
reaching 0.95; for difficult ones (such as Jane Austen’s Pride and Prejudice), the best F-measure can be as
low as 0.62. From this large-scale experiment, we can conclude that sentence alignment for literary texts
remains challenging, even for relatively easy language pairs such as English-French or English-Spanish. It is
expected that sentence alignment can only be more difficult when involving languages that are historically
or typologically unrelated, or that use different scripts.
3
A MaxEnt-Based Algorithm
We present in this section an approach to obtain high-quality alignments, based on the maximum entropy
(MaxEnt) model. We borrow from [Yu et al., 2012a] the idea of a two-pass alignment process: the first pass
computes high-confidence 1:1 links and outputs a partially aligned bitext containing sure links and residual
gaps, i.e. parallel blocks of non-aligned sentences. These small blocks are then searched using a more
computationally costly, 14 but also more precise, model, so as to recover the missing links: we propose,
following again our previous work, to evaluate possible intra-block alignments using a MaxEnt model trained here on a large external corpus using a methodology and features similar to [Munteanu and Marcu,
2005, Smith et al., 2010, Mujdricza-Maydt et al., 2013]. These steps are detailed below.
3.1
Computing sure 1-1 links
As in many existing tools implementing a multi-step strategy, the main purpose of the first step is to
provide a coarse alignment, in order to restrict the search space of the subsequent steps. In our approach,
the links computed in the first step are mainly used as anchor points, which makes the more costly search
procedure used in the second step feasible. Since we do not reevaluate these anchor links, they should be as
reliable as possible.
Our current implementation uses the algorithm of Moore [2002] to identify these sure anchors: as explained in Section 2.1, this algorithm tends to obtain a very good precision, at the expense of a less satisfactory
recall. Furthermore, Moore’s algorithm also computes posterior probabilities for every possible link, which
are then used as confidence scores. The good precision of this system is due to its discarding all links with a
posterior probability smaller than 0.5. As explained below, such confidence measures can be used to control
the quality of the anchor points that are used downstream. The result of this first step is illustrated in
Table 3.
3.2
A MaxEnt Model for Parallel Sentences
Any sentence alignment method needs, at some point, to assess the level of parallelism of a sentence
pair, based on a surface description of these sentences. As discussed above, two kinds of clues are widely
employed in existing systems to perform such assessment: sentence lengths and lexical information. Most
dynamic programming-based approaches further impose a prior probability distribution on link types [Gale
and Church, 1993].
We propose a system that combines all the available clues in a principled, rather than heuristic, way, using
a MaxEnt model. 15 For any pair of sentences l = (e, f ), the model computes a link posterior probability
14. Too costly, in fact, to be used in the first pass over the full bitext.
15. We used the implementation from http://homepages.inf.ed.ac.uk/lzhang10/maxenttoolkit.html.
3
A MAXENT-BASED ALGORITHM
8
en1
Poor Alice!
en2
It was as much as she could do, lying down on one
side, to look through into the garden with one eye;
but to get through was more hopeless than ever :
she sat down and began to cry again.
en3
“You ought to be ashamed of yourself,” said Alice,“a great girl like you,” (she might well say this),
“to go on crying in this way!
en4
Stop this moment, I tell you!”
en5
But she went on all the same, shedding gallons of
tears, until there was a large pool all round her,
about four inches deep and reaching half down the
hall.
Pauvre Alice !
C’est tout ce qu’elle put faire, après s’être étendue
de tout son long sur le côté, que de regarder du coin
de l’oeil dans le jardin.
Quant à traverser le passage, il n’y fallait plus
songer.
Elle s’assit donc, et se remit à pleurer.
«Quelle honte !» dit Alice.
«Une grande fille comme vous» («grande» était
bien le mot) «pleurer de la sorte !
Allons, finissez, vous dis-je ! »
Mais elle continue de pleurer, versant des torrents
de larmes, si bien qu’elle se vit à la fin entourée
d’une grande mare, profonde d’environ quatre
pouces et s’étendant jusqu’au milieu de la salle.
fr1
fr2
fr3
fr4
fr5
fr6
fr7
fr8
Table 3 – An example alignment computed by Moore’s algorithm for Alice’s Adventures in Wonderland.
The first and third anchor links delineate a 2 × 5 gap containing 2 English and 5 French sentences.
p(Y = y|e, f ), where Y is a binary variable for the existence of an alignment link. The rationale for using
MaxEnt is (a) that it is possible to efficiently integrate as many features as desired into the model, and (b)
that we expect that the resulting posterior probabilities will be less peaked towards extreme values than
what we have observed with generative alignment models such as Moore’s model. We detail the features
used in this model in Section 5.2.1.
A second major difference with existing approaches is our use of a very large set of high-confidence
alignment links to train the MaxEnt model. Indeed, most sentence alignment systems are endogenous,
meaning that they only rely on information extracted from the bitext under consideration. While this design
choice was probably legitimate in the early 1990’s, it is much more difficult to justify it now, given the
wide availability of sentence-aligned parallel data, such as the Europarl corpus [Koehn, 2005], which can
help improve alignment systems. This is in departure from our own past work, where the training data for
MaxEnt was identified during the first pass, resulting in small and noisy datasets:
Yu et al. [2012a] observe that for an extreme case of poor first-pass alignment, MaxEnt is trained using
only 4 positive examples.
To better match our main focus, which is the processing of literary texts, we collected positive instances
from the same publicly available source, extracting all 1-sentence paragraphs as reliable alignment pairs.
This resulted in a gold set of approximately 18,000 sentences pairs for French/English (out of a grand total
of 125,000 sentence pairs). Negative examples are more difficult to obtain and are generated artificially as
explained in Section 5.2.2.
Finally, note that the MaxEnt model, even though it is trained on 1:1 sentence pairs, can in fact evaluate
any pairs of segments. We make use of this property in our implementation. 16
3.3
Closing alignment gaps
The job of the second step is to complete the alignment by filling in first-pass gaps. Assume that a gap
begins at index i and ends at index j (i ≤ j) on the English side, and begins at index k and ends at index l
(k ≤ l) on the Foreign side. This step aims at refining the alignment of sentences ei,j and fk,l , assuming
that these blocks are already (correctly) aligned as a whole.
If one side of the gap is empty, 17 then nothing is to be done, and the block is left as is. In all other cases,
the block alignment will be improved by finding a set of n links {l1 , . . . , ln } maximizing the following score:
n
Y
p(1|li = (ei , fi ))
α × size(li )
i=1
(1)
16. Training the model on 1:1 links, and using it to assess multi-sentence links creates a small methodological bias. Our
attempts to include other types of links during training showed insignificant variance in the performance.
17. This happens when the algorithm detects two consecutive anchors (i, k) and (j, k + 1) where j > i + 1 (and similarly in
the other direction).
3
A MAXENT-BASED ALGORITHM
9
where size(l) is the size of link l, defined as the product of the number of sentences on the source and target
sides and α is a hyper-parameter of the model. Note that this score is slightly different from the proposal of
Yu et al. [2012a], where we used an additive penalty instead of a multiplicative one in (1).
This score computes the probability of an alignment as the product of the probabilities of individual
links. The size-based penalty is intended to prevent the model from preferring large blocks over small ones:
this is because the scores of alignments made of large blocks contain fewer factors in Equation (1); dividing
by α × size(li ) mitigates this effect and makes scores more comparable.
In general, the number of sentences in a gap makes it possible to consider all the sub-blocks within a
gap. In some rare cases, however, the number of sub-blocks was too large to enumerate them and we had to
impose an additional limitation on the number of sentences on both sides of a link. Our inspection of several
manually annotated books revealed that links are seldom composed of more than 4 sentences on either side
and we have used this limit in our experiments (this is parameter δ in the algorithms below). Note that this
is consistent with previous works such as [Gale and Church, 1993, Moore, 2002] where only small links (the
largest alignment links are 2:1) are considered. Our two pass approach allows us to explore more alignment
types, the largest link type being 4:4. We compare below two algorithms for finding the optimal set of links:
a greedy search presented in Yu et al. [2012b], and a novel (exact) algorithm based on dynamic programming.
3.3.1
Greedy search
Greedy search is described in Algorithm 1: it simply processes the possible links in decreasing probability
order and insert them into the final alignment unless they overlap with an existing alignment link. For a
gap with M English sentences and N Foreign sentences and fixed δ, the worst-case complexity of the search
is O(M × N ), which, given the typical small size of the gaps (see Figure 4), can be computed very quickly.
Algorithm 1 Greedy search
Input: block = (ei,j , fk,l ), priority list L, max gap size δ
Output: result list R
Generate the set of all possible links S between sub-blocks in (ei,j , fk,l )
for all l in S do
insert l into L, with score(l) defined by (1)
end for
while L not empty do
pop top link l∗ from L
insert l∗ into R
remove any link that intersects or crosses l∗ from L
end while
complete R with null links
The result of Algorithm 1 is a collection of links which do not overlap with each other and respect the
monotonicity constraint. Note that even though the original list L does not contain any null link, the resulting
alignment R may contain links having an empty source or target side. For instance, if links (eb,b+1 , fd−1,d )
and (eb+2,b+3 , fd+2,d+3 ) are selected, then the null link (, fd+1,d+1 ) will also be added to R.
3.3.2
Dynamic programming search
The other search algorithm considered in this study is based on dynamic programming (DP). Given a
series of English sentences ei,j and the corresponding Foreign sentences fk,l , DP tries to find the set of links
yielding maximal global score. Our DP search procedure is described in Algorithm 2: as typical in DP
approaches, the algorithm merely amounts to filling a table D containing the score of the best alignments
of sub-blocks of increasing size. The search complexity for a gap containing M English and N Foreign
sentences is O(M × N ). The constant term in the complexity analysis depends on the types of links DP has
to consider: as explained above, we only consider here links having less than 4 sentences on each side.
An important issue for DP search is that the probability of null links must be estimated, which is
difficult for MaxEnt, as no such information can be found in the training corpus. In greedy search, which
4
A 2D CRF MODEL FOR SENTENCE ALIGNMENT
10
Algorithm 2 The dynamic programming search
Input: Gap=(ei,j , fk,l ), empty tables D and B, max gap size δ
Output: link list R
for a ← i to j do
for b ← k to l do
max ← 0
for m ← 0 to min(a, δ) do
for n ← 0 to min(b, δ) do
cur ← D(a − m, b − n) + score(ea−m,a , fb−n,b )
if cur ≥ max then
max ← cur
D(a, b) ← max
B(a, b) ← (a − m, b − n)
end if
end for
end for
end for
end for
Back trace on B to find R
only considers non-null links, this problem does not exist. In DP, however, null links appear in all backtraces.
We have adopted here a simple method, which is to estimate the score of a null link l = (, e) or l = (e, ) as:
score(e, ) = score(, e) = exp(−β|e|)
(2)
where |u| returns the number of tokens in u and β > 0 is a hyper-parameter of the method. The intuition
is that long null links should be less probable than shorter ones.
4
A 2D CRF Model for Sentence Alignment
State-of-the-art methods, including the MaxEnt model in Section 3, share the characteristic that the
parallelism estimators are locally learnt and applied. When estimating the parallelism level of a pair of
source/target sentence groups e and f , state-of-the-art methods usually ignore contextual information, such
as the sentences before or after the pair and the link types nearby. However, all methods are tested on
blocks of sentences containing more than one link, which might put the models at disadvantage. Recently,
Mujdricza-Maydt et al. [2013] have shown that using dependencies between consecutive alignment links
can significantly help the performance of sentence alignment systems. Local models also tend to give poor
performance on null links, since one often has to use different scoring schemes for null and non-null links. Xu
et al. [2015] show that this weakness is a very important source of errors for several state-of-the-art methods,
reconfirming the discoveries of [Gale and Church, 1991].
DP-based search is another popular choice of state-of-the-art methods. 18 For speed concerns, DP is
seldom exactly carried out. Nearly all systems sacrifice accuracy for efficiency, by limiting the link types
explored in the DP search. For instance, Gale and Church [1991], Moore [2002] used link types up to 2:2,
while Braune and Fraser [2010], Xu et al. [2015] up to 4:4. Links larger than 4:4 are indeed rare, but can still
occur in some bi-texts, especially in literary ones. Their presence leads to intrinsic errors of the DP, which
would hurt its overall accuracy, since the error can propagate in the search process.
We have investigated novel ways to overcome these problems of state-of-the-art methods. We propose a
2D CRF model for sentence alignment, which can encode the dependencies between neighboring alignment
links, and ensure theoretically exact search for the best output.
18. The GSA algorithm of Melamed [1999] was a notable exception, in that it did not employ DP at all. Rather, its search
was based on a set of heuristics.
4
A 2D CRF MODEL FOR SENTENCE ALIGNMENT
4.1
11
The model
Conditional random fields [Lafferty et al., 2001] is a widely used probabilistic model in the NLP community for solving structured prediction problems. It can take the dependencies among variables into account
during training, and use complex features involving evidence from the whole observation instance. Given
observed random variables x and unobserved variables y, a CRF models the conditional distribution of y.
In the most general form, it can be written as:
p(y|x) =
1 Y
Φc (x, yc )
Z(x)
c∈C
where C is the set of cliques, yc thePunobserved
variables in c, and the factors Φc (x, yc ) are non-negative
Q
real functions of x and yc . Z(x) = y c Φc (x, yc ) is the global normalization function.
We follow Mujdricza-Maydt et al. [2013] to use a CRF to model the dependencies between neighboring
links. Given a sequence of source language sentences E1I = E1 , ..., EI and a sequence of target sentences
F1J = F1 , ..., FJ 19 , we develop a 2D CRF model to predict the presence of alignment link between any pair
of sentences [Ei ; Fj ], where 1 ≤ i ≤ I, 1 ≤ j ≤ J. Note that similar models have also been developed for the
alignment of words Niehues and Vogel [2008], Cromierès and Kurohashi [2009], Burkett and Klein [2012]. In
this model, each pair [Ei ; Fj ] gives rise to a binary variable yi,j , whose value is 1 (positive) if there is a link
between Ei and Fj , and 0 (negative) otherwise. For the pair of sequences E1I and F1J , there are I × J such
binary prediction variables, each representing an alignment link. Dependencies between alignment links are
modeled as follows. For the pair [Ei ; Fj ], we assume that its associated variable yij depends on yi−1,j , yi+1,j ,
yi,j−1 , yi,j+1 , yi−1,j−1 and yi+1,j+1 . In other words, it depends on the presence of links [Ei−1 ; Fj ], [Ei+1 ; Fj ],
[Ei ; Fj−1 ], and [Ei ; Fj+1 ], [Ei−1 ; Fj−1 ] and [Ei+1 ; Fj+1 ]. Figure 1 displays a graphical representation of the
model. We conventionally impose that the source sentence Ei corresponds to the ith row, and the target
sentence Fj corresponds to the j th column.
F1
F2
F3
F4
E1
E1
E2
E2
E3
E3
F1
F2
F3
F4
Figure 1 – The 2D CRF model for sentence alignment
The topology of our model is different from the one in [Mujdricza-Maydt et al., 2013], where each diagonal
of the alignment matrix was modeled as a linear chain CRF. Figure 2 (reproduced from [Mujdricza-Maydt
et al., 2013]) displays their model for a passage of 8 source sentences and 10 target sentences. In decoding,
they performed inference on each chain independently, and combined the results of all chains into the final
alignment. This topology captured the important diagonal direction dependency, but did not encode the
horizontal or vertical dependencies. Another important difference lies on the generation of final outputs.
Mujdricza-Maydt et al. [2013] used labels of variables to indicate the corresponding link type (e.g. 1:1, 2:1).
For example, if the predicted value of the variable (i, j) is 2:1, then in the result there is an alignment link
[Ei−1 , Ei ; Fj ]. Theoretically, there are infinitely many kinds of link types, and it is impossible to include all.
Besides, the inference complexity of a linear chain CRF is quadratic in the number of states. As a result,
they made the limitation to consider only six link types (1:1, 1:2, 2:1, 1:3, 3:1, no − link), ignoring all
other types. In our model, all prediction variables are binary. We generate final links using the transitive
19. Note we are referring languages as source and target only for convenience. Thus, source does not necessary indicate the
original language of the bi-text, nor does target indicate the translation language.
4
A 2D CRF MODEL FOR SENTENCE ALIGNMENT
12
Figure 2 – The topology of the model of [Mujdricza-Maydt et al., 2013]. Each circle is a variable. The
value of a variable indicates the alignment link type.
closure operation according to sentence alignment conventions, which can theoretically lead to any possible
link type.
We set two kinds of clique potential functions in the model: node potentials and edge potentials. Both
node and edge potentials take the form of a log-linear combination of a set of feature functions. We further
impose that all single node cliques use the same clique template. That is, all node potentials share the
same set of feature functions and corresponding weights. However, for edge potentials, we make one clique
template for the horizontal edges, one for the vertical edges, and another for the diagonal edges. In total,
there are four clique templates in the model. For now, the main limitation of the model is that it does not
include long distance dependencies, which makes it difficult to encode certain types of constraints (e.g. that
the alignment links should not cross). The model for a pair of sentence sequences [E; F ] (as a shorthand for
[E1I ; F1J ]) can be written as:
p(y|E, F ) =
X
X
1 Y
exp{
θk uk (E, F, i, j, yi,j ) +
wm gm (E, F, i, j, i + 1, yi,j , yi+1,j )
Z i,j
m
k
X
+
bn hn (E, F, i, j, j + 1, yi,j , yi,j+1 )
n
+
X
dt lt (E, F, i, j, i + 1, j + 1, yi,j , yi+1,j+1 )}
(3)
t
where uk indicates the K features of the single node clique template, gm the M features of the vertical
edge clique template, hn the N features of the horizontal edge clique template, and lt the T features of
the diagonal edge clique template (in the following, we will use Ui,j , Gi,j , Hi,j , Li,j to denote the vectors of
feature functions at the position (i, j)). θ , w , b , d are weight vectors. We also used the L2 regularization with
scaling parameter α > 0. 20
As stated above, the proposed model can represent any link type. For instance, an all zero-valued j th
column would indicate an unaligned target sentence Fj ; if yp,q is positive and all other variables in the pth
row and in the q − th column are negative, then there is a 1:1 link [Ep ; Fq ]. Besides, the model can show
finer correspondences than the usually used alignment link level representation. For example, if both Eu−1
and Eu are aligned to Fv−1 , Eu is further aligned to Fv , the proposed representation can display exactly the
relation, while the alignment link representation would show a coarser 2:2 link [Eu−1 , Eu ; Fv−1 , Fv ].
20. In the experiments, α is tuned on a development set, and takes the value 0.1.
4
A 2D CRF MODEL FOR SENTENCE ALIGNMENT
4.2
13
Learning the 2D CRF model
The conventional learning criteria for CRF is the Maximum Likelihood Estimation (MLE). For a set of
fullly observed training instances A = {(E (s) , F (s) , y(s) )}, the MLE consists of maximizing the log-likelihood
of the training set with respect to model parameters Θ = {θθ , w , b, d, α}. The log-likelihood is:
(
)
Y
(s)
(s)
(s)
2
2
2
2
w || + ||bb|| + ||dd|| )
L = log
p(y |E , F ) exp −α(||θθ || + ||w
s
h
i
X X
(s)
(s)
(s)
(s)
(s)
> y
> y
> y
> y
w ||2 + ||bb||2 + ||dd||2 )
−α(||θθ ||2 + ||w
=
θ Ui,j + w Gi,j + b Hi,j + d Li,j − log Z
|
{z
}
s
i,j
C
{z
}
|
Ay(s)
{z
|
}
B
The log-likelihood is concave with respect to each weight vector: Ay(s) is affine hence concave; log Z (s)
is the composition of the log-sum-exp operator with an affine function and is thus convex [Boyd and
Vandenberghe, 2004], so − log Z (s) is concave; B is the sum of concave functions thus is itself concave;
finally, C is a negative quadratic function which is also concave. The log-likelihood, as the sum of concave
function, is itself concave. We can use convex optimization techniques to obtain parameter estimations. In
order to do this, we need the gradient of L with respect to each weight vector. For θ we have:
∇θ Ay(s) =
X
(s)
Uyi,j
i,j
∇θ (− log Z
(s)
X
)=−
=−
p(y0 |E (s) , F (s) )
X
y0 ∈{0,1}I×J
i,j
XX
0
0
Uyi,j
0
p(yi,j
|E (s) , F (s) )Uyi,j
i,j y0i,j
∇θ C = −αθθ
For w we get:
∇w Ay(s) =
X
(s)
Gyi,j
i,j
∇w (− log Z
(s)
X
)=−
p(y0 |E (s) , F (s) )
y0 ∈{0,1}I×J
=−
X
X
X
0
Gyi,j
i,j
0
0
0
p(yi,j
, yi+1,j
|E (s) , F (s) )Gyi,j
i,j y0i,j ,y0i+1,j
w
∇w C = −αw
and we have similar forms for b and d .
The computation of the gradients requires two kinds of marginal probabilities: single node marginals
0
0
0
0
0
p(yi,j
|E (s) , F (s) ) and edge marginals p(yi,j
, yi+1,j
|E (s) , F (s) ) (for w ), p(yi,j
, y1,j+1
|E (s) , F (s) ) (for b ), and
0
0
(s)
(s)
p(yi,j , yi+1,j+1 |E , F ) (for d ). We need to perform inference to obtain these marginals. Since the
topology of our model contains many loops, we have applied the variational Loopy Belief Propagation (LBP)
inference algorithm. LBP is a fixed point iterative procedure. It consists of sending abstract messages along
the edges of the graphical model, each message being a temporal local distribution estimation. Basically, in
each iteration, LBP tries to improve the local estimations, in order to achieve a stable global distribution.
Although LBP is an approximated inference algorithm and there is no convergence guarantee, Murphy et al.
[1999] observes that it often gives reasonable distribution estimations when it converges.
4
A 2D CRF MODEL FOR SENTENCE ALIGNMENT
14
For an undirected graphical model in the form of a tree:
Y
1 Y
Φ(xµ )
Φ(xµ , xν )
p(x) =
Z µ∈υ
(4)
(µ,ν)∈ξ
where υ is the set of nodes, and ξ the set of edges, the message from a node µ to a neighboring node ν takes
the form [Wainwright and Jordan, 2008]:
X
Y
Φ(xµ )Φ(xν , xµ )
mµ→ν (xν ) ∝
mγ→µ (xµ )
(5)
xµ
γ∈N (µ)\ν
where N (µ) denotes the set of neighbors of µ. After the message passing has converged (guaranteed for
trees), the single node and edge marginals are:
Y
p(xν ) ∝ Φ(xν )
mγ→ν (xν )
γ∈N (ν)
Y
p(xµ , xν ) ∝ Φ(xµ )Φ(xν )Φ(xν , xµ )
mδ→µ (xµ )
δ∈N (µ)\ν
Y
mγ→ν (xν )
(6)
γ∈N (ν)\µ
LBP is performing such message passing procedure on a cyclic graph. For our model, since we use only single
node cliques and edge cliques, we can establish a mapping between the concrete form (3) and the general
tree form (4):
p(y|E, F ) =
Y
Y
Y
1 Y
w > Gi,j }
exp{w
exp{bb> Hi,j }
exp{dd> Li,j }
exp{θθ > Ui,j }
Z i,j |
{z
} i,j |
{z
} i,j |
{z
} i,j |
{z
}
Φ(yi,j ,yi+1,j )
Φ(yi,j )
Φ(yi,j ,yi,j+1 )
Φ(yi,j ,yi+1,j+1 )
which enables us to derive easily the formulas to compute the messages and the necessary marginals according
to (5) and (6). For instance, the message along a vertical edge from the node (i, j) to the node (i + 1, j) is
X
Y
w > Gi,j )
m(i,j)→(i+1,j) (yi+1,j ) ∝
exp(θθ > Ui,j ) exp(w
mδ→(i,j) (yi,j )
y
δ∈N (i,j)\(i+1,j)
i,j
the message from (i, j) to (i − 1, j) is
m(i,j)→(i−1,j) (yi−1,j ) ∝
X
yi,j
Y
w > Gi−1,j )
exp(θθ > Ui,j ) exp(w
δ∈N (i,j)\(i−1,j)
mδ→(i,j) (yi,j )
and the single node marginal at position (i, j) is
Y
p(yi,j |E, F ) ∝ exp(θθ > Ui,j )
mδ→(i,j) (yi,j )
δ∈N (i,j)
Convex optimization routines also require to compute the log partition function log Z. LBP approximates
this quantity with the Bethe Free Energy. For our model, we have:
X X
p(yi,j , yi+1,j )
− log Z ≈ FBethe =
p(yi,j , yi+1,j ) log(
)
Φ(y
i,j , yi+1,j )
y ,y
i,j
i,j
+
i+1,j
X
p(yi,j , yi,j+1 ) log(
yi,j ,yi,j+1
p(yi,j , yi,j+1 )
)
Φ(yi,j , yi,j+1 )
p(yi,j , yi+1,j+1 )
)
Φ(y
i,j , yi+1,j+1 )
yi,j ,yi+1,j+1
X
+ (1 − d(i, j))
p(yi,j ) log p(yi,j )
+
X
p(yi,j , yi+1,j+1 ) log(
yi,j
4
A 2D CRF MODEL FOR SENTENCE ALIGNMENT
15
where d(i, j) is the number of neighbors of the node (i, j).
For learning, we first train the CRF without any edge potential (making the model similar to the simpler
MaxEnt model), and initialize the θ parameter. We then uniformly initialize w , b and d 21 , and use the
L-BFGS algorithm [Liu and Nocedal, 1989] implemented in the Scipy package to perform CRF parameter
learning (with all potentials). As for decoding, we again run LBP with the optimized parameters to obtain
single node marginals. Then we pick the local best label for each node.
4.3
Search in the 2D CRF model
For the 2D CRF model, we again perform the search in multiple steps. High-confidence 1 : 1 links in
Moore’s result segment the entire search space into sub-blocks. For each sub-block, we run LBP decoding
independently, which returns a set of sentence-level links. Since the sizes of the blocks are often small
(generally smaller than 10 × 10), search is very fast.
Figure 3 displays an alignment prediction matrix. There are four colors in the graph, corresponding to
four types of predictions: true positive (red), true negative (blue), false positive (yellow), false negative
(cyan). The score in each cell is the marginal probability of the pair being positive, as computed by the 2D
CRF model. Thus, a red or yellow cell indicates a sentence-level link. To turn the sentence-level links into
alignment-level links, we apply following rules:
1. consecutive sentence-level links in the horizontal or vertical direction are combined into a large
alignment-level link;
2. a sentence-level link without horizontal or vertical neighbors becomes a 1:1 type alignment-level link.
These rules follow from the interpretation of our model, where an n:m type alignment-level link decomposes
into n ∗ m sentence-level links.
Figure 3 – An alignment prediction matrix.
Two types of errors exist in the alignment prediction matrix: false negatives (cyan cells) and false positives
(yellow cells). We do not deal with false negatives. False positives introduce noises, for example, the pair
21. See [Sutton, 2008] pp. 88-89 for a discussion on parameter initialization of general CRFs trained by LBP.
5
EXPERIMENTS
16
(50, 44) in Figure 3. Applying directly the rules stated above would lead to two separate 1:1 links involving
the same source sentence, which violates the general convention of sentence alignment. The pair (50, 44)
is clearly wrong: it links the first source sentence with the last target one, thus overlaping with all other
positive sentence-level links. 22 In fact, in all blocks, the true positives always lie around the main diagonal
of the block. Thus we have used the following heuristic to discard false positives:
1. first perform a linear regression on all predicted positive sentence-level links, then take a band of fixed
width around the regression line, and drop positive links outside the band. 23
2. if after this step, there are still separate links involving the same sentence, we take the positive
sentence-level links in the surrounding window with width 5, and discard the one which is inconsistent
with the surrounding links;
3. if it is still undecidable, we perform again a linear regression of positive sentence-level links in the
surrounding window, and discard the one far from the line.
In practice, we rarely need to perform step 3.
5
Experiments
In this section, we first discuss the impact of various hyper-parameters of the system on its overall
performance. We then describe the ways in which the training corpus is designed for the MaxEnt model
and the 2D CRF model, respectively. Finally, we report results of our alignment algorithms on test corpora,
compare them to several baseline methods, and give some quantitative and qualitative analyses.
5.1
A study of Moore’s alignments (BMA)
Our method relies heavily on the information computed by Moore’s algorithm since we use sure 1:1 link
identified in a first step as anchor points to prune the search space of the second step. The number of anchor
points has a direct effect on the computational burden of the search algorithm. Their quality is even more
important as incorrect anchors hurt the performance in two ways: they count as errors in the final result,
and they propagate erroneous block alignments for the second step, thereby generating additional alignment
errors. It is then natural to investigate the quality of BMA’s results from several perspectives.
We have run Moore’s algorithm on the BAF corpus, which contains a complete reference alignment, and
used the default posterior probability threshold 0.5 to filter out alignment links. Among the 1,944 resulting
1:1 links, only 1,577 are correct (P=0.81). These 1,944 links result in 445 gaps to be aligned by the second
alignment pass. The quality of these gaps is also relevant. We define a gap as correct if it can be fully
decomposed into links that appear in the reference set. Among the 445 gaps to be aligned, more than two
third 180 are incorrect. Finally note that the noise in each incorrect gap also negatively impacts the search.
Moore’s algorithm associates a confidence score with each link. As shown in Figure 4, using a tighter
threshold to select the anchor links significantly improves the precision, and also reduces the number of
wrong gaps, at the expense of creating larger blocks.
In Figure 4, we plot precision as a function of θ, where the actual threshold is equal to 1−10θ : for instance,
θ = −0.3 corresponds to a threshold of 0.5, and θ = −5 corresponds to a threshold of 0.99999. 24 On the
BAF, the use of a very high confidence threshold of 0.99999 improves the precision of anchor points from
0.81 to 0.89, and the ratio of correct gaps rises from 0.6 to 0.82. On the manual en-fr corpus of [Yu et al.,
2012a], a threshold 0.99999 yields an anchor point precision of 0.96 and a correct gap ratio of 0.94. In both
cases, a high confidence threshold significantly reduces the number of wrong gaps. In our implementation,
we set the threshold to 0.9999, which constitutes an acceptable trade-off between correct gap ratio and gap
sizes.
22. Note that this particular matrix was computed by an early version of the 2D CRF model. In its current version, the
model is augmented with a feature capturing the relative position information, which effectively prevents this kind of errors.
23. The band width is taken to be half of the number of sentences of the fewer one of the two sides
24. Posterior link probabilities computed by generative models such as those used by Moore’s algorithm tend to be very
peaked, which explains the large number of very confident links.
5
EXPERIMENTS
17
0.95
Link and Gap precision as threshold changes
6
0.90
5
0.85
Average gap size
Precision
0.80
0.75
0.70
0.65
0.60
4
3
2
1
0.55
0.50
Average gap size as threshold changes
-0.3
-0.5
-1.0
-2.0
θ
Link precision
-3.0
-4.0
0
-5.0
Gap precision
-0.3
-0.5
-1.0
-2.0
θ
En gap size
-3.0
-4.0
-5.0
Fr gap size
Figure 4 – Varying confidence threshold causes link precision, gap precision (left) and gap size (right) to
change. Numbers are computed over all the manually aligned data (BAF and manual en-fr).
5.2
5.2.1
The MaxEnt method
Feature engineering
The job of the MaxEnt classifier is, given a pair (e, f ) of source and target sentences, to evaluate the level
of parallelism between them. In [Yu et al., 2012a], we used a restricted (endogenous) feature set derived
solely from the bitext under study. By allowing ourselves to also consider external resources, we can design
more complex and effective feature families. The features considered in this work are the following:
1. The length (in character) of e and f , and the length ratio, discretized into 10 intervals, so this family
contains a total of 12 features.
2. The number of identical tokens in e and f . We define 5 features for the values (0, 1, 2, 3, 4+).
3. The number of cognates 25 in e and f , also defining 5 features for the values (0, 1, 2, 3, 4+).
4. The word pair lexical features, one for each pair of words co-occurring at least once in a parallel
sentence. For example, if the first token in e is “Monday” and the first token in f is "Lundi", then the
pair “Monday-Lundi” defines a word pair lexical feature.
5. Sentence translation score features. For a pair of sentence e = eI1 and f = f1J , we use the by IBM
Model 1 score [Brown et al., 1993]:
T1 (e, f )
=
J
I
1 X
1X
log( ∗
p(fj |ei ))
J j=1
I i=1
T2 (e, f )
=
1X
1 X
log( ∗
p(ei |fj ))
I i=1
J j=1
I
J
After discretizing these values, we obtain 10 features for each direction.
6. Longest continuous covered span features. A word ei is said to be covered if there exists one word
fj such that the translation probability t(ei |fj ) in the IBM Model 1 table is larger than a threshold
(1e − 6 in our experiments). A long span of covered words is an indicator of parallelism. We compute
the length of the longest covered spans on both sides, and normalize them by their respective sentence
length. This family contains 20 features.
7. Uncovered words: The notion of coverage is defined as above. We count the number of uncovered
words on both sides and normalize by the sentence length. This family contains 20 features, 10 on
each side.
25. We call a pair of words “cognates” if they share a prefix of at least 4 characters.
5
EXPERIMENTS
18
8. Unlinked words in the IBM 1 alignment. A word ei is said to be linked if in an alignment a, there
exists some index j such that aj = i. 26 Large portions of consecutive unlinked words is a sign of
non-parallelism. These counts are normalized by the sentence length, and yield 20 additional features.
9. Fertility features: The fertility of a word ei is the number of indices j that satisfies aj = i. Large
fertility values indicate non-parallelism. We take, on each side, the three largest fertility values, and
normalize them with respect to the sentence lengths. This yields 60 supplementary features.
The feature families (1–4) are borrowed from [Yu et al., 2012a], and are used in several other studies
on supervised sentence alignment, e.g. [Munteanu and Marcu, 2005, Smith et al., 2010]. All other features
rely on IBM Model 1 scores and can only be computed reliably on large (external) sources of data. To
evaluate the usefulness of these various features, we performed an incremental feature selection procedure.
We first train the model with only one feature family, then add the other families one by one, monitoring
the performance of the MaxEnt model as more features are included. For this study, model performance
is measured by the prediction accuracy, that is the ratio of examples (positive and negative) for which the
model makes the right classification. Because the new features are all based on IBM Model 1, the size of the
training corpus also has an important impact. We have thus set up three datasets of increasing sizes: the
first one contains around 110,000 tokens, which is the typical amount of data that Moore’s algorithm would
return upon aligning one single book; the second one contains 1,000,000 tokens; the third one includes all
the parallel literary data collected for this study and totals more than 5,000,000 tokens. Each data set is
split into a training set, a development set and a test set, using 80% for training, 10% for tuning, and 10%
for testing. For these experiments, the model is trained with 30 iterations of the L-BFGS algorithm with a
Gaussian prior, which is tuned on the development set.
Table 4 gives the performance of the MaxEnt model on the test set as more features are included. Note
that families 2 and 3 are added together.
Family 1
+Family 2 and 3
+Family 4
+Family 5
+Family 6
+Family 7
+Family 8
+Family 9
~110K
kens
0.778
0.888
0.957
0.943
0.912
0.913
0.913
0.913
to-
Model accuracy
~1M tokens
~5M tokens
0.873
0.869
0.976
0.985
0.979
0.975
0.979
0.981
0.859
0.879
0.977
0.987
0.986
0.986
0.988
0.988
Table 4 – Evaluation of MaxEnt with varying feature families and training data
As expected, the new families of features (5-9) are hardly helping when trained on a small data set; as
more training data are included in the model, the accuracy increases, allowing the system to divide the error
rate by more than 2 in comparison to the best small data condition.
In our applicative scenario, not only do we want the model to make the right alignment decisions, but
also expect that it can do so with a large confidence. To check that this is actually the case, we plot the ROC
(Receiver Operating Characteristic) curve in Figure 5. We only display the ROC curves for the medium and
large data sets.
From the two ROC curves, we can observe that on the medium size data set, the model achieves very
good performance (large AUC areas) in all settings. In both figures, we can see that the use of feature
families 6 to 9 hardly helps to improve the confidence of the model over the results of the first five. In the
expriments reported below, we only use the feature families 1 to 5.
26. This notion is different from coverage and assumes that an optimal 1:1 word alignment has been computed based on
IBM 1 model scores. Words can be covered, yet unlinked, when all their possible matches are linked to other words.
5
EXPERIMENTS
19
ROC Curve
1.0
0.8
True positive rate
0.8
True positive rate
ROC Curve
1.0
0.6
0.4
0.2
0.6
0.4
0.2
0.0
0.0
0.2
random
1
123
0.4
0.6
False positive rate
1234
12345
123456
0.8
1.0
1234567
12345678
123456789
0.0
0.0
0.2
random
1
123
0.4
0.6
False positive rate
1234
12345
123456
0.8
1.0
1234567
12345678
123456789
Figure 5 – ROC curves on the medium size (left) and large (right) data sets. The embedded box is a zoom
over the top-left corner.
5.2.2
Learning corpus
For each test book, we construct the training corpus as follows (the procedure is identical in both
languages pairs). Each test document is processed independently from the others, in a leaving-one-out
fashion. We use 80% of the sure parallel sentence pairs from the other 23 books for auto en-fr (resp.
16 books for auto en-es) corpus as positive examples; we also include the 1:1 links computed by BMA in the
left-out corpus. For every pair of parallel sentence fi and ej , we randomly sample 3 target (source) sentences
other than ej (resp. fi ) to construct a negative pair with fi (resp. ej ). Each positive example thus gives rise
to 6 negative ones. We train a distinct MaxEnt model for each test book, using 80% of these book-specific
data, again using 20% of examples as a held-out data set. Only the feature families 1 to 5 are included in
the model, yielding an average test accuracy of 0.988. The ME+DP and ME+gr procedures are then finally
used to complete BMA’s links according respectively to algorithms 2 and 1.
5.2.3
The performance of the MaxEnt method
Results of the MaxEnt method are summarized in Table 5 (see also Tables 16 and 17 in the appendix
for a full listing). For corpora containing multiple books, the average, minimum and maximum scores are
reported.
BAF
manual en-fr
auto en-fr
auto en-es
min
max
mean
min
max
mean
min
max
mean
GMA
61.4
53.5
92.8
79.6
62.1
99.5
88.7
60.3
96.5
82.8
Hun
71.2
54.3
92.6
74.5
56.6
99.5
87.9
43.7
96.4
81.0
Garg
65.6
51.7
97.1
80.2
56.4
98.1
88.7
58.4
96.8
82.6
Yasa
75.7
59.9
95.6
79.1
62.3
98.8
89.6
64.3
100
84.6
ME+gr
76.3
61.4
95.3
78.3
56.7
97.5
85.7
60.4
97.7
80.5
ME+DP
66.5
51.0
98.0
81.5
57.7
97.9
88.9
65.2
98.0
82.7
Table 5 – Performance of the MaxEnt approach with greedy search (ME+gr) and dynamic programming
search (ME+DP) and of baseline alignment tools
The MaxEnt method obtains the best overall results for the manually aligned manual en-fr corpus. All
5
EXPERIMENTS
20
the average differences between ME+DP and the other algorithms are significant at the 0.01 level, except
for Gargantua and GMA, where the difference is only significant at the 0.05 level. On the large approximate
reference sets, ME+DP achieves comparable results with Gargantua and GMA, slightly worse than Yasa.
Comparing greedy with DP search, the mean performance on the manual en-fr has been increased by 3 points
in F-measure and the differences on the two large corpora, even though they are only approximations, are
also important. Surprisingly, this does not reflect on BAF, where the heuristic search is better than the DP
algorithm - again, this might be because of the peculiar trait of BAF, which contains on average larger gaps
than the other books (and crucially has an average gap size greater than 4 in one dimension).
5.2.4
Analysis of the MaxEnt method
We performed an error analysis on the manual alignment data set (manual en-fr). Table 6 lists the link
types in the reference, along with the numbers of reference links that greedy search or DP fails to find.
Link type
0:1
1:0
1:1
1:2
1:3
2:1
2:2
others
total
in Ref.
20
21
1,364
179
32
96
24
27
1,763
ME+gr
13
12
68
60
17
54
22
22
268
ME+DP
18
18
105
36
9
32
19
15
252
Table 6 – Analyses of the errors of greedy search (ME+gr) and DP search (ME+DP) by link type, relative
to the number of reference links (in Ref.), for the manual en-fr corpus. Only the link types occurring more
than 5 times are reported. This filters out 27 links out of 1,790.
The numbers in Table 6 suggest that null links remain difficult, especially for the DP algorithm, reflecting
the fact that estimating reliable scores for these links is a tricky issue. This problem arises for all systems
whose search is based on DP: for instance, Yasa makes a comparable number of errors for null links (16 errors
for type 0:1, 17 for type 1:0), Hunalign’s results are worse (20 errors for type 0:1, 19 for type 1:0), while
Gargantua does not return any null link at all. DP tends to be more precise for larger blocks such as 1:2 or 2:1.
Table 7 illustrates this property of DP search: this excerpt from Jean-Jacques Rousseau’s Les Confessions
is difficult because of the presence of consecutive 1-to-many links. ME+DP is the only algorithm which
correctly aligns the full passage.
en1
en2
en3
My mother had a defence more powerful even than her virtue;
she tenderly loved my father, and conjured him to return; his
inclination seconding his request, he gave up every prospect of
emolument, and hastened to Geneva.
I was the unfortunate fruit of this return, being
born ten months after, in a very weakly and infirm
state; my birth cost my mother her life, and was
the first of my misfortunes.
I am ignorant how my father supported her loss at that time,
but I know he was ever after inconsolable.
Ma mère avait plus que la vertu pour s’en défendre;
elle aimait tendrement son mari.
Elle le pressa de revenir : il quitta tout, et revint.
Je fus le triste fruit de ce retour.
Dix mois après, je naquis infirme et malade.
Je coûtai la vie à ma mère, et ma naissance fut le
premier de mes malheurs.
Je n’ai pas su comment mon père supporta cette
perte, mais je sais qu’il ne s’en consola jamais.
Table 7 – A passage of a reference alignment Jean-Jacques Rousseau’s Les confessions. MaxEnt with DP
correctly identifies the three links.
The gap size also has an impact on the performance of DP. Since in DP-search, we constrain alignment
links to contain at most 4 sentences on each side, if at least one side of an actual alignment link exceeds this
limit, our algorithm will fail to find the correct solution. Table 8 displays the average gap size 27 inside each
27. A N × M gap contains N sentences on the source side and M sentences on the target side.
fr1
fr2
fr3
fr4
fr5
fr6
5
EXPERIMENTS
21
book of the manual en-fr corpus, along with the F-score of greedy search and DP search.
Du Côté de chez Swann
Emma
Jane Eyre
La Faute de l’Abbé Mouret
Les Confessions
Les Travailleurs de la Mer
The Last of the Mohicans
Ave. gap size
2.62 × 2.54
10.25 × 6.75
4 × 4.8
1.85 × 2.79
2.89 × 4.7
3.37 × 3.74
2.14 × 3.07
ME+gr
89.4
61.4
67.4
95.3
67.8
80.8
85.8
ME+DP
93.3
51.0
78.9
98.0
74.0
85.3
90.1
Table 8 – Gap size and performance of MaxEnt on manual en-fr.
We can see that DP works better when gap sizes are smaller than 4 on each side. When this is not the
case, the results tend to decrease significantly, as for instance for Jane Austen’s Emma. Greedy search, while
generally outperformed by DP, is significantly better for this book. This outlines the need to also improve
the anchor detection algorithm in our future work, in order to make sure that gaps are both as correct and
as small as possible.
5.3
5.3.1
The 2D CRF model
Features
In the 2D CRF model, feature functions take the form of f (E1I , F1J , i1 , j1 , i2 , j2 , yi1 ,j1 , yi2 ,j2 ), where E1I is
the source sequence of I sentences, F1J the target sequence of J sentences, (i1 , j1 ) and (i2 , j2 ) neighboring
source-target indices, yi1 ,j1 and yi2 ,j2 respectively corresponding labels (0 or 1). For each pair (Ei , Fj ), we
compute the following set of features:
1. The length difference ratios. We first compute
r1 =
|len(Ei ) − len(Fj )|
,
len(Ei )
r2 =
|len(Ei ) − len(Fj )|
len(Fj )
where the len() function returns the number of characters in one string. Both r1 and r2 are rounded
into the interval [0, 1], then discretized into 10 indicator features. This family thus also contains
20 features.
2. The ratio of identical tokens. Let the function token() return the number of tokens in a string. We
s
count the number of shared tokens in Ei and Fj , denote the count by s, compute two ratios token(E
i)
s
and token(Fj ) , then discretize each into 10 features. In total, this family contains 20 features.
3. The relative index difference. We discretize the quantity | Ii − Jj | into 10 features.
4. Sentence translation score features. Let token(Ei ) = m and token(Fj ) = n, we compute the by IBM
Model 1 score:
n
T1 (Ei , Fj )
m
=
1X
1 X
log( ∗
p(Fjs |Eik ))
n s=1
m
=
m
n
1 X
1 X
log( ∗
p(Eik |Fjs ))
m
n s=1
k=1
T2 (Ei , Fj )
k=1
where Fjs is the sth token of Fj . The lexical translation probabilities p are computed using an IBM 1
model trained on the Europarl corpus. After discretizing T1 and T2 , we obtain 10 features for each
alignment direction.
5. The span coverage features. We split a string into several spans by segmenting on punctuations
(except fot the quotation marks). For each source span span_e, we compute the translation score
5
EXPERIMENTS
22
T2 (span_e, Fj ). If the score is larger than a threshold 28 , we consider span_e as being covered. We
then compute the ratio of covered source spans and the ratio of covered target spans, and discretize
each ratio into 10 features.
6. The label transition features. These features capture the regularity of the transition of labels from one
node ((Ei , Fj )) to one of its neighbors (e.g. (Ei+1 , j)). For each of the three types of neighbors (vertical, horizontal, diagonal), we define four label transition features (because our prediction variables
are binary). For example, on vertical edges, we define
g00 (i, j) = δ{yi,j = 0 ∧ yi+1,j = 0}, g01 (i, j) = δ{yi,j = 0 ∧ yi+1,j = 1}
g10 (i, j) = δ{yi,j = 1 ∧ yi+1,j = 0}, g11 (i, j) = δ{yi,j = 1 ∧ yi+1,j = 1}
where δ is the Kronecker delta function. We have similar features for horizontal and diagonal transitions. In total, this family contains 12 features.
7. The augmented length difference ratio. This family only applies to vertical or horizontal edge potentials, under the condition that the two neighboring pairs being both positive. In the vertical (resp.
horizontal) case, we combine the two consecutive source (resp. target) sentences Ei , Ei+1 (resp.
Fj , Fj+1 ) into one new sentence E 0 (resp. F 0 ), then we apply the computations carried out for feature
family 1, but for the pair (E 0 , Fj ) (resp. (Ei , F 0 )) instead.
8. The augmented translation score features. This family only applies to vertical or horizontal edge
potentials, under the condition that the two neighboring pairs being both positive. We construct
E 0 (resp. F 0 ) as in the previous feature family. Then we compute the augmented translation score
T1 (E 0 , Fj )−T1 (Ei , Fj ) (resp. T2 (Ei , F 0 )−T2 (Ei , Fj )). The intuition is that a longer partial translation
is better than a shorter partial translation. We discretize each score into 10 features.
5.3.2
Learning corpus
The training of the 2D CRF model requires reference alignments. We have used the reference sentence
alignments collected for TransRead (these resources are described in a subsequent report and will be available
online). The training corpus contains alignment links for four books: “Alice’s Adventures in Wonderland”
(L. Carroll), “Candide” (Voltaire), “Vingt Mille Lieues sous les Mers”, and “Voyage au Centre de la Terre”
(both by J. Verne). Table 9 displays the statistics of the training corpus.
Book
Alice’s Adventures in Wonderland
Candide
Vingt Mille Lieues sous les Mers
Voyage au Centre de la Terre
total
# Links
746
1,230
778
714
3,468
# Sent_EN
836
1,524
820
821
4,001
# Sent_FR
941
1,346
781
754
3,822
Table 9 – Statistics of the training corpus of the 2D CRF model.
We then convert the training corpus into training sets. In theory, we could view the alignment of an
entire book as a single training instance for the 2D CRF model. This would, however, lead to large models
which would make the LBP inference very slow. We have thus segmented the alignment of each book into
sub-blocks. A sub-block is a sequence of bi-sentences. We then consider each sub-block as a training instance
for. In this way, we largely reduce the total number of prediction variables, hence make the training less
memory consuming. Besides, the training can enjoy better parallelization. There is potentially another
advantage of using smaller training instances. Since our model contains one predictive variable for each pair
of source-target sentences, there are roughly quadratically many negative examples, with only linearly many
positive one. This data unbalance problem becomes more severe as the size of the model gets larger. Using
smaller training instances help alleviate this problem.
We decided to define a sub-block as a sequence of 20 consecutive alignment-level links (note that 20
alignment-level links do not indicate exactly 20 source sentences or 20 target sentences). Thus, there are in
28. In our experiments, the threshold is set to log(1e − 3)
5
EXPERIMENTS
23
total 174 training instances. We use 80% of the training instances as the training set, 20% as the development
set.
As for the test, we again use the novels in the BAF corpus and the manual en-fr corpus. Table 10 gives
the statistics of the test corpus.
Book
BAF
Du Côté de chez Swann
Emma
Jane Eyre
La Faute de l’Abbe Mouret
Les Confessions
Les Travailleurs de la Mer
The Last of the Mohicans
Total
# Links
2,520
463
164
174
222
213
359
197
4,312
# Sent_EN
2,554
495
216
205
226
236
389
205
4,526
# Sent_FR
3,319
492
160
229
258
326
405
232
5,421
Table 10 – Statistics of the test corpus of the 2D CRF model.
5.3.3
Performance of the 2D CRF model
We summarize the test results of the 2D CRF method (as well as all other methods) in Table 11.
BAF
Du Côté de chez Swann
Emma
Jane Eyre
La Faute de l’Abbé Mouret
Les Confessions
Les Travailleurs de la Mer
The Last of the Mohicans
Average on manual en-fr
GMA
61.4
92.8
53.5
77.1
91.5
71.9
80.8
89.9
79.6
BMA
73.6
91.5
57.4
61.1
88.4
59.6
83.4
82.7
74.9
Link level
Hun Garg
71.2 65.6
90.9 92.2
57.7 51.7
59.3 71.8
92.6 97.1
54.3 68.6
79.9 87.3
87.1 92.7
74.5 80.2
F-score
Yasa
75.7
92.2
59.9
66.9
95.6
66.7
83.8
88.6
79.1
Gr
76.3
89.4
61.4
67.4
95.3
67.8
80.8
85.8
78.3
DP
66.5
93.3
51.0
78.9
98.0
74.0
85.3
90.1
81.5
CRF
73.3
90.9
55.4
63.2
82.8
58.1
83.0
84.3
74.0
Table 11 – Link level F-scores of all methods on gold references. we denote “Gr” the MaxEnt-based
alignment algorithm with greedy search, and “DP” the MaxEnt-based alignment with dynamic programming.
On the manual en-fr corpus, the 2D CRF model achieves comparable results with Hunalign and BMA.
On the BAF corpus, it outperforms GMA, Hunalign, Gargantua, and MaxEnt with DP. Note the 2D CRF
model is using much less features than the MaxEnt: the family 4 of the MaxEnt model, which helped a lot
according to Table 4 but is quite memory consuming, is not used in the current implementation of the 2D
CRF model. Since this method is still in development, there is much room for improvement, as we will detail
in the following.
5.3.4
Analyses of the 2D CRF model
The recall problem The 2D CRF model predicts alignment links in two steps: it first predicts all
sentence-level links; second, it convert sentence-level links into alignment links. A way of analyzing the
model performance is by looking at its sentence-level precision and recall. Table 12 displays this information
for the test corpus. For the sake of the comparison, we also reproduce the performance of the MaxEnt model
with DP using sentence level measures.
While the 2D CRF model achieves a better sentence-level precision than the MaxEnt model, its sentencelevel recall is much worse, which leading to a drop in the overall F-score. In the next paragraph, we investigate
in more detail which kinds of links are likely to be missed by the CRF model.
5
EXPERIMENTS
24
BAF
Du Côté de chez Swann
Emma
Jane Eyre
La Faute de l’Abbé Mouret
Les Confessions
Les Travailleurs de la Mer
The Last of the Mohicans
Average on manual en-fr
Precision
95.5
96.3
76.1
86.6
98.2
92.6
97.1
97.1
92.0
2D CRF
Recall
74.9
92.5
63.7
69.6
84.5
65.4
82.2
85.7
77.7
F-score
84.0
94.3
69.4
77.2
90.8
76.6
89.1
91.1
84.1
MaxEnt + DP
Precision Recall F-score
72.0
81.8
76.6
97.1
94.9
96.0
62.8
82.3
71.2
86.7
89.3
88.0
98.9
98.9
98.9
89.3
83.2
86.1
90.8
93.0
91.9
94.2
95.8
95.0
88.5
91.1
89.6
Table 12 – Comparison between the 2D CRF model and the MaxEnt model, using sentence-level measures.
Error distribution by link type The corresponding statistics are in Table 13.
Link type
0:1
1:0
1:1
1:2
1:3
2:1
2:2
others
total
in Ref.
20
21
1,364
179
32
96
24
27
1,763
ME+gr
13
12
68
60
17
54
22
22
268
ME+DP
18
18
105
36
9
32
19
15
252
2D CRF
15
15
64
98
29
54
20
26
321
Table 13 – Analyses of the errors of greedy search (ME+gr) and DP search (ME+DP) by link type, relative
to the number of reference links (in Ref.), for the manual en-fr corpus. Only the link types occurring more
than 5 times are reported. This filters out 27 links out of 1,790.
Compared to MaxEnt methods, the main weakness of the current 2D CRF model lies in the prediction
of 1 : n and n : 1 links. After a closer study of the erroneous instances, we found a common pattern of
error: when predicting a m:n link with m ∗ n > 1 (that is, a 1-to-many or many-to-many link), it often
correctly labels some sentence pairs as positive, while leaving other pairs as negative. Figure 6 displays
an alignment prediction matrix for a passage of the book Les Confessions. The corresponding text and
the correct alignment is displayed in Table 18 in the appendix. The model failed to predict the 1:2 link
(13; 22, 23), only labelling (13; 22) as positive; nor did it find the 2:2 link (14, 15; 24, 25), only labelling the
pair (15; 25).
The weakness of the 2D CRF model on 1-to-many and many-to-many links makes it necessary to study
edge potentials. After all, one of the reasons of using a CRF model was to take advantage of its capacity of
encoding the dependencies between neighboring links, to better predict non 1:1 links. However, as shown in
Table 13, it is not working as expected.
Edge potentials If we use only the state transition features in the edge potentials (that is, we omit the
features families 7 and 8), we would encounter the problem that certain types of state-transition features
have dominating weights. Consider the correct alignment in Table 14, extracted from Jane Austen’s Emma.
The corresponding alignment prediction matrix for this passage (computed by an early version of the 2D
CRF model, without feature families 7 or 8) is in Figure 7, where the source (EN) indices lie on the vertical
axis, target (FR) indices lie on the horizontal axis. The model was wrong for two pairs: (8, 8) and (9, 8),
where the marginal probability of (8, 8) was much larger than that of (9, 8). However, studying further the
node potential of each pair, we found that the log node potential of the pair (8, 8) was −4.490, while that of
(9, 8) was 0.313, much larger than the former. The difference on node potentials made sense, since clearly the
5
EXPERIMENTS
25
Figure 6 – An alignment prediction matrix for a passage of the book Les Confessions.
en7
en8
en9
Sixteen years had Miss Taylor been in Mr.
Woodhouse’s family, less as a governess than
a friend, very fond of both daughters, but particularly of Emma .
Between them it was more the intimacy of sisters.
Even before Miss Taylor had ceased to hold
the nominal office of governess, the mildness of
her temper had hardly allowed her to impose
any restraint; and the shadow of authority being now long passed away, they had been living
together as friend and friend very mutually attached, and Emma doing just what she liked;
highly esteeming Miss Taylor’s judgment , but
directed chiefly by her own.
Mlle Taylor était restée seize ans dans la
maison de M. Woodhouse, moins en qualité
d’institutrice que d’amie ; très attachée aux
deux jeunes filles, elle chérissait particulièrement Emma.
fr7
Avant même que Mlle Taylor eût cessé de
tenir officiellement le rôle de gouvernante, la
douceur de son caractère lui permettait difficilement d’inspirer quelque contrainte; cette
ombre d’autorité s’était vite évanouie et les
deux femmes vivaient depuis longtemps sur un
pied d’égalité .
Tout en ayant une grande considération pour
le jugement de Mlle Taylor, Emma se reposait
exclusivement sur le sien!
fr8
fr9
Table 14 – A passage of a reference alignment of Jane Austen’s Emma.
pair (9, 8) is much better a matching than (8, 8). However, the 2D CRF model made mistake on both pairs.
This was due to the fact that the weight of diagonal positive-to-positive state transition feature was much
larger than the weights of other state transition features. Thus, given (7, 7) and (9, 9) are both positive, the
model incorrectly predicted (8, 8) as positive.
Figure 8 displays the alignment prediction matrix for the same passage, computed by the current version
of the 2D CRF model. The problem is still not resolved, although the marginal probability of (9, 8) is
increased a lot.
We examined the weights of edge potential features, and found that the weight of diagonal positive-topositive feature is smoothed. Another finding is more involved: for reasons that are not still unclear, the
weights of all edge potential features are all negative, except vertical and diagonal positive-to-positive state
transition features. This situation is counter-intuitive, since we expect (at least) the weight of some of the
augmented translation score feature would be positive. Besides, that fact that weights of edge features are
almost all negative would discourage a horizontal or vertical positive-to-positive edge. The exact reason of
the edge weights being all negative is currently under investigation.
5
EXPERIMENTS
26
Figure 7 – The alignment prediction matrix corresponding to the passage in Table 14, computed by an
early version of the CRF model.
Figure 8 – The alignment prediction matrix corresponding to the passage in Table 14, produced by a more
recent version of the CRF model.
Null sentences Another reason of using a 2D CRF model was to have a mechanism that expresses null
links explicitly. We thus summarize the performance of the model on this aspect in Table 15.
Except for The Last of the Mohicans, the 2D CRF model was able to predict the majority null sentences,
although it also incorrectly labeled many as null (as we pointed out, it missed many sentence pairs inside
1-to-many and many-to-many links). MaxEnt with DP gave bad results on this aspect. In particular, we
note its statistics on BAF, which might partially explain its relatively poor overall performance on this
corpus.
We therefore conclude that, the current 2D model tends to achieve good sentence-level precision, while
giving poor sentence-level recall, leading to unsatisfactory overall performance. An important source of the
low recall is from partially predicting non 1-to-1 links. That is to say, we have not made the edge potentials
achieve the expected effects. The reason of this problem is not well understood. Three possible ways of
improvement are currently under investigation:
— enforce the node potential with other features. If the node potential is too low, it is unlikely that
edge potentials can compensate. This is a common pattern of wrong predictions in 1-to-many links.
We are trying to add features that use simple word alignment information into the node potential,
e.g. feature that encode consecutive strongly parallel word pairs.
— change the way of constructing training instances. The training critera is to maximize the overall
likelihood of training instances. By the current construction, in each training instances easy negative
nodes and negative-to-negative edges concentrate the probability mass. This bias might impact the
learning of the model. We will try to construct the training instances to reproduce more closely the
computations involved during the test, that is, first to run BMA in the training bi-text, pick the gaps
that are not aligned by BMA, and use them as training instances.
— design new edge features. Clearly, the effects of features families 7 and 8 are limited.
6
CONCLUSION
27
#Correct
BAF
Du Côté de chez Swann
Emma
Jane Eyre
La Faute de l’Abbé Mouret
Les Confessions
Les Travailleurs de la Mer
The Last of the Mohicans
672
8
28
7
2
11
3
3
2D CRF
#Null
(in Hyp.)
1,311
27
85
77
52
96
78
37
#Null
(in Ref.)
714
9
41
10
2
11
5
12
MaxEnt + DP
#Correct
#Null
#Null
(in Hyp.) (in Ref.)
91
150
714
3
5
9
2
2
41
0
0
10
1
1
2
2
4
11
0
2
5
2
2
12
Table 15 – Performance of the 2D CRF model and the MaxEnt model on predicting null sentences. “#Null
in Hyp.” is the number of unaligned sentences in the hypothesis alignment computed by the model; “#Null in
Ref.” is the number of unaligned sentences in the reference alignment; “#Correct” is the number of correctly
predicted null sentences.
6
Conclusion
In the report, we have presented a large-scale study of sentence alignment using a small corpus of reference
alignments, and two large corpora containing dozens of coarsely aligned copyright-free novels for EnglishSpanish and English-French. We have shown that these coarse alignments, once refined, were good enough
to compute approximate performance measures for the task at hand, and confirmed the general intuition
that automatic sentence alignment for novels was still far from perfect; some translations appeared to be
particularly difficult to align with the original text for all existing methods.
Borrowing ideas from previous studies on unsupervised and supervised sentence alignment, we have
proposed and evaluated a new MaxEnt alignment algorithm , and showed that it performs better than
several strong baselines - even if there remains a lot of room for improvement. Based on analyses of the
problems of the MaxEnt method, we have further proposed a more complex 2D CRF model, which can
encode the dependencies between links, and is expected to better deal with null links, which is a difficulty
for all dynamic programming based methods. For the moment, the 2D CRF model does not perform as well
as other state-of-the-art methods. The room for improvement is large, and it is still under development. We
also plan to study additional, arguably more complex, language pairs to get a more complete picture of the
actual complexity of sentence alignment.
Acknowledgments
This work is partially supported by the TransRead project (ANR CONTINT 2011) funded by the French
c
National Research Agency. We have made good use of the alignment data from 2014
FarkasTranslations.
com and would thus like to thank András Farkas for making his multi-parallel corpus of manually aligned
books publicly available.
References
Frédérique Bisson and Christian Fluhr. Sentence alignment in bilingual corpora based on crosslingual querying. In Proceedings of RIAO’2000, pages 529–542, Paris, France, 2000.
Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York,
NY, USA, 2004. ISBN 0521833787.
Fabienne Braune and Alexander Fraser. Improved unsupervised sentence alignment for symmetrical and
asymmetrical parallel corpora. In Coling 2010: Posters, pages 81–89, Beijing, China, 2010. Coling 2010
Organizing Committee.
REFERENCES
28
Peter F. Brown, Jennifer C. Lai, and Robert L. Mercer. Aligning sentences in parallel corpora. In Proceedings
of the 29th annual meeting on Association for Computational Linguistics, 1991, Berkeley, California, pages
169–176, 1991.
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. The mathematics
of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.
David Burkett and Dan Klein. Fast inference in phrase extraction models with belief propagation. In
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages 29–38, Montréal, Canada, 2012. URL http://www.
aclweb.org/anthology/N12-1004.
Stanley F. Chen. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the
31st annual meeting on Association for Computational Linguistics, ACL ’93, pages 9–16, 1993.
Fabien Cromierès and Sadao Kurohashi. An alignment algorithm using belief propagation and a structurebased distortion model. In Proceedings of the 12th Conference of the European Chapter of the ACL
(EACL 2009), pages 166–174, Athens, Greece, March 2009. Association for Computational Linguistics.
URL http://www.aclweb.org/anthology/E09-1020.
Yonggang Deng, Shankar Kumar, and William Byrne. Segmentation and alignment of parallel text for
statistical machine translation. Natural Language Engineering, 13(03):235–260, 2007.
William A. Gale and Kenneth W. Church. A program for aligning sentences in bilingual corpora. In
Proceedings of the 29th annual meeting of the Association for Computational Linguistics, pages 177–184,
Berkeley, California, 1991.
William A. Gale and Kenneth W. Church. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1):75–102, 1993.
Martin Kay and Martin Röscheisen. Text-translation alignement. Computational Linguistics, 19(1):121–142,
1993.
Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. MT summit, 5:79–86, 2005.
Olivier Kraif and Agnès Tutin. Using a bilingual annotated corpus as a writing aid: An application for
academic writing for EFL users. In In Natalie Kübler (Ed.), editor, Corpora, Language, Teaching, and
Resources: From Theory to Practice. Selected papers from TaLC7, the 7th Conference of Teaching and
Language Corpora. Peter Lang, Bruxelles, 2011.
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International
Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001.
Fethi Lamraoui and Philippe Langlais. Yet another fast, robust and open source sentence aligner. time to
reconsider sentence alignment? In Proceedings of the XIV Machine Translation Summit, pages 77–84,
Nice, France, Sept. 2013.
Philippe Langlais. A System to Align Complex Bilingual Corpora. Technical report, CTT, KTH, Stockholm,
Sweden, Sept. 1998.
Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45:503–528, 1989.
Xiaoyi Ma. Champollion: A robust parallel text sentence aligner. In Proceedings of the Language Resources
and Evaluation Conference (LREC 2006), Genoa, Italy, 2006.
I. Dan Melamed. Automatic detection of omissions in translations. In Proceedings of the 16th Conference
on Computational Linguistics - Volume 2, COLING ’96, pages 764–769, 1996.
REFERENCES
29
I. Dan Melamed. Bitext maps and alignment via pattern recognition. Computational Linguistics, 25:107–130,
1999.
Robert C. Moore. Fast and accurate sentence alignment of bilingual corpora. In Stephen D. Richardson,
editor, Proceedings of the annual meeting of the Association for Machine Translation in the Americas
(AMTA 2002), Lecture Notes in Computer Science 2499, pages 135–144, Tiburon, CA, USA, 2002. Springer
Verlag.
Eva Mujdricza-Maydt, Huiqin Koerkel-Qu, Stefan Riezler, and Sebastian Pado. High-precision sentence
alignment by bootstrapping from wood standard annotations. The Prague Bulletin of Mathematical Linguistics, (99):5–16, April 2013.
Dragos Stefan Munteanu and Daniel Marcu. Improving machine translation performance by exploiting
non-parallel corpora. Computational Linguistics, 31(4):477–504, 2005.
Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference:
An empirical study. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence,
UAI’99, pages 467–475, Stockholm, Sweden, 1999. ISBN 1-55860-614-9. URL http://dl.acm.org/
citation.cfm?id=2073796.2073849.
Jan Niehues and Stephan Vogel. Discriminative word alignment via alignment matrix modeling. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 18–25, Columbus, Ohio, 2008. URL
http://www.aclweb.org/anthology/W/W08/W08-0303.
Clément Pillias and Pierre Cubaud. Bilingual reading experiences: What they could be and how to design
for them. In Proceedings of IFIP Interact 2015, Bamberg, Germany, to appear, 2015.
Rico Sennrich and Martin Volk. MT-based sentence alignment for OCR-generated parallel texts. In The
Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, CO,
November 2010.
Michel Simard. The BAF: a corpus of English-French bitext. In First International Conference on Language
Resources and Evaluation, volume 1, pages 489–494, Granada, Spain, 1998.
Michel Simard and Pierre Plamondon. Bilingual sentence alignment: Balancing robustness and accuracy.
Machine Translation, 13(1):59–80, 1998.
Michel Simard, George F. Foster, and Pierre Isabelle. Using cognates to align sentences in bilingual corpora.
In Ann Gawman, Evelyn Kidd, and Per-Åke Larson, editors, Proceedings of the 1993 Conference of the
Centre for Advanced Studies on Collaborative Research, October 24-28, 1993, Toronto, Ontario, Canada,
2 Volume, pages 1071–1082, 1993.
Jason R. Smith, Chris Quirk, and Kristina Toutanova. Extracting parallel sentences from comparable corpora
using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the
North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 403–411, 2010.
Charles Sutton. Efficient Training Methods for Conditional Random Fields. PhD thesis, University of
Massachusetts, February 2008.
Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and Moshe Dubiner. Large scale parallel document mining
for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics,
COLING ’10, pages 1101–1109, Beijing, China, 2010.
Dániel Varga, László Németh, Péter Halácsy, András Kornai, Viktor Trón, and Viktor Nagy. Parallel corpora
for medium density languages. In Proceedings of RANLP 2005, pages 590–596, Borovets, Bulgaria, 2005.
Jean Véronis and Philippe Langlais. Evaluation of Parallel Text Alignment Systems. In Jean Véronis, editor,
Parallel Text Processing, Text Speech and Language Technology Series, chapter X, pages 369–388. Kluwer
Academic Publishers, 2000.
REFERENCES
30
Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(1-2):1–305, January 2008.
Yong Xu and François Yvon. Corpus de référence pour l’alignement de phrases. Technical Report 1.2, LIMSI,
CNRS, Université Paris-Saclay, April 2014.
Yong Xu, Aurélien Max, and François Yvon. Sentence alignment for literary texts. Linguistic Issues in
Language Technology, 12(6):25 pages, Oct. 2015.
Qian Yu, Aurélien Max, and François Yvon. Aligning bilingual literary works: a pilot study. In Proceedings
of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, pages 36–44, Montréal,
Canada, June 2012a. URL http://www.aclweb.org/anthology/W12-2505.
Qian Yu, Aurélien Max, and François Yvon. Revisiting sentence alignment algorithms for alignment visualization and evaluation. In Proceedings of the 5th Workshop on Building and Using Comparable Corpora
(BUCC), Istanbul, Turkey, 2012b.
31
Appendices
In Table 16, we show the performance of baseline tools on the large corpus auto en-fr. The results on the
corpus auto en-es are provided in Table 17. In these tables, we denote “Gr” the MaxEnt-based alignment
algorithm with greedy search, and “DP” the MaxEnt-based alignment with dynamic programming.
20000 Lieues sous les Mers
Alice’s Adventures in Wonderland
A Study in Scarlet
Candide
Germinal
La Chartreuse de Parme
La Dame aux Camelias
Le Grand Meaulnes
Le Rouge et le Noir
Les Trois Mousquetaires
Le Tour du Monde En 80 Jours
L’île Mystérieuse
Madame Bovary
Moll Flanders
Notre Dame de Paris
Pierre et Jean
Pride and Prejudice
Rodney Stone
The Fall of The House of Usher
The Great Shadow
The Hound of The Baskervilles
Therese Raquin
Three Men in a Boat
Voyage au Centre de la Terre
Mean
GMA
97.6
81.2
BMA
96.4
74.3
Link level F-score
Hun Garg Yasa
97.2 98.1
98.8
80.0 83.2
82.7
Gr
95.9
76.2
DP
97.7
81.5
89.0
85.7
97.2
97.1
94.0
93.3
96.9
88.0
76.4
93.4
93.9
80.5
93.5
91.4
62.1
88.6
99.5
81.7
92.8
85.6
85.3
84.9
88.7
78.2
78.8
94.7
94.1
91.1
91.0
94.7
83.3
63.9
93.3
90.7
76.9
91.1
89.3
47.1
83.7
98.4
74.9
90.5
80.3
76.5
81.6
84.0
83.8
82.5
97.5
96.1
89.6
93.4
96.4
89.2
74.9
96.0
93.9
83.1
92.8
91.5
56.6
90.2
99.5
83.4
92.5
84.7
81.7
83.7
87.9
85.0
79.9
95.1
94.2
90.7
92.6
94.3
83.6
68.9
93.5
91.8
78.0
90.6
88.6
56.7
85.5
97.5
79.4
91.1
82.0
81.8
83.5
85.7
86.4
86.4
97.9
97.0
94.6
94.3
97.3
87.9
75.8
94.6
95.0
82.7
94.2
90.7
57.7
89.3
95.7
86.0
93.0
86.7
86.2
84.0
88.9
85.2
82.6
97.6
96.8
93.8
94.7
97.2
88.0
75.8
94.5
94.5
81.8
94.2
91.9
56.4
90.7
97.4
84.0
93.7
85.4
85.6
86.8
88.7
89.0
87.9
97.3
97.4
94.9
94.1
97.9
89.9
78.5
94.8
94.1
83.3
94.1
91.5
62.3
90.1
98.4
82.8
93.5
85.1
87.1
85.9
89.6
Table 16 – F-scores on the large approximate reference set auto en-fr
32
20000 Lieues sous les Mers
Alice’s Adventures in Wonderland
Anna Karenina Volume I
Anna Karenina Volume II
A Study in Scarlet
Candide
Die Verwandlung
Don Quijote de La Mancha
Jane Eyre
Les Trois Mousquetaires
Le Tour du Monde en 80 Jours
L’île Mystérieuse
Sense and Sensibility
The Adventures of Sherlock
Holmes
The Fall of the House of Usher
The Hound of the Baskervilles
Voyage au Centre de la Terre
Mean
GMA
88.9
74.2
BMA
85.2
65.3
Link level F-score
Hun Garg Yasa
89.6 89.5
90.2
66.7 72.2
76.2
GR
84.9
71.4
DP
89.2
74.4
72.9
69.5
94.3
76.8
88.6
87.9
60.3
82.3
78.8
77.9
92.4
93.7
69.3
68.2
91.3
64.2
80.3
82.1
48.8
79.2
71.1
82.5
88.6
91.8
75.4
73.3
94.1
71.7
84.9
85.6
43.7
82.9
77.8
81.2
89.9
93.0
77.8
75.6
96.3
63.2
85.6
88.8
58.4
83.8
77.4
80.7
92.0
91.3
73.6
73.2
96.9
80.4
87.2
89.6
64.3
83.3
81.4
80.1
93.4
93.8
70.9
69.3
93.1
71.7
83.1
81.9
60.9
77.6
74.4
81.1
87.8
91.9
74.2
72.4
92.8
72.8
88.8
86.6
58.3
83.0
78.2
79.4
91.6
93.5
94.6
96.5
77.2
82.8
98.0
95.3
72.2
78.4
96.4
95.4
75.6
81.0
94.9
96.8
79.1
82.6
100.0
97.6
77.5
84.6
98.8
95.4
74.7
80.5
98.4
96.2
76.4
82.7
Table 17 – F-scores on the large approximate reference set auto en-es
33
34
en10
en11
en12
en13
en14
en15
en16
My mother’s circumstances were more affluent; she
was daughter of a Mons.
Bernard, minister, and possessed a considerable
share of modesty and beauty; indeed, my father
found some difficulty in obtaining her hand.
The affection they entertained for each other was
almost as early as their existence; at eight or nine
years old they walked together every evening on
the banks of the Treille, and before they were ten,
could not support the idea of separation.
A natural sympathy of soul confined those sentiments of predilection which habit at first produced;
born with minds susceptible of the most exquisite
sensibility and tenderness, it was only necessary to
encounter similar dispositions; that moment fortunately presented itself, and each surrendered a willing heart.
The obstacles that opposed served only to give a
decree of vivacity to their affection, and the young
lover, not being able to obtain his mistress, was
overwhelmed with sorrow and despair.
She advised him to travel – to forget her.
He consented – he travelled, but returned more passionate than ever, and had the happiness to find her
equally constant, equally tender.
Ma mère, fille du ministre Bernard, était plus riche:
elle avait de la sagesse et de la beauté.
Ce n’était pas sans peine que mon père l’avait
obtenue.
fr19
Leurs amours avaient commencé presque avec leur
vie; dès l’âge de huit à neuf ans ils se promenaient
ensemble tous les soirs sur la Treille; à dix ans ils
ne pouvaient plus se quitter.
fr21
La sympathie, l’accord des âmes, affermit en eux le
sentiment qu’avait produit l’habitude.
Tous deux, nés tendres et sensibles, n’attendaient
que le moment de trouver dans un autre la même
disposition, ou plutôt ce moment les attendait euxmêmes, et chacun d’eux jeta son coeur dans le premier qui s’ouvrit pour le recevoir.
Le sort, qui semblait contrarier leur passion, ne fit
que l’animer .
Le jeune amant ne pouvant obtenir sa maîtresse se
consumait de douleur: elle lui conseilla de voyager
pour l’oublier .
fr22
Il voyagea sans fruit, et revint plus amoureux que
jamais.
Il retrouva celle qu’il aimait tendre et fidèle.
fr26
fr20
fr23
fr24
fr25
fr27
Table 18 – The correct alignment of a passage of Jean-Jacques Rousseau’s Les Confessions, corresponding
to the alignment prediction matrix in Figure 6.
© Copyright 2026 Paperzz