Comparing Optimization Criteria and Methods
for Word Alignment
Tagyoung Chung
Shaojun Zhao
Daniel Gildea
The University of Rochester
Computer Science Department
Rochester, NY 14627
Technical Report 958
July 2010
Abstract
We consider word alignment within the “bag-of-words” framework of IBM Model 1, and explore
alternative optimization criteria and solutions and show that neither the EM nor the probabilistic
constraint is necessary for learning good parameters.
The University of Rochester Computer Science Department supported this work.
1
Introduction
IBM word alignment models (Brown et al., 1993; Och and Ney, 2003) are the most commonly used
for solving word alignment problems, and the Model 1 is the simplest of the IBM models. It treats
sentences as “bags of words” and uses the EM algorithm (Dempster et al., 1977) to learn word
translation probabilities. In addition to machine translation, Model 1 is widely used for tasks such
as sentence alignment of bilingual corpora (Moore, 2002), aligning syntactic tree fragments (Ding
et al., 2003), and matching words to pictures (Berg et al., 2004).
In this paper, we explore various optimization criteria and optimization algorithms for word
alignment within the bag-of-words framework of IBM Model 1. We find that with some modification, we can solve a problem equivalent to IBM model 1 with a quasi-Newton method such as
L-BFGS (Nocedal, 1980) instead of using the EM. We show that the modification does not lead to
degrade in performance nor result in added computational complexity.
We also find that a squared-error formulation of the objective function obtains comparable performance to the maximum log-likelihood function used IBM Model 1 in predicting hand-annotated
alignments, and has the benefit of learning word fertilities at no additional cost to computational
complexity. In solving the least squared-error problem, we show that a multiplicative update method
for solving non-negative matrix factorization problem is a fast and efficient solution compared to
other conventional approaches such as Newton’s methods.
The main goal of this paper is to show various formulations of the problem and different approaches to the solution is possible. Using more general objective function and optimization algorithm presented in this paper, one could facilitate easier inclusion of more comprehensive features
such as phonological or morphological similarities in languages that have high lexical similarity or
other arbitrary features.
2
Measuring alignment quality
Throughout this paper, we use metrics for word alignment quality put forth by Och and Ney (2003),
which are widely used. The alignment error rate (AER) metric requires human-aligned test corpus
that has set of “sure” (S) and “possible” (P ) alignments, where S ⊂ P .
It is well known that AER does not accurately predict performance of final machine translation system (Fraser and Marcu, 2007). However, we felt that it is still a good enough metric for
comparing how different methods perform in predicting hand-annotated alignment.
Given set of alignments A, Och and Ney (2003) define precision, recall and alignment error rate
(AER) as follows:
|P ∩ A|
|A|
|S ∩ A|
recall(A, S) =
|S|
|P ∩ A| + |S ∩ A|
AER(A, P, S) = 1 −
|S| + |A|
precision(A, P ) =
1
3
IBM model 1
IBM model 1 is a strictly concave optimization problem that has the following objective function
and constraints:
max
S
X
log
s=1
s.t.
X
X
P (aJ1 , f1J |eI0 )
aJ
1
p(f |e) = 1 (for ∀ e)
(1)
f
p(f |e) ≥ 0
(for ∀ e and f )
where f1J is a source sentence of length J, eI0 is a target sentence of length I with an extra null word
e0 , aJ1 is the alignments between f1J and eI0 , and S is the number of parallel sentences.1 Alignments
a : j → i are hidden and each word fj in a source sentence is alinged to excatly one word ei in
the target sentence. The objective function is the maximum log-likelihood of parallel data. The EM
algorithm is conventionally used for solving the optimization problem.
4
Using L-BFGS
Compared to the EM, quasi-Newton methods such as L-BFGS are more easily generalizable to
more complex, non-generative models. However, they are designed for solving unconstrained optimization problems, and we have found the normalization constraints of Model 1 (Equation (1))
to be helpful. To solve the problem in Section 3 using L-BFGS, we need to convert the original
optimization problem with the affine constraints to an equivalent optimization problem with only
bound constraints.
Consider a general convex optimization problem with affine constraints:
min
s.t.
f (x)
Dx = b
x≥0
where f (x) is a convex function. Friedlander et al. (1994) showed that it is equivalent to the following optimization problem:
min
s.t.
F (x, y, z) ≡
1
k∇f (x) + DT y − zk2 + kDx − bk2 + (xT z)2
2
x≥0
z≥0
The newly formulated optimization problem can be solved using L-BFGS-B, a variant of LBFGS that can handle bound constraints on variables (Zhu et al., 1994; Benson et al., 2007). In
1
For simplicity, we drop subscript s for a, f and e.
2
order to apply L-BFGB, we need to provide the values for the objective function and its derivatives.
For notational convenience, we define u, v, and w to be:
u ≡ ∇f (x) + DT y − z
v ≡ Dx − b
w ≡ xT z
Now, the new objective function and its derivatives can be computed as follows:
∇x F
1
kuk2 + kvk2 + w2
2
= ∇2 f (x)u + DT v + wz
∇y F
= Du
∇z F
= −u + wx
F
=
(2)
Notice that we need to compute the Hessian of the original convex function in Equation (2).
Fortunately, the matrix is sparse, and we do not need to explicitly store it since all we need is a
vector ∇2 f (x)u, which can be exactly computed. To further speed up the computation, we use the
following approximation:
∇2 f (x)u =
∇f (x + ηu) − ∇f (x)
η
for some small η.
5
Least-norm problem
Staying within the framework of IBM model 1 (bag-of-words and uniform distortion), we can have
an alternative view of the objective function that abandons probabilistic constraints thus making
even bound constraints unnecessary.
A sentence is a “bag of words” and represented as a column vector where each row is associated with a word in the vocabulary. The number in each row is an integer value representing how
many times the word has occurred in the sentence. Assume a French-English sentence pair where
the English sentence is e, the French sentence is f , English vocabulary size is m, and the French
vocabulary size n. Then, the sentence pair can be represented as:
T
e = e1 e2 . . . em
T
f = f1 f2 . . . fn
We can imagine a lexical translation matrix A, which satisfies
Ae = f
3
Given a parallel bilingual corpus, which consists of S sentence pairs, we can formulate the parameter learning problem as minimizing the error between the French sentences and the predicted
translations of the corresponding English sentences:
min
A
S
X
kAei − fi kpp
i=1
where various vector norms can be used as k · kp .
Regardless of the norm we choose, this is a convex
qP optimization problem and has a unique
2
2
minimum. If we choose the `2 norm, kxk2 =
i xi , the problem is a least squared error
problem, and we are trying to find an optimal matrix
P A which minimizes the sum of squared errors.
Alternatively, we can use the `1 norm, kxk1 = i |xi |, and try to minimize the sum of absolute
errors. In certain problems, `1 has been found to give better results as it is more robust to outliers;
`2 has the advantage that the minimization problem has a closed form solution.
As a notational convenience, the problem with `2 norm can be rewritten in matrix notation as
minimizing the Frobenius norm of the error matrix:
min
A
kAE − FkF
where E and
F are matrices in which column i corresponds to the ith sentence pair ei and fi . Since
qP
m Pn
2
kAkF =
i=1
j=1 |aij | , solving the problem using Frobenius norm is equivalent to solving
the problem with `2 norm.
To make problems comparable to the Model 1, we can add an additional row to each sentence
vector and fix it to 1, providing the model with a bias term. We included the bias term in all our
experiments.
6
Solving the least-norm problem
We devised the formulation of the optimization problem and an algorithm to solve it in three steps.
We first chose whether to use `1 or `2 norm. Second, we chose an algorithm to solve the optimization
problem. Third, we chose detailed specification to the algorithm such as initializing the translation
matrix A and a stopping criterion.
6.1
Choice of norm
Since `1 norm and `2 norm are commonly used for optimization problems, we ran an experiment to
see which one is better. We created a very small training set, which derives from the Hansard corpus.
Our test set consisted of 447 sentence pairs with gold standard word alignments between sentence
pairs. After the training, given a sentence pair in the test set, we created Viterbi alignments3 for
2
While Model 1 provides a similar convex problem, its maximum likelihood objective function does not meet the
criteria of a vector norm.
3
For each word in one side, we have selected a word with the best score according to learned translation matrix A in
the other side.
4
each direction (from French to English and from English to French). Then, two sets of alignments
were merged by taking the intersection.
Using the matrix A that resulted from using the `1 norm yielded precision of 0.67, recall of
0.19, and AER of 0.68, and the matrix A that resulted from using the `2 norm yielded precision of
0.69, recall of 0.34, and AER of 0.53. Since optimizing for the `2 norm yielded much better result,
we chose `2 norm for the rest of the experiments.
6.2
Choosing algorithms
We used L-BFGS algorithm to solve the problems in Section 6.1 but we can now use other alternatives since the objective function does not have any constraints. There are two alternatives we tried
that converged faster than L-BFGS. The first uses the matrix pseudoinverse (Moore, 1920; Penrose,
1955), and the second uses non-negative matrix factorization (NMF) (Lee and Seung, 2001).
The pseudoinverse provides a closed form solution to the minimization problem:
A = FE+
where E+ is the pseudoinverse of E, and can be calculated with a singular value decomposition
(Golub and Van Loan, 1996).
The NMF problem formulation is, given a non-negative matrix V, finding non-negative matrices
W and H such that:
V ≈ WH
In context of word translation, we can view F ≈ AE. Lee and Seung (2001) present multiplicative update rules for W and H given non-negative matrix V, which is guaranteed to not increase
kV − WHkF . We can fix V to be F and H to be E, and use the following update rule to find A.
Ãij = Aij
(FET )ij
(AEET )ij
(3)
We applied the two algorithms to a dataset with 133K sentence pairs. The French vocabulary
size was limited to 4.08K words, and the English vocabulary size was limited to 3.71K words. For
fast experiments, the vocabularies included only most frequent ones from the Hansard corpus. After
running both algorithms, we created Viterbi alignments for both directions and merged using the intersection, union, and “refined” methods (Och and Ney, 2003). The intersection method yielded the
best AER of the three. Hence, it was used as a basis for comparisons in the rest of the experiments.
The two algorithms produced alignments that had the almost same AER. Alignments resulting
from solving the pseudoinverse of E had precision of 0.84, recall of 0.58, and AER of 0.30. The
alignments from the NMF algorithm had precision of 0.84, recall of 0.59, and AER of 0.30.
Although the performance of the two algorithms was comparable, the NMF had two advantages
compared to using the pseudoinverse. First, it was faster. Second, and more importantly, it produced
sparser solution, which is desirable since we want a word in one language to be aligned to only few
words in the other language. The algorithm using the pseudoinverse of E generates lots of elements
that are negative, which is not useful for generating word alignments. However, negative elements in
the solution, by definition, are not allowed in NMF, and this has the effect of forcing some elements
to zero. Furthermore, as we discuss in the next section, NMF allows to us to fix some elements to
zero during initialization, which further speeds up computation.
5
6.3
Initialization, pruning and stopping
The NMF algorithm needs an initial matrix A0 as an input. Since the algorithm has a multiplicative
update rule, if an element in A0ij is intially zero, it will stay so in the successive updates. We tried
different initializations for A0 . We have tried random and uniform initializations. We eventually
chose A0 to be a word co-occurrence matrix, that is, if fi and ej occurs together in a sentence
pair in a corpus, A0ij can be set to number of times if fi and ej has co-occurred in the entire
corpus, otherwise, it was set to be zero. We found that this initialization method made the algorithm
converge faster than a random or a uniform A0 . However, it made almost no difference in terms of
final alignments.
Another addition we made to the algorithm is pruning. After each iteration, we prune elements
in A falling below certain threshold by setting them to 0. This made the algorithm run faster and use
less memory, since there are fewer elements to update. Same kind of pruning is present in GIZA++.
We chose the floor value to be 10−9 . As long as the floor value was set to a sufficiently low value,
it made no difference in resulting alignments.
Finally, we needed a stopping criterion for the algorithm. We first tried calculating kF − AEkF
with a held-out data and stopping when it stopped decreasing. However, much like the EM, we
found that the residual kept decreasing for a while after the precision, recall, and AER peaked.4
Thus, we split the hand-aligned test data into two and used a half to calculate AER (development
set). The algorithm was stopped when AER of the development set stopped decreasing.
7
Complexity analysis
The L-BFGS and EM have the same per iteration computational complexity. Computing ∇f (x)
and ∇2 f (x)u has the same complexity as the E-step and the computation involving D or w has the
same complexity as the M-step.5 However, because the quadratic approximation of the objective
function in L-BFGS is not as tight as EM’s auxiliary function, the number of iterations needed for
convergence for L-BFGS tends to be larger than that of EM.
From the update rule (3), it is obvious that the numerator FET and a part of the denominator
EET can be computed before starting iterative updates. Hence, only one matrix multiplication needs
to be computed for every iteration. With the naı̈ve matrix multiplication algorithm, per iteration
complexity is O(nm2 ), where n is the size of French vocabulary and m is the size of English
vocabulary.
However, since the matrices involved in the computation are very sparse, using fast sparse matrix
multiplication (Yuster and Zwick, 2005), the complexity is O(S 0.7 I 1.4 n1.2 + n2+o(1) ) given the
upper bound O(SI 2 ) for non-zero elements in EET , and assuming max{m, n} = n. Note that this
is sublinear in the corpus size; in our experiments, the time need for one iteration of NMF algorithm
was comparable to that of the EM.
4
Precision, recall, and AER all peaked at different times.
E step has the complexity of O(SIJ) where S is the size of the corpus and I and J are sizes of the sentences. M
step has the complexity of O(mn) where m and n are the sizes of vocabulary of each languages. Computing u has the
complexity of O(SIJ + mn).
5
6
8
Experiments
We have experimented with the three different objective functions and optimization methods on
three language pairs. Due to time constraints, we had two sets of experiment for L-BFGS and NMF
where smaller sets of data was used for L-BFGS.
For NMF, our French-English data is a subset of Hansard corpus, where the sizes of vocabularies are 24.2K and 18.7K respectively. It has 825K sentence pairs, which is 13.4M words on English
side. German-English data is a subset of Europarl corpus (Koehn, 2005), where the sizes of vocabularies are 28.3K and 17.0K respectively. It has 998K sentences, which is 21.4M words on English
side. The Chinese-English data is a subset of FBIS corpus, where the sizes vocabularies are 25.4K
and 19.7K respectively. It has 833K sentences, which is 27.3M words on English side.
We used hand-aligned data for each language pair as reference. The French-English data has 447
sentence pairs; the German-English data has 220 sentence pairs; and the Chinese-English data has
491 sentence pairs. All data was split into two. Odd numbered sentences were used as development
set and the rest were used for test set.
For L-BFGS, smaller subsets of data were used for experiments. Each data set consisted of
about 10K sentence pairs and the same test sets were used.
For both sets of experiments, we ran Model 1 of GIZA++ (Och and Ney, 2003) in both directions
(source to target and target to source) as baseline and used resulting translation probability as basis
for two sets of Viterbi alignments. We used the same stopping criterion for the baseline and NMF.
We ran GIZA++ until AER on the development set no longer improved on merged alignments from
two directions.
We repeated the same experiments with NMF algorithm. As with Model 1, we ran the NMF
algorithm twice: in one instance the algorithm minimized kF − Aef EkF and in the other instance
the algorithm minimized kE − Af e FkF . The matrix Aef can be interpreted as a matrix translating
target (E) to source (F), and the matrix Af e can be interpreted as a matrix translating source to
target. Again, we ran the algorithm until AER on the development set no longer decreased. We used
the corresponding matrices to create two sets of Viterbi alignments and merged the alignments.
For L-BFGS experiments, everything was the same except that we ran L-BFGS until convergence was achieved. Tuning the number of iterations with development sets was unnecessary because L-BFGS did not overfit as quickly as the other algorithms.
9
Result and analysis
The result of the experiments using L-BFGS is summarized in Table 1 and the result of the experiments using NMF algorithm is summarized in Table 2. The results reported in the tables used the
intersection method to merge two Viterbi alignments. The other methods of merging the alignments
also gave similar results for Model 1 in comparison to NMF or L-BFGS, but had poorer results overall in terms of AER regardless of optimization methods. For all three language pairs, alignments
produced by all three methods — GIZA++, L-BFGS, and the NMF algorithm — have almost the
same AER. Precision tends to be slightly better with the NMF algorithm but GIZA++ had slightly
better recall. L-BFGS had slightly better precision and recall than GIZA++. Table 3 illustrates our
finding that GIZA++ and our algorithms produce largely the same lexical translation table.
7
Fra-Eng
Deu-Eng
Chi-Eng
GIZA++(3)
L-BFGS
GIZA++(6)
L-BFGS
GIZA++(6)
L-BFGS
Prec.
0.65
0.65
0.55
0.56
0.64
0.65
Rec.
0.71
0.72
0.34
0.34
0.40
0.41
AER
0.33
0.33
0.58
0.58
0.51
0.49
Table 1: Comparison table for L-BFGS and GIZA++. Prec. is precision, Rec. is recall, and AER is
alignment error rate. The numbers in parenthesis indicate number of iterations.
Fra-Eng
Deu-Eng
Chi-Eng
GIZA++(3)
NMF(3)
GIZA++(7)
NMF(4)
GIZA++(7)
NMF(6)
Prec.
0.88
0.89
0.83
0.83
0.86
0.88
Rec.
0.75
0.72
0.37
0.36
0.46
0.44
AER
0.19
0.19
0.48
0.50
0.40
0.41
Table 2: Comparison table for NMF and GIZA++. Prec. is precision, Rec. is recall, and AER is
alignment error rate. The numbers in parenthesis indicate number of iterations.
For L-BFGS, ideally, the objective function F (x, y, z) would be 0 when it converges. We
noticed that this value can be a large positive number. Nevertheless, the translation probabilities we
learned from this objective function is very similar to or better than that of the EM algorithm.
The translation matrix we learn from NMF is not composed of probabilities but rather “scores”
assigned to pairs of words. We can view a sum of a row of matrix Aef as the fertility of the word
associated with that row. Here is an example: English word government translates into French gouvernement but sometimes to le gouvernement, because use of articles is more prevalent in French.
The English word not translates to French ne pas or n’ pas. The scores we learned reflect this fact.
In Aef , the row representing government sums up to 1.35, while the row representing not sums up
to 2.18. Table 4 illustrates some of examples of German compounds being translated into English
words and its fertility found by our algorithm.
10
Conclusion
Our experiments indicate that neither the EM nor the generative probabilistic model behind IBM
Model 1 is essential in finding good word alignments from a bag-of-words representation. The EM
algorithm can be replaced by L-BFGS. Other formulations of the error criterion, in particular the `2
norm, give similar alignment error rates and are better able to model word fertility. NMF algorithm
provides an optimization algorithm with similar complexity to the Model 1 and it is faster in practice
than solving an unconstrained least squares problem.
8
GIZA++
0.38 talk
0.15 speak
0.07 discuss
0.07 speaking
0.06 talking
L-BFGS
0.26 talk
0.25 speak
0.21 remiss
0.17 talking
0.09 speaking
NMF
0.15 talk
0.05 speak
0.03 about
0.03 talking
0.03 address
Table 3: The top five translations of French word parler (to speak) given by GIZA++, L-BFGS, and
NMF. For GIZA++ and L-BFGS, numbers represent conditional probability and for NMF, numbers
represent corresponding entry in matrix A. Note that L-BFGS result is from a different, smaller
data set.
German
Kernkraftwerke
Umweltverträglichkeitsprüfung
Interessengruppen
herausheben
Mitgliedsländer
Wettbewerbsfähigkeit
Arbeitskräfte
Buchpreisbindung
Entschließungsantrag
Beschäftigungsmöglichkeiten
Menschenrechtsverletzungen
Welthandelsorganisation
English Translation
nuclear power plants
environmental impact assessment
interest groups
emphasize/single out
member countries
competitiveness
workers/labor force
Fixed Book Price Agreement
draft resolution
employment opportunities
human rights violations
World Trade Organization
fertility
3.385
3.596
2.238
1.554
2.113
1.223
1.698
3.696
2.159
2.296
3.260
3.119
Table 4: Some German compounds, its typical English translations and fertility found by the our
algorithm.
Lexical translation factors learned with NMF could provide a useful additional feature for machine translation systems, in the same way that it is currently common to include lexical translation
probabilities trained with the IBM models in both directions. We also believe that using more general objective functions and solutions could make easier inclusion of more comprehensive list of
features possible.
References
Benson, Steve, Lois Curfman McInnes, Jorge Moré, Todd Munson, and Jason Sarich. 2007. TAO
user manual (revision 1.9). Technical Report ANL/MCS-TM-242, Mathematics and Computer
Science Division, Argonne National Laboratory. Http://www.mcs.anl.gov/tao.
Berg, Tamara L., Alexander C. Berg, Jaety Edwards, and David A. Forsyth. 2004. Who’s in the
picture? In Neural Information Processing Systems (NIPS).
9
Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The
mathematics of statistical machine translation: Parameter estimation. Computational Linguistics,
19(2):263–311.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1–21.
Ding, Yuan, Daniel Gildea, and Martha Palmer. 2003. An algorithm for word-level alignment of parallel dependency trees. In The 9th Machine Translation Summit of the International Association
for Machine Translation, pages 95–101. New Orleans.
Fraser, Alexander and Daniel Marcu. 2007. Measuring word alignment quality for statistical machine translation. Computational Linguistics, 33(3):293–303.
Friedlander, A., J.M. Martinez, and S.A. Santos. 1994. On the resolution of linearly constrained
convex minimization problems. SIAM Journal on Optimization, 4:331–339.
Golub, Gene H. and Charles F. Van Loan. 1996. Matrix Computations (3rd edition). Johns Hopkins
University Press, Baltimore, MD, USA.
Koehn, Philipp. 2005. Europarl: a parallel corpus for statistical machine translation. In MT summit
X, the tenth machine translation summit, pages 79–86.
Lee, Daniel D. and H. Sebastian Seung. 2001. Algorithms for non-negative matrix factorization.
In Advances in Neural Information Processing Systems (NIPS), volume 13, pages 556–562. MIT
Press.
Moore, Eliakim Hastings. 1920. On the reciprocal of the general algebraic matrix. In Bulletin of
the American Mathematical Society, volume 26, pages 394–395.
Moore, Robert C. 2002. Fast and accurate sentence alignment of bilingual corpora. In AMTA ’02:
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on
Machine Translation: From Research to Real Users, pages 135–144. Springer-Verlag, London,
UK.
Nocedal, Jorge. 1980. Updating quasi-newton matrices with limited storage. Mathematics of Computation, 35(151):773–782.
Och, Franz Josef and Hermann Ney. 2003. A systematic comparison of various statistical alignment
models. Computational Linguistics, 29(1):19–51.
Penrose, Roger. 1955. A generalized inverse for matrices. In Proceedings of the Cambridge Philosophical Society, volume 51, pages 406–413.
Yuster, Raphael and Uri Zwick. 2005. Fast sparse matrix multiplication. ACM Transactions on
Algorithms, 1(1):2–13.
Zhu, Ciyou, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. 1994. L-BFGS-B: Fortran subroutines for large-scale bound constrained optimization. Technical report, ACM Trans. Math.
Software.
10
© Copyright 2026 Paperzz