Document and Sentence Alignment in Comparable Corpora Using

6'th International Symposium on Telecommunications (IST'2012)
Document and Sentence Alignment in Comparable
Corpora Using Bipartite Graph Matching
Zeinab Rahimi, Kaveh Taghipour, Shahram Khadivi, Nasim Afhami
Department of Computer Engineering
Amirkabir University of Technology (Tehran Polytechnic)
Tehran, Iran
{z.rahimi, k.taghipour, khadivi, nasim_afh}@aut.ac.ir
2001] [Kuper et al. 2003] [Petinot et al. 2004], documents
classification field [Kloptchenko et al. 2003] [Jianying et al.
1999] [Le and Thoma, 2000], [Mihalcea and Hassan, 2005],
and citations field [Zhang et al. 2004].
Abstract— Parallel corpora are considered as an inevitable
resource of statistical machine translation systems, and can be
obtained from parallel, comparable or non-parallel documents.
Parallel documents are more suitable resources but due to their
shortage, comparable and nonparallel documents are also used.
In this paper, we address both document alignment and sentence
alignment in comparable documents as an assignment problem of
bipartite graph matching and intend to find the sub graphs
having the maximum weight. One of the best methods to solve
this problem is Hungarian algorithm which is a combinatorial
optimization problem with known mathematical solutions. The
advantages of proposed method are language independency and
time complexity of O(n3) for Hungarian algorithm. We have
applied this method to bilingual Farsi-English corpus, and
obtained high precision and recall for this method.
But in recent years, extraction of parallel sentences and
sub-sentences has been studied in various researches. For
example, [Resnik & Smith, 2003] have used structural
information to extract parallel documents from comparable
documents. [Zhao & Vogel, 2002] have developed sentence
alignment algorithms. Furthermore, Extraction of parallel
sentence pairs with Cross-lingual IR methods, on comparable
corpus, has been carried out in [Munteanu and Marcu, 2005].
In addition, [S. Abdu, I-Rauf & H. Schwenk, 2009] extracted
parallel sentences with information retrivel methods.
Recently, an EM unsupervised model, has been used to
extract bilingual information [L. Lee & A. Aw & M. Zhang &
H. Li, 2010]. Also [Munteanu & Marcu 2006] has extracted
parallel sub-sentential structures from non-parallel corpora.
Keywords: machine translation; comparable corpora;
document alignment; sentence alignment; hungarian algorithm
I.
INTRODUCTION
Parallel corpus, which has several NLP applications, plays
an essential role in statistical machine translation system. Since
the amount of Parallel sentences used for system training
strongly affects system performance, there is a considerable
interest to create parallel sentences automatically. Parallel
resources are available for some language pairs such as
English-Arabic, but for most of language pairs, including
English-Farsi, such resources are not available, so we use
comparable corpora to extract them. In this process we have
two major steps: Document alignment and Sentence alignment.
II.
In this article, we first use IBM Model-1 to compute word
similarity between the two inputs and then we use length
feature. In the next step we use Hungarian matching algorithms
in bipartite graphs for document alignment and sentence
alignment.
Alignment is a process that takes two data resources as
input, and establishes relationships between them, at many
levels. Even though it has been given various names, the
alignment has been the subject in many research works.
In document alignment domain there are few independent
researches to refer. Most of the works are included in other
alignment level works and some of papers have manual
document alignment. Ghorbel et al. [Ghorbel et al. 2002] have
studied the alignment of medieval manuscripts. The approach
consists of enhancing classic methods used in multilingual
alignment with linguistic properties such as similarity at the
lexical, syntactic, semantic and morphological levels, as well as
structural properties of texts, such as rhetorical structure.
Document matching has been also the subject of many works
in information retrieval field, including cross documents
coreference field [Bagga and Biermann, 2000] [Chang and Lui
978-1-4673-2073-3/12/$31.00 ©2012 IEEE
PROPOSED METHOD
In document aligner we also used date prefix in the process
due this fact that documents which their publishing date are
close, are connected with higher probability.
In aligning sentences in two comparable documents each
sentence in the source document has to be assigned to at least
one target sentence or alternatively to an empty sentence and
each target sentence in the target document has to be assigned
to at last one source sentence or alternatively to an empty
sentence. But in document alignment there is no obligation that
each source document should assign to one target.
Source document sentences and target document sentences
are considered as two parts of a weighted complete bipartite
graph (same for documents), and the parallel sentence pairs can
be obtained from the maximum match of this graph. A
matching in a graph G(V, E), (in which, each node (V)
represents a sentence or document) is a subset M of the edges E
such that no two edges in M share a common vertex. [9] The
817
proposed method, based on bipartite graph matching, is
illustrated in Figure 1.
we need to choose a weighted matching among u1, u2, …, un
and v1, …, vn, so that we have ui+vj≥wij, and then find a
subgraph with (u, v) weight coverage, whose edges are xiyj , so
that we have ui+vj=wij.
Therefore, in this algorithm, we deal with a bipartite
matrix, wij, consisting of edge weights. Alternatively, we set a
(u,v) coverage, until a good match is obtained for the subgraph
Gu, y equation.
Here, this algorithm is illustrated with an example in
Figure 3. This maximum weighted bipartite matching problem
can be solved in polynomial time O(n3) (where n refers to the
number of nodes or vertices in the graph).
One of the features used is word similarity with IBM
Model-1. Conditional model of sentence Length is another
feature, which is obtained from normal distribution. They will
be described in section (IV).
Figure 1.Example of maximum weight bipartite matching.
(thick edges denote the maximum weight matching)
We seek for the maximum weight perfect matching
(assignment problem). In other words, we intend to find a
perfect matching M with minimum cost, for a given cnm (for all
(n, m) E), the element cnm of the matrix is the cost of aligning
the source sentence Sm to target sentence tn and it’s cost is
. The cost of an arbitrary alignment A
c(M)=∑ ,
between source document S and target document T is the sum
of total alignment points.
Hungarian algorithms, and features used in this article, have
been described in sections (ІII), and (ІV), respectively. Results
have been represented in section (V).
III.
Iteration.
Form Gu;v and find a maximum matching M in it.
IF M is a perfect matching, THEN stop and report M as
a maximum weight matching and (u;v) as a minimum
cost cover.
ELSE
let Q be a vertex cover of size |M| in Gu;v.
R := X Q
T := Y Q
:= min {ui +vj - wi;j : xi X \ R; yj Y \ T}
HUNGARIAN ALGORITHM
Update u and v:
ui := ui - if xi X \ R
vj := vj + if yj T
The Hungarian algorithm is based on three observations
[Knuth, 1994]:
• Adding a constant to any row of matrix does not
change the solution.
• Adding a constant to any column of matrix does not
change the solution.
• If cnm 0 for all m and n, and A is an alignment with
the property that ∑ ,
cnm =0 then A solves the
alignment (assignment) problem.
Hungarian algorithm, also known as Murker`s
Assignment Algorithm [Kuhn, 1995; Munkers, 1975], is
illustrated in Figure 2. If X and Y are two sets of nodes from
two parts of a graph, the problem here is to find an assignment
of set X to set Y, with a minimum cost.
Minimum cost is an indication of the preferred method.
Each edge has a non-negative weight. The weight matrix
consists of the weights of aligning source sentences (in rows)
and target sentences (in columns).
One may use the Hungarian algorithm to solve the problem
of finding maximum matching weights in a bipartite graph.
We intend to find the best match M, which can maxim the
.
total weight, w(M)=∑
We defined u and v as the number of elements in X and Y
elements, respectively.
In the first step, considering ui as the maximum weight of
edges, which are connected to node i, and initialize vj to zero,
Iterate
Figure 2.Hungarian algorithm pseudo code [D. B. West 2001]
IV.
FEATURES
We use three features, word similarity with IBM Model-1,
Length Model and Date feature for document alignment.
IBM Model 1
IBM Model-1 is a probabilistic model, in which two
sentences S and T are assumed. S and T refer to the source
sentence and sentence of the target language respectively.
In this model, the probability of translating S to T is calculated
according to the following process:
- The Source sentence S, with the length L is
translated to the target sentence T, with the length m.
- si is a word in S, and S includes a null word s0.
- ti is a target word in position j and L(tj|si) is the
translation probability of the generated target
language word from word si. [Brown et al., 1991].
818
=1
DONE
=1
Figure 3. Hungarian algorithm performance computation [D. B. West 2001].
For translating probability of S to T with IBM Model-1 we
have:
|
Є
∏
∑
α is the normalization factor.
Also for document alignment, for each document, cases
with particular length will appear in candidate list according to
a length ratio.
(1)
| )
Date
As we mentioned earlier, in document alignment, we used
date prefix in the name of documents based on this fact that
connectivity of documents whose dates are close is more
probable. With this constraint the candidate list of each
document become smaller and algorithm can perform more
precise and faster.
Є is a uniform probability for the length of target sentence:
(2)
P(m|T) = Є
Length Model
For this feature, we calculate the probability of source
sentence length from the length of target sentence. Sentence
length, can be obtained from a conditional model of source
sentences. This method, is based on the fact that each
character in L1 (source) language, is translated into random
number of characters in L2 (target) language. These are
independent random variables, therefore we can consider a
normal distribution for them, in which there are mean and
variance parameters too. First, we calculate the length ratio
(μ). Here, mean is the number of characters in target
language per each character in source language. The variance,
σ2, is calculated from dividing the length of source sentences
to length of target sentences (contribution of lengths).
We obtain the normal distribution with mean and variance
according to source sentence length to calculate the
probability of target sentence length. [Brown 1991]:
|
α.
|
µ
V.
RESULTS
We used comparable texts - from Tehran Avenue agency
and khameneii.ir – for our experiments. Results of the method,
are shown below, where the alignment represents the
maximum weight matching, out of all possible alignments in
documents. To evaluate the algorithm, we have used precision,
recall and f-measure, which are defined as:
f
Precision=
Recall=
*100
f
f
*100
(3)
2σ
F-measure=
819
P
P
R
R
*100
A. Document alignment
For evaluating document aligner tool, we had 672 English
documents and 5500 Farsi documents from khameneii.ir
comparable set and we aligned the documents based on length
model, translation model, both features and all features using
date prefix. TABLE1 shows the number of output aligned
document pairs among input documents in each experiment.
Also we filter sentences having high costs for alignment
using a threshold.
VI.
In this paper, we have proposed a new algorithm for
aligning both documents and sentences in comparable corpora.
Our method is based on bipartite graph matching algorithms.
We have implemented our sentence alignment framework in
C++ using LEDA1, and document alignment tool in java.
Features used in the proposed methods are IBM Model-1,
length and date (just for document alignment section). In the
next step, we apply Hungarian algorithms and find the
maximum weight with minimum cost. As the results showed,
our method can obtain precise result in both algorithms of
document and sentence alignment.
TABLE1. Number of output aligned documents
Features used
# of aligned documents
Length
672
Translation
612
Length + translation
582
Length+ translation +date
476
As the results indicate, using only one of the length model
or translation model is not sufficient to align documents.
Using just length model assign the first proper length
document according to length ratio and using just translation
model may assign documents with large length difference to
each other just because of having few common phrases. So the
number of aligned documents in these two cases are high but
their precision is not. Using both of them leads to reasonable
result but the time of procedure is extremely long. Because the
candidate set is very big and computing score for all pairs
needs lots of time and computational costs. But using date
feature, prune the candidate set fairly and consequently in a
short time leads to precise results as we can see in TABLE2. It is
obvious that date feature can be used as an additional option
and using just this feature without translation model would not
lead into proper results.
For computing Precision, recall and F-measure we selected
100 output document pairs randomly and checked them to see
if they are correctly aligned or not. In recall case, it turned out
that 91 document pairs out of 100 document pairs were truly
parallel.
TABLE2. Document aligner evaluation
Precision
Recall
Length
42
46
Translation
64
70
Length + translation
76
83
Length+ translation +date
84
93
ACKNOWLEDGMENTS
This work has been partially supported by Iranian Research
Institute for ICT (Ex. ITRC) under grant 500/1141.
REFERENCES
[1] P. Resnik, N. A. Smith, “The web as a parallel corpus”, Computational
Linguistics, 29(3): 349-380, 2003.
[2] B. Zhao, S. Vogel, “Adaptive parallel sentences mining from web
bilingual news collection,” Proceedings of the IEEE International Conference
on Data Mining, pp. 745, 2002.
[3] D. Munteanu , D. Marcu , “Improving Machine Translation Performance
by Exploiting Comparable Corpora”. Computational Linguistics, 31(4), pp.
477-504, 2005.
[4] D. Munteanu, D. Marcu, “Extracting parallel sub-sentential fragments
from non-parallel corpora”, Association for Computational Linguistics, pp.
81-88, 2006.
[5] S. AbduI-Raufs, H. Schwenk, “one the use of comparable corpora to
improve SMT performance”, Proceeding of the 12th conference of the
European chapter of the Association for Computational Linguistics, pp. 16-23,
2009.
[6] L. Lee, A. Aw, M. Zhang, H. Li , “EM-based Hybrid Model for Bilingual
Terminology Extraction from Comparable Corpora”, Coling 2010 organizing
committee, pp. 639-646, 2010.
[7] K. Mehlhom and S. Näher, LEDA: a platform for combinatorial and
geometric computing. Cambridge, U.K.; New York; Cambridge university
Press, 1999.
[8] P. Brown, V. Della Pietra, S. Della Pietra, R. Mercer, “The Mathematics
of Statistical Machine Translation: Parameter Estimation”. Computational
Linguistics , pp. 263-311, 1993.
[9] Y. Wang , F. Makedon , J. Ford , “A Bipartite graph matching for finding
correspondences between structural elements in two proteins”, the 26th
Annual International conference IEE engineering in medicine and Biology
society, 2004.
[10] S. Khadivi, “Statistical Computer-Assisted Translation”, PhD. thesis, pp.
37-48, 2008.
[11] D. B. West, Graph Theory-second edition, ISBN 0-13-014400-2,
Published by Prentice Hall, 2001.
F-measure
44
66
79
88
B. Sentence alignment
To evaluate sentence alignment tool, we have selected 220
sentences randomly from corpus and compute precision, recall
and f-measure for them. We compare the minimum Hungarian
algorithm with Edge-cover algorithm and toolkit of Microsoft
Corp named bilingual sentence aligner. The results are shown
in TABLE3. Hungarian algorithm has better results than
minimum Edge-cover algorithm and bilingual sentence
aligner.
TABLE3. Hungarian algorithm evaluation
Algorithm
Precision
Recall
Hungarian algorithm
72
58
Minimum Edge-cover algorithm
67
54
bilingual sentence aligner
72
38
CONCLUSION
F-measure
62
58
48
1
820
http://www.leda-tutorial.org/
[12]W. A. Gale, K. w. church, “A PROGRAM FOR ALIGNING
SENTENCES IN BILINGUAL CORPORA”, Proceedings of the 29th annual
meeting on Association for Computational Linguistics, pp 177 - 184, 1991.
[13] D. Munteanu, D. Marcu, “Improving Machine Translation Performance
by Exploiting Non-Parallel Corpora”, Published by MIT Press Cambridge,
MA, USA, 2005.
[14] D. E. Knuth, “ The Stanford Graph Base: A platform for combinatorial
computing”, Addison-Wesley, New. York , 1994.
[15] Jorg Tiedemann, Graeme Hirst, Bitext Alignment, Morgan & Claypool
Publishers, 2011.
821