A Discriminative Approach to Filter out Noisy Sentence Pairs from

2010 5th International Symposium on Telecommunications (IST'2010)
A Discriminative Approach to Filter out Noisy
Sentence Pairs from Bilingual Corpora
Kaveh Taghipour, Nasim Afhami, Shahram Khadivi and Saeed Shiry
Department of Computer Engineering
Amirkabir University of Technology
Tehran, Iran
{k.taghipour, nasim_afh, khadivi, [email protected]}@aut.ac.ir
Several researchers have proposed methods for automatic
generation of parallel corpora. In [1] it has been suggested to
collect two sets of sentences in both source and target
languages and then find an alignment between them to form
a corpus. Another filtering technique based on simple
structural rules is suggested in [2]. Some researchers
consider the length of the sentences based on characters [3]
and words [4] as criteria for sentence alignment. An iterative
algorithm based on word alignments and similarity of lexical
information is introduced in [5]. A Word-based translation
model and lexical information are also useful for finding a
proper alignment [6-8]. In [9] a different approach based on
a geometric algorithm has been tested for sentence alignment
task. Some other methods have been used to extract subsentences instead of full translations [10].
Abstract— Parallel corpora are essential for training statistical
machine translation models. Since parallel sentence-aligned
corpora are usually noisy due to inexact automatic methods
when generated from parallel or comparable documents, we
need to clean parallel corpora. In this paper, new features are
introduced to assess the correctness of a sentence pair. Also,
the impact of new features in combination with state-of-the-art
features introduced in the literature is systematically
evaluated. Statistical methods have been used for feature
extraction and therefore this approach is independent to
language. In order to better understand the problem
characteristics, four supervised classification algorithms are
used to classify sentence pairs as noise or parallel. Evaluating
the models by taking accuracy and f-measure into account
shows that using the system for cleaning a noisy parallel FarsiEnglish corpus, the maximum entropy model performs better
than the main filtering techniques used in this paper and shows
a significant improvement over two other systems.
Filtering noisy bilingual corpora can be considered as a
way of extracting parallel sentence pairs. Several methods
have been proposed to filter parallel corpora. It has been
shown that training a translation model on a noisy corpus can
be used to refine the corpus [11],[12]. In [13] information
retrieval methods are used to find parallel pairs. Some other
techniques similar to the approach of this paper use extracted
features to train a maximum entropy model to identify the
noisy sentences [14]. Corpora having different levels of
noise have a relative effect on the translation. Some studies
have been done on the impact of noise level on the training
of translation models and quality [15]. It has been shown that
the higher levels of corpus noise degrade the performance of
automatic translation systems. Having a filtering system that
accurately classifies sentence pairs; it can be used to extract
parallel corpora from a very noisy collection. In [16]
candidate sentence pairs are classified by a maximum
entropy model and a beam-search algorithm is used to prune
the search space. In contrast to mentioned supervised
filtering techniques, some unsupervised methods are
proposed to extract bilingual information. An EM-based
unsupervised bilingual information extraction is proposed in
[17].
Keywords- Corpus Filtering; Maximum Entropy; Statistical
Machine Translation; Cross Language Information Retrieval
I.
INTRODUCTION
Training machine translation models needs large parallel
corpora for the parameters to be estimated. Quality of the
automatic translation extremely depends on the quality of
sentence pairs and the corpus condition. Automatic methods
for gathering sentence pairs are imprecise and using such a
corpus that has many non-parallel sentence pairs, if used for
training a translation model would cause translations of low
quality. On the other hand, manual sentence alignment is
very costly, time consuming and therefore not practical.
Different kinds of noise can be considered in textual
corpora. Noise in parallel corpora can be due to some
reasons such as inaccurate translations or an erroneous
sentence alignment process. Using a noisy parallel corpus to
train a translation model would result in a model with
miscalculated parameters and output sentences of low
quality. In this paper we take all of these noisy sentence pairs
into account and attempt to recognize them and then
reducing the noise of a corpus. In order to have a practical
classification system, evaluation is done on a parallel corpus
having a high level of noise. The details of the experiments
and data statistics are explained in section IV.
978-1-4244-8185-9/10/$26.00 ©2010 IEEE
In this paper, we propose a discriminative approach for
cleaning bilingual corpora. We purpose several new features
based on language models and word alignments to
discriminate parallel and noisy sentence pairs. Additionally,
537
we investigate the performance of the novel features in
combination with various important features introduced in
this literature. It has to be mentioned that among proposed
novel features for corpus cleaning, alignment entropy has a
considerable impact on the classification accuracy.
feature can be calculated by considering the ratio of the
words with null alignments in two sentences:
Features used for building the models are introduced in
section II. In section III maximum entropy classifier is
described and the results are provided in section IV.
We now introduce a novel feature that is slightly
different and we call it alignment entropy (AEnt). This
feature uses symmetric word alignment distribution to
discriminate noisy and parallel sentences. Experiments show
that the alignment entropy is one of the best features used in
this paper. To calculate this feature the following formula is
used:
II.
(6)
2
FEATURES
Many features have been tested and used to build the
models. Here, the features that are used in the best
performing model are explained and the results of the study
on the impact of these features are provided. Features can be
classified into six below groups.
(7)
,
In this formula the function
follow:
A. Translation Table Features
Translation Table is a matrix of probabilities between
words of two languages. This table can be obtained by
performing EM-based training algorithm of IBM word-based
translation models [18]. Two features of the model are based
on the translation table and the IBM model 1. Calculating the
features is based on the formulas (1) and (2):
(8)
The
is the normalized number of links:
∑
|
|
1
1
|
(1)
|
(2)
is calculated as
(9)
is a function that
is an arbitrary sentence and
outputs the number of links that end on the word . Finally,
the role of the
function is to normalize the
entropies and is calculated as follow:
(10)
| is
In these formulas, is the translation table and
given the foreign
the probability of the English word
word .
This feature estimates the entropy of the alignments.
Though, if the alignment distribution of both source and
target sentences are uniform, the feature value approaches
one. If both of the distributions are near uniform the
alignment entropy approaches one and the sentences are
probably parallel, otherwise entropy approaches zero and the
corresponding sentence pair is more probable a noise
alignment.
B. Alignment-based Features
Word alignments can be used to obtain useful features for
filtering noise of parallel corpora. These features are similar
to ones used by [14]. Word alignments can easily be
obtained by training IBM models on the noisy parallel
corpus. In this paper, we have used symmetric word
alignments by taking union and intersection of the word
alignments acquired in two translation direction [19]. The
three features that use this kind of alignment are explained
below.
C. Fertility-based Features
Fertility of a foreign word is the number of English
words that the foreign word translates into. In order to find
an approximation of the fertility model, one can use IBM
models 3, 4 or 5 to estimate fertilities of words. In this paper,
top three fertilities are considered as features of the classifier
similar to [14]. These features that are named , and
are defined as the first, second and the third maximum
fertilities.
D. Binary features
In addition to the features mentioned above, some binary
features are used to exploit special information of different
sentences such as numbers, names and email addresses. In
order to make use of these kinds of information, some binary
features are defined as below:
(3)
⁄
(4)
⁄
(5)
1
0
This feature considers the difference of the null
alignments between source and target sentences. A similar
538
iteratively estimates the parameters of the model. Although it
has been reported that there exists better algorithms for
training the model [22], we have used iterative scaling,
because the performance is quite well and high accuracy is
achieved. It has to be mentioned that the objective function
of the maximum entropy model is concave and training
algorithms iteratively approach to global optimum. After
training the model, (16) is used to classify instances:
E. Length-based Features
Length of the two sentences can be used to discriminate
noisy and parallel sentence pairs [4]. In this paper we have
used four features that are based on the length of the
sentences of a pair. Two of the features consider words of a
sentence as the units of length and the other two features take
characters into account. These features are calculated using
the following formulas:
| ,
The corpus filtering problem is a 2-class discrimination
problem. Hence, in order to have a more general decision
rule, (17) is used instead of (16). It is obvious that
considering the threshold equal to 0.5 makes (17) identical to
(16).
(11)
/
(12)
F. Features based on Language Model
A group of new features that use n-gram language
models to obtain a measure of discrimination are introduced
here. N-gram language models estimate the probability of a
sentence considering conditional word and history
probabilities:
,
|
…
,…,
|
,…,
| ,
(13)
IV.
MAXIMUM ENTROPY
TABLE I.
Number
Maximum entropy or log-linear modeling was first used
in natural language processing applications in [20] and [21]
and is widely used for classification [22]. The conditional
maximum entropy model assumes the distribution with the
following form:
.
exp
∑ exp
.
,
,
Percentage
.
exp
∑ exp
.
, ,
, ,
CORPUS STATISTICS
Parallel
Noise
Total
15169
33045
48214
31%
69%
100%
We have used GIZA++ [23] to compute translation table
probabilities, word alignments and fertilities. Also, SRILM
[24] is used to estimate language models and to calculate the
probabilities of each test sentence. In order to use algorithms
of MLP, DT and K-NN for classifying sentence pairs, we
have used WEKA [25]. Finally, OpenNLP implementation
of maximum entropy is utilized for performing the maximum
entropy part of the experiments. In order to have comparable
results against similar filtering systems, f-measure is
computed in addition to the classification accuracy.
Computing precision and recall is done based on the formula
below:
(14)
The
, is the feature vector considering the input
and class label and is the vector of model parameters
and can be seen as a weighting function. By rewriting the
equation (14), we can customize the formula (15) that has the
sentence pair , as its input:
| ,
RESULTS
In this section, results of applying mentioned learning
models based on explained features are provided. All
experiments are performed on a Farsi-English parallel corpus
with a noise level of about 70%. This dataset is built upon a
clean parallel corpus and the generation process is as
follows: First 70% of sentence pairs are selected randomly
and then the target sentence of the selected pair is replaced
with a randomly chosen non-parallel sentence. The corpus
statistics are provided in the table I:
Various models of learning algorithms have been used to
combine mentioned features. In this section, maximum
entropy model is briefly explained. In order to make full use
of this model, the decision rule is slightly modified. Three
other learning algorithms are decision tree, k-nearest
neighbor and multilayer perceptron.
|
(17)
An optimal value for threshold is found by considering
the classification accuracy on the development set and the
threshold that corresponds to the most accurate maximum
entropy classifier is selected as the optimum value.
The idea of using these models is that the source and
target sentences might have similar n-gram language model
probabilities. In order to use these models as features,
difference and ratio of the two probability values of the
source and target sentences are used and evaluated.
III.
(16)
(15)
100
Model parameters are used to control the effect of each
feature function. In order to estimate the parameters,
different training algorithms have been proposed. A widely
used algorithm is the Generalized Iterative Scaling (GIS) that
100
539
2
and f-measure achieved by modifying the threshold
parameter.
100
In order to compare different features, we have tested the
model excluding each feature. Obtained results are provided
in the Table IV. It has to be mentioned that each row of the
table represents a group of omitted features and the numbers
are the corresponding average accuracy and error over all
features of the group.
100
The entries of the table II are the results of a 10-fold
cross validation evaluation process due to the relatively small
number of observations. In other words, the training set is
divided into ten parts. Nine parts out of ten is used as
training data and the system is tested on the tenth part.
TABLE IV.
The table II shows the accuracy, precision, recall and fmeasure for the best learning models used in this paper. By
comparing the obtained results, it is possible to have an
insight of the feature space and data.
TABLE II.
FEATURE EVALUATION
Accuracy (Avg)
Error (Avg)
98.3
1.6
- Alignment Entropy
97.2
2.7
- T-table
97.3
2.6
All features
CROSS VALIDATION RESULTS OF APPLYING MODELS
Accuracy
Precision
Recall
F-measure
- Language Model
97.6
2.3
DT
96.5
91.6
97.8
94.6
- Word Alignment
97.6
2.3
K-NN
97.3
97.5
97.3
97.4
- Length
98.2
1.7
MLP
92.1
90.7
83.5
86.9
ME
98.3
98.9
96.9
97.9
The results show that the alignment entropy (AEnt) is the
best feature and without considering this feature, we lose
approximately one percent of classification accuracy.
In order to have an insight of the feature space, the results
of training the models with different parameters are
provided. Figure 1 shows that by increasing the value of
parameter K in K-NN the accuracy is monotonically
decreased.
V.
CONCLUSIONS
In this paper, we have examined various learning models
for classifying sentence pairs as noise or parallel. In order to
have a realistic dataset, a random noise level of 70% is added
to a clean parallel corpus. Considering the monotonically
decreasing performance of K-NN by increasing the value of
parameter K and lower accuracy of pruned decision tree with
comparison to the unpruned case, it is deducible that using
raw features without any kind of feature weighting is not
sufficient for fully discriminating the space and samples of
both classes have overlaps. This hypothesis is highlighted by
taking ME results into account. Features are weighted in ME
models and this causes the feature space more discriminated.
100
98
96
94
92
0
20
40
60
80
100
1
Figure 1.
Effect of parameter K on K-NN accuracy
0.95
The accuracies of two decision tree models are compared
in table III. The first row is a pruned tree and the second row
is not.
TABLE III.
0.9
0.85
0.8
DECISION TREE PERFORMANCE
Accuracy
F-measure
0.75
DT (pruned)
96.2
94.2
0.7
DT (complete)
96.5
94.6
precision
recall
f-measure
0
0.2
0.4
0.6
0.8
threshold
As mentioned, the threshold which maximum entropy
model is tested on, is learned by optimizing the parameter on
a development set. The figure 2 shows the precision, recall
Figure 2.
540
Learning threshold of maximum entropy model from
development data
1
Maximum entropy performs quiet well in discriminating
the corpus sentence pairs. ME models have fewer parameters
with comparison to complex models such as MLP and are
less sensitive to parameter tuning. These characteristics plus
significant accuracy, makes ME models perfect for filtering a
noisy corpus and discriminating the corresponding feature
space. Original decision rule for maximum entropy models
selects the most probable class. It has been shown that using
a threshold rather than 0.5 for 2-class classification problems
can improve the performance of a maximum entropy model.
The threshold can be easily estimated by optimizing the
model on a development set or considering cross validation
results.
[12]
[13]
[14]
[15]
[16]
Due to different data sets, it is not fair to compare
obtained results with the results in other papers, but it is
worthy to have a rough idea about the performance of the
purposed method. Comparing the obtained results with those
reported in [14] and [15] shows that considerable
improvement has been achieved. High classification
accuracy makes the system ideal for extracting parallel
sentences from comparable corpora. Considering a large
collection of sentence pairs from two comparable documents
that are mostly noisy, the model with the proposed features
can be exploited to extract parallel sentence pairs. In order to
build such a parallel corpus, English sentences can be
considered aligned to candidate foreign sentences and then
the model is used to classify the sentence pairs and hence
parallel sentence pairs can be extracted from comparable
corpora.
VI.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[17]
[18]
[19]
[20]
[21]
[22]
REFERENCES
P. Resnik, “Mining the web for bilingual text,” Proceedings of the
37th annual meeting of the Association for Computational
Linguistics on Computational Linguistics, 1999, pp. 527–534.
J. Chen and J.Y. Nie, “Automatic construction of parallel EnglishChinese corpus for cross-language information retrieval,”
Proceedings of the sixth conference on Applied natural language
processing, 2000, pp. 21–28.
W.A. Gale and K.W. Church, “A program for aligning sentences in
bilingual corpora,” Computational linguistics, vol. 19, 1993, pp.
75–102.
P.F. Brown, J.C. Lai, and R.L. Mercer, “Aligning sentences in
parallel corpora,” Proceedings of the 29th annual meeting on
Association for Computational Linguistics, 1991, pp. 169–176.
M. Kay and M. R\öscheisen, “Text-translation alignment,”
computational Linguistics, vol. 19, 1993, pp. 121–142.
R. Moore, “Fast and accurate sentence alignment of bilingual
corpora,” Machine Translation: From Research to Real Users,
2002, pp. 135–144.
S.F. Chen, “Aligning sentences in bilingual corpora using lexical
information,” Proceedings of the 31st annual meeting on
Association for Computational Linguistics, 1993, pp. 9–16.
D. Wu, “Aligning a parallel English-Chinese corpus statistically
with lexical criteria,” Proceedings of the 32nd annual meeting on
Association for Computational Linguistics, 1994, pp. 80–87.
I.D. Melamed, “Bitext maps and alignment via pattern
recognition,” Computational Linguistics, vol. 25, 1999, p. 130.
D.S. Munteanu and D. Marcu, “Extracting parallel sub-sentential
fragments from non-parallel corpora,” Proceedings of the 21st
International Conference on Computational Linguistics and the
44th annual meeting of the Association for Computational
Linguistics, 2006, p. 88.
R. Sarikaya, S. Maskey, R. Zhang, E. Jan, D. Wang, B.
Ramabhadran, and S. Roukos, “Iterative sentence–pair extraction
from quasi–parallel corpora for machine translation,” 2009.
[23]
[24]
[25]
541
M. Turchi, T.D. Bie, and N. Cristianini, “An Intelligent Agent That
Autonomously Learns How to Translate,” Proceedings of the 2009
IEEE/WIC/ACM International Joint Conference on Web
Intelligence and Intelligent Agent Technology-Volume 02, 2009, pp.
12–19.
S. AbduI-Rauf and H. Schwenk, “On the use of comparable
corpora to improve SMT performance,” Proceedings of the 12th
Conference of the European Chapter of the Association for
Computational Linguistics, 2009, pp. 16–23.
D.S. Munteanu and D. Marcu, “Improving machine translation
performance by exploiting non-parallel corpora,” Computational
Linguistics, vol. 31, 2005, pp. 477–504.
S. Khadivi and H. Ney, “Automatic filtering of bilingual corpora
for statistical machine translation,” Natural Language Processing
and Information Systems, 2005, pp. 263–274.
C. Tillmann, “A Beam-Search extraction algorithm for comparable
data,” Proceedings of the ACL-IJCNLP 2009 Conference Short
Papers, 2009, pp. 225–228.
L. Lee, A. Aw, M. Zhang, and H. Li, “EM-based Hybrid Model for
Bilingual Terminology Extraction from Comparable Corpora,”
Coling 2010 Organizing Committee, 2010, pp. 639-646.
P.F. Brown, V.J. Pietra, S.A. Pietra, and R.L. Mercer, “The
mathematics of statistical machine translation: Parameter
estimation,” Computational linguistics, vol. 19, 1993, pp. 263–
311.
F.J. Och and H. Ney, “A systematic comparison of various
statistical alignment models,” Computational linguistics, vol. 29,
2003, pp. 19–51.
A.L. Berger, V.J. Pietra, and S.A. Pietra, “A maximum entropy
approach to natural language processing,” Computational
linguistics, vol. 22, 1996, pp. 39–71.
S. Della Pietra, V. Della Pietra, and J. Lafferty, “Inducing features
of random fields,” Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 19, 2002, pp. 380–393.
R. Malouf and others, “A comparison of algorithms for maximum
entropy parameter estimation,” Proceedings of the Sixth
Conference on Natural Language Learning (CoNLL-2002), 2002,
pp. 49–55.
F.J. Och and H. Ney, “Improved statistical alignment models,”
Proceedings of the 38th Annual Meeting on Association for
Computational Linguistics, 2000, pp. 440–447.
A. Stolcke, “SRILM-an extensible language modeling toolkit,”
Seventh International Conference on Spoken Language Processing,
2002, pp. 901–904.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and
I.H. Witten, “The WEKA data mining software: An update,” ACM
SIGKDD Explorations Newsletter, vol. 11, 2009, pp. 10–18.