Bilingual Lexicography

In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
Computational bilingual lexicography: automatic extraction of
translation dictionaries
Dan Tufiş and Ana-Maria Barbu
RACAI-Romanian Academy Center for Artificial Intelligence
13, "13 Septembrie",RO-74311, Bucharest, 5, Romania
{tufis, abarbu}@racai.ro
Abstract
The paper describes a simple but very effective approach to extraction translation
equivalents from parallel corpora. We briefly present the multilingual parallel corpus
used in our experiments and then describe the pre-processing steps, a baseline iterative
method, and the actual algorithm. The evaluation for the two algorithms is presented in
some details in terms of precision, recall and processing time. The baseline algorithm was
used to extract 6 bilingual lexicons and it was evaluated on four of them. The second
algorithm was evaluated only on the Romanian-English noun lexicon. An analysis of the
missed or wrong translation equivalents figured out various factors, both intrinsic, due to
the method and extrinsic due to the working data (accuracy of the pre-processing, quality
of translation, bitext language relatedness). We conclude by discussing the merits and the
drawbacks of our method in comparison with other works and comment on further
developments.
Keywords: alignment, bitext, bilingual dictionaries, evaluation, hapax-legomena,
lemmatization, parallel corpora, tagging
1
Introduction
Automatic Extraction of bilingual lexicons from parallel texts might seem a futile task,
given that more and more bilingual lexicons are printed nowadays and they can be easily
turned into machine-readable lexicons. However, if one considers only the possibility of
automatic enriching the presently available electronic lexicons, with very limited manpower
and lexicographic expertise, the problem reveals a lot of potential. The scientific and
technological advancement in many domains is a constant source of new term coinage and
therefore keeping up with multilingual lexicography in such areas is very difficult unless
computational means are used. On the other hand, translation bilingual lexicons appear to be
quite different from the corresponding printed lexicons, meant for the human users. The
marked difference between printed bilingual lexicons and bilingual lexicons as needed for
automatic translation is not really surprising. The traditional lexicography deals with
translation equivalence (the underlying concept of the bilingual lexicography) in an inherently
discrete way. What is to be found in a printed dictionary or lexicon (bi- or multilingual) is just
a set of general basic translations. In case of specialised registers, general lexicons are usually
not very useful.
A pair of texts that represent the translation of each other is called a parallel text or a bitext.
Extracting bilingual dictionaries from a bitext is a process based on the notion of translation
equivalence. In a given parallel text, the assumption is that the same meaning is linguistically
expressed in two or more languages. Meaning identity between two or more representations
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
of presumably the same thing is a notorious philosophical problem and even in more precise
contexts than language (for instance in software engineering) it remains a fuzzy concept.
Consequently the notion of translation equivalence relation, built on the meaning identity
assumption, is inherently vague. In the area of machine translation, terminology, multilingual
information retrieval and other related domains, one needs operational notions, defined in
precise, quantifiable terms. One of the widely accepted interpretations [Melamed, 2000] of the
translation equivalence defines it as a (symmetric) relation that holds between two different
language texts such that expressions appearing in corresponding parts of the two texts are
reciprocal translations. These expressions are called translation equivalents. A bitext with its
translation equivalents linked is called an aligned bitext. The granularity at which translation
equivalents are defined (paragraph, sentence, lexical) defines the granularity of a bitext
alignment (paragraph, sentence, lexical).
For bilingual dictionaries extraction from a bitext it is of interest the identification of
translation equivalents at the lexical level (words or expressions).
In spite of the bi-directionality of the translation equivalence relation, the text in one
language is usually called the source of the bitext and the text in the other language is called
the target of the bitext.
One basic resource in translating a text (thus creating a bitext) is a bilingual dictionary (a
set of lexical translation equivalents). Automatic extraction of lexical translation equivalents
is the reverse process aiming at discovering the bilingual dictionary used in a bitext.
When no recourse is made to external sources of linguistic knowledge (such as a bilingual
lexicon) such an enterprise is not conceptually different from the one of Champollion’s
deciphering of the hieroglyphic writing. In 1799, Pierre Bouchard, a French officer of the
engineering corps discovered the Rosetta Stone on the west bank of the Nile during
Napoleon's Egyptian campaign. The proclamation carved on it, praising Ptolemy V in 196
B.C., appears in three texts: Hieroglyphics, Egyptian Demotic Script and Greek. Jean
Francois Champollion, a brilliant linguist, assumed that any pairs of the three texts made a
bitext. Based on this hypothesis, in 1822, after 14 years of research without ever seeing the
stone itself, he managed to find the first translation equivalent, namely the one for "Ptolemy",
occurring 5 times in the hieroglyphic variant of the proclamation. Twenty-three years passed
before the Rosetta Stone was completely deciphered [http://www.freemaninstitute.com/
Gallery/rosetta.htm].
Most modern approaches to automatic extraction of translation equivalents (backed up by
the power of nowadays computers) rely on statistical techniques and roughly fall into two
categories. The hypotheses-testing methods such as (Gale and Church, 1991), (Smadja et all,
1996) etc. rely on a generative device that produces a list of translation equivalence
candidates (TECs), each of them being subject to an independence statistical test. The TECs
that show an association measure higher than expected under the independence assumption
are assumed to be translation-equivalence pairs (TEPs). The TEPs are extracted independently
one of another and therefore the process might be characterised as a local maximisation
(greedy) one. The estimating approach (Brown et all, 1993), (Kupiec, 1993), (Hiemstra, 1997)
etc. is based on building from data a statistical bitext model the parameters of which are to be
estimated according to a given set of assumptions. The bitext model allows for global
maximisation of the translation equivalence relation, considering not individual translation
equivalents but sets of translation equivalents (sometimes called assignments).
There are pros and cons for each type of approach, some of them discussed in (Hiemstra,
1997). Essentially, the hypotheses testing is computationally cheaper since it works with a
reasonable search space, proportional to N2, where N is the maximum of the numbers of
lexical items in the two parts of the bitext, but it has difficulties with finding rare translation
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
equivalents of the bitext. The estimating approach is theoretically extremely expensive from
the computational point of view, the search space being proportional to N! (N is the same as
above), but in principle are expected to produce accurate bilingual dictionaries with broader
coverage (better recall). Very efficient implementations, supported by reasonable
assumptions, allow for fast convergence towards the interesting part of the huge search space
[Brown et al., 1993].
Our method is a greedy one and makes decisions based on local contexts. It generates first
a list of translation equivalent candidates and then successively extracts the most likely
translation-equivalence pairs.
A common intuition underlying the majority of methods used in automatic extraction of
translation equivalents from bitexts is that words that are translations of each other are more
likely to appear in corresponding bitext regions than other pairs (Melamed, 2000).
The translation equivalents extraction process, as considered here, does not need on a preexisting bilingual lexicon for the considered languages. Yet, if such a lexicon exists it can be
used to eliminate spurious candidate translation equivalence pairs and thus to speed up the
process and increase its accuracy.
1.1
Words and multiword lexical tokens
In the previous section we defined a translation equivalent as a special pair of two lexical
items, one in the source language of the bitext and the second in the other language of the
bitext. In general, a lexical item is considered to be a space-delimited string of characters or
what is usually called a word1. However, it is not necessary that a space in text be interpreted
always as a lexical item delimiter. For various reasons, in many languages and even in
monolingual studies, some sequences of traditional words are considered as making up a
single lexical unit. For instance in English “in spite of”, “machine gun”, chestnut tree”, “take
off” etc. or in Romanian “de la”(from), “gaura cheii” (keyhole), “sta în picioare”(to stand),
(a)”-si aminti” (remember), etc. could be arguably considered as single meaningful lexical
units even if one is not concerned with translation. For translation purposes considering
multiword expressions as single lexical units is a must because of the differences that might
appear in linguistic realisation of commonly referred concepts. One language might use
concatenation (with or without a hyphen at the joint point), agglutination, derivational
constructions or a simple word where other language might use a multiword expression (with
compositional or non-compositional meaning).
In the following we will refer to words and multiword expressions as lexical tokens, or
simply, tokens.
The recognition of multiword expressions as single lexical tokens, but also the splitting of
single words into multiple lexical tokens (when it is the case) is generically called text
segmentation and the program that performs this task is called segmenter or tokenizer. The
simplest method for text segmentation is based on (monolingual) lists of most frequent
compound expressions (collocations, compound nouns, phrasal verbs, idioms, etc) and some
regular expression patterns for dealing with too many instantiations of similar constructions
(numbers, dates, abbreviations, etc). This linguistic knowledge is referred to as tokenizer’s
resources. In this approach the tokenizer would check if the input text contains string
sequences that match any of the stored patterns and in such a case the matching input
sequences are replaced as prescribed by the tokenizer’s resources. In spite of being very
simple, the main criticism against this text segmentation method is that the tokenizer’s
resources are never exhaustive. Against this drawback one can use special programs for
1
Obviously this comment applies for those languages which use the space delimiter.
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
automatic updating of the tokenizer’s resources. Such programs are the so-called collocation
extractors. A statistical collocation extraction program is based on the idea that words that
appear together more often than would be expected under an independence assumption and
conform to some prescribed syntactic patterns are likely to be collocations. For checking the
independence assumption, one can use various statistical tests. The most used statistical tests
for collocation analysis (mutual information, DICE, log-likelihood, chi-square or left-Fisher
exact text) are available in a nice package called BSP (Bigram Statistical Package see:
http://www.d.umn.edu/~tpederse/code.html) due to Ted Petersen and Satanjeev Banerjee. As
these tests are considering only pairs of tokens, in order to identify collocations longer than
two words, the bigram analysis must be recursively applied until no new collocations are
discovered. The final list of extracted collocations must be filtered out as it might include
many noisy associations. The filtering out is usually achieved by means of stop-word lists
and/or grammatical patterns. The disadvantage of this approach is that it requires monolingual
resources (stop-word list, grammar patterns) and the ignorance of the translation equivalence
issue which is our main concern here.
For our experiments we used Philippe di Cristo’s multilingual segmenter MtSeg
(http://www.lpl.univ-aix.fr/projects/multext/MtSeg/) developed for the MULTEXT project.
The segmenter comes with tokenization resources for many Western European languages,
further enhanced in the MULTEXT-EAST project [Erjavec& Ide, 1998], [Dimitrova et al.,
1998], [Tufiş et al, 1998] with corresponding resources for Bulgarian, Czech, Estonian,
Hungarian, Romanian and Slovene. The segmenter is able to recognise dates, numbers,
various fix phrases, to split clitics or contractions etc.
To cope with the inherent incompleteness of the segmenter resources, besides using a
collocation extractor (unaware of translation equivalence) we experimented with a
complementary method that takes advantage of the word alignment process by trying to
identify partially correct translation equivalents. This method described at length in [Tufis,
2001b] is briefly reviewed in section on partial translations.
1.2
Sentence alignment
A sentence-aligned bitext is usually a prerequisite for extracting lexical translation
equivalents. Theoretically the sentence alignment is not necessary for extracting lexical
translation equivalents. It is just a matter of computational efficiency we will discuss below.
One of the most successful sentence alignment methods [Gale&Church, 1993] is based on a
simple idea. Assuming that the ratio of the number of characters in one part of the bitext and
the number of characters in the other part of the bitext is bitext characteristic, [Gale&Church,
1993] assumed this value should be approximately the same for each of the alignment pairs.
They consider a two-dimensional representation of a bitext so that on the X-axis are
represented the indexes of all the characters in one part of the bitext and on the Y-axis the
indexes of the characters in the other part of the bitext. If the position of the last character in
the text represented on the X-axis is M and the position of the last character in the text
represented on the Y-axis is N, then the segment that starts in origin (0,0) and ends in the
point of co-ordinates (M,N) represents the alignment line of the bitext (see Figure 1). The
positions of the last letter of each sentence in both parts of the bitext are called alignment
positions. Gale and Church transformed the alignment problem into a dynamic programming
one, namely to find the maximal number of alignment position pairs (represented in Figure 1
by small circles), so that they should have a minimal dispersion with respect to the alignment
line. The diagram in Figure 1, where the first alignment point is (X2, Y1) suggests that the
first two sentences in one part of the bitext are aligned with the first sentence in the other part
of the bitext.
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
(0,N)
(0,Yq)
.
.
.
(0,Yp)
(0,Y1)
(X1,0) (X2,0)
...
(Xi,0)
...
(Xj,0)
...
(M,0)
Figure 1: Sentence alignment and the alignment position pairs
1.3
The 1:1 mapping hypothesis
The basic hypothesis used in sentence alignment, namely that the co-ordinates of the
alignment points have monotonically increasing values is not in lexical alignment. In Figure
2a this hypothesis is rendered by parallel alignment links. In Figure 2b one can see that this is
not the true anymore and therefore the lexical alignment is much harder than the sentence
alignment. But this is not the only cause of difficulties in lexical alignment.
TS
segS1
segS2
…
segSi
…
segSk
TT
segT1
segT2
segTj
TS
tokS11
tokS12
…
tok1i
segTk
tok1m
Figure 2a: Sentence alignment;
the alignment links never cross
TT
tokT11
tokT2
tokT1j
…
tokT1n
Figure 2b: Word alignment;
alignment links frequently cross
In general one word in one part of a bitext is translated also by one words in the other part
of the bitext. If this statement, called the “word to word mapping hypothesis” would be
always true, the lexical alignment problem would become significantly easier to solve. But we
all know that the “word to word mapping hypothesis” is not true. By introducing the notion of
lexical token, we tried to alleviate this difficulty assuming that proper segmentations of the
two parts of a bitext would make the “token to token mapping hypothesis” a valid working
assumption. We will generically refer to this mapping hypothesis the “1:1 mapping
hypothesis” in order to cover both word-based and token-based mappings.
With the “1:1 mapping hypothesis” considered, the translation equivalence pairs are
certainly included in the Cartesian product computed over the sets of words or tokens in the
two parts of the bitext: TEP={<segSi segTj >}⊂ TS⊗TT. The search space contains K2 possible
translation equivalence pairs (for a hypotheses testing approach) or K! possible assignments
(for an estimating approach). If the 1:1 mapping hypothesis is not the underlying one, then the
search space is much larger, namely ℘(TS)⊗℘(TT), where ℘(X) is the power-set of X with
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
card(℘(X))= 2card(X) (for an hypotheses testing approach) or ℘(X)! assignments (for an
estimating approach).
Using token-based segmentation as described in the previous section most of the
limitations of the”1:1 mapping” hypothesis are eliminated and the problem to be solved
(dictionary extraction) becomes computationally much cheaper.
In the following we will get into the details of our method for bilingual dictionary
extraction and its implementation.
2
Pre-processing steps
2.1
Sentence alignment
After the two parts of the bitext were tokenised by MtSeg, and sentence IDed (unique
identifiers)2 by a simple script they are given as input to the sentence aligner. We used a
slightly modified version of the Gale’s and Church’s CharAlign sentence aligner
[Gale&Church, 1993]. In the following we will refer to the alignment units as translation
units (TU). In Figure 3 are shown the first tokens in the English-Romanian bitext from
the “Orwell’s 1984” multilingual parallel corpus which was the corpus we used in all the
experiments presented here. An overall presentation of this corpus will be given in the section
on evaluation.
The result produced by the sentence aligner is a cesAlign dtd [Ide&Veronis, 1995]
compliant document which specifies by means of the sentence IDs what sentence(s) in one
language are translated by what sentence(s) in the other language and thus creating the
aligned bitext.
Source file
Target file
<S FROM=Oen.1.1.1.1>
TOK
It
TOK
was
TOK
a
TOK
bright
…
<S FROM="Oro.1.2.2.1">
LSPLIT ÎntrTOK
o
TOK
zi
TOK
senină
…
Figure 3: Input to the sentence aligner; segmented paralel texts
Figure 4 shows the first two TUs of the bitext. The first TU contains one Romanian
sentence (Oro.1.2.2.1) which translates two English sentences (Oen.1.1.1.1 Oen.1.1.1.2). The
second TU contains one Romanian sentence (Oro.1.2.3.1) which translates one English
sentence (Oen.1.1.2.1).
<!doctype cesAlign PUBLIC"-//CES//DTD cesAlign//EN"[]>
<linkList id="oroen">
<linkGrp id="oroen.1" type="body" targtype="s" domains="oro oen">
<link xtargets=" Oro.1.2.2.1 ; Oen.1.1.1.1 Oen.1.1.1.2 ">
<link xtargets=" Oro.1.2.3.1 ; Oen.1.1.2.1 ">
…
Figure 4: The aligned Romanian-English bitext
With a proper stylesheet, the alignment file, partly shown in Figure 4, is visualized as in
the snapshot below (Figure 5).
2
The sentence IDer as used in MULTEXT-EAST project was written by Greg Pries-Dorman of Vassar.
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
<Oro.1.2.2.1>Într-o zi senină şi friguroasă deaprilie, pe când ceasurile băteau ora
treisprezece, Winston Smith, cu bărbia înfundată în piept pentru a scăpa de vântul care-l lua
pe sus, se strecură iute prin uşile de sticlă ale Blocului Victoria, deşi nu destul de repede
pentru a împiedica un vârtej de praf şi nisip să pătrundă o dată cu el.
<Oen.1.1.1.1>It was a bright cold day in April, and the clocks were striking thirteen.
<Oen.1.1.1.2> Winston Smith, his chin nuzzled into his breast in an effort to escape the vile
wind, slipped quickly through the glass doors of Victory Mansions, though not quickly
enough to prevent a swirl of gritty dust from entering along with him.
<Oro.1.2.3.1>Holul blocului mirosea a varză călită şi a preşuri vechi.
<Oen.1.1.2.1>The hallway smelt of boiled cabbage and old rag mats.
. . .
Figure 5: The rendering of the aligned bitext
Estonian-English
Align Nr. Proc
3-1
2 0.030321%
2-2
3 0.045482%
2-1
60 0.909642%
1-3
1 0.015161%
1-2 100 1.516070%
1-1 6426 97.42268%
1-0
1 0.015161%
0-2
1 0.015161%
0-1
2 0.030321%
Hungarian-English
Romanian-English
Align Nr. Proc
Align Nr. Proc
7-0
1 0.014997% 3-1
3 0.046656%
4-1
1 0.014997% 2-4
1 0.015552%
3-1
7 0.104979% 2-3
3 0.046656%
3-0
1 0.014997% 2-2
2 0.031104%
2-1
108 1.619676% 2-1
85 1.321928%
1-6
1 0.014997% 2-0
1 0.015552%
1-5
1 0.014997% 1-5
1 0.015552%
1-2
46 0.689862% 1-3
14 0.217729%
1-1
6479 97.16557% 1-2
259 4.027994%
0-4
1 0.014997% 1-1 6047 94.04355%
0-2
3 0.044991% 0-3
2 0.031104%
0-1
19 0.284943% 0-2
2 0.031104%
0-1
10 0.155521%
Bulgarian-English
Czech-English
Align Nr. Proc
Align Nr.
Proc
2-2
2 0.030017% 4-1
1
0.015029%
2-1
23 0.345190% 3-1
2
0.030057%
1-2
72 1.080594% 2-1 109
1.638112%
1-1 6558 98.42413% 1-3
2
0.030057%
0-1
8 0.120066% 1-2
81
1.217313%
1-1 6438
96.75383%
0-1
21
0.315600%
Slovene-English
Align Nr. Proc
3-3
1 0.014970%
2-1
48 0.718563%
1-5
1 0.014970%
1-2
53 0.793413%
1-1
6572 98.38323%
1-0
2 0.029940%
0-1
3 0.044910%
Figure 6: Distribution of sentence alignment types in the ”1984” parallel corpus
In general, sentence alignments of all bitexts of our multilingual corpus are of the type 1:1
that is, in most cases one sentence is translated as one sentence. The Figure 6 shows the
alignment types for 6 bitexts, English being the hub language for all the bitexts. Native
speakers of the languages paired to English validated the alignments, so that most of the
alignment errors were corrected.
Alignment errors are quite harmful for the accuracy of the bilingual dictionary extraction
process because the translation equivalence will be looked for in wrong areas and in the best
case they would create noise that would prevent extraction of several real translation
equivalence pairs. In the worst case such errors will allow for extraction of several wrong
k
translation equivalence pairs (the upper limit being å N i * M i with k the number of wrong
i =1
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
sentence alignments and Ni and Mi the number of tokens in each part of the i-th alignment
unit).
Melamed [Melamed, 1996] observed that most of the translation pairs are conserving their
part of speech, that is most of the time a verb translates as a verb, a noun as a noun and so on.
He called such translation pairs V-type, to distinguish form those translation pairs where the
part of speech of one token in not the same as the one for the other token. This type was called
P-type translation pairs. The third category of translation pairs is represented by the
incomplete translation (I-type), incompleteness resulting from his 1:1 mapping underlying
approach.
Melamed’s findings concerning the translation types distribution are quite similar to ours
although our text was a literary one and his was an extract from the Canadian Parliament
debates (a text with presumably more literal translations). What is worth mentioning is that
the P-type pairs do not contain arbitrary paired parts of speech and one might consider regular
patterns in part-of speech alternations (participle-adjective, gerund-noun, gerund-adjective) in
order to assimilate most of the P-type pairs with the V-type ones.
2.2
Tagging and lemmatization
Tagging is the process of labeling each token with a morpho-lexical code, out of a list of
known or guessed possibilities that represent the morpho-lexical ambiguity class of the token
in case. In our experiments we used a tiered-tagging approach with combined language
models [Tufiş, 1999] based on TnT [Brants, 2000], a trigram HMM tagger. This approach has
been shown to provide for Romanian an average accuracy of more than 98.5%. The tieredtagging model is based on using two different tagsets. The first one, which is best suited for
the statistical processing, is used internally while the other one (used in a morpho-syntactic
lexicon and in most cases more linguistically motivated) is used in the tagger’s output. The
mapping between the two tagsets is in most cases deterministic (via a dictionary lookup) and
in the rare cases where it is not, a few regular expressions may solve the non-determinism.
The idea of tiered tagging is working not only for very fine-grained tagsets, but also for
very low-information tagsets, such as those containing only part of speech. In such cases the
mapping from the hidden tagset to the coarse-grained tagset is strictly deterministic.
In [Tufiş, 2000] we showed that using the coarse grained tagset directly (14 nonpunctuation tags) gave at best accuracy of 93%, while using a tiered tagging and combined
language model approach (92 non-punctuation tags in the hidden tagset) the accuracy was
never below 99.5%.
For the purpose of bilingual lexicon extraction we used only the part-of-speech information
and therefore the tagging process did not represent a significant source of errors for the token
alignment procedure. Unlike the sentence alignment, the tagging error affects only tokens in a
sentence, and moreover, the tagging errors are not systematically affecting the same tokens.
Because of this, it is very likely that the main result of tagging errors is some statistical noise,
which might prevent extraction of some correct translation pairs (most likely, rare ones).
Lemmatization procedure is in our case a straightforward process since the monolingual
lexicons developed within MULTEXT-EAST project contain for each word its lemma and the
morpho-syntactic codes that apply for the word in case. Knowing the wordform and its
associated tag, the lemma extraction is just a matter of lexicon lookup (for those words that
are in the lexicon; for unknown words, the lemma is automatically set to the wordform itself).
Figure 7 shows the result of the tagging and lemmatization process3 for the same exemplified
samples.
3
We covered the Romanian part processing. The English part in the bitext was responsibility of another
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
The lexical items in each part of the aligned bitext are attached with their lemma, lexicon
code and the part of speech used in the subsequent phases.
Source file
Target file
<S FROM=Oen.1.1.1.1>
TOK It
it\Pp3ns\P
TOK was
be\Vmis3s\AUX
TOK a
a\Di\D
TOK bright bright\Af\A
…
LSPLIT
TOK
TOK
TOK
<S FROM="Oro.1.2.2.1">
ÎntrÎntru\Spsay\S
o
un\Tifsr\T
zi
zi\Ncfsrn\N
senină senin\Afpfsrn\A
…
Figure 7: Tagged and lemmatized bitext; the first column gives the type of the lexical unit (TOK,
SPLIT, DATE, etc), the second gives the word form while the third gives lemma\lexicon_code\pos
Erjavec and Ide, (1998) provide a description of the MULTEXT-EAST lexicon encoding
principles A detailed presentation of the their application to Romanian is given in [Tufiş et
al., 1997].
3
Bilingual dictionary extraction algorithm; the baseline (BASE)
There are several underlying assumptions one can consider in keeping the computational
complexity of a word alignment algorithm as low as possible. None of them is true in general,
but the situations where they are not true are rare enough so that ignoring the exceptions
would not produce a significant number of errors and would not loose too many useful
translations. Moreover, the assumptions we used do not prevent additional processing units
for recovering some of the correct translations missed because they did not observe the
assumptions.
The assumptions we used in our basic algorithm are the following:
• a lexical token in one half of the TU corresponds to at most one non-empty lexical unit in
the other half of the TU; this is the 1:1 mapping assumption which underlines the work of
many other researchers [Kay & Röscheisen, 1993], [Melamed, 1996], [Brew & McKelvie,
1996], [Hiemstra, 1997], [Tiedemann, 1998], [Ahrenberg et al., 2000] etc. However,
remember that a lexical token could be a multiple word expression previously found and
segmented as such by an adequate tokenizer;
• a polysemous lexical token, if used several times in the same TU, is used with the same
meaning; this assumption is explicitly used also by [Melamed, 1996] and implicitly by all
the previously mentioned authors.
• a lexical token in one part of a TU can be aligned to a lexical token in the other part of the
TU only if the two tokens have compatible types (part-of-speech); in most cases,
compatibility reduces to the same POS, but it is also possible to define compatibility
mappings (e.g. participles or gerunds in English are quite often translated as adjectives or
nouns in Romanian and vice versa). This is essentially one very efficient way to cut off
the combinatorial complexity and postpone dealing with irregular ways of POS
alternations.
• although the word order is not an invariant of translation, it is not random either; when
two or more candidate translation pairs are equally scored, the one containing tokens
which are closer in relative position are preferred. This preference is also used in
[Ahrenberg et al., 2000].
parner in the MULTEXT-EAST project.
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
Based on the sentence alignment, tagging and lemmatisation, the first step is to compute a
list of translation equivalence candidates (TECL). This list contains several sub-lists, one for
each POS considered in the extraction procedure.
Each POS-specific sub-list contains several pairs of tokens <tokenS tokenT> of the
corresponding POS that appeared in the same TUs. Let TUj be the jth translation unit. By
collecting all the tokens of the same POSk (in the order they appear in the text and removing
duplicates) in each part of TUj one builds the ordered sets LSjPOSk and LTjPOSk. For each POSi
let TUjPOSi be defined as LSjPOSi⊗LTjPOSi. Then, CTUj (correspondences in the jth translation
unit) is defined as follows:
CTUj =
no.of . pos
7
i =1
j
TU POSi
With these notations, and considering that there are n alignment units in the whole bitext,
then TECL is defined as:
n
TECL = 7 CTU j
j =1
TECL contains a lot of noise and many TECs are very improbable. In order to eliminate
much of this noise, TECL is filtered out of the very unlikely candidate pairs. The filtering is
based on scoring the degree of association between the tokens in a TEC.
Let us consider the following notations:
• TEC = < TS TT > ∈ TECL, the current translation equivalent candidate containing token
TS and its candidate translation TT;
• n11 = the number of TUs which contains the current TEC < TS TT >;
• n12 = the number of TUs in which the TS token was paired with any other token but TT;
• n21 = the number of TUs in which the TT token was paired with any other token but TS;
• n22 = the number of TUs in which no TECs contains either TS or TT appeared;
• n1* = the number of TUs in which TS appeared (irrespective of its associations);
• n*1 = the number of TUs in which TT appeared (irrespective of its associations);
• n2* = the number of TUs in which TS did not appeared;
• n*2 = the number of TUs in which TT did not appeared;
• n** = the total number of TUs;
The 2*2 contingency table in Figure 8 illustrates this notations:
TS
¬TS
TT
n11
n21
n*1
¬TT
n12 n1*
n22 n2*
n*2 n**
n1*=n11+ n12, n2* =n21+ n22
n*1 = n11+ n21, n*2=n12+n22
2
2
n**= å å n ij
j =1 i =1
Figure 8: Contingency table for a translation equivalent candidate <TS TT>
For the ranking of the TECs and their filtering we experimented 4 scoring functions: MI
(pointwise mutual information), DICE, LL (log likelihood) , and χ2 (chi-square). In terms of
the above notations, these measures are defined as follows:
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
n *n
MI(TT, TS) = log 2 ** 11 ,
n1* * n *1
2n11
DICE(TT, TS) =
,
n1* * n *1
2 2
n *n
LL(TT, TS) = 2 * å å n ij * log ij ** and
n i* * n * j
j =1 i =1
n *n ö
æ
ç n ij − i * * j ÷
2 2 ç
n ** ÷ø
χ2 (TT, TS) = n ** å å è
n i * * n j*
j =1 i =1
2
Figure 9: Statistical measures for scoring a translation equivalent candidate <TS TT>
Chi-square coefficients may be computed alternatively by the simpler formula
n ** (n11 * n 22 − n12 * n 21 ) 2
χ (TT , TS ) =
.
(n11 + n12 ) * (n11 + n 21 ) * (n 21 + n 22 ) * (n 21 + n 22 )
Any filtering would eliminate many wrong TECs but also some good ones. The ratio
between the number of good TECs rejected and the number of wrong TECs rejected is just
one criteria we used in deciding which test to use and what would be the threshold score
below which any TEC will be removed from TECL. After various empirical tests we decided
to use loglikelihood test with the threshold value set to 9.
One baseline algorithm is not very different from the filtering discussed above. However,
for improving the precision, the thresholds of whatever statistical test used is higher. Some
additional restrictions such as a minimal number of occurrences for <TS TT> (usually this is 3)
are also used. This baseline algorithm may be enhanced in many ways (using a dictionary of
already extracted TEPs for eliminating generation of spurious TECs, stop-word lists,
considering token string similarity etc.). An algorithm with such extensions (plus a few more)
is described in (Gale and Church, 1991). Although extremely simple, this algorithm, applied
on a sample of 800 sentences from Canadian Hansard, was reported to provide impressive
precision (about 98%). However, the algorithm managed to find only the most frequent words
(4.5%) which cover more than half (61%) of the word occurrences in the corpus. Its recall is
modest if judged in terms of word types [cf. Melamed, 2000].
Our baseline algorithm is an improvement over the one described before. It is a very
simple iterative algorithm, significantly faster than the previous one, with much better recall
even when the precision is required to be as high as 98%. It can be enhanced in many ways
(including those discussed above). It has some similarities to the iterative algorithm presented
in (Ahrenberg et all. 1998) but unlike it, our algorithm avoids computing various probabilities
(or better said probability estimates) and scores (t-score). At each iteration step, the pairs that
pass the selection (see below) will be removed from TECL so that this list is shortened after
each step and eventually may be emptied. Based on TECL, for each POS a Sm* Tn
contingency table (TBLk) is constructed, with Sm the number of token types in the first part of
the bitext and Tn the number of token types in the other part of the bitext as shown (Figure
10).
2
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
Source token types index the rows of the table and the target token types (of the same
POS) index the columns. Each cell (i,j) contains the number of occurrences in TECL of the
n
m
j =1
i =1
n
m
<TSi, TTj> TEC: nij = occ(TSi,TTj); ni* = å n ij ; n*j= å n ij ; and n** = å ( å n ij ) .
TS1
…
TSm
j =1 i =1
TT1
n11
…
…
TTn
n1n
…
nm1
n*1
…
…
…
…
nmn
n*n
n1*
…
nm*
n**
Figure 10: Contingency table with counts for TECs at step K
The selection condition is expressed by the equation:
{
}
(EQ1) TP k = < TSi TTj > | ∀p, q (n ij ≥ n iq ) ∧ (n ij ≥ n pj )
This is the key idea of the iterative extraction algorithm and it expresses the requirement
that in order to select a TEC <TSi, TTj> as a translation equivalence pair, the number of
associations of TSi with TTj must be higher than (or at least equal to) any other TTp (p≠j). The
same holds for the other way around. All the pairs selected in TPk are removed (the respective
counts are zeroed). If TSi is translated in more than one way (either because of having multiple
meanings that are lexicalised in the second language by different words, or because of use in
the target language of various synonyms for TTj) the rest of translations will be found in
subsequent steps (if frequent enough). The most used translation of a token TSi will be found
first. The iterative baseline algorithm is sketched below (many bookkeeping details are
omitted).
procedure BASE(bitext,step; dictionary) is:
k=1;
TP(0)={};
TECL(k)=build-cand(bitext);
for each POS in TECL do
loop
TECL(k)=update(TP(k-1),TECL(k))
TBL(k)=build_TEC_table(TECL(k));
TP(k)= select(TBL(k)); ## (EQ1) ##
add(dictionary, TP(k));
k=k+1;
until {(TECL(k-1) is empty)or(TP(k-1) is empty)or(k > step)}
endfor
return dictionary
end
The TECL is implemented as a hash table. Therefore, the two procedures “update”,
which zeroes the co-occurrence counts for the selected pairs, and build_TEC_table, which
builds the contingency table for the next iteration step, are very simple and efficiently
implemented.
4
A better extraction algorithm (BETA)
One of the main deficiencies of the BASE algorithm is that it is quite sensitive to what
[Melamed, 2000] calls indirect associations. If <TSi, TTj> has a high association score and TTj
collocates with TTk, it might very well happen that <TSi, TTk> gets also a high association
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
score. Although, as observed by Melamed, in general, the indirect associations have lower
scores than the direct (correct) associations, they could receive higher scores than many
correct pairs and this will not only generate wrong translation equivalents, but will eliminate
from further considerations several correct pairs, deteriorating the procedure’s recall. To
weaken this sensitivity, the BASE algorithm had to impose that the number of occurrences of
a TEC be at least 3, thus filtering out more than 50% of all the possible TECs. Still, because
of the indirect association effect, in spite of a very good precision (more than 98%) out of the
considered pairs another approximately 50% correct pairs were missed. The BASE algorithm
has this deficiency because it looks on the association scores globally, and does not check
within the TUs if the tokens making the indirect association are still there.
To diminish the influence of the indirect associations and consequently removing the
occurrence threshold, we modified the BASE algorithm so that the maximum score is not
considered globally but within each of the TUs. This brings BETA closer to the competitive
linking algorithm described in [Melamed, 1996] and [Melamed, 2000]. The competing pairs
are only the TECs generated from the current TU and the one with the best score is the first
selected. Based on the 1:1 mapping hypothesis, any TEC containing the tokens in the winning
pair are discarded. Then, the next best scored TEC in the current TU is selected and again the
remaining pairs that include one of the two tokens in the selected pair are discarded. The
multiple-step control in BASE, where each TU was scanned several times (equal to the
number of iteration steps) is not necessary anymore. The BETA algorithm will see each TU
unit only once but the TU is processed until no further TEPs can be reliably extracted or TU is
emptied. This modification improves both the precision and recall in comparison with the
BASE algorithm. In accordance with the 1:1 mapping hypothesis, when two or more TEC
pairs of the same TU share the same token and they are equally scored, the algorithm has to
make a decision and choose only one of them. We used two heuristics: string similarity
scoring and relative distance.
The similarity measure we used, COGN(TS, TT), is very similar to the XXDICE score
described in [Brew&McKelvie, 1996]. If TS is a string of k characters α1α2 . . . αk and TT is a
string of m characters β1β2 . . . βm then we construct two new strings T’S and T’T by inserting
where necessary special displacement characters into TS and TT. The displacement characters
will cause both T’S and T’T have the same length p (max (k, m)≤p<k+m) and the maximum
number of positional matches. Let δ(αi) be the number of displacement characters that
immediately precedes the character αi which matches the character βi and δ(βi) be the number
of displacement characters that immediately precedes the character βi which matches the
character αi. Let q be the number of matching characters. With these notations, the
COGN(TS, TT) similarity measure is defined as follows:
ìq
2
ïå
ï 1+ | δ (α i ) − δ ( β i ) |
COGN(TS , TT ) = í i =1
if q > 2
k+m
ï
ïî0
if q ≤ 2
The threshold for the COGN(TS, TT) was empirically set to 0.42. This value depends on the
pair of languages in the considered bitext. The actual implementation of the COGN test
considers a language dependent normalisation step, which strips some suffixes, discards the
diacritics and reduces some consonant doubling etc. This normalisation step was hand written,
but, based on available lists of cognates, it could be automatically induced.
The second filtering condition, DIST(TS, TT) is defined as follows:
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
if ((<TS, TT>∈ LSjposk ⊗LTjposk)&(TS is the n-th element in LSjposk)&(TT is the m-th element in
LTjposk)) then DIST(TS, TT)=|n-m|
The COGN(TS, TT) filter stronger than DIST(TS, TT), so that the TEC with the highest
similarity score is the preferred one. If the similarity score is irrelevant, the weaker filter
DIST(TS, TT) gives priority to the pairs with the smallest relative distance between the
constituent tokens.
The BETA algorithm is sketched below (many bookkeeping details are omitted).
procedure BETA(bitext;dictionary) is:
dictionary={};
TECL(k)=build-cand(bitext);
for each POS in TECL do
for each TUiPOS in TECL do
finish=false;
loop
best_cand = get_the_highest_scored_pairs(TUiPOS);
conflicting_cand=select_conflicts(best_cand);
non_conflicting_cand = best_cand \ conflicting_cand;
best_cand=conflicting_cand;
if cardinal(best_cand)=0 then finish=true;
else
if cardinal(best_cand)>1 then best_card=filtered(best_cand);
endif;
best_pairs = non_conflicting_cand + best_cand
add(dictionary,best_pairs);
TUiPOS = remove_pairs_containing_tokens_in_best_pairs(TUiPOS);
endif;
until {(TUiPOS={})or(finish=true)}
endfor
endfor
return dictionary
end
procedure filtered(best_cand) is:
result = get_best_COGN_score(best_cand);
if cardinal(result)=0 then
result = get_best_DIST_score(best_cand);
else if cardinal(result)>1
result = get_best_DIST_score(best_cand);
endif
endif
return result;
end
5
Experiments and results
We conducted experiments on one of the few publicly available multilingual aligned
corpora, namely the "1984" multilingual corpus (Dimitrova et al, 1998) containing 6
translations of the English original. This corpus was developed within the Multext-East
project, published on a CD-ROM (Erjavec et all. 1998) and recently improved within the
CONCEDE project. The newer version is distributed by TRACTOR-TELRI Research Archive
of Computational Tools and Resources (www.tractor.de).
Each monolingual part of the corpus (Bulgarian, Czech, Estonian, Hungarian, Romanian
and Slovene) was tokenised, lemmatised, tagged and sentence aligned to the English hub. In
Figure 11 there are shown the numbers of lemmas in each monolingual part of the
multilingual corpus as well as the number of lemmas that occurred more than twice.
Language
No. of wordforms*
Bulgarian Czech English Estonian Hungarian Romanian Slovene
15093 17659 9192
16811
19250
14023
16402
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
No. of lemmas*
No.of >2-occ lemmas*
*
8225
3350
8677
3329
6871
2916
8403
2729
9729
3294
6987
2999
7157
3189
Figure 11:The lemmatised monolingual "1984" overview
the number of lemmas does not include interjections, particles, residuals)
In the context of this paper, we distinguish between hapax token and hapax translation pairs.
The notion of hapax token (a token that appeared only once) is defined monolingually while
the notion of hapax translation pair (a translation pair that appeared only once) is defined in a
bilingual context. If one or either of the constituents of a translation pair is a hapax token than
the translation pair is also a hapax one. But the other way around is not necessary true. A
recurrent token might be used with different senses and these senses might be lexicalized by
different tokens in the other part of the bitext. Also, a recurrent token, although used with the
same meaning, might be translated by different synonyms.
We relax the definition of “translation hapax” as being a pair of translation equivalents,
which appears in a single TU. Therefore, even if a translation pair occurs twice or more in the
same TU and in no other TU it will still be considered a translation hapax.
The evaluation protocol specified that all the translation pairs are to be judged in context,
so that if one pair is found to be correct in at least one context, then it should be judged as
correct. The evaluation was done for both BASE and BETA algorithms but on different
scales. The BASE algorithm was ran on all the 6 bitexts with the English hub and native
speakers of the second language in the bitexts (with good command of English) validated 4 of
the 6 bilingual lexicons. The lexicons contained all parts of speech defined in the MULTEXTEAST lexicon specifications [Erjavec & Monachini, 1997] except for interjections, particles
and residuals.
The BETA algorithm was ran on the Romanian-English bitext, but at the time of this
writing the evaluation was finalised only for the nominal translation pairs.
5.1
The evaluation of BASE algorithm
For validation purposes we limited the number of iteration steps to 4. The extracted
dictionaries contain adjectives (A), conjunctions (C), determiners (D), numerals (M), nouns
(N), pronouns (P), adverbs (R), prepositions (S) and verbs (V). Figure 12 shows the
evaluation results for those languages, where we found voluntary native speaker evaluators.
The precision (Prec) was computed as the number of correct TEPs divided by the total
number of extracted TEPs. The recall (considered for the non-English language in the bitext)
was computed two ways: the first one, Rec*, which took into account only the tokens
processed by the algorithm (those that appeared at least three times). The second one, Rec,
took into account all the tokens irrespective of their frequency counts. Rec* is defined as the
number of source lemma types in the correct TEPs divided by the number of lemma types in
the source language with at least 3 occurrences. Rec is defined as the number of source
lemma types in the correct TEPs divided by the number of lemma types in the source
language.
Bitext
Bg-En
Cz-En
Et-En
Hu-En
Ro-En
*
*
*
*
Prec/Rec Prec/Rec Prec/Rec Prec/Rec Prec/Rec*
Rec
Rec
Rec
Rec
Rec
4 Processing 1986
NA/NA
Steps
NA
2188
NA/NA
NA
1911
96.18/57.86
18.79
1935
96.89/56.92
19.27
2227
98.38/58.75
25.21
Sl-En
Prec/Rec*
Rec
1646
98.66/57.92
22.69
Figure 12: Partial evaluation of the BASE algorithm after 4 iteration steps
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
The accuracy of the extraction process varies with respect to different parts of speech. The
table in Figure 13 displays extraction precision differentiated per part of speech.
POS
Extracted entries
Wrong entries
POS precision
A
299
5
98.3
C
29
0
100
D
30
0
100
M
25
0
100
N
1243
21
98.3
P
39
0
100
R
170
8
95.3
S
21
0
100
V
371
2
99.7
Total
2227
36
98.38
Figure 13: Romanian-English dictionary; precision of the BASE algorithm after 4 iteration steps
with respect to different parts of speech (2227)
The rationale for showing Rec* is to estimate the proportion of the missed considered
tokens. This might be of interest when precision is of utmost importance. When the threshold
of minimal 3 occurrences is considered, the algorithm provides a high precision and a good
recall (Rec*). The evaluation was fully done for Estonian, Hungarian and Romanian and
partially for Slovene (the first step was fully evaluated while the rest were evaluated from
randomly selected pairs).
As one can see from the table in Figure 12, after four iterations, the precision is higher than
98% for Romanian and Slovene, almost 97% for Hungarian and more than 96% for Estonian.
Rec* ranges from 50.92% (Slovene) to 63.90% (Estonian). The standard recall Rec varies
between 19.27% and 32.46% (quite modest, since on average, the BASE algorithm did not
consider 60% of the lemmas which is not surprising according to Zipf’s rank-frequency law
[Zipf, 1936]).
To facilitate the comparison with the evaluation of the BETA algorithm we ran the BASE
algorithm for extracting the noun translation pairs from the Romanian-English bitext. The
noun extraction had the second worst accuracy (the worst was the adverb), and therefore we
considered that an in-depth evaluation of this case would be more informative than a global
evaluation. We set no limit for the number of steps and lowered the occurrence threshold to 2.
The program stopped after 10 steps with a number of 1900 extracted translation pairs, out of
which 126 were wrong (see Figure 14). Compared with the 4 steps run the precision
decreased to 93.36%, but both Rec* (70.12%) and Rec (39.76%) improved.
Noun types in text No. entries Correct entries Types in correct entries Prec/Rec*/Rec
3435 (1948 occ>1)
1900
1774
1366
93.36/70.12/39.76
Figure 14: BASE evaluation on the noun dictionary extracted from the Romanian-English bitext
(non-hapax)
Another way of evaluating the recall is by showing what percentage of the tokens in the
text the types in the dictionary cover. This measure is better called coverage and we denote it
by Coverage. However, the coverage is not very informative since few most frequent tokens
ensure usually a large coverage. So, if an extraction algorithm would find translations only for
these token types, its Coverage score will be pretty good. As we mentioned previously, the
4.5% token types (this is the recall in our evaluation) for which the algorithm described in
[Gale&Church, 91] found a translation, covered more than 61% of the text.
In the 10-step run of the BASE algorithm, the extracted noun pairs covered 85.83% of the
nouns in the Romanian part of the bitext.
We should mention that in spite of the general practice in computing recall for bilingual
dictionary extraction task (be it Rec*, Rec or Coverage) this is only an approximation of the
real recall. The reason for this approximation is that in order to compute the real recall one
should have a gold standard with all the words aligned by human evaluators. In general such a
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
gold standard bitext is not available and the recall is either approximated as above, or is
evaluated on a small sample and the result is taken to be more or less true for all the bitext.
In the initial version of the BASE algorithm we used a chi-square test to check the selected
TEPs. However, as the selection condition (EQ1) is highly restrictive, the vast majority of the
selected TEPs passed the chi-square test while many pairs that used to pass the chi-square
threshold did not pass the condition (EQ1). Therefore we eliminated this unnecessary
statistical test, which resulted in a very small decrease in recall but is compensated by a better
precision and a significant improvement in response time.
From the 6 bilingual lexicons we also derived a 7-language lexicon (2862 entries), with
English as a search hub (see Figure 15). As more than half of the English words had
equivalents only in 2 or three languages, we considered only those entries for which our
algorithm found translations in all but at most one of the other 6 languages.
En
cold
Bg
студен
Cs
studený/chladný
Et
külm
Hu
Hideg
Ro
rece/friguros
Sl
mrzel/hladen
Figure 15: An entry from the extracted multilingual lexicon
This filtered multilingual lexicon contains 1237 entries and can be found at the same site as
the bilingual lexicons. A typical entry in this multilingual lexicon is given below (in Figure 15
the multiword dictionary entry is shown by using each language character set; in the actual
file there are used SGML entities).
One interesting aspect of the BASE algorithm is that the words that are found as
translations of one word in the same iteration step are very likely to be a multiword
translation of the respective single word (such as the Estonian "armastusministeerium" =
ministry & love). The additional condition for identifying such a particular case of multiword
translation (all having the same part of speech) is that the candidate words must co-occur in
the same TUs. If this condition does not hold, then it simply happened that two different
translations of the same word were equally used.
5.2
The evaluation of the BETA algorithm
The BETA algorithm preserves the simplicity of the BASE algorithm but it significantly
improves its recall (Rec) at the expense of some loss in precision (Prec).
As said before, at the time of this writing, the evaluation for the BETA algorithm was done
only for the Romanian-English bitext and only with respect to the dictionary of nouns. The
filtering condition in case of ties was the following: (max(COGN(TjS, TjT)≥0.42)) ∨
(min(DIST(TjS, TjT)≤2)).
The figures in the tables below summarise the results for this case.
Noun types in text No. entries Correct entries
3435
4023
3149
Types in correct entries
2496
Prec/Rec
78.27/72.66
Figure 16: BETA evaluation, TECs filtered with the condition (max(COGN(TjS, TjT)≥0.42))∨
(min(DIST(TjS, TjT)≤2)).
The results show that the Rec (72.66%) almost doubled compared with the best Rec
obtained by the BASE algorithm for nouns (39.85%, see Figure 14). The Coverage also
improved and reached the value of 93.06%. However, the price for these significant
improvements was a serious deterioration of the Prec (78.27% versus 93.36%).
The analysis of the wrong translation pairs revealed that most of them were hapax pairs
and they were selected because the DIST measure enabled them, so we considered that this
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
filter is not discriminative enough for hapaxes. On the other hand for the non-hapax pairs the
DIST condition was successful in more than 85% of the cases. Therefore, we decided that the
additional DIST filtering condition be preserved for non-hapax competitors only. The
“filtered” procedure was modified to reflect this finding.
procedure filtered(best_cand) is:
result = get_best_COGN_score(best_cand);
if (cardinal(result)=0)&(non-hapax(best_cand))then
result = get_best_DIST_score(best_cand);
else if cardinal(result)>1
result = get_best_DIST_score(best_cand);
endif
endif
return result;
end
Noun types in text No. entries Correct entries Types in correct entries
3435
3713
3007
2371
Prec/Rec
80.98/69.02
Figure 17: BETA evaluation, hapax TECs filtered with the condition max(COGN(TjS, TjT)≥0.42).
The results in Figure 17 show that 166 erroneous TEPs were removed but also 144 good
TEP were lost. Prec improved (80.93% versus 78.28%) but Rec depreciated (69.02% versus
72.65%). The Coverage score for this modified version of BETA slightly decreased to
92.36%.
The BASE algorithm allows for trading off between Prec and Rec* by means of the
number of iteration steps. The BETA algorithm allows for similar trading off between Prec
and Rec by means of the COGN and DIST thresholds and obviously by means of an
occurrence threshold. For instance when BETA was set to ignore the hapax pairs, its Prec was
96.11% (better then the BASE precision 93.36%) Rec* was 96.41% (BASE with 10 iterations
had a Rec* of 70.12%) and Rec was 59.66% (BASE with 10 iterations had a Rec of 39.76%),
Using the COGN test as a filtering device is a heuristics based on the cognate conjecture
which says that when the two tokens of a translation pair are orthographically similar, they are
very likely to be cognates. If the words have an orthographic similarity, but no common
meaning they are called false friends. As a criteria for finding false friends one can use the
alignment and cognate test as in the diagram in Figure 18 (taken from [Brew and McKelvie,
1996]). Since all pairs in the extracted bilingual dictionary have the mark ALIGNED+, all the
pairs that will pass the COGN test are supposed to be cognates. This assumption was 95%
correct when the threshold was 0.42 and 100% when the threshold was 0.6.
ALIGNED+
ALIGNED-
COGN+
cognates
false friends
COGNtranslations
unrelated
Figure 18: String similarity and alignment criteria for classification of the tokens of a bitext
By computing the Cartesian product TYPES⊗TYPET, where TYPES and TYPET are the set
of token types the two languages of a dictionary, and deleting all the pairs in the extracted
dictionary one gets a set of pairs marked as ALIGNED-. Out of these pairs, the vast majority
of those passing the COGN test are false friends. In our noun Romanian-English dictionary
with 2496 token types in the Romanian part and 2433 token types in the English part, we
found more 8215 false friends (COGN threshold set to 0.6). We extracted 72 perfect false
friends (those with the COGN score equal to 1) and 29 almost perfect false friends those with
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
the COGN score higher than 0.9). Part of them (14) are real cognates but they were never
used as translations in the text. For instance “limită” in Romanian is a real cognate for the
English noun “limit”. The two occurences of the Romanian word “limită”were found as
translations for “brink” and “edge”. Out of the two occurences of the English noun “limit”
one was translated as a paraphrase (“într-o mică măsură” = ”within narrow limits”), and once
was not translated.
6
Partial translations
As the alignment model used by the translation equivalence extraction is based on the 1:1
mapping hypothesis, inherently it will find partial translations for those cases where one or
more words in one language must be translated by two or more words in the other language.
Although we used a tokenizer aware of compounds in the two languages, its resources were
obviously partial. In the extracted noun lexicon, the evaluators found 116 partial translations
(3.86%). In this section we will discuss one way to recover the correct translations for the
partial ones, discovered by our 1:1 mapping-based extraction program.
First, from each part of the bitext a set of possible collocations was extracted by a simple
method called “repeated segments” analysis. Any sequence of two or more tokens which
appears more than once is retained. Additionally, the tags attached to the words occurring in a
repeated segment must observe the syntactic patterns characterizing most of the real
collocations. For the noun dictionary we considered only forms of <head-noun
(functional_word) modifier> as Romanian patterns and <modifier (functional_word) headnoun> as English patterns. If all the content contained in a repeated segment have translation
equivalents, then the repeated segment is discarded as not being relevant for a partial
translation. Otherwise, the repeated segment is stored in the lexicon as a translation for the
translation of its head-noun. For instance, “machine gun” was found as a repeated segment
with the translation for “gun” as “mitralieră”. Since “machine” was not translated in the
corresponding TUs, the new entry (mitralieră machine_gun) was added to the dictionary.
Similarly, “muşuroi de cârtiţă” has been found as a repeated segment in the Romanian part of
the bitext. Since “muşuroi” was translated in the corresponding TU as “molehill” and “cârtiţă”
had no translation in the dictionary, the new entry (muşuroi_de_cârtiţă molehill) was added to
the dictionary. This simple procedure managed to recover 62 partial translations and improve
other 12 (still partial, but better). An example of improved partial translation is “poziţie de
drepţi” = “attention” which started with “drept”=”attention” and should have finished with
“poziţie de drepţi” = “stand to attention”| “spring to attention”| “call to attention”.
7
Failures analysis
Any statistical word alignment method is confronted with two principled problems: some
tokens are erroneously associated and some valid ones are missed. There are various reasons
for both of them and in this section we will discuss our findings with respect to our specific
bitext.
The BETA extraction algorithm did not find translations for 892 Romanian noun lemmas.
Out of these, 47 occurred 3 or more times, 102 exactly 2 times and 743 occurred only once.
As we have shown in the presentation of our algorithm, the initial phase is to build a search
space for the translation equivalence pair. As this space is in general very large, one has to
filter it out one way or another. We used the loglikelihood [Dunning, 1993], [Melamed, 1997]
removing all the pairs with a score below 9. However, besides throwing away a large number
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
of noisy candidates, some correct pairs were lost as well. This was responsible for about 60%
of the correct missed translation pairs (the vast majority of them were hapax pairs, translating
secondary meanings of quite frequent words). Working with a much larger corpus might
decrease the influence of this factor.
We found 20 English sentences (192 lemmas) not translated in Romanian, out of which 85
lemmas appeared in no other part of the novel. We noticed that many erroneous translation
pairs were extracted from very long TUs. The explanation is that the long TUs produce a high
level of noise for the way we computed the list of candidates. Because of the alignment
problems (errors, missing translations and long TU) the recall was affected by about 6% (1%
direct influence and about 5% indirect influence due to the noise).
Tagging errors (about 0.5% in the Romanian part and about 2.8% in the English part) were
responsible for about 22% of the missed correct translations.
Many missing translations got an explanation by virtue of the human translation
idiosyncrasies as well as by the different nature of the language pairs considered. Being a
literary translation, several words in the original were paraphrased and some words were
translated differently (by synonyms). In many cases words in one language were translated in
the other by words of different part of speech (from the algorithm point of view this is
identical to a tagging error). A few words were wrongly translated and some others were
simply ignored. For instance out of the 47 Romanian lemmas occurring more than twice in the
text, and missed by the extraction algorithm, 43 are due to one of these causes. Altogether, the
translator was deemed responsible for 12% of the missed translations.
8
Implementation
The extraction programs, both BASE and BETA, are written in Perl and run under
practically any platform (Perl implementations exists not only for UNIX/LINUX but also for
Windows, and MACOS). Table 19 shows the BASE running time for each bitext in the
"1984" parallel corpus (all POS considered). Figure 20 shows the running time for extraction
of the noun Romanian-English dictionary (LINUX on a Pentium III/600Mhz with 96 MB
RAM) for BASE and BETA.
Bitext
Extraction time (sec)
Bg-En Cz-En
181
148
Et-En
Hu-En
139
220
Ro-En
4
28
steps
steps
183
415
Si-En
157
Figure 19:BASE extraction time for each of the bilingual lexicons (all POS)
Algorithm
Extraction time (s)
BASE (10 steps)
105
BETA
232
Figure 20:BASE and BETA extraction time for the Romanian-English noun dictionary
A quite similar approach to our BASE algorithm (also implemented in Perl) is presented in
(Ahrenberg et al, 2000) and for a novel of about half the length of Orwell's "1984" their
algorithm needed 55 minutes on a Ultrasparc1 Workstation with 320 MB RAM. They used a
frequency threshold of 3 and the best results reported are 92.5% precision and 54.6% recall
(our Rec*). For a computer manual containing about 45% more tokens than our corpus, their
algorithm needed 4.5 hours with the best results being 74.94% precision and 67,3% recall
(Rec*).
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
The BETA algorithm is closer to Melamed’s extractor, although our program is greedier
and never returns to a visited translation unit. In (Melamed, 2000) information is not provided
on any of the extraction times, which we suspect it to be higher than in our case.
9
Conclusions and further work
We presented two simple but very effective algorithms for extracting bilingual lexicons,
based on a 1:1 mapping hypothesis. We showed that in case a language specific tokenizer is
responsible for pre-processing the input to the extractor, the 1:1 mapping approach is not an
important limitation anymore. Incompleteness of the segmenter’s resources may be accounted
for by using a post-processing phase for recovering the partial translations. We showed that
such a recovering phase could successfully take advantage of the already extracted entries.
Future plans include experiments and evaluations for a trilingual parallel corpus (EnglishFrench-Romanian) of legal texts and two bilingual corpora (English-Romanian and FrenchRomanian) made of software manuals.
We have strong reasons to believe that with more literal translations and for closer related
languages, the quality and coverage of the extracted bilingual dictionary will be superior to
those reported here.
Acknowledgements.
The research reported here started as an AUPELF/UREF co-operation project with
LIMSI/CNRS (CADBFR) and used multilingual corpus and multilingual lexical resources
developed within MULTEXT-EAST, TELRI and CONCEDE European projects.
Special thanks are due to Heiki Haalep, Csaba Oravecs, and Tomaz Erjavec for the validation
of the Et-En, Hu-En and Si-En extracted dictionaries.
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
References
Ahrenberg, L., M. Andersson, M. Merkel. (2000). "A knowledge-lite approach to word
alignment", in (Véronis, 2000: 97-116)
Ahrenberg, L., Andersson, M. & Merkel, M. (1998) "A Simple Hybrid Aligner for
Generating Lexical Correspondences in Parallel Texts". In Proceedings of COLING'98,
Montreal, Canada
Brants, T.(2000) “TnT – A Statistical Part-of-Speech Tagger”, in Proceedings of the Sitth
Applied Natural Language Processing Conference, ANLP-2000, April 29 – May 3, 2000,
Seattle, WA
Brew, C., McKelvie, D., (1996). “Word-pair extraction for lexicography”,
http:///www.ltg.ed.ac.uk/ ~chrisbr/papers/nemplap96
Brown, P., Pietra, S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer (1993), "The
mathematics of statistical machine translation: parameter estimation" in Computational
Linguistics19(2): 263-311.
Dunning, T. (1993), “Accurate Methods for the Statistics of Surprise and Coincidence” in
Computational Linguistics19(1):61-74
Dimitrova, L, T. Erjavec, N. Ide, H. Kaalep, V. Petkevic and D. Tufiş, (1998). " MultextEast: Parallel and Comparable Corpora and Lexicons for Six Central and East European
Languages" in Proceedings of the 36th Annual Meeting of the ACL and 17th COLING
International Conference, Montreal, Canada, 315-319.
Gale, W.A. and K.W. Church, (1991). "Identifying word correspondences in parallel
texts". In Fourth DARPA Workshop on Speech and Natural Language, pp. 152-157
Gale, W.A. and K.W. Church, (1993). “A Program for Aligning Sentences in Bilingual
Corpora”. In In Computational Linguistics, 19(1), pp. 75-102
Ide, N., Véronis, J. (1995). Encoding dictionaries. In Ide, N., Veronis, J.
(Eds.) The Text Encoding Initiative: Background and Context.Dordrecht:
Kluwer Academic Publishers, 167-80.
Erjavec, T., Monachini, M., (eds.), (1997). « Specifications and Notation for Lexicon
Encoding”.MULTEXT-East Final Report, D1.1, December 1997.
Erjavec, T., Lawson A., Romary L.. (1998). East Meet West: A Compendium of
Multilingual Resources.TELRI-MULTEXT EAST CD-ROM, 1998, ISBN: 3-922641-46-6.
Erjavec T., Ide, N., (1998). “The Multext-East corpus”. In Proceedings of First
International Conference on Language Resources and Evaluation, Granada, Spain 1998,
pp971-974
Hiemstra, D., (1997). "Deriving a bilingual lexicon for cross language information
retrieval". In Proceedings of Gronics 21-26
Kay, M., Röscheisen, (1993). “Text-Translation Alignment”. In Computational
Linguistics, 19(1), 121:142
Kupiec, J., (1993). "An algorithm for finding noun phrase correspondences in bilingual
corpora". In Proceedings of the 31st Annual Meeting of the Association of Computational
Linguistics, 17:22
Melamed, D., (2000). “Word-to-Word Models of Translational Equivalence”. In
Computational Lingouistics, 26, 34 pages
In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001
Melamed, D., (1998). “Empirical Methods for MT Lexicon Development”. In Proceedings
of AMTA
Melamed, D., (1997) "A Word-to-Word Model of Translational Equivalence",
35th Conference of the Association for Computational Linguiistics, Madrid,
Spain
Melamed, D., (1996). “Automatic Construction of Clean Broad-Coverage Translation
Lexicons”. In Proceedings of AMTA
Smadja, F., (1993). ”Retrieving Collocations from Text:Xtract”. In Computational
Linguistics, 19(1), 142:177
Smadja, F., K.R. McKeown, and V. Hatzivassiloglou, (1996). "Translating collocations for
bilingual lexicons: A statistical approach". Computational Linguistics, 22(1), 1:38
Tiedemann, J., (1998). “Extraction of Translation Equivalents from Parallel Corpora”, In
Proceedings of the 11th Nordic Conference on Computational Linguistics, Center for
Sprogteknologi, Copenhagen, 1998, http://stp.ling.uu.se/~joerg/
Tufiş, D., Barbu, A.M., Pătraşcu, V., Rotariu, G., Popescu, C., (1997) ”Corpora and
Corpus-Based Morpho-Lexical Processing “ in D. Tufiş, P. Andersen (eds.) “Recent
Advances in Romanian Language Technology”, Editura Academiei, 1997, pp. 35-56
Tufiş, D., Mason, O., (1998). “Tagging Romanian Texts: a Case Study for QTAG, a
Language Independent Probabilistic Tagger”. In Proceedings of the First International
Conference on Language Resources & Evaluation (LREC), Granada, Spain, 28-30 May 1998,
pp.589-596.
Tufiş, D., (1999). “Tiered Tagging and Combined Classifiers” In F. Jelinek, E. Nöth (eds)
Text, Speech and Dialogue, Lecture Notes in Artificial Intelligence 1692, Springer, 1999, pp.
29-33
Tufiş, D., Ide, N. Erjavec, T., (1998). “Standardized Specifications, Development and
Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European
Languages”. In Proceedings of First International Conference on Language Resources and
Evaluation (LREC), Granada, Spain, 1998, pp. 233-240
Tufiş, D., (2000) “Using a Large Set of Eagles-compliant Morpho-Syntactic Descriptors as
a Tagset for Probabilistic Tagging. In Proceedings of the Second International Conference on
Language Resources and Evaluation (LREC), Athens, Greece, 2000, pp.1105-1112
Tufiş, D., Dienes, P., Oravecz, C., Váradi T., (2000). “Principled Hidden Tagset Design for
Tiered Tagging of Hungarian” Proceedings of the Second International Conference on
Language Resources and Evaluation (LREC), Athens May, 2000
Tufiş, D., Barbu, A.M., (2001). “Extracting multilingual lexicons from parallel corpora” in
Proceedings of the ACH/ALLC 2001, New York University 13-16 June, 2001
Tufiş, D.,(2001a). “Building an ontology from a large Romanian dictionary of synonyms
by importing Wordnet relations”, RACAI Research report, June, 2001.
Tufiş, D., (2001b). “Partial translations recovery in a 1:1 word-alignment approach”,
RACAI Research report, June, 2001.
Véronis, J. (ed), (2000). Parallel Text Processing. Text, Speech and Language Technology
Series, Kluwer Academic Publishers Vol. 13, 2000
Zipf, G.K., (1936). “The Psycho-biology of Language: an Introduction to Dynamic
Philology”. Routledge, London, UK