A Multi-view Approach for
Term Translation Spotting
Raphaël Rubino and Georges Linarès
Laboratoire Informatique d’Avignon
339, chemin des Meinajaries, BP 91228
84911 Avignon Cedex 9, France
{raphael.rubino,georges.linares}@univ-avignon.fr
Abstract. This paper presents a multi-view approach for term translation spotting, based on a bilingual lexicon and comparable corpora. We
propose to study different levels of representation for a term: the context, the theme and the orthography. These three approaches are studied
individually and combined in order to rank translation candidates. We
focus our task on French-English medical terms. Experiments show a
significant improvement of the classical context-based approach, with a
F-score of 40.3 % for the first ranked translation candidates.
Keywords: Multilingualism, Comparable Corpora, Topic Model.
1
Introduction
Bilingual term spotting is a popular task which can be used for bilingual lexicon construction. This kind of resource is particularly useful in many Natural
Language Processing (NLP) tasks, for example in cross-lingual information retrieval or Statistical Machine Translation (SMT). Some works in the literature
are based on the use of bilingual parallel texts, which are often used in SMT
for building translation tables [1,2]. However, the lack of parallel texts is still an
issue, and the NLP community tends to use a forthcoming bilingual resource in
order to build bilingual lexicons: bilingual comparable corpora.
One of the main approaches using non-parallel corpora is based on the assumption that a term and its translation share context similarities. It can be
seen as a co-occurrence or a context-vector model, which depends on the lexical
environment of terms [3,4]. This approach stands on the use of a bilingual lexicon, also known as bilingual seed-words. These words are used as anchor points
in the source and the target language. This representation of the environment of
a term has to be invariant from a language to another in order to spot correct
translations. The efficiency of this approach depends on context-vectors accuracy. Authors have studied different measures between terms, variations on the
context size, and similarity metrics between context-vectors [5,6,7].
In addition to context information, heuristics are often used to improve the
general accuracy of the context-vector approach, like orthographic similarities
A. Gelbukh (Ed.): CICLing 2011, Part II, LNCS 6609, pp. 29–40, 2011.
c Springer-Verlag Berlin Heidelberg 2011
30
R. Rubino and G. Linarès
between the source and the target terms [8]. Cognate-based techniques are popular in bilingual term spotting, in particular for specific domains. It can be
explained by the large amount of transliteration even in unrelated languages.
Also, related languages like Latin languages can share similarities between a
term and its translation, like identical lemmas. We refer to this particularity as
cross-languages cognates.
However, a standard context-vector approach combined with graphic features
does not allow to handle polysemy and synonymy [9]. This limit can be overtaken
by the introduction of semantic information. It is precisely the investigation
presented in this paper: the combination of context, topic and graphic features
for bilingual term spotting with comparable corpora. Each feature is a different
view of a term. In a first step, we propose to study each feature individually.
Then, we combine these features in order to increase the confidence on the
spotted candidates. We focus on spotting English translations of French terms
from the medical domain. We refer as a term in a terminological sense: a single
or multi-word expression with a unique meaning in a given domain.
For the context-based approach, we want to tackle the context limitation issue,
capturing information in a local and in a global context. We assume that some
terms can be spotted using a close context, while other terms are characterized
by a distant one.
For the topic feature, we want to handle polysemious terms and synonyms.
We assume that a term and its translation share similarities among topics. The
comparable corpora are modeled in a topic space, in order to represent contextvectors in different latent semantic themes. One of the most popular methods for
semantic representation of a corpus is the so-called topic model. Topic models are
widely used for statistical analysis of discrete data in large document collections.
Data decomposition into latent components emerges as a useful technique in
several tasks, such as information retrieval or text classification. This technique
can be applied to many kinds of documents like scientific abstracts, news articles,
social network data, etc. The Latent Dirichlet Allocation (LDA) [10] fits to our
needs: a semantic based bag-of-words representation and unrelated dimensions
(one dimension per topic).
Finally for the cognate approach, we investigate the efficiency of the classic
Levenshtein distance [11] between source language and target language terms.
In order to increase the general precision of our system, a vote is used to
combine the results of the three features. These three views are part of our multiview approach for term-translation spotting. To the best of our knowledge, there
is no study combining these particular features for this task.
The remainder of this paper is organized as follows: we describe the lexical
based approach and its different variants in the next section. The topic model
approach is introduced in section three. We also want to give details about the
use of cognate terms, or orthographic features, in section four. Then we present
the framework of the experiments in section five, followed by the results obtained
with several configurations. Finally, results are discussed in the last section.
A Multi-view Approach for Term Translation Spotting
2
31
Context-Based Approach
Bilingual comparable corpora are a set of non-parallel texts that have common
topics and are written independently in each language. In comparable corpora,
a term and its translation share similarities through the vocabulary surrounding
them. Based on this assumption, different techniques are well presented in the
literature. One of the first studies is made on bilingual co-occurrence pattern
matching [3]. The context-vector is introduced in [12], relying on a bilingual
lexicon. With the same philosophy, other works are done with a thesaurus [13].
This approach is the basis of the work presented in this paper. It is also possible
to induce a seed-words lexicon from monolingual sources, as described in [14].
Other studies were made on the association measures between a term and its
context in order to build the most accurate context-vector. Some of the popular
distance measures used in the literature are mutual information, log-likelihood
and odds ratio [15,16,17]. Once the context-vectors are built, a similarity metric
is used to rank the target terms according to the source term. For the bilingual
term spotting task, similarity metrics between vectors have been studied, like the
city-block metric or the cosine distance [12,18]. Among the large number of studies on association measures and similarity metrics, the used technique depends on
the task: domain specific terms to translate, different sizes of corpus for each language, the amount of words and their translations in the seed-words lexicon, etc.
Several combinations were studied in [7], and the most efficient configuration on
their task was the odds ratio with the cosine distance. In our studies, we decide
to implement a system based on this latter work, which stands as a baseline. The
.τ22
) is a coefficient of association strength which can be
odds ratio (odds = ττ11
12 .τ21
computed on a 2x2 contingency table. In our case, we want to compute the association strength between a candidate (in the source or in the target language)
and words of the seed-words lexicon. The four elements in the contingency table
are the possible observation (or absence) of two terms in a given window size. The
most common practice is to use a sliding window around the term to translate.
The size of the window limits the context. It can be fixed, 10 words for instance,
or dynamic, like a sentence or a paragraph. One of the parameters we emphasize in
this paper is precisely the size of the context. We build our implementation to be
Fig. 1. Architecture of the context-based approach for term translation spotting
32
R. Rubino and G. Linarès
able to modify the size of the sliding window used to count words co-occurrence.
The general architecture of the context-vector approach is described in Fig.1. We
use different sizes of window because we assume that a term can be characterized
by a close context and by a distant one.
3
Topic Model
The main idea in topic model is to produce a set of artificial dimensions from a
collection of documents. It is based on the assumption that documents are mixtures of topics, and a topic is a probability distribution over words. One of the
popular approaches used in automatic indexing and information retrieval is the
Latent Semantic Analysis (LSA, or LSI for Latent Semantic Indexing) [19]. This
method is based on a term-document matrix which describes the occurrences of
terms in documents. This high dimensional sparse matrix is reduced by Singular
Value Decomposition (SVD). In [20], the Probabilistic LSA (PLSA) is introduced
to give a robust statistical foundation to the LSA. Based on the likelihood principle, a mixture decomposition is derived from a latent class model. With this
technique, the order of the documents is taken into account. This model is introduced for bilingual term spotting in [9] to handle polysemy and synonymy,
as a Bilingual PLSA approach. For more exchangeability, we decide to use the
Latent Dirichlet Allocation (LDA), first introduced in [10]. The general principle
of LDA stands on the computation of the multinomial probability p(wn |zn , β)
conditionned on the topic zn , for N words wn of a document M in a collection.
A Multilingual LDA is introduced in [21] to mine topics from Wikipedia. They
build a comparable corpus from aligned documents (articles) and use a modified LDA to model this corpus. However in this latter work, multilingual topic
alignment stands on the use of links among documents. This kind of resources
are not taken into account in our studies.
We want to obtain the distribution over topics of the bilingual lexicon used for
the context-based approach. First, we filter the source language corpus with this
bilingual lexicon. The resulting corpus contains only the vocabulary from the
bilingual lexicon. Second, this reduced corpus is modeled in a latent topic space.
The output model is then translated in the target language with the bilingual
lexicon. Our aim is to select the most pertinent topics for candidates in the
source and in the target language. The distance computation between a topic
and a term is explained in (1). Basically, for each topic, a distance between a
term and each word of the topic is computed. We keep the odds-ratio for this
step. The distance is then weighted by the probability to observe the word in
this topic. The sum of all distances within a topic is the term-topic association
score (see Fig.2).
p(wn |zn , β) odds(term, wn ) .
(1)
d(term, z) =
n
We assume that this projection method leads to an important issue: the representation p(wn |zn , β) of a word wn in the target language is not re-estimated.
A Multi-view Approach for Term Translation Spotting
33
Fig. 2. Representation of the context of a term in different topics
This naı̈ve projection of the topic space does not reflect the reality in the target
language. Indeed, the topic alignment of two separately built latent spaces would
be the perfect solution. However, the use of comparable corpora can limit the
distortion effects of the words-over-topics distribution. Furthermore, we combine
the weight of each word in a topic with a lexical distance measure to clear this
hurdle, keeping in mind that this imprecise technique needs to be improved.
An example of bilingual topic alignment is proposed in [22]. They introduce a
multilingual topic model for unaligned text, designed for non-parallel corpora
processing. This model can be used to find and match topics for, at least, two
languages. They focus on building consistent topic spaces in English and German, based on the matched vocabulary in both languages.
4
Cognate Approach
As it is well described in [14], related languages like German and English share
orthographic similarities which can be used for bilingual term spotting. Often
call cognates, these particular terms can be compared by several techniques. The
most popular one is the edit distance (or Levenshtein distance).
In order to handle the suffix or prefix transformation between languages, some
researchers use transformation rules or limit the comparison to language roots.
This approach works when the languages concerned by term spotting are related.
Experiments with Latin languages often yields to good results.
A model introduced in [8] uses a Romanization system for Chinese (as the
target language) and maps the source (English) and the target language letters of
two terms. This technique allows comparing the spelling of unrelated languages.
In our experiments on French and English terms, we use the classic Levenshtein distance between two bilingual terms. We compute the distance between
the four letters at the beginning of two terms and between the whole terms.
34
5
R. Rubino and G. Linarès
Experiments
In order to run our experiments, we need three bilingual resources: a comparable corpora, a seed-words lexicon and a list of candidates to translate with
their translation reference. We decide to use an indexation toolkit to facilitate
statistical processing of the large text dumps.
Our experimental framework is based on the English and French Wikipedia
dumps as comparable corpora. The Wikipedia dumps are free and accessible
on the dedicated download page1 . We refer to Wikipedia dumps as all articles
in one language. It consists in a XML-like file containing every articles in the
selected language. A dump contains text, but also some special data and syntax
(images, internal links, etc.) which are not interesting for our experiments. We
use the trectext format for the articles collection, removing all Wikipedia tags. A
stop-word list is used to filter the textual content of the articles. Table 1 contains
the details about the comparable corpora. The context-vector approach leads to
better results with a large amount of comparable texts. In this paper, we use
the Lemur Toolkit 2 to index the Wikipedia dumps. After this step, counting
co-occurrences of two terms in a fixed window can be done in one query. The
candidates to translate are extracted from the Medical Subject Heading (MeSH)
thesaurus3 along with their translation references (one translation per source
term) [23]. We use the same bilingual lexicon and the same candidates as in [7] to
be able to compare our results to this baseline. The bilingual seed-words lexicon
is taken from the Heymans Institute of Pharmacologys Multilingual glossary of
technical and popular medical terms 4 .
Table 1. Details about the bilingual resources used in our experiments
corpus
documents
candidates
seed-words
Wikipedia FR
872,111
Wikipedia EN 3,223,790
tokens
118,019,979
409,792,870
unique tokens
3,000
9,000
3,994,040
14,059,292
For the semantic part of our framework, we use an implementation of LDA
based on the Gibbs Sampling technique5 . We build different models with variations on the number of topics (from 20 to 200 topics). Each model is estimated
with 2,000 iterations.
We want to study the different approaches separately. For each set of experiments, 3,000 translations have to be spotted. The result of each approach is a
ranked list of the translation candidates, the first ranked candidate is the best
1
2
3
4
5
http://download.wikimedia.org/backup-index.html
http://www.lemurproject.org
http://www.nlm.nih.gov/mesh/
http://users.ugent.be/˜rvdstich/eugloss/welcome.html
http://gibbslda.sourceforge.net
A Multi-view Approach for Term Translation Spotting
35
translation according to the system. We observe the position of the translation
reference and report the accuracy of each approach. Then we combine the results
of the different views by a vote. We want to see if a correct translation is ranked
first by the majority of the judges. The main advantage of this method is the
ability to reach a very high precision with a lot of judges combined, but the recall
may be low. We assume that the complementarity of the three different views
can increase the recall, and the number of judges can maintain a high precision.
5.1
Context-Based Approach
First, we build one context-vector for each term. We make variations of the
window size in order to capture different context information. Then, each target
context-vector is ranked by cosine similarity to the source one. We compare
source and target vectors built with the same parameters. We compute the recall
scores on the first ranked target candidates for each window size. We also present
the recall scores for terms spotted with one window size and not spotted by the
others. Table 2 contains the results for each window size individually. The recall
Table 2. Recall for the context-based approach from rank 1 to 100 with different sliding
window sizes. The unique1 row shows the amount of correct translations spotted at
rank 1 by the current window size and not by the other window sizes.
1
10
50
100
unique1
win 10
31.1
57.6
69.3
73.4
4.5
win 20
32.9
59.6
71.8
76.0
1.7
win 30
33.7
60.6
72.6
76.9
1.5
win 40
32.4
58.6
71.8
76.6
1.7
win doc
15.6
37.7
54.6
61.5
3.1
scores are relatively close for the windows of a size between 20 and 40 words.
The best configuration in our context-based approach is a 30 words window size,
reaching a recall score of 33.7 % for the first ranked target language candidate.
We show with these results that a term can be characterized by a close context
(a small window size) or a global one. Some of the source terms to translate are
locally linked to a vocabulary, because their translations are spotted using a 10
words window, and this characteristic is invariant in the target language. For
these terms, a larger window introduces noise and is not efficient to spot the
correct translation. This is the main motivation for the vote combination of the
context-based approach with different window sizes, presented in Table 3.
In this experiment, the recall score for all the window sizes is low, because
a small amount of candidates are ranked first by the 5 judges. However, the
precision score is relatively high. The best F-score is reached by the combination
of the 20 and 30 words windows. As expected, the higher precision is reached
with the combination of all window sizes.
36
R. Rubino and G. Linarès
Table 3. Scores for the vote combination on the context-based approach, with 2 and
3 sizes combined, and all the window sizes, at rank 1. Scores for a 30 words window
are also detailed.
recall
precision
F-score
5.2
win30
33.7
33.7
33.7
win20+win30
27.8
50.5
35.8
win10+win30+win40
19.2
75.9
30.6
all sizes
6.2
83.7
11.5
Topic-Based Approach
With the topic model approach, we want to see if a semantic space can be used to
filter the terms among the 3,000 English translations. For each English or French
term of the candidate list, we measure the distance with each topic (semantic
class) of the model. Basically, we want to see if a source term and its translation
reference are at the same place in the topic space. This feature can be useful
to filter target terms which are not on the same topics as a source term. For a
source term, we check which target terms share the same topics, according to an
ordered topic distance table. If a term to translate and its translation reference
share the same first topics, it increases a recall score (the resulting recall is then
divided by the number of candidates). A recall score of 100 means that each
source terms and its translation reference share the same top ranked topic. We
measure it for the three top ranked topics for each target term and present the
results in Table 4. We also measure the precision of this approach. The precision
decreases if there are target terms which share similar topics with the source
term but are not the translation reference. A precision score of 0.0 indicates
that all translation candidates are returned by the system for one source term.
Our semantic-based approach on bilingual term spotting can be seen as a validation step in our experiments. The translation candidates are ranked according
to their similarity with a source term in the topic space. In a 50 dimensions topic
space, 1/3 of the candidates to translate have their reference at the first rank.
This score decreases when we increase the number of dimensions. This variation
Table 4. Scores for the topic-based approach. The three first topics of a target term
are presented. Several models are tested: with 20, 50, 100 and 200 dimensions.
1
2
3
recall
precision
F-score
recall
precision
F-score
recall
precision
F-score
20
23.6
0.12
0.23
35.9
0.1
0.19
44.1
0.08
0.16
50
33.4
0.06
0.12
42.2
0.06
0.12
45.9
0.06
0.12
100
33.0
0.07
0.14
38.3
0.07
0.14
41.2
0.07
0.13
200
10.4
0.18
0.35
18.8
0.15
0.31
24.1
0.14
0.28
A Multi-view Approach for Term Translation Spotting
37
impacts the precision, which indicates that incorrect target terms are returned
by the system. The models used in these experiments are not adapted to specific
domains. All medical terms are close to a few topics, that is why a huge amount
of source and target language terms are at the same place in the topic space.
5.3
Cognate Approach
The Levenshtein distance is used to rank the translation candidates according to
a source language term. We measure the recall of this approach for several ranks,
between the 4 letters at the beginning of two terms and between the full terms.
Results are presented in Table 5. For each rank, if the translation reference is
found, the recall score increases. We compute the precision score according to
the amount of target terms at the current rank which are not the translation
reference.
Table 5. Recall results with the noise scores for the cognate approach at different
ranks: from 1 to 10. Two edit distances are tested: between the 4 beginning letters and
between the full terms.
recall
letters precision
F-score
recall
term
precision
F-score
1
34.0
15.9
21.7
50.7
83.5
63.1
2
39.0
3.9
7.2
54.8
29.6
38.5
3
45.9
0.5
1.0
59.6
5.6
10.3
4
65.4
0.1
0.2
67.4
1.4
2.8
5
100
0.03
0.07
77.3
0.5
1.0
10
100
0.03
0.07
99.3
0.2
0.3
We can see that a classic Levenshtein distance allows to find 50.7 % of correct
translations at the first rank with a precision score of 83.5 %. This result is not
surprising in our case, because English and French domain specific terms are
more likely to have common roots. Taking the ten first ranked candidates, the
cognate approach yields to a 99 % recall, but the large number of wrong target
terms spotted decreases the precision. However, this feature can be added to the
context-based method in order to re-rank the translation candidates.
5.4
Combination
We present the combination of the three views in Table 6. Three combinations
are studied: two classic combinations (context and topic or context and cognate),
and the three approaches together. The context-based approach combined with
the cognate approche is the baseline. Each feature, or view, is a judge in a
vote combination. We propose two different vote configurations. The majority
and the unanimity of the judges can determine which translation candidate is
ranked first. If any of the vote configuration is fulfilled, the system is not able to
spot a target language term among the candidates. We include the context-based
38
R. Rubino and G. Linarès
Table 6. Scores at rank 1 for the combination of our approaches. Unanimity is indicated
in brackets if the judges are more than 2. The context feature is noted cont, the cognate
feature is noted cogn.
recall
win30 precision
F-score
recall
all
precision
F-score
cont+topic
13.9
100
24.4
20.8 (2.6)
76.2 (100)
32.7 (5.0)
cont+cogn
19.0
99.1
31.9
21.1 (4.0)
76.6 (97.6)
33.1 (7.7)
cont+topic+cogn
24.2 (7.6)
99.3 (100)
39.0 (14.1)
26.9 (1.7)
80.4 (100)
40.3 (3.4)
approach as five different judges with all window sizes or as one judge with one
fixed window size. We show the results for both of them. A 30 words window is
chosen for the fixed window size.
The context-vector approach combined with the Levenshtein distance yields
to a recall of 19 % with a precision score of 99.1 %. Using all window sizes leads
to a slight improvement of the recall score (21.1 %) but the precision decreases,
according to the majority of the judges. Taking the unanimity in this configuration completely degrades the recall (4 %), showing that the overlap of all first
ranked judges individually is very low. However, our system allows to spot only
correct translations. There are no incorrect target language terms spotted when
the precision reaches 100 %.
Two main aspects of the multi-view approach are outstanding with these
results. The first one is the best F-score obtained with the combination of all
views. This is the best configuration for an acceptable recall and a relatively
high precision. The second one is the high precision reached by the combination
of the three approaches with a fixed window size for the context feature. The
recall is 2.7 % lower than with the combination of all judges.
As reported in Sect.5.2 with the best configuration, the precision of the topicbased approach is very low because of the topic model, which is not adapted
to the medical domain. We are not including this precision in the combination
results including the topic feature because we want to investigate the general
coverage of our approach in this configuration. The semantic information is used
to validate a translation candidate ranked first by the majority of the other
approaches.
6
Discussion
We presented in this paper a multi-view approach for medical term translation
spotting based on comparable corpora and a bilingual seed-words lexicon. The
context-vector approach is combined with a topic-based model in order to handle
polysemy and synonymy. We also include an orthographic feature to boost our
results. The combination of the 3 approaches yields to a recall score of 26.9 %
with a precision of 80.4 %. These results show the complementarity of the three
features. Compared to a state-of-the-art approach combining the context and
A Multi-view Approach for Term Translation Spotting
39
the cognate approches, our multi-view approach leads to a 12.6% absolute improvement of the F-score.
In future work, we plan to improve the topic model for the target language.
Instead of translating the source language topic-model into the target language,
the comparable corpora can be used to compute the probabilities of vocabulary
over topics distribution. Another possible improvement on the topic representation is to select a subset of the comparable corpora in order to model the domain
specific part only. This technique may lead to a finer grained topic model.
In the experiments presented in this paper, each word of the bilingual lexicon is used to build the context-vector of a source term. The context-based
approach can be improved with a more precise study on ambiguous seed-words.
This technique would reduce the size of the context-vector to the most discriminant dimensions for a term to translate.
Acknowledgements
This research has been partly funded by the National Research Authority (ANR),
AVISON project (ANR-007-014).
References
1. Brown, P., Della Pietra, S., Della Pietra, V., Jelinek, F., Lafferty, J., Mercer, R.,
Roossin, P.: A Statistical Approach to Machine Translation. Computational Linguistics 16, 79–85 (1990)
2. Koehn, P.: Europarl: A Parallel Corpus for Statistical Machine Translation. In:
MT Summit, vol. 5, Citeseer (2005)
3. Fung, P.: Compiling Bilingual Lexicon Entries from a Non-parallel English-Chinese
Corpus. In: Proceedings of the 3rd Workshop on Very Large Corpora, pp. 173–183
(1995)
4. Rapp, R.: Identifying Word Translations in Non-parallel Texts. In: Proceedings of
the 33rd ACL Conference, pp. 320–322. ACL (1995)
5. Chiao, Y., Zweigenbaum, P.: Looking for Candidate Translational Equivalents in
Specialized, Comparable Corpora. In: Proceedings of the 19th Coling Conference,
vol. 2, pp. 1–5. ACL (2002)
6. Rubino, R.: Exploring Context Variation and Lexicon Coverage in Projection-based
Approach for Term Translation. In: Proceedings of the RANLP Student Research
Workshop, Borovets, Bulgaria, pp. 66–70. ACL (2009)
7. Laroche, A., Langlais, P.: Revisiting Context-based Projection Methods for Termtranslation Spotting in Comparable Corpora. In: Proceedings of the 23rd Coling
Conference, Beijing, China, pp. 617–625 (2010)
8. Shao, L., Ng, H.: Mining New Word Translations from Comparable Corpora. In:
Proceedings of the 20th ACL Conference, p. 618. ACL (2004)
9. Gaussier, E., Renders, J., Matveeva, I., Goutte, C., Dejean, H.: A Geometric View
on Bilingual Lexicon Extraction from Comparable Corpora. In: Proceedings of the
42nd ACL Conference, p. 526. ACL (2004)
10. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. The Journal of
Machine Learning Research 3, 993–1022 (2003)
40
R. Rubino and G. Linarès
11. Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions, and
Reversals. Soviet Physics Doklady 10, 707–710 (1966)
12. Rapp, R.: Automatic Identification of Word Translations from Unrelated English
and German Corpora. In: Proceedings of the 37th ACL Conference, pp. 519–526.
ACL (1999)
13. Déjean, H., Gaussier, E., Renders, J., Sadat, F.: Automatic Processing of Multilingual Medical Terminology: Applications to Thesaurus Enrichment and Crosslanguage Information Retrieval. Artificial Intelligence in Medicine 33, 111–124
(2005)
14. Koehn, P., Knight, K.: Learning a Translation Lexicon from Monolingual Corpora.
In: Proceedings of the ACL Workshop on Unsupervised Lexical Acquisition, vol. 9,
pp. 9–16. ACL (2002)
15. Church, K.W., Hanks, P.: Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics 16(1), 22–29 (1990)
16. Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence.
Computational Linguistics 19, 61–74 (1993)
17. Evert, S.: The Statistics of Word Cooccurrences: Word Pairs and Collocations.
Ph.D. Thesis, Institut für maschinelle Sprachverarbeitung, Universität Stuttgart
(2004)
18. Fung, P., McKeown, K.: Finding Terminology Translations from Non-parallel Corpora. In: Proceedings of the 5th Workshop on Very Large Corpora, pp. 192–202
(1997)
19. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing
by Latent Semantic Analysis. Journal of the American Society for Information
Science 41, 391–407 (1990)
20. Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proceedings of the 22nd
ACM SIGIR Conference, pp. 50–57. ACM, New York (1999)
21. Ni, X., Sun, J., Hu, J., Chen, Z.: Mining Multilingual Topics from Wikipedia. In:
Proceedings of the 18th International Conference on WWW, pp. 1155–1156. ACM,
New York (2009)
22. Boyd-Graber, J., Blei, D.M.: Multilingual topic models for unaligned text. In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pp. 75–82
(2009)
23. Langlais, P., Yvon, F., Zweigenbaum, P.: Translating medical words by analogy.
In: Intelligent Data Analysis in Biomedicine and Pharmacology, Washington, DC,
USA, pp. 51–56 (2008)
© Copyright 2026 Paperzz