A Novel Approach to Improve Word Translations Extraction from Non-Parallel, Comparable Corpora Yun-Chuang Chiao STIM/DSI & ERM202 INSERM, Paris Jean-David Sta EDF - R & D 92410 Clamart Pierre Zweigenbaum STIM/DSI & ERM202 INSERM, Paris [email protected] [email protected] [email protected] Abstract After an attempt at extracting translational equivalents from domain-specific comparable corpora of limited size, we propose in this paper an approach to prune translation alternatives. In particular, we rescore translation candidates in the target language by applying the same translation algorithm in the reverse direction and reranking them according to the harmonic mean score. After a preliminary investigation limited to identifying translations of most common words, we validated in this paper our proposed model on a larger scale, including less frequent terms in both corpora. The results show an improvement in precision of the top candidate translation. The translational equivalents obtained may then be used, e.g., for extending an existing medical lexicon as human translation aid or for query expansion and translation in cross-language information retrieval. 1 Introduction World alignment is a well-studied problem in Natural Language Processing and has been used in many applications such as translation lexicon acquisition, statistical machine translation, and also cross language information retrieval. Manual alignment of bilingual data is a labor-intensive process and for applications such as bilingual lexicon construction, human compiled dictionaries were often out-of-date as soon as they became available. Recent advances in automatic lexicon extraction and statistical alignment algorithms allow us to build models which can identify translation equivalents at the word or phrase level. Such techniques can be useful especially for technical domains which are in constant evolution, producing new terms. Many of related works on using statistical models for mapping bilingual terms (Hiemstra et al., 1997; Littman et al., 1998) are typically based on parallel texts or ‘bitexts’ - pairs of texts that are translation of each other. Most of those methods are based on the following assumption: words that are translations of each other are more likely to appear in corresponding parallel text regions than other pairs of words. By using various correlation metrics, these approaches derive co-occurrence patterns of words across languages. The limit is that large-scale parallel corpora are not always available, although (Chen and Nie, 2000)’s experiments reveal a potential solution by automatically collecting parallel Web pages. Therefore, it seems natural to enlarge the scope of corpus resources by looking for non-parallel, comparable corpora. Comparable corpora are a ripe area of investigation in the development of bilingual lexicons (Fung and Yee, 1998; Rapp, 1999; Picchi and Peters, 1998). However, these experiments dealt with very large, ‘general language’ corpora and assume the availability of NLP tools such as POS tagger, morphological analyzer, etc. This paper addresses this issue in the medical domain and describes a model where these resources are assumed unavailable. After a preliminary investigation limited to identi- fying translations of most common words from comparable medical corpora, we validated in this paper our proposed model on a larger scale, including less frequent terms in both corpora. We also varied the context window size, with consideration of sentence boundaries, to look at the different types of semantic relationships between words. Once a ranked list of translation candidates is generated for a given source word, we applied the same translation algorithm to each of top ranked candidates. We then re-ranked translation candidates according to the harmonic mean of their original and ‘reverse’ ranks with a given source word. The translational equivalents obtained may then be used, e.g., for extending an existing medical lexicon as human translation aid or for query expansion and translation in cross-language information retrieval. In the following section, we first recall related work on this topic. After describing the corpora we used and their characteristics, we overview our algorithm, which is validated on all lexicon words. We then provide and discuss the results. Finally, we describe future directions. 2 Background Compared with other approaches to use comparable corpora for word to word translation, our work is mostly related to research on alignment of nonparallel texts at the word level and research on domain-specific bilingual lexicon acquisition. Previous works in this area are based on the assumption that words which have the same meaning in different languages should have similar context distributions. Rapp (1999) proposes an approach very similar to the method presented here. He supposes that in any language there is a correlation between the cooccurrences of words which are translations of each other. The main difference between our approach and his model is that he used lemmatized corpora and limited the number of translation candidates considered. Fung and Yee (1998) proposed a method based on the vector space model for translating new words in non-parallel, Chinese English comparable corpora. They claim that the association between words and their ‘context seed words’ are preserved in comparable texts. By designing procedures to retrieve cross-lingual lexical equivalents to- gether, Picchi and Peters (1998) proposed that their system could have applications such as retrieving documents containing terms or contexts which are semantically equivalent in more than one language. 3 Material The material prepared for the present experiments consists of non parallel medical corpora in French and English and a bilingual, French-English combined lexicon including both general and medical words. The two monolingual corpora have been compiled from Internet catalogs of medical web sites, CISMeF (Darmoni et al., 2000) (www.churouen.fr/cismef) for the French language and CliniWeb (Hersh et al., 1999) (www.ohsu.edu/cliniweb for English. We chose a common domain, corresponding to the subtree under the MeSH concept ‘Pathological Conditions, Signs and Symptoms’ (‘C23’), which is the best represented in CISMeF, automatically dowloaded the pages indexed by these catalogs and converted them into plain text. The ‘French’ corpus contains many foreign words (mainly English and Spanish) since we did not have language filtering on it. Besides, the two corpora should be considered as unrelated rather than comparable from the point of view of their size. The French corpus presents 7,604,381 word tokens and the English one contains 639,662 tokens. However, as explained in Diab and Finch (2000, p. 1501), one does not need to have corpora of the same size for this kind of approach to work. A combined French-English lexicon of simple words was compiled from several sources: for the medical domain, an online French medical dictionary (Dictionnaire Médical Masson, www.atmedica.com) and the English-French biomedical terminologies in the UMLS metathesaurus (NLM, 2001): MeSH, WHOART and ICPC; for general words, we used the French-English dictionary distributed in the Linux package dictddictionaries. The resulting lexicon contains 22,036 (‘one-word term’) entries, mainly specialized medical words (see excerpt in table 1). 4 Methods Figure 1 shows a schema of the translation model, which we summarize in the rest of this section. The Figure 1: Schema for the translation model anévrisme assiette caoutchouc champignon dent derme falaise aneurysm plate rubber fungus, mycete tooth corium, dermis cliff Table 1: Lexicon excerpt basic intuition is that there is a correlation between the context distribution of words which are translations of each other. The approach is an attempt at finding the target words whose distributions are the most similar to that of a given source word. The method depends primarily on co-occurrence information of collocate terms. No morphological analysis is applied to either corpus during the experiment. We achieve this by approximating the distributional behavior through context vectors and finding a mapping of source and target words which preserves the context mapping as much as possible. Therefore, our goals are: (a) to build a context vector of each word within each of the corpora; (b) to produce translation candidates using similarity score; and (c) to provide an algorithm for re-ranking the possible matched word vectors between corpora. 4.1 Computing context vectors For each word i in source and target language corpora separately, we first create a context vector which consists of its co-occurrence patterns. Stop words are eliminated from both corpora for co-occurrence counting. With a simple heuristic to identify sentences boundaries (with punctuation marks such as !, . or ?), we defined two alternate sliding context windows of 5 and 7 words (table 2) to calculate the cooccurrences of i. A lemmatization is applied to each co-occurrence pattern. Since this lemmatizer does not handle gender nor verb inflection, this lemmatization is far from perfect. The window size parameter allows us to look at different scales (Church and Hanks, 1990). A smaller window size will identify fixed expressions and other relations such as syntactic dependencies; a larger window size will highlight semantic concepts and other relationships that hold over a larger range. In order to take into account word-frequency effects and to emphasize significant word associations, we chose the following tf.idf (Sparck Jones, 1979) as weighting measure: tf.idf (i, j) = tf (i, j)idf (i) (i,j) tf (i,j)= maxcooc k,l cooc(k,l) maxk,l cooc(k,l) idf (i)= 1 + log |k;cooc (i,k)6=0| 4.2 Transferring context vectors through pivot words When a translation is sought for a source word, its context vector is translated into the target language, using the bilingual lexicon. Only the words in the bilingual lexicon (the ‘pivot’ words) can be used in the transfer. Several translations may exist; pending a better solution, we decided to insert only the first one into the context vector. The result is a targetlanguage context vector, comparable to ‘native’ context vectors directly obtained from the target corpus (table 3). Since we want to compare transferred context vectors with native context vectors, these two original text quantitative or qualitative deficiency of sialophorin in some way due to abnormal Wiskott-Aldrich Syndrome protein 7-word win. quantitative — qualitative deficiency — sialophorin — abnormal wiskott-aldrich syndrome 5-word win. qualitative deficiency — sialophorin — abnormal wiskott-aldrich Table 2: Example of context window for the word sialophorin Word adénose Occ. 11 Cooc. 40 Target context vector consists of translated source pivot words loose, image, proliferation, hyperplasia, dendritic, fever, lesion, fibrosis, papilloma, calcium, epithelium, inside, adenoma, cell Table 3: Example translated context vector: French word adénose. sorts of vectors should belong to the same space, i.e., range over the same set of context words. Using the bilingual lexicon, we reduced the context vector space to the set of ‘cross-language seed words’. A word belongs to this set if it occurs in the target corpus, is listed in the bilingual lexicon and its source counterpart(s) occurs in the source corpus. In our experimental setting, 6,210 pivot words are used. 4.3 Computing vector similarity Given a transferred context vector, for each native target vector, a similarity score is computed and target vectors are ranked. The best-ranked target words are considered as translation candidates. The Jaccard (Romesburg, 1990) similarity metric1 is used for comparing two vectors V and W (of length n; k, l, m range from 1 to n): P Jaccard(V, W ) = P k 4.4 vk2 + vk w k P w 2 − m vm w m l l P k Re-ranking translation candidates After the translation algorithm generates a ranked list of translation candidates for a given source word, the same model is applied in the reverse direction to find the source counterparts of translation candidates. For each of the top ranked candidates, a return ranked list of its source counterparts is computed. We then re-score the translation candidates according to the following harmonic mean measure2 : 1 r2 HM(r1 , r2 ) = r2r1 +r 2 where r1 is the original rank of a target word i given a source word j; r2 is the rank of word j for i obtained by the ‘reverse’ translation module. 1 In preliminary experiments, we also computed the vector similarity with different measures such as cosine, Dice, cityblock and found that the Jaccard metric yielded the best results. 2 http://mathworld.wolfram.com/HarmonicMean.html For example, given the French word nez, the top 3 translation candidates are respiration (rank = 1) with similarity score 0.20155, ear (rank = 2) with score 0.19018 and nose (rank = 3) with score 0.18652. The rank of the French word nez computed by the ‘reverse’ translation is ∞ for respiration3 , 4 for ear and 1 for nose. This gives us revised scores of 2, 2.667 and 1.5 respectively, which leads to the correct translation nose being ranked first. This example shows that when experiments have positive significant results (lesser extremes) and others have only positive non-significant results (greater extremes), the significant experimental results are highlighted by using the harmonic mean. 4.5 Experiments Word alignment generally depends on word frequency distributions in corpora. Words that are translations of each other tend to have similar frequencies in parallel, but also in comparable corpora. Additionally, frequent words have more contexts in a corpus, and therefore provide more alignment information than rare words to the algorithms. One may expect these words to obtain better translations. The present work tests whether less frequent words can still obtain some translational equivalents. It proceeds by leave-one-out experiments: given a word whose translation is known in advance (this provides a gold standard) but which is assumed to be ‘unknown’, and given the set of context vectors for other words in both languages, it examines the rank of the expected translation among the ordered pro3 The list of translation candidates we used for present experiments is of size 30; a target word which is not ranked among the top 30 translation candidates gets ∞ as rank. In that case, HM(r1 , ∞) = 2r1 . posals of the algorithm (target context vectors with their target word). Test words range over the set of known words (or ‘pivot’ words) P . We also tested an ‘unknown’ set U which contains candidate target context vectors of all words in the corpora which are not present in our lexicon. Assuming that we know all the words in our lexicon but one (the test word), we only need to look for a translation among the set U of ‘unknown’ target words – augmented with the expected translation. It might be objected that known words (those found in our lexicon) should be expected to be more frequent than ‘new’ or ‘unknown’ words; and that this difference in frequency might favorably bias the discrimination of a test word from actually unknown words. To check this, we compared the distributions of frequency ranks of known and unknown words in the source corpus. These ranks are obtained by ordering all the words in the corpus (recall that stop words have been removed) by descending frequency: from rank 1 (corresponding to the most frequent words – frequency range = 35,000 ± 10%) to rank 89 (words which appear only once in the corpus, or ‘hapaxes’). Figure 2 plots the relative distribution of known words in the different ranks and that of unknown words in these ranks. It shows that although this tendency exists, many known words (‘pivot’ words) are present at large ranks (are rare in the corpus: 12% are hapaxes) and a few words absent from the lexicon are present in the top ranks (are frequent in the corpus). Let us recall too that grammatical words, which are the most frequent in a corpus, have been removed as part of stop words. Furthermore, we chose two different sets of French test words for which we try to identify the possible translations in our English corpus. The specialized medical set M consists of 886 medical one-word terms extracted from the SNOMED Microglossary for Pathology (Côté, 1996). The global, mixed set P of all lexicon words consists of 6,210 medical and general one-word terms. 5 Results For each test word in P , we produced a list of its translational equivalents ranked in decreasing order of similarity score. The rank R of its expected translations provides the basis for evaluation. Sam- 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0.005 0.0045 0.004 0.0035 0.003 0.0025 0.002 0.0015 0.001 0.0005 0 freq Unknown Words Pivot Words rank 10 20 30 Unknown Words Pivot Words 2 4 6 8 10 12 14 16 18 20 40 50 60 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 70 80 90 Unknown Words Pivot Words 80 82 84 86 rank 88 90 Figure 2: Word frequency distribution in corpus ple results are provided in table 4, showing the top-ranked candidate translations for French words nécrose, gène, sclérose and abcès. 5.1 Effects of reverse translation As we described earlier in this paper, our algorithm first generates a ranked list of translation candidates, then re-scores them by applying the same algorithm to compute the list of source counterparts for each of the top 30 candidates. Table 5 shows the re-ranking results for words in table 4. As we can see in this example, when applying ‘reverse’ translation, the expected translations for sclérose (sclerosis) and abcès (abscess) are ranked first in the re-scored list against 3 and 9 in the original ranked list. Figures 3 and 4 show a detailed comparison of the translation accuracy, within the top 10 ranked translation candidates, for different test sets M and P with different context window sizes. The results are similar for both pivot word set P and medical word set M , whether with 5 or 7-word windows, reverse translation improves precision. With 5-word window, about 43% of test words are correctly translated when applying reverse translation against 39% before reverse translation for M set (3); and for P set (3), 32% after reverse translation against 30% before. With 7-word window, 34% of test words have their expected translation as first ranked can- Fr word nécrose gène sclérose abcès Occ. 667 1,813 347 564 En word necrosis gene sclerosis abscess Occ. 274 1,019 99 120 R 1 1 3 9 Top 5 ranked candidate translations, followed by similarity score necrosis .181, chronic .148, renal .142, inflammation .135, infarction .123 gene .247, mutation .243, recessive .197, protein .194, chromosome .145 sep .263, lateral .187, sclerosis .178, passe .177, poliovirus .158 perforation .227, rupture .192, visible .156, invasive .155, impose .149 Table 4: Example results; Occ = occurrences in corpus; R = original rank of expected target English word before ‘reverse’ translation Fr word nécrose gène sclérose abcès En word necrosis gene sclerosis abscess R 1 1 1 1 Top 5 ranked candidate translations, followed by harmonic mean score necrosis 1, chronic 3.74, renal 4.2, inflammation 5.53, infraction 10 gene 1, mutation 1.33, protein 2.66, chromosome 2.85, recessive 3 sclerosis 1.5, sep 2, lateral 2.66, passe 8, poliovirus 10 abscess 1.8, perforation 2, rupture 4, visible 6, invasive 8, impose 10 Table 5: Example results; R = new rank of expected target English word after ‘reverse’ translation didate with reverse translation against 28% without for M set (fig. 4); and for P set (fig. 4), 26% against 23%. Besides, the smaller, 5-word window yields better results than the larger, 7-word window: percentiles 1 and 2 are increased in all cases (by 8.9% for percentile 1 or by 12.0% for percentiles 1+2, for set M with reverse translation). Figure 5 presents a comparison of the results with the global word set P and with the medical word set M , illustrated by the percentage of French words which obtain their correct translation among the top 10 candidates, depending on their word frequency distributions in the corpora. In the higher frequency region (top 30 ranks = words which appear more than 1,500 times in the corpus), all medical words find their correct translation while about half of the pivot word set P is correctly translated. When we look at less frequent words for the two sets, set M still yields a better result than set P . 6 M Discussion The reverse translation method we proposed significantly improves the translation performance. The main reason seems to be that it eliminates general words, which are therefore relatively more ambiguous, from the translation candidate list. For a given French word abcès of which the expected translation in English is abscess, a general English word such as perforation might have a higher similarity score than abscess since general words are assumed to be frequent, then the chances of their co-occurrence with a given word would be higher than that of specific words. With the reverse translation method, the sim- P Figure 3: Comparison of translation accuracy, before and after reverse translation, between medical word set M and pivot word set P , with 5-word context window: y = percentage of the words ranked in x rank. ilarity score for abscess and abcès is highest while it is lower for perforation and abcès because perforation could be similar to other words. We therefore expect that this method could highlight the ‘good’ translation candidate by filtering out those which are ambiguous. M P Figure 4: Comparison of translation accuracy, before and after reverse translation, between M and P , with 7-word context window The better results on the smaller, 5-word window may be interpreted as a bonus to syntactic dependencies, which may be expected to be more prevalent at short distance than at a longer distance. This prompts us to investigate an even shorter, 3-word window in our next experiments, as well as the application of a parser to identify actual functional dependencies. It is not surprising that our method fares better on more frequent words than on less frequent ones since the alignment algorithm is based on the cooccurrence frequency. A more common word is assumed to have a sufficient number of contexts, which is favorable to this kind of measure. However the performance on the medical word set M looks promising, especially when compared with the pivot word set P , as we can observe a clear improvement of results whether the word to be aligned is frequent or not in the French corpus. This might be due to the fact that domain-specific terms are generally less ambiguous in corpora of the same domain since their meanings are restricted to that definition. Another reason that might help to explain the less performing results on the pivot word set P might be the lack of general word pairs in our corpora, especially for the English corpus since it contains less word occurrences than the French one. These better results at higher percentiles might be linked to a better contrast between a specific test word and general unknown words in the test conditions where the candidate set contained unknown and relatively rare words. Figure 5: Comparison between pivot word set P and medical word set M : y = the percentage of the expected translation ranked among top 10 candidates in function of decreasing frequency distributions x. On a ‘general-language’ corpus, Rapp (Rapp, 1999) reports an accuracy of 65% at the first percentile by using loglike weighting and city-block metric, whereas neither of these improved our results. A larger size for the corpora (135 and 163 Mwords) and the consideration of word order within contexts may help to explain this difference in accuracy. 7 Conclusion and Perspectives In summary, these experiments confirm a positive effect of reverse translation on the suggestion of appropriate translational equivalents for medical words. They show that medical words in this corpus are better handled than less specialized words, and show a clear influence of context window size. Our proposed approach relies on an initial bilingual lexicon to build context vectors. (Chiao and Zweigenbaum, 2003) showed that the performance of such statistical translation model can be improved by adding general words in the domain-specific lexicon. We would like to test our method further by counting only cooccurrences with general words in the context. The main limitation of the present work lies in the moderate corpus size, which limits the frequency and diversity of its words, so that an insufficient number of word pairs may be aligned by the proposed algorithm. We should investigate the effect of very large corpora, for which the Web has a vast potential of resources. Further investigations must now obtain better performance for all types of words including the less frequent terms. Several directions are still open for investigation, among which selecting words with the same part of speech as the source word, boosting morphologically similar candidates (‘cognates’) or enlarging the size of corpora. Acknowledgements We thank Julien Quint and Salah Ait Mokhtar for their help during this work. References Jiang Chen and J-Y. Nie. 2000. Parallel web text mining for cross-language IR. In Proceedings of RIAO 2000: Content-Based Multimedia Information Access, volume 1, pages 62–78, Paris, France, April. C.I.D. Yun-Chuang Chiao and Pierre Zweigenbaum. 2003. The effect of a general lexicon in corpus-based identification of French-English medical word translations. In Robert Baud, Marius Fieschi, Pierre Le Beux, and Patrick Ruch, editors, Proceedings Medical Informatics Europe, pages 397–402, Amsterdam. IOS Press. K.W. Church and P. Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, (16(1)):22–29, March. Roger A. Côté, 1996. Répertoire d’anatomopathologie de la SNOMED internationale, v3.4. Université de Sherbrooke, Sherbrooke, Québec. Stéfan J. Darmoni, J.-P. Leroy, Benoı̂t Thirion, F. Baudic, Magali Douyere, and J. Piot. 2000. CISMeF: a structured health resource guide. Methods Inf Med, 39(1):30–35. Mona Diab and Steve Finch. 2000. A statistical wordlevel translation model for comparable corpora. In Proceedings of RIAO 2000: Content-Based Multimedia Information Access, pages 1500–1508, Paris, France, April. C.I.D. Pascale Fung and L. Y. Yee. 1998. An IR approach for translating new words from non-parallel, comparable texts. In Proceedings of the 36th ACL, pages 414–420, Montréal, August. William R. Hersh, A Ball, B Day, M Masterson, L Zhang, and L Sacherek. 1999. Maintaining a catalog of manually-indexed, clinically-oriented World Wide Web content. J Am Med Inform Assoc, 6(suppl):790– 794. D. Hiemstra, F. de Jong, and W. Kraaij. 1997. A domain specific lexicon acquisition tool for cross-language information retrieval. In Proceedings of RIAO97, pages 217–232, Montreal, Canada. M.L. Littman, S.T. Dumais, and T.K. Landauer. 1998. Automatic cross-language information retrieval using latent semantic indexing. In Gregory Grefenstette, editor, Cross-Language Information Retrieval, chapter 5, pages 51–62. Kluwer Academic Publishers, London. National Library of Medicine, Bethesda, Maryland, 2001. UMLS Knowledge Sources Manual. www.nlm. nih.gov/research/umls/. E. Picchi and C. Peters. 1998. Cross-language information retrieval: A system for comparable corpus querying. In Gregory Grefenstette, editor, Cross-Language Information Retrieval, chapter 7, pages 81–90. Kluwer Academic Publishers, London. Reinhard Rapp. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th ACL, College Park, Maryland, June. H. Charles Romesburg. 1990. Cluster Analysis for Researchers. Krieger, Malabar, FL. Karen Sparck Jones. 1979. Experiments in relevance weighting of search terms. Inform Proc Management, 15:133–144.
© Copyright 2026 Paperzz