Using Cognates in a French - Romanian Lexical Alignment System: A Comparative Study Mirabela Navlea Linguistique, Langues, Parole (LiLPa) Université de Strasbourg 22, rue René Descartes BP, 80010, 67084 Strasbourg cedex [email protected] Abstract This paper describes a hybrid French - Romanian cognate identification module. This module is used by a lexical alignment system. Our cognate identification method uses lemmatized, tagged and sentence-aligned parallel corpora. This method combines statistical techniques, linguistic information (lemmas, POS tags) and orthographic adjustments. We evaluate our cognate identification module and we compare it to other methods using pure statistical techniques. Thus, we study the impact of the used linguistic information and the orthographic adjustments on the results of the cognate identification module and on cognate alignment. Our method obtains the best results in comparison with the other implemented statistical methods. 1 Introduction We present a new French - Romanian cognate identification module, integrated into a lexical alignment system using French - Romanian parallel law corpora. We define cognates as translation equivalents having an identical form or sharing orthographic or phonetic similarities (common etymology, borrowings). Cognates are very frequent between close languages such as French and Romanian, two Latin languages with a rich morphology. So, they represent important lexical cues in a French - Romanian lexical alignment system. Few linguistic resources and tools for Romanian (dictionaries, parallel corpora, MT systems) are currently available. Some lexically aligned corpora or lexical alignment tools (Tufiş et al., 2005) are available for Romanian - English or Amalia Todiraşcu Linguistique, Langues, Parole (LiLPa) Université de Strasbourg 22, rue René Descartes BP, 80010, 67084 Strasbourg cedex [email protected] Romanian - German (Vertan and Gavrilă, 2010). Most of the cognate identification modules used by these systems are purely statistical. As far as we know, no cognate identification method is available for French and Romanian. Cognate identification is a difficult task due to the high orthographic similarities between bilingual pairs of words having different meanings. Inkpen et al. (2005) develop classifiers for French and English cognates based on several dictionaries and manually built lists of cognates. Inkpen et al. (2005) distinguish between: - cognates (liste (FR) - list (EN)); - false friends (blesser (‘to injure’) (FR) - bless (EN)); - partial cognates (facteur (FR) - factor or mailman (EN)); - genetic cognates (chef (FR) - head (EN)); - unrelated pairs of words (glace (FR) - ice (EN) and glace (FR) - chair (EN)). Our cognate detection method identifies cognates, partial and genetic cognates. This method is used especially to improve a French - Romanian lexical alignment system. So, we aim to obtain a high precision of our cognate identification method. Thus, we eliminate false friends and unrelated pairs of words combining statistical techniques and linguistic information (lemmas, POS tags). We use a lemmatized, tagged and sentence-aligned parallel corpus. Unlike Inkpen et al. (2005), we do not use other external resources (dictionaries, lists of cognates). To detect cognates from parallel corpora, several approaches exploit the orthographic similarity between two words of a bilingual pair. An efficient method is the 4-gram method (Simard et al., 1992). This method considers two words as cognates if their length is greater than or equal to 4 and at least their first 4 characters are common. Other methods exploit Dice’s coefficient (Adam247 Proceedings of Recent Advances in Natural Language Processing, pages 247–253, Hissar, Bulgaria, 12-14 September 2011. son and Boreham, 1974) or a variant of this coefficient (Brew and McKelvie, 1996). This measure computes the ratio between the number of common character bigrams of the two words and the total number of two word bigrams. Also, some methods use the Longest Common Subsequence Ratio (LCSR) (Melamed, 1999; Kraif, 1999). LCSR is computed as the ratio between the length of the longest common substring of ordered (and not necessarily contiguous) characters and the length of the longest word. Thus, two words are considered as cognates if LCSR value is greater than or equal to a given threshold. Similarly, other methods compute the distance between two words, which represents the minimum number of substitutions, insertions and deletions used to transform one word into another (Wagner and Fischer, 1974). These methods use exclusevly statistical techniques and they are language independent. On the other hand, other methods use the phonetic distance between two words belonging to a bilingual pair (Oakes, 2000). Kondrak (2009) identifies three characteristics of cognates: recurrent sound correspondences, phonetic similarity and semantic affinity. Thus, our method exploits orthographic and phonetic similarities between French - Romanian cognates. We combine n-grams methods with linguistic information (lemmas, POS tags) and several input data disambiguation strategies (computing cognates’ frequencies, iterative extraction of the most reliable cognates and their deletion from the input data). Our method needs no external resources (bilingual dictionaries), so it could easily be extended to other Romance languages. We aim to obtain a high accuracy of our method to be integrated in a lexical alignment system. We evaluate our method and we compare it with pure statistical methods to study the influence of used linguistic information on the final results and on cognate alignment. In the next section, we present the parallel corpora used for our experiments. In section 3, we present the lexical alignment method. We also describe our cognate identification module in section 4. We present the evaluation of our method and a comparison with other methods in section 5. Our conclusions and further work figure in section 6. 2 The Parallel Corpus In our experiments, we use a legal parallel corpus (DGT-TM1) based on the Acquis Communautaire corpus. This multilingual corpus is available in 22 official languages of EU member states. It is composed of laws adopted by EU member states since 1950. DGT-TM contains 9,953,360 tokens in French and 9,142,291 tokens in Romanian. We use a test corpus of 1,000 1:1 aligned complete sentences (starting with a capital letter and finishing with a punctuation sign). The length of each sentence has at most 80 words. This test corpus contains 33,036 tokens in French and 28,645 in Romanian. We use the TTL2 tagger available for Romanian (Ion, 2007) and for French (Todiraşcu et al., 2011) (as Web service3). Thus, the parallel corpus is tokenized, lemmatized, tagged and annotated at chunk level. The tagger uses the set of morpho-syntactic descriptors (MSD) proposed by the Multext Project4 for French (Ide and Véronis, 1994) and for Romanian (Tufiş and Barbu, 1997). In the Figure 1, we present an example of TTL’s output: lemma attribute represents the lemmas of lexical units, ana attribute provides morphosyntactic information and chunk attribute marks nominal and prepositional phrases. <seg lang="FR"><s id="ttlfr.3"> <w lemma="voir" ana="Vmps-s">vu</w> <w lemma="le" ana="Da-fs" chunk="Np#1">la</w> <w lemma="proposition" ana="Ncfs" chunk="Np#1">proposition</w> <w lemma="de" ana="Spd" chunk="Pp#1">de</w> <w lemma="le" ana="Da-fs" chunk="Pp#1,Np#2">la</w> <w lemma="commission" ana="Ncfs" chunk="Pp#1,Np#2">Commission </w> <c>;</c> </s></seg> Figure 1 TTL’s output for French (in XCES format) 1 2 http://langtech.jrc.it/DGT-TM.html Tokenizing, Tagging and Lemmatizing free running texts 3 https://weblicht.sfs.uni-tuebingen.de/ 4 http://aune.lpl.univ-aix.FR/projects/multext/ 248 3 Lexical Alignment Method The cognate identification module is integrated in a French - Romanian lexical alignment system (see Figure 2). In our lexical alignment method, we first use GIZA++ (Och and Ney, 2003) implementing IBM models (Brown et al., 1993). These models build word-based alignments from aligned sentences. Indeed, each source word has zero, one or more translation equivalents in the target language. As these models do not provide many-tomany alignments, we also use some heuristics (Koehn et al., 2003; Tufiş et al., 2005) to detect phrase-based alignments such as chunks: nominal, adjectival, verbal, adverbial or prepositional phrases. In our experiments, we use the lemmatized, tagged and annotated parallel corpus described in section 2. Thus, we use lemmas and morphosyntactic properties to improve the lexical alignment. Lemmas are followed by the two first characters of morpho-syntactic tag. This operation morphologically disambiguates the lemmas (Tufiş et al., 2005). For example, the same French lemma change (=exchange, modify) can be a common noun or a verb: change_Nc vs. change_Vm. This disambiguation procedure improves the GIZA++ system’s performance. We realize bidirectional alignments (FR - RO and RO - FR) with GIZA++, and we intersect them (Koehn et al., 2003) to select common alignments. To improve the word alignment results, we add an external list of cognates to the list of the translation equivalents extracted by GIZA++. This list of cognates is built from parallel corpora by our own method (described in the next section). Also, to complete word alignments, we use a French - Romanian dictionary of verbo-nominal collocations (Todiraşcu et al., 2008). They represent multiword expressions, composed of words related by lexico-syntactic relations (Todiraşcu et al., 2008). The dictionary contains the most frequent verbo-nominal collocations extracted from legal corpora. To augment the recall of the lexical alignment method, we apply a set of linguisticallymotivated heuristic rules (Tufiş et al., 2005): a) we define some POS affinity classes (a noun might be translated by a noun, a verb or an adjective); b) we align content-words such as nouns, adjectives, verbs, and adverbs, according to the POS affinity classes; c) we align chunks containing translation equivalents aligned in a previous step; d) we align elements belonging to chunks by linguistic heuristics. We develop a language dependent module applying 27 morpho-syntactic contextual heuristic rules (Navlea and Todiraşcu, 2010). These rules are defined according to morpho-syntactic differences between French and Romanian. The architecture of the lexical alignment system is presented in the Figure 2. Figure 2 Lexical alignment system architecture 4 Cognate Identification Module In our hybrid cognate identification method, we use the legal parallel corpus described in section 2. This corpus is tokenized, lemmatized, tagged, and sentence-aligned. Thus, we consider as cognates bilingual word pairs respecting the linguistic conditions below: 1) their lemmas are translation equivalents in two parallel sentences; 2) they have identical lemmas or have orthographic or phonetic similarities between lemmas; 3) they are content-words (nouns, verbs, adverbs, etc.) having the same POS tag or belonging to the same POS affinity class. We filter out short words such as prepositions and conjunctions to limit noisy output. We also detect short cognates such as il ’he’ vs. el (personal pronoun), cas 'case' vs. caz (nouns). We 249 avoid ambiguous pairs such as lui ‘him’ (personal pronoun) (FR) vs. lui ‘s' (possessive determiner) (RO), ce 'this' (demonstrative determiner) (FR) vs. ce 'that' (relative pronoun) (RO). To detect orthographic and phonetic similarities between cognates, we look at the beginning of the words and we ignore their endings. We classify the French - Romanian cognates detected in the studied parallel corpus (at the orthographic or phonetic level), in several categories: 1) cross-lingual invariants (numbers, certain acronyms and abbreviations, punctuation signs); 2) identical cognates (document ‘document’ vs. document); 3) similar cognates: a) 4-grams (Simard et al., 1992); The first 4 characters of lemmas are identical. The length of these lemmas is greater than or equal to 4 (autorité vs. autoritate 'authority'). b) 3-grams; The first 3 characters of lemmas are identical and the length of the lemmas is greater than or equal to 3 (acte vs. act 'paper'). c) 8-bigrams; Lemmas have a common sequence of characters Levels of orthographic adjustments diacritics double contiguous letters consonant groups q intervocalic s w y among the first 8 bigrams. At least one character of each bigram is common to both words. This condition allows the jump of a non identical character (souscrire vs. subscrie 'submit'). This method applies only to long lemmas (length greater than 7). d) 4-bigrams; Lemmas have a common sequence of characters among the 4 first bigrams. This method applies for long lemmas (length greater than 7) (homologué vs. omologat 'homologated') but also for short lemmas (length less than or equal to 7) (groupe vs. grup 'group'). We iteratively extract cognates by identified categories. In addition, we use a set of orthographic adjustments and some input data disambiguation strategies. We compute frequency for ambiguous candidates (the same source lemma occurs with several target candidates) and we keep the most frequent candidate. At each iteration, we delete reliable considered cognates from the input data. We start by applying a set of empirically established orthographic adjustments between French - Romanian lemmas, such as: diacritic removal, phonetic mappings detection, etc. (see Table 1). French Romanian x x x x ph th dh cch ck cq ch ch q (final) qu(+i) (medial) qu(+e) (medial) qu(+a) que (final) v+s+v w y f [f] t [t] d [d] c [k] c [k] c [k] ş [∫] c [k] c [k] c [k] c [k] c(+a) [k] c [k] v+z+v v i Examples FR - RO dépôt - depozit rapport - raport phase - fază méthode - metodă adhérent - aderent bacchante - bacantă stockage - stocare grecque - grec fiche - fişă chapitre - capitol cinq - cinci équilibre - echilibru marquer - marca qualité - calitate pratique - practică présent - prezent wagon - vagon yaourt - iaurt Table 1 French - Romanian cognate orthographic adjustments While French uses an etymological writing and Romanian generally has a phonetic writing, we identify phonetic correspondences between lemmas. Then, we make some orthographic adjustments from French to Romanian. For example, 250 Romanian cognates have one form in French, but two various forms in Romanian (spécification 'specification' vs. specificare or specificaţie). We recover these pairs by using regular expressions based on specific lemma endings (ion (fr) vs. re|ţie (ro)). Then, we delete the reliable cognate pairs (high precision) from the input data at the end of the extraction step. This step helps us to disambiguate the input data. For example, the identical cognates transport vs. transport 'transportation', obtained in a previous extraction step and deleted from the input data, eliminate the occurrence of candidate transport vs. tranzit as 4-grams cognate, in a next extraction step. We apply the same method for cognates having POS affinity (N-V; N-ADJ). We keep only 4grams cognates, due to the significant decrease of the precision for the other categories 3 (b, c, d). Deletion Content-words / Frequency Precision from the Same POS (%) input data x 100 x x 100 cognates stockage 'stock' (FR) vs. stocare (RO) become stocage (FR) vs. stocare (RO). In this example, the French consonant group ck [k] become c [k] (as in Romanian). We also make adjustments in the ambiguous cases, by replacing with both variants (ch ([∫] or [k])): fiche vs. fişă ‘sheet’; chapitre vs. capitol ‘chapter’. We aim to improve the precision of our method. Thus, we iteratively extract cognates by identified categories from the surest ones to less sure candidates (see Table 2). To decrease the noise of the cognate identification method, we apply two supplementary strategies. We filter out ambiguous cognate candidates (autorité - autoritate|autorizare), by computing their frequencies in the corpus. In this case, we keep the most frequent candidate pair. This strategy is very effective to augment the precision of the results, but it might decrease the recall in certain cases. Indeed, there are cases where French Extraction steps by category of cognates 1 : cross lingual invariants 2 : identical cognates 3 : 4-grams (lemmas’ length >= 4) ; 4 : 3-grams (lemmas’ length >=3) ; 5 : 8-bigrams (long lemmas, lemmas’ length >7) x x x 99.05 x x x 93.13 x 95.24 x 6 : 4-bigrams (long lemmas, lemmas’ length > 7) x 7 : 4-bigrams (short lemmas, lemmas’ length =< 7) x 75 x 65.63 Table 2 Precision of cognate extraction steps empirically establish the threshold of 0.68. 5 Evaluation and Methods’ Comparison LCSR ( w1, w 2 ) = We evaluated our cognate identification module against a list of cognates initially built from the test corpus, containing 2,034 pairs of cognates. In addition, we also compared the results of our method with the results provided by pure statistical methods (see Table 3). These methods are the following: a) thresholding the Longest Common Subsequence Ratio (LCSR) for two words of a bilingual pair; This measure computes the ratio between the longest common subsequence of characters of two words and the length of the longest word. We length ( common _ substring ( w1, w 2 )) max( length ( w1), length ( w 2 )) b) thresholding DICE’s coefficient ; We empirically establish the threshold of 0.62. DICE ( w1, w 2 ) = 2 * number _ common _ bigrams total _ number _ bigrams ( w1, w 2 ) c) 4-grams ; Two words are considered as cognates if they have at least 4 characters and their first 4 characters are identical. 251 We implemented these methods using orthographically adjusted parallel corpus (see Table 1). Moreover, we evaluate 4-grams method on the initial parallel corpus and on the orthographically adjusted parallel corpus to study the impact of orthographic adjustments step on the quality of the results. These methods generally apply for words having at least 4 letters in order to decrease the noise of the results. Cognates are searched in aligned parallel sentences. Word characters are almost parallel (rembourser vs. rambursare 'refund'). Methods P (%) R (%) F (%) LCSR 44.13 58.95 50.47 DICE 56.47 60.91 58.61 4-grams 91.55 72.42 80.87 Our method 94.78 89.18 91.89 Table 3 Evaluation and methods’ comparison; P=Precision; R=Recall; F=F-measure Our method extracted 1,814 correct cognates from 1,914 provided candidates. The method obtains the best scores (precision=94.78% ; recall=89.18% ; f-measure=91.89%), in comparison with the other implemented methods. The 4grams method obtains a high precision (90.85%), but a low recall (47.84%). Orthographic adjustments step improves significantly the recall of 4grams method with 24.58% (see Table 4). This result is due to the specific properties of the law parallel corpus. Indeed, many Romanian terms were borrowed from French and these terms present high orthographic similarities. Methods P (%) R (%) F (%) 4-grams Adjustments 90.85 47.84 62.68 4-grams + Adjustments 91.55 72.42 80.87 Table 4 Evaluation of the 4-grams method before and after orthographic adjustments step However, our method extracts some ambiguous candidates such as numéro ‘number’ - nume ‘name’, compléter ‘complete’ - compune ‘compose’. Some of these errors were avoided by keeping the most frequent candidate in the studied corpus. So, the remaining errors mainly concern hapax candidates. Also, some cognates were not extracted: heure oră ‘hour’, semaine - săptămână ‘week’, lieu loc ‘place’. These errors concern cognates sharing very few orthographic similarities. The lowest scores are obtained by the LCSR method (f-measure=50.47%), followed by the DICE’s coefficient (f-measure=58.61%). These general methods provide a high noise due to the important orthographic similarities between the words having different meanings. Their results might be improved by combining statistical techniques with linguistic information such as POS affinity or by combining several association scores. As we mentioned, the output of the cognate identification module is exploited by a French - Romanian lexical alignment system (based on GIZA++) described in section 3. We compared the set of cognates provided by GIZA++ with our results to study their impact on cognate alignment. GIZA++ extracted 1,532 cognates representing a recall of 75.32% (see Table 5). Our cognate identification module significantly improved the recall with 13.86%. Number of extracted cognates 1,532 GIZA++ Our 1,814 method Systems Number of total cognates Recall (%) 75.32 2,034 89.18 Table 5 Improvement of our method’s recall 6 Conclusions and Further Work We present a French - Romanian cognate identification module required by a lexical alignment system. Our method combines statistical techniques and linguistic filters to extract cognates from lemmatized, tagged and sentence-aligned parallel corpus. The use of the linguistic information and the orthographic adjustments significantly improves the results compared with pure statistical methods. However, these results are dependent of the studied languages, of the corpus domain and of the data volume. We need more experiments using other corpora from other domains to be able to generalize. Our system should be improved to detect false friends by using external resources. 252 tical Machine Translation Systems, in Proceedings of the Workshop on Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages, 7th International Conference on Language Resources and Evaluation, Malta, Valletta, May 2010, pp. 41-48. Cognate identification module will be integrated in a French - Romanian lexical alignment system. This system is part of a larger project aiming to develop a factored phrase-based statistical machine translation system for French and Romanian. References George W. Adamson and Jillian Boreham. 1974. The use of an association measure based on character structure to identify semantically related pairs of words and document titles, Information Storage and Retrieval, 10(7-8):253-260. Chris Brew and David McKelvie. 1996. Word-pair ex-traction for lexicography, in Proceedings of International Conference on New Methods in Natural Language Processing, Bilkent, Turkey, 45-55. Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation, Computational Linguistics, 19(2):263-312. Nancy Ide and Jean Véronis. 1994. Multext (multilingual tools and corpora), in Proceedings of the 15th International Conference on Computational Linguistics, CoLing 1994, Kyoto, pp. 90-96. Diana Inkpen, Oana Frunză, and Grzegorz Kondrak. 2005. Automatic Identification of Cognates and False Friends in French and English, RANLP2005, Bulgaria, Sept. 2005, p. 251-257. Radu Ion. 2007. Metode de dezambiguizare semantică automată. Aplicaţii pentru limbile engleză şi română, Ph.D. Thesis, Romanian Academy, Bucharest, May 2007, 148 pp. Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical Phrase-Based Translation, in Proceedings of Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL 2003, Edmonton, May-June 2003, pp. 48-54. Grzegorz Kondrak. 2009. Identification of Cognates and Recurrent Sound Correspondences in Word Lists, in Traitement Automatique des Langues (TAL), 50(2) :201-235. Olivier Kraif. 1999. Identification des cognats et alignement bi-textuel : une étude empirique, dans Actes de la 6ème conférence annuelle sur le Traitement Automatique des Langues Naturelles, TALN 99, Cargèse, 12-17 juillet 1999, 205-214. Dan I. Melamed. 1999. Bitext Maps and Alignment via Pattern Recognition, in Computational Linguistics, 25(1):107-130. Michael P. Oakes. 2000. Computer Estimation of Vocabulary in Protolanguage from Word Lists in Four Daughter Languages, in Journal of Quantitative Linguistics, 7(3):233-243. Franz J. Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models, in Computational Linguistics, 29(1):1951. Michel Simard, George Foster, and Pierre Isabelle. 1992. Using cognates to align sentences, in Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, Montréal, pp. 67-81. Amalia Todiraşcu, Ulrich Heid, Dan Ştefănescu, Dan Tufiş, Christopher Gledhill, Marion Weller, and François Rousselot. 2008. Vers un dictionnaire de collocations multilingue, in Cahiers de Linguistique, 33(1) :161-186, Louvain, août 2008. Amalia Todiraşcu, Radu Ion, Mirabela Navlea, and Laurence Longo. 2011. French text preprocessing with TTL, in Proceedings of the Romanian Academy, Series A, Volume 12, Number 2/2011, pp. 151-158, Bucharest, Romania, June 2011, Romanian Academy Publishing House. ISSN 1454-9069. Dan Tufiş and Ana Maria Barbu. 1997. A Reversible and Reusable Morpho-Lexical Description of Romanian, in Dan Tufiş and Poul Andersen (eds.), Recent Advances in Romanian Language Technology, pp. 83-93, Editura Academiei Române, Bucureşti, 1997. ISBN 973-27-0626-0. Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu. 2005. Combined Aligners, in Proceedings of the Workshop on Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond, pp. 107-110, Ann Arbor, USA, Association for Computational Linguistics. ISBN 978-973703-208-9. Cristina Vertan and Monica Gavrilă. 2010. Multilingual applications for rich morphology language pairs, a case study on German Romanian, in Dan Tufiş and Corina Forăscu (eds.): Multilinguality and Interoperability in Language Processing with Emphasis on Romanian, Romanian Academy Publishing House, Bucharest, pp. 448-460, ISBN 978973-27-1972-5. Robert A. Wagner and Michael J. Fischer. 1974. The String-to-String Correction Problem, Journal of the ACM, 21(1):168-173. Mirabela Navlea and Amalia Todiraşcu. 2010. Linguistic Resources for Factored Phrase-Based Statis- 253
© Copyright 2026 Paperzz