TTC project Terminology Extraction, Translation Tools and Comparable Corpora Project duration: 1st of January 2010 to 31st of December 2012 (36 months) The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement n°248005. Deliverable ID Document title D-4.1 UIMA components to extract neoclassical terms and to align them with their translations Version Version date Status Dissemination status Deliverable responsible Author 4 07/10/11 Final version UN UN FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 Document Revision Information Date Version Changes 20/06/11 1 Initial version 01/09/11 2 First revision 13/09/11 3 Second revision Summary 1 Introduction ............................................................................................................................................. 4 1.1 Context and objectives .................................................................................................................. 4 1.2 Neoclassical compounds ................................................................................................................ 4 2 Neoclassical compound detection and alignment ................................................................................... 5 2.1 Global architecture ........................................................................................................................ 6 2.2 Handled neoclassical compounds forms........................................................................................ 6 3 Resources ................................................................................................................................................. 7 3.1 Comparable corpora ...................................................................................................................... 7 3.2 Monolingual neoclassical elements ............................................................................................... 7 3.3 Aligned neoclassical elements ....................................................................................................... 8 3.4 Bilingual dictionary......................................................................................................................... 8 3.5 Monolingual dictionary .................................................................................................................. 8 4 Algorithm ................................................................................................................................................. 9 4.1 Extraction of neoclassical compounds candidates ........................................................................ 9 4.2 Alignment of neoclassical compounds........................................................................................... 9 4.2.1 Generation of translation candidates .............................................................................9 4.2.2 Selection of correct translations.................................................................................. 10 TTC Project Page 2 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 5 Implementation ..................................................................................................................................... 11 5.1 5.2 6 Components in UIMA ...................................................................................................................12 5.1.1 Extractor ...................................................................................................................... 12 5.1.2 Aligner .......................................................................................................................... 14 Command line ..............................................................................................................................15 Experiments and Evaluation .................................................................................................................. 15 6.1 6.2 Resources used for experiments..................................................................................................15 6.1.1 Comparable corpora .................................................................................................... 15 6.1.2 Monolingual neoclassical elements ............................................................................. 15 6.1.3 Aligned neoclassical elements ..................................................................................... 16 6.1.4 Bilingual dictionary ...................................................................................................... 16 6.1.5 Monolingual dictionaries ............................................................................................. 16 Results ..........................................................................................................................................16 7 Conclusion .............................................................................................................................................. 18 8 References.............................................................................................................................................. 18 TTC Project Page 3 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 1 Introduction 1.1 Context and objectives The TTC project aims at automatically generating bilingual terminologies from comparable corpora in five European languages (English, French, German, Latvian and Spanish) as well as Russian and Chinese. These bilingual terminologies would leverage Machine Translation tools (MT tools) and Computer-Aided Translation tools (CAT tools). To do this, important steps of the project are the automatic extraction of monolingual terminologies in the different languages (WP3) and the bilingual alignment of the extracted terminologies (WP4) from multilingual corpora. WP 4 is dedicated in general to improving term alignment methods from comparable corpora. Task no 4.1 of WP 4 focuses on increasing the coverage of the bilingual dictionary by developing a program for neoclassical compound detection in EN, FR and DE. For this purpose, we aim at developing a method that automatically extracts neoclassical compounds in two languages (source-target) from comparable corpora and aligns these extracted neoclassical compounds. We decided to make this method language independent, i.e. same procedures are used for the pairs of languages (FR-EN, EN-DE, and FR-DE). The method is based on the following assumptions: (1) Neoclassical compounds are translated compositionally; which means that each component is translated individually and the final translation is the combination of the translated parts; as the meaning of neoclassical compounds is often a combination of the meaning of the constituent parts [9]. (2) The order of the constituents (i.e. components) of a source neoclassical compound is preserved in the equivalent target neoclassical compound. (3) Each neoclassical constituent element is translated with a neoclassical element of the same type (for instance an ICF by an ICF, an FCF by an FCF…). The second assumption is based on the fact that neoclassical word-formation in different languages follows the model of Greek and Latin in forming terms [2]. We develop our method using the UIMA (Unstructured Information Management Architecture) framework. This framework is chosen because it facilitates the processing of large volumes of texts. Moreover, the UIMA framework enables applications to be decomposed into components where each component can be dedicated to a particular task. The data flow between these components is automatically managed by UIMA [1]. 1.2 Neoclassical compounds Describing new concepts usually requires creating new terms. Neoclassical word-formation is a process used by many European languages, such as English, French, German, etc [6]. It combines some elements borrowed from Greek or Latin to create neoclassical compounds. For example, the neoclassical element bio combines with the neoclassical element graphy leading to the neoclassical compound biography. Another example is the French neoclassical compound androgyne, which consists of two neoclassical elements: andro (man) and gyne (woman). The German neoclassical compound radiologie is composed of the neoclassical element radio and the neoclassical element logie. Neoclassical elements/roots which are called sometimes combining forms cannot play the role of independent words in a sentence, i.e., they are almost always seen in the combined form with other elements. Each language may assimilate its borrowed neoclassical elements phonologically. For example, the Greek word pathos is transliterated in English to the form pathy as in cardiopathy, while in French it is transliterated to pathie as it appears in the word FR cardiopathie. In addition, an TTC Project Page 4 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 element can have different forms (allomorphs). For example, English neoclassical element neuro can have two forms in French: neuro like in FR neurologie and névro like in FR névrodermite. Neoclassical elements can appear at different positions in neoclassical compounds: (1) initial position in a neoclassical compound, like homo- in homomorphic, (2) final position such as -cide in genocide. We follow L. Bauer [8, pp. 214] in distinguishing between Initial Combining Forms (ICFs) and Final Combining Forms (FCFs). ICFs include forms of neoclassical elements that appear at initial positions (e.g. bio, cardio, patho…), while FCFs include forms of neoclassical elements that appear at final positions (e.g. logy, cide, path…). ICFs may appear sequentially in a neoclassical compound (e.g. histo and patho in histopathology). Neoclassical word-formation is productive [7]; some scientific fields like medicine make intense use of neoclassical compounds [6]. Furthermore, a language can always borrow neoclassical elements in order to form new terms that describe new concepts. The productivity of neoclassical compounds makes their translation difficult since many of them are not likely to be listed in bilingual dictionaries. 2 Neoclassical compound detection and alignment In the following sections, we introduce the architecture of the program that we developed to detect and to align neoclassical compounds (see 2.1). The neoclassical compound forms that the program is able to align are presented in section 2.2. We present the resources that must be provided to the program in section 3. The algorithm that we propose is explained in section 4 and the implementation in section 5. Finally, we describe the overall evaluation of the method in section 6. 2.1 Global architecture The system consists of two components: Extractor and Aligner. The extractor extracts neoclassical compounds in two languages (source - target) from bilingual comparable corpora by using two lists of monolingual neoclassical elements. The aligner aligns the extracted neoclassical compounds when provided with a list of aligned neoclassical elements and a bilingual dictionary. The monolingual dictionary of the source language helps in detecting the form of neoclassical compounds. The alignment process results in generating a list of aligned neoclassical compounds. TTC Project Page 5 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 Bilingual comparable corpora with linguistic annotation Monolingual Neoclassical Elements Extractor Neoclassical Annotation Aligned Neoclassical Elements Aligner Aligned Neoclassical Compounds Source Monolingual Dictionary Bilingual Dictionary Figure 1: Global architecture of the program 2.2 Handled neoclassical compounds forms The program is able to translate neoclassical compounds being adjectives or nouns, and that belong to one of the following forms: Root1+ Root2 The first form includes neoclassical compounds that consist only of neoclassical elements. Root1 is a neoclassical compound of type ICF, while Root2 is a neoclassical compound of type FCF. One or more neoclassical elements of type Root1 can appear sequentially; this is expressed by (Root1+). Examples of neoclassical compounds shown are given below where the neoclassical element from type Root1 is underlined while a neoclassical root from type Root2 is in bold. (1) FR histopathologie (histo/patho/logie), FR monomorphe (mono/morphe), EN histogram (histo/gram), EN radiology (radio/logy), EN biotechnology (bio, techno, logy) , DE radiometric (radio/metrie), DE biotechnologie (bio/techno/logie) TTC Project Page 6 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 Root1+ Word This form includes one or more ICFs, represented by Root1+, combined with a native word. Examples of neoclassical compounds are illustrated in (2) where Roots are underlined and words are in bold. (2) FR cardiovasculaire (cardio/vasculaire), FR photosensibilisateur (photo/sensibilisateur), EN biomedical (bio/medical), EN photobioreactor (photo/bio/reactor), EN microhydroelectric (micro/hydro/electric), DE ferroelektrisch (ferro/elektrisch), DE kardiovaskulär (kardio/vaskulär) Our neoclassical detector program handles the most productive forms of neoclassical compounds. Neoclassical compounds can be seen in other forms that are not covered here, e.g. FR antibiogramme, as anti is not considered to be a neoclassical element (not an ICF but a prefix). 3 Resources In this section, we present the resources needed for the program to align neoclassical compounds. 3.1 Comparable corpora The program needs bilingual corpus in two languages (source - target). Each corpus is stored in a text file. An entry in the text file should be of the following format: Word PartOfSpeech Lemma Where Word: is a token in the corpus PartOfSpeech: is the part of speech of the word represented in Multext format (e.g. A: Adjective, N: Noun) Lemma: is the lemma of the word The corpora will be used to extract two lists of neoclassical compounds; the first (NCls) belongs to the source language, and the second (NClt) belongs to the target language. 3.2 Monolingual neoclassical elements Monolingual lists NEls and NElt of predefined neoclassical elements for source and target languages are used. All possible forms (ICFs and FCFs) that are borrowed from the same Greek or Latin word should be listed. Note that we follow [5, pp. 153] in considering that the element o or i in the neoclassical roots, such as in cardio, neuro or centi, belongs to the root. For example, one would have in a French neoclassical elements list the forms techno-, -technie, and -technique as borrowed neoclassical elements of the Greek word technos. Each list (NEls and NElt) is stored in a text file where entries are of the following format (parameters are separated by tabs): TTC Project Page 7 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 [NC_Origin:neoclassical_Element_Label] -ICF1 - ICF2 … -ICFN -FCF1 -FCF2 … -FCFN Where: NC_Origin: is the origin of the neoclassical element, it can take two values: greek or latin. This information has only an informative role and is not used in the following procedures Neoclassical_Element_Label: is a label given to the ICFs and FCFs that are borrowed from the same Greek or Latin word ICF: is an initial combining form FCF: is a final combining form 3.3 Aligned neoclassical elements An aligned list NEA between neoclassical elements of languages (source-target) is required. This list aligns all possible allomorphs of a neoclassical element in the source language with all possible allomorphs of the equivalent neoclassical element in the target language by aligning only their labels (see neoclassical_Element_Label in 3.2). The list is stored in a text file where entries are of the following format (parameters are separated by tabs): neoclassical_Element_Labells neoclassical_Element_Labellt 3.4 Bilingual dictionary The program needs a general bilingual dictionary of source and target languages stored in a text file. Parameters of each entry in the file are separated by tabs. An entry is of the following format (parameters are separated by tabs): Ws PartOfSpeechws Wt; PartOfSpeechwt Where Ws: is a word in source language PartOfSpeechws: part of speech of the source word, values are in format Multext Wt: is a translation of Ws in the target language. PartOfSpeechwt: part of speech of the target word, values are in Multext format 3.5 Monolingual dictionary The program uses a general monolingual dictionary (text file format) of the source language. It can be used to help in detecting the form of a source language neoclassical compound. Entries in this dictionary should be of the following format: W PartOfSpeechw Where W: is a word in source language PartOfSpeechw: part of speech of the word, values are in Multext format TTC Project Page 8 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 If no monolingual dictionary is provided to the program, the monolingual part of the bilingual dictionary will be used. 4 Algorithm The algorithm aims at aligning neoclassical compounds from the source corpus language (ls) with their equivalents in the corpus of the target language (lt). Firstly, it extracts neoclassical compounds for each language from the corpus using NEls and NElt; this results in lists of source and target neoclassical compound candidates, NCls and NClt. Then, it aligns each neoclassical compound in NCls with its equivalent(s) in NClt. It follows the two main steps of the compositional methods for aligning complex terms [4]: the extraction of neoclassical compound candidates, and the alignment of neoclassical compounds by the generation of translation candidates and the selection of correct translations. 4.1 Extraction of neoclassical compounds candidates Source and target neoclassical compound candidates lists (NCls and NClt) are obtained by projecting NEls on the corpus of language ls, and NElt on the corpus of language lt. The adjectives or nouns that have at least one neoclassical element (ICF or FCF) are considered as neoclassical compound candidates. An ICF can appear in the beginning or anywhere in the middle of a neoclassical element, e.g. ICFs bio, geo and morpho appear in biogeomorphological. FCFs are found at the end of neoclassical compounds such as pathy in neuropathy and logie in biotechnologie. The extracted lists will contain true neoclassical compounds such as radiograph (radio/ICF, graph/FCF), as well as false candidates like decision, because deci will be considered as a forming neoclassical element (ICF). 4.2 Alignment of neoclassical compounds 4.2.1 Generation of translation candidates The projection made in the extraction phase results in decomposing each extracted neoclassical candidate into two or more parts (components), in which at least one of these components is a neoclassical element. The form of a neoclassical candidate is checked, and in case it is identified as one of the two forms presented in section 2.2, the method tries to generate its translation candidates. The translation candidates are generated by depending on the translation of each component of the neoclassical compound candidate (NCs). The generation succeeds if all components of NCs are identified. 1- All components of NCs are identified R1+ R2: if all the components have been identified as neoclassical elements (one or more ICF represented by R1+ and one FCF represented by R2), we generate the translation candidates by using the aligned neoclassical elements list NEA. This means that we search the equivalents of each identified neoclassical element in the target language. If one equivalent (at least) is found for each component, all possible combinations of the found equivalents are generated while preserving the same order of constituents (R1 < R2) of the source neoclassical compound candidate NCs. We respect that the equivalents of each R1 should be an ICF and the equivalent of R2 should be an FCF. For example, suppose that we identify the two components (neuro- and -logy) as neoclassical elements in the neoclassical compound neurology. To generate its French translation candidates, we search for the equivalents of TTC Project Page 9 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 neuro-, which are neuro- and névro-, as well as the equivalents of -logy which is -logie. Accordingly, two translation candidates are generated neurologie and névrologie. R1+ W: if all the components have been identified as ICF, represented by R1+, except the last part, we check if this can be identified as a known word (W) in the monolingual dictionary. If this is the case, we generate the translation candidates by using the aligned neoclassical list NEA to look for the equivalents of neoclassical elements, and the bilingual dictionary for finding all the possible translations of the word. If one equivalent (at least) is found for each component, all the possible combinations of the equivalents of the components are then generated to form a possible translation of NCs. The order (R1+ W) is preserved and it is respected that each equivalent of R1 is of type ICF. For example, in a neoclassical candidate like FR bioscience, we can identify FR bio as neoclassical element and FR science as a word in the dictionary. The ICF equivalent of FR bio in English is bio, while the translations of FR science would be art, science, information, knowledge and learning. Consequently, five translation candidates will be generated: bioart, bioscience, bioinfomation, bioknowledge and biolearning. 2- One component (at least) of NCs is not identified If a neoclassical compound candidate has been extracted (since it contains a neoclassical element) but still it cannot be identified as one of two above forms, the generation would fail. This can be due to several reasons: o False neoclassical element: a candidate like EN decision will be decomposed into two components the first of which is deci, which will be considered as a neoclassical element (false neoclassical element). The second is sion which is neither a neoclassical element nor a known word in the monolingual dictionary. To be able to generate translation candidates for the maximum number of extracted neoclassical candidate compounds; we try to identify the neoclassical compound using a different form from the one it was first identified under, we do this only when the generation fails. For example, a neoclassical compound like radioprotective could be extracted with two neoclassical elements identified: radio (a true element) and prot (a false element). This means that radioprotective will be identified as the form (ICF ICF X), where X is neither a neoclassical element nor a known word. To address this problem, we omit prot from the identified neoclassical elements. Consequently, we will be able to identify radioprotective by the form (radio: ICF, protective: word). o Missing neoclassical element: a candidate like FR métronome could be extracted, where métro would be identified as a neoclassical element, while nome (a neoclassical element) would not be identified if it is missing from the monolingual neoclassical elements list. o Untreated neoclassical form: a true neoclassical candidate is extracted, but its form could not be identified. There exist other forms of neoclassical elements that we do not treat in our method. For example, EN antibiogram (anti: native prefix, bio: ICF, gram: FCF) belongs to a form that our method does not cover. 4.2.2 Selection of correct translations Each translation candidate (obtained in the generation phase) is searched in the target neoclassical compounds list NClt. In case the candidate is found, it is considered as a valid translation for its respective source neoclassical compound NCs. For example, if two French translation candidates were generated for EN neurology: neurologie and névrologie, they would be searched in TTC Project Page 10 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 the target neoclassical list NClt. The candidate névrologie would not be found as it is not the correct translation, but there is a strong probability that neurologie would be found, and therefore considered to be a valid translation. The algorithm of the two steps is illustrated in figure 2. Algorithm: Neoclassical alignment program Input: source and target corpus (Cs and Ct) NCls ← Extractor(Cs) NClt ← Extractor(Ct) Aligner(NCls , NClt) Algorithm: Extractor Input: - monolingual list (source - target) of neoclassical elements - corpus C Output: neoclassical compounds candidates For each W = (Adjective or Noun) in C detectPrefixes(W) detectSuffix(W) Algorithm: Aligner Input: - list of aligned neoclassical elements NAE - bilingual dictionary - monolingual dictionaries - lists of neoclassical compounds NCls and NClt Output: aligned neoclassical compounds For each neoclassical compound NCs in NCls components of NCs ← detectNeoclassicalForm(NCs) If (all components of NCs are identified) translationCandidates ← generateTranslations(components of NCs) selectCorrectTraslations(translationCandidates, NClt) Figure 2: Algorithm for aligning neoclassical compounds 5 Implementation Two components are implemented: extractor and aligner. The extractor first extracts neoclassical compound candidates, and then the aligner aligns the extracted neoclassical compounds. Two program interfaces have been implemented, the first using the UIMA framework, and the second in pure JAVA in order to facilitate running it by a command line. TTC Project Page 11 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 5.1 Components in UIMA 5.1.1 Extractor The extractor component detects neoclassical compounds for source and target languages from comparable corpora. It depends on the neoclassical elements tree data structures presented in (3.2). The input, output and resources for this UIMA component are defined as follow: - Input: Annotated corpus with NounAnnotation and AdjectiveAnnotation. - Output: Neoclassical Annotation. - Resources: source neoclassical elements file, target neoclassical elements file. For the process of the extraction of neoclassical compounds, two tree structures are used to store the neoclassical elements. The ICF tree is used to store the neoclassical elements of type ICF (see Figure 3), while the FCF tree is used to store the neoclassical elements of type FCF (see Figure 4). Leaves of these trees consist of labels along with an origin (e.g. greek, latin) that are aimed at grouping the ICFs/FCFs. For example, the label bio aims at grouping (bio- and -bie) together. * * u t e b a i i g Bio: Greek r Auto : Greek r b o r t a Bio: Greek a i Agri : Latin d c Cardio : Greek r c Cratie : Greek Figure 4: FCF tree of neoclassical elements of type FCF (-bie, -crate, carde) labeled by (Bio, Cardio and Cratie) Figure 3: ICF tree of neoclassical elements of type ICF (auto- , agri-, bio-) labeled by (Auto, Agri, and Bio) TTC Project Page 12 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 It checks for an ICF or an FCF in a noun or an adjective. Adjectives and nouns are identified by AdjectiveAnnotation and NounAnnotation respectively. Two main methods in the extractor are: detectICF() – detecting ICF: This method takes an adjective or a noun and checks if it contains an ICF by searching in the neoclassical ICF tree. For each adjective or noun of length (n), it produces (n-3) chains from its letters. We illustrate this by taking the candidate biotechnology as an example; the method produces the chains shown in figure 5. 1. biotechnology 2. iotechnology 3. otechnology 4. technology 5. echnology 6. chnology 7. hnology 8. nology 9. ology 10. logy Figure 5: chains produced by detectPrefix() for the string biotechnology For each chain, the method tries to search for the longest ICF that exists in the neoclassical ICF tree. In this example, the first chain biotechnology starts with bio which will be found in the neoclassical ICFs tree. The fourth chain starts with techno which will be found in the neoclassical ICFs tree, and so on. The method guarantees that there is no intersection between the indentified ICFs. For example, the chains iotechnology and otechnology will not be checked for ICFs since bio was detected in the first chain. The fourth chain is checked directly after the first one. detectFCF() –detecting an FCF neoclassical element: TTC Project Page 13 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 This method takes an adjective or a noun and tries to detect the longest FCF that the resulting chain ends with and that exists in the neoclassical FCFs tree (see figure 2). 5.1.2 Aligner The aligner aligns each NCs in the source list NCls with its equivalent(s) in the target list NClt. - Input: Neoclassical Annotation - Output: list of aligned neoclassical compounds - Resources: EuRADic French-English bilingual dictionary, list of aligned neoclassical elements NEA, and monolingual lists of neoclassical elements (source and target). The alignment process uses a list of aligned neoclassical elements that are linked to monolingual lists, as in the structure shown in figure 6. Firstly, the component builds the neoclassical compounds lists (NCls and NClt) using the NeoclassicalAnnotation produced by the extractor. A neoclassical compound is an object that consists of a list of neoclassical elements (ICFs, FCFs, or both). EN FR auto- auto neurogéno- FR EN géno auto auto patho neuro neuro neuro névro- -pathe -pathie NEls logie logy patho patho ... ... neuro- neuro géno -gène ... auto- auto geno- -gen patho ... patho- -path -pathy NElt Figure 6: sample of the aligned French-English neoclassical elements (NEA) The aligning process consists of three main methods: detectNeoclassicalForm(): this method tries to identify the form of a neoclassical compound by examining the list of neoclassical elements it consists of (see 2.2 for covered forms). TTC Project Page 14 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 Moreover, if the candidate has both one or more ICF(s) and an FCF, the method verifies that there is no intersection between the final consisting ICF and the FCF; it deletes the final ICF otherwise. generateTranslations(): if the candidate is of the form (R1+ R2), equivalents of each root are searched in NEA. Otherwise, if the candidate is of the form (R1+ W), equivalents of ICF are searched in NEA, while equivalents of W are taken from the EuRADic bilingual dictionary. If equivalents have been found for each component, all possible combinations of these equivalents are then generated following the same original form (see 4.2 for examples). selectCorrectTranslations(): if a generated translation candidate exists in the target list NClt, the translation candidate is identified as a correct (valid) translation. 5.2 Command line The program has been also implemented as a pure java program to facilitate its execution. It consists of two main components (classes) like in the UIMA program: Extractor and Aligner, with the same methods. The only difference resides in the output of the extractor component, which is a list of neoclassical components in this case. 6 Experiments and Evaluation 6.1 Resources used for experiments 6.1.1 Comparable corpora We do the experiments using two French-English comparable corpora from two different topics. The first is related to the renewable energy domain, it consists of 6101 documents containing about 213 800 nouns and adjectives. The second is related to the breast cancer topic; it includes 354 documents. 6.1.2 Monolingual neoclassical elements 113 French neoclassical elements are found in [5, pg 153]. We have chosen 83 English neoclassical elements from www.canoo.net. An example of the file that stores French neoclassical elements: [greek:patho] patho- -pathe -pathie [latin:cide] -cide ..Etc TTC Project Page 15 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 An example of the file that stores English neoclassical elements: [greek:patho] patho- -path [latin:cide] -pathy -cide ..Etc 6.1.3 Aligned neoclassical elements We have manually aligned the list of 83 English neoclassical elements with their equivalents in the French neoclassical elements list. An example of the file that stores aligned neoclassical elements: patho patho cide cide ..Etc The first column corresponds to the English neoclassical elements, while the second corresponds to the French neoclassical elements. 6.1.4 Bilingual dictionary We use the bilingual French-English dictionary [3] that was built and improved within the French national project EuRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. It is easy to use and it contains 243 580 entries with their part of speech. Example of an entry: valide J valid J Where J indicates that the word is an adjective. 6.1.5 Monolingual dictionaries We use the monolingual part of the bilingual dictionary EuRADic. 6.2 Results We have made some experiments using the resources presented in section 6.1. The obtained results for the pair of languages French-English are shown in tables 1 and 2 (experiments using other languages will be completed later). We will detail the FR-EN results in table 1 using the renewable energy corpora. Using the 113 French neoclassical elements we were able to extract 2 052 nouns and adjectives that contain at least one neoclassical root. These are considered to be the neoclassical compound candidates, although many of these words are false candidates like (decision, histoire, réunion, solide, protecteur). We note that the neoclassical compounds that do not contain at least one neoclassical element that exists in our neoclassical elements lists will not be extracted. Using 83 French-English neoclassical aligned elements and the EuRADic bilingual dictionary we were able to generate translation candidates for 287 neoclassical compound candidates. The correct translation among the generated candidates was found in the target neoclassical list for 137 of the 287 candidates. A generated translation candidate that is not found does not necessarily mean that it is a wrong translation; it just could possibly be a correct translation that is missing from the target corpus. In order to evaluate our results, we checked for these found translations in the bilingual TTC Project Page 16 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 dictionary. Indeed, 104 of them already exist in the bilingual dictionary, which means that 33 new neoclassical compounds (not existing in the bilingual dictionary) were aligned with their equivalents. We verified the found translations (aligned neoclassical compounds) manually; the precision obtained by the alignment is given in tables 3 and 4. An example of a false positive was the alignment of FR télécommande with EN telecontrol, while in fact; the correct translation is EN remote control. We have calculated the recall on the Renewable energy corpus by examining a sample of 200 neoclassical candidates that have frequencies more or equal to 5. We have obtained a recall of 19% for the (FR- EN) alignment, and a recall of 34% for the (EN-FR) alignment. Corpus Renewable Energy Neoclassical Candidates Generation Found In aligned succeeded translations dictionary elements 83 2052 287 137 104 83 Breast Cancer 1513 249 91 46 Table 1: Alignment of French neoclassical compounds with their English equivalents Corpus Renewable energy Neoclassical Candidates Generation Found In elements succeeded translations dictionary 83 6115 719 162 108 Breast cancer 83 1218 279 95 48 Table 2: Alignment of English neoclassical compounds with their French equivalents Corpus Generation succeeded Renewable energy Breast Cancer Form1 (R1+ R2) Found translations Precision Generation succeeded Form2 (R1+ W) Found translations Precision 60 40 100% 227 97 95% 48 34 100% 201 57 98% Table 3: Precisions of the alignment of French neoclassical compounds with their English equivalents TTC Project Page 17 / 18 FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction TTC project – GA n°248005 Corpus Renewable energy Breast Cancer Generation succeeded 68 46 Form1 (R1+ R2) Found translations 46 30 Precision 100% Generation succeeded 651 100% 233 Form2 (R1+ W) Found translations 116 65 Precision 96% 98% Table 4: Precisions of the alignment of English neoclassical compounds with their French equivalents 7 Conclusion In this document, we present the program, delivery D.4.1. The program’s aim is to align neoclassical compounds in two languages (source-target); it identifies two types of neoclassical compounds. The required resources are specified in section 3: corpora in source and target languages, monolingual neoclassical elements, aligned neoclassical elements, monolingual dictionaries and a bilingual dictionary. Two UIMA components were implemented: the first component is for detecting neoclassical compounds, and the second is for aligning the extracted neoclassical compounds between source and target languages. The results showed very high precision for aligning neoclassical compounds of the two handled structures. We aim at expanding the method so that it covers other possible forms of neoclassical compounds. We also aim to investigate the possibility of finding equivalents of neoclassical elements that do not exist in our list of aligned neoclassical elements by a learning method. 8 References [1] http://uima.apache.org/ [2] D. Amiot, G. Dal « La composition néoclassique en français et l’ordre des constituants », in : La composition dans les langues, Artois Presses Université. 2008. pp. 89-113. [3] SCI-FRAN-EURADIC Dictionnaire bilingue français-anglais. http://catalog.elra.info/product_info.php?products_id=666 [4] X. Robitaille, Y. Sasaki, M. Tonoike, S. Sato & T. Utsuro. Compiling french-japanese terminologies from the web. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics, EACL’ 06. 2006. pp. 225-232. [5] H. D. Phonétique et morphologie du français moderne et contemporain. 1989. [6] R. Estopa, J. Vivaldi, M. T. Cabré. Use of Greek and Latin forms for term detection. In: proceeding sof the second international conference on language resources and evaluation. 2000. pp. 885-859. [7] A. E. van Niekerk. The lexicographical treatment of neo-classical compounds. Bureau of the dictionary of the Afrikaans language. [8] L. Bauer. English word-formation. Cambridge university press. 1983. [9] F. Namer, R. H. Baud. Defining and relating biomedical terms: Towards a cross-language morphosemantics-based system. I. J. Medical Informatics. 2007. pp. 226-233. TTC Project Page 18 / 18
© Copyright 2026 Paperzz