here - LINA Nantes

TTC project
Terminology Extraction, Translation Tools and Comparable Corpora
Project duration: 1st of January 2010 to 31st of December 2012 (36 months)
The research leading to these results has received funding from the European
Community's Seventh Framework Programme (FP7/2007-2013) under grant
agreement n°248005.
Deliverable ID
Document title
D-4.1
UIMA components to
extract neoclassical terms and to
align them with their translations
Version
Version date
Status
Dissemination status
Deliverable responsible
Author
4
07/10/11
Final version
UN
UN
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
Document Revision Information
Date
Version
Changes
20/06/11
1
Initial version
01/09/11
2
First revision
13/09/11
3
Second revision
Summary
1
Introduction ............................................................................................................................................. 4
1.1
Context and objectives .................................................................................................................. 4
1.2
Neoclassical compounds ................................................................................................................ 4
2
Neoclassical compound detection and alignment ................................................................................... 5
2.1
Global architecture ........................................................................................................................ 6
2.2
Handled neoclassical compounds forms........................................................................................ 6
3
Resources ................................................................................................................................................. 7
3.1
Comparable corpora ...................................................................................................................... 7
3.2
Monolingual neoclassical elements ............................................................................................... 7
3.3
Aligned neoclassical elements ....................................................................................................... 8
3.4
Bilingual dictionary......................................................................................................................... 8
3.5
Monolingual dictionary .................................................................................................................. 8
4
Algorithm ................................................................................................................................................. 9
4.1
Extraction of neoclassical compounds candidates ........................................................................ 9
4.2
Alignment of neoclassical compounds........................................................................................... 9
4.2.1
Generation of translation candidates .............................................................................9
4.2.2
Selection of correct translations.................................................................................. 10
TTC Project
Page 2 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
5
Implementation ..................................................................................................................................... 11
5.1
5.2
6
Components in UIMA ...................................................................................................................12
5.1.1
Extractor ...................................................................................................................... 12
5.1.2
Aligner .......................................................................................................................... 14
Command line ..............................................................................................................................15
Experiments and Evaluation .................................................................................................................. 15
6.1
6.2
Resources used for experiments..................................................................................................15
6.1.1
Comparable corpora .................................................................................................... 15
6.1.2
Monolingual neoclassical elements ............................................................................. 15
6.1.3
Aligned neoclassical elements ..................................................................................... 16
6.1.4
Bilingual dictionary ...................................................................................................... 16
6.1.5
Monolingual dictionaries ............................................................................................. 16
Results ..........................................................................................................................................16
7
Conclusion .............................................................................................................................................. 18
8
References.............................................................................................................................................. 18
TTC Project
Page 3 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
1 Introduction
1.1 Context and objectives
The TTC project aims at automatically generating bilingual terminologies from comparable
corpora in five European languages (English, French, German, Latvian and Spanish) as well as Russian
and Chinese. These bilingual terminologies would leverage Machine Translation tools (MT tools) and
Computer-Aided Translation tools (CAT tools). To do this, important steps of the project are the
automatic extraction of monolingual terminologies in the different languages (WP3) and the bilingual
alignment of the extracted terminologies (WP4) from multilingual corpora.
WP 4 is dedicated in general to improving term alignment methods from comparable
corpora. Task no 4.1 of WP 4 focuses on increasing the coverage of the bilingual dictionary by
developing a program for neoclassical compound detection in EN, FR and DE. For this purpose, we
aim at developing a method that automatically extracts neoclassical compounds in two languages
(source-target) from comparable corpora and aligns these extracted neoclassical compounds. We
decided to make this method language independent, i.e. same procedures are used for the pairs of
languages (FR-EN, EN-DE, and FR-DE). The method is based on the following assumptions:
(1) Neoclassical compounds are translated compositionally; which means that each component
is translated individually and the final translation is the combination of the translated parts;
as the meaning of neoclassical compounds is often a combination of the meaning of the
constituent parts [9].
(2) The order of the constituents (i.e. components) of a source neoclassical compound is
preserved in the equivalent target neoclassical compound.
(3) Each neoclassical constituent element is translated with a neoclassical element of the same
type (for instance an ICF by an ICF, an FCF by an FCF…).
The second assumption is based on the fact that neoclassical word-formation in different languages
follows the model of Greek and Latin in forming terms [2].
We develop our method using the UIMA (Unstructured Information Management Architecture)
framework. This framework is chosen because it facilitates the processing of large volumes of texts.
Moreover, the UIMA framework enables applications to be decomposed into components where
each component can be dedicated to a particular task. The data flow between these components is
automatically managed by UIMA [1].
1.2 Neoclassical compounds
Describing new concepts usually requires creating new terms. Neoclassical word-formation is
a process used by many European languages, such as English, French, German, etc [6]. It combines
some elements borrowed from Greek or Latin to create neoclassical compounds. For example, the
neoclassical element bio combines with the neoclassical element graphy leading to the neoclassical
compound biography. Another example is the French neoclassical compound androgyne, which
consists of two neoclassical elements: andro (man) and gyne (woman). The German neoclassical
compound radiologie is composed of the neoclassical element radio and the neoclassical element
logie.
Neoclassical elements/roots which are called sometimes combining forms cannot play the role
of independent words in a sentence, i.e., they are almost always seen in the combined form with
other elements. Each language may assimilate its borrowed neoclassical elements phonologically. For
example, the Greek word pathos is transliterated in English to the form pathy as in cardiopathy,
while in French it is transliterated to pathie as it appears in the word FR cardiopathie. In addition, an
TTC Project
Page 4 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
element can have different forms (allomorphs). For example, English neoclassical element neuro can
have two forms in French: neuro like in FR neurologie and névro like in FR névrodermite.
Neoclassical elements can appear at different positions in neoclassical compounds:
(1) initial position in a neoclassical compound, like homo- in homomorphic,
(2) final position such as -cide in genocide.
We follow L. Bauer [8, pp. 214] in distinguishing between Initial Combining Forms (ICFs) and
Final Combining Forms (FCFs). ICFs include forms of neoclassical elements that appear at initial
positions (e.g. bio, cardio, patho…), while FCFs include forms of neoclassical elements that appear at
final positions (e.g. logy, cide, path…). ICFs may appear sequentially in a neoclassical compound (e.g.
histo and patho in histopathology).
Neoclassical word-formation is productive [7]; some scientific fields like medicine make intense
use of neoclassical compounds [6]. Furthermore, a language can always borrow neoclassical
elements in order to form new terms that describe new concepts. The productivity of neoclassical
compounds makes their translation difficult since many of them are not likely to be listed in bilingual
dictionaries.
2 Neoclassical compound detection and alignment
In the following sections, we introduce the architecture of the program that we developed to
detect and to align neoclassical compounds (see 2.1).
The neoclassical compound forms that the program is able to align are presented in section 2.2.
We present the resources that must be provided to the program in section 3. The algorithm that we
propose is explained in section 4 and the implementation in section 5. Finally, we describe the overall
evaluation of the method in section 6.
2.1 Global architecture
The system consists of two components: Extractor and Aligner. The extractor extracts
neoclassical compounds in two languages (source - target) from bilingual comparable corpora by
using two lists of monolingual neoclassical elements. The aligner aligns the extracted neoclassical
compounds when provided with a list of aligned neoclassical elements and a bilingual dictionary. The
monolingual dictionary of the source language helps in detecting the form of neoclassical
compounds. The alignment process results in generating a list of aligned neoclassical compounds.
TTC Project
Page 5 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
Bilingual comparable
corpora with
linguistic annotation
Monolingual
Neoclassical
Elements
Extractor
Neoclassical
Annotation
Aligned
Neoclassical
Elements
Aligner
Aligned
Neoclassical
Compounds
Source
Monolingual
Dictionary
Bilingual
Dictionary
Figure 1: Global architecture of the
program
2.2 Handled neoclassical compounds forms
The program is able to translate neoclassical compounds being adjectives or nouns, and that
belong to one of the following forms:

Root1+ Root2
The first form includes neoclassical compounds that consist only of neoclassical elements.
Root1 is a neoclassical compound of type ICF, while Root2 is a neoclassical compound of type
FCF. One or more neoclassical elements of type Root1 can appear sequentially; this is
expressed by (Root1+). Examples of neoclassical compounds shown are given below where
the neoclassical element from type Root1 is underlined while a neoclassical root from type
Root2 is in bold.
(1)
FR histopathologie (histo/patho/logie), FR monomorphe (mono/morphe), EN
histogram (histo/gram), EN radiology (radio/logy), EN biotechnology (bio, techno,
logy) , DE radiometric (radio/metrie), DE biotechnologie (bio/techno/logie)
TTC Project
Page 6 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005

Root1+ Word
This form includes one or more ICFs, represented by Root1+, combined with a native word.
Examples of neoclassical compounds are illustrated in (2) where Roots are underlined and
words are in bold.
(2)
FR
cardiovasculaire
(cardio/vasculaire),
FR
photosensibilisateur
(photo/sensibilisateur), EN biomedical (bio/medical), EN photobioreactor
(photo/bio/reactor), EN microhydroelectric (micro/hydro/electric), DE ferroelektrisch
(ferro/elektrisch), DE kardiovaskulär (kardio/vaskulär)
Our neoclassical detector program handles the most productive forms of neoclassical
compounds. Neoclassical compounds can be seen in other forms that are not covered here, e.g. FR
antibiogramme, as anti is not considered to be a neoclassical element (not an ICF but a prefix).
3 Resources
In this section, we present the resources needed for the program to align neoclassical compounds.
3.1 Comparable corpora
The program needs bilingual corpus in two languages (source - target). Each corpus is stored in
a text file. An entry in the text file should be of the following format:
Word PartOfSpeech Lemma
Where

Word: is a token in the corpus

PartOfSpeech: is the part of speech of the word represented in Multext format (e.g. A:
Adjective, N: Noun)

Lemma: is the lemma of the word
The corpora will be used to extract two lists of neoclassical compounds; the first (NCls) belongs to the
source language, and the second (NClt) belongs to the target language.
3.2 Monolingual neoclassical elements
Monolingual lists NEls and NElt of predefined neoclassical elements for source and target
languages are used. All possible forms (ICFs and FCFs) that are borrowed from the same Greek or
Latin word should be listed. Note that we follow [5, pp. 153] in considering that the element o or i in
the neoclassical roots, such as in cardio, neuro or centi, belongs to the root. For example, one would
have in a French neoclassical elements list the forms techno-, -technie, and -technique as borrowed
neoclassical elements of the Greek word technos.
Each list (NEls and NElt) is stored in a text file where entries are of the following format (parameters
are separated by tabs):
TTC Project
Page 7 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
[NC_Origin:neoclassical_Element_Label] -ICF1 - ICF2
…
-ICFN
-FCF1
-FCF2
…
-FCFN
Where:
NC_Origin: is the origin of the neoclassical element, it can take two values: greek or latin. This
information has only an informative role and is not used in the following procedures
Neoclassical_Element_Label: is a label given to the ICFs and FCFs that are borrowed from the same
Greek or Latin word
ICF: is an initial combining form
FCF: is a final combining form
3.3 Aligned neoclassical elements
An aligned list NEA between neoclassical elements of languages (source-target) is required. This list
aligns all possible allomorphs of a neoclassical element in the source language with all possible
allomorphs of the equivalent neoclassical element in the target language by aligning only their labels
(see neoclassical_Element_Label in 3.2).
The list is stored in a text file where entries are of the following format (parameters are separated by
tabs):
neoclassical_Element_Labells
neoclassical_Element_Labellt
3.4 Bilingual dictionary
The program needs a general bilingual dictionary of source and target languages stored in a text file.
Parameters of each entry in the file are separated by tabs. An entry is of the following format
(parameters are separated by tabs):
Ws
PartOfSpeechws
Wt;
PartOfSpeechwt
Where




Ws: is a word in source language
PartOfSpeechws: part of speech of the source word, values are in format Multext
Wt: is a translation of Ws in the target language.
PartOfSpeechwt: part of speech of the target word, values are in Multext format
3.5 Monolingual dictionary
The program uses a general monolingual dictionary (text file format) of the source language. It can be
used to help in detecting the form of a source language neoclassical compound. Entries in this
dictionary should be of the following format:
W
PartOfSpeechw
Where
 W: is a word in source language
 PartOfSpeechw: part of speech of the word, values are in Multext format
TTC Project
Page 8 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
If no monolingual dictionary is provided to the program, the monolingual part of the bilingual
dictionary will be used.
4 Algorithm
The algorithm aims at aligning neoclassical compounds from the source corpus language (ls) with
their equivalents in the corpus of the target language (lt). Firstly, it extracts neoclassical compounds
for each language from the corpus using NEls and NElt; this results in lists of source and target
neoclassical compound candidates, NCls and NClt. Then, it aligns each neoclassical compound in NCls
with its equivalent(s) in NClt. It follows the two main steps of the compositional methods for aligning
complex terms [4]: the extraction of neoclassical compound candidates, and the alignment of
neoclassical compounds by the generation of translation candidates and the selection of correct
translations.
4.1 Extraction of neoclassical compounds candidates
Source and target neoclassical compound candidates lists (NCls and NClt) are obtained by
projecting NEls on the corpus of language ls, and NElt on the corpus of language lt. The adjectives or
nouns that have at least one neoclassical element (ICF or FCF) are considered as neoclassical
compound candidates. An ICF can appear in the beginning or anywhere in the middle of a
neoclassical element, e.g. ICFs bio, geo and morpho appear in biogeomorphological. FCFs are found
at the end of neoclassical compounds such as pathy in neuropathy and logie in biotechnologie. The
extracted lists will contain true neoclassical compounds such as radiograph (radio/ICF, graph/FCF), as
well as false candidates like decision, because deci will be considered as a forming neoclassical
element (ICF).
4.2 Alignment of neoclassical compounds
4.2.1 Generation of translation candidates
The projection made in the extraction phase results in decomposing each extracted neoclassical
candidate into two or more parts (components), in which at least one of these components is a
neoclassical element. The form of a neoclassical candidate is checked, and in case it is identified as
one of the two forms presented in section 2.2, the method tries to generate its translation
candidates. The translation candidates are generated by depending on the translation of each
component of the neoclassical compound candidate (NCs). The generation succeeds if all
components of NCs are identified.
1- All components of NCs are identified
 R1+ R2: if all the components have been identified as neoclassical elements (one or more ICF
represented by R1+ and one FCF represented by R2), we generate the translation candidates
by using the aligned neoclassical elements list NEA. This means that we search the
equivalents of each identified neoclassical element in the target language. If one equivalent
(at least) is found for each component, all possible combinations of the found equivalents
are generated while preserving the same order of constituents (R1 < R2) of the source
neoclassical compound candidate NCs. We respect that the equivalents of each R1 should be
an ICF and the equivalent of R2 should be an FCF. For example, suppose that we identify the
two components (neuro- and -logy) as neoclassical elements in the neoclassical compound
neurology. To generate its French translation candidates, we search for the equivalents of
TTC Project
Page 9 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
neuro-, which are neuro- and névro-, as well as the equivalents of -logy which is -logie.
Accordingly, two translation candidates are generated neurologie and névrologie.
 R1+ W: if all the components have been identified as ICF, represented by R1+, except the last
part, we check if this can be identified as a known word (W) in the monolingual dictionary. If
this is the case, we generate the translation candidates by using the aligned neoclassical list
NEA to look for the equivalents of neoclassical elements, and the bilingual dictionary for
finding all the possible translations of the word. If one equivalent (at least) is found for each
component, all the possible combinations of the equivalents of the components are then
generated to form a possible translation of NCs. The order (R1+ W) is preserved and it is
respected that each equivalent of R1 is of type ICF. For example, in a neoclassical candidate
like FR bioscience, we can identify FR bio as neoclassical element and FR science as a word in
the dictionary. The ICF equivalent of FR bio in English is bio, while the translations of FR
science would be art, science, information, knowledge and learning. Consequently, five
translation candidates will be generated: bioart, bioscience, bioinfomation, bioknowledge
and biolearning.
2- One component (at least) of NCs is not identified
 If a neoclassical compound candidate has been extracted (since it contains a neoclassical
element) but still it cannot be identified as one of two above forms, the generation would
fail. This can be due to several reasons:
o False neoclassical element: a candidate like EN decision will be decomposed into two
components the first of which is deci, which will be considered as a neoclassical
element (false neoclassical element). The second is sion which is neither a
neoclassical element nor a known word in the monolingual dictionary.
To be able to generate translation candidates for the maximum number of
extracted neoclassical candidate compounds; we try to identify the neoclassical
compound using a different form from the one it was first identified under, we do
this only when the generation fails. For example, a neoclassical compound like
radioprotective could be extracted with two neoclassical elements identified: radio (a
true element) and prot (a false element). This means that radioprotective will be
identified as the form (ICF ICF X), where X is neither a neoclassical element nor a
known word. To address this problem, we omit prot from the identified neoclassical
elements. Consequently, we will be able to identify radioprotective by the form
(radio: ICF, protective: word).
o Missing neoclassical element: a candidate like FR métronome could be extracted,
where métro would be identified as a neoclassical element, while nome (a
neoclassical element) would not be identified if it is missing from the monolingual
neoclassical elements list.
o Untreated neoclassical form: a true neoclassical candidate is extracted, but its form
could not be identified. There exist other forms of neoclassical elements that we do
not treat in our method. For example, EN antibiogram (anti: native prefix, bio: ICF,
gram: FCF) belongs to a form that our method does not cover.
4.2.2 Selection of correct translations
Each translation candidate (obtained in the generation phase) is searched in the target
neoclassical compounds list NClt. In case the candidate is found, it is considered as a valid translation
for its respective source neoclassical compound NCs. For example, if two French translation
candidates were generated for EN neurology: neurologie and névrologie, they would be searched in
TTC Project
Page 10 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
the target neoclassical list NClt. The candidate névrologie would not be found as it is not the correct
translation, but there is a strong probability that neurologie would be found, and therefore
considered to be a valid translation.
The algorithm of the two steps is illustrated in figure 2.
Algorithm: Neoclassical alignment
program
Input: source and target corpus (Cs and
Ct)
NCls ← Extractor(Cs)
NClt ← Extractor(Ct)
Aligner(NCls , NClt)
Algorithm: Extractor
Input: - monolingual list (source - target) of neoclassical
elements
- corpus C
Output: neoclassical compounds candidates
For each W = (Adjective or Noun) in C
detectPrefixes(W)
detectSuffix(W)
Algorithm: Aligner
Input: - list of aligned neoclassical elements NAE
- bilingual dictionary
- monolingual dictionaries
- lists of neoclassical compounds NCls and NClt
Output: aligned neoclassical compounds
For each neoclassical compound NCs in NCls
components of NCs ← detectNeoclassicalForm(NCs)
If (all components of NCs are identified)
translationCandidates ← generateTranslations(components of NCs)
selectCorrectTraslations(translationCandidates, NClt)
Figure 2: Algorithm for aligning neoclassical compounds
5 Implementation
Two components are implemented: extractor and aligner. The extractor first extracts
neoclassical compound candidates, and then the aligner aligns the extracted neoclassical
compounds.
Two program interfaces have been implemented, the first using the UIMA framework, and the
second in pure JAVA in order to facilitate running it by a command line.
TTC Project
Page 11 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
5.1 Components in UIMA
5.1.1 Extractor
The extractor component detects neoclassical compounds for source and target languages from
comparable corpora. It depends on the neoclassical elements tree data structures presented in (3.2).
The input, output and resources for this UIMA component are defined as follow:
-
Input: Annotated corpus with NounAnnotation and AdjectiveAnnotation.
-
Output: Neoclassical Annotation.
-
Resources: source neoclassical elements file, target neoclassical elements file.
For the process of the extraction of neoclassical compounds, two tree structures are used to
store the neoclassical elements. The ICF tree is used to store the neoclassical elements of type ICF
(see Figure 3), while the FCF tree is used to store the neoclassical elements of type FCF (see Figure 4).
Leaves of these trees consist of labels along with an origin (e.g. greek, latin) that are aimed at
grouping the ICFs/FCFs. For example, the label bio aims at grouping (bio- and -bie) together.
*
*
u
t
e
b
a
i
i
g
Bio: Greek
r
Auto : Greek
r
b
o
r
t
a
Bio: Greek
a
i
Agri : Latin
d
c
Cardio : Greek
r
c
Cratie : Greek
Figure 4: FCF tree of neoclassical
elements of type FCF (-bie, -crate, carde) labeled by (Bio, Cardio and
Cratie)
Figure 3: ICF tree of neoclassical
elements of type ICF (auto- , agri-,
bio-) labeled by (Auto, Agri, and
Bio)
TTC Project
Page 12 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
It checks for an ICF or an FCF in a noun or an adjective. Adjectives and nouns are identified by
AdjectiveAnnotation and NounAnnotation respectively.
Two main methods in the extractor are:

detectICF() – detecting ICF:
This method takes an adjective or a noun and checks if it contains an ICF by searching in the
neoclassical ICF tree. For each adjective or noun of length (n), it produces (n-3) chains from its
letters. We illustrate this by taking the candidate biotechnology as an example; the method
produces the chains shown in figure 5.
1. biotechnology
2. iotechnology
3. otechnology
4. technology
5. echnology
6. chnology
7. hnology
8. nology
9. ology
10. logy
Figure 5: chains produced by detectPrefix() for the string
biotechnology
For each chain, the method tries to search for the longest ICF that exists in the neoclassical ICF
tree. In this example, the first chain biotechnology starts with bio which will be found in the
neoclassical ICFs tree. The fourth chain starts with techno which will be found in the neoclassical
ICFs tree, and so on.
The method guarantees that there is no intersection between the indentified ICFs. For example,
the chains iotechnology and otechnology will not be checked for ICFs since bio was detected in
the first chain. The fourth chain is checked directly after the first one.

detectFCF() –detecting an FCF neoclassical element:
TTC Project
Page 13 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
This method takes an adjective or a noun and tries to detect the longest FCF that the resulting
chain ends with and that exists in the neoclassical FCFs tree (see figure 2).
5.1.2 Aligner
The aligner aligns each NCs in the source list NCls with its equivalent(s) in the target list NClt.
-
Input: Neoclassical Annotation
-
Output: list of aligned neoclassical compounds
-
Resources: EuRADic French-English bilingual dictionary, list of aligned neoclassical elements
NEA, and monolingual lists of neoclassical elements (source and target).
The alignment process uses a list of aligned neoclassical elements that are linked to
monolingual lists, as in the structure shown in figure 6.
Firstly, the component builds the neoclassical compounds lists (NCls and NClt) using the
NeoclassicalAnnotation produced by the extractor. A neoclassical compound is an object that
consists of a list of neoclassical elements (ICFs, FCFs, or both).
EN
FR
auto-
auto
neurogéno-
FR
EN
géno
auto
auto
patho
neuro
neuro
neuro
névro-
-pathe
-pathie
NEls
logie
logy
patho
patho
...
...
neuro-
neuro
géno
-gène
...
auto-
auto
geno-
-gen
patho
...
patho-
-path
-pathy
NElt
Figure 6: sample of the aligned French-English
neoclassical elements (NEA)
The aligning process consists of three main methods:

detectNeoclassicalForm(): this method tries to identify the form of a neoclassical compound
by examining the list of neoclassical elements it consists of (see 2.2 for covered forms).
TTC Project
Page 14 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
Moreover, if the candidate has both one or more ICF(s) and an FCF, the method verifies that
there is no intersection between the final consisting ICF and the FCF; it deletes the final ICF
otherwise.

generateTranslations(): if the candidate is of the form (R1+ R2), equivalents of each root are
searched in NEA. Otherwise, if the candidate is of the form (R1+ W), equivalents of ICF are
searched in NEA, while equivalents of W are taken from the EuRADic bilingual dictionary.
If equivalents have been found for each component, all possible combinations of these
equivalents are then generated following the same original form (see 4.2 for examples).

selectCorrectTranslations(): if a generated translation candidate exists in the target list NClt,
the translation candidate is identified as a correct (valid) translation.
5.2 Command line
The program has been also implemented as a pure java program to facilitate its execution. It
consists of two main components (classes) like in the UIMA program: Extractor and Aligner, with
the same methods. The only difference resides in the output of the extractor component, which
is a list of neoclassical components in this case.
6 Experiments and Evaluation
6.1 Resources used for experiments
6.1.1 Comparable corpora
We do the experiments using two French-English comparable corpora from two different
topics. The first is related to the renewable energy domain, it consists of 6101 documents containing
about 213 800 nouns and adjectives. The second is related to the breast cancer topic; it includes 354
documents.
6.1.2 Monolingual neoclassical elements
113 French neoclassical elements are found in [5, pg 153]. We have chosen 83 English
neoclassical elements from www.canoo.net.
An example of the file that stores French neoclassical elements:
[greek:patho] patho- -pathe -pathie
[latin:cide]
-cide
..Etc
TTC Project
Page 15 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
An example of the file that stores English neoclassical elements:
[greek:patho] patho- -path
[latin:cide]
-pathy
-cide
..Etc
6.1.3 Aligned neoclassical elements
We have manually aligned the list of 83 English neoclassical elements with their equivalents in the
French neoclassical elements list.
An example of the file that stores aligned neoclassical elements:
patho patho
cide
cide
..Etc
The first column corresponds to the English neoclassical elements, while the second corresponds to
the French neoclassical elements.
6.1.4 Bilingual dictionary
We use the bilingual French-English dictionary [3] that was built and improved within the French
national project EuRADic (European and Arabic Dictionaries and Corpora), as part of the
Technolangue programme funded by the French Ministry of Industry. It is easy to use and it contains
243 580 entries with their part of speech. Example of an entry:
valide J
valid J
Where J indicates that the word is an adjective.
6.1.5 Monolingual dictionaries
We use the monolingual part of the bilingual dictionary EuRADic.
6.2 Results
We have made some experiments using the resources presented in section 6.1. The obtained
results for the pair of languages French-English are shown in tables 1 and 2 (experiments using other
languages will be completed later). We will detail the FR-EN results in table 1 using the renewable
energy corpora. Using the 113 French neoclassical elements we were able to extract 2 052 nouns and
adjectives that contain at least one neoclassical root. These are considered to be the neoclassical
compound candidates, although many of these words are false candidates like (decision, histoire,
réunion, solide, protecteur). We note that the neoclassical compounds that do not contain at least
one neoclassical element that exists in our neoclassical elements lists will not be extracted.
Using 83 French-English neoclassical aligned elements and the EuRADic bilingual dictionary we
were able to generate translation candidates for 287 neoclassical compound candidates. The correct
translation among the generated candidates was found in the target neoclassical list for 137 of the
287 candidates. A generated translation candidate that is not found does not necessarily mean that it
is a wrong translation; it just could possibly be a correct translation that is missing from the target
corpus. In order to evaluate our results, we checked for these found translations in the bilingual
TTC Project
Page 16 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
dictionary. Indeed, 104 of them already exist in the bilingual dictionary, which means that 33 new
neoclassical compounds (not existing in the bilingual dictionary) were aligned with their equivalents.
We verified the found translations (aligned neoclassical compounds) manually; the precision
obtained by the alignment is given in tables 3 and 4. An example of a false positive was the alignment
of FR télécommande with EN telecontrol, while in fact; the correct translation is EN remote control.
We have calculated the recall on the Renewable energy corpus by examining a sample of 200
neoclassical candidates that have frequencies more or equal to 5. We have obtained a recall of 19%
for the (FR- EN) alignment, and a recall of 34% for the (EN-FR) alignment.
Corpus
Renewable
Energy
Neoclassical Candidates Generation Found
In
aligned
succeeded translations dictionary
elements
83
2052
287
137
104
83
Breast Cancer
1513
249
91
46
Table 1: Alignment of French neoclassical compounds with their English
equivalents
Corpus
Renewable
energy
Neoclassical Candidates Generation Found
In
elements
succeeded translations dictionary
83
6115
719
162
108
Breast cancer
83
1218
279
95
48
Table 2: Alignment of English neoclassical compounds with their French
equivalents
Corpus
Generation
succeeded
Renewable
energy
Breast
Cancer
Form1 (R1+ R2)
Found
translations
Precision
Generation
succeeded
Form2 (R1+ W)
Found
translations
Precision
60
40
100%
227
97
95%
48
34
100%
201
57
98%
Table 3: Precisions of the alignment of French neoclassical compounds
with their English equivalents
TTC Project
Page 17 / 18
FP7– Information Society and Media – ICT-2009.2.2: Language-based interaction
TTC project – GA n°248005
Corpus
Renewable
energy
Breast
Cancer
Generation
succeeded
68
46
Form1 (R1+ R2)
Found
translations
46
30
Precision
100%
Generation
succeeded
651
100%
233
Form2 (R1+ W)
Found
translations
116
65
Precision
96%
98%
Table 4: Precisions of the alignment of English neoclassical compounds
with their French equivalents
7 Conclusion
In this document, we present the program, delivery D.4.1. The program’s aim is to align
neoclassical compounds in two languages (source-target); it identifies two types of neoclassical
compounds. The required resources are specified in section 3: corpora in source and target
languages, monolingual neoclassical elements, aligned neoclassical elements, monolingual
dictionaries and a bilingual dictionary. Two UIMA components were implemented: the first
component is for detecting neoclassical compounds, and the second is for aligning the extracted
neoclassical compounds between source and target languages. The results showed very high
precision for aligning neoclassical compounds of the two handled structures. We aim at expanding
the method so that it covers other possible forms of neoclassical compounds. We also aim to
investigate the possibility of finding equivalents of neoclassical elements that do not exist in our list
of aligned neoclassical elements by a learning method.
8 References
[1] http://uima.apache.org/
[2] D. Amiot, G. Dal « La composition néoclassique en français et l’ordre des constituants », in : La
composition dans les langues, Artois Presses Université. 2008. pp. 89-113.
[3]
SCI-FRAN-EURADIC
Dictionnaire
bilingue
français-anglais.
http://catalog.elra.info/product_info.php?products_id=666
[4] X. Robitaille, Y. Sasaki, M. Tonoike, S. Sato & T. Utsuro. Compiling french-japanese terminologies
from the web. In: Proceedings of the 11th conference of the European chapter of the association for
computational linguistics, EACL’ 06. 2006. pp. 225-232.
[5] H. D. Phonétique et morphologie du français moderne et contemporain. 1989.
[6] R. Estopa, J. Vivaldi, M. T. Cabré. Use of Greek and Latin forms for term detection. In: proceeding
sof the second international conference on language resources and evaluation. 2000. pp. 885-859.
[7] A. E. van Niekerk. The lexicographical treatment of neo-classical compounds. Bureau of the
dictionary of the Afrikaans language.
[8] L. Bauer. English word-formation. Cambridge university press. 1983.
[9] F. Namer, R. H. Baud. Defining and relating biomedical terms: Towards a cross-language
morphosemantics-based system. I. J. Medical Informatics. 2007. pp. 226-233.
TTC Project
Page 18 / 18

Download Report

here - LINA Nantes

Paperzz.com

Your Paperzz