Leveraging parallel corpora and existing wordnets for automatic

Leveraging parallel corpora and existing wordnets for automatic construction of
the Slovene wordnet
Darja Fišer
Abstract
The paper reports on a series of experiments conducted in order to test the feasibility of automatically generating synsets for Slovene
wordnet. The resources used were the multilingual parallel corpus of George Orwell’s Nineteen Eighty-Four and wordnets for several
languages. First, the corpus was word-aligned to obtain multilingual lexicons and then these lexicons were compared to the wordnets
in various languages in order to disambiguate the entries and attach appropriate synset ids to Slovene entries in the lexicon. Slovene
lexicon entries sharing the same attached synset id were then organized into a synset. The results obtained by the different settings in
the experiment are evaluated against a manually created goldstandard and also checked by hand.
1
Introduction
Research teams have approached the construction of their
wordnets in different ways depending on the lexical
resources they had at their disposal. There is no doubt that
manual construction of an independent wordnet by experts
is the most reliable technique as it yields the best results in
terms of both linguistic soundness and accuracy of the
created database. But such an endeavor is an extremely
labor-intensive and time-consuming process, which is why
alternative, fully automated or semi-automatic approaches
have been proposed that have tried to leverage the existing
resources in order to facilitate faster and easier development
of wordnets.
The lexical resources that have proved useful for such a task
fall into several categories. (1) Princeton WordNet (PWN) is
an indispensable resource and serves as the backbone of
new wordnets in approaches following the expand model
(Vossen 1998). This model takes a fixed set of synsets from
PWN which are then translated into the target language,
preserving the structure of the original wordnet. The cost of
the expand model is that the resulting wordnets are heavily
biased by the PWN, which can be problematic when the
source and target linguistic systems are significantly
different. Nevertheless, due to its greater simplicity, the
expand model has been adopted in a number of projects,
such as the BalkaNet (Tufis 2000) and MultiWordNet
(Pianta, Bentivogli and Girardi 2002).
(2) A very popular approach is to link English entries from
machine-readable bilingual dictionaries to PWN synsets
under the assumption that their counterparts in the target
language correspond to the same synset (see Knight and Luk
1994). A well-known problem with this approach is that
bilingual dictionaries are generally not concept-based but
follow traditional lexicographic principles, which is the
biggest obstacle to overcome if this technique is to be useful
is the disambiguation of dictionary entries.
(3) Machine-readable monolingual dictionaries have been
used as a source for extracting taxonomy trees by parsing
definitions to obtain genus terms and then disambiguating
the genus word, resulting in a hyponymy tree (see Farreres
et al. 1998). Problems that are inherent to the monolingual
dictionary (circularity, inconsistencies, genus omission) as
well as limitations with genus term disambiguation must be
borne in mind when this approach is considered.
(4) Once taxonomies in the target language are
available, they can be mapped to the target wordnet,
enriching it with valuable information and semantic
links at relatively low cost, or they can be mapped to
ontologies in other languages in order to create
multilingual ontologies (Farreres et al. 2004).
Our attempt in the construction of the Slovene wordnet
will be to benefit from the resources we have available,
which are mainly corpora. Based on the assumption that
translations are a plausible source of semantics we will
use a multilingual parallel corpus to extract
semantically relevant information. The idea that
semantic insights can be derived from the translational
relation has already been explored by e.g. Resnik and
Yarowsky (1997), Diab (2002) and Ide et al. (2002).
It is our hypothesis that senses of ambiguous words in
one language are often translated into distinct words in
another language. We further believe that if two or
more words are translated into the same word in
another language, then they often share some element
of meaning. This is why we assume that the
multilingual-alignment based approach will convey
sense distinctions of a polysemous source word or yield
synonym sets.
The paper is organized as follows: the next section
gives a brief overview of the related work after which
the methodology for our experiment is explained in
detail. Sections 2.1 and 2.2 present and evaluate the
results obtained in the experiment and the last section
gives conclusions and work to be done in the future.
1.1
Related work
The following approaches are similar to ours in that
they rely on word-aligned parallel corpora. Dyvik
(2002) identified the different senses of a word based
on corpus evidence and then grouped senses in
semantic fields based on overlapping translations which
indicate semantic relatedness of expressions. The fields
and semantic features of their members were used to
construct semilattices which were then linked to PWN.
Diab (2004) took a word-aligned English-Arabic
parallel corpus as input and clustered source words that
were translated with the same target word. Then the
appropriate sense for the words in clusters was
identified on the basis of word sense proximity in
PWN. Finally, the selected sense tags were propagated
to the respective contexts in the parallel texts.
Sense discrimination with parallel corpora has also been
investigated by Ide et al. (2002) who used the same corpus
as input data as we did for this experiment but then used the
extracted lexicon to cluster words into senses. Finding
synonyms with word-aligned corpora was also at the core of
work by van der Plas and Tiedemann (2006) whose
approach differs from ours in the definition of what
synonyms are, which in their case is a lot more permissive
than in ours.
2
Methodology
In the experiment we used the parallel corpus of George
Orwell’s Nineteen Eighty-Four (Erjavec and Ide 1998) in
five languages: English, Czech, Romanian, Bulgarian and
Slovene. Although the corpus is a literary text, the style of
writing is ordinary, modern and not overly domain-specific.
Furthermore, because the 100,000 word corpus is already
sentence-aligned and tagged, the preprocessing phase was
not demanding. First, all but the content-words (nouns, main
verbs, adjectives and adverbs) were discarded to facilitate
the word-alignment process. A few formatting and encoding
modifications were performed in order to conform to the
required input format by the alignment tool.
The corpus was word-aligned with Uplug, a modular tool
for automated corpus alignment (Tiedemann, 2003). Uplug
converts text files into xml, aligns the files at sentence-level,
then aligns them at word-level and generates a bilingual
lexicon. Because our files had already been formatted and
sentence-aligned, the first two stages were skipped and we
proceeded directly to word-alignment. The advanced setting
was used which first creates basic clues for word
alignments, then runs GIZA++ (Och and Ney 2003) with
standard settings and aligns words with the existing clues.
The alignments with the highest reliability scores are
learned and the last two steps are repeated three times. This
is Uplug’s slowest standard setting but, considering the
relatively small size of the corpus, it was used because it
yielded the best results. The output of the alignment process
is a file with information on word link certainty between the
aligned pair of words and their unique ids (see Figure 1).
<link certainty="-220" xtargets="Oen.1.1.5.6 Oen.1.1.5.7;Osl.1.2.6.6"
id="SL0.35">
<wordLink certainty="0.0166666666666667" lexPair="watched;opazujejo"
xtargets="Oen.1.1.5.6.4;Osl.1.2.6.6.3" />
<wordLink certainty="0.0166666666666667" lexPair="wire;oddajnik"
xtargets="Oen.1.1.5.7.3;Osl.1.2.6.6.9" />
<wordLink certainty="0.0198634796329659" lexPair="was time;čas"
xtargets="Oen.1.1.5.6.1+Oen.1.1.5.6.5;Osl.1.2.6.6.2" />
<wordLink certainty="0.0125" lexPair="conceivable;Mogoče"
xtargets="Oen.1.1.5.6.3;Osl.1.2.6.6.1" />
<wordLink certainty="0.0214285714285715" lexPair="rate;drugače"
xtargets="Oen.1.1.5.7.1;Osl.1.2.6.6.6" /></link>
Figure 1: An example of word links
Word ids were used to extract the lemmas from the corpus
and to create bilingual lexicons. In order to reduce the noise
in the lexicon as much as possible, only 1:1 links between
words of the same part of speech were taken into account
and all alignments occurring only once were discarded. The
generated lexicons contain all the translations of an English
word in the corpus with alignment frequency, part of speech
and the corresponding word ids (see Figure 2). The size of
each lexicon is about 1,500 entries.
3 0.075 age,n,doba,n oen.2.9.14.16.1.2;oen.2.9.14.16.3.2;oen.2.9.26.3.5.2;
2 0.154 age,n,obdobje,n oen.2.9.26.3.10.3;oen.2.9.26.4.6.5;
4 0.075 age,n,starost,n oen.1.4.24.1.1;oen.1.8.39.4.2;oen.2.4.55.4.3;
5 0.118 age,n,vek,n oen.1.2.38.2.1.8;oen.1.2.38.2.1.3;oen.1.2.38.2.1.1;
4 0.078 aged,a,star,a oen.1.6.4.3.12;oen.1.8.94.4.5;oen.2.1.53.12.4;
2 0.068 agree,v,strinjati,v oen.1.8.50.4.1;oen.3.4.7.1.2;
9 0.104 aim,n,cilj,n oen.1.4.24.10.6;oen.1.5.27.1.3;oen.2.8.55.6.2;
2 0.051 air,n,izraz,n oen.2.8.19.1.9;oen.2.8.52.2.3;
6 0.080 air,n,videz,n oen.2.5.2.7.12;oen.2.5.6.15.16;oen.3.1.19.6.7;
14 0.065 air,n,zrak,n oen.1.1.6.7.7;oen.1.1.7.2.12;oen.1.2.29.2.7;oen.1.3.3.7.3;
Figure 2: An example of bilingual lexicon entries
The bilingual lexicons were used to create multilingual
lexicons. English lemmas and their word ids were used
as a cross-over and all the translations of an English
word occurring more than twice were included. If an
English word was translated by a single word in one
language and by several words in another language, all
the variants were included in the lexicon because it is
assumed the difference in translation either signifies a
different sense of the English word or it is a synonym to
another translation variant (see the translations for the
English word “army” in Figure 3 which is translated by
the same words in all the languages except in Slovene).
Whether the variants are synonymous or belong to
different senses of a polysemous expression is to be
determined in the next stage. In this way, 3 multilingual
lexicons were created: En-Cs-Si (1,703 entries), En-CsRo-Si (1,226 entries) and En-Cs-Ro-Bg-Si (803
entries).
2 (v)
2 (n)
3 (n)
3 (n)
2 (n)
2 (n)
answer
argument
arm
arm
army
army
odgovoriti odpovědět răspunde
argument argument argument
roka
paže
braŃ
roka
ruka
braŃ
armada armáda armată
vojska
armáda armată
odgovoriti
argument
ruka
ruka
armija
armija
Figure 3: An example of multilingual lexicon entries
The obtained multilingual lexicons were then compared
against the already existing wordnets in the
corresponding languages. For English, the Princeton
WordNet (Fellbaum 1998) was used while for Czech,
Romanian and Bulgarian wordnets from the BalkaNet
project (Tufis 2000) were used. The decision for using
BalkaNet wordnets is twofold: first, the languages
included in the project correspond to the multilingual
corpus we had available, and second, the wordnets were
developed in parallel, they cover a common sense
inventory and are also aligned to one another as well as
to PWN, making the intersection easier.
If a match was found between a lexicon entry and a
literal of the same part of speech in the corresponding
wordnet, the synset id was remembered for that
language. If after examining all the existing wordnets
there was an overlap of synset ids across all the
languages (except Slovene, of course) for the same
lexicon entry, it was assumed that the words in question
all describe the concept marked with this id. Finally, the
concept was extended to the Slovene part of the
multilingual lexicon entry and the synset id common to
all the languages was assigned to it. All the Slovene
words sharing the same synset id were treated as
synonyms and were grouped into synsets (see Figure 4).
ENG20-03500773-n luč svetilka
ENG20-04210535-n miza mizica
ENG20-05291564-n kraj mesto prostor
ENG20-05597519-n doktrina nauk
ENG20-05903215-n beseda vrsta
ENG20-06069783-n dokument papir košček
ENG20-06672930-n glas zvok
ENG20-07484626-n družba svet
ENG20-07686671-n armada vojska
ENG20-07859631-n generacija rod
ENG20-07995813-n kraj mesto prostor
ENG20-08095650-n kraj mesto prostor
ENG20-08692715-n zemlja tla svet
ENG20-09620847-n deček fant
ENG20-09793944-n jetnik ujetnik
ENG20-10733600-n luč svetloba
Figure 4: An example of the created synsets for Slovene
Other language-independent information (e.g. part of
speech, domain, semantic relations) was inherited from the
Princeton WordNet and an xml file was created. The
automatically generated Slovene wordnet was loaded into
VisDic, a graphical application for viewing and editing
dictionary databases stored in XML format (Horak and
Smrž 2000). In the experiment, four different settings were
tested: SLWN1 was created from an English-Slovene
lexicon compared against the English wordnet; for SLWN2
Czech was added; SLWN3 was created using these as well
as Romanian, and SLWN4 was obtained by including
Bulgarian as well.
2.1
Results
In our previous work, a version of the Slovene wordnet
(SLOWN0) was created from Serbian wordnet (Krstev et al.
2004) was translated into Slovene with a Serbian-Slovene
dictionary (see Erjavec and Fišer 2006). The main
disadvantage of that approach was the inadequate
disambiguation of polysemous words, therefore requiring
extensive manual editing of the results. SLOWN0 contains
4,688 synsets, all from Base Concept Sets 1 and 2. Nouns
prevail (3,210) which are followed by verbs (1,442) and
adjectives (36). There are no adverbs in BCS1 and 2, which
is why none were translated into Slovene. Average synset
length (number of literals per synset) is 5.9 and the synsets
cover 119 domains (see Table 2).
In the latest approach to further extend the existing core
wordnet for Slovene we attempted to take advantage of
several available resources (parallel corpora and wordnets
for other languages). At the same we time interested in an
approach that would yield more reliable synset candidates.
This is why we experimented with four different settings to
induce Slovene synsets, each using more resources and
including more languages in order to establish which one is
the most efficient from the cost and benefit point of view.
SLOWN1: This approach, using resources for two
languages, is the most similar to our previous experiment,
only that this time an automatically generated lexicon was
used instead of a bilingual glossary and the much larger
English wordnet was used instead of the Serbian one (See
Table 1). SLOWN1 contains 6,746 synsets belonging to all
three Base Concept Sets and beyond. A lot more verb
(2,310) and adjective synsets (1,132) were obtained,
complemented by adverbs (340) as well.
Average synset length is much lower (2.0), which could
be a sign of a much higher precision than in the
previous wordnet, while the domain space has grown
(126).
synsets
avg. l/s
bcs1
bcs2
bcs3
other
domains
nouns
max l/s
avg. l/s
verbs
max l/s
avg. l/s
adjectives
max l/s
avg. l/s
adverbs
max l/s
avg. l/s
PWN
115,424
1.7
1,218
3,471
3,827
106,908
164
79,689
28
1.8
13,508
24
1.8
18,563
25
1.7
3,664
11
1.6
CSWN
28,405
1.5
1,218
3,471
3,823
19,893
156
20,773
12
1.4
5,126
10
1.8
2,128
8
1.4
164
4
1.5
ROWN
18,560
1.6
1,189
3,362
3,593
10,416
150
12,382
17
1.7
4,603
12
2.1
833
11
1.5
742
10
1.6
BGWN
21,105
2.111
1,218
3,471
3,827
12,589
155
13,924
10
1.8
4,163
18
3.6
3,009
9
1.7
9
2
1.2
GOLDST
1,179
2.1
320
804
55
0
85
828
8
1.9
343
10
2.4
8
4
1.6
0
0
0
Table1: Statistics for the existing wordnets
SLOWN2: Starting from the assumption that
polysemous words tend to be realized differently in
different languages and that translations themselves
could be used to discriminate between the different
senses of the words as well as to determine relations
among lexemes Czech was added to the lexicon as well
as in the lexicon-wordnet comparison stage. As a result,
the number of obtained synsets is much lower (1,501)
as is the number of domains represented by the wordnet
(87). Average synset length is 1.8 and nouns still
represent the largest portion of the wordnet (870).
synsets
avg l/s
bcs1
bcs2
bcs3
other
domains
nouns
max l/s
avg l/s
verbs
max l/s
avg l/s
adjectives
max l/s t
avg l/s
adverbs
max l/s
avg l/s
SLOWN0
4,688
5.9
1,219
3,469
0
0
119
3,210
40
4.8
1,442
96
8.4
36
30
8.2
0
0
0
SLOWN1
6,746
2.0
588
1,063
663
4,432
126
2,964
10
1.4362
2,310
76
3.3
1,132
4
1.2
340
20
2.1
SLOWN2
1,501
1.8
324
393
230
554
87
870
7
1.4
483
15
2.7
118
4
1.1
30
5
1.6
SLOWN3 SLOWN4
1,372
549
2.4
1.8
293
166
359
172
22
99
496
112
83
60
671
291
6
4
1.4
1.7
639
249
30
26
3.7
2.6
32
9
2
2
1.0
1.1
30
0
3
0
1.4
0
Table 2: Results obtained in the experiment
SLOWN3: It was assumed that by adding another
language recall would fall but precision would increase
and we wished to test this hypothesis in a step-by-step
way. In this setting, the multilingual lexicon was
therefore extended with Romanian translations and the
Romanian wordnet was used to obtain an intersection
between lexicon entries and wordnet synsets. The
number of synsets falls slightly in SLOWN3 but the
average number of literals per synset increases.
The domain space is virtually the same, as is the proportion
of nouns, verbs and adverbs with adjectives falling the most
drastically.
SLOWN4: Finally, Bulgarian was added both in the lexicon
and wordnet lookup stage. The last wordnet created within
this experiment only contains 549 synsets with 1.8 as
average synset length. The number of noun and verb synsets
is almost the same and in this case no adverbial synsets were
obtained. The number of domains represented in this
wordnet is 60 (see Table 2).
2.2
Evaluation and discussion of the results
In order to evaluate the results obtained in the experiment
and to decide which setting performs best the generated
synsets were evaluated against a goldstandard that was
created by hand. The goldstandard contains 1,179 synsets
from all three Base Concept Sets. This is why evaluation
only takes into account the automatically generated synsets
that belong to these three categories. Average synset length
in the goldstandard is 2.1 which is comparable to Bulgarian
wordnet and to the Slovene wordnets created within this
experiment. Synsets belong to 80 different domains, a major
part of them are nouns (828) but there are no adverbs in the
goldstandard, which is why they will not be evaluated. Also,
since only three adjective synsets overlapped in the
goldstandard and the generated wordnets, they were
excluded from the evaluation as well.
Table 3 shows the results of the automatic evaluation of the
wordnets obtained in the experiment. The manually created
goldstandard that was used for evaluation was created by
hand and contains multi-word literals as well. The automatic
method presented in this paper is limited to one-word
translation candidates, which is why multi-word literals
were disregarded in the evaluation. The most
straightforward approach for evaluation of the quality of the
obtained wordnets would be to compare the generated
synsets with the corresponding synsets from the
goldstandard. But in this way we would be penalizing the
automatically induced wordnets for missing literals which
are not part of the vocabulary of the corpus that was used to
generate the lexicons in the first place. Instead we opted for
a somewhat different approach by comparing literals in the
goldstandard and in the automatically induced wordnets
with regard to which synsets they appear in. This
information was used to calculate precision, recall and fmeasure. This seems a fairer approach because of the
restricted input vocabulary.
nouns
precision
recall
f-measure
verbs
precision
recall
f-measure
total
prec. total
recall total
f-m total
SLOWN0
261
50.6%
89.3%
64.6%
174
43.8%
74.7%
55.3%
445
48.0%
83.5%
61.0%
SLOWN1
322
70.2%
87.3%
77.8%
127
35.8%
70.3%
47.4%
449
60.6%
82.6%
69.9%
SLOWN2
223
78.4%
81.7%
80.0%
79
54.2%
66.2%
59.6%
302
72.3%
77.6%
74.9%
SLOWN3
179
73.0%
77.4%
75.1%
69
37.5%
72.5%
49.4%
248
63.2%
76.2%
69.1%
SLOWN4
103
84.1%
78.2%
81.1%
53
46.0%
59.1%
51.7%
156
71.3%
71.5%
71.4%
Table 3: Automatic evaluation of the results
As can be seen from Table 3, the approach taken in this
experiment outperforms the previous attempt in which a
bilingual dictionary was used as far as precision is
concerned. Also, the method works much better for
nouns than for verbs, regardless of the setting used. In
general, there is a steady growth in precision and a
corresponding fall in recall with a gradual growth in fmeasure (see Table 3). The best results are obtained by
merging resources for English, Czech and Slovene
(74.89%), when adding Romanian the results fall
(69.10%) and then rise again when Bulgarian is added,
reaching almost the same level as setting 2 (71.34). It is
interesting to see the drop in quality when Romanian is
added; this might occur either because the wordalignment is worse for English-Romanian (because of a
freer translation and consequently poorer quality of
sentence-alignment of the corpus), or it might be due to
the properties of the Romanian wordnet version we
used for the experiment, which is smaller than other
wordnets.
In order to gain insight into what actually goes on with
the synsets in the different settings, we generated an
intersection of all the induced wordnets and checked
them manually. Automatic evaluation shows that the
method works best for nouns, which is why we focus
on them in the rest of this section. The sample we used
for manual evaluation contains 165 synsets which are
the same in all the generated wordnets and can
therefore be directly compared.
In manual evaluation we checked whether the generated
synset obtains a correct literal at all. We classified
errors into several categories: the wrong literal is a
hypernym of the concept in question (more general),
the wrong literal is a hyponym of the concept (more
specific), the concept is semantically related to the
concept (meronym, holonym, antonym), the literal is
simply wrong. The results of manual evaluation of the
small sample confirm the results obtained by automatic
evaluation. However, it was interesting to see that even
though SLOWN2 contains slightly more errors than
SLOWN4, there is more variation (more synset
members) in SLOWN2 which makes it a more useful
resource and is therefore the preferred setting (see
Table 4). Unsurprisingly, the least problematic synsets
are those lexicalizing specific concepts (such as “rat”,
“army”, “kitchen”) and the most difficult ones were
those containing highly polysemous words and
describing vague concepts (e.g. “face” which as a noun
has 13 different senses in PWN or “place” which as a
noun has 16 senses).
total no. of syns
fully correct syns
no correct lit.
at least 1 corr lit.
hypernym
hyponym
sem. related lit.
more err. types
SLOWN1
165 (100%)
96 (58.2%)
6 (3.6%)
43 (26.0%)
3 (1.8%)
6 (3.6%)
2 (1.2%)
8 (4.8%)
SLOWN2
165 (100%)
103 (62.4%)
5 (3.0%)
37 (22.4%)
3 (1.8%)
6 (3.6%)
4 (2.4%)
5 (3.0%)
SLOWN3
165 (100%)
119 (72.1%)
10 (6.0%)
20 (12.1%)
3 (1.8%)
10 (6.0%)
2 (1.2%)
0 (0.0%)
Table 4: Manual evaluation of the results
SLOWN4
165 (100%)
134 (81.2%)
9 (5.4%)
14 (8.4%)
0 (0.0%)
6 (3.6%)
1 (3.6%)
0 (0.0%)
3
Conclusions and future work
Lexical semantics is far from being trivial for computers as
well as humans. This can be seen by relatively low interannotator agreement scores when humans are asked to
assign a sense to a word from the parallel corpus as was
used in this study (approx. 75% for wn senses as reported by
Ide et al. 2002). Comparing the size of the lexicons as well
as the results obtained in the presented set of experiments
they lie within these figures, showing that the method is
promising for nouns (much less for other parts of speech)
and should be investigated further on other, larger and more
varied corpora. The next step will therefore be the
application of the second setting yielding the best results in
this experiment to the multilingual and much larger
ACQUIS corpus (Steinberger et al. 2006). Attempts have
already been made to word-align the ACQUIS corpus (e.g.
Giguet, Luquet 2006) but their alignments are not useful for
our method as their alignments vary greatly in length and
also include function words. This is why the wordalignment phase will have to be done from scratch, a nontrivial task because a lot of preprocessing (tagging,
lemmatization and sentence-alignment) is required.
The results could further be improved by using the latest
versions of the wordnets. The ones that were used in this
experiment are from 2004 when the BalkaNet project ended
but the teams have continued developing their wordnets and
they are now much larger and better resources.
4
References
Diab, Mona (2004): The Feasibility of Bootstrapping an
Arabic WordNet leveraging Parallel Corpora and an
English WordNet. In: Proceedings of the Arabic
Language Technologies and Resources, NEMLAR, Cairo
2004.
Dyvik, Helge (2002). Translations as semantic mirrors: from
parallel corpus to wordnet. Revised version of paper
presented at the ICAME 2002 Conference in Gothenburg.
Erjavec, Tomaž, Darja Fišer (2006): Building Slovene
WordNet. In: Proceedings of the 5th International
Conference on Language Resources and Evaluation
LREC'06. 24-26th May 2006, Genoa, Italy.
Fellbaum, Christiane (1998): WordNet: An Electronic
Lexical Database. MIT Press.
Giguet, Emmanuel, Luquet, Pierre-Sylvain (2006):
Multilingual Lexical Database Generation from Parallel
Texts in 20 European Languages with Endogenous
Resources. In: Proceedings of the COLING/ACL 2006
Main Conference Poster Sessions.
Horak, Ales, Pavel Smrz (2000): New Features of Wordnet
Editor VisDic. In: Romanian Journal of Information
Science and Technology Special Issue (Volume 7, No. 12).
Ide, Nancy, Tomaž Erjavec, Dan Tufis (2002): Sense
Discrimination with Parallel Corpora. In: Proceedings of
ACL'02 Workshop on Word Sense Disambiguation:
Recent Successes and Future Directions, Philadelphia, pp.
54-60.
Krstev, Cvetana, G. Pavlović-Lažetić, D. Vitas, I.
Obradović (2004): Using textual resources in developing
Serbian wordnet. In: Romanian Journal of Information
Science and Technology. (Volume 7, No. 1-2), pp 147161.
Och, Franz Josef, Hermann Ney (2003): A Systematic
Comparison of Various Statistical Alignment
Models. In: Computational Linguistics (Volume 29,
No. 1).
Tiedemann, Jörg (2003): Recycling Translations Extraction of Lexical Data from Parallel Corpora and
their Application in Natural Language Processing,
Doctoral Thesis. Studia Linguistica Upsaliensia 1.
Tufis, Dan (2000): BalkaNet - Design and Development
of a Multilingual Balkan WordNet. In: Romanian
Journal of Information Science and Technology
Special Issue (Volume 7, No. 1-2).
van der Plas, Lonneke, Jörg Tiedemann (2006): Finding
Synonyms Using Automatic Word Alignment and
Measures of Distributional Similarity. In:
Proceedings of ACL/COLING 2006.
Vossen, Piek (ed.) (1998): EuroWordNet: a
multilingual database with lexical semantic networks
for European Languages. Kluwer, Dordrecht.
Farreres, Xavier, G. Rigau, H. Rodrguez (1998): Using
WordNet for Building WordNets. In Proceedings of
COLING-ACL Workshop on Usage of WordNet in
Natural Language Processing Systems, Montreal,
Canada.
Pianta, Emanuele, L. Bentivogli, C. Girardi:
MultiWordNet (2002): developing an aligned
multilingual database. In: Proceedings of the First
International Conference on Global WordNet,
Mysore, India, January 21-25, 2002.
Knight, K., S. Luk. (1994): Building a Large-Scale
Knowledge Base for Machine Translation. In:
Proceedings of the American Association of
Artificial Intelligence AAAI-94. Seattle, WA.
Farreres, Xavier, Karina Gibert, Horacio Rodriguez
(2004): Towards Binding Spanish Senses to Wordnet
Senses through Taxonomy Alignment. In:
Proceedings of the Second Global WordNet
Conference, pp. 259-264, Brno, Czech Republic,
January 20-23, 2004.
Resnik, Philip, David Yarowsky (1997): A perspective
on word sense disambiguation methods and their
evaluation. In: ACL-SIGLEX Workshop Tagging
Text with Lexical Semantics: Why, What, and How?
April 4-5, 1997, Washington, D.C., 79-86.
Steinberger Ralf, Bruno Pouliquen, Anna Widiger,
Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Dániel
Varga (2006): The JRC-Acquis: A multilingual
aligned parallel corpus with 20+ languages. In:
Proceedings of the 5th International Conference on
Language Resources and Evaluation. Genoa, Italy,
24-26 May 2006.