View PDF - CiteSeerX

u
USING EUROWORDNET IN A
CONCEPT-BASED
APPROACH
TO CROSS-LANGUAGE
TEXT RETRIEVAL
JULIO GONZALO, FELISA VERDEJO, and IRINA
CHUGUR
UNED, Ciudad Universitaria, Madrid, Spain
W e present an approach to cross± language text retrieval based on the EuroWordN et
(EW N ) multilingual semantic database. EuroW ordNet is a multilingual, W ordNet± like
database with basic semantic relations between words for several European languages
(English, Dutch, Spanish, Italian, German, French, Czech, and Estonian). In addition to the
relations in W ordNet 1.5, EW N includes domain labels, cross± language, and cross± part± of±
speech relations, which are directly useful for multilingual information retrieval.
In our approach, documents in any language covered by EuroW ordNet are indexed in
a space of language± independent concepts (the EuroW ordNet Inter Lingual Index ), thus
turning term weighting and query/ document matching into language± independent tasks.
W e report on the results of a number of experiments that measure the potential
beneŽts of the approach and its tolerance to word sense disambiguation errors. In our
monolingual experiments, the classical, vector space model for text retrieval is shown to give
better results (up to 29% better in our experiments) if W ordNet synsets are chosen as the
indexing space, instead of word forms. T his result is obtained for a manually disambiguated
test collection derived from the S EMCOR annotated corpus. T he sensitivity of retrieval
performance to (automatic) disambiguation errors is also measured. Our preliminary
bilingual experiments, also reported here, show that our approach can sensibly outperform a
naive, dictionary± based, translation of the query terms into the target language.
Text retrieval deals with the problem of Žnding all the relevant documents
in a text collection for a given user’s query, stated in a natural language. For
a human, this involves reading and understanding documents and query,
and judging on the relevance of each document for the query. Such tasks
seem to fall naturally within the scope of knowledge representation and
natural language processing (N LP) techniques. However, both scientiŽc
communities (information retrieval and NLP people) have largely evolved in
isolation one from the other. There are two powerful reasons : one is that
Final version received N ovember 1998.
This research is being supported by the European Community, project LE [ 4003 and also partially
by the Spanish government, project TIC± 96± 1243 ± CO3± O1. We are indebted to J. Ignacio M ayorga,
Anselmo PenÄ as, Fernando Ostenero, and David Ferna ndez for their help building up the test collection.
Thanks also to Carol Peters for many fruitful discussions.
 ctrica, Electro nica y de
Address correspondence to Julio Gonzalo, Departamento de IngenierõÂ a Ele
Control, UNED, Ciudad Universitaria , s.n., 28040 M adrid, Spain. E± mail : julio @ ieec.uned.es
Applied ArtiŽcial Intelligence, 13 :647­ 678, 1999
Copyright Ó 1999 Taylo r & Francis
0883 ± 9514/99 $12.0 0 1 .00
647
648
J . Gonzalo et al.
statistical approaches that neglect the linguistic properties of the texts they
manipulate, have been quite successful for information retrieval. The other is
that the attempts to introduce natural language techniques (part± of± speech
tagging, morphological analysis and stemming, word± sense disambiguation,
etc.) have largely failed to improve statistical approaches signiŽ cantly.
However, the increasing relevance of cross± language and multilingual
text retrieval seems to be changing this landscape. The explosive growth of
universally accessible information over the international networks±
information that is unstructured, heterogeneous, and multilingual by
nature± has made cross± language text retrieval (CLTR) one of the currently
most compelling challenges for the software industry. In principle, a user of
a WWW search engine wants to Žnd the information relevant to his query,
regardless of the languages used to write documents and query. And thus,
the search has to be able to Žnd documents that are expressed in diV erent
languages.
But the cross± language text retrieval task has proved to be much harder
than its monolingual counterpart. In Grefenstette (1998), the three problems
that CLTR must solve are identiŽed as :
a. knowing how a term expressed in one language might be written in
another ;
b. deciding which of the possible translations are appropriate in a given
context ; and
c. deciding how to weight diV erent translation alternatives when more than
one is retained.
Problem a is related with the use of bilingual dictionaries and other lan±
guage resources, and b and c with word± sense disambiguation and machine
translation issues. And, on the other hand, it has been generally observed
that traditional IR models and techniques su V er a loss of performance of
around 50% when adapted naively to cross± language retrieval. It seems,
therefore, that a point of convergence between IR, AI, and N LP ± based tech±
niques must be found to deal satisfactorily with the problem of cross±
language text retrieval.
The main approaches to CLTR being experimented with today use
either knowledge± based or corpus± based techniques (Oard, 1997).
Knowledge± based approaches. Apply bilingual or multilingual diction±
aries, thesauri, or general± purpose ontologies to get appropriate equivalents
in the target language for the original terms of the query.
U SING T HESAURI : So far, the best known and tested approaches to
CLTR are thesaurus± based, although these are generally used in controlled±
text retrieval, where each document is indexed (mainly by hand) with key-
Cross± Language T ext Retrieval
649
words from the thesaurus. A thesaurus is an ontology specializing in
organizing terminology ; a multilingual thesaurus organizes terminology for
more than one language. ISO 5964 gives speciŽ cations for the incorporation
of domain knowledge in multilingual thesauri and identiŽes alternative tech±
niques. There are now a number of multilingual thesaurus± based systems
available commercially. However, controlled text retrieval demands
resource± consuming thesaurus construction and maintenance and user±
training for optimum usage. In addition, domain± speciŽ c thesauri are not
very useful outside of the particular domain for which they have been
designed. The remainder of the article will implicitly refer to free± text
retrieval, where queries are compared against full documents, rather than
prebuilt keyword descriptions of the documents.
U SING DICTIONARIES : Some of the Žrst methods attempting to match the
query to the document for free± text (as opposed to controlled± text) retrieval
have used bilingual dictionaries. It has been shown that dictionary± based
query translation, where each term or phrase in the query is replaced by a
list of all its possible translations, represents an acceptable Žrst pass at cross±
language information retrieval although such relatively simple methods
clearly show performance below that of monolingual retrieval. Automatic
machine readable dictionary (M RD) query translation, on its own, has been
found to lead to a drop in eV ectiveness of 40­ 60% of monolingual retrieval
(Hull & Grefenstette, 1996 ; Ballesteros & Croft, 1996). There are three main
reasons for this : general purpose dictionaries do not normally contain
specialized vocabulary ; failure to translate multiword terms ; and the pres±
ence of spurious translations.
Corpus± based approaches. The above considerations have encouraged an
interest in corpus± based techniques in which information about the relation±
ship between terms over languages is obtained from observed statistics of
term usage. Corpus± based approaches analyze large collections of texts in
multiple languages and automatically extract the information needed to con±
struct application± speciŽc translation techniques. The collections analyzed
may consist of parallel (translation equivalent) or comparable (domain± spe±
ciŽc) sets of documents. The main approaches that have been experimented
using corpora are vector space and probabilistic techniques. A recent, com±
parative evaluation of some representative approaches to corpus± based
cross± language free± text retrieval (Carbonell et al., 1997) showed that such
approaches± and in particular some applications of example± based machine
translation ± signiŽ cantly outperformed the simple dictionary± based term
translation used in the evaluation.
The Žrst tests with parallel corpora were on statistical methods for the
extraction of multilingual term equivalence data, which could be used as
input for the lexical component of M T systems. Some of the most interesting
650
J . Gonzalo et al.
recent experiments, however, are those using a matrix reduction technique
known as latent semantic indexing (LSI) to extract language independent
terms and document representations from parallel corpora (Dumais et al.,
1996). Latent semantic indexing applies a singular value decomposition to a
large, sparse term document co± occurrence matrix (including terms from all
parallel versions of the documents) and extracts a subset of the singular
vectors to form a new vector space. Thus queries in one language can
retrieve documents in the other (as well as in the original language).
The problem with using parallel texts as training corpora is that test
corpora are costly to acquire± it is difficult to Žnd already existing trans±
lations of the right kind of documents and translated versions are expensive
to create. F or this reason, there has been a lot of interest recently in the
potential of comparable corpora. A comparable document collection is one
in which documents are aligned on the basis of the similarity between the
topics they address rather than because they are translation equivalent.
M ethods have been studied to extract information from such corpora on
cross± language equivalences in order to translate and expand a query formu±
lated in one language with useful terms in another (Sheridan & Ballerini,
1996 ; Picchi & Peters, 1996). Again, as with the parallel corpus method
reported above, it appears that such strategies are very application depen±
dent. A new reference corpus would have to be built to perform retrieval on
a new topic.
From this discussion, we can conclude that any single method currently
being tried presents limitations. Existing resources± such as electronic bilin±
gual dictionaries± are normally inadequate or insufficient for the purpose;
the building of resources like domain± speciŽc thesauri and training corpora
is expensive and such resources are generally not fully reusable; a new multi±
lingual application will require the construction of new resources or con±
siderable work for the adaptation of previously built ones.
It should also be noted that most of the systems and methods in use so
far concentrate on pairs rather than multiples of languages. This is hardly
surprising. The situation is far more complex when an attempt is made to
achieve eV ective retrieval over a number of languages than over a single
pair ; it is necessary to study some kind of interlingual mechanism± at a
more or less conceptual level± in order to permit multiple cross± language
transfer.
The EWN project (Vossen, 1998) aims at building a multilingual,
WordN et± like database with basic semantic relations between words for
several European languages (English, Dutch, Spanish, Italian, German, and
French), and it is scheduled to produce the Žnal database in 1999. Such a
large± scale, multilingual semantic database o V ers an interesting knowledge±
based alternative to query expansion techniques± performing conceptual,
language± neutral retrieval without requiring neither training nor parallel
Cross± Language T ext Retrieval
651
corpora. We present such approach here, together with a number of experi±
ments to determine whether it can enhance retrieval and whether it is a
feasible technique.
First of all, we review previous approaches to text retrieval using
WordN et, Žnding that retrieval strategies and word± sense disambiguation
problems have not been properly isolated from each other. Then we describe
a set of monolingual retrieval experiences with a hand± disambiguated test
collection derived from Semcor (M iller et al., 1994), a subset of the Brown
Corpus annotated with WordN et senses. These experiment s indicate that
retrieval with WordNet can be more efficient than it had been before, pro±
vided that word± sense disambiguation can be performed to a certain degree
of accuracy. Then we state our proposal for language± independent text
retrieval with the EWN database, and perform some bilingual experiments
with a preliminary version of the database to test the feasibility of the
approach and identify how the database should be improved during the last
building stages to permit better cross± language retrieval.
WORDN ET AN D TEXT RETRIEVAL : PREVIOUS
APPROACHES
d
d
WordNet (M iller, 1990) is a freely available lexical database for English.
It consists of semantic relations between English words, which can be acces±
sed as a kind of thesaurus, in which words with similar meanings are
grouped together into so± called synsets (synonym sets). Besides synonymy
(implicit in the deŽnition of synset), other relations are established between
synsets (or, exceptionally, between word forms): hyponymy/hyperonymy (IS±
A relation), which gives the network a hierarchical structure; meronymy/
holonymy (HAS± A relation) in its part, member, and substance variants ; and
antonymy (between opposite word forms). With these relations, the
WordN et lexical database is conŽgured as a web of 168,000 synsets
(concepts) that contain 126,000 diV erent word forms.
A large± scale semantic database such as WordN et seems to have a great
potential for text retrieval. There are, at least, two obvious reasons :
It oV ers the possibility to discriminate word senses in documents and
queries. This would prevent matching spring in its ‘‘metal device’’ sense
with documents mentioning spring in the sense of springtime. And then
retrieval accuracy could be improved.
WordN et provides the chance of matching semantically related words.
F or instance, spring, fountain, out¯ ow, outpouring, in the appropriate
senses, can be identiŽed as occurrences of the same concept, ‘‘natural ¯ ow
of ground water.’’ And beyond synonymy, WordNet can be used to
652
J . Gonzalo et al.
measure semantic distance between occurring terms to get more sophisti±
cated ways of comparing documents and queries.
However, the general feeling within the information retrieval community
is that dealing explicitly with semantic information does not improve signiŽ ±
cantly the performance of text retrieval systems. This impression is founded
on the results of some experiments measuring the role of word sense dis±
ambiguation (WSD) for text retrieval, on one hand, and some attempts to
exploit the features of WordNet and other lexical databases, on the other
hand.
In Sanderson (1994), word sense ambiguity is shown to produce only
minor eV ects on retrieval accuracy, apparently conŽrming that query/
document matching strategies already perform an implicit disambiguation.
Sanderson also estimates that if explicit WSD is performed with less than
90% accuracy, the results are worse than nondisambiguating at all. In his
experimental setup, ambiguity is introduced artiŽcially in the documents,
substituting randomly chosen pairs of words (for instance, banana and
kalashnikov) with artiŽcially ambiguous terms (banana/kalashnikov). While
his results are very interesting, it remains unclear, in our opinion, whether
they would be corroborated with real occurrences of ambiguous words.
There is also other minor weakness in Sanderson’s experiments. When he
‘‘disambiguates’’ a term such as spring/bank to get, for instance, bank, he
has done only a partial disambiguation, as bank can be used in more than
one sense in the text collection.
Besides disambiguation, many attempts have been done to exploit
WordN et for text retrieval purposes. M ainly, two aspects have been
addressed ± the enrichment of queries with semantically related terms, on
one hand, and the comparison of queries and documents via conceptual
distance measures, on the other.
Query expansion with WordNet has shown to be potentially relevant to
enhance recall, as it permits matching relevan t documents that could not
contain any of the query terms (Smeaton et al., 1995). However, it has pro±
duced few successful experiments. For instance, (Voorhees, 1994) manually
expanded 50 queries over a TREC± 1 collection (Harman, 1993) using syn±
onymy and other semantic relations from WordNet 1.3. Voorhees found
that the expansion was useful with short, incomplete queries, and rather
useless for complete topic statements, where other expansion techniques
worked better. For short queries, it remained the problem of selecting the
expansions automatically± doing it badly could degrade retrieval per±
formance rather than enhancing it. In Richardson & Smeaton (1995), a com±
bination of rather sophisticated techniques based on WordN et, including
automatic disambiguation and measures of semantic relatedness between
query/document concepts resulted in a drop of eV ectiveness. Unfortunately,
Cross± Language T ext Retrieval
653
the eV ects of WSD errors could not be discerned from the accuracy of the
retrieval strategy. However, in Smeaton and Quigley (1996), retrieval on a
small collection of image captions± that is, on very short documents± is
reasonably improved using measures of conceptual distance between words
based on WordNet 1.4. Previously, captions and queries had been manually
disambiguated against WordNet. The reason for such success is that with
very short documents (e.g., boys playing in the sand) the chance of Žnding
the original terms of the query (e.g., of children running on a beach) are
much lower than for average± size documents (that typically include many
phrasings for the same concepts). These results are in agreement with
Voorhees (1994), but it remains the question of whether the conceptual dis±
tance matching would scale up to longer documents and queries. In addi±
tion, the experiments in Smeaton and Quigley (1996) only consider nouns,
while WordN et o V ers the chance to use all open± class words (nouns, verbs,
adjectives, and adverbs).
M ON OLIN GUAL EXPERIM EN TS WITH WORDN ET
Our essential retrieval strategy in the experiments reported here is to
adapt a classical vector model± based system, using WordNet synsets as
indexing space instead of word forms. This approach combines two beneŽts
for retrieval, regardless of multilinguality :
i. terms are fully disambiguated as synsets representing word senses (this
should improve precision);
ii. equivalent terms can be identiŽed, as terms with the same sense map to
the same synset (this should improve recall).
d
Note that query expansion does not satisfy the Ž rst condition, as the terms
used to expand a query are, themselves, words and, therefore, can be in their
turn ambiguous. On the other hand, plain word sense disambiguation does
not satisfy the second condition, as equivalen t senses of two diV erent words
are not recognized. Thus, indexing by synsets enables a maximum of word
sense matching while reducing spurious matching and seems to be a good
starting point to study text retrieval using either WordNet or EuroWord±
Net.
Given this approach, our goal is to test two main issues that are not
clearly answered to our knowledge by the experiments mentioned above:
Abstracting from the problem of sense disambiguation, what potential
does WordN et oV er for text retrieval ? In particular, we would like to
extend experiments with manually disambiguated queries and documents
to average± size texts.
654
J . Gonzalo et al.
Once the potential of WordNet is known for a manually disambiguated
collection, we want to test the sensitivity of retrieval performance to dis±
ambiguation errors introduced by automatic WSD.
d
The Test Collection
The best± known publicly available corpus hand± tagged with WordN et
senses is SEMCOR (M iller et al., 1993), a subset of the Brown Corpus of about
100 documents that occupies about 11 M b. (including tags). The collection is
rather heterogeneous, covering politics, sports, music, cinema, philosophy,
excerpts from Žction novels, scientiŽc texts, etc. A new, bigger version has
been made available recently (Landes et al., 1998), but we have not still
adapted it for our collection.
We have adapted SEMCOR in order to build a test collection that we
call IR S EMCOR in four manual steps :
We have split the documents to get coherent chunks of text for retrieval.
We have obtained 171 fragments that constitute our text collection, with
an average length of 1,331 words per fragment.
We have extended the original TOP IC tags of the Brown Corpus with a
hierarchy of subtags, assigning a set of tags to each text in our collection.
This is not used in the experiment s reported here.
We have written a summary for each of the fragments, with lengths
varying from between 4 and 50 words and an average of 22 words per
summary. Each summary is a human explanation of the text contents, not
a mere bag of related keywords. These summaries serve as queries on the
text collection, and then there is exactly one relevant document per query.
F inally, we have hand± tagged each of the summaries with WordN et 1.5
senses. When a word or term was not present in the database, it was left
unchanged. In general, such terms correspond to groups (e.g.,
F ulton– County– Grand –Jury), persons (Cervantes), or locations (F ulton).
d
d
d
d
We also generated a list of ‘‘stop± senses’’ and a list of ‘‘stop± synsets,’’
automatically translating a standard list of stop words for English.
Such a test collection o V ers the chance to measure the adequacy of
WordN et± based approaches to IR independently from the disambiguator
being used, but also oV ers the chance to measure the role of automatic dis±
ambiguation by introducing diV erent rates of ‘‘disambiguation errors’’ in the
collection. The only disadvantage is the small size of the collection, which
does not allow Žne± grained distinctions in the results. However, it has
proved large enough to give meaningful statistics for the experiments report±
ed here.
Although designed for our concrete text retrieval testing purposes, the
resulting database could also be useful for many other tasks. For instance, it
Cross± Language T ext Retrieval
655
could be used to evaluate automatic summarization systems (measuring the
semantic relation between the manually written and hand± tagged summaries
of IR± S EMCOR and the output of text summarization systems) and other
related tasks.
For the bilingual experiments reported in this article, we also extended
the database to include manually translated and indexed versions in Spanish
of the summaries.
The M onolingual Ex periments
We have performed a number of experiments using a standard vector±
model± based text retrieval system, S MART (Salton, 1971), and three diV erent
indexing spaces : the original terms in the documents (for standard SMART
runs), the word± senses corresponding to the document terms (in other words,
a manually disambiguated version of the documents), and the WordN et
synsets corresponding to the document terms (roughly equivalent to con±
cepts occurring in the documents).
These are all the experiments considered here:
1. The original texts as documents and the summaries as queries. This is a
classic SMART run, with the peculiarity that there is only one relevant
document per query.
2. Both documents (texts) and queries (summaries) are indexed in terms of
word± senses. That means that we disambiguate manually all terms. For
instance ‘‘debate’’ might be substituted with ‘‘debate%1 :10 :01 ::.’’ The
three numbers denote the part of speech, the WordNet lexicographer’s Žle
and the sense number within the Žle. In this case, it is a noun belonging
to the noun.communication Žle.
With this collection we can see if plain disambiguation is helpful for
retrieval, because word senses are distinguished but synonymous word
senses are not identiŽed.
3. In the previous collection, we substitute each word sense for a unique
identiŽer of its associated synset. For instance, ‘‘debate%1 :10 :01 ::’’ is
substituted with ‘‘n04616654,’’ which is an identiŽer for
``{argument, debate1}Â Â
(a discussion in which reasons are advanced
for and against some proposition or proposal ; ``the argument over
foreign aid goes on and on  )
This collection represents conceptual indexing, as equivalent word senses
are represented with a unique identiŽer.
4. We produced diV erent versions of the synset indexed collection, intro±
ducing Žxed percentages of erroneous synsets. Thus we simulated a word±
656
J . Gonzalo et al.
FIGURE 1. DiV erent indexing approaches.
sense disambiguation process with 5%, 10%, 20%, 30%, and 60% error
rates. The errors were introduced randomly in the ambiguous words of
each document. With this set of experiments we can measure the sensi±
tivity of the retrieval process to disambiguation errors.
5. To complement the previous experiment, we also prepared collections
indexed with all possible meanings (in their word sense and synset
versions) for each term. This represents a lower bound for automatic dis±
ambiguation : we should not disambiguate if performance is worse than
considering all possible senses for every word form.
6. We produced also a non± disambiguated version of the queries (again,
both in its word sense and synset variants). This set of queries was run
against the manually disambiguated collection.
Discussion of Results
Indexing Approach
In Figure 1 we compare diV erent indexing approaches : indexing by
synsets, indexing by words (basic SM ART), and indexing by word senses
Cross± Language T ext Retrieval
657
(experiments 1, 2, and 3). The leftmost point in each curve represents the
percentage of documents that were successfully ranked as the most relevan t
for its summary/query. The next point represents the documents retrieved as
the Žrst or the second most relevant to its summary/query, and so on. Note
that, as there is only one relevant document per query, the leftmost point is
the most representative of each curve. Therefore, we have included these
results separately in Table 1.
The results are encouraging :
Indexing by W ordNet synsets. produces a remarkable improvement on
our test collection. 62% of the documents are retrieved in Žrst place by its
summary, against 48% of the basic SMART run. This represents 14% more
documents, a 29% improvement with respect to SMART. This is an excellent
result, although we should keep in mind that it is obtained with manually
disambiguated queries and documents. Nevertheless, it shows that WordN et
can greatly enhance text retrieval : the problem resides in achieving accurate
automatic word sense disambiguation.
d
Indexing by word senses improves performance when considering up to
four documents retrieved for each query/summary, although it is worse than
indexing by synsets. This conŽrms our intuition that synset indexing has
advantages over plain word sense disambiguation, because it permits match±
ing semantically similar terms.
Taking only the Žrst document retrieved for each summary, the dis±
ambiguated collection gives a 53.2% success against a 48% of the plain
SMART query, which represents an 11% improvement. For recall levels higher
than 0.85, however, the disambiguated collection performs slightly worse.
This may seem surprising, as word sense disambiguation should only
increase our knowledge about queries and documents. But we should bear
d
TABLE 1 M onolingual Experiments
Experiment
% correct document
retrieved in Žrst place
Indexing by synsets
Indexing by word senses
Indexing by words (basic SM ART)
62.0
53.2
48.0
Indexing by synsets with a 5% errors ratio
Id. with 10% errors ratio
Id. with 20% errors ratio
Id. with 30% errors ratio
Indexing with all possible synsets (no disambiguation)
Id. with 60% errors ratio
62.0
60.8
56.1
54.4
52.6
49.1
Synset indexing with nondisambiguated queries
Word± Sense indexing with nondisambiguated queries
48.5
40.9
J . Gonzalo et al.
658
±
±
±
in mind that WordN et 1.5 is not the perfect database for text retrieval, and
indexing by word senses prevents some matchings that can be useful for
retrieval. In particular, we have conŽrmed the negative eV ects of:
T he lack of cross± part± of± speech relations. This means, for instance, that
design as a verb is not related at all with design as a noun in the
WordNet 1.5 database. Thus, one of our documents summarized using
shoes design cannot be recovered using the three appearances of design
as a verb in the document. The same occurs in other documents with
temp/temptation, American/America, indiV erent/indiV erence, disarm/
disarming, etc. Remarkably, many of these relations can be captured by
a naive stemmer that does not distinguish parts of speech. In Krovetz
(1997), it is shown that the Porter stemmer, which does not use a lexicon,
is surprisingly good at separating unrelated morphological variants and
con¯ ating related ones. Cross± part± of± speech relations are even more
important in multilingual settings, as many words shift category when
translated in context ; this is discussed in the next section.
Lack of topic or domain information. F or instance, a document in our
database is summarized including the word soldier, as it is a story about
soldiers. But the word soldier itself does not appear in the whole docu±
ment (it is evident from the context), and thus the word or concept
soldier is not used for retrieval. If WordN et synsets were tagged with
domain information, soldier and words in the document such as battle,
enemy, etc, would relate summary and document successfully.
T oo much Žne± grained sense distinctions. For instance, in these two para±
graphs of a Semcor document :
1. It got the kind of scrambled, coarsened performance that can happen –to
the best of orchestras when the man with the baton lacks technique and
style.
2. N ot the noblest performance we have heard him play, or the most spa±
cious, or even the most eloquent.
The word performance is tagged with two distinct meanings. The Žrst one
corresponds to
{performance, public presentation} : a dramatic or musical
entertainment ; ``the play ran for 100 performances  or ``the frequent
performances of the symphony testify to its popularity Â
and the second to
{performance} : the act of presenting a play or a piece of music or
other entertainment ; ``we congratulated him on his performance at the
recital. Â
Cross± Language T ext Retrieval
659
However, they are unrelated in the WordN et noun hierarchy (one belongs to
the ‘‘act’’ subhierarchy and the other one to ‘‘communication.’’ At least from
the point± of± view of information retrieval, this is an annoying distinction.
Sensitivity to Disambiguation Errors
Figure 2 shows the sensitivity of the synset indexing system to degrada±
tion of disambiguation accuracy (corresponding to the experiments 4 and 5
described above). From the plot, it can be seen that :
Less than 10% disambiguating errors does not substantially a V ect per±
formance. This is roughly in agreement with Sanderson (1994).
F or error ratios over 10%, the performance degrades quickly. This is also
in agreement with Sanderson (1994).
However, indexing by synsets remains better than the basic SMART run up
to 30% disambiguation errors. From 30% to 60%, the data does not show
signiŽcant diV erences with standard SMART word indexing. This predic±
tion diV ers from Sanderson (1994) result (namely, that it is better not to
disambiguate below a 90% accuracy). The main diV erence is that we are
using concepts rather than word senses. But, in addition, it must be noted
that Sanderson’s setup used artiŽ cially created ambiguous pseudo words
d
d
d
FIGURE 2. Sensitivity to disambiguation errors.
660
J . Gonzalo et al.
FIGURE 3. Performance with nondisambiguated queries.
d
(such as ‘‘bank/spring’’) which are not guaranteed to behave as real
ambiguous words. M oreover, what he understands as disambiguating is
selecting ± in the example± bank or spring which remain to be ambigu±
ous words themselves.
If we do not disambiguate, the performance is slightly worse than dis±
ambiguating with 30% errors, but remains better than term indexing,
although the results are not deŽnitive. An interesting conclusion is that, if
we can disambiguate reliably the queries, WordN et synset indexing could
improve performance even without disambiguating the documents. This
could be conŽ rmed on much larger collections, as it does not involve
manual disambiguation.
It is too soon to say if state± of± the± art WSD techniques can perform with
less than 30% errors, because each technique is currently evaluated in fairly
diV erent settings. Some of the best results on a comparable setting (namely,
disambiguating against WordNet, evaluating on a subset of the Brown
Corpus, and treating the 191 most frequently occurring and ambiguous
words of English) are reported in N g (1997). They reach a 58.7% accuracy
on a Brown Corpus subset and a 75.2% on a subset of the Wall Street
Journal Corpus. A more careful evaluation of the role of WSD is needed to
know if this is good enough for our purposes.
Cross± Language T ext Retrieval
661
Anyway, we have only emulated a WSD algorithm that just picks up
one sense and discards the rest. A more reasonable approach here could be
giving diV erent probabilities for each sense of a word, and use them to weigh
synsets in the vectorial representation of documents and queries.
Performance for Nondisambiguated Queries
In Figure 3 we have plotted the results of runs with a nondisambiguated
version of the queries, both for word sense indexing and synset indexing,
against the manually disambiguated collection (experiment 6). The synset
run performs approximately as the basic SMART run. It seems, therefore,
useless to apply conceptual indexing if no disambiguation of the query is
feasible. This is not a major problem in an interactive system that may help
the user to disambiguate his query, but it must be taken into account if the
process is not interactive and the query is too short to do reliable disambig±
uation.
EUROWORDN ET F EATURES F OR TEXT RETRIEVAL
The aim of the EWN project is to develop (semiautomatically) a multi±
lingual database resembling WordNet that stores semantic relations between
words in several languages of the European community : Dutch, Italian,
Spanish, English, F rench, and German. The project began in M arch 1996
and had a duration of 36 months ; the Žrst public release of EuroWordN et
was scheduled for Spring 1999.
The major feature of the EWN database, comparing to WordN et 1.5, is
obviously its multilingual nature. We summarize here the multilingual archi±
tecture of the database.
M onolingual wordnets. Each language has its individual wordnet with
internal relations that re¯ ect speciŽc properties of that language. However,
each monolingual wordnet is being built from a common set of 1,024 base
concepts (concepts that are relatively high in the semantic hierarchies and
that have many relations with other concepts). These have been veriŽ ed
manually to Žt all monolingual wordnets. This is one of the measures that
guarantees overlap and compatibility between wordnets, reducing spurious
mismatches in the hierarchy.
Interlingual± index (ILI ). A superset of all concepts occurring in the
monolingual wordnets. The ILI began as a collection of records that
matched WordNet 1.5 synsets, and is growing as new concepts are added.
Using WordN et 1.5 as a starting point for the ILI is just a pragmatic deci±
sion, as it was already available and has a wide coverage. However, it will
also be modiŽ ed with respect to WordN et 1.5, as too Žne± grained sense
distinctions will be collapsed. P eters et al. (1998b) describes this process in
662
J . Gonzalo et al.
detail. All interlingual relations and language± independent information is
linked to the ILI, as explained below.
Cross± Language Relations. Each wordnet is linked to the ILI via cross±
language equivalence relations, namely :
cross± language synonymy :
It :anitra EQ± NEAR± SYNONYM duck
cross± language hypernymy :
Dut ch:hoofd (human head) EQ± HAS± HYPERNYM head
cross± language hyponymy :
Sp :dedo (finger or toe) EQ± HAS± HYPONYM Žnger
Sp :dedo EQ± HAD± HYPONYM toe
Cross± language complex relations (hypernyms and hyponyms) indicate
potentially new ILI records. After each building stage, all complex relations
are collected and compared across languages and new ILI records will be
added if appropriate. These relations facilitate cross± language retrieval.
T op± concept ontology. A hierarchy of 63 language± independent concepts
re¯ ecting explicit opposition relations (e.g., object versus substance). This
ontology is linked to the base concepts through the ILI (see Rodriguez et al.
1998).
d
d
Hierarchy of domain labels. Also linked to the ILI and thus inherited by
every monolingual wordnet.
But besides the multilingual nature of EWN, there are a number of addi±
tional features (comparing to WordNet 1.5) that are relevan t from the point±
of± view of text retrieval :
EuroWordNet will contain about 50,000 word meanings correlating the
20,000 most frequent words (only nouns and verbs in the Žrst stage) in
each language. This size should be sufficient to experiment with generic,
domain± independent text retrieval in a multilingual setting without the
need for training with bilingual parallel corpora. The individual mono±
lingual databases will be considerably smaller than WordN et 1.5, but the
diV erence in coverage is only for speciŽc subdomains ; the coverage of
most frequent words and more generic terms will be similar in both data±
bases. The EWN database will be expanded to a higher level of detail for
one speciŽc domain, in order to test its adequacy to incorporate domain±
speciŽc thesauri.
Synsets have domain labels that relate concepts on the basis of topics or
scripts rather than classiŽcation. This means that tennis shoes and tennis
racquets will be related through a common domain labeled tennis. Such
relations are very important for text retrieval and many other tasks,
including word± sense disambiguation.
Cross± Language T ext Retrieval
N ouns and verbs do not form separate networks. EuroWordNet includes
cross± part± of± speech relations :
non± to± verb± hypernym : angling ® catch (from angling : sport of catching
Žsh with a hook and line)
verb± to± noun± hyponym : catch ® angling
noun± to± verb± synonym : adornment ® adorn (from adornment : the act of
adorning)
verb± to± noun± synonym : adorn ® adornment
d
±
±
±
±
663
Again, these relations establish links that are signiŽ cant from the point± of±
view of text retrieval. In particular, adorn and adornment are nearly equiva±
lent for retrieval purposes, regardless of their diV erent parts± of± speech.
N ow we turn to our proposal to exploit the EWN database for
language± independent text retrieval.
LAN GUAGE-IN DEPEN DEN T TEXT RETRIEVAL WITH
EWN
Our proposal, Žrst introduced in Gilarranz et al. (1997), is to index docu±
ments in terms of the ILI records (which, in practice, serves as a language±
independent ontology). The only diV erence with the monolingual approach
described above is that the indexing space is not exactly WordNet 1.5, but a
reŽ ned version of WordN et that serves as a link between all the individual
wordnets in the EWN database. Thus, the indexing space and its associated
tasks, such as weighting terms, become language± independent.
Two major processes have to be considered : document indexing and
query/document matching.
Document Indexing
Document indexing is performed in two stages : a language± dependent
one that maps terms to ILI records, and a language± independent one that
assigns weights to the representation.
Language-dependent Stage
1. P art± of± speech tagging. This is a Žrst step toward disambiguation and
should not cause problems. Part of speech tagging can be performed with
more than 96% precision for many languages ; see, for example Brill
(1992) and M árquez and Padro (1997). N ote that we do not assume
words with diV erent categories have diV erent meanings. Although it is a
necessary step for disambiguation, EuroWordNet cross± part± of± speech
relations may link close meanings that belong to diV erent lexical cate±
gories.
664
J . Gonzalo et al.
2. Term identiŽcation. This step includes stemming and reconstruction, and
the identiŽcation of multiwords. The detection of multiwords is known to
be beneŽcial to text retrieval tasks ; WordN et is rich in multiword infor±
mation, thus o V ering a potential for retrieval reŽnement that should be
exploited. However, an appropriate treatment of multiwords from a
multilingual perspective is not at all simple.
As has been stated, the detection of lexicalized multiwords in a mono±
lingual setting can enhance precision. F or instance, hot spring can be
identiŽed in a document as a lexicalized multiword simply by inspecting
WordN et 1.5 entries. We can thus assign a single meaning to hot spring,
avoiding a separate inclusion of meanings for hot and spring, which
would not re¯ ect the content of the document. Even when WordN et 1.5
includes nonlexicalized phrases such as a great distance or fasten with a
screw, it would seem helpful to use these in order to reŽne term identiŽca±
tion and matching for monolingual text retrieval. In fact, such non±
lexicalized phrases are very common in WordN et 1.5, which oscillates
between lexical and conceptual criteria when constructing the synsets.
However, with many of such phrases the best solution is probably to
search for the lexically signiŽcant words in close co± occurrence, e.g., for
fasten near to screw.
The handling of nonlexicalized phrases is not a simple task in the
cross± language setting, partly because the situation is not symmetric over
languages and this asymmetry frequently re¯ ects important diV erences in
conceptualization between languages that must not be lost. Consider,
for instance, lexical items in one language that do not have equivalents in
another. In order to provide an exact translation equivalent, recourse is
normally made to a phrase. An example is toe, which does not have a
direct equivalent in Spanish. The closest lexical item is dedo, which means
Žnger or toe. Thus going from one language to the other we appear to
lose information on speciŽcity ; a solution could be to introduce a
Spanish synset containing a phrase, even if it is not lexicalized, to describe
the concept in Spanish. The appropriate phrase in Spanish would be dedo
del pie (del pie 5 of the foot). However, we have to consider whether this
is the most correct way to deal with this kind of situation. When a
Spanish document is talking about toes, it will probably just use the term
dedo. A retrieval system looking for dedo del pie as a single bound item
could miss relevant information. The best solution is probably that
already suggested above for monolingual retrieval : to search for both
dedo and pie in close proximity and also just dedo ; pie can be used as a
weight for document ranking.
The question of the treatment of multiwords and lexicalized/
nonlexicalized translation equivalents is one that aV ects other possible
applications of the EuroWordNet database. The decision taken by the
Cross± Language T ext Retrieval
665
project has been to include only lexicalized concepts in each monolingual
wordnet. For CLTR, this means that we should look for cross± language
hyponyms or hypernyms when a lexical item does not have a lexicalized
equivalent in some target language.
3. Word± sense disambiguation. It is usually assumed that information
retrieval systems perform an implicit disambiguation when comparing
queries and documents, because the adequate senses for a term are rein±
forced by the terms in the context (K rovetz & Croft, 1992 ; Sanderson,
1994). So how should we index in terms of ILI records ? Is it better to
disambiguate with a certain error ratio, or can we assign all possible ILI
records for each word form ? Would conceptual indexing improve
retrieval in a monolingual setting, or would it have only a subtle eV ect, as
previous experiments suggest ? These and other issues have been
addressed in the experiments reported in the next section.
M apping into Inter Lingual Index . Once the terms in the documents
have been disambiguated in terms of the relevant monolingual wordnet, they
can be mapped to the Inter Lingual Index via cross± language equivalence
relations ; in EuroWordNet, there will be at least one equivalent relation per
synset, ensuring a complete mapping.
Language-independent Stage
Weighting. Using a classical vector± space model, synset weighting can
be done employing language± independent criteria. Standard weighting
schemes combine within± document term frequency (TF )± a term is more rel±
evant in a document if it appears repeatedly± and inverted documents fre±
quency (IDF)± a term is more relevant if its frequency in the document is
signiŽcantly higher than its frequency in the collection. Such weighting
schemes (nnn, atc, etc.) can be rendered language± independent when
WordN et synsets are used as indexing terms for the documents in each lan±
guage.
Besides standard weighting, depth in the conceptual hierarchy can also
be used to weight synsets, as synsets deeper in the hierarchy are more spe±
ciŽc and therefore more informative. It follows that the uppermost synsets
are the least informative and can probably be removed, thus providing a list
of stop synsets. This is an interesting possibility provided by the WordN et
hierarchy, but its eV ectiveness has to be carefully evaluated , as this may well
depend on the homogeneity of the database. It is known that the WordN et
hierarchy is not well balanced and thus a simple measure of hierarchical
depth might not be reliable for weighting. The building strategy used for the
EuroWordNet database is expected to provide a more evenly balanced hier±
archy (Rodriguez et al., 1998), but only an evaluation of the Žnal database
will be able to guarantee this.
666
J . Gonzalo et al.
The same process will be applied to queries, although performing dis±
ambiguation is more difficult because queries are very short compared with
documents and thus o V er little contextual information.
Query /document M atching
We will experiment with three approaches to query/document compari±
son. Each approach adds some information to the previous one:
a. Cosine comparison. As formally we have a classical vector model, we use
classical cosine comparison as a baseline. Thus we can evaluate separa±
tely the impact of the indexing process and the methods for comparison,
as has been discussed.
b. Weighted expansion. The vector can be expanded± still in a language±
independent manner± by including related ILI records. The Žrst candi±
dates are cross± POS synonyms, which usually have strongly related
meanings (see previous sections). M eronyms also seem to be good candi±
dates, as they are likely to appear in context. However, we are aware that
expansions beyond synonymy are not guaranteed to improve per±
formance, and so careful evaluation of all kinds of expansion is required.
c. M easure of semantic relatedness. Instead of simply matching identical
concepts, it is possible to measure the semantic relatedness of query and
document indexing concepts. A similar approach gave good results for
monolingual retrieval in Smeaton and Quigley (1996). In addition, the
domain labels could be used to score occurrences of words related to the
same topics.
Summing Up
d
d
d
d
This proposal for cross± language text retrieval has attractive advantages
over other techniques :
It performs language independent indexing, providing
±
a semantic structure to perform explicit WSD for indexing ;
±
language± independent weighting criteria.
It permits language± independent retrieval, by
±
concept comparison rather than term comparison ;
±
topic comparison.
It does not require training or the availability of parallel corpora (a great
advantage when thinking of more than two languages or when performing
retrieval on unrestricted texts, such as WWW searches).
The EuroWordNet architecture seems better suited, a priori, for text
retrieval than WordN et 1.5 :
Cross± Language T ext Retrieval
667
words can be conceptually related even if they have diV erent P OS ;
besides classiŽ cation relations, synsets have also topic information
(domain labels), which is especially useful for text retrieval.
±
±
BILIN GUAL EXPERIM EN TS WITH EUROWORDN ET
In any cross± language setting for text retrieval, some kind of query trans±
lation from the query language to the document language is required. When
going from the query to the target language, query expansion techniques
with bilingual dictionaries introduce a genuine cross± language mechanism
that degrades retrieval eV ectiveness. However, in our approach, indexing
with many languages does not involve any additional operation to the ones
performed in our monolingual experiments above. It is reasonable to expect
that the degradatio n in our framework when going to a cross± language sce±
nario should be less accused than for query expansion techniques.
At the time of doing this research, the EuroWordNet database was not
yet available. The Spanish wordnet covers only nouns and verbs (with a
limited amount of noun to verb relations), has not reached its full coverage,
and needs further reŽning and Žltering. However, it is interesting to perform
cross± language retrieval experiments in order to understand how the
retrieval process works and which features of the database must be
improved or newly considered. It also permits one to establish a qualitative
reference framework to evaluate its potentiality for cross± language retrieval,
as compared to other approaches.
English-Spanish Test Collection
We have experimented with Spanish queries to retrieve English docu±
ments. To do so, we have prepared manual translations (into Spanish) of the
171 IR± Semcor summaries described previously. Then we have manually
indexed the occurrences of nouns and verbs in terms of the Spanish
WordN et. When the appropriate sense of a term was missing in the Spanish
WordN et, we added it manually and linked it to the EWN interlingual
index.
Spanish-English Experiments
The main experiment has been using the Spanish queries to retrieve
English documents, indexing queries and documents in terms of the Euro±
WordN et Inter Lingual Index. As currently we have only nouns and verbs
in the Spanish database and we have only used them to index Spanish queries.
In order to evaluate the results, we have performed some complementary
experiments.
668
J . Gonzalo et al.
Dictionary± based term translation. We performed a naive dictionary±
based translation of the terms in the Spanish query, using the VOX± Harraps
Spanish± English Dictionary, picking up every possible English translation
for each term. The VOX± Harraps contains around 28,000 entries in its
Spanish± English version ; if a word was not included in the dictionary, it was
not considered, except for proper nouns, which were manually translated
into its English equivalents. The queries were then used in a normal
SM ART run with the text database. This run is used as a baseline against
EWN ± based retrieval, being the simplest term translation of the query.
M onolingual retrieval with nouns and verbs. In order to discriminate pos±
sible sources of cross± language degradation , we have used only nouns and
verbs in the English queries for retrieval, to compare this result directly with
the Spanish± English experiment .
Retrieval with di€ erent POS. Finally, as the relative coverage of each
open± class word will not be the same for English and the rest of languages,
we have complemented the previous experience using diV erent combinations
of word classes in the queries, in order to know what classes are more rele±
vant for retrieval.
Results
Again, the most relevant Ž gures are the number of documents correctly
retrieved as the most relevant document for its summary, which is separately
displayed in Table 2. These are the main results.
Cross-Language Degradation of Synset Indexing
In F igure 4, monolingual and cross± language synset indexing are com±
pared to measure the degradation of our EWN ± based approach to text
retrieval in a cross± language setting, comparing queries where only nouns
and verbs are processed. The monolingual experiment gives 60.2% of the
appropriate documents retrieved in Žrst place, while the cross± language one
gives 48%. This represents a 20% degradation , which is a promising result :
TABLE 2 Bilingual Experiments
Experiment
% correct document
retrieved in Žrst place
Bilingual indexing by synsets
M onolingual indexing by synsets
Dictionary ± based term translation
M onolingual indexing by words ± basic SM ART±
48.0
62.0
24.0
48.0
Cross± Language T ext Retrieval
669
FIGURE 4. Cross± language degradation of synset indexing.
note that comparing (Table 2) the monolingual SM ART run with the
dictionary± based term translation (48% against 24%) gives a 50% degrada±
tion, which is a standard behavior for naive cross± language retrieval.
However, this 20% cannot be explained in terms of translation ambi±
guity, as both English and Spanish queries are manually disambiguated. We
analyze this result on page 672 (Analysis of Translated Queries).
Comparison to Dictionary-Based Term Translation
The performance of synset indexing and dictionary± based term trans±
lation is compared in F igure 5. Indexing in terms of the Inter Lingual Index
improves cross± language retrieval eV ectiveness from 24% to 48%, which rep±
resents a 100% improvemen t over dictionary± based term translation, even
matching the monolingual results of the standard SM ART run. These results
strongly suggest that language± neutral indexing in terms of the EWN Inter
Lingual Index may improve cross± language eV ectiveness provided that this
indexing can be performed automatically with enough precision. We are cur±
rently testing WSD algorithms in this CLTR environmen t to Ž nd out what
‘‘enough precision’’ means for our retrieval task, beyond the experiments
described in this paper.
It must be noted that, as the monolingual experiments described earlier,
these results are obtained for the simplest approach to synset indexing. We
670
J . Gonzalo et al.
FIGURE 5. Comparison between cross± language retrieval strategies.
have considered only nouns and verbs, and we have not used any semantic
relation other than synset± membership (obviating hyponymy/hypernymy,
cross± part± of± speech relations) of the query. There are many ways in which
these results can be further reŽned, in order to improve retrieval and, at the
same time, tune the design of our multilingual database for retrieval pur±
poses.
Relevance of Di€ erent Word Classes
The precision/recall Žgures for the monolingual experiments with diV er±
ent word classes are represented in Figure 6 and the most signiŽcant data in
Table 3. Retrieving only with the nouns in the queries gives 59.6% of the
correct documents in Žrst place, while retrieving with all synsets (nouns,
verbs, adjectives, and adverbs) gives 62%. Apparently, treating nouns gives a
good Žrst approximation for retrieval. It must be noted, however, that the
rest of open± class words (verbs, adjectives, and adverbs) are also meaningful
for retrieval, but they are simply less frequent. In Table 4, the retrieval of
each word class is compared to their frequency in the queries. The ratio
between the mean number of occurrences per query and the number of
documents correctly retrieved is very similar for nouns and verbs, and it is
higher for adjectives.
Cross± Language T ext Retrieval
FIGURE 6. Performance of synset indexing with diV erent word classes.
TABLE 3 Experiments with DiV erent Word Classes
Experiment
M onolingual,
M onolingual,
M onolingual,
M onolingual,
M onolingual,
M onolingual,
M onolingual,
all classes
only nouns
only adjectives
only verbs
only adverbs
all classes except nouns
nouns and verbs
Cross± Language, nouns and verbs
Cross± Language, only nouns
Cross± Language, only verbs
% correct document
retrieved in Ž rst place
62.0
59.6
37.4
21.6
0.1
40.4
60.2
48.0
46.2
16.4
TABLE 4 Retrieval EV ectiveness with DiV erent Word Classes in
English
word class
nouns
adjectives
verbs
adverbs
’
words per query
6.1
2.5
2.2
0.37
% docs. correctly retrieved
59.6
37.4
21.6
0.1
671
J . Gonzalo et al.
672
The results for the bilingual experiments with nouns and verbs present a
similar pattern : nouns give 46.2%, verbs 16.4%, and nouns with verbs 48%.
However, adjectives possibly play a more important role in Spanish than in
English, as what are noun compounds in English are often expressed as
nouns with adjectives in Spanish. Thus, relations between adjectives and
nouns may be necessary to identify phrases, which is crucial for cross±
language retrieval.
Analysis of Translated Queries
The 20% degradation for our cross± language experiment with synsets
cannot be explained in terms of cross± language ambiguity, as both sets of
queries are manually disambiguated. Thus, it is interesting to examine the
correlation between English queries and their translations to Spanish, to
Žnd out the relevant sources of mismatches.
Table 5 shows the percentage of overlapping Inter Lingual Index records
between English and Spanish summaries. For nouns, a 63% of the synsets
present in the English summaries appeared in their Spanish counterparts.
For verbs, this percentage is even lower. A small part of the mismatches may
be due to annotation errors, but after a manual inspection of selected
queries we have found, as the most important sources of mismatches, the
excessive Žne± grainedness of the interlingual index and terms shifting cate±
gory from one language to another.
Too much Fine-Grainedness of the Inter Lingual Index
Actually, the EWN Inter Lingual Index is still very close to WordN et
1.5, and thus the sense distinctions are too Žne grained, even for a human
annotator. This is not a genuine cross± language problem, but pervades every
annotation with WordN et 1.5. For instance, in ‘‘Debate on the increase of
federal aids for education in Georgia,’’ the term increase was annotated in its
‘‘act’’ sense during the manual annotation of English queries. But during the
annotation of the Spanish translations of the queries, the equivalent term,
incremento, was indexed in its event sense, thus pointing to a diV erent inter±
lingual index record. Unfortunately, this is a very common situation, espe±
cially for verbs.
TABLE 5 Index Overlapping of English Summaries with
Spanish Summaries
Nouns
Nouns 1
All
verbs
% synsets in
Spanish summary
% synsets only
in English
63
60
45
37
40
55
Cross± Language T ext Retrieval
673
This problem interferes with a diV erent one, namely, the diV erent cover±
age and granularity of the Spanish WordN et, compared to the English one.
The Spanish WordNet has less senses. When the appropriate sense was not
found in the database, it was manually introduced during the annotation
process. But many times, the senses already in the database were good
enough. As for English there were more reŽned senses, the chance of getting
the same annotation was lower.
These kinds of mismatches should be less relevant when the ongoing
sense± clustering task for WordNet 1.5 is accomplished (P eters et al., 1998a).
Up to now, the following clusterings have been considered :
sisters, word senses that share the same hypernym, such as table in the
sense of ‘‘piece of furniture’’ and table in the sense of ‘‘piece of furniture for
a meal laid out on it’’.
autohyponyms, words whose senses are each others direct hypernyms or
hyponyms, such as variety in the sense of ‘‘speciŽc kind of something’’ and
in the sense of ‘‘category of things distinguished by some common
quality.’’
twins, synsets that have at least three members in common. For instance,
{ violate, fail to agree with, go against, break, be in violation of} and
{ violate, go against, breach, break, be in violation of} .
cousins, node top pairs whose hyponyms exhibit a speciŽc relation to each
other. For instance, hyponyms of the container node and containerful,
such as bag, cup, spoon, etc.
systematic polysemic patterns as they have been identiŽed in the CoreLex
database (Buitelaar, 1998).
d
d
d
d
d
However, cases as the increase one stated above are not covered in any
of these phenomena. The text retrieval testbed is an excellent way, then, of
testing whether the clustering is eV ective or not. With increase and many
other cases, the problem is that meanings that are identical for text retrieval
are found in totally diV erent parts of the hierarchy. It seems necessary to
detect polysemy regularities related to ontological distinctions to handle
these cases.
Terms Shifting Category
Apart from disambiguation problems, the most accused eV ect is the
translation of English noun compounds into Spanish adjectives. For
instance, ‘‘bone growth centers’’ is translated into ‘‘centros de crecimiento
o seo,’’ where o seo is the Spanish adjective for ‘‘pertaining to the bone.’’ In
this way, we are currently losing the most signiŽcant word in the expressio n
‘‘bone growth centers,’’ because we still haven’t considered adjectives in the
674
J . Gonzalo et al.
Spanish database and, of course, we do not have cross± P OS relations
between adjectives and nouns yet. This can be a very signiŽ cant problem, as
it is related to the proper matching of phrases across languages, which has
proven to be essential for cross± language text retrieval (Ballesteros & Croft,
1998). This fact has lead us to give more attention to adjectives in EWN
than was previously foreseen.
In general, there are also shifts from nouns to verbs and vice versa, due
to rephrasing in the translation. Such eV ects should not be a problem if the
noun, verb, and adjective hierarchies are highly interconnected, and thus it
should improve with the forthcoming version of the EWN database.
CON CLUSION S
We have presented a novel approach to cross± language text retrieval.
The rationale behind this is to perform conceptual indexing and retrieval of
documents in a space of language± independent concepts exploiting the Euro±
WordN et large± scale multilingual semantic database.
Previous work on monolingual settings had reported empirical evidence
of the limited gain of retrieval performance when adding semantic informa±
tion in standard IR processes. However, we found that such results fail to
distinguish two related but diV erent issues± indexing strategies on one hand
and problems of word sense disambiguation on the other.
To investigate the viability of our approach, we have Žrst built a collec±
tion of fully disambiguated documents and queries (adapted from Semcor) to
evaluate empirically the eV ect of concept indexing in contrast with standard
text retrieval techniques. Then we have designed and performed a number of
experiments, in order to test and compare a variety of strategies. The experi±
ments not only show a clear improvemen t in the performance when indexing
by synsets, but also establish a potential range of disambiguation errors,
where still this approach can enhance standard retrieval results.
Finally, we have performed a preliminary experiment on cross± language
text retrieval using Spanish queries against English documents. Although the
EWN database is far from its Žnal form, indexing by Inter Lingual Index
records gives much better results than a naive dictionary± based term trans±
lation run used as a baseline. The degradation of the synset indexing
approach when going from English± English to Spanish± English retrieval is
20%, which is very promising considering all the problems that still have to
be solved concerning the coverage and quality of the database.
Although there is still much work to be done, we believe that these
results are strong evidence in favor of using multilingual ontologies for
cross± language text retrieval. In particular, they seem very appropriate for
multilingual searches over heterogeneous domains, (such as WWW
Cross± Language T ext Retrieval
675
searches), where the lack of adequate multilingual parallel corpus makes
corpus± based approaches hard to apply.
APPEN DIX
We transcribe here an example from our IR± Semcor collection. The text
is the sixth fragment of semcor document br± c01, which we have divided into
six fragments corresponding to six diV erent reviews of movies, concerts, etc.
It is truly odd and ironic that the most handsome and impressive
film yet made from Miguel– de– Cervantes ``Don–Quixote  is the
brilliant Russian spectacle, done in wide– screen and color, which
opened yesterday at the Fifty± fifth– Street and
Sixty± eighth–Street–Playhouses.
More–than a beautiful visualization of the illustrious adventures
and escapades of the tragi ± comic knight± errant and his squire,
Sancho– Panza, in seventeenth century Spain, this inevitably
abbreviated rendering of the classic satire on chivalry is an
affectingly warm and human exposition of character.
Nikolai–Cherkasov, the Russian actor who has played such heroic
roles as Alexander– Nevsky and Ivan –the– Terrible, performs the lanky
Don– Quixote, and does so with a simple dignity that bridges the inner
nobility and the surface absurdity of this poignant man.
His addle± brained knight± errant, self appointed to the ridiculous
position in an age when armor had already been relegated to museums
and the chivalrous code of knight± errantry had become a joke, is, as
Cervantes no–doubt intended, a gaunt but gracious symbol of good,
moving soberly and sincerely in a world of cynics, hypocrites and
rogues.
Cherkasov does not caricature him, as some actors have
been– inclined to do. He treats this deep± eyed, bearded, bony crackpot
with tangible affection and respect. Directed by Grigory– Kozintsev in
a tempo that is studiously slow, he develops a sense of a high
tradition shining brightly and passing gravely through an impious
world.
The complexities of communication have been considerably abetted
in this case by appropriately stilted English–language that has been
excellently dubbed in– place– of the Russian dialogue.
The voices of all the characters, including that of Cherkasov,
have richness, roughness or color to conform with the personalities.
And the subtleties of the dialogue are most helpfully conveyed. Since
Russian was being spoken instead of Spanish, there is no violation of
artistry or logic here.
Splendid, too, is the performance of Yuri–Tolubeyev, one of
Russia s leading comedians, as Sancho– Panza, the fat, grotesque
``squire  . Though his character is broader and more comically rounded
than the don, he gives it a firmness and toughness± a sort– of peasant
dignity± too. It is really as though the Russians have seen in this
character the oftentimes underlying vitality and courage of supposed
buffoons.
The episode in– which Sancho– Panza concludes the joke that is played
on him when he is facetiously put in command of an ``island is one of
the best in the film.
J . Gonzalo et al.
676
True, the pattern and flow of the drama have strong literary
qualities that are a– bit wearisome in the first– half, before
Don– Quixote goes to the duke s court. But strength and poignancy
develop thenceforth, and the windmill and deathbed episodes gather the
threads of realization of the wonderfulness of the old– boy.
There are other good representations of peasants and people of the
court by actors who are finely costumed and magnificently photographed
in this last of the Russian films to reach this country in the program
of joint cultural exchange.
Also on the bill at the Fifty± fifth– Street is a nice ten minute
color film called ``Sunday –in– Greenwich– Village  , a tour of the haunts
and joints.
The summary used as query is :
A new Russian film based on the novel Don–Quixote turns –out to be the
most impressive rendering of this classic.
As an example of indexation by word senses, the summary is indexed as :
new%3 :00 :00 : : russian%3 :01 :00 : : film%1 :10 :01 : : base%2 :31 :00 : :
novel%1 :10 :00 : : don– quixote%1 :18 :01 : : turn– out%2 :42 :01 : :
be%2 :42 :03 : : most%4 :02 :01 : : impressive%3 :00 :00 : :
rendering%1 :04 :01 : : classic%1 :06 :00 : :
Indexing the summary by WordNet synset identiŽers gives :
a01256444 a02130100 n04323474 v00358438 n04198190 n05837444 v01490020
v01472320 r00055410 a00973705 n00056790 n01993371
where, for instance, n0432347 4 is a unique identiŽer for the WordN et
synset :
<noun.communication> {movie, filml, picture2, moving picture, motion
picture, picture show, flick} ±± (a form of entertainment provided by
a
sequence of images giving the illusion of continuous movement ; ``they
went to a movie every Saturday night  )
The Spanish translation is :
 cula rusa basada en la novela Doni–Quijote resulta ser
Una nueva peli
 n ma
 s impresionante de este cla
 sico.
la versio
and indexed by synsets :
n04323474 v00358438 n02818521 n05837444 v01489871 v01506899 n04220780
n01993371
Cross± Language T ext Retrieval
677
The dictionary± based term translation gives :
film base be–based novel don– quijote result turn–out–to– be come– out
be belong come–from be being core version classic classical classic
REF EREN CES
Ballesteros, L., and B. Croft. 1996. Dictionary± based methods for cross± lingual information retrieval. In
Proc. of the 7th International DEXA Conference on Database and Expert Systems Applications, pp.
791­ 801.
Ballesteros, L., and B. Croft. 1998. Resolving ambiguity for cross± language retrieval. In Proceedings of
SIGIR’98.
Brill, E. 1992. A simple rule± based part of speech tagger. In Proceedings of the T hird Conference on
Applied Natural Language Processing.
Buitelaar, P. 1998. CoreLex : Systematic polysemy and underspeciŽcation. Ph.D. thesis, Department of
Computer Science, Brandeis University, Boston, M A.
Carbonell, J., Y. Yang, R. Frederking, R. Brown, Y. Geng, and D. Lee. 1997. Translingual information
retrieval. In Proceedings of IJ CAI’97 .
Dumais, S., T. Landauer, and M . Littman. 1996. Automatic cross± linguistic information retrieval using
latent semantic indexing. In W orking Notes of the W orkshop on Cross± Linguistic Information
Retrieval, ACM SIGIR’96, pp. 16­ 23.
Gilarranz, J., J. Gonzalo, and M . Verdejo. 1997. An approach to cross± language text retrieval with the
EuroWordnet semantic database. In AAAI Spring Symposium on Cross± Language T ext and Speech
Retrieval, pp. 49­ 55. AAAI P ress SS± 97­ 05.
Grefenstette, G. 1998. The problem of cross± language information retrieval. In Cross± Language Informa±
tion Retrieval. K luwer AP.
Harman, D. K . 1993. The Žrst text retrieval conference (TREC± 1). Inform. Process. Management
29(4):411­ 414.
Hull, D., and G. Grefenstette. 1996. Querying across languages . A dictionary± based approach to multilin±
gual information retrieval. In Proc. of the 19th ACM SIGIR Conference, pp. 49­ 57.
Krovetz, R. 1997. Homonymy and polysemy in information retrieval. In Proceedings of ACL/ EACL:’97 .
Krovetz, R., and W. Croft. 1992. Lexica l ambiguity and information retrieval. ACM T rans. Inform.
System 10(2):115­ 141.
Landes, S., C. Leacock , and R. Tengi. 1998. Building semantic concordances. In W ordNet : An Electronic
Lexical Database. M IT Press.
M árquez, L., and L. Padro . 1997. A ¯ exible POS tagger using an automatically acquired language model.
In Proceedings of ACL/ EACL’97 .
M iller, G. 1990. Special issue. Wordnet : An on± line lexical database. International J . Lexicography, 3(4).
M iller, G., M . Chodorow, S. Landes, C. Leacock , and R. Thomas. 1994. Using a semantic concordance
for sense identiŽ cation. In Proceedings of the ARPA Human Language T echnology W orkshop.
M iller, G. A., C. Leacock, R. Tengi, and R. Bunker. 1993. A semantic concordance. In Proceedings of the
ARPA W orkshop on Human Language T echnology. M organ K au V man.
Ng, H. T. 1997. Exemplar± based word sense disambiguation : Some recent improvements. In Proceedings
of the Second Conference on Empirical Methods in NLP.
Oard, D. 1997. Alternative approaches for cross± language text retrieval. In AAAI Spring Symposium on
Cross± Language T ext and Speech Retrieval. AAAI Press SS± 97­ 05.
Peters, W., I. Peters, and P. Vossen. 1998a . Automatic sense clustering in EuroWordN et. In Proceedings
of the First International Conference on Language Resources and Evaluation.
Peters, W., P. Vossen, P. DõÂ ez± Orzas, and G. Adriaens. 1998b. The multilingual design of the EuroWord±
net database. In Computers and the humanities, Special Issue on EuroW ordNet .
Picchi, E., and C. Peters. 1996. Cross language information retrieval : A system for comparable corpus
querying. In ed. G. Grefenstette, W orking Notes of the W orkshop on Cross± Linguistic Information
Retrieval, ACM SIGIR’96, p. 24­ 33.
Richardson, R., and A. Smeaton. 1995. Using Wordnet in a knowledge± based approach to information
retrieval. In Proceedings of the BCS± IRSG Colloquium, Crewe.
678
J . Gonzalo et al.
Rodriguez, H., S. Climent, P. Vossen, L. Bloksma, A. Roventini, F. Bertagna, A. Alonge, and W. Peters.
1998. The top± down strategy for building EuroWord± net : Vocabulary coverage, base concepts and
top ontology. In Computers and the humanities, Special Issue on EuroW ordNet .
Salton, G., ed. 1971. T he SM ART retrieval system : Experiments in automatic document processing.
Prentice± Hall.
Sanderson, M . 1994. Word sense disambiguation and information retrieval. In Proceedings of 17th Inter±
national Conference on Research and Development in Information Retrieval.
Sheridan, P., and J. Ballerini. 1996. Experiments in multilingual information retrieval using the spider
system. In Proc. of the 19th ACM SIGIR Conference, p. 58­ 65.
Smeaton, A., F. K elledy, and R. O’Donnell. 1995. TREC± 4 experiments at Dublin City University :
Thresholding posting lists, query expansion with Wordnet and POS taggin g of Spanish. In Pro±
ceedings of T REC± 4 .
Smeaton, A., and A. Quigley . 1996. Experiments on using semantic distances between words in image
caption retrieval. In Proceedings of the 19th International Conference on Research and Development
in IR.
Voorhees, E. M . 1994. Query expansion using lexica l± semantic relations. In Proceedings of the 17th
Annual International ACM ± SIGIR Conference on Research and Development in Information
Retrieval.
Vossen, P. 1998. Introduction to EuroWordnet. In Computers and the humanities, Special Issue on Euro±
W ordNet .