Automatic Detection of Thesaurus Relations for Information Retrieval

Automatic Detection of Thesaurus Relations for
Information Retrieval Applications
Gerda Ruge
Computer Science, Technical University of Munich
Abstract. Is it possible to discover semantic term relations useful for
thesauri without any semantic information? Yes, it is. A recent approach
for automatic thesaurus construction is based on explicit linguistic knowledge, i.e. a domain independent parser without any semantic component,
and implicit linguistic knowledge contained in large amounts of real world
texts. Such texts include implicitly the linguistic, especially semantic,
knowledge that the authors needed for formulating their texts. This article explains how implicit semantic knowledge can be transformed to an
explicit one. Evaluations of quality and performance of the approach are
very encouraging.
1 Introduction
In most cases, when the expression information retrieval is used text retrieval
is meant. Information retrieval systems manage large amounts of documents.
A database containing 1,000,000 documents is a normal case in practice. Some
database providers e.g. NEXUS have to deal with millions of documents a week.
The special purpose that retrieval systems are designed for, is the search for
relevant items with respect to an information need of a user. This would ideally
be realized by a system that understands the question of the user as well as the
content of the documents in the database - but this is far away from the state
of the art.
The requirements of information retrieval systems - domain independence, efciency, robustness - force them to work very supercially in the case of large
databases. The search is usually based on so-called terms. The terms are the
searchable items of the system. The process of mapping documents to term representations is called indexing. In most retrieval systems, the index terms are
all words in the documents with the exception of stopwords like determiners,
prepositions or conjunctions. A query, i.e. the search request, then consists of
terms; and the documents in the result set of the retrieval process are those
that contain these query terms. Most of the work in retrieval research in the last
decades have been concentrated on rening this term based search method. One
of these renements is to use a thesaurus.
A thesaurus in the eld of Information and Documentation is an ordered compilation of concepts which serves for indexing and retrieval in one documentation domain. A central point is not only to dene terms but also relations
between terms[?]. Such relations are synonymy ("container" - "receptacle"),
broader terms or hyperonyms ("container" - "tank"), narrower terms or hyponyms ("tank" - "container"), the part-of relation ("car" - "tank") and antonymy
("acceleration" - "deceleration"). The concept semantically similar 1 subsumes
all these thesaurus relations.
It is dicult for retrieval system users to bring to mind the large number of terms
that are semantically similar to their initial query terms. Especially untrained
users who formulate short queries. For the original query WORD, CLUSTERING,
TERM, ASSOCIATION the following terms could be found in the titles of relevant
documents like "Experiments on the Determination of the Relationships Between
Terms" or "A Taxonomy of English Nouns and Verbs":
f
g
Example 1.
TERM, KEYWORD, DESCRIPTOR, WORD, NOUN, ADJECTIVE, VERB, STEM, CONCEPT,
MEANING, SENSE, VARIANT, SEMANTIC, STATISTIC, RELATED, NARROWER, BROADER,
DEPENDENT, CO-OCCURRENCE, INTERCHANGABLE, FIRST ORDER, SECOND ORDER,
CLUSTER, GROUP, EXPANSION, CLASS, SIMILARITY, SYNONYMY, ANTONYMY, HYPONYMY, HYPERONYMY, ASSOCIATION, RELATION, RELATIONSHIP, TAXONOMY, HIERARCHY, NETWORK, LEXICON, DICTIONARY, THESAURUS, GENERATION, CONSTRUCTION, CLASSIFICATION, DETERMINATION
For supporting the users, one important direction of research in automatic thesaurus construction is the detection of semantically similar terms as candidates
for thesaurus relations.
2 Various Approaches for the Automatic Detection of
Semantic Similarity
There is a large variety of approaches for automatic detection of thesaurus relations, mainly suggested by retrieval researchers and also by linguists. In the
following, some approaches are listed with brief characterizations2 .
In statistical term association, co-occurrence data of terms are analysed. The
main idea of this approach relies in the assumption that terms occurring in similar contexts are synonyms. The contexts of an initial term are represented by
terms frequently occurring in the same document or paragraph with the intitial
term. In theory, terms with a high degree of context term overlap should be
In this article, synonymy is used in its strong sense; for the week sense of \synonymy",
sematically similar is used.
2
There is a variety of derivations or combinations of methods not cited here.
1
synonyms, but in practice no synonyms could be found by this method [?].
Co-occurrence statistics can be rened by using singular value decomposition.
In this case the relations are generated on the basis of the comparision of the
main factors extracted from co-occurrence data [?]. Even though this approach
does not nd semantically similar terms, the use of such term associations can
improve retrieval results. Unfortunately, singular value decomposition is very
costly, such that it can only be applied to a small selection of the database terms.
Salton [?] gives a summary of work on pseudo-classication. For this approach,
relevance judgements are required: Each document must be judged with respect
to relevance to each of a set of queries. Then an optimization algorithm can be
run: It assigns relation weights to all term pairs, such that expanding the query
by terms related with high weights leads to retrieval results as correct as possible. This is the training phase of the approach. After the training, term pairs
with high weights represent thesaurus relations. A disadvantage of this approach
lies in its high eort for the manual determination of the relevance judgements
as well as for the automatic optimization. The manual eort is even comparable
to manual thesaurus construction.
Hyponyms were extracted from large text corpora by Hearst [?]. She searched
for relations directly mentioned in the texts and discovered text patterns that
relate hyponyms, e.g. "such that". Frequently this method leads to hyponyms
that are not directly related in the hierarchy like "species" and "steatornis oilbird" or questionable hyponyms like "target" and "airplane".
Hyperonyms can also be detected by analysing denitions in monolingual lexica.
A hyperonym of the dened term is the so-called genus term, the one that gives
the main characterization of the dened term [?]. Genus terms can be recognized by means of syntactic analysis. A further approach is based on the idea
that semantically similar terms have similar denitions in a lexicon. Terms that
have many dening terms in common are supposed to be semantically similar
or synonyms [?]. These lexicon based approaches seem very plausible, however
they have not been evaluated. One problem of these approaches is their coverage. Only terms that are led in a lexicon can be dealt with, thus many relations
between technical terms will stay undiscovered.
Guentzer et al. [?] suggested an expert system that draws hypotheses about
term relations on the basis of user observations. For example, if a retrieval system user combines two terms by OR in his/her query (and further requirements
hold) these terms are probably synonyms. The system worked well, but unfortunately the users capability of bringing to mind synonyms is very poor. Therefore
the majority of the term pairs found were either morphologically similar like
"net" and "network" or translations of each other like "user interface" and "Be-
nutzeroberaeche". Other synonyms were hardly found.
These examples of approaches for the detection of semantically similar terms
show clearly that the main ideas can be very plausible but nevertheless do not
work in practice. In the following, a recent approach is explained that has been
conrmed by dierent research groups.
3 Detection of Term Similarities on the Basis of
Dependency Analysis
In this section we describe a method that extracts term relations fully automatically from large corpora. Therefore, domain dependent thesaurus relations can
be produced for each text database. The approach is based on linguistic as well
as statistic analysis. The linguistic background of this approach is dependency
theory.
3.1 Dependency Theory
The theory of dependency analysis has a long tradition in linguistics. Its central
concept, the dependency relation, is the relation between heads and modiers.
Modiers are terms that specify another term, the head, in a sentence or phrase.
Examples 2, 3 and 4 below include the head modier relation between \thesaurus" and \construction".
Example 2. thesaurus construction
Example 3. construction of a complete domain dependent monolingualthesaurus
Example 4. automatic thesaurus generation or construction
In example 3, the head \thesaurus" has three modiers: \complete", \dependent", and \monolingual". Example 4 exemplies that a modier might have
more than one head in case of conjunctions. Here the modiers \automatic" and
\thesaurus"specify both heads, \generation" and \construction".
In dependency trees, the heads are always drawn above their modiers. A line
stands for the head modier relation. For the purpose of information retrieval,
stopwords like determiners, prepositions or conjunctions are usually neglected,
such that the tree only contains relations between index terms. Such dependency trees dier from syntax trees of Chomsky grammars in the fact that their
structure is already abstracted from the particular structure of a language; e.g.
example 5 has the same dependency tree as example 6.
Example 5. Peter drinks a sweet hot coee.
Example 6. Peter drinks a coee which is sweet and hot.
drink
HH
Peter coee
HHhot
sweet
3.2 Head and Modier Extraction from Corpora
Some implementations of dependency analysis are practically applicable to large
amounts of text because they realize a very quick and robust syntactic analysis.
Hindle [?] reports of a free text analysis with subsequent head modier extraction
which deals with one million words over night. Strzalkowski [?] reports of a rate
of one million words in 8 hours. The last version of our own parser described
in [?] is much faster: 3 MB per minute real time on a SUN SPARC station;
that is approximately 15 minutes for one million words. Such a parser works
robustly and domain independently but results only in partial analysis trees.
Our dependency analysis of noun phrases has been evaluated with respect to
error rates. 85% of the head modier token were determined correctly and 14%
were introduced wrongly. These error rates are acceptable because the further
processing is very robust as shown by the results below. Table 1 shows the most
frequent heads and modiers of the term "pump" in three annual editions of
abstracts of the American Patent and Trademark Oce.
Modiers Frequency
pump
Heads Frequency
heat
444
chamber
294
injection
441
housing
276
hydraulic
306
assembly
177
vacuum
238
system
160
driven
207
connected
141
displacement
183
pressure
124
fuel
181
piston
120
pressure
142
unit
119
oil
140
body
115
Table 1. Most frequent heads and modiers of "pump" in 120 MB of patent abstracts
3.3 Semantics of the Head Modier Relation
Head modier relations bridge the gap between syntax and semantics: On the one
hand they can be extracted on the basis of pure syntactic analysis. On the other
hand the modier species the head, and this is a semantic relation. How can
the semantic information contained in head modier relations give hints about
semantic similarity? Dierent semantic theories suggest that terms having many
heads and modiers in common are semantically similar. These connections are
now drafted very briey.
Katz and Fordor [?] introduced a feature based semantic theory. They claimed
that the meaning of a word can be represented by a set of semantic features,
for example +human, +male, -married is the famous feature representation of
"bachelor". These features also explain which words are compatible, e.g. "bachelor" is compatible with "young", but not with "y". The selection of all possible
heads and modiers of a term therefore means the selection of all compatible
words. If the corpus has a large coverage and all heads and modiers of two
terms are the same, they should also have the same feature representation, i.e.
the same meaning.
f
g
Wittgensteins view of the relation of meaning and use was mirrored in the cooccurrence approach in Sect. 2. Terms appearing in similar contexts were supposed to be synonyms. Probably, this idea is correct if the contexts are very
small. Heads and modiers are contexts and these contexts are as small as possible - only one term long. Thus, head and modier comparison implements
smallest contexts comparison.
In model theoretic semantics the so-called extensional meaning of many words
is denoted by a set of objects, e.g. the set of all dogs in the world represents the
meaning of the word "dog". If two terms occur as heads and modiers in a real
world corpus, in most cases it holds that the intersection of their representations
in not empty. Thus the head modier relation between two terms means that
each term is a possible property of the objects of the other term. Head modier
comparison therefore is the comparison of possible properties.
Head modier relations can also be found in human memory structure. A variety
of experiments with human associations suggest that in many cases heads or
modiers are responses to stimulus words, e.g. stimulus "rose" and response
"red". As stimulus words, terms with common heads and modiers therefore
can eect the same associations.
3.4 Experiments on Head-Modier-Based Term Similarity
According to the theories in Sect. 3.3, terms are semantically similar if they
correspond in their heads and modiers. Section 3.2 shows that masses of head
modier relations can be extracted automatically from large amounts of text.
Thus semantically similar terms can be generated fully automatically by means
of head modier comparision. Implementations of this method are described by
Ruge [?], Grefenstette [?] and Strzalkowski [?]. Table 2 gives some examples of
initial terms together with their most similar terms (with respect to head and
modier overlap). The examples contain dierent types of thesaurus relations:
synonyms (\quantity" - \amount"), hyperonyms (\president" - "head"), hyponyms (\government"- \regime") and part-of-relations (\government"- \minister").
quantity government president
amount
leader
director
volume
party
chairman
rate
regime
oce
concentration
year
manage
ratio
week
executive
value
man
ocial
content
minister
head
level
president
lead
Table 2. Most similar terms of "quantity" [?], "government" [?] and "president" [?]
The comparison of heads and modiers is expressed by a value between 0 (no
overlap) and 1 (identical heads and modiers). This value is determined by a
similarity measure. I found that the cosine measure with a logarithmic weighting
function (1) works best ([?], [?]).
sim(t ; t ) = 12
i
j
h
ln
h
ln
i
j
ln
hi hln
j
+ 12
m
ln
m
ln
iln jln mi mj (1)
In Eq. (1), h = (ln0 (h 1); : : :; ln0(h )). h stands for the number of occurrences of term t as head of term t in the corpus. ln0 (r) = ln(r) if r > 1 and 0
otherwise. m is dened analogously for modiers. The cosine measure in principle gives the cosine of the angle between the two terms represented in a space
spanned by the heads or spanned by the modiers. In sim, the weights of the
heads and modiers were smoothed logarithmically. sim gives the mean of the
head space cosine and the modier space cosine of the two terms.
ln
i
ln
i
i
i
in
ik
j
The head modier approach has been examined on dierent levels of evaluation. First, the rate of semantically similar terms among the 10 terms with
highest sim-values for 159 dierent initial terms has been evaluated ([?], [?]).
About 70% of the terms found were semantically similar in the sense that they
could be used as additional terms for information retrieval. Grefenstette [?] and
Strzalkowski [?] clustered those terms that had a high similarity value. Then
they expanded queries by replacing all query terms by their clusters. The expanded queries performed better than the original queries. Strzalkowski found
an improvement in the retrieval results of 20%. This is a very good value in
the retrieval context. Unfortunately, the eect of query expansion alone is not
known, because Strzalkowski used further linguistic techniques in his retrieval
system.
4 Conclusions
A disadvantage of most Information Retrieval systems is that the search is based
on document terms. These term based systems can be improved by incorporating
thesaurus relations. The expensive task of manual thesaurus construction should
be supported or replaced by automatic tools. After a long period of disappointing results in this eld linguistically based systems have shown some encouraging
results. These systems are based on the extraction and comparison of head modier relations in large corpora. They are applicable in practice, because they use
robust and fast parsers.
References
1. Das-Gupta, P.: Boolean Interpretation of Conjunctions for Document Retrieval.
JASIS 38(1987) 245-254
2. DIN 1463: Erstellung und Weiterentwicklung von Thesauri. Deutsches Institut fuer
Normung (1987), related standard published in English: ISO 2788:1986
3. Grefenstette, G.: Explorations in Automatic Thesaurus Discovery. Kluver Academic Publishers, Boston, Dordrecht, London (1994)
4. Guentzer, U., Juettner, G., Seegmueller, G., Sarre, F.: Automatic Thesaurus Construction by Machine Learning from Retrieval Sessions. Inf. Proc. & Management
25(1989) 265-273
5. Hearst, M.: Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of COLING 92, Nantes, Vol. 2 (1992) 539-545
6. Hindle, D.: A Parser for Text Corpora. Technical Memorandum, AT&T Bell Laboratories (1990) also published in Atkins, A., Zampolli, A.: Computational Approaches to the Lexicon. Oxford University Press (1993)
7. Katz, J., Fordor, J.: The Structure of Semantic Theory. In Fordor, J., Katz, J.:
The Structure of Language: Readings in the Philosophy of Language. Englewood
Clis, NJ, Prentice Hall (1964) 479-518
8. Lesk, M.: Word-Word Associations in Document Retrieval Systems. American Documentation (1969) 27-38
9. Ruge, G.: Experiments on Linguistically Based Term Associations. Inf. Proc. &
Mangement 28(1992) 317-332
10. Ruge, G.: Wortbedeutung und Termassoziation. Reihe Sprache und Computer
14(1995) Georg Olms Verlag, Hildesheim, Zuerich, New York
11. Ruge, G., Schwarz, C., Warner, A.: Eectiveness and Eciency in Natural Language Processing for Large Amounts of Text. JASIS 42(1991) 450-456
12. Schuetze, H., Pederssn, J.: A Cooccurrence-Based Thesaurus and Two Applications
to Information Retrieval. Proceedings of RIAO94, New York (1994) 266-274
13. Shaikevich, A.: Automatic Construction of a Thesaurus from Explanatory Dictionaries. Automatic Documentation and Mathematical Linguistics 19(1985) 76-89
14. Salton, G.: Automatic Term Class Construction Using Relevance - A Summary of
Work in Automatic Pseudoclassication. Inf. Proc. & Management 16(1980) 1-15
15. Strzalkowski, T.: TTP: A Fast and Robust Parser for Natural Language. Proceedings of COLING 92 Vol.1(1992) 198-204
16. Strzalkowski, T.: Natural Language Information Retrieval. Inf. Proc. & Management 31(1995) 397-417