EXPERIMENTS IN AUTOMATIC STATISTICAL THESAURUS

EXPERIMENTS
IN
AUTOMATIC
STATISTICAL
THESAURUS
CONSTRUCTION
Carolyn J. Crouch and Bokyung Yang*
Department of Computer Science
University of Minnesota, Duluth
DuhIth, MN 55812
ABSTRACT
A well constructed
valuable
thesaurus has long been recognized as a
tool in the effective
retrieval
system.
experiments
INTRODUCTION
This
designed
paper
of an information
reports
to determine
approach to the automatic
(described originally
operation
the results
the validity
construction
The value of a well
effective
of
concentrate
of global thesauri
recognized.
on the global
thesaurus
generated
The authors validate
queries.
In this paper, [he authors
thesaurus,
wherein
and then used to index
Many experiments
regarding
such
by this method
in substantial
approaches relied on manual construction
retrieval
results
effectiveness
discrimination
particular
distinguishing
thesaurus
alternate
which
thesaurus
a “good”
approach
in
the
from
classes
to automatic
thesaurus
the work
Experimental
in a
(i.e.,
an “indifferent”
the authors
Aslib
project
in
contributions
A
classification
semantic
[7]
recent
attempt
to
collocation,
that exist
taxonomy,
between
synonymy,
those produced by the first method at much lower cost.
11].
on the CODER
Other researchers
generate dictionaries
*Current address: West Publishing
Minnesota, 55123.
Company,
valuable
automatic
thesaurus [8, 9].
use expert
word
etc.).
approach
relying
in
sets of lexical or lexical-
thesauri which are comparable
to
made
has centered on the relational
relations
thesaurus
produce
early work in this area with an artificial
effectiveness
Smart
on the subject.
alternate approach described herein in some cases produces
in retrieval
Early
and automatic
This work is based on identifying
that the
[3]).
to the state of knowledge
more
dictionaries
viable
(e.g., Cleverdon
[4, 5, 6] and Spark Jones’ experiments
keyword
of
Early
construction
automatic
construction
Cranfield
in the literature.
in semiautomatic
in
suggest an
show
the
and
experiments
or “poor”
of producing
results
in
thesaurus
not to be useful
thesaurus
In conclusion,
simplifies
classes.
used
class, is found
between
class).
greatly
theory,
in
The term
to determine a term’s membership
thesaurus
differentiating
in four test collections.
value
generation algorithm
improvements
be found
terms) are first
the construction
that the use of thesauri generated
may
thesaurus
both documents
the approach by showing
thesauri
to the
or document retrieval
classes (classes of similar or “synonymous”
by Crouch in [1] and [2]) based on a
clustering of the document collection.
operation of an information
system is widely
of an
constructed
pairs
(e.g.,
Fox continues
intelligence-based
relational
lexicon
system techniques
[10,
to
[12, 13]. (These approaches, while of
considerable interest, remain largely to be evaluated.)
Eagan,
This paper reports on the results achieved by using
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the ACM copyright notice and the
title of the publication and its date appear, and notice is given
that copying is by permission of the Asaocietion for Computing
Machinery.
To copy otherwise, or to republish, requiree e fee
and/or specific permission.
15th Ann Int’1 SIGIR ‘92/Denmark-6/92
@1992 ACM O-8979 ~.~24.0/92/0006
/0077 ...$J .50
an algorithm
in detail
for automatic
in [1] and [2].
appropriate
clustering
the term discrimination
77
thesaurus generation
This algorithm
described
is based on an
of the document space and the use of
value theory of Sal ton, Yang, and
Yu [14] to select terms for entry into a particular
class.
It produces
envisioned
not a universal
lexicon,
the resultant
from which
positive
such as that
by Fox, but rather a thesaurus applicable
to the collection
categories:
thesaurus
it was generated.
Salton
thesaurus is domain dependent, the algorithm
may be applied to any document
collection,
with
the vector model).
The term discrimination
on the well-known
documents
are regarded
vectors and the correlation
means of a similarity
the vector
retrieved
model
advantages
clustering,
also has its disadvantages
term relationships
to this problem
who showed
improved
term
and term weighting),
it
documents
purposes is one which
maximizes
be u.mi
by combining
phrases.
of poor
these terms
The retrieval
of terms, which are indifferent
values, can be
them into appropriate
thesaurus
space for retrieval
of discrimination
frequency
of a term be used as an
to discrimination
value.
fk, of a term k is defined
in which
k appears.
terms are ranked by increasing
The document
as the number
According
represents the number of documents
the average separation
value is normally
For this reason, Salton, Yang and Yu suggested
approximation
was suggested by Salton, Yang, and Yu,
that the best document
values).
properties
with near-zero discrimination
that the document
to express
A practical
and poor
discriminators
appropriate
of the majority
expensive.
frequency,
the model).
discriminators
values),
The retrieval
by incorporating
solution
within
to form
The calculation
of
[strong]
classes.
Although
(such as ranking
(such as its inability
that good
can be improved
others
with
(terms with negative discrimination
discriminators
between vectors is computed by
measure, such as cosine.
offers
documents,
as (weighted)
discrimination
as index terms.
capabilities
value model itself is based
vector space model [15], wherein both
and queries
near-zero
discriminators
is based on
(terms
values), indifferent
et al suggest
directly
regardless of
subject (as long as the system representation
with
discrimiryztors
Although
discriminators
discrimination
(terms
only
good
of
to [14], if n
in a collection
whose
document
frequency,
terms
by document frequency
as follows:
if Ik
between documents in the document space. In this space, it
may be classified
is easier to distinguish
> n/10, k is considered a high frequency term (a term which
documents
between documents
which are most similar
to a given query.
is the basis of the term discrimination
assigns specific
roles to single
and to retrieve
This
normally
discriminator);
value model, which
terms, term phrases, and
frequency
term classes.
The discrimination
value of a term is defined
measure of the change in space separation
which
when a given term is assigned to the document
A good discriminator
document,
documents
discriminator,
decreases
less
is a term which,
similar
to
each
as a
occurs
collection.
when assigned to a
the space density
(rendering
other).
then, increases space density.
A
n/100,
k is a medium
value,
results
shown
have
k is considered
a near-zero
discriminator);
discrimination
value
value and is a poor
frequency
discrimination
and if n/1 O < fk <
term
(with
a good discriminator).
that
document
are well
a low
a positive
Empirical
frequency
correlated.
A
and
recent
of the centroid approach to the calculation
value, by E1-Hamdouchi
and Willet
greatly reduced the time required to calculate discrimination
value, thus making
and after the
assignment
of each term, the discrimination
value of the
The terms are then ranked
it possible to use discrimination
instead of the [easily calculated]
in
values into three
78
of’
[16], has
By computing
space before
order of their discrimination
an indifferent
discrimination
of the document
decreasing
fk c n/100,
value,
modification
poor
if
term (a term with
discrimination
the
the density
term can be determined.
has a negative discrimination
document frequency
value
as an
indicator
of how that term relates to other terms in the
collection
(i.e., whether it should be used alone as an index
term, used as a component
of a phrase, or combined
with
1.
other terms to become a member of a thesaurus class).
ALGORITHM
FOR
GLOBAL
A thesaurus class is made up of a set of “closely
combined
to justify
The
documents
Application
algorithm
produces
of
the
indifferent
values) lie at the bottom
cannot
expect
to classify
terms
in the low
domain, which have small co-occumence
low
computed
simply
similarities,
to the lack
alternative
into
useful
of information
is to cluster
indexed
One
in which
the tightest
cluster at the highest threshold
of the cluster tree.
are the leaves of the tree.
frequency
clustering
Consider
Fig.
and the numbers
These nodes
1. The squares
frequencies
and
represent
documents
groupings
due
represent
the levels
them.
The
Documents A and B cluster at a threshold value of 0,089, D
collection
and
and E cluster at a level of 0.149, and document
about
the document
are
complete-link
a hierarchy
clusters (i.e., those which
terms).
queries
(i) THRESHOLD VALUE
value theory specifies
(or low frequency
and
of the thesaurus classes generated
that the terms which make up a thesaurus class must be
discriminators
based on ~hree user-
in step 2 are determined by these parameters:
their being used together or
The term discrimination
hierarchy is traversed and thesaurus
The characteristics
related”
to represent a single, new word type within this
collection.
algorithm.
(augmented) by the thesaurus classes.
which are similar enough in the context of the
collection
via the
supplied parameters.
THESAURI
A thesaurus is composed of a set of thesaurus classes.
specified
The resultmt
is clustered
classes are generated,
CONSTRUCTING
3.
terms--terms
collection
complete link clustering
2.
AN
The document
at which
in
the circles
the documents
cluster.
C clusters
subsequently to select from closely related documents terms
with the D-E subtree at a level of 0.077.
which
and the C-D-E subtree cluster at a threshold value of 0.029.
meet
discrimination
the criterion
(i.e.,
terms
algorithm
which produces
clusters is the complete link algorithm
the class of agglomerative,
stage, the similarity
two most similar
until
similarity
small, tight
[17, 18, 19], one of
hierarchical
may be described succinctly
continues
near-zero
values or low frequency terms).
A clustering
algorithm
with
methods.
as follows:
This
At each
between all clusters is computed,
clusters
only
are merged,
one cluster
the similarities
the
and the process
remains.
Since
between clusters is defined as the minimum
document
the
of
between all pairs of documents (where one
of the pair
is in one cluster
and the other
El
document is in the other cluster), the resultant clusters are
difficult
to join
and hence tend to the small
An algorithm
previously
The A-B subtrec
to construct
global
and tight.
thesauri has been
Fig. 1 A sample hierarchy generated by the complete
link atgonthm
suggested by Crouch [1, 2]. A brief description
of this algorithm
G]
follows:
79
Threshold
value largely
determines
from which terms are selected for inclusion
class.
In Fig. 1, a threshold
Previous
the documents
formula
in a thesaurus
value of 0.090 would
experiments
is useful in computing
weights:
return
Let
only the D-E document cluster, since the level at which the
tc_con_wti
documents
class,
cluster
must be greater
than or equal to the
[1] have shown that the following
tc_wt
denote
reasonable
thesaurus class
the thesaurus
class weight,
denote the weight
ltc_conl
represent
the
of term i in this thesaurus
number
of
terms
specified threshold in order for the cluster to be recognized.
thesaurus class, and avg_tc_con_wt
A threshold value of 0.085, on the other hand, would return
weight of a term in the thesaurus class. Then
two clusters (A-B and D-E).
were specified,
If a threshold
in the
represent the average
value of 0.075
two clusters would be returned.
kc_conl
~ tC_COIl_Wti
A-B and
i= 1
either D-E or C-D-E.
avg_tc_con_wt
=
kc_conl
(ii] NUMBER OF DOCUMENTS PER CLUSTER
ml
Since both D-E and C-D-E meet the threshold value
tc.wt
of 0.075 in Fig. 1, this parameter
=
is needed to determine
avg tc_con_wt
kc_conl
* 0.5
the number of documents to be included in the final cluster.
In the example, a value of 2 for this parameter would insure
Thus a thesaurus class weight is computed
that only
average weight
the D-E cluster
be returned.
A value of 3 or
greater would allow the entire C-D-E cluster to be returned.
(iii)
clusters
thesaurus classes can be formed
contained
minimum
in
those
document
criterion
by
frequency,”
of a term in the class by the number
terms in the class and down weighting
which
of
that value by 0.5.
clusters.
In
parameter.)
intersection
A thesaurus
be used in lieu
minimum
General comments
to be “low
useful
fk <
in setting this
of
in a thesaurus
on discrimination
document
determine thesaurus class membership.
those terms whose discrimination
as the
the thesaurus
class.
classes
vahe
threshold
values
the
individual
may be determined
classes will
Appropriate
by examining
might
discriminators
low
will
with
the size of
too few terms
frequency
be very
to
low, very few classes will
of
documents per cluster
small),
terms
or
(i.e.,
is too
The number
always
is the
secondary
document frequency
are weighted.
suggested in [14].
80
of
has a lesser effect on thesaurus class
[1], this parameter is
allowed to range from 2 to 5. (Note that this parameter
3),]
the
indifferent
If the threshold
be generated.
size. Based on previous experiments
is
the
If the setting is too high, thesaurus
be generated
of
may be
experiments.
is collection-dependent.
the thesaurus class.
This approach
effectiveness
Threshold
intersection
frequency
describing
on the input parameters
cluster tree. This parameter largely determines
values fall beneath this
factor in retrieval
the thesaurus
value
before
class.
The intersection
in a later section (see Experiment
by which
the
Setting the Input Parameters
terms in that cluster.
membership
bound
An important
i.e.,
class is then defined
In this case, an upper
investigated
the
one could consider using discrimina-
tion value to determine
bound then forms
case,
(Using as a guideline
of all the low frequency
[Alternatively,
this
of each term,
are determined
must be specified.
EXPERIMENTS
from the low frequency
frequency
terms
THE
have been selected,
n/100 as suggested in [14] may be helpful
method
the
MINIMUM DOCUMENT FREQUENCY
Once the document
terms
by dividing
to the threshold
may
value.)
Minim
is
urn
be set by using the guidelines
Collection
Weighting
and Similarity
Measure
results
When the global thesaurus has been constructed
both documents
and queries augmented
classes, the vectors are then weighted
the Smart system [20].
document
frequency
“ate” weighting).
weighting
is a variation
minimum
of
value.
method
recall
of
2.
In
0.50,
routines)
the best improvement
are specified
(the best case)
The thesaurus construction
this paper
namely,
was run
ADI,
characteristics
on four
Medlars,
procedure
standard
CACM,
of these collections
(For more complete
descriptions
described
in
are found in Table
of these collections,
of
Smart
The
refers to the improvement
using this thesaurus
the same collection
was largely dependent on our ability
1.
without
the
to find the threshold
value whose use produced the “best” thesaurus classes. Nor
see
was the optimal
[18, 21,22].)
Experiment
the
in
The number of runs made to produce each thesaurus
The
CISI.
the
thesaurus (the base case).
test collections,
and
over
by
for each collection.
in three point average of the collection
Collection Characteristics
at three points
as provided
labeled “improvement”
cases,
rather than discrimination
yielding
0,75],
all
class is based on
average (average precision
[0.25,
column
Table
frequency
The parameters
evaluation
measure used is cosine.
in
of a term in a thesaurus
document
three point
of the inverse
scheme (see [20] for a description
The similarity
membership
by the thesaurus
using the facilities
The collection
used in these experiments
and
are presented
easily found;
value of minimum
as Table 2 indicates,
were useful in only one case.
1
document
frequency
the guidelines
Number
of [14]
of documents
per
cluster ranges from 2 to 5.
The algorithm
described
each of the four specified
previously
document
was applied to
collections.
As Table 2 shows, the use of the thesaurus generates
The
substantial improvement
in each case.
Table 1
Collection
Collection
I No of Documents
I
Statistics
No of Terms
No of Queries
Mean No of Terms
per Document
Mean No of Terms
per Query
82
35
822
25.5
7.1
Medlars
1033
30
6927
51.6
10.1
CISI
1460
76
5019
45.2
22.6
64
8503
22.5
8.7
ADI
CACM
I
3204
I
81
Table 2
Improvement
Experiment
represents
Threshold
No of
Documents
Frequency
ADI
0.075
5
25
10.6
Medlars
0.120
3
45
17.1
CISI
0.058
4
69
7.7
CACM
0.137 (0.138)
2
30
9.6
in Table 2) for each collection
2
a collection
a new word
is it possible
between
thesaurus
value as
according
resultant
The question
within each set.
removed
classes
discriminators
to use discrimination
thesaurus
then
and the number of positive
from
which
each
may
be
(i.e., have negative
those
considered
poor
discrimination
to the term discrimination
(reduced)
thesaurus
values)
value model.
The
thesaurus was then used to index the
dccuments and queries.
The results are shown in Table 4.
classes--i. e., to
It would appear from Table 4 that the classes which
good (or useful) thesaurus classes from poor or
were removed from the thesauri because they had negative
classes?
discrimination
objective
was
thesaurus by removing
retrieval
done by augmentation,
to produce
a more
thesaurus classes which,
generated by our algorithm,
improving
The term
by its discrimination
or poor discriminator.
value to differentiate
Our
We
the thesaurus then
type in the collection.
can be classified
that now arises is:
indifferent
and negative discriminators
value model tells us that each unique term in
a good, indifferent,
distinguish
specified above, a thesaurus was
Each thesaurus class within
discrimination
Percentage
Improvement
Collection
For each collection
generated.
Using Thesaurus Generated by the Specified Parameters (Best Case)
(Since the indexing
the removal
in the collection
Thus the discrimination
(including,
in
retrieval
effectiveness
in a negligible
over
the
improvement
base
case
and
was
corresponding
degradation over the best case (experiment
of such classes should
For Medlars,
140 negative
to
discriminators
values of each term
resultant
Table 3 shows the number
thesaurus
offers
82
1).
the
of 14 percent
over the base case but is still less effective
than the best
case thesaurus.
classes
For
negative discriminators
of thesaurus classes generated (using the parameter settings
an improvement
a
were removed
from the thesaurus, leaving 178 positive discriminators;
of course, the new thesaurus
classes) was calculated via [16].
In ADI, 3 of 28 classes were removed
from the thesaurus, resulting
although
at least reduce the negative effects of vector expansion
some degree).
in terms of retrieval.
useful
nevertheless were not useful in
effectiveness.
values nevertheless added useful information
CISI,
29 thesaurus
with
were removed, leaving 269 positive
Table 3
Number and Type of Thesaurus Classes Generated
Collection
Number with
Positive DVS
Number
ADI
Number with
Negative DVS
28
25
3
Medlars
318
178
140
CISI
298
269
29
CACM
438
438
0
Table 4
Improvement
Using a Thesaurus Containing
Only
Thesaurus Classes with Positive Discrimination
r
Percentage over
Base Case
Collection
ADI
discriminators;
-8.4
Medlars
14.0
-2.7
CISI
7.7
0.0
CACM
9.6
0.0
5 (10,
discrimination
In
collection
classes
15,
until
all
of
classes
values had been removed.
we able to improve the retrieval
this
our
replacing
bound
value and successively
... ) percent
3
modifying
by taking the set of thesaurus
classes ordered by discrimination
thesaurus
Experiment
1 were positive discriminators.)
We also experimented
removing
over
over the best case. (All
of the thesaurus classes generated for the CACM
in experiment
Percentage over
Best Case
1.3
the result is a 7.7 percent improvement
the base case and no improvement
Values
experiment,
thesaurus
minimum
consider
value.
identify
effectiveness
of a thesaurus
the clusters, the algorithm
an upper
the document
found in Table 2 to
is modified
document frequency
upper bound on discrimination
83
by
by changing
for entry of a term into a thesaurus class.
this case, minimum
over our best case figures.
of
and the same parameter settings (for threshold
with
the criterion
with
Using
and number of documents per cluster)
In no case were
result
algorithm
frequency
bottom-ranked
negative
the
construction
document
on discrimination
hierarchies
we
is replaced
In
by an
value; i.e., thesaurus classes
are formed
by the intersection
cluster whose discrimination
less than the upper bound.
of those terms from
(The objective
class.)
The bound
increments
applied
values which occur in each
on discrimination
terms
extending
to 100 percent [which
terms with
with
near-zero
strong positive
produced
construct
determined
by this approach.
thesaurus
Initially,
classes
whose
only by discrimination
in
produced;
the classes
Because the correlation
The
the best results
membership
value.
introducing
was
percent restricted
the entry criterion
beneath
not
[i.e.,
with
fairly
low
terms],
without
again a limit on frequency.)
to produce
a thesaurus
values
component terms have not proved successful.
Collection
Percentage over
Base Case
Percentage over
Best Case
ADI
8.9
-1.5
Medlars
13.9
-2.8
-16.9
-24.1
-6.7
-13.3
84
large.
value and
way to handle this problem
Using a Thesaurus Whose Classes Are Based on Discrimimtion
CACM
become
values may in fact be high frequency
Table 5
CISI
may
terms
The best results generated for ADI and
Improvement
to enter one
Greater numbers of classes are
classes are based on the discrimination
to the extent that no
of
value is moved up,
between discrimination
Thus our attempts
50
results
in 847, rather
become eligible
themselves
exact
we find no effective
It immediately
that using an upper bound
is
discrimination
we had hoped to
the
438, thesaurus classes being generated.
or more thesaurus classes.
those
than
The best case results
terms with larger frequencies
values,
values]).
became apparent
classes were formed.
to CACM.
As the upper bound on discrimination
varies
of course includes
represent
less
discrimination
1. (Consider what occurs when this approach is
frequency
that these figures
substantially
positive
value
results are found in Table 5.
Note
with
than the original
discrimination
discrimination
terms
in that thesaurus
of 5 percent (using 5, 10, 15, .... percent of the
qualified
all
were
experiment
is to select terms
of the cluster for membership
using
values,
values are greater than O and
having near-zero discrimination
document
Medlars,
the
Value
whose
of their
Experiment
4
indicates.
When the thesauri produced
evaluated,
in experiment
1 were
some queries are not affected by the indexing
thesaurus construction
view
of the effect
experiment
algorithm).
of the thesauri
4 we extract
produced
algorithm
document
is then
a well
constructed
1.
thesaurus
The
employed
may also prove
are used in the thesaurus
clusters of closely
and then to select terms with
another
clusters.
certain
the thesaurus classes.
approach
to identifying
these
Define a low level cluster as a cluster
The document
Our first algorithm
collection
complete link clustering
2.
is
is clustered
via the
algorithm.
The resultant hierarchy is traversed and thesaurus
classes are generated by
-identifying
is
applied to a collection.
Experiment
frequency
and the
now revised as follows:
effectiveness
global
to set properly,
all of whose children are documents.
in Table 6 gives an accurate and
in retrieval
dependent on the
value is collection-
to help identify
to makeup
Consider
The results (using the parameter
of the changes
when
characteristics
search (the
document
algorithm
related documents
1) are described in Table 6.
The last column
view
in
only those
a normal
threshold
These parameters
construction
(documents and reduced query sets)
and a search is performed.
settings of experiment
troublesome.
This query set, the
The thesaurus construction
applied to the collection
setting of minimum
(step 3 of the
from each collection
reduced set, is now used to perform
base case).
But
on the collections,
is strongly
dependent and can be very difficult
To gain a more accurate
queries which are affected by indexing.
realistic
setting of three parameters;
the results (in terms of the three point average)
were averaged over all the queries in the collection.
But this algorithm
5
each low level cluster, and
-for each such cluster, constructing
a thesaurus
class by forming
of all low
the intersection
frequency terms in the cluster.
algorithm
described
to produce
useful
in
global
this
paper
thesauri,
3.
can be
as Table
The
documents
and
queries
(augmented) by the thesaurus classes.
6
Table 6
Improvement
are
Using Thesaurus (Best Case) With Reduced Query Set
i
Collection
No of Queries
Base Case
No of Queries
Reduced Set
Results Using
All Queries
ADI
35
33
10.6
11.0
Medlars*
30
30
17.1
17.1
CISI
76
63
7.7
8.6
CACM
52
32
9,6
14.3
Results Using
Reduced Set
*All queries in the Medlars collection are affected by thesaurus class indexing,
reduced collection is generated in this case.
85
hence no
indexed
This algorithm
parameter
(minimum
to the four
document frequency).
test collections,
shown (which
correspond
values for minimum
these cases.
depends only on the setting of one
using
useful
values
produce
The results are
As Table 7 shows, this approach
improvements
in retrieval
effectiveness.
substantially
less than those produced
class
thesaurus
on
unsuccessful.
algorithm,
by the original
children
in which
thesaurus
a global
discrimination
value model and a complete
file was examined
based
by
produced
for automatically
constructing
thesauri, substantially
values
to
in a
of
the
document frequency
of
the
original
clusters
(all
of whose
are used along
with
minimum
to form thesaurus classes, is proposed.
on
the
algorithm
term
two of the four
results
collections,
this algorithm
to those
of the original
comparable
at very little cost in terms of time and effort.
link clustering
applying
it to four
ACKNOWLEDGEMENTS
The results of the experiments
can be used to produce useful
improving
retrieval
effectiveness
This work was supported
in part by the University
of Minnesota Graduate School Grant-in-Aid
in
Table 7
Improvement
Using Thesaurus Generated by the Revised Algorithm
w
I
I
I
I
Medlars
CISI
CACM
I
Percentage ovel
Base Case
5.9
I
I
I
45
13.6
30
8.2
90
3.6
86
.
membership
revision
level
by using
Results indicate that this approach may be useful in some
of an algorithm
that the algorithm
A
low
are documents)
cases; for
indicate
the
An attempt
discrimination
was
SUMMARY
standard test collections.
the
The results, while
also
process)
by basing
component terms rather than minimum
document frequency
of the document
to the retrieval
yields significant
are generated at little cost.
The performance
between
value was not successful.
a useful
thesaurus
shown in Table 7.
algorithm,
and nonuseful
discrimination
in most cases to the best case
document frequency).
to differentiate
thesaurus classes produced (i.e., between classes which were
It was applied
the frequency
An attempt
(Best Case)
for Research.
REFERENCES
11.
1.
Crouch,
C.
Acluster-based
construction.
International
of
Conference
Development
in
on
the
Information
Report
Eleventh
Research
Retrieval;
Crouch,
C.
12.
1988;
of
approach
global
C. W.;
4.
Mills,
determining
the performance
Project,
Salton, G. Scientific
and retrieval
J.; Keen,
and retrieval
(ISR
Salton,
G., ed.
experiments
Englewood
7.
systems.
13).
9.
relations:
of information
retrieval
15(3):5-36;
retrieval
document
keyword
London:
16.
system--
American
in information
systems.
E1-Hamdouchi,
Management,
17.
Butterworths;
Voorhees,
retrieval.
ACM
J.; Ahlswede,
18.
Relational
.lourna/
of the
T.; Markowitz,
of the Second Conference
Natural Language Processing;
text analysis.
of
Journal
Science,
to modern
New York:
Willett,
the
McGraw-Hill;
Voorhees,
20.
retrieval.
on Applied
1988; Austin, TX.
improved
exact
Processing
term
and
1988.
and efficiency
of
hierarchic
clustering
in document
Thesis,
Department
of Computer
E.
1985.
Implementing
clustering
retrieval.
Tech.
algorithms
Report
agglomerative
for use in document
86-765,
Department
of
Science, Cornell University.
van Rijsbergen,
C.J.
London:
Buckley,
C.
Information
Butterworths;
Implementation
information
retrievat
Department
of
University,
87
of
The effectiveness
hierarchic
edition.
J.
An
Information
24(1):17-22;
E.
P.
calculation
values.
Ph.D.
Computer
Science, 36(l): 15-
a large thesaurus for information
Proceedings
1991.
A theory
Introduction
Science, Cornell University,
SIGIR
19.
Fox, E.; Nutter,
M.
A.;
for
classification
J.; Evens, M.
Society for Information
C. T.
for Information
retrieval.
discrimination
27; 1985.
Building
in automatic
G.; McGill,
1971.
1980.
retrieval,
27(5): 405-432;
C. S.; Yu,
Society
Information
1975.15.
algorithm
processing.
enhancing effectiveness
Winter
Wang, Y.; Vandendorpe,
thesauri
10.
G.; Yang,
agglomerative
Fox, E. Lexical
Forum,
retrieval
process as a basis for
1983.
1971.
8.
systems design.
Salton,
discovery
on Systems, Man and
intelligent
information
of Computer
N.J.: Prentice-Hall;
retrieval.
15.
storage
information
A knowledge
Cognitive
26(1):33-44;
1968.
Sparck Jones, K. Automatic
Semantics-based
Chen, H.; Dhar, V.
Salton,
Science,
1992.
of the American
of Computer
Department
K.
IEEE Transactions
term importance
storage
reports on information
The SMART
Cliffs,
14.
Tech.
VA.
Processing and Management,
1966.
in automatic
for information
of indexing
Department
Science, Cornell University,
6.
Factors
reports on information
(ISR 11).
Salton, G. Scientific
M.
Vol. 1; 1966.
Science, Cornell University,
5.
13.
1990.
the Collins,
of Computer
and retrieval:
Cybernetics,
Information
Department
Chen, H.; Lynch,
approach.
automatic
26(5) :629-6K$
Cleverdon,
Aslib Cranjleld
the
thesauri.
Processing and Management,
3.
to
lexicon:
and its adverb definitions.
Blacksburg,
management
An
construction
86-23,
VPI&SU,
and
Grenoble, France.
2.
the CODER
English dictionary
approach to thesaurus
Proceedings
Fox, E. Building
system.
Computer
retrieval,
2nd
1979.
of
the
SMART
Tech. Report 85-686,
Science,
Cornell
21.
Fox, E.
Characteristics
collections
in computer
containing
Report
of two new experimental
and information
textual and bibliographic
83-561,
Department
science
concepts. Tech.
of Computer
Science,
Cornell University.
22.
Fox, E.
Extending
the boolean
models of information
and
multiple
Department
University,
retrieval
concept
of
types.
Computer
and vector
space
with p-norm queries
Ph.D.
Science,
Thesis,
Cornell
1983.
88