Knowledge Acquisition: Classification of Terms in a Thesaurus from

Knowledge Acqulsition=
C l a s s i f i c a t i o n of Terms in a thesaurus from~a Corpus
STA Jean-David
EDF, Direction des Etudes et des Recherches
1, av. de G~n4ral de Gaulle
92140 Clamart - France
(33) 1 47.65.49.58
E-mail
: [email protected]
!
Abstract
Faced w i t h growing volume and a c c e s s i b i l i t y of electronic textual
information,
information
retrieval,
and,
in
general,
automatic
d o c u m e n t a t i o n require updated terminological resources that are ever
m o r e voluminous. A current p r o b l e m is the automated c o n s t r u c t i o n of
these resources (e.g., terminologies, thesauri, glossaries, etd~ ~) from
a corpus. Various linguistic and statistical methods to handle this
p r o b l e m are coming to light. One p r o b l e m that has been less studied is
that of updating these resources, in particular, of c l a s s i f y i n g a term
e x t r a c t e d from a corpus in a subject field, discipline or b r a n c h of an
existing thesaurus. This is the first step in p o s i t i o n i n g a term
e x t r a c t e d from a corpus in the structure of a thesaurus
(generic
relations,
synonymy relations ..). This is an important p r o b l e m in
c e r t a i n disciplines in which knowledge, and, in particular, v o c a b u l a r y
is not v e r y stable over time, especially because of neologisms.
This experiment compares different models for r e p r e s e n t i n g a term
from a corpus for its automatic classification in the subject fields
of
a
thesaurus.
The
classification
method
used
is
linear
discriminatory
analysis,
based on a learning
sample.
The models
e v a l u a t e d here are: a term/document model where each term is a vector
in document vector space, two term/term models where each term is a
vector
in
term space,
and
the
coordinates
are
either
the
cooccurrence,
or
the
mutual
information
between
terms.
The
most
effective model is the one based on mutual information b e t w e e n terms,
w h i c h typifies the fact that two terms often appear together in the
corpus, but rarely a p a r t . .
I.
Introduction
In
documentation,
terminologies,
thesauri
and
other
terminological lists are reference systems which can be u s e d for
m a n u a l or automatic indexing. Indexing consists of r e c o g n i s i n g the
terms in a text that belong to a reference system; this is called
c o n t r o l l e d indexing. The quality of the result of the indexing process
depends in large part on the quality of the terminology (completeness,
c o n s i s t e n c e ..). Thus, applications d o w n s t r e a m from the indexing depend
on these terminological resources.
101
The
most
thoroughly
studied
application
is
the
information
retrieval
(IR).
Here,
the
term provides
a means
for
accessing
i n f o r m a t i o n through its standardising effect on the query and on the
text to be found. The term can also be a v a r i a b l e that is u s e d in
statistical c l a s s i f i c a t i o n or clustering processes of documents
([BLO
92] and [STA 95a]), or in selective d i s s e m i n a t i o n of information,
in
w h i c h it is u s e d to bring together a document to be d i s s e m i n a t e d and
its target [STA 93].
Textual
information
is becoming
more
and m o r e
accessible
in
electronic
form.
This
accessibility
is
certainly
one
of
the
p r e r e q u i s i t e s for the massive use of natural language p r o c e s s i n g (NLP)
techniques. These techniques applied on p a r t i c u l a r domains, often use
terminological
resources that supplement the lexical resources.
The
lexical resources
(general language dictionaries)
are fairly stable,
whereas
terminologies
evolve
dynamically
with
the
fields
they
describe.
In particular,
the disciplines
of information p r o c e s s i n g
(computers, etc.) a n d b i o l o g y or genetics are c h a r a c t e r i s e d today by
an e x t r a o r d i n a r y terminological activity.
,I
Unfortunately,
the
abundance
of
electronic
corpora
and
the
relative
maturity
of natural
language
processing
techniques
have
i n d u c e d a shortage of u p d a t e d terminological data. The various efforts
in a u t o m a t i c a c q u i s i t i o n of terminologies from a corpus stem from this
observation,
and
try
to answer
the
following
question:
"How can
c a n d i d a t e terms be extracted from a corpus?"
Another
inKoortant q u e s t i o n
is h o w to p o s i t i o n
a term in an
existing
thesaurus.
That
question
can
itself
be
subdivided
into
several questions that concern the role of the s t a n d a r d relationships
in a thesaurus:
synonymy,
hyperonymy,
etc. The q u e s t i o n studied in
this experiment concerns the p o s i t i o n i n g or c l a s s i f i c a t i o n of a term
in a subject field or semantic field of a thesaurus. This is the first
step in a p r e c i s e p o s i t i o n i n g using the standard r e l a t i o n s h i p s of a
thesaurus. This p r o b l e m is v e r y difficult for a h u m a n being to resolve
w h e n he is not an expert in the field to w h i c h the term b e l o n g s and
one can hope that an automated classification p r o c e s s w o u l d be of
great help.
To c l a s s i f y a term in a subject field can be c o n s i d e r e d similar
to w o r d sense d i s a m b i g u a t i o n
(WSD) w h i c h consists in c l a s s i f y i n g a
w o r d in a conceptual class
(one of its senses).
The d i f f e r e n c e is
that,
in a corpus,
a term is g e n e r a l l y m o n o s e m o u s
and a w o r d is
polysemous. W o r d sense d i s a m b i g u a t i o n uses a single context (generally
a w i n d o w of a few words around the w o r d to be disambiguated)
as input
to p r e d i c t its sense among a few p o s s i b l e senses (generally less than
ten). T e r m subject field d i s c r i m i n a t i o n uses a r e p r e s e n t a t i o n of the
term c a l c u l a t e d on the whole corpus in order to c l a s s i f y it into about
330 subject fields in this experiment.
The experiment
d e s c r i b e d here was used to evaluate
different
m e t h o d s for c l a s s i f y i n g terms from a corpus in the subject fields of a
thesaurus. A f t e r a brief description of the corpus and the thesaurus,
automatic
indexing
and
terminology
extraction
are
described.
L i n g u i s t i c and statistical techniques are used to extract a candidate
term
from a corpus
or to recognise
a term in a document.
This
p r e p a r a t o r y p r o c e s s i n g allows the document to be r e p r e s e n t e d as a set
102
of terms (candidate terms and k e y w o r d s
A c l a s s i f i c a t i o n m e t h o d is
t h e n i m p l e m e n t e d to c l a s s i f y a subset of 1,000 terms in the 49 themes
a n d 330 s e m a n t i c fields that m a k e up the thesaurus. The 1,000 terms
thus c l a s s i f i e d c o m p r i s e the test sample that is u s e d to e v a l u a t e
three m o d e l s for r e p r e s e n t i n g terms.
IX.
Data Preparation
IX.i. Description of the Corpus
The corpus s t u d i e d is a set of I0,000 s c i e n t i f i c and t e c h n i c a l
doc~unents in F r e n c h (4,150,000 words). Each d o c u m e n t c o n s i s t s of one
or two p a g e s of text. This corpus d e s c r i b e s r e s e a r c h c a r r i e d out by
the r e s e a r c h d i v i s i o n of EDF, the F r e n c h e l e c t r i c i t y company. M a n y
d i v e r s e subjects are dealt with: n u c l e a r energy, thermal energy, home
automation,
sociology,
artificial
intelligence,
etc.
Each document
describes
the o b j e c t i v e s
and stages
of a r e s e a r c h p r o j e c t
on a
particular
subject. These d o c u m e n t s are u s e d to p l a n EDF r e s e a r c h
activity.
Thus, the v o c a b u l a r y u s e d is either v e r y technical, w i t h subject
f i e l d terms and c a n d i d a t e
terms,
or v e r y general,
with stylistic
expressions, etc.
|
Obj ectif : ....
,IObjecfif
C o n s t r u c t i o n de t h e s a u r u s ~ E t a t d ' a v a n c e m e n t
~--~
genera/
....
,
_
~
~'-qphase d~ndus~/a/isafion
expressions
~an
a avancemen~ : 2~-~
La, . p h a s e d'
i n d u s t r i a l. i s a t i o n /...
~
.
"
[Iconstrucfion
. . . . . . . . . . . de
. . .the~z
. . auru_
L indexatlon automatlque
. . . . . . . .
-imqxexauon a u t o m a u o u e
k
t erm s
A document with terms and general expressions
II.2. Description
of the Thesaurus
The EDF t h e s a u r u s
consists
of 20,000
terms
(including
6,000
synonyms)
that c o v e r a w i d e v a r i e t y of fields
(statistics,
nuclear
p o w e r plants, i n f o r m a t i o n retrieval, etc.). This r e f e r e n c e s y s t e m was
c r e a t e d m a n u a l l y f r o m c o r p o r a t e documents, and was v a l i d a t e d w i t h the
h e l p of m a n y experts. Currently, updates are h a n d l e d by a g r o u p of
d o c u m e n t a l i s t s who r e g u l a r l y e x a m i n e and insert n e w terms. One of the
s o u r c e s of n e w terms is the corpora. A l i n g u i s t i c and s t a t i s t i c a l
extractor
proposes
candidate
terms
for
validation
by
the
d o c u m e n t a l i s t s . A f t e r validation, the d o c u m e n t a l i s t s m u s t p o s i t i o n the
s e l e c t e d terms in the thesaurus. It's a d i f f i c u l t e x e r c i s e b e c a u s e of
the w i d e v a r i e t y of fields.
103
The thesaurus is composed of 330 semantic
(or subject)
i n c l u d e d in 49 themes such as mathematics, sociology, etc.
th6ode des erreurs [
fields
i'statistiquen'on param6trique ]
See Also
Relationship
Ianalyse discriminanteI
]statistique [
lanalysestatistique
'i
lanalysede la variance
Extract
[:modUlestatistiquelineaire!
~
f r o m the "statistics"
, s.ttmation I
Generic
Relationship
sm-~tic
field from the EDF t h e s a u r u s
This example gives an overview of the various relations b e t w e e n
terms. Each term belongs to a single semantic field. Each t e r m is
linked
to
other
terms
through
a generic
relation
(arrow)
or a
neighbourhood
relation
(line).
Other
relations
(e.g.,
synonym,
t r a n s l a t e d by, etc.) exist, but are not shown in this example.
II°3. Document
Indexing
As a first step, the set of documents in the corpus is indexed.
This consists of producing two types of indexes: candidate terms, and
descriptors.
The candidate
terms are expressions
that m a y b e c o m e
terms, and are submitted to an expert for validation. Descriptors are
terms from the EDF thesaurus that are a u t o m a t i c a l l y r e c o g n i s e d in the
documents.
II.3.1.
Terminological
Filtering
In this experiment, terminological filtering is used for each
document to produce terms that do not b e l o n g to the thesaurus, but
w h i c h n o n e t h e l e s s might be useful to describe the documents. Moreover,
these expressions are candidate terms that are s u b m i t t e d to experts or
d o c u m e n t a l i s t s for validation.
L i n g u i s t i c and statistical terminological filtering are used. The
method chosen
for this experiment
combines an initial linguistic
e x t r a c t i o n w i t h statistical filtering [STA 95b].
104
Linguistic Extraction
G e n e r a l l y , it a p p e a r s that the s y n t a c t i c a l s t r u c t u r e of a t e r m in
F r e n c h l a n g u a g e is the n o u n phrase. For example, in the EDF t h e s a u r u s ,
the s y n t a c t i c s t r u c t u r e s of terms are d i s t r i b u t e d as follows:
%
syntactic structure
example
Noun Ad~ective
~rosion fluviale
2S .i
Noun Preposition Noun
analyse de c o n t e n u
24.4
Noun
d~centralisation
18.1
Proper noun
Chinon
6.8
Noun Preposition Article
a s s u r a n c e de la
3.2
Noun
qualit4
Noun Preposition Noun
u n i t 4 de b a n d e
2.8
Adjective
magn4tique
Noun Participe
puissance absorb4e
2.2
Noun Noun
acc~s m 4 m o i r e
2.1
Distriknation of the s y n t a c t i c s t r u c t u r e s of terms
Thus, t e r m e x t r a c t i o n is i n i t i a l l y s y n t a c t i c a l .
It c o n s i s t s
a p p l y i n g s e v e n r e c u r s i v e s y n t a c t i c p a t t e r n s to the c o r p u s [OGO 94].
The
seven
NP <NP<N P <NP <NP <NP <NP <syntactic
ADJECTIVE NP
NPADJECTIVE
NP ~ NP
N P de N P
NP en N P
N P p o u r NP
NPNP
p a t t e r n s for t e r m i n o l o g y
of
extraction
Statistical Filtering
Linguistic
extraction,
however,
is n o t enough.
In fact,
many
e x p r e s s i o n s w i t h a n o u n p h r a s e s t r u c t u r e are not terms. T h i s i n c l u d e s
general expressions,
s t y l i s t i c effects, etc. S t a t i s t i c a l m e t h o d s can
thus be used,
in a s e c o n d step,
to d i s c r i m i n a t e
terms
f r o m nont e r m i n o l o g i c a l e x p r e s s i o n s . T h r e e i n d i c a t o r s are u s e d here:
Frequency:
This
is b a s e d
on the
fact
that
the m o r e
often
an
e x p r e s s i o n is f o u n d in the corpus, the m o r e l i k e l y it is to be a
term. T h i s s t a t e m e n t m u s t be k e p t in proportion, however. Indeed, it
s e e m s that a s m a l l n u m b e r of w o r d s (usually, v e r y g e n e r a l u n i t e r m s )
a r e v e r y frequent, but are not terms.
-
V a r i a n c e : T h i s is b a s e d on the idea that the m o r e the o c c u r r e n c e s in
a d o c u m e n t of an e x p r e s s i o n are scattered, the m o r e l i k e l y it is to
b e a term. T h i s is the m o s t e f f e c t i v e indicator. Its d r a w b a c k is
t h a t it a l s o h i g h l i g h t s large n o u n p h r a s e s in w h i c h the terms are
included.
L o c a l d e n s i t y [STA 95b]: This is b a s e d on the i d e a that the c l o s e r
t o g e t h e r the d o c u m e n t s are that c o n t a i n the e x p r e s s i o n ,
the m o r e
l i k e l y it is to be a term. The local d e n s i t y of an e x p r e s s i o n is the
105
m e a n of the cosines between documents w h i c h contain the g i v e n
expression. A document is a vector in the Document V e c t o r Space
where a d i m e n s i o n is a term. This indicator highlights a certain
n u m b e r of terms that are not transverse to the corpus, but rather
c o n c e n t r a t e d in documents that are close to each other. Nonetheless,
this is not a v e r y effective indicator for terms that are t r a n s v e r s e
t o the corpus. For example, terms from computer science, w h i c h are
found in a lot of documents, are not h i g h l i g h t e d b y this indicator.
Results of the Tez~inological Extraction
During this experiment, the terminological extraction u l t i m a t e l y
p r o d u c e d 3,000 new terms that did not b e l o n g to the thesaurus. These
new' terms are used in the various r e p r e s e n t a t i o n models d e s c r i b e d
below.
The
initial
linguistic
extracting
produced
about
50,000
expressions.
II.3.2. Controlled Indexing
A s u p p l e m e n t a r y way
recognising
controlled
thesaurus.
To do this,
sentence is p r o c e s s e d on
and semantically. These
dictionary.
of characterising a document's contents is by
terms
in
the
document
that
belong
to
a
an N L P technique is u s e d
[BLO 92].
Each
three levels: morphologically, syntactically,
steps use a grammar and a general language
The m e t h o d consists of b r e a k i n g down the text fragment b e i n g
p r o c e s s e d by a series of successive
transformations
that m a y be
syntactical
(nominalisation, de-coordination,
etc.), semantic
(e.g.,
nuclear
and
atomic),
or
pragmatic
(the
thesaurus"
synonym
r e l a t i o n s h i p s are scanned to transform a synonym b y its m a i n form). At
the end of these transformations, the d e c o m p o s e d text is c o m p a r e d to
the list of d o c u m e n t e d terms of the thesaurus in order to s u p p l y the
descriptors.
Results of the Controlled Iz~exing
C o n t r o l l e d indexing of the corpus supplied 4,000 terms (of 20,000
in the thesaurus). Each document was indexed by 20 to 30 terms. These
documented
terms,
like
the
candidate
terms,
are
used
in
the
r e p r e s e n t a t i o n models described below. The quality of the indexing
process is e s t i m a t e d at 70 percents (number of right terms d i v i d e d by
n u m b e r of terms). The wrong terms are e s s e n t i a l l y due to p r o b l e m s of
polysemy. I n d e e d some terms (generally uniterms) have m u l t i p l e senses
(for example "BASE") and produce a great part of the noise.
106
III.
T e r m Subject Field Discrimination
XXI.l. W o r d Sense D i s a ~ h i g u a t i o n a n d
Term Subject Field Discrimination
T h e d i s c r i m i n a t i o n of w o r d s e n s e s is a w e l l k n o w n p r o b l e m in
computational
linguistics
[YNG 55]. T h e p r o b l e m of W S D is to b u i l d
i n d i c a t o r s d e s c r i b i n g the d i f f e r e n t senses of a word. G i v e n a c o n t e x t
of a word, t h e s e i n d i c a t o r s are u s e d to p r e d i c t its sense. F a c e to the
difficulty
of m a n u a l l y b u i l d i n g
t h e s e indicators,
researchers
have
t u r n e d to r e s o u r c e s s u c h as m a c h i n e - r e a d a b l e d i c t i o n a r i e s [VER 90] a n d
c o r p o r a [YAR 92].
W S D a n d t e r m s u b j e c t f i e l d d i s c r i m i n a t i o n f r o m c o r p o r a can be
considered
similar
in the w a y
that
they
are b o t h
a problem
of
c l a s s i f i c a t i o n into a class (a sense for a w o r d a n d a s u b j e c t f i e l d
for a term). N e v e r t h e l e s s , the p r o b l e m is s t a t i s t i c a l l y d i f f e r e n t . In
o n e case, a w o r d is r e p r e s e n t e d b y a few v a r i a b l e s
(its context) a n d
is c l a s s i f i e d into one class c h o s e n a m o n g a few classes. In the o t h e r
case, a t e r m is r e p r e s e n t e d b y h u n d r e d of v a r i a b l e s (one of the m o d e l s
d e s c r i b e d in c h a p t e r IV) a n d is c l a s s i f i e d into a class c h o s e n a m o n g
h u n d r e d of classes.
ZII.2.
Linear Discr~.mlnatozyAnalysis
T h e p r o b l e m of d i s c r i m i n a t i o n
c a n be d e s c r i b e d as follows:
A
random
variable
X
is
distributed
in
a
p-dimensional
space,
x
represents
the o b s e r v e d v a l u e s
of v a r i a b l e
X.
The problem
is to
d e t e r m i n e the d i s t r i b u t i o n of X a m o n g q d i s t r i b u t i o n s
(the classes),
b a s e d on the o b s e r v e d v a l u e s x. The m e t h o d i m p l e m e n t e d h e r e is l i n e a r
d i s c r i m i n a t o r y analysis.
U s i n g a s a m p l e that has a l r e a d y b e e n c l a s s i f i e d , d i s c r i m i n a t o r y
analysis
can
construct
classification
functions
which
take
into
a c c o u n t t h e v a r i a b l e s that d e s c r i b e the e l e m e n t s to be c l a s s i f i e d .
E a c h e l e m e n t x to be c l a s s i f i e d is d e s c r i b e d b y a b i n a r y v e c t o r x=
(xl, x2 . . . . . xi . . . . . xp) w h e r e xi=l or xi=0.
xi=l means
xi=0 means
the v a r i a b l e xi d e s c r i b e s the t e r m x.
the v a r a i b l e xi does not d e s c r i b e the t e r m x.
The p r o b a b i l i t y
P( C=c
I X=x
Using Bayes
P( C=c
There
that an e l e m e n t x is in a class
) where
formula,
[ X=x
) =
C is a r a n d o m v a r i a b l e a n d X is a r a n d o m vector.
it m a y be d e d u c e d
( P( C = c )
are t h r e e p r o b a b i l i t i e s
- Estimate
c is w r i t t e n :
that:
P( X = x
to estimate:
P( C = c)
107
I C = c ) ) / P( X = x
)
P( C = c
) is e s t i m a t e d
b y nc
/ n where:
nc is the n u m b e r of e l e m e n t s of the class
n is the n u m b e r of e l e m e n t s in the s a m p l e
- Estimate
This
P( X = x)
estimate
- Estimate
c
is s i m p l i f i e d
P( X = x
by normalising
the p r o b a b i l i t i e s
I C = c )
F o r this estimate, we a s s u m e that the r a n d o m v a r i a b l e s
a r e i n d e p e n d e n t for a g i v e n class c. This leads to:
P( X = x
P( Xi = 1
P( Xi = 0
I C=c)
to I.
=HP(
xi = x i
I C = c ) is e s t i m a t e d
I C = c ) is e s t i m a t e d
I C =c
Xl,
X2 . . . .
Xm
) and
b y n c , i / nc
b y 1 - nc,i / nc
w h e r e nc,i is the n u m b e r of e l e m e n t s in the s a m p l e w h i c h are in class
c, a n d for w h i c h xi =i, a n d nc is the n u m b e r of e l e m e n t s of the s a m p l e
in c l a s s c.
Once
all
the p r o b a b i l i t i e s
are
estimated,
the c l a s s i f i c a t i o n
f u n c t i o n for an e l e m e n t x c o n s i s t s of c h o o s i n g the class that has the
highest
probability.
This
function
minimises
the
risk
of
c l a s s i f i c a t i o n e r r o r [RAO 65].
IV.
Description of the Experiment
T h e p u r p o s e of this e x p e r i m e n t is to d e t e r m i n e the b e s t w a y to
c l a s s i f y c a n d i d a t e terms from a corpus in s e m a n t i c fields. The g e n e r a l
principle
is,
firstly,
to
represent
the
candidate
terms
to
be
classified,
then,
to c l a s s i f y
them,
a n d finally,
to e v a l u a t e
the
q u a l i t y of the c l a s s i f i c a t i o n .
The c l a s s i f i c a t i o n m e t h o d is b a s e d on
l e a r n i n g process, w h i c h r e q u i r e s a set of p r e v i o u s l y - c l a s s i f i e d
terms
(the
learning
sample
manually
classified).
The
evaluation
also
requires
a test sample,
a set of p r e v i o u s l y - c l a s s i f i e d
terms which
h a v e to b e a u t o m a t i c a l l y classified.
The e v a l u a t i o n then c o n s i s t s of
c o m p a r i n g the r e s u l t s of the c l a s s i f i c a t i o n p r o c e s s to the p r e v i o u s
m a n u a l c l a s s i f i c a t i o n of the test sample.
IV.I. Learning and Test Sample
T h e t h e s a u r u s terms f o u n d in the corpus w e r e s e p a r a t e d into two
sets:
a subset
of about
3,000
terms
which
composed
the
learning
sample,
a n d a subset of 1,000 terms w h i c h c o m p o s e d the test sample.
A l l t h e s e terms h a d a l r e a d y b e e n m a n u a l l y
classified
by theme and
s e m a n t i c f i e l d in the thesaurus.
108
Rate
of W e l l
Classified Terms
The evaluation
criteria
is the rate of well
c a l c u l a t e d a m o n g the 1,000 terms of the test sample.
classified
terms
Rate
of w e l l
classified
terms
= number
d i v i d e d b y the n u m b e r of c l a s s i f i e d terms.
classified
terms
ZV.2.
TezmRepresentation
of
well
Models
The representation
of the terms to be c l a s s i f i e d
is the m a i n
parameter
that d e t e r m i n e s
the q u a l i t y of the c l a s s i f i c a t i o n .
Indeed,
this e x p e r i m e n t s h o w e d that, for a single r e p r e s e n t a t i o n model, there
is no
significant
difference
between
the results
of
the
various
classification
methods.
By
example,
the n e a r e s t
neighbours
method
(KNN) [DAS 90] w a s t e s t e d w i t h o u t any s i g n i f i c a n t difference. T h e o n l y
p a r a m e t e r that t r u l y i n f l u e n c e s the r e s u l t is the w a y of r e p r e s e n t i n g
the t e r m s to be classified. T h r e e m o d e l s were evaluated. The first is
b a s e d on a t e r m / d o c u m e n t approach, and the two others b y a t e r m / t e r m
approach.
IV2. i.
The TezmlDocument
Model
T h e t e r m / d o c u m e n t m o d e l uses the t r a n s p o s i t i o n
of the s t a n d a r d
d o c u m e n t / t e r m matrix. Each line r e p r e s e n t s a term, and e a c h c o l u m n a
d o c u m e n t . At the i n t e r s e c t i o n of a term and a document, there is 0 if
the t e r m is not in the d o c u m e n t in question, a n d 1 if it is p r e s e n t .
The s t a n d a r d d o c u m e n t / t e r m m a t r i x s h o w e d its w o r t h in the S a l t o n
vector model
[SAL 88]. It can t h e r e f o r e be h o p e d that the d o c u m e n t s
that c o n t a i n a t e r m p r o v i d e a good r e p r e s e n t a t i o n of this t e r m for its
c l a s s i f i c a t i o n in a field.
ZV.2.2.
The Tezm/TezmModels
T h e t e r m / t e r m model u s e s a m a t r i x w h e r e each line r e p r e s e n t s
a
t e r m to be classified,
a n d each c o l u m n r e p r e s e n t s
a thesaurus
term
recognised
in the corpus,
or a c a n d i d a t e
term extracted
f r o m the
corpus. At the i n t e r s e c t i o n of a line and a column,
two i n d i c a t o r s
have b e e n studied.
C o - o c c u r r e n c e s m a t r i x : The i n d i c a t o r is the c o - o c c u r r e n c e b e t w e e n two
terms.
Co-occurrence
reflects
the
fact
that
two
terms
are
found
t o g e t h e r in documents.
Mutual
information
matrix:
The i n d i c a t o r
is the m u t u a l
information
b e t w e e n two terms. Mutual i n f o r m a t i o n ([CHU 89] a n d [FEA 61]) r e f l e c t s
the fact that two terms are o f t e n f o u n d t o g e t h e r in documents,
but
r a r e l y alone. MI(x,y) is the m u t u a l i n f o r m a t i o n b e t w e e n terms x and y,
a n d is w r i t t e n :
109
MI(x,y)
= log2(P(xy)
/P(x)
P(y ))
w h e r e P(x,y) the probability of
observing x and y t o g e t h e r and
the p r o b a b i l i t y of observing x, P(y) the p r o b a b i l i t y of o b s e r v i n g
P(x)
y.
In the two cases, the matrix has to be t r a n s f o r m e d into a b i n a r y
matrix. The solution is to choice a threshold u n d e r w h i c h the v a l u e is
put to 0 a n d above which the value is put to i. Lots of values h a d
b e e n tested. The best classification for the c o - o c c u r r e n c e m a t r i x is
o b t a i n e d for a threshold of three. The best c l a s s i f i c a t i o n
for the
m u t u a l i n f o r m a t i o n matrix is obtained for a t h r e s h o l d of 0.05.
Results
The m a i n results concern three term r e p r e s e n t a t i o n models and two
classifications:
the
first
in 49
themes,
and
the
second
in
330
semantic fields. The criterion chosen for the e v a l u a t i o n is the well
c l a s s i f i e d rate.
Method
Themes
classification
=Term Document model
T e r m term model w i t h co-occurrence
Term
term
model
with
tmatual
information
Rate
of W e l l
Classified
Semantic
fields
classificati
on
42.9
27.3
31.5
19.8
89.8
65.2
Terms
There is a significant difference b e t w e e n the t e r m / t e r m model
w i t h mutual information and the other two models. The g o o d rates (89.8
a n d 65.2) can be improved if the system p r o p o s e s more than one class.
In
the
case
of
3
proposed
classes
(sorted
by
descending
probabilities),
the probability that the r i g h t
class
is in these
classes is e s t i m a t e d respectively by 97.1 and 91.2 for the themes and
the semantic fields.
Discussion
W i t h o u t a doubt, the term/term model w i t h mutual i n f o r m a t i o n
the
best
performance.
Nonetheless,
these
good
results
must
qualified.
has
be
A d e t a i l e d examination of the results shows that there is a wide
d i s p e r s i o n of the rate of well c l a s s i f i e d terms d e p e n d i n g on the field
(the 49 themes or the 320 semantic fields). The e x p l a n a t i o n is that
the
documents
in the corpus
are
essentially
thematic.
Thus,
the
vocabulary
for
certain
fields
in
the
thesaurus
is
essentially
concentrated
in
a
few documents.
Classification
based
on mutual
i n f o r m a t i o n is then efficient. On the other hand, c e r t a i n fields are
transverse
(e.g.,
computer science,
etc.),
a n d are
found in m a n y
documents that have few points in co~ton (and little common technical
vocabulary). Terms in these fields are difficult to classify.
110
Another
problem
with
the
method
is
connected
to
the
representativeness
of the learning sample.
Commonly,
for a given
field, a certain number of terms are available (for example 20,000
terms in the EDF thesaurus). It is more rare for all these terms to be
found
in
the
corpus
under
study
(4,000
terms
found
in
this
experiment). Thus, if a class (a theme or semantic field) is not well
r e p r e s e n t e d in the corpus, the m e t h o d is unable to classify candidate
terms in this class because the learning sample for this class is not
enough.
Through
this experiment,
an automatic
classification
of 300
candidate terms in 330 semantic fields was p r o p o s e d to the group that
validates n e w thesaurus terms. This classification was used b y the
d o c u m e n t a l i s t s to update the EDF thesaurus. Each term was p r o p o s e d in
three semantic fields (among 330) sorted from the highest p r o b a b i l i t y
(to be the right semantic field of the term) to the lowest.
111
References
[BLO 92]
Blosseville M.J., Hebrail G., Monteil M.G.,
Penot N.,
"Automatic
Document
Classification:
Natural
Language
Processing,
Statistical Data Analysis, and Expert System Techniques used together
", ACM-SIGIR'92 proceedings, 51-58,1992.
[CHU 89] Church K., "Word Association Norms, Mutual information,
Lexicography ", ACL 27 proceedings, Vancouver, 76-83, 1989.
[DAS 90] Dasarathy B.V., "Nearest Neighbor ( N N ) Norms: N N
Classification Techniques", IEEE Computer Society Press, 1990.
[FEA 61] Fano R., "Transmission of Information", MIT Press,
Massachusetts, 1961.
lOGO 94]
Ogonowski A.,
Herviou M.L.,
Monteil
extracting
and
structuring
knowledge
from
proceedings, 1049-1053, 1994.
M.G.,
text",
and
Pattern
Cambridge,
"Tools
for
Coling'94
[RAO 65] Rao C.R., "Linear Statistical Inference and its applications
", 2nd edition, Wiley, 1965.
[SAL 88] Salton G., ~ Automatic Text Processing : the Transformation,
Analysis, and Retrieval of Information by Computer ~, Addison-Wesley,
1988.
[STA 93] Sta J.D., "Information filtering : a tool for communication
between
researches",
INTERCHI'93 proceedings,
Amsterdam,
177-178,
1993.
[STA 95a] Sta J.D., "Document expansion applied to classification :
weighting of additional terms", ACM-SIGIR'95 proceedings,
Seattle,
177-178, 1995.
[STA
95b]
Sta
J.D.,
"Comportement
statistique
des
termes
et
acquisition terminologique & partir de corpus", T.A.L., Vol. 36, Num.
1-2, 119-132, 1995.
[VER 90] Veronis J., Ide N.,
"Word Sens Disambiguation with Very
Large Neural Networks Etracted from Machine Radable Dictionnaries"
COLING'90 proceedings, 389-394, 1990.
[YAR 92] Yarowsky D., "Word-Sense Disambiguation Using Statistical
Models of Roger's Categories Trained on Large Corpora", COLING'92
proceedings, 454-460, 1992.
[YNG 55] Yngve V., "Syntax and the Problem of Multiple Meaning" in
Machine Translation of Languages, William Lock and Donald Booth eds.,
Wiley, New York, 1955.
112