Knowledge Acqulsition= C l a s s i f i c a t i o n of Terms in a thesaurus from~a Corpus STA Jean-David EDF, Direction des Etudes et des Recherches 1, av. de G~n4ral de Gaulle 92140 Clamart - France (33) 1 47.65.49.58 E-mail : [email protected] ! Abstract Faced w i t h growing volume and a c c e s s i b i l i t y of electronic textual information, information retrieval, and, in general, automatic d o c u m e n t a t i o n require updated terminological resources that are ever m o r e voluminous. A current p r o b l e m is the automated c o n s t r u c t i o n of these resources (e.g., terminologies, thesauri, glossaries, etd~ ~) from a corpus. Various linguistic and statistical methods to handle this p r o b l e m are coming to light. One p r o b l e m that has been less studied is that of updating these resources, in particular, of c l a s s i f y i n g a term e x t r a c t e d from a corpus in a subject field, discipline or b r a n c h of an existing thesaurus. This is the first step in p o s i t i o n i n g a term e x t r a c t e d from a corpus in the structure of a thesaurus (generic relations, synonymy relations ..). This is an important p r o b l e m in c e r t a i n disciplines in which knowledge, and, in particular, v o c a b u l a r y is not v e r y stable over time, especially because of neologisms. This experiment compares different models for r e p r e s e n t i n g a term from a corpus for its automatic classification in the subject fields of a thesaurus. The classification method used is linear discriminatory analysis, based on a learning sample. The models e v a l u a t e d here are: a term/document model where each term is a vector in document vector space, two term/term models where each term is a vector in term space, and the coordinates are either the cooccurrence, or the mutual information between terms. The most effective model is the one based on mutual information b e t w e e n terms, w h i c h typifies the fact that two terms often appear together in the corpus, but rarely a p a r t . . I. Introduction In documentation, terminologies, thesauri and other terminological lists are reference systems which can be u s e d for m a n u a l or automatic indexing. Indexing consists of r e c o g n i s i n g the terms in a text that belong to a reference system; this is called c o n t r o l l e d indexing. The quality of the result of the indexing process depends in large part on the quality of the terminology (completeness, c o n s i s t e n c e ..). Thus, applications d o w n s t r e a m from the indexing depend on these terminological resources. 101 The most thoroughly studied application is the information retrieval (IR). Here, the term provides a means for accessing i n f o r m a t i o n through its standardising effect on the query and on the text to be found. The term can also be a v a r i a b l e that is u s e d in statistical c l a s s i f i c a t i o n or clustering processes of documents ([BLO 92] and [STA 95a]), or in selective d i s s e m i n a t i o n of information, in w h i c h it is u s e d to bring together a document to be d i s s e m i n a t e d and its target [STA 93]. Textual information is becoming more and m o r e accessible in electronic form. This accessibility is certainly one of the p r e r e q u i s i t e s for the massive use of natural language p r o c e s s i n g (NLP) techniques. These techniques applied on p a r t i c u l a r domains, often use terminological resources that supplement the lexical resources. The lexical resources (general language dictionaries) are fairly stable, whereas terminologies evolve dynamically with the fields they describe. In particular, the disciplines of information p r o c e s s i n g (computers, etc.) a n d b i o l o g y or genetics are c h a r a c t e r i s e d today by an e x t r a o r d i n a r y terminological activity. ,I Unfortunately, the abundance of electronic corpora and the relative maturity of natural language processing techniques have i n d u c e d a shortage of u p d a t e d terminological data. The various efforts in a u t o m a t i c a c q u i s i t i o n of terminologies from a corpus stem from this observation, and try to answer the following question: "How can c a n d i d a t e terms be extracted from a corpus?" Another inKoortant q u e s t i o n is h o w to p o s i t i o n a term in an existing thesaurus. That question can itself be subdivided into several questions that concern the role of the s t a n d a r d relationships in a thesaurus: synonymy, hyperonymy, etc. The q u e s t i o n studied in this experiment concerns the p o s i t i o n i n g or c l a s s i f i c a t i o n of a term in a subject field or semantic field of a thesaurus. This is the first step in a p r e c i s e p o s i t i o n i n g using the standard r e l a t i o n s h i p s of a thesaurus. This p r o b l e m is v e r y difficult for a h u m a n being to resolve w h e n he is not an expert in the field to w h i c h the term b e l o n g s and one can hope that an automated classification p r o c e s s w o u l d be of great help. To c l a s s i f y a term in a subject field can be c o n s i d e r e d similar to w o r d sense d i s a m b i g u a t i o n (WSD) w h i c h consists in c l a s s i f y i n g a w o r d in a conceptual class (one of its senses). The d i f f e r e n c e is that, in a corpus, a term is g e n e r a l l y m o n o s e m o u s and a w o r d is polysemous. W o r d sense d i s a m b i g u a t i o n uses a single context (generally a w i n d o w of a few words around the w o r d to be disambiguated) as input to p r e d i c t its sense among a few p o s s i b l e senses (generally less than ten). T e r m subject field d i s c r i m i n a t i o n uses a r e p r e s e n t a t i o n of the term c a l c u l a t e d on the whole corpus in order to c l a s s i f y it into about 330 subject fields in this experiment. The experiment d e s c r i b e d here was used to evaluate different m e t h o d s for c l a s s i f y i n g terms from a corpus in the subject fields of a thesaurus. A f t e r a brief description of the corpus and the thesaurus, automatic indexing and terminology extraction are described. L i n g u i s t i c and statistical techniques are used to extract a candidate term from a corpus or to recognise a term in a document. This p r e p a r a t o r y p r o c e s s i n g allows the document to be r e p r e s e n t e d as a set 102 of terms (candidate terms and k e y w o r d s A c l a s s i f i c a t i o n m e t h o d is t h e n i m p l e m e n t e d to c l a s s i f y a subset of 1,000 terms in the 49 themes a n d 330 s e m a n t i c fields that m a k e up the thesaurus. The 1,000 terms thus c l a s s i f i e d c o m p r i s e the test sample that is u s e d to e v a l u a t e three m o d e l s for r e p r e s e n t i n g terms. IX. Data Preparation IX.i. Description of the Corpus The corpus s t u d i e d is a set of I0,000 s c i e n t i f i c and t e c h n i c a l doc~unents in F r e n c h (4,150,000 words). Each d o c u m e n t c o n s i s t s of one or two p a g e s of text. This corpus d e s c r i b e s r e s e a r c h c a r r i e d out by the r e s e a r c h d i v i s i o n of EDF, the F r e n c h e l e c t r i c i t y company. M a n y d i v e r s e subjects are dealt with: n u c l e a r energy, thermal energy, home automation, sociology, artificial intelligence, etc. Each document describes the o b j e c t i v e s and stages of a r e s e a r c h p r o j e c t on a particular subject. These d o c u m e n t s are u s e d to p l a n EDF r e s e a r c h activity. Thus, the v o c a b u l a r y u s e d is either v e r y technical, w i t h subject f i e l d terms and c a n d i d a t e terms, or v e r y general, with stylistic expressions, etc. | Obj ectif : .... ,IObjecfif C o n s t r u c t i o n de t h e s a u r u s ~ E t a t d ' a v a n c e m e n t ~--~ genera/ .... , _ ~ ~'-qphase d~ndus~/a/isafion expressions ~an a avancemen~ : 2~-~ La, . p h a s e d' i n d u s t r i a l. i s a t i o n /... ~ . " [Iconstrucfion . . . . . . . . . . . de . . .the~z . . auru_ L indexatlon automatlque . . . . . . . . -imqxexauon a u t o m a u o u e k t erm s A document with terms and general expressions II.2. Description of the Thesaurus The EDF t h e s a u r u s consists of 20,000 terms (including 6,000 synonyms) that c o v e r a w i d e v a r i e t y of fields (statistics, nuclear p o w e r plants, i n f o r m a t i o n retrieval, etc.). This r e f e r e n c e s y s t e m was c r e a t e d m a n u a l l y f r o m c o r p o r a t e documents, and was v a l i d a t e d w i t h the h e l p of m a n y experts. Currently, updates are h a n d l e d by a g r o u p of d o c u m e n t a l i s t s who r e g u l a r l y e x a m i n e and insert n e w terms. One of the s o u r c e s of n e w terms is the corpora. A l i n g u i s t i c and s t a t i s t i c a l extractor proposes candidate terms for validation by the d o c u m e n t a l i s t s . A f t e r validation, the d o c u m e n t a l i s t s m u s t p o s i t i o n the s e l e c t e d terms in the thesaurus. It's a d i f f i c u l t e x e r c i s e b e c a u s e of the w i d e v a r i e t y of fields. 103 The thesaurus is composed of 330 semantic (or subject) i n c l u d e d in 49 themes such as mathematics, sociology, etc. th6ode des erreurs [ fields i'statistiquen'on param6trique ] See Also Relationship Ianalyse discriminanteI ]statistique [ lanalysestatistique 'i lanalysede la variance Extract [:modUlestatistiquelineaire! ~ f r o m the "statistics" , s.ttmation I Generic Relationship sm-~tic field from the EDF t h e s a u r u s This example gives an overview of the various relations b e t w e e n terms. Each term belongs to a single semantic field. Each t e r m is linked to other terms through a generic relation (arrow) or a neighbourhood relation (line). Other relations (e.g., synonym, t r a n s l a t e d by, etc.) exist, but are not shown in this example. II°3. Document Indexing As a first step, the set of documents in the corpus is indexed. This consists of producing two types of indexes: candidate terms, and descriptors. The candidate terms are expressions that m a y b e c o m e terms, and are submitted to an expert for validation. Descriptors are terms from the EDF thesaurus that are a u t o m a t i c a l l y r e c o g n i s e d in the documents. II.3.1. Terminological Filtering In this experiment, terminological filtering is used for each document to produce terms that do not b e l o n g to the thesaurus, but w h i c h n o n e t h e l e s s might be useful to describe the documents. Moreover, these expressions are candidate terms that are s u b m i t t e d to experts or d o c u m e n t a l i s t s for validation. L i n g u i s t i c and statistical terminological filtering are used. The method chosen for this experiment combines an initial linguistic e x t r a c t i o n w i t h statistical filtering [STA 95b]. 104 Linguistic Extraction G e n e r a l l y , it a p p e a r s that the s y n t a c t i c a l s t r u c t u r e of a t e r m in F r e n c h l a n g u a g e is the n o u n phrase. For example, in the EDF t h e s a u r u s , the s y n t a c t i c s t r u c t u r e s of terms are d i s t r i b u t e d as follows: % syntactic structure example Noun Ad~ective ~rosion fluviale 2S .i Noun Preposition Noun analyse de c o n t e n u 24.4 Noun d~centralisation 18.1 Proper noun Chinon 6.8 Noun Preposition Article a s s u r a n c e de la 3.2 Noun qualit4 Noun Preposition Noun u n i t 4 de b a n d e 2.8 Adjective magn4tique Noun Participe puissance absorb4e 2.2 Noun Noun acc~s m 4 m o i r e 2.1 Distriknation of the s y n t a c t i c s t r u c t u r e s of terms Thus, t e r m e x t r a c t i o n is i n i t i a l l y s y n t a c t i c a l . It c o n s i s t s a p p l y i n g s e v e n r e c u r s i v e s y n t a c t i c p a t t e r n s to the c o r p u s [OGO 94]. The seven NP <NP<N P <NP <NP <NP <NP <syntactic ADJECTIVE NP NPADJECTIVE NP ~ NP N P de N P NP en N P N P p o u r NP NPNP p a t t e r n s for t e r m i n o l o g y of extraction Statistical Filtering Linguistic extraction, however, is n o t enough. In fact, many e x p r e s s i o n s w i t h a n o u n p h r a s e s t r u c t u r e are not terms. T h i s i n c l u d e s general expressions, s t y l i s t i c effects, etc. S t a t i s t i c a l m e t h o d s can thus be used, in a s e c o n d step, to d i s c r i m i n a t e terms f r o m nont e r m i n o l o g i c a l e x p r e s s i o n s . T h r e e i n d i c a t o r s are u s e d here: Frequency: This is b a s e d on the fact that the m o r e often an e x p r e s s i o n is f o u n d in the corpus, the m o r e l i k e l y it is to be a term. T h i s s t a t e m e n t m u s t be k e p t in proportion, however. Indeed, it s e e m s that a s m a l l n u m b e r of w o r d s (usually, v e r y g e n e r a l u n i t e r m s ) a r e v e r y frequent, but are not terms. - V a r i a n c e : T h i s is b a s e d on the idea that the m o r e the o c c u r r e n c e s in a d o c u m e n t of an e x p r e s s i o n are scattered, the m o r e l i k e l y it is to b e a term. T h i s is the m o s t e f f e c t i v e indicator. Its d r a w b a c k is t h a t it a l s o h i g h l i g h t s large n o u n p h r a s e s in w h i c h the terms are included. L o c a l d e n s i t y [STA 95b]: This is b a s e d on the i d e a that the c l o s e r t o g e t h e r the d o c u m e n t s are that c o n t a i n the e x p r e s s i o n , the m o r e l i k e l y it is to be a term. The local d e n s i t y of an e x p r e s s i o n is the 105 m e a n of the cosines between documents w h i c h contain the g i v e n expression. A document is a vector in the Document V e c t o r Space where a d i m e n s i o n is a term. This indicator highlights a certain n u m b e r of terms that are not transverse to the corpus, but rather c o n c e n t r a t e d in documents that are close to each other. Nonetheless, this is not a v e r y effective indicator for terms that are t r a n s v e r s e t o the corpus. For example, terms from computer science, w h i c h are found in a lot of documents, are not h i g h l i g h t e d b y this indicator. Results of the Tez~inological Extraction During this experiment, the terminological extraction u l t i m a t e l y p r o d u c e d 3,000 new terms that did not b e l o n g to the thesaurus. These new' terms are used in the various r e p r e s e n t a t i o n models d e s c r i b e d below. The initial linguistic extracting produced about 50,000 expressions. II.3.2. Controlled Indexing A s u p p l e m e n t a r y way recognising controlled thesaurus. To do this, sentence is p r o c e s s e d on and semantically. These dictionary. of characterising a document's contents is by terms in the document that belong to a an N L P technique is u s e d [BLO 92]. Each three levels: morphologically, syntactically, steps use a grammar and a general language The m e t h o d consists of b r e a k i n g down the text fragment b e i n g p r o c e s s e d by a series of successive transformations that m a y be syntactical (nominalisation, de-coordination, etc.), semantic (e.g., nuclear and atomic), or pragmatic (the thesaurus" synonym r e l a t i o n s h i p s are scanned to transform a synonym b y its m a i n form). At the end of these transformations, the d e c o m p o s e d text is c o m p a r e d to the list of d o c u m e n t e d terms of the thesaurus in order to s u p p l y the descriptors. Results of the Controlled Iz~exing C o n t r o l l e d indexing of the corpus supplied 4,000 terms (of 20,000 in the thesaurus). Each document was indexed by 20 to 30 terms. These documented terms, like the candidate terms, are used in the r e p r e s e n t a t i o n models described below. The quality of the indexing process is e s t i m a t e d at 70 percents (number of right terms d i v i d e d by n u m b e r of terms). The wrong terms are e s s e n t i a l l y due to p r o b l e m s of polysemy. I n d e e d some terms (generally uniterms) have m u l t i p l e senses (for example "BASE") and produce a great part of the noise. 106 III. T e r m Subject Field Discrimination XXI.l. W o r d Sense D i s a ~ h i g u a t i o n a n d Term Subject Field Discrimination T h e d i s c r i m i n a t i o n of w o r d s e n s e s is a w e l l k n o w n p r o b l e m in computational linguistics [YNG 55]. T h e p r o b l e m of W S D is to b u i l d i n d i c a t o r s d e s c r i b i n g the d i f f e r e n t senses of a word. G i v e n a c o n t e x t of a word, t h e s e i n d i c a t o r s are u s e d to p r e d i c t its sense. F a c e to the difficulty of m a n u a l l y b u i l d i n g t h e s e indicators, researchers have t u r n e d to r e s o u r c e s s u c h as m a c h i n e - r e a d a b l e d i c t i o n a r i e s [VER 90] a n d c o r p o r a [YAR 92]. W S D a n d t e r m s u b j e c t f i e l d d i s c r i m i n a t i o n f r o m c o r p o r a can be considered similar in the w a y that they are b o t h a problem of c l a s s i f i c a t i o n into a class (a sense for a w o r d a n d a s u b j e c t f i e l d for a term). N e v e r t h e l e s s , the p r o b l e m is s t a t i s t i c a l l y d i f f e r e n t . In o n e case, a w o r d is r e p r e s e n t e d b y a few v a r i a b l e s (its context) a n d is c l a s s i f i e d into one class c h o s e n a m o n g a few classes. In the o t h e r case, a t e r m is r e p r e s e n t e d b y h u n d r e d of v a r i a b l e s (one of the m o d e l s d e s c r i b e d in c h a p t e r IV) a n d is c l a s s i f i e d into a class c h o s e n a m o n g h u n d r e d of classes. ZII.2. Linear Discr~.mlnatozyAnalysis T h e p r o b l e m of d i s c r i m i n a t i o n c a n be d e s c r i b e d as follows: A random variable X is distributed in a p-dimensional space, x represents the o b s e r v e d v a l u e s of v a r i a b l e X. The problem is to d e t e r m i n e the d i s t r i b u t i o n of X a m o n g q d i s t r i b u t i o n s (the classes), b a s e d on the o b s e r v e d v a l u e s x. The m e t h o d i m p l e m e n t e d h e r e is l i n e a r d i s c r i m i n a t o r y analysis. U s i n g a s a m p l e that has a l r e a d y b e e n c l a s s i f i e d , d i s c r i m i n a t o r y analysis can construct classification functions which take into a c c o u n t t h e v a r i a b l e s that d e s c r i b e the e l e m e n t s to be c l a s s i f i e d . E a c h e l e m e n t x to be c l a s s i f i e d is d e s c r i b e d b y a b i n a r y v e c t o r x= (xl, x2 . . . . . xi . . . . . xp) w h e r e xi=l or xi=0. xi=l means xi=0 means the v a r i a b l e xi d e s c r i b e s the t e r m x. the v a r a i b l e xi does not d e s c r i b e the t e r m x. The p r o b a b i l i t y P( C=c I X=x Using Bayes P( C=c There that an e l e m e n t x is in a class ) where formula, [ X=x ) = C is a r a n d o m v a r i a b l e a n d X is a r a n d o m vector. it m a y be d e d u c e d ( P( C = c ) are t h r e e p r o b a b i l i t i e s - Estimate c is w r i t t e n : that: P( X = x to estimate: P( C = c) 107 I C = c ) ) / P( X = x ) P( C = c ) is e s t i m a t e d b y nc / n where: nc is the n u m b e r of e l e m e n t s of the class n is the n u m b e r of e l e m e n t s in the s a m p l e - Estimate This P( X = x) estimate - Estimate c is s i m p l i f i e d P( X = x by normalising the p r o b a b i l i t i e s I C = c ) F o r this estimate, we a s s u m e that the r a n d o m v a r i a b l e s a r e i n d e p e n d e n t for a g i v e n class c. This leads to: P( X = x P( Xi = 1 P( Xi = 0 I C=c) to I. =HP( xi = x i I C = c ) is e s t i m a t e d I C = c ) is e s t i m a t e d I C =c Xl, X2 . . . . Xm ) and b y n c , i / nc b y 1 - nc,i / nc w h e r e nc,i is the n u m b e r of e l e m e n t s in the s a m p l e w h i c h are in class c, a n d for w h i c h xi =i, a n d nc is the n u m b e r of e l e m e n t s of the s a m p l e in c l a s s c. Once all the p r o b a b i l i t i e s are estimated, the c l a s s i f i c a t i o n f u n c t i o n for an e l e m e n t x c o n s i s t s of c h o o s i n g the class that has the highest probability. This function minimises the risk of c l a s s i f i c a t i o n e r r o r [RAO 65]. IV. Description of the Experiment T h e p u r p o s e of this e x p e r i m e n t is to d e t e r m i n e the b e s t w a y to c l a s s i f y c a n d i d a t e terms from a corpus in s e m a n t i c fields. The g e n e r a l principle is, firstly, to represent the candidate terms to be classified, then, to c l a s s i f y them, a n d finally, to e v a l u a t e the q u a l i t y of the c l a s s i f i c a t i o n . The c l a s s i f i c a t i o n m e t h o d is b a s e d on l e a r n i n g process, w h i c h r e q u i r e s a set of p r e v i o u s l y - c l a s s i f i e d terms (the learning sample manually classified). The evaluation also requires a test sample, a set of p r e v i o u s l y - c l a s s i f i e d terms which h a v e to b e a u t o m a t i c a l l y classified. The e v a l u a t i o n then c o n s i s t s of c o m p a r i n g the r e s u l t s of the c l a s s i f i c a t i o n p r o c e s s to the p r e v i o u s m a n u a l c l a s s i f i c a t i o n of the test sample. IV.I. Learning and Test Sample T h e t h e s a u r u s terms f o u n d in the corpus w e r e s e p a r a t e d into two sets: a subset of about 3,000 terms which composed the learning sample, a n d a subset of 1,000 terms w h i c h c o m p o s e d the test sample. A l l t h e s e terms h a d a l r e a d y b e e n m a n u a l l y classified by theme and s e m a n t i c f i e l d in the thesaurus. 108 Rate of W e l l Classified Terms The evaluation criteria is the rate of well c a l c u l a t e d a m o n g the 1,000 terms of the test sample. classified terms Rate of w e l l classified terms = number d i v i d e d b y the n u m b e r of c l a s s i f i e d terms. classified terms ZV.2. TezmRepresentation of well Models The representation of the terms to be c l a s s i f i e d is the m a i n parameter that d e t e r m i n e s the q u a l i t y of the c l a s s i f i c a t i o n . Indeed, this e x p e r i m e n t s h o w e d that, for a single r e p r e s e n t a t i o n model, there is no significant difference between the results of the various classification methods. By example, the n e a r e s t neighbours method (KNN) [DAS 90] w a s t e s t e d w i t h o u t any s i g n i f i c a n t difference. T h e o n l y p a r a m e t e r that t r u l y i n f l u e n c e s the r e s u l t is the w a y of r e p r e s e n t i n g the t e r m s to be classified. T h r e e m o d e l s were evaluated. The first is b a s e d on a t e r m / d o c u m e n t approach, and the two others b y a t e r m / t e r m approach. IV2. i. The TezmlDocument Model T h e t e r m / d o c u m e n t m o d e l uses the t r a n s p o s i t i o n of the s t a n d a r d d o c u m e n t / t e r m matrix. Each line r e p r e s e n t s a term, and e a c h c o l u m n a d o c u m e n t . At the i n t e r s e c t i o n of a term and a document, there is 0 if the t e r m is not in the d o c u m e n t in question, a n d 1 if it is p r e s e n t . The s t a n d a r d d o c u m e n t / t e r m m a t r i x s h o w e d its w o r t h in the S a l t o n vector model [SAL 88]. It can t h e r e f o r e be h o p e d that the d o c u m e n t s that c o n t a i n a t e r m p r o v i d e a good r e p r e s e n t a t i o n of this t e r m for its c l a s s i f i c a t i o n in a field. ZV.2.2. The Tezm/TezmModels T h e t e r m / t e r m model u s e s a m a t r i x w h e r e each line r e p r e s e n t s a t e r m to be classified, a n d each c o l u m n r e p r e s e n t s a thesaurus term recognised in the corpus, or a c a n d i d a t e term extracted f r o m the corpus. At the i n t e r s e c t i o n of a line and a column, two i n d i c a t o r s have b e e n studied. C o - o c c u r r e n c e s m a t r i x : The i n d i c a t o r is the c o - o c c u r r e n c e b e t w e e n two terms. Co-occurrence reflects the fact that two terms are found t o g e t h e r in documents. Mutual information matrix: The i n d i c a t o r is the m u t u a l information b e t w e e n two terms. Mutual i n f o r m a t i o n ([CHU 89] a n d [FEA 61]) r e f l e c t s the fact that two terms are o f t e n f o u n d t o g e t h e r in documents, but r a r e l y alone. MI(x,y) is the m u t u a l i n f o r m a t i o n b e t w e e n terms x and y, a n d is w r i t t e n : 109 MI(x,y) = log2(P(xy) /P(x) P(y )) w h e r e P(x,y) the probability of observing x and y t o g e t h e r and the p r o b a b i l i t y of observing x, P(y) the p r o b a b i l i t y of o b s e r v i n g P(x) y. In the two cases, the matrix has to be t r a n s f o r m e d into a b i n a r y matrix. The solution is to choice a threshold u n d e r w h i c h the v a l u e is put to 0 a n d above which the value is put to i. Lots of values h a d b e e n tested. The best classification for the c o - o c c u r r e n c e m a t r i x is o b t a i n e d for a threshold of three. The best c l a s s i f i c a t i o n for the m u t u a l i n f o r m a t i o n matrix is obtained for a t h r e s h o l d of 0.05. Results The m a i n results concern three term r e p r e s e n t a t i o n models and two classifications: the first in 49 themes, and the second in 330 semantic fields. The criterion chosen for the e v a l u a t i o n is the well c l a s s i f i e d rate. Method Themes classification =Term Document model T e r m term model w i t h co-occurrence Term term model with tmatual information Rate of W e l l Classified Semantic fields classificati on 42.9 27.3 31.5 19.8 89.8 65.2 Terms There is a significant difference b e t w e e n the t e r m / t e r m model w i t h mutual information and the other two models. The g o o d rates (89.8 a n d 65.2) can be improved if the system p r o p o s e s more than one class. In the case of 3 proposed classes (sorted by descending probabilities), the probability that the r i g h t class is in these classes is e s t i m a t e d respectively by 97.1 and 91.2 for the themes and the semantic fields. Discussion W i t h o u t a doubt, the term/term model w i t h mutual i n f o r m a t i o n the best performance. Nonetheless, these good results must qualified. has be A d e t a i l e d examination of the results shows that there is a wide d i s p e r s i o n of the rate of well c l a s s i f i e d terms d e p e n d i n g on the field (the 49 themes or the 320 semantic fields). The e x p l a n a t i o n is that the documents in the corpus are essentially thematic. Thus, the vocabulary for certain fields in the thesaurus is essentially concentrated in a few documents. Classification based on mutual i n f o r m a t i o n is then efficient. On the other hand, c e r t a i n fields are transverse (e.g., computer science, etc.), a n d are found in m a n y documents that have few points in co~ton (and little common technical vocabulary). Terms in these fields are difficult to classify. 110 Another problem with the method is connected to the representativeness of the learning sample. Commonly, for a given field, a certain number of terms are available (for example 20,000 terms in the EDF thesaurus). It is more rare for all these terms to be found in the corpus under study (4,000 terms found in this experiment). Thus, if a class (a theme or semantic field) is not well r e p r e s e n t e d in the corpus, the m e t h o d is unable to classify candidate terms in this class because the learning sample for this class is not enough. Through this experiment, an automatic classification of 300 candidate terms in 330 semantic fields was p r o p o s e d to the group that validates n e w thesaurus terms. This classification was used b y the d o c u m e n t a l i s t s to update the EDF thesaurus. Each term was p r o p o s e d in three semantic fields (among 330) sorted from the highest p r o b a b i l i t y (to be the right semantic field of the term) to the lowest. 111 References [BLO 92] Blosseville M.J., Hebrail G., Monteil M.G., Penot N., "Automatic Document Classification: Natural Language Processing, Statistical Data Analysis, and Expert System Techniques used together ", ACM-SIGIR'92 proceedings, 51-58,1992. [CHU 89] Church K., "Word Association Norms, Mutual information, Lexicography ", ACL 27 proceedings, Vancouver, 76-83, 1989. [DAS 90] Dasarathy B.V., "Nearest Neighbor ( N N ) Norms: N N Classification Techniques", IEEE Computer Society Press, 1990. [FEA 61] Fano R., "Transmission of Information", MIT Press, Massachusetts, 1961. lOGO 94] Ogonowski A., Herviou M.L., Monteil extracting and structuring knowledge from proceedings, 1049-1053, 1994. M.G., text", and Pattern Cambridge, "Tools for Coling'94 [RAO 65] Rao C.R., "Linear Statistical Inference and its applications ", 2nd edition, Wiley, 1965. [SAL 88] Salton G., ~ Automatic Text Processing : the Transformation, Analysis, and Retrieval of Information by Computer ~, Addison-Wesley, 1988. [STA 93] Sta J.D., "Information filtering : a tool for communication between researches", INTERCHI'93 proceedings, Amsterdam, 177-178, 1993. [STA 95a] Sta J.D., "Document expansion applied to classification : weighting of additional terms", ACM-SIGIR'95 proceedings, Seattle, 177-178, 1995. [STA 95b] Sta J.D., "Comportement statistique des termes et acquisition terminologique & partir de corpus", T.A.L., Vol. 36, Num. 1-2, 119-132, 1995. [VER 90] Veronis J., Ide N., "Word Sens Disambiguation with Very Large Neural Networks Etracted from Machine Radable Dictionnaries" COLING'90 proceedings, 389-394, 1990. [YAR 92] Yarowsky D., "Word-Sense Disambiguation Using Statistical Models of Roger's Categories Trained on Large Corpora", COLING'92 proceedings, 454-460, 1992. [YNG 55] Yngve V., "Syntax and the Problem of Multiple Meaning" in Machine Translation of Languages, William Lock and Donald Booth eds., Wiley, New York, 1955. 112
© Copyright 2026 Paperzz