EXPERIMENTS IN AUTOMATIC STATISTICAL THESAURUS CONSTRUCTION Carolyn J. Crouch and Bokyung Yang* Department of Computer Science University of Minnesota, Duluth DuhIth, MN 55812 ABSTRACT A well constructed valuable thesaurus has long been recognized as a tool in the effective retrieval system. experiments INTRODUCTION This designed paper of an information reports to determine approach to the automatic (described originally operation the results the validity construction The value of a well effective of concentrate of global thesauri recognized. on the global thesaurus generated The authors validate queries. In this paper, [he authors thesaurus, wherein and then used to index Many experiments regarding such by this method in substantial approaches relied on manual construction retrieval results effectiveness discrimination particular distinguishing thesaurus alternate which thesaurus a “good” approach in the from classes to automatic thesaurus the work Experimental in a (i.e., an “indifferent” the authors Aslib project in contributions A classification semantic [7] recent attempt to collocation, that exist taxonomy, between synonymy, those produced by the first method at much lower cost. 11]. on the CODER Other researchers generate dictionaries *Current address: West Publishing Minnesota, 55123. Company, valuable automatic thesaurus [8, 9]. use expert word etc.). approach relying in sets of lexical or lexical- thesauri which are comparable to made has centered on the relational relations thesaurus produce early work in this area with an artificial effectiveness Smart on the subject. alternate approach described herein in some cases produces in retrieval Early and automatic This work is based on identifying that the [3]). to the state of knowledge more dictionaries viable (e.g., Cleverdon [4, 5, 6] and Spark Jones’ experiments keyword of Early construction automatic construction Cranfield in the literature. in semiautomatic in suggest an show the and experiments or “poor” of producing results in thesaurus not to be useful thesaurus In conclusion, simplifies classes. used class, is found between class). greatly theory, in The term to determine a term’s membership thesaurus differentiating in four test collections. value generation algorithm improvements be found terms) are first the construction that the use of thesauri generated may thesaurus both documents the approach by showing thesauri to the or document retrieval classes (classes of similar or “synonymous” by Crouch in [1] and [2]) based on a clustering of the document collection. operation of an information system is widely of an constructed pairs (e.g., Fox continues intelligence-based relational lexicon system techniques [10, to [12, 13]. (These approaches, while of considerable interest, remain largely to be evaluated.) Eagan, This paper reports on the results achieved by using Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Asaocietion for Computing Machinery. To copy otherwise, or to republish, requiree e fee and/or specific permission. 15th Ann Int’1 SIGIR ‘92/Denmark-6/92 @1992 ACM O-8979 ~.~24.0/92/0006 /0077 ...$J .50 an algorithm in detail for automatic in [1] and [2]. appropriate clustering the term discrimination 77 thesaurus generation This algorithm described is based on an of the document space and the use of value theory of Sal ton, Yang, and Yu [14] to select terms for entry into a particular class. It produces envisioned not a universal lexicon, the resultant from which positive such as that by Fox, but rather a thesaurus applicable to the collection categories: thesaurus it was generated. Salton thesaurus is domain dependent, the algorithm may be applied to any document collection, with the vector model). The term discrimination on the well-known documents are regarded vectors and the correlation means of a similarity the vector retrieved model advantages clustering, also has its disadvantages term relationships to this problem who showed improved term and term weighting), it documents purposes is one which maximizes be u.mi by combining phrases. of poor these terms The retrieval of terms, which are indifferent values, can be them into appropriate thesaurus space for retrieval of discrimination frequency of a term be used as an to discrimination value. fk, of a term k is defined in which k appears. terms are ranked by increasing The document as the number According represents the number of documents the average separation value is normally For this reason, Salton, Yang and Yu suggested approximation was suggested by Salton, Yang, and Yu, that the best document values). properties with near-zero discrimination that the document to express A practical and poor discriminators appropriate of the majority expensive. frequency, the model). discriminators values), The retrieval by incorporating solution within to form The calculation of [strong] classes. Although (such as ranking (such as its inability that good can be improved others with (terms with negative discrimination discriminators between vectors is computed by measure, such as cosine. offers documents, as (weighted) discrimination as index terms. capabilities value model itself is based vector space model [15], wherein both and queries near-zero discriminators is based on (terms values), indifferent et al suggest directly regardless of subject (as long as the system representation with discrimiryztors Although discriminators discrimination (terms only good of to [14], if n in a collection whose document frequency, terms by document frequency as follows: if Ik between documents in the document space. In this space, it may be classified is easier to distinguish > n/10, k is considered a high frequency term (a term which documents between documents which are most similar to a given query. is the basis of the term discrimination assigns specific roles to single and to retrieve This normally discriminator); value model, which terms, term phrases, and frequency term classes. The discrimination value of a term is defined measure of the change in space separation which when a given term is assigned to the document A good discriminator document, documents discriminator, decreases less is a term which, similar to each as a occurs collection. when assigned to a the space density (rendering other). then, increases space density. A n/100, k is a medium value, results shown have k is considered a near-zero discriminator); discrimination value value and is a poor frequency discrimination and if n/1 O < fk < term (with a good discriminator). that document are well a low a positive Empirical frequency correlated. A and recent of the centroid approach to the calculation value, by E1-Hamdouchi and Willet greatly reduced the time required to calculate discrimination value, thus making and after the assignment of each term, the discrimination value of the The terms are then ranked it possible to use discrimination instead of the [easily calculated] in values into three 78 of’ [16], has By computing space before order of their discrimination an indifferent discrimination of the document decreasing fk c n/100, value, modification poor if term (a term with discrimination the the density term can be determined. has a negative discrimination document frequency value as an indicator of how that term relates to other terms in the collection (i.e., whether it should be used alone as an index term, used as a component of a phrase, or combined with 1. other terms to become a member of a thesaurus class). ALGORITHM FOR GLOBAL A thesaurus class is made up of a set of “closely combined to justify The documents Application algorithm produces of the indifferent values) lie at the bottom cannot expect to classify terms in the low domain, which have small co-occumence low computed simply similarities, to the lack alternative into useful of information is to cluster indexed One in which the tightest cluster at the highest threshold of the cluster tree. are the leaves of the tree. frequency clustering Consider Fig. and the numbers These nodes 1. The squares frequencies and represent documents groupings due represent the levels them. The Documents A and B cluster at a threshold value of 0,089, D collection and and E cluster at a level of 0.149, and document about the document are complete-link a hierarchy clusters (i.e., those which terms). queries (i) THRESHOLD VALUE value theory specifies (or low frequency and of the thesaurus classes generated that the terms which make up a thesaurus class must be discriminators based on ~hree user- in step 2 are determined by these parameters: their being used together or The term discrimination hierarchy is traversed and thesaurus The characteristics related” to represent a single, new word type within this collection. algorithm. (augmented) by the thesaurus classes. which are similar enough in the context of the collection via the supplied parameters. THESAURI A thesaurus is composed of a set of thesaurus classes. specified The resultmt is clustered classes are generated, CONSTRUCTING 3. terms--terms collection complete link clustering 2. AN The document at which in the circles the documents cluster. C clusters subsequently to select from closely related documents terms with the D-E subtree at a level of 0.077. which and the C-D-E subtree cluster at a threshold value of 0.029. meet discrimination the criterion (i.e., terms algorithm which produces clusters is the complete link algorithm the class of agglomerative, stage, the similarity two most similar until similarity small, tight [17, 18, 19], one of hierarchical may be described succinctly continues near-zero values or low frequency terms). A clustering algorithm with methods. as follows: This At each between all clusters is computed, clusters only are merged, one cluster the similarities the and the process remains. Since between clusters is defined as the minimum document the of between all pairs of documents (where one of the pair is in one cluster and the other El document is in the other cluster), the resultant clusters are difficult to join and hence tend to the small An algorithm previously The A-B subtrec to construct global and tight. thesauri has been Fig. 1 A sample hierarchy generated by the complete link atgonthm suggested by Crouch [1, 2]. A brief description of this algorithm G] follows: 79 Threshold value largely determines from which terms are selected for inclusion class. In Fig. 1, a threshold Previous the documents formula in a thesaurus value of 0.090 would experiments is useful in computing weights: return Let only the D-E document cluster, since the level at which the tc_con_wti documents class, cluster must be greater than or equal to the [1] have shown that the following tc_wt denote reasonable thesaurus class the thesaurus class weight, denote the weight ltc_conl represent the of term i in this thesaurus number of terms specified threshold in order for the cluster to be recognized. thesaurus class, and avg_tc_con_wt A threshold value of 0.085, on the other hand, would return weight of a term in the thesaurus class. Then two clusters (A-B and D-E). were specified, If a threshold in the represent the average value of 0.075 two clusters would be returned. kc_conl ~ tC_COIl_Wti A-B and i= 1 either D-E or C-D-E. avg_tc_con_wt = kc_conl (ii] NUMBER OF DOCUMENTS PER CLUSTER ml Since both D-E and C-D-E meet the threshold value tc.wt of 0.075 in Fig. 1, this parameter = is needed to determine avg tc_con_wt kc_conl * 0.5 the number of documents to be included in the final cluster. In the example, a value of 2 for this parameter would insure Thus a thesaurus class weight is computed that only average weight the D-E cluster be returned. A value of 3 or greater would allow the entire C-D-E cluster to be returned. (iii) clusters thesaurus classes can be formed contained minimum in those document criterion by frequency,” of a term in the class by the number terms in the class and down weighting which of that value by 0.5. clusters. In parameter.) intersection A thesaurus be used in lieu minimum General comments to be “low useful fk < in setting this of in a thesaurus on discrimination document determine thesaurus class membership. those terms whose discrimination as the the thesaurus class. classes vahe threshold values the individual may be determined classes will Appropriate by examining might discriminators low will with the size of too few terms frequency be very to low, very few classes will of documents per cluster small), terms or (i.e., is too The number always is the secondary document frequency are weighted. suggested in [14]. 80 of has a lesser effect on thesaurus class [1], this parameter is allowed to range from 2 to 5. (Note that this parameter 3),] the indifferent If the threshold be generated. size. Based on previous experiments is the If the setting is too high, thesaurus be generated of may be experiments. is collection-dependent. the thesaurus class. This approach effectiveness Threshold intersection frequency describing on the input parameters cluster tree. This parameter largely determines values fall beneath this factor in retrieval the thesaurus value before class. The intersection in a later section (see Experiment by which the Setting the Input Parameters terms in that cluster. membership bound An important i.e., class is then defined In this case, an upper investigated the one could consider using discrimina- tion value to determine bound then forms case, (Using as a guideline of all the low frequency [Alternatively, this of each term, are determined must be specified. EXPERIMENTS from the low frequency frequency terms THE have been selected, n/100 as suggested in [14] may be helpful method the MINIMUM DOCUMENT FREQUENCY Once the document terms by dividing to the threshold may value.) Minim is urn be set by using the guidelines Collection Weighting and Similarity Measure results When the global thesaurus has been constructed both documents and queries augmented classes, the vectors are then weighted the Smart system [20]. document frequency “ate” weighting). weighting is a variation minimum of value. method recall of 2. In 0.50, routines) the best improvement are specified (the best case) The thesaurus construction this paper namely, was run ADI, characteristics on four Medlars, procedure standard CACM, of these collections (For more complete descriptions described in are found in Table of these collections, of Smart The refers to the improvement using this thesaurus the same collection was largely dependent on our ability 1. without the to find the threshold value whose use produced the “best” thesaurus classes. Nor see was the optimal [18, 21,22].) Experiment the in The number of runs made to produce each thesaurus The CISI. the thesaurus (the base case). test collections, and over by for each collection. in three point average of the collection Collection Characteristics at three points as provided labeled “improvement” cases, rather than discrimination yielding 0,75], all class is based on average (average precision [0.25, column Table frequency The parameters evaluation measure used is cosine. in of a term in a thesaurus document three point of the inverse scheme (see [20] for a description The similarity membership by the thesaurus using the facilities The collection used in these experiments and are presented easily found; value of minimum as Table 2 indicates, were useful in only one case. 1 document frequency the guidelines Number of [14] of documents per cluster ranges from 2 to 5. The algorithm described each of the four specified previously document was applied to collections. As Table 2 shows, the use of the thesaurus generates The substantial improvement in each case. Table 1 Collection Collection I No of Documents I Statistics No of Terms No of Queries Mean No of Terms per Document Mean No of Terms per Query 82 35 822 25.5 7.1 Medlars 1033 30 6927 51.6 10.1 CISI 1460 76 5019 45.2 22.6 64 8503 22.5 8.7 ADI CACM I 3204 I 81 Table 2 Improvement Experiment represents Threshold No of Documents Frequency ADI 0.075 5 25 10.6 Medlars 0.120 3 45 17.1 CISI 0.058 4 69 7.7 CACM 0.137 (0.138) 2 30 9.6 in Table 2) for each collection 2 a collection a new word is it possible between thesaurus value as according resultant The question within each set. removed classes discriminators to use discrimination thesaurus then and the number of positive from which each may be (i.e., have negative those considered poor discrimination to the term discrimination (reduced) thesaurus values) value model. The thesaurus was then used to index the dccuments and queries. The results are shown in Table 4. classes--i. e., to It would appear from Table 4 that the classes which good (or useful) thesaurus classes from poor or were removed from the thesauri because they had negative classes? discrimination objective was thesaurus by removing retrieval done by augmentation, to produce a more thesaurus classes which, generated by our algorithm, improving The term by its discrimination or poor discriminator. value to differentiate Our We the thesaurus then type in the collection. can be classified that now arises is: indifferent and negative discriminators value model tells us that each unique term in a good, indifferent, distinguish specified above, a thesaurus was Each thesaurus class within discrimination Percentage Improvement Collection For each collection generated. Using Thesaurus Generated by the Specified Parameters (Best Case) (Since the indexing the removal in the collection Thus the discrimination (including, in retrieval effectiveness in a negligible over the improvement base case and was corresponding degradation over the best case (experiment of such classes should For Medlars, 140 negative to discriminators values of each term resultant Table 3 shows the number thesaurus offers 82 1). the of 14 percent over the base case but is still less effective than the best case thesaurus. classes For negative discriminators of thesaurus classes generated (using the parameter settings an improvement a were removed from the thesaurus, leaving 178 positive discriminators; of course, the new thesaurus classes) was calculated via [16]. In ADI, 3 of 28 classes were removed from the thesaurus, resulting although at least reduce the negative effects of vector expansion some degree). in terms of retrieval. useful nevertheless were not useful in effectiveness. values nevertheless added useful information CISI, 29 thesaurus with were removed, leaving 269 positive Table 3 Number and Type of Thesaurus Classes Generated Collection Number with Positive DVS Number ADI Number with Negative DVS 28 25 3 Medlars 318 178 140 CISI 298 269 29 CACM 438 438 0 Table 4 Improvement Using a Thesaurus Containing Only Thesaurus Classes with Positive Discrimination r Percentage over Base Case Collection ADI discriminators; -8.4 Medlars 14.0 -2.7 CISI 7.7 0.0 CACM 9.6 0.0 5 (10, discrimination In collection classes 15, until all of classes values had been removed. we able to improve the retrieval this our replacing bound value and successively ... ) percent 3 modifying by taking the set of thesaurus classes ordered by discrimination thesaurus Experiment 1 were positive discriminators.) We also experimented removing over over the best case. (All of the thesaurus classes generated for the CACM in experiment Percentage over Best Case 1.3 the result is a 7.7 percent improvement the base case and no improvement Values experiment, thesaurus minimum consider value. identify effectiveness of a thesaurus the clusters, the algorithm an upper the document found in Table 2 to is modified document frequency upper bound on discrimination 83 by by changing for entry of a term into a thesaurus class. this case, minimum over our best case figures. of and the same parameter settings (for threshold with the criterion with Using and number of documents per cluster) In no case were result algorithm frequency bottom-ranked negative the construction document on discrimination hierarchies we is replaced In by an value; i.e., thesaurus classes are formed by the intersection cluster whose discrimination less than the upper bound. of those terms from (The objective class.) The bound increments applied values which occur in each on discrimination terms extending to 100 percent [which terms with with near-zero strong positive produced construct determined by this approach. thesaurus Initially, classes whose only by discrimination in produced; the classes Because the correlation The the best results membership value. introducing was percent restricted the entry criterion beneath not [i.e., with fairly low terms], without again a limit on frequency.) to produce a thesaurus values component terms have not proved successful. Collection Percentage over Base Case Percentage over Best Case ADI 8.9 -1.5 Medlars 13.9 -2.8 -16.9 -24.1 -6.7 -13.3 84 large. value and way to handle this problem Using a Thesaurus Whose Classes Are Based on Discrimimtion CACM become values may in fact be high frequency Table 5 CISI may terms The best results generated for ADI and Improvement to enter one Greater numbers of classes are classes are based on the discrimination to the extent that no of value is moved up, between discrimination Thus our attempts 50 results in 847, rather become eligible themselves exact we find no effective It immediately that using an upper bound is discrimination we had hoped to the 438, thesaurus classes being generated. or more thesaurus classes. those than The best case results terms with larger frequencies values, values]). became apparent classes were formed. to CACM. As the upper bound on discrimination varies of course includes represent less discrimination 1. (Consider what occurs when this approach is frequency that these figures substantially positive value results are found in Table 5. Note with than the original discrimination discrimination terms in that thesaurus of 5 percent (using 5, 10, 15, .... percent of the qualified all were experiment is to select terms of the cluster for membership using values, values are greater than O and having near-zero discrimination document Medlars, the Value whose of their Experiment 4 indicates. When the thesauri produced evaluated, in experiment 1 were some queries are not affected by the indexing thesaurus construction view of the effect experiment algorithm). of the thesauri 4 we extract produced algorithm document is then a well constructed 1. thesaurus The employed may also prove are used in the thesaurus clusters of closely and then to select terms with another clusters. certain the thesaurus classes. approach to identifying these Define a low level cluster as a cluster The document Our first algorithm collection complete link clustering 2. is is clustered via the algorithm. The resultant hierarchy is traversed and thesaurus classes are generated by -identifying is applied to a collection. Experiment frequency and the now revised as follows: effectiveness global to set properly, all of whose children are documents. in Table 6 gives an accurate and in retrieval dependent on the value is collection- to help identify to makeup Consider The results (using the parameter of the changes when characteristics search (the document algorithm related documents 1) are described in Table 6. The last column view in only those a normal threshold These parameters construction (documents and reduced query sets) and a search is performed. settings of experiment troublesome. This query set, the The thesaurus construction applied to the collection setting of minimum (step 3 of the from each collection reduced set, is now used to perform base case). But on the collections, is strongly dependent and can be very difficult To gain a more accurate queries which are affected by indexing. realistic setting of three parameters; the results (in terms of the three point average) were averaged over all the queries in the collection. But this algorithm 5 each low level cluster, and -for each such cluster, constructing a thesaurus class by forming of all low the intersection frequency terms in the cluster. algorithm described to produce useful in global this paper thesauri, 3. can be as Table The documents and queries (augmented) by the thesaurus classes. 6 Table 6 Improvement are Using Thesaurus (Best Case) With Reduced Query Set i Collection No of Queries Base Case No of Queries Reduced Set Results Using All Queries ADI 35 33 10.6 11.0 Medlars* 30 30 17.1 17.1 CISI 76 63 7.7 8.6 CACM 52 32 9,6 14.3 Results Using Reduced Set *All queries in the Medlars collection are affected by thesaurus class indexing, reduced collection is generated in this case. 85 hence no indexed This algorithm parameter (minimum to the four document frequency). test collections, shown (which correspond values for minimum these cases. depends only on the setting of one using useful values produce The results are As Table 7 shows, this approach improvements in retrieval effectiveness. substantially less than those produced class thesaurus on unsuccessful. algorithm, by the original children in which thesaurus a global discrimination value model and a complete file was examined based by produced for automatically constructing thesauri, substantially values to in a of the document frequency of the original clusters (all of whose are used along with minimum to form thesaurus classes, is proposed. on the algorithm term two of the four results collections, this algorithm to those of the original comparable at very little cost in terms of time and effort. link clustering applying it to four ACKNOWLEDGEMENTS The results of the experiments can be used to produce useful improving retrieval effectiveness This work was supported in part by the University of Minnesota Graduate School Grant-in-Aid in Table 7 Improvement Using Thesaurus Generated by the Revised Algorithm w I I I I Medlars CISI CACM I Percentage ovel Base Case 5.9 I I I 45 13.6 30 8.2 90 3.6 86 . membership revision level by using Results indicate that this approach may be useful in some of an algorithm that the algorithm A low are documents) cases; for indicate the An attempt discrimination was SUMMARY standard test collections. the The results, while also process) by basing component terms rather than minimum document frequency of the document to the retrieval yields significant are generated at little cost. The performance between value was not successful. a useful thesaurus shown in Table 7. algorithm, and nonuseful discrimination in most cases to the best case document frequency). to differentiate thesaurus classes produced (i.e., between classes which were It was applied the frequency An attempt (Best Case) for Research. REFERENCES 11. 1. Crouch, C. Acluster-based construction. International of Conference Development in on the Information Report Eleventh Research Retrieval; Crouch, C. 12. 1988; of approach global C. W.; 4. Mills, determining the performance Project, Salton, G. Scientific and retrieval J.; Keen, and retrieval (ISR Salton, G., ed. experiments Englewood 7. systems. 13). 9. relations: of information retrieval 15(3):5-36; retrieval document keyword London: 16. system-- American in information systems. E1-Hamdouchi, Management, 17. Butterworths; Voorhees, retrieval. ACM J.; Ahlswede, 18. Relational .lourna/ of the T.; Markowitz, of the Second Conference Natural Language Processing; text analysis. of Journal Science, to modern New York: Willett, the McGraw-Hill; Voorhees, 20. retrieval. on Applied 1988; Austin, TX. improved exact Processing term and 1988. and efficiency of hierarchic clustering in document Thesis, Department of Computer E. 1985. Implementing clustering retrieval. Tech. algorithms Report agglomerative for use in document 86-765, Department of Science, Cornell University. van Rijsbergen, C.J. London: Buckley, C. Information Butterworths; Implementation information retrievat Department of University, 87 of The effectiveness hierarchic edition. J. An Information 24(1):17-22; E. P. calculation values. Ph.D. Computer Science, 36(l): 15- a large thesaurus for information Proceedings 1991. A theory Introduction Science, Cornell University, SIGIR 19. Fox, E.; Nutter, M. A.; for classification J.; Evens, M. Society for Information C. T. for Information retrieval. discrimination 27; 1985. Building in automatic G.; McGill, 1971. 1980. retrieval, 27(5): 405-432; C. S.; Yu, Society Information 1975.15. algorithm processing. enhancing effectiveness Winter Wang, Y.; Vandendorpe, thesauri 10. G.; Yang, agglomerative Fox, E. Lexical Forum, retrieval process as a basis for 1983. 1971. 8. systems design. Salton, discovery on Systems, Man and intelligent information of Computer N.J.: Prentice-Hall; retrieval. 15. storage information A knowledge Cognitive 26(1):33-44; 1968. Sparck Jones, K. Automatic Semantics-based Chen, H.; Dhar, V. Salton, Science, 1992. of the American of Computer Department K. IEEE Transactions term importance storage reports on information The SMART Cliffs, 14. Tech. VA. Processing and Management, 1966. in automatic for information of indexing Department Science, Cornell University, 6. Factors reports on information (ISR 11). Salton, G. Scientific M. Vol. 1; 1966. Science, Cornell University, 5. 13. 1990. the Collins, of Computer and retrieval: Cybernetics, Information Department Chen, H.; Lynch, approach. automatic 26(5) :629-6K$ Cleverdon, Aslib Cranjleld the thesauri. Processing and Management, 3. to lexicon: and its adverb definitions. Blacksburg, management An construction 86-23, VPI&SU, and Grenoble, France. 2. the CODER English dictionary approach to thesaurus Proceedings Fox, E. Building system. Computer retrieval, 2nd 1979. of the SMART Tech. Report 85-686, Science, Cornell 21. Fox, E. Characteristics collections in computer containing Report of two new experimental and information textual and bibliographic 83-561, Department science concepts. Tech. of Computer Science, Cornell University. 22. Fox, E. Extending the boolean models of information and multiple Department University, retrieval concept of types. Computer and vector space with p-norm queries Ph.D. Science, Thesis, Cornell 1983. 88
© Copyright 2025 Paperzz