THEsAURI IN A FULL-TEXT WoRLD Jessica L. Milstead ABSTRACT Despite early claims to the contrary, thesauri continue to find use as access tools for information in the full-text environment. Their mode of use is changing, but this change actually represents an expansion rather than a contraction of their utility. Thesauri and simi lar vocabulary tools can complement full-text access by aiding users in focusing their searches, by supplementing the linguistic analysis of the text search engine, and even by serving as one of the tools used by the linguistic engine for its analysis. While human indexing continues to be used for many databases, the trend is to increase the use of machine aids for this purpose. All machine-aided indexing (MAl) systems rely on thesauri as the basis for term selection. In the twenty-first century, the balance of effort between human and machine will change at both input and output, but thesauri will continue to play an important role for the foreseeable future. INTRODUCTION With the dramatic increase in avai lability of searchable full text-and the increasing availability of powerful engines for searching the text-it is reasonable to ask if there is any place left for thesauri in this new information retrieval scenario. It is my thesis that there is a place for thesauri-or something like them-but that they must change in order to continue to be of value , and it is hard to predict just what the changes will be. First, it is important to define what is meant by the word "thesaurus" in this paper. Simple equivalence lists, the kind of"thesaurus" most often supported by text retrieval packages, are much too limited to be considered. Certainly equivalence lists are vital to effective information retrieval , THESAURJ IN A FuLL-TExT WoRLD 29 but these are not enough. They can only suggest other ways of expressing an idea which is already in the user's mind; they do not remind the user of related ideas that might be valuable in searching. A true thesaurus has equivalence relationships, but it also supports other kinds of relationships-such as genus-species-and provides navigation assistance by means of scope notes and other aids. In other words, a thesaurus is a tool designed to aid users in finding their way around a vocabulary database. In addition to its primary use as an authority for the terms used in indexing the database , it offers reminders of terms the user might not even have considered. The ANSI/NISO standard for thesauri (NISO , 1994) provides the best available information on what thesauri should do and how they should be built, but it predates the explosion of full text and powerful search engines that we have recently seen , and it is not an adequate guide to future needs and potential. In order to set present-day thesauri in context, it is useful to look briefly at their history. The first thesauri were actually produced before electronic searching was widely available, but their full development coincided with the growth of online bibliographic databases. The unitary terms of a thesaurus provide much greater flexibility in searching than a subject heading list, with its complexities of subdivision and inversion. Consider a subject heading: Automobile engines-Manufacturing Now consider one of the ways in which this complex concept might be indexed with a thesaurus: Automobiles Engines Manufacturing The specifics depend on the design of the particular thesaurus and, in particular, the extent to which it precoordinates the elements of a complex concept. Regardless, the thesaurus indexing offers far greater searching flexibility, though with a possible penalty in false retrievals. Whether based on an ANSI/NISO standard thesaurus or not, most databases today are indexed with thesaurus-type terms. The exception is some of the databases designed primarily for schools and public libraries, which use more complex terms, and typically based on Library of Congress subject headmgs. The earliest electronic files consisted only of titles, bibliographic descriptions, and indexing; if you were lucky there were abstracts, but this was by no means to be taken for granted in the days when storage space 30 Jessica L. Milstead was a very precious commodity and acquiring anything in electronic form generally meant rekeying it. In this environment, indexing had to be of high quality if information was to be retrieved at all, hence the obvious need for thesauri. Today abstracts are practically universal, and it is beginning to seem as if all information is available in full text. However, this is not true, nor will it be true in the immediate future . (Retrieval of graphic images is not considered here, because image searching still relies so heavily on text captions or descriptions.) Vast numbers oflegacy documents remain, and converting these to searchable text is an expensive long-term proposition. Furthermore, many documents are still being produced in printed form only. Therefore, thesauri and indexing will continue to have a place-at least for awhile-in facilitating access to documents for which electronic text is not available. Their long-run value, however, depends on integration with full-text search. THESAURI AND SEARCH ENGINES Thesauri actually have a place at both ends of the information access process-i.e., at storage and at eventual retrieval. The universe of electronically accessible full text is so immense, and is growing so fast, that users need all the help they can get in accessing it. The explosive growth of Web search engines, with their rather primitive algorithms, has had some rather unfortunate effects, to my mind. Some of these engines appear to have been developed by people who saw a need, but who had not the vaguest idea that there was already a history of development of tools to fulfill similar needs. There is little evidence that these developers had ever used either Dialog or a library catalog. Not long ago, in a meeting of a national information society, a speaker gave an example of natural language retrieval of 92 citations from his database on the effect of alcohol on heart disease. A representative of a Web search engine countered with a report of carrying out a search using his engine on the Web and retrieving over 600,000 items. This speaker actually saw this 600,000 as better than 92. True, the 600,000 items were ranked (but so were the 92), but the speaker did not go on to show the relevance of the top ranked items to the query, or how many good items might actually have ranked so low that the user would never have looked at them. In fact, the audience was told nothing at all about how these 600,000 citations were presented to the user. It was almost as if the number itself were intoxicating to the speaker. A distinction should be made among kinds of tools for facilitating access to full text on the basis of the a ttention they give to semantics. THESAURJ IN A FuLL-TExT WoRLD 31 Older, exact-match (Boolean) systems give no attention to semantics. Furthermore, they retrieve purely on the basis of the occurrence of the search word or phrase in the document. This means that search terms must appear in the text for the document to be retrieved-if a term appears in the text at all the document will be retrieved regardless of whether the term is important to the meaning of the document or not. Another approach relies on statistical information-co-occurrence of words in the document, frequency, etc. Natural language parsing may be included as well, but there is no concern with the meaning of the words. The fact that two words co-occur in a document means only that; it does not imply that there is any relationship between their meanings. Boolean and statistically based systems have been found to have comparable retrieval performance, but to produce very different retrieval sets. That is, searches of the same database using a Boolean engine and a statistically based one often produce about the same number of relevant hitsbut there may be little overlap between the two sets of hits. Intelligent retrieval systems integrate statistical and semantic information-as well as a full battery of linguistic techniques-to retrieve more useful results. Such a system may contain an extensive lexicon, not just of word meanings and equivalents but of word types and relationships. Text is parsed-to a greater or lesser extent depending on the system-and there are often tools for disambiguation of terms. Phrases rather than just single words can also be handled. The most powerful systems actually can determine syntactic or structural meaning, permitting them to retrieve a concept expressed in different words that are not actually in the lexicon. One of these systems is DR-LINK, discussed elsewhere in this volume. Any of these types of system can produce better results if controlledvocabulary indexing is present. The index terms can be weighted more heavily than the running text in either statistical or intelligent systems, causing documents which have been predetermined by human (or automatic) analysis to be relevant to the query to rank more highly. In a Boolean system, the chances of retrieving relevant documents that do not happen to contain the words of the search query are improved, though precision is not helped unless the search is specifically limited to controlled vocabulary terms. Searchers consistently state that they need indexed, searchable full text (Pritchard-Schoch, 1993). For some kinds of queries, statistical techniques applied to the full text have been satisfactory, while others just cannot be answered satisfactorily without indexing. In general, searchers have not had much access to intelligent systems; when they do, it seems most likely that the presence of indexing will continue to improve the retrieval. 32 Jessica L. Milstead USE OF THESAURI IN SEARCHING In the traditional scenario, an indexer uses the thesaurus to select index terms for inclusion in the document record. Then the searcher, hopefully referring to the same thesaurus, selects terms which seem likely to produce relevant results and searches the indexing, retrieving on the basis of exact match . Even if the searcher has not referred to the thesaurLIS, she or he is aided by the indexing because, if the query words appear in the indexing, then all documents indexed with those words will be retrieved, whether the words happen to appear in the text or not. The basic design of thesauri to date, then, has been as indexing aids, with the expectation that searchers would be able to use these aids as a guide to searching. The notation used in term relationships is abstruse; the fact that "BT' and "NT' mean that two terms are related hierarchically is obvious only to specialists. Furthermore, database producers frequently do not mount their thesauri on search systems. And if the thesaurus is mounted, the search system may not support the full range of navigational information. In other words, the thesaurus is an indexing aid which we hope can also be used for searching, but we frequently haven't put much effort into making this use possible, let alone easy. It is easy to find evidence in the literature that thesauri are underused by searchers; this is probably due at least partly to the fact that the thesaurus for a database is unlikely to be readily available to searchers. Even without significant changes in the nature of the thesaurus itself, provision of a tool such as the IODyne thesaurus navigator, described in another paper in this volume, should increase searcher use substantially. Permitting the searcher to switch seamlessly between navigating the thesaurus and searching the database can only improve access. An obvious way in which a thesaurus can be applied directly in retrieval is to use the relationships as a means of expanding the search. Research, however, has shown that these relationships must be used with caution. In general, expanding a search to include the narrower terms tends to improve recall without great sacrifice in precision. Expanding to include broader or related terms, while it does improve recall , typically has a significant negative impact on precision. In pragmatic terms, the purpose of distinguishing hierarchical relationships is to indicate to users everything that is a "kind" or "part" of the broader term, in order to facilitate making searches more inclusive. However, it is not known for certain that this is what users need or want-or to what extent they need it. We do not know how far up or down hierarchies it would be useful to go in expansion of searches; if a hierarchy is nine levels deep, would a user starting at the top really want to broaden the search all the way down or would stopping at an intermediate level be THESAURI IN A FuLL-TExT WoRLD 33 preferable? On the other hand, limitation to a single level of expansion is probably not adequate. MAKING THESAURI MORE ACCESSIBLE TO SEARCHERS Over the years there have been proposals for end-user thesauri designed specifically to facilitate searching. Bates (1986) and Anderson and Rowley (1992) have both made interesting proposals for development of such thesauri. The end-user thesaurus differs from a conventional thesaurus in two primary ways-its term inclusion and organization and its displays. It is designed to reflect and organize the total specialized vocabulary of users in a field rather than to provide a limited list of authorized terms. It provides more information about the scope of terms, and its displays are designed around the way in which users approach information. For instance, one design of Bates's used term clusters as a device to aid users in enriching their searches. These clusters were like the sublanguages of different specialist groups. Anderson proposed collecting words and phrases from full text and organizing them to build the end-user thesaurus. The idea of end-user thesauri has not been widely accepted, probably for a number of reasons. Conventional thesauri are costly to develop and maintain; the additional access in an end-user thesaurus would be even more costly. Simultaneously, until recently there seems not to have been a real understanding on the part of system designers that simply making full text available-even with a powerful search engine-is not adequate. The more full text there is, the more help users need in navigating it. At the same time, users have certainly not been demanding richer thesauri, though I am aware of more than one instance where a major database producer was motivated by user demand to develop a conventional thesaurus. If end users-particularly the more sophisticated oneswere aware of the aid that better semantic tools could provide, they would demand them. Unfortunately, people generally do not miss what they have never had. CHANGES IN INDEXING Meanwhile, indexing is changing in a way that makes even greater demands on the thesaurus. As stated above, the traditional scenario is one in which the indexer consults a thesaurus as a source for terms to use in indexing. This work is repetitive, labor-intensive, and inherently inconsistent. It places heavy demands on the indexer, who must remember all indexing rules and policies; when indexers must work outside their specialized area, they are handicapped by their suboptimal knowledge of the thesaurus in the new area. Few organizations to date have found a way to provide more than clerical aids to indexers. 34 Jessica L. Milstead For many years there have been a few systems using machine-aided indexing (MAl). In these older MAl systems, the text of titles and abstracts is run against a rule base; when a rule is matched, the applicable thesaurus term is assigned to the document. The indexer reviews these candidate index terms, adding and deleting as appropriate. While their users have found that the systems increase indexer productivity significantly, there has been no great move to MAl by other database producers in the twenty or more years that these systems have been in use. This lack of growth in use is probably due to the immense up-front cost of developing a rule-based MAl system. First, the system depends on availability of a well-developed thesaurus. Then it is necessary to develop rules for matching sequences of characters in text to produce indexing with a high degree of reliability. This rule base must continually be refined and updated if it is to remain useful. While no published data exist, the rule base probably costs at least as much to develop and maintain as the thesaurus itself. Within the past few years, one MAl shell system has become commercially available (Hiava & Hainebach, 1996), but it is still necessary to develop the actual rule base. As an aside, it is worth noting that this system is actually multilingual-an aspect of indexing which may be expected to increase in importance in the future. The availability of powerful text analysis software is changing this scenario dramatically. The same analysis used to provide good relevanceranked search results can be used to suggest candidate terms for indexing without manual development of a rule base. Instead, a substantial number of already-indexed documents is used to train the text analysis software, which then assigns candidate index terms to the documents for indexer review. Without human review, of course, the same scenario produces automatic indexing. MAl assumes a developed thesaurus, and ongoing maintenance and refinement of the term assignment criteria. It shifts much of the analysis effort away from review of individual documents to maintenance of the vocabulary and retraining of the system. Indexer productivity can be increased significantly; it is known to have increased in the older rule-based machine-aided indexing systems. The shift from rule base development to more automatic training of the system will also make the process of MAl system development and maintenance less labor-intensive. In fact, none of the changes described have reduced the need for a thesaurus; if anything, they have increased the demands made on these tools and, as a result, are bringing more of their limitations to light. PROBLEMS OF THESAURUS DESIGN There are fundamental problems in the basic design of thesauri that THESAURI IN A F U LL-TEXT WORLD 35 make them less than optimally useful for more powerful retrieval scenarios. There is no reason to expect that a tool designed for Boolean search on index terms will be optimized when full text is searched by a powerful engine. Unfortunately, the ways in which thesauri could be redesigned to be more useful are not immediately obvious. The number of kinds of relationships in the present design is limited -and yet even this specification of types is probably only of marginal direct value to users. As indicated earlier, users do not necessarily recognize that "BT" and "NT" mean a relationship is hierarchical, and "Use" and "UF" mean the terms are equivalent, while "RT" means the relationship is something else-that something being unspecified. For a thesaurus developer, even deciding when a relationship is hierarchical or part/ whole can be difficult. The determination is fairly easy when concrete objects (e.g., truck/ motor vehicle) are the issue. However, in a world where the same thing may be a "particle" (i.e., concrete) or a "wave" (not concrete), depending on how the observer happens to be looking at the thing at the moment, deciding whether something is a "thing" or a "process" may not only be difficult, it is likely to be futile . As an example, recently I encountered in building a thesaurus the problem of how to relate "Codons," the basic units of genetic information, and "Codon usage ." This certainly sounded like a clear case of thing/ process- i.e., RT-but it turned out that "Codon usage" was used in the field not for a process, but for studies of the types of codons being used. Thesaurus practice does not offer a good way to distinguish dictionary meanings from actual use of terms in the literature. If the distinction between hierarchical and other relationships is that porous in fact, of how much value is it to users? The distinction is probably of significant value when it is clearly "things" or fairly concrete entities that are being related, but of much less value when the entities being related are less concrete. Yet, even for very abstract entities-processes and the like-we find ourselves wanting to say "these terms are very closely related, while these others, though less related, might still be useful for you." Obviously, weighting is involved here, but there is no way to build weighting into a standard thesaurus. At the same time, text analysis software theoretically can make use of much richer semantic analysis, not only of the relationships between terms, but of the kind of term-e.g., a process, thing, or property. Historically, this kind of analysis has been even more labor-intensive than that required to develop the relationships in a standard thesaurus. For instance, efforts such as the Cyc project have involved manual development of a knowledge base that would permit automatic analysis. On the near horizon, though , are systems which will automate development of abstractionssuch as relationships among concepts. 36 Jessica L. Milstead Equivalence relationships grow out of the print paradigm, where everything had to be entered in a single place-i.e., it was not feasible to place a copy of the record under all equiva lents of each term. If terms are truly equivalent, perhaps we should treat them as an "equivalence cluster," so that including one of the terms in a query retrieves them all , either automatica lly or at the user's option. Displaying the relationships of a thesaurus in print has always involved compromises. For instance, the typical alphabetical display can only show a single leve l of upward and downward hierarchical relationships. Thesauri which include the full hierarchy of terms in the alphabetical display become much more voluminous. If the full hierarchical display is relegated to a separate listing, it can be difficult in the alphabetical display to show where to enter the hierarchical listing to see the full hierarchy of the term. While e lectronic display of a thesaurus can ameliorate some of the limitations of the print display, making it possible, for instance, to switch back and forth between alph abetical and hierarchical display, the limitations of the screen are substituted for the limitations of the printed page. The screen display does offer possibilities of flexibility and customization that simply are not possible in print, and it is to be hoped that the IODyne browser will turn out to be only the first of a new generation of tools which supports end-user thesaurus access in a friendly and powerful way. More and richer connections between thesaurus and text may be expected as the thesaurus becomes a resource for detecting relationships and refining searches. FUTURE OF THESAURI These tools, originally designed to facilitate consistent ana lysis of documents at input to an information retrieval system, are already well on their way to becoming vital retrieval tools as well. In fact, I anticipate that, in the near future, thesauri will be used more at retrieval than at input. They may work behind the scenes much of the time. While users should certainly have access to any available vocabulary aids if they want them, we need to design our interfaces so that users need not interact directly with the thesaurus to any greater extent than they wish or need to. Given all the problems and limitations indicated , how is it possible to remain positive about the need for continued use of thesauri? There are two fundamental reasons , one philosophical and one pragmatic: • Philosophically, just as thesauri built on subject heading lists , providing more structured relationships and terms better fitted to the cur- THESAURI IN A FULL-TEXT WORLD 37 rent searching environment, thesauri can be built on to develop vocabulary tools that meet the needs of users in the search environment of the near future. • Pragmatically, there is increasing evidence of a realization on the part of text analysis system developers of the need to include a semantic component in their software. Whether this semantic component is a formal ANSI / NISO standard thesaurus is not as important as the fact that a rich semantic tool-not just an equivalence list-is embedded in the system. A thesaurus can become the basis of a more extensive semantic network, providing information, not just on what terms are used in indexing, but on how they are used within the system. Most often a semantic network includes richer relationships than a thesaurus, but there is no reason not to build the less sophisticated system, using it as a resource when it becomes feasible to develop the more powerful system. On the retrieval side, the advent of intelligent information retrieval systems, like those discussed in these proceedings, changes the picture of indexing and therefore of thesauri. The question then arises: Which kinds of retrieval can best be left to the intelligent system, and which will be facilitated by indexing-and therefore by use of a thesaurus? This is a very important question, but I know of no attempts to answer it. The concentration has been on refining intelligent retrieval systems and demonstrating their value. Yet, if we knew which kinds of information or queries could be well served by a text retrieval system without human input or refinement, we would be free to improve productivity by concentrating human effort on the types of retrieval needs where it could really add value. Thesauri and intelligent retrieval systems can be complementary in another way: The thesaurus shows a variety of relationships among terms; these relationships can be used by the system to supplement its statistical and linguistic analyses. Conversely, by flagging phrases which do not match any of its existing criteria, the intelligent retrieval system can assist in thesalll·us updating. SUMMARY Thesauri were developed to meet the needs of a different kind of retrieval system than the full-text systems which are available today. However, the basic concept of the thesaurus remains useful; the problems encountered have more to do with the implementation than with the concept itself. More work is needed to assure that thesauri built in the future are optimally suited to the needs of full-text systems. 38 Jessica L. Milstead ACKNOWLEDGMENT Susan Fe ldman reviewed a draft of this paper and provided extremely useful advice. Any e rrors of fact or interpretation , however, remain the responsibility of the author. REFERENCES An de rson,.). D. , & Rowley, F. A. (1992). Building e nd-user thesauri from full-text. In B. H. Kwasnik & R. Fide l (Eds.), Advances in classification research (Proceedings of the second ASIS SIG/ CR Classificatio n Research Workshop. October 27, 199 1) (vol. 2, pp. 1-1 3). Medford, NJ: Learned Info rm ation. Bates, M. J. (1986). Subject access in on line ca talogs: A design mode l. .Journal of the American Society for Information Science, 37(6), 357-376. Hlava, M. M. K. , & Hain ebac h , R. ( 1996). Machine aided indexing: European Parliament study and results. In 17th National OnlinP MPeting. Proceedings (pp. 137-158). Medford, NJ: Information Today. National Informatio n Standards Organization. ( 1994). Guidelines for the construction, formal, and managnnenl of mon.olingualthesawi. Bethesda, MD: NISO Press (ANS I/ NISO Z39.19-1993). Pritchard-Sc hoch , T. (1993) . Natural language comes of age. Online, 17(3), 33-43.
© Copyright 2026 Paperzz