Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System lbrahim A. Al-Kharashi King Abdulaziz City for Science and Technology, General Directorate for Information Services, P. 0. Box 6086, Riyadh 11442, Saudi Arabia Martha W. Evens Department of Computer Science, Illinois Institute of Technology, 10 West 31st Street, Chicago, IL 60676 The Micro-AIRS System, a microcomputer system for Arabic Information Retrieval, was designed as an experimental system to investigate indexing and retrieval processes for Arabic bibliographic data. A series of experiments were performed using 29 queries against a base of 355 Arabic bibliographic records, covering computer and information science from the bibliographic databank at King Abdulaziz City for Science and Technology. These experiments revealed that using roots and using stems as index terms gives better retrieval results than using words. The root performs as well as or better than the stem at low recall levels and definitely better at high recall levels. Several different binary similarity coefficients were tried: the cosine, Dice, and Jaccard coefficients. All three led to exactly the same document rankings for every query. The experiments were run on an IBM/AT-compatible microcomputer. Micro-AIRS is written in Turbo C, Version 2.0. Introduction The Problem Techniques for storing, maintaining, and retrieving from English bibliographic databaseshave been studied, implemented, and tested for the last three decades, but we do not know how well these techniques will work on Arabic data. Experimentation with retrieval systems in Arabic language environments has been very limited. Arabization of available information retrieval systems has dealt mostly with internal representation of the Arabic data and translation of menus and system messages Received May 26, 1992; revised February 9, 1994; accepted February 9, 1994. 0 1994John Wiley & Sons, Inc. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION to Arabic. The problems of working with the Arabic language have not been confronted directly. In principle, there are two approaches to developing an Arabized computer application; the first approach is to develop the application from scratch and bear in mind the characteristics of the Arabic language. The second approach, however, is based on building an I/O interface to existing application software built for non-Arabic languages.The first approach is costly and time consuming; the second approach is easy to implement at the price of abandoning some Arabic language characteristics. The second approach has been adapted to Arabize two well known retrieval system software packages,STAIRS (Salton &McGill, 1983) and ISIS (UNESCO, 1989). The Arabization effort, however, is limited to the internal representation of the text, and the translation of the menus and messagesto Arabic (Al-Gasimi, 1987). The aim of our work is to study the problems and difficulties of applying indexing and retrieval algorithms to Arabic data. In particular we explore the problems of storing and displaying bilingual bibliographic data, selection of index terms, ranking of Arabic records, and stemming algorithms for Arabic index terms. Special effort will be devoted to the study of the effect of stemming algorithms on the performance of the information retrieval system. Stemming in information retrieval systems designed for use with English text is usually confined to suffix removal. The motive for the use of stemming is obvious; term stemming can increase the number of retrieved documents since the stem of a term represents a broader notion than the original term itself Several stemming algorithms have been used in experimental environments SCIENCE. 45(8):548-560, 1994 CCC 0002-8231/94/080548-l 3 (Lovins, 1968; Porter, 1980; Salton, 1971). Experiments using word stems as indexing terms show different results. While the implementation of suffix removal algorithms in the SMART system (Salton, 1971) shows improvement in retrieval effectiveness, Harman’s ( 1987) experiments show less improvement and even sometimes decay. Further research (Harman, 1991) suggests that in an online system, stemming should be applied differentially (to some queries but not others) under user control, depending on results obtained for particular queries. The basic goal of our research is to try to find the best way to solve this problem for documents in Arabic. The morphological structure of the Arabic language makes the stemming problem much more complex. We compare three alternative choices for index terms: the word itself, the stem, and the root, with the goal of finding out which of these three alternatives gives the best results. We have also examined alternative choices of a similarity coefficient, comparing the effectsof using the familiar cosine measure, and the Dice and Jaccard coefficients. Background Ring Abdulaziz City for Science and Technology, KACST, was established in Saudi Arabia in 1977 as a research and development institution. KACST is responsible for the formulation of national science and technology policies and for the coordination and promotion of applied scientific research. It sponsors and supports research activities acrossa broad spectrum of scientific and technological fields. KACST also provides a wide variety of information support services through the General Directorate of Information Systems, GDIS. Such services include access to national and international databases,maintenance of a specialized library and a national database, and operation of a computer network connecting the computers of major research institutions in the Gulf States. The national database holds over 70,000 bibliographic records covering a wide range of science and technology. The collection includes: master’s and doctoral theses, technical reports, books, articles, measurements and standards, statistics, and proceedings of conferences and scientific seminars. The collection has an online catalogue. This catalogue is divided into two databases:an Arabic databasewhich contains about 23,800 records, and a non-Arabic database. Sample Arabic and English database records are shown in Figures I and 2, respectively. Each record in the database is composed of 36 fields. The Arabic records are classified-that is, a subject area for the document is given in the record. However, due to the short supply of Arabic indexers and abstracters, only a few document records contain abstracts or index terms. JOURNAL OF THE AMERICAN Plan of Research To achieve our goals, we built a microcomputer-based Arabic Information Retrieval System, Micro-AIRS, targeted for the IBM/PC and compatible microcomputers. The system was implemented using the Turbo C compiler, Version 2.0. A few routines, however, were coded in assembly language. Processing the Arabic Language Special characteristics of the Arabic language make it difficult to deal with, especially when using a system designed for Roman characters (Tayli & Al-Salamah, 1990). Among these characteristics are the right to left orientation, the fact that vowels may be included or dropped, and the morphological structure. The Arabic language belongs to the Semitic language group. These languages have a common grammatical system based on a root-and-pattern structure. Most Arabic words are morphologically derived from a short list of productive roots. The root is the bare verb form; it can be triliteral, quadriliteral, or pentaliteral. According to Hegazi and Elsharkawi (1985) there are about 1200 roots. A stem is a combination of a root and derivational morphemes to which one or more affixes can be added. A triliteral bare verb generates 14 verb forms, whereas a quadriliteral bare verb generatesthree verb forms. Arabic words are classified into three main categories: nouns, verbs, and particles. All verbs and many nouns are derived from root verbs. Some of the root letters may be deleted or modified during morphological derivation. Also a word may change its inflectional form when preceded by certain prefixes or prepositions or followed by certain suffixes. Some nouns, known as “solid nouns,” have no verb origins. Particles can be found in the form of prefixes and/or suffixes attached to verbs or nouns. Some particles can be found in isolated form. Particles include preposition particles, negative particles, answer particles, interrogative particles, conjunction particles, and so forth. Affixes can be added to the beginning, the end, and the middle of a word. Affixes fall into four categories: particles, pronouns, inflectional morphemes, and derivational morphemes. It is very common to find a verb, subject, and object contained in a single word. Yahya (1989) counted 120 different forms of nouns resulting from adding affixes to the basic naked noun, and 1440 different forms of verbs resulting from adding affixes to the basic naked verb. For the purposes of our experiment we used a wordstem-root dictionary developed by hand for each index term. Now that the system is being enhanced for actual use at KACST we plan to add automatic morphological analysis. Several morphological analysis algorithms have been suggestedand/or implemented. Hegazi and Elsharkawi ( 1985) describe a computer-aided morphological SOCIETY FOR INFORMATION SCIENCE-September 1994 549 hierarchy system for a vowelized Arabic text. They based their work on both the morphological rules and the phonetic rules of the language. The main disadvantage ofthis method is that the phonetic analysis requires a fully vowelized text which rarely appears in today’s applications. Gheith and Aboul-Ela ( 1989) present a computer-based syntax analyzer which is based on a morphological analyzer that separates the linguistic model from the processing algorithm. In another study, Gheith and El-Sadany ( 1987) describe a morphological analyzer that can INTRNL CNTL NO CATF.GORY DOCUMENT TYPE TnLE AUTHORS AFFILIATION 8903003 114 COMPUTING AND CONTROL ENGINEERING CONFERENCE PROCEEDING Computer virus prevention and containment on mainframes Dowry. Ghannam M. AlIndustrial Security Planning and Support Services Department, ARAMCO, Dhahran, SA SOURCE TlTLE Pnxedings of the 1lth National Computer Confennce, Dhahmn, March 47.1989: Computers and Productivity VOLUME I PAGINATION 48-60 NO. OF REFER 73 PIJBLICATN DATE 1989/01/01 PUBLISHER INF. King Fahd University of Petroleum and Minerals. Dhahnn, SA IEXT LANGUAGE ENGLISH ABSTRACf The nature and anatomy of the computer virus is outlined. Basic preventions. detection and correction techniques for reducing he dams-s causedbv viruses are oresented. Vaccinea or fdters. encrypt& access &trol softwke. test to production control pn~edures, personnel selection and review control and physical access control m&hods are detailed with examples. The paper presents measures to be adopted by the industry to make the computer systems less inviting to attacks from viruses. DESCmORS Computer software; Computer viruses: Computer security; Mainframe computers: Data pnxessing STORAGE MEDIA PAPER COPY AVAILABILITY KACST. Source FIG. 2. Sample English database record. 550 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION detect the root and the morphological structure of a given vowelized Arabic word with a trilateral root. AlFedaghi and Anzi (1989) present a simple but slow mathematical method to generate the root and the pattern of a given Arabic word. Hilal(1985) gives a more comprehensive theoretical approach while Thalouth and AlDannan (1987) give a more practical approach to the analysis of an unvowelized Arabic text. The principal phase in all these algorithms is the isolation of any suffixes and/or prefixes from the word before proceeding to deeper analyses. Representation of the Arabic Language The representation of the Arabic language has been a major concern for the designers of Arabic systems. The representation involves the internal representation of the stored data as well as the external representation, which is used in displaying text on the screen or the printer. The General Assembly of the Arabic Standardization and Metrology Organization has approved many standards for Arabic text representation. The seven-bit coded Arabic character set for information interchange, ASMO-449, was adopted in October 1982, to represent Arabic characters along with some graphical and control characters. Although this code was intended for pure Arabic language applications only, some applications use it to handle bilingual text by using some special characters to indicate that the text is changing from one language to another. For bilingual applications the organization SCIENCE--September 1994 adopted an eight-bit coded Arabic-Roman character set for information interchange known as ASMO-708. Both ASMO-449 and ASMO-708 include 32 Arabic alphabetic characters. The set of displayable Arabic shapes, however, is much larger than the set of coded Arabic characters. This is because an Arabic character changesits shapedepending on whether it is at the beginning, the middle, the end of the word, or isolated. The majority of the Arabic characters have two distinguished shapes,and a few characters have one, three, or four distinguishable shapes.To determine the correct shape of a given Arabic character a contextual analysis algorithm is needed. Previous work has provided a fast and efficient algorithm (Al-Kharashi, 1989; 1990b) which was implemented in Micro-AIRS as a basic function used by the I/O interface. Displaying the Arabic Text There are two available approaches to displaying Arabic shapeson the PC. The first approach usesthe graphic video mode, while the other usesthe alphanumeric video mode. Using the graphic screen allows the display of an unlimited number of fonts with flexible sizes and the vowels at the correct positions. Unfortunately, using a graphic screen will slow down the system as its complexity increases. Also as the font size increases, the amount of displayable information decreases. One of the easiest ways to speed up the I/O routines and make the screen hold more information is by using the alphanumeric video mode. On the original MDA and CGA display adapters, the only fonts that could be displayed in the alphanumeric video mode were those defined in a table located in ROM on the adapter. To display different fonts, the ROM must be replaced with one that holds the new font definitions. Recent adapters, such as EGAs and VGAs, all have alphanumeric character generators that use character definition tables located in predesignated areas of RAM. This table can be accessedand modified by means of software. For Micro-AIRS, a small number of I/O routines have been designed and implemented to allow it to accept and display an Arabic/English text. A previous system (AlKharashi, 1989), which uses a graphic screen to display vowelized Arabic text, has been modified to display both Arabic and English text in the same screen line. The new system uses the text screen instead of the graphic screen to display text. To achieve this, a whole new font table was created by the first author. The font shapesthat represent the English ASCII characters are kept without any change. The last 128 font shapes, which represent some graphical and foreign shapes, have been replaced by the shapes of Arabic characters and vowels. The MicroAIRS fonts are shown in Figure 3. A brief glimpse of this interface and the basic input/output routines will be provided during the discussion of the Micro-AIRS system structure below. JOURNAL OF THE AMERICAN FIG. 3. The Micro-AIRS fonts. The Structure of the Micro-AIRS System Basically, Micro-AIRS consists of three main conceptual components: namely, a User Interface, a Command Processor, and a Database Handler. The description of each component of the system follows. User Interface The real effectiveness of a computer system is measured by its usability by people other than computer professionals. This leads to the need for an effective humancomputer interface. Menu-driven systemsare one of the most successful and widely used system design techniques. The advantages of using menu-driven systems have been discussed by Shneiderman ( 1987) and by Galambos et al. ( 1985). They include: reducing the training and memorizing effort, simplifying entry of choices, structuring the user’s task, and allowing the user to become acquainted with the range of possibilities that the system offers. Micro-AIRS adapted the menu system that is used by Borland’s interactive compiler products such as Turbo C, Version 2.0 and Turbo Pascal, Version 5.0 (Borland, 1988b; 1988~).Thus, our interface is composed of two components, a permanent menu and pulldown menus. The permanent menu displays the options of the main menu. Left and right arrow keys are used to SOCIETY FOR INFORMATION SCIENCE-September 1994 551 key. There are eight basic items in the main menu. When activated,eachitem in the main menu will pull up another menu. An item in the secondlevel menu, in turn, could trigger another menu or selecta basic item. Command Processor Pull-Down (b) FIG. 4. (a) Arabic Micro-AIRS user interface. (b) Equivalent English interface. move through the items causing them to be highlighted one at a time. The highlighted item can be selected by pressing the (ENTER) key. A pull-down menu, on the other hand, is displayed when an item from the permanent menu, or an item from the current pull-down menu is selected. Pull-down menu items are listed vertically, and the user can move through the items by using the up and down arrow keys. For large lists of items in one menu, more elaborate scrolling capabilities are provided. The screen in the Micro-AIRS user interface is divided into three areas as shown in Figure 4 and described as follows: l l l 552 Systemresponsearea:This areais usedby the system to display its responseto a usercommandsuchasDISPLAY, SEARCH,or SORT. System status area: The bottom line of the screen showsthe systemstatus(running, waiting, or idle), error messages, namesof activedatabases,and so forth. Systemmenu area:The top line of the screenlists all availablesubmenus/commandsin the system.An individual entry can be activatedby highlighting it using left/right arrow keysand then pressingthe (ENTER) This module accepts a user request, validates it, and then processesit. Since all user commands are entered through a menu-driven system, a great deal of this module is devoted to interpreting the SEARCH and DISPLAY commands. All the system commands are available through a menu-driven system. Commands are categorized into eight groups, namely FILE, EDIT, SEARCH, DISPLAY, SORT, PRINT, UTILITIES, and HELP. Each group is represented by an entry in the system menu area. When a group is selected, it will display a related commands list as shown in Figure 4. The DISPLAY command allows the user to accessthe text of a database record directly, or to choose to display a document from a previously retrieved set. The user then can display the next or the previous document, the last or the first document, or jump backward or forward a given distance from the current document. Micro-AIRS allows the user to search the databaseusing three retrieval methods, using words, stems, or roots, one at a time. To switch from one retrieval method to another the user should use the SEARCH/RETRIEVAL-METHOD command to select the desired method. The system then will close the current keyword and posting files associatedwith the current method and open the files for the selectedmethod. The system makes available a full set of Boolean and distance operators. The ADJ operator specifies that two words must appear next to each other in the document and in the proper word order. The FLD operator specifiesthat two words must appear in the same field. Another option allows the user to specify that two index terms must be separatedby exactly n number of words. The truncation symbol “:” can be suffixed to a query term to widen the search results. The truncation symbol indicates whether the term is to be searched as a complete term or as a fragment of a large term. The truncation option is limited to the word-retrieval method only since the stem and root retrieval methods have superior effects over the truncation. Parentheses can be used to construct long and complex queries. A query can be submitted interactively using the appropriate chain of menus, or by giving a name of preedited query file which can contain one or more queries. File Handler This module is responsible for accessingand updating the data file. The three most basic operations are creating JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE--September 1994 a new database, building a searchable database out of pure text files, and searching through a database. Each Micro-AIRS database consists of five basic files: the database definition file, the data and the index files, and the keyword and posting files (Al-Kharashi, 1990a). Descriptions of each file are provided in the next section along with a discussion of the indexing process. The Micro-AIRS Indexing Process Indexing Strategies The indexing and the data organization are the two major factors that influence the effectiveness and the efficiency of an information retrieval system. The indexing process deals with the selection of appropriate terms capable of representing the content of a given bibliographic record. Experimental information retrieval systems use different indexing methods. Frequency-based indexing methods measure the importance of a given term by its frequency in individual documents as well as its frequency in the whole collection. Frequency factors can also be used during the retrieval process as term weights to enhance the precision of the system, that is to present the user with a set of records that closely match the query in decreasing order. Binary weighting schemes, however, can be used instead. In this case all indexable terms are assigned the same weight. Although MicroAIRS stores the frequency of occurrence of every valid indexable term in the collection, these values were not used as a measure for selecting significant terms. There are two reasons for not using the frequency of the word as a valid measurement tool during the indexing and the retrieval of the data. The first reason is related to Luhn’s (1958) observation. Since Luhn’s law has not been verified with an Arabic text, it is not realistic to use it as a solid base for indexing Arabic data. The second reason involves the type of collection that was used to test the system. Salton (197 1) concluded that the effectiveness of the content analysis depends on the length of the textual data available. Content analysis works better with larger textual data. In our collection, every record contains a short title with no abstract except for very few records. It is seldom to find a word occurring more than once or twice in the same document. Hence, a frequency based measurement has no significance for our data. Data Description As was described earlier, the Arabic collection contains about 23,800 records covering a wide range of science and technology fields. This data was originally contained in a single sequential file that occupies about seven million bytes of disk space.Each record in the data file is represented as a sequential list of bibliographic fields (e.g., title, author, journal title, abstract, and so JOURNAL OF THE AMERICAN forth). The text of each record is terminated with an endof-record mark. Every field starts with a three character field identifier followed by a space and then the text for that field. The field text is terminated with an end-of-line mark (i.e., carriage return and line feed characters). The evaluation of the system requires some initial manual tasks, mainly relevance judgments. This task needs experts in the field of the area that the system covers. Becausethe collection has wide coverage, we needed to choose a subset that covers a specific area where we would find help in performing the manual tasks. A single record from the original data is contained in one or more sets.The computer and information science set, with 355 records, was found to be the most suitable set for testing and evaluating the system. With this set it is easy to find people who are able to create queries and perform the relevance judgment. The text of the selected set contained a few typing and spelling mistakes. To reduce the effect of these mistakes on the evaluation process,they were corrected before the final indexing process is carried out. The majority of these mistakes were easily detected after all keywords from the record texts were extracted and sorted. The VATE editor (Al-Kharashi, 1989) then was used for the simple editing and correction processes. Database Dejnition Table The databasedefinition table controls the behavior of the system during editing, indexing, and retrieval. Every field in the database has an entry in this file and holds the following information: the full name of the field, the abbreviated name of the field, and the field attributes. The process of selecting index terms goes through many phases starting with plain text and ending with a list of useful, accessible index terms. The indexing process accepts plain text defined by the index and record file, and extracts all words from every indexible field in all database records. In the database definition table we define the category, the title, and the abstract fields as indexible and searchable fields. The extraction of a word from a bilingual text is certainly more difficult than the extraction of a word from a unilingual text. With the use of the character attribute file, the word extraction process was able to distinguish between numeric, control, English, and Arabic data. The length of any extracted term is limited to 25 bytes. If the original term exceedsthe 25 byte limit, the term then will be truncated at the 25th byte and the remainder of the term will be skipped. After the extraction of all index terms from the record file, the indexed keyword list is then sorted. A general purpose sorting program such as the DOS sort utility or other commercial sorting programs will not work with our data. This is becausethese utilities are usually intended for textual data and the indexed keyword list contains binary coded values in the document number entries. IBM PCs and compatibles are built on the Intel family SOCIETY FOR INFORMATION SCIENCE-September 1994 553 TABLE 1. The list of queries. English meaning Arabic query Number Computer systems Computer and languages Arabic programming languages Computer and architecture design Natural language processing Computer and drawing Computer and language learning Computer and industries and industrial information Computers in military field Computer Arabization Parallel programming Morphological analysis Computers and (indexing or classification or documentation) Computers in Saudi Arabia 14 Computers and (Quran or Hadeeth) 15 Computers and children 16 Computers and phonetics 17 Computers and agriculture 18 Computer networks and communication 19 Computer and design 20 Computers in education 21 Arabic terminology 22 Thesauri and information retrieval 23 Computers and information security 24 Computers in libraries 25 Machine translation 26 Computers and managements 27 Personal computers 28 Terminology databanks 29 microprocessors. The Intel CPUs store an integer value, which occupies two bytes, with the most significant byte at the lower location and the least significant byte at the higher location. This structure causes a general purpose sorting program intended for textual data to mis-sort any 554 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION data with numerical data coded in binary format. To overcome this problem, a general purpose sorting program MERGE3ORT.C was developed. MERGE SORT uses the Turbo C built-in quick sort function, QSORT (Borland, 1988a), for internal sorting. SCIENCE-September 1994 TABLE 2. The binary similarity coefficients. Cosine IQnDl VIQI . VIDI Dice 2 Jaccard IQnDl IQnDl IQ1 + IDI IQI + IDI - IQnDl I D /, number of terms in the document text; IQ 1,number of terms in the query text; /Q tl D / , number of terms in both document and query. of the set and the list of queries at the same time in two different windows on the same screen. If the user judged that the displayed record was relevant to the displayed query, he simply marked a box on the screen. After the completion of the relevance judgment task, the judgments for the three sets were grouped together in a relevance judgment matrix. Out of the 50 queries, only the 29 queries shown in Table 1 were found to have one or more relevant documents in the collection. Thus, only these 29 queries were used for the system evaluation. Creation of the Word-Stem-Root Dictionary A comprehensive Arabic dictionary is not available in machine-readable form. So we created a small wordstem-root dictionary, using the words in the collection. The dictionary is used during the indexing and the retrieval process to identify the stem or the root of a given word and also identify the stop words. For every keyword a corresponding stem and root structure was created. From 355 bibliographic records we obtained 1,126 words, 725 stems, and 526 roots. Evaluation of the Micro-AIRS System The major purpose of this work is to study the effect of using words, stems, or roots as index terms on the performance of an Arabic information-retrieval system. Relevance Judgments Information-retrieval systemsare usually evaluated in terms of two measures, recall and precision. Recall is defined as the proportion of the documents in the collection relevant to the query that are actually retrieved. Precision is defined as the proportion of the documents retrieved that are actually relevant. Perfect recall (a value of 1.O) occurs when the system finds all the items in the collection that are relevant to the document. Perfect precision (also a value of 1.O) occurs when all the documents retrieved are relevant. Both measures depend on knowing what documents are relevant to each query, so the first step in evaluation is making relevance judgments. For large collections, sampling techniques are used (Salton, 1975), but our collection was small enough so that we could carry out this task manually. We asked graduate students in computer science, who were also native speakersof Arabic, to make up 60 queries that they might themselves use in their own research. Ten queries were removed because they were essentially duplicates of other queries, asking for the same information. The 355 database records were divided into three sets. Each set was handed to one of the students along with a computer-based relevance judgment support system designed and implemented by the first author. This system allowed the judge to browse through the records JOURNAL OF THE AMERICAN Similarity Measurements Similarity coefficients have several important applications in an information-retrieval system (Salton, 1989). Their most important function is in ranking retrieved documents in order to present to the user first the documents most relevant to the query. There are three common normalized similarity coefficients (eachwith two versions, one for binary and one for weighted terms), the cosine, Dice, and Jaccard coefficients (van Rijsbergen, 1979; Salton, 1989). We decided to try all three binary coefficient measurement methods and select the one that performs the document ranking best. Table 2 showsthe formulas for the cosine, Dice, and Jaccard binary coefficients. The calculations of the similarity measurements between a query and a document require information about the number of terms in the document text and in the query text and the number of terms that appear jointly in the query and the document text. As the root or the stem is used instead of the word, it is more likely that one or more word collapses in one common stem or root. Hence, the number of unique words is reduced and the similarity coefficient is increased. In determining the order in which documents should be presented to the user, however, the actual value of the coefficient does not matter, it is only the relative values that make a difference. The ranking processshowed that all three binary similarity coefficients produced exactly the same rankings for all queries. The actual values are shown for all three similarity methods combined with all three retrieval methods for TABLE 3. Ranking of the result of query number 20 using roots. Similarity coefficient values Document number 19 18 216 281 325 212 SOCIETY Relevance indicator * * * FOR INFORMATION Rank Cosine Dice Jaccard I 2 3 4 5 6 0.5774 0.4715 0.4715 0.4715 0.4083 0.2133 0.5000 0.3637 0.3637 0.3637 0.2858 0.0870 0.3334 0.2223 0.2223 0.2223 0.1667 0.0455 SCIENCE-September 1994 555 TABLE 4. Ranking of the result ofquery number 22 using words. TABLE 6. The results of the processing of the 29 queries. Word Similarity coefficient values Document number Relevance indicator Rank Cosine Dice Jaccard 279 263 254 253 273 * * * * * 1 2 3 4 5 0.5346 0.4473 0.4265 0.3780 0.3652 0.4445 0.3334 0.3077 0.2500 0.2353 0.2858 0.2000 0.1819 0.1429 0.1334 three example queries in Tables 3, 4, and 5. The results obtained for all the other queries were the same. Clearly, it does not matter what coefficient is used and we are free to use whichever one is the cheapest to compute on the target hardware. Results of Processing the Queries The processing of our 29 queries on the Micro-AIRS system using words, stems, and roots produces the results shown in Table 6. The system performance using each of the three indexing methods can be categorized into six groups as follows: (1) The system failed to retrieve any document with the use of any retrieval method in responseto query 4. (2) The three methods perform equally in response to queries 9, 11, and 16. (3) The word-retrieval method performs as well as or better than the other methods at most recall levels in queries 13,20,25,26, and 29. It is not able, however, to retrieve any of the relevant documents for the following queries: 2,4, 5, 7, 12, 15, 17, 18, and 23. (4) The stem retrieval method outperforms the other methods in response to some queries, including 1 and 7. (5) The root-retrieval method outperforms the other Query 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Stem Root Rc. Ret. Rel. Fall Ret. Rel. Fall 15 5 10 5 0 4 0 1 0 1 0 5 3 3 0 0 2 1 1 0 0 11 8 8 5 13 3 3 3 3 1 I 1 0 1 0 9 15 2 5 5 10 9 0 0 I 1 1 4 4 5 4 9 26 2 7 7 0 0 2 2 3 5 0 4 3 0 0 2 2 3 5 0 4 3 1 14 II 1 4 1 I 2 1 1 I 1 1 0 0 0 0 0 0 0 0 0 0 0 0 I 0 0 0 0 0 0 0 0 0 0 0 0 15 10 85 5 0 4 3 7 4 5 0 2 3 3 11 3 9 I 0 6 17 1 1 1 1 2 0 3 6 2 0 3 3 I1 6 5 5 5 81 0 0 2 0 4 22 85 7 0 4 3 25 10 12 4 7 0 2 3 3 81 0 0 2 0 22 10 1 13 11 2 3 5 1 0 5 13 0 4 3 47 3 0 0 0 1 1 1 6 11 1 1 4 0 0 0 0 0 3 5 0 0 0 I 1 6 0 17 2 1 3 3 6 6 52 14 2 7 7 3 12 9 12 36 0 0 1 5 1 3 1 2 1 1 4 4 2 4 3 3 0 10 8 1 2 Ret. Rel. Fall 1 4 8 1 5 2 0 1 0 3 2 5 3 7 14 2 4 4 0 3 45 0 0 3 3 1 2 6 9 6 0 1 1 Rc., number of relevant records in the collection; Ret., number of retrieved records using the method; Rel., number of relevant records actually retrieved; Fall, number of irrelevant records actually retrieved. methods in most of the queries, most strikingly in 3, 8, 12, 15, 17. 18, 19,22,23,27,and28. (6) The stem- and the root-retrieval methods perform equally in response to queries 2,5. and 6. Table 7 and Figure 5 show the differences in average TABLE 5. Ranking ofthe result of query number 2 I using stems. Similarity coefficient values Document number Relevance indicator 328 73 230 52 13 12 276 255 * * * * 11 * * 120 224 556 JOURNAL Rank Cosine Dice Jaccard 2 3 4 5 6 7 8 9 IO 0.5774 0.5774 0.4365 0.4365 0.4365 0.3850 0.3652 0.3652 0.3652 0.3334 11 0.2218 0.5774 0.5000 0.4000 0.4000 0.4000 0.3334 0.3077 0.3077 0.3077 0.2667 0.0938 0.4000 0.3334 0.2500 0.2500 0.2500 0.2000 0.1819 0.1819 0.1819 0.1539 0.0492 1 OF THE AMERICAN SOCIETY FOR INFORMATION retrieval values. It is clear from the table and the figure that the root-retrieval method was able to retrieve more documents per query than the other two retrieval methods. The problem with the root-retrieval method can be shown in the number of irrelevant documents retrieved along with the relevant documents. The stem-retrieval method retrieved fewer irrelevant documents yet a rea- TABLE 7. Average retrieval of the 29 queries. Method Word Stem Root Retrieved Relevant Irrelevant 2.24 7.79 12.55 2.03 3.69 4.72 0.21 4.10 7.83 SCIENCE--September 1994 Stem Word Root FIG. 5. Average retrieval ofthe 29 queries. sonable number of relevant documents. Finally, the word-retrieval method retrieved the least number of relevant and irrelevant documents. From the previous detailed and averaged retrieval data, we cannot draw a precise conclusion about the effectiveness of the system. In the following sections we will discuss and present the standard evaluation of the information-retrieval system based on the recall and precision measures. Recall-Precision Measurements The recall and precision values produced for a given query reveal the behavior of the system only under that query and for those calculated recall and precision values. Notice that the rough recall-precision values could contain many precision values for a single recall value. Furthermore, some precision values may not be defined at certain recall levels. Different smoothing algorithms are in common use for precision averaging (Keen, 1972). In Micro-AIRS we used the smoothing algorithm summarized below. (1) Divide the recall valuesinto 10levels; 0.0 <= rO.l < 0.1, precisionvalue to the next level if its precisionvalue is lower than the current one. (5) To assurethat the precisionwill drop graduallyfrom a certain precision value to a zero value, we assign any levelwith a zeroprecisionto half of the precision value of the previouslevel. Table 8 and Figure 6 show the averaged recallprecision values with the zero-smoothing process. The summaries provided by the average recall-precision values suggest that the root-retrieval method outperforms both the word- and the stem-retrieval methods. They also suggestthat the stem-retrieval method outperforms the word-retrieval method. LO- 4 0.8 0.6 * 0.1 <= r0.2 < 0.2, ...) 0.4 . Root 0.9 <= rl.O <= 1.0. stem stem (2) Assignthe largestprecision value of a level to that level. (3) Assignthe largestprecisionvalue found in the table to the first level. (4) Starting from the tenth region we start removing all sawtooth lines by assigning the current level JOURNAL OF THE AMERICAN Word Redl 0.0. I 0.10 0.20 0.30 0.40 050 0.60 0.70 0.80 0.90 1.00 FIG. 6. Average recall-precision graph after zero-smoothing process. SOCIETY FOR INFORMATION SCIENCE-September 1994 557 TABLE 8. Average recall-precision table with zero-smoothing process. Precision Recall Word Stem Root 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.oo 0.7143 0.6357 0.5643 0.4946 0.424 1 0.3353 0.2569 0.1999 0.1714 0.1571 0.8739 0.8356 0.7772 0.75 10 0.6952 0.5721 0.4457 0.3545 0.2935 0.2467 0.9308 0.8998 0.854 1 0.8442 0.8085 0.6946 0.5860 0.5047 0.4290 0.3912 Statistical Analysis To draw accurate conclusions about the effectiveness of the system using word-, stem-, and root-retrieval methods, to determine the significance of the results shown in Table 8 and Figure 6, we use two nonparametric statistical tests, the sign test and the Wilcoxon signedrank test. In this analysis we compared each pair of retrieval methods separately. Thus essentially we looked at the results of three experiments the word-stem experiment, the word-root experiment, and the stem-root experiment. The null hypothesis and the alternative hypothesis used for the word-stem experiment are: HO: The word-retrieval and the stem-retrievalmethod give the same results. H 1: The stem-retrieval method is better than the wordretrieval method. TABLE IO. Wilcoxon signed-rank test for word vs. stem. Recall Favoring word Favoring stem NDF Norm dev. Z One-sided probability .lO .20 .30 .40 .50 .60 .70 .80 .90 1.oo 4.00 8.00 21.00 21.00 23.00 14.00 15.00 8.00 8.00 8.00 32.00 70.00 115.00 132.00 148.00 157.00 175.00 182.00 182.00 182.00 8.00 12.00 16.00 17.00 18.00 18.00 19.00 19.00 19.00 19.00 1.9604 2.43 I8 2.4303 2.6273 2.7219 3.1139 3.2194 3.5011 3.5011 3.501 I .0250 .0075 .0075 .0043 .0033 .0009 .0006 .0002 .0002 .0002 H2: The word-retrieval and the root-retrieval methods give the same results. H3: The root-retrieval method is better than the wordretrieval method. The null hypothesis and the alternative hypothesis used for the stem-root experiment are: H4: The stem-retrieval and the root-retrieval methods give the sameresults. H5: The root-retrieval method is better than the stemretrieval method. The test results are shown in Tables 9- 14. The statistical results support H 1 and H3, that is, they confirm the superiority of root- and stem-retrieval methods over the word-retrieval method with alpha = .03 using the Wilcoxon signed-rank test. When we compare the stem- and the root-retrieval methods then the results are not so clear. The one-sided probability values at the lower recall levels (up to .5) of Tables 13 and 14 comparing stem- and root-retrieval The null hypothesis and the alternative hypothesis used for the word-root experiment are: perform significantly better than the stem method. At TABLE 9. Sign test for word vs. stem. TABLE 11. Sign test for word vs. root. Favoring word Recall Favoring stem Tied methods show that the root-retrieval Norm dev. z One-sided probability Recall Favoring word Favoring root Tied .10 .20 .30 .40 .50 .60 .70 .80 .90 1.oo 2 3 5 5 5 4 4 3 3 3 6 9 11 12 13 14 15 16 16 16 21 17 13 12 11 11 10 10 10 10 1.4142 1.7321 I .5000 I .6977 I .8856 2.3570 2.5236 2.9824 2.9824 2.9824 .0793 .0418 .0668 .0446 .0294 ,009 1 .0059 .0014 .0014 .0014 .10 .20 .30 .40 SO .60 .70 .80 .90 I .oo 2 4 5 5 3 2 2 1 1 I 8 12 15 16 18 19 20 21 21 21 19 13 9 8 8 8 7 7 7 Combined 37 128 125 7.0843 .oooo Combined 26 171 558 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1994 method does not Norm dev. Z One-sided probability 1 1.8974 2.0000 2.236 1 2.4004 3.2733 3.7097 3.8376 4.2640 4.2640 4.2640 .0287 .0228 .0073 .0082 .0005 .oooo .oooo .oooo .oooo .oooo 93 10.3308 .oooo TABLE 14. Wilcoxon signed-rank test for stem vs. root. TABLE 12. Wilcoxon signed-rank test for word vs. root. Recall Favoring word Favoring root NDF Norm dev. Z One-sided probability Recall Favoring stem Favoring root .I0 .20 .30 .40 .50 .60 .70 .80 .90 1.oo 5.00 15.00 29.50 26.00 16.50 9.00 9.00 2.00 2.00 2.00 50.00 121.00 180.50 205.00 214.50 222.00 244.00 251.00 251.00 251.00 10.00 16.00 20.00 2 1.oo 2 1.oo 2 I .oo 22.00 22.00 22.00 22.00 2.2934 2.7406 2.8186 3.1 108 3.4410 3.7017 3.8147 4.0420 4.0420 4.0420 .OllO ,003 1 .0020 .0009 .0003 .oooo .oooo .oooo .oooo .oooo .I0 .20 .30 .40 .50 .60 .70 .80 .90 1.00 6.00 18.00 18.00 20.00 18.00 22.00 33.00 30.00 34.00 24.00 9.00 27.00 37.00 46.00 48.00 98.00 120.00 123.00 119.00 129.00 higher recall levels, however, the root-retrieval method performs better than the stem-retrieval method. Conclusions Summary Micro-AIRS was designed as an experimental system to investigate indexing and retrieval processesfor Arabic bibliographic data. During the design and implementation of the system, we dealt with the following problems: (I) Accessing,processing,and displaying Arabic/English text. (2) Indexing and sorting Arabic terms. (3) Indexing and retrieval of Arabic data using different types of index terms, words, stems, and roots. (4) Ranking documents using different binary similarity coefficients. This research reveals the superiority of root- and stem-retrieval methods over word-retrieval methods for Arabic data. The root performs as well as or better than the stem at the low recall levels, and definitely better at high recall levels. We also found that the document rank- TABLE 13. Sign test for stem vs. root. Recall Favoring stem Favoring root Tied Norm dev. Z One-sided probability .I0 .20 .30 .40 .50 .60 .70 .80 .90 1.00 3 5 5 5 4 4 5 5 5 4 2 4 5 6 7 11 12 12 12 13 24 20 19 18 18 14 12 12 12 12 -.4472 -.3333 .oooo .3015 .9045 1.8074 1.6977 1.6977 1.6977 2.1828 .3300 .3707 .5000 .3821 .I841 .035 1 .0446 .0446 .0446 .0146 Combined 45 84 161 3.4338 .0003 JOURNAL OF THE AMERICAN NDF 5.00 9.00 10.00 11.00 11.00 15.00 17.00 17.00 17.00 17.00 Norm dev. Z One-sided probability .4045 .5331 .9683 1.1558 1.3337 2.1583 2.0592 2.2012 2.0119 2.4853 .3409 .2982 .I660 .I230 .0918 .0154 .0197 .0131 .0217 .0064 ing processproduced exactly the sameresults when using different binary similarity coefficients, so a single simple coefficient can be used. These results were obtained in a system where each document was accurately classified as to subject area. Also, the part of the collection involved in the experiments, the set containing all 355 computer science documents in the database, was carefully proofread to eliminate spelling errors. Of most concern, most documents in the collection were represented by titles only, not by abstracts. Clearly, further experiments are needed. Future Research In an operational system, the word-stem-root dictionary should be replaced by a morphology algorithm that finds stems and roots as mentioned previously. By using stemsand roots for indexing and retrieval we were able to retrieve most of the relevant documents in the collection. The retrieval failure of some or all relevant documents (see Table 6) was due to the use of related words (e.g., synonyms). We believe that the use of an interactive thesaurus will be helpful in retrieving more relevant documents. For a discussion of the use of such a thesaurus in English see Fox (1980) and Wang, Vandendorp, and Evens ( 1985). Research on this problem is being carried forward at Illinois Institute of Technology using a database of Arabic documents with abstracts. The current system allows the user to use only one type of index term at any given time. To reduce the number of irrelevant documents, the user should have the ability to impose the retrieval method over individual words of a query. For example, the search argument “A and (B or C)” could be expressedas “root:A and (stem:B or word:C).” Using a binary ranking process fails in some casesto put the most relevant documents at the top of the retrieved list. A weighted ranking process should be investigated for Arabic documents using a database where all documents have abstracts, or better still, where all docu- SOCIETY FOR INFORMATION SCIENCE-September 1994 559 ments are available online in full-text form. The first author is planning a large-scale test of the effectiveness of the system at KACST using a large collection of documents with abstracts and a large number of test queries collected from actual users. REFERENCES Borland International. ( 1988a). Turbo C, version 2.0; Referenceguide, Scotts Valley, CA. Borland International. (I 988b). Turbo C, version 2.0: User’s guide, Scotts Valley, CA. Borland International. (1988~). TurboPascal, version 5.0: User’sguide, Scotts Valley, CA. Al-Fedaghi, S. S.. & Al-Anzi, F. S. (I 989, March). A new algorithm to generate Arabic root-pattern forms. In Proceedings ofthe 11th National Computer Conference and Exhibition, (pp. 391-400.) Dhahran, Saudi Arabia: King Fahd University ofPetroleum and Minerals. Fox, E. (1980). Lexical relations: Enhancing effectiveness of information retrieval. SIGIR Forum. 1.5,6-35. Galambos, J. A., Sebrechts, M.. Wikler, E., & Black, J. (1985). A diagrammatic language for instruction of a menu-based word processing system. In S. Williams (Ed.), Humans and machines (pp. 1l-44). Norwood. NJ: Ablex. Al-Gasimi, M. (1987, April). Arabization of the MINISIS system. In Proceedings qfthe First King Saud University S.ymposium on Computer Arubization (pp. 13-26.) Riyadh. Saudi Arabia: King Saud University. Gheith. M., & El-Sadany, T. (1987, April). Arabic morphological analyzer on a personal computer. In Proceedings ofthe First King Saud University Symposium on Computer Arabization (pp. 55-65.) Riyadh, Saudi Arabia: King Saud University. Gheith. M.. & Abdul-Ela, M. (1989, March). A computer based Arabic syntax analyzer. In Proceedings ofthe I Ith National Computer Conference and Exhibition (pp. 352-360.) Dhahran. Saudi Arabia: King Fahd University of Petroleum and Minerals. Harman, D. (1987, June). A failure analysis on the limitation of suffixing in online environments. In Proceedings ofthe 10th Annual International ACM SIGIR Coqference, New York: Association of Computer Machinery. Harman, D. ( 1991). How effective is suffixing? Journal ofthe American Society,for IGformation Science, 42, 7- 15. Hegazi, M., & Elsharkawi, A. A. (1985, April). An approach to a computerized lexical analyzer of natural Arabic. Computer Processing qf the Arabic Language. Wbrkshop papers (Vol. I). Kuwait: Kuwait Institute for Scientific Research (KISR). 560 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION Hilal, Y. (1985. April). Morphological analysis of Arabic speech, Computer Processing qfthe Arabic Lnnguage. Workshop papers (Vol. I). Kuwait. Keen, E. M. (1972). Prospects for classification suggestedby evaluation tests carried out 1957-1970. In A. Maltby (Ed.), Classification in the 197O’s(pp. 193-210). Hamden, CT: Linnet Books. Al-Kharashi, I. A. (1989). V,4TE: A vowelized Arabic text editor. Ph.D. qualifying project, Illinois Institute of Technology, Chicago, IL. Al-Kharashi, I. A. (I 990a, October). Micro-AIRS: A microcomputer based Arabic information retrieval system, design, implementation and evaluation. In The 12th National Computer Conference(Vol. 2) (pp. 5 15-529.) Riyadh, Saudi Arabia: King Saud University. Al-Kharashi, I. A. (1990b, October). An efficient contextual analysis algorithm for Arabic text handling. The 12th National Computer Conference(Vo1. 2) (pp. 465-473.) Riyadh, Saudi Arabia: King Saud University. Lovins, J. B. (1968). Development ofa stemming algorithm. h4echanical Translation and Computational Linguistics, I I, 22-3 I. Luhn, H. P. ( 1958).The automatic creation of literature abstracts. IBM Journal ofResearch and Development, 2, 159- 165. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14. 130-137. Salton, G. (Ed.) (197 I). The SMART retrieval system experiments in automatic document processing. Englewood Cliffs, NJ: Prentice Hall. Salton, G. (1975). A theory of indexing. Regional Conference Seriesin Applied Mathematics, No. 18. Philadelphia: Society for Industrial and Applied Mathematics. Salton. G. ( 1989).Automatic te,vtprocessing: The transformation, analysis, and mrieval of information by computer. Reading, MA: Addison-Wesley. Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill. Shneiderman, B. ( 1987).De.yigning the user interSace:Strategies@ human-computer interuction. Reading, MA: Addison-Wesley. Tayli, M., & Al-Salamah, A. I. (I 990). Building a bilingual microcomputer system. Communications ofthe ACM, 33,495-504. Thalouth, B., & Al-Dannan, A. (1987). A comprehensive Arabic morphological analyzer/generator. IBM Kuwait Scientific Center. UNESCO ( 1989).Mini-micro CDS/ISIS, Paris. van Rijsbergen, C. J. (1979). I@rmation retrieval (2nd ed.). London: Buttenvorths. Wang. Y. C., Vandendorpe, J., & Evens, M. (I 985). A microcomputer based information retrieval system supporting stroke diagnosis. Journal oJthe American Society for Information Science, 36, 15-27. Yahya, A. H. (1989, October). On the complexity ofthe initial stage c)fArubic text processing. Paper presented at the First Great Lakes Computer Conference. Kalamazoo, MI. SCIENCE-September 1994
© Copyright 2026 Paperzz