Comparing words, stems, and roots as index terms in an Arabic

Comparing Words, Stems, and Roots as Index Terms in an
Arabic Information Retrieval System
lbrahim
A. Al-Kharashi
King Abdulaziz City for Science and Technology, General Directorate for Information Services,
P. 0. Box 6086, Riyadh 11442, Saudi Arabia
Martha
W. Evens
Department of Computer Science, Illinois Institute of Technology,
10 West 31st Street, Chicago, IL 60676
The Micro-AIRS System, a microcomputer
system for Arabic Information Retrieval, was designed as an experimental system to investigate indexing and retrieval processes
for Arabic bibliographic
data. A series of experiments were
performed using 29 queries against a base of 355 Arabic
bibliographic
records, covering computer and information
science from the bibliographic
databank at King Abdulaziz
City for Science and Technology. These experiments
revealed that using roots and using stems as index terms
gives better retrieval results than using words. The root
performs as well as or better than the stem at low recall
levels and definitely better at high recall levels. Several
different binary similarity coefficients were tried: the cosine, Dice, and Jaccard coefficients. All three led to exactly
the same document rankings for every query. The experiments were run on an IBM/AT-compatible
microcomputer.
Micro-AIRS is written in Turbo C, Version 2.0.
Introduction
The Problem
Techniques for storing, maintaining, and retrieving
from English bibliographic databaseshave been studied,
implemented, and tested for the last three decades, but
we do not know how well these techniques will work on
Arabic data. Experimentation with retrieval systems in
Arabic language environments has been very limited.
Arabization of available information retrieval systems
has dealt mostly with internal representation of the Arabic data and translation of menus and system messages
Received May 26, 1992; revised February 9, 1994; accepted February
9, 1994.
0 1994John Wiley & Sons, Inc.
JOURNAL
OF THE AMERICAN
SOCIETY
FOR INFORMATION
to Arabic. The problems of working with the Arabic language have not been confronted directly.
In principle, there are two approaches to developing
an Arabized computer application; the first approach is
to develop the application from scratch and bear in mind
the characteristics of the Arabic language. The second
approach, however, is based on building an I/O interface
to existing application software built for non-Arabic languages.The first approach is costly and time consuming;
the second approach is easy to implement at the price of
abandoning some Arabic language characteristics. The
second approach has been adapted to Arabize two well
known retrieval system software packages,STAIRS (Salton &McGill, 1983) and ISIS (UNESCO, 1989). The Arabization effort, however, is limited to the internal representation of the text, and the translation of the menus
and messagesto Arabic (Al-Gasimi, 1987).
The aim of our work is to study the problems and
difficulties of applying indexing and retrieval algorithms
to Arabic data. In particular we explore the problems of
storing and displaying bilingual bibliographic data, selection of index terms, ranking of Arabic records, and stemming algorithms for Arabic index terms. Special effort
will be devoted to the study of the effect of stemming
algorithms on the performance of the information retrieval system.
Stemming in information retrieval systems designed
for use with English text is usually confined to suffix removal. The motive for the use of stemming is obvious;
term stemming can increase the number of retrieved
documents since the stem of a term represents a broader
notion than the original term itself Several stemming algorithms have been used in experimental environments
SCIENCE.
45(8):548-560,
1994
CCC 0002-8231/94/080548-l
3
(Lovins, 1968; Porter, 1980; Salton, 1971). Experiments
using word stems as indexing terms show different results. While the implementation of suffix removal algorithms in the SMART system (Salton, 1971) shows improvement in retrieval effectiveness, Harman’s ( 1987)
experiments show less improvement and even sometimes decay. Further research (Harman, 1991) suggests
that in an online system, stemming should be applied
differentially (to some queries but not others) under user
control, depending on results obtained for particular
queries.
The basic goal of our research is to try to find the best
way to solve this problem for documents in Arabic. The
morphological structure of the Arabic language makes
the stemming problem much more complex. We compare three alternative choices for index terms: the word
itself, the stem, and the root, with the goal of finding out
which of these three alternatives gives the best results.
We have also examined alternative choices of a similarity
coefficient, comparing the effectsof using the familiar cosine measure, and the Dice and Jaccard coefficients.
Background
Ring Abdulaziz City for Science and Technology,
KACST, was established in Saudi Arabia in 1977 as a
research and development institution. KACST is responsible for the formulation of national science and technology policies and for the coordination and promotion of
applied scientific research. It sponsors and supports research activities acrossa broad spectrum of scientific and
technological fields.
KACST also provides a wide variety of information
support services through the General Directorate of Information Systems, GDIS. Such services include access
to national and international databases,maintenance of
a specialized library and a national database, and operation of a computer network connecting the computers of
major research institutions in the Gulf States.
The national database holds over 70,000 bibliographic records covering a wide range of science and
technology. The collection includes: master’s and doctoral theses, technical reports, books, articles, measurements and standards, statistics, and proceedings of conferences and scientific seminars. The collection has an
online catalogue. This catalogue is divided into two databases:an Arabic databasewhich contains about 23,800
records, and a non-Arabic database. Sample Arabic and
English database records are shown in Figures I and 2,
respectively. Each record in the database is composed of
36 fields. The Arabic records are classified-that is, a
subject area for the document is given in the record.
However, due to the short supply of Arabic indexers and
abstracters, only a few document records contain abstracts or index terms.
JOURNAL
OF THE AMERICAN
Plan of Research
To achieve our goals, we built a microcomputer-based
Arabic Information Retrieval System, Micro-AIRS,
targeted for the IBM/PC and compatible microcomputers. The system was implemented using the Turbo C
compiler, Version 2.0. A few routines, however, were
coded in assembly language.
Processing the Arabic Language
Special characteristics of the Arabic language make it
difficult to deal with, especially when using a system designed for Roman characters (Tayli & Al-Salamah,
1990). Among these characteristics are the right to left
orientation, the fact that vowels may be included or
dropped, and the morphological structure.
The Arabic language belongs to the Semitic language
group. These languages have a common grammatical
system based on a root-and-pattern structure. Most Arabic words are morphologically derived from a short list
of productive roots. The root is the bare verb form; it can
be triliteral, quadriliteral, or pentaliteral. According to
Hegazi and Elsharkawi (1985) there are about 1200
roots.
A stem is a combination of a root and derivational
morphemes to which one or more affixes can be added.
A triliteral bare verb generates 14 verb forms, whereas a
quadriliteral bare verb generatesthree verb forms.
Arabic words are classified into three main categories:
nouns, verbs, and particles. All verbs and many nouns
are derived from root verbs. Some of the root letters may
be deleted or modified during morphological derivation.
Also a word may change its inflectional form when preceded by certain prefixes or prepositions or followed by
certain suffixes. Some nouns, known as “solid nouns,”
have no verb origins. Particles can be found in the form
of prefixes and/or suffixes attached to verbs or nouns.
Some particles can be found in isolated form. Particles
include preposition particles, negative particles, answer
particles, interrogative particles, conjunction particles,
and so forth. Affixes can be added to the beginning, the
end, and the middle of a word.
Affixes fall into four categories: particles, pronouns,
inflectional morphemes, and derivational morphemes. It
is very common to find a verb, subject, and object contained in a single word. Yahya (1989) counted 120
different forms of nouns resulting from adding affixes to
the basic naked noun, and 1440 different forms of verbs
resulting from adding affixes to the basic naked verb.
For the purposes of our experiment we used a wordstem-root dictionary developed by hand for each index
term. Now that the system is being enhanced for actual
use at KACST we plan to add automatic morphological
analysis. Several morphological analysis algorithms have
been suggestedand/or implemented. Hegazi and Elsharkawi ( 1985) describe a computer-aided morphological
SOCIETY
FOR INFORMATION
SCIENCE-September
1994
549
hierarchy system for a vowelized Arabic text. They based
their work on both the morphological rules and the phonetic rules of the language. The main disadvantage ofthis
method is that the phonetic analysis requires a fully vowelized text which rarely appears in today’s applications.
Gheith and Aboul-Ela ( 1989) present a computer-based
syntax analyzer which is based on a morphological analyzer that separates the linguistic model from the processing algorithm. In another study, Gheith and El-Sadany ( 1987) describe a morphological analyzer that can
INTRNL CNTL NO
CATF.GORY
DOCUMENT TYPE
TnLE
AUTHORS
AFFILIATION
8903003 114
COMPUTING AND CONTROL ENGINEERING
CONFERENCE PROCEEDING
Computer virus prevention and containment on mainframes
Dowry. Ghannam M. AlIndustrial Security Planning and Support Services Department,
ARAMCO, Dhahran, SA
SOURCE TlTLE
Pnxedings of the 1lth National Computer Confennce, Dhahmn,
March 47.1989: Computers and Productivity
VOLUME
I
PAGINATION
48-60
NO. OF REFER
73
PIJBLICATN DATE 1989/01/01
PUBLISHER INF. King Fahd University of Petroleum and Minerals. Dhahnn, SA
IEXT LANGUAGE ENGLISH
ABSTRACf
The nature and anatomy of the computer virus is outlined. Basic
preventions. detection and correction techniques for reducing
he dams-s causedbv viruses are oresented. Vaccinea or fdters.
encrypt& access
&trol softwke. test to production control
pn~edures, personnel selection and review control and physical
access control m&hods are detailed with examples. The paper
presents measures to be adopted by the industry to make the
computer systems less inviting to attacks from viruses.
DESCmORS
Computer software; Computer viruses: Computer security;
Mainframe computers: Data pnxessing
STORAGE MEDIA PAPER COPY
AVAILABILITY
KACST. Source
FIG. 2. Sample English database record.
550
JOURNAL
OF THE AMERICAN
SOCIETY
FOR INFORMATION
detect the root and the morphological structure of a
given vowelized Arabic word with a trilateral root. AlFedaghi and Anzi (1989) present a simple but slow mathematical method to generate the root and the pattern of
a given Arabic word. Hilal(1985) gives a more comprehensive theoretical approach while Thalouth and AlDannan (1987) give a more practical approach to the
analysis of an unvowelized Arabic text. The principal
phase in all these algorithms is the isolation of any
suffixes and/or prefixes from the word before proceeding
to deeper analyses.
Representation of the Arabic Language
The representation of the Arabic language has been a
major concern for the designers of Arabic systems. The
representation involves the internal representation of the
stored data as well as the external representation, which
is used in displaying text on the screen or the printer.
The General Assembly of the Arabic Standardization
and Metrology Organization has approved many standards for Arabic text representation. The seven-bit coded
Arabic character set for information interchange,
ASMO-449, was adopted in October 1982, to represent
Arabic characters along with some graphical and control
characters. Although this code was intended for pure Arabic language applications only, some applications use it
to handle bilingual text by using some special characters
to indicate that the text is changing from one language
to another. For bilingual applications the organization
SCIENCE--September
1994
adopted an eight-bit coded Arabic-Roman character set
for information interchange known as ASMO-708.
Both ASMO-449 and ASMO-708 include 32 Arabic
alphabetic characters. The set of displayable Arabic
shapes, however, is much larger than the set of coded
Arabic characters. This is because an Arabic character
changesits shapedepending on whether it is at the beginning, the middle, the end of the word, or isolated. The
majority of the Arabic characters have two distinguished
shapes,and a few characters have one, three, or four distinguishable shapes.To determine the correct shape of a
given Arabic character a contextual analysis algorithm is
needed. Previous work has provided a fast and efficient
algorithm (Al-Kharashi, 1989; 1990b) which was implemented in Micro-AIRS as a basic function used by the
I/O interface.
Displaying the Arabic Text
There are two available approaches to displaying Arabic shapeson the PC. The first approach usesthe graphic
video mode, while the other usesthe alphanumeric video
mode. Using the graphic screen allows the display of an
unlimited number of fonts with flexible sizes and the
vowels at the correct positions. Unfortunately, using a
graphic screen will slow down the system as its complexity increases. Also as the font size increases, the amount
of displayable information decreases.
One of the easiest ways to speed up the I/O routines
and make the screen hold more information is by using
the alphanumeric video mode. On the original MDA
and CGA display adapters, the only fonts that could be
displayed in the alphanumeric video mode were those
defined in a table located in ROM on the adapter. To
display different fonts, the ROM must be replaced with
one that holds the new font definitions. Recent adapters,
such as EGAs and VGAs, all have alphanumeric character generators that use character definition tables located
in predesignated areas of RAM. This table can be accessedand modified by means of software.
For Micro-AIRS, a small number of I/O routines have
been designed and implemented to allow it to accept and
display an Arabic/English text. A previous system (AlKharashi, 1989), which uses a graphic screen to display
vowelized Arabic text, has been modified to display both
Arabic and English text in the same screen line. The new
system uses the text screen instead of the graphic screen
to display text. To achieve this, a whole new font table
was created by the first author. The font shapesthat represent the English ASCII characters are kept without any
change. The last 128 font shapes, which represent some
graphical and foreign shapes, have been replaced by the
shapes of Arabic characters and vowels. The MicroAIRS fonts are shown in Figure 3. A brief glimpse of this
interface and the basic input/output routines will be provided during the discussion of the Micro-AIRS system
structure below.
JOURNAL
OF THE AMERICAN
FIG. 3. The Micro-AIRS fonts.
The Structure of the Micro-AIRS System
Basically, Micro-AIRS consists of three main conceptual components: namely, a User Interface, a Command
Processor, and a Database Handler. The description of
each component of the system follows.
User Interface
The real effectiveness of a computer system is measured by its usability by people other than computer professionals. This leads to the need for an effective humancomputer interface. Menu-driven systemsare one of the
most successful and widely used system design techniques. The advantages of using menu-driven systems
have been discussed by Shneiderman ( 1987) and by Galambos et al. ( 1985). They include: reducing the training
and memorizing effort, simplifying entry of choices,
structuring the user’s task, and allowing the user to become acquainted with the range of possibilities that the
system offers. Micro-AIRS adapted the menu system
that is used by Borland’s interactive compiler products
such as Turbo C, Version 2.0 and Turbo Pascal, Version
5.0 (Borland, 1988b; 1988~).Thus, our interface is composed of two components, a permanent menu and pulldown menus. The permanent menu displays the options
of the main menu. Left and right arrow keys are used to
SOCIETY
FOR INFORMATION
SCIENCE-September
1994
551
key. There are eight basic items in the main menu.
When activated,eachitem in the main menu will pull
up another menu. An item in the secondlevel menu,
in turn, could trigger another menu or selecta basic
item.
Command Processor
Pull-Down
(b)
FIG. 4. (a) Arabic Micro-AIRS user interface. (b) Equivalent English
interface.
move through the items causing them to be highlighted
one at a time. The highlighted item can be selected by
pressing the (ENTER) key. A pull-down menu, on the
other hand, is displayed when an item from the permanent menu, or an item from the current pull-down menu
is selected. Pull-down menu items are listed vertically,
and the user can move through the items by using the up
and down arrow keys. For large lists of items in one
menu, more elaborate scrolling capabilities are provided.
The screen in the Micro-AIRS user interface is divided
into three areas as shown in Figure 4 and described as
follows:
l
l
l
552
Systemresponsearea:This areais usedby the system
to display its responseto a usercommandsuchasDISPLAY, SEARCH,or SORT.
System status area: The bottom line of the screen
showsthe systemstatus(running, waiting, or idle), error messages,
namesof activedatabases,and so forth.
Systemmenu area:The top line of the screenlists all
availablesubmenus/commandsin the system.An individual entry can be activatedby highlighting it using
left/right arrow keysand then pressingthe (ENTER)
This module accepts a user request, validates it, and
then processesit. Since all user commands are entered
through a menu-driven system, a great deal of this module is devoted to interpreting the SEARCH and DISPLAY commands. All the system commands are available through a menu-driven system. Commands are categorized into eight groups, namely FILE, EDIT,
SEARCH, DISPLAY, SORT, PRINT, UTILITIES, and
HELP. Each group is represented by an entry in the system menu area. When a group is selected, it will display
a related commands list as shown in Figure 4.
The DISPLAY command allows the user to accessthe
text of a database record directly, or to choose to display
a document from a previously retrieved set. The user
then can display the next or the previous document, the
last or the first document, or jump backward or forward
a given distance from the current document.
Micro-AIRS allows the user to search the databaseusing three retrieval methods, using words, stems, or roots,
one at a time. To switch from one retrieval method to
another the user should use the SEARCH/RETRIEVAL-METHOD command to select the desired method.
The system then will close the current keyword and posting files associatedwith the current method and open the
files for the selectedmethod. The system makes available
a full set of Boolean and distance operators. The ADJ
operator specifies that two words must appear next to
each other in the document and in the proper word order. The FLD operator specifiesthat two words must appear in the same field. Another option allows the user to
specify that two index terms must be separatedby exactly
n number of words. The truncation symbol “:” can be
suffixed to a query term to widen the search results. The
truncation symbol indicates whether the term is to be
searched as a complete term or as a fragment of a large
term. The truncation option is limited to the word-retrieval method only since the stem and root retrieval
methods have superior effects over the truncation. Parentheses can be used to construct long and complex
queries.
A query can be submitted interactively using the appropriate chain of menus, or by giving a name of preedited query file which can contain one or more queries.
File Handler
This module is responsible for accessingand updating
the data file. The three most basic operations are creating
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE--September 1994
a new database, building a searchable database out of
pure text files, and searching through a database.
Each Micro-AIRS database consists of five basic files:
the database definition file, the data and the index files,
and the keyword and posting files (Al-Kharashi, 1990a).
Descriptions of each file are provided in the next section
along with a discussion of the indexing process.
The Micro-AIRS Indexing Process
Indexing Strategies
The indexing and the data organization are the two
major factors that influence the effectiveness and the
efficiency of an information retrieval system. The indexing process deals with the selection of appropriate terms
capable of representing the content of a given bibliographic record. Experimental information retrieval systems use different indexing methods. Frequency-based
indexing methods measure the importance of a given
term by its frequency in individual documents as well as
its frequency in the whole collection. Frequency factors
can also be used during the retrieval process as term
weights to enhance the precision of the system, that is to
present the user with a set of records that closely match
the query in decreasing order. Binary weighting schemes,
however, can be used instead. In this case all indexable
terms are assigned the same weight. Although MicroAIRS stores the frequency of occurrence of every valid
indexable term in the collection, these values were not
used as a measure for selecting significant terms.
There are two reasons for not using the frequency of
the word as a valid measurement tool during the indexing and the retrieval of the data. The first reason is related
to Luhn’s (1958) observation. Since Luhn’s law has not
been verified with an Arabic text, it is not realistic to use
it as a solid base for indexing Arabic data. The second
reason involves the type of collection that was used to
test the system. Salton (197 1) concluded that the
effectiveness of the content analysis depends on the
length of the textual data available. Content analysis
works better with larger textual data. In our collection,
every record contains a short title with no abstract except
for very few records. It is seldom to find a word occurring
more than once or twice in the same document. Hence,
a frequency based measurement has no significance for
our data.
Data Description
As was described earlier, the Arabic collection contains about 23,800 records covering a wide range of science and technology fields. This data was originally contained in a single sequential file that occupies about
seven million bytes of disk space.Each record in the data
file is represented as a sequential list of bibliographic
fields (e.g., title, author, journal title, abstract, and so
JOURNAL
OF THE AMERICAN
forth). The text of each record is terminated with an endof-record mark. Every field starts with a three character
field identifier followed by a space and then the text for
that field. The field text is terminated with an end-of-line
mark (i.e., carriage return and line feed characters).
The evaluation of the system requires some initial
manual tasks, mainly relevance judgments. This task
needs experts in the field of the area that the system covers. Becausethe collection has wide coverage, we needed
to choose a subset that covers a specific area where we
would find help in performing the manual tasks. A single
record from the original data is contained in one or more
sets.The computer and information science set, with 355
records, was found to be the most suitable set for testing
and evaluating the system. With this set it is easy to find
people who are able to create queries and perform the
relevance judgment.
The text of the selected set contained a few typing and
spelling mistakes. To reduce the effect of these mistakes
on the evaluation process,they were corrected before the
final indexing process is carried out. The majority of
these mistakes were easily detected after all keywords
from the record texts were extracted and sorted. The
VATE editor (Al-Kharashi, 1989) then was used for the
simple editing and correction processes.
Database Dejnition Table
The databasedefinition table controls the behavior of
the system during editing, indexing, and retrieval. Every
field in the database has an entry in this file and holds
the following information: the full name of the field, the
abbreviated name of the field, and the field attributes.
The process of selecting index terms goes through
many phases starting with plain text and ending with a
list of useful, accessible index terms. The indexing process accepts plain text defined by the index and record
file, and extracts all words from every indexible field in
all database records. In the database definition table we
define the category, the title, and the abstract fields as
indexible and searchable fields. The extraction of a word
from a bilingual text is certainly more difficult than the
extraction of a word from a unilingual text. With the use
of the character attribute file, the word extraction process
was able to distinguish between numeric, control, English, and Arabic data. The length of any extracted term
is limited to 25 bytes. If the original term exceedsthe 25
byte limit, the term then will be truncated at the 25th
byte and the remainder of the term will be skipped.
After the extraction of all index terms from the record
file, the indexed keyword list is then sorted. A general purpose sorting program such as the DOS sort utility or other
commercial sorting programs will not work with our data.
This is becausethese utilities are usually intended for textual data and the indexed keyword list contains binary
coded values in the document number entries.
IBM PCs and compatibles are built on the Intel family
SOCIETY
FOR INFORMATION
SCIENCE-September
1994
553
TABLE 1. The list of queries.
English meaning
Arabic query
Number
Computer systems
Computer and languages
Arabic programming languages
Computer and architecture design
Natural language processing
Computer and drawing
Computer and language learning
Computer and industries and industrial information
Computers in military field
Computer Arabization
Parallel programming
Morphological analysis
Computers and (indexing or classification or
documentation)
Computers in Saudi Arabia
14
Computers and (Quran or Hadeeth)
15
Computers and children
16
Computers and phonetics
17
Computers and agriculture
18
Computer networks and communication
19
Computer and design
20
Computers in education
21
Arabic terminology
22
Thesauri and information retrieval
23
Computers and information security
24
Computers in libraries
25
Machine translation
26
Computers and managements
27
Personal computers
28
Terminology databanks
29
microprocessors. The Intel CPUs store an integer value,
which occupies two bytes, with the most significant byte
at the lower location and the least significant byte at the
higher location. This structure causes a general purpose
sorting program intended for textual data to mis-sort any
554
JOURNAL
OF THE AMERICAN
SOCIETY
FOR INFORMATION
data with numerical data coded in binary format. To
overcome this problem, a general purpose sorting program MERGE3ORT.C
was developed. MERGE
SORT uses the Turbo C built-in quick sort function,
QSORT (Borland, 1988a), for internal sorting.
SCIENCE-September
1994
TABLE
2. The binary similarity coefficients.
Cosine
IQnDl
VIQI . VIDI
Dice
2
Jaccard
IQnDl
IQnDl
IQ1 + IDI
IQI + IDI - IQnDl
I D /, number of terms in the document text; IQ 1,number of terms
in the query text; /Q tl D / , number of terms in both document and
query.
of the set and the list of queries at the same time in two
different windows on the same screen. If the user judged
that the displayed record was relevant to the displayed
query, he simply marked a box on the screen. After the
completion of the relevance judgment task, the judgments for the three sets were grouped together in a relevance judgment matrix.
Out of the 50 queries, only the 29 queries shown in
Table 1 were found to have one or more relevant documents in the collection. Thus, only these 29 queries were
used for the system evaluation.
Creation of the Word-Stem-Root Dictionary
A comprehensive Arabic dictionary is not available
in machine-readable form. So we created a small wordstem-root dictionary, using the words in the collection.
The dictionary is used during the indexing and the retrieval process to identify the stem or the root of a given
word and also identify the stop words. For every keyword
a corresponding stem and root structure was created.
From 355 bibliographic records we obtained 1,126
words, 725 stems, and 526 roots.
Evaluation of the Micro-AIRS System
The major purpose of this work is to study the effect
of using words, stems, or roots as index terms on the performance of an Arabic information-retrieval system.
Relevance Judgments
Information-retrieval systemsare usually evaluated in
terms of two measures, recall and precision. Recall is defined as the proportion of the documents in the collection relevant to the query that are actually retrieved.
Precision is defined as the proportion of the documents
retrieved that are actually relevant. Perfect recall (a value
of 1.O) occurs when the system finds all the items in the
collection that are relevant to the document. Perfect
precision (also a value of 1.O) occurs when all the documents retrieved are relevant.
Both measures depend on knowing what documents
are relevant to each query, so the first step in evaluation
is making relevance judgments. For large collections,
sampling techniques are used (Salton, 1975), but our collection was small enough so that we could carry out this
task manually.
We asked graduate students in computer science, who
were also native speakersof Arabic, to make up 60 queries that they might themselves use in their own research.
Ten queries were removed because they were essentially
duplicates of other queries, asking for the same information. The 355 database records were divided into three
sets. Each set was handed to one of the students along
with a computer-based relevance judgment support system designed and implemented by the first author. This
system allowed the judge to browse through the records
JOURNAL
OF THE AMERICAN
Similarity Measurements
Similarity coefficients have several important applications in an information-retrieval system (Salton, 1989).
Their most important function is in ranking retrieved documents in order to present to the user first the documents
most relevant to the query. There are three common normalized similarity coefficients (eachwith two versions, one
for binary and one for weighted terms), the cosine, Dice,
and Jaccard coefficients (van Rijsbergen, 1979; Salton,
1989). We decided to try all three binary coefficient measurement methods and select the one that performs the
document ranking best. Table 2 showsthe formulas for the
cosine, Dice, and Jaccard binary coefficients.
The calculations of the similarity measurements between a query and a document require information
about the number of terms in the document text and in
the query text and the number of terms that appear
jointly in the query and the document text. As the root
or the stem is used instead of the word, it is more likely
that one or more word collapses in one common stem or
root. Hence, the number of unique words is reduced and
the similarity coefficient is increased.
In determining the order in which documents should
be presented to the user, however, the actual value of the
coefficient does not matter, it is only the relative values
that make a difference. The ranking processshowed that
all three binary similarity coefficients produced exactly
the same rankings for all queries.
The actual values are shown for all three similarity
methods combined with all three retrieval methods for
TABLE 3. Ranking of the result of query number 20 using roots.
Similarity coefficient values
Document
number
19
18
216
281
325
212
SOCIETY
Relevance
indicator
*
*
*
FOR INFORMATION
Rank
Cosine
Dice
Jaccard
I
2
3
4
5
6
0.5774
0.4715
0.4715
0.4715
0.4083
0.2133
0.5000
0.3637
0.3637
0.3637
0.2858
0.0870
0.3334
0.2223
0.2223
0.2223
0.1667
0.0455
SCIENCE-September
1994
555
TABLE 4. Ranking of the result ofquery number 22 using words.
TABLE 6. The results of the processing of the 29 queries.
Word
Similarity coefficient values
Document
number
Relevance
indicator
Rank
Cosine
Dice
Jaccard
279
263
254
253
273
*
*
*
*
*
1
2
3
4
5
0.5346
0.4473
0.4265
0.3780
0.3652
0.4445
0.3334
0.3077
0.2500
0.2353
0.2858
0.2000
0.1819
0.1429
0.1334
three example queries in Tables 3, 4, and 5. The results
obtained for all the other queries were the same. Clearly,
it does not matter what coefficient is used and we are free
to use whichever one is the cheapest to compute on the
target hardware.
Results of Processing the Queries
The processing of our 29 queries on the Micro-AIRS
system using words, stems, and roots produces the results
shown in Table 6. The system performance using each of
the three indexing methods can be categorized into six
groups as follows:
(1) The system failed to retrieve any document with the
use of any retrieval method in responseto query 4.
(2) The three methods perform equally in response to
queries 9, 11, and 16.
(3) The word-retrieval method performs as well as or
better than the other methods at most recall levels in
queries 13,20,25,26, and 29. It is not able, however,
to retrieve any of the relevant documents for the following queries: 2,4, 5, 7, 12, 15, 17, 18, and 23.
(4) The stem retrieval method outperforms the other
methods in response to some queries, including 1
and 7.
(5) The root-retrieval method outperforms the other
Query
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Stem
Root
Rc. Ret. Rel. Fall Ret. Rel. Fall
15
5
10
5
0
4
0
1
0
1
0
5
3
3
0
0
2
1
1
0
0
11
8
8
5
13
3
3
3
3
1
I
1
0
1
0
9
15
2
5
5
10
9
0
0
I
1
1
4
4
5
4
9
26
2
7
7
0
0
2
2
3
5
0
4
3
0
0
2
2
3
5
0
4
3
1
14
II
1
4
1
I
2
1
1
I
1
1
0
0
0
0
0
0
0
0
0
0
0
0
I
0
0
0
0
0
0
0
0
0
0
0
0
15
10
85
5
0
4
3
7
4
5
0
2
3
3
11
3
9
I
0
6
17
1
1
1
1
2
0
3
6
2
0
3
3
I1
6
5
5
5
81
0
0
2
0
4
22
85
7
0
4
3
25
10
12
4
7
0
2
3
3
81
0
0
2
0
22
10
1
13
11
2
3
5
1
0
5
13
0
4
3
47
3
0
0
0
1
1
1
6
11
1
1
4
0
0
0
0
0
3
5
0
0
0
I
1
6
0
17
2
1
3
3
6
6
52
14
2
7
7
3
12
9
12
36
0
0
1
5
1
3
1
2
1
1
4
4
2
4
3
3
0
10
8
1
2
Ret. Rel. Fall
1
4
8
1
5
2
0
1
0
3
2
5
3
7
14
2
4
4
0
3
45
0
0
3
3
1
2
6
9
6
0
1
1
Rc., number of relevant records in the collection; Ret., number of
retrieved records using the method; Rel., number of relevant records
actually retrieved; Fall, number of irrelevant records actually retrieved.
methods in most of the queries, most strikingly in 3,
8, 12, 15, 17. 18, 19,22,23,27,and28.
(6) The stem- and the root-retrieval methods perform
equally in response to queries 2,5. and 6.
Table 7 and Figure 5 show the differences in average
TABLE 5. Ranking ofthe result of query number 2 I using stems.
Similarity coefficient values
Document
number
Relevance
indicator
328
73
230
52
13
12
276
255
*
*
*
*
11
*
*
120
224
556
JOURNAL
Rank
Cosine
Dice
Jaccard
2
3
4
5
6
7
8
9
IO
0.5774
0.5774
0.4365
0.4365
0.4365
0.3850
0.3652
0.3652
0.3652
0.3334
11
0.2218
0.5774
0.5000
0.4000
0.4000
0.4000
0.3334
0.3077
0.3077
0.3077
0.2667
0.0938
0.4000
0.3334
0.2500
0.2500
0.2500
0.2000
0.1819
0.1819
0.1819
0.1539
0.0492
1
OF THE AMERICAN
SOCIETY
FOR INFORMATION
retrieval values. It is clear from the table and the figure
that the root-retrieval method was able to retrieve more
documents per query than the other two retrieval methods. The problem with the root-retrieval method can be
shown in the number of irrelevant documents retrieved
along with the relevant documents. The stem-retrieval
method retrieved fewer irrelevant documents yet a rea-
TABLE 7. Average retrieval of the 29 queries.
Method
Word
Stem
Root
Retrieved
Relevant
Irrelevant
2.24
7.79
12.55
2.03
3.69
4.72
0.21
4.10
7.83
SCIENCE--September
1994
Stem
Word
Root
FIG. 5. Average retrieval ofthe 29 queries.
sonable number of relevant documents. Finally, the
word-retrieval method retrieved the least number of relevant and irrelevant documents.
From the previous detailed and averaged retrieval
data, we cannot draw a precise conclusion about the
effectiveness of the system. In the following sections we
will discuss and present the standard evaluation of the
information-retrieval system based on the recall and
precision measures.
Recall-Precision Measurements
The recall and precision values produced for a given
query reveal the behavior of the system only under that
query and for those calculated recall and precision values. Notice that the rough recall-precision values could
contain many precision values for a single recall value.
Furthermore, some precision values may not be defined
at certain recall levels. Different smoothing algorithms
are in common use for precision averaging (Keen, 1972).
In Micro-AIRS we used the smoothing algorithm summarized below.
(1) Divide the recall valuesinto 10levels;
0.0 <= rO.l < 0.1,
precisionvalue to the next level if its precisionvalue
is lower than the current one.
(5) To assurethat the precisionwill drop graduallyfrom
a certain precision value to a zero value, we assign
any levelwith a zeroprecisionto half of the precision
value of the previouslevel.
Table 8 and Figure 6 show the averaged recallprecision values with the zero-smoothing process. The
summaries provided by the average recall-precision values suggest that the root-retrieval method outperforms
both the word- and the stem-retrieval methods. They
also suggestthat the stem-retrieval method outperforms
the word-retrieval method.
LO-
4
0.8
0.6 *
0.1 <= r0.2 < 0.2,
...)
0.4 .
Root
0.9 <= rl.O <= 1.0.
stem
stem
(2) Assignthe largestprecision value of a level to that
level.
(3) Assignthe largestprecisionvalue found in the table
to the first level.
(4) Starting from the tenth region we start removing all
sawtooth lines by assigning the current level
JOURNAL
OF THE AMERICAN
Word
Redl
0.0.
I
0.10
0.20
0.30
0.40
050
0.60
0.70
0.80
0.90
1.00
FIG. 6. Average recall-precision graph after zero-smoothing process.
SOCIETY
FOR INFORMATION
SCIENCE-September
1994
557
TABLE 8. Average recall-precision table with zero-smoothing process.
Precision
Recall
Word
Stem
Root
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.oo
0.7143
0.6357
0.5643
0.4946
0.424 1
0.3353
0.2569
0.1999
0.1714
0.1571
0.8739
0.8356
0.7772
0.75 10
0.6952
0.5721
0.4457
0.3545
0.2935
0.2467
0.9308
0.8998
0.854 1
0.8442
0.8085
0.6946
0.5860
0.5047
0.4290
0.3912
Statistical Analysis
To draw accurate conclusions about the effectiveness
of the system using word-, stem-, and root-retrieval
methods, to determine the significance of the results
shown in Table 8 and Figure 6, we use two nonparametric statistical tests, the sign test and the Wilcoxon signedrank test. In this analysis we compared each pair of retrieval methods separately. Thus essentially we looked at
the results of three experiments the word-stem experiment, the word-root experiment, and the stem-root experiment.
The null hypothesis and the alternative hypothesis
used for the word-stem experiment are:
HO: The word-retrieval and the stem-retrievalmethod
give the same results.
H 1: The stem-retrieval method is better than the wordretrieval method.
TABLE IO. Wilcoxon signed-rank test for word vs. stem.
Recall
Favoring
word
Favoring
stem
NDF
Norm dev.
Z
One-sided
probability
.lO
.20
.30
.40
.50
.60
.70
.80
.90
1.oo
4.00
8.00
21.00
21.00
23.00
14.00
15.00
8.00
8.00
8.00
32.00
70.00
115.00
132.00
148.00
157.00
175.00
182.00
182.00
182.00
8.00
12.00
16.00
17.00
18.00
18.00
19.00
19.00
19.00
19.00
1.9604
2.43 I8
2.4303
2.6273
2.7219
3.1139
3.2194
3.5011
3.5011
3.501 I
.0250
.0075
.0075
.0043
.0033
.0009
.0006
.0002
.0002
.0002
H2: The word-retrieval and the root-retrieval methods
give the same results.
H3: The root-retrieval method is better than the wordretrieval method.
The null hypothesis and the alternative hypothesis
used for the stem-root experiment are:
H4: The stem-retrieval and the root-retrieval methods
give the sameresults.
H5: The root-retrieval method is better than the stemretrieval method.
The test results are shown in Tables 9- 14. The statistical results support H 1 and H3, that is, they confirm the
superiority of root- and stem-retrieval methods over the
word-retrieval method with alpha = .03 using the Wilcoxon signed-rank test.
When we compare the stem- and the root-retrieval
methods then the results are not so clear. The one-sided
probability values at the lower recall levels (up to .5) of
Tables 13 and 14 comparing stem- and root-retrieval
The null hypothesis and the alternative hypothesis
used for the word-root experiment are:
perform significantly better than the stem method. At
TABLE 9. Sign test for word vs. stem.
TABLE 11. Sign test for word vs. root.
Favoring
word
Recall
Favoring
stem
Tied
methods show that the root-retrieval
Norm dev.
z
One-sided
probability
Recall
Favoring
word
Favoring
root
Tied
.10
.20
.30
.40
.50
.60
.70
.80
.90
1.oo
2
3
5
5
5
4
4
3
3
3
6
9
11
12
13
14
15
16
16
16
21
17
13
12
11
11
10
10
10
10
1.4142
1.7321
I .5000
I .6977
I .8856
2.3570
2.5236
2.9824
2.9824
2.9824
.0793
.0418
.0668
.0446
.0294
,009 1
.0059
.0014
.0014
.0014
.10
.20
.30
.40
SO
.60
.70
.80
.90
I .oo
2
4
5
5
3
2
2
1
1
I
8
12
15
16
18
19
20
21
21
21
19
13
9
8
8
8
7
7
7
Combined
37
128
125
7.0843
.oooo
Combined
26
171
558
JOURNAL
OF THE AMERICAN
SOCIETY
FOR INFORMATION
SCIENCE-September
1994
method does not
Norm dev.
Z
One-sided
probability
1
1.8974
2.0000
2.236 1
2.4004
3.2733
3.7097
3.8376
4.2640
4.2640
4.2640
.0287
.0228
.0073
.0082
.0005
.oooo
.oooo
.oooo
.oooo
.oooo
93
10.3308
.oooo
TABLE 14. Wilcoxon signed-rank test for stem vs. root.
TABLE 12. Wilcoxon signed-rank test for word vs. root.
Recall
Favoring
word
Favoring
root
NDF
Norm dev.
Z
One-sided
probability
Recall
Favoring
stem
Favoring
root
.I0
.20
.30
.40
.50
.60
.70
.80
.90
1.oo
5.00
15.00
29.50
26.00
16.50
9.00
9.00
2.00
2.00
2.00
50.00
121.00
180.50
205.00
214.50
222.00
244.00
251.00
251.00
251.00
10.00
16.00
20.00
2 1.oo
2 1.oo
2 I .oo
22.00
22.00
22.00
22.00
2.2934
2.7406
2.8186
3.1 108
3.4410
3.7017
3.8147
4.0420
4.0420
4.0420
.OllO
,003 1
.0020
.0009
.0003
.oooo
.oooo
.oooo
.oooo
.oooo
.I0
.20
.30
.40
.50
.60
.70
.80
.90
1.00
6.00
18.00
18.00
20.00
18.00
22.00
33.00
30.00
34.00
24.00
9.00
27.00
37.00
46.00
48.00
98.00
120.00
123.00
119.00
129.00
higher recall levels, however, the root-retrieval method
performs better than the stem-retrieval method.
Conclusions
Summary
Micro-AIRS was designed as an experimental system
to investigate indexing and retrieval processesfor Arabic
bibliographic data. During the design and implementation of the system, we dealt with the following problems:
(I) Accessing,processing,and displaying Arabic/English text.
(2) Indexing and sorting Arabic terms.
(3) Indexing and retrieval of Arabic data using different
types of index terms, words, stems, and roots.
(4) Ranking documents using different binary similarity
coefficients.
This research reveals the superiority of root- and
stem-retrieval methods over word-retrieval methods for
Arabic data. The root performs as well as or better than
the stem at the low recall levels, and definitely better at
high recall levels. We also found that the document rank-
TABLE 13. Sign test for stem vs. root.
Recall
Favoring
stem
Favoring
root
Tied
Norm dev.
Z
One-sided
probability
.I0
.20
.30
.40
.50
.60
.70
.80
.90
1.00
3
5
5
5
4
4
5
5
5
4
2
4
5
6
7
11
12
12
12
13
24
20
19
18
18
14
12
12
12
12
-.4472
-.3333
.oooo
.3015
.9045
1.8074
1.6977
1.6977
1.6977
2.1828
.3300
.3707
.5000
.3821
.I841
.035 1
.0446
.0446
.0446
.0146
Combined
45
84
161
3.4338
.0003
JOURNAL
OF THE AMERICAN
NDF
5.00
9.00
10.00
11.00
11.00
15.00
17.00
17.00
17.00
17.00
Norm dev.
Z
One-sided
probability
.4045
.5331
.9683
1.1558
1.3337
2.1583
2.0592
2.2012
2.0119
2.4853
.3409
.2982
.I660
.I230
.0918
.0154
.0197
.0131
.0217
.0064
ing processproduced exactly the sameresults when using
different binary similarity coefficients, so a single simple
coefficient can be used.
These results were obtained in a system where each
document was accurately classified as to subject area.
Also, the part of the collection involved in the experiments, the set containing all 355 computer science documents in the database, was carefully proofread to eliminate spelling errors. Of most concern, most documents
in the collection were represented by titles only, not by
abstracts. Clearly, further experiments are needed.
Future Research
In an operational system, the word-stem-root dictionary should be replaced by a morphology algorithm that
finds stems and roots as mentioned previously.
By using stemsand roots for indexing and retrieval we
were able to retrieve most of the relevant documents in
the collection. The retrieval failure of some or all relevant documents (see Table 6) was due to the use of related words (e.g., synonyms). We believe that the use of
an interactive thesaurus will be helpful in retrieving
more relevant documents. For a discussion of the use of
such a thesaurus in English see Fox (1980) and Wang,
Vandendorp, and Evens ( 1985). Research on this problem is being carried forward at Illinois Institute of Technology using a database of Arabic documents with abstracts.
The current system allows the user to use only one
type of index term at any given time. To reduce the number of irrelevant documents, the user should have the
ability to impose the retrieval method over individual
words of a query. For example, the search argument “A
and (B or C)” could be expressedas “root:A and (stem:B
or word:C).”
Using a binary ranking process fails in some casesto
put the most relevant documents at the top of the retrieved list. A weighted ranking process should be investigated for Arabic documents using a database where all
documents have abstracts, or better still, where all docu-
SOCIETY
FOR INFORMATION
SCIENCE-September
1994
559
ments are available online in full-text form. The first author is planning a large-scale test of the effectiveness of
the system at KACST using a large collection of documents with abstracts and a large number of test queries
collected from actual users.
REFERENCES
Borland International. ( 1988a). Turbo C, version 2.0; Referenceguide,
Scotts Valley, CA.
Borland International. (I 988b). Turbo C, version 2.0: User’s guide,
Scotts Valley, CA.
Borland International. (1988~). TurboPascal, version 5.0: User’sguide,
Scotts Valley, CA.
Al-Fedaghi, S. S.. & Al-Anzi, F. S. (I 989, March). A new algorithm to
generate Arabic root-pattern forms. In Proceedings ofthe 11th National Computer Conference and Exhibition, (pp. 391-400.) Dhahran, Saudi Arabia: King Fahd University ofPetroleum and Minerals.
Fox, E. (1980). Lexical relations: Enhancing effectiveness of information retrieval. SIGIR Forum. 1.5,6-35.
Galambos, J. A., Sebrechts, M.. Wikler, E., & Black, J. (1985). A diagrammatic language for instruction of a menu-based word processing system. In S. Williams (Ed.), Humans and machines (pp. 1l-44).
Norwood. NJ: Ablex.
Al-Gasimi, M. (1987, April). Arabization of the MINISIS system. In
Proceedings qfthe First King Saud University S.ymposium on Computer Arubization (pp. 13-26.) Riyadh. Saudi Arabia: King Saud
University.
Gheith. M., & El-Sadany, T. (1987, April). Arabic morphological analyzer on a personal computer. In Proceedings ofthe First King Saud
University Symposium on Computer Arabization (pp. 55-65.) Riyadh, Saudi Arabia: King Saud University.
Gheith. M.. & Abdul-Ela, M. (1989, March). A computer based Arabic
syntax analyzer. In Proceedings ofthe I Ith National Computer Conference and Exhibition (pp. 352-360.) Dhahran. Saudi Arabia: King
Fahd University of Petroleum and Minerals.
Harman, D. (1987, June). A failure analysis on the limitation of
suffixing in online environments. In Proceedings ofthe 10th Annual
International ACM SIGIR Coqference, New York: Association of
Computer Machinery.
Harman, D. ( 1991). How effective is suffixing? Journal ofthe American
Society,for IGformation Science, 42, 7- 15.
Hegazi, M., & Elsharkawi, A. A. (1985, April). An approach to a computerized lexical analyzer of natural Arabic. Computer Processing
qf the Arabic Language. Wbrkshop papers (Vol. I). Kuwait: Kuwait
Institute for Scientific Research (KISR).
560
JOURNAL
OF THE AMERICAN
SOCIETY
FOR INFORMATION
Hilal, Y. (1985. April). Morphological analysis of Arabic speech, Computer Processing qfthe Arabic Lnnguage. Workshop papers (Vol. I).
Kuwait.
Keen, E. M. (1972). Prospects for classification suggestedby evaluation
tests carried out 1957-1970. In A. Maltby (Ed.), Classification in the
197O’s(pp. 193-210). Hamden, CT: Linnet Books.
Al-Kharashi, I. A. (1989). V,4TE: A vowelized Arabic text editor. Ph.D.
qualifying project, Illinois Institute of Technology, Chicago, IL.
Al-Kharashi, I. A. (I 990a, October). Micro-AIRS: A microcomputer
based Arabic information retrieval system, design, implementation
and evaluation. In The 12th National Computer Conference(Vol. 2)
(pp. 5 15-529.) Riyadh, Saudi Arabia: King Saud University.
Al-Kharashi, I. A. (1990b, October). An efficient contextual analysis
algorithm for Arabic text handling. The 12th National Computer
Conference(Vo1. 2) (pp. 465-473.) Riyadh, Saudi Arabia: King Saud
University.
Lovins, J. B. (1968). Development ofa stemming algorithm. h4echanical Translation and Computational Linguistics, I I, 22-3 I.
Luhn, H. P. ( 1958).The automatic creation of literature abstracts. IBM
Journal ofResearch and Development, 2, 159- 165.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14.
130-137.
Salton, G. (Ed.) (197 I). The SMART retrieval system experiments in
automatic document processing. Englewood Cliffs, NJ: Prentice Hall.
Salton, G. (1975). A theory of indexing. Regional Conference Seriesin
Applied Mathematics, No. 18. Philadelphia: Society for Industrial
and Applied Mathematics.
Salton. G. ( 1989).Automatic te,vtprocessing: The transformation, analysis, and mrieval of information by computer. Reading, MA: Addison-Wesley.
Salton, G., & McGill, M. J. (1983). Introduction to modern information
retrieval. New York: McGraw-Hill.
Shneiderman, B. ( 1987).De.yigning the user interSace:Strategies@ human-computer interuction. Reading, MA: Addison-Wesley.
Tayli, M., & Al-Salamah, A. I. (I 990). Building a bilingual microcomputer system. Communications ofthe ACM, 33,495-504.
Thalouth, B., & Al-Dannan, A. (1987). A comprehensive Arabic morphological analyzer/generator. IBM Kuwait Scientific Center.
UNESCO ( 1989).Mini-micro CDS/ISIS, Paris.
van Rijsbergen, C. J. (1979). I@rmation retrieval (2nd ed.). London:
Buttenvorths.
Wang. Y. C., Vandendorpe, J., & Evens, M. (I 985). A microcomputer
based information retrieval system supporting stroke diagnosis.
Journal oJthe American Society for Information Science, 36, 15-27.
Yahya, A. H. (1989, October). On the complexity ofthe initial stage
c)fArubic text processing. Paper presented at the First Great Lakes
Computer Conference. Kalamazoo, MI.
SCIENCE-September
1994

Download Report

Comparing words, stems, and roots as index terms in an Arabic

Paperzz.com

Your Paperzz