Stemming Hausa text: using affix-stripping rules and reference look-up

Lang Resources & Evaluation (2016) 50:687–703
DOI 10.1007/s10579-015-9311-x
PROJECT NOTES
Stemming Hausa text: using affix-stripping rules
and reference look-up
Andrew Bimba1 · Norisma Idris1
Nurul Fazmidar Mohd Noor1
· Norazlina Khamis1 ·
Published online: 26 July 2015
© Springer Science+Business Media Dordrecht 2015
Abstract Stemming is a process of reducing a derivational or inflectional word to
its root or stem by stripping all its affixes. It is been used in applications such as
information retrieval, machine translation, and text summarization, as their preprocessing step to increase efficiency. Currently, there are a few stemming algorithms which have been developed for languages such as English, Arabic, Turkish,
Malay and Amharic. Unfortunately, no algorithm has been used to stem text in
Hausa, a Chadic language spoken in West Africa. To address this need, we propose
stemming Hausa text using affix-stripping rules and reference lookup. We stemmed
Hausa text, using 78 affix stripping rules applied in 4 steps and a reference look-up
consisting of 1500 Hausa root words. The over-stemming index, under-stemming
index, stemmer weight, word stemmed factor, correctly stemmed words factor and
average words conflation factor were calculated to determine the effect of reference
look-up on the strength and accuracy of the stemmer. It was observed that reference
look-up aided in reducing both over-stemming and under-stemming errors,
increased accuracy and has a tendency to reduce the strength of an affix stripping
stemmer. The rationality behind the approach used is discussed and directions for
future research are identified.
& Andrew Bimba
[email protected]; [email protected]
Norisma Idris
[email protected]
Norazlina Khamis
[email protected]
Nurul Fazmidar Mohd Noor
[email protected]
1
Faculty of Computer Science and Information Technology, University of Malaya, 50603
Kuala Lumpur, Malaysia
123
688
A. Bimba et al.
Keywords Stemming algorithm · Hausa language · Root word ·
Over-stemming · Under-stemming · Stemmer accuracy
1 Introduction
Since 1968, when Lovins (1968) first officially developed a stemming algorithm for
English, various modifications have been performed on the algorithm. This
improvement has produced other stemming methods such as Porter’s Stemmer
(1980), Dawson’s Stemmer (1974), Paice/Husk’s Stemmer (1990), etc. Stemming
can be defined as a computational or manual procedure which reduces all words
with the same root or stem to a common form, usually by stripping each word of its
derivational and inflectional affixes’ (Lovins 1968). For example, words like
Connected, Connecting, Connection, and Connections will be truncated to Connect.
Stemming is applied to improve the efficiency of information retrieval by providing
searches with means of discovering other morphological variants of their search
items (Frakes and Baeza-Yates 1992). Normally a single stem relates to several full
terms, therefore storing the stem instead of the term will go a long way in reducing
the size of the indexed files (Frakes and Baeza-Yates 1992).
Currently, a few stemming algorithms have been developed for various
languages such as Arabic (Alhanini et al. 2011), Turkish (Sever and Bitirim
2003), Malay (Darwis et al. 2012), and Amharic (Alemayehu and Willett 2002).
Unfortunately, no literatures have been found at the time of writing this paper on
any stemming algorithm for the Hausa language. Hausa is a Chadic language
spoken in West Africa, mostly northern Nigeria and Niger Republic, having
speakers in northern Ghana, Togo, Cameroun and Benin Republic. According to
‘Ethnologue’, Hausa language has approximately 25 million native speakers and
about 18 million second language speakers all located in 13 different countries in
Africa (Lewis 2009). This makes Hausa the 38th most spoken language in the world
and consisting of the most native speakers south of the Sahara Desert (Schuh 2012;
Lewis 2009; Newman 2007). To improve the efficiency of information retrieval and
indexing of Hausa text, there is a need to develop a method to stem Hausa text.
This paper describes our effort to stem Hausa text and investigate the suitability
of our approach. Section 2 reviews the background of the Hausa language. Section 3
summarizes the stemming algorithms for a few languages. Section 4 describes the
proposed method for stemming Hausa text. The evaluation and result obtained from
experiments are discussed in Sect. 5. Finally, Sect. 6 concludes the paper with a
suggestion for future work.
2 A brief on Hausa language grammar and morphology
The first writing system of the Hausa language, called ajami, was based on the
Arabic script. Presently, the Romanized orthography, called boko, is used, since its
introduction by the British at the beginning of the twentieth century (Newman 2000;
Jaggar 2001). The standard Hausa alphabet consists of 28 characters with 23 from
123
Stemming Hausa text: using affix-stripping rules and…
689
the English alphabet excluding q, v and x with the addition of ɓ, ɗ, ƙ, ‘and ƴ’
(Newman 2000). The Hausa language consists of a large amount of inflectional and
derivational morphology (Schuh 2012; Newman 2000; Smirnova 1982). Like most
African languages, Hausa has a variety of reduplicative processes in the nominal,
adjectival, and verbal systems, but most of the morphologies, both derivational and
inflectional, are related to nominal gender, number and to the verbal system to mark
lexical verb classes and verbal derivational functions called “extensions” (Schuh
2012; Newman 2000; Smirnova 1982).
Hausa nouns are either simple or derived. Simple nouns have their origin traced
from the present form such as gona (farm), karfi (strength), while derived nouns are
produced by adding prefixes, suffixes or confixes and in rare cases infixes. Example
of nouns with confix (both prefix and suffix) is marowaci (greedy person); having
the prefix ma and suffix ci the root word is rowa (greediness, stinginess). Example
of noun with suffix is makafta (blindness) has the suffix ta and the root word makafo
(blind man). Noun plurals in the Hausa language are derived by either reduplication,
addition of suffix, a change in the last vowel or a complete change of the root word
(Smirnova 1982; Newman 2000). Reduplication of the last syllable takes the form
(b)obi, (d)odi, (f)ofi, (g)ogi, (ƙ)oki (s)oshi, (t)oti or (t)oci, (w)oyi, (y)oyi, etc. for
Table 1 Some affixes in the
Hausa language
Affix class
Affixes
Prefix
ma
ya-l
ba
yaya-n
ta
‘ya-r
abi-n
ya-t
wuri-n
ya-n
da-n
na
da-m
Suffix
taka
oki
ci
oshi
nci
ni
wa
ki
ya
Aye
ye
aje
odi
ai
ofi
una
oyi
je
ke
u
woyi
ka
di
ntaka
shi
da
ta
nye
Infix
tta
a
Confix
ma – ci
ma-ta
123
690
A. Bimba et al.
example ƙofofi (doors) and ƙofa (door). Example of addition of suffix is rana (day)
and ranaki (days). Example of change in last vowel is fitila (lamp) and fitilu (lamps).
Nouns that have internal plurals are jirgi (plane) and jirage (planes), sariki (king)
and sarakuna (kings), doki (horse) and dawakai (horses) etc. The nouns with
internal plural have both have a suffix in addition to the infix. Verbs in Hausa
language can either be primitive or derived. Examples of primitive verbs are ci (eat),
sha (drink), ji (hear) etc. The formation of derived verbs is through the addition of a
suffix to the stem, with or without modification of the stem. Examples of
derivational verbs are tsorata (frighten) and tsoro (fear), ci (eat), canye or cinye
(devour) etc. Infixes occur in few cases such as the infix tta which is used for
littattafai (books) a plural of littafi (book) and infix ‘a’ as occurs in gurgu (cripple)
and guragu (cripples) (Newman 2000). Table 1 shows some important affixes in the
Hausa language.
3 Related works
Kraaij and Pohlmann (1994) suggested that stemming algorithm for information
retrieval is effective in languages with a complex morphology (Kraaij and
Pohlmann 1994). Over the years researchers have worked in developing stemming
algorithms for the English language and other languages such as Arabic, Malay, and
Turkish etc. These researches are focused mostly on modifying the stemming
algorithm developed by Porter (1980), because of its popularity and effectiveness in
information retrieval systems.
3.1 English stemmers
Lovins stemmer (1968), the first English stemmer, was developed based on
keywords in material science and engineering documents. The algorithm consisted
of 294 endings, 29 conditions and 35 transformation rules, where every ending is
linked to any of the conditions. Stripping of a word to its stem is done in 2 stages,
the stemming stage and the recoding stage (Lovins 1968). The stemming stage
involves stripping the longest ending found that fulfills its corresponding condition.
The ending is then modified by applying the transformation rules in the recoding
stage (Lovins 1968; Smirnov 2008; Jivani 2011).
Dawson (1974), extended the Lovins stemming algorithm using a broad list of
1200 suffixes, which are stored in the reversed order indexed by their length and last
letter. The rule for stripping a suffix using this algorithm is when the word is not
shorter than a specific number and its suffix is preceded by a specific order of
characters. This makes it more complex than the Lovins algorithm although it is fast
and covers more suffixes (Smirnov 2008; Jivani 2011).
Porter’s stemmer was introduced in 1980 by Martin Porter (Porter 1980). This
stemming algorithm is derived on the basis that there are approximately 1200
suffixes in English language, which comprises of a mix of smaller and simpler
suffixes (Porter 1980). It comprises 5 steps, with a series of 60 arranged rules which
uses a partial matching concept to reduce the complexity of the stemming algorithm,
123
Stemming Hausa text: using affix-stripping rules and…
691
although the stemmer may not necessarily produce a root word. This algorithm has
resulted 6370 different entries in the vocabulary of stems and less errors than the
Lovins stemmer. However, it was slower than Lovins due to its 5 steps whereas
Lovins has only 2 steps (Jivani 2011).
Paice/Husk stemmer (Paice 1990; Smirnov 2008; Jivani 2011) developed a
stemming algorithm, which iterates through a table that consists of 120 rules that are
indexed by the last letter of the suffix (Paice 1990; Jivani 2011). It has a higher
tendency for over stemming than the Porter’s algorithm (Smirnov 2008; Jivani
2011).
3.2 Arabic stemmers
For stemming Arabic text, 3 methods are employed; dictionary-based stemming,
light-based stemming and unsupervised approach will be discussed. In the
dictionary-based stemming, the root word is derived based on linguistic lexicons.
It involves 3 steps; pre-processing, entity recognition and stem identification. To
remove redundant spaces, sentence boundaries are identified in the pre-processing
stage and the text is tokenized and passed to the morphological analyzer. In the
entity recognition stage, Arabic personal names are identified. Finally, the stem is
extracted based on suffixes, prefixes, clitics and stem. In the light-based
methodology, 3 steps are involved; word identification, word segmentation and
pattern matching. In the first step common words and non-derivational nouns are
identified. In the segmentation stage the words are divided into its components;
affix, stem and clitics based on some rules. Finally, in the pattern matching stage the
root word is extracted by matching it with some patterns (Alhanini et al. 2011).
Alhanini and Aziz offered an enhanced stemming algorithm for Arabic text
(Alhanini et al. 2011). It was developed to overcome the disadvantages posed by the
dictionary-based and light-based stemming. Based on their experiment with an inhouse newspaper corpus, they achieved an accuracy of 96 % as compared to the
light and dictionary base which have 85 and 88 % accuracy, respectively (Alhanini
et al. 2011).
The unsupervised approach used by Khorsi (2012) learns from a corpus of about
14,000 traditional distinct Arabic words mapped to their stems. This approach
extracts morphological segments from classical Arabic words using a set of all the
distinct n-grams acquired from a plain text input mapped to the frequency. The basic
concept is that a letter is irremovable from the start or end of a word if the remaining
substring depends relatively on this letter. This unsupervised approach deals with
the eventual imperfection in the input text such as misspellings, and achieved about
90 % true positives in stemming of classical Arabic text (Khorsi 2012).
3.3 Malay stemmers
The Malay stemming algorithm has gone through various improvements since it
was first introduced by Othman (Darwis et al. 2012). It is a rule-based algorithm
using 121 affix stripping rules. Ahmad et al. (1996) decided to improve Othman’s
algorithm which had some over-stemming and under-stemming errors as a result of
123
692
A. Bimba et al.
no dictionary lookup and dependency on the order of the affixes. Sembok’s
algorithm included a stem dictionary lookup and some additional list of affixes. Idris
and Mustapha (2001) then developed a stemmer which is similar to Sembok’s that
greatly improved the recall of an automated essay grading system for Malay History
text. This improvement was made by altering the order of processing the affixes and
including this process in the steps of the algorithm rather than looping through a file
as seen in Sembok’s algorithm (Idris and Mustapha 2001). Regardless of the
techniques applied, stemming errors still existed in the earlier Malay Language
Stemmers. Hence, a recent approach proposed the use of a Malay Word Register to
solve the ambiguity problem and an exhaustive affix stripping approach to tackle
under-stemming and over-stemming errors (Darwis et al. 2012). This approach
performed well with 99.80 % accuracy.
3.4 Turkish stemmers
There are a few stemming algorithms developed for Turkish language, such as
Longest-Match (L-M) algorithm, an algorithm that is centered on a word search logic
done on a lexicon or dictionary that contains Turkish stems and their possible
alteration, and Aysin–Fazli (A–F) algorithm (Solak and Can 1994), which uses a
dictionary of Turkish stems annotated with sixty-four tags indicating how to produce
root words (Sever and Bitirim 2003). Currently, a more accurate stemming algorithm
for the Turkish language called FindStem was developed. The algorithm consists of
three components; ‘Find the Root’, ‘Morphological Analysis’, and ‘Choose the
Stem’ and has a pre-processing stage that handles clitics and converts the words in a
text to a smaller case (Sever and Bitirim 2003). The FindStem algorithm improved
document retrieval effectiveness by about 25 % in terms of precision.
3.5 Amharic stemmers
The Amharic stemmer involves the truncation of words to their root form by
deletion of both prefixes and suffixes (Alemayehu and Willett 2002). This is done by
some iterative procedures, which use a minimum stem length, recoding and contextsensitive rule similar to the Lovins algorithm (1968). During the affix removal
process, the prefix is deleted before the suffix, ensuring the minimum stem length
condition of 2 consonants. The algorithm was tested against a sample file of 1211
words and showed an accuracy of 95.90 %, 2.70 % over stemming and 1.40 % under
stemming errors (Alemayehu and Willett 2002).
3.6 Discussion
Based on the few stemmers reviewed, it can be seen that several approaches have
been used in their implementations. These differences are seen in the complexity of
the languages. For instance, English stemmers consist of some affix stripping rules,
which includes transformation rules for recoding the stem, verifying the vowel and
consonant combination of the word ensuring it is not shorter than a specified value.
More complex languages such as Arabic used dictionary-based, light stemming and
123
Stemming Hausa text: using affix-stripping rules and…
693
unsupervised approach, Malay used affix rules, dictionary reference and recently the
Malay Word Register to solve the problem with word ambiguity and the Turkish,
which implemented the Longest-Match (L-M) algorithm, A–F algorithm and
FindStem. Other languages such as Amharic used some iterative procedures that use
a minimum stem length, recoding and context-sensitive rule just like the English
stemmers.
The proposed approach to developing the Hausa stemming algorithm is based on
the affix stripping rules used in English, Malay, Amharic and Turkish stemmers.
The dictionary approach is used to reduce under-stemming errors.
4 Approach to stemming Hausa text
4.1 Overview
The proposed approach for stemming Hausa text is based on 78 affix stripping rules
and a reference lookup. It comprises a series of steps to remove certain suffix,
prefix, infix or confix based on some substitution rules and a condition which
specifies the vowel-consonant sequences in the resulting stem. Similar to other affix
stripping based stemmers, some of its substitution rules are determined according to
the combination of letters preceding the affix. For example, words with prefix ma
and suffix ci having a combination *VwVci, where V stands for vowel, then the
combination will be replaced by *VwV such as in marowaci (miser, greedy
man) \ rowa (greediness). Also words having a combination *Cvuci where C
stands for consonant will be replaced by *CVuta, such as in mafauci
(butcher) \ fauta (slaughter), marubuci (writer) \ rubuta (write). Similarly, such
computations are made for other words with suffix oshi, awa, nci.
4.2 Hausa affix-rules with reference lookup
The Hausa Affix-rules and reference lookup was developed based on the
information obtained from the books ‘The Hausa language: a descriptive grammar’
(Smirnova 1982) and ‘The Hausa Language: An Encyclopedic Reference Grammar’
(Newman 2000). References were also made to the dictionary; ‘The Hausa
Lexicographic Tradition’ (Newman and Newman 2001).
The reference lookup consist of 1500 root words, which were compiled based on
common words used in the Hausa language, words found in an old Hausa folktale
and news articles which range from politics, religion, sports, crime and conflict. The
affix-rules ensure that all common affixes, inflectional morphology such as noun
plurals, common noun and verb derivational affixes are covered.
4.3 The basic algorithm
The Hausa stemmer is developed based on the rule-based approach. It comprises 4
steps which are:
123
694
A. Bimba et al.
Step 1a: Check word against reference lookup; if word exist go to next word else
stem word reduplication and concrete nouns formed from prefixed particles and
other nouns that use hyphens. For example:
guje-guje (race) \ guje (ran)
Step 1b: Stem plural of compound nouns formed with prefix ma and suffix ai, ta,
and awa. For example:
madafai \ madafa (kitchen)
marubuta \ marubuci (writer)
Step 1c: Stem plural of compound nouns formed with prefix ma and have the
suffix ta. For example:
mafauta \ mafauci (butcher)
mahaukata \ mahaukaci (madman)
Step 1d: Stem abstract nouns formed from a verb or adjective with suffix ci and
prefix ma. For example:
marowaci (greediness) \ rowa (to be greedy)
Stem concrete nouns formed from verb with ma prefixed and ya suffixed. For
example:
mashaya (dinking place) \ sha
Step 2: Check word against reference lookup; if word exist go to next word else
stem verbal nouns formed with the suffix wa and abstract nouns formed from
simple nouns with suffixes ci (masculine) and ta (feminine) -n-ci (masculine of
Kano origin) and -n-taka (feminine of Sokoto origin), Plural of compound nouns
formed with particles and other plural formations. For example:
rubuce-rubuce (writing) \ rubutu (write)
tayasuwa (helping) \ taya (to help)
rantsuwa (swearing, oath) \ rantse (to swear)
tadowa (raising) \ tada (to raise)
gadonci \ gadontaka (inheritance) \ gado (inheritance)
fauta (slaughter) \ fawa (slaughtering)
Step 3: Check word against reference lookup; if word exist go to next word else
check for reduplication of the last syllable taking the forms (b)obi, oci, (d)odi, (f)
ofi, (g)ogi, (l)oli, (k)oki, (o)omi, (r)ori, (s)osi, (t)oti or (t)oshi, (w)owi, (y)oyi to
signify plurals and stem. For example:
goribobi \ goriba (a palm fruit)
batoci \ bata (small box made of skin)
kofofi \ kofa (door)
bindigogi \ bindiga (gun)
fitiloli \ fitila (lamp)
123
Stemming Hausa text: using affix-stripping rules and…
695
Step 4: Check word against reference lookup; if word exist go to next word else stem
some common plurals by stripping common suffix, prefix and infix. For example:
Suffix:
abokai \ aboki (friend)
shaiduna \ shaida (witness)
kafirai \ kafiri (heathen)
Prefix:
tarwatsa (scattering) \ watsa (to scatter)
tsattsaga (separate) \ tsaga (cut, divide)
Infix:
littattafi \ littafi
The flow chart of the algorithm is shown in Fig. 1. During each step, a reference is
made to the dictionary if the word is found it exits and goes to the next word else it
checks for a matching rule. If a rule matches a word, then whatever consequences
tied to that rule is tested on the stem to be produced, to see if the affix is to be
removed or replaced as indicated by the rule. The affix is taken out once a rule’s
conditions are satisfied and the command is passed to the following step. When the
rule is not satisfied, the following rule in that step is then applied, until a rule in that
step is met and command is passed to the next step or after all the rules in that step
has been exhausted, hence moving control to the following step. This procedure will
continue to stem all the words until, the words in the given text is exhausted.
The stemmer was implemented in Python Programing Language which is heavily
used in industry, scientific research, and education around the world (Kuhlman
2012; Bird et al. 2009). To store 1500 Hausa root words for the reference lookup,
MySQL was chosen as the database management system.
5 Performance evaluation
5.1 Stemmer evaluation
Several techniques have been suggested for evaluating the strength, similarities and
accuracy of stemming algorithms (Paice 1994; Frakes and Fox 2003; Sirsat et al.
2013). The methods we considered in this research do not evaluate stemmers based
on their effect on retrieval performance, as this does not give understanding of the
specific causes of errors. These methods are:
5.1.1 Paice’s evaluation method
In evaluating English based stemmers Paice did not use the customary precision/
recall factors, but as an alternative presented over-stemming index (OI), under-
123
696
A. Bimba et al.
Start
Select Next Word
Yes
Is Word in
Dictionary?
No
Apply Rule in Step 1 to word
Yes
Rule Satisfied?
Yes
No
Is Word in
Dictionary?
No
Apply Rule in Step 2 to word
Yes
Is Word in
Dictionary?
Yes
Rule Satisfied?
No
No
Apply Rule in Step 3 to word
Rule Satisfied?
Yes
No
Is Word in Yes
Dictionary?
No
Apply Rule in Step 4 to word
Yes
Are words
Exhausted?
No
Fig. 1 Flow chart of Hausa stemmer
stemming index (UI) and their relationship, the stemming weight (SW). The Paice
approach provides insights for stemmer optimization, unlike the traditional method
which evaluates stemmers based on retrieval performance (Paice 1994). The Paice
evaluation method requires a predefined list of groups of words which are
morphologically and semantically related. An ideal stemmer should be able to
merge all the semantically equivalent words in a group to a common stem which is
not a stem in another group. A stemmer has made under-stemming errors if a
stemmed group has more than one unique stem. Subsequently, a stemmer has overstemmed if a stem from a certain group also occurs in another stemmed group
(Kraaij and Pohlmann 1995).
The over-stemming index signifies the proportion of words, in different groups,
which have been reduced to the same stem. In order to calculate the OI two
parameters are required.
123
Stemming Hausa text: using affix-stripping rules and…
●
697
Desired Non-merge Total (DNT): Since an ideal stemmer is not expected to
merge any word in a group with a word in another group, the desired non-merge
total for that group DNTg is calculated by the formula:
DNTg ¼ 0:5ng ðW ng Þ
where ng, the number of words in the group; W, the total number of unique
words in the sample.
For example, considering one group (g1) in our sample of 1723 unique words:
By summing the total value of DNTg over all the groups in the sample, we
obtain the global desired non-merge total GDNT.
g1 = {doka (law), dokar, dokoki, dokokin, dokokinsa}
W = 1723
ng = 5
DNT1 ¼ 0:5 5ð1723 5Þ
●
Wrongly-merged total (WMT): In situations where a similar stem occurs in
more than one group, the number of wrongly merged words is calculated by the
formula:
X
WMTg ¼ 0:5
v i ð ns v i Þ
i¼1...t
where t, number of groups that share the same stem; ns, the number of instances
of that stem; vi, the number of stems for each group t.
For example, considering two groups g1 and g2 with the stemmer output sg1 and
sg2 respectively:
g1 = {baya (back), bayan, bayar, bayansu, bayarwa}
g2 = {bayanai (explanations), bayanaka, bayanan, bayananka, bayanansu,
bayani, bayanin}
sg1 = {baya, baya, baya, baya, bayar}
sg2 = {baya, bayanaka, baya, baya, baya, bayani, bayani}
v1 = 4 the number of the stem baya in sg1
v2 = 4 the number of the stem baya in sg2
ns = 8 the total number of the stem baya in sg1 and sg2
WMT1 ¼ 0:5ð4 ð8 4Þ þ 4 ð8 4ÞÞ
After summing the WMTg for each group we obtain the Global Wrongly-Merged
Total (GWMT).
The Over-stemming Index is represented as:
123
698
A. Bimba et al.
OI ¼
GWMT
GDNT
The under-stemming index signifies portion of the words in a group (g), which
failed to be merged by the stemmer. Similar to OI, the UI is calculated based on two
parameters.
●
Desired Merge Total (DMT): Since a perfect stemmer is required to merge all
the words in a group to one another, the desired merge total is calculated as:
DMTg ¼ 0:5ng ng 1
For example, looking at a group g1 in our sample having 4 words:
g1 = {aboki (friend), abokansa, abokan, abokin}
DMT1 ¼ 0:5 4ð4 1Þ ¼ 6
The 6 pairs can be seen below as:
Pair
Pair
Pair
Pair
Pair
Pair
1
2
3
4
5
6
=
=
=
=
=
=
{abokan, abokansa}
{abokan, aboki}
{abokan, abokin}
{abokansa, aboki}
{abokansa, abokin}
{aboki, abokin}
A summation of the total value of DMTg across all the groups in the sample
gives the global desired merge total GDMT.
Unachieved Merge Total (UMT): After stemming some groups may contain
more than one distinct stem. The failure of the stemmer to combine all the words
in a group to a single stem is referred to as the unachieved merge total (UMT),
represented by the formula:
UMTg ¼ 0:5
X
ui ng ui
i¼1...s
where s, the number of distinct stems; ui, the number of instances of each stem.
For example, a stemmed group sg1:
sg1 = {ciki (inside), cik, ciki, ciki, ciki, ciki}
ns = 6 the total number of the stems ciki and cik in sg1
u1 = 5 number of stem ciki in sg1
u2 = 1 number of stem cik in sg1
UMT1 ¼ 0:5ð5 ð6 5Þ þ 1 ð6 1ÞÞ
123
Stemming Hausa text: using affix-stripping rules and…
699
Summing the UMTg for each group, we get the global unachieved merge total
(GUMT).
The Under-stemming Index is represented as:
UI ¼
GUMT
GDMT
The strength of the stemmer known as the stemming weight is measured as the
ratio of OI and UI.
SW ¼
OI
UI
A high or low value of SW indicates a strong (heavy) stemmer or a weak (light)
stemmer respectively (Kraaij and Pohlmann 1995).
5.1.2 Sirsat’s evaluation method
Sirsat et al. (2013), proposed a useful method to measure the strength and accuracy
of a stemming algorithm. They introduced the Word Stemmed Factor (WSF) to
compute the strength of a stemmer. Also, they used the Correctly Stemmed Words
Factor (CSWF) and Average Words Conflation Factor (AWCF) to measure the
accuracy of a stemmer.
●
Word Stemmed Factor (WSF): This is the ratio of the words stemmed by the
stemmer and the total number of words in the sample. It has a minimum
threshold of 50 %. The strength of the stemmer is greater if the WSF is larger. It
is calculated as:
WSF ¼
●
where WS, number of words stemmed; TW, number of words in a sample.
Correctly Stemmed Words Factor (CSWF): It is a ratio of the words stemmed
correctly by the stemmer and the number of words stemmed. The minimum
threshold is 50 %. A stemmer has a higher accuracy if the value of CSWF is
higher. CSWF is calculated as:
CSWF ¼
●
WS
100
TW
CSW
100
WS
where CSW, number of correctly stemmed words; WS, total number of words
stemmed.
Average Words Conflation Factor (AWCF): This represents the average number
of different words of diverse conflation group that are stemmed correctly. In
computing AWCF, the number of unique words after conflation is first
calculated as:
123
700
A. Bimba et al.
NWC ¼ S CW
where S, number of distinct stem after stemming; CW, number of correct words
not stemmed.
Thus, AWCF is calculated as:
AWCF ¼
CSW NWC
100
CSW
A stemmer has a higher accuracy if the value of AWCF is higher.
5.2 Result analysis
To measure the performance of stemming Hausa text, Paice’s (1994) and Sirsat’s
(2013) evaluation methods were applied to 436 online news articles in Hausa
language. The Hausa text consisted of 32,236 words among which 3460 are unique.
A total of 490 groups of morphologically and semantically related words were
formed from the unique words. This amounted to 1723 words. A sample, of the
groups created is shown in Table 2.
Two versions of the algorithm for stemming Hausa text were evaluated. The first
version (HStemV1) consisted of only affix stripping rules. While the second version
(HStemV2) consisted of affix stripping rules and a reference lookup. We applied the
two versions to our sample group to obtain the values of OI, UI and SW. The results
are shown in Table 3.
HStemV2 had a lower OI than HStemV1. This is because the reference lookup
reduced the portion of words merged, from different groups of morphologically and
semantically related words. For example, HStemV1 merged words from 4 different
groups:
ragu (reduce), rana (day), ransa (his life),
runduna (company).
Table 2 Sample group of related words
S/N
Related words
1
alama, alamar, alamomin, alamu, alamun
2
bangare, bangaren, bangarori, bangarorin
3
gida, gidaje, gidajen, gidajensu, gidan, gidanka, gidansa gidansu
4
haifa, haifar, haife, haife shi’, haifi, haihu, haihuwa, haihuwar
5
jarirai, jariran, jariransu, jariranta, jaririya, jaririyar
6
makamai, makaman, makamanta, makamashi, makamashin, makami
7
rubuce, rubuta, rubutawa, rubutu
8
shawara, shawarar, shawararin, shawarwari
9
tsaka, tsakanin, tsakaninsa, tsakaninsu, tsaki, tsakiya, tsakiyar
10
zumunci, zumuncin, zumunta, zumuntan
123
Stemming Hausa text: using affix-stripping rules and…
701
Table 3 Result using Paice’s Evaluation Method
Stemmer
OI
HStemV1
0.39 × 10−4
HStemV2
−4
0.24 × 10
UI
SW
0.15
2.65 × 10−4
0.14
1.68 × 10−4
But, HStemV2 merged words from only 2 different groups:
ragu (reduce) and ransa (his life).
Similarly, the UI of HStemV2 was lower than HStemV1. This shows that a
reference lookup can reduce the number of semantically equivalent words, which
are unsuccessfully merged while stemming. For example, HStemV2 reduced the
unsuccessfully merged words by 1, while stemming a group words as compared to
HStem V1. A sample result is shown in Table 4.
It was also observed from the stemming weight that HStemV1 is heavier than
HStemV2.
Furthermore, the strength and accuracy of the two stemming approaches were
measured using Sirat’s method. The results obtained are shown in Table 5.
Table 4 Sample result of
under-stemming error
Group
HStemV1-result
HStemV2-result
wasa (play)
Wasa
wasa
wasan
Washi
wasa
wasanni
Washi
wasa
wasannin
Washi
wasa
wasanninta
Was an
wasan
wasansa
Washi
wasa
wasansu
Washi
wasa
Table 5 Result using Sirsat’s method
Analysis of stemmers
HStemV1
HStemV2
Total words (TW)
1723
1723
Number of words stemmed (WS)
1197
1132
Words stemmed factor (WSF)
69.47 %
65.7 %
Number of distinct words after Stemming (S)
984
992
Index compression factor (ICF)
75.1 %
73.7 %
Correctly stemmed words (CSW)
971
989
Incorrectly stemmed words (ISW)
226
143
Correctly stemmed words factor (CSWF)
81.12 %
87.37 %
Correct words not stemmed (CW)
526
591
Number of distinct words after conflation (NWC)
458
401
Average words conflation factor (AWCF)
52.83 %
59.45 %
123
702
A. Bimba et al.
The word stemmed factors (WSF) from the two approaches were well above the
50 % threshold. This indicates the stemmers are strong and aggressive (Sirsat et al.
2013). However, the HStemV1 is more aggressive than the HStemV2. This is in
agreement with the Paice’s evaluation method, which shows HStemV1 is heavier
than HStemV2. Furthermore, the following were observed:
●
●
The word stemmed factor (WSF) value for HStemV1 (69.47 %) is higher than
HStemV2 (65.7 %), but the CSWF and AWCF are lower. This is because
HStemV1 lacks a lookup of root words, thereby making it transform root words
to incorrect stem.
HStemV1 has a higher tendency to generate under-stemming errors than
HStemV2. This is seen by a higher value of NWC. This result corresponds to the
outcome obtained from Paice’s evaluation method.
6 Conclusion
In this study, we used two methods to stem Hausa text. The first method HStemV1
consisted of only affix stripping rules, while the second method HStemV2 had both
affix stripping rules and a reference lookup. Using the Paice’s evaluation method,
we observed that reference lookup aided in reducing both over-stemming and understemming errors. Comparison of HStemV1 and HStemV2 based on stemming
weight indicates, a reference lookup has a tendency to reduce the strength of an affix
stripping stemmer.
To further measure the effect of reference lookup on stemming accuracy, we used
Sirsat’s method to calculate the CSWF and AWCF. While using reference lookup, a
6.62 % improvement in correctly stemmed words was achieved. These improvements are attributed to the presence of root words in the reference lookup. Even
though the reference lookup improved the accuracy of the stemmer, some understemming and over-stemming errors were consistent amongst both methods. This
could be attributed to the amount of root words in the reference lookup. Further
work can be carried out to determine the effect of the size of the reference lookup on
stemming accuracy.
Acknowledgments We gratefully acknowledge the support of Paul Newman, Indiana University USA,
for the substantive comments and constructive criticisms given.
References
Ahmad, F., Yusoff, M., & Sembok, T. M. T. (1996). Experiments with a stemming algorithm for Malay
words. Journal of the American Society for Information Science, 47(12), 909–918.
Alemayehu, N., & Willett, P. (2002). Stemming of Amharic Words for Information Retrieval. Literary
and Linguistic Computing, 17(1)
Alhanini, Y., Juzaiddin, M., & Aziz, A. (2011). The enhancement of arabic stemming by using light
stemming and dictionary-based stemming. Journal of Software Engineering and Applications, 4,
522–526.
123
Stemming Hausa text: using affix-stripping rules and…
703
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. Newton: O’Reilly
Media Inc.
Darwis, S. A., Rukaini, A., & Idris, N. (2012). Exhaustive affix stripping and a Malay word register to
solve stemming errors and ambiguity problem in Malay stemmers. Malaysian Journal of Computer
Science, 25(4), 196–209.
Dawson, J. (1974). Suffix removal and word conflation. Bulletin of the Association for Literary and
Linguistic Computing, 2(3), 33–46.
Frakes, W. B., & Baeza-Yates, R. (1992). Information retrieval: Data structures and algorithms (pp. 161–
218). Englewood Cliffs, NJ: Prentice Hall.
Frakes, W. B., & Fox, C. J. (2003). Strength and similarity of affix removal stemming algorithms. ACM
SIGIR Forum, 37(1), 26–30.
Idris, N., & Mustapha S. M. F. D. (2001). Stemming for term conflation in Malay texts. In International
conference on artificial intelligence (IC-AI. Las Vegas) (pp. 1512–1517).
Jaggar, P. J. (2001). Hausa. Reading: John Benjamins Publishing.
Jivani, A. G. (2011). A comparative study of stemming Algorithms. International Journal of Computer
Technology and Applications, 2(6), 1930–1938.
Khorsi, A. (2012). Effective unsupervised Arabic word stemming: Towards an unsupervised radicals
extraction. The International Arab Journal of Information Technology, 9(6), 571–577.
Kraaij, W., & Pohlmann, R. (1994). Porter’s stemming algorithm for Dutch. In L.G.M. Noordman & W.
A.M. de Vroomen (Eds.), Informatiewetenschap 1994: Wetenschappelijke bijdragen aande derde
STINFON conferentie (pp. 167–180). Leiden, Netherlands: Stichting Informatiewetenschap
Nederland.
Kraaij, W., & Pohlmann, R. (1995). Evaluation of a Dutch stemming algorithm. The New Review of
Document and Text Management, 1, 25–43.
Kuhlman, D. (2012). A python book: Beginning python, advanced python and python exercises.
Rexx.com.
Lewis, M. P. (2009). Ethnologue: Languages of the World, Sixteenth edition. [online]. http://www.
ethnologue.com/. Accessed 4 Dec 2012
Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational
Linguistics, 11(1&2), 22–31.
Newman, P. (2000). The Hausa language: An encyclopedic reference grammar. New Haven: Yale
University Press.
Newman, P. (2007). A Hausa-English dictionary. New Haven: Yale University Press.
Newman, R., & Newman, P. (2001). The Hausa lexicographic tradition. Lexikos, 11, 263–286.
Paice, C. D. (1990). Another stemmer. ACM, SIGIR Forum, 24(3), 56–61.
Paice, C. D. (1994). An evaluation method for stemming algorithms. In Proceedings of the 17th annual
international ACM SIGIR conference on research and development in information retrieval (pp. 42–
50). New York, NY: Springer-Verlag.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130–137.
Schuh, R. G. (2012). A Hausa story and Hausa verb morphology UCLA [online]. http://www.linguistics.
ucla.edu/people/schuh/lx105/. Accessed 4 Dec 2012
Sever, H., & Bitirim, Y. (2003). FindStem: Analysis and evaluation of a Turkish stemming algorithm.
LNCS, 2857, 238–251.
Sirsat, S. R., Chavan, V., & Mahalle, H. S. (2013). Strength and accuracy analysis of affix removal
stemming algorithms. International Journal of Computer Science and Information Technologies, 4
(2), 265–269.
Smirnov, I. (2008). Overview of stemming algorithms. Mechanical Translation, 52.
Smirnova, M. (1982). The Hausa language a descriptive grammar. London: Routledge & Keagan Paul.
Solak, A., & Can, F. (1994). Effects of stemming on Turkish text retrieval. In Proceedings of the ninth
international. Symposium on Computer and Information Sciences (ISCIS), pp. 49–56.
123

Download Report

Stemming Hausa text: using affix-stripping rules and reference look-up

Paperzz.com

Your Paperzz