Root Word Stemming by Multiple Evidence from Corpus

Root Word Stemming by Multiple Evidence from Corpus
Utpal Sharma
Department of Computer Science, Tezpur University, Tezpur - 784028, Assam, India
utpal@@tezu.ernet.in
Jugal Kalita
Department of Computer Science, University of Colorado, Colorado Springs, CO 80933
kalita@@pikespeak.uccs.edu
Rajib Das
Department of Computer Science, Tezpur University, Tezpur - 784028, Assam, India
rkd@@tezu.ernet.in
Abstract
We discuss problems that arise in morphological analysis of highly inflectional natural languages. We focus on word stemming, particularly the problem of identifying root words automatically when access to
a substantive computational lexicon is unavailable.
1
Introduction
The word stemming problem commonly occurs in
morphological analysis is in the following problem
context: Given a corpus of a language and a list of
suffixes in the language, decompose the words in the
corpus into roots and suffixes wherever applicable.
The first step towards this task may be simply to
check the applicability of each suffix in each word,
as is done by the well known Porter’s method([1]).
This results in decomposition of not only all the
words that should be decomposed, but also many
others. For instance, sender=send+er is a correct
decomposition, whereas gender=gend+er is not.
Given no other information about the language
apart from the list of suffixes, it may not be possible
to detect all invalid decompositions. However, criteria can be developed by which some decompositions
are retained and the rest discarded. These criteria
are not likely to be exact, i.e., not all the decompositions discarded due to a particular criteria may be
incorrect.
We have been working with Assamese—a language
of the Indic branch of the Indo-European family of
languages. There is little existing computational lin-
guistic work on Assamese which is spoken by around
15 million people. Some informal corpuses of Assamese text have become available during the last
few years, but there is no available computational
lexicon of the language. Assamese is more inflectional than in English. A preliminary study reveals
that about 48% words in an Assamese text of around
1600 words are inflectional or derivational whereas
only about 19% words in an English text of about
1400 words are so. Most Assamese derivations are by
simple concatenation of suffixes to base words. One
of our intentions is to extract the morphological features of Assamese from a corpus ([3, 4]), and to develop a computational lexicon of the language. Our
present experiments are based on a corpus of newspaper items. For convenience of processing the encoding was converted to Roman before further processing. There are about 49000 words in the corpus
of which about 11500 are distinct. The suffix list
used in the present experiment is obtained from the
corpus itself by a unsupervised learning method ([3]).
2
Correct decompositions
Correctness of a decomposition means that (1) the
root identified by stripping a suffix from a given word
is a valid word, and (2) the given word is actually derived from the identified root by applying that suffix.
The terms precision and recall quantify the performance of a method for the task of stemming. Precision is the proportion of correct decompositions out
of all the decompositions obtained, and recall is the
proportion of correct decompositions obtained and
accepted out of all decompositions which are actu-
ally correct. When the list of suffixes is assumed to
be exhaustive, the recall is 100%, but many of the
decompositions may be incorrect, i.e., the precision
may be low.
The second condition for correctness is usually more
difficult to verify. That is, even if we ensure that an
identified root is a valid word, it is difficult to ensure
that the input word is actually formed by the application of the suffix to that root. To verify whether
an identified root is a valid word a lexicon may be
needed. But if a computational lexicon is not available (as is the case for many languages, including our
test language Assamese), other means are required.
In Porter’s and other similar methods ([1, 2]), the
criteria for applicability of a suffix itself provides
some guarantee regarding the validity of the root
identified. For instance, the rule SSES → SS
means strip suffix ES if the root ends in SS. Similarly, the rule (∗v∗)IN G → (N U LL) means strip
suffix IN G if the root contains a vowel. Such rules
prohibit indiscriminate stripping of suffixes from
word endings. In Assamese, for instance, a noticeable feature is the extensive inflection of nouns, including proper nouns. In such words there are hardly
any patterns that would facilitate definition of criteria like those cited above. Moreover, it requires
careful study of the morphology of the language to
define such criteria. We would like the system to
carry out stemming with minimum linguistic input.
In our experiment we have applied the suffixes by exact matching of the word endings against each suffix
and stripping the suffix if a match occurs. For a
language where many inflections are not simple concatenations of the roots and suffixes, spelling modification mechanisms like in Porter’s method may be
used.
3
Validity of unknown roots
In the absence of a good lexicon, validity of a root
can be tested by checking if the root occurs as a
word somewhere in the input corpus. Some care has
to be taken in this verification because the corpus
may contain items which are not true words, such
as contractions (e.g., govt, info), individual letters in
abbreviations (e.g., U in U.N.O), and foreign words
(see [3]). But, there may be identified roots which
do not occur in the available corpus; let us call these
fresh roots. We require some criteria to determine if
a fresh root is a valid word. In this paper we describe
techniques to deal with fresh roots.
To start with, we obtain about 6400 decompositions
with fresh roots of which 32.41% are valid. An im-
mediate observation is that very short fresh roots are
mostly invalid. We find that only 13.52% (96 out of
710) of the decompositions with fresh roots with 0
or 1 consonants are valid. On discarding all such decompositions the precision becomes 32.89% and the
recall becomes 98.11%. That is, without sacrificing much correct decompositions, we dispense with
many invalid decompositions.
4
Support for fresh roots
We observe that in a highly inflectional language
most base words undergo inflections, and such words
have more than one inflection. In Assamese, very
few base words (mainly the few conjunctions) do not
undergo inflection. A valid root is likely to occur at
different places in the corpus with different suffixes.
For example, if a is a valid root of the word av in
the corpus, then other inflections of a, such as aw,
are also likely to occur. So if v and w are suffixes
in the language then a is more likely to be a valid
root when the input corpus contains both the words
av and aw, than when only av or aw occur. We say
that the root a is supported by the two decomposition a + v and a + w. Or that the support for a
is 2. Let the application of a particular suffix to a
root be called a case of the root. Hence, the probability that a fresh root is valid is proportional to
its support, i.e., the number of its occurrences in
distinct cases. In practice, if the support of a fresh
root is more than a threshold number, say 1, then
we assume that the fresh-root is valid.
It is possible to increase the support for some fresh
roots by extending the list of suffixes. In many
highly inflectional languages, occurrences of multiple suffixes are common. Such suffix sequences may
be tried along with the other suffixes during stemming. Thus if v, w and x are suffixes and wx is a
valid suffix sequence, then if the words av and awx
occur in the corpus, then the support for the root a
will be 2.
We partition the list of decompositions into two, A
and B. List A contains decompositions where the
support for the roots is higher than 1, and list B
contains decompositions where the support for the
roots is 1. In list A the precision is 62.39%, which is
a clear improvement. However, the recall drops to
31.75%, implying that there are correct decompositions which we discard because the support was low.
We deal with this situation a little later.
At this point it may be noted that there are several instances of a single word being decomposed
with different suffixes and suffix sequences. While
in some such cases more than one decomposition for
a word are valid (due to suffix sequences), in many
cases it is not. For instance, if v, w, x and y are suffixes, vw and vx are suffix sequences, and avw, avx
and avy are words in the corpus, then we obtain the
decompositions av + w, a + vw, av + x, a + vx and
av + y. The support for av is 3 and that for a is 2.
So, for avw the decomposition av + w is more likely
to be correct than the decomposition a + vw. So for
each word when we select the decomposition with
the highest support for the root and longest length
of root, we get a precision of 69.82% and a recall of
28.85%. However this criterion is not suitable when
a word actually has multiple correct decompositions.
5
Shortcomings of support
The contention between precision and recall in the
above method are because of two reasons:
1. A valid root word may not take more than one
suffix, and even if it does, an adequate number
of such cases may not occur in a given corpus.
In such situations this test declares a fresh root
as invalid. This reduces the recall of the exercise. In our experiment, list B contains a large
number of correct decompositions. The recall
in this list is 63.60% (some correct decompositions were discarded due to very short roots),
and the precision is 28.48%.
2. There may be multiple incorrect decompositions using the same fresh root. This test will
pass such a fresh root as valid. This reduces
precision.
6
Root occurrence
The first problem calls for additional criteria by
which we do not discard all decompositions in list
B. We note that multiple distinct cases of a root
are likely to occur in a corpus, if the corpus contains
adequate number of occurrences of that root. Conversely, more the number of times a root occurs in
the corpus in any of its cases, more of its distinct
cases are likely to figure in those occurrences. To
determine the frequency of a root in the corpus we
simply add the number of occurrences of the words
with that root. Since the frequency of a root would
depend on the size of the corpus, it may be useful
to express it as a percentage of the corpus size. If
despite a high frequency of a root in a corpus, its
support remains small, say 1, then it is likely that
the decomposition(s) involving the root is (are) invalid and the decomposed words are probably base
words themselves. On the other hand, if the frequency of the root is small and so is its support, then
it means inadequate evidence. The root may be valid
despite the low support. We may defer the decision
regarding such decompositions till more input provides more occurrences of the root. This leaves us
with the results obtained with list A. Alternatively,
we may consider these decompositions along with
list A. Doing so can increase the recall, of course, at
the cost of precision.
In our experiment, in the decomposition kumAr=kumA+r (kumAr = kumA+r) the support for the
root kumA (kumA) happens to be only 1, but its frequency is 74 (since kumAr occurs 74 times). So
we can assume that kumA is not a valid root.
Similarly, the root kth (kz) in the decomposition
kthA=kth+A (kzA=kz+A) is not a valid root since
the support is 1 and the frequency is 160. But the
root AgDokhr (aAgeDAKr) in the decomposition AgDokhrte=AgDokhr+te (aAgeDAKret=aAgeDAKr+et)
is actually a valid root despite the support being
1. Its frequency is 1. Overall, when we partition list
B by the frequencies of the roots, we observe that
among the decompositions with smaller root frequencies there are more valid decompositions compared to those with larger root frequencies. Also,
among decompositions in list B, fewer decompositions have very high frequencies of root and more
have small frequencies of root. These figures are
shown in the table 1.
Root
Freq
1
2
3
4
5
6
7
8
9
10
11
12-191
No. of decompositions
3081
599
292
158
95
66
56
30
26
25
24
168
No. of valid
decompositions
1068
142
46
16
9
4
9
2
1
4
1
14
Precision
34.66%
23.70%
15.75%
10.12%
9.47%
6.06%
16.07%
6.66%
3.84%
16.00%
4.16%
8.33%
Table 1: Precision of decompositions with low (=1)
root support
From list B when we consider the decompositions
with root occurrence 1, together with list A we obtained precision of 41.72% and recall of 83.37%.
Further improvement of precision may not be obtained simply by considering root support and root
occurrences. Considerations such as lengths of suffixes and lengths of roots may help. It is seen that
among the decompositions in list B, the ones with
longer bases are more likely to be valid. When we
gradually remove decompositions with root lengths
below 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, the precision improved as 28.48%, 29.07%, 30.38%, 33.12%,
36.44%, 41.61%, 46.10%, 49.19%, 51.04%, 51.36%,
53.33%. Of course the number of decompositions
with longer roots is less than those with shorter
roots.
We also feel that some sets of suffixes that are valid
with a root can be defined by methods such as the
one suggested in [4]. For roots with good support, it
can be verified if all its suffixes are included in any
such set. If some do not, then such decompositions
can be discarded. Such methods will help in tackling
the second problem mentioned earlier, i.e., despite a
root being valid, some of the decompositions involving it may be invalid.
7
Conclusion
In this work we have described an approach to improve precision of stemming by considering occur-
rences of different inflected forms of root words in
a corpus. This is an important aspect of a natural
corpus which can be useful in unsupervised or semisupervised acquisition of morphology of a language.
References
[1] Porter, M. “An Algorithm for Suffix Stripping”. Automated Library and Information Systems vol. 14, no.
3, pp 130-137, 1980
[2] Saravanan, M, P C Reghu Raj, Vadali Srinivasa
Murty and S Raman. “Improved Porter’s Algorithm
for Root Word Stemming”. International Conference
on Natural Language Processing Mumbai, 2002, 1821, pp 21-30.
[3] Sharma, Utpal, Jugal Kalita and Rajib Das. “Unsupervised Learning of Morphology for Building a Lexicon for Highly Inflectional Language”. Workshop on
Morphological and Phonological Learning, ACL-2002
Philadelphia, pp 1-10
[4] Sharma, Utpal, Jugal Kalita and Rajib Das. “Classification of Words Based on Affix Evidence”. International Conference on Natural Language Processing
Mumbai, 2002, pp 31-39