Root Word Stemming by Multiple Evidence from Corpus Utpal Sharma Department of Computer Science, Tezpur University, Tezpur - 784028, Assam, India utpal@@tezu.ernet.in Jugal Kalita Department of Computer Science, University of Colorado, Colorado Springs, CO 80933 kalita@@pikespeak.uccs.edu Rajib Das Department of Computer Science, Tezpur University, Tezpur - 784028, Assam, India rkd@@tezu.ernet.in Abstract We discuss problems that arise in morphological analysis of highly inflectional natural languages. We focus on word stemming, particularly the problem of identifying root words automatically when access to a substantive computational lexicon is unavailable. 1 Introduction The word stemming problem commonly occurs in morphological analysis is in the following problem context: Given a corpus of a language and a list of suffixes in the language, decompose the words in the corpus into roots and suffixes wherever applicable. The first step towards this task may be simply to check the applicability of each suffix in each word, as is done by the well known Porter’s method([1]). This results in decomposition of not only all the words that should be decomposed, but also many others. For instance, sender=send+er is a correct decomposition, whereas gender=gend+er is not. Given no other information about the language apart from the list of suffixes, it may not be possible to detect all invalid decompositions. However, criteria can be developed by which some decompositions are retained and the rest discarded. These criteria are not likely to be exact, i.e., not all the decompositions discarded due to a particular criteria may be incorrect. We have been working with Assamese—a language of the Indic branch of the Indo-European family of languages. There is little existing computational lin- guistic work on Assamese which is spoken by around 15 million people. Some informal corpuses of Assamese text have become available during the last few years, but there is no available computational lexicon of the language. Assamese is more inflectional than in English. A preliminary study reveals that about 48% words in an Assamese text of around 1600 words are inflectional or derivational whereas only about 19% words in an English text of about 1400 words are so. Most Assamese derivations are by simple concatenation of suffixes to base words. One of our intentions is to extract the morphological features of Assamese from a corpus ([3, 4]), and to develop a computational lexicon of the language. Our present experiments are based on a corpus of newspaper items. For convenience of processing the encoding was converted to Roman before further processing. There are about 49000 words in the corpus of which about 11500 are distinct. The suffix list used in the present experiment is obtained from the corpus itself by a unsupervised learning method ([3]). 2 Correct decompositions Correctness of a decomposition means that (1) the root identified by stripping a suffix from a given word is a valid word, and (2) the given word is actually derived from the identified root by applying that suffix. The terms precision and recall quantify the performance of a method for the task of stemming. Precision is the proportion of correct decompositions out of all the decompositions obtained, and recall is the proportion of correct decompositions obtained and accepted out of all decompositions which are actu- ally correct. When the list of suffixes is assumed to be exhaustive, the recall is 100%, but many of the decompositions may be incorrect, i.e., the precision may be low. The second condition for correctness is usually more difficult to verify. That is, even if we ensure that an identified root is a valid word, it is difficult to ensure that the input word is actually formed by the application of the suffix to that root. To verify whether an identified root is a valid word a lexicon may be needed. But if a computational lexicon is not available (as is the case for many languages, including our test language Assamese), other means are required. In Porter’s and other similar methods ([1, 2]), the criteria for applicability of a suffix itself provides some guarantee regarding the validity of the root identified. For instance, the rule SSES → SS means strip suffix ES if the root ends in SS. Similarly, the rule (∗v∗)IN G → (N U LL) means strip suffix IN G if the root contains a vowel. Such rules prohibit indiscriminate stripping of suffixes from word endings. In Assamese, for instance, a noticeable feature is the extensive inflection of nouns, including proper nouns. In such words there are hardly any patterns that would facilitate definition of criteria like those cited above. Moreover, it requires careful study of the morphology of the language to define such criteria. We would like the system to carry out stemming with minimum linguistic input. In our experiment we have applied the suffixes by exact matching of the word endings against each suffix and stripping the suffix if a match occurs. For a language where many inflections are not simple concatenations of the roots and suffixes, spelling modification mechanisms like in Porter’s method may be used. 3 Validity of unknown roots In the absence of a good lexicon, validity of a root can be tested by checking if the root occurs as a word somewhere in the input corpus. Some care has to be taken in this verification because the corpus may contain items which are not true words, such as contractions (e.g., govt, info), individual letters in abbreviations (e.g., U in U.N.O), and foreign words (see [3]). But, there may be identified roots which do not occur in the available corpus; let us call these fresh roots. We require some criteria to determine if a fresh root is a valid word. In this paper we describe techniques to deal with fresh roots. To start with, we obtain about 6400 decompositions with fresh roots of which 32.41% are valid. An im- mediate observation is that very short fresh roots are mostly invalid. We find that only 13.52% (96 out of 710) of the decompositions with fresh roots with 0 or 1 consonants are valid. On discarding all such decompositions the precision becomes 32.89% and the recall becomes 98.11%. That is, without sacrificing much correct decompositions, we dispense with many invalid decompositions. 4 Support for fresh roots We observe that in a highly inflectional language most base words undergo inflections, and such words have more than one inflection. In Assamese, very few base words (mainly the few conjunctions) do not undergo inflection. A valid root is likely to occur at different places in the corpus with different suffixes. For example, if a is a valid root of the word av in the corpus, then other inflections of a, such as aw, are also likely to occur. So if v and w are suffixes in the language then a is more likely to be a valid root when the input corpus contains both the words av and aw, than when only av or aw occur. We say that the root a is supported by the two decomposition a + v and a + w. Or that the support for a is 2. Let the application of a particular suffix to a root be called a case of the root. Hence, the probability that a fresh root is valid is proportional to its support, i.e., the number of its occurrences in distinct cases. In practice, if the support of a fresh root is more than a threshold number, say 1, then we assume that the fresh-root is valid. It is possible to increase the support for some fresh roots by extending the list of suffixes. In many highly inflectional languages, occurrences of multiple suffixes are common. Such suffix sequences may be tried along with the other suffixes during stemming. Thus if v, w and x are suffixes and wx is a valid suffix sequence, then if the words av and awx occur in the corpus, then the support for the root a will be 2. We partition the list of decompositions into two, A and B. List A contains decompositions where the support for the roots is higher than 1, and list B contains decompositions where the support for the roots is 1. In list A the precision is 62.39%, which is a clear improvement. However, the recall drops to 31.75%, implying that there are correct decompositions which we discard because the support was low. We deal with this situation a little later. At this point it may be noted that there are several instances of a single word being decomposed with different suffixes and suffix sequences. While in some such cases more than one decomposition for a word are valid (due to suffix sequences), in many cases it is not. For instance, if v, w, x and y are suffixes, vw and vx are suffix sequences, and avw, avx and avy are words in the corpus, then we obtain the decompositions av + w, a + vw, av + x, a + vx and av + y. The support for av is 3 and that for a is 2. So, for avw the decomposition av + w is more likely to be correct than the decomposition a + vw. So for each word when we select the decomposition with the highest support for the root and longest length of root, we get a precision of 69.82% and a recall of 28.85%. However this criterion is not suitable when a word actually has multiple correct decompositions. 5 Shortcomings of support The contention between precision and recall in the above method are because of two reasons: 1. A valid root word may not take more than one suffix, and even if it does, an adequate number of such cases may not occur in a given corpus. In such situations this test declares a fresh root as invalid. This reduces the recall of the exercise. In our experiment, list B contains a large number of correct decompositions. The recall in this list is 63.60% (some correct decompositions were discarded due to very short roots), and the precision is 28.48%. 2. There may be multiple incorrect decompositions using the same fresh root. This test will pass such a fresh root as valid. This reduces precision. 6 Root occurrence The first problem calls for additional criteria by which we do not discard all decompositions in list B. We note that multiple distinct cases of a root are likely to occur in a corpus, if the corpus contains adequate number of occurrences of that root. Conversely, more the number of times a root occurs in the corpus in any of its cases, more of its distinct cases are likely to figure in those occurrences. To determine the frequency of a root in the corpus we simply add the number of occurrences of the words with that root. Since the frequency of a root would depend on the size of the corpus, it may be useful to express it as a percentage of the corpus size. If despite a high frequency of a root in a corpus, its support remains small, say 1, then it is likely that the decomposition(s) involving the root is (are) invalid and the decomposed words are probably base words themselves. On the other hand, if the frequency of the root is small and so is its support, then it means inadequate evidence. The root may be valid despite the low support. We may defer the decision regarding such decompositions till more input provides more occurrences of the root. This leaves us with the results obtained with list A. Alternatively, we may consider these decompositions along with list A. Doing so can increase the recall, of course, at the cost of precision. In our experiment, in the decomposition kumAr=kumA+r (kumAr = kumA+r) the support for the root kumA (kumA) happens to be only 1, but its frequency is 74 (since kumAr occurs 74 times). So we can assume that kumA is not a valid root. Similarly, the root kth (kz) in the decomposition kthA=kth+A (kzA=kz+A) is not a valid root since the support is 1 and the frequency is 160. But the root AgDokhr (aAgeDAKr) in the decomposition AgDokhrte=AgDokhr+te (aAgeDAKret=aAgeDAKr+et) is actually a valid root despite the support being 1. Its frequency is 1. Overall, when we partition list B by the frequencies of the roots, we observe that among the decompositions with smaller root frequencies there are more valid decompositions compared to those with larger root frequencies. Also, among decompositions in list B, fewer decompositions have very high frequencies of root and more have small frequencies of root. These figures are shown in the table 1. Root Freq 1 2 3 4 5 6 7 8 9 10 11 12-191 No. of decompositions 3081 599 292 158 95 66 56 30 26 25 24 168 No. of valid decompositions 1068 142 46 16 9 4 9 2 1 4 1 14 Precision 34.66% 23.70% 15.75% 10.12% 9.47% 6.06% 16.07% 6.66% 3.84% 16.00% 4.16% 8.33% Table 1: Precision of decompositions with low (=1) root support From list B when we consider the decompositions with root occurrence 1, together with list A we obtained precision of 41.72% and recall of 83.37%. Further improvement of precision may not be obtained simply by considering root support and root occurrences. Considerations such as lengths of suffixes and lengths of roots may help. It is seen that among the decompositions in list B, the ones with longer bases are more likely to be valid. When we gradually remove decompositions with root lengths below 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, the precision improved as 28.48%, 29.07%, 30.38%, 33.12%, 36.44%, 41.61%, 46.10%, 49.19%, 51.04%, 51.36%, 53.33%. Of course the number of decompositions with longer roots is less than those with shorter roots. We also feel that some sets of suffixes that are valid with a root can be defined by methods such as the one suggested in [4]. For roots with good support, it can be verified if all its suffixes are included in any such set. If some do not, then such decompositions can be discarded. Such methods will help in tackling the second problem mentioned earlier, i.e., despite a root being valid, some of the decompositions involving it may be invalid. 7 Conclusion In this work we have described an approach to improve precision of stemming by considering occur- rences of different inflected forms of root words in a corpus. This is an important aspect of a natural corpus which can be useful in unsupervised or semisupervised acquisition of morphology of a language. References [1] Porter, M. “An Algorithm for Suffix Stripping”. Automated Library and Information Systems vol. 14, no. 3, pp 130-137, 1980 [2] Saravanan, M, P C Reghu Raj, Vadali Srinivasa Murty and S Raman. “Improved Porter’s Algorithm for Root Word Stemming”. International Conference on Natural Language Processing Mumbai, 2002, 1821, pp 21-30. [3] Sharma, Utpal, Jugal Kalita and Rajib Das. “Unsupervised Learning of Morphology for Building a Lexicon for Highly Inflectional Language”. Workshop on Morphological and Phonological Learning, ACL-2002 Philadelphia, pp 1-10 [4] Sharma, Utpal, Jugal Kalita and Rajib Das. “Classification of Words Based on Affix Evidence”. International Conference on Natural Language Processing Mumbai, 2002, pp 31-39
© Copyright 2025 Paperzz