Suffix Stripping Based NER in Assamese for Location Names Padmaja Sharma Utpal Sharma Jugal Kalita Department of CSE Tezpur University Assam, India 784028 [email protected] Department of CSE Tezpur University Assam, India 784028 [email protected] Department of CS University of Colorado at Colorado Springs Colorado, USA 80918 [email protected] Abstract—Named Entity Recognition (NER) is the process of identifying and classifying proper nouns in text documents into pre-defined classes such as person, location and organization. It plays an important role in Natural Language Processing applications. Although NER in Indian languages is a difficult and challenging task and suffers from scarcity of resources, such work has started to appear recently. In highly inflectional languages such as Assamese, NER requires identification of the root forms of words that occur in texts. Our work reports a suffix stripping approach to identify those roots of words which are location named entities. I. I NTRODUCTION In natural language processing (NLP), identification of proper nouns in texts is an important task. This is referred to as Named Entity Recognition (NER). NER is the process of identifying and classifying proper nouns into predefined categories such as person, location and organization. It generally requires morphological analysis of the input words, i.e., analysing how the input words are created from basic units called morphemes. For NER, identification of the root form of a word, i.e., stemming is required. Stemming is the process of reducing inflected words to their stem or base or root form. For example, the stem of the word “governing” is “govern”. In Indian languages such as Assamese or Bengali, we often find words which themselves are not named entities, but their root words are named entities. For example none of, , or is a named entity, but the root , or is. A considerable amount of work on stemming can be found in English, but not much work has been done in Indian languages. This paper presents our work in Assamese. We use a suffix stripping approach to derive a root word that results in a named entity. We find that this approach is effective for location named entities. So our work concentrates on deriving a location named entity from a given input word. The paper is organized as follows. Section 2 describes the characteristic of the Assamese language. Section 3 describes the previous work done in stemming. Section 4 describes the different techniques used for stemming and Section 5 describes our approach and the last Section concludes the paper. II. M ORPHOLOGICAL CHARACTERISTIC OF A SSAMESE L ANGUAGE Morphological analysis is an essential task in any Natural Language Processing application. It deals with the identification of the internal structure of the words of a language. The objective of the NER task is to detect and categorize all named entities in a document into pre-defined classes such as person, location and organization. Assamese, an Indo-European language spoken by over 30 million people in North East India and a national language of India, is a morphologically rich language. Till now, to the best of our knowledge, no work on Assamese NER has been reported. Ours is the first such work. Assamese is written using the Assamese script. It comprises 11 vowels, 34 consonants and 10 digits. There are no uppercase or lowercase letters in Assamese script. Assamese is relatively a free word order language. For example the sentence I will go home today. can be written in any of the following forms given below although the first form is more common. In Assamese a single root word may have different morphological variants. For example, , , and are morphological variants of the word . Suffixes found in Assamese for location named entities are described below. • • Plain suffix: A plain suffixe like , or combines with a root word to produce its morphological variant. Eg: = + . Complex suffix: A complex suffixe is one which is formed by combining one or more consonants with a plain suffixes. Eg: = + . Like other Indian languages, Assamese also follows a specific pattern of suffixation which is: <token> = <root/stem word> + <inflection>. Eg: = + . = + . • III. P REVIOUS W ORK A considerable amount of work exists in English for stemming. Most of the work was done using rule-based approaches [1][2]. There are a number of stemming algorithms for English such as Paice/Husk [3], Lovins Stemmer [2], Dawscen [4], Krovertz [5] and Porter Stemmer [1]. Among these, Porter Stemmer is the most prevalent one since it can be applied to other languages as well. It is a rule-based stemmer with five steps, each of which applies a set of rules. Indian language work on stemming includes [6], [7], [8], [9] and [10]. Larkey et al. [6] developed a Hindi light weight stemmer by using a manually constructed list of 27 common suffixes that included gender, number, tense and nominalization. A similar approach was used by Ramanathan and Rao [7] in which they used 65 most inflectional suffixes. Chen and Grey [8] described a statistical Hindi stemmer for evaluating the performance of a Hindi information retrieval system. Similar work was described in Dasgupta and Ng [9] for a Bengali morphological analyser. Sharma et al. [10] focussed on word stemming in Assamese, particularly in identifying the root words automatically when lexicon is unavailable. Work on Assamese can also be found in Sharma et al. [11] where the authors developed a lexicon and a morphology rule base. Sharma et al. [12] discussed issues in Assamese morphology where the presence of a large number of suffixal determiners, sandhis, to use suffix sequences making almost half of the words inflected. Sarkar and Bandopadhyay [13] presented the design of a rule-based stemmer for Bengali. Paik and Pariu [14] used an n-gram approach for three languages namely Hindi, Bengali and Marathi. Majgaonker and Siddique [15] described a rule-based and unsupervised Marathi stemmer. Similarly a suffix stripping based morphological analyser for Malayalam was described in R et al. [16]. Kumar and Rana [17] described a stemmer for Punjabi using a brute force technique. Work on a morphological analyser and a stemmer for Nepali was described in Bal and Shrestha [18]. • • • than try and remove suffixes, the goal of a production algorithm is to generate suffixes. For example, if the word is “run”, the algorithm might automatically generate the form “running”. Suffix stripping algorithm: It does not rely on a lookup table that consists of inflected forms and root form relations. A list of rules are created and stored. These rules provide a path for the algorithm, given an input word form to find its root form. For example, if the word ends in “es”, it remove “es”. Suffix substitution: This algorithm is an improvement over suffix stripping. Here instead of removing the suffix, it replaces it with some other suffix. For example, “ies” is replaced with “ey”. Matching algorithm: Such an algorithm uses a stem database. In order to stem a word, the algorithm tries to match it with stems from the database, applying various constraints such as the one based on relative lengths of the candidate stems within the word. For example, the short prefix “be”, which is the stem of such words as “be”, “been”, “being” would be considered as the stem of the word “beside” as well. This is a deficiency of this method. Affix stemmer: The term affix refers to either a prefix or a suffix. This approach performs stemming by removing word prefixes and suffixes. It removes the longest possible string of characters from a word according to a set of rules. For example, given the word “indefinitely”, it identifies that the leading “in” is a prefix that can be removed. V. O UR A PPROACH We have used an Asomiya Pratidin Corpus of nearly 300,000 word forms. The main aim of this paper is to generate the root word from a given input word resulting in a location named entity. Some examples are given in Table I TABLE I E XAMPLES Root Surface form Suffix IV. A PPROACHES TO S TEMMING There are several types of stemming algorithms that differ with respect to performance and accuracy. The following are some types of algorithms used in stemming• Brute-force algorithm: A brute-force stemmer employs a lookup table which contains relations between root words and inflected words. To stem a word, the table is queried to find a matching inflection. If a matching inflection is found, the root form is returned. For example, if we enter the word “cats”, it searches for the word “cat” in the list . When the match is found, it displays the word. • Production technique: A production algorithm attempts to infer the probable inflections for a given word. Rather To derive the root words for our purpose, we strip suffixes. It is a fast process as the search is done only on suffix. We have listed some suffixes that identify a location such as [ , , , ] etc. It can also be seen that some suffixes apply to both as person names as well as locations: [ , , ]. For example: [ , ],[ , ]. Assamese uses derivation to add additional features to the root word to change its grammatical category. Examples of given derivation are in Table I. These words are formed by adding suffixes such as [ , , , , ] to location names after which they are no longer named entities. But when these words are stemmed using the suffix stripping approach, they produce in named entities which fall under the location category. Below are the steps used in our approach. In our approach the input is an Assamese word which occurs in a given text. The key file contains all possible suffixes such as [ , , ] etc that combine with the location named entities. The output is the root word that is the location named entity obtained after the stemming process the input word. Step1: Input the word. Step2: Search for the suffixes in the input word from the key file. Step3: If exists then Step4: Remove the suffix of the word along with the trailing characters. step5: Exit. F-measure is the harmonic mean of precision and recall. Thus mathematically, Precision(P) = S/K (1) Recall (R) = S/T (2) F = 2*(P*R)/(P+R) (3) , where S= No of relevant NE’s returned. K= Total NE’s retrieved. T= Total NE’s present. Our experiments with the proposed method produced Fmeasure of 88% for location named entity. TABLE II DATA USED FOR TESTING Feature of Dataset Total words Total NE present (T) Total NE retrieved (K) Total relevant NE retrieved (S) Training Data 20,000 475 565 465 The performance of this stemmer is poor for those words which do not have suffixes attached to the root word, or the trailing characters of the root word matches the suffix list. For example [ , ]. Moreover NER in Assamese also faces some challenges such as ambiguity that exists between location names and person names. For example, . The following are some examples of correct and incorrect results obtained by our experiment. = = = = (correct) (correct) (incorrect) (incorrect) VI. C ONCLUSION The above process can be implemented using finite state automata as well to improve efficiency, provided the list of suffixes is known during program coding. The goal of our experiment is to calculate the level of accuracy the proposed approach can achieve. The statistics of the training data is given in the Table II. To calculate the accuracy we have used the following metrics: Precision, Recall and F-measure. Recall is the ratio of relevant results returned divided by all relevant results whereas Precision is the number of relevant results returned divided by the total number of results returned. We have tested a very simple and fast method for location NER. Though the approach is simple, interestingly, it performs reasonably well and gives an F-measure of nearly 90%. Our experimental results also produce some errors described in Section V. Further investigation has to be done in order to remove these errors. One may design rule-based approach to overcome such difficulties or may also develop a lexicon containing a list of locations and then apply a stemming approach using the lexicon. Moreover Assamese being a highly inflectional language, special attention is required to handle morphological and engineering issues. Use of contextual information may be incorporated to reduce errors due to ambiguity. R EFERENCES [1] M.Porter, “An algorithm for suffix stripping program,” vol. 14, 1980, pp. 130–137. [2] J. B. Lovins, “Development of a stemming algorithm. Mechanical Translation and Computational Linguistic High-speed digital-to-RF converter,” vol. 11, 1968, pp. 22–31. [3] Chris.D.Paice, “An Evaluation method for Stemming algorithm,” in Proceedings of the 17th Annual International ACM SIGR Conference on Research and Development in Information Retrieval, 1990, pp. 42– 50. [4] J. Dawson, “Suffix Removal and word Conflation,” ALLCbulletin, vol. 2(3), pp. 33–46, 1974. [5] R.Krovertz, “Viewing morphology as an inference process,” in Proceedings of the 16 Annual International ACM SIGIR Conference on Research and Development in Information Retrievall, 1993, pp. 191– 202. [6] L. S.Larkey, M. E.Connel, and N. A. Jaleel, “Hindi CLIR in Thirty days,” ACM Transaction on Asian language Information Processing, vol. 2(2), pp. 130–142, 2003. [7] A. Ramanathan and D.Rao, “A light weight stemmer for Hindi,” in Proceedings of the 10th Conference of the European Chapter of the association for Computational Linguistic for South Asian language workshop, 2003, pp. 42–48. [8] A.Chen and F.C.Grey, “Generating statistical Hindi stemmer from Parallel texts,” ACM Trans. Asian Language Inform Process, vol. 2(3), 2003. [9] S. Dasgupta and V. Ng, “Unsupervised morphological parsing of Bengali,” Language Resources and Evaluation, pp. 311–330, 2006. [10] U. Sharma, J. Kalita, and R. Das, “Root word stemming by multiple evidence from corpus,” in Proceedings of the 6th International Conference on Computational Intelligence and Natural Computing (CINC-2003), North Carolina, September 2003, pp. 26–30. [11] ——, “Unsupervised Learning of Morphology for Building Lexicon for a Highly Inflectional Language,” in Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia, July 2002, pp. 1–10. [12] ——, “Acquisition of Morphology of an Indic Language from Text Corpus,” ACM Transactions on Asian Language Information Processing (TALIP), vol. 7, August 2008. [13] S. Sarkar and S. Bandyopadhyay, “Design of a Rule-based Stemmer for Natural Language Text in Bengali,” in Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, Hyderabad, India, January 2008, pp. 65–72. [14] J. H.Paik and S. K.Parui, “A Simple Stemmer for inflectional languages,” 2008. [15] M. M. Majgaonker and T. Siddique, “Discovering Suffixes for Marathi Language,” International Journal on Computer Science and Engineering, vol. 2(8), pp. 2716–2720, 2010. [16] R. R, R. N, and E. Sherley, “A Suffix Stripping Based Morph Analyzer for Malayam language,” in Proceedings of 20th Kerala Science Congress, 2008, pp. 482–484. [17] D. Kumar and P. Rana, “Design and Development of a stemmer for Punjabi,” International Journal of Computer Applications (0975-8887), vol. 11(12), December 2010. [18] B. K. Bal and P. Shrestha, “A Morphological Analyzer and a Stemmer for Nepali, 2007-01-01 Pan Localization, Working Papers,” 2004-2007, pp. 324–31.
© Copyright 2026 Paperzz