Suffix Stripping Based NER in Assamese for Location Names

Suffix Stripping Based NER in Assamese for
Location Names
Padmaja Sharma
Utpal Sharma
Jugal Kalita
Department of CSE
Tezpur University
Assam, India 784028
[email protected]
Department of CSE
Tezpur University
Assam, India 784028
[email protected]
Department of CS
University of Colorado at Colorado Springs
Colorado, USA 80918
[email protected]
Abstract—Named Entity Recognition (NER) is the process of
identifying and classifying proper nouns in text documents into
pre-defined classes such as person, location and organization.
It plays an important role in Natural Language Processing
applications. Although NER in Indian languages is a difficult and
challenging task and suffers from scarcity of resources, such work
has started to appear recently. In highly inflectional languages
such as Assamese, NER requires identification of the root forms
of words that occur in texts. Our work reports a suffix stripping
approach to identify those roots of words which are location
named entities.
I. I NTRODUCTION
In natural language processing (NLP), identification of
proper nouns in texts is an important task. This is referred
to as Named Entity Recognition (NER). NER is the process
of identifying and classifying proper nouns into predefined
categories such as person, location and organization. It
generally requires morphological analysis of the input words,
i.e., analysing how the input words are created from basic units
called morphemes. For NER, identification of the root form
of a word, i.e., stemming is required. Stemming is the process
of reducing inflected words to their stem or base or root form.
For example, the stem of the word “governing” is “govern”.
In Indian languages such as Assamese or Bengali, we often
find words which themselves are not named entities, but their
root words are named entities. For example none of,
, or
is a named entity, but the root
, or
is. A considerable amount of work on stemming can be
found in English, but not much work has been done in Indian
languages. This paper presents our work in Assamese. We use
a suffix stripping approach to derive a root word that results
in a named entity. We find that this approach is effective for
location named entities. So our work concentrates on deriving
a location named entity from a given input word.
The paper is organized as follows. Section 2 describes the
characteristic of the Assamese language. Section 3 describes
the previous work done in stemming. Section 4 describes the
different techniques used for stemming and Section 5 describes
our approach and the last Section concludes the paper.
II. M ORPHOLOGICAL CHARACTERISTIC OF A SSAMESE
L ANGUAGE
Morphological analysis is an essential task in any
Natural Language Processing application. It deals with the
identification of the internal structure of the words of a
language. The objective of the NER task is to detect and
categorize all named entities in a document into pre-defined
classes such as person, location and organization. Assamese,
an Indo-European language spoken by over 30 million people
in North East India and a national language of India, is a
morphologically rich language. Till now, to the best of our
knowledge, no work on Assamese NER has been reported.
Ours is the first such work. Assamese is written using the
Assamese script. It comprises 11 vowels, 34 consonants and 10
digits. There are no uppercase or lowercase letters in Assamese
script. Assamese is relatively a free word order language. For
example the sentence
I will go home today.
can be written in any of the following forms given below
although the first form is more common.
In Assamese a single root word may have different
morphological variants. For example,
,
, and
are morphological variants of the word
.
Suffixes found in Assamese for location named entities are
described below.
•
•
Plain suffix: A plain suffixe like , or combines with a
root word to produce its morphological variant.
Eg:
=
+ .
Complex suffix: A complex suffixe is one which is
formed by combining one or more consonants with a
plain suffixes.
Eg:
=
+
.
Like other Indian languages, Assamese also follows a specific
pattern of suffixation which is:
<token> = <root/stem word> + <inflection>.
Eg:
=
+ .
=
+ .
•
III. P REVIOUS W ORK
A considerable amount of work exists in English for
stemming. Most of the work was done using rule-based
approaches [1][2]. There are a number of stemming algorithms
for English such as Paice/Husk [3], Lovins Stemmer [2],
Dawscen [4], Krovertz [5] and Porter Stemmer [1]. Among
these, Porter Stemmer is the most prevalent one since it can be
applied to other languages as well. It is a rule-based stemmer
with five steps, each of which applies a set of rules. Indian
language work on stemming includes [6], [7], [8], [9] and
[10]. Larkey et al. [6] developed a Hindi light weight stemmer
by using a manually constructed list of 27 common suffixes
that included gender, number, tense and nominalization. A
similar approach was used by Ramanathan and Rao [7] in
which they used 65 most inflectional suffixes. Chen and Grey
[8] described a statistical Hindi stemmer for evaluating the
performance of a Hindi information retrieval system. Similar
work was described in Dasgupta and Ng [9] for a Bengali
morphological analyser. Sharma et al. [10] focussed on word
stemming in Assamese, particularly in identifying the root
words automatically when lexicon is unavailable. Work on
Assamese can also be found in Sharma et al. [11] where
the authors developed a lexicon and a morphology rule base.
Sharma et al. [12] discussed issues in Assamese morphology
where the presence of a large number of suffixal determiners,
sandhis, to use suffix sequences making almost half of the
words inflected. Sarkar and Bandopadhyay [13] presented the
design of a rule-based stemmer for Bengali. Paik and Pariu
[14] used an n-gram approach for three languages namely
Hindi, Bengali and Marathi. Majgaonker and Siddique [15]
described a rule-based and unsupervised Marathi stemmer.
Similarly a suffix stripping based morphological analyser for
Malayalam was described in R et al. [16]. Kumar and Rana
[17] described a stemmer for Punjabi using a brute force
technique. Work on a morphological analyser and a stemmer
for Nepali was described in Bal and Shrestha [18].
•
•
•
than try and remove suffixes, the goal of a production
algorithm is to generate suffixes. For example, if the word
is “run”, the algorithm might automatically generate the
form “running”.
Suffix stripping algorithm: It does not rely on a lookup
table that consists of inflected forms and root form
relations. A list of rules are created and stored. These
rules provide a path for the algorithm, given an input
word form to find its root form. For example, if the word
ends in “es”, it remove “es”.
Suffix substitution: This algorithm is an improvement
over suffix stripping. Here instead of removing the suffix,
it replaces it with some other suffix. For example, “ies”
is replaced with “ey”.
Matching algorithm: Such an algorithm uses a stem
database. In order to stem a word, the algorithm tries to
match it with stems from the database, applying various
constraints such as the one based on relative lengths of
the candidate stems within the word. For example, the
short prefix “be”, which is the stem of such words as
“be”, “been”, “being” would be considered as the stem
of the word “beside” as well. This is a deficiency of this
method.
Affix stemmer: The term affix refers to either a prefix or
a suffix. This approach performs stemming by removing
word prefixes and suffixes. It removes the longest possible
string of characters from a word according to a set
of rules. For example, given the word “indefinitely”, it
identifies that the leading “in” is a prefix that can be
removed.
V. O UR A PPROACH
We have used an Asomiya Pratidin Corpus of nearly
300,000 word forms. The main aim of this paper is to generate
the root word from a given input word resulting in a location
named entity. Some examples are given in Table I
TABLE I
E XAMPLES
Root
Surface form
Suffix
IV. A PPROACHES TO S TEMMING
There are several types of stemming algorithms that differ
with respect to performance and accuracy. The following are
some types of algorithms used in stemming• Brute-force algorithm: A brute-force stemmer employs a
lookup table which contains relations between root words
and inflected words. To stem a word, the table is queried
to find a matching inflection. If a matching inflection is
found, the root form is returned. For example, if we enter
the word “cats”, it searches for the word “cat” in the list
. When the match is found, it displays the word.
• Production technique: A production algorithm attempts
to infer the probable inflections for a given word. Rather
To derive the root words for our purpose, we strip suffixes.
It is a fast process as the search is done only on suffix. We
have listed some suffixes that identify a location such as [
,
,
, ] etc. It can also be seen that some suffixes apply
to both as person names as well as locations: [ , , ]. For
example: [
,
],[
,
].
Assamese uses derivation to add additional features to the root
word to change its grammatical category. Examples of given
derivation are in Table I. These words are formed by adding
suffixes such as [ , ,
, ,
] to location names after
which they are no longer named entities. But when these words
are stemmed using the suffix stripping approach, they produce
in named entities which fall under the location category.
Below are the steps used in our approach. In our approach the
input is an Assamese word which occurs in a given text. The
key file contains all possible suffixes such as [ , , ] etc
that combine with the location named entities. The output is
the root word that is the location named entity obtained after
the stemming process the input word.
Step1: Input the word.
Step2: Search for the suffixes in the input word from the
key file.
Step3: If exists then
Step4: Remove the suffix of the word along with the trailing
characters.
step5: Exit.
F-measure is the harmonic mean of precision and recall. Thus
mathematically,
Precision(P) = S/K
(1)
Recall (R) = S/T
(2)
F = 2*(P*R)/(P+R)
(3)
,
where S= No of relevant NE’s returned.
K= Total NE’s retrieved.
T= Total NE’s present.
Our experiments with the proposed method produced Fmeasure of 88% for location named entity.
TABLE II
DATA USED FOR TESTING
Feature of Dataset
Total words
Total NE present (T)
Total NE retrieved (K)
Total relevant NE retrieved (S)
Training Data
20,000
475
565
465
The performance of this stemmer is poor for those words
which do not have suffixes attached to the root word, or the
trailing characters of the root word matches the suffix list. For
example [
,
]. Moreover NER in Assamese also
faces some challenges such as ambiguity that exists between
location names and person names. For example,
. The
following are some examples of correct and incorrect results
obtained by our experiment.
=
=
=
=
(correct)
(correct)
(incorrect)
(incorrect)
VI. C ONCLUSION
The above process can be implemented using finite state
automata as well to improve efficiency, provided the list of
suffixes is known during program coding.
The goal of our experiment is to calculate the level of accuracy
the proposed approach can achieve. The statistics of the
training data is given in the Table II. To calculate the accuracy
we have used the following metrics:
Precision, Recall and F-measure.
Recall is the ratio of relevant results returned divided by all
relevant results whereas Precision is the number of relevant
results returned divided by the total number of results returned.
We have tested a very simple and fast method for location
NER. Though the approach is simple, interestingly, it performs
reasonably well and gives an F-measure of nearly 90%. Our
experimental results also produce some errors described in
Section V. Further investigation has to be done in order to
remove these errors. One may design rule-based approach
to overcome such difficulties or may also develop a lexicon
containing a list of locations and then apply a stemming
approach using the lexicon. Moreover Assamese being a
highly inflectional language, special attention is required
to handle morphological and engineering issues. Use of
contextual information may be incorporated to reduce errors
due to ambiguity.
R EFERENCES
[1] M.Porter, “An algorithm for suffix stripping program,” vol. 14, 1980,
pp. 130–137.
[2] J. B. Lovins, “Development of a stemming algorithm. Mechanical
Translation and Computational Linguistic High-speed digital-to-RF
converter,” vol. 11, 1968, pp. 22–31.
[3] Chris.D.Paice, “An Evaluation method for Stemming algorithm,” in
Proceedings of the 17th Annual International ACM SIGR Conference
on Research and Development in Information Retrieval, 1990, pp. 42–
50.
[4] J. Dawson, “Suffix Removal and word Conflation,” ALLCbulletin, vol.
2(3), pp. 33–46, 1974.
[5] R.Krovertz, “Viewing morphology as an inference process,” in
Proceedings of the 16 Annual International ACM SIGIR Conference on
Research and Development in Information Retrievall, 1993, pp. 191–
202.
[6] L. S.Larkey, M. E.Connel, and N. A. Jaleel, “Hindi CLIR in Thirty days,”
ACM Transaction on Asian language Information Processing, vol. 2(2),
pp. 130–142, 2003.
[7] A. Ramanathan and D.Rao, “A light weight stemmer for Hindi,” in
Proceedings of the 10th Conference of the European Chapter of the
association for Computational Linguistic for South Asian language
workshop, 2003, pp. 42–48.
[8] A.Chen and F.C.Grey, “Generating statistical Hindi stemmer from
Parallel texts,” ACM Trans. Asian Language Inform Process, vol. 2(3),
2003.
[9] S. Dasgupta and V. Ng, “Unsupervised morphological parsing of
Bengali,” Language Resources and Evaluation, pp. 311–330, 2006.
[10] U. Sharma, J. Kalita, and R. Das, “Root word stemming by multiple
evidence from corpus,” in Proceedings of the 6th International
Conference on Computational Intelligence and Natural Computing
(CINC-2003), North Carolina, September 2003, pp. 26–30.
[11] ——, “Unsupervised Learning of Morphology for Building Lexicon for
a Highly Inflectional Language,” in Proceedings of the 6th Workshop
of the ACL Special Interest Group in Computational Phonology
(SIGPHON), Philadelphia, July 2002, pp. 1–10.
[12] ——, “Acquisition of Morphology of an Indic Language from Text
Corpus,” ACM Transactions on Asian Language Information Processing
(TALIP), vol. 7, August 2008.
[13] S. Sarkar and S. Bandyopadhyay, “Design of a Rule-based Stemmer for
Natural Language Text in Bengali,” in Proceedings of the IJCNLP-08
Workshop on NLP for Less Privileged Languages, Hyderabad, India,
January 2008, pp. 65–72.
[14] J. H.Paik and S. K.Parui, “A Simple Stemmer for inflectional languages,”
2008.
[15] M. M. Majgaonker and T. Siddique, “Discovering Suffixes for
Marathi Language,” International Journal on Computer Science and
Engineering, vol. 2(8), pp. 2716–2720, 2010.
[16] R. R, R. N, and E. Sherley, “A Suffix Stripping Based Morph
Analyzer for Malayam language,” in Proceedings of 20th Kerala Science
Congress, 2008, pp. 482–484.
[17] D. Kumar and P. Rana, “Design and Development of a stemmer for
Punjabi,” International Journal of Computer Applications (0975-8887),
vol. 11(12), December 2010.
[18] B. K. Bal and P. Shrestha, “A Morphological Analyzer and a Stemmer
for Nepali, 2007-01-01 Pan Localization, Working Papers,” 2004-2007,
pp. 324–31.