Morphological and Lexical Analysis of the Sanskrit

MIT International Journal of Computer Science & Information Technology Vol. 1 No. 1 Jan. 2011 pp. 28-31
ISSN 2230-7621 (Print Version) 2230-763X (Online Version) ©MIT Publications
28
Morphological and Lexical Analysis of the
Sanskrit Sentences
Namrata Tapaswi
Dr. Suresh Jain
Chamelidevi Group of Institutions
Devi Ahilya University, Indore, MP (India)
Email: [email protected]
Department of Computer Engineering,
Institute of Engineering & Technology,
Devi Ahilya University, Indore, MP (India)
Email: [email protected] Technology
Abstract- Sanskrit (laLd`r) which is part of the IndoEuropean language family is the basic language of India.
It is considered as the base language for all Indian
languages. It is a free ordering language (or syntax free
language) and there is no ambiguity in the form of the
words order change. Our paper is concerns with the
Morphological and Lexical analysis of the Sanskrit
sentences [1]. The lexical analyzer may be applied for
Machine Translation domain of Natural Language
Processing (by mapping the parse tree of Sanskrit
language with other). This may also be used for
information retrieval.
jke%
(Ramaha)
jkeSk
(Ramau)
jkek%
(Ramaha)
in the in the above example we can separate the markers as
Ram----aha
Ram----au
Ram----aha
Marker
with the removing the marker or prefixes we get the root
word.
2. SANDHI ( lfUèk)
When two words in Sanskrit are combined to form one
Keywords: Morphology, Tagging, POS, Verb, Noun, word, the rules specify the transformations that must be
Determiner.
1. INTRODUCTION
Understanding of actual sense of word is very tricky. This
paper is explaining an approach which is concerned with the
Morphological and Lexical analysis of the Sanskrit words.
Here we have tried for “sandhi-wichched” (lfU/k foxzg).
Sentence is categories into four parts (Fig. 1), i.e.,
Morphological Lexical analysis, Syntactic analysis,
semantic analysis and context analysis.
applied depending on the vowel in the last letter of the first
word and the vowel in the first letter of the second word.
All together there are 5 + 14 + 7 = 26 crisp rules on which
sandhi vigrah is carried out.
i. Swar sandhi (Loj lfU/k) for 5
ii. Wyanjan sandhi (O;Utu lfU/k) for 14
iii. Wisarg sandhi (folxZ lfU/k) for 7
In the formation of sandhi words, 3 cases are dealing:
(i) Effect on the first word only: (i.e. last letter of the
first word changes)
nso% + onfr = nsoks onfr
(ii) Effect on the second word only: (first letter of the
second word changes)
jfo + bUnz = jfoUnz
Fig. 1: Types of Sentence
In the morphological analyzer we are finding the root
word. Let as consider a word which may be composed of
two words. One may go for separating these words. e.g.
iBfUr = iB~ + vfUr
take another example where marker may be attached to the
word,
(iii) Effect on both of the word.(last letter of 1st word
and first letter of 2nd word changes)
fge + vky; = fgeky;
3. DESIGN RULE FORMAT
For identification of the root word we have design a rule
format stored in the machine.
MIT International Journal of Computer Science & Information Technology Vol. 1 No. 1 Jan. 2011 pp. 28-31
ISSN 2230-7621 (Print Version) 2230-763X (Online Version) ©MIT Publications
(i) Effect On - - result - - left - - right
0 i.e. b + b = Ã
i + i = ii
1 jfo + bUnz = jfoUnz
effectOn takes values: f,s,b
f = first
s= second
b= both
Here, both the words (left and right) are changing. So, the
rule format for the above rule will be:
Effect On - - result - - left - - right
b - - ii
--i
-- i
example: him—ii—alaya
if we get “ii” then add “i” on left i.e. him + “i” and add “i”
on the right i.e. “i”+ alaya
So that,we get him + alaya.
3.1 RULE FORMAT IN TEXT RULE FILE
Consider: vd% lo.ksZ nhÄZ%
vd~
v]vk]b]m]_]Yk`
lo.kZ Loj v]vk]b]Ã]m]Å]_]_`]Yk`
Rule format in the text rule file :
Effect On--result--left--right
Effect On: character
f : if result can be obtained by making changes in first.
s : if result can be obtained by making changes second.
b : if result can be obtained by making changes both.
result: string
left: or separated strings
(e.g. vd~
v AvkA bA mA_AYk`)
string|string|......|string|
right: or separated strings
(e.g. lo.kZ Loj vAvkAbAÃAmAÅA_A_`AYk`)
string|string|......|string|
4. RULE STORAGE
For storage the rule, we have used files indexed on there
starting letter.
I.e. for above case, b - - ii - - i - - i the rule will be in the file
“i.txt”.
Similarly, words/root words are stored in the file indexed
on starting letter. I.e. himalaya is stored in “h.txt”. Also
only one word per line is stored in the corresponding file.
5. ALGORITHM
For Sandhi-Wichched (SW) we have design an algorithm.
Algorithm 1: SW
Step 1: Begin
Step 2: Receive a word for doing sandhi wichched
Step 3: Try iteratively for breaking the word into two parts
(left and right)
e.g. if the word is kastu then:
29
kast u
kas tu
ka stu
k astu
try for sandhi wichched. (note: sandhi wigrah is : word =
leftWord+rightWord, i.e. ending of leftWord and starting of
rightWord gets connected to form the sandhi now in order
to detect the effect of addition (makingup sandhi), we
assumed that effected word is encapsulated in right word
e.g. kastu=kaH + tu.... so when we receive "kastu" for
sandhi wichched,then we take up left as "ka" and right as
"stu", and then pass "ka"(left) and "stu"(right) for tryingup
sandhi wichched).
Step 4: Now we have two words: left,right (e.g. ka,stu)
(i) Sets a variable flag to false.
(ii) Reads a rule from rule file named as starting letter
of right (e.g. here it will open s.txt, since "stu"
starts with letters)
(iii) Try applying each of the sandhi rules (present in
the rule file, here sandhi rules present in s.txt) on
left and right (i.e. on ka and stu).
(iv) If one or more rules listed in corresponding rule
file is applicable then flag is set to true.
Step 5: End {of SW}
6. SOFTWARE TOOL FOR SANDHI-WICHCHED
The system reads a word from the storage pool and tries to
identify the meaning of the word. For identification of the
word following steps are used:
(i) extract effect On, result, left, right from the rule.
(ii) if received secondString starts with result then try
to apply the rule
result=left+right
here result is encapsulated in the second String so
extract right part from second String by removing
result from it (secondString)
e.g. here secondString (stu) starts with result (s) so
rule f--s--H|--t|th| have the chances of performing
sandhi wichched.
(iii) separate left-result-right
e.g. ka-s-tu for(kastu)
leftGuess : ka
result : s
rightGuess : tu
6.1 CHECK CERTAINTY OF SANDHI
while formation of sandhi, the effect may be on first
word,or on second word or on both the words.
(A) EFFECT ON FIRST WORD:
example: f—y—i|ii|I|— a|aa|A|u|uu|U
|e|ai|o|au|aM|aH|RRi|R^i|R^I|L^i|L^I|
MIT International Journal of Computer Science & Information Technology Vol. 1 No. 1 Jan. 2011 pp. 28-31
ISSN 2230-7621 (Print Version) 2230-763X (Online Version) ©MIT Publications
(i) So, if on forming sandhi only first word is effected and
second word remains as it is then append something
(depending on rule) to the leftGuess, no changes in
rightGuess.
(ii) checking for left + right = result , if the rightGuess word
exist in database, also check right of rule = f--result--left-right could be applied (since, both, sandhi wigrah and
sandhi formation on the given rule should be checked up).
Then depnding upon the presence of left hand side one may
declare the result.
(iii) left + right = sandhi
even if the left word is not present, but, since, rightGuess is
present in database, so there are chances of sandhi vigrah so
record it to try Result.txt. Since we are in effectOn=='f' so
we need something to be added to the leftGuess and then
check there presence in the database.
(iv) if rightGuess is allready present in database then, if the
leftGuess+append is also present in database then one can
be sure of sandhi vigrah.
Even if leftGuess+append NOT present in database but, if
there (leftGuess+append), sandhi wichched is possible then
also one can be sure of sandhi-vigrah.
Example:
u`is"oekR;s"kq is not present in database
but
u`is"kq $ vekR;s"kq is possible
(B) EFFECT ON SECOND WORD:
rule example: s--Dh--Sh|shh|T|Th|D|Dh|N|--dh|
(i) if on forming sandhi only second word is effected and
first word remains as it is then prefix something (depending
on rule) to the rightGuess, no changes in leftGuess
e.g. leftGuess sheshh ends with shh (which is
present Sh|shh|T|Th|D|Dh|N|)
(C) EFFECT ON BOTH THE WORDS
rule example: b--e--a|aa|A|--i|ii|I|
(i) if on forming sandhi both (first and second) word
gets effected. then (depending on rule) append
something to the leftGuess and prefix something to
the rightGuess.
(ii) for each of the potential prefix for rightGuess
append a postfix(extracting from rule) to the left
part and check presence of (leftGuess+append),
(add+rightGuess) in database.
(iii) extract one of the prefix for rightGuess from the
rule's right
rule : b--e--a|aa|A|--i|ii|I|
right: i|ii|I|
potential prefix: i,ii,I
(iv) if (add+rightGuess)AND(leftGuess+append) both
present then declare the result.
(v) left + right = sandhi
if the left word is not present, but, since, add +
rightGuess is present in database,
so there are chances of sandhi vigrah so record it to
tryResult.txt
(vi) now for each prefix for the rightGuess:
extract one of the postfix for leftGuess
from the rule's left
rule : b--e--a|aa|A|--i|ii|I|
left : a|aa|A
potential postfix/append: a,aa,A
(vii)If (add + rightGuess) is already present in
database, then, if the (leftGuess + append) is also
present in database, then one can be sure of sandhi
vigrah.
(ii) check for left + right = result, if the right
(prefix+rightGuess) word exist in database.
check right of rule. s--result--left—right could be applied
then depnding upon the presence of left hand side one may
declare the result.
Since we are in effectOn =='s' so we need something to be
prefixed to the rightGuess and then check there presence in
the database.
7. SAMPLE RESULTS
(iii) even if the leftGuess word is not present, but, since,
rightGuess is present in database, so there are chances of
sandhi vigrah so record it to try Result.txt
8. CONCLUSION
(iv) check if the leftGuess ends with one of the string
present in left part of the rule since sandhi vigrah and sandhi
formation both of them should be checked.
rule: effectOn--result--left—right
right: orSeparatedString(string|string|......|string|)
example:
for rule s--Dh-- Sh|shh|T|Th|D|Dh|N
left: Sh|shh|T|Th|D|Dh|N|
30
One set of 100 words have been taken and manually
evaluated, which gives following results. Few of them are
illustrated below in Fig. 2 and Fig. 3.
The concept analyzed in this paper is basically evolved to
handle the languages, which are morphologically rich
Languages like Sanskrit,. The concept is language
independent. After deducting some procedures this model
can be used for identification of the word as well as spellchecker. Our approch is based on Morphological and
Lexical analysis of the sanskrit sentences. The lexical
analyzer identify the root word and its meaning. Also, used
for information retrieval. in this case the corpora should be
free from spelling and grammatical errors.
MIT International Journal of Computer Science & Information Technology Vol. 1 No. 1 Jan. 2011 pp. 28-31
ISSN 2230-7621 (Print Version) 2230-763X (Online Version) ©MIT Publications
31
[5] Cuyckens, H. and Zawada, B. (eds.), Polysemy in Cognitive
Linguistics. Amsterdam: John Benjamins. 2001.
[6] Dash, N.S. and Chaudhuri, B.B. “The process of designing a
multidisciplinary monolingual sample corpus”, International Journal of
Corpus Linguistics. 5(2): 179.
Fig. 2: Screen shot of Sample Sentences
REFERENCES
[1]
Jurafsky D. & Martin J.H., “Speech and Language Processing”
Parson Education, first Indian Reprint edition.
[2]
Automatic stochastic tagging of natural language texts by Evangelos
Dermatas, George Kokkinakis . MIT Press Cambridge, MA, USA
[3]
Doug Cutting, Julian Kupiec, Jan Pedersen, Penelope Sibun, A
practical part-of-speech tagger, In the Proceedings of the third
conference on Applied natural language processing, March 31-April
03, 1992, Trento, Italy.
[4]
Marie Meteer, Richard Schwartz, Ralph Weischedel, Studies in
part of speech labelling, Proceedings of the workshop on Speech and
Natural Language, pp. 331- 336,February 19-22, 1991, Pacific Grove,
California
.
Fig. 3: Screen shot of Sanskrit Sentences