MIT International Journal of Computer Science & Information Technology Vol. 1 No. 1 Jan. 2011 pp. 28-31 ISSN 2230-7621 (Print Version) 2230-763X (Online Version) ©MIT Publications 28 Morphological and Lexical Analysis of the Sanskrit Sentences Namrata Tapaswi Dr. Suresh Jain Chamelidevi Group of Institutions Devi Ahilya University, Indore, MP (India) Email: [email protected] Department of Computer Engineering, Institute of Engineering & Technology, Devi Ahilya University, Indore, MP (India) Email: [email protected] Technology Abstract- Sanskrit (laLd`r) which is part of the IndoEuropean language family is the basic language of India. It is considered as the base language for all Indian languages. It is a free ordering language (or syntax free language) and there is no ambiguity in the form of the words order change. Our paper is concerns with the Morphological and Lexical analysis of the Sanskrit sentences [1]. The lexical analyzer may be applied for Machine Translation domain of Natural Language Processing (by mapping the parse tree of Sanskrit language with other). This may also be used for information retrieval. jke% (Ramaha) jkeSk (Ramau) jkek% (Ramaha) in the in the above example we can separate the markers as Ram----aha Ram----au Ram----aha Marker with the removing the marker or prefixes we get the root word. 2. SANDHI ( lfUèk) When two words in Sanskrit are combined to form one Keywords: Morphology, Tagging, POS, Verb, Noun, word, the rules specify the transformations that must be Determiner. 1. INTRODUCTION Understanding of actual sense of word is very tricky. This paper is explaining an approach which is concerned with the Morphological and Lexical analysis of the Sanskrit words. Here we have tried for “sandhi-wichched” (lfU/k foxzg). Sentence is categories into four parts (Fig. 1), i.e., Morphological Lexical analysis, Syntactic analysis, semantic analysis and context analysis. applied depending on the vowel in the last letter of the first word and the vowel in the first letter of the second word. All together there are 5 + 14 + 7 = 26 crisp rules on which sandhi vigrah is carried out. i. Swar sandhi (Loj lfU/k) for 5 ii. Wyanjan sandhi (O;Utu lfU/k) for 14 iii. Wisarg sandhi (folxZ lfU/k) for 7 In the formation of sandhi words, 3 cases are dealing: (i) Effect on the first word only: (i.e. last letter of the first word changes) nso% + onfr = nsoks onfr (ii) Effect on the second word only: (first letter of the second word changes) jfo + bUnz = jfoUnz Fig. 1: Types of Sentence In the morphological analyzer we are finding the root word. Let as consider a word which may be composed of two words. One may go for separating these words. e.g. iBfUr = iB~ + vfUr take another example where marker may be attached to the word, (iii) Effect on both of the word.(last letter of 1st word and first letter of 2nd word changes) fge + vky; = fgeky; 3. DESIGN RULE FORMAT For identification of the root word we have design a rule format stored in the machine. MIT International Journal of Computer Science & Information Technology Vol. 1 No. 1 Jan. 2011 pp. 28-31 ISSN 2230-7621 (Print Version) 2230-763X (Online Version) ©MIT Publications (i) Effect On - - result - - left - - right 0 i.e. b + b = Ã i + i = ii 1 jfo + bUnz = jfoUnz effectOn takes values: f,s,b f = first s= second b= both Here, both the words (left and right) are changing. So, the rule format for the above rule will be: Effect On - - result - - left - - right b - - ii --i -- i example: him—ii—alaya if we get “ii” then add “i” on left i.e. him + “i” and add “i” on the right i.e. “i”+ alaya So that,we get him + alaya. 3.1 RULE FORMAT IN TEXT RULE FILE Consider: vd% lo.ksZ nhÄZ% vd~ v]vk]b]m]_]Yk` lo.kZ Loj v]vk]b]Ã]m]Å]_]_`]Yk` Rule format in the text rule file : Effect On--result--left--right Effect On: character f : if result can be obtained by making changes in first. s : if result can be obtained by making changes second. b : if result can be obtained by making changes both. result: string left: or separated strings (e.g. vd~ v AvkA bA mA_AYk`) string|string|......|string| right: or separated strings (e.g. lo.kZ Loj vAvkAbAÃAmAÅA_A_`AYk`) string|string|......|string| 4. RULE STORAGE For storage the rule, we have used files indexed on there starting letter. I.e. for above case, b - - ii - - i - - i the rule will be in the file “i.txt”. Similarly, words/root words are stored in the file indexed on starting letter. I.e. himalaya is stored in “h.txt”. Also only one word per line is stored in the corresponding file. 5. ALGORITHM For Sandhi-Wichched (SW) we have design an algorithm. Algorithm 1: SW Step 1: Begin Step 2: Receive a word for doing sandhi wichched Step 3: Try iteratively for breaking the word into two parts (left and right) e.g. if the word is kastu then: 29 kast u kas tu ka stu k astu try for sandhi wichched. (note: sandhi wigrah is : word = leftWord+rightWord, i.e. ending of leftWord and starting of rightWord gets connected to form the sandhi now in order to detect the effect of addition (makingup sandhi), we assumed that effected word is encapsulated in right word e.g. kastu=kaH + tu.... so when we receive "kastu" for sandhi wichched,then we take up left as "ka" and right as "stu", and then pass "ka"(left) and "stu"(right) for tryingup sandhi wichched). Step 4: Now we have two words: left,right (e.g. ka,stu) (i) Sets a variable flag to false. (ii) Reads a rule from rule file named as starting letter of right (e.g. here it will open s.txt, since "stu" starts with letters) (iii) Try applying each of the sandhi rules (present in the rule file, here sandhi rules present in s.txt) on left and right (i.e. on ka and stu). (iv) If one or more rules listed in corresponding rule file is applicable then flag is set to true. Step 5: End {of SW} 6. SOFTWARE TOOL FOR SANDHI-WICHCHED The system reads a word from the storage pool and tries to identify the meaning of the word. For identification of the word following steps are used: (i) extract effect On, result, left, right from the rule. (ii) if received secondString starts with result then try to apply the rule result=left+right here result is encapsulated in the second String so extract right part from second String by removing result from it (secondString) e.g. here secondString (stu) starts with result (s) so rule f--s--H|--t|th| have the chances of performing sandhi wichched. (iii) separate left-result-right e.g. ka-s-tu for(kastu) leftGuess : ka result : s rightGuess : tu 6.1 CHECK CERTAINTY OF SANDHI while formation of sandhi, the effect may be on first word,or on second word or on both the words. (A) EFFECT ON FIRST WORD: example: f—y—i|ii|I|— a|aa|A|u|uu|U |e|ai|o|au|aM|aH|RRi|R^i|R^I|L^i|L^I| MIT International Journal of Computer Science & Information Technology Vol. 1 No. 1 Jan. 2011 pp. 28-31 ISSN 2230-7621 (Print Version) 2230-763X (Online Version) ©MIT Publications (i) So, if on forming sandhi only first word is effected and second word remains as it is then append something (depending on rule) to the leftGuess, no changes in rightGuess. (ii) checking for left + right = result , if the rightGuess word exist in database, also check right of rule = f--result--left-right could be applied (since, both, sandhi wigrah and sandhi formation on the given rule should be checked up). Then depnding upon the presence of left hand side one may declare the result. (iii) left + right = sandhi even if the left word is not present, but, since, rightGuess is present in database, so there are chances of sandhi vigrah so record it to try Result.txt. Since we are in effectOn=='f' so we need something to be added to the leftGuess and then check there presence in the database. (iv) if rightGuess is allready present in database then, if the leftGuess+append is also present in database then one can be sure of sandhi vigrah. Even if leftGuess+append NOT present in database but, if there (leftGuess+append), sandhi wichched is possible then also one can be sure of sandhi-vigrah. Example: u`is"oekR;s"kq is not present in database but u`is"kq $ vekR;s"kq is possible (B) EFFECT ON SECOND WORD: rule example: s--Dh--Sh|shh|T|Th|D|Dh|N|--dh| (i) if on forming sandhi only second word is effected and first word remains as it is then prefix something (depending on rule) to the rightGuess, no changes in leftGuess e.g. leftGuess sheshh ends with shh (which is present Sh|shh|T|Th|D|Dh|N|) (C) EFFECT ON BOTH THE WORDS rule example: b--e--a|aa|A|--i|ii|I| (i) if on forming sandhi both (first and second) word gets effected. then (depending on rule) append something to the leftGuess and prefix something to the rightGuess. (ii) for each of the potential prefix for rightGuess append a postfix(extracting from rule) to the left part and check presence of (leftGuess+append), (add+rightGuess) in database. (iii) extract one of the prefix for rightGuess from the rule's right rule : b--e--a|aa|A|--i|ii|I| right: i|ii|I| potential prefix: i,ii,I (iv) if (add+rightGuess)AND(leftGuess+append) both present then declare the result. (v) left + right = sandhi if the left word is not present, but, since, add + rightGuess is present in database, so there are chances of sandhi vigrah so record it to tryResult.txt (vi) now for each prefix for the rightGuess: extract one of the postfix for leftGuess from the rule's left rule : b--e--a|aa|A|--i|ii|I| left : a|aa|A potential postfix/append: a,aa,A (vii)If (add + rightGuess) is already present in database, then, if the (leftGuess + append) is also present in database, then one can be sure of sandhi vigrah. (ii) check for left + right = result, if the right (prefix+rightGuess) word exist in database. check right of rule. s--result--left—right could be applied then depnding upon the presence of left hand side one may declare the result. Since we are in effectOn =='s' so we need something to be prefixed to the rightGuess and then check there presence in the database. 7. SAMPLE RESULTS (iii) even if the leftGuess word is not present, but, since, rightGuess is present in database, so there are chances of sandhi vigrah so record it to try Result.txt 8. CONCLUSION (iv) check if the leftGuess ends with one of the string present in left part of the rule since sandhi vigrah and sandhi formation both of them should be checked. rule: effectOn--result--left—right right: orSeparatedString(string|string|......|string|) example: for rule s--Dh-- Sh|shh|T|Th|D|Dh|N left: Sh|shh|T|Th|D|Dh|N| 30 One set of 100 words have been taken and manually evaluated, which gives following results. Few of them are illustrated below in Fig. 2 and Fig. 3. The concept analyzed in this paper is basically evolved to handle the languages, which are morphologically rich Languages like Sanskrit,. The concept is language independent. After deducting some procedures this model can be used for identification of the word as well as spellchecker. Our approch is based on Morphological and Lexical analysis of the sanskrit sentences. The lexical analyzer identify the root word and its meaning. Also, used for information retrieval. in this case the corpora should be free from spelling and grammatical errors. MIT International Journal of Computer Science & Information Technology Vol. 1 No. 1 Jan. 2011 pp. 28-31 ISSN 2230-7621 (Print Version) 2230-763X (Online Version) ©MIT Publications 31 [5] Cuyckens, H. and Zawada, B. (eds.), Polysemy in Cognitive Linguistics. Amsterdam: John Benjamins. 2001. [6] Dash, N.S. and Chaudhuri, B.B. “The process of designing a multidisciplinary monolingual sample corpus”, International Journal of Corpus Linguistics. 5(2): 179. Fig. 2: Screen shot of Sample Sentences REFERENCES [1] Jurafsky D. & Martin J.H., “Speech and Language Processing” Parson Education, first Indian Reprint edition. [2] Automatic stochastic tagging of natural language texts by Evangelos Dermatas, George Kokkinakis . MIT Press Cambridge, MA, USA [3] Doug Cutting, Julian Kupiec, Jan Pedersen, Penelope Sibun, A practical part-of-speech tagger, In the Proceedings of the third conference on Applied natural language processing, March 31-April 03, 1992, Trento, Italy. [4] Marie Meteer, Richard Schwartz, Ralph Weischedel, Studies in part of speech labelling, Proceedings of the workshop on Speech and Natural Language, pp. 331- 336,February 19-22, 1991, Pacific Grove, California . Fig. 3: Screen shot of Sanskrit Sentences
© Copyright 2025 Paperzz