Tagalog Support for LanguageTool Nathaniel Oco Allan Borra De La Salle University - Manila 2401 Taft Avenue Malate, Manila City 1004 Metro Manila, Philippines +639178477549 De La Salle University - Manila 2401 Taft Avenue Malate, Manila City 1004 Metro manila, Philippines +639174591073 [email protected] [email protected] ABSTRACT 25, 2011, already includes Tagalog support. This paper outlines the different processes and issues involved in adding a Tagalog support for LanguageTool. LanguageTool is an open-source rule-based style and grammar checker that implements a manual-based rule-creation approach. Details of the different LanguageTool resources are discussed in this paper. The different linguistic considerations, technical considerations, and language properties of Tagalog – that were captured and handled – are also discussed and outlined. The system was tested using 50 correct and 50 incorrect sentences collected from different sources. LanguageTool processed the correct sentences in 53 milliseconds and the incorrect sentences in 80 milliseconds. The Tagalog support scored 95.83% for precision, 46% for recall, and 72% for accuracy. Aside from Tagalog, LanguageTool also supports Asturian, Belarusian, Breton, Catalan, Chinese, Czech, Danish, Dutch, English, Esperanto, French, Galician, Icelandic, Italian, Khmer, Lithuanian, Malayalam, Polish, Russian, Slovak, Slovenian, Spanish, Swedish, Ukrainian, and Romanian. This paper aims to explain the processes and issues involved in adding a new language support – Tagalog support – for LanguageTool. 2. GRAMMAR CHECKING Grammar Checking is the process of detecting if there is error in an input. Mark Johnson [Johnson, personal communication] added that grammar checking entails locating where the error is and notifying the user about the error. [5] agrees, adding that grammar checking also entails providing a feedback, which can include possible corrections with linguistic explanations. Grammar checkers can then be defined as programs that can detect if there is error in an input, locate the error, notify the user about the error, and provide relevant feedback. Categories and Subject Descriptors F.4.2 [Mathematical Logic and Formal Languages]: Grammars and Other Rewriting System – grammar types. I.5.0 [Pattern Recognition]: General. I.7.1 [Document and Text Processing]: Document and Text Editing – languages, spelling. [15] identified three approaches in grammar checking – syntaxbased approach, statistics-based approach, and rule-based approach. General Terms Algorithms, Languages. Syntax-based approach relies on parsing and grammar formalisms (e.g. CFG, LFG). An error is detected if parsing fails and an error is located using tree structures, graphs, and other methods. Examples of Filipino Grammar checkers that utilize this approach are PanPam [10] and [6]. Keywords Tagalog, Grammar Checking, LanguageTool, Rules, NLP Tools. 1. INTRODUCTION Statistics-based approach relies on properly annotated corpus (e.g. Penn Treebank, Brown Corpus) to train the language model. An error is detected and located using probability. [15] explained that sequences describing correct sentences will occur often in the corpus while sequences describing incorrect sentences will occur less in the corpus or probably not at all. LanguageTool, developed by [15], is an open-source rule-based style and grammar checker that implements a manual-based rulecreation approach. It is publicly available through LanguageTool’s website [13]. LanguageTool has a growing list of supported languages. The authors of this paper developed and submitted a new language support – Tagalog support – for LanguageTool to provide a readily-available Tagalog Grammar Checker [16]. LanguageTool version 1.5, released last September Rule-based approach relies on rules, which are matched against the input to check and locate errors. LanguageTool is an example of this approach. [11] classifies grammar checkers under this approach into two – manual-based and automatic-based. Manualbased grammar checkers use manual means to develop rules while automatic-based grammar checkers use automatic means to develop rules. In LanguageTool, rules are manually created and are added or modified incrementally. 64 Proceedings of the 8th National Natural Language Processing Research Symposium, pages 64-71 De La Salle University, Manila, 24-25 November 2011 3. POS TAGGING Part-of-speech or POS is a lexical category that defines the function of words. POS Tagging or POST is the process of labeling words with POS. The list of POS used to label words is called a tagset. POS Tagging is heavily utilized in LanguageTool. 4. LANGUAGETOOL LanguageTool can run as a stand-alone program or as an extension to word editors like OpenOffice1 or LibreOffice2. It uses two major linguistic resources to perform grammar checking. These resources are the tagger dictionary and the rule file. LanguageTool, as a stand alone program, splits an input into sentences. Each sentence is split into words and each word is assigned a tag based on the declarations in the tagger dictionary. The words and their tags are checked against the rules in the rule file. If there is a match, a feedback is shown to the user. Figure 1 shows a screenshot of LanguageTool as a stand-alone program and Figure 2 shows a screenshot of LangaugeTool as an OpenOffice extension. Both figures demonstrate the feedback mechanism of LanguageTool. Figure 2. LanguageTool as an OpenOffice Extension 5. LANGUAGETOOL RESOURCES This section discusses and outlines LanguageTool resources – the tagger dictionary and the rule file – and the tagset used for the tagger dictionary. 5.1 Tagger Dictionary The tagger dictionary is a text file used to tag words in an input. Each declaration in the text file follows a tab-separated threecolumn format. The first column is the token, the second column is base form of the token, and the third column is the tag of the token. The interpretation for the base form of the token is left to the discretion of the language maintainer. This could be the root word or any other interpretation. In total, the Tagalog tagger dictionary has approximately 8,000 word declarations. This number, with Tagalog support as a new language support, pales in comparison with 350,000-word English tagger dictionary and the 530,000-word French tagger dictionary. Table 1 demonstrates word declarations and shows sample entries from the Tagalog tagger dictionary. Table 1. Tagalog Tagger Dictionary Declatations Figure 1. LanguageTool as a stand-alone program 1 2 http://www.openoffice.org/ http://www.libreoffice.org/ 65 Token Base Form Tag mapanirang mapanira ADMO S anim anim ADNU sa sa DECN DAT ng ng DECN GEN ho ho MAHM nila nila PNGP RD P nilang nila PNGP RD P pumupusta pumupusta VACF IN B yakapin yakapin VOBF NE B kakukumpisal kakukumpisal VOTF RC B kalilibing kalilibing VOTF RC B 5.2 Tagset for the Tagger Dictionary 6. LINGUISTIC CONSIDERATIONS [17] developed a tagset for Tagalog using the Penn Treebank3 Tagset as guide. [14] modified this tagset with the aid of linguists and experts in the field. The original tagset proposed by [17] and the modifications by [14] were used as basis in developing the tagset for the tagger dictionary. This section discusses and outlines the different linguistic considerations in developing a Tagalog support for LanguageTool. The tagset for the tagger dictionary defines the different tags used in word declarations. The tagset is composed of POS and lexical categories, which could be followed by one or more attributes with a white space separating them. The third column in Table 1 shows sample tag declarations. The table in the appendix shows the tagset developed for the tagger dictionary. [19] and [18] identified several Tagalog POS. These are article or pantukoy, noun or pangngalan, pronoun or panghalip, verb or pandiwa, adjective or pang-uri, adverb or pang-abay, preposition or pang-ukol, conjunction or pangatnig, interjection or pandamdam, and ligatures or pang-angkop. These POS, except pantukoy, and other lexical categories (e.g. markers and particles) compose the tagset. The table in the appendix shows the POS and Lexical Categories used. 6.1 POS and Lexical Categories 5.3 Rule File The rules are stored in an xml file. The input is matched against patterns in the rule file, which describe incorrect sentences. A rule file is composed of elements and attributes. The three main elements of a rule file are: pattern, message, and example. Pattern refers to the token or sequence of tokens to be matched. Message refers to the feedback, which can include possible suggestions. Example refers to incorrect and correct sentences demonstrating the rule’s usage. Figure 3 shows a rule from the Tagalog rule file. 6.2 Ligatures According to [18], ligatures or pang-angkop in Tagalog, are words or morphemes that link a modifier to the word being modified. Figure 4 demonstrates an example. Maganda <rule id="NOUN_SI_PUNCT" name="noun si punct (si noun punct)"> = Beautiful Woman = Adjective Noun Magandang babae Beautiful woman <pattern case_sensitive="no" mark_from="0" mark_to="-1"> Mabilis Bata = Mabilis na bata <token postag="(NPRO|NCOM).*" postag_regexp="yes"/> Fast Child = Fast Child <token regexp="yes">si|sina</token> Adjective Noun <token postag="(PSNP|PSNQ|PSNE|PSNC)" postag_regexp="yes"/> Figure 4. Ligature Usage <short>Exchange Word Positions</short> To handle words with ligatures, a separate tagset needs to be allotted. However, this would result in a large tagset. To solve this issue, the second column in the Tagalog tagger dictionary is similar to the first column, except that ligatures were omitted. The words, with ligatures omitted, serve as the base form of the token (second column). <example correction="si Maria" type="incorrect">Maganda <marker>Maria si</marker>.</example> 6.3 Tagset Attributes </pattern> <message>Do you mean <suggestion><match no="2" case_conversion="startlower"/> \1</suggestion>? Irregular POS sequence due to transposition of words.</message> Additional attributes were considered. [9] and [8] both proposed tagsets with additional attributes. These are to appropriately model their language and to address language-specific ambiguities and issues. <example type="correct">Maganda <marker>si Maria</marker>.</example> </rule> The tagset for the Tagalog tagger dictionary contains attributes. General categorization and semantic classes were considered as noun attributes. Grammatical person and plurality were considered as pronoun attributes. Verb focus, verb aspect, and plurality were considered as verb attributes. Plurality was considered as adjective attribute. The POS of the word being modified was considered as adverb attribute. The morphological case was considered as determiner attribute. These attributes aid in better classifying Tagalog language properties, linguistic phenomena, and Tagalog words. Figure 3. Rule File The Tagalog rule file has approximately 500 lines covering incorrect patterns caused by wrong words, missing words, and transposition of words. The English rule file has approximately 10,000 lines and the French rule file has approximately 25,000 lines. With these numbers, the Tagalog support can still be considered in its initial stage. In most rule files, the message element is normally in the language of the language support. Tagalog uses English in the message element so that other users of the system can also understand. 3 Babae http://www.cis.upenn.edu/~treebank/ 66 7. TECHNICAL CONSIDERATIONS <pattern> This section discusses and outlines the different technical considerations in developing a Tagalog support for LanguageTool. <token regexp="yes"> ding?|dito|ditong|daw|diyang?|dyang?|doong? </token> 7.1 Tagger Dictionary File Size </pattern> In most NLP Tools, large dictionary and grammar file sizes affect performance and file storage. One issue with the tagger dictionary is the file size. For instance, the English tagger dictionary in .txt file format has a file size of 8 million bytes while the French tagger dictionary has a file size of 17 million bytes. To address this issue, Morfologik4 was utilized to encode the tagger dictionary into Finite State Automata encoded (FSA-encoded) .dict file. Using Morfologik, the FSA-encoded .dict file of the English tagger dictionary was reduced to 1 million bytes while the French tagger dictionary was reduced to 500 thousand bytes. Since Tagalog support is still in its initial stage, there is little difference in terms of file size between the .txt file format and the FSA-encoded format. Figure 6. Using Regular Expressions 7.3 Populating the Tagger Dictionary The words from the literature domain in [14] were used for the tagger dictionary. [2] proposed a stemming algorithm for Tagalog words. The algorithm they proposed with the use of a spreadsheet application was utilized to tag words. 7.4 File Compilation LanguageTool community uses Apache Ant6 to compile the extension file, which is needed in OpenOffice and in LibreOffice. However, once the files have been compiled, changing the tagger dictionary or the rule file would require the files to be recompiled again. 7.2 Regular Expressions Another way of reducing the file size is by using regular expressions to express patterns and other elements in the rule file. LanguageTool uses the standard regular expression engine of Java5. Figure 5 and Figure 6 illustrate the advantages of using regular expressions in a rule. The patterns in Figure 5 can be reduced to the pattern in Figure 6, reducing the number of lines and the space occupied. LanguageTool uses subversion control7 and uploads daily snapshots8. This allows language maintainers to provide regular updates to tagger dictionaries and rule files. This also allows users to download the latest program. <pattern><token>din</token></pattern> 7.5 Rule-Creation Standards <pattern><token>ding</token></pattern> Apache Ant can also be used to check the rules for errors. Things like the number of tokens to be highlighted must match the number of tokens in the suggestion and in the examples. These standards ensure that the rule files are error-free in terms of syntax. <pattern><token>dito</token></pattern> <pattern><token>ditong</token></pattern> <pattern><token>daw</token></pattern> <pattern><token>diyan</token></pattern> 8. TESTING AND RESULTS <pattern><token>diyang</token></pattern> A total of 100 sentences – 50 correct and 50 incorrect – from [3], [7], [12], FiSSAn [1], LEFT [4], PanPam [10] and previous test data were used to test the Tagalog support. LanguageTool processed the correct sentences in 53 milliseconds and the incorrect sentences in 80 milliseconds. <pattern><token>dyan</token></pattern> <pattern><token>dyang</token></pattern> <pattern><token>doon</token></pattern> <pattern><token>doong</token></pattern> LanguageTool properly marked 49 out of 50 sentences as correct and 23 out of 50 incorrect sentences as incorrect. The Tagalog support scored 95.83% for precision (23 over 24), 46% for recall (23 over 50), and 72% for accuracy (72 over 100). Figure 5. Pattern Matching Figure 7 shows 27 incorrect sentences marked as correct. Sentences 1 to 9 contain free word order errors. Sentences 10 to 11 contain predicates in the plural form. Sentences 12 to 27 either contain missing words or transposition of words. The low recall rate can be attributed to three things: (1) lack of rules or grammar checking coverage; (2) incorrect or erroneous tagger dictionary declarations; (3) and insufficient word entries. 4 Morfologik is available at: http://sourceforge.net/projects/morfologik/files/morfologikstem ming/ 5 Standard regular expression engine of Java: http://download.oracle.com/javase/1,5.0/docs/api/java/util/regex/ Pattern.html 6 http://ant.apache.org/ LanguageTool’s subversion repository: https://languagetool.svn.sourceforge.net/svnroot/languagetool/tr unk/JLanguageTool 8 http://www.languagetool.org/download/snapshots/ 7 67 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. [3] Cena, R. and T, Ramos. 1990. Modern Tagalog: Grammatical Explanations and Exercises for Non-native Speakers. University of Hawaii Press, Honolulu, HI. Ang bumili lalaki ng isda sa tindahan. Nagbigay sa ng libro babae ang lalaki. Binigyan ng babae ng libro dalaga ang. Binigyan mabait ng regalo ang batang. Maganda sumasayaw si Rosa na. Pinalo tatay ng makulit batang ang. Bumili ng Manuel si medias para sa dalaga. Bumili si Maria ng libro tungkol sa pag-ibig. Ang pusang mataba pinakain isda ng. Nagsisikain na si Maria ng hapunan. Nagsipangisda na si Ben. Ay kumain. Si Janyll ay. Ay si Janyll maganda. Si Janyll kumain ay. Kumain ay si Janyll. Maganda ay si Janyll. Mabait si Martee sa. Kumain si Martee si Justin. Aalis ako ikaw. Umalis nagnakaw. Sumama maganda. Umalis ang nagnakaw ang ninakawan. Sumama ang malakas ang maganda. Kumain uminom si Martee. Nawawala mabilis ang tumakbo. Sumama totoo ang maganda. [4] Chan, E., Lim, I., Tan, R., and Tong, M. 2006. LEFT: Lexical Functional Grammar Based English-Filipino Translator. Undergraduate Thesis. De La Salle University, Manila, Philippines. [5] Clément, L., Gerdes, K., and Marlet, R. 2009. A Grammar Correction Algorithm: Deep Parsing and Minimal Corrections for a Grammar Checker. In Proceedings of the 14th Conference on Formal Grammar (Bordeaux, France, July 25 – 26, 2009). FG '09. Springer-Verlag, Berlin, Germany. 47-63, DOI= http://doi.acm.org/10.1007/978-3642-20169-1_4. [6] Dimalen, D. and Dimalen, E. 2007. An OpenOffice Spelling and Grammar Checker Add-in Using an Open Source External Engine as Resource Manager and Parser. In Proceedings of the 4th National Natural Language Processing Research Symposium (Manila, Philippines, June 14 – 16, 2007). NNLPRS '07. 69-73. [7] Dimalen, E. 2003. A Parsing Algorithm for Constituent Structures of Tagalog. Graduate Thesis. De La Salle University, Manila, Philippines. 9. FINAL NOTES [8] Divjak, D., T. Erjavec, A. Feldman, M. Kopotev and S. Sharoff. 2008. Designing and Evaluating a Russian Tagset. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (Marrakech, Morocco, May 28 – 30, 2008). LREC '08. European Language Resources Association, Paris, France, 279-285. ISBN=29517408-4-0. Although the Tagalog support scored 95.83% for precision, the recall rate is below average. This clearly highlights the fact that Tagalog support is still in its early stages. However, it is important to note that the linguistic resources of LanguageTool are sufficient to handle the different Tagalog language properties and linguistic phenomena. [9] Feldman, A. and Hana, J. 2010. A Positional Tagset for Russian. In Proceedings of the Seventh International Conference on Langauge Resources and Evaluation (Valletta, Malta, May 19-21, 2010). LREC '10. European Languages Resources Association, Paris, France, 1277 – 1284. ISBN=2-9517408-6-7. Further development can focus on improving the grammar checking coverage by adding more rules to the rule file and improving the tagger dictionary and the tagset. Future works can also focus on other LanguageTool functionalities like automatic detection of code switching. [10] Jasa, M., Palisoc, J., and Villa, M. 2007. Panuring Pampanitikan (PanPam): A Sentence Syntax and Semantic Based Grammar Checker for Filipino. Undergraduate Thesis. De La Salle University, Manila, Philippines. Figure 7. Incorrect Sentences Detected as Correct 10. ACKNOWLEDGMENTS [11] Konchady, M. 2009. Detecting Grammatical Errors in Text using a Ngram-based Ruleset. Retrieved October 6, 2011 from Emustru: http://emustru.sourceforge.net/detecting_grammatical_errors. pdf. The authors acknowledge LanguageTool’s developers and language maintainers for their assistance. [12] Kroeger, P. 1993. Phrase Structure and Grammatical Relations in Tagalog. CSLI Publications, Stanford, CA. Overall, LanguageTool is a novel NLP tool that provides a readily-available grammar checker. [13] LanguageTool. http://www.languagetool.org/. 11. REFERENCES [14] Miguel, D. 2007. Comparative Analysis of Tagalog Part-ofSpeech (POS) Taggers. Graduate Thesis. De La Salle University, Manila, Philippines. [1] Ang, M., Cagalingan, S., Tan, P., and Tan, R. 2002. FiSSAn: Filipino Sentence Syntax and Semantic Analyzer. Undergraduate Thesis. De La Salle University, Manila, Philippines. [15] Naber, D. 2003. A Rule-Based Style and Grammar Checker. Diploma Thesis. Bielefeld University, Bielefeld. [2] Bonus, D and Roxas, R. 2004. A Stemming Algorithm for Tagalog Words. In Proceedings of the 4th Philippine Computing Science Congress (Laguna, Philippines, February 14 – 15, 2004). PCSC '04. Computing Science of the Philippines. ISSN=1908-1146. [16] Oco, N. & Borra, A. 2011. A Grammar Checker for Tagalog using LanguageTool. In Proceedings of the 9th Workshop on Asian Language Resources collocated with IJCNLP 2011 (Chiang Mai, Thailand, November 12 – 13, 2011). ALR '11. 68 Asian Federation of Natural Language Processing. 2-9. ISBN 978-974-466-565-2. [18] Ramos, T. 1971. Makabagong Balarila ng Pilipino. Rex Bookstore, Manila, Philippines. [17] Rabo, V. 2004. TPOST: A Template-based N-gram Part-ofSpeech Tagger for Tagalog. Graduate Thesis. De La Salle University, Manila, Philippines. [19] Santos, L. 1939. Balarila ng Wikang Pambansa. Institute of Philippine Language, Manila, Philippines. 69 APPENDIX Table 2. Tagset for the Tagger Dictionary AVSC Slight comparison AVAY Agree (Panang-ayon) Noun: [tag] [general categorization] [semantic class] AVGI Disagree (Pananggi) NPRO Proper Noun AVAG Possibility (Pang-agam) NCOM Common Noun AVPA Frequency (Pamanahon) NABB Abbreviation AVOT Other Pronoun: [tag] [grammatical person] [plurality] Conjunction PANP “ang” Pronouns CONM Panimbang PNGP “ng” Pronouns COMU Pamukod PSAP “sa” Pronouns CONU Panubali PAND “ang” Demonstratives CONI Paninsay PNGD “ng” Demonstratives CONA Pananhi PSAD “sa” Demonstratives CONP Panapos PFOP Found Pronouns CONG Panghugnay Interrogative Pronouns COOT Other PCOP Comparison Pronouns Preposition PIDP Indefinite Pronouns PRPL Place Other PRLO Location PINP POTH Verb: [focus] [aspect] [plurality] PRSO Source VACF Actor Focus PRTA Target VOBF Object / Goal Focus PRRE Referential Benefactive Focus PRAG Agree VLOF Locative Focus PRDI Disagree VINF Instrument Focus PRME Means Other PROT Other VBEF VOTF Determiner: [tag] [morphological case] Adjective: [tag] [plurality] Modifier DECN Common Noun ADCO Comparative DEPS Personal Name Singular ADSU Superlative DEPP Singular Person Marker ADNU Numeral DEPL Plural Marker ADUN Unaffixiated Interjection ADOT Other INTR Interjection Adverb: [tag] [modifies] IRIA Positive Informal Response AVMA Manner IRID Negative Informal Response AVNU Numeral IRFA Positive Formal Response AVDE Definite IRUN Uncertain Response AVEO Comparison, group I Ligature AVET Comparison, group II LINA Ligature “na” AVCO Comparative, group I LIPA Ligature “pa” AVCT Comparative, group II Independent Particles AVSO Superlative, group I MALM Lexical Marker AVST Superlative, group II MANE Negation Marker ADMO 70 MAVN Verb Negation Marker ENCL MAEM Existential Marker Punctuation, Symbol, Number MANM Non-existential Marker PSNP MAHM Honorific Marker PSNE Exclamation Point PSNQ Question Mark Comma Auxiliary Word Enclitic Period AUXP Auxiliary Positive PSNC AUXN Auxiliary Negative PSNS Symbols AUPO Auxiliary Possibility PSNN Numerals Enclitic 71
© Copyright 2026 Paperzz