Tagalog Support for LanguageTool

Tagalog Support for LanguageTool
Nathaniel Oco
Allan Borra
De La Salle University - Manila
2401 Taft Avenue Malate, Manila City
1004 Metro Manila, Philippines
+639178477549
De La Salle University - Manila
2401 Taft Avenue Malate, Manila City
1004 Metro manila, Philippines
+639174591073
[email protected]
[email protected]
ABSTRACT
25, 2011, already includes Tagalog support.
This paper outlines the different processes and issues involved in
adding a Tagalog support for LanguageTool. LanguageTool is an
open-source rule-based style and grammar checker that
implements a manual-based rule-creation approach. Details of the
different LanguageTool resources are discussed in this paper. The
different linguistic considerations, technical considerations, and
language properties of Tagalog – that were captured and handled
– are also discussed and outlined. The system was tested using 50
correct and 50 incorrect sentences collected from different
sources. LanguageTool processed the correct sentences in 53
milliseconds and the incorrect sentences in 80 milliseconds. The
Tagalog support scored 95.83% for precision, 46% for recall, and
72% for accuracy.
Aside from Tagalog, LanguageTool also supports Asturian,
Belarusian, Breton, Catalan, Chinese, Czech, Danish, Dutch,
English, Esperanto, French, Galician, Icelandic, Italian, Khmer,
Lithuanian, Malayalam, Polish, Russian, Slovak, Slovenian,
Spanish, Swedish, Ukrainian, and Romanian.
This paper aims to explain the processes and issues involved in
adding a new language support – Tagalog support – for
LanguageTool.
2. GRAMMAR CHECKING
Grammar Checking is the process of detecting if there is error in
an input. Mark Johnson [Johnson, personal communication]
added that grammar checking entails locating where the error is
and notifying the user about the error. [5] agrees, adding that
grammar checking also entails providing a feedback, which can
include possible corrections with linguistic explanations.
Grammar checkers can then be defined as programs that can
detect if there is error in an input, locate the error, notify the user
about the error, and provide relevant feedback.
Categories and Subject Descriptors
F.4.2 [Mathematical Logic and Formal Languages]: Grammars
and Other Rewriting System – grammar types.
I.5.0 [Pattern Recognition]: General.
I.7.1 [Document and Text Processing]: Document and Text
Editing – languages, spelling.
[15] identified three approaches in grammar checking – syntaxbased approach, statistics-based approach, and rule-based
approach.
General Terms
Algorithms, Languages.
Syntax-based approach relies on parsing and grammar formalisms
(e.g. CFG, LFG). An error is detected if parsing fails and an error
is located using tree structures, graphs, and other methods.
Examples of Filipino Grammar checkers that utilize this approach
are PanPam [10] and [6].
Keywords
Tagalog, Grammar Checking, LanguageTool, Rules, NLP Tools.
1. INTRODUCTION
Statistics-based approach relies on properly annotated corpus (e.g.
Penn Treebank, Brown Corpus) to train the language model. An
error is detected and located using probability. [15] explained that
sequences describing correct sentences will occur often in the
corpus while sequences describing incorrect sentences will occur
less in the corpus or probably not at all.
LanguageTool, developed by [15], is an open-source rule-based
style and grammar checker that implements a manual-based rulecreation approach. It is publicly available through
LanguageTool’s website [13]. LanguageTool has a growing list of
supported languages. The authors of this paper developed and
submitted a new language support – Tagalog support – for
LanguageTool to provide a readily-available Tagalog Grammar
Checker [16]. LanguageTool version 1.5, released last September
Rule-based approach relies on rules, which are matched against
the input to check and locate errors. LanguageTool is an example
of this approach. [11] classifies grammar checkers under this
approach into two – manual-based and automatic-based. Manualbased grammar checkers use manual means to develop rules while
automatic-based grammar checkers use automatic means to
develop rules. In LanguageTool, rules are manually created and
are added or modified incrementally.
64
Proceedings of the 8th National Natural Language Processing Research Symposium, pages 64-71
De La Salle University, Manila, 24-25 November 2011
3. POS TAGGING
Part-of-speech or POS is a lexical category that defines the
function of words. POS Tagging or POST is the process of
labeling words with POS. The list of POS used to label words is
called a tagset. POS Tagging is heavily utilized in LanguageTool.
4. LANGUAGETOOL
LanguageTool can run as a stand-alone program or as an
extension to word editors like OpenOffice1 or LibreOffice2. It uses
two major linguistic resources to perform grammar checking.
These resources are the tagger dictionary and the rule file.
LanguageTool, as a stand alone program, splits an input into
sentences. Each sentence is split into words and each word is
assigned a tag based on the declarations in the tagger dictionary.
The words and their tags are checked against the rules in the rule
file. If there is a match, a feedback is shown to the user. Figure 1
shows a screenshot of LanguageTool as a stand-alone program
and Figure 2 shows a screenshot of LangaugeTool as an
OpenOffice extension. Both figures demonstrate the feedback
mechanism of LanguageTool.
Figure 2. LanguageTool as an OpenOffice Extension
5. LANGUAGETOOL RESOURCES
This section discusses and outlines LanguageTool resources – the
tagger dictionary and the rule file – and the tagset used for the
tagger dictionary.
5.1 Tagger Dictionary
The tagger dictionary is a text file used to tag words in an input.
Each declaration in the text file follows a tab-separated threecolumn format. The first column is the token, the second column
is base form of the token, and the third column is the tag of the
token. The interpretation for the base form of the token is left to
the discretion of the language maintainer. This could be the root
word or any other interpretation. In total, the Tagalog tagger
dictionary has approximately 8,000 word declarations. This
number, with Tagalog support as a new language support, pales in
comparison with 350,000-word English tagger dictionary and the
530,000-word French tagger dictionary. Table 1 demonstrates
word declarations and shows sample entries from the Tagalog
tagger dictionary.
Table 1. Tagalog Tagger Dictionary Declatations
Figure 1. LanguageTool as a stand-alone program
1
2
http://www.openoffice.org/
http://www.libreoffice.org/
65
Token
Base Form
Tag
mapanirang
mapanira
ADMO S
anim
anim
ADNU
sa
sa
DECN DAT
ng
ng
DECN GEN
ho
ho
MAHM
nila
nila
PNGP RD P
nilang
nila
PNGP RD P
pumupusta
pumupusta
VACF IN B
yakapin
yakapin
VOBF NE B
kakukumpisal
kakukumpisal
VOTF RC B
kalilibing
kalilibing
VOTF RC B
5.2 Tagset for the Tagger Dictionary
6. LINGUISTIC CONSIDERATIONS
[17] developed a tagset for Tagalog using the Penn Treebank3
Tagset as guide. [14] modified this tagset with the aid of linguists
and experts in the field. The original tagset proposed by [17] and
the modifications by [14] were used as basis in developing the
tagset for the tagger dictionary.
This section discusses and outlines the different linguistic
considerations in developing a Tagalog support for
LanguageTool.
The tagset for the tagger dictionary defines the different tags used
in word declarations. The tagset is composed of POS and lexical
categories, which could be followed by one or more attributes
with a white space separating them. The third column in Table 1
shows sample tag declarations. The table in the appendix shows
the tagset developed for the tagger dictionary.
[19] and [18] identified several Tagalog POS. These are article or
pantukoy, noun or pangngalan, pronoun or panghalip, verb or
pandiwa, adjective or pang-uri, adverb or pang-abay, preposition
or pang-ukol, conjunction or pangatnig, interjection or
pandamdam, and ligatures or pang-angkop. These POS, except
pantukoy, and other lexical categories (e.g. markers and particles)
compose the tagset. The table in the appendix shows the POS and
Lexical Categories used.
6.1 POS and Lexical Categories
5.3 Rule File
The rules are stored in an xml file. The input is matched against
patterns in the rule file, which describe incorrect sentences. A rule
file is composed of elements and attributes. The three main
elements of a rule file are: pattern, message, and example. Pattern
refers to the token or sequence of tokens to be matched. Message
refers to the feedback, which can include possible suggestions.
Example refers to incorrect and correct sentences demonstrating
the rule’s usage. Figure 3 shows a rule from the Tagalog rule file.
6.2 Ligatures
According to [18], ligatures or pang-angkop in Tagalog, are
words or morphemes that link a modifier to the word being
modified. Figure 4 demonstrates an example.
Maganda
<rule id="NOUN_SI_PUNCT" name="noun si punct (si noun
punct)">
=
Beautiful
Woman =
Adjective
Noun
Magandang babae
Beautiful woman
<pattern case_sensitive="no" mark_from="0" mark_to="-1">
Mabilis
Bata
=
Mabilis na bata
<token postag="(NPRO|NCOM).*" postag_regexp="yes"/>
Fast
Child
=
Fast Child
<token regexp="yes">si|sina</token>
Adjective
Noun
<token postag="(PSNP|PSNQ|PSNE|PSNC)"
postag_regexp="yes"/>
Figure 4. Ligature Usage
<short>Exchange Word Positions</short>
To handle words with ligatures, a separate tagset needs to be
allotted. However, this would result in a large tagset. To solve this
issue, the second column in the Tagalog tagger dictionary is
similar to the first column, except that ligatures were omitted. The
words, with ligatures omitted, serve as the base form of the token
(second column).
<example correction="si Maria" type="incorrect">Maganda
<marker>Maria si</marker>.</example>
6.3 Tagset Attributes
</pattern>
<message>Do you mean <suggestion><match no="2"
case_conversion="startlower"/> \1</suggestion>? Irregular
POS sequence due to transposition of words.</message>
Additional attributes were considered. [9] and [8] both proposed
tagsets with additional attributes. These are to appropriately
model their language and to address language-specific ambiguities
and issues.
<example type="correct">Maganda <marker>si
Maria</marker>.</example>
</rule>
The tagset for the Tagalog tagger dictionary contains attributes.
General categorization and semantic classes were considered as
noun attributes. Grammatical person and plurality were
considered as pronoun attributes. Verb focus, verb aspect, and
plurality were considered as verb attributes. Plurality was
considered as adjective attribute. The POS of the word being
modified was considered as adverb attribute. The morphological
case was considered as determiner attribute. These attributes aid
in better classifying Tagalog language properties, linguistic
phenomena, and Tagalog words.
Figure 3. Rule File
The Tagalog rule file has approximately 500 lines covering
incorrect patterns caused by wrong words, missing words, and
transposition of words. The English rule file has approximately
10,000 lines and the French rule file has approximately 25,000
lines. With these numbers, the Tagalog support can still be
considered in its initial stage.
In most rule files, the message element is normally in the language
of the language support. Tagalog uses English in the message
element so that other users of the system can also understand.
3
Babae
http://www.cis.upenn.edu/~treebank/
66
7. TECHNICAL CONSIDERATIONS
<pattern>
This section discusses and outlines the different technical
considerations in developing a Tagalog support for
LanguageTool.
<token regexp="yes">
ding?|dito|ditong|daw|diyang?|dyang?|doong?
</token>
7.1 Tagger Dictionary File Size
</pattern>
In most NLP Tools, large dictionary and grammar file sizes affect
performance and file storage. One issue with the tagger dictionary
is the file size. For instance, the English tagger dictionary in .txt
file format has a file size of 8 million bytes while the French
tagger dictionary has a file size of 17 million bytes. To address
this issue, Morfologik4 was utilized to encode the tagger
dictionary into Finite State Automata encoded (FSA-encoded)
.dict file. Using Morfologik, the FSA-encoded .dict file of the
English tagger dictionary was reduced to 1 million bytes while the
French tagger dictionary was reduced to 500 thousand bytes.
Since Tagalog support is still in its initial stage, there is little
difference in terms of file size between the .txt file format and the
FSA-encoded format.
Figure 6. Using Regular Expressions
7.3 Populating the Tagger Dictionary
The words from the literature domain in [14] were used for the
tagger dictionary.
[2] proposed a stemming algorithm for Tagalog words. The
algorithm they proposed with the use of a spreadsheet application
was utilized to tag words.
7.4 File Compilation
LanguageTool community uses Apache Ant6 to compile the
extension file, which is needed in OpenOffice and in LibreOffice.
However, once the files have been compiled, changing the tagger
dictionary or the rule file would require the files to be recompiled
again.
7.2 Regular Expressions
Another way of reducing the file size is by using regular
expressions to express patterns and other elements in the rule file.
LanguageTool uses the standard regular expression engine of
Java5. Figure 5 and Figure 6 illustrate the advantages of using
regular expressions in a rule. The patterns in Figure 5 can be
reduced to the pattern in Figure 6, reducing the number of lines
and the space occupied.
LanguageTool uses subversion control7 and uploads daily
snapshots8. This allows language maintainers to provide regular
updates to tagger dictionaries and rule files. This also allows users
to download the latest program.
<pattern><token>din</token></pattern>
7.5 Rule-Creation Standards
<pattern><token>ding</token></pattern>
Apache Ant can also be used to check the rules for errors. Things
like the number of tokens to be highlighted must match the
number of tokens in the suggestion and in the examples. These
standards ensure that the rule files are error-free in terms of
syntax.
<pattern><token>dito</token></pattern>
<pattern><token>ditong</token></pattern>
<pattern><token>daw</token></pattern>
<pattern><token>diyan</token></pattern>
8. TESTING AND RESULTS
<pattern><token>diyang</token></pattern>
A total of 100 sentences – 50 correct and 50 incorrect – from [3],
[7], [12], FiSSAn [1], LEFT [4], PanPam [10] and previous test
data were used to test the Tagalog support. LanguageTool
processed the correct sentences in 53 milliseconds and the
incorrect sentences in 80 milliseconds.
<pattern><token>dyan</token></pattern>
<pattern><token>dyang</token></pattern>
<pattern><token>doon</token></pattern>
<pattern><token>doong</token></pattern>
LanguageTool properly marked 49 out of 50 sentences as correct
and 23 out of 50 incorrect sentences as incorrect. The Tagalog
support scored 95.83% for precision (23 over 24), 46% for recall
(23 over 50), and 72% for accuracy (72 over 100).
Figure 5. Pattern Matching
Figure 7 shows 27 incorrect sentences marked as correct.
Sentences 1 to 9 contain free word order errors. Sentences 10 to
11 contain predicates in the plural form. Sentences 12 to 27 either
contain missing words or transposition of words. The low recall
rate can be attributed to three things: (1) lack of rules or grammar
checking coverage; (2) incorrect or erroneous tagger dictionary
declarations; (3) and insufficient word entries.
4
Morfologik is available at:
http://sourceforge.net/projects/morfologik/files/morfologikstem
ming/
5
Standard regular expression engine of Java:
http://download.oracle.com/javase/1,5.0/docs/api/java/util/regex/
Pattern.html
6
http://ant.apache.org/
LanguageTool’s subversion repository:
https://languagetool.svn.sourceforge.net/svnroot/languagetool/tr
unk/JLanguageTool
8
http://www.languagetool.org/download/snapshots/
7
67
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
[3] Cena, R. and T, Ramos. 1990. Modern Tagalog:
Grammatical Explanations and Exercises for Non-native
Speakers. University of Hawaii Press, Honolulu, HI.
Ang bumili lalaki ng isda sa tindahan.
Nagbigay sa ng libro babae ang lalaki.
Binigyan ng babae ng libro dalaga ang.
Binigyan mabait ng regalo ang batang.
Maganda sumasayaw si Rosa na.
Pinalo tatay ng makulit batang ang.
Bumili ng Manuel si medias para sa dalaga.
Bumili si Maria ng libro tungkol sa pag-ibig.
Ang pusang mataba pinakain isda ng.
Nagsisikain na si Maria ng hapunan.
Nagsipangisda na si Ben.
Ay kumain.
Si Janyll ay.
Ay si Janyll maganda.
Si Janyll kumain ay.
Kumain ay si Janyll.
Maganda ay si Janyll.
Mabait si Martee sa.
Kumain si Martee si Justin.
Aalis ako ikaw.
Umalis nagnakaw.
Sumama maganda.
Umalis ang nagnakaw ang ninakawan.
Sumama ang malakas ang maganda.
Kumain uminom si Martee.
Nawawala mabilis ang tumakbo.
Sumama totoo ang maganda.
[4] Chan, E., Lim, I., Tan, R., and Tong, M. 2006. LEFT:
Lexical Functional Grammar Based English-Filipino
Translator. Undergraduate Thesis. De La Salle University,
Manila, Philippines.
[5] Clément, L., Gerdes, K., and Marlet, R. 2009. A Grammar
Correction Algorithm: Deep Parsing and Minimal
Corrections for a Grammar Checker. In Proceedings of the
14th Conference on Formal Grammar (Bordeaux, France,
July 25 – 26, 2009). FG '09. Springer-Verlag, Berlin,
Germany. 47-63, DOI= http://doi.acm.org/10.1007/978-3642-20169-1_4.
[6] Dimalen, D. and Dimalen, E. 2007. An OpenOffice Spelling
and Grammar Checker Add-in Using an Open Source
External Engine as Resource Manager and Parser. In
Proceedings of the 4th National Natural Language
Processing Research Symposium (Manila, Philippines, June
14 – 16, 2007). NNLPRS '07. 69-73.
[7] Dimalen, E. 2003. A Parsing Algorithm for Constituent
Structures of Tagalog. Graduate Thesis. De La Salle
University, Manila, Philippines.
9. FINAL NOTES
[8] Divjak, D., T. Erjavec, A. Feldman, M. Kopotev and S.
Sharoff. 2008. Designing and Evaluating a Russian Tagset.
In Proceedings of the Sixth International Conference on
Language Resources and Evaluation (Marrakech, Morocco,
May 28 – 30, 2008). LREC '08. European Language
Resources Association, Paris, France, 279-285. ISBN=29517408-4-0.
Although the Tagalog support scored 95.83% for precision, the
recall rate is below average. This clearly highlights the fact that
Tagalog support is still in its early stages. However, it is important
to note that the linguistic resources of LanguageTool are sufficient
to handle the different Tagalog language properties and linguistic
phenomena.
[9] Feldman, A. and Hana, J. 2010. A Positional Tagset for
Russian. In Proceedings of the Seventh International
Conference on Langauge Resources and Evaluation
(Valletta, Malta, May 19-21, 2010). LREC '10. European
Languages Resources Association, Paris, France, 1277 –
1284. ISBN=2-9517408-6-7.
Further development can focus on improving the grammar
checking coverage by adding more rules to the rule file and
improving the tagger dictionary and the tagset. Future works can
also focus on other LanguageTool functionalities like automatic
detection of code switching.
[10] Jasa, M., Palisoc, J., and Villa, M. 2007. Panuring
Pampanitikan (PanPam): A Sentence Syntax and Semantic
Based Grammar Checker for Filipino. Undergraduate
Thesis. De La Salle University, Manila, Philippines.
Figure 7. Incorrect Sentences Detected as Correct
10. ACKNOWLEDGMENTS
[11] Konchady, M. 2009. Detecting Grammatical Errors in Text
using a Ngram-based Ruleset. Retrieved October 6, 2011
from Emustru:
http://emustru.sourceforge.net/detecting_grammatical_errors.
pdf.
The authors acknowledge LanguageTool’s developers and
language maintainers for their assistance.
[12] Kroeger, P. 1993. Phrase Structure and Grammatical
Relations in Tagalog. CSLI Publications, Stanford, CA.
Overall, LanguageTool is a novel NLP tool that provides a
readily-available grammar checker.
[13] LanguageTool. http://www.languagetool.org/.
11. REFERENCES
[14] Miguel, D. 2007. Comparative Analysis of Tagalog Part-ofSpeech (POS) Taggers. Graduate Thesis. De La Salle
University, Manila, Philippines.
[1] Ang, M., Cagalingan, S., Tan, P., and Tan, R. 2002. FiSSAn:
Filipino Sentence Syntax and Semantic Analyzer.
Undergraduate Thesis. De La Salle University, Manila,
Philippines.
[15] Naber, D. 2003. A Rule-Based Style and Grammar Checker.
Diploma Thesis. Bielefeld University, Bielefeld.
[2] Bonus, D and Roxas, R. 2004. A Stemming Algorithm for
Tagalog Words. In Proceedings of the 4th Philippine
Computing Science Congress (Laguna, Philippines, February
14 – 15, 2004). PCSC '04. Computing Science of the
Philippines. ISSN=1908-1146.
[16] Oco, N. & Borra, A. 2011. A Grammar Checker for Tagalog
using LanguageTool. In Proceedings of the 9th Workshop on
Asian Language Resources collocated with IJCNLP 2011
(Chiang Mai, Thailand, November 12 – 13, 2011). ALR '11.
68
Asian Federation of Natural Language Processing. 2-9. ISBN
978-974-466-565-2.
[18] Ramos, T. 1971. Makabagong Balarila ng Pilipino. Rex
Bookstore, Manila, Philippines.
[17] Rabo, V. 2004. TPOST: A Template-based N-gram Part-ofSpeech Tagger for Tagalog. Graduate Thesis. De La Salle
University, Manila, Philippines.
[19] Santos, L. 1939. Balarila ng Wikang Pambansa. Institute of
Philippine Language, Manila, Philippines.
69
APPENDIX
Table 2. Tagset for the Tagger Dictionary
AVSC
Slight comparison
AVAY
Agree (Panang-ayon)
Noun: [tag] [general categorization] [semantic class]
AVGI
Disagree (Pananggi)
NPRO
Proper Noun
AVAG
Possibility (Pang-agam)
NCOM
Common Noun
AVPA
Frequency (Pamanahon)
NABB
Abbreviation
AVOT
Other
Pronoun: [tag] [grammatical person] [plurality]
Conjunction
PANP
“ang” Pronouns
CONM
Panimbang
PNGP
“ng” Pronouns
COMU
Pamukod
PSAP
“sa” Pronouns
CONU
Panubali
PAND
“ang” Demonstratives
CONI
Paninsay
PNGD
“ng” Demonstratives
CONA
Pananhi
PSAD
“sa” Demonstratives
CONP
Panapos
PFOP
Found Pronouns
CONG
Panghugnay
Interrogative Pronouns
COOT
Other
PCOP
Comparison Pronouns
Preposition
PIDP
Indefinite Pronouns
PRPL
Place
Other
PRLO
Location
PINP
POTH
Verb: [focus] [aspect] [plurality]
PRSO
Source
VACF
Actor Focus
PRTA
Target
VOBF
Object / Goal Focus
PRRE
Referential
Benefactive Focus
PRAG
Agree
VLOF
Locative Focus
PRDI
Disagree
VINF
Instrument Focus
PRME
Means
Other
PROT
Other
VBEF
VOTF
Determiner: [tag] [morphological case]
Adjective: [tag] [plurality]
Modifier
DECN
Common Noun
ADCO
Comparative
DEPS
Personal Name Singular
ADSU
Superlative
DEPP
Singular Person Marker
ADNU
Numeral
DEPL
Plural Marker
ADUN
Unaffixiated
Interjection
ADOT
Other
INTR
Interjection
Adverb: [tag] [modifies]
IRIA
Positive Informal Response
AVMA
Manner
IRID
Negative Informal Response
AVNU
Numeral
IRFA
Positive Formal Response
AVDE
Definite
IRUN
Uncertain Response
AVEO
Comparison, group I
Ligature
AVET
Comparison, group II
LINA
Ligature “na”
AVCO
Comparative, group I
LIPA
Ligature “pa”
AVCT
Comparative, group II
Independent Particles
AVSO
Superlative, group I
MALM
Lexical Marker
AVST
Superlative, group II
MANE
Negation Marker
ADMO
70
MAVN
Verb Negation Marker
ENCL
MAEM
Existential Marker
Punctuation, Symbol, Number
MANM
Non-existential Marker
PSNP
MAHM
Honorific Marker
PSNE
Exclamation Point
PSNQ
Question Mark
Comma
Auxiliary Word
Enclitic
Period
AUXP
Auxiliary Positive
PSNC
AUXN
Auxiliary Negative
PSNS
Symbols
AUPO
Auxiliary Possibility
PSNN
Numerals
Enclitic
71