Sentence-level Morphological and Phonological Analyzer

Sentence-level Morphological and Phonological Analyzer
for Filipino
Alina, Angelo Nico Cambaliza, Carlo
Sta. Ana, Xedric
Sosa, Judd
Chu, Shirley
3315 Michaelangelo 258 Cuenca St. Ayala 103 M.H. del Pilar St. 14 P. Gomez St. Sta. De La Salle University
Alabang Village,
SFDM, Quezon City Ana, San Mateo, Rizal 2401 Taft Avenue,
BFRV, Las Piñas City
Muntinlupa
(+63)9175255377
(+63)9267057654
Manila, Philippines
(+63)9179921179
(+63)9178222822
[email protected]
juddphilip_sosa
(+632)524-4611
[email protected] [email protected]
@yahoo.com
ABSTRACT
Part-of-Speech tagging is an integral part in sentence analysis
that is concerned with annotating the part-of-speech of a
particular word in a sentence. There are existing tools for partof-speech tagging for Tagalog such as HATPOST [2] and
TPOST [8]. HATPOST is chosen because it utilizes both rulebased and statistical based to increase the accuracy for tagging
words.
This paper discusses filSPAM (Sentence-level Phonological
and Morphological Analyzer for Filipino). Given an input
sentence in Tagalog, this system outputs the corresponding
parts-of-speech, root words, affixes and phonemic notation of
each word in the sentence. The system will make use of the
existing systems MAGTag and HATPOST in handling
morphological
analysis
and
part-of-speech
tagging,
respectively. The system has four modules: POS tagger which
has 54% accuracy, the morphological analyzer which has
73.02% accuracy, the phonological analyzer is corpus-based
and unknown handler which has two functions, the automaton
and the generalized tree which has 67% accuracy and 64%
respectively.
These components, namely MAGTag and HATPOST, function
independently from one another. However, they have their own
individual limitations that need to be addressed. The research
will seek to construct a sentence-level morphological and
phonological analyzer for the Filipino language that will
possibly integrate the aforementioned components in order to
identify the part-of-speech of a Filipino word in the sentence
and generate the root word and phonology of the identified
words.
Categories and Subject Descriptors
D.2.10 [Software Engineering]: Design – methodologies,
representation.
Phonology is the study of meaningful sounds in speech and how
they are used in natural language. According to the work of
Schacter & Otanes [4], some general rules may be made
regarding Tagalog phonology. In Tagalog, vowel length is
significant since some word pairs which have the same spelling
yet different in meaning are differentiated on the basis of their
vowel length. Tagalog vowel length is marked by a raised dot
as in /i·/ for phonemic notation. There are also instances
wherein different phonemes may be interchanged without any
effect on the meaning of the word.
General Terms
Algorithms, Documentation, Performance, Experimentation,
Languages, Theory
Keywords
Natural
language
processing,
phonological
morphological analysis, part-of-speech tagging
1.
[email protected]
analysis,
INTRODUCTION
In general, Tagalog words are spelled in the manner that they
are pronounced, even in the case of some loan words. This
means that consonant phonemes usually represented according
to the letter used in the word. However, vowel phonemes have
two allophones for each although these allophones may also be
used interchangeably without any effect on the meaning of the
word. In a syllable, the vowel is understood to be its syllable
nucleus which is the most prominent sound in the syllable. The
patterns of Tagalog syllables are often either consonant-vowel
or consonant-vowel-consonant. A syllable may also include a
consonant cluster, which is considered as one consonant in the
pattern. However, there are certain restrictions in what pairs of
consonants are accepted consonant clusters.
Morphological analysis is an important process in natural
language processing. It deals with the identification of a root
word and its affixes (morphemes) from a morphed word.
Phonology is another facet of morphology that has to do with
how a word is voiced or sounded out. There are various
approaches and systems that exist and are used in
morphological analysis for generating rules for different
languages such as MAGTag [1] for Tagalog and KIMMO [9]
for Japanese. These differ in each of their methods in
identification and classification of morphemes as well as
handling ambiguity. Although there are systems which handle
morphology for Filipino, most of these are limited in that they
are only word-level and they do not cover rules for phonology.
MAGTag in particular was used for this research as it is the
only morphological analysis system available for the language
and accessible to the proponents. MAGTag utilizes a set of
rules in determining the root word and the affixes of an input
word.
In the case of words that begin with a vowel, there is always a
glottal stop /’/ at the initial position of the word since all words
must start with a consonant phoneme. In this case, a glottal stop
/’/ is considered to be a consonant although it is not represented
in conventional spelling. In disyllabic words, if the first syllable
does not end in a consonant, the general rule states that the
72
Proceedings of the 8th National Natural Language Processing Research Symposium, pages 72-80
De La Salle University, Manila, 24-25 November 2011
vowel of this syllable is long. Furthermore, in words that do not
end with a consonant, a glottal stop /’/ or glottal fricative /h/
may also occur in the word-final position as they are consonant
phonemes but are also not represented in conventional
orthography.
Finite state transducers can be used to represent phonological
rules. Figure 1 represents the English flapping rule using subsequential finite state transducer. The English flapping rule
shown in figure 1: an underlying t is realized as a flap after a
stressed vowel and any number of r’s and before an unstressed
vowel.
Figure 1. Sub-sequential transducer for English flapping rule.
The phonological-rule induction is based on the Onward Subsequential Transducer Inference Algorithm (OSTIA). OSTIA
takes as input a training set of input-output pairs. The root of
the tree is the initial transducer state, and each leaf of the tree
corresponds to the input sample. The output symbols are placed
as near as possible to the root of the tree.
Figure 3. System Flow
Figure 2. Example tree constructed
2.
Architectural Design
Figure 4. Process Flow of POS Module
The Sentence-level Morphological and Phonological Analyzer
for Filipino is a Java-based system that analyzes and generates
the morphology and phonology of Filipino words in a sentence.
The system identifies the part-of-speech, root word, affixes and
phonology of each word of the input Tagalog sentence. The
features of the system may be summarized into the following
functions, namely Morphological Analysis, Part-of-Speech
Tagging and Phonological Analysis. These modules operate
independently; meaning that they are able to function apart
from each other the results of one does not affect another. An
exception to this will be discussed in 2.4.
2.1
2.2
Morphological Analysis
Morphological analysis will be handled by MAGTag [1]
(shown in Figure 5). It is utilized in determining the root word
and affixes of a word. It uses numerous rules to determine the
rootword of each word in the sentence. It also can generate the
POS Tags of those words using the same ruleset.
Part-of-Speech Tagging
HATPOST [2] was utilized to handle most of the part-of-speech
tagging for the wrapper class. HATPOST (shown in Figure 4)
has a set of generated POS tags which consist of the RaboBuban Tag set along with its own set of defined tags. Refer to
Table 1. The tagset was not modified further by the proponents.
Figure 5. Process Flow of Morphological Analyzer Module
73
Table 1. Rabo-Buban Tagset
Part-ofSpeech
Part-of-Speech
POS
Engine
Output
Description
Common Noun
NNC
Common
NNP
Proper
NNPP
Proper Abbreviation
JJD
Describing
JJC
Same-level Comparison
JJCC
Comparison Comparative
JJCS
Comparison Superlative
JJCN
Comparison Negation
JJN
Describing Number
VBW
Neutral Infinitive
VBS
Pseudoverb
VBH
Existential
VBL
Linking Verb
VBN
Non-existential
VBTS
Time Past
VBTR
Time Present
VBTF
Time Future
VBTP
Recent Past
VBAF
Actor Focus
VBOF
Object Focus
VBOB
Benefactive Focus
VBOL
Locative Focus
VBOI
Instrumental Focus
VBRF
Referential Focus
RBD
Describing “How”
Proper Noun
Adjective
Verb
Adverb
RBN
Number
RBC
Comparison
RBK
Conditional
RBP
Causative
RBB
Benefactive
RBR
Referential
RBQ
Question
RBT
Agree
RBF
Disagree
RBW
Frequency
RBM
Possibility
RBI
Enclitics
RBL
Place
RBJ
Interjections
RBS
Social Formula
Category
pronoun
determiner
others
conjunction
POS
Engine
Output
Description
PRS
Personal Singular
PRSP
Possessive Subject
PROP
Possessive Object
PRQP
Interrogative Plural
PRL
Location
PRC
Comparison
PRF
Found
PRI
Indefinite Number
DTC
Common Noun
DTCC
Plural Common Noun
DTP
Proper Noun
DTPP
Plural Proper Noun
CCA
Proposition
CCP
Ligatures
CCT
CCR
Undefined
CCB
cardinality
punctuation
unknown
n/a
CDB
Digit, Rank, Count
PMP
Period
PME
Exclamation Point
PMQ
Question Mark
PMC
Comma
PMS
Symbol
etc
etc
Figure 6. Process Flow of Phonological Analyzer Module
2.3
Phonological Analysis
The phonological analysis module will be created by the
proponents. This module identifies the phonology of the word
by getting its corresponding phonology from the database. The
database consists of two tables: known and unknown. The
74
known table consists of words and its corresponding phonology
from [4] and words verified by the linguist. Also, the known
corpus was populated with Tagalog function words (particles,
conjunctions, prepositions) along with pronouns as opposed to
content words (nouns, verbs) since function words are constant
and stable for any language. The listings for the function words
were retrieved from [5]. The unknown table consists of words
and its corresponding phonology which are output from predefined automata used in the unknown word handler.
Figure 8. Initial finite-state automaton
The initial automaton was tested on 51 words from [4]. The
automaton’s accuracy is only 27%. The problems encountered
involve wrong symbols and word-final consonant phonemes.
The /./ represents stress and not vowel length. The notation for
major stress is denoted by the symbol /:/. Some other errors
encountered involved incorrect lengthening of vowel sounds
such as /maga:ling/. However, there is also an ambiguous case
wherein the word labi may be transcribed to both /la:bi’/
or /labi’/, both of which have different meanings. The word
final consonant phoneme is not always the glottal stop denoted
by /`/ but also can be the glottal fricative denoted by /h/ as
in /maramih/ which was processed as /marami’/. Some
phonological rules were also missed such as two consecutive
vowel phonemes from two separate syllables requires a glottal
stop /‘/ between the syllables as in /maba’it/ which was
transcribed by the system as /maba:it/. Nevertheless, words
that start with a vowel are correctly represented with a glottal
stop /‘/ at the initial position as in /’a:teh/.
Figure 7. Process Flow of Unknown Handler Module
2.4
Unknown Word Handler
The unknown word handler module identifies the phonology of
words that are not found in the corpus (shown in Figure 7).
This module will apply general phonological rules for
determining the phonology of the word. The words and its
corresponding phonology are then added to the unknown table.
The implementation of the rules will be manually created by
the proponents. The words in the unknown table will be
ultimately be added to the known corpus during the latter
phases of the module after the collaborating linguists have
verified the words.
Revised Automata to handle Multi-syllabic words For the
second automaton, it was modified to handle words with more
than two syllables by applying rules to the syllables before the
penultimate. Also the stress symbol /./ used in the previous
automaton was replaced by the major stress symbol /:/. The
final consonant phoneme was identified by doing a frequency
count, counting which of the two final consonant phoneme
(glottal stop /`/ or glottal fricative /h/) occurs the most among
the known table. The initial automaton was modified as shown
in Figure 9.
The unknown handler has two methods in determining
phonology, namely the Automata and the Generalized
Phonology Tree.
Initial Automata The initial automaton was based on one to
two syllable Tagalog words. This representation attempts to
determine the stress/vowel length in the disyllabic word and
also generates word-initial glottal stop for words beginning
with a vowel. The vowel length is determined by checking if
the penultimate syllable does not end in a consonant such as
lunes or if the following syllable begins with any of the
accepted consonant clusters such as senyas. A special
consonant cluster to be considered is /ng/ in which it is one
phoneme represented as /ŋ/. Word-final consonant phoneme for
words ending with a vowel are set to glottal fricative /`/ by
default. Any word more than two syllables will only have its
last two syllables processed and the preceding syllables will be
retained.
Figure 9. Second finite-state automaton
The second automaton was tested from the same 51 words from
[4]. The second automata’s accuracy was 87%. The increase
75
was attributed to correct generation of word-final consonant
phoneme wherein /h/ happened to occur more. Yet there were
still words with incorrect vowel-lengthening such as /da:mit/
which proved difficult to determine when the rule would or
would not be applicable. The second automata were again
tested from 217 arbitrary Tagalog words from [5]. The testing
resulted to 55% accuracy.
Third Automata indicating improvement suggested by
Linguists The third automaton was tested from the same 217
words from [5]. The third automata’s accuracy is 67%. Errors
encountered by the third automata were words that do not fall
under the general pattern CV for a second to the last syllable of
the word has vowel length /:/ as well as for words that have the
pattern CV:C before the final syllable. For instance the word
batas is pronounced as /batas/ but was incorrectly annotated as
/ba:tas/. There are minor errors which still involve the final
consonant phoneme. The previous error regarding word-initial
consonant-glide clusters has also been handled, with an
example kuwago being correctly transcribed as /kwa:goh/.
However,
it
incorrectly
annotated
impluwensya
as /implwensyah/ with which the aforementioned rule is not
applicable.
After consultation with the collaborating linguist, there were
misconceptions that were addressed regarding the usage of
some phonological symbols. The /:/ denotes vowel length and
not major stress. A consonant cluster in a syllable does not
always include vowel length for the vowel of the syllable after
it such as gagamba which was erroneously transcribed
as /gaga:mbah/. This is because the cluster /mb/ is not an
accepted consonant cluster when considering preceding vowel
length. A correct output would be /sigari:lyoh/. As for the
consonant-vowel syllable with vowel length /:/ (CV:) followed
by a consonant in the succeeding syllable does not occur almost
all the time for Tagalog words as observed from the 51 words
from [4]. Generally, a consonant-vowel pattern located at the
second to the last syllable of the word has vowel length /:/ such
as /pe:rah/. The issue with 2 consecutive vowel phonemes
separated with a glottal stop /`/ has also been addressed with an
example paos being correctly tagged as /pa’os/.
Generally, the rise in accuracy in the third automaton was
attributed from the issues that were addressed from the second
automaton. However, the words that were correct in the second
automaton were now incorrect and vice versa. For instance, the
word hadlang was incorrectly transcribed as /ha:dlaŋ/ in the
second automaton but became correct in the third
representation as /hadlaŋ/; on the other hand, the
correct /rebe:ldeh/ from the second automaton became incorrect
/rebeldeh/ on the third. This is because the second automaton
was based on the observations made the 51 words from [4]
which were more applicable on borrowed words from Spanish
or English such as /se:rmon/ and /ma:rsoh/. On the other hand,
the third automaton is based from the output of the second
representation tested on Tagalog words with the feedback from
the linguist regarding the output of these words. Compound
words have also been tackled. This was done by splitting the
compound word and running each through the automaton and
concatenating the results for its output since compound words
generally retain the phonology of each constituent word.
The concept of word-initial consonant-glide clusters was also
not covered. For example, a word with word-initial /uw/ like
the word buwan is pronounced as /bwan/ and a word with /iy/
like the word niyog is pronounced as /nyog/ wherein the vowel
before the glide consonant /w/ or /y/ may be omitted from the
phonology although this is not applicable to words such
as /pruweba/ since the /uw/ is in separate syllables.
Minor errors were caused by the word-final consonant
phoneme. Also, there are certain words that are ambiguous in
that they may have two phonological representations based on
vowel length but this is not to say that they are totally incorrect.
Lastly, Tagalog compound words have also not been taken into
consideration, as the automaton processes them as one word.
The second automata were modified based on the feedback of
the linguist as shown in Figure 10.
For compound words, the notation used affects the meaning of
the word. According to the linguist, words which have been
compounded do not necessarily have a hyphen. For instance, in
bahag-hari, /bahag-hari’/, both words will be taken with their
corresponding literal meaning and bahaghari, /bahag+hari‘/
has an idiomatic nature with regard to its meaning. Since
ambiguity which requires context, is one of the limitations for
the phonological analyzer module, words that have two
pronunciations will only have one pronunciation. Since the
phonology compound words is similar to a non-compound
word, the symbol /+/ will be used to denote that these words
are combined without using the symbol /-/ to form a new word.
The third representation of the automata was then modified
based on this feedback as shown in Figure 11.
Figure 10. Third finite-state automaton
76
2.5
Sentence-level Phonological Analysis
This module identifies the phonology of the sentence as a
whole by combining the word-level phonology and applying
sentence-level phonological rules.
3.
RESULTS AND DISCUSSIONS
This chapter will discuss per module an overview and some of
the design and implementation issues for each followed by their
corresponding test phases.
For the testing, a general test data set was used for the system.
These were articles in the Tagalog language that were retrieved
from Tagalog Wikipedia and an internet blog. In the following
sections, the two articles used for testing will be referred to as
Blog and Wiki (refer to Table 3 for details of each).
The results were manually evaluated by the linguist. The
accuracy is computed by getting the percentage of the number
of correct words over the total number of words.
Figure 11. Final automata
Final automata with POS In an attempt to deal with
ambiguity, the part-of-speech is now also included in the
database for known words. Ambiguity may deal with certain
word-pairs that have the same spelling yet possess different
phonology. These word-pairs may also differ in their part-ofspeech. An example of this would be for the word sama which
maybe either /sa:mah/ which is a verb or /sama‘/ which is an
adjective. As such, it would be useful to also recognize the
part-of-speech of the word in identifying its phonology, given
that there is varying parts-of-speech for the word-pairs.
However, this will not be able to handle word-pairs having
different phonology yet have the same part-of-speech.
Table 3. Details for Test Data
N
Adjectives
J
Verbs
V
Adverbs
R
Pronouns, Function Words
P
Unknown/Untagged
YYY
1st
http://tl.wikipedia.org/wiki/Pilipinas
189
Wiki
2nd
http://www.perfspot.com/blogs/blog.
asp?BlogId=49231
241
Blog
Part-of-Speech Tagger
Evaluation of the articles for the HATPOST proved to be highly
dependent on its training sets. The results varied greatly, given
the quality of said training sets. The POS tagger was fed two
articles of varying contents. For Blog, the results consisted of
110 incorrectly tagged words which imply that 54.36% of the
total words were correctly translated. The POS tagging module
of MAGTag was introduced in order to countercheck whether
words that have been tagged as unknown by HATPOST are
indeed unknown. The results have improved for there was a
deduction in the amount of incorrectly tagged words. The
amount of incorrectly tagged words was reduced to 60 which
imply that 75.10% were correctly tagged.
Table 2. Final Automaton Tags used in Corpus
Symbol
Refer
as
HATPOST is an undergraduate thesis developed by Ciego, et.
al (2007). It is integrated to the system to handle part-of-speech
tagging. Integration of HATPOST is straightforward. Essential
files of HATPOST were simply transferred to the root directory
of the system so that the functions of HATPOST necessary for
determining the parts-of-speech may be utilized.
Specific tags generated by HATPOST and MAGTag were
mapped to a general tag as shown in Table 2 since both use
different tagsets.
Part-of-Speech
Word
Count
3.1
The input will run through the part-of-speech tagger, which
will produce the corresponding tags for each word. The
automaton will check if the tag produced by the POS tagger for
each word matches the part-of-speech contained in the database
in determining the appropriate phonology to use in the output.
If the word does not exist with the generated tag, it will go
through the normal process using the unknown handler. The
generated tag will be saved in the table for unknown words
along with the phonology.
Nouns (common, proper)
Source
As for Wiki, the results consisted of 76 incorrectly tagged
words which imply that 59.79% of the total words were
correctly tagged. After the introduction of the POS tagging
module of MAGTag, results show that only 42 words were
incorrectly tagged which ultimately results to 75.10% correctly
tagged words.
HATPOST was not altered in the implementation nor was it
retrained in any manner.
77
3.2
Morphological Analyzer
The first one is a set of 51 words from [1] and the second set is
217 from [5].
MAGTag was used to analyze the words given to it and outputs
its base form (as a root word) and it also doubled as a POS
Tagger given that HATPOST fails to recognize the word’s
appropriate POS tag. Initially, MAGTag was used as is;
however, some errors were encountered relating to rule
implementation and multiple executions and thus had to be
subsequently resolved.
Sentence-level Phonology As for the sentence-level analyzer,
it was tested on a Tagalog article composed of 11 sentences.
According to [5], words in medial position in a sentence will
have the initial and final consonant phonemes omitted. This is
because in sentence-level speech, such phonemes will be
negligible in pronunciation but will not affect the meaning of
the sentence. On the other hand, the initial and final words are
retained as normal. Also, punctuation marks are ignored and
space is denoted by / . /. A correct output for the sentence-level
analyzer would be /`i:saŋ . kapulu`an . aŋ . bansaŋ . pilipi:nas/.
The only errors encountered were caused by the erroneous
transcription in word-level phonology regarding each individual
word.
Unlike HATPOST, MAGTag is purely rule-based when it
comes to its word analysis. Given this approach, the results that
will be yielded will be consistent to subsequent executions.
The articles that were fed to the POS tagger were the same
articles that were fed to MAGTag for morphological analysis.
For Blog, there are 192 correctly translated words which is
77.78% of the total amount of words. Upon feeding Wiki,
MAGTag produced 138 correctly translated words which is
73.02% of the total amount of words. The errors generated by
the MAGTag were sorted into different categories.
Generalized Phonology Tree An experiment was conducted in
order to obtain a comparative analysis for the accuracy and
efficiency of the initial automaton method. This experiment
involved pattern-matching with multiple trees which will be
generated based on training data. This was partially derived
from the OSTIA algorithm [3].
Errors in Affixations There have been instances wherein the
system produced errors after the extraction of the affixes of the
words. These errors consisted of words with wrong affix
definitions during the analysis. These occurred in the
processing of the rules for the pattern of an affix of a word. An
example of which is the word bansang. MAGTag did not
consider the ng in bansang as an affix, specifically a suffix. If
the affixes were wrongly or incompletely extracted, the analysis
of its root word would also be incorrect. Results show that for
the Blog, 31.48% (17/54) of the errors fall under this
classification. As for Wiki, 47.06% (24/51) of the errors were
under this category as well. Most of the errors consisted of the
[g] affix that was not extracted for words ending with [an] (ex:
“bansang”).
The algorithm for the Generalized Phonology Tree has two
main phases, namely training and output generation. To
populate the trees, training data must be provided. The trees
consist of input-output syllable pairs represented as CV
(consonant-vowel) patterns. Each input-output syllable pair
corresponds to a node wherein the root nodes are the first
syllables and subsequent syllables compose of the child nodes.
Each node has a weight value assigned to it which denotes how
many times the pattern has occurred with regard to the training
data. The resulting generalized tree (an example is shown in
Figure 12) can be used to generate phonology from given input
words.
Errors due to Overanalysis Overanalysis occurs when a word
or a group of words were unnecessarily analyzed. Most of the
occurences of this event happened upon encountering
determiners and adverbs. These were words that were root
words in essence but not technically defined as actual root
words. Examples of such words are nang and habang.
MAGTag analyzed these words and returned ng and haba
respectively. But it was expected that these words would retain
their current forms. 40.74% (22/54) of the errors that have
occurred were classified into this category for Wiki and
25.49%(13/51) for Blog.
Figure 12. Sample Generated Tree
Errors due to Underanalysis There were also instances
wherein a certain word was not thoroughly analysed. Words
which exhibited affix reduplication were classified into this
category. These errors consisted of words that were lacking in
analysis reiterations. This means that the words are too
complex or the patterns do not match any of the rules for it to
be analyzed even further. An example of such an occurrence is
the word katimugan which results to katimog but should have
been timog. Wiki and Blog had 15.67% (8/51) and 27.68%
(15/54) respectively.
3.3
Results The final stand-alone automaton resulted to 67%
accuracy using the 217 words, 66.23% using Wiki and 60.23%
using Blog (64% on average). The errors that occurred were
commonly attributed to word-final consonant phoneme such
as /hila:ga’/ incorrectly transcribed as /hila:gah/. The second
error consists of words that did not fall under the pattern that
was addressed with the second representation of the automaton,
which was mostly incorrect vowel lengthening. These words
mainly consisted of borrowed words such as /bentilador/ and
/te:rnoh/ erroneously transcribed as /bentila:dor/ and /ternoh/
respectively. A less occurring error consists of both, an
example of which is /‘aruga’/ incorrectly transcribed
as /‘aru:gah/. Table 4 shows more details with each type of
error for each test with their corresponding accuracy from total
words
Phonological Analyzer
The automata used by this module were built through empirical
analysis of input-output pairs and general observations on
Tagalog phonology based from [4]. The automata are presented
to the linguist and are modified accordingly based on the
linguist’s feedback. There were two major sets of Tagalog
words used in testing the automata during its different phases.
78
Table 4. Breakdown of errors from PA tests
Type of Error
217 words
Wiki
Incorrect vowel
23.5%
31.17%
lengthening
Word-final
8.76%
2.60%
consonant phoneme
Both
0.46%
n/a
Filipino sentence input and generate the corresponding part-ofspeech, root word, and phonology of the input sentence. To
complete the task of generating the POS, root word and
phonology, the following modules were created:
Blog
31.82%
6.81%
• The Part-of-Speech tagging module, which successfully tags
input sentences using both HATPOST and the POS tagging
module of MAGTag.
• Nothing was modified with the implementation of HATPOST.
• For the Morphological Analysis module, it was successful in
determining root words with the respective affixes using the
Morphological Analysis module of MAGTag.
• The rules were modified to accommodate the overlooked
exceptions produced by MAGTag.
1.13%
In testing the automaton with POS querying, Wiki was used.
According to the results, 53% of the words were known which
were mostly composed of function words such as ang, mga and
sa. The other 47% contained the unknown words and words
that were incorrectly tagged by the POS tagger. For these
unknown words, the unknown handler was used to generate the
phonology. Table 5 shows a more detailed breakdown of these
errors.
Although the exceptions in MAGTag were resolved, the
alteration that was made to the rules of MAGTag did not affect
the accuracy of its analysis.
Table 5. Breakdown of errors from PA w/ POS tests
Type of Error
Percentage from total
unknown words
Found in corpus but incorrect POS
6%
Unknown in both tagger and corpus
35%
Unknown in corpus but tagged
59%
For the Phonology module, the Automata or the Generalized
Phonology Tree was created to represent and implement
phonological rules and to generate phonology with accuracy of
64% and 52% respectively. As for compound words, the system
can only identify them if they include a hyphen.
The Generalized phonology tree still has some issues to be
handled regarding its implementation, which lead to a lesser
accuracy than expected. The output generation phase may have
to be improved. This can be done by implementing a backtrack
method such that if the subtree encountered has reached its
final node when there are still more input syllables to process,
it will search for another appropriate subtree. Also, it is limited
based on the training data provided. As shown in the testing,
the training data used did not include patterns for function
words.
In testing the implementation of the generalized phonology
tree, the training data consists of the 217 words from [5] which
were also used for the automaton testing. All of the words were
used for training and testing. In this test, the accuracy
amounted to 64% as opposed to the 67% accuracy of the
automaton using the same data set. It was able to correctly
generate the output for the majority of the words given that
these patterns occurred most based from the training data.
However, the slight drop in accuracy was attributed to certain
words having incomplete output based from the issues
mentioned in the previous section. This solution was not
effective with certain input words. Examples of such erroneous
words were babae and halaan having incomplete transcriptions
of /baba:/ and /hala:/ respectively. The implementation of the
generation phase in processing similar words has to be
improved. With regard to the aforementioned glide-consonant
cluster syllable issue, the training phase encountered 5 words
which were of this occurrence from the 217 words. Therefore, a
total of 212 words were only used for the training.
Akin to the rule-based automaton, several of the incorrectly
transcribed words were attributed to the word-final consonant
phoneme being usually glottal fricative /h/ instead of glottal
stop /’/ since both used a similar implementation (frequency
count) in determining the phoneme. Therefore, the trainingbased experiment did not have a significant effect concerning
this issue as it bears the same result with the rule-based
automaton.
As a stand-alone module, the phonological analyzer resulted to
67%. However, the integration with HATPOST and MAGTag
affected the accuracy of the phonological analyzer. The
accuracy dropped, resulting to 54%. The handling of
ambiguous words will depend on the output of the POS tagger.
Wiki was used for testing data against the tree generated from
the previously used 212 words. It resulted to 37%, with most of
the errors coming from the function words. Most of the function
words used in the article were not found in the tree. Thus, they
had to be run through the automaton. The reason for this is that
the training data did not contain patterns that constitute most
function words, which were mostly monosyllabic. This shows
that the results are very much reliant on the patterns derived
from the training data. Another test was conducted using Wiki
as both the training and testing data. The accuracy improved to
55% for this test.
The phonology module also provides a save function for future
reference. The function outputs a text file using the following
format shown in Figure 13:
4.
CONCLUSION AND FUTURE
WORK
4.1
Conclusion
Figure 13. Format of Output text file
<I> represents the input sentence, <SLP> represents the
sentence-level phonology, <W> represents the words in the
input sentence, and <WP> represents the word-level
filSPAM (Sentence-level Phonological and Morphological
Analyzer for Filipino) was developed to analyze a given
79
phonology. The output file can also be used as training data in
building the generalized phonology tree.
a sentence. An utterance may have a different connotation or
may give emphasis if used with a different pitch.
This research is a stepping stone towards a back-end for
human-computer interaction, specifically using the Tagalog
Language. This purpose is achieved through understanding and
analyzing basic sentence components such as parts-of-speech,
root words and word pronunciations.
To improve the overall accuracy of the system, a
recommendation would be to integrate external dictionaries
such as FilWordNet and external systems such as fiLex.
FilWordNet may be used to augment the training data used by
HATPOST once its tagset has been converted to the RaboBuban Tagset. Since part-of-speech is not enough to solve
ambiguity in determining phonology for certain words, the
thematic roles from fiLex as well as the POS from HATPOST
and MAGTAG may be used as additional parameters in
building the Generalized Phonology Tree.
4.2
Recommendation
During the course of research and development of filSPAM,
several issues were encountered and some are still left
unaddressed. Aside from which, improvements can also be
made to increase the performance of the system. The following
suggestions were noted for future research endeavors:
5.
REFERENCES
[1] Aquino, M., Fernandez, E., & Villanueva, K. (2007).
Morphological Analyzer and Generator for Tagalog. De La
Salle University, Manila.
Part-of-Speech Tagger Training data for HATPOST as of the
moment is not sufficient, Having the appropriate training data
for HATPOST will improve the quality of the tagging and it
will also improve its accuracy. The training data needs to be of
the same context as the expected input to produce the best
results. The degree of the formality of the words may prove of
some use in this scenario. Another case would be the genre
from which is input was retrieved from (e.g. if the input will
come from short stories, the POS tagger will perform better if
the training sets came from short stories as well).
[2] Ciego, R., Huang, Z., Navarro, G., Roxas, R., & Torres,
M. (2007). Hybrid Approach to Tagalog Part-of-Speech
Tagging. De La Salle University, Manila.
[3] Gildea, D., & Jurafsky, D. (1995). Automatic Induction of
Finite State Transducer for Simple Phonological Rules.
[4] Otanes, F., & Schachter, N. (1972). Tagalog reference
grammar. (University of California Press, Berkeley, Los
Angeles)
Morphological Analyzer The architecture of MAGTag can be
improved. The implementation of the word analysis of
MAGTag is tedious to simply modify. Therefore, a complete
and careful redesign of the coding of the Analysis Module
would help improve its performance.
[5] Tagalog Dictionary. (n.d.) Retrieved February 23, 2011
from
http://www.seasite.niu.edu/Tagalog/Dictionary/diction.htm
[6] Llamzon, T. (1976). Modern Tagalog: A Functional
Structural Description. The Hague: Mouton
Phonological Analyzer The phonology module cannot handle
words that are spelled the same but have two different
pronunciations. Ambiguity can be solved through context.
Context can be identified through the use of part-of-speech. For
example /ora:san/ meaning to record time and /orasan/ which
refers to a watch. On the other hand, part-of-speech is not
enough. For examaple /ba:ta’/ meaning a child and /ba:tah/
meaning a cloth are both nouns and can be differentiated by
identifying its thematic role, /ba:ta’/ is an agent/actor while
/ba:tah/ is an instrument/object.
[7] Rabiner, L. R. (1989). A tutorial on Hidden Markov
Models and selected applications in speech recognition.
Proceedings of the IEEE77 (2): 257-286.
[8] Rabo, V. (2004). TPOST: A Template-based, n-gram partof-speech tagger for Tagalog. (De La Salle University,
Manila)
[9] Yukiko, S. (1983). A two-level morphological analysis for
Japanese. (Retrieved November 22, 2010 from
http://www2.parc.com/istl/members/karttune/publications/
archive/kimmo/kimmo-japanese.pdf)
The Generalized Phonology Tree was constructed using
training data consisting of words from various sources. Using
the training data, different approaches for training can be used
such as HMM (Hidden Markov Model) [7]. Instead of using the
frequency of the node or the number of instances a node
occurred in the training data as basis which node to visit, a
probability measure can also be used.
To improve the sentence-level phonology, the concept of
intonation and assimilation can be used. By studying the
concept of assimilation, wherein a sound accommodates itself
to a neighboring sound, the word final consonant phoneme of
one word can blend into the sound of the beginning of the
following word such as ang pader becoming /am pader/. This
phenomenon is involved mostly with Tagalog nasal consonant
phonemes such as /n/, /m/ and /ng/ followed by words
beginning in a labial/labiodental consonant such as /p/ or /b/.
This commonly occurs during normal rapid conversation and
does not affect the meaning of the words. Intonation on the
other hand, deals with pitch phenomena in varying positions in
80