CHAPTER 3 METHODOLOGY 3.1 Needs and Problem Analysis 3.1

CHAPTER 3
METHODOLOGY
3.1
Needs and Problem Analysis
3.1.1
Needs Analysis
There are many things related to literature or linguistics can be simplified using
technology, such as language translation, grammar analysis, information retrieval, and
so on. Specific algorithms and rules are needed in order to make a technology capable of
doing those things. It is needed because every language has its own nature and
characteristic.
Indonesian language has both agglutinative and inflexional character.
Agglutinative character makes Indonesian language possess so many words considering
the combination between root words and affixes. Inflexional character makes Indonesian
language possess many rule and word form possibility, because some combination
between root words and affixes would change the form of the root word. Hence, we need
to find many word transformation rules, or even make a transformation exception
sometimes.
So, it is not suggested to check all the possibilities of word combinations in
Indonesian language. Otherwise, we rather find some algorithms and rules which are
capable to cover all those combinations. This is why we need lemmatization algorithms,
since according to Ingason (2008, p.1), “Lemmatization is a process to find the base
(entry) of a given word form”. By doing lemmatization process, we do not need to check
all the combinations of Indonesian words. Lemmatization process will find the base
word form.
37
38
Lemmatization is the basic needed for other complex application capable of
doing what have been mentioned in the first paragraph.
3.1.2
Problem Analysis
In Indonesia, lemmatization has not been a popular subject, but there are some
people trying to find algorithms related to lemmatization. Before people found
lemmatization, people used a method called stemming. Stemming is a process that aims
to reduce a number of variations in a representation of a concept become a standard
morphological/canonical representation. (Kowalski, 2008, p.76) This method is
popularized by Martin Porter. The purpose of stemming and lemmatization is different.
Stemming aims to cut words into its’ simplest form regardless of the meaning or the
shape of the cut word. Lemmatization aims to make any derivational word become its’
base word form which could be found in the language dictionary entry.
However, no lemmatization method has been developed for Indonesian language.
The only method developed for Indonesian language is only stemming method. But
there is a problem in most of the developed stemming methods. The problem is that the
method does not return words into their root form or stemmed form, the method returns
the dictionary entry words. As what we know about the stemming definition stated by
Kowalski, this method no longer can be called stemming. It is rather close to
lemmatization than to stemming, because the method is checking words using dictionary
which is unnecessary in stemming. Although close to lemmatization, those methods still
cannot be categorized as lemmatization, because those methods treat dictionary entry
word that does not match the root word as a false result. This behavior is against the
lemmatization definition by Ingason.
39
So, most of the methods developed for Indonesian language is neither stemming
nor lemmatization, it is a combination of both. For this reason too, we are going to
research and produce an algorithm which should be a proper lemmatization method
according to the official definition of lemmatization.
3.2
Alternative Solution
To solve the ambiguity problem of lemmatization and stemming in Indonesia, a
clear and standard lemmatization method have to be created. For a clear method, official
definition of lemmatization from a valid textbook is going to be used as the theoretical
foundation. “Lemmatization is a process to find the base (entry) of a given word form”.
(Ingason, 2008, p.1) Hence, the algorithm developed would not violate this definition.
The goal and standard of this algorithm are also determined based on the definition.
3.3
Algorithm Design
The lemmatization algorithm is based on Arifin, Mahendra, and Ciptaningtyas’
improvement over Indonesian stemming algorithm, Enhanced Confix-Stripping Stemmer
(2009). Enhanced Confix-Stripping Stemmer (from now referred as ECS), is chosen
because it is the most relevant and updated work; specifically on Indonesian stemming
subject. ECS improves Confix-Stripping Stemmer which is proposed by Asian, Nazief,
Adriani, and Tahaghoghi (2007). The improvements consist of some additional and
modified rules. A suffix backtracking step was also added, to increase accuracy.
The lemmatization algorithm does not aim to improve ECS, because it is
different in goal/purpose. The lemmatization algorithm aims to modify ECS instead, in
order to fit the lemmatization concept. However, there are similarities in some of the
40
processes, for example, removal of affix, in order to reach its lemma form. These kinds
of process can be re-implemented with minimal changes. There are some cases that does
not stemmed successfully by ECS; that will hopefully be solved by lemmatization
algorithm. These cases are:
1. Ineffective Rule, especially rules that handle ‘meny-‘ and ‘peny-‘. For example,
‘penyanyi’ and ‘menyatakan’ are not stemmed.
2. Compound Words, such as ‘diberitahukan’ is not stemmed.
3. Overstemming, such as ‘penyidikan’ produces ‘sidi’
4. Understemming, such as ‘mengalami’ produces ‘alami’
The lemmatization algorithm itself involves rule precedence check, inflectional suffix
removal, derivational suffix removal, derivational prefix removal, recoding, suffix
backtracking, hyphenation checking, and dictionary lookup.
The algorithm is depicted in the Figure 3.1.
41
Remove
Remove
Remove
Inflectional
Suffix
Derivational
Suffix
Derivational
Prefix
Perform
Recoding
START
Failed
False
Failed
Failed
Dictionary
Dictionary
Dictionary
Dictionary
Lookup
Lookup
Lookup
Lookup
Read
Input Word
Success
Success
Dictionary
Lookup
Failed
Check Rule
Precedence
Success
Success
return
Lemma
Success
Success
Failed
END
Success
Success
Suffix
Backtracking
Success
return
Lemma
Dictionary
Dictionary
True
Lookup
Dictionary
Dictionary
Dictionary
Lookup
Lookup
Lookup
Lookup
Failed
Failed
Remove
Derivational
Prefix
Perform
Recoding
Success
Failed
Failed
Remove
Remove
Inflectional
Suffix
Derivational
Suffix
return
return
input word
Lemma
END
Figure 3.1 Lemmatization Algorithm Flowchart
42
3.3.1
Dictionary Lookup
This process checks whether the word is listed as a lemma in the dictionary.
When the lookup succeeds, then the algorithm will stop, and the lemma will be returned
as a result. This process is executed at the end of every executed process, to ensure that
every applied transformation are always checked and will be immediately returned as
result when the lemma is found.
3.3.2
Rule Precedence Check
This process is executed to determine the other processes’ execution order. There
are some prefix-suffix combinations that produce faster, and more accurate result, if
prefix removal is executed prior to suffix removal. These combinations are:
1. ‘be-‘ and ‘-lah’,
2. ‘be-‘ and ‘-an’,
3. ‘me-‘ and ‘-i’,
4. ‘di-‘ and ‘-i’,
5. ‘pe-‘ and ‘-i’,
6. ‘te-‘ and ‘-i’.
When an input word has a prefix-suffix pair that satisfies the combinations listed,
then the execution order will be derivational prefix removal, recoding, inflectional suffix
removal, and derivational suffix removal. On the contrary, if the affix pair of input word
does not match any of the affix combinations listed, then inflectional suffix removal and
derivational suffix removal will be performed/executed first.
43
3.3.3
Inflectional Suffix Removal
Inflectional suffix has two types of suffix, particle {‘-lah’, ‘-kah’, ‘-tah’, ‘-pun’}
and possessive pronoun {‘-ku’, ‘-mu’, ‘-nya’}. Indonesian language structure dictates
that particle suffix will always be the last suffix added on a word. So, this process will
try to remove particle suffix before removing possessive pronoun suffix. For example,
the word ‘bajukupun’ contains the particle ‘-pun’, and the possessive pronoun ‘-ku’. The
particle will be removed first, resulting in ‘bajuku’, and a dictionary lookup is
performed. Since the word is not listed in dictionary, the possessive pronoun is removed,
producing the word ‘baju’ as result.
3.3.4
Derivational Suffix Removal
This process will try to remove derivational suffix {‘-i’, ‘-kan’, ‘-an’} from a
given word. Derivational suffix is always added to a word before adding inflectional
suffix. It means that this process will always be executed after inflectional suffix
removal (except when the word has no inflectional suffixes). For example, the word
‘nyalakan’ contains the derivational suffix ‘-kan’, therefore it will be removed and
produces the word ‘nyala’ as result.
3.3.5
Derivational Prefix Removal
Derivational prefix has two kind of groups, plain {‘di-‘, ‘ke-‘, ‘se-‘} and complex
{‘me-‘, ‘be-‘, ‘pe-‘, ‘te-‘}. Plain prefixes, as the name suggests, do not require any rule, nor
transform the word when added; which means, the removal process is done by directly removing
the detected plain prefixes (e.g. ‘dibawa’, ‘sejalan’, ‘ketutup’). On the other hand, complex
prefixes transform the word when added, according to several rules:
44
Table 3.1 Rule List for ‘be-‘ Prefix
Rule
Input
Output
1
berV…
ber-V… | be-rV…
2
berCAP…
ber-CAP… where C!= r and P!= er
3
berCAerV…
ber-CAerV… where C!=r
4
belajar
bel-ajar
5
beC1erC2…
be-C1erC2.. Where C1!={r | l}
Table 3.2 Rule List for ‘te-‘ Prefix
Rule
Input
Output
6
terV…
ter-V… | te-rV…
7
terCerV…
ter-CerV... where C!= r
8
terCP…
ter-CP… where C!= r and P!= er
9
teC1erC2
te-C1erC2.. Where C1!= r
10
terC1erC2…
ter-C1erC2… where C1!= r
45
Table 3.3 Rule List for ‘me-‘ Prefix
Rule
Input
Output
11
me{l|r|w|y}V…
me-{l|r|w|y}V…
12
mem{b|f|v}…
mem-{b|f|v}…
13
mempe…
mem-pe…
14
mem{rV|V}…
me-m{rV|V}… | me-p{rV|V}…
15
men{c|d|j|s|z}…
men-{c|d|j|s|z}…
16
menV…
me-nV… | me-tV…
17
meng{g|h|q|k}…
meng-{g|h|q|k}…
18
mengV…
meng-V… | meng-kV… | (mengV… if V='e')
19
menyV…
meny-sV…
20
mempA…
mem-pA… where A!= e
46
Table 3.4 Rule List for ‘pe-‘ Prefix
Rule
Input
Output
21
pe{w|y}V…
pe-{w|y}V…
22
perV…
per-V… | pe-rV…
23
perCAP…
per-CAP… where C!= r and P!= er
24
perCAerV…
per-CAerV… where C!= r
25
pem{b|f|v}…
pem-{b|f|v}…
26
pem{rV|V}…
pe-m{rV|V}… | pe-p{rV|V}…
27
pen{c|d|j|z}…
pen-{c|d|j|z}…
28
penV…
pe-nV… | pe-tV…
29
pengC…
peng-C…
30
pengV…
peng-V… | peng-kV… | (pengV-… if V=e)
31
penyV…
peny-sV…
32
pelV…
pe-lV… except 'pelajar' return 'ajar'
33
peCerV…
per-erV… where C!={r|w|y|l|m|n}
34
peCP…
pe-CP… where C!={r|w|y|l|m|n} and P!= er
35
peC1erC2
pe-C1erC2… where C1!= {r|w|y|l|m|n}
V stands for a vowel (a, i, u, e, o), C stands for consonant, A represents any
alphabet character (a-z), and P represents a short fragment of words, such as ‘er’.
Indonesian language permits derivational prefix combination on a word (e.g.
‘berkelanjutan’ which originates from ‘lanjut’), however there are constraints that limit
the combination possibility. The possible combinations are:
1. ‘di-‘, followed by ‘pe-‘ or ‘be-‘ prefix type (e.g. ‘diperlakukan’ and
‘diberlakukan’)
2. ‘ke-‘, followed by ‘be-‘ or ‘te-‘ prefix type (e.g. ‘kebersamaan’ and
‘keterlambatan’)
3. ‘be-‘, followed by ‘pe-‘ prefix type (e.g. ‘berpengalaman’)
47
4. ‘me-‘, followed by ‘pe-‘, ‘te-, or ‘be-‘ prefix type (e.g. ‘mempersulit’,
‘menertawakan‘, and ‘membelajarkan’)
5. ‘pe-‘, followed by ‘be-‘ prefix type (e.g. ‘pemberhentian’), with special case for
‘tertawa’ (‘penertawaan’).
The lemmatization algorithm will remove up to three prefixes and three suffixes;
whereas the three suffixes consists of derivational suffix, possessive pronoun, and and
particle suffix types and the prefixes follow the combination rule above. Therefore, this
process is repetitive, up to three iterations. At the end of every iteration, the current state
of word is checked against the dictionary in order to prevent overstemming. Termination
also occurs when the currently identified prefix has been removed in the previous
iteration, or the word contains a disallowed affix pair (prefix-suffix), listed below:
Table 3.5 Disallowed Affix Pairs
Prefix
Suffix
be-
-i
di-
-an
ke-
-i, -kan
me-
-an
se-
-i, -kan
te-
-an
According to Arifin and Setiono, a valid word can contain up to two prefixes,
and three suffixes (2002), However this is not true for Indonesian scheme; take for
example ‘sepengetahuan’ which contains the prefix ‘se-‘, ‘pe-‘, and ‘ke-’. In this step,
the ‘se-‘ prefix will be removed; which produces ‘pengetahuan’. On the second
iteration, ‘pe-‘ will be removed; which produces ‘ketahuan’. The last iteration will
48
remove ‘ke-‘, which produces ‘tahuan’. So, the lemmatization algorithm iterate up to
three times.
3.3.6
Recoding
When affix removal process still fails the dictionary lookup, there are a
possibility that the removal process did not transform the word accordingly. For
example, the word ‘menanya’ is transformed into ‘me-nanya’ which in result fails the
lookup; this happens because the original word, ‘tanya’, is transformed into ‘nanya’
when combined with the prefix ‘me-‘. However, there are also cases where the lemma’s
first letter is ‘n’, for example ‘nama’ in the word ‘menamai’. The purpose of recoding is
to go through all kinds of transformation possibilities. This is achieved by recording
alternative path of transformation. Take rule 1 for example, there are two possible
output. On affix removal, the chosen output will always be the left one; However when
this process is executed, the algorithm checks whether there are any alternative path
recorded when removing affixes; and then replaces the current transformation with the
alternative. For example, the word ‘berima’ (in rhythm), contains the prefix ‘be-‘, and
affix removal rule 1 will be applied (Table 3.1) because it follows the pattern ‘berV…’.
However, the default output of this rule is to remove ‘ber-‘ from the word, resulting in
‘ima’ and this causes the dictionary lookup to fail. This process checks for the recoding
path, i.e. ‘berV… to ‘be – rV…’, reattached the removed prefix (from ‘ima’ to
‘berima’), and applied the recoding rule (from ‘berima’ to ‘rima’) and produces ‘rima’
as result.
49
3.3.7
Suffix Backtracking
This process is attempted after affix removal and recoding fails dictionary
lookup. On each step, prefix removal and recoding will performed. First, the prefixes
that have been removed will be reattached to the word; then prefix removal and recoding
is performed. If the result fails the dictionary, the prefixes are reattached and the
removed derivational suffix will also be reattached. If the result still fails, reattach
prefixes, derivational suffix, and possessive pronoun. If the result still fails, the last step
is the reattach the particle. There is a special case, when the removed derivational suffix
is ‘-kan’, then ‘-k’ will be attached first. If the result fails, then ‘-an’ will be attached.
Considering the word ‘pemberhentiannyapun’, and assuming that the dictionary lookup
will always returns failure, the reattachment will be:
1. Reattach prefixes: pemberhenti, and perform derivational prefix removal:
a. ‘pe-‘ prefix removed, resulting in ‘berhenti’
b. ‘be-‘ prefix removed, resulting in ‘henti’
2. Reattach derivational suffix: pemberhentian, and perform derivational prefix
removal:
a. ‘pe-‘ prefix removed, resulting in ‘berhentian’
b. ‘be-‘ prefix removed, resulting in ‘hentian’
3. Reattach possessive pronoun: pemberhentiannya, and perform derivational prefix
removal:
a. ‘pe-‘ prefix removed, resulting in ‘berhentiannya’
b. ‘be-‘ prefix removed, resulting in ‘hentiannya’
4. Reattach particle: pemberhentiannyapun and perform derivational prefix
removal:
50
a. ‘pe-‘ prefix removed, resulting in ‘berhentiannyapun’
b. ‘be-‘ prefix removed, resulting in ‘hentiannyapun’
3.3.8
Hyphenation Checking
Indonesian language permits repetition of the same word (repetitive word),
concatenated by a stripe (‘-‘), that can represent a pluralized form (e.g ‘bola-bola’), or a
single meaning (e.g. ‘bolak-balik’). The word ‘kuda-kuda’ represents the pluralized form
of ‘kuda’, in which this process will attempt to transform it to its singular form. The
approach of this process is by looking at each word and compares them; pluralized
forms usually have identical word (after affix removal). Repetitive words that represent
single meaning will not be processed, and will be leaved as is because they are already
in a lemma form. For example the word ‘perundang-undangan’ will be removed of its
affixes, thus resulting in ‘undang-undang’. When a hyphenate character (i.e. ‘-‘) is
detected, the word is split into two part and they will be checked whether they are
identical. In this case, ‘undang’ will be returned as a result. This process is performed
inside the dictionary lookup function.
3.3.9
Pseudocode
begin
read input word
perform dictionary lookup
if lookup success
return lemma and end
endif
precedence = check rule precedence
if rule precedence is prefix first
remove derivational prefix
51
perform dictionary lookup
if lookup success
return lemma and end
endif
perform recoding
perform dictionary lookup
if lookup success
return lemma and end
endif
remove inflectional suffix
perform dictionary lookup
if lookup success
return lemma and end
endif
remove derivational suffix
perform dictionary lookup
if lookup success
return lemma and end
endif
else if rule precedence is suffix first
remove inflectional suffix
perform dictionary lookup
if lookup success
return lemma and end
endif
remove derivational suffix
perform dictionary lookup
if lookup success
return lemma and end
endif
remove derivational prefix
perform dictionary lookup
if lookup success
return lemma and end
endif
perform recoding
perform dictionary lookup
if lookup success
return lemma and end
52
endif
endif
perform suffix backtracking
perform dictionary lookup
if lookup success
return lemma and end
else if word still not found
return input word
endif
end
3.3.10 Complexity
Referring to the process of lemmatization proposed above, the complexity of the
lemmatization algorithm will be theoretically analyzed. Each process will be
summarized, in the importance of what it aims to complete. Let m = number of prefixes
and n = number of suffixes :
1. Dictionary Lookup (DL)
The dictionary lookup attempts to check the word against the dictionary; it is
considered as a constant, single lookup process.
2. Inflectional Suffix Removal (IS)
The goal of this process is to remove up to 2 suffixes; this indicates a constraint
on iteration count with a maximum of 2 iteration steps. In conclusion, it can be
assumed that on worst case, IS = 2.
3. Derivational Suffix Removal (DS)
The goal of this process is to remove one suffix; which means DS = 1.
4. Derivational Prefix Removal (DP)
53
The goal of this process is to remove up to 3 prefixes; this also indicates a
constant iterative process which can iterate 3 times at worst case/condition. This
can be concluded as DP = 3.
5. Recoding (RC)
This process aims to examine possible alternative paths for removed prefixes in
an iterative behavior; however the maximum allowed prefix is 3, which
indirectly puts a constraint of 3 maximum iteration steps. In a worst case
condition/situation (where all removed prefix has an alternative recoding path) ,
This process can be concluded as RC = 3.
6. Suffix Backtracking (SB)
This process attempts to redo the derivational prefix removal in an iterative
behavior, based on the suffixes that have been removed. However, the maximum
number of suffixes is 3, which also puts a constraint of 3 maximum iteration
steps for this process. This process can be concluded as SB = mn.
Based on the summary above, the processes joined represent a lemmatization
algorithm (regardless of the execution order), and the time taken to lemmatize a single
word (whether successful or not) can be defined by:
T(n) = IS + DS + DP + RC + SB
T(n) = 2 + 1 + 3 + 3 + mn
T(n) = 9 + mn
54
However, the fact that only maximum of 3 prefixes and 3 suffixes can be added
on a given word creates a constraint on the function above; If a word happens to have
more than 3 prefixes and/or suffixes, the conclusion for Suffix Backtracking process is
no longer mn, but the constrain/limit itself. Therefore, the complexity of lemmatization
algorithm can be concluded as:

Download Report

CHAPTER 3 METHODOLOGY 3.1 Needs and Problem Analysis 3.1

Paperzz.com

Your Paperzz