Using Technology Transfer to Advance Automatic

Using Technology Transfer to
Advance Automatic Lemmatisation
for Setswana
Hendrik J Groenewald
Centre for Text Technology (CTexT™)
Research Unit: Languages and Literature in the South African Context
North-West University, Potchefstroom Campus (PUK)
South Africa
E-mail: [email protected]
31 March 2009; Athens
Introduction
Lemmatisation
Methodology
Conclusion
Overview
• Introduction
• Lemmatisation
– Lemmatisation in Setswana
– Lemmatisation in Afrikaans
• Methodology
–
–
–
–
Memory-based Learning
Architecture
Data
Implementation
• Conclusion
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Introduction I
• South Africa has 11 official languages
– English has the most HLT resources
• Situation is changing
• SA Government is supporting initiatives to
develop core linguistic resources and
technologies
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Introduction II
• Focus: Using technology transfer for
– Improving existing linguistic resources
– Fast-tracking development
• Improving an existing Setswana lemmatiser
by applying a method developed for
Afrikaans
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Overview
Setswana
Afrikaans
Lemmatisation: Overview
• Process whereby the inflected forms of a
word are converted/normalised under the
lemma or base form
– swim, swimming, swam -> swim
• Lemmatisation is an important process for
many NLP tasks
– Information Retrieval
– Morphological Analysis
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Overview
Setswana
Afrikaans
Lemmatisation: Overview
• Not to be confused with Stemming
– The process whereby a word is reduced to its
stem by removing both inflectional and
derivational morphemes
• Two popular approaches to lemmatisation
– Rule-based approach
– Statistically/data-driven approach
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Overview
Setswana
Afrikaans
Lemmatisation: Setswana
• First Automatic Lemmatiser for Setswana
developed by Brits (2006)
– Found that only stems (and not roots) can act
independently as words
– Stems should be accepted as lemmas
• Brits formalised rules for determining lemmas
– Implemented as Finite-state transducers
• Accuracy: 62.17% when evaluated on a dataset
containing 295 randomly selected words
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Overview
Setswana
Afrikaans
Lemmatisation: Afrikaans
• 2003: Ragel – Accuracy of 67% when evaluated
on a 1,000 word data set
• Disappointing accuracy motivated development of
another lemmatiser using a different approach
• New Lemmatiser called Lia
– Based on data-driven machine learning method
– 73,000 lemma-annotated words
– Accuracy 92,8% on new data
• Motivated the application of machine learning
methods for lemmatisation in Setswana
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Memory-based Learning
Architecture
Data
Implementation
Methodology: Memory-based Learning
• Based on k-NN algorithm
– All instances of a certain problem correspond to points
in a n-dimensional space
– Nearest neighbours computed by some form of
distance metric
Xq
1-Nearest Neighbour
5-Nearest Neighbours
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Memory-based Learning
Architecture
Data
Implementation
Methodology: Architecture
Key
Start
Process
on k-NN algorithm
Decision
Based
Choose
Data
Algorithm
– All instances of a certain problem correspond to points
in a n-dimensional space
– Nearest neighbours computed by some form of
distance metric
Compute
Statistics on
Training Data
Training
Data
Store Data in
Memory
Classify
Evaluation
Data
Evaluation
Data
Generate
Lemma
Lemma
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Memory-based Learning
Architecture
Data
Implementation
Methodology: Data
• MBL requires large amounts of data
• Only 2,947 lemma-annotated Setswana words
available (Brits’s evaluation set)
• 2,947 words are a very small data set in memorybased learning terms
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Memory-based Learning
Architecture
Data
Implementation
Methodology: Data
• MBL requires that lemmatisation be performed as
a classification task
• Data should consist of feature vectors with
assigned class labels
– Feature vectors: letters of the word
– Class label: Transformation from word to lemma
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Memory-based Learning
Architecture
Data
Implementation
Methodology: Data
• Deriving class labels
– Longest common substring
– Indicates the string that needs to be removed, as well
as possible replacement strings during the
transformation from word form to lemma
– Positions of the character strings that need to be
removed are indicated as L (left) or R (right)
– If the word form and lemma are identical, the awarded
class is “0”
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Memory-based Learning
Architecture
Data
Implementation
Methodology: Data
• Deriving classes
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Memory-based Learning
Architecture
Data
Implementation
Methodology: Implementation
• Data
– 90% for training
– 10% for evaluation
• First version (default algorithmic parameters)
– 46.25% Accuracy
• Parameter optimisation
– 58.98%
• Accuracy is below that of the rule-based version
of Brits
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Memory-based Learning
Architecture
Data
Implementation
Methodology: Implementation
• Error analysis indicated obvious mistakes
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Memory-based Learning
Architecture
Data
Implementation
Methodology: Implementation
• Solution: Add class distributions to the output and
implement a “back-off” mechanism
• Resulted in a further increase in accuracy:
64.06%
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Conclusion
• The machine learning-based lemmatiser is only
1.9% more accurate than the rule-based version
• Small in comparison to the 25% increase
obtained for Afrikaans
• Size of the training data
– 2,652 words compared to 73,000 for Afrikaans
• Increasing the amount of training data will
increase the accuracy
• Most important result: Technology Transfer
31 March 2009; Athens
Hendrik J Groenewald
Introduction
Lemmatisation
Methodology
Conclusion
Acknowledgements
• The work of Jeanetta H. Brits, performed under
the supervision of Rigardt Pretorius and Gerhard
B. van Huyssteen
31 March 2009; Athens
Hendrik J Groenewald