A General-Purpose Machine Learning Method for

A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
A General-Purpose Machine Learning Method for
Tokenization and Sentence Boundary Detection
Valerio Basile
Johan Bos
Kilian Evang
University of Groningen
{v.basile,johan.bos,k.evang}@rug.nl
Computational Linguistics in the Netherlands 2013
http://gmb.let.rug.nl
1/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Tokenization: a solved problem?
I Problem: tokenizers are often rule-based: hard to maintain,
hard to adapt to new domains, new languages
I Problem: word segmentation and sentence segmentation often
seen as separate tasks, but they inform each other
I Problem: most tokenization methods provide no alignment
between raw and tokenized text (Dridan and Oepen, 2012)
http://gmb.let.rug.nl
2/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Research Questions
I Can we use machine learning to avoid hand-crafting rules?
I Can we use the same method across domains and languages?
I Can we combine word and sentence boundary detection into
one task?
http://gmb.let.rug.nl
3/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Method: IOB Tagging
I widely used in sequence labeling tasks such as shallow parsing,
named-entity recognition
I we propose to use it for word and sentence boundary detection
I label each character in a text with one of four tags:
. I: inside a token
. O: outside a token
. B: two types
I T: beginning of a token
I S: beginning of the first token of a sentence
http://gmb.let.rug.nl
4/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
IOB Tagging: Example
It didn’t matter if the faces were male,
SIOTIITIIOTIIIIIOTIOTIIOTIIIIOTIIIOTIIITO
female or those of children. EightyTIIIIIOTIOTIIIIOTIOTIIIIIIITOSIIIIIIO
three percent of people in the 30-to-34
IIIIIOTIIIIIIOTIOTIIIIIOTIOTIIOTIIIIIIIO
year old age range gave correct responses.
TIIIOTIIOTIIOTIIIIOTIIIOTIIIIIIOTIIIIIIIIT
I Note: discontinuous tokens are possible (Eighty-three)
http://gmb.let.rug.nl
5/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Acquiring Labeled Data: correcting a Rule-Based Tokenizer
http://gmb.let.rug.nl
6/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Method: Training a Classifier
I We use Conditional Random Fields (CRF)
I State of the art in sequence labeling tasks
I Implementation: Wapiti (http://wapiti.limsi.fr)
http://gmb.let.rug.nl
7/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Features Used for Learning
I current Unicode character
I label on previous character
I different kinds of contexts:
. either Unicode characters in the context
. or Unicode categories of these characters
I Unicode categories less in number (31), but also less
informative than characters
I context windows sizes: 0, 1, 2, 3, 4 to the right and left of
current character
http://gmb.let.rug.nl
8/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Experiments
I Three datasets (different languages, different domains):
. Newswire English
. Newswire Dutch
. Biomedical English
http://gmb.let.rug.nl
9/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Creating the Datasets
I (Newswire) English: Groningen Meaning Bank (manually
checked part)
. 458 documents, 2,886 sentences, 64,443 tokens
. already exists in IOB format
I Newswire Dutch: Twente News Corpus (subcorpus: two
days from January 2000)
. 13,389 documents, 49,537 sentences, 860,637 tokens
. inferred alignment between raw and tokenized text
I Biomedical English: Biocreative1
. 7,500 sentences, 195,998 tokens (sentences are isolated,
only word boundaries)
. inferred alignment between raw and tokenized text
http://gmb.let.rug.nl
10/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Baseline Experiment
I Newswire English without context features
I Confusion matrix:
predicted label
I
T
gold label
I
T
O
S
21,163
26
0
4
45
5,316
0
141
O
S
0
0
5,226
0
0
53
0
123
I Main difficulty: distinguishing between T and S
http://gmb.let.rug.nl
11/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
How Much Context Is Needed?
100
●
●
●
100
●
●
context characters
context categories
98
97
96
●
80
F1 score for label S
99
Accuracy
●
●
●
90
●
70
60
●
●
50
context characters
context categories
40
30
20
10
95
0
0
1
2
Left&right context
3
4
0
1
2
3
4
Left&right context
I results shown for GMB (trained on 80%, tested on 10%
development set)
I performance almost constant after left&right window size 2
http://gmb.let.rug.nl
12/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Characters or Categories?
100
●
●
●
100
●
●
context characters
context categories
98
97
96
●
80
F1 score for label S
99
Accuracy
●
●
●
90
●
70
60
●
●
50
context characters
context categories
40
30
20
10
95
0
0
1
2
Left&right context
3
4
0
1
2
3
4
Left&right context
I character features perform well, categories overfit
http://gmb.let.rug.nl
13/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Applying the Method to Dutch
100
●
●
●
100
●
context characters
context categories
98
97
●
96
80
●
F1 score for label S
99
Accuracy
●
●
●
90
●
70
60
●
50
context characters
context categories
40
30
20
10
95
0
0
1
2
Left&right context
3
4
●
0
1
2
3
4
Left&right context
I results shown for TwNC (trained on 80%, tested on 10%
development set)
http://gmb.let.rug.nl
14/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Applying the Method to Biomedical English
100
●
●
●
●
100
●
●
●
●
90
●
context characters
context categories
98
97
●
96
80
F1 score for label S
Accuracy
99
70
60
●
50
context characters
context categories
40
30
20
10
95
0
0
1
2
Left&right context
3
4
●
0
1
2
3
4
Left&right context
I results shown for Biocreative1 (trained on 80%, tested on
10% development set)
I in this corpus: sentences isolated, sentence boundary
detection trivial
http://gmb.let.rug.nl
15/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
What Kinds of Erros Does More Context Fix?
I examples from English newswire, 2-window vs. 4-window
character models
context:
gold:
2-window:
4-window:
er Iran to the U.N. Security Council, whi
IIOTIIIOTIOTIIOTIIIOTIIIIIIIOTIIIIIITOTII
IIOTIIIOTIOTIIOTIIIOSIIIIIIIOTIIIIIITOTII
IIOTIIIOTIOTIIOTIIIOTIIIIIIIOTIIIIIITOTII
context:
gold:
2-window:
4-window:
by Sunni voters. Shi'ite leaders have not
TIOTIIIIOTIIIIITOSIIIIIIOTIIIIIIOTIIIOTII
TIOTIIIIOTIIIIITOSIITIIIOTIIIIIIOTIIIOTII
TIOTIIIIOTIIIIITOSIIIIIIOTIIIIIIOTIIIOTII
http://gmb.let.rug.nl
16/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Examples of Errors Still Made by the Best Model
I examples from English newswire, 4-window character model
I probable causes: too simple features, not enough training data
context: ive arms race it cannot win. Taiwan split
gold:
IIIOTIIIOTIIIOTIOTIITIIOTIITOSIIIIIOTIIII
4-window: IIIOTIIIOTIIIOTIOTIIIIIOTIITOSIIIIIOTIIII
context: ally paved with gold? Moses Bittok probab
gold:
IIIIOTIIIIOTIIIOTIIITOSIIIIOTIIIIIOTIIIII
4-window: IIIIOTIIIIOTIIIOTIIIIOTIIIIOTIIIIIOTIIIII
http://gmb.let.rug.nl
17/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Is It Fast Enough?
I Tested on 4-core, 2.67 GHz desktop machine
I Training: around 1’30” for best model on 40,000 Dutch
sentences
I Labeling: around 3,000 sentences/second
http://gmb.let.rug.nl
18/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Future Work
I Compare with existing rule-based tokenizers
I Compare with existing sentence-boundary detectors
I Can we build universal models (trained on mixed-language,
mixed-domain corpora)?
I Experiment with more complex features
I Software release
http://gmb.let.rug.nl
19/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
Conclusions
I Word and sentence segmentation can be recast as a combined
tagging task
I Supervised learning: shift of labor from writing rules to
correcting labels
I Learning this task with CRF achieves high speed and accuracy
I Our tagging method does not lose the connection between
original text and tokens
I Possible drawback of tagging method: no changes to original
text possible, e.g. normalization of punctuation etc.
http://gmb.let.rug.nl
20/21
A General-Purpose Machine Learning Method
for Tokenization and Sentence Boundary
Detection
References I
Dridan, R. and Oepen, S. (2012). Tokenization: Returning to a
long solved problem — a survey, contrastive experiment,
recommendations, and toolkit —. In Proceedings of the 50th
Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), pages 378–382, Jeju Island, Korea.
Association for Computational Linguistics.
http://gmb.let.rug.nl
21/21