A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Valerio Basile Johan Bos Kilian Evang University of Groningen {v.basile,johan.bos,k.evang}@rug.nl Computational Linguistics in the Netherlands 2013 http://gmb.let.rug.nl 1/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Tokenization: a solved problem? I Problem: tokenizers are often rule-based: hard to maintain, hard to adapt to new domains, new languages I Problem: word segmentation and sentence segmentation often seen as separate tasks, but they inform each other I Problem: most tokenization methods provide no alignment between raw and tokenized text (Dridan and Oepen, 2012) http://gmb.let.rug.nl 2/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Research Questions I Can we use machine learning to avoid hand-crafting rules? I Can we use the same method across domains and languages? I Can we combine word and sentence boundary detection into one task? http://gmb.let.rug.nl 3/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Method: IOB Tagging I widely used in sequence labeling tasks such as shallow parsing, named-entity recognition I we propose to use it for word and sentence boundary detection I label each character in a text with one of four tags: . I: inside a token . O: outside a token . B: two types I T: beginning of a token I S: beginning of the first token of a sentence http://gmb.let.rug.nl 4/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection IOB Tagging: Example It didn’t matter if the faces were male, SIOTIITIIOTIIIIIOTIOTIIOTIIIIOTIIIOTIIITO female or those of children. EightyTIIIIIOTIOTIIIIOTIOTIIIIIIITOSIIIIIIO three percent of people in the 30-to-34 IIIIIOTIIIIIIOTIOTIIIIIOTIOTIIOTIIIIIIIO year old age range gave correct responses. TIIIOTIIOTIIOTIIIIOTIIIOTIIIIIIOTIIIIIIIIT I Note: discontinuous tokens are possible (Eighty-three) http://gmb.let.rug.nl 5/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Acquiring Labeled Data: correcting a Rule-Based Tokenizer http://gmb.let.rug.nl 6/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Method: Training a Classifier I We use Conditional Random Fields (CRF) I State of the art in sequence labeling tasks I Implementation: Wapiti (http://wapiti.limsi.fr) http://gmb.let.rug.nl 7/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Features Used for Learning I current Unicode character I label on previous character I different kinds of contexts: . either Unicode characters in the context . or Unicode categories of these characters I Unicode categories less in number (31), but also less informative than characters I context windows sizes: 0, 1, 2, 3, 4 to the right and left of current character http://gmb.let.rug.nl 8/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Experiments I Three datasets (different languages, different domains): . Newswire English . Newswire Dutch . Biomedical English http://gmb.let.rug.nl 9/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Creating the Datasets I (Newswire) English: Groningen Meaning Bank (manually checked part) . 458 documents, 2,886 sentences, 64,443 tokens . already exists in IOB format I Newswire Dutch: Twente News Corpus (subcorpus: two days from January 2000) . 13,389 documents, 49,537 sentences, 860,637 tokens . inferred alignment between raw and tokenized text I Biomedical English: Biocreative1 . 7,500 sentences, 195,998 tokens (sentences are isolated, only word boundaries) . inferred alignment between raw and tokenized text http://gmb.let.rug.nl 10/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Baseline Experiment I Newswire English without context features I Confusion matrix: predicted label I T gold label I T O S 21,163 26 0 4 45 5,316 0 141 O S 0 0 5,226 0 0 53 0 123 I Main difficulty: distinguishing between T and S http://gmb.let.rug.nl 11/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection How Much Context Is Needed? 100 ● ● ● 100 ● ● context characters context categories 98 97 96 ● 80 F1 score for label S 99 Accuracy ● ● ● 90 ● 70 60 ● ● 50 context characters context categories 40 30 20 10 95 0 0 1 2 Left&right context 3 4 0 1 2 3 4 Left&right context I results shown for GMB (trained on 80%, tested on 10% development set) I performance almost constant after left&right window size 2 http://gmb.let.rug.nl 12/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Characters or Categories? 100 ● ● ● 100 ● ● context characters context categories 98 97 96 ● 80 F1 score for label S 99 Accuracy ● ● ● 90 ● 70 60 ● ● 50 context characters context categories 40 30 20 10 95 0 0 1 2 Left&right context 3 4 0 1 2 3 4 Left&right context I character features perform well, categories overfit http://gmb.let.rug.nl 13/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Applying the Method to Dutch 100 ● ● ● 100 ● context characters context categories 98 97 ● 96 80 ● F1 score for label S 99 Accuracy ● ● ● 90 ● 70 60 ● 50 context characters context categories 40 30 20 10 95 0 0 1 2 Left&right context 3 4 ● 0 1 2 3 4 Left&right context I results shown for TwNC (trained on 80%, tested on 10% development set) http://gmb.let.rug.nl 14/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Applying the Method to Biomedical English 100 ● ● ● ● 100 ● ● ● ● 90 ● context characters context categories 98 97 ● 96 80 F1 score for label S Accuracy 99 70 60 ● 50 context characters context categories 40 30 20 10 95 0 0 1 2 Left&right context 3 4 ● 0 1 2 3 4 Left&right context I results shown for Biocreative1 (trained on 80%, tested on 10% development set) I in this corpus: sentences isolated, sentence boundary detection trivial http://gmb.let.rug.nl 15/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection What Kinds of Erros Does More Context Fix? I examples from English newswire, 2-window vs. 4-window character models context: gold: 2-window: 4-window: er Iran to the U.N. Security Council, whi IIOTIIIOTIOTIIOTIIIOTIIIIIIIOTIIIIIITOTII IIOTIIIOTIOTIIOTIIIOSIIIIIIIOTIIIIIITOTII IIOTIIIOTIOTIIOTIIIOTIIIIIIIOTIIIIIITOTII context: gold: 2-window: 4-window: by Sunni voters. Shi'ite leaders have not TIOTIIIIOTIIIIITOSIIIIIIOTIIIIIIOTIIIOTII TIOTIIIIOTIIIIITOSIITIIIOTIIIIIIOTIIIOTII TIOTIIIIOTIIIIITOSIIIIIIOTIIIIIIOTIIIOTII http://gmb.let.rug.nl 16/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Examples of Errors Still Made by the Best Model I examples from English newswire, 4-window character model I probable causes: too simple features, not enough training data context: ive arms race it cannot win. Taiwan split gold: IIIOTIIIOTIIIOTIOTIITIIOTIITOSIIIIIOTIIII 4-window: IIIOTIIIOTIIIOTIOTIIIIIOTIITOSIIIIIOTIIII context: ally paved with gold? Moses Bittok probab gold: IIIIOTIIIIOTIIIOTIIITOSIIIIOTIIIIIOTIIIII 4-window: IIIIOTIIIIOTIIIOTIIIIOTIIIIOTIIIIIOTIIIII http://gmb.let.rug.nl 17/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Is It Fast Enough? I Tested on 4-core, 2.67 GHz desktop machine I Training: around 1’30” for best model on 40,000 Dutch sentences I Labeling: around 3,000 sentences/second http://gmb.let.rug.nl 18/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Future Work I Compare with existing rule-based tokenizers I Compare with existing sentence-boundary detectors I Can we build universal models (trained on mixed-language, mixed-domain corpora)? I Experiment with more complex features I Software release http://gmb.let.rug.nl 19/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection Conclusions I Word and sentence segmentation can be recast as a combined tagging task I Supervised learning: shift of labor from writing rules to correcting labels I Learning this task with CRF achieves high speed and accuracy I Our tagging method does not lose the connection between original text and tokens I Possible drawback of tagging method: no changes to original text possible, e.g. normalization of punctuation etc. http://gmb.let.rug.nl 20/21 A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection References I Dridan, R. and Oepen, S. (2012). Tokenization: Returning to a long solved problem — a survey, contrastive experiment, recommendations, and toolkit —. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea. Association for Computational Linguistics. http://gmb.let.rug.nl 21/21
© Copyright 2025 Paperzz