Charles University in Prague Institute of Formal and Applied Linguistics (ÚFAL) Creating a Bilingual Dictionary using Wikipedia Angelina Ivanova Advisor: RNDr. Daniel Zeman, Ph.D. September 6, 2011 1 Overview ➢ Introduction ➢ Motivation ➢ Experiments ➢ ➢ Dictionary Development ➢ Named Entity Recognition and Classification ➢ Statistics in Traditional Dictionary and in Corpus ➢ Machine Translation Experiments Conclusions and Future Work 2 Introduction Bilingual dictionaries are specialized dictionaries used to translate words or phrases from one language to another. 3 Introduction Types of bilingual dictionaries ➔ unidirectional vs bidirectional ➔ human-oriented vs machine-oriented ➔ traditional vs NLP Usage of bilingual dictionaries ➔ in cross-language information retrieval ➔ in machine translation 4 Introduction Methods of bilingual dictionary development ➔ ➔ ➔ ➔ ➔ manual automatic from parallel corpora automatic from comparable corpora automatic from Wikipedia automatic from existing dictionaries 5 Goals Develop dictionaries from the Wikipedia link structure Evaluate quality and examine the content Apply the dictionary to statistical machine translation task 6 Motivation Bilingual dictionaries are important resources for NLP Manual development is expensive Internet-resources provide terminology which is not present in traditional dictionaries 7 Features of Wikipedia Wikipedia Tools and Resources Software JWPL MediaWiki Dumps HTML XML SQL 8 Features of Wikipedia Wikipedia Structure Pages Links Entity pages Redirect pages Disambiguation pages Interlanguage links Category links Article links 9 Features of Wikipedia Interlanguage links - links from an article to a presumably equivalent, article in another language. [[cs:Tygr džunglový]] [[en:Tiger]] 10 Features of Wikipedia Redirect pages - pages without article body that provide: ➢ equivalent names for an entity Federation of Czechoslovakia -> Czechoslovakia ➢ spelling resolution Microsoft’s -> Microsoft ➢ spelling correction Wikipeda -> Wikipedia 11 Development of Wiki-dictionary Redirect page CSFR Redirect page Federation of Czechoslovakia Entity page Entity page Redirect page Czechoslovakia Interlanguage link Redirect page 12 Named Entity Recognition Heuristics (Bunescu and Pasca, 2006) The title is considered as named entity if: ● All content words of a multi-word expression are capitalized ● One-word title contains at least two capital letters ● At least 75% of times that one-word title occurs on the position other than the beginning of the sentence 13 Named Entity Recognition Additional heuristics for Named Entity Recognition ● One-word title contains at least one capital letter and at least one digit 1P/Halley, 9P/Tempel ● A multi-word expression each word of which starts with a capital letter or a digit 16 BC 14 Named Entity Recognition Results of the named entity recognition 88% named entities 12% common words 74% accuracy on the sample of size 100 15 Named Entity Recognition Common misclassification errors ■ False positives: All content words of a multi-word special term are capitalized Museum of Fine Arts Cloning Extinct Species ■ Phrases contain named entities Pakistan at the 1948 Summer Olympics, Saturn Award for Best Make-up ■ Common words containing two or more capital letters WebConferencing 16 Named Entity Recognition Common misclassification errors False negatives: ■ Short articles for one-word named entities Muaná ■ Pairs that occur only in Russian Wikipedia 17 Named Entity Classification Named entity classification ➔ Applied only to the titles of entity pages ➔ Types: PER, LOC, ORG, MISC ➔ Bootstrapping algorithm (Knopp, 2010) ➔ Final labels obtained by: ➔ ➔ ➔ ➔ Intersection of labels from the labels for En and Ru titles Translations of labels for En titles Translations of labels for Ru titles Corrections using heuristics based on the comments in brackets 18 Named Entity Classification Evaluation of named entity classification Intersection of En and Ru NE # of entries in the sample # of correctly classified NE NE labels translated from En NE labels translated from Ru 300 170 180 194 19 Sample data: English-Russian 20 Sample data: English-German 21 Pre-Processing Dictionary pre-processing Filtering ■ HTML Punctuation Stop words Tokenization ■ Tokenizer from Europarl v6 Preprocessing Toolkit Normalization ■ Lemmatizer 22 Statistics in Mueller's dictionary Mueller's Dictionary Files ■ ■ ■ ■ Abbreviations (2204 entries) Geographical names (1282 entries) Names (630 entries) Base (50695 entries) 23 Statistics in Mueller's dictionary jug I [dзAg] 1. _n. 1) кувшин 2) _жарг. тюрьма 2. _v. 1) _кул. тушить (зайца, кролика) 2) _жарг. посадить в тюрьму II [dзAg] _n. щёлканье (соловья и т.п.) 24 Statistics in Mueller's dictionary Recall: ● ● Geographical names: 82.18% Names: 75.88% ● Abbreviations: 22.64% ● Base: 7.42% 25 Statistics in Mueller's dictionary Reasons of the low recall of the Wiki-dictionary on Mueller's dictionary ➢ Additional details Amazon River Амазонка ➢ Spelling Accra Elbrus ➢ Amazon р. Амазонка Akkra Elbrus Predominance of the noun phrases in the Wiki-dictionary 26 Corpus Statistics Statistics of the Wiki-dictionary on UMC corpus Approx. 28% of the training set – 0 translation pairs from the Wiki-dictionaries Approx. 24.7% of the training set - 1 translation pair from the Wiki-dictionaries 27 SMT Experiments Machine Translation: Background 28 SMT Experiments Machine Translation: Background 29 SMT Experiments Evaluation of Machine Translation ■ Manual Fluency and adequacy Ranking ■ Automatic BLEU 30 SMT Experiments Limitations of BLEU ■ BLEU doesn't take into the account that some words are more important than other ■ BLEU works only on n-gram level therefore it cannot check the overall grammatical coherence ■ BLEU compares the word forms => problem with inflective languages ■ The actual BLEU score is meaningless 31 SMT Experiments BLEU Score for the Trained Models (UMC Test Set) BLEU score 3-gram 21.19 4-gram 21.42 5-gram 20.99 4-gram + additional data for LM 24.60 5-gram + additional data for LM 24.76 3-gram + Wiki-dict. 20.05 4-gram + Wiki-dict. 20.42 5-gram + Wiki-dict. 20.38 32 SMT Experiments Paired Bootstrap Re-sampling (UMC set) 100 trial sets of the size 300 sentences from the original test set of 1000 sentences Model 1 Model 2 Statistical significance that model1 outperforms model 2 3-gram 3-gram + Wiki-dict. 98.5% 4-gram 4-gram + Wikidict. 5-gram + Wiki-dict. 96% 5-gram 87.1% 33 SMT Experiments Data Collection for the Wiki-set conversion to plain text ■ sentence splitting ■ split-sentences.perl from Europarl v6 Preprocessing Toolkit tokenization ■ ■ PlainTextConverter from Java Wikipedia API Tokenizer.perl from from Europarl v6 Preprocessing Toolkit manual post-processing 34 SMT Experiments Test Sets Statistics 35 SMT Experiments Out-of-Vocabulary Words Statistics w/t Wiki-dict. with Wiki-dict # of OOV UMC test set 934 906 # of OOV Wiki test set 2,260 1,878 36 SMT Experiments Manual Ranking (Wiki-set) Due to the high number of out-of-vocabulary words we chose the sentence pairs by criteria: ■ Both sentences had at most 2 out-ofvocabulary words ■ At least one of the sentences is longer than 124 characters ■ The sentences are not equal 37 SMT Experiments Manual Ranking 100 sample from UMC test set 100 sample from Wiki-set Model w/t Wiki-dict is ranked first Model with Wiki-dict is ranked first Translations are equally bad/good 55 37 8 44 50 6 38 SMT Experiments Translation with model trained without Wikidictionary: after the death of фредди меркьюри remaining members groups , using records his votes , could provoke in 1995 , the latest krylenko queen - made in heaven . Translation with model trained with Wikidictionary: after the death of freddie mercury remaining members of groups , through records his votes , managed to issue in 1995 , the last queen made in 39 heaven . SMT Experiments Experiment without comments in brackets (UMC set) Results: ■ BLEU: 20.89 vs 20.42 ■ OOV: 889 (929) vs 906 (945) 40 Conclusions Conclusions ✔ Wiki-dictionaries and traditional dictionaries differ dramatically (7.42% recall, named entities, noun phrases) ✔ Wiki-dictionaries can cause drop of accuracy in MT experiment (due to domain shift) ✔ The methods can be applied to any language pair which is present in Wikipedia (e.g. EnglishGerman) 41 Future Work Future Work Evaluation on a parallel corpus from another domain Connection of the dictionary to the morphological analyzer Factored machine translation Improvement of named entity recognition and classification The impact of the comments in brackets and additional information 42 Thank you for your attention! Questions? 43
© Copyright 2026 Paperzz