Creating a Bilingual Dictionary using Wikipedia

Charles University in Prague
Institute of Formal and Applied Linguistics (ÚFAL)
Creating a Bilingual Dictionary
using Wikipedia
Angelina Ivanova
Advisor: RNDr. Daniel Zeman, Ph.D.
September 6, 2011
1
Overview
➢
Introduction
➢
Motivation
➢
Experiments
➢
➢
Dictionary Development
➢
Named Entity Recognition and Classification
➢
Statistics in Traditional Dictionary and in Corpus
➢
Machine Translation Experiments
Conclusions and Future Work
2
Introduction
Bilingual dictionaries are specialized
dictionaries used to translate words or
phrases from one language to another.
3
Introduction
Types of bilingual dictionaries
➔ unidirectional vs bidirectional
➔ human-oriented vs machine-oriented
➔ traditional vs NLP
Usage of bilingual dictionaries
➔ in cross-language information retrieval
➔ in machine translation
4
Introduction
Methods of bilingual dictionary
development
➔
➔
➔
➔
➔
manual
automatic from parallel corpora
automatic from comparable corpora
automatic from Wikipedia
automatic from existing dictionaries
5
Goals

Develop dictionaries from the Wikipedia link
structure

Evaluate quality and examine the content

Apply the dictionary to statistical machine
translation task
6
Motivation

Bilingual dictionaries are important
resources for NLP

Manual development is expensive

Internet-resources provide terminology
which is not present in traditional
dictionaries
7
Features of Wikipedia
Wikipedia Tools and Resources
Software
JWPL
MediaWiki
Dumps
HTML
XML
SQL
8
Features of Wikipedia
Wikipedia Structure
Pages
Links
Entity pages
Redirect pages
Disambiguation pages
Interlanguage links
Category links
Article links
9
Features of Wikipedia
Interlanguage links - links from an article to a
presumably equivalent, article in another language.
[[cs:Tygr džunglový]]
[[en:Tiger]]
10
Features of Wikipedia
Redirect pages - pages without article body that
provide:
➢
equivalent names for an entity
Federation of Czechoslovakia -> Czechoslovakia
➢
spelling resolution
Microsoft’s -> Microsoft
➢
spelling correction
Wikipeda -> Wikipedia
11
Development of Wiki-dictionary
Redirect page
CSFR
Redirect page
Federation of
Czechoslovakia
Entity page
Entity page
Redirect page
Czechoslovakia
Interlanguage
link
Redirect page
12
Named Entity Recognition
Heuristics (Bunescu and Pasca, 2006)
The title is considered as named entity if:
●
All content words of a multi-word expression are
capitalized
●
One-word title contains at least two capital letters
●
At least 75% of times that one-word title occurs on the
position other than the beginning of the sentence
13
Named Entity Recognition
Additional heuristics for Named Entity Recognition
●
One-word title contains at least one capital letter and
at least one digit
1P/Halley, 9P/Tempel
●
A multi-word expression each word of which starts
with a capital letter or a digit
16 BC
14
Named Entity Recognition
Results of the named entity recognition
88% named entities
12% common words
74% accuracy on the sample of size 100
15
Named Entity Recognition
Common misclassification errors
■
False positives:
All content words of a multi-word special term are
capitalized
Museum of Fine Arts
Cloning Extinct Species
■
Phrases contain named entities
Pakistan at the 1948 Summer Olympics,
Saturn Award for Best Make-up
■
Common words containing two or more capital
letters
WebConferencing
16
Named Entity Recognition
Common misclassification errors
False negatives:
■
Short articles for one-word named entities
Muaná
■
Pairs that occur only in Russian Wikipedia
17
Named Entity Classification
Named entity classification
➔
Applied only to the titles of entity pages
➔
Types: PER, LOC, ORG, MISC
➔
Bootstrapping algorithm (Knopp, 2010)
➔
Final labels obtained by:
➔
➔
➔
➔
Intersection of labels from the labels for En and Ru titles
Translations of labels for En titles
Translations of labels for Ru titles
Corrections using heuristics based on the comments in
brackets
18
Named Entity Classification
Evaluation of named entity classification
Intersection of
En and Ru NE
# of entries in
the sample
# of correctly
classified NE
NE labels
translated from
En
NE labels
translated
from Ru
300
170
180
194
19
Sample data: English-Russian
20
Sample data: English-German
21
Pre-Processing
Dictionary pre-processing
Filtering
■



HTML
Punctuation
Stop words
Tokenization
■

Tokenizer from Europarl v6 Preprocessing Toolkit
Normalization
■

Lemmatizer
22
Statistics in Mueller's dictionary
Mueller's Dictionary Files
■
■
■
■
Abbreviations (2204 entries)
Geographical names (1282 entries)
Names (630 entries)
Base (50695 entries)
23
Statistics in Mueller's dictionary
jug
I [dзAg]
1. _n.
1) кувшин
2) _жарг. тюрьма
2. _v.
1) _кул. тушить (зайца, кролика)
2) _жарг. посадить в тюрьму
II [dзAg] _n. щёлканье (соловья и т.п.)
24
Statistics in Mueller's dictionary
Recall:
●
●
Geographical names: 82.18%
Names: 75.88%
●
Abbreviations: 22.64%
●
Base: 7.42%
25
Statistics in Mueller's dictionary
Reasons of the low recall of the
Wiki-dictionary on Mueller's dictionary
➢
Additional details
Amazon River
Амазонка
➢
Spelling
Accra
Elbrus
➢
Amazon
р. Амазонка
Akkra
Elbrus
Predominance of the noun phrases in the
Wiki-dictionary
26
Corpus Statistics
Statistics of the Wiki-dictionary on
UMC corpus
Approx. 28% of the training set – 0 translation pairs from
the Wiki-dictionaries
Approx. 24.7% of the training set - 1 translation pair from
the Wiki-dictionaries
27
SMT Experiments

Machine Translation: Background
28
SMT Experiments

Machine Translation: Background
29
SMT Experiments

Evaluation of Machine Translation
■
Manual
 Fluency and adequacy
 Ranking
■
Automatic
 BLEU
30
SMT Experiments

Limitations of BLEU
■
BLEU doesn't take into the account that some
words are more important than other
■
BLEU works only on n-gram level therefore it
cannot check the overall grammatical coherence
■
BLEU compares the word forms => problem with
inflective languages
■
The actual BLEU score is meaningless
31

SMT Experiments
BLEU Score for the Trained Models (UMC Test Set)
BLEU score
3-gram
21.19
4-gram
21.42
5-gram
20.99
4-gram + additional data for LM
24.60
5-gram + additional data for LM
24.76
3-gram + Wiki-dict.
20.05
4-gram + Wiki-dict.
20.42
5-gram + Wiki-dict.
20.38
32
SMT Experiments

Paired Bootstrap Re-sampling (UMC set)
100 trial sets of the size 300 sentences from the
original test set of 1000 sentences
Model 1
Model 2
Statistical significance that
model1 outperforms model 2
3-gram
3-gram + Wiki-dict.
98.5%
4-gram
4-gram + Wikidict.
5-gram + Wiki-dict.
96%
5-gram
87.1%
33
SMT Experiments

Data Collection for the Wiki-set
conversion to plain text
■

sentence splitting
■

split-sentences.perl from Europarl v6 Preprocessing
Toolkit
tokenization
■

■
PlainTextConverter from Java Wikipedia API
Tokenizer.perl from from Europarl v6 Preprocessing
Toolkit
manual post-processing
34

SMT Experiments
Test Sets Statistics
35

SMT Experiments
Out-of-Vocabulary Words Statistics
w/t Wiki-dict.
with Wiki-dict
# of OOV
UMC test set
934
906
# of OOV
Wiki test set
2,260
1,878
36
SMT Experiments

Manual Ranking (Wiki-set)
Due to the high number of out-of-vocabulary words we
chose the sentence pairs by criteria:
■
Both sentences had at most 2 out-ofvocabulary words
■
At least one of the sentences is longer than
124 characters
■
The sentences are not equal
37

SMT Experiments
Manual Ranking
100 sample
from
UMC test set
100 sample
from
Wiki-set
Model w/t
Wiki-dict is
ranked first
Model with
Wiki-dict is
ranked first
Translations
are equally
bad/good
55
37
8
44
50
6
38
SMT Experiments

Translation with model trained without Wikidictionary:
after the death of фредди меркьюри remaining
members groups , using records his votes , could
provoke in 1995 , the latest krylenko queen - made
in heaven .
Translation with model trained with Wikidictionary:
after the death of freddie mercury remaining
members of groups , through records his votes ,
managed to issue in 1995 , the last queen made in
39
heaven .
SMT Experiments

Experiment without comments in brackets
(UMC set)
Results:
■
BLEU: 20.89 vs 20.42
■
OOV: 889 (929) vs 906 (945)
40
Conclusions
Conclusions
✔ Wiki-dictionaries and traditional dictionaries differ
dramatically (7.42% recall, named entities, noun
phrases)
✔ Wiki-dictionaries can cause drop of accuracy in
MT experiment (due to domain shift)
✔ The methods can be applied to any language
pair which is present in Wikipedia (e.g. EnglishGerman)
41
Future Work
Future Work

Evaluation on a parallel corpus from another
domain

Connection of the dictionary to the morphological
analyzer

Factored machine translation

Improvement of named entity recognition and
classification

The impact of the comments in brackets and
additional information
42
Thank you for your attention!
Questions?
43