Using Finite State Transducers for Making - Giellatekno

Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Using Finite State Transducers for Making
Efficient Reading Comprehension Dictionaries
Ryan Johnson, Trond Trosterud and Lene Antonsen,
Centre for Saami Language Technology
http://giellatekno.uit.no/
Presentation at NODALIDA 2013, May 22-24, Oslo, Norway
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Contents
Introduction
Morphologically Sensitive Dictionaries
Finite State Transducers
Evaluation
Conclusion
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Introduction
Introduction
I
I
Starting point: Enriching our ICALL learning environment with
a dictionary
Design requirement:
I
I
Handle morphology
No need for downloading and installation
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Introduction
Introduction
I
I
Starting point: Enriching our ICALL learning environment with
a dictionary
Design requirement:
I
I
I
Handle morphology
No need for downloading and installation
Result: dictionaries for
I
I
North and South Saami
and several other Uralic languages
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Introduction
Introduction
I
I
Starting point: Enriching our ICALL learning environment with
a dictionary
Design requirement:
I
I
I
Result: dictionaries for
I
I
I
Handle morphology
No need for downloading and installation
North and South Saami
and several other Uralic languages
... in a variety of ways
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Introduction
Introduction
I
I
Starting point: Enriching our ICALL learning environment with
a dictionary
Design requirement:
I
I
I
Result: dictionaries for
I
I
I
Handle morphology
No need for downloading and installation
North and South Saami
and several other Uralic languages
... in a variety of ways
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Introduction
Lemmas are rare in running text
Number of words
in the text
Number of lemmas
in the dictionary
Coverage
(Antonsen et al 2009)
North Saami
Finnish
Norwegian
252 461
45 144
64 994
99 071
7,9 %
94 111
10,0 %
38 983
30,5 %
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
The morphology is rich, and not always concatenative
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
The morphology is rich, and not always concatenative
I
Inflection:
I
I
Case: niidii → nieida girl, gåatan → gåetie house
Person-tense: eahtsa → iehtsedh love, bođii → boahtit come
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
The morphology is rich, and not always concatenative
I
Inflection:
I
I
I
Case: niidii → nieida girl, gåatan → gåetie house
Person-tense: eahtsa → iehtsedh love, bođii → boahtit come
Derivation:
I
I
Passivisation: juhkkojuvvui → juhkat drink, juohkit share
Superlative: nööremes → noere young
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
The morphology is rich, and not always concatenative
I
Inflection:
I
I
I
Derivation:
I
I
I
Case: niidii → nieida girl, gåatan → gåetie house
Person-tense: eahtsa → iehtsedh love, bođii → boahtit come
Passivisation: juhkkojuvvui → juhkat drink, juohkit share
Superlative: nööremes → noere young
Compounding:
I
I
bargojoavku → bargu + joavku work + group
bigkemegierkine → bigkedh + gierkie building + stone
I
In a 1.1 mill word North Saami corpus, 26.7 % of the nouns
are compounds, or 7.0 % of all words
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
The morphology is rich, and not always concatenative
I
Inflection:
I
I
I
Derivation:
I
I
I
Case: niidii → nieida girl, gåatan → gåetie house
Person-tense: eahtsa → iehtsedh love, bođii → boahtit come
Passivisation: juhkkojuvvui → juhkat drink, juohkit share
Superlative: nööremes → noere young
Compounding:
I
I
bargojoavku → bargu + joavku work + group
bigkemegierkine → bigkedh + gierkie building + stone
I
I
In a 1.1 mill word North Saami corpus, 26.7 % of the nouns
are compounds, or 7.0 % of all words
Cliticisation:
I
bođiige → boahtit + ge came + too
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
Finite state transducers (FST), compiled with lexc
girji+N:gir’ji GOAHTI ;
lávka+N:láv’ka GOAHTI ;
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
Morphophonological transducer –– twolc
girji+N:gir’ji GOAHTI ;
lávka+N:láv’ka GOAHTI ;
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
Two different approaches to using FSTs with dictionaries
I
generating a list of wordforms pointing to their respective
lemma articles
I
analysis via FST at runtime and using outcome for lookup in
dictionary
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
Wordform dictionaries (static)
I
Vuosttaš digisánit: 9999 lemma → 110 000 wordforms
I
Voestes digibaakoeh: 10 657 lemma → 85 000 wordforms
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
Dynamic FST dictionaries
FST-analysis and translation via a web service
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
The dictionary understands inflected forms
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
... and may generate word paradigms via FST
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
Necessary adjustments
I
Add lexicalised forms to the transducer:
gaskabeaivi ‘noon’ – gaskabeaivvit ‘dinner’
gávdni ‘guts’ – gávnnit ‘bedclothes’
→ gaskabeaivvit+N+Pl, gávnnit+N+Pl
I
Resolve homonymy in the analyser:
vuovdi / vuovddi vs. vuovdi / vuovdi ‘forest’ vs. ‘salesman’
beassi / beassi vs. beassi / beasi ‘birchbark’ vs. ‘nest’
→ vuovdi+N+NomAg, beassi+N+G3
I
Clean up the generated paradigms when there are variants:
fievrridin – fievrredin ‘transport’
→ fievrridit+v1+V, fievrridit+v2+V
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
I
http: // sanit. oahpa. no
I
http: // baakoeh. oahpa. no
I
I
I
I
I
I
I
Eastern Mari ↔ Finnish
Western Mari ↔ Finnish
http: // kyv. oahpa. no
I
I
Erzya ↔ English, Finnish, Russian
Mokša ↔ Finnish
http: // muter. oahpa. no
I
I
Livonian ↔ Finnish, Estonian, Latvian
Kven ↔ Norwegian
Ingrian ↔ Finnish
http: // valks. oahpa. no
I
I
South Saami ↔ Norwegian
http: // sanat. oahpa. no
I
I
North Saami ↔ Norwegian, Finnish
Komi ↔ English, Finnish
http: // vada. oahpa. no
I
Nenets ↔ English, Finnish
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
Neahttadigisánit – Implementation
I
Python with Flask framework (http://flask.pocoo.org)
I
Twitter Bootstrap
(http://twitter.github.io/bootstrap/)
I
JavaScript with jQuery (http://jquery.com/)
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
User interface (also works on a mobile phone)
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
Reading dictionary: read and alt-click
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
Eastern Mari
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
Komi
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Finite State Transducers
Kven
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
Evaluation
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
Comparing the dictionaries: North Saami
FST dictionary and wordform (WF) dictionary: 9999 lemmas.
(A lemma dictionary with 99 000 lemma covered 7,9 %)
Translation
WF dictionary:
all words
FST dictionary:
all words
WF dictionary:
different
words
FST dictionary:
different
words
36 231
81.9 %
39 920
90.2 %
7874
57.8 %
10 862
80.0 %
Partial
translation
549
448
1.2 %
3.3 %
No
translation
100 %
8025
18.1 %
44 256
3787
8.6 %
44 256
5758
42.2 %
13 632
2322
17.0 %
13 632
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
Comparing the dictionaries: South Saami
FST dictionary and wordform dictionary: 10 657 lemmas
Translation
WF dictionary:
all words
FST dictionary:
all words
WF dictionary:
different
words
FST dictionary:
different
words
44 989
72.7 %
53 295
88.8 %
5015
41.4 %
8039
67.0 %
Partial
translation
475
308
0.8 %
2.6 %
No
translation
100 %
15 268
25.3 %
60 037
6266
10.4 %
60 037
7100
58.6 %
12 092
3660
30.5 %
12 007
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
#SoMe dictionary (Social media)
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
#SoMe dictionary (Social media)
Jua, livzze buorre jus nu ain birgese dan aigge, ahte eamidat maid
alo siiddas go manat leat unnit, nu ahte manat harjanit ju unnin
leahket doppe ja besset de oahppat. Muhto oluhat darbbasit ...
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
#SoMe dictionary (Social media)
Jua, livzze buorre jus nu ain birgese dan aigge, ahte eamidat maid
alo siiddas go manat leat unnit, nu ahte manat harjanit ju unnin
leahket doppe ja besset de oahppat. Muhto oluhat darbbasit ...
Jua, livžže buorre jus nu ain birgeše dan áigge, ahte eamidat maid
alo siiddas go mánát leat unnit, nu ahte mánát hárjánit ju unnin
leahket doppe ja besset de oahppat. Muhto oluhat dárbbašit ...
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
#SoMe dictionary (Social media)
Jua, livzze buorre jus nu ain birgese dan aigge, ahte eamidat maid
alo siiddas go manat leat unnit, nu ahte manat harjanit ju unnin
leahket doppe ja besset de oahppat. Muhto oluhat darbbasit ...
Jua, livžže buorre jus nu ain birgeše dan áigge, ahte eamidat maid
alo siiddas go mánát leat unnit, nu ahte mánát hárjánit ju unnin
leahket doppe ja besset de oahppat. Muhto oluhat dárbbašit ...
I
From the Facebook group Ártegis ságat: 20 % of which is written
without a North Saami keyboard, on either a desktop or mobile
device.
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
#SoMe dictionary (Social media)
Jua, livzze buorre jus nu ain birgese dan aigge, ahte eamidat maid
alo siiddas go manat leat unnit, nu ahte manat harjanit ju unnin
leahket doppe ja besset de oahppat. Muhto oluhat darbbasit ...
Jua, livžže buorre jus nu ain birgeše dan áigge, ahte eamidat maid
alo siiddas go mánát leat unnit, nu ahte mánát hárjánit ju unnin
leahket doppe ja besset de oahppat. Muhto oluhat dárbbašit ...
I
From the Facebook group Ártegis ságat: 20 % of which is written
without a North Saami keyboard, on either a desktop or mobile
device.
Our answer:
A North Saami FST dictionary where the text is analysed with an
ordinary FST, and in addition with an FST adjusted to the #SoMe
writing convention.
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
#SoMe-FST
If the word in the text contains c, then the intended word may also
contain č ...
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
Cell phone
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
#SoMe sátnegirji
Results with an ordinary FST and with the #SoMe FST
Translation
Ord. FST
all words:
#SoMe-FST
all words:
Ord. FST
different
words:
#SoMe FST
different
words:
Partial
translation
No
translation
100 %
50 263
76.3 %
250
0.4 %
15 364
23.3 %
65 877
54 197
80.6 %
315
0.5 %
12 753
19.0 %
67 265
8813
50.8 %
224
1.3 %
8326
48.0 %
17 363
10 596
59.8 %
286
1.6 %
6825
38.5 %
17 707
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
Including Nynorsk in a Norwegian → Saami dictionary
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
Why Nynorsk in a Norwegian → Saami dictionary?
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
Why Nynorsk in a Norwegian → Saami dictionary?
Lov om stadnamn:
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
Why Nynorsk in a Norwegian → Saami dictionary?
Lov om stadnamn:
Jon Todal: Samisk språk i Svahken sijte:
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
Extended Norwegian dictionary
Bokmål - Saami FST dictionary tested on Nynorsk text (Wikipedia)
FST coverage
Bokmål
text
Nynorsk
text
All Bokmål
varieties
Conservative
Bokmål
All Bokmål
varieties
Bokmål with
Nynorsk
Dictionary coverage
703 950
93.36 %
1 644 286
84.50 %
2 191 428
79.34 %
3 504 733
66.96 %
1 849 654
82.56 %
3 206 796
69.77 %
1 101 116
89.62 %
2 530 995
76.14 %
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
How did we extend the dictionary?
I
The starting point was a Bokmål - North Saami dictionary
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
How did we extend the dictionary?
I
I
The starting point was a Bokmål - North Saami dictionary
An FST translates frequent Nynorsk words into Bokmål:
I
kva, korleis, kjem, sjølv, ikkje, truleg, betre, ...
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
How did we extend the dictionary?
I
I
The starting point was a Bokmål - North Saami dictionary
An FST translates frequent Nynorsk words into Bokmål:
I
I
kva, korleis, kjem, sjølv, ikkje, truleg, betre, ...
The Bokmål FST was then enriched with Nynorsk morphology
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
How did we extend the dictionary?
I
I
The starting point was a Bokmål - North Saami dictionary
An FST translates frequent Nynorsk words into Bokmål:
I
I
kva, korleis, kjem, sjølv, ikkje, truleg, betre, ...
The Bokmål FST was then enriched with Nynorsk morphology
LEXICON m1 ! dag, hane
+N+Msc+Sg: 0-en ;
+N+Msc+Pl: er-ene ;
+N+Msc+Pl+Nynorsk: ar-ane ;
+N: R ;
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
One dictionary thus gives rise to many dictionaries
1. Ordinary dictionaries
I
If the word is lexicalised, then we show only the lexicalised
translation
2. Student dictionary
I
The dictionary also translates each part of compound words
3. #SoMe dictionary
I
#SoMe dictionary accepts also acdsz in place of Saami letters
áčđšž
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
FSTs for other Saami languages
1. Pite Saami: rgg/rrg ↔ rg
2. Skolt and Inari Saami: Extensive letter variation
3. Kildin Saami: Non-standard Cyrillic letters + two competing
orthographies
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Evaluation
DinOrdbok (John Atle Sandbakken)
Engelsk*
Svensk*
Spansk*
Polsk
Fransk*
Nederlandsk*
Italiensk*
Finsk*
Nynorsk*
Russisk*
Portugisisk*
Dansk*
50562
48824
48698
20904
18846
18804
18587
18492
17402
17293
17210
15793
Japansk
Tysk*
Slovakisk
Tyrkisk*
Kinesisk
Bulgarsk*
Gresk
Katalansk*
Rumensk*
Samisk*
Islandsk*
Latinsk
14553
14171
13264
12915
12694
11974
11789
11111
9618
9481
7674
6864
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Conclusion
Conclusion 1
The online FST dictionary has many great benefits compared to the
offline wordform dictionary:
I
Uses existing FSTs (plus modifications) and XML lexica
I
FSTs handle both linguistic and orthographic variation
... and more morphology:
I
I
I
North Saami 57,8 % → 83.3 % , South S. 41.4 % → 69.6 %
The server-side setup is easy, can plug in new FSTs and lexica.
Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries
Conclusion
Conclusion 2
Many benefits to users, and easy to use.
I
No installation, just bookmark
I
New updates to lexicon and morphology arrive as they are
made
I
Works in all of the most popular desktop and mobile browsers
I
Allows easier access to written material in numerous languages
(also majority languages)