Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Ryan Johnson, Trond Trosterud and Lene Antonsen, Centre for Saami Language Technology http://giellatekno.uit.no/ Presentation at NODALIDA 2013, May 22-24, Oslo, Norway Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Contents Introduction Morphologically Sensitive Dictionaries Finite State Transducers Evaluation Conclusion Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Introduction Introduction I I Starting point: Enriching our ICALL learning environment with a dictionary Design requirement: I I Handle morphology No need for downloading and installation Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Introduction Introduction I I Starting point: Enriching our ICALL learning environment with a dictionary Design requirement: I I I Handle morphology No need for downloading and installation Result: dictionaries for I I North and South Saami and several other Uralic languages Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Introduction Introduction I I Starting point: Enriching our ICALL learning environment with a dictionary Design requirement: I I I Result: dictionaries for I I I Handle morphology No need for downloading and installation North and South Saami and several other Uralic languages ... in a variety of ways Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Introduction Introduction I I Starting point: Enriching our ICALL learning environment with a dictionary Design requirement: I I I Result: dictionaries for I I I Handle morphology No need for downloading and installation North and South Saami and several other Uralic languages ... in a variety of ways Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Introduction Lemmas are rare in running text Number of words in the text Number of lemmas in the dictionary Coverage (Antonsen et al 2009) North Saami Finnish Norwegian 252 461 45 144 64 994 99 071 7,9 % 94 111 10,0 % 38 983 30,5 % Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers The morphology is rich, and not always concatenative Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers The morphology is rich, and not always concatenative I Inflection: I I Case: niidii → nieida girl, gåatan → gåetie house Person-tense: eahtsa → iehtsedh love, bođii → boahtit come Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers The morphology is rich, and not always concatenative I Inflection: I I I Case: niidii → nieida girl, gåatan → gåetie house Person-tense: eahtsa → iehtsedh love, bođii → boahtit come Derivation: I I Passivisation: juhkkojuvvui → juhkat drink, juohkit share Superlative: nööremes → noere young Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers The morphology is rich, and not always concatenative I Inflection: I I I Derivation: I I I Case: niidii → nieida girl, gåatan → gåetie house Person-tense: eahtsa → iehtsedh love, bođii → boahtit come Passivisation: juhkkojuvvui → juhkat drink, juohkit share Superlative: nööremes → noere young Compounding: I I bargojoavku → bargu + joavku work + group bigkemegierkine → bigkedh + gierkie building + stone I In a 1.1 mill word North Saami corpus, 26.7 % of the nouns are compounds, or 7.0 % of all words Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers The morphology is rich, and not always concatenative I Inflection: I I I Derivation: I I I Case: niidii → nieida girl, gåatan → gåetie house Person-tense: eahtsa → iehtsedh love, bođii → boahtit come Passivisation: juhkkojuvvui → juhkat drink, juohkit share Superlative: nööremes → noere young Compounding: I I bargojoavku → bargu + joavku work + group bigkemegierkine → bigkedh + gierkie building + stone I I In a 1.1 mill word North Saami corpus, 26.7 % of the nouns are compounds, or 7.0 % of all words Cliticisation: I bođiige → boahtit + ge came + too Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers Finite state transducers (FST), compiled with lexc girji+N:gir’ji GOAHTI ; lávka+N:láv’ka GOAHTI ; Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers Morphophonological transducer –– twolc girji+N:gir’ji GOAHTI ; lávka+N:láv’ka GOAHTI ; Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers Two different approaches to using FSTs with dictionaries I generating a list of wordforms pointing to their respective lemma articles I analysis via FST at runtime and using outcome for lookup in dictionary Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers Wordform dictionaries (static) I Vuosttaš digisánit: 9999 lemma → 110 000 wordforms I Voestes digibaakoeh: 10 657 lemma → 85 000 wordforms Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers Dynamic FST dictionaries FST-analysis and translation via a web service Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers The dictionary understands inflected forms Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers ... and may generate word paradigms via FST Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers Necessary adjustments I Add lexicalised forms to the transducer: gaskabeaivi ‘noon’ – gaskabeaivvit ‘dinner’ gávdni ‘guts’ – gávnnit ‘bedclothes’ → gaskabeaivvit+N+Pl, gávnnit+N+Pl I Resolve homonymy in the analyser: vuovdi / vuovddi vs. vuovdi / vuovdi ‘forest’ vs. ‘salesman’ beassi / beassi vs. beassi / beasi ‘birchbark’ vs. ‘nest’ → vuovdi+N+NomAg, beassi+N+G3 I Clean up the generated paradigms when there are variants: fievrridin – fievrredin ‘transport’ → fievrridit+v1+V, fievrridit+v2+V Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers I http: // sanit. oahpa. no I http: // baakoeh. oahpa. no I I I I I I I Eastern Mari ↔ Finnish Western Mari ↔ Finnish http: // kyv. oahpa. no I I Erzya ↔ English, Finnish, Russian Mokša ↔ Finnish http: // muter. oahpa. no I I Livonian ↔ Finnish, Estonian, Latvian Kven ↔ Norwegian Ingrian ↔ Finnish http: // valks. oahpa. no I I South Saami ↔ Norwegian http: // sanat. oahpa. no I I North Saami ↔ Norwegian, Finnish Komi ↔ English, Finnish http: // vada. oahpa. no I Nenets ↔ English, Finnish Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers Neahttadigisánit – Implementation I Python with Flask framework (http://flask.pocoo.org) I Twitter Bootstrap (http://twitter.github.io/bootstrap/) I JavaScript with jQuery (http://jquery.com/) Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers User interface (also works on a mobile phone) Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers Reading dictionary: read and alt-click Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers Eastern Mari Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers Komi Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Finite State Transducers Kven Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation Evaluation Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation Comparing the dictionaries: North Saami FST dictionary and wordform (WF) dictionary: 9999 lemmas. (A lemma dictionary with 99 000 lemma covered 7,9 %) Translation WF dictionary: all words FST dictionary: all words WF dictionary: different words FST dictionary: different words 36 231 81.9 % 39 920 90.2 % 7874 57.8 % 10 862 80.0 % Partial translation 549 448 1.2 % 3.3 % No translation 100 % 8025 18.1 % 44 256 3787 8.6 % 44 256 5758 42.2 % 13 632 2322 17.0 % 13 632 Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation Comparing the dictionaries: South Saami FST dictionary and wordform dictionary: 10 657 lemmas Translation WF dictionary: all words FST dictionary: all words WF dictionary: different words FST dictionary: different words 44 989 72.7 % 53 295 88.8 % 5015 41.4 % 8039 67.0 % Partial translation 475 308 0.8 % 2.6 % No translation 100 % 15 268 25.3 % 60 037 6266 10.4 % 60 037 7100 58.6 % 12 092 3660 30.5 % 12 007 Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation #SoMe dictionary (Social media) Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation #SoMe dictionary (Social media) Jua, livzze buorre jus nu ain birgese dan aigge, ahte eamidat maid alo siiddas go manat leat unnit, nu ahte manat harjanit ju unnin leahket doppe ja besset de oahppat. Muhto oluhat darbbasit ... Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation #SoMe dictionary (Social media) Jua, livzze buorre jus nu ain birgese dan aigge, ahte eamidat maid alo siiddas go manat leat unnit, nu ahte manat harjanit ju unnin leahket doppe ja besset de oahppat. Muhto oluhat darbbasit ... Jua, livžže buorre jus nu ain birgeše dan áigge, ahte eamidat maid alo siiddas go mánát leat unnit, nu ahte mánát hárjánit ju unnin leahket doppe ja besset de oahppat. Muhto oluhat dárbbašit ... Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation #SoMe dictionary (Social media) Jua, livzze buorre jus nu ain birgese dan aigge, ahte eamidat maid alo siiddas go manat leat unnit, nu ahte manat harjanit ju unnin leahket doppe ja besset de oahppat. Muhto oluhat darbbasit ... Jua, livžže buorre jus nu ain birgeše dan áigge, ahte eamidat maid alo siiddas go mánát leat unnit, nu ahte mánát hárjánit ju unnin leahket doppe ja besset de oahppat. Muhto oluhat dárbbašit ... I From the Facebook group Ártegis ságat: 20 % of which is written without a North Saami keyboard, on either a desktop or mobile device. Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation #SoMe dictionary (Social media) Jua, livzze buorre jus nu ain birgese dan aigge, ahte eamidat maid alo siiddas go manat leat unnit, nu ahte manat harjanit ju unnin leahket doppe ja besset de oahppat. Muhto oluhat darbbasit ... Jua, livžže buorre jus nu ain birgeše dan áigge, ahte eamidat maid alo siiddas go mánát leat unnit, nu ahte mánát hárjánit ju unnin leahket doppe ja besset de oahppat. Muhto oluhat dárbbašit ... I From the Facebook group Ártegis ságat: 20 % of which is written without a North Saami keyboard, on either a desktop or mobile device. Our answer: A North Saami FST dictionary where the text is analysed with an ordinary FST, and in addition with an FST adjusted to the #SoMe writing convention. Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation #SoMe-FST If the word in the text contains c, then the intended word may also contain č ... Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation Cell phone Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation #SoMe sátnegirji Results with an ordinary FST and with the #SoMe FST Translation Ord. FST all words: #SoMe-FST all words: Ord. FST different words: #SoMe FST different words: Partial translation No translation 100 % 50 263 76.3 % 250 0.4 % 15 364 23.3 % 65 877 54 197 80.6 % 315 0.5 % 12 753 19.0 % 67 265 8813 50.8 % 224 1.3 % 8326 48.0 % 17 363 10 596 59.8 % 286 1.6 % 6825 38.5 % 17 707 Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation Including Nynorsk in a Norwegian → Saami dictionary Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation Why Nynorsk in a Norwegian → Saami dictionary? Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation Why Nynorsk in a Norwegian → Saami dictionary? Lov om stadnamn: Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation Why Nynorsk in a Norwegian → Saami dictionary? Lov om stadnamn: Jon Todal: Samisk språk i Svahken sijte: Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation Extended Norwegian dictionary Bokmål - Saami FST dictionary tested on Nynorsk text (Wikipedia) FST coverage Bokmål text Nynorsk text All Bokmål varieties Conservative Bokmål All Bokmål varieties Bokmål with Nynorsk Dictionary coverage 703 950 93.36 % 1 644 286 84.50 % 2 191 428 79.34 % 3 504 733 66.96 % 1 849 654 82.56 % 3 206 796 69.77 % 1 101 116 89.62 % 2 530 995 76.14 % Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation How did we extend the dictionary? I The starting point was a Bokmål - North Saami dictionary Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation How did we extend the dictionary? I I The starting point was a Bokmål - North Saami dictionary An FST translates frequent Nynorsk words into Bokmål: I kva, korleis, kjem, sjølv, ikkje, truleg, betre, ... Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation How did we extend the dictionary? I I The starting point was a Bokmål - North Saami dictionary An FST translates frequent Nynorsk words into Bokmål: I I kva, korleis, kjem, sjølv, ikkje, truleg, betre, ... The Bokmål FST was then enriched with Nynorsk morphology Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation How did we extend the dictionary? I I The starting point was a Bokmål - North Saami dictionary An FST translates frequent Nynorsk words into Bokmål: I I kva, korleis, kjem, sjølv, ikkje, truleg, betre, ... The Bokmål FST was then enriched with Nynorsk morphology LEXICON m1 ! dag, hane +N+Msc+Sg: 0-en ; +N+Msc+Pl: er-ene ; +N+Msc+Pl+Nynorsk: ar-ane ; +N: R ; Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation One dictionary thus gives rise to many dictionaries 1. Ordinary dictionaries I If the word is lexicalised, then we show only the lexicalised translation 2. Student dictionary I The dictionary also translates each part of compound words 3. #SoMe dictionary I #SoMe dictionary accepts also acdsz in place of Saami letters áčđšž Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation FSTs for other Saami languages 1. Pite Saami: rgg/rrg ↔ rg 2. Skolt and Inari Saami: Extensive letter variation 3. Kildin Saami: Non-standard Cyrillic letters + two competing orthographies Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Evaluation DinOrdbok (John Atle Sandbakken) Engelsk* Svensk* Spansk* Polsk Fransk* Nederlandsk* Italiensk* Finsk* Nynorsk* Russisk* Portugisisk* Dansk* 50562 48824 48698 20904 18846 18804 18587 18492 17402 17293 17210 15793 Japansk Tysk* Slovakisk Tyrkisk* Kinesisk Bulgarsk* Gresk Katalansk* Rumensk* Samisk* Islandsk* Latinsk 14553 14171 13264 12915 12694 11974 11789 11111 9618 9481 7674 6864 Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Conclusion Conclusion 1 The online FST dictionary has many great benefits compared to the offline wordform dictionary: I Uses existing FSTs (plus modifications) and XML lexica I FSTs handle both linguistic and orthographic variation ... and more morphology: I I I North Saami 57,8 % → 83.3 % , South S. 41.4 % → 69.6 % The server-side setup is easy, can plug in new FSTs and lexica. Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries Conclusion Conclusion 2 Many benefits to users, and easy to use. I No installation, just bookmark I New updates to lexicon and morphology arrive as they are made I Works in all of the most popular desktop and mobile browsers I Allows easier access to written material in numerous languages (also majority languages)
© Copyright 2025 Paperzz