Linguistics in FAST ESP 5.0

Two-Level Morphology
STEFAN LANGER, CIS – UNIVERSITÄT MÜNCHEN
SOMMERSEMESTER 2016
VERSION 1.0
Übersicht
Flexionsmorphologie – Ansätze zur Behandlung in der
Sprachverarbeitung
Two-level morphology
◦
◦
Lexikon und Morphosyntax
Two-Level-Regeln
Morphologische Ansätze - Beispiel
bakeries -> bakery
Vollformlexikon (1 Komponente)
bakery.[bakery]
bakeries.[bakery]
Grammatik + Stammlexikon mit Varianten (2 Komponenten)
bakery.[Stem1,bakery]
bakerie.[Stem2,bakery]
Word -> Stem2 + s
Grammatik + Stammlexikon + Phonologischer Regeln
bakery.[Stem]
Word -> Stem + s
y becomes ie before ‘s(PLU)’
Ansätze - Konsequenzen
hyperbakeries -> hyperbakery
Vollformenansatz:
klappt nicht
Stammvariantenlexikon + Regeln
hyper.[Prefix,hyper]
bakery.[Stem1,bakery]
bakerie.[Stem2,bakery]
Word -> Prefix + Stem2 + 's'
Two-level-Morphologie
bakery.[Stem]
hyper.[Prefix]
Word -> Prefix + Stem + 's'
y becomes ie before 's'
Motivation I
Example Finnish – number of word forms. Similar situation in Turkish, Hungarian and other languages
lapsi|N
[lasten,lapseni,lapsemmekin,lapsellasi,lapsestasi,lapsenkin,lapsista,lastani,
lapsetkaan,lapsellamme,lapsiinne,lapsiltani,lapsenaan,lapsistamme,lapsellemme
,lapsianikaan,lapsiakin,lastanne,lapsillesi,lapsillahan,lapsinaan,lapsennekin
,lapsillenikin,lapsella,lapselle,lastenne,lapsetko,lapseenkin,lapsillehan,lap
sillenne,lapsillaan,lastesi,lapsistaan,lapsineen,lapsenne,lapsilla,lapselta,l
apsille,lapsellensa,lapsellekaan,lapsihan,lapsiani,lapsilleen,lapsilta,lapsen
,lastaan,lapsenakaan,lapsillakaan,lapset,lapsellani,lastakin,lapsiltaan,lapse
stani,lapsien,lapsillakin,lapsiini,lapsethan,lapsillekaan,lapsiamme,lapsineni
,lapsi,lapsillekin,lapsellanikin,lapsensakin,lapsiemme,lapsissaan,lapsilleni,
lapsestamme,lapsiaankin,lapsiakaan,lapsiesi,lapsikin,lapsiltakaan,lapsina,lap
sillesikin,lapsiltakin,lapsiimme,lapsellesi,lapsellanne,lapsissakin,lapseensa
,lastakaan,lasteni,lapsiansa,lapsilleenkin,lapsiaan,lastensakin,lapsessani,la
psistahan,lapsillasi,lapsistasi,lapsetkin,lapsistanne,lapsellenne,lastamme,la
psellaan,lapsiensa,lastenhan,lapsestaan,lapsillamme,lastenkaan,lapsesi,lapsen
akin,lastemme,lastasi,lasta,lapsiinsa,lapsillemme,lapselleen,lapsemme,lapsilt
asi,lapsillenikaan,lapseenkaan,lapseltaan,lapseksi,lapsellakin,lapsiaankaan,l
apseeni,lapsinensa,lastansa,lapsia,lapsekseen,lapsienkin,lapsiltamme,lapsenik
in,lapsessa,lapseen,lapsissamme,lapsistakin,lapsiksi,lapsellekin,lapsieni,lap
sistaankin,lastenko,lastensa,lapsikaan,lastenkin,lapsillensa,lapsessaan,lapsi
inkin,lapsensa,lapselleni,lapsissa,lapsiin,lapsiahan,lapsianne,lapsillani,lap
sistani,lapsesta,lapsiasi,lapsellako,lapsena,lapsestahan,lapsienne]
Motivation II
Finnish – proper names. Proper names are also inflected in many other
languages (Polish, Russian ...)
porsche|N
[porschelta,porschella,porschessa,porschen,pors
cheksi,porschelle,porscheen,porschesta,porschea
,porsche]
Motivation III
Arabic, Hebrew
-
Arabic and Hebrew append conjunctions, pronouns and articles to words.
This leads to a very high number of different tokens which contain the same
content word with any combination of affixes.
‫والقمر‬
wa-al-qamar
and-the-moon
-
cannot be caught by simple dictionary lookup (dictionaries only have limited
coverage)
-
similar phenomena are verbal clitics in romance language (e.g. Spanish)
Two level morphology: history
Two level morphology is a model from the 80ies
◦ First presented by Kimmo Koskenniemi in 1983
Freely available implementation from the early 90ies (PCKimmo)
used in some commercial systems (lingsoft)
Two level morphology Komponenten
1.
Wörterbuch mit Affixen und Stemmen
2.
Reguläre Grammatik ins Lexikon integriert
3.
Two-level-Komponente
hyperbakeries
↕
hyper+´bakery+s
Morphotactics: Classes
english.lex
ALTERNATION
ALTERNATION
ALTERNATION
ALTERNATION
ALTERNATION
ALTERNATION
ALTERNATION
ALTERNATION
ALTERNATION
ALTERNATION
ALTERNATION
ALTERNATION
ALTERNATION
ALTERNATION
Particle
Prefix
Root
Suffix
Infl
PN_Suffix
Y_Suffix
IC_Suffix
PT_Suffix
Clitic
Contraction
CD
Compound
End
AUX AUX-V PP CJ PP-CJ DT PR DT-PR IJ
PREFIX
N AJ V AV N-V N-AJ AJ-V AJ-AV CD OD
SUFFIX
INFL
;inflection
PN_SUFF
;proper nouns
Y_SUFF
;-y suffix
IC_SUFF
;-ic suffix
PT_SUFF
;participles
GEN CNTR End
CNTR End
;contractions
CD OD ORDR
;cardinals and ordinals
INITIAL
;compounds
End
extract from lexicon (affix.lex)
;LEXICON INITIAL
\lf 0
\lx INITIAL
\alt Particle
\gl1
\gl2
\lf 0
\lx INITIAL
\alt Prefix
\gl1
\gl2
Extract from lexicon (affix.lex)
;LEXICON PREFIX
\lf 0
\lx PREFIX
\alt Root
\gl1
\gl2
\lf hyper+
\lx PREFIX
\alt Prefix
\gl1 DEG3+
\gl2 DEG3+
Extract from lexicon (noun.lex)
\lf `bakery
\lx N
\alt Suffix
\fea deverb
\gl1 `bake
\gl2
\lf `balcony
\lx N
\alt Suffix
\gl1
\gl2
#STEM with stress
#STEM category
#continuation
#additional morph. Information
Extract from lexicon (affix.lex)
\lf 0
\lx SUFFIX
\alt Infl
\gl1
\gl2
;noun plural
\lf +s
\lx INFL
\alt Clitic
\fea n/n pl reg
\gl1 +PL
\gl2 +PL
Extract from lexicon (affix.lex)
;LEXICON End
;to disable compound parsing, comment out the next entry
\lf \lx End
\alt Compound
\fea compound
\gl1 \gl2
\lf 0
\lx End
\alt #
\gl1
\gl2
Morphotactic rule component & lexicon
Summary:
- The dictionary contains affixes, stems, eventually some additional information
(boundaries,stress)
- simple regular grammar integrated in dictionary
- operates on sequences containing morphotactic information
- verifies that following are well formed sequences:
hyper+`bakery+s
Next step:
How to we get the sequences analysed/generated by the word grammar and the lexicon from
surface text?
hyperbakeries
↕ ?
hyper+´bakery+s
Two level rules - Alphabet
Alphabet and character classes
ALPHABET
;lexical (upper) and surface (lower) characters:
b c d f g h j k l m n p q r s t v w x y z a e i o u ' - .
sh ch ;digraphs
B C D F G H J K L M N P Q R S T V W X Y Z A E I O U
;lexical (upper) only characters:
` +
NULL
0
ANY
@
BOUNDARY #
SUBSET CN
b c d f g h j k l m n p q r s t v w x y z sh ch
SUBSET CNsib s x z sh ch
;sibilant consonants
SUBSET CNpal c g
;palatal consonants
SUBSET CNgem b d f g l m n p r s t
;geminated consonants
SUBSET VO
a e i o u
SUBSET VObk
a o u
;back vowels
Two level rules – default
mappings
default mappings
RULE
"Defaults 1" 1 33
b c d f g h j k l m n p q r s t v w x y z sh ch a e i o u ' - ` + @
b c d f g h j k l m n p q r s t v w x y z sh ch a e i o u ' - 0 0 @
1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1
1 1 1 1 1 1 1 1 1 1
Two level rules – types of rules
=>
the correspondence only occurs in the
environment
<=
the correspondence always occurs in the
environment
<=>
the correspondence always and only occurs
in the environment
/<=
the correspondence never occurs in the
environment
Two level rules - examples
Two level rule (always and only)
;==========
;Epenthesis
;==========
; LR: fox+s kiss+s church+s spy+s
; SR: foxes kisses churches spies
RULE
"+:e <=> [CNsib | y:i | o] ___ s [+:@ | #]" 7 8
+ CNsib +
s
#
y
o
@
e CNsib @
s
#
i
o
@
1: 0
2
1
2
1
2
7
1
2: 3
2
5
2
1
2
7
1
3. 0
0
0
4
0
0
0
0
4. 0
0
1
0
1
0
0
0
5: 0
1
1
6
1
1
1
1
6: 0
1
0
1
0
1
1
1
7: 3
2
1
2
1
2
7
1
Two level rules - examples
Two level rule (always)
RULE
"y:i <= @:CN ___ +:@ ~[i | ']"
@
y
y
+
i
'
@
CN
i
@
@
i
'
@
1:
2
1
1
1
1
1
1
2:
2
1
3
2
1
1
1
3:
2
1
1
4
1
1
1
4:
0
0
0
0
1
1
0
4 7
Two level rules - examples
Two level rule (only)
RULE
"y:i => @:CN ___ +:@ ~[i | ']"
@
y
+
i
'
@
CN
i
@
i
'
@
1:
2
0
1
1
1
1
2:
2
3
2
1
1
1
3.
0
0
4
0
0
0
4.
2
1
1
0
0
1
4 6
Zusammenfassung – TwolevelRegeln
- Regeln beschreiben systematische Beziehungen zwischen
Oberflächenformen und Formen, die vom Lexikon und der Grammatik
generiert werden
- werden als Transducer kompiliert