Document

Natural Language Processing
>> Morphology <<
winter / fall 2011/2012
41.4268
Prof. Dr. Bettina Harriehausen-Mühlbauer
Univ. of Applied Science, Darmstadt, Germany
https://www.fbi.h-da.de/organisation/personen/harriehausen-muehlbauer-bettina.html
[email protected]
content
1 morphemes
2 compounds / concatenation
3 idiomatic phrases
4 multiple word entries (MWE)
5 spell aid
6 regular expressions
7 Finite State Automata (FSA)
WS 2011/2012
- Natural Language Systems Harriehausen
2
content
1 morphemes
2 compounds / concatenation
3 idiomatic phrases
4 multiple word entries (MWE)
5 spell aid
6 regular expressions
7 Finite State Automata (FSA)
WS 2011/2012
- Natural Language Systems Harriehausen
3
definition
Morphemes
morpheme = smallest possible item in a language that
carries meaning
• lexeme (man, house, dog,...)
• inflectional affixes (dog-s, want-ed,...)
• other affixes (pre-/in-/suff-): unwanted, atypical,
antipathetic,...
esp. in technical language (-itis = „infection“, gastro =
stomach...gastroenteritis)
WS 2011/2012
- Natural Language Systems Harriehausen
4
morphemes
WS 2011/2012
- Natural Language Systems Harriehausen
5
morphemes
free morphemes : stand-alone, carry lexical and morphological
meaning (e.g. house= sing, neuter, nominative ;
case/number/gender)
bound morphemes : legal wordform only in combination with
another morpheme, stand-alone, carry lexical and morphological
meaning.
Various combinations exist:
bound + free: e.g. un-happy,
all bound: e.g. gastro-enter-itis
WS 2011/2012
- Natural Language Systems Harriehausen
6
morphemes
inflectional morphemes : create words and carry morphological
meaning (e.g. dogs, laughed, going
derivational morphemes : create wordforms and carry
morphological meaning ( happily, intellectually, instruction,
instructor, insulator, the pounding, limpness, blindness...)
Question: which string (~morpheme) do we include in
our dictionary ?
• full form dictionary vs.
• base form dictionary (lemmas)
WS 2011/2012
- Natural Language Systems Harriehausen
7
content
1 morphemes
2 compounds / concatenation / decompounding
3 idiomatic phrases
4 multiple word entries (MWE)
5 spell aid
6 regular expressions
7 Finite State Automata (FSA)
WS 2011/2012
- Natural Language Systems Harriehausen
8
compounds / concatenation
Definition: a compound is a lexeme that consists of more than one stem.
Compounding or composition is the word formation that creates
compound lexemes (= compounds).
There is no clear upper limit in number of roots allowed in English
compounds. It usually doesn‘t exceed 3 morphemes, but it is clearly a
stylistic issue.
Some compounds are written as one word: blackbird.
Some are written with hyphens: mother-in-law.
Most are written as separate words: smoke screen.
Question:
What do we put into
our dictionary ?
Typically not spelling, but stress and word-internal sound rules distinguish
compounds from non-compounds: Compare white house with White House.
WS 2011/2012
- Natural Language Systems Harriehausen
9
compounds / concatenation
Compounding follows rules.
e.g. from chemical compounds. (http://www.chem.qmul.ac.uk/iupac/)
Substitutive nomenclature
This naming method generally follows established IUPAC organic
nomenclature. E.g.:
Hydrides of the main group elements (groups 13–17) are given -ane base
names, e.g. borane (BH3), oxidane (H2O), phosphane (PH3) .
The compound PCl3 would be named substitutively as
trichlorophosphane.
Additive nomenclature
This naming method has been developed principally for coordination
compounds. An example of its application is:
[CoCl(NH3)5]Cl2 pentaamminechloridocobalt(III) chloride
WS 2011/2012
- Natural Language Systems Harriehausen
10
Example of a chemical compound
Components of Phane Parent Names
bicyclo[8.6.0]hexadecaphane
•
•
•
•
The prefix "bicyclo" indicates that there are two rings
(bi-cyclo).
The bridge descriptor describes the ring structure in
terms of a sixteen-membered main ring [8 + 6 + 2
(the bridgehead nodes)] with a bridge consisting of a
bond, i.e., zero nodes, which divides the main ring
into an eight-membered and a ten-membered ring.
The numerical term "hexadeca" denotes the
presence of sixteen skeletal nodes.
and
the term "phane" indicates that at least one node
represents a multiatomic (cyclic) structural unit.
[http://www.chem.qmul.ac.uk/iupac/phane/PhI2.html]
WS 2011/2012
- Natural Language Systems Harriehausen
11
WS 2011/2012
- Natural Language Systems Harriehausen
12
Example of a medical compound
Medical compounds are usually composed of a prefix + root +
suffix, where neither of the components can be used stand-alone.
nephritis:
supra-renal:
nephrologist:
gastroenteritis :
nephrgastr-
inflammation of the kidney
situated above the kidneys
a kidney doctor
inflammation of stomach and intestines
2 roots: Greek (νεφρός nephr(os)) , Latin (ren(es)).
ancient Greek γαστήρ (gastēr), γαστρ-
-olinking 2 body parts (linguistically)
enter- ancient Greek ἔντερον (énteron)
-itis
supra- ologist
WS 2011/2012
- Natural Language Systems Harriehausen
= kidney
= stomach,
belly
= intestine
= inflammation
= above
= person
studying a
certain body
part
13
WS 2011/2012
- Natural Language Systems Harriehausen
14
compounds / concatenation
formation of compounds: synthesis and agglutination
Compound formation rules vary widely across language types.
Examples of formation processes (usually linked to the language type):
•
synthesis (typically with synthetic languages, i.e. languages with a
high morpheme-per-word ratio): e.g.
German:
Kapitänspatent = Kapitän (sea captain) + Patent (license) joined by an
-s- (originally a genitive case suffix);
„patent of a sea captain“
Latin:
paterfamilias = pater (father) + familias (genitive of the lexeme
familia (family)); „father of a family“
WS 2011/2012
- Natural Language Systems Harriehausen
15
compounds / concatenation
formation of compounds:
It can get more difficult: (German -> English)
Aufsichtsratsmitgliederversammlung =>
Auf = on
sicht+s =view + “Fuge-s“
Notice:
rat+s = council + „genitive-s“
"with" and "link" form a derivation that is
mit = with
the German word for "member";
glied + er = link + „plural“
"completion", "collect" and "noun" form a
derivation that means "meeting"
ver = „completion“
samml (stem = sammeln) = collect
ung = „noun“
On-view-council-with-link-collect ??????????????????
= "meeting of members of the supervisory board"
WS 2011/2012
- Natural Language Systems Harriehausen
16
compounds / concatenation
formation of compounds: synthesis and agglutination
•
agglutination (usually with agglutinative languages, which tend to
create very long words with derivational morphemes), e.g.
German
Farbfernsehgerät
= color television set
Funkfernbedienung
= radio remote control
Donaudampfschifffahrtsgesellschaftskapitänsmütze = Danube steamboat
Finnish
shipping company Captain's hat
hätä-uloskäytävä
= emergency exit
Lentokone-suihku-turbiini-moottori-apu-mekaanikko-aliupseeri-oppilas
Swedish
= Airplane jet turbine engine auxiliary
mechanic non-commissioned officer student
rörelseuppskattningssökintervallsinställningar = Motion estimation search
range settings
WS 2011/2012
- Natural Language Systems Harriehausen
17
Samples for long compounds in German
•
•
•
•
•
•
•
•
•
•
•
•
die Armbrust
die Mehrzweckhalle
das Mehrzweckkirschentkerngerät
die Gemeindegrundsteuerveranlagung
die Nummernschildbedruckungsmaschine
der Mehrkornroggenvollkornbrotmehlzulieferer
der Schifffahrtskapitänsmützenmaterialhersteller
die Verkehrsinfrastrukturfinanzierungsgesellschaft
die Feuerwehrrettungshubschraubernotlandeplatzaufseherin
der Oberpostdirektionsbriefmarkenstempelautomatenmechaniker
das Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
die Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft
Wolkenkratzer 'skyscraper': wolken 'clouds', + kratzer 'scraper'
Eisenbahn 'railway': Eisen 'iron', + bahn 'track'
Kraftfahrzeug 'automobile': Kraft 'power', + fahren/fahr 'drive', + zeug 'machinery'
Stacheldraht 'barbed wire': stachel 'barb/barbed', + draht 'wire'
Rinderkennzeichnungs- und Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz:
literally, Cattle marking and beef labeling supervision duties delegation law
WS 2011/2012
- Natural Language Systems Harriehausen
18
Samples for long compounds in different languages
(see: http://en.wikipedia.org/wiki/Compound_%28linguistics%29)
Chinese (Cantonese Jyutping):
學生 'student': 學 learn + 生 grow
太空 'universe': 太 t great + 空 emptiness
摩天樓 'skyscraper': 摩 touch + 天 sky + 樓 building (with more than 1 storey)
打印機 'printer': 打 strike + 印 stamp/print + 機 machine
百科全書 'encyclopaedia': 百 100 + 科 (branch of) study + 全 entire/complete +
書 book
Dutch:
Arbeidsongeschiktheidsverzekering 'disability insurance': arbeid 'labour', +
ongeschiktheid 'inaptitude', + verzekering 'insurance'.
Rioolwaterzuiveringsinstallatie 'wastewater treatment plant': riool 'sewer', +
water 'water', + zuivering 'cleaning', + installatie 'installation'.
Verjaardagskalender 'birthday calendar': verjaardag 'birthday', + kalender
'calendar'.
Klantenservicemedewerker 'customer service representative': klanten 'customers',
+ service 'service', + medewerker 'worker'.
Universiteitsbibliotheek 'university library': universiteit 'university', + bibliotheek
'library'.
Doorgroeimogelijkheden
'possibilities
for advancement':
door 'through', + groei19
WS 2011/2012
- Natural Language
Systems Harriehausen
'grow', + mogelijkheden
'possibilities'.
Samples for long compounds in different languages
Samples for
long compounds in different languages
(see: http://en.wikipedia.org/wiki/Compound_%28linguistics%29)
Finnish:
sanakirja 'dictionary': sana 'word', + kirja 'book'
tietokone 'computer': tieto 'knowledge, data', + kone 'machine'
keskiviikko 'Wednesday': keski 'middle', + viikko 'week'
maailma 'world': maa 'land', + ilma 'air'
rautatieasema 'railway station': rauta 'iron' + tie 'road' + asema 'station'
suihkuturbiiniapumekaanikkoaliupseerioppilas: 'Jet engine assistant mechanic NCO
student'
atomiydinenergiareaktorigeneraattorilauhduttajaturbiiniratasvaihde: some part of a
nuclear plant
Korean:
안팎 anpak 'inside and outside': 안 an 'inside' + 밖 bak 'outside‚
Spanish:
Ciempiés 'centipede': cien 'hundred', + pies 'feet'
Ferrocarril 'railway': ferro 'iron', + carril 'lane'
Paraguas 'umbrella': para 'to stop, stops' + aguas '(the) water'
WS 2011/2012
- Natural Language Systems Harriehausen
20
Samples for long compounds in different languages
Samples for
long compounds in different languages
(see: http://en.wikipedia.org/wiki/Compound_%28linguistics%29)
Icelandic:
járnbraut 'railway': járn 'iron', + braut 'path' or 'way'
farartæki 'vehicle': farar 'journey', + tæki 'apparatus'
alfræðiorðabók 'encyclopædia': al 'everything', + fræði 'study' or 'knowledge', +
orða 'words', + bók 'book'
símtal 'telephone conversation': sím 'telephone', + tal 'dialogue'
Italian:
Millepiedi 'centipede': mille 'thousand', + piedi 'feet'
Ferrovia 'railway': ferro 'iron', + via 'way'
Tergicristallo 'windscreen wiper': tergere 'to wash', + cristallo 'crystal, glass'
Japanese:
目覚まし(時計) mezamashi(dokei) 'alarm clock': 目 me 'eye' + 覚まし samashi
(-zamashi) 'awakening (someone)' (+ 時計 tokei (-dokei) clock)
お好み焼き okonomiyaki: お好み okonomi 'preference' + 焼き yaki 'cooking'
日帰り higaeri 'day trip': 日 hi 'day' + 帰り kaeri (-gaeri) 'returning (home)'
国会議事堂 kokkaigijidō 'national diet building': 国会 kokkai 'national diet' + 議事
giji
+ 堂 dō 'hall'
WS 'proceedings'
2011/2012
- Natural Language Systems 21
Harriehausen
compounds / concatenation
formation of compounds and their structure:
Most compounds are 2-root-compounds, but they come with a number of
different structures: Nouns – Adjectives - Verbs
A. Nouns
Noun-Noun
Adjective-Noun
Preposition-Noun Verb-Noun
apron string
high school
overdose
swearword
hubcap
smallpox
underdog
whetstone
bedroom
poorhouse
uptone
scrubwoman
schoolteacher
bluebird
afterthought
rattlesnake
(see: http://public.wsu.edu/~gordonl/S05/256/compounds.htm)
In each of these cases, the syntactic class of the compound is the same
as the syntactic class of the final element of the compound.
WS 2011/2012
- Natural Language Systems Harriehausen
22
compounds / concatenation
formation of compounds and their structure:
In each of these cases, the syntactic class* of the
compound is the same as the syntactic class of the
final element of the compound.
* syntactic class = part-of-speech, such as noun, verb, adjective,…
WS 2011/2012
- Natural Language Systems Harriehausen
23
compounds / concatenation
formation of compounds and their structure:
Noun-Noun
Adjective-Noun
Preposition-Noun
Verb-Noun
schoolteacher
bluebird
afterthought
rattlesnake
In each of these cases, the syntactic class of the compound is the same
as the syntactic class of the final element of the compound.
Rule:
• Germanic languages (e.g. English, German) are left-branching (the
modifiers come before the head). Schoolteacher = teacher of a school,
bluebird = bird of blue color
• Romance languages ( e.g. French, Spanish) are usually rightbranching; i.e. they are often formed by left-hand heads with
prepositional components inserted before the modifier:
chemin-de-fer = railway (lit. 'road of iron')
moulin à vent = windmill (lit. 'mill (that works)-by-means-of wind')
WS 2011/2012
- Natural Language Systems Harriehausen
24
compounds / concatenation
formation of compounds and their structure:
B. Adjectives
Noun-Adjective
Adjective-Adjective
Preposition-Adjective
headstrong
white-hot
overwide
skin-deep
widespread
ingrown
nationwide
bittersweet
underripe
earthbound
hardworking
above-mentioned
(see: http://public.wsu.edu/~gordonl/S05/256/compounds.htm)
In each of these cases, the syntactic class of the compound is the
same as the syntactic class of the final element of the compound.
WS 2011/2012
- Natural Language Systems Harriehausen
25
compounds / concatenation
formation of compounds and their structure:
B. Adjectives : hardworking
The internal structure may be complex:
hard + work + ing -> hardwork + ing OR hard + working
-
ing is typically the aspect-suffix that gets added to the verb (root):
e.g. play-ing, laugh-ing, ask-ing,…
As a rule, we can form other wordforms (inflections, due to different
tenses) from those roots, following the same inflectional pattern, i.e.
verbal root + tense-marking-suffix, or insertion of modal verb:
Simple Present:
Simple Past:
Simple Future:
WS 2011/2012
He play-s. He laugh-s. He ask-s.
They play-ed. They laugh-ed. They ask-ed.
I will play. I will laugh. I will ask.
- Natural Language Systems Harriehausen
26
compounds / concatenation
formation of compounds and their structure:
B. Adjectives : hardworking
The internal structure may be complex:
hard + work + ing -> hardwork + ing OR hard + working
* He hardworks.
* They hardworked.
* I will hardwork.
-> hardwork + ing
i.e. hardwork is not a verb by itself
(see: http://public.wsu.edu/~gordonl/S05/256/compounds.htm)
WS 2011/2012
- Natural Language Systems Harriehausen
27
compounds / concatenation
formation of compounds and their structure:
B. Adjectives : hardworking
The internal structure may be complex:
hard + work + ing -> hardwork + ing OR hard + working
* He hardworks.
* They hardworked.
* I will hardwork.
-> hardwork + ing
Adj
Adv
Adj
verb suffix
hard work ing
(see: http://public.wsu.edu/~gordonl/S05/256/compounds.htm)
WS 2011/2012
- Natural Language Systems Harriehausen
28
compounds / concatenation
formation of compounds and their structure:
C. Verbs
Noun-Verb
Adjective-Verb
PrepositionVerb
Verb-Verb
spoonfeed
dry-clean
outlive
sleepwalk
aircondition
whitewash
overdo
window-shop
broadcast
uproot
(see: http://public.wsu.edu/~gordonl/S05/256/compounds.htm)
In each of these cases, the syntactic class of the compound is the same
as the syntactic class of the final element of the compound.
WS 2011/2012
- Natural Language Systems Harriehausen
29
semantics of compounds
Semantic classification : it it common to classify compounds into 4 types:
•
•
•
•
endocentric
exocentric
copulative
appositional
description: A+B denotes a special kind of B
Endocentric compounds consist of a head and modifiers, which restrict this
meaning. Endocentric compounds tend to be of the same part of speech
(word class) as their head.
Examples:
- doghouse, where house is the head and dog is the modifier; i.e. a house
intended for a dog
-darkroom, where dark modifies room; i.e. a type of a room (usually used in
photography)
WS 2011/2012
- Natural Language Systems Harriehausen
30
semantics of compounds
Semantic classification : it it common to classify compounds into 4 types:
•
•
•
•
endocentric
exocentric
copulative
appositional
description: (one) whose B is A
Exocentric compounds have an unexpressed semantic head (e.g. a person,
a plant, an animal...), and their meaning is often not transparent from its
constituent parts.
Examples: ●white-collar is neither a kind of collar nor a white thing,
but the collar's colour is a metaphor for socioeconomic status
● red-neck only indirectly refers to a neck, but refers to a working
person (e.g. farmer)
● skinhead, may refer to a bald head but also refers to a certain
group of people
● paleface, native American Indians call the White Man a paleface
WS 2011/2012
- Natural Language Systems Harriehausen
31
semantics of compounds
Semantic classification : it it common to classify compounds into 4 types:
• endocentric
• exocentric
• copulative
description: A+B denotes 'the sum' of what A and B denote
• appositional
Copulative compounds are compounds which have two semantic heads.
Examples:
- bittersweet; having both tastes
- sleepwalk; sleeping while walking OR walking in your sleep
WS 2011/2012
- Natural Language Systems Harriehausen
32
semantics of compounds
Semantic classification : it it common to classify compounds into 4 types:
• endocentric
• exocentric
• copulative
• appositional
description: A and B provide different descriptions for the
same referent; the meaning of which can be
characterized as 'a AS WELL AS'.
Appositional compounds refer to lexemes that have two (contrary)
attributes which classify the compound.
Examples:
- actor-director; an actor who also plays the role of the director
- maidservant; a maid who is also a servant OR a servant who is also a maid
- Player-coach; someone who is a player as well as a coach
WS 2011/2012
- Natural Language Systems Harriehausen
33
semantics of compounds (ambiguities)
When - in Germanic languages (e.g. German, English) - compound words
are formed by prepending a descriptive word in front of the main word, the
description or meaning between the components may be ambiguous. This is
a problem for decompounding or translation.
-> the orange bowl problem
WS 2011/2012
- Natural Language Systems Harriehausen
34
semantics of compounds (ambiguities)
Can you please bring me the orange bowl ?
bowl filled
with oranges
?
?
?
bowl having the
shape of an
orange
WS 2011/2012
?
?
bowl with an
orange pattern
- Natural Language Systems Harriehausen
bowl of orange
colour
bowl that was
formerly /
usually filled
with oranges
35
compounding - decompounding
decompounding -> follows rules
principles / rules:
FANO rule: „the analysis is unambiguous, when a morpheme is not the
beginning of another morpheme“
(= principle of longest match)
e.g. but / butter
(Orthographic) Ambiguities in segmentation :
horseshoe: horses – hoe (?) vs.
horse-shoe
(the FANO rule would lead to the incorrect/unlikely segmentation)
Segmentation has to be done recursively in order to find all possibilities:
WS 2011/2012
- Natural Language Systems Harriehausen
36
compounding - decompounding
English:
petshopping: pet-shopping vs. pets-hopping
egg roll: Chinese food vs. rolling egg
a green ´house vs. a ´greenhouse
The white ´house vs. The ´White House
WS 2011/2012
- Natural Language Systems Harriehausen
37
compounding - decompounding
German:
Staubecken: Stau-becken = a reservoir
Staub-ecken = dusty corners
Wachstube: Wach-stube = die Stube einer Wache (the room of a guard)
Wachs-tube = eine Tube, in der Wachs aufbewahrt wird (a tube
filled with wax)
Gelbrand: Gelb-rand = gelber Rand (a yellow border)
Gel-brand = Brand eines Gels (burning of a gel)
Tonerkennung: Toner-kennung = die Kennung eines Toners (the identifier of a
toner)
Ton-erkennung = das Erkennen von Tönen (the identification
of tones)
Lachen: Lache-n = mehrere Pfützen (multiple puddles of water)
Lachen = eine menschliche Lautäußerung wie Gelächter (laughter)
Druckerzeugnis: Druck-erzeugnis = Gedrucktes (printed matter)
Drucker-zeugnis = Zeugnis für einen Drucker (certificate for a
printer)
beinhalten : bein-halten vs. be-inhalten (imagine: Beinhalten….)
Abteilungen : Abtei-lungen vs. Abteil-ungen
WS 2011/2012
- Natural Language Systems Harriehausen
38
compounding - decompounding
context or stress (in spoken language) is needed for disambiguation
WS 2011/2012
- Natural Language Systems Harriehausen
39
(problems with )concatenation
Summary
Structural as well as semantic challenges with compounds:
• ambiguities in meaning (orange bowl)
• ambiguities in hyphenation points (Staubecken)
• not all morphemes can form a compound (sheepchops)->
WS 2011/2012
- Natural Language Systems Harriehausen
40
(problems with )concatenation
WS 2011/2012
- Natural Language Systems Harriehausen
41
compounds -> MWE -> idiomatic phrases
WS 2011/2012
=
increasing the
idiomatic rigidity
increasing the
formal complexity
In addition to the compounds that have one of the four descriptions
(endocentric, exocentric, copulative, appositional), i.e. stick to the original lexical
meaning of at least one of its components, we need to consider „multiple
morpheme strings / multi word expressions (MWE)“ (fixed phrases) that have
„lost“ the original lexical meaning of its components. Those MWE are called
idiomatic phrases or idioms.
• compounding: combination of lexical
meanings: carseat, houseboat,
cellar door,...
• compounding: not a combination of
the lexical meanings:
starfish, paperback, ladybug,...
• depending on the context: bite the
dust, lose face, kick the bucket,...
- Natural Language Systems Harriehausen
42
content
1 morphemes
2 compounds / concatenation
3 idiomatic phrases
4 multiple word entries (MWE)
5 spell aid
6 regular expressions
7 Finite State Automata (FSA)
WS 2011/2012
- Natural Language Systems Harriehausen
43
idiomatic phrases
(http://www.geo.de/GEOlino/mensch/redewendungen/englisch)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Out of the blue
To be on Cloud Nine
A leopard cannot change its spots
Head over heels
Fair Play
As cool as a cucumber
The early bird catches the worm
As fit as a fiddle
Beat about the bush
The Big Apple
The apple of my eye
Wet behind the ears
A bird in the hand is worth two in the bush
It's raining cats and dogs
WS 2011/2012
- Natural Language Systems Harriehausen
44
idiomatic phrases
(http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Wie bei Hempels unterm Sofa
Schmetterlinge im Bauch
Jemanden übers Ohr hauen
Ein Bäuerchen machen
Mit jemandem durch dick und dünn gehen
Seine Pappenheimer kennen
Jemandem die Würmer aus der Nase ziehen
Die Arschkarte ziehen
Mit jemandem Pferde stehlen können
Sich aus dem Staub machen
Hummeln im Hintern haben
Im siebten Himmel sein
Viele Wege führen nach Rom
Mit einem lachenden und einem weinenden Auge
Nah am Wasser gebaut haben
Da ist der Bär los
Nachtigall, ick hör dir trapsen
Mein lieber Scholli!
WS 2011/2012
- Natural Language Systems Harriehausen
45
idiomatic phrases
(http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Jemandem einen Denkzettel verpassen
Sich auf den Schlips getreten fühlen
Alles für die Katz
Wo drückt denn der Schuh?
Gegen den Strich gehen
Den Faden verlieren
Etwas ausbaden müssen
Einen Stein im Brett haben
Bahnhof verstehen
Der springende Punkt
Der Sündenbock sein
Einen Ohrwurm haben
Das ist doch zum Mäusemelken!
Schmiere stehen
Den Teufel an die Wand malen
Auf dem Holzweg sein
Eselsbrücke
In der Kreide stehen
WS 2011/2012
- Natural Language Systems Harriehausen
46
idiomatic phrases
(http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Die Ohren steif halten
Auf Vordermann bringen
Um die Ecke bringen
Hals- und Beinbruch
Auf dem Kerbholz haben
Eine Schlappe einstecken
Frosch im Hals
Es zieht wie Hechtsuppe
Jemandem einen Bärendienst erweisen
Damoklesschwert
Tomaten auf den Augen haben
Jemandem raucht der Kopf
Für 'n Appel und 'n Ei
Etwas an die große Glocke hängen
Das ist Jacke wie Hose
Etwas aus dem Ärmel schütteln
Ein X für ein U vormachen
Jemandem nicht das Wasser reichen können
WS 2011/2012
- Natural Language Systems Harriehausen
47
idiomatic phrases
(http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Alles im grünen Bereich
Die Hand ins Feuer legen
Das kann kein Schwein lesen!
Auf Draht sein
Sein blaues Wunder erleben
Der hat es faustdick hinter den Ohren
Mein Name ist Hase, ich weiß von nichts
Aus dem Stegreif
Der Groschen ist gefallen
Einen Vogel haben
Den Kürzeren ziehen
Bis in die Puppen
Etwas hinter die Ohren schreiben
Ins Fettnäpfchen treten
Beleidigte Leberwurst
Jemanden auf dem Kieker haben
Ich verstehe immer nur Bahnhof!
Die Katze im Sack kaufen
WS 2011/2012
- Natural Language Systems Harriehausen
48
idiomatic phrases
(http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Bekannt wie ein bunter Hund
Den Kopf in den Sand stecken
Mit dem ist nicht gut Kirschen essen
Aller guten Dinge sind drei
Lampenfieber
Das kommt mir spanisch vor
Schwein haben
Das hast du dir selbst eingebrockt
Seinen Senf dazugeben
Jemandem ist eine Laus über die Leber gelaufen
Kalte Füße bekommen
Im Stich lassen
Schwedische Gardinen
Alles in Butter
Geld auf den Kopf hauen
Das Handtuch werfen
Sich mit fremden Federn schmücken
WS 2011/2012
- Natural Language Systems Harriehausen
49
idiomatic phrases – and their morpho-syntax
Idiomatic expressions are extremely rigid, in that morphosyntactic modifications are not allowed (without a change in
meaning) :
GERMAN
Singular - Plural
• Bekannt wie ein bunter Hund
• ???
Bekannt wie bunte Hunde.
• *
Bekannt wir 2 bunte Hunde.
adjectival modification
• Den Kopf in den Sand stecken.
• Den Kopf in den weichen Sand stecken.
WS 2011/2012
- Natural Language Systems Harriehausen
50
idiomatic phrases – and their morpho-syntax
Idiomatic expressions are extremely rigid, in that morpho-syntactic
modifications are not allowed (without a change in meaning) :
ENGLISH
Adjectival modification:
• to be on cloud nine –> * to be on cloud eight
Singular – Plural:
• The early bird gets the worm. -> ? The early birds get the worm.
• It's raining cats and dogs. -> * It's raining 2 cats and 3 dogs.
Neither adjectival modification nor change of subject:
• He kicked the bucket.
• * He kicked the green bucket.
• * It kicked the bucket.
WS 2011/2012
- Natural Language Systems Harriehausen
51
content
1 morphemes
2 compounds / concatenation
3 idiomatic phrases
4 multiple word entries (MWE) – and their relationship
5 spell aid
6 regular expressions
7 Finite State Automata (FSA)
WS 2011/2012
- Natural Language Systems Harriehausen
52
multiple word entries (MWE)
We have already looked at the semantics / meaning of
compounds and idioms.
But what about the relationship within the MWE ?
WS 2011/2012
- Natural Language Systems Harriehausen
53
multiple word entries (MWE)
Problems: the relationships among the components change
the „Schnitzel“ problem
• Schweineschnitzel / -steak
• Pfefferschnitzel / -steak
• Wienerschnitzel
• Soyaschnitzel
• Rückensteak, Lendensteak, Ribeyesteak
• Minutenschnitzel / -steak
• Jäger Schnitzel
• Zigeuner Schnitzel
• Tiefkühl-Schnitzel
WS 2011/2012
- Natural Language Systems Harriehausen
54
multiple word entries (MWE)
Problems: the relationships among the components change
the „Schnitzel“ problem
• Schweineschnitzel / -steak
made of pork meat
• Pfefferschnitzel / -steak
garnished / spiced with pepper
• Wienerschnitzel
a certain recipe
• Soyaschnitzel
made of soy
• Rückensteak, Lendensteak, Ribeyesteak
body part
• Minutenschnitzel / -steak
time / length of cooking
• Jäger Schnitzel
a certain recipe
• Zigeuner Schnitzel
a certain recipe
• Tiefkühl-Schnitzel
status (frozen)
WS 2011/2012
- Natural Language Systems Harriehausen
55
multiple word entries (MWE)
Problems: the relationships among the components change
the „Schnitzel“ problem
Even though the single lexical meanings remain untouched in the compound,
the relationships between the compounds vary tremendously !
WS 2011/2012
- Natural Language Systems Harriehausen
56
multiple word entries (MWE)
the 3 main relationships (default ?) between parts of a
compound word: (the role of global knowledge in decompounding)
compound meaning
relationship
doorknob
carseat
glasdoor
nutbread
waterglas
oiltruck
is-a / is-part-of/
genitive
made from / material
WS 2011/2012
knob of the door
seat of the car
door made of glas
‡ bread of the nut
glas filled with water
truck that carries oil
‡ truck made of oil
used for
- Natural Language Systems Harriehausen
1
2
3
57
content
1 morphemes
2 compounds / concatenation
3 idiomatic phrases
4 multiple word entries (MWE)
5 spell aid
6 regular expressions
7 Finite State Automata (FSA)
WS 2011/2012
- Natural Language Systems Harriehausen
58
spell aid
in NLP, decompounding algorithms are essential for spell-
checking / spell aid :
How do we define a lexical error in NLP terms ?
An error is a string that cannot be found in / matched with a
dictionary entry.
It is not necessarily an incorrect word (esp. neologisms).
WS 2011/2012
- Natural Language Systems Harriehausen
59
spell aid
Neologism (Definition):
A neologism is a new term, word or phrase, that may or may not be in
the process of entering common use, but has not yet been accepted into
mainstream language, i.e. it has NOT entered written dictionaries (yet).
For a long time neologisms were mainly seen as pathological or
deviating - Webster’s Third New International Dictionary (1966)
describes neologism as „a meaningless word coined by a psychotic“.
http://www.neologisms.us/
a-er
aagram
aagram string
aangram
Aazymurgy
abasure
abberateur
WS 2011/2012
abbrantcooty
abbrhyme
abched
abilliant
abomasum
abrabro
abrickity
abthurt
- Natural Language Systems Harriehausen
60
spell aid - neologisms
http://www.wortwarte.de/
http://www.wortwarte.de/
Neue Wörter vom 25.9.2011
Heute servieren wir Ihnen 23
neue Wörter:
Alles-Apparat, der
Ampelorgie, die
ärzteloyal, Adjektiv
Distanzmanöver, das
Drivingcenter, das
E-Ball-Match, das
Ego-Archäologe, der
Full-Flat, die
Gefällt-mir-Klick, der
Geschmacksfarbe, die
HD-Livestream, der
Inlineskater-Marathon, der
Neue Wörter vom 25.9.2011
Heute servieren wir Ihnen 23
neue Wörter:
WS 2011/2012
Leerheitsanalyse, die
mitnahmefähig, Adjektiv
nachkochsicher, Adjektiv
Nerdpartei, die
Neutrino-Witz, der
Panda-Umarmer, der
Radfahrlinksabbiegerspur, die
Schwungrad-Technologie, die
Sugar-Stick, der
Zahnspangen-Dichte, die
Zeiterfassungschip, der
- Natural Language Systems Harriehausen
61
spell aid - neologisms
AIDS
to xerox
googling / to google
photoshopping
Kleenex
to pamper
texting / to text
….
…
l.o.l. OR
WS 2011/2012
lol
LG
HDGDL
LOL - laut herauslachen
- Natural Language Systems Harriehausen
62
spell aid – chat language (acronyms)
AFAIK -- As Far As I Know
AFK -- Away From Keyboard
ASAP -- As Soon As Possible
BAS -- Big A** Smile
BBL -- Be Back Later
BBN -- Bye Bye Now
BBS -- Be Back Soon
BEG -- Big Evil Grin
BF -- Boyfriend
BIBO -- Beer In, Beer Out
BRB -- Be Right Back
BTW -- By The Way
BWL -- Bursting With Laughter
C&G -- Chuckle and Grin
CICO -- Coffee In, Coffee Out
CID -- Crying In Disgrace
CP -- Chat Post(a chat message)
CRBT -- Crying Real Big Tears
CSG -- Chuckle Snicker Grin
CYA -- See You (Seeya)
CYAL8R -- See You Later
(Seeyalata)
DLTBBB -- Don't Let The Bed Bugs
Bite
EG -- Evil Grin
EMSG -- Email Message
FC -- Fingers Crossed
FTBOMH -- From The Bottom Of
My Heart
FYI -- For Your Information
See: http://www.chatdefinitions.com/
WS 2011/2012
- Natural Language Systems Harriehausen
63
spell aid – chat language (symbols)
:-9 -- Delicious, Yummy
:-> -- Devilish
;-> -- Devilish Wink
:P -- Disgusted (sticking out
tongue)
:*) -- Drunk
:-6 -- Exhausted, Wiped Out
:( -- Frown
\~/ -- Full Glass
\_/ -- Glass (drink)
^5 -- High Five
:-| -- Ambivalent
o:-) -- Angelic
>:-( -- Angry
|-I -- Asleep
(::()::) -- Bandaid
:-{} -- Blowing a Kiss
\-o -- Bored
:-c -- Bummed Out
|C| -- Can of Coke
|P| -- Can of Pepsi
:( ) -- Can't Stop Talking
:*) -- Clowning
:' -- Crying
:'-) -- Crying with Joy
:'-( -- Crying Sadly
See: http://www.chatdefinitions.com/
WS 2011/2012
- Natural Language Systems Harriehausen
64
spell aid
spell checking algorithms are based on the following types of
mistakes (statistics !):
• phonetic similarities (ph – f : telephone – telefone)
• deletion of multiple entries ( mouuse - mouse)
• wrong order (from – form ; mouse – muose)
• substitution of neighbouring letters on the keyboard (miuse – mouse)
• include missing letters (vowels in between consonants...) (telephne)
• typos occur towards the end of a word (assumption:first letter is correct)
• segmentation / decomposition into substrings (horses‘hoe – horse‘shoe)
WS 2011/2012
- Natural Language Systems Harriehausen
65
spell aid
• phonetic similarities (ph – f : telephone – telefone)
• deletion of multiple entries ( mouuse - mouse)
• wrong order (from – form ; mouse – muose)
• substitution of neighbouring letters on the keyboard (miuse – mouse)
• include missing letters (vowels in between consonants...) (telephne)
• typos occur towards the end of a word (assumption:first letter is correct)
• segmentation / decomposition into substrings (horeshoe – horseshoe)
WS 2011/2012
- Natural Language Systems Harriehausen
66
spell aid
• include missing letters
www.dositey.com/language/spelling/Mislet3.htm
WS 2011/2012
- Natural Language Systems Harriehausen
67
spell aid
How does spell checking work (w.r.t. grammar checking) ?
Various degrees of „intelligence“:
System A : no match found in the dictionary -> mark entry as incorrect
System B: no match found in the dictionary. Initiate a rudimentary parse
(left-right-search). Try to identify the wordclass, i.e. limit possibilities and
continue a sentential analysis. e.g. the ...man (statistics: DET + ADJ +
NOUN); n-gram
System C: no match found in the dictionary. Initiate a segmentation of the
word to identify the wordclass, e.g. look for typical endings (-ly = adverb /
capital letters = proper noun, ...). This way new wordcreations can be
identified (e.g. any word ending in -ness = noun); n-gram
WS 2011/2012
- Natural Language Systems Harriehausen
68
n-grams / language models (statistical language processing)
An n-gram is a substring of n items from a given string.
In NLP, the items in question can be phonemes, syllables, letters, words or any
substring. This depends on the application.
An n-gram of
WS 2011/2012
size
size
size
size
1
2
3
n
is
is
is
is
a "unigram";
a "bigram" ;
a "trigram"; etc. …
an "n-gram ".
- Natural Language Systems Harriehausen
69
n-grams / language models (statistical language processing)
Example: „he reads a book"
For a sequence of words, the trigrams would be: "# he reads", „he reads a",
„reads a book", and "a book #".
For sequences of characters, the trigrams that can be generated from „hello world"
are "hel", "ell", "llo", "lo ", "o w", " wo", "wor" etc.
In practice, we often
• collapse whitespace to a single space
• remove punctuation
WS 2011/2012
- Natural Language Systems Harriehausen
70
n-grams / language models (statistical language processing)
Example of an n-gram count from the GOOGLE n-gram corpus:
(http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html#!/2006/08/all-our-ngram-are-belong-to-you.html)
File sizes: approx. 24 GB compressed (gzip'ed) text files
Number of sentences: 95,119,665,584
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663
WS 2011/2012
- Natural Language Systems Harriehausen
71
n-grams / language models (statistical language processing)
Example of an n-gram count from the GOOGLE n-gram corpus:
(http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html#!/2006/08/all-our-ngram-are-belong-to-you.html)
trigrams:
ceramics
ceramics
ceramics
ceramics
ceramics
ceramics
ceramics
collectables collectibles 55
collectables fine 130
collected by 52
collectible pottery 50
collectibles cooking 45
collection , 144
collection . 247
WS 2011/2012
- Natural Language Systems Harriehausen
72
n-grams / language models (statistical language processing)
Example of an n-gram count from the GOOGLE n-gram corpus:
(http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html#!/2006/08/all-our-ngram-are-belong-to-you.html)
fourgrams:
serve
serve
serve
serve
serve
serve
serve
serve
serve
as
as
as
as
as
as
as
as
as
the
the
the
the
the
the
the
the
the
incoming 92
incubator 99
independent 794
index 223
indication 72
indicator 120
indicators 45
indispensable 111
indispensible 40
WS 2011/2012
- Natural Language Systems Harriehausen
73
n-grams / language models (statistical language processing)
A statistical language model assigns a probability to a sequence of m words
P (w1,…,wm) by means of a probability distribution.
More concisely, an n-gram model predicts xi based on
In probability terms, this is
This is also called an n-1-order Markov Model.
In speech recognition, sequences of phonemes are often modeled using a n-gram
distribution.
WS 2011/2012
- Natural Language Systems Harriehausen
74
n-grams / language models (statistical language processing)
In an n-gram model, the conditional probability P (w1,…,wm) of observing the
sentence w1,...,wm can be approximated:
It is assumed that the probability of observing the i th word wi in the context history
of the preceding i-1 words can be approximated by the probability of observing it in
the shortened context history of the preceding n-1 words.
In a bigram (n=2) language model, the probability of the sentence I saw the red
house is approximated as:
Whereas in a trigram (n=3) language model, the approximation is:
WS 2011/2012
- Natural Language Systems Harriehausen
75
content
1 morphemes
2 compounds / concatenation
3 idiomatic phrases
4 multiple word entries (MWE)
5 spell aid
6 regular expressions
7 Finite State Automata (FSA)
WS 2011/2012
- Natural Language Systems Harriehausen
76
regular expressions (Jurafsky, section 2.1)
•
•
•
•
•
•
In order to figure out whether something is an incorrect word, the
machine has to match the string (= a sequence of symbols; any
sequence of alphanumeric characters (letters, numbers, spaces, tabs,
punctuation) to an entry in the dictionary
other matches: e.g. information retrieval in www-search engines
(Google, altavista,…)
the standard notation for characterizing text sequences=
regular expressions
regular expressions are written in (regular expression) languages: e.g.
Perl, grep (Global Regular Expression Print)
formally, regular expressions are algebraic notations for characterizing a
set of strings
regular expression search requires a pattern that we want to search for
(and a corpus of text to search through) (text mining !)
WS 2011/2012
- Natural Language Systems Harriehausen
77
regular expressions (Jurafsky, section 2.1)
Example: Search for the pattern “linguistics”.
• You also want to find documents with “Linguistics” and “LINGUISTICS”.
(remember: the computer does EXACTLY do what you tell him to…)
• The regular expression /linguistics/ matches any string in any
document containing exactly the substring “linguistics”
• Regular expressions are case sensitive
• samples (Jurafsky, p. 23)
regular expression
/woodchucks/
/a/
/Claire says,/
/song/
/!/
WS 2011/2012
example pattern matched
“interesting links to woodchucks and lemurs”
“Mary Ann stopped by Mona’s”
Dagmar, my gift please,” Claire says,”
“all our pretty songs”
“You’ve left the burglar behind again!” said Nori
- Natural Language Systems Harriehausen
78
regular expressions (Jurafsky, section 2.1)
linguistics - Linguistics - LINGUSTICS
to search for alternative characters “l” and/or “L” we use square
brackets: [l L]
Regular expression
match
sample pattern
/[l L] inguistics/
Linguistics or linguistics
“computational
linguistics is fun”
/[1 2 3 4 5 6 7 8 9 0]/
any digit
this is Linguistics
5981
WS 2011/2012
- Natural Language Systems Harriehausen
79
regular expressions (Jurafsky, section 2.1)
to search for a character in a range we use the dash: [-]
Regular expression
match
sample pattern
/[A-Z]/
any uppercase letter
this is Linguistics 5981
/[0-9]/
any single digit
this is Linguistics 5981
/[1 2 3 4 5 6 7 8 9 0]/
any single digit
WS 2011/2012
- Natural Language Systems Harriehausen
this is Linguistics 5981
80
regular expressions (Jurafsky, section 2.1)
to search for negation, i.e. a character that I do NOT want to find we
use the caret: [^]
Regular expression
match
sample pattern
/[^A-Z]/
not an uppercase letter this is Linguistics 5981
/[^L l]/
neither L nor l
this is Linguistics 5981
/[^\.]/
not a period
this is Linguistics 5981
an asterisk
a period
a question mark
a newline
a tab
“L*I*N*G*U*I*S*T*I*C*S”
“Dr.Doolittle”
“Is this Linguistics 5981 ?”
Special characters:
\*
\.
\?
\n
\t
WS 2011/2012
- Natural Language Systems Harriehausen
81
regular expressions (Jurafsky, section 2.1)
to search for optional characters we use the question mark: [?]
Regular expression
match
sample pattern
/colou?r/
colour or color
beautiful colour
to search for any number of a certain character we use the Kleene star: [*]
Regular expression
match
/a*/
any string of zero or more “a”s
/aa*/
at least one a but also any number of “a”s
WS 2011/2012
- Natural Language Systems Harriehausen
82
regular expressions (Jurafsky, section 2.1)
To look for at least one character of a type we use the Kleene “+”:
Regular expression
match
/[0-9]+/
a sequence of digits
Any combination is possible
Regular expression
match
/[ab]*/
zero or more “a”s or “b”s
/[0-9] [0-9]*/
any integer (= a string of digits)
WS 2011/2012
- Natural Language Systems Harriehausen
83
regular expressions (Jurafsky, section 2.1)
The “.” is a very special character -> so-called wildcard
Regular expression
match
sample pattern
/b.ll/
any character
between b and ll
ball
bell
bull
bill
Will the search find “Bill” ?
WS 2011/2012
- Natural Language Systems Harriehausen
84
regular expressions (Jurafsky, section 2.1)
Anchors (start of line: “^”, end of line:”$”)
Regular expression
match
sample pattern
/^Linguistics/
“Linguistics” at the
beginning of a line
Linguistics is fun.
/linguistics\.$/
“linguistics” at the
We like linguistics.
end of a line
Anchors (word boundary: “\b”, non-boundary:”\B”)
Regular expression
match
sample pattern
/\bthe\b/
“the” alone
This is the place.
/\Bthe\B/
“the” included
This is my mother.
WS 2011/2012
- Natural Language Systems Harriehausen
85
regular expressions (Jurafsky, section 2.1)
More on alternative characters: the pipe symbol: “|” (disjunction)
Regular expression
match
sample pattern
/colou?r/
colour or color
beautiful colour
/progra(m|mme)/
program or programme linguistics program
WS 2011/2012
- Natural Language Systems Harriehausen
86
regular expressions (Jurafsky, section 2.1)
What does the following expression match ?
/student
[0-9]+
*/
Will it match “student 1 student 2 student 3” ?
WS 2011/2012
- Natural Language Systems Harriehausen
87
regular expressions (Jurafsky, section 2.1)
Perl expressions are also used for string substitution: (used in ELIZA)
s/man/men/
man -> men
Perl expressions are also used for string repetition via memory:
(the number operator)
s/(linguistics)/wonderful \1/
linguistics-> wonderful linguistics
ELIZA
s/.* YOU ARE (depressed|sad) .*/ I AM SORRY TO HEAR YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/ WHY DO YOU THINK YOU ARE \1 ?/
WS 2011/2012
- Natural Language Systems Harriehausen
88
content
1 morphemes
2 compounds / concatenation
3 idiomatic phrases
4 multiple word entries (MWE)
5 spell aid
6 regular expressions
7 Finite State Automata (FSA)
WS 2011/2012
- Natural Language Systems Harriehausen
89
Finite State Automata (FSA)
The regular expression is more than just a convenient metalanguage
for text searching.
• First, a regular expression is one way of describing a finite-state
automaton (FSA).
Finite-state automata are the theoretical foundation of a good deal of
the computational work we will describe and look at in this lecture. Any
regular expression can be implemented as a finite-state automaton*.
Symmetrically, any finite-state automaton can be described with a
regular expression.
• Second, a regular expression is one way of characterizing a particular
kind of formal language called a regular language.
Both regular expressions and finite-state automata can be used to
describe regular languages. The relation among these three theoretical
constructions is sketched out in the following figure:
*
Except regular expressions that use the memory feature – more on that later
WS 2011/2012
- Natural Language Systems Harriehausen
90
Finite State Automata (FSA)
regular
expressions
Finite
Automata
regular
languages
The relationship between finite state automata, regular
expressions, and regular languages*
*
as suggested by Martin Kay in:
Kay, M. (1987). Nonconcatenative finite-state morphology. In Proceedings of the Third Conference
of the European Chapter of the ACL (EACL-87), Copenhagen, Denmark,pp. 2-10.ACL.).
WS 2011/2012
- Natural Language Systems Harriehausen
91
Finite State Automata (FSA)
Examples:
• Introduction to finite-state automata for regular
expressions
• Mapping from regular expressions to automata
examples
WS 2011/2012
- Natural Language Systems Harriehausen
92
Finite State Automata (FSA)
Using a FSA to recognize sheeptalk
After a while, with the parrot‘s help, the Doctor got to learn
the language of the animals so well that he could talk to
them himself and understand everything they said.
Hugh Lofting, The Story of Doctor Doolittle
WS 2011/2012
- Natural Language Systems Harriehausen
93
Finite State Automata (FSA)
Using a FSA to recognize sheeptalk
Sheep language can be defined as any string from the
following (infinite) set:
baa!
baaa!
baaaa!
baaaaa!
baaaaaa!
....
WS 2011/2012
- Natural Language Systems Harriehausen
94
Finite State Automata (FSA)
baa!
baaa!
baaaa!
baaaaa!
baaaaaa!
....
The regular expression for this kind of sheeptalk is
/baa+!/
All regular expressions can be represented as finite-state
automata (FSA):
WS 2011/2012
- Natural Language Systems Harriehausen
95
Finite State Automata (FSA)
a
b
q0
a
q1
a
q2
start state
!
q3
q4
final state/
accepting state
a finite-state automaton (FSA) for the regular expression /baa+!/
WS 2011/2012
- Natural Language Systems Harriehausen
96
Finite State Automata (FSA)
q
0
... ... ... a b a
! b ... ... ... ... ... ... ... ...
a tape with cells
Example of non-finite state = rejection of the input
WS 2011/2012
- Natural Language Systems Harriehausen
97
Finite State Automata (FSA)
Input
State
b
a
!
0(null)
1
0
0
1
0
2
0
2
0
3
0
3
0
3
4
4:
0
0
0
The state-transition table for the previous FSA
WS 2011/2012
- Natural Language Systems Harriehausen
98
Finite State Automata (FSA)
An algorithm for deterministic recognition of FSAs
function D-RECOGNIZE(tape,machine) returns accept or reject
index <- Beginning of tape
current-state <- Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then
return accept
else
return reject
elseif transition-table[current-state,tape[index]] is empty then
return reject
else
current-state <- transition-table[current-state,tape[index]]
index <- index +1
end
WS 2011/2012
- Natural Language Systems Harriehausen
99
Finite State Automata (FSA)
q
0
q
1
q
2
... ... ... b a a
q
3
q
4
q
5
a ! ... ... ... ... ... ... ... ...
Tracing the execution of FSA on some sheeptalk
WS 2011/2012
- Natural Language Systems Harriehausen
100
Finite State Automata (FSA)
Regular expressions can be represented as FSAs:
a
fail state
b
a
q0
q1
!
?
c
a
b
!
q2
!
b
!
q3
b
a
q4
!
b
a
qf
WS 2011/2012
- Natural Language Systems Harriehausen
101
Finite State Automata (FSA)
a
b
q0
a
q1
a
q2
!
q3
q4
A non-deterministic finite-state automaton for talking
sheep
WS 2011/2012
- Natural Language Systems Harriehausen
102
Finite State Automata (FSA)
b
q
0
a
q
1
a
q
!
q
3
2
q4
E
A non-finite-state automaton (NFSA) for the sheep language
– having an E-transition
WS 2011/2012
- Natural Language Systems Harriehausen
103