Finite State Machinery

Finite State Machinery - I
• Fundamentals
• Recognisers and Transducers
Reference Outline
• Websites
–
–
–
Xerox: www.xrce.xerox.com/research/mltt/fst/
Groningen: grid.let.rug.nl/~vannoord/FSA/fsa.html
AT & T: www.research.att.com/sw/tools/fsm
• Books/Collections
–
–
–
–
Karttunen & Oflazer (2000)
Jurafsky & Martin (2000)
Hopcraft and Ullman (1979)
Roche and Schabes (1977)
• Classic Articles
–
–
–
Kaplan and Kay (1994)
Koskenniemi (1983)
Johnson (1972)
• Tools
–
–
–
–
Van Noord et al.
Mohri et al.
Daciuk.
Karttunen & Beesley
4
Acknowledgements to
• Lauri Karttunen, Ken Beesley and
colleagues at Xerox.
• Most materials in this tutorial are from their
website.
• Forthcoming book: Finite State Morphology
– Xerox Tools and Techniques.
5
FS Motivation
• Chomsky hierarchy of language classes
based on classes of descriptive notation, and
also on asociated classes of machine.
• Chomsky (1957) dismissed FS grammars,
and associated machinery, as fundamentally
inadequate for the description of NL.
Embedding
• Basic problem is not that sentences can
grow to arbitrary length, it is that the
description of a syntactic constitutent may
embed any other constituents including the
sentence itelf.
The dog bit the cat.
The dog that the man saw bit the cat.
The dog that the man that the horse kicked saw bit the cat.
etc
On the other hand …...
• Plenty of language just ain't like that.
• Words
–
–
–
Orthographic spelling.
Phonological spelling.
Morphology.
• Fixed expression types (e.g dqtes).
• Gross constitutent structures (e.g. the big,
bad, blue wolf).
Recent Application Areas for FS
Technology Include
•
•
•
•
•
•
•
POS Tagging
Spell Checking
Information Extraction
Speech Recognition
Text to Speech
Spoken Dialogue
Parsing
Recognition of Italian Words
• The coke machine recognises words in the
coke machine language.
• The following machine recognises two
words in Italian.
• Recognition mechanism is language
independent.
C
A
S
A
I
N
Q
U
E
21
The Process of Analysis
• Start in the initial state and at the first
symbol of the word.
• If there is an arc labelled with that symbol,
the machine transitions to the next state,
and the symbol is consumed.
• The process continues with successive
symbols until .....
22
The Process of Analysis
One or more of these conditions holds:
• A. A final state is reached
• B. All symbols are consumed
• C. There are no transitions out of a state for
the current symbol.
–
–
If both A and B, analysis succeeds and the
word is recognised.
Otherwise recognition fails.
23
Success and Failure
C
A
I
S
A
N
Q
U
E
L
E
N
T
E
LE; CASA; CINQUANTA; LENTEMENTE
24
Transducers
• Recognisers either accept or reject a word.
• Although this is useful, networks can
actually return more substantial
information.
• This is achieved by providing networks
with the ability to write as well as to read.
27
Basic Transducer
• Each transition of a transducer is labelled with a
pair of symbols rather than with a single symbol.
• Analysis proceeds as before, except that input
symbols are matched against the lower-side
symbols on transitions.
• If analysis succeeds, return the string of upperside symbols on the path to the final state
28
Confusing Terminology
•
•
•
•
Lower side = surface side.
Upper side = "deep" side.
Analysis proceeds from lower to upper.
Synthesis (generation) proceeds from upper
to lower.
Lexical Transducers
• In common parlance, a transducer is a
device which converts one form of energy
into another, e.g. a microphone converts
from sound to electrical signals.
• Next we look at lexical transducers which
convert one string of symbols into another.
29
Lexical Transducer Example
lexical string
C
A
S
A
C
A
S
E
surface string
• Input: CASE
• Output: CASA
30
Morphological Analysis
C
O
N
T
A
C
O
N
T
e
e
e
O
e
e
+SG
+1P
+V
E
R
• Input: CONTO
• Output: CONTARE +V +1P +SG
31
Remarks
 e stands for "epsilon". During analysis,
epsilon transitions are taken freely without
consuming any input.
• Note also single symbols with multicharacter print names (e.g. +SG).
• The order of these symbols, and the choice
of infinitive as baseform, is determined by
linguists.
32
Exercise
• The word "conto" in Italian is also a
masculine noun meaning (a) story and (b)
bank account
• Draw the corresponding 2-level networks.
• How can the different meanings be
incorporated into the same network
33
Conto +N +SG
A e
C
O
N
T
O
C
O
N
T
O
e
+N
e
+SG
• Input: CONTO
• Output: CONTO +N+SG
31
Synthesis
• Transducers are reversible. This means that
they can be used to perform the inverse
transduction from an transducers.
• The process of synthesis is the inverse of
analysis
34
The Process of Synthesis
• Start at the start state and at the beginning
of the input string.
• Match the input symbols against the upperside symbols of the arcs, consuming
symbols until a final state is reached.
• If successful, return the string of lower-side
symbols (else nothing).
35
Morphological Synthesis
C
O
N
T
A
C
O
N
T
e
e
e
O
e
e
+SG
+1P
+V
E
R
•Input:
CONTARE +V +1P +SG
•Output: CONTO
•N.B. e symbols are ignored on output
36
Analysis and Synthesis
•
•
•
•
Upper Side Language (Lexical Strings).
Lower Side Language (Surface Strings).
Transducer maps between the two.
However large the lexical transducer may
become, analysis and synthesis are
performed by the same languageindependent matching techniques.
37