Finite-State Methods in Natural-Language Processing

Finite-State Methods in
Natural-Language Processing:
1—Motivation
Ronald M. Kaplan
and
Martin Kay
Motivation —
1
Finite-State Methods in Language
Processing
The Application of a branch of mathematics
—The regular branch of automata theory
to a branch of computational linguistics in which what
is crucial is (or can be reduced to)
—Properties of string sets and string relations with
—A notion of bounded dependency
Motivation —
2
Applications
•
Finite Languges
— Dictionaries
— Compression
•
•
Phenomena involving
bounded dependency
— Morpholgy
• Spelling
• Hyphenation
• Tokenization
• Morphological Analysis
— Phonology
Approximations to
phenomena involving
mostly bounded dependency
— Syntax
•
Phenomena that can be
translated into the realm of
strings with bounded
dependency
— Syntax
Motivation —
3
Correspondences
Computational Device
Finite-State Automaton
Descriptive Notation:
Regular Expression
Set of Objects:
Regular Language
Motivation —
4
The Basic Idea
•
•
•
•
At any given moment,an automaton is in one of a
finite number of states
A transition from one state to another is possible
when the automaton contains a corresponding
transition.
The process can stop only when the automaton is in
one of a subset of the states, called final.
Transitions are labeled with symbols so that a
sequence of transitions corresponds to a sequence
of symbols.
Motivation —
5
Bounded Dependency
The choice between γ1 and γ2 depends on a bounded
number of preceding symbols.
γ1
?
γ2
Motivation —
6
Bounded Dependency
The choice between γ1 and γ2 depends on a bounded
number of preceding symbols.
γ1
si
si
?
γ2
irrelevant
Motivation —
7
Closure Properties and Operations
•
By definition
—Union
—Concatenation
—Iteration
•
By deduction
—Intersection
—Complementation
—Substitution
—Reversal
— ...
Motivation —
8
Operations on Languages and
Automata
For the set-theoretic operations on languages there
are corresponding operations on automata.
M(L1 ⊗ L2 ) = M( L1 ) ⊕ M(L2 )
M(L) is a machine that characterizes the language L.
We
Wewill
willuse
usethe
thesame
samesymbols
symbolsfor
forcorresponding
correspondingoperations
operations
Motivation —
9
Automata-based Calculus
•
Closure gives:
—Complementation → Universal quantification
—Intersection → Combinations of constraints
•
Machines give:
—Finite representations for (potentially) infinite sets
—Practical implementation
•
Combination gives:
—Coherence
—Robustness
—Reasonable machine transformations
Motivation —
10
Quantification
Σ * xyΣ * There is an x followed by a y in the string
Σ * xyΣ * There is no xy sequence in the string
Σ * xyΣ * There is a y preceded by something that is not an x
Σ * xyΣ * Every y is preceded by an x.
∃y.∃x. precedes(x, y)
Motivation —
11
Universal Quantification—
i before e except after c
e
Σ * ceiΣ *
ot
he
r
e
i, other
e,
i,
ot
he
r
c
c
c
Motivation —
12
Universal Quantification—
i before e except after c
e
e
After e: no i
ot
he
r
No
an t aft
yth er
ing c o
bu r e:
te
i
i, other
e,
i,
ot
he
r
c
c
After c: anything
c
Motivation —
13
Only e i after c
*
*
*
Σ ceiΣ ∩ Σ cieΣ
e
*
oth
er
e
i,other
c
i,other
e,
ot
he
r
c
c
4
i
c
Motivation —
14
Only e i after c
No
an t aft
yth er
ing c o
bu r e:
te
i
e
e
oth
er
After e: no i
i,other
i,other
After ci: no e
c
e,
ot
he
r
c
After c: not ie
c
4
i
c
Motivation —
15
Alternative Notations
Closure ⇒ Recursive Formalisms ⇒
Higher-level Constructs
L1 ← L2
Σ * ceiΣ *
≡
≡
L1 L2
Σ * c ← eiΣ *
Choose notation for theoretical significance and
practical convenience.
Motivation —
16
What is a Finite-State Automaton?
•
•
•
•
•
An alphabet of symbols,
A finite set of states,
A transition function from states and symbols to
states,
A distinguished member of the set of states called
the start state, and
A distinguished subset of the set of states called
final states.
Pace
Paceterminology,
terminology,same
same definition
definitionas
asfor
for
directed
directedgraphs
graphswith
withlabeled
labelededges,
edges,plus
plus
initial
initialand
andfinal
finalstates.
states.
Motivation —
17
Unless otherwise
marked, the start state is
usually the leftmost in
the diagram
0
v
i to x
x
1
i
i
5
We draw final states
with a double circle
2
i
3
i
4
i
v, x
Motivation —
18
Regular Languages
•
•
•
•
•
Languages — sets of strings
Regular languages — a subset of languages
Closed under concatenation, union, and iteration
Every regular language is chracterized by (at least)
one finite-state automaton
Languages may contain infinitely many strings but
automata are finite
Motivation —
19
Regular Expressions
•
Formulae with operators that denote
—union
—concatenation
—iteration
a*
a*[b
[b| |c]
c]
Any number of a’s followed by either b or c.
Motivation —
20
Some Motivations
•
•
•
Word Recognition
Dictionary Lookup
Spelling Conventions
Motivation —
21
Word Recognition
A word recognizer takes a string of characters as
input and returns “yes” or “no” according as the
word is or is not in a given set.
Solves the membership problem.
e.g. Spell Checking, Scrabble
Motivation —
22
Approximate methods
•
Has right set of letters (any order).
•
Has right sounds (Soundex).
•
Random (suprimposed) coding (Unix Spell)
Word
hash1
hash2
Bit
Table
hashk
Motivation —
23
Exact Methods
•
•
•
Hashing
Search (linear, binary ...)
Digital search (“Tries”)
v
a
k
h
c
a
s
a
b
a
d
r
a
b
k
r
s
n
i
c
u
e
d
g
k
z
z
d
e
s
i
n
Folds together common prefixes
Motivation —
24
g
Exact Methods (continued)
•
Finite-state automata
Folds together common prefixes and suffixes
Motivation —
25
Enumeration vs. Description
•
Enumeration
—Representation includes an item for each object.
Size = f(Items)
•
Description
—Representation provides a characterization of the
set of all items.
Size = g(Common properties, Exceptions)
—Adding item can decrease size.
Motivation —
26
Classification
Exact
Approximate
Enumeration
Hash table
Binary search
Soundex
Description
Trie
FSM
Unix Spell
Right letter
Motivation —
27
FSM Extends to Infinite Sets
Productive compounding
Kindergartensgeselschaft
Motivation —
28
Statistics
English
Portuguese
81,142
858
313
253
206,786
2,389
683
602
FSM
States
29,317
Transitions 67,709
KBytes
203
17,267
45,838
124
Vocabulary
Words
KBytes
PKPAK
PKZIP
From Lucchese and Kowaltowski (1993)
Motivation —
29
Dictionary Lookup
Dictionary lookup takes a string of characters as
input and returns “yes” or “no” according as the
word is or is not in a given set and returns
information about the word.
Motivation —
30
Lookup Methods
Approximate — guess the information
If it ends in “ed”, it’s a past-tense verb.
Exact — store the information for finitely many words
Table Lookup
• Hash
• Search
• Trie —store at word-endings.
FSM
• Store at final states?
No suffix collapse — reverts to Trie.
Motivation —
31
Word Identifiers
Associate a unique, useful, identifier with each of n
words, e.g. an integer from 1 to n. This can be used
to index a vector of dictionary information.
word → i
i
Information
n
Motivation —
32
Pre-order Walk
A pre-order walk of an n-word FSM, counting final
states, assigns such integers, even if suffixes are
collapsed ⇒ Linear Search.
drip →
drips →
drop →
drops →
1
2
3
4
Motivation —
33
Suffix Counts
•
•
Store with each state the size of its suffix set
Skip irrelevant transitions, incrementing count by
destination suffix sizes.
4
4
4
2
drip →
drips →
drop →
drops →
1
0
1
2
3
4
Motivation —
34
•
Minimal Perfect Hash (Lucchesi and Kowaltowski)
•
Word-number mapping (Kaplan and Kay, 1985)
Motivation —
35
Spelling Conventions
iN+tractable → intractable
iN+practical → impractical
iN is the common negative prefix
— im before labial
— in otherwise
c.f. input → input
Motivation —
36
An in/im Transducer
No exit from this
state except over
a labial.
No labials from this state
Motivation —
37
Generation — “intractable”
N
1
i
0
m
0
i
N
t
2
n
r
0
t
a
0
r
a
Motivation —
38
Generation — “impractical”
N
p
1
i
0
m
r
0
p
a
0
r
a
0
i
N
2
n
Motivation —
39
Recognition — “intractable”
n
t
0
i
0
n
r
0
t
a
0
r
a
0
i
N
t
2
n
r
0
t
a
0
r
a
Motivation —
40
Generation — “input”
N
p
1
i
0
m
u
0
p
t
0
u
0
t
0
i
N
2
n
Motivation —
41
A Word Transducer
Finite-State
Transducers
Base Forms
Morphology
Finite-state
Machine
Spelling Rules
Finite-state machine
Text Forms
Motivation —
42
Bibliography
Kaplan, Ronald M. and Martin Kay. “Regular
models of phololgical rule systems.” Computational
Linguistics, 23:3, September 1994.
Hopcroft, John E. and Jeffrey D. Ullman.
Introduction to Automata Theory, Languages and
Computation. Addison Wesley, 1979. Chapters 2
and 3.
Partee, Barbara H., Alice ter Meulen and Robert E.
Wall. Mathematical Methods in Linguistics. Kluwer,
1990. Chapters 16 and 17.
Lucchesi, Cláudio L. and Tomasz Kowaltowski.
“Applications of Finite Automata Representing
Large Vocabularies”. Software Practice and
Esperience. 23:1, January 1993.
Motivation —
43