Intro

Finite State Transducers
Mark Stamp
Finite State Transducers
1
Finite State Automata
 FSA
 states and transitions
o Represented as labeled directed graphs
o FSA has one label per edge
 State
are circles:
o Double circles for end states:
 Beginning
state
o Denoted by arrowhead:
o Or, sometimes bold circle is used:
Finite State Transducers
2
FSA Example
 Nodes
are states
 Transitions are (labeled) arrows
 For example…
a
c
1
z
Finite State Transducers
2
3
y
3
Finite State Transducer
 FST
 input & output labels on edge
o That is, 2 labels per edge
o Can be more labels (e.g., edge weights)
o Recall, FSA has one label per edge
 FST
represented as directed graph
o And same symbols used as for FSA
o FSTs may be useful in malware analysis…
Finite State Transducers
4
Finite State Transducer
 FST
has input and output “tapes”
o Transducer, i.e., can map input to output
o Often viewed as “translating” machine
o But somewhat more general
 FST
is a finite automata with output
o Usual finite automata only has input
o Used in natural language processing (NLP)
o Also used in many other applications
Finite State Transducers
5
FST Graphically
 Edges/transitions
arrows
are (labeled)
o Of the form, i : o, that is, input:ouput
 Nodes
labeled numerically
 For example…
a:b
c:d
1
z:x
Finite State Transducers
2
3
y:q
6
FST Modes
 As
previously mentioned, FST usually
viewed as translating machine
 But FST can operate in several modes
o Generation
o Recognition
o Translation (left-to-right or right-to-left)
 Examples
Finite State Transducers
of modes considered next…
7
FST Modes
 Consider
this simple example:
 Generation mode
a:b
1
o Write equal number of a and b to first
and second tape, respectively
 Recognition
mode
o “Accept” when 1st tape has same number
of a as 2nd tape has b
 Translation
Finite State Transducers
mode  next slide
8
FST Modes
 Consider
this simple example:
 Translation mode
a:b
1
o Left-to-right  For every a read from
1st tape, write b to 2nd tape
o Right-to-left  For every b read from
2nd tape, write a to 1st tape
 Translation
is the mode we usually
want to consider
Finite State Transducers
9
WFST
 WFST
== Weighted FST
o Include a “weight” on each edge
o That is, edges of the form i : o / w
 Often,
probabilities serve as weights…
a:b/1
1
z:x/0.4
Finite State Transducers
c:d/0.6
2
3
y:q/1
10
FST Example
 Homework…
Finite State Transducers
11
Operations on FSTs
 Many
well-defined operations on FSTs
o Union, intersection, composition, etc.
o These also apply to WFSTs
 Composition
is especially interesting
 In malware context, might want to…
o Compose detectors for same family
o Compose detectors for different
families
 Why
might
this
be
useful?
Finite State Transducers
12
FST Composition
 Compose
2 FSTs (or WFSTs)
o Suppose 1st WFST has nodes 1,2,…,n
o Suppose 2nd WFST has nodes 1,2,…,m
o Possible nodes in composition labeled (i,j),
for i = 1,2,…,n and j = 1,2,…,m
o Generally, not all of these will appear
 Edge
from (i1,j1) to (i2,j2) only when
composed labels “match” (next slide…)
Finite State Transducers
13
FST Composition
 Suppose
we have following labels
o In 1st WFST, edge from i1 to i2 is x:y/p
o In 2nd WFST, edge from j1 to j2 is w:z/q
 Consider
nodes (i1,j1) and (i2,j2) in
composed WFST
o Edge between nodes provided y == w
o I.e., output from 1st matches input for
2nd
o And, resulting edge label is x:z/pq
Finite State Transducers
14
WFST Composition
 Consider
composition of WFSTs
b:b/0.3
1
a:b/0.1
2
3
a:b/0.5
a:a/0.6
4
b:b/0.4
a:b/0.2
 And…
a:b/0.3
1
b:b/0.1
2
3
a:b/0.4
b:a/0.5
4
b:a/0.2
Finite State Transducers
15
WFST
Composition
Example
1
a:b/0.1
2
a:b/0.5
3
b:b/0.3
a:a/0.6
4
b:b/0.4
a:b/0.2
3
a:b/0.3
1
b:b/0.1
2
b:a/0.5
a:b/0.4
4
b:a/0.2
a:a/.04
1,2
1,1
a:b/.01
2,2
a:b/.24
a:a/.02
b:a/.08
4,4
4,2
b:a/.06
a:b/.18
3,2
4,3
a:a/.1
Finite State Transducers
16
WFST Composition
 In
previous example, composition is…
a:a/.04
1,2
1,1
a:b/.01
2,2
a:b/.24
a:a/.02
b:a/.08
4,4
4,2
b:a/.06
a:b/.18
3,2
4,3
a:a/.1
 But
(4,3) node is useless
o Must always end in a final state
Finite State Transducers
17
FST Approximation of HMM
 Why
would we want to approximate an
HMM by FST?
o Faster scoring using FST
o Easier to correct misclassification in
FST
o Possible to compose FSTs
o Most important, it’s really cool and fun…
 Down
side?
o FST may be less accurate than the HMM
Finite State Transducers
18
FST Approximation of HMM
 How
to approximate HMM by FST?
 We consider 2 methods known as
o n-type approximation
o s-type approximation
 These
usually focused on “problem 2”
o That is, uncovering the hidden states
o This is the usual concern in NLP, such as
“part of speech” tagging
Finite State Transducers
19
n-type Approximation
 Let
V be distinct observations in HMM
o Let λ = (A,B,π) be a trained HMM
o Recall, A is N x N, B is N x M, π is 1 x N
 Let
(input : output / weight) = (Vi : Sj / p)
o Where i  {1,2,…,M} and j  {1,2,…,N}
o And Sj are hidden states (rows of B)
o And weight is max probability (from λ)
 Examples
Finite State Transducers
later…
20
More n-type Approximations
 Range
of n-type approximations
o n0-type  only use the B matrix
o n1-type  see previous slide
o n2-type  for 2nd order HMM
o n3-type  for 3rd order HMM, and so on
 What
is 2nd order HMM?
o Transitions depend on 2 consecutive
states
o In 1st order, only depend on previous state
Finite State Transducers
21
s-type Approximation
“Sentence type” approximation
 Use sequences and/or natural breaks

o In n-type, max probability over one transition
using A and B matrices
o In s-type, all sequences up to some length

Ideally, break at boundaries of some sort
o In NLP, sentence is such a boundary
o For malware, not so clear where to break
o So in malware, maybe just use a fixed length
Finite State Transducers
22
HMM to FST
 Exact
representation also possible
o That is, an FST that is “same” as HMM
 Given
model λ = (A,B,π)
 Nodes for each (input : output) = (Vi : Sj)
o Edge from each node to all other nodes…
o …including loop to same node
o Edges labeled with target node
o Weights computed from probabilities in λ
Finite State Transducers
23
HMM to FST
 Note
that some probabilities may be 0
o Remove edges with 0 probabilities
A
lot of probabilities may be small
o So, maybe approximate by removing
edges with “small” probabilities?
o Could be an interesting experiment…
o A reasonable way to approximate HMM
that does not seem to have been studied
Finite State Transducers
24
HMM Example
 Suppose
we have 2 coins
o 1 coin is fair and 1 unfair
o Roll a die to decide which coin to flip
o We see resulting sequence of H and T
o We do not know which coin was flipped…
o …and we do not see the roll of the die
 Observations?
 Hidden
states?
Finite State Transducers
25
HMM Example
 Suppose
probabilities are as given
o Then what is λ = (A,B,π) ?
0.8
Hidden states:
fair
0.9
unfair
0.2
0.1
Observations:
Finite State Transducers
0.5
0.5
0.7
0.3
H
T
H
T
26
HMM Example

HMM is given by λ = (A,B,π), where
A=

B=
π=
This π implies we start in F (fair) state
o Also, state 1 is F and state 2 is U (unfair)

Suppose we observe HHTHT
o Then probability of, say, FUFFU is
πFbF(H)aFUbU(H)aUFbF(T)aFFbF(H)aFUbU(T)
= 1.0(0.5)(0.1)(0.7)(0.8)(0.5)(0.9)(0.5)(0.1)(0.3) = 0.000189
Finite State Transducers
27
HMM Example
 We
have
A=
score probability
FFFFF
.020503 .664086
FFFFU
.001367 .044272
FFFUF
.002835 .091824
FFFUU
.000425 .013774
FFUFF
.001215 .039353
FFUFU
.000081 .002624
FFUUF
.000387 .012243
FFUUU .000057 .001836
B=
π=
 And
state
observe HHTHT
FUFFF
.002835 .091824
FUFFU
.000189 .006122
FUFUF
.000392 .012697
FUFUU .000059 .001905
FUUFF
.000378 .012243
o Probabilities in table
FUUFU .000025 .000816
Finite State Transducers
FUUUU .000018 .000571
FUUUF .000118
.003809
28
HMM Example
state
score probability
FFFFF
.020503 .664086
FFFFU
.001367 .044272
FFFUF
.002835 .091824
FFFUU
.000425 .013774
FFUFF
.001215 .039353
FFUFU
.000081 .002624
o FFFFF
FFUUF
.000387 .012243
o Solves problem 2
FFUUU .000057 .001836
 So,
most likely
state sequence is
 Problem
1, scoring?
o Next slide
 Problem
3?
FUFFF
.002835 .091824
FUFFU
.000189 .006122
FUFUF
.000392 .012697
FUFUU .000059 .001905
FUUFF
.000378 .012243
o Not relevant here
FUUFU .000025 .000816
Finite State Transducers
FUUUU .000018 .000571
FUUUF .000118
.003809
29
HMM Example
 How
to score
sequence HHTHT ?
 Sum over all states
state
score probability
FFFFF
.020503 .664086
FFFFU
.001367 .044272
FFFUF
.002835 .091824
FFFUU
.000425 .013774
FFUFF
.001215 .039353
FFUFU
.000081 .002624
FFUUF
.000387 .012243
o Sum the “score”
column in table:
FFUUU .000057 .001836
P(HHTHT) = .030874
o Forward algorithm is
way more efficient
FUFFF
.002835 .091824
FUFFU
.000189 .006122
FUFUF
.000392 .012697
FUFUU .000059 .001905
FUUFF
.000378 .012243
FUUFU .000025 .000816
FUUUF .000118
Finite State Transducers
.003809
FUUUU .000018 .000571
30
n-type Approximation
 Consider
the 2-coin HMM with
A=
B=
π=
 For
each observation, only include the
most probable hidden state
o So, only possible FST labels in this case…
H:F/w1, H:U/w2, T:F/w3, T:U/w4
o Where weights wi are probabilities
Finite State Transducers
31
n-type Approximation
 Consider
example
H:F/0.45
A=
2
H:F/0.5
B=
π=
 For
each observation,
most probable state
1
H:F/0.45
T:F/0.45
T:F/0.45
T:F/0.5
3
o Weight is probability
Finite State Transducers
32
n-type Approximation
 Suppose
instead…
H:U/0.42
A=
2
H:U/0.35
B=
T:F/0.20
1
π=
 Most
probable state T:F/0.25
for each observation?
o Weight is probability
Finite State Transducers
T:F/0.30
3
H:F/0.30
T:F/0.30
4
H:F/0.30
33
HMM as FST

Consider 2-coin HMM where
A=

B=
π=
Then FST nodes correspond to…
o Initial state
o Heads from fair coin, (H:F)
o Tails from fair coin (T:F)
o Heads from unfair coin (H:U)
o Tails from unfair coin (T:U)
Finite State Transducers
34
HMM as FST

Suppose HMM is specified by
A=

π=
B=
Then FST is…
H:F
H:U
H:F
2
H:U
5
H:F
T:F
H:F
1
T:F
H:F
T:U
H:U T:U
H:U
T:F
Finite State Transducers
T:F
3
T:U
T:F
4
T:U
35
HMM as FST

This FST is boring and not very useful
o Weights make it a little more interesting

Computing edge weights is homework…
H:F
H:U
H:F
2
H:U
5
H:F
T:F
H:F
1
T:F
H:F
T:U
H:U T:U
H:U
T:F
Finite State Transducers
T:F
3
T:U
T:F
4
T:U
36
Why Consider FSTs?
 FST
used as “translating machine”
 Well-defined operations on FSTs
o Composition is an interesting example
 Can
convert HMM to FST
o Either exact or approximation
o Approximations may be much simplified,
but might not be as accurate
 Advantages
Finite State Transducers
of FST over HMM?
37
Why Consider FSTs?
 Scoring/translating
faster with FST
 Able to compose multiple FSTs
o Where FSTs may be derived from HMMs
 One
idea…
o Multiple HMMs trained on malware (same
family and/or different families)
o Convert each HMM to FST
o Compose resulting FSTs
Finite State Transducers
38
Bottom Line
 Can
we get best of both worlds?
o Fast scoring, composition with FSTs
o Simplify/approximate HMMs via FSTs
o Tweak FST to improve scoring
o Efficient training using HMMs
 Other
possibilities?
o Directly compute an FST without HMM
o Or FST as first pass (e.g., disassembly?)
Finite State Transducers
39
References
 A.
Kempe, Finite state transducers
approximating hidden Markov models
 J. R. Novak, Weighted finite state
transducers: Important algorithms
 K. Striegnitz, Finite state transducers
Finite State Transducers
40