Finite State Transducers
Mark Stamp
Finite State Transducers
1
Finite State Automata
FSA
states and transitions
o Represented as labeled directed graphs
o FSA has one label per edge
State
are circles:
o Double circles for end states:
Beginning
state
o Denoted by arrowhead:
o Or, sometimes bold circle is used:
Finite State Transducers
2
FSA Example
Nodes
are states
Transitions are (labeled) arrows
For example…
a
c
1
z
Finite State Transducers
2
3
y
3
Finite State Transducer
FST
input & output labels on edge
o That is, 2 labels per edge
o Can be more labels (e.g., edge weights)
o Recall, FSA has one label per edge
FST
represented as directed graph
o And same symbols used as for FSA
o FSTs may be useful in malware analysis…
Finite State Transducers
4
Finite State Transducer
FST
has input and output “tapes”
o Transducer, i.e., can map input to output
o Often viewed as “translating” machine
o But somewhat more general
FST
is a finite automata with output
o Usual finite automata only has input
o Used in natural language processing (NLP)
o Also used in many other applications
Finite State Transducers
5
FST Graphically
Edges/transitions
arrows
are (labeled)
o Of the form, i : o, that is, input:ouput
Nodes
labeled numerically
For example…
a:b
c:d
1
z:x
Finite State Transducers
2
3
y:q
6
FST Modes
As
previously mentioned, FST usually
viewed as translating machine
But FST can operate in several modes
o Generation
o Recognition
o Translation (left-to-right or right-to-left)
Examples
Finite State Transducers
of modes considered next…
7
FST Modes
Consider
this simple example:
Generation mode
a:b
1
o Write equal number of a and b to first
and second tape, respectively
Recognition
mode
o “Accept” when 1st tape has same number
of a as 2nd tape has b
Translation
Finite State Transducers
mode next slide
8
FST Modes
Consider
this simple example:
Translation mode
a:b
1
o Left-to-right For every a read from
1st tape, write b to 2nd tape
o Right-to-left For every b read from
2nd tape, write a to 1st tape
Translation
is the mode we usually
want to consider
Finite State Transducers
9
WFST
WFST
== Weighted FST
o Include a “weight” on each edge
o That is, edges of the form i : o / w
Often,
probabilities serve as weights…
a:b/1
1
z:x/0.4
Finite State Transducers
c:d/0.6
2
3
y:q/1
10
FST Example
Homework…
Finite State Transducers
11
Operations on FSTs
Many
well-defined operations on FSTs
o Union, intersection, composition, etc.
o These also apply to WFSTs
Composition
is especially interesting
In malware context, might want to…
o Compose detectors for same family
o Compose detectors for different
families
Why
might
this
be
useful?
Finite State Transducers
12
FST Composition
Compose
2 FSTs (or WFSTs)
o Suppose 1st WFST has nodes 1,2,…,n
o Suppose 2nd WFST has nodes 1,2,…,m
o Possible nodes in composition labeled (i,j),
for i = 1,2,…,n and j = 1,2,…,m
o Generally, not all of these will appear
Edge
from (i1,j1) to (i2,j2) only when
composed labels “match” (next slide…)
Finite State Transducers
13
FST Composition
Suppose
we have following labels
o In 1st WFST, edge from i1 to i2 is x:y/p
o In 2nd WFST, edge from j1 to j2 is w:z/q
Consider
nodes (i1,j1) and (i2,j2) in
composed WFST
o Edge between nodes provided y == w
o I.e., output from 1st matches input for
2nd
o And, resulting edge label is x:z/pq
Finite State Transducers
14
WFST Composition
Consider
composition of WFSTs
b:b/0.3
1
a:b/0.1
2
3
a:b/0.5
a:a/0.6
4
b:b/0.4
a:b/0.2
And…
a:b/0.3
1
b:b/0.1
2
3
a:b/0.4
b:a/0.5
4
b:a/0.2
Finite State Transducers
15
WFST
Composition
Example
1
a:b/0.1
2
a:b/0.5
3
b:b/0.3
a:a/0.6
4
b:b/0.4
a:b/0.2
3
a:b/0.3
1
b:b/0.1
2
b:a/0.5
a:b/0.4
4
b:a/0.2
a:a/.04
1,2
1,1
a:b/.01
2,2
a:b/.24
a:a/.02
b:a/.08
4,4
4,2
b:a/.06
a:b/.18
3,2
4,3
a:a/.1
Finite State Transducers
16
WFST Composition
In
previous example, composition is…
a:a/.04
1,2
1,1
a:b/.01
2,2
a:b/.24
a:a/.02
b:a/.08
4,4
4,2
b:a/.06
a:b/.18
3,2
4,3
a:a/.1
But
(4,3) node is useless
o Must always end in a final state
Finite State Transducers
17
FST Approximation of HMM
Why
would we want to approximate an
HMM by FST?
o Faster scoring using FST
o Easier to correct misclassification in
FST
o Possible to compose FSTs
o Most important, it’s really cool and fun…
Down
side?
o FST may be less accurate than the HMM
Finite State Transducers
18
FST Approximation of HMM
How
to approximate HMM by FST?
We consider 2 methods known as
o n-type approximation
o s-type approximation
These
usually focused on “problem 2”
o That is, uncovering the hidden states
o This is the usual concern in NLP, such as
“part of speech” tagging
Finite State Transducers
19
n-type Approximation
Let
V be distinct observations in HMM
o Let λ = (A,B,π) be a trained HMM
o Recall, A is N x N, B is N x M, π is 1 x N
Let
(input : output / weight) = (Vi : Sj / p)
o Where i {1,2,…,M} and j {1,2,…,N}
o And Sj are hidden states (rows of B)
o And weight is max probability (from λ)
Examples
Finite State Transducers
later…
20
More n-type Approximations
Range
of n-type approximations
o n0-type only use the B matrix
o n1-type see previous slide
o n2-type for 2nd order HMM
o n3-type for 3rd order HMM, and so on
What
is 2nd order HMM?
o Transitions depend on 2 consecutive
states
o In 1st order, only depend on previous state
Finite State Transducers
21
s-type Approximation
“Sentence type” approximation
Use sequences and/or natural breaks
o In n-type, max probability over one transition
using A and B matrices
o In s-type, all sequences up to some length
Ideally, break at boundaries of some sort
o In NLP, sentence is such a boundary
o For malware, not so clear where to break
o So in malware, maybe just use a fixed length
Finite State Transducers
22
HMM to FST
Exact
representation also possible
o That is, an FST that is “same” as HMM
Given
model λ = (A,B,π)
Nodes for each (input : output) = (Vi : Sj)
o Edge from each node to all other nodes…
o …including loop to same node
o Edges labeled with target node
o Weights computed from probabilities in λ
Finite State Transducers
23
HMM to FST
Note
that some probabilities may be 0
o Remove edges with 0 probabilities
A
lot of probabilities may be small
o So, maybe approximate by removing
edges with “small” probabilities?
o Could be an interesting experiment…
o A reasonable way to approximate HMM
that does not seem to have been studied
Finite State Transducers
24
HMM Example
Suppose
we have 2 coins
o 1 coin is fair and 1 unfair
o Roll a die to decide which coin to flip
o We see resulting sequence of H and T
o We do not know which coin was flipped…
o …and we do not see the roll of the die
Observations?
Hidden
states?
Finite State Transducers
25
HMM Example
Suppose
probabilities are as given
o Then what is λ = (A,B,π) ?
0.8
Hidden states:
fair
0.9
unfair
0.2
0.1
Observations:
Finite State Transducers
0.5
0.5
0.7
0.3
H
T
H
T
26
HMM Example
HMM is given by λ = (A,B,π), where
A=
B=
π=
This π implies we start in F (fair) state
o Also, state 1 is F and state 2 is U (unfair)
Suppose we observe HHTHT
o Then probability of, say, FUFFU is
πFbF(H)aFUbU(H)aUFbF(T)aFFbF(H)aFUbU(T)
= 1.0(0.5)(0.1)(0.7)(0.8)(0.5)(0.9)(0.5)(0.1)(0.3) = 0.000189
Finite State Transducers
27
HMM Example
We
have
A=
score probability
FFFFF
.020503 .664086
FFFFU
.001367 .044272
FFFUF
.002835 .091824
FFFUU
.000425 .013774
FFUFF
.001215 .039353
FFUFU
.000081 .002624
FFUUF
.000387 .012243
FFUUU .000057 .001836
B=
π=
And
state
observe HHTHT
FUFFF
.002835 .091824
FUFFU
.000189 .006122
FUFUF
.000392 .012697
FUFUU .000059 .001905
FUUFF
.000378 .012243
o Probabilities in table
FUUFU .000025 .000816
Finite State Transducers
FUUUU .000018 .000571
FUUUF .000118
.003809
28
HMM Example
state
score probability
FFFFF
.020503 .664086
FFFFU
.001367 .044272
FFFUF
.002835 .091824
FFFUU
.000425 .013774
FFUFF
.001215 .039353
FFUFU
.000081 .002624
o FFFFF
FFUUF
.000387 .012243
o Solves problem 2
FFUUU .000057 .001836
So,
most likely
state sequence is
Problem
1, scoring?
o Next slide
Problem
3?
FUFFF
.002835 .091824
FUFFU
.000189 .006122
FUFUF
.000392 .012697
FUFUU .000059 .001905
FUUFF
.000378 .012243
o Not relevant here
FUUFU .000025 .000816
Finite State Transducers
FUUUU .000018 .000571
FUUUF .000118
.003809
29
HMM Example
How
to score
sequence HHTHT ?
Sum over all states
state
score probability
FFFFF
.020503 .664086
FFFFU
.001367 .044272
FFFUF
.002835 .091824
FFFUU
.000425 .013774
FFUFF
.001215 .039353
FFUFU
.000081 .002624
FFUUF
.000387 .012243
o Sum the “score”
column in table:
FFUUU .000057 .001836
P(HHTHT) = .030874
o Forward algorithm is
way more efficient
FUFFF
.002835 .091824
FUFFU
.000189 .006122
FUFUF
.000392 .012697
FUFUU .000059 .001905
FUUFF
.000378 .012243
FUUFU .000025 .000816
FUUUF .000118
Finite State Transducers
.003809
FUUUU .000018 .000571
30
n-type Approximation
Consider
the 2-coin HMM with
A=
B=
π=
For
each observation, only include the
most probable hidden state
o So, only possible FST labels in this case…
H:F/w1, H:U/w2, T:F/w3, T:U/w4
o Where weights wi are probabilities
Finite State Transducers
31
n-type Approximation
Consider
example
H:F/0.45
A=
2
H:F/0.5
B=
π=
For
each observation,
most probable state
1
H:F/0.45
T:F/0.45
T:F/0.45
T:F/0.5
3
o Weight is probability
Finite State Transducers
32
n-type Approximation
Suppose
instead…
H:U/0.42
A=
2
H:U/0.35
B=
T:F/0.20
1
π=
Most
probable state T:F/0.25
for each observation?
o Weight is probability
Finite State Transducers
T:F/0.30
3
H:F/0.30
T:F/0.30
4
H:F/0.30
33
HMM as FST
Consider 2-coin HMM where
A=
B=
π=
Then FST nodes correspond to…
o Initial state
o Heads from fair coin, (H:F)
o Tails from fair coin (T:F)
o Heads from unfair coin (H:U)
o Tails from unfair coin (T:U)
Finite State Transducers
34
HMM as FST
Suppose HMM is specified by
A=
π=
B=
Then FST is…
H:F
H:U
H:F
2
H:U
5
H:F
T:F
H:F
1
T:F
H:F
T:U
H:U T:U
H:U
T:F
Finite State Transducers
T:F
3
T:U
T:F
4
T:U
35
HMM as FST
This FST is boring and not very useful
o Weights make it a little more interesting
Computing edge weights is homework…
H:F
H:U
H:F
2
H:U
5
H:F
T:F
H:F
1
T:F
H:F
T:U
H:U T:U
H:U
T:F
Finite State Transducers
T:F
3
T:U
T:F
4
T:U
36
Why Consider FSTs?
FST
used as “translating machine”
Well-defined operations on FSTs
o Composition is an interesting example
Can
convert HMM to FST
o Either exact or approximation
o Approximations may be much simplified,
but might not be as accurate
Advantages
Finite State Transducers
of FST over HMM?
37
Why Consider FSTs?
Scoring/translating
faster with FST
Able to compose multiple FSTs
o Where FSTs may be derived from HMMs
One
idea…
o Multiple HMMs trained on malware (same
family and/or different families)
o Convert each HMM to FST
o Compose resulting FSTs
Finite State Transducers
38
Bottom Line
Can
we get best of both worlds?
o Fast scoring, composition with FSTs
o Simplify/approximate HMMs via FSTs
o Tweak FST to improve scoring
o Efficient training using HMMs
Other
possibilities?
o Directly compute an FST without HMM
o Or FST as first pass (e.g., disassembly?)
Finite State Transducers
39
References
A.
Kempe, Finite state transducers
approximating hidden Markov models
J. R. Novak, Weighted finite state
transducers: Important algorithms
K. Striegnitz, Finite state transducers
Finite State Transducers
40
© Copyright 2026 Paperzz