Artificial Intelligence Markov Chains Overview

Artificial Intelligence
Markov Chains
Stephan Dreiseitl
FH Hagenberg
Software Engineering & Interactive Media
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
1 / 36
Overview
Uncertain reasoning in time
Using Markov chains for simulations
Hidden Markov models
State estimation
Most probable path estimation
Application: speech recognition
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
2 / 36
Uncertain reasoning in time
Want to model systems that change through time, in
some non-deterministic manner
Use stochastic processes: Collections of random variables
X0 , X1 , . . . that take on values in some state space S
One random variable collection for each quantity of
interest
Current random variable value (state) depends on
previous states
Easier to model if states are observable, otherwise use
hidden Markov models
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
3 / 36
Example: Wumpus world
Wumpus world is static, only agent moves
Assume that agent does not reason logically, but moves
randomly
Current agent position Xt depends on previous positions
Xt−1 , . . . , X1
I.e., current position is given by P(Xt | Xt−1 , . . . , X1 )
How to calculate this?
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
4 / 36
Simplifying assumptions
Markov assumption: Current state depends on finite
number of previous states
First-order Markov process (chain):
P(Xt | Xt−1 , . . . , X0 ) = P(Xt | Xt−1 )
Second-order Markov process (chain):
P(Xt | Xt−1 , . . . , X0 ) = P(Xt | Xt−1 , Xt−2 )
Stationarity assumption: Transition probabilities
P(Xt | parents(Xt )) do not depend on t
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
5 / 36
Markov chains as Bayesian networks
First-order Markov chain:
... = P( X t −1| X t −2 ) = P( X t | X t −1 ) = P( X t+1 | X t ) = ...
X t −2
X t −1
Xt
X t+1
Xt
X t+1
Second-order Markov chain:
X t −2
X t −1
... = P( X t+1 | X t , Xt −1 ) = ...
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
6 / 36
Using Markov chains for simulations
Express transition probabilities P(Xt = si | Xt−1 = sj ) in
matrix Aij (columns always sum to 1)
Equilibrium distribution of Markov chain: Distribution
the chain converges to, i.e. limt→∞ P(Xt )
Problem: Sometimes hard to generate random values for
complicated distributions (e.g., Bayesian networks)
Solution: Construct Markov chain with desired
distribution as equilibrium distribution, random values
are samples of Markov chain
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
Simple simulation example
States S = {s1 , s2 }, state transition matrix A =
s1
s2
1/2
1/2
1/4
X t −2
3/4
s1
s1
s1
s2
s2
s2
X t −1
Xt
X t+1
1
7 / 36
1
4
3
4
2
1
2
With matrix algebra and conditional probabilities,can 1 1
show that P(Xt+1 ) = At P(X1 ) and limt→∞ At = 32 23
3
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
3
8 / 36
Simple simulation example (cont.)
Obtain equilibrium distribution P(Xt ) =
arbitrary initial distribution P(X1 )
1 2
3, 3
for
Verify numerically: relative frequencies of states s1 and s2
with start state s1 (left) and s2 (right)
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
200
400
600
Stephan Dreiseitl (Hagenberg/SE/IM)
800
1000
Lecture 12: Markov Chains
200
400
600
800
Artificial Intelligence SS2010
1000
9 / 36
Calculating state probabilities
Consider stochastic wumpus world example: States are
positions on 4 × 4 board
Distinguish between r.v. Xt , state constants sj and
variables St denoting state in time t
For brevity, also write St to denote Xt = St
Marginalize to obtain probability P(Xt = g ) of reaching
gold at time step t as
X
P(Xt = g ) =
P(Xt = g ∧ (S1 , . . . , St−1 ))
all state sequences
(S1 ,...,St−1 )
that lead to g
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
10 / 36
Calculating state probabilities (cont.)
State transitions of agent are first-order Markov process:
With known first state (i.e., P(X1 = si ) = 1) obtain
P(Xt = g ∧ (S1 , . . . , St−1 ))
= P(Xt = g | St−1 )P(St−1 | St−2 ) · · · P(S2 | S1 )
Therefore, calculation
X
P(Xt = g ) =
P(Xt = g | St−1 ) · · · P(S2 | S1 )
all state sequences
(S1 ,...,St−1 )
that lead to g
grows exponentially with t (bad)
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
11 / 36
Improving calculations
To simplify notation, write pt (i) = P(Xt = si )
Idea to reduce complexity to polynomial (good)
(
1 if si is start state
p1 (i) =
0 otherwise
pt+1 (i) = P(Xt+1 = si ) =
=
n
X
n
X
P(Xt+1 = si ∧ Xt = sj )
j=1
P(Xt+1 = si | Xt = sj )P(Xt = sj ) =
j=1
n
X
Aij pt (j)
j=1
This “trick” (dynamic programming) used often with
hidden Markov models
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
12 / 36
Hidden Markov models (HMMs)
Often, states of world are not observable (hidden).
However, some evidence Et that depends on state
(stochastically) is available, i.e., know P(Et | Xt )
X t −2
E t −2
X t −1
Xt
X t+1
Et
E t −1
E t+1
Assume P(Xt | Xt−1 ) and P(Et | Xt ) do not depend on t:
P(X1 , . . . , Xt , E1 , . . . , Et ) = P(X1 )
t
Y
P(Xk | Xk−1 )
k=2
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
t
Y
P(Ek | Xk )
k=1
Artificial Intelligence SS2010
13 / 36
HMM formal specification
To completely specify an HMM, we need
number N of possible hidden states for each Xt
number M of possible observations for each Et
initial state probabilites π1 , . . . , πN : πi = P(X1 = si )
state transition prob. Aij = P(Xt = si | Xt−1 = sj )
observation prob. Bj (oi ) = P(Et = oi | Xt = sj )
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
14 / 36
HMM notational conventions
Distinguish between possible states s1 , . . . , sN for each
r.v. Xt and concrete state St that system is in at time t
(one of the s1 , . . . , sN )
For brevity, write St instead of Xt = St
Same for evidence Et : At time t, one of M possible
outputs o1 , . . . , oM can be observed. Use Ot to denote
concrete observation at time t (one of o1 , . . . , oM )
For brevity, write Ot instead of Et = Ot
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
15 / 36
HMM interesting problems
State estimation: What is probability of state si , given list
of observations? I.e., what is P(Xt = si | (O1 , . . . , Ot ))?
Most probable path: Given observations O1 , . . . , Ot ,
what is the most probable sequence of states S1 , . . . , St ?
Learning HMMs: Given observations O1 , . . . , Ot , what is
most likely HMM to produce these observations?
HMM applications: Speech recognition, bioinformatics,
consumer decision modelling, economics and finance
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
16 / 36
Simple HMM example
Employee wants to infer food quality in cafeteria from
co-worker’s expression after lunch
Three food qualities (hidden states): good (g), mediocre
(m), bad (b)
Three co-worker’s expressions (observations): happy (h),
indifferent (i), angry (a)
One day’s food quality influences next (leftovers)
X t −2
E t −2
Stephan Dreiseitl (Hagenberg/SE/IM)
X t −1
E t −1
Xt
Et
Lecture 12: Markov Chains
X t+1
E t+1
Artificial Intelligence SS2010
17 / 36
Simple HMM example (cont.)
Start: P(X1 = g) = 0.3,P(X1 = m) = 0.5,P(X1 = b) = 0.2
State transitions:
P(g | g) = 0.1 P(g | m) = 0.3 P(g | b) = 0
P(m | g) = 0.7 P(m | m) = 0.6 P(m | b) = 0.8
P(b | g) = 0.2 P(b | m) = 0.1 P(b | b) = 0.2
Observation probabilities:
P(h | g) = 0.8 P(h | m) = 0.3 P(h | b) = 0.1
P(i | g) = 0.2 P(i | m) = 0.5 P(i | b) = 0.2
P(a | g) = 0
P(a | m) = 0.2 P(a | b) = 0.7
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
18 / 36
Simple HMM example (cont.)
Assume first three days are like this:
m
g
i
m
h
h
Employee sees only co-worker’s expression sequence (i,h,h)
What can be inferred about the food quality?
Tackle some easier questions first
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
19 / 36
Probability of observation sequence
What is probability of the observation sequence (i,h,h)?
Not very good way (but easier to see):
X
P(i,h,h) =
P((i,h,h) ∧ (S1 , S2 , S3 ))
3-element state
sequence (S1 ,S2 ,S3 )
X
=
P((i,h,h) | (S1 , S2 , S3 )) P(S1 , S2 , S3 )
3-element state
sequence (S1 ,S2 ,S3 )
How to compute P(S1 , S2 , S3 )?
How to compute P((i,h,h) | (S1 , S2 , S3 ))?
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
20 / 36
Probability of observation sequence (cont.)
How to compute P(S1 , S2 , S3 )? With cond. independence
P(S1 , S2 , S3 ) = P(S1 )P(S2 | S1 )P(S3 | S2 )
E.g., with (S1 , S2 , S3 ) = (m,b,b), we get
P = 0.5 × 0.1 × 0.2 = 0.01
How to compute P((i,h,h) | (S1 , S2 , S3 ))?
P((i,h,h) | (S1 , S2 , S3 )) = P(i | S1 )P(h | S2 )P(h | S3 )
E.g., with (S1 , S2 , S3 ) = (m,b,b), we get
P = 0.5 × 0.1 × 0.1 = 0.005
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
21 / 36
Probability of observation sequence (cont.)
Problem: 27 possibilities for (S1 , S2 , S3 ), so calculating
X
P(i,h,h) =
P((i,h,h) | (S1 , S2 , S3 ))P(S1 , S2 , S3 )
3-element state
sequence (S1 ,S2 ,S3 )
requires 27 + 27 = 54 calculations (exponential growth
in length of sequence)
Better: Use same trick as before (dynamic programming)
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
22 / 36
Dynamic programming for state estimation
For observation sequence (O1 , . . . , On ) and t ≤ n, define
αt (i) = P(Xt = si ∧ (O1 , . . . , Ot ))
as probability of seeing (O1 , . . . , Ot ) and ending in state si
Recursive definition gives polynomial time calculation:


B (O ) π
if t = 1

 i 1 i
N
αt (i) = X

Bi (Ot ) Aik αt−1 (k) if t > 1


k=1
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
23 / 36
Dynamic programming for state estimation
Now easy to calculate probabilities of interest: Because of
αt (i) = P(Xt = si ∧ (O1 , . . . , Ot ))
marginalizing yields
P(O1 , . . . , Ot ) =
N
X
αt (i)
i=1
and definition of conditional probability gives
αt (i)
P(Xt = si | (O1 , . . . , Ot )) = PN
i=1 αt (i)
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
24 / 36
State estimation in cafeteria example
Calculate state probabilities for observations (i,h,h):
α1 (g) = 0.2×0.3 = 0.06
α1 (m) = 0.5×0.5 = 0.25
α1 (b) = 0.2×0.2 = 0.04
α2 (g) = 0.8×0.1×0.06 + 0.8×0.3×0.25 + 0.8×0×0.04 = 0.0648
α2 (m) = 0.3×0.7×0.06 + 0.3×0.6×0.25 + 0.3×0.8×0.04 = 0.0672
α2 (b) = 0.1×0.2×0.06 + 0.1×0.1×0.25 + 0.1×0.2×0.04 = 0.0045
α3 (g) = 0.8×0.1×0.0648 + 0.8×0.3×0.0672 + 0.8×0×0.0045 = 0.0213
α3 (m) = 0.3×0.7×0.0648 + 0.3×0.6×0.0672 + 0.3×0.8×0.0045 = 0.0268
α3 (b) = 0.1×0.2×0.0648 + 0.1×0.1×0.0672 + 0.1×0.2×0.0045 = 0.0020
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
25 / 36
State estimation in cafeteria example
Most likely state at time 1 is m with
0.25
P(X1 = m | i) =
= 0.714
0.06 + 0.25 + 0.04
Most likely state at time 2 is m with
0.0672
P(X2 = m | (i,h)) =
= 0.492
0.0648 + 0.0672 + 0.0045
Most likely state at time 3 is m with
0.0268
P(X3 = m | (i,h,h)) =
= 0.535
0.0213 + 0.0268 + 0.002
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
26 / 36
Inferring most probable path
For given observation sequence (O1 , . . . , Ot ), find state
sequence (S1 , . . . , St ) with
P((S1 , . . . , St ) | (O1 , . . . , Ot )) → max
Call this best sequence (S1∗ , . . . , St∗ ). Slow idea to
calculate:
P((S1 , . . . , St ) | (O1 , . . . , Ot )) =
P((O1 , . . . , Ot ) | (S1 , . . . , St ))P(S1 , . . . , St )
P(O1 , . . . , Ot )
(S1∗ , . . . , St∗ ) is not sequence of most likely states!
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
27 / 36
Dynamic programming for most prob. path
Use dynamic programming: For each state si and time t,
calculate most probable path that ends in si at t: mppt (i)
Can do this recursively (as before)
Key insight: mppt (i) can be calculated from
all mppt−1 (j) that are one state shorter
transition probabilities P(Xt = si | Xt−1 = sj )
probability Bi (Ot ) of observing Ot in state si
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
28 / 36
Viterbi algorithm
Fleshing out these ideas is Viterbi algorithm
δt (i) = max P((S1 , . . . , St−1 ) ∧ Xt = si ∧ (O1 , . . . , Ot ))
S1 ,...,St−1
mppt (i) is the path that achieves probability δt (i)
Recursive formula:
(
Bi (O1 ) πi
if t = 1
δt (i) =
maxj {Bi (Ot ) Aij δt−1 (j)} if t > 1
Then, (S1∗ , . . . , St∗ ) is mppt (i) with final state St∗ = si ∗
s.t. si ∗ = maxi δt (i)
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
29 / 36
Viterbi algorithm on cafeteria example
Calculate most probable path for observations (i,h,h):
δ1 (g) = 0.2×0.3 = 0.06
δ1 (m) = 0.5×0.5 = 0.25
δ1 (b) = 0.2×0.2 = 0.04
δ2 (g) = max{0.8×0.1×0.06, 0.8×0.3×0.25, 0.8×0×0.04} = 0.06
δ2 (m) = max{0.3×0.7×0.06, 0.3×0.6×0.25, 0.3×0.8×0.04} = 0.045
δ2 (b) = max{0.1×0.2×0.06, 0.1×0.1×0.25, 0.1×0.2×0.04} = 0.0025
δ3 (g) = max{0.8×0.1×0.06, 0.8×0.3×0.045, 0.8×0×0.0025} = 0.0108
δ3 (m) = max{0.3×0.7×0.06, 0.3×0.6×0.045, 0.3×0.8×0.0025} = 0.0126
δ3 (b) = max{0.1×0.2×0.06, 0.1×0.1×0.045, 0.1×0.2×0.0025} = 0.0012
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
30 / 36
Viterbi algorithm on cafeteria example
Highest δ3 value is δ3 (m) = 0.0126, so S3∗ = m
Work backwards t = 3 → t = 2: S3∗ achieved by a
transition g → m, so S2∗ = g
One more step t = 2 → t = 1: S2∗ achieved by a
transition m → g, so S1∗ = m
Most probable path is therefore (m,g,m)
Not the same as sequence of most probable states
(m,m,m)!
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
31 / 36
Sample application: speech recognition
Have signals, want to find words that generate signals,
i.e. maximize P(words | signals)
With Bayes rules, get
P(words | signals) = α P(signals | words) P(words)
|
{z
} | {z }
“acoustic model”
“language model”
Acoustic model comprises pronounciation model and
phone model
A phone is an atomic speech sound, a phoneme is a set
of phones that is indistinguishable in a language
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
32 / 36
Phone models
Sound is discretized, split into frames (typically 30ms
long) and represented by features
Analog acoustic signal:
Sampled, quantized
digital signal:
10
15
38
22
63
24
10
12
73
Frames with features:
52
47
82
89
94
11
Phone model is P(feature | phone)
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
33 / 36
Pronounciation models
Each word is represented as a distribution over phone
sequences, implemented as a transition model
0.2
[ow]
1.0
0.5
[ey]
1.0
[t]
[m]
[t]
0.8
[ah]
1.0
0.5
[aa]
1.0
[ow]
1.0
P([towmeytow] | “tomato”) = P([towmaatow] | “tomato”) = 0.1
P([tahwmeytow] | “tomato”) = P([tahmaatow] | “tomato”) = 0.4
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
34 / 36
Language models
Prior probability P(w1 , . . . , wn ) of word sequences
modeled with Markov assumption (bigram model)
P(w1 , . . . , wn ) = P(w1 )
n
Y
P(wi | wi−1 )
i=2
Obtain conditional probabilities by analyzing large texts
Can be improved by model of language grammar
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
35 / 36
Summary
Temporal/sequential reasoning achieved by random
processes
Markov chains: current state depends only on previous
state
Markov chains widely used in simulations
Hidden Markov models when states are not observable
State estimation and most probable path by dynamic
programming—only linear time/space complexity
Sample HMM application: speech recognition
Stephan Dreiseitl (Hagenberg/SE/IM)
Lecture 12: Markov Chains
Artificial Intelligence SS2010
36 / 36