P - Teamcore USC - University of Southern California

Lecture 25:
CS573
Advanced Artificial Intelligence
Milind Tambe
Computer Science Dept and Information Science Inst
University of Southern California
[email protected]
Surprise Quiz II: Part I
P(A) = 0.05
A
B
Questions:
Surprise 
C
A
P(B)
A
P(C)
T
0.9
T
0.7
F
0.05
F
0.01
Markov
Markov
Dynamic Belief Nets
In each time slice:
• Xt = Observable state variables
• Et = Observable evidence variables
Xt
Xt+1
Xt+2
Et
Et+1
Et+2
Types of Inference

Filtering or monitoring: P(Xt | e1, e2…et)
– Keep track of probability distribution over current states
– Like POMDP belief state
– P(@ISI | c1,c2….ct) and P(N@ISI | c1,c2…ct)

Prediction: P(Xt+k | e1,e2…et) for some k > 0
– P(@ISI 3 hours from now | c1,c2…ct)

Smoothing or hindsight: P(Xk | e1, e2…et) for 0 <= k < t
– What is the state of the user at 11 Am, if observations at
9AM,10AM,11AM, 1pm, 2 pm

Most likely explanation: Given a sequence of observations, find the sequence of
states that is most likely to have generated the observations (speech recognition)
– Argmaxx1:t
P(X1:t|e1:t)
Filtering: P(Xt+1 | e1,e2…et+1)
f
P(Xt+1 | e1:t+1) = 1:t+1
RECURSION
= Norm * P(et+1 | Xt+1) *  P(Xt+1 | xt) * P(xt|e1:t)
xt
• e1:t+1 = e1, e2…et+1
• P(xt|e1:t) = f1:t
• f1:t+1 = Norm-const * FORWARD (f1:t, et+1)
Computing Forward f1:t+1
• For our example of tracking user location:
• f1:t+1 = Norm-const * FORWARD (f1:t, ct+1)
• Actually it is a vector, not a single quantity
• f1:2 =
P(L2 | c1, c2) implies computing for both
< P(L2 = @ISI | c1, c2), P(L2 = N@ISI | c1, c2) >
Then normalize
Hope you tried out all the computations from the last lecture at home!
Robotic Perception
At-1
At
At+1
Xt
Xt+1
Xt+2
Et
Et+1
Et+2
• At = action at time t (observed evidence)
• Xt = State of the environment at time t
• Et = Observation at time t (observed evidence)
Robotic Perception
• Similar to filtering task seen earlier
• Differences:
• Must take into account action evidence
Norm * P(et+1 | Xt+1) *  P(Xt+1 | xt, at) * P(xt|e1:t)
xt
POMDP belief update?
• Must note that the variables are continuous
P(Xt+1 | e1:t+1, a1:t)
= Norm * P(et+1 | Xt+1) * ∫ P(Xt+1 | xt,at) * P(xt|e1:t, a1:t-1)
Prediction


Filtering without incorporating new evidence
P(Xt+k | e1,e2…et) for some k > 0
– E.g., P( L3 | c1)
=  P(L3 | L2) * P(L2 | c1)
Computed in
theComputed
last
= (P(L3=@ISI|L2=@ISI)*P(L2=@ISI|c1) +
lecture
P(L3=@ISI|L2=N@ISI)*P(L2=N@ISI|c1) the last
lecture
= 0.7 * 0.6272 + 0.3 * 3728
= 0.43904 + 0.1118 = 0.55
– P(L4 | c1) =
 P(L4 | L3) *
P(L3 | c1)
= 0.7 * 0.55 + 0.3 * 0.45 = 0.52
Prediction
– P(L5 | c1) = 0.7 * 0.52 + 0.3* 0.48 = 0.508
– P(L6 | c1) = 0.7 * 0.5 + 0.3 * 0.5 = 0.5… (converging to 0.5)

Predicted distribution of user location converges to a fixed point
– Stationary distribution of the markov process
– Mixing time: Time taken to reach the fixed point

Prediction useful if K << mixing time
– The more uncertainty there is in the transition model
– The shorter the mixing time; more difficult to make predictions
Smoothing

P(Xk | e1, e2…et) for 0 <= k < t

P(Lk | c1,c2…ct) = Norm * P(Lk | c1,c2..ck) * P(ck+1..ct | Lk)
= Norm * f1:k * bk+1:t


bk+1:t

Hence algorithm called forward-backward algorithm
is a backward message, like our earlier forward message
bk+1:t backward message
bk+1:t = P(ek+1:t | Xk)
= P(ek+1,ek+2…. et | Xk)
=  P(ek+1,ek+2…. et | Xk, Xk+1) P (xk+1 | Xk)
xk+1
Xk
Xk+1
Xk+2
Ek
Ek+1
Ek+2
bk+1:t backward message
 bk+1:t
e
= P(ek+1,ek+1…. et | Xk)
= P( k+1:t | Xk)
=  P( k+1, k+1…. t | Xk, Xk+1) P (xk+1 | Xk)
xk+1
e
e
e
=  P( k+1, k+1…. t | Xk+1) P (xk+1 | Xk)
xk+1
e
e
e
=  P( k+1| Xk+1) P( k+2:t | Xk+1) P (xk+1 | Xk)
xk+1
e
e
bk+1:t backward message
P(ek+1:t | Xk) = bk+1:t
=  P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk)
xk+1
 bk+1:t
 bk+1:t
b
= BACKWARD( k+2:t,
ek+1:t)
e
= P(ek+1,ek+1…. et | Xk)
=  P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk)
= P( k+1:t | Xk)
xk+1
Example of Smoothing

P(L1 = @ISI | c1, c2)
= Norm * P(Lk | c1,c2..ck) * P(ck+1..ct | Lk)
= Norm * P(L1 | c1) * P(c2 | L1) = Norm * 0.818 * P(c2 | L1)
e
P(c2 | L1 = @ISI) = P( k+1:t | Xk) =
 P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk)
xk+1
=>  P(c2 | L2) *
P(c3:2|L2) * P(L2 | L1)
L2
= [ (0.9 * 1* 0.7) + (0.2 * 1* 0.3)] = 0.69
Example of Smoothing
P(c2 | L1 = @ISI) =  P(c2 | L2) * P(L2 | L1)
L2
= [ (0.9 * 0.7) + (0.2 * 0.3)] = 0.69



P(L1 = @ISI | c1, c2) = Norm * 0.818 * 0.69 = Norm * 0.56442
P(L1 = N@ISI | c1, c2) = Norm * 0.182 * 0.41 = Norm * 0.074
After normalization: P(L1 = @ISI | c1, c2) = .883
Smoothed estimate .883 > Filtered estimate P(L1=@ISI | c1)!
 WHY?
HMM
HMM


Hidden Markov Models
Speech recognition  perhaps the most popular application
– Any speech recognition researcher in class?
– Waibel and Lee
– Dominance of HMMs in speech recognition from 1980s
– For ideal isolated conditions they say 99% accuracy
– Accuracy drops with noise, multiple speakers

Find applications everywhere  just try putting in HMM in google

First we gave Bellman update to AI (and other sciences)
Now we make our second huge contribution to AI: Viterbi
algorithm!

HMM

Simple nature of HMM allow simple and
elegant algorithms

Transition model P(Xt+1 | Xt) for all values
of Xt
– Represented as a matrix |S| * |S|
– For our example: Matrix “T”
– Tij = P(Xt= j | Xt-1 = i)

Sensor model also represented as a Diagonal
matrix
– Diagonal entries give P(et | Xt = i)
– et is the evidence, e.g., ct = true
– Matrix Ot
 0.7 0.3 


 0.3 0.7 
 0 .9 0 


 0 0 .2 
HMM
• f1:t+1 = Norm-const * FORWARD (f1:t, ct+1)
= Norm-const * P(ct+1 | Lt+1)
*
= Norm-const *
f1:2 =
 P(Lt+1 | Lt) * P(Lt|c1,c2…ct)
Ot+1 * TT * f1:t
P (L2 | c1, c2) = Norm-const * O2 * TT * f1:1
 0 .9 0 


0 .2  *
= Norm-const *  0
 0.7

 0.3
0.3 

0.7 
 0.818 

 0.182 


* 
Transpose
HMM
• f1:2 =
P (L2 | c1, c2) = Norm-const * O2 * TT * f1:1
 0 .9 0 

= Norm-const * 
 0 0 .2 
 0.7
*  0.3

0.3 
 0.818 
 * 



0.7 
0
.
182


0.27  *  0.818 



 0.182 



 0.06 0.14 
= Norm-const * 0.63

= Norm * <(0.63*0.818 + 0.27 * .182) (0.06*0.818 + 0.14 * .182)>
= Norm * <0.564, 0.074> after normalization
= <0.883, 0.117>
Backward in HMM
P(ek+1:t | Xk) = bk+1:t
=  P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk)
xk+1
=
T * Ok+1 * bk+2:t
P(c2 | L1 = @ISI) =
 0.7 0.3 


 0.3 0.7 
b2:2 =
*
 0 .9 0 


 0 0 .2 
*
b3:2
Backward
• bk+1:t = T*Ok+1 * bk+2:t
• b3:2 = T*O2
•
=
•
= ( 0.69 0.41 )
 0.7 0.3 


 0.3 0.7 
*
 0 .9 0 


 0 0 .2 
*
 1
 
 1
Key Results for HMMs
• f1:t+1 = Norm-const * Ot+1 * TT * f1:t
• bk+1:t = T*Ok+1 * bk+2:t
Inference in DBN


How to do inference in a DBN in general?
Could unroll the loop forever…
Xt
Xt+1
Xt+2
Xt+3
X
Et
Et+1
Et+2
Et+3
E
• Slices added beyond the last observation have no effect on inference
 WHY?
• So only keep slices within the observation period
Inference in DBN
Xt
Xt+1
Alarm
Mary
X
Et
Et+1
JOHN
Et+3
E
• Slices added beyond the last observation have no effect on inference
 WHY?
• P(Alarm | JohnCalls)  independent of MaryCalls
Complexity of inference in DBN

Keep almost two slices in memory
– Start with slice 0
– Add slice 1
– “Sum out” slice 0 (get a probability distribution over slice 1
state; don’t need to go back to slice 0 anymore – like POMDPs)
– Add slice 2, sum out slice 1…

Constant time and space per update

Unfortunately, update exponential in the number of state variables
Need approximate inference algorithms

Solving DBNs in General

Exact methods:
– Compute intensive
– Variable elimination from Chapter 14

Approximate methods:
– Particle filtering popularity
– Run N samples together through slices of the DBN network
– All N samples constitute the forward message
– Highly efficient
– Hard to provide theoretical guarantees
Next Lecture

Continue with Chapter 15
Student Evaluations
Surprise Quiz II: Part II
Xt
Et
Xt
P(Xt+1)
T
0.5
F
0.5
Question:
Xt+1
Xt+1
P(E’)
T
0.7
F
0.01
E’t+1
Et+1
Xt+1
P(E)
T
0.8
F
0.01
Most Likely Path

Given a sequence of observations, find the sequence of states that
most likely have generated these observations

E.g., in the E-elves example, suppose
[activity, activity, no-activity, activity, activity]

What is the most likely explanation of the presence of the user at
ISI over the course of the day?
– Did the user step out at time = 3?
– Was the user present all the time, but was in a meeting at time 3

Argmaxx1:t P (X1:t| e1:t)
Not so simple…




Use smoothing to find the posterior distribution at each time step
E.g., compute P(L1=@ISI | c1:5), P(L1=N@ISI | c1:5), find max
Do the same for P(L2=@ISI|c1:5) vs P(L2=N@ISI|c1:5) find max

Find the maximum this way
Why might this be different from computing what we want (the
most likey sequence)?

maxx1:t+1 P (X1:t+1| e1:t+1) via viterbi algorithm
Norm * P(et+1 | Xt+1) *
max (P(Xt+1 | xt) max P(x1….xt-1,xt|e1..et))
xt
x1..xt-1