HMMs - Tamara L Berg

HMMs
Tamara Berg
CS 590-133 Artificial Intelligence
Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan
1
2048 Game
Game
AI (Minimax
with
alpha/beta
pruning)
Announcements
• HW2 graded
– mean = 14.40, median = 16
• Grades to date after 2 HWs, 1 midterm (60% /
40%)
– mean = 77.42, median = 81.36
Announcements
• HW4 will be online tonight (due April 3)
• Midterm2 next Thursday
– Review will be Monday, 5pm in FB009
hursday March 20
SN011
5:30-6:30
Information Session: BS/MS
Program
Recap from b.s.b.
Bayes Nets
Directed Graph, G = (X,E)
Nodes
Edges
X = { X1, X 2 X 3,...X n }
E = {(X i, X j ) : i ¹ j}
Each node is associated with a random variable
Example
X4
X2
X6
X1
X3
X5
Joint Probability:
p(x1, x 2 x 3 , x 4 , x 5 , x 6 ) =
p(x1) p(x 2 | x1) p(x 3 | x1) p(x 4 | x 2 ) p(x 5 | x 3 ) p(x 6 | x 2, x 5 )
Independence in a BN
• Important question about a BN:
–
–
–
–
Are two nodes independent given certain evidence?
If yes, can prove using algebra (tedious in general)
If no, can prove with a counter example
Example:
X
Y
Z
– Question: are X and Z necessarily independent?
• Answer: no. Example: low pressure causes rain, which
causes traffic.
• X can influence Z, Z can influence X (via Y)
• Addendum: they could be independent: how?
Bayes Ball
• Shade all observed nodes. Place balls at
the starting node, let them bounce around
according to some rules, and ask if any of
the balls reach any of the goal node.
• We need to know what happens when a
ball arrives at a node on its way to the goal
node.
10
11
Inference
• Inference: calculating some
useful quantity from a joint
probability distribution
• Examples:
B
– Posterior probability:
E
A
J
M
– Most likely explanation:
12
Inference Algorithms
• Exact algorithms
– Inference by Enumeration
– Variable Elimination algorithm
• Approximate (sampling) algorithms
– Forward sampling
– Rejection sampling
– Likelihood weighting sampling
Example BN – Naïve Bayes
C
C – Class
F - Features
F1
F2
Fn
P(C,F1,F2,...Fn ) = P(C)Õ P(Fi | C)
i
We only specify (parameters):
prior over class labels
P(C)
P(Fi | C) how each feature depends on the class
HW4!
Slide from Dan Kle
Slide from Dan Kle
Slide from Dan Kle
Percentage of
documents in
training set labeled
as spam/ham
Slide from Dan Kle
In the documents labeled
as spam, occurrence
percentage of each word
(e.g. # times “the”
occurred/# total words).
Slide from Dan Kle
In the documents labeled
as ham, occurrence
percentage of each word
(e.g. # times “the”
occurred/# total words).
Slide from Dan Kle
Parameter estimation
• Parameter estimate:
# of word occurrences in spam messages
P(word | spam) =
total # of words in spam messages
• Parameter smoothing: dealing with words that were never
seen or seen too few times
– Laplacian smoothing: pretend you have seen every vocabulary word
one more time (or m more times) than you actually did
# of word occurrences in spam messages + 1
P(word | spam) =
total # of words in spam messages + V
(V: total number of unique words)
Classification
The class that maximizes:
P(C,W1,...W n ) = P(C)Õ P(W i | C)
i
= argmax P(c)Õ P(W i | c)
c ÎC
i
Summary of model and parameters
• Naïve Bayes model:
n
P( spam | message)  P( spam) P( wi | spam)
i 1
n
P(spam | message)  P(spam) P( wi | spam)
i 1
• Model parameters:
Likelihood
of spam
Likelihood
of ¬spam
P(w1 | spam)
P(w1 | ¬spam)
P(spam)
P(w2 | spam)
P(w2 | ¬spam)
P(¬spam)
…
…
P(wn | spam)
P(wn | ¬spam)
prior
Classification
• In practice
– Multiplying lots of small probabilities can
result in floating point underflow
Classification
• In practice
– Multiplying lots of small probabilities can
result in floating point underflow
– Since log(xy) = log(x) + log(y), we can sum
log probabilities instead of multiplying
probabilities.
Classification
• In practice
– Multiplying lots of small probabilities can
result in floating point underflow
– Since log(xy) = log(x) + log(y), we can sum
log probabilities instead of multiplying
probabilities.
– Since log is a monotonic function, the class
with the highest score does not change.
Classification
• In practice
– Multiplying lots of small probabilities can
result in floating point underflow
– Since log(xy) = log(x) + log(y), we can sum
log probabilities instead of multiplying
probabilities.
– Since log is a monotonic function, the class
with the highest score does not change.
– So, what we usually compute in practice is:
é
ù
å
êëlog P(c)+ i log P(Wi |c)úû
c map = arg max
c ÎC
Today: Adding time!
29
Reasoning over Time
• Often, we want to reason about a sequence of
observations
–
–
–
–
Speech recognition
Robot localization
Medical monitoring
DNA sequence recognition
• Need to introduce time into our models
• Basic approach: hidden Markov models (HMMs)
• More general: dynamic Bayes’ nets
30
Reasoning over time
• Basic idea:
– Agent maintains a belief state representing
probabilities over possible states of the world
– From the belief state and a transition model,
the agent can predict how the world might
evolve in the next time step
– From observations and an emission model,
the agent can update the belief state
Markov Model
32
Markov Models
• A Markov model is a chain-structured BN
– Value of X at a given time is called the state
– As a BN:
X1
X2
X3
X4
….P(Xt|Xt-1)…..
– Parameters: initial probabilities and transition
probabilities or dynamics (specify how the state
evolves over time)
Conditional Independence
X1
X2
X3
X4
• Basic conditional independence:
– Past and future independent of the present
– Each time step only depends on the previous
– This is called the (first order) Markov property
• Note that the chain is just a (growing) BN
– We can always use generic BN reasoning on it if we
truncate the chain at a fixed length
34
Example: Markov Chain
X1
X2
X3
X4
• Weather:
0.1
– States: X = {rain, sun}
– Transitions:
0.9
rain
sun
0.9
0.1
– Initial distribution: 1.0 sun
– What’s the probability distribution after one step?
35
Mini-Forward Algorithm
• Question: What’s P(X) on some day t?
sun
sun
sun
sun
rain
rain
rain
rain
Forward simulation
36
Example
• From initial observation of sun
P(X1)
P(X2)
P(X3)
P(X)
• From initial observation of rain
P(X1)
P(X2)
P(X3)
P(X)
37
Stationary Distributions
• If we simulate the chain long enough:
– What happens?
– Uncertainty accumulates
– Eventually, we have no idea what the state is!
• Stationary distributions:
– For most chains, the distribution we end up in is
independent of the initial distribution
– Called the stationary distribution of the chain
– Usually, can only predict a short time out
Hidden Markov Model
39
Hidden Markov Models
• Markov chains not so useful for most agents
– Eventually you don’t know anything anymore
– Need observations to update your beliefs
• Hidden Markov models (HMMs)
– Underlying Markov chain over states S
– You observe outputs (effects) at each time step
– As a Bayes’ net:
X1
X2
X3
X4
X5
E1
E2
E3
E4
E5
Example
• An HMM is defined by:
– Initial distribution:
– Transitions:
– Emissions:
Conditional Independence
• HMMs have two important independence properties:
– Markov hidden process, future depends on past via the present
– Current observation independent of all else given current state
X1
X2
X3
X4
X5
E1
E2
E3
E4
E5
• Quiz: does this mean that observations are independent
given no evidence?
– [No, correlated by the hidden state]
Real HMM Examples
• Speech recognition HMMs:
– Observations are acoustic signals
(continuous valued)
– States are specific positions in specific
words (so, tens of thousands)
• Machine translation HMMs:
– Observations are words (tens of thousands)
– States are translation options
• Robot tracking:
– Observations are range readings
(continuous)
– States are positions on a map (continuous)
Filtering / Monitoring
• Filtering, or monitoring, is the task of tracking the
distribution B(X) (the belief state) over time
• We start with B(X) in an initial setting, usually uniform
• As time passes, or we get observations, we update B(X)
• The Kalman filter was invented in the 60’s and first
implemented as a method of trajectory estimation for the
Apollo program
Example: Robot Localization
Example from
Michael Pfeiffer
Prob
0
1
t=0
Sensor model: never more than 1 mistake
Motion model: may not execute action with small prob.
Example: Robot Localization
Prob
0
1
t=1
Example: Robot Localization
Prob
0
1
t=2
Example: Robot Localization
Prob
0
1
t=3
Example: Robot Localization
Prob
0
1
t=4
Example: Robot Localization
Prob
0
1
t=5
Filtering
• Filtering – update belief state given all evidence
(observations) to date
• Recursive estimation - given the result of filtering
up to time t, the agent computes the result for
t+1 from the new evidence: et+1
• Filtering consists of two parts:
– Projecting the current belief state distribution forward
from t to t+1
– Updating the belief using the new evidence et+1
Forward Algorithm
P(Xt+1 | e1:t+1 ) = a P(et+1 | Xt+1 )å P(Xt+1 | xt )P(xt | e1:t )
xt
Emission model
New belief state
Transition model
Current belief state
a
Normalizing constant used to make probabilities sum to 1.
Decoding: Finding the most
likely sequence
[video]
X1
X2
X3
X4
X5
E1
E2
E3
E4
E5
• Query: most likely seq:
53
Speech Recognition with HMMs
Particle Filtering
Representation: Particles
• Our representation of P(X) is now
a list of N particles (samples)
• P(x) approximated by number of
particles with value x
• For now, all particles have a
weight of 1
Particles:
(3,3)
(2,3)
(3,3)
(3,2)
(3,3)
(3,2)
(2,1)
(3,3)
(3,3)
(2,1)
63
Particle Filtering: Elapse Time
• Each particle is moved by sampling its
next position based on the transition
model
• This captures the passage of time
Particle Filtering: Observe
• Based on observations:
– Similar to likelihood weighting, we down
weight our samples based on the
evidence
– Note that, the probabilities won’t sum to
one, since most have been down
weighted
Particle Filtering: Resample
•
Rather than tracking
weighted samples,
we resample
•
N times, we choose
from our weighted
sample distribution
(i.e. draw with
replacement)
•
This is equivalent to
renormalizing the
distribution
•
Now the update is
complete for this time
step, continue with
the next one
Old Particles:
(3,3) w=0.1
(2,1) w=0.9
(2,1) w=0.9
(3,1) w=0.4
(3,2) w=0.3
(2,2) w=0.4
(1,1) w=0.4
(3,1) w=0.4
(2,1) w=0.9
(3,2) w=0.3
New Particles:
(2,1) w=1
(2,1) w=1
(2,1) w=1
(3,2) w=1
(2,2) w=1
(2,1) w=1
(1,1) w=1
(3,1) w=1
(2,1) w=1
(1,1) w=1
Robot Localization
• In robot localization:
– We know the map, but not the robot’s position
– Observations may be vectors of range finder readings
– State space and readings are typically continuous (works
basically like a very fine grid) and so we cannot store B(X)
– Particle filtering is a main technique
• [Demos,Fox]
SLAM
• SLAM = Simultaneous Localization And Mapping
– We do not know the map or our location
– Our belief state is over maps and positions!
– Main techniques: Kalman filtering (Gaussian HMMs) and particle
methods
DP-SLAM, Ron Parr
Particle Filters for Localization and
Video Tracking
End of Part II!
• Now we’re done with our unit on
probabilistic reasoning
• Last part of class: machine learning &
applications