Probabilistic Reasoning over Time

Probabilistic
Reasoning over
Time
Slides adapted from work of: Dan Klein
1
Temporal Probabilistic
agent
sensors
?
environment
agent
actuators
t1, t2, t3, …
2
Time and Uncertainty
• The world changes, we need to track and predict it
• Examples: diabetes management, traffic monitoring
• Basic idea: copy state and evidence variables for
each time step
• Xt – set of unobservable state variables at time t
o e.g., BloodSugart, StomachContentst
• Et – set of evidence variables at time t
o e.g., MeasuredBloodSugart, PulseRatet, FoodEatent
• Assumes discrete time steps
3
States and Observations
•
•
Process of change is viewed as series of
snapshots, each describing the state of the world
at a particular time
Each time slice involves a set or random variables
indexed by t:
1.
2.


the set of unobservable state variables Xt
the set of observable evidence variable Et
The observation at time t is Et = et for some set of
values et
The notation Xa:b denotes the set of variables from
Xa to Xb
4
Stationary Process/Markov Assumption
• Markov Assumption: Xt depends on some previous
Xis
• First-order Markov process:
P(Xt|X0:t-1) = P(Xt|Xt-1)
• kth order: depends on previous k time steps
• Sensor Markov assumption:
P(Et|X0:t, E0:t-1) = P(Et|Xt)
• Assume stationary process: transition model P(Xt|Xt1) and sensor model P(Et|Xt) are the same for all t
• In a stationary process, the changes in the world
state are governed by laws that do not themselves
change over time
5
Complete Joint
Distribution
• Given:
o Transition model: P(Xt|Xt-1)
o Sensor model: P(Et|Xt)
o Prior probability: P(X0)
• Then we can specify complete joint distribution:
t
P( X 0 , X1 ,..., X t , E1 ,..., E t ) = P( X 0 )∏ P( X i | X i −1 ) P( E i | X i )
i =1
6
Example
Raint-1
Umbrellat-1
Rt-1
P(Rt|Rt-1)
T
F
0.7
0.3
Raint
Raint+1
Umbrellat
Umbrellat+1
Rt
P(Ut|Rt)
T
F
0.9
0.2
7
Inference Tasks
•
Filtering or monitoring: P(Xt|e1,…,et)
computing current belief state, given all evidence to date
•
Example: What is the probability that it is raining today, given
all the umbrella observations up through today?
8
Inference Tasks
•
Prediction: P(Xt+k|e1,…,et)
computing prob. of some future state
•
Example: What is the probability that it will rain the day after
tomorrow, given all the umbrella observations up through
today?
9
Inference Tasks
•
Smoothing: P(Xk|e1,…,et)
computing prob. of past state (hindsight)
•
Example: What is the probability that it rained yesterday, given
all the umbrella observations through today?
10
Inference Tasks
•
Most likely explanation:
arg maxx1,..xtP(x1,…,xt|e1,…,et)
given sequence of observation, find sequence of states that is
most likely to have generated those observations.
•
Example: if the umbrella appeared the first three days but not
on the fourth, what is the most likely weather sequence to
produce these umbrella sightings?
11
Forward-backward algorithm
• P(Xk | zk), P(zl, |x), P(z1)
•
•
•
•
F-B algorithm computes: P(zk|x)
F: compute P(zk, x1:k)
B: compute P(xk+1:n , |zk)
P(zk|x) = P(xk+1:n, |zk , x1:k)P(zk, x1:k)
Filtering
•
•
We use recursive estimation to compute P(Xt+1 | e1:t+1) as a
function of et+1 and P(Xt | e1:t)
We can write this as follows:
P( X t +1 | e1:t +1 ) = P( X t +1 | e1:t , e t +1 )
= αP(e t +1 | X t +1 , e1:t ) P( X t +1 | e1:t )
= αP(e t +1 | X t +1 ) P( X t +1 | e1:t )
= αP(e t +1 | X t +1 ) ∑ P( X t +1 | x t ) P( x t | e1:t )
•
xt
This leads to a recursive definition
o f1:t+1 = αFORWARD(f1:t:t,et+1)
13
Filtering
(the forward algorithm)
P(Rain1 = t)
= ΣRain0 P(Rain1 = t | Rain0) P(Rain0)
=0.70 * 0.50 + 0.30 * 0.50 = 0.50
P(Rain1 = t | Umbrella1 = t)
= α P(Umbrella1 = t | Rain1 = t) P(Rain1 = t)
= α * 0.90 * 0.50 = α *0.45 ≈ 0.818
P(Rain2 = t | Umbrella1 = t)
= ΣRain1 P(Rain2 = t | Rain1) P(Rain1 | Umbrella1 = t)
= 0.70 * 0.818 + 0.30 * 0.182 ≈ 0.627
P(Rain2 = t | Umbrella1 = t, Umbrella2 = t)
= α P(Umbrella2 = t | Rain2 = t) P(Rain2 = t | Umbrella1 = t)
= α * 0.90 * 0.627 ≈ α * 0.564 ≈ 0.883
Example
Raint-1
Umbrellat-1
Rt-1
P(Rt|Rt-1)
T
F
0.7
0.3
Raint
Raint+1
Umbrellat
Umbrellat+1
Rt
P(Ut|Rt)
T
F
0.9
0.2
15
Smoothing
•
•
Compute P(Xk|e1:t) for 0<= k < t
Using a backward message bk+1:t = P(Ek+1:t | Xk), we obtain
o P(Xk|e1:t) = αf1:kbk+1:t
•
The backward message can be computed using
bk +1:t = ∑ P(ek +1 | x k +1 )P(ek + 2:t | x k +1 )P(x k +1 | Xk )
x k +1
•
This leads to a recursive definition
o Bk+1:t = αBACKWARD(bk+2:t,ek+1:t)
16
Smoothing(forward-backward algorithm)
P(Umbrella2 = t | Rain1 = t)
= ΣRain2 P(Umbrella2 = t | Rain2) P(* | Rain2) P(Rain2 |
Rain1 = t)
= 0.9 * 1.0 * 0.7 + 0.2 * 1.0 * 0.3 = 0.69
P(Rain1 = t | Umbrella1 = t, Umbrella2 = t)
= α * 0.818 * 0.69
≈ α * 0.56 ≈ 0.883
Recap:Markov Models
• A Markov model is a chain-structured BN!
o Each node is identically distributed (stationarity).
o Value of X at a given time is called the state.
o As a BN:
X1
X2
X3
X4
o Parameters: called transition probabilities or dynamics, specify
how the state evolves over time (also, initial probabilities).
18
Conditional Independence
X1
X2
X3
X4
• Basic conditional independence:
o Past and future independent of the present.
o Each time step only depends on the previous step.
o This is called the (first order) Markov property.
• Note that the chain is just a (growing) BN:
o Can always use generic BN reasoning if we truncate
the chain at a fixed length.
19
Example: Markov Chain
0.1
• Weather:
0.9
o States: X = {rain, sun}
o Transitions:
rain
sun
0.9
0.1
o Initial distribution: 0.0 rain,1.0 sun.
o What’s the probability distribution after one step?
20
Stationary Distributions
• If we simulate the chain long enough:
o Uncertainty accumulates.
o Eventually, we have no idea what the state is!
• Stationary distributions:
o For most chains, the distribution we end up in is
independent of the initial distribution.
o Called the stationary distribution of the chain.
o Can be used to predict some future events.
21
Hidden Markov Models
• Markov chains not so useful for most agents:
o Eventually you do not know anything anymore.
o Need observations to update your beliefs.
• Hidden Markov models (HMMs):
o Underlying Markov chain over states S.
o You observe outputs (effects) at each time step.
o As a Bayes’ net:
X1
X2
X3
X4
E1
E2
E3
E4
22
Example
0.4
rain
0.8
sun
0.6
0.2
0.4
0.6
0.9
0.1
Happy Grumpy Happy Grumpy
23
Example
•
An HMM is defined by:
o Initial distribution:
o Transition probabilities:
o Emission probabilities:
24
Conditional Independence
• HMMs have two important independence properties:
o Markov hidden process, future depends on past via the present.
o Current observation independent of all else given current state.
X1
X2
X3
X4
X5
E1
E2
E3
E4
E5
• Does this mean that observations are independent?
o [No, correlated by the hidden state]
25
Real HMM Examples
Real HMM Examples
•
Speech recognition HMMs:
o Observations are acoustic signals (continuous valued) or phonemes
(discrete valued).
o States are specific positions in specific words (so, tens of thousands).
•
Machine translation HMMs:
o Observations are words (tens of thousands).
o States are translation options.
•
Robot tracking:
o Observations are range readings (continuous).
o States are positions on a map (continuous/discrete).
27
Example: Robot Localization
Example from
Michael Pfeiffer
Prob
0
1
t=0
Sensor model: never more than 1 mistake.
Motion model: may not execute action with small prob.
28
Example: Robot Localization
Prob
0
1
t=1
29
Example: Robot Localization
Prob
0
1
t=4
30
Example: Robot Localization
Prob
0
1
t=5
31
Kalman filter
• An optimal (linear) estimator or optimal recursive
data processing algorithm.
• Parameters - indirect, inaccurate and uncertain
observations.
32
Bayesian filterrecap
Recursive Bayesian Filter
P(Hypothesis | Data) = P(hypothesis) x
P(Data | Hypothesis) / P(data)
P(Ht |Ht-1, Dt)
33
Application
34
Concept
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
40
30
20
10
0
50
80
70
60
90
100
What robot thinks
0.16
0.16
0.14
0.14
0.12
0.12
0.1
0.1
0.08
0.08
0.06
0.06
0.04
0.04
0.02
0
0.02
0
10
20
30
40
50
60
70
80
90
100
0
What GPS says
0
10
20
30
40
50
60
70
New Estimate
80
90
100
35
State and Covariance
Correction
• The Kalman filter is a two-step process: prediction
and correction.
• The correction step makes corrections to an
estimate, based on new information obtained from
sensor measurements.
• The Kalman gain matrix K used for correction of an
estimate x.
36
Contd.
• 𝑠𝑠𝑠𝑠𝑠 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝: 𝑥̅ t = A xt-1 + Bµt + ε
• sensor prediction: 𝑧̅ t = c xt + εz
Xest = 𝑥̅ t + k(zt - 𝑧̅ t)
where k is Kalman gain
37
Linearity and updating
General Temporal Model
• P(xt+1 | xt) = N(F.xt , Σx)(xt+1)
• P(zt | xt) = N(H.xt , Σz)(zt)
Update mean and covariance
• µt+1 = F. µt + Kt+1(zt+1 - HFµt)
• Σt+1 = (I – Kt+1 H)(F Σt FT + Σt)
38
Extended Kalman Filter
• System is non-linear if transitional model cannot be
described as multiplication of state vector
• EKF models the system as locally linear in xt in xt = µt
region
• Not necessarily useful (ex. Bird)
39
Viterbi Algorithm
sun
sun
sun
sun
rain
rain
rain
rain
40
The following slides are
compilation of all questions
and answers on Piazza
related to this presentation
• - What is a stationary process and can you provide
some examples if possible?
-In a stationary process, the changes in the world
state are governed by laws that do not themselves
change over time. Here is an example if you want to
read through
it: http://en.wikipedia.org/wiki/Bernoulli_scheme
• What does it mean when a transition model is linear
or nonlinear? Examples if possible?
If the transition model can be formulated the same
way as the equation at 15.21 in the book then it is
linear else it is non linear.
• When can a Dynamic Bayesian Network be
represented by a Kalman filter model and when
can it not be? Also can you provide examples of
both cases
You can express a Kalman filter as a DBN by using
Continuous variables and linear Gaussian
conditional distributions but you can't express a DBN
as a Kalman fielder since a DBN allows any arbitrary
distributions.
• Why would we want to represent a Kalman filter
model as a Dynamic Bayesian Network?
-"KFMs have been used for problems ranging from
tracking planes and missiles to predicting the
economy. However, HMMs and KFMs are limited in
their “expressive power”. Dynamic Bayesian
Networks (DBNs) generalize HMMs by allowing the
state space to be represented in factored form,
instead of as a single discrete random variable.
DBNs generalize KFMs by allowing arbitrary
probability distributions, not just (unimodal) linearGaussian."
You can read more from
here: http://citeseerx.ist.psu.edu/viewdoc/download?
rep=rep1&type=pdf&doi=10.1.1.129.7714
• When we limit the number of earlier variables or
evidence we rely on (Markov assumption), are
those variables involved in calculation the
probabilities?
No, remember the conditional independence.
In the section about filtering, why do we need to have this
condition:
"The time and space requirements for updating must be constant
if an agent with limited memory is to keep track of the current
state distribution over an unbounded sequence"?
I also don't understand why if the state variables are discrete the
time for each update is constant because it's independent of t isn't the whole problem due to the fact that states were
associated with time?
•
The Kalman filter is a recursive estimator. This means to compute
the estimate for the current state only two things are needed:
i)estimated state from the previous time step and ii)the current
measurement. That's why it is independent of t. Now say if the
time and space requirement were not constant then the
memory cost for each successive update would grow higher and
higher and we will quickly run out of memory
•
What is the practical use of smoothing? In other words, why would
an agent wish to refine a past state estimate when that state is
already over?
You must have realized that when you take a photograph, sometimes
the picture turns out to be noisy or grainy- you see so many tiny colored
spots all over a picture. You want to remove those. These spots appear
as contrasting pixels around other pixels nearby. Say the picture is of a
clear sky- it has lighter shade of blue towards the horizon and while the
shade gradually deepens towards the zenith. Because of noise, red,
white and other colored pixel appear haphazardly in the blue part. So
the noise filter in Photoshop etc. take a portion of the picture and looks
for anomaly in the region. (The actual procedure much complex than
this.) You check for all pixels in that portions. If they are blue-ish then
keep it but if they are shades of other colors then turn into blue-ish
depending on the depth of adjacent pixels. So smoothing finds out
anomalous data in your data collection then adjusts it. In short you want
to go over all previous state to check for anomaly and so as to correct
it. In another example think of signal transmission and processing where
you need to remove noise in received signal(electrical pulses) to have a
sharper/smoother signal.
•
In a Hidden Markov Model (HMM), how do we
handle discrete-valued variables with infinitely-many possible
values?
Thats why we have the markov assumption.
•
Can a second-order Markov process be rewritten as firstorder?
You can reduce any higher-order markov process to one with
lower order.
Here is an example for writing a 2nd order into 1st one.
http://stats.stackexchange.com/questions/2457/markov-processabout-only-depending-on-previous-state