Travel Time Estimation using Approximate Belief States on a Hidden

Travel Time Estimation using Approximate Belief
States on a Hidden Markov Model
Walid Krichene
Overview
Context
Inference on a HMM
Modeling framework and exact inference
Approximate Inference: the Boyen-Koller algorithm
Graph Partitioning
Overview
Context
Inference on a HMM
Modeling framework and exact inference
Approximate Inference: the Boyen-Koller algorithm
Graph Partitioning
Context
I
Mobile Millennium project
I
Travel time estimation on an Arterial Network
Context
I
Mobile Millennium project
I
Travel time estimation on an Arterial Network
I
Input data: probe vehicles that send their GPS locations
periodically
Context
I
Mobile Millennium project
I
Travel time estimation on an Arterial Network
I
Input data: probe vehicles that send their GPS locations
periodically
I
processed using path inference
Context
I
Mobile Millennium project
I
Travel time estimation on an Arterial Network
I
Input data: probe vehicles that send their GPS locations
periodically
I
I
processed using path inference
observation = (path, travel time along the path)
Objective
Improve inference algorithm
I
Time complexity exponential in size of the network (number
of links)
Objective
Improve inference algorithm
I
Time complexity exponential in size of the network (number
of links)
I
Solution: assume links are independent
I
But lose structure of network
Objective
Improve inference algorithm
I
Time complexity exponential in size of the network (number
of links)
I
Solution: assume links are independent
I
But lose structure of network
I
Need approximate inference to keep the structure
Overview
Context
Inference on a HMM
Modeling framework and exact inference
Approximate Inference: the Boyen-Koller algorithm
Graph Partitioning
Graphical Model
I
Nodes: random variables
I
Conditional independence: x and y are independent
conditionally on (n1 , n2 ) but not on n1
Hidden Markov Model
I
Hidden variables st ∈ (s 1 , . . . , s N )
I
Observed variables yt
I
(s0 , . . . , st ) is a Markov process
Hidden Markov Model
I
Hidden variables st ∈ (s 1 , . . . , s N )
I
Observed variables yt
I
(s0 , . . . , st ) is a Markov process
I
Hidden variables are introduced to simplify the model
Hidden Markov Model
I
Hidden variables st ∈ (s 1 , . . . , s N )
I
Observed variables yt
I
(s0 , . . . , st ) is a Markov process
I
Hidden variables are introduced to simplify the model
I
Interesting because provides efficient algorithms to do
inference and parameter estimation
Parametrization of a HMM
I
Initial probability distribution π i = P(s0i )
Parametrization of a HMM
I
Initial probability distribution π i = P(s0i )
I
j
Transition Matrix: Ti,j = P(st+1
|sti )
Parametrization of a HMM
I
Initial probability distribution π i = P(s0i )
I
j
Transition Matrix: Ti,j = P(st+1
|sti )
I
Observation model: P(yt |st )
Parametrization of a HMM
I
Initial probability distribution π i = P(s0i )
I
j
Transition Matrix: Ti,j = P(st+1
|sti )
I
Observation model: P(yt |st )
I
Completely characterizes the HMM: We can compute
probability of any event.
Inference
General inference problem: compute P(st |y0:T )
Inference
General inference problem: compute P(st |y0:T )
I
Filtering if t = T
I
Prediction if t > T
I
Smoothing if t < T
Inference
General inference problem: compute P(st |y0:T )
I
Filtering if t = T
I
Prediction if t > T
I
Smoothing if t < T
Let y = y0:T
P(s|y ) =
where
P(s, y )
α(st )β(st )
=P
P(y )
st α(st )β(st )
∆
α(st ) = P(y0 , . . . , yt , st )
∆
β(st ) = P(yt+1 , . . . , yT |st )
Message passing algorithms
Recursive algorithm to compute α(st ) and β(st )
Message passing algorithms
Recursive algorithm to compute α(st ) and β(st )
P
I α(st+1 ) =
α(st )Tst ,st+1 P(yt+1 |st+1 )
P st
I β(st ) =
st+1 β(st+1 )P(yt+1 |st+1 )Tst ,st+1
Message passing algorithms
Recursive algorithm to compute α(st ) and β(st )
P
I α(st+1 ) =
α(st )Tst ,st+1 P(yt+1 |st+1 )
P st
I β(st ) =
st+1 β(st+1 )P(yt+1 |st+1 )Tst ,st+1
I
Complexity: O(N 2 T ) operations
I
α recursion: for every t, N possible values of st+1 , each
α(st+1 ) requires N multiplications
Parameter estimation
Parameters of the HMM: θ = (π, T , η)
I
T : transition matrix
I
π: initial state probability distribution
I
η: parameters of observation model: P(yt |st , η)
Parameter estimation: maximize log likelihood w.r.t θ
ln
XX
s0
s1
···
X
sT
πs0
TY
−1
t=0
Tst ,st+1
T
Y
t=0
P(yt |st , η)
Expectation Maximization algorithm
I
E step: estimate the hidden (unobserved) variables given the
observed variables and the current estimate of θ
Expectation Maximization algorithm
I
E step: estimate the hidden (unobserved) variables given the
observed variables and the current estimate of θ
I
M step: maximize likelihood function under assumption that
latent variables are known (they are “filled-in” with their
expected values)
Expectation Maximization algorithm
In the case of HMMs:
PT −1
i j
t=0 ξ(st ,st+1 )
P
T −1
i
t=0 γ(st )
I
T̂ij =
I
η̂ij =
PT
γ(sti )ytj
Pt=0
T
i
t=0 γ(st )
I
π̂i =
i
i
Pα(s0 )β(s0 )
α(s
)β(s
0
0)
s
0
where ξ and γ are simple functions of α and β.
Time complexity
O(N 2 T ) operations
Overview
Context
Inference on a HMM
Modeling framework and exact inference
Approximate Inference: the Boyen-Koller algorithm
Graph Partitioning
Modeling framework
I
System modeled using a Hidden Markov Model.
I
L links
Modeling framework
I
System modeled using a Hidden Markov Model.
I
L links
Hidden variables
Link l: discrete state Stl ∈ {1, . . . , K }
- state of entire system St = (St1 , . . . , StL )
- N = K L possible states
- Markov process P(St+1 |S0 , . . . , St ) = P(St+1 |St )
Modeling framework
I
System modeled using a Hidden Markov Model.
I
L links
Hidden variables
Link l: discrete state Stl ∈ {1, . . . , K }
- state of entire system St = (St1 , . . . , StL )
- N = K L possible states
- Markov process P(St+1 |S0 , . . . , St ) = P(St+1 |St )
Observed variables
We observe travel times: random variables
Distributions depend on the state of links
HMM
Parametrization of the HMM
Transition model
∆
j
Tt (s i → s j ) = P(st+1
|sti )
Transition matrix, size K L
Parametrization of the HMM
Observation model
Probability to observe response y = (l, xi , xf , δ) given state s at
time t
Z xf
∆
l,s
Ot (s → y ) = P(yt |st ) = gt (δ) ×
ρlt (x)dx
xi
I
I
gtl,s : distribution of total travel time on link l at state s.
ρlt : probability distribution of vehicle locations (results from
traffic assumptions)
Parametrization of the HMM
Observation model
Probability to observe response y = (l, xi , xf , δ) given state s at
time t
Z xf
∆
l,s
Ot (s → y ) = P(yt |st ) = gt (δ) ×
ρlt (x)dx
xi
I
I
gtl,s : distribution of total travel time on link l at state s.
ρlt : probability distribution of vehicle locations (results from
traffic assumptions)
Assumptions
Processes time invariant during 1 hour time slices
Travel time estimation
I
Estimate state of system
Travel time estimation
I
Estimate state of system
I
Estimate parameters of models (observation)
Travel time estimation
I
Estimate state of system
I
Estimate parameters of models (observation)
I
Update estimation when new responses are observed
Travel time estimation
I
Estimate state of system
I
Estimate parameters of models (observation)
I
Update estimation when new responses are observed
Belief State
∆
pt (s) = P(st |y0:t )
Probability distribution over possible states
Travel time estimation
Bayesian tracking of the belief state: forward-backward
propagation (O(N 2 T ) time)
Can be done in O(N 2 ):
T [.]
Oy [.]
pt → qt+1 → pt+1
Travel time estimation
Bayesian tracking of the belief state: forward-backward
propagation (O(N 2 T ) time)
Can be done in O(N 2 ):
T [.]
Oy [.]
pt → qt+1 → pt+1
Parameter estimation of the model
I
update parameters of probability distribution of vehicle
locations: solve
X
ln ρlt (x)
max
x∈Xtl
where Xtl are the observed vehicle locations
Parameter estimation of the model
I
update Transition matrix: EM algorithm in O(N 2 T )
operations
Parameter estimation of the model
I
update Transition matrix: EM algorithm in O(N 2 T )
operations
I
Exact inference and parameter estimation done in
O(N 2 T ) = O(K 2L T ) time complexity
Computational intractability
Exact inference and EM algorithm not tractable
Size of the belief state and transition matrix exponential in size of
network.
EM algorithm takes exponential time in L.
Computational intractability
Exact inference and EM algorithm not tractable
Size of the belief state and transition matrix exponential in size of
network.
EM algorithm takes exponential time in L.
I
Assume independence of links?
Computational intractability
Exact inference and EM algorithm not tractable
Size of the belief state and transition matrix exponential in size of
network.
EM algorithm takes exponential time in L.
I
Assume independence of links?
I
Use approximate tracking instead, limit the size of the
network?
Overview
Context
Inference on a HMM
Modeling framework and exact inference
Approximate Inference: the Boyen-Koller algorithm
Graph Partitioning
Approximate Belief State
Choose a family of belief states that have a compact
representation.
Approximate Belief State
Choose a family of belief states that have a compact
representation.
Factorized belief state
Decompose process into subprocesses. Approximate probability of
state s by product of marginal probabilities of sub states s c
pt (s) = P(st |y0:t )
≈
=
C
Y
c=1
C
Y
c=1
P(stc |y0:t )
∆
p̃tc (s c ) = p̃t (s)
Approximate Belief State
Example
A network with 3 links, S = (S 1 , S 2 , S 3 ), links 1 and 2 are in
cluster 1 and link 3 is in cluster 2.
pt ((0, 1, 1)) = P(St = (0, 1, 1)|y0:t )
≈ P((St1 , St2 ) = (0, 1)|y0:t )P(St3 = 1|y0:t )
∆
= p̃t1 ((0, 1))p̃t2 (1) = p̃t ((0, 1, 1))
Approximate Belief State
Approximate Belief State
Perform Bayesian tracking and parameter estimation (EM
algorithm) on each p̃ separately.
Transition model
Assume state of cluster c at t + 1 only depend on state of N(c) at
t
c
Transition matrix of size K |N(c)| × K |S |
Inference
Inference is done on the subprocesses separately
T [.]
Oy [.]
Π[.]
p̃t → q̂t+1 → p̂t+1 → p̃t+1
The new approximate belief state p̃t+1 is computed as the product
of marginal distributions over the subprocesses.
Observation model and EM algorithm are the same.
Time Complexity
Let L0 be the maximum size of subprocesses, C the number of
0
clusters, and M = K L
I
Time complexity of inference + parameter estimation (EM) is
O(CM 2 T ).
Time Complexity
Let L0 be the maximum size of subprocesses, C the number of
0
clusters, and M = K L
I
Time complexity of inference + parameter estimation (EM) is
O(CM 2 T ).
I
If L0 is fixed:
Time Complexity
Let L0 be the maximum size of subprocesses, C the number of
0
clusters, and M = K L
I
Time complexity of inference + parameter estimation (EM) is
O(CM 2 T ).
I
If L0 is fixed:
C increases linearly with L
Time Complexity
Let L0 be the maximum size of subprocesses, C the number of
0
clusters, and M = K L
I
Time complexity of inference + parameter estimation (EM) is
O(CM 2 T ).
I
If L0 is fixed:
C increases linearly with L
time complexity becomes O(CT ) = O(LT ): linear in L as
opposed to the original algorithm (O(K 2L T ))
Approximation error
Use the Kullback-Leibler divergence (relative entropy)
D[pt , p̃t ] =
X
pt (s i ) ln
i
pt (s i )
p̃t (s i )
The error is shown to be bounded
D[pt , p̃t ] ≤
(γ/r )q
for some , where each subprocess Tc depends on at most r
others, and affects at most q others.
I
Smaller error for more independent subprocesses.
Approximation error
Use the Kullback-Leibler divergence (relative entropy)
D[pt , p̃t ] =
X
pt (s i ) ln
i
pt (s i )
p̃t (s i )
The error is shown to be bounded
D[pt , p̃t ] ≤
(γ/r )q
for some , where each subprocess Tc depends on at most r
others, and affects at most q others.
I
Smaller error for more independent subprocesses.
I
Trade-off between speed and approximation error.
Approximation error
Use the Kullback-Leibler divergence (relative entropy)
D[pt , p̃t ] =
X
pt (s i ) ln
i
pt (s i )
p̃t (s i )
The error is shown to be bounded
D[pt , p̃t ] ≤
(γ/r )q
for some , where each subprocess Tc depends on at most r
others, and affects at most q others.
I
Smaller error for more independent subprocesses.
I
Trade-off between speed and approximation error.
I
Define a partitioning of the Network into subgraphs such that
clusters depend weekly.
Overview
Context
Inference on a HMM
Modeling framework and exact inference
Approximate Inference: the Boyen-Koller algorithm
Graph Partitioning
Graph Partitioning
We need to cluster the network into sub-graphs that interact
weakly (to have a small approximation error).
Use historical observations to define a weighted graph that
describes interaction.
Graph Partitioning
We need to cluster the network into sub-graphs that interact
weakly (to have a small approximation error).
Use historical observations to define a weighted graph that
describes interaction.
Weighted graph
I
Set P of observed paths
Graph Partitioning
We need to cluster the network into sub-graphs that interact
weakly (to have a small approximation error).
Use historical observations to define a weighted graph that
describes interaction.
Weighted graph
I
Set P of observed paths
I
Each path p ∈ P is a sequence of connected links
p = (li1 , . . . , lik )
Graph Partitioning
We need to cluster the network into sub-graphs that interact
weakly (to have a small approximation error).
Use historical observations to define a weighted graph that
describes interaction.
Weighted graph
I
Set P of observed paths
I
Each path p ∈ P is a sequence of connected links
p = (li1 , . . . , lik )
I
Define weight of edge (i, j)
o
n
p
# p ∈ P|li → lj
wi,j =
# {p ∈ P|li ∈ p}
Graph Partitioning
We need to cluster the network into sub-graphs that interact
weakly (to have a small approximation error).
Use historical observations to define a weighted graph that
describes interaction.
Weighted graph
I
Set P of observed paths
I
Each path p ∈ P is a sequence of connected links
p = (li1 , . . . , lik )
I
Define weight of edge (i, j)
o
n
p
# p ∈ P|li → lj
wi,j =
I
# {p ∈ P|li ∈ p}
Weights are normalized ∀i,
P
j
wi,j = 1
Partitioning the weighted graph
Loss function
Minimize a loss function
L((Gc )1≤c≤C ) =
X
c,c 0
cut(Gc , Gc 0 )
Partitioning the weighted graph
Loss function
Minimize a loss function
L((Gc )1≤c≤C ) =
X
cut(Gc , Gc 0 )
c,c 0
where
cut(Gc , Gc 0 ) =
X
li ∈Gc ,lj ∈Gc 0
I
Does it yield good results?
wi,j
Partitioning the weighted graph
Minimizing the cut function yields unbalanced clusters
Partitioning the weighted graph
Appropriate loss function
Minimize
L((Gc )1≤c≤C ) =
X
c,c 0
Ncut(Gc , Gc 0 )
Partitioning the weighted graph
Appropriate loss function
Minimize
L((Gc )1≤c≤C ) =
X
Ncut(Gc , Gc 0 )
c,c 0
where
cut(Gc , Gc 0 )
P
li ∈Gc wi,j +
li ∈G 0 wi,j
Ncut(Gc , Gc 0 ) = P
c
Partitioning the weighted graph
I
Normalizing the cut function favors balanced clusters
Partitioning the weighted graph
I
Normalizing the cut function favors balanced clusters
I
Exact solution is NP hard
Partitioning the weighted graph
I
Normalizing the cut function favors balanced clusters
I
Exact solution is NP hard
I
Use METIS algorithm for approximate solution
Partitioning the weighted graph
I
Normalizing the cut function favors balanced clusters
I
Exact solution is NP hard
I
Use METIS algorithm for approximate solution
I
Post process the output to have connected clusters
Results of Graph Partitioning
Tested on historical data aggregated on 1 hour time periods, for
each day of week, over 3 months.
Results of Graph Partitioning
Results of Graph Partitioning
Results of Graph Partitioning
I
Geographically connected clusters
I
Connected arteries appear in the same cluster
I
Sections of highway 80 (Bay Bridge) and neighboring links all
appear in the same cluster
Summary
Adapted BK algorithm to our model and provided a study and
description of its steps.
BK is promising because:
I Trade-off between fast computation and approximation error:
can adjust the size of network to choose speed over accuracy.
I If we limit the size of subgraphs: polynomial time in the size
of the network.
I Error remains bounded in time
I Possibility of concurrent processing
I Possibility of short term prediction:
Use the transition matrix learned up to time t0 and propagate
the belief state pt0 up to time t0 + T
T
T
T
pt0 → pt0 +1 → · · · → pt0 +T
Addressed the graph partition problem.
The BK algorithm should be tested to evaluate the performance. I
started its implementation and it is being carried by the arterial
team.
Thank you.