State Space Models – a Personal View
Peter Tiňo and Yuan Shen
School of Computer Science
University of Birmingham, UK
State Space Models – a Personal View – p.1/187
Special thanks to ...
Barbara Hammer
Alessio Micheli
Michal Čerňanský
Luba Beňúšková
Jort van Mourik
Igor Farkaš
Ali Rodan
Jochen Steil
Ata Kabán
Nick Gianniotis
Dan Conford
Manfred Opper
...
State Space Models – a Personal View – p.2/187
Dynamical processes
Imagine a "box" inside which a dynamic process (i.e. things change in
time) takes place. At time t, the "state of the system" is x(t) ∈ X .
The dynamics can be autonomous (e.g. "happens by itself"), or
non-autonomous (e.g. driven by external input).
x(t) = f (x(t − 1)), x(t) = f (u(t), x(t − 1))
The dynamics can be continuous time (e.g. time is taken as a continuous
entity - dynamics can be described e.g. by differential equations), or
discrete time (e.g. changes happen in discrete time steps- can be described
e.g. by iterative maps).
dx(t)
= f (x(t)),
dt
dx(t)
= f (u(t), x(t))
dt
State Space Models – a Personal View – p.3/187
Dynamical processes - cont’d
The dynamics can be stationary (e.g. the law governing the dynamics is
fixed and not allowed to change), or non-stationary (e.g. the manner in
which things change is itself changing in time).
x(t) = ft (x(t − 1)), x(t) = ft (u(t), x(t − 1))
dx(t)
= ft (x(t)),
dt
dx(t)
= ft (u(t), x(t))
dt
The dynamics can be deterministic (as above), or stochastic (e.g. corrupted
by some form of dynamic noise). One now must work with full
distributions over possible "states" of the system!
p(x(t) | x(t − 1)), p(x(t) | u(t), x(t − 1))
State Space Models – a Personal View – p.4/187
Observing dynamical processes
We can have a "telescope" with which to observe the "box" inside which a
dynamic process is happening, e.g. we can have access to some
coordinates of the dynamics, or a "reasonable" function on them. We will
call such observed entities observations.
y(t) = h(x(t)), y(t) = h(u(t), x(t))
Reading out of the observations can be corrupted by an observational
noise. In this case we are uncertain about the value of observations and can
only have distributions over possible observations.
p(y(t) | x(t)), p(y(t) | u(t), x(t))
State Space Models – a Personal View – p.5/187
The notion of the state space model - part 1
We impose that the dynamic process we are observing is governed by a
dynamic law prescribing how the state of the system evolves in time. The
state captures all that we need to know about the past in order to describe
future evolution of the system.
x(t) = f (x(t − 1)), x(t) = f (u(t), x(t − 1))
dx(t)
= f (x(t)),
dt
dx(t)
= f (u(t), x(t))
dt
p(x(t) | x(t − 1)), p(x(t) | u(t), x(t − 1))
State Space Models – a Personal View – p.6/187
The notion of the state space model - part 2
We can have access to the dynamics through some function of the states
producing observations.
y(t) = h(x(t)), y(t) = h(u(t), x(t))
p(y(t) | x(t)), p(y(t) | u(t), x(t))
It can be possible to "recover" the states (in some sense- e.g. up to
topological conjugacy) from the observations - "observable states", or not "unobservable states".
State Space Models – a Personal View – p.7/187
I only have observations...
In many cases we only have access to observations from the underlying
dynamics and perhaps also to the driving input stream.
In oder to "generalize beyond the observations" we must somehow capture
the law describing how the underlying dynamics evolves in time.
In some cases we have a deep theory underpinning such generalization
processes. For example, for deterministic autonomous dynamics observed
through a noiseless coordinate-projection readout, we have Taken’s
Theorem guaranteeing (under some rather mild conditions) recovery of the
original dynamic law and its attractor (up to topological conjugacy).
Many "input time-lag window" (finite input memory) approaches take their
inspiration from Taken’s Theorem. The situation can get much more
complicated once dynamic (and/or observational) noise is considered.
State Space Models – a Personal View – p.8/187
Step back and start simple...
a finite alphabet of abstract symbols A = {1, 2, ..., A}
possibly infinite strings over A
left-to-right processing of strings
(as opposed to parsing the strings)
State Space Models – a Personal View – p.9/187
Making sense of time series
"the environment"
hmmm, how
can I make some
sense out of this...
...abbacbaaaccbdaabdcccca...
What is your task?
modelling
prediction
classification
...
State Space Models – a Personal View – p.10/187
Left-to-right processing of sequences
One can attempt to reduce complexity of the processing task
group together histories of symbols that have “the same functionality”
w.r.t. the given task (e.g. next-symbol prediction)
Information processing states (IPS) are equivalence classes over sequences
IPS code what is important in everything we have seen in the past
State Space Models – a Personal View – p.11/187
IPS - discrete state space, inputs, observations
A simple sequence classification example - FSA
IPS - 1, 2
inputs - a, b
outputs - 2 "Grammatical": odd number of b’s, 1 "Non-Grammatical":
even number of b’s (including none)
a
b
2
1
a
b
State Space Models – a Personal View – p.12/187
IPS - a simple example - transducer
translate input streams over {a, b} to sequences over {x, y}
IPS - 1, 2
inputs - a, b
outputs - x, y
a|x
b|y
2
1
a|x
b|x
State Space Models – a Personal View – p.13/187
IPS - the simplest case - finite time lag
The simplest construction of IPS is based on concentrating on the very
recent (finite) past
e.g. definite memory machines, finite memory machines, Markov models
Example:
Only the last 5 input symbols matter when reasoning about what symbol
comes next
... 1 2 1 1 2 3 2 1 2 1 2 4 3 2 1 1
... 2 1 1 1 1 3 4 4 4 4 1 4 3 2 1 1
... 3 3 2 2 1 2 1 2 2 1 3 4 3 2 1 1
All three sequences belong to the same IPS “43211”
State Space Models – a Personal View – p.14/187
Probabilistic framework - Markov model (MM)
What comes next?
... 1 2 1 1 2 3 2 1 2 1 2 4 3 2 1 1 | ?
... 2 1 1 1 1 3 4 4 4 4 1 4 3 2 1 1 | ?
... 3 3 2 2 1 2 1 2 2 1 3 4 3 2 1 1 | ?
finite context-conditional next-symbol distributions
P (s | 11111)
P (s | 11112)
P (s | 11113)
...
P (s | 11121)
...
P (s | 43211)
...
P (s | 44444)
s ∈ {1, 2, 3, 4}
State Space Models – a Personal View – p.15/187
Difficult times...
but, who
is going to pay
for all this?
difficult times, pay for
necessary things only!
State Space Models – a Personal View – p.16/187
Markov model
MM are intuitive and simple, but ...
Number of IPS (prediction contexts) grows exponentially fast with
memory length
Large alphabets and long memories are infeasible:
computationally demanding and
very long sequences are needed to obtain sound statistical estimates
of free parameters
State Space Models – a Personal View – p.17/187
Variable memory length MM (VLMM)
Sophisticated implementation of potentially high-order MM
Takes advantage of subsequence structure in data
Saves resources by making memory depth context dependent
Use deep memory only when it is needed
Natural representation of IPS in form of Prediction Suffix Trees
Closely linked to the idea of universal simulation of information sources.
State Space Models – a Personal View – p.18/187
Prediction suffix tree (PST)
IFS are organized on nodes of PST
Traverse the tree from root towards leaves in reversed order
Complex and potentially time-consuming training
1
1
P( . | 12 )
P( . | 122 )
1
2
Each node i
1
has associated next−symbol
2
probability distribution
2
P( . | i )
P( . | 222 )
time
State Space Models – a Personal View – p.19/187
Continuous state space, discrete inputs/outputs
Output
Delay
Parametrized DS
W
... 1 2 2 3 2 1 1 1 2
States of the DS form IPS: x(t) = f (u(t), x(t − 1); w)
These states code the entire history of symbols we have seen so far
State Space Models – a Personal View – p.20/187
The power of contractive DS
*
4
3
contract
*1
4
symbol 1
3
contract
1
*142
1
4
3
symbol 2
3
2
1
2
2
*14
4
symbol 4
contract
1
2
State Space Models – a Personal View – p.21/187
Affine input-driven dynamics
x(t) = f (u(t), x(t − 1))
= ρ · x(t − 1) + (1 − ρ) · u(t).
or, more generally
d
1 2
L−1
ρ
,
x(t) = x(t − 1) + u(t), u(t) ∈ 0, , , ...,
L
L L
L
with a scale parameter ρ ∈ (0, 1].
State Space Models – a Personal View – p.22/187
Affine input-driven dynamics - cont’d
7
4
1
8
5
2
n=1
9
6
3
77 87 97
78 88 98
79 89 99
47 57 67
48 58 68
49 59 69
17 27 37
18 28 38
19 29 39
74 84 94
75 85 95
76 86 96
44 54 64
45 55 65
46 56 66
14 24 34
15 25 35
16 26 36
71 81 91
72 82 92
73 83 93
41 51 61
42 52 62
43 53 63
11 21 31
12 22 32
13 23 33
n=2
State Space Models – a Personal View – p.23/187
Mapping sequences with contractions
Mapping symbolic sequences into Euclidean space
Coding strategy is Markovian
Sequences with shared histories of recently observed symbols have close
images
The longer is the common suffix of two sequences, the closer are they
mapped to each other
State Space Models – a Personal View – p.24/187
Cluster structure of IPS
The output readout map y(t) = h(x(t)) is often smooth and/or highly
constrained. This imposes the requirement that IPS leading to the
same/similar output should lie "close" to each other.
Hence, it is natural to look at the cluster structure of IPS
Each cluster represents an abstract IPS
If we wish, we can estimate the topology/statistics of state
transitions/symbol emissions on such new IPS on the available data and
test them on a hold-out set
State Space Models – a Personal View – p.25/187
Prediction Machines (PM)
Reduce complexity of IPS by identifying "higher-level" IPS with clusters
(indexed by i) of IPS x(t)
Each cluster i=1,2,...,M
P( . | i )
Cluster IPS
has associated next−symbol
probability distribution
P( . | i )
DS
Delay
... 1 2 2 3 2 1 1 1 2
State Space Models – a Personal View – p.26/187
Fractal Prediction Machine
*
4
3
contract
*1
4
symbol 1
3
contract
2
1
1
2
Each cluster i=1,2,...,M
P( . | i )
Cluster IPS
4
3
symbol 2
3
*142
1
2
*14
has associated next−symbol
probability distribution
P( . | i )
4
symbol 4
DS
contract
1
2
Delay
... 1 2 2 3 2 1 1 1 2
State Space Models – a Personal View – p.27/187
a link to VLMM
Contractive affine input driven dynamics is naturally related to VLMM
Memory resources are used efficiently:
many quantization centers cover dense clusters –
sequences with long suffixes
very few centers in sparse areas –
don’t waste memory on subsequences barely present in the data
State Space Models – a Personal View – p.28/187
Chaotic laser
Laser activity
Histogram of activity diff
300
300
activity
4
250
250
200
200
150
150
100
100
50
50
0
0
200
400
600
800
1000
0
−200
t
Training sequence: 8000 symbols
Test sequence: 2000 symbols
−100
3
1
0
activity diff
2
100
200
State Space Models – a Personal View – p.29/187
Fractal representation through affine dynamics
CBR of Laser data
FPM contexts
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.5
1
0
0
PST contexts
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0.5
1
MM contexts
1
0
0.5
1
0
0
0.5
1
State Space Models – a Personal View – p.30/187
Laser - results (NNL on test set)
NNL of FPMs and MMs on Laser data
0.8
0.7
NNL
0.6
0.5
0.4
0.3
MM
FPM−L1
FPM−L2
0.2
0
1
10
2
10
10
3
10
NNL of PSTs on Laser data
0.8
0.7
NNL
0.6
0.5
PST
PST−fg
PST(10)
PST(50)
PST(100)
0.4
0.3
0.2
0
10
1
2
10
10
# contexts
3
10
State Space Models – a Personal View – p.31/187
Bible
NNL of MMs, FPMs and PST on a test text from Genesis
0.9
0.85
0.8
0.75
NNL
0.7
0.65
0.6
0.55
0.5
0.45
0.4
0
10
MM
FPM−L1
FPM−L2
PST
1
10
2
10
# contexts
3
10
Training sequence: Old Testament - Genesis
Test sequence: Genesis
4
10
State Space Models – a Personal View – p.32/187
IPS - complications (unbounded input memory)
Things can be much more complicated
May need to remember events that happened long time ago
Time lag into the past can be unbounded
3
1
1,2
1
B
C
D
1
St
A
3
2
E
1
G
F
1,2
1
3
State Space Models – a Personal View – p.33/187
IPS - even more complications ...
In such a state-transition formulation
finite number of IPS may not be enough
E.g. when moving from regular languages to context-free languages, if we
want to keep finite state formulation, we need an additional memory with
LIFO organization
Example:
Match the counts
1 n 2n , n ≥ 1
More complicated: context sensitive, Turing computable
State Space Models – a Personal View – p.34/187
Finite state skeleton ++
There are discrete finite state models for dealing with sequential structures
of increasing (Chomsky) complexity
Apart from the basic finite state transition skeleton they contain additional
memory mechanism of increasing complexity
Algorithms for learning such representations (deterministic/probabilistic)
from training data, although in general, as with all flexible models fitted on
limited data, this is quite problematic (e.g. Gold)
State Space Models – a Personal View – p.35/187
‘Pure’ state transition formulation
Another possibility is to keep the state transition formulation without
additional memory mechanism, but allow infinite number of IFS
(countable, or even uncountable)
E.g. Dynamical recognizers studied by Pollack, Moore, ...
F2
Start
F2
F1
F1
State transitions:
On symbol 1: x −> F1(x)
On symbol 2: x −> F2(x)
F2
F1
12122 − grammatical
F2
F2
Accept
221 − non−grammatical
Reject
State Space Models – a Personal View – p.36/187
Continuous state space
State transitions over a continuous state space X can be considered
iterations of a non-autonomous dynamical system on X .
Each map Fi : X → X corresponds to distinct input symbol i ∈ A.
Having a state space with uncountable number of states can lead to Turing
(super-Turing) capabilities, but we need infinite precision of computations
(Siegelmann, Moore,...).
State Space Models – a Personal View – p.37/187
Recurrent Neural Network (Elman)
Output
Output Y
Delay
Parametrized DS
W
... 1 2 2 3 2 1 1 1 2
W OR
Unit Delay
Recurrent X
W RI
Input U
WRC
Context C
Neurons organized in 4 groups: Input (U), Context (C), Recurrent (X) and
Output (Y)
X+C can be considered an “extended” input to the network
C is an exact copy of X from the previous time step
State Space Models – a Personal View – p.38/187
RNN - parametrization
x(t) = f (u(t), c(t)) = f (u(t), x(t − 1))
y(t) = h(x(t))
X
X
RI
Wij · uj (t) +
WijRC · xj (t − 1) + TiR
xi (t) = σ
j
j
X
yi (t) = σ
WijOR · xj (t) + TiO
j
WRI , WRC , WOR and TR , TO are real-valued connection weight matrices
and threshold vectors, respectively
σ is a sigmoid activation function, e.g. σ(u) = (1 + exp(−u))−1
State Space Models – a Personal View – p.39/187
Attractive sets
To latch a piece of information for a potentially unbounded number of time
steps we need attractive sets
Start
F1
2
F1
F1
F1
St
F1
F2
F1
Accept
Reject
1
1
F2
A
2
B
F1
Grammatical:
all strings containg odd number of 2’s
State Space Models – a Personal View – p.40/187
Information latching problem
1. To latch an important piece of information for future reference we
need to create an attractive set
2. But derivatives of the state-transition map are small in the
neighborhood of such an attractive set
3. We cannot propagate error information when training RNN via a
gradient-based procedure – derivatives decrease exponentially fast
through time
State Space Models – a Personal View – p.41/187
Curse of long time spans
As soon as we try to create a mechanism for keeping important bits of
information from the past, we lose the ability to set the parameters to the
appropriate values since the error information gets lost very fast
It is actually quite non-trivial (although not impossible) to train a
parametrized state space model (e.g. RNN) beyond finite memory on
non-toy problems
State Space Models – a Personal View – p.42/187
Recall - clusters of IPS are meaningful
Output
IPS
Non−autonomous
DS
"smooth" map
Delay
... 1 2 2 3 2 1 1 1 2
The State → Output map is often quite smooth and sometimes monotonic,
e.g. squashed version of a linear transformation
IPS leading to the same/similar output are forced to lie close to each other
in the state space
State Space Models – a Personal View – p.43/187
Clustering IPS
Start
1
2
1
1
2
St
A
2
2
B
1
Accept
Reject
Extracted FSM:
all strings containg odd number of 2’s
State Space Models – a Personal View – p.44/187
Neural Prediction Machines (NPM)
P( . | i )
Cluster X
Each cluster i=1,2,...,M
has associated next−symbol
probability distribution
P( . | i )
Recurrent X
Unit Delay
Input U
Context C
... 1 2 2 3 2 1 1 1 2
State Space Models – a Personal View – p.45/187
Bit of a shock!
There are good reasons for initializing RNN with small weights
(indeed it is a common practice)
State-transition maps of RNN
(with commonly used sigmoid activation functions)
prior to training (small weights) are contractions
Clustering IPS naturally forms Markov states:
we group together neighboring points that are images of sequences with
shared suffixes
RNN without any training is already great!
Markovian architectural bias of RNN
State Space Models – a Personal View – p.46/187
It’s all about contractions...
Given an input symbol s ∈ A at time t, and activations of recurrent units
from the previous step, x(t − 1),
x(t) = fs (x(t − 1)).
This notation can be extended to sequences over A the recurrent
activations after t time steps are
x(t) = fs1 ...st (x(0)).
If maps fs are contractions with Lip. constant 0 < C < 1,
kfvq (x) − fwq (x)k ≤ C |q| · kfv (x) − fw (x)k
≤ C |q| · diam(X ).
State Space Models – a Personal View – p.47/187
Chaotic laser revisited
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
(b) RNN output on laser
MM
VLMM
NNL
NNL
(a) Markov models on laser
5
10
20
50
# contexts
100
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
300
RTRL
EKF
1
100
1000
Epoch
(c) RTRL - NPM on laser - 16 recurrent neurons
(d) EKF - NPM on laser - 16 recurrent neurons
NNL
0.7
0.6
0.5
0.4
0.3
0.2
0.1
1
10
NNL
10
10
Epoch
100
5
20
50 #contexts
100
1000300
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10
2
4
Epoch
6
5
20
50 #contexts
100
8
10 300
State Space Models – a Personal View – p.48/187
Deep Recursion
Production rules:
R → 1R3, R → 2R4, R → e.
Length
2
4
6
8
10,12,14,...20
Percent in the sequence
34.5 %
20.6%
13.6 %
10.3%
each length 3.5 %
Examples
13, 24
1243, 1133
122443, 211344
22213444
121212111333434343
Table 1: Distribution of string lengths. Training sequence
– 6156 symbols, test sequence – 6192 symbols
State Space Models – a Personal View – p.49/187
Deep recursion - results
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
(b) RNN output on context-free language
MM
VLMM
NNL
NNL
(a) Markov models on context-free language
5
10
20
50
100
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
300
RTRL
EKF
1
10
100
# contexts
Epoch
(c) RTRL - NPM on CFL - 16 recurrent neurons
(d) EKF - NPM on CFL - 16 recurrent neurons
NNL
1.0
0.9
0.8
0.7
0.6
0.5
0.4
1
1000
NNL
10
10
Epoch
100
5
20
50 #contexts
100
1000300
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0
10
2
4
Epoch
6
5
20
50 #contexts
100
8
10 300
State Space Models – a Personal View – p.50/187
Theoretical grounding
Theorem (Hammer & Tiňo):
Recurrent networks with contractive transition function can be
approximated arbitrarily well on input sequences of unbounded
length by a definite memory machine. Conversely,
every definite memory machine can be simulated by a recurrent
network with contractive transition function.
Hence initialization with small weights induces an
architectural bias into learning with recurrent neural networks.
State Space Models – a Personal View – p.51/187
Learnability (PAC)
Valid also for continuous input/output spaces.
The architectural bias emphasizes one possible region of the weight space
where generalization ability can be formally proved.
Standard RNN are not distribution independent learnable in the PAC sense
if arbitrary precision and inputs are considered.
Theorem (Hammer & Tiňo):
Recurrent networks with contractive transition function with a fixed
contraction parameter fulfill the distribution independent UCED
property and so
unlike general recurrent networks, are distribution independent
PAC-learnable.
State Space Models – a Personal View – p.52/187
Fractal analysis
Complexity of the driving input stream (topological entropy) is directly
reflected in the complexity of the state space activations (fractal
dimension).
Theorem (Tiňo & Hammer):
Recurrent activations inside contractive RNN form fractal clusters the
dimension of which can be bounded by the scaled entropy of the
underlying driving source. The scaling factors are fixed and are given by
the RNN parameters.
State Space Models – a Personal View – p.53/187
Why then bother to train the dynamical part?
Output
IPS
Non−autonomous
DS
"smooth" map
Delay
... 1 2 2 3 2 1 1 1 2
Fix the DS part to a contraction.
Only train the simple linear (or piece-wise constant) readout output map.
Can be done efficiently.
Reservoir computation models.
State Space Models – a Personal View – p.54/187
Reservoir Computation
Fix the DS part to:
standard RNN dynamics - Echo State Network (ESN)
- randomized interconnection structure - random interconnection
weights
recurrent spiking network - Liquid State Machine (LSM)
- readout operates on time integrated activations
affine contractions - Fractal Prediction Machine (FPM)
- readout is a collection of spatially distributed multinomials
real bucket of water, bacteria metabolic pathway ...
State Space Models – a Personal View – p.55/187
ESN/LSM - This is great and it ‘works’, BUT ...
Having to rely on random reservoir construction makes me nervous.
- Exactly why is it that we need random elements in reservoir
construction?
- If we need randomness, how can its use be minimized so that “good
reservoirs” are still obtained?
- Is it really just about contractions?
Deterministically constructed simple reservoirs are amenable to
theoretical analysis.
- How is the input stream translated into reservoir activations?
- When are the reservoir representations of historical contexts in the
input stream useful and when not?
State Space Models – a Personal View – p.56/187
Still playing the devil’s advocate...
Where does the “universality” of reservoirs come from?
- High reservoir dimensionality + mild nonlinearity.s - To what
extend is “diversity of computational elements in reservoir ”
necessary?
- What features of reservoirs are important for their “universality”?
How can we usefully characterize the reservoirs?
- Memory quantification: Short term memory capacity, Fisher
memory.
- Properties of the reservoir coupling (connection matrix): Spectral
radius, sparsity, eigenspectrum.
- Input stream dependent measures: “edge-of-chaos” computations,
kernel separation.
State Space Models – a Personal View – p.57/187
Simple non-randomized reservoirs
Dynamical Reservoir
N internal units
x(t)
Dynamical Reservoir
N internal units
x(t)
W
Input unit
s(t)
V
U
output unit Input unit
y(t)
s(t)
Dynamical Reservoir
N internal units
x(t)
W
V
U
output unit Input unit
y(t)
s(t)
W
V
U
output unit
y(t)
State Space Models – a Personal View – p.58/187
Simple reservoir models - some results
Laser
0.022
ESN
SCR
DLR
DLRB
0.021
0.02
0.019
0.018
4
3.8
ESN
SCR
DLR
DLRB
0.0115
0.011
Non−linear Communication Channel
ESN
SCR
DLR
DLRB
3.5
3.2
0.0105
2.6
0.016
0.015
2.3
0.01
0.014
2
0.0095
0.013
0.012
1.7
1.4
0.009
0.011
1.1
0.01
0.0085
0.8
0.009
0.008
50
x 10
2.9
0.017
NMSE
−3
Henon map
0.012
100
150
Reservoir size
200
0.008
50
100
150
Reservoir size
200
0.5
50
100
150
Reservoir size
200
State Space Models – a Personal View – p.59/187
Simple reservoir models - some results
0.17
10th order NARMA
0.14
ESN
SCR
DLR
DLRB
0.16
0.15
0.14
0.11
0.1
0.12
NMSE
0.3
0.29
0.28
0.27
0.26
0.25
0.24
0.23
0.22
0.21
0.2
ESN
SCR
DLR
DLRB
0.12
0.13
0.09
0.11
0.08
0.1
0.07
0.09
0.06
0.08
0.06
0.04
0.05
0.03
0.04
0.02
100
150
Reservoir size
200
0.01
50
20th order NARMA
ESN
SCR
DLR
DLRB
0.19
0.18
0.17
0.05
0.07
0.03
50
10th order random NARMA
0.13
100
150
Reservoir size
200
0.16
0.15
0.14
50
100
150
Reservoir size
200
State Space Models – a Personal View – p.60/187
Cyclic reservoirs - aperiodic input weight signs
- Need to break the reservoir symmetry.
- Sign patterns can be obtained deterministically.
0.25
20th order NARMA
0.021
PI
Rand
Exp
logistic
0.24
0.23
0.22
Laser
0.022
PI
Rand
Exp
Logistic
0.020
0.019
0.018
0.017
NMSE
0.21
0.016
0.2
0.015
0.014
0.19
0.013
0.18
0.012
0.17
0.011
0.010
0.16
0.15
50
0.012
0.009
100
150
200
4
3.8
PI
Rand
Exp
Logistic
0.011
100
−3
Henon map
0.0115
0.008
50
x 10
150
200
Non−linear Communication Channel
PI
Rand
Exp
Logistic
3.5
3.2
2.9
NMSE
0.0105
2.6
2.3
0.01
2
0.0095
1.7
1.4
0.009
1.1
0.0085
0.8
0.008
50
100
150
Reservoir size
200
0.5
50
100
150
Reservoir size
200
State Space Models – a Personal View – p.61/187
Memory capacity
[Jaeger] A measure correlating past events in an i.i.d. input stream with the
network output.
Assume the network is driven by a univariate stationary input signal u(t).
For a given delay k > 0, construct the network with optimal parameters for
the task of outputting u(t − k) after seeing the input stream ...u(t − 1)u(t)
up to time t.
Goodness of fit - squared correlation coefficient between the desired output
u(t − k) and the observed network output y(t)
State Space Models – a Personal View – p.62/187
Memory capacity
Cov 2 (u(t − k), y(t))
M Ck =
,
V ar(u(t)) V ar(y(t))
where Cov denotes the covariance and V ar the variance operators. The
short term memory (STM) capacity is then given by:
MC =
∞
X
M Ck .
k=1
[Jaeger] For any recurrent neural network with N recurrent neurons, under
the assumption of i.i.d. input stream, the STM capacity cannot exceed N .
State Space Models – a Personal View – p.63/187
Memory capacity of SCR
[Tiňo & Rodan] Under the assumption of zero-mean i.i.d. input stream, the
memory capacity of linear SCR architecture with N reservoir units can be
made arbitrarily close to N .
Theorem (Tiňo & Rodan): Consider a linear SCR network with reservoir
weight 0 < r < 1 and an input weight vector such that the reservoir
activations are full rank. Then, the SCR network memory capacity is equal
to
M C = N − (1 − r2N ).
State Space Models – a Personal View – p.64/187
Cycle reservoir with regular jumps
still a simple deterministic construction...
(A)
N=18 1
rc
rj
rj
rc
4
rc
rj
rj
13
(B) 17
1
rc
16
rc
rc
N=18
rc
rj
7 13
rc
rc
rc
5
rj
rc
rj
rc
rj
rj
rc
rc
rc
10
rc
9
State Space Models – a Personal View – p.65/187
CRJ - NARMA(10), laser, speech recognition
reservoir model
N = 100
N = 200
N = 300
ESN
SCR
CRJ
0.0788 (0.00937)
0.0868
0.0619
0.0531 (0.00198)
0.0621
0.0196
0.0246 (0.00142)
0.0383
0.0130
reservoir model
N = 100
N = 200
N = 300
ESN
SCR
CRJ
0.0128 (0.00371)
0.0139
0.00921
0.0108 (0.00149)
0.0112
0.00673
0.0089 (0.0017)
0.0106
0.00662
reservoir model
N = 100
N = 200
N = 300
ESN
SCR
CRJ
0.0296 (0.0063)
0.0329 (0.0031)
0.0281 (0.0032)
0.0138 (0.0042)
0.0156 (0.0035)
0.0117 (0.0029)
0.0092 (0.0037)
0.0081 (0.0022)
0.0046 (0.0021)
State Space Models – a Personal View – p.66/187
Reservoir characterizations
Eigen-decomposition of reservoir coupling
ESN
SCR
0.5
0.5
0
0
−0.5
−0.5
−1 −0.5
0.5
1
−1 −0.5
0.5
0.5
0
0
−0.5
−0.5
0
0.5
1
0.5
1
CRHJ
CRJ
−1 −0.5
0
0.5
1
−1 −0.5
0
State Space Models – a Personal View – p.67/187
Reservoir characterizations
Memory capacity
(A)
(B)
0
0
10
10
−5
−5
10
10
−10
−10
10
10
−15
−15
10
MCk
MCk
10
−20
10
−25
−25
10
10
−30
−30
10
10
−35
−35
10
10
SCR
CRJ
ESN
−40
10
−20
10
0
−40
1
10
SCR
CRJ
ESN
10
2
10
10
0
1
10
2
10
k
10
k
(D)
(C)
0
0
10
10
−5
−5
MCk
10
MCk
10
−10
−10
10
10
−15
−15
10
10
SCR
CRJ
ESN
−20
10
0
10
SCR
CRJ
ESN
−20
1
2
10
10
k
10
0
10
1
2
10
10
k
State Space Models – a Personal View – p.68/187
Reservoir characterizations
Pseudo-Lyapunov exponents
NARMA Task
1
0
ESN
SCR
CRJ
−1
−2
−3
0
0.2
0.4
0.6
0.8 1 1.2
scaling parameter
Laser task
2
lyapunov exponent
lyapunov exponent
2
1.4
1.6
1.8
2
1
0
ESN
SCR
CRJ
−1
−2
−3
0
0.2
0.4
0.6
0.8 1 1.2
scaling parameter
1.4
1.6
1.8
2
State Space Models – a Personal View – p.69/187
Reservoir kernels for time series
Kernel trick - quantify similarity between two input items (vectors,
sequences,...) in the "feature space" via dot product of their feature space
representations.
In the feature space - it is sufficient to be linear.
State Space Models – a Personal View – p.70/187
Time Series Kernels
Quantify similarities between two time series
Time series - different length, phase shifts ...
Model based kernel - declare two data items (time series) "similar" if
they are likely to have been generated by the same or "similar"
model.
What model to employ?
- known parametric model class or
- "general purpose" temporal filter.
State Space Models – a Personal View – p.71/187
Learning in the Model Space Framework
Each data item is represented by a model that "explains" it;
Learning is formulated in the space of models (function space);
Parametrization-free notions of distances: distance between the
models not their (arbitrary) parametrizations!;
Model class
- flexible enough to represent variety of data items
- sufficiently constrained to avoid overfitting
State Space Models – a Personal View – p.72/187
Time Series Model
N -dimensional continuous state space [−1, 1]N
x(t) = f (x(t − 1), u(t)) = σ(Rx(t − 1) + V u(t) + b),
y(t) = h(x) = W x(t) + a,
x(t) ∈ [−1, 1]N - state vector at time t
u(t) - input time series element at time t
f (·) - state-transition function
y(t) - output of the linear readout from the state space.
The trick: Fix the state transition map, train only the linear readout!
State Space Models – a Personal View – p.73/187
Time Series Kernel via Reservoir Models
rc
rc
rc
rj rj
rj
rj
rc
rc
rc
Readout
Mapping
Space
Readout Mapping
1. For all sequences - the
same state transition
(reservoir) part;
2. For each time series
train the linear readout
on prediction task
3. Gaussian kernel in the
readout function space.
Deterministic
and fixed
Trained
State Space Models – a Personal View – p.74/187
Distance in the Readout Function Space
Two readouts h1 and h2 obtained on two sequences:
L2 (h1 , h2 ) =
Z
kh1 (x) − h2 (x)k2 dµ(x)
C
1/2
where µ(x) is the probability density measure on the input
domain, and C is the integral range.
C = [−1, +1]N as we use recurrent neural network reservoir
formulation with tanh transfer function.
State Space Models – a Personal View – p.75/187
Model Distance with Uniform Measure
Consider two readouts from the same reservoir
h1 (x) = W1 x + a1 ,
h2 (x) = W2 x + a2 .
Since the sigmoid activation function is employed in the
domain of the readout,C ∈ [−1, 1]N .
L2 (h1 , h2 ) =
=
R
N
2
3
2
C
kh1 (x) − h2 (x)k dµ(x)
PN PO
j=1
i=1
1/2
2
wi,j
+ 2N kak
2
1/2
where wi,j is the (i, j)-th element of W1 − W2 , a = a1 − a2 .
State Space Models – a Personal View – p.76/187
Model Distance vs. Parameter Distance
Scaling of the squared model distance by
1
3
N X
O
X
2
2
wi,j
+ kak ,
j=1 i=1
Squared Euclidean distance of the readout parameters
O
N X
X
2
2
+ kak ,
wi,j
j=1 i=1
more importance is given to the
’offset’ than ’orientation’ of the
readout mapping.
State Space Models – a Personal View – p.77/187
Non-uniform State Distribution
For K-component Gaussian mixture model
µ(x) =
K
X
αi µi (x|ηi , Σi )
i=1
The model distance can be obtained as follows:
K
trace(W T W Σ ) + aT a
X
k
2
αi
L2 (h1 , h2 ) =
+ η T W T W ηk + 2aT W ηk
i=1
k
Sampling approximation
m
X
1
L22 (h1 , h2 ) ≈
kh1 (x(i)) − h2 (x(i))k2 .
m i=1
State Space Models – a Personal View – p.78/187
Three Time Series Kernels
Uniform state distribution
– Reservoir Kernel: RV
Non-uniform state distribution
– Reservoir kernel by sampling: SamplingRV
– Reservoir kernel by Gaussian mixture models:
GMMRV
2
K(hi , hj ) = exp −γ · L2 (hi , hj )
State Space Models – a Personal View – p.79/187
Fisher Kernel
Single model for all data items.
Based on Information Geometry
Fisher vector (score functions): U (s) = ∇λ log P (s|λ)
Metric tensor - Fisher Information Matrix F
K(si , sj ) = (U (si ))T F U (sj ).
State Space Models – a Personal View – p.80/187
Fisher Kernel Based on Reservoir Model
Endowing the readout with a noise model yields a
generative time series model of the form:
x(t) = f (u(t), x(t − 1)),
u(t + 1) = W x(t) + a + ε(t).
Assume the noise model follows a Gaussian
distribution
ε(t) = N (0, σ 2 I).
State Space Models – a Personal View – p.81/187
Fisher Kernel by Reservoir Models
partial derivatives of log likelihood log p(u(1..ℓ))
∂
ℓ
Q
P (u(t) | u(1 · · · t − 1))
∂ log p(u(1..ℓ))
= t=1
U =
∂W
∂W
ℓ
X
(u(t) − a) x(t − 1)T − W x(t − 1)x(t − 1)T
.
=
2
σ
t=1
Fisher kernel (simplified in the usual way - Id matrix instead of
F) for two time series si and sj with scores
Ui and Uj :
N
O X
X
(Ui ◦ Uj )o,n
K(si , sj ) =
o=1 n=1
State Space Models – a Personal View – p.82/187
Experimental Comparisons
Compared Kernels:
– Autoregressive Kernel: AR
– Fisher Kernel with Hidden Markov Model: Fisher
– Dynamic time wrapping: DTW
Our Kernels
– Reservoir Kernel: RV
– Reservoir kernel by sampling: SamplingRV
– Reservoir kernel by Gaussian mixture models: GMMRV
– Fisher kernel with reservoir models: FisherRV
State Space Models – a Personal View – p.83/187
Synthetic Data: NARMA
0.8
0.7
0.6
0.5
0.4
0.3
0.2
order 10
0.1
s(t + 1)
=
0
500
order 20
1000
0.3s(t) + 0.05s(t)
9
X
1500
order 30
2000
2500
3000
s(t − i) + 1.5u(t − 9)u(t) + 0.1
i=0
s(t + 1)
=
tanh(0.3s(t) + 0.05s(t)
19
X
s(t − i) + 1.5u(t − 19)u(t) + 0.01) + 0.2
i=0
s(t + 1)
=
0.2s(t) + 0.004s(t)
29
X
i=0
s(t − i) + 1.5u(t − 29)u(t) + 0.201
State Space Models – a Personal View – p.84/187
Multi-Dimensional Scaling Representation
0.1
0.05
0
−0.05
−0.1
−0.15
−0.15
order 10
order 20
order 30
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Illustration of MDS on the model distance among reservoir
weights
State Space Models – a Personal View – p.85/187
Robustness to Noise
100
RV
AR
Fisher
DTW
NoKernel
Generalization Accuracy
90
80
70
60
50
40
30
0
0.1
0.2
0.3
Gaussian Noise Level
0.4
0.5
Illustration of the performance of compared kernels with
different noise levels
State Space Models – a Personal View – p.86/187
Benchmark Data
Dataset
Length
Classes
Train
Test
Symbols
OSULeaf
Oliveoil
Lighting2
Beef
Car
Fish
Coffee
Adiac
398
427
570
637
470
576
463
286
176
6
6
4
2
6
4
8
2
37
25
200
30
60
30
60
175
28
390
995
242
30
61
30
60
175
28
391
State Space Models – a Personal View – p.87/187
Results: Classification Accuracy
Dataset
DTW
AR
Fisher
RV
FisherRV
GMMRV
SamplingRV
Symbols
94.77
91.15
94.42
98.08
95.96
97.31
95.77
OSULeaf
74.79
56.61
54.96
69.83
64.59
56.55
63.33
Oliveoil
83.33
73.33
56.67
86.67
83.33
84.00
90.00
Lighting2
64.10
77.05
64.10
77.05
75.41
78.69
80.33
Beef
66.67
78.69
58.00
80.00
68.00
79.67
86.67
Car
58.85
60.00
65.00
76.67
72.33
78.33
86.67
Fish
69.86
60.61
57.14
79.00
74.29
78.00
85.71
Coffee
85.71
100.00
81.43
100.00
92.86
96.43
100.00
Adiac
65.47
64.45
68.03
72.63
71.61
74.94
76.73
Comparison of DTW, AR, Fisher (with hidden Markov models), RV,
FisherRV, GMMRV, and SamplingRV kernels on nine benchmark data sets
by accuracy. The best performance for each data set has been boldfaced.
State Space Models – a Personal View – p.88/187
Results : CPU Time
Dataset
DTW
AR
Fisher
RV
FisherRV
GMMRV
SamplingRV
Symbols
1,318
2,868
2,331
202
236
374
808
OSULeaf
6,030
1,375
3,264
98
111
186
447
Oliveoil
295
113
832
11
19
27
43
Lighting2
918
151
1,143
33
46
61
95
Beef
107
54
87
10
17
23
40
Car
679
442
902
27
42
50
84
Fish
3,353
495
1,998
81
96
159
286
Coffee
21
25
145
3
3
7
19
Adiac
550
8131
1,122
201
213
394
699
CPU Time (in seconds) of DTW, AR, Fisher (with hidden Markov
models), RV, FisherRV, GMMRV, and SamplingRV kernels on nine
benchmark data sets.
State Space Models – a Personal View – p.89/187
Performance on increasing length of time series
5
45
40
10
RV
AR
DTW
Fisher
RV
AR
DTW
Fisher
4
10
CPU time
Accuracy
35
30
3
25
10
20
15
200
2
400
600
800
1000
1200
length of time series
1400
1600
Generalization Accuracy
1800
10
200
400
600
800
1000
1200
length of time series
1400
1600
1800
CPU Time
State Space Models – a Personal View – p.90/187
Multivariate Time Series
Our kernels can be applicable for
– multivariate time series
– of variable length
Dataset
dim
length
classes
train
test
Libras
handwritten
AUSLAN
2
3
22
45
60-182
45-136
15
20
95
360
600
600
585
2258
1865
State Space Models – a Personal View – p.91/187
Multivariate Time Series: Results
100
4500
AR
Fisher
DTW
RV
SamplingRV
95
4000
3500
3000
90
AR
Fisher
DTW
RV
SamplingRV
2500
85
2000
1500
80
1000
75
500
70
0
Libras
handwritten
AUSLAN
Generalization Accuracy
Libras
handwritten
AUSLAN
CPU Time
State Space Models – a Personal View – p.92/187
Online Kernel Construction
Reservoir readouts can be trained in an on-line
fashion, e.g. using Recursive Least Squares.This
enables us to construct and refine reservoir kernels.
wopt = w0 + k(b − aT w0 )
k=
1
−1
T
A
)
a
(A
0
0
T
−1
T
λ + a (A0 A0 ) a
(AT0 A0 )−1 = [λ−1 I − kaT ](AT0 A0 )−1
λ:forgetting factor
State Space Models – a Personal View – p.93/187
Long Time series
PEMS-SF is 15 months (Jan. 1st 2008 to Mar. 30th
2009) of daily data from the California Department
of Transportation PEMS.
The data describes the occupancy rate of different
car lanes of San Francisco bay area freeways.
PEMS-SF with 440 time series of length 138,672.
State Space Models – a Personal View – p.94/187
Results
The computational complexity of our kernels is O(l), where l
is the length of time series:Scale lineally with the length of
time series
90
85
1. RV
kernel
achieves
86.13% accuracy.
Accuracy
80
75
2. The best accuracy reported are 82% 83% for
AR, global alignment,
spline smoothing and Bag
of vectors kernels
70
65
60
55
0
2
4
6
8
Length of Time Series
10
12
14
4
x 10
State Space Models – a Personal View – p.95/187
So do we need adaptive hidden state structure?
Hidden states add flexibility, but make training more difficult.
Can represent and track underlying changes in the generative process.
If the number of states is finite
- hidden Markov models
- switching models
- ...
State Space Models – a Personal View – p.96/187
Finite state set - Hidden Markov Model
Stationary emissions conditional on hidden (unobservable) states.
Hidden states represent basic operating "regimes" of the process.
Bag 3
Bag 1
Bag 2
State Space Models – a Personal View – p.97/187
Temporal structure - Hidden Markov Model
We have M bags of balls of different colors (Red - R, Green - G).
We are standing behind a curtain and at each point in time we select a bag
j , draw (with replacement) a ball from it and show the ball to an observer.
Color of the ball shown at time t is C(t) ∈ {R, G}. We do this for T time
steps.
The observer can only see the balls, it has no access to the information
about how we select the bags.
Assume: we select bag at time t based only on our selection at the previous
time step t − 1 (1st-order Markov assumption).
State Space Models – a Personal View – p.98/187
HMM - probabilistic model
Probabilistic, finite state and observation spaces version of
x(t) = f (x(t − 1)), y(t) = h(x(t)), hence,
p(x(t) | x(t − 1)), p(y(t) | x(t))
alphabet of A symbols, A = {1, 2, ..., A}.
(n)
Consider a set of symbolic sequences, s(n) = (st )t=1:Tn , n = 1 : N
Assuming independently generated sequences, the likelihood is
L=
N
Y
p(s(n) ),
n=1
p(s(n) ) =
X
x∈K Tn
p(x1 )
Tn
Y
t=2
p(xt |xt−1 )
Tn
Y
(n)
p(st |xt )
t=1
State Space Models – a Personal View – p.99/187
Model estimation - if only we knew ...
If we knew which bags were used at which time steps, things would be
very easy! ... just counting
Hidden variables ztj :
ztj = 1, iff bag j was used at time t;
ztj = 0, otherwise.
PT −1
k
ztj · zt+1
P (bagj → bagk ) = PM t=1
PT −1
q=1
j
z
t=1 t
·
q
zt+1
[state transitions]
j
z
t=1 t · δ(c = C(t))
P (color = c | bagj ) = P
PT
j
z
t=1 t · δ(g = C(t))
g∈{R,G}
PT
[emissions]
State Space Models – a Personal View – p.100/187
But we don’t ...
We need to come up with reasonable estimates of probabilities for hidden
events such as:
k
ztj · zt+1
=1
at time t we selected bag j and at the next time step we selected bag
k
ztj · δ(c = C(t)) = 1
at time t we selected bag j and drew ball of color c
Again, the probability estimates need to be based on observed data D and
our current model of state transition and emission probabilities.
State Space Models – a Personal View – p.101/187
Re-estimate the model
k
P (ztj · zt+1
= 1 | D, current model) = Rtj→k
P (ztj · δ(c = C(t)) = 1 | D, current model) = Rtj,c
Then we can re-estimate the model parameters as follows:
PT −1
Rtj→k
P (bagj → bagk ) = PM t=1
PT −1
q=1
P (color = c | bagj ) = P
j→q
R
t
t=1
j,c
R
t=1 t
PT
PT
g∈{R,G}
[state transitions]
j,g
R
t=1 t
[emissions]
State Space Models – a Personal View – p.102/187
Estimating values of the hidden variables
In this case we haven’t dealt with the crucial question of how to compute
the estimation for possible values of hidden variables, given the observed
data and current model parameters.
This can be done efficiently - Forward-Backward algorithm.
There is a consistent framework for training any form of latent-variable
model via maximum likelihood: Estimation–Maximization (E-M)
algorithm.
State Space Models – a Personal View – p.103/187
Learning - can we do better than hand-waving?
Observed data: D
Parameterized model of data items t: p(t|w)
Log-likelihood of w: log p(D|w)
Train via Maximum Likelihood:
wM L = argmax log p(D|w).
w
State Space Models – a Personal View – p.104/187
Complete data log-likelihood
Observed data: D
Unobserved data: Z (realization of compound hidden variable Z )
Log-likelihood of w: log p(D|w)
Complete data log-likelihood: log p(D, Z|w)
By marginalization:
p(D|w) =
X
p(D, Z|w)
Z
State Space Models – a Personal View – p.105/187
Concave function
F(au’ + (1−a)u’’)
aF(u’) + (1−a)F(u’’)
F(u)
0 <= a <= 1
u’
au’ + (1−a)u’’
u
u’’
For any concave function F : R → R and any u′ , u′′ ∈ R, a ∈ [0, 1]:
F (au′ + (1 − a)u′′ ) ≥ aF (u′ ) + (1 − a)F (u′′ ).
F
X
i
a i ui
!
≥
X
i=1
ai F (ui ), ai ≥ 0,
X
ai = 1
i
State Space Models – a Personal View – p.106/187
A lower bound on log-likelihood
Pick ‘any’ distribution Q for hidden variable Z .
As usual, Q(Z = Z) will be denoted simply by Q(Z).
P
log(·) is a concave function and Z Q(Z) = 1, Q(Z) > 0.
Log-likelihood of w: log p(D|w)
log p(D|w) = log
X
p(D, Z|w)
X
p(D, Z|w)
Q(Z)
Q(Z)
Z
= log
Z
≥
X
Z
!
!
p(D, Z|w)
Q(Z) log
= F(Q, w)
Q(Z)
State Space Models – a Personal View – p.107/187
Maximize the lower bound on log-likelihood
Do ‘coordinate-wise’ ascent on F(Q, w).
w
F(Q,w)
Q
State Space Models – a Personal View – p.108/187
Maximize F(Q, w) w.r.t. Q (w fixed)
F(Q, w) = −
X
Z
X
Q(Z)
+
Q(Z) log
Q(Z) log p(D|w)
p(Z|D, w)
Z
= −DKL [Q(Z)||P (Z|D, w)] + log p(D|w).
Since log p(D|w) is constant w.r.t. Q, we only need to maximize
−DKL [Q(Z)||P (Z|D, w)], which is equivalent to minimizing
DKL [Q(Z)||P (Z|D, w)] ≥ 0.
This is achieved when DKL [Q(Z)||P (Z|D, w)] = 0, i.e. when
Q∗ (Z) = P (Z|D, w) and F(Q, w) = log p(D|w).
Exact click, no longer a lose lower bound!
State Space Models – a Personal View – p.109/187
E-Step
We have:
Q∗ (Z) = P (Z|D, w).
The optimal way for guessing the values of hidden variables Z is to set the
distribution of Z to the posterior over Z , given the observed data D and
current parameter settings w.
Estimation of the posterior P (Z|D, w) is done in the E-Step of the E-M
algorithm.
State Space Models – a Personal View – p.110/187
Maximize F(Q, w) w.r.t. w (Q is fixed to Q∗ )
F(Q∗ , w) = −
X
Z
Q∗ (Z) log p(D, Z|w) −
X
Q∗ (Z) log Q∗ (Z)
Z
= EQ∗ (Z) [log p(D, Z|w)] + H(Q∗ ).
Since the entropy of Q∗ , H(Q∗ ), is constant (Q∗ is fixed), we only need to
maximize EQ∗ (Z) [log p(D, Z|w)]:
w∗ = argmax EQ∗ (Z) [log p(D, Z|w)].
w
State Space Models – a Personal View – p.111/187
M-Step
We have:
w∗ = argmax EQ∗ (Z) [log p(D, Z|w)].
w
The optimal way for estimating the parameters w is to select the parameter
values w∗ that maximize the expected value of the complete data
log-likelihood p(D, Z|w), where the expectation is taken w.r.t. the
posterior distribution over the hidden data Z , P (Z|D, w).
Find a single parameter vector w∗ for all hidden variable settings Z (since
we don’t know the true values of Z ), but while doing this, weight the
importance of each particular setting Z by the posterior probability
P (Z|D, w).
State Space Models – a Personal View – p.112/187
E-M Algorithm
Given the current parameter setting wold do:
E-Step:
Estimate P (Z|D, wold ), the posterior distribution over Z , given the
observed data D and current parameter settings wold .
M-Step:
Obtain new parameter values wnew by maximizing
EP (Z|D,wold ) [log p(D, Z|w)].
Set wold := wnew and go to E-Step.
State Space Models – a Personal View – p.113/187
The beauty of probabilistic modeling
Think: How could have the data been generated?
– principled formulation
– principled model selection
– coping with missing data
– consistent building of hierarchies
– understanding hierarchies through model responsibilities
– ...
State Space Models – a Personal View – p.114/187
Vector Quantization
Take advantage of the cluster structure in the data ti , i = 1, 2, ..., N . To
minimize the representation error, place the representatives b1 , b2 , ..., bM ,
known as codebook vectors, in the center of each cluster.
t3
b2
b1
t2
t1
t5
t3
t2
t4
b3
t1
State Space Models – a Personal View – p.115/187
Mixture models
Mixture of Gaussians
Given a set of data points D = {t1 , t2 , ..., tN }, coming from what looks
like a mixture of M Gaussians G1 , G2 , ..., GM , fit an M -component
Gaussian mixture to D.
Mixture model:
p(t) =
M
X
P (j) · p(t|j),
j=1
P (j) – mixing coefficient (prior probability) of Gaussian Gj
p(t|j) – ‘probability mass’ given to the data item t by Gaussian Gj .
Generative process: Repeat 2 steps
1. generate j from P (j)
2. generate t from p(t|j)
State Space Models – a Personal View – p.116/187
Constrained Mixture models
A chain of noise models (Gaussians) along a 1-dim "mean data manifold"
Constrained mixture of noise models - spherical Gaussians.
PM
Still p(t) = j=1 P (j) · p(t|j), but now the Gaussians are forced to have
their means organized along a smooth 1-dim manifold.
State Space Models – a Personal View – p.117/187
Smooth embedding of continuous latent space
low−dim latent space (continuous)
−1
+1
(smooth) non−linear embedding in high−dim model space
constrained mixture
State Space Models – a Personal View – p.118/187
Generative Topographic Mapping
Projection manifold
Latent space
RBF net
Centres x i
Data space tn
State Space Models – a Personal View – p.119/187
GTM - model formulation
Models probability distribution in the (observable) high-dim data
space ℜD by means of low-dim latent variables. Data is visualized in
the latent space H ⊂ ℜL (e.g [−1, 1]2 ).
Latent space H is covered with an array of C latent space centers
xc ∈ H, c = 1, 2, ..., C .
Non-linear GTM map f : H → D is defined using a kernel
regression – M fixed basis functions φj : H → ℜ, collected in φ,
(Gaussians of the same width σ ), D × M matrix of weights W:
f (x) = W φ(x)
State Space Models – a Personal View – p.120/187
GTM - model formulation - Cont’d
GTM creates a generative probabilistic model in the data space by
placing radially-symmetric Gaussians P (t| xc , W, β) (zero mean,
inverse variance β ) around f (xc ), c = 1, 2, ..., C .
Defining a uniform prior over xc , the GTM density model is
C
X
1
P (t| W, β) =
P (t| xc , W, β)
C
c=1
The data is modelled as a constrained mixture of Gaussians. GTM
can be trained using an EM algorithm.
State Space Models – a Personal View – p.121/187
GTM - Data Visualization
Posterior probability that the c-th Gaussian generated tn ,
P (tn | xc , W, β)
rc,n = PK
j=1
P ( tn |
xj , W, β)
.
The latent space representation of the point tn , i.e. the projection of
tn , is taken to be the mean
K
X
rcn xc
c=1
of the posterior distribution on H.
State Space Models – a Personal View – p.122/187
Model fitting - unconstrained mixture
If only we knew ... which data point came from which Gaussian, things
would be very easy.
Collect all data points from D that are known to be generated from G1 in a
set D1 .
Compute ML estimates of the free parameters (mean, covariance matrix)
of G1 :
1 X
µ̂1 =
t,
|D1 |
t∈D1
1 X
(t − µ̂1 )(t − µ̂1 )T .
Σ̂1 =
|D1 |
t∈D1
Do the same for Gaussians G2 , ..., GM .
State Space Models – a Personal View – p.123/187
Indicator variables
If we knew which data point came from which Gaussian, we could
represent this knowledge through indicator variables zji , i = 1, 2, ..., N
(data points), j = 1, 2, ..., M (mixture components - Gaussians):
zji = 1, iff xi was generated by Gj ;
zji = 0, otherwise.
Then,
1
µ̂j = PN
i
z
i=1 j
1
Σ̂j = PN
i
z
i=1 j
N
X
N
X
zji · ti ,
i=1
zji · (ti − µ̂j )(ti − µ̂j )T .
i=1
State Space Models – a Personal View – p.124/187
Compound indicator variable Z
Collect all the indicator variables in a compound (vector) variable
Z = {zji } ∈ {0, 1}N ·M .
Each setting of Z represents one particular situation of assigning points xi
to Gaussians.
This can be anything from trivial settings, such as all points come from
G1 , to very mixed situations.
State Space Models – a Personal View – p.125/187
Other data types =⇒ other noise models
We have already seen that probabilistic framework is convenient - we
can deal with arbitrary data types as long as an appropriate noise
model can be formulated.
The principle will be demonstrated on sequential data, but extensions
to other data types (graphs) are possible.
For sequential data
need noise models that take into account temporal correlations
within sequences, e.g. Markov chains, HMMs, etc.
State Space Models – a Personal View – p.126/187
Latent Trait HMM (LTHMM)
GTM with HMM as the noise model!
For each HMM (latent center) we need to parameterize several
multinomials
initial state probabilities
transition probabilities
emission probabilities (discrete observations)
Multinomials are parameterized through natural parameters.
State Space Models – a Personal View – p.127/187
LTHMM
alphabet of S symbols, S = {1, 2, ..., S}.
(n)
Consider a set of symbolic sequences, s(n) = (st )t=1:Tn ,
n=1:N
With each latent point x ∈ H, we associate a generative distribution
(HMM) over sequences p(s|x).
Assuming independently generated sequences, the likelihood is
L=
N
Y
n=1
M
N
X
Y
1
p(s(n) |xc ).
p(s(n) ) =
M
n=1
c=1
p(s|xc ) take the form of HMM with K hidden states
p(s(n) |xc ) =
X
h∈K Tn
p(h1 |xc )
Tn
Y
t=2
p(ht |ht−1 , xc )
Tn
Y
(n)
p(st |ht , xc )
t=1
State Space Models – a Personal View – p.128/187
LTHMM - constrained mixture of HMMs
LTHMM parameters are obtained through a parameterized smooth
non-linear mapping from the latent space into the HMM parameter space.
g(.) is the softmax function (the canonical inverse link function of
multinomial distributions)
gk (.) is the k -th component returned by the softmax,
gk (a1 , a2 , ..., aℓ )
T
e ak
= Pℓ
i=1 e
ai
, k = 1, 2, ..., ℓ,
Free parameters: A(π ) ∈ RK×B , A(T l ) ∈ RK×B and A(B k ) ∈ RS×B
State Space Models – a Personal View – p.129/187
From latent points to (local) noise models
π c = {p(h1 = k|xc )}k=1:K
= {g (A(π ) φ(x ))}
c
k
k=1:K
T c = {p(ht = k|ht−1 = l, xc )}k,l=1:K
= {g (A(T l ) φ(x ))}
k
c
k,l=1:K
(n)
B c = {p(st
= s|ht = k, xc )}s=1:S,k=1:K
= {gs (A(B k ) φ(xc ))}s=1:S,k=1:K
State Space Models – a Personal View – p.130/187
LTHMM - parameterization
x2
+1
x + dx
p(.|x+dx)
x
p(.|x)
x1
M
V
−1
+1
H
2-dim manifold M of local noise models (HMMs) p(·|x) parameterized
by the latent space through a smooth non-linear mapping.
M is embedded in manifold H of all noise models of the same form.
State Space Models – a Personal View – p.131/187
LTHMM - metric properties
Latent coordinates x are displaced to x + dx.
How different are the corresponding noise models (HMMs)?
Need to answer this in a parameterization-free manner...
Local Kullback-Leibler divergence can be estimated by
D[p(s|x)kp(s|x + dx)] ≈ dxT J(x)dx,
where J(x) is the Fisher Information Matrix
Ji,j (x) = −Ep(s|x)
∂ 2 log p(s|x)
∂xi ∂xj
that acts like a metric tensor on the Riemannian manifold M
State Space Models – a Personal View – p.132/187
LTHMM - Fisher Information Matrix
HMM is itself a latent variable model.
J(x) cannot be analytically determined.
There are several approximation schemes and an efficient algorithm for
calculating the observed Fisher Information Matrix.
State Space Models – a Personal View – p.133/187
Induced metric in data space
Structured data types - careful with the notion of a metric in the data space.
LTHMM naturally induces a metric in the structured data space.
Two data items (sequences) are considered to be close (or similar) if both
of them are well-explained by the same underlying noise model (e.g.
HMM) from the 2-dimensional manifold of noise models.
Distance between structured data items is implicitly defined by the local
noise models that drive topographic map formation.
If the noise model changes, the perception of what kind of data items are
considered similar changes as well.
State Space Models – a Personal View – p.134/187
LTHMM - training
Constrained mixture of HMMs is fitted by Maximum likelihood using an
E-M algorithm
Two types of hidden variables:
which HMM generated which sequence
(responsibility calculations is in mixture models)
within a HMM, what is the state sequence responsible for generating
the observed sequence
(forward-backward-like calculations)
State Space Models – a Personal View – p.135/187
Two illustrative examples
Toy Data
400 binary sequences of length 40 generated from 4 HMMs (2
hidden states) with identical emission structure (the HMMs differed
only in transition probabilities). Each of the 4 HMMs generated 100
sequences.
Melodic Lines of Chorals by J.S. Bach
100 chorales. Pitches are represented in the space of one octave, i.e.
the observation symbol space consists of 12 different pitch values.
State Space Models – a Personal View – p.136/187
Toy data
State Transitions
Toy data: info matrix capped at 250
Emissions
1
1
1
2.5
2
0.8
0.8
0.8
1.8
0.6
0.4
200
0.6
0.6
2
1.6
0.4
0.4
1.4
0.2
0
150
0.2
0.2
1.5
1.2
0
0
1
−0.2
−0.2
−0.2
100
1
0.8
−0.4
−0.4
−0.4
0.6
−0.6
−0.6
0.5
0.4
−0.8
−0.8
−0.6
50
−0.8
0.2
−1
−1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−1
State−transition probabilities
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
−1
1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Emission probabilities
0.9
0.9
0.8
Class 1
0.8
0.7
0.6
0.5
0.4
Class 2
Class 3
0.7
Class 4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
State Space Models – a Personal View – p.137/187
Bach chorals
flats
Latent space visualization
1
0.8
0.6
0.4
infrequent flats/sharps
0.2
sharps
0
g−#f−g patterns
−0.2
−0.4
−0.6
−0.8
−1
−1
tritons
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
c−#a−#d−d−c
repeated #f notes in g−#f−g patterns
g−#f−#f−#f−g
State Space Models – a Personal View – p.138/187
Topographic formulation regularizes the model
2.55
train
test
2.5
2.45
negative log likelihood
2.4
2.35
2.3
2.25
2.2
2.15
2.1
2.05
0
5
10
15
20
25
30
35
epoch
Evolution of negative log-likelihood per symbol measured on the training
(o) and test (*) sets (Bach chorals).
State Space Models – a Personal View – p.139/187
Topographic organization of eclipsing binaries
Line of sight of the observer is aligned with orbit plane of a two star
system to such a degree that the component stars undergo mutual eclipses.
Even though the light of the component stars does not vary, eclipsing
binaries are variable stars - this is because of the eclipses.
The light curve is characterized by periods of constant light with periodic
drops in intensity.
If one of the stars is larger than the other (primary star), one will be
obscured by a total eclipse while the other will be obscured by an annular
eclipse.
State Space Models – a Personal View – p.140/187
Eclipsing Binary Star - normalized flux
Original lightcurve
1
flux
0.8
0.6
primary eclipse
0.4
0
1
...
T−1
T
T−1
T
0.9
1.0
time
Shifted lightcurve
1
0.9
flux
0.8
0.7
primary eclipse
0.6
0.5
0.4
0
1
...
time
Phase−normalised lightcurve
1
0.9
flux
0.8
0.7
0.6
0.5
0.4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
phase
State Space Models – a Personal View – p.141/187
Eclipsing Binary Star - the model
q.m
ap
i
m
Parameters:
Primary mass: m (1-100 solar mass)
mass ration: q (0-1)
eccentricity: e (0-1)
inclination: i (0o − 90o )
argument of periastron: ap (0o − 180o )
log period: π (2-300 days)
State Space Models – a Personal View – p.142/187
Empirical priors on parameters
p(m, q, e, i, ap, π) = p(m)p(q)p(π)p(e|π)p(i)p(ap)
Primary mass density:
p(m) = a × mb
0.6865, if 0.5 × Msun ≤ m ≤ 1.0 × Msun
a=
0.6865, if 1.0 × Msun < m ≤ 10.0 × Msun
3.9, if 10.0 × Msun < m ≤ 100.0 × Msun
−1.4, if 0.5 × Msun ≤ m ≤ 1.0 × Msun
b=
−2.5, if 1.0 × Msun < m ≤ 10.0 × Msun
−3.3, if 10.0 × Msun < m ≤ 100.0 × Msun
State Space Models – a Personal View – p.143/187
Empirical priors on parameters
Mass ratio density
p(q) = p1 (q) + p2 (q) + p3 (q)
where
(q−qi )2
pi (q) = Ai × exp(−0.5 s2 )
i
with
A1 = 1.30, A2 = 1.40, A3 = 2.35
q1 = 0.30, q2 = 0.65, q3 = 1.00
s1 = 0.18, s2 = 0.05, s3 = 0.10
log-period density
p(π) =
1.93337π 3 + 5.7420π 2 − 1.33152π + 2.5205, if π ≤ log10 18
19.0372π − 5.6276, if log10 18 < π ≤ log10 300
etc.
State Space Models – a Personal View – p.144/187
Flux GTM
Smooth parameterized mapping F from 2-dim latent space into the space
where 6 parameters of the eclipsing binary star model live.
Model light curves are contaminated by an additive observational noise
(Gaussian). This gives a local noise model in the (time,flux)-space.
Each point on the computer screen corresponds to a local noise model and
"represents" observed eclipsing binary star light curves that are well
explained by the local model.
MAP estimation of the mapping F via E-M.
State Space Models – a Personal View – p.145/187
Outline of the model (1)
1
x
−1
coordinate vector
1
0
[x1, x2]
V
−1
Latent space
(computer screen)
Γ(x)
M
Apply smooth mapping Γ
parameter vector
θ
ΩM
Parameter space
Apply physical model
State Space Models – a Personal View – p.146/187
Outline of the model (2)
Apply physical model
1
fΓ(x)
flux
0.9
0.8
0.7
0.6
J
0.5
0.4
0
ΩJ
0.2
0.4
0.6
0.8
1
phase
Regression model space
Apply Gaussian obervational noise
1
H
0.9
flux
p(O|x)
0.8
0.7
0.6
0.5
ΩH
Distribution space
0.4
0
0.2
0.4
0.6
0.8
1
phase
State Space Models – a Personal View – p.147/187
Artificial fluxes - projections
State Space Models – a Personal View – p.148/187
Artificial fluxes - model
State Space Models – a Personal View – p.149/187
Real fluxes - clustering mode
cluster 1
cluster 2
1.1
1.1
1
1
cluster 4
cluster 3
1.1
1.1
1
1
0.9
0.9
0.9
0.9
0.8
0.8
0.8
0.8
0.7
0.7
0.7
0.7
0.6
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0
10
20
30
40
50
60
70
80
90
100
0.5
0.4
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
cluster 5
1.1
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0
10
20
30
40
50
60
70
80
90
100
60
70
80
90
100
0.4
0
10
20
30
40
50
60
70
80
90
100
cluster 6
1.1
0.4
50
60
70
80
90
100
0.4
0
10
20
30
40
50
State Space Models – a Personal View – p.150/187
Off-the-shelf methods may produce nonsense!
lightcurve in dataset
1
0.8
0.6
0.4
0
lightcurve in dataset
ρ
1
0.8
0.6
0
codebook vector
ρ
1
0.8
0.6
0.4
0
ρ
State Space Models – a Personal View – p.151/187
Real fluxes - projections + model
Primary mass
Mass ratio
Eccentricity
Inclination
Argument
Period
State Space Models – a Personal View – p.152/187
General Probabilistic State Space Modeling
Introduction
Autoregressive models
Linear state space models
Dynamical systems
Non-linear state-space models
State Space Models – a Personal View – p.153/187
Definition
A state space model consists of an unobserved state sequence
Xt ∈ X , with time index t = 0, 1, 2, ..., and an obserevation sequence
Yt ∈ Y , with t = 1, 2, ...;
State spaces X and Y are either open or compact subsets of an
Euclidean space;
The state sequence X0 , X1 , X2 , ... is a Markov chain;
Conditionally on {X0 , X1 , ...}, the Yt ’s are independent and
Yt depends on Xt only;
Graphical representation:
xt−1
xt
xt+1
yt−1
yt
yt+1
State Space Models – a Personal View – p.154/187
Definition (cont.)
Conditional independence. For example, given Xt , Xt+1 and Yt are
independent with each other;
The observations could include both input variables, Ut ∈ U , and
output variables, Yt ∈ Y ;
As input variables, Ut are usually observed without noise corruption;
Graphical representation:
ut−1
ut
ut+1
xt−1
xt
xt+1
yt−1
yt
yt+1
State Space Models – a Personal View – p.155/187
Probabilistic Modelling
Initial state density X0 ∼ a0 (x);
State transition densities
Xt |(xt−1 , ut ) ∼ at (x, xt−1 , ut );
Observation densities
Yt |(xt , ut ) ∼ bt (y, xt , ut ).
The joint density of (X0:t , U1:t , Y1:t ) is given by
p(x0:t , u1:t , y1:t ) = a0 (x0 )
t
Y
as (xs , us , xs−1 )bs (ys , xs , us ).
s=1
State Space Models – a Personal View – p.156/187
Probabilistic Modelling (cont.)
The joint density factorizes into a product of terms containing only
pairs of variables −→
The joint process (Xt , Yt ) is also Markovian;
The observations Yt alone are not Markovian −→
The dependence structure of Yt could be more complicated than for
a Markov process;
The conditional density p(x0:t |y1:t ) is proportional to the joint
density p(x0:t , y1:t ) −→
Conditionally on Y1:t = y1:t , the state variables are still Markovian
(so-called conditional Markov process);
Densities p(xt |y1:t ) store all information about the conditional
Markov process up to time t.
The IPS are densities over the state space X !!!
State Space Models – a Personal View – p.157/187
Statistical Inference
The main tasks we need to solve are
1. Inference about states xs based on a stretch of observed values yq:t
for a given model (i.e. at and bt known);
2. Inference about unknown parameters in at and bt .
Inference about Xs given y1:t is called
1. prediction if s > t,
2. filtering if s = t,
3. smoothing if s < t.
State Space Models – a Personal View – p.158/187
Autoregressive Models
Autoregressive models for univariate time series y1 , y2 , ...:
Autoregressive (AR) models;
Autoregressive-moving average (ARMA) models
Autoregressive-moving average with exogenous terms (ARMAX)
models
State Space Models – a Personal View – p.159/187
AR Models
The model: yt = φ1 yt−1 + φ2 yt−2 + ... + φp yt−p + wt
where φ1 , φ2 , ..., φp are constants and wt is a Gaussian noise series
2;
with mean zero and variance σw
Define xt = (xt , xt−1 , ..., xt−p+1 )⊺ ;
Define wt = (wt , 0, ..., 0)⊺ and H = (1, 0, ..., 0);
The state equation: xt+1 = Fxt + wt with
φ1
1
F= .
..
0
φ2
0
..
.
···
···
..
.
···
0
φp−1 φp
0
0
p×p
∈
R
..
..
.
.
1
0
The observation equation: yt = Hxt
The state evolution is ‘contaminated’ by noise!
State Space Models – a Personal View – p.160/187
ARMA Models
Consider the univariate ARMA(2, 1) models:
yt = φ1 yt−1 + φ2 yt−2 + θwt−1 + wt .
where φ1 , φ2 , and θ are constant, and wt s are Gaussian noise with
2
zero mean and variance σw
Define xt = (xt , xt−1 )⊺ ;
φ1 1
1
State equation: xt =
xt−1 + wt
;
φ2 0
θ
Define H = (1, 0) → Observation equation: yt = Hxt ;
Verification
xt =
xt−1 =
xt−2 =
xt =
φ1 xt−1 + xt−2 + wt
φ2 xt−1 + θwt
φ2 xt−2 + θwt−1
φ1 xt−1 + φ2 xt−2 + θwt−1
+ wt – a Personal View – p.161/187
State Space Models
ARMAX model
An univariate ARMAX(p, q ) model (p > q ):
yt = Γut +
p
X
φj yt−j +
j=1
q
X
θj wt−j + wt
j=1
State equation:
xt+1
φ1
φ2
..
.
=
φp−1
φp
1 0 ···
0 1 ···
.. .. . .
.
. .
0 0 ···
0 0 ···
1
Γ
0
θ 1
0
0
..
0
.
.. x + w + u
t
t
..
.
θq t
.
1
0
0
0
..
0
.
Observation equation: yt = (1, 0, 0, ..., 0)xt
State Space Models – a Personal View – p.162/187
Linear State Space Model
Definition
Kalman Filtering
Kalman Smoothing
Parameter Estimation
Examples
State Space Models – a Personal View – p.163/187
Definition
Linear-Gaussian assumption:
a0 (x) = N (x; µ0 , Σ0 )
at (x, ut , xt−1 ) = N (x; Fxt−1 + Υut , Σ)
bt (y, ut , xt ) = N (y; Hxt + Γut , R)
where xt and µ0 are p × 1 vector, yt are q × 1 vector, ut are r × 1 vector,
F, Σ0 and Σ are p × p matrix, R are q × q matrix, H are q × p matrix„ Υ0
and Γ are p × r matrix.
x0 = µ0 + w0 ; w0 ∼ N (w0 ; 0, Σ0 )
xt = Fxt−1 + Υut + wt ; wt ∼ N (wt ; 0, Σ)
yt = Hxt + Γut + vt ; vt ∼ N (vt ; 0, R)
State Space Models – a Personal View – p.164/187
Kalman Filtering
Recall that the conditional densities p(xt |y1:t ) do store all
information about the state space model up to time t −→
p(xt |y1:t ) = N (xt ; mt|t , Pt|t );
At t = 0, mt|t = µ0 and Pt|t = Σ0 ;
At time t, predictive mean and covariance matrix:
mt|t−1 = E [xt |y1:t−1 ]
Pt|t−1
= E [Fxt−1 + Υut + wt |y1:t−1 ]
= Fmt−1|t−1 + Υut
⊺
= E (xt − mt|t−1 )(xt − mt|t−1 )
⊺
= E [F(xt−1 − mt|t−1 ) + wt ][F(xt−1 − mt|t−1 ) + wt ]
= FPt−1|t−1 F⊺ + Σ
State Space Models – a Personal View – p.165/187
Conditional Gaussians
If µ and Σ are partitioned as follows:
µ1
µ=
µ2
and
Σ=
Σ11 Σ12
Σ21 Σ22
The distribution of x1 conditional on x2 = a:
mean vector
µ̄ = µ1 + Σ12 Σ−1
22 (a − µ2 )
and covariance matrix
Σ = Σ11 − Σ12 Σ−1
22 Σ21 .
State Space Models – a Personal View – p.166/187
Kalman Filtering (cont.)
The innovations:
ǫt = yt − E [yt |y1:t−1 ] = yt − Γut − Hmt|t−1
Conditional on ǫt , the conditional Gaussian N (xt ; mt|t , Pt|t ) can be
derived from the joint Gaussian N (xt , ǫt ; mxt ,ǫt , Pxt ,ǫt );
E [ǫt ] = 0;
var(ǫt ) = HPt|t−1 H⊺ + R;
cov(xt , ǫt |y1:t ) = Pt|t−1 H⊺ ;
mt|t = mt|t−1 + Kt ǫt
Pt|t = Pt|t−1 − Kt HPt|t−1
The Kalman gain: Kt = (cov(xt , ǫt |y1:t )) · (var(ǫt ))−1 .
State Space Models – a Personal View – p.167/187
Kalman Filtering (cont.)
Interpretation of Kt :
30
25
R
t+1
20
ε
Σ
Pt+1|t+1
t+1|t+1
Kε
t
m
t+1|t
t+1
m
t
P
15
x
y
t+1|t
10
x
m
5
0
t
t|t
Pt|t
0
5
10
15
20
25
30
State Space Models – a Personal View – p.168/187
Kalman Smoothing
Given the observation sequence {y1 , ..., yT }:
p(xt |y1:T ) = N (xt ; mt|T , Pt|T );
The conditional means mt−1|T and mt|T are those values that
maximize the joint Gaussian density p(xt−1 , xt |y1:T );
The conditional joint density p(xt−1 , xt |y1:T ) has the forms
p(xt−1 , xt |y1:T ) ∝ p(xt−1 |y1:t−1 ) · p(xt |xt−1 );
Suppose we have mt|T avaiable, the mt−1|T is found by maximizing
N (xt−1 ; mt−1|t−1 , Pt−1|t−1 ) · N (mt|T ; Fxt−1 , Σ);
State Space Models – a Personal View – p.169/187
Kalman Smoothing(cont.)
mt−1|T = mt−1|t−1 + Jt−1 (mt|T − mt|t−1 );
Pt−1|T = Pt−1|t−1 + Jt−1 (Pt|T − Pt|t−1 )J⊺t−1 ;
−1
⊺
;
Jt−1 = Pt−1|t−1 F Pt|t−1
Interpretation of Jt−1 :
Recall that Pt|t−1 = FPt−1|t−1 F⊺ + Σ;
Note that Jt−1 = 1, if F is a identity matrix and Σ = 0.
There are many equivalent forms of the smoother in the literature. They
differ numerically with respect to speed and accuracy.
State Space Models – a Personal View – p.170/187
Parameter Estimation
The parameter set to estimate is θ = {F, Σ, R};
The incomplete data likelihood
−2 log LY (Θ) =
T
X
log |var(ǫt (Θ))|+
T
X
ǫt (Θ)⊺ var(ǫt (Θ))−1 ǫt (Θ));
t=1
t=1
The complete data likelihood
−2 log LX,Y (Θ) = log |Σ0 | + (x0 − µ0 )⊺ Σ−1
0 (x0 − µ0 )
+ log |Σ| +
T
X
(xt − Fxt−1 )⊺ Σ−1 (xt − Fxt−1 )
t=1
+ log |R| +
T
X
(yt − Hxt )⊺ Σ−1 (yt − Hxt );
t=1
State Space Models – a Personal View – p.171/187
Parameter Estimation (cont.)
The Q function at iteration j
h
i
Q Θ|Θ(j−1) = E −2 log LX,Y (Θ)|y1,...,T , Θ(j−1) ,
which is an upper bound on log LY (Θ);
The algorithm
1. Initialize Θ(0) and set j = 1;
2. Compute log LY (Θ(j−1) );
3. (E-Step) Perform Kalman filtering and smoothing using Θ(j−1) ;
4. (M-step) Use the smoothed estimates (mt|T , Pt|T , ...) to
(j)
(j−1)
compute Θ that maximizes Q Θ|Θ
;
5. Repeat Steps 2 - 4 for log LY (Θ(j−1) ) to converge.
State Space Models – a Personal View – p.172/187
Example
xt = φxt−1 + wt
yt = xt + w t ,
2 = 1.0, and σ 2 = 1.0.
with φ = 0.8, σw
v
8
state
data
6
4
2
0
−2
−4
0
10
20
30
40
50
Time
60
70
80
90
100
State Space Models – a Personal View – p.173/187
4
4
3
3
3
2
2
2
1
1
0
−1
−2
Smoothed estimates of state
4
Filtered estimates of state
Predictive estimates of state
Example (cont.)
0
−1
−2
1
0
−1
−2
−3
−3
−3
−4
−4
−4
−5
0
5
Time
10
−5
0
5
Time
10
−5
0
5
Time
10
State Space Models – a Personal View – p.174/187
Example (cont.)
191.6
1.7
φ
2
σw
1.6
σ2
v
191.5
1.5
1.4
191.4
log likelihood
1.3
1.2
191.3
1.1
191.2
1
0.9
191.1
0.8
191
0
10
20
iteration
30
0.7
0
10
20
30
iteration
State Space Models – a Personal View – p.175/187
Non-linear state space models
Extended Kalman filtering
Non-linear filtering
Non-linear smoothing
Parameter estimation
Example
State Space Models – a Personal View – p.176/187
Extended Kalman filtering
Nonlinear state equation:
Xt = F(Xt−1 ) + Wt
and Yt = HXt + Vt
Linearization:
dF(Xt )
Xt = Ft (Xt−1 ) ≈ Ft (mt−1|t−1 )+
dXt
(Xt−1 −mt−1|t−1 ).
mt−1|t−1
Prediction:
Pt|t−1
mt|t−1 = F(mt−1|t−1 )
⊺
dF(Xt )
dF(Xt )
+Σ
Pt−1|t−1
=
dXt mt−1|t−1
dXt mt−1|t−1
State Space Models – a Personal View – p.177/187
Extended Kalman filtering (cont.)
2
2
1.8
1.8
1.6
1.6
1.4
1.4
marginal posterior
marginal posterior
Problems:
Conditional densities are assumed to be Gaussian;
1.2
1
0.8
1.2
1
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
−0.5
0
0.5
x(t = 3.5)
1
1.5
0
−1.5
−1
−0.5
0
0.5
x(t = 3.5)
1
1.5
No information about approximation error
State Space Models – a Personal View – p.178/187
Extended Kalman filtering (cont.)
Particle approximation of p(x)
Draw N samples from p(x): {x1 , x2 , ..., xN };
1 PN
Approximate p(x) by p̃(x) ≈ N k=1 δ(x − xk );
Capability of approximating multimodal distribution;
When N → ∞, p̃(x) → p(x).
If a direct draw is impossible, Markov Chain Monte Carlo or
Sequential Monte Carlo.
State Space Models – a Personal View – p.179/187
Non-linear filtering
filtering densities: ft|t (xt ) = p(xt |y1:t );
prediction densities for state: ft+1|t (xt+1 ) = p(xt+1 |y1:t );
filtering: ft|t −→ ft+1|t −→ ft+1|t+1 ;
R
Markov transition: ft+1|t (xt+1 ) = at (xt+1 , x)ft|t (x)dx
Bayes’ formula: p(x|y) =
R p(y|x)·p(x)
p(y|x)·p(x)dx
−→
bt (yt+1 , xt+1 )ft+1|t (xt+1 )
ft+1|t+1 (xt+1 ) = R
.
bt (yt+1 , x)ft+1|t (x)dx
prediction densities for observation:
Z
pt+1|t (yt+1 |y1:t ) = bt (yt+1 , x)ft+1|t (x)dx;
State Space Models – a Personal View – p.180/187
Non-linear filtering (cont.)
The standard particle filtering
1. Generate (x10 , ...., xN
0 ) from a0 (x) and set t = 0;
2. We have (x1t , ...., xN
t ) for an particle approximation of ft|t (xt );
3. Generate x̃jt+1 ∼ at+1 (xjt , x), j = 1, ..., N , to obtain an particle
approximation of ft+1|t (xt+1 );
4. Compute weights
λjt+1
=
bt+1 (x̃jt+1 ,yt+1 )
P
;
k
k bt+1 (x̃t+1 ,yt+1 )
I
j
5. Generate Ij with p[Ij = k] = λjt+1 and set xjt+1 to x̃t+1
;
6. Increase t by 1 and go back to Step 2.
State Space Models – a Personal View – p.181/187
Nonlinear Filtering (cont.)
0.35
0.2
Predictive density
0.3
0.18
0.16
0.25
likelihood
0.14
0.12
0.2
0.1
0.15
0.08
0.06
0.1
0.04
0.05
true stae
0.02
0
−10
−8
−6
−4
−2
0
2
4
6
8
10
0
−10
−8
−6
−4
−2
0
2
4
Samples
6
8
10
observation
10
0.035
before filtering
9
0.03
weights
0.025
8
7
6
0.02
5
0.015
4
0.01
3
0.005
0
−5
after filtering
2
1
−5
−4
−3
−2
−1
0
1
2
3
4
−4
−3
−2
−1
0
1
2
3
4
5
5
State Space Models – a Personal View – p.182/187
Nonlinear smoothing
xt−1
xt
xt+1
yt−1
yt
yt+1
Bayes’ formula
at+1 (xt , xt+1 )ft|t (xt |y1:t )
p(xt |xt+1 , y1:T ) = p(xt |xt+1 , y1:t ) =
ft+1|t (xt+1 |y1:t )
Markov transition
p(xt |y1:T ) =
Z
p(xt |xt+1 , y1:T ) · p(xt+1 |y1:T )dxt+1 .
State Space Models – a Personal View – p.183/187
Nonlinear smoothing (cont.)
Smoothing densities:
ft|T (xt ) = ft|t (xt )
at+1 (xt , x)
· ft+1|T (x)dx.
ft+1|t (x)
Z
f1|0
ft|t−1
ft+1|t
fT|T−1
B
M
a0
M
f1|1
M
ft|t
...
B
p(x |x ,yT)
1
f
1|T
2
p(x |x
1
t
M
f
t|T
M
ft+1|t+1
B
B
...
fT|T
M
f
B
,yT)
t+1
1
f
t+1|T
M
T|T
State Space Models – a Personal View – p.184/187
Non-linear smoothing (cont.)
We have (x1t|t , ...., xN
t|t ) for t = 1, ..., T representing filtering densities
ft|t (xt ).To generate xjt|T given xjt+1|T , we sample from
gt (x) ∝ at+1 (x, xjt+1|T )ft|t (x|y1:t )
(
∝ at+1 (x, xjt+1|T ) ·
bt (x, yt )
N
X
i=1
)
at (xit|t , x) ,
using accept reject method, with the index as auxiliary variable;
1. Generate I uniform on {1, ..., N } and x from a(xIt−1|t−1 , x);
2. Generate U uniform on [0, cjt ]. If U < bt (x, yt )at+1 (x, xjt+1|T ),
put xjt|T = x.
State Space Models – a Personal View – p.185/187
Parameter estimation
Assume now that both at and bt depende on a finite dimensional parameter
Θ. The the likelihood of Θ given the observed series y1:T
p(y1:T |θ) =
T
Y
t=1
p(yt |y1:t−1;Θ ) =
T Z
Y
bt (x, yt ; Θ)ft|t−1 (x|y1:t−1 ; Θ)dx.
t=1
Each factor on the right is obtained as a normalization during the filter
recursion.
Z
N
1 X
bt (x, yt ; Θ)ft|t−1 (x|y1:t−1 ; Θ)dx =
bt (xjt|t−1 , yt )
N
j=1
Maximization can be done by a general purpose optimization algorithm.
State Space Models – a Personal View – p.186/187
Parameter estimation(cont.)
Since the joint likelihood of the observations and the hidden states can be
written explicitly, using the EM algorithm is natural.In the E-step, we have
to compute
′
Q(Θ, Θ ) = E log p(x0:T , y1:T ; Θ|y1:T , Θ
′
where Θ′ is the current approximation to the MLE. In the M-step, we
maximize Q(Θ, Θ′ ) with respect to Θ. The function Q(Θ, Θ′ ) contains the
terms
′
′
and E log bt (xt , yt ; Θ)|y1:T , Θ ,
E log at (xt−1 , xt ; Θ)|y1:T , Θ
which can be computed by using smoothed particle xjt|T , j = 1, ..., N .
State Space Models – a Personal View – p.187/187
© Copyright 2026 Paperzz