LEARNING AND INFERENCE ALGORITHMS
FOR
DYNAMICAL SYSTEM MODELS OF DEXTROUS MOTION
by
Balakrishnan Varadarajan
A dissertation submitted to The Johns Hopkins University in conformity with the requirements for
the degree of Doctor of Philosophy.
Baltimore, Maryland
October, 2011
c Balakrishnan Varadarajan 2011
All rights reserved
Abstract
High dimensional time series data such as video sequences, spectral trajectories of a speech
signal or the kinematic measurements of skilled human activity are encountered in several
engineering applications, and computational models of such data hold considerable interest,
particularly models that capture the inherent stochastic variability in the signal. Of particular interest in this dissertation are kinematic measurements of manipulator and tool motion
in robot-assisted minimally invasive surgery (RMIS). A set of gesture labelled RMIS data is
initially assumed to be given. The primary goal is to develop statistical models for gesture
recognition for new RMIS trials from kinematic data, for eventually supporting automatic
skill evaluation and surgeon training. The goal of automatically discovering the structure
of dextrous motion in an unsupervised manner is also addressed, when an inventory of
gestures is not known, or gesture-labeled data are not provided.
A number of statistical models to address these problems have been investigated, including hidden Markov models (HMM) with linear discriminant analysis, factor-analyzed hidden Markov models and linear dynamical systems with time varying parameters. Gesture
recognition accuracies for three RMIS training tasks — suturing, knot-tying and needleii
ABSTRACT
passing — are shown to improve significantly with increasing model complexity, justifying
the concomitant increase in the computation required to estimate model parameters from
gesture-labeled data or to perform recognition.
Algorithms for unsupervised structure induction have been investigated for discovering
gestures used in skilled dexterous motion directly from kinematic data when gesture-labeled
data are not available. An improved algorithm based on successive state splitting is presented for discovering the state-topology of a hidden Markov model. The algorithm efficiently explores an enormous space of possible topologies and yields models with a high
goodness-of-fit to the RMIS kinematic data.
Technical contributions of this dissertations include novel, efficient algorithms for probabilistic principal component analysis, for switching linear dynamical system parameter
estimation, and for hidden Markov model topology induction. Other techniques for improving gesture recognition accuracy beyond those mentioned above are also investigated
by incorporating ideas such as user-adaptation of the models.
Readers: Sanjeev Khudanpur, Gregory Hager
Examiners: Rene Vidal, Danielle Tarraf, Pablo Iglesias
iii
Acknowledgments
Pauca Sed Matura”
Few, but ripe.
-Carl Friedrich Gauss
I consider myself very fortunate to have had Sanjeev Khudanpur as my dissertation advisor.
His clarity of thought and way of appreciating new ideas very quickly were inspiring. His
continuous support and guidance in improving my research skills has helped me in becoming a much better researcher. He always encouraged me whenever I showed my inclination
to attend courses that are of interest to me or to involved in other recreational activities (like
coding competitions), even if they may not be considered as, direct relation to my research.
This freedom has enabled me to get the best out of the graduate school and has also helped
me in improving the quality of my research significantly.
I have also had several helpful research discussions pertaining to my dissertation with Rene
Vidal, Gregory Hager and Damianos. I am thankful to Henry Lin, Carol Reiley and Rajesh
Kumar for useful discussions pertaining to the datasets used for the experiments.
I must also mention here that, I was very lucky enough to get the opportunity to learn from
iv
ACKNOWLEDGMENTS
an excellent and Knowledgeable set of teachers from the CS and the Math departments of
JHU. I should make special mention to Professors: Rao Kosaraju, Jason Eisner, Rene Vidal,
Laurent Younes, Ed Scheinerman and Sanjeev Khudanpur for their excellent and thought
provoking way of teaching.
During the course of my graduate school, I also had an opportunity to compete in various
online programming and math contests like projecteuler, codechef, codeforces, topcoder
where I have met and learned from extremely smart people from all over the world.
I am also thankful to my undergraduate institution, IIT-Madras, for providing me solid
foundations- generally in all Engineering Subjects and especially in subjects such as Probability and Signal Processing. Preparing for the IIT entrance exam was the prime source of
inspiration to create profound interest in Mathematics.
Finally I am extremely grateful to my parents for their continuous support and motivation
throughout, my education period.
v
Contents
Abstract
ii
Acknowledgments
iv
List of Tables
xi
List of Figures
xiii
1
2
Introduction
1
1.1
Problems of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.1
Robotic Minimally Invasive Surgery (RMIS) . . . . . . . . . . . . . . . .
1
1.1.2
Temporal Textures in Video . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Problems We are Trying to Solve . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
Contributions of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Hidden Markov Models
14
2.1
14
Probabilistic Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
CONTENTS
2.2
Emitting Probabilistic Markov models . . . . . . . . . . . . . . . . . . . . . . . .
15
2.3
Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.4
Inference in a HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
Inferring the Distribution of s1:N . . . . . . . . . . . . . . . . . . . . . . .
17
Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Backward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Inferring the most Probable State Sequence s1:N . . . . . . . . . . . . . .
20
2.5
Parameter Estimation for HMM ; the Baum Welch Algorithm . . . . . . . . . . . .
22
2.6
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.4.1
2.4.2
3
Factor Analyzed Hidden Markov Models
28
3.1
Probabilistic Principal Component Analysis . . . . . . . . . . . . . . . . . . . . .
29
3.1.1
Inference in PPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
Efficient Inference in PPCA . . . . . . . . . . . . . . . . . . . . . . . . .
31
Maximum Likelihood Estimation of Θ via EM . . . . . . . . . . . . . . .
34
Factor Analyzed HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.2.1
Inferring the joint distribution of s1:N and x1:N . . . . . . . . . . . . . . .
39
3.2.2
Learning the Parameters of a FA-HMM via EM . . . . . . . . . . . . . . .
40
3.2.3
Tied Estimation of the Loading Matrix : Hs = H . . . . . . . . . . . . . .
45
3.2.4
Tied Estimation of the Loading Parameters : Hs = H . . . . . . . . . . . .
46
Connection to LDA, HLDA and Semi-Tied Covariance Modeling . . . . . . . . . .
47
3.3.1
EM Estimation of the Parameters of HLDA . . . . . . . . . . . . . . . . .
48
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.1.2
3.2
3.3
3.4
vii
CONTENTS
4
Linear Dynamical System
52
4.1
Linear Dynamical System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.2
Inference in Linear Dynamical System . . . . . . . . . . . . . . . . . . . . . . . .
54
4.2.1
Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.2.2
Backward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.2.3
E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.2.4
Practical need for Regularizing ξt|t and ξt|N . . . . . . . . . . . . . . . . .
62
Learning the Parameters of an LDS via EM . . . . . . . . . . . . . . . . . . . . .
62
Efficient Inference and Learning for Diagonal Σz . . . . . . . . . . . . . .
66
Application of LDS to Dynamic Textures . . . . . . . . . . . . . . . . . . . . . .
66
4.4.1
Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.4.2
Prediction Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.3
4.3.1
4.4
4.5
5
Switching Dynamical Models
72
5.1
Switching Linear Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . . .
74
5.2
Inference in Switching LDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
5.2.1
Forward pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
Backward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5.3
Learning the Parameters of an S-LDS . . . . . . . . . . . . . . . . . . . . . . . .
87
5.4
Introducing Null States in S-LDS . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
5.4.1
95
Forward Pass including Null-States . . . . . . . . . . . . . . . . . . . . .
viii
CONTENTS
5.4.2
6
Backward Pass including Null-States . . . . . . . . . . . . . . . . . . . .
99
5.5
Approximate Viterbi Inference for S-LDS . . . . . . . . . . . . . . . . . . . . . . 100
5.6
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Experimental Results for Automatic Gesture Recognition
6.1
106
Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.1.1
Kinematic Data Recordings of Benchtop tasks . . . . . . . . . . . . . . . 107
6.1.2
Manual segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.1.3
Automatic Gesture Recognition and Evaluation . . . . . . . . . . . . . . . 108
6.2
Setups for Comparing FA-HMM and S-LDS . . . . . . . . . . . . . . . . . . . . . 110
6.3
Initialization of S-LDS and FA-HMM Parameters via System Identification . . . . 112
6.3.1
PCA based Initialization for Setup 4 . . . . . . . . . . . . . . . . . . . . . 113
6.3.2
PCA based Initialization for Setups 1 and 3 . . . . . . . . . . . . . . . . . 114
6.3.3
LDA based Initialization for Setups 2 and 5 . . . . . . . . . . . . . . . . . 114
6.4
Surgical Gesture Recognition Setups . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.5
Empirical comparisons on Setup I . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5.1
Using 1 State Models for each Gesture . . . . . . . . . . . . . . . . . . . . 116
6.5.2
Using Multi-State Models for each Gesture . . . . . . . . . . . . . . . . . 120
Evaluating statistical significance . . . . . . . . . . . . . . . . . . . . . . 123
6.5.3
Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.6
Empirical Comparisons on Setup II . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.7
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
ix
CONTENTS
7
Learning S-LDS on Unlabeled Data
7.1
Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.2
Using Grammar for Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . 139
7.3
Discovering the Inventory of Gestures without Supervision . . . . . . . . . . . . . 141
7.4
8
9
134
7.3.1
Maximum Likelihood SSS Algorithm . . . . . . . . . . . . . . . . . . . . 142
7.3.2
An Improved/Fast ML-SSS Algorithm . . . . . . . . . . . . . . . . . . . . 143
7.3.3
Evaluation of the ML-SSS Algorithm on RMIS data . . . . . . . . . . . . 145
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
User Adaptive Transforms via EM
149
8.1
EM estimation of {F s , Hs,l } while incorporating Pose Information . . . . . . . . . 150
8.2
Gesture Recognition with an unknown Pose Class . . . . . . . . . . . . . . . . . . 152
8.3
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Summary and Discussion
9.1
155
Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.1.1
Structured Prediction of LDS Models . . . . . . . . . . . . . . . . . . . . 158
Sparse LDS Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Higher Order LDS Models . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Bibliography
160
Vita
169
x
List of Tables
1.1
1.2
1.3
1.4
4.1
Gesture recognition accuracy of HMM on various RMIS tasks. . . . . . . . . . . .
Gesture recognition accuracy of FA-HMM on various RMIS tasks . . . . . . . . .
Gesture recognition accuracy of S-LDS on various RMIS tasks . . . . . . . . . . .
Demonstration of the modeling power of the S-LDS on suturing with decreasing
supervision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
8
10
Average squared error in comparison with Doretto’s technique . . . . . . . . . . .
70
12
Accuracies on Suturing using one state models with ot ∈ R36 (position) on Setup 1 117
Accuracies on Needle-Passing using one state models with ot ∈ R36 (position) on
Setup 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.3 Accuracies on Knot-Tying using one state models with ot ∈ R36 (position) on Setup 1118
6.4 Accuracies on Suturing using three state models with ot ∈ R36 (position) on Setup 1 123
6.5 Accuracies on Needle-Passing using three state models with ot ∈ R36 (position) on
Setup 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.6 Accuracies on Knot-Tying using three state models with ot ∈ R36 (position) on
Setup 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.7 The best hidden dimension d for each of the five models across the three tasks.
In some cases, the difference between d = 12 and d = 16 are not statistically
significant. FA-HMM(1) represents the first variant of FA-HMM with Hs = H and
so on. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.8 Comparison of various models by statistical significance. The models are ordered
according to the increasing mean performance on the suturing task. The diagonal
elements represent the mean accuracy available in table 6.4. . . . . . . . . . . . . . 126
6.9 Comparison of various models by statistical significance. The models are ordered
according to the increasing mean performance on the needle passing task. The
diagonal elements represent the mean accuracy available in table 6.5. . . . . . . . . 127
6.10 Comparison of various models by statistical significance. The models are ordered
according to the increasing mean performance on the knottying task. The diagonal
elements represent the mean accuracy available in table 6.6. . . . . . . . . . . . . . 127
6.11 Accuracies on Suturing using one state models on Setup 1 after enforcing the decoded path must follow the state transitions given by Figure 6.5. . . . . . . . . . . 128
6.1
6.2
xi
LIST OF TABLES
6.12 Accuracies on Suturing using three state models on Setup 1 after enforcing the decoded path must follow the state transitions given by Figure 6.5. . . . . . . . . . .
6.13 Accuracies on Suturing using three state models with ot ∈ R36 (position) on Setup 2
6.14 Accuracies on Needle Passing using three state models with ot ∈ R36 (position) on
Setup 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.15 Accuracies on Knot Tying using three state models with ot ∈ R36 (position) on
Setup 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
128
131
131
131
Accuracies on Suturing using one state models with ot ∈ R36 (position) on Setup 1
following semi-supervised training. . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Accuracies on Needle-Passing using one state models with ot ∈ R36 (position) on
Setup 1 after mapping the state segmentation leniently on the training partition. The
models were initialized using labels of randomly chosen 20% subset of the entire
training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 Accuracies on Suturing using three state models with ot ∈ R36 (position) on Setup 1
after mapping the state segmentation leniently on the training partition. The models
were initialized using labels of randomly chosen 20% subset of the entire training
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4 Accuracies on Needle-Passing using three state models with ot ∈ R36 (position) on
Setup 1 after mapping the state segmentation leniently on the training partition. The
models were initialized using labels of randomly chosen 20% subset of the entire
training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5 Accuracies on Suturing using one state models with ot ∈ R36 (position) on Setup 1
following unsupervised estimation of model parameters. . . . . . . . . . . . . . .
7.6 Accuracies on Needle-Passing using one state models with ot ∈ R36 (position) on
Setup 1 following unsupervised estimation of model parameters. . . . . . . . . . .
7.7 Accuracies on Suturing using three state models with ot ∈ R36 (position) on Setup
1 following unsupervised estimation of model parameters. . . . . . . . . . . . . .
7.8 Accuracies on Needle-Passing using three state models with ot ∈ R36 (position) on
Setup 1 following unsupervised estimation of model parameters. . . . . . . . . . .
7.9 Result of ML-SSS on Suturing using various graphical models. . . . . . . . . . . .
7.10 Result of ML-SSS on needle-passing using various graphical models. . . . . . . .
141
146
146
Effect of user adaptive learning of H in mixed (seen) user setting on Suturing . . .
Effect of user adaptive learning of H in mixed (seen) user setting on Needle Passing
Effect of user adaptive learning of H in unseen user setting on Suturing . . . . . .
Effect of user adaptive learning of H in unseen user setting on Needle Passing . . .
151
151
153
153
7.1
8.1
8.2
8.3
8.4
xii
137
137
138
138
140
140
140
List of Figures
1.1
1.2
1.3
3.1
4.1
4.2
4.3
Some examples of training using a task board. The kinematics of these arms are the
observed motion data that we wish to model. . . . . . . . . . . . . . . . . . . . . .
Temporal textures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Graphical Model representation of various models considered in this dissertation
for modeling the observation ot . The time is indexed by t = 1, . . . , N . The discrete
state is represented by st and xt represents the low-dimensional continuous state. .
6
The graphical models that we will consider in this chapter. In PPCA, o1:N is the
observations and x1:N is a hidden continuous state. Adding a discrete hidden state
s1:N to the PPCA gives rise to a FA-HMM. . . . . . . . . . . . . . . . . . . . . .
29
2
4
Probabilistic PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Linear dynamical system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration of generated moving plastic using our learned model using EM. These
are the frames generated at t = 1000, 1020, 1040, 1060. . . . . . . . . . . . . . . .
69
5.1
5.2
5.3
Graphical model representation of a switching Linear Dynamical System . . . . .
Fully connected 3-state S-LDS (no null states) . . . . . . . . . . . . . . . . . . . .
Fully connected 3-state S-LDS (with null states) . . . . . . . . . . . . . . . . . . .
73
94
94
6.1
6.2
6.3
Single-state, self-loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-state LR, self-loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The top figure is an example of a manual segmentation of a needle passing trial
into gestures. Gestures are numbered 1,2,3,etc. Here gesture 1 is inserting the
needle through the hoop. The bottom figure is the outcome of an automatic gesture
recognition using a standard HMM. . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of FA-HMM and S-LDS on a trial of needle-passing. The top figure
shows the ground truth gesture segmentation of the particular trial. The second
figure is the automatic segmentation via FA-HMM (Accuracy=59%). The third
figure is the automatic segmentation via S-LDS (Accuracy= 90%). . . . . . . . . .
State machine constraining the possible actions in Suturing. . . . . . . . . . . . . .
109
109
6.4
6.5
xiii
54
54
110
121
125
LIST OF FIGURES
7.1
7.2
This figure shows the two splits that are explored in the first iteration. In general,
the contextual split simply splits a single state into two states in parallel, while the
temporal state splits the two states in series. . . . . . . . . . . . . . . . . . . . . . 142
Four way split of the state s in the first iteration. This could also be thought of as
the cross product of the splits described in Figure 7.1 as it explores the contextual
and the temporal splits simultaneously. . . . . . . . . . . . . . . . . . . . . . . . . 143
xiv
Chapter 1
Introduction
1.1
Problems of Interest
Several engineering problems entail modeling continuous-valued signals that exhibit time-varying
dynamical behavior. These signals can be, for instance, successive frames in a video sequence,
kinematic observations of a robot assisted arm, human activities like walking, running, etc., or even
stock prices. We describe some of the problems that we are interested in addressing.
1.1.1
Robotic Minimally Invasive Surgery (RMIS)
Recently, robot assisted minimally invasive surgery (RMIS) has been rapidly replacing manual surgeries [1].
The da-Vinci surgery system (Intuitive Surgical Inc. Sunnyvale, CA) is a tele-operated system
for RMIS where a surgeon remotely controls an endoscopic device that operates inside the patient’s
body. The application programming interface (API) of the da Vinci provides a setting through which
the kinematics of the surgeon side manipulators as well as the patient-side tools can be recorded.
1
CHAPTER 1. INTRODUCTION
(a) Running Suturing.
(b) Knot Tying.
(c) Needle Passing.
(d) Interrupted Suturing.
Figure 1.1: Some examples of training using a task board. The kinematics of these arms are the
observed motion data that we wish to model.
The dataset we use consists of several recordings of bench top training tasks such as suturing, knottying and needle-passing from the surgeon and patient-side unit. These tasks are performed on a
task-board (See figure 1.5 for examples)1 . Each trial is manually segmented into gestures. We have
1
These figures are obtained from a two suturing datasets whose descriptions are available at [2, 3]
2
CHAPTER 1. INTRODUCTION
recordings of these from a mix of expert, intermediate and novice surgeons.
In this dissertation, we develop new efficient learning and inference algorithms that are able to
provide accurate gesture recognition (segment a new trial into individual gestures that characterize various gestures or dexemes) on these kind of tasks. The ability to perform automatic gesture
recognition has several applications which, for instance, include assessing surgical dexterity of the
user or providing live feedback while performing a real tele-operated surgery. Models developed for
surgical dexterity could also be used to segment other day-to-day human activities like walking and
running where one possible set of distinct actions could be right-leg up, right-leg forward, left-leg
up, left-leg forward, etc. For analysis of full body human motion, we use the CMU motion capture
dataset for our experiments [4].
1.1.2
Temporal Textures in Video
The term temporal textures has been used to refer to image sequences (video) that exhibit certain
kind of dynamics. Some examples include wavy water in a pond, flames in a fire, moving grass etc.
Image snippets of typical temporal textures are shown in Figure 1.2. Temporal texture modeling is
a well studied problem [5], [6]. This problem can be cast as a dynamical system where each observation ot is one frame of the image sequences. The dynamics capture the evolution of the image
frames in time. Linear dynamic textures are linear dynamical systems applied to temporal textures.
In this dissertation, we present some new algorithms which enables one to perform efficient learning
and inference in dynamic textures. We show its applications in generating texture scenes from limited learning data, de-noising, recognition of dynamic textures, etc and demonstrate improvements
over some previously known methods.
Eventually, we would also like to apply the learning and inference algorithms for dynamic textures
3
CHAPTER 1. INTRODUCTION
to segment raw video sequences of an RMIS task.
(a) Image snippet from a waterfall
(b) Image snippet from a wavy pond
(c) Image snippet from a moving grass
(d) Moving flames of a fire
Figure 1.2: Temporal textures
1.2
Problems We are Trying to Solve
As noted earlier, we are interested in building models for performing effective gesture recognition
in the RMIS surgery data. These models should be able to capture the dynamics of various gestures
4
CHAPTER 1. INTRODUCTION
in both supervised and un-supervised settings. In the supervised setting, while the goal is to learn
the models for the various manually annotated gestures, the goal in the unsupervised setting is to
discover the different gestures automatically from completely unlabeled data. We briefly describe
the setups we use for each of them here.
In the supervised gesture recognition setting, we are provided a set of training trials where each trial
is a sequence of observations o1:N associated with a frame-level gesture label. The RMIS tasks we
consider for our experiments are Suturing, Needle Passing and Knot-Tying. Some typical gestures
in these tasks include (i) Inserting the needle, (ii) Transferring the needle (iii) Pulling, etc. We will
learn models for these gestures from the training data and use the learnt models to perform gesture
recognition on a set of test trials. Performance is measured based on how well the decoded gestures
match the manually marked labels. Specifically, the accuracy of gesture recognition is measured on
a frame-by-frame basis.
We are then interested in building probabilistic models when the gesture labels are unknown. The
goal here is to be able to segment unlabeled videos or kinematic trials into gesture labels that closely
resemble the manual annotation. In this dissertation, we develop a Successive State Splitting algorithm that can learn HMM topologies from unlabeled data. We then use this algorithm to bootstrap
unsupervised models for S-LDS models as well.
1.3
Organization
A set of models with increasing complexities is investigated in this dissertation.
The description of the basic Hidden Markov Model (HMM) is setup in Chapter 2. Here the observation sequence o1:N is modeled using a sequence of hidden discrete states s1:N . HMMs have been
5
CHAPTER 1. INTRODUCTION
(a) Hidden Markov Model (HMM)
(b) Factor Analyzed HMM
(c) Linear Dynamical System (LDS)
(d) Switching LDS
Figure 1.3: Graphical Model representation of various models considered in this dissertation for
modeling the observation ot . The time is indexed by t = 1, . . . , N . The discrete state is represented
by st and xt represents the low-dimensional continuous state.
extensively used for segmenting various kinds of continuous valued signals for speech recognition,
video segmentation, identifying sign languages or gesture recognition [7–10]. In a HMM, each
observation ot ∈ RD where D is the inherent dimension of the signal. For instance ot is a video
frame from Figure 1.6 with D ≈ 200000 or kinematic observations in Figure 1.1 with D = 36. The
discrete state st belongs to the set of gesture labels described in Section 1.2.
The transition between st−1 and st is governed by a Markov chain. Each discrete state is associated
with a probability distribution that generates ot and is denoted as p(ot |st ). For instance, each dis-
6
CHAPTER 1. INTRODUCTION
crete state st can belong to a set of gesture labels as described in Section 1.2. When the observation
probability distribution p(ot |st ) has certain properties (e.g. it is a mixture of distributions from
an exponential family), the estimate of its parameters may be obtained via well known Expectation Maximization (EM) procedures [7]. When the HMM models are employed to perform gesture
recognition on RMIS tasks, we get the performances shown in Table 1.1
HMM
Suturing
Needle-Passing
Knot-Tying
68.53%
54.53%
69.80%
Table 1.1: Gesture recognition accuracy of HMM on various RMIS tasks.
In several applications (including the RMIS), the observation ot lies in a high dimensional space
although the number of degrees of freedom might be lower. So directly modeling the observations
effectively requires a large number of parameters. To mitigate this problem, factor analysis is commonly used to reduce the observation dimensions to a smaller dimensional space that can model
the desired variability in the data. Some kinds of factor analysis like linear discriminant analysis
(LDA), independent factor analysis [11] and probabilistic LDA [12] have been commonly used in
several applications like speech, speaker and face recognition [13]. In chapter 3, we start by introducing the basic probabilistic principal component analysis where the observations ot are modeled
compactly using a set of d factors xt ∈ Rd with d < D. We will focus on linear factor analysis
where the estimated observation ôt = Hxt and the goal is to minimize the variation between ot and
ôt . Some basic linear factor analysis settings like Principal Component Analysis (PCA) and Probabilistic PCA (PPCA) [14] are illustrated in this chapter. While the PCA simply tries to minimize
the squared error between the estimated observation ôt and ot , PPCA generalizes the PCA in the
7
CHAPTER 1. INTRODUCTION
probabilistic setting. Some of the techniques we develop in Chapter 3 enable performing PPCA on
very high dimensional observations in a computationally elegant manner.
Factor analyzed hidden Markov models (FA-HMM) [15] are then introduced by adding a hidden
continuous state xt ∈ Rd (refer to Figure 1.3). FA-HMM is much like PPCA, except that the
distribution of xt will now be governed by the underlying discrete state st . When d < D, the
FA-HMM acts as a probabilistic linear dimensionality reduction method and can be thought of as
a variant of the classical LDA. While the identification of the parameters in the LDA can be done
using standard eigen-vector computations, the estimates of the parameters Θ in the FA-HMM has
to be done using EM. Furthermore, the efficient inference and learning techniques we developed
for PPCA extends straighforwardly to the FA-HMM setting as well. We show that FA-HMM can
perform better gesture recognition compared to standard HMMs and even LDA+HMMs (refer to
Table 1.2)
Suturing
Needle-Passing
Knot-Tying
HMM+HLDA
74.13%
65.01%
79.91%
FA-HMM
78.27%
71.01%
82.88%
Table 1.2: Gesture recognition accuracy of FA-HMM on various RMIS tasks
The graphical model of the HMM and the FA-HMM assumes that, given the discrete state st at
time t, the observation ot is a conditionally independent of o1:t−1 . Therefore, if ot is a continuous
valued random signal, the HMM models fail to capture the dynamical behavior of the observation
sequence (since dynamics of the continuous signal needs to be captured). This is particularly of
concern in case of the video examples. In order to capture the dynamics, the notion of linear
8
CHAPTER 1. INTRODUCTION
dynamical systems is introduced in chapter 4. In a linear dynamical system, the temporal evolution
process of xt is governed according to the following state equation.
xt+1 = F(xt ) + ut+1 ,
ot
= Hxt + zt .
Here ut and zt are external inputs to the system which, in our case, are normally distributed random vectors. Linear dynamical systems are commonly used in time-series analysis [16] which for
instance include stock prices. F(·) is a state transition function. A special class of such functions, called the linear dynamical systems (LDS), assumes that the state transition function is linear. Although the linearity assumption in LDS seems very strong, these models can be shown to
approximate a large class of real processes [17]. When the noise processes ut and zt are independent random variables with Gaussian distributions, inference and learning in LDS are well studied
problems [18–21]. However, these learning and inference algorithms for the general LDS case are
computationally expensive when D, the dimension of ot , is very high. In Chapter 4, we will extend
the efficient inference and learning for PPCA into the LDS setting. We will use our algorithm to
model the raw pixels of an image sequence. Applications such as predicting the resulting pixel
values, generating video sequences and de-noising are illustrated in this chapter.
Time invariant Linear dynamical systems are however not good models when the dynamics themselves change in time. For instance, in the RMIS tasks, we have about 10 distinct gestures with each
gesture having its own characteristic dynamics. In order to model sequences of gestures, the concept
of a switching linear dynamical system is introduced in Chapter 5. Switching (or hybrid) dynamical
systems have roots in control theory where the system output process is modeled as piecewise linear [22], [23]. Thus using a suitable number of switching discrete states, a non-linear system may be
9
CHAPTER 1. INTRODUCTION
approximated using several linear systems. We will focus on a class of hybrid dynamical systems
called the switching linear dynamical systems. Switching linear dynamical systems (sometimes
called as switching Kalman filters [24] or switching state-space models [25] ) are a generalization
of the linear dynamical system where the dynamics Ft of the hidden state xt and the observation
matrix Ht are allowed to switch in time. It can be shown that the problem of inferring the hidden
variables in an S-LDS from the observations ot is computationally intractable. In fact the posterior
density of the hidden state xt becomes a mixture of an exponentially large number of Gaussian
densities even if ut and zt are modeled using single Gaussian densities. Several approximate algorithms have been proposed to address this issue. Our contribution in this chapter is to extend the
expectation correction inference technique of Barber [26] to the learning of S-LDS parameters.
The complexity of the approximate inference techniques in S-LDS is proportional to the number
of edges in the underlying state-transition graph which can be quite high if each discrete state may
be followed by several (or all) other states. The inclusion of null (or non-emiting) states reduces
this number significantly at a small loss in modeling power and hence significantly simplify the
inference procedures. We modify several inference algorithms such as the Expectation Correction
and the approximate Viterbi to include such null states. When S-LDS models are used to describe
the RMIS tasks, we get performances described in Table 1.3.
S-LDS
Suturing
Needle-Passing
Knot-Tying
80.79%
77.62%
82.09%
Table 1.3: Gesture recognition accuracy of S-LDS on various RMIS tasks
In Chapter 6, HMM, FA-HMM and S-LDS are carefully contrasted by performing a number of
10
CHAPTER 1. INTRODUCTION
gesture recognition experiments on Suturing, Knot-Tying and Needle-Passing. All models have the
same set of training and test data and the experiments are performed in the supervised setting. The
final parameters of the S-LDS and the FA-HMM are identified by performing several iterations of
EM after the initialization. The learnt models are evaluated by performing gesture recognition on a
set of test trials (disjoint from the training trials).
It is shown that FA-HMM models provide better gesture segmentations compared to standard HMM
with dimensionality reduction. Furthermore, S-LDS models outperform FA-HMM significantly.
The contrast between S-LDS and FA-HMM is much more when the variability in the data is higher
(or when the training and test partitions are user-disjoint).
In Chapter 7, we explore the possibility of unsupervised learning wherein the FA-HMM and the
S-LDS parameters are learned from unlabeled data. Three settings are considered
1. Semi-Supervised Learning: Given a set of training trials and a grammar specifying the allowable state transitions, we initialize the model parameters using a small random subset of
manually annotated training trials. After the initialization, we use the grammar to perform
learning of the S-LDS models from all the training trials.
2. Almost Unsupervised Learning: A grammar specifying the allowable state transitions is provided but no training trial has manual gesture labels.
3. Completely Unsupervised Learning: The training data must be segmented and clustered into
gestures in a fully automatic fashion with no kind of supervision. To get the structure of the
graph, we use a Successive State Splitting Algorithm.
The unsupervised segments are evaluated by performing a lenient mapping of the discrete states to
the manual labels. We get the performances shown in Table 1.4.
11
CHAPTER 1. INTRODUCTION
HMM
LDA+HMM
FA-HMM
S-LDS
semi-supervised
61.12%
63.90%
69.96%
76.87%
almost unsupervised
47.13%
47.23%
52.92%
64.23%
unsupervised
48.50%
49.70%
54.80%
63.32%
Table 1.4: Demonstration of the modeling power of the S-LDS on suturing with decreasing
supervision.
Finally, user adaptive transforms are explored in Chapter 8. Similar to speaker adaptive transforms
in speech recognition, by constraining certain parameters of the S-LDS to be user specific or by
unsupervised adaptation of some parameters on a test trial from a new user, we show improvements
in recognition accuracies in mixed or unseen user settings.
1.4
Contributions of this Dissertation
The following are the main contributions of this dissertation:
• First, we have shown that modeling surgical data is possible and that the models of increasing
complexities are well justified.
• We have developed new techniques that are efficient for performing inference and learning
in FA-HMM and S-LDS models. Towards this goal, we have developed algorithms that can
handle high dimensional observations and also graphs that have high edge connectivity.
• We developed a novel approach for unsupervised segmentation using Successive State Splitting by exploring a large search space of possible state-splits efficiently using dynamic pro-
12
CHAPTER 1. INTRODUCTION
gramming.
• We have extended the idea of speaker adaptation of HMM in speech recognition to user
adaptation of the FA-HMM and S-LDS models.
• Finally, we have developed scalable C++ and Matlab codes that implements all the algorithms
we have developed. Our techniques can be used for a wide spectrum of problems that involve
modeling high dimensional continuous observation sequences.
13
Chapter 2
Hidden Markov Models
A Hidden Markov Model is a simple probabilistic graphical model where the observations ot are
modeled as outputs of a discrete state switching process. Each discrete state has a characteristic
pattern of emitting the observations and is normally studied in the probabilistic setting. Inference
and learning in HMMs have been studied for several decades and are heavily used in applications
such as speech recognition [27–29]. This is an introductory chapter on HMMs with no technical
contributions. Here, we will derive various basic algorithms commonly used in HMMs. We will use
standard HMMs to segment the RMIS data and use it as a baseline for comparing the experimental
results of the various models we will construct in the future chapters. Furthermore, many of the
mathematical symbols we will use in this chapter will be reused later.
2.1
Probabilistic Markov Models
A probabilistic Markov model is a state machine that generates a discrete state sequence s0:N +1 with
s0 being a unique START state and sN +1 a unique END state, and st ∈ {1, . . . , C} for 1 ≤ t ≤ N .
14
CHAPTER 2. HIDDEN MARKOV MODELS
A probability distribution is associated with each possible sequence. Specifically, the probability of
observing a discrete sequence s1:N has the Markov property and is given as
p(s1:N ) =
N
Y
pt (st+1 |st ).
t=0
Here we have assumed that the state s0 is a unique start state denoted by START and sN +1 is
a unique end state denoted by END. Furthermore, a Markov chain is said to be homogeneous if
regardless of the index t,
pt (st+1 |st ) = qst ,st+1 ,
where qs0 ,s is a transition matrix that has all the canonical properties of a probability distribution.
We will consider only time homogenous Markov chains in this dissertation.
2.2
Emitting Probabilistic Markov models
The next generalization over probabilistic Markov models is to allow emitting states. An emitting
state is a discrete state s ∈ {1, . . . , C} that emits an output observation o ∈ O where O is some
support set. This support set may be discrete in which case O = {1, . . . , K}, or continuous e.g
O = RD . An emitting state is associated with a probability distribution on the support set O,
denoted by p(o|s). Formally, an emitting probabilistic Markov model is a weighted graph G(V, E)
such that each vertex v ∈ V − {START, END} is associated with a probability distribution on the
output alphabet O. Thus an emitting probabilistic Markov model generates a discrete state sequence
s1:N synchronously with an observation sequence o1:N , with joint probability of observing s1:N and
15
CHAPTER 2. HIDDEN MARKOV MODELS
o1:N given by
p(s1:N , o1:N ) = p(o1:N |s1:N )p(s0:N +1 )
=
N
−1
Y
!
qst ,st+1 p(ot+1 |st+1 ) qsN ,END .
t=0
2.3
Hidden Markov Model
An emitting probabilistic Markov model becomes a hidden Markov model if the discrete state sequence s1:N is unobserved or hidden. The probability of observing o1:N is then computed by simply
marginalizing over all possible state sequences s1:N that could give rise to o1:N as,
p(o1:N ) =
X
p(s1:N , o1:N )
s1:N
=
X
qsN ,END
s1:N
N
Y
qst ,st+1 p(ot+1 |st+1 ).
(2.1)
t=0
A hidden Markov model therefore has the following basic components.
1. V : A vertex set consisting of {1, . . . , C, START, END}.
2. qs0 ,s : A transition probability matrix of size (C + 2) × (C + 2).
3. O : An output alphabet that defines the space of observations.
4. p(o|s): A probability distribution over O for each state s ∈ V \ {START, END}.
The parameters Θ of a HMM are the set of variables that govern {qs0 ,s , p(o|s)} and are required to
define the probability of an observation sequence o1:N according to (2.1). For example, if p(o|s) is
a state specific Gaussian density with mean µs and covariance Σs , Θ consists of the set of means
and covariances associated with each state:
Θ = {[µs , Σs ]s , qs0 ,s }.
16
CHAPTER 2. HIDDEN MARKOV MODELS
We denote this dependence of the probabilities on the model parameters Θ by subscripts, e.g.
p(ot = o|st = s) = pΘ (o|s)
2.4
Inference in a HMM
The inference problem in a probabilistic graphical model is to estimate the values of one set of
hidden variables given the values of the remaining variables. The random variables in an HMM are
s1:N and o1:N . Thus, if o1:N is observed, we are left with inferring s1:N . The two most common
forms of inference in any probabilistic graphical model are
1. Computing the conditional distribution of the hidden random variables, i.e, p(s1:N |o1:N ), or
2. the most likely values of the hidden random variables, i.e, ŝ1:N = argmaxp(s1:N |o1:N ).
s1:N
The first form of inference is often performed using techniques such as Expectation Maximization
(EM). The second form of inference is used to find a reasonable set of values for the hidden random
variables under a given circumstance; a typical example is speech recognition where the most likely
set of words or phones given the acoustics is desired. Both kinds of inference are well studied and
are considered solved problems for HMMs [7]. We describe their solutions briefly for completeness.
2.4.1
Inferring the Distribution of s1:N
Formally, we need to compute the following conditional joint probability distribution
p(s1:N |o1:N , G) =
N
Y
p(st+1 |st , o1:N , G).
(2.2)
t=0
Here the grammar G consists on a graph G on which the inference is made, and optionally a boolean
function bst that specifies at each time t whether a particular emitting state s is permissible. The
17
CHAPTER 2. HIDDEN MARKOV MODELS
function bst is particularly useful when performing HMM inference using known time-boundaries.
In speech, for example, let us say we have a sentence that contains the word cat := / k / ae / t /
and we know a priori that this was spoken between times 5.1 sec and 5.4 sec, then only the states
corresponding to the phonemes k,ae and t can be active between 5.1 and 5.4 sec. As a result bst = 0
for all other states s that do not belong to these three phonemes. Of course, if bst is not given, it is
equivalent to assuming that bst = 1 ∀(t, s).
Note from (2.2), the joint distribution of st , st+1 and o1:N for each t suffices to complete the inference step. We denote this joint distribution as
0
γts,s = p(st = s, st+1 = s0 , o1:N |G).
(2.3)
Using the chain rule and Markov property, we may write this as
γts,s
0
= p(st = s, st+1 = s0 , o1:N )
= p(st = s, o1:t )p(st+1 = s0 |st = s, o1:t )p(ot+1:N |st+1 = s, st = s, o1:t )
= p(st = s, o1:t ) qs,s0 p(ot+1:N |st+1 = s0 ),
|
{z
}
{z
}
|
αst
s
βt+1
where the dependence on the graph G has been suppressed for brevity. Thus, to compute the quantity
above, one requires a forward probability denoted as αts and a backward probability denoted as
0
s . These quantities can be computed recursively using dynamic programming, as done in the
βt+1
well known forward-backward algorithm [30]. As the name suggests, the forward step computes
αts for all (t, s) starting with t = 0 and the backward pass computes βts . The two stages of the
forward-backward algorithm are described below.
18
CHAPTER 2. HIDDEN MARKOV MODELS
Forward Pass
The goal of the forward pass is to compute the probability αts = p(st = s, o1:t ). This is done easily
using the law of total probability (marginalizing over all st−1 = s0 ) as
αts = p(st = s, o1:t )
=
X
=
X
=
X
p(st = s, st−1 = s0 , o1:t )
s0
p(st−1 = s0 , o1:t−1 )p(st = s|st−1 = s0 , o1:t−1 )p(ot |st = s, st−1 = s0 , o1:t−1 )
s0
p(st−1 = s0 , o1:t−1 )p(st = s|st−1 = s0 )p(ot |st = s)
s0
0
X
=
s
αt−1
qs0 ,s pΘ (ot |s)
(2.4)
s0 :(s0 ,s)∈E(G)
The base case for t = 0 is α0START = 1 and α0s = 0 for all other s. Starting from this, one may use
(2.4) to compute αts for each state s for t going forward from 1 to N .
Backward Pass
In order to compute βts = p(ot:N |st = s), we may again use the law of total probability (marginalizing over all future states st+1 = s0 ) as
βts = p(ot:N |st = s)
= p(ot |st = s)p(ot+1:N |ot , st = s)
X
= p(ot |st = s)
p(ot+1:N , st+1 = s0 |ot , st = s)
s0 :(s,s0 )∈E(G)
X
= p(ot |st = s)
p(st+1 = s0 |ot , st = s)p(ot+1:N |st+1 = s0 , ot , st = s)
s0 :(s,s0 )∈E(G)
X
= p(ot |st = s)
p(st+1 = s0 |st = s)p(ot+1:N |st+1 = s0 )
s0 :(s,s0 )∈E(G)
= p(ot |s)
X
s
qs,s0 βt+1
.
s0 :(s,s0 )∈E(G)
19
CHAPTER 2. HIDDEN MARKOV MODELS
END = 1 and β s
s
The base case for this setting is βN
+1
N +1 = 0 for s 6= END. βt may then be recursively
computed for all times going from backward N to 1.
The general forward-backward procedure for a HMM is summarized in Algorithm 2.1 The algorithm takes a grammar G that contains a graph structure G, the parameter set Θ and the observa0
tion sequence o1:N . It outputs the pairwise joint probability of states γts,s for t ≥ 1. Additionally, it also computes the marginal probability γts = p(st = s, o1:N ). The posterior probabilities
γ̂ts = p(st = s|o1:N ) may be computed by simply normalizing γts over s at each time t. Similarly
0
0
normalizing γts,s gives γ̂ts,s = p(st = s, st+1 = s0 |o1:N ) from which one computes (2.2).
Algorithm:logadd (a, b)
return max(a, b) + log 1 + e−|a−b| ;
Algorithm 1: The logadd function to avoid underflows
2.4.2
Inferring the most Probable State Sequence s1:N
This inference problem entails computing the most likely state sequence ŝ1:N given o1:N .
ŝ1:N
= argmax p(s1:N |o1:N )
s1:N
= argmax p(s1:N , o1:N ).
s1:N
The problem above can be solved using standard dynamic programming and is often refered to as
the Viterbi algorithm [31]. The Viterbi algorithm proceeds by computing α̂ts where
α̂ts = max p(s1:t−1 , st = s, o1:t ).
s1:t−1
1
In the algorithm, we have stored log probabilities instead of probabilities since in practice the probabilities can
become very small and may result in underflow errors. To add two probabilities which are stored as logs, the logadd
function in Algorithm 1 is a suitable choice.
20
CHAPTER 2. HIDDEN MARKOV MODELS
Algorithm:HMMForwardBackward G, b1:C
1:N Θ, o1:N
αs0 ← −∞ ∀s;
αSTART
← 0;
0
for t ← 1 to N do
αst ← −∞ ∀s;
for (s0 , s) ∈ E(G) do
if bst == true then
0
αst ← logadd αst , αst−1 + log qs0 ,s + log pΘ (ot |s) ;
end
end
s
βN
+1 ← −∞ ∀s;
END ← 0;
βN
+1
for t ← N to 1 do
βts ← −∞ ∀s;
γts ← −∞ ∀s;
for (s, s0 ) ∈ E(G) do
s0
βts ← logadd βts , log pΘ (ot |s) + log qs,s0 + βt−1
;
0
0
s ;
γts,s ← αst + log qs,s0 + βt+1
0
γts ← logadd γts , γts,s ;
end
end
end
Algorithm 2: The HMM forward-backward procedure
21
CHAPTER 2. HIDDEN MARKOV MODELS
This can be done recursively much like the forward pass in Algorithm 2 using chain rules and
Markov properties as
α̂ts =
max p(s1:t−1 , st = s, o1:t )
0
= max
max p(s1:t−2 , st−1 = s , st = s, o1:t )
0
s1:t−1
s
s1:t−2
= max
max p(s1:t−2 , st−1 = s0 , o1:t−1 )qs0 ,s pΘ (ot |s)
s0 s1:t−2
0
= pΘ (ot |s) max
qs0 ,s p(s1:t−2 , st−1 = s , o1:t−1 )
s0
s0
0
= pΘ (ot |s) max
qs ,s α̂t−1 .
0
s
(2.5)
Moreover, for each s, we may keep track of the s0 that attains the maximum in (2.5) as
s
.
p̂st = argmax qs0 ,s α̂t−1
(2.6)
s0
Finally, it can be seen that
END
α̂N
+1 = max p(s1:N , o1:N ).
s1:N
Thus, the most likely ŝ1:N can be computed by backtracking the best previous state in (2.6) starting
from sN +1 = END. The resulting procedure is shown in Algorithm 3. The algorithm takes in the
inference graph G, the HMM parameters Θ and the observation sequence o1:N and returns the most
likely ŝ1:N given o1:N .
2.5
Parameter Estimation for HMM ; the Baum Welch Algorithm
h
iM
Let us say we are given M sequences of observations oj1:Nj
. Each observed sequence is
j=1
presented in conjunction with a grammar Gj that contains an unweighted graph2 Gj of states that
The graph only acts as a skeleton; the weights of the edges going from state s to s0 in Gj would be qs,s0 of the
transition probability matrix.
2
22
CHAPTER 2. HIDDEN MARKOV MODELS
Algorithm:Viterbi (G, Θ, o1:N )
α̂s0 ← −∞ ∀s;
α̂START
← 0;
0
for t ← 1 to N do
αst ← −∞ ∀s;
for (s0 , s) ∈ E(G) do
0
0
Ls ← α̂st−1 + log qs0 ,s + log pΘ (ot |s);
0
if Ls > α̂st then
0
α̂st ← Ls ;
p̂st ← s0 ;
end
end
ŝN +1 ← END;
for t ← N to 1 do
ŝ
t+1
ŝt ← p̂t+1
;
end
return ŝ1:N ;
end
Algorithm 3: Viterbi Algorithm for HMM
23
CHAPTER 2. HIDDEN MARKOV MODELS
generated oj1:Nj . The graph Gj defines a set of vertices Vj and a set of edges Ej . Thus every
permissible state-sequence s1:Nj must belong to the grammar Gj . The parameter estimation process attempts to find a Θ̂ML under which the total observed data likelihood is maximized. For an
observation sequence oj1:Nj , the logarithm of the observed data likelihood is
Nj
X
Y
LΘ (oj1:Nj ) = log
qsNj ,END
qst−1 ,st pΘ (ojt |st ) .
s1:Nj ∈Gj
(2.7)
t=1
The required optimizer Θ̂ML is therefore
Θ̂ML = argmax
Θ
M
X
LΘ (oj1:Nj ).
(2.8)
j=1
Directly optimizing the objective above is a hard problem mostly because (2.7) is non-convex in Θ.
So one has to employ gradient ascent techniques which again have standard issues with convergence
and may require various heuristics such as simulated annealing or other randomized algorithms.
An alternative approach is to write the complete data log-likelihood of {s1:Nj , oj1:Nj } assuming that
the state sequence s1:Nj for the j th observation sequence was somehow given. The complete data
likelihood for {s1:Nj , oj1:Nj } is
LΘ s1:Nj , oj1:Nj = log qsN ,END
Nj
Y
qst−1 ,st pΘ (ojt |st ) .
t=1
This complete log-likelihood is a random variable since s1:Nj is unknown. To get rid of this
randomness, an expectation of LΘ s1:Nj , oj1:Nj is computed under the posterioir distribution
pΘ̃ (s1:Nj |oj1:Nj , Gj ) where Θ̃ is the current set of HMM parameters. This is the E-Step in the
EM procedure. We may write this expectation as
Nj
h
i
Y
EΘ̃ LΘ s1:Nj , oj1:Nj
= EΘ̃ log qsNj ,END
qst−1 ,st pΘ (ojt |st )
t=1
Nj
=
X
EΘ̃ log qst ,st+1 +
t=0
Nj
X
t=1
24
h
i
EΘ̃ log pΘ (ojt |st ) .
CHAPTER 2. HIDDEN MARKOV MODELS
The term inside the first summation may be simplified as follows
X
pΘ̃ (st = s, st+1 = s0 |oj1:Nj ) log qs,s0
EΘ̃ log qst ,st+1 =
s,s0
=
X
0
s,s
γ̂t,j
log qs,s .
(2.9)
s,s0
Similarly the term inside the second summation can be written as
h
i X
s
EΘ̃ log pΘ (ojt |st ) =
γ̂t,j
log pΘ (ojt |s).
(2.10)
s
0
s,s
s for each j are computable using Algorithm 2, with parameters Θ̃.
This quantities γ̂t,j
and γ̂t,j
The M-Step computes Θ̂ such that
Θ̂ = argmax
Θ
M
X
i
h
EΘ̃ LΘ s1:Nj , oj1:Nj .
(2.11)
j=1
The marginalization over the hidden state sequence in (2.8) has now been replaced by an expectation
in (2.11). The appealing thing about this is that the marginalization has to be done inside the
logarithm while the expectation is done outside the logarithm. We will soon note that this is a major
computational simplification. Furthermore Θ̂ satisfies
M
X
LΘ̂ (oj1:Nj )
≥
M
X
LΘ̃ (oj1:Nj ).
(2.12)
j=1
j=1
In other words, the new parameters Θ̂ always improve the log-likelihood over the current set of
parameters Θ̃.
The estimation process above is popularly referred to as the Baum Welch algorithm [30].
Plugging (2.9) and (2.10) into (2.11), we get
XX
X X s,s0
s
Θ̂ = argmax
γ̂t,j log qs,s0 +
γ̂t,j
log pΘ (ojt |s) .
Θ
s,s0
s
t,j
25
t,j
(2.13)
CHAPTER 2. HIDDEN MARKOV MODELS
The maximizing q̂s,s0 is determined by optimizing the first summation subject to the constraint that
P
s0 qs,s0
= 1. One may use Lagrange multipliers to enforce this constraint. The resulting q̂s,s0 is
P
q̂s,s0 = P
t,j
s00
P
s,s
γ̂t,j
0
s,s
t,j γ̂t,j
00
.
(2.14)
Next, if one assumes pΘ (o|s) = pΘs (o) where Θs is a state specific parameter that governs the
distribution of o, then one may obtain the maximizing Θ̂s separately for each s from the second
term in (2.13) as
Θ̂s = argmax
Θs
X
s
γ̂t,j
log pΘs (ojt ).
(2.15)
t,j
The optimization above is amenable to a closed form solution if pΘ (·) posseses certain properties
(for example, it is a member of the exponential family). One of the most commonly used distributions is the Gaussian density. Moreover, Gaussian densities are easy to deal with and mixtures
of Gaussians which can approximate a large class of distributions. If one assumes single Gaussian
densities on the state-conditional distribution of the observations i.e. pΘs (o) = N (µs , Σs ), we
may rewrite (2.15) as
s
Θ̂ = argmax
Θs
X
t,j
s
γ̂t,j
T 1
j
j
s
s
s
s −1
ot − µ
.
ot − µ
− log |Σ | − Trace Σ
2
(2.16)
The expression above can be maximized using standard matrix calculus [32]. The resulting estimate
for Θ̂s = {µ̂s , Σ̂s } is obtained by setting
µ̂s =
Σ̂s =
1 X s j
γ̂t,j ot
Ns
t,j
1 X s j jT
s
γ̂t,j ot ot − µ̂s µ̂ T ,
Ns
t,j
where Ns =
P
t,j
s .
γ̂t,j
26
CHAPTER 2. HIDDEN MARKOV MODELS
Since EM only guarantees improvement in the likelihood objective and not necessarily reach an
optimum in one step, we repeatedly perform the updates mentioned above to iteratively increase the
log-likelihood until its value does not change further.
2.6
Chapter Summary
In this chapter, we developed the concept of Hidden Markov Models from first principles. We reviewed some of the well known inference and learning algorithms for completeness. Maximum
Likelihood estimates of the parameters in the HMM can be obtained using the Baum Welch Algorithm which can be done elegantly using dynamic programming. Similarly, finding the best set of
values for the discrete states given the observations is solved elegantly using the Viterbi algorithm.
As a baseline, we would use HMMs to model the gestures in the RMIS surgery data. We will perform a complete set of experiments in Chapter 6, comparing HMMs with the more complex set of
models we will develop in next 3 chapters. Although a simple model, when 3 state HMMs are used
to model each gesture in the surgery trials, we get about 65% accuracy in gesture recognition.
27
Chapter 3
Factor Analyzed Hidden Markov Models
Factor analysis is primarily used to model observed data variability using a compact number of
unobserved variables called factors. The most commonly used factor analysis is linear factor analysis where the observations are described as a linear combination of the factors through a matrix
H. Formally given a sequence of observations o1:N , we model each observation ot as ot ≈ Hxt
where ot ∈ RD and xt ∈ Rd is a compact set of factors describing ot . Several methods have been
proposed to determine H for a given set of observations.
We start by describing the probabilistic principal component factor analysis (PPCA [14]). The main
contribution of this chapter is to derive an efficient algorithm for PPCA inference when the inherent
dimension D is very high. PPCA is a dimensionality reducing technique when no discrete class information is given for the observations. When class information is provided for the observations, the
dimensionality reduction can be done in a probabilistic setting to discriminate between the classes
similar to linear discriminant analysis (LDA). This can be achieved using factor analyzed hidden
Markov models (FA-HMMs) [33], [15]. FA-HMMs are not only compact in representation, but
28
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
also computationally simpler and robust compared to other known discriminative factor analysis
techniques such as the Heteroscedastic Linear Discriminant Analysis [34]. Moreover FA-HMMs
bridges the gap between standard HMM and switching LDS which we will describe in Chapter 5.
(a) Probabilistic PCA (PPCA)
(b) Factor analyzed HMM (FA-HMM)
Figure 3.1: The graphical models that we will consider in this chapter. In PPCA, o1:N is the
observations and x1:N is a hidden continuous state. Adding a discrete hidden state s1:N to the
PPCA gives rise to a FA-HMM.
3.1
Probabilistic Principal Component Analysis
Given a sequence of observations o1:N , classical principal component analysis seeks to find a matrix
H ∈ RD×d , a mean vector µz ∈ RD and a set of factors x1:N such that
N
X
Ĥ, µ̂z , x1:N = argmin
kot − µz − Hxt k2
(3.1)
H,µz ,x1:N t=1
Assuming Ud to be the d largest eigen-vectors of the empirical covariance matrix of the observations
29
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
o1:N , the solution to the optimization above may be derived as (c.f. [35] for details)
µ̂z
Ĥ
=
1
N
PN
t=1 ot
= Ud
xt = HT (ot − µz )
Probabilistic principal component analysis is the probabilistic version of the classical PCA and was
first described in [14]. In PPCA, one assumes the following generative model for the observations
ot = Hxt + zt
(3.2)
Here zt ∼ N (µz , Σz ) is viewed as a D− dimensional Gaussian noise that corrupts the observations. Furthermore, xt ∼ N (µu , Σu ) is an i.i.d. unobserved latent random signal independent of
x1:t−1 that constitutes the factors of ot . A graphical representation of PPCA is shown in Figure
3 (b). The probabilistic view has several advantages that PCA does not offer. For example, this
framework also enables an extension of hidden Markov models in the factor analyzed setting as we
will see in Section 3.2. See [14] for the additional benefits that PPCA offers over the classical PCA.
3.1.1
Inference in PPCA
Unlike PCA, xt cannot be determined exactly in the PPCA setting. Hence one needs to infer a
distribution over the xt given ot . Since each observation is generated independently of the others,
it suffices to infer the distribution, p(xt |ot ) for each t. The fact that ot is linearly related to xt as
ot = Hxt + zt implies that xt and ot are jointly Gaussian. The covariance of ot is HΣu HT + Σz
and the cross-covariance of xt and ot is Σu HT . Using this, one may write the joint distribution of
30
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
xt and ot as
µu
Σu HT
Σu
xt
∼ N
,
.
T
Hµu + µz
ot
HΣu HΣu H + Σz
(3.3)
We now make use of the following property of jointly Gaussian random variables.
Property 1 If x and y are jointly Gaussian with distribution p(x, y) given as
µx Σx Σxy
,
p(x, y) = N
,
T
µy
Σxy Σy
then the conditional distribution of x after y is observed is given as
−1 T
p(x|y) = N µx + Σxy Σ−1
y (y − µy ), Σx − Σxy Σy Σxy .
Thus assuming x = xt and y = ot in the property above, the mean (µt|t ) and covariance (Σt|t ) of
the conditional distribution, pΘ (xt |ot ) may be computed in closed form as (cf [36])
p(xt |ot ) = N µu + Σu HT P (ot − Hµu − µz ) , Σu − Σu HT PHΣu ,
where P =
HΣu HT + Σz
−1
. The resulting inference procedure for PPCA is described in
Algorithm 4. This takes in the observation ot , the parameters Θ and returns the parameters of the
conditional Gaussian distribution pΘ (xt |ot ) along with the log-likelihood of the observation ot .
Computation of the log-likelihood is important because it is an indication of how good a fit the
parameter Θ is to the observation ot .
Efficient Inference in PPCA
The two main computationally expensive steps of the standard PPCA inference procedure described
in Algorithm 4 are the computations of the inverse and the determinant of HΣu HT + Σz . Storing
31
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
Algorithm:PPCAinfer (ot , Θ)
P ← HΣu HT + Σz
−1
;
µt|t ← µu + Σu HT P (ot − Hµu − µz );
Σt|t ← Σu − Σu HT PHΣu ;
L←
1
2
log |P| −
1
2
(ot − Hµu − µz )T P (ot − Hµu − µz );
return {µt|t , Σt|t , L};
Algorithm 4: The inference procedure in PPCA - computes p(xt |ot ).
the matrix itself takes O(D2 ) space and computing its inverse and determinants take O(D3 ) time
each. Thus, if one were to apply Algorithm 4 naively for high dimensional observations, the space
and time complexity of PPCA would be prohibitive for problems with inherent dimension D. In
standard factor analysis, the dimension of xt is d << D and Σz is diagonal. This implies that
HΣu HT + Σz is essentially a sum of a low-rank matrix HΣu HT and a diagonal matrix Σz . In
this section, we describe how to exploit these facts and obtain an O(Dd2 ) time and O(Dd) space
algorithm to compute the following three quantities:
• PH, which is used to compute Σt|t in the Algorithm where P = (HΣu HT + Σz )−1 ,
• P(ot − Hµu − µz ), which is used to compute µt|t ,
• log |P|, which is used in the likelihood calculation.
Computation of PH:
First, we compute HΣu = U which can be done in O(Dd2 ) time. Denoting hi and ui to represent
ith column of the matrices H and U respectively, we note that
P=
Σz +
d
X
i=1
32
!−1
hi uTi
.
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
Define the sequence of matrices
Mj = Σz +
j
X
hi uTi ,
(3.4)
i=1
for j = 0, 1, . . . , d. We iteratively compute the set of vectors m̃j,k = M−1
j hk for j = 1, . . . , d.
The culminating result of this computation would be the vectors m̃d,k for k = 1, . . . , d which
would constitute the columns of the matrix PH. The base case of j = 0 is trivial since M0 = Σz
is diagonal with σk2 being the k th diagonal entry. Thus m̃0,k is obtained by simply scaling each
element of hk by
1
.
σk2
For j > 0, we may then write m̃j,k as
m̃j,k = M−1
j hk
=
Mj−1 + hj uTj
−1
hk .
−1
Next using the Sherman Morrison formula, we may expand Mj−1 + hj uTj
in the expression
above to obtain
m̃j,k =
M−1
j−1
−
−1
T
M−1
j−1 hj uj Mj−1
1 + uTj M−1
j−1 hj
!
hk .
−1
−1
By definition, M−1
j−1 hk = m̃j−1,k , Mj−1 hj = m̃j−1,j and Mj−1 hk = m̃j−1,k . Thus, we may
write the expression above as
m̃j,k = m̃j−1,k −
uTj m̃j−1,k
1 + uTj m̃j−1,j
m̃j−1,j .
Computing m̃j,k using the expression above involves two inner-products, followed by a vector
addition, each of which is O(D). Since there are d2 such vectors, the net complexity of computing
all the m̃j,k is O(Dd2 ).
Computation of P(ot − Hµu − µz ):
Setting ot − Hµu − µz = v, define the sequence of vectors ṽj = M−1
j v where Mj is as defined
above in (3.4). ṽ0 is trivial to compute as it is just scaling the k th element of v with σk2 . Finally, ṽd
33
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
is the required quantity to be computed. Assuming we have computed ṽj−1 = M−1
j−1 v, one may
again invoke the Sherman Morrison formula to get
ṽj = ṽj−1 −
uTj ṽj−1
1 + uTj m̃j−1,j
m̃j−1,j .
Each of the computations above take O(D) resulting in O(Dd) time in total.
Computation of log |P|:
Finally to compute log |P|, we define the sequence of real numbers Lj = log |Mj |. Then, one may
use the matrix determinant lemma as
PD
j=1 log σj2
if j = 0,
Lj =
Lj−1 + log |1 + uTj m̃j−1,j | if j > 0.
−Ld is the resulting final log-determinant.1
An efficient PPCA inference algorithm based on the above mentioned organization of computation
is described in Algorithm 5.
3.1.2
Maximum Likelihood Estimation of Θ via EM
Given a sequence of observations o1:N , the maximum likelihood estimation is a process that computes a parameter set Θ such that the total likelihood of the observed sequence is maximized. The
log-likelihood of the observation ot may be given in terms of the parameters in (3.3) as
LΘ (ot ) = log N ot ; Hµu + µz , HΣu HT + Σz .
(3.5)
1
Although P is positive definite, it it not necessary that Mj is positive definite ∀j. Hence in general |Mj | can be
negative and hence Lj can be complex. Since we know that Md > 0, we always compute abs(|Mj |) instead of the
direct |Mj | at each step in the iteration.
34
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
Algorithm:PPCAinfer (ot , Θ)
µ̂o ← Hµu + µz ;
U ← HΣ
u ;
h1
m̃0,:
σ2
1
.
.
←
. , ṽ0 ←
hD
2
σD
o(1)−µ̂o (1)
2
σ1
..
.
o(D)−µ̂o (D)
2
σD
, L̃0 ← PD log σ 2 ;
j=1
j
for j ← 1 to d do
for k ← 1 to d do
m̃j,k = m̃j−1,k −
uT
j m̃j−1,k
1+uT
j m̃j−1,j
m̃j−1,j ;
end
ṽj = ṽj−1 − m̃j−1,j
uT
j ṽj−1
1+uT
j m̃j−1,j
;
L̃j ← Lj−1 + log |1 + uT
j m̃j−1,j |;
end
ΣOOH̄ ← m̃D,1
L←
− 21 L̃d
−
1
2
...
m̃D,d ;
(o − µ̂o )T ṽd ;
µt|t ← µu + Σu HT ṽd ;
Σt|t ← Σu − Σu HT ΣOOH̄ Σu ;
return {µt|t , Σt|t , L};
Algorithm 5: An efficient PPCA inferring algorithm. The procedure uses the Sherman Morrison formula to provide an efficient O(Dd2 ) time and O(Dd) space using
dynamic programming for computing the distribution of the hidden variables given
the observations. This is a significant improvement over Algorithm 4 which takes
O(D3 ) time and O(D2 ) space.
35
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
It is easy to see that there are multiple sets of parameters Θ that give the same likelihood. For
instance if
Θ = {H, µu , Σu , µz , Σz },
then there is an equivalent Θ̄ which may be defined as
Θ̄ = {H, 0, Σu , µz + Hµu , Σz },
that gives the same likelihood for each ot . Thus the PPCA model is, in general, not identifiable.
Θ̂ML is an element from the set of all possible settings of Θ that maximizes the likelihood of the
observations. In other words, Θ̂ML can be given by solving the optimization
Θ̂ML = argmax
Θ
N
X
LΘ (ot ).
(3.6)
t=1
Directly maximizing the objective above does not lend itself to a closed form solution in general,
unless certain special assumptions are made about the parameters Θ [14]. The EM approach, on
the other hand, computes Θ̂ that maximizes the expected value of the complete data log-likelihood,
where the expectation is computed under a current set of parameters Θ̃. The complete data loglikelihood in the PPCA setting is the joint likelihood of x1:N and o1:N , assuming that x1:N is
known. Since p(x1:N , o1:N ) = p(x1:N )p(o1:N |x1:N ), this may be written as
LΘ (x1:N , o1:N ) =
N
X
(log N (xt ; µu , Σu ) + log N (ot ; Hxt + µz , Σz )) .
t=1
The resulting update of the M-step in the EM procedure is then
Θ̂ = argmax EΘ̃ [LΘ (x1:N , o1:N )] .
(3.7)
Θ
Due to linearity of expectations, we may rewrite (3.7) as
Θ̂ = argmax
Θ
N
X
EΘ̃ [log N (xt ; µu , Σu )] +
t=1
N
X
t=1
36
EΘ̃ [log N (ot ; Hxt + µz , Σz )] .
(3.8)
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
The expectation in (3.8) is with respect to the posterior distribution pΘ̃ (xt |ot ) where Θ̃ denotes the
current set of parameters. One may compute the optimizing µ̂u and Σ̂u by maximizing the first
summation since the second summation does not depend on these quantities. The expectation with
respect to xt may be simplified further as
1
1
EΘ̃ [log N (xt ; µu , Σu )] = − log |Σu | − Trace EΘ̃ (xt − µu )T Σ−1
u (xt − µu ) .
2
2
=−
1
log |Σu | + Trace Σ−1
EΘ̃ (xt xTt ) + µu µTu − µu EΘ̃ (xTt ) − EΘ̃ (xt )µTu
u
2
(3.9)
The quantities EΘ̃ (xt ) and EΘ̃ (xt xTt ) are available from the inference procedure in Algorithm 4 as
EΘ̃ (xt ) = µt|t ,
EΘ̃ (xt xTt ) = Σt|t + µt|t µTt|t .
Thus, using standard matrix calculus and assuming no structure is imposed on µu and Σu , we may
compute the optimizing µ̂u and Σ̂u as
µ̂u =
Σ̂u =
N
1 X
µt|t
N
1
N
t=1
N X
Σt|t + µt|t µTt|t − µ̂u µ̂Tu .
t=1
In order to optimize for H, µz , Σz , we consider the second term of the summation in (3.8). Here,
2 ) is assumed to be diagonal.2 Denoting S
T
Σz = diag(σ12 , . . . , σD
t|t = Σt|t + µt|t µt|t and ōt =
ot − µz , we may rewrite the second summation in (3.8) as
N
−
h
i
N
1X
T
T
T
T
T
log |Σz | −
Trace Σ−1
ō
ō
+
HS
H
−
Hµ
ō
−
ō
µ
H
.
t
t
t|t
t|t
z
t
t
t|t
2
2
(3.10)
t=1
2
The diagonal assumption is the essense of factor analysis as it enforces that the corellation in the components of the
observation ot should be taken care by the latent variable xt . This reduces the number of model parameters drastically
from O(D2 ) to O(Dd) resulting in a much more compact model for the representation of o1:N .
37
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
Maximizing the expression above with respect to µz and H results in a least squares problem with
a closed form solution. The diagonal values σj2 of Σz may be computed by considering each row
seperately and finding its variance. The resulting M-Step that computes the optimizing Θ̂ is tabulated in Algorithm 6. Here, we have used to denote element-wise multiplication of two equal
sized vectors, hp to denote the pth row of the matrix H, Vōx (p) to denote the pth row of Vōx and
vo (p) to denote the pth element of the vector vo .
N Algorithm:PPCA-MStep o1:N , µt|t , Σt|t t=1
f0 ←
PN
µt|N ;
T
t=1 Σt|N + µt|N µt|N ;
t=1
S0 ←
PN
µ̂u ←
1
N
Σ̂u ←
wo ←
PN
µt|t ;
PN T
T
t=1 Σt|t + µt|t µt|t − µ̂u µ̂u ;
t=1
1
N
PN
Vox ←
t=1
PN
ot ;
t=1
ot µT
;
t|t
1
wo f0T S0 −
Ĥ ← Vox − N
1
µ̂z ← N
wo − Ĥf0 ;
−1
1
f fT
;
N 0 0
for p ← 1 to D do
1
T − µ̂ (p)2 ;
σ̂p2 = N
vo (p) + ĥp S0 ĥT
z
p − 2ĥp Vox (p)
end
return Θ̂;
Algorithm 6: The M-Step for PPCA
The EM procedure guarantees that the resulting Θ̂ from the optimization in (3.7) satisfies
LΘ̂ (x1:N , o1:N ) ≥ LΘ̃ (x1:N , o1:N ) .
Thus, instead of directly optimizing Θ̂ML in (3.6), one may use Algorithms 4 and 6 to compute the
optimizing Θ̂ in (3.7) and improve the log-likelihood iteratively until convergence. It can be shown
under certain favorable conditions that the limiting Θ would actually be the maximum likelihood
estimate Θ̂ML .
38
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
3.2
Factor Analyzed HMMs
Extending the notion of standard HMMs, a factor analyzed HMM is a weighted graph G(V, E) with
the vertex set V composed of a unique start state START, an end state END and a set of emiting
states. The edgeset E is weighted as usual via a transition probability matrix qs,s0 . When the FAHMM is at an emiting state st = s at time t, it emits an observation ot according to the following
generative model.
xt ∼ N (µsu , Σsu ).
zt ∼ N (µsz , Σsz ).
ot = Hs xt + zt .
(3.11)
As in standard factor analysis, we assume that xt ∈ Rd is a latent variable that forms the factors of
ot . zt is a noise with diagonal covariance Σz that corrupts the observation. Among all the random
variables ot is the only one that is observed.
3.2.1
Inferring the joint distribution of s1:N and x1:N
Given a sequence of observations o1:N , the inference problem is to compute a joint distribution on
the random variables {s1:N , x1:N } for every possible valid state sequence s1:N .3 The conditional
distribution p(s1:N , x1:N |o1:N ) can be written using chain rule as
p(s1:N , x1:N |o1:N ) = p(s1:N |o1:N )p(x1:N |s1:N , o1:N )
Next, using the graphical model structure of the FA-HMM in Figure 3, we may invoke the Markov
3
The random variable zt is not included in the inference because given ot and xt , zt is deterministic.
39
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
properties implying (assuming s0 = START and sN +1 = END)
p(s1:N , x1:N |o1:N ) =
N
Y
p(st+1 |st , o1:N )
t=0
N
Y
p(xt |st , ot )
t=1
0
Thus, it suffices to compute a joint distribution γts,s = p(st = s, st+1 = s0 , o1:N ) of the states
st , st+1 with the observation o1:N and a posterior distribution of the latent vector pΘ (xt |st , ot ) for
0
complete inference. γts,s can be computed as we did for the standard HMM using the forward
0
s
and the backward probabilities, αts and βt+1
(refer to Equations (2.3, 2.4, 2.5)). Both make use
of the state conditional observation densities pΘs (ot ) which may be given for a FA-HMM (after
marginalizing xt ) as
pΘ (ot |s) = N ot ; Hs µsu + µsz , Hs Σsu Hs T + Σsz .
(3.12)
Finally, the mean and the covariance of the posterior distribution of the latent vector, pΘ (xt |st , ot )
can be computed using PPCAinfer(ot , Θs ).4
Finally, to compute the most likely s1:N given the observations o1:N , we may use the Viterbi procedure given in Algorithm 3 with pΘ (ot |s) given in equation (3.12).
3.2.2
Learning the Parameters of a FA-HMM via EM
h
iM
Let us assume we have M sequences of observations oj1:Nj
each associated with a grammar
j=1
Gj that contains an unweighted graph Gj and the time information of the labels (ex, phonemes
in a speech signal) encoded using a boolean matrix bst denoting whether a state s can be active at
time t. Thus, every permissible state sequence s1:Nj ∈ Gj for the j th observation sequence. The
Maximum likelihood estimation process attempts to find Θ̂ML under which the total log-likelihod
4
If dim(ot ) is high, one may use Algorithm 5.
40
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
of the observed is maximized. The log-likelihood of an observation sequence oj1:Nj under Θ for the
FA-HMM can be written after integrating out the hidden state sequences s1:Nj and x1:Nj :
Nj
Z
X
Y
LΘ (oj1:Nj ) = log
qsNj ,END
qst−1 ,st p(xt |st )pΘ (ojt |st , xt ) dx1:Nj . (3.13)
s1:Nj ∈Gj
x1:Nj
t=1
The maximum likelihood estimate for Θ given the sequences of observations would be
Θ̂ML = argmax
Θ
M
X
LΘ (oj1:Nj ).
(3.14)
j=1
Due to the presence of several hidden variables, directly optimizing (3.14) is non-trivial. Hence we
resort to EM where we first write the joint distribution of all the random variables (both seen and
unseen). This is also refered to as the complete log-likelihood. In the FA-HMM setting, this is the
joint likelihood of s1:Nj , x1:Nj and oj1:Nj . The complete log-likelihood, L̃Θ (s1:Nj , x1:Nj , oj1:Nj ), is
another random variable. The E-Step in EM gets rid of this randomness by computing an expectation
of this complete log-likelihood under a current set of parameters Θ̃. The Θ̂ from the M-Step is then
computed as
Θ̂ = argmax
Θ
M
X
i
h
EΘ̃ L̃Θ (s1:Nj , x1:Nj , oj1:Nj |Gj )
(3.15)
j=1
To compute the expectation above, we first decompose the complete log-likelihood using chain rule
(again suppressing the conditioning on Gj for convenience) as,
L̃Θ (s1:Nj , x1:Nj , oj1:Nj ) = L̃Θ (s1:Nj ) + L̃Θ (x1:Nj |s1:Nj ) + L̃Θ (o1:Nj |x1:Nj , s1:Nj )
{z
} |
{z
}
| {z } |
Q̃j1
Q̃j2
Q̃j3
Due to linearity of expectations, we may compute the expectations of each of these quantities seper-
41
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
ately. The first term is independent of x1:Nj . Hence we may simplify its expectation as
Z
X
x1:Nj
s1:Nj ∈Gj
=
Z
X
x1:Nj
s1:Nj ∈Gj
=
Nj
X
pΘ̃ (s1:Nj , x1:Nj |oj1:Nj )Q̃j1 dx1:Nj
Nj
X
pΘ̃ (s1:Nj , x1:Nj |oj1:Nj )
log qst ,st+1 dx1:Nj
t=0
pΘ̃ (st = s, st+1 = s0 |oj1:Nj ) log qs,s .
{z
}
|
t=0 (s,s0 )∈E,bs =1
X
t
s,s
γ̂t,j
0
Thus,
Qj1
=
h
EΘ̃ Q̃j1
i
=
Nj
X
0
s,s
γ̂t,j
log qs,s0
X
(3.16)
t=0 (s,s0 )∈E,bst =1
To simplify Q̃j2 , we first note that Q̃j2 =
PNj
t=1 log pΘ (xt |st ).
The expectation of Q̃j2 can be simpli-
fied as
Qj2
Nj
h i
X
XZ
j
pΘ̃ (st = s, xt , |oj1:Nj ) log pΘs (xt )dxt
= EΘ̃ Q̃2 =
t=1
Nj
=
XX
t=1
Nj
=
s
pΘ̃ (st =
s|oj1:Nj )
s
XX
t=1
xt
s
s
γ̂t,j
Z
xt
Z
xt
pΘ̃ (xt |oj1:Nj , st = s) log pΘs (xt )dxt
pΘ̃ (xt |ojt , st = s) log pΘs (xt )dxt .
(3.17)
The mean and the covariance of the posterior distribution pΘ̃ (xt |ojt , st = s) are available from
Algorithm 4 as µst,j and Σst,j respectively. The expected second order statistics can be given in
terms of these quantities as EΘ̃ (xt xTt |st = s, ojt ) = Sst,j = Σst,j + µst,j µst,jT . The integral in the last
equation of (3.17) is essentially an expectation of log pΘs (xt ) under pΘ̃ (xt |ojt , st = s). Hence, we
42
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
may simplify this integral as
Z
xt
pΘ̃ (xt |ojt , st = s) log pΘs (xt )dxt
= EΘ̃ [log pΘs (xt )]
1
= − EΘ̃ log |Σsu | + Trace Σus −1 (xt − µsu )(xt − µsu )T
2
1
1
= − log |Σsu | − Trace Σsu −1 Sst,j − µsu µst,jT − µst,j µsu T + µsu µsu T .
2
2
Plugging this back to (3.17), we get
Qj2
=
X
t,s
s
γ̂t,j
s −1 s
1
1
s
s sT
s
sT
s sT
St,j − µu µt,j − µt,j µu + µu µu
. (3.18)
− log |Σu | − Trace Σu
2
2
The expectation of Q̃j3 can be similarly simplified as
Nj
h i
X
XZ j
pΘ̃ (st = s, xt , |oj1:Nj ) log pΘs (ot |xt )dxt
Qj3 = EΘ̃ Q̃j3 =
t=1
Nj
=
XX
t=1
Nj
=
=
s
XX
t=1
Nj
pjΘ̃ (st = s|oj1:Nj )
s
γ̂t,j
Z
xt
Z
xt
pjΘ̃ (xt |oj1:Nj , st = s) log pΘs (ot |xt )dxt
pjΘ̃ (xt |ojt , st = s) log pΘs (ot |xt )dxt
s
γ̂t,j
EjΘ̃ [log pΘs (ot |xt )]
s
XX
t=1
xt
s
XX
t=1
Nj
=
s
s
γ̂t,j
EjΘ̃ [log N (ot ; Hs xt + µsz , Σsz )] .
s
(3.19)
s 2 is diagonal. Denoting
Furthermore, as in standard factor analysis, we assume that Σsz = diag σ1:D
v(p) to be the pth of a D dimensional vector v and hsj to be the j − th row of Hs , we may rewrite
43
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
(3.19) as
X
−
s
γ̂t,j
t,s,p
X
−
log σps 2
2
ojt 2 (p) + hsj Sst,j hsj T + µsz (p)2 − 2ojt (p)µsz (p) − 2hsj µst,j ojt (p) + 2hsj µst,j µsz (p)
2σps 2
s
γ̂t,j
t,s,p
!
.
(3.20)
From (3.15), we have that
Θ̂ = argmax
Θ
M
X
Qj1 +
Qj2 +
M
X
Qj3
j=1
j=1
j=1
| {z }
| {z }
| {z }
Q1
Using
M
X
Q2
Q3
s,s0
P
t,j
γ̂t,j = Ns,s0 ,
Q1 =
X
Ns,s0 log qs,s0
(3.21)
s,s0
Using
P
t,j
s = N , fs =
γ̂t,j
s
P
t,j
s µs and Ss =
γ̂t,j
t,j
P
t,j
s Ss , we can simplify Q as
γ̂t,j
2
t,j
X Ns
s −1 s
1
s
s sT
s sT
s sT
Q2 =
−
log |Σu | − Trace Σu
S − µu f
− f µu + Ns µu µu
.
2
2
s
Using vos (p) =
P
t,j
s oj (p)2 , ws (p) =
γ̂t,j
o
t
P
t,j
s oj (p) and Vs (p) =
γ̂t,j
xo
t
P
t,j
(3.22)
s µ o (p), we may
γ̂t,j
t,j t
compute Q3 from (3.19) as
−
X
−
X
s,p
s,p
Ns
log σps 2
2
s (p) + 2hs f s µs (p)
vos (p) + hsp Ss hsp T + Ns µsz (p)2 − 2wos (p)µsz (p) − 2hsp Vxo
p
z
2σps 2
!
.
(3.23)
In order to optimize for qs,s0 , we only need to consider Q1 since Q2 and Q3 do not depend on the
transition matrix parameter. One may maximize (3.21) subject to
Ns,s0
q̂s,s0 = P
.
s00 Ns,s
44
P
s0 qs,s0
= 1 to obtain
(3.24)
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
Similarly to optimize for µsu and Σsu , one only needs to consider (3.22). This can be done using
standard matrix calculus (unfamiliar readers may refer to [32]):
µ̂su
Σ̂su =
=
1 s
Ns S
1 s
Ns f .
− µ̂su µ̂su T .
Finally, to optimize for Hs and the parameters of the noise zt , we need to consider (3.23). If one
does not impose any further structure on Hs , we may invoke standard matrix calculus to obtain the
updates as
1 s
1 s s T −1
s
Vxo
(p)T −
wo (p)f s T
Ss −
f f
.
Ns
Ns
1 s
µ̂sz (p) =
wo (p) − ĥsp f s .
Ns
s (p)
vos (p) + ĥsp Ss ĥsp T − 2ĥsp Vxo
σ̂ps 2 =
− µ̂sz (p)2 .
Ns
ĥsp =
3.2.3
(3.25)
Tied Estimation of the Loading Matrix : Hs = H
A common assumption in factor analysis while dealing with multiple classes is to tie the loading
matrices Hs = H ∀s. The tying enables us to do robust parameter estimation. This can be achieved
by employing standard matrix calculus on (3.23) after enforcing hsp = hp . However, in the setting
here, we may still allow the parameters of the observation noise to be class-dependent. In (3.25),
we saw that the estimate of ĥsp was independent of µ̂sz,p and σ̂ps implying we could directly estimate
it from the sufficient statistics. The estimates for ĥp is slightly modified when the loading matrix is
tied across classes. Setting the derivatives of (3.23) with respect to hp , µsz,p and σps to zero, we get
45
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
the following set of equations:
ĥp
µ̂sz,p
σ̂ps 2
"
#"
#−1
X 1
X 1
=
Vox (p)s − µ̂z (p)s f0s T
Ss
,
s2
s2 0
σ̂
σ̂
z,p
p
s
s
1 wo (p)s − ĥp f0s ,
=
Ns
s (p)
vos (p) + ĥp Ss0 ĥTp − 2ĥp Vxo
=
− µ̂sz (p)2 .
N
(3.26)
Due to the interdependence of hp and σ̂ps , we cannot get a closed form for the set of equations above.
However, since (3.23) is convex, we can iterate over the three equations (3.26) one after the other,
guaranteeing improvement in the objective function.
3.2.4
Tied Estimation of the Loading Parameters : Hs = H
We get the linear discriminative setting analogous to Probabilistic LDA [37] or HLDA [34] when
we further impose Hs = H ∀s, implying there is a common loading matrix as well as observation
noise parameters for all the classes. The motivation for doing this is to require the model to learn the
structure variability between the classes entirely within the hidden continuous state xt and modeling
all the remaining randomness in ot as state-independent phenomena. The tying also enables robust
parameter estimation. Moreover, this model directly allows us to compare our techniques with well
known LDA and HLDA where the observations are reduced to a lower dimensional vector. Our
setting, we claim, is a more principled version of LDA since the parameters are learned according
2
to a maximum likelihood objective. Similar to (3.25), we can write the estimates for H, µz and σ1:D
46
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
as
1 T −1
1
T
T
S− ff
ĥp =
Vxo (p) − wo (p)f
,
N
N
1 µ̂z (p) =
wo (p) − ĥp f ,
N
vo (p) + ĥp S0 ĥTp − 2ĥp Vxo (p)
σ̂p2 =
− µ̂z (p)2 .
N
(3.27)
As opposed to (3.25), the statistics in (3.27) are the accumulated statistics (obtained by adding the
statistics from all the states) and hence do not have any dependence on s.
3.3
Connection to LDA, HLDA and Semi-Tied Covariance Modeling
The factor analyzed model we described in the last few sections is an extension of the classical
PPCA to the HMM framework. The underlying xt can be thought of as a reduced dimensional
representation of the observation ot . Due to the presence of the observation noise zt , xt cannot
be deterministically inferred. There are other known techniques such as the Heteroscedastic Linear Discriminant Analysis (HLDA) [34] that do the identification of xt deterministically. HLDA
assumes the following state-space model for the underlying state.
s
s
xt
µut Σut
yt = ∼ N ,
µw
0
wt
0
.
Σw
(3.28)
Finally the observation ot are modeled in HLDA as
ot = Hyt .
Here H ∈ RD×D is assumed to be invertible and thus yt is not actually a hidden state as yt =
H−1 ot . The only structure imposed on yt is that it is assumed to be a concatenation of a state47
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
dependent random variable xt ∈ Rd with Gaussian parameters {µsut , Σsut } and a state-independent
random variable wt ∈ RD−d with Gaussian parameters {µw , Σw }. Furthermore, it is assumed that
xt and wt are independent random variables. An equivalent representation of (3.28) is
xt ∼ N (µsut , Σsut ) ,
zt ∼ N Hd+1:D µw , Hd+1:D Σw HTd+1:D ,
ot = H1:d xt + zt .
(3.29)
Here H1:d represents the first d columns of the loading matrix H and Hd+1:D is the last D − d
columns of it. (3.29) looks almost identical to the state-space representation of the FA-HMM in
(3.11). The major difference is in the distribution of zt which is a low-rank state-independent noise
in the case of HLDA, while it is diagonal and potentially state-dependent in the FA-HMM case.
3.3.1
EM Estimation of the Parameters of HLDA
Given M observation sequences [oj1:Nj ] each associated with a unweighted grammar Gj , we first
s as we did for the standard HMM and the FA-HMM secompute the marginal probabilities γ̂t,j
tups. Then a global observation covariance Σ̂o and a class dependent observation covariance Σ̂so is
48
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
computed as
Ns =
X
s
γ̂t,j
,
j,t
N
=
M
X
Nj ,
j=1
µo =
Σo =
µso =
Σso =
1 X j
ot ,
N
j,t
1 X j jT
ot ot − µo µTo ,
N
j,t
1 X s j
γ̂t,j ot ,
Ns
j,t
1 X s j jT
γ̂t,j ot ot − µso µso T .
Ns
j,t
"
Then, it is shown in [34] that the EM estimate of the matrix H−1 = A =
Ad
#
AD−d is
computed as
#
C
X
1
N
1
s
log |Ad Σso ATd | .
 = argmax log |A| − log |AD−d Σo ATD−d | −
2
2
N
A
"
(3.30)
s=1
Once this is done, the EM estimated parameters Θ of the HLDA can be directly given as
H
= A−1 ,
µsu
= Ad µso ,
Σsu
= Ad Σso ATd ,
µw
= AD−d µo ,
Σw = AD−d Σo ATD−d .
The hard part is computing A since the objective function in (3.30) is non-convex in general and requires gradient ascent techniques. Another major drawback of this kind of factor analysis, however,
49
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
is the need for the estimation of the full matrix H instead of only its first d columns. As a result,
(H)LDA is not an ideal model for very high-dimensional observations. However, if one assume that
all the classes share a common covariance matrix as Σsu = Σu in the state-space equation (3.28),
then the estimate for A becomes the standard linear discriminant analysis (LDA) whose rows may
be given by computing the top d principal eigen-vectors of the following matrix:
!
!
X
X
1
1
Σ−1
Σo −
Ns Σso = Σ−1
Ns µso µso T − µo µo T .
o
o
N s
N s
Closed form EM updates for A = H−1 also exists when the noise covariances Σsu and Σw are
assumed to be diagonal [38] . In the special case of d = D under this diagonal assumption, the
matrix A is referred to as the semi-tied covariance transform.
3.4
Chapter Summary
In the previous chapter, we had introduced the standard HMM which models the observations as
emissions from a discrete state. When the observation components are highly correlated, a full
covariance matrix is needed to model the correlations. However, when the dimension of ot is high,
there are way too many parameters when full covariance models are used. One way to tackle high
dimensionality is to project ot to xt with dim(xt ) << dim(ot ) and model xt via a HMM. This is
often referred to as factor analysis where xt forms the factors of ot .
In this chapter, we discussed some standard probabilistic factor analysis techniques like PPCA, FAHMM, LDA and HLDA to perform factor analysis and to better model the variability in the data.
HLDA is a standard model which has been extensively used in speech recognition. Although it
performs dimensionality reduction, the number of parameters used in these models is still O(D2 )
and furthermore, it require O(D3 ) time for estimation. FA-HMM is a more compact model which
50
CHAPTER 3. FACTOR ANALYZED HIDDEN MARKOV MODELS
does the same kind of dimensionality reduction but with only O(Dd) parameters.
Furthermore, in this chapter, we have developed techniques that can also estimate the parameters
in O(Dd2 ) instead of the conventional O(D3 ). In Chapter 6, we will perform a comprehensive set
of experiments for contrasting FA-HMM with HMMs. FA-HMMs will be shown to improve over
HMMs in gesture recognition with accuracies over 70% for standard RMIS tasks.
51
Chapter 4
Linear Dynamical System
In the last two chapters, we developed the concept of hidden Markov models and extended them to
incorporate factor analysis. One fundamental drawback with the standard hidden Markov model,
however, is its reliance on only a discrete variable to fully capture temporal correlations. Adjacent
observations in smoothly varying time-series signals (such as speech or video) are often correlated
in more complex ways. In other words, the correlation between successive frames ot−1 and ot
cannot be fully explained away by the discrete state st−1 and st . Linear dynamical systems (LDS)
comprise a class of models that capture this kind of correlations by introducing a continuous valued
hidden state.
Note in this context that the machinery developed for PPCA in Chapter 3 also makes it possible
to analyze LDS; as suggested by the similarity of Figures 4.1 and 4.2. In this chapter, we start
by setting up the standard LDS problem and derive an EM algorithm for maximum likelihood
estimation of model parameters. The standard LDS simply replaces the discrete state in the HMM,
st , with a continuous state, xt . The technical contribution of this chapter is to extend the efficient
52
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
inference scheme we described for PPCA in Algorithm 5 for EM learning of the LDS parameters.
This enables us to perform principled ML estimation of LDS models for high dimensional image
sequences which, to the best of our knowledge, has not been addressed in literature. Using EM
estimate to identify the parameters of an LDS is superior to other system identification techniques
like the numerical algorithms for subspace identification [39] in terms of computational complexity.
For experiments, we use the MIT temporal texture database [40] and dynamic texture videos used
by Kwatra, et.al [41] and compare with Doretto’s non-probabilistic system identification technique
[42].
4.1
Linear Dynamical System
Let us assume that we have a sequence of observations o1:N with ot ∈ RD . A linear dynamical
system model of data assumes that there is an underlying continuous state sequence x1:N with
xt ∈ Rd for some d << D which evolves according to
xt+1 = Fxt + ut+1 ,
(4.1)
and that the ot ’s are noisy observation of xt as
ot+1 = Hxt+1 + zt+1 ,
(4.2)
where the exogenous signal {ut } and noise {zt } are assumed to be stochastically distributed according N (µu , Σu ) and N (µz , Σz ) respectively. The stochasticity of the exogenous signals plays a vital
role in capturing the dynamics of noisy observation sequences. The estimation of the underlying
hidden state is done is done using Kalman filtering [43–47].
For ease of reference, we will divide the parameters of the LDS into two sets. The first set of
53
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
parameters governs the dynamics of xt which is F = {F, µu , Σu }. The second constitutes H =
{H, µz , Σz } which govern the distribution of the observation given the underlying continuous state
xt . The key difference between the linear dynamical system and a hidden Markov model is the
generalization of a discrete state, st , in the HMM to a continuous state xt in the LDS. It can also
be thought of as a generalization of PPCA we described in Section 3.1 except that there is now
temporal dependence in xt as depicted in Figure 4.1 and 4.2.
Figure 4.1: Probabilistic PCA
4.2
Figure 4.2: Linear dynamical system
Inference in Linear Dynamical System
The inference problem in a linear dynamical system is to determine the sequence xt for each t.
Since xt is a random variable and the observation is partial, infering the exact value of xt is not possible. Hence inference has to be done in a probabilistic sense. Formally, we compute a conditional
distribution of x1:N given the observation sequence o1:N . Using Markov properties, we may write
this as
p(x1:N |o1:N ) =
N
−1
Y
p(xt+1 |xt , o1:N ).
t=0
We will assume for the ease of computations that x0 is a unique continuous start state.1 More1
Typically, we assume x0 = 0.
54
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
over, in order to compute p(xt+1 |xt , o1:N ), it suffices to compute the posterior joint distribution,
p(xt+1 , xt |o1:N ) for each 0 ≤ t ≤ N − 1. Thus, the inference would be complete when we have
these pairwise distributions. As in standard HMM, we can divide the inference into two stages, a
forward and a backward pass through the observations o1:N .
4.2.1
Forward Pass
The forward pass computes the joint distribution of xt+1 and xt given observations o1:t+1 . In other
words, it computes p(xt+1 , xt |o1:t+1 ). Since o1:N and x1:N are jointly Gaussian, by properties
of joint Gaussian random variables, p(xt+1 , xt |o1:t+1 ) is also a Gaussian density. Assuming that
p(xt , xt−1 |o1:t ) was computed in the previous step of the forward pass, the distribution p(xt |o1:t )
can be obtained by marginalizing it about xt−1 .2 We will denote the parameters of this distribution
as follows
p(xt |o1:t ) = N µt|t , Σt|t .
(4.3)
We can now use the state-equation given in (4.1), we can compute the joint distribution of the states
xt and xt+1 as
p(xt , xt+1 |o1:t ) = N µxt ,xt+1 |t , Σxt ,xt+1 |t
(4.4)
where µxt ,xt+1 |t and Σxt ,xt+1 |t are given as
µt|t
,
µxt ,xt+1 |t =
Fµt|t + µu
FT
Σt|t
Σt|t
.
Σxt ,xt+1 |t =
FΣt|t FΣt|t FT + Σu
2
In case of a Gaussian distribution, it simply amounts to taking the mean and the covariance of the part belonging to
xt−1 in this joint distribution.
55
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
The above is also called as the prediction step and is summarized in Algorithm 7. Finally, using the
observation equation given in (4.2), we may obtain the joint distribution of xt , xt+1 and ot+1 given
o1:t as
p(xt , xt+1 , ot+1 |o1:t ) = N µxt ,xt+1 ,ot |t , Σxt ,xt+1 ,ot |t ,
(4.5)
where the µxt ,xt+1 ,ot |t and the covariance3 Σxt ,xt+1 ,ot |t are given by
µxt ,xt+1 ,ot |t
µt|t
,
=
µt+1|t
Hµt+1|t + µz
FT
Σxt ,xt+1 ,ot |t
FT HT
Σt|t
Σt|t
Σt|t
.
= FΣt|t
Σt+1|t
Σt+1|t HT
T
HFΣt|t HΣt+1|t HΣt+1|t H + Σz
Algorithm:Predict {µt|t , Σt|t }, F
µt+1|t ← Fµt|t + µu ;
Σt+1|t ← FΣt|t FT + Σu ;
Σt+1,t|t ← FΣt|t ;
return {µt+1|t , Σt+1|t , Σt+1,t|t };
Algorithm 7: Predict xt|t−1 from xt−1|t−1
xt
Now we may use property 1 described in Chapter 3 to compute the distribution of x =
xt+1
given ot+1 and from the joint distribution of xt , xt+1 , ot+1 given in (4.5). We may then obtain the
3
The elements of the covariance Σxt ,xt+1 ,ot |t consists of the terms E xt xTt , E xt xTt+1 , E xt oTt+1 ,
E xt+1 xTt+1 , E xt+1 oTt+1 and E ot+1 oTt+1 which can be computed from equations (4.1) and (4.2).
56
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
conditional distribution p(xt , xt+1 |o1:t+1 ) as
µt|t+1 Σt|t+1
,
p(xt , xt+1 |o1:t+1 ) = N
µt+1|t+1
Σt+1,t|t+1
ΣTt+1,t|t+1
Σt+1|t+1
,
where
µt|t+1 = µt|t + Σt|t FT HT P(ot+1 − Hµt+1|t − µz ),
Σt|t+1 = Σt|t − Σt|t FT HT PHFΣt|t ,
µt+1|t+1 = µt+1|t + Σt+1|t HT P(ot+1 − Hµt+1|t − µz ),
Σt+1|t+1 = Σt+1|t − Σt+1|t HT PHΣt+1|t ,
Σt+1,t|t+1 = FΣt|t − Σt+1|t HT PHFΣt|t .
The above computations are referred to as the update step in the Kalman filter and is given in
Algorithm 8.4
Algorithm:Update ξt+1|t , H, ot+1
P ← HΣt+1|t HT + Σz
−1
;
µt|t+1 ← µt|t + Σt|t FT HT P(ot+1 − Hµt+1|t − µz );
Σt|t+1 ← Σt|t − Σt|t FT HT PHFΣt|t ;
µt+1|t+1 ← µt+1|t + Σt+1|t HT P(ot+1 − Hµt+1|t − µz );
Σt+1|t+1 ← Σt+1|t − Σt+1|t HT PHΣt+1|t ;
Σt+1,t|t+1 ← FΣt|t − Σt+1|t HT PHFΣt|t ;
L←
1
2
log |P| −
1
2
ot+1 − Hµt+1|t − µz
T
P ot+1 − Hµt+1|t − µz ;
return {ξt+1|t+1 , L};
Algorithm 8: Update Step of the Kalman filter.
The net resulting forward pass using the Kalman filter is summarized in Algorithm 9. We will use
ξt|t = {µt|t , Σt|t , Σt−1|t , Σt,t−1|t } to simplify notations.
4
The algorithm also returns the likelihood of the observation ot+1 given o1:t given by L.
57
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
Algorithm: LDSForward ξ0|0 , F , H, o1:N
for t ← 0 to N − 1 do
ξt+1|t ← Predict ξt|t , F ;
ξt+1|t+1 ← Update ξt+1|t , H, ot+1 ;
end
T
return ξt|t t=1
Algorithm 9: Forward Pass in the Kalman filter
4.2.2
Backward Pass
The final step in the forward pass computes the Gaussian parameters of the joint distribution,
p(xN −1 , xN |o1:N ). The inference would be complete only when we could compute the joint distribution p(xt , xt+1 |o1:N ) for all 1 ≤ t ≤ N − 1. The goal of the backward pass is to recursively
fill this distribution starting from t = N − 1 using the complete posterior distribution of the future
samples. This can again be done using the prediction-correction approach. To accomplish this, we
first write the joint distribution of {xt−1 , xt , xt+1 } given o1:t . Using the state equation given in
(4.1) to predict xt+1 from xt , we may write this as
p(xt−1 , xt , xt+1 |o1:t ) = N µxt−1 ,xt ,xt+1 |t , Σxt−1 ,xt ,xt+1 |t ,
where
µxt−1 ,xt ,xt+1 |t
µt−1|t
,
=
µt|t
Fµt|t + µu
and
Σxt−1 ,xt ,xt+1 |t
Σt−1|t
=
Σt,t−1|t
FΣt,t−1|t
58
ΣTt,t−1|t
ΣTt,t−1|t FT
Σt|t
Σt|t FT
FΣt|t
FΣt|t FT + Σu
.
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
Let the joint distribution of xt−1 and xt conditioned on o1:t and xt+1 be given as
p(xt−1 , xt |o1:t , xt+1 ) = N µxt−1 ,xt |xt+1 ,t , Σxt−1 ,xt |xt+1 ,t .
Then, one may again use property 1 to derive the expressions for µxt−1 ,xt |xt+1 ,t and Σxt−1 ,xt |xt+1 ,t
as we did for the forward pass. Let
µxt−1 |xt+1 ,t
µxt−1 ,xt |xt+1 ,t =
µxt ,|xt+1 ,t
and
Corrxt−1 ,xt |xt+1 ,t
Σxt−1 |xt+1 ,t
.
Σxt−1 ,xt |xt+1 ,t =
CorrTxt−1 ,xt |xt+1 ,t
Σxt |xt+1 ,t
−1
After defining P = FΣt|t FT + Σu
and δxt+1 = xt+1 − Fµt|t − µu , we may obtain the
expressions for µxt−1 |xt+1 ,t , Σxt−1 |xt+1 ,t , µxt |xt+1 ,t , Σxt |xt+1 ,t and Corrxt−1 ,xt |xt+1 ,t :
µxt−1 |xt+1 ,t = µt−1|t + ΣTt,t−1|t FT Pδxt+1 ,
Σxt−1 |xt+1 ,t = Σt−1|t − ΣTt,t−1|t FT PFΣt,t−1|t ,
µxt |xt+1 ,t = µt|t + Σt|t FT Pδxt+1 ,
Σxt |xt+1 ,t = Σt|t − Σt|t FT PFΣt|t ,
Corrxt−1 ,xt |xt+1 ,t = Σt,t−1|t − Σt|t FT PFΣt,t−1|t .
Now what is left is computing the joint distribution of xt−1 and xt given o1:N . Let this joint
distribution be
p(xt−1 , xt |o1:N ) = N µxt−1 ,xt |N , Σxt−1 ,xt |N ,
where
µt−1|N
,
µxt−1 ,xt |N =
µt|N
59
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
and
Σt−1|N
Σxt−1 ,xt |N =
ΣTt−1,t|N
Σt−1,t|N
.
Σt|N
We may compute the quantity above by marginalizing p(xt−1 , xt |o1:t , xt+1 ) with respect to xt+1
as follows:
p(xt−1 , xt |o1:N ) = Ext+1 [p(xt−1 , xt |o1:t , xt+1 )] .
The expectation above is under the distribution p(xt+1 |o1:N ). It is easy to see that
µxt−1 ,xt |N
= Ext+1 µxt−1 ,xt |xt+1 ,t ,
Σxt−1 ,xt |N
= Σxt−1 ,xt |xt+1 ,t + Var µxt−1 ,xt |xt+1 ,t .
(4.6)
Here Var(µxt−1 ,xt |xt+1 ,t ) is the variance of the random variable µxt−1 ,xt |xt+1 ,t under the distribution
p(xt+1 |o1:N ). This distribution is available from the previous step of the backward recursion as
p(xt+1 |o1:N ) = N µt+1|N , Σt+1|N .
Thus using (4.6) and the expressions in (4.6), we may obtain the expressions for the parameters
ξt|N = {µt−1|N , µt|N , Σt−1|N , Σt,t−1|N , Σt|N }.
The procedure shown in Algorithm 10 takes the parameters of the distribution p(xt |o1:t ) and p(xt+1 |o1:N )
along with the dynamical parameters F to compute the smoothed distribution p(xt , xt−1 |o1:N )
given by the set of parameters ξt|N . The backward pass in Algorithm 11 simply iterates through t
from N − 1 to 1 calling Algorithm 10 at each time.
60
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
Algorithm:RTS ξt|t , ξt+1|N , F
P ← FΣt|t FT + Σu
−1
;
µt−1|N ← µt−1|t + ΣT
FT P µt+1|N − Fµt|t − µu ;
t,t−1|t
µt|N ← µt|t + Σt|t FT P µt+1|N − Fµt|t − µu ;
Σt|N ← Σt|t − Σt|t FT P I − Σt+1|N P FΣt|t ;
Σt,t−1|N ← Σt,t−1|t − Σt|t FT P I − Σt+1|N P FΣt,t−1|t ;
Σt−1|N ← Σt−1|t − ΣT
FT P I − Σt+1|N P FΣt,t−1|t ;
t,t−1|t
return ξt|N ;
Algorithm 10: The Rauch-Tung-Streibel smoother
Algorithm:LDSBackward
ξt|t
N
t=1
,F
for t ← N − 1 to 1 do
ξt|N ← RTS ξt|t , ξt+1|N , F
end
N
return ξt|N t=1 ;
Algorithm 11: Backward pass routine
Algorithm:LDS-EStep ξ0|0 , F , H, o1:N
N
← LDSForward ξ0|0 , F , H, o1:N ;
N
N
ξt|N t=1 ← LDSBackward ξt|t t=1 , F ;
ξt|t
t=1
N
return ξt|N t=1 ;
Algorithm 12: The E-Step for LDS. It can also be referred to as the inference step
61
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
4.2.3
E-Step
The inference process obtained by combining the forward and the backward pass results in computing the statistics {ξt|N }t=1,...,N of the hidden variable x1:N given o1:N . We call it the E-Step since
it computes the statistics which are sufficient for writing down the joint distribution on x1:N . The
resulting E-Step computation is shown in Algorithm 12.
4.2.4
Practical need for Regularizing ξt|t and ξt|N
The covariance of the joint distribution of xt and xt+1 in ξ output from the forward and the backward
pass are prone to singularity issues. We do a simple regularization each time we compute these
quantities by thresholding the singular values of the covariance. Assuming, we want the singular
values to be above , we do the following thresholding operation
UΣs UT
Σt+1|. Σt+1,t|.
SVD
T
Σt+1,t|.
Σt|.
=
diag(Σs ) ← max (diag(Σs ), )
Σt+1|. Σt+1,t|.
ΣTt+1,t|.
Σt|.
4.3
UΣs UT .
=
Learning the Parameters of an LDS via EM
h
iM
Given M sequences of observations oj1:Nj
, the maximum likelihood estimation problem atj=1
tempts to find
Θ̂ML = argmax
Θ
M
X
j=1
!
Z
log
x1:Nj
pΘ (oj1:Nj , x1:Nj )dx1:Nj
(4.7)
The problem above is intractable in general, although system identification techniques like N4SID
[39] (although computationally very expensive) are proven to be asymptotically optimal for large
62
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
N . However, it can be converted into an EM optimization where, instead of computing the marginal
likelihood of the observation oj1:Nj , we compute an expectation of the complete log-likelihood of
both x1:Nj and oj1:Nj under a current set of parameters Θ̃. The EM update for Θ is then given as
M
X
j
Θ̂ = argmax
EΘ̃ log pΘ (x1:Nj , o1:Nj )
.
Θ
{z
}
|
j=1
(4.8)
L̃Θ (x1:Nj ,oj1:N )
j
Since
L̃Θ (x1:Nj , oj1:Nj )
is a random variable, we compute its conditional expectation under a cur-
rent set of parameters Θ̃ and this becomes the E-Step in EM. This expectation is conditioned on the
observation sequence oj1:Nj . The resulting expected log-likelihood may be written as
h
i
EΘ̃ L̃Θ (x1:Nj , oj1:Nj )
Z
=
x1:Nj
L̃Θ (x1:Nj , oj1:Nj )pΘ̃ (x1:Nj |oj1:Nj )dx1:Nj .
(4.9)
To simplify this further, we first make use of the following chain rule
L̃Θ (x1:Nj , oj1:Nj ) = L̃Θ (x1:Nj ) + L̃Θ (oj1:Nj |x1:Nj ).
(4.10)
L̃Θ (x1:Nj ) and L̃Θ (oj1:Nj |x1:Nj ) can in turn be written as
L̃Θ (x1:Nj ) =
L̃Θ (oj1:Nj |x1:Nj )
=
PNj
t=1 log [N
(xt − Fxt−1 ; µu , Σu )] ,
h i
j
log
N
o
−
Hx
;
µ
,
Σ
.
t z
z
t
t=1
PNj
(4.11)
h
i
While computing the expectation of the two quantities above, we get terms of the form EΘ̃ xt |oj1:Nj ,
h
i
h
i
EΘ̃ xt xTt |oj1:Nj and EΘ̃ xt xTt−1 |oj1:Nj . Fortunately, these are available from the parameters of
63
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
the posterior distribution that we compute during the inference step in Algorithm 12 as5
h
i
EΘ̃ xt |oj1:Nj
= µjt|Nj ,
h
i
T
= Σjt|Nj + µjt|Nj µjt|N
EΘ̃ xt xTt |oj1:Nj
,
j
{z
}
|
Sjt|N
j
h
EΘ̃ xt xTt−1 |oj1:Nj
i
{z
}
|
Sjt,t−1|N
T
= Σjt,t−1|Nj + µjt|Nj µjt−1|N
.
j
(4.12)
j
Using the chain rule (4.10), the individual likelihood expressions in (4.11) and the expressions for
the expectations in (4.12), we can simplify the expected log-likelihood in (4.9) as6
i
h
i
h
i
h
EΘ̃ L̃Θ (x1:Nj , oj1:Nj ) = EΘ̃ L̃Θ (x1:Nj ) + EΘ̃ L̃Θ (o1:Nj |x1:Nj )
Nj
log |Σu |
2
Nj
i
h
1X
jT
j
j
j
T
T
F
−
FS
F
−
S
+
FS
Trace Σ−1
S
−
u
t,t−1|Nj
t,t−1|Nj
t−1|Nj
t|Nj
2
= −
t=1
Nj
log |Σz |
2
Nj
i
h
1X
T
j
jT
T
−
Trace Σ−1
ōjt ōjt T + HSjt|Nj HT − ōjt µjt,|N
H
−
Hµ
ō
z
t
t,|Nj
j
2
−
t=1
In order to optimize for F and H using EM, we plug in the expression above into the EM objective
in (4.8) and use standard matrix calculus [32]. The estimate Θ̂ can be obtained in closed form and
is presented in Algorithm 13.
j
Here ξt|N
denotes the posterior parameters output by Algorithm 12 for the j th observation sequence.
j
6
We have used µu = 0 as in any classical LDS setting although it is possible to generalize by allowing µu ∈ Rd . For
notational convenience, we used ōt = ot − µz
5
64
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
Algorithm:LDS-MStep
N ←
PM
j=1
h
oj1:N
i
j
j
h
i
j
, ξt|N
j
t,j
Nj ;
µjt|N ;
j
PM PN j
T
S0 ← j=1 t=1 Σt|N + µjt|N µjt|N
;
j
j
j
PM PN j
j
j T
S0 ← j=1 t=1 Σt,t−1,N + µt|N µt−1|N ;
f0 ←
PM PN
j=1
t=1
j
Vo ←
wo ←
PM PN
j=1
Vox ←
j=1
j
j
j j T
t=1 ot ot ;
PM PN
t=1
PM PN
j=1
ojt ;
t=1
T
ojt µjt|N
;
j
F̂ ← S0,1 (S−1 )−1 ;
1
T ;
S0 + F̂S−1 F̂T − F̂ST
Σ̂u ← N
0,−1 − S0,−1 F̂
−1
1
1
Ĥ ← Vox − N
wo f0T S0 − N
f0 f0T
;
1
µ̂z ← N
wo − Ĥf0 ;
1
T − V ĤT − µ̂ µ̂T ;
Vo + ĤS0 ĤT − ĤVox
Σ̂z = N
z z
ox
F̂ ← {F̂, 0, Σ̂u };
Ĥ ← {Ĥ, µ̂z , Σ̂z };
Θ̂ ← {F̂ , Ĥ};
return Θ̂;
Algorithm 13: The M-Step
65
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
4.3.1
Efficient Inference and Learning for Diagonal Σz
In Section 3.1.1, we pointed out that the general case of PPCA is computationally expensive. For
diagonal Σz , we were able to provide an O(Dd2 ) inference algorithm. The forward update in
Algorithm 8 has the same set of computations and if implemented naively would be very expensive
for high dimensional signals. Using techniques shown in section 3.1.1, we devolop an efficient
forward algorithm for an LDS with diagonal observation noise.
The substitute for Algorithm 8 is shown in Algorithm 14. Due to the diagonal assumption on
2
Σz = diag σ1:D
, its update by the M-Step in Algorithm 13 is slightly modified. Firstly, instead
of computing Vo as a full D × D matrix, we only need to need its diagonal elements vo collected
as a sum of the elementwise squares of ojt
vo =
Nj
M X
X
ojt ojt .
j=1 t=1
All other updates remain the same except for Σz whose diagonal values (index p) can be computed
using the rows hp of H and Vox (p, :) of Vox as
σ̂p2 =
4.4
1 vo (p) + ĥp S0 ĥTp − 2ĥp Vox (p, :) − µ̂z (p)2 .
N
Application of LDS to Dynamic Textures
We applied our maximum likelihood estimation of LDS parameters for learning dynamic textures.
Dynamic textures can be thought of as a video sequence which smoothly evolves in time. Examples
include videos of trees in a breeze, fire and water. There is an extensive collection of dynamic
textures available from the MIT temporal texture database [40]. Learnt dynamic textures can be
used for correcting noisy videos or even synthesis of a large amount of data from limited real
66
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
Algorithm:Update ξt+1|t , H, ot
µ̂o|t ← Hµt+1|t + µz ;
U ← HΣt+1|t ;
h1
m̃0,:
o(1)−µ̂o|t (1)
2
σ2
σ1
1
.
.
.
..
←
. , ṽ0 ←
o(D)−µ̂
hD
2
σD
o|t
, L̃0 ← PD log σ 2 ;
j=1
j
(D)
2
σD
for j ← 1 to d do
for k ← 1 to d do
m̃j,k = m̃j−1,k −
uT
j m̃j−1,k
1+uT
j m̃j−1,j
m̃j−1,j ;
end
ṽj = ṽj−1 − m̃j−1,j
uT
j ṽj−1
1+uT
j m̃j−1,j
;
L̃j ← Lj−1 + log |1 + uT
j m̃j−1,j |;
end
ΣOOH̄ ← m̃D,1
L ← − 21 L̃d −
1
2
m̃D,d ;
T
o − µ̂o|t ṽd ;
...
µt+1|t+1 ← µt+1|t + Σt+1|t HT ṽd ;
µt|t+1 ← µt|t + Σt|t FT HT ṽd ;
Σt+1|t+1 ← Σt+1|t − Σt+1|t HT ΣOOH̄ Σt+1|t ;
Σt+1,t|t+1 ← FΣt|t − Σt+1|t HT ΣOOH̄ FΣt|t + µt+1|t+1 µT
;
t|t+1
Σt|t+1 ← Σt|t − Σt|t FT HT ΣOOH̄ FΣt|t + µt|t+1 µT
;
t|t+1
return {ξt+1|t+1 , L};
Algorithm 14: An efficient Kalman update step specialized for diagonal Σz . This
uses the Sherman Morrison formula to compute the parameters of the conditional
distribution using dynamic programming without recourse to explicitly computing
HΣt+1|t HT + Σz
−1
.
67
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
examples. Formally speaking a dynamic texture is essentially a linear dynamical system with the
output observation being the video sequence of interest. Maximum likelihood learning of dynamic
textures directly from video has not been thoroughly addressed due to the computational bottleneck
involved. Recent work by Chan and Vasconcelos like those mentioned in [48–51] use heavily down
sampled version of the video sequences to ensure tractability of the EM algorithm to work. Doretto
[42] provides efficient algorithms for learning dynamic texture learning algorithm using techniques
from model identification. We coded the algorithm we described in the last section specialized
for diagonal Σz . Using the computationally efficient techniques described in Algorithm 14, the
computational complexity of our algorithm is O(Dd2 N ) per iterations which is very tractable when
d << D. Thus, unlike Chan’s technique, we can work with video sequences without downsampling
them. However, the EM needs to be done iteratively several times until convergence is achieved.
For initialization, we use Doretto’s technique of computing the SVD of the observations, o1:N and
projecting each ot to a d-dimensional space to get an initial estimate of xt .
4.4.1
Synthesis
We use various videos available at [40] to learn different dynamical models. Formally, a sequence
of image is a tensor of size H × W × N where H and W are the dimensions of the image. For the
observation ot , we vectorize the images into a HW dimensional vector and apply our LDS learning
schemes. Once we learn the parameters F, H, µu , Σu , µz , Σz , we may employ a sampling scheme
to generate an image sequence (of length N ) as follows:
• Choose x0 = 0.
• For 1 ≤ t ≤ N , xt = Fxt−1 + ut where ut ∼ N (µu , Σu )
68
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
• The observation ot is then sampled as ot = Hxt + zt where zt ∼ N (µz , Σz ).
A sample of generated image sequence is plotted in Figure 4.3.
Figure 4.3: Illustration of generated moving plastic using our learned model using EM. These are
the frames generated at t = 1000, 1020, 1040, 1060.
4.4.2
Prediction Error
An LDS essentially predicts the next sample ot+1 given o1:t . As a metric of goodness, we compare
the average per-pixel prediction squared-error for various image sequences with Doretto’s system
q P
2
identification approach. Formally, we tabulate the value of the RMS error N1 N
t=1 kot − µ̂o|t k2
for various videos. Both Doretto’s and our models are estimated from the same video. For learning
the parameters of the LDS, we use Doretto’s approach to initialize H and F and do 5 EM iterations
on top of it. Table 4.1 show our algorithm’s prediction error on various image-sequences.
69
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
Video
Doretto’s approach
EM (1 iter)
EM (2 iters)
EM (3 iters)
EM (4 iters)
EM (5 iters)
water spiral
1.056E-5
1.023E-5
9.6E-6
9.5E-6
9.2E-6
9.0E-6
Boiling water
4.8E-5
4.7E-5
4.5E-5
4.4E-5
4.4E-5
4.4E-5
Plastic sheet
4.0E-5
2.4E-5
2.3E-5
2.2E-5
2.2E-5
2.1E-5
Table 4.1: Average squared error in comparison with Doretto’s technique
4.5
Chapter Summary
Linear dynamical systems can model observations sequences where the adjacent observation samples are highly correlated. In this chapter, we reviewed the well studied Kalman filter and RTS
smoothing algorithms for inference and learning. The LDS can be either thought of as a generalization of an HMM with the discrete state being replaced by a continuous state or as that of PPCA
where the underlying hidden state xt have dynamics associated with them. Inference in LDS is
done similar to that of a HMM where the forward pass is replaced by a Kalman filtering and the
backward pass by RTS smoothing. Learning is then done via EM using the inferred distributions of
the random variables.
These standard techniques for inference and learning are however still very computationally expensive when the dimension of the observation ot is high. However, if the observation noise model is
assumed to be diagonal, inference can be done efficiently by combining dynamic programming and
the Sherman Morisson Lemma as we did for the PPCA case. We applied our algorithm to learn
dynamic texture videos like fire, waterfall, fountain, etc. where dim(ot ) ≈ 200000. We use our
learnt models for generating video sequences, predicting the next video frame and show that they
are better that models obtained from Doretto’s conventional PCA based system identification based
technique for estimating the parameters.
70
CHAPTER 4. LINEAR DYNAMICAL SYSTEM
However, a single linear dynamical system is only good for modeling signals which have time
invariant dynamics. Our ultimate goal is to model the gestures in the RMIS surgery data. Each
gesture has a certain kind of dynamics associated with it and the gestures switch from time to time.
Hence we need a dynamical system for each gesture and this can be achieved only using a switching
linear dynamical system. We will extend the LDS inference and learning algorithms developed in
this chapter to the switching LDS case in the next chapter.
71
Chapter 5
Switching Dynamical Models
In the last chapter, we have introduced the notion of linear dynamical systems and described an
efficient inference and learning algorithm for the same using EM. A linear dynamical system is
able to generalize over hidden Markov models via a continuous state xt instead of the discrete
state st used in HMM. A single dynamical model may not however generalize to sequences whose
dynamics themselves change with time. Real life examples of such data include speech where the
dynamics of the acoustic waveform change depending on the sound being produced or videos where
the dynamics of the scene undergo a phase transition. Other examples include human motion such
as walking, running, dancing each of which have a different kind of dynamics associated with them.
A switching (or hybrid) linear dynamical system are models that are capable of describing a physical process governed by state equations that switch from time to time [23, 52]. In other words, a
switching LDS has both a discrete and a continuous hidden state as shown in Figure 5.1.
Learning and inference in switching linear dynamical systems have been studied for several decades.
These are known to be open problems since exact inference is provably intractable. One of the old
72
CHAPTER 5. SWITCHING DYNAMICAL MODELS
Figure 5.1: Graphical model representation of a switching Linear Dynamical System
popular methods is the Monte-Carlo sampling based on particle filtering [53] where the discrete
state and the continuous states are represented by a set of particles and are iteratively adjusted based
on the current configuration of the other states. The collection of particles are used to mimic the
actual probability distribution. This is iterated until convergence and the resulting distribution of the
particles are the inferred random variables. When larger number of particles are used, the obtained
distribution is provably closer to the true estimate. Other earlier techniques include the truncated
maximum likelihood based approach [54, 55] which attempts to approximate the exact exponential
inference to a tractable polynomial time algorithm.
More recently methods have been proposed to do an alternating maximization to iteratively estimate
the discrete and the hidden states [56, 57]. Here, the most likely discrete state ŝ1:N is first inferred
using an approximate Viterbi algorithm and the distribution of the continuous state is then computed
using a Kalman filter and RTS smoothing given ŝ1:N . These methods have also been used to learn
S-LDS models for speech recognition as well [58].
Recently Barber in [26, 59] discovered a slightly improved way of inferring the random variables
s1:N and x1:N using expectation propagation (which is based on decomposing the joint distribution
73
CHAPTER 5. SWITCHING DYNAMICAL MODELS
of a random variable approximately as the product of the marginals) and expectation correction
which uses the Taylor series expansion to approximate the expectation of a function of a random
variable.
In this chapter, we will extend Barber’s technique to perform approximate inference into learning
the S-LDS parameters. We derive the parameters of the joint distributions using the approximations
via the standard forward-backward procedure. These algorithms are extensions of the non-switching
LDS. Finally, we will extend all these algorithms to be incorporated into S-LDS models that have
null states.
One main contribution of this chapter is to extend the expectation correction algorithm for approximate inference technique by Barber [26] to learn the parameters of an LDS. We derive the EM
learning updates for the switching LDS model as a direct generalization of the FA-HMMs we derived in Section 3.2.2.
In the final part of the chapter, we also extend the notion of null states from HMM to S-LDS. We
observe that introducing null states simplifies the computational complexity involved in learning
and inference significantly, thereby making S-LDS learning complexity comparable to standard
HMMs. This also enables us to apply an S-LDS model to any setting in which using an HMM is
computationally infeasible.
5.1
Switching Linear Dynamical Systems
The switching linear dynamical systems (S-LDS) model is a generalization of the standard LDS
wherein the system is allowed to switch between discrete states, each associated with its own set
of dynamical parameters. At each time t, one discrete state is active and is called the state of the
74
CHAPTER 5. SWITCHING DYNAMICAL MODELS
system, denoted by st . In the standard switching LDS, we have C + 2 distinct states. One of this
is a unique start state and another is a unique end state. The other C states are associated with a set
of LDS parameters. The transition from one state to another is governed by a Markov chain. These
are called jump-Markov linear systems.
p(s1:N ) =
N
−1
Y
p(st+1 |st ) =
t=0
N
−1
Y
qst ,st+1 .
t=0
Here qs0 ,s > 0 ∀(s, s) ∈ E where E is a set of edges in the graph (G) of the Markov chain. The
state space equations of the S-LDS are as follows
xt+1 = Fst+1 xt + ut+1 ,
(5.1)
ot = Hst xt + zt .
(5.2)
Additionally, we have ut ∼ N (µsut , Σsut ) and zt ∼ N (µszt , Σszt ). Thus the switching dynamical
system parameters have two sets of parameters F s = {Fs , µsu , Σsu } and Hs = {Hs , µsz , Σsz } for
each state s ∈ {1, . . . , C}.
5.2
Inference in Switching LDS
As usual, the general inference problem is to estimate the distribution of the hidden variables given
the observation sequence o1:N . The hidden variables in this setting are the discrete states s1:N and
the continuous states x1:N . Since st and xt follow a first order Markov chain, this joint distribution
of all the hidden variables given the observation can be computed using chain rule as
p(s1:N , x1:N |o1:N ) =
QN
t=0 [p(st+1 |o1:N , st , xt )]
QN −1
t=0
[p(xt+1 |o1:N , st , xt )] .
(5.3)
As we did for the HMM, we will consider the more general inference problem where a partial
75
CHAPTER 5. SWITCHING DYNAMICAL MODELS
knowledge on the s1:N may be provided a priori1 . One extreme is when the discrete state sequence
s1:N is known exactly and only the continuous state sequence x1:N needs to be inferred. The other
extreme is that no information on s1:N is provided. We will provide algorithms that can commonly
work with these two extremes. The set of possible state sequences is restricted using a grammar
G that can accommodate manual labels. The grammar G consists of a graph G that allows only
certain state transitions. G also contains a boolean function bst that depends on time t and state s that
can accommodate any time-marked manual labels. For instance when complete state supervision is
provided (i.e. s1:N is known),
bst =
1
if s = st ,
0
otherwise.
The other alternative is when an incomplete state supervision is provided. For instance it is known
that a particular gesture gt took place at time t and gesture gt is made of multiple hidden discrete
states given by the set S(gt ). In other words, if gesture gt was performed at time t, the discrete state
of the system at time t must be an element of a smaller subset of the entire set of discrete states.
Thus the discrete state st at time t must be an element of S(gt ). In such settings,
bst =
1 if s ∈ S(gt )
0 otherwise
The final extreme is not to have any manual supervision for the discrete states. In this case the
function bst = 1 ∀(t, s).
In all the settings, we will assume that o1:N always comes with a grammar G that regulates the
1
Recall that inference is a key part of learning. So allowing partial knowledge of s1:N enables one to learn the
parameters when a coarse manual label (e.g. gesture sequence in a robot assisted surgery) may be provided.
76
CHAPTER 5. SWITCHING DYNAMICAL MODELS
amount of supervision required for inference.2
From the graphical model structure in Figure 5.1, there is no direct link between xt−1 and st . Since
we are eventually interested in learning the parameters, we will only consider the pairs of random
variables that each edge in the S-LDS connects to. Formally, we only need to following joint
distributions:
1. p(st+1 , st |o1:N ),
2. p(xt+1 , xt , st+1 |o1:N ) = p(st+1 |o1:N )p(xt+1 , xt |st+1 , o1:N ).
Computing these quantities exactly is known to be an intractable problem in switching linear dynamical system. In order to see why the second quantity cannot be computed exactly in polynomial
time, we note first note that
p(xt+1 , xt |st+1 , o1:N ) =
X
p(xt+1 , xt |s1:N , o1:N )p(s1:t , st+2:N |st+1 , o1:N )
(5.4)
s1:t ,st+2:N
In the summation above, the term p(xt+1 , xt |s1:N , o1:N ) can be computed exactly as we did for the
standard non-switching LDS (since s1:N is no longer random) using the forward Kalman and the
backward RTS recursions. Thus, if p(xt+1 , xt |s1:N , o1:N ) is a single jointly Gaussian conditional
density given a configuration of s1:t and st+2:N . However in (5.4), we marginalize over all configurations of s1:t , st+2:N , implying that p(xt+1 , xt |st+1 , o1:N ) is a mixture of C N −1 Gaussians. This
leads to exponential blowup for exact inference. Several algorithms have been proposed to perform
approximate inference.
5.2.1
Forward pass
The forward pass in S-LDS computes the following forward joint distributions.
2
It is an important observation that even if G provides complete discrete state supervision, the continuous state still
needs to be inferred.
77
CHAPTER 5. SWITCHING DYNAMICAL MODELS
1. p(xt+1 , xt |st+1 , o1:t+1 )
2. p(st+1 , st |o1:t+1 ). As a byproduct, we also compute
s
=
α̂t+1
X
p(st+1 = s, st = s0 |o1:t+1 ).
s0
Using the law of total probability, one may write
X
p(xt+1 , xt |o1:t+1 , st+1 ) =
p(xt+1 , xt |o1:t+1 , st+1 , st )p(st |st+1 , o1:t+1 )
(5.5)
st :(st ,st+1 )∈E
In order to compute p(xt+1 , xt |o1:t+1 , st+1 , st ), we may similarly invoke the Bayes’ rule as
p(xt+1 , xt , ot+1 |o1:t , st+1 , st )
p(ot+1 |o1:t , st+1 , st )
p(xt+1 , xt |o1:t+1 , st+1 , st ) =
∝ p(xt |o1:t , st )p(xt+1 |xt , st+1 )p(ot+1 |xt+1 , st+1 )
The distribution p(xt+1 |xt , st+1 ) is available from the state equation (5.1). Similarly the distribution
p(ot+1 |xt+1 , st+1 ) can be written from the observation equation (5.2). Thus, p(xt |o1:t , st ) were a
single Gaussian with mean µst|tt and covariance Σst|tt , then p(xt+1 |xt , st+1 ) would also be a Gaussian
s ,s
s
s ,s
s
t t+1
t t+1
with mean µt+1|t
= Fst+1 µst|tt + µut+1 and covariance Σt+1|t
= Fst+1 Σst|tt Fst+1 T + Σut+1 .
Likewise, p(ot+1 |xt+1 , st+1 ) would also be Gaussian. In fact, one can write the joint distribution
of xt , xt+1 and ot+1 as we did in (4.5) as
µst|tt
Σst|tt
st ,st+1
, Fst+1 Σst
N
µt+1|t
t|t
st ,st+1
s
Hst+1 µt+1|t
Hst+1 Fst+1 Σt|t
+ µzt+1
Fst+1 Σst|tt
T
s ,s
t t+1
Σt+1|t
HΣt+1|t
Hst+1 Fst+1 Σst|tt
s ,s
t t+1
Hst+1 Σt+1|t
s ,s
T
T
s
t t+1
HΣt+1|t
Hst+1 T + Σzt+1
.
(5.6)
78
CHAPTER 5. SWITCHING DYNAMICAL MODELS
Invoking the properties of jointly Gaussian random variables and conditioning on the fact that ot+1
is observed, we can get the Gaussian parameters of the joint distribution of xt+1 and xt given the
current and the previous states (st+1 and st ).
Alternatively, one can think of obtaining this joint distribution using a prediction and an update
step as we did for the Kalman filter. From the distribution p(xt |st , o1:t ), a prediction is done using
the dynamical parameters F st+1 to obtain the distribution p(xt , xt+1 |st , st+1 , o1:t ). Then using the
observation equation
ot+1 = Hst+1 xt+1 + zt+1 ,
we may compute the joint distribution of p(xt , xt+1 , ot+1 |st , st+1 , o1:t ). Once this is computed,
we may condition on the observed ot+1 using property 1 described in Chapter 3 to obtain the
distribution
p(xt , xt+1 |st , st+1 , o1:t+1 ).
In order to compute p(st |st+1 , o1:t+1 ) in equation (5.5), we first compute the joint distribution of
the previous and the current discrete states, p(st , st+1 |o1:t+1 ) as below:
p(st , st+1 |o1:t+1 ) =
=
p(st , st+1 , ot+1 |o1:t )
p(ot+1 |o1:t )
p(st |o1:t )p(st+1 |st )p(ot+1 |o1:t , st , st+1 )
p(ot+1 |o1:t )
∝ p(st |o1:t ) p(st+1 |st ) p(ot+1 |o1:t , st , st+1 ).
| {z } | {z }
s
qst ,st+1
α̂t t
The third equation follows because st+1 is conditionally independent of o1:t given st . Thus we
only need to compute the product α̂tst qst ,st+1 p(ot+1 |o1:t , st , st+1 ) and normalize over all possible
pairs (st , st+1 ∈ E(G)) to obtain p(st , st+1 |o1:t+1 ). α̂tst is already available from the previous
time-step t. p(ot+1 |o1:t , st , st+1 ) is a Gaussian whose parameters are available via (5.6). Similarly
79
CHAPTER 5. SWITCHING DYNAMICAL MODELS
p(st |st+1 , o1:t+1 ) can be obtained by normalizing p(st , st+1 |o1:t+1 ) about all possible st for a given
st+1 .
s
Now the quantity α̂t+1
= p(st+1 = s|o1:t+1 ) can be computed by marginalizing p(st = s0 , st+1 |o1:t+1 )
as follows:
s
α̂t+1
=
X
p(st+1 = s, st = s0 |o1:t+1 ).
s0
Then, plugging the obtained p(st |st+1 , o1:t+1 ) into (5.5), we obtain p(xt+1 , xt |o1:t+1 , st+1 ) is a
mixture of C Gaussians. So the assumption that we made about p(xt |o1:t , st ) being a single Gaussian is not quite right; it is infact a mixture of C t−1 Gaussians. The correct way to do this recursion
is to write down the statistics of the Gaussian of each mixture in each state which is unfortunately
computationally very complex.
To avoid exponential blowups, there are techniques like the Gaussian Sum approximation (GSA)
[60] which maintain a constant number of mixture components by combining several mixtures into
a fewer number. The simplest form of the GSA algorithm only collects the moments at each stage
(i.e. it collapses all the Gaussians to a single one). Although, this looks like a very strong relaxation,
it has been shown that it approximates the posterior probability with high precision [61]. If one were
to use this form of GSA, we simply need to collect the Gaussian statistics (the mean, covariance and
the log-likelihood available from Algorithm 14) and merge them into a single Gaussian.
The GaussMerge function in Algorithm 15 shows the implementation of the Gaussian merging.
This function takes in the log mixture weights of the individual distribution and outputs a merged
distribution along with the total log-weight.
The resulting forward pass using the GSA for a switching LDS is shown in Algorithm 16. In the algorithm, αts denotes the joint distribution of p(st = s, o1:t ) instead of the conditional p(st = s|o1:t ).
80
CHAPTER 5. SWITCHING DYNAMICAL MODELS
This does not make any difference since the Gauss-Merge function can work with unnormalized logweights. Furthermore, the αts here is consistent with the forward probability defined in the HMM
framework.
Algorithm:GaussMerge
h
j
ξt|.
, Lj
iM j=1
L̃ ← −∞;
for j ← 1 to M do
L̃ ← logadd L̃, Lj ;
end
Initialize ξ̃t|. ← 0 ;
for j ← 1 to M do
wj ← exp Lj − L̃ ;
µ̃t|. ← µ̃t|. + wj µjt|. ;
µ̃t−1|. ← µ̃t−1|. + wj µjt−1|. ;
S̃t|. ← S̃t|. + wj Σjt|. + µjt|. µjt|.T ;
T
S̃t−1|. ← S̃t−1|. + wj Σjt−1|. + µjt−1|. µjt−1|.
;
j
j
j T
S̃t,t−1|. ← S̃t,t−1|. + wj Σt,t−1|. + µt|. µt−1|. ;
end
Σ̃t|. ← S̃t|. − µ̃t|. µ̃T
;
t|.
Σ̃t−1|. ← S̃t−1|. − µ̃t−1|. µ̃T
;
t−1|.
Σ̃t,t−1|. ← S̃t,t−1|. − µ̃t|. µ̃T
;
t−1|.
return {ξ̃t|. , L̃};
Algorithm 15: The Gaussian sum approximation (GSA) for merging all the statistics
into a single Gaussian
Backward Pass
The backward pass completes the inference stage by computing the distributions.
1. p(xt , xt−1 |st , o1:N ) : The joint distribution of the current and the previous continuous hidden
state given the current discrete state st and the observation sequence o1:N .
81
CHAPTER 5. SWITCHING DYNAMICAL MODELS
START , [F s , Hs ] , G, [q ]
Algorithm:SLDSForward ξ0|0
e e∈E(G) , o1:N
αst ← −∞, ∀(s, t);
αSTART
← 0;
0
for t ← 0 to N do
for (s0 , s) ∈ E(G) and bst == true do
if s0 6= START and s 6= END and t < N then
0
s0 ,s
s , Fs ;
ξt+1|t
← Predict ξt|t
0
s0 ,s
s ,s
{ξt+1|t+1
, L} ← Update ξt+1|t
, Hs , ot+1 ;
0
L ← L + log qs0 ,s + αts ;
s0 ,s
s
s
, αst+1 }, {ξt+1|t+1
, L} ;
, αst+1 } ← GaussMerge {ξt+1|t+1
{ξt+1|t+1
else
0
L ← log qs0 ,s + αts ;
αst+1 ← logadd αst+1 , L ;
end
end
h
i
s , αs ;
return ξt|t
t
Algorithm 16: The forward pass algorithm for the switching LDS
82
CHAPTER 5. SWITCHING DYNAMICAL MODELS
2. p(st , st−1 |o1:N ) : The joint distribution of the current and the previous discrete hidden state
given the current discrete state st and the entire observation sequence o1:N .
We first compute the joint distribution of xt , xt−1 and st . This can be written using the law of total
probability by marginalizing over all possible configurations of the future discrete state st+1 and the
future continuous state as follows:
p(xt , xt−1 , st |o1:N ) =
X
Z
p(st+1 |o1:N )
p(xt , xt−1 , st |o1:t , st+1 , xt+1 )p(xt+1 |o1:N , st+1 )dxt+1 .
xt+1
st+1
The integral to be evaluated is
Z
p(st |o1:t , st+1 , xt+1 )p(xt , xt−1 |o1:t , st+1 , xt+1 , st )p(xt+1 |o1:N , st+1 )dxt+1 .
I=
xt+1
s
t+1
This integral is essentially the average of a product under the posterior distribution of xt+1|N
. That
is
I = Exst+1 [p(st |o1:t , st+1 , xt+1 )p(xt , xt−1 |o1:t , st+1 , xt+1 , st )] .
(5.7)
t+1|N
Next note that
p(st |o1:t , st+1 , xt+1 ) =
p(s , s , x |o )
P t 0 t+1 t+1 1:t
s0 p(st , st+1 , xt+1 |o1:t )
t
=
p(st |o1:t )qst ,st+1 p(xt+1 |st+1 , st , o1:t )
P
.
0
0
s0 p(st |o1:t )qs0t ,st+1 p(xt+1 |st+1 , st , o1:t )
t
The denominator in the above expression is a mixture of Gaussians whose parameters depend on
xt+1 . Since there is no way to evaluate the integral keeping st dependent on xt+1 , we need to
look for ways to remove the dependency. Expectation propogation [61–63] suggests that we can
approximate the expectation of a product by the product of their expectations. In other words,
I ≈ Exst+1 [p(st |o1:t , st+1 , xt+1 )] Exst+1
t+1|N
t+1|N
83
st+1 p(xt , xt−1 |o1:t , st+1 , xt+1
) .
(5.8)
CHAPTER 5. SWITCHING DYNAMICAL MODELS
The second expectation can be easily evaluated assuming p(xt+1 |o1:N , st+1 = s) is a single Gaussian using the RTS smoothing routine (refer to Algorithm 10). The nasty dependence on xt+1 is
however still present in the first expectation. One way out is to assume that p(st |o1:t , st+1 , xt+1 ) ≈
p(st |o1:t , st+1 ) which can be interpreted as a filtered posterior (since it depends only on the history).
A more powerful approximation was used by Barber and Mesot [26], [59] and termed as the expectation correction. The main idea behind the expectation correction algorithm is to use the Taylor
series expansion about the mean value of a random variable for the expected value of a function of
the random variable and to ignore the higher order terms.
Ex [f (x)] = Ex [f (µx + x − µx )]
= Ex [f (µx )] + f 0 (µx ) (µx − µx ) + . . .
≈ Ex [f (µx )] .
The function f (x) we consider in our case is the joint distribution of st , st+1 given xt+1 and o1:t .
This may be written as
p(st , st+1 |o1:t , xt+1 ) =
=
p(st , st+1 , xt+1 |o1:t )
p(xt+1 |o1:t )
p(st |o1:t )qst ,st+1 p(xt+1 |o1:t , st , st+1 )
.
Z(xt+1 )
Thus, if we were to evaluate the expected value of the joint probability above at a given value xt+1 ,
we may write
Exst+1 [p(st , st+1 |o1:t , xt+1 )] ≈
t+1|N
1
st+1
st+1
p(st |o1:t )qst ,st+1 N xt+1 ; µt+1|t
, Σt+1|t
|xt+1 =µst+1 .
t+1|N
Z
The denominator Z is a normalizer that is independent of st and st+1 . Thus, one could simply evaluate the numerator for all pairs st and st+1 and later normalize to make sure the probabilities sum to
84
CHAPTER 5. SWITCHING DYNAMICAL MODELS
one. The resulting joint distribution is the expectation corrected distribution over st and st+1 . Using
this, one may compute the required expected conditional probability Exst+1 [p(st |st+1 , o1:t , xt+1 )]
t+1|N
using conditioning over st+1 via Bayes rule.
An alternate way to derive this is to directly consider the conditional distribution over st given st+1
and o1:N and approximate it as follows
p(st |st+1 , o1:N ) ≈ p(st |E (xt+1 |st+1 , o1:N ) , st+1 , o1:t ).
Once the conditional distribution is computed, the final corrected joint distribution p(st , st+1 |o1:N )
may be given as
p(st , st+1 |o1:N ) = p(st |st+1 , o1:N )p(st+1 |o1:N ).
Here p(st+1 |o1:N ) is available from the previous step of the backward recursion.
The backward pass that implements the above expectation correction for a switching LDS is shown
s
in Algorithm 17. This algorithm returns the parameter ξt|N
s
s
µt|N Σt|N
s
,
ξt|N
=
T
Σst,t−1|N
µs+
t−1|N
Σst,t−1|N
.
s+
Σt|N
The mean µs+
t−1|N denotes EΘ̃ [xt−1 |st = s, o1:N , G] and should not be confused with EΘ̃ [xt−1 |st−1 = s, o1:N , G].
Similarly Σs+
t−1|N denotes the covariance of xt−1 given st = s. In other words,
Σs+
t−1|N = CovarΘ̃ [xt−1 |st = s, o1:N , G] .
E-Step
To complete the inference procedure on an observation sequence o1:N , we call the forward and the
backward subroutines. The backward subroutine returns us the statistics sufficient to write down
85
CHAPTER 5. SWITCHING DYNAMICAL MODELS
Algorithm:SLDSBackward
h
i
s , αs , [F s ] , G, [q ]
ξt|t
e e∈E(G)
t
END ← αEND ;
γN
+1
N +1
for t ← N to 0 do
0
δ s ← −∞∀s0 ∈ V (G);
for (s, s0 ) ∈ E(G) do
if s = START or s0 = END then
0
γts,s ← αst + log qs,s0 ;
else
δs
0
0
{µt+1|t , Σt+1|t , .} ← Predict {µst|t , Σst|t }, F s ;
0
0
s
γts,s ← αst + log qs,s0 + log N µt+1|N
, µt+1|t , Σt+1|t ;
0
0
← logadd δ s , γts,s ;
end
0
0
0
s ;
∀(s, s0 ) ∈ E(G), Set γts,s ← γts,s − δ s + γt+1
∀(s) ∈ V (G), Set γts ← −∞;
for (s, s0 ) ∈ E(G) do
if s = START then
0
γts ← logadd γts , γts,s ;
else
0
s,s
ξt|N
←
ξ s
if s0 = END
t|t
0
RTS ξ s , ξ s0
, Fs
if s0 6= END
t|t t+1|N
0
0 s , γ s } ← GaussMerge {ξ s , γ s }, {ξ s,s , γ s,s } ;
{ξt|N
t
t
t
t|N
t|N
end
end
return
h
i h
0 i
s , γ s , γ s,s
ξt|N
;
t
t
Algorithm 17: Backward pass routine for SLDS
86
CHAPTER 5. SWITCHING DYNAMICAL MODELS
the joint distribution p(x1:N , s1:N |o1:N ). The inference step for a realization o1:N from a switching
s 3 (analogous to Algorithm 12) is shown in Algorithm 18.
LDS with initial state distribution ξ0|0
Algorithm:SLDS-EStep
h
i
s , F s , Hs , G, [q ] , o
ξ0|0
e
1:N
h
i
s , αs
START , [F s , Hs ] , G, [q ]
ξt|t
, LT ← SLDSForward ξ0|0
e e∈E(G) , o1:N ;
t|t
h
h
i
0 i
s , αs , [F s ] , G, [q ]
[ξts , γts ] , γts,s
← SLDSBackward ξt|t
e e∈E(G) ;
t
for t ← 0 to N do
LV ← −∞;
LE ← −∞;
for s ∈ V (G) do
LV ← logadd (LV , γts );
end
for (s, s0 ) ∈ E(G) do
0
LE ← logadd LE , γts,s ;
end
for s ∈ V (G) do
γ̂ts ← exp (γts − LV );
end
for (s, s0 ) ∈ E(G) do
0
0
γ̂ts,s ← exp γts,s − LE ;
end
end
h
0 i
return [ξts , γ̂ts ] , γ̂ts,s
;
Algorithm 18: The E-Step for S-LDS
5.3
Learning the Parameters of an S-LDS
As in any generative probabilistic model, we attempt to perform maximum likelihood learning for
h
i
the S-LDS parameters. Given M sequences of observations oj1:Nj each associated with a grammar
3
We set the continuous state (x0 ) at t = 0 deterministically to zero and the discrete state (s0 ) is a unique START
state.
87
CHAPTER 5. SWITCHING DYNAMICAL MODELS
Gj that generated them4 , the maximum likelihood solution Θ̂ML is given by
M
X
X Z
Θ̂ML = argmax
log
pΘ (s1:Nj , xj1:Nj , oj1:Nj )dx1:Nj .
Θ
j=1
s1:Nj ∈Gj
(5.9)
x1:Nj
As for HMM, FA-HMM and LDS, instead of directly optimizing (5.9), we iteratively estimate Θ
via expectation maximization. The objective function for the EM update is the sum of the expected
value of the complete joint log-likelihood of {s1:Nj , x1:Nj , o1:Nj } for each 1 ≤ j ≤ M . Formally,
the EM solution Θ̂ for one iteration is given as
Θ̂ = argmax
Θ
M
X
j=1
j
EjΘ̃
log pΘ (s1:Nj , x1:Nj , o1:Nj ) .
{z
|
L̃Θ (s1:Nj x1:Nj ,oj1:N )
(5.10)
}
j
The expectation of L̃Θ (s1:Nj , x1:Nj , oj1:Nj ) under the current set of parameters Θ̃ conditioned on
the observation sequence oj1:Nj and the given grammar Gj becomes the E-Step in EM. This expected
log-likelihood may be written as
j
EΘ̃ L̃Θ (s1:Nj , x1:Nj , o1:Nj )o1:Nj , Gj
X Z
=
L̃Θ (s1:Nj , x1:Nj , oj1:Nj )pΘ̃ (s1:Nj , x1:Nj |oj1:Nj , Gj )dx1:Nj .
s1:Nj ∈Gj
(5.11)
x1:Nj
As we did for LDS, we invoke the chain rule to rewrite the likelihood as
L̃Θ (s1:Nj , x1:Nj , oj1:Nj ) = L̃Θ (s1:Nj ) + L̃Θ (x1:Nj |s1:Nj ) + L̃Θ (oj1:Nj |x1:Nj , s1:Nj ).
(5.12)
Furthermore, the individual terms in (5.12) may be further simplified as
4
Recall that the grammar Gj contains the graph of possible state transitions and a boolean function bst which is true
only if s can be active at time t and zero otherwise.
88
CHAPTER 5. SWITCHING DYNAMICAL MODELS
L̃Θ (s1:Nj ) =
Nj
X
log qst ,st+1 ,
t=0
Nj
L̃Θ (x1:Nj |s1:Nj ) =
X
log [N (xt − Fst xt−1 ; µsut , Σsut )] ,
t=1
Nj
L̃Θ (oj1:Nj |s1:Nj , x1:Nj )
=
X
h
log N
ojt
st
−H
s
xt ; µszt , Σzj
i
.
(5.13)
t=1
For the first expectation, we simply need to sum over all possible state pairs st , st+1 weighted by
their aposteriori probability given the observation as
EΘ̃ log qst ,st+1 |o1:Nj , Gj
=
X
s,s0
pΘ̃ (st = s, st+1 = s0 |oj1:Nj , Gj ) log qs,s .
{z
}
|
s,s
γ̂t,j
(5.14)
0
For the second expectation in (5.13), we may use the property of iterated expectation5
st
st
st EΘ̃ log [N (xt − F xt−1 ; µu , Σu )] o1:Nj , Gj
X
st st
st
s
=
γ̂t,j EΘ̃ log [N (xt − F xt−1 ; µu , Σu )] st = s, o1:Nj , Gj .
(5.15)
s
s is the posterior probability that s = s given o
where γ̂t,j
t
1:Nj under the current set of S-LDS
parameters Θ̃. Similarly,
st
xt ; µszt , Σszt )] o1:Nj , Gj
EΘ̃ log [N (ot − H
X
s
st
st
st =
γ̂t,j EΘ̃ log [N (ot − H xt ; µz , Σz )] st = s, o1:Nj , Gj .
(5.16)
s
The expressions above may be simplified as we did for LDS. We make use of the following quanti5
For any pair of random variables X and Z, E[Z] = E [E [Z|X]].
89
CHAPTER 5. SWITCHING DYNAMICAL MODELS
ties that are computable via RTS smoothing in Algorithm 18 for each j as
EΘ̃ xt |st = s, o1:Nj , Gj = µs,j
t|Nj ,
EΘ̃ xt−1 |st = s, o1:Nj , Gj = µs+,j
t−1|Nj ,
s,j
s,j T
EΘ̃ xt xTt |st = s, o1:Nj , Gj = Σs,j
t|Nj + µt|Nj µt|Nj ,
{z
}
|
Ss,j
t|N
EΘ̃ xt−1 xTt−1 |st
|
j
= s, o1:Nj , Gj
{z
Ss,j+
t−1|N
s,j+ s,j+ T
= Σs,j+
t|Nj + µt|Nj µt|Nj ,
}
j
s,j
s,j+ T
EΘ̃ xt xTt−1 |st = s, o1:Nj , Gj = Σs,j
t,t−1|Nj + µt|Nj µt−1|Nj .
{z
}
|
Ss,j
t,t−1|N
(5.17)
j
For notational simplicity, we will use
ōs,j
t
S̄s,j
t|Nj
S̄s,j
t,t−1|Nj
= ojt − µsz ,
= EΘ̃ (xt − µst t )(xt − µst t )T |st = s, o1:Nj , Gj ,
= EΘ̃ (xt − µst t )xTt−1 |st = s, o1:Nj , Gj .
Combining the expectations of the individual expressions in (5.14, 5.16, 5.17) with (5.12), we can
simplify (5.11) as
EΘ̃
L̃Θ (s1:Nj , x1:Nj , oj1:Nj )o1:Nj , Gj
= EΘ̃ L̃Θ (s1:Nj )o1:Nj , Gj
+ EΘ̃ L̃Θ (x1:Nj |s1:Nj )o1:Nj , Gj
j
+ EΘ̃ L̃Θ (o1:Nj |x1:Nj , s1:Nj )o1:Nj , Gj .
90
CHAPTER 5. SWITCHING DYNAMICAL MODELS
=
X
0
s,s
γ̂t,j
log qs,s0
t,s,s0
−
−
−
−
1X s
γ̂ log |Σsu |
2 t,s t,j
ii
h
1X s h
s,j
s s,j+
sT
sT
s s,j T
γ̂t,j Trace Σsu −1 S̄s,j
+
F
S̄
F
−
S̄
F
−
F
S̄
t|Nj
t−1|Nj
t,t−1|Nj
t,t−1|Nj
2 t,s
1X s
γ̂ log |Σsz |
2 t,s t,j
ii
h
1X s h
s,j T
s,j s,j T s T
s,j T
sT
s s,j
s s,j
ō
.
H
−
ō
µ
H
−
H
µ
ō
+
H
S
γ̂t,j Trace Σsz −1 ōs,j
t
t
t
t,|Nj
t,|Nj t
t|Nj
2 t,s
(5.18)
The objective function in (5.10) is obtained by adding (5.18) for all 1 ≤ j ≤ M . This results in
accumulation of statistics as in Algorithm 19. Thus, (5.10) can be solved using standard matrix
calculus. When µu = 0 is enforced as in classical LDS, the estimates for F s becomes
s −1
,
F̂s = Ss0,1 S−1
µ̂su = 0,
Σ̂su =
1 s
T
− Ss0,−1 F̂s T .
S0 + F̂s Ss−1 F̂s T − F̂s Ss0,−1
Ns
(5.19)
When we allow µu ∈ Rd , then the estimates for F s become
F̂
s
1 s s T −1
1 s sT
s
f f
S−1 −
f f
,
−
Ns 0 −1
Ns −1 −1
1 s
s
f0 − F̂s f−1
,
Ns
1 s
T
S0 + F̂s Ss−1 F̂s T − F̂s Ss0,−1
− Ss0,−1 F̂s T − Ns µ̂su µ̂su T .
Ns
=
µ̂su =
Σ̂su =
Ss0,1
(5.20)
The parameters of the observation equation can be be estimated by enforcing Hs = H, as suggested
in (3.26), or Hs = H as in (3.27). The M-step is presented in Algorithm 20 assumes Hs = H and
a non-zero mean µsu .
91
CHAPTER 5. SWITCHING DYNAMICAL MODELS
h
iM Algorithm:AccumulateStats Θ, Gj , oj1:N
j
j=1
i
h
s Ss , Ss , Ss
s
s
s
Initialize Ns,s0 , Ns , f0s , f−1
0
−1
0,−1 , vo , wo , Vox
for j ← 1 to M do
h
0 i
[ξts , γ̂ts ] , γ̂ts,s
← SLDS-EStep Θ, Gj , [qe ]e∈E(G) o1:Nj ;
for s ∈ {1, . . . , C} do
P
s
Ns ← Ns + N
t=1 γ̂t ;
for s0 : (s, s0 ) ∈ E do
Ns,s0 ← Ns,s0 +
PN
t=0
0
γ̂ts,s ;
end
f0s ← f0s +
PN
t=1
s ← fs +
f−1
−1
γ̂ts µst ;
PN
t=1
Ss0 ← Ss0 +
PN
vos ← vos +
PN
γ̂ts µs+
t−1 ;
γ̂ts Σst + µst µst T ;
P
s+ s+ T
s
s
Ss−1 ← Ss−1 + N
;
t=1 γ̂t Σt−1 + µt−1 µt−1
P
s
s
s s+ T ;
Ss0,−1 ← Ss0,−1 + N
t=1 γ̂t Σt,t−1 + µt µt−1
t=1
wos ← wos +
t=1
PN
s ← Vs +
Vox
ox
γ̂ts ot ot ;
t=1
γ̂ts ot ;
PN
t=1
γ̂ts ot µT
t ;
end
end
return
i
h
s , Ss , Ss , Ss
s
s
s
Ns,s0 , Ns , f0s , f−1
;
0
−1
0,−1 , vo , wo , Vox
Algorithm 19: Accumulates the statistics from a set of training observations
92
CHAPTER 5. SWITCHING DYNAMICAL MODELS
This completes the exposition of the iterative ML estimation procedure for S-LDS parameters from
a set of observations.
h
iM Algorithm:SLDS-EM Θ, G j , oj1:N
j
j=1
h
i
iM h
s
Ns,s0 , Ns , f0s , Ss0 , Ss−1 , Ss0,−1 , vos , wos , Vox
← AccumulateStats Θ, G j , oj1:N
;
j
for
(s, s0 )
∈ E do
qs,s0 =
P
Ns,s0
Ns,s00
s00
;
end
for s ∈ {1, . . . , C} do
−1
sT
s fs T
F̂s ← Ss0,1 − N1 f0s f−1
;
Ss−1 − N1 f−1
−1
s
s
s
µ̂su ← N1 f0s − F̂s f−1
;
s
T − Ss
s T − N µ̂s µ̂s T ;
Σ̂su ← N1 Ss0 + F̂s Ss−1 F̂s T − F̂s Ss0,−1
s u u
0,−1 F̂
s
end
P
N ←
Ns ;
s
s
s f0 ;
f ←
P
S←
P
s
Ss0 ;
vo ←
P
wo ←
P
Vox ←
s
vos ;
s
P
wos ;
s
s ;
Vox
for p ← 1 to D do
1
ĥp ← Vox (p) − N
wo (p)f T
1
µ̂z (p) ← N
wo (p) − ĥp f ;
σ̂p2 ←
S−
T
vo (p)+ĥp SĥT
p −2ĥp Vox (p)
N
1
N
f fT
−1
;
− µ̂z (p)2 ;
end
return Θ̂;
Algorithm 20: EM for factor analyzed S-LDS
93
j=1
CHAPTER 5. SWITCHING DYNAMICAL MODELS
5.4
Introducing Null States in S-LDS
A null state or a non-emitting state is often used in hidden Markov models to reduce the number of
state-transitions [7]. A null state is defined as a state that does not have any observation associated
with it: it does not consume any time either. It is merely a computational construct. In this section,
we extend the concept of null-states to switching linear dynamical systems. The advantages of doing
so are parallel to that in the HMM as it reduces the number of edges in the state transition graph.
As an example, Figure 5.2 shows an S-LDS with 3 emitting states that does not have null states.
Each state can be followed by any other state. In this, the number of edges is O(V 2 ) where V is
the number of emitting states. Figure 5.3 shows an equivalent S-LDS after introducing null states.
Clearly, the number of edges in this case is O(V ).
Figure 5.2: Fully connected 3-state S-LDS (no
null states)
Figure 5.3: Fully connected 3-state S-LDS (with
null states)
Since the Forward and the Backward recursions in the S-LDS are as complex as the number of edges
(refer to Algorithms 16 and 17), this is an essential simplification when dealing with a significant
number of emitting states. There is, of course, some loss in generality when replacing Figure 5.2
with Figure 5.3. The former permitted 9 transition probability parameters, allowing, for instance
q13 6= q23 . The latter limits the transition probabilities to be of the form qs0 ,s = qs00 ,s = qs .
94
CHAPTER 5. SWITCHING DYNAMICAL MODELS
Formally, we define a switching LDS including null-states as an edge-weighted graph G with vertices V (G) = {N
S
ν}. Here ν denotes the emitting states, while N denotes the null-states. For
s ∈ ν, we have a unique set of parameters {F s , Hs } associated with it. For s ∈ N , we do not have
any parameters associated with it. By this definition, we also include the unique START and the
END states in the set of null states N .
Given an observation sequence o1:N , the inference problem in this S-LDS would be to compute
the joint distribution of any valid state sequence {s1:N 0 , x1:N 0 } with N 0 ≥ N such that exactly
N 0 − N of the discrete states s1:N 0 are null states. Since every non-null state has to be associated
with an observation, for each t, there is a unique g(t) such that s0t = sg(t) ∈ ν. The remaining
states are null-states. When the model transits into a null state, the continuous state vector xt and
the observation ot remain the same.
Thus, the state equations of the modified S-LDS are now as follows:
Fsj xj−1 + ug−1 (j) if sj ∈ ν,
xj =
if sj ∈ N .
xj−1
(5.21)
Since, for each time t, there is a unique non-null state, s0t , associated with it, the observation equation
still remain
0
ot = Hst xs0t + zs0t
(5.22)
where s0t = sg(t) as explained above and zs0t is the usual observation noise.
5.4.1
Forward Pass including Null-States
As usual one needs to compute the conditional joint probability p(xt+1 , xt |o1:t+1 , st+1 ) for every
state st+1 . This probability is proportional to p(xt+1 , xt , ot+1 , st+1 |o1:t ) upto a normalization con95
CHAPTER 5. SWITCHING DYNAMICAL MODELS
stant. In this setting, it is convenient to derive the expressions for the latter. For future reference, we
s
t+1
(xt+1 , xt ) for simplicity. When the null states are included, the
will denote this probability as p̃t+1
estimation of this joint probability falls into two classes, ie st+1 = s ∈ ν and st+1 = s ∈ N .
When s ∈ ν, this can be done as before:
p̃st+1 (xt+1 , xt ) =
X
p(st , o1:t )p(xt |st , o1:t )qst ,st+1 p(xt+1 |xt , st+1 = s)p(ot+1 |xt+1 , st+1 = s)
st
(5.23)
p(st , o1:t ) is available as αtst from the forward pass up to time t.6 p(xt |st , o1:t ) is similarly available.
Since ot+1 is observed, one may normalize p̃st+1 (xt+1 , xt ) to obtain p(xt+1 , xt |o1:t+1 , st+1 ). The
s
t+1
forward probability αt+1
= p(st+1 , o1:t+1 ) may be computed as
p(st+1 , o1:t+1 ) =
XZ
st
=
X
st
p(xt , st , o1:t )qst ,st+1 p(xt+1 |xt , st+1 )dxt dxt+1
xt ,xt+1
Z
p(xt , st , o1:t )p(xt+1 |xt , st+1 )dxt dxt+1
qst ,st+1
xt ,xt+1
The value of this integral is essentially the likelihood computed in the update step of the Kalman
filter in Algorithm 8 (assuming the distributions are single Gaussians).
When s ∈ N , again using the law of total probability and using the copying property (5.21) of the
null-states,
p̃st+1 (xt+1 , xt ) =
X
0
0
s
αt+1
qs0 ,s p̃st+1 (xt+1 , xt ).
s0
s
Furthermore, αt+1
is computed as
s
αt+1
=
X
0
s
αt+1
qs0 ,s .
s0
0
As can be seen, there is no Kalman filtering here. Rather the joint probabilities p̃st+1 all the states s0
that s is connected to in the current time are mixed together to obtain p̃st+1 . If one were to implement
6
Although, in practice, we store the log of αtst to avoid underflows.
96
CHAPTER 5. SWITCHING DYNAMICAL MODELS
the Gaussian sum approximation, we simply invoke the GaussMerge function on all states s0 such
s q 0 .
that (s0 , s) ∈ E(G) weighted by αt+1
s ,s
0
Due to the dependence of p̃st+1 of a null state on the p̃st+1 of its antecedent states, the computations
of the forward parameters at each time t must be done in a topological sorted order on the vertices
of the graph G. In other words, we should not process an edge e1 = (u, v) before another edge
e2 = (w, u) such that (u, v) ∈ N . In order to topologically sort an edge set E, we first make the
partitions E1 and E2 of E such that
E1
=
E2 =
S
{(u, v) ∈ E : v ∈ ν}
S
{(u, v) ∈ E : v ∈ N }.
The edge set E2 is topologically sorted such that
6 ∃ ((e1 , e2 ) ∈ E2 , e1 = (u, v), e2 = (w, u), (u, v) ∈ N , e1 < e2 )
in the sorted list Ẽ2 of E2 7 . This can be done easily using a breadth-first-search (BFS) by first
creating a list L = {v : (u, v) ∈ E2 }. At the k th step in the BFS, we select all the vertices
Vk = {v : (u, v) ∈ E2 , u ∈
/ L} and pop them from L. Then the set of edges Ek ← {(u, v) ∈ E2 :
v ∈ Vk } is added into E2 . This is done as long as L is not empty. The TOPSORT procedure8 is
shown in Algorithm 21. The resulting forward algorithm is shown in Algorithm 22.
7
Here e1 < e2 means that the edge e1 occurs before the edge e2 in the sorted list of E2 .
This requires another condition on the graph - there should be no cycles of null states. This is not a stringent
requirement since even a graph with null cycles can be transformed into a cycle free graph [7].
8
97
CHAPTER 5. SWITCHING DYNAMICAL MODELS
Algorithm:TOPSORT(E)
E1 ←
S
{(u, v) ∈ E : v ∈ ν};
E2 ←
S
{(u, v) ∈ E : v ∈ N };
L ← {v : (u, v) ∈ E2 };
Ẽ2 ← ∅;
while L 6= ∅ do
Vk ← {v : (u, v) ∈ E2 , u ∈
/ L};
Ek ← {(u, v) ∈ E2 : v ∈ Vk };
Ẽ2 ← {Ẽ2 , Ek };
L ← L − Vk ;
end
return {E1 , Ẽ2 };
Algorithm 21: Topological sorting of E
START , [F s , Hs ] , G, [q ]
Algorithm:SLDSForward ξ0|0
e e∈E(G) , o1:N
αst ← −∞, ∀(s, t);
αSTART
← 0;
0
for t ← 0 to N do
for (s0 , s) ∈ TOPSORT(E(G)) and bst == true do
if s ∈ ν and t > 0 then
0
s0 ,s
s
ξt|t−1
← Predict ξt−1|t−1
, Fs ;
0
s0 ,s
s ,s
{ξt|t
, L} ← Update ξt|t−1
, H s , ot ;
0
L ← L + log qs0 ,s + αst−1 ;
0
s , αs } ← GaussMerge {ξ s , αs }, {ξ s ,s , L} ;
{ξt|t
t
t
t|t
t|t
end
if s ∈ N then
0
L ← L + log qs0 ,s + αts ;
s , αs } ← GaussMerge {ξ s , αs }, {ξ s0 , L} ;
{ξt|t
t
t
t|t
t|t
end
end
end
h
i
s , αs ;
return ξt|t
t
Algorithm 22: The forward pass algorithm for the S-LDS including null states
98
CHAPTER 5. SWITCHING DYNAMICAL MODELS
5.4.2
Backward Pass including Null-States
In the backward pass, for each st , the posterior distribution9 q̃tst (xt ) needs to be computed. This
can be written using the law of total probability as
q̃tst (xt ) =
X
p(s0t , o1:N )Ex0t p(st |s0t , o1:N , x0t )p(xt |st , s0t , o1:N , x0t )
s0t ∈N
+
X
p(st+1 , o1:N )Ext+1 [p(st |st+1 , o1:N , xt+1 )p(xt |st , st+1 , o1:N , xt+1 )] .
st+1 ∈ν
(5.24)
If one employs expectation propogation for the first summation, we get
P
s0t ∈N
≈
p(s0t , o1:N )Ex0t [p(st |s0t , o1:N , x0t )p(xt |st , s0t , o1:N , x0t )]
s0
P
s0t ∈N
t
p(s0t , o1:N )p(st |s0t , o1:N , µt|N
)Ex0t [p(xt |st , s0t , o1:N , x0t )] .
It can be seen that p(xt |st , s0t , o1:N , x0t ) = δ(xt − x0t ) since st is a null state and it simply copies the
xt to x0t . Thus,
X
p(s0t , o1:N )Ex0t p(st |s0t , o1:N , x0t )p(xt |st , s0t , o1:N , x0t )
s0t ∈N
=
X
s0
t
p(s0t , o1:N )p(st |s0t , o1:N , µt|N
)p(xt |s0t , o1:N ).
s0t ∈N
Thus, (5.24) is approximated as
q̃tst (xt ) ≈
X
s0
t
p(s0t , o1:N )p(st |s0t , o1:N , µt|N
)p(xt |s0t , o1:N )
s0t ∈N
+
X
s
t+1
p(st+1 , o1:N )p(st |st+1 , o1:t , µt+1|N
)Ext+1 [p(xt |st , st+1 , o1:N , xt+1 )] .
st+1 ∈ν
Contrary to the forward pass, the edges have to now be processed in the reverse topological order.
The reverse topological sort of an edge set E is shown in Algorithm 23. Algorithm 24 shows the
resulting backward pass.
9
This is the joint distribution of xt with state s at time t. We have dropped xt−1 to simplify notation just for the
derivation. The joint distribution with xt−1 can be obtained similarly.
99
CHAPTER 5. SWITCHING DYNAMICAL MODELS
Algorithm:REVERSE-TOPSORT(E)
E1 ←
S
{(u, v) ∈ E : v ∈ ν};
S
E2 ← {(u, v) ∈ E : v ∈ N };
L ← {v : (u, v) ∈ E2 };
Ẽ2 ← ∅;
while L 6= ∅ do
Vk ← {v : (u, v) ∈ E2 , u ∈
/ L};
Ek ← {(u, v) ∈ E2 : v ∈ Vk };
Ẽ2 ← {Ẽk , E2 };
L ← L − Vk ;
end
return {E1 , Ẽ2 };
Algorithm 23: Reverse topological sorting of E
5.5
Approximate Viterbi Inference for S-LDS
The Viterbi Algorithm finds the most likely discrete state sequence, ŝ1:N , that gave rise to a sequence
of observations o1:N in a HMM. Formally, the Viterbi algorithm solves for
ŝ1:N = argmax p(s1:N |o1:N ).
s1:N
We have already described the Viterbi inference for HMM in Algorithm 3. The corresponding inference problem in the switching LDS is hard because p(s1:N |o1:N ) has to be computed by integrating
over the hidden continuous states x1:N . In other words,
Z
p(s1:N |o1:N ) =
p(s1:N , x1:N |o1:N )dx1:N .
x1:N
100
CHAPTER 5. SWITCHING DYNAMICAL MODELS
Algorithm:SLDSBackward
h
i
s , αs , [F s ] , G, [q ]
ξt|t
e e∈E(G)
t
∀(s, t), γts ← −∞;
END ← αEND ;
γN
N
for t ← N to 0 do
0
∀s0 ∈ V (G), δ s ← −∞;
0
∀(s, s0 ) ∈ E(G), γts,s ← −∞;
for (s, s0 ) ∈ REVERSE-TOPSORT(E(G)) do
if s0 ∈ ν and t < N then
0
{µt+1|t , Σt+1|t , .} ← Predict {µst|t , Σst|t }, F s ;
0
0
s
γts,s ← αst + log qs,s0 + log N µt+1|N
, µt+1|t , Σt+1|t ;
end
if s0 ∈ N then
0
0
s ,µ ,Σ
γts,s ← αst + log qs,s0 + log N µt|N
t|t
t|t ;
end
0
0
0
δ s ← logadd δ s , γts,s ;
end
for (s, s0 ) ∈ REVERSE-TOPSORT(E(G)) do
if s0 ∈ ν and t < N then
0
0
0
0
0
0
0
s ;
γts,s ← γts,s − δ s + γt+1
0
0
s,s
s , ξ s0
ξt|N
← RTS ξt|t
, Fs ;
t+1|N
0
0 s , γ s } ← GaussMerge {ξ s , γ s }, {ξ s,s , γ s,s } ;
{ξt|N
t
t
t
t|N
t|N
end
if s0 ∈ N then
0
γts,s ← γts,s − δ s + γts ;
0
0
s,s
s ;
ξt|N
← ξt|N
0
0 s , γ s } ← GaussMerge {ξ s , γ s }, {ξ s,s , γ s,s } ;
{ξt|N
t
t
t
t|N
t|N
end
end
end
return
h
i h
0 i
s , γ s , γ s,s
ξt|N
;
t
t
Algorithm 24: Backward pass routine for S-LDS including null states
101
CHAPTER 5. SWITCHING DYNAMICAL MODELS
If one were to replicate the dynamic programming technique used in HMM, we see that
α̂ts =
max p(s1:t−1 , st = s, o1:t )
0
= max
max p(s1:t−2 , st−1 = s , st = s, o1:t )
0
s1:t−1
s1:t−2
s
= max
qs0 ,s max p(s1:t−2 , st−1 = s0 , o1:t−1 ) p(ot |st = s, s1:t−2 , st−1 = s0 , o1:t−1 ) .
0
s1:t−2
s
This poses a problem, for p(ot |st = s, s1:t−2 , st−1 = s0 , o1:t−1 ) still depends on all the previous
s1:t−2 . Unlike the HMM, this dependence cannot be removed, because the dependence on s1:t−2
comes through the continuous state sequence x1:t . Pavlovic in [56] gets away with this dependency
by making the following approximation:
h
i
s0
0
s0
0
0
α̂ts = max
q
p(
Ŝ
,
s
=
s
,
o
)
p(o
|s
=
s,
Ŝ
,
s
=
s
,
o
)
1:t−1
t t
1:t−1 .
s ,s
t−2 t−1
t−2 t−1
0
s
0
s
Here Ŝt−2
denotes the best s1:t−2 according to the approximate Viterbi algorithm conditioned on
st−1 = s0 . Using this, the required α̂ts may be approximated accordingly as
0
0
s
s
α̂ts ≈ max
qs0 ,s α̂t−1
p(ot |st = s, Ŝt−2
, st−1 = s0 , o1:t−1 ).
0
s
(5.25)
0
s ,s
0
p(ot |st = s, Ŝt−2
t−1 = s , o1:t−1 ) in turn can be computed by predicting ot using the parameters
0
s
s is computed using Kalman filtering;
ξt−1|t−1
. Once the optimal ŝ0 that solves (5.25) is computed, ξt|t
0
s
ŝ
i.e., by first predicting ξt|t−1
from ξt−1|t−1
and updating the resulting distribution using ot .
We can modify the above procedure from S-LDS that includes null states. Algorithm 25 shows
the approximate Viterbi algorithm for switching LDS with null states. This is quite similar to the
forward pass, except that the Gauss-merge is replaced by a selection of the most likely Gaussian
density. The algorithm returns the state sequence s1:N which represents the most likely sequence of
emitting states corresponding to the observations o1:N (it can be suitable modified to also include
102
CHAPTER 5. SWITCHING DYNAMICAL MODELS
the null-states) using a back-tracking search much similar to Algorithm 3. In fact, this can be
further optimized by replacing the prediction and the update steps in the loop by only a likelihood
computation.10 The prediction and the update may only be done using the optimal ŝ0t that solved
(5.25). Thus the Approximate Viterbi is somewhat faster than the forward algorithm. In fact, one
could use the decoded s1:N of the approximate Viterbi to update the parameters of the state x1:N by
performing RTS smoothing only on the decoded discrete states. Pavlovic in [56] states that inference
via approximate Viterbi and that using the GBP (described before) are not significantly different.
However, we will use the GBP for our implementations as it more closely resembles the exact EM.
Moreover GBP is more powerful when there are multiple state-sequences s1:N all of which have
almost identical likelihoods.
5.6
Chapter Summary
In the last chapter, we presented a generalization of the standard HMM by replacing the discrete
state with a continuous one and showed that doing so can model observation sequences when the
adjacent samples are highly correlated. However a single LDS can only model sequences that follow
a certain kind of linear dynamics. When the dynamics of the observations themselves switch in time,
an appropriate model is a switching LDS.
The switching LDS is obtained by adding a discrete hidden state to the standard LDS. These discrete states then govern the dynamics of the continuous state. While this appears to be a simple
generalization of the LDS model, inference and learning in the S-LDS setting is intractable.
In this chapter, we setup the S-LDS learning and inference problem. One important contribution is
10
Recall that the likelihood computation requires computing the log-determinant.
103
CHAPTER 5. SWITCHING DYNAMICAL MODELS
START , [F s , Hs ] , G, [q ]
Algorithm:SLDSApproximateViterbi ξ0|0
e e∈E(G) , o1:N
∀(s, t), α̂st ← −∞;
α̂START
← 0;
0
for t ← 0 to N do
for (s0 , s) ∈ TOPSORT(E(G)) and bst == true do
if s ∈ ν and t > 0 then
0
s0 ,s
s
ξt|t−1
← Predict ξt−1|t−1
, Fs ;
0
s0 ,s
s ,s
, L} ← Update ξt|t−1
, H s , ot ;
{ξt|t
L ← L + log qs0 ,s + α̂st−1 ;
if L > α̂st then
0
s , α̂s , p̂s } ← {ξ s ,s , L, s0 };
{ξ̂t|t
t
t
t|t
end
end
if s ∈ N then
0
L ← L + log qs0 ,s + αts ;
if L > α̂st then
0
s , α̂s , p̂s } ← {ξ s , L, s0 };
{ξ̂t|t
t
t
t|t
end
end
end
end
s ← END, t ← N ;
while t > 0 do
s0 ← p̂st ;
if s ∈ ν then
st ← s, t ← t − 1;
end
s ← s0 ;
end
return s1:N ;
Algorithm 25: Approximate Viterbi for switching LDS including null states
104
CHAPTER 5. SWITCHING DYNAMICAL MODELS
to extend David Barber’s S-LDS inference technique via Expectation Correction to learning. If the
approximations made in the EC algorithm are justified, it yields an almost exact EM and is more
principled than the other conventional techniques like the approximate Viterbi which only considers
the approximate best discrete hidden state sequence to learn the parameters of the continuous state.
Another important contribution of this chapter is to provide the S-LDS learning algorithms when
the inference graph contains null or non-emiting states. Null states can simplify highly connected
S-LDS graphs at a small cost of modeling power. This combined with the Sherman Morrison
Lemma for efficient Kalman update is implemented in the code we develop for S-LDS inference
and learning. We will show in Chapter 6 and 7 that S-LDS models give the best gesture recognition
experiments on the RMIS surgery data. On some complicated procedures like needle-passing, when
the Factor Analyzed HMM could provide 70% gesture recognition accuracies, performance with SLDS goes as high as 77%, a 10% relative improvement.
105
Chapter 6
Experimental Results for Automatic
Gesture Recognition
We have, in the preceding chapters, generalized the framework of hidden Markov models and factor
analyzed HMMs to Switching Linear Dynamical Systems by the addition of a continuous evolving
hidden state. We integrated various approximate inference procedures existing in literature to create
a principled parameters estimation framework via expectation maximization. In this chapter, we
apply the standard HMMs, LDA, factor analyzed HMMs and S-LDS models to perform gesture
recognition experiments on various benchtop RMIS tasks. We show that FA-HMM and S-LDS
yield robust estimates of parameters resulting in better modeling and performance of such data.
6.1
Dataset
The dataset for our experiment is a set of benchtop surgical trials captured from various surgeons.
We do empirical comparisons on three different surgical trials, namely Suturing, knot-tying and
106
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
needle-passing. Each of these is performed by a mix of trainee and novice surgeons using a teleoperated da Vinci surgical system.
6.1.1
Kinematic Data Recordings of Benchtop tasks
The duration of a trial varies from 2 to 5 minutes, and the video and kinematic data are recorded
at 30 frames per second. The kinematic data contains information of 36 different positions of the
various robot arms. The velocity information is also available for each of these 36 coordinates
from the device. The kinematic measurements included position information from surgeon and the
patient side manipulators. In the kinematic recordings, there are a total of 36 position variables of
which 22 are in the master side (surgeon control side) and 14 are in the slave side (patient side) unit.
For our experimental analysis, we would use the kinematic information as our observations ot . For
our experiments, we down-sample the video and kinematic recordings to 3 frames per second.1
6.1.2
Manual segmentation
Each trial was manually segmented into semantically “atomic” gestures. For suturing and needlepassing, there is a set of eleven symbols proposed by Reiley [2]. Following their terminology, we
will call each gesture a surgeme. Typical surgemes include, for instance, (i) positioning the needle
for insertion with the right hand, (ii) inserting the needle through the tissue till it comes out where
desired, (iii) reaching for the needle-tip with the left hand, (iv) pulling the suture with the left hand,
etc. Formally, a manual label of an observation sequence [o1:N ] is a set of frame-level gesture labels
[g1:N ] where each gt ∈ C is a gesture. In case of suturing and needle-passing, there are 11 gestures
C = {1, . . . , 11}. In case of knot-tying, there are 6 different gestures. For instance, in the suturing
1
We do this mainly due to the complexity involved in the LDS models while dealing with videos. Moreover, the
frequency of human hand gestures are at the rate of 3-4 Hz and hence we hardly lose anything by such down-sampling.
107
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
task, the following sequence of actions (among others) is iteratively performed four times:
1. positioning the needle for insertion with the right hand,
2. inserting the needle through the tissue till it comes out where desired,
3. reaching for the needle-tip with the left hand,
4. pulling the suture with the left hand,
5. transferring the needle to the right hand.
Similarly, needle (or hoop) passing consists of gestures such as inserting the needle through the
hoop, pulling the needle, etc.
6.1.3
Automatic Gesture Recognition and Evaluation
Given a partition of the trials into training and test trials, the gesture recognition task is to first
h
i
j
learn a parameter set Θ̂ from the set of manually labelled training examples oj1:Nj , g1:N
and
j
automatically assign to each frame in a test trial, o1:N a gesture label g1:N . These gesture labels are
the same set of labels we described in Section 6.1.2 above. Trials in the training partition are used
to train the HMM or the S-LDS using the EM algorithms we described in the preceding chapters.
Using the trained models, we decode a test trial as a sequence of gestures. In order to decode a test
trial into a sequence of gestures, a decoding graph is constructed. One way to construct a decode
graph is to allow any gesture to follow any other gesture. For instance, if one discrete state is
used for each gesture, the decoding graph shown in figure 6.1 is constructed. On the other hand, if
one uses a three state left-to-right model for each gesture, the decode graph shown in figure 6.2 is
constructed.
108
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
Figure 6.1: Single-state, self-loop
Figure 6.2: 3-state LR, self-loop
Later, we also explore constraining the set of possible gesture transitions using domain specific
knowledge for the various RMIS tasks.
Finally, accuracy is measured on the basis of how well the models were able to replicate the manual
labeling of the test trial. An example of a typical gesture recognition is shown in Figure 6.3.
For evaluating the automatically recognized gesture sequence, we compare the recognized gesture
sequence with the manual label frame by frame. Thus, if ĝ1:N was the recognized gesture sequence
and ĝ1:N is the manual label, the percentage accuracy for the given trial is computed as
N
100 −
100 X
I (gt 6= ĝt )
N
t=1
Our hypothesis is that, by using increasingly complex models, we get improved performance in automatic gesture recognition. Improved gesture recognition implies powerful modeling capabilities,
and hence would yield better systems for surgical skill evaluation and live feedback in laproscopic
surgery.
109
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
Figure 6.3: The top figure is an example of a manual segmentation of a needle passing trial into
gestures. Gestures are numbered 1,2,3,etc. Here gesture 1 is inserting the needle through the hoop.
The bottom figure is the outcome of an automatic gesture recognition using a standard HMM.
6.2
Setups for Comparing FA-HMM and S-LDS
We speculate that S-LDS models will do a better job in modeling the RMIS surgery data as they
effectively capture the correlations between the adjacent observation samples via the continuous
hidden state. In this chapter, we perform several gesture recognition experiments to test this hypothesis. In order to compare FA-HMM and S-LDS, we first note that the FA-HMM is a special
case of S-LDS where the state transition matrix Fs = 0. We will explore a few variants of the
FA-HMM and the S-LDS models by appropriately constraining the parameter set Θ. Specifically,
we modify the parameters Θ of the S-LDS by enforcing various constraints. The five setups we will
110
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
use for our experiments are:
1. FAHMM: Fs = 0, Hs = H : When Fs = 0, the S-LDS becomes the FA-HMM with no
direct dependence between xt and xt−1 and hence the inference is exact. This setting is the
one described in section 3.2.3 where the loading matrix H is tied across classes while we still
s 2 ] of the observation noise z to be class dependent.
allow the mean µsz and the covariance [σ1:D
t
In this setting the estimates of Hs is given in (3.26). When d = 0, this becomes the standard
HMM with no factor analysis where the observations are modeled as a single class dependent
diagonal covariance Gaussian.
2. FAHMM: Fs = 0, Hs = H : This is the FA-HMM setting given in Section 3.2.4 where the
2 ] is common for all
loading matrix H, the observation noise mean µz and the variance [σ1:D
classes. The estimates for H are given in (3.27). In this case, the only variability between the
classes lies in the state noise parameters µsu and Σsu . The vector xt can be thought of as the
reduced dimensional representation of the observation ot and we are only modeling the variability between the classes in the distribution of the hidden (the reduced dimensional) vector.
Hence this particular setting closely resembles classical linear discriminating dimensionality
reduction techniques such as LDA and HLDA. When d = 0, this setup is not interesting as
there are no parameters to discriminate between the classes.
3. S-LDS: µu = 0, Hs = H : This is the switching LDS setting when the underlying statesequence has zero mean and the observation matrices are tied. We still allow the means and
the variances (diagonal) of the observation noise to be class-dependent. The estimates for F s
is shown in (5.19). The estimates for H is (as in FA-HMM) are shown in (3.26).
4. S-LDS: µu = 0, Hs = H : This is the zero mean S-LDS with complete tying of the obser111
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
vation model parameters. As before, the estimate for F s is according to (5.19), while H is
estimated according to (3.27).
5. S-LDS: Hs = H : Since µu ∈ Rd , the estimates for F s is according to (5.20) and H is
estimated using (3.27).
In the settings above, we always assume that the loading matrix Hs is tied across classes. It is,
however, possible to relax this by allowing a smaller (yet manageable) number of loading matrices
where each matrix acts on a set of states. The sets of states that share an observation matrix may be
estimated using any likelihood based clustering algorithm.
6.3
Initialization of S-LDS and FA-HMM Parameters via System Identification
Since the EM algorithms described in Chapters 4 and 5 converge to a local optima, a good initialization for the parameters is normally desired. Formally, given a set of observations [oj1:Nj ]M
j=1 , we use
system identification techniques to deterministically infer the latent continuous signal [xj1:Nj ]M
j=1
based on the setup being used. For instance, if Hs = H, ∀s, an initial value of H is computed
via PCA and the projection of ot using the resulting H are set to be the initial estimates of xt .
s can then be computed in each setup given the initial
The statistics f0s , Ss0 , Ss−1 , Ss0,−1 , vos , wos , Vox
112
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
estimates of xt as
f0s
=
M X
X
xjt ,
j=1 t:g j =s
t
f0s =
M
X
X
xjt−1 ,
j=1 t:g j =s
t
Ss0 =
M
X
X
xjt xjt T ,
j=1 t:g j =s
t
Ss0,−1 =
M
X
X
T
,
xjt xjt−1
(6.1)
j=1 t:g j =s
t
vos =
M X
X
ot ,
j=1 t:g j =s
t
wos =
M X
X
ot ot ,
j=1 t:g j =s
t
s
Vox
=
M
X
X
ojt xjt T .
j=1 t:g j =s
t
The initialization of {F s , Hs } is done using these statistics for each of the setups.2 Then, 10 EM
iterations are done to get a better estimate of the parameters in the maximum likelihood sense.
6.3.1
PCA based Initialization for Setup 4
2 , H}
In Setup 4 (S-LDS) described in Section 6.2, µsu = 0 and the observation parameters {µz , σ1:D
2 as the state independent means and variance
are common across states. We can estimate µz and σ1:D
directly from the set of observations [oj1:Nj ]. Then, the normalized zero-mean observations are
computed as
õjt (p) =
2
ojt (p) − µz (p)
σp
More precisely this replaces the EM based AccumulateStats function in (for example) Algorithm 20.
113
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
2 ). The initial estimate of H is computed as the first
for p = 1, . . . , D where Σz = diag(σ12 , . . . , σD
j
d principal components of the covariance matrix Σ̃o of [õj1:Nj ]M
j=1 . The underlying xt is then given
as
xjt = HT õjt .
Using the deterministically inferred xjt , we may use (6.1) to compute the sufficient statistics needed
for making an initial assignment3 of the parameters Θ.
6.3.2
PCA based Initialization for Setups 1 and 3
In Setups 1 (FA-HMM) and Setup 3 (S-LDS), the observation mean and covariance are not tied.
Hence, unlike the last initialization, we compute the class dependent observation means µsz and the
j
s 2 directly from the gesture-labelled data [oj
diagonal covariance σ1:D
1:Nj , g1:Nj ]. Then, the normal-
ized zero-mean observations are computed as
gj
õjt (p)
=
ojt (p) − µzt (p)
gj
,
σpt
Then, the initial H is computed as the first d principal components of the covariance Σ̃o of [õj1:Nj ]M
j=1 .
The underlying xjt is then deterministically computed as
xjt = HT õjt .
6.3.3
LDA based Initialization for Setups 2 and 5
2 , H}
In Setup 2 (FA-HMM) and Setup 5 (S-LDS), we note that the observation parameters {µz , σ1:D
are tied across all classes. Thus, F s is the only parameter that is allowed to distinguish between the
Since õjt has zero mean, the xjt computed above also has zero mean, implying the estimated µsu = 0 will already be
a good fit to the model.
3
114
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
various states. For FA-HMM setting, this F s only comprises of the mean µsu and the covariance Σsu
since the evolution matrix Fs = 0. Thus all the class discriminating variabilities in the signal must
be captured by the underlying xt alone. This is very similar to classical discriminating dimensionality reduction techniques like LDA to which we can draw parallels by assuming that the reduced
dimensional signal is the underlying xt . We, therefore, initialize xt as we do for LDA.
j
Given M observation sequences [oj1:Nj ]M
j=1 along with class (gesture) labels g1:Nj , we first compute
the global observation mean and covariance {µo , Σo } of all the N =
P
j
Nj observation points
and the class-dependent observation means µso as the mean of the observations corresponding to the
label gtj = s. Then the d × D LDA matrix A is computed by taking the top d eigen vectors of the
matrix Σ−1
o
1
N
P
s sT
s Ns µo µo
− µo µTo . For each observation point ojt , we compute the underlying
d dimensional xjt deterministically as xjt = Aojt . Then, using (6.1), we collect the statistics required
to initialize the parameters Θ in Algorithm 20.
6.4
Surgical Gesture Recognition Setups
We will do experiments on the following two setups:
Setup I: We have 40 trials each of suturing, needle-passing and knot-tying. Each of them comprises
of 5 trials from 8 different surgeons. Given a set of trials we make five partitions of training and test.
Typically, the trials from each surgeon are numbered 1 through 5. For the ith train-test partition, we
leave out all the trials numbered i for the training set and put them in the test set. Thus, a trial in the
test data may come from the same surgeon as another trial from the training data. The training data
is used to learn the parameters Θ and the gesture sequence is inferred using Θ on a test trial. The
accuracy is reported as an average frame accuracy of all the five partitions, i.e., accuracy is reported
115
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
on an effective test set of 40 trials. In this particular setup, the test data contains user that is already
seen in the training data. We will compare standard HMM, FA-HMM and S-LDS in this chapter.
Setup II: Unlike Setup I, User-disjoint partitions of the trials are created in Setup II. In case of
Suturing and Knot-Tying, there are 8 different users (and seven in case of Needle-Passing). An
eight-fold cross validation akin to Setup I is carried out, except that in each fold, all the trials of one
surgeon are in the test partition and all trials of the remaining 7 surgeons are in training. Test results
of all the trials are again aggregated, resulting in an effective test set of 40 trials. We will describe
the analysis and results of Setup II in the next chapter alongside with user adaptive transforms.
6.5
Empirical comparisons on Setup I
6.5.1
Using 1 State Models for each Gesture
Each of the eleven gestures in case of Suturing and needle passing (or six in case of knot-tying) is
modeled using one discrete state. In fact, when training, there is actually no hidden discrete state, as
the entire state sequence s1:N corresponding to the observation sequence o1:N is known. However,
for d > 0, the underlying x1:N is still a latent variable and hence needs to be inferred. For each d,
we explore several variants of the S-LDS as described in Section 6.2. We use 10 EM iterations to
train up these models to estimate Θ.
Using the estimated Θ̂, we infer marginal distribution of the states, γ̂ts in a test trial o1:N using
Algorithm 18.
The decoded sequence ŝ1:N is the emitting state with the highest marginal probability. In other
116
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
words,
ŝt = argmax γ̂ts
(6.2)
s∈N̄
Then, using the actual gesture sequence g1:N for the test-trial, we compute a frame-level accuracy
as the fraction of the N frames whose decoded state ŝt matches gt . Finally, the average accuracy
over all the trials is reported.
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
d=0
54.0%
54.0%
0.6%
0.6%
54.0%
54.0%
0.6%
0.6%
0.6%
0.6%
d=2
61.9%
63.4%
49.1%
19.3%
62.3%
63.5%
35.7%
16.6%
39.5%
23.0%
d=4
63.1%
64.7%
51.2%
37.8%
64.7%
65.7%
39.2%
40.8%
40.9%
45.6%
d=8
61.8%
64.8%
56.9%
53.8%
61.9%
69.1%
45.0%
48.4%
42.2%
53.3%
d=12
61.5%
66.3%
61.0%
64.7%
62.8%
70.1%
49.6%
51.2%
42.4%
56.4%
d=16
63.1%
68.5%
62.8%
70.2%
64.9%
74.8%
53.1%
54.1%
43.5%
58.6%
Table 6.1: Accuracies on Suturing using one state models with ot ∈ R36 (position) on Setup 1
Table 6.1 shows the comparisons of the 6 variants of S-LDS described in Section 6.2 on the suturing
task (11 gestures) using one-state per gesture models with the observation ot ∈ R36 containing only
the position information of the individual robot arms. Table 6.2 and 6.3 shows the corresponding
results on the needle-passing (11 gestures) and the knot-tying (6 gestures) tasks respectively. For
each setup and each d, we report two accuracies.
The first (denoted 0 EM) is based on the parameters Θ estimated according to the deterministic inference of the underlying hidden variable xjt using the setup based system identification techniques
described Section 6.3. The second (denoted 10 EM) is obtained after performing 10 iterations of
117
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
FA-HMM
S-LDS
µsu = 0, Hs = H
µsu = 0, Hs = H
Fs = 0, Hs = H
Fs = 0, Hs = H
Hs = H
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
d=0
48.8%
48.8%
1.4%
1.46%
48.8%
48.8%
1.46%
1.4%
1.4%
1.4%
d=2
54.9%
53.7%
30.4%
42.14%
56.1%
54.5%
5.3%
7.0%
41.8%
25.1%
d=4
57.4%
59.7%
35.0%
45.36%
57.4%
62.4%
36.1%
30.9%
35.5%
27.7%
d=8
58.7%
63.7%
49.8%
45.6%
57.43%
63.2%
36.9%
38.9%
44.9%
40.8%
d=12
57.9%
64.3%
50.3%
59.3%
56.49%
67.4%
41.0%
40.5%
46.5%
45.3%
d=16
60.3%
65.4%
50.5%
62.7%
64.82%
72.2%
45.4%
45.4%
46.5%
48.2%
Table 6.2: Accuracies on Needle-Passing using one state models with ot ∈ R36 (position) on Setup
1
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
d=0
63.6%
63.6%
2.4%
2.4%
63.6%
63.6%
2.4%
2.4%
2.4%
2.4%
d=2
67.1%
67.0%
36.0%
34.1%
66.3%
67.5%
15.9%
30.5%
35.8%
27.3%
d=4
70.7%
71.2%
57.1%
44.6%
70.6%
72.4%
30.8%
34.5%
48.5%
36.4%
d=8
70.4%
74.6%
61.0%
65.4%
68.5%
77.4%
49.2%
51.1%
49.0%
55.5%
d=12
72.2%
76.1%
60.9%
75.8%
68.9%
77.7%
57.0%
60.4%
47.9%
61.8%
d=16
72.8%
77.1%
61.6%
77.8%
68.6%
78.5%
59.6%
62.3%
49.3%
67.9%
Table 6.3: Accuracies on Knot-Tying using one state models with ot ∈ R36 (position) on Setup 1
118
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
EM after the system identification based initialization of the parameters Θ.
Based on the gesture recognition results, we make the following observations:
1. It can be seen that system identification based initialization is a good starting point. In fact, for
the second setting of the FA-HMM, the LDA based initialization performs better for smaller
d than the EM based estimates. However the EM based estimates rapidly improve the performance as d is increased.
2. Next, comparing any 10EM column with Hs = H and the corresponding column with Hs =
s
H, it can be seen that the FA-HMM and the S-LDS perform best when we allow µsz and σ1:D
to be class dependent (setting 1). This is because allowing different observation means for
each class gives more freedom for the EM estimates to evolve, still retaining a compact set of
parameters (we still have diagonal covariance, Σsz , for the observation noise zt ).
3. Furthermore, the factor analysis is pretty powerful even when d = 2. We see about 6-7%
gain in accuracies (setup 1 in FA-HMM and S-LDS) when going from d = 0 to d = 2. This
implies that the a significant fraction of the correlation between the observation components
can be captured even by a small d.
4. Furthermore, the S-LDS is significantly better than the FA-HMM (the best of the S-LDS
is about 6% better than the best of the FA-HMM in case of Suturing). Of the 40 trials (in
suturing), the S-LDS model predicts the gestures in the test trial with significantly higher
accuracy than the FA-HMM in over 30 trials for each d showing that the S-LDS is indeed a
better model than the FA-HMM for this kind of data.
5. The margin between SLDS and FA-HMM is even higher on comparatively harder tasks like
119
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
needle-passing.
Figure 5.3 shows an illustration of how S-LDS performs relatively much better than the FA-HMM
for a needle-passing trial.
6.5.2
Using Multi-State Models for each Gesture
A one discrete state model in figure 6.1 assumes that each gesture is a homogeneous action from
start to end. An alternate model is to use more than one hidden state for each gesture. A 3-state
left to right model as shown in figure 6.2 is one such example. A multi state model can capture fine
grained movements that are not manually labelled. For instance, the action of inserting the needle
could be made up of several atomic movements like positioning the needle, pushing it through the
tissue, etc. Capturing these atomic gestures that are not manually marked can be done by introducing
hidden discrete states for each gesture and allowing EM to discover their dynamics. As an example
of a multi-state model, we use a sequence of three discrete states to model a single manually labeled
gesture. In case of Suturing, there are 11 gestures and hence 33 discrete states in total.
As before, the training data comprises of a set of observation sequences. Each observation sequence
[o1:N ] comes with a gesture label sequence g1:N . Using these gesture labels, we create a boolean
matrix bst which is true only if the state s belong to the gesture gt and zero otherwise. Specifically,
if the gesture gt = 1, then only the states 1, 2 and 3 would be active (b1t = b2t = b3t = 1) at time t.
The resulting bst then becomes the boolean mask of the grammar G for this particular trial4
Then one may use Algorithm 19 to accumulate the statistics required for estimating the parameters
4
An alternate way to train without using this boolean mask is to first chop up each trial into the individual gestures
and train each 3-state LR model without the boolean mask (bst = 1∀(s, t)) from the chopped segments corresponding
to that gesture. This is somewhat inefficient due to the data processing involved in chopping the trials into individual
gestures. It also does not capture the transition between the continuous state xt in the last frame of one gesture and xt+1 ,
the continuous state in the first frame of the next gesture.
120
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
Figure 6.4: Comparison of FA-HMM and S-LDS on a trial of needle-passing. The top figure shows
the ground truth gesture segmentation of the particular trial. The second figure is the automatic
segmentation via FA-HMM (Accuracy=59%). The third figure is the automatic segmentation via
S-LDS (Accuracy= 90%).
121
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
Θ for a set of observations [oj1:Nj ].
For decoding a test trial o1:N , we first infer the marginal distribution γ̂ts using Algorithm 18. Then,
the decoded state st at time t is given using (6.2). Once the state-level (over the 33 emitting states)
decoding is done on the test trial using (6.2), the states are mapped to the individual gestures (states
1, 2, 3 correspond to gesture 1, states 4, 5, 6 correspond to gesture 2,etc.) and the frame level gesture
accuracy is reported. The three states for each gesture are homogeneously initialized to the same
set of parameters obtained using system identification as we did for the one-state case.
Tables 6.4, 6.5 and 6.6 show the recognition accuracies of the various models on Suturing, NeedlePassing and KnotTying respectively using 3 left to right discrete hidden states for each gesture with
ot ∈ R36 containing the RMIS tool position information.
Based on the recognition results using three state models, we make the following observations:
1. As opposed to the one-state models, there is substantial improvement in performance by the
EM based Θ over the system identification based initializations. This is because for the initializations, we had used the same set of parameters for the three hidden discrete states based
on the class-labels. The EM algorithm is able to discover and model the gesture internaldynamics using these three hidden states from the data yielding better models which inturn
results in better performance.
2. We see substantial gains using S-LDS in case of Needle-Passing where the FA-HMM is able
to achieve a frame accuracy of only upto 71% while the S-LDS could go upto 77%.
3. Increasing the hidden dimensions help. It is seen from table 6.7 that the best accuracy is
achieved when d = 12 or d = 16. Sometimes the difference between d = 12 and d = 16 are
not statistically significant.
122
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
This leads us to conclude that while the human labeling is based on the intent (or semantics) of a
gesture, the dynamics of a gesture contain further analyzable parts, and that these sub-gestures can
be discovered automatically via EM.
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
d=0
61.8%
68.5%
0.5%
0.5%
63.1%
68.5%
0.5%
0.5%
0.5%
0.5%
d=2
63.0%
75.0%
50.8%
31.1%
64.4%
75.0%
41.0%
22.9%
47.0%
57.0%
d=4
65.8%
75.3%
52.0%
54.4%
66.6%
77.3%
47.4%
59.9%
49.3%
67.9%
d=8
65.5%
78.4%
56.9%
75.5%
66.9%
78.9%
54.9%
67.6%
53.5%
77.6%
d=12
66.1%
77.4%
61.2%
78.1%
68.1%
78.9%
58.49%
77.9%
55.0%
80.7%
d=16
66.3%
76.9%
62.8%
78.2%
69.3%
79.4%
60.00%
78.0%
55.0%
79.8%
Hs = H
Table 6.4: Accuracies on Suturing using three state models with ot ∈ R36 (position) on Setup 1
Evaluating statistical significance
To compare the statistical significance of the results obtained, we performed a student t-test for
pairs of models (X and Y) with the null-hypothesis that model X is as good as model Y. SpecifiX
cally, if model X has accuracies AX
1 , . . . , A40 on each of the 40 test trials, model Y has accuracies
AY1 , . . . , AY40 and the experiment suggested that the mean accuracy of model X is higher than that
of model Y, we compute the p-value using the student t-test as the probability of the null hypothesis
that the mean of the difference accuracy of both models on the task is zero. Clearly, if the null hypothesis is true, it means that there is not much statistically significant difference between the two
models.
123
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
d=0
49.2%
54.5%
1.4%
1.4%
49.28%
54.5%
1.4%
1.4%
1.4%
1.46%
d=2
55.6%
62.4%
32.0%
48.2%
56.7%
63.9%
13.6%
34.8%
46.6%
53.1%
d=4
58.2%
67.9%
36.2%
49.4%
58.0%
72.5%
40.5%
61.0%
39.0%
58.5%
d=8
59.1%
69.0%
50.5%
60.2%
57.8%
73.6%
46.9%
68.3%
54.1%
72.6%
d=12
58.3%
67.9%
50.5%
69.1%
56.5%
75.8%
48.6%
68.8%
54.3%
75.4%
d=16
61.6%
70.8%
52.0%
71.0%
65.7%
77.6%
54.1%
69.1%
56.3%
74.3%
Table 6.5: Accuracies on Needle-Passing using three state models with ot ∈ R36 (position) on Setup
1
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
d=0
64.3%
69.8%
2.4%
2.4%
64.3%
69.8%
2.4%
2.4%
2.4%
2.4%
d=2
67.4%
74.5%
35.8%
36.2%
67.1%
75.4%
18.1%
36.0%
41.3%
42.0%
d=4
71.9%
80.0%
58.0%
52.7%
71.8%
79.6%
39.0%
58.9%
53.0%
52.3%
d=8
71.8%
81.5%
62.5%
74.1%
69.8%
82.7%
56.7%
74.6%
54.1%
75.9%
d=12
73.8%
81.4%
63.4%
81.0%
70.7%
82.0%
64.3%
77.6%
55.1%
77.5%
d=16
73.5%
82.8%
64.2%
82.8%
68.8%
78.4%
64.6%
75.0%
56.3%
77.2%
Table 6.6: Accuracies on Knot-Tying using three state models with ot ∈ R36 (position) on Setup 1
124
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
Figure 6.5: State machine constraining the possible actions in Suturing.
To compare FA-HMM and S-LDS as modeling approaches, we first determine the best latent dimension d for each model for each task. THese are shown in Table 6.7. We then conduct a pairwise
test on 5 different models.
Tables 6.8, 6.9 and 6.10 show the pairwise t-values of the five different models5 obtained from the
student-t distribution. It is seen that the first model of S-LDS is the overall winner and it statistically
outperforms the FA-HMM models for the suturing and the needle-passing task. For knot-tying task,
there is not much difference between the FA-HMM and the S-LDS as depicted in table 6.10.
Comparing the best model (S-LDS with Hs = H and d = 16) with all the other models, we
observe that the rejection of the null hypothesis of this S-LDS model is statistically significant with
p-value less than 0.05 (i.e., rejection probability greater than 0.95) against all the FA-HMM models
5
We fix the hidden layer dimension d that maximizes the accuracy for each of the models as described in table 6.7
125
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
FA-HMM (1)
FA-HMM (2)
S-LDS (1)
S-LDS (2)
S-LDS (3)
Suturing
8
12 − 16
16
12 − 16
12
Needle-Passing
16
16
16
16
12
Knot-Tying
16
16
12
12
12 − 16
Table 6.7: The best hidden dimension d for each of the five models across the three tasks. In some
cases, the difference between d = 12 and d = 16 are not statistically significant. FA-HMM(1)
represents the first variant of FA-HMM with Hs = H and so on.
for the suturing task. Furthermore the null-hypothesis is rejected with p-value less than 0.01 for
the same S-LDS model against all the FA-HMM models and some of the other S-LDS models for
the needle-passing task, indicating that the gain due to the S-LDS model is significant in case of
needle-passing.
FA-HMM (1)
S-LDS(2)
FA-HMM(2)
S-LDS (1)
S-LDS (3)
77.4%
0.161
0.3
0.014
< 0.001
78.0%
0.6
0.46
< 0.001
78.2%
0.18
0.011
79.4%
0.18
FA-HMM(1)
S-LDS(2)
FA-HMM(2)
S-LDS(1)
S-LDS(3)
80.7%
Table 6.8: Comparison of various models by statistical significance. The models are ordered according to the increasing mean performance on the suturing task. The diagonal elements represent
the mean accuracy available in table 6.4.
6.5.3
Language Modeling
So far we have used a simple self-loop shown in figures 6.1 and 6.2 for decoding the gesture sequences for these test trials. It is however possible to use in-domain knowledge regarding the language of the task being performed. For instance, it is known from expert surgeons that the task of
126
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
S-LDS(2)
S-LDS(2)
FA-HMM(1)
FA-HMM(2)
S-LDS (3)
S-LDS (1)
69.2%
0.316
0.537
< 0.001
< 0.001
70.8%
0.95
0.01
< 0.001
71.0%
0.044
0.003
75.5%
0.03
FA-HMM(1)
FA-HMM(2)
S-LDS(3)
S-LDS(1)
77.6%
Table 6.9: Comparison of various models by statistical significance. The models are ordered according to the increasing mean performance on the needle passing task. The diagonal elements
represent the mean accuracy available in table 6.5.
S-LDS(3)
S-LDS(2)
S-LDS(1)
S-LDS(3)
S-LDS(2)
S-LDS(1)
FA-HMM(2)
FA-HMM(1)
77.5%
0.98
0.006
0.002
0.005
77.6%
< 0.001
< 0.001
< 0.001
82.7%
0.5
0.3
82.8%
0.86
FA-HMM(2)
FA-HMM(1)
82.8%
Table 6.10: Comparison of various models by statistical significance. The models are ordered according to the increasing mean performance on the knottying task. The diagonal elements represent
the mean accuracy available in table 6.6.
127
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
d=0
54.8%
68.5%
0.5%
0.5%
54.8%
68.5%
0.5%
0.5%
0.5%
0.5%
d=2
72.4%
73.3%
61.6%
39.1%
73.7%
74.4%
38.7%
39.7%
51.2%
57.7%
d=4
73.4%
74.6%
64.3%
57.9%
73.8%
75.7%
44.7%
47.3%
56.7%
63.0%
d=8
74.3%
75.1%
65.9%
66.1%
73.4%
76.9%
55.9%
56.3%
62.1%
66.6%
d=12
74.2%
75.1%
67.6%
74.2%
72.7%
76.9%
60.0%
61.0%
65.9%
71.2%
d=16
73.1%
75.8%
69.5%
77.1%
72.3%
78.8%
63.1%
64.0%
66.2%
73.4%
Hs = H
Table 6.11: Accuracies on Suturing using one state models on Setup 1 after enforcing the decoded
path must follow the state transitions given by Figure 6.5.
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
0 EM
10 EM
d=0
54.8%
68.5%
0.5%
0.5%
54.8%
68.5%
0.5%
0.5%
0.5%
0.5%
d=2
74.1%
77.3%
65.7%
59.5%
74.63%
77.2%
38.6%
34.1%
57.4%
66.6%
d=4
74.0%
77.0%
66.4%
74.5%
74.1%
78.1%
54.0%
67.3%
64.2%
68.4%
d=8
76.0%
78.6%
68.2%
77.1%
74.8%
79.6%
65.3%
66.7%
69.0%
76.7%
d=12
66.1%
77.4%
61.2%
78.1%
68.1%
78.9%
58.4%
77.9%
55.0%
80.7%
d=16
51.3%
77.5%
62.8%
78.2%
39.2%
80.5%
52.2%
78.5%
55.0%
79.8%
Table 6.12: Accuracies on Suturing using three state models on Setup 1 after enforcing the decoded
path must follow the state transitions given by Figure 6.5.
128
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
suturing follows the state-machine given in Figure 6.5. This state diagram shows that performing
suturing is much more constrained that the assumed self loop grammars in Figures 6.1 and 6.2. In
the state machine, the initial idle and the final Drop Suture are interpreted as the START and the
END state respectively.6 . For decoding a test trial, we will the use the inference graph G obtained
by the composition of this state-machine with the self-loop graphs in figures 6.1 and 6.2. When 3state models are used for each gesture, in the resulting composed state machine, each of the original
gesture states in Figure 6.5 is replaced by a 3 left to right subgesture states.
Table 6.11 shows the frame accuracies (comparable with Table 6.1) when one state models for each
gesture is used in conjunction with this state machine as the inference graph while decoding.
Table 6.12 shows the corresponding frame accuracies when decoding is done with the help of three
state models for each gesture.
We see that there is substantial gain in performance by just incorporating this simple in-domain
knowledge (or constraint) while decoding.
6.6
Empirical Comparisons on Setup II
We perform a similar set of experiments as we did in the last Section, but this time we will use
Setup 2 to make our training and test partitions. In this, the test data contains trials from a user who
is not in the training partition. Average accuracy over all the users is finally reported. Since we
already discovered that three state models perform better than one state models, we will perform
comparative experiments only using three-state models. Moreover, we will report results only on
d = 4, 8, 12, 16 since we have already analyzed the trend over varying d in the last setup. We report
6
Training is done as before using the provided gesture labels {g1:N } and the boolean state mask [bst ]1:N
129
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
the results after performing 10 iterations of EM in each of the 5 settings.
Tables 6.9, 6.14 and 6.15 shows the corresponding results on the suturing, needle-passing and the
knot-tying tasks using three-state left to right models for each gesture. We make the following
observations from the gesture recognition performances:
1. First we note that there is a significant drop in recognition performance using all the models
compared to Setup I (refer to tables 6.4, 6.5 and 6.6 ). This is expected since user variabilities
are common and are more prominent when such dextrous actions are involved.
2. As before, we see that the S-LDS model outperforms the FA-HMM, but this time with a higher
margin. In case of Suturing, we see about 8% gain in accuracy in the best performance of
S-LDS over FA-HMM. In the needle passing task, the margin is even higher with the S-LDS
going as high as 60%, while the FA-HMM could only achieve unto 40% accuracies.
3. Another interesting observation is that S-LDS and FA-HMM now performs the best when all
the variability is in the underlying xt alone, in other words when Hs = H. We expect this
because the mean of the observations capture the pose information which do not generalize
when we go to a new trial. The dynamics is highly invariant to the camera pose and offset.
4. The last observation, namely that the cross-user dynamics seems to be better captured by
providing additional parameters to F, while H seems to be capturing pose, etc. which may
be trial dependent. This leads to an interesting questionL Should H be trial-dependent? This
is addressed in chapter 8.
130
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
d=4
55.1%
42.8%
54.5%
52.7%
58.9%
d=8
51.5%
53.5%
51.6%
59.9%
66.0%
d=12
48.3%
50.7%
54.2%
65.4%
67.0%
d=16
49.8%
57.2%
58.9%
67.1%
66.5%
Table 6.13: Accuracies on Suturing using three state models with ot ∈ R36 (position) on Setup 2
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
d=4
42.2%
28.9%
40.1%
44.6%
49.6%
d=8
39.5%
35.7%
44.4%
59.4%
54.6%
d=12
42.6%
39.6%
43.7%
56.4%
60.0%
d=16
39.5%
42.7%
52.2%
57.2%
54.2%
Table 6.14: Accuracies on Needle Passing using three state models with ot ∈ R36 (position) on
Setup 2
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
d=4
66.5%
36.7%
67.7%
53.6%
45.1%
d=8
64.0%
55.8%
66.3%
65.6%
60.7%
d=12
67.0%
65.7%
65.9%
61.8%
63.4%
d=16
62.5%
65.2%
58.8%
61.5%
66.0%
Table 6.15: Accuracies on Knot Tying using three state models with ot ∈ R36 (position) on Setup 2
131
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
6.7
Chapter Summary
The main focus of this chapter was to compare different models which were introduced in the
previous chapters. It is shown via gesture recognition performances on RMIS data that the models
of increasing complexity are well justified. More specifically, compactly modeling the correlations
between the components of observations and by modeling the dynamics of the kinematics, we are
able to achieve improved gesture recognition on the RMIS tasks.
To formally compare the several variants of HMMs and S-LDSs, we proposed five different modifications of the conventional S-LDS model. Each model was obtained by enforcing certain constraints
on the parameters (e.g., tying the different parameters across classes or enforcing some of the parameters to be zero). The connections of each of these models to conventional techniques like LDA,
HLDA, PCA were well established. Using this connection, we were able to provide efficient system identification techniques for initializing the parameters of the models. For instance, when the
variability across the classes is captured using the dynamical parameters F s , we provide an initialization based on LDA. These initialization play a vital role in providing a good starting point
for the model to evolve from. It is further observed that the final likelihood of the training data is
significantly higher using the proposed initializations compared to random starts.
By performing empirical experiments on RMIS data, we observe significant gain in accuracies by
incorporating factor analysis in standard hidden Markov models. Thus, effectively capturing the
correlations in the observation components results in models having better fit to the data. Furthermore, the model and computational complexity of the FA-HMM is simpler compared to other
conventional dimensional reduction techniques like LDA and H-LDA. Moreover, using the efficient
inference technique developed in chapters 3 and 4, learning the FA-HMM parameters can be done
132
CHAPTER 6. EXPERIMENTAL RESULTS FOR AUTOMATIC GESTURE RECOGNITION
much faster than that of H-LDA.
Switching linear dynamical systems further generalize over FA-HMM as they enforce continuity in
the hidden state vector. This allows us to capture the correlations between the adjacent observations.
It is observed that this results in improved gesture recognition performance compared to HMMs or
FA-HMMs. S-LDS models capture more of the dynamics rather than the position informations and
hence provides superior performance when the test trial is from a new user or there is significant
pose variability (e.g., in needle-passing).
133
Chapter 7
Learning S-LDS on Unlabeled Data
The heart of the learning process we described in the last chapter relies upon a set of manually
transcribed gesture label sequences. Models are learned in the supervised setting and evaluated by
comparing the inferred gesture sequence, ĝ1:N , on an observation o1:N with its manual segmentation, g1:N (assuming it is the ground truth). Time marked manual segmentation requires substantial
time and effort. In this chapter, we explore how EM based learning techniques may be effectively
used to minimize the amount of supervision. We will explore the case when only the task grammar
(cf figure 6.5) is provided but no manual gestures for the training data {oj1:Nj }, and the extreme
case when even the task grammar is not provided and completely unsupervised learning must be
performed. An important novel contribution of this chapter is the introduction of the successive
state splitting algorithm for learning the topology and parameters of a HMM from unlabeled data.
This work is previously published in [64, 65].
134
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
7.1
Semi-Supervised Learning
In section 6.5.3, we showed that introducing indomain knowledge about the sequence of gestures
gives substantial gains in performances while decoding. For the learning, however, we had assumed
that the time marked ground truth segmentation of the gestures were given. Since we know that any
gesture-sequence of suturing or needle-passing must follow the state diagram in Figure 6.5, can we
use this information to train probabilistic models without much supervision?
In order to answer this question, we took 8 randomly selected labelled trials1 from each of Suturing and Needle-Passing. The remaining trials were assumed to be unlabeled. Using the 8 trials,
we initialize the model parameters for each of the 5 setups we described in chapter 6. After the
initialization, we perform 10 iterations of EM training on the data. The inference graph G for each
training trial (including the trial for which gesture labels are available) is given by Figure 6.5. In
other words, we use the manual labels for initialization only. Once the models are learned using
the grammar, we infer the discrete state-sequence of each of the training trials using the learnt models under the graph G. Thus, for each observation sequence [oj1:Nj ] of the training, we obtain a
state-sequence [sj1:Nj ].
One would expect that each block in Figure 6.5 would be trained to match the observations of its
manual marked gesture label. However, this need not be the case as no gesture supervision was
provided during learning (except the initialization). Note in figure 6.5 that the gesture sequence 56-7-8 could repeat cyclically, and it is entirely possible during unsupervised training that the discrete
state sequence is learned with an offset of one label as 6-7-8-5. In fact, it is very much possible that
the gesture named Transfer needle get mapped to Position needle after a few EM iterations. To
1
In a real life setting, a grammar such as the one in Figure 6.5 could be handwritten by a domain expert. But it is fair
to assume that (s)he would label a few trials to inform or validate the handcrafted grammar.
135
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
quantify how the states learned in the unsupervised setting correlate with the actual gestures, we
first create a co-occurence matrix between the learnt discrete states and the manual gesture labels g,
j
C(s, g),using labelled gestures [g1:N
] as
j
C(s, g) =
X
X
j
t:sjt =s,gtj =g
1
For each learnt state s, we then compute the most lenient manual gesture assignment g(s) as the
gesture that most frequently cooccurs with s. In other words,
g(s) = argmax C(s, g).
(7.1)
g
To recap, the semi-supervised training procedure begins with the unlabeled gesture grammar of
Figure 6.5, and an initial value of Θ based on a small number of 8 labeled trials. The Θ is then reestimated via EM without recourse to any manual labels on all the training trials. Finally, the best
state sequence for each manually labeled trial is determined and a gesture name is computed for
each learnt discrete state. For any test sequence o1:N , we first infer the most likely state-sequence,
ŝ1:N using (6.2). Then the decoded gesture sequence ĝ1:N is given as ĝt = g(ŝt ). Thus, for every
state, we simply assign the most frequent co-occurred gesture in the training data. Accuracy on both
the training and the test partition is reported.
Tables 7.1 and 7.2 show the recognition accuracies on suturing and needle passing (the decoding is
done using the grammar) when each gesture in Figure 6.5 is replaced with a single state. When 3
state left to right models are used to replace the gestures, we get the results in Tables 7.3 and 7.4.
It is observed that:
1. S-LDS models outperform FA-HMM in the semi-supervised setting indicating that these
models can capture the dynamics of the different gestures even with no supervision.
136
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
2. When 3-state models are used for each block in the grammar in Figure 6.5, we get significantly better accuracies for both the S-LDS and the FA-HMM models. However, when 3-state
models are used, the S-LDS models outperform the FA-HMM with even higher margin.
We note that the is substantial improvement when 3-state gestures are used.
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
d=8
64.0%
62.3%
58.4%
57.1%
65.9%
64.0%
44.3%
43.1%
52.3%
50.1%
d=12
61.9%
57.5%
61.6%
60.7%
70.2%
66.2%
46.8%
42.7%
59.5%
56.6%
d=16
65.4%
62.5%
62.2%
61.4%
70.1%
66.7%
48.9%
45.1%
65.3%
62.5%
Table 7.1: Accuracies on Suturing using one state models with ot ∈ R36 (position) on Setup 1
following semi-supervised training.
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
d=8
60.7%
58.0%
57.8%
56.3%
56.2%
52.2%
44.6%
43.6%
48.1%
44.0%
d=12
61.1%
59.0%
60.1%
58.9%
62.3%
58.8%
45.5%
41.4%
48.3%
43.6%
d=16
67.0%
60.3%
60.8%
58.4%
70.5%
62.4%
49.2%
43.3%
50.8%
47.8%
Table 7.2: Accuracies on Needle-Passing using one state models with ot ∈ R36 (position) on Setup 1
after mapping the state segmentation leniently on the training partition. The models were initialized
using labels of randomly chosen 20% subset of the entire training data.
137
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
d=8
74.3%
64.4%
71.5%
69.5%
74.2%
71.1%
70.1%
68.4%
69.9%
68.0%
d=12
73.7%
69.1%
72.1%
70.8%
77.0%
73.3%
74.8%
70.5%
73.3%
71.0%
d=16
74.8%
69.4%
71.7%
69.9%
76.0%
73.4%
76.8%
72.0%
73.2%
71.9%
Table 7.3: Accuracies on Suturing using three state models with ot ∈ R36 (position) on Setup 1
after mapping the state segmentation leniently on the training partition. The models were initialized
using labels of randomly chosen 20% subset of the entire training data.
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
d=8
72.8%
67.9%
67.9%
66.0%
72.5%
69.3%
67.9%
67.8%
73.4%
71.1%
d=12
72.4%
68.4%
67.3%
63.7%
75.2%
68.1%
73.0%
65.4%
71.1%
73.1%
d=16
76.9%
72.1%
72.6%
67.2%
79.1%
73.3%
76.9%
70.0%
75.4%
75.1%
Table 7.4: Accuracies on Needle-Passing using three state models with ot ∈ R36 (position) on
Setup 1 after mapping the state segmentation leniently on the training partition. The models were
initialized using labels of randomly chosen 20% subset of the entire training data.
138
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
7.2
Using Grammar for Unsupervised Learning
We do a similar set of experiments as we did in the last section, but this time, with no training data
even for the initializations. Before performing EM training, we initialize the parameters of all the
states identically.2 We identically initialize each of the 5 models we described before and perform
10 iterations of EM. At the end of the training iteration, the 11 (or 33) discrete states of the resulting
model correspond to some (sub) gesture, but it is not clear which one is which. Therefore, if we
labeled a test trial using this model, it is not possible to compare the automatic labels ŝ1:N with the
manual gesture labels g1:N to quantify recognition accuracy.
Solely for evaluation, we decode and align the training trials, ŝ1:N and g1:N , and again the mapping
g(s) for each discrete states as shown in 7.1.
Table 7.5 shows the result of replacing each gesture in Figure 6.5 with a single state and performing
a lenient map on the training data for various d. The columns under Train and Test show the
accuracies on the training and the test data respectively. Table 7.6 is the corresponding result on
needle passing. If one were to use 3 state left to right models instead, we get the results mentioned
in Tables 7.7 and 7.8 respectively.
Note from tables 7.7 and 7.8 that while the recognition accuracy is lower than the case of supervised
learning, the S-LDS is able to outperform FA-HMM with an even higher margin compared to the
supervised setting. Moreover, the best performance of S-LDS is only 8% worse than the supervised
setting, thereby affirming the fact that S-LDS models are able to learn the distinct dynamics in a
task even with no supervision. The FA-HMM, on the other hand, is not able to generalize much in
terms of learning the gestures (since a significant part of the variability relies on the dynamics).
More precisely, we use the initialization used for Setup 1, but assuming all the observation samples [oj1:Nj ] are
labelled as a single class.
2
139
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
d=8
48.1%
47.6%
50.0%
47.8%
47.1%
47.9%
43.7%
43.9%
44.9%
44.1%
d=12
48.1%
44.9%
45.5%
44.8%
45.9%
43.2%
41.2%
38.6%
41.0%
38.4%
d=16
46.6%
43.8%
47.6%
46.6%
46.9%
44.5%
44.3%
43.1%
44.3%
43.3%
Table 7.5: Accuracies on Suturing using one state models with ot ∈ R36 (position) on Setup 1
following unsupervised estimation of model parameters.
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
d=8
49.0%
46.9%
50.1%
49.2%
48.1%
47.1%
43.5%
42.3%
43.9%
42.3%
d=12
50.9%
50.5%
47.7%
45.8%
48.6%
48.7%
44.9%
42.3%
45.3%
43.0%
d=16
47.9%
47.7%
46.4%
44.2%
49.7%
47.6%
44.7%
41.9%
44.6%
41.2%
Table 7.6: Accuracies on Needle-Passing using one state models with ot ∈ R36 (position) on Setup
1 following unsupervised estimation of model parameters.
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
d=8
53.3%
53.2%
57.1%
55.7%
57.1%
55.8%
54.2%
54.5%
54.3%
54.6%
d=12
53.4%
49.0%
53.5%
52.9%
56.7%
55.1%
61.3%
60.6%
60.6%
60.5%
d=16
53.7%
51.9%
53.1%
52.2%
64.2%
62.7%
64.7%
64.1%
63.5%
64.2%
Table 7.7: Accuracies on Suturing using three state models with ot ∈ R36 (position) on Setup 1
following unsupervised estimation of model parameters.
140
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
d=8
53.7%
51.7%
54.4%
50.6%
55.6%
52.2%
55.8%
55.8%
55.8%
54.8%
d=12
57.8%
52.7%
51.6%
48.8%
57.0%
53.2%
59.1%
60.0%
59.7%
61.3%
d=16
54.9%
51.2%
53.7%
47.3%
57.8%
56.0%
56.5%
55.9%
55.4%
54.5%
Table 7.8: Accuracies on Needle-Passing using three state models with ot ∈ R36 (position) on Setup
1 following unsupervised estimation of model parameters.
7.3
Discovering the Inventory of Gestures without Supervision
In the last few sections we described how, using a grammar of possible state-transitions in an HMM
or an S-LDS, we may estimate the parameters of the model using EM. However, we are eventually
interested in building HMM, FA-HMM and S-LDS models on surgery data having no manual labels
(or even a grammar). In other words, we are interested in discovering the graph G constraining the
set of possible state transitions along with the parameter set Θ describing the dynamical and the
emission distributions of the emitting states in G.
In this section, we improve upon a well known allophone learning technique called the successive
state splitting for HMM. The successive state splitting (abbreviated SSS) was first proposed by
Takami and Sagayama in [66] and enhanced by Singer and Ostendorf in [67].
We investigate directly learning the gesture inventory of the RMIS surgery task without recourse to
its actual gesture set or even the grammar constraining the state transitions. Since the technique is
easier to develop for HMMs, we will first apply our algorithm to learn the HMM structure.
The original successive state splitting algorithm starts with a one state HMM for all the data and at
each iteration, it explores the contextual and the temporal split for each current state. For example,
141
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
(a) Contextual split
(b) Temporal split
Figure 7.1: This figure shows the two splits that are explored in the first iteration. In general, the
contextual split simply splits a single state into two states in parallel, while the temporal state splits
the two states in series.
in the first iteration, the two splits explored are shown in Figure 7.1. Thus, if there are N states
in the HMM, 2N splits are explored. For each split, the state parameters of the child states, e.g.
(µsu , Σsu ), are perturbed and a few EM iterations are performed to learn the new model. Finally the
split (among the 2N possible choices of state splitting) that gives the highest likelihood increase is
chosen resulting in an N + 1 state HMM for the next iteration.
7.3.1
Maximum Likelihood SSS Algorithm
The improvement of the original SSS algorithm of [66] by [67], renamed ML-SSS, proceeds roughly
as follows.
1. Model all the data using a 1-state HMM with diagonal-covariance Gaussian.(N = 1).
2. For each HMM state si , i = 1, . . . , N , compute the improvements in log-likelihood(LL) of
the data by either contextual and temporal split . Select the state si that yields the maximum
gain in LL for si .
142
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
Figure 7.2: Four way split of the state s in the first iteration. This could also be thought of as the
cross product of the splits described in Figure 7.1 as it explores the contextual and the temporal
splits simultaneously.
3. If the gain is above a threshold and N is less than desired, retain the split, set N = N + 1,
re-estimate all parameters of the new HMM, and go to Step 2.Otherwise, exit.
Note that the key computational steps are the for-loop of Step 2 and the re-estimation of Step 3.
7.3.2
An Improved/Fast ML-SSS Algorithm
We made the following modifications to the ML-SSS algorithm and the work is published in [64]
and [68]. The proposed algorithm is superior to the original ML-SSS in terms of speed and also
explores a larger search space. Exploring larger search space provides models that have potentially
higher likelihood values for the observed data.
1. Model all the data using a 1-state HMM with a full-covariance Gaussian density. Set N = 1.
2. Simultaneously replace each state s of the HMM with the 4-state topology shown in Figure
7.2, yielding a 4N -state HMM. If the state s had parameters (µs , Σs ), then means of its 4state replacement are µs1 = µs − δ = µs4 and µs2 = µs + δ = µs3 with δ = λ∗ v ∗ , where
λ∗ and v ∗ are the principal eigenvalue and eigenvector of Σs and 0 < 1 is typically 0.2.
143
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
3. Re-estimate all parameters of this (overgrown) HMM. Gather the Gaussian sufficient statistics
for each of the 4N states from the last pass of re-estimation: the state occupancy πsi . The
sample mean µsi , and sample covariance Σsi .
4. Each quartet of states (see Figure 7.2) that results from the same original state s can be merged
in different ways to produce 3,2 or only 1 state. There are a total of 15 such down-merges of
which 6 ways result in 3 states and 7 ways result in 2 states.We select the best 3-state merge
among the 6 ways and the best 2-state merge among the 7 ways. We can also merge all the 4
states or retain all of them.
5. Reduce the number of states from 4N to N + ∆ by down-merging quartets that cause the
least loss in total log-likelihood. This entails solving a constrained knapsack problem.
6. Set N = N + ∆. If N is less than the desired HMM size, go to Step 2. Otherwise, exit.
Observe that the 4-state split of Figure 7.2 permits a slight look-ahead in our scheme in the sense
that the goodness of a contextual or temporal split of two different states in consecutive iterations
of the ML-SSS can be compared in the same iteration here with two consecutive splits of a single
state.
Observe that the split/merge statistics for a state are gathered in our modified SSS assuming that the
other states have already been split, which facilitates consideration of concurrent state splitting. If
s1 , . . . , sm are merged into s̃, the loss of log-likelihood3 in Step 4 is:
m
m
dX
dX
γ̂si log |Σs̃ | −
γ̂si log |Σsi | ,
(7.2)
2
2
i=1
i=1
Pm
0
i=1 γ̂si Σsi + µsi µsi
Pm
where Σs̃ =
− µs̃ µ0s̃ . It can be easily verified (by convexity of positive
γ̂
i=1 si
definite matrices) that the (7.2) is always non-negative.
3
Existence of closed form expression is another advantage of using single Gaussian densities
144
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
Finally, in selecting the best ∆ states to add to the HMM, we consider all possible ways of splitting
the existing N states. e.g. if N = 6 and ∆ = 3, we consider the best merge-down from 4N = 24
to N + ∆ = 9 states . We want to pick the combination that has the least loss in likelihood. It could
be a 4-way split of a single state, a 3-way split of one state and 2-way of another, or a 2-way split
of three distinct states. However, each original state si is present in the solution, at least with no
split ,and is not merged with another original state sj . This restriction leads to an O(N 3 ) dynamic
programming algorithm.
In summary, the modified SSS algorithm can leap-frog by ∆ states at a time, e.g. ∆ = αN , compared to the standard ML-SSS algorithm, and has the benefit of some lookahead to avoid greediness.
7.3.3
Evaluation of the ML-SSS Algorithm on RMIS data
The modified ML-SSS is developed for discovering the structure of the Hidden Markov Model.
Although, it is theoretically possible to perform the split-merge procedures for FA-HMM and S-LDS
models, for our experiments, we will first induce the HMM topology G and infer the discrete state
sequence of the training trials using the standard Viterbi algorithm using the learnt HMM parameter.
We will then initialize the parameters of the FA-HMM or the S-LDS states in G using the obtained
best discrete state sequence and perform 10 iterations of EM over the initialized parameters on the
graph G, but this time with no labels. The result of performing ML-SSS to learn the parameters of
the FA-HMM and the S-LDS and performing gesture recognition on the suturing and the needlepassing data is shown in tables 7.9 and 7.10.
It is observed that using successive state splitting, we get almost identically performance as using
the state-grammar for learning the models. This implies that the structure induction technique using
the successive state splitting is able to provide a good estimate for graphical structure by iteratively
145
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
d=8
56.1%
52.0%
58.0%
54.3%
58.3%
56.9%
56.8%
56.1%
57.0%
56.9%
d=12
57.0%
52.8%
58.2%
54.2%
56.8%
54.8%
61.3%
60.1%
61.5%
61.4%
d=16
57.3%
54.1%
58.1%
54.8%
63.1%
61.8%
63.9%
62.3%
63.8%
63.3%
Table 7.9: Result of ML-SSS on Suturing using various graphical models.
FA-HMM
S-LDS
Fs = 0, Hs = H
Fs = 0, Hs = H
µsu = 0, Hs = H
µsu = 0, Hs = H
Hs = H
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
d=8
52.1%
51.0%
52.1%
49.4%
54.9%
53.9%
54.0%
53.5%
54.0%
53.1%
d=12
55.9%
52.1%
49.1%
47.4%
56.4%
54.1%
57.0%
56.9%
57.4%
56.8%
d=16
53.0%
50.6%
51.0%
46.0%
56.2%
55.9%
55.1%
54.7%
54.1%
53.4%
Table 7.10: Result of ML-SSS on needle-passing using various graphical models.
146
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
learning the structure and the parameters. Furthermore, since the SSS algorithm was implemented
originally for HMM, we see a smaller degradation in performance (compared tables 7.7 and 7.9)
for HMM compared to FA-HMM or S-LDS. We, however, believe that the SSS algorithm can be
extended to S-LDS models as well where the likelihood loss in equation (7.2) may be replaced by
the distances obtained from the Cauchy-Binet kernels [69].
7.4
Chapter Summary
In this chapter, we showed that learning probabilistic models from unlabeled data is possible. We
started with semi-supervised learning where a small amount of labelled training data was used to
initialize the parameters of the models. It was observed that the result of semi-supervised learning
is not significantly worse than the completely supervised setting experiments investigated in chapter
6. Another observation is that S-LDS outperform FA-HMM models significantly and with a higher
margin in the semi-supervised case compared to the supervised setting.
Next we explored learning models with using only a grammar specifying the set of allowable state
transitions. When each gesture is modeled as a single state, we get significantly worse results
compared to the supervised setting. However, we improve significantly when 3-state models are
used for each gesture. S-LDS models continue to outperform FA-HMM models in this almost
unsupervised setting as well.
Finally, we explore the possibility of learning models when no supervision of any kind is provided.
Towards this goal, we modify the successive state splitting algorithm for allophone learning in
speech and provide efficient algorithms for discovering the HMM structure and parameters. The
FA-HMM and the S-LDS models are initialized using the learnt HMM models and the parameters
147
CHAPTER 7. LEARNING S-LDS ON UNLABELED DATA
are iteratively refined using expectation maximization. Thus, we were effectively able to use a
simple HMM topology learning algorithm to learn the structure and parameters of an S-LDS model.
148
Chapter 8
User Adaptive Transforms via EM
User dependent variability is common in most of the dextrous actions we analyzed in the last section.
These variations can come either from the user side as a style/pose variation or from changing
camera positions. The models we described in the last few chapters, as such, do not account for
such pose variations. With a slight modification of the parameter set Θ, it is possible to allow the
models to automatically learn these pose variabilities. This modification is the focus of this chapter.
The concepts used in this chapter are motivated from speaker adaptation techniques commonly
used in speech recognition such as the maximum likelihood linear regression ( c.f., [70–72]) or the
maximum a-posteriori adaptation [73].
We evaluate our proposed framework in mixed user settings and show improvements over nonadapted models. More formally, we propose a ML adaptation of the observation parameters Hs,l =
s,l
{Hs,l , µs,l
z , Σz } in the S-LDS for each pose class l, where l may refer to an individual user (and
will thus be the same for all trials of that user) or even an individual trial (in which case each trial
will have a unique Hs ). The experiments we performed in the last chapter assumes only one pose
149
CHAPTER 8. USER ADAPTIVE TRANSFORMS VIA EM
class.
8.1
EM estimation of {F s , Hs,l } while incorporating Pose Information
h
i
As usual, we are given a sequence of observations oj1:Nj each associated with a grammar Gj
which generated them. In addition, in this setup, we also have the pose-class lj associated with the
j th observation sequence. One such example of a pose class is the identity of the user themselves
if we assume each user has a different way of positioning the instrument during their RMIS task.
Another example could be the identity of the instrument where the trial was recorded. We will
assume in our models that the underlying dynamical parameters F s are independent of the pose
l ∈ L, and that the only dependence of the pose comes from the observation parameters Hs,l =
s,l
s,l
{Hs,l , µs,l
z , σ1:D }. The maximum likelihood estimation problem is to estimate the parameters H
and F s where l ∈ {1, . . . , L}, the set of all poses and s ∈ {1, . . . , C}, the set of gesture labels.
The estimation of the pose independent parameters F s can be done using the statistics for class s.
For the purposes of our experiments, we will assume that Hs,l = Hl , i.e. a gesture independent observation model. The estimation of Hl can be done by collecting the global statistics corresponding
to the trials belonging to pose l, i.e. by aggregating the statistics from all discrete states for trials
labeled with a particular pose. Using superscript l to denote the pose dependent statistics, we may
get these estimates as
−1
1
1
l
,
Vox
(p) − wol (p)f l T
Sl − f l f l T
N
N
1 l
µ̂lz (p) =
wo (p) − ĥlp f l ,
N
l (p)T
vol (p) + ĥlp Sl ĥTp − 2ĥlp Vox
σ̂pl 2 =
− µ̂lz (p)2 .
Nl
ĥlp =
150
CHAPTER 8. USER ADAPTIVE TRANSFORMS VIA EM
For experiments, we use the Setup I described in chapter 6 with three discrete states to model each
gesture. The observations ot ∈ R36 contain the 36 position coordinates from the RMIS system. We
will assume the user identities to be the pose classes. In Setup I (described in the last chapter), we
leave one trial of each user out and hence all the users in the test data are seen in the training data.
We will learn the user adaptive loading parameters Hl from the statistics collected for each user.
The baseline for our setup is obtained by tying all the loading matrices together as Hs,l = H.
Tables 8.1 and 8.2 show the result of training user specific observation models Hl .
From Tab 8.1, we see that the user adaptation is not buying much in case of suturing. This possibly
means that the suturing trials do not have much variability in pose compared to the needle-passing
trial where we observe about 2-3% improvement in the best performance of the S-LDS model.
FA-HMM
S-LDS
Hs,l = H
Hs,l = Hl
Hs,l = H
Hs,l = Hl
d=8
75.51%
76.01%
73.17%
74.10%
d = 12
78.11%
78.30%
77.33%
77.10%
d = 16
78.27%
78.50%
78.00%
78.01%
Table 8.1: Effect of user adaptive learning of H in mixed (seen) user setting on Suturing
FA-HMM
S-LDS
Hs,l = H
Hs,l = Hl
Hs,l = H
Hs,l = Hl
d=8
60.2%
69.3%
71.0%
74.1%
d = 12
69.1%
69.2
72.7%
72.3%
d = 16
67.9%
68.8
72.3%
70.1%
Table 8.2: Effect of user adaptive learning of H in mixed (seen) user setting on Needle Passing
151
CHAPTER 8. USER ADAPTIVE TRANSFORMS VIA EM
8.2
Gesture Recognition with an unknown Pose Class
Often times, it becomes necessary to infer the state sequence of a trial with an unknown pose class.
In RMIS, for instance, this is a new surgeon for whom no labeled data is available. Formally, given
a set of test sequences oj1:Nj , all having an unknown pose label l ∈
/ {1, . . . , L}, we need to infer
the distribution of {s1:N , x1:N }. Since l ∈
/ {1, . . . , L}, the observation parameters H are unknown.
The classical way, as usual, is to perform a maximum likelihood estimation of H such that
ĤML = argmax
H
M
X
j=1
log
X
s1:Nj ∈Gj
Z
x1:Nj
pΘ (s1:Nj , xj1:Nj , oj1:Nj )dx1:Nj .
(8.1)
Directly optimizing (8.1) is non-trivial, hence we resort to EM. The EM estimates may be obtained
by first inferring the distribution of the hidden states from the observation sequences using the
current parameters and updating H using the statistics collected. This, however, requires a good
initialization point of H.
For our experiments, we estimate a global H from the training data from the accumulated statistics
of all the training trials (irrespective of the pose) and use this as a starting point for the EM iterations.
For our experiments, we use Setup II where we use trials from 7 (out of the 8) users to estimate the
parameters Θ = {F s , Hl }. Along with the user dependent loading transforms Hl , we also estimate
a global H. Each test trial is from the eighth user not seen in the training data. We use 5 iterations
of EM as an attempt to solve (8.1) and use the estimated Ĥ to infer the marginal distribution of the
states γ̂ts on the test trial. Then the decoded state at time t is, as usual, inferred to be
ŝt = argmaxγ̂ts .
s
An eight fold cross-validation over all the users is performed and the average is reported. Tables 8.3
and 8.4 show the result of learning user adaptive transforms on unseen users.
152
CHAPTER 8. USER ADAPTIVE TRANSFORMS VIA EM
FA-HMM
S-LDS
Hs,l = H
Hs,l = Hl
Hs,l = H
Hs,l = Hl
d=8
53.5%
53.5%
66.0%
66.2%
d = 12
50.7%
57.3%
67.0%
68.1%
d = 16
57.2%
57.6%
66.5%
67.8%
Table 8.3: Effect of user adaptive learning of H in unseen user setting on Suturing
FA-HMM
S-LDS
Hs,l = H
Hs,l = Hl
Hs,l = H
Hs,l = Hl
d=8
39.5%
50.1%
44.4%
50.3%
d = 12
42.6%
47.8%
43.7%
52.0%
d = 16
39.5%
42.2%
52.2%
59.0%
Table 8.4: Effect of user adaptive learning of H in unseen user setting on Needle Passing
As before, we do not see much improvement in gesture recognition in case of suturing. However, we
observed substantial gains in performance in needle-passing trials. When d = 16, we observe that
the S-LDS models provide a gesture recognition of 52% with no adaptation. When the observation
parameters are adapted using EM, we observe that the recognition performance goes unto 59%.
8.3
Chapter Summary
In this chapter, we made an attempt to normalize user and pose variabilities by adapting certain
parameters of the S-LDS. First, we explored learning user specific parameters when the user label
are known in the training data. We observed small improvements in recognition performances by
doing so. Next, we performed experiments on Setup II where the training and the test partitions are
153
CHAPTER 8. USER ADAPTIVE TRANSFORMS VIA EM
user-disjoint. We saw in the last chapter that there is a significant drop is performance when learnt
models are used to infer the gestures in a new user. We observed that a significant fraction of that
loss is recovered by unsupervised adaptation of certain model parameters.
154
Chapter 9
Summary and Discussion
The goal of this dissertation was to build models that can efficiently segment and analyze continuous
valued signals encountered in robot assisted minimally invasive surgery and other applications such
as video sequences. The sought models are expected to capture the variability between the different
gestures (intra-gesture) and the variation between different people performing the same task.
We started off by introducing the standard hidden Markov model where the observations, ot ∈ RD ,
are modeled as emissions from a discrete state. We reviewed the well known Baum-Welch and the
Viterbi algorithms and noted that several inference and learning problems in the HMM can be done
exactly. In Chapter 3, we reviewed Factor Analyzed HMM by introducing a continuous hidden
state, , xt ∈ Rd . The introduction of a continuous hidden state can be thought of as modeling
the observations in a smaller dimensional space where d < D. One important contribution of
this Chapter was to exploit the structural properties of the matrices involved to provide efficient
inference and learning procedures when d << D. The original existing techniques require O(D3 )
time and O(D2 ) space to perform exact inference in FA-HMM, while our a technique performs the
155
CHAPTER 9. SUMMARY AND DISCUSSION
same inference in O(Dd2 ) time and O(Dd) space.
Next, we reviewed the linear dynamical system where the inference and the learning problem was
setup. The Kalman prediction and the RTS smoothing procedures necessary to perform inference in
an LDS were derived for completeness. The efficient inference technique developed for FA-HMM
was incorporated in the LDS setting as well. Directly learning the parameters of the LDS via EM
when the observation dimension is high (example images) was the main contribution of this chapter.
We perform various experiments like generation, prediction and de-noising of video sequences using
our inference and learning techniques. Compared to a previously known technique [42] for system
identification, the parameters obtained from the EM estimates have better predictive power of the
image pixel values.
The switching linear dynamical system is introduced in Chapter 5. It is well known that exact inference in S-LDS is intractable. One important contribution of this chapter is to extend the expectation
correction approach [26] to obtain the ML estimates of the S-LDS parameters. Upon justifying the
approximations made in this technique, the EM learning procedure we develop can be shown to be
more accurate and principled compared to previously known methods like the Approximate Viterbi
by Vladimir Pavlovic [56]. Another important contribution of this chapter was to derive the inference and the learning for S-LDS when null states are included. Incorporating null states is a very
useful device in practice.
We then studied the performance of the various models we developed on real data. Since the EM
algorithms can get stuck in local maxima, we provide efficient techniques to initialize the model parameters in a reasonable manner based on system identification methods. The system identification
based initializations provide a good starting point for the EM algorithm to evolve from. The models
156
CHAPTER 9. SUMMARY AND DISCUSSION
are evaluated by performing gesture recognition experiments on bench top surgery tasks. Our first
observation was that introducing factor analysis in the standard HMM setup improves performance.
We even get significant improvements in performance when the dimension of the hidden continuous state xt was 2. This is a clear indication that factor analysis is able to effectively capture the
correlations between the observation components that are not captured via the dependence of the
discrete states, thereby producing better and robust models for the data with fewer parameters.
Next, we observe that S-LDS models perform consistently better than FA-HMM in all the setups.
The best performance of the S-LDS is about 9% better than the best of FA-HMM in case of the
Needle-Passing task. Moreover, the training and decoding using the S-LDS was sped up significantly using the algorithms we developed for FA-HMM and the introduction of null states. Although, still slower than standard HMM algorithms, our S-LDS learning algorithms run in real time
– previously published S-LDS inference and learning algorithms run much slower than real time.
The efficient inference techniques developed even enables one to perform learning of the S-LDS
models using real video sequences where each image can have several thousand pixel values.
Next, we investigate whether S-LDS models can be learnt from un-labelled data. We measure the
goodness of the learned models by appropriately mapping the learnt S-LDS discrete states to the
actual manual labels. It is observed that S-LDS models provide better unsupervised segmentations
compared to FA-HMM. Using a skeletal graph of 33 states, we perform several iterations of EM after
initializing the parameters of each of the 33 states identically. While the unsupervised segmentation
using FA-HMM on the Suturing task correlates with the corresponding manual labels 54%, the SLDS models go as high as 64%. Thus, the S-LDS models capture the inter-gesture variabilities
effectively even when no labels are provided to boot-strap the learning process.
157
CHAPTER 9. SUMMARY AND DISCUSSION
Another important contribution of this dissertation is to provide an efficient algorithm for learning
the structure of the HMM using a modification of the maximum likelihood successive state splitting
algorithm.
Finally, we explored user adaptation by modeling certain parameters of the S-LDS to be user specific. By doing so and assuming that the identity of the user is known while decoding, we demonstrated some improvements in the recognition accuracy. Next, we explored unsupervised adaptation
while decoding a trial from a new user. It is shown that the gesture recognition accuracies improve
substantially by doing so.
9.1
9.1.1
Future Directions
Structured Prediction of LDS Models
Sparse LDS Models
Recall that the (S)-LDS parameters are split into the dynamical part denoted by F s and observation
part Hs . In chapters 4 and 5, we described algorithms to iteratively refine these parameters via
EM. If the underlying hidden state xt is assumed to be d dimensional, the number of parameters
used for each state s in a hybrid system is O(Dd + d2 ). In presence of less data, but when the real
order of the system is d, one may leverage the number of parameters by constraining the number
of non-zero entries in Fs of Hs to be less than d2 or Dd respectively. A straightforward way to do
this is to impose an L0 regularization on Fs in addition to the EM objective function. Since solving
a least squares problem with an L0 is known to be NP-hard, one may resort to greedy techniques
like subspace pursuit, greedy pursuit or iterative greedy pursuit [74–76]. An alternate approach is
to impose an L1 regularization which in turn can be solved exactly using convex optimizations.
158
CHAPTER 9. SUMMARY AND DISCUSSION
Higher Order LDS Models
An order p LDS model assumes the following generative form for the observations ot .
p
X
xt =
Fj xt−j + ut ,
j=1
ot = Hxt + zt .
Although, it looks a generalization of the standard LDS model, the above is actually a special case
of a LDS model where the hidden state vector has dimension dp. We can observe this by defining a
new state vector yt as
xt
.
yt =
.. .
xt−p+1
and redefining the state equations as follows
yt = F̃yt−1 + ũt ,
ot = Hxt + zt .
where
F1
...
Fp
.
F̃ =
Id(p−1)×d(p−1)
0d(p−1)×1
(9.1)
and
ut
.
ũt =
.. .
0
(9.2)
Thus inference and learning can be done using Kalman smoothing techniques on this extended state
vector and estimating F̃ after constraining it to be of the form given in (9.1).
159
Bibliography
[1] M. Hashizume, “Robot-assisted surgery.” Nippon Geka Gakkai Zasshi, vol. 106, no. 11, p.
689, 2005.
[2] C. Reiley, H. Lin, B. Varadarajan, S. Khudanpur, D. D. Yuh, and G. D. Hager, “Automatic
recognition of surgical motions using statistical modeling for capturing variability,” MMVR,
2008.
[3] R. Kumar, A. Jog, A. Malpani, B. Vagvolgyi, D. Yuh, H. Nguyen, G. Hager, and C. Chen,
“Assessing system operation skills in robotic surgery trainees,” The International Journal of
Medical Robotics and Computer Assisted Surgery, p. accepted, 2011.
[4] M. Muller, T. Roder, M. Clausen, B. Eberhardt, B. Kruger, and A. Weber, “Documentation
mocap database HDM05,” Computer Graphics Technical Report CG-2007-2, 2007.
[5] M. Szummer and R. W. Picard, “Temporal texture modeling,” in Image Processing, 1996.
Proceedings., International Conference on, vol. 3.
IEEE, 2002, pp. 823–826.
[6] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto, “Dynamic textures,” International Journal of
Computer Vision, vol. 51, no. 2, pp. 91–109, 2003.
160
BIBLIOGRAPHY
[7] F. Jelinek, Statistical methods for speech recognition.
Cambridge, MA, USA: MIT Press,
1997.
[8] J. Boreczky and L. Wilcox, “A hidden Markov model framework for video segmentation using
audio and image features,” in Acoustics, Speech and Signal Processing, 1998. Proceedings of
the 1998 IEEE International Conference on, vol. 6.
IEEE, 1998, pp. 3741–3744.
[9] T. Starner, “Visual recognition of american sign language using hidden markov models,” Ph.D.
dissertation, Citeseer, 1995.
[10] A. Wilson and A. Bobick, “Parametric hidden Markov models for gesture recognition,” Pattern
Analysis and Machine Intelligence, IEEE Transactions on, vol. 21, no. 9, pp. 884–900, 1999.
[11] H. Attias, “Independent factor analysis,” Neural computation, vol. 11, no. 4, pp. 803–851,
1999.
[12] S. Ioffe, “Probabilistic linear discriminant analysis,” Computer Vision–ECCV 2006, pp. 531–
542, 2006.
[13] P. Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms,”
Online: http://www. crim. ca/perso/patrick. kenny, 2006.
[14] M. Tipping and C. Bishop, “Probabilistic principal component analysis,” Journal of the Royal
Statistical Society. Series B (Statistical Methodology), vol. 61, no. 3, pp. 611–622, 1999.
[15] K. Yao, K. Paliwal, and T. Lee, “Generative factor analyzed hmm for automatic speech recognition,” Speech communication, vol. 45, no. 4, pp. 435–454, 2005.
[16] J. Hamilton, Time series analysis.
Cambridge Univ Press, 1994, vol. 10.
161
BIBLIOGRAPHY
[17] T. Henzinger, “The theory of hybrid automata,” in Logic in Computer Science, 1996. LICS’96.
Proceedings., Eleventh Annual IEEE Symposium on.
IEEE, 1996, pp. 278–292.
[18] H. Rauch, F. Tung, and C. Striebel, “Maximum likelihood estimates of linear dynamic systems,” AIAA journal, vol. 3, no. 8, pp. 1445–1450, 1965.
[19] J. Casti, Linear dynamical systems.
Academic Press, 1987, vol. 135.
[20] Z. Ghahramani and G. Hinton, “Parameter estimation for linear dynamical systems,” University of Toronto technical report CRG-TR-96-2, vol. 6, 1996.
[21] R. Kalman, “Mathematical description of linear dynamical systems,” Siam, 1963.
[22] A. Gollu and P. Varaiya, “Hybrid dynamical systems,” in Decision and Control, 1989., Proceedings of the 28th IEEE Conference on.
IEEE, 1989, pp. 2708–2712.
[23] R. Goebel, R. Sanfelice, and A. Teel, “Hybrid dynamical systems,” Control Systems Magazine,
IEEE, vol. 29, no. 2, pp. 28–93, 2009.
[24] K. Murphy, “Switching Kalman filters,” Dept. of Computer Science, University of California,
Berkeley, Tech. Rep, 1998.
[25] Z. Ghahramani and G. Hinton, “Switching state-space models,” in Kings College Road,
Toronto M5S 3H5.
Citeseer, 1996.
[26] D. Barber, “Expectation correction for smoothed inference in switching linear dynamical systems,” The Journal of Machine Learning Research, vol. 7, pp. 2515–2540, 2006.
[27] B. Juang, “Hidden Markov Models,” Encyclopedia of Telecommunications, 1985.
162
BIBLIOGRAPHY
[28] S. Eddy, “Hidden markov models,” Current opinion in structural biology, vol. 6, no. 3, pp.
361–365, 1996.
[29] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[30] L. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization technique occurring in the
statistical analysis of probabilistic functions of markov chains,” The annals of mathematical
statistics, vol. 41, no. 1, pp. 164–171, 1970.
[31] G. Forney Jr, “The viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp. 268–278,
1973.
[32] K. Petersen and M. Pedersen, “The matrix cookbook,” Technical University of Denmark, pp.
7–15, 2008.
[33] A. Rosti and M. Gales, “Factor analyzed hidden Markov models for speech recognition,” Computer Speech & Language, vol. 18, no. 2, pp. 181–200, 2004.
[34] N. Kumar and A. G. Andreou, “Heteroscedastic discriminant analysis and reduced rank hmms
for improved speech recognition,” Speech Commun., vol. 26, no. 4, pp. 283–297, 1998.
[35] I. Jolliffe, Principal component analysis.
Wiley Online Library, 2002.
[36] H. Poor, An introduction to signal detection and estimation.
Springer, 1994.
[37] S. Ioffe, “Probabilistic linear discriminant analysis,” Computer Vision–ECCV 2006, pp. 531–
542, 2006.
163
BIBLIOGRAPHY
[38] M. Gales, “Semi-tied covariance matrices for hidden markov models,” Speech and Audio Processing, IEEE Transactions on, vol. 7, no. 3, pp. 272–281, 1999.
[39] P. Van Overschee and B. De Moor, “N4SID: Subspace algorithms for the identification of
combined deterministic-stochastic systems,” Automatica, vol. 30, no. 1, pp. 75–93, 1994.
[40] M. Szummer and R. W. Picard,
Conf. Image Processing,
vol. 3,
“Temporal texture modeling,”
Sep. 1996,
in IEEE Intl.
pp. 823–826. [Online]. Available:
http://research.microsoft.com/ szummer/papers/icip-96/SzummerPicard-icip96.pdf
[41] V. Kwatra, A. Schodl, I. Essa, G. Turk, and A. Bobick, “Graphcut textures: Image and video
synthesis using graph cuts,” ACM Transactions on Graphics, SIGGRAPH 2003, vol. 22, no. 3,
pp. 277–286, July 2003.
[42] G. Doretto, A. Chiuso, Y. Wu, and S. Soatto, “Dynamic textures,” International Journal of
Computer Vision, vol. 51, no. 2, pp. 91–109, 2003.
[43] R. Kalman et al., “A new approach to linear filtering and prediction problems,” Journal of
basic Engineering, vol. 82, no. 1, pp. 35–45, 1960.
[44] R. Kalman and R. Bucy, “New results in linear filtering and prediction theory,” Transactions
of the ASME. Series D, Journal of Basic Engineering, vol. 83, pp. 95–107, 1961.
[45] G. Welch and G. Bishop, “An introduction to the Kalman filter,” University of North Carolina
at Chapel Hill, Chapel Hill, NC, vol. 7, no. 1, 1995.
[46] R. Brown and P. Hwang, Introduction to random signals and applied Kalman filtering. John
Wiley & Sons, 1997, vol. 2, no. 4.
164
BIBLIOGRAPHY
[47] H. Sorenson, Kalman filtering: theory and application.
IEEE, 1985.
[48] A. Chan and N. Vasconcelos, “The em algorithm for layered dynamic textures,” Citeseer, Tech.
Rep., 2005.
[49] ——, “Mixtures of dynamic textures,” in Computer Vision, 2005. ICCV 2005. Tenth IEEE
International Conference on, vol. 1.
IEEE, 2005, pp. 641–647.
[50] A. Chan, N. Vasconcelos et al., “Layered dynamic textures,” Advances in Neural Information
Processing Systems, vol. 18, p. 203, 2006.
[51] A. Chan and N. Vasconcelos, “Modeling, clustering, and segmenting video with mixtures of
dynamic textures,” IEEE transactions on pattern analysis and machine intelligence, pp. 909–
926, 2007.
[52] H. Witsenhausen, “A class of hybrid-state continuous-time dynamic systems,” Automatic Control, IEEE Transactions on, vol. 11, no. 2, pp. 161–167, 1966.
[53] B. Ristic, S. Arulampalam, and N. Gordon, Beyond the Kalman filter: Particle filters for
tracking applications.
Artech House Publishers, 2004.
[54] J. Tugnait, “Adaptive estimation and identification for discrete systems with markov jump
parameters,” Automatic Control, IEEE Transactions on, vol. 27, no. 5, pp. 1054–1065, 1982.
[55] ——, “Detection and estimation for abruptly changing systems,” Automatica, vol. 18, no. 5,
pp. 607–615, 1982.
[56] V. Pavlovic, J. M. Rehg, and J. MacCormick, “Learning switching linear models of human
motion,” Advances in Neural Information Processing Systems, pp. 981–987, 2001.
165
BIBLIOGRAPHY
[57] A. Logothetis and V. Krishnamurthy, “Expectation maximization algorithms for map estimation of jump markov linear systems,” Signal Processing, IEEE Transactions on, vol. 47, no. 8,
pp. 2139–2156, 1999.
[58] A. Rosti, M. Gales, and U. of Cambridge. Engineering Dept, Switching linear dynamical systems for speech recognition.
University of Cambridge, Department of Engineering, 2003.
[59] B. Mesot and D. Barber, “Switching linear dynamical systems for noise robust speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, no. 6, pp.
1850–1858, 2007.
[60] D. Alspach and H. Sorenson, “Nonlinear Bayesian estimation using gaussian sum approximations,” Automatic Control, IEEE Transactions on, vol. 17, no. 4, pp. 439–448, 1972.
[61] T. Minka, “Expectation propagation for approximate Bayesian inference,” in Uncertainty in
Artificial Intelligence, vol. 17.
Citeseer, 2001, pp. 362–369.
[62] T. Heskes, O. Zoeter, A. Darwiche, and N. Friedman, “Expectation propagation for approximate inference,” in Proceedings UAI-2002.
Citeseer, 2002, pp. 216–233.
[63] D. Koller and N. Friedman, Probabilistic graphical models: principles and techniques.
The
MIT Press, 2009.
[64] B. Varadarajan, S. Khudanpur, and E. Dupoux, “Unsupervised learning of acoustic
sub-word units,” in Proceedings of ACL-08:
HLT, Short Papers.
Columbus, Ohio:
Association for Computational Linguistics, June 2008, pp. 165–168. [Online]. Available:
http://www.aclweb.org/anthology/P/P08/P08-2042
166
BIBLIOGRAPHY
[65] B. Varadarajan and S. Khudanpur, “Automatically learning speaker-independent acoustic subword units,” in Proc. Interspeech, Brisbane, Australia, Sep. 2008, pp. 1333–1336.
[66] J. Takami and S. Sagayama, “A successive state splitting algorithm for efficient allophone
modeling,” in IEEE International Conference on Acoustics, Speech and Signal Processing,
vol. I, 1992, pp. 573, 576.
[67] O. Singer.H, “Maximum likelihood successive state splitting,” in IEEE International Conference in Acoustics, Speech and Signal Processing, vol. 2, May 1996, pp. 601,604.
[68] B. Varadarajan, C. Reiley, H. Lin, S. Khudanpur, and G. Hager, “Data-derived models for
segmentation with application to surgical assessment and training,” Medical Image Computing
and Computer-Assisted Intervention–MICCAI 2009, pp. 426–434, 2009.
[69] S. Vishwanathan, A. Smola, and R. Vidal, “Binet-cauchy kernels on dynamical systems and
its application to the analysis of dynamic scenes,” International Journal of Computer Vision,
vol. 73, no. 1, pp. 95–119, 2007.
[70] C. Leggetter and P. Woodland, “Maximum likelihood linear regression for speaker adaptation
of continuous density hidden Markov models,” Computer speech and language, vol. 9, no. 2,
p. 171, 1995.
[71] M. Gales and U. of Cambridge. Engineering Dept, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12, pp. 75–
98, 1998.
[72] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “Speaker adaptation for hmm-based
167
BIBLIOGRAPHY
speech synthesis system using MLLR,” in The Third ESCA/COCOSDA Workshop on Speech
Synthesis.
Citeseer, 1998, pp. 273–276.
[73] C. Lee and J. Gauvain, “Speaker adaptation based on map estimation of HMM parameters,” in
icassp.
IEEE, 1993, pp. 558–561.
[74] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensing signal reconstruction,”
Information Theory, IEEE Transactions on, vol. 55, no. 5, pp. 2230–2249, 2009.
[75] W. Dai, O. Milenkovic, and I. U. A. URBANA-CHAMAPAIGN., Subspace pursuit for compressive sensing: Closing the gap between performance and complexity.
Citeseer, 2008.
[76] B. Varadarajan, S. Khudanpur, and T. Tran, “Stepwise Optimal Subspace Pursuit for Improving
Sparse Recovery,” Signal Processing Letters, IEEE, vol. 18, no. 1, pp. 27–30, 2011.
168
Vita
Balakrishnan Varadarajan was a PhD student at Johns Hopkins University in the department of Electrical and Computer engineering. Prior to this, he did his undergraduate at the
Indian Institute of Technology, Madras in the department of Electrical Engineering. His
PhD dissertation focusses on efficient machine learning techniques for modeling and recognizing gestures in human activities using hybrid dynamical systems. He has published
about 6 conference papers and two journal papers pertaining to his research.
Starting from Oct, 2011, Balakrishnan will work in the Youtube research team based in
Google, Mountain View, California.
169
© Copyright 2026 Paperzz