Series in
Signal and
Information
Processing
Volume 27
Diss. ETH No. 22575
Input Estimation and
Dynamical System
Identification:
New Algorithms and
Results
A dissertation submitted to
ETH Zurich
for the degree of
Doctor of Sciences
presented by
Lukas Bruderer
Dipl. El.-Ing. ETH
born on January 24, 1985
citizen of Speicher, AR
Hartung
Gorre
Konstanz
accepted on the recommendation of
Prof. Dr. Hans-Andrea Loeliger, examiner
Prof. Dr. Bernard Fleury, co-examiner
2015
Series in Signal and Information Processing
Vol. 27
Editor: Hans-Andrea Loeliger
Bibliographic Information published by Die Deutsche Nationalbibliothek
Die Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie;
detailed bibliographic data is available in the internet at http://dnb.d-nb.de.
Copyright © 2015 by Lukas Bruderer
First Edition 2015
HARTUNG-GORRE VERLAG KONSTANZ
ISSN 1616-671X
ISBN-10: 3-86628-533-7
ISBN-13: 978-3-86628-533-0
Acknowledgments
I remember well, the first discussion with my supervisor Hans-Andrea
Loeliger before starting my PhD studies. It was hard not to be caught
by his enthusiasm while he sketched a number of possible directions of
research. When on the road towards this thesis new paths - some of them
off the beaten track - emerged, Andi was always supportive and swiftly
brought up those “right” questions that would lead from one thing to
another. Later I was consistently amazed by his ability to express our
work and conceptual ideas in a clear and well-understandable way.
Next, I would like to thank Bernard Fleury for accepting to co-examining
my thesis. I recall a discussion with him in 2013 at a workshop in Aalborg
Denmark that sparked my curiosity in a Bayesian approach to sparse
recovery. In hindsight, it marked the beginning of fascinating work that
is now a substantial part of this thesis.
A portion of this thesis was developed and inspired by the CTI1 -project
“Adaptive Filtererung von Zerspankraftsignalen”, a cooperation of ETH
Zurich and Kistler AG. I acknowledge the experimental data that I have
received from Daniel Spescha, Josef Stirnimann, and Friedrich Kuster
from Inspire AG and the Institute of Machine Tools and Manufacturing.
Working on this interdisciplinary project was an unique opportunity to
put oneself in the end user’s position and learn a lot about sensor technology. I benefited a lot from thought-provoking meetings with Manuel
Blatter, Thomas Wuhrmann, and Bernhard Bill from Kistler and the
latter persons from Inspire.
Of major significance to the great experience that were the PhD studies,
were the colleagues at the lab. I am grateful for the many valuable dis1 Commission
for Technology and Innovation
iii
iv
cussions - on research topics or off-topic - some in front of whiteboards
and some just next to the ISI espresso machine. At ISI, it was never
hard to gain support for a project or join forces. In particular, it was
inspiring to collaborate with Christoph Reller, Jonas Biveroni, Georg
Wilckens, Sarah Neff, Nour Zalmaï, and Federico Wadehn. It is a pleasure to acknowledge Federico, who provided valuable comments on this
document. Last but not least, I am highly indebted to Hampus Malmberg. The results in his master’s thesis and the variety of open questions
he elaborated and discussed contributed significantly to this thesis.
Apart from my fellow PhD students I would also like to thank Rita
Hildebrand for saving me worries about all sorts of administrative matters and Patrik Strebel for usually having a solution ready before the
issue had even been raised.
Special thanks go to my semester thesis and master’s thesis students. In
particular, Filipe Barata, Ismail Celebi, and Christian Käslin for their
contributions to the CTI project and all the other students for the diverse and very positive experiences and learnings that I could gather as
supervisor.
My deepest thank goes to my girlfriend Lisa for her relentless support,
her considerate nature, and her patience during all stages of my PhD.
You always managed to pronounce precisely what I needed to hear to
chart my own path.
A special thanks goes to my parents who supported me throughout my
studies. Even though I was often away and airheaded, they were always
there for me.
Abstract
Recovery of signals from distorted or noisy observations has been a longstanding research problem with a wide variety of practical applications.
We advocate to approach these types of problems by interpreting them
as input estimation in finite-order linear state-space models. Among
other applications the input signal may represent a physical quantity
and the state-space model a sensor yielding corrupted readings.
In this thesis, we provide new estimation algorithms and theoretical results for different classes of input signals: continuous-time input signals
and weakly sparse input signals. The latter method is obtained by specializing a more general framework for inference with sparse priors and
sparse signal recovery, which in contrast to standard methods, amounts
to iterations of Gaussian message passing. Applicability of input estimation is extended to complex models, which generally are computationally
more demanding and may be prone to numerical instability, by introducing new numerically robust computation methods expressed as Gaussian
message passing in factor graphs.
In practical applications, a signal model may not necessarily be available
a-priori. As a consequence, in addition to input estimation, estimation
of the state-space model itself must also be adressed. To this end, we
introduce a variational statistical framework to retrieve convenient statespace models for input estimation and present a joint input and model
estimation algorithm for weakly sparse input signals.
The proposed methods are substantiated with two real world application
examples. First, we consider impaired mechanical sensor measurements
in machining processes and show that input estimation and suitable
model identification can result in more accurate measurements, when
strong resonances distort the sensor readings. Secondly, we show that
v
vi
Abstract
our simultaneous weakly-sparse input estimation and model estimation
method is capable of identifying individual heart beats from ballistocardiographic measurements, a method used to measure non-invasively the
cardiac output.
Keywords: Gaussian Message Passing; State-space models; Factor
Graphs; Sparse Estimation; System Identification.
Kurzfassung
Die Schätzung von Signalen aus verzerrten oder verrauschten Beobachtungen ist ein langjähriges Forschungsproblem mit einer Vielzahl praxisrelevanter Anwendungen. Um diese Art von Problemen anzugehen,
interpretieren wir sie als Eingangsschätzungen von linearen Zustandsraummodellen endlicher Ordnung. Ein konkretes Anwendungsbeispiel für
unseren Ansatz sind physikalische Messgrössen - die Eingangssignale und die entsprechenden Sensoren, welche durch Zustandsraummodelle
dargestellt werden.
In der vorliegenden Dissertation führen wir neue Schätzmethoden und
sowohl theoretische als auch experimentelle Ergebnisse für verschiedene Klassen von Eingangssignalen ein: zeitkontinuierliche Eingangssignale
und schwach spärliche Eingangssignale. Die Schätzmethode für Letztere
ist ein Spezialfall einer Inferenz-Methode, welche für die Rückgewinnung
von spärlichen Signalen und sich im Gegensatz zu anderen Standardmethoden auf iteratives Gauß’sches Message-Passing reduzieren lässt.
Dank der Einführung numerisch robuster Gauss’schen Message-Passing
Verfahren kann die Anwendbarkeit der Eingangsschätzungsmethoden für
komplexere Modelle, die eine höhere Modellordnung aufweisen und deshalb in der Regel rechnerisch anspruchsvoller und für numerische Fehler
anfällig sind, gewährleistet werden.
Üblicherweise steht in der Praxis a-priori kein Signal-Modell zur Verfügung, weshalb nebst der Eingangsschätzung auch die Identifikation effektiver Zustandsraummodelle nötig ist. Zu diesem Zweck stellen wir ein
variationales statistisches Modell vor, welches geeignete ZustandsraumModelle für die Eingangsschätzung garantiert, und präsentieren einen
Modellschätzungsalgorithmus für schwach spärliche Eingangssignale.
Die Anwendbarkeit der neuen Verfahren wird mit zwei realen Anwenvii
viii
Kurzfassung
dungsbeispielen untermauert. Zunächst betrachten wir die Entzerrung
dynamischen Kraftmessungen während Fräsprozessen und zeigen, dass
Eingangsschätzung mit geeigneter Modellidentifikation zu einer höheren
Genauigkeit bei den Messwerten führt. Zweitens wenden wir die gleichzeitige Schätzung schwach spärlicher Eingangssignale und von Zustandsraummodellen auf ballistocardiographische Messungen, einer nicht invasiven Methode zur Messung der Herzleistung, an. Wir zeigen, dass die
eingeführte Methode die Extraktion von Herzschlägen und damit die
Berechnung relevanter physiologischer Parameter ermöglicht.
Stichworte: Gauss’sches Message-Passing; Zustandsraummodelle;
Schätzung von spärlichen Signalen; Faktor-Graphen; Modellschätzung.
Contents
Abstract
v
Kurzfassung
vii
1 Introduction
1
1.1
Contributions and Overview . . . . . . . . . . . . . . . . .
2
1.2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2.1
Input Estimation . . . . . . . . . . . . . . . . . . .
4
1.2.2
System Identification . . . . . . . . . . . . . . . . .
5
2 Preliminaries
7
2.1
Notation and Definitions . . . . . . . . . . . . . . . . . . .
2.2
Expectation-Maximization Method . . . . . . . . . . . . . 10
2.3
7
2.2.1
Convergence Properties . . . . . . . . . . . . . . . 10
2.2.2
Gradient-Based Likelihood Optimization . . . . . . 12
2.2.3
Expectation-Maximization Acceleration Methods . 12
2.2.4
Expectation-Maximization-based Estimation of
State Space Models . . . . . . . . . . . . . . . . . 13
Wiener filter
. . . . . . . . . . . . . . . . . . . . . . . . . 13
ix
x
Contents
3 Numerically Stable Gaussian Message-Passing
17
3.1
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2
Decimation of Low-Pass-Type Systems . . . . . . . . . . . 18
3.3
State-Space Reparametrization . . . . . . . . . . . . . . . 20
3.4
3.5
3.3.1
Numerical Robustness . . . . . . . . . . . . . . . . 22
3.3.2
State-Space-Model Identification . . . . . . . . . . 25
Square-Root Message Passing . . . . . . . . . . . . . . . . 25
3.4.1
Gaussian Square-Root Message Passing . . . . . . 26
3.4.2
Computational Optimization . . . . . . . . . . . . 32
3.4.3
Expectation-Maximization Updates
3.4.4
Experimental Results . . . . . . . . . . . . . . . . 35
Conclusion
. . . . . . . . 32
. . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Gaussian Message Passing with Dual-Precisions
39
4.1
Dual-Precision . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2
Message Passing Tables . . . . . . . . . . . . . . . . . . . 41
4.3
Algorithms Based on W̃ and W̃µ . . . . . . . . . . . . . . 41
4.4
4.3.1
Smoothing in SSMs . . . . . . . . . . . . . . . . . 41
4.3.2
Steady-State Smoothing . . . . . . . . . . . . . . . 45
4.3.3
E-Step in SSMs . . . . . . . . . . . . . . . . . . . . 47
4.3.4
Continuous-Time Smoothing . . . . . . . . . . . . 48
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 Input Signal Estimation
51
5.1
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2
Computation and Continuity of û(t) . . . . . . . . . . . . 53
5.3
Wiener Filter Perspective . . . . . . . . . . . . . . . . . . 54
5.4
Postfiltering . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5
Experimental Results . . . . . . . . . . . . . . . . . . . . . 57
Contents
5.6
xi
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . 58
6 Input Estimation for Force Sensors
61
6.1
Problem Setup and Modeling . . . . . . . . . . . . . . . . 62
6.2
Model-Based Input Estimation . . . . . . . . . . . . . . . 64
6.3
6.4
6.5
6.6
6.2.1
Model-Based Input Estimation . . . . . . . . . . . 64
6.2.2
Input estimator implementation
. . . . . . . . . . 66
Sensor Model Identification . . . . . . . . . . . . . . . . . 67
6.3.1
Tuning of Identification Parameters . . . . . . . . 71
6.3.2
Implementation . . . . . . . . . . . . . . . . . . . . 72
6.3.3
Improvements of Model-Identification Algorithm . 72
Frequency-Based MMSE Filters . . . . . . . . . . . . . . . 76
6.4.1
Frequency-Based Filtering . . . . . . . . . . . . . . 76
6.4.2
Frequency Response Function Estimation . . . . . 78
6.4.3
Results . . . . . . . . . . . . . . . . . . . . . . . . 80
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5.1
System Identification Results . . . . . . . . . . . . 82
6.5.2
Force Estimation Performance . . . . . . . . . . . 85
6.5.3
Convergence Properties . . . . . . . . . . . . . . . 86
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . 89
7 Sparse Bayesian Learning in Factor Graphs
7.1
91
Variational Prior Representation . . . . . . . . . . . . . . 91
7.1.1
Multi-Dimensional Features . . . . . . . . . . . . . 93
7.2
Sparse Bayesian Learning in Factor Graphs . . . . . . . . 93
7.3
Multiplier Optimization . . . . . . . . . . . . . . . . . . . 96
7.4
Fast Sparse Bayesian Learning . . . . . . . . . . . . . . . 99
7.4.1
7.5
Multi-Dimensional Fast Sparse Bayesian Learning 102
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . 108
xii
Contents
8 Sparse Input Estimation in State Space Models
8.1
8.2
8.3
111
Sparse Input Estimation . . . . . . . . . . . . . . . . . . . 111
8.1.1
Algorithm . . . . . . . . . . . . . . . . . . . . . . . 113
8.1.2
Simulation Results . . . . . . . . . . . . . . . . . . 113
Blind Deconvolution . . . . . . . . . . . . . . . . . . . . . 116
8.2.1
Type-I Estimators vs. Type-II Estimators . . . . . 116
8.2.2
Algorithm . . . . . . . . . . . . . . . . . . . . . . . 118
8.2.3
Simulation Results . . . . . . . . . . . . . . . . . . 120
8.2.4
Heart Beat Detection for Ballistocardiography . . 122
Conclusions and Outlook . . . . . . . . . . . . . . . . . . 125
A Algorithm Statements
127
A.1 Square-Root Smoothing . . . . . . . . . . . . . . . . . . . 127
A.2 Continuous-Time Posterior Computation . . . . . . . . . . 127
A.3 System Identification in Chapter 6 . . . . . . . . . . . . . 128
A.4 Sparse Input Estimation . . . . . . . . . . . . . . . . . . . 130
B Proofs
133
B.1 Proofs for Chapter 3 . . . . . . . . . . . . . . . . . . . . . 135
B.2 Proofs for Chapter 4 . . . . . . . . . . . . . . . . . . . . . 136
B.3 Proofs for Chapter 5 . . . . . . . . . . . . . . . . . . . . . 137
B.4 Proofs for Chapter 6 . . . . . . . . . . . . . . . . . . . . . 140
B.5 Proofs for Chapter 7 . . . . . . . . . . . . . . . . . . . . . 142
C Spline Prior
147
C.1 Relation to Splines . . . . . . . . . . . . . . . . . . . . . . 148
D Additional Material on Dynamometer Filtering
151
D.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 151
D.2 Measured Frequency Responses . . . . . . . . . . . . . . . 152
Contents
xiii
E Learnings from Implementing Gaussian Message Passing
Algorithms
155
F Multi-Mass Resonator Model
159
Bibliography
163
About the Author
171
Chapter 1
Introduction
A fundamental and thoroughly investigated problem in signal processing
is the estimation or equalization of signals given disturbed or distorted
measurements. We focus on a related class of problems where the measurement processes or sensors can be modeled with finite-order linear
state-space models (SSM) and promote a SSM-based approach to estimation.
SSMs are powerful tools that have been an essential part of many practical and theoretical developments over the last decades. In an SSM all
information is contained in a finite-dimensional state variable. The state
is usually not directly observable, but it evolves in time according to a
transition law. In practical application, SSMs provide flexibility with
respect to measurement or sampling schemes and often permit relatively
painless handling of complex, even mildly non-linear models.
The estimation problems we consider can be expressed as state or input
estimation in SSMs, where the SSM might not be restricted to merely
being an accurate model of the measurement device. With our approach,
equalization of an observed signal is considered an input estimation in
SSMs. However, the input estimation method gives access to a wider
range of interesting problems, some of which are shown in this thesis.
When confronted with actual tasks and real world applications, some
questions arise. The most apparent one is how to obtain an SSM, when
none is given a-priori. This task is known as system identification. As
a wide choice of system identification methods is available, the main
challenge is commonly to select the method that suits best the task for
which the model is needed.
Once a model is set, we take a Bayesian stance and attempt to estimate
1
2
Introduction
the inputs to the model that explain best the present observations. One
challenge that is faced in doing so, are high computational requirements
of common methods; often along with numerically stability problems.
In many practical applications, no prior knowledge about the actual
input signal is available or biasing the estimates is not desired. Thus
input estimates based on the weakest possible assumptions are sought
for. For discrete-time input signals there exist well-known estimators.
Continuous-time input estimation, on the other hand, is not well understood. How can we obtain estimators and do these estimators behave
according to our expectations?
From an algorithmic point of view, linear SSMs with Gaussian disturbances share an intimate relationship with Gaussian inference methods,
notably Gaussian message passing. These methods distinguish themselves by the possibility to decompose the computations into simple local computations and, under suitable circumstances, by yielding an exact
solution in a fixed number of steps. When dealing with non-Gaussian
statistical models, which allow to introduce useful types of prior knowledge about the inputs, we seek Gaussian message passing algorithms to
perform (approximately) the computations.
In this thesis, we will touch on many of these points and aspects, which
lead to new algorithms and results. Our contributions are summarized
next.
1.1
Contributions and Overview
In Chapter 2 we introduce the notation and necessary background on
expectation maximization (EM).
Input Estimation
• We present a model-based filter for estimation of undistorted measurements in an industrial application (Section 6.2.1 and a threedimensional Wiener filter Section 6.4). The estimation performance is evaluated with extensive measurement data and tradeoffs in terms of estimation performance and implementation are
discussed for our model-based filter approach.
1.1 Contributions and Overview
3
• We demonstrate the use of sparsifying (“compressible”) priors with
variational representations for input signals of state space models
such that the resulting estimation algorithm amounts to Gaussian
message passing.
• We devise a Gaussian message-passing scheme for message passing
in SSMs and EM, offering attractive numerical stability properties,
that uses square-root Kalman filtering methods (Section 3.4).
• A state-space basis construction method is presented that, typically, improves numerical robustness of Gaussian message passing
and is readily applicable (Section 3.3).
• Extending previous message passing schemes and ideas, we propose a new pertinent and very efficient Gaussian message passing
scheme in SSMs (Chapter 4) that does not require any matrix inversion, if the SSM’s output dimension is 1.
• Several new theoretical and experimental results on a continuoustime white Gaussian noise estimator, proposed in previous work,
are presented (Chapter 5).
System Identification
• We show a new numerically more stable Gaussian message passing
scheme for SSMs that uses matrix square-roots of the covariances.
The improved numerical properties are substantiated with experimental results (Section 3.4.3).
• Based on the proposed inversion-free message passing method, EM
computations are derived that are free of matrix-inversions. The
latter property is essential to improve numerical stability and assures a reduced computational overhead (Section 4.3.3).
• A result (Theorem 6.1) demonstrates that maximum-likelihood
estimation seems well suited to identify systems that are subsequently employed for input estimation. Using a non-probabilistic
view, insight on parameter selection for the proposed identification
method is given (Section 6.3.1).
• We derive an algorithm for blind system identification (Section 8.2),
where the actual computations amount to an iterative application
4
Introduction
of Gaussian message passing. We then demonstrate and compare
its performance on a synthetic problem and show its effectiveness
on finding heart beat times in ballistocardiographic measurements,
a challenging real-world physiological signal-processing problem.
1.2
Related Work
1.2.1
Input Estimation
Input estimation problems are usually addressed by beginning with additional assumptions on the unknown, potentially random, input signal.
A classical approach is to assume that the desired signal is low-pass filtered Gaussian noise and to estimate it either by a Wiener filter or by
a Kalman smoother [38]. These methods extend to much more general
Gaussian priors, such as e.g., splines [73], but consequently also lead to
more complex estimators. Joint state estimation and input estimation
for discrete-time linear SSM has been analyzed in [26], which essentially
corresponds to discrete-time input estimation using a non-informative
prior. Estimators for continuous-time input estimation were presented
in [9] with a smooth prior on the input process and in [10] with noninformative white Gaussian noise prior.
The methods we propose for estimating weakly sparse or compressible
signals, certainly fit into the collective class of Sparse Bayesian learning (SBL) methods, which were initiated with the seminal paper [67]
and have since found applications in signal processing [23] and communications [65]. However, it seems that this powerful class of models has
(to the best of our knowledge) never been used in the context of linear
dynamic models.
Approaches that are related to (weakly) sparse input estimation assume
the input to be sparse with respect to some basis, as, e.g., in [71] and
then either solve the combinatorial problem with exact sparsity measures
or a more tractable problem using sparsity-promoting regularizers. For
the input estimation problem in SSMs, the application of non-Gaussian
priors, in particular sparsity-promoting priors, has not received a lot
of attention in the past. Related approaches essentially use heavy-tailed
priors on the inputs of discrete-time SSMs to devise (statistically) robust
Kalman filters [19], but do not try to recover the input signals.
1.2 Related Work
5
A different Bayesian approach to estimation of sparse processes observed
through a finite-order model, is given in [3] for continuous-time sparse
signals. The presented algorithms, based on spline theory, however, are
only applicable to very simple SSM (i.e., SSMs with order at most 1).
Blind deconvolution problem have been solved with a variety of methods. A variety of related methods express blind deconvolution as an
optimization problem with sparsity-promoting regularizers, e.g., blind
source separation [22, 85] and dictionary learning [1]. The Bayesian approach to such problems were first introduced in image processing [21]
and formulated with mean-field variational SBL in [4], where a scale mixture representation of the LASSO is used to blindly estimate image blurs
modeled as two-dimensional finite impulse-response (FIR) filters. The
Bayesian approach provides not only an estimate of the sparse variables,
but also reliability information of the posterior (e.g., posterior variance),
which can be essential for blind deconvolution [41].
1.2.2
System Identification
A wide variety of methods for identification of SSM have been developed.
The most widely used ones are:
• Subspace system identification methods [68] seek an SSM that
approximately spans the subspace generated by the given inputoutput measurements. The principal quantity is the Hankel matrix of the identification data. These methods can easily handle
additional complications such as e.g., non-zero initialization (transients), but unfortunately lack a simple error criterion or probabilistic interpretation.
• Prediction-error methods [43] are often the preferred method for
system identification of SSM. Prediction-error methods minimize
the prediction error in an SSM in an online fashion. Asymptotically, these methods minimize the absolute model error weighted
by the relative spectral magnitude of the input signal. Unfortunately, minimization of the error criterion for SSMs requires to
work with innovation form of an SSM, which is not the same SSM
as the ones that are considered here.
While under certain circumstances prediction error-methods asymptotically approximate the maximum-likelihood (ML) estimators, direct ML-
6
Introduction
based estimation does not seem very popular in system identification. In
case of the general stochastic SSMs that we consider, however, prediction error methods can not be used and the complex ML problem can
often only be solved approximately (not asymptotically though) with
EM-based algorithms. We present an overview of prior work on EM in
SSM identification in Section 2.2.
Chapter 2
Preliminaries
2.1
Notation and Definitions
We write matrices in boldface (e.g., A) and vectors in italic boldface
(e.g., a). We will usually utilize uppercase for random variables and
matrices. For a matrix A, AT , A−1 , det (A), and Tr A denote its
transpose, its inverse, its determinant, and trace (sum of its diagonal
elements). The Kronecker product between two matrices A and B is
denoted as A ⊗ B. Superscripts in squared brackets, e.g., x[t] , denote
the value of variable x in iteration t, typically in an iterative algorithm.
A normal distribution with mean m and variance σ 2 is denoted by
N (m, σ 2 ). The probability density function (pdf) of a normal distribution is denoted by N (x|m, σ 2 ).
Definition 2.1: Linear discrete-time SSM
A linear discrete-time SSM of order d with states xk ∈ Rd , inputs uk ∈
Rm , and outputs yk ∈ Rn satisfies the following equation for all k ∈ Z:
xk+1 = Ak xk + Bk uk
yk = Ck xk .
(2.1)
(2.2)
an SSM is specified by its parameters: the state-transition matrix Ak ∈
Rd×d and the matrices Bk ∈ Rd×m and Ck ∈ Rn×d . If Ak = A, Bk = B,
and Ck = C for all k ∈ Z, the SSM is time-invariant.
Often, Gauss-Markov statistical models will be used.
7
8
Preliminaries
Definition 2.2: Linear stochastic SSMs
We define a linear stochastic SSM of order d as the stochastic process
generated by
Xk+1 = Ak Xk + Bk Uk
(2.3)
Y k = Ck X k + N k ,
(2.4)
i.e., a dth order linear discrete-time SSM with parameters Ak , Bk ,
and Ck , with independent Gaussian random variables Uk ∼ N (uk , VUk )
and Nk ∼ N (0, VNk ) for all k ∈ Z. The model is time-invariant, if the
SSM is time-invariant and VUk = VU and VNk = VN . Notice that this
definition does not require the mean of Uk to be time-invariant.
The stochastic SSM is linear and Gaussian. Stochastic processes with
the same distribution as Xk or Yk in (2.3) belong to the class of GaussMarkov random processes (i.e., Gaussian random processes with Markov
properties). We refer the reader to [58] for an interesting treatment on
a more general case.
The models defined in Definition 2.1 and 2.2, are multiple-input multipleoutput models. In many instances, we consider single-input singleoutput SSM: Vectors bk and ck replace Bk and Ck at the SSMs’ inputs
2
2
and outputs. The dimension of the variances σU
and σN
are changed
k
k
accordingly.
A factor graph representation of the stochastic SSM from Definition 2.2 is
shown in Figure 2.1. Factor graphs are graphical models that represent
function factorizations. In Forney factor graphs, nodes of the graph
denote factors, whereas edges correspond to the variables. We refer the
reader to [40, 44] for a general introduction to factor graphs and [59,
Section 1.5.2] for a description of the notation used in this thesis. When
there are multiple edges that represent one state Xk in the factor graph
of an SSM, we distinguish between the edge variable to the left of an “=”factor and to the right, by denoting them Xk and Xk0 respectively.
Stochastic SSM will be used as probabilistic model throughout this thesis
and, in this case, the factor graph representation provides a concise and
versatile tool to facilitate and describe algorithms to compute various
statistical quantities. In particular, we will be interested in marginal
probability densities. To this end, we will make heavy use of standard
rules and relations from [47].
Experimental results are generally compared with respect to their nor-
2.1 Notation and Definitions
9
N (0, VU )
Uk−1
Bk
···
0
Xk−1
+
Ak
Xk
=
Xk0
···
Ck
N (0, VN )
+
Yk
Figure 2.1: Factor graph representation of a stochastic SSM.
malized mean-squared error (NMSE). The NMSE of an estimated (vector) ŵ with respect to the true (vector) value w is
NMSE ,
kŵ − wk2
.
kwk2
(2.5)
When comparing different estimation methods, another metric is the
MSE improvement factor with respect to a baseline estimate e.g., unprocessed observations y. Using the definitions from (2.5) the factor is
given by
∆MSE ,
kŵ − wk2
.
ky − wk2
(2.6)
Note that NMSE and ∆MSE do not depend on the absolute magnitude
of the signals.
10
2.2
Preliminaries
Expectation-Maximization Method
Expectation maximization (EM) [6, 17] is a widely-used iterative technique for parameter estimation. It may be seen as an instance of majorization minimization (MM) techniques [35,82], a class of optimization
methods. MM methods minimize (maximize) a function by alternatively
constructing an upper-bounding (lower-bounding) surrogate function to
the objective and then minimizing (maximizing) the surrogate function.
More specifically, suppose we wish to compute
θ̂ , argmax f (θ),
(2.7)
θ
with a positive, not necessarily convex or unimodal, function f (θ). A
function g(θ|θ̄) is said to minorize f (θ) if i) f (θ̄) = g(θ̄|θ̄) and ii) f (θ) ≥
g(θ|θ̄).
In particular, the EM technique applies when f (θ) = log l(θ) is a loglikelihood and there exist latent or missing variables x and a probability
measure p(x|θ) such that
Z
l(θ) = p(x|θ)dx.
Then EM at iteration k with current estimate θ[k] , proceeds by constructing a lower bound Q(θ|θ[k] ) to f (θ) (expectation step or E-step)
Z
Q(θ|θ[k] ) = p(x|θ[k] ) log p(x|θ)dx = Ep(x|θ[k] ) [log p(x|θ)] ,
followed by maximizing the surrogate function (maximization step or
M-step) to obtain the new estimate θ[k+1]
θ[k+1] = argmax Q(θ|θ[k] ).
θ
2.2.1
Convergence Properties
The main property of MM-type algorithms, hence also of the EM algorithm, is that the sequential estimates force f (θ[k] ) to increase monotonically, i.e.,
f (θ[k] ) ≤ f (θ[`] ),
(2.8)
2.2 Expectation-Maximization Method
11
for all k ≤ `. In other words, an MM step (EM step) is guaranteed to at
least achieve the same objective values (likelihood) as the current point.
While this usually implies convergence to a stationary point, stronger
global convergence properties such as convergence to a local maximum,
can in general not be established [81]1 .
Local convergence of EM algorithms, on the other hand, is characterized
in [17], in the general case, and in [51,83] for specific problems. Let M (θ)
represent the implicit mapping such that θ[k+1] = M (θ[k] ) for any k ∈ Z.
If the sequence of parameters θ[k] converges to2 θ̂ then M (θ̂) = θ̂ and by
a Taylor expansion of M (θ) in the neighborhood of θ̂ it follows
θ[k+1] − θ̂ = ∇M (θ̂) θ[k] − θ̂ .
(2.9)
Thus, EM algorithms [17] and also MM algorithms [35] converge with
linear rate (as e.g., gradient methods). The rate is given by the spectral
radius of ∇M (θ̂) (the largest eigenvalue of matrix ∇M (θ̂)). For the EM,
the rate matrix ∇M (θ̂) can be expressed as [17]
∇M (θ̂) = I −
∂ 2 Q(θ|θ̂) ∂θ2 !−1
θ=θ̂
d2 `(θ̂)
,
dθ2
(2.10)
with `(θ) , log l(θ) and an analogous expression holds for MM algorithms [82]. When (2.7) is a ML estimation problem, the quotient on
2
θ̂) the left of (2.10) can be given a statistical interpretation: ∂ Q(θ|
∂θ 2
θ=θ̂
corresponds to the Fisher information3 on θ given the missing variables x
2
(complete Fisher information) and, similarly, d dθ`(2θ̂) is seen as the Fisher
information of the ML problem (2.7) (incomplete Fisher information).
Therefore, convergence rate of EM depends on the quotient of the complete Fisher information and the incomplete Fisher information.
1 A typical counterexample to guaranteed convergence to local maxima are saddle
points, where EM iterations can get trapped.
2 Given that the mapping M (θ) is continuous.
3 The Fisher information, given as
I(θ) = EX | θ
2 ∂
log f (x|θ)
∂θ
= EX | θ
∂2
log f (x|θ)
∂θ2
is a measure for the information that a random variable x conveys about a parameter θ [39]. The second equality holds under some mild regularity conditions on the
likelihood [39].
12
2.2.2
Preliminaries
Gradient-Based Likelihood Optimization
A well-known problem with EM-based approximation of an ML estimate
is that estimates can converge very slowly (e.g., [42, 61, 83]). To overcome this limitation various acceleration strategies have been proposed
(see [51] for an overview). We point out two simple acceleration methods, which are easily applicable in a wide variety of situations. These
observations and a lot of acceleration strategies carry over to general
MM-based methods [82].
2.2.3
Expectation-Maximization Acceleration Methods
The simplest acceleration method is step doubling [82]. This method
replaces the next EM estimate θ[k+1] by the estimate
(1 − η)θ[k] + ηθ[k+1]
with η > 1.
Another strategy is to use the gradient of the surrogate function Q(θ|θ[k] )
and combine it with powerful first-order optimization techniques such as
conjugate gradient or quasi-Newton methods [11]. When Q(θ|θ[k] ) is
differentiable in θ4 , it turns out that the gradient corresponds to the
gradient of the log-likelihood [17]:
∇θ Q(θ|θ[k] ) [k] = ∇`(θ[k] ).
(2.11)
θ=θ
This implies that if a computationally feasible EM algorithm can be
devised in a specific application, gradient-based optimization of the likelihood function itself, i.e., of the problem in (2.7), are readily available.
Note that these schemes might exhibit faster convergence than EM, but
generally are also first-order methods. More importantly, gradient-based
scheme do usually not provide monotonicity guarantees like (2.8) in case
of the EM algorithm.
An additional idea is to devise methods that alternatively use EM steps
and gradient steps; Switching between the schemes according to a fixed
schedule (e.g., as EM tends to have slow convergence in later iterations,
switching to a gradient-based methods might yield faster convergence)
4 This for instance always applies when the likelihood pertains to the exponential
family [81].
2.3 Wiener filter
13
or an adaptive switching criterion based on the current estimates and
surrogate functions. In fact, good results have been obtained with such
a scheme in learning Gaussian mixtures [61].
2.2.4
Expectation-Maximization-based Estimation of
State Space Models
ML-estimation of linear SSM parameters is a well-known application of
the EM algorithm: It was first presented in [63] in context of time-series
models, corresponding to SSMs with known output matrix C, while a
comprehensive treatment of the general discrete-time SSM (i.e., multidimensional inputs and multi-dimensional outputs) is presented in [24,
25]. A message-passing view on EM, with applications in estimation of
SSMs, is given in [16]. In this thesis, we heavily rely on concepts and
ideas from the latter work.
Given observations y = y1 , y2 , . . . and an order d, we are interested in iteratively approximating the ML estimate of a single-input single-output
stochastic SSM parameters. To this end, we focus on the autoregressive
form parameterization of the SSM with 2d free parameters (see [16]):
There are d parameters that define a, the first row of A, and d parameters for the vector c (the vector b is fixed). The ML parameter estimation
problem then translates into
2
2
â, ĉ , argmax p(y|a, c; σN
, σU
).
(2.12)
a,c
We refer the reader to Algorithm A.3 in Appendix A.3 (cf. Step 3)
and Step 4)) for a standard implementation of the EM method applied
to (2.12).
We will typically take the noise variances as known or fixed a-priori
(see, e.g., Section 6.3) and not consider additional estimation schemes.
In many cases, ML estimation of the noise variances can be added to
EM-based methods as in e.g., [16].
2.3
Wiener filter
A Wiener filter is the linear minimum mean-squared error (LMMSE)
estimator of a wide-sense stationary random process, see e.g., [38]. In
14
Preliminaries
the following, we focus on d-dimensional observations, for all k it holds
that Yk ∈ Rd , and d-dimensional target signals and its LMMSE estimates
Uk , Ûk ∈ Rd present the more general multi-dimensional Wiener filter.
The Wiener filter is chosen such that for all k the estimates Ûk , given
by
s
X
ûk =
Gl yk−l ,
l=k
where coefficient matrices Gk are the filter coefficient (matrices) of the
Wiener filter, minimize the conditional squared error
2
E Ûk − Uk |Yk+t , . . . , Yk+s ,
for −∞ < t < s < ∞.
One approach to obtain coefficients Gk is by means of the orthogonality
principle [38], which in the multi-dimensional case is expressed as
h
i
E Uk − Ûk YlT = 0
(2.13)
and must hold for all l ∈ [k + t, k + s]. To proceed, we specialize on
non-causal IIR-type filters, thus (2.13) must be satisfied for all l ∈ Z.
Remark 2.1
When the filter is of FIR-type (i.e., −∞ < t < s < ∞), necessary computations to obtain the (matrix) filter coefficients are slightly different than
for the IIR case above: The orthogonality principle leads to a (block)
Toeplitz linear system of equations, which has to be solved to obtain the
filter coefficients. An efficient linear-time algorithm to compute the filter
coefficients is the (block) Levinson-Durbin algorithm (see e.g., [34]).
Let S(U U ) (ejθ ) be the power spectral density of Uk given by the (matrix)
discrete-time Fourier transform (DTFT) of the autocorrelation function
(U U )
Rk
T
= E Uk Uk−l
.
We now assume that
Yk =
∞
X
l=0
Hl Uk−l + Nk ,
2.3 Wiener filter
15
with Nk ∈ Rd be (correlated) random vectors with mean 0 and finite
(N N )
autocorrelation Rk
i.e., observations yk are linearly distorted and
noisy versions of the original signal. Then let S(N N ) (ejθ ) be the power
spectral density of Nk . Basic manipulations yield the expressions
S(Y Y ) (ejθ ) = H(ejθ )S(U U ) (ejθ )HH (ejθ ) + S(N N ) (ejθ )
for the power spectral density of Yk and likewise
S(U Y ) (ejθ ) = S(U U ) (ejθ )HH (ejθ )
representing the cross-spectral density of Yk and Uk . From (2.13) a
computational method to obtain the Wiener filter can now be found by
solving the linear equations
G(ejθ )S(Y U ) (ejθ ) − SY U T (ejθ ) = 0,
which must hold at all frequencies θ ∈ [0, 2π]. Let G(ejθ ) be the DTFT
of the Wiener filter coefficients Gk . From (2.3) the DTFT of the noncausal Wiener filter follows as
−1
G(ejθ ) = S(U Y ) (ejθ ) S(Y Y ) (ejθ )
= S(U U ) (ejθ )H(ejθ )H
−1
× H(ejθ )S(U U ) (ejθ )HH (ejθ ) + S(N N ) (ejθ )
. (2.14)
If for all k the random vectors Uk and Nk are iid and spatially white
with variance σu2 and σn2 respectively, (2.14) simplifies to
−1
G(ejθ ) = σu2 H(ejθ )H σu2 H(ejθ )HH (ejθ ) + σ 2 I
.
Chapter 3
Numerically Stable
Gaussian
Message-Passing
In practice, the application of Gaussian message-passing algorithms may
be limited by numerical stability problems. While theses issues are relatively certain to appear for large, complex problems, in some cases
moderately sized problems are affected.
In this chapter, we consider methods applicable when standard Gaussian message passing implementations experience numerical instabilities.
After outlining the principle to assess numerical stability (Section 3.1),
we establish that fast sampled SSMs may cause poor numerical properties of message-passing methods (Section 3.2). Next, we explore SSM
reparametrizations and show how these idea can be exploited to improve robustness of model-based methods (Section 3.3). Finally, a novel
message-passing scheme, called square-root message passing is introduced (Section 3.4). Subsequently, we focus on time-invariant singleinput single-output models, but several underlying principles carry over
to more general SSM classes.
3.1
Methodology
Assessing numerical properties of inference or model estimation in linear
SSM (i.e., estimation of the state transition matrix A, input matrix B
and output matrix C representing a fixed order SSM), is a very compli17
18
Numerically Stable Gaussian Message-Passing
cated task due to recursive type of processing and non-linear operations
used. As a result, prior work on numerical stability of these algorithms is
limited: there exists some analysis on numerical properties of (forwardonly) Kalman filtering variants [53,69], but no relevant results seem to be
known for Kalman smoothing, let alone EM-based parameter estimation.
In addition, known error bounds tend to be too loose.
Our goal is not to provide a formal analysis of numerical stability properties for message-passing algorithms relevant in this work, but to devise
practical solutions extending applicability of them.
In Kalman filtering, and more generally in message-passing, inaccuracies
caused by finite-precision implementation of the computations come i)
directly from an operation or ii) propagate and get amplified by the
recursive algorithms. For Kalman filtering, it was shown in [69] that error
propagation is mitigated as long that the covariances are kept symmetric.
In contrast, we are interested in numerical errors caused by each message
updates1 In Gaussian message passing algorithms, updates boil down to
basic matrix computations. While vector and matrix multiplications and
additions are numerically stable i.e., do not cause losses in significant
digits, matrix inversions can be problematic.
Tight bounds on (relative) numerical errors for matrix inversions, as
well as other complex matrix operations, can be given in terms of the
condition number of the matrix, which is the ratio of the largest singular
value and the smallest one, so
κ(A) , kAT Ak2 kA−T A−1 k2 .
In the following, we concentrate on the condition numbers of the forward covariance matrix or the backward covariance matrix, as these are
central in marginalization and estimation (see, e.g., EM update rules in
Algorithm A.3)
3.2
Decimation of Low-Pass-Type Systems
A common observation made with many sensors is that their frequency
transfer characteristic exhibits a low-pass behavior. This behavior can
1 Another source of error encountered in message-passing computations on SSMs
is in solutions to discrete algebraic Riccati equations (DAREs). We elaborate on
numerical issues with these methods in Appendix E.
−
→
κ(VX∞ )
3.2 Decimation of Low-Pass-Type Systems
1018
1015
1012
109
106
103
100
19
1018
1015
1012
109
106
103
0
π
4
π
2
3π
4
(a) General random systems
π
100
0
π
4
π
2
3π
4
π
(b) Resonant multi-mass systems (cf.
Appendix F)
−
→
Figure 3.1: Condition numbers of VX∞ for random 8-th order systems
plotted against the mean value of their pole’s frequencies.
be very pronounced because of high sampling rates compared to the
characteristic time constants of the physical sensor and as a result useful information is only present in a narrow low-frequency band. An
application where this issue occurs is discussed in Chapter 6.
When considering SSM-based algorithms, slowly-varying band-limited
systems may pose various challenges. In particular, for estimation and
system identification, numerical issues may arise due to poorly conditioned covariances and precisions in Gaussian message passing. Note
that poorly conditioned covariances on one hand reduce numerical precision of message updates that involve matrix inversions (e.g., computation of marginals in message-passing for SSMs), and on the other hand
have larger backward errors in the in the matrices themselves as mentioned in [32]. We next illustrate with a numerical example how this
issue occurs for random SSMs and is particularly strong for multi-mass
systems defined in Appendix F.
Example 3.1
We generate 500 realization of two types of 8th-order discrete-time SISO
SSMs and compute the condition number of the steady-state state-space
−
→
covariance matrix VX∞ by solving the corresponding DARE [59]. For the
clarity of the exposition, we measure the low-pass characteristic of SSMs
by the mean value of the pole’s frequencies. Results are shown in Fig-
20
Numerically Stable Gaussian Message-Passing
ure 3.1. In the left, condition numbers of random systems generated
by the Matlab command drss2 are shown. The right figure, presents
random resonant multi-mass systems (cf. Appendix F). Random realizations are created by choosing random masses and randomly perturbing
damping coefficients of the multi-mass system models.
In Figure 3.1, it can be seen that low-frequency poles lead to larger
−
→
κ(VX∞ ). While for general systems the correlation is visible, for the
special class of resonant multi-mass systems, which feature a highly attenuated spectrum for frequencies above the resonant frequency and also
close pole-zero pairs, this phenomenon is much more pronounced. In line
with observations made with actual multi-mass system measurements
(see Chapter 6), it can be concluded from the example that model-based
message-passing algorithms for highly oversampled systems may are often numerically unstable.
A simple but effective remedy to numerically unstable message-passing
methods is to reduce the sampling frequencies. In fact, for SSM realizations in the previous example that exhibit a low-pass characteristics and
thus, poorly conditioned covariances, the pole frequencies would increase
and condition numbers would consequently decrease. In practice, this
approach can be implemented by decimation of signals before processing
with message-passing based methods.
When system identification is performed, care must be taken to not affect
the model estimates through low-pass filtering prior to down sampling.
However, since such signals have, by definition, a low-pass characteristic,
this is usually not an issue. Model identification on decimated signals
followed by normal-rate filtering with a fixed model, was adopted in the
methods described in Chapter 6 and showed very promising results.
3.3
State-Space Reparametrization
Let T ∈ Rn×n be an invertible matrix and assume an SSM of order n. A
reparameterization with T of the state or equivalently of the SSM, maps
the state vectors Xk onto new state vectors
X̄k = TXk .
(3.1)
2 The command drss is a standard routine that is widely accepted to evaluate
system identification methods. It generates random SISO systems of a given order.
3.3 State-Space Reparametrization
21
N (0, Vu )
B
T
···
X̄k
S
Xk
A
T
+
=
X̄k+1
···
S
C
N (y, Vn )
Figure 3.2: Factor graph of SSM, where states are transformed
with X̄k = TXk . For the sake of graphical presentation
we define S , T−1 .
The SSM parameters are then adapted such that the input-output behaviour of the SSM is preserved:
Ā = TAT−1 ,
B̄ = TB,
C̄ = CT−1 .
(3.2)
Reparameterization can be shown graphically as well; They correspond
to introducing the deterministic T and T−1 factors at each time step
and then moving these factors in the graph. The result is demonstrated
in Figure 3.2.
From the graphical representation Figure 3.2, it is evident how Gaussian
messages are transformed when reparametrizing an SSM. In particular,
we see that the covariance matrices and precision matrices between different parametrization (transformed by (3.1)) are related by
VX̄k = TVXk TT ,
WX̄k = T−T WXk T−1 .
(3.3)
22
3.3.1
Numerically Stable Gaussian Message-Passing
Numerical Robustness
One cause of numerically poor estimation in SSMs are states that have
widely different dynamic ranges and amplitudes. This might then lead to
poor condition numbers of the covariance matrix or the precision matrix.
This effect is illustrated with a very simple example below.
Example 3.2
Consider a simple SSM of order 2 with parameters:
!
!
ρ 0
1
A=
, b=
,
0 −ρ
c= 1 1 .
We are interested in the steady-state covariance matrix and its condition
number, given that observations of the output are corrupted by Gaussian
2
and the SSM is driven by Gaussian noise with
noise with variance σN
2
variance σU . The steady-state covariance matrix for ρ = 0.99, = 0.01,
2
2
σN
= 0.01, and σU
= 1 is
!
1.01
0.01
−
→
VX =
,
0.01 0.0002
and its condition number already 1.56∗104 . At the root of this relatively
high condition number lie the different scales of the two states.
To approach this issue, we re-parameterize the states by x0 = Tx with
T = diag(1, 100)
−
→
and the SSM according to (3.2), the condition number of VX̄ is subsequently reduced to 8.5.
Another issue that arises when choosing a specific parameterization, is
the sensitivity of the SSM’s properties to small changes in the coefficient
values. The AR form, where AT is also known as companion matrix [28],
is particularly susceptive to this issue (see [70, Example 7.4]). Nevertheless, the AR is often the only possible parametrization choice3 for EMbased parameter identification. In Appendix E, we therefore provide a
3 EM-based system identification for other parametrization (e.g., full system matrix [16] or block diagonal form [59]) requires state noise with full rank covariance
matrices. For single-input systems of order larger than 1 that are only subject to
input noise, state noise covariance matrices are not full rank.
3.3 State-Space Reparametrization
23
measure to assess the sensitivity in particular situations. Also, experimental observations regarding the AR form are given in Section 6.5.
Sensitivity of the parameterization can be a relevant factor also for the
condition number of the DARE [32], with the result of, potentially, less
accurate solutions.
When SSM parametrizations can be chosen freely, a natural question is
which representation yields the most stable numerical message-passing
algorithms. In the following, we focus on the condition number of covariance matrices and precision matrices as a metric for the numerical
robustness of message passing computations. Inspired by Example 3.2
and balanced reductions of high-order state-space models to low-order
models, see e.g., [60], we propose the following method to improve condition of SSM representations and hence also make message-passing implementations more robust.
Algorithm 3.1
−
→
1) Compute the steady-state covariance matrix VX∞ and precision
←−
WX∞ for the current state-space parameterization.
−
→
2) Obtain a transformation T that simultaneously diagonalizes VX∞
←−
and WX∞ and also attempts to “balance” the eigenvalues of the
matrices.
3) Transform the SSM using T and T−1 and perform computations
in the transformed SSM.
In the second step, a transformation matrix T is sought for that si−
→
←−
multaneously diagonalizes VX∞ and WX∞ while also improving their
respective condition numbers by making large diagonal values smaller
and small ones larger. As shown in [28, P. 500], we can always find a
transformation such that two symmetric positive semi-definite matrices
are both diagonalized. Once both matrices are diagonalized, we are still
free to choose d parameters to make the 2d diagonal values of both matrices more even. In fact, in our case we obtain one from the following
steps.
24
Numerically Stable Gaussian Message-Passing
Algorithm 3.2: Simultaneous Diagonalization of Steady-State
Matrices
−
→
1) Compute the eigenvalue decomposition of VX∞ , i.e.,
−
→
QΛQT ← VX∞ .
2) Perform the eigenvalue decomposition of
←−
DΞDT ← Λ1/2 QT WX∞ QΛ1/2 ,
where Λ = Λ1/2 Λ1/2 .
3) Obtain T from
T ← Ξ1/4 DT Λ−1/2 QT .
and T−1 from
T−1 ← QΛ1/2 DΞ−1/4 .
By inspection it is seen that the computed T transforms the SSM such
that the steady-state matrices (cf. (3.3)) are diagonal matrices as
−
→
TVX∞ TT = Ξ1/2
←−
T−T WX∞ T−1 = Ξ1/2 .
Now, observe that
←−
−
→
←− −
→
T−T WX∞ T−1 TVX∞ TT = T−T WX∞ VX∞ TT = Ξ.
(3.4)
←− −
→
Typically, the eigenvalues of WX∞ VX∞ are much closer to 1 than the
−
→
←−
ones of the steady-state matrices VX∞ and WX∞ . The similarity transformation in (3.4) then implies the diagonal entries of Ξ1/2 are more
even and hence, the transformed steady-state matrices will have a better condition number4 .
4 The product in (3.4) also shows another point: No matter what T we select,
the eigenvalues of the product are always the same and, in some sense, distributing
the product evenly among the two matrices is a reasonable choice.
3.4 Square-Root Message Passing
3.3.2
25
State-Space-Model Identification
When considering estimation of SSM parameters under reparametrizations, a natural question that arises is if convergence properties of the
EM algorithm can be improved by a suitable state transformation. This
hypothesis can be answered negatively;
Proposition 3.1
Let `(θ) = log p(y|θ) be the likelihood of a linear time-invariant SSM,
where θ is defined as
vec (A)
θ = vec (B) .
vec (C)
Let θ̂ be a stationary point of the likelihood, i.e., ∇`(θ̂) = 0.
Assume that an EM-estimate θ converges to θ̂. Its local convergence
rate ρθ̂ (cf. Section 2.2) is invariant to parametrization of the SSM.
The proof is provided in Appendix B on page 135.
Remark 3.1
From the proof of Proposition 3.1, it is easily seen that with non-linearly
parametrized SSM5 the local convergence rate is also invariant to changes
of the SSM form. In the proof, the transformation T is replaced by the
respective Jacobian matrix.
3.4
Square-Root Message Passing
Computation of numerical values for the mean, covariance, and precision matrices are key in Gaussian message-passing implementations.
Numerical stability of these computations are determined by the largest
5 A simple, yet relevant example is a second-order SSM with transition matrix
parametrized by ρ and ω
A=ρ
cos (ω)
sin (ω)
sin (ω)
cos (ω)
.
Obviously, ω is non-linearly related to entries of A. For more examples of non-linear
parametrizations refer to e.g., [59, Section 3.2.2].
26
Numerically Stable Gaussian Message-Passing
condition numbers of matrices involved in the message-passing calculations and asymmetric errors6 in numerical values of symmetric matrices [69,74]. It is possible to ensure that second-order variables are strictly
symmetric by utilizing “square-roots” instead of full matrices. Assuming S is the square root of a symmetric positive definite matrix M. Apart
from symmetry that is preserved, the reduced matrix condition number
satisfies
p
κ(S) = kST Sk2 kS−T S−1 k2 = kMk2 kM−1 k2 = κ(M)
and usage of numerically stable orthogonal transformations for each recursion step improve reliability when for poorly conditioned systems.
We present here a square-root technique for Gaussian message passing based on square-root Kalman filtering techniques from [38]. The
workhorse will be the numerically robust (Householder) QR decomposition7 [28].
Eventually, we extend the square-root approach to computation of EM
message for learning linear SSMs.
3.4.1
Gaussian Square-Root Message Passing
Let the factors of covariance matrices V ∈ Rd×d and precision matrices
W ∈ Rd×d be defined as
V , CT C,
W , ST S.
(3.5)
A special case is where C and S are upper triangular matrices and correspond to Cholesky factors. Since covariance matrices are by definition
symmetric and positive definite, there always exists a unique decomposition into Cholesky factors. When the matrices are merely positive
semi-definite, a Cholesky decomposition still exists, but it is not unique.
Note that in general the factor messages need not be upper triangular but will always be similar to the Cholesky factors by an orthogonal
transformation.
6 The numerical value of a symmetric matrix M is affected by asymmetric errors
if M − MT 6= 0.
7 The QR decomposition decomposes any matrix M ∈ Rn×m into
M = QR,
that is the product of an orthogonal matrix Q ∈ Rn×n and an upper triangular
matrix R ∈ Rn×m .
3.4 Square-Root Message Passing
27
Akin to standard Gaussian message passing, scaled normal densities will
be either parametrized by the matrix-vector pair C, Cm or, equivalently,
by the pair S, Sm. If the factors are full rank recovering the mean vector
m from either Cm or Sm can be done in a computationally efficient
and numerically-stable fashion (e.g., back substitution) [28].
To take full advantage of the superior numerical properties of square-root
matrices, it is necessary to avoid squaring of factors and use rotationbased algorithms such as QR decomposition [28]. Let us show this by
means of the following example.
Example 3.3
→
−
Consider the update of the precision factor SZ and corresponding vector
→ −
−
→ for the case of an “=”-factor and look at the QR decomposition
SZ m
Z
of the matrix
→
−
→ → !
−
→ −
−
→ → !
SX SX −
mX
SZ SZ −
mZ
=Q
,
(3.6)
→
−
→ −
−
→
0
×
S
S m
Y
Y
Y
where × stands for an arbitrary vector. Multiplying both sides of (3.6)
with its respective transpose provides
→T −
−
→
→ −
−
→
→T −
−
→ −
→T −
→ → !
→T −
−
→ −
→ −
→ → !
→ +−
SX SX + SYT SY SX
SX m
SY SY −
mY
SZ SZ SZT SZ −
mZ
X
=
.
×
×
×
×
(3.7)
We recognize with (3.5) that the first entry on the lefthand side corresponds to the well known precision matrix update. This implies that
→T −
−
→
−→
→
−
SZ SZ = WZ and SZ is the Cholesky factor of the updated covariance
→ →
−
matrix. An analogous conclusion can be drawn for SZ −
mZ .
In the following Tables 3.1 to 3.3, the orthogonal matrix Q in the update
equations is decomposed into two 2d × d blocks Q1 and Q2 and applied
from the left to the equations. Two obvious conclusions can be drawn.
→
−
→→
−
First, the updates for the factor S and the vector S−
m may obviously
be split into two distinct computations. The update rule corresponding
to (3.6) is then
→ !
−
→ !
−
T
SX
SZ
Q1 Q2
,
=
−
→
{z
}
|
0
SY
orthonormal matrix
28
Numerically Stable Gaussian Message-Passing
− −
→
→ = QT
SZ m
Z
1
!
−
→ −
→
SX m
X
.
→ −
−
→
S m
Y
Y
Second, computational overhead can be reduced, as the orthogonal matrix Q1 remains constant and thus does not have to be recalculated for
every input vector in steady-state scenarios. Given that QR decomposition are complex matrix operations, savings can be substantial.
The mean can be propagated in standard form or in a novel adapted
form. By coincidence, all updates performed on the standard mean vectors in square-root updates below use “square-root” quantities and hence,
numerical accuracy is not an issue when using the standard mean vector
alongside factor messages for the covariance and precision matrices.
In our proposed method, we use the mean m together with the covariance factor C and the vector Sm in conjunction with propagation of the
precision factors S, which has the appealing property that it does not
change for factor updates.
In Tables 3.1 to 3.3 we provide the square-root message updates commonly used in Gaussian message passing for SSM. The updates are expressed using block matrices. Matrices used to transform the messages
and obtain the output message are denoted by Q. As noted above, these
matrices and/or the respective output message shall be computed with
numerically stable algorithms (e.g., QR decomposition or Householder
transformation) to benefit from the improved numerical properties of
square-root algorithms. If not stated otherwise, the factor messages are
d × d matrices and vectors have dimensions d. All expressions can be
proven analogously to (3.6) and (3.7) i.e., by “squaring” both sides of
the relations and recalling the corresponding Gaussian message passing
rule.
Note that in (I.3)–(I.5) the matrix A does not need to be triangular.
→
−
Consequently, the resulting factor CY will in general neither be upper
triangular.
A practical smoothing algorithm with square-root message passing that
combines updates from Table 3.1 to Table 3.3 is shown in Appendix A.1
on page 127.
3.4 Square-Root Message Passing
Node
Update rule
Assume proper (i.e., invertible V) density:
X
−T −
→
→
CX CX = V,
→T −
−
→
SX SX = V−1 ,
N (m, V)
X
29
A
Y
−
→ = m,
m
X
→ −
−
→ = V−1/2 m.
S m
X
For arbitrary A ∈ Rn×m .
Forward:
→
−
→
−
CY = CX AT ,
−
→ = A−
→ .
m
m
Backward:
=
Z
(I.3)
(I.4)
X
←
−
←
−
SX = SY A,
←
− ←
− ←
− =←
− .
SX m
SY m
X
Y
− !
→
SZ
0
Y
(I.2)
X
Y
X
(I.1)
= Q1
− −
→
→ = QT
SZ m
Z
1
Q2
T
− −
→
→
SX m
X
→ −
−
→
S m
Y
(I.5)
(I.6)
− !
→
SX
,
→
−
SY
!
(I.7)
,
(I.8)
Y
with Q1 ∈ R2d×d .
X
+
Y
Z
− !
→
CZ
0
= Q1
Q2
T
− !
→
CX
,
→
−
CY
−
→ =−
→ +−
→ .
m
m
m
Z
X
Y
(I.9)
(I.10)
←
−
→
−
←
−
For CX , replace CX by CZ in (I.9) and reverse
→ by ←
− in (I.10).
the sign and replace −
m
m
X
Z
Table 3.1: Standard message updates in square-root form.
30
Numerically Stable Gaussian Message-Passing
Node
Z
X
=
Y
Update rule
→ ←
−
−
If I + CX SYT is nonsingular:
!
R G
I
T
= Q1 Q2
→
−
→ ←
−
−
0 CZ
CX SYT
0
→
−
CX
!
,
(II.1)
− ←
←
− −
−
→
−
→
−
→
T −1 ←
mZ = mX + G R ( SY mY − SY mX ) ,
(II.2)
where R is upper triangular.
X
+
Y
Z
×
0
×
−
→
SZ
!
= Q1
−
→ −
→ = QT
SZ m
Z
1
Q2
T
− −
→
→
SX m
X
→ −
−
→
S m
Y
−
→
SX
→
−
SY
!
,
0
→
−
SY
!
, (II.3)
(II.4)
Y
←
−
−
→
←
−
with Q1 ∈ R2d×d . For SX , replace SX by SZ in
−
→ →
(II.3) and add a minus sign in front of SY −
mY and
−
→ −
←
−
→
←
−
replace SX mX by SZ mZ in (II.4).
Table 3.2: Additional square-root message passing updates.
3.4 Square-Root Message Passing
Node
X
31
Update rule
=
Z
R
0
A
G
→
−
CZ
!
T
=Q
−
→
CY
→ T
−
CX A
0
→
−
CX
!
,
(III.1)
→
−
→
−
If CY + CX AT is nonsingular:
−
→ =−
→ + GT R−T (−
→ − AT −
→ ).
m
m
m
m
Z
X
Y
X
(III.2)
←
−
→
−
←
−
→ by
For CX , replace CX by CZ in (III.1) and −
m
X
←
−
m in (III.2).
Y
Z
X
+
A
Y
Z
Suppose A ∈ Rn×m
!
R G
T
= Q1 Q2
→
−
0 SZ
!
−
→
SY
0
,
→
−
−
→
SX A SX
(III.3)
with Q2 ∈ Rn+m×n then
!
→ −
−
→
SX m
→ −
−
X
→
T
SZ mZ = Q2 −
,
→ −
→
SY m
Y
(III.4)
−
→
−
→
or if SY + SX A is nonsingular:
→ −
−
→ −
→ =−
→ + GT R−T (−
→ − AT −
→ ).
SZ m
SX m
m
m
Z
X
Y
X
(III.5)
←
−
→
−
←
−
For SX , replace SX by SZ in (III.3) and, in ad→ −
→ , −
→ by
dition to reversing the sign of −
m
SX m
Y
X
←
− ←
−
−
→
←
−
SZ mZ in (III.4) or mX by mZ in (III.5).
Table 3.3: Square-root message passing updates for composite blocks.
32
3.4.2
Numerically Stable Gaussian Message-Passing
Computational Optimization
In the next part we present ways to lower the computational complexity of square-root message passing. In all updates with d-dimensional
messages, the matrix Q can be computed in 4d3 + O(d2 ) flops8 from the
QR-decomposition of the block matrices by employing a Householder QR
decomposition9 . When using the alternative update (3.6) (e.g., when
the update with the respective factors is just performed once), the orthonormal matrix Q is not necessary and computational complexity may
be saved by employing a Householder triangularization with complexity
4 3
2
3 d + O(d ) flops [28].
Another improvement over straightforward application of the squareroot message update rules is to lower the number of required QR decompositions by propagating non-upper triangular matrices. It is easily
seen that the ideas behind the square-root technique also apply when
the factors are not necessarily upper triangular (see (3.1) to (3.3)).
3.4.3
Expectation-Maximization Updates
We first note that many EM-algorithms to estimate e.g., SSM parameters A, b, and c, may be expressed as Gaussian messages [16]. Hence,
the idea of square-root factors can be extended to EM-algorithms and is
pursued further in the following section.
Two steps, viz. the E-step and M-step, are performed in EM-algorithms.
In the M-step, factors of the EM messages are propagated through “=”
nodes and (I.7) and (I.8). In the E-step, marginal Gaussian probability
densities in an SSM must be computed first. To this end, a square-root
message-passing based smoothing algorithm as e.g., Algorithm A.1, can
be employed to yield the marginals. Secondly, squaring of the marginal
factors must be avoided also in the calculation of (2.2) to carry the
favorable numerical properties over to the EM algorithm. We illustrate
this concept with the computation of the square-root EM message for A
in autoregressive form.
8 One flop corresponds to either one of the following floationg-point operations:
multiplication, division, addition or subtraction.
9 The modified Gram-Schmidt-based QR decomposition can also be used to obtain
the upper-triangular factor in the QR decomposition. Even though it is more efficient
in terms of flops, its numerical stability guarantees are weaker than the ones of the
Householder QR [28]. It is therefore not well suited for the problem at hand.
3.4 Square-Root Message Passing
33
←
− ←− ←
− ↑ a
Sa , Wa m
a
N (0, σ 2 )
e1
A
X
×
+
Y
Figure 3.3: Factor graph illustration of square-root message passing
and EM algorithms in Example 3.4. The matrix A is in
AR form and the input noise is one-dimensional with e1 =
[1, 0, . . . , 0]T .
Example 3.4
Consider computation of the EM messages for a in Figure 3.3, which
is essential for EM-based estimation of the AR-form matrix A, with
→
−
←
−
square-root messages. We assume that CX and SX are given. The goal
←− ←
− : the factor and mean of the EM
is then to compute Sa and Wa m
a
message for a.
Given the square-root factor of the marginal probability over X and Y ,
observe that the precision of the EM message with standard Gaussian
marginals [16, Equations (III.7)]) can be expressed as
←−
Wa = σ −2 VX + mX mTX
!
!T
CX
CX
−2
,
=σ
mTX
mTX
where VX and mX are the covariance matrix and mean of variable X
←
− −
in Figure 3.3. The square-root EM factor Sak ←
mak is thus in uppertriangular form:
!
←
− ←
−
Sak m
ak
0
= Q1
Q2
T
CXk
mTXk
!
.
Computations can be saved by propagating non-upper-triangular fac-
34
Numerically Stable Gaussian Message-Passing
←
− −
tors Sak ←
mak and hence
←
− ←
− =
Sak m
ak
CXk
!
.
mTXk
The square-root factor of the marginal probability over
!
X
Z,
Y
can be computed by (II.1) using the forward message
!
−
→
→ !
−
T CX I AT
CZ
= Q1 Q2
0
0 σ 2 eT1
with e1 = [1, 0, . . . , 0]T and the backward message
←
−
←
−
SZ = 0 SY .
The resulting square-root factor has form
CZ =
CX
ΓXY
0
R
!
,
(3.8)
with R upper-triangular. By squaring CZ it becomes evident that the
top right corner of the matrix in (3.8) is the desired factor of the marginal
probability over X and that
VXY = XTX ΓXY .
(3.9)
The marginal mean follows from (II.2) analogously to the computation
of factor CZ , so
!
!
−
mX
I −
− −
→ + GT R−1 ←
− −←
→
=
m
SY ←
m
SY Am
(3.10)
X
Y
X
mY
A
with the matrices G and R given as byproduct of the computation
in (II.1) of CZ .
←− −
From [16, Equations (III.8)]) the EM-message vector Wak ←
mak is
←− ←
− = σ −2 CT Γ
T
Wak m
ak
X XY + mX mY e1 ,
3.4 Square-Root Message Passing
X
···
=
35
X0
=
X 00
=
···
N (m, V)
ck−1
ck+1
ck
Yk−1
Yk
Yk+1
N (yk−1 , σ 2 ) N (yk , σ 2 ) N (yk+1 , σ 2 )
Figure 3.4: Prior and two steps of an RLS problem.
where mX is given in (3.9) and mY in (3.10).
In practical implementations, computations are performed from right to
−
→−1 −
→−T
→−1
−
left in the and V X
= CX
CX 0
is solved by two applications of
0
0
k−1
k−1
k−1
back substitution [28].
3.4.4
Experimental Results
We apply the proposed square-root message passing rules to a poorly
conditioned recursive least-squares (RLS) problem and show the benefits in terms of numerical accuracy by comparison to the standard RLS
(message passing) and the two versions of the full least-squares problems.
Consider an unknown random variable X ∈ Rd with prior probability
N (m, V) and with noisy projections yk = ck X for k = 1, . . . , N as
depicted in the factor graph in Figure 3.4.
Using standard Gaussian message passing the least-squares estimate of
X can be computed for all k = 1, . . . , N by means of the forward recursion
−→
−→
1
WX ← WX + 2 cTk ck ,
σ
−→ −
→ →
yk
→ ←−
WX m
WX −
mX + 2 cTk ,
X
σ
(3.11)
(3.12)
−→
−→ →
where WX and WX −
mX are initialized to V−1 and V−1 m respectively.
36
Numerically Stable Gaussian Message-Passing
−
→
→→
−
S and S−
m
−→
−→−
→
W and Wm
direct QR solver
direct matrix inversion
NMSE
10−6
10−7
200
400
600
800
1,000
N
Figure 3.5: Relative estimation error for various RLS implementations
depending on signal length N . Results were generated by
Monte-Carlo simulation with RLS of order n = 20 and
σ 2 = 10−6 . The observation vectors c were randomly generated and in a second step, the resulting random RLS
problem’s condition number was set to 105 .
The final Gaussian estimate is computed as the solution to:
−→
−→ →
WX x̂GMP = WX −
mX .
(3.13)
Square-root Gaussian message passing can be implemented with an analogous forward recursion: According to (I.2) and (I.7) the message updates at the kth equality factor can be expressed as
−
→
SX
0
!
− −
→
→
SX m
X
×
−
→
SX
T
←Q
1
σ ck
|
!
−
→ −
→
SX m
X
,
y
k
σ
{z
S
}
(3.14)
3.4 Square-Root Message Passing
37
with the initial messages
−
→
SX ← chol(V−1 )
− −
→
−
→ ←→
SX m
SX m.
X
(3.15)
(3.16)
The final estimate x̂SQ is given analogously to (3.13):
−
→
→ →
−
SX x̂SQ = SX −
mX .
(3.17)
To demonstrate the effectiveness of square-root message passing to address numerical issues, we implemented computation of x̂GMP and x̂SQ
with double-precision floating-point arithmetic. Both recursions exhibit
an O(d2 ) computational complexity per iteration. In (3.14) this complexity is achieved by utilizing Householder triangularizations and taking
advantage of the block structure of S. To compute the solution of the
linear system of equations in (3.13) a Cholesky-based solver is employed,
while (3.17) is solved by back substitution.
In addition we compare both methods with the numerical values obtained
by solution of the full least-squares system with
C , [cT1 , . . . , cTN ]T
(3.18)
and all y1 , . . . , yN stacked in a vector y. Two version of the least-squares
method were used to match the respective message-passing algorithm:
A squared approach with
CT C + σ 2 I x̂INV = CT y
involving the solution of a d × d linear system of equations and with the
QR decomposition of C directly applied to C.
We perform a Monte-Carlo simulation with 500 simulation runs, random
X ∈ R20 , and a poorly conditioned matrix C. To obtain a random
matrix C, first iid random Gaussian vectors are drawn and then stacked
as in (3.18). In a second step, all singular values of this matrix are scaled
such that the ratio between the maximum value and the minimum value
i.e., the condition number, is 105 . The NMSE of estimates with different
algorithms is compared to the true value and plotted for various RLS
lengths N in Figure 3.5.
In Figure 3.5, note that both approaches that are based on the “squared”
values of the matrices, i.e., estimates x̂GMP and x̂INV floor and are affected by a numerical error on the order of 10−1 . On the other hand, the
38
Numerically Stable Gaussian Message-Passing
proposed square-root message passing algorithm and its counterpart the
QR-based least-squares solver have virtually the same numerical accuracy (the two curves lie on top of each other in Figure 3.5) and no error
floor in the evaluated range.
3.5
Conclusion
We have proposed methods that improve numerical stability of Gaussian
message passing in model-based signal processing. The methods differ
primarily in their scope of application, with the most specific one being decimation of SSMs, applicable to strongly bandlimited SSMs when
results at lower time resolutions are acceptable (cf. model-based filter
in Chapter 6). When the parametrization of SSMs is unrestricted e.g.,
in input estimation and smoothing, balanced state-space realization may
be applied. However, for many system-identification tasks, numerical
stability is an issue but reparametrization is not a viable option, since
state-space realization fixed (cf. results in Section 6.5). One approach
to improve numerical stability in these cases and in a wide range of
applications, is to tackle the problem at its origin: the representation
of Gaussian pdfs themselves. In square-root message passing, numerical
accuracy is instantly improved by propagating the square-root decomposition of the matrices. We have tabulated important computation rules
for the square-root message representation and have outlined novel algorithms for EM methods that are representable as Gaussian message
passing algorithms. Another pertinent Gaussian message representation
that improves numerical robustness of inference methods for a large class
of SSMs is shown in Chapter 4.
Chapter 4
Gaussian Message
Passing with
Dual-Precisions
Matrix inversions, especially for large matrices, are undesirable for reasons of computational complexity and their potentially poor numerical
properties (cf. Section 3.1). There exist Kalman filtering techniques i.e.,
the Modified Bryson–Frazier (MBF) smoother [7] and Gaussian message
passing schemes [47, Algorithm E] that avoid matrix inversion to compute marginal probabilities of the SSM’s state variables. In this chapter,
we complement and formalize the latter approaches and present a pertinent Gaussian message passing variant that in many cases yields matrix
inversion free smoothing-type algorithms. Eventually, we derive a novel
efficient and matrix inversion free algorithm to compute marginal probabilities for inputs in SSMs with uncorrelated observation noise.
In the beginning, we derive message passing rules for commonly used in
linear Gaussian factor graphs (Section 4.1 and Section 4.2). Using these
expressions we derive efficient versions of relevant smoothing algorithms
that utilize Gaussian message passing (Section 4.3).
4.1
Dual-Precision
In [47] the authors present a forward-backward message passing scheme
for SSMs that does not require matrix inversions (Algorithm E). It performs a forward (in time) recursion using Gaussian message represented
39
40
Gaussian Message Passing with Dual-Precisions
−
→
−
←− −
→ followed by a backward recursion based on ←
as V and −
m,
W and W←
m. In
a third step, marginal posterior probability densities are computed with
variables V, W̃, and m by alternating the message updates:
←−
←−
←−
W̃Xk = WXk − WXk VXk WXk
0
W̃Xk−1
= AT W̃Xk A
−
→
−
→
−
→
VXk = VXk0 − VXk0 W̃Xk0 VXk0
− ←
→
→ 0 −−
→ 0 +V 0←
−
mXk = −
m
VXk0 W̃Xk0 −
m
Xk
Xk
Xk WXk0 mXk0 ,
(4.1)
(4.2)
(4.3)
where Xk and Xk0 are the kth state before an “=”-factor and after it.
Additionally to the auxiliary variable
−
→
←
− −1
W̃X , V X + V X
,
called dual precision in [47], we also define the new auxiliary variable
→ −←
− ).
W̃X µX , W̃X (−
m
m
X
X
(4.4)
As is evident from Table 4.1, W̃µ mirrors distinctive properties of W̃
(e.g., invariant at “+”-factors).
We can now express (4.3) as follows
−→ →
←− −
mX = VX (WX −
mX + WX ←
mX )
−
→
−
→
−
→ −→ −
→ −→
←− −
→ +−
= (VX − VX W̃X VX )WX m
VX WX VX WX ←
mX
X
−
→
→ − V W̃ (−
→
←
−
=−
m
X
X
X mX − mX ),
|
{z
}
(4.5)
W̃X µX
−→
←−
where W̃X = WX VX WX was used in the last step.
The variables W̃ and W̃µ exhibit a few remarkable properties. The
marginal covariance and the marginal mean can be retrieved by means
of (4.2) and (4.5) without matrix inversions or having to solve a linear
system of equations. Furthermore, both quantities can be propagated
←−
←− −
backwards through factor nodes in the same way as W and W←
m, and
they are invariant at “+”-factors [47].
4.2 Message Passing Tables
4.2
41
Message Passing Tables
We recognize that in Algorithm E [47] the backwards recursion is only
used in (4.1) and (4.3). All other computations depend on quantities
←−
←− −
from the forward recursion and W and W←
m. Hence, an alternative
relation to (4.1) and (4.3) is sought for, making the backward recursion
←−
←− −
with W and W←
m obsolete and yielding significant computational savings.
This relation can be found, by observing that marginals are invariant at
equality factors (cf. Eq. (II.5) and (II.6) in [47]). Combining (4.2) and
(4.1) then results in the relation (IV.7) for W̃. For W̃µ an analogous
computation results in the expression (IV.8). Note that rule (IV.7) requires no matrix inversions when the variable Y has dimension one such
as e.g., one-dimensional observations in SSMs and only depends on the
−
→
messages V and the related variable G computed during the forward
recursion. Additional relations are listed in Table 4.1.
In Table 4.1, we synthesize relevant and useful expressions for W̃ and W̃µ
from [47] and provide the new expression for “=”-factors.
4.3
Algorithms Based on W̃ and W̃µ
We will apply the message passing rules in Table 4.1 to two common
tasks in SSM-based processing: smoothing in SSM and computation of
the expectation step in EM-based estimation of the SSM parameters. In
parallel, benefits of these algorithms will be highlighted.
4.3.1
Smoothing in SSMs
We are now ready to devise a smoothing scheme based on the auxiliary
quantities W̃ and W̃µ. Consider a length K vector of scalar observations
y1 , . . . , yK and a time-invariant1 nth order SSM. A factor graph repre−
→
senting time step k is shown in Figure 4.1. First, a forward pass with V
−
→
and m is performed, i.e., the covariance Kalman filter. For each message
update at a composite equality blocks the auxiliary matrices F and G
defined in (IV.9) and (IV.10) are utilized to compute the next message
1 All algorithms extend to time-variant scenarios. This property is only used to
keep notation simple.
42
Gaussian Message Passing with Dual-Precisions
Node
X
Update rule
Y
A
A∈R
Backward:
W̃X = AT W̃Y A
n×m
T
W̃X µX = A W̃Y µY
(IV.1)
(IV.2)
Forward:
W̃Y = (AT )† W̃X A†
T †
W̃Y µY = (A ) W̃X µX
(IV.3)
(IV.4)
†
where A denotes the pseudo-inverse of A as A
need not be full rank for the relations to hold.
X
+
Z
W̃X = W̃Y = W̃Z
Y
X
=
A
Y
W̃X µX = W̃Y µY = W̃Z µZ
(IV.5)
(IV.6)
Z
W̃X = FW̃Z FT + AT GA
(IV.7)
−
→
←
−
W̃X µX = FW̃Z µZ − AT G(Am X − m Y ) (IV.8)
−
→
F , I − AT GAVX
(IV.9)
−
→ T ←
− −1
G , (AVX A + VY )
(IV.10)
Analogous relations to compute W̃Z and W̃Z µZ
−
→
←
−
→ by
are obtained by replacing VX by VZ and −
m
X
←
−
mZ .
Table 4.1: Message passing rules for Gaussian message representation
with W̃ and W̃µ extending results in [47].
4.3 Algorithms Based on W̃ and W̃µ
43
N (0, VU )
Uk−1
B
···
0
Xk−1
A
+
Xk
Xk0
=
···
c
2
)
N (0, σN
+
yk
Figure 4.1: One time step in a time-invariant SSM with, possibly,
multi-dimensional input vectors and one-dimensional observations. Parameters of the SSM are the state-transition
matrix A, input matrix B, and output vector c and important variables are also denoted.
and stored as intermediate values along with the incoming message (i.e.,
−
→
→ . Next a backward recursion using the relations from TaV and −
m)
ble 4.1 and (auxiliary) variables stored in the forward pass is performed.
Assuming an uninformative final state, W̃ and W̃µ are initialized at
observation yN as:
1 T
c c
σ2
1
= 2 cT .
σ
W̃XN =
W̃XN µXN
Marginal probability densities follow directly from (4.2) and (4.5). If Xk
denotes the state variable left of the kth observation, then
−
→
−
→
−
→
VXk = VXk − VXk W̃Xk VXk
→
→ −−
mXk = −
m
VXk W̃Xk µXk .
Xk
44
Gaussian Message Passing with Dual-Precisions
Regarding the marginal of the input Uk , by the invariance of W̃ and W̃µ
and using (IV.1) and (IV.2) we readily obtain
VUk = VU − VU BT W̃Xk BVU
mUk = mU − VU BT W̃Xk µXk .
Marginals for Yk can be computed from VXk and mXk by multiplication
with c.
Algorithm 4.1: Dual-Precision Smoothing
Assume an SSM with inputs u1 , . . . , uK and observations y1 , . . . , yK . In
→
−
addition, denote by −
µinit and ←
µinit the initial forward and backward messages.
−
→
→ ←−
→
1) Initialize the forward message VX1 ← Vinit , −
m
m
X1
Xinit .
2) Pass recursively passing through k ∈ [1, . . . , K −1] and compute the
following message updates (intermediate results are only required
for smoothing):
i) Perform update through A factor and “+”-factor with:
−
→
−
→
2
0
VXk+1
← AVXk AT + σN
, BBT
−
→ 0 ← A−
→ .
m
m
Xk
Xk+1
ii) Update messages after observation yk+1 :
1
Gk+1 ← −
,
→
2
0
cVXk+1
cT + σN
−
→
0
Fk+1 ← I − VXk+1
cT Gk+1 c ,
−
→
−
→
0
VXk+1 ← Fk+1 VXk+1
,
−
→
−
→
−
→ 0 +G
0
m
cT yk+1 .
Xk+1 ← Fk+1 mXk+1
k+1 VXk+1
3) If the SSM ends with an open edge at k = K, initialize the backward
messages
1 T
c c,
σ2
1
← 2 cT ,
σ
W̃XK ←
W̃XK µXK
4.3 Algorithms Based on W̃ and W̃µ
45
with neutral values. Otherwise compute W̃XK and W̃XK µXK using (4.1) and (4.4).
4) Perform a backward (in time) sweep through k ∈ [K − 1, . . . , 1] and
compute the following message updates:
i) Pass message through the A factor:
W̃Xk0 ← AT W̃Xk+1 A,
W̃Xk0 µXk0 ← AT W̃Xk+1 µXk+1 .
i) Use (IV.7) and (IV.9) for update at “=”-factor:
W̃Xk ← FkT W̃Xk0 Fk + cT Gk c,
(4.6)
−
→
T
W̃Xk µXk ← Fk W̃Xk0 µXk0 − cT Gk cmXk0 − yk
5) Computation of posterior probability for Xk for any k ∈ [1, K] with
−
→
−
→
−
→
VXk = VXk − VXk W̃Xk VXk
→
→ −−
mXk = −
m
VXk W̃Xk µXk .
Xk
4.3.2
Steady-State Smoothing
Let the SSM be time-invariant and observable. Consider the processing of long signals or SSMs. Then, since the SSM is observable, the
−
→
covariance matrix VXk converges to the solution of the DARE
−
−1 −
−
→
−
→
−
→
→
→
VX∞ = AVX∞ AT − AVX∞ cT cT VX∞ c + σ 2
cVX∞ AT + BVU BT
(4.7)
for k large [38]. This is stated in the following proposition, which is
shown in Appendix B.2 on page 136.
Proposition 4.1: Steady-State Solution for W̃X∞
Let the SSM be time-invariant and observable. If the forward message−
→
passing is initialized with the steady-state matrix VX∞ , then the auxiliary
matrix W̃X converges to W̃X∞ and can be obtained from the solution of
a discrete Lyapunov equation
Alp W̃X∞ ATlp − W̃X∞ + Qlp = 0,
(4.8)
46
Gaussian Message Passing with Dual-Precisions
Per sample for smoothing
Matrix mult.
Matrix inv.
RTS [38]
4
1
Two-filter
6
1
Algorithm E [47]
10
0
W̃ and W̃µ
4
0
smoothing [38]
Storage
−
→ −
→
→ 0
VXk ,VXk0 , −
m
Xk
−
→ −
→
V ,m
Xk
Xk
−
→
→ ,
VXk , −
m
Xk
←− ←− ←
−
WXk0 ,WXk m
Xk
−
→
0
F ,G , m
k
k
Xk
Table 4.2: Comparison of computational and memory requirements of
various Kalman smoothing algorithms with our proposed
method. All figures are stated per time step. Operation
counts are restricted to the smoothing step, as all algorithms
use the same filtering recursions. Storage refers to all variables that need to be saved between forward and backward
sweeps.
with variables
−
→
Alp , AT (I − cT G∞ cVX∞ ),
Qlp , AT cT G∞ cA,
−
→
←
−
G∞ , (AVX∞ AT + VY )−1
Proposition 4.1 provides us with a linear equation2 to compute the
steady-state W̃X∞ , and hence all matrix quantities in our smoothing
scheme (Algorithm 4.1) offline and substantially reduce computational
complexity during online processing. In addition to reduced computational overhead for SSM smoothing, utilizing steady-state quantities
also possesses appealing numerical properties; computation of a solution
for (4.7) and (4.8) can be performed with numerically stable solvers,
yielding more accurate results than message passing iterations.
Departing from our initial assumptions and considering general multiinput multi-output SSMs, we see that the steady-state specialization of
2 Since the Lyapunov equation is linear, solution methods have better numerical
properties than the ones for the DARE.
4.3 Algorithms Based on W̃ and W̃µ
47
our smoothing algorithm also delivers marginal probabilities without any
matrix inversions needed. This follows as the matrix G can be obtained
as a byproduct when solving the DARE (4.7).
4.3.3
E-Step in SSMs
Consider estimation of the SSM (i.e., A, b, c) and the noise variances
with the EM algorithm and the variables given in the SSM shown in Figure 4.1. From inspection of the E-steps (cf. [16]) we recognize that the
marginal quantities VXk and mXk for all k are sufficient to compute the
update of c and σn2 and similarly, given all VUk and mUk the E step for
b and σu2 is readily computed. All these quantities can be obtained in a
computationally efficient manner from the algorithm in Section 4.3.1.
Estimation of A also relies on VXk and mXk , but additionally requires
0
and Xk , i.e.,
computation of the cross-covariance of Xk−1
VX 0
T
Xk
k−1
h
i
0
0
, E (Xk−1
− mXk−1
)(Xk − mXk )T ,
while, of course, we have
0
mXk−1
= mXk−1 .
Again the desired relation should only depend on variables that are com−
→
→ sweep through an SSM and the dual-precision
puted during a V and −
m
and avoid matrix inversions. It turns out that VX 0 XT may be comk−1
k
puted:
−
→
−
→ 0
VX 0 XT = VXk−1
AT I − W̃Xk VXk .
(4.9)
k−1
k
The proof of (4.9) is given in Appendix B.2 on page 136.
Remark 4.1
Observe that no assumptions on any matrix were done in the proof
←−
of (4.9). We conclude that (4.9) also holds when e.g., WXk or A do
not have full rank. Derivation of (4.9) directly from the relation given
in [16, Equation (IV.6)] is cumbersome, without assumptions on the rank
of involved matrices.
48
4.3.4
Gaussian Message Passing with Dual-Precisions
Continuous-Time Smoothing
For continuous-time SSM
dX(t) = AX(t) dt + bU (t) dt
Y (t) = cX(t) + N (t)
with observations yk = Y (tk ) at discrete moments tk for k ∈ [1, K].
Computation of posterior probabilities for X(t) at any tk ≤ t ≤ tk+1
−
→ →
may be computed by relying just on V, −
m, W̃, and W̃µ. Apart from advantages put forward for smoothing in discrete-time SSMs, this message
passing scheme is particularly attractive for continuous-time SSMs as
closed-form continuous-time input-estimation formulas can be obtained
(see the interpolation formula in Section 5.2).
To compute the smoothing most general case, i.e, smoothing, forward
−
→
→
message passing with VX(t) and −
m
veX(t) is done as in [10, (II.1) and
(II.2)].
For the backward sweep, use (IV.1) and (IV.2) with eAT instead of A,
while the “=”-factor updates are (IV.7) and (IV.8) and do not change.
Posterior probabilities of X(t) for any tk as in step 5) in Algorithm 4.1
−
→
and at any tk ≤ t ≤ tk+1 by computing first VX(t) with [10, (I.2)] and
then (4.2), whereas for mX(t) the formula
mX(t) = e
A(t−tk )
mX(tk ) −
2
σU
W̃X(tk ) µX(tk )
Z
t−tk
e
Aτ
T Aτ
bb e
dτ
0
(4.10)
may be used. If A is diagonalizable, the integral term can be expressed
in closed-form as done in [10, Section IV.B]. Note that (4.10) yields a
closed-form expression for mX(t) which might be used as an interpolation
formula (cf. Theorem 5.1).
A complete algorithm statement is given in the Appendix in Section A.2.
Note also that, in a stationary situation with constant intervals tk+1 −
→
−
←
−
tk , the covariance matrices V X(tk ) and V X(tk ) (and thus W̃ (tk )) do not
depend on k and may be computed offline, as is usual in Kalman filtering.
4.4 Conclusion
4.4
49
Conclusion
We have presented robust and computationally efficient smoothing-type
algorithms in SSMs based on parametrization of Gaussian densities with
dual precisions. While the standard Kalman smoother with dual precisions formalizes the Modified Bryson–Frazier Kalman smoother, fixedinterval input estimation algorithms that derive from dual-precision message passing are novel. We expand further on dual-precision based input
estimation in Chapter 5 and Chapter 8.
We expect that these message passing algorithms might motivate the development of further dual-precision Gaussian message passing algorithms
beyond Kalman smoothing and input estimation.
Chapter 5
Input Signal Estimation
In this chapter, we report several new theoretical and experimental results on the continuous-time input estimator proposed in [9,10] and that
we present in Section 5.1. In the first part, we show, in particular, that
the continuous-time estimate is smooth (i.e., continuous and infinitely
often differentiable) between sampling times. In fact, the smoothness is
obvious from a new expression for the estimate between sampling times
that is attractive also for practical computations. We also state and
prove a condition on the SSM that is necessary and sufficient to yield
input estimates that are continuous everywhere. In addition, we give a
Wiener-filter version of the the present estimator that further illustrates
the nature of this estimate. In the experimental part (Section 5.5),
we report not only simulation results, but also continuous-time estimation results using dynamometer measurements presented in more detail
in Chapter 6.
5.1
Preliminaries
Let U (t) and Y (t) be the real-valued continuous-time input signal and
output signal, respectively, of some sensor. We are given noisy samples
Yk , Y (tk ) + Zk
(5.1)
of Y (t) at discrete moments tk , k ∈ Z, (with tk < tk+1 ), where Zk
(the measurement noise) are iid Gaussian random variables that are
independent of U (t) and Y (t). From these samples, we wish to estimate
U (t).
51
52
Input Signal Estimation
We will assume that the sensor is given by a finite-dimensional stable
linear (continuous-time) SSM with state X ∈ Rd evolving according to
dX(t) = AX(t) dt + bU (t) dt
(5.2)
with A ∈ Rd×d and b ∈ Rd×1 , and with
Y (t) = cX(t)
(5.3)
with c ∈ R1×d . A number of generalizations of this setting will be
indicated at the end of Section 5.3.
Problems of this sort are usually addressed by beginning with additional assumptions on U (t). However, in many practical applications,
the actual sensor input signal is, strictly speaking, neither bandlimited
nor sparse: better sensors might reveal ever more details in the signal.
Nonetheless, we need to cope with the given sensor, as well as we can.
A new approach to such estimation problems was proposed (among other
things) by Bolliger et al. in [10]. In this approach, U (t) is modeled as
white Gaussian noise (WGN)—not because the unknown true input signal is expected to resemble WGN, but to avoid unwarranted assumptions
on its spectrum. It is shown in [10] that modeling U (t) as WGN leads
to a practical estimator that is easily computed on the basis of forwardbackward Kalman filtering/smoothing.
The definition of the estimate û(t) from [10] can be paraphrased as follows. For ∆ > 0, let
Z
1 t
Ũ (t, ∆) ,
U (τ ) dτ.
(5.4)
∆ t−∆
If U (t) is a continuous signal, then lim∆→0 Ũ (t, ∆) = U (t). Assume now
that U (t) is white Gaussian noise. Then, for fixed t, Ũ (t, ∆) is a well2
defined zero-mean Gaussian random variable with variance σU
/∆ for
2
some constant σU > 0. The MAP/MMSE/LMMSE estimate of Ũ (t, ∆)
from observations Yk = yk for all k. is
ˆ(t, ∆) = E Ũ (t, ∆)|y1 , y2 , . . . ,
ũ
(5.5)
and û(t) is defined as
ˆ(t, ∆).
û(t) , lim ũ
∆→0
(5.6)
The limit can be shown to exist and practical computation of û(t) will
be presented next.
5.2 Computation and Continuity of û(t)
53
N (0, V∆ )
−
→
µX(t)
X(t)
e
←
−
µX(t0 )
+
A∆
X(t0 )
Figure 5.1: Factor graph for interpolation with t0 = t + ∆ > t.
5.2
Computation and Continuity of û(t)
The estimate (5.6) first given in [10] is paraphrased here using messages
from Chapter 4
2 T
û(t) = σU
b W̃X(t) µX(t) .
(5.7)
Computation of posterior probabilities p x(t)|y0 , y1 , . . . may be done
with Gaussian message passing as in Section 4.3.4 (cf. Algorithm A.2).
For numerical computations of the input estimate û(t), it may be preferable to use the formula (5.7) only for t ∈ {tk }, and to use the new
interpolation formula (5.8) for any intermediate moments t.
Theorem 5.1: Interpolation
Assume that both t and t0 = t + ∆ lie between adjacent sampling times
tk and tk+1 , i.e., tk < t ≤ tk+1 and tk < t0 ≤ tk+1 . Then
T
2 T −A ∆
û(t0 ) = σU
b e
W̃X(t) µX(t) .
(5.8)
Note that ∆ may be negative. It is obvious that (5.8) is a smooth
function of ∆, which proves that û(t) is smooth between sampling times.
The proofs of the two theorems in this section use factor graphs and are
presented in Appendix B.3 on page 137.
Theorem 5.2: Continuity at sampling times
If cT b = 0, then û(t) as in (5.6) is continuous also for t ∈ {tk }.
Conversely, if cT b 6= 0, then û(t) is generically not continuous for t ∈
{tk }, as is evident from many examples.
54
5.3
Input Signal Estimation
Wiener Filter Perspective
In a stationary situation with tk = kT for fixed sampling time T > 0, the
MAP/MMSE/LMMSE estimate (5.6) can also be obtained via a version
of a Wiener filter [38] as follows. Let G(ω) be the frequency response
of the sensor, i.e., the Fourier transform of the impulse response of the
system given by (5.2) and (5.3). Let z denote the complex conjugate of
z ∈ C.
We assume a slightly more general setting in the next theorem and introduce a input-noise shaping filter N (ω).
Theorem 5.3
Under the stated assumptions,
û(t) = T
∞
X
Yk h(t − kT )
(5.9)
k=−∞
where h(t) is given by its Fourier transform
H(ω) = P
k∈Z
G(ω)
G(ω + k 2π )2 +
T
(5.10)
2
σN
2
σU
Let n̂(t) be the input estimate, as described above, but let the inputprocessed be colored, i.e., filtered with N (ω). The filter h(t) in
n̂(t) = T
∞
X
Yk h(t − kT )
(5.11)
k=−∞
is then given by
2
|N (ω)| G(ω)
N (ω + k 2π )G(ω + k 2π )2 +
k∈Z
T
T
H(ω) = P
2
σN
2
σU
(5.12)
Mixed discrete/continuous-time Wiener filters as in (5.9) are not usually
covered in textbooks, and we have not yet been able to find filters similar
to (5.10) in the literature. In any case, a proof deriving from the orthogonality principle of LMMSE estimation [38] is given in Appendix B.3 on
page 137.
5.4 Postfiltering
55
Theorem 5.3 can be used as an alternative to message-passing in the
SSM followed by (5.8) for numerically computing the continuous-time
input estimate.
A main point of Theorem 5.3 is that it further illuminates the nature of
the estimate (5.6). For example, consider the two amplitude responses
|G(ω)| in Figure 5.2. The dashed lines in this figure show the aliasing
term |G(ω + 2π
T )|. As long as such aliased parts of G(ω) remain sub2
2
stantially below the noise-to-signal ratio σN
/σU
, the aliasing does not
materially affect the estimate û(t).
It should be noted, however, that the Kalman-filter approach (5.7) is
more general than the Wiener filter of Theorem 5.3. In particular, the
Kalman-filter approach works also for nonuniform sampling, and it generalizes easily to time-varying systems (e.g., unstable systems under digital control as in [46]) and to mildly nonlinear systems (via local linearisation as in extended Kalman filtering [38]).
5.4
Postfiltering
While the estimate (5.6) is piecewise continuous (and smooth between
sampling times), the user of the sensor may sometimes prefer a smootherlooking estimate of U (t). In this case, the estimate (5.6) may simply be
passed through a suitable low-pass filter (preferably with linear phase
response).
Such postfiltering is similar, in effect, to estimating U (t) under the traditional assumption that it is bandlimited. However, the results of the
two approaches are not identical, as is easily seen from (5.10). Moreover,
in a Kalman filter setting (as in Section 5.2), the traditional assumption
requires an SSM of the noise-shaping filter to be included in the Kalman
filter, which increases its complexity; by contrast, postfiltering the estimate (5.7) does not affect the Kalman filtering at all.
It should be noted that postfiltering (5.6) is, in principle, a continuoustime operation. In the following we propose an approach to obtain analytic expressions of postfiltering and numerically compute them.
56
Input Signal Estimation
Theorem 5.4: Analytical Computation of Postfiltering
Consider the augmented continuous-time SSM with state [X T , W T ]T
with additional state W ∈ Rp and state-space parameters
Ā ,
!
A
0
0
F
0 ,
c̄ , c
,
b̄ ,
b
!
e
,
(5.13)
(5.14)
with F ∈ Rp×p and e ∈ Rp×1 .
Then the signal
v̂(t) = 0 d X̂(t),
with d ∈ R1×d , corresponds to continuous-time estimator (5.6) with postfiltering by a filter P (ω), given by
P (ω) = d (jωI − F)
−1
e.
(5.15)
Theorem 5.4 can be seen from the factor graph representation of continuous-time SSMs as follows: The factor graph representation of the
augmented SSM (5.13) can be split into two graphs that are only connected at the input edge U (t) and the graph representing the post filter
is purely deterministic. Trivially, the deterministic factors in the postfiltering graph are fulfilled for any U (t). Hence, the augmented model
yields the same estimate û(t) and hence, (5.15) follows from the Fourier
transform of an SSM.
The benefit of (5.13) is that analytical expressions for postfiltering with
any continuous-time filter1 of finite degree and with relative degree at
least 1 follow easily from (5.7) or (5.8). Furthermore, observe that eĀt
is a block-diagonal matrix for all t ∈ R, thus reducing the necessary
computational effort.
The second approach computes the input estimate in the reduced system
and uses (5.8) to obtain û(t). Then we use the post filter expressed
as a continuous-time SSM in autoregressive form. Using vectorization
operations we can express the postfiltering result between two sampling.
5.5 Experimental Results
57
10
4th order
2nd order
magnitude [dB]
0
10
20
30
40
50
0
0.1
0.2
0.3
0.4
0.5
frequency [f /fs ]
Figure 5.2: The amplitude response of two sensors of order 2 and 4, reFig. spectively.
1. Frequency The
response
of the evaluated
mod- frequency
horizontal
axis is state-space
the normalized
els. f /fs with fs , 1/T . The dashed lines show the aliasing.
5.5
Experimental Results
Figure 5.2 shows the amplitude response of two different sensor models,
one of order n = 2 and the other of order n = 4, which both originate
from fitting low-order models to the measured impulse response of realworld sensors in an industrial setting. The 4th-order model turns out
to satisfy cT b = 0 while cT b 6= 0 for the 2nd-order model. Figs. 5.3–5.5
show simulation results with these models for high signal-to-noise ratio
2
2
σU
/σN
. Note that the input signal in these simulations has effective
discontinuities, which is not uncommon for real signals (e.g., forces when
moving objects collide). Note also that the input signal is nonnegative,
of which the estimator is ignorant.
Figure 5.5 shows the input estimate for the 2nd-order model in microscopic detail. It is apparent that the estimated signal is not continuous at
the sampling moments, which is due to the fact that cT b 6= 0 in this case
(cf. Theorem 5.2). By contrast, the estimate in Figure 5.3 is continuous
everywhere in consequence of Theorems 5.1 and 5.2.
Figure 5.6 illustrates the use of the 4th-order model with measured real1 These type of filters can be represented by a continuous-time SSM by using the
transfer function to obtain the autoregressive form representation of the SSM.
58
Input Signal Estimation
world data. In this case, the true input signal is not known, but a better
(more expensive) reference sensor provides a good guess of it. Moreover,
the actual sensor dynamics has slightly changed during operation while
the estimation uses the unchanged 4th-order model. The estimate is
smoothed by postfiltering as in Section 5.4. Due to the uncertainties
of the situation, it is difficult to assess the quality of the estimate in
absolute terms, but it can certainly be concluded that the estimator
works very well in practice.
(The estimated signal in Figure 5.6 is sometimes clearly negative. This
seems to be due to a bias in the measurement set-up, not due to a
problem with the estimator.)
5.6
Conclusion
We have presented a number of new theoretical and experimental results on the (continuous-time) input-signal estimator that was proposed
in [10]. In particular, we have established continuity (or piecewise continuity if cT b 6= 0) of the estimate. A key step leading to this new
insight, has been an interpolation formula for the continuous-time input signal estimate. To obtain these results, the dual-precision message
representation has been leveraged, from which the results follow. We
have complemented the Kalman-filter perspective with a Wiener-filter
perspective, which may prove useful in analysis of the proposed input
estimator. Practicality of the proposed estimator was confirmed with experimental results and we have elaborated on postfiltering options and
their effects on the estimate.
5.6 Conclusion
59
[arb. units]
[arb. units]
estimated input signal
actual input signal
observed output signal
0
0
0
50
100
150
0
200
[samples]
Figure 5.3: Input signal estimation (simulation) for 4th-order model of
2
2
Fig. 4. Input signa
= −41dB.
/σN
with
σU
Fig.Figure
3. Input5.2
signal
estimation
method
applied to simulated
estimated input signal
actual input signal
observed output signal
[arb. units]
[1] L. Bolliger, H.-A. Loeliger, and C. Vogel, “Simulation, MMSE estimation, and interpolation of sampled
continuous-time signals using factor graphs,” 2010 Information Theory & Applications Workshop, UCSD, La
Jolla, CA, USA, Jan. 31 – Feb. 5, 2010.
measurements
durin
10
0
magnitude [dB]
data.
• . . . with or wi
10
• estimation w
20
• comparison w
30
• spectral
char
[2] L. Bolliger, H.-A. Loeliger, and C. Vogel, “LMMSE
estimation and interpolation of continuous-time sig• continuity
of
40
0 from discrete-time samples using factor graphs,”
nals
• correct proof
arXiv:1301.4793v1.
50
0
50
100
150
200
0 aliasin0
•
avoid
[3] L. Bolliger, Digital Estimation
[samples]of Continuous-Time Sigtional measu
nals Using Factor Graphs. PhD Thesis No. 20123 at
Figure 5.4: Input
signal2012.
estimation (simulation) for 2nd-order model
ETH Zurich,
2 method
2
...
Fig.of
5. Figure
Input signal
estimation
applied to simulated
5.2 with
σU
/σN
= −31dB.
Fig. 6. Frequency r
data.
[4] H.-A. Loeliger, J. Dauwels, Junli Hu, S. Korl, Li Ping,
els.
Advantages of prop
and F. R. Kschischang, “The factor graph approach
to model-based signal processing,” Proceedings of the
IEEE, vol. 95, no. 6, pp. 1295–1322, June 2007.
[5] J. Dauwels, A. Eckford, S. Korl, and H.-A. Loeliger,
“Expectation maximization as message passing – Part I:
principles and Gaussian messages,” arXiv:0910.2832.
[6] . . .
• No assumptio
• Can combine
surements
• Arbitrary tem
• Straightforwa
60
Input Signal Estimation
[arb. units]
estimated input signal
actual input signal
observed output signal
tk )
-
0
100
102
104
106
108
110
[samples]
Figure 5.5: Close-up of Figure 5.4 around a jump of the input signal.
Fig. 7. Closeup of5, where a jump occurs in the input signal.
measured reference signal
estimated input signal
Parseval’s relation. Under the given
assumptions,
the second
measured
ouptut signal
d input signal
put2signal
,output
)
U / signal
[arb. units]
term can also be transformed into frequency domain
Z
i⌦
Ỹ (e )
⇡
X
k2Z
2
2⇡
2⇡
H(! + k )Û (! + k ) d!. (18)
T
T
The cost funciton has to be minimized for each ! separately.
0
Therefore, assume first that Û (! + k 2⇡
T ) = 0 for kkk > K
and for any ! 2 [0, 2⇡]
(tk )
-
150
⇡
200
Figure
pplied to simulated
2
N (0, U
/ )
C. Vogel, “Simulaolation of sampled
graphs,” 2010 Inorkshop, UCSD, La
10.
X(tk + )
Vogel,-“LMMSE
ntinuous-time sig-
K
X
2
Ỹ (e )
2⇡
2⇡
H(! 50
+ k )Û (! + 100
k ) +
T
T
K
X
2
2⇡
150
Û (! + k ) .
T
k= K
k= K
[samples]
(19) data us5.6: Input signal estimation from real-world measured
This corresponds to a regularized least-squares problem, with
Fig.ing
4. Input
signal estimation
applied
the 4th-order
modelmethod
of Figure
5.2force
andsensor
postfiltering as
minimizer
measurements
in Sectionduring
5.4. machining.
H(!) Y (ei!T )
Û (!) = PK
(20)
2 .
2⇡ 2
Z
2
k= K H(! + k T ) + U
• . . . with or without additional spectral shaping
One obtains ?? by taking K ! 1.
• estimation without measurement noise makes sense
i⌦0
4. NEW with
EXPERIMENTAL
• comparison
inverse filter RESULTS
. . . in•addition
those of [3]. estimated
input
signal
spectraltocharacterization
in a special
case
...
• continuity of û(t)
Chapter 6
Input Estimation for
Force Sensors
Unwanted vibrations during machining are a well-known fact and cause a
number of problems ranging from increased tool wear and tool breakage
to poor surface finish and inferior product quality. As a consequence,
for a wide-range of applications, it is essential to monitor forces accurately [55]. However, structural sensor modes can distort these measurements. Oftentimes even the whole sensor-machine mechanical system
(see Figure 6.1a), causes mechanically interference with the sensor and
causes new modes to appear. Especially, in high-speed machining processes, such distorted sensor readings are limiting applications.
A popular method to remove these disturbances is by low-pass filtering.
On the other hand, this simple method also removes relevant dynamical
information when excitation frequencies are high.
In [54, 55] a continuous-time Kalman filter is applied to compensate for
unwanted resonating modes in dynamometer measurements. The filter
is static (i.e., non adaptive) and derived through mode matching in the
spectrum. A different approach is shown in [8, 27, 50], where first a
rational finite-order transfer function is fitted to a sensor’s frequency
response. Filtering of the measurements is utilizes the inverted transfer
function and is realized directly in the frequency domain or with IIR
filters.
The aforementioned methods derive relatively high-order filters from a
frequency-domain view of the filtering problem. We propose a different
approach by using a model-based filter with low-order order to compensate for unwanted dynamics. An automatic model-identification scheme
61
62
Input Estimation for Force Sensors
mass
m2
reference
dynamometer
ξ2
k2
workpiece
mass
m1
dynamometer
sliding table
k1
ξ1
machine
(a) Schematic of the mechanical
setup (experiments)
(b) Multi-mass model
ensures that the filter adapts to new sensor settings. Model-identification
and filtering are derived in time domain.
6.1
Problem Setup and Modeling
The machining setup under consideration is schematically shown in Figure 6.1a. A few physical phenomenas lead to corrupted dynamometer
sensor readings. Firstly, the physical sensor itself does, independently of
the overall machining setup, suffer from resonant modes at certain characteristic frequencies. Additionally in the present machining setup, sensor, workpiece, and components of the machine form a complex coupled
mechanical system: the machine-workpiece loop. Mechanical coupling
in the machine-workpiece loop causes distorted sensor readings through
inertial forces. Both effects severely limit the use of estimation methods
solely based on a-priori known sensor transfer characteristics as in [54,55]
for practical dynamometer filtering tasks and necessitate identification
methods and compensation methods that account for vibrations that are
not necessarily due to the sensor itself.
We approach this problem by taking a model-based stand; it relies on
the hypothesis that a low-order sensor model captures the salient characteristics of the real sensor and thereby, allows us to obtain accurate
6.1 Problem Setup and Modeling
63
Figure 6.1: Schematic of the presented dynamometer signal processing
setup.
force estimates through model-based signal processing methods.
Our model-based view on the problem is rooted in multi-mass swinger
models (see also Appendix F) as shown in Figure 6.1b. In Figure 6.1b
the lighter mass m2 with spring k2 and damping element ξ2 represents the sensor fixed to a heavy machine with mass m1 . This simple physical model contains many phenomena observed in experimental measurements of dynamometer sensors (see e.g., Figure 6.7). First,
mechanical coupling will introduce resonating poles, far below the sensor’s lowest resonance frequency. Secondly, these additional modes form
resonance/anti-resonance pairs: a complex-valued pole pair and a close,
but at a higher frequency, complex-valued zero pair. Thirdly, since these
pole-zero pairs are caused by the machine itself, they might be very
susceptible to small changes in the setup, e.g., starting the machining
process.
We conclude that the main factor that impairs the force estimates is not
noise, but model uncertainties (cf. Section 6.5). These distortions will
64
Input Estimation for Force Sensors
appear in the low-frequency range and in combination with the limited
bandwidth of actual force signals, will thus be highly relevant towards
good performance.
Our model-based approach tries to capture the most prominent components of this mechanical systems, which is assumed to be linear, by identifying a low-order discrete-time SSM based on identification measurements. Estimation of this SSM is done by means of the EM algorithm.
In this case these measurements are impulse response measurements, but
the presented solution does not rely on this property. Once an SSM is
identified, input estimation from sampled sensor measurements yields an
estimate of the actual force.
In order to validate the proposed methods, an additional more accurate
sensor reference is employed during experiments. A schematic of the
whole setting is shown in Figure 6.1
6.2
Model-Based Input Estimation
The aim of the presented input estimation method, is estimation of
the machining forces acting on the workpiece by means of dynamometer measurements. We assume here that the workpiece-machine loop
is represented well with a (low-order) SSM. Since sampling rates, are
much higher than the frequency range of interest (i.e., dynamometer
resonance frequencies and spectrum of the force signal), identification
and consequently input estimation are performed by means of discretetime SSMs. We also focus on one-dimensional force signals (i.e., 1-axis
measurements). Extensions for processing multi-dimensional force measurements are discussed and presented in Section 6.4.
6.2.1
Model-Based Input Estimation
For any K ∈ N let u1 , . . . , uK ∈ R denote the unknown samples of the
applied force and y1 , . . . , yK ∈ R the scalar sensor measurements from
the force sensor sampled with period Ts . The goal is to find an estimate
Ûk of Uk using SSMs.
An MMSE/LMMSE/MAP input estimation method is used and requires
two modeling assumptions: First, the force signal is resembles (statistically) to a correlated Gaussian random process and second, input-
6.2 Model-Based Input Estimation
65
Nk
Uk
f
Gaussian input Sensor model/
process prior workpiece model
+
Yk
Sensor measurements
Figure 6.2: Stochastic model used to develop a model-based filter for
estimation of the random input Uk given measurements
of Yk .
relevant dynamics of the sensor are captured well by a low-order SSM.
Experimental results confirm that, both assumption hold rather well.
The resulting model is shown in Figure 6.2. A continuous-time prior
models the force U (t) acting on the sensor. Since model-identification
yields discrete-time sensor models, input estimation with the estimator
from Chapter 5 estimates Uk = U (tk ) i.e., the input signal at discrete
times t1 , . . . , tK . Finally, sensor measurements Yk corresponding to sensor readings are presumed to be corrupted by white Gaussian noise with
2
variance σN
.
The SSM representation of Figure 6.2, is obtained by concatenating an
SSM with parameters A(p) , B(p) , and c(p) encoding prior assumptions
on the actual force with the SSM of the sensor model. An additional
state Ūk accounts for large random changes1 in the input signal Uk , as
they are typical during machining.
Denote the SSM parameters of the sensor model as A(s) , b(s) , and c(s) .
The complete SSM is then:
(p)
(p)
A
0
0
B
0
1
0 Xk + 0
Xk+1 = 0
(6.1)
Zk
b(s) c(p)
Yk = 0 0
b(s) A(s)
c(s) Xk + Nk
0
0
(6.2)
1 Such jumps in the signal may be detected from Ū using a glue factor as
k
described in [59]. However, we do not consider this problem further.
66
Input Estimation for Force Sensors
and the entries of the state vector Xk are defined as follows
(U )
Xk
(Ū )
Xk =
Xk .
(s)
Xk
(6.3)
(Ū )
The input signal offset Xk and the states of the spline prior (cf. Ap2
,
pendix C) are driven by white Gaussian noise Zk with variance σU
where is small to allow the offset state to slowly change over time.
An estimate of the input is eventually obtained from the a-posteriori
mean value, which is easily seen from (6.3) to be given by
Uk = cs 1 0 Xk .
|
{z
}
,Π
6.2.2
Input estimator implementation
The input estimate, i.e., the posterior mean E[Uk |y1 , . . . , yK ], requires
the marginals2 p(uk |y1 , . . . , yK ). Efficient computation of these marginals
is obtained by implementing the smoothing algorithm from Section 4.3.1
using the SSM (6.1). Specifically, a forward recursion based on the
−
→
→ followed by a backward recursion to compute W̃ and
messages V and −
m
W̃µ are executed. Eventually, the force estimate u1 , . . . , uK is computed
from
ûk , ΠmXk .
Since signals are typically long (K 1) and the physical model can be
assumed to be stationary, computational complexity can be reduced significantly by using the steady-state smoothing method presented in Section 4.3.2.
2 Our implementation does batch processing. Online processing algorithms are
derived likewise, as shown in e.g., [47]. The main idea is to use a sliding-window
approach and introduce a processing delay. The length of the processing delay can
be used to trade estimation performance (i.e., smoothing) and computational complexity (sliding window length).
6.3 Sensor Model Identification
67
Remark 6.1
Utilizing steady-state message passing, the smoothing is, in fact, the
sum of a causal IIR filter and an anti-causal IIR filter. The order of
both filters is the order of the employed SSM.
To improve numerically stability of the implementation, the SSM (6.1)
may be expressed in another state-space basis X̄k = TXk . It follows
that the input estimator Π must be transformed to ΠT−1 . In the present
implementation, this additional step was not necessary.
Parameters The proposed model offers two parameters to adapt input signal estimates to varying filtering settings. First, the order of the
spline prior may be increased (decreased) to get smoother (fast-varying)
2
2
. By inspection
/σN
estimates. The second parameter is the ratio λ , σU
of the variational representation of the input estimator [10, Theorem 2],
it is evident that the input estimate ûk only depends on this ratio, as
2
2
opposed to the absolute values of the variances σU
or σN
. In practical applications, this ratio may be used to adjust the resulting input
estimates to the given sensor measurement quality.
6.3
Sensor Model Identification
As outlined in Section 6.1, the goal is to infer a dynamometer model
from L input-output measurements u(1) , . . . , u(L) and y (1) , . . . , y (L) . In
the following we consider the case L = 1 i.e., only one input and output
signal set is available, and as a consequence forgo the corresponding
notation. The generalization to L > 1 is deferred to the end of the
description.
We seek an identification scheme that focuses on the final usage of the
system model: In this case, it is input estimation. To the best of our
knowledge, there are no results on identification methods that are suited
for input estimation tasks. We show a novel variational statistical model
that prefers system models with better (expected) input estimation performance and is therefore better suited for the given task than standard
methods as in e.g., [43].
To elaborate on this question, consider a stationary setting. Then input-
68
Input Estimation for Force Sensors
estimation as in Section 6.2.1 with input-power-to-noise
2
2
λ , σN
/σU
according to Proposition 5.3 corresponds to a Wiener filter
G(ejθ ) ,
¯
Ĥ(ejθ )
|Ĥ(ejθ )|2 + λ
,
with Ĥ(ejθ ) the frequency-response of the sensor model. Inevitably, the
sensor model will only recover a corrupted version of the actual discretetime sensor frequency response H(ejθ ) to a sinusoid at circular frequency
θ, thus
Ĥ(ejθ ) = H(ejθ ) + (ejθ ).
(6.4)
Application of this filter then results in
G(ejθ )H(ejθ )U (ejθ ) =
ˆ (ejθ )H(ejθ )U (ejθ )
H̄
|Ĥ(ejθ )|2 + λ
¯(ejθ )
jθ 2
|H(e )| 1 + ¯ jθ U (ejθ )
Ĥ(e )
=
2
(ejθ ) |H(ejθ )|2 1 + Ĥ(e
+λ
jθ ) 1
=
1+
≈
¯(ejθ )
¯
Ĥ(ejθ )
U (ejθ )
1 + |Ĥ(eλjθ )|2
U (ejθ )
(ejθ )
1 −
λ
1 + |Ĥ(ejθ )|2
Ĥ(ejθ )
| {z }
|
{z
}
MMSE estimate
,
(6.5)
model error
while the last approximation holds when (ejθ ) 1. Hence, a relative
modeling error (ejθ )/Ĥ(ejθ ) ≈ (ejθ )/H(ejθ ) results in an (absolute)
estimation error of the same magnitude. We conclude that for the task
at hand, system identification should try to minimize the relative model
error.
A variety of system identification methods might be assessed based on
this finding: Subspace methods, prediction-error methods, and ML estimators [43, 70]. Prediction error methods [43], minimize a (weighted)
6.3 Sensor Model Identification
69
2
N (u1 , σU
)
N1
b
N (0, VX1 )
=
X1
A
=
+
c
Ỹ1
2
)
N (y1 , σN
X2
···
c
Ỹ2
2
)
N (y2 , σN
Figure 6.3: First segments of the factor graph corresponding to the
probability density used in Theorem 6.1.
absolute model error, which might be very different from the relative
model error. Subspace methods lack a characterization with an error
criterion. On the other hand, maximum-likelihood methods naturally
offer the required properties as will be demonstrated by the following
theorem, which is proven in the Appendix B.
Theorem 6.1: ML Cost Function
Consider a linear single-input single-output SSM as shown in Figure 6.3
and define its K-point discrete Fourier transform (DFT) as
−1
S[k] = cT e−j2πk/K I − A
b.
Also, let u1 , . . . , uK and U [1], . . . , U [K] be the input signal and its DFT
and likewise y1 , . . . , yK and Y [1], . . . , Y [K] the output signal and its
DFT.
If K 1 the (scaled) log-likelihood L(A, b, c) , −2 log p(y, u|A, b, c)
fulfills
L(A, b, c) ≈
K
X
k=1
2
log(|S[k]|2 + λ) +
1 |Y [k] − S[k]U [k]|
+ C,
2
σU
λ + |S[k]|2
(6.6)
70
Input Estimation for Force Sensors
2
N (0, σU
)
2
N (0, σN
)
Ek
Uk
Hammer
impulse
Nk
+
Nk
f
Yk
+
Sensor model/
workpiece model
Impulse response
measurements
Figure 6.4: Schematic of the variational statistical model proposed for
model estimation from impulse response measurements.
The filled black boxes represent given input measurements
and output measurements. The modeled identification input signal is denoted Nk .
where C is an (approximately) constant term independent of the parameters A, b, and c.
Now, assume that Y [k] = H[k]U [k] + N [k], where N [k] is noise for DFT
bins k ∈ [1, K]. We now illustrate that the ML’s behavior is similar to
the desired one (6.5), when the signal is long enough for Theorem 6.1
to (approximately) hold. To simplify the exposition we also assume
2
that σU
1. Then, as the second term in (6.6) is weighted much more
than the log term, the log term is ignored. The maximum-likelihood
estimator of the SSM parameters in Figure 6.3 thus minimizes
min
S[1],...,S[K]
K
2
X
|[k]U [k] + N [k]|
k=1
λ + |S[k]|2
(6.7)
with [k] as in (6.4) dependent on S[k]. From the last expression it is
evident that the squared error in the numerator is weighted by 1/|S[k]|2
and hence, as the relative model error in (6.5).
In addition, observe that for k where λ |S[k]|2 holds, the MMSE
estimate in (6.5) is shrunken towards 0 and the ML estimator limits the
weighting factor at 1/λ. This makes sense, as the estimate S[k] ought
not to fit a noisy estimate for DFT bins, where the estimate will barely
influence the MMSE estimate.
6.3 Sensor Model Identification
71
In conclusion, when the goal is to identify a sensor model that is afterwards used for input estimation, maximum-likelihood estimators are
a suitable choice. Henceforth, we seek to compute (approximatively)
the maximum-likelihood estimate from the variational statistical model
2
and λ.
depicted in 6.4 with parameters σU
6.3.1
Tuning of Identification Parameters
In the present maximum-likelihood problem Figure 6.4, we consider the
noise variances as algorithmic tuning parameters3 .
From Theorem 6.1 we make the following observations that guide parameter choices:
2
• Tuning of σU
2
Varying σU
while holding λ fixed changes the relative weight of
the logarithmic term with respect to the model estimation error.
The logarithmic term in the optimization objective encourages that
small S[k] are driven towards zero i.e., shrinkage of the smaller
terms.
2
The parameter σU
may be used to adjust characteristics of the
EM algorithm employed to compute the ML estimate. Specifically,
2
for 1 σU
(observations are on the order of 1) the EM updates
become numerically more robust, since in the EM message the
influence of the covariance term decreases (cf. (A.1) and (A.3))
and covariances are more (numerically) error-prone than vectors
(cf. Section 3.1).
2
Additionally, the choice 1 σU
has been observed to increase
convergence rates of EM.
• Tuning of λ
As seen from (6.6), the ML objective essentially caps the weight
once attenuation of the SSM falls below λ. The parameter may
thus be tuned specifically to prescribe when the ML estimate must
pay less attention to data fit.
3 This is in contrast to deterministic values or random variables that may be
inferred from the probability model.
72
6.3.2
Input Estimation for Force Sensors
Implementation
Before applying an EM algorithm to compute an approximate solution
to the ML problem, the measurement data is appropriately scaled and
only the signal window is used, where there is significant power in the
sensor measurements.
We apply the EM algorithm, as outlined in [16], to our model Figure 6.4 and use the AR form of the SSM. The algorithm is stated in
Appendix A.3 on page 128.
6.3.3
Improvements of Model-Identification Algorithm
In the following, we particularize our model identification method in
Algorithm A.3 and present various extension that proved useful in the
current application.
Initialization
Since the EM iteratively optimizes a non-convex objective function, results can vary depending on the initial parameters that the algorithm
starts with. One possibility to initialize the EM algorithm is to obtain a
first estimate with another system identification method. As suggested
in [75] subspace system identification methods are well suited, as they
do not require an initial estimates themselves.
When identifying SSMs for dynamometer sensors, another approach,
which is employed in our final implementation, is to leverage prior knowledge on the model characteristics (e.g., sensor’s data sheet); given frequencies of the resonance anti-resonance pairs and the resonance itself, an initial SSM is generated as sketched in Appendix F. The mode
frequency is set according to the dynamometer’s data sheet and the
resonance/anti-resonance pairs are equally spaced in the frequency range
up to the resonance mode.
Multiple Independent Measurements
Multiple input-output measurements of the sensor u(1) , . . . , u(L) and
y (1) , . . . , y (L) can serve to improve the quality of the identified model or
make the method more robust in practical applications.
6.3 Sensor Model Identification
(1)
73
(2)
(1)
(2)
2
2
2
2
N (u1 , σU
)
) N (u1 , σU
) N (u2 , σU
) N (u2 , σU
N (0, VX (1) )
(1)
(1)
X2
1
X3
fS
fS
=
A, c
=
N (0, VX (2) )
(2)
X3
fS
(1)
(2)
···
(2)
X2
1
···
···
fS
(1)
(2)
2
2
2
2
)
N (y1 , σN
) N (y1 , σN
) N (y2 , σN
) N (y2 , σN
Figure 6.5: Factor graph of an SSM with two independent sets of input and output observations. The factor node fS denotes a
hard constraint that is equal to 1 iff the two state variables
and the input and output variable fulfill (2.1) and (2.2).
The dotted part of the graph is relevant in ML-based estimation. In EM-based estimation, only EM messages are
propagated on the dotted edges; the two SSMs are independent otherwise.
Consider the factor graph in Figure 6.5, which represents an SSM given
two sets of input and output observations. The node fS represents a hard
(`)
(`)
constraint to ensure that the two state variables Xk−1 and Xk and the
input and output variables fulfill the SSM equations (2.1) and (2.2). The
two SSMs, which are assumed to be parametrized in AR form, are only
connected through the parameters A and c, which of course may be seen
from the conditional independence of the likelihoods
p(y (1) , . . . , y (L) , u(1) , . . . , u(L) |A, c) =
L
Y
p(y (`) , u(`) |A, c).
(6.8)
`=1
EM-based system identification in the joint multiple measurements factor graph as in e.g., Figure 6.5 or the joint likelihood (6.8), can be accommodated into Algorithm A.3 on page 128 with the following two
74
Input Estimation for Force Sensors
adaptations:
• Computation of the marginal probability densities using Gaussian
message passing in SSMs is done separately on each of the L input
u(`) and output y (`) datasets. On each SSM factor graph and for
(`)
(`)
−
−
each time step, the EM messages ←
µA (A) and ←
µc (c) are calculated independently. These are steps 2) and 3) in Algorithm A.3.
• The EM messages are then joined through equality constraints (see
e.g., the dotted part of the graph in Figure 6.5). Correspondingly
in Algorithm A.3, EM messages from all SSMs must be combined
prior to step 4).
We recognize that the EM-based system identification method for multiple observations averages the estimated parameters over multiple measurement sets and is generally different from averaging all datasets before
system identification. The proposed approach is preferable to the latter
one, which is very common in system identification, as it is an (approximate) algorithm to find the joint ML estimate (cf. (6.8)).
Decimation
As seen in Section 3.2, decimating signals before using Gaussian message
passing-based methods can be a simple method to improve numerical stability when a slowly-varying band-limited SSM is used. Implementation
of decimation of the input-output measurements before application of
the EM algorithm is straightforward. As the estimated SSM is going to
be defined for a lower rate, it must then be up-sampled after estimation.
Let n be the decimation factor. We propose to up-sample the SSM by
computing A1/n and with the upsampled A and the original measurement data, reestimate c, while keeping A fixed.
Before decimation, it is advantageous to lowpass filter the data, in order to reduce noise. However, applying a lowpass filter may lead to
identification of a differently shaped SSM. To overcome this drawback,
we propose system identification with n cosets of the input and output
signals, resulting from decimation by a factor n. Specifically, define
(`)
(`)
(`)
u(`,k) = uk , un+k , u2n+k , . . .
6.3 Sensor Model Identification
75
for any measurement set ` and k ∈ [1, n]. The cosets are defined analogously for y. EM-based estimation as in the case of multiple measurements, where the cosets represent a measurement can now be used.
Analogously to the factor graph shown for the case of 2 sets in Figure 6.5, the (decimated) SSMs for each coset independently compute
marginal probabilities and EM messages for A1/n and c. The EM messages of each SSM with dataset u(s,k) and y (s,k) are then combined in
the M step by means of the equality constraints. When the EM-based
method terminates, the joint low-rate estimates of the SSM parameters
are up-sampled to obtain normal-rate model estimates.
Constrained Model Identification
Assume that constraints on the sensor model are given that can be expressed as linear functions of the SSM parameters A or c. In fact, for
dynamometer sensors, it is known that the steady-state gain must be 1,
i.e., the sensors are calibrated. It was observed that identification measurements are often corrupted4 with constant offsets and gains unequal
to 1.
A linear constraint can be directly integrated into the EM recursions
for model identification. The key observation is that in step 4) in Algorithm A.3 (i.e., where all Gaussian EM messages are combined with
equality factors) a linear constraint on the parameter corresponds to a
noise-free observation factor.
Specifically, consider an unit gain constraint on the estimated SSM.
Given A in AR form, where a is the first row of A, the constraint
is
c1 = 1 − a1.
(6.9)
This can readily be seen from the transfer function derived from an SSM
in AR form
c1 z n−1 + c2 z n−2 + . . . + cn
S(z) =
,
z n − a1 z n−1 − . . . − an
with steady-state gain given by S(z)|z=1 = 1 expression (6.9) follows.
From standard Gaussian message passing rules as in [47], it can be seen
4 These
fier.
effects can most likely be attributed to drift effects in the sensor’s ampli-
76
Input Estimation for Force Sensors
that (A.5) is replaced with
1 − 1T (a[j] + mc )
and
1T Wc−1 1
= Wc−1 (Wc mc + λ1)
λ=
c[j+1]
with 1 a column vector with all entries equal to 1.
6.4
Frequency-Based MMSE Filters
The resonating modes that occur in actual dynamometer sensors often
exert dynamic forces in multiple spatial direction. To account for spatial
characteristics of resonating modes and hence, improved suppression of
resonating modes, a frequency-based MMSE filter is presented next.
Let the actual force signal, measurements and estimate be denoted as:
(x)
(x)
Û (x) (t)
U (t)
Y (t)
(y)
U (t) = U (y) (t) , Y (t) = Y (y) (t) , Û (t) =
Û (t) .
(z)
(z)
(z)
U (t)
Y (t)
Û (t)
(6.10)
A consequence of these phenomena is crosstalk in the 3-dimensional sensor measurements e.g., a dynamic force u(x) (t) in just one spatial direction acting on the sensor will be recorded on other channels Y (y) and/or
Y (z) of the sensor as well. There are two ways to approach this issue:
considering the interference from other channels as noise and suppressing
it or joint estimators of all 3-dimensional channels.
Single-channel Wiener filters were presented in Section 5.3 for LMMSE
input-estimation in a wide-sense stationary setting. These filters are appropriate if we assume that between channels there is no interference.
Otherwise, Wiener filters for (correlated) multi-channel signals are necessary. Both approaches i.e., independent single-channel Wiener filters
and a multi-channel Wiener filter will be shown below and compared
with performance on experimental data.
6.4.1
Frequency-Based Filtering
To develop the frequency-based MMSE filter, we will take different assumptions to Section 6.2.1. Specifically, we assume that the sensor is
6.4 Frequency-Based MMSE Filters
77
linear and time invariant and that noise or external disturbances are
stationary. Then the sensor, sampled with sampling period T , can be
modeled with the discrete-time Fourier transform (DTFT) at frequency
θ = 2πT f as
(x) jθ
Ỹ (e )
(y) jθ
Ỹ (e ) =
|
Ỹ (z) (ejθ )
{z
}
H (xx) (ejθ ) H (xy) (ejθ ) H (xz) (ejθ )
Y (ejθ )
(yx) jθ
(e )
H
(zx) jθ
H
(e )
|
H (yy) (ejθ )
H (zy) (ejθ )
{z
H(ejθ )
U (x) (ejθ )
H (yz) (ejθ ) U (y) (ejθ ) . (6.11)
H (zz) (ejθ )
U (z) (ejθ )
}|
{z
}
U (ejθ )
Here H (ab) (ejθ ) represents the DTFT of the impulse response of an input
in direction a ∈ x, y, z measured at the sensor’s output ba ∈ x, y, z. Furthermore, let the input force U be modeled as a wide-sense stationary
white Gaussian5 noise process with power spectral density SU U T (ejθ ) ≡
σu2 I (i.e., we refrain from making any assumptions on the input force).
The observed sensor measurements Y are corrupted by wide-sense stationary noise N with power spectral density σn2 (ejθ )I.
As described in Section 2.3, the multi-dimensional Wiener filter for the
problem at hand is
G(ejθ ) = SU Y T (ejθ )S−1
(ejθ )
YYT
= σu2 H(ejθ )H σu2 H(ejθ )HH (ejθ ) + σn2 (ejθ )I
−1
.
(6.12)
Under certain circumstances (e.g., inaccurate sensor model estimates or
limited computational resources) it might be desirable to make independent MMSE estimates instead of one joint estimate of all three signal
dimensions. To this end, consider the scalar Wiener filter corresponding
to (6.12) for scalar signal, i.e.,
G(a) (ejθ ) =
SY (a) U (a) (ejθ )
SY (a) Y (a) (ejθ )
5 If the Gaussian assumption does not hold, all filters reduce to the corresponding
LMMSE filters [39].
78
Input Estimation for Force Sensors
=
σu2 H̄ (aa) (ejθ )
σu2 h(a) (ejθ )h(a)(ejθ )T + σn2 (ejθ )
(6.13)
for a ∈ {x, y, z} and with H̄ (aa) (ejθ ) the complex conjugate and h(a) (ejθ )
the row vector a of H(ejθ ). This interference suppression filter will be
denoted with K (a) (ejθ ). Most of the following treatment applies to this
filter as well as the MMSE filter, however, we will only refer to the MMSE
filter.
Of course, ignoring all interference one obtains the well-known scalar
Wiener filter for all channels a ∈ {x, y, z} and all frequencies:
L(a) (ejθ ) =
6.4.2
σu2 H̄ (a) (ejθ )
.
σu2 |H (a) (ejθ )|2 + σn2 (ejθ )
(6.14)
Frequency Response Function Estimation
In order to use the derived filters in a practical dynamometer filtering application, at first it is necessary to estimate the multichannel frequencydomain model H(ejθ ) for a sensor. To this end, we assume that L ≤ 1
impulse response measurements are available as in the identification setting outlined in Section 6.1.
Sensor transfer function identification To apply the frequencybased system identification method proposed next, the identification
measurements are chosen long enough such that the output signals of
the sensor are zero outside the given window of duration TW . Note that
this the SSM-based identification method in Section 6.3 did not have
this prerequisite.
Let K = TW /T and define6
U = {U1 [·], U2 [·], . . . , UL [·]}
Y = {Y1 [·], Y2 [·], . . . , YL [·]},
the sets of the K-point discrete Fourier transform (DFT) of the input
data sets and the output data sets.
Under the assumption that the measurements Y are corrupted by white
2
Gaussian noise with variance σE
and recalling well-known properties of
6 Note that a different notation for the multiple measurement sets is used than
in Section 6.3 to avoid confusion with the spatial indices x, y, and z.
6.4 Frequency-Based MMSE Filters
79
the DFT (i.e., white noise remains uncorrelated after taking a DFT),
the identification problem setup is
Y` [n] = H[n]U` [n] + E` [n],
for all Uk ∈ U, Yk ∈ Y and n ∈ [1, . . . , K]. Due to the linear Gaussian
problem, the ML estimate of the unknown frequency response H[·] is the
solution of 3L independent least-squares problems. To this end, define
for all K DFT points
ĥ(a) [n] = [Ĥ (ax) [n], Ĥ (ay) [n], Ĥ (az) [n]]T ,
i.e., the transfer functions with output channel a, where a ∈ {x, y, z}.
The least-squares problems that give the ML estimate of the unknown
sensor model can then be written more compactly as
ĥ(a) [n] ,
(a)
Y1 [n]
..
argmin .
(a)
h [n] Y (a) [n]
L
2
(x)
(y)
(z)
U1 [n] U1 [n] U1 [n]
..
−
h(a) [n]
.
(x)
(y)
(z)
U [n] U [n] U [n]
L
L
L
(6.15)
for all a ∈ {x, y, z}. The least-squares solution is [45]
ĥ(a) [n] = (
L
X
L
X
u` [n]H u` [n])−1 (
(a)
u`0 [n]H y`0 ).
`0 =1
`=1
Accidentally, this is equivalent to the (estimate of) cross-spectral density
of the input and output signal divided by the estimated power spectral
density for the input signal:
1
L
Ĥab [n] =
L
P
`=1
1
L
(a)
(b)
1
L
Y` [n] · Ū` [n]
L
P
`=1
=
(b)
|U` [n]|2
1
L
L
P
`=1
L
P
`=1
SY (a) U (b) [n]
`
`
.
(6.16)
SU (b) U (b) [n]
`
`
Corrections of the Frequency-Response Function In the application at hand, the sensor characteristic at very low frequencies is important for filtering performance, because typical input force signals display
80
Input Estimation for Force Sensors
very slow and large signal components (e.g., offsets). Corrupted estimates of the frequency-response function (FRF), caused by non-idealities
in the sensor-amplifier chain or noise, are remedied by linearly interpolating the first 10 FRF bins setting the first bin to 1.
Further improvement in performance is observed, when the resulting
Wiener filter has a steady-state (i.e., n = 0) gain of 1. We therefore
rescale (6.12) according to
−1
σn2 (0)
σn2 [n]
H
H
H
H[0]H [0] H[n] H[n]H [n] +
I
Ĝ[n] = I +
σu2
σu2
(6.17)
and also (6.13)–(6.14) in an analogous way.
6.4.3
Results
The three filters (6.12)–(6.14) are evaluated with measurements from
[EXPA] (cf. Appendix D). The frequency domain model is estimated
from 3 impulse response measurements per spatial axis (x,y, and z).
Since it is assumed that identification measurements are zero outside of
the measurement window, the DFT of the zero padded impulse responses
is used to approximate the DTFT. The zero-padding length is chosen
such that impulse responses, their DFT transform, and Wiener filter all
have the same number of DFT bins as the corresponding signal segment7 .
The parameter σn2 /σu2 was determined in a coarse sweep over the given
data and is set to 0.1. Finally, a low-pass filter (6th-order Butterworth
filter with cut-off frequency 3500 Hz) is applied.
The results are summarized in Figure 6.6, with the relative MSE improvement (2.6) taken with respect to the unfiltered sensor measurements. The reference dynamometer measurements were filtered with
method (6.12) (multi-dimensional Wiener filter) and subsequently lowpass filtered before being used as true input signal. The box plot depicts
the lower quartile, median, and upper quartile of the data. The whiskers
mark the smallest and largest sample in the set.
From the results shown in Figure 6.6 three observations can be made:
7 In practice, a fixed-length FIR type Wiener filter as noted in Remark 6.2 would
be computed from identification measurements to make the application more flexible.
Since performance will be at most as good as in the IIR Wiener filter case, we chose
this filter for presentation here.
6.4 Frequency-Based MMSE Filters
81
20
20
16.04
16.21
16
17.99
7.81
7.78
7.67
0.36
0.33
0.33
Ĝ(f )
Ĝ(f )
L̂(f )
15
∆MSE
∆MSE
15
19.08
17.32
10
8.42
8.56
8.33
5
10
5
2.92
0
Ĝ(f )
2.79
2.17
0
Ĝ(f )
L̂(f )
(a) x axis
(b) y axis
10
9.04
7.98
∆MSE
6.83
5
2.97
0
3.64
2.75
1.02
0.96
0.93
Ĝ(f )
Ĝ(f )
L̂(f )
(c) z axis
Figure 6.6: Distribution of MSE improvement (in dB) for a set of 23
measurements taken during milling obtained by means of
different frequency-domain filters.
82
Input Estimation for Force Sensors
• For measurements from the x-axis channel and the y-axis channel,
filtering crosstalk with either Ĝ(ejθ ) or Ĝ(ejθ ) improves the quality
of the estimate marginally.
• The estimation quality of signals measured on the z channel can
be improved by taking interference from the other channels into
account. However, median improvements are again very small, as
mostly a few signals can be estimated much better by taking into
account the crosstalk.
• Interestingly, merely suppressing the interference with the filter
Ĝ(ejθ ) leads to considerably better results than using the joint
estimator Ĝ(ejθ ) for the z-axis measurements.
6.5
Results
The MSE improvement factor (2.6) with respect to the unfiltered estimate, the sensor outputs yk , will be used to compare input estimation
performance across various methods. Specifically for estimate ûk and
true input uk :
PK
(ûk − uk )2
∆MSE , Pk=1
.
(6.18)
K
2
k=1 (yk − uk )
We show two different scenarios: First, a setting is shown where relevant sensor characteristics are captured accurately with SSMs and secondly, analyze an opposite scenario, where high-order techniques (i.e.,
high-order Wiener filter) outperform other methods. Details on the experimental setup are given in Appendix D.
6.5.1
System Identification Results
Single-channel measurements (channel x) are filtered with the proposed
model-based input estimation filter using a third order spline prior (cf.
Appendix C), which provides the flexibility to model continuous-time
signals with adjustable degrees of smoothness8 . The noise variance ratio
(Ũ )
8 As an aside, states in X
represent estimates of derivatives of Ũ (using the
k
SSM basis in (C.1)). These estimates might be of interest in certain applications,
e.g., peak detection.
6.5 Results
83
for the input estimation algorithm is set to λ = 10−3 to account for
modeling errors as the measurements are nearly noise-free.
The models are estimated via different system identification methods:
Standard EM with model order9 6, EM using decimated (by a factor
2) model (cf. Section 6.3.3) with orders 6 and 8, output-error identification method [43] with order 8. The latter method essentially finds the
most-likely SSM with deterministic inputs and noise at the output (i.e.,
2
2
model Figure 6.3 with σU
= 0 and σN
> 0). The output-error method
is forced to fit the measurements well in the frequency range from 0 Hz
to 3000 Hz by standard techniques [43]. The EM-based system identifi2
= 10−5 and λ = 100.
cation methods employed parameter σU
The FRF derived from the estimated SSMs are shown in Figure 6.7
(the 6th order decimated EM estimate is not shown). The largest discrepancy between the estimated FRFs is seen around 600 Hz and above
2000 Hz. Clearly, the 6th order EM estimate can not model the small
mode at 600 Hz, but otherwise follows the 8th order decimated EM estimate rather closely. Surprisingly, the output-error estimate also does
not recover the mode at 600 Hz. Because output error methods do
not minimize error criteria similar related to the relative model error,
the penalty incurred by not modeling the mode around 600 Hz is small
(around −5 dB). The 8th order decimated EM, which minimizes (6.6),
on the other hand, puts more weight on fitting errors when magnitude is
small and therefore penalizes the estimate more (under the assumption
that the parameter λ is chosen small enough, see Section 6.3.1).
Let us assess how these differences eventually affect input estimation
performance. Result for all the signals in [EXPB] are presented in Table 6.1 and additionally the (single-input single-output) Wiener filter
(cf. Section 6.4) for comparison. To assess model estimation quality, the
MSE figures were obtained by first applying a low-pass filter with cut-off
frequency 2500 Hz to the validation signal and the estimates. We observe
that the decimated estimates with order 6 and standard EM identify the
SSM equally well.
The additional mode at 600 Hz, appears to have a large impact on input
estimation performance. This is seen from the performance gains of the
8th order model compared to all other methods. The reasons are that
the machining forces have large spectral components in this frequency
9 Higher model orders cannot be computed numerically stable with standard message passing.
84
Input Estimation for Force Sensors
measurements
8th order decimated EM
8th order output error
6th order EM
50
Magnitude [dB]
40
30
20
10
0
−10
−20
Phase [deg]
90
0
−90
−180
0
500
1,000
1,500
2,000
2,500
3,000
f [Hz]
Figure 6.7: Empirical frequency response from measurements [EXPB]
and frequency responses derived from various identification
methods. Also marked with dashed lines are inferred pole
frequencies of the dynamometer-machine system.
6.5 Results
85
∆MSE in dB
Model order
i
ii
iii
6
9.4
8.7
10.6
6
10.8
9.6
11.4
8
14.8
14.6
15.4
Output-error method
8
10.6
9.4
11.4
Wiener filter
n.a.
8.4
8.2
8.9
Standard EM
Decimated EM
Table 6.1: Improvement in MSE compared to the unprocessed sensor
output after applying various filter methods. Figures were
computed with experimental data from [EXPB], which contains three signal sets i, ii, and iii.
band and that, when magnitude of the transfer function is small, model
errors have a greater effect on input estimates than in the opposite case.
Performance of the Wiener-filter estimate is, interestingly, slightly lower
than for model-based methods. This effect has also been observed with
other datasets and is due to discrepancies between the system’s characteristics during identification measurements and the characteristics during machining and processing. The model-based estimate is more robust
to these kind of changes as it tries to infer a low-order model that even
after small changes to the FRF still fits well.
6.5.2
Force Estimation Performance
In Table 6.1 it was already observed that the presented input estimation method can improve performance considerably compared to spectral inverse methods such as the Wiener filter. Stability of the sensor’s
characteristic plays a key role in determining estimation performance.
While, the proposed model-based approach appears more robust than
the Wiener filter, the prerequisite is that relevant sensor characteristics
are represented well by the model. Often a simple SSM of order 2 limits performance, as in Table 6.1, where at least models of order 8 are
necessary to achieve the best performance.
Let us analyze the last requirement. The same algorithm and settings
86
Input Estimation for Force Sensors
∆MSE in dB
Model order
Avg.
Min.
Max.
Standard EM
6
9.9
7
13.3
Decimated EM
6
9
4.7
12.5
Wiener filter
n.a.
11
5.9
18.8
Table 6.2: Improvement in MSE compared to the unprocessed sensor
output after applying various filter methods. Figures were
computed with experimental data from [EXPA] and averaged over 17 signals.
as above are applied to data from another experiment ([EXPA], cf. Appendix D). The results are shown in Table 6.2. For this experimental
data, the sensor’s transfer characteristics can not be captured well by a
low-order SSM (see Figure D.1) and the Wiener filter outperforms both
low-order methods.
6.5.3
Convergence Properties
The EM algorithm is well-known to exhibit slow convergence under certain circumstances (cf. Section 2.2). Convergence results for the proposed EM method using data from [EXPB] are shown in Figure 6.9.
The convergence of the estimated frequency of the three largest modes
towards the inferred pole frequencies marked in Figure 6.7 is shown.
For the subsequent analysis a conjugate gradient-based optimizer of the
log-likelihood was implemented (cf. Section 2.2.2). In each conjugate
gradient step, the optimal step size is found by line-search methods.
Observe that the strength of the mode mainly determines convergence
speed of the corresponding estimated frequency (resonance mode is located at around 1800 Hz and a strong resonance anti-resonance pair at
around 200 Hz). While convergence is fast for the strongest modes, estimation of weaker poles requires a lot EM steps.
Empirically, we observed “zig-zag” steps as illustrated in Figure 6.8. This
effect arises because EM updates for A and c factorize into independent
updates (cf. Section 2.2.4) with Xk as hidden variables; The EM is
6.5 Results
87
c
A
Figure 6.8: Schematic illustration of a typical issue encountered when
estimating models with multiple resonance/anti-resonance
modes; Poorly conditioned likelihood (i.e., high condition
number of the local Hessian of the likelihood) and decoupled optimization steps on c and A (responsible for “zigzag” steps), cause very slow convergence of the algorithm
(black path). The unfavorable conditioning of the problem
manifests itself graphically in long and thin level sets of
the likelihood.
not able to make joint updates, which would correspond to diagonal
moves in Figure 6.8. This issue is more prominent for poorly conditioned
problems, because their likelihood (locally) exhibits long and thin level
sets as in Figure 6.8.
But, this decoupling is not the only issue, as the conjugate gradient
optimizer, which is known to overcome such “zig-zag” behavior [11], does
not converge to the difficult to reach mode. The other issue appears
to be the high condition number of the Hessian of the log-likelihood,
which corresponds approximately to W in the EM [48]. It can be shown
that convergence rate of first order methods, such as conjugate gradient,
is proportional to that condition number, however, the EM method is
known to (partly) overcome this limitation [83].
In fact, when initializing EM with either the EM solution of A or c and
using real measurement data, convergence occurs in roughly up to 10
iterations.
Input Estimation for Force Sensors
First pole frequency [Hz]
88
standard EM
decimated EM
conjugate gradient
400
300
200
0
10
20
30
40
50
Second pole frequency [Hz]
Iteration
800
700
600
500
0
100
200
300
400
500
600
700
800
Third pole frequency [Hz]
Iteration
1,800
1,600
1,400
1,200
1,000
0
2
4
6
8
10
12
14
16
18
20
Iteration
Figure 6.9: Current pole frequencies estimate versus iteration for simulated identification data 6th order. The true frequencies
are marked with a thin line.
6.6 Conclusion
6.6
89
Conclusion
We have presented two solutions to improve force measurements from
dynamometer sensor measurements during machining processes. First,
we have introduced a model-based approach combining discrete-time input estimation techniques proposed by Bolliger et. al. and input signal
priors. We have presented a probabilistic variational model for system
identification, the pivotal step in this application, and have showed that
it renders better model-based estimators with ML system identification
than general state-of-the-art system identification techniques. To overcome numerical issues that severely limit applicability of EM-based estimation methods, a novel constrained EM algorithm in a decimated SSM
was proposed. Secondly, frequency-domain approaches that take into
account the three-dimensional nature of the sensor readings, have been
devised.
Practicality of the methods has been corroborated with experimental
results on real machining measurement data. Superior performance of
the variational probabilistic model for system identification has also been
confirmed by real-world data.
While the model-based method assumes low-order systems and estimators, the frequency-domain approach, implicitly relies on high-order system models. As we have demonstrated with different experimental data
sets, both approaches have a distinct raison d’être: In settings where
sensors have simple transfer functions, the model-based estimator has
outperformed the Wiener filter and, in addition, compensation performs
well on both stationary as well as non-stationary signals due to timedomain processing. Otherwise, the Wiener-filter has exhibited superior
performance and is able to jointly process multi-dimensional signals.
Two interesting future directions arise: On one hand, the positive results
of the variational probabilistic model encourage to consider other modelbased estimation tasks and develop techniques that do not necessarily
render the most accurate model estimates, but rather the model that
yields the best performance on the designated task. On the other hand,
prior knowledge on the cutting forces may be used to develop more
reliable, maybe even semi-blind, model identification methods.
Chapter 7
Sparse Bayesian Learning
in Factor Graphs
In this chapter, we show probabilistic models and algorithms that combine sparse methods with Gaussian message passing. A special focus
is on the Bayesian approach, which allows to quantify “confidence” and
“goodness-of-fit” of models or estimates. These measures prove to be
essential for approaching certain blind estimation problems e.g., blind
deconvolution [41]. Concepts from this chapter are specialized to linear
SSMs in Chapter 8.
We first introduce sparsity-promoting priors with variational representations (Section 7.1) and then integrate these priors into general factor
graph that express Gaussian functions and show estimation methods
(Section 7.2). Eventually, we focus on fast methods and show that a
highly efficient method to recover sparse estimates may be derived directly from message passing and the tools in Section 7.2 and we show
a new efficient analogue of this algorithm for multi-dimensional sparse
features (Section 7.4).
7.1
Variational Prior Representation
Heavy tails of the pdf and non-differentiability at zero are commonly
considered to induce sparsity [31, 37]. The latter is typically responsible for recovering exact zeros when used as a prior in an appropriate
estimator (e.g., MAP estimator) [66]. However, in practical applications, signals are typically not exactly sparse (i.e., with zeros), but only
91
92
Sparse Bayesian Learning in Factor Graphs
approximately. This notion has been formalized and particularized to
compressible 1 priors [13,31]. Roughly, a sequence of iid random variables
is compressible, if (infinitely long) realizations are always approximated
well (in quadratic distance) with fewer large entries.
The following particular class of symmetric probability densities defined
as follows allows us eventually to obtain tractable (approximate) statistical models for compressible priors:
Definition 7.1: Super-Gaussian Probabilities [52]
√
A symmetric pdf p(x) is strongly super-Gaussian if p( x) is log-concave
on (0, ∞).
Observe that the strongly super-Gaussian property implies heavy tails,
which are key to weakly-sparse realizations [31] with the corresponding probability distribution. In addition, strongly super-Gaussian pdfs
always admit a specific kind of variational representation [52]; each
2
strongly super-Gaussian pdf p(x) ≡ e−g(x ) may be represented as
x2
1
p(x) = sup √
e− 2γ φ(γ −1 ),
2πγ
γ>0
|
{z
}
(7.1)
,N (x|0,γ)
with
r
φ(α) =
2π g? (α/2)
e
α
and g ? the concave conjugate of g (see, e.g., [11]). In the sequel we define
f (γ) , φ(γ −1 ).
(7.2)
Example 7.1: Variational representation of Student’s t distribution
The Student’s t distribution with degree of freedom ν > 0 has pdf
x2 − ν+1
(7.3)
) 2 ,
ν
which is easily seen to be indeed strongly super-Gaussian. From the
concave conjugate of (7.3), the variational representation follows
p(x) ∝ (1 +
p(x) = sup Kν N (x|0, γ)γ −ν/2 e−ν/2γ
γ>0
1 This term should not be confused with the informmation-theoretic concept of
compression that relates to the entropy of a probability density.
7.2 Sparse Bayesian Learning in Factor Graphs
93
ν+1
2ν
with a constant Kν , 2π ν+1
eν+1 . The weight on γ thus corresponds to an inverse Gamma pdf.
The Student’s t distribution is seen to be a member of a larger class
of distributions that are provably compressible with suitable parameters
(ν < 3 for Student’s t) [31]. Compressibility means that iid realizations
follow almost surely a power-law decay, which is a way to formalize the
property weakly sparse. Using such distributions may lead to more consistent Bayesian interpretations (e.g., a-posteriori mean) from the statistical model [30], which is of special interest when model parameters
must also be estimated, as e.g., for the proposed blind input estimation
method or experimental design tasks [37]. This generative view is different from the common “optimization view”: Sparsifying regularizers are
interpreted as maximum a-posteriori estimates with a given prior (e.g.,
the LASSO problem is commonly linked to a maximum a-posteriori estimation with an iid Laplacian prior).
7.1.1
Multi-Dimensional Features
Akin to the group Lasso in [84], the representation of priors in Figure 7.1 a) may be extended to multidimensional features U ∈ Rd with
d > 1 by adapting (7.1) to
p(u) = sup N (u|0, γI)f (γ).
(7.4)
γ>0
Results are analogous to the scalar case.
The prior defined by (7.4) is useful to describe vectorial features. Particular application examples include multi-input linear SSMs or glue factor2
learning [59].
7.2
Sparse Bayesian Learning in Factor Graphs
Assume that U = U1 , . . . ,Q
UL ∈ R are iid super-Gaussian priors with
L
probability density p(u) = k=1 p(uk ) and a Gaussian3 likelihood func2 The glue factor is a special factor augmenting factor graphs of SSMs. Among
its many applications, its ability to localize a pulses is particularly interesting in this
context.
3 Non-Gaussian likelihoods may be treated by linearization (e.g., EKF) or local
variational approximation [78] (e.g., sigmoid for classification).
94
Sparse Bayesian Learning in Factor Graphs
tion p(y|u) that admits representation with a factor graph [40]. Also
let p(uk |γk ) = N (uk |0, γk ) for all k ∈ [1, L] assume that the variational
weights f (γk ) are continuous. In later parts, we will particularize p(y|u)
such that it corresponds to a linear SSM, but most of the current exposition extends beyond that.
Let ûk denote an estimate of uk that follows from a standard Bayesian
technique. One possibility to obtain such technique is a MAP estimate
û = argmax p(y|u)
u
L
Y
p(uk ),
(7.5)
k=1
called Type-I methods [79]. Now, using the variational representation for
all p(uk ) an interesting expression for the MAP estimate emerges:
ûI , argmax p(y|u)
u
L
Y
k=1
max p(uk |γk )f (γk )
γk
= argmax max p(y|u)p(u|γ)
γ
u
L
Y
f (γk ).
(7.6)
k=1
The maximization in (7.6) can readily be seen as a (non-trivial) instance
of the idea of opening and closing boxes [40]. Indeed, as illustrated
in Figure 7.1, using the variational representation of sparse priors and
the graphical model representation of p(y|u). Type-I methods are potentially equivalent to max-product message passing in the joint graph
in Figure 7.1.
Alternatively to MAP, finding a posterior distribution that capture inferred characteristics of u, i.e. p(u|y), is generally intractable. However,
the factor graph in Figure 7.1 a) represents by definition a (scaled) Gaussian distribution for γ fixed. Assume that all γk can be set to a fixed
value and the resulting Gaussian distribution approximates well the intractable one. Yet, one criterion to choose an approximate distribution
is by maximizing the evidence p(y) [67, 80] since
Z
p(y) = p(y|u)p(u)d u
Z
=
p(y|u)
L
Y
k=1
max p(uk |γk )f (γk )duk
γk
Z
≥ max
γ1 ,...,γL
p(y|u)
L
Y
k=1
p(uk |γk )f (γk )duk ,
(7.7)
7.2 Sparse Bayesian Learning in Factor Graphs
95
where the last expression is akin to a ML problem over γ with latent
variables u. When an approximation captures much of the evidence, i.e.,
is close to p(y), the maximizing (Gaussian) distribution may be used as
a tractable surrogate for p(u|y). This method, corresponding to sumproduct message passing on all edges in the graph except γ, is known as
a Type-II method [12]. Let us define
γ̂ II , argmax
Z
p(y|u)
γ1 ,...,γL
L
Y
p(uk |γk )f (γk )duk ,
(7.8)
k=1
and akin to (7.6) the mode, or equally the mean, as
ûII , argmax p(u|y, γ̂ II ).
(7.9)
u
For undetermined sparse recovery it was argued in [80] that bound (7.7)
is in fact sufficiently tight to obtain sparse ûII .
The relation between Type-I and Type-II estimates for a linear model
y = Θu + n,
(7.10)
where Θ ∈ Rm×L , u ∈ RL , y ∈ Rm , and n ∼ N (0, σ 2 I) can be uncovered through a variational characterization of the estimates [79]. This
variational expression can, alternatively to the derivation in [79], be readily obtained with Lemma B.2 as follows: Observe that the logarithm of
the integral over u in (7.8) can be identified with Lemma B.2 and hence,
γ̂ II , ûII = argmax log W̃Y /2
γ,u
+ q(y, u) +
L
X
p(uk |γk ) + log f (γk )
k=1
= −2 argmin log ΘΓΘT + σ 2 I
γ,u
+ σ −2 ky − Θuk2 +
L
X
u2
k
k=1
γk
− 2 log f (γk ),
(7.11)
where by (7.10), q(y, u) is the quadratic form corresponding to the
Gaussian likelihood p(y|u) = eq(y,u) or the factor graph representation
thereof.
Since we are interested in a-posteriori statistics rather than MAP point
estimates, the subsequent exposition focuses on Type-II estimation and
96
Sparse Bayesian Learning in Factor Graphs
Figure 7.1: Figure a) is a schematic of the “open box” principle; a
strongly super-Gaussian pdf p(u) is expanded into a maximization factor. Figure b) shows the convex-type representation (max) of the Laplace pdf. In this case f (γ) = λγ.
computation methods. It is noteworthy that computing Type-I estimates is generally more straightforward than computing Type-II estimates. Since (7.6) is an (nonlinear) optimization problem, a variety of optimization methods are applicable; One example is alternating
minimization, which corresponds to iteratively reweighted least-squarses
approaches [14], when f (γ) ≡ 1. Similarly, using the strongly superτ
Gaussian prior p(x) = e−|x| results in the weighting used in the sparse
recovery algorithm from [15]. With an additional regularization term
that inhibits entries in γ from going to 0 too fast, these methods deliver
very good performance recovering sparse solutions. However, the additional damping term is an additional parameter that needs to be chosen
carefully in practical applications [76] and it has been reported to have
many local minima, where optimization algorithms can get stuck [62].
7.3
Multiplier Optimization
In the following and if not stated otherwise, we assume that f (γ) ≡ 1. As
seen above, to get sparse estimates maximization of the likelihood with
respect to the multiplication factor4 γ is necessary. To this end, differ4 It is seen that the multiplier ξ only enters through its square in the likelihood.
We therefore, use γ = ξ 2 throughout the presentation and in the factor graph representation.
7.3 Multiplier Optimization
97
ent update rules are given in Table 7.1. To distinguish between other
maximum-likelihood methods, these updates will be denoted multiplier
update steps. Because the update rules in Table 7.1 boil down to local
computations (i.e., message passing), one may envision schemes for much
more complex models, by building upon Gaussian message passing relations and multiplier optimization rules. Derivations of the expressions
in Table 7.1 are deferred to Appendix B.5 on page 142.
In the following, a few observations are due. Firstly, when there are
multiple (connected) features U , the EM-based updates and MacKay
updates can be executed independently, i.e., in parallel on every edge
γ. Contrary, the marginal likelihood maximization (V.6) updates are
derived under the assumption that all multipliers γ, except the updated
one, are fixed.
When f (γ) is not constant, a closed-form expression for f (γ) given a
super-Gaussian prior pdf p(x) may be found through convex duality
(e.g., Student’s t distribution in Example 7.1). However, for general
p(x), hence general f (γ), only the EM multiplier update yields closedform updates - the other multiplier updates can usually not be expressed
in closed-form.
Convergence
Let
Consider (7.5) and an uninformative weight f (γ) ≡ 1.
Z
L(γ) ,
p(y|u)
L
Y
N (uk |0, γk )d u.
k=1
be the log-likelihood.
Convergence to a local maxima5 is guaranteed for (V.6). The MacKay
update rule, which is a gradient method, does not necessarily converge
and the EM update may converge to saddle points. Contrary to (V.6),
the EM update cannot reintroduce features that have been turned off,
i.e., γ set to 0.
A further connection between the update expressions, which eventually
shed some light onto their convergence properties is obtained with (2.11)
and (B.23). It is easily seen that the gradient of the log-likelihood is
2
d`(γ)
= W̃X − W̃X µX .
dγ
5 As the optimization problem over multiple γ is non-convex this is as good as we
can hope for.
98
Sparse Bayesian Learning in Factor Graphs
Node
Update rule
• EM (M step):
N (0, 1)
f
γ ← m2X + VX
max
(V.1)
or
γ
2 .
γ ← γ − γ 2 W̃X − W̃X µX
×
X
(V.2)
If f (γ) is not necessarily equal to 1 everywhere,
but is chosen s.t. the max box corresponds
to p(x) then [52]:
−1/2 1 d log p(x)
γ← −
.
√ 2
x
dx
x=
mX +VX
(V.3)
• MacKay update [67]:
γ←
m2X
1 − VX /γ
(V.4)
or
γ←
W̃X µX
W̃X
2
γ.
(V.5)
• Marginal likelihood maximization updates:
γ ← argmax p(y|γ)
γ
or
γ←
q
←
−2
m
Xn
0
←
−
− VXn
−
←
−2 ≥ ←
m
V Xn
Xn
(V.6)
otherwise,
Table 7.1: Overview of multiplier update rules for feature selection.
7.4 Fast Sparse Bayesian Learning
99
Looking at (V.2), we recognize that the EM update is in fact a gradient
step with variable step size γ 2
γ ← γ − γ2
d`(γ)
dγ
on the likelihood `(γ). Since the step size goes to zero faster than the
estimate γ, it is evident that EM updates show slow convergence for
γ small, i.e., when a feature is almost pruned from the model. It is
evident from (V.5) that this effect does not occur for MacKay updates.
It appears that this behavior, close to the zeros, plus specific initial
values for γ has strongly contributed to experimental findings and the
general belief that convergence of EM is “slow”.
Accelerated Updates Convergence of the EM updates and MacKay
updates can be accelerated with a simple idea that was used in [64] for
variational parameter estimation and for EM updates with a Gamma
←
−
− are held constant, as they are inprior in [56]. Messages V X and ←
m
X
dependent of γ, and we treat the updates on γ as fixed-point equations;
the updates are iterated several times. It can be seen that when doing
so, the fixed point of the update equations corresponds to (V.6). This
result was also recognized in [56, 64].
7.4
Fast Sparse Bayesian Learning
The fast sparse Bayesian learning (fast SBL) algorithm presented in [20]
combines low computational requirements with good marginal likelihood
optimization results. The appeal of this method lies in closed-form update computations and large computational savings due to the relatively
low dimensions of the involved matrices.
The algorithm optimizes the marginal likelihood of a single feature Un
at a time. As shown in [20], maximization can be expressed in closed
form and shows a threshold behavior, removing features from the model
in a single iteration. Various extensions of this algorithm have been
developed: In [5] a hierarchical prior is added to the features’ variances
(corresponding to priors on γn in Figure 7.2), which results in adaptive
feature estimation thresholds (cf. (7.13)), while in [64] variational meanfield methods are applied.
100
Sparse Bayesian Learning in Factor Graphs
N (0, 1)
f
N (0, 1)
γ1
..
.
×
U1
f
×
Un
U2
θ1
γN
θ2
...
θN
+
...
+
2
I)
N (0, σN
Ỹ
y
+
Figure 7.2: Factor graph of a general SBL setup expressed as a recursive least-squares graph.
We show that the fast SBL algorithm follows (if it exists) from a recursive
least-squares decomposition of our general likelihood (cf. Section 7.2) in
combination with W̃-based message representations. Specifically, consider the factor graph depicted in Figure 7.2, which has been decomposed into its features U1 , . . . , UN and where Y is an M -dimensional
observation vector.
Initially, all features are pruned from the model i.e., γn = 0 for all n ∈
[1, N ]. The pivotal quantities, messages W̃Y and W̃Y µY , are initialized
as in
−
→
←
− −1
− −−
→ W̃Y = VY + VY
W̃Y µY = W̃Y ←
m
m
Y
Y
−2
= σN
I
−2
= σN
y.
(7.12)
Now, a single feature Un is iteratively updated such that the marginal
likelihood p(y|γ\n , γn ) is maximized, where γ\n denotes the vector of
all γk except γn . Using (7.13),
(←
−
−
−2 − ←
−2 ≥ ←
VUn ←
m
V Un
m
Un
Un
γ̂n =
(7.13)
0
otherwise.
Between updates (7.13), messages can be efficiently updated by message
passing. The complete iterative algorithm follows from rules presented
in Chapter 4 then:
7.4 Fast Sparse Bayesian Learning
101
Algorithm 7.1: Fast Feature Selection
Initialize W̃Y as in (7.12) and γn = 0 for all n ∈ [1, N ]. Then the
following steps are iterated until convergence of γ.
1) Select a feature Un for updating according to a scheduling scheme.
Commonly, the next features are selected in a round-robin fashion or such that marginal log-likelihood increments are maximized
(greedy scheme). In the latter case, additional computation are
necessary as W̃Un and W̃Un µUn for all i ∈ [1, N ] must be evaluated
from which the marginal log-likelihood increment can be obtained
easily.
←
−
− , from W̃ = θT W̃ θ and W̃ µ =
2) Compute VUn and ←
m
Un
Un
Y n
Un Un
n
T
θn W̃Y µY , which follow from (4.1) and (4.4),
←
−
→
−
V Un + V Un =
←
−
V Un =
←
− −−
→ = W̃Un µUn
m
m
Un
Un
W̃Un
1
W̃Un
1
θnT W̃Y
θn
←
− = W̃Un µUn . (7.14)
m
Un
θnT W̃Y θn
− γn2
3) Set ∆γn = γn − γ̂n and then update γn to γ̂n according to (7.13).
4) Update W̃Y and W̃Y µY . To this end, consider the change of the
message that enters the “Σ” node. Straightforward use of the matrix inversion lemma [57] yields a rank-1 updates for W̃Y :
W̃Y−1 − ∆γn θn θnT
−1
= W̃Y −
W̃Y θn θT
n W̃Y
,
T
θn W̃Y θn − ∆−1
γn
while W̃Y µY is then obtained by multiplication of W̃Y and y, since
−
→ = 0 for all n ∈ [1, N ].
m
Un
Remark 7.1
It can be easily seen that the variables W̃Un and W̃Un µUn correspond
to the variables Sn and Qn , which are used in the original fast SBL
algorithm in [20].
Implementation Aspects The graph is cycle free and linear Gaussian, thus all computed marginals are exact. Step 2) in Algorithm 7.1
requires O(3M 2 ) multiplications (3 matrix-vector products of dimension M ) and updating W̃Y in step 4) O(M 2 ) multiplications (1 vector
102
Sparse Bayesian Learning in Factor Graphs
outer product of dimension M ). For M very large and very sparse features, computational complexity may be reduced by using VX instead
of W̃Y , as rows and columns of VX corresponding to indices with γ = 0
are zero as well. Thus, all matrix operations can be reduced to L dimensional operations with L the number of non-zero features. Variables W̃X
can then be recovered with (4.1). Of course, the scheduling scheme in
step 1) might be significantly more complex. Complexity can be further
reduced by using FFT techniques.
Noise Estimation When noise variance σN is unknown, EM-based
estimation steps or marginal likelihood updates can be integrated into
Algorithm 7.1 to concurrently estimate the noise variance [20, 67]. The
necessary marginal parameters are readily obtained from W̃Y by (4.2)
and (4.3):
2
4
VY = σN
I − σN
W̃Y
2
mY = y − σN
W̃Y y.
Note that with the current parameterization of the sparse prior, an up2
date of σN requires an expensive matrix inversion. Integrating σN
into
the prior, as suggested in [36] resolves this issue.
7.4.1
Multi-Dimensional Fast Sparse Bayesian Learning
Now consider N multi-dimensional features Un ∈ Rd with d > 1, as
shown in Figure 7.3. We will denote the d columns of the dictionary
that correspond to feature Un as Θn .
Our goal is to extend the advantages of the fast SBL to estimation of the,
presumably, sparse multi-dimensional features. To this end, we recognize that most steps in the presented fast feature selection Algorithm 7.1
can be easily extended to multi dimensional Un by using standard message passing rules. The difficulty lies in the adaptation of the marginal
likelihood maximization; For d > 1 the marginal log-likelihood (with
←
− −1
respect to γn ) is seen from (B.24) using W̃Un = γn2 I + VUn
and
7.4 Fast Sparse Bayesian Learning
103
N (0, I)
γn
×
Un
Θn
...
+
...
Figure 7.3: To accommodate multi-dimensional features, the inputs
in Figure 7.2 are replaced by the displayed vector feature Un while the rest of the factor graph remains unchanged.
− , the log-likelihood can be expressed as
W̃Un µUn = W̃Un ←
m
Un
←
−
− −1 ←
−T γ 2 I + ←
− .
2 log p(y|γ\i , γn ) ∝ − log |γn2 I + VUn | − ←
m
VUn
m
Un
Un
n
|
{z
} |
{z
}
concave
convex
(7.15)
Two important observations can be made: First, the log-likelihood (7.15)
is non-convex because it is a sum of a convex term and a concave term.
Also, a closed-form solution as in (7.13) is not readily available. Secondly,
consider the eigenvalue decomposition
←
−
VUn = Ddiag(Λ)DT
(7.16)
− . Then, (7.15) can be
m
where Λ , (λ1 , . . . , λd ) and define m , QT ←
Un
decomposed into a sum of d scalar problems as in (7.13):
2 log p(y|γ\i , γn2 ) ∝ −
d
X
l=1
log(γn2 + λl ) +
m2l
.
+ λl
γn2
(7.17)
Observe that each term in (7.17) is unimodal. Hence, we can conclude
that if m2l ≤ λl for all l ∈ [1, d] then θn = γn2 = 0 is a maximizer
of (7.15) and marginal log-likelihood optimization leads to thresholding
of γn (i.e., γn = 0). The sparsifying characteristic of this method carry
over to the multi-dimensional case.
104
Sparse Bayesian Learning in Factor Graphs
Computational Aspects One approach to maximize the marginal
likelihood in (7.15) is by using the EM updates or the MacKay updates
shown in Table 7.1 for multi-dimensional features combined with the
iteration technique mentioned in Section 7.3. However, such algorithms
often exhibit two drawbacks relevant to fast feature selection algorithms:
• Once a feature is pruned from the model, i.e., with γ = 0, it is
permanently removed from the model.
• Convergence of the iterative schemes can be very slow. As a consequence, many inner loops are necessary to compute step 3) when
extending Algorithm 7.1 to multi-dimensional features.
We seek an (approximate) method to solve (7.13) that can introduced
pruned features into the model and exhibits fast convergence. To this
end, we propose to adapt Algorithm 7.1 to the following scheme:
Algorithm 7.2: Fast Multi-Dimensional Feature Selection
[0]
[0]
Initialize all γ1 , . . . , γN to 0. Then perform the following steps until
convergence of all γk .
[t]
1) Select a feature Un as in step 1) of Algorithm 7.1. All γk with
k 6= n are not updated.
2) Compute messages (cf. Table 4.1):
W̃Un ← ΘTn W̃Y Θn ,
(7.18)
W̃Un µUn ← ΘTn W̃Y µY .
(7.19)
[t],2
If γn = 0: test if feature should be introduced into the model in
3)
[t],2
If γn > 0: reestimate γn2 in 4)
3) First evaluate
ςn ← kW̃Un µUn k2 − Tr W̃Un .
(7.20)
If ςn > 0 the feature Un is added to the model and γn is reestimated
(see next step). Otherwise, if ςn ≤ 0, the feature is not added,
[t+1]
γn
← 0 and skip to step 1) again.
7.4 Fast Sparse Bayesian Learning
105
4) Initialize
G ← W̃Un
and if more than 1 optimization iteration below is performed, compute also
−1
←
−
− = ΘT W̃ Θ −1 W̃ µ ,
VUn = ΘTn W̃Y Θn
− γn[t],2 I, ←
m
Un
Y
n
Un Un
n
then γn is estimated by iterating the following steps for a fixed
number of steps:
i)
γn[t],2 ← min
Tr
− 2
←
− ←
−T − ←
m
V
m
Un
Un G
Un
Tr G2
, 0
(7.21)
ii)
←
−1
−
G ← VUn + γn[t],2
5) Update W̃Y and W̃Y µY analogously to Algorithm 7.1. Specifically,
for W̃Y :
H ← ΘTn W̃Y Θn − ∆−1
γn I
−1
W̃Y ← W̃Y − W̃Y Θn HΘTn W̃Y ,
where
∆γn = γn[t+1],2 − γn[t],2 .
Before presenting performance results for this algorithm, we provide insight into (7.20) and (7.21). First note that computation of ςn from (7.20)
is very lightweight and thus allows to rule out features with few computations, akin to (7.13). Furthermore, we have
γ̂n2 = 0 ⇒ ςn ≤ 0,
(7.22)
i.e., ςn ≤ 0 is a necessary condition to remove a feature. Empirical
results shown below show that it is highly effective as well. The proof
of (7.22) is presented in Appendix B.5 on 145.
106
Sparse Bayesian Learning in Factor Graphs
The iterative optimization in (7.21) is a fixed-point iteration obtained
by setting the gradient of the marginal log-likelihood (B.25) to zero:
d
log p(y|γ\i , θn ) = 0
dθn
− ←
−T − Tr W̃Un + Tr W̃U2 n ←
m
Un mUn = 0
←
− − ←
−T Tr θn I + VUn W̃U2 n = Tr W̃U2 n ←
m
Un mUn
←
− − ←
−T
m
θn Tr W̃U2 n = Tr W̃U2 n ←
Un mUn − VUn
←
− − ←
−T
m
Tr W̃U2 n ←
Un mUn − VUn
.
γn2 =
Tr W̃U2 n
−2
Since θn = γn2 is non-negative, negative θn values are projected back
onto zero. Empirical evidence shows that the proposed fixed-point iteration converges very quickly, also to the value 0 where a feature may
subsequently be pruned from the model.
Performance We show simulations to compare performance of the
proposed multi-dimensional extension of the fast feature selection algorithm with the exact algorithm that uses a line search method to maximize the marginal log-likelihood (step 3) in Algorithm 7.1) in each update and with the oracle estimator, which is the MMSE-estimator of the
features knowing nonzero entries. For all methods the same scheduling
scheme and convergence criteria has been used: the algorithms iterate
through all feature groups in a round-robin fashion and are stopped once
the difference of the estimated Γ between two loops (through all feature
groups) is less than 0.1% and the number of active (γn2 > 0) features has
not changed.
To generate data for the sparse multi-dimensional features estimation we
construct a basis Φ ∈ RM ×N by drawing M N samples from a zero-mean
normal distribution with variance 1 M (unit-energy basis vectors). The
features vectors X ∈ RN have a fixed number of K nonzero sub-vectors,
constructed by randomly setting d-dimensional sub-vectors of X to 1.
2
Measurements are corrupted by Gaussian noise with variance σN
.
Simulation results for N = 200, M = 100, K = 5, and d = 5 are shown
7.4 Fast Sparse Bayesian Learning
107
1 fixed-point iteration
2 fixed-point iterations
exact maximization
oracle
NMSE
100
10−1
Avg. maximization steps
400
1 fixed-point iteration
2 fixed-point iterations
exact maximization
300
200
100
0
0
5
10
15
20
25
30
35
40
45
SNR [dB]
Figure 7.4: Sparse multi-dimensional feature estimation algorithm results averaged over 100 basis, coefficient, and noise realizations. Figure (a) shows the NMSE versus SNR. In (b) the
average number of maximization steps until convergence
at different SNR points is shown.
108
Sparse Bayesian Learning in Factor Graphs
in Figure 7.4. Definition of the SNR
SNR ,
Kd
2 ,
M σN
is standard and the NMSE as in (2.5) based on the MMSE estimate ŵ.
The proposed algorithm is executed with fixed numbers of iterations of
step (7.21). We observe that one iteration yields nearly optimal performance, while merely two iterations are enough to perform as well as the
exact algorithm. The former setting is particularly interesting, as it is
free of complex matrix operations such as matrix inversions or eigendecompositions; most similar in spirit to the standard (scalar) fast SBL
algorithm. The gap in error-performance between the fast feature selection algorithms and the oracle estimator is constant over various SNRs
and can be attributed to suboptimal solutions, due to the non-convex
nature of the full log-likelihood.
The benefit of criteria (7.20), is evident from the bottom plot in Figure 7.4. The proposed method is able to prune many features from the
model without having to perform costly optimization of the marginal
likelihood. Significant computational savings can be expected in practical applications.
Finally, it remains to check whether the reduction in maximization iterations comes at the expense of a higher number of active basis elements
during processing and at convergence. Simulations show that the active
set sizes are almost the same for all algorithms.
7.5
Conclusion
Variance estimation of zero-mean Gaussian random variables in probability models can be leveraged to infer sparsifying and/or heavy-tailed
priors in linear Gaussian models and, essentially, enabling sparse recovery of the respective signals. We have elaborated on the application of
this concept to message passing algorithms and have formalized it for
factor graphics, representing a linear Gaussian probability model, and
Gaussian message-passing methods. The sparse estimation approach
with linear Gaussian models goes as follows:
1. Priors or weights on the variances are chosen according to a desired
heavy-tailed or sparsifying distribution.
7.5 Conclusion
109
2. MAP/ML estimates of the multiplication factors viz. the variances
of the variational priors are computed.
3. Variances are fixed to their ML/MAP estimate and standard inference methods for linear Gaussian models can be employed.
We have presented various options to implement the second step above
and discussed computational trade-offs.
In the second part, we have presented a novel algorithm for multidimensional input (group-sparse) fast recovery based on a dual-precision
formulation of fast SBL algorithms and a new (approximate) marginal
likelihood criterion. Experimental results have corroborated that significant computational savings compared to standard marginal-likelihoodbased SBL algorithm are achievable without compromising estimation
performance.
Chapter 8
Sparse Input Estimation
in State Space Models
In this chapter, we concentrate on sparse-input estimation in SSM and,
in a second step, develop a joint input and model estimation method. To
this end, the general sparse estimation framework introduced in Chapter 7 is specialized to SSM and novel methods to for sparse input estimation in SSM (Section 8.1) are conceived. With a focus on a large variety
of practical applications where neither signal model is known a-priori
nor the input signal is given, we present a blind deconvolution method
that assumes sparse unknown input signals. This estimation method
discussed in Section 8.2, simultaneously infers an SSM representations
and an input signal. Effectiveness in real-world application and estimation performance of the proposed method are substantiated by means of
experimental results with a real-world application.
8.1
Sparse Input Estimation
Let us particularize the general setting from Section 7.2; consider a
single-input single-output linear SSM (cf. (2.3)) with the observed signal y1 , . . . , yL ∈ R. The sparse input-estimation problem, assumes that
the SSM is driven by a weakly sparse signal U0 , . . . , UL−1 ∈ R which is
iid distributed with a compressible prior p(uk ) according to Section 7.1.
Our goal is to estimate the input. The complete model results in the
factor graph shown in Figure 8.1, which is indeed a special case of our
general framework introduced in Section 7.2.
111
112
Sparse Input Estimation in State Space Models
N (0, 1)
max
f
ξ
×
Uk−1
b
···
0
Xk−1
A
+
Xk
=
Xk0
···
c
2
)
N (0, σN
+
yk
Figure 8.1: A factor graph representation of our sparse-input SSM,
where p(y|u) is decomposed into factors defined by the
SSM in Section 8.1.
Another way to generalize the statistical model in Figure 8.1, is by using
inputs and outputs with different rates. When the input is sampled at a
higher rate than the observations, we obtain a super-resolution problem.
On the other hand, when the output is available with shorter periods
relative to the input process, the problem setting essentially encompasses
multi-hypothesis testing problems or multiple glue factor learning. The
subsequent treatment can be readily extended in the aforementioned
cases.
8.1 Sparse Input Estimation
8.1.1
113
Algorithm
The proposed algorithm relies on the factor graph depicted in Figure 8.1
and uses W̃ and W̃µ to efficiently compute the posterior statistics. Once
the posterior statistics are computed, an update of the max box is performed according to an update rule in Table 7.1. A complete algorithm
statement is provided in Section A.4.
There is a number of key features that make the proposed Gaussian message passing scheme based on dual-precision message a particularly efficient choice for sparse input estimation and a highly efficient algorithm
in general. First, it is free of computationally costly matrix inversions.
Second, the marginal statistics for Uk for any k ∈ [1, L] are simple projections of dual precision quantities W̃Xk and W̃Xk µXk used in Kalman
smoothing:
VUk = σ 2 γk − (σ 2 γk )2 bT W̃Xk b
mUk = −σ 2 γk bT W̃Xk µXk .
In contrast to standard sparse recovery methods and owing to the local
character of the message passing algorithm, the proposed method scales
linearly with the signal length L.
With respect to numerical properties of the algorithm, as discussed in
Chapter 3, the matrix inversion-free computations allow applicability of
sparse input estimation to complex high-order state-space models.
8.1.2
Simulation Results
The favorable recovery performance of the proposed sparse input estimator is shown with a synthetic example. An exactly sparse signal is
generated and passed through a highly resonating filter of order 12 and
corrupted by white Gaussian noise. Refer to Figure 8.2 and Figure 8.3
for the original signal and the observed one for SNR of 30 dB and 10 dB
respectively.
In both examples, the Kalman-smoother based scheme described in Section 8.1.1 with iid Student’s t prior for the sparse input signal samples
with ν = 10−4 was used. The variances viz. multiplication factors, were
obtained with the EM-based method from Table 7.1 in 10 iterations.
114
Sparse Input Estimation in State Space Models
y
0.1
0.05
0
-0.05
true input U
estimate Û
LASSO-based estimate
2
0
−2
50
75
100 125 150 175 200 225 250 275 300 325 350
Index
Figure 8.2: Input estimation with SSM and sparsity promoting iid
prior using simulated data with SNR of 30 dB. Also shown
is the LASSO estimator for recovery of the sparse input.
8.1 Sparse Input Estimation
115
0.2
y
0.1
0
−0.1
−0.2
true input U
estimate Û
1
0
−1
50
75
100 125 150 175 200 225 250 275 300 325 350
Index
Figure 8.3: Input estimation with SSM and sparsity promoting iid
prior using simulated data with SNR of 10 dB.
116
Sparse Input Estimation in State Space Models
The LASSO estimate1 [33] of the sparse input signal is also depicted for
comparison. We observe that, for this example the LASSO estimate does
not work well due to strong coherence in the dictionary, which originates
in the slowly-decaying character of the impulse responses. Our proposed
Bayesian type-II estimator appears to be able to cope with coherence
much better, which confirms observations made in [77]. Apart from
potentially inferior estimation performance for sparse input estimation,
compared to the proposed Kalman-filter based scheme, LASSO-based
methods are computationally more demanding and do not scale linearly
in the signal length.
8.2
Blind Deconvolution
In practical applications, in addition to the unknown input signal, the
underlying model, i.e., the SSM representation, might also be unknown
a-priori. Sensible estimates may emerge by taking advantage of the
(weak) sparsity assumption imposed on the input signal to eliminate the
inherent ambiguity between unknown input signal and unknown SSM
representation. In general, methods that simultaneously estimate a random process and a dynamical system are termed blind deconvolution
schemes.
In order to derive a blind scheme, let us write (7.11) as a joint minimization problem
ky − Huk2
argmin −2 log p(y|γ, H) = argmin min log HΓHT + σ 2 I +
u
σ2
γ,H
γ,H
+
L
X
u2
k
k=0
γk
− log f (γk ).
(8.1)
To avoid an obvious scaling ambiguity, we impose that H is normalized,
i.e., the corresponding impulse response h = h1 , h2 . . . of the SSM has
energy 1.
8.2.1
Type-I Estimators vs. Type-II Estimators
While (8.1) is a type II estimate of the compressible input Uk , one may
wonder if type I estimators such as LASSO, which correspond to a similar
1 The
regularization weight is set such that the estimation error is minimal.
8.2 Blind Deconvolution
117
optimization problem
argmin −2 log p(y, u|H) = argmin σ −2 ky − Huk2
u,H
u,γ,H
+
L
X
u2
k
k=0
γk
+ log f (γk ).
(8.2)
may also prove useful for blind deconvolution. The answer is that type
II estimator should always be preferred in this case.
To elaborate on this statement, observe that the sole difference between (8.2) and (8.1) is the first log-determinant term. The term penalizes large γ values. It also regularizes the structure of the SSM represented by H in a highly desirable way. If the SSM is trivial, i.e., A = 0,
b = [1, 0, . . . , 0]T , and c = [1, 0, . . . , 0] then H is the identity operator.
This choice of SSM is always penalized by the log-determinant term;
From Hadamard’s inequality we now observe that
L
X
log HΓHT + σ 2 I ≤
log [HΓHT ]k,k + σ 2 ,
(8.3)
k=0
where equality holds when HΓHT is a diagonal matrix. Since, typically,
√
γ k and uk will be similar to yk (in either (8.2) and (8.1) due to the
quadratic term), the diagonal entries
[HΓHT ]n,n =
n−1
X
h2k γn−k
k=0
can be seen to be roughly equal to yk2 . Looking at (8.3), we then observe
that the righthand side corresponds to the trivial SSM estimate H. We
can thus conclude that the log-determinant term will always penalize2
the trivial SSM compared to other SSMs; the penalty acting as a driver
towards parsimonious SSMs.
2 The penalties added by the log-determinant term are commonly large. This may
be seen for example, when impulse responses approximately form an orthogonal basis
(i.e., time shifted versions of the impulse response are mutually orthonormal) and,
for simplicity, the number of observations is the same as the number of inputs. After
PL
Eigenvalue decomposition the log-determinant then corresponds to
log γk + σ 2
k=0
and as σ 2 is small and many γk → 0 the term can take large negative values. For
PL
H = 1, the log-determinant will be approximately
log yk2 + σ 2 and thus much
k=0
larger.
118
Sparse Input Estimation in State Space Models
Experimentally, it was observed that LASSO-based blind deconvolution
tends to converge to trivial SSMs estimates and thus, give meaningless
results.
8.2.2
Algorithm
The objective (8.1) may be conveniently solved by coordinate descent
in H or γ, while minimization over u corresponds to computation of
the Gaussian messages in the graph shown in Figure 8.1. Both descent
steps are derived from the corresponding EM update. Specifically, the
proposed blind deconvolution method, follows from EM-based multiplier
optimization rules (cf. Table 7.1) for the local EM boxes in Figure 8.1
and using EM-based identification of the SSM in AR form. Particularly,
we alternate between: i) sparse input estimation using the current SSM
estimate and then ii) a system identification step, similar to Algorithm
A.4, to estimate A and c.
Computational complexity of both maximization steps is similar, as both
necessitate a full Gaussian message passing smoothing step in order to
get the marginal statistics of Uk or Xk .
Initialization
The input estimate in the first iteration can be considered proportional
to the energy in y weighted by the spectrum of the initial SSM. With
no prior knowledge on model or u, an instantaneous energy detector is
sensible initial choice. To this end, the initial SSM is initialized with
A = 0,
(8.4)
b as all ones vector, and c is drawn randomly and scaled such that
cb = 1.
This type of initialization may also be argued by considering the EM
updates for A and c. We focus on c but the analogous applies to A. The
M-step for c, when estimated at multiple observations yk with k ∈ [1, N ]
!−1 K
K
X
X ←−
←−
−
c=
W
W ←
m
ck
=
k=1
Wc−1 Wc mc
ck
ck
k=1
(8.5)
8.2 Blind Deconvolution
119
(k)
can be interpreted as a weighted sum over local estimates Wc
(k)
(k)
(k)
Wc mc for every k. The weights Wc are computed with
and
Wc(k) = VXk + mXk mTXk .
Returning to the initialization, specifically the first system identification
update, we recognize that VXk only depends on γk and mXk mTXk on yk2
(k)
and γk . In the first iteration, when A = 0 was used to initialize, Wc is
proportional to the instantaneous observed energy. Hence, samples with
high instantaneous energy, which would be expected to contain most
signal information also get the largest weight on their local estimate
once the update is computed as in (8.5).
Energy-Constrained Updates
Since the EM algorithm maximizes the likelihood step wise, the unspecific input estimates in the first iterations imply a large ambiguity on
the SSM. Commonly, this results in system estimates that exhibit a gain
which is far from 1.
Using a non-trivial prior on γ, scaling of the dictionary or the sparse
signal does affect our estimation algorithms in two relevant ways: The
Bayesian model might not be describing the observations well enough
or methods used to optimize the likelihood, such as EM or gradient
updates, will exhibit poor convergence properties.
To prevent this effect, a constraint is imposed on the EM step for c.
Under the assumption that the SSM is stable, the energy of the SSM’s
impulse response h = h1 , h2 . . . is forced to 1. This is, of course, equivalent to a norm constraint on the dictionary columns. Consider the
following expression for the energy of the impulse response of our SSM
khk2 =
=
∞
X
h2n
n=1
∞
X
cAn bbT AT
n
c
n=0
∞
X
=c
n
T
A bb
T n
!
cT ,
A
n=0
|
{z
C(A,bbT )
}
(8.6)
120
Sparse Input Estimation in State Space Models
where we used hn = cAn b and C A, bbT is known as controllability Gramian [38] and can be obtained from the solution of a Lyapunov
equation with A and bbT . When A is (approximately) constant during
an EM update of the SSM, a quadratic constraint on c can be conceived
from (8.6) and then added to the t + 1th M-step for c:
min cWc cT − 2cWc mc
c
.
s.t. c C A[t] , bbT cT = 1
This optimization problem is a quadratically constrained quadratic problem and can be solved with Newton’s method [29].
Since the current value c[t] will typically be close to the optimal value,
we choose to linearize the quadratic constraint around the current value
which results in a quadratic programming problem with equality constraints, which can be solved (in closed form) using an augmented linear
system of equations.
8.2.3
Simulation Results
To evaluate the performance of the blind deconvolution algorithm we
create a highly resonating 4th order system3 . An input signal u ∈
{−1, 0, 1}N of length N = 300 is randomly generated such that only
8 components are non zero. Furthermore, the observations ỹ are corrupted by noise such that SNR is 30 dB.
The algorithm is initialized according to Section 8.2.2 and after 20 EM
iterations, the final result, shown in Figure 8.4, is conceived4 . The time
shift is largely due to the ambiguity related to the choice of observation
vector c in relation to the estimated signal Û , i.e. there are often multiple combinations for c and Û that describe observations well. This
ambiguity can not be circumvented without enforcing more constraints
on either c, Û or both. Furthermore, in many applications a small constant time shift is negligible.
3 The SSM has poles with absolute value ρ = 0.99 and phase randomly chosen
from [0, π), while the zeros are drawn completely randomly.
4 The sign of the input estimate, which is obviously non-identifiable was always
adapted such that it matches the true signal best.
8.2 Blind Deconvolution
121
0.2
measurements y
0.1
0
−0.1
−0.2
1
true input U
estimate Û
0
−1
50
100
150
200
250
300
350
Index
Figure 8.4: Blind input estimation example using 4th order SSM and
simulated data y with length 300 and SNR 30 dB.
122
Sparse Input Estimation in State Space Models
BCG
ECG
0
100
200
300
400
500
600
700
800
900
1,000
Index
Figure 8.5: A BCG measurement and an ECG measurement recorded
synchronously with a length of 1000 samples.
8.2.4
Heart Beat Detection for Ballistocardiography
Ballistocardiography (BCG), the measurement of forces on the body, exerted by heart contraction and subsequent blood ejection, is a method
for non-obstructive monitoring heart conditions [2]. Certain relevant
physiological parameters (e.g., the heart-rate variability) require accurate heart beat time stamps; this means that individual heart beats must
be detected, which is considerably more difficult than detecting pulses
which are strictly periodic. This task is severely complicated by the
typical characteristics of BCG measurements: the beats cause large oscillations in the mechanical measurement system, with the result that
beats can not be discerned anymore. A typical measurement series is depicted in Figure 8.5, where the BCG signal is plotted in comparison to
8.2 Blind Deconvolution
123
an Electrocardiography (ECG) measurement, recorded synchronously5 .
An additional complication in this application are movements of the patient or test subject, which change the shape of the signals and limits
application of signal template or pattern matching techniques.
We applied the proposed blind input estimation method to BCG measurements with the goal to detect single heart beats. To this end, a 4th
order single-input and single-output SSM is employed, where both SSM
parameters (A matrix and c vector) are estimated from the data and
it is initialized as described in Section 8.2.2. The weakly-sparse input
process is modeled with a Student’s t prior with ν = 1. The length of
the signal was 3000. The algorithm was run for 100 iterations.
The final input estimate are thresholded to read off the peaks’ time
stamps. When two detected peaks are very close together (less than
20 samples or 100 ms), the smaller peak is discarded. To assess the performance of our proposed method, we compare it with a likelihood-based
filter from [72]. The likelihood-based filter method utilizes a 16th order
SSM, which must first be estimated on a much longer signal window using the ECG signal as a proxy for the unknown input. Strictly speaking
that likelihood filter is not a blind estimator. In addition, the SSM employs two-dimensional outputs, as two-channel BCG measurements are
available.
The thresholded γ, the corresponding ECG measurement, and the result
of the likelihood-based filter6 are shown in Figure 8.6. If two peaks of
estimated γ are less than 20 samples (200 ms) apart, the larger peak is
selected and marked by a circle. Our input estimate has been shifted,
because the electrical excitation of the heart and the physical effects of
the blood flow are shifted and because blind input estimation estimates
the input signal up to a time shift.
It is evident that blind input estimation is able to detect well individual
heart beats and yields comparable performance as the non-blind likeli5 The BCG and ECG measurements were recorded by Daniel Waltisberger from
the electronics laboratory (IFE) at ETH. The provided BCG signal, was subject to
decimation from 1000 Hz to 200 Hz, to reduce noise, but primarily to reduce computational costs. A 10-th order, zero-phase bandpass filter (0.5 Hz-30 Hz) was combined
with a second order IIR notch filter at 50 Hz. The lower cut-off was required for
filtering respiratory movements, the upper cut-off is used to attenuate noise, whereas
the notch for filtering the 50 Hz mains.
6 The likelihood-based filter processes data in a window-based fashion. The length
of a window is adjusted such that it contains one heart beat with high probability. The
largest likelihood peak in each window corresponds to the estimated beat position.
124
Sparse Input Estimation in State Space Models
proposed method γ̂
detected heart beats
ECG
likelihood filter
500
1,000
1,500
2,000
2,500
3,000
Figure 8.6: Blind pulse (heart beat) detection from a ballistocardiographic signal. The top plot shows the estimated timevarying input variance and detected heart beats. The ECG
signal (middle) serves for validation (i.e., to provide the
ground truth). The bottom plot shows the result of the
(non-blind) heart beat detection method [72].
8.3 Conclusions and Outlook
125
hood filter. Interestingly, when multiple close inputs are estimated, the
likelihood filter suffers a drop in the overall likelihood level (see, e.g.,
around sampling index 1900). It is indicative that the model SSM does
not describe the data well around that peak.
8.3
Conclusions and Outlook
We have developed an efficient and robust method to estimate weakly
sparse (input) signals based on zero-mean normal inputs with timevarying variance and (iterative) Gaussian message passing. The approach was then extended to the case where the linear system is unknown and must be estimated as well. The practicality of the proposed
approach has been demonstrated by a real-world example.
Our algorithm has three appealing properties that set it apart from other
common sparse recovery methods. Firstly, no sparsity controlling parameter needs to be set to obtain good estimation performance, unlike for
instance the regularization parameter in LASSO methods. On the other
hand, it has been observed that even when sparsity is low, our method
converges stably towards standard MMSE input estimation. Secondly,
input estimation performs well when estimating inputs in models with
slowly decaying impulse responses. Typically these kind of sparse recovery problem is accompanied by highly coherent dictionaries. Thirdly, it
seems that the proposed blind input and model estimation method rules
out trivial solutions (e.g., a one-to-one pass-through SSM). In contrast to
standard sparse methods for blind estimation in e.g., image processing,
our scheme requires only mild additional constraint on the estimated
system model.
Generalizations of the proposed weakly-sparse input estimation method
to to multi-dimensional observations or non-stationary conditions (e.g.,
time varying SSMs) are immediate. The ease to cover these cases comes
from the flexibility of the SSM-based approach.
Another advantage of SSM-based Gaussian message passing is the ease
to adapt fixed-interval algorithms to recursive (online) settings and viceversa. Another research direction are recursive algorithms for sparse
input estimation and blind estimation. To this end, the idea of recursive
Kalman smoothing may be convenient.
Another interesting extension to the weakly-sparse input signal prior
126
Sparse Input Estimation in State Space Models
are multi-dimensional inputs based on group-sparse features (cf. Section 7.1.1). For instance, blind system identification for an application
as e.g., in Chapter 6, where distinct hammer hits on a sensor (sparse
input) with varying spatial direction are used for model identification,
might become practical.
Moreover, distinct sampling rates for inputs and outputs open a widerange of options: from super-resolution to multiple-hypothesis-testing
type of problems (see also Section 8.1). In particular, combining weakly
sparse input estimation with continuous-time input estimation or (glue
factor-based) pulse population learning might lead to interesting novel
Bayesian methods. The former would enable recovery of impulses in
a continuous-time input signal that lie off the sampling grid. In the
latter case, our blind scheme might be adapted to localize pulses prior
to learning.
Appendix A
Algorithm Statements
A.1
Square-Root Smoothing
→
−
←
−
Algorithm A.1: Smoothing with C and S
→
−
Given a length K SSM and initial message −
µinit and ←
µinit .
→
−
1) Initialize the forward factor CX1 with the Cholesky factorization
−
→
−
→
→
of −
µinit and mX1 with minit .
2) Perform a forward sweep through the graph using, in a general
→
−
SSM, (I.3) and (I.4), followed by (I.9) (with CY = B chol (VU ))
→
−
and (I.10), and (III.1) (with CY = chol (VN )) and (III.2).
←
−
3) If an initial message is given, the backward factor SX1 is initialized
analogously to 1). Otherwise initialize to 0.
←
−
4) Perform a backward sweep through the graph using messages SXk
←
− −
mXk . Tabulated rules are applied in a dual fashion to 2).
and SXk ←
5) Compute the marginal covariance factor and mean by using (II.1)
→
−
←
−
and (II.2) with CXk and SXk . Eventually, obtain VX by squaring
the covariance factor.
A.2
Continuous-Time Posterior Computation
Given a continuous-time SSM
dX(t) = AX(t)dt + bU (t)dt
127
128
Algorithm Statements
Y (t) = cX(t) + N (t)
with observations yk = Y (tk ) at discrete moments tk for k ∈ [1, K], the
following algorithm leads to an estimate of the posterior probability for
X(t) at any tk ≤ t ≤ tk+1 .
Algorithm A.2: Continuous-Time Smoothing
1) Initialization of the forward message analogously to step 1) in Algorithm 4.1.
−
→
→
2) Perform forward message passing with VX(t) and −
m
X(t) according
to [10, (I.1) and (I.2)]. Note that messages need only be computed
at observations tk .
3) Initialization of the backward message as in step 3) of Algorithm 4.1.
4) Backward recursion with messages W̃X(t) and W̃X(t) µX(t) computed at tk for all k ∈ Z using (IV.1) and (IV.2) with eA(tk+1 −tk )
instead of A and updates (IV.7) and (IV.8).
5) Computation of posterior probability of X(t) for any tk as in step
5) in Algorithm 4.1 and at any tk ≤ t ≤ tk+1 by computing first
−
→
VX(t) with [10, (I.2)] and then with At , eA(tk −t) and Ãt ,
eA(t−tk ) :
−
→
−
→
−
→
VX(t) = VX(t) − VX(t) ATt W̃X(tk ) At VX(t) ,
−
→
→
mX(t) = Ãt −
m
X(tk ) − VX(t) At W̃X(tk ) µX(tk )
Z t−tk
2
= Ãt mX(tk ) + σU
eAτ bbT eA(τ −t+tk ) dτ W̃X(tk ) µX(tk )
0
A.3
System Identification in Chapter 6
Algorithm A.3: Model Identification using EM
1) Given the order of the sensor model and parameters σu2 and σn2 ,
initialize A[0] and c[0] with AR form SSM generated according to
physical model in Appendix F. Fix b.
A.3 System Identification in Chapter 6
129
2) At iteration j, perform dual-precision-based forward-backward message passing (see Section 4.3.2) in the factor graph of the SSM with
parameters fixed to the current estimates A[j] and c[j] to obtain
the Gaussian messages VXk and mXk for the posterior probability
over Xk . In addition, compute the likelihood according to (see Figure 4.1 for notation)
L[j] , p(y1 , . . . , yn |A[j] , c[j] )
K
→ )2
−
→
1X
(yk − c[j] −
m
Xk
2
log 2π(σN
+ c[j] VXk c[j],T ) +
.
=−
−
→
2
2
[j]
σN + c VXk c[j],T
k=1
3) Compute the EM messages according to Section 4.3.3 and [16]. It
turns out that A[0] and c[0] can be handled independently (their
EM messages factorize) and both can be expressed as Gaussian
messages. Specifically, let
−
←− −
T←
Wθ θk +2θT Wθ ←
mθ
ηθk (θ) ∝ e−θ
,
be a Gaussian EM-message, where θk represents either the parameters of A or of c at time step k in the factor graph. The specific
←−
←− −
parameters Wak and Wak ←
mak for parameter A are computed as
in [16, (III.7) and(III.8)]
←−
Wak = VXk + mXk mTXk ,
(A.1)
←− ←
− =V
0
W m
+ m mT 0
,
(A.2)
ak
ak
Xk
Xk [Xk+1 ]1
[Xk+1 ]1
0
0
where [Xk+1
]1 is the first entry of the state variable Xk+1
. Then,
←−
←− ←
−
Wck and Wck mck , the parameters for c, are evaluated according
to [16, (III.1) and (III.2)]
←−
Wck = VXk + mXk mTXk ,
(A.3)
←− ←
− =y m .
W m
(A.4)
ck
ck
k
Xk
4) A new estimate c[k+1] is computed from all EM messages ηck (c)
by
!−1 K
K
X ←−
X
←−
−
[j+1]
c
=
Wck
Wck ←
m
(A.5)
ck
k=1
k=1
|
{z
,Wc
}|
{z
,Wc mc
}
130
Algorithm Statements
and likewise for A[j+1] .
5) If the termination criteria (i.e., maximum number of iterations)
is met or the estimates converged (i.e., sufficiently small relative
change in the log-likelihood), complete the algorithm. Otherwise,
repeat steps 2)-4).
A.4
Sparse Input Estimation
Under the assumptions stated in Section 8.1, the following algorithm
estimates a compressible input to an SSM.
Algorithm A.4: Sparse Input Estimation
−
→
→ ← 0, where V
1) Initialize the forward message VX1 ← Vinit , −
m
X1
init
2
2
.
= mink γk σN
is the solution of a DARE equation with σU
2) Recursively passing through k ∈ [1, . . . , L − 1] update messages and
intermediate results:
i)
−
→
−
→
2
0
← AVXk AT + σN
VXk+1
, BBT
−
→ 0 ← A−
→ .
m
m
Xk
Xk+1
ii)
1
,
Gk+1 ← −
→
2
0
cVXk+1
cT + σN
−
→
0
Fk+1 ← I − VXk+1
cT Gk+1 c ,
−
→
−
→
0
VXk+1 ← Fk+1 VXk+1
,
−
→
−
→
−
→ 0 −G
0
m
cT yk+1 .
Xk+1 ← Fk+1 mXk+1
k+1 VXk+1
3) Initialize the auxiliary messages
1 T
c c,
σ2
1
← 2 cT .
σ
W̃XK ←
W̃XK µXK
A.4 Sparse Input Estimation
131
4) Perform a backward (in time) sweep through k ∈ [K − 1, . . . , 1]
using intermediate computation results from 2):
i)
W̃Xk0 ← AT W̃Xk+1 A,
W̃Xk0 µXk0 ← AT W̃Xk+1 µXk+1 .
ii)
W̃Xk ← FkT W̃Xk0 Fk + cT Gk c,
→ 0 −y
W̃Xk µXk ← FkT W̃Xk0 µXk0 − cT Gk c−
m
k
Xk
iii) Estimate the input:
VUk ← σ 2 γk − (σ 2 γk )2 bT W̃Xk b
mUk ← −σ 2 γk bT W̃Xk µ̃Xk
iv) Based on an update from Table 7.1, compute a new value for
γk . Note that the previous step may be adapted to
0
W̃Uk ← bT W̃Xk+1
b,
0
0
W̃Uk µUk ← bT W̃Xk+1
µXk+1
.
In particular if the EM update is used with weight function
f (γ):
2
γk ← f 0 mU
+ VUk .
k
Appendix B
Proofs
Lemma B.1: Scaled Gaussian Max/Int Lemma
Consider the function f (x, y) represented by Figure B.1 with variables
x ∈ Rn and y ∈ Rm , and where VU and VN are invertible matrices.
Then first it holds that
Z
f (x, y) dx = |WX |
−1/2
max f (x, y).
x
(B.1)
Lemma B.2: Scaled Gaussian Max/Int Lemma with W̃
In addition to the conditions stated in Lemma B.1, let q(x, y) be the
quadratic function defined by
q(x, y) = log
f (x, y)
.
f (mX , mY )
N (mX , VU )
X
Θ
N (mY , VN )
Figure B.1: Factor graph used to define a Gaussian Max/Int Lemma
with scaling.
133
134
Proofs
Then it follows
Z
1/2
f (x, y) dx = log W̃Y + max q(x, y).
(B.2)
x
Note that this lemma holds for arbitrary dimensions n and m.
Proof of Lemma B.1 and Lemma B.2: First we recognize that for
all x = y it holds
f (x, y) = N (mX , VX )N (mY , VY )
= |VU |
−1/2
−1/2 −q(x,y)
|VN |
e
(B.3)
where q(x, y) is a quadratic function. Now recall the Gaussian Max/Int
Theorem and apply (B.3) to [47, Equation (138)]
Z
−1/2
f (x, y)dx = |VU |
−1/2
−1/2 − minx q(x,y)
|VN |
e
Z
T
e−x
−1/2
WX x
dx
−1/2
= |VU |
|VN |
e− minx q(x,y) |WX |
(B.4)
−1/2
−1/2 − minx q(x,y)
= VU ΘT VN−1 Θ + I
|VN |
e
−1
−1/2
−1/2
|VN |
e− minx q(x,y)
= VN ΘVU ΘT + I
−1/2 − minx q(x,y)
= ΘVU ΘT + VN e
,
(B.5)
where we first used the identity (from message passing)
←−
−→
WX = WX + WX
= ΘT VN−1 Θ + VU−1 .
and subsequently applied determinant identities. Identity (B.1) follows
from (B.4) and (B.3).
Then Identifying in (B.3) the term
−
→
←
−
ΘVU ΘT + VN = VY + VY
| {z }
W̃Y−1
shows (B.2).
B.1 Proofs for Chapter 3
B.1
135
Proofs for Chapter 3
Proof of Proposition 3.1: The proof consists of two steps: obtaining
the Jacobi matrix of an SSM reparametrization with respect to θ and
then expressing reparametrizations of (2.10) as matrix similarity transform. Given that rate matrices in two parameterizations are similar, it
follows that the eigenvalues are the same (see e.g. [34]) and therefore
that the local rate of convergence ρθ̂ is invariant to the chosen SSM
representation.
Assume that T ∈ Rn×n is a transformation matrix for state x ∈ Rn in
a time-invariant SSM with A, B, and C. Using vectorization properties
(cf. e.g. [34]), in the transformed SSM x0 , Tx the parameter vector θ
is linearly related to the transformed parameter vector θ 0 as well:
T
T
I
vec (B) = d
TT
vec (C)
|
{z
} |
vec (A)
θ
T−1
⊗
vec (A0 )
⊗ T−1
vec (B0 ),
⊗ Im
vec (C0 )
{z
}
{z
}|
θ0
,Π
where d and m are the column dimension of B and the row dimension
of C, respectively. The local convergence rate is given in (2.10) and
based on the parametrization θ0 . Applying the chain rule and using the
dθ
Jacobian dθ
0 = Π, we can establish similarity of the rate matrices for
different parameterizations:
∇M(θ̂ 0 )
=I−
∂ 2 Q(θ0 |θˆ0 ) ∂θ0,2 !−1
θ 0 =θˆ0
d2 `(θˆ0 )
dθ0 2
!−1
∂ 2 Q(θ|θ̂) d2 `(θ̂)
=I− Π
Π
ΠT
Π
2
∂θ
dθ2
θ=θ̂
!−1
2
2
d
`(
θ̂)
∂
Q(θ|
θ̂)
Π
= Π−1 I −
∂θ2 dθ2
T
θ=θ̂
=Π
−1
∇M(θ̂)Π.
136
B.2
Proofs
Proofs for Chapter 4
Proof of Proposition 4.1: From (IV.7) and (IV.1), the aggregated
one time-step update for W̃Xk is
−
→
−
→
W̃Xk−1 = AT (I − cT GcVX )W̃Xk (I − cT GcVX )T A + AT cT Gk cA. (B.6)
We recognize that if W̃Xk converges to a steady-state matrix, i.e.,
W̃Xk → W̃X∞ ,
then (B.6) is a discrete Lyapunov equation with matrices Alp and Qlp .
According to [38], W̃Xk convergences if Alp is asymptotically stable, i.e.,
if all its eigenvalues lie inside the unit circle. In fact, this property can
be shown as follows
−
→
det(Alp − λI) = det(AT (I − cT G∞ cVX ) − λI)
−
→
= det((I − cT G∞ cVX )T A − λI)
−
→
= det(A(I − VX cT G∞ c) − λI)
| {z }
,K
and since the system is observable, the matrix K, which is recognized
as the Kalman gain of the system, stabilizes the closed loop matrix [38].
Thus, the matrix A(I − Kc) has all poles inside the unit circle (irrespective of A being asymptotically stable or not).
Proof of (4.9): The proof uses the first part of the proof in [16, Appendix E]. Starting from [16, Equation (163)], the following steps show
the desired equation.
Using first (IV.1) and (4.2) we get
V(Y X)
−
→
−
→
= V(Y X) − V(Y X)
I
0
!
W̃Y I
−
→
0 V(Y X) .
Then inserting expression [16, Equation (163)] and simplifying, we iden
tify the lower left corner of V(Y X) as VX 0 XT .
k−1
k
B.3 Proofs for Chapter 5
137
N (0, V∆ )
−
→
µX(t)
X(t)
e
←
−
µX(t0 )
+
A∆
X(t0 )
Figure B.2: Factor graph for interpolation with t0 = t + ∆ > t.
B.3
Proofs for Chapter 5
Proof of Theorem 5.1: From (5.7), we have
2 T
û(t0 ) = σU
b W̃X(t0 ) µX(t0 ) .
(B.7)
Now consider Figure B.2, which shows a factor graph of the joint probability density of the relevant variables (cf. [10]). Using Tables 2 and 3
of [47] (specifically, equations (II.9), (II.10), (II.12), (III.2), and (III.8)
from these tables), we have, first,
−
→ 0 = eA∆ −
→
m
m
X(t )
X(t) ,
(B.8)
−
− 0 and thus
−A∆ ←
then ←
m
m
X(t) = e
X(t )
←
− 0 = eA∆ ←
−
m
m
X(t )
X(t) ,
(B.9)
T
and finally W̃X(t) = eA ∆ W̃X(t0 ) eA∆ and thus
T
W̃X(t0 ) = e−A ∆ W̃X(t) e−A∆ .
(B.10)
Inserting these expressions into (B.7) yields (5.8).
Proof of Theorem 5.2: We have to show that
ˆ(tk , ∆) = lim ũ
ˆ(tk + ∆, ∆)
lim ũ
∆→0
∆→0
(B.11)
if cT b = 0. The relevant part of the factor graph for the left-hand side of
(B.11) is shown in Figure B.3 (top), and the relevant part of the factor
138
Proofs
2
N (0, σU
/∆)
2
N (0, σU
/∆)
Ũ (tk , ∆)
Ũ (tk , ∆)
b∆
X̃(tk )
b∆
=
+
X(tk )
X̃(tk )
cT
=
+
X(tk )
cT
Y (tk )
Y (tk )
(b)
(a)
2
N (0, σU
/∆)
Ũ (tk + ∆, ∆)
b∆
X̃(tk )
=
eA∆
+
X(tk + ∆)
cT
Y (tk )
(c)
Figure B.3: Factor graph segments used for the proof of Theorem 5.2.
B.3 Proofs for Chapter 5
139
graph for the right-hand side of (B.11) is shown in Figure B.3 (bottom).
The former represents the equations
X(tk ) = X̃(tk ) + b∆Ũ (tk , ∆)
(B.12)
and
Y (tk ) = cT X(tk )
T
(B.13)
T
= c X̃(tk ) + c b∆Ũ (tk , ∆).
(B.14)
If cT b = 0, then (B.14) reduces to Y (tk ) = cT X̃(tk ), in which case
Figure B.3 (top) is equivalent to Figure B.3 (middle). But for ∆ → 0,
Figure B.3 (bottom) also turns smoothly into Figure B.3 (middle). Proof of Theorem 5.3: We now prove a more general result:
Let Y (c) (ω) ≡ G(ω)N (ω)U (ω) and
Y (ω) =
∞
X
k=−∞
Y (c) (ω + k
2π
).
T
We wish to construct a linear estimator H(ω) that minimizes the expected squared error over all frequencies
h
i
2
E |H(ω)Y (ω) − N (ω)U (ω)| .
Equivalently by the orthogonality principle, for all ω ∈ R,
i
h
!
0 = E Y (ω) (H(ω)Y (ω) − N (ω)U (ω))
i
h
h
i
2
= H(ω)E |Y (ω)| − E Y (ω)N (ω)U (ω)
!
2
X
2π
2π 2 2
= H(ω)
σU N (ω + k )G(ω + k ) + σN
T
T
k∈Z
−
2
σU
2
|N (ω)| G(ω),
where we used
the idefinition of Y (ω) and Y (c) (ω) and the white noise
h
2
2
property E |U (ω)| ≡ σU
. Equation (5.10) now follows from the last
expression and (5.9) is a consequence of the Poisson summation formula
applied to H(ω)Y (ω).
140
Proofs
2
N (0, σU
)
N (0, VX1 )
2
N (0, σU
)
2
N (0, σU
)
N (0, VU )
X1
Θ
···
X2
XK
N (mY , VN )
2
)
N (y1 , σN
2
)
N (y2 , σN
2
)
N (yK , σN
Figure B.4: General factor that describes a SSM realization of the
variational statistical model used for system identification.
The factor graph is used in the proof of Theorem 6.1.
B.4
Proofs for Chapter 6
Proof of Theorem 6.1: First observe from Figure B.4 that any SSM
corresponds to the factor graph from Lemma B.2. After applying the
lemma, the (scaled) log-likelihood L(A, b, c) , −2 log p(y, u|A, b, c) is
seen to be
L(A, b, c) =
min
x1 ,...,xK+1
subject to
f (A, b, c) + g(x, A, b, c)
(B.15)
state sequence fulfills (2.1)
2
2
f (A, b, c) , log |θVX1 θT + σU
ΘΘT + σN
IK |
g(x, A, b, c) ,
K
K
X
1
1 X
2
2
(y
−
cx
)
+
kxk+1 − Axk − buk k
k
k
2
2 kbk2
σN
σU
k=1
k=1
where Θ ∈ RK×K is the Toeplitz matrix constructed from the impulse
response of the SSM i.e.,
(
cA(i−j−1) b if i − j > 0
[Θ]i,j =
(B.16)
0
otherwise,
B.4 Proofs for Chapter 6
141
and θ ∈ RK is
[θ]i = cAi ,
(B.17)
and VX1 is the initial covariance, while the initial mean is assumed to
be 0.
We first consider the second term g(x, A, b, c). With the substitution
ek ,
1 T
b (xk+1 − Axk − buk ) ,
kbk
for all k ∈ [1, K] the minimization (B.15), under consideration of the
constraints (2.1) on the state, becomes
g(x, A, b, c)
K
K
1 X 2
1 X
2
(y
−
[Θ(u
+
e)]
)
+
ek
k
k
2
2
σN
σU
= min
e1 ,...,eK
k=1
1
= 2 min
σU e
k=1
2
λ ky − Θu + Θek + kek2 ,
(B.18)
where signals uk and ek were stacked to obtain vectors. Note that e
can be interpreted as an input-side innovations vector. The minimization (B.18) is a standard regularized least-squares problem (e.g., [39])
and with the addition of the matrix inversion lemma, we obtain
2
T
2
σU
g(A, b, c) = λ ky − Θuk − (y − Θu) Θ λI + ΘT Θ
−1
T
= (y − Θu) λI + ΘT Θ
(y − Θu) .
−1
ΘT (y − Θu)
For long sequences, i.e., K 1, the eigenvalues and eigenvectors of the
Toeplitz matrix Θ converge to the DFT values of the impulse response
and the DFT matrix columns [34], i.e.,
Θ ≈ DSDH ,
(B.19)
where S is a diagonal matrix
Si,i = S[i],
and D is the K × K DFT matrix. Defining the DFT of the input signal
U [k] and output signal Y [k] and S2 = SSH the log-likelihood converges
142
Proofs
to
−1
λI + DS2 DH
y − DSDH u
H
−1
= DH y − SDH u
λI + S2
DH y − SDH u
2
g(x, A, b, c) ≈ y − DSDH u
σU
=
H
K
2
X
|Y [k] − S[k]U [k]|
k=1
λ + |S[k]|2
.
(B.20)
For the first term f (A, b, c) we also invoke (B.19)
2 2
2
f (A, b, c) ≈ log |sVX1 sH + σU
S + σN
IK |
−1
2 2
2
= log |VX1 | + log |σU
S + σN
IK | + log |VX
1
2
2
+ (S−1 s)H (S−1 s)/(σU
+ σN
)|
≈
K
X
2
2
log(σU
|Sk,k |2 + σN
) + const.
k=1
where we defined s = DH θ, the d DFTs of the impulse responses from
either one of the states to the output and used the matrix inversion
lemma in the second step. In the last step we consider S−1 s approximately independent of A, b, c.
The final approximation is thus
L(A, b, c) ≈
K
X
2
2
log(σU
(|Sk,k |2 + λ)) +
k=1
1 |Y [k] − S[k]U [k]|
2
σU
λ + |S[k]|2
B.5
Proofs for Chapter 7
Remark B.1
In all the following proofs, we consider d sparse priors on X ∈ Rd and a
general Gaussian likelihood p(y|x). We also define
Z
d
Y
L(ξ) , log p(y|x)
p(xk |ξk2 )dx
k=1
Z
`(γ) , log
p(y|x)
d
Y
k=1
p(xk |γk )dx.
B.5 Proofs for Chapter 7
143
Proof of (V.1), (V.2) and (V.3): To derive the EM update of ξ [t]
we use the max box in Figure 7.1 as pAR -box, i.e., define X as hidden
variable. Other choices are also possible and have been explored. The
advantage of this choice of pAR when there is more than one prior, the
joint EM message factorizes and each ξ [t] may be optimized independently in the M-step.
Let η(ξ [t] ) be the EM message on ξ [t] , which follows from
p
x2
2
η(ξ) = EpAR − log 2πξ − 2
2ξ
2
m
+
VX
1
.
= − log (2πξ 2 ) − X 2
2
2ξ
(B.21)
(B.22)
When the weights f (ξ) ≡ 1, a new estimate ξ [t+1] is obatined by maximizing η(ξ)
ξ [t+1] ← argmax − log
ξ
=
q
p
m2 + V
X
2πξ 2 − X 2
2ξ
m2X + VX ,
which is (V.1). The alternative update (V.2) follows by applying (4.2)
and (4.3) from Table 4.1.
For a general f (ξ), the M-step is formally
ξ [t+1] ← argmax − log
ξ
p
m2 + VX
2πξ 2 − X 2
+ log f (ξ).
2ξ
Recall (7.2) and then take the derivative with respect to ξ, which results
in
dg ? (α) = m2X + VX .
dα α=ξ−2
and by properties of convex conjugates (see, e.g., [11, Section 3.3.2]),
the derivative of the conjugate g ? (α) is the inverse map of the
√ derivative
of g(z) and thus, recalling the definition of g(z) = − log p( z) in (7.1),
144
Proofs
the update is
ξ
[t+1]
←
dg(z)
dz
−1/2 z=m2 +VX
X
−1/2 1 d log p(x)
= −
x
dx
√
x=
.
m2X +VX
Proof of (V.4) and (V.5): MacKay updates are based on a gradient
step on γ [49]. By looking at (2.11) and that the EM update factorizes,
it is clear that the gradient for γ also decays into d scalar problems. Let
us thus pick one γ and use (B.22) and (2.11):
d`(γ)
dη(γ)
=
dγ
dγ
1 1 m2X + VX
=−
−
.
2 γ
γ2
(B.23)
Now with the current estimate γ [t] , we set the gradient to 0 and substi[t]
tute V˜X = VX /γ [t] :
[t]
γ 1 − V˜X
= m2X
from which (V.4) follows. Applying the definitions of W̃X and W̃X µX
to the righthand side of (V.4) yields (V.5).
Proof of (V.6) : Using scale factors [59], the marginal likelihood
evaluated on edge Xn corresponds to
s
2
(W̃ µ )
→ ←
−
−
W̃Xn − X2nW̃XXn
n
p(y|ξ\i , ξn ) = βXn βXn
e
2π
2
q
(W̃Xn µXn )
−
2W̃X
n
∝ W̃Xn e
,
(B.24)
where the second equation is proportional with respect to ξn . Making the
− ),
dependence on ξn explicit (W̃Xn (ξn ) and W̃Xn µXn (ξn ) = W̃Xn (ξn )←
m
Xn
B.5 Proofs for Chapter 7
145
taking the derivative with respect to ξn2 and considering stationary points
of the (scaled) log-marginal probability, i.e., ξn that fulfill
d
log p(y|ξ\i , ξn )
dξn2
d
−2 W̃ (ξ )
0 = 2 log W̃Xn (ξn ) − ←
m
Xn Xn n
dξn
0=
W̃X0 n (ξn )
−2 W̃ 0 (ξ )
=←
m
Xn Xn n
W̃Xn (ξn )
1
−2
=←
m
Xn
W̃Xn (ξn )
−
−2 − ←
ξn2 = ←
m
VXn
Xn
q
−
−2 − ←
γn = ←
m
VXn
Xn
where W̃X0 n (ξn ) denotes the derivative with respect to ξn2 and, finally,
using W̃X0 n (ξn ) > 0. It can be seen that the derivative of the log-marginal
−
−2 − ←
m
VXn , 0) and nonprobability is always negative for ξn2 > min(←
Xn
negative otherwise.
Proof of (7.22): We first note that for γn = 0, trivially, W̃Un =
←
−
←
− ←
− . Next recalling the Eigen decomposiVUn and W̃Un µUn = VU−1
m
Un
n
tion (7.16) and applying it to (7.20) we obtain
←
−
−T DΛDT −2 ←
−
ςn = − Tr VU−1
+←
m
m
Un
Un
n
d
X
1
m2
=−
− 2l
λl
λl
l=1
d
X
d
m2l
=−
log(θn + λl ) +
dθn
θn + λl θn =0
l=1
d
= −2
log p(y|γ\i , θn )
.
dθn
θn =0
(B.25)
Now if γ̂n2 = 0 local optimality requires that the derivative at zero is
non-positive from which (7.22) follows.
Appendix C
Spline Prior
When prior knowledge on real signals is limited to the fact that the
signals and some of its derivatives are continuous with bounded energy.
In this case, signals may be modeled with a continuous-time Gaussian
stochastic process X (n) (t) with n ∈ N arbitrary that is generated by
2
as
(scaled) n-fold integration of white Gaussian noise with variance σW
depicted in Figure C.1 (with a parameter T that will be explained below).
This process may also be expressed with a continuous-time SSM
dX = A(c) X + bdW,
X
(n)
X(0) = X0 ,
(C.1)
(t) = cX,
(C.2)
with parameters
(1
(c)
Ai,j
=
T
0,
, if i = j − 1
,
otherwise
√
b = [1/ T , 0, . . . , 0]T , c = [0, . . . , 0, 1].
The initial state is important when we want the prior to be able to model
signals with large offsets.
The parameter T is introduced to make the prior behave the same across
different time-scales. To explain this, consider two signals with different
time scales T1 T2 both modeled with the spline prior. The scaling
effectuates that both random signals will have the same average energy
on their respective scale. Put it differently, at the fine resolution of T1 ,
147
148
Spline Prior
1
√
T
Z
X1 (t) 1
T
X2 (t)
Z
1
T
...
Z
Xn (t)
WGN
Figure C.1: n-fold integration of white noise.
realizations of the random process will look the same as realizations of
the other signal at T2 . This is necessary to ensure consistent characteristics of the prior throughout different operating regimes (e.g., sampling
frequencies). For example, a large sensor will display resonances at low
frequencies and desired estimated signal’s frequencies will be low as well.
Nevertheless, the proposed spline prior remains scale free in the classical
sense as its distribution is homogenous with respect to time1 .
When X (n) (t) is eventually used together with a discrete-time system, a
reasonable choice of T is of course the sampling period and the equivalent
discrete-time SSM with time-steps T is
Xk+1 = AT Xk + Uk ,
(n)
Xk
iid
Uk ∼ N (0, VU ),
X0 = X0 ,
(C.3)
= [0, . . . , 0, 1]Xk ,
with
(
[AT ]i,j =
[VU ]i,j =
1
(i−j)! ,
if i ≥ j
0,
otherwise
,
2
σW
,
|i − j|!(i − j + 1)
which is readily obtained using [9, Section 2.4] and the nilpotent property
of A. It is apparent that the noise covariance VU is independent of T
and thus, the prior defined by (C.3) over Xn (t) indeed exhibits the same
“behaviour” at different time scales, i.e., different sampling rates.
C.1
Relation to Splines
The presented nth order spline prior is related to nth order polynomial
spline smoothing and spline interpolation [73]. A correspondence can
1 This means that mean and variance will be homogenous functions in the variable
t. A homogenous function obeys f (αt) = cαf (t).
C.1 Relation to Splines
149
be established by considering MMSE/MAP estimation of Xn (t) from a
set of observations y1 , . . . , yK at times Tmin ≤ t1 ≤ . . . ≤ tK ≤ Tmax
2
corrupted by Gaussian noise with variance σN
. Recall [10, Theorem 2],
then using a non-informative initial state (i.e., V → ∞), it follows that
the estimate, x̂n (t), for t ∈ [Tmin , Tmax ] minimizes
x̂(t) = argmin
x(t)
K
1 X
kyk − x(tk )k2 +
2
σN
k=1
√
2n−1
T
2
σW
Z
Tmax
Tmin
dn x(t)
dtn
2
dt.
(C.4)
(c)
Using Theorem (5.1) and the observation that in eA t monomials of
maximum order n appear, we can conclude that x̂(t) is an nth order
polynomial. In addition, a straightforward, but relevant, observation is
√ 2n−1 2 2
that λ , T
σN /σW trades data fit for smoothness and when λ → 0,
(C.4) reduces to an interpolation spline. Only the this quantity λ, the
ratio, are relevant for this prior.
Observe that our proposed scaling enforces more smoothness for larger
T (e.g., sampling time) by making λ larger.
A related version of objective (C.4) was first used for spline smoothing
in [73] . Similar connections between estimation in SSMs and spline
smoothing have been pointed out by various authors [18, 59].
Appendix D
Additional Material on
Dynamometer Filtering
D.1
Experimental Setup
The proposed methods are evaluated on experimental cutting data sets.
Raw dynamometer identification and machining data were recorded by
Daniel Spescha at the institute of Machine Tools and Manufacturing at
ETH Zurich. Relevant details on the experiment setups evaluated in this
thesis are given in Table D.1. The presented results are based on data
from two experiments. Standard equipment was used in all experiments.
The dynamometer sensor used to estimate the applied force and the
reference sensor were the same in both experiments, but mounted on
different machines. The mounting is standard usage and consisted in
stacking and fixing a metal plate, dynamometer sensor, reference sensor
and the work piece on top of each other. With both dynamometers forces
in x-direction, y-direction, and z-direction were recorded.
The measurements employed for identification were recorded by solicitation of the workpiece with an impulse hammer prior to the machining
process. During identification measurements the machine was turned
off and the system in resting conditions. Solicitation with the impulse
hammer was performed by the operator and the hammer also recorded
the applied force.
Experimental machining data was produced by processing a workpiece
at different cutting speeds and feed rates. Machining was made in x
direction and y direction. One machining test signal corresponds thus to
151
152
Additional Material on Dynamometer Filtering
Parameter
Experiment A
Experiment B
[EXPA]
[EXPB]
Sampling frequency [Hz]
20480
16384
Identification sets
3
4
Cutting sets
53
3
Signal length
8192
8192
Experimental Setup
Sensor
Kistler 9255B
Reference sensor
Machine
Kistler MiniDyn
Mikron VC1000
Fehlmann
Picomax 825
Table D.1: Specific details on the experiments performed at Inspire.
This data was used to evaluate the performance of the
present signal processing algorithms.
a complete cut through the workpiece with constant cutting parameters.
D.2
Measured Frequency Responses
In Figure D.1 the FRFs of the dynamometer sensor of setting [EXPA]
are shown. The shown FRF estimates Ĥ (xx) , Ĥ (xy) , and Ĥ (xz) are
estimated by the least-squares method (6.16) with parameters L = 3
and K = 8192.
The reference sensors’ frequency response is sufficiently flat up to frequencies of 1 kHz and for higher frequencies of interest, the frequency
response can be equalized reliably. The reference sensor does not display resonant modes below frequencies of 5 kHz.
D.2 Measured Frequency Responses
Ĥ (xx)
Ĥ (xy)
Ĥ (xz)
20
|Ĥ| [dB]
153
10
0
−10
∠Ĥ [deg]
90
0
−90
−180
0
500
1,000
1,500
2,000
2,500
3,000
3,500
f [Hz]
Figure D.1: Measured frequency responses at output channel x obtained in experiment A (see Table D.1).
Appendix E
Learnings from
Implementing Gaussian
Message Passing
Algorithms
Gaussian message passing algorithms with high-dimensional multivariate Gaussian messages may become numerically unstable. Once these
problems occur, it is usually difficult to locate the root cause of the
errors. In the following, we present a few error metrics that are useful “diagnostic” tools in these situtations. For approaches to overcome
sources of numerical errors, after they were identified, see Section 3. All
quantities used in the following are finite-precision approximations to
the exact quantities.
It was shown in [69] that, in the first order, a numerical implementation of a Kalman filter is exact (in the mean vector and the covariance
matrix), if numerical errors caused by finite-precision updates are symmetric. This type of errors can be avoided by carefully designing message
update computations such that they are symmetric (see, e.g., (4.6) in Al−
→ −
→
gorithm 4.1). Obviously, symmetrizing the covariances, i.e., (V+ VT )/2,
is
155
156Learnings from Implementing Gaussian Message Passing Algorithms
Backward Error of Steady-State Messages
When utilizing steady-state covariance matrices or precision matrices
(cf. Section 6.2.2), the numerical precision of these matrices is important for the overall numerical stability of algorithms based on marginal
probabilities (e.g., EM). The precision of the steady-state matrices can
be assessed by estimating backward errors of the discrete-algebraic Riccatti equation. In words, backward errors quantify the amount that
coefficients of a system of equations have to be modified to obtain the
finite-precision solution, under the assumption that the system of equations are calculated exactly.
−
→
Assume that VX is the finite-precision solution of the corresponding
discrete-algebraic Riccatti equation with standard SSM notation. We
then use the (Frobenius) norm of the relative backward errors
−1 −
−
→
−
→ T −
→ T
→
−
→ T
T
A(V
CVX C + Vn
CVX )A + BVu B − VX X − VX C
−
→
kVX kF
F
(E.1)
−
→
as a metric to assess the loss in precision due to VX . An analogous
reasoning is used for the precision matrix.
Sensitivity of Autoregressive Parameterization
The autoregressive parameterization of an SSM, whereas AT is also
known as companion matrix [28], is a numerically sensitive representation of the system dynamics [70]. When we are just interested in
marginalization on the factor graph and we can assume that the coefficients of A are not affected by rounding errors, high sensitivities are not
of concern. In parameter estimation settings, however, the sensitivity is
highly relevant; Amongst many effects, a high sensitivity can make the
discrete-algebraic Riccatti equation problem ill conditioned [32] and it
can make recursive EM-based estimation of A unstable.
Let us quantify the sensitivity of a nth order SSM in AR form with
coefficients a = [a1 , . . . , an ] (first row of A) in terms of the change ∆pj of
pole pj (the Eigenvalue of A) due to a perturbation δa on the coefficients.
157
To first order, it can be shown that
∆pj = −
1
a0 (pj )
δa,
(E.2)
where a0 (x) is the first derivative of the polynomial
a(x) , xn − a1 xn−1 − . . . − an .
(E.3)
We thus propose to measure the sensitivity as
max
p∈p1 ,...,pn
1
.
|a0 (p)|
(E.4)
It was experimentally observed that large values, are linked to stability
problems.
Expectation-Maximization Updates
The log likelihood is, of course, the most important diagnostic tool for
EM algorithms and should be implemented first. Given an SSM with
variables given as in (2.3) in Section 2.1 the likelihood is readily obtainable from forward message passing:
We have also found that consistency equations, in a wider sense backward
errors, are well suited to detect numerical inaccuracies. For EM-based
estimation of the state-transition matrix A, one possible such equation
is constructed from the posterior means around a “+”-factor. For an
SSM (2.3) with one-dimensional input this is
2
X
mXk+1 − mXk0 − bmUk /kbmUk k2 ,
k
which would be zero with exact arithmetics. In addition, empirical tests
also suggest that this relative error is of the same order as the actual
relative error due to finite precision arithmetics.
Appendix F
Multi-Mass Resonator
Model
Subsequently, we sketch out reasoning and observations for our resonant
system model. The implementation details are standard.
A simple example of a mechanical system with two masses is shown
in Figure F.1. Coupling between the masses is modeled with a spring
and a damping element. Let H(f ) be the FRF from F2 to x2 . The
resulting magnitude and phase response of H(f ) are shown in Figure F.2.
In this simple mechanical model parameters were chosen such that m1
may represent a typical machine and mass m2 the small lightweight
sensor on top of it. Specifically, the following assumptions lead to the
demonstrated model:
1) the lower mass m1 is considerably larger than the top mass,
2) the spring constants are similar, and
3) damping is assumed to be much larger for m1 than for m2 .
SSMs are generated from general N -masses dynamical systems by series
concatenation of 2nd-order SSM representing one mass, spring, damping system. The first element in the series is the mass connected to
the steady ground (m1 in Figure F.1). This model allows to construct
SSM with reasonable structure that serve as initial estimates for system
identification in Chapter 6.
In addition, insights that hold for all realistic dynamometer sensor settings can be gained from the multi-mass models. Observe in the FRF
159
160
Multi-Mass Resonator Model
F2
mass
m2
x2
F1
ξ2
k2
mass
m1
x1
k1
ξ1
Figure F.1: Mechanical model of simple multi-mass swinger example.
in Figure F.2 includes i) a large resonance (at f = 0.5), ii) a resonance
anti-resonance mode (at f = 0.23), which are akin to what is seen in
measured FRFs in Chapter 6. In Figure F.2, the FRF of m2 only (i.e.,
isolated sensor) is shown as well. Comparing the two FRFs, it is evident
that the coupling with m2 causes a shift in the resonance frequency of
m2 .
161
40
|H(f )| [dB]
m1 fixed
full system
20
0
∠H(f ) [deg]
0
−50
−100
−150
−200
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
f [Hz]
Figure F.2: Magnitude and phase of transfer function H(f ) of x2 with
input F2 . The masses are m1 = 25m2 , spring constants
are k1 = 10 and k2 = 2, while the first mass is damped
with coefficient 0.5 and m2 is damped with coefficient 0.01.
Forces measured at m2 are proportional to x2 .
Bibliography
[1] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm
for designing overcomplete dictionaries for sparse representation,”
Signal Processing, IEEE Trans. on, vol. 54, no. 11, pp. 4311–4322,
2006.
[2] J. Alihanka, K. Vaahtoranta, and I. Saarikivi, “A new method for
long-term monitoring of the ballistocardiogram, heart rate, and respiration,” Am J Physiol, vol. 240, no. 5, pp. R384–92, 1981.
[3] A. Amini, U. Kamilov, E. Bostan, and M. Unser, “Bayesian estimation for continuous-time sparse stochastic processes,” IEEE Trans.
on Signal Processing, vol. 61, no. 4, pp. 907–920, Feb 2013.
[4] S. D. Babacan, R. Molina, M. N. Do, and A. K. Katsaggelos,
“Bayesian blind deconvolution with general sparse image priors,”
in ECCV. Springer, 2012, pp. 341–355.
[5] S. D. Babacan, R. Molina, and A. K. Katsaggelos, “Bayesian compressive sensing using laplace priors,” IEEE Trans. on Image Processing, vol. 19, no. 1, pp. 53–63, 2010.
[6] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization
technique occurring in the statistical analysis of probabilistic functions of markov chains,” The annals of mathematical statistics, pp.
164–171, 1970.
[7] G. Bierman, “Factorization methods for discrete sequential estimation,” Mathematics in science and engineering, vol. 128, 1977.
[8] D. Biermann, R. Hense, and R. Surmann, “Korrektur gemessener
Zerspankräfte beim Fräsen - Inverse Filterung von Kraftmessun163
164
Bibliography
gen als Werkzeug beim Ermitteln von Zerspankraftparameter,” wt
- Werkstattstechnik online, vol. 102, no. 11, pp. 789 – 794, 2012.
[9] L. Bolliger, “Digital estimation of continuous-time signals using factor graphs,” Ph.D. dissertation, ETH - Swiss Federal Institute of
Technology, 2012.
[10] L. Bolliger, H. Loeliger, and C. Vogel, “LMMSE estimation and
interpolation of continuous-time signals from discrete-time samples
using factor graphs,” vol. abs/1301.4793, 2013.
[11] S. Boyd and L. Vandenberghe, Convex optimization.
university press, 2009.
Cambridge
[12] G. Casella and R. L. Berger, Statistical inference. Duxbury Pacific
Grove, CA, 2001, vol. 2.
[13] V. Cevher, “Learning with compressible priors,” in Advances in
Neural Information Processing Systems, 2009, pp. 261–269.
[14] R. Chartrand and W. Yin, “Iteratively reweighted algorithms for
compressive sensing,” in Proc. IEEE Int. Conf. Acoust., Speech,
Signal Process., 2008, pp. 3869–3872.
[15] I. Daubechies, R. DeVore, M. Fornasier, and C. S. Güntürk, “Iteratively reweighted least squares minimization for sparse recovery,”
Communications on Pure and Applied Mathematics, vol. 63, no. 1,
pp. 1–38, 2010.
[16] J. Dauwels, A. Eckford, S. Korl, and H.-A. Loeliger, “Expectation
maximization as message passing - part I: Principles and gaussian
messages,” vol. abs/0910.2832, 2009.
[17] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the
Royal Statistical Society. Series B (Methodological), pp. 1–38, 1977.
[18] J. Durbin and S. J. Koopman, Time series analysis by state space
methods. Oxford University Press, 2012, no. 38.
[19] S. Farahmand, G. B. Giannakis, and D. Angelosante, “Doubly robust smoothing of dynamical processes via outlier sparsity constraints,” IEEE Trans. on Signal Processing, vol. 59, no. 10, pp.
4529–4543, 2011.
Bibliography
165
[20] A. C. Faul and M. E. Tipping, “Fast marginal likelihood maximisation for sparse bayesian models,” in Proc. of the Ninth International
Workshop on Artificial Intelligence and Statistics, Key West, FL„
Jan 3-6 2003.
[21] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. Freeman, “Removing camera shake from a single photograph,” in ACM
Trans. on Graphics (TOG), vol. 25, no. 3. ACM, 2006, pp. 787–794.
[22] C. Févotte and S. Godsill, “A bayesian approach for blind separation
of sparse sources,” Audio, Speech, and Language Processing, IEEE
Trans. on, vol. 14, no. 6, pp. 2174–2188, Nov 2006.
[23] M. A. Figueiredo, “Adaptive sparseness for supervised learning,”
IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 25,
no. 9, pp. 1150–1159, Sept 2003.
[24] Z. Ghahramani and G. E. Hinton, “Parameter estimation for linear
dynamical systems,” Technical Report CRG-TR-96-2, University of
Totronto, Dept. of Computer Science, Tech. Rep., 1996.
[25] S. Gibson and B. Ninness, “Robust maximum-likelihood estimation
of multivariable dynamic systems,” Automatica, vol. 41, no. 10, pp.
1667–1682, 2005.
[26] S. Gillijns and B. De Moor, “Unbiased minimum-variance input
and state estimation for linear discrete-time systems,” Automatica,
vol. 43, no. 1, pp. 111–116, 2007.
[27] F. Girardin, D. Remond, and J.-F. Rigal, “High Frequency Correction of Dynamometer for Cutting Force Observation in Milling,”
Journal of Manufacturing Science and Engineering, vol. 132, 2010.
[28] G. H. Golub and C. F. Van Loan, Matrix computations.
Hopkins University Press, 2012, vol. 3.
John
[29] G. Golub and U. von Matt, “Quadratically constrained least squares
and quadratic problems,” Numerische Mathematik, vol. 59, no. 1,
pp. 561–580, 1991.
[30] R. Gribonval, “Should penalized least squares regression be interpreted as maximum a posteriori estimation?” IEEE Trans. on Signal Processing, vol. 59, no. 5, pp. 2405–2410, 2011.
166
Bibliography
[31] R. Gribonval, V. Cevher, and M. E. Davies, “Compressible distributions for high-dimensional statistics,” IEEE Trans. on Information
Theory, vol. 58, no. 8, pp. 5016–5034, 2012.
[32] T. Gudmundsson, C. Kenney, and A. Laub, “Scaling of the discretetime algebraic riccati equation to enhance stability of the schur
solution method,” IEEE Trans. on Automatic Control, vol. 37, no. 4,
pp. 513–518, April 1992.
[33] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and
R. Tibshirani, The elements of statistical learning. Springer, 2009,
vol. 2, no. 1.
[34] R. A. Horn and C. R. Johnson, Matrix analysis.
versity press, 2012.
Cambridge uni-
[35] D. R. Hunter and K. Lange, “A tutorial on mm algorithms,” The
American Statistician, vol. 58, no. 1, pp. 30–37, 2004.
[36] S. Ji, D. Dunson, and L. Carin, “Multitask compressive sensing,”
IEEE Trans. on Signal Processing, vol. 57, no. 1, pp. 92–106, 2009.
[37] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE
Trans. on Signal Processing, vol. 56, no. 6, pp. 2346–2356, 2008.
[38] T. Kailath, B. Hassidi, and A. H. Sayed, Linear estimation.
Prentice-Hall, 2000.
[39] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory. Upper Saddle River, NJ, USA: Prentice-Hall, 1993.
[40] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs
and the sum-product algorithm,” IEEE Trans. on Information Theory, vol. 47, no. 2, pp. 498–519, 2001.
[41] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, “Understanding
and evaluating blind deconvolution algorithms,” in IEEE Conf. on
Computer Vision and Pattern Recognition, 2009, pp. 1964–1971.
[42] A. Lewandowski, C. Liu, S. Vander Wiel et al., “Parameter expansion and efficient inference,” Statistical Science, vol. 25, no. 4, pp.
533–544, 2010.
[43] L. Ljung, System Identification: Theory for the User.
Education, 1998.
Pearson
Bibliography
167
[44] H.-A. Loeliger, “An introduction to factor graphs,” IEEE Signal
Processing Magazine, vol. 21, no. 1, pp. 28–41, Jan 2004.
[45] ——, Signal and Information Processing:
Learning. ISI, 2013.
Modeling, Filtering,
[46] H.-A. Loeliger, L. Bolliger, G. Wilckens, and J. Biveroni, “Analogto-digital conversion using unstable filters,” in Information Theory
and Applications Workshop, 2011, pp. 1–4.
[47] H.-A. Loeliger, J. Dauwels, J. Hu, S. Korl, L. Ping, and F. Kschischang, “The factor graph approach to model-based signal processing,” Proc. of the IEEE, vol. 95, no. 6, pp. 1295–1322, June 2007.
[48] T. A. Louis, “Finding the observed information matrix when using
the em algorithm,” Journal of the Royal Statistical Society. Series
B (Methodological), pp. 226–233, 1982.
[49] D. J. MacKay, “Bayesian methods for backpropagation networks,”
in Models of neural networks III. Springer, 1996, pp. 211–254.
[50] M. Magnevall, M. Lundblad, K. Ahlin, and G. Broman, “High Frequency Measurements of Cutting Forces in Milling by Inverse Filtering,” Machining Science and Technology, vol. 16, no. 4, pp. 487–500,
2012.
[51] G. McLachlan and T. Krishnan, The EM algorithm and extensions.
John Wiley & Sons, 2007, vol. 382.
[52] J. A. Palmer, D. P. Wipf, K. Kreutz-delgado, and B. D. Rao, “Variational em algorithms for non-gaussian latent variable models,” in
Advances in Neural Information Processing Systems 18. MIT Press,
2006, pp. 1059–1066.
[53] P. G. Park and T. Kailath, “New square-root smoothing algorithms,” IEEE Trans. on Automatic Control, vol. 41, no. 5, pp.
727–732, May 1996.
[54] S. S. Park and Y. Altintas, “Dynamic compensation of spindle integrated force sensors with kalman filter,” Journal of Dynamic Systems, Measurement, and Control, vol. 126, no. 3, pp. 443–452, 2004.
[55] S. Park and Y. Altintas, “Dynamic Compensation of Spindle Integrated Force Sensors With Kalman Filter,” Journal of Dynamic
Systems, Measurement, and Control, vol. 126, 2004.
168
Bibliography
[56] N. Pedersen, C. Manchón, M. Badiu, D. Shutin, and B. Fleury,
“Sparse estimation using bayesian hierarchical prior modeling for
real and complex linear models,” Signal Processing, 2015.
[57] K. B. Petersen and M. S. Pedersen, “The matrix cookbook,” Tech.
Rep., November 2012.
[58] C. Rasmussen and C. Williams, Gaussian Processes for Machine
Learning. MIT Press, 2006.
[59] C. Reller, “State-space methods in statistical signal processing,”
Ph.D. dissertation, ETH - Swiss Federal Institute of Technology,
2013.
[60] M. Safonov and R. Chiang, “A schur method for balancedtruncation model reduction,” IEEE Trans. on Automatic Control,
vol. 34, no. 7, pp. 729–733, 1989.
[61] R. Salakhutdinov, S. Roweis, and Z. Ghahramani, “Optimization
with em and expectation-conjugate-gradient,” in ICML, 2003, pp.
672–679.
[62] M. Seeger and D. P. Wipf, “Variational bayesian inference techniques,” IEEE Signal Processing Magazine, vol. 27, no. 6, pp. 81–91,
2010.
[63] R. H. Shumway and D. S. Stoffer, “An approach to time series
smoothing and forecasting using the em algorithm,” Journal of time
series analysis, vol. 3, no. 4, pp. 253–264, 1982.
[64] D. Shutin, T. Buchgraber, S. R. Kulkarni, and H. V. Poor, “Fast
variational sparse bayesian learning with automatic relevance determination for superimposed signals,” IEEE Trans. on Signal Processing, vol. 59, no. 12, pp. 6257–6261, 2011.
[65] D. Shutin and B. H. Fleury, “Sparse variational bayesian sage algorithm with application to the estimation of multipath wireless
channels,” IEEE Trans. on Signal Processing, vol. 59, no. 8, pp.
3609–3623, 2011.
[66] R. Tibshirani, “Regression shrinkage and selection via the lasso,”
Journal of the Royal Statistical Society, Series B, vol. 58, pp. 267–
288, 1994.
Bibliography
169
[67] M. E. Tipping, “The relevance vector machine,” in Advances in
Neural Information Processing Systems, vol. 12. MIT Press, 2000,
pp. 652–658.
[68] P. Van Overschee and B. De Moor, “Subspace identification for
linear systems: Theory, implementation,” Methods, 1996.
[69] M. Verhaegen and P. Van Dooren, “Numerical aspects of different implementations,” IEEE Trans. on Automatic Control, vol. 31,
no. 10, pp. 907–917, 1986.
[70] M. Verhaegen and V. Verdult, Filtering and system identification:
a least squares approach. Cambridge university press, 2007.
[71] M. Vetterli, P. Marziliano, and T. Blu, “Sampling signals with finite
rate of innovation,” IEEE Trans. on Signal Processing, vol. 50, no. 6,
pp. 1417–1428, 2002.
[72] F. Wadehn, L. Bruderer, D. Waltisberg, T. Keresztfalvi, and H. A.
Loeliger, “Sparse-input detection algorithm with applications in
electrocardiography and ballistocardiography.” International Conf.
on Bio-inspired Systems and Signal Processing, 2015.
[73] G. Wahba, Spline models for observational data.
vol. 59.
Siam, 1990,
[74] G. Wilckens, “A new perspective on analog-to-digital conversion of
continuous-time signals,” Ph.D. dissertation, ETH - Swiss Federal
Institute of Technology, 2013.
[75] A. Wills and B. Ninness, “On gradient-based search for multivariable system estimates,” IEEE Trans. on Automatic Control, vol. 53,
no. 1, pp. 298–306, 2008.
[76] D. Wipf and S. Nagarajan, “Iterative reweighted and methods for
finding sparse solutions,” IEEE Journal of Selected Topics in Signal
Processing, vol. 4, no. 2, pp. 317–329, 2010.
[77] D. P. Wipf, “Sparse estimation with structured dictionaries,” in Advances in Neural Information Processing Systems, 2011, pp. 2016–
2024.
[78] D. P. Wipf and S. S. Nagarajan, “A new view of automatic relevance
determination,” in Advances in neural information processing systems, 2008, pp. 1625–1632.
170
Bibliography
[79] D. P. Wipf, B. D. Rao, and S. Nagarajan, “Latent variable bayesian
models for promoting sparsity,” IEEE Trans. on Information Theory, vol. 57, no. 9, pp. 6236–6255, 2011.
[80] D. Wipf and B. Rao, “Sparse bayesian learning for basis selection,”
IEEE Trans. on Signal Processing, vol. 52, no. 8, pp. 2153–2164,
Aug 2004.
[81] C. J. Wu, “On the convergence properties of the em algorithm,”
The Annals of statistics, pp. 95–103, 1983.
[82] T. T. Wu and K. Lange, “The MM alternative to EM,” Statistical
Science, vol. 25, no. 4, pp. 492–505, 11 2010.
[83] L. Xu and M. I. Jordan, “On convergence properties of the em
algorithm for gaussian mixtures,” Neural computation, vol. 8, no. 1,
pp. 129–151, 1996.
[84] M. Yuan and Y. Lin, “Model selection and estimation in regression
with grouped variables,” Journal of the Royal Statistical Society:
Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006.
[85] M. Zibulevsky and B. Pearlmutter, “Blind source separation by
sparse decomposition in a signal dictionary,” Neural computation,
vol. 13, no. 4, pp. 863–882, 2001.
About the Author
Lukas Bruderer was born in Switzerland in 1985. He passed the Matura
in Trogen AR and received his Dipl. El.-Ing. (MSc ETH EEIT) degree
from ETH Zurich, Switzerland in 2009.
In 2007, he was first an exchange student (exchange studies scholarship)
and later a visiting scholar at Northwestern University Chicago IL USA.
During his studies he also worked as an intern for Huber+Suhner AG.
From 2009 to 2011, he was a research assistant with the Integrated Systems Laboratory (IIS) at ETH Zurich. In 2010 he also was with Disney
Research. Since 2011, he has been with the Signal and Information Processing Laboratory (ISI) at ETH Zurich.
171
Series in Signal and Information Processing
edited by Hans-Andrea Loeliger
Vol. 1:
Hanspeter Schmid, Single-Amplifier Biquadratic MOSFET-C
Filters. ISBN 3-89649-616-6
Vol. 2:
Felix Lustenberger, On the Design of Analog VLSI Iterative
Decoders. ISBN 3-89649-622-0
Vol. 3:
Peter Theodor Wellig, Zerlegung von Langzeit-Elektromyogrammen zur Prävention von arbeitsbedingten Muskelschäden.
ISBN 3-89649-623-9
Vol. 4:
Thomas P. von Hoff, On the Convergence of Blind Source
Separation and Deconvolution. ISBN 3-89649-624-7
Vol. 5:
Markus Erne, Signal Adaptive Audio Coding using Wavelets and
Rate Optimization. ISBN 3-89649-625-5
Vol. 6:
Marcel Joho, A Systematic Approach to Adaptive Algorithms
for Multichannel System Identification, Inverse Modeling, and
Blind Identification. ISBN 3-89649-632-8
Vol. 7:
Heinz Mathis, Nonlinear Functions for Blind Separation and
Equalization. ISBN 3-89649-728-6
Vol. 8:
Daniel Lippuner, Model-Based Step-Size Control for Adaptive
Filters. ISBN 3-89649-755-3
Vol. 9:
Ralf Kretzschmar, A Survey of Neural Network Classifiers for
Local Wind Prediction. ISBN 3-89649-798-7
Vol. 10:
Dieter M. Arnold, Computing Information Rates of Finite State
Models with Application to Magnetic Recording.
ISBN 3-89649-852-5
Vol. 11:
Pascal O. Vontobel, Algebraic Coding for Iterative Decoding.
ISBN 3-89649-865-7
Vol. 12:
Qun Gao, Fingerprint Verification using Cellular Neural
Networks. ISBN 3-89649-894-0
Vol. 13:
Patrick P. Merkli, Message-Passing Algorithms and Analog
Electronic Circuits. ISBN 3-89649-987-4
Vol. 14:
Markus Hofbauer, Optimal Linear Separation and
Deconvolution of Acoustical Convolutive Mixtures.
ISBN 3-89649-996-3
Vol. 15:
Sascha Korl, A Factor Graph Approach to Signal Modelling,
System Identification and Filtering. ISBN 3-86628-032-7
Vol. 16:
Matthias Frey, On Analog Decoders and Digitally Corrected
Converters. ISBN 3-86628-074-2
Vol: 17:
Justin Dauwels, On Graphical Models for Communications and
Machine Learning: Algorithms, Bounds, and Analog
Implementation. ISBN 3-86628-080-7
Vol. 18:
Volker Maximillian Koch, A Factor Graph Approach to ModelBased Signal Separation. ISBN 3-86628-140-4
Vol. 19:
Junli Hu, On Gaussian Approximations in Message Passing Algorithms with Application to Equalization. ISBN 3-86628-212-5
Vol. 20:
Maja Ostojic, Multitree Search Decoding of Linear Codes.
ISBN 3-86628-363-6
Vol. 21:
Murti V.R.S. Devarakonda, Joint Matched Filtering, Decoding,
and Timing Synchronization. ISBN 3-86628-417-9
Vol. 22:
Lukas Bolliger, Digital Estimation of Continuous-Time Signals
Using Factor Graphs. ISBN 3-86628-432-2
Vol. 23:
Christoph Reller, State-Space Methods in Statistical Signal
Processing: New Ideas and Applications. ISBN 3-86628-447-0
Vol. 24:
Jonas Biveroni, On A/D Converters with Low-Precision Analog
Circuits and Digital Post-Correction. ISBN 3-86628-452-7
Vol. 25:
Georg Wilckens, A New Perspective on Analog-to-Digital
Conversion of Continuous-Time Signals. ISBN 3-86628-469-1
Vol. 26:
Jiun-Hung Yu, A Partial-Inverse Approach to Decoding ReedSolomon Codes and Polynomial Remainder Codes.
ISBN 3-86628-527-2
Hartung-Gorre Verlag Konstanz http://www.hartung-gorre.de
© Copyright 2026 Paperzz