A general linear non-Gaussian state

JMLR: Workshop and Conference Proceedings 1: xxx-xxx
ACML2011
A general linear non-Gaussian state-space model:
Identifiability, identification, and applications
Kun Zhang
[email protected]
Max Planck Institute for Intelligent Systems
Spemannstr. 38, 72076 Tübingen, Germany
Aapo Hyvärinen
[email protected]
Dept of Computer Science, HIIT, and Dept of Mathematics and Statistics
University of Helsinki, Finland
Editor: Chun-Nan Hsu and Wee Sun Lee
Abstract
State-space modeling provides a powerful tool for system identification and prediction. In
linear state-space models the data are usually assumed to be Gaussian and the models have
certain structural constraints such that they are identifiable. In this paper we propose a
non-Gaussian state-space model which does not have such constraints. We prove that this
model is fully identifiable. We then propose an efficient two-step method for parameter
estimation: one first extracts the subspace of the latent processes based on the temporal
information of the data, and then performs multichannel blind deconvolution, making use
of both the temporal information and non-Gaussianity. We conduct a series of simulations
to illustrate the performance of the proposed method. Finally, we apply the proposed
model and parameter estimation method on real data, including major world stock indices
and magnetoencephalography (MEG) recordings. Experimental results are encouraging
and show the practical usefulness of the proposed model and method.
Keywords: State-space model, Non-Gaussianity, Identifiability, Causality
1. Introduction
Suppose that we have multiple parallel times series which are observable. Usually, the source
series of interest are not directly measurable, but hidden in them. In addition, the mixing
system generating the observable series from the sources is unknown. For simplicity, we
often assume that the mixing system is linear. The goal is to recover the latent interesting
sources, as well as to model their dynamics. This problem was referred to as blind source
separation (BSS, see e.g., books by Cichocki and Amari (2003) and by Hyvärinen et al.
(2001)). In the literature, statistical independence has played a great role in BSS; in most
BSS algorithms, the sources are assumed to be statistically independent. In the noiseless
case, certain techniques have been proposed to solve this problem efficiently. For example,
if the sources are non-Gaussian (or at most one of them is Gaussian), BSS can be solved
by the independent component analysis (ICA) technique (Hyvärinen et al., 2001). If the
sources are temporally correlated, simultaneous diagonalization of the cross-correlations
makes source separation possible (Belouchrani et al., 1997).
c
2011
Kun Zhang and Aapo Hyvärinen.
Zhang and Hyvärinen
In practice the data are usually noisy, i.e., the observed data contain observation errors,
and the latent source processes exhibit some temporal structures (which may include delayed
influences between them). The state-space representation then offers a powerful modeling
approach. Here we are particularly interested in the linear state-space model (SSM) or
linear dynamic system (Kalman, 1960; van Overschee and de Moor, 1996). Denote by xt =
(x1t , ..., xnt )T , t = 1, ..., N , the vector of the observed signals, and by yt = (y1t , ..., ymt )T
the vector of latent processes which are our main object of interest.1 The observed data are
assumed to be linear mixtures of the latent processes together with some noise effect, while
the latent processes follow a vector autoregressive (VAR) model. Mathematically, we have
xt = Ayt + et ,
yt =
L
X
Bτ yt−τ + t ,
(1)
(2)
τ =1
where et = (e1t , ..., eTnt ) and t = (1t , ..., mt )T denote the observation error and process
noise, respectively. Moreover, et and t are both temporally white and independent of
each other. One can see that because of the state transition matrices Bτ , yit are generally
dependent, even if it are mutually independent.
In traditional SSMs, both t and et are assumed to be Gaussian; or equivalently, one
makes use of their covariance structure, and the statistical properties beyond second-order
are not considered. In Kalman filtering (Kalman, 1960), A and Bτ are given, and the goal
is to do inference, i.e., to estimate yt based on {xt }. Learning of the parameters A, Bτ ,
and the covariance matrices of et and t was also studied; see, e.g, van Overschee and de
Moor (1991); Ghahramani and Hinton (1996). However, it is well-known that under the
above assumptions, the SSM model is generally not identifiable; see e.g., Arun and Kung
(1990), and consequently, one can not use this model to recover the latent processes yit .
Under specific structural constraints on Bτ or A, the SSM model (1∼2) may become
identifiable, so that it can be used to reveal the underlying structure of the data. Many
existing models which are used for source separation or prediction of time series can be
considered as special cases of this model. For instance, the temporal structure based source
separation (Murata et al., 2001) assume that Bτ are diagonal. The model also becomes
identifiable with some other structural constraints on A, as discussed in Xu (2002). However, one should note that in practice such constraints may not hold; for instance, for the
electroencephalography (EEG) or magnetoencephalography (MEG) data, some underlying
processes or sources may have delayed influences on others, and letting Bτ be diagonal will
destroy these types of connectivities.
On the other hand, distributional information also helps system identification. One
can ignore the temporal information and perform system identification based on the nonGaussianity of the data. For example, if the matrices Bτ are zero and ei (t) are nonGaussian, it is reduced to the noisy ICA problem or the independent factor analysis (IFA)
model (Attias, 1999). In the noiseless case, ICA could recover the underlying linear mixing
system up to trivial indeterminacies. But in the noisy case, the model is just partially
1. We use the terms latent processes, factors, and sources interchangeably in this paper, depending on
application scenarios.
2
A general linear non-Gaussian state-space model
identifiable (Davies, 2004); in fact, the distributions of yit could not be uniquely determined,
even with a given mean and variance.
For generality, we would like to keep the flexibility of the SSM model (i.e., do not use
specific structural constraints on A and Bτ ), but still make it identifiable by incorporating
some assumption which usually holds in practice. Non-Gaussianity of the distribution
plays such a role: we assume that the process noise t has non-Gaussian and independent
components. Hence our proposed model is called non-Gaussian SSM (NG-SSM). We will
show that combining the temporal information and the distributional information makes
NG-SSM fully identifiable. This enables finding the sources, or latent processes, successfully
even when they are dependent. Note that a special case of the proposed model, which does
not have the observation error et , was recently exploited to analyze the connectivity between
the brain sources for EEG or MEG (Haufe et al., 2010; Gómez-Herrero et al., 2008).
It is also interesting to note that NG-SSM could serve as another scheme to do Granger
causality analysis (Granger, 1988; Hyvärinen et al., 2010). A time series z1t is said to
Granger cause another series z2t if z1t has incremental predictive power when forecasting
z2t . Granger causality can be readily extended to the case with more than two time series;
see, e.g., Ding et al. (1996). VAR is a widely-used way to represent Granger causality.
As (2) is a VAR model of yit with contemporaneously independent residuals, from another
point of view, one can see that NG-SSM finds the latent factors yit which can be explained
well by Granger causality. In this sense, NG-SSM provides a way to do “Granger causal
factor” analysis.
Our contribution is two-fold. First, we prove the identifiability of NG-SSM, and more
specifically, we show how the non-Gaussianity of the process noise makes the model fully
identifiable. Second, we present an identification method which, as illustrated by numerical
studies, clearly outperforms the previous identification approach. The rest of this paper
is organized as follows. In Section 2 we give a rigorous proof of the identifiability of the
proposed model. A simple and efficient method for system identification is presented in
Section 3, followed by some simulation results in Section 4. Section 5 reports the results of
analyzing financial and MEG data with the proposed model. Finally we conclude the paper
and give further discussions in Section 6.
2. Identifiability
We consider the SSM (1∼2) under the non-Gaussianity assumption on t . Denote by Σe
and Σ the covariance matrices of et and t , respectively. Without loss of generality, we
assume that the data are zero-mean. Further, we make the following assumptions on the
proposed NG-SSM.
A1. n ≥ m, both the observation error et and process noise t are temporally white, and
t has i.i.d. mutually independent components.
A2. A is of full column rank, and BL is of full rank.
P
−τ ) = 0, where
A3. The VAR process (2) is stable. That is, the roots of det(I − L
τ =1 Bτ z
z is the delay operator and I the identity matrix, lie inside the complex unit circle,
i.e., with modulus smaller than one.
3
Zhang and Hyvärinen
A4. The process noise has a unit variance, i.e., Σ = I. This assumption is to avoid the
scaling indeterminacy of it or yit .
A5. At most one component of the process noise t is Gaussian, and the observation error
et is either Gaussian or non-Gaussian.
Differently from the Kalman filtering task which aims to do inference with given parameters,
the goal in this contribution is to learn (A, {Bτ }L
τ =1 , Σe ), as well as to recover the latent
process yt .
Now we show the identifiability of the proposed NG-SSM model (1∼2). First, one can see
that without the non-Gaussianity assumption of the data, the temporal information helps
to partially identify the model. Similar results have been mentioned in the literature (Arun
and Kung, 1990; van Overschee and de Moor, 1996); however, for completeness of the theory
and consistency of the presentation, we give a rigorous formulation and proof; see Lemma 1.
We then prove that the non-Gaussianity of the process noise further allows NG-SSM
to be fully identifiable. For simplicity of the presentation, in Theorem 2 we consider the
case with n = m. The case with n > m is considered later in Theorem 3. We start by
proving that using only the second-order temporal information in the data, the observation
error covariance matrix Σe , as well as other quantities such as ABk AT , k = 0, 1, 2, ..., are
identifiable, as stated in the following lemma.
Lemma 1 Consider the model given by (1∼2) with n = m and given L. If the assumptions
A1∼A4 hold, by making use of the autocovariances of the observed data xt , the noise covariance Σe and ABk AT can be uniquely determined; furthermore, A and Bτ can be identified
up to some rotation transformations. That is, suppose that the NG-SSM model with paL
rameters (A, {Bτ }L
τ =1 , Σe ) and that with (Ã, {B̃τ }τ =1 , Σ̃ẽ ) are observationally equivalent;
T
we then have Σ̃ẽ = Σe , Ã = AU, B̃τ = U Bτ U, where U is a m-dimensional orthogonal
matrix.2
T ), and similarly
Proof Define the autocovariance function of y at lag k as Ry (k) = E(yt yt+k
T
T
for Rx (k). Clearly Ry (−k) = Ry (k) and Rx (−k) = Rx (k) . Due to (2), we have
(P
L
T i
h P
Ry (k − τ )BTτ , for k 6= 0;
L
(3)
Ry (k) = E yt
= PτL=1
τ =1 Bτ yt+k−τ + t+k
T T
τ =1 Ry (τ ) Bτ + I, for k = 0.
Let x̃t , Ayt , so xt = x̃t + et . From (1) one can see that
Rx̃ (k) = ARy (k)AT .
(4)
Further considering the fact that A is invertible, combining (3) and (4) gives that Rx̃ (k)
satisfies the recursive relationship:
(P
L
Rx̃ (k − τ )CTτ ,
for k 6= 0,
(5)
Rx̃ (k) = PτL=1
T
T
τ =1 Rx̃ (k − τ )Cτ + AA , for k = 0,
2. Note that here we already assumed that Var(it ) = 1. Otherwise, given Σe and Σ̃ẽ , both of which are
1/2
−1/2
diagonal with positive diagonal entries, à can be represented as à = AΣe UΣ̃ẽ
for B̃τ is more complex.
4
and the expression
A general linear non-Gaussian state-space model
where Cτ , ABτ A−1 . Also bearing in mind that Rx (k) = Rx̃ (k) for k 6= 0 and Rx (0) =
Rx̃ (0) + Σe , (5) can be written in the matrix form:

Rx (0) − AAT














Rx (1)
..
.
Rx (L)
Rx (L + 1)
..
.
Rx (2L)


−I


 T



 C1 
CT1




 T
 .. 
 . 

C 


 2

 = H ·  .  −Σe ·  CT  ,

 . 
 L

 0 
 . 



T

 . 
CL


.. 
| {z }

,C
0

(6)
where

Rx (1)T
Rx (2)T

 Rx (0)
Rx (1)T


..
..

.
.


H ,  Rx (L − 1) Rx (L − 2)

 Rx (L)
Rx (L − 1)


..
..


.
.
Rx (2L − 1) Rx (2L − 2)
...
Rx (L)T
. . . Rx (L − 1)T
..
..
.
.
...
Rx (0)
...
..
.
...
Rx (1)
..
.
Rx (L)








.






−
−
−
x t = (xTt , xTt−1 , ..., xTt−L+1 )T . Let
The lower block of H is actually E(→
x t→
x Tt+L ), where →
"
#
→
−
Im(L−1)
C, C
, where Im(L−1) denotes the m(L − 1)-dimensional identity matrix,
0m×m(L−1)
→
−
and 0m×m(L−1) the m × m(L − 1) zero matrix. As det( C) = det(CL ), which is not zero due
→
−
→
−
−
−
−
−
to Assumption A2, C is nonsingular. One can easily see that E(→
x t→
x Tt+1 ) = E(→
x t→
x Tt ) C.
→
−
−
−
−
−
x →
x T ) C L , and the lower block of H is thus non-singular.
Consequently E(→
x →
x T ) = E(→
t
t+L
t
t
Hence, we can find the unique solution for Cτ , τ = 1, 2, ..., L from the bottom mL equations
of (6). Substituting the solutions for Cτ into the top 2m equations, we can get the unique
solution for Σe and AAT .
L
T
Since (A, {Bτ }L
τ =1 , Σe ) and (Ã, {B̃τ }τ =1 , Σ̃ẽ ) produce the same Rx (k), we have ÃÃ =
T
−1
−1
T
T
−1
AA and ÃB̃τ à = ABA . As Ãà = AA and à is nonsingular, we have A à ·
(A−1 Ã)T = I, i.e., A−1 à = U, or à = AU, where U is an orthogonal matrix. Furthermore, multiplying both sides of ÃB̃τ Ã−1 = ABA−1 and ÃÃT = AAT gives ABAT =
ÃB̃τ ÃT = AUB̃τ UT AT , i.e., B̃τ = UT Bτ U.
Next, we show how the assumption of non-Gaussian distributions leads to identifiability
in the basic case of n = m. A typical example in which the distributional information
guarantees the model identifiability is ICA. In ICA, one has a set of observable signals,
which are assumed to be linear mixtures of some hidden independent sources, but the
mixing procedure is unknown. In the square case (with an equal number of sources and
5
Zhang and Hyvärinen
observable signals), if at most one of the sources is non-Gaussian, then ICA is able to
estimate the mixing procedure and recover the sources up to some trivial indeterminacies.
Similarly, here considering the non-Gaussianity of the process noise allows the NG-SSM
model with n = m to be fully identifiable, as given by the following theorem.
Theorem 2 Consider the model given by (1∼2) with n = m and given L. Suppose that the
assumptions A1∼A5 hold. Then the model is identifiable. In particular, suppose that the
L
model with parameters (A, {Bτ }L
τ =1 , Σe ) and that with (Ã, {B̃τ }τ =1 , Σ̃ẽ ) are observationally
T
equivalent; we then have Σ̃ẽ = Σe , Ã = AP, B̃τ = P Bτ P, and ỹt = PT yt , where P is a
signed permutation matrix (a permutation matrix with non-zero entries being 1 or -1). The
distribution of it can also be uniquely determined up to the sign indeterminacy.3
Proof Lemma 1 gives the relationships between the two parameter sets (A, {Bτ }L
τ =1 )
and (Ã, {B̃τ }L
)
which
produce
the
same
x
,
under
the
assmptions
A1∼A4.
Based
on
t
τ =1
Lemma 1, we have
L
L
h
h
X
X
−1 i
−1 i
˜t + ẽt
xt = A I −
Bτ z −τ
t + et = Ã I −
B̃τ z −τ
τ =1
=⇒A I −
L
X
Bτ z −τ
τ =1
−1
h
t + et = AU I −
τ =1
L
X
UT Bτ Uz −τ
−1 i
˜t + ẽt
τ =1
L
L
h
h
X
X
−1 i
−1 i
=⇒ I −
Bτ z −τ
t + A−1 et = I −
Bτ z −τ
U˜t + A−1 ẽt
h
=⇒ I −
τ =1
L
X
τ =1
Bτ z −τ
−1 i
(t − U˜t ) = A−1 (ẽt − et )
τ =1
L
i
h
X
=⇒(t − U˜t ) = I −
Bτ z −τ A−1 (ẽt − et ) .
(7)
τ =1
Let dt , A−1 (ẽt − et ). The right-hand side (RHS) of (7) is a moving
average (MA)
P
process of dt , and its autocovariance function at lag L is RRHS (L) = E − L
j=0 Bj dt−j ·
PL
T
T
T
(− i=0 Bi dt+L−j ) = E(dt dt )BL , where we have defined B0 , −I. On the other hand,
the left-hand side of (7) is i.i.d., and hence E(dt dTt )BTL = 0. As BL is nonsingular, we must
have E(dt dTt ) = 0, i.e., ẽt = et and t = U˜t .
We then consider the condition t = U˜t , or ˜t = UT t . Assumption A5 states that
at most one of it is Gaussian. From the identifiability of the noiseless ICA model with a
square mixing matrix (see Theorem 11 of Comon (1994) or Theorem 10.3.1 of Kagan et al.
(1973)), U must be a signed permutation matrix. Furthermore, the distributions of ˜it are
the same as those of it (up to the permutation and sign indeterminacies).
3. If Var(it ) can be arbitrary, we have à = APΛ and B̃τ = Λ−1 PT BPΛ, where Λ is a diagonal matrix
with positive diagonal entries.
6
A general linear non-Gaussian state-space model
Finally, we consider the general case and show the identifiability of the NG-SSM model.
The identifiability in the case with n ≥ m follows as a consequence of the above theorem
and linear algebra.
Theorem 3 Under the assumptions A1∼A5, the model given by (1∼2) with n ≥ m and
given L is identifiable up to arbitrary permutations and signs of yit .
Proof Let ATi be the ith row of A. Recall that A is of full column rank. Then for any
i, by making use of linear algebra, one can show that there always exist (m − 1) rows of A
such that they, together with ATi , form a full-rank matrix, denoted by Āi . According to
Theorem 2, from the observed data corresponding to Āi , {Bτ }L
τ =1 and Āi can be uniquely
determined up to the permutation and sign indeterminacies. That is, all rows of A are
identifiable, and thus A is identifiable. Furthermore, Cov(Ayt ) is determined by A and
{Bτ }L
τ =1 . As Cov(xt ) = Cov(Ayt ) + Σe , Σe is also identifiable.
3. Identification
In the Gaussian case, one can efficiently find an estimate of all parameters in the SSM model
from an infinite number of possible solutions. By considering yt as missing values, one can
find the complete data likelihood log p({yt }nt=1 , {xt }nt=1 ) of the model (1∼2) in closed-form.
The EM algorithm to maximize the data likelihood can then be derived, allowing partial
identification of the model; see, for instance, Ghahramani and Hinton (1996). Alternatively,
by rewriting the SSM model as some large block matrix formula, one can adopt the subspace
state space system identification (4SID) methods (van Overschee and de Moor, 1991). For
a comparison between these two types of methods, see Smith et al. (2000).
However, when the process noise it is non-Gaussian, one usually has to resort to
simulation-based methods, such as particle filtering, to do the inference, and parameter
estimation also becomes computationally difficult. For reasons of computational efficiency
of the algorithm, especially for large-scale problems (e.g., for MEG data analysis), we propose an approximate but very efficient method which consists of two steps.
3.1 Step 1: Dimension reduction and noise suppression by recovering the
subspace of the latent processes
In the case where n = m, this step is skipped and one directly performs the second step
given in Subsection 3.2. Otherwise in the first step we reduce the noise effect and estimate
the subspace of the latent processes yt .
In the proposed NG-SSM model, one can see that the subspace spanned by the latent
processes yit is temporally colored, while the observation error et is assumed to be temporally white. Based on these properties, in this step we extract the subspace of yt from xt
by finding a m-dimensional colored subspace:
x̌t = Wc xt
(8)
and making the remaining subspace temporally as white as possible. In this way the effect
of the observation error et in the extracted subspace x̌t is greatly reduced. Consequently
7
Zhang and Hyvärinen
x̌t and yt approximately lie in the same subspace. In other words, x̌t is expected to be a
linear square transformation of yt , i.e.,
x̌t = Ǎyt .
(9)
How to find the transformation in (8) is discussed below, and how to find Ǎ and estimate
yt from x̌t will be explained in Subsection 3.2.
At first glance, a natural way to find the transformation in (8) is to resort to so-called
colored subspace analysis (CSA, Theis (2010)). We adopt the algorithm based on joint
low-rank approximation (JLA) for CSA, due to its attractive theoretical properties (Theis,
2010). However, in practice we found that when the data are very noisy, its solution is
highly sensitive to initialization conditions, and hence we need develop an approach which
avoids local optima.
A closer inspection of (2), which specifies the generating process of the latent precesses
yt , together with Assumption A2, shows that yt corresponds to the subspace in xt which can
be predicted best from historical values, while the subspace which is mainly caused by the
observation error can not. In fact, by exploiting the VAR model, the subspace of yt , or (8),
can be estimated efficiently, and the solution is guaranteed to be global optimal. First, let
us whiten the data. Assume the eigenvalue decomposition (EVD) of the covariance matrix
of xt is Cov(xt ) = E1 D1 ET1 , where D1 is a diagonal matrix consisting of the eigenvalues
and columns of E1 are corresponding eigenvectors. The whitened data are
−1/2
ẍt = D1
ET1 xt .
Note that components of ẍt are uncorrelated and have unit variance.
Second, one fits the VAR model with L lags on ẍt :
ẍt =
L
X
Mτ ẍt−τ + ε̃t ,
τ =1
where Mτ denote the coefficient matrices, and ε̃t is the residual series. Parameters involved
in the above equation can be estimated efficiently, with the solution given in closed-form.
Note that the subspace in ẍt which can be predicted well coincides with the subspace of x̌t
given in (8), and that it corresponds to that of ε̃t which has small variance.
Let the EVD decomposition of the covariance matrix of ε̃t be Cov(εt ) = E2 D2 ET2 . Let
P be the matrix consisting of the eigenvectors corresponding to the m smallest eigenvalues.
Third, one can see that PT ε̃t gives the m-dimensional minor subspace of the error ε̃t , and
consequently, PT ẍt corresponds to the subspace of xt that can be predicted best. That is,
the transformation in (8) is determined by
−1/2
Wc = PT D1
ET1 .
In our experiments, we use this solution for CSA. In our simulation studies we found that
this scheme always leads to a good performance, i.e., the learned x̌t provides a good estimate
of the subspace of the latent processes.
8
A general linear non-Gaussian state-space model
3.2 Step 2: Multichannel blind deconvolution and post-processing
In the noiseless case, Haufe et al. (2010) discussed the relationship between the NG-SSM
model and the multichannel blind deconvolution (MBD, Cichocki and Amari (2003)), or
convolutive ICA problem. Here after dimension reduction in the first step, we also use
MBD for parameter estimation. From (2) and (9) one can see that
h
t = I −
L
X
Bτ z
−τ
i
yt = Ǎ
−1
x̌t −
τ =1
L
X
Bτ Ǎ
−1
τ =1
x̌t−τ =
L
X
Wk x̌t−k ,
(10)
k=0
where W0 , Ǎ−1 , and Wτ , −Bτ Ǎ−1 for τ > 0. Recall that t is assumed to be both
spatially and temporally independent. Consequently, Wk in (10) can be estimated by using
the MBD technique, which aims to make the output sequences spatially and temporally
independent.
Under the assumption that at most one of it is Gaussian, MBD can estimate Wk
uniquely up to only the scaling, permutation, and time shift indeterminacies. Here, the
permutation indeterminacy is trivial, and to tackle the scaling indeterminacy, we fix the
variance of ˆit to one. With proper initializations (say, with large values for W0 ), we can
avoid the time shift indeterminacy. In our experiments we used the natural gradient-based
algorithm (Cichocki and Amari, 2003) for MBD to determine Wk .
Now suppose that we have obtained the estimate Ŵk . According to (10), one can see
that the estimate of Ǎ and that of Bτ can be constructed as
ˆ = Ŵ−1 , and B̂ = −Ŵ Ǎ
ˆ = −Ŵ Ŵ−1 .
Ǎ
τ
τ
τ
0
0
Recalling that in (8) we used Wc to extract the colored subspace, i.e., x̌t = Wc xt = Ǎyt ,
one can then contruct the estimate of A as
ˆ = W† Ŵ−1 ,
 = Wc† Ǎ
c
0
where † denotes the pseudo-inverse. The proposed two-step method is denoted by CSA+MBD.
3.3 Significance assessment
We use the bootstapping-based method to assess the significance of each influence from
yi,t−τ (τ > 0) to yjt or the total effect from {yi,t−τ }L
τ =1 to yjt , denoted by Sj←i , following
the pathway proposed in Sec. 6 of Hyvärinen et al. (2010).
4. Simulations
We used simulations to illustrate the performance of the proposed method CSA+MBD
for parameter estimation of the NG-SSM model. The observed data xt were generated
according to (1∼2) with n = 10, m = 4 and the sample size N = 1000. 1t and 2t
are i.i.d. super-Gaussian: they were generated by passing Gaussian samples through the
power nonlinearity with the exponent 1.5 and keeping the original signs. 3t and 4t are
uniformly distributed (and thus sub-Gaussian). The lag number of the VAR process (2)
was set to L = 2. We used a heuristic way to enforce the stability of the process, by using
9
Zhang and Hyvärinen
Entri es of B̂1
PCA+MBD
Entri es of B̂2
Entri es of B̂1
SNR(xt)= 15dB
0.5
CSA+MBD
0.2
0.2
0.1
0
0
0
0
−0.1
−0.2
−0.2
−0.5
−0.5
−0.4
0
0.4
−0.2
1
0.4
0.5
0.2
0
0
0
0.2
−0.3
−0.4
0
0.4
0.4
SNR(xt)= 3dB
Entri es of B̂2
0.5
0.2
−0.2
0
0.2
0.1
0
0
−0.2
−0.1
−0.2
−0.4
−1
−0.5
0
0.2
0.2
−0.2
−0.5
−0.2
−0.4
0
0.5
−0.2
0
0.2
−0.4
0
0.4
Figure 1: Scatterplots of the entries of B̂τ , τ = 1, 2, vs. the true ones with different levels of
et and different simulation methods. The average SNRs in xt are 15dB (top) and
3dB (bottom). The two methods are PCA+MBD (left) and CSA+MBD (right).
small values for the entries of Bτ : the entries of B1 were uniformly distributed between
-0.5 and 0.5, and those of B2 were drawn from the uniform distribution between -0.25
and 0.25. Entries of the mixing matrix A were drawn uniformly from [−1.5, 1.5]. The
covariance matrix of et was constructed as the product of a n × n random matrix and
its transpose.
ratio (SNR) in xt , which is defined as
Pn signal-to-noise
Pn We use the average
Var(e
)
,
to
measure
the noise level in the observed
Var(x
−
e
)/
10 log10
it
it
it
i=1
i=1
data.
Principal component analysis (PCA) is a widely used approach for dimension reduction
by finding the components with maximal variations. One can use PCA instead of CSA in
Step 1 of the proposed method, leading to the method PCA+MBD, which was used in Haufe
et al. (2010). We compare our method CSA+MBD with PCA+MBD. We considered two
noise levels with the average SNRs in xt being 15dB and 3dB, respectively. Each case
was repeated for 50 random replications. Since good estimates of A often result in good
estimates of the influence matrices Bτ , for conciseness of the presentation, here we only
show how well Bτ were recovered in Figure 1.
Figure 1 gives the scatterplots of the estimated entries of Bτ and the true ones. Note
that we already permuted ŷit and adjusted their signs such that they and the true latent
processes yit have the same order and positive correlations (the scaling indeterminacy is
avoided by enforcing Assumption A4). In addition, we report the SNRs of B̂τ and ŷit w.r.t.
the true ones in Table 1. For completeness of the comparison, we also show the results by
ICA: note that here although the latent factors yit are not mutually independent, one can
still apply ICA to find the the components which are mutually as independent as possible;
one can then fit the VAR model (2) on the estimated “independent” component to estimate
10
A general linear non-Gaussian state-space model
SNR(xt )
15dB
3dB
Method
PCA+MBD
FastICA
CSA+MBD
PCA+MBD
FastICA
CSA+MBD
SNR(B̂1 )
25.3
6.1
41.1
5.4
5.4
25.5
SNR(B̂2 )
18.6
4.3
31.5
2.2
2.8
17.7
SNR(ŷ1t )
27.6 (11.6)
12.6 (4.4)
34.4 (7.9)
13.2 (10.7)
10.4 (4.7)
20.4 (5.9)
SNR(ŷ2t )
26.1 (10.6)
12.6 (4.5)
33.1 (9.1)
12.6 (5.8)
10.2 (4.0)
20.3 (4.5)
SNR(ŷ3t )
25.3 (9.4)
8.8 (4.3)
34.9 (10.4)
10.6 (7.4)
7.6 (3.9)
20.4 (4.9)
SNR(ŷ4t )
27.1 (10.5)
10.2 (5.5)
32.7 (7.6)
8.2 (8.3)
7.2 (4.1)
20.0 (4.4)
Table 1: SNR of the estimated Bτ and the recovered latent processes yit at different observation error levels and with different methods. Numbers in parentheses indicate
standard deviations.
Bτ . We adopted the symmetric FastICA algorithm (Hyvärinen and Oja, 1997) with the
tanh nonlinearity to perform ICA.
From Table 1 one can see that in both low-noise and high-noise situations, CSA+MBD
successfully estimated Bτ and the latent processes yt with very high SNRs. Its performance
is clearly better than PCA+MBD. This illustrates the validity of the proposed two-step
method CSA+MBD, and also documents that properly initialized CSA performs well for
dimension reduction and noise suppression (at least much better than PCA) in the estimation of the proposed NG-SSM. Not surprisingly, since the latent processes are not mutually
independent, here ICA gives the poorest performance. Even in the lower noise situation
(SNR = 15dB), the estimated Bτ are very noisey (as seen from the low SNRs).
5. Real-world Applications
5.1 Financial data
We exploited the NG-SSM model to investigate some underlying structures of nine world
major stock indices, including Dow Jones Industrial Average Index (DJI) and Composite
Index (NAS) in USA, FTSE 100 Index (FTSE) in England, DAX Index in Germany, CAC
40 Index in France, Nikkei 225 (N225) in Japan, Hang Seng Index (HSI) in Hong Kong,
Taiwan Weighted Index (TWI), and Shanghai Stock Exchange Composite Index (SSEC)
in China. We used the daily dividend/split adjusted closing prices from Dec. 4, 2001 to
Jul. 11, 2006. For the days when the price is not available, we used linear interpolation
to estimate the price. Denoting the closing price of the ith index on day t by Pit , the
corresponding return was calculated by xit = (Pit − Pi,t−1 )/Pi,t−1 . The data for analysis
are the 9-dimensional parallel return series with 1200 samples.
We set the dimensionality of the latent process yt to m = 9 and the number of lags to
L = 1. After estimating all parameters, we further selected four estimated processes yit of
interest for further analysis. The corresponding columns of the estimated mixing matrix Â
are given in Figure 2, with large numbers given in the figure. These latent processes are
interesting to us for two reasons. First, they have large contributions to xt , as measured
by the norm of the corresponding columns of Â, and moreover, each of them contributes
significantly to at least three indices, while others mainly contribute to one or two, so
11
Zhang and Hyvärinen
they seem to be “common” factors of the indices. Second, compared to other estimated
processes, they have stronger influences on each other, as measured by the corresponding
entries of B̂1 . From Figure 2 one can see that ŷ3t and ŷ4t are closely related to the indices in
America as well as those in Europe, while ŷ1t and ŷ2t contribute mainly to those in Europe
and those in Asia, respectively. Figure 3 shows the entries of B̂t corresponding to these
processes. One can see that the positive influences are mainly ŷ4,t−1 → ŷ1t , ŷ3,t−1 → ŷ2t , and
ŷ1,t−1 → ŷ2t . These relationships seem meaningful and natural; in fact they are consistent
with the direction of the information flow caused by the time difference between America,
Europe, and Asia. It also suggests that the stock markets in Asia (except for that in China,
as seen from the fact that SSEC is not strongly related to any of the four factors shown in
Figure 2) are significantly influenced by those in America and Europe. However, this type
of information will be ignored if one uses ICA to analyze their relationships, due to the
assumption of independence between the latent factors.
Part of Â
DJI
0.25
0.67
NAS 0.58
FTSE
0.64
DAX
1.0
0.72
0.28
0.30
0.8
−0.44
0.6
−0.32
CAC 0.86
-0.06
0.34
0.37
0.4
0.23
−0.34
0.2
N225
-0.11
0.04
0.92
-0.11
HSI
0.42 −0.21
0
TWI
0.37
−0.2
-0.11
SSEC
0.14
−0.4
Esti. y1
y2
y3
-0.08
-0.07
-0.32
y4
Figure 2: Part of Â. ŷit are sorted in
descending order of their contributed variances to all xit .
The signs of ŷit have been adjusted such that the sum of
each column of  is postive.
Figure 3: The entries of the influence matrix B1 corresponding to the four
factors. Refer to Figure 2 for
the relationships between ŷit and
the stock indices. Solid (dashed)
lines show positive (negative) influences.
5.2 MEG data
Finally we applied NG-SSM on MEG data. The raw recordings consisted of the 204 gradiometer channels, and were obtained from a healthy volunteer, lasting about 8.2 minutes.
12
A general linear non-Gaussian state-space model
The data were down-sampled to 75 Hz from 600Hz.4 As preprocessing, we used a bandpass
filter to filter out the very low frequency part (lower than 4 Hz) and amplify the signals in
the frequency bands 4-26Hz. These bands are believed to be informative, as Alpha, Beta,
and Theta waves are mainly in this frequency range.
We used the proposed two-step method, CSA+MND, to extract the sources yit . We
empirically set the number of sources and number of lags to be m = 25 and L = 15
(corresponding to 0.2 second), respectively. Correspondingly in the first step the data
dimensionality was reduced to 25. MBD was then applied on the extracted subspace, and
finally 25 sources were obtained, together with {Bτ }L
τ =1 which imply the influences between
them.
We found that the total effect Sj←i is significant at 1% level for at most 50% of the
pairs {yi,t−τ } → yjt with i 6= j. Therefore for better readability, we just give 11 sources
which have the strongest effects (indicated by the smallest p-values) relative to each other.
The topographic distributions of those sources, together with their influences, are given in
Figure 4. Note the thicker the line, the stronger the effect. One can see that in many
cases sources that have significant influences in between tend to be located near each other.
Source #10 is of particular interest: it influences many other sources, but is hardly affected
by others itself. For comparison, we also give the sources produced by FastICA (Hyvärinen
and Oja, 1997) that have the strongest correlations to the sources given in Figure 5. One
can see that sources found by NG-SSM usually have sharper or better-defined locations
than those by FastICA, especially for sources #4, #5, #7, #11, and #10.
6. Discussion
We have proposed a general linear state-space model with the non-Gaussianity assumption
of the process noise. The model takes into account the time-delayed interactions of the latent
processes and the observation error (or measurement noise) in the observed data. Thanks
to the non-Gaussianity assumption, we proved that this model, although very flexible, is
in general identifiable, which enables simultaneously recovering the latent processes and
estimating their delayed interactions (or roughly speaking, delayed causal relations). A
computationally efficient method which consists of two separate steps has been given for
system identification. Simulation results suggest good performance of the proposed method,
and experimental results on real data illustrate the usefulness of the model.
References
K. S. Arun and S. Y. Kung. Balanced approximation of stochastic systems. SIAM Journal
on Matrix Analysis and Applications, 11:42–68, 1990.
H. Attias. Independent factor analysis. Neural Computation, 11(4):803–851, 1999.
A. Belouchrani, K. Abed-Meraim, J. Cardoso, and E. Moulines. A blind source separation
technique using second-order statistics. IEEE Trans. on Signal Processing, 45(2):434–444,
1997.
4. We are very grateful to Pavan Ramkumar and Riitta Hari of Aalto University, Finland, for access to the
MEG data.
13
Zhang and Hyvärinen
Figure 4: Some of the estimated sources for MEG signals and their relationships. The
thicker the line, the stronger the influence.
14
A general linear non-Gaussian state-space model
Figure 5: The sources estimated by FastICA which are most correlated with the sources in
Figure 4.
A. Cichocki and S. Amari. Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. John Wiley & Sons, UK, corrected and revisited edition edition,
2003.
P. Comon. Independent component analysis – a new concept?
287–314, 1994.
Signal Processing, 36:
M. Davies. Identifiability issues in noisy ica. IEEE Signal Processing Letters, 11:470–473,
2004.
M. Ding, Y. Chen, and S. L. Bressler. Granger causality: Basic theory and application
to neuroscience. In B. Schelter, M. Winterhalder, and J. Timmer, editors, Handbook of
Time Series Analysis, pages 437–460, Wienheim, 1996. Wiley.
Z. Ghahramani and G. E. Hinton. Parameter estimation for LDS. Technical report, Department of Computer Science, University of Toronto, Toronto, Canada, 1996. U. Toronto
Tech. Report.
G. Gómez-Herrero, M. Atienza, K. Egiazarian, and J. L. Cantero. Measuring directional
coupling between eeg sources. NeuroImage, 43:497–508, 2008.
C. Granger. Some recent developments in a concept of causality. Journal of Economietrics,
39:199–211, 1988.
15
Zhang and Hyvärinen
S. Haufe, R. Tomioka, G. Nolte, K. R. Müller, and M. Kawanabe. Modeling sparse connectivity between underlying brain sources for EEG/MEG. IEEE Trans Biomed Eng, (8):
1954 – 1963, 2010.
A. Hyvärinen and E. Oja. A fast fixed-point algorithm for independent component analysis.
Neural Computation, 9(7):1483–1492, 1997.
A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley &
Sons, Inc, 2001.
A. Hyvärinen, K. Zhang, S. Shimizu, and P. O. Hoyer. Estimation of a structural vector
autoregression model using non-gaussianity. Journal of Machine Learning Research, 11:
1709–1731, 2010.
A. M. Kagan, Y. V. Linnik, and C. R. Rao. Characterization Problems in Mathematical
Statistics. Wiley, New York, 1973.
R.E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic
Engineering, 82:35–45, 1960.
N. Murata, S. Ikeda, and A. Ziehe. An approach to blind source separation based on
temporal structure of speech signals. Neurocomputing, 41:1–24, 2001.
G. A. Smith, A. J. Robinson, and M. Niranjan. A comparison between the em and subspace
algorithms for the time-invariant linear dynamical system. Technical Report Tech. rep.
CUED/F-INFENG/TR.366, Department of Computer Science, University of Toronto,
Engineering Dept., Cambridge Univ., UK, 2000.
F. J. Theis. Colored subspace analysis: Dimension reduction based on a signals autocorrelation structure. IEEE Transactions on Circuits and Systems – I: Regular Papers, 57:
1463 – 1474, 2010.
P. van Overschee and B. de Moor. Subspace algorithms for the stochastic identification
problem. In Proceedings of the 30th IEEE Conference on Decision and Control, pages
1321–1326, Brighton, UK, 1991.
P. van Overschee and B. de Moor. Subspace Identification for Linear Systems: Theory,
Implementation, Applications. Kluwer Academic Publishers, Dordrecht, Netherlands,
1996.
L. Xu. Temporal factor analysis (TFA): stable-identifiable family, orthogonal flow learning,
and automated model selection. In Proceedings of the 2002 International Joint Conference
on Neural Networks, pages 472–476, Honolulu, HI , USA, 2002.
16