JMLR: Workshop and Conference Proceedings 1: xxx-xxx ACML2011 A general linear non-Gaussian state-space model: Identifiability, identification, and applications Kun Zhang [email protected] Max Planck Institute for Intelligent Systems Spemannstr. 38, 72076 Tübingen, Germany Aapo Hyvärinen [email protected] Dept of Computer Science, HIIT, and Dept of Mathematics and Statistics University of Helsinki, Finland Editor: Chun-Nan Hsu and Wee Sun Lee Abstract State-space modeling provides a powerful tool for system identification and prediction. In linear state-space models the data are usually assumed to be Gaussian and the models have certain structural constraints such that they are identifiable. In this paper we propose a non-Gaussian state-space model which does not have such constraints. We prove that this model is fully identifiable. We then propose an efficient two-step method for parameter estimation: one first extracts the subspace of the latent processes based on the temporal information of the data, and then performs multichannel blind deconvolution, making use of both the temporal information and non-Gaussianity. We conduct a series of simulations to illustrate the performance of the proposed method. Finally, we apply the proposed model and parameter estimation method on real data, including major world stock indices and magnetoencephalography (MEG) recordings. Experimental results are encouraging and show the practical usefulness of the proposed model and method. Keywords: State-space model, Non-Gaussianity, Identifiability, Causality 1. Introduction Suppose that we have multiple parallel times series which are observable. Usually, the source series of interest are not directly measurable, but hidden in them. In addition, the mixing system generating the observable series from the sources is unknown. For simplicity, we often assume that the mixing system is linear. The goal is to recover the latent interesting sources, as well as to model their dynamics. This problem was referred to as blind source separation (BSS, see e.g., books by Cichocki and Amari (2003) and by Hyvärinen et al. (2001)). In the literature, statistical independence has played a great role in BSS; in most BSS algorithms, the sources are assumed to be statistically independent. In the noiseless case, certain techniques have been proposed to solve this problem efficiently. For example, if the sources are non-Gaussian (or at most one of them is Gaussian), BSS can be solved by the independent component analysis (ICA) technique (Hyvärinen et al., 2001). If the sources are temporally correlated, simultaneous diagonalization of the cross-correlations makes source separation possible (Belouchrani et al., 1997). c 2011 Kun Zhang and Aapo Hyvärinen. Zhang and Hyvärinen In practice the data are usually noisy, i.e., the observed data contain observation errors, and the latent source processes exhibit some temporal structures (which may include delayed influences between them). The state-space representation then offers a powerful modeling approach. Here we are particularly interested in the linear state-space model (SSM) or linear dynamic system (Kalman, 1960; van Overschee and de Moor, 1996). Denote by xt = (x1t , ..., xnt )T , t = 1, ..., N , the vector of the observed signals, and by yt = (y1t , ..., ymt )T the vector of latent processes which are our main object of interest.1 The observed data are assumed to be linear mixtures of the latent processes together with some noise effect, while the latent processes follow a vector autoregressive (VAR) model. Mathematically, we have xt = Ayt + et , yt = L X Bτ yt−τ + t , (1) (2) τ =1 where et = (e1t , ..., eTnt ) and t = (1t , ..., mt )T denote the observation error and process noise, respectively. Moreover, et and t are both temporally white and independent of each other. One can see that because of the state transition matrices Bτ , yit are generally dependent, even if it are mutually independent. In traditional SSMs, both t and et are assumed to be Gaussian; or equivalently, one makes use of their covariance structure, and the statistical properties beyond second-order are not considered. In Kalman filtering (Kalman, 1960), A and Bτ are given, and the goal is to do inference, i.e., to estimate yt based on {xt }. Learning of the parameters A, Bτ , and the covariance matrices of et and t was also studied; see, e.g, van Overschee and de Moor (1991); Ghahramani and Hinton (1996). However, it is well-known that under the above assumptions, the SSM model is generally not identifiable; see e.g., Arun and Kung (1990), and consequently, one can not use this model to recover the latent processes yit . Under specific structural constraints on Bτ or A, the SSM model (1∼2) may become identifiable, so that it can be used to reveal the underlying structure of the data. Many existing models which are used for source separation or prediction of time series can be considered as special cases of this model. For instance, the temporal structure based source separation (Murata et al., 2001) assume that Bτ are diagonal. The model also becomes identifiable with some other structural constraints on A, as discussed in Xu (2002). However, one should note that in practice such constraints may not hold; for instance, for the electroencephalography (EEG) or magnetoencephalography (MEG) data, some underlying processes or sources may have delayed influences on others, and letting Bτ be diagonal will destroy these types of connectivities. On the other hand, distributional information also helps system identification. One can ignore the temporal information and perform system identification based on the nonGaussianity of the data. For example, if the matrices Bτ are zero and ei (t) are nonGaussian, it is reduced to the noisy ICA problem or the independent factor analysis (IFA) model (Attias, 1999). In the noiseless case, ICA could recover the underlying linear mixing system up to trivial indeterminacies. But in the noisy case, the model is just partially 1. We use the terms latent processes, factors, and sources interchangeably in this paper, depending on application scenarios. 2 A general linear non-Gaussian state-space model identifiable (Davies, 2004); in fact, the distributions of yit could not be uniquely determined, even with a given mean and variance. For generality, we would like to keep the flexibility of the SSM model (i.e., do not use specific structural constraints on A and Bτ ), but still make it identifiable by incorporating some assumption which usually holds in practice. Non-Gaussianity of the distribution plays such a role: we assume that the process noise t has non-Gaussian and independent components. Hence our proposed model is called non-Gaussian SSM (NG-SSM). We will show that combining the temporal information and the distributional information makes NG-SSM fully identifiable. This enables finding the sources, or latent processes, successfully even when they are dependent. Note that a special case of the proposed model, which does not have the observation error et , was recently exploited to analyze the connectivity between the brain sources for EEG or MEG (Haufe et al., 2010; Gómez-Herrero et al., 2008). It is also interesting to note that NG-SSM could serve as another scheme to do Granger causality analysis (Granger, 1988; Hyvärinen et al., 2010). A time series z1t is said to Granger cause another series z2t if z1t has incremental predictive power when forecasting z2t . Granger causality can be readily extended to the case with more than two time series; see, e.g., Ding et al. (1996). VAR is a widely-used way to represent Granger causality. As (2) is a VAR model of yit with contemporaneously independent residuals, from another point of view, one can see that NG-SSM finds the latent factors yit which can be explained well by Granger causality. In this sense, NG-SSM provides a way to do “Granger causal factor” analysis. Our contribution is two-fold. First, we prove the identifiability of NG-SSM, and more specifically, we show how the non-Gaussianity of the process noise makes the model fully identifiable. Second, we present an identification method which, as illustrated by numerical studies, clearly outperforms the previous identification approach. The rest of this paper is organized as follows. In Section 2 we give a rigorous proof of the identifiability of the proposed model. A simple and efficient method for system identification is presented in Section 3, followed by some simulation results in Section 4. Section 5 reports the results of analyzing financial and MEG data with the proposed model. Finally we conclude the paper and give further discussions in Section 6. 2. Identifiability We consider the SSM (1∼2) under the non-Gaussianity assumption on t . Denote by Σe and Σ the covariance matrices of et and t , respectively. Without loss of generality, we assume that the data are zero-mean. Further, we make the following assumptions on the proposed NG-SSM. A1. n ≥ m, both the observation error et and process noise t are temporally white, and t has i.i.d. mutually independent components. A2. A is of full column rank, and BL is of full rank. P −τ ) = 0, where A3. The VAR process (2) is stable. That is, the roots of det(I − L τ =1 Bτ z z is the delay operator and I the identity matrix, lie inside the complex unit circle, i.e., with modulus smaller than one. 3 Zhang and Hyvärinen A4. The process noise has a unit variance, i.e., Σ = I. This assumption is to avoid the scaling indeterminacy of it or yit . A5. At most one component of the process noise t is Gaussian, and the observation error et is either Gaussian or non-Gaussian. Differently from the Kalman filtering task which aims to do inference with given parameters, the goal in this contribution is to learn (A, {Bτ }L τ =1 , Σe ), as well as to recover the latent process yt . Now we show the identifiability of the proposed NG-SSM model (1∼2). First, one can see that without the non-Gaussianity assumption of the data, the temporal information helps to partially identify the model. Similar results have been mentioned in the literature (Arun and Kung, 1990; van Overschee and de Moor, 1996); however, for completeness of the theory and consistency of the presentation, we give a rigorous formulation and proof; see Lemma 1. We then prove that the non-Gaussianity of the process noise further allows NG-SSM to be fully identifiable. For simplicity of the presentation, in Theorem 2 we consider the case with n = m. The case with n > m is considered later in Theorem 3. We start by proving that using only the second-order temporal information in the data, the observation error covariance matrix Σe , as well as other quantities such as ABk AT , k = 0, 1, 2, ..., are identifiable, as stated in the following lemma. Lemma 1 Consider the model given by (1∼2) with n = m and given L. If the assumptions A1∼A4 hold, by making use of the autocovariances of the observed data xt , the noise covariance Σe and ABk AT can be uniquely determined; furthermore, A and Bτ can be identified up to some rotation transformations. That is, suppose that the NG-SSM model with paL rameters (A, {Bτ }L τ =1 , Σe ) and that with (Ã, {B̃τ }τ =1 , Σ̃ẽ ) are observationally equivalent; T we then have Σ̃ẽ = Σe , à = AU, B̃τ = U Bτ U, where U is a m-dimensional orthogonal matrix.2 T ), and similarly Proof Define the autocovariance function of y at lag k as Ry (k) = E(yt yt+k T T for Rx (k). Clearly Ry (−k) = Ry (k) and Rx (−k) = Rx (k) . Due to (2), we have (P L T i h P Ry (k − τ )BTτ , for k 6= 0; L (3) Ry (k) = E yt = PτL=1 τ =1 Bτ yt+k−τ + t+k T T τ =1 Ry (τ ) Bτ + I, for k = 0. Let x̃t , Ayt , so xt = x̃t + et . From (1) one can see that Rx̃ (k) = ARy (k)AT . (4) Further considering the fact that A is invertible, combining (3) and (4) gives that Rx̃ (k) satisfies the recursive relationship: (P L Rx̃ (k − τ )CTτ , for k 6= 0, (5) Rx̃ (k) = PτL=1 T T τ =1 Rx̃ (k − τ )Cτ + AA , for k = 0, 2. Note that here we already assumed that Var(it ) = 1. Otherwise, given Σe and Σ̃ẽ , both of which are 1/2 −1/2 diagonal with positive diagonal entries, à can be represented as à = AΣe UΣ̃ẽ for B̃τ is more complex. 4 and the expression A general linear non-Gaussian state-space model where Cτ , ABτ A−1 . Also bearing in mind that Rx (k) = Rx̃ (k) for k 6= 0 and Rx (0) = Rx̃ (0) + Σe , (5) can be written in the matrix form: Rx (0) − AAT Rx (1) .. . Rx (L) Rx (L + 1) .. . Rx (2L) −I T C1 CT1 T .. . C 2 = H · . −Σe · CT , . L 0 . T . CL .. | {z } ,C 0 (6) where Rx (1)T Rx (2)T Rx (0) Rx (1)T .. .. . . H , Rx (L − 1) Rx (L − 2) Rx (L) Rx (L − 1) .. .. . . Rx (2L − 1) Rx (2L − 2) ... Rx (L)T . . . Rx (L − 1)T .. .. . . ... Rx (0) ... .. . ... Rx (1) .. . Rx (L) . − − − x t = (xTt , xTt−1 , ..., xTt−L+1 )T . Let The lower block of H is actually E(→ x t→ x Tt+L ), where → " # → − Im(L−1) C, C , where Im(L−1) denotes the m(L − 1)-dimensional identity matrix, 0m×m(L−1) → − and 0m×m(L−1) the m × m(L − 1) zero matrix. As det( C) = det(CL ), which is not zero due → − → − − − − − to Assumption A2, C is nonsingular. One can easily see that E(→ x t→ x Tt+1 ) = E(→ x t→ x Tt ) C. → − − − − − x → x T ) C L , and the lower block of H is thus non-singular. Consequently E(→ x → x T ) = E(→ t t+L t t Hence, we can find the unique solution for Cτ , τ = 1, 2, ..., L from the bottom mL equations of (6). Substituting the solutions for Cτ into the top 2m equations, we can get the unique solution for Σe and AAT . L T Since (A, {Bτ }L τ =1 , Σe ) and (Ã, {B̃τ }τ =1 , Σ̃ẽ ) produce the same Rx (k), we have Ãà = T −1 −1 T T −1 AA and ÃB̃τ à = ABA . As Ãà = AA and à is nonsingular, we have A à · (A−1 Ã)T = I, i.e., A−1 à = U, or à = AU, where U is an orthogonal matrix. Furthermore, multiplying both sides of ÃB̃τ Ã−1 = ABA−1 and ÃÃT = AAT gives ABAT = ÃB̃τ ÃT = AUB̃τ UT AT , i.e., B̃τ = UT Bτ U. Next, we show how the assumption of non-Gaussian distributions leads to identifiability in the basic case of n = m. A typical example in which the distributional information guarantees the model identifiability is ICA. In ICA, one has a set of observable signals, which are assumed to be linear mixtures of some hidden independent sources, but the mixing procedure is unknown. In the square case (with an equal number of sources and 5 Zhang and Hyvärinen observable signals), if at most one of the sources is non-Gaussian, then ICA is able to estimate the mixing procedure and recover the sources up to some trivial indeterminacies. Similarly, here considering the non-Gaussianity of the process noise allows the NG-SSM model with n = m to be fully identifiable, as given by the following theorem. Theorem 2 Consider the model given by (1∼2) with n = m and given L. Suppose that the assumptions A1∼A5 hold. Then the model is identifiable. In particular, suppose that the L model with parameters (A, {Bτ }L τ =1 , Σe ) and that with (Ã, {B̃τ }τ =1 , Σ̃ẽ ) are observationally T equivalent; we then have Σ̃ẽ = Σe , à = AP, B̃τ = P Bτ P, and ỹt = PT yt , where P is a signed permutation matrix (a permutation matrix with non-zero entries being 1 or -1). The distribution of it can also be uniquely determined up to the sign indeterminacy.3 Proof Lemma 1 gives the relationships between the two parameter sets (A, {Bτ }L τ =1 ) and (Ã, {B̃τ }L ) which produce the same x , under the assmptions A1∼A4. Based on t τ =1 Lemma 1, we have L L h h X X −1 i −1 i ˜t + ẽt xt = A I − Bτ z −τ t + et = à I − B̃τ z −τ τ =1 =⇒A I − L X Bτ z −τ τ =1 −1 h t + et = AU I − τ =1 L X UT Bτ Uz −τ −1 i ˜t + ẽt τ =1 L L h h X X −1 i −1 i =⇒ I − Bτ z −τ t + A−1 et = I − Bτ z −τ U˜t + A−1 ẽt h =⇒ I − τ =1 L X τ =1 Bτ z −τ −1 i (t − U˜t ) = A−1 (ẽt − et ) τ =1 L i h X =⇒(t − U˜t ) = I − Bτ z −τ A−1 (ẽt − et ) . (7) τ =1 Let dt , A−1 (ẽt − et ). The right-hand side (RHS) of (7) is a moving average (MA) P process of dt , and its autocovariance function at lag L is RRHS (L) = E − L j=0 Bj dt−j · PL T T T (− i=0 Bi dt+L−j ) = E(dt dt )BL , where we have defined B0 , −I. On the other hand, the left-hand side of (7) is i.i.d., and hence E(dt dTt )BTL = 0. As BL is nonsingular, we must have E(dt dTt ) = 0, i.e., ẽt = et and t = U˜t . We then consider the condition t = U˜t , or ˜t = UT t . Assumption A5 states that at most one of it is Gaussian. From the identifiability of the noiseless ICA model with a square mixing matrix (see Theorem 11 of Comon (1994) or Theorem 10.3.1 of Kagan et al. (1973)), U must be a signed permutation matrix. Furthermore, the distributions of ˜it are the same as those of it (up to the permutation and sign indeterminacies). 3. If Var(it ) can be arbitrary, we have à = APΛ and B̃τ = Λ−1 PT BPΛ, where Λ is a diagonal matrix with positive diagonal entries. 6 A general linear non-Gaussian state-space model Finally, we consider the general case and show the identifiability of the NG-SSM model. The identifiability in the case with n ≥ m follows as a consequence of the above theorem and linear algebra. Theorem 3 Under the assumptions A1∼A5, the model given by (1∼2) with n ≥ m and given L is identifiable up to arbitrary permutations and signs of yit . Proof Let ATi be the ith row of A. Recall that A is of full column rank. Then for any i, by making use of linear algebra, one can show that there always exist (m − 1) rows of A such that they, together with ATi , form a full-rank matrix, denoted by Āi . According to Theorem 2, from the observed data corresponding to Āi , {Bτ }L τ =1 and Āi can be uniquely determined up to the permutation and sign indeterminacies. That is, all rows of A are identifiable, and thus A is identifiable. Furthermore, Cov(Ayt ) is determined by A and {Bτ }L τ =1 . As Cov(xt ) = Cov(Ayt ) + Σe , Σe is also identifiable. 3. Identification In the Gaussian case, one can efficiently find an estimate of all parameters in the SSM model from an infinite number of possible solutions. By considering yt as missing values, one can find the complete data likelihood log p({yt }nt=1 , {xt }nt=1 ) of the model (1∼2) in closed-form. The EM algorithm to maximize the data likelihood can then be derived, allowing partial identification of the model; see, for instance, Ghahramani and Hinton (1996). Alternatively, by rewriting the SSM model as some large block matrix formula, one can adopt the subspace state space system identification (4SID) methods (van Overschee and de Moor, 1991). For a comparison between these two types of methods, see Smith et al. (2000). However, when the process noise it is non-Gaussian, one usually has to resort to simulation-based methods, such as particle filtering, to do the inference, and parameter estimation also becomes computationally difficult. For reasons of computational efficiency of the algorithm, especially for large-scale problems (e.g., for MEG data analysis), we propose an approximate but very efficient method which consists of two steps. 3.1 Step 1: Dimension reduction and noise suppression by recovering the subspace of the latent processes In the case where n = m, this step is skipped and one directly performs the second step given in Subsection 3.2. Otherwise in the first step we reduce the noise effect and estimate the subspace of the latent processes yt . In the proposed NG-SSM model, one can see that the subspace spanned by the latent processes yit is temporally colored, while the observation error et is assumed to be temporally white. Based on these properties, in this step we extract the subspace of yt from xt by finding a m-dimensional colored subspace: x̌t = Wc xt (8) and making the remaining subspace temporally as white as possible. In this way the effect of the observation error et in the extracted subspace x̌t is greatly reduced. Consequently 7 Zhang and Hyvärinen x̌t and yt approximately lie in the same subspace. In other words, x̌t is expected to be a linear square transformation of yt , i.e., x̌t = Ǎyt . (9) How to find the transformation in (8) is discussed below, and how to find Ǎ and estimate yt from x̌t will be explained in Subsection 3.2. At first glance, a natural way to find the transformation in (8) is to resort to so-called colored subspace analysis (CSA, Theis (2010)). We adopt the algorithm based on joint low-rank approximation (JLA) for CSA, due to its attractive theoretical properties (Theis, 2010). However, in practice we found that when the data are very noisy, its solution is highly sensitive to initialization conditions, and hence we need develop an approach which avoids local optima. A closer inspection of (2), which specifies the generating process of the latent precesses yt , together with Assumption A2, shows that yt corresponds to the subspace in xt which can be predicted best from historical values, while the subspace which is mainly caused by the observation error can not. In fact, by exploiting the VAR model, the subspace of yt , or (8), can be estimated efficiently, and the solution is guaranteed to be global optimal. First, let us whiten the data. Assume the eigenvalue decomposition (EVD) of the covariance matrix of xt is Cov(xt ) = E1 D1 ET1 , where D1 is a diagonal matrix consisting of the eigenvalues and columns of E1 are corresponding eigenvectors. The whitened data are −1/2 ẍt = D1 ET1 xt . Note that components of ẍt are uncorrelated and have unit variance. Second, one fits the VAR model with L lags on ẍt : ẍt = L X Mτ ẍt−τ + ε̃t , τ =1 where Mτ denote the coefficient matrices, and ε̃t is the residual series. Parameters involved in the above equation can be estimated efficiently, with the solution given in closed-form. Note that the subspace in ẍt which can be predicted well coincides with the subspace of x̌t given in (8), and that it corresponds to that of ε̃t which has small variance. Let the EVD decomposition of the covariance matrix of ε̃t be Cov(εt ) = E2 D2 ET2 . Let P be the matrix consisting of the eigenvectors corresponding to the m smallest eigenvalues. Third, one can see that PT ε̃t gives the m-dimensional minor subspace of the error ε̃t , and consequently, PT ẍt corresponds to the subspace of xt that can be predicted best. That is, the transformation in (8) is determined by −1/2 Wc = PT D1 ET1 . In our experiments, we use this solution for CSA. In our simulation studies we found that this scheme always leads to a good performance, i.e., the learned x̌t provides a good estimate of the subspace of the latent processes. 8 A general linear non-Gaussian state-space model 3.2 Step 2: Multichannel blind deconvolution and post-processing In the noiseless case, Haufe et al. (2010) discussed the relationship between the NG-SSM model and the multichannel blind deconvolution (MBD, Cichocki and Amari (2003)), or convolutive ICA problem. Here after dimension reduction in the first step, we also use MBD for parameter estimation. From (2) and (9) one can see that h t = I − L X Bτ z −τ i yt = Ǎ −1 x̌t − τ =1 L X Bτ Ǎ −1 τ =1 x̌t−τ = L X Wk x̌t−k , (10) k=0 where W0 , Ǎ−1 , and Wτ , −Bτ Ǎ−1 for τ > 0. Recall that t is assumed to be both spatially and temporally independent. Consequently, Wk in (10) can be estimated by using the MBD technique, which aims to make the output sequences spatially and temporally independent. Under the assumption that at most one of it is Gaussian, MBD can estimate Wk uniquely up to only the scaling, permutation, and time shift indeterminacies. Here, the permutation indeterminacy is trivial, and to tackle the scaling indeterminacy, we fix the variance of ˆit to one. With proper initializations (say, with large values for W0 ), we can avoid the time shift indeterminacy. In our experiments we used the natural gradient-based algorithm (Cichocki and Amari, 2003) for MBD to determine Wk . Now suppose that we have obtained the estimate Ŵk . According to (10), one can see that the estimate of Ǎ and that of Bτ can be constructed as ˆ = Ŵ−1 , and B̂ = −Ŵ Ǎ ˆ = −Ŵ Ŵ−1 . Ǎ τ τ τ 0 0 Recalling that in (8) we used Wc to extract the colored subspace, i.e., x̌t = Wc xt = Ǎyt , one can then contruct the estimate of A as ˆ = W† Ŵ−1 ,  = Wc† Ǎ c 0 where † denotes the pseudo-inverse. The proposed two-step method is denoted by CSA+MBD. 3.3 Significance assessment We use the bootstapping-based method to assess the significance of each influence from yi,t−τ (τ > 0) to yjt or the total effect from {yi,t−τ }L τ =1 to yjt , denoted by Sj←i , following the pathway proposed in Sec. 6 of Hyvärinen et al. (2010). 4. Simulations We used simulations to illustrate the performance of the proposed method CSA+MBD for parameter estimation of the NG-SSM model. The observed data xt were generated according to (1∼2) with n = 10, m = 4 and the sample size N = 1000. 1t and 2t are i.i.d. super-Gaussian: they were generated by passing Gaussian samples through the power nonlinearity with the exponent 1.5 and keeping the original signs. 3t and 4t are uniformly distributed (and thus sub-Gaussian). The lag number of the VAR process (2) was set to L = 2. We used a heuristic way to enforce the stability of the process, by using 9 Zhang and Hyvärinen Entri es of B̂1 PCA+MBD Entri es of B̂2 Entri es of B̂1 SNR(xt)= 15dB 0.5 CSA+MBD 0.2 0.2 0.1 0 0 0 0 −0.1 −0.2 −0.2 −0.5 −0.5 −0.4 0 0.4 −0.2 1 0.4 0.5 0.2 0 0 0 0.2 −0.3 −0.4 0 0.4 0.4 SNR(xt)= 3dB Entri es of B̂2 0.5 0.2 −0.2 0 0.2 0.1 0 0 −0.2 −0.1 −0.2 −0.4 −1 −0.5 0 0.2 0.2 −0.2 −0.5 −0.2 −0.4 0 0.5 −0.2 0 0.2 −0.4 0 0.4 Figure 1: Scatterplots of the entries of B̂τ , τ = 1, 2, vs. the true ones with different levels of et and different simulation methods. The average SNRs in xt are 15dB (top) and 3dB (bottom). The two methods are PCA+MBD (left) and CSA+MBD (right). small values for the entries of Bτ : the entries of B1 were uniformly distributed between -0.5 and 0.5, and those of B2 were drawn from the uniform distribution between -0.25 and 0.25. Entries of the mixing matrix A were drawn uniformly from [−1.5, 1.5]. The covariance matrix of et was constructed as the product of a n × n random matrix and its transpose. ratio (SNR) in xt , which is defined as Pn signal-to-noise Pn We use the average Var(e ) , to measure the noise level in the observed Var(x − e )/ 10 log10 it it it i=1 i=1 data. Principal component analysis (PCA) is a widely used approach for dimension reduction by finding the components with maximal variations. One can use PCA instead of CSA in Step 1 of the proposed method, leading to the method PCA+MBD, which was used in Haufe et al. (2010). We compare our method CSA+MBD with PCA+MBD. We considered two noise levels with the average SNRs in xt being 15dB and 3dB, respectively. Each case was repeated for 50 random replications. Since good estimates of A often result in good estimates of the influence matrices Bτ , for conciseness of the presentation, here we only show how well Bτ were recovered in Figure 1. Figure 1 gives the scatterplots of the estimated entries of Bτ and the true ones. Note that we already permuted ŷit and adjusted their signs such that they and the true latent processes yit have the same order and positive correlations (the scaling indeterminacy is avoided by enforcing Assumption A4). In addition, we report the SNRs of B̂τ and ŷit w.r.t. the true ones in Table 1. For completeness of the comparison, we also show the results by ICA: note that here although the latent factors yit are not mutually independent, one can still apply ICA to find the the components which are mutually as independent as possible; one can then fit the VAR model (2) on the estimated “independent” component to estimate 10 A general linear non-Gaussian state-space model SNR(xt ) 15dB 3dB Method PCA+MBD FastICA CSA+MBD PCA+MBD FastICA CSA+MBD SNR(B̂1 ) 25.3 6.1 41.1 5.4 5.4 25.5 SNR(B̂2 ) 18.6 4.3 31.5 2.2 2.8 17.7 SNR(ŷ1t ) 27.6 (11.6) 12.6 (4.4) 34.4 (7.9) 13.2 (10.7) 10.4 (4.7) 20.4 (5.9) SNR(ŷ2t ) 26.1 (10.6) 12.6 (4.5) 33.1 (9.1) 12.6 (5.8) 10.2 (4.0) 20.3 (4.5) SNR(ŷ3t ) 25.3 (9.4) 8.8 (4.3) 34.9 (10.4) 10.6 (7.4) 7.6 (3.9) 20.4 (4.9) SNR(ŷ4t ) 27.1 (10.5) 10.2 (5.5) 32.7 (7.6) 8.2 (8.3) 7.2 (4.1) 20.0 (4.4) Table 1: SNR of the estimated Bτ and the recovered latent processes yit at different observation error levels and with different methods. Numbers in parentheses indicate standard deviations. Bτ . We adopted the symmetric FastICA algorithm (Hyvärinen and Oja, 1997) with the tanh nonlinearity to perform ICA. From Table 1 one can see that in both low-noise and high-noise situations, CSA+MBD successfully estimated Bτ and the latent processes yt with very high SNRs. Its performance is clearly better than PCA+MBD. This illustrates the validity of the proposed two-step method CSA+MBD, and also documents that properly initialized CSA performs well for dimension reduction and noise suppression (at least much better than PCA) in the estimation of the proposed NG-SSM. Not surprisingly, since the latent processes are not mutually independent, here ICA gives the poorest performance. Even in the lower noise situation (SNR = 15dB), the estimated Bτ are very noisey (as seen from the low SNRs). 5. Real-world Applications 5.1 Financial data We exploited the NG-SSM model to investigate some underlying structures of nine world major stock indices, including Dow Jones Industrial Average Index (DJI) and Composite Index (NAS) in USA, FTSE 100 Index (FTSE) in England, DAX Index in Germany, CAC 40 Index in France, Nikkei 225 (N225) in Japan, Hang Seng Index (HSI) in Hong Kong, Taiwan Weighted Index (TWI), and Shanghai Stock Exchange Composite Index (SSEC) in China. We used the daily dividend/split adjusted closing prices from Dec. 4, 2001 to Jul. 11, 2006. For the days when the price is not available, we used linear interpolation to estimate the price. Denoting the closing price of the ith index on day t by Pit , the corresponding return was calculated by xit = (Pit − Pi,t−1 )/Pi,t−1 . The data for analysis are the 9-dimensional parallel return series with 1200 samples. We set the dimensionality of the latent process yt to m = 9 and the number of lags to L = 1. After estimating all parameters, we further selected four estimated processes yit of interest for further analysis. The corresponding columns of the estimated mixing matrix  are given in Figure 2, with large numbers given in the figure. These latent processes are interesting to us for two reasons. First, they have large contributions to xt , as measured by the norm of the corresponding columns of Â, and moreover, each of them contributes significantly to at least three indices, while others mainly contribute to one or two, so 11 Zhang and Hyvärinen they seem to be “common” factors of the indices. Second, compared to other estimated processes, they have stronger influences on each other, as measured by the corresponding entries of B̂1 . From Figure 2 one can see that ŷ3t and ŷ4t are closely related to the indices in America as well as those in Europe, while ŷ1t and ŷ2t contribute mainly to those in Europe and those in Asia, respectively. Figure 3 shows the entries of B̂t corresponding to these processes. One can see that the positive influences are mainly ŷ4,t−1 → ŷ1t , ŷ3,t−1 → ŷ2t , and ŷ1,t−1 → ŷ2t . These relationships seem meaningful and natural; in fact they are consistent with the direction of the information flow caused by the time difference between America, Europe, and Asia. It also suggests that the stock markets in Asia (except for that in China, as seen from the fact that SSEC is not strongly related to any of the four factors shown in Figure 2) are significantly influenced by those in America and Europe. However, this type of information will be ignored if one uses ICA to analyze their relationships, due to the assumption of independence between the latent factors. Part of  DJI 0.25 0.67 NAS 0.58 FTSE 0.64 DAX 1.0 0.72 0.28 0.30 0.8 −0.44 0.6 −0.32 CAC 0.86 -0.06 0.34 0.37 0.4 0.23 −0.34 0.2 N225 -0.11 0.04 0.92 -0.11 HSI 0.42 −0.21 0 TWI 0.37 −0.2 -0.11 SSEC 0.14 −0.4 Esti. y1 y2 y3 -0.08 -0.07 -0.32 y4 Figure 2: Part of Â. ŷit are sorted in descending order of their contributed variances to all xit . The signs of ŷit have been adjusted such that the sum of each column of  is postive. Figure 3: The entries of the influence matrix B1 corresponding to the four factors. Refer to Figure 2 for the relationships between ŷit and the stock indices. Solid (dashed) lines show positive (negative) influences. 5.2 MEG data Finally we applied NG-SSM on MEG data. The raw recordings consisted of the 204 gradiometer channels, and were obtained from a healthy volunteer, lasting about 8.2 minutes. 12 A general linear non-Gaussian state-space model The data were down-sampled to 75 Hz from 600Hz.4 As preprocessing, we used a bandpass filter to filter out the very low frequency part (lower than 4 Hz) and amplify the signals in the frequency bands 4-26Hz. These bands are believed to be informative, as Alpha, Beta, and Theta waves are mainly in this frequency range. We used the proposed two-step method, CSA+MND, to extract the sources yit . We empirically set the number of sources and number of lags to be m = 25 and L = 15 (corresponding to 0.2 second), respectively. Correspondingly in the first step the data dimensionality was reduced to 25. MBD was then applied on the extracted subspace, and finally 25 sources were obtained, together with {Bτ }L τ =1 which imply the influences between them. We found that the total effect Sj←i is significant at 1% level for at most 50% of the pairs {yi,t−τ } → yjt with i 6= j. Therefore for better readability, we just give 11 sources which have the strongest effects (indicated by the smallest p-values) relative to each other. The topographic distributions of those sources, together with their influences, are given in Figure 4. Note the thicker the line, the stronger the effect. One can see that in many cases sources that have significant influences in between tend to be located near each other. Source #10 is of particular interest: it influences many other sources, but is hardly affected by others itself. For comparison, we also give the sources produced by FastICA (Hyvärinen and Oja, 1997) that have the strongest correlations to the sources given in Figure 5. One can see that sources found by NG-SSM usually have sharper or better-defined locations than those by FastICA, especially for sources #4, #5, #7, #11, and #10. 6. Discussion We have proposed a general linear state-space model with the non-Gaussianity assumption of the process noise. The model takes into account the time-delayed interactions of the latent processes and the observation error (or measurement noise) in the observed data. Thanks to the non-Gaussianity assumption, we proved that this model, although very flexible, is in general identifiable, which enables simultaneously recovering the latent processes and estimating their delayed interactions (or roughly speaking, delayed causal relations). A computationally efficient method which consists of two separate steps has been given for system identification. Simulation results suggest good performance of the proposed method, and experimental results on real data illustrate the usefulness of the model. References K. S. Arun and S. Y. Kung. Balanced approximation of stochastic systems. SIAM Journal on Matrix Analysis and Applications, 11:42–68, 1990. H. Attias. Independent factor analysis. Neural Computation, 11(4):803–851, 1999. A. Belouchrani, K. Abed-Meraim, J. Cardoso, and E. Moulines. A blind source separation technique using second-order statistics. IEEE Trans. on Signal Processing, 45(2):434–444, 1997. 4. We are very grateful to Pavan Ramkumar and Riitta Hari of Aalto University, Finland, for access to the MEG data. 13 Zhang and Hyvärinen Figure 4: Some of the estimated sources for MEG signals and their relationships. The thicker the line, the stronger the influence. 14 A general linear non-Gaussian state-space model Figure 5: The sources estimated by FastICA which are most correlated with the sources in Figure 4. A. Cichocki and S. Amari. Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. John Wiley & Sons, UK, corrected and revisited edition edition, 2003. P. Comon. Independent component analysis – a new concept? 287–314, 1994. Signal Processing, 36: M. Davies. Identifiability issues in noisy ica. IEEE Signal Processing Letters, 11:470–473, 2004. M. Ding, Y. Chen, and S. L. Bressler. Granger causality: Basic theory and application to neuroscience. In B. Schelter, M. Winterhalder, and J. Timmer, editors, Handbook of Time Series Analysis, pages 437–460, Wienheim, 1996. Wiley. Z. Ghahramani and G. E. Hinton. Parameter estimation for LDS. Technical report, Department of Computer Science, University of Toronto, Toronto, Canada, 1996. U. Toronto Tech. Report. G. Gómez-Herrero, M. Atienza, K. Egiazarian, and J. L. Cantero. Measuring directional coupling between eeg sources. NeuroImage, 43:497–508, 2008. C. Granger. Some recent developments in a concept of causality. Journal of Economietrics, 39:199–211, 1988. 15 Zhang and Hyvärinen S. Haufe, R. Tomioka, G. Nolte, K. R. Müller, and M. Kawanabe. Modeling sparse connectivity between underlying brain sources for EEG/MEG. IEEE Trans Biomed Eng, (8): 1954 – 1963, 2010. A. Hyvärinen and E. Oja. A fast fixed-point algorithm for independent component analysis. Neural Computation, 9(7):1483–1492, 1997. A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, Inc, 2001. A. Hyvärinen, K. Zhang, S. Shimizu, and P. O. Hoyer. Estimation of a structural vector autoregression model using non-gaussianity. Journal of Machine Learning Research, 11: 1709–1731, 2010. A. M. Kagan, Y. V. Linnik, and C. R. Rao. Characterization Problems in Mathematical Statistics. Wiley, New York, 1973. R.E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82:35–45, 1960. N. Murata, S. Ikeda, and A. Ziehe. An approach to blind source separation based on temporal structure of speech signals. Neurocomputing, 41:1–24, 2001. G. A. Smith, A. J. Robinson, and M. Niranjan. A comparison between the em and subspace algorithms for the time-invariant linear dynamical system. Technical Report Tech. rep. CUED/F-INFENG/TR.366, Department of Computer Science, University of Toronto, Engineering Dept., Cambridge Univ., UK, 2000. F. J. Theis. Colored subspace analysis: Dimension reduction based on a signals autocorrelation structure. IEEE Transactions on Circuits and Systems – I: Regular Papers, 57: 1463 – 1474, 2010. P. van Overschee and B. de Moor. Subspace algorithms for the stochastic identification problem. In Proceedings of the 30th IEEE Conference on Decision and Control, pages 1321–1326, Brighton, UK, 1991. P. van Overschee and B. de Moor. Subspace Identification for Linear Systems: Theory, Implementation, Applications. Kluwer Academic Publishers, Dordrecht, Netherlands, 1996. L. Xu. Temporal factor analysis (TFA): stable-identifiable family, orthogonal flow learning, and automated model selection. In Proceedings of the 2002 International Joint Conference on Neural Networks, pages 472–476, Honolulu, HI , USA, 2002. 16
© Copyright 2025 Paperzz