Parametric and non-parametric criteria for causal inference from

Parametric and non-parametric
criteria for causal inference from
time-series
Daniel Chicharro
Abstract Granger causality constitutes a criterion for causal inference from
time series that has been largely applied to study causal interactions in the
brain from electrophysiological recordings. This criterion underlies the classical parametric implementation in terms of linear autoregressive processes as
well as Transfer entropy, i.e. a non-parametric implementation in the framework of information theory. In the spectral domain, partial directed coherence and the Geweke formulation are related to Granger causality but rely
on alternative criteria for causal inference which are inherently based on the
parametric formulation in terms of autoregressive processes. Here we clearly
differentiate between criteria for causal inference and measures used to test
them. We compare the different criteria for causal inference from time-series
and we further introduce new criteria that complete a unified picture of how
the different approaches are related. Furthermore, we compare the different
measures that implement these criteria in the information theory framework.
1 Introduction
The inference of causality in a system of interacting processes from recorded
time-series is a subject of interest in many fields. Particularly successful has
been the concept of Granger causality [29, 31], originally applied to economic
time-series. In the last years, measures of causal inference have been also
widely applied to electrophysiological signals, in particular to characterize
causal interactions between different brain areas (see [46, 28, 10] for a review
of Granger causality measures applied to neural data).
In the original formulation of Granger causality, causality from a process
Y to a process X was examined based on the reduction of the prediction error
Center for Neuroscience and Cognitive Systems@UniTn, Istituto Italiano di Tecnologia,
Via Bettini 31, 38068 Rovereto (TN) e-mail: [email protected]
i
ii
Daniel Chicharro
of X when including the past of Y [60, 29]. However, this prediction error
criterion generalizes to a criterion of conditional independence on probability
distributions [31] that is generally applicable to stationary and non-stationary
stochastic processes.
Here we consider the criterion of Granger causality together with related
criteria of causal inference, like Sims causality [55]. We also consider the criteria underlying other measures that have been introduced to infer causality
but for which the underlying criterion has not been made explicit. This includes the Geweke spectral measures of causality (GSC) [25, 26], and partial
directed coherence (PDC) [5]. We make a clear distinction between criteria
for causal inference and measures implementing them. Accordingly, we refer
by Granger causality to the general criterion of causal inference and not as
it is often the case to the measure implementing it for linear processes. This
means that we consider transfer entropy [54] as a particular measure to test
for Granger causality in the information-theoretic framework (e.g. [56, 1]).
This distinction between criteria and measures is important because in
practice one is usually not only interested in assessing the existence of a causal
connection but in evaluating its strength (e.g. [11, 9, 8, 52, 59]). Causal inference can be associated with the construction of a causal graph representing
which connections exist in the system [19]. However, quantifying the causal
effects resulting from these connections is a more difficult task. Recently [16]
examined how the general notion of causality developed by Pearl [45] can be
applied to study the natural dynamics of complex systems. This notion is
based on the idea of externally manipulating the system to evaluate causal
effects. For example, if one is studying causal connectivity in the brain, this
manipulation could be the deactivation of some connections between brain
areas, or stimulating electrically a given area. It is clear that these manipulations alter the normal dynamics of the brain, those which one wants to analyze in order to understand neural computations. Accordingly, [16] pointed
out that if the main interest is not the effect of external perturbations, but
how the causal connections participate in the generation of the unperturbed
dynamics of the system, then only in some cases it is meaningful to characterize interactions between different subsystems in terms of the effect of one
subsystem over another. To identify these cases the notion of natural causal
effects between dynamics was introduced and conditions for their existence
were provided . Consequently, Granger causality measures, and in particular transfer entropy, cannot be used in general as measures of the strength
of causal effects [4, 39]. Alternatively, a different approach was developed in
[15]. Instead of examining the causal effects resulting from the causal connections, a unifying multivariate framework to study the dynamic dependencies
between the subsystems that arise from the causal interactions was proposed.
Considering this, we here focus on the criteria for causal inference and the
measures are only used as statistics to test these criteria. We closely follow
[14] relating the different formulations of Granger causality and the corresponding criteria of causal inference, and integrating parametric and non-
Parametric and non-parametric criteria for causal inference from time-series
iii
parametric formulations, as well as time-domain and spectral formulations,
for both bivariate and multivariate systems. Furthermore, we do not discuss
the fundamental assumptions that determine the valid applicability of the
criterion of Granger causality. In particular we assume that all the relevant
processes are observed and well-defined. This is of course a big idealization
for real applications, but our purpose is examining the relation between the
different criteria and measures that appear in the different formulations of
Granger causality. (For a detailed discussion of the limitations of these criteria see [58, 16]). More generally, [45] offers a complete explanation of the
limitations of causal inference without intervening the system.
This Chapter is organized as follows: In Section 2 we review the nonparametric formulation of the criteria of Granger and Sims causality and the
information-theoretic measures, including transfer entropy, used to test them.
In section 3 we review the parametric autoregressive representation of the
processes and the time domain and spectral measures of Granger causality,
in particular GSC and PDC. We make explicit the parametric criteria of
causal inference underlying these measures and discuss their relation to the
non-parametric criteria. Furthermore we introduce related new criteria for
causal inference that allow us to complete a consistent unifying picture that
integrates all the criteria and measures. This picture is presented all together
in Section 4.
2 Non-parametric approach to causal inference from
time-series
We here review Granger causality and Sims causality as non-parametric criteria to infer causality from time-series as well as some measures used to
test them. Although both the criteria of Granger causality [29, 30] and Sims
causality [55] were originally introduced in combination with a linear formulation, we here consider their general non-parametric expression [31, 12].
2.1 Non-parametric criteria for causal inference
In [31] it was stated a general criterion for causal inference from time-series
based on the comparison of two probability distributions. We consider first
its bivariate formulation. Assume that for the processes X and Y we record
two time-series {X} = {X1 , X2 , ..., XN } and {Y } = {Y1 , Y2 , ..., YN }. Granger
causality states that there is no causality from Y to X if the equality
p(Xt+1 |X t ) = p(Xt+1 |X t , Y t ) ∀X t , Y t
(1)
iv
Daniel Chicharro
holds. Here X t = {Xt , Xt−1 , ...X1 } is the past of the process at time t. From
now on we will assume stationarity so that the results do not depend on the
particular time. Therefore we consider N → ∞ and select t such that X t
accounts for the infinite past of the process. See [56, 15] for a non-stationary
formulation. According to Eq. 1 Granger causality indicates that there is no
causality from Y to X when the future Xt+1 is conditionally independent of
the past Y t given the partialization on his own past X t . That is, the past of
Y has no dependence with the future of X that cannot be accounted by the
past of X.
As an alternative criterion Sims causality [55] examines the equality
p(X t+1:N |X t , Y t ) = p(X t+1:N |X t , Y t , Yt+1 ) ∀X t , Y t , Yt+1 .
(2)
It states that there is no causality from Y to X if the whole future X t+1:N
is conditionally independent of Yt+1 given the past of the two processes. In
fact, assuming stationarity it is not necessary to condition on Y t so that like
Granger causality the criterion indicates that the future of X is completely
determined by his own past (see [37] for a detailed review of the relation
between the two criteria).
While Granger causality and Sims causality are equivalent criteria for the
bivariate case [12], this is not true for multivariate processes. When other
processes also interact with X and Y it is necessary to distinguish a causal
connection from Y to X from other connections that also result in statistical
dependencies incompatible with the equality in Eq. 1. These other connections are indirect causal connections Y → Z → X as well as the effect of
common drivers, i.e. a common parent Z such that Z → Y and Z → X.
The formulation of Granger causality turns out to be easily generalizable to
account for these influences resulting in the equality
p(Xt+1 |X t , Z t ) = p(Xt+1 |X t , Y t , Z t ) ∀X t , Y t , Z t ,
(3)
where Z t refers to the past of any other process that interacts with X and Y .
In fact, on which processes it is needed to condition depends on the particular
causal structure of the system, which is exactly what one wants to infer.
This renders the criterion of Granger causality context dependent [31]. This
means that if Z does not include all the relevant processes a false positive can
be obtained when testing for causality from Eq. 3. The problem of hidden
variables for causal inference is an issue not specific for time-series that in
general can only be addressed by an interventional treatment of causality [45].
In practice, from observational data, some procedures can help to optimize
the selection of the variables on which to condition [22, 41]. In this Chapter
we do not further deal with this problem and we assume that all the relevant
processes are observed.
In contrast to Granger causality, Sims causality cannot be generalized to
the multivariate case as a criterion for causal inference. The reason is that,
since in Eq. 2 the whole future X t+1:N is considered jointly, there is no way to
Parametric and non-parametric criteria for causal inference from time-series
v
disentangle direct from indirect causal connections from Y to X. This means
that for multivariate processes the criterion of Granger causality in Eq. 3
remains as the unique non-parametric criterion for causal inference between
the time series.
2.2 Measures to test for causality
In this Chapter we want to clearly differentiate between the criteria for causal
inference and the particular measures used to test for causality according to
these criteria. This is why we refer by Granger causality to the general criterion proposed in [31] (Eqs. 1 and 3) so that Granger causality measures
include both the transfer entropy and the linear Granger causality measure.
The linear measure, that quantifies the predictability improvement [60], implements for linear processes a test on the equality of the mean of the distributions appearing in Eq. 1. More generally, if one wants to test for the equality
between two probability distributions without examining specific moments of
a given order, the Kullback-Leibler divergence (KL-divergence) [38]
KL(p∗ (x), p(x)) =
∑
p∗ (x) log
x
p∗ (x)
p(x)
(4)
is a non-negative measure that is zero if and only if the two distributions
are identical. For a multivariate variable X, since it quantifies the divergence
of the distribution p(x) from p∗ (x), one can construct p(x) to reflect a specific null-hypothesis about the dependence between the components of X. As
particular applications of the KL-divergence to quantify the interdependence
between random variables one has the conditional mutual information
I(X; Y |Z) =
∑
x,y,z
p(x, y, z) log
p(x|y, z)
.
p(x|z)
(5)
We can see that the form of the probability distributions in the argument of
the logarithm is the same as the ones in Eqs. 1-3. Accordingly, testing the
equality of Eq. 1 is equivalent to having a zero transfer entropy [54, 44]
TY →X = I(Xt+1 ; Y t |X t ) = 0.
(6)
An analogous information-theoretic measure of Sims causality is obtained so
that Eq. 2 leads to
SY →X = I(Yt+1 ; X t+1:N |Y t , X t ) = 0.
(7)
For multivariate processes Eq. 3 leads to a zero conditional transfer entropy
vi
Daniel Chicharro
TY →X|Z = I(Xt+1 ; Y t |X t , Z t ) = 0.
(8)
[54] introduced the transfer entropy to test the equality of Eq. 1 further
assuming that the processes were Markovian with a finite order. A similar
information-theoretic quantity, the directed information, has been introduced
in the context of communication theory [42, 43, 36]. The directed information
was originally formulated for the non-stationary case and naturally appears
in a causal decomposition of the mutual information (e.g. [1]). Such a decomposition can also be expressed in terms of transfer entropies, and is valid for
both a non-stationary formulation of the measures which is local in time and
another that is cumulative on the whole time series [15]. These two formulations converge for the stationary case resulting in
I(X N ; Y N ) = TY →X + TX→Y + TX·Y ,
(9)
where TX·Y is a measure of instantaneous causality. From this relation it can
be checked that for both the cumulative non-stationary formulation and for
the stationary one, if there is no instantaneous causality
TY →X =H(Xi+1 |X i ) − H(Xi+1 |X i , Y i ) = H(Yi+1 |X i , Y i ) − H(Y N |X N )
=SY →X .
(10)
This equality, restricted to the stationary linear case, is indicated already
in Theorem 1(ii) of [25], where no instantaneous causality is enforced by a
normalization of the covariance matrix.
Notice that here we consider the measures as particular instantiations of
the KL-divergence used as a statistic for hypothesis testing [38]. This is important to keep in mind because the KL-divergence can be interpreted as
well in terms of code length [17], and in particular the transfer entropy (directed information) determines the error-free transmission rate when applied
to specific communication channels with feedback [36], (and see also [47] for
a discussion of different application of transfer entropy). Furthermore, any
conditional mutual information can be evaluated as a difference of two conditional entropies, and interpreted as a reduction of uncertainty. To test for
causality only the significance of nonzero values is of interest, but it is common
to use the values of TY →X to characterize the causal dependencies. Alternatively, the value of SY →X could be used, giving a not necessarily equivalent
characterization if the conditions of Eq. 10 are not fulfilled or depending on
the particular estimation procedure.
More generally, the KL-divergence is not the only option to test the criteria of causality above in a non-parametric way. Other measures have been
proposed based on the same criterion (e.g. [33, 2]) that are sensitive to higherorder moments of the distributions. A natural alternative that also considers
all the moments of the distributions is to use the Fisher information
Parametric and non-parametric criteria for causal inference from time-series
∫
F (Y ; x) =
dY p(Y |x)(
∂ ln p(Y |x) 2
)
∂x
vii
(11)
which, by means of the Cramer-Rao bound [17], it is related to the accuracy
of an unbiased estimator of X from Y . For the particular equality of Eq. 1
this leads to test
Eyt [F (Xt+1 ; y t |X t )] = 0.
(12)
In the Appendix we examine in detail this expression for linear Gaussian
autoregressive processes.
3 Parametric approach to causal inference from
time-series
The criteria of Section 2.1 do not assume any particular form of the processes. Oppositely, in the implementation originally introduced by [29], the
processes are assumed to have a linear autoregressive representation. Here
by parametric we refer specifically to the assumption of this representation.
Notice that this is different from a parametric approach in which not the
processes but the probability distributions are estimated parametrically, for
example using generalized linear models [49].
We first review the autoregressive representation of stationary stochastic
processes for bivariate and multivariate systems, describing the projections
used in the different linear formulations of Granger causality. We then review
these formulations, in particular the Geweke formulation in the temporal and
spectral domain [25, 26] and partial directed coherence [5, 53]. Apart from
stationarity we will assume that there is no instantaneous causality, i. e. that
the covariance matrices of the innovations terms in the autoregressive representation are diagonal. This substantially simplifies the formulation avoiding
a normalization step [25, 18]. Furthermore, strictly speaking, the existence
of instantaneous causality is a signature of time or spatial aggregation, or
of the existence of hidden variables, that questions the validity of the causal
inference [30].
3.1 The autoregressive process representation
Consider the system formed by the stationary stochastic processes X and
Y . Two projections are required to construct the bivariate linear measure
of Granger causality from Y to X. First, the projection of Xt+1 on his own
past:
viii
Daniel Chicharro
Xt+1 =
∞
∑
(x)
(x)
(x)
a(x)
xs Xt−s + ϵxt+1 , var(ϵx ) = Σx ,
(13)
s=0
second, its projection on the past of both X and Y :
Xt+1 =
∞
∑
(xy)
(xy)
a(xy)
xxs Xt−s + axys Yt−s + ϵxt+1
s=0
Yt+1 =
∞
∑
(14)
a(xy)
yxs Xt−s
+
a(xy)
yys Yt−s
+
(xy)
ϵyt+1
s=0
(
Σ
(xy)
(xy)
(xy)
=
(xy)
(xy)
(xy)
)
Σxx Σxy
(xy)
(xy)
Σyx Σyy
(xy)
(15)
(xy)
(xy)
(xy)
where Σxx
= var(ϵx ), Σyy
= var(ϵy ), Σxy = cov(ϵx , ϵy ),
(xy)
(xy)T
and Σyx = Σxy . Notice that while the subindexes are used to refer to
the corresponding variable or to components of a matrix, the superindexes
refer to the particular projection. As we said above, we assume that Σ (xy) is
diagonal.
[25] also proved the equality between Granger and Sims causality measures
for linear autoregressive processes. For that purpose also the projection of
Yi+1 in the whole process X is needed
Yt+1 =
∞
∑
(xy)
bx s(xy) Xt−s + ηyt+1 .
(16)
s=−∞
For multivariate systems we consider the fully multivariate autoregressive
representation of the system W = {X, Y, Z}:
Xt+1 =
∞
∑
(xyz)
(xyz)
(xyz)
a(xyz)
xxs Xt−s + axys Yt−s + axzs Zt−s + ϵxt+1
s=0
Yt+1 =
Zt+1 =
∞
∑
s=0
∞
∑
(xyz)
(xyz)
(xyz)
a(xyz)
yxs Xt−s + ayys Yt−s + ayzs Zt−s + ϵyt+1
(17)
(xyz)
(xyz)
(xyz)
a(xyz)
zxs Xt−s + ayzs Yt−s + azzs Zt−s + ϵzt+1
s=0

(xyz)
(xyz)
(xyz)
Σxx Σxy Σxz
 (xyz) (xyz) (xyz) 
=  Σyx
Σyy Σyz  .
(xyz)
(xyz)
(xyz)
Σzx Σzy
Σzz

Σ (xyz)
(18)
Like for the bivariate case we assume that Σ (xyz) is diagonal.
Apart from the joint autoregressive representation of W to calculate the
conditional GSC from Y to X it is also needed the projection of Xt+1 only
Parametric and non-parametric criteria for causal inference from time-series
ix
on the past of X and Z:
Xt+1 =
Zt+1 =
∞
∑
s=0
∞
∑
(xz)
(xz)
a(xz)
xxs Xt−s + axzs Zt−s + ϵxt+1
(19)
a(xz)
zxs Xt−s
+
a(xz)
zzs Zt−s
+
(xz)
ϵzt+1
s=0
(
Σ
(xz)
=
(xz)
(xz)
Σxx Σxz
(xz)
(xz)
Σzx Σzz
)
.
(20)
3.2 Parametric measures of causality
The autoregressive representations described in Section 3.1 have been used to
define quite many measures related to the criterion of Granger causality. We
here focus on the Geweke measures [25, 26], and partial directed coherence [5].
Other measures introduce some variation or refinement of these measures to
deal with estimation problems or attenuate the influence of hidden variables
(e.g. [13, 32, 52]). Furthermore, directed transfer function [35] is another
related measure [14] but only equivalent to Geweke measure for bivariate
systems [20].
3.2.1 The Geweke measures of Granger causality
The temporal formulation and the relation between linear Granger causality
and transfer entropy
Granger [29, 30] proposed to test for causality from Y to X examining if
there is an improvement of predictability of Xt+1 when using the past of
Y apart from the past on X for an optimal linear predictor. For a linear
predictor h(X t ), using only information from the past of X, the squared
error is determined by
∫
E (x) = dXt+1 dX t (Xt+1 − h(X t ))2 p(Xt+1 , X t ),
(21)
and analogously for E (xy) using information from the past of X and Y . Since
the optimal linear predictor is the conditional mean [40], we have that
∫
∫
E (x) = dX t p(X t ) dXt+1 (Xt+1 − EXt+1 [Xt+1 |X t ])2 p(Xt+1 |X t )
(22)
=EX t [σ 2 (Xt+1 |X t )].
x
Daniel Chicharro
If the autoregressive representation of Eq. 13 is assumed to be valid the
variance σ 2 (Xt+1 |X t ) does not depend on the value of X t and we have
E (x) = EX t [σ 2 (Xt+1 |X t )] = Σx(x) .
(23)
An analogous equality is obtained for E (xy) , so that the Geweke measure of
Granger causality is defined as:
(x)
GY →X = ln(
Σx
(xy)
),
(24)
Σxx
using the autoregressive representation of Eqs. 13-15. This measure, as indicated in [31], tests if there is causality from Y to X in mean, that is, the
equality:
EXt+1 [Xt+1 |X t ] = EXt+1 [Xt+1 |X t , Y t ] ∀X t , Y t .
(25)
Accordingly, given Eqs. 1 and 25, it is clear that
GY →X ̸= 0 ⇒ TY →X ̸= 0,
(26)
since the first only test for difference in the moment of order 1 and the other
in the whole probability distribution. In principle, the opposite implication
is not always true. However, since Eq. 25, as well as Eqs. 1-3 impose a stack
of constraints (one for each value of the conditioning variables) we expect
that, at least in general, the inequality for higher order moments is accompanied by one in the conditional means. Furthermore, when the autoregressive
representations are assumed to be valid, testing for the equality in the mean
or the variance of the distributions is equivalent, given Eq. 23 and that the
conditional variance is independent on the value conditioning. Notice that
Gaussianity has not to be assumed for this equality and in general in [25] it
is only further assumed to find the distribution of the measures under the
null-hypothesis of no causality.
The explanation above further relates the distinction in [31] between causation in mean (Eq. 25) and causation prima facie (Eq. 1) to the equivalence
between the Geweke linear measure of Granger causality GY →X and the
transfer entropy for Gaussian processes. Since a Gaussian probability distribution is completely determined by its first two moments, and the conditional
variance is independent on the value conditioning, it is clear from the explanation above that for Gaussian variables causation in mean and prima facie
have to be equivalent. This in practice can be seen [7] taking into account that
the entropy of a N -variate Gaussian distribution is completely determined
by its covariance matrix Σ:
N
H(XGaussian
)=
1
ln ((2πe)N |Σ|).
2
Accordingly, the two measures are such that:
(27)
Parametric and non-parametric criteria for causal inference from time-series
GY →X = 2 TY →X .
xi
(28)
For multivariate processes the conditional GSC [26] is defined in the time
domain analogously to GY →X in Eq. 24, but now using the autoregressive
representations of Eqs 17-20:
(xz)
GY →X|Z = ln(
Σxx
(xyz)
).
(29)
Σxx
It is straightforward to see that, given the form of the entropy for Gaussian variables (Eq. 27) and the definition of the conditional transfer entropy
TY →X|Z (Eq. 8), the relation between Granger causality and Transfer entropy
also holds for the conditional measures for Gaussian variables:
GY →X|Z = 2 TY →X|Z .
(30)
The spectral formulation
Geweke [25] also proposed a spectral decomposition of the time domain
Granger causality measure (Eq. 24). Geweke derived the spectral measure
of causality from Y to X, gY →X (ω), requiring the fulfillment of some properties:
1. The spectral measure should have an intuitive interpretation so that the
spectral decomposition is useful for empirical applications.
2. The measure has to be nonnegative.
3. The temporal and spectral measures have to be related so that
∫ π
1
gY →X (ω)dω = GY →X .
(31)
2π −π
Conditions two and three imply that
GY →X = 0 ⇔ gY →X (ω) = 0 ∀ω.
(32)
The GSC is obtained from the spectral representation of the bivariate
autoregressive process as follows. Fourier transforming Eq. 14 leads to:
)(
) ( (xy) )
( (xy)
(xy)
Axx (ω) Axy (ω) X(ω)
ϵx (ω)
,
= (xy)
(xy)
(xy)
Y (ω)
ϵy (ω)
Ayx (ω) Ayy (ω)
(33)
∑∞ (xy)
(xy)
(xy)
where we have Axx (ω) = 1 − s=1 axxs e−iωs , as well as Axy (ω) =
∑∞ (xy) −iωs
(xy)
(xy)
− s=1 axys e
, and analogously for Ayy (ω), Ayx (ω). The coefficients
(xy)
matrix A
(ω) can be inverted into the transfer function H(xy) (ω) =
(xy) −1
(A
) (ω), so that
xii
Daniel Chicharro
(
) ( (xy)
)( (xy) )
(xy)
Hxx (ω) Hxy (ω) ϵx (ω)
X(ω)
=
.
(xy)
(xy)
(xy)
Y (ω)
Hyx (ω) Hyy (ω) ϵy (ω)
(34)
Accordingly, the spectral matrix can be expressed as:
S(xy) (ω) = H(xy) (ω)Σ (xy) (H(xy) )∗ (ω)
(35)
where ∗ denotes complex conjugate and matrix transpose. Given the lack of
instantaneous correlations
(xy)
(xy)
(xy)
(xy)
|Hxx
(ω)|2 + Σyy
|Hxy
(ω)|2 .
Sxx (ω) = Σxx
(36)
The GSC from Y to X at frequency ω is defined as:
gY →X (ω) = ln
Sxx (ω)
.
(xy)
(xy)
Σxx |Hxx (ω)|2
(37)
This definition fulfills the requirement of being nonnegative since, given Eq.
(xy)
(xy)
36, Sxx (ω) is always higher than Σxx |Hxx (ω)|2 . It also fulfills the requirement of being intuitive since gY →X (ω) quantifies the portion of the power
spectrum which is associated with the intrinsic innovation process of X. Furthermore, the third condition is also accomplished (see [25, 57, 14] for details).
This can be seen considering that
gY →X (ω) = − ln (1 − |C(X, ϵ(xy)
)|2 )
y
(xy)
(38)
(xy)
where |C(X, ϵy )|2 is the squared coherence of X with the innovations ϵy
of Eq. 14. Given the general relation of the mutual information rate with the
squared coherence [24] we have that for Gaussian variables
∫
−1 π
N (xy)N
TY →X = I(X ; ϵy
)=
ln (1 − |C(X, ϵ(xy)
)|2 )dω.
(39)
y
4π −π
For the multivariate case, to derive the spectral representation of GY →X|Z
for simplicity we assume again that there is no instantaneous causality and
Σ (xyz) and Σ (xz) are diagonal (see [18] for a detailed derivation when instantaneous correlations exist). We rewrite Eq. 19 after Fourier transforming
as:
)(
(
) (
)
(xz)
(xz)
(xz)
X(ω)
Axx (ω) Axz (ω)
ϵx (ω)
.
(40)
=
(xz)
(xz)
(xz)
Z(ω)
Azx (ω) Azz (ω)
ϵz (ω)
Furthermore we rewrite Eq. 17 using the transfer function H(xyz) :
 (xyz)



ϵx
(ω)
X(ω)

(xyz)
 Y (ω)  = H(xyz) 
 ϵy
(ω)  .
(xyz)
Z(ω)
ϵz
(ω)
(41)
Parametric and non-parametric criteria for causal inference from time-series
xiii
Geweke [26] showed that
GY →X|Z = GY ϵ(xz) →ϵ(xz) .
z
(42)
x
(xz)
Accordingly, Eqs. 40 and 41 are combined to express Y , ϵz
terms of the innovations of the fully multivariate process:


 (xyz)

(xz)
ϵx
(ω)
ϵx (ω)



(xyz)  (xyz)
 Y (ω)  = DH
 ϵy
(ω)  ,
(xz)
(xyz)
ϵz (ω)
ϵz
(ω)
(xz)
and ϵx
in
(43)


(xz)
(xz)
Axx (ω) 0 Axz (ω)


0
1
0
D=
.
(xz)
(xz)
Azx (ω) 0 Azz (ω)
where
(44)
(xz)
Considering Q = DH(xyz) , the spectrum matrix of Y , ϵz
(xz)
and ϵx
b
S(ω)
= Q(ω)Σ (xyz) Q∗ (ω),
is:
(45)
and in particular
(xyz)
(xyz)
(xyz)
. (46)
+ |Qxz (ω)|2 Σzz
+ |Qxy (ω)|2 Σyy
Sϵ(xz) ϵ(xz) (ω) = |Qxx (ω)|2 Σxx
x
x
The conditional GSC from Y to X given Z is defined [26] as the portion of
(xyz)
, in analogy to Eq. 37:
the power spectrum associated with ϵx
gY →X|Z (ω) = gY ϵ(xz) →ϵ(xz) (ω) = ln
z
x
Sϵ(xz) ϵ(xz) (ω)
x
x
(xyz)
|Qxx (ω)|2 Σxx
.
(47)
This measure also fulfills the requirements that [25] imposed to the spectral
measures. Furthermore, in analogy to Eq. 38, gY →X|Z (ω) is related to a
multiple coherence:
gY →X|Z (ω) = − ln (1 − |C(ϵ(xz)
, ϵ(xyz)
ϵ(xyz)
)|2 ),
x
y
z
(xz)
(48)
(xyz) (xyz)
ϵz
)|2 is the squared multiple coherence [48]. This
where |C(ϵx , ϵy
equality results from the direct application of the definition of the squared
multiple coherence (see [14] for details).
Given the definition of gY →X|Z (ω) in terms of the squared multiple coherence it is clear that, analogously to GY →X (Eq. 39):
GY →X|Z = 2 I(ϵ(xz)N
; ϵ(xyz)N
ϵ(xyz)N
).
x
y
z
(49)
xiv
Daniel Chicharro
3.2.2 Partial directed coherence
The other measure related to Granger causality that we review here is partial
directed coherence [6, 5], which is defined only in the spectral domain. In
particular, the information partial directed coherence (iPDC) from Y to X
[57] is defined in the bivariate case as:
(xy)
iπxy
(ω)
=
C(ϵ(xy)
, ηy(xy) )
x
√
(xy)
Axy (ω) Syy|X
√
=
,
(xy)
Σxx
(50)
where A(xy) (ω) is the spectral representation of the autoregressive coefficients
matrix of Eq. 14 and Syy|X is the partial spectrum [48] of the Y process when
(xy)
partialized on process X. Furthermore, ηy refers to the partialized process
resulting from the Y process when partialized on X, as results from Eq. 16.
Like in the case of the GSC, a mutual information rate is associated with
iPDC [57], and is further related to SY →X [14] for Gaussian variables:
∫
−1 π
(xy)
(xy)N
(xy)N
; ηy
)=
ln (1 − |iπxy
(ω)|2 )dω.
SY →X = I(ϵx
(51)
4π −π
In the multivariate case the information partial directed coherence (iPDC)
from Y to X [57] is:
√
(ω) Syy|W \y
√
(xyz)
Σxx
(xyz)
(xyz)
iπxy
(ω)
=
C(ϵ(xyz)
, ηy(xyz) )
x
=
Axy
(52)
where A(xyz) (ω) is the spectral representation of the autoregressive coefficients matrix of Eq. 17 and Syy|W \y is the partial spectrum of the Y process
when partialized on all the other processes in the multivariate process W .
(xyz)
refers to the partialized process resulting from the Y
Furthermore, ηy
process when partialized on all the others.
In the multivariate case not even after the integration across frequencies
the iPDC can be expressed in terms of the variables of the observed processes
X, Y , Z. The equality [57]
; ηy(xyz)N ) =
I(ϵ(xyz)N
x
−1
4π
∫
π
−π
(xyz)
ln (1 − |iπxy
(ω)|2 )dω
(53)
analogous to the one of the bivariate case (Eq. 51), provides only an expres(xyz)
(xyz)
sion which involves the innovation processes ϵx
and ηy
.
Parametric and non-parametric criteria for causal inference from time-series
xv
3.3 Parametric criteria for causal inference
From the revision above of the spectral Geweke measures of Granger causality and of the partial directed coherence one can see that alternative criteria
for causal inference which involve the innovation processes intrinsic to the
parametric autoregressive representation are implicit in the mutual information terms. In particular, for the bivariate case, the spectral Geweke measure
is related (Eq. 39) to the criterion
p(X N ) = p(X N |ϵ(xy)N
) ∀ϵ(xy)N
.
y
y
(54)
The bivariate PDC is related (Eq. 51) to
p(ϵ(xy)N
) = p(ϵ(xy)N
|ηy(xy)N ) ∀ηy(xy)N .
x
x
(55)
For the multivariate case the Geweke measure is related (Eq. 49) to
p(ϵ(xz)N
) = p(ϵ(xz)N
|ϵ(xyz)N
, ϵ(xyz)N
) ∀ϵ(xyz)N
, ϵ(xyz)N
x
x
y
z
y
z
(56)
while the PDC is related (Eq. 53) to
p(ϵ(xyz)N
) = p(ϵ(xyz)N
|ηy(xyz)N ) ∀ηy(xyz)N .
x
x
(57)
Comparing the non-parametric criteria of Section 2.1 with these parametric criteria we can see another main difference, apart from that the parametric
ones all involve some innovation process. This difference is that in Eqs. 54-57
temporal separation between future and past is not required to state the criteria, while the non-parametric criteria all rely explicitly on temporal precedence. The lack of temporal separation is exactly what allows to construct the
spectral measures based on the criteria of Eqs. 54-57. In [14] it was shown,
based on this difference with respect to temporal separation, that transfer
entropy does not have a non-parametric spectral representation. This lack
of a non-parametric spectral representation of the transfer entropy can be
further understood considering why a criterion without temporal separation
that involves only the processes X, Y and not innovation processes, cannot
be used for causal inference: Consider p(X N ) = p(X N |Y N ) as a criterion to
infer causality from Y to X in contrast to the ones of Eqs. 1 and 54. Using
the chain rule for the probability distributions this equality implies checking
p(Xt+1 |X t ) = p(Xt+1 |X t , Y N ). But this equality does not hold if there is
a causal connection in the opposite direction, from X to Y , because of the
conditioning on the whole process Y N instead of only on its past. Oppositely
xvi
Daniel Chicharro
p(X N |ϵ(xy)N
)=
y
N
−1
∏
p(Xt+1 |X t , ϵ(xy)N
)=
y
t=0
=
N
−1
∏
N
−1
∏
t=0
p(Xt+1 |X t , ϵ(xy)t
)
y
(58)
p(Xt+1 |X , Y ),
t
t
t=0
since by construction there are no causal connections from the processes to
the innovation processes. The last equality can be understood considering
that the autoregressive projections described in Section 3.1 introduce a functional relation of the variables, such that, for example, given Eq. 14, Xt+1 is
(xy)t+1 (xy)t
completely determined by ϵx
, ϵy , and analogously for Yt+1 . Accord(xy)t
ingly, it is equivalent to condition on X t , ϵy
or X t , Y t .
The probability distributions in Eq. 1 and Eq. 54 are still not the same, as
it is clear from Eq. 58. However, under the assumption of stationarity, they
are the functional relations that completely determine the processes from
the innovation processes (and inversely) what leads to the equality in Eq.
39 of the transfer entropy with the mutual information corresponding to the
comparison of the probability distributions in Eq. 54, and analogously for
Eqs. 49 and 51. Remarkably, the mutual information associated with Eq. 57,
as noticed above Eq. 53, is not equal to a mutual information associated with
a non-parametric criterion. As indicated in Eq. 51 (see [14] for details) for
bivariate processes the PDC is related to Sims causality. However, for the
multivariate case, while there is no extension of Sims causality, it is clear
from the comparison of the definitions in Eqs. 50 and 52, as well as from the
comparison of the criteria of Eqs. 55 and 57, that the multivariate formulation
appears as a natural extension of the bivariate one. This stresses the role of
the functional relations that are assumed to implicitly define the innovation
processes. It is not only the causal structure between the variables but the
specific functional form in which they are related what guarantees the validity
of the criteria in Eqs. 54-57. In general this functional form is not required to
be linear, as long as it establishes that the processes and innovation processes
are mutually determined.
Another interesting aspect is revealed from the comparison of the bivariate
and multivariate criteria respectively associated with GSC and PDC measures. While for the PDC the multivariate criterion is a straightforward extension of the bivariate one, this is not the case for the criteria associated with
GSC. This can be noticed as well comparing the autoregressive projections
used for each measure. In particular, for the bivariate case, gY →X (ω) is obtained directly from the bivariate autoregressive representation (Eq. 14), not
by combining it with the univariate autoregressive representation of X (Eq.
13). Oppositely, gY →X|Z (ω) requires the combination of the full multivariate
projection (Eq. 17) and the projection on the past of X, Z (Eq. 19). Below
we show that in fact there is a natural counterpart for both the criteria of
Eqs. 54 and 56 respectively.
Parametric and non-parametric criteria for causal inference from time-series
xvii
3.4 Alternative Geweke spectral measures
Instead of constructing gY →X (ω) just from the bivariate autoregressive representation, one could proceed alternatively following the procedure used for
the conditional case. This means combining Eq. 34 with the Fourier transform
of Eq. 13
(x)
ϵ(x)
(59)
x (ω) = axx (ω)X(ω).
This is analogous to combining Eqs. 40 and 41 in the conditional case. Combining Eqs. 34 and 59 we get an expression analogous to Eq. 43:
(
)
( (x)
)
(xy)
ϵx (ω)
ϵx (ω)
(xy)
= PH
,
(60)
(xy)
Y (ω)
ϵy (ω)
(
where
P=
(x)
axx (ω) 0
0
1
)
.
(61)
b = PH(xy) , the spectrum of ϵ(x)
Considering Q
x is
2
(xy)
(xy)
b xy |2 Σyy
b xx |2 Σxx
= |a(x)
+ |Q
Sϵ(x) ϵ(x) (ω) = |Q
xx (ω)| Sxx (ω)
x
x
(xy)
and comparing the total power to the portion related to ϵx
gbY →X (ω) = ln
Sϵ(x) ϵ(x) (ω)
x
x
(xy)
b xx |2 Σxx
|Q
= ln
Sxx (ω)
(xy)
|Hxx |2 Σxx
(62)
one can define
= gY →X (ω).
(63)
This shows that
(xy)
))
)) = − ln (1 − C(X, ϵ(xy)
gY →X = − ln (1 − C(ϵ(x)
x
x , ϵy
(64)
TY →X = I(ϵ(x)N
; ϵ(xy)N
) = I(X N ; ϵ(xy)N
).
x
y
y
(65)
and
This equality indicates that although the procedure used for the multivariate
case is apparently not reducible to the bivariate case for Z = ∅, the spectral
decomposition gY →X (ω) is the same. The criterion for causal inference that
results from reducing to the bivariate case straightforwardly the one of Eq.
56 is
p(ϵ(x)N
) = p(ϵ(x)N
|ϵ(xy)N
) ∀ϵ(xy)N
.
(66)
x
x
y
y
Again the particular functional relation between the processes and the inno(x)N
vation processes determines that X N and ϵx
share the same information
(xy)N
with ϵy
, given that they are mutually determined in Eq. 13.
Analogously, we want to find the criterion that results from a straightforward extension of the one in Eq. 54. An alternative way to construct
xviii
Daniel Chicharro
gY →X|Z (ω) is suggested by the relation between the bivariate and the conditional measures stated by Geweke [26]:
GY →X|Z = GY Z→X − GZ→X ,
(67)
which is just an application of the chain rule for the mutual information [17].
In analogy to Eq. 37
gY Z→X (ω) = ln
Sxx
(xyz) 2 (xyz)
|Hxx | Σxx
(68)
gZ→X (ω) = ln
Sxx
,
(xz) 2 (xz)
|Hxx | Σxx
(69)
and
where H(xz) is the inverse of the coefficients matrix of Eq. 40. This leads to
(xz)
gbY →X|Z (ω) = ln
(xz)
|Hxx |2 Σxx
(xyz) 2 (xyz)
| Σxx
|Hxx
.
(70)
Notice that while gY →X (ω) = gbY →X (ω), the two measures are different in the
conditional case. This means that two alternative spectral decompositions are
possible, although their integration is equivalent. This can be seen considering
(xyz)
(xz)
that the integration of the logarithm terms including |Hxx |2 and |Hxx |2 is
zero, based on theorem 4.2 of Rozanov [51], (see [14] for details). Accordingly
TY →X|Z = I(X N ; ϵ(xyz)N
ϵ(xyz)N
) − I(X N ; ϵ(xz)N
).
y
z
z
(71)
and the natural extension of the criterion in Eq. 54 is
.
, ϵ(xyz)N
) ∀ϵ(xyz)N
, ϵ(xyz)N
) = p(X N |ϵ(xyz)N
p(X N |ϵ(xz)N
z
y
z
y
z
(72)
The fact that the variable conditioning on the left hand side is not preserved
among the variables conditioning on the right hand side is what determines
that the information-theoretic statistic to test this equality is not a single
KL-divergence (in particular a mutual information) but a difference of two.
We examine if the alternative spectral measures fulfill the three conditions imposed by Geweke described in Section 3.2.1. In the bivariate case
the measure is equal so it is clear that it does. In the multivariate case the
measure has an intuitive interpretation and fulfills the relation with the time
domain measure under integration. However, nonnegativity is not guaranteed
for every frequency since it is related to a difference of mutual informations.
Parametric and non-parametric criteria for causal inference from time-series
xix
3.5 Alternative parametric criteria based on
innovations partial dependence
Above we have shown that the different criteria underlying bivariate and multivariate GSC can be reduced or extended respectively to the other case. We
indicated that the parametric criteria rely not only on the causal structure
but also in the functional relations assumed between the processes and the
innovation processes. This is particularly clear in the multivariate criteria
(Eqs. 56, 57 and 72) because the criteria combine innovations from different projections. This prevents from considering the autoregressive models as
actual generative models which structure can be mapped to a causal graph.
Here we introduce an alternative type of parametric criteria which relies on
a single projection, which can be considered as the model from which the
processes are generated.
In the bivariate case the criterion is
p(ϵ(xy)N
|X N ) = p(ϵ(xy)N
|X N , ϵ(xy)N
) ∀X N , ϵ(xy)N
,
x
x
y
y
(73)
which can be tested with the mutual information
I(ϵ(xy)N
, ϵ(xy)N
|X N ) = 0.
x
y
(xy)N
(74)
(xy)N
are assumed to be independent (or are
and ϵy
The innovations ϵx
rendered independent after the normalization step in [25]) when there is no
conditioning. The logic of the criterion is that if conditioning on X N introduces some dependence, this can only be because both innovation processes
have a dependence with process X (this is the conditioning on a joint child
(xy)N
is associated with X N , this effect occurs
effect). Since by construction ϵx
(xy)N
has an influence on X N , which can only be through
only if and only if ϵy
an existent connection from Y to X. In the multivariate case the criterion is
straightforwardly extended to
(75)
) ∀X N , Z N , ϵ(xyz)N
|X N , Z N , ϵ(xyz)N
|X N , Z N ) = p(ϵ(xyz)N
p(ϵ(xyz)N
y
y
x
x
and can be tested with the mutual information
I(ϵ(xyz)N
, ϵ(xyz)N
|X N , Z N ) = 0.
x
y
(76)
(xyz)N
Here the conditioning on Z N is required so that the connection from ϵy
to X N is not indirect through Z.
These criteria have the advantage that they rely on a unique autoregressive
representation. They are also useful to illustrate the difference between using
the information-theoretic measures as statistics to test for causality or as
measure to quantify the dependencies. In particular, the mutual informations
of Eqs. 74, 76 are either infinity or zero if there is or there is no causality from
xx
Daniel Chicharro
Y to X. This is clear when expressed in terms of the squared coherence, for
(xyz) (xyz) 2
example |C(Xϵx
, ϵy
)| is associated with Eq. 74, and is 1 when there
is causality. This is because since the two innovation processes completely
(xyz)
determine X, inversely the innovations ϵy
can be known from process X
and its innovations. The same occurs in the multivariate case. In principle,
this renders these mutual information very powerful to test for causality and
useless to quantify in some way the strength of the dependence. In practice,
the actual value estimated would also reflect how valid is the autoregreesive
model chosen.
4 Comparison of non-parametric and parametric
criteria for causal inference from time-series
We have reviewed different criteria for causal inference and introduced some
related ones that conform a whole consistent framework for causal inference
from time-series. Here we briefly summarize them, further highlighting their
relations. In Table 1 and 2 we collect all the criteria for causal inference,
organized according to being parametric or non-parametric, and bivariate or
multivariate. We see that all the bivariate criteria have their multivariate
counterpart except the criterion 2, associated with Sims causality. In Table
3 we display the corresponding information-theoretic measures to test the
criteria. We group the measures according to which of them are equal given
the functional form that determines the processes from the innovation processes. In the bivariate case the measures in 1 and 2 are equivalent only when
there is no instantaneous causality. All these measures can be used to test
for causal inference from Y to X, but when a nonzero value is obtained they
provide alternative characterizations of the dependencies.
Finally, in Table 4 we use the set W = {X, Y, Z} to reexpress the criteria
of Table 1 in a synthetic form that integrates the bivariate and multivariate notation used so far, making more transparent their link. For example,
{W \Y } refers to all the processes except Y . Furthermore, for innovation pro({W \Y })
cesses, ϵ{W \X,Y } refers to, given the projection ({W \Y }) which includes all
the processes except Y , all the innovation processes {W \X, Y }, that is, all
the ones in the projection except the ones associated with X and Y .
5 Conclusion
We have reviewed criteria for causal inferences related to Granger causality
and proposed some new ones in order to complete a unified framework of criteria and measures to test for causality in a parametric and non-parametric
Parametric and non-parametric criteria for causal inference from time-series
xxi
Table 1 Bivariate criteria for causal inference
Non-parametric
1
p(Xt+1 |X t ) = p(Xt+1 |X t , Y t )
2
p(X t+1:N |X t , Y t ) = p(X t+1:N |X t , Y t , Yt+1 )
Parametric
(x)N
(x)N (xy)N
3
p(ϵx
) = p(ϵx
|ϵy
)
(xy)N
N
4
p(X ) = p(X N |ϵy
)
(xy)N
(xy)N (xy)N
5
p(ϵx
) = p(ϵx
|ηy
)
(xy)N
(xy)N
(xy)N
N
6
p(ϵx
|X ) = p(ϵx
|X N , ϵy
)
Table 2 Multivariate criteria for causal inference
Non-parametric
1
p(Xt+1 |X t , Z t ) = p(Xt+1 |X t , Y t , Z t )
2
−
Parametric
(xz)N
(xz)N (xyz)N (xyz)N
3
p(ϵx
) = p(ϵx
|ϵy
, ϵz
)
(xz)N
(xyz)N (xyz)N
N
4
p(X |ϵz
) = p(X N |ϵy
, ϵz
)
(xyz)N
(xyz)N (xyz)N
5
p(ϵx
) = p(ϵx
|ηy
)
(xyz)N
(xyz)N
(xyz)N
N
N
6
p(ϵx
|X , Z ) = p(ϵx
|X N , Z N , ϵy
)
Table 3 Mutual information measures to test for causality
Bivariate
1
2
(xy)N
(x)N
(xy)N
I(Xt+1 ; Y t |X t ) = I(X N ; ϵy
) = I(ϵx
; ϵy
(xy)N
(xy)N
I(Yt+1 ; X t+1:N |Y t , X t ) = I(ϵx
; ηy
)
(xy)N
3
I(ϵx
(xy)N
; ϵy
)
|X N )
Multivariate
(xz)N
(xyz)N
(xyz)N
4
I(Xt+1 ; Y t |X t , Z t ) = I(ϵx
; ϵy
, ϵz
)
(xyz)N (xyz)N
(xz)N
= I(X N ; ϵy
, ϵz
) − I(X N ; ϵz
)
5
(xyz)N
(xyz)N
I(ϵx
; ηy
)
(xyz)N (xyz)N
I(ϵx
; ϵy
|X N , Z N )
6
Table 4 Criteria for causal inference
Non-parametric
1
p(Xt+1 |{W \Y }t ) = p(Xt+1 |{W }t )
Parametric
({W \Y })N
2
3
) = p(ϵx
({W \Y })N
({W })N
|ϵ{W \X} )
({W })N
p(X N |ϵ{W \X,Y } ) = p(X N |ϵ{W \X} )
({W })N
4
5
({W \Y })N
p(ϵx
p(ϵx
({W })N
p(ϵx
({W })N
|ηy
({W })N
|{W \Y }N , ϵy
) = p(ϵx
|{W \Y }N ) = p(ϵx
({W })N
)
({W })N
)
xxii
Daniel Chicharro
way, in the time or spectral domain, and for bivariate or multivariate processes. These criteria and measures are summarized in Tables 1-4. This offers
an integrating picture comprising the measures proposed by Geweke [25, 26]
and partial directed coherence [5]. The contributions of this Chapter are complementary to the work in [57] and [14]. The distinction between parametric
and non-parametric criteria further emphasizes the necessity to check the
validity of the autoregressive representation when applying a measure which
inherently relies on the definition of the innovation processes. The distinction between criteria and measures stresses that causal inference and the
characterization of the dynamic dependencies resulting from them should be
addressed by different approaches [16, 15].
Finally, we notice again that we have here focused on the formal relation between the different criteria and measures. For practical applications,
problems like the influences of hidden variables [21], or time and temporal
aggregation [23] constitute serious challenges that prevent from successfully
applying these criteria. For example, in the case of brain causal analysis
it is now clear that a successful characterization can only be obtained if
the application of these criteria is combined with a biologically plausible reconstruction of how the recorded data are generated by the neural activity
[50, 58, 23]. Even at a more practical level, estimating from small data sets
the information-theoretic measures to test for causality is complicated [34].
Most often stationarity is assumed for simplification, but event-related estimation is also possible [3, 27]. We believe that a clear understanding of the
underlying criteria for causal inference and their relation to measures can
also help to better interpret and address these practical problems.
6 Appendix: Fisher Information measure of Granger
causality for linear autoregressive Gaussian processes
In Eq. 12 we showed how the criterion of Granger causality of Eq. 1 can
be tested using the Fisher information. For linear Gaussian autoregressive
processes, considering the definition of the Fisher information (Eq. 9) we
have
Eyt [F (Xt+1 ; y t |X t )]
∫
∫
∫
∂ log p(Xi+1 |X t , y t ) 2
= dy t p(y t ) dX t p(X t |y t ) p(Xi+1 |X t , y t )(
) dXt+1 .
∂y t
(77)
We start considering the term F (Xt+1 ; y t |xt ) corresponding to the first integral. For a gaussian process p(Xi+1 |xt , y t ) = N (µ(Xt+1 |xt , y t ), σ(Xt+1 |xt , y t ))
is Gaussian. Therefore
Parametric and non-parametric criteria for causal inference from time-series
∫
F (Xt+1 ; y t |xt ) = −
((
∂ log
xxiii
N (µ(Xt+1 |xt , y t ), σ(Xt+1 |xt , y t ))
xt+1 −
√
∂ 12 (
2πσ(Xt+1 |xt , y t ) 2
) +(
∂y t
∑∞
(xy)
a(xy)
xxs Xt−s +axys Yt−s 2
)
σ(Xt+1 |xt ,y t )
)2 )dXt+1 .
∂y t
s=0
(78)
The first summand inside the integral is zero because the term on which the
derivative is done is independent of y t . For the second summand, since it is
linear, we consider for simplification just the partial derivation on a single
variable yt . We get
∫
F (Xt+1 ; yt |xt ) = N (µ(Xt+1 |xt , yt ), σ(Xt+1 |xt , yt ))
(
=
xt+1 −
∑∞
(xy)
(xy)
axxs Xt−s + axyt yt
axyt
)2 dXt+1
σ(Xt+1 |xt , yt )
σ(Xt+1 |xt , yt )
s=0
(79)
a2xyt
.
σ 2 (Xt+1 |xt , yt )
This term is independent both of xt and y t , so that the other two integration
in Eq. 77 can be done straightforwardly. We have
Eyt [F (Xt+1 ; yt |X t )] =
a2xyt
,
σ 2 (Xt+1 |xt , y t )
(80)
so that each coefficient in the autoregressive representation can be given a
meaning in terms of the Fisher information. This relation further illuminates
the relation between coefficients and GY →X [55, 40]:
GY →X = 0 ⇔ a(xy)
xys = 0 ∀s.
(81)
References
1. Amblard, P.O., Michel, O.: On directed information theory and Granger causality
graphs. J Comput Neurosci 30, 7–16 (2011)
2. Ancona, N., Marinazzo, D., Stramaglia, S.: Radial basis function approach to nonlinear
granger causality of time series. Phys Rev E 70(5), 056221 (2004)
3. Andrzejak, R.G., Ledberg, A., Deco, G.: Detection of event-related time-dependent
directional couplings. New J Phys 8, 6 (2006)
4. Ay, N., Polani, D.: Information flows in causal networks. Advances in complex systems
11, 17–41 (2008)
5. Baccala, L., Sameshima, K.: Partial directed coherence: a new concept in neural structure determination. Biol Cybern 84(1), 463–474 (2001)
6. Baccala, L., Sameshima, K., Ballester, G., Do Valle, A., Timo-Iaria, C.: Studying
the interaction between brain structures via directed coherence and granger causality.
Appl Sig Process 5, 40–48 (1999)
xxiv
Daniel Chicharro
7. Barnett, L., Barrett, A.B., Seth, A.K.: Granger causality and transfer entropy are
equivalent for gaussian variables. Phys Rev Lett 103(23), 238701 (2009)
8. Besserve, M., Schoelkopf, B., Logothetis, N.K., Panzeri, S.: Causal relationships between frequency bands of extracellular signals in visual cortex revealed by an information theoretic analysis. J Comput Neurosci 29(3), 547–566 (2010)
9. Bressler, S.L., Richter, C.G., Chen, Y., Ding, M.: Cortical functional network organization from autoregressive modeling of local field potential oscillations. Stat Med
26(21), 3875–3885 (2007)
10. Bressler, S.L., Seth, A.K.: Wiener Granger causality: A well established methodology.
Neuroimage 58(2), 323–329 (2011)
11. Brovelli, A., Ding, M., Ledberg, A., Chen, Y., Nakamura, R., Bressler, S.L.: Beta oscillations in a large-scale sensorimotor cortical network: Directional influences revealed
by Granger causality. P Natl Acad Sci USA 101, 9849–9854 (2004)
12. Chamberlain, G.: The general equivalence of granger and sims causality. Econometrica
50(3), 569–581 (1982)
13. Chen, Y., Bressler, S., Ding, M.: Frequency decomposition of conditional granger
causality and application to multivariate neural field potential data. J Neurosci Meth
150(2), 228–237 (2006)
14. Chicharro, D.: On the spectral formulation of Granger causality. Biol Cybern 105(5-6),
331–347 (2011)
15. Chicharro, D., Ledberg, A.: Framework to study dynamic dependencies in networks of
interacting processes. Phys Rev E 86, 041901 (2012)
16. Chicharro, D., Ledberg, A.: When two become one: The limits of causality analysis of
brain dynamics. PLoS One 7(3), e32466 (2012)
17. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. John Wiley
and Sons (2006)
18. Ding, M., Chen, Y., Bressler, S.L.: Granger causality: Basic theory and application to
neuroscience. In: Handbook of Time Series Analysis: Recent Theoretical Developments
and Applications., pp. 437–460. Wiley-VCH Verlag (2006)
19. Eichler, M.: A graphical approach for evaluating effective connectivity in neural systems. Phil Trans R Soc B 360, 953–967 (2005)
20. Eichler, M.: On the evaluation of information flow in multivariate systems by the
directed transfer function. Biol Cybern 94(6), 469–482 (2006)
21. Eichler, M.: Granger causality and path diagrams for multivariate time series. J Econometrics 137, 334–353 (2007)
22. Faes, L., Nollo, G., Porta, A.: Information-based detection of nonlinear granger causality in multivariate processes via a nonuniform embedding technique. Phys Rev E 83(5),
051112 (2011)
23. Friston, K.J.: Functional and effective connectivity: A review. Brain Connectivity 1(1),
13–36 (2012)
24. Gelfand, I., Yaglom, A.: Calculation of the amount of information about a random
function contained in another such function. Am Math Soc Transl Ser 2(12), 199–246
(1959)
25. Geweke, J.F.: Measurement of linear dependence and feedback between multiple time
series. J Am Stat Assoc 77(378), 304–313 (1982)
26. Geweke, J.F.: Measures of conditional linear dependence and feedback between time
series. J Am Stat Assoc 79(388), 907–915 (1984)
27. Gómez-Herrero, G., Wu, W., Rutanen, K., Soriano, M.C., Pipa, G., Vicente, R.: Assessing coupling dynamics from an ensemble of time series. arXiv:1008.0539v1 (2010)
28. Gourevitch, B., Le Bouquin-Jeannes, R., Faucon, G.: Linear and nonlinear causality
between signals: methods, examples and neurophysiological applications. Biol Cybern
95(4), 349–369 (2006)
29. Granger, C.W.J.: Economic processes involving feedback. Information and Control 6,
28–48 (1963)
Parametric and non-parametric criteria for causal inference from time-series
xxv
30. Granger, C.W.J.: Investigating causal relations by econometric models and crossspectral methods. Econometrica 37(3), 424–38 (1969)
31. Granger, C.W.J.: Testing for causality : A personal viewpoint. J Econ Dynamics and
Control 2(1), 329–352 (1980)
32. Guo, S., Seth, A.K., Kendrick, K.M., Zhou, C., Feng, J.: Partial granger causality eliminating exogenous inputs and latent variables. J Neurosci Meth 172(1), 79–93
(2008)
33. Hiemstra, C., Jones, J.D.: Testing for linear and nonlinear granger causality in the
stock price-volume relation. J Financ 49(5), 1639–64 (1994)
34. Hlaváčkova-Schindler, K., Paluš, M., Vejmelka, M., Bhattacharya, J.: Causality detection based on information-theoretic approaches in time-series analysis. Phys Rep 441,
1–46 (2007)
35. Kaminski, M., Blinowska, K.: A new method of the description of the information flow
in the brain structures. Biol Cybern 65(3), 203–210 (1991)
36. Kramers, G.: Directed information for channels with feedback. PhD dissertation, Swiss
Federal Institute of Technology, Zurich (1998)
37. Kuersteiner, G.: Granger-Sims causality, 2nd edn. The New Palgrave Dictionary of
Economics (2008). DOI 10.1057/9780230226203.0665
38. Kullback, S.: Information Theory and Statistics. Dover, Mineola,NY (1959)
39. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Local information transfer as a spatiotemporal filter for complex systems. Phys Rev E 77, 026110 (2008)
40. Lütkepohl, H.: New introduction to multiple time series analysis. Springer-Verlag,
Berlin (2006)
41. Marinazzo, D., Pellicoro, M., Stramaglia, S.: Causal information approach to partial
conditioning in multivariate data sets. Comput Math Meth Med p. 303601 (2012).
DOI 10.1155/2012/303601
42. Marko, H.: Bidirectional communication theory - generalization of information-theory.
IEEE T Commun 12, 1345–1351 (1973)
43. Massey, J.: Causality, feedback and directed information. Proc Intl Symp Info Th
Appli, Waikiki, Hawai, USA (1990)
44. Paluš, M., Komárek, V., Hrnčı́ř, Z., Štěrbová, K.: Synchronization as adjustment of
information rates: Detection from bivariate time series. Phys Rev E 63, 046211 (2001)
45. Pearl, J.: Causality: Models, Reasoning, Inference, 2nd edn. Cambridge University
Press, New York (2009)
46. Pereda, E., Quian Quiroga, R., Bhattacharya, J.: Nonlinear multivariate analysis of
neurophysiological signals. Prog Neurobiol 77, 1–37 (2005)
47. Permuter, H., Kim, Y., Weissman, T.: Interpretations of directed information in portfolio theory, data compression, and hypothesis testing. IEEE Trans Inf Theory 57(3),
3248–3259 (2009)
48. Priestley, M.: Spectral analysis and time series. Academic Press Inc., San Diego (1981)
49. Quinn, C.J., Coleman, T.P., Kiyavash, N., Hatsopoulos, N.G.: Estimating the directed
information to infer causal relationships in ensemble neural spike train recordings. J
Comput Neurosci 30, 17–44 (2011)
50. Roebroeck, A., Formisano, E., Goebel, R.: The identification of interacting networks
in the brain using fmri: Model selection, causality and deconvolution. NeuroImage
58(2), 296–302 (2011)
51. Rozanov, Y.: stationary random processes. Holden-Day, San Francisco, USA (1967)
52. Schelter, B., Timmer, J., Eichler, M.: Assessing the strength of directed influences
among neural signals using renormalized partial directed coherence. J Neurosci Meth
179(1), 121–130 (2009)
53. Schelter, B., Winterhalder, M., Eichler, M., Peifer, M., Hellwig, B., Guschlbauer, B.,
Lucking, C., Dahlhaus, R., Timmer, J.: Testing for directed influences among neural
signals using partial directed coherence. J Neurosci Meth 152(1-2), 210–219 (2006)
54. Schreiber, T.: Measuring information transfer. Phys Rev Lett 85, 461–464 (2000)
xxvi
Daniel Chicharro
55. Sims, C.: Money, income, and causality. American Economic Rev 62(4), 540–552
(1972)
56. Solo, V.: On causality and mutual information. In: Proceedings of the 47th IEEE
Conference on Decision and Control, pp. 4639–4944 (2008)
57. Takahashi, D.Y., Baccala, L.A., Sameshima, K.: Information theoretic interpretation
of frequency domain connectivity measures. Biol Cybern 103(6), 463–469 (2010)
58. Valdes-Sosa, P., Roebroeck, A., Daunizeau, J., Friston, K.: Effective connectivity: Influence, causality and biophysical modeling. Neuroimage 58(2), 339–361 (2011)
59. Vicente, R., Wibral, M., Lindner, M., Pipa, G.: Transfer entropy: A model-free measure
of effective connectivity for the neurosciences. J Comput Neurosci 30, 45–67 (2010)
60. Wiener, N.: The theory of prediction. In: Modern Mathematics for Engineers, pp.
165–190. McGraw-Hill, New York (1956)