CAUSALITY MEASURES IN NEUROSCIENCE:
WIENER-GRANGER CAUSALITY AND TRANSFER
ENTROPY APPLIED TO INTRACRANIAL EEG DATA
A 30 credit project submitted to the University of Manchester
for the degree of Master of Mathematics
2011
Kristjan Korjus
School of Mathematics
Contents
Abstract
5
Acknowledgments
6
1 Introduction
7
1.1
1.2
Causality Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.1.1
Granger-Wiener Causality . . . . . . . . . . . . . . . . . . . .
8
1.1.2
Transfer Entropy . . . . . . . . . . . . . . . . . . . . . . . . .
9
Problems in Neuroscience and application of causality measures . . .
10
2 Wiener-Granger Causality
11
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2
Mathematical Description of a Linear Model . . . . . . . . . . . . . .
12
2.3
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.3.1
Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3.2
Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3.3
Dependence on observed variables . . . . . . . . . . . . . . . .
14
Toolbox for Granger causality . . . . . . . . . . . . . . . . . . . . . .
14
2.4
3 Transfer Entropy
15
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.2
Information Theory and Shannon Entropy . . . . . . . . . . . . . . .
16
3.2.1
Properties and extensions of Shannon Entropy . . . . . . . . .
18
Transfer Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.3
2
3.3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.3.2
Derivation of Transfer Entropy . . . . . . . . . . . . . . . . .
23
Application of Transfer Entropy in Neuroscience . . . . . . . . . . . .
24
3.4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.4.2
Computation of Transfer Entropy . . . . . . . . . . . . . . . .
26
3.4.3
Summary of the parameters . . . . . . . . . . . . . . . . . . .
30
3.4.4
Particular problems in neuroscience data . . . . . . . . . . . .
31
3.4.5
Delayed interactions . . . . . . . . . . . . . . . . . . . . . . .
33
Toolbox for transfer entropy . . . . . . . . . . . . . . . . . . . . . . .
34
4 Granger Causality and Transfer Entropy with Gaussian Variables
35
5 Briefly about the Human Brain
39
3.4
3.5
5.1
History of the scientific approach towards brain . . . . . . . . . . . .
39
5.2
Brain as we know it today . . . . . . . . . . . . . . . . . . . . . . . .
41
5.3
Theory of visual system . . . . . . . . . . . . . . . . . . . . . . . . .
41
5.4
Neuroimaging: EEG and intracranial EEG . . . . . . . . . . . . . . .
43
6 Applying Causality Measures to Intracranial EEG data
6.1
6.2
45
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
6.1.1
Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
6.1.2
Experimental set up . . . . . . . . . . . . . . . . . . . . . . .
45
6.1.3
Intracranial recordings . . . . . . . . . . . . . . . . . . . . . .
48
6.1.4
Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
6.1.5
Choosing the parameters for transfer entropy . . . . . . . . . .
50
6.1.6
Choosing the parameters for Granger causality . . . . . . . . .
51
6.1.7
List of tests taken . . . . . . . . . . . . . . . . . . . . . . . . .
51
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
6.2.1
Behavioural results . . . . . . . . . . . . . . . . . . . . . . . .
52
6.2.2
Results of transfer entropy . . . . . . . . . . . . . . . . . . . .
53
6.2.3
Results of Granger causality . . . . . . . . . . . . . . . . . . .
54
3
6.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
7 Future directions for the causality measure research for neuroscience 57
8 Conclusions
59
Bibliography
60
A Codes and instruction for calculating transfer entropy
65
B Codes and instruction for calculating Granger causality
67
Word count 14 000
4
The University of Manchester
Kristjan Korjus
Master of Mathematics
Causality Measures in Neuroscience: Wiener-Granger Causality and Transfer Entropy applied to intracranial EEG data
May 18, 2011
The goal of this project is to study the mathematical tools applied in brain research.
In particular, the focus is on the analysis of causal relationships between neuronal
signals. For that purpose, Granger causality and transfer entropy - two main causality
measures used in neuroscience - are described, compared, analysed and applied to
intracranial EEG data.
Granger causality is a statistical measure of causal influence based on prediction
which is implemented via a vector autoregressive model. Developed originally by
Clive Granger in 1960s for the field of econometrics, it has since found application in
a variety of fields, particularly in neuroscience.
As a new and more complex measure, transfer entropy is the main focus of the current
project. First, an overview of the information theory is given. Information theory
was developed to find fundamental limits on signal processing such as compressing,
reliably storing and communicating data. Most importantly, Shannon entropy is
defined, which measures uncertainty of information.
Information theory leads to the transfer entropy defined by Thomas Schreiber in 2000.
Derivation of transfer entropy is given and its application to neuroscience developed
in 2010 by Raul Vicente and colleagues is explained. In the process of application,
couple of important concepts such as reconstruction of the state space and estimating
the entropies is described and particular problems in neuroscience are described.
In the final theoretical part, based on Lionel Barnett’s work in 2009, it is shown that
for Gaussian variables, Granger causality and transfer entropy are entirely equivalent,
therefore bridging autoregressive and information-theoretic approaches.
Most importantly, both methods are applied to the intracranial EEG data of epilepsy
patients to study the interactions between the different brain areas during visual
recognition and to directly compare Granger causality and transfer entropy on the
same signals.
The results of the causality analysis confirm the main hypothesis that frontal regions
of the brain are actively influencing visual areas during visual recognition. Specifically, both methods show that there is a significant information flow from frontal
regions to visual areas. In addition, transfer entropy indicates that prior knowledge
of the stimulus is associated with stronger information flow from frontal to visual
regions.
Based on the work, it is concluded that transfer entropy and Granger causality are
powerful tools for neuroscience.
5
Acknowledgments
I would like to thank my friend Jaan Aru, a Ph.D. student from Max Planck Institute
for Brain Research, who introduced me to such an interesting field as neuroscience
and who provided the intracranial EEG data used in the project. I am also thankful
to my supervisor prof. Georgi Boshnakov, who has been very helpful; to Dr. Raul
Vicente, a post doc from Max Planck Institute for Brain Research, who introduced
and explained to me the concept of transfer entropy; to my friends Ivo Kund and
Tom Harrison and to my love Merit Valge.
6
Chapter 1
Introduction
”This is the way science works: Begin with simple, clearly formulated,
tractable questions that can pave the way for eventually answering the
Big Questions, such as “What are qualia,” “What is the self,” and even
“What is consciousness?””
- V.S. Ramachandran
The brain is one of the weirdest objects in the universe. The brain is like a Turing
machine which asks about its own capacities or like a Gödel’s function which wants
to know its own properties.
The main aim for neuroscience is to understand the phenomena of intelligence and
subjective consciousness. Although the last decade of the 20th century has been
labelled as the Decade of the Brain [1], not much is known about the brain and the
main goals are far from being met. It is not known how we record memories, how
our extremely slow hardware1 can work so efficiently with the incomplete information
and how can a subjective consciousness arise at all from a physical processes.
As in every other field of science, one can approach the understanding of the brain in
small steps. Neuroscience is getting gradually more complex and theories from various
1
Chemical transitions of signal between synapses are hundreds of times slower than our silicon
hardware
7
CHAPTER 1. INTRODUCTION
8
fields, such as mathematics, information technology and physics, are converging to
tackle the complicated problems. Although there are many interesting sub-problems
in the field of brain studies, this project concentrates on an efficient connectivity or
causality of different parts of the brain.
If one can understand how different parts of brain are communicating with each other
under various circumstances it can give huge hints towards understanding (or even
replicating) the mechanisms of the brain. As the area is still very young, at the
moment the problem reduces to analysing just usual time series from different areas
of the brain. So, the question arises: how to measure connectivity or causality in
time series?
1.1
Causality Measures
”Please, ma cherie. I have told you. We are all victims of causality.
I drank too much wine, I must take a piss. Cause and effect.”
- Merovingian, The Matrix Reloaded (2003)
The problem of causality has been a puzzling question for philosophers and scientist
for centuries. In his essay “Testing for Causality” [2] C.W.J. Granger puts forward
an idea that philosophers have not had any convergence in the definition of causality
and their general attitude is destructive rather than constructive. He says that in
order to be constructive we need workable definitions for causality and only criticising
shortcomings is useless unless a person proposes a better definition.
1.1.1
Granger-Wiener Causality
Granger itself puts forward an idea that causality means an increase in predictive
power. The basic Granger causality is rather simple: X “G-causes” Y if X helps to
predict Y . To be more precise, one can consider two time series Xt and Yt , where t
is tth term of the sequence and try to forecast Xt+1 using only past terms of X itself.
CHAPTER 1. INTRODUCTION
9
Then one tries to predict Xt+1 again but in addition to the past terms of X, the
past terms of Y are also included in the prediction mechanism. If the prediction is
found to be significantly more accurate, then the past of Y appears to contain some
information about the future of X.
It is accepted that Granger causality is not the perfect measure for causality but as
it is easy to understand, implement and it still contains some aspects of causality, it
is widely used measure for causality.
The actual definition is bit more complicated due to the fact that some other variables
might cause both the X and the Y . Granger causality is described thoroughly in
Chapter 2 together with its shortcomings and applications.
1.1.2
Transfer Entropy
Another interesting way to tackle the causality comes from information theory. Transfer entropy is an information theoretic measure developed by Thomas Schreiber [3]
in 2000 which tries to capture the flow of entropy from one subsystem to another.
Information theory itself studies properties of data and signal. It explains the redundancy of information, capacities of information channels and many other phenomena
[4]. Entropy can be thought as an uncertainty in the next bit of the data. Transfer
entropy measures the reduction of uncertainty between subsystems which could be
interpreted as an efficient connectivity between systems. First, information theory is
explained and then full description and mathematical definitions of transfer entropy
are given in Chapter 3.
There are couple of issues that one needs to overcome before applying the transfer
entropy in neuroscience such as estimation of the state space, instantaneous mixing
and delayed interactions which are all described in Chapter 3. Most of the work has
been done by Raul Vicente in 2010 [5] and his work forms a basis of the current work.
But also Alexander Kraskov’s PhD thesis published in 2004 [6] has contributed a lot
to the current field.
CHAPTER 1. INTRODUCTION
10
As one of the main ideas of the project is to compare these two causality measures, an
interesting side result by Lionel Barnett is also presented in Chapter 4. The results
tells that under the Gaussian variables transfer entropy and Granger causality are
actually equivalent.
1.2
Problems in Neuroscience and application of
causality measures
Our understanding of the brain and particularly of the visual system has evolved over
time. In early days, it was believed that most of the information flowed only in one
direction: from the eye to the abstract processing areas of the brain - constructing
the reality around us. With the discoveries of different psychological phenomena and
anatomical features, it was soon understood that actually more information travels
in the opposite direction. The brain is constantly making predictions and trying
to construct the best possible interpretation of the world using its whole knowledge
space. Brief history of neuroscience and current ideas about the visual system are
given in Chapter 5.
In order to study the interactions between the areas of the brain which deal with
abstract information and early visual areas which process the visual input of the
brain specific experimental paradigm was created by Jaan Aru, a PhD student in
the Max Planck Institute of Brain Studies. Rare intracranial data was gathered by
Department of Epileptology, University of Bonn, Germany. Both of the methods,
Granger causality and transfer entropy, were used on the data set. The experimental
set up, the data and results are analysed in Chapter 6.
In Chapter 7, some future directions are discussed.
Chapter 2
Wiener-Granger Causality
2.1
Introduction
In “Testing for Causality” C.W.J. Granger discusses his personal viewpoint of causality. [2]
In the introduction, simple idea of the Granger causality was offered as the increase
of prediction power, but some other parameters might affect both of the original time
series. To make the definition bit stronger a vector of other variables Wt is added to
the prediction mechanism. It can be thought of as the universe of all the necessary
knowledge.
So, let us first predict Xt+1 from Xt and Wt first and then from Xt , Wt and Yt . If the
predictive power increases after addition of time series Yt then Y Granger causes X.
Larger and better knowledge base W makes Granger causality represent our intuitive
understanding of causality more accurately.
Normally, its mathematical description is based on linear regression modelling of
stochastic processes. There are many non-linear extensions available as well but they
get increasingly more complicated and unsuitable.
11
CHAPTER 2. WIENER-GRANGER CAUSALITY
2.2
12
Mathematical Description of a Linear Model
One can consider a bivariate linear autoregressive model of two time series X1 and
X2 :
X1 (t) =
X2 (t) =
p
X
A11,j X1 (t − j) +
p
X
j=1
j=1
p
p
X
X
A21,j X1 (t − j) +
j=1
A12,j X2 (t − j) + E1 (t)
(2.1)
A22,j X2 (t − j) + E2 (t),
j=1
where p is the number of lagged observations included in the model and E contains
residuals for each time series. All the coefficients of the model are denoted by A.
If the variance of E1 (or E2 ) is reduced by the inclusion of the X2 (or X1 ) then it
is said that X2 (or X1 ) Granger causes X1 (or X2 ). In other words, if coefficients of
A12,j (or A22,j ) are significantly different from zero then causality is detected.
Assuming covariance stationarity of X1 and X2 , significance can be tested by F-test
of the null hypothesis that A12 = 0 (or A22 = 0). The magnitude of the Granger
causality is estimated by the logarithm of the given F-test statistic. [7]
The linear model of Granger causality can be extended to n variable case which
is rather useful as pairwise analysis of time series can give incorrect results. For
example, pairwise analysis cannot distinguish between two possibly connectivities as
seen on the figure 2.1. Another instance where a multivariate case would be better
is a situation were one variable causally inferences other two but with a delay. The
n variable analysis is one of the strengths of Granger causality. As one can later see,
the method of transfer entropy cannot distinguish between these two situations.
2.3
Limitations
Granger causality has many limitations.
CHAPTER 2. WIENER-GRANGER CAUSALITY
13
Figure 2.1: Two possible connectivities that cannot be distinguished by pairwise
analysis. Source: [8]
2.3.1
Linearity
The mathematical formulation given in the equation 2.1 only accounts for linear
information transfer. It is a problem in complex systems (such as the brain) because
lots of information is also transferred non-linearly. There are many approaches to
tackling the issue. [8]
Winrich A. Freiwald and others used a method where they divided non-linear data
into small regions which they assumed to be locally linear [9]. Additionally, they
generalise the definition of Granger causality which is also valid for non-linear systems.
Yonghong Chen and others analysed many different non-linear time series with Extended Granger causality [10].
They showed that in non-linear cases Extended
Granger causality method gave better results than the linear model. Although, Extended Granger causality requires massive amount of data for reliable analysis.
The linearity assumption is probably the most important shortcoming and the reason
for developing additional methods which would take into account non-linear interaction and preferably would be model free like transfer entropy.
2.3.2
Stationarity
In the current formulation it is strongly assumed that the time series are stationary.
Unfortunately in some cases this assumption does not hold. To overcome the issue
CHAPTER 2. WIENER-GRANGER CAUSALITY
14
different solutions have been recently offered. [8]
One possible solution was proposed by Wolfram Hesse and others. They showed that
if the data is stationary for short time intervals then it is possible to use the so called
windowing technique where causality measures are calculated for small overlapping
time frames. [11]
Another approach was offered by Mingzhou Ding and Yonghong Chen where they
analysed each trial separately assuming their individual stationariness. [12]
2.3.3
Dependence on observed variables
As mentioned earlier, Granger causality cannot distinguish between actual straight
causality from the interaction via a third process which is not included into the
analysis. The choice of the factors is important and Granger causality should not be
directly interpreted as physical causality.
2.4
Toolbox for Granger causality
In this report, Anil K. Seth’s toolbox for Granger causality is used [13]. The toolbox
provides a variety of MATLAB (MathWorks, Inc.) functions for evaluating the causal
connectivity in the dynamics of a set of variables and for calculating various statistical
inferences.
The toolbox is based on Granger causality defined above. It is implemented via vector
autoregressive (MVAR) modelling. In the two variable case, it is same as a simple
linear regression model and the theory above still holds. There is no graphic interface
for the toolbox and all the functions used in this report are shown in the Appendix
B.
Chapter 3
Transfer Entropy
3.1
Introduction
One can consider a sentence in English, for example ”I love mathematics more than
anything else”, and decode its meaning instantly.1 . But if couple of characters would
have been missing such as in the following sentence: ”I lve mathmtics more than
anyting else” one would have still understood the full meaning of it without losing
anything at all. This simple example gives us a hint that if a small part of information
goes missing, it might be still possible to restore the original message. There is some
redundancy of information in the first sentence. In addition, it shows that it is
possible to compress the data for storing purposes without losing any information.
One can find similar ideas everywhere: playing music from scratched CD; talking on
mobile phone or compressing data to .zip fail. A field which measures and quantifies
this phenomena is called information theory.
Information theory is important for this study. It makes it possible to develop a
model-free method for efficient connectivity without assuming stationarity nor linearity as in Granger causality. The method is called transfer entropy and it measures
reduction of uncertainty.
1
Decoding process involves most of our brain but mainly Broca’s and Wernicke’s areas. The
exact full mechanism of understanding the language is still unknown [14, p.19]
15
CHAPTER 3. TRANSFER ENTROPY
16
It has been a long journey from information theory to transfer entropy and then
to actual applications in neuroscience. The process will be explained here from the
beginning.
3.2
Information Theory and Shannon Entropy
”You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so
it already has a name. In the second place, and more important, no one
really knows what entropy really is, so in a debate you will always have
the advantage.”
- John von Neumann to Claude E. Shannon
Information theory was formulated by Claude E. Shannon. He published a legendary
paper “A Mathematical Theory of Communication” in 1948 [4]. Information theory
was developed to find fundamental limits of signal processing operations such as
compressing data and reliably storing and communicating data [15]. Later similar
ideas have been used in very many different fields 2 , including neurobiology and
neuroscience. [15–17]
A measure of information is called entropy. Entropy quantifies uncertainty in predicting the value of a random variable. For example, specifying the outcome of a fair
coin flip provides less information than specifying the outcome from a roll of a die.
In the context of the earlier example, a long string of uniformly distributed random
letters has higher entropy than our everyday language which has lots of redundancy
as shown above.3
2
Including statistical inference, natural language processing, cryptography generally, networks
other than communication networks as in neurobiology, the evolution and function of molecular
codes, model selection in ecology, thermal physics, quantum computing, plagiarism detection and
other forms of data analysis. [15]
3
An English prose with highest entropy is possibly James Joyce’s ”Finnegans Wake” which is
written in a largely made up language consisting of a mixture of standard English lexical items and
neologistic multilingual puns and mixture of words [4, 18].
CHAPTER 3. TRANSFER ENTROPY
17
The most natural choice for the measure is a logarithmic function due to the following
reasons:
1. Practical reasons: important parameters such as time, bandwidth, number of
relays, etc. vary exponentially with the number of possibilities but linearly with
the logarithm of possibilities. For example, adding extra channel for communication squares the number of possibilities but doubles them in logarithmic
case.
2. Intuitive reasons: basically the same as (1). Doubling the time of sending the
information intuitively doubles the amount of information sent which logarithmic measure captures very well.
3. Mathematical reasons: many operations are simpler in terms of logarithm than
in terms of the number of possibilities.
Entropy is usually expressed by the average number of bits needed for storage or
communication. A bit or binary digit is the basic unit of information in computing
and it has two states – {0, 1} – which is a natural choice for the base of the logarithm.4
More precisely, suppose we have a set of possible symbols a1 , a2 , ..., an whose probabilities of occurrence are p1 , p2 , ..., pn and it is all we know about the system. Is it
possible to find a reasonable measure for the uncertainty of the outcome?
This measure, say H(p1 , p2 , ..., pn ), should have the following properties:
1. H should be continuous for the pi .
2. If occurrence of all the symbols is equally likely, the amount of choice should
increase as the number of symbols increases. More precisely, if pi =
1
n
then H
should be monotonic increasing function of n.
3. If a choice is broken down into successive choices then the original H should be
the weighted sum of the individual values of H
4
It will be assumed that the base of the logarithm is always 2 unless stated otherwise.
CHAPTER 3. TRANSFER ENTROPY
18
Theorem 1. The only H satisfying the three assumptions above is of the form:
H = −K
n
X
pi log pi ,
(3.1)
i=1
where pi is the probability of ith symbol occurring in the message and the constant K
is chosen arbitrary to give meaningful measurement units.
Shannon proved the result in his original paper but it will not be given here. The
constant K is chosen to be 1 and not mentioned any more. As said earlier, the base of
the logarithm is two as binary digits are the basic unit of information in computing.
Since then, the information measure is called Shannon Entropy and it is written out
once again.
Definition 1. Shannon Entropy is given by the formula
H=−
n
X
pi log pi ,
(3.2)
i=1
where pi is the probability of ith symbol occurring in the message.
Shannon Entropy has many interesting properties which will be explained in the
following subsection.
3.2.1
Properties and extensions of Shannon Entropy
To illustrate a number of the interesting properties of H, let us consider the system
with two possibilities, each occurring with probabilities p and q = 1 − p. This system
has a Shannon Entropy H = −(p log p + q log q) as seen in the figure 3.1, which is a
function of p.
The entropy is greater than or equal to zero
H = 0 if and only if one of the pi is 1 and all others are zero. Which means that only
if we are certain of the outcome, the H is zero, otherwise it is positive.
CHAPTER 3. TRANSFER ENTROPY
19
Figure 3.1: The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss. Source: [15]
Maximum value
If all the outcomes are equally likely then H has a maximum value which is equal to
log n. It coincides with our intuition of maximum uncertainty.
Equalisation decreases the entropy
Any change which makes the probabilities p1 , p2 , ..., pn more even increases the entropy H of the system. More precisely, if Pi0 is any operation on the pi ’s of the
form
Pi0 =
X
aij Pj
for i = 1, . . . , n,
j
where
P
i
aij =
P
j
aij = 1, and all aij ≥ 0, then H increases (or stays the same if
the operation is just a permutation of probabilities).
Joint Entropy
The entropy H was defined for a discrete distribution but if a random variable X has
that distribution we also say that H is the entropy of X and write H(X).
CHAPTER 3. TRANSFER ENTROPY
20
The same principle of entropy applies to the system with many variables which motivates the next definition.
Definition 2. Joint entropy H(X, Y ) of random variables X and Y is defined as
H(X, Y ) = −
X
p(x, y) log p(x, y)
x,y
while
H(X) = −
X
p(x, y) log
X
x,y
y
X
p(x, y) log
X
H(Y ) = −
x,y
p(x, y)
(3.3)
p(x, y)
(3.4)
x
Also,
H(X, Y ) ≤ H(X) + H(Y )
(3.5)
The uncertainty of a joint event is less than or equal to the sum of the individual
uncertainties, equal if X and Y are independent.
Conditional Entropy
If two processes are interacting with each other and we know the outcome of one of
these processes then it can decrease the unpredictability of other process (unless these
two processes are independent). The idea is captured in the following definition.
Definition 3. Conditional entropy H(X|Y ) of random variables X and Y is defined
as
H(X|Y ) = −
X
y∈Y
p(y)
X
x∈X
p(x|y) log p(x|y) = −
X
x,y
p(x, y) log
p(x, y)
p(y)
The uncertainty of a joint event of X and Y is the uncertainty of X plus the uncertainty of Y when X is known. This basic property of conditional entropy is captured
here
H(X|Y ) = H(X, Y ) − H(Y )
(3.6)
CHAPTER 3. TRANSFER ENTROPY
21
Mutual Information
Mutual information
5
of two random variables X and Y is a quantity that measures
the mutual dependence of the two variables.
Definition 4. Mutual information I(X; Y ) of random variables X and Y is defined
as
I(X; Y ) =
X
x,y
p(x, y) log
p(x, y)
p(x) p(y)
A basic property of the mutual information is that
I(X; Y ) = H(X) − H(X|Y )
(3.7)
Mutual information is symmetric:
I(X; Y ) = I(Y ; X) = H(X) + H(Y ) − H(X, Y )
(3.8)
All of these nice and simple properties make Shannon entropy a really good measure.
Its popularity has been wide and profound in many areas of science. Figure 3.2 sums
up most of the ideas mentioned above.
Figure 3.2: Joint entropy, conditional entropies, individual entropies and mutual
information of two random variables X and Y. Source: [19]
Intuitively one can think of the entropy H(X) of a random variable X as a measure
of uncertainty. In that case, H(X|Y ) is a uncertainty of X such that Y is known. It
5
Sometimes known by the archaic term transinformation [19]
CHAPTER 3. TRANSFER ENTROPY
22
coincides with the intuitive meaning of mutual information as the amount of information that knowing either variable provides about the other. It is summed up once
more in the following equalities:
I(X; Y ) = H(X) − H(X|Y )
= H(Y ) − H(Y |X)
= H(X) + H(Y ) − H(X, Y )
= H(X, Y ) − H(X|Y ) − H(Y |X).
3.3
3.3.1
Transfer Entropy
Introduction
Important information about the structure of a system consisting of several subsystems can be found by measuring the information production and information exchange of the subsystems.
Mutual information (Definition 4) has been used to quantify the overlap of the information of two systems. Unfortunately, mutual information neither contains dynamical6 nor directional7 information. One possibility is to introduce a time delay to one
of the observations but it still would not capture the true information flow.
Thomas Schreiber derived an information-theoretic measure that quantifies the statistical coherence between systems evolving in time. Information-theoretic measure
is especially important for complicated dynamical systems as it does not require any
assumptions for the model. In his approach, the influences from common source or
common history are excluded by appropriate conditioning of transition probabilities.
[3] The resulting transfer entropy is able to measure asymmetry in the interactions
6
7
How data points are interacting to each other in the given time series?
Is A influencing B or B influencing A?
CHAPTER 3. TRANSFER ENTROPY
23
of subsystems.
The following derivation follows more closely Alexander Kraskov’s PhD thesis structure written in 2004 which is a bit simpler derivation.
3.3.2
Derivation of Transfer Entropy
Assume that Xi and Yj are random iid variables. X and Y are time series such that
for example Xi+1 denotes the value of the next moment in time with respect to Xi .
For simplicity, define random variables consisting of words of length k:
(k)
= (Xi , . . . , Xi−k+1 )
(k)
= (Yj , . . . , Yi−k+1 )
Xi
Yj
(3.9)
If Xi and Yi are discrete random variables with probabilities pi (x) = prob(Xi = x)
and pj (x) = prob(Yj = y) then transition probabilities can be defines as follows:
(k)
(l)
(l)
(k)
(k)
pi+1 (xi+1 |xi , yj ) = prob(Xi+1 |Xi
(l)
pj+1 (yj+1 |yi , xi ) = prob(Yj+1 |Yj
(k)
where the xi
(l)
(l)
(k)
= yj , Xi
(k)
= (xi , . . . , xi−k+1 ) is the state of Xi
(k)
= xi , Yj
= yj ),
(l)
(3.10)
(k)
(3.11)
= xi ),
(l)
and the state of Yj
is defined as
(l)
yj = (yj , . . . , yj−l+1 ). The transition probability denotes the probability of finding
(k)
Xi+1 in the state xi+1 when Xi
(l)
and Yj
(l)
(k)
are in states xi
and yj . Similarly for
Yj+1 .
Transfer entropy is closely related to conditional entropy (Definition 3). If the future
(k)
state xi+1 of Xi depends on the k past states xi
(k)
(l)
but not on the l past states yj then
(l)
(k)
the generalised Markov property p(xi+1 |xi , yj ) = p(xi+1 |xi ) holds. If there is any
such dependency, it can be quantified by the transfer entropy which is obtained by
comparing two conditional entropies
(k)
(l)
(k)
(k)
(l)
T (Xi+1 |Xi , Yj ) = H(Xi+1 |Xi ) − H(X(i+1) |Xi , Yj )
(k)
=
X
x,y
(k) (l)
p(xi+1 , xi , yj ) log
(l)
p(xi+1 |xi , yj )
(k)
p(xi+1 |xi )
(3.12)
CHAPTER 3. TRANSFER ENTROPY
24
So, the information flow, effective connectivity, causality or transfer entropy T from
(k)
system Y to X is the information about future of Xi gathered from both Xi
(j)
Yj
and
(k)
minus the information retrieved only from Xi . It can be seen that the transfer
entropy is asymmetric and in essence resembles the main idea of Granger causality.
Writing out an information flow from in opposite direction from X to Y once again.
Definition 5. Transfer Entropy TX→Y from a random variable X to a random variable Y is given by the formula
(l)
(l)
(k)
T (Yj+1 |Yj , Xi )
=
X
(l)
(k)
p(yj+1 , yj , xi ) log
x,y
(k)
p(yj+1 |yj , xi )
(l)
p(yj+1 |yj )
Transfer entropy can be expressed as a sum of mutual informations
(k)
(k)
(l)
(l)
(k)
(l)
T (Xi+1 |Xi , Yj ) = I((Xi+1 , Xi ), Yj ) − I(Xi , Yj ),
(3.13)
or
(k)
(k)
(l)
(l)
T (Xi+1 |Xi , Yj ) = I((Xi+1 , Yj ), Xi+1 ) − I(Xi , Xi+1 ),
(3.14)
where I((X, Y ), Z) means mutual information (Definition 3) of two variables (X, Y )
and Z.
By equation 3.7 T can be written as a sum of Shannon entropies
(k)
(l)
(k)
(l)
(k+1)
(l)
(k+1)
(k)
T (Xi+1 |Xi , Yj ) = H(Xi , Yj ) − H(Xi+1 , Yj ) + H(Xi+1 ) − H(Xi ). (3.15)
3.4
Application of Transfer Entropy in Neuroscience
”It will be all too easy for our somewhat artificial prosperity to collapse
overnight when it is realized that the use of a few exciting words like
information theory, entropy, redundancy, do not solve all our problems.”
- Claude E. Shannon [20]
CHAPTER 3. TRANSFER ENTROPY
3.4.1
25
Introduction
Because of its model-free property and flexibility, information theory was quickly
recognised in many fields, including neuroscience [21].
It started in 1952 when
MacKay and McCulloch applied the ideas of information theory to analyse the limits
of the transmission capacities of a nerve cell [22, 23]. This work grew into a very influential field termed “Neural Information Flow”. It explains how much information
is transferred from one area to another and gives constraints on the capabilities of
neural systems for computation, communication and behaviour.
Although information theory seems to suit well for neuroscience, after an initial uprising and some extensions to few more systems, interest declined, possibly because
it was still too difficult to actually apply the ideas. [21]
The work of de Ruyter van Steveninck and Bialek started the so called modern era of
information-theoretic analysis in neuroscience. Although the applications were still
very limited and it was understood that there were difficulties in getting unbiased
estimates. But since 2000, information-theoretic measures have found their way to
neuroscience in many forms. [21]
One of the recent and most important advances has been made by Raul Vicente and
their team. They developed a methodology in 2010 for applying transfer entropy as
a model-free measure of effective connectivity for the neuroscience [5].
They demonstrated that their method of calculating transfer entropy satisfies the
following necessary conditions for a good model-free measure of causality:
1. It should not require an a priori definition of the interactions. It will make it a
useful tool for getting some new knowledge about the system.
2. It should be able to measure non-linear interactions as they are very common
in all levels of brain functioning.
3. It should detect effective connectivity even with long delays as they are again
common in brain due to different pathways and “calculation speeds”.
CHAPTER 3. TRANSFER ENTROPY
26
4. It should not be affected by linear cross-talk as it is very common feature in
the modern EEG and MEG signals which is explained below.
The main factor in favour of this measure is its information-theoretic property – it
does not require any particular model which is particularly nice for understanding
the brain.
First a description of transfer entropy computation is given. There are many different
options for doing it and they all give rise to some errors but only the methods chosen
by Vicente and his team are explained. It includes a reconstruction of the state space,
estimating the transfer entropy and significance analysis.
Then two particular problems, delayed interactions and instantaneous mixing, will
be studied.
3.4.2
Computation of Transfer Entropy
For two observed time series xt and yt , transfer entropy can be written as
d
T EX→Y =
X
d
yt+u ,yt y ,xdt x
d
p(yt+u , yt y , xdt x ) log
p(yt+u , yt y , xdt x )
d
p(yt+u , yt y )
,
d
where t is a discrete time-index and u is the prediction time. yt y and xdt x are dy and
dx dimensional delay vectors.
It is possible to estimate transfer entropy with different methods. In Vincent’s work
they argue that the estimator must be robust to moderate levels of noise. It should
rely on a very limited number of data samples and the estimator should be reliable
when dealing with high-dimensional data as it is necessary to reconstruct the state
space.
Without any modelling assumptions and only with some very general and biophysically motivated assumptions they used particular kernel-based estimators.
As the outcome would be still rather noisy, significance analysis is necessary by testing
CHAPTER 3. TRANSFER ENTROPY
27
against the null hypothesis of independent time series. Since there are no parametric
formulas for the estimators, suitable surrogate data is needed such that the causal
dependency of interest is removed by the construction but trivial dependencies of no
interest are preserved. This will be explained later.
The combination of the estimator and suitable statistical tests were Vicente’s team
main contribution.
Reconstructing the state space
Data has only a limited number of variables that almost describe the full set of
possible states. But sensible causality hypotheses should be formulated in terms of
full state of the system. To overcome this problem, Vicente’s team used Takens’ delay
embedding theorem. [24]
Scalar time series will be mapped into trajectories in a state space of possibly high
dimension. It uses delay coordinates to create a set of higher dimensional points:
xdt = (x(t), x(t − τ ), x(t − 2τ ), . . . , x(t − (d − 1)τ )).
(3.16)
There are two parameters τ and d. Former is the delay of the embedding and latter
is the dimension.
There are many ways of choosing given parameters and they affect the transfer entropy estimates. For example, too small value of d might not be enough for describing
the full state space and too large value would make it impossible in terms of computing
time and in terms of under sampling in construction of probability space. Similarly,
too small value of τ might contain mostly autocorrelation information which would
not be useful for constructing the state space nor making predictions.
Another important factor in choosing suitable values for d and τ is the length of the
data. Quite often stimulus induced effects happen in very short time such as 200 ms.
The equation
d × τ length of the time series
CHAPTER 3. TRANSFER ENTROPY
28
must hold in the process of choosing the values for parameters.
Another danger is undersampling the data. For short pieces of data, informationtheoretic functions will not be accurate. And due to embedding, the distribution can
live in many dimensions which makes it especially under sampled.
According to Vicente’s methodology, a popular way is to choose the delay embedding
parameter τ as the auto-correlation decay time (act) of the time series or the first
minimum (if any) of the auto-information.
The embedding dimension is chosen by Cao criterion, which is based on false neighbours computation. [24]
These methods together give a good systematic method for estimating the parameters
τ and d and they give meaningful results as shown in Vicente’s work and in the this
study.
Estimating the transfer entropy
If the state space is constructed, the next step is to estimate transfer entropies. From
the equation 3.7 let the transfer entropy be rewritten in the following way:
d
d
d
d
T EX→Y = H(yt y , xdt x ) − H(yt+u , yt y , xdt x ) + H(tt+u , yt y ) − H(yt y )
(3.17)
It can be seen that the problem reduces to calculating different joint and marginal
probability distributions and there are many ways for calculating them.
For discrete processes it can be easily determined by counting the visits to each
different state. The main interest for current study is a continuous case, which is lot
more complicated since it must be approximated from a finite number of samples.
The measure must also converge to a common value as the coarsening scale is reduced.
Reasoning for Vicente’s method is now given.
Whenever choosing an estimator of any information-theoretic functional one must be
aware that for finite data there will always be statistical errors and a bias. This can
CHAPTER 3. TRANSFER ENTROPY
29
even quite often lead to negative transfer entropy values that should not happen by
definition of transfer entropy.
It would be possible to fit a sample probability distribution into a known distribution.
It is straightforward to calculate but it would contradict our desire to have a modelfree measure and if the data does not fit the family of distribution huge biases can
easily incur.
There are a range of popular non-parametric approaches such as fixed and adaptive
histogram or partition methods. But it has been shown that other non-parametric
approaches such as kernel or nearest-neighbour estimators are more data-efficient and
accurate. [25]
The nearest-neighbour method is used in Vicente’s method. It should be only asymptotically unbiased estimator which assumes that probability densities involved are
smooth [6, 26]. Their decision is mainly based on data efficiency i.e. it requires less
data to have the same bias error [5].
Nearest-neighbour techniques estimate smooth probability densities from the distribution of distances of each sample point to its k-th nearest neighbour. Vicente’s team
developed further from Kraskov, Stgbauer, and Grassberger methods for estimating
mutual information [27]. The estimator of mutual information based on this method
displays many good statistical properties such as it becomes an exact estimator in
the case of independent variables. [5]
In practice, the kernel estimation method needs two parameters. Firstly, the mass
of the nearest-neighbours search (k). It adjusts the statistical error and bias of the
estimate. Kraskov and others suggested that it would be best to set the parameter
k = 4 which was used in Vicente’s analysis and will be used here. [27]
Another parameter is Theiler correction (T ) which tries to exclude all the samples
which are close in time because these points would include autocorrelation information not the actual density estimation. In Vicente’s work and in this report
T = 1 auto-correlation decay time (act).
CHAPTER 3. TRANSFER ENTROPY
3.4.3
30
Summary of the parameters
Although transfer entropy is model free, its numerical estimation needs at least five
different parameters:
1. τ - the embedding delay. It is constrained by the length of data from
above and by autocorrelation effects below. Used method for τ is to set it
as the auto-correlation decay time (act) of the signal. In addition, maximum
autocorrelation limit for the search is also set in order to ensure that the time
interval will stay large enough. Trials where the maximum is exceeded are
excluded from the analysis.
2. d - dimension. It is constrained by the length of data and by the computing
power from above and by not reflecting the actual dynamics of the system from
below. The embedding dimension is chosen by Cao criterion, which is based
on false neighbours computation. In the analysis, maximum dimension must
be set for the Cao’s search and it is based on the length of the time interval
studied and maximum τ allowed.
3. k - the mass of the nearest neighbour search. It is fixed to 4 by the
advice of Kraskov et al.
4. T - the Theiler correction window. It excludes the samples which are too
close in time. T = 1act in the this report.
5. u - the prediction time. Again constrained by the length of data from above
and by the physical limits how quickly signals can affect each other.
Significance analysis
Most of the time, two or more different conditions are compared. In the Vicente’s
implementation and in the current report the permutation test is used to identify
the statistical significance of the difference of these conditions. Permutation test
compares the mean of two conditions, then all the trials will be taken to one set and
CHAPTER 3. TRANSFER ENTROPY
31
then the same number of trial is taken out randomly from the set of all the trials [28].
This last process is repeated many times and later the original difference is compared
to the differences achieved with the permutation test.
If the transfer entropy is calculated only for one condition then suitable surrogate
data is needed. The meaning of the surrogate data is that it has the same statistical
properties as the original data but all causal interactions are removed. It can be
achieved by shuffling the trials if the stationarity and trial independence are assured.
In principle, data for one of the channels is taken from one trial and for the second
channel from another trial. Again, the results are compared to the original data using
the permutation test.
3.4.4
Particular problems in neuroscience data
Normally the data of interest is gathered with EEG or MEG (see Chapter 6). Both of
these data types have some characteristic properties which might affect the accuracy
of the transfer entropy measure. Firstly, the recorded data might contain the linear
mixture of a third process which is not measured. Secondly, the data can contain
delayed interaction. Both of these issues are an discussed and additional test is
described.
Instantaneous mixing
In MEG and EEG data instantaneous linear mixing is almost always present and
creates extra difficulties. Firstly, it may reduce signal asymmetry that would decrease
the prediction power. This problem can be dealt with increasing the number of trials
or the amount of data in general.
Secondly, a problem can arise if a single source signal with an internal memory
structure is observed multiple times on different channels with individual channel
noise. Nolte and others have shown that this can lead to false positive detection of
effective connectivity. [29]
CHAPTER 3. TRANSFER ENTROPY
32
To illustrate the problem, example from Vicente’s work is described. [5] One can
think of an AR process s(t) of order m:
s(t) =
m
X
αi s(t − i) + ηs (t),
(3.18)
i=1
which is influencing processes X 0 and Y 0 in the following way:
X 0 (t) = s(t),
Y 0 (t) = (1 − )s(t) + ηY ,
(3.19)
where Y 0 can be written as
0
Y (t) = (1 − )
m
X
αi X 0 (t − i) + (1 − )ηs + ηY .
(3.20)
i=1
In the current system, transfer entropy would identify a causal relationship as the
past X 0 influences the present X 0 that is contained in Y 0 as (1 − )ηs .
Another test is needed to find these kinds of influences. Vicente’s team offered a
’time-shift test’ to avoid false positive reports for instantaneous linear mixing. Time
series X 0 is shifted by one sample into the past X 00 (t) X 0 (t + 1). In that case the
instantaneous mixing becomes lagged and causal. If there is some instantaneous mixing present, then the transfer entropy values of X 00 (t) to Y 0 would increase compared
to the original time series X 0 (t) to Y 0 .
So, this kind of increase in the transfer entropy values can be interpreted as a presence
of instantaneous mixing. One way how to construct the actual test with null and
alternative hypotheses is the following:
H0 : T E(X 00 (t) → Y 0 ) ≥ T E(X 0 (t) → Y 0 )
H1 : T E(X 00 (t) → Y 0 ) < T E(X 0 (t) → Y 0 )
(3.21)
If the transfer entropy values for the original data are not significantly greater than
the values of the shifted data then the null hypothesis is not discarded and the
hypothesis of a causal interaction from X 0 to Y 0 must be discarded.
CHAPTER 3. TRANSFER ENTROPY
33
Therefore, if there is a possibility for instantaneous mixing then it is tested before
proceeding to test the effective connectivity hypothesis.
In general, Vicente’s team suggests using this test, in cases where there is a possibility
of linear instantaneous mixing.
The approach is described as very conservative, as the default option states that
instantaneous mixing is present. A more liberal approach would be to discard the
data only when there is a significant evidence for the presence of instantaneous mixing.
In this case the hypotheses would be:
H0 : T E(X 00 (t) → Y 0 ) ≤ T E(X 0 (t) → Y 0 )
H1 : T E(X 00 (t) → Y 0 ) > T E(X 0 (t) → Y 0 )
(3.22)
In the this study, the null hypothesis in 3.22 is used.
3.4.5
Delayed interactions
One very general difficulty with prediction methods is the problem of delayed interaction which can give erroneously high transfer entropy values. One can think of two
processes X and Y with lagged interaction and long autocorrelation. Assume that
system X influences Y with delay δ as shown on figure 3.3. Biased results can occur
if transfer entropy is measured from Y to X.
Problem arises if the embedding delay τ and embedding dimension d are too small
such that the earlier information of X which influences Y and then in turn the future
of X. In order to avoid it, τ and d must be sufficiently greater.
CHAPTER 3. TRANSFER ENTROPY
34
Figure 3.3: Delayed interaction of information from X to Y and back to the X such
that given transfer entropy measure would think that Y influences X. Source: [5]
3.5
Toolbox for transfer entropy
TRENTOOL is developed in MATLAB (Mathworks, Inc.) as an open source package for analysing transfer entropy as described above. It is closely related to a popular FieldTrip toolbox as it uses some FieldTrip functions. At the time of writing,
TRENTOOL also depends on the TStool toolbox for time series analysis (version 1.2,
http://www.dpi.physik.uni-goettingen.de/tstool/) and MATLAB’s statistics
toolboxes.
Trentool has no graphical user interface and all the commands used in this report are
shown in the Appendix A.
Chapter 4
Granger Causality and Transfer
Entropy with Gaussian Variables
This section is based on Lionel Barnett et al letter “Granger Causality and Transfer Entropy are Equivalent for Gaussian Variables” [30]. They demonstrated that
Granger causality, causal influence based on prediction via vector auto-regression,
and transfer entropy (an information-theoretic measure of information transfer between jointly dependent processes) are completely equivalent for Gaussian variables.
This chapter uses the same notation as the original paper and is not related to the
future analysis. Following notation will be used:
• x - a bold type denotes a vector. All vector are row vectors.
• M - an upper case type denotes matrices or random variables.
• X T - denotes the transpose of the matrix X.
• ⊕ - denotes the concatenation of vectors such that (x1 , . . . , xn ) ⊕ (y1 , . . . , ym ) =
(x1 , . . . , xn , y1 , . . . , yn )
• Σ(X) denotes the n × n matrix of covariances cov(Xi , Xj )
• Σ(X, Y ) denotes the n × m matrix of cross covariances cov(Xi , Yα )
35
CHAPTER 4. EQUIVALENCE OF THE MEASURES
36
• Σ(X|Y ) denotes the n × n matrix which is defined for invertible Σ(Y ) in the
following way:
Σ(X|Y ) ≡ Σ(X) − Σ(X, Y )Σ(Y )−1 Σ(X, Y )T
(4.1)
Σ(X|Y ) can be seen as the covariance matrix of the residuals of a linear regression of X on Y and will be called partial covariance of X given Y by analogy
of partial correlation.
• Let X t be a multivariate stochastic process in discrete time such that the
(p)
random variables Xti are jointly distributed. The notation X t ≡ X t ⊕X t−1 ⊕
(p)
· · ·⊕X x−p+1 denotes the X itself with p−1 lags so that X t is a 1×np random
vector for each t. If the lag p is given, the shorthand notation is used for a lagged
(p)
variable: X −
t ≡ X t−1 .
Let X and Y be jointly distributed random vectors and
X =α+Y·A+
(4.2)
the linear regression model where the m × n matrix A consists of the regression
model coefficients, vector α = (α1 , . . . , αn ) contains the constants of the model and
random vector = (1 , . . . , n ). The mean squared error can be written in terms
of the covariance matrix of the residuals as E 2 = trace[Σ()] is just the sum of the
variances, sometimes known as the total variance. Assuming Σ() is invertible then
ordinary least square (OLS) can be find by minimising E 2 : A = Σ(Y )−1 Σ(X, Y )T
and for the fit the covariance matrix of residuals is given by
Σ() = Σ(X|Y )
(4.3)
Also note that the same A which minimises the total variance also minimises the
generalised variance |Σ()|, where | · | denotes the determinant.
Suppose there are three jointly distributed stationary multivariate stochastic processes X t , Y t , Z t . Consider the regression models
(p)
(r)
X t = αt + (X t−1 ⊕ Z t−1 · A + t ),
(4.4)
CHAPTER 4. EQUIVALENCE OF THE MEASURES
(p)
(q)
(r)
X t = α0t + (X t−1 ⊕ Y t−1 ⊕ Z t−1 · A0 + 0t )
37
(4.5)
so that the variable X is first regressed on the previous p lags of itself plus r lags of
the variable Z and the second adds q lags of Y in addition to the lags of X and Z.
The G causality of Y to X given Z is a measure of the extent to which inclusion
of Y in the second model 4.5 reduces the prediction error of the first model 4.4.
The standard measure of G causality is defined for univariate predictor and response
variables Y and X and is given by the natural logarithm of the ratio of residual
variance in the equation 4.4 to the equation 4.5. In this case, a general case is
considered where X and Y are vectors and replaced with X and Y respectively:
|Σ(t )|
= ln
|Σ(0t )|
|Σ(X|X − ⊕ Z − )|
= ln
|Σ(X|X − ⊕ Y − ⊕ Z − )|
FY →X|Z
(4.6)
As this expression is stationary, it does not depend on time t and time subscripts are
no longer necessary. X − is the lag of X and F is the Granger causality. Note that
FY →X|Y ≥ 0 as adding new terms cannot increase the residual variance.
Let X t , Y t , Z t be vectors as before. The transfer entropy of Y to X given Z is
defined as the difference between the entropy of X conditioned on its own past and
the past of Z and in addition, its entropy conditioned on the past of Y as well:
TY →X|Z = H(X|X − ⊕ Z − ) − H(X|X − ⊕ Y − ⊕ Z − ),
(4.7)
where H(·) denotes the entropy (Definition 1) and H(·|·) conditional entropy (Definition 3).
Again, by stationarity there is no need for time indexes and TY →X|Z ≥ 0. Again,
original definition is for univariate case but the extension is unproblematic.
For a multivariate Gaussian random variable X entropy can be written as [31]:
H(X) =
1
1
ln(|Σ(X)|) + n ln(2πe)
2
2
in terms of the determinant of the covariance matrix, where n is the dimension of X.
CHAPTER 4. EQUIVALENCE OF THE MEASURES
38
The next step is to show that the conditional entropy H(X|Y ) for two jointly multivariate Gaussian variables can be expressed in terms of the determinant of the
corresponding partial covariance matrix:
H(X|Y ) =
1
1
ln(|Σ(X|Y )|) + n ln(2πe)
2
2
(4.8)
First,
H(X|Y ) = H(X ⊕ Y ) − H(Y )
=
1
1
1
ln(|Σ(X ⊕ Y )|) − ln(|Σ(Y )|) + n ln(2πe)
2
2
2
Now
Σ(X ⊕ Y ) =
and from
Σ(X)
T
Σ(X ⊕ Y )
Σ(X, Y )
Σ(Y )
(4.9)
A B = |D||A − BD−1 C|
C D (4.10)
|Σ(X ⊕ Y )| = |Σ(Y )| · |Σ(X|Y )|
(4.11)
identity follows:
And from the equation 4.11, equation 4.8 follows.
If processes X t , Y t , Z t are jointly multivariate Gaussian then from equation 4.7 and
4.8 transfer entropy can be written as
TY →X|Z
1
= ln
2
|Σ(X|X − ⊕ Z − )|
|Σ(X|X − ⊕ Y − ⊕ Z − )|
.
(4.12)
Comparing the equation 4.6 and 4.12 gives us the final result:
FY →X|Z = 2TY →X|Z ,
(4.13)
which proves the main assertion that Granger causality and transfer entropy are
equivalent for Gaussian variables.
Chapter 5
Briefly about the Human Brain
”If the brain were simple enough for us to understand it,
we would be too simple to understand it”
- Ken Hill
It is difficult to over-emphasise the importance and wonder of human brain. The
brain is an organ that makes our conscious existence possible. Because of the brain
we feel, talk, read and think. But it took us a long time to understand that the
brain is the place where to look for in order to get answers for the questions about
ourselves.
5.1
History of the scientific approach towards brain
Brain science emerged relatively late, largely in the 19th century. Naturally some
observations started earlier. For example, Aristotle (384 - 322 bc) was correct in
his observation that ratio of brain size to body sizes is higher in more intellectually
advanced species. Unfortunately he and everyone else believed that the brain is
needed to cool down the blood that circulates in our body. So he concluded that
more intelligent species are in need of a larger cooling system.
Galen (ad 129-199) discovered that there are nerves going to and from brain but he
39
CHAPTER 5. BRIEFLY ABOUT THE HUMAN BRAIN
40
still believed that mental abilities come from the ventricles of the brain. [32]
It took over 1500 years to change this belief – the drawings of the brain always concentrated on the ventricles but not on the cortex. Gall (1758 - 1828) and Spurzheim
(1776 - 1832) invented a notorious system called phrenology. It had two assumptions: each different region performed a very specific function and the size of these
areas predicts its quality and importance for a person. In modern terms, these theories were completely wrong and non-scientific, nevertheless their ideas still echo in
neuroscience.
In 1861 Broca documented two cases where patients with brain damage lost their
ability to speak but every other function was left untouched. He concluded that
there must be a specialised area for language. Later it was realised that even a single
aspect of a language, such as speech recognition, speech production or conceptual
knowledge, can be lost with brain damage. This suggested that the language area is
probably subdivided into smaller specific areas.
Related argumentations are more related to the mind than to the brain as they do
not tell where exactly and how these processes work. It can be said that brain science
started with the discovery and understanding of neurons as independent and discrete
entities which make up the brain. In 1906, Camillo Golgi and Santiago Ramon y
Cajal won the Nobel Price for the work with neurons.
Proper modern-day cognitive neuroscience got started with advances in technology
and the invention of the computer. It had a ground-breaking influence in two ways.
People started to see brain as a machine and they started using computers and other
technological wonders to solve the mysteries that surrounded it. The main part of the
technologies can be categorised as neuroimaging methods which are described below.
CHAPTER 5. BRIEFLY ABOUT THE HUMAN BRAIN
5.2
41
Brain as we know it today
Enclosed in the cranium, the human brain is the centre of the nervous system. There
are about 100-150 billion neurons in the human brain, each connected to about 10
000 other neurons. Neurons make up about 10% of the cells in our brain. Other cells
are for support functions such as repairing, protecting and supplying nutritions. The
brain weights about 1.5 kg and it has a volume of about 1200 cm3
The brain controls breathing, heart rate, and other autonomic processes. It monitors and regulates the body’s actions and reactions by analysing the continuous
sensory input and responding accordingly. The neocortex is responsible for higherorder thinking, learning, and memory. The cerebellum is responsible for the body’s
balance, posture, and the coordination of movement. Figure 5.1 shows the mains
areas of the brain. The foremost importance for this study is the visual system which
will be explained in the next section.
Figure 5.1: Brain areas. Source: [33]
5.3
Theory of visual system
Historically the main model of visual system was simple – as information reaches
the eye and the retina it is transformed into electrical signals and sent to the brain.
CHAPTER 5. BRIEFLY ABOUT THE HUMAN BRAIN
42
In the brain, information propagates serially along a bottom-up hierarchy of brain
structures and the reality is constructed.
It turns out that brain is not that simple. Recent theories gradually promote the role
of top-down processing in recognition. [34]
Historic realisation of current ideas can be easily explained with the following example. In the figure 5.2 the same image has been rotated 180◦ . Our brain (presumably)
has experienced that most of the time light comes from above and therefore interprets
the situation accordingly.
Figure 5.2: The image on the left probably looks like a dent (concave), while the
image on the right probably looks like a bump (convex). Source: [35]
In other words, brain uses memories to interpret the reality. What is more, now we
can see that there are more feedback connections in the brain than connections which
send information in the forward direction.
It still remains unclear how these top-down memory based mechanisms are triggered.
M. Bar and others are proposing a model (see figure 5.3) that low spatial frequencies
are quickly sent from early visual cortex to the orbitofrontal cortex (OFC) and in
parallel to late visual areas. In OFC, the first guesses and predictions are made which
are also sent to the late visual areas where the raw information and memory based
information is integrated.
CHAPTER 5. BRIEFLY ABOUT THE HUMAN BRAIN
43
Figure 5.3: An illustration of the proposed model by M. Bar and his team. The input
image is projected rapidly from early visual cortex to the OFC, slowly and in parallel
to late visual areas. This coarse representation activates a minimal set of the most
probable interpretations of the input, which are then integrated with the bottom-up
stream of analysis to facilitate recognition. Source: [34]
5.4
Neuroimaging: EEG and intracranial EEG
Neuroimaging is a relatively new field which uses different techniques to study and
measure the activity and structure of the brain. The main four technologies are: functional magnetic resonance imaging (fMRI) [36], electroencephalogram (EEG) [37],
magnetoencephalogram (MEG) [38] and positron emission tomography (PET) [39].
In this study, measurements of intracranial EEG are analysed.
Electroencephalography (EEG) is the recording of electrical activity along the scalp.
EEG measures voltage fluctuations in µV ’s resulting from ionic current flows within
the neurons of the brain [40]. Multiple locations are normally used for recording. In
neurology, EEG is used to diagnose epilepsy. EEG used to be the best method for
diagnosing tumours and strokes but it has been replaced by other methods.
The main EEG technique is an event-related potential (ERP) which involves averaging the EEG activity over multiple trials such that the effects of stimuli could be
measured and analysed. It is a common technique in cognitive science, cognitive
psychology, and psychophysiological research.
Compared to the other methods, EEG has a very good temporal resolution and it
measures electrical activity which is assumed to carry information about our cognitive
CHAPTER 5. BRIEFLY ABOUT THE HUMAN BRAIN
44
processes. Compared to the other methods, EEG has some shortcomings as well.
Normally, EEG measurements are taken from the scalp and therefore it has a very
low spatial resolution and it is not possible to locate sources of electrical activity in
three-dimensional space (inverse problem). These problems can be overcome with
invasive EEG.
For ethical reasons, it is not allowed to use invasive EEG on healthy humans. It
can be used on primates but it is more difficult to study cognitive processes in their
brain. However, it is accepted to use intracranial EEG on pharmacoresistant epilepsy
patients as it is very efficient way for locating an epilepsy centre which could be then
removed surgically. As the patients must stay in hospital for a longer period of time
in order to get enough information about the centre of epilepsy, it is possible to do
some experiments with them and record the data. This method is used in this study
and the experimental set-up will be explained in the next chapter.
Chapter 6
Applying Causality Measures to
Intracranial EEG data
6.1
Methods
6.1.1
Subjects
Intracranial EEG from a single patient with pharmacoresistant epilepsy is analysed.
In fact, there were six patients in total but as there was no control over the placements
of electrodes, four of them did not happen to have frontal basal strip electrodes which
is the interest of this study and the data of one of the patients was noisy and unusable.
Recordings were performed at the Department of Epileptology, University of Bonn,
Germany.
6.1.2
Experimental set up
For having stimuli where prior experience could have an impact on sensory processing,
relatively complex natural images were used. A set of grey-scale photographs with a
variety of backgrounds that contained a single person in the foreground were used.
45
CHAPTER 6. APPLYING CAUSALITY MEASURES
46
After each picture in the test phase, the subjects had to indicate whether the person
on the degraded picture was male or female (objective performance) and whether
they had indeed seen a person on the picture or not (subjective performance). Occasionally, subjects were also asked whether the picture, now shown in the degraded
fashion, had been presented in the familiarization phase of that block. After the
questions and before the next picture, the fixation cross was presented randomly for
1200-1400 ms. The response screens appeared 1 second after the onset of the picture.
Each block lasted for 3-4 minutes and subjects could take breaks between the blocks.
The duration of the whole experiment was 35-45 minutes.
To study conscious recognition, one needs to make sure that the targets are not
consciously recognised in every trial. To achieve that in the test phase, the approach
of having brief durations (150 ms) and degrading of the target images was used.
For that, each target image was edited with random noise, where the amount of
noise could be controlled parametrically (see figure 6.1). For this experiment stimuli
were prepared with noise levels from 60% to 90% noise in 5% steps. Before the
main experiment started, a threshold experiment was conducted to select for each
subject two neighbouring degradation levels (e.g. 75% noise and 70% noise) that
would provide optimal recognition performance (e.g. not random guessing and not
ceiling effects).
The main experiment was divided into 11 experimental blocks. Each experimental
block consisted of two phases - the familiarisation phase and the test phase (illustrated
on figure 6.1). The goal of the familiarisation phase was to create prior knowledge
about some of the pictures. In the first phase 4 pictures were shown clearly for 2
times for 3 seconds each time. The subjects were told to remember these pictures.
By the first clear presentation of each picture the subjects were asked whether the
person on the picture was male or female and whether they thought he or she was
under or over 30 years old. By the second clear presentation of each picture the
subjects simply had a task of memorising the picture.
In the test phase of each block these 4 familiarised pictures and 4 new pictures
CHAPTER 6. APPLYING CAUSALITY MEASURES
47
Figure 6.1: Experimental paradigm. Each block is divided into 2 phases. In the
first phase prior knowledge is created about some of the images by showing them
clearly. In the second phase the effects of prior knowledge and sensory information
are tested by showing the pictures briefly and in degraded fashion. Degraded versions
of the pictures from phase 1 are presented together with new pictures (manipulation
of prior knowledge). In addition, both types of pictures can be shown in two different
degradation levels (manipulation of sensory evidence). Source: author
with people on them were presented, belonging to conditions with and without prior
knowledge, respectively. Each picture was presented twice during the test phase.
Importantly, in the test phase the pictures were presented for 150 ms and in either
the lower or the higher degradation level, as explained above, which constituted the
bottom-up factor of sensory evidence. Therefore, each picture with a person on it
could meet one of the 4 conditions:
• With prior knowledge and high bottom-up (familiarized and lower degradation)
• With prior knowledge and low bottom-up (familiarized and higher degradation)
CHAPTER 6. APPLYING CAUSALITY MEASURES
48
• without prior knowledge and low bottom-up (not familiarized and higher degradation)
• high bottom-up (not familiarized and lower degradation).
Additionally, in each test phase 4 catch trials with no persons on them were presented.
They served to control for the subjective perception, as in these pictures subjects
should not have seen any persons. Indeed, for the subject included in this study, the
amount of false perceptions in the catch trials was very low.
6.1.3
Intracranial recordings
Multi-contact depth electrodes and temporal basal and lateral strips were inserted
for diagnostic purposes using a computed tomography-based stereotactic insertion
technique [41]. As all the electrodes were implanted for diagnostic purposes, there
was no influence on their positioning for this study. The location of electrode contacts
was ascertained by MRI in each patient. EEG was referenced to linked mastoids1
and recorded at a sampling rate of 1000 Hz.
Figure 6.2 shows the electrode strips of the patient whose data is analysed in the current study. Channels FPL02 and TLL02 were used. These electrodes were selected
because reliable task related activity was observed. Also, they were anatomically close
to the areas interesting us - the higher visual areas (TLL02) and the OFC (FPL02),
where the latter is assumed to be important for providing top-down facilitation in
visual recognition processes [34]. In other words, these electrodes were most reflective
to the stimuli from frontal and temporal lobes which is the interest of current study.
Intracranial EEG data were filtered (0.5-300 Hz) and segmented into epochs of −1000
ms to 1000 ms around stimulus presentation. Subsequently, trials were visually inspected for artefacts (e.g. epileptiform spikes), and trials containing artefacts were
excluded from analysis.
1
For reference signal, two linked electrodes were used which were located next to the ear on
mastoids.
CHAPTER 6. APPLYING CAUSALITY MEASURES
49
Figure 6.2: The positions of multi-contact electrode strips and the electrodes used in
the current study. Source: Original from the Department of Epileptology, University
of Bonn, Germany. Modifications made by author
Figure 6.3 shows a sample data (-500 ms to 500 ms) of a random trial from both
channels.
Figure 6.3: Sample intracranial data from electrodes TLL02 and FPL02. Source:
author
6.1.4
Hypothesis
The hypothesis of the study:
CHAPTER 6. APPLYING CAUSALITY MEASURES
50
First
According to the theory, frontal and visual area should actively communicate
after stimulus.
Second
One of the controlled conditions is the familiarity of the photo. Theory says
that if the photo had been already shown in the first stage of the experiment
then more information should be sent from frontal to early visual areas.
Third
Another controlled condition is the degradation level of the photos. If level
of degradation was higher (i.e. early visual areas needed more information to
decipher the photos) more information was sent from frontal areas to early
visual areas.
6.1.5
Choosing the parameters for transfer entropy
One of the major issues comes from the size of the time window. It is a compromise
between temporal resolution and sample size for creating the state space and estimating various entropies. It is known that the effects in the brain happen in a short time
frame and ideally one would like to know the dynamics of neural connectivity in the
time frame of 50 to 100 ms. Unfortunately with the current methods and especially
with the information theoretic measures it is practically impossible. Transfer entropy
ideally needs more than 1000 ms of data which would measure the general activity
of the brain, not the event induced effects.
As most of the parameters were recommended by Vicente’s team, only the following
five are left: the embedding delay (τ ), maximum autocorrelation time, dimension (d),
prediction time (u) and the time window.
The preferred choice would be the smallest possible time window. By this criteria,
the dimension was chosen to be the smallest allowed dimension by the toolbox, d = 5.
The embedding delay was recommended to be between 1 or 1.5 auto-correlation decay
CHAPTER 6. APPLYING CAUSALITY MEASURES
51
time, again for the smallest possible time window τ = 1 was chosen. In addition,
maximum limit for the search was chosen to be 40. Trials that had larger autocorrelation decay time were excluded. The prediction time was chosen to be u = 20ms
as it was suggested to be the minimum time which would still capture the dynamics
of the two subsystems of brain.2
6.1.6
Choosing the parameters for Granger causality
In the case of Granger causality the choice of parameters is lot smaller. Calculating
the Granger causality is very efficient in terms of computation power. It also works
with lot smaller time windows as it does not need that many data points for the
construction of probability surfaces. Basically there is only one parameter to choose
from and it is the model order u. It is recommended to use maximum order u = 10
but as it might not be enough to capture the full dynamics, in this study u = 25 is
used.
The biggest issue is the non-stationariness of the time-series which affects the Granger
Causality quite a lot. In order to reduce any negative effects, mean value is subtracted
from the time series and built in test for stationarity is used. It is called KPSS test
[42] that uses the null hypothesis that the data is stationary. All the trials which did
not pass the test were excluded from the analysis.
6.1.7
List of tests taken
Four tests were conducted for both channel pairs (FPL02 to TTL02 and vice versa,
see figure 6.2). Only the first test was for a single condition which means that the
surrogate data was needed. Transfer entropy toolbox has a built-in functionality for
that purpose. In the case of Granger causality, surrogate data was created. The
last three test are comparing the two conditions and a permutation test was used
for significance analysis. Transfer entropy toolbox TRENTOOL has the permutation
2
Source: Raul Vicente, personal communication, 15/04/2011
CHAPTER 6. APPLYING CAUSALITY MEASURES
52
test built into it but as it is again absent from the Granger causality toolbox, the
implementation was developed for this report. Most of the code can be seen in the
Appendixes A and B.
Here is the list of test:
Test 1. 0.05 to 0.5s for all the trials (Only single condition test)
Test 2. All trials 0.05 to 0.5s versus all trials -0.5 to -0.05s.
Test 3. 0.05 to 0.5s of the trials with familiar photo versus the trials with nonfamiliar photo but the same time frame.
Test 4. 0.05 to 0.5s of the trials with more degraded photo versus the trials with
less degraded photo but the same time frame.
For the last three tests, it is hypothesised that the first subset of trials mentioned has
a higher level of information transfer than the second condition but still the 2-sided
test against the null hypothesis, that the means are equal, is used in all the cases.
6.2
6.2.1
Results
Behavioural results
Behaviourally it was observed that the subjects were more accurate and reported
seeing the person on the picture more often, when they had prior knowledge about
the picture. Subjects also showed such perceptual and behavioural improvements
when the picture contained objectively more sensory evidence. In fact, providing
more sensory evidence or having prior knowledge led to similar behavioural benefits
[43]. However, comparable behavioural effects do not necessarily imply that the
neural mechanisms behind these two ways of perceptual enhancement (through more
sensory evidence or with the help of prior knowledge) have to be identical.
CHAPTER 6. APPLYING CAUSALITY MEASURES
6.2.2
53
Results of transfer entropy
Transfer entropy did not find any significant results for information transfer from early
visual areas to frontal area. In the third test, instantaneous mixing was detected and
the test was excluded (see Chapter 3, page 31). But it found several results in other
direction. Summary of all the results of transfer entropy is shown in the table 6.1.
P-value 0.05 is used as a level for rejecting a null hypotheses.
Test
1
1
2
2
3
3
4
4
Channels
p-value
TTL to FPL
0.2739
FPL to TTL
0.0166
TTL to FPL
0.3221
FPL to TTL
0.00001
TTL to FPL Instantaneous mixing
FPL to TTL
0.0133
TTL to FPL
0.3837
FPL to TTL
0.4753
Reject H0
fail to reject
reject
fail to reject
reject
fail to reject
reject
fail to reject
fail to reject
Table 6.1: Summary of all the results of Transfer Entropy
Firstly, the analysis shows that there exists a significant information flow from frontal
to visual area (p = 0.0166) and compared to the pre-stimulus information flow the
result is even more significant (p = 0.00001). It can be concluded that there is an
information flow from frontal areas to early visual areas related to the deciphering
process of natural photos. This nice result coincides with the recent understandings
of the visual system.
What is more, familiar photos which were shown in the first stage, showed larger
information transfer (p = 0.0133) than non-familiar photos. It is a powerful result
which confirms the idea that if frontal areas recognise the scene they send information
to the visual areas.
The third hypothesis (Section 6.1.4) did not get any significant confirmations.
CHAPTER 6. APPLYING CAUSALITY MEASURES
6.2.3
54
Results of Granger causality
Summary of all the results of Granger causality is shown in the table 6.2.
Test
1
1
2
2
3
3
4
4
Channels
p-value
TTL to FPL 0.8926
FPL to TTL
0
TTL to FPL 0.0003
FPL to TTL 0.0004
TTL to FPL 0.6759
FPL to TTL 0.9244
TTL to FPL 0.9213
FPL to TTL 0.5224
Reject H0
fail to reject
reject
reject*
reject
fail to reject
fail to reject
fail to reject
fail to reject
Table 6.2: Summary of all the results of Granger Causality. *Only instance where
difference of the means of the two conditions was other way than hypothesised
Interestingly, in addition to the information flow from the frontal area to the visual
(p = 0 for single condition and p = 0.0004 when comparing post and prior stimulus
information flow), Granger causality found information flow in other direction as well:
from visual area to frontal. The interaction was opposite than expected. Analysis
shows that more information flows from visual areas to frontal area before the stimulus
than after the stimulus.
The result that less information flows from visual areas to frontal areas when a photo
is shown can feel contradictory but the brain is a mysterious organ and everything
is possible. It can be possible that as soon as information is quickly sent from the
visual area to the frontal area then the former stops sending information to the latter.
Instead it uses the actual visual scene and predictions made by the frontal area to
decipher the visual scene. However, it is also possible that the method is unstable and
in some conditions gives false results. As there are some problems with the linearity
and stationarity, it would not be a surprise.
Granger causality did not show any significant difference between the two conditions:
familiar photos versus non-familiar photos and more degraded versus less degraded
photos.
CHAPTER 6. APPLYING CAUSALITY MEASURES
6.3
55
Discussion
These two methods have successfully found significant relationships in a complicated
situation. Table 6.3 shows the results of tests where at least one of the methods was
able to reject the null hypothesis (that the there is no causal connection or that the
two conditions have an equal amount of causal connectivity).
Test
TE/GC
Channels
p-value
Reject H0
1
1
2
2
2
2
3
3
TE
GC
TE
GC
TE
GC
TE
GC
FPL to TTL
FPL to TTL
TTL to FPL
TTL to FPL
FPL to TTL
FPL to TTL
FPL to TTL
FPL to TTL
0.0166
0
0.3221
0.0003
0.0000
0.0004
0.0133
0.9244
reject
reject
fail to reject
reject*
reject
reject
reject
fail to reject
Table 6.3: Quick summary of all the results. *Only instance where difference of the
means of the two conditions was other way than hypothesised
Both of the methods agree that there is a significant information flow from frontal
area to visual area after the stimulus and it is significantly bigger than the information
flow before the stimulus. Again, it is a pleasant confirmation of the reliability of the
methods and the correctness of recent theories.
In addition, transfer entropy gives a very nice result that more familiar photos give
rise to bigger information flow from frontal area to visual area. It supports the
recent theories that our memory and pre-learnt understanding of the reality helps to
understand visual scenes and also with neurobiological inspections that there are more
connections in top-down direction than in other way as it was previously expected.
Granger causality, on the other hand, shows an unexpected result that information
flow from the visual area to frontal area is smaller after the stimulus.
The difference of these two models could be explained by their construction. Transfer
entropy is more complex, needs longer time frames and can detect non-linear connections. Granger causality is simpler, uses lot shorter time frames and is linear. In
CHAPTER 6. APPLYING CAUSALITY MEASURES
56
the light of these ideas, the previous result might have a logical explanation behind
them.
Chapter 7
Future directions for the causality
measure research for neuroscience
Technologies, theories about the brain, experiments and mathematics are all advancing at a rapid speed. There are many extensions and problems which will be tackled
and probably solved in recent future. An experience of this study would say that two
main problems are the time window of analysis and non-stationarity of the data.
The construction of probability surfaces in small time-frames is problematic. G.
Gomez-Herrero and his team are trying to overcome this issue by taking advantage
of the trial based structure of the neuroscience experiments [44]. In that case, there
would be more data points for the construction of state space (i.e. for the nearest
neighbour search) and the temporal resolution could be smaller. This method is not
implemented in the TRENTOOL toolbox and will not be used in the current study
either. But the method is already under development for the TRENTOOL toolbox.1
In case of non-stationarity, the windowing technique is proposed in order to assume
that shorter time-frames are stationary. Proper framework and statistical inferences
should be developed for this purpose.
In addition, there are many important ideas for future development: extensions to
1
Source: Raul Vicente, personal communication, 15/04/2011
57
CHAPTER 7. FUTURE OF THE CAUSALITY MEASURES
58
Granger causality measures which would take non-linearity into account; using the
multivariate data structure in case of transfer entropy that can only deal with pair of
signals at the moment; adding frequency specific interactions into the methods. It is
always possible to develop different estimators further for construction of probability
surfaces and for choosing parameters.
Some advances can be made without improvement of methods. For example, neurophysiologists can try to make experiments which need a longer time-frame for the
brain such that information theoretic measures could be more efficiently used.
Chapter 8
Conclusions
Brain science is fascinating. Hard problems are mysterious and complicated. This
study explains a tiny sub-problem of neuroscience which uses the ideas of statistics
and information theory, which in turn use many fields of mathematics, such as algebra
and algorithms. This field tries to measure information flow or effective connectivity
which in turn can help to reject or confirm theories about the brain.
It can be said that a lot of work has already been done. Both of the methods have
their merits and they have shown their usefulness even in this study, showing how
the frontal area is involved in the analysis of natural photos. Transfer entropy is
model free, it takes into account non-linear interactions and is well suited for a study
of system that is not very well known. Granger causality is lighter, it uses less data
and is lot more efficient in terms of computing power. In fact, it has been a very
popular method in many fields for decades. Mathematics and computer science have
contributed a lot to a quest to understand our brains.
But as described in the last chapters, there is still lot to improve in the methods –
efficiency, reliability and usability. Mathematicians can and will make a great impact
on the quest for understanding consciousness.
59
Bibliography
[1] “Project on the Decade of the Brain.” www.loc.gov/loc/brain/. [Online; accessed 10-February-2011].
[2] C. Granger, “Testing for causality:: A personal viewpoint,” Journal of Economic
Dynamics and Control, vol. 2, pp. 329–352, 1980.
[3] T. Schreiber, “Measuring information transfer,” Physical review letters, vol. 85,
no. 2, pp. 461–464, 2000.
[4] C. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE
Mobile Computing and Communications Review, vol. 5, no. 1, pp. 3–55, 2001.
[5] R. Vicente, M. Wibral, M. Lindner, and G. Pipa, “Transfer entropy: a model-free
measure of effective connectivity for the neurosciences,” Journal of computational
neuroscience, pp. 1–23, 2010.
[6] A. Kraskov, “Synchronization and interdependence measures (phd thesis),”
2004.
[7] J. Geweke, “Measurement of linear dependence and feedback between multiple
time series,” Journal of the American Statistical Association, vol. 77, no. 378,
pp. 304–313, 1982.
[8] A. Seth, “Granger causality,” Scholarpedia, vol. 2, no. 7, p. 1667, 2007.
[9] W. Freiwald, P. Valdes, J. Bosch, R. Biscay, J. Jimenez, L. Rodriguez, V. Rodriguez, A. Kreiter, and W. Singer, “Testing non-linearity and directedness of
60
BIBLIOGRAPHY
61
interactions between neural groups in the macaque inferotemporal cortex,” Journal of neuroscience methods, vol. 94, no. 1, pp. 105–119, 1999.
[10] Y. Chen, G. Rangarajan, J. Feng, and M. Ding, “Analyzing multiple nonlinear
time series with extended Granger causality,” Physics Letters A, vol. 324, no. 1,
pp. 26–35, 2004.
[11] W. Hesse, E. Möller, M. Arnold, and B. Schack, “The use of time-variant EEG
Granger causality for inspecting directed interdependencies of neural assemblies,” Journal of Neuroscience Methods, vol. 124, no. 1, pp. 27–44, 2003.
[12] M. Ding, Y. Chen, and S. Bressler, “Granger causality: basic theory and application to neuroscience,” Arxiv preprint q-bio/0608035, 2006.
[13] A. Seth, “A MATLAB toolbox for Granger causal connectivity analysis,” Journal
of neuroscience methods, vol. 186, no. 2, pp. 262–273, 2010.
[14] V. Ramachandran, The Tell-Tale Brain: A Neuroscientist’s Quest for What
Makes Us Human. WW Norton & Company, 2011.
[15] Wikipedia,
dia.”
“Information
theory
—
wikipedia,
the
free
encyclope-
http://en.wikipedia.org/w/index.php?title=Information_
theory&oldid=405796183, 2011. [Online; accessed 7-February-2011].
[16] F. Rieke, D. Warland, R. van Steveninck, and W. Bialek, “Spikes: Exploring the
Neural Code,” 1996.
[17] H. Hinrichs, T. Noesselt, and H.-J. Heinze, “Directed information flow: A
model free measure to analyze causal interactions in event related eeg-megexperiments,” Human Brain Mapping, vol. 29, no. 2, pp. 193–206, 2008.
[18] Wikipedia, “Finnegans wake — wikipedia, the free encyclopedia.” http://
en.wikipedia.org/w/index.php?title=Finnegans_Wake&oldid=412956594,
2011. [Online; accessed 10-February-2011].
[19] Wikipedia, “Mutual information — wikipedia, the free encyclopedia.” http:
BIBLIOGRAPHY
62
//en.wikipedia.org/w/index.php?title=Mutual_information&oldid=
412591741, 2011. [Online; accessed 10-February-2011].
[20] C. Shannon, “The bandwagon,” IRE Transactions on Information Theory, vol. 2,
no. 3, p. 3, 1956.
[21] A. Dimitrov, A. Lazar, and J. Victor, “Information theory in neuroscience,”
Journal of Computational Neuroscience, pp. 1–5.
[22] D. MacKay and W. McCulloch, “The limiting information capacity of a neuronal
link,” Bulletin of Mathematical Biology, vol. 14, no. 2, pp. 127–135, 1952.
[23] W. McCulloch, “An upper bound on the informational capacity of a synapse,” in
Proceedings of the 1952 ACM national meeting (Pittsburgh), pp. 113–117, ACM,
1952.
[24] L. Cao, “Practical method for determining the minimum embedding dimension
of a scalar time series,” Physica D: Nonlinear Phenomena, vol. 110, no. 1-2,
pp. 43–50, 1997.
[25] J. Victor, “Binless strategies for estimation of information from neural data,”
Physical Review E, vol. 66, no. 5, p. 051903, 2002.
[26] K. Hlavackova-Schindler, M. Palus, M. Vejmelka, and J. Bhattacharya, “Causality detection based on information-theoretic approaches in time series analysis,”
Physics Reports, vol. 441, no. 1, pp. 1–46, 2007.
[27] A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Physical Review E, vol. 69, no. 6, p. 066138, 2004.
[28] Wikipedia, “Resampling (statistics) — wikipedia, the free encyclopedia.”
http://en.wikipedia.org/w/index.php?title=Resampling_(statistics)
&oldid=427266368, 2011. [Online; accessed 8-May-2011].
[29] G. Nolte, A. Ziehe, V. Nikulin, A. Schlögl, N. Krämer, T. Brismar, and K. Müller,
“Robustly estimating the flow direction of information in complex physical systems,” Physical review letters, vol. 100, no. 23, p. 234101, 2008.
BIBLIOGRAPHY
63
[30] L. Barnett, A. B. Barrett, and A. K. Seth, “Granger causality and transfer entropy are equivalent for gaussian variables,” Phys. Rev. Lett., vol. 103, p. 238701,
Dec 2009.
[31] A. Papoulis and S. Pillai, Probability, random variables, and stochastic processes.
McGraw-Hill, 2002.
[32] M. Gazzaniga, R. Ivry, G. Mangun, and M. Steven, Cognitive neuroscience: the
biology of the mind. Norton, 2009.
[33] Wikipedia, “Human brain — wikipedia, the free encyclopedia,” 2011. [Online;
accessed 16-May-2011].
[34] M. Bar, K. Kassam, A. Ghuman, J. Boshyan, A. Schmid, A. Dale,
M. Hämäläinen, K. Marinkovic, D. Schacter, B. Rosen, et al., “Top-down facilitation of visual recognition,” Proceedings of the National Academy of Sciences
of the United States of America, vol. 103, no. 2, p. 449, 2006.
[35] H. Farid,
“Personal web site.” http://www.cs.dartmouth.edu/farid/
illusions/bump.html. [Online; accessed 07-May-2011].
[36] S. Ogawa and Y. Sung, “Functional magnetic resonance imaging,” Scholarpedia,
vol. 2, no. 10, p. 3105, 2007.
[37] P. L. Nunez and R. Srinivasan, “Electroencephalogram,” Scholarpedia, vol. 2,
no. 2, p. 1348, 2007.
[38] G. Barnes, A. Hillebrand, and M. Hirata, “Magnetoencephalogram,” Scholarpedia, vol. 5, no. 7, p. 3172, 2010.
[39] Wikipedia, “Positron emission tomography — wikipedia, the free encyclopedia.”
http://en.wikipedia.org/w/index.php?title=Positron_emission_
tomography&oldid=427468889, 2011. [Online; accessed 8-May-2011].
[40] Wikipedia, “Electroencephalography — wikipedia, the free encyclopedia,” 2011.
[Online; accessed 15-May-2011].
BIBLIOGRAPHY
64
[41] D. Van Roost, L. Solymosi, J. Schramm, B. van Oosterwyck, and C. Elger, “Depth electrode implantation in the length axis of the hippocampus
for the presurgical evaluation of medial temporal lobe epilepsy: a computed
tomography-based stereotactic insertion technique and its accuracy,” Neurosurgery, vol. 43, no. 4, p. 819, 1998.
[42] C. Kwiatkowski Peter et al., “Testing the null hypothesis of stationarity against
the alternative of a unit root* 1:: How sure are we that economic time series
have a unit root?,” Journal of econometrics, vol. 54, no. 1-3, pp. 159–178, 1992.
[43] J. Aru, N. Axmacher, A. Do Lam, J. Fell, W. Singer, and L. Melloni, “Previous
experience enhances visual perception: Intracranial EEG effects of familiarization on the processing of degraded natural images,” Society for Neuroscience,
2010.
[44] G. Gomez-Herrero, W. Wu, K. Rutanen, M. Soriano, G. Pipa, and R. Vicente,
“Assessing coupling dynamics from an ensemble of time series,” Arxiv preprint
arXiv:1008.0539, 2010.
Appendix A
Codes and instruction for
calculating transfer entropy
R
1. MATLABis
needed
R
2. Install open source MATLABtoolbox
TSTool from the Juelich Group
(http://www.physik3.gwdg.de/tstool/)
R
3. Install open source MATLABtoolbox
Fieldtrip from the Donders Centre
(http://fieldtrip.fcdonders.nl/)
R
4. Install open source MATLABtoolbox
TRENTOOL from
http://www.trentool.de/
5. Run the following commands for paths:
1 c u r r e n t F o l d e r = pwd;
2 cd ( ’OpenTSTOOL ’ ) ;
3 settspath ;
4 cd ( ’ . . ’ ) ;
5 addpath ( c u r r e n t F o l d e r+ ’ TRENTOOL 20110119alpha\ ’ ) ;
6 addpath ( genpath ( c u r r e n t F o l d e r+ ’ f i e l d t r i p −20110324\ f o o t ’ ) ) ;
6. Data must be in fieldstrip data format. Single condition analysis:
1 load ( ’ data . mat ’ )
65
APPENDIX A. CODES FOR TRANSFER ENTROPY
66
2 %%% TE p r e p a r e s e t t i n g s %%%
3 cfg =[];
4 c f g . c h a n n e l = a l l . l a b e l ; %A l l c a h n n e l s
5 c f g . optimizemethod= ’ cao ’ ; %CAO method t o e s t i m a t e embedding d i m e n s i o n s
6 c f g . caodim = 1 : 5 ; %Check d i m e n s i o n s b e t w e e n 1 and 5 ;
7 c f g . tau = 1 ; %Embedding d e l a y i n u n i t s o f a c t
8 c f g . c a o k t h n e i g h b o r s = 1 ; %Number o f n e i g h b o r s
9 c f g . t r i a l s e l e c t = ’ACT ’ ; %T h r e s h h o l d i n g o f t r i a l s
10 c f g . a c t t h r v a l u e =70; %R e j e c t t r i a l s w i t h a l o n g e r a c t
11 c f g . m i n n r t r i a l s =20; %Min t r i a l s needed f o r c a l c u l a t i o n s
12 c f g . p r e d i c t t i m e u = 2 0 ; %The p r e d i c t i o n time i n ms .
13 c f g . k t h n e i g h b o r s = 4 ; %Number o f n e i g h b o r s f o r f i x e d mass s e a r c h
14 c f g . T h e i l e r T = ’ACT ’ ; %T h e i l e r c o r r e c t i o n
15 c f g . Path2TSTOOL = ’C: \ U s e r s \ K r i s t j a n \ Dropbox \ d i s s e r t a t i o n \Data\
OpenTSTOOL\ ’ ;
16 c f g . f e e d b a c k = ’ no ’ ;
17 c f g . t o i = [ 0 . 0 5 0 . 5 ] ; % t h e time i n t e r v a l o f i n t e r e s t i n s e c s
18 data=TEprepare ( c f g , a l l ) ; % P r e p a r e s t h e d a t a f o r s u b s e q u e n t TE a n a l y s i s
19
20 %%% s t a t s s e c t i o n %%%
21 c f g = [ ] ;
22 c f g . s u r r o g a t e t y p e = ’ t r i a l s h u f f l i n g ’ ; %S u r r o g a t e d a t a f o r t r i a l s
23 c f g . s h i f t t e s t t y p e= ’ TEshift >TE ’ ; %The t y p e o f s h i f t t e s t t h a t i s used
24 c f g . f i l e i d o u t = s t r c a t ( ’ R e s u l t s ’ ) ; %A p r e f i x f o r t h e r e s u l t s f i l e n a m e
25 T E s u r r o g a t e s t a t s ( c f g , data ) ; %t r a n s f e r e n t r o p y , t h e s u r r o g a t e data , t h e
s h i f t t e s t d a t a and s t a t i s t i c a l t e s t .
7. Comparing two conditions
1 %F i r s t p a r t i s same as i n s i n g l e c o n d i t i o n c a s e .
2 cfg = [ ] ;
3 c f g . s h i f t t e s t t y p e= ’ TEshift >TE ’ ; % t h e t y p e o f s h i f t t e s t
4 cfg . f i l e i d o u t = strcat ( ’ Results ’ ) ; % Results filename
5 T E c o n d i t i o n s t a t s s i n g l e ( c f g , data1 , data2 ) ; % t r a n s f e r e n t r o p y , s h i f t t e s t
d a t a and t h e p e r m u t a t i o n t e s t
Appendix B
Codes and instruction for
calculating Granger causality
R
1. MATLABis
needed
R
2. Install open source MATLABtoolbox
”Granger Causal Connectivity Analysis”
from http://www.informatics.sussex.ac.uk/users/anils/
3. Run the following commands for paths:
1 addpath ( genpath ( s t r c a t (pwd, ’ \ GCCA toolbox jan2011 \ ’ ) ) ) ;
2 ccaStartup ;
4. A permutation test which compares means of two samples
1 function [ pValue ] = p e r m t e s t ( v1 , v2 , n )
2 %PERMTEST p e r m u t a t i o n t e s t t o compare means
3%
v1 , v2 d a t a v e t o r i s , n # o f p e r m u t a t i o n s
4%
By K r i s t j a n Korjus , 10/05/2011
5 i f mean( v1 )<mean( v2 )
6
temp=v2 ;
7
v2=v1 ;
8
v1=temp ;
9 end
10 m=mean( v1 )−mean( v2 ) ;
11 l 1=length ( v1 ) ;
67
APPENDIX B. CODES FOR GRANGER CAUSALITY
68
12 l 2=length ( v2 ) ;
13 v=[ v1 v2 ] ;
14 c =0;
15 f o r i =1:n
16
a=randperm( l 1+l 2 ) ;
17
i f (mean( v ( a ( 1 : l 1 ) ) )−mean( v ( a ( l 1 +1: l 1+l 2 ) ) ) )>=m
18
c=c +1;
19
end
20 end
21 pValue=c /n ∗ 2 ;
22 end
5. Single condition analysis
1 load ( ’ data . mat ’ ) ;
2 n=138; %number o f t r i a l s i n each c o n d i t i o n
3 l =451; %Length o f a t r i a l
4 m o d e l o r d e r =25; %Model o r d e r
5 r e s u l t s=zeros ( 1 , n ) ;
6 r e s u l t s S u r r o g a t e=zeros ( 1 , n ) ;
7 data=zeros ( 2 , l ) ; %A m a t r i x f o r s u r r o g a t e d a t a
8 f o r i =1:n
r e t = c c a g r a n g e r r e g r e s s ( a l l . t r i a l {1 , i } ( 1 : 2 , 1 0 5 1 : 1 5 0 1 ) , modelorder ) ;
9
%Granger
10
r e s u l t s ( 1 , i )=r e t . gc ( 2 , 1 ) ; %NB! rows and columns swapped !
11
data ( 1 , : )=a l l . t r i a l { 1 , i } ( 1 , 1 0 5 1 : 1 5 0 1 ) ;
12
data ( 2 , : )=a l l . t r i a l { 1 , n+1− i } ( 2 , 1 0 5 1 : 1 5 0 1 ) ; %S u f f l i n g t r i a l s f o r
surrogate data
13
r e t=c c a g r a n g e r r e g r e s s ( data , m o d e l o r d e r ) ;
14
r e s u l t s S u r r o g a t e ( 1 , i )=r e t . gc ( 2 , 1 ) ;
15 end
16 Mean1to2=mean( r e s u l t s ( 1 , : ) )−mean( r e s u l t s S u r r o g a t e ( 1 , : ) ) ; %Which has
larger values ?
17 PValue1to2=p e r m t e s t ( r e s u l t s ( 1 , : ) , r e s u l t s S u r r o g a t e ( 1 , : ) , 1 0 0 0 0 0 ) ; %
Permutation t e s t
6. Comparing two conditions
APPENDIX B. CODES FOR GRANGER CAUSALITY
69
1 load ( ’ c a u s a l i t y p a 6 T L L F P L a l l . mat ’ ) ;
2 n1 =138; %Number o f t r i a l s i n each c o n d i t i o n
3 n2 =138; %Number o f t r i a l s i n each c o n d i t i o n
4 l =451; %Length o f a t r i a l
5 m o d e l o r d e r =25;
6 X1=zeros ( 2 , n1∗ l ) ; %Changing t h e f o r m at o f d a t a
7 X2=zeros ( 2 , n2∗ l ) ;
8 f o r i =1: n1
9
X1 ( 1 : 2 , ( i −1)∗ l +1: i ∗ l )=a l l . t r i a l { 1 , i } ( 1 : 2 , 5 0 1 : 9 5 1 ) ;
10 end
11 f o r i =1: n2
12
X2 ( 1 : 2 , ( i −1)∗ l +1: i ∗ l )=a l l . t r i a l { 1 , i } ( 1 : 2 , 1 0 5 1 : 1 5 0 1 ) ;
13 end
14 [ X1 ,M1] = cca rm ensemblemean (X1 , n1 , l , 0 ) ; %S u b t r a c t i n g t h e mean
15 [ X2 ,M2] = cca rm ensemblemean (X2 , n1 , l , 0 ) ;
16 [ H1 , ks1 ] = c c a k p s s m t r i a l (X1 , n1 , l , modelorder , 0 . 1 ) ; %S t a t i o n a r y ?
17 [ H2 , ks2 ] = c c a k p s s m t r i a l (X2 , n1 , l , modelorder , 0 . 1 ) ;
18 count1 =1;
19 f o r i =1: n1 %A l l t r i a l s f o r c o n d i t i o n 1
20
[ r e t ] = c c a g r a n g e r r e g r e s s (X1 ( 1 : 2 , ( i −1)∗ l +1: i ∗ l ) , m o d e l o r d e r ) ; %
Granger f o r a s i n g l e t r i a l
21
i f H1 ( 1 , i )==1 %Only u s i n g s t a t i o n a r y t r i a l s
22
R e s u l t s P r i o r 2 ( count1 )=r e t . gc ( 1 , 2 ) ;
23
count1=count1 +1;
24
end
25 end
26 count1 =1;
27 f o r i =1: n2 %A l l t r i a l s f o r c o n d i t i o n 2
28
[ r e t ] = c c a g r a n g e r r e g r e s s (X2 ( 1 : 2 , ( i −1)∗ l +1: i ∗ l ) , m o d e l o r d e r ) ;
29
i f H2 ( 1 , i )==1 %Only u s i n g s t a t i o n a r y t r i a l s
30
R e s u l t s P o s t 2 ( count1 )=r e t . gc ( 1 , 2 ) ;
31
count1=count1 +1;
32
end
33 end
34 pValuePermTest1=p e r m t e s t ( R e s u l t s P o s t 2 , R e s u l t s P r i o r 2 , 2 0 0 0 0 0 ) %Permutation
test
© Copyright 2026 Paperzz