Information theory Entropy

The University
of Manchester
Introducción al análisis del código
neuronal con métodos de la teoría de la
información
Dr Marcelo A Montemurro
[email protected]
Information theory
Entropy
Suppose there is a source that produces symbols, taken from a given alphabet
A = {a,b, c,..., z}
A = {0,1}
A = {1, 2, 3, 4, 5, 6}
A = {tails, head}
Assume also that there is a certain probability distribution, with support over the
alphabet, that determines that outcome of the source (for the moment we assume iid
sources).
xi ÎA ,i = 1,....n
p(xi ) =
Normalisation of a probability distribution
Probability of observing the
outcome i
å p(x ) = 1
i
i
Empirical determination of a probability
There are ni outcomes of event i in a total of N trials. Then if N>>1
ni
p(xi ) »
N
We define the ‘surprise’ of event i as
- log 2 p(xi )
[bits]
Example
heads
tails
p(heads)=0.5
p(tails)=0.5
- log 2 p(heads) = 1
- log 2 p(tails) = 1
What is the average surprise?
Average of a random variable
X = å p(xi )xi
i
Then the average surprise is
H( X ) = - log 2 p(x) = -å p(xi )log 2 p(xi )
i
Entropy
H( X ) = -å p(xi )log 2 p(xi )
i
For our coin,
H = - p(heads)log 2 p(heads) - p(tails)log 2 p(tails)
H =1
Example
Frequency of letters in English text
p(a)=0.082; p(e)=0.127; p(q)=0.001
Surprise of letter ‘e’
Surprise of letter ‘q’
- log 2 p(e) = 2.97
- log 2 p(q) = 9.97
H = -å p(xi )log 2 p(xi )
xi Î{a, b, c,..., z}
i
H = 4.19 [bits]
If all the letters appeared with the same probability, then
1
p(xi ) = = 0.0385
26
xi Î{a, b, c,..., z}
and
1
1
H = - å p(xi )log 2 p(xi ) = - 26 log 2 = log 2 26
26
26
i
H = 4.70 [bits]
Which is larger than for the real distribution. It can be shown that the entropy attains
its maximum value for a uniform distribution.
Imagine a loaded die that produces always the same outcome
A = {1,2, 3, 4, 5, 6}
p(A) = {0,0,0,0, 0,1}
What is the surprise of each outcome?
What is the average surprise?
H = -å p(xi )log 2 p(xi )
i
What if the dice is fair?
A = {1,2, 3, 4, 5, 6}
1 1 1 1 1 1
p(A) = { , , , , , }
6 6 6 6 6 6
What is the surprise of each outcome?
What is the average surprise?
H = -å p(xi )log 2 p(xi )
i
In general, the less uniform a distribution (less random) the
lower is the entropy
p(0) = 0.8 p(1) = 0.2 H = 0.72
p(0) = 0.4 p(1) = 0.6 H = 0.97
p(0) = 0.6 p(1) = 0.4 H = 0.97
p(0) = 0.2 p(1) = 0.8 H = 0.72
In general, for the independent binary variable case
p(0) = a p(1) = 1- a
a Î[0,1]
H = - p(0)log2 p(0) - p(1)log2 p(1)
H = -a log2 a - (1- a )log2 (1- a )
Thus for a noiseless communication system the entropy quantifies the amount of
information that can be encoded in the signal
Signal with low entropy -> low information
Signal with high entropy -> high information
Noiseless channel
p(a)
a
p(b)
b
0
1
However, many real systems, like neurons, have a noisy output
(3)
trial 1
(5)
trial 1
(4)
trial 2
(6)
trial 2
Stimulus 1
Stimulus 2
(3)
trial 3
(5)
trial 3
(2)
trial 3
(4)
trial 3
Because of the noise, a new variability has to be taken into account. On the one
hand, we have the variability of the stimulus (good variability); on the other we have
the variability created by the noise (bad variability)
How to handle this more complex problem? How can we quantify information
in the presence of noise in the channel?
p(Y|X)
X
transmitter
noisy
channel
receiver
Y
Noiseless channel
p(a)
a
p(b)
b
0
1
Noisy channel
p(a)
a
0
d
p(b)
b
g
1
stimulus
s
response
r
P(r|s)
Probabilistic dictionary
•The amount of information about the stimulus encoded in the neural response is
quantified by the Mutual Information I(S;R)
•In general Mutual Information quantifies how much can be known about one variable
by looking at the other.
•It can be computed from real data by characterising the stimulus-response statistics.
Mutual Information
I(S, R) = H(R) - H(R | S)
Response entropy: variability of the whole response
H ( R)   P(r ) log 2 P(r )
P(r) = P(r | s)
r
Noise entropy: variability of the response at fixed stimulus
H (R | S )  
 P(r | s) log
r
2
P(r | s )
s
s
Noisy binary channel
Stimulus
Response
a
Stimulus={a, b}
p(S)={p(a),p(b)}
0
d
b
g
1
Response={0,1}
p(R)={p(0),p(1)}
Probabilistic dictionary
P(R|S)=
æ p(0 | a) p(0 | b) ö æ 1 - d
g ö
÷
ç
÷ =ç
1- g ø
è p(1 | a) p(1 | b) ø è d
Simple example
a
0
d
p(S)={0.5, 0.5}
b
d
1
æ 1-d
d ö
p(R | S) = ç
÷
d
1
d
è
ø
Let us first find p(R)={p(0), p(1)}
We must find p(0) and p(1)
p(0) = p(0 | a)p(a) + p(0 | b)p(b) =
1
1 1
= (1 - d ) + d =
2
2 2
p(1) = p(1 | a)p(a) + p(1 | b)p(b) =
1
1 1
= d + (1 - d ) =
2
2 2
then
1 1
p(R) = { , }
2 2
Now we can find the entropies to compute the information
1
1 1
1
H (R) = - log 2 - log 2 = 1
2
2 2
2
H (R | S) = -
å P(r | s) log
r
2
= [- å P(r | a) log 2 P(r | a)] p(a) + [- å P(r | b) log 2 P(r | b)] p(b) =
P(r | s)
s
r
r
= éë - p(0 | a) log 2 p(0 | a) - p(0 | a)log 2 p(0 | a) ùû p(a) + éë - p(0 | b)log 2 p(0 | b) - p(0 | b)log 2 p(0 | b) ùû p(b) =
1
1
= [-(1- d )log 2 (1- d ) - d log 2 d ] + [-d log 2 d - (1- d )log 2 (1- d )] = -(1- d )log 2 (1- d ) - d log 2 d
2
2
Then, to compute the information we just take the difference between the
two entropies
I(S, R) = H(R) - H(R | S)
I(S, R) = 1+ (1- d )log2 (1- d ) + d log 2 d
a
0
d
b
d
1
What is the meaning of information?
I(S, R) = H(R) - H(R | S)
Response entropy: variability of the whole response
H(R) = - å P(ri ) log 2 P(ri )
P(ri ) = P(ri | s)
i
Noise entropy: variability of the response at fixed stimulus
H(R | S) = -
å P(r | s) log
i
i
2
P(ri | s)
s
s
I(S, R) = H(S) - H(S | R)
Stimulus entropy: variability of the whole stimulus
H(S) = -å P(si ) log 2 P(si )
P(si ) = P(si | r)
i
Noise entropy: variability of the stimulus at fixed response
H (S | R) = -
å P(s | r) log
i
i
2
P(si | r)
r
r
Meaning 1 : Number of yes/no questions to indentify the stimulus
a) Deterministic responses
Stimulus 1
Stimulus 2
Stimulus 3
Stimulus 4
P(S)=1/4
P(si | rj ) = d ij
Response 1
Response 2
Response 3
Response 4
H(S)=2
H (S | R) = -
å P(s | r) log
i
i
2
H(S | R) = 0
P(si | r)
r
Before observing the responses, H(S) questions need to be asked on average
When a response is observed, H(S | R) questions need to be asked on average
b) Overlapping responses
Stimulus 1
Stimulus 2
P(S)=1/2
æ
ç
P(si | rj ) = ç 1
ç 0
çè
H(S)=1
1
2
1
2
ö
÷
0 ÷
1 ÷
÷ø
Response 1
Response 2
Response 3
H (S | R) = - å P(s | r) log P(s | r)
i
i
2
i
r
1
1
1
1
H(S | R) = [ 0 + log 2 2 + 0] =
6
3
6
3
Before observing the responses, H(S) questions need to be asked on average
When a response is observed, H(S | R) questions need to be asked on average
1 2
I (S | R) = 1- =
3 3
Information measures the reduction in uncertainty about the stimulus, after the
responses are observed
Meaning 2: upper bound to the number of messages that can be transmitted
through a communication channel
(3)
trial 1
(5)
trial 1
(4)
trial 2
(6)
trial 2
Stimulus 1
responses to S1
Stimulus 2
(3)
trial 3
(5)
trial 3
(2)
trial 3
(4)
trial 3
all responses
responses to S2
responses to S3
responses to S4
Question:
what is the number of stimuli n that can be encoded in the neural response such that
their responses do not overlap?
Typical sequences
xi   j 
x( n )  x1 x2 ...xn
also
iid
j 1..k
p ( j )  p j
Example:
xi 1,0 k  2
iid
x(2)  01
x(2)  00
What is the probability of a given sequence?
p  p1n1 p2n2 ... pknk
A typical sequence is such that every symbol appears e number of
ni  npi
times equal to its average
Then the probability of a typical sequence will be
Taking logs
Then
(
np
np
npk
- log 2 p = - log 2 p1 1 p2 2 ...pk
p  2 n H
p  p1np1 p2np2 ... pknpk
) = -nå p log
j
2
pj = n H
j
Is the probability of each typical sequence
p  2 n H
Is the probability of each typical sequence. What is the probability
all typical sequences?
If  is the number of all typical sequences, then the total probability is  p
First, how many typical sequences are there?
Example:
xi a, b
n  n2  n2
x  aaaa..abbbb...b
n1
When we have k symbols
n
n!
n!
 

 n1  n1 !n2 ! ( p1n)!( p2 n)!
n2

n!
( p1n)!( p2 n)!...( pk n)!
If the sequences are very long, n ≫ 1
we can compute log 
(using Stirling’s approximation: log(n!)=n log (n)-n)
log 2 W ≫ n H
W ≫ 2n H
and
Wp ≫1
responses to S1
all responses
responses to S2
responses to S3
responses to S4
Question:
what is the number of stimuli n that can be encoded in the neural response such that
their responses do not overlap?
Simple explanation
there are typically 2 H(R) responses that
could generated by the stimulus
However, due to the ‘noise’ fluctuations in the
response a number 2H(R|S) of different responses
that can be attributed to the same stimulus
2 H(R)
2 H(R|S)
Then, how many stimuli can be reliably encoded in the neural response?
2 H ( R)
 H ( R )  H ( R| S ) 
I ( R,S )

2

2
2 H ( R| S )
Therefore, finding that a neuron transmits n bits of information within a
behaviourally relevant time window, means that there are potentially 2n different
stimuli that can be discriminated only on the basis of the neuron’s response.
How do we estimate information in a neural
system?
Sensory
system
External
stimulus
Spike trains
L
T [ms]
Each stimulus is
presented with
probability P(s)
Stimulus
conditions
Encoding
T=L Δt
r=(r1, r2, …, rL)
S1
S2
S3
1010
1001
…
0010
1110
1010
…
1101
0011
1101
…
0110
S stimuli
N s trials per stimulus
P(r|s)
Trials
P(r|t)
P(r)
Response
probability
conditional to the
stimulus (at fixed
time t)
Unconditional
response
probability
P(r )  P(r | t )
t
P(r|t)
Response entropy: variability of the whole response
H ( R)   P(r ) log 2 P(r )
Bin of size ∆t
P(r|t)
0 00 00 00 01 0
Time window T
r
Noise entropy: variability of the response at fixed time
L  T / t
H (R | S )  
 P(r | t ) log
r
I ( R, S )  H ( R )  H ( R | S )
2
P(r | t )
t
Mutual Information quantifies how much variability is
left after subtracting the effect of noise. It is
measured in bits (Meaning 3)
Bias in the information estimation
To measure P(r|s) we need to estimate up to 2L-1 parameters from the data
The statistical errors in the estimation of P(r|s) lead to a systematic bias in the
entropies
For Ns>>1 we can obtain a first order approximation to the bias
Number of response ‘words’
with non zero probability
With N=Ns S
Miller, A G. Info. Theory in Psychology (1955)
The response is more random. Responses are more uniformly spread
over possible response words
m large so bias is large
The response is less random. Responses are more concentrated over
a few response words
m Small, so bias is small
Adapted from Panzeri at al J. Neurophysiol 2007
-Bias [H(R|S)]>-Bias[H(R)]
Bias [H(R)-H(R|S)]>0
Because of the bias the information is overestimated
I true < I
A lower bound to the information
For words of length L, we need to estimate at least 2^L parameters from the data!
Pind (r | s) = P(r1 | s)P(r2 | s)...P(rL | s)
Independent model
To estimate this probability we need only 2L parameters!
Pind (r | s) ¹ P(r | s)
In general
Using the independent model we can compute H ind (R | S) = -
åP
ind
r
This entropy is much less biased
(r | t)log 2 Pind (r | t)
t
There is an alternative way of estimating the entropy of the independent model.
Instead of neglecting the correlations by computing the marginals, we simply destroy
them in the original dataset.
r1 r2 r3 r4
Trial 1 1 0 1 0
Trial 2 0 1 0 1
Trial 3 1 0 0 1
Shuffling
Trial 1
Trial 2
Trial 3
r1 r2 r3 r4
1111
0000
1001
a)
b)
Essentially because shuffling creates a larger number of response words with non
zero probability
Now we propose the following estimator fro the entropy
I sh (S, R) = H(R) - ( H(R | S) - H sh (R | S) + H ind (R | S))
I
Ish
I
I
sh
1
0.5
0
4
6
8
10
Log 2(trials)
12

0.08
Information [bits]
Information [bits]
1.5
I

I
0.06
sh

I
 I
0.04
sh
0.02
4
6
8
10
Log 2(trials)
Montemurro et al Neural Computation (2007)
12
Further improvements can be achieved with extrapolation methods
We have N trials. We then get estimates of the entropy for different subsets of
trials: N/2, and N/4
This gives 3 estimation of the information: I1, I2, and I4
1
1
æ 1 ö
I N (S, R) = I ¥ (S, R) + C1 + C2 2 + O ç 3 ÷
èN ø
N
N
Up to 2nd order this is the equation of a parabola in 1/N.
I4
I2
Quadratic extrapolation
I1
1
N
2
N
4
N
The practical
Efficiency of neural code of the H1 neuron of the fly
Experiment was done: right before sunset,
at midday, and right after sunset
The same visual seen was presented 100-200 times.
1)
2)
3)
4)
5)
6)
7)
8)
Examine the data
Generate rasters for the three conditions
Compute the time varying firing rate, allowing for different binnings.
Compute spike-count information as a function of window size
Compute spike-time information as a function of window size
Determine the maximum response word length for which the estimation is accurate
Compute the efficiency of the code: e=I(R,S)/H(R)=1-H(R|S)/H(R)
Discuss