Part I: Information theory basics

Informa(on-­‐Theore(c Tools for Social Media Greg Ver Steeg a nd Aram Galstyan ICWSM Tutorial, July 11, 2013 You could be non-­‐parametrically esDmaDng entropies before the tutorial starts… Wifi: Cambridge MS0711
Or visit hJp://www.isi.edu/~gregv/npeet.html to download code If you don’t have “scipy” (scienDfic-­‐python) installed, I recommend the “Scipy Superpack”: hJp://fonnesbeck.github.com/ScipySuperpack/ InformaDon-­‐TheoreDc Tools for Social Media Greg Ver Steeg and Aram Galstyan
July 11, 2013
ICWSM Tutorial
InformaDon theory: Reliable communicaDon over a noisy channel 0 000 Encoder 001 Noisy Channel 0 Decoder “How much informaDon can we send?” is an ill-­‐posed quesDon What is the maximum rate of error-­‐free communicaDon over all possible codes? Surprises: -­‐  Error free is possible! -­‐  Simple formula for this rate! (Mutual informaDon) Examples of noisy channels 1956 “InformaDon theory has, in the last few years, become something of a scienDfic bandwagon… It will be all too easy for our somewhat arDficial prosperity to collapse overnight when it is realized that the use of a few exciDng words like informa1on, entropy, redundancy do not solve all of our problems” 
p(Y |X)
I(X : Y ) = E log
p(Y )
E.g. mutual informa1on We will emphasize two things: –  Es1ma1on –  Useful, meaningful measures •  InformaDon Theory Basics –  Entropy, MI, Discrete IT esDmators –  Entropy esDmaDon demo –  Example: predicDng verdicts from text •  Social network dynamics –  Entropic measures for Dme series –  Transfer entropy & Granger causality –  Examples Coffee Break (4:00-­‐4:30) •  Content on social networks –  RepresenDng content –  ConDnuous IT esDmators •  Informa(on Theory Basics –  Entropy, MI, Discrete IT esDmators –  Entropy esDmaDon demo –  Example: predicDng verdicts from text •  Social network dynamics –  Entropic measures for Dme series –  Transfer entropy & Granger causality –  Examples Coffee Break (4:00-­‐4:30) •  Content on social networks –  RepresenDng content –  ConDnuous IT esDmators •  Plain Old Entropy –  Why “log”?, Building intuiDon –  ConDnuous variable caveats •  Mutual informaDon –  DefiniDon/interpretaDon/forms –  ConDnuous variables –  Dependence/mulDvariate measures •  EsDmaDon, Part 1: Discrete variables •  DemonstraDon –  My first name has 4 leJers, therefore… •  InformaDon in human communicaDon (using discrete measures) Why “log”? •  How to quanDfy uncertainty? X, H(X)
p(X = x) = p(x) = 1/6
x = 1, . . . , 6
•  6*6 = 36 states •  log(6*6) = log(6) + log(6) = 2 log(6) AxiomaDc approach •  Which funcDons quanDfy uncertainty? –  Con(nuous (a small change in p(x) should lead to a small change in our uncertainty) –  Increasing (If there are n equally likely outcomes, uncertainty goes up with n) –  Composi(on (The uncertainty for two independent coins should equal the sum of uncertainDes for each coin) H(X) = E(log 1/p(x))
X
=
p(x) log p(x)
x
Alternate interpretaDon: compression Guess my square game: •  I pick a square uniformly at random •  You can ask yes/no quesDons to determine the square •  How many quesDons are required? •  To disDnguish between N squares, we need log2 N quesDons •  In Round 2: I prefer the boJom two rows, and half the Dme pick one of those squares •  Find the correct square with fewer quesDons on average •  Find the correct square with fewer quesDons on average •  A distribuDon pHxL
Entropy of a ConDnuous Random Variable 1êa
0
a
0
x
•  What is the probability of observing x= 3.1415926… ? •  p(x)dx tells us the probability observe a number in [x,x+dx) (DifferenDal) Entropy dx
pHxL
•  p(x)dx tells us the probability observe a number in [x,x+dx) 1êa
0
Each discrete bin has probability dx/↵
↵/dx
H(X) =
X
dx/↵ log dx/↵
i=1
= log ↵
Hdif f (X) =
Z
log dx
+1
As dx ! 0 . . .
dx p(x) log p(x) = E(log 1/p(x))
a
0
x
•  Plain Old Entropy –  Why “log”?, Building intuiDon –  ConDnuous variable caveats •  Mutual informa(on –  DefiniDon/interpretaDon/forms –  ConDnuous variables –  Dependence/mulDvariate measures •  EsDmaDon, Part 1: Discrete variables •  DemonstraDon –  My first name has 4 leJers, therefore… •  InformaDon in human communicaDon (using discrete measures) Mutual informaDon X
Noisy Channel Y
C = max I(X : Y )
p(X)
Mutual informaDon! Mutual informaDon I(X : Y ) = H(X) + H(Y )
Uncertainty if X and Y are independent Some things to noDce: •  Symmetric •  A difference of entropies •  Non-­‐negaDve H(X, Y )
Uncertainty considered as one system Mutual informaDon H(Y |X) =
X
x
p(x)H(Y |X = x)
Read off other the ways of describing mutual informaDon: I(X : Y ) = H(X) + H(Y )
= H(X)
H(X|Y )
= H(Y )
H(Y |X)
H(X, Y )
Independence I(X : Y ) = H(X) + H(Y )
Uncertainty if X and Y are independent H(X) = E (log 1/p(x))
I(X : Y ) = E (log 1/p(x) + log 1/p(y)
✓
◆
p(x, y)
= E log
p(x)p(y)
I(X : Y ) = 0
H(X, Y )
Uncertainty considered as one system log 1/p(x, y))
! p(x, y) = p(x)p(y)
Extends to CondiDonal Independence •  Bayesian networks, e.g., can be read as encoding a set of “condiDonal independence” relaDonships p(X, Y |Z) = p(X|Z)p(Y |Z)8Z
X ? Y |Z
! X ? Y |Z
! I(X : Y |Z) = 0
I(X : Y |Z) = H(X|Z)
H(X|Z, Y )
First useful(?) property for M.L. I(X : Y ) = 0
! p(x, y) = p(x)p(y)
•  You don’t get this for other “correlaDon” measures: (Pearson, Kendall, Spearman…) •  MI captures nonlinear relaDonships, the size of MI has many nice interpretaDons •  Extends to mulDvariate relaDonships/CMI •  But, is it “useful”? We have to esDmate p(x,y) first anyway… •  Plain Old Entropy –  Why “log”?, Building intuiDon –  ConDnuous variable caveats •  Mutual informaDon –  DefiniDon/interpretaDon/forms –  ConDnuous variables –  Dependence/mulDvariate measures •  Es(ma(on, Part 1: Discrete variables •  DemonstraDon –  My first name has 4 leJers, therefore… •  InformaDon in human communicaDon (using discrete measures) EsDmaDon, Part 1: Discrete Variables h
i
lim E ĤN (X) = H(X)
•  An “asymptoDcally unbiased” esDmator: N !1
• 
x(i) ⇠ p(X), i = 1, . . . , N
h
i
lim E ĤN (X) = H(X)
N !1
(i) entropy, the ‘plug-­‐in’ esDmator: For discrete x
Ĥ(X) =
⇠ p(X), i = 1, . . . , N
X
p̂(x) log p̂(x)
x
p̂(x) = (number of times to observe x)/N
How well do we do? pHX=iL
Entropy HbitsL = 4
1
16
1
2
3
4
5
6
7
10 11 12 13 14 15 16
i
Entropy HbitsL = 3.5
N=32
1
9
# states = 16
# samples = 32
p` HX=iL
1
16
8
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
i
How well do we do? pHX=iL
Entropy HbitsL = 4
1
16
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
i
Probability
0.15
0.10
0.05
2.8
True HHXL
3.0
3.2
3.4
3.6
3.8
4.0
Est. entropy HbitsL
# states = 16
# samples = 32
Naïve esDmator for MI? Again, standard formula using ✓ observed freq. ◆
counts: p̂(x, y)
ˆ
I(X : Y ) = E log
p̂(x)p̂(y)
One way to think of it is as: ˆ : Y ) = Ĥ(X) Ĥ(X|Y )
I(X
•  (Under-­‐esDmate) bias is worse here, fewer samples for each Y. So MI is over-­‐esDmated… Bias for MI E.g., for x = 1, . . . , 16 and y = 1, . . . , 16
p(x, y) = 1/(16 · 16)
Then I(X : Y ) = 0.
Again, let # samples = 2· # states
Probability
0.15
True IHX:YL
0.10
0.05
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
Est. MI HbitsL
0.6
Three possible soluDons •  AnalyDc esDmate of bias (Panzeri-­‐Treves) •  Bootstrap •  Shuffle Test Bias for MI #states
/
#samples
Probability
0.15
True IHX:YL
0.10
0.05
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
Est. MI HbitsL
0.6
Bias for MI -­‐ Bootstrap: generate new samples based on p̂(x, y)
-­‐ EsDmate bias for those samples, use as correcDon PermutaDon test •  For a given set of samples (i)
(i)
(x , y ), i = 1, . . . , N
•  Generate many “shuffled” versions (x⇡(i) , y (i) ), i = 1, . . . , N
•  For these, I(X
shuf
f le , Y
) =
0 this gives empirical CI for correlaDons to be due to chance. •  Plain Old Entropy –  Why “log”?, Building intuiDon –  ConDnuous variable caveats •  Mutual informaDon –  DefiniDon/interpretaDon/forms –  ConDnuous variables –  Dependence/mulDvariate measures •  EsDmaDon, Part 1: Discrete variables •  Demonstra(on –  My first name has 4 leJers, therefore… •  InformaDon in human communicaDon (using discrete measures) Example: InformaDon in human speech