Introduction to Information Theory Information Theory Introduction to Information Theory Uncertainty Uncertainty Entropy Entropy Surprisal Surprisal Joint entropy Joint entropy Conditional entropy Mutual information Introduction to Information Theory Relative entropy Noisy channel model L645 Information theory answers two fundamental questions in communication theory: I What is the ultimate data compression? I What is the transmission rate of communication? Conditional entropy Mutual information Relative entropy Noisy channel model Applies to: computer science thermodynamics, economics, computational linguistics, . . . Dept. of Linguistics, Indiana University Fall 2015 Background reading: T. Cover, J. Thomas (2006) Elements of Information Theory. Wiley. 1 / 28 Information & Uncertainty Information as a decrease in uncertainty: I I We have some uncertainty about a process, e.g., which symbol (A, B, C) will be generated from a device We learn some information, e.g., previous symbol is B I I Introduction to Information Theory 2 / 28 Entropy Introduction to Information Theory Uncertainty Uncertainty Entropy Entropy Surprisal Surprisal Joint entropy Joint entropy Conditional entropy Conditional entropy Mutual information Mutual information Relative entropy Noisy channel model Entropy is the way we measure how informative a random variable is: (1) H (p ) = H (X ) = − How uncertain are we now about which symbol will appear? The more informative our information is, the less uncertain we will be. P x ∈X Relative entropy Noisy channel model p (x ) log2 p (x ) How do we get this formula? 3 / 28 Motivating Entropy Assume a device that emits one symbol (A) I Uncertainty of what we will see is zero Assume a device that emits two symbols (A, B). I With one choice (A or B), our uncertainty is one I We could use one bit (0 or 1) to encode the outcome Introduction to Information Theory If we made a decision tree, there would be two levels (starting with “Is it A/B or C/D?”): uncertainty is two. I Need two bits (00, 01, 10, 11) to encode the outcome Adding probabilities Uncertainty Uncertainty Entropy Entropy Surprisal Surprisal Joint entropy Joint entropy log2 M assumes every choice (A, B, C, D) is equally likely Conditional entropy Mutual information Relative entropy I This is not the case in general Noisy channel model Conditional entropy Mutual information Relative entropy Noisy channel model Instead, we look at − log2 p (x ), where x is the given choice, to tell us how surprising it is Assume a device that emits four symbols (A, B, C, D) I 4 / 28 Introduction to Information Theory I If every choice x is equally likely: I p (x ) = 1 (and M = 1 ) M p (x ) 1 I log M = log 2 2 p (x ) = − log2 p (x ) We are describing a (base 2) logarithm: I log2 M, where M is the number of symbols 5 / 28 6 / 28 Average surprisal Introduction to Information Theory Entropy example Introduction to Information Theory Uncertainty − log2 p (x ) tells us how surprising one particular symbol is. I But on average, how surprising is a random variable? Summation gives a weighted average: (2) H (X ) = − P x ∈X p (x ) log2 p (x ) = Uncertainty Entropy Entropy Surprisal Roll an 8-sided die (or pick a character from an alphabet of 8 characters) Joint entropy Conditional entropy Mutual information Relative entropy E (log2 p (1X ) ) Noisy channel model Because each outcome is equally likely, the entropy is: (3) H (X ) = − i.e., sum over all possible outcomes, multiplying surprisal (− log2 p (x )) by probability of seeing that outcome (p (x )) I I 8 P i =1 p (i ) log2 p (i ) = − 8 P i =1 1 8 log2 1 8 Surprisal Joint entropy Conditional entropy Mutual information Relative entropy Noisy channel model = log2 8 = 3 i.e., 3 bits needed to encode this 8-character language: The amount of surprisal is the amount of information we need in order to not be surprised. I H (X ) = 0 if the outcome is certain I H (X ) = 1 if out of 2 outcomes, both are equally likely A 000 E 001 I 010 O 011 U 100 F 101 G 110 H 111 7 / 28 Simplified Polynesian Introduction to Information Theory 8 / 28 Simplified Polynesian Introduction to Information Theory Entropy calculation Uncertainty Uncertainty Entropy Entropy Surprisal Surprisal Joint entropy I Conditional entropy entropy: Joint entropy Conditional entropy Mutual information H (X ) = − Relative entropy I Simplified Polynesian: 6 characters char: P T K A I prob: 1/8 1/4 1/8 1/4 1/8 Noisy channel model X Mutual information p (i ) log p (i ) Relative entropy i ∈L 1 1 1 1 = −[4 × log + 2 × log ] 8 8 4 4 = 2.5 U 1/8 I I again, we need 3 bits char: P T K code: 000 001 010 A 011 I 100 Noisy channel model U 101 BUT: since the distribution is NOT uniform, we can design a better code . . . 9 / 28 Simplified Polynesian Introduction to Information Theory 10 / 28 Joint entropy Introduction to Information Theory Designing a better code char: code: I I P 100 T 00 K 101 A 01 I 110 U 111 0 as first digit: 2-digit char. 1 as first digit: 3-digit char. Uncertainty Uncertainty Entropy Entropy Surprisal Surprisal Joint entropy Joint entropy Conditional entropy Mutual information Relative entropy Noisy channel model More likely characters get shorter codes (4) H (X , Y ) = − P P x ∈X y ∈Y p (x , y ) log2 p (x , y ) Mutual information Relative entropy Noisy channel model How much do the two values influence each other? Task: I Code the following word: KATUPATI I Conditional entropy For two random variables X & Y , joint entropy is the average amount of information needed to specify both values I e.g., the average surprisal at seeing two POS tags next to each other How many average bits do we need? 11 / 28 12 / 28 Conditional entropy Introduction to Information Theory Conditional entropy: how much extra information is needed to find Y ’s value, given that we know X ? P H ( Y |X ) = x ∈X P = x ∈X (5) = − = − p (x )[− P P x ∈X y ∈Y P P x ∈X y ∈Y P y ∈Y Introduction to Information Theory Uncertainty Uncertainty Entropy Entropy Surprisal Surprisal Joint entropy Joint entropy Conditional entropy Mutual information Relative entropy p (x )H (Y |X = x ) Chain Rule Conditional entropy Chain rule for entropy: Mutual information Relative entropy Noisy channel model Noisy channel model H (X , Y ) = H (X ) + H (Y |X ) p (y |x ) log2 p (y |x )] H (X1 , . . . , Xn ) = H (X1 ) + H (X2 |X1 ) +H (X3 |X1 , X2 ) + . . . p (x )p (y |x ) log2 p (y |x ) +H (Xn |X1 , . . . , Xn−1 ) p (x , y ) log2 p (y |x ) 13 / 28 Syllables in Simplified Polynesian Introduction to Information Theory Uncertainty Our model of simplified Polynesian earlier was too simple; joint entropy will help us with a better model I Probabilities for letters on a per-syllable basis, using C and V as separate random variables I Probabilities for consonants followed by a vowel (P (C , .)) & vowels preceded by a consonant (P (., V )) p 1/8 I t 3/4 k 1/8 a 1/2 i 1/4 t 3/8 k 1/16 a 1/4 i 1/8 Syllables in Simplified Polynesian (2) More specifically, for CV structures the joint probability P (C , V ) is represented: Entropy Surprisal Joint entropy Conditional entropy Uncertainty Entropy Surprisal Joint entropy Conditional entropy Mutual information Mutual information p t k a 1 16 3 8 1 16 1 2 i 1 16 3 16 0 1 4 u 0 3 16 1 16 1 4 1 8 3 4 1 8 Relative entropy Noisy channel model u 1/4 On a per-letter basis, this would be the following (which we are not concerned about here): p 1/16 14 / 28 Introduction to Information Theory Relative entropy Noisy channel model P (C , .) & P (., V ) from before are marginal probabilities: u 1/8 I I P (C = t , V = a ) = P (C = t ) = 3 4 3 8 15 / 28 Polynesian Syllables Introduction to Information Theory 16 / 28 Polynesian Syllables (2) Uncertainty Find H (C , V ), how surprised we are on average to see a particular syllable structure (6) H (C , V ) = H (C ) + H (V |C ) ≈ 1.061 + 1.375 ≈ 2.44 (7) a. H (C ) = − P c ∈C b. H (V |C ) = − p (c ) log2 p (c ) ≈ 1.061 P P c ∈C v ∈V Uncertainty Entropy Entropy Surprisal Joint entropy Conditional entropy Mutual information Relative entropy Noisy channel model I H (C ) = − = Can calculate the probability p (v |c ) from the chart on the previous page 1 is half of 18 e.g., p (V = a |C = p ) = 12 because 16 = = 17 / 28 X i ∈L Surprisal Joint entropy p (i ) log p (i ) 1 1 3 3 log + log ] 8 8 4 4 1 3 2 × log 8 + (log 4 − log 3) 8 4 1 3 2 × × 3 + (2 − log 3) 8 4 3 6 3 + − log 3 4 4 4 9 3 − log 3 ≈ 1.061 4 4 = −[2 × = p (c , v ) log p (v |c ) = 1.375 For the calculation of H (V |C ) ... I Introduction to Information Theory Conditional entropy Mutual information Relative entropy Noisy channel model 18 / 28 Polynesian Syllables (3) H ( V |C ) = − XX x ∈C y ∈V p (x , y ) log p (y |x ) = −[1/16 log 1/2 + 3/8 log 1/2 + 1/16 log 1/2 +1/16 log 1/2 + 3/16 log 1/4 + 0 log 0 Introduction to Information Theory Polynesian Syllables (4) Introduction to Information Theory Uncertainty Uncertainty Entropy Entropy Surprisal Surprisal Joint entropy Joint entropy Conditional entropy Conditional entropy Mutual information Mutual information Relative entropy Noisy channel model +0 log 0 + 3/16 log 1/4 + 1/16 log 1/2] Relative entropy Exercise: Verify this result by using H (C , V ) = H (V ) + H (C |V ) Noisy channel model = 1/16 log2 + 3/8 log 2 + 1/16 log 2 +1/16 log 2 + 3/16 log 4 +3/16 log 4 + 1/16 log 2] = 11/8 19 / 28 Pointwise mutual information Mutual information (I(X ; Y )): how related are two different random variables? = Amount of information one random variable contains about another = Reduction in uncertainty of one random variable based on knowledge of other Introduction to Information Theory 20 / 28 Mutual information Introduction to Information Theory Uncertainty Uncertainty Entropy Entropy Surprisal Surprisal Joint entropy Joint entropy Conditional entropy Conditional entropy Mutual information Relative entropy Noisy channel model Mutual information Mutual information: on average what is the common information between X and Y ? (9) I(X ; Y ) = Pointwise mutual information: mutual information for two points (not two distributions), e.g., a two-word collocation P x ,y p (x , y ) log2 Relative entropy Noisy channel model p (x ,y ) p (x )p (y ) p (x ,y ) (8) I(x ; y ) = log p (x )p (y ) Probability of x and y occurring together vs. independently Exercise: Calculate the pointwise mutual information of C = p and V = i from simplified Polynesian example 21 / 28 Mutual information (2) H (X |Y ): information needed to specify X when Y is known (10) I(X ; Y ) = H (X ) – H (X |Y ) = H (Y ) – H (Y |X ) Take the information needed to specify X and subtract out the information when Y is known I I I Introduction to Information Theory 22 / 28 Relative entropy Uncertainty Uncertainty Entropy Entropy Surprisal Surprisal Joint entropy Conditional entropy Mutual information Relative entropy Noisy channel model . . . i.e., the information shared by X and Y How far off are two distributions from each other? Relative entropy, or Kullback-Leibler (KL) divergence, provides such a measure (for distributions p and q): (11) D (p ||q) = When X and Y are independent, I(X ; Y ) = 0 When X and Y are very dependent, I(X ; Y ) is high P x ∈X Joint entropy Conditional entropy Mutual information Relative entropy Noisy channel model p (x ) p (x ) log q(x ) Informally, this is the distance between p and q I Exercise: Calculate I(C ; V ) for Simplified Polynesian I Introduction to Information Theory If p is correct disribution: average number of bits wasted by using distribution q instead of p Note that log2 1 = 0 and that this happens when p (x , y ) = p (x )p (y ) 23 / 28 24 / 28 Notes on relative entropy Introduction to Information Theory Divergence as mutual information Uncertainty Uncertainty Entropy Entropy Surprisal Surprisal Joint entropy Joint entropy Conditional entropy D (p ||q) = P x ∈X p (x ) log p (x ) q(x ) I Often used as a distance measure in machine learning I D (p ||q) is always nonnegative D (p ||q) = 0 if p = q D (p ||q) = ∞ if x ∈ X so that p (x ) > 0 and q(x ) = 0 I Introduction to Information Theory Mutual information Relative entropy Noisy channel model Conditional entropy Our formula for mutual information was: (12) I(X ; Y ) = P x ,y p (x ,y ) p (x , y ) log p (x )p (y ) Mutual information Relative entropy Noisy channel model meaning that it is the same as measuring how far a joint distribution (p (x , y )) is from independence (p (x )p (y )): Not symmetric: D (p ||q) , D (q||p ) (13) I(X ; Y ) = D (p (x , y )||p (x )p (y )) 25 / 28 The noisy channel model The idea of information comes from Claude Shannon’s description of the noisy channel model Many natural language tasks can be viewed as: I I There is a output, which we can observe We want to guess at what the input was ... but it has been “corrupted” along the way Introduction to Information Theory I Uncertainty Entropy Entropy Surprisal Surprisal Joint entropy Conditional entropy Mutual information Relative entropy Some questions behind information theory: I Noisy channel model How much loss of information can we prevent when we are attempting to compress the data? I I I Assume that the true input is in French But all we can see is the garbled (English) output I The noisy channel model theoretically Uncertainty Example: machine translation from English to French I 26 / 28 Introduction to Information Theory Joint entropy Conditional entropy Mutual information Relative entropy Noisy channel model i.e., how redundant does the data need to be? And what is the theoretical maximum amount of compression? (This is entropy.) How fast can data be transmitted perfectly? I A channel has a specific capacity (defined by mutual information) To what extent can we recover the “original” French? 27 / 28 28 / 28
© Copyright 2026 Paperzz