Information Theory Answers two fundamental questions in communication theory: Introduction to Information Theory • what is the ultimate data compression? L645 Advanced NLP Autumn 2009 • what is the transmission rate of communication? Helpful in many areas: computer science (Kolmogoroff complexity), thermodynamics, economics, computational linguistics Background reading: T. Cover, J. Thomas (2006) Elements of Information Theory. Wiley. 2 Information & Uncertainty The noisy channel model The idea of information comes from Claude Shannon’s description of the noisy channel model, which is used to model many processes. We are going to talk about information as a decrease in uncertainty. • We have some uncertainty about a process, e.g., which symbol (A, B, C) will be generated from a device • We learn a piece of information, e.g., that the previous symbol is B Many natural language tasks can be viewed as: • There is a certain output, which we can see • We want to guess the input ... but it has been corrupted along the way • How uncertain are we now about which symbol will appear? Example: machine translation from English to French – The more informative our information is, the less uncertain we will be. • Assume that the true input is in French • But all we can see is the garbled (English) output – To what extent can we recover the “original” French? 3 4 The noisy channel model theoretically Entropy Entropy is the way we measure how informative something (a random variable) is: Some questions behind information theory: • How much loss of information can we prevent when we are attempting to compress the data? – i.e., how redundant does the data need to be? – And what is the theoretical maximum amount of compression? (We’ll see that this is entropy.) (1) H(p) = H(X) = − P x∈X p(x) log2 p(x) What does this formula mean? How do we get this? • How fast can data be transmitted perfectly? → a channel has a specific capacity (defined by mutual information) 5 6 Motivating Entropy Adding probabilities Let’s assume that we have a device that emits one symbol (A). We have no uncertainty as to what we will see: uncertainty is zero. A device that emits two symbols (A, B). We have one choice, either A or B: our uncertainty is one. • We could use one bit (0 or 1) to encode the outcome A device that emits four symbols (A, B, C, D). If we made a decision tree, there would be two levels (starting with “Is it A/B or C/D?”): uncertainty is two. But log2 M assumes that every choice (A, B, C, D) is equally likely → this is not generally true So, instead we look at − log2 p(x), where x is the given symbol, to tell us how surprising something is • If every choice x is equally likely, then p(x) = 1 M (and thus also: M = 1 p(x) ) 1 • And then log2 M = log2 p(x) = − log2 p(x) • We would need two bits (00, 01, 10, 11) to encode the outcome What we are describing is a logarithm (base 2): log2 M , where M is the number of symbols 7 8 Average surprisal Entropy example But − log2 p(x) just tells us how surprising one particular symbol is. On average, though, how surprising is our random variable? That is where the summation (a weighted average) comes in: (2) H(X) = − P x∈X p(x) log2 p(x) = Roll an 8-sided die (or pick a character from an alphabet of 8 characters) • because each outcome is equally likely, the entropy is: 1 E(log2 p(X) ) (3) H(X) = − Sum over all possible outcomes, multiplying the surprisal (− log2 p(x)) by the probability of seeing that surprising outcome (p(x)) 8 P i=1 8 P i=1 p(i) log2 p(i) = − 8 P i=1 1 8 log2 18 = log2 8 = 3 In other words, you need 3 bits to encode this (000, 001, 010, 011, ...), i.e., the number of bits needed to describe 8-character language A 000 • The amount of surprisal is the amount of information we need in order to not be surprised. – H(X) = 0 if the outcome is certain – H(X) = 1 if out of 2 outcomes, both are equally likely =− E 001 I 010 O 011 U 100 F 101 G 110 • how many bits do we need if entropy is 2.3? 9 10 Simplified Polynesian • simplified Polynesian L: 6 characters • entropy: H(X) = − char: prob: X = 2.5 char: code: P 000 T 1/4 Simplified Polynesian (2) K 1/8 A 1/4 I 1/8 U 1/8 char: code: p(i) log p(i) K 010 A 011 P 100 T 00 K 101 A 01 I 110 K 101 A 01 I 110 U 111 I 100 • more likely characters get shorter codes • code the following word: ”KATUPATI” U 101 • how many bits on average do we need? • BUT: since the distribution is NOT uniform, we can design a better code: char: code: T 00 1 as first digit: 3-digit char. 1 1 1 1 log + 2 × log ] 8 8 4 4 T 001 P 100 • 0 as first digit: 2-digit char. i∈L = −[4 × • again, we need 3 bits P 1/8 H 111 U 111 11 12 Joint entropy Conditional entropy For two random variables X and Y , the joint entropy is the amount of information needed to specify both values, on average Conditional entropy is a similar idea: how much extra information is needed to find Y ’s value, given that we know the value of X. (4) H(X, Y ) = − P P x∈X y∈Y H(Y |X) = p(x, y) log2 p(x, y) In other words, we can talk about how two values are related and influence each other • e.g., the average surprisal at seeing two POS tags next to each other (5) P x∈X p(x)H(Y |X = x) = P P = − x∈X y∈Y = − x∈X y∈Y P P P p(x)[− x∈X P y∈Y p(x)p(y|x) log2 p(y|x) p(y|x) log2 p(y|x)] p(x, y) log2 p(y|x) 13 Chain Rule 14 Syllables in Simplified Polynesian Our model of simplified Polynesian earlier was too simple; joint entropy will help us with a better model And then we have the Chain rule for entropy (resembling the chain rule of probability): • Probabilities for letters on a per-syllable basis, using C and V as separate random variables – i.e., probabilities for consonants followed by a vowel (P (C, .)) and for vowels preceded by a consonant (P (., V )) H(X, Y ) = H(X) + H(Y |X) H(X1, . . . , Xn) = H(X1) + H(X2|X1) p 1/8 +H(X3|X1, X2) + . . . +H(Xn|X1, . . . , Xn−1) t 3/4 k 1/8 a 1/2 i 1/4 u 1/4 • On a per-letter basis, this would be the following (which we are not concerned about here): p 1/16 t 3/8 k 1/16 a 1/4 i 1/8 u 1/8 15 16 Polynesian Syllables Syllables in Simplified Polynesian (2) More specifically, for CV structures the joint probability P (C, V ) is represented: p t k a 1 16 3 8 1 16 1 2 i 1 16 3 16 0 1 4 0 3 16 1 16 1 4 1 8 3 4 1 8 u We want to find H(C, V ), in other words, how surprised we are on average to see a particular syllable structure (6) H(C, V ) = H(C) + H(V |C) ≈ 1.061 + 1.375 ≈ 2.44 P (7) a. H(C) = − p(c) log2 p(c) ≈ 1.061 c∈CP P b. H(V |C) = − p(c, v) log p(v|c) = 1.375 c∈C v∈V For the calculation of H(V |C) ... The P (C, .) and P (., V ) from before are the marginal probabilities • P (C = t, V = a) = • P (C = t) = • Can calculate the probability p(v|c) from the chart on the previous page 3 8 • e.g., p(V = a|C = p) = 3 4 1 2 because 1 16 is half of 1 8 Exercise: Verify this result by using H(C, V ) = H(V ) + H(C|V ) 17 18 Polynesian Syllables (2) H(C) = − X Polynesian Syllables (3) H(V |C) = − p(i) log p(i) i∈L = = = p(x, y) log p(y|x) x∈C y∈V = −[1/16 log 1/2 + 3/8 log 1/2 + 1/16 log 1/2 1 1 3 3 log + log ] 8 8 4 4 1 3 2 × log 8 + (log 4 − log 3) 8 4 3 1 2 × × 3 + (2 − log 3) 8 4 3 6 3 + − log 3 4 4 4 9 3 − log 3 ≈ 1.061 4 4 = −[2 × = XX +1/16 log 1/2 + 3/16 log 1/4 + 0 log 0 +0 log 0 + 3/16 log 1/4 + 1/16 log 1/2] = 1/16 log2 + 3/8 log 2 + 1/16 log 2 +1/16 log 2 + 3/16 log 4 +3/16 log 4 + 1/16 log 2] = 11/8 19 20 Pointwise mutual information Mutual information Mutual information (I(X; Y )) tells us how related two different random variables are, i.e., will knowing one tell you a lot about the other? • the amount of information that one random variable contains about another, or the reduction in uncertainty of one random variable based on the knowledge of the other Mutual information makes this comparison for the random variables, i.e., on average what is the common information between X and Y (9) I(X; Y ) = P x,y p(x,y) p(x, y) log2 p(x)p(y) Let’s start by looking at pointwise mutual information, the mutual information for two points (not two distributions), e.g., a collocation of two words Because of how we defined H(X, Y ), namely as: H(X)+H(Y |X) = H(Y )+H(X|Y ), we have the following equivalent formulas: p(x,y) (8) I(x; y) = log p(x)p(y) (10) I(X; Y ) = H(X) – H(X|Y ) = H(Y ) – H(Y |X) We compare the probability of x and y occurring together vs. the probability of x and the probability of y occurring independently Exercise: Calculate the pointwise mutual information of C = p and V = i 21 Mutual information (2) 22 Noisy Channel Model (revisited) Recall that H(X|Y ) is the information needed to specify X when Y is known Recall that the noisy channel model was developed for data compression (11) I(X; Y ) = H(X) – H(X|Y ) = H(Y ) – H(Y |X) This formula means: take the information needed to specify X and subtract out the information when Y is known ... in other words, we are left with the information shared by X and Y • When X and Y are independent, I(X; Y ) = 0 • basis: communication via telephone line • Entropy = optimal amount of compression Capacity C: rate at which information can be transmitted • When X and Y are highly dependent, I(X; Y ) is very high • C = maxp(X)I(X; Y ) Exercise: Calculate I(C; V ) for Simplified Polynesian • Note that log2 1 = 0 and that this happens when p(x, y) = p(x)p(y) 23 24 Relative entropy Notes on relative entropy P p(x) log p(x) q(x) If we know that one distribution is (more) true, but another distribution is perhaps easier to calculate, we might want to see how far off the two distributions are D(p||q) = Relative entropy, or Kullback-Leibler (KL) divergence, provides such a measure (for distributions p and q): • often used as a distance measure in machine learning (12) D(p||q) = P p(x) log x∈X p(x) q(x) x∈X • D(p||q) is always nonnegative D(p||q) = 0 if p = q D(p||q) = ∞ if x ∈ X so that p(x) > 0 and q(x) = 0 Informally, this is the distance between p and q • not symmetric: D(p||q) 6= D(q||p) • Average number of bits wasted by using distribution q instead of correct p 25 26 Divergence as mutual information Cross entropy Our formula for mutual information was: (13) I(X; Y ) = P x,y Let’s say we have a probability model pM (for language model M ) covering some data we have seen. p(x,y) p(x, y) log p(x)p(y) meaning that it is the same as measuring how far a joint distribution (p(x, y)) is from independence (p(x)p(y)): (14) I(X; Y ) = D(p(x, y)||p(x)p(y)) • How close is this model to the true model p? The cross entropy between a random variable X with true distribution p and a model of this, pM , is: (15) H(X, pM ) = − P p(x) log pM (x) x 27 Cross entropy as a measure of language models The Entropy of English The following inequality is true: (16) − P x p(x) log p(x) ≤ − P 28 Depending on our language model, we have different (cross) entropy measures: p(x) log q(x) = H(X, q) • Uniform model (27 characters equiprobable): log2(27) = 4.76 bits per character x This means that the lower the cross-entropy for some model q, the closer it is to reality • Taking account of letter frequency: 4.03 • Taking account of bigrams: 2.8 • If we can construct a series of models with decreasing cross entropy, we approach— and maybe even reach—the correct model • Good language models tend to minimize entropy (and perplexity = 2Crossentropy ) 29 • Human experiment: 1.34 The true entropy of English is lower than any of these 30
© Copyright 2026 Paperzz