10-704: Information Processing and Learning Spring 2012 Lecture 6: Source coding, AEP and Typicality Lecturer: Aarti Singh Scribes: Mark McCartney Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 6.1 Data compression/source coding i.i.d. source p(x) x1 ,...,xn → nH(x) compressor → Source code: A source code C is a mapping from the range of a random variable or a set of random variables to finite length strings of symbols from a D-ary alphabet. Expected length of a source code denoted by L(C) is given as follows: X L(C) = p(x)l(x) x∈X where l(x) is the length of codeword c(x) for a symbol x ∈ X , and p(x) is the probability of the symbol. Intuitively, a good code should preserve the information content of an outcome. Since information content depends on the probability of the outcome (it is higher if probability is lower, or equivalently if the outcome is very uncertain), a good codeword will use fewer bits to encode a certain or high probability outcome and more bits to encode a low probability outcome. Thus, we expect that the smallest expected code length should be related to the average uncertainty of the random variable i.e. the entropy. We will show that entropy is the fundamental limit of data compression; i.e., ∀C : L(C) ≥ H(X) (source coding theorem) Instead of encoding individual symbols, we can also encode blocks of symbols together. A length n block code encodes n length strings of symbols together and is denotes by C(x1 , . . . , xn ) =: C(xn ). First, we will show that there exists a length n block code (an impractical one) with expected code length for a symbol that is arbitrarily close to entropy as n → ∞. This argument is due to Shannon as presented in his seminal paper (available at http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html). But first we need to introduce some new concepts. 6.2 Asymptotic equipartition property (AEP) & Typical Sets iid Theorem 6.1 (AEP) If X1 , . . . , Xn ∼ p(x), then − 1 log p(x1 , . . . , xn ) → H(x) in probability n AEP is essentially the information theoretic version of the Law of Large Numbers. The proof follows directly from the weak law of large numbers. 6-1 6-2 Lecture 6: Source coding, AEP and Typicality We will categorize n-length sequence of symbols into two types: typical and atypical. First, we discuss an intuitive notion of typical sequence. If the probabilities of symbols are given as p1 , p2 , . . . and sequences are drawn i.i.d., then we expect symbol 1 to occur n × p1 times in a typical sequence xn . Thus, for a typical sequence p(x(n) ) = p(x1 ) . . . p(xn ) p p(x(n) ) = p1 p1 n p2 p2 n p3 p3 n . . . p|X|X| | n Thus, the Shannon information content of a typical sequence: − log2 1 1 = − log2 p1 n p2 n n p(x ) p1 p2 . . . X 1 = −n pi log p i i = nH(x) Another way of expressing this: p(x(n) ) = 2−nH(x) We now formally state the definition of a typical sequence. A Typical set denoted Aε (n) w.r.t. p(x) is the set of sequences xn ∈ X n that satisfy the following: 2−n(H(x)+ε) ≤ p(x1 , . . . , xn ) ≤ 2−n(H(x)−ε)) Or, equivalently: n o A(n) = xn ∈ X n : p(xn ) ∈ [2−n(H(x)+ε) , 2−n(H(x)−ε) ] ε Properties of typical set: (n) 1. If (x1 , . . . , xn ) ∈ Aε then H(X) − ε ≤ − n1 log p (x1 , . . . , xn ) ≤ H(X) + ε. (n) > 1 − ε for n sufficiently large. 2. P r Aε (n) 3. |Aε | ≤ 2n(H(x)+ε) (n) 4. |Aε | ≥ (1 − ε)2n(H(x)−ε) for n sufficiently large. Proof of property 1: Follows from definition. Proof of property 2: By AEP, convergence in probability ⇒ for any > 0, there exists n0 s.t. ∀n ≥ n0 , (n) P r(Aε ) > 1 − . Proof of property 3: 1= X xn p(xn ) ≥ X p(xn ) ≥ 2−n(H(x)+ε) |A(n) ε | (n) xn ∈Aε Proof of property 4: By property 2 for sufficiently large n, we have X 1 − ε < P r(A(n) p (xn ) ≤ 2−n(H(x)−ε) |A(n) ε )= ε | (n) xn ∈Aε Lecture 6: Source coding, AEP and Typicality 6.3 6-3 Using AEP for Data Compression Given length n i.i.d. sequences drawn from p(x), we encode a sequence xn as follows. We use one bit to indicate whether or not the sequence is typical, and then use either the index of the sequence in the typical set (if it is typical) or the index of the sequence in the entire set of length n sequences (if the sequence is atypical). (n) If xn ∈ Aε , we encode the sequence with the bit 0 and the index into the typical set. This requires (n) at most 1 + dn (H + ε)e bits. The last term is log2 of the size of the typical set plus one bit if log2 |Aε | is not an integer. (n) If xn 6∈ Aε , we encode the sequence with the bit 1 and the index of the sequence in the set of all length n sequences. This requires 1 + dn log |X |e bits since the set of all length n sequences is of size |X |n . This can be improved, but suffices to show the result. Notice that the above code is one-to-one and uniquely decodable. Also notice that typical sequences are encoded using much fewer bits than atypical ones if the entropy of the source H is small. (Recall that entropy is maximum log2 |X | if the source distribution is uniform. Expected length of codeword: X E[l(X n )] = p(xn )l(xn ) ≤ xn X p(xn ) (2 + n(H + ε)) + (n) X p(xn ) (2 + n log |X |) (n) xn ∈Aε xn 6∈Aε ≤ 2 + n(H + ε) + n log |X |(1 − P r(A(n) ε )) ≤ 2 + n(H + ε) + n log |X |ε (for n sufficiently large) =: n(H + ε0 ) where ε0 = ε + ε log |X| + 2 n Thus, the code we constructed has expected length per symbol arbitrarily close to entropy for large n. iid Theorem 6.2 Let x1 , . . . , xn ∼ p(x). Let ε > 0. ∃ a code which maps sequences of length n into binary strings such that mapping is one-to-one and n l(x ) E ≤ H(x) + ε0 n for n sufficiently large. Thus, it is possible to compress a length n sequence xn = (x1 , . . . , xn ) using nH(x) bits on average. While this establishes that the fundamental limit of data compression (entropy) is achieveable for i.i.d. sources, the code we have used is impractical since it requires maintaining lookup tables at the encoder and decoder that are expoentially large (i.e. size is ≈ 2nH(X) ). 6.3.1 High probability sets and typical sets We might wonder if the typical set was special, or can we come up with another high probability set with smaller size, so that we can compress sequences in that high probability set using even fewer number of bits. (n) (n) Let B ⊂ X n be the smallest set such that P r B ≥ 1 − . This is the ideal set we would like to use for encoding, however we use the typical set since we can bound the size of the typical set more easily. 6-4 Lecture 6: Source coding, AEP and Typicality And in fact the following theorem shows that using the typical set isn’t supoptimal since all other sets with probability > 1 − have about the same size. (n) Remark: Notice that there are sequences which belong to B but are not typical, e.g. consider a source (n) Bernoulli(0.9). Then the sequence with all 1s is the most likely (highest probability) sequence and is in B , however it is not a typical sequence since a typical sequence has about 0.9 as the proportion of 1s. However, such sequences contribute negligible probability. iid (n) Theorem 6.3 Let x1 , . . . , xn ∼ p(x). Let Cδ ⊂ X n . Let δ < 21 and δ 0 > 0. (n) (n) If P r Cδ > 1 − δ , then n1 log |Cδ | > H − δ 0 for n sufficiently large. (n) The theorem implies that the typical set size is of the same order as any high probability set: |Aε | ≈ (n) C 1 (n) log δ(n) = 0. |Cδ | ≈ 2nH . More specifically, lim n→∞ n A δ Proof: Let ε < 1 2. (n) 1. First, we argue that the set Cδ has high overlap with the typical set. (n)C C (n) (n) P r(A(n) ∪ Cδ ε ∩ Cδ ) = 1 − P (Aε ≥1− C P (A(n) ) ε − ) (n)C P (Cδ ) ≥1−ε−δ (n) 2. Now we lower bound the size of Cδ . (n) (n) X P r(A(n) ε ∩ Cδ ) = (n) p(xn ) ≤ 2−n(H−ε) |Cδ |. (n) xn ∈Aε ∩Cδ Combining 1 and 2, we have: (n) |Cδ | ≥ (1 − ε − δ) 2n(H−ε) or equivalently, 1 1 (n) log |Cδ | ≥ H − ε + log(1 − ε − δ) ≥ H − δ 0 n n for n sufficiently large. 6.4 Symbol coding Now we’ll discuss some more practical coding schemes. First, lets note that a lossless coding scheme needs to satisfy the following requirement: Non-singular code: x1 6= x2 =⇒ C(x1 ) 6= C(x2 ) Since we usually have to send a stream of symbols and not a single symbol, we need the following: Extension of a code: C(x1 , . . . , xn ) = C(x1 ) . . . C(xn ) n m Unique decodability: Extension is non-singular i.e. xni 6= xm j =⇒ C(xi ) 6= C(xj ) Lecture 6: Source coding, AEP and Typicality 6-5 Moreover, Unique decodability is not enough if you have to wait until the end of the message to decode. For example, assume C(X) uses the code shown in Table 6.1. x C(x) 1 10 2 00 3 11 4 110 Table 6.1: A code which might change early symbols after decoding later symbols The message 110000 decodes to 322, while the message 1100000 decodes to 422, changing the first symbol two symbols after it has been received. The class of codes which does not have this problem is known as prefix, instantaneous, or self-puncturing codes. For an example of a prefix code, see Table 6.2. x C(x) 1 0 2 10 3 110 4 111 Table 6.2: An example of a prefix code
© Copyright 2026 Paperzz