Lecture 6: Source coding, AEP and Typicality 6.1 Data compression

10-704: Information Processing and Learning
Spring 2012
Lecture 6: Source coding, AEP and Typicality
Lecturer: Aarti Singh
Scribes: Mark McCartney
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.
6.1
Data compression/source coding
i.i.d. source p(x)
x1 ,...,xn
→
nH(x)
compressor →
Source code: A source code C is a mapping from the range of a random variable or a set of random
variables to finite length strings of symbols from a D-ary alphabet.
Expected length of a source code denoted by L(C) is given as follows:
X
L(C) =
p(x)l(x)
x∈X
where l(x) is the length of codeword c(x) for a symbol x ∈ X , and p(x) is the probability of the symbol.
Intuitively, a good code should preserve the information content of an outcome. Since information content
depends on the probability of the outcome (it is higher if probability is lower, or equivalently if the outcome
is very uncertain), a good codeword will use fewer bits to encode a certain or high probability outcome and
more bits to encode a low probability outcome. Thus, we expect that the smallest expected code length
should be related to the average uncertainty of the random variable i.e. the entropy.
We will show that entropy is the fundamental limit of data compression; i.e., ∀C : L(C) ≥ H(X) (source
coding theorem)
Instead of encoding individual symbols, we can also encode blocks of symbols together. A length n block
code encodes n length strings of symbols together and is denotes by C(x1 , . . . , xn ) =: C(xn ).
First, we will show that there exists a length n block code (an impractical one) with expected code length
for a symbol that is arbitrarily close to entropy as n → ∞. This argument is due to Shannon as presented
in his seminal paper (available at http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html).
But first we need to introduce some new concepts.
6.2
Asymptotic equipartition property (AEP) & Typical Sets
iid
Theorem 6.1 (AEP) If X1 , . . . , Xn ∼ p(x), then
−
1
log p(x1 , . . . , xn ) → H(x) in probability
n
AEP is essentially the information theoretic version of the Law of Large Numbers. The proof follows directly
from the weak law of large numbers.
6-1
6-2
Lecture 6: Source coding, AEP and Typicality
We will categorize n-length sequence of symbols into two types: typical and atypical. First, we discuss an
intuitive notion of typical sequence. If the probabilities of symbols are given as p1 , p2 , . . . and sequences are
drawn i.i.d., then we expect symbol 1 to occur n × p1 times in a typical sequence xn . Thus, for a typical
sequence
p(x(n) ) = p(x1 ) . . . p(xn )
p
p(x(n) ) = p1 p1 n p2 p2 n p3 p3 n . . . p|X|X| |
n
Thus, the Shannon information content of a typical sequence:
− log2
1
1
= − log2 p1 n p2 n
n
p(x )
p1 p2 . . .
X
1
= −n
pi log
p
i
i
= nH(x)
Another way of expressing this:
p(x(n) ) = 2−nH(x)
We now formally state the definition of a typical sequence. A Typical set denoted Aε (n) w.r.t. p(x) is the
set of sequences xn ∈ X n that satisfy the following:
2−n(H(x)+ε) ≤ p(x1 , . . . , xn ) ≤ 2−n(H(x)−ε))
Or, equivalently:
n
o
A(n)
= xn ∈ X n : p(xn ) ∈ [2−n(H(x)+ε) , 2−n(H(x)−ε) ]
ε
Properties of typical set:
(n)
1. If (x1 , . . . , xn ) ∈ Aε then H(X) − ε ≤ − n1 log p (x1 , . . . , xn ) ≤ H(X) + ε.
(n)
> 1 − ε for n sufficiently large.
2. P r Aε
(n)
3. |Aε | ≤ 2n(H(x)+ε)
(n)
4. |Aε | ≥ (1 − ε)2n(H(x)−ε) for n sufficiently large.
Proof of property 1: Follows from definition.
Proof of property 2: By AEP, convergence in probability ⇒ for any > 0, there exists n0 s.t. ∀n ≥ n0 ,
(n)
P r(Aε ) > 1 − .
Proof of property 3:
1=
X
xn
p(xn ) ≥
X
p(xn ) ≥ 2−n(H(x)+ε) |A(n)
ε |
(n)
xn ∈Aε
Proof of property 4: By property 2 for sufficiently large n, we have
X
1 − ε < P r(A(n)
p (xn ) ≤ 2−n(H(x)−ε) |A(n)
ε )=
ε |
(n)
xn ∈Aε
Lecture 6: Source coding, AEP and Typicality
6.3
6-3
Using AEP for Data Compression
Given length n i.i.d. sequences drawn from p(x), we encode a sequence xn as follows. We use one bit to indicate whether or not the sequence is typical, and then use either the index of the sequence in the typical set
(if it is typical) or the index of the sequence in the entire set of length n sequences (if the sequence is atypical).
(n)
If xn ∈ Aε , we encode the sequence with the bit 0 and the index into the typical set. This requires
(n)
at most 1 + dn (H + ε)e bits. The last term is log2 of the size of the typical set plus one bit if log2 |Aε | is
not an integer.
(n)
If xn 6∈ Aε , we encode the sequence with the bit 1 and the index of the sequence in the set of all length n
sequences. This requires 1 + dn log |X |e bits since the set of all length n sequences is of size |X |n . This can
be improved, but suffices to show the result.
Notice that the above code is one-to-one and uniquely decodable. Also notice that typical sequences are
encoded using much fewer bits than atypical ones if the entropy of the source H is small. (Recall that
entropy is maximum log2 |X | if the source distribution is uniform.
Expected length of codeword:
X
E[l(X n )] =
p(xn )l(xn )
≤
xn
X
p(xn ) (2 + n(H + ε)) +
(n)
X
p(xn ) (2 + n log |X |)
(n)
xn ∈Aε
xn 6∈Aε
≤
2 + n(H + ε) + n log |X |(1 − P r(A(n)
ε ))
≤
2 + n(H + ε) + n log |X |ε (for n sufficiently large)
=: n(H + ε0 )
where ε0 = ε + ε log |X| +
2
n
Thus, the code we constructed has expected length per symbol arbitrarily close to entropy for large n.
iid
Theorem 6.2 Let x1 , . . . , xn ∼ p(x). Let ε > 0. ∃ a code which maps sequences of length n into binary
strings such that mapping is one-to-one and
n l(x )
E
≤ H(x) + ε0
n
for n sufficiently large.
Thus, it is possible to compress a length n sequence xn = (x1 , . . . , xn ) using nH(x) bits on average.
While this establishes that the fundamental limit of data compression (entropy) is achieveable for i.i.d.
sources, the code we have used is impractical since it requires maintaining lookup tables at the encoder and
decoder that are expoentially large (i.e. size is ≈ 2nH(X) ).
6.3.1
High probability sets and typical sets
We might wonder if the typical set was special, or can we come up with another high probability set with
smaller size, so that we can compress sequences in that high probability set using even fewer number of bits.
(n)
(n)
Let B ⊂ X n be the smallest set such that P r B
≥ 1 − . This is the ideal set we would like to
use for encoding, however we use the typical set since we can bound the size of the typical set more easily.
6-4
Lecture 6: Source coding, AEP and Typicality
And in fact the following theorem shows that using the typical set isn’t supoptimal since all other sets with
probability > 1 − have about the same size.
(n)
Remark: Notice that there are sequences which belong to B but are not typical, e.g. consider a source
(n)
Bernoulli(0.9). Then the sequence with all 1s is the most likely (highest probability) sequence and is in B ,
however it is not a typical sequence since a typical sequence has about 0.9 as the proportion of 1s. However,
such sequences contribute negligible probability.
iid
(n)
Theorem 6.3 Let x1 , . . . , xn ∼ p(x). Let Cδ ⊂ X n . Let δ < 21 and δ 0 > 0.
(n)
(n)
If P r Cδ > 1 − δ , then n1 log |Cδ | > H − δ 0 for n sufficiently large.
(n)
The theorem implies that the typical set size is of the same order as any high probability set: |Aε | ≈
(n)
C
1
(n)
log δ(n) = 0.
|Cδ | ≈ 2nH . More specifically, lim
n→∞ n
A
δ
Proof: Let ε <
1
2.
(n)
1. First, we argue that the set Cδ
has high overlap with the typical set.
(n)C
C
(n)
(n)
P r(A(n)
∪ Cδ
ε ∩ Cδ ) = 1 − P (Aε
≥1−
C
P (A(n)
)
ε
−
)
(n)C
P (Cδ )
≥1−ε−δ
(n)
2. Now we lower bound the size of Cδ .
(n)
(n)
X
P r(A(n)
ε ∩ Cδ ) =
(n)
p(xn ) ≤ 2−n(H−ε) |Cδ |.
(n)
xn ∈Aε ∩Cδ
Combining 1 and 2, we have:
(n)
|Cδ | ≥ (1 − ε − δ) 2n(H−ε)
or equivalently,
1
1
(n)
log |Cδ | ≥ H − ε + log(1 − ε − δ) ≥ H − δ 0
n
n
for n sufficiently large.
6.4
Symbol coding
Now we’ll discuss some more practical coding schemes. First, lets note that a lossless coding scheme needs
to satisfy the following requirement:
Non-singular code: x1 6= x2 =⇒ C(x1 ) 6= C(x2 )
Since we usually have to send a stream of symbols and not a single symbol, we need the following:
Extension of a code: C(x1 , . . . , xn ) = C(x1 ) . . . C(xn )
n
m
Unique decodability: Extension is non-singular i.e. xni 6= xm
j =⇒ C(xi ) 6= C(xj )
Lecture 6: Source coding, AEP and Typicality
6-5
Moreover, Unique decodability is not enough if you have to wait until the end of the message to decode. For
example, assume C(X) uses the code shown in Table 6.1.
x
C(x)
1
10
2
00
3
11
4
110
Table 6.1: A code which might change early symbols after decoding later symbols
The message 110000 decodes to 322, while the message 1100000 decodes to 422, changing the first symbol
two symbols after it has been received. The class of codes which does not have this problem is known as
prefix, instantaneous, or self-puncturing codes. For an example of a prefix code, see Table 6.2.
x
C(x)
1
0
2
10
3
110
4
111
Table 6.2: An example of a prefix code