Sheet 3

University of Cologne
Institute for Theoretical Physics
Information Theory and Statistical Physics
Lecturer: Prof. Johannes Berg
Exercises: Prasanna Bhogale and Chau Nguyen
Sheet 3
A core problem of information theory is source coding, or data
compression. In this sheet we explore some concepts of data
compression where you will see that the information entropy
plays a central role.
A data string (x1 , x2 , ...xn ) is a stream of N symbols, each
drawn independently from the same random source X of m
values (base m); that is, a data string is an outcome of a
random sequence of N i.i.d. draws (X1 , X2 , X3 ..., XN ) of X.
Usually, physical devices work on a different base b of different letter set B, for example most of digital computers work
with two-state devices (base b = 2 of letters B = {0, 1}). Basically data compression deals with how to write (encode, or
compress) a data string of base m in the device alphabet B
of base b efficiently, which will be called encoded string.
Figure 1: Tree presentation of
We begin with the question of how to encode a sym- a prefix code (b = 2). Each
bol, that is how to write a single data symbol in a unique branch of the extended tree (instring, usually of several letters in B. Usually this is for- cluding the dashed branch) cormally an injective mapping c : AX → B + , where B + is the responds to one letter, 1 upward
set of all strings of letters of B of arbitrary lengths, that is and 0 downward. A codeword
k
B + = ∪∞
k=1 B . For x ∈ AX , c(x) is called a codeword, and of a prefix code terminates a
the length of the codeword is denoted
by l(x). The average branch of the tree at some node
P
length of a code c is then L(c) = x∈AX pX (x)l(x). We want (red), which then becomes a leaf
to make L(c) as small as possible.
of the restricted tree. A codeNote that a data string is a continuous stream of symbols, word is the series of letters in
therefore the encoded string is also a continuous stream of the branches from the root to the
letters. Suppose we are reading an encoded string, we will leaf.
need to know where the stop of a codeword is in order to
decode the string. Prefix codes, or self-punctuating codes, where no codeword is a prefix of any
other codeword, make the end of a codeword easily to determine as the encoded string is read
sequentially. A prefix code is usually presented on a tree with leafs as codewords.
1
Kraft inequality
For any self-punctuating code over an alphabet of size b, the codeword lengths l(x), where x ∈ AX ,
must satisfy the inequality
X
b−l(x) ≤ 1.
(1)
x∈AX
Conversely, given a set of codeword lengths that satisfy this inequality, there exists a prefix code
with these codeword lengths.
Too prove this theorem, we will make use of the tree presentation of prefix codes.
(a) (10 pts) Suppose lmax = maxx∈AX l(x), which is the distance to the farthest leaf. Let
us pick up a nearer leaf c(x), which is at the level l(x) ≤ lmax . Imagine the leaf continues to
branch to further levels, how many leafs it will have at level lmax ?
(b) (10 pts) Now imagine all the nearer leafs grow, and all finally reach the level lmax , what
is the total number of leafs now? Note that this total number of leafs is at most blmax , thereby
prove Kraft inequality.
1
Conversely, given a set of codeword lengths {l1 , l2 , ..., lm }, and suppose l1 ≤ l2 ≤ l3 · · · ≤ lm ,
we can construct a prefix code in the following way: on the extended (infinite) tree, pick up a
node at level l1 , erasing all of its descendants; then go to level l2 , pick up one of the nodes left,
erasing all of its descendants... and so on. If the code lengths satisfy Kraft inequality, we will
always have nodes left to pick up until lm . This is a verbal proof for the converse part of the
theorem.
2
Optimal codes
The expected length L(c) of any self-punctuating code c of base b for a random variable X is
greater than or equal to the entropy Hb (X) (calculated in base b),
L(c) ≥ Hb (X),
(2)
with equality iff b−l(x) = pX (x) for all x ∈ AX .
(30 pts) ProveP
the this theorem, using the hint below.
Hint: Let s = x∈AX b−l(x) . We introduce a distribution q(x) = b−l(x) /s. Show that we can
write
1
L(c) − H(X) = KL(pX ||q) + logb ,
(3)
s
where KL stands for Kullback-Leibler divergence (which is non-negative). From this equality,
proving the theorem is straightforward.
3
Shannon code
From the previous problem, we know that the inequality L(c) ≥ Hb (X) occurs as an equality
if l(x) = logb pX1(x) . However logb pX1(x) are usually non-integer. The co-called Shannon code
tries to get close to this equality by rounding up the values to the nearest integer numbers l(x) =
dlogb pX1(x) e.
(a) (10 pts) Prove that the set of code lengths l(x) = dlogb pX1(x) e satisfies Kraft inequality.
That means it is possible to construct a prefix code using such a set of code lengths.
(b) (10 pts) Denote by LS the average code length of a Shannon code, we know that LS ≥
Hb (X). Prove that LS ≤ Hb (X) + 1.
(c) (30 pts) As an example, we consider the a coding problem where the source X is of
3-base, AX = {a, b, c}, with corresponding probabilities PX = {1/2, 1/3, 1/6}. The alphabet
(of the device) is binary B = {0, 1}. Calculate the codeword lengths of Shannon code for this
source. Using the construction described in exercise 1, give a possible code for this set of code
lengths. Using your code, give the encoded string for data string abaaabbc. Think about what
you will do to decode the encoded string.
With Shannon code, it seems that we have gotten close to the optimal code, but it turns
out that rounding up logb pX1(x) to the integer values is too much, and Shannon code may not
be the optimal one! In 1952, D. A. Huffmann, while he was a PhD student, described a simple
code, now called Huffman code, which can be proven to be optimal.
Still the story does not end yet. Although we know that Hb (X) ≤ LH ≤ LS ≤ Hb (X) + 1, if
Hb (X) is so small (for example much smaller than 1), the average code length of Huffmann code
LH may be still too far away from the ideal value Hb (X). The way to avoid this is to group the
symbols of data strings into blocks, which serve as “super symbols” and we encode those super
symbols. We can of course use our optimal Huffman code for the super symbols, but this is
computationally complicated; possible simple solutions are arithmetic code and Lempel-Ziv code
(which is the basic of our zip-compressors).
As the blocks get larger and larger, the average code length per symbol of arithmetic code
and Lempel-Ziv code both asymptotically approache the entropy Hb (X) as Huffmann code or
Shannon code do. Moreover, universal properties emerge in the limit of block length tends to
infinity; in particular, the equipartition principle takes place, which is the basis of Shannon lossy
data compression as we will discuss (or have discussed) in our lectures.
2