Document

Huffman Coding


David A. Huffman (1951)
Huffman coding uses frequencies of symbols in a string to build a variable rate prefix code




Each symbol is mapped to a binary string
More frequent symbols have shorter codes
No code is a prefix of another
Example:
0
A
B
C
D
29/07/2017
0
100
101
11
A
1
1
0
0
B
Applied Algorithmics - week7
1
D
C
1
Variable Rate Codes

Example:
1) A  00; B  01; C  10; D  11;
2) A  0; B  100; C  101; D  11;

Two different encodings of AABDDCAA

0000011111100000
(16 bits)

00100111110100
(14 bits)
29/07/2017
Applied Algorithmics - week7
2
Cost of Huffman Trees


Let A={a1, a2, .., am} be the alphabet in which each
symbol ai has probability pi
We can define the cost of the Huffman tree HT as
m
C(HT)=i=1
 pi·ri,

where ri is the length of the path from the root to ai
The cost C(HT) is the expected length (in bits) of a code
word represented by the tree HT. The value of C(HT) is
called the bit rate of the code.
29/07/2017
Applied Algorithmics - week7
3
Cost of Huffman Trees - example

Example:

Let a1=A, p1=1/2; a2=B, p2=1/8; a3=C, p3=1/8; a4=D, p4=1/4
where r1=1, r2=3, r3=3, and r4=2
HT
0
A
29/07/2017
1
0
0
B
1
1
C(HT) =1·1/2 +3·1/8 +3·1/8 +2·1/4=1.75
D
C
Applied Algorithmics - week7
4
Huffman Tree Property
Input: Given probabilities p1, p2, .., pm for symbols a1, a2,
.., am from alphabet A
 Output: A tree that minimizes the average number of bits
(bit rate) to code a symbol from A
 I.e., the goal is to minimize function:
C(HT)= pi·ri,
where ri is the length of the path from the root to leaf ai.
This is called a Huffman tree or Huffman code for alphabet A

29/07/2017
Applied Algorithmics - week7
5
Huffman Tree Property
Input: Given probabilities p1, p2, .., pm for symbols a1, a2,
.., am from alphabet A
 Output: A tree that minimizes the average number of bits
(bit rate) to code a symbol from A
 I.e., the goal is to minimize function:
C(HT)= pi·ri,
where ri is the length of the path from the root to leaf ai.
This is called a Huffman tree or Huffman code for alphabet A

29/07/2017
Applied Algorithmics - week7
6
Construction of Huffman Trees



Form a (tree) node for each symbol ai with weight pi
Insert all nodes to a priority queue PQ (e.g., a heap)
ordered by nodes probabilities
while (the priority queue has more than two nodes)






min1  remove-min(PQ); min2  remove-min(PQ);
create a new (tree) node T;
T.weight  min1.weight + min2.weight;
T.left  min1; T.right  min2;
insert(PQ, T)
return (last node in PQ)
29/07/2017
Applied Algorithmics - week7
7
Construction of Huffman Trees
P(A)= 0.4, P(B)= 0.1, P(C)= 0.3, P(D)= 0.1, P(E)= 0.1
0.1
0.1
0.1
0.3
0.4
D
E
B
C
A
0.2
0.1
0.3
0.4
B
C
A
0
D
29/07/2017
1
E
Applied Algorithmics - week7
8
Construction of Huffman Trees
0.2
0.1
B
0
0.3
0.4
C
A
0.3
0.4
C
A
1
D
E
0.3
0
1
B
0
D
29/07/2017
1
E
Applied Algorithmics - week7
9
Construction of Huffman Trees
0.3
0
0.3
0.4
C
A
0.6
0.4
A
0
1
1
B
C
0
D
1
E
0
1
B
0
D
29/07/2017
Applied Algorithmics - week7
1
E
10
Construction of Huffman Trees
0.4
0.6
A
0
1
C
0
A
0
1
1
C
B
0
D
29/07/2017
1
0
0
1
E
1
B
0
Applied Algorithmics - week7
D
1
E
11
Construction of Huffman Trees
1
0
A=0
B = 100
C = 11
A
0
1
C
0
1
D = 1010
E = 1011
B
0
D
29/07/2017
1
E
Applied Algorithmics - week7
12
Huffman Codes



Theorem: For any source S the Huffman code can
be computed efficiently in time O(n·log n) , where n
is the size of the source S.
Proof: The time complexity of Huffman coding
algorithm is dominated by the use of priority queues
One can also prove that Huffman coding creates the
most efficient set of prefix codes for a given text
It is also one of the most efficient entropy coder
29/07/2017
Applied Algorithmics - week7
13
Basics of Information Theory



The entropy of an information source (string) S built over
alphabet A={a1, a2, .., am}is defined as:
H(S) = ∑ i pi·log2(1/pi)
where pi is the probability that symbol ai in S will occur
log2(1/pi) indicates the amount of information contained
in ai, i.e., the number of bits needed to code ai.
For example, in an image with uniform distribution of
gray-level intensity, i.e. all pi = 1/256, then the number of
bits needed to encode each gray level is 8 bits. The
entropy of this image is 8.
29/07/2017
Applied Algorithmics - week7
14
Huffman Code vs. Entropy
P(A)= 0.4, P(B)= 0.1, P(C)= 0.3, P(D)= 0.1, P(E)= 0.1

Entropy:


0.4 · log2(10/4) + 0.1 · log2(10) + 0.3 · log2(10/3) +
0.1 · log2(10) + 0.1 · log2(10) = 2.05 bits per symbol
Huffman Code:


29/07/2017
0.4 · 1 + 0.1 · 3 + 0.3 · 2 + 0.1 · 4 + 0.1 · 4 = 2.10
Not bad, not bad at all.
Applied Algorithmics - week7
15
Lempel-Ziv-Welch Compression



The Lempel-Ziv-Welch (LZW) compression algorithm is
an example of dictionary based methods, in which longer
fragments of the input text are replaced by much shorter
references to code words stored in the special set called
dictionary
LZW is an implementation of a lossless data compression
algorithm developed by Abraham Lempel and Jacob Ziv.
It was published by Terry Welch in 1984 as an improved
version of the LZ78 dictionary coding algorithm
developed by Lempel and Ziv.
29/07/2017
Applied Algorithmics - week7
16
LZW Compression



The key insight of the method is that it is possible to
automatically build a dictionary of previously seen strings
in the text being compressed.
The dictionary starts off with 256 entries, one for each
possible character (single byte string).
Every time a string not already in the dictionary is seen, a
longer string consisting of that string appended with the
single character following it in the text, is stored in the
dictionary.
29/07/2017
Applied Algorithmics - week7
17
LZW Compression



The output consists of integer indices into the dictionary.
These initially are 9 bits each, and as the dictionary
grows, can increase to up to 16 bits.
A special symbol is reserved for "flush the dictionary"
which takes the dictionary back to the original 256
entries, and 9 bit indices. This is useful if compressing a
text which has variable characteristics, since a dictionary
of early material is not of much use later in the text.
This use of variably increasing index sizes is one of
Welch's contributions. Another was to specify an efficient
data structure to store the dictionary.
29/07/2017
Applied Algorithmics - week7
18
LZW Compression - example

Fibonacci language: w0 =a, w1=b, wi = wi-1·wi-2 for i>1

For example, w6 = babbababbabba

We show how LZW compresses babbababbabba
CW0 CW1 CW2 CW3
b
a
b
0
1
2
a
3
CW4
b
b
a
b
4
5
6
7
CW5
a
8
b
b
a
b
b
a
9
10
11
12
13
14
Virtual part
In general:
CWi = CWj o First(CWi+1)
and j<i
And in particular:
CW4 = CW3 o First(CW5)
29/07/2017
Applied Algorithmics - week7
19
LZW Compression - example








cw-2 = b
cw-1 = a
cw0 = ba
cw1 = ab
cw2 = bb
cw3 = bab
cw4 = babb
cw5 = babba
29/07/2017
cw-1
cw1
a
b
cw-2
b
a
b
cw0
b
cw3
Applied Algorithmics - week7
cw2
b
cw4
a
cw5
20
LZW Compression - compression stage
29/07/2017
Applied Algorithmics - week7
21
LZW Compression - compression stage
cw  ;
while ( read next symbol s from IN )
if cw·s exists in the dictionary then
cw  cw·s;
else
add cw·s to the dictionary;
save cw in OUT;
cw  s;
29/07/2017
Applied Algorithmics - week7
22
Decompression stage



Input IN
– Compressed file of integers.
Output OUT – Decompressed file of characters.
|IN| = Z
– Size of the compressed file.
Copy all numbers from file IN to vector V [256………..Z+255]
Create vector F [256…Z+255] containing first characters of each
code word
Create vector CW [256…Z+255] of all code words
for i=256 to Z+255 do
if V[i] < 256 then
CW[i]  Concatenate(char(V[i]), F[i+1])
else

CW[i]  Concatenate(CW(V[i]), F[i+1])
Write to the output file OUT all code words without their last
symbols
29/07/2017
Applied Algorithmics - week7
23


Theorem: For any input string S LZW algorithm
computes its compressed counterpart in time O(n),
where n is the length of S.
Proof: The most complex and expensive operations
are performed on dictionary. However with a help
of hash tables all operations can be performed in
linear time.
Also decompression process is linear.
29/07/2017
Applied Algorithmics - week7
24