Huffman Coding

IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Huffman Coding
94
95
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Static Huffman Coding
The model for static Huffman coding
a
• Given a sequence of symbols from an alphabet
α = {s1 , s2 , · · · , sn }, where si , i = 1 · · · n, occurs with a
fixed probability pi .
• Suppose that the probability of finding si in the next
character position is always pi , regardless of what went
before.
a Assume
sor.
that the probability distribution is known to both the compressor and the decompres-
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
• Huffman coding belongs to fixed-to-variable coding
method.
• Huffman codes belongs to prefix codes (meaning
“prefix-free codes”), so they are uniquely decodable.
• Huffman codes are optimal when probabilities of the
symbols are all negative powers of 2, for li has to be an
integer (in bits).
• Huffman codes are fragile for decoding: the entire file
could be corrupted even if there is a 1 bit error.
96
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
97
Huffman algorithm
• Build a binary tree where the leaves of the tree are the
symbols to be coded.
• The edges of the tree are labelled by a 0 or 1.
• The codeword for a symbol is obtained by “walking” down
from the root of the tree to the leaf for the symbol.
• Example 11 α = {A, B, C, D, E, F } with probabilities
0.25, 0.2, 0.15, 0.15, 0.15, 0.1.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
98
Building the binary tree
• If there is one symbol, the tree is the root and the leaf.
• Otherwise, take two symbols in alphabet s and s0 which
have the lowest probabilities p and p0 .
• Remove s and s0 from the alphabet and add a new symbol
[s, s0 ] with probability p + p0 . Now the alphabet has one
fewer symbol than before.
• Repeat this until there is only one symbol in the alphabet.
• Recursively construct the tree, starting from the root ...
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
99
Canonical and minimum-variance
As we know
• There can be items with equal probabilities
• The roles of 0 and 1 are interchangeable
To avoid this, we set RULES:
1. A newly-created element goes as high in the list as possible
2. When combining two items, the one higher up in the list is
assigned 0 and the lower down 1.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
100
• The code that follows these rules is called canonical.
• It is minimum-variance: the variation in the lengths of
the code-words is minimised.
• Huffman coding that follows these rules is called Canonical
and Minimum-variance Huffman Coding.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
101
Implementation efficiency
The algorithm described earlier requires
• maintaining the list of current symbols in decreasing order
of probability,
• searching the right place to insert the new symbols, which
gives an O(n2 ) worst-case, where n is the number of
symbols.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
102
One solution
• Maintain 2 lists: one (Linit ) contains the original symbols
in decreasing order of probability. The other (Lcomb ) only
contains “super-symbols” which are obtained by
combining symbols, initially empty.
• A new combined symbol will always go to the front of list
Lcomb (in O(1) time).
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
• Each time look at the ends of Linit and Lcomb to
determine which symbols next to combine (may be two
super-symbols or two initial symbols or a super-symbol
with a initial symbol).
• Example 12 Show how the algorithm works when the
alphabet is {A,B,C,D,E,F,G,H,I,J} and the probabilities
are (in %) 19, 17, 15, 13, 11, 9, 7, 5, 3, 1.
103
104
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Linit: A
B
C
19 17 15
Lcomb: Empty
Linit: A
B
C
19 17 15
Lcomb: [IJ]
4
Linit: A
B
C
19 17 15
Lcomb: [H [IJ]]
9
D
13
E
11
F
9
G
7
H
5
D
13
E
11
F
9
G
7
H
5
D
13
E
11
F
9
G
7
I
3
J
1
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Linit: A
B
C
D
E
19 17 15 13 11
Lcomb: [FG] [H [IJ]]]
16
9
Linit: A
B
C
D
19 17 15 13
Lcomb: [E [H [IJ]]] [FG]
20
16
105
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Linit: A
B
19 17
Lcomb: [CD] [E [H [IJ]]] [FG]
28
20
16
Linit: A
19
Lcomb: [B [FG]] [CD] [E [H [IJ]]]
33
28
20
106
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Linit: Empty
Lcomb: [A [E [H [IJ]]]] [B [FG]] [CD]
39
33
28
Linit: Empty
Lcomb: [[B [FG]] [CD]] [A [E [H [IJ]]]]
61
39
Linit: Empty
Lcomb: [[[B [FG]] [CD]] [A [E [H [IJ]]]]]
100
107
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Construct the binary tree recursively from the root:
[[[B [FG]] [CD]] [A [E [H [IJ]]]]]
0 /
\ 1
[[B [FG]] [CD]]
[A [E [H [IJ]]]]
108
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
[[[B [FG]] [CD]] [A [E [H [IJ]]]]]
0 /
\ 1
0 /
\ 1
0 /
\ 1
[B [FG]]
[CD]
A
[E [H [IJ]]]
109
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
0 /
B
[[[B [FG]] [CD]] [A [E [H [IJ]]]]]
0 /
\ 1
0 /
\ 1
0 /
\ 1
\ 1
0 / \ 1
A
0 / \ 1
[FG]
C
D
E [H [IJ]]
110
111
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
0 /
B
[[[B [FG]] [CD]] [A [E [H [IJ]]]]]
0 /
\ 1
0 /
\ 1
0 /
\ 1
\ 1
0 / \ 1
A
0 / \ 1
0 / \ 1
C
D
E 0/ \1
F
G
H 0/ \1
I
J
So the code:
B 000
E 110
F 0010
H 1110
G 0011
H 1110
C 010
D 011
A 10
I 11110 J 11111
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
A problem in Huffman codes
• Remember that Huffman codes meet the entropy bound
only when all probabilities are powers of 2.
• How about the alphabet is binary? α = {a, b}
• The only optimal case is when pa = 1/2 and pb = 1/2.
• Hence, Huffman codes can be bad.
Example 13 When pa = 0.8 and pb = 0.2
112
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
• Since Huffman coding needs to use 1 bit per symbol at
least, to encode the input, the Huffman codewords are 1
bit per symbol on average to code the input.
• However, the entropy of the distribution is
−(0.8 log2 0.8 + 0.2 log2 0.2) = 0.72 bits
• then the entropy of the distribution is 0.28/0.72 ≈ 39%
worse than optimal.
113
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
114
Extended Huffman coding
• Artificially increase the alphabet size.
• Example 14 Create a new alphabet Σ0 = {A, B, C, D},
where A stands for a sequence of two original symbols aa,
say. Similarly, B for ab, C for ba and D for bb.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
• Now the probabilities: P r[A] = P r[a] × P r[a] = 0.64,
P r[B] = P r[a] × P r[b] = 0.16
P r[A] = P r[b] × P r[a] = 0.16,
P r[B] = P r[b] × P r[b] = 0.04
• The canonical minimum-variance code for this is A=0,
B=11, C=100, D=101.
• The average length is 1.56. The original output became
1.56/2 = 0.78 bits per symbol. This is only 8% worse
than optimal.
115
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Exercise: What improvement can be made if we encoded 3
symbols from Σ0 at a time? (about 1% worse than the
optimal?)
116
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
117
Adaptive Huffman Coding
• The model calculates symbol probabilities as it goes along.
• For ease of discussion, we use weights instead of
probabilities.
• The weight of a symbol is the number of times that
symbol has been seen before (i.e. the frequency of its
occurrence so far).
• Codewords are generated and output dynamically, and a
Huffman tree is updated regularly.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
118
• Let the alphabet be α = {σ1 , σ2 , · · · , σn }, and g(σi ) be
any fixed-length code for σi , e.g. an ASCII code.
• Define one special symbol † (6∈ α) solely for
communication between the compressor and decompressor.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
119
The compressor
The compression algorithm
• maintains a subset of symbols S (∈ α), that it has seen so
far. Initially, S = {†}.
• A Huffman code for all the symbols in S is also
maintained.
• The weight of † is always 0.
• The weight of any other symbol in S is its frequency so
far.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
120
• Before any input has been seen, the Huffman tree has only
one node of symbol †.
• During the process, the Huffman tree will be used to
assign codes to the symbols in S.
• Let h(σi ) be the current Huffman code for σi and DAG for
the special symbol † in the algorithm below.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
121
Encoding algorithm
1. Initialise the Huffman tree T containing
the only node DAG.
2. while (more characters remain) do
s:= next_symbol_in_text();
if (s has been seen before) then output h(s)
else output h(DAG) followed by g(s)
T:= update_tree(T);
end
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
122
The function update tree
It does the following:
1. If s is not in S, then add s to S; weight_s:=1;
else weight_s:=weight_s + 1;
2. Recompute the Huffman tree for the new set
of weights and/or symbols.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
123
The decompressor
The decompression algorithm
• maintains a subset of real symbols S 0 that it has seen so
far
• Initially, S 0 := {†}
• The weight of † is always 0, while the weight of any other
symbol is the frequency of occurrence so far in the
decoded output.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Decoding algorithm
1. Initialise the Huffman tree T
with single node DAG
2. while (more bits remain) do
s:=huffman_next_sym();
if (s==DAG) then
s:= read_unencoded_sym();
else output s;
T:=update_tree(T);
end
124
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Function huffman next sym()
It reads bits from the input until it reaches a leaf node and
returns the symbol with which that leaf is labelled.
1. start at root of Huffman tree;
2. while (not reach leaf) do
read_next_bit();
follow edge;
end
3. return the symbol of leaf reached.
125
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
126
Function read unencoded sym()
• It simply reads the next unencoded symbol from the input.
Example, if the original encoding was an ASCII code, then
it would read the next 7 bits (8 bits if includes the parity
bit).
• The function update tree does the following:
1. If s is not in S, then add s to S
else add 1 to the weight of s;
2. Recomputes the Huffman tree for
the new set of weights and/or symbols.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
127
Disadvantages of Huffman coding
Problem 1 It is not optimal unless all probabilities are
negative powers of 2 (Recall the particularly bad situation
for binary alphabets).
• Although, by encoding for a group of symbols, one may
achieve a compression closer to the optimal, the
grouping method requires a larger alphabet to be
handled.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
128
Problem 2 Despite of some clever methods available for
counting reasonably quickly the frequency of each symbol,
it can be very slow for rebuilding the entire tree for each
symbol. This is normally when the probability distributions
change rapidly with each symbol.