Huffman Tree

GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
Huffman coding and trees
Huffman coding is another method for lossless compression. It reduces the number
of bits needed to store data. It is based on the number of times that each data
item (character for text or pixel for image) is repeated. The data items that occur
most frequently will be stored using a fewer number of bits.
Before looking at how Huffman coding works, let’s look at how to calculate the
number of bits needed to store a simple text file.
Each character in an ASCII text file uses 7 bits, but most text files use extended
ASCII which is 8 bits per character.
Therefore, to calculate the file size of an ASCII text file, count the number of
characters, including spaces, and that is the number of bytes. Then multiply by 8
to give the number of bits.
ASCII file size (bytes) = number of characters
ASCII file size (bits) = number of characters x 8
Example – ASCII file size
How many bits and bytes are in this sentence?
8 x spaces, 1 x ? and 36 letters = 45 bytes. Multiply by 8 for 280 bits.
Huffman coding uses a method of identifying which characters occur most
frequently. The most frequent character is assigned the least number of bits for
storage.
Using a Huffman Tree
A Huffman Tree is used to identify the bit pattern that should be used for each
character in a file. This is best explained using an example.
© paullong.net 2016
Page 1 of 16
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
Example – Huffman Tree
This is a Huffman Tree for the word
“abracadabra”. There are 5 letters in this
word – a, b, c, d and r.
Each circle is called a node.
The frequency of each letter (number of
times each letter appears in the word)
can be seen on the nodes above each
number:
a:5 r:2 b:2 c:1 d:1
The nodes above are the totals of adding
up each pair of nodes beneath them. For
example, for c and d, 1 + 1 = 2 and so 2
appears in the node above. The node at
the top indicates the total number of
characters in the phrase.
The Huffman Tree can now be used to identify the bit pattern to be used for each
character. This is done by inserting a 0 (zero) on every left hand branch and a 1
(one) on every right hand branch. The bit pattern for each letter is calculated by
following the branches and writing down the 0s and 1s passed to get to each
character.
© paullong.net 2016
Page 2 of 16
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
Example – calculating the bit pattern
0
To get to the letter a, only a single 0 (zero)
is passed from the top and so the bit
pattern for a is 0 (zero).
1
To get to the letter r, you have to pass a 1
and then a 0 so the bit pattern for r is 10.
0
1
To get to the letter c, you have to pass a
1, then another 1, then another 1 and
then a 0 so the bit pattern for c is 11110.
0
1
0
Here are all the bit patterns for each
character:
a:0 r: 10 b: 110 c: 1110 d: 1111
1
Notice how the shortest bit pattern is used
for the highest frequency character (a).
The Huffman Coding can now be calculated by replacing each character in the
file with its bit pattern.
Example – Huffman Coding
Each character of abracadabra is represented as follows:
a:0 r: 10 b: 110 c: 1110 d: 1111
Therefore, the word Huffman Coding will be:
a
b
r
a
c
a
d
a
b
r
a
0
110
10
0
1110
0
1111
0
110
10
0
This is written out as:
01101001110011110110100
© paullong.net 2016
Page 3 of 16
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
The Huffman Coding is now used to store the data. This will require less bits to store
the data. The number of bits needed can be calculated by adding up the
number of 1s and 0s in the bit pattern. This can then be compared with the
number of bits needed to store the data using ASCII to show how much storage
could be saved using the Huffman method of compression.
Example – Huffman compression savings
Using Huffman Coding, abracadabra is stored as:
01101001110011110110100
Count up the number of 1s and 0s to calculate the number of bits required to
store. Total = 23 bits.
Compare this with the space needed to store in ASCII:
11 characters = 11 bytes
11 x 8 bits = 88 bits.
The saving is calculated as 88 – 23 = 65 bits saved.
Example – using a Huffman Tree
This is a Huffman Tree for “poppy pop”.
The coding for p will be 1 (move right once).
The coding for o will be 00 (left, then left).
The coding for space will be 010 (left, right,
left).
The coding for y will be 011 (left, right, right).
Therefore, the word Huffman Coding will be:
P
O
P
P
Y
space
P
O
P
1
00
1
1
011
010
1
00
1
This is written out as:
100110110101001
This uses a total of 15 bits.
Using ASCII, 9 characters of 8 bits each would be needed making a total of 72 bits.
There is a compression saving of 72 – 15 = 57 bits.
© paullong.net 2016
Page 4 of 16
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
Activity
1) The following Huffman tree has been created for “SHE SELLS SEA SHELLS”
a)
b)
c)
Identify the binary code to be used for each character.
Write out the Huffman binary encoding for “SHE SELLS SEA SHELLS”.
Calculate how many bits would be required to store the sentence using
ASCII.
2) The following Huffman tree has been created for “EDDIE EDITED IT”
Created using http://huffman.ooz.ie
a) Write out the Huffman binary encoding for this sentence.
b) Calculate how many bits are needed to store this data using Huffman.
c) Calculate how many bits are needed to store this data using ASCII.
© paullong.net 2016
Page 5 of 16
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
3) The following Huffman tree has been created for “WHICH WITCH IS WHICH”
a)
b)
c)
Write out the Huffman binary encoding for this sentence.
Calculate how many bits are needed to store this data using Huffman.
Calculate how many bits are needed to store this data using ASCII.
4) The following Huffman tree has been created for “STUPID SUPERSTITION”
a)
b)
c)
d)
Identify the binary code to be used for each character.
Write out the Huffman binary encoding for this tree.
Calculate how many bits are needed to store this data using Huffman.
Calculate how many bits are needed to store this data using ASCII.
© paullong.net 2016
Page 6 of 16
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
5) The following Huffman tree has been created for “GOOD BLOOD, BAD BLOOD”
a)
b)
c)
d)
Identify the binary code to be used for each character.
Write out the Huffman binary encoding for this tree.
Calculate how many bits are needed to store this data using Huffman.
Calculate how many bits are needed to store this data using ASCII.
© paullong.net 2016
Page 7 of 16
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
Creating a Huffman Tree
Note: you do not need to be able to create a Huffman Tree in an exam, so this is a
bit of extension work. You may find it helpful to understand how the Trees
are created.
Creating a Huffman Tree is best understood with a video explanation.
Video
Watch Text Compression with Huffman Coding by Barry Brown on YouTube.
Work through this animation.
Example – creating a Huffman Tree 1
Peter Piper picked a peck of pickled peppers
To start with, note that the total characters including spaces is 44 which in normal
ASCII encoding would require 8 bits per character making a total of 352 bits.
For the Huffman code, count the frequency of each letter:
Space (Δ) x 7 P x 2
ix3
cx3
fx1
lx1
px7
kx3
sx1
ex8
dx2
tx1
ax1
rx3
ox1
Add the total frequencies to check they add up to 44 because mistakes are very
easy to make. Now put these into ascending order along the bottom of a page:
a1
f1
l1
o1
s1
t1
P2
d2
c3
i3
k3
r3
p7
Δ7
e8
These are all now known as nodes in the Huffman Tree that is to be created. Start
by looking for the two nodes with the lowest frequencies. There are 6 to choose
from (a, f, l, o, s, t), so start with the left hand pair. Combine these to make a new
node with a number that represents the total frequency of characters within those
nodes (1 + 1 = 2):
af2
a1
f1
l1
o1
s1
© paullong.net 2016
t1
P2
d2
c3
Page 8 of 16
i3
k3
r3
p7
Δ7
e8
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
Repeat this for the other four nodes with a frequency of 1:
af2
a1
lo2
f1
l1
st2
o1
s1
t1
P2
d2
c3
i3
k3
r3
p7
Δ7
e8
Repeat this again for the two nodes with the lowest frequency. There are 5 to
choose from (P, d, af, lo, st). Notice how we have to take account of the new
nodes that have been created. Therefore, join together from the left hand side
fitting in P with af, then d with lo (2 + 2 = 4):
Paf4
P2
std4
af2
a1
d2
lo2
f1
l1
st2
o1
s1
t1
c3
i3
k3
r3
p7
Δ7
e8
Next move on to do the next pair of nodes with the lowest frequency. This will be st
and one of c, i, k and r. Choose the left most of these to pair with st, which will be
c (2+3 = 5), and similarly pair i with c (3 + 3 = 6).
Paf4
P2
dlo4
af2
a1
d2
stc5
lo2
f1
l1
st2
o1
s1
c3
t1
ik6
i3
k3
r3
p7
Δ7
The next lowest frequency pair will be r (3) and Paf (4) making a total of 7. The
other two lowest frequencies are dlo (4) and stc (5) making a total of 9.
Remember the lowest frequency moves to the left hand side of each pair of
branches.
rPaf7
r3
P2
dlostc9
Paf4
dlo4
af2
a1
d2
f1
© paullong.net 2016
stc5
lo2
l1
st2
o1
s1
c3
t1
Page 9 of 16
ik6
i3
k3
p7
Δ7
e8
by Paul Long
e8
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
The next 2 lowest frequency nodes are ik (6) and rPaf (7) making a total of 13. The
other two lowest frequencies are rpaf (7) and space/ Δ (7) making a total of 14.
The top of each set of branches must be kept in ascending order of frequency so
rPaf Δ moves to the right hand side because it is the largest frequency.
rPaf Δ14
dlostc9
rPaf7
dlo4
d2
stc5
lo2
ikp13
st2
l1
o1
s1
c3
ik6
t1
i3
r3
p7
Δ7
Paf4
P2
af2
k3
a1
f1
e8
The next 2 lowest frequency nodes are e (8) and dlostc (9) making a total of 17.
e8 is moved to the left hand side of kdPta9 because it has a smaller frequency
and edlostc is moved to the right hand side because 17 is now the largest
frequency.
rPafΔ14
edlostc17
rPaf7
ikp13
ik6
i3
r3
p7
Δ7
Paf4
P2
k3
dlostc9
dlo4
af2
a1
e8
d2
f1
stc5
lo2
l1
st2
o1
s1
c3
t1
The next 2 lowest frequency nodes are ikp (13) and rPafΔ (14) making a total of 27.
ikprPafΔ goes on the right hand side because it is the higher frequency
ikprPafΔ27
edlostc17
e8
Ikp13
dlostc9
dlo4
d2
ik6
stc5
lo2
l1
st2
o1
s1
© paullong.net 2016
i3
p7
k3
c3
rPafΔ14
rPaf7
r3
P2
t1
Δ7
Paf4
af2
a1
Page 10 of 16
f1
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
Finally, the last 2 nodes are edlostc (17) and ikprPafΔ27 (27) with a total of 44
which matches the number of characters meaning there is some success so far.
To complete the Huffman Tree, add a 0 on all the left hand branches and a 1 on
all the right hand branches.
44
0
1
edlostc17
0
1
0
e8
dlostc9
Ikp13
0
1
0
1
dlo4
stc5
ik6 p7
0
1
0
1 0
1
d2
lo2
st2
c3 i3 k3
0
1
0
1
l1 o1
s1 t1
01010
01101
ikprPafΔ27
1
rPafΔ14
0
1
rPaf7
Δ7
0
1
r3
Paf4
0
1
P2
af2
0
1
a1 f1
110111
Now you can encode the characters into binary. Read from the top the 0s and 1s
down to each character. For example, follow from 44 to edlostc which is 0 and
then to e is another 0 so 00. Some are shown on the diagram above.
e (8) = 00
k (3) = 1001
a (1) = 110110 o (1) = 01011
r (3) = 1100
i (3) = 1000
d (2) = 0100
f (1) = 01101
c (3) = 0111
P (2) = 11010
l (1) = 01010
Δ (7) = 111
t (1) = 01101
s (1) = 01100
p (7) = 101
Notice how the most frequent characters have the smallest binary number.
“Peter Piper picked a peck of pickled peppers” can now be represented in binary:
11010 00 01101 00 1100 111 11010 1000 0111 1001 00 0100 101 110110 111 101 00
0111 1001 111 01011 01101 111 101 1000 0111 1001 01010 00 0100 111 101 00 101
101 00 1100 01100
Count up the number of 1s and 0s to calculate the number of bits required to
store. Total = 136.
Compare this with the space needed to store in ASCII:
44 characters x 8 bits = 352 bits.
© paullong.net 2016
Page 11 of 16
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
That was hard work! Huffman coding is not easy. The example above was as hard
as it should get and in an exam you will not be expected to create a Huffman
tree. Here is a simpler example below that might be easier to understand. It uses
a different method which does not keep putting in the combination of letters, but
instead just puts in the numbers for each node:
Example – creating a Huffman Tree 2
Create a Huffman code tree for “the big bugbit the little beetle”
No capital letters are used and this method moves the original sequence about in
order to build the tree. The tool used for creating this tree is
http://www.algorasim.com/ and it puts characters into ascending order rather
than descending order.
Start by identifying the frequency of each letter:
Now combine the lowest frequency pair (1 and 2 = total 3):
© paullong.net 2016
Page 12 of 16
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
Now combine the new lowest frequency pair of 2 and 3, moving the branches
along to the right to keep an ascending order of frequency:
Now combine the pair of 3s (l and ug) to make a total of 6:
Now combine 4 and 5 (total 9) and move to the right hand side:
© paullong.net 2016
Page 13 of 16
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
Do the same for a pair of 6s (e and lug) to make a total of12:
The last characters left are space (6) and t (6) with a total of 12:
Now combine 9 and 12 to make a total of 21:
© paullong.net 2016
Page 14 of 16
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
Finally, 12 and 21 make a total of 33 – check this is the same as the total number of
characters. Remember to put 0s on left hand side of each branch and 1s on right
hand side of each branch:
0
0
1
0
1
0
1
0
1
0
1
0
1
1
0
The Huffman encoding is:
Space = 00
t = 01
u = 11110
e = 110
b =100
i = 1011
h = 1010
l = 1110
1
g = 11111
The phrase “the big bug bit the little beetle” can now be represented in binary as:
01 1010 110 00 100 1011 11111 00 100 11110 11111 00 100 1011 01 00 01 1010 110 00
1110 1011 01 01 1110 110 00 100 110 110 01 1110 110
Count up the number of 1s and 0s to calculate the number of bits required to
store. Total = 101
Compare this with the space needed to store in ASCII:
33 characters x 8 bits = 264 bits.
© paullong.net 2016
Page 15 of 16
by Paul Long
GCSE CS 4 AQA –Huffman – www.gcsecs.org
Published by paullong.net
Extension activity
1) Create a Huffman tree for each of:
a) A CLEAN CREAM CAN
b) WOULD A WOODCHUCK CHUCK WOOD?
c) FOUR FINE FRESH FISH FOR YOU
c)
d)
For each of the above phrases, write out the Huffman binary encoding.
For each of the above phrases, calculate the number of bits required for
storage using:
i) ASCII
iii) The Huffman Tree
Questions
1) Contrast lossy and lossless compression.
2) Give 2 reasons for compressing data.
3) Identify two methods of compression encoding.
© paullong.net 2016
Page 16 of 16
[2]
[2]
[2]
by Paul Long