Run-length and Huffman algorithms

IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Hardware Data Compression HDC
31
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
32
HDC algorithm
• used by tape drives connected to IBM computer systems
• A similar one is used in the IBM System Network
Arcitecture (SNA) standard for data communication
• belongs to Run-Length coding, where the coder replaces
sequences of consecutive identical symbols with three
elements:
1. a single symbol
2. a run-length count
3. an indicator that signifies how the symbol and count are
to be interpreted.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
A simple algorithm
Uses
1. ASCII characters
2. 123 control characters
• r2 , r3 , . . . , r63 : repeating characters
• n3 , n4 , . . . , n63 : ni ’s signify non-repeating characters.
33
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
• Strings of i consecutive blanks (2-63 bytes long) are
replaced by ri
• Strings of i consecutive other than blanks (3-63 bytes
long) are replaced by two characters: ri followed by the
character to repeat
• String of i nonrepeating characters (1-63 bytes long) are
expanded by having a ni added at the beginning of the
nonrepeated character sequence.
• Use 8 bit characters, 128 ASCII characters, others for
compression.
34
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Compression
Example 1 ÃÃÃÃÃÃABCDEFÃÃ33GHJKÃMN3333333333
The compressed version: r6 n6 ABCDEFr2 n9 33GHJKÃMNr10 3
35
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Decompression
Example 2 Given r6 n6 ABCDEFr2 n9 33GHJKÃMNr10 3, the
original:
ÃÃÃÃÃÃABCDEFÃÃ33GHJKÃMN3333333333
36
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Good and bad performance
Run-length coding is
• Excellent when the data has many runs of consecutive
symbols.
• Bad when there are few blanks but several runs of
chararater A.
• Good when used as a subrountine in other more
sopisticated coding.
37
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Huffman coding
38
39
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Facts
• Characters are represented by fixed-length codes in
computers. The codes are often 8-bit long, such as ASCII
code and EBCDIC code.
Example: In ASCII code, A 0001100p; B 0010100p; E
0101100p, where p is a parity bit; etc.
• In English text, some characters occur far more frequently
than others, e.g. letters E,A,O,T are much more frequent
than J,Q,X.
• Our aim is to reduce the total number of bits in a sequence
of 1s and 0s that represent the characters in a text.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
40
Huffman’s idea
• A frequently occurred character is represented by a shorter
code.
• Huffman coding: A frequency based coding scheme
(algorithm) in which the more frequently occurring letters
are assigned fewer bits of codes than the less frequently
occurring ones.
41
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Example
Frequency of occurrence:
E
5
A
5
O
5
T
3
J
3
Q
2
X
1
A code:
E
10
A
11
O
000
T
010
J
011
Q
X
0010 0011
So, EEETTJX only needs 2 + 2 + 2 + 3 + 3 + 3 + 4 = 19
(bits), instead of 8 × 7 = 56 bits.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
42
Huffman Coding
Algorithm
1. Constructing a frequency table.
2. Building a Huffman tree
Iterations until completion of the tree: take the last two
entries (which have the minimum frequncy) from the
frequency table and update the frequency table.
3. Starting at the root, trace down to every leaf; mark ‘0’ for
a left branch and ‘1’ for a right branch.
4. Assigning a code for each symbol.
43
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Example
Message to en-code: ‘BILL BEATS BEN.’ (15 characters in
total)
B
3
I
1
L
2
E
2
A
1
T
1
S
1
N
1
SP(space) .
2
1
E
2
SP I
2 1
A
1
T
1
S
1
N
1
Sorted:
B
3
L
2
.
1
character
freqency
44
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Constructing a binary tree
There are two stages in each step: (1) Combine the last two
items on the table (2) Adjust the position of the combined
item on the table so the table remains sorted.
Combine
B
3
L
2
E
2
SP I
2 1
A
1
T
1
S
1
[N.]
2
L
2
SP I
2 1
A
1
T
1
Update
B
3
[N.]
2
E
2
S
1
45
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
B
3
[TS] [N.]
2
2
B
3
[IA] [TS] [N.]
2
2
2
[E SP] B
4
3
[[N.]
4
L]
L
2
E
2
SP I
2 1
L
2
E
2
[IA] [TS] [N.]
2
2
2
[E SP] B
4
3
A
1
SP
2
L
2
[IA] [TS]
2
2
46
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
[[IA][TS]] [[N.]
4
4
L]
[E SP] B
4
3
[[E SP] B] [[IA][TS]] [[N.]
7
4
4
[[[IA][TS]] [[N.]
8
[[[[IA][TS]] [[N.]
15
L]
L]] [[E SP] B]
7
L]] [[E SP] B]]
47
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
The Binary Tree
[[[[IA][TS]] [[N.]
/
[[[IA][TS]] [[N.] L]]
/
\
[[IA][TS]]
[[N.] L]
/
\
/
\
[IA] [TS]
N.
L
/ \ / \ / \
I A T S N .
L]] [[E SP] B]]
\
[[E SP] B]
/
\
[E SP] B
/ \
E SP
48
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Huffman tree
[[[[IA][TS]] [[N.]
0/
[[[IA][TS]] [[N.] L]]
0/
\1
[[IA][TS]]
[N.] L]
0/
\1
0/
\1
[IA] [TS]
N.
L
0/ \1 0/ \1 0/ \1
I A
T S
N .
L]] [[E SP] B]]
\1
[[E SP] B]
0/
\1
[E SP] B
0/ \1
E SP
49
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Generate codes
I
A
T
S
N
.
L
E
SP B
0000 0001 0010 0011 0100 0101 011 100 101 11
x
1
4
1
4
1
4
1
4
1
4
1
4
2
3
2
3
2
3
3
2
50
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Saving Percentage
Comparison of the use of
Huffman coding and the use of 8-bit ASCII or EBCDIC Coding:
Huffman ASCII/EBCDIC
48
120
Saving Bits
Percentage
72
60%
120 − 48 = 72 72/120 = 60%
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Decoding
Given a Huffman coding message,
111000100101111000001001000111011100000110110101
What is the message?
(see the coding tree)
Message: BEN BEATS BILL.
51
52
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
Note
• Frequency Table for the general case (average case) in
English. However, one can always construct the table
himself to achieve a higher saving percentage.
• Huffman codes are not unique.
Example: a) ’0’ - right; ’1’ - left
b) There are a number of different ways to insert a
combined item into the frequncy table. This leads to
different binary trees.
IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6
53
• We use the canonical minimum-variance code in which the
difference in length among the codewords is kept to the
minimum.
• When the alphabet is small, a fixed length (less than 8
bits) code can also be used to save bits. For example, if
the size of the alphabet set is not bigger than 32, we can
use five bits to code each character. This would give a
saving percentage: 8×32−5×32
= 37.5%
8×32