IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Hardware Data Compression HDC 31 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 32 HDC algorithm • used by tape drives connected to IBM computer systems • A similar one is used in the IBM System Network Arcitecture (SNA) standard for data communication • belongs to Run-Length coding, where the coder replaces sequences of consecutive identical symbols with three elements: 1. a single symbol 2. a run-length count 3. an indicator that signifies how the symbol and count are to be interpreted. IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 A simple algorithm Uses 1. ASCII characters 2. 123 control characters • r2 , r3 , . . . , r63 : repeating characters • n3 , n4 , . . . , n63 : ni ’s signify non-repeating characters. 33 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 • Strings of i consecutive blanks (2-63 bytes long) are replaced by ri • Strings of i consecutive other than blanks (3-63 bytes long) are replaced by two characters: ri followed by the character to repeat • String of i nonrepeating characters (1-63 bytes long) are expanded by having a ni added at the beginning of the nonrepeated character sequence. • Use 8 bit characters, 128 ASCII characters, others for compression. 34 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Compression Example 1 ÃÃÃÃÃÃABCDEFÃÃ33GHJKÃMN3333333333 The compressed version: r6 n6 ABCDEFr2 n9 33GHJKÃMNr10 3 35 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Decompression Example 2 Given r6 n6 ABCDEFr2 n9 33GHJKÃMNr10 3, the original: ÃÃÃÃÃÃABCDEFÃÃ33GHJKÃMN3333333333 36 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Good and bad performance Run-length coding is • Excellent when the data has many runs of consecutive symbols. • Bad when there are few blanks but several runs of chararater A. • Good when used as a subrountine in other more sopisticated coding. 37 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Huffman coding 38 39 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Facts • Characters are represented by fixed-length codes in computers. The codes are often 8-bit long, such as ASCII code and EBCDIC code. Example: In ASCII code, A 0001100p; B 0010100p; E 0101100p, where p is a parity bit; etc. • In English text, some characters occur far more frequently than others, e.g. letters E,A,O,T are much more frequent than J,Q,X. • Our aim is to reduce the total number of bits in a sequence of 1s and 0s that represent the characters in a text. IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 40 Huffman’s idea • A frequently occurred character is represented by a shorter code. • Huffman coding: A frequency based coding scheme (algorithm) in which the more frequently occurring letters are assigned fewer bits of codes than the less frequently occurring ones. 41 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Example Frequency of occurrence: E 5 A 5 O 5 T 3 J 3 Q 2 X 1 A code: E 10 A 11 O 000 T 010 J 011 Q X 0010 0011 So, EEETTJX only needs 2 + 2 + 2 + 3 + 3 + 3 + 4 = 19 (bits), instead of 8 × 7 = 56 bits. IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 42 Huffman Coding Algorithm 1. Constructing a frequency table. 2. Building a Huffman tree Iterations until completion of the tree: take the last two entries (which have the minimum frequncy) from the frequency table and update the frequency table. 3. Starting at the root, trace down to every leaf; mark ‘0’ for a left branch and ‘1’ for a right branch. 4. Assigning a code for each symbol. 43 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Example Message to en-code: ‘BILL BEATS BEN.’ (15 characters in total) B 3 I 1 L 2 E 2 A 1 T 1 S 1 N 1 SP(space) . 2 1 E 2 SP I 2 1 A 1 T 1 S 1 N 1 Sorted: B 3 L 2 . 1 character freqency 44 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Constructing a binary tree There are two stages in each step: (1) Combine the last two items on the table (2) Adjust the position of the combined item on the table so the table remains sorted. Combine B 3 L 2 E 2 SP I 2 1 A 1 T 1 S 1 [N.] 2 L 2 SP I 2 1 A 1 T 1 Update B 3 [N.] 2 E 2 S 1 45 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 B 3 [TS] [N.] 2 2 B 3 [IA] [TS] [N.] 2 2 2 [E SP] B 4 3 [[N.] 4 L] L 2 E 2 SP I 2 1 L 2 E 2 [IA] [TS] [N.] 2 2 2 [E SP] B 4 3 A 1 SP 2 L 2 [IA] [TS] 2 2 46 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 [[IA][TS]] [[N.] 4 4 L] [E SP] B 4 3 [[E SP] B] [[IA][TS]] [[N.] 7 4 4 [[[IA][TS]] [[N.] 8 [[[[IA][TS]] [[N.] 15 L] L]] [[E SP] B] 7 L]] [[E SP] B]] 47 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 The Binary Tree [[[[IA][TS]] [[N.] / [[[IA][TS]] [[N.] L]] / \ [[IA][TS]] [[N.] L] / \ / \ [IA] [TS] N. L / \ / \ / \ I A T S N . L]] [[E SP] B]] \ [[E SP] B] / \ [E SP] B / \ E SP 48 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Huffman tree [[[[IA][TS]] [[N.] 0/ [[[IA][TS]] [[N.] L]] 0/ \1 [[IA][TS]] [N.] L] 0/ \1 0/ \1 [IA] [TS] N. L 0/ \1 0/ \1 0/ \1 I A T S N . L]] [[E SP] B]] \1 [[E SP] B] 0/ \1 [E SP] B 0/ \1 E SP 49 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Generate codes I A T S N . L E SP B 0000 0001 0010 0011 0100 0101 011 100 101 11 x 1 4 1 4 1 4 1 4 1 4 1 4 2 3 2 3 2 3 3 2 50 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Saving Percentage Comparison of the use of Huffman coding and the use of 8-bit ASCII or EBCDIC Coding: Huffman ASCII/EBCDIC 48 120 Saving Bits Percentage 72 60% 120 − 48 = 72 72/120 = 60% IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Decoding Given a Huffman coding message, 111000100101111000001001000111011100000110110101 What is the message? (see the coding tree) Message: BEN BEATS BILL. 51 52 IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 Note • Frequency Table for the general case (average case) in English. However, one can always construct the table himself to achieve a higher saving percentage. • Huffman codes are not unique. Example: a) ’0’ - right; ’1’ - left b) There are a number of different ways to insert a combined item into the frequncy table. This leads to different binary trees. IS53010A (CIS325), Dr Ida Pu, Goldsmiths College, 2000–2005/6 53 • We use the canonical minimum-variance code in which the difference in length among the codewords is kept to the minimum. • When the alphabet is small, a fixed length (less than 8 bits) code can also be used to save bits. For example, if the size of the alphabet set is not bigger than 32, we can use five bits to code each character. This would give a saving percentage: 8×32−5×32 = 37.5% 8×32
© Copyright 2025 Paperzz