Restricted Prefix Byte Codes Culpepper and Moffat Introduction Enhanced byte codes with restricted prefix properties Related Work New Method Experiments J. Shane Culpepper1 Alistair Moffat2 Conclusions 1. NICTA Victoria Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia 2. Department of Computer Science and Software Engineering The University of Melbourne, Victoria 3010, Australia November 2, 2005 General Approach Restricted Prefix Byte Codes Culpepper and Moffat Introduction Related Work Our approach uses strings of bytes to represent the channel alphabet in the compressed text, New Method • Byte manipulation is faster than bit manipulation; Experiments • Skipping and seeking in byte aligned files is more Conclusions efficient than the same operations in bit aligned files; • Traditional pattern matching algorithms can be used directly on byte aligned compressed text. Static Byte Codes (bc) Restricted Prefix Byte Codes Culpepper and Moffat Introduction Related Work BC DBC SCDBC The static byte coder, is a variable length radix-128 code, which uses bytes as the channel alphabet. The static byte coder uses a fixed model which assumes the source alphabet exhibits a natural, non-increasing probability ordering. BHuff New Method Experiments Conclusions • The prefix-free property is easily enforced by making each byte of a codeword a stopper or a continuer byte. • A byte with a decimal value greater than 127 is a continuer and is always followed by another byte. • A byte with a value less than 128 is a stopper byte. Static Byte Codes (bc) Restricted Prefix Byte Codes Culpepper and Moffat Introduction Related Work BC DBC Example (Encoding Byte Codes) 1 1,000 1,000,000 → → → 001 134-104 188-131-064 SCDBC BHuff New Method Example (Decoding Byte Codes) Experiments Conclusions Consider the code 134-104. To reconstruct the original codeword: 134 → (134 − 127) = 7 104 → (7 × 128) + 104 = 1,000 Dense Byte Codes (dbc) Restricted Prefix Byte Codes Culpepper and Moffat Introduction Related Work The dense byte coder is an extension of the standard byte coder, which introduces an alphabet ordering to match symbol ranks with symbol probabilities. BC DBC SCDBC BHuff New Method Experiments Conclusions • All mapped symbols appear in the message and each symbol is represented by its rank. • Dense byte codes require transmission of a prelude for decoding purposes. • The simple ranking mechanism opens up the possibility of new prelude representations because the alphabet ordering is not dependent on exact symbol probabilities. (S, C)-Dense Byte Codes (scdbc) Restricted Prefix Byte Codes Culpepper and Moffat Introduction Related Work The (S, C)-dense byte coder is an extension of the dense byte coder, where the partitioning value of 128 is made a variable x such that 1 ≤ x ≤ 256. BC DBC SCDBC BHuff New Method Experiments Conclusions • So, the constraint S + C = 256 is added where S is the number of stoppers, C is the number of continuers, and 256 is the radix. • The tag bit that identifies each byte is essentially arithmetically coded now, allowing more of each byte to be available for actual “data” bits. Byte Huffman Codes (bhuff) Restricted Prefix Byte Codes Culpepper and Moffat Introduction The byte Huffman coder is a constrained Huffman coder, where all codewords must have a length which falls on a byte boundary. Related Work BC DBC SCDBC BHuff New Method Experiments Conclusions • Exactly matches the probability distribution and is minimum-redundancy over all byte codes. • In reality, the code can be described by the 4-tuple (h1 , h2 , h3 , h4 ), where hx is the number of x-byte codewords in the code. • This method is clearly more flexible than other approaches, but opens up the possibility of false matches in compressed pattern matching. Restricted Prefix Byte Codes (rpbc) Restricted Prefix Byte Codes Culpepper and Moffat Introduction The restricted prefix byte coder is a byte Huffman coder with an additional constraint; the first byte must completely define the suffix length. Related Work New Method RPBC Preludes Semi-Dense Experiments Conclusions • Instead of using the 2-tuple (S, C) as (S, C)-dense codes use, a 4-tuple (v1 , v2 , v3 , v4 ) is defined to completely specify allowable code lengths. • Also, v1 + v2 R + v3 R 2 + v4 R 3 ≥ n, where R is the radix and n is the cardinality of the source alphabet, must be satisfied. • This still allows considerable flexibility in assignment of total codespace to different code lengths. Restricted Prefix Byte Codes (rpbc) Restricted Prefix Byte Codes Culpepper and Moffat 0 1 20 2 11 3 8 4 5 5 2 6 2 7 1 8 1 9 1 10 1 21 1 Introduction New Method v2=1 v3=1 10 prefix followed by 11 prefix followed by v1=2 Related Work 00 01 RPBC Preludes 00 01 10 11 00 00 00 01 00 10 00 11 01 00 ..... 11 11 Semi-Dense Experiments Conclusions Example of a restricted prefix code with R = 4, n = 11, and (v1 , v2 , v3 ) = (2, 1, 1). The 53 symbols are coded into 160 bits, compared to 144 bits if a bitwise Huffman code is calculated, and 148 bits if a radix-4 Huffman code is calculated. Prelude costs are additional. Prelude Representation Restricted Prefix Byte Codes Culpepper and Moffat Introduction Related Work New Method The permutation approach is the traditional method where the complete alphabet is transmitted using a static binary code in a decreasing frequency ordering. RPBC Preludes Semi-Dense Experiments Conclusions • The length of each static binary code is easily calculated as log nmax , where nmax is the value of the largest symbol. • If the source alphabet is small, the cost is minimal. For more general applications, the cost can be non-trivial. Prelude Representation Restricted Prefix Byte Codes Culpepper and Moffat Introduction Related Work New Method The bitvector approach is based on a key observation. The only requirement to rebuild the original mapping is to know that a particular alphabet symbol appears, and it’s corresponding codeword length. RPBC Preludes Semi-Dense Experiments Conclusions • A bitvector from 0 . . . nmax , where a 1 means the symbol appears and 0 means it does not appear, is constructed. • If the symbol appears, an additional two bits can be used to encode the symbol length. Prelude Representation Restricted Prefix Byte Codes Culpepper and Moffat Introduction Related Work The gap approach is another alternative which eliminates bit operations and attempts to exploit clustering in alphabet densities. New Method RPBC Preludes Semi-Dense Experiments Conclusions • Only four possible codeword lengths are allowed so the alphabet can be partitioned into four subsets. • Each individual subset is then sorted and encoded as a sequence of d -gaps. • Since the emphasis is easily decodable streams, the d -gaps are encoded with the standard byte coder. Semi-Dense Prelude Representation Restricted Prefix Byte Codes Culpepper and Moffat Introduction Related Work New Method In the semi-dense approach, a subset of “interesting” symbols (because of their prevalence) are assigned a second codeword, in addition to the normal “dense” codeword. RPBC Preludes Semi-Dense Experiments Conclusions • The cutoff could be a fixed value such as 1,000. • The cutoff could be all symbols above a particular frequency threshold. • The cutoff could be all symbols in the one or two byte range. Semi-Dense Prelude Representation Restricted Prefix Byte Codes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 0 1 8 11 1 0 5 1 0 0 1 2 1 2 20 11 8 5 (0) 0 1 (0) (0) 1 0 (0) 1 0 0 1 2 20 11 8 5 1 (0) (0) 1 0 (0) 1 0 0 1 2 1 2 11 00 11 01 Culpepper and Moffat Introduction 1 2 11 10 11 11 Related Work New Method RPBC Preludes Semi-Dense Experiments t=4 Conclusions v1=3 00 01 v3=1 10 11 prefix followed by 00 00 00 01 00 10 00 11 01 00 ..... Example of a semi-dense restricted prefix code with R = 4, nmax = 14, and a threshold at t = 4. The modified prelude contains only four symbols. Test Data Restricted Prefix Byte Codes Culpepper and Moffat Introduction Related Work New Method File name wsj267.seq wsj267.repair Total symbols 58,421,983 19,254,349 Maximum value 222,577 320,016 Self-information (bits/sym) 10.58 17.63 Experiments Test Data Effectiveness Efficiency Conclusions Parameters of two test files. The file wsj267.seq was generated using a spaceless words model. The file wsj267.repair represents phrase numbers from a recursive byte-pair parser. Codeword Costs Restricted Prefix Byte Codes Culpepper and Moffat Introduction File Related Work New Method Experiments wsj267.seq wsj267.repair bc 16.29 22.97 Method dbc scdbc 12.13 11.88 19.91 19.90 rpbc 11.76 18.27 Test Data Effectiveness Efficiency Conclusions Average codeword cost for different byte coding methods. Values listed are in terms of bits per source symbol, excluding any necessary prelude components. Prelude Costs Restricted Prefix Byte Codes Culpepper and Moffat Introduction File Related Work New Method wsj267.seq wsj267.repair permutation 0.59 4.44 Prelude representation bit-vector gaps semi-dense 0.22 0.27 0.13+0.01 0.78 1.87 0.49+0.02 Experiments Test Data Effectiveness Efficiency Conclusions Average prelude cost for four different representations. Values listed represent the total cost of all of the block preludes, expressed in terms of bits per source symbol. Decoding Efficiency Restricted Prefix Byte Codes Culpepper and Moffat Introduction File Related Work New Method wsj267.seq wsj267.repair bc (none) 59 49 dbc dense 24 9 scdbc dense 24 9 dense 36 12 rpbc semi-dense 43 30 Experiments Test Data Effectiveness Efficiency Conclusions Decoding speed on a 2.8 Ghz Intel Xeon with 2 GB of RAM, in millions of symbols per second, for complete compressed messages including a prelude in each message block. In Summary Restricted Prefix Byte Codes Culpepper and Moffat Introduction Related Work New Method • Restricted prefix byte codes provide a flexible approach to modelling diverse probability distributions without losing the attractive properties of current byte coding approaches. Experiments Conclusions Summary • Semi-dense prelude representations are a practical alternative to traditional prelude representations. • In combination, these two improvements provide significant enhancements to decoding efficiency, and give improved compression effectiveness.
© Copyright 2026 Paperzz