Enhanced byte codes with restricted prefix properties

Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
Enhanced byte codes with restricted prefix
properties
Related Work
New Method
Experiments
J. Shane Culpepper1
Alistair Moffat2
Conclusions
1. NICTA Victoria Laboratory,
Department of Computer Science and Software Engineering,
The University of Melbourne, Victoria 3010, Australia
2. Department of Computer Science and Software Engineering
The University of Melbourne, Victoria 3010, Australia
November 2, 2005
General Approach
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
Related Work
Our approach uses strings of bytes to represent the channel
alphabet in the compressed text,
New Method
• Byte manipulation is faster than bit manipulation;
Experiments
• Skipping and seeking in byte aligned files is more
Conclusions
efficient than the same operations in bit aligned files;
• Traditional pattern matching algorithms can be used
directly on byte aligned compressed text.
Static Byte Codes (bc)
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
Related Work
BC
DBC
SCDBC
The static byte coder, is a variable length radix-128 code,
which uses bytes as the channel alphabet. The static byte
coder uses a fixed model which assumes the source
alphabet exhibits a natural, non-increasing probability
ordering.
BHuff
New Method
Experiments
Conclusions
• The prefix-free property is easily enforced by making
each byte of a codeword a stopper or a continuer byte.
• A byte with a decimal value greater than 127 is a
continuer and is always followed by another byte.
• A byte with a value less than 128 is a stopper byte.
Static Byte Codes (bc)
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
Related Work
BC
DBC
Example (Encoding Byte Codes)
1
1,000
1,000,000
→
→
→
001
134-104
188-131-064
SCDBC
BHuff
New Method
Example (Decoding Byte Codes)
Experiments
Conclusions
Consider the code 134-104.
To reconstruct the original codeword:
134 → (134 − 127)
= 7
104 → (7 × 128) + 104 = 1,000
Dense Byte Codes (dbc)
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
Related Work
The dense byte coder is an extension of the standard byte
coder, which introduces an alphabet ordering to match
symbol ranks with symbol probabilities.
BC
DBC
SCDBC
BHuff
New Method
Experiments
Conclusions
• All mapped symbols appear in the message and each
symbol is represented by its rank.
• Dense byte codes require transmission of a prelude for
decoding purposes.
• The simple ranking mechanism opens up the possibility
of new prelude representations because the alphabet
ordering is not dependent on exact symbol probabilities.
(S, C)-Dense Byte Codes (scdbc)
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
Related Work
The (S, C)-dense byte coder is an extension of the dense
byte coder, where the partitioning value of 128 is made a
variable x such that 1 ≤ x ≤ 256.
BC
DBC
SCDBC
BHuff
New Method
Experiments
Conclusions
• So, the constraint S + C = 256 is added where S is the
number of stoppers, C is the number of continuers, and
256 is the radix.
• The tag bit that identifies each byte is essentially
arithmetically coded now, allowing more of each
byte to be available for actual “data” bits.
Byte Huffman Codes (bhuff)
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
The byte Huffman coder is a constrained Huffman coder,
where all codewords must have a length which falls on a
byte boundary.
Related Work
BC
DBC
SCDBC
BHuff
New Method
Experiments
Conclusions
• Exactly matches the probability distribution and is
minimum-redundancy over all byte codes.
• In reality, the code can be described by the 4-tuple
(h1 , h2 , h3 , h4 ), where hx is the number of x-byte
codewords in the code.
• This method is clearly more flexible than other
approaches, but opens up the possibility of false
matches in compressed pattern matching.
Restricted Prefix Byte Codes (rpbc)
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
The restricted prefix byte coder is a byte Huffman coder with
an additional constraint; the first byte must completely
define the suffix length.
Related Work
New Method
RPBC
Preludes
Semi-Dense
Experiments
Conclusions
• Instead of using the 2-tuple (S, C) as (S, C)-dense
codes use, a 4-tuple (v1 , v2 , v3 , v4 ) is defined to
completely specify allowable code lengths.
• Also, v1 + v2 R + v3 R 2 + v4 R 3 ≥ n, where R is the radix
and n is the cardinality of the source alphabet, must be
satisfied.
• This still allows considerable flexibility in assignment of
total codespace to different code lengths.
Restricted Prefix Byte Codes (rpbc)
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
0
1
20
2
11
3
8
4
5
5
2
6
2
7
1
8
1
9
1
10
1
21
1
Introduction
New Method
v2=1
v3=1
10 prefix followed by
11 prefix followed by
v1=2
Related Work
00
01
RPBC
Preludes
00
01
10
11
00 00
00 01
00 10
00 11
01 00
.....
11 11
Semi-Dense
Experiments
Conclusions
Example of a restricted prefix code with R = 4, n = 11, and
(v1 , v2 , v3 ) = (2, 1, 1). The 53 symbols are coded into 160
bits, compared to 144 bits if a bitwise Huffman code is
calculated, and 148 bits if a radix-4 Huffman code is
calculated. Prelude costs are additional.
Prelude Representation
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
Related Work
New Method
The permutation approach is the traditional method where
the complete alphabet is transmitted using a static binary
code in a decreasing frequency ordering.
RPBC
Preludes
Semi-Dense
Experiments
Conclusions
• The length of each static binary code is easily
calculated as log nmax , where nmax is the value
of the largest symbol.
• If the source alphabet is small, the cost is minimal. For
more general applications, the cost can be non-trivial.
Prelude Representation
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
Related Work
New Method
The bitvector approach is based on a key observation. The
only requirement to rebuild the original mapping is to know
that a particular alphabet symbol appears, and it’s
corresponding codeword length.
RPBC
Preludes
Semi-Dense
Experiments
Conclusions
• A bitvector from 0 . . . nmax , where a 1 means the
symbol appears and 0 means it does not appear,
is constructed.
• If the symbol appears, an additional two bits can be
used to encode the symbol length.
Prelude Representation
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
Related Work
The gap approach is another alternative which eliminates bit
operations and attempts to exploit clustering in alphabet
densities.
New Method
RPBC
Preludes
Semi-Dense
Experiments
Conclusions
• Only four possible codeword lengths are allowed so the
alphabet can be partitioned into four subsets.
• Each individual subset is then sorted and encoded as a
sequence of d -gaps.
• Since the emphasis is easily decodable streams, the
d -gaps are encoded with the standard byte coder.
Semi-Dense Prelude Representation
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
Related Work
New Method
In the semi-dense approach, a subset of “interesting”
symbols (because of their prevalence) are assigned a
second codeword, in addition to the normal “dense”
codeword.
RPBC
Preludes
Semi-Dense
Experiments
Conclusions
• The cutoff could be a fixed value such as 1,000.
• The cutoff could be all symbols above a particular
frequency threshold.
• The cutoff could be all symbols in the one or two byte
range.
Semi-Dense Prelude Representation
Restricted
Prefix Byte
Codes
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
20
0
1
8
11
1
0
5
1
0
0
1
2
1
2
20
11
8
5
(0)
0
1
(0)
(0)
1
0
(0)
1
0
0
1
2
20
11
8
5
1
(0)
(0)
1
0
(0)
1
0
0
1
2
1
2
11 00
11 01
Culpepper
and Moffat
Introduction
1
2
11 10
11 11
Related Work
New Method
RPBC
Preludes
Semi-Dense
Experiments
t=4
Conclusions
v1=3
00
01
v3=1
10
11 prefix followed by
00 00
00 01
00 10
00 11
01 00
.....
Example of a semi-dense restricted prefix code with R = 4,
nmax = 14, and a threshold at t = 4. The modified prelude
contains only four symbols.
Test Data
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
Related Work
New Method
File name
wsj267.seq
wsj267.repair
Total
symbols
58,421,983
19,254,349
Maximum
value
222,577
320,016
Self-information
(bits/sym)
10.58
17.63
Experiments
Test Data
Effectiveness
Efficiency
Conclusions
Parameters of two test files. The file wsj267.seq was
generated using a spaceless words model. The file
wsj267.repair represents phrase numbers from a
recursive byte-pair parser.
Codeword Costs
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
File
Related Work
New Method
Experiments
wsj267.seq
wsj267.repair
bc
16.29
22.97
Method
dbc
scdbc
12.13
11.88
19.91
19.90
rpbc
11.76
18.27
Test Data
Effectiveness
Efficiency
Conclusions
Average codeword cost for different byte coding methods.
Values listed are in terms of bits per source symbol,
excluding any necessary prelude components.
Prelude Costs
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
File
Related Work
New Method
wsj267.seq
wsj267.repair
permutation
0.59
4.44
Prelude representation
bit-vector
gaps semi-dense
0.22
0.27
0.13+0.01
0.78
1.87
0.49+0.02
Experiments
Test Data
Effectiveness
Efficiency
Conclusions
Average prelude cost for four different representations.
Values listed represent the total cost of all of the block
preludes, expressed in terms of bits per source symbol.
Decoding Efficiency
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
File
Related Work
New Method
wsj267.seq
wsj267.repair
bc
(none)
59
49
dbc
dense
24
9
scdbc
dense
24
9
dense
36
12
rpbc
semi-dense
43
30
Experiments
Test Data
Effectiveness
Efficiency
Conclusions
Decoding speed on a 2.8 Ghz Intel Xeon with 2 GB of RAM,
in millions of symbols per second, for complete compressed
messages including a prelude in each message block.
In Summary
Restricted
Prefix Byte
Codes
Culpepper
and Moffat
Introduction
Related Work
New Method
• Restricted prefix byte codes provide a flexible approach
to modelling diverse probability distributions without
losing the attractive properties of current byte coding
approaches.
Experiments
Conclusions
Summary
• Semi-dense prelude representations are a practical
alternative to traditional prelude representations.
• In combination, these two improvements provide
significant enhancements to decoding efficiency, and
give improved compression effectiveness.