Czech Technical University in Prague Faculty of Information

Czech Technical University in Prague
Faculty of Information Technology
Department of Theoretical Computer Science
Natural Language Compression using Byte Codes
by
Petr PROCHÁZKA
A thesis submitted to
the Faculty of Information Technology, Czech Technical University in Prague,
in partial fulfilment of the requirements for the degree of Doctor.
PhD programme: Informatics
Prague, March 2014
ii
Thesis Supervisor:
Doc. Ing. Jan HOLUB, Ph.D.
Department of Theoretical Computer Science
Faculty of Information Technology
Czech Technical University in Prague
Thákurova 9
160 00 Prague 6
Czech Republic
c 2014 Petr PROCHÁZKA
Copyright iii
Abstract and contributions
So-called word-based approach is nowadays very often applied in the natural language
compression. Recently, numerous compression methods started to combine the word-based
approach with so-called byte codes when the codewords are represented as the sequence
of bytes. This brought surprisingly interesting application possibilities. The byte codes
proved to be very efficient in compression and decompression and also in searching on the
compressed text. Probably the largest family of the byte codes is the family of dense codes.
In this thesis, we focus on further exploring of application possibilities of the dense codes
in natural language compression. First, we propose generalized concept of dense coding
called Open Dense Code (ODC) which, aims to be a frame for definition of many other
dense code schemes. Using the frame of ODC, we present two new word-based adaptive
compression methods based on the dense coding idea: Two-Byte Dense Code (TBDC) and
Self-Tuning Dense Code (STDC). Our compression methods improve the compression ratio
and are considerate to smaller files, which are very often omitted.
Next, we present semi-adaptive version of TBDC : Semi-adaptive Two-Byte Dense Code
(STBDC) which is due its limited coding space implicitly block-oriented. We show that
STBDC has some interesting properties which could be applied in digital libraries and
other textual databases. The compression method allows direct searching on compressed
text. Moreover, the vocabulary can be used as a block index which makes some kinds of
searching very fast. Another property is that the compressor can send single blocks of the
compressed text with only a corresponding part of the vocabulary, which is considerate to
limited bandwidth. In addition, the compressed file can be continuously extended without
any need of previous decompression.
Our next result is a modification of STBDC for compression of a large set of small text
files. We call this variant Set-of-Files Semi-adaptive Two-Byte Dense Code (SF-STBDC ).
SF-STBDC exploits the fact that the files in the set share the common words. Every single
file is compressed using its proper model, which ensures the best possible effectiveness.
However, SF-STBDC does not store the proper model of every file but only the changes in
comparison with a global model, which is a statistical model of all unique words contained
it the whole set of files. The compression method allows random access to the compressed
data and also direct searching on the compressed text. Thanks to its high search speed
and decompression speed SF-STBDC represents an interesting alternative for compression
of text files used by contemporary web search engines.
Keywords:
Byte Codes, Dense Codes, Natural Language Compression, Word-based Compression,
Self-indexing, Block Indexes, Integer Coding.
iv
Acknowledgements
First of all, I would like to express my gratitude to my dissertation thesis supervisor,
Dr. Jan Holub. He has been a constant source of encouragement and insight during my
research and helped me with numerous problems and professional advancements.
My research has also been partially supported by the following grants:
• Czech Science Foundation (GAČR) as project No. GA-13-03253S,
• Czech Science Foundation (GAČR) as project No. 201/09/0807,
• Ministry of Education, Youth and Sports under research program MSM 6840770014,
• Czech Technical University in Prague as project No. SGS10/306/OHK3/3T/18.
Finally, my greatest thanks go to my girlfriend and family members, for their infinite
patience, care and invaluable help in writing of this thesis.
v
Dedication
To my beloved family for their patience with my studies.
vi
Contents
List of Figures
ix
List of Tables
xi
1 Introduction
1.1 Motivation . . . . . . . . . . . .
1.2 Problem Statement . . . . . . .
1.3 Related Work/Previous Results
1.4 Contributions of the Thesis . .
1.5 Structure of the Thesis . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
2
3
2 Background and State-of-the-Art
2.1 Basic Notions . . . . . . . . . . . . . . . . . .
2.1.1 Stringology . . . . . . . . . . . . . . .
2.1.2 Data compression . . . . . . . . . . . .
2.1.3 Information Theory . . . . . . . . . . .
2.2 Linguistic Basics . . . . . . . . . . . . . . . .
2.3 Byte-oriented Huffman Code . . . . . . . . . .
2.4 Variable Byte Codes . . . . . . . . . . . . . .
2.5 End-Tagged Dense Code . . . . . . . . . . . .
2.5.1 Searching on ETDC . . . . . . . . . .
2.5.2 Adaptive version of ETDC . . . . . . .
2.6 (s,c) - Dense Code . . . . . . . . . . . . . . .
2.6.1 Adaptive version of SCDC . . . . . . .
2.7 Dynamic Lightweight End-Tagged Dense Code
2.8 Restricted Prefix Byte Codes . . . . . . . . .
2.9 Variable-to-Variable Dense Code . . . . . . . .
2.10 Boosting Text Compression with Byte Codes .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
5
6
7
10
16
17
18
23
24
27
30
31
32
35
36
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Novel Approaches to Dense Coding
39
4 Contribution to Dense Coding
4.1 Open Dense Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
43
vii
viii
CONTENTS
4.1.1
4.1.2
4.1.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
44
47
51
53
54
58
58
58
61
64
65
67
68
68
69
71
72
74
76
80
81
85
88
5 Conclusions
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
97
98
99
4.2
4.3
Two-Byte Dense Code . . . . . . . . . . . . . . . . . .
Self-Tuning Dense Code . . . . . . . . . . . . . . . . .
Experiments . . . . . . . . . . . . . . . . . . . . . . . .
4.1.3.1 Compression ratio . . . . . . . . . . . . . . .
4.1.3.2 Compression and decompression speed . . . .
Semi-Adaptive Two-Byte Dense Code . . . . . . . . . . . . . .
4.2.1 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Algorithms and Data Structures . . . . . . . . . . . . .
4.2.4 Optimal Number of Stoppers and Continuers . . . . . .
4.2.5 Searching on Compressed Text . . . . . . . . . . . . . .
4.2.6 Modifications of the Compressed Text . . . . . . . . .
4.2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . .
4.2.7.1 Compression Ratio . . . . . . . . . . . . . . .
4.2.7.2 Compression and Decompression Time . . . .
4.2.8 Searching on Compressed Text . . . . . . . . . . . . . .
Set-of-Files Semi-Adaptive Two-Byte Dense Code . . . . . . .
4.3.1 STBDC Modification for Compression of a Set of Files
4.3.1.1 SF-STBDC variant 1: . . . . . . . . . . . . .
4.3.1.2 SF-STBDC variant 2: . . . . . . . . . . . . .
4.3.1.3 SF-STBDC variant 3: . . . . . . . . . . . . .
4.3.2 Searching on the Compressed Text . . . . . . . . . . .
4.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Bibliography
101
Publications of the Author
109
List of Figures
2.1
2.5
2.6
Heaps’ Law: Dependency of the number of unique words on the number of
processed words (size of the text). The processed text is a Bible written in
different languages (English, Czech, Hungarian, Danish, Swedish). . . . . .
Zipf’s Law: Dependency of frequency fi of a word wi on its rank ri . The
processed text is a Bible written in different languages. The graphs only
depict the top 200 words. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
False positive using a pattern-matching algorithm in Plain Huffman Code
stream. The false positive is a substring of two following codewords corresponding to words “the” and “rose”. . . . . . . . . . . . . . . . . . . . . .
False positive of the searched pattern “moose” using a pattern-matching
algorithm in the ETDC data stream. . . . . . . . . . . . . . . . . . . . . .
Dictionary data structure . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dictionary: before and after UpdateDict algorithm was performed. . . .
24
26
27
3.1
3.2
STBDC : the compression process. . . . . . . . . . . . . . . . . . . . . . . .
SF-STBDC : the compression process. . . . . . . . . . . . . . . . . . . . . .
40
41
4.1
4.2
4.3
4.4
STDC: Evolution of various parameters as the number of words grows. . .
Compression ratio / compression speed trade-off. . . . . . . . . . . . . . .
Compression ratio / decompression speed trade-off. . . . . . . . . . . . . .
The number of unique words (total and for single blocks) depending on the
number of processed words of the gut8 file. . . . . . . . . . . . . . . . . . .
The compressor evaluating the changes between two blocks. . . . . . . . .
The dependency of the STBDC compression ratio on the value of c = 256 − s.
Comparison of compression ratio in the case that only certain blocks are
required by the client. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SF-STBDC variant 1: Evaluation of the swaps. . . . . . . . . . . . . . . .
SF-STBDC : Shifts of words between single blocks depicted as a network
flow problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SF-STBDC variant 1: Completion of arrays from and to. . . . . . . . . . .
SF-STBDC variant 3: Storage of the swaps. . . . . . . . . . . . . . . . . .
SF-STBDC variant 3: Evaluation of the swaps. . . . . . . . . . . . . . . .
FPMA and BPMA: Searching example. . . . . . . . . . . . . . . . . . . . .
52
56
57
2.2
2.3
2.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
ix
12
14
24
59
63
65
70
75
77
79
81
84
86
x
LIST OF FIGURES
4.14 Dependency of the compression ratio on the number of continuers c for
different tested files. Each curve starts with the optimal value cmin and
continues with incrementing the number of continuers c. . . . . . . . . . .
4.15 Searching algorithms: Number of comparisons. Algorithms FPMA and
BPMA proposed in Section 4.3.2 are performed on the compressed text
using the word-based approach. The algorithm MBMH is a multi-pattern
variant of the classical Boyer-Moore-Horspool algorithm [36]. All algorithms
search for the same set of randomly chosen patterns. . . . . . . . . . . . .
4.16 General comparison (Gutenberg text corpus): Trade-off between the compression ratio and the normalized speed. The speed values are in logarithmic
scale. The linked points represent the same algorithm performed on different
block sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
92
93
List of Tables
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
Empirical k-order entropy extracted from a Bible of different languages using
spaceless word-based modelling. . . . . . . . . . . . . . . . . . . . . . . . .
Empirical k-order entropy extracted from Bible of different languages using
character-based modelling. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Representation of selected integers in the range 1 – 30 as uncompressed
eight-bit integers, Elias gamma codes, Elias delta codes, Golomb codes with
k = 3 and k = 10, and variable-byte integers. . . . . . . . . . . . . . . . . .
Comparison of ETDC and THC for b=3. . . . . . . . . . . . . . . . . . . .
End-Tagged Dense Coding . . . . . . . . . . . . . . . . . . . . . . . . . . .
ETDC, SCDC codewords. . . . . . . . . . . . . . . . . . . . . . . . . . . .
SCDC : encoding process of Example 2.6.2 (s = 132 and c = 124). . . . . .
SCDC : decoding process of Example 2.6.2 (s = 132 and c = 124). . . . . .
ETDC, SCDC : set of rules P . . . . . . . . . . . . . . . . . . . . . . . . . .
DSCDC, TBDC codewords. . . . . . . . . . . . . . . . . . . . . . . . . . .
TBDC : set of rules P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
STDC : set of rules P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tested algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tested files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Compression ratio in %. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Compression speed in MB/s. . . . . . . . . . . . . . . . . . . . . . . . . . .
Decompression speed in MB/s. . . . . . . . . . . . . . . . . . . . . . . . .
Tested files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Compression ratio in %. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Compression time in seconds. . . . . . . . . . . . . . . . . . . . . . . . . .
Decompression time in seconds. . . . . . . . . . . . . . . . . . . . . . . . .
Search time in seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Compression ratio, the size of the compressed file and the vocabulary file
in the left part. Search time in milliseconds and the time improvement in
comparison to MBMH in the right part of the table. . . . . . . . . . . . .
xi
15
15
19
21
21
28
29
30
44
45
46
47
53
54
54
55
55
68
69
70
71
72
91
xii
LIST OF TABLES
4.16 General comparison (Gutenberg text corpus): compression ratio, snippet
speed and decompression speed. Normalized snippet and decompression
speed are computed for methods without random access to compressed data
using the “back of envelope” calculation [29]. Both speeds are stated in
megabytes per second where 1 MB = 220 B of the original (uncompressed)
file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.17 General comparison (WT2g text corpus): compression ratio, snippet speed
and decompression speed. Normalized snippet and decompression speed are
computed for methods without random access to compressed data using the
“back of envelope” calculation [29]. Both speeds are stated in megabytes
per second where 1 MB = 220 B of the original (uncompressed) file. . . . .
93
95
Chapter 1
Introduction
1.1
Motivation
The amount of information is rapidly growing up. Data compression helps us to reduce two
of the most expensive resources: time needed to transmit data and space needed to store
data. The word-based approach, together with byte codes, represent a very interesting way
of how to reduce space needed to store data and still provide very attractive properties
that can be exploited in practical implementation. The word-based approach considers
single words (not characters) as the symbols of alphabet and so ensures very effective
compression of natural language text. On the other hand, the byte codes provide very
attractive speed of compression, decompression and compressed data processing since they
exploit byte orientation of machine memory.
Nowadays, every competitive text compression method should allow fast random decompression and fast direct searching on the compressed text. The aforementioned properties are necessary for compression methods with ambitions in the field of web search
engines, digital libraries and other textual databases. The compression methods using
byte codes and the word-based approach are able to achieve a compression ratio around
30 %. Furthermore, they are very fast in compression and especially in decompression.
They allow direct searching on compressed data and its random decompression. Another
significant advantage is their simple and easy implementation.
1.2
Problem Statement
Dense codes [10, 16] belong to so-called “zero-order substitution” methods [19]. These
are techniques where the input text is split into symbols and each symbol is represented
by a unique codeword. Compression is achieved by assigning shorter codewords to more
frequent symbols. In the last decade, the idea of dense coding showed that one can derive
many coding schemes based on the idea of dense coding. These coding schemes differ in
syntax of resulting codewords (used in the compression method). However, the idea is
always the same. Our ambition is to give some formal rule that can define different coding
1
2
CHAPTER 1. INTRODUCTION
schemes for different purposes. Using the formal rule for the definition of dense codes, we
show various compression methods optimized for different situations of natural language
compression. These new compression methods overcome their competitors in compression
ratio and still keep their high compression and decompression speed. Furthermore, the
compression methods that are semi-adaptive allow random access to a compressed data
stream and direct searching on the compressed data.
1.3
Related Work/Previous Results
The first word-based compression algorithm was presented by Moffat in [52]. It was
word-based Huffman code, which achieves a compression ratio approximately 25 %. Byteoriented versions of word-based Huffman code (Plain Huffman and Tagged Huffman) were
described by Moura in [19]. Byte orientation induces a small loss of the compression ratio,
but a significant improvement of decompression speed, while the decompressor can omit
all the bit operations. In addition, Tagged Huffman marks the beginning of each codeword
to allow direct searching on the compressed text.
Williams et al. introduced their variable byte code for compression of integer values
in [73]. Later, Schoeler et al. improved and applied this coding scheme for compression
of inverted indexes in [64]. Finally, the word-based approach and byte code scheme for
compressing integers were combined in Dense codes by Brisaboa et al. [10] few years ago.
Dense codes are also byte-oriented. They are easier to implement and much faster than
Huffman codes. Moreover, they keep the direct search capability.
Another practical and more general byte code was proposed by Culpepper and Moffat
in [39]. Their Restricted Prefix Byte Codes allow flexible setting of the number of codewords
with different lengths. This setting is adjusted to the input text and is defined as the
parameter in the prelude of each message block.
1.4
Contributions of the Thesis
1. Establishment of Open Dense Code (ODC ) as a frame for definition of other dense
code schemes. ODC was originally introduced in [58] and later described in detail
in [62].
2. Definition of two new adaptive dense codes: Two-Byte Dense Code (TBDC ) and
Self-Tuning Dense Code (STDC ). Both of them were defined using ODC and both
of them overcome their closest competitors in compression ratio. TBDC and STDC
were introduced in [58].
3. Definition of a semi-adaptive version of TBDC : Semi-adaptive Two-Byte Dense Code
(STBDC ), which is a block-oriented compression method suitable and very effective
for the compression of large portions of natural language text. STBDC was introduced in [59] and later described in detail in [60].
1.5. STRUCTURE OF THE THESIS
3
4. Modification of STBDC for compression of a set of small text files: Set-of-Files Semiadaptive Two-Byte Dense Code (SF-STBDC ). This compression method represents
an attractive alternative for compressing text files in terms of web search engines.
SF-STBDC was introduced in [61].
1.5
Structure of the Thesis
The thesis is organized into chapters as follows:
1. Introduction: Describes the motivation behind our efforts, together with our goals.
There is also a list of contributions of this doctoral thesis.
2. Background and State-of-the-Art: Introduces the reader to the necessary theoretical
background and surveys the current state-of-the-art.
3. Novel Approaches to Dense Coding: We present some basic ideas of ODC and its
derived dense code schemes in this chapter.
4. Contribution to Dense Coding: We present step-by-step basic ideas and experimental
results of: Open Dense Code (ODC ), Two-Byte Dense Code (TBDC ) and SelfTuning Dense Code (STDC ), Semi-adaptive Two-Byte Dense Code (STBDC ) and
finally Set-of-Files Semi-adaptive Two-Byte Dense Code (SF-STBDC ).
5. Conclusions: Summarizes the results of our research, suggests possible topics for
further research, and concludes the thesis.
4
CHAPTER 1. INTRODUCTION
Chapter 2
Background and State-of-the-Art
2.1
Basic Notions
In this Section, we define some basic notions that are used in this thesis.
2.1.1
Stringology
Definition 2.1.1 (Alphabet)
An alphabet Σ is a non-empty set of symbols.
Definition 2.1.2 (String)
A string u over Σ (u ∈ Σ∗ ) is any sequence of symbols from Σ.
Definition 2.1.3 (Substring)
String x is a substring of string y (x = y[i..j]) starting at position i and ending at position j,
if y = uxv, where x, y, u, v ∈ Σ∗ and 1 ≤ i ≤ j ≤ |y|.
Definition 2.1.4 (Length of string)
The length of string w = w[1..n] is the number of symbols in string w ∈ Σ∗ and it is
denoted |w| = n.
Definition 2.1.5 (Pattern Matching)
Given text t = t[1..n] and pattern p = p[1..m], we define string pattern matching as
verifying whether string p is a substring of text t.
Definition 2.1.6 (Nondeterministic finite automaton)
Nondeterministic finite automaton (NFA) is a quintuple (Q, Σ, δ, q0 , F ), where Q is a finite
set of states, Σ is a finite set of input symbols, δ is a mapping Q × (Σ ∪ {ε}) 7→ P (Q),
q0 ∈ Q is an initial state, and F ⊆ Q is a set of final states.
Definition 2.1.7 (Deterministic finite automaton)
Deterministic finite automaton (DFA) is a quintuple (Q, Σ, δ, q0 , F ) , where Q is a finite
set of states, Σ is a finite set of input symbols, δ is a mapping Q × Σ 7→ Q, q0 ∈ Q is an
initial state, and F ⊆ Q is a set of final states.
5
6
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
Definition 2.1.8 (Grammar)
Grammar is a quadruple G = (N, T, P, S), where N is a finite set of nonterminal symbols,
T is a finite set of terminal symbols (T ∩ N = ∅), P is a finite set of production rules
(subset of a set (N ∪ T )∗ .N.(N ∪ T )∗ × (N ∪ T )∗ ), an element (α,β) of P is denoted α → β
and is called a production rule, S ∈ N is an initial symbol of grammar.
2.1.2
Data compression
Definition 2.1.9 (Code)
Code is a triple K = (S, C, f ) , where S is a set of source symbols, C is a set of codewords
and f is a mapping f : S 7→ C assigning just one codeword to every source symbol. Symbol
C + denotes a set of all strings with nonzero length constituted of symbols from C. Symbol
C ∗ denotes a set of all strings possibly with zero length constituted of symbols from C.
Definition 2.1.10 (Unambiguously Decodable Code)
Given a code K = (S, C, f ), a string x ∈ C + is unambiguously decodable in terms of
mapping f , if there exists just one string y ∈ S + that f (y) = x. The code K = (S, C, f ) is
unambiguously decodable if and only if all possible strings of a set C + are unambiguously
decodable.
Definition 2.1.11 (Prefix Code)
Code K = (S, C, f ) is a prefix code if no codeword of C + is a prefix of another codeword
of C + . Source symbols encoded with a prefix code are unambiguously decodable when the
code is read from left to right.
Definition 2.1.12 (Optimal Code)
Given a code
P K = (S, C, f ) and a probability distribution p of source symbols from set S
such that i∈S p(i) = 1. The
C, f ) is optimal if there exists no other code
P code K0 = (S,P
0
0
0
K = (S, C , f ) such that i∈S p(i)|f (i)| < i∈S p(i)|f (i)|.
Definition 2.1.13 (Compression ratio)
Given a code K = (S, C, f ), x ∈ S + , y ∈ C + : y = f (x), a string x is compressed to a
|y|
.
string y. The compression ratio cr is: cr = |x|
Definition 2.1.14 (Model [63])
Model of a data compression consists of two parts: (i) the structure that is a set of events
(source symbols) and their contexts, and (ii) the parameters that are the probabilities
assigned to the events.
Definition 2.1.15 (Coder [63])
Coder is responsible for representing sequences of source symbols as numbers in orderpreserving manner.
7
2.1. BASIC NOTIONS
Definition 2.1.16 (Static Model)
A static model is a fixed model that is known by both the compressor and the decompressor
and does not depend on the data that is being compressed.
Definition 2.1.17 (Semi-adaptive Model)
A semi-adaptive model is a fixed model that is constructed from the compressed data. The
model has to be attached as a part of the compressed data.
Definition 2.1.18 (Adaptive Model)
An adaptive model evolves during the compression. At a given point in compression,
the model is a function of the previously compressed part of the data. There is no need
to store the model since the decompressor can reconstruct it during decompression from
decompressed data on-the-fly.
2.1.3
Information Theory
Definition 2.1.19 (Entropy [67])
Suppose information source X with a set of distinct source symbols S = {s1 , s2 , ..., sn } and
probability distribution P = {p1 , p2 , ..., pn }. Each source symbol si ∈ S is associated with
just one probability pi ∈ P . Suppose the information source is memoryless, which means
that there is no dependence between states of the system, then the zero-order entropy of
the source X is defined by formula:
P
H(X) = − ni=1 pi log pi
Definition 2.1.20 (Redundancy)
Suppose coding of source symbol x denoted as f (x). The length of codeword f (x) in bits
is denoted as L(x). Redundancy R(x) of the code f (x) is defined as difference between the
length of codeword L(x) and the entropy of source symbol x:
R(x) = L(x) − H(x)
Definition 2.1.21 (Kullback-Leibler Divergence [44], [51])
Given two probability distributions P and Q, the Kullback-Leibler divergence of Q from
P , denoted as DKL (P kQ), is a measure of the information lost when Q is used to approximate P .
P
P (i)
DKL (P kQ) = i P (i) Q(i)
Definition 2.1.22 (Cross entropy [51])
Given two probability distributions P and Q, the cross entropy measures the average
number of bits needed to identify an event from a set of events, if a coding scheme is used
based on a given probability distribution Q, rather than the true distribution P .
P
P
P
P (i)
H(P, Q) = − i P (i) log Q(i) = H(P ) + DKL (P kQ) = − i P (i) log P (i) + i P (i) Q(i)
8
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
The cross entropy H(P, Q) expresses the amount of coding redundancy when the probability distribution Q is used to encode the set of events with true probability P .
Definition 2.1.23 (Markov property)
A stochastic process has the Markov property if the conditional probability distribution
of future states of the process depends only upon the present state, not on the preceding
states. A process with this property is said to be Markovian or a Markov process.
Definition 2.1.24 (Markov chain)
A Markov chain is a sequence of random variables X1 , X2 , X3 , ..., Xn with the Markov
property, namely that, given the present state, the future and past states are independent.
Formally, P (Xn+1 = x|X1 = x1 , X2 = x2 , ..., Xn = xn ) = P (Xn+1 |Xn = xn ).
Definition 2.1.25 (Variable-order Markov model)
Let A be a state space (a finite alphabet) of size |A|. Variable-order Markov model attempts
to estimate conditional distributions of the form P (xi |s) for a symbol xi ∈ A given a context
s ∈ A∗ , i.e. context s is a sequence of states (alphabet symbols) of any length including
the empty context. The length of context |s| ≤ D varies depending on available statistics.
Definition 2.1.26 (0-order empirical entropy)
Suppose a string X = x1 x2 x3 ...xn composed of distinct source symbols S = {s1 , s2 , ..., sm }
with frequencies F = {n1X , n2X , ..., nm
X }. The 0-order entropy of the string is defined by
formula:
P
na
na
H0 (X) = − a∈S nX log2 nX ,
where naX represents the frequency of a symbol a in terms of a string X and n is the
length of the string X. See also Definition 2.1.19 defining the entropy using the probability
mass function of a random variable.
Definition 2.1.27 (k-order empirical entropy)
Suppose a string X = x1 x2 x3 ...xn composed of distinct source symbols S = {s1 , s2 , ..., sm }
with frequencies F = {n1X , n2X , ..., nm
X }. The k-order entropy of the string is defined by the
formula:
P
Hk (X) = n1 w∈S k |wX |H0 (wX ),
where wX represents a concatenation of symbols following a substring w in a string X.
Definition 2.1.28 (Joint entropy [51])
The joint entropy H(X, Y ) of a pair of discrete random variables X and Y with a joint
distribution p(x, y) is defined by the formula:
P P
H(X, Y ) = − x y p(x, y) log p(x, y)
2.1. BASIC NOTIONS
9
Definition 2.1.29 (Conditional entropy [51])
Let X and Y be discrete random variables with joint distribution p(x, y) and conditional
distribution p(x|y), then the conditional entropy is defined by the formula:
P
H(X|Y ) = − x,y p(x, y) log p(x|y)
Theorem 2.1.1 (Properties of entropy [51])
The entropy H(X) of a discrete random variable X = {x1 , ..., xn } and the joint entropy
H(X, Y ) have the following properties:
• Non-negativity: H(X) ≥ 0
• Upper bound : H(X) ≤ log(n)
• Chain rule: H(X, Y ) = H(X) + H(Y |X)
• Conditioning reduces entropy: H(X|Y ) ≤ H(X)
10
2.2
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
Linguistic Basics
Traditionally, standard symbol-based compressors use characters as the source symbols of
compressed text. It means that they regard the text as a sequence of alphanumeric or
non-alphanumeric characters, e.g. Huffman code [38] or Shannon-Fano code [22] belong to
the standard character-based statistical compressors. The basic idea of these compression
methods is to substitute the most frequent symbol with the shortest codeword. The compression effectiveness of these compressors is quite poor. They achieve only a 60 − 65%
compression ratio on typical English texts.
Another family of character-based compressors are dictionary compressors. Their basic
idea is to replace repeating substrings in the text by a link to the dictionary or to the
previous occurrence of the substring. The most widely-known compression methods of
this type belong to the so-called Lempel-Ziv family [78, 79]. These compression methods
achieve a better compression ratio (35 − 40% on typical English texts) on natural language
texts and are very fast in decompression. On the other hand, they are quite slow during
the compression process and their memory requirements are typically very high (up to a
few gigabytes).
A natural way of how to improve the compression ratio in character-based compression
methods is to use a context in the natural language text. It is natural that, for example, the
English character ‘t’ can be followed by the character ‘h’ with higher probability than by
the character ‘x’. Claude Shannon performed his well-known experiment [68] in a similar
fashion. He let a human to guess the next letter in a sample English text with a knowledge
of the preceding letters. The success rate of a guess grew with the position of a letter in
terms of one word. Similarly, the character-based entropy (k-order empirical entropy, see
Definition 2.1.27) decreases with higher order (as well as the rate of uncertainty of the next
letter decreases).
One group of compression methods using k-th order statistics are PPM (Prediction by
Partial Matching) compressors [18]. These compression methods use a left context of maximum length k (which is a parameter of compression) to predict the following character in
the text. They work with the so-called variable Markov model (see Definition 2.1.25). The
PPM compressors are equipped with an arithmetic coder [75, 1] that encodes the probability interval of the following character based on the preceding context. The maximum size of
context k is a parameter giving a trade-off between the achieved compression ratio on one
side and memory requirements (upper bound of the number of contexts is |Σ|k , where |Σ|
is a size of the alphabet) and compression/decompression speed on the other side. These
compression methods achieve a very good compression ratio around 20 % when performed
on English texts. However, they are quite slow in compression and decompression and they
require a huge amount of memory.
The other group of compression methods uses a right context to predict a preceding character. These compression methods are based on the Burrows-Wheeler transform
(BWT) [17] that returns a reversible permutation of the input text. The permutation is
derived from a sorting by the right context so the same characters very often follow each
other since they come from a similar context. After performing of permutation, the com-
11
2.2. LINGUISTIC BASICS
pressor applies the Move-To-Front strategy [6] followed by the Huffman coder [38]. The
decompression process is inverse. These compression methods usually divide the input
text into smaller blocks so they do not consume such a large amount of memory as PPM
methods. BWT-based methods still achieve a very competitive compression ratio around
25 % when performed on English text.
Another way of how to exploit the context in natural language text is to use whole
words as the source symbols of the alphabet. By using so-called word-based approach
the compressor parses the input text into a sequence of words. To our knowledge the
word-based approach in compression was used for the first time in [52, 37].
A word is defined as a sequence of strictly alphanumeric characters or a sequence
of strictly non-alphanumeric characters. Clearly, the input text is parsed into a strictly
alternating sequence of alphanumeric and non-alphanumeric words. One can exploit this
property of alternating words and build and maintain two models: a model of alphanumeric
words and a model of non-alphanumeric words [37]. This approach allows the use of one
codeword twice (first in alphanumeric context, second in non-alphanumeric context).
Another way of how to exploit the syntax of a clause is to suppose a so-called implicit
separator which is a single space character ‘ ’. This approach is called spaceless word-based
modelling [20]. The compressor uses a common word-based model for both alphanumeric
and non-alphanumeric words. When an alphanumeric word is followed by a single space
character ‘ ’, the compressor does not encode this separator and continues with encoding
the following alphanumeric word. On the other hand, when the decompressor meets two
following alphanumeric words, it outputs a single space character ‘ ’ as the implicit separator. The spaceless word-based model usually achieves a slightly better compression ratio
than the approach with alphanumeric and non-alphanumeric models. The word-based statistical compression methods achieved a competitive compression ratio, e.g. word-based
Huffman code achieves a compression ratio around 25 % when performed on English text.
The reason why the word-based approach is so effective in natural language compression
lies in the properties of a given language. The word-based model covers longer correlations
in the text and, at the same time, it is more biased than the character-based model which
predetermines its good compressibility by statistical compression methods. A small complication of the word-based approach is that the input alphabet (i.e. set of unique words) is
theoretically unbounded. However, in practice, according to the so-called Heaps’ Law [34],
the size of the alphabet grows sublinearly with the number of words.
Vr (n) = K × nβ
(2.1)
Let us now remind some basic properties of natural language that cause the high efficiency of the word-based model. Heaps’ Law [34] describes the dependency of the number
of unique words on the total number of processed words. It can be described by Equation 2.1, where Vr represents a good estimate of the number of unique words, n represents
the number of processed words and K and β are language-specific constants (K ∈ [10; 100]
and β ∈ [0.4; 0.6] for English). Heaps’ Law proves that the vocabulary (the number of
unique words) is limited in practice, which is very important for any compression scheme
12
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
since it must be able to cover all words (symbols of the alphabet). More unique words
(alphabet symbols) means more codewords and some of them of greater length. Generally,
a higher number of unique words means worse compressibility. Clearly, a text composed
of one repeating word (symbol) can be compressed very efficiently by replacing of the
word by the shortest possible codeword. On the other hand, when compressing the text
composed of all unique words (i.e. each word in the text is unique), the compressor cannot exploit the idea of substitution of frequent words by shorter codewords and the text
remains uncompressed.
70000
EN
CZ
HU
DK
SW
60000
Number of unique words
50000
40000
30000
20000
10000
0
0
200000
400000
600000
800000
Number of processed words
1000000
1200000
1400000
Figure 2.1: Heaps’ Law: Dependency of the number of unique words on the number of
processed words (size of the text). The processed text is a Bible written in different
languages (English, Czech, Hungarian, Danish, Swedish).
From this point of view, the weakly inflected languages, such as English, profit from
their limited inflection and achieve better compression since they produce a lower number
of unique words in standard texts. Inflection means the modification of a word to express
different grammatical categories, e.g. case, number, gender (in case of nouns) and tense,
person, voice (in case of verbs). The inflection is expressed with a prefix, suffix, infix
or another internal modification of a word. For example, the word “vidı́m” represents
the verb “see” in Czech with the following categories: the first person, singular, present
13
2.2. LINGUISTIC BASICS
simple, indicative; and word “vidı́š” represents the verb “see” in Czech with the following
categories: the second person, singular, present simple, indicative. In English, all forms of
the verb “see” in present tense are the same except in the third person, singular when the
suffix ‘s’ is added.
The languages with a high degree of inflection are called highly inflected languages and
to this group belong: Latin and Romance languages (Spanish, Italian, French, Portuguese,
...), Slavic languages (Czech, Russian, Polish, ...), Baltic languages (Latvian or Lithuanian).
Other languages like English, Swedish, Danish or Norwegian belong to the so-called weakly
inflected languages. The languages of this group typically produce a very low number of
unique words in the text. The third group of languages is called agglutinative languages
since the words of these languages are formed by so-called agglutination which is a stringing
together of morphemes, each with a single grammatical meaning. Typical representatives of
this group are Hungarian and Finnish and these languages usually produce an extremely
high number of unique words in the text. Figure 2.1 depicts the Heaps’ Law curve of
different languages (English, Czech, Hungarian, Danish and Swedish) extracted from a
Bible written in different languages.
1/rs
fi (ri ; s, N ) = PN i
s
i=1 (1/i )
(2.2)
Another assumption for good compressibility of a word-based model is that its distribution is not uniform. For natural languages this assumption is satisfied since the distribution
of single words is strongly biased. This property of natural languages is described by the
so-called Zipf’s Law [77] that approximates the relationship between the rank ri of word wi
(in terms of a word-based model that is sorted in non-increasing order by the frequency fi )
and its frequency fi . This relationship is described by Equation 2.2, where ri represents the
rank of word wi , N represents the number of unique words and s is a constant specific for a
given language (for English s ≈ 1). Basically, the relationship says that the frequency fi of
a word wi depends inversely on its rank ri . The biased distribution enables to maximally
exploit the idea of substitution of more frequent symbols by shorter codewords. Figure 2.2
depicts the Zipf’s Law curve of different languages (English, Czech, Hungarian, Danish
and Swedish) extracted from a Bible written in different languages.
H(X) = −
n
X
i=1
p(xi ) log2 p(xi ) = −
n
X
1
1
log2 = −(log2 1 − log2 n) = log2 n
n
n
i=1
(2.3)
The next property describing the compressibility of a given language is the empirical
entropy (see Definitions 2.1.26 and 2.1.27). The entropy determines the amount of uncertainty (information). Higher entropy means higher uncertainty, i.e. the amount of surprise
what symbol (word) occurs in the text. Basically, lower entropy means better compression.
The upper bound of the entropy H(X) of a random variable X = x1 , ..., xn is log(n) (see
Theorem 2.1.1). This upper bound is achieved for the uniform distribution of a random
variable X where p(xi ) = n1 . This can be easily proved by the definition of the entropy (see
14
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
70000
180000
EN
CZ
160000
60000
140000
50000
Frequency of word
Frequency of word
120000
40000
30000
100000
80000
60000
20000
40000
10000
20000
0
0
0
20
40
60
80
100
120
140
160
180
200
0
20
40
60
80
Rank of word
100
120
140
160
70000
200
70000
HU
DK
60000
60000
50000
50000
Frequency of word
Frequency of word
180
Rank of word
40000
30000
40000
30000
20000
20000
10000
10000
0
0
0
20
40
60
80
100
120
Rank of word
140
160
180
200
0
20
40
60
180
200
80
100
120
Rank of word
140
160
180
200
160000
SW
140000
Frequency of word
120000
100000
80000
60000
40000
20000
0
0
20
40
60
80
100
120
Rank of word
140
160
Figure 2.2: Zipf’s Law: Dependency of frequency fi of a word wi on its rank ri . The
processed text is a Bible written in different languages. The graphs only depict the top
200 words.
Equation 2.3). It means that more biased distribution of words enables better compression
15
2.2. LINGUISTIC BASICS
of the input text.
Language
# words
# un.words
k=0
k=1
k=2
k=3
k=4
k=5
EN
CZ
HU
DK
SW
889 574
1 248 279
807 290
1 035 468
1 399 468
13 506
32 095
64 313
23 394
24 522
8.53
8.93
10.04
9.07
8.85
5.51
5.73
5.71
5.62
5.36
3.26
2.99
2.64
3.01
3.10
1.45
1.47
0.88
1.40
1.69
0.56
0.61
0.23
0.52
0.77
0.21
0.25
0.07
0.18
0.32
Table 2.1: Empirical k-order entropy extracted from a Bible of different languages using
spaceless word-based modelling.
Language
# chars
# un.chars
k=0
k=1
k=2
k=3
k=4
k=5
EN
CZ
HU
DK
SW
4 077 770
4 071 844
4 332 385
4 007 666
5 212 160
64
79
78
81
96
4.37
4.52
4.49
4.50
4.47
3.24
3.44
3.41
3.43
3.38
2.46
2.86
2.69
2.59
2.50
1.90
2.37
2.13
2.09
1.99
1.56
1.96
1.75
1.77
1.66
1.36
1.63
1.47
1.53
1.44
Table 2.2: Empirical k-order entropy extracted from Bible of different languages using
character-based modelling.
We performed experiments measuring word empirical entropy and character empirical
entropy on texts of different natural languages. Table 2.1 presents measured word empirical
entropy for different order k (0 ≤ k ≤ 5). The entropy values were extracted from a Bible
written in different languages using spaceless word-based modelling. Single rows represent
different versions of the Bible and they contain also the information about the total number
of words and number of unique words. The noticeable difference in the number of unique
words is between Germanic languages (English, Danish and Sweden), which belong to
the so-called weakly inflected languages, Czech, which is a highly inflected language, and
Hungarian, which is an agglutinative language. The values of word entropy of higher orders
k are extremely low which creates an illusion of high compressibility. The extremely low
entropy values are caused by the fact that word contexts contain a very low number of
words occurring in the given context. However, the resulting compression ratio is always
spoiled by the overhead with the new words occurring in contexts. The semi-adaptive
methods have to encode them as a table and the adaptive methods have to escape each
new word in a given context using a so-called escape symbol.
Table 2.2 gives the comparison to character entropy for different order k (0 ≤ k ≤ 5).
The entropy values are extracted from a Bible written in different languages as well. The
zero-order character entropy value is 4.37 bits per character for English. Suppose standard
ASCII coding. Then this entropy value predicts a lower bound for compression ratio
16
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
4.37 ÷ 8 = 0.546 for zero-order character-based compressors such as Huffman coding [38]
or arithmetic coding [75, 1].
The character-based approach has the advantage that the alphabet, as well as the
number of contexts, is limited. For the comparison of zero-order character entropy and
zero-order word entropy, we need to normalize the values of character entropy according to
the average length of word. For English the average length of a word is 4 077 770÷889 574 =
4.58 characters per word; therefore the normalized value of zero-order character entropy is
4.37 × 4.58 = 20.01 bits per word for English. The zero-order word entropy 8.53 bits per
word is then significantly below the value 20.01 bits per word since the word-based model
can cover longer correlation in the text.
2.3
Byte-oriented Huffman Code
Huffman code belongs to the so-called “zero-order substitution” compression methods. The
original version of the code was proposed by D. A. Huffman in [38]. This version is bitoriented. It means that the codewords are formed as a sequence of bits and, at the same
time, the codewords satisfy the prefix property (see Definition 2.1.11). The compressor
groups the individual symbols into tree nodes according to their frequency and builds a
binary tree. Leaves of the Huffman tree correspond to the symbols of the alphabet and
edges are labelled by the value ‘0’ for a left child and by value ‘1’ for a right child. A path
leading from the root to a leaf (concatenation of the labels of single edges) represents a
codeword assigned to the corresponding symbol of the alphabet.
The original version of the Huffman code works with the character-based approach
when the alphabet is limited (e.g. as a set of characters of ASCII table). Versions using the
word-based approach appeared later in [6, 52, 37]. These versions work with the so-called
word-based approach when single words (sequences of alphanumeric or non-alphanumeric
characters) are considered as a symbols of the alphabet.
The byte-oriented version of the Huffman code was for the first time presented by
Moura et al. in [19]. Their byte orientation means that each codeword is composed of a
sequence of bytes instead of bits. The shortest codeword has the size of one byte, which
predetermines the necessity of the word-based approach in natural language compression.
The byte orientation was achieved by the extension of the degree of the Huffman tree
from 2 to 256. It means that each of the internal nodes of the Huffman tree can have
up to 256 descendants. The byte orientation means a small loss in compression ratio.
The byte-oriented word-based Huffman code achieves a compression ratio around 30 %
on typical English text while the bit-oriented word-based Huffman code can achieve a
compression ratio around 25 %. On the other hand, the byte orientation provides much
faster compression and decompression. The reason is that bit shifts and masking operations
are not necessary during compression, decompression and searching on the compressed text.
In [19] two variants of byte-oriented Huffman code are presented. The first variant with
the degree of the Huffman tree 256 is called Plain Huffman code. The second variant has
a Huffman tree with a degree of 128 and is called Tagged Huffman code. In the Tagged
2.4. VARIABLE BYTE CODES
17
Huffman code each byte uses 7 bits for the Huffman code and 1 bit to signal the beginning
of a codeword. The most significant bit is set to one when the byte is the first in the
sequence representing a codeword. This change means another small loss in compression
ratio (33 – 35 % on typical English text). However, this change gives the Huffman code a
fast self-synchronizing property, i.e. the single codewords are unambiguously recognizable
in a compressed data stream since the most significant bit signals the beginning of a
codeword. The self-synchronizing property provides a possibility of direct searching on the
compressed text using any conventional pattern-matching algorithm. Furthermore, the
decompression can start from an arbitrary point of the compressed text. Three variants
of search algorithms were introduced in [19]: one for the Tagged Huffman code and two
for the Plain Huffman code. The search algorithms in the Plain Huffman code must work
aligned with the codewords or must check the surrounding of each match to verify the
validity of that match. Not only byte-oriented Huffman code can be directly searched.
The standard bit-oriented Huffman code also allows direct searching on the compressed
data stream [71, 41]. However, the searching performance is always poorer. Thus, we
conclude that not only Tagged Huffman has self-synchronizing property [28]. However, selfsynchronizing can mean the overhead with processing an arbitrary number of codewords
in the case that the compressor does not use a special synchronizing mark [7, 8].
2.4
Variable Byte Codes
The compression scheme based on variable byte code (VBC ) was originally proposed in [73].
Schoeler et al. later proved the efficiency of the compression scheme when used on inverted
indexes of a textual database in [64].
Variable byte code gives an efficient solution for compressing an array of integers.
Efficient processing of arrays of integers is the crucial issue for the efficient resolution of
queries to databases. Especially the inverted index [74], which is commonly used in web
search engines, is suitable for compression by variable byte codes. The inverted index,
namely, consists of two major components: the vocabulary of terms (words) and inverted
lists, which consist of vectors that contain information about all occurrences of the terms
(usually an integer value identifying a document where the term occurs). Furthermore, the
inverted lists can contain an array with offset positions in terms of a given document.
However, the problem of word-based natural language compression can be easily transformed into the problem of compressing an array of integers. Notice the basic idea of all
“zero-order substitution” compression methods: substitute more frequent symbols with
shorter codewords. This idea implies that the compression model has to be stored as a set
of unique symbols wi (words in the case of a word-based model) sorted by their frequency fi
in non-increasing order. The natural language text can then be perceived as a sequence of
ranks ri in the compression model corresponding to words wi occurring on a given position
in the text.
Integer values are usually stored using a byte sequence with a fixed size of four bytes.
The idea of variable byte code is to use the “zero-order substitution” compression technique
18
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
and assign shorter codewords to more frequent integers. However, at the same time, the
byte orientation must be preserved to keep high processing speed. The variable byte
codeword consists of a sequence of bytes in which seven bits are used to code an integer
while the least significant bit of each byte is used to mark whether the byte is the last in
the sequence or not. In comparison to the bit-oriented coding scheme, variable byte codes
cover many more integer values by just one-byte codeword: 27 = 128. On the other hand,
the coding scheme of variable byte codes does not provide a shorter codeword than one
byte. Table 2.3 [73] gives the comparison of encoded integer values using Elias gamma and
delta codes [21], the Golomb code [32] and variable-byte code [73].
Algorithm 1 represents a method of decompression of variable-byte code. The algorithm
reads byte-by-byte from the array of bytes using the head function (see line 5). In every
step the algorithm checks the least significant bit (see line 4). If the least significant bit is
one, then the reading can continue. Otherwise, the while loop must be left. The operation
v ← (v C7)+((iB1) bit-and 0x7F) performs a left shift of the read value stored in variable
v and concatenation with the last read value stored in variable i. The symbols C and B
represent the left and right bit shift operations. Algorithm 1 demonstrates the simplicity
of the structure of the variable-byte codeword. The codeword is formed as a sequence of
bytes, whose least significant bit is set to one except the last byte, whose least significant
bit is set to zero. Concatenation of the other bits of the bytes represents the encoded
integer value.
The byte orientation of the code ensures high decompression and searching speed.
Thanks to the least significant bit of each byte, the borders of single codewords are easily
recognizable so the variable-byte code is self-synchronizing and allows direct searching on
the compressed text using standard pattern-matching algorithms. For exhaustive experiments performed on variable-byte code, see [73] and [64].
Algorithm 1 Pseudocode for decompression of variable-byte integers.
1: function VariableByteRead(A)
2:
i ← 0x1
3:
v←0
4:
while (i bit-and 0x1) = 0x1 do
5:
i ← head(A)
6:
v ← (v C 7) + ((i B 1) bit-and 0x7F)
7:
return v + 1
2.5
End-Tagged Dense Code
End-Tagged Dense Coding (ETDC ) is a word-based compression method proposed by
Brisaboa et al. in [10]. The word-based approach of the method means that the algorithm
follows all basic ideas proposed in [6, 52, 37]. The source symbols are words instead of
19
2.5. END-TAGGED DENSE CODE
Decimal
Uncomp.
Elias γ
Elias δ
Golomb (k = 3)
Golomb (k = 10)
Variable-byte
1
2
3
4
5
6
7
8
9
10
15
20
25
30
00000001
00000010
00000011
00000100
00000101
00000110
00000111
00001000
00001001
00001010
00001111
00010100
00011001
00011110
1
0 10
0 11
00 100
00 101
00 110
00 111
000 1000
000 1001
000 1010
000 1111
0000 10100
0000 11001
0000 11110
1
0 100
0 101
0 1100
0 1101
0 1110
0 1111
00 100000
00 100001
00 100010
00 100111
00 1010100
00 1011001
00 1011110
1 10
1 11
01 0
01 10
01 11
001 0
001 10
001 11
0001 0
0001 10
000001 0
0000001 11
000000001 10
00000000001 0
1 001
1 010
1 011
1 100
1 101
1 1100
1 1101
1 1110
1 1111
01 000
01 101
001 000
001 101
0001 000
0000001 0
0000010 0
0000011 0
0000100 0
0000101 0
0000110 0
0000111 0
0001000 0
0001001 0
0001010 0
0001111 0
0010100 0
0011001 0
0011110 0
Table 2.3: Representation of selected integers in the range 1 – 30 as uncompressed eight-bit
integers, Elias gamma codes, Elias delta codes, Golomb codes with k = 3 and k = 10, and
variable-byte integers.
characters and the compressor processes a strictly alternating sequence of alphanumeric
and non-alphanumeric words.
ETDC was designed to be very fast in compression and decompression. The wordbased approach makes it a little bit faster. Especially the fact that ETDC is byte-oriented
improves the speed of compression and decompression substantially. The byte-orientation
means that the lowest coding unit is one byte. The compressor encodes a symbol as a
sequence of bytes instead of bits. This can cause that compression ratio can be a little bit
worse. However, we avoid all bit level operations (shifting and comparing), which makes
ETDC (and all byte level compression algorithms) much more faster.
ETDC is inspired by the Tagged Huffman Code proposed by E. de Moura et al. in [19].
Word-based byte-oriented Huffman code is often called Plain Huffman. Tagged Huffman
Code is like Plain Huffman whose the most significant bit of every byte is used to mark
the beginning of a codeword. If a codeword starts with a given byte, then the bit is set
to one, otherwise to zero. The remaining 7 bits are filled by Huffman code of the encoded
word. Tagging is used to enable a direct search on compressed text without previous
decompression. The searched word is simply encoded into a pattern. The pattern is then
compared to single codewords which are recognized according to the most significant bit
of each byte in the encoded stream.
ETDC is derived from Tagged Huffman Code by very simple modification. The most
significant bit is not used to mark the beginning of a codeword but the end of a codeword.
Whenever a given byte is the last byte of a codeword, it must start with the most significant
bit set to one. Otherwise, the most significant bit must be set to zero. After this change,
20
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
we do not have to maintain the codeword as Huffman code (generally as a prefix code)
and we can exploit all combinations of the remaining 7 bits to encode the source symbols.
Suppose codewords X and Y , |X| < |Y |. Encoded by ETDC, X can never be a prefix of
Y because they differ in the most significant bit of |X|-th byte. Using all combinations of
the remaining 7 bits can compensate compression inefficiency caused by byte-orientation
of ETDC.
The idea of tagging the last byte of the codeword is also used in variable byte coding
(VBC ) described in Section 2.4. ETDC, unlike VBC, exploits not only all combinations of
the remaining bits in the current number of bytes, but it remembers even the combinations
used with a lower number of bytes. Example 2.5.1 demonstrates the advantage of ETDC
in comparison with VBC.
Example 2.5.1 (ETDC vs. VBC )
Encode number 16 400 using VBC and ETDC.
VBC (16 400) = h00000011i h00000001i h00100000i
ETDC (16 400) = h01111111i h10010000i
ETDC, unlike VBC, considers that there was a possibility to encode 128 most frequent
source symbols using just one byte. Hence, it subtracts these 128 source symbols and in
fact it encodes number 16 400 − 128 = 16 272. As a result, ETDC needs only 2 bytes unlike
VBC which needs 3 bytes.
Let us finally give a formal definition of End-Tagged Dense Code, as it was originally
defined in [10].
Definition 2.5.1 (ETDC ) [10]
Given source symbols S = {s1 , ..., sn }, End-Tagged Dense Code assigns number i−1 to the
i-th most frequent source symbol. This number is represented in base 2b−1 , as a sequence
of digits from the most to the least significant. Each such a digit is represented using b
bits. The exception is the least significant digit d0 where we represent 2b−1 + d0 instead of
just d0 .
In general, ETDC can be defined for an arbitrary number of bits b. However, we can
use the byte-oriented version of ETDC only for b equal to multiple of 8. In further text, we
will always consider b = 8. To emphasize efficiency of ETDC in comparison with Tagged
Huffman Code (THC), the authors of [12] used Table 2.4. The size of the block is set to
b = 3.
Suppose a vocabulary containing list of source symbols si ∈ S ordered by a nonincreasing frequency count fi ∈ F . Table 2.5 contains some codewords assigned to those
source symbols. As we can see, the compressor is able to encode 128 most frequent source
symbols as a codeword of a length of one byte, 16 384 following most frequent source
symbols as codeword of a length of two bytes and so on. The number of codewords that
21
2.5. END-TAGGED DENSE CODE
Rank
1
2
3
4
5
6
7
8
9
10
ETDC
100
101
110
111
000 100
000 101
000 110
000 111
001 100
001 101
THC
100
101
110
111 000
111 001
111 010
111 011
111 011
111 011
111 011
000
001
010
011
Table 2.4: Comparison of ETDC and THC for b=3.
ETDC is able to cover by one byte is usually higher than the number of codewords covered
by the Huffman code up to length of 8 bits because Huffman coding tree is very pruned.
On the other hand, Huffman code better exploits probabilities of source symbols and so
achieves a better compression ratio than ETDC.
Rank
1
2
128
129
130
16512
16513
ETDC
10000000
10000001
11111111
00000000
00000000
01111111
00000000
10000000
10000001
11111111
00000000 10000000
Table 2.5: End-Tagged Dense Coding
Algorithm 2 and Algorithm 3 describe the encoding and decoding of a given number
(rank) using ETDC. In the ETDC coding scheme, the coder still has to consider the ranks
that could be encoded by using a lower number of bytes and subtract the sum of “codewords
encoded by a lower number of bytes” from rank r. The most efficient way of how to achieve
this effect is to subtract one from the remaining rank (see Algorithm 2, line 7) in every
step of the iteration; and vice-versa: the decoder has to consider the sum of “codewords
encoded by a lower number of bytes” and add it to the decoded rank r (see Algorithm 3,
line 12). The decoder computes this sum on lines
10 using variables base and tot.
Pb−19 and
b
These lines can be expressed by the formula i=1 128 , where b is the number of bytes
of the given codeword. So the rank is not expressed as its binary representation but it
obtains the lowest free combination of bits (see the Definition 2.5.1).
22
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
Algorithm 2 ETDC encode algorithm.
1: function Encode(r)
2:
. prepend function adds the element to the beginning of the list.
3:
bytes ← {}
4:
prepend(bytes,128 + (r mod 128))
5:
r ← r div 128
6:
while r > 0 do
7:
r ←r−1
8:
prepend(bytes, r mod 128)
9:
r ← r div 128
Algorithm 3 ETDC decode algorithm.
1: function Decode(r)
2:
r←0
3:
base ← 0
4:
tot ← 128
5:
b←1
6:
while x[b] < 128 do
7:
r ← r × 128 + bytes[b]
8:
b←b+1
9:
base ← base + tot
10:
tot ← tot × 128
11:
r ← (r × 128 + (bytes[b] − 128))
12:
r ← r + base
From now on, the abbreviation ETDC will refer to the semi-adaptive version of ETDC.
The semi-adaptive version of ETDC works with two standard passes over the input data.
The first pass is used to collect necessary statistics (frequency counts fi ) of processed source
symbols si . The parser stores single words in a vocabulary (a hash table) and increments
their frequency counts during the first pass. After the first pass, the compressor performs
sorting of the source symbols si according to their frequency counts fi (using the Quicksort
algorithm [35]) and each source symbol si (word) obtains its rank ri . Furthermore, the
compressor performs the second pass over the input data and encodes each source symbol
si using its rank ri as a parameter of the function described by Algorithm 2. At the end
of the compression process, the compressor needs to store the vocabulary. The vocabulary
of ETDC is stored as a sequence of source symbols si (words) separated by a single h00i
character. Notice, that there is no need to store the frequency fi of each source symbol si
since only the rank ri is necessary for correct decoding.
The decoding process is still simpler than the encoding. The decompressor reads a
vocabulary file (words sorted according to their ranks). Furthermore, the compressed data
stream is read and single codewords are parsed using the most significant bit of each byte
as a mark of the end of codeword. Each codeword is decoded using the Algorithm 3 and
2.5. END-TAGGED DENSE CODE
23
the decompressor outputs a source symbol si corresponding to the decoded rank ri .
2.5.1
Searching on ETDC
Searching on compressed text is a very attractive issue nowadays. Suppose static or semiadaptive encoding and suppose single codewords are unambiguously distinguishable in the
compressed stream. Only then we are able to perform searching on ETDC compressed
text. The main steps of searching algorithm are: encode searched source symbol into a
pattern and compare single codewords in the compressed data stream with this pattern.
Suppose we preserve byte-orientation (like in plain text). We can achieve cr-times lower
searching time in comparison with searching in uncompressed text, where cr is compression
ratio of a given compression scheme.
Algorithm 4 Searching on the compressed text.
1: function Search(compressed message C, vocabulary D, searched source symbol s)
2:
p ← getCodeword(D, s)
3:
while not EOF do
4:
apply some pattern matching algorithm
5:
if p is found then
6:
Check if it is a false positive
7:
if not a false positive then
8:
Output actual position
The general algorithm of searching on the compressed data stream is described in Algorithm 4. By using e.g. Plain Huffman Code (and other bit-oriented compression methods)
without any synchronization marks [7, 8] we may not simply perform searching in compressed text because we are not able to distinguish single codewords without text decompression (going through the Huffman coding tree and determining the end of one codeword
and the beginning of another). If we apply some pattern-matching algorithm searching for
a match with a codeword in the compressed data stream, we may obtain also many false
positives. These false positives are caused by substrings in the compressed data stream
that grew up by the concatenation of two codewords and at random correspond to the
searched pattern. Example of false positive is depicted in Figure 2.3. It means that this
kind of searching must always be accompanied by some heuristics confirming or disproving
each occurrence.
In ETDC, some false positives may occur when the searched pattern is a suffix of
some codeword. Unlike Plain Huffman Code, ETDC has a simple mechanism of how to
distinguish real occurrences from false positives. The searching algorithm simply checks
the most significant bit of byte preceding in the compressed data stream. If the most
significant bit is set to zero, then it is a false positive and we have found only a suffix. If
the most significant bit is set to one, then we have found a real occurrence of a searched
source symbol. Figure 2.4 depicts a situation of a false positive in a ETDC data stream.
24
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
the 0 1 1
rose 1 0 1 1 1
dog 1 1 0 1 1
... the rose ...
... 0 1 1 1 0 1 1 1 ...
false positive
Figure 2.3: False positive using a pattern-matching algorithm in Plain Huffman Code
stream. The false positive is a substring of two following codewords corresponding to
words “the” and “rose”.
mad
01101011 10001011
moose
10001011
... mad moose ...
... 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 ...
false positive
Figure 2.4: False positive of the searched pattern “moose” using a pattern-matching algorithm in the ETDC data stream.
Except for standard searching for words or phrases (concatenation of more words),
ETDC allows searching allowing errors, patterns containing wildcards [19] and even searching for arbitrary patterns (e.g. substrings). This technique uses a preprocessing step when
the searched pattern is mapped into all possible ETDC dictionary codewords [42] and the
search for a certain pattern is transformed into the search for a set of patterns in parallel.
This kind of searching is, of course, much slower.
2.5.2
Adaptive version of ETDC
An adaptive version of ETDC [10, 15, 14] is called Dynamic End-Tagged Dense Code. The
abbreviation DETDC will denote the adaptive version of ETDC from now.
Algorithm 5 and Algorithm 6 represent a skeleton of most of the statistical adaptive
compressors and decompressors, respectively. The algorithms perform only one pass over
the data and continuously update their model (vocabulary). When a new source symbol
occurs, the compressor outputs a special escape symbol (see line 6) and stores the new
source symbol to the vocabulary (see line 8). When a source symbol is already in the
vocabulary, the compressor outputs a corresponding codeword and updates the vocabulary
(see lines 10 and 11). Vice-versa: when the special escape symbol is read, the decompressor
follows with reading of a new source symbol in plain text form (see line 5). Otherwise,
the decompressor decodes the read codeword (see line 10) and outputs the corresponding
source symbol (see line 13).
For an adaptive compression, it is crucial to the compressor and the decompressor keep
2.5. END-TAGGED DENSE CODE
Algorithm 5 DETDC encode algorithm.
1: function DETDC Encode(source text T )
2:
vocabulary ← hESCi
3:
while not EOF do
4:
Read next source symbol si
5:
if si ∈
/ vocabulary then
6:
Output hESCi
7:
Write plain word si
8:
vocabulary ← si
9:
else
10:
Output ET DC(si )
11:
fi ← fi + 1
12:
update vocabulary
Algorithm 6 DETDC decode algorithm.
1: function DETDC Decode(compressed message C)
2:
vocabulary ← hESCi
3:
while not EOF do Read next codeword ci
4:
if ci = hESCi then
5:
sj ← read plain word
6:
vocabulary ← sj
7:
fj ← 1
8:
Output sj
9:
else
10:
sx ← read source symbol assigned to ci
11:
fx ← fx + 1
12:
update vocabulary
13:
Output sx
25
26
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
the vocabulary updated during the whole process of compression or decompression. A
naive solution is to always sort the vocabulary when a new source symbol is inserted or
when the frequency of some source symbol is incremented. However, the sorting after
reading each symbol is inefficient (since most of the symbols, except for possibly one, keep
sorted) and time-consuming. The authors of [15] proposed a nice solution for DETDC
using the special dictionary data structure.
1
2
3
4
5
6
7
8
9 10 11 12
word
posInVoc
Y
3
X
Z
2
1
freq
3
3
5
0
1
2
top 3
1
3
4
2
2
3
posInHT 10 8
5
4
5
6
7
8
9 10 11
7
8
9 10 11 12
1
5
6
Figure 2.5: Dictionary data structure
Algorithm 7 Update dictionary algorithm
1: function UpdateDict(index)
2:
posInV ocInd ← posInV oc[index];
3:
posInV ocT op ← top[f req[index]];
4:
posInHT Ind ← index;
5:
posInHT T op ← posInHT [top[f req[index]]];
6:
posInV oc[posInHT Ind] ← posInV ocT op;
7:
posInV oc[posInHT T op] ← posInV ocInd;
8:
posInHT [posInV ocT op] ← posInHT Ind;
9:
posInHT [posInV ocInd] ← posInHT T op;
10:
top[f req[index]] ← top[f req[index]] + 1;
11:
f req[index] ← f req[index] + 1;
The dictionary data structure (see Figure 2.5) is formed by a hash table and another
two arrays. The hash table is composed of three arrays, which are indexed by a hash value
of each word. The word array stores the textual representation of the word, the posInVoc
array stores the pointer to the vocabulary, which is in fact the rank of the word. The freq
array stores the frequency of the given word. There are two other arrays. The first array
is called top: it is indexed by frequency and it stores the first word position of a given
frequency in the vocabulary. The second one is the posInHT array (vocabulary itself),
which stores the pointers to words sorted by a non-increasing frequency. The posInHT
27
2.6. (S,C) - DENSE CODE
array contains pointers to words that are grouped into virtual blocks of words with the
same frequency. The auxiliary top array then points to the first pointer of the virtual block
with a given frequency.
The point of the dictionary data structure is that the update operation can be performed
very fast (in constant time) by swapping just occurred word with the top word of its original
frequency. The update function is described as Algorithm 7. Both the vocabulary data
structure and the update function are also used in our adaptive implementations (TBDC
and STDC ).
Example 2.5.2 Suppose the Dense code compressor reads Hamlet’s phrase “To be, or not
to be: that is the question.” The state of the dictionary data structure before and after
reading the first word ‘be’ is depicted in Figure 2.6. The pointers of the word ‘be’ and the
word ‘to’ (which is the top word of a given frequency) in arrays posInVoc and posInHT
were swapped. Furthermore, the top pointer of the original frequency and the frequency
itself were incremented.
... To be, or not be: that is the question ...
word posInVoc freq
...
...
...
top
posInHT
... To be, or not be: that is the question ...
word posInVoc freq
...
...
...
...
123577
to
127
33
127
158455
...
...
...
...
...
...
...
126
...
33
...
126
...
be
...
127
...
33
...
be
126
34
...
...
...
...
...
to
...
top
(a) Before UpdateDict
posInHT
158455
123577
(b) After UpdateDict
Figure 2.6: Dictionary: before and after UpdateDict algorithm was performed.
2.6
(s,c) - Dense Code
Another word-based compression method proposed by Brisaboa et al. is (s,c)-Dense Code
(SCDC ) [16]. The main drawback of ETDC is that it cannot adjust its coding scheme to
word distribution of compressed text (see Example 2.6.1). SCDC, unlike ETDC, does not
use the most significant bit of each block to mark the end of the codeword. It distinguishes
so-called stoppers and continuers, s stands for the number of stoppers, c stands for the
number of continuers and s + c = 2b , where b is the size of a block. The SCDC codeword
is then designed as a sequence of zero or more continuers closed by one stopper. SCDC is,
in fact, generalization of ETDC as ETDC is (128,128)-Dense Code. SCDC is defined as
follows in [16].
28
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
Definition 2.6.1 Given source symbols with non-increasing probabilities {pi }0≤i<n , an (s,c)
stop-cont code (where c and s are integers larger than zero) assigns to each source symbol i
a unique target code formed by a base-c digit sequence terminated by a digit between c and
c + s − 1.
Word rank Codeword assigned by ETDC
0
h1000 0000i
1
h1000 0001i
...
...
127
h1111 1111i
128
h0000 0000ih1000 0000i
...
...
243
h0000 0000ih1111 0011i
244
h0000 0000ih1111 0100i
...
...
2 999
h0001 0110ih1011 0111i
Codeword assigned by SCDC
h0000 1100i
h0000 1101i
...
h1000 1011i
h1000 1100i
...
h1111 1111i
h0000 0000ih0000 1100i
...
h0000 1011ih0101 0011i
Table 2.6: ETDC, SCDC codewords.
Example 2.6.1 Suppose a text file with 3 000 unique words, each with a frequency fi .
Furthermore, suppose two Dense code schemes: ETDC and SCDC with parameters s =
244 and c = 256 − s = 12. Codewords assigned to the unique words are shown in Table 2.6
(starting with the word with the highest frequency).
We can see that SCDC is more flexible coding scheme. Using parameters s and c,
SCDC can adjustPits coding scheme to the input data. In this example, SCDC, compared
to ETDC, saves 243
i=128 fi bytes. SCDC uses the following stopper values: h0000 1100i –
h1111 1111i and the following continuer values: h0000 0000i – h0000 1011i.
Algorithm 8 SCDC encode algorithm.
1: function Encode(r)
2:
. prepend function adds the element to the beginning of the list.
3:
bytes ← {}
4:
prepend(bytes, c + (r mod s))
5:
r ← r div s
6:
while r > 0 do
7:
r ←r−1
8:
prepend(bytes, r mod c)
9:
r ← r div c
2.6. (S,C) - DENSE CODE
29
Algorithms 8 and 9 performing SCDC encoding and decoding are very similar to algorithms ETDC encode and ETDC decode. The reason is, as we said before, ETDC is
a special case of SCDC, concretely (128,128)-Dense Code. The SCDC coder and decoder
just consider the difference between continuer bytes and stopper bytes (bytes that terminate codewords). Suppose byte-oriented encoding (b = 8), then complexity of encoding
algorithm is O(logc r) and complexity of the decoding algorithm is O(length(SCDC(r))).
Algorithm 9 SCDC decode algorithm.
1: function Decode(r)
2:
r←0
3:
base ← 0
4:
tot ← s
5:
b←1
6:
while x[b] < c do
7:
r ← r × c + bytes[b]
8:
b←b+1
9:
base ← base + tot
10:
tot ← tot × c
11:
r ← (r × s + (bytes[b] − c))
12:
r ← r + base
Example 2.6.2 (SCDC )
We encode and decode back a source symbol with rank r = 13 243 using SCDC with parameters s = 132 and c = 124.
We apply Algorithms 8 and 9 to encode and decode rank r = 13 243. The most
important steps of the encoding process and decoding process are described in Table 2.7
and Table 2.8.
Step Line
Variables
5
5
bytes = {167}, r = 100
8
8
bytes = {99, 167}, r = 99
9
9
bytes = {99, 167}, r = 0
Table 2.7: SCDC : encoding process of Example 2.6.2 (s = 132 and c = 124).
The values s (number of stoppers) and c (number of continuers) are clearly the parameters of the SCDC coding scheme. In [16, 12] it is proved that only one local minimum
exists in real text collections when considering a size of a compressed stream depending on
the parameter s.
30
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
Step
5
10
11
12
Line
Variables
5
bytes = {99, 167}, r = 0, base = 0, tot = 132, b = 1
10
bytes = {99, 167}, r = 99, base = 132, tot = 16 368, b = 2
11
bytes = {99, 167}, r = 13 111, base = 132, tot = 16 368, b = 2
12
bytes = {99, 167}, r = 13 243, base = 132, tot = 16 368, b = 2
Table 2.8: SCDC : decoding process of Example 2.6.2 (s = 132 and c = 124).
Consider a vocabulary data structure storing source symbols si sorted by their frequency fi in non-increasing order. Furthermore, suppose some additional
Pj≤iinformation –
a cumulative frequency cfi for each source symbol si such that cfi = j=0 fj . Consider
the SCDC coding scheme with parameters s and c. It provides s one-byte codewords, sc
two-byte codewords, sc2 three-byte codewords and so on. Let ` be the maximal length
Pi<` ofi
codewords in the SCDC coding scheme, i.e. a minimal number such that N ≤ i=0 sc
where N is a number of unique words
in a text collection. Then the size of the compressed
Pi<`
stream can easily be expressed as i=0 (i + 1)cfsc i .
Since there is only one local minimum we can use the binary searchingPalgorithm and
choose the following subinterval according to the value given by the formula i<`
i=0 (i+1)cfsc i .
The optimal value of s can be achieved in log2 256 = 8 steps. The overhead with setting
cfi is O(N ).
2.6.1
Adaptive version of SCDC
The adaptive version of SCDC is called Dynamic (s,c) – Dense Code DSCDC [14]. The
compressor processes the input text and when a new word occurs, a special escape symbol
is transmitted followed by the plain text form of the word. When an already known word
occurs, the compressor transmits the codeword assigned to the word and updates the
vocabulary (see Algorithm 5).
Algorithm 10 DSCDC tuning algorithm.
1: function TuneS(s, wi ) . s represents a number of stoppers and wi is just read word
from the input.
2:
prev ← prev + CountBytes(wi , s − 1);
3:
curr ← curr + CountBytes(wi , s);
4:
next ← next + CountBytes(wi , s + 1);
5:
if curr − prev > thresh then
6:
s ← s − 1;
7:
prev ← 0; curr ← 0; next ← 0;
8:
if curr − next > thresh then
9:
s ← s + 1;
10:
prev ← 0; curr ← 0; next ← 0;
2.7. DYNAMIC LIGHTWEIGHT END-TAGGED DENSE CODE
31
The compressor and decompressor also adjust the values s and c = 2b − s according to
word distribution as the compression continues. However, this tuning technique does not
work for each byte (block) separately and so it tunes the values regardless of the single
bytes (blocks). The tuning function is described as Algorithm 10. The compressor (as
well as the decompressor) defines three variables: curr, prev and next storing the number
of output bytes used to encode the input text using the number of stoppers s (current s
value), s−1 and s+1, respectively. When the difference curr −prev or curr −next exceeds
some threshold thresh (using parameter s − 1 or s + 1 is more effective, see lines 5 and 8),
then the number of stoppers is decremented or incremented (lines 6 and 9) and all three
variables curr, prev and next are set to zero (lines 7 and 10).
Another evident drawback of SCDC is that this tuning technique is very unfriendly to
small files (with a size lower than 1 MB).
2.7
Dynamic Lightweight End-Tagged Dense Code
Dynamic Lightweight End-Tagged Dense Code (DLETDC), proposed by Brisaboa et al.
in [13], is an adaptive dense code using the same coding scheme as DETDC. Its contribution
is not in the improvement of the compression ratio but in its application possibilities
in the field of digital libraries and other textual databases. To the knowledge of the
authors, DLETDC is the first adaptive compression method with direct search capabilities,
permitting direct search of the compressed text without decompressing it. Up to now, it
was usual that only semi-adaptive compression methods were suitable for direct searching
on the compressed text. Imagine the classic server/client architecture. In the situation
when only some parts of the compressed data (e.g. containing some of the keywords)
should be presented on the client side, it is necessary to send the wanted portion of the
compressed data together with the corresponding vocabulary. The problem is that the
vocabulary is usually very large since it covers a much bigger portion of the compressed
data. This situation is very inconvenient especially for the limited bandwidth between the
server and the client.
DLETDC is designed exactly for these situations. Compression and decompression
processes of DLETDC are asymmetrical. During the compression, the statistical model is
normally updated. The words change their ranks according to their frequency. However,
the change of the rank does not automatically mean a change of the codeword. The
codeword is changed only if the word also changes (by the change of its rank) the length
of its codeword. The authors of [13] proved that these changes appear especially at the
beginning of the compression and are not so frequent later. The change of the codeword
is expressed by the triple hCswap , Ci , Cj i, where Cswap is the reserved codeword and Ci and
Cj are swapped codewords.
This affords very interesting properties for the decompression process. The decompressor, unlike the compressor, does not keep the statistical model, but only mapping words
into the codewords. Since the changes of the codewords occurs less frequently during
the compression, the decompressor almost does not have to change the mapping of words
32
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
into the codewords. The decompression process is then much faster and consists only of
substituting the codewords with the words.
The direct searching on the compressed text is very simple in this manner. First, the
searched word w appears in plain text form and an initial codeword Cw is assigned to it.
The algorithm then just looks for the occurrence of codeword Cw and also any possible
change of the codeword signalized by the triple hCswap , Ci , Cj i, where Ci = Cw or Cj = Cw .
2.8
Restricted Prefix Byte Codes
The proposal of Restricted Prefix Byte Codes (RPBC) by Culpepper and Moffat in [39]
extends the idea of SCDC and adds more flexibility to the coding scheme of the words. The
authors argue that the codewords with a maximum length of four bytes satisfy most of the
compressed text portions. The basic idea of RPBC is that the first byte of the codeword
uniquely identifies the number of the following bytes (in a given codeword). The coding
scheme then has four parameters (v1 , v2 , v3 , v4 ), which, in sequence, define the number of
codewords of length: one, two, three and four bytes. The coding scheme then provides
v1 one-byte codewords, v2 × 256 two-byte codewords, v3 × 2562 three-byte codewords and
v4 × 2563 four-byte codewords.
Two conditions must be satisfied: v1 + v2 + v3 + v4 ≤ 256 (determining that the first
byte defines the parameters) and v1 +v2 ×256+v3 ×2562 +v4 ×2563 ≥ N (determining that
the size of the coding space is greater than or equal to the number of unique words N ).
The first byte of an RPBC codeword always determines the length of the codeword.
The decompressor exploits a simple array suffix [i ] returning the length of the codeword
corresponding to a given value i ∈ {0, 1, ..., 255} of the first byte. The coding scheme of
RPBC is more complex (using four parameters v1 , v2 , v3 and v4 ) than the coding scheme
of SCDC. Thus, the decompressor uses the next auxiliary array first[i ] to store the number
of combinations (ranks) provided by the lower number of bytes. Algorithm 11 describes
the process of decompression of a message compressed using RPBC. The decompressor
reads the first byte of a codeword (see line 4) and it reads the next suffix [b] bytes and
multiplies them by the radix 256 (see line 7). Finally, the decompressor outputs the result
(variable offset) increased by the number of combinations provided by codewords of shorter
length (see line 8). The initialization of arrays suffix and first is performed in the function
CreateTables (see line 9).
RPBC allows direct searching on the compressed text. However, RPBC, unlike ETDC
and SCDC, cannot use standard pattern-matching algorithms with their shifting techniques. It has to use its own shifting to keep the scanned codewords aligned. The shifting
is given by the first byte of each codeword that determines its length (suffix [i ]).
The natural question is how the values of parameters v1 , v2 , v3 and v4 are determined.
The authors of [39] propose to use a brute-force approach using (similarly to SCDC )
a cumulative frequency of single source symbols cfi . Algorithm 12 describes the bruteforce approach of computing parameters v1 , v2 , v3 and v4 . The algorithms compute the
cumulative frequencies of single symbols (see line 4). Furthermore, the algorithm traverses
2.8. RESTRICTED PREFIX BYTE CODES
33
Algorithm 11 RPBC decode message algorithm.
1: function DecodeMessage(m)
. m represents a length of input message.
2:
CreateTables(v1 , v2 , v3 , v4 , 256)
3:
for i ← 0 to m − 1 do
4:
b ← get byte()
5:
offset ← 0
6:
for i ← 1 to suffix [b] do
7:
offset ← offset × 256 + get byte()
8:
output block[i] ← first[b] + offset
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
function CreateTables(v1 , v2 , v3 , v4 , R)
start ← 0
for i ← 0 to v1 − 1 do
suffix [i ] ← 0; first[i ] ← start; start ← start + 1
for i ← v1 to v1 + v2 − 1 do
suffix [i ] ← 1; first[i ] ← start; start ← start + R
for i ← v1 + v2 to v1 + v2 + v3 − 1 do
suffix [i ] ← 2; first[i ] ← start; start ← start + R2
for i ← v1 + v2 + v3 to v1 + v2 + v3 + v4 − 1 do
suffix [i ] ← 3; first[i ] ← start; start ← start + R3
all viable combinations (v1 , v2 , v3 , v4 ) (the validity of a combination is checked by the
conditions on lines 13, 15 and 17) and evaluates the size of the compressed text using the
current combination of parameters (v1 , v2 , v3 , v4 ). If the current combination produces a
smaller output, then it is stored as a temporary optimal value (see lines 11 and 12). The
time complexity of the algorithm is O(N + N 3 ) = O(N 3 ).
Another aim discussed in [39] is to reduce the overhead when coding the input text
as single message blocks. A so-called prelude is transmitted prior to each block and it
defines the parameters (v1 , v2 , v3 , v4 ) as well as the permutation of the alphabet defining
the vocabulary of the code. There are three ways proposed to express the permutation of
the alphabet. The permutation can be expressed as bit-vectors defining the presence of the
word in the block and its codeword length. Furthermore, the bit-vectors can be encoded
as a sequence of gaps reducing the overhead when the bit-vectors are sparse. Finally, the
words are divided into high-frequency words, which are defined in the prelude, and lowfrequency words for which the default codewords are used. TBDC (see Section 4.1) can be
seen as RPBC with parameters (v1 , v2 , 0, 0). However, it comes from an even more general
coding scheme - Open Dense Code (ODC) - proposed independently in [58].
34
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
Algorithm 12 RPBC calculate (v1 , v2 , v3 , v4 ).
1: function CalcParam(N, R)
2:
C[0] ← 0
3:
for i ← 0 to N − 1 do
4:
C[i + 1] ← C[i] + f [i]
5:
mincost ← partial sum(0, N ) × 4
6:
for i1 ← 0 to R do
7:
for i2 ← 0 to R − i1 do
8:
for i3 ← 0 to R − i1 − i2 do
9:
i4 ← d(N − i1 − i2 R − i3 R2 )/R3 e
10:
if i1 + i2 + i3 + i4 ≤ R and cost(i1 , i2 , i3 , i4 ) < mincost then
11:
(v1 , v2 , v3 , v4 ) ← (i1 , i2 , i3 , i4 )
12:
mincost ← cost(i1 , i2 , i3 , i4 )
13:
if i1 + i2 R + i3 R2 ≥ N then
14:
return
15:
if i1 + i2 R ≥ N then
16:
return
17:
if i1 ≥ N then
18:
return
19:
return (v1 , v2 , v3 , v4 )
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
function partial sum(lo, hi )
if lo > N then
lo ← N
if hi > N then
hi ← N
return C[hi ] − C[lo]
function cost(i1 , i2 , i3 , i4 )
return partial sum(0, i1 ) × 1+
partial sum(i1 , i1 + i2 R) × 2+
partial sum(i1 + i2 R, i1 + i2 R + i3 R2 ) × 3+
partial sum(i1 + i2 R + i3 R2 , i1 + i2 R + i3 R2 + i4 R3 ) × 3
2.9. VARIABLE-TO-VARIABLE DENSE CODE
2.9
35
Variable-to-Variable Dense Code
All of the presented byte-oriented compressors up to now could be called fixed-to-variable
compressors. They work with the word-based approach and they assign codewords of
variable length to single words of fixed length which is always one symbol in terms of
the word-based approach. Brisaboa et al. proposed new Variable-to-Variable Dense Code
(V2VDC) [11]. V2VDC enables to assign the codewords of the same set of codewords to
both single words and phrases (a sequence of words).
The compression ratio of all byte-oriented word-based compression methods is lowerbounded by Plain Huffman [19] and these methods usually achieve substantially poorer
results than Bzip2 [17] or PPM-based compressors [18]. V2VDC overcomes this bound
by taking into account the frequent phrases occurring in the text. Clearly, the key to
the attractive compression ratio is the selection of the phrases. However, this is a more
general problem and it is inspected especially in the field of grammar-based compression.
Grammar-based compression methods (e.g. Re-Pair [45] or Sequitur [56]) basically try to
find the minimal grammar for a given input text, which is a NP-complete problem [47].
This compressed grammar then describes the input message.
The strategy of V2VDC is to find all non-overlapping phrases with a frequency higher
than some parameter minFreq. The searched phrases are flat, i.e. not containing each
other. The algorithm makes the first pass over the text and substitutes all words with
their ranks (ids) according to their frequency. Then it builds a suffix array SA [49] of
the input text and looks for the longest common prefix LCP (in terms of assigned ids)
between the suffixes SA[j − 1] and SA[j]. The phrases that occur at least minFreq times
are included into the phrase candidates. The phrase candidates are evaluated according to
an appropriate heuristic in the next phase and so the final phrase book is formed.
The simplest heuristic to accept or discard a phrase phi depends only on the following
condition [11]: freq(phi ) ≥ minFreq. A more sophisticated heuristic tries to estimate the
compression gain obtained by accepting a phrase phi into the final phrase book. The
gain estimatePis computed as follows: (bytesbefore − bytesafter ) × (freq(phi ) − 1) − 2, where:
bytesbefore = j |Cwj | is the size of codewords for the single words wj that appear in phrase
phi (assuming phi is discarded); bytesafter = |Cphi | is the size of a codeword assigned to the
phrase phi (assuming phi is accepted); −1 and −2 are related to the cost of adding phi to
the phrase book.
Final phrases are then sorted by their frequency and each phrase occurrence in the text
(ids of the words of the phrase) is substituted by the id of the phrase. At the end, the
text (the sequence of ids) is compressed using ETDC.
V2VDC achieves only a slightly worse compression ratio than Re-pair or PPM-based
compression methods. On the other hand, its compression ratio significantly overcomes
Gzip and very often is better than Bzip2. The compression gain is between 8−10 percentage
points in comparison to standard fixed-to-variable dense compressors (ETDC, SCDC ).
What is most important: V2VDC is very fast in decompression and allows simple and fast
searching on the compressed text. For detailed results see [11].
36
2.10
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
Boosting Text Compression with Byte Codes
Word-based compression methods using byte codes are very attractive for their acceptable
compression ratio and for high compression and, especially, decompression speed. However,
byte codes are not very suitable for the context compression of higher order. The reason is
that the codewords composed of a sequence of bytes cannot easily encode extremely biased
distributions with a small number of elements that occur in the contexts of higher order.
Suppose the average length of word w in English is 5 characters [5], |w| = 5 and suppose
the minimum length of a codeword c is 1 byte, |c| = 1. It means that the lower bound of
|c|
× 100 = 15 × 100 = 20 % for any context.
the compression ratio is |w|
However, Fariña et al. revealed in [26, 27] that the byte codes can successfully work
as so-called compression boosters for the context compression methods operating with
higher order such as PPM (Prediction by Partial Matching) compressors [18]. The wordbased compressors with byte-oriented codewords are used as a preprocessing step, and
boost not only the compression ratio, but also the compression and decompression time.
Furthermore, the sequential searching of a pattern in the compressed text is also boosted.
The results presented in [27] prove that the preprocessing improves the compression ratio
up to 11 percentage points, accelerates compression (up to 5 times) and decompression (up
to 85 %).
Another similar compression boost was presented by Adiego et al. in [3] as an adaptive
compressor MPPM. This compressor maps words occurring in the text into two-byte codewords and it encodes the resulting byte stream using a PPM compressor. As the two-byte
coding scheme is limited by 216 = 65 536 identifiers, MPPM has to implement some pruning strategy to release positions for new coming words. A semi-adaptive version, MPPM2,
was proposed later in [2].
The authors of [26, 27] prove in their experiments that the preprocessing step with a
dense compressor (generally a word-based byte-oriented compressor) improves the compression ratio and (de)compression time. The reason for the improvement in terms of
compression ratio is that the backend compressor (e.g. PPM) can work with a lower order
(of context) to achieve the same compression ratio as without the preprocessing step. Suppose the average length of a word is 5 characters in English. Compressors such as PPM
need the context of order 10 to cover the correlation between two consecutive words. Such
a high order usually means an extremely high number of contexts. The upper bound of
number of contexts is 2610 (the real number is much lower, however still very high). On
the other hand, while compressing a byte stream produced by a dense compressor, the
backend compressor can use only the order 3 (since the average codeword length of a word
in the text is less than 2 bytes for dense compression) to cover the relationship between two
consecutive words. In other words, the preprocessing step enables the backend modeller
to capture a longer correlation in the text and so achieve a better compression ratio. The
reduction of the order allows for more sophisticated modelling and still the reduction of
memory consumption and the reduction of processing time.
Another important parameter in context compression is the number of contexts that
2.10. BOOSTING TEXT COMPRESSION WITH BYTE CODES
37
significantly influence the resulting compression ratio. Every new context must be encoded
(using a table for semi-adaptive methods or using an escape symbol for adaptive compression methods). The number of new contexts grows with the order of context. Thus, the
number is significantly lower when a dense compressor is used as a preprocessing step. The
authors of [27] present 12 647 531 contexts of order up to k = 10 for a character-based
approach without any preprocessing and 1 853 531 contexts of order up to k = 3 for text
preprocessed by ETDC. Suppose that every entry of a context is escaped with a cost of
log2 |Σ| bits, where |Σ| is the alphabet size. Then another important parameter influencing
the compression ratio is the size of the alphabet. It is |Σ| = 256 for an ETDC preprocessed
data stream, but it can be much higher for the word-based approach, where the alphabet is
unbounded. In this regard, the ETDC preprocessed text also overcomes a pure word-based
approach without any preprocessing (e.g. word-based PPM).
38
CHAPTER 2. BACKGROUND AND STATE-OF-THE-ART
Chapter 3
Novel Approaches to Dense Coding
The basic motivation of our work is to state a set of compression methods applicable
to compressing natural language text in practice. Different situations require a different
compression approaches. Digital libraries, where the text files are compressed once at the
beginning of their being in the database and can be searched in their compressed form,
require different compression approach than, for example, chunks of a text sent between
the server and client that need to be processed on the fly. Given the same basis, we define
a few compression methods that are applicable in different situations. The common basis
has the following properties:
1. Spaceless word-based modelling,
2. byte coding (i.e. the codewords in the form of a sequence of bytes),
3. a common definition frame called Open Dense Code (ODC ).
Word-based modelling allows to capture longer correlations in the text and still keeps
the possibility of arbitrary searching (phrase searching, searching allowing errors or substring searching [42]). For details about word-based modelling see Section 2.2. Byte coding
provides very fast compression, decompression and pattern matching in the compressed
data stream at the expense of a small loss of the compression ratio.
The idea of Open Dense Code (ODC ) [58, 62] is to give some formalized prescript for
different dense codes applicable in different situations. The ODC has the two following
ideas: (i) the idea of dense coding, i.e. expression of a rank ri of a given word wi as a combination of bits in a sequence of bytes; (ii) the idea to define the syntax of codewords using
a construct of grammars. ODC underlines common features of the dense code schemes,
but, at the same time, allows to express the divergences of each.
Using the frame of ODC, we present two new word-based statistical adaptive compression algorithms based on the dense coding idea: Two-Byte Dense Code (TBDC ) and
Self-Tuning Dense Code (STDC ). The aim of these two compression methods it to provide
a simple compression scheme suitable for on-the-fly compression of natural language text.
Both compression methods are very fast in compression and decompression. TBDC, as well
39
40
CHAPTER 3. NOVEL APPROACHES TO DENSE CODING
as STDC, use adaptive modelling and they are very considerate to small files, which is a
property omitted by their immediate competitors DETDC (see Section 2.5.2) and DSCDC
(see Section 2.6.1).
TBDC is based on the idea of some limit of the coding scheme. If the form of codewords
is defined as a single byte or a sequence of two bytes, then the idea of marking every byte
can be left and any codeword needs to store only a single binary information: whether its
length is one or two bytes. The space in the coding scheme saved by the simple marking
can be exploited for other positions storing ranks ri of the words wi . Thus TBDC achieves
a better compression ratio thanks to its coding scheme.
STDC is based on DSCDC so its coding scheme is defined according to the parameters
s (number of stoppers) and c = 256 − s (number of continuers). For details about a
compression scheme using stoppers and continuers see the Section 2.7. STDC (unlike
DSCDC ) is focused on tuning of each byte separately. This can have a surprisingly large
impact on the resulting compression ratio, especially when compressing very small files.
Another proposed compression method is Semi-adaptive Two-Byte Dense Code (STBDC )
[59, 60] which is a semi-adaptive version of TBDC. The syntax of the codewords is the same
as in the case of TBDC. The semi-adaptive modelling, as well as the coding scheme limited
by two bytes, predetermine the block orientation of STBDC. The compression method
is focused on very large, potentially artificial portions of natural language text. STBDC
splits the input text into a several blocks and exploits the locality of the text to transform
the compression model to be optimal for a given block, i.e. global words (word occurring in
all blocks) are kept and new local words substitute the local words of the previous block.
The point of this approach lies in the compression ratio improvement and still in other
interesting application properties. One of them is that the vocabulary can serve as a simple block index [50] since it is composed as a set of changes transforming the model of the
previous block to the model optimal for the current block.
FIRST PASS
SECOND PASS
Vocabulary
File
Input File
Input text
Bi−1 :
Bi :
Bi+1 :
Ii−1
Model
αi−1
Ii
Ii+1
αi
αi+1
Changes
βi−1
βi
βi+1
Output File
Model
Compressed text
Ci−1
γi−1
Ci
γi
γi+1
Ci+1
Figure 3.1: STBDC : the compression process.
Figure 3.1 illustrates the compression process. The compressor processes single
blocks Bi of the input text. During the first pass over the text, the compressor collects
41
frequency counts of single words and so builds the statistical model αi . Furthermore, the
compressor evaluates the changes transforming the statistical model of block Bi−1 to the
model of block Bi and stores these changes βi to the vocabulary file. During the second
pass of the compression process, the compressor evaluates the changes βi and transforms
model γi−1 to the model γi . Using this optimal model γi , the compressor performs encoding
of a current block Bi and stores the compressed block Ci to the output file. The decompression process mimics the second pass of the compression process. The decompressor
performs the changes βi to obtain an optimal model γi and then performs the decoding
according to the optimal model γi .
The last of the proposed compression methods is called Set-of-Files Semi-adaptive TwoByte Dense Code (SF-STBDC ) [61]. This compression method is applicable in the situation
when it is necessary to compress a set of files, where the single files must be searchable
without decompression and they must provide the random access and decompression of an
arbitrary part of the file. This situation arrives in case of web search engines that work
with a huge amount of text files and need to search the compressed files and decompress
small portions of the text to assemble a so-called snippet 1 .
f1
...
f2
fn
global model
proper model γi
local model βi
fi
cfi
Figure 3.2: SF-STBDC : the compression process.
SF-STBDC is based on the idea of transforming the global model (a model common
for all files of the set f1 , ..., fn ) to the proper model γi of a given file fi . The set of
changes βi (a local model ), transforming the global model for a given file fi , is stored as a
part of the vocabulary file. During the compression process (see Figure 3.2) of a file fi , the
compressor collects necessary statistics and builds the proper model γi . The compressor
further evaluates the changes βi transforming the global model to the proper model γi and
stores them to the vocabulary file. Then the compressor performs a compression according
to the proper model γi . Thanks to the usage of the global model, SF-STBDC achieves an
extraordinarily good compression ratio and still allows searching on the compressed text
and random access to compressed files.
1
A short phrase surrounding the searched pattern in the text.
42
CHAPTER 3. NOVEL APPROACHES TO DENSE CODING
Chapter 4
Contribution to Dense Coding
4.1
Open Dense Code
Let us introduce a novel concept of dense coding called Open Dense Code (ODC). ODC
attempts to cover ETDC, SCDC and other codes based on the dense coding idea. The
basic motivation is to allow the definition of some formalized prescript of the codewords,
which could be followed in combination with the dense coding idea.
Definition 4.1.1 The b-ary Open Dense Code (ODC) is a couple hb, Gi, where b is a size
of block and G = (N, T, P, S) is a grammar defining syntax of the code. ODC assigns
a codeword cr of k blocks to the r-th most frequent symbol (starting with r = 0), which
satisfies the following conditions:
(1) cr ∈ L(G),
(2) P
cr is not a prefix of any
codeword ci ∈ L(G),
Pk other
k−1 i
i
(3) i=1 Πj=1 vji ≤ r < i=1 Πj=1 vji , where vji is number of combinations that can occur
as a j-th block in a codeword of length i.
The first condition ensures that the codeword cr is defined by the grammar G. The next
condition ensures that the code is a prefix code, which means that it is unambiguously
decodable. The last condition provides the dense coding property.
Lemma 4.1.1 ETDC and SCDC belong to ODC and are defined as follows:
43
44
CHAPTER 4. CONTRIBUTION TO DENSE CODING
b = 8; G = (N, T, P, S) :
N = {Codeword },
T = {s, c},
P is defined in Table 4.1,
S = Codeword,
# Rule
1 Codeword → c Codeword
2 Codeword → s
Table 4.1: ETDC, SCDC : set of rules P .
where symbol c represents a continuer symbol, c ∈ {0, ..., 127} for ETDC, c ∈
{0, ..., cont − 1} for SCDC, and symbol s represents a stopper symbol, s ∈ {128, ..., 255}
for ETDC, s ∈ {cont, ..., 255} for SCDC.
Proof. ETDC is in fact a special case of SCDC with parameters s = c = 128; therefore,
we need to prove the lemma only for SCDC. We need to prove that the definition of SCDC
satisfies the three conditions mentioned in Definition 4.1.1.
(1) The grammar G derives language L(G) = {cn s; n ≥ 0}. SCDC codewords are defined as a sequence of zero or more continuers terminated by one stopper. So obviously all
SCDC codewords are words of language L(G).
(2) Suppose two codewords X and Y , where |X| = n, |Y | = m where n < m. X cannot be
a prefix of Y since n-th block of X is a stopper while n-th block of Y is a continuer and
the sets of stoppers and continuers are disjoint.
(3) Within SCDC, vji = c for j < i and vji = s for i = j. The SCDC codeword composed
of one block provides s different combinations which means s distinct codewords. Two
block codewords cover c × s distinct codewords and so on. Generally, we say that n blocks
provide cn−1 × s different combinations for cn−1 × s distinct codewords. The expression
j = n. Then
cn−1 × s can also be written as Πnj=1 vj , where vj = c for j < n and vj = s forP
the number of codewords covered by up to n blocks can be expressed as ni=1 Πij=1 vji ,
where vji = c for j < i and vji = s for j = i.
The assignment of the codewords is done in order by frequency (a more frequent symbol
gets shorter codeword). It means that if the r-th most frequent symbol gets the codeword
composed of k blocks, rank r must be greater than or equal (r starts with zero) to the
number of codewords covered by up to k − 1 blocks. It must also hold that r is lower than
the number of codewords covered by up to k blocks. Otherwise, the codeword assigned to
the r-th most frequent symbol could not be k blocks long.
4.1.1
Two-Byte Dense Code
TBDC is an adaptive compressor based on dense coding idea. The basic motivation of
TBDC is to improve the compression ratio and keep very good compression and decompression speed of all dense compressors, which is very close to character-based statistical
45
4.1. OPEN DENSE CODE
compression methods (e.g. Huffman code). Suppose that we compress only natural portions of the text. Then the set of unique words occurring in single portion is very limited.
Following Heaps’ Law [34], the English version of the Bible is approximately 4 MB in size
and contains only 13 413 unique words. Performing DETDC on any natural language text,
the compressor never uses the codewords with size greater than 3 bytes because one-byte,
two-byte and three-byte codewords provide coding space for 128 + 1282 + 1283 = 2 113 664
unique words, which satisfies almost all natural language texts.
The facts mentioned above lead us to the idea of a code that is focused on natural
portions of the text (with size between 1 and 10 MB) and which uses only one-byte and
two-byte codewords. This change allows the compressor to leave a marking of each byte1 in
any form (stopper, continuer as it is in DSCDC or the most significant bit of each byte as it
is in DETDC ). Instead of that, the compressor needs to mark only whether the codeword
has size 1B or 2B. It means that only the first byte is affected by this marking and so the
one-byte and two-byte codewords can cover more words and the algorithm can achieve a
better compression ratio (see Example 4.1.1). On the other hand, the implementation of
TBDC must be implemented with some pruning technique, which can prune the vocabulary
in the case that the input text contains more unique words than the coding space can cover.
To mark one-byte or two-byte codewords, TBDC uses an idea similar to DSCDC : the
idea of stoppers and continuers. The codewords of size 1B are composed only of one stopper
and the codewords of size 2B are composed of one continuer followed by another byte in
which any combination of bits is allowed. Using stoppers and continuers in the first byte
of the codeword ensures that TBDC is a prefix code. Suppose two codewords X and Y ,
where |X| = 1 and |Y | = 2. The codeword X is composed of one stopper and the codeword
Y is composed of one continuer followed by another byte. It means that the codewords X
and Y already differ in the first byte and so X cannot be a prefix of Y .
Word rank
0
1
...
216
217
...
8 679
8 680
...
9 999
Codeword assigned by DSCDC
h0010 0111i
h0010 1000i
...
h1111 1111i
h0000 0000ih0010 0111i
...
h0010 0110ih1111 1111i
h0000 0000ih0000 0000ih0010 0111i
...
h0000 0000ih0000 0110ih0001 0001i
Codeword assigned by TBDC
h0000 0000i
h0000 0001i
...
h1101 1000i
h1101 1001ih0000 0000i
...
h1111 1001ih0000 1110i
h1111 1001ih0000 1111i
...
h1111 1111ih0011 0110i
Table 4.2: DSCDC, TBDC codewords.
1
We define only byte-oriented codes in this section (i.e. with the size of block b = 8). Thus, we substitute
the word block for the word byte in the following text.
46
CHAPTER 4. CONTRIBUTION TO DENSE CODING
Example 4.1.1 Suppose a text file with 10 000 unique words each with a frequency fi .
Furthermore, suppose two Dense coding schemes: DSCDC and TBDC, both with parameters s = 217 and c = 256 − s = 39. Notice that we compare only the coding schemes,
not the compression methods, so we omit some conditions of the real compression, e.g.
the escape symbol. The codewords assigned to the unique words are shown in Table 4.2
(starting with the word with the highest frequency).
Using parameters s and c, both DSCDC and TBDC can adjust its coding scheme to the
input data. Furthermore, TBDC can exploit all possible combinations in the second byte
and so cover all 10 000 unique words, whereas DSCDC has to use three-byte codewords for
the words
wi where i ∈ [8680; 9999]. Thus, in this example, TBDC compared to DSCDC
P9 999
saves i=8 680 fi bytes.
Definition 4.1.2 TBDC is ODC (see Definition 4.1.1), where we can define TBDC as
follows:
b = 8; G = (N, T, P, S) :
N = {Codeword },
T = {s, c, b},
P is defined in Table 4.3,
S = Codeword,
# Rule
1 Codeword → c b
2 Codeword → s
Table 4.3: TBDC : set of rules P .
where symbol s represents a stopper symbol, s ∈ {1, ..., si }; symbol c represents a continuer symbol, c ∈ {si + 1, ..., 255}; symbol b represents a byte in which any combination of
bits is allowed, b ∈ {0, ..., 255}. Symbol si represents the number of stoppers in the i-th step
of compression. Similarly, symbol ci represents the number of continuers in the i-th step
of compression. In every step i of the compression, the following must hold: si + ci = 255.
The codeword 0 is reserved for a special escape symbol.
It is evident from Definition 4.1.2 that the number of stoppers si can change as the
compression proceeds. We adjust the number of stoppers si to the actual number of
unique words in the vocabulary. In every step i of the compression, it must hold that
si + (255 − si ) × 256 ≥ top[0]. The data structure of the vocabulary is the same as
in Figure 2.5 and top[0] represents the number of unique words stored in the vocabulary.
Whenever the previous condition is broken, the number of stoppers si must be decremented.
To avoid degradation of the coding scheme, we need to state a lower bound for si . When the
compressor achieves this lower bound, the number of stoppers si is no longer decremented,
but some pruning technique is applied.
We have implemented two different pruning techniques. Least Frequently Used (LFU)
is very simple and very fast as the vocabulary is sorted by frequency, but there is greater
negative effect on the compression ratio as we prune the words that were recently added
47
4.1. OPEN DENSE CODE
and are more connected with the actual context. This negative effect can be eliminated
by the other known technique: Least Recently Used (LRU). On the other hand, LRU is a
little bit more time-consuming.
Algorithm 13 describes the process of compression. The process of decompression is
the reverse. We can see that the Encode function has the parameter s. At the beginning,
the parameter s (the current number of stoppers) has the maximal value 255 (see line 18).
As the compression proceeds, it is necessary to decrement the number of stoppers and
increment the number of continuers to enlarge the coding space (see lines 13 and 14).
To avoid degradation of the scheme, the compressor bewares of some lower bound smin
and needs to apply a pruning technique on the dictionary after achieving this bound (see
lines 15 and 16). The logic of the correct number of stoppers implements the function
CheckSpace. The function UpdateDict (see line 26) is defined as Algorithm 7.
4.1.2
Self-Tuning Dense Code
STDC is another adaptive compressor based on the dense coding idea. STDC, unlike
TBDC, allows codewords of arbitrary size. As in DSCDC, the code uses the idea of stoppers
and continuers.
Definition 4.1.3 STDC is ODC (see Definition 4.1.1), where we can define STDC as
follows:
b = 8; G = (N, T, P, S) :
N = {Codeword },
T = {sa , ca },
P is defined in Table 4.4,
S = Codeword,
# Rule
1 Codeword → ca Codeword
2 Codeword → sa
Table 4.4: STDC : set of rules P .
where symbol sa represents a stopper symbol of a-th byte of the codeword, for a = 1 :
sa ∈ {1, ..., sai }; for a > 1 : sa ∈ {0, ..., sai − 1}; symbol ca represents a continuer symbol,
for a = 1 : ca ∈ {sai + 1, ..., 255}; for a > 1 : ca ∈ {sai , ..., 255}. Symbol sai represents
the number of stoppers in byte a in the i-th step of compression. Similarly, symbol cai
represents the number of continuers in byte a in the i-th step of compression. In every step
i of the compression, it must hold that for a = 1 : sai +cai = 255; for a > 1 : sai +cai = 256.
The codeword 0 is reserved for a special escape symbol.
In fact, the only difference between STDC and DSCDC is that DSCDC sets the same
number of stoppers and continuers for all blocks, while STDC allows a different number
of stoppers and continuers for different blocks. This seemingly insignificant change can
48
CHAPTER 4. CONTRIBUTION TO DENSE CODING
Algorithm 13 TBDC compressor main algorithm
1: function Encode(r, s)
. r is the rank of the encoded word and s is the current
number of stoppers.
2:
if r < 255 − s then
3:
Send(r + 1);
4:
else
5:
r ← r − (255 − s);
6:
B1 ← 256 − s;
7:
B2 ← 0;
8:
B1 ← B1 + r ÷ 256;
9:
B2 ← B2 + r mod 256;
10:
Send(B1 );
11:
Send(B2 );
12: function CheckSpace(s)
13:
if s + (255 − s) × 256 < top[0] ∧ s > smin then
14:
s ← s − 1; return s;
15:
if s + (255 − s) × 256 < top[0] ∧ s = smin then
16:
prune the dictionary; return s;
17: function Main
18:
s ← 255; top[0] ← 0;
19:
while not EOF do
20:
read word wi from input;
21:
index ← hash(wi );
22:
while word[index] 6= wi ∧ f [index] 6= 0 do
23:
index ← index + k;
24:
if word[index] = wi then
25:
Encode(ri , s);
26:
UpdateDict(index);
27:
else
28:
Send(wi );
29:
word[index] ← wi ;
30:
f [index] ← 1;
31:
posInV oc[index] ← top[0];
32:
posInHT [top[0]] ← index;
33:
top[0] ← top[0] + 1;
34:
s ← CheckSpace(s);
4.1. OPEN DENSE CODE
49
bring interesting improvement in the compression ratio. This change allows us to change
the number of stoppers and continuers of single blocks as necessary.
Suppose that the codeword with size k blocks is assigned to the last word in the vocabulary in the i-th step of compression. The coding space of the (k − 1)-th byte is tuned by a
similar technique
In every step i of the compression, the following condition
Pk like in TBDC.
a−1
must hold:
a=1 sai × Πb=1 cbi ≥ top[0], where sai represents the number of stoppers of
byte a in step i, cbi represents the number of continuers of byte b in step i and finally top[0]
represents the number of unique words stored in the vocabulary. Whenever the mentioned
condition is broken, the number of stoppers of the (k − 1)-th byte must be decremented.
Again, to avoid degradation of coding scheme, we need to state some lower bound for sk−1 .
After exceeding of the bound, the number of blocks of the last word in the vocabulary k
must be incremented.
Algorithm 14 STDC encode algorithm
1: function Encode(r, s, c)
. r is the rank of the encoded word,
s is a vector storing the current number of stoppers in single blocks and c is a vector
storing the current number of continuers in single blocks.
2:
i ← 0;
3:
c ← 1;
4:
b ← si ;
5:
while r ≥ b do
6:
c ← c × ci ;
7:
i ← i + 1;
8:
r ← r − b;
9:
b ← c × si ;
10:
j ← i;
11:
while i ≥ 0 do
12:
if i = j then
13:
if i = 0 then
14:
Send(1 + (r mod si ));
15:
else
16:
Send(r mod si );
17:
r ← r ÷ si ;
18:
else
19:
if i = 0 then
20:
Send(1 + si + (r mod ci ));
21:
else
22:
Send(si + (r mod ci ));
23:
r ← r ÷ ci ;
24:
i ← i − 1;
50
CHAPTER 4. CONTRIBUTION TO DENSE CODING
The coding space of blocks `, where 0 ≤ ` < k − 1, are tuned independently using a
tuning technique proposed by Brisaboa et al. in [14]. For each ` there exists a unique
minimum of the function which expresses dependency between the compression ratio and
s` (number of stoppers in the `-th byte). This fact is exploited in the tuning technique.
The compressor and decompressor store the size of the encoded part of the file in three
variables: prev, curr and next. The variable curr stores the size of the encoded part of the
file using s` , the variable prev stores the size using s` − 1 and the variable next stores the
size using s` + 1. When the difference curr − prev or curr − next exceeds a threshold, the
s` is decremented or incremented and curr, prev and next are set to zero.
Algorithm 15 STDC compressor tuning functions
1: function CheckSpace(s, k)
P
a−1
2:
if ka=1 sa × Πb=1
cb < top[0] ∧ sk−1 > smin then
3:
sk−1 ← sk−1 − 1;
Pk
a−1
4:
if a=1 sa × Πb=1
cb < top[0] ∧ sk−1 = smin then
5:
sk ← sk − 1;
6:
k ← k + 1;
7: function TuneS(s, k, wi ) . s stores the current number of stoppers in single blocks, k
is the current maximal number of blocks and wi is the just read word from the input.
8:
for a = 0 to k − 2 do
9:
preva ← preva + CountBytes(wi , sa − 1);
10:
curra ← curra + CountBytes(wi , sa );
11:
nexta ← nexta + CountBytes(wi , sa + 1);
12:
if curra − preva > thresh then
13:
sa ← sa − 1;
14:
preva ← 0; curra ← 0; nexta ← 0;
15:
if curra − nexta > thresh then
16:
sa ← sa + 1;
17:
preva ← 0; curra ← 0; nexta ← 0;
The main algorithm of the compressor is basically the same as in TBDC (see Algorithm 13). The only difference is naturally in the Encode function (see Algorithm 14) and
in the tuning technique (see Algorithm 15). The definition of the CheckSpace function is
more general now. As its parameters, the function has: s - the current number of stoppers
in single blocks and k - the current maximal number of blocks. The function compares
the size of the coding space with the actual size of the dictionary top[0]. When the actual
coding space is not sufficient, the compressor must decrement the number of stoppers in
the (k − 1)-th byte (see lines 2 and 3) or take into account another byte and increment k
(see lines 4 – 6). The function TuneS is called in every iteration of the main cycle and is
used to tune the number of stoppers in blocks lower than k − 1. Every byte has its own
variables preva , curra and nexta . These variables store a sum of bytes needed to encode the
word with different numbers of stoppers (see lines 9 – 11). The number of bytes needed to
4.1. OPEN DENSE CODE
51
encode a word with a given number of stoppers is counted by the CountBytes function.
When the difference exceeds a threshold thresh, the number of stoppers is incremented or
decremented (see lines 12 – 17) and all the variables (preva , curra and nexta ) are set to
zero.
We can observe the evolution of the tuning parameters of STDC in Figure 4.1 which
depicts the compression of file gut3. The number of unique words grows exactly according
to Heaps’ Law [34] (see Figure 4.1(a)). The number of stoppers in the first byte rapidly
falls down during the first phase when only the first two bytes are used (see Figure 4.1(b)).
During the next phase, the number of stoppers oscillates about the value 192. Similarly,
the number of stoppers in the second byte is constant during the first phase and it falls
down during the second phase (as soon as three-byte codewords are involved). However,
this fall is not so fast as in the previous case because the coding space is larger and also,
the number of unique words does not grow as quickly in this phase (see Figure 4.1(c)). The
boundary between the first and the second phase (the moment when three-byte codewords
are involved) is approximately at the level of 650 000 processed words (x axis of the graph).
4.1.3
Experiments
The test set of the compression algorithms is very diverse in order to be able to show the
advantages and the disadvantages of our implementations in comparison with the other algorithms. We have chosen both word-based and character-based, statistical and dictionary
types of the compression algorithms. Table 4.5 provides the overview of tested compression
algorithms.
We performed the algorithms on various files with English natural language content.
The tested files come especially from standard corpuses (Calgary and Canterbury corpus)
and from the Project Gutenberg 2 . We created four larger corpora (all1.txt, all2.txt, all3.txt
and all4.txt) by concatenating many files from standard corpuses and especially from the
Project Gutenberg. All tested files are stated in Table 4.6.
We used the spaceless word model [20] in our implementations. It means that the vocabulary is common for alphanumeric and non-alphanumeric words. The model takes a
single space as a default separator. When the alphanumeric word is followed by a space, the
compressor encodes just the alphanumeric word. When the alphanumeric word is followed
by another non-alphanumeric word (separator), the compressor encodes both alphanumeric
and non-alphanumeric words.
We performed our tests on an AMD AthlonT M 64 Processor 3200+, 2518 MB RAM with
Fedora Linux and kernel version 2.6.23.15-80.fc7. We used the compiler gcc version 3.4.6
with compiler optimization -O3. We measure user time + system time in our experiments.
2
The Gutenberg Project (www.gutenberg.org) is the first and largest single collection of free electronic
books.
52
CHAPTER 4. CONTRIBUTION TO DENSE CODING
4
x 10
4
Number of unique words
3.5
3
2.5
2
1.5
1
0.5
0
0
0.5
1
1.5
2
2.5
Number of words
3
6
x 10
(a) Unique words
260
s value in first byte
250
240
230
220
210
200
190
180
0
0.5
1
1.5
2
2.5
Number of words
3
6
x 10
(b) s in first byte
260
s value in second byte
250
240
230
220
210
200
190
180
0
0.5
1
1.5
2
Number of words
2.5
3
6
x 10
(c) s in second byte
Figure 4.1: STDC: Evolution of various parameters as the number of words grows.
53
4.1. OPEN DENSE CODE
Algorithm
Huffman Coding
LZ77
Arithmetic Coding
GNU zip
Two-Byte
Dense Code(LFU)
Two-Byte
Dense Code(LRU)
Self-Tuning
Dense Code
Dyn. End-tagged
Dense Code
Dyn. (s,c)Dense Code
Dyn. Lightweight
End-tagged Dense Code
Dyn. Plain
Huffman
Notation
huff
lz77
cac
gzip
tbdc1
Approach
cba
cb
cb
cb
wbb
Family
statistical
dictionary
statistical
statistical
statistical
Proposed
J. S. Vitter [72]
Lempel, Ziv [78]
Moffat et al. [1]
J. Gailly
our proposal
Implemented
D. Scott [65]
M. Geelnard [31]
Moffat et al. [53]
J. Gailly [46]
our implementation
tbdc2
wb
statistical
our proposal
our implementation
stdc
wb
statistical
our proposal
our implementation
detdc
wb
statistical
Brisaboa et al. [10]
Brisaboa et al. [24]
dscdc
wb
statistical
Brisaboa et al. [16]
Brisaboa et al. [24]
dletdc
wb
statistical
Brisaboa et al. [13]
Brisaboa et al. [24]
dph
wb
statistical
Brisaboa et al. [23]
Brisaboa et al. [24]
Table 4.5: Tested algorithms.
a
b
cb - character-based approach
wb - word-based approach
4.1.3.1
Compression ratio
The results are summarized in Table 4.7. Our implementation of STDC achieved the best
compression ratio in all tested files among all dense compressors. It can adjust the coding
scheme to the actual distribution of the alphabet and achieve a better compression ratio.
Performed on small- and medium-sized files (with size lower than approximately 4 MB),
STDC and both TBDC variants achieve the same compression ratio. Both STDC and
TBDC apply the same tuning technique while they only use the first two bytes to encode a
symbol. In the moment when the algorithms exceed the size of the two-byte scheme, they
apply different approaches to encode a symbol. TBDC(LFU) and TBDC(LRU) need to use
some pruning technique because their coding space is limited. TBDC(LFU) uses the Least
Frequently Used pruning technique and TBDC(LRU) uses the Least Recently Used pruning
technique. On the other hand, STDC can continue in the compression without pruning
the vocabulary and it involves the third byte in its coding scheme. Then the algorithm
starts to tune the second byte and the first byte is tuned by the technique proposed by
Brisaboa et al. in [14] (see Algorithm 15).
The comparison of the achieved compression ratio of all algorithms proves that wordbased compression algorithms are much better. Our implementations (TBDC and STDC )
are friendly to smaller files and they achieved outstanding improvement in compression
ratio in comparison with DETDC or DSCDC when performed on files like cal3, cal4, can1.
DSCDC is very unfriendly to small files and sometimes even achieves a compression ratio
higher than 100 %. When performed on larger files, both variants of TBDC drag behind. Thanks to pruning the vocabulary, they lose precise information about the alphabet
54
CHAPTER 4. CONTRIBUTION TO DENSE CODING
File
bible.txt
alice29.txt
plrabn12.txt
book1
book2
paper1
paper2
wrnpc11.txt
clarissa.txt
all1.txt
all2.txt
all3.txt
all4.txt
Notation
canL
can1
can2
cal1
cal2
cal3
cal4
gut1
gut2
gut3
gut4
gut5
gut6
Source
Large Cantebury
Cantebury
Cantebury
Calgary
Calgary
Calgary
Calgary
Gutenberg
Gutenberg
Gutenberg
Gutenberg
Gutenberg
Gutenberg
Size [B]
4 047 389
148 460
471 161
768 770
610 855
53 160
82 198
3 217 389
5 233 126
12 483 578
19 352 946
28 727 290
48 610 669
# words
889 575
34 040
102 773
175 853
133 338
11 143
17 281
697 342
1 209 613
2 793 686
4 280 943
6 366 654
10 799 288
# unique words
13 413
3 210
10 937
13 497
10 420
2 175
2 669
19 740
22 109
38 211
58 475
70 794
96 351
Entropya
8.4574
8.6816
9.6155
9.3386
9.6751
9.1154
8.8772
9.2460
8.9695
9.1329
9.2302
9.2396
9.2451
Table 4.6: Tested files.
a
Zero-order word-based entropy
File/Alg.
canL
can1
can2
cal1
cal2
cal3
cal4
gut1
gut2
gut3
gut4
gut5
gut6
huff
54.81
57.02
56.52
57.04
60.32
62.96
58.09
56.25
56.75
56.17
56.29
56.35
56.60
lz77
45.85
57.97
66.97
65.53
52.25
54.80
58.25
59.86
60.89
55.73
57.31
58.13
59.14
cac
54.36
56.70
56.10
56.71
60.04
62.81
57.91
55.94
56.28
55.78
55.94
55.99
56.25
gzip
29.43
36.14
41.11
40.76
33.84
34.94
36.20
37.34
37.94
35.02
36.03
36.41
37.10
tbdc1
29.75
43.00
46.22
42.72
42.61
57.74
49.29
33.51
33.13
32.78
33.97
33.97
34.52
tbdc2
29.75
43.00
46.22
42.72
42.61
57.74
49.29
33.51
33.17
32.44
33.19
33.24
33.02
stdc
29.75
43.00
46.22
42.72
42.61
57.74
49.29
33.51
33.03
32.10
32.41
32.20
32.12
dph
30.41
45.67
48.74
45.18
43.45
59.95
52.69
34.80
33.89
32.87
33.24
32.91
32.73
detdc
31.70
47.92
49.97
46.41
45.01
62.36
54.54
35.89
35.17
34.01
34.27
33.95
33.73
dscdc
30.33
67.69
55.59
49.35
48.64
121.57
92.54
34.92
34.04
33.03
33.41
33.10
32.93
dletdc
32.09
51.08
52.65
48.43
43.36
68.64
59.19
36.64
35.69
34.53
34.73
34.31
33.98
Table 4.7: Compression ratio in %.
distribution. On the other hand, STDC still achieves a better compression ratio: approximately 1% improvement in comparison with DSCDC and approximately 2% improvement
in comparison with DETDC.
4.1.3.2
Compression and decompression speed
The results of compression and decompression speed are summarized in Table 4.8 and
Table 4.9. The compression speed is defined as the ratio of the size of the original file and
the compression time. Similarly, the decompression speed is defined as the ratio of the
size of the original file and the decompression time. So, the compression ratio of single
algorithms has no effect on the resulting speed. We have chosen the speed for the efficiency
comparison since we wanted to compare the efficiency of the algorithm on the files with
different sizes.
55
4.1. OPEN DENSE CODE
File/Alg.
canL
can1
can2
cal1
cal2
cal3
cal4
gut1
gut2
gut3
gut4
gut5
gut6
huff
2.22
2.08
2.11
2.11
2.02
1.75
2.01
2.15
2.12
2.16
2.16
2.16
2.14
lz77
0.03
0.03
0.02
0.02
0.03
0.08
0.05
0.02
0.02
0.02
0.02
0.02
0.02
cac
7.15
7.08
8.99
7.33
7.28
7.72
7.83
7.45
7.13
7.30
7.10
7.21
7.29
gzip
7.58
6.81
5.76
6.36
8.15
7.68
7.00
6.75
6.28
6.77
6.72
6.67
6.60
tbdc1
24.12
22.47
19.37
19.71
21.42
21.12
22.40
21.44
20.76
20.76
19.84
19.83
19.20
tbdc2
24.20
22.12
18.80
18.99
20.66
21.12
22.40
20.24
19.94
19.53
17.92
18.45
18.55
stdc
18.39
17.92
15.71
15.63
16.60
16.90
17.82
15.87
15.22
14.52
13.57
13.36
12.95
dph
18.92
3.37
7.49
9.78
8.57
1.37
2.12
15.82
16.75
16.49
14.76
16.13
14.12
detdc
19.90
2.78
6.72
8.94
7.98
1.06
1.57
17.64
19.12
19.52
19.51
19.11
18.41
dscdc
19.30
2.28
8.03
13.09
10.59
0.86
2.12
16.68
17.70
18.12
17.73
17.21
16.81
dletdc
17.90
2.57
6.19
8.36
7.53
0.73
1.15
15.58
17.15
17.54
16.83
16.66
16.12
detdc
55.95
47.21
40.86
45.83
48.55
50.70
39.20
47.95
46.65
45.62
43.53
41.83
40.74
dscdc
48.26
23.60
34.57
31.88
25.33
15.68
43.22
39.61
38.41
36.34
36.29
34.78
dletdc
67.65
36.30
44.62
46.97
49.08
16.90
19.61
60.06
61.59
63.09
61.43
61.77
61.10
Table 4.8: Compression speed in MB/s.
File/Alg.
canL
can1
can2
cal1
cal2
cal3
cal4
gut1
gut2
gut3
gut4
gut5
gut6
huff
2.59
2.44
2.47
2.47
2.36
1.95
2.31
2.49
2.49
2.51
-
lz77
214.44
202.21
172.93
178.88
208.01
253.49
261.30
191.77
191.95
198.42
196.35
194.29
191.57
cac
6.23
7.08
6.42
6.11
5.83
7.53
7.83
6.23
6.24
6.14
6.13
6.12
6.14
gzip
73.38
38.27
51.06
55.54
55.48
20.28
28.00
64.06
64.81
67.76
66.53
64.89
64.16
tbdc1
39.67
35.40
29.76
31.33
31.32
29.82
32.66
32.13
31.95
33.11
31.74
31.50
29.75
tbdc2
39.63
34.53
28.44
30.17
32.19
28.17
32.66
31.74
31.46
32.27
30.21
30.42
29.29
stdc
33.39
28.89
25.24
26.00
27.48
26.68
29.03
27.89
26.27
22.46
21.21
20.14
19.32
dph
25.74
5.24
11.24
12.86
13.55
2.11
3.14
21.76
23.32
23.81
22.59
22.08
21.23
Table 4.9: Decompression speed in MB/s.
All the tested compressors were run without any knowledge of the number of unique
words and their frequencies in the tested files. It means that none of the tested compressors
could exploit this knowledge to optimize the compression speed.
TBDC(LFU) is the fastest algorithm among the dense compressors in the compression.
It is faster than STDC because it only uses a two-byte coding space and it does not need
to care about tuning all bytes except the first one. It is also faster than TBDC(LRU)
because the Least Frequently Used pruning technique is faster as the vocabulary is sorted
by the frequency.
In the decompression, the fastest algorithm is lz77, which is very asymmetrical. The
gzip is also very fast in decompression since it is based on the deflate algorithm that uses
a combination of the lz77 and Huffman coding. DLETDC is the fastest algorithm among
the dense compressors. It namely exploits its asymmetry when the decompressor swaps
the words in the vocabulary only when they change the length of their codewords. All
56
CHAPTER 4. CONTRIBUTION TO DENSE CODING
our implementations are behind in the decompression speed when they are performed on
larger files.
22
tbdc1
tbdc2
stdc
dph
detdc
dscdc
dletdc
gzip
20
Compression speed [MB/s]
18
16
14
12
10
8
6
33
34
35
36
37
38
Compression ratio [%]
Figure 4.2: Compression ratio / compression speed trade-off.
The main advantage of all dense compressors is a very good compression ratio/compression
speed trade-off. The reason why the adaptive dense compressors are so fast, is that they
are byte-oriented and they use a very efficient dictionary data structure (see Figure 2.5).
This data structure allows the performance of the update operation in the constant time
(see Algorithm 7). The compression ratio is low because of the word-based approach and
efficient definition of the coding space. The trade-off between the compression ratio and
the compression speed for the most competitive compression algorithms is depicted in Figure 4.2. The algorithms were performed on the gut2 file. We can observe that all the dense
compressors showed a very good trade-off, while both variants of TBDC achieved the best
result. The gzip falls a little behind; nevertheless, it is very fast in decompression. The
trade-off between compression ratio and decompression speed is depicted in Figure 4.3.
The champion in this parameter is not clear. The dense compressors achieve a better com-
57
4.1. OPEN DENSE CODE
65
tbdc1
tbdc2
stdc
dph
detdc
dscdc
dletdc
gzip
60
Decompression speed [MB/s]
55
50
45
40
35
30
25
20
33
34
35
36
37
38
Compression ratio [%]
Figure 4.3: Compression ratio / decompression speed trade-off.
pression ratio (the best is our algorithm STDC ), yet gzip and DLETDC are significantly
better in the decompression speed.
Our compressors achieve approximately 1% improvement in the compression ratio in
comparison to DPH, while they keep the same or slightly better compression and decompression speed. When we compare our compressors to DLETDC, we can see that there is
a big difference caused by the special orientation of DLETDC. Our compressors achieve
significantly better results in the compression ratio, while DLETDC is significantly faster
in the decompression and, moreover, it allows the direct searching on the compressed data
stream. The results for our compressors and DLETDC in compression speed are very
similar when it comes to large files.
58
CHAPTER 4. CONTRIBUTION TO DENSE CODING
4.2
Semi-Adaptive Two-Byte Dense Code
Only a few existing compression methods, such as DLETDC and DLSCDC presented
in [13] and [11], address some common problems for maintenance and work with large
textual databases such as digital libraries. One of the typical scenarios can be as follows.
A client is searching some keywords in the digital library. When some or all searched
keywords are found in a document, the client wants to download the document and present
it to the user.
The scenario above presents some basic properties which the compression method has
to satisfy in order to be used. The compression method has to permit efficient searching on the compressed text. It has to send only the part of the compressed text (with
its corresponding vocabulary) where the keywords have occurred. Furthermore, it should
enable the extension of the compressed file by adding other compressed text without previous decompression. This can be applicable in dynamically growing databases of similar
textual data. In this section, we present a compression method that satisfies the aforementioned requirements and which is still able to achieve a competitive compression ratio in
acceptable compression and decompression time.
4.2.1
Code
Semi-Adaptive Two-Byte Dense Code (STBDC) is a semi-adaptive version of TBDC proposed in [58] and presented in Section 4.1. The structure of the STBDC code is practically
the same. The only difference is that STBDC does not have to use a special codeword
Cnew (an escape symbol) to express that a new word in plain text form occurs. The idea
of STBDC was briefly introduced in [59] and later described in detail in [60].
Definition 4.2.1 Given source symbols with probabilities {pi }0≤i<s+c×256 , ST BDC assigns to each source symbol i a unique codeword formed by one byte b1 ∈ {0, ..., s − 1} for
i < s or a unique codeword formed by two bytes b1 ∈ {s, ..., 255} and b2 ∈ {0, ..., 255} for
s ≤ i < s + c × 256. Symbols s and c represent the number of stoppers and continuers.
4.2.2
Model
Following Definition 4.2.1, it is obvious that the STBDC coding space is limited by two
bytes. This limitation predetermines the block orientation3 of STBDC. The input text is
split into the consecutive text blocks. Each block represents a portion of input text that
can be (with respect to number of unique words) covered by STBDC coding scheme. The
end of the block must always come when the coding space given by the number of stoppers
s is exhausted. It implies that STBDC is semi-adaptive in terms of a single block, but it is
adaptive in terms of the whole input. The handling of the prelude is important. It is not
necessary to express the permutation of the whole input alphabet. Only the changes in
3
A block represents a portion of the input text in terms of this Section and not the part of a codeword
according to Definition 4.1.1
59
4.2. SEMI-ADAPTIVE TWO-BYTE DENSE CODE
the model of the two consecutive blocks are encoded and thus, the overhead of the prelude
is significantly reduced.
Large natural language text usually consists of global words and local words. Global
words are the words that appear with approximately the same frequency almost in all
parts of natural language texts (e.g. stopwords, common nouns, verbs, etc.). On the
other hand, local words are the words that are strongly related to a certain part of the
compressed text (e.g. names, special terms, archaisms). The idea of storing the model (to
the vocabulary file) of a certain block is that the compressor stores only the changes that
must be performed so the model of the previous block would be transformed to the model
of the actual block. In fact, the global words are kept in the model and the local words are
exchanged. STBDC so allows to use its own model for each block while it does not waste
the space by storing the global words again.
Number of unique words
120000
Unique words in file
Unique words in block
Number of unique words
100000
80000
60000
40000
20000
0
0
500000
1000000
1500000
2000000
2500000
3000000
Number of words
Figure 4.4: The number of unique words (total and for single blocks) depending on the
number of processed words of the gut8 file.
Figure 4.4 presents the Heaps’ Law curve for the gut8 file. The cross points represent
the curve for the whole file and the circle points represent the curves for single STBDC
60
CHAPTER 4. CONTRIBUTION TO DENSE CODING
blocks. The curves prove two important facts. First, the curve of each separate block
is very steep. Thus, the space requirements would increase rapidly without evaluation
and coding of the model changes. Second, the extremely large files with natural language
content can consist of many different (artificially concatenated) parts. These parts appear
as the successive Heaps’ Law curves in Figure 4.4.
The STBDC compressor works in two passes over the input text. During the first pass,
it collects the necessary statistics (frequencies of single words). Unlike the coding space, the
vocabulary data structure of the compressor is not limited and so the compressor always
has the exact information about the frequency fi and rank ri of each word wi . When the
number of unique words in the current block achieves the size of coding space s + c × 256,
then the end of the block must come. At the end of the block, the compressor evaluates:
(i) which words in the block are new, (ii) which words are recycled (words which appeared in
some preceding block, were flushed and reappeared in the current block), (iii) which words
were unused and should be flushed and (iv) which words should be swapped (they have
two-byte codeword but should have one-byte codeword and vice versa). The compressor
has to remember all these changes for the second pass.
During the second pass at the beginning of every block, the compressor and decompressor must perform the changes detected in the first pass. There are basically three types of
changes: adding a new word, recycling a word and swapping of two words (one has a size
of one byte and the other one has a size of two bytes). The adding and recycling is always
performed at the expense of an old word used previously but unused in the current block.
After the update of the model, every block is either compressed or decompressed using its
own model.
After the second pass, the compressor stores the changes evaluated during the first
pass to a vocabulary file. The structure of the vocabulary file is key for the acceptable
compression ratio and also for other possible applications of STBDC. The vocabulary file
is also block-oriented and each block of the vocabulary file represents the prelude of the
block in the compressed file. The stored data have the character of the changes which
are necessary to perform so the model of the previous block could be applicable for the
current block. Some reserved bytes are needed in the vocabulary file. Byte h00i stands for
the end of one block and the beginning of another. Byte h01i represents recycling of the
word and is followed by the code of its current position in the vocabulary and the code
of its new position in the vocabulary. The vocabulary data structure of the compressor
and decompressor is not limited so we can express the recycled word by its position in
the vocabulary. However, the code expressing its position is not STBDC, but a slightly
different dense code, which can address larger coding space. Byte h02i represents swapping
of two words and is followed by the STBDC representation of their positions. Finally, byte
h03i represents adding a new word and is followed by the STBDC representation of the
old word, which is substituted, and plain text representation of the new word. For the first
block, only the plain text representation is stated since the addings are ordered by their
position in the vocabulary. Basically, all three types of the change represent the exchange
of two words - the old word and the new word. However, each type of the change uses the
addressing of a different space. That is the reason why these types must be distinguished.
4.2. SEMI-ADAPTIVE TWO-BYTE DENSE CODE
61
At the beginning of the vocabulary file, the number of blocks is inserted followed by
their offsets in the compressed file. This is necessary for the searching in single blocks
and subsequently decompressing only some of the blocks. As we mentioned, one of the
typical applications for digital libraries can be fast searching and decompressing of single
portions of the compressed text. The offsets stored in the vocabulary file enable to search
only in the blocks where the word has occurred, which is apparent after the analysis of the
vocabulary. Thus, the vocabulary serves as a simple block index (see [50]). Furthermore,
when the searched word is found in some block, it is necessary to decompress and present
this block to the user. For the client/server architecture and some usual semi-adaptive
compression algorithm, it could mean the sending of the interesting part of the compressed
file together with the whole vocabulary, which can often be very large. For STBDC, this
situation only means to synchronize the vocabulary (to perform the changes from the
beginning up to the wanted block) and send the interesting block together with its own
vocabulary.
Another advantage of STBDC block orientation is that the compressed file can be at
extended without previous decompression any time, which is needed for usual semi-adaptive
compression techniques. The compressor always holds the exact frequencies of single words.
So, it is possible to export this information to another additional vocabulary file. The
compressor then imports this information when the extension of the compressed file is
needed and continues with adding new blocks. The compressor pays for this possibility by
the overhead of the additional vocabulary file.
4.2.3
Algorithms and Data Structures
The vocabulary data structure is very similar to that proposed by Brisaboa et al. in [10]
(see Figure 2.5). Two arrays indexed by the hash value of a given word were added so that
the compressor would be able to evaluate the changes between two blocks (see Figure 4.5).
The first array is pBl, where the position of the given word in the vocabulary in the previous
block is stored. The other array is act and it is used to mark whether the word appeared
in the current block or not. Other auxiliary arrays (old, new, swapUp and swapDown) are
necessary to store old, new and swapped words. Finally, it is necessary to evaluate these
words and store the resulting changes. To store the changes, the compressor uses three
other arrays: chType, source and dest. The chType array stores the type of the change.
Number 1 stands for recycling, number 2 for swapping and number 3 for adding. The
arrays source and dest store the references to the words, which act in the given change.
Algorithm 16 represents the key function that the compressor has to perform during
the first pass at the end of each block. The algorithm iterates from the beginning of the
vocabulary (posInHT array) and analyses the words. When the word is active and it did
not appear in the previous block (it was not in the vocabulary at all or it was out of the
coding space), then it is a new word (see line 4). When the word is not active and it
appeared in the previous block, then it is an old word (see line 6). When the word is active
and had a two-byte codeword in the previous block but now has a one-byte codeword, then
it is a word which should be swapped upward (see line 8). Conversely, when the word in
62
CHAPTER 4. CONTRIBUTION TO DENSE CODING
Algorithm 16 Evaluation of the changes between two blocks
1: function changeEval
2:
j ← 0; k ← 0; l ← 0; m ← 0; max ← (s + c ∗ 256)
3:
for i = 0 to top[0] do
4:
if act[posInHT [i]] = 1 ∧
(pBl[posInHT [i]] = −1 ∨ pBl[posInHT [i]] ≥ max) then
5:
new[j] ← posInHT [i]; j ← j + 1
6:
if act[posInHT [i]] = 0 ∧ pBl[posInHT [i]] < max then
7:
old[k] ← posInHT [i]; k ← k + 1
8:
if act[posInHT [i]] = 1 ∧ pBl[posInHT [i]] > s ∧
pBl[posInHT [i]] < max ∧ i < s then
9:
swapU p[l] ← posInHT [i]; l ← l + 1
10:
if act[posInHT [i]] = 1 ∧ pBl[posInHT [i]] ≥ 0 ∧
pBl[posInHT [i]] < s ∧ i ≥ s then
11:
swapDown[m] ← posInHT [i]; m ← m + 1
12:
act[posInHT [i]] = 0
13:
for i = 0 to l do
14:
chT ype[chP ointer] ← 2
15:
source[chP ointer] ← pBl[swapU p[i]]
16:
dest[chP ointer] ← pBl[swapDown[i]]
17:
pBl[swapU p[i]] = dest[chP ointer]
18:
pBl[swapDown[i]] = source[chP ointer]
19:
chP ointer ← chP ointer + 1
20:
for i = 0 to j do
21:
if pBl[new[i]] ≥ max then
22:
chT ype[chP ointer] ← 1
23:
source[chP ointer] ← pBl[new[i]]
24:
dest[chP ointer] ← pBl[old[i]]
25:
pBl[old[i]] ← source[chP ointer]
26:
else
27:
chT ype[chP ointer] ← 3
28:
source[chP ointer] ← new[i]
29:
dest[chP ointer] ← pBl[old[i]]
30:
pBl[old[i]] ← spLast
31:
spLast ← spLast + 1
32:
pBl[new[i]] ← dest[chP ointer]
33:
chP ointer ← chP ointer + 1
the previous block had a one-byte codeword but now has a two-byte codeword, then it is
a word which should be swapped downward (see line 10).
In the next two for cycles (see lines 13 and 20), the words are formed into the arrays:
63
4.2. SEMI-ADAPTIVE TWO-BYTE DENSE CODE
chType, source and dest. In the first cycle, the swapping is stored and in the second cycle
the recycling and the adding are stored. It is always necessary to update the position
in the previous block in the pBl array. For adding change the old word obtains the pBl
value from the spLast variable which expresses the number of unique words seen by the
compressor up to now.
1234
1547
2346
2900
5344
... 28999 ...
-1
...
1278
...
1237
...
...
1
...
0
...
0
...
pBl:
...
465
...
142
...
147
act:
...
1
...
1
...
1
word:
...
Sonya
...
149
...
posInVoc: ...
148
old:
chType:
...
source:
... 465 ... 28999 5344
dest:
... 147 ... 1278 1237
2
...
1
...
11632
Natasha ... Nicholas ... Pierre ... Denisov ... Kutuzov ... Napoleon ...
... 2900 5344 ...
... 8677 11632 ...
new:
1
8677
...
151
...
150
...
swapUp:
swapDown:
152
...
... 1234
... 2346
1405
...
1411
...
...
...
3
Figure 4.5: The compressor evaluating the changes between two blocks.
Example 4.2.1 Suppose STBDC compresses the famous Russian novel War and Peace
by Leo Tolstoy. After the first pass over some block, the compressor has the information
stored in the arrays pBl, act, word and posInVoc (see Figure 4.5). Before the second pass
over the block, when the words are substituted with the codewords, the compressor has
to evaluate the changes between the previous and the current block (see Algorithm 16).
The number of stoppers in the coding scheme is s = 150. The changes of the words are
evaluated from the beginning of the vocabulary (the position in the vocabulary expresses
the array posInVoc). “Natasha” is evaluated as the first of the six words. However, this
word brings no change since it is active with the one-byte codeword and it was also active
with the one-byte codeword in the previous block. The word “Sonya” is active with a
one-byte codeword, but it had only a two-byte codeword in the previous block. Hence,
“Sonya” needs to be swapped up and its reference is stored in the array swapUp. “Pierre”
is a clearly recycled word since it is active with a two-byte codeword, but it was out of
the coding scheme (28 999 > 150 + 106 × 256 = 27 286) in the previous block. Thus,
its reference is stored to the array new. “Nicholas” is a word which needs to be swapped
down. In the previous block, this word had a one-byte codeword, but it has only a two-byte
codeword in the current block. Its reference is stored to the array swapDown. The word
which appeared for the first time in the current block is “Denisov ” and must be stored
in the array new. The next two words “Kutuzov ” and “Napoleon” are not active in the
64
CHAPTER 4. CONTRIBUTION TO DENSE CODING
current block, but were active in the previous block. Their references are stored in the
array old.
In the next step, the arrays chType, source and dest are filled. “Sonya” is swapped
with “Nicholas”, “Pierre” with “Kutuzov ” and “Denisov ” with “Napoleon”. Notice that
all the references in the arrays source and dest are positions in the vocabulary except the
word “Denisov ” which is a new word in the current block.
4.2.4
Optimal Number of Stoppers and Continuers
A natural question is regarding the ideal values of the parameters s and c = 256 − s. The
authors of SCDC [16] solved a similar problem and presented a time-efficient algorithm,
which returns the optimal values of the parameters s and c. However,
their problem was
Pn−1
much easier. They were looking for such values s and c that f (s) = i=0 fi × len(i, s) is
minimal (the len function returns the length of the codeword and its parameters are the
rank of the word ri = i and the number of stoppers s; fi represents the frequency of word
wi ). It was proved in [16] that the function f (s) has only one minimum and so the binary
search can be used to find the optimal s value. Later in [12], Brisaboa et al. revised their
claim with the result that the uniqueness property of s should hold for all natural language
text collections.
Our problem is in fact more complicated.
We need to find the value of parameter s
Ps−1+(256−s)×256
fi × len(i, s). As it was explained in
that minimizes the function f (s) = i=0
Section 4.2.2, the size of the coding space predetermines the size of the single blocks. It
means that the value of the parameter s affects not only the size of the single codewords,
but also the number of unique words n = s + (256 − s) × 256 and their frequencies fi .
Decreasing the value of s means increasing the size of the block together with increasing
the number of unique words and increasing frequencies of single words. Considering the
size of the vocabulary file, we still need to take into account the number of changes in the
model of two consecutive blocks. We conclude that this value is practically unattainable
without performing the compression itself.
Another complication is that the function, when the compression ratio depends on
the value of s (or c = 256 − s), has many local minima in the case of STBDC. Many
counterexamples are presented in Figure 4.6, which depicts the dependency of the STBDC
compression ratio on the value of c on four files with natural language content.
The aforementioned complications forced us to choose a different approach of how to
determine the values s and c. We analysed several files with natural language content
and noticed that the change in the compression ratio, depending on the value c, is usually
quite small (less than 0.5 %) on the interval c ∈ h50; 125i. Brisaboa et al. presented
a similar observation in [12]. Following our experiments, we made a good estimate of
s = 171 and
P c = 85, which is, on average, optimal value for all analysed files. It holds:
arg mins { i cr(fi , s)} = 171, where cr(fi , s) represents the achieved compression ratio on
the file fi with the parameter s and fi are all analysed files. The analysed files were large
files (consisting of more than one block) with a different ratio of the total number of words
and number of unique words.
65
4.2. SEMI-ADAPTIVE TWO-BYTE DENSE CODE
34.2
all.txt
all2.txt
all3.txt
all4.txt
34
Compression ratio
33.8
33.6
33.4
33.2
33
32.8
32.6
40
50
60
70
80
90
100
110
120
130
Number of continuers
Figure 4.6: The dependency of the STBDC compression ratio on the value of c = 256 − s.
Still, STBDC is considerate to the small files, which consist of only one block. The
compressor and the decompressor just set the value s > 171 so that it is the maximum
value which satisfies (s + (256 − s) × 256) ≥ n, where n represents the number of unique
words. This setting makes the compression most effective since the maximum number of
words obtain the codewords with a size of one byte.
In all experiments presented in Section 4.2.7, the STBDC compressor uses its default
parameter s = 171. The only exception are the files consisting of only one STBDC block
(canL, can1, can2, cal1, cal2 and gut1 ), where the parameter s > 171 was automatically
set by the compressor according to the aforementioned condition.
4.2.5
Searching on Compressed Text
The direct search on compressed text was originally introduced by Manber and Wu in [48].
Later, it was nicely described for the byte codes by de Moura et al. in [19]. Navarro et al.
in [54] made a proposal of searching on compressed text using the block index which was
66
CHAPTER 4. CONTRIBUTION TO DENSE CODING
originally proposed by Manber and Wu in [50].
STBDC is semi-adaptive in terms of single blocks. It means that it is possible to perform the searching on compressed text using some standard pattern-matching algorithm.
Since the searched word can have a different codeword in every block, it is necessary to
distinguish different blocks in the compressed data stream. Furthermore, the vocabulary
file can be seen as a block index used in the first step of two-level searching described
in [50]. In the first step, only the vocabulary file is sequentially searched and the blocks
with no occurrence are excluded. This can be useful especially for so-called proximity
searching when more words are searched approximately at the same position in the text.
In the second step, the sequential searching in predetermined blocks is performed. The
spatial complexity of this built-in block index (and the vocabulary at the same time) is
O(nβ + c × r), where n is the number of words, nβ is according to Heaps’ Law [34] the
number of unique words, r is the number of blocks and c is the average number of swaps
between two blocks. This complexity is comparable with the complexity analysed in [54]
and, according to [50], this means another 3 − 4 % of space savings in comparison to
uncompressed text.
At the beginning of the vocabulary file, the number of blocks and their offsets in
the compressed file are stored; therefore, the searching algorithm knows where every block
starts. The algorithm needs to go through the vocabulary file first and track the codewords
assigned to the searched word in single blocks. This is of course a kind of complication,
but, on the other hand, when the word does not occur in some block, the algorithm is
already able to discover this fact in the vocabulary file and then omit the given block when
searching on the compressed file. Thus, the vocabulary file serves as a block index.
Word tracking in the vocabulary file is easy. When the word occurs in a block for
the first time, a codeword is assigned to it. The word can, in some later blocks, swap
its position with another word. Every change in the vocabulary file contains the original
codeword as the reference. Thus, it is necessary to search the codeword and, after the
change of the codeword, change the searched pattern as well. Sequential searching in the
vocabulary file seems to be ineffective, but it is necessary for some kinds of searching such
as: searching allowing errors or searching for regular expressions.
We consider only single pattern or multi-pattern searching in our implementation. For
the present moment, we omitted some special kinds of searching such as: phrase searching
or searching allowing errors. We implemented a simple forward pattern-matching automaton for searching a set of patterns. We argue that no shifting technique is needed for an
STBDC byte stream since the shifts are limited by the minimal length of pattern, which
is in the case of STBDC 4 lmin ≤ 2. Furthermore, when the search is aligned with the
codewords, a simple shift is still possible. Namely, the first byte of an STBDC codeword
expresses the length of the codeword. If its value is greater or equal to the number of
stoppers s, then the searching algorithm can shift two bytes, otherwise it must shift only
4
We implemented a backward pattern-matching algorithm for SF-STBDC (a compression method
optimized for a set of files) and the gain in comparison to the forward pattern-matching algorithm is
minimal. See Section 4.3 for details.
4.2. SEMI-ADAPTIVE TWO-BYTE DENSE CODE
67
one byte.
4.2.6
Modifications of the Compressed Text
The data stored in textual databases are not only created once and then presented many
times. The data can also be modified in many ways during the time of its existence. The
compressed text can be extended (another text can be concatenated to it) or modified, i.e.
some chapter or paragraph can be rewritten or some term can be substituted by another
term in the entire compressed text.
Recently, the IRT algorithm (a variant of Burrows-Wheeler Transform [17]) was presented in [9]. This algorithm enables the direct searching and modification of the compressed text. We argue that the Dense codes are more suitable for the text compression
(even considering searching capabilities and modifications of the text). First, the Dense
codes are much faster than IRT in compression and especially in decompression. Our
STBDC is nine times faster in compression and thirty times faster in decompression than
IRT (compared on the canL file). Second, the attractive properties of IRT are at the
expense of the compression ratio (approximately 40 % of IRT in comparison to approximately 30 % of our STBDC, both on the canL file). Third, the searching efficiency of
IRT also seems to be poorer. Compared on the canL file, the search for a single or more
patterns is approximately ten times faster on our STBDC.
The question is how STBDC can perform the modifications of the compressed text.
Suppose a large text file divided into blocks 3 MB in size, which is approximately the size
of one uncompressed STBDC block. We argue that the majority of the text modifications
is related to only one block and these modification inhere in adding or changing of certain portions of the text (paragraphs, chapters). STBDC (similarly as the other Dense
codes) theoretically allows some kinds of modifications without decompression. However,
these changes cannot be propagated to the compression model, which is then deprecated.
Another problem is that, due to the word-based approach, the modifications can bring in
some new words that are not associated with a certain codeword.
So, for the modification of STBDC compressed text, decompression is always needed.
The only exception is when the compressed text is extended (other text is added at the end
of the compressed text). Then this extending text is added as a sequence of new blocks.
STBDC, in comparison to other Dense codes, has the advantage that the model of
each block depends only on the previous blocks. Suppose the uniform probability of the
update of single blocks; then STBDC has to decompress and again compress on average
only a half of the text (the updated block plus all of the following blocks). Considering the
high compression and decompression speed of all Dense codes, the time for modifying the
compressed text could be comparable with IRT. We plan further experimental comparison
for future work. However, at the moment, we have to draw on the results presented in [9].
68
CHAPTER 4. CONTRIBUTION TO DENSE CODING
4.2.7
Experiments
In our experiments, we tested the compression ratio, compression and decompression speed
and the efficiency of the searching on compressed text.
File
Notation
Source
bible.txt
alice29.txt
plrabn12.txt
book1
book2
wrnpc11.txt
clarissa.txt
all1.txt
all2.txt
all3.txt
all4.txt
all5.txt
all6.txt
canL
can1
can2
cal1
cal2
gut1
gut2
gut3
gut4
gut5
gut6
gut7
gut8
Large Cantebury
Cantebury
Cantebury
Calgary
Calgary
Gutenberg
Gutenberg
Gutenberg
Gutenberg
Gutenberg
Gutenberg
Gutenberg
Gutenberg
Size [B]
# words
# un. words
4 047 389
148 460
471 161
768 770
610 855
3 217 389
5 233 126
12 483 578
19 352 946
28 727 290
48 610 669
12 095 037
11 179 123
889 575
34 040
102 773
175 853
133 338
697 342
1 209 613
2 793 686
4 280 943
6 366 654
10 799 288
2 536 509
2 455 409
13 413
3 210
10 937
13 497
10 420
19 740
22 109
38 211
58 475
70 794
96 351
90 583
116 522
a
Entropy
8.4574
8.6816
9.6155
9.3386
9.6751
9.2460
8.9695
9.1329
9.2302
9.2396
9.2451
9.7081
9.9030
Table 4.10: Tested files.
a
Zero-order word-based entropy
We performed the compared algorithms on various files mostly with English text (see
Table 4.10). The tested files come especially from standard corpuses (Calgary and Canterbury corpus) and from the Gutenberg Project. We created six larger corpora (gut3,
gut4, gut5, gut6, gut7 and gut8 ) by concatenating many files from standard corpuses and
especially from the Gutenberg Project. The files gut7 and gut8 have a very high number of
unique words since they are composed of different parts of different languages. These files
should demonstrate the behaviour of the algorithms while they are performed on blockoriented files where different words occur in every block (language).
4.2.7.1
Compression Ratio
We compared the compression ratio of STBDC against other dense compressors ETDC [10],
SCDC [16], DLETDC [13] and also against standard compression programs Gzip [78] and
Bzip2 [17]. The results in Table 4.11 represent the compression ratios including the size
of the vocabulary file.
As we expected, Bzip2 achieved the best compression ratio on all tested files. Among
the dense compressors, SCDC and STBDC achieved similar results. STBDC proved to be
slightly better in the compression of smaller files (can1, can2, cal1 and cal2 ) and files with
a higher number of unique words (gut7 and gut8 ). While compressing the files composed
of different languages, STBDC successfully exploits its block orientation. However, in
compression ratio, STBDC is always better than DLETDC [13], which has very similar
application possibilities described at the beginning of Section 4.2.
Consider an experiment when only the first n blocks of the file gut6 are required by the
client. The compression ratio in this case is: (cs + vs)/os, where cs represents compressed
69
4.2. SEMI-ADAPTIVE TWO-BYTE DENSE CODE
File/Alg.
etdc
scdc
dletdc
stbdc
gzip
bzip2
canL
can1
can2
cal1
cal2
gut1
gut2
gut3
gut4
gut5
gut6
gut7
gut8
31.6902
47.6977
49.8878
46.3579
44.9383
35.8399
35.1278
34.2172
35.2309
35.2265
35.5081
41.6004
47.3221
30.3839
45.3967
48.7286
45.3934
43.3636
34.8148
33.9001
32.9340
33.3655
33.0472
32.8937
40.7842
46.3814
32.0929
51.0750
52.6455
48.4262
46.8224
36.6410
35.6981
34.5323
34.7294
34.3083
33.9817
42.5883
48.5401
30.2653
45.0337
48.5137
44.7205
44.3602
34.5113
33.9038
32.7412
33.3008
33.0943
33.1199
39.8884
44.9636
29.4281
36.1437
41.1074
40.7633
33.8357
37.3404
37.9449
35.0167
36.0279
36.4101
37.1015
35.2931
35.3554
20.8937
29.1230
30.8907
30.2559
25.7742
27.5142
27.3021
25.3695
26.4533
26.7119
27.2265
26.0626
26.9774
Table 4.11: Compression ratio in %.
text size, vs represents vocabulary size and os represents original text size. The experiment depicted in Figure 4.7 compares three compression methods: STBDC, SCDC and
SCDC(S), which is SCDC applied on single blocks separately. Figure 4.7 shows that for one
block, SCDC(S) achieves the best compression ratio and STBDC stays just a little behind.
However, SCDC is very ineffective since it has to attach the entire vocabulary file, which
is quite large. On the other hand, when the whole file is required (all five blocks), SCDC
achieves the best compression ratio and STBDC stays a little behind again. SCDC(S)
achieves the worst compression ratio since it has to repeatedly send the same global words
in the vocabulary files of single blocks. We conclude that STBDC is the ideal solution for
storing the compressed file and sending the single blocks required by the client.
4.2.7.2
Compression and Decompression Time
We performed out our tests on AMD AthlonT M 64 Processor 3200+, 2518 MB RAM with
Fedora Linux and kernel version 2.6.23.15-80.fc7. We used compiler gcc version 3.4.6 with
compiler optimization -O3. We measure user time + system time in our experiments.
We compared our STBDC with other Dense codes and with standard compression programs Gzip and Bzip2. The results of compression and decompression time are summarized
in Table 4.12 and Table 4.13. STBDC proved to be very fast when performed on small files
composed of only one block. However, when performed on larger files composed of more
than one block, the compression and the decompression time are worsened by the overhead
caused by evaluating and performing the changes of the model. On larger files, STBDC
is up to 1.3 times worse in compression time and up to 1.5 times worse in decompression
time than SCDC.
70
CHAPTER 4. CONTRIBUTION TO DENSE CODING
0.39
STBDC
SCDC
SCDC(S)
0.38
Compression ratio
0.37
0.36
0.35
0.34
0.33
0.32
1
2
3
4
5
Number of blocks
Figure 4.7: Comparison of compression ratio in the case that only certain blocks are
required by the client.
File/Alg.
etdc
scdc
dletdc
stbdc
gzip
bzip2
canL
can1
can2
cal1
cal2
gut1
gut2
gut3
gut4
gut5
gut6
gut7
gut8
0.2600
0.0510
0.0750
0.0960
0.0820
0.2350
0.3589
0.8089
1.2568
1.8417
3.1705
0.8569
0.8449
0.2560
0.0500
0.0750
0.0990
0.0840
0.2350
0.3549
0.8009
1.2538
1.8547
3.1785
0.8589
0.8539
0.1990
0.0550
0.0700
0.0840
0.0750
0.1840
0.2730
0.6409
1.0278
1.5428
2.6885
0.6889
0.6759
0.2480
0.0090
0.0320
0.0540
0.0390
0.2310
0.4029
0.9459
1.6517
2.4246
4.3063
1.1818
1.3688
0.4560
0.0190
0.0710
0.1050
0.0650
0.4150
0.7180
1.5830
2.4620
3.7020
6.2770
1.5800
1.4850
1.3500
0.0320
0.1330
0.2600
0.1810
1.0770
1.7720
4.2480
6.5850
9.7890
16.6300
4.1530
3.7950
Table 4.12: Compression time in seconds.
71
4.2. SEMI-ADAPTIVE TWO-BYTE DENSE CODE
File/Alg.
etdc
scdc
dletdc
stbdc
gzip
bzip2
canL
can1
can2
cal1
cal2
gut1
gut2
gut3
gut4
gut5
gut6
gut7
gut8
0.0630
0.0070
0.0130
0.0170
0.0150
0.0550
0.0860
0.1990
0.3110
0.4549
0.7649
0.2040
0.1990
0.0590
0.0070
0.0130
0.0180
0.0150
0.0540
0.0810
0.1960
0.3050
0.4289
0.7689
0.2050
0.1990
0.0580
0.0050
0.0110
0.0150
0.0140
0.0520
0.0830
0.1920
0.3020
0.4549
0.7789
0.2100
0.1980
0.0800
0.0030
0.0110
0.0190
0.0140
0.0710
0.1170
0.2770
0.4469
0.6539
1.1508
0.3000
0.3040
0.0600
0.0050
0.0100
0.0150
0.0120
0.0550
0.0840
0.1890
0.2980
0.4620
0.7530
0.1820
0.1680
0.4120
0.0130
0.0570
0.0980
0.0670
0.3720
0.6220
1.3960
2.1890
3.2810
5.6220
1.3470
1.2280
Table 4.13: Decompression time in seconds.
4.2.8
Searching on Compressed Text
We compared our simple forward pattern-matching algorithm (see Section 4.2.5) performed on compressed STBDC text with implementation of the Boyer-Moore-Horspool
algorithm [36] performed on the original text. The results of the Boyer-Moore-Horspool
algorithm do not include the decompression time.
We performed the searching algorithms on the gut4 file, which consists of five STBDC
blocks. We tested the search of the set of five and of ten distinct words of different lengths
and different frequencies. The words were randomly chosen from the vocabulary of the
gut4 file.
The results of searching on compressed text are summarized in Table 4.14. The BoyerMoore-Horspool algorithm is marked as MBMH, our forward pattern-matching algorithm
is marked as FPMA-or. We put across another variant of our searching algorithm, marked
as FPMA-and. Suppose so-called proximity searching when the user is searching some set
of patterns and wants to find their occurrences approximately at the same position in the
text. When this kind of search is performed, the indexing property of STBDC vocabulary
is used much more. Namely, many more blocks can be excluded after the vocabulary
analysis since not all the searched words occur in the same block or in the two subsequent
blocks very often.
The search on compressed text (FPMA-or and FPMA-and ) is significantly faster for all
test instances than MBMH. We can observe that MBMH is faster for longer patterns and
for a lower number of searched patterns since the maximal possible shift depends on the
minimal length of pattern. We did not find out any significant dependency of FPMA-or
on the length or the frequency of searched patterns or on the number of searched patterns.
On the other hand, FPMA-and logically worsens with the higher frequency of searched
patterns. However, FPMA-and can never be worse than FPMA-or since in the worst case,
it performs the search in the same blocks as FPMA-or.
72
CHAPTER 4. CONTRIBUTION TO DENSE CODING
# patterns
frequency (len ≤ 5)
MBMH
5
FPMA-or
FPMA-and
MBMH
10
FPMA-or
FPMA-and
1-10
11-100
101-1000
frequency (len > 5)
1-10
11-100
101-1000
0.2220
0.0825
0.3205
0.0815
0.0825
0.0815
0.0510
0.0325
0.0555
0.0535
0.0575
0.0520
0.0090
0.0175
0.0485
0.0105
0.0425
0.0460
0.2520
0.0980
0.3450
0.1050
0.1100
0.1050
0.0560
0.0310
0.0560
0.0590
0.0580
0.0630
0.0070
0.0110
0.0450
0.0080
0.0350
0.0490
Table 4.14: Search time in seconds.
4.3
Set-of-Files Semi-Adaptive Two-Byte Dense Code
Much information is stored in small files less than 1 MB in size. Considering databases
storing literary works (e.g. the Gutenberg Project), the average size of stored text files
is around 200 kB. The English version of the Bible has the a size of around 4 MB (with
only 13 413 unique words considering so-called spaceless word model [52, 20]). The longest
real novel written in English Clarissa, or, the History of a Young Lady is just a little over
5 MB in size (with only 22 109 unique words). Still, a much larger amount of information is
stored in the form of even smaller files. A typical web page comes up to 20 kB. Everyday,
web crawlers scan millions of documents with natural language content and store their
content as raw text files. However, we do not focus only on web content, but generally on
natural language text stored in the form of small files. In addition to web pages, emails
and logs are common textual records stored as files a few kilobytes in size.
H(p, q) = −
X
=−
X
i
i
pi log qi = H(p) + DKL (p k q)
pi log pi +
X
i
pi log
pi
qi
(4.1)
Using a statistical compression method, the best compression ratio is achieved when
the algorithm uses the proper model given by the exact probabilities of words in the compressed file. This claim can be easily proved by the definition of the cross-entropy (see
Equation 4.1), which states the lower bound for compression of model p using a different
statistical model q with the same probability space. The Kullback-Leibler divergence DKL
is namely always non-negative. Factors pi and qi represent the probability of a given symbol si in the corresponding statistical model. The term pi log qi represents the amount of
binary encoded information (log qi ) multiplied by the probability of occurrence of a given
symbol si (pi ). However, even if the proper model is used, the final compression ratio is,
in terms of compression of small files, always spoiled by the space overhead of the model.
Standard compression algorithms, which are usually used for compression of raw text
data, are typically unfriendly to the small files (say under 500 kB). The reason is that the
overhead of vocabulary burdens the resulting compression ratio of statistical compressors.
4.3. SET-OF-FILES SEMI-ADAPTIVE TWO-BYTE DENSE CODE
73
Dictionary compressors are also not an ideal option since they need a relatively large
context to look for the repetitions in the text.
One obvious alternative of how to improve the compression effectiveness is to group the
files to blocks and exploit the higher repetitiveness of single words in the whole block. Unfortunately, the most efficient compression methods - dictionary compressors (LZMA [57],
Gzip [46]) and context compressors (Bzip2 [66], PPMd [69, 74]) - do not allow random access to the compressed data stream. Then the size of the block is clearly a trade-off between
the compression ratio and decompression time. The whole block must be decompressed
whenever one wants to access only one file.
There are other rather efficient compression methods that allow random access to the
data stream (and also searching on compressed text) such as Tagged Huffman [19] or socalled Dense codes [10, 16]. The index data structures are the next possible approach.
Especially recently discovered compressed self-indexes [55] seem to be very competitive.
Compressed self-index represents an index data structure based on the Burrows-Wheeler
transform5 T bwt [17]. One of the representatives of self-indexes is Sadakane’s Compressed
Suffix Array (Sad-CSA) that is also used in our experiments (see Section 4.3.3). Sad-CSA
occupies a space of 1 nH0 + O(n log σ) bits and allows to perform the locate operation in
time O(log n), where n is the size of the text, σ is the size of the alphabet and > 0 is
an arbitrary constant. Sad-CSA further allows to display an arbitrary portion of the text
(l characters in time O(l + log n)). So there is no need to store the text separately.
In this Section, we discuss a general scenario of compression of single text files distributed over more computers that need to be randomly accessed and operated over a very
short time. A typical example of the aforementioned scenario is a web search engine that
needs to retrieve a few top ranked web pages (their corresponding stored text files) in an
extremely short period of time, build so-called snippets 6 for them and display the result to
the client.
Disregarding the compression efficiency, the best way of compression supporting individual access to files is the compression of single files separately. This approach supports
not only fast individual access to the files but also easy manipulation with individual files
(updating, deleting or adding new files to the collection). The compression of single files
further allows to send only requested files (not a whole block) with their corresponding
part of vocabulary without need of re-compression.
We propose a compression method optimized for the compression of small files with
natural language content. The idea of the method is a combination of two models: the
global model containing all the words of the collection and the local model, which transforms
the global model to a proper model. The proper model is the optimal model for a given
file assigning such codewords that provide the shortest output data stream. Here we focus
only on the compression of the raw text content. However, we suppose the final solution is
equipped with a compressed non-positional inverted index [76] pointing to single docIDs. In
very recent work [4], the authors discuss the question of using positional inverted indexes
5
6
Unlike most of the self-indexes, the recently discovered LZ77-Index [43] is based on LZ77 parsing.
A short phrase in the text surrounding the searched pattern.
74
CHAPTER 4. CONTRIBUTION TO DENSE CODING
(the indexes giving the exact position of a term in the text) in search engines. They
conclude that the positional index became a bottleneck in index compression and a suitable
method compressing the raw text could achieve higher effectiveness at the expense of only
a slightly worse search speed. We need to realize that the two most important operations
that exploit the positional index - positional ranking and snippet extraction - are usually
performed on quite a small subset of documents. According to [4], the web search engine
uses an inverted index to get only top-k1 documents (usual k1 = 200). Furthermore, the
web search engine gets snippets for only top-k2 documents (usual k2 = 10). Thus, the
worsened search speed causes only a small increase of search time.
There are three key ideas of our compression method leading to as efficient compression
as possible: (i) each file is compressed using its proper model ; (ii) the coding scheme is
adjusted exactly to the compressed file; (iii) the local vocabulary (the changes leading to
the proper model ) is encoded in an efficient way.
The combination of two models in the compression using byte codes is not a new idea at
all. In [39], the authors present their Restricted Prefix Byte Code (RPBC), together with
the strategies of handling the prelude of single compressed blocks. The text compression
per blocks using Two-Byte Dense Code (TBDC) is discussed in [60]. However, exactly the
same problem (compression of a set of files using byte codes) was first stated in [25]. Our
main aim is to solve this problem with a better compression ratio and the same speed of
searching on the compressed text as in [25]. The aforementioned three key ideas helps our
compression method to do so.
4.3.1
STBDC Modification for Compression of a Set of Files
The version of STBDC adapted for compression of a set of small files is called the Set-ofFiles Semi-Adaptive Two-Byte Dense Code (SF-STBDC). The search engines, or generally
the textual databases, usually work with a large set of files with natural language content.
The size of the single files is quite small (under 1 MB7 ) so they should always fit the SFSTBDC coding scheme. The SF-STBDC coding scheme can easily cover 15 000 codewords
and still achieve a very good compression ratio (using parameter s = 198). SF-STBDC
enables to state some lower bound for the number of stoppers s determining a limit of
unique words in proper model : s + (256 − s) × 256. When the limit of unique words is
exceeded, the input file must be compressed piecewise, where each of the pieces (separate
compressed files) must satisfy the limit. This approach is similar to block-wise STBDC [60]
(see Section 4.2).
Thanks to its simple coding scheme, SF-STBDC can always easily compute its optimal
parameter smin in constant time. See Equation 4.3, where smin is the optimal number of
stoppers, cmin is optimal number of continuers and n represents the total number of unique
words in the vocabulary. The first row of the equation represents the basic property of the
coding scheme which must cover all unique words n. The algorithms obtaining the optimal
7
The size of a text file in the Gutenberg Project is 200 kB on average. The size of a typical web page
is just 20 kB according to [29].
75
4.3. SET-OF-FILES SEMI-ADAPTIVE TWO-BYTE DENSE CODE
parameter smin for SCDC [16] and RPBC [39] are more complicated. The corresponding
algorithm for SCDC works in time O(σ + log(σ) × log log(σ)) = O(σ), where σ represents
the number of unique words in the vocabulary. Similarly, the cost of the algorithm for
RPBC is again O(σ). However, both algorithms are fast enough for practical issue.
n ≤ (256 − cmin ) + 256 × cmin
n ≤ 256 + 255 × cmin
n − 256
cmin =
255
smin = 256 − cmin
n − 256
smin = 256 −
255
(4.2)
(4.3)
The SF-STBDC modification for the compression of the set of files especially consists
of the redefinition of the global vocabulary and the local vocabulary. The global vocabulary
is composed of all the unique words occurring in the whole set of files and is supposed
to be held in the main memory of computer. The advantage of the change is obvious. It
is much easier to handle the words through their links (3B or 4B pointers) to the global
vocabulary. Furthermore, for the words with three or more characters, it is more efficient
to store the link to the global vocabulary than the word itself.
from
a)
1B
2B
nB
b)
1B
2B
nB
c)
1B
2B
nB
d)
1B
2B
nB
e)
1B
2B
nB
f)
1B
2B
nB
active
to
F2B→1B
FnB→1B
FnB→2B
T1B
Figure 4.8: SF-STBDC variant 1: Evaluation of the swaps.
Notice that the number of unique words in the entire set of files can be unusually high.
Suppose the average length of a word is 4.5 characters in English and suppose the Heaps’
76
CHAPTER 4. CONTRIBUTION TO DENSE CODING
Law [34] with its constants for English. A rough estimate of the number of unique words
is one million for a set of files with a size of 1 GB. Nevertheless, the server storing the
data should have no problem to fit such an amount of data (for one million words assume
10 MB at most) into its main memory.
The local vocabulary stores the swaps between two words that transform the global
model to the proper model of a compressed file. The swap can be perceived as the mutual
change of the position in the global model. The swaps are established based on arrays from
and to. The array from addresses the global model and stores the links to the words that
should be swapped. The array to addresses the proper model of the file and stores the
positions to which the words should be swapped.
The process of compression of one file has two passes. In the first pass, the compressor
collects the necessary statistics and it sorts the words in decreasing order by the frequency.
The sorted words represent the proper model of the compressed file. As the next step, the
compressor evaluates necessary changes and stores them to the arrays from and to. The
pairs of values of the arrays from and to are stored to the vocabulary file as the swaps. At
the end, the compressor performs the second pass when the input is encoded according to
the proper model of the compressed file. The decompression process consists of reading the
local vocabulary, performing the swaps to transform the global model to the proper model
and performing the decompression.
4.3.1.1
SF-STBDC variant 1:
The obvious question is how to store the swaps to the local vocabulary. The first variant is
to store the words as absolute references to the global vocabulary. This variant is apparently
not very efficient since most of the swaps are stored as 2 + 3 = 5 bytes. Nevertheless, it
shows some beneficial properties for searches on compressed text. If there is no inverted index over the textual database, then this kind of vocabulary storage can accelerate questions
whether a given word is in the file or not. The word is namely stored as the direct reference
to the global vocabulary. So there is no need for any preprocessing of the vocabulary file,
which can be directly searched by standard pattern-matching algorithms. Furthermore,
this variant is suitable for a task when only positions of a given word in the compressed
text are requested. The swap defines exactly the codeword and so the word can be directly
searched without any need for synchronizing of the entire vocabulary, which is a necessary
task when e.g. the extraction of a part of the text is requested.
Another question is how to exactly establish the set of swaps. Clearly, we cannot
evaluate the swaps according to the exact rank ri in the vocabulary, but it is possible
according to the length of the corresponding codeword ci . In other words, the swap is
stored only in the case when the ranks of the same word in the global model and the
proper model predetermine the codewords with different length. Suppose the global model
is divided into three blocks b: 1B – a block of words with a one-byte codeword, 2B – a block
of words with a two-byte codeword and finally nB – a block of words out of the coding
scheme of the proper model. Furthermore, suppose Fb to be a set of words (positions in the
global model ) that do not belong to a given block b in the global model and, at the same
4.3. SET-OF-FILES SEMI-ADAPTIVE TWO-BYTE DENSE CODE
77
time, belong to a given block b in the proper model. And vice versa: suppose Tb to be a
set of words (positions in the global model ) that do belong to a given block b in the global
model and, at the same time, do not belong to a given block b in the proper model. Then,
relation Sb ⊆ Fb × Tb represents the set of swaps for a block b. Actually, only the blocks
1B and 2B are relevant for the proper model of a compressed file.
F1B→nB
FnB→1B
F1B→2B
1B
F2B→nB
nB
2B
F2B→1B
FnB→2B
Figure 4.9: SF-STBDC : Shifts of words between single blocks depicted as a network flow
problem.
The size of the local vocabulary is predetermined by the number of swaps and according
to the way how these swaps are stored. The number of swaps is clearly |S1B | + |S2B | =
|F1B | + |F2B | = |T1B | + |T2B |. However, this number can still be lower if we put together
the words that swap from 1B to 2B and vice versa. Suppose F1B→2B ⊆ F2B to be a set
of words that swap from 1B to 2B and suppose F2B→1B ⊆ F1B to be a set of words that
swap from 2B to 1B. The number of saved swaps is then σ = min{|F1B→2B |, |F2B→1B |}.
Finally, the total number of swaps is |S1B | + |S2B | − σ.
Notice that the problem of word shifts among blocks 1B, 2B and nB can easily be
depicted as a network flow problem of graph theory [30]. The global model is represented
as a directed graph G = hE, V i, where V = {1B, 2B, nB} is a set of vertices representing
single blocks of the global model and E ⊆ V × V is a set of edges leading from each vertex
to another. Each edge of the graph is associated with a corresponding set of words defined
above, e.g. the edge e = [1B, nB] is associated with the set F1B→nB of words that shift
from block 1B to block nB. The size of the corresponding set of words determines the flow
f : E 7→ Z≥0 of each edge e ∈ E. As well as for the network flows, it must hold that the
number of incoming words
the number of outgoing words, i.e. it must hold
Pmust be equal toP
for each vertex u ∈ V :
(u,v)∈E f (u, v) −
(w,u)∈E f (w, u) = 0. The graph representing
word shifts is depicted in Figure 4.9.
Figure 4.8 depicts the way how the swaps are evaluated. The solid rectangle represents
the global model and the dotted rectangle represents the proper model of a compressed file.
The rectangles are divided into three blocks of words with different sizes of their codewords:
1B, 2B and more than 2B (marked as nB). The algorithm evaluating swaps iterates over
the proper model (blocks 1B and 2B) and compares the positions of the words in the proper
model (dotted arrow) and global model (solid arrow).
Algorithm 17 describes the function syncVoc, which performs the extraction of the
78
CHAPTER 4. CONTRIBUTION TO DENSE CODING
Algorithm 17 SF-STBDC1 : Evaluation of swaps between global model and proper model
1: function syncVoc(vocabulary v1 , vocabulary v2 )
. The
function evaluates the changes between two vocabularies: a file vocabulary v1 and the
static (global) vocabulary v2
2:
v1 .s ← computeS(v1 )
3:
v1 .f romP ointer ← 0; v1 .toP ointer ← 0
4:
for i = 0 to s − 1 do
5:
sP os ← getP osInV oc(v2 , v1 .word[i])
6:
if sP os ≥ s ∧ sP os < v1 .lastP os then
. a)
7:
v1 .f rom[v1 .f romP ointer] ← sP os
8:
v1 .f romP ointer ← v1 .f romP ointer + 1
9:
for i = 0 to v1 .lastP os do
10:
sP os ← getP osInV oc(v2 , v1 .word[i])
11:
if (i < v1 .s ∧ sP os ≥ v1 .lastP os)
12:
∨(i ≥ s ∧ sP os ≥ v1 .lastP os) then
. b), d)
13:
v1 .f rom[v1 .f romP ointer] ← sP os
14:
v1 .f romP ointer ← v1 .f romP ointer + 1
15:
else if i ≥ v1 .s ∧ sP os < v1 .s then
. e)
16:
v1 .to[v1 .toP ointer] ← sP os
17:
if v1 .f rom[v1 .toP ointer] 6= 0 then
18:
v1 .active[sP os] = 1
19:
v1 .active[v1 .f rom[v1 .toP ointer]] = 1
20:
v1 .toP ointer ← v1 .toP ointer + 1
21:
else if i < v1 .s ∧ sP os ≥ s
22:
∧sP os < v1 .lastP os then
. a)
23:
continue
24:
else
. c), f)
25:
v1 .active[sP os] ← 1
26:
for i = 0 to v1 .lastP os do
27:
if v1 .active[i] = 0 then
28:
v1 .to[v1 .toP ointer] ← i
29:
v1 .toP ointer ← v1 .toP ointer + 1
swaps based on two different models (vocabularies): the global model and the proper model
of the compressed file. The values of two variables i and sP os are crucial for the evaluation
of swaps. The variable i represents the position of a word in the proper model and the
sP os variable is set on the position of the same word in the global model (see line 5).
The first for cycle (line 4) iterates over the block 1B of the proper model and it collects
the words (positions) to the F2B→1B set, i.e. the words with a position in block 2B in the
global model and with a position in block 1B in the proper model. This condition (see
line 6) meets the situation a) in Figure 4.8. This first for cycle ensures that words that
79
4.3. SET-OF-FILES SEMI-ADAPTIVE TWO-BYTE DENSE CODE
word :
a)
... Sonya
e)
b)
Natasha
d)
Nicholas
c)
Pierre
Denisov
f)
Napoleon
global model :
...
1 357
21 953
29 025
144
195
295
proper model :
...
125
91
284
49
1 395
3 295
f rom
to
1 357
21 953
29 025
...
195
?
?
...
s = 200, c = 56, n = 200 + 56 × 256 = 14 536
Figure 4.10: SF-STBDC variant 1: Completion of arrays from and to.
are swapped from block 2B (and not from block nB) are the first in the from array and
can later constitute a swap with words that swap vice versa, i.e. from block 1B to block
2B. The second for cycle (line 9) traverses the whole proper model (vocabulary v1 ) and it
collects the words (positions) that meet the corresponding condition (see line 11) to the sets
FnB→1B and FnB→2B . This condition meets situations b) and d) as depicted in Figure 4.8.
The next condition (see line 15) meets the situation e) in Figure 4.8 and corresponding
words (positions) are stored in the to array. These positions represent members of the T1B
set. Furthermore, if some position figures on the corresponding position in the from array,
then this position, as well as the sP os position, are marked as already engaged positions in
the active array (see lines 17 – 19). The next condition (see line 21) covers the same cases
as the first for cycle. Thus, these cases are skipped for this time. The last condition (see
line 24) represents the words with positions in the same blocks. These words (positions)
are marked (using the active array) as the words, whose position will not be changed. The
third for cycle iterates over the active array and the positions that are not marked are
stored in the to array (see lines 26 – 29) as free positions. Finally, the swaps are established
as the pairs of positions stored on corresponding positions in the from and to arrays. The
local vocabulary stores these pairs encoded using a simple byte code.
Example 4.3.1 Figure 4.10 shows the situation in terms of the global model and proper
model of the following six words: Sonya, Natasha, Nicholas, Pierre, Denisov, Napoleon.
The coding scheme of SF-STBDC is the following. Number of stoppers s = 200. Number
of continuers c = 256 − s = 56. Number of unique words covered by the coding scheme
n = s+c×256 = 14 536. The situation of each word is evaluated according to Algorithm 17
and it is expressed by the letter (corresponding to Figure 4.8) above the word. Situations
c) and f) cause no swaps. Situations a), b) and d) fill a value (a position in the global
model ) to the from array. Situation e) fills a value (a position in the global model ) to the
to array. The question marks in the to array will be substituted by the positions that are
inside of block 1B or block 2B in terms of the global model and are outside of the coding
scheme in terms of the proper model. These positions are filled in the last for cycle in
Algorithm 17. The positions in arrays from and to form the swaps needed to transform
the global model to the proper model of a given file.
80
CHAPTER 4. CONTRIBUTION TO DENSE CODING
We give a theoretical estimate of the local vocabulary’s size for SF-STBDC1. The local
vocabulary stores the following parts (sets) that together compose the swaps: F2B→1B ,
FnB→1B , FnB→2B , T1B and T2B . SF-STBDC1 uses simple byte codes (some of them with
variable length, some of them with fixed length) to encode the positions of the sets. We
give the following estimate of the size of the local vocabulary considering the average length
of byte codes: 2 × |F2B→1B | + 3 × |FnB→1B | + 3 × |FnB→2B | + |T1B | + 2 × |T2B | bytes. We
note that the positions are stored as the absolute references and so the local vocabulary
can serve as a simple index, answering the questions whether a given word occurs in the
file or not.
4.3.1.2
SF-STBDC variant 2:
The second variant offers a bit vector determining which of the words of the global model
are active in the proper model. Obviously, we cannot use the bit vector for an entire
global model that is very large. The density of the active words would be very low in the
ending of the global model and such a way of storage would become inefficient. We used
a simple heuristic to determine the moment when to leave the storing as a bit vector and
start to store the distance from the last active word. Suppose the density of the active
words in the global model is higher at the beginning of the model, which is sorted by the
frequency in decreasing order. The heuristic checks every thousand of the words and it
chooses the better way of storage according to the number of active words. If the number
= 125, then the bit vector is more
of active words in the thousand is higher than 1000
8
efficient. Notice that the distance between two consecutive active words is stored using
only one byte on average, which is a plausible assumption for the density of 125 words in
one thousand when the average distance is 1000
= 8. When the number of active words
125
falls below the limit 125 in one thousand, the remaining words are stored as a distance
between two consecutive active words. A simple variable-length byte code is used to store
the distances between two consecutive words. Of course, only the words that should have
two-byte codewords (F2B set) are stored in the bit vector. The swapped words that should
have a one-byte codeword (F1B set) must be stored explicitly using their links to the global
model. Fortunately, these words are few.
The problem of encoding the T2B set and a part of the FnB→2B set can be seen as a
succinct subset encoding. The concept of succinct data structures was originally introduced
in [40] and a practical implementation of succinct data structures was presented in [33].
The succinct data structures provide i.a. two basic operations over a bit array: rank (giving
the number of bits set up to some position) and select (giving a position of i-th bit set).
Both operations can be performed in constant time at the expense of some (not negligible)
space overhead. However, in our solution, we always use sequential scanning of the local
vocabulary (bit vectors). Furthermore, the bit vectors are not sparse at all. They contain
at least 125 bits of one thousand set to one. Thus, SF-STBDC has no application for
rank and select operations. Our solution, however, uses an operation similar to popcount
(described in [33]) for decoding and encoding bit vectors representing the local vocabulary.
We give a theoretical estimate of the local vocabulary’s size for SF-STBDC2. SF-
4.3. SET-OF-FILES SEMI-ADAPTIVE TWO-BYTE DENSE CODE
81
STBDC2 uses a bit vector to store the whole T2B set and a part of the FnB→2B set.
Let us divide the FnB→2B set into two subsets: FnB→2B A and FnB→2B B . An element
α = max FnB→2B A is the break point when the algorithm leaves a bit vector and starts
to encode single elements of FnB→2B B (as a distance between two consecutive elements).
An element β = min FnB→2B A represents
the first position of FnB→2B A being encoded
represents
the number of bytes used to store the
as a bit vector. The expression α−β
8
c×256 bit vector of FnB→2B A . The number
is a good estimate of size (in bytes) of a
8
bit vector storing the T2B set since c × 256 represents the size of the entire block 2B.
We suppose, on average, that two bytes are needed to encode the distance between two
consecutive elements of the FnB→2B B set. The
final
estimate of size of the local vocabulary
α−β
c×256
is: 2×|F2B→1B |+3×|FnB→1B |+|T1B |+ 8 + 8 +2×|FnB→2B B | bytes. The theoretical
gain of SF-STBDC2 in comparison to SF-STBDC1 is composed only of difference
in the
c×256
size of the local vocabulary, which can be expressed as: 3 × |FnB→2B | + 2 × |T2B | − 8 −
α−β − 2 × |FnB→2B B | bytes.
8
4.3.1.3
SF-STBDC variant 3:
1B codewords
2B codewords
s1
global model:
s1 + c1 × 256
...
...
s2 + c2 × 256
s2
proper model:
dense part
sparse part
dense part
original coding space
extended coding space
Figure 4.11: SF-STBDC variant 3: Storage of the swaps.
The third variant is similar to the semi-dense technique described in [39]. The basic
difference is that our proper model consists of two dense parts (see Figure 4.11) instead of
just one dense part as it is in [39]. The second dense part is necessary since the coding
scheme of SF-STBDC is limited by two bytes. Thanks to a huge global model that does
not fit two-byte coding scheme, it is not possible to keep sparse codes for all words except
the most frequent.
The idea of this variant is to care about the exact swaps only for the words that should
have a one-byte codeword (sets F2B→1B and FnB→1B ). The words that should have a twobyte codeword and are not in the bounds of the coding scheme (FnB→2B set) are swapped
to the first free positions following the coding scheme. The coding scheme is extended at
the end by decrementing the parameter s1 to the necessary value s2 (see Equation 4.3).
Figure 4.11 depicts the coding scheme of SF-STBDC divided into dense and sparse parts.
The first dense part is block 1B, where all positions are engaged by the words that stay
in block 1B and by the words that are swapped to block 1B (T1B set). The next sparse
82
CHAPTER 4. CONTRIBUTION TO DENSE CODING
Algorithm 18 SF-STBDC3 : Evaluation of swaps between the global model and the proper
model
1: function syncVoc(vocabulary v1 , vocabulary v2 )
. The function evaluates the
changes between two vocabularies: a file vocabulary v1 and the static vocabulary v2
2:
v1 .s ← computeS(v1 )
3:
v1 .f romP ointer ← 0; v1 .toP ointer ← 0
4:
for i = 0 to v1 .lastP os do
5:
sP os ← getP osInV oc(v2 , v1 .word[i])
6:
if i = v1 .s then
7:
v1 .f romP ointer1B ← v1 .f romP ointer
8:
if (i < v1 .s ∧ sP os ≥ v1 .s)
9:
∨(i ≥ v1 .s ∧ sP os ≥ v1 .lastP os) then
. a), b), d)
10:
v1 .f rom[v1 .f romP ointer] ← sP os
11:
v1 .f romP ointer ← v1 .f romP ointer + 1
12:
else if i ≥ v1 .s ∧ sP os < v1 .s then
. e)
13:
v1 .to[v1 .toP ointer] ← sP os
14:
v1 .toP ointer ← v1 .toP ointer + 1
15:
v1 .active[sP os] ← 1
16:
else
. c), f)
17:
v1 .active[sP os] ← 1
18:
v1 .toP ointer1B ← v1 .toP ointer
19:
for i = 0 to v1 .s do
20:
if v1 .active[i] = 0 then
21:
v1 .to[v1 .toP ointer] ← i
22:
v1 .toP ointer ← v1 .toP ointer + 1
23:
radixSort(v1 .f rom, 0, v1 .f romP ointer1B)
24:
radixSort(v1 .to, 0, v1 .toP ointer1B)
25:
radixSort(v1 .f rom, v1 .f romP ointer1B, v1 .f romP ointer)
part corresponds to the original block 2B (before decrementing of the parameter s1 ), where
only some positions are engaged by the words that stay in block 2B. Other positions are
unused. The last dense block is the appendix created after extending the coding scheme
and is composed of words of the FnB→2B set. The coding scheme change (extension of
block 2B and narrowing the blokc 1B) is denoted by dotted lines.
This variant saves the space needed to store some swaps. On the other hand, it loses
a few one-byte codewords (by decrementing the parameter s1 ) which has a negative effect
on the size of the encoded text. Unlike the semi-dense technique [39], our approach even
swaps the words that do not belong to the dense part of the model. This means that
these words obtain two-byte codewords instead of their three-byte codewords or four-byte
codewords in the global model and so the compression is more efficient. This variant of
swapped word storage is the most efficient for most of the tested files. Nonetheless, due to
4.3. SET-OF-FILES SEMI-ADAPTIVE TWO-BYTE DENSE CODE
83
Algorithm 19 SF-STBDC3 : Serialization of the swaps to the vocabulary file
1: function writeVoc(vocabulary v1 ) . The function performs the serialization of the
file vocabulary v1 .
2:
v1 .vocBuf f er[v1 .vocP ointer] ← s
3:
v1 .vocP ointer ← v1 .vocP ointer + 1
4:
v1 .vocBuf f er[v1 .vocP ointer] ← v1 .toP ointer1B
5:
v1 .vocP ointer ← v1 .vocP ointer + 1
6:
v1 .vocBuf f er[v1 .vocP ointer] ← v1 .f romP ointer1B
7:
v1 .vocP ointer ← v1 .vocP ointer + 1
8:
for i = 0 to v1 .f romP ointer1B do
9:
v1 .vocBuf f er[v1 .vocP ointer] ← v1 .to[i]
10:
v1 .vocP ointer ← v1 .vocP ointer + 1
11:
for i = 0 to v1 .f romP ointer1B do
12:
encodeVBC (v1 , v1 .f rom[i])
13:
f rom ← v1 .f romP ointer1B
14:
f P ointer ← f rom
15:
nextT housand ← v1 .f rom[f rom]
16:
repeat
17:
thousandCount ← 0
18:
nextT housand ← nextT housand + 1000
19:
while v1 .f rom[f P ointer] < nextT housand do
20:
f P ointer ← f P ointer + 1
21:
thousandCount ← thousandCount + 1
22:
until thousandCount > 125
23:
encodeVBC (v1 , f P ointer)
24:
encodeBitwise(v1 , v1 .f rom, f rom, f P ointer)
25:
while f P ointer < v1 .f romP ointer do
26:
d ← v1 .f rom[f P ointer] − v1 .f rom[f P ointer − 1]
27:
encodeVBC (v1 , d)
28:
f P ointer ← f P ointer + 1
its semi-dense property, it loses the possibility to be decompressed using the vocabulary
composed of the words sorted by their ranks. Both previous variants have this possibility,
so the compressed files can be directly sent to the remote decompressor with the attached
vocabulary composed of the words sorted by their ranks. Such a vocabulary can be easily
obtained by performing the swaps stored in the local vocabulary.
Figure 4.12 depicts the way how the swaps are evaluated using this variant. Corresponding Algorithm 18 describes a function syncVoc that defines the swaps. Algorithm 19
describes a function writeVoc that describes how the local vocabulary is stored. The
syncVoc function consists of two for cycles. The first for cycle (see line 4) iterates over
the entire vocabulary v1 . The variable i represents a position in the proper model and the
84
CHAPTER 4. CONTRIBUTION TO DENSE CODING
from
a)
1B
2B
nB
b)
1B
2B
nB
c)
1B
2B
nB
d)
1B
2B
nB
e)
1B
2B
nB
f)
1B
2B
nB
to
active
F2B→1B
FnB→1B
FnB→2B
T1B
Figure 4.12: SF-STBDC variant 3: Evaluation of the swaps.
variable sP os represents a position in the global model again. When i achieves a value corresponding to parameter s (a number of stoppers), this value is stored (see line 6). Later,
this value is used to determine parts of the from array that need to be sorted separately.
The next condition (see line 8) covers situations a), b) and d) depicted in Figure 4.12 when
the positions of the sets F2B→1B , FnB→1B and FnB→2B are collected. The next condition
(see line 12) corresponds to situation e) in Figure 4.12 when the positions of the set T1B are
collected. At the same time these positions are marked as engaged positions (see line 15).
The last condition (see line 16) represents situations c) and f) when the words (positions)
in the global model and the proper model stay in the same blocks (with the same length
of corresponding codeword). It means that these positions are marked as engaged again
(see line 17). The next for cycle (see line 19) iterates over block 1B and collects the
positions that are not engaged and stores them in the to array. Finally, the two parts of
the from array (the first part corresponding to F2B→1B and FnB→1B sets, the second part
corresponding to the FnB→2B set) are sorted in increasing order (see lines 23 and 25). The
to array is also sorted in increasing order (see line 24). There are two reasons to sort the
arrays. First, the sorted from and to arrays can establish the swaps between block 1B and
block 2B (similarly to the first variant of SF-STBDC ). Second, the positions in the from
array need to be sorted for encoding using a bit vector.
The function writeVoc (Algorithm 19) starts with storing some necessary parameters:
s, |T1B | and |F1B | (see lines 2 – 7). The first for cycle (see line 8) stores the positions of
the T1B set using simple one-byte codewords. The following for cycle (see line 11) stores
the positions of the F1B set using simple byte code with three-byte codewords at most.
Next, the function has to evaluate which part of the FnB→2B set should be stored as a
bit vector and which part should be encoded as single positions in the global vocabulary.
4.3. SET-OF-FILES SEMI-ADAPTIVE TWO-BYTE DENSE CODE
85
The function uses a simple heuristic. The first repeat-until cycle (see lines 16 – 22)
iterates over thousands of positions in the global vocabulary (starting after the last position
of the F1B set) until the thousand contains at least 125 words (positions) of the FnB→2B
set. The inner while cycle (see lines 19 – 21) iterates over the from array and counts the
positions in the current thousand. Finally, the first part of the FnB→2B set (the from array
terminated by fPointer ) is encoded using a bit vector (see line 24). The leaving members
of FnB→2B are encoded as a distance between two consecutive active words using a simple
byte varible-length byte code (see line 27).
We give a theoretical estimate of the local vocabulary’s size for SF-STBDC3. The only
difference between SF-STBDC3 and SF-STBDC2 is that SF-STBDC3 does not encode the
T2B set and extends the coding scheme to cover the words of the FnB→2B set. Suppose that
FnB→2B A , FnB→2B B , α and β are defined, as in the case of SF-STBDC2 (see Section 4.3.1.2).
The final estimate of the local vocabulary’s size is: 2 × |F2B→1B | + 3 × |FnB→1B | + |T
l 1B | +
m
α−β |T2B |
+
2
×
|F
|
bytes.
The
extension
of
the
coding
scheme
causes
that
nB→2B B
8
256
unique words have to use two-byte codewords instead of one-byte codewords. Suppose
the original number of stoppers sold and
l a mnew number of stoppers (after an extension
2B |
of the coding scheme) snew = sold − |T256
. The loss in compression (because of the
Psold
extension of the coding scheme) can be expressed as:
i=snew fi . The theoretical gain of
SF-STBDC3, in comparison to SF-STBDC2, is composed only of the difference in the size
of the localPvocabulary and of the compression loss caused by the coding scheme extension:
old
d c×256
fi .
e − si=s
8
new
4.3.2
Searching on the Compressed Text
The attractiveness of the compression algorithm presented in [25] is in its simplicity and
especially in the possibility of fast searching directly on the compressed text. We want
to keep all these advantages for our SF-STBDC and still improve the compression ratio,
which is one of the crucial parameters.
1
ps = ×
n
s−1
X
1
pc = ×
n
c−1 s+256×i+s−1
X
X
pdec =
k−1
X
i=0
fi +
i=0
i=0
pic × ps
c−1 s+256×i+s−1
X
X
i=0
!
fj
(4.4)
j=s+256×i
j=s+256×i
fj +
c−1 s+256×i+255
X
X
i=0 j=s+256×i+s
!
2 × fj
(4.5)
(4.6)
The obvious problem of SF-STBDC is that it is not a fully self-synchronizing code.
It means that for two adjacent codewords, the suffix of the first codeword and the prefix
of the second one can form another codeword which can be incorrectly regarded as the
86
CHAPTER 4. CONTRIBUTION TO DENSE CODING
BPMA
FPMA
STEP 1
137
1
START
0
195
212
STEP 2
0
SHIFT: 2
E
4
4
212
212
...
107
195
STEP 2
2
2
212
STEP 3
E
4
4
212
212
...
107
195
STEP 3
2
2
1
201
Σ − {201, 212}
Σ − 212
212
195
212
137
1
Σ − {137, 212}
212
212
212
212
...
107
4
195
...
201
137
1
Σ − {137, 212}
START
0
212
3
SHIFT: 1
212
212
Σ − 212
4
START
0
212
3
212
107
E
...
201
Σ − 201
E
212
212
Σ − 212
Σ − 137
0
212
3
137
START
212
Σ − 201
SHIFT: 2
START
0
212
3
212
195
SHIFT: 1
Σ − 212
Σ − 137
1
201
Σ − {201, 212}
E
Σ − 212
212
107
E
137
1
Σ − {137, 212}
3
137
START
Σ − 201
3
212
107
2
Σ − 137
201
Σ − {201, 212}
Σ − 212
212
STEP 1
2
201
212
212
212
212
...
submit : 195 < s ∨ (195 ≥ s ∧ 107 ≥ s)
disprove : (195 ≥ s ∧ 107 < s)
137
STEP 4
START
0
2
201
Σ − 201
E
212
137
1
Σ − {137, 212}
Σ − 212
212
4
SHIFT: 1
212
...
4
107
STEP 5
195
2
212
212
3
107
4
195
...
137
1
Σ − {137, 212}
Σ − 212
SHIFT: 1
212
212
201
Σ − 201
E
START
0
212
3
3
212
195
STEP 4
Σ − 137
1
201
Σ − {201, 212}
E
Σ − 212
212
107
2
START
0
212
212
212
212
212
...
submit : 212 < s ∨ (212 ≥ s ∧ 195 ≥ s ∧ 107 < s)
disprove : (212 ≥ s ∧ 195 < s) ∨ (212 ≥ s ∧ 195 ≥ s ∧ 107 ≥ s)
Figure 4.13: FPMA and BPMA: Searching example.
occurrence of some searched pattern. On the other hand, the SF-STBDC code stream is
always synchronized by the stoppers (byte values lower than s). When the stopper value
occurs, it is always sure that it means the end of the codeword, no matter if it is the first
or the second byte of the codeword. Considering no context, the Equations 4.4 and 4.5
4.3. SET-OF-FILES SEMI-ADAPTIVE TWO-BYTE DENSE CODE
87
express the probability of the stopper value and the probability of the continuer value,
respectively in the compressed byte stream. Both equations consist of sums iterating over
some members of the proper model, where fi represents the frequency of the word wi and n
represents the total number of words in the compressed file. Suppose the search algorithm
evaluates each hit by checking k preceding bytes, then the probability that the hit can be
confirmed or disproved can be expressed as Equation 4.6.
For all tested files, the probability ps is around 0.7 and the probability pc is around
0.3. These values give the probability to confirm or disprove the hit approximately pdec =
0.973 using only three preceding bytes. However, the search algorithm always checks the
preceding bytes until the hit is confirmed or disproved. It means that no false matches are
produced.
We propose two search algorithms for SF-STBDC : a backward and forward variant.
The backward search algorithm (BPMA) is a multi-pattern variant of the classical BoyerMoore-Horspool algorithm [36]. Notice that due to the simple coding scheme of SF-STBDC,
the shifting possibilities are very limited. In the case that a one-byte codeword is among
the searched patterns, the maximal allowed shift degrades to one byte and does not provide
any benefit. The technique to confirm the hit using preceding bytes is added to this variant
of the search algorithm.
The forward search algorithm (FPMA) is even simpler than the backward variant. The
shifts are aligned to the codewords and so the synchronization is provided. It means that
for a value b < s on the first position of the codeword, the maximal allowed shift is one
byte. For a value b ≥ s on the first position of the codeword, the maximal allowed shift is
two bytes. Finally, for any value b on the second position of the codeword, the maximal
allowed shift is one byte.
Both variants of the search algorithm do the analysis of the local vocabulary to catch
a possible change of the codewords. Afterward, a search automaton is built and the
compressed text is searched using the shifting explained above. We have already mentioned
in the introduction that we suppose the final solution is equipped with non-positional
inverted indexes [76] pointing to single documents containing a given word.
Example 4.3.2 Suppose we have a set of two patterns, whose codewords used in SFSTBDC encoded text (using parameter s = 192) correspond to the following set P =
{h201ih137i, h212ih212i}. Furthermore, suppose that we want to search these patterns
using FPMA and BPMA search algorithms.
The first step for both algorithms is to build the search automata: forward patternmatching automaton for FPMA (see the left side of Figure 4.13) and backward patternmatching automaton for BPMA (see the right side of Figure 4.13). Notice that each
automaton is composed as a set of paths from an initial state to final states that correspond
to the searched codewords. Moreover, an error state is added to each automaton to indicate
that no occurrence was found and a shift must be performed. Both search algorithms,
except the automata, are still equipped with shift tables that determine the length of a
shift which can be one or two bytes in the compressed byte stream. The shift heuristic is
very simple for FPMA. The shift is one byte for a value C[i] < s (where C represents a
88
CHAPTER 4. CONTRIBUTION TO DENSE CODING
compressed byte stream and i a position in the stream) and two bytes otherwise. We use
a simple “bad character shift” heuristic for the BPMA search algorithm.
Now, follow the left part of Figure 4.13 that corresponds to the FPMA search algorithm.
In the first step, the algorithm encounters byte h107i, which corresponds to the error state
in the search automaton. Since 107 < s = 192, a shift by one byte has to be performed.
Step 2 reads byte h195i, which means the error state again and shift by two bytes since
195 ≥ s = 192. In step 3, the algorithm encounters byte h212i and the automaton moves
from the initial state to state “3”. Finally, in step 4, another byte h212i is read and the
automaton moves to the final state “4”. The final state automatically means that a match
was found since the FPMA search algorithm always shifts so that it is aligned with the
codewords in the compressed byte stream.
The right part of Figure 4.13 corresponds to the BPMA search algorithm. The algorithm
starts reading the second byte of the compressed byte stream, whose value is h195i. This
means that the error state is met in the search automaton and a shift by two bytes has
to be performed since value h195i is not a substring of any of the searched codewords. In
step 2, the algorithm reads byte h212i and the automaton moves to state “3”. In step
3, the algorithm reads byte h212i again and the automaton moves to the final state “4”.
However, since the searching is not aligned with the codewords, the occurrence must be
confirmed or disproved by preceding bytes in the compressed data stream. Basically, the
algorithm looks for the first stopper value (C[j] < s) on the left from the occurrence (on
position i). If the difference (i − j) is odd, then the occurrence is confirmed. Otherwise,
it is disproved. This time the occurrence is disproved by stopper value h107i on position
j = 1, where the difference is even: 3 − 1 = 2. However, a shift by one byte has to be
performed at the end of step 3 since value h212i appears at the second position of one of
the searched codewords. In step 4, the value h212i is read and the automaton moves to
state “3”. Finally, in step 5, the algorithm reads byte h212i and moves the automaton
to the final state “4”. This time, the occurrence is confirmed by the same stopper value
C[1] = h107i since the difference of positions is odd: 4 − 1 = 3.
4.3.3
Experiments
Our aim is to follow the idea presented in [25]. We want to focus our SF-STBDC on the
compression optimized for search engines. Such a compression method must allow fast
searching directly on the compressed text and still achieve an attractive compression ratio.
The idea of our compression method is to keep the search speed of the modified SCDC [25]
and improve the compression ratio.
Unfortunately, despite the efforts of the authors of [25], it was not possible to obtain
neither the implementation of the modified SCDC nor the text collection used in their
experiments. Furthermore, some necessary details of the implementation (e.g. the fashion of
the combination the common vocabulary and specific vocabulary) are missing in [25]. Thus,
we cannot present the comparison of our SF-STBDC to the modified SCDC. However, we
have every reason to believe that our SF-STBDC keeps the searching performance and
89
4.3. SET-OF-FILES SEMI-ADAPTIVE TWO-BYTE DENSE CODE
38
37.5
37
36.5
36
35.5
compression ratio [%]
35
34.5
34
33.5
33
32.5
32
31.5
31
30.5
30
test_10k.txt
test_20k.txt
test_100k.txt
test_200k.txt
test_500k.txt
29.5
29
28.5
28
0
20
40
60
number of continuers c
80
100
120
Figure 4.14: Dependency of the compression ratio on the number of continuers c for different tested files. Each curve starts with the optimal value cmin and continues with
incrementing the number of continuers c.
achieves a better compression ratio: (i) SF-STBDC works with the proper model of each
compressed file which ensures the best possible compression ratio; (ii) the coder of SFSTBDC (unlike the modified SCDC ) uses the optimal value smin for each file; (iii) the
coding scheme of SF-STBDC provides a higher number of words covered by codewords
with a size up to two bytes; (iv) the global vocabulary contains all the words of the corpus,
which means that the words can be addressed by links to the global vocabulary. The matter
of using the proper model in modified SCDC [25] is not clear since the authors do not clarify
the way of combining the common vocabulary and specific vocabulary. Thus, we have no
reason to believe that the words achieve codewords of the same size as if the file would be
compressed separately using standard SCDC.
We present a small experiment to support our argument (ii) concerning the optimality
of the number of stoppers smin . Figure 4.14 shows the results of the experiment. We
performed our SF-STBDC3 variant on different tested files with a different parameter
c = 256 − s. SF-STBDC does not allow to state some average value savg since the common
value s for all files of the collection is driven by the file with the highest number of unique
words that must be covered by the coding scheme. Thus, one file of the collection with
an extremely high number of unique words could spoil the common s value for all files of
the collection. The results presented in the Figure 4.14 prove that the compression ratio is
90
CHAPTER 4. CONTRIBUTION TO DENSE CODING
not sensitive to the exact value of cmin . However, with a higher difference c − cmin (in the
case of a file with an extremely high number of unique words), the compression ratio can
be worsen significantly (e.g. with difference c − cmin = 80, the resulting compression ratio
is up to 1.5 % worse for the “test 100k.txt” file). The oscillating progress of the presented
dependency is very interesting for small tested files (“test 10k.txt” and “test 20k.txt”).
We give the following explanation for this oscillation. On one hand, the increasing number
c forces to encode more words using two-byte codewords, which causes an increase of the
compression ratio. On the other hand, the decreasing size of block 1B causes a lower
number of swaps between block 1B and block 2B, which means a smaller size of the local
vocabulary. The gain in the local vocabulary is small, however, with small frequencies of the
words (that are in the small files), it can compensate the loss caused by a higher number
of two-byte codewords.
R
We perform our tests on Intel
CoreTM 2 Duo CPU T6600 2.20 GHz, 4 GB RAM. We
use compiler gcc version 3.4.6 with compiler optimization -O3. We measure user time +
system time in our experiments.
We present three variants of SF-STBDC varying in the way of storage of the swaps.
SF-STBDC1 stores the swaps as the absolute addresses to the global vocabulary (see Section 4.3.1.1). SF-STBDC2 combines the storage as the bit vector and the distances between
two consecutive words (see Section 4.3.1.2). Finally, SF-STBDC3 stores the exact swaps
only for one-byte codewords (see Section 4.3.1.3). The other words obtain the first free
positions in the coding scheme.
We compose two large text corpora. The first one, Gutenberg, is composed as a subset
of text files from the Gutenberg Project. The size of the corpus is more than 600 MB and it
contains the files of different world languages. We have chosen a few files of different sizes:
10 kB, 20 kB, 30 kB, 50 kB, 100 kB, 200 kB and 500 kB for our experiments. All tested
files as well as the text corpus are publicly available at www.stringology.org/cj_corpus.
The experiments presented in Table 4.15 and Table 4.16 were performed on our Gutenberg
corpus. The average size of one file in the corpus is 200 kB.
The second text corpus, WT2g, is composed as a set of web pages. The web pages are
converted to text form. Our WT2g corpus is a subset of the WT2g corpus which is a part of
the TREC conference and is available at http://ir.dcs.gla.ac.uk/test_collections/
access_to_data.html8 . The experiments presented in Table 4.17 were performed on our
WT2g corpus. The average size of one file in the corpus is 8 kB.
We carried out all variants of SF-STBDC on single small files from the test collection
(see Table 4.15) and compared the achieved compression ratio to baseline compressors
Gzip [46] and SCDC [16]. All variants of SF-STBDC exploit the access to the global
vocabulary as well as their simple byte code and significantly overcome their competitors.
A similar comparison is also presented in [25].
The right part of Table 4.15 reports the search times achieved on SF-STBDC3 compressed text, which are crucial for the localization of single terms in the compressed text
8
Unfortunately, we cannot publish the resulting files due to copyright statements. However, the files
can be easily extracted from the corpus and converted from HTML format to pure text.
91
4.3. SET-OF-FILES SEMI-ADAPTIVE TWO-BYTE DENSE CODE
File
Size
test 10k.txt
13 267 B
test 20k.txt
test 30k.txt
test 50k.txt
20 176 B
32 028 B
51 388 B
test 100k.txt 103 312 B
test 200k.txt 192 535 B
test 300k.txt 300 698 B
test 500k.txt 499 032 B
SF-STBDC1 SF-STBDC2 SF-STBDC3 SCDC Gzip MBMH FPMA BPMA
50.19 %
35.51 %
36.14 % 73.14 % 41.60 % 0.670 ms 0.155 ms 0.155 ms
3 608 B
3 608 B
3 618 B
3 605 B
5 519 B
3 051 B
1 103 B
1 177 B
6 099 B
49.28 %
35.76 %
−
5 835 B
5 835 B
5 850 B
5 854 B
8 971 B
4 108 B
1 380 B
1 365 B
9 445 B
51.26 %
35.05 %
−
8 928 B
8 928 B
35.76 % 75.83 % 44.46 %
34.71 % 81.91 % 45.79 %
8 969 B
8 976 B
14 666 B
2 148 B
17 259 B
−
7 489 B
2 297 B
38.25 %
31.25 %
14 563 B
14 563 B
14 616 B
14 686 B
20 877 B
5 093 B
1 497 B
1 331 B
15 920 B
36.72 %
32.45 %
−
31 353 B
31 353 B
31 486 B
31 625 B
42 074 B
6 579 B
2 171 B
1 839 B
25 701 B
38.09 %
31.78 %
−
55 676 B
55 676 B
56 102 B
55 970 B
82 418 B
17 659 B
5 519 B
4 923 B
62 852 B
30.43 %
27.45 %
−
78 486 B
78 486 B
79 164 B
13 027 B
4 051 B
3 647 B
31.03 % 59.56 % 40.63 %
32.26 % 55.49 % 40.73 %
31.70 % 61.71 % 42.81 %
27.54 % 43.32 % 36.09 %
31.23 %
29.12 %
140 696 B
140 696 B
141 963 B
15 174 B
4 611 B
3 914 B
79 585 B 108 536 B
50 689 B
−
29.23 % 41.50 % 37.35 %
142 083 B 186 380 B
65 022 B
−
76.87 %
76.87 %
0.671 ms 0.172 ms 0.172 ms
−
74.37 %
74.37 %
0.811 ms 0.218 ms 0.202 ms
−
73.12 %
75.09 %
0.842 ms 0.233 ms 0.279 ms
−
72.33 %
66.86 %
1.076 ms 0.374 ms 0.342 ms
−
65.24 %
68.22 %
1.606 ms 0.624 ms 0.623 ms
−
61.15 %
61.21 %
2.027 ms 0.732 ms 0.763 ms
−
63.89 %
62.36 %
3.041 ms 1.138 ms 1.060 ms
−
62.58 %
65.14 %
−
Table 4.15: Compression ratio, the size of the compressed file and the vocabulary file in the
left part. Search time in milliseconds and the time improvement in comparison to MBMH
in the right part of the table.
and for building the snippet. We tested the search of multiple patterns in the compressed
file. Ten patterns were randomly chosen of each of the tested files. The patterns are of
different lengths and different frequencies; English stop words were omitted. The average length of chosen patterns is 7.49 characters. All pattern files are also available at
www.stringology.cz/corpus.
We compare both proposed variants of search algorithm FPMA and BPMA. We give
a classical multi-pattern Boyer-Moore-Horspool algorithm (MBMH) [36] performed on uncompressed text as a baseline. Figure 4.15 illustrates a number of comparisons (i.e. the
efficiency of shifting) performed by single tested search algorithms when searching on tested
files of different sizes.
The search process starts with reading the local vocabulary and composing the swaps
(using from and to arrays). Next, the swaps must be performed and the global model is so
transformed into the proper model of the tested file. As the next step, a multi-pattern search
automaton has to be built according to codewords of the searched patterns. Finally, the
compressed text can be scanned using the search automaton. The aforementioned overhead
spoils the resulting search time of small files. E.g. the search time of “test 20k.txt” is only
3.6 times smaller than the search time of “test 200k.txt”, but its size is ten times lower.
Suppose the scenario of building the snippet. The search engine retrieves s top-scoring
pages (typically s = 10 [29]) that contain the queried keyword using the non-positional
92
CHAPTER 4. CONTRIBUTION TO DENSE CODING
Number of comparisons
200 000
MBMH
FPMA
BPMA
150 000
100 000
50 000
0
0
50
100 150 200 250 300 350 400 450 500
File size [kB]
Figure 4.15: Searching algorithms: Number of comparisons. Algorithms FPMA and BPMA
proposed in Section 4.3.2 are performed on the compressed text using the word-based approach. The algorithm MBMH is a multi-pattern variant of the classical Boyer-MooreHorspool algorithm [36]. All algorithms search for the same set of randomly chosen patterns.
inverted index pointing to single docIDs. First, it must specify the positions of the keyword
in the page (corresponding raw text file). Second, it must decompress a neighbourhood of
that position and compose the snippet. Suppose the number of occurrences of the keyword
in the text is very low. Therefore, we can neglect the decompression time of a rather
small neighbourhood and define the snippet time in our experiment as follows: 1. search
time over the compressed text for compression methods allowing the decompression of a
random part of the file (including compressed self-indexes and their locate operation),
2. decompression time and search time over decompressed text for other methods. The
search algorithm of SF-STBDC already includes the transformation of the global model to
the proper model of the searched file. The decompression of single SF-STBDC codewords
can be performed in the constant time. So for SF-STBDC, building a snippet means: (i)
reading the codewords in neighbourhood of the occurrence; (ii) transforming the codewords
to ranks; (iii) reading the words of the proper model pointed by the ranks. This tells us to
neglect this constant time needed to decompress the snippet.
Table 4.16 summarizes the general comparison to other classical compression methods
that are usually used for the compression of raw text data. We compare the following com-
93
4.3. SET-OF-FILES SEMI-ADAPTIVE TWO-BYTE DENSE CODE
Compressor Block size Comp. ratio Snippet speed N. snippet speed Dec. speed N. dec. speed
1
2
5
1
2
5
1
2
5
MB
MB
MB
MB
MB
MB
MB
MB
MB
32.68
30.90
29.04
40.69
38.99
37.99
31.73
30.26
29.15
%
%
%
%
%
%
%
%
%
12.71
13.76
14.91
10.92
13.97
19.37
5.19
5.35
5.98
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
2.35
1.46
0.59
2.02
1.49
0.76
0.96
0.57
0.24
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
13.96
15.14
16.46
11.83
15.39
22.07
5.39
5.55
6.21
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
2.58
1.61
0.65
2.19
1.64
0.87
1.00
0.59
0.24
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
1
2
5
0.2
100
MB
MB
MB
MB
MB
47.32
42.43
38.12
31.70
42.97
%
%
%
%
%
98.17
150.14
259.90
294.73
925.93
MB/s
MB/s
MB/s
MB/s
MB/s
98.17
150.14
259.90
294.73
925.93
MB/s
MB/s
MB/s
MB/s
MB/s
35.12
30.04
27.38
24.62
0.23
MB/s
MB/s
MB/s
MB/s
MB/s
35.12
30.04
27.38
24.62
0.23
MB/s
MB/s
MB/s
MB/s
MB/s
LZMA
Gzip
Bzip2
SCDC
SF-STBDC3
Sad-CSA
Table 4.16: General comparison (Gutenberg text corpus): compression ratio, snippet speed
and decompression speed. Normalized snippet and decompression speed are computed
for methods without random access to compressed data using the “back of envelope”
calculation [29]. Both speeds are stated in megabytes per second where 1 MB = 220 B of
the original (uncompressed) file.
6
3
LZMA
Gzip
Bzip2
SCDC
SF-STBDC3
Sad-CSA
log(Snippet speed)
4
LZMA
Gzip
Bzip2
SCDC
SF-STBDC3
Sad-CSA
2.5
log(Decompression speed)
5
3
2
1
0
2
1.5
1
0.5
0
-0.5
-1
-1
28
30
32
34
36
38
40
42
44
46
48
Compression ratio
(a) Compression ratio / Snippet speed
50
28
30
32
34
36
38
40
42
44
46
48
50
Compression ratio
(b) Compression ratio / Decompression speed
Figure 4.16: General comparison (Gutenberg text corpus): Trade-off between the compression ratio and the normalized speed. The speed values are in logarithmic scale. The linked
points represent the same algorithm performed on different block sizes.
pression methods in compression ratio, snippet speed and decompression speed. Gzip [46]
(version 1.4), Bzip2 [66](version 1.0.6)9 and LZMA10 [57] represent the methods not allow9
Both Bzip2 and Gzip were run with the −9 parameter to achieve maximum compression.
The compression performance of dictionary compression methods can be further improved by url-based
or similarity-based sorting. We suppose this technique would have only a minor effect on our pure text
collection composed of literary works. See the experiments in [29] including different documents sorting.
10
94
CHAPTER 4. CONTRIBUTION TO DENSE CODING
ing the random decompression. Conversely, SCDC [16], our SF-STBDC3 and compressed
self-index Sad-CSA11 [55] are the methods that allow the decompression of randomly chosen parts. The algorithms that are not optimized for the compression of small files were
performed on blocks 1, 2 or 5 MB in size. The Sad-CSA self-index was performed on the
block 100 MB in size (the largest that the implementation allowed). Our SF-STBDC3
was performed on the single file “test 200k.txt” with the average size of all files in the
tested collection, i.e. approximately 200 kB. We used the “back of envelope” calculation
mentioned in [29] to achieve the fairest possible comparison among the compressors that
can randomly access single files and those that must decompress the whole block. This
calculation (used to express Normalized snippet speed and Normalized decompression speed
P
, where P is an average
in Table 4.16) consists of multiplying the speed by the fraction B
page size (P = 200 kB) and B is a size of block (1, 2 or 5 MB).
SF-STBDC3 achieves a very good trade-off between compression ratio and snippet /
decompression speed (see Figure 4.16). Its compression ratio is at the level of the best
standard compressors (LZMA and Bzip2 ) and it significantly overcomes Gzip, SCDC and
Sad-CSA. The snippet speed performed by SF-STBDC3 is the best among the standard
compressors and is overcome only by Sad-CSA, which is (as the index data structure)
unbeatable. However, the snippet speed of Sad-CSA strongly depends i.a. on the probability of the searched patterns since the locate operation is dependent on the number
of occurrences. We performed our experiments with patterns with different probabilities pi . We state some of the results in the form of a couple (pi ; snippet speed in MB/s):
(2.78 × 10−7 ; 24 630.54), (1.06 × 10−5 ; 925.93), (3.28 × 10−5 ; 321.54), (6.72 × 10−4 ; 19.19).
Due to its more complicated structure of vocabulary, SF-STBDC3 is a little bit worse
in decompression speed than SCDC. We conclude that STBDC3 is well-balanced in all
three tested parameters and, thus, represents an interesting alternative in compressing a
set of text files without the need of a positional inverted index.
We repeated the same set of experiments presented in Table 4.16 on our WT2g corpus
(i.e. on textual content of web pages). The results of these experiments are summarized
in Table 4.17. The algorithms that cannot effectively compress single small files were
performed on large blocks composed as concatenation of single text files. The algorithms
LZMA, Gzip, Bzip2 and SCDC were performed on blocks 1 MB, 2 MB, 5 MB in size.
Sad-CSA is fast enough for any size so it was performed on the block 100 MB in size (the
largest that the implementation allowed). Finally, our SF-STBDC3 was performed on
single text files (corresponding to single web pages) 0.008 MB (which is the average size
of a file in our WT2g corpus), 0.02 MB and 0.2 MB in size. These three files are listed
in the WT2g corpus using the following names: “WT16-B20-10”, “WT02-B18-152” and
“WT08-B38-1”.
Concerning the 0.2 MB file, the results of our SF-STBDC3 are very similar to those
presented in Table 4.16. The compression ratio, as well as snippet speed, naturally worsens
for smaller files (sizes of 0.02 MB and 0.008 MB). The snippet speed gets worse due to the
11
The performance of Sad-CSA strongly depends on its sampling parameters. We set samplerate and
samplepsi to 128 to achieve a good trade-off between the compression ratio and decompression speed.
95
4.3. SET-OF-FILES SEMI-ADAPTIVE TWO-BYTE DENSE CODE
Compressor Block size Comp. ratio Snippet speed N. snippet speed Dec. speed N. dec. speed
LZMA
Gzip
Bzip2
SCDC
SF-STBDC3
Sad-CSA
1
2
5
1
2
5
1
2
5
MB
MB
MB
MB
MB
MB
MB
MB
MB
24.09
19.45
20.70
29.58
25.74
27.56
22.92
20.87
22.18
%
%
%
%
%
%
%
%
%
24.873 7
38.135 8
53.720 8
13.306 7
18.493 6
22.276 5
5.406 7
6.106 1
6.614 1
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
0.198 9
0.155 5
0.083 1
0.106 4
0.075 4
0.034 5
0.043 2
0.024 9
0.010 2
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
36.122 3
66.558 0
122.745 8
15.966 6
23.323 5
29.050 8
5.799 2
6.554 2
7.106 1
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
0.288 8
0.271 4
0.189 9
0.127 7
0.095 1
0.044 9
0.046 4
0.026 7
0.011 0
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
1
2
5
0.008
0.02
0.2
100
MB
MB
MB
MB
MB
MB
MB
37.65
38.50
47.00
36.63
34.34
31.12
40.80
%
%
%
%
%
%
%
88.723 6
140.087 8
263.018 0
66.982 6
101.937 8
202.985 9
159.165 6
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
88.723 6
140.087 8
263.018 0
66.982 6
101.937 8
202.985 9
159.165 6
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
14.263 6
21.332 0
29.148 5
31.487 6
31.198 0
30.985 8
0.229 2
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
14.263 6
21.332 0
29.148 5
31.487 6
31.198 0
30.985 8
0.229 2
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
MB/s
Table 4.17: General comparison (WT2g text corpus): compression ratio, snippet speed
and decompression speed. Normalized snippet and decompression speed are computed
for methods without random access to compressed data using the “back of envelope”
calculation [29]. Both speeds are stated in megabytes per second where 1 MB = 220 B of
the original (uncompressed) file.
overhead with the initialization of the search algorithm. However, the compression ratio
of SF-STBDC3 is always better than SCDC and Sad-CSA.
There is more inter-document redundancy in web collections which explains the significant improvement of the compression ratio (up to 10 %) for the following compression
methods: LZMA, Gzip, Bzip2 and SCDC. With the compression ratio, the methods LZMA
and Gzip also naturally improved their decompression speed and snippet speed.
Table 4.17 contains the normalized snippet speed and normalized decompression speed
to achieve the fairest possible comparison of the algorithms allowing random access to
single files (SCDC, SF-STBDC3 and Sad-CSA) and those algorithms that do not allow
random access (LZMA, Gzip and Bzip2 ). However, the average size of a page is very
low this time (P = 8 kB) and so the results of LZMA, Gzip and Bzip2 are very poor in
comparison with other algorithms.
We conclude that SF-STBDC3 achieves an acceptable compression ratio (between 31 %
and 37 %) and still provides very fast access to single files with a snippet speed between 66
MB/s and 202 MB/s and decompression speed above 30 MB/s.
96
CHAPTER 4. CONTRIBUTION TO DENSE CODING
Chapter 5
Conclusions
5.1
Summary
In this thesis we have presented an extensive overview of different kinds of dense codes
together with their possible applications. In addition, we have proposed a novel concept of
Open Dense Code (ODC ) that became a frame for definition of two new coding schemes:
Two-Byte Dense Code (TBDC ) and Self-Tuning Dense Code (STDC ).
These two coding schemes can easily be used as a simple adaptive byte code compressors
that achieve a very good compression ratio (better than its competitors) and are still
very fast in compression and decompression. Both these adaptive compressors are very
considerate to small files with natural language content. Thus, their application resides
especially in fast on-the-fly compression and decompression of this kind of file.
Furthermore, the application of the TBDC coding scheme was also extended for semiadaptive compression and for a compression of a huge set of small files with natural language
content. Semi-Adaptive Two-Byte Dense Code (STBDC ) offers efficient block-oriented
compression of large files with possibly inconsistent natural language content (e.g. concatenation of texts of different languages or different character).
Set-of-Files Semi-Adaptive Two-Byte Dense Code (SF-STBDC ) is a compressor optimized for semi-adaptive compression of a huge set of small files. It works with a so-called
global model : a set of all words occurring in any file of the compressed set. So the compressor exploits the repetitions of the same words in different files of the set. Thanks to
this technique, the resulting compression ratio is significantly reduced.
All of the presented compressors are optimized for the compression of natural language
content. Naturally, all of them also exploit some basic linguistic properties. First of
all, we have to mention word-based modelling [52, 37] which is a headstone of all our
compression methods presented in this thesis. The concept of byte coding that fits the
word-based entropy, not optimally but very efficiently, is equally important and moreover,
it provides very fast compression and especially decompression. Except for these pillars,
our compression method hides other linguistic tweaks of minor importance: spaceless word97
98
CHAPTER 5. CONCLUSIONS
based modelling [52, 20] or the well-known technique of capital conversion [70] of the words
at the beginning of a sentence.
5.2
Contributions of the Thesis
We consider the Open Dense Code concept as a basic contribution of this thesis. ODC enabled us to describe all existing dense coding schemes and define two new coding schemes:
TBDC a STDC. According to these coding schemes, we evolved different compression
methods applicable in different situations. Let us remind the contribution of single compression methods in separate paragraphs.
Both our adaptive dense compressors TBDC and STDC proved that they are very
considerate to small files and still overcome their competitors in compression ratio (approximately 1% improvement in comparison to DSCDC and approximately 2% improvement
in comparison to DETDC ). Furthermore, both of the adaptive compressors achieve a very
good trade-off between the compression ratio and (de)compression speed.
Semi-Adaptive Two-Byte Dense Code is a block-oriented semi-adaptive version of
TBDC. It achieves a comparable compression ratio with other dense compressors when
performed on standard files. Moreover, when it is performed on text composed of different
parts (different languages, different topics, different XML elements), it achieves better results in compression ratio than its competitors. The block orientation of STBDC does not
mean an extraordinary overhead in processing time since STBDC is only up to 1.3 times
slower in compression and up to 1.5 times slower in decompression in comparison to SCDC.
We disclose some interesting properties of STBDC : (i) Single blocks of the compressed file
can be sent together with only a possibly smaller vocabulary corresponding to a given
block. Unlike other semi-adaptive compression methods, STBDC can easily determine the
necessary part of the vocabulary and send it together with a given block. Obviously, this
property can substantially save a limited bandwidth when only a part of a file is requested
by the client. (ii) The files compressed by STBDC can be later extended without previous
decompression. (iii) STBDC is suitable for searching on the compressed text and it can
use its own vocabulary as a built-in block index which significantly accelerates some types
of searches.
Set-of-Files Semi-Adaptive Two-Byte Dense Code is a compression method optimized
for the compression of a set of files and its possible application lies in terms of search
engines. SF-STBDC achieves an extraordinary compression ratio thanks to the usage
of the global vocabulary and, at the same time, it provides standard properties that are
necessary for application in the field of search engines: (i) SF-STBDC allows searching on
the compressed text. (ii) SF-STBDC enables random decompression of an arbitrary part
of the compressed text. The two aforementioned properties are necessary for searching a
set of keywords on the compressed text and for basic operations of search engines, such
as positional ranking and snippet extraction. The main contribution of SF-STBDC is its
trade-off between the compression ratio and snippet speed and its trade-off between the
5.3. FUTURE WORK
99
compression ratio and decompression speed. This equilibrium predetermines SF-STBDC
as an interesting alternative for the compression of text files extracted for search engines.
5.3
Future Work
In our future work, we want to focus on the following topics:
• Generally, we would like to involve more linguistics into our compression methods.
E.g. there are many dependencies in a sentence rising at the level of morphology.
Exploiting these dependencies in a preprocessing step of compression could improve
the entropy of compressed natural language text.
• In terms of STBDC, we want to address the problem of smart determination of
the end of a block in the large compressed file. Thus, the resulting compression
method could be called block-wise instead of block-oriented. Furthermore, we want
to focus on simple updatability of the compressed text without the need for previous
decompression.
• In terms of SF-STBDC, we would like to develop a simple search engine using SFSTBDC compressed files in combination with a non-positional inverted index. Next,
we would like to give a fair comparison of a solution equipped by a positional inverted
index with our solution using SF-STBDC and a non-positional inverted index. This
could help answer the question given by the authors in [4]: to index or not to index
web textual content using a positional inverted index?
100
CHAPTER 5. CONCLUSIONS
Bibliography
[1] I. H. Witten A. Moffat, R. M. Neal. Arithmetic coding revisited. In ACM Trans. on
Inf. Systems Vol 16, pages 256–294, July 1998.
[2] J. Adiego, M.A. Martinez-Prieto, and P. Fuente. High performance word-codeword
mapping algorithm on ppm. In Proceedings of the Data Compression Conference (DCC
2009), pages 23–32. IEEE Computer Society, 2009.
[3] Joaquin Adiego and Pablo de la Fuente. On the use of words as source alphabet
symbols in ppm. In Proceedings of the Data Compression Conference (DCC 2006),
page 435, Washington, DC, USA, 2006. IEEE Computer Society.
[4] Diego Arroyuelo, Senén González, Mauricio Marin, Mauricio Oyarzún, and Torsten
Suel. To index or not to index: time-space trade-offs in search engines with positional
ranking functions. In Proceedings of the 35th international ACM SIGIR conference on
Research and development in information retrieval, SIGIR ’12, pages 255–264, New
York, NY, USA, 2012. ACM.
[5] T.C. Bell, J.G. Cleary, and I.H. Witten. Text compression. Prentice Hall advanced
reference series: Computer science. Prentice Hall, 1990.
[6] Jon Louis Bentley, Daniel D. Sleator, Robert E. Tarjan, and Victor K. Wei. A locally
adaptive data compression scheme. Commun. ACM, 29(4):320–330, April 1986.
[7] M.T. Biskup. Guaranteed synchronization of huffman codes. In Proceedings of the
Data Compression Conference (DCC 2008), pages 462–471. IEEE Computer Society,
2008.
[8] M.T. Biskup and W. Plandowski. Guaranteed synchronization of huffman codes with
known position of decoder. In Proceedings of the Data Compression Conference (DCC
2009), pages 33–42. IEEE Computer Society, 2009.
[9] S. Bottcher, A. Bultmann, and R. Hartel. Search and modification in compressed
texts. In Proceedings of the Data Compression Conference (DCC 2011), pages 403–
412. IEEE Computer Society, 2011.
[10] N. Brisaboa. An efficient compression code for text databases. In 25th European
Conference on IR Research (ECIR 2003), pages 468–481, 2003.
101
102
BIBLIOGRAPHY
[11] N. Brisaboa, A. Fariña, J. López, G. Navarro, and E. López. A new searchable
variable-to-variable compressor. In Proceedings of the Data Compression Conference
(DCC 2010), pages 199–208. IEEE Computer Society, 2010.
[12] N. Brisaboa, A. Fariña, G. Navarro, and J. Paramá. Lightweight natural language
text compression. Information Retrieval, 10:1–33, 2007.
[13] N. Brisaboa, A. Fariña, G. Navarro, and J.L. Paramá. Efficiently decodable and searchable natural language adaptive compression. In Proc. 28th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval (SIGIR),
pages 234–241. ACM Press, 2005.
[14] N. Brisaboa, A. Fariña, G. Navarro, and J.L. Paramá. New adaptive compressors for
natural language text. Software - Practise and Experience, 38(13):1429–1450, 2008.
[15] N. Brisaboa, A. Fariña, G. Navarro, and José Paramá. Simple, fast, and efficient
natural language adaptive compression. In Proc. 11th International Symposium on
String Processing and Information Retrieval (SPIRE), LNCS 3246, pages 230–241.
Springer, 2004.
[16] Nieves R. Brisaboa, Antonio Fariña, Gonzalo Navarro, and Mara F. Esteller. (s,c)dense coding: An optimized compression code for natural language text databases. In
String Processing and Information Retrieval, volume 2857 of Lecture Notes in Computer Science, pages 122–136. Springer Berlin / Heidelberg, 2003.
[17] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm.
Technical report, Digital SRC Research Report, 1994.
[18] John G. Cleary and Ian H. Witten. Data compression using adaptive coding and
partial string matching. IEEE Transactions on Communications, 32:396–402, 1984.
[19] E. de Moura. Fast and flexible word searching on compressed text. In ACM Transactions and Information Systems, pages 113–139, 2000.
[20] Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates.
Fast searching on compressed text allowing errors. In Proceedings of the 21st annual
international ACM SIGIR conference on Research and development in information
retrieval, SIGIR ’98, pages 298–306, 1998.
[21] P. Elias. Universal codeword sets and representations of the integers. Information
Theory, IEEE Transactions on, 21(2):194–203, 1975.
[22] R.M. Fano. The transmission of information. Technical Report No. 65 (Cambridge
(Mass.), USA: Research Laboratory of Electronics at MIT, 1949.
[23] Antonio Fariña. New Compression Codes for Text Databases. PhD thesis, Database
Laboratory, University of La Coruña, Spain, 2005.
BIBLIOGRAPHY
103
[24] Antonio Fariña. Family of dense compressors. http://vios.dc.fi.udc.es/codes,
2011.
[25] Antonio Fariña, Nieves R. Brisaboa, Cristina Parı́s, and José R. Paramá. Fast and
flexible compression for web search engines. Electronic Notes in Theoretical Computer
Science, 142:129–141, January 2006.
[26] Antonio Fariña, Gonzalo Navarro, and José R. Paramá. Word-based statistical compressors as natural language compression boosters. In Proceedings of the Data Compression Conference (DCC 2008), DCC ’08, pages 162–171, Washington, DC, USA,
2008. IEEE Computer Society.
[27] Antonio Fariña, Gonzalo Navarro, and José R. Paramá. Boosting text compression
with word-based statistical encoding1. Computer Journal, 55(1):111–131, January
2012.
[28] T. Ferguson and J. Rabinowitz. Self-synchronizing huffman codes (corresp.). Information Theory, IEEE Transactions on, 30(4):687–693, 1984.
[29] Paolo Ferragina and Giovanni Manzini. On compressing the textual web. In Proceedings of the third ACM international conference on Web search and data mining,
WSDM ’10, pages 391–400, New York, NY, USA, 2010. ACM.
[30] L.R. Ford, D.R. Fulkerson, and R.G. Bland. Flows in Networks. Princeton Landmarks
in Mathematics. Princeton University Press, 2010.
[31] Marcus Geelnard. Basic compression library. http://bcl.comli.eu/, 2006. [Online;
accessed 19-December-2010].
[32] S.W. Golomb. Run length encodings. Information Theory, IEEE Transactions on,
12(3):399–401, 1966.
[33] Rodrigo González, Szymon Grabowski, Veli Mäkinen, and Gonzalo Navarro. Practical
implementation of rank and select queries. In In Poster Proceedings Volume of 4th
Workshop on Efficient and Experimental Algorithms (WEA05) (Greece, pages 27–38,
2005.
[34] H. S. Heaps. Information retrieval: Computational and theoretical aspects. Academic
Press, 1978.
[35] C. A. R. Hoare. Algorithm 64: Quicksort. Communications of the ACM, 4(7):321–,
July 1961.
[36] R. Nigel Horspool. Practical fast searching in strings. Software - Practise and Experience, 10:501–506, 1980.
104
BIBLIOGRAPHY
[37] R. Nigel Horspool and Gordon V. Cormack. Constructing word-based text compression algorithms. In Proceedings of the Data Compression Conference (DCC 1992),
pages 62–71. IEEE Computer Society, 1992.
[38] D.A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952.
[39] A. Moffat J. S. Culpepper. Enhanced byte codes with restricted prefix properties. In
SPIRE 2005, pages 1–12, 2005.
[40] Guy Joseph Jacobson. Succinct Static Data Structures. PhD thesis, Pittsburgh, PA,
USA, 1988. AAI8918056.
[41] S.T. Klein and D. Shapira. Pattern matching in huffman encoded texts. In Proceedings
of the Data Compression Conference (DCC 2001), pages 449–458. IEEE Computer
Society, 2001.
[42] S.T. Klein and D. Shapira. The string-to-dictionary matching problem. In Proceedings
of the Data Compression Conference (DCC 2011), pages 143–152. IEEE Computer
Society, 2011.
[43] Sebastián Kreft and Gonzalo Navarro. LZ77-like Compression with Fast Random
Access. In Proceedings of the Data Compression Conference (DCC 2010). IEEE Computer Society.
[44] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical
Statistics, 22:49–86, 1951.
[45] N. Jesper Larsson and Alistair Moffat. Offline dictionary-based compression. In Proceedings of the Data Compression Conference (DCC 1999), pages 296–305. IEEE Computer Society, 1999.
[46] Jean loup Gailly and Mark Adler. Gnu zip compression utility. http://www.gnu.
org/software/gzip/, 2013. [Online; accessed 17-November-2013].
[47] D. Liu M. Charikar, E. Lehman. The smallest grammar problem. IEEE Transactions
on Information Theory, 51(7):2554–2576, 2005.
[48] Udi Manber. A text compression scheme that allows fast searching directly in the
compressed file. ACM Transactions on Information Systems, 15:124–136, April 1997.
[49] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches.
In Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms,
SODA ’90, pages 319–327, Philadelphia, PA, USA, 1990. Society for Industrial and
Applied Mathematics.
BIBLIOGRAPHY
105
[50] Udi Manber and Sun Wu. Glimpse: A tool to search through entire file systems.
In Proceedings of the USENIX Winter 1994 Technical Conference, WTEC’94, pages
23–32, Berkeley, CA, USA, 1994. USENIX Association.
[51] Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural
Language Processing. MIT Press, Cambridge, MA, USA, 1999.
[52] A. Moffat. Word-based text compression.
19(2):185–198, 1989.
Software - Practise and Experience,
[53] Alistair Moffat. The arithmetic coding page. http://www.cs.mu.oz.au/~alistair/
arith_coder/, 1999. [Online; accessed 19-December-2010].
[54] Gonzalo Navarro, Edleno Silva de Moura, Marden Neubert, Nivio Ziviani, and Ricardo
Baeza-Yates. Adding compression to block addressing inverted indexes. Information
Retrieval, 3:49–77, 2000.
[55] Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing
Surveys, 39(1), April 2007.
[56] Craig G. Nevill-Manning, Ian H. Witten, and David L. Maulsby. Compression by induction of hierarchical grammars. In Proceedings of the Data Compression Conference
(DCC 1994), pages 244–253. IEEE Computer Society, 1994.
[57] I. Pavlov. 7-zip. http://www.7-zip.org/, 2012. [Online; accessed 9-October-2012].
[58] Petr Procházka and Jan Holub. New word-based adaptive dense compressors. In
Proc. International Workshop on Combinatorial Algorithms (IWOCA 2009), pages
420–431, 2009.
[59] Petr Procházka and Jan Holub. Block-oriented dense compressor. In Proceedings of
the Data Compression Conference (DCC 2011), page 472. IEEE Computer Society,
2011.
[60] Petr Procházka and Jan Holub. Natural language compression per blocks. In First International Conference on Data Compression, Communications and Processing (CCP
2011), pages 67–75, 2011.
[61] Petr Procházka and Jan Holub. Natural language compression optimized for large set
of files. In Proceedings of the Data Compression Conference (DCC 2013), page 514.
IEEE Computer Society, 2013.
[62] Petr Procházka and Jan Holub. ODC: Frame for definition of dense codes. European
Journal of Combinatorics, 34(1):52–68, 2013.
[63] J. Rissanen and Jr. Langdon, G.G. Universal modeling and coding. Information
Theory, IEEE Transactions on, 27(1):12–23, 1981.
106
BIBLIOGRAPHY
[64] Falk Scholer, Hugh E. Williams, John Yiannis, and Justin Zobel. Compression of
inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR ’02, pages 222–229. ACM, 2002.
[65] David A. Scott. Vitter adaptive compression. http://bijective.dogma.net, 2002.
[Online; accessed 19-December-2010].
[66] Julian Seward. bzip2. http://bzip.org/, 2013. [Online; accessed 17-November-2013].
[67] Claude Shannon. A mathematical theory of communication. Bell system technical
journal, 27:623–656, 1948.
[68] Claude E Shannon. Prediction and entropy of printed english. Bell system technical
journal, 30(1):50–64, 1951.
[69] D. Shkarin. Ppm: one step to practicality. In Proceedings of the Data Compression
Conference (DCC 2002), pages 202–211. IEEE Computer Society, 2002.
[70] Przemyslaw Skibinski, Szymon Grabowski, and Sebastian Deorowicz. Revisiting
dictionary-based compression. Software - Practise and Experience, 35(15):1455–1476,
2005.
[71] Moffat A. Turpin, A. Fast file search using text compression. In Proceedings of 20th
Australasian Computer Science Conference, pages 1–8, 1997.
[72] Jeffrey Scott Vitter. Algorithm 673: Dynamic huffman coding. ACM Transactions on
Mathematical Software, 15:158–167, 1989.
[73] Hugh E. Williams and Justin Zobel. Compressing integers for fast file access. The
Computer Journal, 42:193–201, 1999.
[74] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing gigabytes (2nd ed.):
compressing and indexing documents and images. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, 1999.
[75] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data
compression. Commun. ACM, 30(6):520–540, 1987.
[76] Hao Yan, Shuai Ding, and Torsten Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international
conference on World wide web, WWW ’09, pages 401–410, New York, NY, USA, 2009.
ACM.
[77] G. K. Zipf. Human behaviour and the principle of least effort. Addison-Wesley, 1949.
[78] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. Information Theory, IEEE Transactions on, 23(3):337–343, 1977.
BIBLIOGRAPHY
107
[79] J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding.
Information Theory, IEEE Transactions on, 24(5):530–536, 1978.
108
BIBLIOGRAPHY
Publications of the Author
[A.1] Petr Procházka, Jan Holub. ODC: Frame for definition of Dense codes. European
Journal of Combinatorics (EJC), 34(1): 52 – 68, 2013.
[A.2] Petr Procházka, Jan Holub. Compressing Similar Biological Sequences using FMindex. Data Compression Conference 2014: 312 – 321, IEEE Computer Society
Press, Snowbird, UT, USA, 2014.
[A.3] Petr Procházka, Jan Holub. Natural Language Compression Optimized for Large Set
of Files. Data Compression Conference 2013: 514, IEEE Computer Society Press,
Snowbird, UT, USA, 2013.
[A.4] Petr Procházka, Jan Holub. Natural Language Compression per Blocks. First International Conference on Data Compression, Communications and Processing, CCP
2011, 67 – 75 Palinuro, Cilento Coast, Italy, 2011.
[A.5] Petr Procházka, Jan Holub. Block-Oriented Dense Compressor. Data Compression
Conference 2011: 472, IEEE Computer Society Press, Snowbird, UT, USA, 2011.
[A.6] Jakub Jaroš, Petr Procházka, Jan Holub. Natural Language and Formal Language
Data Compression. Czech Technical University Workshop 2011, Prague, Czech
Republic, 2011.
[A.7] Petr Procházka, Jan Holub. New Word-Based Adaptive Dense Compressors. IWOCA
2009: 420 – 431, Hradec nad Moravicı́, CZ, 2009.
The paper has been cited in:
• S. Grabowski, W. Bieniecki. Tight and Simple Web Graph Compression for
Forward and Reverse Neighbor Queries, Discrete Applied Mathematics 163, Elsevier Science Publishers B. V., 298–306, January 2014.
• S. Grabowski, J. Swacha. Language-Independent Word-based Text compression
with Fast Decompression, Perspective Technologies and Methods in MEMS Design (MEMSTECH), 2010 Proceedings of VIth International Conference on.
IEEE, 2010.
• S. Grabowski, W. Bieniecki . Tight and Simple Web Graph Compression, arXiv
preprint arXiv:1006.0809 (2010).
109
110
PUBLICATIONS OF THE AUTHOR
[A.8] Jakub Jaroš, Petr Procházka, Jan Holub. On Implementation of Word-Based Compression Methods. MEMICS 2008 - 4th Doctoral Workshop on Mathematical and
Engineering Methods in Computer Science, Znojmo, Czech Republic, 2008.

Download Report

Czech Technical University in Prague Faculty of Information

Paperzz.com

Your Paperzz