Syllable-based Burrows

Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
Charles University, Faculty of Mathematics and Physics
Malostranské nám. 25, 118 00 Praha 1, Czech Republic
XBW Algorithm
1: Replacement of tag names
2: Division into words or syllables
3: Dictionary encoding
4: Burrows-Wheeler transformation (BWT)
5: Move to Front transformation (MTF)
6: Run Length Encoding of null sequences (RLE)
7: Canonical Huffman
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
Burrows-Wheeler Transform (BWT)
•compression method which reorders an input string into the form, which is preferable to another compression
•usually Move-To-Front transform and then Huffman coding is used to the permutated string
•the original method from 1994 was designed for an alphabet compression
•2001 - versions working with word and n-grams alphabet were presented
•the newest version copes with the syllable alphabet
•our goal: compare the BWT compression working with alphabet of letters, syllables, words, 3-grams and 5-grams
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
Equality of names
•the XBW method was not named very appropriately
•it can be easily mistaken by name xbw used by the authors
of paper Structuring labeled trees for optimal succinctness,
and beyond (Ferragina, P., Luccio, F., Manzini, G.,
Muthukrishnan, S.) for XML transformation into the format
more suitable for searching
•in another article these authors renamed the
transformation from xbw to XBW
•used it in compression methods called XBzip and
XBzipIndex.
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
1: Replacement of tag names
Input
<note importance=”high”>money</note>
<note importance=”low”>your money</note>
Element dictionary
EA
End-of-Tag
A1
note
Attribute dictionary
E1
importance
E2
empty
Output
A1 E1 high EA money EA A1 E1 low EA your money EA
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
2: Division into words or syllables
Input
A1 E1 high EA money EA A1 E1 low EA your money EA
[
Syllables dictionary (PUML)
01 high
02 mo
03 ney
04 low
05 your
06
Output
A1 E1 01 EA 02 03 EA A1 E1 04 EA 05 06 02 03 EA
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
Syllables and their semantic
1: find blocks of vowels (max length of 3 letters)
2: assign consonants before the 1st block
3: assign consonants after the last block
4: use one of the strategies below to assign the
remaining consonants
Example
Algorithm
PUL
PUR
PUML
PUMR
Syllables
odst-rč-en-ou
o-dstr-če-nou
ods-tr-če-nou
od-str-če-nou
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
Examples of string parsing into words, syllables,
letters, 3-grams and 5-grams
parsing type
original
letters
3-grams
5-grams
syllables
words
parsed text elements
“consists of”
“c”, “o”, “n”, “s”, “i”, “s”, “t”, “s”, “ “, “o”, “f”
“con”, “sis”, “ts “, “of”
“consi”, “sts o”, “f”
“con”, “sists”, “ “, “of”
“consists”, “ “, “of”
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
3: Dictionary encoding
o
3
t
1
h
6
e
0
30
.
66
C
34
\n
76
M
33
A
Encoding of the dictionary of used elements
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
Procedure EncodeNode
00 EncodeNode3(node) {
/* encode the number of sons */
01 output->WriteGamma0(count + 1);
02 if (count = 0) return;
/* is the node a string? */
03 if (represents)
04 output->WriteBit(1);
05 else output->WriteBit(0);
06 if (IsKnownTypeOfSons)
07 previous = first(TypeOfSymbol(This->Symbol));
08 else previous = 0;
/* iterate and encode all sons of this node */
09 for(i = 0; i < count; i++) {
/* count and encode the distances between sons */
10 distance = ord(son[i]->extension) - previous;
11 output->WriteDelta0(distance + 1);
/* call the procedure on the given son */
12 EncodeNode3(son[i]);
13 previous = ord(son[i]->extension);
14 }
15 }
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
4: Burrows-Wheeler transformation
01 02 03 EA A1 E1 04 05 06 02 03 EA A1 E1
02 03 EA A1 E1 01 02 03 EA A1 E1 04 05 06
02 03 EA A1 E1 04 05 06 02 03 EA A1 E1 01
03 EA A1 E1 01 02 03 EA A1 E1 04 05 06 02
03 EA A1 E1 04 05 06 02 03 EA A1 E1 01 02
04 05 06 02 03 EA A1 E1 01 02 03 EA A1 E1
05 06 02 03 EA A1 E1 01 02 03 EA A1 E1 04
06 02 03 EA A1 E1 01 02 03 EA A1 E1 04 05
A1 E1 01 02 03 EA A1 E1 04 05 06 02 03 EA
A1 E1 04 05 06 02 03 EA A1 E1 01 02 03 EA
E1 01 02 03 EA A1 E1 04 05 06 02 03 EA A1
E1 04 05 06 02 03 EA A1 E1 01 02 03 EA A1
EA A1 E1 01 02 03 EA A1 E1 04 05 06 02 03
EA A1 E1 04 05 06 02 03 EA A1 E1 01 02 03 Output
8 / E1 06 01 02 02 E1 04 05 EA EA A1 A1 03 03
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
5: Move to Front transformation
•output stream of BWT is transformed by another
transformation step MTF
•this step translates text strings into a sequence of
numbers
•suppose a numbered list of alphabet elements, then MFT
reads these input elements and writes their list order
•after the element is processed it is moved up to the front of
the list
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
5: Move to Front transformation
Input
01 02 03 04 05 06 A1 E1 EA
E1 06 01 02 02 E1 04 05 EA EA A1 A1 03 03
Output
03 A1 EA 05 04 E1 02 01 06
03
7623035680808
03 A1 EA 05 04 E1 02 01 06
76230356808080
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
6: Run Length Encoding of null sequences
•one MFT step may generate long sequences of zeroes
(null sequences)
•if successful, the RLE step shrinks these null sequences
and replaces them with a special symbol representing
a null sequence of a given length
•finally the output is a stream of numbers and the special
symbols
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
6: Run Length Encoding of null sequences
Input
02 01 00 00 00 03 00 07 00 00
Output
02 01 N3 03 00 07 N2
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
7: Canonical Huffman
Input
76230356808080
Output
1111 011 110 10 00 010 1110
011 10 00 10 00 10 00
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
root
14
7
5
1
1
8
0
3
4
2
6
3
1
2
2
After the RLE step the stream is encoded by a canonical
Huffman code. Huffman coding is based on assigning shorter
codes to more frequent characters then to characters with
less frequent appearance.
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
Experiments
original
gzip
bzip
XMLPPM XBWW
XBWS
14664
3062
2381
2153
1445
1432
14968
3072
2393
2141
1384
1379
14914
3135
2443
2197
1407
1400
14265
3018
2363
2122
1357
1352
14543
3096
2409
2165
1408
1399
14512
3070
2392
2138
1387
1377
13989
3002 2336
2089
1335
1325
13948
3020
2358
2113
1381
1373
13869
2948
2296
2053
1320
1312
14153
3012
2355
2107
1363
1357
14383
3044
2372
2128
1379
1371
Table: Results of compression (10 files, 1000 English pages each)
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
Lang.
CZ
CZ
CZ
CZ
CZ
EN
EN
EN
EN
EN
GE
GE
GE
GE
GE
File size
Method
Symbols
Syllables
Words
3-grams
5-grams
Symbols
Syllables
Words
3-grams
5-grams
Symbols
Syllables
Words
3-grams
5-grams
100 B
1 kB
5.715
6.712
7.751
8.539
10.104
5.042
5.974
6.323
7.740
9.358
4.545
5.591
6.343
6.813
8.545
1 kB
10 kB
4.346
4.996
6.111
6.432
8.415
3.018
3.267
3.651
4.571
6.293
3.853
4.671
5.491
5.583
7.429
10 kB
50 kB
3.512
3.765
4.629
4.629
6.566
2.552
2.647
2.969
3.421
4.877
2.914
3.201
3.679
3.760
5.324
50 kB
200 kB
3.200
3.280
3.871
3.851
5.448
2.647
2.685
2.944
3.148
4.367
2.629
2.724
3.119
3.117
4.320
200 kB
500 kB
2.998
3.003
3.476
3.463
4.796
2.513
2.486
2.668
2.823
3.769
2.491
2.505
2.865
2.820
3.744
500 kB
2MB
2.846
2.825
3.149
3.166
4.265
2.336
2.282
2.382
2.530
3.246
2.323
2.295
2.545
2.525
3.237
2 MB
5 MB
2.066
1.996
2.014
2.136
2.530
2.416
2.354
2.608
2.519
3.004
Comparission of different input parsing strategies for Burrows-Wheeler transform.
Values are in bits per character.
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
Method
none
gzip
bzip2
XMLPPM
XBW (words)
XBW (syllables)
Compressed
14 383 kB
3 044 kB
2372 kB
2128 kB
1379 kB
1371 kB
Ratio
100.00%
21.16%
16.50%
14.80%
9.59%
9.53%
Table: Results of compression (10 files, 1000
English pages each)
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
Conclusion
•letter-based, syllable-based and word-based compression
•character-based compression is most suitable for small files
(up to 200 kB)
•syllable-based compression is most suitable for files of size
200 kB - 5 MB
•natural text units (words or syllables) is 10-30% better than
compression with 5-grams and 3-grams
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková
Bibliography
Lansky, J., Zemlicka, M.:
Compression of a Dictionary.
In: Snasel, V., Richta, K., and Pokorny, J.: Proceedings of the Dateso 2006 Annual International
Workshop on Databases, Texts, Specifications and Objects. CEUR-WS, Vol. 176, (2006) 11-20
Lansky, J., Zemlicka, M.:
Text Compression: Syllables.
In: Richta, K., Snasel, V., Pokorny, J.: Proceedings of the Dateso 2005 Annual International Workshop on Databases, Texts, Specifications and Objects. CEUR-WS, Vol. 129, (2005) 32--45
Galambos, L., Lansky, J., Chernik, K.:
Compression of Semistructured Documents.
In: International Enformatika Conference IEC 2006, Enformatika, Transactions on Engineering,
Computing and Technology, Volume 14, (2006) 222--227
Burrows, M., Wheeler, D. J.:
A Block Sorting Loseless Data Compression Algorithm.
Technical report, Digital Equipment Corporation, Palo Alto, CA, U.S.A (2003).
Isal, R.Y.K., Moffat, A.:
Parsing Strategies for BWT Compression.
Data Compression Conference, IEEE CS Press, Los Alamitos, CA, USA (2001) 429-438.
Syllable-based Burrows-Wheeler Transform
Jan Lánský, Katsiaryna Chernik, and Zuzana Vlčková