Information and Coding - Marek Rychlik

Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Information and Coding
Part I: Data Compression
Marek Rychlik
Department of Mathematics
University of Arizona
February 25, 2016
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
What is coding — informally?
Definition (Coding)
Coding is algorithmic rewriting of data, while achieving
desirable objectives.
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Three main types of objectives
1
Reducing the cost of storage and transmission time
required (data compression);
2
Preventing data loss due to random causes, such as
defects in physical media (error correcting codes);
3
Achieving privacy of comunications (data encryption).
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Without data compression we would not have
1
high-definition digital streaming video;
2
cellular telephony;
3
high-definition digital cinema.
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Coriolanus — the play by William Shakespeare
“Before we proceed any further,
hear me speak. Speak, speak. You
are all resolved rather to die than to
famish? Resolved. resolved. First,
you know Caius Marcius is chief
enemy to the people. We know’t,
we know’t. Let us kill him, and
we’ll have corn at our own price. Is’t
a verdict? No more talking on’t; let
it be done: away, away! One word,
good citizens... [total of 141,830
characters]”
Marek Rychlik
CGB
Figure: W. Shakespeare
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
An early printed scientific paper
Figure: Nikolaus Kopernikus, 1543, “Polish astronomer”
(Encyclopedia Britannica)
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Character encoding
Characters are numbered.
The oldest standard is ASCII (American Standard Code
for Information Interchange).
ASCII is pronounced “as key”.
Encodes characters as numbers 0–255.
Successor to ASCII is Unicode . It aims at encoding all
characters in all human alphabets.
Only ASCII is used in the current talk.
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Bits and Bytes
The term bit stands for a binary digit and it is either 0 or 1.
Bits are digits of base-2 (binary) representation of
numbers, e.g. 6 in decimal is 101 in binary.
(101)2 = 1 ⋅ 22 + 0 ⋅ 21 + 1 ⋅ 1 = (6)10
It is a normalized unit of information; we may think that
information comes as a sequence of answers to Yes/No
questions. A fraction of a bit makes sense, statistically
speaking (e.g. tossing a biased coin).
A byte is a sequence of 8 bits. A byte can also be thought
of as an unsigned integer in the range 0–255.
(10000100)2 = 1 ⋅ 27 + 1 ⋅ 22 = 128 + 4 = (123)10
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Hexadecimal and Octal notation for integers
1 byte = 8 bits = 2 groups of 4 bits; 24 = 16 “hexadecimal” =
base 16
16 digits required for hexadecimal: 0–9, A,B,C,D,E,F
(letters used as digits)
(1010 0100)2 = (8 + 2) ⋅ 16 + 4 = (A4)16 = /xA4
Octal notation uses groups of 3 bits. 8 digits required: 0–7.
(10 100 100)2 = (244)8 = /o244
NOTE: 8 is not divisible by 3, the highest octal digit of a
byte is 3.
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
ASCII codes
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Extended ASCII codes
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Reading some characters of “Coriolanus” in MATLAB
1
2
3
4
5
6
7
8
%% F i l e : p1 .m
% Read f i r s t 10 c h a r a c t e r s from C o r i o l a n u s
f =fopen ( ’ TextCompression / c o r i o l a n . t x t ’ , ’ r ’ ) ;
A= f r e a d ( f , 1 0 , ’ u i n t 8 ’ ) ;
fclose ( f ) ;
A’
% P r i n t i n decimal
dec2hex ( A)
% P r i n t i n hexadecimal
dec2bin ( A)
% Print in binary
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Output I
>> p1
ans =
66
101
102
111
114
ans =
42
65
66
6F
Marek Rychlik
CGB
101
32
119
1
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Output II
72
65
20
77
65
20
ans =
1000010
1100101
1100110
1101111
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Output III
1110010
1100101
0100000
1110111
1100101
0100000
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
An important paradigm of data compression
Data Compression = Data Modeling + Source Coding
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
A data model of “Coriolanus”
Example (Data Modeling)
Out model of “Coriolanus” is a random data model: Every
character is drawn at random from the alphabet, according to
the probability distribution
P = (p1 , p2 , . . . , pn )
Numbers pj are considered parameters, to be estimated.
George Box, a statistician, circa 1979
All models are wrong but some are useful.
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Arithmetic coding as a source coding approach
Example (Source Coding)
Our source coding technique is arithmetic coding, to be
explained later. “Coriolanus” will be encoded as a single binary
fraction with approximately 640, 000 binary digits (bits) after the
binary point!
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
The alphabet and “table” of Coriolanus in MATLAB
1
2
3
4
5
6
7
8
9
10
11
12
13
%% F i l e : p2 .m
%% Read a l l c h a r a c t e r s o f C o r i o l a n u s as ASCII codes
f =fopen ( ’ TextCompression / c o r i o l a n . t x t ’ , ’ r ’ ) ;
A= f r e a d ( f , i n f , ’ u i n t 8 ’ ) ;
fclose ( f ) ;
%% Conduct fr e q u e n c y a n a l y s i s ( b u i l d " t a b l e " )
% C i s t h e a l p h a b e t : as per documentation
% o f unique , IA and IC are i n d e x a r r a y s s . t .
%
C = A( IA ) and A = C( IC )
[ C, IA , IC ] = unique ( A) ;
t a b l e = h i s t c o u n t s ( IC , 1 : ( l e n g t h (C) +1) ) ;
%% Bar p l o t s o f a l p h a b e t and t a b l e
bar (C) ; bar ( t a b l e )
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
The alphabet and “table” bar plots
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Entropy of a probability distribution
Definition (Shannon Entropy)
Let P = (p1 , p2 , . . . , pn ) be
a probability distribution on numbers 1–n.
The number H(P) defined by the formula:
n
H(P) = ∑(− log2 pj ) ⋅ pj
j=1
is called
the Shannon entropy of the distribution P.
Marek Rychlik
CGB
Figure: C. Shannon
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Entropy of the “Coriolanus” distribution
1
2
3
4
5
6
%% F i l e : p3 .m
%% Read data , f i n d t a b l e
p2 ;
%% Find e n t r o p y
p= t a b l e / sum ( t a b l e ) ;
e n t r o p y =sum( −p . * l o g 2 ( p ) )
>> p3
entropy =
4.5502
>>
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Meaning of entropy in data compression
Entropy as mean number of bits per symbol
If entropy is h = 4.5502 bits, we expect h bits to be issued in the
compressed message, per every symbol of the original
message. Since ASCII encoding uses 1 byte (=8 bits), per
symbol, se expect compression ratio to be:
8
= 1.7582
4.5502
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Calculation of arithmetic code in MATLAB
1
2
3
4
5
6
%% F i l e : p4 .m
%% Read t e x t f i l e i n t o a r r a y A , compute t a b l e ,
alphabet , i n d e x a r r a y IC
p2 ;
%% Use a r i t h m e t i c coding t o a r i t h m e t i c a l l y encode
t h e sequence o f i n d i c e s
code= a r i t h e n c o ( IC , t a b l e ) ;
c o m p r e s s i o n _ r a t i o = ( 8 * l e n g t h ( A) ) / l e n g t h ( code )
>> p4
compression_ratio =
1.7581
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
What is arithmetic code?
Arithmetic code
Arithmetic code is a binary code which represents digits of a
real number x ∈ [0, 1] uniquely assigned to the original
message (Shakespeare’s play “Coriolanus”). The number has
approximately
H(P) ⋅ N = 645354
binary digits after the binary point:
x = 0.d1 d2 . . .
where dj ∈ {0, 1}.
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Commentary on arithmetic coding
The idea of arithmetic coding comes from
Shannon-Fano-Elias coding.
The algorithm was implemented in a practical manner by
IBM around 1980.
IBM obtained a patend for arithmetic coding and defended
it vigorously for approximately 20 years. The value of the
algorithm as intellectual property was estimated at tens of
billions of dollars.
Now there are many freely available implementations of
arithmetic coding, including the MATLAB implementation.
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
The principle of arithmetic coding I
1
Let m = (a0 , a1 , . . . , aN ) be a message with
aj ∈ {1, 2, . . . , n}
2
Let P = (p1 , p2 , . . . , pn ) be a relative frequency table of the
symbols in a message.
3
Every partial message mK = (a0 , a1 , . . . , aK ) is assigned a
semi-open subinterval of I = [0, 1).
4
m0 (the empty message) is assigned I0 = I.
5
If mK −1 is assigned interval IK −1 , interval IK ⊆ IK −1 and the
length of IK is proportional to paK (the probability of the
K -th symbol).
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
The principle of arithmetic coding II
6
As the length of the interval
N
∣IN ∣ = ∏ paj
j=0
N
n
− log2 ∣IN ∣ = ∑(− log paj ) = N ⋅ ∑(− log pl )
j=0
l=1
Nl
N
n
= N ⋅ ∑(− log pl )pl = N ⋅ H(P).
l=1
7
where Nl is the number of times l appears in the message,
and N is the length of the message.
We select a number in IN with the smallest number of
binary digits, approximately N ⋅ H(P).
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Useful definitions I
An alphabet is any finite set A.
A message m = a0 a1 . . . an (or m = (a0 , a1 , . . . , an )) over the
alphabet A is any finite sequence of elements of A. The set
of all messages is denoted A+ . We have this disjoint union
of Cartesian powers of the alphabet:
∞ n
A+ = ⊍ ⨉ A
n=0 j=1
Acode is a map C ∶ A+ → B + .
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Useful definitions II
A symbol code is a map C ∶ A → B + , i.e. a map that maps
symbols to one alphabet A to messages (or blocks of
symbols) in another alphabet B. The symbol code extends
to C ∶ A+ → B + of all messages over alphabet A to all
messages over alphabet B, by concatenation:
C(a0 a1 . . . an ) = C(a0 )C(a1 ) . . . C(an )
The length of a message is its number of symbols; The
function ` ∶ A+ → {0, 1, . . .} maps messages to their lengths.
A binary (symbol) code is a symbols code C ∶ A → {0, 1}.
A code C ∶ A+ → B + is lossless, if it is 1:1 as a map.
A symbol code C ∶ A → B + is called uniform iff ` ○ C = const..
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Morse code — an old, binary, non-uniform code with
some compression
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
An Excercise — “Coriolanus” in Morse Code
Problem (Coriolanus to Morse Code)
Write a program which will encode Shakespeare’s “Coriolanus”
in Morse code. Comparse the output length to that of the
original message (141,830 characters). Also, estimate the
duration in “units” of the transmission of “Coriolanus”, assuming
that the “unit” is 1/4 second.
Speed of Morse Code “Benchmark”
“High-speed telegraphy contests are held; according to the
Guinness Book of Records in June 2005 at the International
Amateur Radio Union’s 6th World Championship in High Speed
Telegraphy in Primorsko, Bulgaria, Andrei Bindasov of Belarus
transmitted 230 morse code marks of mixed text in one minute.”
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Kraft inequality
Upper Bound
The expected length of the code
Definition (Expected length of a binary symbol code)
Given a probability distribution P ∶ A → [0, 1] on the alphabet A,
the expected length of a binary symbol code C ∶ A → {0, 1}+ is
defined as:
E(` ○ C) = ∑ `(C(a)) ⋅ P(a).
a∈A
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Kraft inequality
Upper Bound
Shannon source coding theorem
Theorem (Shannon source coding theorem — Shannon, 1948)
If a binary symbol code C ∶ A → {0, 1}+ is lossless then
E(` ○ C) ≥ H(P)
where H(P) is the Shannon entropy of the distribution P:
H(P) = ∑ (− log2 P(a)) ⋅ P(a)
a∈A
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Kraft inequality
Upper Bound
Kraft inequality (extended variant)
Theorem (Kraft Inequality)
If A is countable and C ∶ A → {0, 1}+ is a lossless symbol code
then
−D(a)
∑2
≤ 1.
a∈A
where D(a) = `(C(a)).
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Kraft inequality
Upper Bound
The existence of nearly optimal codes
Theorem (Shannon-Fano, 1948)
For every alphabet A and a distribution function P ∶ A → (0, 1]
there exists a binary code C ∶ A → {0, 1}+ such that:
H(P) ≤ E(` ○ C) ≤ H(P) + 1.
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Kraft inequality
Upper Bound
Full encoder in MATLAB I
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
%% S p e c i f y i n p u t and o u t p u t f i l e n a m e s
i n f i l e = ’ coriolan . txt ’ ;
o u t f i l e = ’ c o r i o l a n . zzz ’ ;
%% Read uncompressed t e x t f i l e
f i l e I D =fopen ( i n f i l e ) ;
A= f r e a d ( f i l e I D , i n f , ’ u i n t 8 ’ ) ;
fclose ( f i l e I D ) ;
d i s p ( ’ E n t e r i n g RTG DEMO t e x t encoder . . . ’ ) ;
%% Conduct fr e q u e n c y a n a l y s i s ( b u i l d " t a b l e " )
% C i s t h e a l p h a b e t : as per documentation o f unique ,
% IA and IC are i n d e x a r r a y s C = A( IA ) and A = C( IC )
[ C, IA , IC ] = unique (A ) ;
t a b l e = h i s t c o u n t s ( IC , 1 : ( l e n g t h (C) +1) ) ;
%% Find e n t r o p y
p= t a b l e / sum ( t a b l e ) ;
e n t r o p y =sum(−p . * l o g 2 ( p ) ) ;
d i s p ( [ ’ Ex p e c t i n g ’ , num2str ( e n t r o p y ) , ’ b i t s per symbol ’ ] ) ;
%% Use a r i t h m e t i c coding t o encode t h e sequence
code= a r i t h e n c o ( IC , t a b l e ) ;
%% C a l c u l a t e b i t s per symbol
b i t s p e r s y m b o l = l e n g t h ( code ) / l e n g t h (A ) ;
d i s p ( [ ’ Achieved ’ , num2str ( b i t s p e r s y m b o l ) , ’ b i t s per symbol ’ ] ) ;
%% Save compressed data
f =fopen ( o u t f i l e , ’ wb+ ’ ) ;
f w r i t e ( f , ’_RTG ’ ) ;
%% W r i t e data t o f i l e
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Kraft inequality
Upper Bound
Full encoder in MATLAB II
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
% −−−−−−−−−−−−−−−− W r i t e metadata −−−−−−−−−−−−−−−−
% W r i t e unencoded data l e n g t h
f w r i t e ( f , l e n g t h ( A) , ’ u i n t 3 2 ’ ) ;
% Write alphabet length
f w r i t e ( f , l e n g t h (C) , ’ u i n t 8 ’ ) ;
% Write alphabet
f w r i t e ( f , C, ’ u i n t 8 ’ ) ;
% Write table
b i t s _ p e r _ t a b l e _ e n t r y = f l o o r ( l o g 2 ( max ( t a b l e ) ) ) +1;
bytes_per_table_entry= c e i l ( bits_per_table_entry /8) ;
i f bytes_per_table_entry > 2
count_class= ’ uint32 ’ ;
e l s e i f bytes_per_table_entry > 1
count_class= ’ uint16 ’ ;
else
count_class= ’ uint8 ’ ;
end
fh= str2func ( count_class ) ;
counts = f h ( t a b l e ) ;
f w r i t e ( f , bytes_per_table_entry , ’ uint8 ’ ) ;
f w r i t e ( f , counts , c o u n t _ c l a s s ) ;
% −−−−−−−−−−−−−−−− W r i t e b i n a r y code −−−−−−−−−−−−−−−−
% Padd code t o a m u l t i p l e o f 8 b i t s
padd= c e i l ( l e n g t h ( code ) / 8 ) * 8− l e n g t h ( code ) ;
code = [ code ; zeros ( padd , 1 ) ] ;
% Pack code i n t o b y t e s
packed_code= zeros ( l e n g t h ( code ) / 8 , 1 , ’ u i n t 8 ’ ) ;
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Kraft inequality
Upper Bound
Full encoder in MATLAB III
54
55
56
57
58
59
60
61
f o r i =1: l e n g t h ( packed_code )
packed_code ( i , 1 ) = bi2de ( code ( ( 1 + 8 * ( i −1) ) : ( 8 + 8 * ( i −1) ) ) ’ ) ;
end
f w r i t e ( f , l e n g t h ( packed_code ) , ’ u i n t 3 2 ’ ) ;
f w r i t e ( f , packed_code , ’ u i n t 8 ’ ) ;
%−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
fclose ( f ) ;
d i s p ( ’ E x i t i n g RTG t e x t encoder . ’ ) ;
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Kraft inequality
Upper Bound
Full decoder in MATLAB I
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
d i s p ( ’ E n t e r i n g RTG t e x t decoder . . . ’ ) ;
i n f i l e = ’ c o r i o l a n . zzz ’ ;
o u t f i l e = ’ coriolan_d . t x t ’ ;
%% Decoder
f =fopen ( i n f i l e , ’ r ’ ) ;
%% Read and check magic number
magic= f r e a d ( f , 4 , ’ * char ’ ) ’ ;
i f a l l ( magic== ’_RTG ’ )
d i s p ( ’ Great ! Magic number matches . ’ ) ;
else
d i s p ( ’ Magic number does n o t match ’ ) ;
return ;
end
%% Read unencoded data l e n g t h
len=fread ( f ,1 , ’ uint32 ’ ) ;
%% Read a l p h a b e t l e n g t h
alphabet_len=fread ( f ,1 , ’ uint8 ’ ) ;
%% Read a l p h a b e t
C= f r e a d ( f , a l p h a b e t _ l e n , ’ u i n t 8 ’ ) ;
%% Read t a b l e
bytes_per_table_entry=fread ( f ,1 , ’ uint8 ’ ) ;
i f bytes_per_table_entry > 2
count_class= ’ uint32 ’ ;
e l s e i f bytes_per_table_entry > 1
count_class= ’ uint16 ’ ;
else
Marek Rychlik
CGB
Definition of coding
Compression of plain (ASCII) text
Shannon lower bound
Kraft inequality
Upper Bound
Full decoder in MATLAB II
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
count_class= ’ uint8 ’ ;
end
recovered_table=fread ( f , alphabet_len , count_class ) ’ ;
%% Read packed code
packed_code_len= f r e a d ( f , 1 , ’ u i n t 3 2 ’ ) ;
packed_code= f r e a d ( f , packed_code_len , ’ u i n t 8 ’ ) ;
%% Unpack t h e code
code_len=packed_code_len * 8 ;
recovered_code= zeros ( code_len , 1 , ’ double ’ ) ;
recovered_code = de2bi ( packed_code , 8 ) ’ ;
recovered_code = recovered_code ( : ) ;
%% Recover i n d e x a r r a y
IC= a r i t h d e c o ( recovered_code , r e c o v e r e d _ t a b l e , l e n ) ;
%% Apply a l p h a b e t
recovered =C( IC ) ;
%% W r i t e t h e f i l e
f =fopen ( o u t f i l e , ’w ’ ) ;
f w r i t e ( f , recovered ) ;
fclose ( f ) ;
%−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
d i s p ( ’ E x i t i n g RTG t e x t decoder . . . ’ ) ;
Marek Rychlik
CGB