Compression codes Compression An ASCII example Codes

Compression codes
Compression
• The idea is to shorten the string while
maintaining all the information
• This means looking at the probabilities for
each bit and trying to get them as close to
50/50 as possible
Section 7.2 in the textbook
An ASCII example
• “booboobooboobooboobooboo”
• 01100010011011110110111101100010...
• uncompressed data
• 01001011110010101110111110001111...
• compressed with gzip
Codes
• Homomorphisms over the free monoid of
letters (“strings”)
• Provide a mapping between strings and
other strings
elephants eat sweet eggs
elephants eat sweet eggs
e
0000
t
0110
e
0000
t
0110
l
0001
s
0111
l
0001
s
0111
p
0010
<space>
1000
p
0010
<space>
1000
h
0011
w
1001
h
0011
w
1001
a
0100
g
1010
a
0100
g
1010
n
0101
n
0101
000000010000001000110100010101100111...
elephants eat sweet eggs
elephants eat sweet eggs
e
0000
t
0110
e
6
t
3
l
0001
s
0111
l
1
s
3
p
0010
<space>
1000
p
1
<space>
3
h
0011
w
1001
h
1
w
1
a
0100
g
1010
a
2
g
2
n
0101
n
1
000000010000001000110100010101100111...
24 letters * 4 bits each = 96 bits
elephants eat sweet eggs
elephants eat sweet eggs
e
6
t
3
e
6
t
3
l
1
s
3
l
1
s
3
p
1
<space>
3
p
1
<space>
3
h
1
w
1
h
1
w
1
a
2
g
2
a
2
g
2
n
1
n
1
‘e’ was encoded as 0000
‘l’ was encoded as 0001
‘e’ was encoded as 0000
‘l’ was encoded as 0001
After receiving 000, p(0) = 0.857, p(1) = 0.143
elephants eat sweet eggs
elephants eat sweet eggs
e
6
t
3
e
10
t
110
l
1
s
3
l
00000
s
011
p
1
<space>
3
p
1110
<space>
010
h
1
w
1
h
0011
w
1111
a
2
g
2
a
0001
g
0010
n
1
n
00001
0.566 bits
‘e’ was encoded as 0000 of entropy!
‘l’ was encoded as 0001
After receiving 000, p(0) = 0.857, p(1) = 0.143
elephants eat sweet eggs
elephants eat sweet eggs
e
10
t
110
e
10
t
110
l
00000
s
011
l
00000
s
011
p
1110
<space>
010
p
1110
<space>
010
h
0011
w
1111
h
0011
w
1111
a
0001
g
0010
a
0001
g
0010
n
00001
n
00001
1000000101110001100010000111001101010...
1000000101110001100010000111001101010...
A total of 77 bits
elephants eat sweet eggs
elephants eat sweet eggs
e
10
t
110
e
10
t
110
l
00000
s
011
l
00000
s
011
p
1110
<space>
010
p
1110
<space>
010
h
0011
w
1111
h
0011
w
1111
a
0001
g
0010
a
0001
g
0010
n
00001
n
00001
e has frequency 6; t+w+p have frequency 5
elephants eat sweet eggs
elephants eat sweet eggs
e
10
t
110
e
10
t
110
l
00000
s
011
l
00000
s
011
p
1110
<space>
010
p
1110
<space>
010
h
0011
w
1111
h
0011
w
1111
a
0001
g
0010
a
0001
g
0010
n
00001
n
00001
e has frequency 6; t+w+p have frequency 5
After receiving 1, p(0) = 0.545; p(1) = 0.454
0.994 bits
of entropy!
e has frequency 6; t+w+p have frequency
5
After receiving 1, p(0) = 0.545; p(1) = 0.454
e
6
t
3
l
1
s
3
All coding schemes aim to
give short codes to
frequently used symbols and
long codes to seldom used
symbols
p
1
<space>
3
h
1
w
1
a
2
g
2
Huffman codes are better
than most
n
1
Huffman coding
•
•
•
They produce optimal
codes
•
They’re easy to generate
6 3 3 3 2 2 1 1 1 1 1
e # s t g a w n h p l
6 3 3 3 2 2 1 1 1
2
e # s t g a w n h p l
6 3 3 3 2 2 1 1 1 1 1
e # s t g a w n h p l
6 3 3 3 2 2 2 1 1 1
6 3 3 3 2 2 2 1 1 1
e # s tp l g a w n h
e # s tp l g a w n h
6 3 3 3 2 2 2 1
2
6 3 3 3 2
2 2 2 1
e # s tp l g a w n h
e # s tn h p l g a w
6 3 3 3 2
6 3
2 2
3
3 3 3 2
2 2
e # s t n hp l g a w
e a w # s t n hp l g
6 3 3 3 3 2
6
ea w# s t n h
4
p l
g
e
4
p l
3 3 3 3 2
ga w# s tn h
6
e
4
p l
3
4
n hp l
e t
n h
ga w# s
ga w
3
3 3
ga w# s
4
3
# s e t n hp l g a w
7
p l
4
n hp l
6 6 5
6
3
6 6 5
# s e tn h
6 5
5
ga w# s t
6 5
e t
3 3
7
p l
6 6 5
e
ga w # s tn h
7
p l
6
11
e
ga w# s t
t
n h
p l
e
n h
t
7
13
ga w # s
ga w # s
p l
6
ga w# s
n h p l
13
11
e
11
11
e
t
n h
From tree to codes
24
p l
ga w # s
24
e
t
n h
p l
ga w # s
e
t
n h
•
Start with the empty
string
•
Each time you follow a
branch left, append a 0;
each time you follow a
branch right, append a 1
•
•
E.g., e = 10
E.g., w = 0011
Huffman code
properties
Huffman properties
• The first Huffman codes I introduced were
different than the one we just derived
• Both are optimal! Huffman codes are not
unique!
• Huffman codes are always prefix-free
• Decompression requires no backtracking!
Huffman properties
• What would happen if we tried to
compress binary data with Huffman codes?
Huffman properties
• What would happen if we tried to
compress binary data with Huffman codes?
• E.g., 000110001001000000000100000
Huffman properties
Huffman properties
• What would happen if we tried to
• What would happen if we tried to
• E.g., 000110001001000000000100000
• Frequency of 0 = 23, 1 = 5 28
• E.g., 000110001001000000000100000
• Frequency of 0 = 23, 1 = 5 28
0 1
• Code for 0 is 0, for 1 is 1
compress binary data with Huffman codes?
0 1
compress binary data with Huffman codes?
Huffman properties
Lempel-Ziv(-Welch)
• Huffman is only optimal if both parties
already know what the codes are!
• For a practical compression scheme, you
would have to send along the codes as well
as the compressed data!
• Note Huffman coding does not give optimal
• Good news: we don’t have to send a code
table along with our data
• Bad news: our codes are not optimal :(
compression. It gives optimal code-based
compression
mama and amamau
mama and amamau
a
00000
a
00000
b
00001
b
00001
c
00010
c
00010
d
00011
d
00011
...
...
...
...
y
11001
y
11001
z
11010
z
11010
<space> 11011
<space> 11011
ma
11100
01101
mama and amamau
mama and amamau
a
00000
a
00000
b
00001
b
00001
c
00010
c
00010
d
00011
d
00011
...
...
...
...
y
11001
y
11001
z
11010
z
11010
<space> 11011
<space> 11011
ma
11100
ma
11100
am
11101
am
11101
0110100000
011010000011100
ma#
11110
mama#and amamau
mama and amamau
a
00000
ma#
11110
a
000000
ma#
011110
b
00001
#a
11111
b
000001
#a
011111
c
00010
c
000010
an
100000
d
00011
d
000011
...
...
...
...
y
11001
y
011001
z
11010
z
011010
<space> 11011
<space> 011011
ma
11100
ma
011100
am
11101
am
011101
01101000001110011011
01101000001110011011000000
mama and amamau
mama and amamau
a
000000
ma#
011110
a
000000
ma#
011110
b
000001
#a
011111
b
000001
#a
011111
c
000010
an
100000
c
000010
an
100000
d
000011
nd
100001
d
000011
nd
100001
d#
100010
...
...
...
...
y
011001
y
011001
z
011010
z
011010
<space> 011011
<space> 011011
ma
011100
ma
011100
am
011101
am
011101
01101000001110011011000000001110
01101000001110011011000000001110000011
mama and#amamau
mama and#amamau
a
000000
ma#
011110
a
000000
ma#
011110
b
000001
#a
011111
b
000001
#a
011111
c
000010
an
100000
c
000010
an
100000
d
000011
nd
100001
d
000011
nd
100001
...
...
d#
100010
...
...
d#
100010
y
011001
#am
100011
y
011001
#am
100011
z
011010
z
011010
<space> 011011
<space> 011011
ma
011100
ma
011100
am
011101
am
011101
01101000001110011011000000001110000011011111
01101000001110011011000000001110000011011111
mama and amamau
mama and amamau
a
000000
ma#
011110
a
000000
ma#
011110
b
000001
#a
011111
b
000001
#a
011111
c
000010
an
100000
c
000010
an
100000
d
000011
nd
100001
d
000011
nd
100001
...
...
d#
100010
...
...
d#
100010
y
011001
#am
100011
y
011001
#am
100011
z
011010
mam
100100
z
011010
mam
100100
<space> 011011
mau
100101
<space> 011011
ma
011100
ma
011100
am
011101
am
011101
01110011011000000001110000011011111011100
01101000001110011011000000001110000011011111011100011100
mama and amamau
a
000000
ma#
011110
b
000001
#a
011111
c
000010
an
100000
d
000011
nd
100001
...
...
d#
100010
y
011001
#am
100011
z
011010
mam
100100
<space> 011011
mau
100101
u$
100110
ma
011100
am
011101
01110011011000000001110000011011111011100011100010101
Lempel-Ziv
• Our string was 15 characters long
• Without compression, it would have been 65
= 15 * 15 bits
• We did it in 4*5 + 7*6 = 62 bits
• BFD, we saved 3 freaking bits, and on a totally
contrived example to boot
• It works better in much longer data
01100
Lempel-Ziv(-Welch)
•
The principle behind LZW working is “if we
saw a pattern before, we’re likely to see it
again later”
• If this is true, then it’s clear how LZW helps
to even up the probability distributions (so
p(0) and p(1) get closer to 0.5)
• If this isn’t true, LZW will just make things
worse :(
0
0
1
1
01100
01100
0
00
0
00
1
01
1
01
01
10
01
10
11
11
0
001
01100
0
000
1
01100
10
100
0
000
10
100
001
1
001
00
101
01
010
01
010
11
011
11
011
001001
001001000
01100
Run-length encoding
0
000
10
100
1
001
00
101
01
010
0$
110
11
011
001001000000
• Simplest form of compression
• “If you see one symbol, there are probably
a bunch more of the same symbol coming”
• E.g., AAABAAAAACCCC => A3B1A5C4
• Problem: you can make your data a whole
lot longer if your assumption isn’t true
Code-based
compression schemes
• Huffman is optimal if both parties know the
code table beforehand
• Transmitting the code table can be
expensive, especially for short data
• Other schemes make assumptions about
repetition which aren’t always true
Data as a computation
Back to Huffman
• I said Huffman was an optimal code-based
compression scheme, but not an optimal
compression scheme
• What’s the difference?
Data as a computation
• Idea: data and computations are actually the
same thing
• Any computation can be expressed as the
output it produces
• This output might be infinite :(
• Any data can be expressed by a
computation that produces it
Data as a computation
Data as a computation
• Computations are actually finite (!)
• Computations are actually finite (!)
descriptions of possibly infinite data
descriptions of possibly infinite data
• In 3331 you will (hopefully) discuss
“encoding Turing machines”
Data as a computation
Data as a computation
• Computations are actually finite (!)
• Computations are actually finite (!)
• In 3331 you will (hopefully) discuss
• In 3331 you will (hopefully) discuss
• Even without 3331, you have already seen
• Even without 3331, you have already seen
descriptions of possibly infinite data
“encoding Turing machines”
finite descriptions of computations
descriptions of possibly infinite data
“encoding Turing machines”
finite descriptions of computations
• They’re called “programming languages”
Kolmogorov
complexity
•
•
• Remember Monday I asked you to think
Andrey Kolmogorov (and many
others) was interested in the
inherent “complexity” of a string
•
about the most concise syntax/notation for
describing a computation
• Let’s just pretend it’s SPARC machine
010111001 is more
“complex” than 000000000
code
The Kolmogorov complexity of
a string (finite or infinite) is the
size of the smallest Turing
machine that produces it
• We can compress data down to a SPARC
program that produces it
0000000000000000....
•
While we’re pretending, pretend
the SPARC assembly to the right
is the smallest code possible to
produce this string (this is a lie)
•
The string 0000... (for some given
positive length) has complexity
160 (32 bits per instr)
•
We can encode any string of all
zeroes in 192 bits
K-complexity minus
Turing
allZeroes:
save %sp, -96, %sp
call writeChar
mov ‘0’, %o0
bnz . - 8
deccc %i0
101001000100000100...
• I coded this up in 15 instructions (you can
probably do it in less)
• That means any string of this length can be
compressed down to 512 bits
• No code-based compression scheme could
do that, not even Huffman!
Compressing
everything
•
• It would work fantastic for long strings
• Any other compression scheme (RLE,
There is some overhead in doing this
Stupid undecidability
• For those of you who haven’t taken 3331
yet, “undecidable” means “impossible to
compute”
• Finding the optimal encoding for a given
string is undecidable
Huffman, etc.) can be simulated, which
means this method is at least as good as
any other
• Even finding out how bit that optimal
Who cares about Kcomplexity?
Compression wrap-up
• Just like entropy, Kolmogorov complexity is
a purely theoretical tool
encoding would be (i.e., K-complexity) is
undecidable :(
• Algorithmic (non-code based) are generally
very good, but highly specialized to your
data
• It gives us a framework for determining
• Huffman coding is optimal if you can set up
• It doesn’t tell us how to actually do
• Lempel-Ziv is good enough for most things
• A perfect solution is impossible!
what’s possible
what’s possible :(
Lossy compression
• Shannon’s source coding theorem gave us a
theoretical bound on how much we could
compress things without losing information
• ...what if we don’t mind losing information?
your codes beforehand
Lossy compression
• lol omg r u srs
• JPEG, etc. (photos)
• MPEG, etc. (video)
• MP3, etc. (audio)
• The last 3 domains all carry a lot of
information (entropy)
• Humans ignore most of it
JPEG
Lossy compression
• Time permitting, we will talk about lossy
compression schemes (JPEG, MPEG, etc.)
later in the course
• They’re primarily built on psychology
General lossy
compression
• A general method for removing
information is to reduce fidelity
• Take out the “low-order bits”
r = 11001001
g = 11000001
b = 00011100
r = 11000000
g = 11000000
b = 00010000