Compression codes Compression • The idea is to shorten the string while maintaining all the information • This means looking at the probabilities for each bit and trying to get them as close to 50/50 as possible Section 7.2 in the textbook An ASCII example • “booboobooboobooboobooboo” • 01100010011011110110111101100010... • uncompressed data • 01001011110010101110111110001111... • compressed with gzip Codes • Homomorphisms over the free monoid of letters (“strings”) • Provide a mapping between strings and other strings elephants eat sweet eggs elephants eat sweet eggs e 0000 t 0110 e 0000 t 0110 l 0001 s 0111 l 0001 s 0111 p 0010 <space> 1000 p 0010 <space> 1000 h 0011 w 1001 h 0011 w 1001 a 0100 g 1010 a 0100 g 1010 n 0101 n 0101 000000010000001000110100010101100111... elephants eat sweet eggs elephants eat sweet eggs e 0000 t 0110 e 6 t 3 l 0001 s 0111 l 1 s 3 p 0010 <space> 1000 p 1 <space> 3 h 0011 w 1001 h 1 w 1 a 0100 g 1010 a 2 g 2 n 0101 n 1 000000010000001000110100010101100111... 24 letters * 4 bits each = 96 bits elephants eat sweet eggs elephants eat sweet eggs e 6 t 3 e 6 t 3 l 1 s 3 l 1 s 3 p 1 <space> 3 p 1 <space> 3 h 1 w 1 h 1 w 1 a 2 g 2 a 2 g 2 n 1 n 1 ‘e’ was encoded as 0000 ‘l’ was encoded as 0001 ‘e’ was encoded as 0000 ‘l’ was encoded as 0001 After receiving 000, p(0) = 0.857, p(1) = 0.143 elephants eat sweet eggs elephants eat sweet eggs e 6 t 3 e 10 t 110 l 1 s 3 l 00000 s 011 p 1 <space> 3 p 1110 <space> 010 h 1 w 1 h 0011 w 1111 a 2 g 2 a 0001 g 0010 n 1 n 00001 0.566 bits ‘e’ was encoded as 0000 of entropy! ‘l’ was encoded as 0001 After receiving 000, p(0) = 0.857, p(1) = 0.143 elephants eat sweet eggs elephants eat sweet eggs e 10 t 110 e 10 t 110 l 00000 s 011 l 00000 s 011 p 1110 <space> 010 p 1110 <space> 010 h 0011 w 1111 h 0011 w 1111 a 0001 g 0010 a 0001 g 0010 n 00001 n 00001 1000000101110001100010000111001101010... 1000000101110001100010000111001101010... A total of 77 bits elephants eat sweet eggs elephants eat sweet eggs e 10 t 110 e 10 t 110 l 00000 s 011 l 00000 s 011 p 1110 <space> 010 p 1110 <space> 010 h 0011 w 1111 h 0011 w 1111 a 0001 g 0010 a 0001 g 0010 n 00001 n 00001 e has frequency 6; t+w+p have frequency 5 elephants eat sweet eggs elephants eat sweet eggs e 10 t 110 e 10 t 110 l 00000 s 011 l 00000 s 011 p 1110 <space> 010 p 1110 <space> 010 h 0011 w 1111 h 0011 w 1111 a 0001 g 0010 a 0001 g 0010 n 00001 n 00001 e has frequency 6; t+w+p have frequency 5 After receiving 1, p(0) = 0.545; p(1) = 0.454 0.994 bits of entropy! e has frequency 6; t+w+p have frequency 5 After receiving 1, p(0) = 0.545; p(1) = 0.454 e 6 t 3 l 1 s 3 All coding schemes aim to give short codes to frequently used symbols and long codes to seldom used symbols p 1 <space> 3 h 1 w 1 a 2 g 2 Huffman codes are better than most n 1 Huffman coding • • • They produce optimal codes • They’re easy to generate 6 3 3 3 2 2 1 1 1 1 1 e # s t g a w n h p l 6 3 3 3 2 2 1 1 1 2 e # s t g a w n h p l 6 3 3 3 2 2 1 1 1 1 1 e # s t g a w n h p l 6 3 3 3 2 2 2 1 1 1 6 3 3 3 2 2 2 1 1 1 e # s tp l g a w n h e # s tp l g a w n h 6 3 3 3 2 2 2 1 2 6 3 3 3 2 2 2 2 1 e # s tp l g a w n h e # s tn h p l g a w 6 3 3 3 2 6 3 2 2 3 3 3 3 2 2 2 e # s t n hp l g a w e a w # s t n hp l g 6 3 3 3 3 2 6 ea w# s t n h 4 p l g e 4 p l 3 3 3 3 2 ga w# s tn h 6 e 4 p l 3 4 n hp l e t n h ga w# s ga w 3 3 3 ga w# s 4 3 # s e t n hp l g a w 7 p l 4 n hp l 6 6 5 6 3 6 6 5 # s e tn h 6 5 5 ga w# s t 6 5 e t 3 3 7 p l 6 6 5 e ga w # s tn h 7 p l 6 11 e ga w# s t t n h p l e n h t 7 13 ga w # s ga w # s p l 6 ga w# s n h p l 13 11 e 11 11 e t n h From tree to codes 24 p l ga w # s 24 e t n h p l ga w # s e t n h • Start with the empty string • Each time you follow a branch left, append a 0; each time you follow a branch right, append a 1 • • E.g., e = 10 E.g., w = 0011 Huffman code properties Huffman properties • The first Huffman codes I introduced were different than the one we just derived • Both are optimal! Huffman codes are not unique! • Huffman codes are always prefix-free • Decompression requires no backtracking! Huffman properties • What would happen if we tried to compress binary data with Huffman codes? Huffman properties • What would happen if we tried to compress binary data with Huffman codes? • E.g., 000110001001000000000100000 Huffman properties Huffman properties • What would happen if we tried to • What would happen if we tried to • E.g., 000110001001000000000100000 • Frequency of 0 = 23, 1 = 5 28 • E.g., 000110001001000000000100000 • Frequency of 0 = 23, 1 = 5 28 0 1 • Code for 0 is 0, for 1 is 1 compress binary data with Huffman codes? 0 1 compress binary data with Huffman codes? Huffman properties Lempel-Ziv(-Welch) • Huffman is only optimal if both parties already know what the codes are! • For a practical compression scheme, you would have to send along the codes as well as the compressed data! • Note Huffman coding does not give optimal • Good news: we don’t have to send a code table along with our data • Bad news: our codes are not optimal :( compression. It gives optimal code-based compression mama and amamau mama and amamau a 00000 a 00000 b 00001 b 00001 c 00010 c 00010 d 00011 d 00011 ... ... ... ... y 11001 y 11001 z 11010 z 11010 <space> 11011 <space> 11011 ma 11100 01101 mama and amamau mama and amamau a 00000 a 00000 b 00001 b 00001 c 00010 c 00010 d 00011 d 00011 ... ... ... ... y 11001 y 11001 z 11010 z 11010 <space> 11011 <space> 11011 ma 11100 ma 11100 am 11101 am 11101 0110100000 011010000011100 ma# 11110 mama#and amamau mama and amamau a 00000 ma# 11110 a 000000 ma# 011110 b 00001 #a 11111 b 000001 #a 011111 c 00010 c 000010 an 100000 d 00011 d 000011 ... ... ... ... y 11001 y 011001 z 11010 z 011010 <space> 11011 <space> 011011 ma 11100 ma 011100 am 11101 am 011101 01101000001110011011 01101000001110011011000000 mama and amamau mama and amamau a 000000 ma# 011110 a 000000 ma# 011110 b 000001 #a 011111 b 000001 #a 011111 c 000010 an 100000 c 000010 an 100000 d 000011 nd 100001 d 000011 nd 100001 d# 100010 ... ... ... ... y 011001 y 011001 z 011010 z 011010 <space> 011011 <space> 011011 ma 011100 ma 011100 am 011101 am 011101 01101000001110011011000000001110 01101000001110011011000000001110000011 mama and#amamau mama and#amamau a 000000 ma# 011110 a 000000 ma# 011110 b 000001 #a 011111 b 000001 #a 011111 c 000010 an 100000 c 000010 an 100000 d 000011 nd 100001 d 000011 nd 100001 ... ... d# 100010 ... ... d# 100010 y 011001 #am 100011 y 011001 #am 100011 z 011010 z 011010 <space> 011011 <space> 011011 ma 011100 ma 011100 am 011101 am 011101 01101000001110011011000000001110000011011111 01101000001110011011000000001110000011011111 mama and amamau mama and amamau a 000000 ma# 011110 a 000000 ma# 011110 b 000001 #a 011111 b 000001 #a 011111 c 000010 an 100000 c 000010 an 100000 d 000011 nd 100001 d 000011 nd 100001 ... ... d# 100010 ... ... d# 100010 y 011001 #am 100011 y 011001 #am 100011 z 011010 mam 100100 z 011010 mam 100100 <space> 011011 mau 100101 <space> 011011 ma 011100 ma 011100 am 011101 am 011101 01110011011000000001110000011011111011100 01101000001110011011000000001110000011011111011100011100 mama and amamau a 000000 ma# 011110 b 000001 #a 011111 c 000010 an 100000 d 000011 nd 100001 ... ... d# 100010 y 011001 #am 100011 z 011010 mam 100100 <space> 011011 mau 100101 u$ 100110 ma 011100 am 011101 01110011011000000001110000011011111011100011100010101 Lempel-Ziv • Our string was 15 characters long • Without compression, it would have been 65 = 15 * 15 bits • We did it in 4*5 + 7*6 = 62 bits • BFD, we saved 3 freaking bits, and on a totally contrived example to boot • It works better in much longer data 01100 Lempel-Ziv(-Welch) • The principle behind LZW working is “if we saw a pattern before, we’re likely to see it again later” • If this is true, then it’s clear how LZW helps to even up the probability distributions (so p(0) and p(1) get closer to 0.5) • If this isn’t true, LZW will just make things worse :( 0 0 1 1 01100 01100 0 00 0 00 1 01 1 01 01 10 01 10 11 11 0 001 01100 0 000 1 01100 10 100 0 000 10 100 001 1 001 00 101 01 010 01 010 11 011 11 011 001001 001001000 01100 Run-length encoding 0 000 10 100 1 001 00 101 01 010 0$ 110 11 011 001001000000 • Simplest form of compression • “If you see one symbol, there are probably a bunch more of the same symbol coming” • E.g., AAABAAAAACCCC => A3B1A5C4 • Problem: you can make your data a whole lot longer if your assumption isn’t true Code-based compression schemes • Huffman is optimal if both parties know the code table beforehand • Transmitting the code table can be expensive, especially for short data • Other schemes make assumptions about repetition which aren’t always true Data as a computation Back to Huffman • I said Huffman was an optimal code-based compression scheme, but not an optimal compression scheme • What’s the difference? Data as a computation • Idea: data and computations are actually the same thing • Any computation can be expressed as the output it produces • This output might be infinite :( • Any data can be expressed by a computation that produces it Data as a computation Data as a computation • Computations are actually finite (!) • Computations are actually finite (!) descriptions of possibly infinite data descriptions of possibly infinite data • In 3331 you will (hopefully) discuss “encoding Turing machines” Data as a computation Data as a computation • Computations are actually finite (!) • Computations are actually finite (!) • In 3331 you will (hopefully) discuss • In 3331 you will (hopefully) discuss • Even without 3331, you have already seen • Even without 3331, you have already seen descriptions of possibly infinite data “encoding Turing machines” finite descriptions of computations descriptions of possibly infinite data “encoding Turing machines” finite descriptions of computations • They’re called “programming languages” Kolmogorov complexity • • • Remember Monday I asked you to think Andrey Kolmogorov (and many others) was interested in the inherent “complexity” of a string • about the most concise syntax/notation for describing a computation • Let’s just pretend it’s SPARC machine 010111001 is more “complex” than 000000000 code The Kolmogorov complexity of a string (finite or infinite) is the size of the smallest Turing machine that produces it • We can compress data down to a SPARC program that produces it 0000000000000000.... • While we’re pretending, pretend the SPARC assembly to the right is the smallest code possible to produce this string (this is a lie) • The string 0000... (for some given positive length) has complexity 160 (32 bits per instr) • We can encode any string of all zeroes in 192 bits K-complexity minus Turing allZeroes: save %sp, -96, %sp call writeChar mov ‘0’, %o0 bnz . - 8 deccc %i0 101001000100000100... • I coded this up in 15 instructions (you can probably do it in less) • That means any string of this length can be compressed down to 512 bits • No code-based compression scheme could do that, not even Huffman! Compressing everything • • It would work fantastic for long strings • Any other compression scheme (RLE, There is some overhead in doing this Stupid undecidability • For those of you who haven’t taken 3331 yet, “undecidable” means “impossible to compute” • Finding the optimal encoding for a given string is undecidable Huffman, etc.) can be simulated, which means this method is at least as good as any other • Even finding out how bit that optimal Who cares about Kcomplexity? Compression wrap-up • Just like entropy, Kolmogorov complexity is a purely theoretical tool encoding would be (i.e., K-complexity) is undecidable :( • Algorithmic (non-code based) are generally very good, but highly specialized to your data • It gives us a framework for determining • Huffman coding is optimal if you can set up • It doesn’t tell us how to actually do • Lempel-Ziv is good enough for most things • A perfect solution is impossible! what’s possible what’s possible :( Lossy compression • Shannon’s source coding theorem gave us a theoretical bound on how much we could compress things without losing information • ...what if we don’t mind losing information? your codes beforehand Lossy compression • lol omg r u srs • JPEG, etc. (photos) • MPEG, etc. (video) • MP3, etc. (audio) • The last 3 domains all carry a lot of information (entropy) • Humans ignore most of it JPEG Lossy compression • Time permitting, we will talk about lossy compression schemes (JPEG, MPEG, etc.) later in the course • They’re primarily built on psychology General lossy compression • A general method for removing information is to reduce fidelity • Take out the “low-order bits” r = 11001001 g = 11000001 b = 00011100 r = 11000000 g = 11000000 b = 00010000
© Copyright 2026 Paperzz