Coding

Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Lecture 3
Fundamentals of Information
Theory
Shujun Li
May 7, 2009
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Outlines
Review: Multimedia Coding
Review: Coding
Information Theory: Basic
Concepts
Information Theory: Entropy
Information Theory: Shannon
Source Coding Theorem
1
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Why is compression possible: Spatial redundancy
Correlation between vertically adjacent pixels
Correlation between horizontally adjacent pixels
180
250
160
200
140
pixel at (i+1,1)
Pixel at (1,i+1)
120
100
80
150
100
60
40
50
20
Review: Multimedia Coding
0
0
20
40
60
80
100
Pixel at (1,i)
120
140
160
180
0
0
50
100
150
Pixel at (i,1)
200
250
Correlation between horizontally (left)
and vertically (right) adjacent pixels
A picture of Konstanz
(13.12.2008)
⇒ Spatial predictive
coding and transform
coding become useful!
Autocorrelation in the horizontal
direction of natural images
3
1
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Why is compression possible: Temporal redundancy
Why is compression possible: Psychovisual redundancy
HVS (human visual system) is a highly nonlinear
system!
The 2nd frame
The 1st frame
• Luminance Masking: ∆I/I≈0.02
• Texture Masking
• Frequency Masking
• Temporal Masking
• Color Masking: Luminance > Chrominance
The difference
⇒ Temporal Predictive coding (motion estimation
and compensation) become useful!
⇒ Lossy coding becomes useful!
4
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
5
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Why is compression possible: Statistical redundancy
Image and video coding: Where is IT?
Predictive
Coding
Input
Image/Video
PreProcessing
Lossy
Coding
Visual Quality
Measurement
Decoded
Image/Video
6
Lossless
Coding
Encoded
Image/Video
Predictive
Coding
PostProcessing
Lossy
Coding
PostProcessing
Lossless
Coding
…11011001…
And finally, lossless coding (lossless data
compression) is always useful to further remove any
more statistical redundancy existing in the data.
data
Pre-Processing
7
2
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
What is coding?
Coding is just some kind of mapping
mapping.
It maps each symbol in a set X to a
codeword in another set Y*=Y∪Y2∪Y3∪…
Review: Coding
• It further maps a message (a number of
consecutive symbols in X) to another message (a
number of consecutive codewords in Y*).
)
9
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
One image coding example
Coding: A Formal Definition
Intput: x=(x1,…,xn), where xi∈X
Output: y=(y1,…,yn), where yi∈Y*=Y∪Y2∪…
Encoding: y=F(x)=f(x1)…f(xn), where f(xi)=yi
X={black,
X
{black, gray, white}, Y
Y={0,1}
{0,1}
• Symbols to codewords
• black = 00, gray=01, white=1
•
•
•
•
• A message to another message
=>
1 01 1 1
1 00 1 01
00 00 00 1
1 01 1 00
yi is the codeword corresponding to the input symbol xi
The mapping f: X→Y* is called a code
F is called the extension of f.
If F is an injection, i.e., x1≠x2 → y1≠y2, then f is a uniquely
decodable (UD) code.
code
Decoding: x*=F-1(y)
•
When x*=x, we have a lossless code.
• When x*≠x, we have a lossy code.
• A lossy code cannot be a UD code.
10
11
3
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Why do we need coding?
Source Coding
•
Coding for Economy: Compression
•
Coding for Security: Cryptography
•
Coding for Convenience
•
Coding for Fun: See next slide for an example ☺
•
•
•
“thanks” ⇒ “谢谢”(compression rate: 1.25)
Information Theory:
Basic Concepts
“thanks” ⇒ “iwpczh” (Caesar cipher, key = 15)
“李” ⇒ 0xC0EE (GBK) ⇒ 0x674E (UNICODE)
Channel Codingg
•
x
Coding for Reliability: Error Detection and Correction
Source
Encoding
y
Channel
Encoding
Encoder
z
z’
Channel
Decoding
noise
y’
x’
Source
Decoding
Decoder
Encoder + Decoder = Codec
12
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
The name of the game
Where does information come from?
What is information theory?
Information comes from a source with some kind
of statistical distribution.
There are different sources.
• Wikipedia: Information theory is a branch of applied
•
mathematics and electrical engineering involving the
quantification of information. Historically,
information theory was developed by Claude E.
Shannon to find fundamental limits on compressing
and reliably communicating data.
Elements of Information Theory (Thomas M. Cover
& Joy A. Thomas): Information theory answers two
fundamental questions in communication theory: What
is the ultimate data compression, and what is the
ultimate transmission rate of communication.
• Deterministic source: Prob(…)=1
• It could be as complicated as a deterministic but random-like
sequence. => Typical examples include a chaotic source.
• Random source: Prob(…)<1
• Memoryless source = i.i.d. (independent identically
distributed) source: Prob(aiaj)=Prob(ai)Prob(aj)
• Stationary source: Prob(ai)=f(Prob(ai-1),…,Prob(ai-k))
• Time-varying source: Prob(ai)=fi(Prob(ai-1),…,Prob(ai-k(i)))
14
15
4
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Memoryless random (i.i.d) source
Coding an information source
Notations
•
•
•
•
Goal
A source emits symbols in a set X.
At any time, each symbol xi is emitted from the source with a
fixed probability pi, independent of any other symbols.
Any two emitted symbols are independent of each other: the
probability that a symbol xj appears after another symbol xi is pipj.
⇒ There is a discrete distribution P={Prob(xi)=pi|∀xi∈X}, which
describes the statistical behavior of the source.
Try to use symbols in another set Y to better represent
(i.e., to code) symbols in X.
•
In case of working with digital computers, Y={0,1}.
• The definition of “better” depends on the purpose of coding.
How to achieve data compression?
•
A memoryless random source is simply
represented as a 2-tuple (X, P).
•
•
P can be simply represented as a vector P=[p1,p2,…],
when we define an order of all the elements in X.
Given a random source (X, P), if different symbols occur
with different probabilities, then assign longer codewords
to symbols with smaller occurrence probabilities, and
shorter codewords to symbols with larger probabilities.
16
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Measuring compression performance
Optimal code
Given a source (X, P), a UD code is called
an optimal code if there is no any other UD
code with a smaller average codewordlength.
Our main goal is to find the optimal code for
any given
i
source andd determine
d t
i its
it
compression performance.
Denoting
g the length
g of each codeword by
y L(x
( i), the
average codeword-length
L=E(L(x))=p1L(x1)+…+pnL(xn), where E() denotes the
mathematical expectation and n is the size of X.
•
The smaller L is, the better the compression performance is.
A simple example: X={1,2,3}, P=[0.8,0.1,0.1], Y={0,1}.
•
•
•
17
A naïve code: f(
f(1)=00,
)
f(2)=01,
f(
)
f(3)=10
f(
)
→ L=2 ((bits).
)
A better code: f(1)=0, f(2)=10, f(3)=11 → average length
L=0.8×1+0.1×2+0.1×2=1.2 (bits)<2 (bits).
The best code: apparently, L>0 => There must be a lower limit.
What is it? Is the second code the best one?
18
19
5
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Prefix-free (PF) code
FP code: A simple example
We say
s y a code f: X→Y*
→
iss p
prefix-free
e
ee ((PF)) or
o
instantaneous,
X {1,2,3}, Y
X={1,2,3},
Y={0,1}
{0,1}
*
f: X→Y : f(1)=0, f(2)=10, f(3)=11
• if no codeword is a prefix of another codeword = there
does not exist any two distinct symbols x1, x2∈X such
that f(x1) is the prefix of f(x2).
0
Properties of FP codes
1
x=1
• PF codes are always UD codes.
• PF codes can be uniquely represented by a b-ary tree.
• PF codes can be decoded without reference of the
0
x=2
1
x=3
future.
20
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Kraft-McMillan number
Proof of Kraft’s Theorem *
Kraft-McMillan Number:
K=
P
x∈X
L(x)
1/b
where L(x) denotes the length of the codeword f(x) and b is
the size of Y.
Theorem 1 (Kraft): K≤1 ⇔ a PF code exists
•
21
K≤1 is often called Kraft Inequality.
Theorem 2 (McMillan):
Th
(M Mill ) a code
d is
i UD ⇔ K≤1
Theorems 1+2: a UD code always has a PF counterpart.
⇒ UD codes are not important, but PF ones.
22
Represent
p
a FP code as a b-ary
y tree.
Assume the maximal codeword-length is Lmax.
Extend each leaf node of the b-ary tree to the maximal
level Lmax.
The number of added descendants of a leaf node at level i
is bLmax −L(x) .
F lleaff nodes
For
d att the
th maximal
i l level,
l l there
th is
i only
l one node,
d
i.e., bLmax −Lmax = 1 .
Apparently, the sum of all nodes at the level should not
exceed bLmax .
P
P
⇒ x∈X bLmax −L(x) ≤ bLmax ⇒ x∈X 1/bL(x) ≤ 1
23
6
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Proof of Kraft’s Theorem *
For a slightly different proof, refer to the
following reference:
•
Norman L. Biggs, Codes: An Introduction to Information
Communication and Cryptography, Springer 2008
(Chapter 2, Proof of Theorem 2.9 on Pages 19-21)
Information Theory:
Entropy
24
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
First, two questions…
Definition of entropy (Shannon, 1948)
Given a memoryless random source (X, P)
with probability distribution P=[p1, …, pn],
its entropy to base b is defined as follows:
n
X
Hb (X) = Hb (P ) =
pi logb (1/pi )
Q: Why we define entropy?
A: It is the key to find the optimal code and
the lower limit of compression performance
(and more).
Q: What is entropy?
A: Roughly speaking, it is a measure of
information contained in a source.
i=1
When b=2, the subscription b may be
omitted, then we have H(X)=H(P).
26
27
7
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
History of entropy
A simple example: Binary entropy function
X {x1,x2}, P
X={x
P={p,1-p},
{p,1 p}, Y
Y={0,1}
{0,1}
⇒ H(X)=H(P)=plog2(1/p)+(1-p)log2(1/(1p))
Etymology
• 1868, from Ger. Entropie “measure of the disorder
of a system,” coined 1865 by physicist Rudolph
Clausius (1822-1888) from Gk. εντροπία “a
turning toward,” from en- “in” + trope “a turning”.
p*log2(1/p)+(1-p)*log2(1/(1-p))
1
0.9
0.8
0.7
S=k
n
X
H2(X)=H2(P
P)
Entropy
py in thermodynamics
y
(p
(physics,
y , 1870s))
pi ln(1/pi )
0.6
0.5
0.4
0.3
i=1
0.2
where k is Boltzmann constant and equal to
1.38066×10−23 J K−1.
0.1
0
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
28
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Why log: Information measure
Why log: Average information
Assume an event which happens with probability
pi .
•
•
•
•
•
29
The entropy of a source is a measure of the
average information contained in the
source, or the uncertainty associated with
the source.
When pi=0, logb(1/pi)=∞
When pi=1, logb(1/pi)=0
For any pi and pj, logb(1/pipj)=logb(1/pi)+logb(1/pj)
When pi<pj, logb(1/pi)>logb(1/pj)
⇒ log
l b(1/p
( / i) turnedd out to reflect
fl the
h surprise
i we will
ill get if
the corresponding event occurs.
• ⇒ So, logb(1/pi) measures the quantity of information we
get from the fact that the event occurs, or the uncertainties
we have before the event happens.
30
31
8
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
But why log (again)?
What does the base b mean?
It is the size of Y.
Changing b just changes the unit of entropy.
Are there other choices of defining the entropy?
•
•
•
•
bn
Q1: Consider a source emitting symbols with equal
probability, what is the most natural way to code these bn
symbols with the numbers {0,…,b-1}?
A1: Intuitively, coding the i-th symbol by the base-b
representation of the index i seems to be the most natural
way.
Q2: Then, what is the average length of each symbol’s
representation?
A2: Apparently it is logb(bn)=n.
logb2 (X) = logb1 (X)/ logb1 b2
By using different values of b, we have different units of
information measure:
•
•
•
•
•
bb=2:
2 bi
bit
b=3: trit
b=e: nat
b=10: dit
…
32
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
33
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Properties of entropy
The comparison theorem: P
P=[p
[p1,…,p
pn] and
Q=[q1,…,qn] are two probability
distributions
P ⇒
P
Hb (P ) =
n
i=1 pi log b (1/pi )
≤
n
i=1 pi logb (1/qi )
Information Theory:
Shannon Source Coding Theorem
and the equality holds if and only if P=Q.
⇒ Hb(P)≤logbn, and the equality holds if and only
if p1=...=pn=1/n (uniform distribution).
Hb(Pn)=nHb(P)
34
9
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Shannon’s source coding theorem (I)
Shannon’s source coding theorem (II)
Thee entropy
e opy of
o a memoryless
e o y ess random
do source
sou ce
defines the lower bound of the efficiency of
all UD codes.
Given a memoryless random source (X,P)
with probability distribution P=[p1, …, pn].
Make a PF Code (Shannon Code) as follow:
• Denoting by L the average length of all the
• Finding L=[L1,…,Ln], where Li is the least positive
codewords of a UD code, this theorem says
•
L≥ Hb(X)
•
Question: Also an upper bound? – Yes!
integer such that bLi≥1/pi, i.e., Li = dlogb (1/pi )e .
One can prove K=∑(1/b
K ∑(1/bLi)≤1,
)≤1 then Kraft’s
K ft’
Theorem ensures there must be a PF code.
Then, for this PF code, we can prove
Hb(X)≤L<Hb(X)+1
36
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
37
Shujun LI (李树钧): INF-10845-20091 Multimedia Coding
Approaching the entropy
Shannon code: Making an example
Given a memoryless random source {X,P},
generate an extended source {Xn,Pn}.
Hb(Xn)≤Ln<Hb(Xn)+1⇒
Hb(X)≤L<Hb(X)+1/n, where note
Hb(Xn)=nHb(X) and Ln=nL.
Let n→∞, we have L→Hb(X).
Problem: n might be too large to be used in
practice.
Assignment: Construct an example of your
own, and show how the entropy is
approached as n increases from 1 to 3 (or
even a larger value as you like).
38
39
10