Information Theory
Lecturer: Cong Ling
Office: 815
2
Lectures
Introduction
1 Introduction - 1
Entropy Properties
2 Entropy I - 19
3 Entropy II – 32
Lossless Source Coding
4 Theorem - 47
5 Algorithms - 60
Channel Capacity
6 Data Processing Theorem - 76
7 Typical Sets - 86
8 Channel Capacity - 98
9 Joint Typicality - 112
10 Coding Theorem - 123
11 Separation Theorem – 131
Continuous Variables
12 Differential Entropy - 143
13 Gaussian Channel - 158
14 Parallel Channels – 171
Lossy Source Coding
15 Rate Distortion Theory - 184
Network Information Theory
16 NIT I - 214
17 NIT II - 229
Revision
18 Revision - 247
19
20
3
Claude Shannon
• C. E. Shannon, ”A mathematical theory
of communication,” Bell System
Technical Journal, 1948.
• Two fundamental questions in
communication theory:
• Ultimate limit on data compression
– entropy
• Ultimate transmission rate of
communication
– channel capacity
• Almost all important topics in
information theory were initiated by
Shannon
1916 - 2001
4
Origin of Information Theory
• Common wisdom in 1940s:
– It is impossible to send information error-free at a positive rate
– Error control by using retransmission: rate 0 if error-free
• Still in use today
– ARQ (automatic repeat request) in TCP/IP computer networking
• Shannon showed reliable communication is possible for
all rates below channel capacity
• As long as source entropy is less than channel capacity,
asymptotically error-free communication can be achieved
• And anything can be represented in bits
– Rise of digital information technology
5
Relationship to Other Fields
6
Course Objectives
• In this course we will (focus on communication
theory):
– Define what we mean by information.
– Show how we can compress the information in a
source to its theoretically minimum value and show
the tradeoff between data compression and
distortion.
– Prove the channel coding theorem and derive the
information capacity of different channels.
– Generalize from point-to-point to network information
theory.
7
Relevance to Practice
• Information theory suggests means of achieving
ultimate limits of communication
– Unfortunately, these theoretically optimum schemes
are computationally impractical
– So some say “little info, much theory” (wrong)
• Today, information theory offers useful
guidelines to design of communication systems
–
–
–
–
Turbo code (approaches channel capacity)
CDMA (has a higher capacity than FDMA/TDMA)
Channel-coding approach to source coding (duality)
Network coding (goes beyond routing)
8
Books/Reading
Book of the course:
• Elements of Information Theory by T M Cover & J A Thomas, Wiley,
£39 for 2nd ed. 2006, or £14 for 1st ed. 1991 (Amazon)
Free references
• Information Theory and Network Coding by R. W. Yeung, Springer
http://iest2.ie.cuhk.edu.hk/~whyeung/book2/
• Information Theory, Inference, and Learning Algorithms by D
MacKay, Cambridge University Press
http://www.inference.phy.cam.ac.uk/mackay/itila/
• Lecture Notes on Network Information Theory by A. E. Gamal and
Y.-H. Kim, (Book is published by Cambridge University Press)
http://arxiv.org/abs/1001.3404
• C. E. Shannon, ”A mathematical theory of communication,” Bell
System Technical Journal, Vol. 27, pp. 379–423, 623–656, July,
October, 1948.
9
Other Information
• Course webpage:
http://www.commsp.ee.ic.ac.uk/~cling
• Assessment: Exam only – no coursework.
• Students are encouraged to do the problems in
problem sheets.
• Background knowledge
– Mathematics
– Elementary probability
• Needs intellectual maturity
– Doing problems is not enough; spend some time
thinking
10
Notation
• Vectors and matrices
– v=vector, V=matrix
• Scalar random variables
– x = R.V, x = specific value, X = alphabet
• Random column vector of length N
– x = R.V, x = specific value, XN = alphabet
– xi and xi are particular vector elements
• Ranges
– a:b denotes the range a, a+1, …, b
• Cardinality
– |X| = the number of elements in set X
11
Discrete Random Variables
• A random variable x takes a value x from
the alphabet X with probability px(x). The
vector of probabilities is px .
Examples:
X = [1;2;3;4;5;6], px = [1/6; 1/6; 1/6; 1/6; 1/6; 1/6]
pX is a “probability mass vector”
“english text”
X = [a; b;…, y; z; <space>]
px = [0.058; 0.013; …; 0.016; 0.0007; 0.193]
Note: we normally drop the subscript from px if unambiguous
12
Expected Values
• If g(x) is a function defined on X then
Ex g ( x ) p ( x ) g ( x )
often write E for EX
xX
Examples:
X = [1;2;3;4;5;6], px = [1/6; 1/6; 1/6; 1/6; 1/6; 1/6]
E x 3.5
E x 2 15.17 2 2
E sin(0.1x ) 0.338
E log 2 ( p(x )) 2.58
This is the “entropy” of X
13
Shannon Information Content
• The Shannon Information Content of an outcome
with probability p is –log2p
• Shannon’s contribution – a statistical view
– Messages, noisy channels are random
– Pre-Shannon era: deterministic approach (Fourier…)
• Example 1: Coin tossing
– X = [Head; Tail], p = [½; ½], SIC = [1; 1] bits
• Example 2: Is it my birthday ?
– X = [No; Yes], p = [364/365; 1/365],
SIC = [0.004; 8.512] bits
Unlikely outcomes give more information
14
Minesweeper
• Where is the bomb ?
• 16 possibilities – needs 4 bits to specify
Guess
Prob
1.
2.
3.
4.
15/
16
14/
15
13/
14
No
No
No
Yes
1/
13
Total
SIC
0.093
0.100
0.107
3.700
bits
bits
bits
bits
4.000 bits
SIC = – log2 p
15
Minesweeper
• Where is the bomb ?
• 16 possibilities – needs 4 bits to specify
Guess
Prob
1.
15/
No
16
SIC
0.093 bits
16
Entropy
H (x ) E log 2 ( px (x )) px ( x) log 2 px ( x)
xx
– H(x) = the average Shannon Information Content of x
– H(x) = the average information gained by knowing its value
– the average number of “yes-no” questions needed to find x is in
the range [H(x),H(x)+1)
– H(x) = the amount of uncertainty before we know its value
We use log(x) log2(x) and measure H(x) in bits
– if you use loge it is measured in nats
– 1 nat = log2(e) bits = 1.44 bits
•
log 2 ( x)
ln( x)
ln( 2)
d log 2 x log 2 e
dx
x
H(X ) depends only on the probability vector pX not on the alphabet X,
so we can write H(pX)
17
Entropy Examples
1
(1) Bernoulli Random Variable
H (x ) (1 p) log(1 p) p log p
Very common – we write H(p) to
mean H([1–p; p]).
Maximum is when p=1/2
(2) Four Coloured Shapes
0.6
H(p)
X = [0;1], px = [1–p; p]
0.8
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
p
H ( p) (1 p) log(1 p) p log p
H ( p) log(1 p) log p
H ( p) p 1 (1 p) 1 log e
X = [; ; ; ], px = [½; ¼; 1/8; 1/8]
H (x ) H (p x ) log( p( x)) p( x)
1 ½ 2 ¼ 3 18 3 18 1.75 bits
18
Comments on Entropy
• Entropy plays a central role in information
theory
• Origin in thermodynamics
– S = k ln, k: Boltzman’s constant, : number of
microstates
– The second law: entropy of an isolated system is nondecreasing
• Shannon entropy
– Agrees with intuition: additive, monotonic, continuous
– Logarithmic measure could be derived from an
axiomatic approach (Shannon 1948)
19
Lecture 2
• Joint and Conditional Entropy
– Chain rule
• Mutual Information
– If x and y are correlated, their mutual information is
the average information that y gives about x
• E.g. Communication Channel: x transmitted but y received
• It is the amount of information transmitted through the
channel
• Jensen’s Inequality
20
Joint and Conditional Entropy
p(x,y)
y=0 y=1
x=0
½
¼
H (x , y ) E log p (x , y )
½ log ½ ¼ log ¼ 0 log 0 ¼ log ¼ 1.5 bits
x=1
0
¼
Conditional Entropy: H(y |x)
p(y|x)
Joint Entropy: H(x,y)
H (y | x ) E log p (y | x )
p ( x, y ) log p ( y | x)
Note: 0 log 0 = 0
y=0 y=1
x=0
2/
x=1
0
3
1/
3
1
Note: rows sum to 1
x, y
½ log 2 3 ¼ log 1 3 0 log 0 ¼ log1 0.689 bits
21
Conditional Entropy – View 1
p(x, y) y=0 y=1 p(x)
Additional Entropy:
x=0
½
¼
¾
p (y | x ) p ( x , y ) p ( x )
0
x=1
¼
¼
H (y | x ) E log p (y | x )
E log p (x , y ) E log p (x )
H (x , y ) H (x ) H ( 1 2 , 1 4 , 0, 1 4 ) H ( 3 4 , 1 4 ) 0.689 bits
H(Y|X) is the average additional information in Y when you know X
H(x ,y)
H(y |x)
H(x)
22
Conditional Entropy – View 2
Average Row Entropy:
p(x, y)
y=0
y=1 H(y | x=x)
p(x)
x=0
½
¼
H(1/3)
¾
x=1
0
¼
H(1)
¼
H (y | x ) E log p (y | x ) p ( x, y ) log p ( y | x)
x, y
p ( x) p ( y | x) log p ( y | x) p ( x) p ( y | x) log p ( y | x)
x, y
x X
yY
p ( x) H y | x x ¾ H ( 1 3 ) ¼ H (0) 0.689 bits
x X
Take a weighted average of the entropy of each row using p(x) as weight
23
Chain Rules
• Probabilities
p ( x , y , z ) p (z | x , y ) p (y | x ) p ( x )
• Entropy
H ( x , y , z ) H (z | x , y ) H (y | x ) H ( x )
n
H (x 1:n ) H (x i | x 1:i 1 )
i 1
The log in the definition of entropy converts products of
probability into sums of entropy
24
Mutual Information
Mutual information is the average amount of
information that you get about x from observing
the value of y
- Or the reduction in the uncertainty of
knowledge of y
x due to
I ( x ; y ) H ( x ) H ( x | y ) H ( x ) H (y ) H ( x , y )
Information in x
Information in x when you already know y
H(x ,y)
Mutual information is
symmetrical
I ( x ; y ) I (y ; x )
H(x |y) I(x ;y) H(y |x)
H(x)
Use “;” to avoid ambiguities between I(x;y,z) and I(x,y;z)
H(y)
25
Mutual Information Example
p(x,y)
y=0 y=1
x=0
½
¼
x=1
0
¼
• If you try to guess y you have a 50%
chance of being correct.
• However, what if you know x ?
– Best guess: choose y = x
=0 (p=0.75) then 66% correct prob
– If x =1 (p=0.25) then 100% correct prob
– If x
– Overall 75% correct probability
H(x ,y)=1.5
H(x |y)
=0.5
I(x ;y)
=0.311
H(x)=0.811
H(y |x)
=0.689
H(y)=1
I (x ; y ) H (x ) H (x | y )
H ( x ) H (y ) H ( x , y )
H (x ) 0.811, H (y ) 1, H (x , y ) 1.5
I (x ; y ) 0.311
26
Conditional Mutual Information
Conditional Mutual Information
I (x ; y | z ) H (x | z ) H (x | y , z )
H ( x | z ) H (y | z ) H ( x , y | z )
Note: Z conditioning applies to both X and Y
Chain Rule for Mutual Information
I (x 1 , x 2 , x 3 ; y ) I (x 1; y ) I (x 2 ; y | x 1 ) I (x 3 ; y | x 1 , x 2 )
n
I (x 1:n ; y ) I (x i ; y | x 1:i 1 )
i 1
H(x ,y)
27
Review/Preview
• Entropy:
H(x |y) I(x ;y) H(y |x)
H(x)
H(y)
H (x ) - log 2 ( p ( x)) p( x) E log 2 ( p X ( x))
xX
– Positive and bounded
• Chain Rule:
0 H (x ) log | X |
H ( x , y ) H ( x ) H (y | x ) H ( x ) H (y )
– Conditioning reduces entropy H (y | x ) H (y )
• Mutual Information:
I (y ; x ) H (y ) H (y | x ) H ( x ) H (y ) H ( x , y )
– Positive and Symmetrical I (x ; y ) I (y ; x ) 0
– x and y independent H (x , y ) H (y ) H (x )
= inequalities not yet proved
I (x ; y ) 0
28
Convex & Concave functions
f(x) is strictly convex over (a,b) if
f (u (1 )v) f (u ) (1 ) f (v) u v (a, b), 0 1
f(x) lies above f(x)
– f(x) is concave –f(x) is convex
– every chord of
Concave is like this
• Examples
– Strictly Convex: x , x , e , x log x [ x 0]
– Strictly Concave: log x, x [ x 0]
– Convex and Concave: x
2
– Test:
4
x
d2 f
0 x (a, b)
dx 2
f(x) is strictly convex
“convex” (not strictly) uses “” in definition and “” in test
29
Jensen’s Inequality
Jensen’s Inequality: (a) f(x) convex Ef(x) f(Ex)
(b) f(x) strictly convex Ef(x) > f(Ex) unless x constant
Proof by induction on |X|
These sum to 1
– |X|=1: E f (x ) f ( E x ) f ( x1 )
k
k 1
pi
f ( xi )
i 1 1 pk
– |X|=k: E f (x ) pi f ( xi ) pk f ( xk ) (1 pk )
i 1
k 1 pi
pk f ( xk ) (1 pk ) f
xi
Assume JI is true
1
p
i
1
k
for |X|=k–1
k 1
pi
xi f E x
f pk xk (1 pk )
i 1 1 pk
Follows from the definition of convexity for two-mass-point distribution
30
Jensen’s Inequality Example
Mnemonic example:
f(x)
f(x) = x2 : strictly convex
X = [–1; +1]
4
p = [½; ½]
3
Ex =0
2
f(E x)=0
1
E f(x) = 1 > f(E x)
0
-2
-1
0
x
1
2
31
Summary
H(x ,y)
• Chain Rule:
H ( x , y ) H (y | x ) H ( x )
H(x |y)
H(y |x)
H(x)
• Conditional Entropy:
H (y | x ) H (x , y ) H (x ) p ( x) H y | x
– Conditioning reduces entropy xX
H (y | x ) H (y )
• Mutual Information I (x ; y ) H (x ) H (x | y ) H (x )
– In communications, mutual information is the
amount of information transmitted through a noisy
channel
• Jensen’s Inequality
= inequalities not yet proved
f(x) convex Ef(x) f(Ex)
H(y)
32
Lecture 3
• Relative Entropy
– A measure of how different two probability mass
vectors are
• Information Inequality and its consequences
– Relative Entropy is always positive
• Mutual information is positive
• Uniform bound
• Conditioning and correlation reduce entropy
• Stochastic Processes
– Entropy Rate
– Markov Processes
33
Relative Entropy
Relative Entropy or Kullback-Leibler Divergence
between two probability mass vectors p and q
p(x )
p( x)
D(p || q) p( x) log
Ep log
Ep log q (x ) H (x )
q( x)
q(x )
xX
where Ep denotes an expectation performed using probabilities p
D(p||q) measures the “distance” between the probability
mass functions p and q.
We must have pi=0 whenever qi=0 else D(p||q)=
Beware: D(p||q) is not a true distance because:
–
–
(1) it is asymmetric between p, q and
(2) it does not satisfy the triangle inequality.
34
Relative Entropy Example
X = [1 2 3 4 5 6]T
6
q 1
10
p 1
1
1
1
1
1
H (p) 2.585
6
6
6
6
6
1
1
1
1
1
H (q) 2.161
10
10
10
10
2
D(p || q) Ep ( log qx ) H (p) 2.935 2.585 0.35
D(q || p) Eq ( log px ) H (q) 2.585 2.161 0.424
35
Information Inequality
Information (Gibbs’) Inequality: D(p || q) 0
• Define A {x : p( x) 0} X
p( x )
q( x )
D
(
p
||
q
)
p
(
x
)
log
p
(
x
)
log
• Proof
q( x )
p( x )
xA
Jensen’s
inequality
xA
q( x )
log p ( x )
log q( x ) log q( x ) log 1 0
p( x )
xA
xX
xA
If D(p||q)=0: Since log( ) is strictly concave we have equality in the
proof only if q(x)/p(x), the argument of log, equals a constant.
But
p( x) q( x) 1
xX
xX
so the constant must be 1 and p q
36
Information Inequality Corollaries
• Uniform distribution has highest entropy
– Set q = [|X|–1, …, |X|–1]T giving H(q)=log|X| bits
D(p || q) Ep log q(x ) H (p) log | X | H (p) 0
• Mutual Information is non-negative
p(x , y )
I (y ; x ) H (x ) H (y ) H (x , y ) E log
p ( x ) p (y )
D( p (x , y ) || p (x ) p (y )) 0
with equality only if p(x,y) p(x)p(y) x and y are independent.
37
More Corollaries
• Conditioning reduces entropy
0 I ( x ; y ) H (y ) H (y | x ) H (y | x ) H (y )
with equality only if x and y are independent.
• Independence Bound
n
n
i 1
i 1
H (x 1:n ) H (x i | x 1:i 1 ) H (x i )
with equality only if all xi are independent.
E.g.: If all xi are identical H(x1:n) = H(x1)
38
Conditional Independence Bound
• Conditional Independence Bound
n
n
i 1
i 1
H (x 1:n | y 1:n ) H (x i | x 1:i 1 , y 1:n ) H (x i | y i )
• Mutual Information Independence Bound
If all xi are independent or, by symmetry, if all yi are independent:
I (x 1:n ; y 1:n ) H (x 1:n ) H (x 1:n | y 1:n )
n
n
n
i 1
i 1
i 1
H (x i ) H (x i | y i ) I (x i ; y i )
E.g.: If n=2 with xi i.i.d. Bernoulli (p=0.5) and y1=x2 and y2=x1,
then I(xi;yi)=0 but I(x1:2; y1:2) = 2 bits.
39
Stochastic Process
Stochastic Process {xi} = x1, x2, …
often
Entropy: H ({x i }) H (x 1 ) H (x 2 | x 1 )
Entropy Rate:
1
H ( X) lim H (x 1:n ) if limit exists
n n
– Entropy rate estimates the additional entropy per new sample.
– Gives a lower bound on number of code bits per sample.
Examples:
– Typewriter with m equally likely letters each time: H(X)=logm
– xi i.i.d. random variables: H ( X) H (x i )
40
Stationary Process
Stochastic Process {xi} is stationary iff
p (x 1:n a1:n ) p (x k (1:n ) a1:n ) k , n, ai X
If {xi} is stationary then H(X) exists and
1
H ( X) lim H (x 1:n ) lim H (x n | x 1:n 1 )
n n
n
Proof:
(a )
(b)
0 H (x n | x 1:n 1 ) H (x n | x 2:n 1 ) H (x n 1 | x 1:n 2 )
(a) conditioning reduces entropy, (b) stationarity
Hence H(xn|x1:n-1) is positive, decreasing tends to a limit, say b
Hence
H (x k | x 1:k 1 ) b
1
1 n
H (x 1:n ) H (x k | x 1:k 1 ) b H ( X)
n
n k 1
41
Markov Process (Chain)
Discrete-valued stochastic process {xi} is
• Independent iff p(xn|x0:n–1)=p(xn)
• Markov iff
p(xn|x0:n–1)=p(xn|xn–1)
– time-invariant iff p(xn=b|xn–1=a) = pab indep of n
– States
– Transition matrix: T = {tab}
• Rows sum to 1: T1 = 1 where 1 is a vector of 1’s
1
• pn = TTpn–1
• Stationary distribution: p$ = TTp$
t13
3
t12
2
t14
t34
t43
t24
4
Independent Stochastic Process is easiest to deal with, Markov is next easiest
42
Stationary Markov Process
If a Markov process is
a) irreducible: you can go from any state a to any b in
a finite number of steps
b) aperiodic: state a, the possible times to go from a
to a have highest common factor = 1
then it has exactly one stationary distribution, p$.
T
– p$ is the eigenvector of TT with = 1: T p $ p $
T n 1pT$ where 1 [1 1 1]T
n
–
Initial distribution becomes irrelevant (asymptotically
stationary) (TT ) n p 0 p$ 1T p 0 p$ , p 0
43
Chess Board
H(p )=0, H(p
)=0
)=2.99553,
H(p 1||||ppppp)=2.30177
)=2.09299
H(p)=3.07129,
)=3.111,
)=3.10287,
)=3.09141,
)=1.58496,
)=2.54795
)=1.58496
)=2.24987
H(p
H(p
)=2.20683
H(p
)=3.0827,
H(p
)=2.23038
1
485
5863
3
2
2
1
7
74 4752
630
6
Random Walk
• Move equal
prob
• p1 = [1 0 … 0]T
– H(p1) = 0
• p$ = 1/40 × [3 5 3 5 8 5 3 5 3]T
– H(p$) = 3.0855
H (x n | x n 1 )
• H ( X ) lim
n
lim p ( xn , xn 1 ) log p ( xn | xn 1 ) p$,i ti , j log(ti , j ) 2.2365
n
i, j
44
Chess Board
H(p )=0,
Random Walk
1
H(p | p )=0
1
0
• Move equal prob
• p1 = [1 0 … 0]T
– H(p1) = 0
• p$ = 1/40 × [3 5 3 5 8 5 3 5 3]T
– H(p$) = 3.0855
•
H ( X ) lim H (x n | x n 1 )
n
lim p ( xn , xn 1 ) log p ( xn | xn 1 ) p$,i ti , j log(ti , j )
n
–
H ( X ) 2.2365
i, j
45
Chess Board Frames
H(p )=0,
1
H(p )=3.111,
5
H(p | p )=0
1
0
H(p | p )=2.30177
5
4
H(p )=1.58496,
H(p | p )=1.58496
H(p )=3.10287,
H(p | p )=2.54795
H(p )=2.99553,
H(p | p )=2.09299
H(p )=3.07129,
H(p | p )=2.20683
H(p )=3.09141,
H(p | p )=2.24987
H(p )=3.0827,
H(p | p )=2.23038
2
6
2
6
1
5
3
7
3
7
2
6
4
8
4
8
3
7
46
Summary
• Relative Entropy:
– D(p||q) = 0 iff p q
D(p || q) Ep log
p(x )
0
q(x )
• Corollaries
– Uniform Bound: Uniform p maximizes H(p)
– I(x ; y) 0 Conditioning reduces entropy
– Indep bounds: H (x 1:n )
n
H (x )
H (x 1:n | y 1:n )
i
i 1
I (x 1:n ; y 1:n )
n
I (x ; y )
i
i
H ( X ) lim H (x n | x 1:n 1 )
n
H ( X ) H (x n | x n 1 ) p$,i ti , j log(ti , j )
i, j
i
if x i or y i are indep
• Entropy Rate of stochastic process:
– {xi} stationary Markov:
H (x
i 1
i 1
– {xi} stationary:
n
| yi)
47
Lecture 4
• Source Coding Theorem
– n i.i.d. random variables each with entropy H(X) can
be compressed into more than nH(X) bits as n tends
to infinity
• Instantaneous Codes
– Symbol-by-symbol coding
– Uniquely decodable
• Kraft Inequality
– Constraint on the code length
• Optimal Symbol Code lengths
– Entropy Bound
48
Source Coding
• Source Code: C is a mapping XD+
– X a random variable of the message
– D+ = set of all finite length strings from D
– D is often binary
– e.g. {E, F, G} {0,1}+ : C(E)=0, C(F)=10, C(G)=11
• Extension: C+ is mapping X+ D+ formed by
concatenating C(xi) without punctuation
– e.g. C+(EFEEGE) =01000110
49
Desired Properties
• Non-singular: x1 x2 C(x1) C(x2)
– Unambiguous description of a single letter of X
• Uniquely Decodable: C+ is non-singular
– The sequence C+(x+) is unambiguous
– A stronger condition
– Any encoded string has only one possible source
string producing it
– However, one may have to examine the entire
encoded string to determine even the first source
symbol
– One could use punctuation between two codewords
but inefficient
50
Instantaneous Codes
• Instantaneous (or Prefix) Code
– No codeword is a prefix of another
– Can be decoded instantaneously without reference to
future codewords
• Instantaneous Uniquely Decodable Nonsingular
Examples:
–
–
–
–
–
C(E,F,G,H) = (0, 1, 00, 11)
C(E,F) = (0, 101)
C(E,F) = (1, 101)
C(E,F,G,H) = (00, 01, 10, 11)
C(E,F,G,H) = (0, 01, 011, 111)
IU
IU
IU
IU
IU
51
Code Tree
Instantaneous code: C(E,F,G,H) = (00, 11, 100, 101)
Form a D-ary tree where D = |D|
– D branches at each node
– Each codeword is a leaf
– Each node along the path to
a leaf is a prefix of the leaf
can’t be a leaf itself
– Some leaves may be unused
0
0
E
1
G
0
0
1
1
1
H
F
111011000000 FHGEE
52
Kraft Inequality (instantaneous codes)
• Limit on codeword lengths of instantaneous codes
– Not all codewords can be too short
| X|
li
2
1
• Codeword lengths l1, l2, …, l|X|
• Label each node at depth l
with 2–l
• Each node equals the sum of
all its leaves
i 1
½
0
0
1
E
1/
¼
8
0
G
¼
1
• Equality iff all leaves are utilised
• Total code budget = 1
Code 00 uses up ¼ of the budget
Code 100 uses up 1/8 of the budget
¼
1
½
0
1
1
1
¼
/8
H
F
Same argument works with D-ary tree
53
McMillan Inequality (uniquely decodable codes)
If uniquely decodable C has codeword lengths
l1, l2, …, l|X| , then
| X|
| X|
li
D
1
The same
i 1
Proof: Let S D li and M max li then for any N ,
i 1
N
|X | | X |
|X |
|X| li
li1 li2 ... liN
S D ... D
D length{C
i1 1 i2 1 iN 1
x X
i 1
N
( x)}
N
NM
D | x:l
l 1
l
If S > 1 then
NM
NM
by 1total
length
length{C (x)} | re-order
D l D l sum
NM
Sum over all sequences of length N
SN >
l 1
l 1
NM for some N. Hence S 1.
max number of distinct sequences of length l
Implication: uniquely decodable codes doesn’t offer further
reduction of codeword lengths than instantaneous codes
54
McMillan Inequality (uniquely decodable codes)
If uniquely decodable C has codeword lengths
l1, l2, …, l|X| , then
| X|
| X|
li
D
1
The same
i 1
Proof: Let S D li and M max li then for any N ,
N
i 1
|X | | X |
|X |
|X| li
li1 li2 ... liN
S D ... D
D length{C
i1 1 i2 1 iN 1
x X
i 1
N
( x)}
N
NM
NM
D | x : l length{C (x)} | D D
l 1
l
If S > 1 then
SN >
l 1
l
l
NM
1 NM
l 1
NM for some N. Hence S 1.
Implication: uniquely decodable codes doesn’t offer further
reduction of codeword lengths than instantaneous codes
55
How Short are Optimal Codes?
If l(x) = length(C(x)) then C is optimal if
L=E l(x) is as small as possible.
We want to minimize
1.
D
l ( x )
1
p( x)l ( x)
subject to
xX
xX
2. all the l(x) are integers
Simplified version:
Ignore condition 2 and assume condition 1 is satisfied with equality.
less restrictive so lengths may be shorter than actually possible lower bound
56
Optimal Codes (non-integer li)
| X|
• Minimize p( xi )li subject to
i 1
| X|
li
D
1
i 1
Use Lagrange multiplier:
| X|
Define
| X|
J p ( xi )li D
i 1
li
and set
i 1
J
0
li
J
p ( xi ) ln( D) D li 0 D li p ( xi ) / ln( D )
li
| X|
also
D
i 1
li
1 1 / ln( D)
li = –logD(p(xi))
E log 2 p (x ) H (x )
E l (x ) E log D p (x )
H D (x )
log 2 D
log 2 D
no uniquely decodable code can do better than this
57
Bounds on Optimal Code Length
Round up optimal code lengths:
li log D p ( xi )
• li are bound to satisfy the Kraft Inequality (since the
optimum lengths do)
• For this choice,
–logD(p(xi)) li –logD(p(xi)) + 1
• Average shortest length:
H D (x ) L H D (x ) 1
*
(since we added <1
to optimum values)
• We can do better by encoding blocks of n symbols
n 1 H D (x 1:n ) n 1 E l (x 1:n ) n 1 H D (x 1:n ) n 1
• If entropy rate of xi exists ( xi is stationary process)
n 1 H D (x 1:n ) H D (X ) n 1 E l (x 1:n ) H D (X)
Also known as source coding theorem
58
Block Coding Example
1
n-1E l
X = [A;B], px = [0.9; 0.1]
H ( xi ) 0.469
Huffman coding:
B
0.1
1
0.8
0.6
0.4
• n=1
sym A
prob 0.9
code 0
• n=2
sym AA
AB
BA
BB
prob 0.81 0.09 0.09 0.01
code 0
11 100 101
• n=3
sym
prob
code
AAA
0.729
0
2
4
…
…
…
8
10
n
n 1E l 1
AAB
0.081
101
6
n 1E l 0.645
BBA
BBB
0.009 0.001
10010 10011
n 1E l 0.583
The extra 1 bit inefficiency becomes insignificant for large blocks
12
59
Summary
• McMillan Inequality for D-ary codes:
| X|
• any uniquely decodable C has D l 1
i 1
• Any uniquely decodable code:
i
E l (x ) H D (x )
• Source coding theorem
• Symbol-by-symbol encoding
H D (x ) E l (x ) H D (x ) 1
• Block encoding
n 1 E l (x 1:n ) H D (X)
60
Lecture 5
• Source Coding Algorithms
• Huffman Coding
• Lempel-Ziv Coding
61
Huffman Code
An optimal binary instantaneous code must satisfy:
1. p ( xi ) p ( x j ) li l j
(else swap codewords)
2. The two longest codewords have the same length
(else chop a bit off the longer codeword)
3. two longest codewords differing only in the last bit
(else chop a bit off all of them)
Huffman Code construction
1. Take the two smallest p(xi) and assign each a
different last bit. Then merge into a single symbol.
2. Repeat step 1 until only one symbol remains
Used in JPEG, MP3…
62
Huffman Code Example
X = [a, b, c, d, e], px = [0.25 0.25 0.2 0.15 0.15]
Read diagram backwards for codewords:
C(X) = [01 10 11 000 001], L = 2.3, H(x) = 2.286
For D-ary code, first add extra zero-probability symbols until
|X|–1 is a multiple of D–1 and then group D symbols at a time
63
Huffman Code is Optimal Instantaneous Code
Huffman traceback gives codes for progressively larger
alphabets:
0
0
a
0.25
0.25
b
0.25
p2=[0.55 0.45],
c
0.2
c2=[0 1], L2=1
d
0.15
p3=[0.45 0.3 0.25],
e
0.15
c3=[1 00 01], L3=1.55
p4=[0.3 0.25 0.25 0.2],
c4=[00 01 10 11], L4=2
p5=[0.25 0.25 0.2 0.15 0.15],
c5=[01 10 11 000 001], L5=2.3
0.25
0.2
0
0.3
0
0.25
0.55
0.45
0.45
1.0
1
1
0.3
1
1
We want to show that all these codes are optimal including C5
64
Huffman Code is Optimal Instantaneous Code
Huffman traceback gives codes for progressively larger
alphabets:
p2=[0.55 0.45],
c2=[0 1], L2=1
p3=[0.45 0.3 0.25],
c3=[1 00 01], L3=1.55
p4=[0.3 0.25 0.25 0.2],
c4=[00 01 10 11], L4=2
p5=[0.25 0.25 0.2 0.15 0.15],
c5=[01 10 11 000 001], L5=2.3
We want to show that all these codes are optimal including C5
65
Huffman Optimality Proof
Suppose one of these codes is sub-optimal:
– m>2 with cm the first sub-optimal code (note c 2 is definitely
optimal)
– An optimal cm must have LC'm < LCm
– Rearrange the symbols with longest codes in cm so the two
lowest probs pi and pj differ only in the last digit (doesen’t
change optimality)
– Merge xi and xj to create a new code cm 1 as in Huffman
procedure
– L C'm–1 =L C'm– pi– pj since identical except 1 bit shorter with prob
p i+ p j
– But also L Cm–1 =L Cm– pi– pj hence L C'm–1 < LCm–1 which
contradicts assumption that cm is the first sub-optimal code
Hence, Huffman coding satisfies H D (x ) L H D (x ) 1
Note: Huffman is just one out of many possible optimal codes
66
Shannon-Fano Code
Fano code
1.
2.
Put probabilities in decreasing order
Split as close to 50-50 as possible; repeat with each half
a
0.20
b
0.19
c
0.17
d
0.15
e
0.14
f
0.06
g
0.05
h
0.04
0
00
0
1
0
1
1
0
010
1
011
0
1
100
0
110
1
H(x) = 2.81 bits
LSF = 2.89 bits
101
0
1110
1
1111
Not necessarily optimal: the
best code for this p actually
has L = 2.85 bits
67
Shannon versus Huffman
Shannon
Fi k 1 p(x k ),
i 1
p(x 1 ) p(x 2 ) p (x m )
encoding : round the number Fi [0,1] to log p(x i ) bits
H D (x ) LSF H D (x ) 1
(excercise)
p x [0.36 0.34 0.25 0.05] H (x ) 1.78 bits
log 2 p x [1.47 1.56 2 4.32]
l S log 2 p x [2 2 2 5]
LS 2.15 bits
Huffman
l H [1 2 3 3]
LH 1.94 bits
Individual codewords may be longer in
Huffman than Shannon but not the average
68
Issues with Huffman Coding
• Requires the probability distribution of the
source
– Must recompute entire code if any symbol probability
changes
• A block of N symbols needs |X|N pre-calculated probabilities
• For many practical applications, however, the
underlying probability distribution is unknown
– Estimate the distribution
• Arithmetic coding: extension of Shannon-Fano coding; can
deal with large block lengths
– Without the distribution
• Universal coding: Lempel-Ziv coding
69
Universal Coding
• Does not depend on the distribution of the
source
• Compression of an individual sequence
• Run length coding
– Runs of data are stored (e.g., in fax machines)
Example: WWWWWWWWWBBWWWWWWWBBBBBBWW
9W2B7W6B2W
• Lempel-Ziv coding
– Generalization that takes advantage of runs of strings
of characters (such as WWWWWWWWWBB)
– Adaptive dictionary compression algorithms
– Asymptotically optimum: achieves the entropy rate
for any stationary ergodic source
70
Lempel-Ziv Coding (LZ78)
Memorize previously occurring substrings in the input data
– parse input into the shortest possible distinct ‘phrases’, i.e., each
phrase is the shortest phrase not seen earlier
– number the phrases starting from 1 (0 is the empty string)
ABAABABABBBAB…
12_3_4__5_6_7
Look up a dictionary
– each phrase consists of a previously occurring phrase
(head) followed by an additional A or B (tail)
– encoding: give location of head followed by the additional
symbol for tail
0A0B1A2A4B2B1B…
– decoder uses an identical dictionary
locations are underlined
71
Lempel-Ziv Example
Input = 1011010100010010001001010010
Dictionary
0000
0
000
00
0001
1
001
01
0010
010
10
0011
011
11
0100
100
0101
101
0110
110
0111
111
1000
1001
location
1
0
11
01
010
00
10
0100
01001
Send
Decode
1
00
011
101
1000
0100
0010
1010
10001
10010
1
0
11
01
010
00
10
0100
01001
010010
No need to always
send 4 bits
Remark:
• No need to send the
dictionary (imagine zip
and unzip!)
• Can be reconstructed
• Need to send 0’s in 01,
010 and 001 to avoid
ambiguity (i.e.,
instantaneous code)
72
Lempel-Ziv Comments
Dictionary D contains K entries D(0), …, D(K–1). We need to send M=ceil(log K) bits to
specify a dictionary entry. Initially K=1, D(0)= = null string and M=ceil(log K) = 0 bits.
Input
1
0
1
1
0
1
0
1
0
Action
“1” D so send “1” and set D(1)=“1”. Now K=2 M=1.
“0” D so split it up as “”+”0” and send location “0” (since D(0)= ) followed
by “0”. Then set D(2)=“0” making K=3 M=2.
“1” D so don’t send anything yet – just read the next input bit.
“11” D so split it up as “1” + “1” and send location “01” (since D(1)= “1” and
M=2) followed by “1”. Then set D(3)=“11” making K=4 M=2.
“0” D so don’t send anything yet – just read the next input bit.
“01” D so split it up as “0” + “1” and send location “10” (since D(2)= “0” and
M=2) followed by “1”. Then set D(4)=“01” making K=5 M=3.
“0” D so don’t send anything yet – just read the next input bit.
“01” D so don’t send anything yet – just read the next input bit.
“010” D so split it up as “01” + “0” and send location “100” (since D(4)= “01”
and M=3) followed by “0”. Then set D(5)=“010” making K=6 M=3.
So far we have sent 1000111011000 where dictionary entry numbers are in red.
73
Lempel-Ziv Properties
• Simple to implement
• Widely used because of its speed and efficiency
– applications: compress, gzip, GIF, TIFF, modem …
– variations: LZW (considering last character of the
current phrase as part of the next phrase, used in
Adobe Acrobat), LZ77 (sliding window)
– different dictionary handling, etc
• Excellent compression in practice
– many files contain repetitive sequences
– worse than arithmetic coding for text files
74
Asymptotic Optimality
• Asymptotically optimum for stationary ergodic
source (i.e. achieves entropy rate)
• Let c(n) denote the number of phrases for a
sequence of length n
• Compressed sequence consists of c(n) pairs
(location, last bit)
• Needs c(n)[logc(n)+1] bits in total
• {Xi} stationary ergodic
limsup n 1l ( X 1:n ) limsup
n
n
c(n)[log c(n) 1]
H ( X ) with probability 1
n
– Proof: C&T chapter 12.10
– may only approach this for an enormous file
75
Summary
• Huffman Coding: H D (x ) E l (x ) H D (x ) 1
– Bottom-up design
– Optimal shortest average length
• Shannon-Fano Coding: H D (x ) E l (x ) H D (x ) 1
– Intuitively natural top-down design
• Lempel-Ziv Coding
– Does not require probability distribution
– Asymptotically optimum for stationary ergodic source (i.e.
achieves entropy rate)
76
Lecture 6
• Markov Chains
– Have a special meaning
– Not to be confused with the standard
definition of Markov chains (which are
sequences of discrete random variables)
• Data Processing Theorem
– You can’t create information from nothing
• Fano’s Inequality
– Lower bound for error in estimating X from Y
77
Markov Chains
If we have three random variables: x, y, z
p ( x, y , z ) p ( z | x, y ) p ( y | x ) p ( x )
they form a Markov chain xyz if
p ( z | x, y ) p ( z | y ) p ( x, y , z ) p ( z | y ) p ( y | x ) p ( x )
A Markov chain xyz means that
– the only way that x affects z is through the value of y
– if you already know y, then observing x gives you no additional
information about z, i.e. I ( x ; z | y ) 0 H (z | y ) H (z | x , y )
– if you know y, then observing z gives you no additional
information about x.
78
Data Processing
• Estimate z = f(y), where f is a function
• A special case of a Markov chain x y f(y)
Data
X
Y
Channel
Z
Estimator
• Does processing of y increase the information that y
contains about x?
79
Markov Chain Symmetry
If xyz
p ( x, y , z ) ( a ) p ( x, y ) p ( z | y )
p ( x, z | y )
p( x | y ) p( z | y )
p( y)
p( y)
(a)
p ( z | x, y ) p ( z | y )
Hence x and z are conditionally independent given y
Also xyz iff zyx since
p( z | y ) p ( y ) (a) p( x, z | y ) p ( y ) p ( x, y, z )
p( x | y ) p( x | y )
p( y, z )
p( y, z )
p( y, z )
p( x | y, z )
(a) p( x, z | y ) p( x | y ) p( z | y )
Markov chain property is symmetrical
Conditionally indep.
80
Data Processing Theorem
If xyz then I(x ;y) I(x ; z)
– processing y cannot add new information about x
If xyz then I(x ;y) I(x ; y | z)
– Knowing z does not increase the amount y tells you about x
Proof:
Apply chain rule in different ways
I (x ; y , z ) I (x ; y ) I (x ; z | y ) I (x ; z ) I (x ; y | z )
(a)
but I (x ; z | y ) 0
hence I (x ; y ) I (x ; z ) I (x ; y | z )
so I ( x ; y ) I ( x ; z ) and I ( x ; y ) I ( x ; y | z )
(a) I(x ;z)=0 iff x and z are independent; Markov p(x,z |y)=p(x |y)p(z |y)
81
So Why Processing?
• One can not create information by manipulating
the data
• But no information is lost if equality holds
• Sufficient statistic
– z contains all the information in y about x
– Preserves mutual information I(x ;y) = I(x ; z)
• The estimator should be designed in a way such
that it outputs sufficient statistics
• Can the estimation be arbitrarily accurate?
82
Fano’s Inequality
If we estimate x from y, what is pe p(xˆ x ) ?
x
H (x | y ) H ( pe ) pe log | X |
pe
H (x | y ) H ( pe ) (a) H (x | y ) 1
log | X |
log | X |
y
x^
(a) the second form is
weaker but easier to use
1 xˆ x
Proof: Define a random variable e
0 xˆ x
H (e , x | xˆ ) H (x | xˆ ) H (e | x , xˆ ) H (e | xˆ ) H (x | e , xˆ ) chain rule
H0; H(e |y)≤H(e)
H (x | xˆ ) 0 H (e ) H (x | e , xˆ )
H (e ) H (x | xˆ , e 0)(1 pe ) H (x | xˆ , e 1) pe
H ( pe ) 0 (1 pe ) pe log | X |
H(e)=H(pe)
Markov chain
H (x | y ) H (x | xˆ ) since I (x ; xˆ ) I (x ; y )
83
Implications
• Zero probability of error pe 0 H (x | y ) 0
• Low probability of error if H(x|y) is small
• If H(x|y) is large then the probability of error is
high
• Could be slightly strengthened to
H (x | y ) H ( pe ) pe log(| X | 1)
• Fano’s inequality is used whenever you need to
show that errors are inevitable
– E.g., Converse to channel coding theorem
84
Fano Example
X = {1:5}, px = [0.35, 0.35, 0.1, 0.1, 0.1]T
Y = {1:2} if x 2 then y =x with probability 6/7
while if x >2 then y =1 or 2 with equal prob.
Our best strategy is to guess xˆ y (x y xˆ )
– px |y=1 = [0.6, 0.1, 0.1, 0.1, 0.1]T
– actual error prob: pe 0.4
Fano bound:
H (x | y ) 1 1.771 1
0.3855
pe
log| X | 1
log(4)
(exercise)
Main use: to show when error free transmission is impossible since pe > 0
85
Summary
• Markov: x y z p( z | x, y ) p( z | y ) I (x ; z | y ) 0
• Data Processing Theorem: if xyz then
– I(x ; y) I(x ; z), I(y ; z) I(x ; z)
can be false if not Markov
– I(x ; y) I(x ; y | z)
– Long Markov chains: If x1 x2 x3 x4 x5 x6,
then Mutual Information increases as you get closer
together:
• e.g. I(x3, x4) I(x2, x4) I(x1, x5) I(x1, x6)
• Fano’s Inequality: if x y xˆ then
pe
H ( x | y ) H ( pe ) H ( x | y ) 1 H ( x | y ) 1
log| X | 1
log| X | 1
log | X |
weaker but easier to use since independent of pe
86
Lecture 7
• Law of Large Numbers
– Sample mean is close to expected value
• Asymptotic Equipartition Principle (AEP)
– -logP(x1,x2,…,xn)/n is close to entropy H
• The Typical Set
– Probability of each sequence close to 2-nH
– Size (~2nH) and total probability (~1)
• The Atypical Set
– Unimportant and could be ignored
87
Typicality: Example
X = {a, b, c, d}, p = [0.5 0.25 0.125 0.125]
–log p = [1 2 3 3] H(p) = 1.75 bits
Sample eight i.i.d. values
• typical correct proportions
adbabaac –log p(x) = 14 = 8×1.75 = nH(x)
• not typical log p(x) nH(x)
dddddddd –log p(x) = 24
88
Convergence of Random Variables
• Convergence
x n y 0, m such that n m, | x n y |
n
Example:
x n 2 – n , y 0
choose m log
• Convergence in probability (weaker than convergence)
prob
xn y
0, P | x n y | 0
1
1
x
{0;1},
p
[1
n
;
n
]
n
Example:
n
for any small , p (| xn | ) n 1
0
prob
so xn 0 (but xn 0)
Note: y can be a constant or another random variable
89
Law of Large Numbers
1 n
Given i.i.d. {xi}, sample mean s n x i
n i 1
– E sn E x
Var s n n 1 Var x n 1 2
As n increases, Var sn gets smaller and the values
become clustered around the mean
LLN:
prob
sn
0, P| s n | 0
n
The expected value of a random variable is equal
to the long-term average when sampling
repeatedly.
90
Asymptotic Equipartition Principle
• x is the i.i.d. sequence {xi} for 1 i n n
– Prob of a particular sequence is p( x ) p(x i )
i 1
– Average E log p( x ) n E log p( xi ) nH (x )
• AEP:
• Proof:
prob
1
log p ( x ) H (x )
n
1
1 n
log p ( x ) log p ( xi )
n
n i 1
prob
law of large numbers
E log p ( xi ) H (x )
91
Typical Set
Typical set (for finite n)
T( n ) x X n : n 1 log p( x ) H (x )
Example:
=41%
N=4,
N=128,
p=0.2,
e=0.1,
=85%
=0%
N=2,p=0.2,
p=0.2,e=0.1,
e=0.1,pppTT=29%
N=8,
N=1,
=72%
N=64,
N=32,
=62%
=45%
N=16,
T
– xi Bernoulli with p(xi =1)=p
– e.g. p([0 1 1 0 0 0])=p2(1–p)4
– For p=0.2, H(X)=0.72 bits
– Red bar shows T0.1(n)
-2.5
-2.5
-2-2
-1.5
-1-1
-1.5
-1-1
NN log
logp(x)
p(x)
-0.5
-0.5
00
92
Typical Set Frames
N=1, p=0.2, e=0.1, pT=0%
-2.5
-2
-1.5
-1
N-1log p(x)
-0.5
N=2, p=0.2, e=0.1, pT=0%
0
-2.5
N=16, p=0.2, e=0.1, pT=45%
-2.5
-2
-1.5
-1
N-1log p(x)
-0.5
-2
-1.5
-1
N-1log p(x)
-0.5
0
-2.5
-2.5
-2
-1.5
-1
N-1log p(x)
-0.5
-2
-1.5
-1
N-1log p(x)
-0.5
0
-2.5
N=64, p=0.2, e=0.1, pT=72%
N=32, p=0.2, e=0.1, pT=62%
0
N=8, p=0.2, e=0.1, pT=29%
N=4, p=0.2, e=0.1, pT=41%
0
-2.5
-2
-1.5
-1
N-1log p(x)
-0.5
-2
-1.5
-1
N-1log p(x)
-0.5
0
N=128, p=0.2, e=0.1, pT=85%
0
-2.5
-2
-1.5
-1
N-1log p(x)
11,
1111,
1,
00110011,
0010001100010010,
010,1101,
00 00110010,
1010, 0100,
0001001010010000,
00010001,
0000
00010000,
0001000100010000,
00000000
0000100001000000, 0000000010000000
-0.5
0
93
Typical Set: Properties
1. Individual prob:
x T( n ) log p( x ) nH (x ) n
2. Total prob:
3. Size:
p ( x T( n ) ) 1 for n N
(1 )2
n ( H ( x ) )
n N
T( n ) 2 n ( H ( x ) )
Proof 2: n 1 log p( x ) n 1 n log p( x ) prob
E log p( xi ) H (x )
i
i 1
Hence 0 N s.t. n N
Proof 3a:
Proof 3b:
p( n 1 log p ( x ) H (x ) )
f.l.e. n, 1 p ( x T( n ) )
1 p( x )
x
p( x )
xT( n )
2
xT( n )
n ( H ( x ) )
2 n ( H ( x ) ) T( n )
n ( H ( x ) )
n ( H ( x ) )
(n)
2
2
T
xT( n )
94
Consequence
• for any and for n > N
“Almost all events are almost equally surprising”
p ( x T(n ) ) 1
and
•
Coding consequence
log p ( x ) nH (x ) n
|X|n elements
– x T(n ) : ‘0’ + at most 1+n(H+) bits
(n )
x
T
–
: ‘1’ + at most 1+nlog|X| bits
– L = Average code length
p ( x T( n ) )[2 n( H )]
p ( x T( n ) )[2 n log | X |]
n( H ) n log X 2 2
n H log X 2( 2)n 1 n( H ')
2 n ( H ( x ) ) elements
95
Source Coding & Data Compression
For any choice of > 0, we can, by choosing block
size, n, large enough, do the following:
• make a lossless code using only H(x)+ bits per symbol
on average:
L
H
n
• The coding is one-to-one and decodable
– However impractical due to exponential complexity
• Typical sequences have short descriptions of length nH
– Another proof of source coding theorem (Shannon’s original
proof)
• However, encoding/decoding complexity is exponential
in n
96
Smallest high-probability Set
T(n) is a small subset of Xn containing most of the
probability mass. Can you get even smaller ?
For any 0 < < 1, choose N0 = ––1log , then for any
n>max(N0,N) and any subset S(n) satisfying S ( n) 2 n(H ( x ) 2ε)
p x S ( n ) p x S ( n ) T( n ) p x S ( n ) T( n )
(n)
S ( n ) max
p
(
x
)
p
x
T
(n)
xT
2 n ( H 2 ) 2 n ( H )
2n 2
for n N 0,
for n N
2 n 2 log
Answer: No
97
Summary
• Typical Set
– Individual Prob
– Total Prob
– Size
x T( n ) log p ( x ) nH (x ) n
p ( x T( n ) ) 1 for n N
(1 )2
n ( H ( x ) )
n N
T( n ) 2 n ( H ( x ) )
• No other high probability set can be much
smaller than T(n )
• Asymptotic Equipartition Principle
– Almost all event sequences are equally surprising
• Can be used to prove source coding theorem
98
Lecture 8
• Channel Coding
• Channel Capacity
– The highest rate in bits per channel use that can be
transmitted reliably
– The maximum mutual information
• Discrete Memoryless Channels
– Symmetric Channels
– Channel capacity
• Binary Symmetric Channel
• Binary Erasure Channel
• Asymmetric Channel
= proved in channel coding theorem
99
Model of Digital Communication
Source Coding
Channel Coding
In
Compress
Encode
Noisy
Channel
Decode
Decompress
Out
• Source Coding
– Compresses the data to remove redundancy
• Channel Coding
– Adds redundancy/structure to protect against
channel errors
100
Discrete Memoryless Channel
• Input: xX, Output yY
x
Noisy
Channel
y
• Time-Invariant Transition-Probability Matrix
Q y x
|
i, j
p y y j | x xi
– Hence p y Q y |x p x
– Q: each row sum = 1, average column sum = |X||Y|–1
T
• Memoryless: p(yn|x1:n, y1:n–1) = p(yn|xn)
• DMC = Discrete Memoryless Channel
101
Binary Channels
• Binary Symmetric Channel
– X = [0 1], Y = [0 1]
• Binary Erasure Channel
– X = [0 1], Y = [0 ? 1]
• Z Channel
– X = [0 1], Y = [0 1]
1 f
f
1 f
0
f
1 f
f
f
0
1
1
0
0
x
0
1 f
0
1
f 1 f
0
x
x
y
? y
1
1
0
0
1
1
y
Symmetric: rows are permutations of each other; columns are permutations of each other
Weakly Symmetric: rows are permutations of each other; columns have the same sum
102
Weakly Symmetric Channels
Weakly Symmetric:
1.
All columns of Q have the same sum = |X||Y|–1
–
If x is uniform (i.e. p(x) = |X|–1) then y is uniform
p ( y ) p ( y | x) p ( x) X
x X
2.
1
1
p ( y | x) X X Y
1
Y
1
x X
All rows are permutations of each other
–
Each row of Q has the same entropy so
H (y | x ) p ( x) H (y | x x) H (Q1,: ) p ( x) H (Q1,: )
xX
xX
where Q1,: is the entropy of the first (or any other) row of the Q matrix
Symmetric:
1. All rows are permutations of each other
2. All columns are permutations of each other
Symmetric weakly symmetric
103
Channel Capacity
• Capacity of a DMC channel:
C max I (x ; y )
px
– Mutual information (not entropy itself) is what could be
transmitted through the channel
– Maximum is over all possible input distributions px
– only one maximum since I(x ;y) is concave in px for fixed py|x
– We want to find the px that maximizes I(x ;y)
– Limits on C:
0 C min H (x ), H (y ) min log X ,log Y
• Capacity for n uses of channel:
C
(n)
1
max I (x 1:n ; y 1:n )
n px
1:n
H(x ,y)
H(x |y) I(x ;y) H(y |x)
H(x)
H(y)
= proved in two pages time
104
Mutual Information Plot
Binary Symmetric Channel
Bernoulli Input
0
1–p
I(X;Y)
1–f
f
f
x
p
1
0
1–f
y
1
1
0.5
0
1
-0.5
0
0.8
0.2
0.6
0.4
0.4
0.6
0.2
0.8
1
Input Bernoulli Prob (p)
0
I ( x ; y ) H (y ) H (y | x )
H ( f 2 pf p) H ( f )
Channel Error Prob (f)
105
Mutual Information Concave in pX
Mutual Information I(x;y) is concave in px for fixed py|x
Proof: Let u and v have prob mass vectors pu and pv
– Define z: bernoulli random variable with p(1) =
– Let x = u if z=1 and x=v if z=0 px= pu +(1–) pv
I ( x , z ; y ) I ( x ; y ) I (z ; y | x ) I (z ; y ) I ( x ; y | z )
U
but I (z ; y | x ) H (y | x ) H (y | x , z ) 0 so
1
X
V
Z
0
p(Y|X)
Y
I (x ; y ) I (x ; y | z )
I (x ; y | z 1) (1 ) I (x ; y | z 0)
I (u ; y ) (1 ) I (v ; y )
Special Case: y=x I(x; x)=H(x) is concave in px
106
Mutual Information Convex in pY|X
Mutual Information I(x ;y) is convex in py|x for fixed px
Proof: define u, v, x etc:
– p y |x pu |x (1 )pv |x
I (x ; y , z ) I (x ; y | z ) I (x ; z )
I (x ; y ) I (x ; z | y )
but I (x ; z ) 0 and I (x ; z | y ) 0 so
I (x ; y ) I (x ; y | z )
I (x ; y | z 1) (1 ) I (x ; y | z 0)
I (x ;u ) (1 ) I (x ;v )
107
n-use Channel Capacity
For Discrete Memoryless Channel:
I (x 1:n ; y 1:n ) H (y 1:n ) H (y 1:n | x 1:n )
n
n
i 1
n
i 1
H (y i | y 1:i 1 ) H (y i | x i )
i 1
n
Chain; Memoryless
n
Conditioning
H (y i ) H (y i | x i ) I ( x i ; y i )
Reduces
i 1
i 1
Entropy
with equality if yi are independent xi are independent
We can maximize I(x;y) by maximizing each I(xi;yi)
independently and taking xi to be i.i.d.
– We will concentrate on maximizing I(x; y) for a single channel use
– The elements of Xi are not necessarily i.i.d.
108
Capacity of Symmetric Channel
If channel is weakly symmetric:
I (x ; y ) H (y ) H (y | x ) H (y ) H (Q1,: ) log | Y | H (Q1,: )
with equality iff input distribution is uniform
Information Capacity of a WS channel is C = log|Y|–H(Q1,:)
0
For a binary symmetric channel (BSC):
– | Y| = 2
– H(Q1,:) = H(f)
– I(x;y) 1 – H(f)
Information Capacity of a BSC is 1–H(f)
1–f
f
f
x
1
0
1–f
y
1
109
Binary Erasure Channel (BEC)
1 f
0
f
f
0
1 f
0
(1 f ) H (x )
1 f
C 1 f
0
f
x
1
I (x ; y ) H (x ) H (x | y )
H (x ) p(y 0) 0 p (y ?) H (x ) p (y 1) 0
H (x ) H (x ) f
1–f
? y
f
1–f
1
H(x |y) = 0 when y=0 or y=1
since max value of H(x) = 1
with equality when x is uniform
since a fraction f of the bits are lost, the capacity is only 1–f
and this is achieved when x is uniform
110
Asymmetric Channel Capacity
Let px = [a a
1–2a]T
py
=QTp
x
= px
H (y ) 2a log a 1 2a log1 2a
H (y | x ) 2aH ( f ) (1 2a ) H (1) 2aH ( f )
0: a
x
To find C, maximize I(x ;y) = H(y) – H(y |x)
1–f
0
f
f
1: a
1–f
2: 1–2a
1
y
2
I 2a log a 1 2a log1 2a 2aH ( f )
f
0
1 f
dI
Q f
1 f 0
2 log e 2 log a 2 log e 2 log(1 2a ) 2 H ( f ) 0
0
da
0
1
1 2a
1
log
log a 1 2 H ( f ) a 2 2 H ( f )
Note:
a
d(log x) = x–1 log e
H( f )
log
1
2
a
C 2a log a 2
1 2a log1 2a
Examples:
f = 0 H(f) = 0 a = 1/3 C = log 3 = 1.585 bits/use
f = ½ H(f) = 1 a = ¼ C = log 2 = 1 bits/use
111
Summary
• Given the channel, mutual information is
concave in input distribution
I (x ; y )
• Channel capacity C max
p
x
– The maximum exists and is unique
• DMC capacity
–
–
–
–
Weakly symmetric channel: log|Y|–H(Q1,:)
BSC: 1–H(f)
BEC: 1–f
In general it very hard to obtain closed-form;
numerical method using convex optimization instead
112
Lecture 9
• Jointly Typical Sets
• Joint AEP
• Channel Coding Theorem
– Ultimate limit on information transmission is
channel capacity
– The central and most successful story of
information theory
– Random Coding
– Jointly typical decoding
113
Intuition on the Ultimate Limit
• Consider blocks of n symbols:
x1:n
Noisy
Channel
2nH(x)
2nH(y|x)
2nH(y)
y1:n
– For large n, an average input sequence x1:n corresponds to about
2nH(y|x) typical output sequences
– There are a total of 2nH(y) typical output sequences
– For nearly error free transmission, we select a number of input
sequences whose corresponding sets of output sequences hardly
overlap
– The maximum number of distinct sets of output sequences is
2n(H(y)–H(y|x)) = 2nI(y ;x)
– One can send I(y;x) bits per channel use
for large n can transmit at any rate < C with negligible errors
114
Jointly Typical Set
x,y is the i.i.d. sequence {xi,yi} for 1 i n
– Prob of a particular sequence is
N
p ( x , y ) p ( xi , yi )
i 1
– E log p( x , y ) n E log p( xi , yi ) nH (x , y )
– Jointly Typical set:
J ( n ) x , y XY n : n 1 log p ( x ) H (x ) ,
n 1 log p ( y ) H (y ) ,
n 1 log p ( x , y ) H (x , y )
115
Jointly Typical Example
Binary Symmetric Channel
f 0.2, p x 0.75 0.25
T
p y 0.65 0.35 , Pxy
T
0
1–f
f
f
x
0.6 0.15
0.05 0.2
1
0
1–f
y
1
Jointly typical example (for any ):
x=11111000000000000000
y=11110111000000000000
all combinations of x and y have exactly the right frequencies
116
Jointly Typical Diagram
Each point defines both an x sequence and a y sequence
2nlog|Y|
2nH(y)
Typical in x or y: 2n(H(x)+H(y))
2nlog|X|
)
x,y
(
nH
2nH(x)
Jo
ty
y
l
i nt
2
:
l
a
pic
All sequences: 2nlog(|X||Y|)
•
•
•
•
•
2nH(x|y)
2nH(y|x)
Dots represent
jointly typical
pairs (x,y)
Inner rectangle
represents pairs
that are typical
in x or y but not
necessarily
jointly typical
There are about 2nH(x) typical x’s in all
Each typical y is jointly typical with about 2nH(x|y) of these typical x’s
The jointly typical pairs are a fraction 2–nI(x ;y) of the inner rectangle
Channel Code: choose x’s whose J.T. y’s don’t overlap; use J.T. for decoding
There are 2nI(x ;y) such codewords x’s
117
Joint Typical Set Properties
(n)
x
,
y
J
log px, y nH (x , y ) n
1. Indiv Prob:
2. Total Prob: px, y J ( n ) 1 for n N
n N
n H ( x , y )
3. Size:
(1 )2
J ( n ) 2n H ( x ,y )
Proof 2: (use weak law of large numbers)
p n 1 log p( x ) H (x )
Choose N1 such that n N1 ,
3
Similarly choose N 2 , N 3 for other conditions and set N max N1 , N 2 , N 3
Proof 3: 1
p(x, y ) J
(n)
x , yJ ( n )
1
p(x, y ) J
(n)
x , yJ ( n )
max( n ) p ( x , y ) J ( n ) 2 n H ( x ,y ) n>N
x , yJ
min( n ) p ( x , y ) J ( n ) 2 n H ( x ,y )
x , yJ
n
118
Properties
4. If px'=px and py' =py with x' and y' independent:
(1 )2 n I ( x ,y ) 3 p x ' , y ' J ( n ) 2 n I ( x ,y ) 3 for n N
Proof: |J| × (Min Prob) Total Prob |J| × (Max Prob)
p( x ' , y ' ) p( x ' ) p( y ' )
J max p(x ' ) p(y ' )
p x ' , y ' J ( n )
p x ' , y ' J ( n )
x ', y 'J ( n )
x ', y 'J ( n )
(n)
x ', y 'J ( n )
2 n H ( x ,y ) 2 n H ( x ) 2 n H ( y ) 2 n I ( x ;y ) 3
p x ' , y ' J ( n ) J ( n ) min( n ) p ( x ' ) p( y ' )
x ', y 'J
(1 )2 n I ( x ;y ) 3 for n N
119
Channel Coding
w 1:M
Encoder
x1:n
Noisy
Channel
y1:n
Decoder
g(y)
^ 1:M
w
• Assume Discrete Memoryless Channel with known Qy|x
• An (M, n) code is
– A fixed set of M codewords x(w)Xn for w=1:M
– A deterministic decoder g(y)1:M
• The rate of an (M,n) code: R=(log M)/n bits/transmission
• Error probability w p g y w w
p y | x (w)
yY
n
g ( y ) w
– Maximum Error Probability ( n) max w
1 w M
– Average Error probability
Pe( n )
1
M
M
w 1
w
C = 1 if C is true or 0 if it is false
120
Shannon’s ideas
• Channel coding theorem: the basic theorem of
information theory
– Proved in his original 1948 paper
• How do you correct all errors?
• Shannon’s ideas
– Allowing arbitrarily small but nonzero error probability
– Using the channel many times in succession so that
AEP holds
– Consider a randomly chosen code and show the
expected average error probability is small
• Use the idea of typical sequences
• Show this means at least one code with small max error
prob
• Sadly it doesn’t tell you how to construct the code
121
Channel Coding Theorem
• A rate R is achievable if R<C and not achievable if R>C
– If R<C, a sequence of (2nR,n) codes with max prob of error
(n)0 as n
– Any sequence of (2nR,n) codes with max prob of error (n)0 as
n must have R C
A very counterintuitive result:
Despite channel errors you can
get arbitrarily low bit error
rates provided that R<C
Bit error probability
0.2
0.15
Achievable
0.1
Impossible
0.05
0
0
1
2
Rate R/C
3
122
Summary
• Jointly typical set
log p ( x , y ) nH (x , y ) n
p x , y J ( n ) 1
J ( n ) 2n H ( x ,y )
(1 )2 n I ( x ,y ) 3 p x ', y ' J ( n ) 2 n I ( x ,y ) 3
• Machinery to prove channel coding
theorem
123
Lecture 10
• Channel Coding Theorem
– Proof
• Using joint typicality
• Arguably the simplest one among many possible
ways
• Limitation: does not reveal Pe ~ e-nE(R)
– Converse (next lecture)
124
Channel Coding Principle
• Consider blocks of n symbols:
x1:n
Noisy
Channel
2nH(x)
2nH(y|x)
2nH(y)
y1:n
– An average input sequence x1:n corresponds to about
2nH(y|x) typical output sequences
– Random Codes: Choose M = 2nR ( RI(x;y) ) random
codewords x(w)
• their typical output sequences are unlikely to overlap much.
– Joint Typical Decoding: A received vector y is very
likely to be in the typical output set of the transmitted
x(w) and no others. Decode as this w.
Channel Coding Theorem: for large n, can transmit at any rate R < C with negligible errors
125
Random (2nR,n) Code
• Choose error prob, joint typicality N , choose n>N
• Choose px so that I(x ;y)=C, the information capacity
• Use px to choose a code C with random x(w)Xn, w=1:2nR
– the receiver knows this code and also the transition matrix Q
• Assume the message W1:2nR is uniformly distributed
• If received value is y; decode the message by seeing how many
x(w)’s are jointly typical with y
– if x(k) is the only one then k is the decoded message
– if there are 0 or 2 possible k’s then declare an error message 0
– we calculate error probability averaged over all C and all W
p E p(C )2
C
nR
2nR
w 1
2nR
w
(a)
(C ) 2- nR p (C )w (C ) p (C )1 (C ) p E | w 1
w 1 C
C
(a) since error averaged over all possible codes is independent of w
126
Decoding Errors
• Assume we transmit x(1) and receive y
(n)
nR
• Define the J.T. events e w ( x ( w), y ) J for w 1: 2
x1:n
Noisy
Channel
y1:n
• Decode using joint typicality
• We have an error if either e1 false or ew true for w 2
• The x(w) for w 1 are independent of x(1) and hence also
independent of y. So p (ew true) 2
n I ( x ,y ) 3
for any w 1
Joint AEP
127
Error Probability for Random Code
p(AB)p(A)+p(B)
• Upper bound
2nR
p E p E | W 1 p e1 e 2 e 3 e 2nR p e1 p e w
2nR
2
n I ( x ;y ) 3
2 2
nR
n I ( x ;y ) 3
i 2
2
n C R 3
w 2
2 for R C 3 and n
we have chosen p ( x ) such that I ( x ; y ) C
(1) Joint typicality
log
C R 3
(2) Joint AEP
• Since average of P(E) over all codes is 2 there must be at least
one code for which this is true: this code has 2 nR
w
2
w
• Now throw away the worst half of the codewords; the remaining
ones must all have w 4. The resultant code has rate R–n–1 R.
= proved on next page
128
Code Selection & Expurgation
• Since average of P(E) over all codes is 2 there must be at least
one code for which this is true.
Proof:
2 K 1
K
P
(n)
e ,i
K 1
i 1
minP minP
K
i 1
(n)
e ,i
i
i
(n)
e ,i
K = num of codes
• Expurgation: Throw away the worst half of the codewords; the
remaining ones must all have w 4.
Proof: Assume w are in descending order
2 M
1
M
w 1
½ M 4
w
M
1
½M
w
w 1
w 4
M
1
½M
½M
½ ½ M
w 1
w ½M
M ½ 2 nR messages in n channel uses R n 1 log M R n 1
129
Summary of Procedure
1
• For any R<C – 3 set n max N , (log ) /(C R 3 ),
see (a),(b),(c) below
• Find the optimum pX so that I(x ; y) = C
• Choosing codewords randomly (using pX) to construct codes with 2nR(a)
codewords and using joint typicality as the decoder
• Since average of P(E) over all codes is 2 there must be at least (b)
one code for which this is true.
• Throw away the worst half of the codewords. Now the worst
codeword has an error prob 4 with rate = R – n–1 > R –
(c)
• The resultant code transmits at a rate as close to C as desired with
an error probability that can be made as small as desired (but n
unnecessarily large).
Note: determines both error probability and closeness to capacity
130
Remarks
• Random coding is a powerful method of proof,
not a method of signaling
• Picking randomly will give a good code
• But n has to be large (AEP)
• Without a structure, it is difficult to
encode/decode
– Table lookup requires exponential size
• Channel coding theorem does not provide a
practical coding scheme
• Folk theorem (but outdated now):
– Almost all codes are good, except those we can think
of
131
Lecture 11
• Converse of Channel Coding Theorem
– Cannot achieve R>C
• Capacity with feedback
– No gain for DMC but simpler encoding/
decoding
• Joint Source-Channel Coding
– No point for a DMC
132
Converse of Coding Theorem
• Fano’s Inequality: if Pe(n) is error prob when estimating w from y,
H (w | y ) 1 Pe( n ) log W 1 nRPe( n )
• Hence
nR H (w ) H (w | y ) I (w ; y )
H (w | y ) I ( x (w ); y )
Definition of I
Markov : w x y wˆ
1 nRPe( n ) I ( x; y )
1 nRPe( n ) nC
P
(n)
e
•
R C n 1
R
n
Fano
n-use DMC capacity
C
1 0 if R C
R
For large (hence for all) n, Pe(n) has a lower bound of (R–C)/R if w
equiprobable
– If achievable for small n, it could be achieved also for large n by
concatenation.
133
Minimum Bit-Error Rate
w1:nR
x1 :n
Encoder
Suppose
Noisy
Channel
y1:n
Decoder
^
w
1: nR
– w1:nR is i.i.d. bits with H(wi)=1
– The bit-error rate is Pb E p (w i wˆ i ) E p (e i )
i
(a)
i
(b)
Then nC I (x 1:n ; y 1:n ) I (w 1:nR ;wˆ 1:nR ) H (w 1:nR ) H (w 1:nR | wˆ 1:nR )
nR
nR
(c)
nR H (w i | wˆ 1:nR ,w 1:i 1 ) nR H (w i | wˆ i ) nR 1 E H (w i | wˆ i )
i
(d)
i 1
i 1
(e)
nR 1 E H (e i | wˆ i ) nR 1 EH (e i ) nR 1 H ( E P (e i )) nR1 H ( Pb )
i
i
(c)
i
(a) n-use capacity
(b) Data processing theorem
(c) Conditioning reduces entropy
(d) e i w i wˆ i
Hence
R C 1 H ( Pb )
1
Pb H 1 (1 C / R )
Bit error probability
0.2
0.15
0.1
0.05
0
0
1
2
Rate R/C
3
(e) Jensen: E H(x) H(E x)
134
Coding Theory and Practice
• Construction for good codes
– Ever since Shannon founded information theory
– Practical: Computation & memory nk for some k
• Repetition code: rate 0
• Block codes: encode a block at a time
– Hamming code: correct one error
– Reed-Solomon code, BCH code: multiple errors (1950s)
• Convolutional code: convolve bit stream with a filter
• Concatenated code: RS + convolutional
• Capacity-approaching codes:
– Turbo code: combination of two interleaved convolutional codes
(1993)
– Low-density parity-check (LDPC) code (1960)
– Dream has come true for some channels today
135
Channel with Feedback
w1:2nR
y1:i–1
Encoder
xi
Noisy
Channel
yi
^
w
Decoder
• Assume error-free feedback: does it increase capacity ?
• A (2nR, n) feedback code is
– A sequence of mappings xi = xi(w,y1:i–1) for i=1:n
– A decoding function wˆ g (y 1:n )
• A rate R is achievable if a sequence of (2nR,n) feedback
0
codes such that Pe( n ) P(wˆ w ) n
• Feedback capacity, CFB C, is the sup of achievable rates
136
Feedback Doesn’t Increase Capacity
I (w ; y ) H ( y ) H ( y | w )
n
H ( y ) H (y i | y 1:i 1 ,w )
w1:2nR
y1:i–1
Encoder
xi
Noisy
Channel
yi
^
w
Decoder
i 1
n
H ( y ) H (y i | y 1:i 1 ,w , x i )
since xi=xi(w,y1:i–1)
i 1
n
H ( y ) H (y i | x i )
i 1
n
since yi only directly depends on xi
n
n
i 1
i 1
H (y i ) H (y i | x i ) I (x i ; y i ) nC
i 1
Hence
nR H (w ) H (w | y ) I (w ; y ) 1 nRPe( n ) nC
P
(n)
e
R C n 1
R
cond reduces ent
DMC
Fano
Any rate > C is unachievable
The DMC does not benefit from feedback: CFB = C
137
Example: BEC with feedback
• Capacity is 1 – f
• Encode algorithm
– If yi = ?, tell the sender to retransmit bit i
– Average number of transmissions per bit:
1
1 f f
1 f
0
1–f
0
f
x
? y
f
1
1–f
1
2
• Average number of successfully recovered bits per
transmission = 1 – f
– Capacity is achieved!
• Capacity unchanged but encoding/decoding algorithm
much simpler.
138
Joint Source-Channel Coding
w1:n
Encoder
x1:n
Noisy
Channel
y1:n
^
Decoder
w1:n
• Assume wi satisfies AEP and |W|<
– Examples: i.i.d.; Markov; stationary ergodic
• Capacity of DMC channel is C
– if time-varying: C lim n 1 I ( x; y )
n
• Joint Source-Channel Coding Theorem:
0 iff H(W) < C
codes with Pe( n ) P(wˆ 1:n w 1:n )
n
– errors arise from two reasons
• Incorrect encoding of w
• Incorrect decoding of y
= proved on next page
139
Source-Channel Proof ()
• Achievability is proved by using two-stage
encoding
– Source coding
– Channel coding
• For n > N there are only 2n(H(W)+) w’s in the
typical set: encode using n(H(W)+) bits
– encoder error <
• Transmit with error prob less than so long as
H(W)+ < C
• Total error prob < 2
140
Source-Channel Proof ()
w1:n
Encoder
x1:n
Noisy
Channel
y1:n
^
Decoder
w1:n
ˆ ) 1 Pe( n ) n log W
Fano’s Inequality: H ( w | w
H (W ) n 1 H (w 1:n )
entropy rate of stationary process
n 1 H (w 1:n | wˆ 1:n ) n 1 I (w 1:n ;wˆ 1:n )
n 1 1 Pe( n ) n log W n 1 I (x 1:n ; y 1:n )
n 1 Pe( n ) log W C
Let
n Pe( n ) 0 H (W ) C
definition of I
Fano + Data Proc Inequ
Memoryless channel
141
Separation Theorem
• Important result: source coding and channel coding
might as well be done separately since same capacity
– Joint design is more difficult
• Practical implication: for a DMC we can design the
source encoder and the channel coder separately
– Source coding: efficient compression
– Channel coding: powerful error-correction codes
• Not necessarily true for
– Correlated channels
– Multiuser channels
• Joint source-channel coding: still an area of research
– Redundancy in human languages helps in a noisy environment
142
Summary
• Converse to channel coding theorem
– Proved using Fano’s inequality
– Capacity is a clear dividing point:
• If R < C, error prob. 0
• Otherwise, error prob. 1
• Feedback doesn’t increase the capacity of DMC
– May increase the capacity of memory channels (e.g.,
ARQ in TCP/IP)
• Source-channel separation theorem for DMC and
stationary sources
143
Lecture 12
• Continuous Random Variables
• Differential Entropy
– can be negative
– not really a measure of the information in x
– coordinate-dependent
• Maximum entropy distributions
– Uniform over a finite range
– Gaussian if a constant variance
144
Continuous Random Variables
Changing Variables
• pdf: f x (x)
CDF:
Fx ( x)
x
f x (t )dt
1
• For g(x) monotonic: y g (x ) x g (y )
Fy ( y ) Fx g 1 ( y )
or 1 Fx g 1 ( y )
according to slope of g(x)
dx
dg 1 ( y )
dFY ( y )
1
f x g ( y)
f x x
fy ( y)
dy
dy
dy
where
x g 1 ( y )
• Examples:
Suppose
f x ( x) 0.5 for
(a)
y 4x
x 0.25y
(b)
z x 4 x z¼
x (0,2) Fx ( x) 0.5 x
f y ( y ) 0.5 0.25 0.125 for
f z ( z ) 0.5 ¼ z ¾ 0.125 z ¾
for
y (0,8)
z (0,16)
145
Joint Distributions
Joint pdf:
f x , y ( x, y )
f x ( x) f x ,y ( x, y )dy
Marginal pdf:
Independence: f x ,y ( x, y ) f x ( x) f y ( y )
f x , y ( x, y )
Conditional pdf:
f x |y ( x)
y
y
fy ( y)
f x ,y 1 for y (0,1), x ( y, y 1)
f x |y 1 for x ( y, y 1)
f y |x
x
fX
1
for y max(0, x 1), min( x,1)
min( x,1 x)
x
fY
Example:
146
Entropy of Continuous R.V.
• Given a continuous pdf f(x), we divide the range of x into
bins of width
( i 1)
– For each i, xi with f ( xi )
f ( x)dx
i
mean value theorem
• Define a discrete random variable Y
– Y = {xi} and py = {f(xi)}
– Scaled, quantised version of f(x) with slightly unevenly spaced xi
•
H (y ) f ( xi ) log f ( xi )
log f ( xi ) log f ( xi )
0
log
f ( x) log f ( x)dx log h(x )
• Differential entropy: h(x ) f x ( x) log f x ( x)dx
- Similar to entropy of discrete r.v. but there are differences
147
Differential Entropy
Differential Entropy: h(x ) f x ( x) log f x ( x)dx E log f x ( x)
Bad News:
– h(x) does not give the amount of information in x
– h(x) is not necessarily positive
– h(x) changes with a change of coordinate system
Good News:
– h1(x) – h2(x) does compare the uncertainty of two continuous
random variables provided they are quantised to the same
precision
– Relative Entropy and Mutual Information still work fine
– If the range of x is normalized to 1 and then x is quantised to n
bits, the entropy of the resultant discrete random variable is
approximately h(x)+n
148
Differential Entropy Examples
• Uniform Distribution: x ~ U (a, b)
–
–
f ( x) (b a ) 1 for x (a, b) and f ( x) 0 elsewhere
b
h(x ) (b a ) 1 log(b a) 1 dx log(b a)
a
– Note that h(x) < 0 if (b–a) < 1
• Gaussian Distribution: x ~ N ( , 2 )
– f ( x) 2
1
exp ( x ) 2 2
2
2 ½
– h(x ) (log e) f ( x) ln f ( x)dx
1
(log e) f ( x) ln(2 2 ) ( x ) 2 2
2
loge y
1
2
2
2
log
y
x
(log e) ln(2 ) E ( x )
loge x
2
1
2
(log e) ln(2 2 ) 1 ½ log2e log(4.1 ) bits
2
149
Multivariate Gaussian
Given mean, m, and symmetric positive definite covariance matrix K,
1
exp ( x m )T K 1( x m )
2
1
h( f ) log e f (x) (x m)T K 1 (x m) ½ ln 2 K dx
2
½ loge ln 2K E (x m)T K 1 (x m)
½ log e ln 2 K E tr (x m)(x m)T K 1 tr ( AB) tr (BA )
x 1:n ~ N(m , K )
f (x) 2K
½
E xx = K
½ loge ln 2K tr KK ½ loge ln 2K n
tr(I)=n=ln(e )
½ loge ½ log 2K
½ log 2 eK ½ log (2 e) K bits
½ loge ln 2K tr E (x m)(x m)T K 1
f
T
1
n
n
n
150
Other Differential Quantities
Joint Differential Entropy
h(x , y ) f x ,y ( x, y ) log f x ,y ( x, y )dxdy E log f x ,y ( x, y )
x, y
Conditional Differential Entropy
h(x | y ) f x ,y ( x, y ) log f x ,y ( x | y )dxdy h(x , y ) h(y )
x, y
Mutual Information
I (x ; y ) f x ,y ( x, y ) log
x, y
f x , y ( x, y )
f x ( x) f y ( y )
dxdy h(x ) h(y ) h(x , y )
Relative Differential Entropy of two pdf’s:
f ( x)
dx
g ( x)
h f (x ) E f log g (x )
D( f || g ) f ( x) log
(a) must have f(x)=0 g(x)=0
(b) continuity 0 log(0/0) = 0
151
Differential Entropy Properties
Chain Rules
h ( x , y ) h ( x ) h(y | x ) h (y ) h ( x | y )
I ( x , y ; z ) I ( x ; z ) I (y ; z | x )
Information Inequality: D( f || g ) 0
Proof: Define
D ( f || g )
xS
S {x : f (x) 0}
f (x) log
g ( x)
dx E f
f ( x)
g (x )
log
f (x)
g ( x)
g (x )
dx
log f (x)
log E
f ( x)
f (x)
S
log g (x)dx log 1 0
S
Jensen + log() is concave
all the same as for discrete r.v. H()
152
Information Inequality Corollaries
Mutual Information 0
I (x ; y ) D( f x ,y || f x f y ) 0
Conditioning reduces Entropy
h( x ) h( x | y ) I ( x ; y ) 0
Independence Bound
n
n
i 1
i 1
h(x 1:n ) h(x i | x 1:i 1 ) h(x i )
all the same as for H()
153
Change of Variable
Change Variable: y g (x )
from earlier
dg 1 ( y )
f y ( y) f x g ( y)
dy
1
h(y ) E log( f y (y )) E log f x ( g 1 (y )) E log
E log f x (x ) E log
dx
dy
h(x ) E log
dx
dy
dy
dx
Examples:
– Translation:
– Scaling:
y x a dy / dx 1 h(y ) h(x )
y cx
dy / dx c h(y ) h(x ) log c
– Vector version: y 1:n Ax 1:n
h( y ) h( x ) log det( A)
not the same as for H()
154
Concavity & Convexity
• Differential Entropy:
– h(x) is a concave function of fx(x) a maximum
• Mutual Information:
– I(x ; y) is a concave function of fx (x) for fixed fy|x (y)
– I(x ; y) is a convex function of fy|x (y) for fixed fx (x)
Proofs:
Exactly the same as for the discrete case: pz = [1–, ]T
U
1
X
V
0
Z
H (x ) H (x | z )
U
f(u|x)
1
Z
0
1
X
X
V
U
f(y|x)
Y
I (x ;y ) I (x ;y |z )
Y
f(v|x)
V
0
Z
I (x ;y ) I (x ;y |z )
155
Uniform Distribution Entropy
What distribution over the finite range (a,b) maximizes the
entropy ?
Answer: A uniform distribution u(x)=(b–a)–1
Proof:
Suppose f(x) is a distribution for x(a,b)
0 D( f || u ) h f (x ) E f log u (x )
h f (x ) log(b a )
h f (x ) log(b a )
156
Maximum Entropy Distribution
What zero-mean distribution maximizes the entropy on
(–, )n for a given covariance matrix K ?
½
(
x
)
2
K
exp ½ xT K 1x
Answer: A multivariate Gaussian
Proof:
0 D ( f || ) h f ( x ) E f log ( x )
h f ( x ) log e E f ½ ln 2K ½ x T K 1x
½log e ln 2K tr E f xx T K 1
½log e ln 2K tr I
½ log 2eK h (x )
EfxxT = K
tr(I)=n=ln(en)
Since translation doesn’t affect h(X),
we can assume zero-mean w.l.o.g.
157
Summary
• Differential Entropy:
h( x )
f x ( x) log f x ( x)dx
– Not necessarily positive
– h(x+a) = h(x),
h(ax) = h(x) + log|a|
• Many properties are formally the same
– h(x|y) h(x)
– I(x; y) = h(x) + h(y) – h(x, y) 0, D(f||g)=E log(f/g) 0
– h(x) concave in fx(x); I(x; y) concave in fx(x)
• Bounds:
– Finite range: Uniform distribution has max: h(x) =
log(b–a)
– Fixed Covariance: Gaussian has max: h(x) =
½log((2e)n|K|)
158
Lecture 13
•
•
•
•
Discrete-Time Gaussian Channel Capacity
Continuous Typical Set and AEP
Gaussian Channel Coding Theorem
Bandlimited Gaussian Channel
– Shannon Capacity
159
Capacity of Gaussian Channel
Discrete-time channel: yi = xi + zi
Zi
– Zero-mean Gaussian i.i.d. zi ~ N(0,N)
– Average power constraint n
1
n
x
i 1
2
i
Xi
P
Ey 2 E (x z ) 2 Ex 2 2 E (x ) E (z ) Ez 2 P N
+
Yi
X,Z indep and EZ=0
Information Capacity
I (x ; y )
– Define information capacity: C max
2
Ex P
I ( x ; y ) h ( y ) h (y | x ) h (y ) h ( x z | x )
(a)
h (y ) h (z | x ) h (y ) h (z )
½ log 2eP N ½ log 2eN
1
P
log 1
2
N
The optimal input is Gaussian
(a) Translation independence
Gaussian Limit with
equality when x~N(0,P)
160
Achievability
z
w1:2
nR
Encoder
x
y
+
^
Decoder
w0:2nR
• An (M,n) code for a Gaussian Channel with power
constraint is
– A set of M codewords x(w)Xn for w=1:M with x(w)Tx(w) nP w
– A deterministic decoder g(y)0:M where 0 denotes failure
max : ( n )
average : Pe( n )
– Errors: codeword : i
i
i
• Rate R is achievable if seq of (2nR,n) codes with ( n ) 0
n
• Theorem: R achievable iff
R C ½ log1 PN 1
= proved on next pages
161
Argument by Sphere Packing
• Each transmitted xi is received as a
probabilistic cloud yi
– cloud ‘radius’ =
Var y | x nN
Law of large numbers
• Energy of yi constrained to n(P+N) so
clouds must fit into a hypersphere of
radius nP N
• Volume of hypersphere r n
• Max number of non-overlapping clouds:
(nP nN )½ n
½ n log 1 PN 1
2
½n
nN
• Max achievable rate is ½log(1+P/N)
162
Continuous AEP
Typical Set: Continuous distribution, discrete time i.i.d.
For any >0 and any n, the typical set with respect to f(x) is
T( n ) x S n : n 1 log f (x) h(x )
where
S is the support of f {x : f(x)>0}
n
f ( x ) f ( xi ) since xi are independent
i 1
h(x ) E log f (x ) n 1 E log f ( x )
Typical Set Properties
1.
2.
p ( x T( n ) ) 1 for n N
(1 )2
n ( h ( x ) )
n N
Vol(T( n ) ) 2 n ( h ( x ) )
where Vol( A)
dx
xA
Proof: LLN
Proof: Integrate
max/min prob
163
Continuous AEP Proof
Proof 1: By law of large numbers
n log f (x 1:n ) n
1
prob
1
n
log f (x
i 1
prob
i
) E log f (x ) h(x )
Reminder: x n y 0, N such that n N , P| x n y |
Proof 2a:
1
f ( x )d x
for n N
Property 1
max f(x) within T
min f(x) within T
(n)
T
2 n h ( X ) dx 2 n h ( X ) Vol T( n )
(n)
T
Proof 2b:
1
Sn
f ( x )d x
T( n )
f ( x )d x
2 n h ( X ) dx 2 n h ( X ) Vol T( n )
T( n )
164
Jointly Typical Set
Jointly Typical: xi, yi i.i.d from 2 with fx,y (xi,yi)
J ( n ) x , y 2 n : n 1 log f X ( x ) h(x ) ,
n 1 log fY ( y ) h(y ) ,
n 1 log f X ,Y ( x , y ) h(x , y )
Properties:
1. Indiv p.d.:
2. Total Prob:
3. Size:
4. Indep x′,y′:
x , y J ( n ) log f x ,y x , y nh(x , y ) n
p x , y J ( n ) 1
(1 )2
n h ( x , y )
(1 )2
n N
for n N
Vol J ( n ) 2 n h ( x ,y )
n I ( x ;y ) 3
n N
p x ' , y ' J ( n ) 2 n I ( x ;y )3
Proof of 4.: Integrate max/min f(x´, y´)=f(x´)f(y´), then use known bounds on Vol(J)
165
Gaussian Channel Coding Theorem
R is achievable iff R C ½ log 1 PN 1
Proof ():
Choose > 0
Random codebook: x w n for w 1 : 2nR where xw are i.i.d. ~ N (0, P )
Use Joint typicality decoding
p x T x nP 0 for n M
Errors: 1. Power too big
p ( x , y J ( n ) ) for n N
2. y not J.T. with x
3. another x J.T. with y
2 nR
p( x
j 2
n I ( X ;Y ) R 3
j
, y i J ( n ) ) (2 nR 1) 2 n ( I ( x ;y ) 3 )
(n)
3 for large n if R I ( X ; Y ) 3
Total Err P 2
Expurgation: Remove half of codebook*: ( n ) 6
now max error
We have constructed a code achieving rate R–n–1
*:Worst codebook half includes xi: xiTxi >nP i=1
166
Gaussian Channel Coding Theorem
Proof (): Assume Pe( n ) 0 and n-1x T x P for each x ( w)
nR H (w ) I (w ; y 1:n ) H (w | y 1:n )
I (x 1:n ; y 1:n ) H (w | y 1:n )
Data Proc Inequal
h(y 1:n ) h(y 1:n | x 1:n ) H (w | y 1:n )
n
h(y i ) h(z 1:n ) H (w | y 1:n )
Indep Bound + Translation
i 1
n
I (x i ; y i ) 1 nRPe( n )
Z i.i.d + Fano, |W|=2nR
i 1
n
½ log 1 PN 1 1 nRPe( n )
i 1
max Information Capacity
1
R ½ log 1 PN 1 n 1 RPe( n ) ½ log 1 PN
167
Bandlimited Channel
• Channel bandlimited to f(–W,W) and signal duration T
– Not exactly
– Most energy in the bandwidth, most energy in the interval
• Nyquist: Signal is defined by 2WT samples
– white noise with double-sided p.s.d. ½N0 becomes
i.i.d gaussian N(0,½N0) added to each coefficient
– Signal power constraint = P Signal energy PT
• Energy constraint per coefficient: n–1xTx<PT/2WT=½W–1P
• Capacity:
1/ 2 P / W 2WT
1
P
C log 1
W log 1
bits/second
2
N
/
2
T
WN
0
0
• More precisely, it can be represented in a vector space of
about n=2WT dimensions with prolate spheroidal
functions as an orthonormal basis
Compare discrete time version: ½log(1+PN–1) bits per channel use
168
Limit of Infinite Bandwidth
P
C W log 1
bits/second
WN
0
P
C
log e
W N
0
Minimum signal to noise ratio (SNR)
Eb PTb P / C
ln 2 1.6 dB
N0
N0
N 0 W
Given capacity, trade-off between P and W
- Increase P, decrease W
- Increase W, decrease P
- spread spectrum
- ultra wideband
169
Channel Code Performance
• Power Limited
–
–
–
–
High bandwidth
Spacecraft, Pagers
Use QPSK/4-QAM
Block/Convolution Codes
• Bandwidth Limited
– Modems, DVB, Mobile
phones
– 16-QAM to 256-QAM
– Convolution Codes
• Value of 1 dB for space
– Better range, lifetime,
weight, bit rate
– $80 M (1999)
Diagram from “An overview of Communications” by John R Barry
170
Summary
• Gaussian channel capacity
1
P
C log 1
2
N
bits/transmission
• Proved by using continuous AEP
• Bandlimited channel
P
C W log 1
bits/second
WN 0
– Minimum SNR = –1.6 dB as W
171
Lecture 14
• Parallel Gaussian Channels
– Waterfilling
• Gaussian Channel with Feedback
– Memoryless: no gain
– Memory: at most ½ bits/transmission
172
Parallel Gaussian Channels
• n independent Gaussian channels
– A model for nonwhite noise wideband channel where each
component represents a different frequency
– e.g. digital audio, digital TV, Broadband ADSL, WiFi
z1
(multicarrier/OFDM)
x1
Noise is independent zi ~ N(0,Ni)
+
•
• Average Power constraint ExTx nP
• Information Capacity: C
• R<C R achievable
– proof as before
• What is the optimal f(x) ?
maxT
f ( x ):E f x x nP
I ( x; y )
y1
zn
xn
+
yn
173
Parallel Gaussian: Max Capacity
Need to find f(x):
C
maxT
f ( x ): E f x x nP
I ( x; y )
I ( x ; y ) h ( y ) h ( y | x ) h ( y ) h (z | x )
Translation invariance
n
h ( y ) h(z ) h ( y ) h(z i )
x,z indep; Zi indep
i 1
(a) n
(b) n
h(y i ) h(z i ) ½ log 1 Pi N i1
i 1
i 1
(a) indep bound;
(b) capacity limit
Equality when: (a) yi indep xi indep; (b) xi ~ N(0, Pi)
1
½
log
1
P
N
i i
n
We need to find the Pi that maximise
i 1
174
Parallel Gaussian: Optimal Powers
1
We need to find the Pi that maximise
log(
e
)
½
ln
1
P
N
i i
n
i 1
– subject to power constraint Pi nP
i 1
n
– use Lagrange multiplier
n
J ½ ln 1 Pi N
i 1
1
i
P
J
1
½Pi N i 0
Pi
n
n
i 1
Also Pi nP v P n
i 1
v
i
Pi N i v
1
P1
P2
P3
n
N
i 1
i
Water Filling: put most power into
least noisy channels to make equal
power + noise in each channel
N1
N2
N3
175
Very Noisy Channels
• What if water is not enough?
• Must have Pi 0 i
v
• If v < Ni then set Pi=0 and recalculate
v (i.e., Pi = max(v – Ni,0) )
Kuhn-Tucker Conditions:
(not examinable)
– Max f(x) subject to Ax+b=0 and
P1
P2
N1
N2
N3
g i (x) 0 for i 1 : M with f , g i concave
M
T
– set J (x) f (x) i g i (x) Ax
i 1
– Solution x0, , i iff
J (x 0 ) 0, Ax b 0, g i (x 0 ) 0, i 0, i g i (x 0 ) 0
176
Colored Gaussian Noise
• Suppose y = x + z where E zzT = KZ and E xxT = KX
• We want to find KX to maximize capacity subject to
n
power constraint: E x i2 nP tr K X nP
i 1
– Find noise eigenvectors:
K Z QQT with QQT I
QTy = QTx + QTz = QTx + w
where E wwT = E QTzzTQ = E QTKZQ = is diagonal
– Now
• Wi are now independent (so previous result on P.G.C. applies)
– Power constraint is unchanged
tr QT K X Q tr K X QQT tr K X
– Use water-filling and indep. messages QT K X Q vI
– Choose
QT K X Q vI where v P n 1 tr
K X Q vI QT
177
Power Spectrum Water Filling
• If z is from a stationary process
then diag() power spectrum
n
N(f)
– To achieve capacity use waterfilling on
noise power spectrum
Noise Spectrum
5
4.5
4
3.5
3
2.5
v
2
1.5
1
W
P max v N ( f ),0 df
W
max v N ( f ),0
C ½ log 1
df
W
N( f )
W
0.5
0
-0.5
0
Frequency
0.5
T r a n s m it S p e c t r u m
2
1 .5
1
0 .5
– Waterfilling on spectral domain
0
-0 .5
0
F re qu e n c y
0 .5
178
Gaussian Channel + Feedback
x i xi w , y 1:i 1 zi
Does Feedback add capacity ?
w
Coder
– White noise (& DMC) – No
– Coloured noise – Not much
xi
+
yi
n
I (w ; y ) h( y ) h( y | w ) h( y ) h(y i | w , y 1:i 1 )
Chain rule
i 1
n
h( y ) h(y i | w , y 1:i 1 , x 1:i , z 1:i 1 )
i 1
n
h( y ) h(z i | w , y 1:i 1 , x 1:i , z 1:i 1 )
i 1
n
h( y ) h(z i | z 1:i 1 )
i 1
h ( y ) h (z )
Ky
1
log
2
Kz
xi=xi(w,y1:i–1), z = y – x
z = y – x and translation invariance
z may be colored; zi depends only on z1:i–1
Chain rule, h(z) ½ log 2 eK Z bits
maximize I(w ; y) by maximizing h(y) y gaussian
we can take z and x = y – z jointly gaussian
179
Maximum Benefit of Feedback
zi
w
1
Cn , FB max ½ n log
tr( K x ) nP
1
max ½ n log
| Ky |
xi
+
yi
| Kz |
| 2K x Kz |
tr( K x ) nP
| Kz |
2n | K x K z |
max ½ n log
tr( K x ) nP
| Kz |
1
½ max ½ n 1 log
tr( K x ) nP
Coder
Lemmas 1 & 2:
|2(Kx+Kz)| |Ky|
|kA|=kn|(A|
| Kx Kz |
½ Cn bits / transmission
| Kz |
Ky = Kx+Kz if no feedback
Cn: capacity without feedback
Having feedback adds at most ½ bit per transmission for colored Gaussian
noise channels
180
Max Benefit of Feedback: Lemmas
Lemma 1: Kx+z+Kx–z = 2(Kx+Kz)
K x z K x z E x z x z E x z x z
T
E 2 xx
T
E xx T xzT zx T zzT
T
xx T xzT zx T zzT
2zzT 2K x K z
Lemma 2: If F,G are positive definite then |F+G| |F|
Consider two indep random vectors f~N(0,F), g~N(0,G)
½ log 2 e | F G | h(f g)
n
h(f g | g) h(f | g)
h(f ) ½ log 2 e | F |
n
Conditioning reduces h()
Translation invariance
Hence: |2(Kx+Kz)| = |Kx+z+Kx–z| |Kx+z| = |Ky|
f,g independent
181
Gaussian Feedback Coder
x and z jointly gaussian
w
x Bz v (w )
zi
x i xi w , y 1:i 1
Coder
xi
+
yi
where v is indep of z and
B is strictly lower triangular since xi indep of zj for j>i.
y x z (B I )z v
E Bzz B
K y Eyy T E (B I )zzT (B I )T vv T (B I )K z (B I )T K v
K x Exx
T
T
T
½n
Capacity: Cn, FB max
K ,B
v
subject to
vv T
| Ky |
1
| Kz |
BK z BT K v
max ½ n 1 log
K v ,B
K x tr BK z BT K v nP
(B I )K z (B I )T K v
| Kz |
hard to solve
Optimization can be done numerically
182
Gaussian Feedback: Toy Example
0 0
2 1
n 2, P 2, K Z
, B
b 0
1 2
Goal: Maximize (w.r.t. Kv and b)
15
10
| K Y || B I K z B I K v |
T
5
Subject to:
0
-1.5
Power constraint : tr BK z B K v 4
T
Solution (via numerically search):
b=0:
|Ky|=16
b=0.69: |Ky|=20.7
C=0.604 bits
C=0.697 bits
-0.5
0
b
0.5
4
PV2
3
1
1.5
PbZ1
2
PV1
1
0
-1.5
Feedback increases C by 16%
-1
5
Power: V1 V2 bZ1
K v must be positive definite
det(Ky)
20
det(KY)
x Bz v x1 v1 , x2 bz1 v2
25
-1
-0.5
Px 1 Pv 1
0
b
0.5
1
1.5
Px 2 Pv 2 Pbz 1
183
Summary
• Water-filling for parallel Gaussian channel
x max( x,0)
(v N i )
1
C log 1
Ni
i 1 2
n
(v N )
i
nP
• Colored Gaussian noise
i eigenvealues of Κ z
(v i )
1
C log 1
2
i 1
i
n
(v )
i
• Continuous Gaussian channel
v N ( f )
C ½ log 1
W
N( f )
W
• Feedback bound
Cn , FB Cn
1
2
df
nP
184
Lecture 15
• Lossy Source Coding
– For both discrete and continuous sources
– Bernoulli Source, Gaussian Source
• Rate Distortion Theory
– What is the minimum distortion achievable at
a particular rate?
– What is the minimum rate to achieve a
particular distortion?
• Channel/Source Coding Duality
185
Lossy Source Coding
x1:n
f(x1:n)1:2nR
Encoder
f( )
Decoder
g( )
Distortion function: d ( x, xˆ ) 0
– examples: (i) d S ( x, xˆ ) ( x xˆ )
– sequences:
d (x, xˆ ) n
1
2
n
d ( x , xˆ )
i 1
i
x^1:n
0 x xˆ
(ii) d H ( x, xˆ )
1 x xˆ
i
Distortion of Code fn( ), gn( ): D ExX d x , xˆ E d x , g ( f ( x ))
n
Rate distortion pair (R,D) is achievable for source X if
a sequence fn( ) and gn( ) such that lim ExX d ( x , g n ( f n ( x ))) D
n
n
186
Rate Distortion Function
Rate Distortion function for {xi} with pdf p(x) is defined as
R( D) min{R} such that (R,D) is achievable
Theorem: R ( D ) min I (x ; xˆ ) over all p (x , xˆ ) such that :
p (x ) is correct
(b) Ex ,xˆ d (x , xˆ ) D
(a)
– this expression is the Rate Distortion function for X
We will prove this next lecture
Lossless coding: If D = 0 then we have R(D) = I(x ;x) = H(x)
p (x , xˆ ) p (x )q (xˆ | x )
187
R(D) bound for Bernoulli Source
Bernoulli: X = [0,1], pX = [1–p , p] assume p ½
d ( x, xˆ ) x xˆ
1
0.8
– If Dp, R(D)=0 since we can set g( )0
– For D < p ½, if
E d (x , xˆ ) D then
I (x ; xˆ ) H (x ) H (x | xˆ )
Hence R(D) H(p) – H(D)
p=0.5
0.6
0.4
0.2
H ( p ) H (x xˆ | xˆ )
H ( p ) H (x xˆ )
H ( p) H ( D)
R(D), p=0.2,0.5
– Hamming Distance:
0
0
p=0.2
0.1
0.2
0.3
0.4
0.5
D
is one-to-one
Conditioning reduces entropy
Prob.( x xˆ 1) D for D ½
H (x xˆ ) H ( D) as H ( p ) monotonic
188
R(D) for Bernoulli source
We know optimum satisfies R(D) H(p) – H(D)
– We show we can find a p(xˆ , x ) that attains this.
– Peculiarly, we consider a channel with xˆ as the
input and error probability D
Now choose r to give x the correct
1–r 0 1–D
0
D
probabilities:
^
x
r (1 D) (1 r ) D p
r 1
D
1
1–p
x
p
1–D
r ( p D )(1 2 D) 1 , D p
Now I (x ; xˆ ) H (x ) H (x | xˆ ) H ( p ) H ( D)
Hence R(D) = H(p) – H(D)
and p (x xˆ ) D distortion D
If D p or D 1- p, we can achieve R(D)=0 trivially.
189
R(D) bound for Gaussian Source
• Assume X ~ N (0, 2 ) and d ( x, xˆ ) ( x xˆ ) 2
• Want to minimize I (x ; xˆ ) subject to E (x xˆ ) 2 D
I (x ; xˆ ) h(x ) h(x | xˆ )
½ log 2e 2 h(x xˆ | xˆ )
½ log 2e 2 h(x xˆ )
Translation Invariance
Conditioning reduces entropy
½ log 2e 2 ½ log2e Var x xˆ
½ log 2e 2 ½ log 2eD
2
ˆ
I (x ; x ) max ½ log
,0
D
Gauss maximizes entropy
for given covariance
require Var x xˆ E (x xˆ ) 2 D
I(x ;y) always positive
190
R(D) for Gaussian Source
To show that we can find a p ( xˆ , x)
that achieves the bound, we
construct a test channel that
introduces distortion D<2
z ~ N(0,D)
x^ ~ N(0,2–D)
+
3.5
I (x ; xˆ ) h(x ) h(x | xˆ )
3
½ log 2e 2 h(x xˆ | xˆ )
2
D( R)
D
2
achievable
1.5
D
0.5
, 0}
cf. PCM
0
0
impossible
0.5
1
2
D/
2
22 R
2
1
R ( D) max{½ log
2.5
R(D)
½ log 2e 2 h(z | xˆ )
½ log
x ~ N(0,2)
D( R)
m 2p / 3
2
2R
m p 4
16 / 3 2
22 R
1.5
191
Lloyd Algorithm
Problem: Find optimum quantization levels for Gaussian pdf
a.
b.
Bin boundaries are midway between quantization levels
Each quantization level equals the mean value of its own bin
Lloyd algorithm: Pick random quantization levels then apply
conditions (a) and (b) in turn until convergence.
0.2
2
0.15
1
MSE
Quantization levels
3
0
0.1
-1
0.05
-2
-3 0
10
1
10
Iteration
10
2
Solid lines are bin boundaries. Initial levels uniform in [–1,+1] .
0 0
10
1
10
Iteration
Best mean sq error for 8 levels = 0.03452. Predicted D(R) = (/8)2 = 0.01562
10
2
192
Vector Quantization
To get D(R), you have to quantize many values together
– True even if the values are independent
D = 0.035
Scalar
quant.
D = 0.028
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
0.5
1
1.5
2
2.5
0
0
Vector
quant.
0.5
1
1.5
2
2.5
Two gaussian variables: one quadrant only shown
– Independent quantization puts dense levels in low prob areas
– Vector quantization is better (even more so if correlated)
193
Multiple Gaussian Variables
• Assume x1:n are independent gaussian sources with
different variances. How should we apportion the
available total distortion between the sources?
T
• Assume x i ~ N (0, i2 ) and d (x, xˆ ) n 1 x xˆ x xˆ D
n
I (x 1:n ; xˆ1:n ) I (x i ; xˆ i )
Mut Info Independence Bound
for independent xi
i 1
i2
R ( Di ) max ½ log
,0
Di
i 1
i 1
n
n
We must find the Di that minimize
i2
max ½ log ,0
Di
i 1
n
D0
Di 2
i
such that
R(D) for individual
Gaussian
if D0 i2
otherwise
n
1
n
D
i 1
i
D
194
Reverse Water-filling
n
i2
Minimize max ½ log ,0 subject to Di nD
Di
i 1
i 1
n
J
1
½ Di1 0 Di ½ D0
Di
D
i 1
i
nD0 nD
i2 <D
D
½log i2
Use a Lagrange multiplier:
n
n
i2
J ½ log
Di
Di
i 1
i 1
n
Ri ½ log
i2
½logD
R1
X1
D0 D
Choose Ri for equal distortion
Di=i2)
• If
then set Ri=0 (meaning
and increase D0 to maintain the average
distortion equal to D
R2
R3
X2
X3
½log i2
½logD0
R1 R =0
2
R3
X1
X3
½logD
X2
195
Channel/Source Coding Duality
• Channel Coding
– Find codes separated enough to
give non-overlapping output
images.
– Image size = channel noise
– The maximum number (highest
rate) is when the images just don’t
overlap (some gap).
Noisy channel
Sphere Packing
Lossy encode
• Source Coding
– Find regions that cover the sphere
– Region size = allowed distortion
– The minimum number (lowest rate)
is when they just fill the sphere
(with no gap).
Sphere Covering
196
Gaussian Channel/Source
• Capacity of Gaussian channel (n: length)
– Radius of big sphere n( P N )
nN
– Radius of small spheres
n
n / 2 Maximum number of
n
(
P
N
)
PN
– Capacity 2nC
small spheres packed
n
nN
N
in the big sphere
• Rate distortion for Gaussian source
– Variance 2 radius of big sphere n 2
– Radius of small spheres nD for distortion D
Minimum number of
2 n/2
– Rate 2nR ( D )
small spheres to cover
D
the big sphere
197
Channel Decoder as Source Encoder
z1:n ~ N(0,D)
w2nR
y1:n ~ N(0,2 ¨ D)
Channel
Encoder
+
x1:n ~ N(0,2) Channel
^ 2nR
w
Decoder
• For R C ½ log 1 2 D D 1 , we can find a channel
2
encoder/decoder so that p(wˆ w ) and E (x i y i ) D
• Now reverse the roles of encoder and decoder. Since
p ( xˆ y ) p w wˆ and E (x i xˆ i ) 2 E (x i y i ) 2 D
z1:n ~ N(0,D)
w
Channel
Encoder
y1:n
+
Source coding
x1:n ~ N(0,2) Channel
Decoder
^ 2nR
w
Channel
Encoder
x^1:n
Source encoder Source decoder
We have encoded x at rate R=½log(2D–1) with distortion D!
198
Summary
• Lossy source coding: tradeoff between rate and
distortion
• Rate distortion function
R( D)
min
p xˆ |x s .t . Ed ( x , xˆ ) D
I (x ; xˆ )
• Bernoulli source: R(D) = (H(p) – H(D))+
• Gaussian source
2
1
(reverse waterfilling):
R ( D) log
D
2
• Duality: channel decoding (encoding) source
encoding (decoding)
199
Nothing But Proof
• Proof of Rate Distortion Theorem
– Converse: if the rate is less than R(D), then
distortion of any code is higher than D
– Achievability: if the rate is higher than R(D),
then there exists a rate-R code which
achieves distortion D
Quite technical!
200
Review
x1:n
Encoder
fn( )
f(x1:n)1:2nR
Decoder
gn( )
^
x1:n
Rate Distortion function for x whose px(x) is known is
R ( D ) inf R such that f n , g n with lim ExX d ( x , xˆ ) D
Rate Distortion Theorem:
n
n
R ( D) min I (x ; xˆ ) over all p ( xˆ | x) such that Ex ,xˆ d (x , xˆ ) D
3.5
We will prove this theorem for discrete X
and bounded d(x,y) dmax
3
R(D)
2.5
R(D) curve depends on your choice of d(,)
Decreasing and convex
2
achievable
1.5
1
0.5
0
0
impossible
0.5
1
D/2
1.5
201
Converse: Rate Distortion Bound
Suppose we have found an encoder and decoder at rate R0
with expected distortion D for independent xi (worst case)
We want to prove that R0 R( D) RE d (x; xˆ )
– We show first that
R0 n 1 I (x i ; xˆ i )
i
– We know that
I (x i ; xˆ i ) R E d (x i ; xˆ i )
Defn of R(D)
– and use convexity of R(D) to show
n 1 R E d (x i ; xˆ i ) R n 1 E d (x i ; xˆ i ) R E d ( x; xˆ ) R ( D)
i
i
We prove convexity first and then the rest
202
Convexity of R(D)
If p1 ( xˆ | x ) and p2 ( xˆ | x ) are
3.5
3
associated with ( D1 , R1 ) and ( D2 , R2 )
Then
E p d ( x, xˆ ) D1 (1 ) D2 D
R(D)
on the R ( D ) curve we define
p ( xˆ | x ) p1 ( xˆ | x ) (1 ) p2 ( xˆ | x )
2.5
R
2
1.5
1
0.5
(D2,R2)
0
0
R ( D ) I p (x ; xˆ )
I p1 (x ; xˆ ) (1 ) I p2 (x ; xˆ )
R( D1 ) (1 ) R( D2 )
(D1,R1)
D 0.5
1
D/
1.5
2
R( D) min I (x ; xˆ )
p ( xˆ |x )
I (x ; xˆ ) convex w.r.t. p( xˆ | x)
p1 and p2 lie on the R(D) curve
203
Proof that R R(D)
nR0 H (xˆ1:n ) H (xˆ1:n ) H (xˆ1:n | x 1:n )
Uniform bound; H (xˆ | x ) 0
I (xˆ1:n ; x 1:n )
Definition of I(;)
n
I (x i ; xˆ i )
xi indep: Mut Inf
Independence Bound
i 1
n
n
RE d (x i ; xˆ i ) n n 1 RE d (x i ; xˆ i )
i 1
i 1
1 n
ˆ
nR n E d (x i ; x i ) nRE d (x 1:n ; xˆ1:n )
i 1
nR(D)
definition of R
convexity
defn of vector d( )
original assumption that E(d) D
and R(D) monotonically decreasing
204
Rate Distortion Achievability
x1:n
Encoder
fn( )
f(x1:n)1:2nR
Decoder
gn( )
^
x1:n
We want to show that for any D, we can find an encoder
and decoder that compresses x1:n to nR(D) bits.
• pX is given
• Assume we know the p( xˆ | x ) that gives I (x ; xˆ ) R( D)
• Random codebook: Choose 2nR random xˆ i ~ p xˆ
– There must be at least one code that is as good as the average
• Encoder: Use joint typicality to design
– We show that there is almost always a suitable codeword
First define the typical set we will use, then prove two preliminary results.
205
Distortion Typical Set
Distortion Typical: x i , xˆi X Xˆ drawn i.i.d. ~ p(x , xˆ )
ˆ n : n 1 log p (x) H (x ) ,
J d( n,) x, xˆ X n X
n 1 log p (xˆ ) H (xˆ ) ,
n 1 log p(x, xˆ ) H (x , xˆ )
d (x, xˆ ) E d (x , xˆ )
new condition
Properties of Typical Set:
1. Indiv p.d.:
x, xˆ J d( n,) log px, xˆ nH (x , xˆ ) n
2. Total Prob:
p x, xˆ J d( n,) 1
for n N
weak law of large numbers; d (x i , xˆ i ) are i.i.d.
206
Conditional Probability Bound
Lemma:
Proof:
x, xˆ J d( n,) pxˆ p xˆ | x 2 n I ( x ;x ) 3
ˆ
p xˆ , x
p xˆ | x
px
pxˆ , x
pxˆ
pxˆ px
p xˆ
2
take max of top and min of bottom
n H ( x , xˆ )
n H ( x ) n H ( xˆ )
2
2
ˆ
pxˆ 2 n I ( x ;x ) 3
bounds from defn of J
defn of I
207
Curious but Necessary Inequality
Lemma: u, v [0,1], m 0 (1 uv) m 1 u e vm
Proof: u=0: e vm 0 (1 0) m 1 0 e vm
u=1: Define f (v) e v 1 v f (v) 1 e v
f (0) 0 and f (v) 0 for v 0
Hence for v [0,1], 0 1 v e v
0<u<1:
f (v) 0 for v [0,1]
1 v e vm
m
Define g v (u ) (1 uv) m
g v( x) m(m 1)v 2 (1 uv) n 2 0 g v (u ) convex
(1 uv) m g v (u ) (1 u ) g v (0) ug v (1)
for u,v [0,1]
convexity for u,v [0,1]
(1 u )1 u (1 v) m 1 u ue vm 1 u e vm
208
Achievability of R(D): preliminaries
x1:n
Encoder
fn( )
f(x1:n)1:2nR
Decoder
gn( )
^
x1:n
• Choose D and find a p( xˆ | x) such that I (x ; xˆ ) R( D); E d ( x, xˆ ) D
Choose 0 and define p xˆ {
p ( xˆ ) p( x) p ( xˆ | x) }
x
• Decoder: For each w 1 : 2 nR choose g n ( w) xˆ w drawn i.i.d. ~ p nxˆ
• Encoder: f n (x) min w such that (x,xˆ w ) J d( n,) else 1 if no such w
• Expected Distortion: D Ex , g d (x, xˆ )
– over all input vectors x and all random decoding functions, g
– for large n we show D D
code
so there must be one good
209
Expected Distortion
We can divide the input vectors x into two categories:
a) if w such that (x,xˆ w ) J d( n,) then d (x, xˆ w ) D
since E d (x, xˆ ) D
b)
if no such w exists we must have d (x, xˆ w ) d max
since we are assuming that d(,) is bounded. Suppose
the probability of this situation is Pe.
Hence D Ex , g d (x, xˆ )
(1 Pe )( D ) Pe d max
D Pe d max
We need to show that the expected value of Pe is small
210
Error Probability
Define the set of valid inputs for (random) code g
V ( g ) x : w with ( x, g ( w)) J d( n,)
We have
Pe p( g )
g
p(x ) p(x ) p( g )
xV ( g )
x
Change the order
g:xV ( g )
Define K (x, xˆ ) 1 if (x, xˆ ) J d( n,) else 0
Prob that a random xˆ does not match x is 1 p (xˆ ) K (x, xˆ )
xˆ
Prob that an entire code does not match x is 1 p (xˆ ) K (x, xˆ )
xˆ
Hence Pe p (x)1 p (xˆ ) K (x, xˆ )
x
xˆ
2 nR
2nR
Codewords are i.i.d.
211
Achievability for Average Code
Since x, xˆ J
(n)
d ,
pxˆ pxˆ | x 2
Pe p(x)1 p(xˆ ) K (x, xˆ )
x
xˆ
n I ( x ; xˆ ) 3
2 nR
n I ( x ; xˆ ) 3
p(x) 1 p(xˆ | x) K (x, xˆ ) 2
x
xˆ
2nR
Using (1 uv) m 1 u e vm
with u p(xˆ | x) K (x, xˆ );
xˆ
v 2 nI ( x ;x ) 3n ;
ˆ
m 2 nR
ˆ
p (x)1 p (xˆ | x) K (x, xˆ ) exp 2 n I ( x ;x ) 3 2 nR
x
xˆ
Note : 0 u , v 1 as required
212
Achievability for Average Code
n I ( X ; Xˆ ) 3 nR
ˆ
ˆ
Pe p (x)1 p (x | x) K (x, x) exp 2
2
x
xˆ
1 p (x, xˆ ) K (x, xˆ ) exp 2
x , xˆ
n R I ( X ; Xˆ ) 3
P (x, xˆ ) J d( n,) exp 2 n R I ( X ; X ) 3
ˆ
Mutual information does
not involve particular x
n
0
since both terms 0 as n provided nR I (x , xˆ ) 3
Hence 0, D Ex , g d (x, xˆ ) can be made D
213
Achievability
Since 0, D Ex , g d (x, xˆ ) can be made D
there must be at least one g with Ex d (x, xˆ ) D
Hence (R,D) is achievable for any R > R(D)
x1:n
Encoder
fn( )
f(x1:n)1:2nR
Decoder
gn( )
^
x1:n
that is lim E X 1:n ( x, xˆ ) D
n
In fact a stronger result is true (proof in C&T 10.6):
0, D and R R( D), f n , g n with p(d (x, xˆ ) D ) 1
n
214
Lecture 16
• Introduction to network information
theory
• Multiple access
• Distributed source coding
215
Network Information Theory
• System with many senders and receivers
• New elements: interference, cooperation,
competition, relay, feedback…
• Problem: decide whether or not the sources can
be transmitted over the channel
– Distributed source coding
– Distributed communication
– The general problem has not yet been solved, so we
consider various special cases
• Results are presented without proof (can be
done using mutual information, joint AEP)
216
Implications to Network Design
• Examples of large information networks
– Computer networks
– Satellite networks
– Telephone networks
• A complete theory of network communications
would have wide implications for the design of
communication and computer networks
• Examples
– CDMA (code-division multiple access): mobile phone
network
– Network coding: significant capacity gain compared to
routing-based networks
217
Network Models Considered
•
•
•
•
•
•
•
Multi-access channel
Broadcast channel
Distributed source coding
Relay channel
Interference channel
Two-way channel
General communication network
218
State of the Art
• Triumphs
• Unknowns
– Multi-access channel
– The simplest relay
channel
– Gaussian broadcast
channel
– The simplest
interference channel
Reminder: Networks being built (ad hoc networks, sensor
networks) are much more complicated
219
Multi-Access Channel
• Example: many users
communicate with a
common base station over
a common channel
• What rates are achievable
simultaneously?
• Best understood multiuser
channel
• Very successful: 3G CDMA
mobile phone networks
220
Capacity Region
• Capacity of single-user Gaussian channel
C
1
P
P
log1 C
2 N
N
• Gaussian multi-access channel with m users
Xi has equal power P
Y X Z
noise Z has variance N
• Capacity region Ri: rate for user i
m
i 1
P
Ri C
N
2P
Ri R j C
N
3P
Ri R j Rk C
N
m
R
i 1
i
mP
C
N
i
Transmission: independent and simultaneous
(i.i.d. Gaussian codebooks)
Decoding: joint decoding, look for m
codewords whose sum is closest to Y
The last inequality dominates when all rates
are the same
The sum rate goes to with m
221
Two-User Channel
• Capacity region
P
R1 C 1
N
P
R2 C 2
N
P P
R1 R1 C 1 2
N
CDMA
Onion
peeling
FDMA, TDMA
Naïve
TDMA
– Corresponds to CDMA
– Surprising fact: sum rate
= rate achieved by a single sender with power P1+P2
• Achieves a higher sum rate than treating
interference as noise, i.e.,
P1
P2
C
C
P
N
P
N
2
1
222
Onion Peeling
• Interpretation of corner point: onion-peeling
– First stage: decoder user 2, considering user 1 as noise
– Second stage: subtract out user 2, decoder user 1
• In fact, it can achieve the entire capacity region
– Any rate-pairs between two corner points achievable by timesharing
• Its technical term is successive interference cancelation
(SIC)
– Removes the need for joint decoding
– Uses a sequence of single-user decoders
• SIC is implemented in the uplink of CDMA 2000 EV-DO
(evolution-data optimized)
– Increases throughput by about 65%
223
Comparison with TDMA and FDMA
• FDMA (frequency-division multiple access)
P
R1 W1 log 1 1
N 0W1
P
R2 W2 log 1 2
N 0W2
Total bandwidth W = W1 + W2
Varying W1 and W2 tracing out the
curve in the figure
• TDMA (time-division multiple access)
– Each user is allotted a time slot, transmits and other
users remain silent
– Naïve TDMA: dashed line
– Can do better while still maintaining the same
average power constraint; the same capacity region
as FDMA
• CDMA capacity region is larger
– But needs a more complex decoder
224
Distributed Source Coding
• Associate with nodes are sources that are
generally dependent
• How do we take advantage of the dependence
to reduce the amount of information
transmitted?
• Consider the special case where channels are
noiseless and without interference
• Finding the set of rates associate with each
source such that all required sources can be
decoded at destination
• Data compression dual to multi-access channel
225
Two-User Distributed Source Coding
• X and Y are correlated
• But the encoders cannot
communicate; have to
encode independently
• A single source: R > H(X)
• Two sources: R > H(X,Y) if
encoding together
• What if encoding separately?
• Of course one can do R > H(X) + H(Y)
• Surprisingly, R = H(X,Y) is sufficient (Slepian-Wolf
coding, 1973)
• Sadly, the coding scheme was not practical (again)
226
Slepian-Wolf Coding
• Achievable rate region
R1 H ( X | Y )
achievable
R2 H (Y | X )
R1 R2 H ( X , Y )
• Core idea: joint typicality
• Interpretation of corner point R1 =
H(X), R2 = H(Y|X)
– X can encode as usual
– Associate with each xn is a jointly
typical fan (however Y doesn’t know)
– Y sends the color (thus compression)
– Decoder uses the color to determine
the point in jointly typical fan
associated with xn
• Straight line: achieved by timesharing
xn
yn
227
Wyner-Ziv Coding
•
•
•
•
Distributed source coding with side information
Y is encoded at rate R2
Only X to be recovered
How many bits R1 are
required?
• If R2 = H(Y), then R1 = H(X|Y) by Slepian-Wolf coding
• In general
R1 H ( X | U )
R2 I (Y ;U )
where U is an auxiliary random variable (can be
thought of as approximate version of Y)
228
Rate-Distortion
• Given Y, what is the ratedistortion to describe X?
RY ( D) min min{I ( X ;W ) I (Y ;W )}
p ( w| x )
W
R ( D) min I (x ; xˆ ) over all p ( xˆ | x) such that Ex ,xˆ d (x , xˆ ) D
f
over all decoding functions f : Y W Xˆ
and all p ( w | x) such that Ex , w, y d ( x, xˆ ) D
• The general problem of
rate-distortion for
correlated sources
remains unsolved
229
Lecture 17
• Network information theory – II
– Broadcast
– Relay
– Interference channel
– Two-way channel
– Comments on general communication
networks
230
Broadcast Channel
X
Y2
Y1
• One-to-many: HDTV station sending different
information simultaneously to many TV receivers
over a common channel; lecturer in classroom
• What are the achievable rates for all different
receivers?
• How does the sender encode information meant
for different signals in a common signal?
• Only partial answers are known.
231
Two-User Broadcast Channel
• Consider a memoryless broadcast channel with
one encoder and two decoders
• Independent messages at rate R1 and R2
• Degraded broadcast channel: p(y1, y2|x) = p(y1|x)
p(y2| y1)
– Meaning X Y1 Y2 (Markov chain)
– Y2 is a degraded version of Y1 (receiver 1 is better)
• Capacity region of degraded broadcast channel
R2 I (U ; Y2 )
R1 I ( X ; Y1 | U )
U is an auxiliary
random variable
232
Scalar Gaussian Broadcast Channel
• All scalar Gaussian broadcast channels belong to
the class of degraded channels
Y1 X Z1
Y2 X Z 2
Assume variance
N1 < N2
• Capacity region
P
R1 C
N
1
(1 ) P
R2 C
P
N
2
01
Coding Strategy
Encoding: one codebook with power P at
rate R1, another with power (1-)P at rate
R2, send the sum of two codewords
Decoding: Bad receiver Y2 treats Y1 as noise;
good receiver Y1 first decode Y2, subtract it
out, then decode his own message
233
Relay Channel
• One source, one destination, one or more
intermediate relays
• Example: one relay
– A broadcast channel (X to Y and Y1)
– A multi-access channel (X and X1 to Y)
– Capacity is unknown! Upper bound:
C sup min{I ( X , X 1 ; Y ), I ( X ; Y , Y1 | X 1 )}
p ( x , x1 )
– Max-flow min-cut interpretation
• First term: maximum rate from X and X1 to Y
• Second term: maximum rate from X to Y and Y1
234
Degraded Relay Channel
• In general, the max-flow min-cut bound cannot
be achieved
• Reason
– Interference
– What for the relay to forward?
– How to forward?
• Capacity is known for degraded relay channel
(i.e, Y is a degradation of Y1, or relay is better
than receiver), i.e., the upper bound is achieved
C sup min{I ( X , X 1 ; Y ), I ( X ; Y , Y1 | X 1 )}
p ( x , x1 )
235
Gaussian Relay Channel
• Channel model
Y1 X Z1
Variance( Z1 ) N1
Y X Z 1 X1 Z2
Variance( Z 2 ) N 2
• Encoding at relay:
• Capacity
– If
X 1i f i (Y11,Y12 , , Y1i 1 )
P P1 2 (1 ) PP1
C max min C
0 1
N1 N 2
relay desitination SNR
P1
P
N 2 N1
P X has power P
, C
N1
X1 has power P1
source relay SNR
then C C P / N1 (capacity from source to relay can
be achieved; exercise)
– Rate C C P /( N1 N 2 ) without relay is increased by the
relay to C C P / N1
236
Interference Channel
• Two senders, two receivers, with crosstalk
– Y1 listens to X1 and doesn’t care
what X2 speaks or what Y2 hears
– Similarly with X2 and Y2
• Neither a broadcast channel nor a multiaccess channel
• This channel has not been solved
– Capacity is known to within one bit (Etkin, Tse, Wang
2008)
– A promising technique interference alignment
(Camdambe, Jafar 2008)
237
Symmetric Interference Channel
• Model
Y1 X 1 aX 2 Z1
equal power P
Y2 X 2 aX 1 Z 2
Var( Z1 ) Var( Z 2 ) N
• Capacity has been derived in the strong
interference case (a 1) (Han, Kobayashi, 1981)
– Very strong interference (a2 1 + P/N) is equivalent to
no interference whatsoever
• Symmetric capacity (for each user R1 = R2)
SNR = P/N
INR = a2P/N
238
Capacity
239
Very strong interference = no interference
• Each sender has power P and rate C(P/N)
• Independently sends a codeword from a
Gaussian codebook
• Consider receiver 1
– Treats sender 1 as interference
– Can decode sender 2 at rate C(a2P/(P+N))
– If C(a2P/(P+N)) > C(P/N), i.e.,
rate 2 1 > rate 2 2 (crosslink is better)
he can perfectly decode sender 2
– Subtracting it from received signal, he sees a clean
channel with capacity C(P/N)
240
An Example
• Two cell-edge users (bottleneck of the cellular network)
• No exchange of data between the base stations or
between the mobiles
• Traditional approaches
– Orthogonalizing the two links (reuse ½)
– Universal frequency reuse and treating interference as noise
• Higher capacity can be achieved by advanced
interference management
241
Two-Way Channel
• Similar to interference channel, but in both
directions (Shannon 1961)
• Feedback
– Sender 1 can use previously
received symbols from sender 2, and vice versa
– They can cooperate with each other
• Gaussian channel:
– Capacity region is known (not the case in general)
– Decompose into two independent channels
P1
R1 C
N1
P
R2 C 2
N2
Coding strategy: Sender 1 sends a codeword;
so does sender 2. Receiver 2 receives a sum
but he can subtract out his own thus having an
interference-free channel from sender 1.
242
General Communication Network
• Many nodes trying to communicate with each other
• Allows computation at each node using it own message
and all past received symbols
• All the models we have considered are special cases
• A comprehensive theory of network information flow is
yet to be found
243
Capacity Bound for a Network
• Max-flow min-cut
– Minimizing the maximum
flow across cut sets
yields an upper bound
on the capacity of a
network
• Outer bound on capacity
region
R
(i , j )
I(X
(S )
;Y
(S c )
|X
(S c )
)
iS , jS c
– Not achievable in general
244
Questions to Answer
• Why multi-hop relay? Why
decode and forward? Why
treat interference as noise?
• Source-channel separation?
Feedback?
• What is really the best way
to operate wireless
networks?
• What are the ultimate limits
to information transfer over
wireless networks?
245
Scaling Law for Wireless Networks
• High signal attenuation:
(transport) capacity is O(n)
bit-meter/sec for a planar
network with n nodes (XieKumar’04)
• Low attenuation: capacity can
grow superlinearly
• Requires cooperation between
nodes
• Multi-hop relay is suboptimal
but order optimal
246
Network Coding
• Routing: store and forward
(as in Internet)
• Network coding: recompute
and redistribute
• Given the network topology,
coding can increase
capacity (Ahlswede, Cai, Li,
Yeung, 2000)
– Doubled capacity for butterfly
network
• Active area of research
B
A
Butterfly Network
247
Lecture 18
• Revision Lecture
248
Summary (1)
• Entropy:
H (x ) p ( x) log 2 p ( x) E log 2 ( p X ( x))
xX
0 H (x ) log X
– Bounds:
– Conditioning reduces entropy: H (y | x ) H (y )
n
n
H (x 1:n ) H (x i | x 1:i 1 ) H (x i )
– Chain Rule:
i 1
i 1
n
• Relative Entropy:
H (x 1:n | y 1:n ) H (x i | y i )
i 1
Dp || q Ep log p (x ) / q (x ) 0
249
Summary (2)
H(x ,y)
• Mutual Information:
H(x |y) I(x ;y) H(y |x)
H(x)
I (y ; x ) H (y ) H (y | x )
H (x ) H (y ) H (x , y ) Dp x ,y || p x p y
H(y)
– Positive and Symmetrical: I (x ; y ) I (y ; x ) 0
– x, y indep H (x , y ) H (y ) H (x ) I (x ; y ) 0
n
– Chain Rule: I (x 1:n ; y ) I (x i ; y | x 1:i 1 )
i 1
n
x i independen t I (x 1:n ; y 1:n ) I (x i ; y i )
i 1
n
p(y i | x 1:n ; y 1:i 1 ) p(y i | x i ) I (x 1:n ; y 1:n ) I (x i ; y i )
n-use DMC capacity
i 1
250
Summary (3)
• Convexity: f’’ (x) 0 f(x) convex Ef(x) f(Ex)
– H(p) concave in p
– I(x ; y) concave in px for fixed py|x
– I(x ; y) convex in py|x for fixed px
• Markov:
x y z p ( z | x, y ) p ( z | y ) I ( x ; z | y ) 0
I(x ; y) I(x ; z) and I(x ; y) I(x; y | z)
• Fano:
H (x | y ) 1
ˆ
ˆ
x y x px x
log| X | 1
• Entropy Rate:
– Stationary process
– Markov Process:
H ( X) lim n 1 H (x 1:n )
n
H ( X ) lim H (x n | x 1:n 1 )
n
H ( X) lim H (x n | x n 1 )
n
if stationary
251
Summary (4)
• Kraft: Uniquely Decodable
| X|
li
D
1 instant code
i 1
• Average Length: Uniquely Decodable LC = E l(x) HD(x)
• Shannon-Fano: Top-down 50% splits. LSF HD(x)+1
• Huffman: Bottom-up design. Optimal.
LH H D (x ) 1
– Designing with wrong probabilities, q penalty of D(p||q)
– Long blocks disperse the 1-bit overhead
• Lempel-Ziv Coding:
– Does not depend on source distribution
– Efficient algorithm widely used
– Approaches entropy rate for stationary ergodic sources
252
Summary (5)
• Typical Set
–
–
–
–
x T( n ) log p (x) nH (x ) n
Individual Prob
p ( x T( n ) ) 1 for n N
Total Prob
n N
n ( H ( x ) )
(1 )2
T( n ) 2 n ( H ( x ) )
Size
No other high probability set can be much smaller
• Asymptotic Equipartition Principle
– Almost all event sequences are equally surprising
253
Summary (6)
• DMC Channel Capacity:
• Coding Theorem
C max I (x ; y )
px
– Can achieve capacity: random codewords, joint typical decoding
– Cannot beat capacity: Fano inequality
• Feedback doesn’t increase capacity of DMC but could
simplify coding/decoding
• Joint Source-Channel Coding doesn’t increase capacity of
DMC
254
Summary (7)
• Differential Entropy:
h(x ) E log f x ( x )
– Not necessarily positive
– h(x+a) = h(x),
h(ax) = h(x) + log|a|,
h(x|y) h(x)
– I(x; y) = h(x) + h(y) – h(x, y) 0, D(f||g)=E log(f/g) 0
• Bounds:
– Finite range: Uniform distribution has max: h(x) = log(b–a)
– Fixed Covariance: Gaussian has max: h(x) = ½log((2e)n|K|)
• Gaussian Channel
– Discrete Time: C=½log(1+PN–1)
– Bandlimited: C=W log(1+PN0–1W–1)
1
1 1
W / C
1 ln 2 1.6 dB
• For constant C: Eb N 0 PC N 0 W / C 2
W
– Feedback: Adds at most ½ bit for coloured noise
1
255
Summary (8)
• Parallel Gaussian Channels: Total power constraint Pi nP
Pi max v N i ,0
– White noise: Waterfilling:
– Correlated noise: Waterfill on noise eigenvectors
• Rate Distortion:
R( D)
min
p xˆ |x s .t . Ed ( x , xˆ ) D
I (x ; xˆ )
– Bernoulli Source with Hamming d: R(D) = max(H(px)–H(D),0)
– Gaussian Source with mean square d: R(D) = max(½log(2 D–1),0)
– Can encode at rate R: random decoder, joint typical encoder
– Can’t encode below rate R: independence bound
256
Summary (9)
• Gaussian multiple access
channel
• Distributed source coding
– Slepian-Wolf coding
P
R1 C 1 ,
N
P
R2 C 2
N
1
P P
R1 R1 C 1 2 ,
C ( x) log(1 x)
2
N
R1 H ( X | Y ), R2 H (Y | X )
R1 R2 H ( X , Y )
• Scalar Gaussian broadcast channel
P
R1 C
,
N
1
(1 ) P
R2 C
,
P
N
2
0 1
• Gaussian Relay channel
P P1 2 (1 ) PP1
C max min C
0 1
N1 N 2
P
, C
N1
257
Summary (10)
• Interference channel
– Strong interference = no interference
• Gaussian two-way channel
– Decompose into two independent channels
• General communication network
– Max-flow min-cut theorem
– Not achievable in general
– But achievable for multiple access channel
and Gaussian relay channel
© Copyright 2026 Paperzz