Information Theory - Communications and signal processing

Information Theory
Lecturer: Cong Ling
Office: 815
2
Lectures
Introduction
1 Introduction - 1
Entropy Properties
2 Entropy I - 19
3 Entropy II – 32
Lossless Source Coding
4 Theorem - 47
5 Algorithms - 60
Channel Capacity
6 Data Processing Theorem - 76
7 Typical Sets - 86
8 Channel Capacity - 98
9 Joint Typicality - 112
10 Coding Theorem - 123
11 Separation Theorem – 131
Continuous Variables
12 Differential Entropy - 143
13 Gaussian Channel - 158
14 Parallel Channels – 171
Lossy Source Coding
15 Rate Distortion Theory - 184
Network Information Theory
16 NIT I - 214
17 NIT II - 229
Revision
18 Revision - 247
19
20
3
Claude Shannon
• C. E. Shannon, ”A mathematical theory
of communication,” Bell System
Technical Journal, 1948.
• Two fundamental questions in
communication theory:
• Ultimate limit on data compression
– entropy
• Ultimate transmission rate of
communication
– channel capacity
• Almost all important topics in
information theory were initiated by
Shannon
1916 - 2001
4
Origin of Information Theory
• Common wisdom in 1940s:
– It is impossible to send information error-free at a positive rate
– Error control by using retransmission: rate  0 if error-free
• Still in use today
– ARQ (automatic repeat request) in TCP/IP computer networking
• Shannon showed reliable communication is possible for
all rates below channel capacity
• As long as source entropy is less than channel capacity,
asymptotically error-free communication can be achieved
• And anything can be represented in bits
– Rise of digital information technology
5
Relationship to Other Fields
6
Course Objectives
• In this course we will (focus on communication
theory):
– Define what we mean by information.
– Show how we can compress the information in a
source to its theoretically minimum value and show
the tradeoff between data compression and
distortion.
– Prove the channel coding theorem and derive the
information capacity of different channels.
– Generalize from point-to-point to network information
theory.
7
Relevance to Practice
• Information theory suggests means of achieving
ultimate limits of communication
– Unfortunately, these theoretically optimum schemes
are computationally impractical
– So some say “little info, much theory” (wrong)
• Today, information theory offers useful
guidelines to design of communication systems
–
–
–
–
Turbo code (approaches channel capacity)
CDMA (has a higher capacity than FDMA/TDMA)
Channel-coding approach to source coding (duality)
Network coding (goes beyond routing)
8
Books/Reading
Book of the course:
• Elements of Information Theory by T M Cover & J A Thomas, Wiley,
£39 for 2nd ed. 2006, or £14 for 1st ed. 1991 (Amazon)
Free references
• Information Theory and Network Coding by R. W. Yeung, Springer
http://iest2.ie.cuhk.edu.hk/~whyeung/book2/
• Information Theory, Inference, and Learning Algorithms by D
MacKay, Cambridge University Press
http://www.inference.phy.cam.ac.uk/mackay/itila/
• Lecture Notes on Network Information Theory by A. E. Gamal and
Y.-H. Kim, (Book is published by Cambridge University Press)
http://arxiv.org/abs/1001.3404
• C. E. Shannon, ”A mathematical theory of communication,” Bell
System Technical Journal, Vol. 27, pp. 379–423, 623–656, July,
October, 1948.
9
Other Information
• Course webpage:
http://www.commsp.ee.ic.ac.uk/~cling
• Assessment: Exam only – no coursework.
• Students are encouraged to do the problems in
problem sheets.
• Background knowledge
– Mathematics
– Elementary probability
• Needs intellectual maturity
– Doing problems is not enough; spend some time
thinking
10
Notation
• Vectors and matrices
– v=vector, V=matrix
• Scalar random variables
– x = R.V, x = specific value, X = alphabet
• Random column vector of length N
– x = R.V, x = specific value, XN = alphabet
– xi and xi are particular vector elements
• Ranges
– a:b denotes the range a, a+1, …, b
• Cardinality
– |X| = the number of elements in set X
11
Discrete Random Variables
• A random variable x takes a value x from
the alphabet X with probability px(x). The
vector of probabilities is px .
Examples:
X = [1;2;3;4;5;6], px = [1/6; 1/6; 1/6; 1/6; 1/6; 1/6]
pX is a “probability mass vector”
“english text”
X = [a; b;…, y; z; <space>]
px = [0.058; 0.013; …; 0.016; 0.0007; 0.193]
Note: we normally drop the subscript from px if unambiguous
12
Expected Values
• If g(x) is a function defined on X then
Ex g ( x )   p ( x ) g ( x )
often write E for EX
xX
Examples:
X = [1;2;3;4;5;6], px = [1/6; 1/6; 1/6; 1/6; 1/6; 1/6]
E x  3.5  
E x 2  15.17   2   2
E sin(0.1x )  0.338
E  log 2 ( p(x ))  2.58
This is the “entropy” of X
13
Shannon Information Content
• The Shannon Information Content of an outcome
with probability p is –log2p
• Shannon’s contribution – a statistical view
– Messages, noisy channels are random
– Pre-Shannon era: deterministic approach (Fourier…)
• Example 1: Coin tossing
– X = [Head; Tail], p = [½; ½], SIC = [1; 1] bits
• Example 2: Is it my birthday ?
– X = [No; Yes], p = [364/365; 1/365],
SIC = [0.004; 8.512] bits
Unlikely outcomes give more information
14
Minesweeper
• Where is the bomb ?
• 16 possibilities – needs 4 bits to specify

Guess
Prob
1.
2.
3.
4.
15/
16
14/
15
13/
14
No
No
No
Yes
1/
13
Total
SIC
0.093
0.100
0.107
3.700
bits
bits
bits
bits
4.000 bits
SIC = – log2 p
15
Minesweeper
• Where is the bomb ?
• 16 possibilities – needs 4 bits to specify
Guess
Prob
1.
15/
No
16
SIC
0.093 bits
16
Entropy
H (x )  E  log 2 ( px (x ))    px ( x) log 2 px ( x)
xx
– H(x) = the average Shannon Information Content of x
– H(x) = the average information gained by knowing its value
– the average number of “yes-no” questions needed to find x is in
the range [H(x),H(x)+1)
– H(x) = the amount of uncertainty before we know its value
We use log(x)  log2(x) and measure H(x) in bits
– if you use loge it is measured in nats
– 1 nat = log2(e) bits = 1.44 bits
•
log 2 ( x) 
ln( x)
ln( 2)
d log 2 x log 2 e

dx
x
H(X ) depends only on the probability vector pX not on the alphabet X,
so we can write H(pX)
17
Entropy Examples
1
(1) Bernoulli Random Variable
H (x )  (1  p) log(1  p)  p log p
Very common – we write H(p) to
mean H([1–p; p]).
Maximum is when p=1/2
(2) Four Coloured Shapes
0.6
H(p)
X = [0;1], px = [1–p; p]
0.8
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
p
H ( p)  (1  p) log(1  p)  p log p
H ( p)  log(1  p)  log p
H ( p)   p 1 (1  p) 1 log e
X = [; ; ; ], px = [½; ¼; 1/8; 1/8]
H (x )  H (p x )    log( p( x)) p( x)
 1  ½  2  ¼  3  18  3  18  1.75 bits
18
Comments on Entropy
• Entropy plays a central role in information
theory
• Origin in thermodynamics
– S = k ln, k: Boltzman’s constant, : number of
microstates
– The second law: entropy of an isolated system is nondecreasing
• Shannon entropy
– Agrees with intuition: additive, monotonic, continuous
– Logarithmic measure could be derived from an
axiomatic approach (Shannon 1948)
19
Lecture 2
• Joint and Conditional Entropy
– Chain rule
• Mutual Information
– If x and y are correlated, their mutual information is
the average information that y gives about x
• E.g. Communication Channel: x transmitted but y received
• It is the amount of information transmitted through the
channel
• Jensen’s Inequality
20
Joint and Conditional Entropy
p(x,y)
y=0 y=1
x=0
½
¼
H (x , y )  E  log p (x , y )
 ½ log ½  ¼ log ¼  0 log 0  ¼ log ¼  1.5 bits
x=1
0
¼
Conditional Entropy: H(y |x)
p(y|x)
Joint Entropy: H(x,y)
H (y | x )  E  log p (y | x )
  p ( x, y ) log p ( y | x)
Note: 0 log 0 = 0
y=0 y=1
x=0
2/
x=1
0
3
1/
3
1
Note: rows sum to 1
x, y
 ½ log 2 3  ¼ log 1 3  0 log 0  ¼ log1  0.689 bits
21
Conditional Entropy – View 1
p(x, y) y=0 y=1 p(x)
Additional Entropy:
x=0
½
¼
¾
p (y | x )  p ( x , y )  p ( x )
0
x=1
¼
¼
H (y | x )  E  log p (y | x )
 E  log p (x , y )  E  log p (x )
 H (x , y )  H (x )  H ( 1 2 , 1 4 , 0, 1 4 )  H ( 3 4 , 1 4 )  0.689 bits
H(Y|X) is the average additional information in Y when you know X
H(x ,y)
H(y |x)
H(x)
22
Conditional Entropy – View 2
Average Row Entropy:
p(x, y)
y=0
y=1 H(y | x=x)
p(x)
x=0
½
¼
H(1/3)
¾
x=1
0
¼
H(1)
¼
H (y | x )  E  log p (y | x )    p ( x, y ) log p ( y | x)
x, y
   p ( x) p ( y | x) log p ( y | x)   p ( x)  p ( y | x) log p ( y | x)
x, y
x X
yY
  p ( x) H y | x  x   ¾  H ( 1 3 )  ¼  H (0)  0.689 bits
x X
Take a weighted average of the entropy of each row using p(x) as weight
23
Chain Rules
• Probabilities
p ( x , y , z )  p (z | x , y ) p (y | x ) p ( x )
• Entropy
H ( x , y , z )  H (z | x , y )  H (y | x )  H ( x )
n
H (x 1:n )   H (x i | x 1:i 1 )
i 1
The log in the definition of entropy converts products of
probability into sums of entropy
24
Mutual Information
Mutual information is the average amount of
information that you get about x from observing
the value of y
- Or the reduction in the uncertainty of
knowledge of y
x due to
I ( x ; y )  H ( x )  H ( x | y )  H ( x )  H (y )  H ( x , y )
Information in x
Information in x when you already know y
H(x ,y)
Mutual information is
symmetrical
I ( x ; y )  I (y ; x )
H(x |y) I(x ;y) H(y |x)
H(x)
Use “;” to avoid ambiguities between I(x;y,z) and I(x,y;z)
H(y)
25
Mutual Information Example
p(x,y)
y=0 y=1
x=0
½
¼
x=1
0
¼
• If you try to guess y you have a 50%
chance of being correct.
• However, what if you know x ?
– Best guess: choose y = x
=0 (p=0.75) then 66% correct prob
– If x =1 (p=0.25) then 100% correct prob
– If x
– Overall 75% correct probability
H(x ,y)=1.5
H(x |y)
=0.5
I(x ;y)
=0.311
H(x)=0.811
H(y |x)
=0.689
H(y)=1
I (x ; y )  H (x )  H (x | y )
 H ( x )  H (y )  H ( x , y )
H (x )  0.811, H (y )  1, H (x , y )  1.5
I (x ; y )  0.311
26
Conditional Mutual Information
Conditional Mutual Information
I (x ; y | z )  H (x | z )  H (x | y , z )
 H ( x | z )  H (y | z )  H ( x , y | z )
Note: Z conditioning applies to both X and Y
Chain Rule for Mutual Information
I (x 1 , x 2 , x 3 ; y )  I (x 1; y )  I (x 2 ; y | x 1 )  I (x 3 ; y | x 1 , x 2 )
n
I (x 1:n ; y )   I (x i ; y | x 1:i 1 )
i 1
H(x ,y)
27
Review/Preview
• Entropy:
H(x |y) I(x ;y) H(y |x)
H(x)
H(y)
H (x )   - log 2 ( p ( x)) p( x)  E  log 2 ( p X ( x))
xX
– Positive and bounded
• Chain Rule:
0  H (x )  log | X |

H ( x , y )  H ( x )  H (y | x )  H ( x )  H (y ) 
– Conditioning reduces entropy H (y | x )  H (y )

• Mutual Information:
I (y ; x )  H (y )  H (y | x )  H ( x )  H (y )  H ( x , y )
– Positive and Symmetrical I (x ; y )  I (y ; x )  0
– x and y independent  H (x , y )  H (y )  H (x )
 = inequalities not yet proved
 I (x ; y )  0


28
Convex & Concave functions
f(x) is strictly convex over (a,b) if
f (u  (1   )v)  f (u )  (1   ) f (v) u  v  (a, b), 0    1
f(x) lies above f(x)
– f(x) is concave  –f(x) is convex
– every chord of
Concave is like this
• Examples
– Strictly Convex: x , x , e , x log x [ x  0]
– Strictly Concave: log x, x [ x  0]
– Convex and Concave: x
2
– Test:
4
x
d2 f
 0 x  (a, b)
dx 2
 f(x) is strictly convex
“convex” (not strictly) uses “” in definition and “” in test
29
Jensen’s Inequality
Jensen’s Inequality: (a) f(x) convex  Ef(x)  f(Ex)
(b) f(x) strictly convex  Ef(x) > f(Ex) unless x constant
Proof by induction on |X|
These sum to 1
– |X|=1: E f (x )  f ( E x )  f ( x1 )
k
k 1
pi
f ( xi )
i 1 1  pk
– |X|=k: E f (x )   pi f ( xi )  pk f ( xk )  (1  pk )
i 1
 k 1 pi

 pk f ( xk )  (1  pk ) f  
xi 
Assume JI is true

1
p
i

1
k


for |X|=k–1
k 1


pi
xi   f E x 
 f  pk xk  (1  pk )
i 1 1  pk


Follows from the definition of convexity for two-mass-point distribution
30
Jensen’s Inequality Example
Mnemonic example:
f(x)
f(x) = x2 : strictly convex
X = [–1; +1]
4
p = [½; ½]
3
Ex =0
2
f(E x)=0
1
E f(x) = 1 > f(E x)
0
-2
-1
0
x
1
2
31
Summary
H(x ,y)
• Chain Rule:
H ( x , y )  H (y | x )  H ( x )
H(x |y)
H(y |x)
H(x)
• Conditional Entropy:
H (y | x )  H (x , y )  H (x )   p ( x) H y | x 
– Conditioning reduces entropy xX
H (y | x )  H (y )
• Mutual Information I (x ; y )  H (x )  H (x | y )  H (x )
– In communications, mutual information is the
amount of information transmitted through a noisy
channel
• Jensen’s Inequality
 = inequalities not yet proved
f(x) convex  Ef(x)  f(Ex)
H(y)


32
Lecture 3
• Relative Entropy
– A measure of how different two probability mass
vectors are
• Information Inequality and its consequences
– Relative Entropy is always positive
• Mutual information is positive
• Uniform bound
• Conditioning and correlation reduce entropy
• Stochastic Processes
– Entropy Rate
– Markov Processes
33
Relative Entropy
Relative Entropy or Kullback-Leibler Divergence
between two probability mass vectors p and q
p(x )
p( x)
D(p || q)   p( x) log
 Ep log
 Ep  log q (x )   H (x )
q( x)
q(x )
xX
where Ep denotes an expectation performed using probabilities p
D(p||q) measures the “distance” between the probability
mass functions p and q.
We must have pi=0 whenever qi=0 else D(p||q)=
Beware: D(p||q) is not a true distance because:
–
–
(1) it is asymmetric between p, q and
(2) it does not satisfy the triangle inequality.
34
Relative Entropy Example
X = [1 2 3 4 5 6]T
6
q  1
10
p 1
1
1
1
1
1

 H (p)  2.585
6
6
6
6
6
1
1
1
1
1
 H (q)  2.161
10
10
10
10
2
D(p || q)  Ep ( log qx )  H (p)  2.935  2.585  0.35

D(q || p)  Eq ( log px )  H (q)  2.585  2.161  0.424
35
Information Inequality
Information (Gibbs’) Inequality: D(p || q)  0
• Define A  {x : p( x)  0}  X
p( x )
q( x )
D
(
p
||
q
)
p
(
x
)
log
p
(
x
)
log




• Proof


q( x )
p( x )
xA
Jensen’s
inequality
xA

q( x ) 




 log  p ( x )
  log  q( x )   log  q( x )   log 1  0
p( x ) 
 xA

 xX

 xA
If D(p||q)=0: Since log( ) is strictly concave we have equality in the
proof only if q(x)/p(x), the argument of log, equals a constant.
But
 p( x)  q( x) 1
xX
xX
so the constant must be 1 and p  q
36
Information Inequality Corollaries
• Uniform distribution has highest entropy
– Set q = [|X|–1, …, |X|–1]T giving H(q)=log|X| bits
D(p || q)  Ep  log q(x )  H (p)  log | X |  H (p)  0
• Mutual Information is non-negative
p(x , y )
I (y ; x )  H (x )  H (y )  H (x , y )  E log
p ( x ) p (y )
 D( p (x , y ) || p (x ) p (y ))  0
with equality only if p(x,y)  p(x)p(y)  x and y are independent.
37
More Corollaries
• Conditioning reduces entropy
0  I ( x ; y )  H (y )  H (y | x )  H (y | x )  H (y )
with equality only if x and y are independent.
• Independence Bound
n
n
i 1
i 1
H (x 1:n )   H (x i | x 1:i 1 )   H (x i )
with equality only if all xi are independent.
E.g.: If all xi are identical H(x1:n) = H(x1)
38
Conditional Independence Bound
• Conditional Independence Bound
n
n
i 1
i 1
H (x 1:n | y 1:n )   H (x i | x 1:i 1 , y 1:n )   H (x i | y i )
• Mutual Information Independence Bound
If all xi are independent or, by symmetry, if all yi are independent:
I (x 1:n ; y 1:n )  H (x 1:n )  H (x 1:n | y 1:n )
n
n
n
i 1
i 1
i 1
  H (x i )   H (x i | y i )   I (x i ; y i )
E.g.: If n=2 with xi i.i.d. Bernoulli (p=0.5) and y1=x2 and y2=x1,
then I(xi;yi)=0 but I(x1:2; y1:2) = 2 bits.
39
Stochastic Process
Stochastic Process {xi} = x1, x2, …
often
Entropy: H ({x i })  H (x 1 )  H (x 2 | x 1 )    
Entropy Rate:
1
H ( X)  lim H (x 1:n ) if limit exists
n  n
– Entropy rate estimates the additional entropy per new sample.
– Gives a lower bound on number of code bits per sample.
Examples:
– Typewriter with m equally likely letters each time: H(X)=logm
– xi i.i.d. random variables: H ( X)  H (x i )
40
Stationary Process
Stochastic Process {xi} is stationary iff
p (x 1:n  a1:n )  p (x k  (1:n )  a1:n ) k , n, ai  X
If {xi} is stationary then H(X) exists and
1
H ( X)  lim H (x 1:n )  lim H (x n | x 1:n 1 )
n  n
n 
Proof:
(a )
(b)
0  H (x n | x 1:n 1 )  H (x n | x 2:n 1 )  H (x n 1 | x 1:n  2 )
(a) conditioning reduces entropy, (b) stationarity
Hence H(xn|x1:n-1) is positive, decreasing  tends to a limit, say b
Hence
H (x k | x 1:k 1 )  b 
1
1 n
H (x 1:n )   H (x k | x 1:k 1 )  b  H ( X)
n
n k 1
41
Markov Process (Chain)
Discrete-valued stochastic process {xi} is
• Independent iff p(xn|x0:n–1)=p(xn)
• Markov iff
p(xn|x0:n–1)=p(xn|xn–1)
– time-invariant iff p(xn=b|xn–1=a) = pab indep of n
– States
– Transition matrix: T = {tab}
• Rows sum to 1: T1 = 1 where 1 is a vector of 1’s
1
• pn = TTpn–1
• Stationary distribution: p$ = TTp$
t13
3
t12
2
t14
t34
t43
t24
4
Independent Stochastic Process is easiest to deal with, Markov is next easiest
42
Stationary Markov Process
If a Markov process is
a) irreducible: you can go from any state a to any b in
a finite number of steps
b) aperiodic:  state a, the possible times to go from a
to a have highest common factor = 1
then it has exactly one stationary distribution, p$.
T
– p$ is the eigenvector of TT with  = 1: T p $  p $
T n  1pT$ where 1  [1 1  1]T
n 
–
Initial distribution becomes irrelevant (asymptotically
stationary) (TT ) n p 0  p$ 1T p 0  p$ , p 0
43
Chess Board
H(p )=0, H(p
)=0
)=2.99553,
H(p 1||||ppppp)=2.30177
)=2.09299
H(p)=3.07129,
)=3.111,
)=3.10287,
)=3.09141,
)=1.58496,
)=2.54795
)=1.58496
)=2.24987
H(p
H(p
)=2.20683
H(p
)=3.0827,
H(p
)=2.23038
1
485
5863
3
2
2
1
7
74 4752
630
6
Random Walk
• Move   equal
prob
• p1 = [1 0 … 0]T
– H(p1) = 0
• p$ = 1/40 × [3 5 3 5 8 5 3 5 3]T
– H(p$) = 3.0855
H (x n | x n 1 )
• H ( X )  lim
n 
 lim   p ( xn , xn 1 ) log p ( xn | xn 1 )    p$,i ti , j log(ti , j )  2.2365
n 
i, j
44
Chess Board
H(p )=0,
Random Walk
1
H(p | p )=0
1
0
• Move   equal prob
• p1 = [1 0 … 0]T
– H(p1) = 0
• p$ = 1/40 × [3 5 3 5 8 5 3 5 3]T
– H(p$) = 3.0855
•
H ( X )  lim H (x n | x n 1 )
n 
 lim   p ( xn , xn 1 ) log p ( xn | xn 1 )    p$,i ti , j log(ti , j )
n 
–
H ( X )  2.2365
i, j
45
Chess Board Frames
H(p )=0,
1
H(p )=3.111,
5
H(p | p )=0
1
0
H(p | p )=2.30177
5
4
H(p )=1.58496,
H(p | p )=1.58496
H(p )=3.10287,
H(p | p )=2.54795
H(p )=2.99553,
H(p | p )=2.09299
H(p )=3.07129,
H(p | p )=2.20683
H(p )=3.09141,
H(p | p )=2.24987
H(p )=3.0827,
H(p | p )=2.23038
2
6
2
6
1
5
3
7
3
7
2
6
4
8
4
8
3
7
46
Summary
• Relative Entropy:
– D(p||q) = 0 iff p  q
D(p || q)  Ep log
p(x )
0
q(x )
• Corollaries
– Uniform Bound: Uniform p maximizes H(p)
– I(x ; y)  0  Conditioning reduces entropy
– Indep bounds: H (x 1:n ) 
n
 H (x )
H (x 1:n | y 1:n ) 
i
i 1
I (x 1:n ; y 1:n ) 
n
 I (x ; y )
i
i
H ( X )  lim H (x n | x 1:n 1 )
n 
H ( X )  H (x n | x n 1 )    p$,i ti , j log(ti , j )
i, j
i
if x i or y i are indep
• Entropy Rate of stochastic process:
– {xi} stationary Markov:
 H (x
i 1
i 1
– {xi} stationary:
n
| yi)
47
Lecture 4
• Source Coding Theorem
– n i.i.d. random variables each with entropy H(X) can
be compressed into more than nH(X) bits as n tends
to infinity
• Instantaneous Codes
– Symbol-by-symbol coding
– Uniquely decodable
• Kraft Inequality
– Constraint on the code length
• Optimal Symbol Code lengths
– Entropy Bound
48
Source Coding
• Source Code: C is a mapping XD+
– X a random variable of the message
– D+ = set of all finite length strings from D
– D is often binary
– e.g. {E, F, G} {0,1}+ : C(E)=0, C(F)=10, C(G)=11
• Extension: C+ is mapping X+ D+ formed by
concatenating C(xi) without punctuation
– e.g. C+(EFEEGE) =01000110
49
Desired Properties
• Non-singular: x1 x2  C(x1)  C(x2)
– Unambiguous description of a single letter of X
• Uniquely Decodable: C+ is non-singular
– The sequence C+(x+) is unambiguous
– A stronger condition
– Any encoded string has only one possible source
string producing it
– However, one may have to examine the entire
encoded string to determine even the first source
symbol
– One could use punctuation between two codewords
but inefficient
50
Instantaneous Codes
• Instantaneous (or Prefix) Code
– No codeword is a prefix of another
– Can be decoded instantaneously without reference to
future codewords
• Instantaneous  Uniquely Decodable  Nonsingular
Examples:
–
–
–
–
–
C(E,F,G,H) = (0, 1, 00, 11)
C(E,F) = (0, 101)
C(E,F) = (1, 101)
C(E,F,G,H) = (00, 01, 10, 11)
C(E,F,G,H) = (0, 01, 011, 111)
IU
IU
IU
IU
IU
51
Code Tree
Instantaneous code: C(E,F,G,H) = (00, 11, 100, 101)
Form a D-ary tree where D = |D|
– D branches at each node
– Each codeword is a leaf
– Each node along the path to
a leaf is a prefix of the leaf
 can’t be a leaf itself
– Some leaves may be unused
0
0
E
1
G
0
0
1
1
1
H
F
111011000000 FHGEE
52
Kraft Inequality (instantaneous codes)
• Limit on codeword lengths of instantaneous codes
– Not all codewords can be too short
| X|
 li
2
 1
• Codeword lengths l1, l2, …, l|X| 
• Label each node at depth l
with 2–l
• Each node equals the sum of
all its leaves
i 1
½
0
0
1
E
1/
¼
8
0
G
¼
1
• Equality iff all leaves are utilised
• Total code budget = 1
Code 00 uses up ¼ of the budget
Code 100 uses up 1/8 of the budget
¼
1
½
0
1
1
1
¼
/8
H
F
Same argument works with D-ary tree
53
McMillan Inequality (uniquely decodable codes)
If uniquely decodable C has codeword lengths
l1, l2, …, l|X| , then
| X|
| X|
 li
D
 1
The same
i 1
Proof: Let S   D li and M  max li then for any N ,
i 1
N
|X | | X |
|X |
 |X|  li 
  li1  li2 ... liN 
S    D    ... D
  D  length{C
i1 1 i2 1 iN 1
x X
 i 1

N

( x)}
N
NM
 D | x:l
l 1
l
If S > 1 then
NM

NM

by 1total
length
 length{C (x)} |  re-order
D  l D l sum


NM
Sum over all sequences of length N
SN >

l 1
l 1
NM for some N. Hence S  1.
max number of distinct sequences of length l
Implication: uniquely decodable codes doesn’t offer further
reduction of codeword lengths than instantaneous codes
54
McMillan Inequality (uniquely decodable codes)
If uniquely decodable C has codeword lengths
l1, l2, …, l|X| , then
| X|
| X|
 li
D
 1
The same
i 1
Proof: Let S   D li and M  max li then for any N ,
N
i 1
|X | | X |
|X |
 |X|  li 
  li1  li2 ... liN 
S    D    ... D
  D  length{C
i1 1 i2 1 iN 1
x X
 i 1

N

( x)}
N
NM
NM
  D | x : l  length{C (x)} |   D D
l 1
l
If S > 1 then

SN >
l 1
l
l
NM
 1  NM
l 1
NM for some N. Hence S  1.
Implication: uniquely decodable codes doesn’t offer further
reduction of codeword lengths than instantaneous codes
55
How Short are Optimal Codes?
If l(x) = length(C(x)) then C is optimal if
L=E l(x) is as small as possible.
We want to minimize
1.
D
l ( x )
1
 p( x)l ( x)
subject to
xX
xX
2. all the l(x) are integers
Simplified version:
Ignore condition 2 and assume condition 1 is satisfied with equality.
less restrictive so lengths may be shorter than actually possible  lower bound
56
Optimal Codes (non-integer li)
| X|
• Minimize  p( xi )li subject to
i 1
| X|
 li
D
 1
i 1
Use Lagrange multiplier:
| X|
Define
| X|
J   p ( xi )li    D
i 1
 li
and set
i 1
J
0
li
J
 p ( xi )   ln( D) D  li  0  D  li  p ( xi ) /  ln( D )
li
| X|
also
D
i 1
 li
 1    1 / ln( D) 
li = –logD(p(xi))
E  log 2  p (x )  H (x )
E l (x )  E  log D  p (x )  

 H D (x )
log 2 D
log 2 D
no uniquely decodable code can do better than this
57
Bounds on Optimal Code Length
Round up optimal code lengths:
li   log D p ( xi )
• li are bound to satisfy the Kraft Inequality (since the
optimum lengths do)
• For this choice,
–logD(p(xi))  li  –logD(p(xi)) + 1
• Average shortest length:
H D (x )  L  H D (x )  1
*
(since we added <1
to optimum values)
• We can do better by encoding blocks of n symbols
n 1 H D (x 1:n )  n 1 E l (x 1:n )  n 1 H D (x 1:n )  n 1
• If entropy rate of xi exists ( xi is stationary process)
n 1 H D (x 1:n )  H D (X )  n 1 E l (x 1:n )  H D (X)
Also known as source coding theorem
58
Block Coding Example
1
n-1E l
X = [A;B], px = [0.9; 0.1]
H ( xi )  0.469
Huffman coding:
B
0.1
1
0.8
0.6
0.4
• n=1
sym A
prob 0.9
code 0
• n=2
sym AA
AB
BA
BB
prob 0.81 0.09 0.09 0.01
code 0
11 100 101
• n=3
sym
prob
code
AAA
0.729
0
2
4
…
…
…
8
10
n
n 1E l  1
AAB
0.081
101
6
n 1E l  0.645
BBA
BBB
0.009 0.001
10010 10011
n 1E l  0.583
The extra 1 bit inefficiency becomes insignificant for large blocks
12
59
Summary
• McMillan Inequality for D-ary codes:
| X|
• any uniquely decodable C has  D l  1
i 1
• Any uniquely decodable code:
i
E l (x )  H D (x )
• Source coding theorem
• Symbol-by-symbol encoding
H D (x )  E l (x )  H D (x )  1
• Block encoding
n 1 E l (x 1:n )  H D (X)
60
Lecture 5
• Source Coding Algorithms
• Huffman Coding
• Lempel-Ziv Coding
61
Huffman Code
An optimal binary instantaneous code must satisfy:
1. p ( xi )  p ( x j )  li  l j
(else swap codewords)
2. The two longest codewords have the same length
(else chop a bit off the longer codeword)
3.  two longest codewords differing only in the last bit
(else chop a bit off all of them)
Huffman Code construction
1. Take the two smallest p(xi) and assign each a
different last bit. Then merge into a single symbol.
2. Repeat step 1 until only one symbol remains
Used in JPEG, MP3…
62
Huffman Code Example
X = [a, b, c, d, e], px = [0.25 0.25 0.2 0.15 0.15]
Read diagram backwards for codewords:
C(X) = [01 10 11 000 001], L = 2.3, H(x) = 2.286
For D-ary code, first add extra zero-probability symbols until
|X|–1 is a multiple of D–1 and then group D symbols at a time
63
Huffman Code is Optimal Instantaneous Code
Huffman traceback gives codes for progressively larger
alphabets:
0
0
a
0.25
0.25
b
0.25
p2=[0.55 0.45],
c
0.2
c2=[0 1], L2=1
d
0.15
p3=[0.45 0.3 0.25],
e
0.15
c3=[1 00 01], L3=1.55
p4=[0.3 0.25 0.25 0.2],
c4=[00 01 10 11], L4=2
p5=[0.25 0.25 0.2 0.15 0.15],
c5=[01 10 11 000 001], L5=2.3
0.25
0.2
0
0.3
0
0.25
0.55
0.45
0.45
1.0
1
1
0.3
1
1
We want to show that all these codes are optimal including C5
64
Huffman Code is Optimal Instantaneous Code
Huffman traceback gives codes for progressively larger
alphabets:
p2=[0.55 0.45],
c2=[0 1], L2=1
p3=[0.45 0.3 0.25],
c3=[1 00 01], L3=1.55
p4=[0.3 0.25 0.25 0.2],
c4=[00 01 10 11], L4=2
p5=[0.25 0.25 0.2 0.15 0.15],
c5=[01 10 11 000 001], L5=2.3
We want to show that all these codes are optimal including C5
65
Huffman Optimality Proof
Suppose one of these codes is sub-optimal:
–  m>2 with cm the first sub-optimal code (note c 2 is definitely
optimal)
– An optimal cm must have LC'm < LCm
– Rearrange the symbols with longest codes in cm so the two
lowest probs pi and pj differ only in the last digit (doesen’t
change optimality)
– Merge xi and xj to create a new code cm 1 as in Huffman
procedure
– L C'm–1 =L C'm– pi– pj since identical except 1 bit shorter with prob
p i+ p j
– But also L Cm–1 =L Cm– pi– pj hence L C'm–1 < LCm–1 which
contradicts assumption that cm is the first sub-optimal code
Hence, Huffman coding satisfies H D (x )  L  H D (x )  1
Note: Huffman is just one out of many possible optimal codes
66
Shannon-Fano Code
Fano code
1.
2.
Put probabilities in decreasing order
Split as close to 50-50 as possible; repeat with each half
a
0.20
b
0.19
c
0.17
d
0.15
e
0.14
f
0.06
g
0.05
h
0.04
0
00
0
1
0
1
1
0
010
1
011
0
1
100
0
110
1
H(x) = 2.81 bits
LSF = 2.89 bits
101
0
1110
1
1111
Not necessarily optimal: the
best code for this p actually
has L = 2.85 bits
67
Shannon versus Huffman
Shannon
Fi   k 1 p(x k ),
i 1
p(x 1 )  p(x 2 )    p (x m )
encoding : round the number Fi  [0,1] to   log p(x i )  bits
H D (x )  LSF  H D (x )  1
(excercise)
p x  [0.36 0.34 0.25 0.05]  H (x )  1.78 bits
 log 2 p x  [1.47 1.56 2 4.32]
l S   log 2 p x   [2 2 2 5]
LS  2.15 bits
Huffman
l H  [1 2 3 3]
LH  1.94 bits
Individual codewords may be longer in
Huffman than Shannon but not the average
68
Issues with Huffman Coding
• Requires the probability distribution of the
source
– Must recompute entire code if any symbol probability
changes
• A block of N symbols needs |X|N pre-calculated probabilities
• For many practical applications, however, the
underlying probability distribution is unknown
– Estimate the distribution
• Arithmetic coding: extension of Shannon-Fano coding; can
deal with large block lengths
– Without the distribution
• Universal coding: Lempel-Ziv coding
69
Universal Coding
• Does not depend on the distribution of the
source
• Compression of an individual sequence
• Run length coding
– Runs of data are stored (e.g., in fax machines)
Example: WWWWWWWWWBBWWWWWWWBBBBBBWW
9W2B7W6B2W
• Lempel-Ziv coding
– Generalization that takes advantage of runs of strings
of characters (such as WWWWWWWWWBB)
– Adaptive dictionary compression algorithms
– Asymptotically optimum: achieves the entropy rate
for any stationary ergodic source
70
Lempel-Ziv Coding (LZ78)
Memorize previously occurring substrings in the input data
– parse input into the shortest possible distinct ‘phrases’, i.e., each
phrase is the shortest phrase not seen earlier
– number the phrases starting from 1 (0 is the empty string)
ABAABABABBBAB…
12_3_4__5_6_7
Look up a dictionary
– each phrase consists of a previously occurring phrase
(head) followed by an additional A or B (tail)
– encoding: give location of head followed by the additional
symbol for tail
0A0B1A2A4B2B1B…
– decoder uses an identical dictionary
locations are underlined
71
Lempel-Ziv Example
Input = 1011010100010010001001010010
Dictionary
0000
0
000
00
0001
1
001
01
0010
010
10
0011
011
11
0100
100
0101
101
0110
110
0111
111
1000
1001
location

1
0
11
01
010
00
10
0100
01001
Send
Decode
1
00
011
101
1000
0100
0010
1010
10001
10010
1
0
11
01
010
00
10
0100
01001
010010
No need to always
send 4 bits
Remark:
• No need to send the
dictionary (imagine zip
and unzip!)
• Can be reconstructed
• Need to send 0’s in 01,
010 and 001 to avoid
ambiguity (i.e.,
instantaneous code)
72
Lempel-Ziv Comments
Dictionary D contains K entries D(0), …, D(K–1). We need to send M=ceil(log K) bits to
specify a dictionary entry. Initially K=1, D(0)=  = null string and M=ceil(log K) = 0 bits.
Input
1
0
1
1
0
1
0
1
0
Action
“1” D so send “1” and set D(1)=“1”. Now K=2  M=1.
“0” D so split it up as “”+”0” and send location “0” (since D(0)= ) followed
by “0”. Then set D(2)=“0” making K=3  M=2.
“1”  D so don’t send anything yet – just read the next input bit.
“11” D so split it up as “1” + “1” and send location “01” (since D(1)= “1” and
M=2) followed by “1”. Then set D(3)=“11” making K=4  M=2.
“0”  D so don’t send anything yet – just read the next input bit.
“01” D so split it up as “0” + “1” and send location “10” (since D(2)= “0” and
M=2) followed by “1”. Then set D(4)=“01” making K=5  M=3.
“0”  D so don’t send anything yet – just read the next input bit.
“01”  D so don’t send anything yet – just read the next input bit.
“010” D so split it up as “01” + “0” and send location “100” (since D(4)= “01”
and M=3) followed by “0”. Then set D(5)=“010” making K=6  M=3.
So far we have sent 1000111011000 where dictionary entry numbers are in red.
73
Lempel-Ziv Properties
• Simple to implement
• Widely used because of its speed and efficiency
– applications: compress, gzip, GIF, TIFF, modem …
– variations: LZW (considering last character of the
current phrase as part of the next phrase, used in
Adobe Acrobat), LZ77 (sliding window)
– different dictionary handling, etc
• Excellent compression in practice
– many files contain repetitive sequences
– worse than arithmetic coding for text files
74
Asymptotic Optimality
• Asymptotically optimum for stationary ergodic
source (i.e. achieves entropy rate)
• Let c(n) denote the number of phrases for a
sequence of length n
• Compressed sequence consists of c(n) pairs
(location, last bit)
• Needs c(n)[logc(n)+1] bits in total
• {Xi} stationary ergodic 
limsup n 1l ( X 1:n )  limsup
n 
n 
c(n)[log c(n)  1]
 H ( X ) with probability 1
n
– Proof: C&T chapter 12.10
– may only approach this for an enormous file
75
Summary
• Huffman Coding: H D (x )  E l (x )  H D (x )  1
– Bottom-up design
– Optimal  shortest average length
• Shannon-Fano Coding: H D (x )  E l (x )  H D (x )  1
– Intuitively natural top-down design
• Lempel-Ziv Coding
– Does not require probability distribution
– Asymptotically optimum for stationary ergodic source (i.e.
achieves entropy rate)
76
Lecture 6
• Markov Chains
– Have a special meaning
– Not to be confused with the standard
definition of Markov chains (which are
sequences of discrete random variables)
• Data Processing Theorem
– You can’t create information from nothing
• Fano’s Inequality
– Lower bound for error in estimating X from Y
77
Markov Chains
If we have three random variables: x, y, z
p ( x, y , z )  p ( z | x, y ) p ( y | x ) p ( x )
they form a Markov chain xyz if
p ( z | x, y )  p ( z | y )  p ( x, y , z )  p ( z | y ) p ( y | x ) p ( x )
A Markov chain xyz means that
– the only way that x affects z is through the value of y
– if you already know y, then observing x gives you no additional
information about z, i.e. I ( x ; z | y )  0  H (z | y )  H (z | x , y )
– if you know y, then observing z gives you no additional
information about x.
78
Data Processing
• Estimate z = f(y), where f is a function
• A special case of a Markov chain x  y  f(y)
Data
X
Y
Channel
Z
Estimator
• Does processing of y increase the information that y
contains about x?
79
Markov Chain Symmetry
If xyz
p ( x, y , z ) ( a ) p ( x, y ) p ( z | y )
p ( x, z | y ) 

 p( x | y ) p( z | y )
p( y)
p( y)
(a)
p ( z | x, y )  p ( z | y )
Hence x and z are conditionally independent given y
Also xyz iff zyx since
p( z | y ) p ( y ) (a) p( x, z | y ) p ( y ) p ( x, y, z )
p( x | y )  p( x | y )


p( y, z )
p( y, z )
p( y, z )
 p( x | y, z )
(a) p( x, z | y )  p( x | y ) p( z | y )
Markov chain property is symmetrical
Conditionally indep.
80
Data Processing Theorem
If xyz then I(x ;y)  I(x ; z)
– processing y cannot add new information about x
If xyz then I(x ;y)  I(x ; y | z)
– Knowing z does not increase the amount y tells you about x
Proof:
Apply chain rule in different ways
I (x ; y , z )  I (x ; y )  I (x ; z | y )  I (x ; z )  I (x ; y | z )
(a)
but I (x ; z | y )  0
hence I (x ; y )  I (x ; z )  I (x ; y | z )
so I ( x ; y )  I ( x ; z ) and I ( x ; y )  I ( x ; y | z )
(a) I(x ;z)=0 iff x and z are independent; Markov  p(x,z |y)=p(x |y)p(z |y)
81
So Why Processing?
• One can not create information by manipulating
the data
• But no information is lost if equality holds
• Sufficient statistic
– z contains all the information in y about x
– Preserves mutual information I(x ;y) = I(x ; z)
• The estimator should be designed in a way such
that it outputs sufficient statistics
• Can the estimation be arbitrarily accurate?
82
Fano’s Inequality
If we estimate x from y, what is pe  p(xˆ  x ) ?
x
H (x | y )  H ( pe )  pe log | X |
 pe
H (x | y )  H ( pe )  (a)  H (x | y )  1



log | X |
log | X |
y
x^
(a) the second form is
weaker but easier to use
1 xˆ  x
Proof: Define a random variable e  
0 xˆ  x
H (e , x | xˆ )  H (x | xˆ )  H (e | x , xˆ )  H (e | xˆ )  H (x | e , xˆ ) chain rule
H0; H(e |y)≤H(e)
 H (x | xˆ )  0  H (e )  H (x | e , xˆ )
 H (e )  H (x | xˆ , e  0)(1  pe )  H (x | xˆ , e  1) pe
 H ( pe )  0  (1  pe )  pe log | X |
H(e)=H(pe)
Markov chain
H (x | y )  H (x | xˆ ) since I (x ; xˆ )  I (x ; y )
83
Implications
• Zero probability of error pe  0  H (x | y )  0
• Low probability of error if H(x|y) is small
• If H(x|y) is large then the probability of error is
high
• Could be slightly strengthened to
H (x | y )  H ( pe )  pe log(| X | 1)
• Fano’s inequality is used whenever you need to
show that errors are inevitable
– E.g., Converse to channel coding theorem
84
Fano Example
X = {1:5}, px = [0.35, 0.35, 0.1, 0.1, 0.1]T
Y = {1:2} if x 2 then y =x with probability 6/7
while if x >2 then y =1 or 2 with equal prob.
Our best strategy is to guess xˆ  y (x  y  xˆ )
– px |y=1 = [0.6, 0.1, 0.1, 0.1, 0.1]T
– actual error prob: pe  0.4
Fano bound:
H (x | y )  1 1.771  1

 0.3855
pe 
log| X | 1
log(4)
(exercise)
Main use: to show when error free transmission is impossible since pe > 0
85
Summary
• Markov: x  y  z  p( z | x, y )  p( z | y )  I (x ; z | y )  0
• Data Processing Theorem: if xyz then
– I(x ; y)  I(x ; z), I(y ; z)  I(x ; z)
can be false if not Markov
– I(x ; y)  I(x ; y | z)
– Long Markov chains: If x1 x2  x3  x4  x5  x6,
then Mutual Information increases as you get closer
together:
• e.g. I(x3, x4)  I(x2, x4)  I(x1, x5)  I(x1, x6)
• Fano’s Inequality: if x  y  xˆ then
pe 
H ( x | y )  H ( pe ) H ( x | y )  1 H ( x | y )  1


log| X | 1
log| X | 1
log | X |
weaker but easier to use since independent of pe
86
Lecture 7
• Law of Large Numbers
– Sample mean is close to expected value
• Asymptotic Equipartition Principle (AEP)
– -logP(x1,x2,…,xn)/n is close to entropy H
• The Typical Set
– Probability of each sequence close to 2-nH
– Size (~2nH) and total probability (~1)
• The Atypical Set
– Unimportant and could be ignored
87
Typicality: Example
X = {a, b, c, d}, p = [0.5 0.25 0.125 0.125]
–log p = [1 2 3 3]  H(p) = 1.75 bits
Sample eight i.i.d. values
• typical  correct proportions
adbabaac –log p(x) = 14 = 8×1.75 = nH(x)
• not typical  log p(x)  nH(x)
dddddddd –log p(x) = 24
88
Convergence of Random Variables
• Convergence
x n  y    0, m such that n  m, | x n  y | 
n 
Example:
x n  2 – n , y  0
choose m   log 
• Convergence in probability (weaker than convergence)
prob
xn  y
   0, P | x n  y |    0
1
1
x

{0;1},
p

[1

n
;
n
]
n
Example:
n 
for any small  , p (| xn |  )  n 1 
0
prob
so xn  0 (but xn  0)
Note: y can be a constant or another random variable
89
Law of Large Numbers
1 n
Given i.i.d. {xi}, sample mean s n   x i
n i 1
– E sn  E x  
Var s n  n 1 Var x  n 1 2
As n increases, Var sn gets smaller and the values
become clustered around the mean
LLN:
prob
sn  
   0, P| s n   |    0
n 
The expected value of a random variable is equal
to the long-term average when sampling
repeatedly.
90
Asymptotic Equipartition Principle
• x is the i.i.d. sequence {xi} for 1  i n n
– Prob of a particular sequence is p( x )   p(x i )
i 1
– Average E  log p( x )  n E  log p( xi )  nH (x )
• AEP:
• Proof:
prob
1
 log p ( x )  H (x )
n
1
1 n
 log p ( x )    log p ( xi )
n
n i 1
prob
law of large numbers

E  log p ( xi )  H (x )
91
Typical Set
Typical set (for finite n)

T( n )  x  X n :  n 1 log p( x )  H (x )  

Example:
=41%
N=4,
N=128,
p=0.2,
e=0.1,
=85%
=0%
N=2,p=0.2,
p=0.2,e=0.1,
e=0.1,pppTT=29%
N=8,
N=1,
=72%
N=64,
N=32,
=62%
=45%
N=16,
T
– xi Bernoulli with p(xi =1)=p
– e.g. p([0 1 1 0 0 0])=p2(1–p)4
– For p=0.2, H(X)=0.72 bits
– Red bar shows T0.1(n)
-2.5
-2.5
-2-2
-1.5
-1-1
-1.5
-1-1
NN log
logp(x)
p(x)
-0.5
-0.5
00
92
Typical Set Frames
N=1, p=0.2, e=0.1, pT=0%
-2.5
-2
-1.5
-1
N-1log p(x)
-0.5
N=2, p=0.2, e=0.1, pT=0%
0
-2.5
N=16, p=0.2, e=0.1, pT=45%
-2.5
-2
-1.5
-1
N-1log p(x)
-0.5
-2
-1.5
-1
N-1log p(x)
-0.5
0
-2.5
-2.5
-2
-1.5
-1
N-1log p(x)
-0.5
-2
-1.5
-1
N-1log p(x)
-0.5
0
-2.5
N=64, p=0.2, e=0.1, pT=72%
N=32, p=0.2, e=0.1, pT=62%
0
N=8, p=0.2, e=0.1, pT=29%
N=4, p=0.2, e=0.1, pT=41%
0
-2.5
-2
-1.5
-1
N-1log p(x)
-0.5
-2
-1.5
-1
N-1log p(x)
-0.5
0
N=128, p=0.2, e=0.1, pT=85%
0
-2.5
-2
-1.5
-1
N-1log p(x)
11,
1111,
1,
00110011,
0010001100010010,
010,1101,
00 00110010,
1010, 0100,
0001001010010000,
00010001,
0000
00010000,
0001000100010000,
00000000
0000100001000000, 0000000010000000
-0.5
0
93
Typical Set: Properties
1. Individual prob:
x  T( n )  log p( x )   nH (x )  n
2. Total prob:
3. Size:
p ( x  T( n ) )  1   for n  N 
(1   )2
n ( H ( x )  )
n N

T( n )  2 n ( H ( x )  )
Proof 2:  n 1 log p( x )  n 1 n  log p( x ) prob
 E  log p( xi )  H (x )

i
i 1
Hence   0 N  s.t. n  N 
Proof 3a:
Proof 3b:
p(  n 1 log p ( x )  H (x )   )  
f.l.e. n, 1    p ( x  T( n ) ) 
1   p( x ) 
x
 p( x ) 
xT( n )
2
xT( n )
 n ( H ( x )  )
 2  n ( H ( x )  ) T( n )
 n ( H ( x )  )
 n ( H ( x )  )
(n)
2

2
T


xT( n )
94
Consequence
• for any  and for n > N
“Almost all events are almost equally surprising”
p ( x  T(n ) )  1  
and
•
Coding consequence
log p ( x )   nH (x )  n
|X|n elements
– x  T(n ) : ‘0’ + at most 1+n(H+) bits
(n )
x

T
–
: ‘1’ + at most 1+nlog|X| bits

– L = Average code length
 p ( x  T( n ) )[2  n( H   )]
 p ( x  T( n ) )[2  n log | X |]
 n( H   )    n log X   2  2
 n  H     log X  2(  2)n 1   n( H   ')
 2 n ( H ( x )  ) elements
95
Source Coding & Data Compression
For any choice of  > 0, we can, by choosing block
size, n, large enough, do the following:
• make a lossless code using only H(x)+ bits per symbol
on average:
L
 H 
n
• The coding is one-to-one and decodable
– However impractical due to exponential complexity
• Typical sequences have short descriptions of length  nH
– Another proof of source coding theorem (Shannon’s original
proof)
• However, encoding/decoding complexity is exponential
in n
96
Smallest high-probability Set
T(n) is a small subset of Xn containing most of the
probability mass. Can you get even smaller ?
For any 0 <  < 1, choose N0 = ––1log  , then for any
n>max(N0,N) and any subset S(n) satisfying S ( n)  2 n(H ( x )  2ε)

 


p x  S ( n )  p x  S ( n )  T( n )  p x  S ( n )  T( n )

(n)
 S ( n ) max
p
(
x
)

p
x

T

(n)
xT

 2 n ( H  2 ) 2  n ( H   )  
 2n    2
for n  N 0,

for n  N 
2  n  2 log   
Answer: No
97
Summary
• Typical Set
– Individual Prob
– Total Prob
– Size
x  T( n )  log p ( x )   nH (x )  n
p ( x  T( n ) )  1   for n  N 
(1   )2
n ( H ( x )  )
n  N

T( n )  2 n ( H ( x )  )
• No other high probability set can be much
smaller than T(n )
• Asymptotic Equipartition Principle
– Almost all event sequences are equally surprising
• Can be used to prove source coding theorem
98
Lecture 8
• Channel Coding
• Channel Capacity
– The highest rate in bits per channel use that can be

transmitted reliably
– The maximum mutual information
• Discrete Memoryless Channels
– Symmetric Channels
– Channel capacity
• Binary Symmetric Channel
• Binary Erasure Channel
• Asymmetric Channel
 = proved in channel coding theorem
99
Model of Digital Communication
Source Coding
Channel Coding
In
Compress
Encode
Noisy
Channel
Decode
Decompress
Out
• Source Coding
– Compresses the data to remove redundancy
• Channel Coding
– Adds redundancy/structure to protect against
channel errors
100
Discrete Memoryless Channel
• Input: xX, Output yY
x
Noisy
Channel
y
• Time-Invariant Transition-Probability Matrix
Q y x 
|
i, j
 p y  y j | x  xi 
– Hence p y  Q y |x p x
– Q: each row sum = 1, average column sum = |X||Y|–1
T
• Memoryless: p(yn|x1:n, y1:n–1) = p(yn|xn)
• DMC = Discrete Memoryless Channel
101
Binary Channels
• Binary Symmetric Channel
– X = [0 1], Y = [0 1]
• Binary Erasure Channel
– X = [0 1], Y = [0 ? 1]
• Z Channel
– X = [0 1], Y = [0 1]
1  f

 f
1  f

 0
f 

1  f 
f
f



0
1
1
0
0
x
0 

1 f 
0
1

 f 1 f
0
x
x
y
? y
1
1
0
0
1
1
y
Symmetric: rows are permutations of each other; columns are permutations of each other
Weakly Symmetric: rows are permutations of each other; columns have the same sum
102
Weakly Symmetric Channels
Weakly Symmetric:
1.
All columns of Q have the same sum = |X||Y|–1
–
If x is uniform (i.e. p(x) = |X|–1) then y is uniform
p ( y )   p ( y | x) p ( x)  X
x X
2.
1

1
p ( y | x)  X  X Y
1
 Y
1
x X
All rows are permutations of each other
–
Each row of Q has the same entropy so
H (y | x )   p ( x) H (y | x  x)  H (Q1,: )  p ( x)  H (Q1,: )
xX
xX
where Q1,: is the entropy of the first (or any other) row of the Q matrix
Symmetric:
1. All rows are permutations of each other
2. All columns are permutations of each other
Symmetric  weakly symmetric
103
Channel Capacity
• Capacity of a DMC channel:
C  max I (x ; y )
px
– Mutual information (not entropy itself) is what could be
transmitted through the channel
– Maximum is over all possible input distributions px
–  only one maximum since I(x ;y) is concave in px for fixed py|x
– We want to find the px that maximizes I(x ;y)
– Limits on C:
0  C  min  H (x ), H (y )   min  log X ,log Y 
• Capacity for n uses of channel:
C
(n)
1
 max I (x 1:n ; y 1:n )
n px
1:n

H(x ,y)
H(x |y) I(x ;y) H(y |x)
H(x)
H(y)
 = proved in two pages time
104
Mutual Information Plot
Binary Symmetric Channel
Bernoulli Input
0
1–p
I(X;Y)
1–f
f
f
x
p
1
0
1–f
y
1
1
0.5
0
1
-0.5
0
0.8
0.2
0.6
0.4
0.4
0.6
0.2
0.8
1
Input Bernoulli Prob (p)
0
I ( x ; y )  H (y )  H (y | x )
 H ( f  2 pf  p)  H ( f )
Channel Error Prob (f)
105
Mutual Information Concave in pX
Mutual Information I(x;y) is concave in px for fixed py|x
Proof: Let u and v have prob mass vectors pu and pv
– Define z: bernoulli random variable with p(1) = 
– Let x = u if z=1 and x=v if z=0  px= pu +(1–) pv
I ( x , z ; y )  I ( x ; y )  I (z ; y | x )  I (z ; y )  I ( x ; y | z )
U
but I (z ; y | x )  H (y | x )  H (y | x , z )  0 so
1
X
V
Z
0
p(Y|X)
Y
I (x ; y )  I (x ; y | z )
 I (x ; y | z  1)  (1   ) I (x ; y | z  0)
 I (u ; y )  (1   ) I (v ; y )
Special Case: y=x  I(x; x)=H(x) is concave in px
106
Mutual Information Convex in pY|X
Mutual Information I(x ;y) is convex in py|x for fixed px
Proof: define u, v, x etc:
– p y |x  pu |x  (1   )pv |x
I (x ; y , z )  I (x ; y | z )  I (x ; z )
 I (x ; y )  I (x ; z | y )
but I (x ; z )  0 and I (x ; z | y )  0 so
I (x ; y )  I (x ; y | z )
 I (x ; y | z  1)  (1   ) I (x ; y | z  0)
 I (x ;u )  (1   ) I (x ;v )
107
n-use Channel Capacity
For Discrete Memoryless Channel:
I (x 1:n ; y 1:n )  H (y 1:n )  H (y 1:n | x 1:n )
n
n
i 1
n
i 1
  H (y i | y 1:i 1 )   H (y i | x i )

i 1
n
Chain; Memoryless
n
Conditioning
H (y i )  H (y i | x i )  I ( x i ; y i )
Reduces
i 1
i 1
Entropy
with equality if yi are independent  xi are independent


We can maximize I(x;y) by maximizing each I(xi;yi)
independently and taking xi to be i.i.d.
– We will concentrate on maximizing I(x; y) for a single channel use
– The elements of Xi are not necessarily i.i.d.
108
Capacity of Symmetric Channel
If channel is weakly symmetric:
I (x ; y )  H (y )  H (y | x )  H (y )  H (Q1,: )  log | Y |  H (Q1,: )
with equality iff input distribution is uniform
 Information Capacity of a WS channel is C = log|Y|–H(Q1,:)
0
For a binary symmetric channel (BSC):
– | Y| = 2
– H(Q1,:) = H(f)
– I(x;y)  1 – H(f)
 Information Capacity of a BSC is 1–H(f)
1–f
f
f
x
1
0
1–f
y
1
109
Binary Erasure Channel (BEC)
1  f

 0
f
f
0 

1 f 
0
 (1  f ) H (x )
 1 f
C  1 f
0
f
x
1
I (x ; y )  H (x )  H (x | y )
 H (x )  p(y  0)  0  p (y  ?) H (x )  p (y  1)  0
 H (x )  H (x ) f
1–f
? y
f
1–f
1
H(x |y) = 0 when y=0 or y=1
since max value of H(x) = 1
with equality when x is uniform
since a fraction f of the bits are lost, the capacity is only 1–f
and this is achieved when x is uniform
110
Asymmetric Channel Capacity
Let px = [a a
1–2a]T
 py
=QTp
x
= px
H (y )  2a log a  1  2a  log1  2a 
H (y | x )  2aH ( f )  (1  2a ) H (1)  2aH ( f )
0: a
x
To find C, maximize I(x ;y) = H(y) – H(y |x)
1–f
0
f
f
1: a
1–f
2: 1–2a
1
y
2
I  2a log a  1  2a  log1  2a   2aH ( f )
f
0
1  f


dI
Q f
1 f 0
 2 log e  2 log a  2 log e  2 log(1  2a )  2 H ( f )  0
 0
da
0
1 

1  2a
1
log
 log a 1  2  H ( f )  a  2  2 H ( f ) 
Note:
a
d(log x) = x–1 log e
H( f )




log
1

2
a
 C  2a log a 2
 1  2a  log1  2a 



Examples:

f = 0  H(f) = 0  a = 1/3  C = log 3 = 1.585 bits/use
f = ½  H(f) = 1  a = ¼  C = log 2 = 1 bits/use
111
Summary
• Given the channel, mutual information is
concave in input distribution
I (x ; y )
• Channel capacity C  max
p
x
– The maximum exists and is unique
• DMC capacity
–
–
–
–
Weakly symmetric channel: log|Y|–H(Q1,:)
BSC: 1–H(f)
BEC: 1–f
In general it very hard to obtain closed-form;
numerical method using convex optimization instead
112
Lecture 9
• Jointly Typical Sets
• Joint AEP
• Channel Coding Theorem
– Ultimate limit on information transmission is
channel capacity
– The central and most successful story of
information theory
– Random Coding
– Jointly typical decoding
113
Intuition on the Ultimate Limit
• Consider blocks of n symbols:
x1:n
Noisy
Channel
2nH(x)
2nH(y|x)
2nH(y)
y1:n
– For large n, an average input sequence x1:n corresponds to about
2nH(y|x) typical output sequences
– There are a total of 2nH(y) typical output sequences
– For nearly error free transmission, we select a number of input
sequences whose corresponding sets of output sequences hardly
overlap
– The maximum number of distinct sets of output sequences is
2n(H(y)–H(y|x)) = 2nI(y ;x)
– One can send I(y;x) bits per channel use
for large n can transmit at any rate < C with negligible errors
114
Jointly Typical Set
x,y is the i.i.d. sequence {xi,yi} for 1  i  n
– Prob of a particular sequence is
N
p ( x , y )   p ( xi , yi )
i 1
– E  log p( x , y )  n E  log p( xi , yi )  nH (x , y )
– Jointly Typical set:

J ( n )  x , y  XY n :  n 1 log p ( x )  H (x )   ,
 n 1 log p ( y )  H (y )   ,
n 1 log p ( x , y )  H (x , y )  

115
Jointly Typical Example
Binary Symmetric Channel
f  0.2, p x  0.75 0.25
T
p y  0.65 0.35 , Pxy
T
0
1–f
f
f
x
 0.6 0.15 

 
 0.05 0.2 
1
0
1–f
y
1
Jointly typical example (for any ):
x=11111000000000000000
y=11110111000000000000
all combinations of x and y have exactly the right frequencies
116
Jointly Typical Diagram
Each point defines both an x sequence and a y sequence
2nlog|Y|
2nH(y)
Typical in x or y: 2n(H(x)+H(y))
2nlog|X|
)
x,y
(
nH
2nH(x)
Jo
ty
y
l
i nt
2
:
l
a
pic
All sequences: 2nlog(|X||Y|)
•
•
•
•
•
2nH(x|y)
2nH(y|x)
Dots represent
jointly typical
pairs (x,y)
Inner rectangle
represents pairs
that are typical
in x or y but not
necessarily
jointly typical
There are about 2nH(x) typical x’s in all
Each typical y is jointly typical with about 2nH(x|y) of these typical x’s
The jointly typical pairs are a fraction 2–nI(x ;y) of the inner rectangle
Channel Code: choose x’s whose J.T. y’s don’t overlap; use J.T. for decoding
There are 2nI(x ;y) such codewords x’s
117
Joint Typical Set Properties
(n)
x
,
y

J
 log px, y   nH (x , y )  n
1. Indiv Prob:

2. Total Prob: px, y  J ( n )   1   for n  N
n N
n  H ( x , y )  
3. Size:
(1   )2
 J ( n )  2n H ( x ,y )  

Proof 2: (use weak law of large numbers)


p  n 1 log p( x )  H (x )   
Choose N1 such that n  N1 ,

3
Similarly choose N 2 , N 3 for other conditions and set N   max N1 , N 2 , N 3 
Proof 3: 1   
 p(x, y )  J 
(n)
x , yJ ( n )
1
 p(x, y )  J 
(n)
x , yJ ( n )
max( n ) p ( x , y )  J ( n ) 2  n  H ( x ,y )   n>N
x , yJ 
min( n ) p ( x , y )  J ( n ) 2  n  H ( x ,y )  
x , yJ 
n
118
Properties
4. If px'=px and py' =py with x' and y' independent:


(1   )2  n  I ( x ,y ) 3   p x ' , y ' J ( n )  2  n  I ( x ,y ) 3  for n  N 
Proof: |J| × (Min Prob)  Total Prob  |J| × (Max Prob)

  p( x ' , y ' )   p( x ' ) p( y ' )
  J max p(x ' ) p(y ' )
p x ' , y ' J ( n ) 

p x ' , y ' J ( n )


x ', y 'J ( n )
x ', y 'J ( n )
(n)

x ', y 'J ( n )
 2 n  H ( x ,y )  2  n  H ( x )  2  n  H ( y )    2  n  I ( x ;y ) 3 
p x ' , y ' J ( n )  J ( n ) min( n ) p ( x ' ) p( y ' )
x ', y 'J 
 (1   )2  n  I ( x ;y ) 3  for n  N 
119
Channel Coding
w 1:M
Encoder
x1:n
Noisy
Channel
y1:n
Decoder
g(y)
^  1:M
w
• Assume Discrete Memoryless Channel with known Qy|x
• An (M, n) code is
– A fixed set of M codewords x(w)Xn for w=1:M
– A deterministic decoder g(y)1:M
• The rate of an (M,n) code: R=(log M)/n bits/transmission
• Error probability w  p  g  y  w    w  
 p  y | x (w)  
yY
n
g ( y ) w
– Maximum Error Probability ( n)  max w
1 w M
– Average Error probability
Pe( n )
1

M
M

w 1
w
C = 1 if C is true or 0 if it is false
120
Shannon’s ideas
• Channel coding theorem: the basic theorem of
information theory
– Proved in his original 1948 paper
• How do you correct all errors?
• Shannon’s ideas
– Allowing arbitrarily small but nonzero error probability
– Using the channel many times in succession so that
AEP holds
– Consider a randomly chosen code and show the
expected average error probability is small
• Use the idea of typical sequences
• Show this means  at least one code with small max error
prob
• Sadly it doesn’t tell you how to construct the code
121
Channel Coding Theorem
• A rate R is achievable if R<C and not achievable if R>C
– If R<C,  a sequence of (2nR,n) codes with max prob of error
(n)0 as n
– Any sequence of (2nR,n) codes with max prob of error (n)0 as
n must have R  C
A very counterintuitive result:
Despite channel errors you can
get arbitrarily low bit error
rates provided that R<C
Bit error probability
0.2
0.15
Achievable
0.1
Impossible
0.05
0
0
1
2
Rate R/C
3
122
Summary
• Jointly typical set
 log p ( x , y )  nH (x , y )  n
p  x , y  J ( n )   1  
J ( n )  2n H ( x ,y )  
(1   )2 n I ( x ,y )  3   p  x ', y '  J ( n )   2 n I ( x ,y ) 3 
• Machinery to prove channel coding
theorem
123
Lecture 10
• Channel Coding Theorem
– Proof
• Using joint typicality
• Arguably the simplest one among many possible
ways
• Limitation: does not reveal Pe ~ e-nE(R)
– Converse (next lecture)
124
Channel Coding Principle
• Consider blocks of n symbols:
x1:n
Noisy
Channel
2nH(x)
2nH(y|x)
2nH(y)
y1:n
– An average input sequence x1:n corresponds to about
2nH(y|x) typical output sequences
– Random Codes: Choose M = 2nR ( RI(x;y) ) random
codewords x(w)
• their typical output sequences are unlikely to overlap much.
– Joint Typical Decoding: A received vector y is very
likely to be in the typical output set of the transmitted
x(w) and no others. Decode as this w.
Channel Coding Theorem: for large n, can transmit at any rate R < C with negligible errors
125
Random (2nR,n) Code
• Choose   error prob, joint typicality  N , choose n>N
• Choose px so that I(x ;y)=C, the information capacity
• Use px to choose a code C with random x(w)Xn, w=1:2nR
– the receiver knows this code and also the transition matrix Q
• Assume the message W1:2nR is uniformly distributed
• If received value is y; decode the message by seeing how many
x(w)’s are jointly typical with y
– if x(k) is the only one then k is the decoded message
– if there are 0 or 2 possible k’s then declare an error message 0
– we calculate error probability averaged over all C and all W
p  E    p(C )2
C
 nR
2nR

w 1
2nR
w
(a)
(C )  2- nR  p (C )w (C )   p (C )1 (C )  p  E | w  1
w 1 C
C
(a) since error averaged over all possible codes is independent of w
126
Decoding Errors
• Assume we transmit x(1) and receive y
(n)
nR
• Define the J.T. events e w  ( x ( w), y )  J   for w 1: 2
x1:n
Noisy
Channel
y1:n
• Decode using joint typicality
• We have an error if either e1 false or ew true for w  2
• The x(w) for w  1 are independent of x(1) and hence also
independent of y. So p (ew true)  2
 n  I ( x ,y ) 3 
for any w  1
Joint AEP
127
Error Probability for Random Code
p(AB)p(A)+p(B)
• Upper bound


 
2nR
p  E   p  E | W  1  p e1  e 2  e 3    e 2nR  p e1   p e w 
2nR
  2
 n  I ( x ;y ) 3 
 2 2
nR
 n  I ( x ;y ) 3 
i 2
 2
 n  C  R 3 
w 2
 2 for R  C  3 and n  
we have chosen p ( x ) such that I ( x ; y )  C
(1) Joint typicality
log 
C  R  3
(2) Joint AEP
• Since average of P(E) over all codes is  2 there must be at least
one code for which this is true: this code has 2  nR

w
 2

w
• Now throw away the worst half of the codewords; the remaining
ones must all have w  4. The resultant code has rate R–n–1  R.

 = proved on next page
128
Code Selection & Expurgation
• Since average of P(E) over all codes is  2 there must be at least
one code for which this is true.
Proof:
2  K 1
K
P
(n)
e ,i
 K 1
i 1
 minP   minP 
K
i 1
(n)
e ,i
i
i
(n)
e ,i
K = num of codes
• Expurgation: Throw away the worst half of the codewords; the
remaining ones must all have w 4.
Proof: Assume w are in descending order
2  M
1
M

w 1
 ½ M  4
w
M
1
½M

w
w 1
 w  4
M
1
½M

½M
 ½ ½ M
w 1
 w  ½M
M   ½  2 nR messages in n channel uses  R   n 1 log M   R  n 1
129
Summary of Procedure

1
• For any R<C – 3 set n  max N  ,  (log  ) /(C  R  3 ), 

see (a),(b),(c) below
• Find the optimum pX so that I(x ; y) = C
• Choosing codewords randomly (using pX) to construct codes with 2nR(a)
codewords and using joint typicality as the decoder
• Since average of P(E) over all codes is  2 there must be at least (b)
one code for which this is true.
• Throw away the worst half of the codewords. Now the worst
codeword has an error prob  4 with rate = R – n–1 > R – 
(c)
• The resultant code transmits at a rate as close to C as desired with
an error probability that can be made as small as desired (but n
unnecessarily large).
Note:  determines both error probability and closeness to capacity
130
Remarks
• Random coding is a powerful method of proof,
not a method of signaling
• Picking randomly will give a good code
• But n has to be large (AEP)
• Without a structure, it is difficult to
encode/decode
– Table lookup requires exponential size
• Channel coding theorem does not provide a
practical coding scheme
• Folk theorem (but outdated now):
– Almost all codes are good, except those we can think
of
131
Lecture 11
• Converse of Channel Coding Theorem
– Cannot achieve R>C
• Capacity with feedback
– No gain for DMC but simpler encoding/
decoding
• Joint Source-Channel Coding
– No point for a DMC
132
Converse of Coding Theorem
• Fano’s Inequality: if Pe(n) is error prob when estimating w from y,
H (w | y )  1  Pe( n ) log W  1  nRPe( n )
• Hence
nR  H (w )  H (w | y )  I (w ; y )
 H (w | y )  I ( x (w ); y )
Definition of I
Markov : w  x  y  wˆ
 1  nRPe( n )  I ( x; y )
 1  nRPe( n )  nC
P
(n)
e
•
R  C  n 1

R

n 
Fano
n-use DMC capacity
C
1   0 if R  C
R
For large (hence for all) n, Pe(n) has a lower bound of (R–C)/R if w
equiprobable
– If achievable for small n, it could be achieved also for large n by
concatenation.
133
Minimum Bit-Error Rate
w1:nR
x1 :n
Encoder
Suppose
Noisy
Channel
y1:n
Decoder
^
w
1: nR
– w1:nR is i.i.d. bits with H(wi)=1

– The bit-error rate is Pb  E  p (w i  wˆ i )  E  p (e i )
i
(a)
i
(b)
Then nC  I (x 1:n ; y 1:n )  I (w 1:nR ;wˆ 1:nR )  H (w 1:nR )  H (w 1:nR | wˆ 1:nR )
nR
nR
(c)



 nR   H (w i | wˆ 1:nR ,w 1:i 1 )  nR   H (w i | wˆ i )  nR 1  E H (w i | wˆ i )
i
(d)


i 1

i 1

(e)


 nR 1  E  H (e i | wˆ i )  nR 1  EH (e i )  nR 1  H ( E P (e i ))  nR1  H ( Pb ) 
i
i
(c)
i
(a) n-use capacity
(b) Data processing theorem
(c) Conditioning reduces entropy
(d) e i  w i wˆ i
Hence
R  C 1  H ( Pb ) 
1
Pb  H 1 (1  C / R )
Bit error probability
0.2
0.15
0.1
0.05
0
0
1
2
Rate R/C
3
(e) Jensen: E H(x)  H(E x)
134
Coding Theory and Practice
• Construction for good codes
– Ever since Shannon founded information theory
– Practical: Computation & memory  nk for some k
• Repetition code: rate  0
• Block codes: encode a block at a time
– Hamming code: correct one error
– Reed-Solomon code, BCH code: multiple errors (1950s)
• Convolutional code: convolve bit stream with a filter
• Concatenated code: RS + convolutional
• Capacity-approaching codes:
– Turbo code: combination of two interleaved convolutional codes
(1993)
– Low-density parity-check (LDPC) code (1960)
– Dream has come true for some channels today
135
Channel with Feedback
w1:2nR
y1:i–1
Encoder
xi
Noisy
Channel
yi
^
w
Decoder
• Assume error-free feedback: does it increase capacity ?
• A (2nR, n) feedback code is
– A sequence of mappings xi = xi(w,y1:i–1) for i=1:n
– A decoding function wˆ  g (y 1:n )
• A rate R is achievable if  a sequence of (2nR,n) feedback
 0
codes such that Pe( n )  P(wˆ  w ) n

• Feedback capacity, CFB  C, is the sup of achievable rates
136
Feedback Doesn’t Increase Capacity
I (w ; y )  H ( y )  H ( y | w )
n
 H ( y )   H (y i | y 1:i 1 ,w )
w1:2nR
y1:i–1
Encoder
xi
Noisy
Channel
yi
^
w
Decoder
i 1
n
 H ( y )   H (y i | y 1:i 1 ,w , x i )
since xi=xi(w,y1:i–1)
i 1
n
 H ( y )   H (y i | x i )
i 1
n
since yi only directly depends on xi
n
n
i 1
i 1
  H (y i )   H (y i | x i )   I (x i ; y i )  nC
i 1
Hence
nR  H (w )  H (w | y )  I (w ; y )  1  nRPe( n )  nC
P
(n)
e
R  C  n 1

R
cond reduces ent
DMC
Fano
 Any rate > C is unachievable
The DMC does not benefit from feedback: CFB = C
137
Example: BEC with feedback
• Capacity is 1 – f
• Encode algorithm
– If yi = ?, tell the sender to retransmit bit i
– Average number of transmissions per bit:
1
1 f  f  
1 f
0
1–f
0
f
x
? y
f
1
1–f
1
2
• Average number of successfully recovered bits per
transmission = 1 – f
– Capacity is achieved!
• Capacity unchanged but encoding/decoding algorithm
much simpler.
138
Joint Source-Channel Coding
w1:n
Encoder
x1:n
Noisy
Channel
y1:n
^
Decoder
w1:n
• Assume wi satisfies AEP and |W|<
– Examples: i.i.d.; Markov; stationary ergodic
• Capacity of DMC channel is C
– if time-varying: C  lim n 1 I ( x; y )
n 
• Joint Source-Channel Coding Theorem:
 0 iff H(W) < C
 codes with Pe( n )  P(wˆ 1:n  w 1:n ) 
n 

– errors arise from two reasons
• Incorrect encoding of w
• Incorrect decoding of y
 = proved on next page
139
Source-Channel Proof ()
• Achievability is proved by using two-stage
encoding
– Source coding
– Channel coding
• For n > N there are only 2n(H(W)+) w’s in the
typical set: encode using n(H(W)+) bits
– encoder error < 
• Transmit with error prob less than  so long as
H(W)+  < C
• Total error prob < 2
140
Source-Channel Proof ()
w1:n
Encoder
x1:n
Noisy
Channel
y1:n
^
Decoder
w1:n
ˆ )  1  Pe( n ) n log W
Fano’s Inequality: H ( w | w
H (W )  n 1 H (w 1:n )
entropy rate of stationary process
 n 1 H (w 1:n | wˆ 1:n )  n 1 I (w 1:n ;wˆ 1:n )
 n 1 1  Pe( n ) n log W   n 1 I (x 1:n ; y 1:n )
 n 1  Pe( n ) log W  C
Let
n    Pe( n )  0  H (W )  C
definition of I
Fano + Data Proc Inequ
Memoryless channel
141
Separation Theorem
• Important result: source coding and channel coding
might as well be done separately since same capacity
– Joint design is more difficult
• Practical implication: for a DMC we can design the
source encoder and the channel coder separately
– Source coding: efficient compression
– Channel coding: powerful error-correction codes
• Not necessarily true for
– Correlated channels
– Multiuser channels
• Joint source-channel coding: still an area of research
– Redundancy in human languages helps in a noisy environment
142
Summary
• Converse to channel coding theorem
– Proved using Fano’s inequality
– Capacity is a clear dividing point:
• If R < C, error prob.  0
• Otherwise, error prob.  1
• Feedback doesn’t increase the capacity of DMC
– May increase the capacity of memory channels (e.g.,
ARQ in TCP/IP)
• Source-channel separation theorem for DMC and
stationary sources
143
Lecture 12
• Continuous Random Variables
• Differential Entropy
– can be negative
– not really a measure of the information in x
– coordinate-dependent
• Maximum entropy distributions
– Uniform over a finite range
– Gaussian if a constant variance
144
Continuous Random Variables
Changing Variables
• pdf: f x (x)
CDF:
Fx ( x)  
x

f x (t )dt
1
• For g(x) monotonic: y  g (x )  x  g (y )

Fy ( y )  Fx g 1 ( y )


or 1  Fx g 1 ( y )

according to slope of g(x)
dx
dg 1 ( y )
dFY ( y )
1
 f x g ( y)
 f x x 
fy ( y) 
dy
dy
dy


where
x  g 1 ( y )
• Examples:
Suppose
f x ( x)  0.5 for
(a)
y  4x
 x  0.25y
(b)
z x 4  x z¼ 
x  (0,2)  Fx ( x)  0.5 x

f y ( y )  0.5  0.25  0.125 for
f z ( z )  0.5  ¼ z ¾  0.125 z ¾
for
y  (0,8)
z  (0,16)
145
Joint Distributions
Joint pdf:
f x , y ( x, y )

f x ( x)   f x ,y ( x, y )dy
Marginal pdf:

Independence:  f x ,y ( x, y )  f x ( x) f y ( y )
f x , y ( x, y )
Conditional pdf:
f x |y ( x) 
y
y
fy ( y)
f x ,y  1 for y  (0,1), x  ( y, y  1)
f x |y  1 for x  ( y, y  1)
f y |x 
x
fX
1
for y  max(0, x  1), min( x,1) 
min( x,1  x)
x
fY
Example:
146
Entropy of Continuous R.V.
• Given a continuous pdf f(x), we divide the range of x into
bins of width 
( i 1) 
– For each i, xi with f ( xi ) 

f ( x)dx
i
mean value theorem
• Define a discrete random variable Y
– Y = {xi} and py = {f(xi)}
– Scaled, quantised version of f(x) with slightly unevenly spaced xi
•
H (y )   f ( xi ) log f ( xi ) 
  log    f ( xi ) log f ( xi ) 

 0
 log   


f ( x) log f ( x)dx   log   h(x )

• Differential entropy: h(x )    f x ( x) log f x ( x)dx

- Similar to entropy of discrete r.v. but there are differences
147
Differential Entropy


Differential Entropy: h(x )   f x ( x) log f x ( x)dx  E  log f x ( x)

Bad News:
– h(x) does not give the amount of information in x
– h(x) is not necessarily positive
– h(x) changes with a change of coordinate system
Good News:
– h1(x) – h2(x) does compare the uncertainty of two continuous
random variables provided they are quantised to the same
precision
– Relative Entropy and Mutual Information still work fine
– If the range of x is normalized to 1 and then x is quantised to n
bits, the entropy of the resultant discrete random variable is
approximately h(x)+n
148
Differential Entropy Examples
• Uniform Distribution: x ~ U (a, b)
–
–
f ( x)  (b  a ) 1 for x  (a, b) and f ( x)  0 elsewhere
b
h(x )    (b  a ) 1 log(b  a) 1 dx  log(b  a)
a
– Note that h(x) < 0 if (b–a) < 1
• Gaussian Distribution: x ~ N (  ,  2 )
– f ( x)   2

 1

exp   ( x   ) 2  2 
 2


2 ½
– h(x )  (log e)  f ( x) ln f ( x)dx


1
  (log e)  f ( x)   ln(2 2 )  ( x   ) 2  2 

2
loge y
1
2
2
2
log
y

x
 (log e) ln(2 )   E  ( x   ) 
loge x
2
1
2
 (log e)  ln(2 2 )  1  ½ log2e   log(4.1 ) bits
2


149
Multivariate Gaussian
Given mean, m, and symmetric positive definite covariance matrix K,
 1

exp   ( x  m )T K 1( x  m ) 
 2

 1

h( f )    log e   f (x)    (x  m)T K 1 (x  m)  ½ ln 2 K  dx
 2

 ½ loge  ln 2K  E (x  m)T K 1 (x  m) 
 ½ log  e   ln 2 K  E tr  (x  m)(x  m)T K 1  tr ( AB)  tr (BA )
x 1:n ~ N(m , K ) 
f (x)  2K
½





E xx = K
 ½ loge  ln 2K  tr KK   ½ loge  ln 2K  n 
tr(I)=n=ln(e )
 ½ loge   ½ log 2K 
 ½ log  2 eK   ½ log  (2 e) K  bits
 ½ loge  ln 2K  tr E (x  m)(x  m)T K 1
f
T
1
n
n
n
150
Other Differential Quantities
Joint Differential Entropy
h(x , y )    f x ,y ( x, y ) log f x ,y ( x, y )dxdy  E  log f x ,y ( x, y )
x, y
Conditional Differential Entropy
h(x | y )    f x ,y ( x, y ) log f x ,y ( x | y )dxdy  h(x , y )  h(y )
x, y
Mutual Information
I (x ; y )   f x ,y ( x, y ) log
x, y
f x , y ( x, y )
f x ( x) f y ( y )
dxdy  h(x )  h(y )  h(x , y )
Relative Differential Entropy of two pdf’s:
f ( x)
dx
g ( x)
 h f (x )  E f log g (x )
D( f || g )   f ( x) log
(a) must have f(x)=0  g(x)=0
(b) continuity  0 log(0/0) = 0
151
Differential Entropy Properties
Chain Rules
h ( x , y )  h ( x )  h(y | x )  h (y )  h ( x | y )
I ( x , y ; z )  I ( x ; z )  I (y ; z | x )
Information Inequality: D( f || g )  0
Proof: Define
 D ( f || g ) 

xS
S  {x : f (x)  0}
f (x) log
g ( x)
dx  E f
f ( x)

g (x ) 
log


f (x) 


g ( x) 
 g (x ) 
dx 
  log  f (x)
 log E
f ( x) 
 f (x) 
S



 log  g (x)dx   log 1  0
S

Jensen + log() is concave
all the same as for discrete r.v. H()
152
Information Inequality Corollaries
Mutual Information  0
I (x ; y )  D( f x ,y || f x f y )  0
Conditioning reduces Entropy
h( x )  h( x | y )  I ( x ; y )  0
Independence Bound
n
n
i 1
i 1
h(x 1:n )   h(x i | x 1:i 1 )   h(x i )
all the same as for H()
153
Change of Variable
Change Variable: y  g (x )
from earlier
dg 1 ( y )
f y ( y)  f x g ( y)
dy

1



h(y )   E log( f y (y ))   E log f x ( g 1 (y ))  E log
  E log f x (x )   E log
dx
dy
 h(x )  E log
dx
dy
dy
dx
Examples:
– Translation:
– Scaling:
y  x  a  dy / dx  1  h(y )  h(x )
y  cx
 dy / dx  c  h(y )  h(x )  log c
– Vector version: y 1:n  Ax 1:n
 h( y )  h( x )  log det( A)
not the same as for H()
154
Concavity & Convexity
• Differential Entropy:
– h(x) is a concave function of fx(x)   a maximum
• Mutual Information:
– I(x ; y) is a concave function of fx (x) for fixed fy|x (y)
– I(x ; y) is a convex function of fy|x (y) for fixed fx (x)
Proofs:
Exactly the same as for the discrete case: pz = [1–, ]T
U
1
X
V
0
Z
H (x )  H (x | z )
U
f(u|x)
1
Z
0
1
X
X
V
U
f(y|x)
Y
I (x ;y )  I (x ;y |z )
Y
f(v|x)
V
0
Z
I (x ;y )  I (x ;y |z )
155
Uniform Distribution Entropy
What distribution over the finite range (a,b) maximizes the
entropy ?
Answer: A uniform distribution u(x)=(b–a)–1
Proof:
Suppose f(x) is a distribution for x(a,b)
0  D( f || u )  h f (x )  E f log u (x )
 h f (x )  log(b  a )
 h f (x )  log(b  a )
156
Maximum Entropy Distribution
What zero-mean distribution maximizes the entropy on
(–, )n for a given covariance matrix K ?
½

(
x
)

2

K
exp ½ xT K 1x 
Answer: A multivariate Gaussian
Proof:
0  D ( f ||  )   h f ( x )  E f log  ( x )

 h f ( x )  log e E f  ½ ln  2K   ½ x T K 1x


 ½log e  ln  2K   tr E f xx T K 1
 ½log e ln  2K   tr I 
 ½ log 2eK   h (x )


EfxxT = K
tr(I)=n=ln(en)
Since translation doesn’t affect h(X),
we can assume zero-mean w.l.o.g.
157
Summary
• Differential Entropy:
h( x )   


f x ( x) log f x ( x)dx
– Not necessarily positive
– h(x+a) = h(x),
h(ax) = h(x) + log|a|
• Many properties are formally the same
– h(x|y)  h(x)
– I(x; y) = h(x) + h(y) – h(x, y)  0, D(f||g)=E log(f/g)  0
– h(x) concave in fx(x); I(x; y) concave in fx(x)
• Bounds:
– Finite range: Uniform distribution has max: h(x) =
log(b–a)
– Fixed Covariance: Gaussian has max: h(x) =
½log((2e)n|K|)
158
Lecture 13
•
•
•
•
Discrete-Time Gaussian Channel Capacity
Continuous Typical Set and AEP
Gaussian Channel Coding Theorem
Bandlimited Gaussian Channel
– Shannon Capacity
159
Capacity of Gaussian Channel
Discrete-time channel: yi = xi + zi
Zi
– Zero-mean Gaussian i.i.d. zi ~ N(0,N)
– Average power constraint n
1
n
x
i 1
2
i
Xi
P
Ey 2  E (x  z ) 2  Ex 2  2 E (x ) E (z )  Ez 2  P  N
+
Yi
X,Z indep and EZ=0
Information Capacity
I (x ; y )
– Define information capacity: C  max
2
Ex  P
I ( x ; y )  h ( y )  h (y | x )  h (y )  h ( x  z | x )
(a)
 h (y )  h (z | x )  h (y )  h (z )
 ½ log 2eP  N   ½ log 2eN
1
P

 log 1  
2
 N
The optimal input is Gaussian
(a) Translation independence
Gaussian Limit with
equality when x~N(0,P)
160
Achievability
z
w1:2
nR
Encoder
x
y
+
^
Decoder
w0:2nR
• An (M,n) code for a Gaussian Channel with power
constraint is
– A set of M codewords x(w)Xn for w=1:M with x(w)Tx(w)  nP w
– A deterministic decoder g(y)0:M where 0 denotes failure
max : ( n )
average : Pe( n )
– Errors: codeword : i
i
i
• Rate R is achievable if  seq of (2nR,n) codes with ( n )  0
n 
• Theorem: R achievable iff
R  C  ½ log1  PN 1 

 = proved on next pages
161
Argument by Sphere Packing
• Each transmitted xi is received as a
probabilistic cloud yi
– cloud ‘radius’ =
Var y | x   nN
Law of large numbers
• Energy of yi constrained to n(P+N) so
clouds must fit into a hypersphere of
radius nP  N 
• Volume of hypersphere  r n
• Max number of non-overlapping clouds:
(nP  nN )½ n
½ n log 1 PN 1 
2
½n
nN 
• Max achievable rate is ½log(1+P/N)
162
Continuous AEP
Typical Set: Continuous distribution, discrete time i.i.d.
For any >0 and any n, the typical set with respect to f(x) is

T( n )  x  S n :  n 1 log f (x)  h(x )  
where

S is the support of f  {x : f(x)>0}
n
f ( x )   f ( xi ) since xi are independent
i 1
h(x )  E  log f (x )  n 1 E log f ( x )
Typical Set Properties
1.
2.
p ( x  T( n ) )  1   for n  N 
(1   )2
n ( h ( x )  )
n N
 Vol(T( n ) )  2 n ( h ( x )  )
where Vol( A) 
 dx
xA
Proof: LLN
Proof: Integrate
max/min prob
163
Continuous AEP Proof
Proof 1: By law of large numbers
 n log f (x 1:n )   n
1
prob
1
n
 log f (x
i 1
prob
i
)  E  log f (x )  h(x )
Reminder: x n  y    0, N  such that n  N  , P| x n  y |    
Proof 2a:

1  
f ( x )d x
for n  N
Property 1
 
max f(x) within T
 
min f(x) within T
(n)
T
 2  n h ( X )    dx  2  n h ( X )   Vol T( n )
(n)
T
Proof 2b:
1

Sn
f ( x )d x 

T( n )
f ( x )d x
 2  n h ( X )    dx  2  n h ( X )   Vol T( n )
T( n )
164
Jointly Typical Set
Jointly Typical: xi, yi i.i.d from 2 with fx,y (xi,yi)

J ( n )  x , y  2 n :  n 1 log f X ( x )  h(x )   ,
n 1 log fY ( y )  h(y )   ,
n 1 log f X ,Y ( x , y )  h(x , y )  
Properties:
1. Indiv p.d.:
2. Total Prob:
3. Size:
4. Indep x′,y′:

x , y  J ( n )  log f x ,y  x , y   nh(x , y )  n


p x , y  J ( n )  1  
(1   )2
n  h ( x , y )  
(1   )2
n N
for n  N 
 
 Vol J ( n )  2 n h ( x ,y )  
 n  I ( x ;y )  3 
n N


 p x ' , y ' J ( n )  2  n  I ( x ;y )3 
Proof of 4.: Integrate max/min f(x´, y´)=f(x´)f(y´), then use known bounds on Vol(J)
165
Gaussian Channel Coding Theorem

R is achievable iff R  C  ½ log 1  PN 1
Proof ():

Choose  > 0
Random codebook: x w  n for w  1 : 2nR where xw are i.i.d. ~ N (0, P   )
Use Joint typicality decoding
p x T x  nP   0    for n  M 
Errors: 1. Power too big
p ( x , y  J ( n ) )   for n  N 
2. y not J.T. with x
3. another x J.T. with y
2 nR
 p( x
j 2
 n  I ( X ;Y )  R 3 
j
, y i  J ( n ) )  (2 nR  1)  2  n ( I ( x ;y ) 3 )
(n)
 3 for large n if R  I ( X ; Y )  3
Total Err P      2
Expurgation: Remove half of codebook*: ( n )  6
now max error
We have constructed a code achieving rate R–n–1
*:Worst codebook half includes xi: xiTxi >nP  i=1
166
Gaussian Channel Coding Theorem
Proof (): Assume Pe( n )  0 and n-1x T x  P for each x ( w)
nR  H (w )  I (w ; y 1:n )  H (w | y 1:n )
 I (x 1:n ; y 1:n )  H (w | y 1:n )
Data Proc Inequal
 h(y 1:n )  h(y 1:n | x 1:n )  H (w | y 1:n )
n
  h(y i )  h(z 1:n )  H (w | y 1:n )
Indep Bound + Translation
i 1
n
  I (x i ; y i )  1  nRPe( n )
Z i.i.d + Fano, |W|=2nR
i 1
n
  ½ log 1  PN 1   1  nRPe( n )
i 1
max Information Capacity

1
R  ½ log 1  PN 1   n 1  RPe( n )  ½ log 1  PN

167
Bandlimited Channel
• Channel bandlimited to f(–W,W) and signal duration T
– Not exactly
– Most energy in the bandwidth, most energy in the interval
• Nyquist: Signal is defined by 2WT samples
– white noise with double-sided p.s.d. ½N0 becomes
i.i.d gaussian N(0,½N0) added to each coefficient
– Signal power constraint = P  Signal energy  PT
• Energy constraint per coefficient: n–1xTx<PT/2WT=½W–1P
• Capacity:
 1/ 2  P / W  2WT

1
P 
C  log 1 
 W log 1 

 bits/second
2
N
/
2
T
WN
0
0 



• More precisely, it can be represented in a vector space of
about n=2WT dimensions with prolate spheroidal
functions as an orthonormal basis
Compare discrete time version: ½log(1+PN–1) bits per channel use
168
Limit of Infinite Bandwidth

P 
C  W log 1 
 bits/second
WN
0 

P
C 
log e
W  N
0
Minimum signal to noise ratio (SNR)
Eb PTb P / C


 ln 2  1.6 dB
N0
N0
N 0 W 
Given capacity, trade-off between P and W
- Increase P, decrease W
- Increase W, decrease P
- spread spectrum
- ultra wideband
169
Channel Code Performance
• Power Limited
–
–
–
–
High bandwidth
Spacecraft, Pagers
Use QPSK/4-QAM
Block/Convolution Codes
• Bandwidth Limited
– Modems, DVB, Mobile
phones
– 16-QAM to 256-QAM
– Convolution Codes
• Value of 1 dB for space
– Better range, lifetime,
weight, bit rate
– $80 M (1999)
Diagram from “An overview of Communications” by John R Barry
170
Summary
• Gaussian channel capacity
1
P

C  log 1  
2
 N
bits/transmission
• Proved by using continuous AEP
• Bandlimited channel

P 
C  W log 1 
 bits/second
 WN 0 
– Minimum SNR = –1.6 dB as W 
171
Lecture 14
• Parallel Gaussian Channels
– Waterfilling
• Gaussian Channel with Feedback
– Memoryless: no gain
– Memory: at most ½ bits/transmission
172
Parallel Gaussian Channels
• n independent Gaussian channels
– A model for nonwhite noise wideband channel where each
component represents a different frequency
– e.g. digital audio, digital TV, Broadband ADSL, WiFi
z1
(multicarrier/OFDM)
x1
Noise is independent zi ~ N(0,Ni)
+
•
• Average Power constraint ExTx  nP
• Information Capacity: C 
• R<C  R achievable
– proof as before
• What is the optimal f(x) ?
maxT
f ( x ):E f x x  nP
I ( x; y )
y1
zn
xn
+
yn
173
Parallel Gaussian: Max Capacity
Need to find f(x):
C
maxT
f ( x ): E f x x  nP
I ( x; y )
I ( x ; y )  h ( y )  h ( y | x )  h ( y )  h (z | x )
Translation invariance
n
 h ( y )  h(z )  h ( y )   h(z i )
x,z indep; Zi indep
i 1
(a) n
(b) n

  h(y i )  h(z i )    ½ log 1  Pi N i1
i 1

i 1
(a) indep bound;
(b) capacity limit
Equality when: (a) yi indep  xi indep; (b) xi ~ N(0, Pi)
1

½
log
1

P
N

i i 
n
We need to find the Pi that maximise
i 1
174
Parallel Gaussian: Optimal Powers
1
We need to find the Pi that maximise

log(
e
)
½
ln
1

P
N

i i 
n
i 1
– subject to power constraint  Pi  nP
i 1
n
– use Lagrange multiplier

n
J   ½ ln 1  Pi N
i 1
1
i
   P
J
1
 ½Pi  N i     0
Pi
n
n
i 1

Also  Pi  nP  v  P  n
i 1
v
i
Pi  N i  v
1
P1
P2
P3
n
N
i 1
i
Water Filling: put most power into
least noisy channels to make equal
power + noise in each channel
N1
N2
N3
175
Very Noisy Channels
• What if water is not enough?
• Must have Pi  0 i
v
• If v < Ni then set Pi=0 and recalculate
v (i.e., Pi = max(v – Ni,0) )
Kuhn-Tucker Conditions:
(not examinable)
– Max f(x) subject to Ax+b=0 and
P1
P2
N1
N2
N3
g i (x)  0 for i 1 : M with f , g i concave
M
T
– set J (x)  f (x)   i g i (x)   Ax
i 1
– Solution x0, , i iff
J (x 0 )  0, Ax  b  0, g i (x 0 )  0, i  0, i g i (x 0 )  0
176
Colored Gaussian Noise
• Suppose y = x + z where E zzT = KZ and E xxT = KX
• We want to find KX to maximize capacity subject to
n
power constraint: E  x i2  nP  tr K X   nP
i 1
– Find noise eigenvectors:
K Z  QQT with QQT  I
QTy = QTx + QTz = QTx + w
where E wwT = E QTzzTQ = E QTKZQ =  is diagonal
– Now
•  Wi are now independent (so previous result on P.G.C. applies)
– Power constraint is unchanged
tr QT K X Q   tr K X QQT   tr K X 
– Use water-filling and indep. messages QT K X Q    vI
– Choose
QT K X Q  vI   where v  P  n 1 tr   
 K X  Q  vI    QT
177
Power Spectrum Water Filling
• If z is from a stationary process
then diag()  power spectrum
n 
N(f)
– To achieve capacity use waterfilling on
noise power spectrum
Noise Spectrum
5
4.5
4
3.5
3
2.5
v
2
1.5
1
W
P   max  v  N ( f ),0  df
W
 max  v  N ( f ),0  
C   ½ log 1 
 df
W
N( f )


W
0.5
0
-0.5
0
Frequency
0.5
T r a n s m it S p e c t r u m
2
1 .5
1
0 .5
– Waterfilling on spectral domain
0
-0 .5
0
F re qu e n c y
0 .5
178
Gaussian Channel + Feedback
x i  xi w , y 1:i 1  zi
Does Feedback add capacity ?
w
Coder
– White noise (& DMC) – No
– Coloured noise – Not much
xi
+
yi
n
I (w ; y )  h( y )  h( y | w )  h( y )   h(y i | w , y 1:i 1 )
Chain rule
i 1
n
 h( y )   h(y i | w , y 1:i 1 , x 1:i , z 1:i 1 )
i 1
n
 h( y )   h(z i | w , y 1:i 1 , x 1:i , z 1:i 1 )
i 1
n
 h( y )   h(z i | z 1:i 1 )
i 1
 h ( y )  h (z )
Ky
1
 log
2
Kz
xi=xi(w,y1:i–1), z = y – x
z = y – x and translation invariance
z may be colored; zi depends only on z1:i–1
Chain rule, h(z)  ½ log 2 eK Z  bits
 maximize I(w ; y) by maximizing h(y)  y gaussian
 we can take z and x = y – z jointly gaussian
179
Maximum Benefit of Feedback
zi
w
1
Cn , FB  max ½ n log
tr( K x )  nP
1
 max ½ n log
| Ky |
xi
+
yi
| Kz |
| 2K x  Kz  |
tr( K x )  nP
| Kz |
2n | K x  K z |
 max ½ n log
tr( K x )  nP
| Kz |
1
 ½  max ½ n 1 log
tr( K x )  nP
Coder
Lemmas 1 & 2:
|2(Kx+Kz)|  |Ky|
|kA|=kn|(A|
| Kx  Kz |
 ½  Cn bits / transmission
| Kz |
Ky = Kx+Kz if no feedback
Cn: capacity without feedback
Having feedback adds at most ½ bit per transmission for colored Gaussian
noise channels
180
Max Benefit of Feedback: Lemmas
Lemma 1: Kx+z+Kx–z = 2(Kx+Kz)
K x  z  K x z  E x  z x  z   E x  z x  z 
T

 E 2 xx
T
 E xx T  xzT  zx T  zzT
T
 xx T  xzT  zx T  zzT

 2zzT  2K x  K z 

Lemma 2: If F,G are positive definite then |F+G|  |F|
Consider two indep random vectors f~N(0,F), g~N(0,G)


½ log  2 e  | F  G |  h(f  g)
n
 h(f  g | g)  h(f | g)

 h(f )  ½ log  2 e  | F |
n

Conditioning reduces h()
Translation invariance
Hence: |2(Kx+Kz)| = |Kx+z+Kx–z|  |Kx+z| = |Ky|
f,g independent
181
Gaussian Feedback Coder
x and z jointly gaussian 
w
x  Bz  v (w )
zi
x i  xi w , y 1:i 1 
Coder
xi
+
yi
where v is indep of z and
B is strictly lower triangular since xi indep of zj for j>i.
y  x  z  (B  I )z  v

 E Bzz B

K y  Eyy T  E (B  I )zzT (B  I )T  vv T  (B  I )K z (B  I )T  K v
K x  Exx
T
T
T
½n
Capacity: Cn, FB  max
K ,B
v
subject to

 vv T
| Ky |
1
| Kz |

 BK z BT  K v
 max ½ n 1 log
K v ,B

K x  tr BK z BT  K v  nP
(B  I )K z (B  I )T  K v
| Kz |
hard to solve 
Optimization can be done numerically
182
Gaussian Feedback: Toy Example
0 0
2 1



n  2, P  2, K Z  
, B  

b 0
 1 2
Goal: Maximize (w.r.t. Kv and b)
15
10
| K Y ||  B  I  K z  B  I   K v |
T
5
Subject to:
0
-1.5


Power constraint : tr BK z B  K v  4
T
Solution (via numerically search):
b=0:
|Ky|=16
b=0.69: |Ky|=20.7
C=0.604 bits
C=0.697 bits
-0.5
0
b
0.5
4
PV2
3
1
1.5
PbZ1
2
PV1
1
0
-1.5
Feedback increases C by 16%
-1
5
Power: V1 V2 bZ1
K v must be positive definite
det(Ky)
20
det(KY)
x  Bz  v  x1  v1 , x2  bz1  v2
25
-1
-0.5
Px 1  Pv 1
0
b
0.5
1
1.5
Px 2  Pv 2  Pbz 1
183
Summary
• Water-filling for parallel Gaussian channel
x   max( x,0)
 (v  N i )  
1
C   log 1 

Ni
i 1 2


n
 (v  N )

i
 nP
• Colored Gaussian noise
i eigenvealues of Κ z
 (v  i )  
1
C   log 1 

2

i 1
i


n
 (v   )
i
• Continuous Gaussian channel
  v  N ( f ) 
C   ½ log 1 
W

N( f )

W
• Feedback bound
Cn , FB  Cn 
1
2

 df



 nP
184
Lecture 15
• Lossy Source Coding
– For both discrete and continuous sources
– Bernoulli Source, Gaussian Source
• Rate Distortion Theory
– What is the minimum distortion achievable at
a particular rate?
– What is the minimum rate to achieve a
particular distortion?
• Channel/Source Coding Duality
185
Lossy Source Coding
x1:n
f(x1:n)1:2nR
Encoder
f( )
Decoder
g( )
Distortion function: d ( x, xˆ )  0
– examples: (i) d S ( x, xˆ )  ( x  xˆ )
– sequences:

d (x, xˆ )  n
1
2
n
 d ( x , xˆ )
i 1
i
x^1:n
0 x  xˆ
(ii) d H ( x, xˆ )  
1 x  xˆ
i
Distortion of Code fn( ), gn( ): D  ExX d x , xˆ   E d x , g ( f ( x )) 
n
Rate distortion pair (R,D) is achievable for source X if
 a sequence fn( ) and gn( ) such that lim ExX d ( x , g n ( f n ( x )))  D
n 
n
186
Rate Distortion Function
Rate Distortion function for {xi} with pdf p(x) is defined as
R( D)  min{R} such that (R,D) is achievable
Theorem: R ( D )  min I (x ; xˆ ) over all p (x , xˆ ) such that :
p (x ) is correct
(b) Ex ,xˆ d (x , xˆ )  D
(a)

– this expression is the Rate Distortion function for X
We will prove this next lecture
Lossless coding: If D = 0 then we have R(D) = I(x ;x) = H(x)
 p (x , xˆ )  p (x )q (xˆ | x )
187
R(D) bound for Bernoulli Source
Bernoulli: X = [0,1], pX = [1–p , p] assume p  ½
d ( x, xˆ )  x  xˆ
1
0.8
– If Dp, R(D)=0 since we can set g( )0
– For D < p  ½, if
E d (x , xˆ )  D then
I (x ; xˆ )  H (x )  H (x | xˆ )
Hence R(D)  H(p) – H(D)
p=0.5
0.6
0.4
0.2
 H ( p )  H (x  xˆ | xˆ )
 H ( p )  H (x  xˆ )
 H ( p)  H ( D)
R(D), p=0.2,0.5
– Hamming Distance:
0
0
p=0.2
0.1
0.2
0.3
0.4
0.5
D
 is one-to-one
Conditioning reduces entropy
Prob.( x  xˆ  1)  D for D  ½
H (x  xˆ )  H ( D) as H ( p ) monotonic
188
R(D) for Bernoulli source
We know optimum satisfies R(D)  H(p) – H(D)
– We show we can find a p(xˆ , x ) that attains this.
– Peculiarly, we consider a channel with xˆ as the
input and error probability D
Now choose r to give x the correct
1–r 0 1–D
0
D
probabilities:
^
x
r (1  D)  (1  r ) D  p
r 1
D
1
1–p
x
p
1–D
 r  ( p  D )(1  2 D) 1 , D  p
Now I (x ; xˆ )  H (x )  H (x | xˆ )  H ( p )  H ( D)
Hence R(D) = H(p) – H(D)
and p (x  xˆ )  D  distortion  D
If D  p or D  1- p, we can achieve R(D)=0 trivially.
189
R(D) bound for Gaussian Source
• Assume X ~ N (0,  2 ) and d ( x, xˆ )  ( x  xˆ ) 2
• Want to minimize I (x ; xˆ ) subject to E (x  xˆ ) 2  D
I (x ; xˆ )  h(x )  h(x | xˆ )
 ½ log 2e 2  h(x  xˆ | xˆ )
 ½ log 2e 2  h(x  xˆ )
Translation Invariance
Conditioning reduces entropy
 ½ log 2e 2  ½ log2e Var x  xˆ 
 ½ log 2e 2  ½ log 2eD
2



ˆ
I (x ; x )  max ½ log
,0 
D 

Gauss maximizes entropy
for given covariance
require Var x  xˆ   E (x  xˆ ) 2  D
I(x ;y) always positive
190
R(D) for Gaussian Source
To show that we can find a p ( xˆ , x)
that achieves the bound, we
construct a test channel that
introduces distortion D<2
z ~ N(0,D)
x^ ~ N(0,2–D)
+
3.5
I (x ; xˆ )  h(x )  h(x | xˆ )
3
 ½ log 2e 2  h(x  xˆ | xˆ )
2
 D( R) 
D
2
achievable
1.5
D
0.5
, 0}
cf. PCM
0
0
impossible
0.5
1
2
D/
2
22 R
2
1
 R ( D)  max{½ log

2.5
R(D)
 ½ log 2e 2  h(z | xˆ )
 ½ log
x ~ N(0,2)
D( R) 
m 2p / 3
2
2R
m p  4

16 / 3   2
22 R
1.5
191
Lloyd Algorithm
Problem: Find optimum quantization levels for Gaussian pdf
a.
b.
Bin boundaries are midway between quantization levels
Each quantization level equals the mean value of its own bin
Lloyd algorithm: Pick random quantization levels then apply
conditions (a) and (b) in turn until convergence.
0.2
2
0.15
1
MSE
Quantization levels
3
0
0.1
-1
0.05
-2
-3 0
10
1
10
Iteration
10
2
Solid lines are bin boundaries. Initial levels uniform in [–1,+1] .
0 0
10
1
10
Iteration
Best mean sq error for 8 levels = 0.03452. Predicted D(R) = (/8)2 = 0.01562
10
2
192
Vector Quantization
To get D(R), you have to quantize many values together
– True even if the values are independent
D = 0.035
Scalar
quant.
D = 0.028
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
0.5
1
1.5
2
2.5
0
0
Vector
quant.
0.5
1
1.5
2
2.5
Two gaussian variables: one quadrant only shown
– Independent quantization puts dense levels in low prob areas
– Vector quantization is better (even more so if correlated)
193
Multiple Gaussian Variables
• Assume x1:n are independent gaussian sources with
different variances. How should we apportion the
available total distortion between the sources?
T
• Assume x i ~ N (0, i2 ) and d (x, xˆ )  n 1 x  xˆ  x  xˆ   D
n
I (x 1:n ; xˆ1:n )   I (x i ; xˆ i )
Mut Info Independence Bound
for independent xi
i 1

 i2 
  R ( Di )   max ½ log
,0 
Di 
i 1
i 1

n
n
We must find the Di that minimize

 i2 
max  ½ log ,0 

Di 
i 1

n
 D0
 Di   2
 i
such that
R(D) for individual
Gaussian
if D0   i2
otherwise
n
1
n
D
i 1
i
D
194
Reverse Water-filling
n

 i2 
Minimize  max  ½ log ,0  subject to  Di  nD
Di 
i 1
i 1

n
J
1
 ½ Di1    0  Di  ½  D0
Di
D
i 1
i
 nD0  nD
i2 <D
D
½log i2
Use a Lagrange multiplier:
n
n
 i2
J   ½ log
   Di
Di
i 1
i 1
n
Ri  ½ log
 i2
½logD
R1
X1
 D0  D
Choose Ri for equal distortion
Di=i2)
• If
then set Ri=0 (meaning
and increase D0 to maintain the average
distortion equal to D
R2
R3
X2
X3
½log i2
½logD0
R1 R =0
2
R3
X1
X3
½logD
X2
195
Channel/Source Coding Duality
• Channel Coding
– Find codes separated enough to
give non-overlapping output
images.
– Image size = channel noise
– The maximum number (highest
rate) is when the images just don’t
overlap (some gap).
Noisy channel
Sphere Packing
Lossy encode
• Source Coding
– Find regions that cover the sphere
– Region size = allowed distortion
– The minimum number (lowest rate)
is when they just fill the sphere
(with no gap).
Sphere Covering
196
Gaussian Channel/Source
• Capacity of Gaussian channel (n: length)
– Radius of big sphere n( P  N )
nN
– Radius of small spheres
n
n / 2 Maximum number of
n
(
P

N
)
 PN 
– Capacity 2nC 


 small spheres packed
n
nN

N

in the big sphere
• Rate distortion for Gaussian source
– Variance 2  radius of big sphere n 2
– Radius of small spheres nD for distortion D
Minimum number of
2 n/2
– Rate 2nR ( D )    
small spheres to cover


D


the big sphere
197
Channel Decoder as Source Encoder
z1:n ~ N(0,D)
w2nR
y1:n ~ N(0,2 ¨ D)
Channel
Encoder
 
+
x1:n ~ N(0,2) Channel
^ 2nR
w
Decoder
 
• For R  C  ½ log 1   2  D D 1 , we can find a channel
2
encoder/decoder so that p(wˆ  w )   and E (x i  y i )  D
• Now reverse the roles of encoder and decoder. Since
p ( xˆ  y )  p w  wˆ    and E (x i  xˆ i ) 2  E (x i  y i ) 2  D
z1:n ~ N(0,D)
w
Channel
Encoder
y1:n
+
Source coding
x1:n ~ N(0,2) Channel
Decoder
^ 2nR
w
Channel
Encoder
x^1:n
Source encoder Source decoder
We have encoded x at rate R=½log(2D–1) with distortion D!
198
Summary
• Lossy source coding: tradeoff between rate and
distortion
• Rate distortion function
R( D) 
min
p xˆ |x s .t . Ed ( x , xˆ )  D
I (x ; xˆ )
• Bernoulli source: R(D) = (H(p) – H(D))+
• Gaussian source
2 
1
 
(reverse waterfilling):
R ( D)   log

D
2


• Duality: channel decoding (encoding)  source
encoding (decoding)
199
Nothing But Proof
• Proof of Rate Distortion Theorem
– Converse: if the rate is less than R(D), then
distortion of any code is higher than D
– Achievability: if the rate is higher than R(D),
then there exists a rate-R code which
achieves distortion D
Quite technical!
200
Review
x1:n
Encoder
fn( )
f(x1:n)1:2nR
Decoder
gn( )
^
x1:n
Rate Distortion function for x whose px(x) is known is
R ( D )  inf R such that f n , g n with lim ExX d ( x , xˆ )  D
Rate Distortion Theorem:
n
n 
R ( D)  min I (x ; xˆ ) over all p ( xˆ | x) such that Ex ,xˆ d (x , xˆ )  D
3.5
We will prove this theorem for discrete X
and bounded d(x,y)  dmax
3
R(D)
2.5
R(D) curve depends on your choice of d(,)
Decreasing and convex
2
achievable
1.5
1
0.5
0
0
impossible
0.5
1
D/2
1.5
201
Converse: Rate Distortion Bound
Suppose we have found an encoder and decoder at rate R0
with expected distortion D for independent xi (worst case)
We want to prove that R0  R( D)  RE d (x; xˆ ) 
– We show first that
R0  n 1  I (x i ; xˆ i )
i
– We know that
I (x i ; xˆ i )  R E d (x i ; xˆ i ) 
Defn of R(D)
– and use convexity of R(D) to show


n 1  R  E d (x i ; xˆ i )   R  n 1  E d (x i ; xˆ i )   R  E d ( x; xˆ )   R ( D)
i
i


We prove convexity first and then the rest
202
Convexity of R(D)
If p1 ( xˆ | x ) and p2 ( xˆ | x ) are
3.5
3
associated with ( D1 , R1 ) and ( D2 , R2 )
Then
E p d ( x, xˆ )  D1  (1   ) D2  D
R(D)
on the R ( D ) curve we define
p ( xˆ | x )  p1 ( xˆ | x )  (1   ) p2 ( xˆ | x )
2.5
R
2
1.5
1
0.5
(D2,R2)
0
0
R ( D )  I p (x ; xˆ )
 I p1 (x ; xˆ )  (1   ) I p2 (x ; xˆ )
 R( D1 )  (1   ) R( D2 )
(D1,R1)
D 0.5
1
D/
1.5
2
R( D)  min I (x ; xˆ )
p ( xˆ |x )
I (x ; xˆ ) convex w.r.t. p( xˆ | x)
p1 and p2 lie on the R(D) curve
203
Proof that R  R(D)
nR0  H (xˆ1:n )  H (xˆ1:n )  H (xˆ1:n | x 1:n )
Uniform bound; H (xˆ | x )  0
 I (xˆ1:n ; x 1:n )
Definition of I(;)
n
  I (x i ; xˆ i )
xi indep: Mut Inf
Independence Bound
i 1
n
n
  RE d (x i ; xˆ i )   n n 1 RE d (x i ; xˆ i ) 
i 1
i 1
 1 n

ˆ
 nR n  E d (x i ; x i )   nRE d (x 1:n ; xˆ1:n ) 
 i 1

 nR(D)
definition of R
convexity
defn of vector d( )
original assumption that E(d)  D
and R(D) monotonically decreasing
204
Rate Distortion Achievability
x1:n
Encoder
fn( )
f(x1:n)1:2nR
Decoder
gn( )
^
x1:n
We want to show that for any D, we can find an encoder
and decoder that compresses x1:n to nR(D) bits.
• pX is given
• Assume we know the p( xˆ | x ) that gives I (x ; xˆ )  R( D)
• Random codebook: Choose 2nR random xˆ i ~ p xˆ
– There must be at least one code that is as good as the average
• Encoder: Use joint typicality to design
– We show that there is almost always a suitable codeword
First define the typical set we will use, then prove two preliminary results.
205
Distortion Typical Set
Distortion Typical:  x i , xˆi   X  Xˆ drawn i.i.d. ~ p(x , xˆ )

ˆ n :  n 1 log p (x)  H (x )   ,
J d( n,)  x, xˆ  X n  X
 n 1 log p (xˆ )  H (xˆ )   ,
 n 1 log p(x, xˆ )  H (x , xˆ )  
d (x, xˆ )  E d (x , xˆ )  

new condition
Properties of Typical Set:
1. Indiv p.d.:
x, xˆ  J d( n,)  log px, xˆ   nH (x , xˆ )  n
2. Total Prob:
p x, xˆ  J d( n,)  1  


for n  N 
weak law of large numbers; d (x i , xˆ i ) are i.i.d.
206
Conditional Probability Bound
Lemma:
Proof:
x, xˆ  J d( n,)  pxˆ   p xˆ | x 2  n I ( x ;x ) 3 
ˆ
p xˆ , x 
p xˆ | x  
px 
pxˆ , x 
 pxˆ 
pxˆ  px 
 p xˆ 
2
take max of top and min of bottom
 n  H ( x , xˆ )  
 n  H ( x )    n  H ( xˆ )  
2
2
ˆ
 pxˆ 2 n I ( x ;x ) 3 
bounds from defn of J
defn of I
207
Curious but Necessary Inequality
Lemma: u, v  [0,1], m  0  (1  uv) m  1  u  e  vm
Proof: u=0: e  vm  0  (1  0) m  1  0  e  vm
u=1: Define f (v)  e  v  1  v  f (v)  1  e  v
f (0)  0 and f (v)  0 for v  0 
Hence for v  [0,1], 0  1  v  e  v
0<u<1:
f (v)  0 for v  [0,1]
 1  v   e  vm
m
Define g v (u )  (1  uv) m
 g v( x)  m(m  1)v 2 (1  uv) n  2  0  g v (u ) convex
(1  uv) m  g v (u )  (1  u ) g v (0)  ug v (1)
for u,v  [0,1]
convexity for u,v  [0,1]
 (1  u )1  u (1  v) m  1  u  ue  vm  1  u  e  vm
208
Achievability of R(D): preliminaries
x1:n
Encoder
fn( )
f(x1:n)1:2nR
Decoder
gn( )
^
x1:n
• Choose D and find a p( xˆ | x) such that I (x ; xˆ )  R( D); E d ( x, xˆ )  D
Choose   0 and define p xˆ  {
p ( xˆ )   p( x) p ( xˆ | x) }
x
• Decoder: For each w 1 : 2 nR choose g n ( w)  xˆ w drawn i.i.d. ~ p nxˆ
• Encoder: f n (x)  min w such that (x,xˆ w )  J d( n,) else 1 if no such w
• Expected Distortion: D  Ex , g d (x, xˆ )
– over all input vectors x and all random decoding functions, g
– for large n we show D  D  
code
so there must be one good
209
Expected Distortion
We can divide the input vectors x into two categories:
a) if w such that (x,xˆ w )  J d( n,) then d (x, xˆ w )  D  
since E d (x, xˆ )  D
b)
if no such w exists we must have d (x, xˆ w )  d max
since we are assuming that d(,) is bounded. Suppose
the probability of this situation is Pe.
Hence D  Ex , g d (x, xˆ )
 (1  Pe )( D   )  Pe d max
 D    Pe d max
We need to show that the expected value of Pe is small
210
Error Probability
Define the set of valid inputs for (random) code g
V ( g )   x : w with ( x, g ( w))  J d( n,) 
We have
Pe   p( g )
g
 p(x )   p(x )  p( g )
xV ( g )
x
Change the order
g:xV ( g )
Define K (x, xˆ )  1 if (x, xˆ )  J d( n,) else 0
Prob that a random xˆ does not match x is 1   p (xˆ ) K (x, xˆ )
xˆ


Prob that an entire code does not match x is 1   p (xˆ ) K (x, xˆ ) 
xˆ




Hence Pe   p (x)1   p (xˆ ) K (x, xˆ ) 
x
xˆ


2 nR
2nR
Codewords are i.i.d.
211
Achievability for Average Code
Since x, xˆ  J
(n)
d ,
 pxˆ   pxˆ | x 2


Pe   p(x)1   p(xˆ ) K (x, xˆ ) 
x
xˆ


 n  I ( x ; xˆ )  3 
2 nR

 n I ( x ; xˆ )  3  
  p(x) 1   p(xˆ | x) K (x, xˆ )  2 

x
xˆ


2nR
Using (1  uv) m  1  u  e  vm
with u   p(xˆ | x) K (x, xˆ );
xˆ
v  2  nI ( x ;x ) 3n ;
ˆ

m  2 nR



ˆ
  p (x)1   p (xˆ | x) K (x, xˆ )  exp  2  n I ( x ;x ) 3 2 nR 
x
xˆ


Note : 0  u , v  1 as required
212
Achievability for Average Code



 n I ( X ; Xˆ )  3  nR 
ˆ
ˆ
Pe   p (x)1   p (x | x) K (x, x)  exp  2
2 
x
xˆ



 1   p (x, xˆ ) K (x, xˆ )  exp  2 

x , xˆ


n R  I ( X ; Xˆ )  3
 P (x, xˆ )  J d( n,)  exp  2 n R  I ( X ; X ) 3 
ˆ


Mutual information does
not involve particular x

n
 0

since both terms  0 as n   provided nR  I (x , xˆ )  3
Hence   0, D  Ex , g d (x, xˆ ) can be made  D  
213
Achievability
Since   0, D  Ex , g d (x, xˆ ) can be made  D  
there must be at least one g with Ex d (x, xˆ )  D  
Hence (R,D) is achievable for any R > R(D)
x1:n
Encoder
fn( )
f(x1:n)1:2nR
Decoder
gn( )
^
x1:n
that is lim E X 1:n ( x, xˆ )  D
n 
In fact a stronger result is true (proof in C&T 10.6):
  0, D and R  R( D), f n , g n with p(d (x, xˆ )  D   )  1
n 
214
Lecture 16
• Introduction to network information
theory
• Multiple access
• Distributed source coding
215
Network Information Theory
• System with many senders and receivers
• New elements: interference, cooperation,
competition, relay, feedback…
• Problem: decide whether or not the sources can
be transmitted over the channel
– Distributed source coding
– Distributed communication
– The general problem has not yet been solved, so we
consider various special cases
• Results are presented without proof (can be
done using mutual information, joint AEP)
216
Implications to Network Design
• Examples of large information networks
– Computer networks
– Satellite networks
– Telephone networks
• A complete theory of network communications
would have wide implications for the design of
communication and computer networks
• Examples
– CDMA (code-division multiple access): mobile phone
network
– Network coding: significant capacity gain compared to
routing-based networks
217
Network Models Considered
•
•
•
•
•
•
•
Multi-access channel
Broadcast channel
Distributed source coding
Relay channel
Interference channel
Two-way channel
General communication network
218
State of the Art
• Triumphs
• Unknowns
– Multi-access channel
– The simplest relay
channel
– Gaussian broadcast
channel
– The simplest
interference channel
Reminder: Networks being built (ad hoc networks, sensor
networks) are much more complicated
219
Multi-Access Channel
• Example: many users
communicate with a
common base station over
a common channel
• What rates are achievable
simultaneously?
• Best understood multiuser
channel
• Very successful: 3G CDMA
mobile phone networks
220
Capacity Region
• Capacity of single-user Gaussian channel
C
1 
P
P
log1    C  
2  N
N
• Gaussian multi-access channel with m users
Xi has equal power P
Y X Z
noise Z has variance N
• Capacity region Ri: rate for user i
m
i 1
P
Ri  C  
N
 2P 
Ri  R j  C 

N


 3P 
Ri  R j  Rk  C  
N 

m
R
i 1
i
 mP 
 C

 N 
i
Transmission: independent and simultaneous
(i.i.d. Gaussian codebooks)
Decoding: joint decoding, look for m
codewords whose sum is closest to Y
The last inequality dominates when all rates
are the same
The sum rate goes to  with m
221
Two-User Channel
• Capacity region
P
R1  C  1 
N
P 
R2  C  2 
N
P P 
R1  R1  C  1 2 
 N 
CDMA
Onion
peeling
FDMA, TDMA
Naïve
TDMA
– Corresponds to CDMA
– Surprising fact: sum rate
= rate achieved by a single sender with power P1+P2
• Achieves a higher sum rate than treating
interference as noise, i.e.,
 P1 
 P2 
C
C




P

N
P

N
 2

 1

222
Onion Peeling
• Interpretation of corner point: onion-peeling
– First stage: decoder user 2, considering user 1 as noise
– Second stage: subtract out user 2, decoder user 1
• In fact, it can achieve the entire capacity region
– Any rate-pairs between two corner points achievable by timesharing
• Its technical term is successive interference cancelation
(SIC)
– Removes the need for joint decoding
– Uses a sequence of single-user decoders
• SIC is implemented in the uplink of CDMA 2000 EV-DO
(evolution-data optimized)
– Increases throughput by about 65%
223
Comparison with TDMA and FDMA
• FDMA (frequency-division multiple access)

P 
R1  W1 log 1  1 
 N 0W1 

P 
R2  W2 log 1  2 
 N 0W2 
Total bandwidth W = W1 + W2
Varying W1 and W2 tracing out the
curve in the figure
• TDMA (time-division multiple access)
– Each user is allotted a time slot, transmits and other
users remain silent
– Naïve TDMA: dashed line
– Can do better while still maintaining the same
average power constraint; the same capacity region
as FDMA
• CDMA capacity region is larger
– But needs a more complex decoder
224
Distributed Source Coding
• Associate with nodes are sources that are
generally dependent
• How do we take advantage of the dependence
to reduce the amount of information
transmitted?
• Consider the special case where channels are
noiseless and without interference
• Finding the set of rates associate with each
source such that all required sources can be
decoded at destination
• Data compression dual to multi-access channel
225
Two-User Distributed Source Coding
• X and Y are correlated
• But the encoders cannot
communicate; have to
encode independently
• A single source: R > H(X)
• Two sources: R > H(X,Y) if
encoding together
• What if encoding separately?
• Of course one can do R > H(X) + H(Y)
• Surprisingly, R = H(X,Y) is sufficient (Slepian-Wolf
coding, 1973)
• Sadly, the coding scheme was not practical (again)
226
Slepian-Wolf Coding
• Achievable rate region
R1  H ( X | Y )
achievable
R2  H (Y | X )
R1  R2  H ( X , Y )
• Core idea: joint typicality
• Interpretation of corner point R1 =
H(X), R2 = H(Y|X)
– X can encode as usual
– Associate with each xn is a jointly
typical fan (however Y doesn’t know)
– Y sends the color (thus compression)
– Decoder uses the color to determine
the point in jointly typical fan
associated with xn
• Straight line: achieved by timesharing
xn
yn
227
Wyner-Ziv Coding
•
•
•
•
Distributed source coding with side information
Y is encoded at rate R2
Only X to be recovered
How many bits R1 are
required?
• If R2 = H(Y), then R1 = H(X|Y) by Slepian-Wolf coding
• In general
R1  H ( X | U )
R2  I (Y ;U )
where U is an auxiliary random variable (can be
thought of as approximate version of Y)
228
Rate-Distortion
• Given Y, what is the ratedistortion to describe X?
RY ( D)  min min{I ( X ;W )  I (Y ;W )}
p ( w| x )
W
R ( D)  min I (x ; xˆ ) over all p ( xˆ | x) such that Ex ,xˆ d (x , xˆ )  D
f
over all decoding functions f : Y  W  Xˆ
and all p ( w | x) such that Ex , w, y d ( x, xˆ )  D
• The general problem of
rate-distortion for
correlated sources
remains unsolved
229
Lecture 17
• Network information theory – II
– Broadcast
– Relay
– Interference channel
– Two-way channel
– Comments on general communication
networks
230
Broadcast Channel
X
Y2
Y1
• One-to-many: HDTV station sending different
information simultaneously to many TV receivers
over a common channel; lecturer in classroom
• What are the achievable rates for all different
receivers?
• How does the sender encode information meant
for different signals in a common signal?
• Only partial answers are known.
231
Two-User Broadcast Channel
• Consider a memoryless broadcast channel with
one encoder and two decoders
• Independent messages at rate R1 and R2
• Degraded broadcast channel: p(y1, y2|x) = p(y1|x)
p(y2| y1)
– Meaning X  Y1  Y2 (Markov chain)
– Y2 is a degraded version of Y1 (receiver 1 is better)
• Capacity region of degraded broadcast channel
R2  I (U ; Y2 )
R1  I ( X ; Y1 | U )
U is an auxiliary
random variable
232
Scalar Gaussian Broadcast Channel
• All scalar Gaussian broadcast channels belong to
the class of degraded channels
Y1  X  Z1
Y2  X  Z 2
Assume variance
N1 < N2
• Capacity region
P 
R1  C 

N
 1
 (1   ) P 
R2  C 


P
N


2 
01
Coding Strategy
Encoding: one codebook with power P at
rate R1, another with power (1-)P at rate
R2, send the sum of two codewords
Decoding: Bad receiver Y2 treats Y1 as noise;
good receiver Y1 first decode Y2, subtract it
out, then decode his own message
233
Relay Channel
• One source, one destination, one or more
intermediate relays
• Example: one relay
– A broadcast channel (X to Y and Y1)
– A multi-access channel (X and X1 to Y)
– Capacity is unknown! Upper bound:
C  sup min{I ( X , X 1 ; Y ), I ( X ; Y , Y1 | X 1 )}
p ( x , x1 )
– Max-flow min-cut interpretation
• First term: maximum rate from X and X1 to Y
• Second term: maximum rate from X to Y and Y1
234
Degraded Relay Channel
• In general, the max-flow min-cut bound cannot
be achieved
• Reason
– Interference
– What for the relay to forward?
– How to forward?
• Capacity is known for degraded relay channel
(i.e, Y is a degradation of Y1, or relay is better
than receiver), i.e., the upper bound is achieved
C  sup min{I ( X , X 1 ; Y ), I ( X ; Y , Y1 | X 1 )}
p ( x , x1 )
235
Gaussian Relay Channel
• Channel model
Y1  X  Z1
Variance( Z1 )  N1
Y  X  Z 1  X1  Z2
Variance( Z 2 )  N 2
• Encoding at relay:
• Capacity

– If
X 1i  f i (Y11,Y12 ,  , Y1i 1 )
  P  P1  2 (1   ) PP1
C  max min C
0 1
N1  N 2
 
relay  desitination SNR
P1
P

N 2 N1
  P  X has power P
, C 

  N1 

 X1 has power P1
source  relay SNR
then C  C P / N1  (capacity from source to relay can
be achieved; exercise)
– Rate C  C  P /( N1  N 2 )  without relay is increased by the
relay to C  C P / N1 
236
Interference Channel
• Two senders, two receivers, with crosstalk
– Y1 listens to X1 and doesn’t care
what X2 speaks or what Y2 hears
– Similarly with X2 and Y2
• Neither a broadcast channel nor a multiaccess channel
• This channel has not been solved
– Capacity is known to within one bit (Etkin, Tse, Wang
2008)
– A promising technique  interference alignment
(Camdambe, Jafar 2008)
237
Symmetric Interference Channel
• Model
Y1  X 1  aX 2  Z1
equal power P
Y2  X 2  aX 1  Z 2
Var( Z1 )  Var( Z 2 )  N
• Capacity has been derived in the strong
interference case (a  1) (Han, Kobayashi, 1981)
– Very strong interference (a2  1 + P/N) is equivalent to
no interference whatsoever
• Symmetric capacity (for each user R1 = R2)
SNR = P/N
INR = a2P/N
238
Capacity
239
Very strong interference = no interference
• Each sender has power P and rate C(P/N)
• Independently sends a codeword from a
Gaussian codebook
• Consider receiver 1
– Treats sender 1 as interference
– Can decode sender 2 at rate C(a2P/(P+N))
– If C(a2P/(P+N)) > C(P/N), i.e.,
rate 2  1 > rate 2  2 (crosslink is better)
he can perfectly decode sender 2
– Subtracting it from received signal, he sees a clean
channel with capacity C(P/N)
240
An Example
• Two cell-edge users (bottleneck of the cellular network)
• No exchange of data between the base stations or
between the mobiles
• Traditional approaches
– Orthogonalizing the two links (reuse ½)
– Universal frequency reuse and treating interference as noise
• Higher capacity can be achieved by advanced
interference management
241
Two-Way Channel
• Similar to interference channel, but in both
directions (Shannon 1961)
• Feedback
– Sender 1 can use previously
received symbols from sender 2, and vice versa
– They can cooperate with each other
• Gaussian channel:
– Capacity region is known (not the case in general)
– Decompose into two independent channels
 P1 
R1  C  
 N1 
P 
R2  C  2 
 N2 
Coding strategy: Sender 1 sends a codeword;
so does sender 2. Receiver 2 receives a sum
but he can subtract out his own thus having an
interference-free channel from sender 1.
242
General Communication Network
• Many nodes trying to communicate with each other
• Allows computation at each node using it own message
and all past received symbols
• All the models we have considered are special cases
• A comprehensive theory of network information flow is
yet to be found
243
Capacity Bound for a Network
• Max-flow min-cut
– Minimizing the maximum
flow across cut sets
yields an upper bound
on the capacity of a
network
• Outer bound on capacity
region
R
(i , j )
 I(X
(S )
;Y
(S c )
|X
(S c )
)
iS , jS c
– Not achievable in general
244
Questions to Answer
• Why multi-hop relay? Why
decode and forward? Why
treat interference as noise?
• Source-channel separation?
Feedback?
• What is really the best way
to operate wireless
networks?
• What are the ultimate limits
to information transfer over
wireless networks?
245
Scaling Law for Wireless Networks
• High signal attenuation:
(transport) capacity is O(n)
bit-meter/sec for a planar
network with n nodes (XieKumar’04)
• Low attenuation: capacity can
grow superlinearly
• Requires cooperation between
nodes
• Multi-hop relay is suboptimal
but order optimal
246
Network Coding
• Routing: store and forward
(as in Internet)
• Network coding: recompute
and redistribute
• Given the network topology,
coding can increase
capacity (Ahlswede, Cai, Li,
Yeung, 2000)
– Doubled capacity for butterfly
network
• Active area of research
B
A
Butterfly Network
247
Lecture 18
• Revision Lecture
248
Summary (1)
• Entropy:
H (x )   p ( x)   log 2 p ( x)  E  log 2 ( p X ( x))
xX
0  H (x )  log X
– Bounds:
– Conditioning reduces entropy: H (y | x )  H (y )
n
n
H (x 1:n )   H (x i | x 1:i 1 )   H (x i )
– Chain Rule:
i 1
i 1
n
• Relative Entropy:
H (x 1:n | y 1:n )   H (x i | y i )
i 1
Dp || q   Ep log p (x ) / q (x )   0
249
Summary (2)
H(x ,y)
• Mutual Information:
H(x |y) I(x ;y) H(y |x)
H(x)
I (y ; x )  H (y )  H (y | x )
 H (x )  H (y )  H (x , y )  Dp x ,y || p x p y 
H(y)
– Positive and Symmetrical: I (x ; y )  I (y ; x )  0
– x, y indep  H (x , y )  H (y )  H (x )  I (x ; y )  0
n
– Chain Rule: I (x 1:n ; y )   I (x i ; y | x 1:i 1 )
i 1
n
x i independen t  I (x 1:n ; y 1:n )   I (x i ; y i )
i 1
n
p(y i | x 1:n ; y 1:i 1 )  p(y i | x i )  I (x 1:n ; y 1:n )   I (x i ; y i )
n-use DMC capacity
i 1
250
Summary (3)
• Convexity: f’’ (x)  0  f(x) convex  Ef(x)  f(Ex)
– H(p) concave in p
– I(x ; y) concave in px for fixed py|x
– I(x ; y) convex in py|x for fixed px
• Markov:
x  y  z  p ( z | x, y )  p ( z | y )  I ( x ; z | y )  0
 I(x ; y)  I(x ; z) and I(x ; y)  I(x; y | z)
• Fano:
H (x | y )  1
ˆ
ˆ
x  y  x  px  x  
log| X | 1
• Entropy Rate:
– Stationary process
– Markov Process:
H ( X)  lim n 1 H (x 1:n )
n 
H ( X )  lim H (x n | x 1:n 1 )
n 
H ( X)  lim H (x n | x n 1 )
n 
if stationary
251
Summary (4)
• Kraft: Uniquely Decodable 
| X|
 li
D
  1   instant code
i 1
• Average Length: Uniquely Decodable  LC = E l(x)  HD(x)
• Shannon-Fano: Top-down 50% splits. LSF  HD(x)+1
• Huffman: Bottom-up design. Optimal.
LH  H D (x )  1
– Designing with wrong probabilities, q  penalty of D(p||q)
– Long blocks disperse the 1-bit overhead
• Lempel-Ziv Coding:
– Does not depend on source distribution
– Efficient algorithm widely used
– Approaches entropy rate for stationary ergodic sources
252
Summary (5)
• Typical Set
–
–
–
–
x  T( n )  log p (x)   nH (x )  n
Individual Prob
p ( x  T( n ) )  1   for n  N 
Total Prob
n N
n ( H ( x )  )
(1   )2
 T( n )  2 n ( H ( x )  )
Size
No other high probability set can be much smaller

• Asymptotic Equipartition Principle
– Almost all event sequences are equally surprising
253
Summary (6)
• DMC Channel Capacity:
• Coding Theorem
C  max I (x ; y )
px
– Can achieve capacity: random codewords, joint typical decoding
– Cannot beat capacity: Fano inequality
• Feedback doesn’t increase capacity of DMC but could
simplify coding/decoding
• Joint Source-Channel Coding doesn’t increase capacity of
DMC
254
Summary (7)
• Differential Entropy:
h(x )  E  log f x ( x )
– Not necessarily positive
– h(x+a) = h(x),
h(ax) = h(x) + log|a|,
h(x|y)  h(x)
– I(x; y) = h(x) + h(y) – h(x, y)  0, D(f||g)=E log(f/g)  0
• Bounds:
– Finite range: Uniform distribution has max: h(x) = log(b–a)
– Fixed Covariance: Gaussian has max: h(x) = ½log((2e)n|K|)
• Gaussian Channel
– Discrete Time: C=½log(1+PN–1)
– Bandlimited: C=W log(1+PN0–1W–1)
1
1 1
W / C 
 1  ln 2  1.6 dB
• For constant C: Eb N 0  PC N 0  W / C  2
W 
– Feedback: Adds at most ½ bit for coloured noise

1

255
Summary (8)
• Parallel Gaussian Channels: Total power constraint  Pi  nP
Pi  max  v  N i ,0 
– White noise: Waterfilling:
– Correlated noise: Waterfill on noise eigenvectors
• Rate Distortion:
R( D) 
min
p xˆ |x s .t . Ed ( x , xˆ )  D
I (x ; xˆ )
– Bernoulli Source with Hamming d: R(D) = max(H(px)–H(D),0)
– Gaussian Source with mean square d: R(D) = max(½log(2 D–1),0)
– Can encode at rate R: random decoder, joint typical encoder
– Can’t encode below rate R: independence bound
256
Summary (9)
• Gaussian multiple access
channel
• Distributed source coding
– Slepian-Wolf coding
P
R1  C  1  ,
N
P 
R2  C  2 
N
1
P P 
R1  R1  C  1 2  ,
C ( x)  log(1  x)
2
 N 
R1  H ( X | Y ), R2  H (Y | X )
R1  R2  H ( X , Y )
• Scalar Gaussian broadcast channel
P 
R1  C 
,
N
 1
 (1   ) P 
R2  C 
,


P
N

2 
0  1
• Gaussian Relay channel
  P  P1  2 (1   ) PP1
C  max min C 
0 1
N1  N 2
 
  P 
, C 

  N1 


257
Summary (10)
• Interference channel
– Strong interference = no interference
• Gaussian two-way channel
– Decompose into two independent channels
• General communication network
– Max-flow min-cut theorem
– Not achievable in general
– But achievable for multiple access channel
and Gaussian relay channel