Chapter 10: Rate distortion theory
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Chapter 10 outline
• Quantization
• Definitions
• Calculation of the rate-distortion function
• Converse of rate distortion theorem
• Strongly typical sequences
• Achievability of rate distortion theorem
• Characterization of the rate-distortion function
• Computation and channel capacity and rate-distortion function
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Rate-distortion
Source
Encoder
Source
Source
Decoder
Destination
minimum E[# bits] for
error free representation
< minimum E[# bits] for
error free representation
There will be errors and distortion
in reconstructing the source!
Rate-distortion theory describes the trade-off between
lossy compression rate and the resulting distortion.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Quantization
• Consider representing a continuous valued random source - need infinite
precision to represent it exactly!
• Q: what is the best possible representation of X for a given data rate?
Source
Encoder
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Decoder
Destination
302
RATE DISTORTION THEORY
random variable. Since a continuous random source requires infinite precision to represent exactly, we cannot reproduce it exactly using a finite-rate
code. The question is then to find the best possible representation for any
given data rate.
We first consider the problem of representing a single sample from the
source. Let the random variable be represented be X and let the representation of X be denoted as X̂(X). If we are given R bits to represent X,
the function X̂ can take on 2R values. The problem is to find the optimum
set of values for X̂ (called the reproduction points or code points) and
the regions that are associated with each value X̂.
For example, let X ∼ N(0, σ 2 ), and assume a squared-error distortion
measure. In this case we wish to find the function X̂(X) such that X̂ takes
on at most 2R values and minimizes E(X − X̂(X))2 . If we are given one
bit to represent X, it is clear that the bit should distinguish whether or
not X > 0. To minimize squared error, each reproduced symbol should
be the conditional mean of its region. This is illustrated in Figure 10.1.
Thus,
%
2
σ
if x ≥ 0,
π
%
(10.1)
X̂(x) =
2
−
σ if x < 0.
π
Quantization example: 1 bit Gaussian
If we are given 2 bits to represent the sample, the situation is not as
simple. Clearly, we want to divide the real line into four regions and use
0.4
0.35
0.3
f(x)
0.25
0.2
0.15
0.1
0.05
0
−2.5
−2 −1.5
−1
−0.5
−.79
0
x
0.5
1
1.5
2
2.5
+.79
FIGURE 10.1. One-bit quantization of Gaussian random variable.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Quantization example: 1 bit Gaussian
• Lloyd algorithm - iterative way of finding a “good” quantizer
• Find set of reconstruction points (centroids if MSE)
• Find optimal reconstruction regions
• Benefits to quantizing many RVs at once?
• Yes! n iid RVs represented using nR bits
• Surprisingly, better to represent whole sequence than each RV
independently, even though chosen iid!!!
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Definitions
Source
Encoder
Decoder
Destination
Decoder
Destination
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Definitions
Source
Encoder
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Definitions
Source
Encoder
Decoder
Destination
Encoder
Decoder
Destination
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Definitions
Source
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Definitions
Source
Encoder
Decoder
Destination
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Main Theorem
Source
Encoder
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Decoder
Destination
A few examples
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Calculating R(D) - binary source
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
(10.20)
r(1 − D) + (1 − r)D = p,
or
r=
p−D
.
1 − 2D
(10.21)
If D ≤ p ≤ 12 , then Pr(X̂ = 1) ≥ 0 and Pr(X̂ = 0) ≥ 0. We then have
Calculating R(D)
I (X; X̂) = Hbinary
(X) − H (X|X̂)source
= H (p) − H (D),
(10.22)
and the expected distortion is Pr(X $= X̂) = D.
If D ≥ p, we can achieve R(D) = 0 by letting X̂ = 0 with probability
1. In this case, I (X; X̂) = 0 and D = p. Similarly, if D ≥ 1 − p, we can
achieve R(D) = 0 by setting X̂ = 1 with probability 1. Hence, the rate
distortion function for a binary source is
!
H (p) − H (D), 0 ≤ D ≤ min{p, 1 − p},
(10.23)
R(D) =
0,
D > min{p, 1 − p}.
!
This function is illustrated in Figure 10.4.
1
0.9
0.8
0.7
R(D)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
D
0.6
0.7
0.8
0.9
1
FIGURE 10.4. Rate distortion function for a Bernoulli ( 12 ) source.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Calculating R(D) - Gaussian source
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Calculating R(D) - Gaussian source
312
RATE DISTORTION THEORY
5
4.5
4
3.5
R(D)
3
2.5
2
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
1
D
1.2
1.4
1.6
1.8
2
FIGURE 10.6. Rate distortion function for a Gaussian source.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Each bit of description reduces the expected distortion by a factor of 4.
With a 1-bit description, the best expected square error is σ 2 /4. We can
compare this with the result of simple 1-bit quantization of a N(0, σ 2 )
random variable as described in Section 10.1. In this case, using the two
regions corresponding to the positive and negative real lines and reproduction points as the centroids of the respective regions, the expected distortion is (ππ−2) σ 2 = 0.3633σ 2 (see Problem 10.1). As we prove later, the
rate distortion limit R(D) is achieved by considering long block lengths.
This example shows that we can achieve a lower distortion by considering several distortion problems in succession (long block lengths) than can
be achieved by considering each problem separately. This is somewhat
surprising because we are quantizing independent random variables.
Calculating R(D) - Gaussian source
10.3.3 Simultaneous Description of Independent Gaussian
Random Variables
Consider the case of representing m independent (but not identically distributed) normal random sources X1 , . . . , Xm , where Xi are ∼ N(0, σi2 ),
with squared-error distortion. Assume that we are given R bits with which
to represent this random vector. The question naturally arises as to how
we should allot these bits to the various components to minimize the
total distortion. Extending the definition of the information rate distortion
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Calculating R(D) - Gaussian source
315
10.4 CONVERSE TO THE RATE DISTORTION THEOREM
s 2i
s 24
s 21
s 26
s^ 24
s^ 21
s^ 26
l
s 22
s 23
D1
D4
D2
s 25
D6
D3
D5
X1
X2
X3
X4
X5
X6
FIGURE 10.7. Reverse water-filling for independent Gaussian random variables.
This gives rise to a kind of reverse water-filling, as illustrated in
Figure 10.7. We choose a constant λ and only describe those random variables with variances greater than λ. No bits are used to describe random
variables with variance less than λ. Summarizing, if
2
2
σ1 · · · 0
σ̂1 · · · 0
X ∼ N(0, ... . . . ... ), then X̂ ∼ N(0, ... . . . ... ),
0 · · · σm2
0 · · · σ̂m2
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Calculating R(D) - Gaussian source
and E(Xi − X̂i )2 = Di , where Di = min{λ, σi2 }. More generally, the rate
distortion function for a multivariate normal vector can be obtained by
reverse water-filling on the eigenvalues. We can also apply the same arguments to a Gaussian stochastic process. By the spectral representation
theorem, a Gaussian stochastic process can be represented as an integral of independent Gaussian processes in the various frequency bands.
Reverse water-filling on the spectrum yields the rate distortion function.
CONVERSE
TO THE RATE
DISTORTION
• Reverse10.4
water-filling
on independent
Gaussian
RVs THEOREM
this section we
the multi-variate
converse to Theorem
10.2.1
by showing that
• ReverseInwater-filling
onprove
general
Gaussian
RVs
we cannot achieve a distortion of less than D if we describe X at a rate
• Reverseless
water-filling
Gaussian stochastic process
than R(D),on
where
R(D) =
p(x̂|x):
'
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
min
(x,x̂) p(x)p(x̂|x)d(x,x̂)≤D
I (X; X̂).
(10.53)
Main Theorem
Source
Encoder
Decoder
CONVERSE
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Rate-distortion theorem: CONVERSE
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Destination
A similar argument can be applied when the encoded source is passed
through a noisy channel and hence we have the equivalent of the source
channel separation theorem with distortion:
Source-channel separation with distortion
Theorem 10.4.1 (Source–channel separation theorem with distortion)
Let V1 , V2 , . . . , Vn be a finite alphabet i.i.d. source which is encoded as
a sequence of n input symbols X n of a discrete memoryless channel with
capacity C. The output of the channel Y n is mapped!
onto the reconstruction
1
n
n
n
n
alphabet V̂ = g(Y ). Let D = Ed(V , V̂ ) = n ni=1 Ed(Vi , V̂i ) be the
average distortion achieved by this combined source and channel coding
scheme. Then distortion D is achievable if and only if C > R(D).
Vn
X n(V n)
Channel Capacity C
Yn
^
Vn
Proof: See Problem 10.17.
!
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
10.5
ACHIEVABILITY OF THE RATE DISTORTION FUNCTION
of R(D) of the rate distortion function. We begin
WeAchievability
now prove the achievability
with a modified version of the joint AEP in which we add the condition
that• We
the will
pairskip
of 10.5
sequences
be typical
with
respect proof
to thebased
distortion
measure.
and go directly
for an
achievability
on
strong typicality
• Strong typicality holds only for discrete alphabets and sequences.
• Why do we need it?
• To find an upper bound on the probability that a given source sequence is
NOT well represented by a randomly chosen codeword. Analogous to
probability of error calculations in channel coding / capacity theorems.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Two types of typicality
• Strong typicality:
• Weak typicality:
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Examples of typicality
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Strong joint typicality
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Examples of joint typicality
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Some useful Lemmas
• Strong typicality is a very powerful technique more thoroughly explored in
Chapters 11 and 12. Related to the Method of Types, and useful in proving
stronger results than can be obtained using weak typicality - universal source
coding, rate distortion theory, large deviation theory.
BONUS homework during midterm 2 week (due 11/09) - 10.16
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Proof of achievability
328
RATE DISTORTION THEORY
Typical
sequences
with
jointly
typical
codeword
Typical
sequences
without
jointly
typical
codeword
Nontypical sequences
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
FIGURE 10.8. Classes of source sequences in rate distortion theorem.
where the expectation is over the random choice of codebook. For a fixed
codebook C, we divide the sequences x n ∈ X n into three categories, as
shown in Figure 10.8.
Some interesting parallels
• Channel coding
• Rate distortion
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Some more interesting parallels
• Channel coding for Gaussian channel
• Rate-distortion for Gaussian channel
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Sphere packing
Sphere covering
Characterization of the Rate-Distortion
• How do we actually go about finding R(D)?
• What’s the tough part?
332
RATE DISTORTION THEORY
This characterization will enable us to check if a given q(x̂) is a solution
to athe
minimization
problem. However,
it is not
easy to solve for the
332
DISTORTION
THEORY
• PoseRATE
as
convex,
constrained
optimization
problem
optimum output distribution from these equations. In the next section we
provide an iterative algorithm for computing the rate distortion function.
This algorithm iswill
a special
case of
general
algorithm
for finding
the is a solution
This
characterization
enable
usa to
if a given
• Can check
if a given
is a solution
tocheck
minimization,
but q(
stillx̂)cannot
always
minimum relative entropy distance between two convex sets of probability
to solve
the minimization
problem.
However,
it
is
not
easy
to
solve
for the
for
it!
densities.
optimum output distribution from these equations. In the next section we
provide an
iterative algorithm for computing the rate distortion function.
10.8 COMPUTATION OF CHANNEL CAPACITY AND THE RATE
This algorithm
is a FUNCTION
special case of a general algorithm for finding the
DISTORTION
minimumConsider
relative
entropy distance between two convex sets of probability
the following problem: Given two convex sets A and B in Rn
densities.
University
of Illinois at Chicago
ECE 534,in
FallFigure
2009, Natasha
Devroye
as shown
10.9,
we would like to find the minimum distance
between them:
dmin =
min d(a, b),
(10.130)
a∈A,b∈B
10.8 COMPUTATION OF CHANNEL
CAPACITY AND THE RATE
DISTORTION
where d(a,FUNCTION
b) is the Euclidean distance between a and b. An intuitively
obvious algorithm to do this would be to take any point x ∈ A, and find
Computation
the
the y ∈ B that isof
closest
to it.rate-distortion
Then fix this y and find the function
closest point in
Consider the following problem: Given two convex sets A and B
A. Repeating this process, it is clear that the distance decreases at each
Does it converge
minimumlike
distance
the two
sets?
as shownstage.
in Figure
10.9, towethe would
to between
find the
minimum
Csiszár and Tusnády [155] have shown that if the sets are convex and
• How can
one find the minimum distance between two convex sets?
between
them:
if the distance satisfies certain conditions, this alternating minimization
algorithm will indeed converge to the minimum. In particular, if the sets
are sets of probability distributions
the distance
measure is the relative
min
d(a, b),
dmin = and
entropy, the algorithm does converge
to the minimum relative entropy
a∈A,b∈B
between the two sets of distributions.
in Rn
distance
(10.130)
where d(a, b) is the Euclidean distance between a and b. An intuitively
obvious algorithm to do this would be to take any point x ∈ A, and find
the y ∈ B that is closest to it. Then fix this y and find the closest point in
A. Repeating this process,
it is clear that the distance decreases at each
A
stage. Does it converge to the minimum distance
between the two sets?
B
Csiszár and Tusnády [155] have shown that if the sets are convex and
if the distance satisfies certain conditions, this alternating minimization
algorithm will indeed converge to the minimum. In particular, if the sets
10.9. Distance
between
sets.
are sets of probabilityFIGURE
distributions
and
theconvex
distance
measure is the relative
entropy, the algorithm does converge to the minimum relative entropy
between the two sets of distributions.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
provide an iterative algorithm for computing the rate distortion function.
This algorithm is a special case of a general algorithm for finding the
minimum relative entropy distance between two convex sets of probability
densities.
10.8 COMPUTATION OF CHANNEL CAPACITY AND THE RATE
DISTORTION FUNCTION
Consider the following problem: Given two convex sets A and B in Rn
as shown in Figure 10.9, we would like to find the minimum distance
between them:
334
dmin =
RATE DISTORTION THEORY
Computation of the rate-distortion function
min d(a, b),
(10.130)
a∈A,b∈B
where d(a, b) is the Euclidean distance between a and b. An intuitively
obvious algorithm to do this would be to take any point x ∈ A, and find
the y ∈ B that is closest to it. Then fix this y and find the closest point in
A. Repeating this process, it is clear that the distance decreases at each
stage. Does it converge to the minimum distance between the two sets?
Csiszár and Tusnády [155] have shown that if the sets are convex and
if the distance satisfies certain conditions, this alternating minimization
algorithm will indeed converge to the minimum. In particular, if the sets
are sets of probability distributions and the distance measure is the relative
entropy, the algorithm does converge to the minimum relative entropy
between the two sets of distributions.
The proof of the second part of the lemma is left as an exercise.
!
We can
use this lemma
to rewrite
the minimization in the definition of
• Connection
ofRATE
minimum
distance
with R(D)?
334
DISTORTION THEORY
the rate distortion function as a double minimization,
""
The optimization
proof of the second
part of the
lemma
is left as an exercise.
q(x̂|x) !
• Write R(D)
as minimum
of relative
p(x)q(x̂|x) log
.
R(D) = min
! min
entropy betweenr(two
sets!!p(x)q(x̂|x)d(x,x̂)≤D
x̂) q(x̂|x):
r(x̂)
x
x̂
We can use this lemma to rewrite the minimization in the definition of
(10.140)
the rate distortion function as a double minimization,
334
RATE DISTORTION THEORY
""
x̂|x) the
If A is the set of all joint distributions with
marginal p(x) that q(
satisfy
p(x)q(x̂|x) log
.
R(D) = min
! min
distortion
constraints
and
if
B
the
set
of
product
distributions
p(x)r(
r(x̂)lemma
r(x̂) x̂)
q(x̂|x): is
p(x)q(
The proof of the second part of the
leftx̂|x)d(x,
as anx̂)≤D
exercise.
!
x
x̂
with arbitrary r(x̂), we can write
(10.140)
A
B
FIGURE 10.9. Distance between convex sets.
We can use this lemma
to rewrite the minimization
inmin
theD(p||q).
definition of
Alternate
R(D)
= min
(10.141)
If
A
is
the
set
of
all
joint
distributions
with
marginal p(x) that satisfy
the
q∈B
p∈A
the rate distortion function
as a double minimization,
between
distortion constraints and if B the set of product distributions p(x)r(x̂)
"
We
now
apply the
process
ofwrite
alternating
minimization,
with
arbitrary
r(x̂),
we"
can
334
RATE
DISTORTION
THEORY
q(x̂|x) which is called the
finding
these!
min
p(x)q(
x̂|x)
log
. a choice of λ and
R(D) = min
Blahut–Arimoto
algorithm
in
this
case.
We
begin with
!
r(x̂) q(x̂|x): p(x)q(x̂|x)d(x,x̂)≤D
r(
x̂)
R(D)
=
min
min
D(p||q).
(10.141)
x proof
an initial output distribution
r(of
x̂)q∈B
andp∈A
calculate
theof
q(the
x̂|x)lemma
that minimizes
x̂
The
the
second
part
is left as an exercise.
!
the mutual information subject to the distortion(10.140)
constraint. We can use the
method
Lagrange
multipliers
thislemma
minimization
to
obtain
We nowofapply
the process
of alternating
minimization,
which
called the
We
can
useforthis
to rewrite
theisminimization
in the definition of
334distributions
RATE DISTORTION
THEORY
If A is the set of all joint
with
marginal
p(x)
that
satisfy
the
Blahut–Arimoto algorithm
in this case.
We begin with a choice of λ and
−λd(x,x̂) as a double minimization,
the rate distortion
function
r(x̂)e
an initial
output
distribution
r(=distributions
x̂)!
and
calculate
the. q(x̂)
x̂|x) that minimizes
distortion constraints
ifECEB534,the
setNatasha
of product
p(x)r(
University of Illinois atand
Chicago
Fall 2009,
Devroye
(10.142)
q(x̂|x)
The
proof
of
the
second
part
of
the
lemma
is left
!
−λd(x,
x̂) as an exercise.
•
Write
R(D)
optimization
as
minimum
of
relative
r(
x̂)e
the
mutual
information
subject
to
the
distortion
constraint.
We"
can"
use the
q(x̂|x)
with arbitrary r(x̂), we can write
x̂
p(x)q(x̂|x) log
.
R(D)
= min
! min
of Lagrange
multipliers
for
this
minimization
to
obtain
entropymethod
between
two
sets!!
r(x̂)
We
use
this lemma
to rewrite
the we
minimization
inx̂)≤D
the
definition
of
r(
x̂)
q(x̂|x):
p(x)q(
x̂|x)d(x,
For
thiscan
conditional
distribution
q(x̂|x),
calculate
the
output
distribu334
RATE
DISTORTION
THEORY
x
x̂
R(D) =
min minfunction
D(p||q).
(10.141)
−λd(x,x̂)
the
a double
tionrate
r(x̂)distortion
that
theasmutual
which by Lemma 10.8.1
r(information,
x̂)eminimization,
q∈Bminimizes
p∈A
(10.140)
.
(10.142)
q(x̂|x) = !
isThe proof of the second part of the lemma
x̂) as an exercise.
"−λd(x,
"
is left
r(x̂)e
q(x̂|x) !
x̂
We now apply the processR(D)
of alternating
which
is called
thex̂|x) log
p(x)q(
.
= min Ifminimization,
! is min
A
thex̂|x)d(x,
set"
ofx̂)≤D
all joint
distributions
withr(marginal
p(x) that satisfy the
r(x̂) q(x̂|x): p(x)q(
x̂)
p(x)q(
x̂|x).
(10.143)
r(
x̂)
=
xwe
ForWe
this
conditional
distribution
q(x̂|x),
calculate
the
output
distribuComputation
of
the
rate-distortion
function
x̂
use
thisdistortion
lemma
to constraints
rewrite
minimization
in
the
definition
of
Blahut–Arimoto
algorithm
incan
this
case.
We
begin
with
athe
choice
of
λ
and
and if B the set of product distributions p(x)r(x̂)
x information,
(10.140)
tionrate
r(r(
x̂)x̂)
that
the
mutual
which by Lemma
10.8.1
the
distortion
function
as
a
double
minimization,
an initial output distribution
andminimizes
calculate
the
q(
x̂|x)
that
minimizes
with arbitrary r(x̂), we can
write
is
the mutual informationWe
toset
the
constraint.
We
use the
"can
"
this
output
as the starting
point
of
the that
nextq(
iteration.
x̂|x) the
Ifsubject
Ause
is the
ofdistortion
alldistribution
joint distributions
with
marginal
p(x)
satisfy
min "
p(x)q(
x̂|x)
log D(p||q).
.
R(D)
=
min
! minimizing
R(D)
=
min
min
(10.141)
Each
step
in
the
iteration,
over
q(·|·)
and
then
minimizing
method of Lagrange multipliers
for
this
minimization
to
obtain
distortion constraints
B=the x̂)≤D
set
of product
p(x)r(
x̂)
p(x)q(
x̂|x). distributions r(
(10.143)
r(ifx̂)x̂|x)d(x,
r(x̂) q(x̂|x): and
x̂) over
p(x)q(
q∈B p∈A
x
x̂ Thus, there is a limit, and
r(·), arbitrary
reduces the
right-hand
side of (10.140).
with
r(x̂),
we can
−λd(x,
x̂) write x
(10.140)
x̂)eWe
the limit has r(
been
shown
be R(D)
by Csiszár
[139], where
the
value
nowtoapply
the
process
of alternating
minimization,
which is called the
!
.
(10.142)
q(
x̂|x)
=
We
use
this
output
distribution
as
the
starting
point
of
the
next
iteration.
of
D
and
R(D)
depends
on
λ.
Thus,
choosing
λ
appropriately
sweeps
out
−λd(x,
x̂)
R(D)
=
min
min
D(p||q).
(10.141)
r(
x̂)e
If
A
is
the
set
of
all
joint
distributions
with
marginal
p(x)
that
satisfy
the
Blahut–Arimoto
We begin
x̂ iteration,
q∈B algorithm
p∈A
Each
step curve.
in the
minimizing
over q(·|·) in
andthis
thencase.
minimizing
overwith a choice of λ and
the
R(D)
distortion
constraints
and
if
B
the
set
of
product
distributions
p(x)r(
x̂) q(x̂|x) that minimizes
an
initial
output
distribution
r(
x̂)
and
calculate
the
r(·),
reduces
the right-hand
side
of
(10.140).
Thus, there
is a limit,
and
Anow
similar
can
bewrite
applied
to
the calculation
of channel
capacFor this conditional distribution
q(procedure
x̂|x),
we
calculate
the
output
distribuWe
apply
the
process
of
alternating
minimization,
which
is
called
the
with
arbitrary
r(
x̂),
we
can
the
mutual
information
subject
to thewhere
distortion
constraint. We can use the
the Again
limit has
shown
to be
R(D)
Csiszár
[139],
the value
ity.
we been
rewrite
the definition
of by
channel
capacity,
tion r(x̂) that minimizes
the
mutual
information,
by
Lemma
10.8.1
Blahut–Arimoto
algorithm
inwhich
this case.
We begin
with a choice of λ and
method
multipliers
for this minimization
to obtain
of D and R(D) depends
onofλ.Lagrange
Thus,
choosing
λ appropriately
sweeps
out
R(D)
min
D(p||q).
(10.141)
an initial output distribution
r(=
x̂)
andmin
calculate
the q(x̂|x)
that minimizes
"
"
is
r(x)p(y|x)
q∈B p∈A
the R(D) curve.
! We −λd(x,
Iinformation
(X; Y ) = max
r(x)p(y|x)
C=
max "
x̂) the
the
mutual
subject to the
distortionlog
constraint.
can use
$)
r(xx̂)e
A similar
procedurer(x)
can be applied to the calculation
of
capacr(x)
r(x)
r(x $ )p(y|x
$channel
yfor this minimization,
!
.
(10.142)
q(
x̂|x)
=
We
now=
apply
the process
ofx alternating
which
is called
the
method
of
Lagrange
multipliers
minimization
to
obtain
p(x)q(
x̂|x).
(10.143)
r(
x̂)
−λd(x,
x̂)
ity. Again we rewrite the definition of channel capacity, r(x̂)e
(10.144)
x̂
Blahut–Arimoto
algorithm
in
this
case.
We
begin
with
a
choice
of
λ
and
x
−λd(x,x̂)
r(x̂)e
"
"and
an initial output distribution
r(=x̂)
calculate the. q(x̂|x)
that minimizes
r(x)p(y|x)
!
(10.142)
q(
x̂|x)
For
this
conditional
distribution
q(!
x̂|x),
we
calculate
the output distribu(X;
Ystarting
) = max
r(x)p(y|x)
log
C=
maxasIinformation
−λd(x,
x̂)
We use this output distribution
the
point
of
the
next
iteration.
$ )p(y|x
$)
r(
x̂)e
the
mutual
subject
to
the
distortion
constraint.
We
can
use the
r(x)
r(x)
r(x)
r(x
x̂
$
x
x
yminimizes the mutual information,
tionq(·|·)
r(x̂)and
that
which by Lemma 10.8.1
Each step in the iteration,
minimizing
over
then
minimizing
overto obtain
method
of Lagrange
multipliers
for this
minimization
(10.144)
For thisside
conditional
x̂|x),iswea calculate
the output distribuis distribution
r(·), reduces the• Apply
right-hand
of (10.140).
Thus, q(
there
limit, and
the Blahut-Arimoto
algorithm
tion r(x̂) that minimizes the mutualr(information,
by Lemma 10.8.1
x̂)e−λd(x,x̂) which
"
the limit has been shown to be R(D) by Csiszár
thex̂)
value
.=
q(x̂|x) [139],
= ! where
p(x)q((10.142)
x̂|x).
(10.143)
r(
−λd(x,
x̂)
is
x̂ r(x̂)e(see
Analogous
computing
capacity!
pg. 335)
of D and R(D) •depends
on λ.results
Thus,for
choosing
λ appropriately
sweeps
out
"
x
the R(D) curve.
p(x)q(
(10.143)
r(x̂) = q(x̂|x),
For this conditional distribution
wex̂|x).
calculate the output distribuUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha
Devroye
We
use
this
output
distribution
as
the
starting
point of the next iteration.
x
A similar procedure can
applied
to the calculation
channel capaction be
r(x̂)
that minimizes
the mutual of
information,
which by Lemma 10.8.1
• Write R(D) optimizationEach
as minimum
of
relative
stepcapacity,
in the iteration, minimizing over q(·|·) and then minimizing over
is definition of channel
ity. Again we rewrite the
Webetween
use this output
distribution
as the starting point of the next iteration.
entropy
two sets!!
r(·), reduces "
the right-hand side of (10.140). Thus, there is a limit, and
© Copyright 2026 Paperzz