A First Course in Information Theory

A First Course in
Information Theory
A FIRST COURSE IN
INFORMATION THEORY
RAYMOND W. YEUNG
The Chinese University of Hong Kong
Kluwer Academic Publishers
Boston/Dordrecht/London
Contents
Preface
Acknowledgments
Foreword
xi
xvii
xix
1. THE SCIENCE OF INFORMATION
1
2. INFORMATION MEASURES
5
2.1
Independence and Markov Chains
5
2.2
Shannon’s Information Measures
10
2.3
Continuity of Shannon’s Information Measures
16
2.4
Chain Rules
17
2.5
Informational Divergence
19
2.6
The Basic Inequalities
23
2.7
Some Useful Information Inequalities
25
2.8
Fano’s Inequality
28
2.9
Entropy Rate of Stationary Source
32
Problems
36
Historical Notes
39
3. ZERO-ERROR DATA COMPRESSION
41
3.1
The Entropy Bound
42
3.2
Prefix Codes
45
3.3
Redundancy of Prefix Codes
54
Problems
58
Historical Notes
59
4. WEAK TYPICALITY
D R A F T
September 13, 2001, 6:27pm
61
D R A F T
vi
A FIRST COURSE IN INFORMATION THEORY
4.1 The Weak AEP
4.2 The Source Coding Theorem
4.3 Efficient Source Coding
4.4 The Shannon-McMillan-Breiman Theorem
Problems
Historical Notes
61
64
66
68
70
71
5. STRONG TYPICALITY
5.1 Strong AEP
5.2 Strong Typicality Versus Weak Typicality
5.3 Joint Typicality
5.4 An Interpretation of the Basic Inequalities
Problems
Historical Notes
73
73
82
83
92
92
93
6. THE -MEASURE
6.1 Preliminaries
6.2 The -Measure for Two Random Variables
6.3 Construction of the -Measure *
* Can be Negative
6.4
6.5 Information Diagrams
6.6 Examples of Applications
Problems
Historical Notes
95
96
97
100
103
105
112
119
122
7. MARKOV STRUCTURES
7.1 Conditional Mutual Independence
7.2 Full Conditional Mutual Independence
7.3 Markov Random Field
7.4 Markov Chain
Problems
Historical Notes
125
126
136
140
143
147
148
8. CHANNEL CAPACITY
8.1 Discrete Memoryless Channels
8.2 The Channel Coding Theorem
8.3 The Converse
149
153
158
160
D R A F T
September 13, 2001, 6:27pm
D R A F T
vii
Contents
8.4 Achievability of the Channel Capacity
8.5 A Discussion
8.6 Feedback Capacity
8.7 Separation of Source and Channel Coding
Problems
Historical Notes
166
171
174
180
183
186
9. RATE DISTORTION THEORY
9.1 Single-Letter Distortion Measures
9.2 The Rate Distorion Function
9.3 Rate Distortion Theorem
9.4 The Converse
9.5 Achievability of
Problems
Historical Notes
187
188
191
196
204
206
212
214
10. THE BLAHUT-ARIMOTO ALGORITHMS
10.1 Alternating Optimization
10.2 The Algorithms
10.3 Convergence
Problems
Historical Notes
215
216
218
226
230
231
11. SINGLE-SOURCE NETWORK CODING
11.1 A Point-to-Point Network
11.2 What is Network Coding?
11.3 A Network Code
11.4 The Max-Flow Bound
11.5 Achievability of the Max-Flow Bound
Problems
Historical Notes
233
234
236
240
242
245
259
262
12. INFORMATION INEQUALITIES
12.1 The Region
12.2 Information Expressions in Canonical Form
12.3 A Geometrical Framework
12.4 Equivalence of Constrained Inequalities
263
265
267
269
273
D R A F T
September 13, 2001, 6:27pm
D R A F T
viii
A FIRST COURSE IN INFORMATION THEORY
12.5 The Implication Problem of Conditional Independence
Problems
Historical Notes
276
277
278
13. SHANNON-TYPE INEQUALITIES
13.1 The Elemental Inequalities
13.2 A Linear Programming Approach
13.3 A Duality
13.4 Machine Proving – ITIP
13.5 Tackling the Implication Problem
13.6 Minimality of the Elemental Inequalities
Problems
Historical Notes
279
279
281
285
287
291
293
298
300
14. BEYOND SHANNON-TYPE INEQUALITIES
14.1 Characterizations of , , and
14.2 A Non-Shannon-Type Unconstrained Inequality
14.3 A Non-Shannon-Type Constrained Inequality
14.4 Applications
Problems
Historical Notes
301
302
310
315
321
324
325
15. MULTI-SOURCE NETWORK CODING
15.1 Two Characteristics
15.2 Examples of Application
15.3 A Network Code for Acyclic Networks
15.4 An Inner Bound
15.5 An Outer Bound
15.6 The LP Bound and Its Tightness
15.7 Achievability of
Problems
Historical Notes
327
328
335
337
340
342
346
350
361
364
16. ENTROPY AND GROUPS
16.1 Group Preliminaries
16.2 Group-Characterizable Entropy Functions
16.3 A Group Characterization of
365
366
372
377
D R A F T
September 13, 2001, 6:27pm
D R A F T
ix
Contents
16.4 Information Inequalities and Group Inequalities
Problems
Historical Notes
Bibliography
380
384
387
389
Index
403
D R A F T
September 13, 2001, 6:27pm
D R A F T
Preface
Cover and Thomas wrote a book on information theory [49] ten years ago
which covers most of the major topics with considerable depth. Their book
has since become the standard textbook in the field, and it was no doubt a
remarkable success. Instead of writing another comprehensive textbook on the
subject, which has become more difficult as new results keep emerging, my
goal is to write a book on the fundamentals of the subject in a unified and
coherent manner.
During the last ten years, significant progress has been made in understanding the entropy function and information inequalities of discrete random variables. The results along this direction are not only of core interest in information theory, but they also have applications in network coding theory and
group theory, and possibly in physics. This book is an up-to-date treatment of
information theory for discrete random variables, which forms the foundation
of the theory at large. There are eight chapters on classical topics (Chapters
1, 2, 3, 4, 5, 8, 9, and 10), five chapters on fundamental tools (Chapters 6, 7,
12, 13, and 14), and three chapters on selected topics (Chapters 11, 15, and
16). The chapters are arranged according to the logical order instead of the
chronological order of the results in the literature.
What is in this book
Out of the sixteen chapters in this book, the first thirteen chapters are basic
topics, while the last three chapters are advanced topics for the more enthusiastic reader. A brief rundown of the chapters will give a better idea of what is
in this book.
Chapter 1 is a very high level introduction to the nature of information theory and the main results in Shannon’s original paper in 1948 which founded
the field. There are also pointers to Shannon’s biographies and his works.
D R A F T
September 13, 2001, 6:27pm
D R A F T
xii
A FIRST COURSE IN INFORMATION THEORY
Chapter 2 introduces Shannon’s information measures and their basic properties. Useful identities and inequalities in information theory are derived and
explained. Extra care is taken in handling joint distributions with zero probability masses. The chapter ends with a section on the entropy rate of a stationary
information source.
Chapter 3 is a discussion of zero-error data compression by uniquely decodable codes, with prefix codes as a special case. A proof of the entropy bound
for prefix codes which involves neither the Kraft inequality nor the fundamental inequality is given. This proof facilitates the discussion of the redundancy
of prefix codes.
Chapter 4 is a thorough treatment of weak typicality. The weak asymptotic
equipartition property and the source coding theorem are discussed. An explanation of the fact that a good data compression scheme produces almost
i.i.d. bits is given. There is also a brief discussion of the Shannon-McMillanBreiman theorem.
Chapter 5 introduces a new definition of strong typicality which does not
involve the cardinalities of the alphabet sets. The treatment of strong typicality
here is more detailed than Berger [21] but less abstract than Csisz r and K rner
[52]. A new exponential convergence result is proved in Theorem 5.3.
Chapter 6 is an introduction to the theory of -Measure which establishes a
one-to-one correspondence between Shannon’s information measures and set
theory. A number of examples are given to show how the use of information
diagrams can simplify the proofs of many results in information theory. Most
of these examples are previously unpublished. In particular, Example 6.15 is a
generalization of Shannon’s perfect secrecy theorem.
Chapter 7 explores the structure of the -Measure for Markov structures.
Set-theoretic characterizations of full conditional independence and Markov
random field are discussed. The treatment of Markov random field here is perhaps too specialized for the average reader, but the structure of the -Measure
and the simplicity of the information diagram for a Markov chain is best explained as a special case of a Markov random field.
Chapter 8 consists of a new treatment of the channel coding theorem. Specifically, a graphical model approach is employed to explain the conditional independence of random variables. Great care is taken in discussing feedback.
Chapter 9 is an introduction to rate distortion theory. The results in this
chapter are stronger than those in a standard treatment of the subject although
essentially the same techniques are used in the derivations.
In Chapter 10, the Blahut-Arimoto algorithms for computing channel capacity and the rate distortion function are discussed, and a simplified proof for
convergence is given. Great care is taken in handling distributions with zero
probability masses.
D R A F T
September 13, 2001, 6:27pm
D R A F T
xiii
PREFACE
Chapter 11 is an introduction to network coding theory. The surprising fact
that coding at the intermediate nodes can improve the throughput when an information source is multicast in a point-to-point network is explained. The
max-flow bound for network coding with a single information source is explained in detail. Multi-source network coding will be discussed in Chapter 15
after the necessary tools are developed in the next three chapters.
Information inequalities are sometimes called the laws of information theory
because they govern the impossibilities in information theory. In Chapter 12,
the geometrical meaning of information inequalities and the relation between
information inequalities and conditional independence are explained in depth.
The framework for information inequalities discussed here is the basis of the
next two chapters.
Chapter 13 explains how the problem of proving information inequalities
can be formulated as a linear programming problem. This leads to a complete
characterization of all information inequalities which can be proved by conventional techniques. These are called Shannon-type inequalities, which can now
be proved by the software ITIP which comes with this book. It is also shown
how Shannon-type inequalities can be used to tackle the implication problem
of conditional independence in probability theory.
All information inequalities we used to know were Shannon-type inequalities. Recently, a few non-Shannon-type inequalities have been discovered.
This means that there exist laws in information theory beyond those laid down
by Shannon. These inequalities and their applications are explained in depth
in Chapter 14.
Network coding theory is further developed in Chapter 15. The situation
when more than one information source are multicast in a point-to-point network is discussed. The surprising fact that a multi-source problem is not equivalent to a few single-source problems even when the information sources are
mutually independent is clearly explained. Implicit and explicit bounds on
the achievable coding rate region are discussed. These characterizations on
the achievable coding rate region involve almost all the tools that have been
developed earlier in the book, in particular, the framework for information inequalities.
Chapter 16 explains an intriguing relation between information theory and
group theory. Specifically, for every information inequality satisfied by any
joint distribution, there is a corresponding group inequality satisfied by any
finite group and its subgroups, and vice versa. Inequalities of the latter type
govern the orders of any finite group and their subgroups. Group-theoretic
proofs of Shannon-type information inequalities are given. At the end of this
chapter, a group inequality is obtained from a non-Shannon-type inequality
discussed in Chapter 14. The meaning and the implication of this inequality
are yet to be understood.
D R A F T
September 13, 2001, 6:27pm
D R A F T
xiv
A FIRST COURSE IN INFORMATION THEORY
8
9
4
1
10
5
3
2
11
15
6
7
12
13
14
16
How to use this book
You are recommended to read the chapters according to the above chart.
However, you will not have too much difficulty jumping around in the book
because there should be sufficient references to the previous relevant sections.
As a relatively slow thinker, I feel uncomfortable whenever I do not reason
in the most explicit way. This probably has helped in writing this book, in
which all the derivations are from the first principle. In the book, I try to
explain all the subtle mathematical details without sacrificing the big picture.
Interpretations of the results are usually given before the proofs are presented.
The book also contains a large number of examples. Unlike the examples in
most books which are supplementary, the examples in this book are essential.
This book can be used as a reference book or a textbook. For a two-semester
course on information theory, this would be a suitable textbook for the first
semester. This would also be a suitable textbook for a one-semester course
if only information theory for discrete random variables is covered. If the
instructor also wants to include topics on continuous random variables, this
book can be used as a textbook or a reference book in conjunction with another suitable textbook. The instructor will find this book a good source for
homework problems because many problems here do not appear in other textbooks. The instructor only needs to explain to the students the general idea,
and they should be able to read the details by themselves.
Just like any other lengthy document, this book for sure contains errors and
omissions. If you find any error or if you have any comment on the book, please
email them to me at [email protected]. An errata will be maintained
at the book website http://www.ie.cuhk.edu.hk/IT book/.
!#"%$'&)(#*,+.-%"/
D R A F T
September 13, 2001, 6:27pm
D R A F T
To my parents and my family
Acknowledgments
Writing a book is always a major undertaking. This is especially so for a
book on information theory because it is extremely easy to make mistakes if
one has not done research on a particular topic. Besides, the time needed for
writing a book cannot be under estimated. Thanks to the generous support of
a fellowship from the Croucher Foundation, I have had the luxury of taking a
one-year leave in 2000-01 to write this book. I also thank my department for
the support which made this arrangement possible.
There are many individuals who have directly or indirectly contributed to
this book. First, I am indebted to Toby Berger who taught me information
theory and writing. Venkat Anatharam, Dick Blahut, Dave Delchamps, Terry
Fine, and Chris Heegard all had much influence on me when I was a graduate
student at Cornell.
I am most thankful to Zhen Zhang for his friendship and inspiration. Without the results obtained through our collaboration, this book would not be
complete. I would also like to thank Julia Abrahams, Agnes and Vincent
Chan, Tom Cover, Imre Csisz r, Bob Gallager, Bruce Hajek, Te Sun Han,
Jim Massey, Alon Orlitsky, Shlomo Shamai, Sergio Verd , Victor Wei, Frans
Willems, and Jack Wolf for their encouragement throughout the years. In particular, Imre Csisz r has closely followed my work and has given invaluable
feedback. I would also like to thank all the collaborators of my work for their
contribution and all the anonymous reviewers for their useful comments.
During the process of writing this book, I received tremendous help from
Ning Cai and Fangwei Fu in proofreading the manuscript. Lihua Song and
Terence Chan helped compose the figures and gave useful suggestions. Without their help, it would have been quite impossible to finish the book within
the set time frame. I also thank Andries Hekstra, Frank Kschischang, Siu-Wai
Ho, Kingo Kobayashi, Amos Lapidoth, Li Ping, Prakash Narayan, Igal Sason, Emre Teletar, Chunxuan Ye, and Ken Zeger for their valuable inputs. The
code for ITIP was written by Ying-On Yan, and Jack Lee helped modify the
D R A F T
0
September 13, 2001, 6:27pm
D R A F T
xviii
A FIRST COURSE IN INFORMATION THEORY
code and adapt it for the PC. Gordon Yeung helped on numerous Unix and
LATEX problems. The two-dimensional Venn diagram representation for four
sets which is repeatedly used in the book was taught to me by Yong Nan Yeh.
On the domestic side, I am most grateful to my wife Rebecca for her love
and support. My daughter Shannon has grown from an infant to a toddler
during this time. She has added much joy to our family. I also thank my
mother-in-law Mrs. Tsang and my sister-in-law Ophelia for coming over from
time to time to take care of Shannon. Life would have been a lot more hectic
without their generous help.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Foreword
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 1
THE SCIENCE OF INFORMATION
In a communication system, we try to convey information from one point to
another, very often in a noisy environment. Consider the following scenario. A
secretary needs to send facsimiles regularly and she wants to convey as much
information as possible on each page. She has a choice of the font size, which
means that more characters can be squeezed onto a page if a smaller font size
is used. In principle, she can squeeze as many characters as desired on a page
by using a small enough font size. However, there are two factors in the system
which may cause errors. First, the fax machine has a finite resolution. Second,
the characters transmitted may be received incorrectly due to noise in the telephone line. Therefore, if the font size is too small, the characters may not be
recognizable on the facsimile. On the other hand, although some characters
on the facsimile may not be recognizable, the recipient can still figure out the
words from the context provided that the number of such characters is not excessive. In other words, it is not necessary to choose a font size such that all
the characters on the facsimile are recognizable almost surely. Then we are
motivated to ask: What is the maximum amount of meaningful information
which can be conveyed on one page of facsimile?
This question may not have a definite answer because it is not very well
posed. In particular, we do not have a precise measure of meaningful information. Nevertheless, this question is an illustration of the kind of fundamental
questions we can ask about a communication system.
Information, which is not a physical entity but an abstract concept, is hard
to quantify in general. This is especially the case if human factors are involved
when the information is utilized. For example, when we play Beethoven’s
violin concerto from a compact disc, we receive the musical information from
the loudspeakers. We enjoy this information because it arouses certain kinds
of emotion within ourselves. While we receive the same information every
D R A F T
September 13, 2001, 6:27pm
D R A F T
2
A FIRST COURSE IN INFORMATION THEORY
3 RECEIVER
INFORMATION
SOURCE
TRANSMITTER
1MESSAGE
SIGNAL
RECEIVED
SIGNAL
DESTINATION
1MESSAGE
2
NOISE
SOURCE
Figure 1.1. Schematic diagram for a general point-to-point communication system.
time we play the same piece of music, the kinds of emotions aroused may be
different from time to time because they depend on our mood at that particular
moment. In other words, we can derive utility from the same information
every time in a different way. For this reason, it is extremely difficult to devise
a measure which can quantify the amount of information contained in a piece
of music.
In 1948, Bell Telephone Laboratories scientist Claude E. Shannon (19162001) published a paper entitled “The Mathematical Theory of Communication” [173] which laid the foundation of an important field now known as information theory. In his paper, the model of a point-to-point communication
system depicted in Figure 1.1 is considered. In this model, a message is generated by the information source. The message is converted by the transmitter
into a signal which is suitable for transmission. In the course of transmission,
the signal may be contaminated by a noise source, so that the received signal
may be different from the transmitted signal. Based on the received signal, the
receiver then makes an estimate of the message and deliver it to the destination.
In this abstract model of a point-to-point communication system, one is only
concerned about whether the message generated by the source can be delivered
correctly to the receiver without worrying about how the message is actually
used by the receiver. In a way, Shannon’s model does not cover all the aspects
of a communication system. However, in order to develop a precise and useful
theory of information, the scope of the theory has to be restricted.
In [173], Shannon introduced two fundamental concepts about ‘information’ from the communication point of view. First, information is uncertainty.
More specifically, if a piece of information we are interested in is deterministic,
then it has no value at all because it is already known with no uncertainty. From
this point of view, for example, the continuous transmission of a still picture
on a television broadcast channel is superfluous. Consequently, an information source is naturally modeled as a random variable or a random process,
D R A F T
September 13, 2001, 6:27pm
D R A F T
3
The Science of Information
and probability is employed to develop the theory of information. Second, information to be transmitted is digital. This means that the information source
should first be converted into a stream of 0’s and 1’s called bits, and the remaining task is to deliver these bits to the receiver correctly with no reference
to their actual meaning. This is the foundation of all modern digital communication systems. In fact, this work of Shannon appears to contain the first
published use of the term bit, which stands for nary digi .
In the same work, Shannon also proved two important theorems. The first
theorem, called the source coding theorem, introduces entropy as the fundamental measure of information which characterizes the minimum rate of a
source code representing an information source essentially free of error. The
source coding theorem is the theoretical basis for lossless data compression1 .
The second theorem, called the channel coding theorem, concerns communication through a noisy channel. It was shown that associated with every noisy
channel is a parameter, called the capacity, which is strictly positive except
for very special channels, such that information can be communicated reliably
through the channel as long as the information rate is less than the capacity.
These two theorems, which give fundamental limits in point-to-point communication, are the two most important results in information theory.
In science, we study the laws of Nature which must be obeyed by any physical systems. These laws are used by engineers to design systems to achieve
specific goals. Therefore, science is the foundation of engineering. Without
science, engineering can only be done by trial and error.
In information theory, we study the fundamental limits in communication
regardless of the technologies involved in the actual implementation of the
communication systems. These fundamental limits are not only used as guidelines by communication engineers, but they also give insights into what optimal coding schemes are like. Information theory is therefore the science of
information.
Since Shannon published his original paper in 1948, information theory has
been developed into a major research field in both communication theory and
applied probability. After more than half a century’s research, it is quite impossible for a book on the subject to cover all the major topics with considerable
depth. This book is a modern treatment of information theory for discrete random variables, which is the foundation of the theory at large. The book consists of two parts. The first part, namely Chapter 1 to Chapter 13, is a thorough
discussion of the basic topics in information theory, including fundamental results, tools, and algorithms. The second part, namely Chapter 14 to Chapter
16, is a selection of advanced topics which demonstrate the use of the tools
4.5
6
1 A data compression scheme is lossless if the data can be recovered with an arbitrarily small probability of
error.
D R A F T
September 13, 2001, 6:27pm
D R A F T
4
A FIRST COURSE IN INFORMATION THEORY
developed in the first part of the book. The topics discussed in this part of the
book also represent new research directions in the field.
An undergraduate level course on probability is the only prerequisite for
this book. For a non-technical introduction to information theory, we refer the
reader to Encyclopedia Britanica [62]. In fact, we strongly recommend the
reader to first read this excellent introduction before starting this book. For
biographies of Claude Shannon, a legend of the 20th Century who had made
fundamental contribution to the Information Age, we refer the readers to [36]
and [185]. The latter is also a complete collection of Shannon’s papers.
Unlike most branches of applied mathematics in which physical systems are
studied, abstract systems of communication are studied in information theory.
In reading this book, it is not unusual for a beginner to be able to understand all
the steps in a proof but has no idea what the proof is leading to. The best way to
learn information theory is to study the materials first and come back at a later
time. Many results in information theory are rather subtle, to the extent that an
expert in the subject may from time to time realize that his/her understanding
of certain basic results has been inadequate or even incorrect. While a novice
should expect to raise his/her level of understanding of the subject by reading
this book, he/she should not be discouraged to find after finishing the book that
there are actually more things yet to be understood. In fact, this is exactly the
challenge and the beauty of information theory.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 2
INFORMATION MEASURES
Shannon’s information measures refer to entropy, conditional entropy, mutual information, and conditional mutual information. They are the most important measures of information in information theory. In this chapter, we
introduce these measures and prove some of their basic properties. The physical meaning of these measures will be discussed in depth in subsequent chapters. We then introduce informational divergence which measures the distance
between two probability distributions and prove some useful inequalities in
information theory. The chapter ends with a section on the entropy rate of a
stationary information source.
2.1
INDEPENDENCE AND MARKOV CHAINS
We begin our discussion in this chapter by reviewing two basic notions in
probability: independence of random variables and Markov chain. All the
random variables in this book are discrete.
Let be a random variable taking values in an alphabet . The probability
, with
.
distribution for is denoted as
When there is no ambiguity,
will be abbreviated as
, and
will be abbreviated as
. The support of , denoted by
, is the set of
all
such that
. If
, we say that is strictly positive.
Otherwise, we say that is not strictly positive, or contains zero probability
masses. All the above notations naturally extend to two or more random variables. As we will see, probability distributions with zero probability masses
are very delicate in general, and they need to be handled with great care.
7
7
>'CV8
[\^]`_ba%_dc%_?eagf@hji
by 7mlQk , if
D R A F T
8
9;:.<=?>@BA;>DCE8GF
.: <=?>@IHKJMLN9O7PHQ>#F
: < ?>R
:?>R
9;:?>RSF
:T?>R
7
U<
:T?>RWYX U < HZ8
:
:
:
Two random variables
7
and
k
are independent, denoted
:T?>#A;n^MHo:?>@d:?np
September 13, 2001, 6:27pm
(2.1)
D R A F T
6
for all
A FIRST COURSE IN INFORMATION THEORY
>
and
n
(i.e., for all
?>#A;n^qCE8)rts
).
For more than two random variables, we distinguish between two types of
independence.
[\^]`_ba%_dc%_?eagf@hufovxwzyRc%yR{|~}Sa%\^^\â%\â#\ For , random variables 7GA;7 ANNNA;7 are mutually independent if
:?>%OA;> ANNNA;> MHo:?>#Sd:T?> NN:T?> (2.2)
for all >%OA;> ANNN , > .
[\^]`_ba%_dc%_?eagf@huov{_?_O\D}Ba%#\^.\â%#\.a#\ For ) , random variables 7GOA;7 ANNNA;7 are pairwise independent if 7 and 7 are independent
for all
g o¡g .
Note that mutual independence implies pairwise independence. We leave it
as an exercise for the reader to show that the converse is not true.
[ \^]`_ba%_dc%_?eagf@h£¢¤vS¥ea%_uc%_e#aR{|D}Sa%\^^\â%\â#\ For random variables
7¦Axk , and § , 7 is independent of § conditioning on k , denoted by 7¨l©§=ª k ,
if
:?>#A;nAx«¬d:?nÎH:?>#A;n^d:?nAx«¬
(2.3)
for all >#A;n , and « , or equivalently,
®Z¯°²±³ ¯´°²µ¶¯´µ°£Ó³ ·Sµ
Ho:?>#A;n^d:«ª n^ if :?np W¸X
(2.4)
:?>A;nAx«¬ H X
otherwise.
The first definition of conditional independence above is sometimes more
convenient to use because it is not necessary to distinguish between the cases
and
. However, the physical meaning of conditional independence is more explicit in the second definition.
:?np
W¹X
:T?npºH»X
!Re#e¼_uc%_e#a'f@hu½
7 Axk , and § , m
¦
7 lK§,ª k
:?>#A;nAx«¬MH¿¾?>A;npÁÀ?nAx«¬
for all > , n , and « such that :?n^qWVX .
For random variables
if and only
if
(2.5)
Proof The ‘only if’ part follows immediately from the definition of conditional
independence in (2.4), so we will only prove the ‘if’ part. Assume
:?>#A;nAx«¬MH¿¾?>A;npÁÀ?nAx«¬
D R A F T
September 13, 2001, 6:27pm
(2.6)
D R A F T
7
Information Measures
> n
«
:T?npqW¸X . Then
:?>#A;n^HKÂ · :?>A;nAx«¬ HKÂ · ¾R?>#A;n^ÁÀÃ?nAx«¬qHÄ¾?>A;nppÂ · À?nAx«¬BA
for all , , and such that
(2.7)
:?nAx«¬IH Â ± :?>#A;nAx«¬MH Â ± ¾R?>#A;n^ÁÀ?nAx«` H¿À?nAx«¬ Â ± ¾R?>#A;n^BA
and
(2.8)
:?n^MH Â · :?nAx«¬MHPÅ Â ± ¾?>A;npÇÆÈÅ Â · À?nAx«¬ÇÆoÉ
(2.9)
Therefore,
Åp¾?>A;np Â · À?nAx«¬ÇÆYÅ^ÀÃ?nAx«¬ Â ± ¾R?>#A;n^ÇÆ
:?>#A;n^d:?nAx«¬ H
:?n^
^Å Â ± ¾R?>#A;n^ÇÆYÅpÂ · À?nAx«`ÇÆ
H ¾R?>#A;n^ÁÀ?nAx«`
H :T?>#A;nAx«¬BÉ
(2.10)
(2.11)
(2.12)
7 lK§,ª k . The proof is accomplished.
m
[\^]`_ba%_dc%_?eagf@huÊËvxwo{@Ì#eÍÎ¥
ÏR{_?a For random variables 7GNA;7 ANNNOA;7 ,
where D¸ , 7GqÐÑ7 ÐÒNNÐÑ7 forms a Markov chain if
:?> A;> ANNN¼A;> d:?> d:T?> NNj:?> ÔÓ H :?>#NA;> d:?> A;> NN:?> ÔÓ NA;> (2.13)
for all > A;> ANNN , > , or equivalently,
Hence,
:?>%NA;> ANNNA;> MH
Õ :?>#OA;> d:?> ª > NN:?> ª > ÔÓ B
X
:T?> BA:?> BANNNA:?> ¬Ó ÖqWVX
if
otherwise.
(2.14)
We note that 7mlK§,ª k is equivalent to the Markov chain 7mÐkÈÐ×§ .
!Re#e¼_uc%_e#a'f@hdØ 7 Ð 7 Ð NN Ð 7 forms a Markov chain if and
only if 7 ÐÑ7 ÔÓ qÐÒNN`ÐÑ7G forms a Markov chain.
Proof This follows directly from the symmetry in the definition of a Markov
chain in (2.13).
D R A F T
September 13, 2001, 6:27pm
D R A F T
8
A FIRST COURSE IN INFORMATION THEORY
In the following, we state two basic properties of a Markov chain. The
proofs are left as an exercise.
!Re#e¼_uc%_e#a'f@huÙ 7GÐ 7 Ð NNqÐ 7 forms a Markov chain if and
only if
7GqÐÑ7 ÐÑ7 ?7GOA;7 MÐÑ7
Ð 7Ú
(2.15)
..
.
?7 A;7 ANNN¼A;7 ¬ Ó M ÐÑ7 ¬ Ó ÐÑ7 form Markov chains.
!Re#e¼_uc%_e#a'f@huÛ 7GÐ 7 Ð NNqÐ 7 forms a Markov chain if and
only if
:?>#OA;> ANNNOA;> H¿Ü¬¼?>#OA;> ÁÜ ?> A;> NNBÜ ÔÓ ¼?> ÔÓ NA;> for all >%OA;> ANNN , > such that :?> BA:T?> BANNN¼A:T?> ¬Ó Ö WVX .
(2.16)
Note that Proposition 2.9 is a generalization of Proposition 2.5. From Proposition 2.9, one can prove the following important property of a Markov chain.
Again, the details are left as an exercise.
!Re#e¼_uc%_e#a'f@hjiÝZvxwo{@ÌeÍ»¼y%ÞRÏ@{_ba Let ß Hm9ÔASàÀNNNA;TF and
let 7GÐá7 ÐâNN#Ðã7 forms a Markov chain. For any subset ä of ß ,
denote ?7 A;IC~äT by 7å . Then for any disjoint subsets ä Axä ANNNAxä#æ of ß ç ç
ç
such that
KNN^ æ
è

(2.17)
ç
for all CtäR , ¡=HYASàÀNNNA;é ,
7 å`ê ÐÑ7 åë ÐÒNN`ÐÑ7 åíì
(2.18)
forms a Markov chain. That is, a subchain of 7GîÐï7 ÐðNN.Ðï7 is also
a Markov chain.
ñºò%{óº^|¬\ôf@hjiî Let 7 Ð 7 Ð NNõÐ 7 ÷ö forms a Markov chain and
äMHø9ÔASà¬F , ä ¨
H 9OùF , ä Hø9 úÀBû¬ASü¬F , and 9ÔNX`F be subsets of ßE÷ö . Then
Proposition 2.10 says that
7 å`ê Ð7 åë Ð7 åíý ÐÑ7 åíþ
(2.19)
also forms a Markov chain.
D R A F T
September 13, 2001, 6:27pm
D R A F T
9
Information Measures
We have been very careful in handling probability distributions with zero
probability masses. In the rest of the section, we show that such distributions
are very delicate in general. We first prove the following property of a strictly
positive probability distribution involving four random variables1 .
! Re#e¼_uc%_e#a'f@hjiÔf Let 7GOA;7 A;7 , and 7Ú be random variables such that
:?> A;> A;> A;> Ú is strictly positive. Then
7GqlÄ7Ú¬ªÿ?7 A;7 (2.20)
7GqlÄ7 ªÿ?7 A;7Ú
7GqlÈ?7 A;7ÚNª 7 É
, we have
A;>ÚMH :?>#NA;> ;A > d:T?> A;>
:T?> A;> On the other hand, if 7GqlÄ7 ªÿ?7 A;7ÚÃ , we have
:?>#OA;> A;> A;>ÚMH :?> A;> :TA;> ?> Ú d:TA;>.?> Ú¼ A;>
Proof If
7GqlQ7Ú¬ªÿ?7 ;A 7
:?>#OA;> A;>
A;>Ú
É
(2.21)
A;> Ú É
(2.22)
Equating (2.21) and (2.22), we have
:?>%NA;> A;> MH :?> ;A > : ?d> : ?A;>#>.OÚA; > A;>.Ú¼ É
Then
:?> A;> ÒH
Â ± : ?> A;> A;> ý
H Â ± :?> A;> :?d> :T ?A;>%>OÚA; > A;>Ú¼
ý
H :?> :Td:?> ?># OA;>A;> Ú A;>.Ú¼ A
:?> A;> A;> Ú H :?> A;> É
:?> A;>Ú
:T?> or
(2.23)
(2.24)
(2.25)
(2.26)
(2.27)
Hence from (2.22),
:?>%NA;> A;> A;>.ÚH :T?>%OA;> :;A >?> Úd:;A >?> Ú A;> A;>.Ú H :?>#OA;> :d :?> ?> A;> A;>.Ú
(2.28)
1 Proposition
2.12 is called the intersection axiom in Bayesian networks. See [151].
D R A F T
September 13, 2001, 6:27pm
D R A F T
10
A FIRST COURSE IN INFORMATION THEORY
%> OA;> A;> , and >.Ú , i.e., 7GqlÈ?7 A;7ÚNª 7 .
If :?>%OA;> A;> A;>.ÚH¨X for some >#OA;> A;> , and >.Ú , i.e., : is not strictly
positive, the arguments in the above proof are not valid. In fact, the proposition
may not hold in this case. For instance, let 7GGH¨k , 7 H § , and 7 H
7 Ú HËk AB§! , where k and § are independent random variables. Then 7 l
7Ú¬ªÿ?7 A;7 , 7GDl7 ªÿ?7 A;7Ú , but 7G
l
?7 A;7ÚNª 7 . Note that for
this construction, : is not strictly positive because :?>#OA;> A;> A;>.Ú¼ÎH×X if
> HÈ
?>#A;> or >Ú H»
?>%OA;> .
:
for all
The above example is somewhat counter-intuitive because it appears that
Proposition 2.12 should hold for all probability distributions via a continuity
argument. Specifically, such an argument goes like this. For any distribution
, let
be a sequence of strictly positive distributions such that
and
satisfies (2.21) and (2.22) for all , i.e.,
:
9;:ÔF
:
Ð :
ç
:¬?> A;> A;> A;> Ú d:`?> A;> H:`?> A;> A;> d:`?> A;> A;> Ú (2.29)
and
:?>%OA;> A;> A;>.Úd:`?> A;>ÚMHo:`?>%OA;> A;>.Úd:`?> A;> A;>ÚBÉ (2.30)
Then by the proposition, : also satisfies (2.28), i.e.,
(2.31)
ç : ¬?>#OA;> A;> A;>Úd:¬?> MHo:`?>%OA;> d:¬?> A;> A;>.ÚBÉ
Ð , we have
Letting (2.32)
:?> A;> A;> A;> Ú d:?> H:?> A;> d:?> A;> A;> Ú for all >%NA;> A;> , and >.Ú , i.e., 7GqlÈ?7 A;7ÚNª 7 . Such an argument would be
: ÔF as prescribed. However, the exisvalid if there always exists a sequence 9;
tence of the distribution :?>#OA;> A;> A;>Ú constructed immediately after Propo-
sition 2.12 simply says that it is not always possible to find such a sequence
.
Therefore, probability distributions which are not strictly positive can be
very delicate. In fact, these distributions have very complicated conditional
independence structures. For a complete characterization of the conditional
independence structures of strictly positively distributions, we refer the reader
to Chan and Yeung [41].
9;:F
2.2
SHANNON’S INFORMATION MEASURES
We begin this section by introducing the entropy of a random variable. As
we will see shortly, all Shannon’s information measures can be expressed as
linear combinations of entropies.
D R A F T
September 13, 2001, 6:27pm
D R A F T
11
Information Measures
[\^]`_ba%_dc%_?eagf@hjiÔ
?7Î of a random variable 7
z
?7DIH Â ± :T?>R :?>RBÉ
The entropy
is defined by
(2.33)
In the above and subsequent definitions of information measures, we adopt
the convention that summation is taken over the corresponding support. Such a
convention is necessary because
in (2.33) is undefined if
.
The base of the logarithm in (2.33) can be chosen to be any convenient
as
when the base of the
real number great than 1. We write
logarithm is . When the base of the logarithm is 2, the unit for entropy is bit.
When the base of the logarithm is an integer
, the unit for entropy is
-it ( -ary digit). In the context of source coding, the base is usually taken to
be the size of the code alphabet. This will be discussed in Chapter 3.
In computer science, a bit means an entity which can take the value 0 or 1.
In information theory, a bit is a measure of information, which depends on the
probabilities of occurrence of 0 and 1. The reader should distinguish these two
meanings of a bit from each other carefully.
be any function of a random variable . We will denote the
Let
expectation of
by
, i.e.,
:?>R :?>R
X
?7D
ä
:?>RH
å@?7Î
ã¤à
?7Î
R?7Î
?7D
?7Î H Â ± :?>@?>RBA
7
(2.34)
U<
where the summation is over
. Then the definition of the entropy of a random variable can be written as
7
z?7Î H :?7ÎBÉ
(2.35)
Expressions of Shannon’s information measures in terms of expectations will
be useful in subsequent discussions.
The entropy
of a random variable is a functional of the probability
distribution
which measures the amount of information contained in , or
equivalently, the amount of uncertainty removed upon revealing the outcome
of . Note that
depends only on
but not on the actual values in .
For
, define
7
z?7Î
:?>R
z?7Î
XD:~© 7
:?>@
7
8
²:@MH: :¸Á t:R Át:R
(2.36)
! Á¼MH¿X . With this convenwith the convention X X HQX , so that XÔIH
tion, ²:R is continuous at :ôH X and :ôH¤ .
is called the binary entropy
function. For a binary random variable 7 !with
distribution 9;:#AO" :%F ,
z?7ÎMH ²:RBÉ
(2.37)
D R A F T
September 13, 2001, 6:27pm
D R A F T
12
A FIRST COURSE IN INFORMATION THEORY
hb( p)
1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
00
0.1
0.2
0.3
0.4
0.5
0.5
0.6
#%$'&)(* versus (
Figure 2.1.
²:R
0.7
0.8
0.9
11
p
in the base 2.
:
:GH
²:R
versus in the base 2. Note that
Figure 2.1 shows the graph
achieves the maximum value 1 when
.
For two random variables, their joint entropy is defined as below. Extension
of this definition to more than two random variables is straightforward.
[\^]`_ba%_dc%_?eagf@hji¢ The joint entropy ?7~Axk of a pair of random variables
7 and k is defined by
(2.38)
z?7¦Axk
H Â ±³ ´ :?>#A;n^ :?>#A;n^MH+ :?7~AxkBÉ
[\^]`_ba%_dc%_?eagf@hjiÔ½ For random variables 7 and k , the conditional entropy
of 7 given k is defined by
ktª 7DM
H Â ±³ ´ :?>A;np :?n%ª >RMH, :kª 7ÎBÉ
(2.39)
From (2.39), we can write
zkGª 7DIH Â ± :T?>R-. Â ´ :?n%ª >R :?n%ª >R0/É
The inner sum is the entropy of
are motivated to express
k
zkGª 7D
conditioning on a fixed
as
>zCU <
zkª 7ÎMHKÂ ± :T?>R1zkª 7mHÄ>@BA
D R A F T
(2.40)
September 13, 2001, 6:27pm
. Thus we
(2.41)
D R A F T
13
Information Measures
zkGª 7mHÄ>RH,EÂ ´ :?n%ª >R :?n@ª >@BÉ
where
Similarly, for
where
ktª 7~AB§ , we write
ktª 7¦AB§!MH©Â · :«¬1zkª 7~AB§¿HQ«¬BA
(2.42)
(2.43)
kª 7¦AB§¿H¿«¬MHEÂ ±³ ´ :T?>#A;n%ª «¬ :?n@ª >Ax«¬BÉ
(2.44)
?7~AxkMH2?7D435zkª 7Î
(2.45)
z?7¦AxkMH6zk735?7'ª kBÉ
(2.46)
!Re#e¼_uc%_e#a'f@hjiÔÊ
and
Proof Consider
?7~Axk H :?7~Axk
H 8 :?7Dd:ktª 7Î9
H :?7D: :kª 7Î
H
?7D;35z
ktª 7ÎBÉ
(2.47)
(2.48)
(2.49)
(2.50)
Note that (2.48) is justified because the summation of the expectation is over
, and we have used the linearity of expectation2 to obtain (2.49). This
proves (2.45), and (2.46) follows by symmetry.
U@<=<
This proposition has the following interpretation. Consider revealing the
outcome of
and in two steps: first the outcome of
and then the outcome of . Then the proposition says that the total amount of uncertainty
removed upon revealing both and is equal to the sum of the uncertainty
removed upon revealing (uncertainty removed in the first step) and the uncertainty removed upon revealing once is known (uncertainty removed in
the second step).
k
7
k
7
7
7
k
k
7
[\^]`_ba%_dc%_?eagf@hji¬Ø For random variables 7 and k , the mutual information
between 7 and k is defined by
(2.51)
?
7 >xkMHKÂ ±³ ´ :T?>#A;n^ ::?>@?>#d:A;n^?n^ H6 :: ? 7D?7~d:Axkk É
2 See
Problem 5 at the end of the chapter.
D R A F T
September 13, 2001, 6:27pm
D R A F T
14
A FIRST COURSE IN INFORMATION THEORY
? 7>xk is symmetrical in 7 and k .
!Re#e¼_uc%_e#a'f@hjiÔÙ The mutual information between a random variable 7
7 >;7Î 6
H ?7D .
and itself is equal to the entropy of 7 , i.e., ?+
Remark
Proof This can be seen by considering
?7+>;7ÎáH + ::??7Î7Î H + :?7Î
H
?7DBÉ
(2.52)
(2.53)
(2.54)
The proposition is proved.
Remark The entropy of
!Re#e¼_uc%_e#a'f@hjiÔÛ
7
is sometimes called the self-information of
?7>xk H
?
7 >xk H
and
z?7Î??7'ª kBA
zk:z
kGª 7DBA
7
.
(2.55)
(2.56)
?7>xkMH@?7D;35zk,:+?7~AxkBÉ
(2.57)
The proof of this proposition is leaf as an exercise.
?7+>xk
From (2.55), we can interpret
as the reduction in uncertainty about
when is given, or equivalently, the amount of information about provided by . Since
is symmetrical in and , from (2.56), we can as
as the amount of information about provided by .
well interpret
7
k
k
? 7>xk
?7>xk
7
7
k
k
7
The relations between the (joint) entropies, conditional entropies, and mutual information for two random variables and are given in Propositions
2.16 and 2.19. These relations can be summarized by the diagram in Figure 2.2
which is a variation of the Venn diagram3 . One can check that all the relations
between Shannon’s information measures for
and
which are shown in
Figure 2.2 are consistent with the relations given in Propositions 2.16 and 2.19.
This one-to-one correspondence between Shannon’s information measures and
7
7
k
3 The
k
rectangle representing the universal set in a usual Venn diagram is missing in Figure 2.2.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Information Measures
B
A
15
B
A
H(X,Y)
CH A( Y|X B)
H(X|Y)
A B
CH A( Y B )
D I A(E X ; Y B )
H(X)
Figure 2.2. Relationship between entropies and mutual information for two random variables.
set theory is not just a coincidence for two random variables. We will discuss
this in depth when we introduce the -Measure in Chapter 6.
Analogous to entropy, there is a conditional version of mutual information
called conditional mutual information.
[\^]`_ba%_dc%_?eagf@hufpÝ For random variables 7 , k and § , the mutual information between 7 and k conditioning on § is defined by
Axktª § É
?+
7 >xktª § IHP±Â ³ Ó³ · :?>A;nAx«` :T?:>T?ª >«¬dA;:n@?ª n%«¬ ª «¬ H6 :?:7z?7~
ª §!d:kª §!
(2.58)
Remark
?7>xk¦ª §!
is symmetrical in
7
and
k
.
Analogous to conditional entropy, we write
where
?7+>xkª §!MH Â · :«¬ ?7+>xk¦ª §QHQ«¬BA
(2.59)
?7>xk¦ª §QHQ«`MHKÂ ±³ ´ :?>A;n@ª «¬ :?: > ?ª >#«¬dA;:n%?ª n%«¬ ª «¬ É
(2.60)
Similarly, when conditioning on two random variables, we write
?7+>xkª §èA.F! HKÂHG~:JIx ?7>xktª §èA.FKH6I;
where
(2.61)
?7+>xkª §èA.F¿H6I;MH ±Â ³ Ó³ · :?>A;nAx«ª Ix :?>T:ª «^?>#AKI;A;dn%:ª «p?n%KA Ixª «p AKIx É
D R A F T
September 13, 2001, 6:27pm
(2.62)
D R A F T
16
A FIRST COURSE IN INFORMATION THEORY
Conditional mutual information satisfies the same set of relations given in
Propositions 2.18 and 2.19 for mutual information except that all the terms are
now conditioned on a random variable . We state these relations in the next
two propositions. The proofs are omitted.
§
!Re#e¼_uc%_e#a'f@huf@i
7
The mutual information between a random variable
and itself conditioning on a random variable is equal to the conditional
entropy of given , i.e.,
.
§
?7>;7'ª § H6z?7zª §!
7
§
!Re#e¼_uc%_e#a'f@huf^f
?7>xktª §! H
?
7 >xktª §! H
and
2.3
z?7zª §!:+?7'ª kIAB§!BA
zkª § ?
kª 7¦AB§!BA
(2.63)
(2.64)
?7+>xk¦ª §!H6?7'ª §!;35zkGª §!:Lz?7¦Axk¦ª §!BÉ
(2.65)
CONTINUITY OF SHANNON’S INFORMATION
MEASURES
This section is devoted to a discussion of the continuity of Shannon’s information measures. To begin with, we first show that all Shannon’s information
measures are special cases of conditional mutual information. Let be a degenerate random variable, i.e.,
takes a constant value with probability 1.
Consider the mutual information
. When
and
,
becomes the entropy
. When
,
becomes
the conditional entropy
. When
,
becomes the
mutual information
. Thus all Shannon’s information measures are
special cases of conditional mutual information. Therefore, we only need to
discuss the continuity of conditional mutual information.
Recall the definition of conditional mutual information:
M
M
?7>xktª §!
? 7+>xk¦ª §!
7 Hk
§HNM
?7D
7×HYk ?7>xk¦ª §!
z?7zª §!
§ïHOM ?7>xktª §!
?7>xk
: ?>#A;n%ª «¬
¯ ?7>xktª §!H
(2.66)
°²±³ ´¼³ ·SµQP%Â RTS7UWV.° ¯¼µ :?>A;nAx«¬ :?> ª «¬d:?n%ª «¬ A
where we have written ?
7 >xktª §! and U@=< <:X as ¯ ?7+>xkª§! and U@<Y<4X ²:R ,
respectively to emphasize their dependence on : . Since ¾ is continuous
in ¾ for ¾EW X , ¯ ?+
7 >xkª §! varies continuously with : as long as the support
U =< <:X ²:R does not change. The problem arises when some positive probability
masses become zero or some zero probability masses become positive. Since
continuity in the former case implies continuity in the latter case and vice versa,
we only need to consider the former case. As
and eventually
:?>#A;nAx«¬ Ð X
D R A F T
September 13, 2001, 6:27pm
D R A F T
17
Information Measures
>#A;n
U <=<:X ²:R
«
, and , the support
is reduced, and
become(s) zero for some
there is a discrete change in the number of terms in (2.66). Now for
and
such that
, consider
«
>#A;nA
:?>A;nAx«¬õWVX
¯°²±³ ´¼³ ·Sµ
¯°ÿ·Sµ
:
?
>
;
A
@
n
ª
¬
«
(2.67)
:?>A;nAx«¬ :?>ª «¬d:?n%ª «¬ H :?>#A;nAx«¬ ¯¯°£±°£·B³ ·Sµ µ ¯¯°£Ó°£·S³ ·Sµ µ
(2.68)
H :?>#A;nAx«¬ ::??>>A;Axn«¬dAx:«¬?dn:TAx«¬«¬
H :?>#A;nAx«¬ 8 :?>A;nAx«¬435 :T«¬
(2.69)
Z :T?>#Ax«¬:+ :?nAx«¬9É
It can ready be verified by L’Hospital’s rule that :?>A;nAx«¬ :?>A;nAx«`
ÐX
as :?>#A;nAx«¬ Ð X . For :?>A;nAx«¬
:«` , since :T?>#A;nAx«¬È:«`P , we
have
,X D:?>A;nAx«¬ : «`qD:?>A;nAx«¬ :?>A;nAx«ÌÐXÉ
(2.70)
Thus :T?>#A;nAx«¬ :«¬
Ð X as :T?>#A;nAx«¬Ð X . Similarly, we can show that
:?>#Ax«` and : ?>A;nAx«` :T?nAx«¬¦Ð X as :?>A;nAx«`Ð X .
both :?>#A;nAx«¬
Therefore,
:?>A;nAx«` :T?:>T?ª >«¬dA;:n@?ª «¬n% ª «¬ ÐX
(2.71)
7 >xkª §! varies continuously with : even when
as :T?>#A;nAx«¬ÐâX . Hence, ?+
:?>#A;nAx«¬MÐX for some >#A;n , and « .
ç :?>#A;nAx«¬ , if 9;: ÔF is a sequence of joint distributions
To conclude, for any
such that : Ð : as [
Ð , then
(2.72)
7 >xktª §!MH ¯ ?+
7 >xktª § BÉ
_uâ]5 \ ` c¯ b ?
2.4
CHAIN RULES
In this section, we present a collection of information identities known as
the chain rules which are often used in information theory.
!Re#e¼_uc%_e#a'f@hufôvS¥
Ï@{_ba'y%|¬\D]Ôe#V\âRc%ed
z?7GOA;7 ANNNOA;7 H Â z?7xª 7GOANNNA;7 Ó ÖBÉ
f e
ÈHÑà
(2.73)
Proof The chain rule for
has been proved in Proposition 2.16. We
prove the chain rule by induction on . Assume (2.73) is true for
,
D R A F T

September 13, 2001, 6:27pm
»H é
D R A F T
18
A FIRST COURSE IN INFORMATION THEORY
where
é ¸à . Then
¨
z?7GOANNN¼A;7 æ A;7 æYg Ö
H z?7 ANNNOA;7æ735z?7æ=g ª 7 ANNN¼A;7æ
æ
H Â z?7Sª 7GOANNNOA;7 Ó Ö; 35z?7 æ=g Ãª 7GOANNNA;7 æ ] e
Yæ g
H Â ?7Sª 7GOANNNA;7 Ó ÖBA
] e
(2.74)
(2.75)
(2.76)
where in (2.74), we have used (2.45), and in (2.75), we have used (2.73) for
. This proves the chain rule for entropy.
~HÄé
The chain rule for entropy has the following conditional version.
!Re#e¼_uc%_e#a'f@huf`¢'vS¥
Ï@{_ba'y%|¬\D]Ôe#¸ea%_uc%_e#aR{|\âRc%ed
z?7GOA;7 ANNN¼A;7 ª kH Â ?7Sª 7GOANNN¼A;7 Ó OAxkBÉ
] e
(2.77)
Proof This can be proved by considering
z?7 A;7 ANNN¼A;7 ª k
H Â ´ :?n^1
?7 A;7 ANNNA;7 ª k»HÄn^
(2.78)
:?n^ Â ?7Sª 7GOANNN¼A;7 Ó OAxkÈHÄn^
]e
Â ´ :?n^1?7Sª 7GOANNNOA;7 Ó OAxk»HÄn^
H
H
H
Â´
Â
]e
Â ?7xª 7GOANNN¼A;7 Ó OAxkBA
]e
(2.79)
(2.80)
(2.81)
where we have used (2.41) in (2.78) and (2.43) in (2.81), and (2.79) follows
from Proposition 2.23.
!Re#e¼_uc%_e#a'f@huf^½ovS¥
Ï@{_ba'y%|¬\D]Ôe#VóºyRc%yR{|t_ba#]Ôe#@óî{pc%_?ea
?7GOA;7 ANNN¼A;7 >xkMH Â ?71 >xkª 7GOANNN¼A;7 Ó ÖBÉ
] e
D R A F T
September 13, 2001, 6:27pm
(2.82)
D R A F T
19
Information Measures
Proof Consider
?7GOA;7 ANNNA;7 >xk
H z ?7GOA;7 ANNNOA;7 :L?7GA;7 ANNNOA;7 ª k
H Â 8 z?7xª 7GOANNNA;7 Ó Ö:z?7;ª 7GOANNNA;7 Ó OAxk9
f e
H Â ?7 >xkª 7 ANNN¼A;7 Ó BA
f e
(2.83)
(2.84)
(2.85)
where in (2.84), we have invoked both Propositions 2.23 and 2.24. The chain
rule for mutual information is proved.
!Re#e¼_uc%_e#a'f@huf^Ê vS¥
Ï@{_ba¸y#|Ô\
óè{pc%_e#a For random variables 7GOA;7
?7GOA;7 ANNN¼A;7 >xktª § H Â
] e
]¬e#Äea%_uc%_e#aR{|~ó yRc%yR{|E_ba#]Ôe#h
ANNNOA;7 , k , and § ,
?7.>xkGª 7GOANNNA;7 Ó OAB§ BÉ
(2.86)
Proof This is the conditional version of the chain rule for mutual information.
The proof is similar to that for Proposition 2.24. The details are omitted.
2.5
:
INFORMATIONAL DIVERGENCE
i
8
Let and be two probability distributions on a common alphabet . We
very often want to measure how much is different from , and vice versa. In
order to be useful, this measure must satisfy the requirements that it is always
nonnegative and it takes the zero value if and only if
. We denote the
support of and by
and , respectively. The informational divergence
defined below serves this purpose.
:
i
:
U¯
U;k
i
:¸Hji
[\^]`_ba%_dc%_?eagf@huf.Ø The informational divergence between two probability distributions : and i on a common alphabet 8 is defined as
7Î A
E²:mlniMH©Â ± :?>R :ip??>R>R H@ ¯ :ip??7Î
(2.87)
where ¯ denotes expectation with respect to : .
In the above definition, in addition to the convention ¯that
°£±¼µ the summation is
k °£±¼µ o
H if ip?>RIHQX .
taken over U ¯ , we further adopt the convention :?>@
In the literature, informational divergence is also referred to as relative entropy or the Kullback-Leibler distance. We note that
is not symmetrical
in and , so it is not a true metric or “distance.”
:
i
D R A F T
Î²:mlni
September 13, 2001, 6:27pm
D R A F T
20
A FIRST COURSE IN INFORMATION THEORY
1.5
a −1
1
ln a
0.5
1
0
a
−0.5
−1
−1.5
−0.5
0
0.5
1
1.5
2
2.5
Figure 2.3. The fundamental inequality
3
prq7stsvuw .
In the rest of the book, informational divergence will be referred to as divergence for brevity. Before we prove that divergence is always nonnegative,
we first establish the following simple but important inequality called the fundamental inequality in information theory.
x \^ó óî{¸f@huf^ÙovKyõy%a%{ó \âRc`{|¦}Ba%\!zyR{|¬_ucd
For any
]{õ¾ V¾V
¾ W¸X ,
(2.88)
¾,HY .
Proof Let Ü¾pIH|]{õ¾o¾3¿ . Then Ü}j¾p H c~Ã¾V and Ü} }?¾p Hc~Ã¾ .
H ZX , we see that Ü¾ attains
Since ÜÁ¼H¤X , Ü } Á¼ H¤X , and Ü } } Á¼
its maximum value 0 when ¾VH . This proves (2.88). It is also clear that
equality holds in (2.88) if and only if ¾HY . Figure 2.3 is an illustration of the
with equality if and only if
fundamental inequality.
¥ee|Ô|{dÄf@huf^Û
For any
¾ WVX ,
]{õ¾ © ¾
with equality if and only if
¾,HY .
Proof This can be proved by replacing
¾
by
(2.89)
c~Ã¾
in (2.88).
We can see from Figure 2.3 that the fundamental inequality results from
the convexity of the logarithmic function. In fact, many important results in
information theory are also direct or indirect consequences of the convexity of
the logarithmic function!
D R A F T
September 13, 2001, 6:27pm
D R A F T
21
Information Measures
Ï%\.e@\^óËf@hupÝv;[_ ÍT\^\â#\D}Ba%\!zyR{|Ô_dcd For any two probability distributions : and i on a common alphabet 8 ,
Î²:mlni qVX
(2.90)
with equality if and only if :GH6i .
i ?>@HZX for some >VCgU ¯ , then E²:mlniH
and ¯ the theorem is
Proof If i ?>RqW¸X for all >¦CtU . Consider
trivially true. Therefore, we assume that p
Î²:mlni H Q Y ,±HÂ P%R :T?>R]{ :ip??>@>@
(2.91)
Q Y ,±HÂ P%R :T?>R=.Y :pi ??>R>R%
(2.92)
H Q Y a#±HÂ P%R :T?>R:P±HÂ P%R i?>@0
(2.93)
Q Y Á"¸¼
(2.94)
H
XA
(2.95)
where (2.92) results from an application of (2.89), and (2.94) follows from
H± Â PTRc ip?>R ©É
>ÎC~U ¯
(2.96)
This proves (2.90). Now for equality to hold in (2.90), equality must hold in
, and equality must hold in (2.94). For the former, we see
(2.92) for all
from Lemma 2.28 that this is the case if and only if
for all
.
For the latter, i.e.,
(2.97)
>~CtU ¯
:?>@MH6ip?>R
H± Â PTRc ip?>RHYA
this is the case if and only if U ¯ and ;
U k coincide. Therefore, we conclude that
equality holds in (2.90) if and only if :?>Rº,
H ip?>R for all >C8 , i.e., :ô
H i.
The theorem is proved.
We now prove a very useful consequence of the divergence inequality called
the log-sum inequality.
Ï%\.e@\^óËf@hu@iYv;|¬e:hB¼y%ó _?a%\!zyR{|¬_ucd
À¼OASÀ ANNN
¾¬2
Â ¾¬% ¾ÔÀB .Å Â ¾Ô?ÆZ ¾ÔÀS
and nonnegative numbers
D R A F T
¾ Ax¾ ANNN
X2 ÀBT
,
For positive numbers
such that
and
September 13, 2001, 6:27pm
(2.98)
D R A F T
22
n HnHW_0H!
A FIRST COURSE IN INFORMATION THEORY
with the convention that
for all .

anö H
. Moreover, equality holds if and only if
The log-sum inequality can easily be understood by writing it out for the
case when there are two terms in each of the summations:
3z¾ É
¾.! .¾ À¼ z
3 ¾ ¾ À ¾^:3¾ ^¾ ¼À ::3'
À
¾} H ¾Ô0~ ¾Ã
(2.99)
À} HËÀB~ ÀÁ
9¼¾} F
Proof of Theorem 2.31 Let
and
. Then
and
are probability distributions. Using the divergence inequality, we have
9À } F
X Â ¾ } ¾À }}
H Â ¾Ô ¾ ¾¬ÀB00~~
H ¾ - Â ¾ (2.100)
¾Ã
(2.101)
À
¾¬ Ë
¾Ã /
A
Â
(2.102)
Å
¾
Z
Æ
ÀB
À
n

which implies (2.98). Equality holds if and only if ¾ } HËÀ } for all , or H
nHW_0H! for all . The theorem is proved.
One can also prove the divergence inequality by using the log-sum inequality (see Problem 14), so the two inequalities are in fact equivalent. The logsum inequality also finds application in proving the next theorem which gives
a lower bound on the divergence between two probability distributions on a
common alphabet in terms of the variational distance between them.
[\^]`_ba%_dc%_?eagf@hu^f Let : and i be two probability distributions on a common
alphabet 8 . The variational distance between : and i is defined by
²:#'A iMH¨Â ª :T?>R: ip?>@Nª²É
(2.103)
±HPH Ï%\.e@\^óËf@hu^Ëv!_?a#¼Ì\^¡u=_?a%\!zyR{|Ô_dcd
Î²:mlniq à?] î{ à ²:#A'iBÉ
(2.104)
The proof of this theorem is left as an exercise (see Problem 17). We will see
further applications of the log-sum inequality when we discuss the convergence
of some iterative algorithms in Chapter 10.
D R A F T
September 13, 2001, 6:27pm
D R A F T
23
Information Measures
2.6
THE BASIC INEQUALITIES
In this section, we prove that all Shannon’s information measures, namely
entropy, conditional entropy, mutual information, and conditional mutual information are always nonnegative. By this, we mean that these quantities are
nonnegative for all joint distributions for the random variables involved.
¢a£;¤!¥§¦7¤¨ª©7«]¬
® , ¯ , and ° ,
±² ®³'¯´)°!vµ¶¸·
(2.105)
with equality if and only if ® and ¯ are independent conditioning on ° .
For random variables
Proof This can be readily proved by observing that
±² ®+³'¯Z´)° ²JÀ
¹ º ²JÀ ·KÁ·'Â¬ÃfÄÅ ²J¿ À ·KÁ;² ´ Â¬
(2.106)
»H¼ ½c¼ ¾%¿
¿ ´ Â` ¿ Á7´ ²JÂ`À ¹ º ² Â` º ²JÀ ·KÁ;´ Â¬ÃfÄÅ ²J¿ À ·KÁ;² ´ Â¬
(2.107)
´
¬
Â
;
Á
´
¬
Â
¾L¿ »H¼ ½§¿
¹ º ² Â` ² ¾É ¾ ¿ ¾ · ¿
(2.108)
¿!ÆYÇ=È ¿ÆÊÈ ¿ÇËÈ
¾ ¿
²JÀ ·KÁ;´ Â¬· ²JÀ ·KÁ^ÎÍ+ÏÑÐÓÒaÔ , etc. Since
¾
where we have used
to denote Ì
ÆYÇm¾ È and ¾ ¿ ¾ are joint probability distributions on
for a fixed Â , both ¿
=
Æ
¿
ÏÕÐÒ , we have ÇËÈ ² ¿ ÆÎÈ ¿ ÇmÈ
¿ ÆYÇËÈ ¾ É ¿ ÆÎÈ ¾ ¿ ÇmÈ ¾ vµ¶¸Ö
(2.109)
±
²
° Lµ¶ . Finally, we see from TheoTherefore, we±conclude
that ®+³'¯×´)!
²
rem 2.30 that ®³'¯´)!
° ¹ ¶ if and only if for all ÂØÍÙ ¾ ,
²JÀ ·KÁ;´ ¬Â ¹ ²JÀ ´ ¬Â ² Á;´ ¬Â ·
(2.110)
¿
¿ ¿
or
²JÀ ·KÁ·'Â¬ ¹ ²JÀ ·'Â` ² Á7´ Â`
(2.111)
¿
¿
¿
À
for all and Á . Therefore, ® and ¯ are independent conditioning on ° . The
proof is accomplished.
As we have seen in Section 2.3 that all Shannon’s information measures are
special cases of conditional mutual information, we already have proved that
all Shannon’s information measures are always nonnegative. The nonnegativity of all Shannon’s information measures are called the basic inequalities.
For entropy and conditional entropy, we offer the following more direct
proof for their nonnegativity. Consider the entropy
of a random variable
Ú ² ®D
D R A F T
September 13, 2001, 6:27pm
D R A F T
24
¶Ü ² ¿ J² Ý À7ÝÞß ÃfÄÅ ¿ ²JÀÝÞ ¶
² ¯×´ ® Ý
Ú ® |
µ ¶
Ú
²

À
Ý
¯
®Ý
Ú ¯×´ ® ¹ µà¶
À ÍÙ
²
Ú ¯×´ ® µ¶
Æ
áÎ¦¥4â¥§ãcä]å;äQ¥4æç©7«]¬è Ú ² ® Ý ¹ ¶ if and only if ® is deterministic.
Àé Í,Ï such that ²JÀéÝ ¹ ß
Proof ²JÀIfÝ ® is deterministic,
i.e.,
there exists
é
ë
À
ê
À
²
Ý
¹ ¶ for all ¹ , then Ú ® ¹íì ²JÀ é Ý ÃfÄÅÀ é ²JÀ é Ý ¿ ¹ ¶ . On
and
¿ hand, if ® is not deterministic, i.e., there¿ exists ¿ ÍàÏ such that
the other
¶îÜ ¿ ²JÀ² éÝ Ý Ü ¹ ß , then Ú ² ® Ý µ ì ¿ ²JÀéÝ Ã]ÄÅ ¿ ²JÀé_ÝYï ¶ . Therefore, we conclude
that Ú ®
¶ if and only if ® is deterministic.
áÎ¦¥4â¥§ãcä]å;äQ¥4æç©7«]¬ð Ú ² ¯ñ´ ® Ý ¹ ¶ if and only if ¯ is a function of ® .
² Ý ¹ ¶ if and only if Ú ² ¯´ ® ¹ ÀÝ ¹ ¶
Proof From (2.41), we see that Ú ¯×´ ®
À
for each ÍZÙ . Then from the
À last proposition, this happens if and only if ¯
Æ
is deterministic for each given . In other words, ¯ is a function of ® .
áÎ¦¥4â¥§ãcä]å;äQ¥4æç©7«]¬!òo±² ®³'¯ Ý ¹ ¶ if and only if ® and ¯ are independent.
Proof This is a special case of Theorem 2.34 with ° being a degenerate ran®
À Û
Í ÙÆ
A FIRST COURSE IN INFORMATION THEORY
. For all
, since
,
. It then follows from
the definition in (2.33) that
. For the conditional entropy
of random variable given random variable , since
for
, we see from (2.41) that
.
each
dom variable.
One can regard (conditional) mutual information as a measure of (conditional) dependency between two random variables. When the (conditional)
mutual information is exactly equal to 0, the two random variables are (conditionally) independent.
We refer to inequalities involving Shannon’s information measures only
(possibly with constant terms) as information inequalities. The basic inequalities are important examples of information inequalities. Likewise, we refer to
identities involving Shannon’s information measures only as information identities. From the information identities (2.45), (2.55), and (2.63), we see that
all Shannon’s information measures can be expressed as linear combinations
of entropies. Specifically,
and
Ú ±² ² ¯ñ´ ® ÝÝ ¹ Ú ²² ®Z·'Ý;¯ ó Ý ì ² Ú Ý² ® Ý · ² Ý
®+³'¯ ¹ Ú ® Ú ¯ ì Ú ®Z·'¯ ·
(2.112)
(2.113)
±² ®³'¯´)° Ý ¹ Ú ² ®Z·° ;Ý ó Ú ² ¯Ë·° Ý ì Ú ² ®Z·'¯v·° Ý ì Ú ² ° Ý Ö
(2.114)
Therefore, an information inequality is an inequality which involves only entropies.
D R A F T
September 13, 2001, 6:27pm
D R A F T
25
Information Measures
In information theory, information inequalities form the most important set
of tools for proving converse coding theorems. Except for a few so-called nonShannon-type inequalities, all known information inequalities are implied by
the basic inequalities. Information inequalities will be studied systematically
in Chapter 12 to Chapter 14 . In the next section, we will prove some consequences of the basic inequalities which are often used in information theory.
2.7
SOME USEFUL INFORMATION INEQUALITIES
In this section, we prove some useful consequences of the basic inequalities introduced in the last section. Note that the conditional versions of these
inequalities can be proved by similar techniques as in Proposition 2.24.
¢a£;¤!¥§¦7¤¨ª©7«]¬ô+õnö¥§æ;÷4äøå;äJ¥§æ;äJæ4ùú÷:¥4¤!ãØæ4¥7åúäJæ4û¦7¤¸ü:ãc¤ý¤æå;¦¥§âþ ÿ
(2.115)
Ú ² ¯´ ® Ý=Þ Ú ² ¯ Ý
with equality if and only if
®
and
¯
are independent.
Proof This can be proved by considering
Ú ² ¯´ ® Ý ¹ Ú ² ¯ Ý ì ±² ®³'¯ Ý=Þ Ú ² ¯ Ý ·
(2.116)
±² ®³'¯ Ý is always nonnegative. The
where the inequality follows because
±
²
Ý ¹ ¶ , which is equivalent to that ®
inequality is tight if and only if ®³'¯
and ¯ are independent by Proposition 2.37.
²
Ý Þ Ú ² ¯×´)° Ý , which is the conSimilarly, it can be shown that Ú ¯×´ ®Z·°
ditional version of the above proposition. These results have the following
interpretation. Suppose ¯ is a random variable we are interested in, and ®
and ° are side-information about ¯ . Then our uncertainty about ¯ cannot be
increased on the average upon knowing side-information ® . Once we know
® , our uncertainty about ¯ again cannot be increased on the average upon
further knowing side-information ° .
Remark Unlike entropy, the mutual information between two random variables can be increased by conditioning on a third random variable. We refer
the reader to Section 6.4 for a discussion.
¢a£;¤!¥§¦7¤¨ª©7«]¬ õ næ;÷§¤â¤æ;÷§¤æ4û¤¥;æ4÷¥§¦ç¤æå;¦¥4âþmÿ
²
Ý
=
Þ
º
Ú ®
·K®H·c·K® Ú ² ® Ý
(2.117)
ß ··T· are mutually independent.
with equality if and only if ® , ¹
D R A F T
September 13, 2001, 6:27pm
D R A F T
26
A FIRST COURSE IN INFORMATION THEORY
Proof By the chain rule for entropy,
Ú ² ®
·K®··K® Ý ¹ º Ú
Þ º Ú
² ® ´ ®
··K®
²® Ý ·
Ý
(2.118)
(2.119)
where the inequality follows because we have proved in the last theorem that
conditioning does not increase entropy. The inequality is tight if and only if it
is tight for each , i.e.,
Ú ² ® ´ ® · ·K®
Ý ¹ Ú ²® Ý
(2.120)
ßaÞ Þ . From the last theorem, this is equivalent to that ® is indepen® ·K®··K® for each . Then
²JÀ · À H·T· À Ý
¿ ¹ ²JÀ _· À H· c· À Ý ²JÀ Ý
(2.121)
¹ ¿ ² ²JÀ · À H·c· À ¿ Ý ²JÀ Ý ²JÀ Ý
(2.122)
¿ ¿ .
¿
¿
..
J
²
À
Ý
J
²
À
Ý ²JÀ Ý
¹
(2.123)
¿
¿
¿
À À
À
for all · · c· , or ® ·K® · c·K® are mutually independent.
for
dent of Alternatively, we can prove the theorem by considering
º Ú ² ® Ý ì Ú ² ®
·K®H·c·K® Ý
ó ÃfÄÅ ² ®
·K®H·T·K® Ý
¹ ì º Ã]ÄÅ ² ® Ý7
¿
¿
²
Ý
²
Ý
²
¹ ì ÃfÄÅ ® ® ® Ý ¸ó! ÃfÄÅ ² ® ·K® · c·K® Ý
¿² ¿
¿ Ý
¿
¹ ÃfÄÅ ² ¿ ® Ý ·K² ® H· Ý ·K® ² Ý
®
® ¿ ® Ý
¹ " ² ¿ ¿É
Æ
$
#0&
Æ
%(')')' +
Æ
* ¿ $
Æ
# ¿ &
Æ
%
¿
¿ ,Æ *
µ ¶¸·
(2.124)
(2.125)
(2.126)
(2.127)
(2.128)
where equality holds if and only if
²JÀ · À · T· À Ý ¹ ²JÀ Ý ²JÀ Ý ²JÀ Ý
(2.129)
¿
¿
¿
À À
® ·K® H·T·K® are mutually independent.
for all
· %·c· , i.e., ¿À
D R A F T
September 13, 2001, 6:27pm
D R A F T
27
Information Measures
¢a£;¤!¥§¦7¤¨ª©7« .-
±² ®+³'¯Ë·° Ý µ ± ² ® ³'¯ Ý ·
with equality if and only if ®0/ 1
¯ /° forms a Markov chain.
(2.130)
Proof By the chain rule for mutual information, we have
±² ®+³'¯Ë·° Ý ¹ ± ² ® ³'¯ Ý;óç±² ®³°´ ¯ Ý µ ±² ®³'¯ Ý Ö
(2.131)
±²
Ý ¹ ¶ , or 2
® / 3
¯ / °
The above inequality is tight if and only if ®³°´ ¯
forms a Markov chain. The theorem is proved.
4v¤¨Ê¨ü©7« 6
5 If ®7/ ¯1/° forms a Markov chain, then
±² ®+³° ÝYÞ±² ®³'¯ Ý
(2.132)
±² ®+³° Ý=Þ±² ¯ ³° Ý Ö
and
(2.133)
®
The meaning of this inequality is the following. Suppose
is a random
is an observation of . If we infer
variable we are interested in, and
via , our uncertainty about
on the average is
. Now suppose
we process (either deterministically or probabilistically) to obtain a random
variable . If we infer
via , our uncertainty about
on the average is
. Since 7/ 1/
forms a Markov chain, from (2.132), we have
¯
°
Ú ² ®ç´)° Ý
¯
®
¯
®Ý
²
Ú ®ç´ ¯
®
®
®
°
¯ [°
Ú ² ®ç´)° Ý ¹ Ú ²² ® ÝÝ ì ±±²² ®³° Ý Ý
µ Ú ² ® ì Ý ®³'¯
¹ Ú ®5´ ¯ ·
i.e., further processing ¯ can only increase our uncertainty about ®
®
(2.134)
(2.135)
(2.136)
on the
average.
¯8/ °
±² ®³°´ ¯ Ý ¹ ¶¸Ö
Proof of Lemma 2.41 Assume
rem 2.34, we have
Then
D R A F T
®
/
, i.e.,
®
9
±² ®³° Ý ¹ ± ² Ý ì ±² ®³'¯´)° Ý
®
'
³
v
¯

·
°
Þ ±² ®³'¯v·° Ý
¹ ±² ®³'¯ Ý7ó5±² ®³°´ ¯ Ý
¹ ±² ®³'¯ Ý Ö
September 13, 2001, 6:27pm
°´ ¯
. By Theo(2.137)
(2.138)
(2.139)
(2.140)
(2.141)
D R A F T
28
A FIRST COURSE IN INFORMATION THEORY
±² ®+³'¯Z´)° Ý
In (2.138) and (2.140), we have used the chain rule for mutual information.
The inequality in (2.139) follows because
is always nonnegative,
and (2.141) follows from (2.137). This proves (2.132).
is equivalent to </ 7/
, we also have proved
Since :/ ;/
(2.133). This completes the proof of the lemma.
®
¯
°
°
¯
®
From Lemma 2.41, we can prove the more general data processing theorem.
¢a£;¤!¥§¦7¤¨ª©7« ©+õ=üWåüúá¦¥:û¤!ãcãcä æ4ù¢a£;¤¥4¦7¤¨ÿ
forms a Markov chain, then
®
If
>?/
±² > ³CA Ý=Þú±² ®+³'¯ Ý Ö
®@/[¯1/BA
(2.142)
¯
Proof Assume >D/ E/ 1/8A . Then by Proposition 2.10, we have >D/
and >F/ G/HA . From the first Markov chain and Lemma 2.41, we
7/
have
>
(2.143)
®
¯
¯
±² ³'¯ ÝvÞ±² ®+³'¯ Ý Ö
From the second Markov chain and Lemma 2.41, we have
±² > ³CA Ý=Þ±² > ³'¯ Ý Ö
(2.144)
Combining (2.143) and (2.144), we obtain (2.142), proving the theorem.
2.8
FANO’S INEQUALITY
In the last section, we have proved a few information inequalities involving
only Shannon’s information measures. In this section, we first prove an upper
bound on the entropy of a random variable in terms of the size of the alphabet. This inequality is then used in the proof of Fano’s inequality, which is
extremely useful in proving converse coding theorems in information theory.
¢a£;¤!¥§¦7¤¨ª©7« ¬
® ,
(2.145)
Ú ² ® Ý=Þ ÃfÄÅÎ´)ÏÓ´ ·
where ´)ÏÓ´ denotes the size of the alphabet Ï . This upper bound is tight if and
only if ® distributes uniformly on Ï .
²JÀÝ ¹ ´)Ï´ for all
Proof
Let
I be the uniform distribution on Ï , i.e., I
À ÍÓÏ . Then
ÃfÄÅ´)Ï´ ì Ú ² ® Ý
¹ ìNº ²JÀÝ ÃfÄÅ´)Ï´ ó º ²JÀÝ Ã]ÄÅ ²JÀ7Ý
(2.146)
¿
» JMLON ¿
K
» JOLMN ¿
K
D R A F T
For any random variable
September 13, 2001, 6:27pm
D R A F T
29
Information Measures
¹ ìNº ²JÀ7Ý ÃfÄÅPI ²JÀÝ;ó º ²JÀ7Ý ÃfÄÅ ²JÀÝ
(2.147)
¿
»KJMLON ¿
»KJMLON ¿
²JÀ7Ý
Q
¹ º ²JÀ7Ý ÃfÄÅ ¿ ²JÀÝMR
(2.148)
I
» JMLON ¿
K
¹ " ² ÉI Ý
(2.149)
¿
(2.150)
µ ¶¸·
² É I Ý ¹ ¶ , which
proving (2.145). This upper bound is tight if and if only "
J
²

À
Ý
J
²
7
À
Ý
À
¹ I for all ÍÓÏ ¿ , completing the
from Theorem 2.30 is equivalent to
¿
proof.
ö¥§¦6¥ STS ü§¦þ2©7« The entropy of a random variable may take any nonnegative real value.
®
²
Ý
¹
Ú ®
ÃfÄÅÎ´)ÏÓ´
Ú ²® Ý ¹ ¶
¶ Ü Ü,ÃfÄÅÎ´)ÏÓ´ ² Ý ¹
®
Ú ®
´)ÏÓ´
´)ÏÓ´
and let
be fixed. We see from the
Proof Consider a random variable
last theorem that
is achieved when distributes uniformly
on . On the other hand,
is achieved when is deterministic. For
, by the intermediate value theorem, there exists a
any value VU
distribution for such that
can take any
U . Then we see that
be sufficiently large. This accomplishes the proof.
positive value by letting
Ï
´)Ï´ ¹
®
<" , or the random variable
Remark Let
the base of the logarithm is " , (2.145) becomes
ÚXW ² ® ÝYÞ|ß Ö
®
®
is a
"
Ú ²® Ý
-ary symbol. When
(2.151)
Recall that the unit of entropy is the " -it when the logarithm is in the base
" . This inequality says that a " -ary symbol can carry at most 1 " -it of information. This maximum is achieved when has a uniform distribution. We
already
have seen the binary case when we discuss the binary entropy function
Y[Z
in Section 2.2.
®
² Ý
¿
We see from Theorem 2.43 that the entropy of a random variable is finite as
long as it has a finite alphabet. However, if a random variable has an infinite
alphabet, its entropy may or may not be finite. This will be shown in the next
two examples.
;ü:¨â[S¤L©7« è
\^]
Let
¹ ß ··%Ö Then
D R A F T
®
be a random variable such that
_a`
Ì® ¹ K Ô ¹
Ú ² ® Ý ¹cºb
d
[
·
(2.152)
¹ ·
(2.153)
[
September 13, 2001, 6:27pm
D R A F T
30
A FIRST COURSE IN INFORMATION THEORY
which is finite.
;ü:¨â[S¤L©7« ð
\^]
Let
of pairs of integers e
¯
be a random variable which takes value in the subset
² '· f Ýhg!ßÎÞ Ë Üji
such that
ßÎÞ f Þ
and
Ìc¯ ¹ ² n· f Ý Ô ¹
_a`
k
ml
k
(2.154)
(2.155)
for all and f . First, we check that
% k n k
ºb º
Then
_P`
oC cÌ ¯ ¹ ² '· f Ý Ô ¹ ºb
%
²
Ý
¹
r
ì
º
b
º
Ú ¯
s k n k
ot k$p k
6q
k
ÃfÄÅ ¹ ßÖ
(2.156)
¹ º ß·
(2.157)
k
cb
which does not converge.
®
®
®
Let be a random variable and u be an estimate of which takes value
in the same alphabet . Let the probability of error v,w be
Ï
v,w
¹
_P`
Ì® ¹ ê L
®u Ô Ö
(2.158)
² ®ç´ ® Ý ¹ ¶
¹ ¶
® ¹ ®
Ú
® ¹ ®
²
Ý
Ú ®ç´ ®
¢a£;¤!¥§¦7¤¨ª©7« ò õxü§4æ ¥v¡]ãäJæ;¤zy6ü6Säøåþ ÿ Let ® and ® u be random variables
taking values in the same alphabet Ï . Then
Ú ² ®ç´ ® u ÝYÞ Y[Z ² v,w Ý;ó v6wÃfÄÅ ² ´)ÏÓ´ ì ßcÝ ·
(2.159)
u with probability 1, then
u
If v6w
, i.e.,
by Proposiu with probability close to
tion 2.36. Intuitively, if v,w is small, i.e.,
u
1, then
should be close to 0. Fano’s inequality makes this intuition
precise.
where
Y[Z
is the binary entropy function.
Proof Define a random variable e
¯ ¹
D R A F T
¶ß
if
if
® ¹ê ® u
® ¹ ®u .
September 13, 2001, 6:27pm
(2.160)
D R A F T
Information Measures
Ìc¯ ¹ ß Ô ¹
The random variable
_a`
v,w and
31
ê ® u Ô , with
of the error event Ì® ¹ ¯ ² is Ýan¹ indicator
Ú ¯ YzZ ² v6w Ý . Since ¯ is a function ® and ® u ,
Ú ² ¯×´ ®· ® u Ý ¹ ¶¸Ö
(2.161)
Then
Ú ²5
® ´.® ± u ² Ý
¹ ® ² ³'¯Z´ Ý® u Ý;ó Ú² ² ®5´ ®Ûu ·'Ý7¯ ó Ý ²
¹ Ú ¯ ´[® u ì Ú ¯×´ ®Z·{® u Ú ®5´.®Ûu ·'¯ Ý
¹ Ú ² ¯´ ® u Ý4ó Ú ² ®ç´ ®+u ·'¯ Ý
Þ Ú ² ¯ Ý7ó Ú ² ®5´ ®5u ·'¯ Ý
¹ Ú ² ¯ Ý7ó º _P` Ì ® u ¹ À u ·'¯ ¹ ¶ÔcÚ ² ®5´ ® u ¹ À u '· ¯ ¹ ¶ Ý
»K| J~}
ó _a` Ì ® u ¹ À ·'¯ ¹ ß ÔcÚ ² ®5´ ® u ¹ À ·'¯ ¹ ßcÝ Ö

u
u
(2.162)
(2.163)
(2.164)
(2.165)
(2.166)
In the above, (2.164) follows from (2.161), (2.165) follows because conditioning does not increase entropy, and (2.166) follows from an application of
. In other words,
(2.41). Now must take the value u if u
u and
is conditionally deterministic given u
. Therefore, by
u and
Proposition 2.35,
u
(2.167)
u
À ® ¹ À ¯ ¹ ¶
®
¯ ¹ ¶
® ¹ À
Ú ² ®ç´ ® ¹ À ·'¯ ¹ ¶ Ý ¹ ¸¶ Ö
À
ß , then ® must take values in the set Ì À ÍÏ gÀo¹ ê À Ô
If ® u ¹ u and ¯ ¹
u
ß elements. From the last theorem, we have
which contains ´)Ï´ ì
Ú ² ®ç´z® u ¹ À u ·'¯ ¹ ßcÝ=Þ Ã]ÄÅ ² ´)ÏÓ´ ì ßcÝ ·
(2.168)
À
where this upper bound does not depend on u . Hence,
Ú ² ®5´ ® u Ý
Þ Y[Z ² v,w Ý;@
ó º _P` Ì ® u ¹ À ·'¯ ¹ ß Ô 5ÃfÄÅ ² ´)ÏÓ´ ì ßcÝ (2.169)
u
»K| JK}
ó _a` Ìc¯ ¹ ß Ô4ÃfÄÅ ² ´)Ï´ ì ßcÝ
¹ Y[Z ² v,w Ý;!
(2.170)
¹ Y[Z ² v,w Ý;ó v6wÃfÄÅ ² ´)ÏÓ´ ì ßcÝ ·
(2.171)
®
proving (2.159). This accomplishes the proof.
Very often, we only need the following simplified version when we apply
Fano’s inequality. The proof is omitted.
ö¥§¦¥6STS ü§¦þ2©7« ô Ú ² ®ç´ ® u Ý Ü ßvó v,w!ÃfÄÅ´)Ï´ .
D R A F T
September 13, 2001, 6:27pm
D R A F T
32
A FIRST COURSE IN INFORMATION THEORY
Ï Ý
²
Ú ®5´ ®
Fano’s inequality has the following implication. If the alphabet is finite,
u also
, the upper bound in (2.159) tends to 0, which implies
as v w /
tends to 0. However, this is not necessarily the case if is infinite, which is
shown in the next example.
¶
Ï
;ü:¨â[S¤L©7«
\^]
®
Ìc¶¸· ß Ô
°
Let u take the value 0 with probability 1. Let be an independent binary random variable taking values in
. Define the random
variable by
if

(2.172)
if
,
®
where
¯
® ¹
° ¹¹ ¶ ß
°
¶
¯
is the random variable in Example 2.46 whose entropy is infinity. Let
v,w
¹
_a`
Ì® ¹ ê +
®u Ô ¹
_a`
Ì%° ¹ ß Ô Ö
(2.173)
Then
Ú ²ç
® ´® u Ý² Ý
¹ Ú ®
µ Ú_a` ² ®5´)° Ý ²
¹
¹ ¶ÔcÚ ®ç´)° ¹ ¶ Ý;ó_P` Ì%° ¹ ß ÔcÚ ² ®ç´)° ¹ cß Ý
%
Ì
°
¹ ².ß ì v,w Ý ¶ ó v,wacÚ ² ¯ Ý
¹ i
ï ¶ . Therefore, Ú ² ®5´ ® u Ý does not tend to 0 as v6wh/ ¶ .
for any v,w
2.9
(2.174)
(2.175)
(2.176)
(2.177)
(2.178)
(2.179)
ENTROPY RATE OF STATIONARY SOURCE
In the previous sections, we have discussed various properties of the entropy of a finite collection of random variables. In this section, we discuss the
entropy rate entropy rate of a discrete-time information source.
is an infinite collection of
A discrete-time information source
random variables indexed by the set of positive integers. Since the index set is
ordered, it is natural to regard the indices as time indices. We will refer to the
random variables as letters.

i
We assumeg that
for all . Then for any finite subset of the
index set M
, we have
Ì® · Óµ ß Ô
® ² Ý
Ú ® Ü
Ì ñµ ß Ô
(2.180)
Ú ² ®·ñÍ =Ý Þ º JK Ú ² ® Ý Üi2Ö

²
ßcÝ because the joint enHowever, it is not meaningful to discuss Ú ®·µ
tropy of an infinite collection of letters is infinite except for very special cases.
On the other hand, since the indices are ordered, we can naturally define the
D R A F T
September 13, 2001, 6:27pm
D R A F T
33
Information Measures
entropy rate of an information source, which gives the average entropy per
letter of the source.
Ø¤[ä æ;äøå;äJ¥§æú©7«]è=
The entropy rate of an information source
by
when the limit exists.
ß ²
¹
Ú Æ KÃ Ú ®
·K®··K® Ý
b
Ì®Ô
is defined
(2.181)
We show in the next two examples that the entropy rate of a source may or
may not exist.
®
;ü:¨â[S¤L©7«]è5
\^]
. Then
Ã
K
b
Let
Ì®Ô
be an i.i.d. source with generic random variable
ß ²
Ú ® ·K®H·c·K® Ý ¹
²® Ý
;
Ú
~Ã
¹ ~Ã b Ú ² ® Ý
¹ Ú ² b® Ý ·
(2.182)
(2.183)
(2.184)
i.e., the entropy rate of an i.i.d. source is the entropy of a single letter.
;ü:¨â[S¤L©7«]è©
²® Ý ¹
and Ú Ì®Ô be a source such that ® are mutually independent
ñµ ß . Then
ß
ß ÃK Ú ² ®
·K®H·c·K® Ý ¹ ~Ã º
(2.185)

b
b ß
² ó6ßcÝ
¹ ~Ã (2.186)
ß b ² ó6 ßcÝ
¹ ~Ã ·
(2.187)
b
²® Ý
which does not converge although Ú Ü i for all . Therefore, the
entropy rate of Ì®Ô does not exist.
® Ô , it is natural to conToward characterizing the asymptotic behavior of Ì
sider the limit
(2.188)
Ú Æ ¹ ~Ã Ú ² ® ´ ® ·K® · T·K® Ý

² ® b ´ ® ·K® H· T·K® Ý is interpreted as the conif it exists. The quantity Ú ditional entropy of the next letter given that we know all the past history of the
source, and Ú is the limit of this quantity after the source has been run for an
Æ
\^]
Let
for
indefinite amount of time.
D R A F T
September 13, 2001, 6:27pm
D R A F T
34
A FIRST COURSE IN INFORMATION THEORY
Ø¤[ä æ;äøå;äJ¥§æú©7«]è¬
=
An information source
Ì® Ô
is stationary if
®
·K®··K®
and
®
· c·K®
ß
have the same joint distribution for any ý(· ?µ .
d
·K®
(2.189)
( (2.190)
In the rest of the section, we will show that stationarity is a sufficient condition for the existence of the entropy rate of an information source.
v¤¨Ê¨ü©7«]è
Ì® Ô be a stationary source. Then Ú Æ exists.
²
Ý
Proof Since Ú ®;´ ®
·K®² H··K® is lower Ýbounded by zero for all ,
® 7´ ® ·K® H··K® is non-increasing in to
it suffices to prove that Ú µ , consider
conclude that the limit Ú exists. Toward this end, for ý Æ
Ú Þ² ®7´ ®² ·K®H·c·K® Ý Ý
(2.191)
Ú ² ® ´ ® ·K® ¡H· T·K® Ý
¹ Ú ® %´ ® ·K® H· T·K® ·
(2.192)
® Ô . The lemma is
where the last step is justified by the stationarity of Ì
proved.
4v¤¨Ê¨ü©7«]èè+õnö¤!a
ã ü§¢ ¦ ¥ £5¤¸ü§æ§ÿ Let U¤ and ¥C be real numbers. If UTX/BU
as /Hi and ¥C ¹ §¦ U¤ , then ¥C/¨U as /Hi .
4
Let
Proof The idea of the lemma is the following. If U©/2U as /8i , then the
average of the first terms in ªU¤ , namely ¥C , also tends to U as /8i .
The lemma is formally proved as follows. Since U / U as / i , for
U
G« for all ¬
« . For
every «
, there exists ¬ « such that U©
¬
« , consider
ï
ï
² Ý¶
² Ý
ß
´ ¥C ì U7´ ¹

¹
D R A F T
ß
ß
º
Ì Ô
U ì
ï
´ ì ´;Ü
U
(2.193)

º ² U ì U Ý
(2.194)

º ´ U ì U ´
² Ý

º ´ U ì U´ ó
(2.195)
´ U ì 7U ´
(2.196)
September 13, 2001, 6:27pm
D R A F T
.®h¯±°³²
º
®´¯µ°¶²
6
35
Information Measures
ß
Ü
Ü
ß
º ´ U ì U7´ ó
®´¯µ°¶²
² ì
² « ÝKÝ «
¬
(2.197)
º ´ U ì U7´ ó « Ö
®´¯µ°¶²
(2.198)
´ ì 7´Ü
The first term tends to 0 as ·/¸i . Therefore, for any «
U
?~« . Hence
be sufficiently large, we can make ¥C
proving the lemma.
ï ¶ , by taking ¥C/¹U
as º/¸i
Ú Æ
Ì ® Ô
Ì® Ô
¢a£;¤!¥§¦7¤¨ª©7«]èð For a stationary source Ì®Ô , the entropy rate Ú
Æ
and it is equal to Ú .
Æ
to
,
We now prove that is an alternative definition/interpretation of the entropy rate of when is stationary.
exists,
Ú Æ
Proof Since we have proved in Lemma 2.54 that always exists for a
stationary source , in order to prove the theorem, we only have to prove
that
. By the chain rule for entropy,
Ì ® Ô
Ú Æ ¹ Ú Æ
ß ²
ß ²
Ý
¹
Ú ® ·K® · c·K® º Ú ®´ ® K· ® · ·K® Ý Ö

Since
(2.199)
Ã Ú ² ®´ ®
·K®H··K® Ý ¹ Ú Æ
b
(2.200)
from (2.188), it follows from Lemma 2.55 that
ß ²
¹
Ú Æ Ã Ú ® K· ® · ·K® Ý ¹ Ú Æ Ö
b
(2.201)
The theorem is proved.
In this theorem, we have proved that the entropy rate of a random source
exists under the fairly general assumption that is stationary. However, the entropy rate of a stationary source may not carry any physical
meaning unless is also ergodic. This will be explained when we discuss
the Shannon-McMillan-Breiman Theorem in Section 4.4.
Ì® Ô
Ì® Ô
D R A F T
Ì® Ô
Ì® Ô
September 13, 2001, 6:27pm
D R A F T
36
A FIRST COURSE IN INFORMATION THEORY
PROBLEMS
®
¯
²JÀ ·KÁ Ý
¿
¾¿ ß
¿
ß ¿¿À ~½
¶
¶
² Ý ² Ý ²
Determine Ú ® ·'Ú ¯ ·'Ú ®5´ ¯
1. Let and be random variables with alphabets
and joint distribution
given by
ß ß ß
ß ¶
¶ ß ß
»
¶ ¶ ß ß
Ý ·'Ú ² ¯Z´ ®
ß
ÁÃÂ
Â
Â
Â
¶ß
¶
Ä
Ï ¹ Ò ¹ Ì ß ··»·¼W·½Ô
Ö
»
Ý , and ±² ®³'¯ Ý .
2. Prove Propositions 2.8, 2.9, 2.10, 2.19, 2.21, and 2.22.
3. Give an example which shows that pairwise independence does not imply
mutual independence.
²JÀ Ý
¿ K· Á·'Â
4. Verify that
as defined in Definition 2.4 is a probability distribution. You should exclude all the zero probability masses from the summation carefully.
² ® Ýó ² ¯ Ý ¹
² ® ÝYó
² ¯ Ý
5. Linearity of Expectation It is well-known that expectation is linear, i.e.,
DÆ
?{Æ
, where the summation in an ex)Å
Å
pectation is taken over the corresponding alphabet. However, we adopt in
information theory the convention that the summation in an expectation is
taken over the corresponding support. Justify carefully the linearity of expectation under this convention.
6. Let ÇÉÈ
¹
¦
b ¯ËÊ)ÌÍ
.
²sÎ
a) Prove that
ÇÉÈ
È
7.
8.
¹Üi
i
¶ Þ
if Ï
if
ï|ß
Ï
Þoß Ö
² ÃfÄÅa Ý È 3
ß ··
·
¹
¿
ï|ß .
is a probability distribution for Ï
b) Prove that
ï ²
Ý
if Ï ß
i
Ü

Ú ¿ È ¹ i if Ü Ï Þ Ö
² Ý is concave in , i.e.,
Prove that Ú
¿ Ò ² Ý;óDÒ Ó ¿ ² Ý=Þ ² Ò óFÒ Ó Ý
Ú ¿
Ú ¿ Ú ¿
¿ Ö
² &Ý Ô ²JÀ ·KÁ Ý ¹ ²JÀÝ ² Á;´ ÀÝ .
Let ®·'¯
¿
¿ ¿
Then
D R A F T
² Ý ¹

ÐÇÉÈÑ
September 13, 2001, 6:27pm
D R A F T
37
Information Measures
²JÀ7Ý , ±² ®+³'¯ Ý is a convex functional of ² Á;´ ÀÝ .
¿
¿ ²JÀÝ .
² ÀÝ ±² Ý
b) Prove that for fixed Á;´ , ®³'¯ is a concave functional of
¿
±² Ý
±² ¿
Ý
Do ®+³'¯ ¹ ¶ and ®³'¯´)° ¹ ¶ imply each other? If so, give a proof.
a) Prove that for fixed
9.
If not, give a counterexample.
10. Let
®
be a function of
¯
ýµ ,
. Prove that
Ú ² ® Ý Þ Ú ² ¯ Ý . Interpret this result.
11. Prove that for any Ú ² ®
·K®H·c·K® Ý µ º Ú ² ® ´ ® o · f ¹ ê Ý Ö
ß ²
Ú
® ·K® Ý7ó Ú ² ®H·K® ¡ ;Ý ó Ú ² ®
·K® ¡ Ý
µ Ú ² ®
·K®H·K® ¡ Ý Ö
12. Prove that
Hint: Sum the identities
Ú ² ®
·K®H·K® ¡ Ý ¹ Ú ² ® o · f ¹ ê Ý;ó Ú ² ® ´ ® o · f ¹ ê Ý
ß ··» and apply the result in Problem 11.
for ¹
ß
²
Ý
² Ý
13. Let Õ ¹ Ì ··%·?Ô and denote Ú ® ·ËÍÏ by Ú ®È for any subset
ß
Þ
Þ
Ï of Õ . For

, let
ß
¹
Ú Ö º Ú ²® È Ý Ö
X
Ø× È¤Ù È
ÈÈ
Prove that
Ú µXÚ F
µ !µÚ !Ö
This sequence of inequalities, due to Han [87], is a generalization of the
independence bound for entropy (Theorem 2.39).
14. Prove the divergence inequality by using the log-sum inequality.
² É Ý
² · Ý
² · Ý ² %· Ú Ý
¿
¿
¿
¿
" ² Ò ó Ò É Ò Ú ó Ò Ú Ý=Þ Ò " ² ÉÚ Ý7ó Ò " ² ÉÚ Ý
¿ Ò ß Ò
¿
¿
Þ Ò Þ|¿ ß , where
¹ ì .
for all ¶
16. Let²
distributions on ÏÐýÒ . Prove that
ÝÖ
¿ Æ=Ç ÉÚ and Ý Ú ÆYµ Ç " be² twoÉÚ probability
"
¿ ÆYÇ ÆYÇ
¿Æ Æ
Ú is convex in the pair
Ú , i.e., if
15. Prove that "
Ú and
are two pairs of probability distributions on a common alphabet, then
D R A F T
September 13, 2001, 6:27pm
D R A F T
38
A FIRST COURSE IN INFORMATION THEORY
² · Ý
¿
Ï
¿
" ² É¿ Ú Ý µÜÝÛ ² ¿ · Ú Ý Ö
À@g ²JÀÝ µ Ú ²JÀÝ Ô , ¹ Ì ² Ý · ß ì ² Ý Ô , and Ú ¹
Let ¹ Ì
u
u
²
Ý
ß
Ì Ú · ì Ú ² Ý Ô .¿ Show that " ² ¿ É ¿ Ú Ý µ " ¿ ² ¿ u É Ú u Ý and Û¿ ² ¿ · Ú Ý ¹ Û ² ¿ u · Ú u Ý .
Show that toward determining the largest value of Ü , we only have to
consider the case when Ï is binary.
By virtue of b), it suffices to determine the largest Ü such that
ßì
@
ó
.
²
ß
Ý
ì ¼ÞÜ ² ì Ú Ý µú¶
ì
Ã¿ ]ÄÅ ¿ Ú
f
Ã
Ä
Å
ß
¿
ì
Ú
¿
¿
Þ
Ñ
Þ
ß
·¹ Ú , withï the convention that ¶ ÃfÄàÅ ßZ ¹ ¶ for ¥ µ ¶
for all ¶
¿
and U=ÃfÄÅàá
i
for U
¶ . By observing that equality in the above
holds if ß ¹ Ú and considering the derivative of the left ² hand side
¿ Ú , show that the largest value of Ü is equal to ?Ãâm Ý . with
respect to
Ú
denotes the variational distance between
17. Pinsker’s inequality Let Û
two probability distributions and Ú on a common alphabet . We will
determine the largest Ü which satisfies
a)
b)
c)
18. Find a necessary and sufficient condition for Fano’s inequality to be tight.
¹ Ì · T· T· Ô
¹ Ì ·
¿ ¿
¿ µ
µ
¹ ß · ¿ · %¿ ·
ê
ßÞ
Show that for ã ¹ ä , there exist
%· c· Ô be two probability
Þ

Ü? , and ¦ s Ú ² Ý µÚ ² ä Ý . Hint: ¿
fúæ
Ü Þ which satisfy the
Ú Ú Ú
19. Let ã
and ä
±å and Ú Ú µå for all
distributions such that ot
o
. Prove that
ã
Ú for all
¦
a)
following:
Ú i) f is the largest index such that Ú f and ii) is the smallest index such that Ú for all f ç .
iii)
Ú Ú Ú b) Consider the distribution ä
defined by Ú
f and
ï
¿ Üï
¿
¹
Ü ËÜ é
¿
¹ Ì é · é · c· é Ô
ê¹ ·
² Ú o é · Ú é Ý ¹ ²² ¿ o · ì Ú ² óo² ì Ú o ì Ý ¿ o ÝKÝ Ý if ¿ ìì Ú aµ
Ú o
Ú ·
Ú aÜ

if
¿
¿
¿
é
é
Note that either Ú o ¹ o or Ú ¹
. Show that
¿
¿
é
é
Þ
i) Ú µ Ú å for all
é
Þ

¹ ß · · T· ii) ¦
Ú for all
¦
² é ¿Ý µÚ ² ä Ý .
iii) Ú ä
Ú o
Ú o
ì
é¹
Ú for
o
ì ¿o.
¿
c) Prove the result by induction on the Hamming distance between ã and
ä , i.e., the number of places where ã and ä differ.
Ä
(Hardy, Littlewood, and P è lya [91].)
D R A F T
September 13, 2001, 6:27pm
D R A F T
Information Measures
39
HISTORICAL NOTES
The concept of entropy has its root in thermodynamics. Shannon [173] was
the first to use entropy as a measure of information. Informational divergence
was introduced by Kullback and Leibler [119], and it has been studied extensively by Csiszé è r [50] and Amari [11].
The material in this chapter can be found in most textbooks in information
theory. The main concepts and results are due to Shannon [173]. Pinsker’s
inequality is due to Pinsker [155]. Fano’s inequality has its origin in the converse proof of the channel coding theorem (to be discussed in Chapter 8) by
Fano [63].
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 3
ZERO-ERROR DATA COMPRESSION
In a random experiment, a coin is tossed times. Let
the th toss, with
_a`
¶ Þ ¿
®
be the outcome of
Ì ® ¹ HEADÔ ¹ ¿ and _a` Ì® ¹ TAILÔ ¹ ß ì ¿ ·
Þàß . It is assumed that ® are i.i.d., and the value of
(3.1)
¿
where
is known.
We are asked to describe the outcome of the random experiment without error
(with zero error) by using binary symbols. One way to do this is to encode
a HEAD by a ‘0’ and a TAIL by a ‘1.’ Then the outcome of the random
experiment is encoded into a binary codeword of length . When the coin is
Ã½ , this is the best we can do because the probability of every
fair, i.e.,
outcome of the experiment is equal to . In other words, all the outcomes
are equally likely.
Ã½ , the probability of an outcome
However, if the coin is biased, i.e.,
of the experiment depends on the number of HEADs and the number of TAILs
in the outcome. In other words, the probabilities of the outcomes are no longer
uniform. It turns out that we can take advantage of this by encoding more
likely outcomes into shorter codewords and less likely outcomes into longer
codewords. By doing so, it is possible to use less than bits on the average to
describe the outcome of the random experiment. In particular, when
or
1, we actually do not need to describe the outcome of the experiment because
it is deterministic.
At the beginning of Chapter 2, we mentioned that the entropy
measures the amount of information contained in a random variable . In this
chapter, we substantiate this claim by exploring the role of entropy in the context of zero-error data compression.
¿
¹ ¶¸Ö
¹ ê ¶¸Ö
¿
¿
¹ ¶
Ú ²® Ý
®
D R A F T
September 13, 2001, 6:27pm
D R A F T
42
3.1
A FIRST COURSE IN INFORMATION THEORY
Ú ²® Ý
THE ENTROPY BOUND
is a fundamental lower bound on the
In this section, we establish that
expected length of the number of symbols needed to describe the outcome of a
random variable with zero error. This is called the entropy bound.
Ø¤[ä æ;äøå;äJ¥§æú¬7«¶5
a mapping from Ï
"
=
from a
®
®
é
A " -ary source code ê for a source random variable is
to ë , the set of all finite length sequences of symbols taken
-ary code alphabet.
Ì® · úµ ß Ô
®
Consider an information source
, where are discrete random variables which take values in the same alphabet. We apply a source code
ê to each and concatenate the codewords. Once the codewords are concatenated, the boundaries of the codewords are no longer explicit. In other
words, when the code ê is applied to a source sequence, a sequence of code
symbols are produced, and the codewords may no longer be distinguishable.
We are particularly interested in uniquely decodable codes which are defined
as follows.
®
Ø¤[ä æ;äøå;äJ¥§æú¬7«]©
=
A code ê is uniquely decodable if for any finite source sequence, the sequence of code symbols corresponding to this source sequence is
different from the sequence of code symbols corresponding to any other (finite)
source sequence.
Suppose we use a code ê to encode a source file into a coded file. If ê is
uniquely decodable, then we can always recover the source file from the coded
file. An important class of uniquely decodable codes, called prefix codes, are
discussed in the next section. But we first look at an example of a code which
is not uniquely decodable.
;ü:¨â[S¤L¬7«]¬
\^]
Let
Ï ¹ Ìª ·(ì ·CÇÊ· " Ô . Consider the code ê
í
î[ïsíÞð
A
B
C
D
0
1
01
10
defined by
Then all the three source sequences AAD, ACA, and AABA produce the code
sequence 0010. Thus from the code sequence 0010, we cannot tell which of the
three source sequences it comes from. Therefore, ê is not uniquely decodable.
In the next theorem, we prove that for any uniquely decodable code, the
lengths of the codewords have to satisfy an inequality called the Kraft inequality.
D R A F T
September 13, 2001, 6:27pm
D R A F T
43
Zero-Error Data Compression
¢a£;¤!¥§¦7¤¨ª¬7« Õõñ¦ü6åjæ;¤zy6ü6Sä]åþ ÿ Let ê be a "
let ò
(· H· c(· be the lengths of the codewords. If ê
then

º " Ãó Þàß Ö
-ary source code, and
is uniquely decodable,
(3.2)
Proof Let ¬
be an arbitrary positive integer, and consider the identity
º
p

"
ó
¹ º
®
q

#
º

"
¯
Cô ó
#
ó
%

')')'
ó ô
²
Ö
(3.3)
By collecting terms on the right-hand side, we write
p
º

Ãó
"
¹ ®º
®
q
¹
where
øaùØú
Ðõö³÷
éOû

ü [
ü
[
"
[
(3.4)

(3.5)
Ãóªþ

® . Now observe that gives
and is the coefficient of "
in ý ¦ "
the total number of sequences of ¬ codewords with a total length of code
symbols. Since the code is uniquely
decodable, these code sequences must be
"
distinct, and therefore because there are " distinct sequences of code symbols. Substituting this inequality into (3.4), we have
Þ
p
º

Ãó
"
or
º

"
q
Þ ® º Ðõö³÷ ß ¹
®
Let
®
·
(3.6)
Þà² ¬ÿøaùØú Ý n ® Ö
Ãó
Since this inequality holds for any
completing the proof.
¬ øaùØú
¬
(3.7)
, upon letting ¬H/¹i , we obtain (3.2),
be a source random variable with probability distribution
µ
Ì ¿ _· ¿ H ·c· ¿ Ô ·
(3.8)
where B . When we use a uniquely decodable code ê to encode the
outcome of , we are naturally interested in the expected length of a codeword,
which is given by

(3.9)
®
D R A F T
¹oº
¿ Ö
September 13, 2001, 6:27pm
D R A F T
44
A FIRST COURSE IN INFORMATION THEORY
We will also refer to as the expected length of the code ê . The quantity
gives the average number of symbols we need to describe the outcome of
when the code ê is used, and it is a measure of the efficiency of the code ê .
Specifically, the smaller the expected length is, the better the code ê is.
In the next theorem, we will prove a fundamental lower bound on the expected length of any uniquely decodable " -ary code. We first explain why this
is the lower bound we should expect. In a uniquely decodable code, we use
" -ary symbols on the average to describe the outcome of
. Recall from
the remark following Theorem 2.43 that a " -ary symbol can carry at most one
" -it of information. Then the maximum amount of information which can be
" -its. Since the code
carried by the codeword on the average is is uniquely decodable, the amount of entropy carried by the codeword on the
average is
. Therefore, we have
®
®
Ú ²® Ý
ß ¹
Ú W ² ® Ý=Þ Ö
(3.10)
In other words, the expected length of a uniquely decodable code is at least
the entropy of the source. This argument is rigorized in the proof of the next
theorem.
¢a£;¤!¥§¦7¤¨ª¬7«]èÛõØ\Îæå;¦¥4â¸þ¥;æ4÷:ÿ Let ê be a " -ary uniquely decodable
² Ý
code for a source random variable ® ² with
entropy Ú W ® . Then the expected
Ý
length of ê is lower bounded by ÚXW ® , i.e.,
(3.11)
µXÚ W ² ® Ý Ö
This lower bound is tight if and only if ¹ì Ã]ÄÅ W for all .
¿
Proof Since ê is uniquely decodable, the lengths of its codewords satisfy the
Kraft inequality. Write
" k
(3.12)
W
¹oº
¿ ÃfÄÅ
and recall from Definition 2.33 that
Ú W ² ® Ý ¹ì º ¿ ]Ã ÄÅ W ¿ Ö
Then
ì ÚXW ² ® Ý ¹ º ÃfÄÅ W ² " k Ý
¿
¿
²
Ý
¹ Ã â " º Ãâ ² " k Ý
¿
¿ ß
Q
µ ² Ãâ " Ý º ¿ ß ì "
¿
D R A F T
(3.13)
(3.14)
(3.15)
k R
September 13, 2001, 6:27pm
(3.16)
D R A F T
Zero-Error Data Compression
45
¹ ² Ã â " Ý º ìÛº
¿
².ß
²
Ý
ß
c
Ý
ì
µ Ãâ "
¹ ¶¸·
"

k
(3.17)
(3.18)
(3.19)
where we have invoked the fundamental inequality in (3.16) and the Kraft inequality in (3.18). This proves (3.11). In order for this lower bound to be tight,
both (3.16) and (3.18) have to be tight simultaneously. Now (3.16) is tight if

for all . If this holds, we have
and only if " k
, or W
¿
¹ ß
¹ì ÃfÄÅ
¿
º " k ¹ º
¹ ß·
¿
(3.20)
i.e., (3.18) is also tight. This completes the proof of the theorem.
The entropy bound can be regarded as a generalization of Theorem 2.43, as
is seen from the following corollary.
ö¥§¦¥6STS ü§¦þ2¬7«]ð Ú ² ® Ý=Þ ÃfÄÅ´)Ï´ Ö
Proof Considering encoding each outcome of a random variable ® by a disß
tinct symbol in Ì ··%·T´)Ï´ Ô . This is obviously a ´)ÏÓ´ -ary uniquely decodable
code with expected length 1. Then by the entropy bound, we have
which becomes
Ú È}
Ú ²®
² ® ÝÞ|ß ·
È
YÝ Þ ÃfÄÅ´)Ï´
(3.21)
(3.22)
when the base of the logarithm is not specified.
Motivated by the entropy bound, we now introduce the redundancy of a
uniquely decodable code.
Ø¤[ä æ;äøå;äJ¥§æú¬7«øò
=
The redundancy of a " -ary uniquely decodable code is
the difference between the expected length of the code and the entropy of the
source.
We see from the entropy bound that the redundancy of a uniquely decodable
code is always nonnegative.
3.2 PREFIX CODES
3.2.1 DEFINITION AND EXISTENCE
Ø¤[ä æ;äøå;äJ¥§æú¬7«]ô
=
A code is called a prefix-free code if no codeword is a prefix
of any other codeword. For brevity, a prefix-free code will be referred to as a
prefix code.
D R A F T
September 13, 2001, 6:27pm
D R A F T
46
A FIRST COURSE IN INFORMATION THEORY
;ü:¨â[S¤L¬7« \^]
The code ê in Example 3.3 is not a prefix code because the
codeword 0 is a prefix of the codeword 01, and the codeword 1 is a prefix of
the codeword 10. It can easily be checked that the following code ê is a prefix
code.
x
A
B
C
D
î å ïsíÞð
0
10
110
1111
A " -ary tree is a graphical representation of a collection of finite sequences
of " -ary symbols. In a " -ary tree, each node has at most " children. If a
node has at least one child, it is called an internal node, otherwise it is called a
leaf. The children of an internal node are labeled by the " symbols in the code
alphabet.
A " -ary prefix code can be represented by a " -ary tree with the leaves of
the tree being the codewords. Such a tree is called the code tree for the prefix
code. Figure 3.1 shows the code tree for the prefix code ê in Example 3.9.
As we have mentioned in Section 3.1, once a sequence of codewords are
concatenated, the boundaries of the codewords are no longer explicit. Prefix codes have the desirable property that the end of a codeword can be recognized instantaneously so that it is not necessary to make reference to the
future codewords during the decoding process. For example, for the source sequence ìÇ " Ç , the code ê in Example 3.9 produces the code sequence
. Based on this binary sequence, the decoder can reconstruct the source sequence as follows. The first bit 1 cannot form the first
codeword because 1 is not a valid codeword. The first two bits 10 must form
the first codeword because it is a valid codeword and it is not the prefix of any
other codeword. The same procedure is repeated to locate the end of the next
ß ¶ ß ß ¶ ßßßß ¶ ß ß ¶
0
10
110
1111
å
Figure 3.1. The code tree for the code î .
D R A F T
September 13, 2001, 6:27pm
D R A F T
47
Zero-Error Data Compression
ß ßß ßßßß ·'¶¸· ßß ¶¸· . Then
codeword, and the code sequence is parsed as ¶¸· ¶¸·
the source sequence ìÇ " {Ç can be reconstructed correctly.
Since a prefix code can always be decoded correctly, it is a uniquely decodable code. Therefore, by Theorem 3.4, the codeword lengths of a prefix code
also satisfies the Kraft inequality. In the next theorem, we show that the Kraft
inequality fully characterizes the existence of a prefix code.
¢a£;¤!¥§¦7¤¨ª¬7«¶5Þ- There exists a " -ary prefix code with codeword lengths ¶
·
%
· c(· if and only if the Kraft inequality

º " Ãó Þ|ß
(3.23)
is satisfied.
Proof We only need to prove the existence of a " -ary prefix code with codeword lengths ¶
( ( if these lengths satisfy the Kraft inequality. Without

.
loss of generality, assume that ¶
Consider all the " -ary sequences of lengths less than or equal to and
regard them as the nodes of the full " -ary tree of depth . We will refer to a
sequence of length as a node of order . Our strategy is to choose nodes as
codewords in nondecreasing order of the codeword lengths. Specifically, we
choose a node of order ¶
as the first codeword, then a node of order as the
second codeword, so on and so forth, such that each newly chosen codeword is
not prefixed by any of the previously chosen codewords. If we can successfully
choose all the codewords, then the resultant set of codewords forms a prefix
code with the desired set of lengths.

There are " #
(since ò
) nodes of order ¶
which can be chosen
as the first codeword. Thus choosing the first codeword is always possible.
Assume that the first codewords have been chosen successfully, where

, and we want to choose a node of order 6
as the st codeword
such that it is not prefixed by any of the previously chosen codewords. In other
st node to be chosen cannot be a descendant of any of the
words, the previously chosen codewords.
Observe that for
f
, the codeword

with length o has " k #
descendents of order 6
. Since all the previously
chosen codewords are not prefeces of each other, their descendents of order
6
do not overlap. Therefore, upon noting that the total number of nodes of

order 6
is " k # , the number of nodes which can be chosen as the st
codeword is

" k
" k
" k
k
(3.24)
#
#
#
#
Þ
ì ß
· H· ·
Þ
ïß
îµ ß
Þ
Þ
² óñßcÝ
² ó,ßcÝ
·(H··(
ì
ì
ì
Ö
satisfy the Kraft inequality, we have
"
D R A F T
Þ
² ó@ßcÝ
If ò
ßÞ
ßñÞ

#
ó
ó
"

k
ó
"
Þ ßÖ
|

k
#
September 13, 2001, 6:27pm
(3.25)
D R A F T
48
A FIRST COURSE IN INFORMATION THEORY
Multiplying by "
k
"
and rearranging the terms, we have
#
ì k
k
·"
#
#

#
ì
ì
k
"
µ ßÖ

k
#
(3.26)
² óúßcÝ
² ó ßcÝ
The left hand side is the number of nodes which can be chosen as the st
codeword as given in (3.24). Therefore, it is possible to choose the st
codeword. Thus we have shown the existence of a prefix code with codeword
lengths ( ( , completing the proof.
· · c·
¹ ¿ ¹ Ì Ô
¿
Ì¿ Ô
k , where is a
A probability distribution such that for all , G"
"
"
is called a
,
positive integer, is called a -adic distribution. When
dyadic distribution. From Theorem 3.5 and the above theorem, we can obtain
the following result as a corollary.
ö¥§¦¥6STS ü§¦þ2¬7«¶5[5
There exists a " -ary prefix code which achieves the entropy bound for a distribution if and only if is " -adic.
Ì¿ Ô
Ì¿ Ô
Proof Consider a " -ary prefix code which achieves the entropy bound for a
distribution . Let be the length of the codeword assigned to the probak
, or "
bility . By Theorem 3.5, for all , . Thus W
"
is -adic.
"
k for all . Let Conversely, suppose is " -adic, and let for all . Then by the Kraft inequality, there exists a prefix code with codeword
lengths ª , because
Ì¿ Ô
¿
¹ªì ÃfÄÅ
¿
¹
¿
Ì¿ Ô
Ì Ô
º
"

k
¹oº
"
¹oº
k
¿
¿ ¹
Ì¿ Ô
¹ ¹ ßÖ
(3.27)
Assigning the codeword with length to the probability for all , we see
from Theorem 3.5 that this code achieves the entropy bound.
¿
3.2.2
HUFFMAN CODES
As we have mentioned, the efficiency of a uniquely decodable code is measured by its expected length. Thus for a given source , we are naturally interested in prefix codes which have the minimum expected length. Such codes,
called optimal codes, can be constructed by the Huffman procedure, and these
codes are referred to as Huffman codes. In general, there exists more than one
optimal code for a source, and some optimal codes cannot be constructed by
the Huffman procedure.
For simplicity, we first discuss binary Huffman codes. A binary prefix code
with distribution is represented by a binary code tree,
for a source
with each leaf in the code tree corresponding to a codeword. The Huffman
procedure is to form a code tree such that the expected length is minimum.
The procedure is described by a very simple rule:
®
®
D R A F T
Ì¿ Ô
September 13, 2001, 6:27pm
D R A F T
Zero-Error Data Compression
49
p codeword
0.35
00
0.1 010
0.15 011
0.2 10
0.2 11
i
0.25
0.6
1
0.4
Figure 3.2. The Huffman procedure.
Keep merging the two smallest probability masses until one probability
mass (i.e., 1) is left.
The merging of two probability masses corresponds to the formation of an
internal node of the code tree. We now illustrate the Huffman procedure by the
following example.
;ü:¨â[S¤L¬7«¶5©
\^]
Ï ¹ Ì · · Ê· · Ô
®
Let
be the source with
, and the
ª (ì CÇ "
probabilities are 0.35, 0.1, 0.15, 0.2, 0.2, respectively. The Huffman procedure is shown in Figure 3.2. In the first step, we merge probability masses
0.1 and 0.15 into a probability mass 0.25. In the second step, we merge probability masses 0.2 and 0.2 into a probability mass 0.4. In the third step, we
merge probability masses 0.35 and 0.25 into a probability mass 0.6. Finally,
we merge probability masses 0.6 and 0.4 into a probability mass 1. A code tree
is then formed. Upon assigning 0 and 1 (in any convenient way) to each pair of
branches at an internal node, we obtain the codeword assigned to each source
symbol.
In the Huffman procedure, sometimes there are more than one choice of
merging the two smallest probability masses. We can take any one of these
choices without affecting the optimality of the code eventually obtained.
ì ß
steps to complete the Huffman
For an alphabet of size , it takes
procedure for constructing a binary code, because we merge two probability
masses in each step. In the resulting code tree, there are leaves and
internal nodes.
In the Huffman procedure for constructing a " -ary code, the smallest "
probability masses are merged in each step. If the resulting code tree is formed
in
steps, where
, then there will be
internal nodes and
"
"
leaves, where each leaf corresponds to a source symbol in the
alphabet. If the alphabet size has the form "
, then we can apply
"
the Huffman procedure directly. Otherwise, we need to add a few dummy
óªß
ó ² ì ßcÝ
D R A F T
µ ¶
ì ß
ó ß
ó ² 6ì ßcÝ
September 13, 2001, 6:27pm
D R A F T
50
A FIRST COURSE IN INFORMATION THEORY
ó ² ì ßcÝ
symbols with probability 0 to the alphabet in order to make the total number
"
of symbols have the form "
.
;ü:¨â[S¤L¬7«¶5¬
Õ¹
\^]
¼ )
If we want to construct a quaternary Huffman code ("
for the source in the last example, we need to add 2 dummy symbols so that the
¼
» , where
total number of symbols becomes
. In general, we
need to add at most "
dummy symbols.
¹ óo².ßcÝ
ì
¹ ß
In Section 3.1, we have proved the entropy bound for a uniquely decodable
code. This bound also applies to a prefix code since a prefix code is uniquely
decodable. In particular, it applies to a Huffman code, which is a prefix code
by construction. Thus the expected length of a Huffman code is at least the
entropy of the source. In Example 3.12, the entropy
is 2.202 bits, while
the expected length of the Huffman code is
Ú ²® Ý
¸¶ ÖÃ»~½ ² Ý:ó ¶¸Ö ß ² » §Ý ó ¶¸Ö ß ½ ² » Ý§ó ¶¸ÖÃ ² :Ý ó ¸¶ ÖÃ ² Ý ¹ ÖÃ~½Ö
(3.28)
We now turn to proving the optimality of a Huffman code. For simplicity,
we will only prove the optimality of a binary Huffman code. Extension of the
proof to the general case is straightforward.
Without loss of generality, assume that
(3.29)
¿ µ ¿ µD!µ ¿ Ö
Denote the codeword assigned to by Ü , and denote its length by . To prove
¿
that a Huffman code is actually optimal,
we make the following observations.
4v¤¨Ê¨ü¬7¶« 5H
In an optimal code, shorter codewords are assigned to larger
probabilities.
ßÞ ËÜ Þ
ï
ï
o . Assume that in a code, the
Proof Consider
ºf
such that o
o
codewords Ü and Ü are such that
, i.e., a shorter codeword is assigned
to a smaller probability. Then by exchanging Ü and Ü o , the expected length of
the code is changed by
²
o
ó
o
Ýì ²

ó
¿
¿
Ý ¹ ² ì o Ý_² o ì Ý
Ü ¶
¿
¿ ¿
o o
(3.30)
¿
¿
¿
ï o and ï o . In other words, the code can be improved and
since ¿
¿ optimal. The lemma is proved.
therefore is not
4v¤¨Ê¨ü¬7¶« 5è
There exists an optimal code in which the codewords assigned
to the two smallest probabilities are siblings, i.e., the two codewords have the
same length and they differ only in the last symbol.
Proof The reader is encouraged to trace the steps in this proof by drawing a
code tree. Consider any optimal code. From the last lemma, the codeword Üt
D R A F T
September 13, 2001, 6:27pm
D R A F T
51
Zero-Error Data Compression
¿
assigned to has the longest length. Then the sibling of ÜC cannot be the
prefix of another codeword.
We claim that the sibling of Üt must be a codeword. To see this, assume
that it is not a codeword (and it is not the prefix of another codeword). Then
we can replace Üt by its parent to improve the code because the length of the
codeword assigned to is reduced by 1, while all the other codewords remain
unchanged. This is a contradiction to the assumption that the code is optimal.
Therefore, the sibling of Üt must be a codeword.
If the sibling of Ü is assigned to , then the code already has the desired
property, i.e., the codewords assigned to the two smallest probabilities are siblings. If not, assume that the sibling of Ü is assigned to , where ç
.
,
. On the other hand, by Lemma 3.14, Since is always less than or equal to , which implies that
. Then we
can exchange the codewords for and without changing the expected
length of the code (i.e., the code remains optimal) to obtain the desired code.
The lemma is proved.
¿
¿
¿ µ ¿
µ
¿
¹
¿
¿
¹
¹
vÜ
ì ß
¹
Suppose Ü and Ü o are siblings in a code tree. Then o . If we replace
o
Ð
o
and Ü by a common codeword at their parent, call it Ü , then we obtain
o . Accordingly, the
a reduced code tree, and the probability of Ü )o is probability set becomes a reduced probability set with and o replaced by a
o . Let
probability and be the expected lengths of the original code
and the reduced code, respectively. Then
Ü ¿
¿
ó
ì
¿

which implies
¹ ²
ó
o o Ý ì
¿
¿
¹ ² ó o Ý ì
¹ ¿ ó o ¿·
¿ ¿
¹ ó6²
ó
¿
¿
¿
¿
² ó o Ý_² ì cß Ý
² ¿ ó ¿ o _Ý ² ì ßcÝ
¿ ¿
ó
¿
o
(3.31)
(3.32)
(3.33)
ÝÖ
(3.34)
This relation says that the difference between the expected length of the original code and the expected length of the reduced code depends only on the
values of the two probabilities merged but not on the structure of the reduced
code tree.
¢a£;¤!¥§¦7¤¨ª¬7«¶5ð
The Huffman procedure produces an optimal prefix code.
Proof Consider an optimal code in which Üt and Üt are siblings. Such an
optimal code exists by Lemma 3.15. Let be the reduced probability set
obtained from by merging and . From (3.34), we see that is
the expected length of an optimal code for if and only if is the expected
length of an optimal code for . Therefore, if we can find an optimal code
for , we can use it to construct an optimal code for . Note that by
Ì¿ Ô
Ì¿ Ô
D R A F T
¿
Ì¿ Ô
Ì¿ Ô
¿ Ì Ô
¿
Ì¿ Ô
September 13, 2001, 6:27pm
D R A F T
52
A FIRST COURSE IN INFORMATION THEORY
merging and , the size of the problem, namely the total number of
probability masses, is reduced by one. To find an optimal code for , we
again merge the two smallest probability in . This is repeated until the size
of the problem is eventually reduced to 2, which we know that an optimal code
has two codewords of length 1. In the last step of the Huffman procedure, two
probability masses are merged, which corresponds to the formation of a code
with two codewords of length 1. Thus the Huffman procedure indeed produces
an optimal code.
¿
¿
Ì¿ Ô
Ì¿ Ô
We have seen that the expected length of a Huffman code is lower bounded
by the entropy of the source. On the other hand, it would be desirable to obtain
an upper bound in terms of the entropy of the source. This is given in the next
theorem.
¢a£;¤!¥§¦7¤¨ª¬7«¶5ò
satisfies
The expected length of a Huffman code, denoted by
ÜÚXW ² ® Ý;ó6ß Ö
This bound is the tightest among all the upper bounds on
only on the source entropy.
,
(3.35)
which depend
Ú ² ® Ý'óØß
.
Proof We will construct a prefix code with expected length less than
Then, because a Huffman code is an optimal prefix code, its expected length
is upper bounded by
.
Consider constructing a prefix code with codeword lengths ª , where
Ú ² ® Ý;ó6ß
¹ ì fÃ ÄÅ W Ö
¿
ó ß
ì Ã]ÄÅ W Þ Ü ì Ã]ÄÅ W @
¿
¿ ·
ï
"
µ "
k
¿
¿ Ö
º " k Þ º ¹ ß ·
¿

Then
or
Thus
Ì Ô
Ì Ô
(3.36)
(3.37)
(3.38)
(3.39)
i.e., ª satisfies the Kraft inequality, which implies that it is possible to construct a prefix code with codeword lengths ª .
It remains to show that , the expected length of this code, is less than
. Toward this end, consider
Ú ² ® Ý;ó6ß
Ì Ô
¹ º
¿

Ü º ¿ ² ì ]Ã ÄÅ W ¿ ó2ßcÝ
D R A F T
September 13, 2001, 6:27pm
(3.40)
(3.41)
D R A F T
53
Zero-Error Data Compression
¹ Ó
ì º ÃfÄÅ W ó º
¿
¿ ¿
²
§
Ý
ó
6
ß
¹ Ú ®
·
(3.42)
(3.43)
Þ
ÜÚ ² ® Ý;ó6ß Ö
(3.44)
we have to show² thatÝ7ó2there
To see that this upper bound is the tightest possible,
ß
where (3.41) follows from the upper bound in (3.37). Thus we conclude that
Ú ®
approaches
as
exists a sequence of distributions v6 such that
/Hi . This can be done by considering the sequence of " -ary distributions
v,
6µ
¹< ß ì
"
ß
ì ß ß
· ·c· ! ·

(3.45)
"
codewords of
²
Ý
Ú ® / ¶ , and hence
" . The Huffman code for each v6 consists of
where
is equal to 1 for all . As X/8i ,
length 1. Thus
approaches
. The theorem is proved.
Ú ² ® Ý4ó6ß
The code constructed in the above proof is known as the Shannon code.
The idea is that in order for the code to be near-optimal, we should choose for all . When
is " -adic, can be chosen to be exactly
close to
because the latter are integers. In this case, the entropy bound is tight.
From the entropy bound and the above theorem, we have
ì fÃ ÄÅ
¿
ì fÃ ÄÅ
¿
Ì¿ Ô
² Ý;ó6ß
Ú ² ® ÝYÞ
ÜÚ ®
Ö
(3.46)
Now suppose we use a Huffman code to encode ® ·K® ··K® which are
i.i.d. copies of ® . Let us denote the length of this Huffman code by
.
Then (3.46) becomes
²
Y
Ý
Þ
;Ú
®
Üç;Ú ² ® Ý;ó6ß Ö
(3.47)
Dividing by , we obtain
ß ß
²
Y
Ý
Þ
²
;
Ý
ó
Ú ®
ÜÚ ® Ö
(3.48)
As /Hi , the upper bound approaches the lower bound. Therefore, ,
the coding rate of the code, namely the average number of code symbols
needed to encode a source symbol, approaches
as 3/ i . But of
course, as becomes large, constructing a Huffman code becomes very complicated. Nevertheless, this result indicates that entropy is a fundamental measure of information.
Ú ²® Ý
D R A F T
September 13, 2001, 6:27pm
D R A F T
54
A FIRST COURSE IN INFORMATION THEORY
3.3
REDUNDANCY OF PREFIX CODES
The entropy bound for a uniquely decodable code has been proved in Section 3.1. In this section, we present an alternative proof specifically for prefix
codes which offers much insight into the redundancy of such codes.
Let be a source random variable with probability distribution
®
Ì ¿ · ¿ % ·c· ¿ aÔ ·
-ary prefix code for ® can be represented by a "
µ
(3.49)
where . A "
-ary code
tree with leaves, where each leaf corresponds to a codeword. We denote the
leaf corresponding to by Ü and the order of Ü by , and assume that the
alphabet is
"
(3.50)
¿
Ìc¶¸· ß · %· ì ß Ô Ö
"
Let be the index set of all the internal nodes (including the root) in the code
tree.
Instead of matching codewords by brute force, we can use the code tree of
a prefix code for more efficient decoding. To decode a codeword, we trace the
path specified by the codeword from the root of the code tree until it terminates
at the leaf corresponding to that codeword. Let Ú be the probability of reaching an internal node during the decoding process. The probability Ú is called
the reaching probability of internal node . Evidently, Ú is equal to the sum
of the probabilities of all the leaves descending from node .
Let o be the probability that the f th branch of node is taken during
"
f
the decoding process. The probabilities o
, are called the
branching probabilities of node , and
¿# ¼
#¿ ¼ ·'¶ Þ Þ ì ß
¹ º # ¼o Ö
o ¿
Ú
(3.51)
Once node is reached, the conditional branching distribution is
¿ # Ú ¼ ß · ¿ # Ú ¼ · · ¿ # Ú¼ W

Ö
(3.52)
Then define the conditional entropy of node by
Y

¹ Ú
Q
W
¿ # Ú ¼ ß · ¿ # Ú ¼ · T· ¿ # Ú¼ W

R
·
(3.53)
Ú ²Ý
where with a slight abuse of notation, we have used XW to denote the entropy in the base " Y of the conditional branching distribution in the parenthesis.
By Theorem 2.43,
. The following lemma relates the entropy of with
the structure of the code tree.
Þ|ß
v¤¨Ê¨ü¬7«¶5ô ÚXW ² ® Ý ¹
4
D R A F T
¦
®
%$
J
Ú
Y
Ö

September 13, 2001, 6:27pm
D R A F T
55
Zero-Error Data Compression
Proof We prove the lemma by induction on the number of internal nodes of
the code tree. If there is only one internal node, it must be the root of the tree.
Then the lemma is trivially true upon observing that the reaching probability
of the root is equal to 1.
Assume the lemma is true for all code trees with internal nodes. Now
internal nodes. Let be an internal node such
consider a code tree with that is the parent of a leaf Ü with maximum order. The siblings of Ü may or
may not be a leaf. If it is not a leaf, then it cannot be the ascendent of another
leaf because we assume that Ü is a leaf with maximum order. Now consider
revealing the outcome of in two steps. In the first step, if the outcome of is
not a leaf descending from node , we identify the outcome exactly, otherwise
we identify the outcome to be a descendent of node . We call this random
variable A . If we do not identify the outcome exactly in the first step, which
happens with probability Ú , we further identify in the second step which of
the descendent(s) of node the outcome is (node has only one descendent
if all the siblings of Ü are not leaves). We call this random variable . If the
second step is not necessary, we assume that
takes a constant value with
A
.
probability 1. Then
The outcome of A can be represented by a code tree with internal nodes
which is obtained by pruning the original code tree at node . Then by the
induction hypothesis,
Y
Ú å å
A
(3.54)
óoß
®
®
® ¹ ² ·& Ý
Ú ² Ý ¹ å J%º $('*)

,+
&
&
Ö
By the chain rule for entropy, we have
Ú ² ® Ý ¹ Ú ² A 4Ý ó Ú Y ² &ó@
´)A ².Ý ß
¹ º Úå å ì
-$('*) -+
¹ º
Ö
-$
åJ
åJ

Ý ¶ ó
Ú
Ú
Y
(3.55)

Y
Ú å å
(3.56)
(3.57)
The lemma is proved.
The next lemma expresses the expected length of a prefix code in terms
of the reaching probabilities of the internal nodes of the code tree.
v¤¨Ê¨ü¬7«¶5 ¹
4
Proof Define
¹
U¤ <
D R A F T
¶
ß
¦
%$ Ö
J
Ú
if leaf Ü is a descendent of internal node k
otherwise.
September 13, 2001, 6:27pm
(3.58)
D R A F T
56
A FIRST COURSE IN INFORMATION THEORY
Then

¹º
%$ ·
U© J
(3.59)
because there are exactly internal nodes of which
order of Ü is . On the other hand,
¹ º
Ú
Then
¹ º
¹ º
(3.60)

¿
is a descendent if the
¿ Ö
U¤ Ü (3.61)
º
¿ J%$ U©
¹ º º U© ¿
J%$
¹ º Ú ·
J%$
(3.62)
(3.63)
(3.64)

proving the lemma.
Define the local redundancy of an internal node by
. ¹ .² ß ì Ý Ö
Y
Ú

(3.65)
Þ|ß¿ # ¼ ¹
This quantity is local to node in the sense that it depends only on the branch
ing probabilities of node , and it vanishes if and only
if o D"
for all f ,
Y
or the node is balanced. Note that
because
.
The next theorem says that the redundancy of a prefix code is equal to the
sum of the local redundancies of all the internal nodes of the code tree.
. µ¶
¢a£;¤!¥§¦7¤¨ª¬7«]©-õ4=¥:û!ü6S0/¤÷;æ4÷ü§æ4ûþ¢a£;¤¥4¦7¤¨ÿ
Let
length of a " -ary prefix code for a source random variable
redundancy of the code. Then
¹ º -$ . Ö

J
®
be the expected
, and be the
(3.66)
Proof By Lemmas 3.18 and 3.19, we have
ì ÚXW ² ® Ý
¹ º Ú ì º ÚY
¹
%$
J
D R A F T

(3.67)

September 13, 2001, 6:27pm
(3.68)
D R A F T
Zero-Error Data Compression
¹ º
².ß ì
Y
%$
¹ º . Ö
%$
Ú
J
57
Ý

(3.69)

J
(3.70)
The theorem is proved.
ö¥§¦¥6STS ü§¦þ2¬7«]©5ÓõØ\Îæå;¦¥4â¸þ¥;æ4÷:ÿ Let be the redundancy of a prefix code. Then Õµà¶ with equality if and only if all the internal nodes in the
We now present an slightly different version of the entropy bound.
code tree are balanced.
. µ2¶
µ¶
. ¹ ¶
¹ ¶
Proof Since
for all , it is evident from the local redundancy theorem
that
. Moreover
if and only if
for all , which means that
all the internal nodes in the code tree are balanced.
Remark Before the entropy bound was stated in Theorem 3.5, we gave the
intuitive explanation that the entropy bound results from the fact that a " ary symbol can carry at most one " -it of information. Therefore, when the
entropy bound is tight, each code symbol has to carry exactly one " -it of
information. Now consider revealing a random codeword one symbol after
another. The above corollary states that in order for the entropy bound to be
tight, all the internal nodes in the code tree must be balanced. That is, as long
as the codeword is not completed, the next code symbol to be revealed always
carries one " -it of information because it distributes uniformly on the alphabet.
This is consistent with the intuitive explanation we gave for the entropy bound.
;ü:¨â[S¤L¬7«]©©
\^]
The local redundancy theorem allows us to lower bound the
redundancy of a prefix code based on partial knowledge on the structure of the
code tree. More specifically,
"
,µ º -$ .
"

å
J
(3.71)
for any subset of .
Let be the two smallest probabilities in the source distribution.
In constructing a binary Huffman code, and are merged. Then the
redundancy of a Huffman code is lower bounded by
¿
· ¿
²
¿
ó
¿

Ý ÚXW
Q
¿

¿
¿

¿ ó
¿

·
¿ ó 1 R ·
¿
¿
(3.72)
the local redundancy of the parent of the two leaves corresponding to and . See Yeung [214] for progressive lower and upper bounds on the redundancy of a Huffman code.
¿
D R A F T
September 13, 2001, 6:27pm
¿
D R A F T
58
A FIRST COURSE IN INFORMATION THEORY
Ìc¶¸ÖÃ~½·c¶¸Ör¶Þ½·c¶¸Ö ß ·¶¸Ö ß »·
PROBLEMS
¶¸Ö ·'¶¸Ö ß ·'¶¸Ör¶32·'¶Ör¶4Ô
1. Construct a binary Huffman code for the distribution
.
Ã
2. Construct a ternary Huffman code for the source distribution in Problem 1.
3. Construct an optimal binary prefix code for the source distribution in Problem 1 such that all the codewords have even lengths.
4. Prove directly that the codeword lengths of a prefix code satisfy the Kraft
inequality without using Theorem 3.4.
¿ µ¶¸Ö
5. Prove that if )¼ , then the shortest codeword of a binary Huffman code
has length equal to 1. Then prove
that the redundancy of such a Huffman
Y[Z
code is lower bounded by
. (Johnsen [103].)
ßì
² Ý
¿
6. Suffix codes A code is a suffix code if no codeword is a suffix of any other
codeword. Show that a suffix code is uniquely decodable.
7. Fix-free codes A code is a fix-free code if it is both a prefix code and a
suffix code. Let ( ( be positive integers. Prove that if
· · c·
º

ó
Þ ß·
·( H··(
then there exists a binary fix-free code with codeword lengths ¶
(Ahlswede et al. [5].)
Þ
Þ
Þ
ßÞ
.
Þ
8. Random coding for prefix codes Construct a binary prefix code with codeword lengths ¶

as follows. For each

, the
ó
codeword with length is chosen independently from the set of all possible binary strings with length according the uniform distribution. Let
Æ
v
Û be the probability that the code so constructed is a prefix code.
² 6575 Ý
a) Prove that v6
² Æ8595 Û Ý ¹ .² ß ì # Ý
À
²JÀ7Ý <
¹
¶
b) Prove by induction on
v
, where
if
if
À µ¶
À Ü¶WÖ
that
² Æ8595 Û Ý ¹ : 4ß ì º
² 6575 ÝÎï ¶
;
oC

Ö
² 8595 ÝÎï ¶ · H·
c) Observe that there exists a prefix code with codeword lengths ¶
( ,
Æ
Æ
. Show that v
is equiva if and only if v
Û
Û
lent to the Kraft inequality.
D R A F T
September 13, 2001, 6:27pm
D R A F T
59
Zero-Error Data Compression
By using this random coding method, one can derive the Kraft inequality
without knowing the inequality ahead of time. (Ye and Yeung [210].)
®
9. Let be a source random variable. Suppose a certain probability mass
in the distribution of is given. Let
¹
®
ì
¹
ì ]ÃÃ]ÄÄÅÅ ²¿ ó5À Ý ifif ¹ ê
¿
ì =< 9> *?
À ¹
¿ ßì
o
o <
where
o
o
for all f
¹ê

¿
o
p
.

o
s
¿

Ê)ÌÍ
f

,
ó
q
ßÞ o Þ ì ÃfÄÅ o% for all f .
¿ inequality.
Show that ÌªÔ satisfies the Kraft
² Ý and which is
Obtain an upper bound on
in terms of Ú ®
²
;
Ý

ó
ß
¿ about
tighter than Ú ®
. This shows that when partial knowledge
a) Show that
b)
c)
the source distribution in addition to the source entropy is available,
can be obtained.
tighter upper bounds on
(Ye and Yeung [211].)
HISTORICAL NOTES
The foundation for the material in this chapter can be found in Shannon’s
original paper [173]. The Kraft inequality for uniquely decodable codes was
first proved by McMillan [142]. The proof given here is due to Karush [106].
The Huffman coding procedure was devised and proved to be optimal by Huffman [97]. The same procedure was devised independently by Zimmerman
[224]. Linder et al. [125] have proved the existence of an optimal prefix code
for an infinite source alphabet which can be constructed from Huffman codes
for truncations of the source distribution. The local redundancy theorem is due
to Yeung [214].
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 4
WEAK TYPICALITY
In the last chapter, we have discussed the significance of entropy in the
context of zero-error data compression. In this chapter and the next, we explore
entropy in terms of the asymptotic behavior of i.i.d. sequences. Specifically,
two versions of the asymptotic equipartition property (AEP), namely the weak
AEP and the strong AEP, are discussed. The role of these AEP’s in information
theory is analogous to the role of the weak law of large numbers in probability
theory. In this chapter, the weak AEP and its relation with the source coding
theorem are discussed. All the logarithms are in the base 2 unless otherwise
specified.
Ì® · µ ß Ô
®
²® Ý
®
Ú
¿
®
Ú ²® Ý Ü
@ ¹
² ® ·K® H· c·K® Ý
®
² @ Ý ¹ ² ® Ý ² ® Ý ² ® Ý Ö
(4.1)
¿
¿ ¿
¿
²
Ý
Note that @
variable because it is a function of the random
¿ ·K®His·acrandom
·K® . We now prove an asymptotic property of ¿ ² @ Ý
variables ®
called the weak asymptotic equipartition property (weak AEP).
¢a£;¤!¥§¦7¤¨§«¶5 õBA ¤¸Dü CE \<
á _ÿ
ß
ì ÃfÄÅ ² @ Ý / Ú ² ® Ý
(4.2)
¿ ï
¶ , for sufficiently large,
in probability as /8i , i.e., for any «
ß
_a` ì

Ã]ÄÅ ¿ ² @ Ý ì Ú ² ® Ý Ü « ï|ß ì « Ö
(4.3)
4.1
THE WEAK AEP
²JÀÝ
where are i.i.d. with
We consider an information source
distribution
. We use to denote the generic random variable and
Hi . Let
to denote the common entropy for all , where
. Since are i.i.d.,

D R A F T

September 13, 2001, 6:27pm
D R A F T
62
A FIRST COURSE IN INFORMATION THEORY
® ·K®H·c·K® are i.i.d., by (4.1),
ß
ß ì ÃfÄÅ ² @ Ý ¹ì º ÃfÄÅ ² ® Ý Ö
(4.4)
¿
¿
² ® Ý are also i.i.d. Then by the weak law of large
The random variables ÃfÄÅ
¿ of (4.4) tends to
numbers, the right hand side
ì ÃfÄÅ ² ® Ý ¹ Ú ² ® Ý ·
(4.5)
¿
Proof Since in probability, proving the theorem.
The weak AEP is nothing more than a straightforward application of the
weak law of large numbers. However, as we will see shortly, this property has
significant implications.
Ø¤[ä æ;äøå;äJ¥§æÛ§«]© The weakly typical set & F with respect to ²JÀ7Ý
J² À · À H·· À Ý ÍÓÏ suchÆG that
¿
°
of sequences H ¹
ß
ì
ÃfÄÅ ¿ ² H Ý ì Ú ² ® Ý Þ «·
=

ß
Ú ² ® Ý ì « Þ ì fÃ ÄÅ ¿ ² H YÝ Þ Ú ² ® 4Ý ó «·
(4.7)
where « is an arbitrarily small positive real number. The sequences in
are called weakly « -typical sequences.
ß
ß ì fÃ ÄÅ ² H Ý ¹ ì º fÃ ÄÅ ²JÀ Ý
¿
¿
The quantity
(4.6)

or equivalently,
is the set
H
& F ÆG
°
(4.8)
Ú ²® Ý
is called the empirical entropy of the sequence . The empirical entropy of
a weakly typical sequence is close to the true entropy
. The important
properties of the set
are summarized in the next theorem which we will
°
see is equivalent to the weak
AEP.
& F Æ G
¢a£;¤!¥§¦7¤¨§«]¬ÛõBA ¤¸üDCE á ÿ The following hold for any ï ¶ :
1) If HLÍI& F , then
ÆG
KJ Æ Þ ¿ ² H ÝYÞ KJ Æ Ö
\ 3
«
°
D R A F T
¯ h¯
²

°³²
¯ h¯
²
°¶²
September 13, 2001, 6:27pm
(4.9)
D R A F T
63
Weak Typicality
2) For sufficiently large,
ï ß ì «Ö
ÌL@ëÍI& F Æ G ° Ô |
sufficiently large,
².ß ì « Ý ¯MJ´¯ Æ ² °¶² Þ ´N& F ´ Þ ¯MJ´¯ Æ
Æ G °
_a`
3) For ²
(4.10)

°¶²
Ö
(4.11)
& F ÆG
Proof Property 1 follows immediately from the definition of
in (4.7).
°
Property 2 is equivalent to Theorem 4.1. To prove Property 3, we use the lower
bound in (4.9) and consider
Þ _P` Ì7& F Ô Þ|ß ·
(4.12)
Ń& F Æ G ° ´ ¯KJh¯ Æ ² °³² ÆG °
which implies
N´ & F ÆG ° ´ Þ ¯MJ´¯ Æ ² ß °¶² Ö
(4.13)
Note that this upper bound holds for any 5µ
. On the other hand, using the
upper bound in (4.9) and Theorem 4.1, for sufficiently large, we have
ßì « Þ _a` Ì7& F Ô Þ Ń& F ´ ¯MJ´¯ Æ ² °¶² Ö
(4.14)
Æ G °
Æ G °
Then
N´ & F ÆG ° ´µ ².ß ì « Ý ¯MJ´¯ Æ ² °¶² Ö
(4.15)
Combining (4.13) and (4.15) gives Property 3. The theorem is proved.
Remark 1 Theorem 4.3 is a consequence of Theorem 4.1. However, Property
2 in Theorem 4.3 is equivalent to Theorem 4.1. Therefore, Theorem 4.1 and
Theorem 4.3 are equivalent, and they will both be referred to as the weak AEP.
® Ý
@ ¹ ² ® ·K® H· c·
²JÀ7Ý
The weak AEP has the following interpretation. Suppose
is drawn i.i.d. according to
, where is large. After the sequence is
drawn, we ask what the probability of occurrence of the sequence is. The weak
AEP
says that the probability of occurrence of the sequence drawn is close to
´¯ ² with very high probability. Such a sequence is called a weakly typical
sequence. Moreover, the total number of weakly typical sequences is approx
imately equal to ´¯ ² . The weak AEP, however, does not say that most of
the sequences in
are weakly typical. In fact, the number of weakly typical sequences is in general insignificant compared with the total number of
sequences, because
¿
J Æ
Ï
J Æ
Ń& F Æ GPO ´
´)ÏÓ´ Q
D R A F T
J Æ ¹
h¯
Ê)ÌÍ
²
È È
}
È } È J´¯ Æ
¯ËÊ)ÌÍ
²² /
September 13, 2001, 6:27pm
¶
(4.16)
D R A F T
64
Ú ²® Ý
A FIRST COURSE IN INFORMATION THEORY
ÃfÄÅÎ´)ÏÓ´
is strictly less than
. The idea is that,
as G/ i as long as
although the size of the weakly typical set may be insignificant compared with
the size of the set of all sequences, the former has almost all the probability.
When is large, one can almost think of the sequence as being obtained
by choosing a sequence from the weakly typical set according to the uniform
distribution. Very often, we concentrate on the properties of typical sequences
because any property which is proved to be true for typical sequences will then
be true with high probability. This in turn determines the average behavior of
a large sample.
@
Remark The most likely sequence is in general not weakly typical although
the probability of a weakly typical set is close to 1 when is large. For example, for i.i.d. with
and
,
is the most
likely sequence, but it is not weakly typical because its empirical entropy is not
close to the true entropy. The idea is that as /¹i , the probability of every
sequence, including that of the most likely sequence, tends to 0. Therefore, it
is not necessary for a weakly typical set to include the most likely sequence in
order to make up a probability close to 1.
² ¶ Ý ¹ ¶¸Ö ß
¿
®
4.2
².ßcÝ ¹ ¶¸ÖPR .² ß · ß · T· ßcÝ
¿
@ ¹ ² ® ·K® H· ·K® Ý
THE SOURCE CODING THEOREM
²JÀÝ
drawn i.i.d. accordTo encode a random sequence
ing to
by a block code, we construct a one-to-one mapping from a subset
of
to an index set
(4.17)
S
Ï¿
´ S ´ ¹ T Þ ´)ÏÓ´ "
" ¹ Ì ß · · T·*T Ô ·
´)Ï´
where
. We do not have to assume that
is finite. The
indices in are called codewords, and the integer is called the block length
of the code. If a sequence
occurs, the encoder outputs the correspondbits. If a sequence
ing codeword which is specified by approximately
occurs, the encoder outputs the constant codeword 1. In either case,
the codeword output by the encoder is decoded to the sequence in corresponding to that codeword by the decoder. If a sequence
occurs, then
is decoded correctly by the decoder. If a sequence
occurs, then
is not decoded correctly by the decoder. For such a code, its performance is
measured by the coding rate defined as (in bits per source symbol),
and the probability of error is given by
H Íê S
H5Í S
ÃfÄÅT
H
Ã]ÄÅT
v,w
T
¹
_a`
ÌL@ Í ê S Ô Ö
S ¹ Ï
Hê S Í S
H Í
¹ ¶
S
H
(4.18)
If the code is not allowed to make any error, i.e., v6w
, it is clear that
must be taken to be
, or
. In that case, the coding rate is
equal to
. However, if we allow v w to be any small quantity, Shannon
[173] showed that there exists a block code whose coding rate is arbitrarily
ÃfÄÅÊ´)ÏÓ´
D R A F T
´)Ï´
September 13, 2001, 6:27pm
D R A F T
65
Weak Typicality
Ú ²® Ý
@
when is sufficiently large. This is the direct part of Shannon’s
close to
source coding theorem, and in this sense the source sequence is said to be
reconstructed almost perfectly.
We now prove the direct part of the source coding theorem by constructing
a desired code. First, we fix «
and take
ï ¶
S ¹ & F
Æ G
T ¹ ´S ´Ö
and
(4.19)
°
(4.20)
For sufficiently large , by the weak AEP,
Þ T ¹ ´ S ´ ¹ N´ & F ´ Þ ¯MJ´¯ Æ ² °¶² Ö
Æ G °
Therefore, the coding rate ÃfÄÅ T satisfies
ß ².ß Ý;ó ² Ý Þ ß
Ã fÄÅ ì « Ú ® ì « ÃfÄÅT Þ Ú ² ® Ý;ó « Ö
².ß ì « Ý ¯MJ´¯ Æ
²
(4.21)
°¶²
Also by the weak AEP,
ÌL@ Í ê S Ô ¹
(4.22)
(4.23)
ÌL@ ÍUê & F Æ G ° ÔaÜ «Ö
¶ , the coding rate tends to Ú ² ® Ý , while v6w tends to 0. This proves
¹
v,w
_a`
_a`
Letting «$/
the direct part of the source coding theorem.
The converse part of the source coding theorem says that if we use a block
code with block length and coding rate less than
, where
does not change with , then v,w^/
as ·/:i . To prove this, consider any
code with block length and coding rate less than
, so that , the
total number of codewords, is at most ¯ ´¯ ² ² . We can use some of these
codewords for the typical sequences
, and some for the non-typical
°
. The total probability of the typical sequences covered
sequences
°
by the code, by the weak
AEP, is upper bounded by
ß
Æ
H+ÍZM& J F Æ G
H Í[ê & F Æ G
Y
MJ Æ
¯ ´¯
Y ó
²
² MJ Æ
¯ ´¯
²
°¶²
¹
Ú ²®
Y Ú ² ®
ÝW
ì V
ÝX
ì V
Y Ö
Ý
¯
(4.24)
°¶²
ÌL@ Í\ê & F Æ G ° ÔÜ ¯ YÝ °¶² ó
T
V ï ¶
Therefore, the total probability covered by the code is upper bounded by
ßì
for
Ý
¯
°¶²
_a`
«
(4.25)
@
sufficiently large, again by the weak AEP. This probability is equal to
because v w is the probability that the source sequence is not covered
by the code. Thus
v,w «
(4.26)
¯
°³²
v w
D R A F T
ß ì Ü
Y ó ·
September 13, 2001, 6:27pm
D R A F T
66
A FIRST COURSE IN INFORMATION THEORY
ï|ß ì ² ¯ Y °³² ó « Ý Ö
(4.27)
ï
¶ , in particular
This inequality holds when is sufficiently large for any «
ï
ß
V
V
ì
for «ZÜ
. Then
Ü , v6w
~« when is sufficiently large.
ß asforº/ anyi «Zand
Hence, v,w/
then «/í¶ . This proves the converse part of
or
v6w
the source coding theorem.
¢a£;¤!¥§¦7¤EFFICIENT
¨§« Let ] ¹ SOURCE
² ¯ ·'¯ H·CODING
sequence of
·'¯. Ý be a random binary
²
Ý
Þ

length . Then Ú ]
with equality ß if and only if ¯ are drawn i.i.d.
according to the uniform distribution on Ìc¶¸· Ô .
4.3
Proof By the independence bound for entropy,

²
Ý
v
Þ
º
Ú ] Ú ² ¯ Ý
with equality if and only if
(4.28)
¯
with equality if and only if
(4.28) and (4.29), we have
¯
are mutually independent. By Theorem 2.43,
Ú ² ¯ vÝ Þ fÃ ÄÅÉ ¹ ß
distributes uniformly on
(4.29)
Ìc¶¸· ß Ô .
Combining

²
Ý
=
Þ
º
Ú ] Ú ² ¯ YÝ Þ ý·
(4.30)
¯
ß
Ìc¶¸· Ô
where this upper bound is tight if and only if are mutually independent and
each of them distributes uniformly on
. The theorem is proved.
] ¹ ² ¯ ·'¯ · T·'¯ Ý
Ìc¶¸· ß Ô
¯
be a sequence of length such that are drawn
Let
i.i.d. according to the uniform distribution on
, and let
denote the
generic random variable. Then
. According to the source coding
theorem, for almost perfect reconstruction of , the coding rate of the source
code must be at least 1. It turns out that in this case it is possible to use a
source code with coding rate exactly equal to 1 while the source sequence
can be reconstructed with zero error. This can be done by simply encoding all
the possible binary sequences of length , i.e., by taking
. Then
the coding rate is given by
Ú ²¯ Ý ¹ ß
]
¯
^ ¹
]
]
Ã]ÄÅÎ´ ^´ ¹
ÃfÄÅ ¹ ß Ö
]
(4.31)
Since each symbol in is a bit and the rate of the best possible code describing
is 1 bit per symbol, are called raw bits, with the connotation
that they are incompressible.
D R A F T
¯ ·'¯ · ·'¯
September 13, 2001, 6:27pm
D R A F T
67
Weak Typicality
It turns out that the whole idea of efficient source coding by a block code is
to describe the information source by a binary sequence consisting of “approximately raw” bits. Consider a sequence of block codes which encode
, where are i.i.d. with
into
,
generic random variable , is a binary sequence with length
and ?/ i . For simplicity, we assume that the common alphabet
is fi
nite. Let u
be the reconstruction of
by the decoder and v,w be the
probability of error, i.e.
_a`
u
(4.32)
v6w
@ ¹ ² ® ·K® · T·K® Ý
® ]
@ ÍoÏ
] ¹ ² ¯ ·'¯ · ·'¯ Ý
@
ÌL@ ¹ ê @ Ô Ö
¹
[¶
®
as ÿ/i . We will show that
Further assume v w /
mately raw bits.
By Fano’s inequality,
Since
@
u
Ú ² @L´M@ u ÝvÞ|ßvó v,w!ÃfÄÅ´)Ï´ ¹ ßËó
is a function of
It follows that
]
]
Q ;Ï Ú ² ® Ý
consists of approxi-
!Ã]ÄÅÎ´)ÏÓ´ Ö
v,w
(4.33)
,
Ú ² ] Ý ¹ Ú ² ]+· @ u Ý µÚ ² @ u Ý Ö
Ú ² ] Ý µ Ú± ² ² @ u Ý Ý
µ @ ³ @ u
Ý
¹ Ú ²@ Ý ì Ú ²ý
@
M@ u
´
µ ;Ú ² ² ® ² Ý Ý ì ².ßvó v,w!ÃfÄÝ ÅÊ´)ÏÓß ´ Ý
¹ Ú ® ì v w fÃ ÄÅÎ´)ÏÓ´ ì Ö
(4.34)
(4.35)
(4.36)
(4.37)
(4.38)
(4.39)
On the other hand, by Theorem 4.4,
Ú ² ] vÝ Þ ýÖ
(4.40)
Combining (4.39) and (4.40), we have
² Ú ² ® Ý ì v6wÃfÄÅÎ´)ÏÓ´ Ý ì ß Þ Ú ² ] vÝ Þ ýÖ
(4.41)
² Ý is approximately
Since v w / ¶ as º/¹i , the above lower bound on Ú ]
equal to
²® Ý
;Ú
(4.42)
Q
when is large. Therefore,
(4.43)
Ú ² ] Ý ýÖ
]
]
Q
In light of Theorem 4.4,
almost attains the maximum possible entropy. In
this sense, we say that consists of approximately raw bits.
D R A F T
September 13, 2001, 6:27pm
D R A F T
68
4.4
A FIRST COURSE IN INFORMATION THEORY
THE SHANNON-MCMILLAN-BREIMAN
THEOREM
Ì® Ô
® and
¿
ß
ì ÃfÄÅ ² @ Ý / Ú ² ® Ý
(4.44)
¿ ²
Ý
² Ý is the
in probability as !/ i , where @ ¹ ®
·K®H·c·K® . Here Ú ®
entropy of the generic random variables ® as well as the entropy rate of the
source Ì® Ô .
In Section 2.9, we showed that the entropy rate Ú of a source Ì
® Ô exists if
the source is stationary. The Shannon-McMillan-Breiman theorem states that
if Ì
® Ô is also ergodic, then ß
_a` ?ì
(4.45)
Ã
Ã fÄÅ _P` LÌ @ZÔ ¹ Ú ¹ ß Ö
~
b
® Ô is stationary and ergodic, then ì Ã]ÄÅ _a` LÌ @Ô not
This means that if Ì
only almost always converges, but it also almost always converges to Ú . For
²JÀÝ
For an i.i.d. information source with generic random variable
generic distribution
, the weak AEP states that
this reason, the Shannon-McMillan-Breiman theorem is also referred to as the
weak AEP for stationary ergodic sources.
The formal definition of an ergodic source and the statement of the ShannonMcMillan-Breiman theorem require the use of measure theory which is beyond
the scope of this book. We point out that the event in (4.45) involves an infinite
collection of random variables which cannot be described by a joint distribution unless for very special cases. Without measure theory, the probability of
this event in general cannot be properly defined. However, this does not prevent us from developing some appreciation of the Shannon-McMillan-Breiman
theorem.
be the common alphabet for a stationary source . Roughly
Let
speaking, a stationary source is ergodic if the time average exhibited
by a single realization of the source is equal to the ensemble average with
probability 1. Moree specifically, for any
T Þ ,
Ï
Ì® Ô
Ì® Ô
_P`
Ã
~
ß
b
º
Å

ß
_· · ·
² ® ·K® ·c·K®_ Ý
#
%
¹ Å ² ® ·K® ·T·K®`_ Ý l ¹ ß ·
(4.46)
#
%

where Å is a function defined on Ï
which satisfies suitable conditions. For
e
the special case that Ì®Ô satisfies
ß _a`
º ® ¹ ® l ¹ ß ·
Ã
(4.47)
~
b
D R A F T

September 13, 2001, 6:27pm
D R A F T
69
Weak Typicality
Ì®Ô is mean ergodic.
;ü:¨â[S¤Ó§«]è The i.i.d. source Ì®Ô
we say that
\^]
is mean ergodic under suitable conditions because the strong law of the large numbers states that (4.47) is satisfied.
;ü:¨â[S¤Ó§«]ð
Ì® Ô
\^]
°
® ¹ °
Ìc¶¸· ß Ô
ß
ß
º ® ¹ ¶l ¹
Consider the source defined as follows. Let
be a
e
binary random variable uniformly
distributed on
. For all , let
.
Then
Ã
_a`

~
e
and
Ã
_P`
Since
® ¹
Ì®Ô
Ã

K
ß
b

º ® ¹

(4.49)
® l ¹ ¸¶ Ö
(4.50)
is not mean ergodic and hence not ergodic.
Ì® Ô
If an information source
McMillan-Breiman theorem,
ß
@
b
(4.48)
ß
º ® ¹ ßl ¹ Ö
ß
e
,
_a`
Therefore,

~
b

is stationary and ergodic, by the Shannon-
ÃfÄÅ _a` LÌ @Ô Q
J
(4.51)
when is large, i.e., with probability close to 1, the
probability of the sequence which occurs is approximately equal to . Then by means of
arguments similar to the proof of Theorem 4.3, we see that there exist approx
imately
sequences in
whose probabilities are approximately equal to
, and the total probability of these sequences is almost 1. Therefore, by
encoding these sequences with approximate bits, the source sequence
can be recovered with an arbitrarily small probability of error when the block
length is sufficiently large. This is a generalization of the direct part of the
source coding theorem which gives a physical meaning to the entropy rate of
a stationary ergodic sources. We remark that if a source is stationary but not
ergodic, although the entropy rate always exists, it may not carry any physical
meaning.
J
J
D R A F T
J
Ï
;Ú
September 13, 2001, 6:27pm
@
D R A F T
70
A FIRST COURSE IN INFORMATION THEORY
PROBLEMS
1. Show that for any «
ï ¶ , & F
Æ G °
is nonempty for sufficiently large .
"
2. The source coding theorem with a general block code In proving the converse of the source coding theorem, we assume that each codeword in
corresponds to a unique sequence in
. More generally,
a block code
Æg
/
with block length isg defined by an encoding function
and a
. Prove that v,w^/
as ·/:i even if we
decoding function Å /
are allowed to use a general block code.
Ï
" íÏ
O¶
Ï a"
3. Following Problem 2, we further assume that we can use a block code with
probabilistic encoding and decoding. For such a code, encoding is defined
to and decoding is defined by a transiby a transition matrix from
. Prove that v,wm/
as /ci even if we are
tion matrix from to
allowed to use such a code.
c
b
" Ï
"
Ï
N¶
4. In the discussion in Section 4.3, we made the assumption that the common
alphabet is finite. Can you draw the same conclusion when is infinite
but
is finite? Hint: use Problem 2.
5.
6.
7.
Ï
Ï
Ú ²® Ý
Let and Ú ² beÝÊtwo
distributions on the same
finite alphabet Ï
² Ú Ý . Show that
ï ¶ such
¹ ê Ú probability
such¿ that Ú
there exists an «
that
¿
þ
»
Jh}
»
> * ý9d * * Ù Ê)Ì¹ êÍ9> * ¯ * ² J´¯e² Lf °*g ß
as ÿ/¶ . Give an example that
¿ Ú but the above convergence does not
hold.
Let and Ú be two probability distributions on the same finite alphabet Ï
with¿ the same support.
ï ¶,
a) Prove that for any h
» J} * Ù Ê)ÌÍ6e * ¯ » * ² ¯KJh¯ >ª² {W ¯ >jike² ² O þ * ý7d *
>
g

f
as /Hi .
ï ¶,
b) Prove that for any h
» Jh} * Ù Ê)ÌÍ(e * ¯ » * ² ¯MJ´¯ >M² §W ¯ >jike²² O ü *LlPmnl o*p 4q l o%rstp ju pwv
d *
gg

ßf
¹
ÌÌ® ¯Kyd² · ×µ Ô ÍZÙ"Ô be a family of
Universal source coding Let x
; a common alphabet
i.i.d. information sources indexed by a finite set Ù with
Ï . Define
Ú Ó ¹ JMéOL û Ú ² ® ¯MyØ² Ý
y
ß Ô , and
where ® ¯MyØ² is the generic random variable for Ì® ¯Kyd² · µ
² Ý {

Ù° ¹ z JOL & F Æ l}|~p G ·
y
D R A F T
°
September 13, 2001, 6:27pm
D R A F T
71
Weak Typicality
where «
ï ¶.
a) Prove that for all
b)
; ÍÙ
,
ÌL@ ¯Kyd² Í ° ² Ù Ý Ô{/ ß
²
Ý
as /Hi , where @ ¯Kyd² ¹ ® ¯Kyd² ·K® ¯Kyd² ·c·K® ¯Kyd² .
ï «,
Prove that for any «
å
´ ² Ù Ý ´ Þ ¯9J ° ² Ö
_a`
°
x
c) Suppose we know that an information source is in the family
but
we do not know which one it is. Devise a compression scheme for
the information source such that it is asymptotically optimal for every
possible source in .
x
8.
ß Ô be an i.i.d. information source with generic random variLet Ì®·µ
able ® and finite alphabet Ï . Assume
º ²JÀÝ ÃfÄÅ ²JÀÝ Üi
»5¿
¿
and define
² Ý
° ¹ì ÃfÄÅ ¿ @ ì Ú ² ® Ý
&
ß · · . Prove that ° / ° in distribution, where ° is the
for ¹
²JÀÝ Ã]ÄÅ ²JÀ7Ý ì
Gaussian random variable with mean 0 and variance ¦ »
²
Ý
¿
¿
Ú ® .
HISTORICAL NOTES
The weak asymptotic equipartition property (AEP), which is instrumental in
proving the source coding theorem, was first proved by Shannon in his original
paper [173]. In this paper, he also stated that this property can be extended
to a stationary ergodic source. Subsequently, McMillan [141] and Breiman
[34] proved this property for a stationary ergodic source with a finite alphabet.
Chung [44] extended the theme to a countable alphabet.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 5
STRONG TYPICALITY
Weak typicality requires that the empirical entropy of a sequence is close to
the true entropy. In this chapter, we introduce a stronger notion of typicality
which requires that the relative frequency of each possible outcome is close to
the corresponding probability. As we will see later, strong typicality is more
powerful and flexible than weak typicality as a tool for theorem proving for
memoryless problems. However, strong typicality can be used only for random variables with finite alphabets. Throughout this chapter, typicality refers
to strong typicality and all the logarithms are in the base 2 unless otherwise
specified.
5.1
STRONG AEP
²JÀÝ
Ì® · µ ß Ô
®
®
Ú ²® Ý Ü
Ú ²® ¹Ý
@
where are i.i.d. with
We consider an information source
distribution
. We use to denote the generic random variable and
Hi . Let
to denote the common entropy for all , where
.
®
¿
² ® ·K® H· c·K® Ý
=Ø¤[ä æ;äøå;äJ¥§æúè7«¶5
²JÀ7Ý is the set
The strongly typical set F
with respect to
²JÀ · À T·T· À Ý ÍÓÏ suchÆGPthat
¿
O
of sequences H ¹
ß
º ¬ ²JÀ ³H Ý ì ²JÀÝ Þ h·
(5.1)
¿
» ²JÀ ³H Ý is the number of occurrences of À in the sequence H , and h is
where ¬
are called
an arbitrarily small positive real number. The sequences in F
Æ
P
G
O
strongly h -typical sequences.
Throughout this chapter, we adopt the convention that all the summations,
products, unions, etc are taken over the corresponding supports. The strongly
D R A F T
September 13, 2001, 6:27pm
D R A F T
74
A FIRST COURSE IN INFORMATION THEORY
F ÆGPO
typical set
shares similar properties with its weakly typical counterpart,
which is summarized as the strong asymptotic equipartition property (strong
AEP) below. The interpretation of the strong AEP is similar to that of the weak
AEP.
¢a£;¤!¥§¦7¤¨ªè7«]©Ûõ`å;¦¥§æ4ùE \áaÿ In the following, is a small positive quantity such that X/ ¶ as h{/ ¶ .
1) If HL
Í F ÆGPO , then
² H ÝYÞ ¯KJh¯ Æ ² ² Ö
Þ
(5.2)
¯KJh¯ Æ ²
²
¿
2) For sufficiently large,
_a`
ï ß ì h Ö
ÌL@ëÍ F Æ GPO Ô |
3) For sufficiently large,
².ß ì h Ý ¯MJ´¯ Æ ² ² Þ ´ F ´ Þ
Æ GPO
Proof To prove Property 1, we write
¿
²H Ý ¹ :
»Z¿
(5.3)
MJ Æ Ö
¯ ´¯
²

(5.4)
²
²JÀ7Ý ®h¯ »9 ² Ö
(5.5)
Then
Ã]ÄÅ ¿ ² H Ý ²JÀ Ý ²JÀÝ
¹ º ¬ ³H ÃfÄÅ
¿
»
²
J
²
À
Ý
¹ º ¬ ³H ì ²JÀ7Ý7ó
¿
»
¹ º ²JÀÝ ÃfÄÅ ²JÀÝ ì » ¿
¿
Q ß
¹ ì Ú ² ® Ý4ó º
» ¬
If
HLÍ F ÆGPO , then
D R A F T
º
²JÀ7ÝKÝ ÃfÄÅ J² À7Ý
¿ Qß ¿
²JÀ ³ H Ý ì ²JÀ7Ý R ² ì Ã]ÄÅ ²JÀ7ÝKÝ
º
¬
» ¿
¿
²JÀ ³H Ý ì ²JÀÝ R ² ì ÃfÄÅ ²JÀ7ÝKÝ Ö
¿
¿
ß J² À Ý ²JÀÝ Þ
ì
» ¬ ³ H ¿ h·

September 13, 2001, 6:27pm
(5.6)
(5.7)
(5.8)
(5.9)
(5.10)
D R A F T
75
Strong Typicality
which implies
ß J² À Ý ²JÀÝ ²
ì
ì fÃ ÄÅ ²JÀ7ÝKÝ
¬
³

H
R
» ¿
¿
ß
Þ ì ÃfÄÅ ýt» â ²JÀ7Ý þ º ¬ J² À ³ H Ý ì ²JÀÝ
¿
¿
» Þ ì h4ÃfÄÅ ý » â ²JÀ7Ý þ
¿
¹ ·
where
¹,ì h:ÃfÄÅ ýt» â ¿ ²JÀÝ þ ï ¶¸Ö
On the other hand,
Q ß
²JÀ ³ H Ý ì ²JÀÝ R ² ì ÃfÄÅ ²JÀÝKÝ
º
¬
» ß ²J¿À Ý ²JÀÝ ¿ ²
µ ìÓº » ¬ ³ H ì ¿ ì ÃfÄÅ ¿ ²JÀÝKÝ

ß
J
²

À
Ý
²JÀ ³H Ý ì ²JÀÝ q
þ p ì º
ì
¬
µ ÃfÄÅ ýt» â ¿
» ¿
ß
¹ ÃfÄÅ ýt» â ²JÀÝ þ º ¬ ²JÀ ³ H Ý ì ²JÀÝ
¿
» ¿
J
²
7
À
Ý
µ h§Ã]ÄÅ ýt» â ¿ þ
¹ ì Ö
º
Q
(5.11)
(5.12)
(5.13)
(5.14)
(5.15)
(5.16)
(5.17)
(5.18)
(5.19)
Combining (5.13) and (5.19), we have
ì Þ º
»
Q
ß J² À Ý ²JÀ7Ý ²
ì
ì fÃ ÄÅ ²JÀÝKÝYÞ Ö
R
¬
³

H
¿
¿
(5.20)
It then follows from (5.9) that
or

ì ² Ú ² ® ;Ý ó =Ý Þ fÃ ÄÅ ² H Ý=Þ ì ² Ú ² ® Ý ì Ý ·
¿
² H Ý=Þ ¯MJ´¯ Æ ² ² ·
Þ
¯MJ´¯ Æ ²
²
¿
¶ as h{/ ¶ , proving Property 1.
(5.21)
(5.22)
where /
To prove Property 2, we write
¬
D R A F T
²JÀ ³*@ Ý ¹ º
ì
²JÀÝ ·
September 13, 2001, 6:27pm
(5.23)
D R A F T
76
A FIRST COURSE IN INFORMATION THEORY
ß if ® ¹ À
J
²

À
Ý
¹
<

ì
¶ if ® ¹ ê À .
²JÀÝ · ¹ ß ··T· are i.i.d. random variables with
Then ì
_a`
Ì ì ²JÀÝ ¹ ß Ô ¹ ¿ ²JÀÝ
ª
and
_P`
Ìªì ²JÀ7Ý ¹ ¶Ô ¹ ß ì ¿ ²JÀÝ Ö
Note that
²JÀ7Ý ¹ ².ß ì ²JÀÝKÝ c¶ ó ²JÀÝ ß ¹ ²JÀÝ Ö
ì
¿
¿ï
¿
e
By the weak law of large
numbers, for any h
¶ and for any À ÍÓÏ
ß ²JÀ7Ý ²JÀÝ ï h
_a`
h

º ì ì

l Ü

)
´
Ó
Ï
´
)
´
Ï
´
¿

where

(5.24)
(5.25)
(5.26)
(5.27)
,
(5.28)

for sufficiently large. Then
ß J² À Ý ²JÀ7Ý ï h
À
ì

¬ e *
³
@
for
some

¿ ´)ÏÓ´

ß
_a`
J² À7Ý ì ²JÀ7Ý ï h for some À
º

e
e ì
¿ ´)ÏÓ´

ß
_a`
z e º ì ²JÀÝ ì ²JÀÝ ï h lml
¿ ´)ÏÓ´
»

ß
º _a` º ì ²JÀ7Ý ì ²JÀ7Ý ï h l
»
¿ ´)ÏÓ´

º h
» ´)Ï´
h·
_P`

¹
¹
Þ
¹
Ü
l
(5.29)
(5.30)
(5.31)
(5.32)
(5.33)
where we have used the union bound1 to obtain (5.31). Since
º
ß J² À Ý ²JÀÝ ï
ì
» ¬ ³ H ¿ h
ß J² À Ý ²JÀÝ ï h
¬
³ H ì ¿ ´)ÏÓ´

implies

1 The
union bound refers to
D R A F T
for some
(5.34)
À ÍÓÏ
,
(5.35)
8 ) 0 + 6 ) + 6 ) + v
Ø
ü
Ø

September 13, 2001, 6:27pm
D R A F T
77
Strong Typicality
we have
_a`
d(@ÑÍ F ÆGPO g
ß J² À Ý ²JÀÝ
¹
º
*³ @ ì ¿
»
ß ²JÀ Ý
º
¹ ßì
³*@ ì ¿
»ß
²JÀ ³*@ Ý ì ²JÀ7Ý
µ ßì
¿
ï ß ì ch ·
e _P`
_P`
Þ hl

²JÀ7Ý ï h l

ï h for some À ÍÏ

´)Ï´

e ¬

_P`

(5.36)
(5.37)
(5.38)
(5.39)
proving Property 2.
Finally, Property 3 follows from Property 1 and Property 2 in exactly the
same way as in Theorem 4.3, so it is omitted.
àµ ß
´ F ÆGPO ´
h ï ¶
Remark Analogous to weak typicality, we note that the upper bound on
in Property 3 holds for all , and for any
, there exists at
least one strongly typical sequence when is sufficiently large. See Problem 1
in Chapter 4.
In the rest of the section, we prove an enhancement of Property 2 of the
strong AEP which gives an exponential bound on the probability of obtaining
a non-typical vector2 . This result, however, will not be used until Chapter 15.
¢a£;¤!¥§¦7¤¨ªè7«]¬
For sufficiently large , there exists
_P`
ÌL@ Íê F Æ GPO ÔÜ j ¯ O ² Ö
² h ÝYï ¶
such that
(5.40)
The proof of this theorem is based on the Chernoff bound [43] which we
prove in the next lemma.
v¤¨Ê¨üè7« úõnö£;¤¦7æ§¥¤I¥;æ4÷:ÿ
¯
4
Let be a real random variable and
be any nonnegative real number. Then for any real number U ,
and
Ã]ÄÅ _a` cÌ ¯ªµU!Ô Þ ì U ó ÃfÄÅ
;
Ã]ÄÅ _a` Ìc¯ Þ U!Ô Þ U ó fÃ ÄÅ ;
)
)
yÇ
yÇ Ö
;
(5.41)
(5.42)
2 This
result is due to Ning Cai and Raymond W. Yeung. An alternative proof based on Pinsker’s inequality
(Theorem 2.33) and the method of types has been given by Prakash Narayan (private communication).
D R A F T
September 13, 2001, 6:27pm
D R A F T
78
A FIRST COURSE IN INFORMATION THEORY
2
1.8
2
1.6
s ( y−a)
1.4
1.2
u ( y−a)
1
1
0.8
0.6
0.4
0.2
0
−1
0
1
a
2
3
4
K |tl}9p .
Figure 5.1. An illustration of ï
Proof Let
I
Then for any
; µ¶ ,
Since
I
I
²¯ ì U Ý ¹
we see that
_a`
ì ¯
² ¯ ì U Ý 4Þ ) y ¯ Ç
_P`
Oð
² Á Ý ¹ ß if Á µ¶
¶ if Á Ü¶Ö
² Ý=Þ (y ¯ ½ á ² Ö
I Á ì U
This is illustrated in Fig. 5.1. Then
y
5
á ²
¹
(5.43)
(5.44)
y
á
Ìc¯ªµUÔ´ ßËó_a` Ìc¯ ÜU!Ômc¶ ¹
Ìc¯ªµU!Ô Þ
y
á
)
yÇ ¹
)
yÇ Ö
_P`
Ìc¯ µUÔ ·
F
y | G Ö
á

Ê)ÌÍ
(5.45)
ì
(5.46)
(5.47)
Then (5.41) is obtained by taking logarithm in the base 2. Upon replacing
by
and U by U in (5.41), (5.42) is obtained. The lemma is proved.
£¡ ¢¤
¥§¦w¡©¨ª[«
¬Kj®°¯=±=²´µ ³ ¶ ¦w¡º¨°»[¼½¦¥§¦w¡º¨©¾À¿9¨9Á
¶%·D¸4¹
Â Ã!Ä ¼½¦¥Å¦w¡º¨©¾À¿9¨Æ¾ ¬Kj®°ÇÉÈËÊjÌjÍZÎÏBÐÑ7Ò Ï-ÓÔLÕ Ö
¯
Proof of Theorem 5.3 We will follow the notation in the proof of Theorem 5.2.
Consider
such that
. Applying (5.41), we have
D R A F T
September 13, 2001, 6:27pm
(5.48)
D R A F T
79
Strong Typicality
Ø× Õ !Ã Ä ½¼ ¦P¥§¦w¡©¨Æ¾À¿9¨Æ¾ ¬Kj®ÙÀ¶-Ú ·D³ ¸ Ç È ÊjÌBÒ Ï ÓÔLÕ Ö%Û
Ø Õ Ã!Ä ½¼ ¦P¥§¦w¡©¨Æ¾À¿9¨Æ¾Ü¼ ¬Kj® ¦Ý Ã ¥§¦w¡©¨Æ¾U¥§¦w¡©¨ Ê Ì ¨
ÂÞ Õ Ã!Ä ½¼ ¦P¥§¦w¡©¨Æ¾À¿9¨Æ¾Ü¼¦ ¬ËßàÊ ¨`á ¸ ¦ Ã ¥§¦w¡º¨Æ¾\¥§¦w¡º¨ ÊjÌ ¨
Ø Ã ¼ È Ä ¦ ¥§¦w¡©¨©¾À¿9¨n¾â¦ ¬ËßàÊ ¨ á ¸ ¥Å¦w¡º¨%¦Ý Ã Ê Ì ¨ Ö=ã
(5.49)
Z
where
a) follows because
(5.50)
(5.51)
(5.52)
¹ ¶ ¦w¡º¨ are mutually independent;
b) is a direct evaluation of the expectation from the definition of
(5.24);
c) follows from the fundamental inequality
¬ßåä Â ä Ã Ý .
æ Ô ¦ Ä ã ¿9¨ Ø Ä ¦ ¥§¦w¡º¨©¾W¿7¨D¾â¦ ¬ËßàÊ ¨ á ¸ ¥§¦w¡©¨%¦Ý Ã Ê Ì ¨ ã
¹ ¶ ¦w¡º¨
in
In (5.52), upon defining
(5.53)
we have
¬Kj®¯±ç² µ ³ ¶ w¦ ¡©¨°»W¼½¦¥§¦w¡©¨©¾À¿9¨7Á ÂèÃ ¼ æ Ô ¦ Ä ã ¿9¨ ã
¶-·D¸ ¹
or
¯±=²éµ ³ ¶ ¦w¡©¨°»W¼½¦¥§¦w¡©¨Æ¾À¿9¨7Á Â Ê á ³7ê,ë Ó Ì*ì í Õî
¶-·D¸ ¹
It is readily seen that
æ Ô ¦ « ã ¿7¨ Ø « î
Ä
Regarding ¿ as fixed and differentiate with respect to , we have
æÆÔï ¦ Ä ã ¿9¨ Ø ¥§¦w¡º¨%¦Ý Ã Ê Ì ¨Æ¾À¿ î
Then
æ Ôï ¦ « ã ¿7¨ Ø ¿ª[«
and it is readily verified that
for
D R A F T
æÆÔï ¦ Ä ã ¿9¨ð»[«
« ÂÄñÂ ¬Kj®½ò Ý¾ ¥§¦w¿ º¡ ¨ôó î
September 13, 2001, 6:27pm
(5.54)
(5.55)
(5.56)
(5.57)
(5.58)
(5.59)
(5.60)
D R A F T
80
A FIRST COURSE IN INFORMATION THEORY
æ Ô ¦ Ä ã ¿7¨ is strictly positive for
«öõ ÄñÂ ¬Kj®½ò Ý¾ ¥§¦w¿ ¡º¨ôó î
Therefore, we conclude that
(5.61)
On the other hand, by applying (5.42), we can obtain in the same fashion
the bound
or
where
Then
and
¬Kj®¯±ç²éµ ³ ¶ ¦w¡©¨ Â ¼½¦¥§¦w¡©¨ Ã ¿9¨7Á ÂèÃ ¼Æ÷ Ô ¦ Ä ã ¿9¨ ã
¶-·D¸ ¹
¯±=² µ ³ ¶ ¦w¡©¨ Â ¼½¦¥§¦w¡©¨ Ã ¿7¨jÁ Â Ê á ³9ø ë Ó Ì*ì í Õ ã
¶-·D¸ ¹
÷ Ô ¦ Ä ã ¿7¨ Ø Ã!Ä ¦¥Å¦w¡º¨ Ã ¿9¨Æ¾â¦ ¬ßàÊ ¨`á ¸ ¥§¦w¡º¨%¦Ý Ã Ê á Ì ¨ î
÷ Ô ¦ « ã ¿7¨ Ø « ã
÷ Ôï ¦ Ä ã ¿9¨ Ø ¥§¦w¡©¨%¦ Ê á Ì Ã ÝL¨Æ¾À¿ ã
(5.62)
(5.63)
(5.64)
(5.65)
(5.66)
which is nonnegative for
(5.67)
« ÂÄùÂèÃ K¬ j®úò Ý Ã ¥§¦w¿ ¡©¨ôó î
In particular,
÷ Ôï ¦ ã« ã ¿9¨ Ø ¿ûª[« î
(5.68)
Ä
Therefore, we conclude that ÷ Ô ¦ ¿7¨ is strictly positive for
«üõ ÄùÂèÃ ¬Kj® ò Ý Ã ¥§¦w¿ ¡©¨ ó î
(5.69)
Ä
By choosing satisfying
(5.70)
«üõ ÄùÂ[ýöþ ß\ÿË¬Kj®úò Ý¾ ¥§¦w¿ ¡©¨ôó ã Ã ¬Ëj®½ò Ý Ã ¥Å¦w¿ ¡º¨ôó ã
æ Äã
Äã
both Ô ¦ ¿9¨ and ÷ Ô ¦ ¿9¨ are strictly positive. From (5.55) and (5.63), we
have
¯± ² Ý µ ³ ¶ ¦w¡º¨ Ã ¥§¦w¡º¨ »¿ Á
¼ ¶%D· ¸ ¹
D R A F T
September 13, 2001, 6:27pm
D R A F T
81
Strong Typicality
Ø =¯ ±=² ¶-µ ·D³ ¸ ¹ ¶ ¦w¡©¨ Ã ¼(¥§¦w¡©¨ »W¼n¿Á
(5.71)
Â ¯=± ² µ ³ ¶ ¦w¡º¨°»W¼½¦¥Å¦w¡º¨Æ¾À¿9¨ Á
¶%·D¸ ¹
(5.72)
¾ ¯=±ç²é¶%µ ·D³ ¸ ¹ ¶ ¦w¡º¨ Â ¼½¦¥§¦w¡º¨ Ã ¿9¨7Á
Â Ê á ³ôê,ë Ó Ì*ì í Õ ¾ Ê á ³9ø-ë Ó Ì*ì í Õ
(5.73)
Â Ê LÊ á ³ Ó ê,ë Ó Ì*ì í Õ ì ø-ë Ó Ìì í ÕÕÑ
(5.74)
Ó
Ó
Õ
Ó
Õ
Õ
Ê
*
Ì
ì
í
ì
*
Ì
ì
í
Ø á ³ Ó Õ ê,ã ë ø-ë á Î
(5.75)
Ø Ê á ³ jë í
(5.76)
where
ý½þ ß ¦ æ Ô ¦ Ä ã ¿9¨ ã ÷ Ô ¦ Ä ã ¿9¨¨ Ã Ý î
Ø
Ô
~
¦
9
¿
¨
(5.77)
¼
æ
Äã
Then Ô ~¦ ã ¿9¨ is strictly positive for sufficiently large ¼ because both Ô ¦ ¿7¨
Ä
and ÷ Ô ¦ ¿9¨ are strictly positive.
Finally, consider
¯± ¢ ³ í Ø ¯=± ² µ Ô Ý ¦w"¡ ! ¨ Ã ¥§¦w¡©¨ Â ¿ Á
(5.78)
¼ (5.79)
» ¯=
± # ¼ Ý ¦w"¡ ! ¨ Ã ¥§¦w¡©¨ Â ¿ for all ¡\%¢ $'&
Ø Ý Ã ¯= (± # Ý ¦w"¡ ! ¨ Ã ¥§ ¦w¡º¨ ª¿ for some ¡£)¢ $)&
(5.80)
¼ (5.81)
» Ý Ã µ Ô ¯=(± # ¼ Ý ¦w"¡ ! ¨ Ã ¥§¦w¡º¨ ª*¿ &
µ
µ
=
¯
ç
±
²
³
Ý
Ã
Ã
Ø Ý Ô
¶ ¦w¡©¨ ¥§¦w¡©¨ ª ¿Á
(5.82)
¹
¶
D
·
¸
¼
Ø Ý Ã µ ¯=±=² Ý ¶-µ ·D³ ¸4¹ ¶ ¦w¡©¨ Ã ¥§¦w¡©¨ ª ¿Á
(5.83)
Ô+ ,9ÓÔLÕ.-0/ ¼
(5.84)
» Ý Ã Ô+ ,9µ ÓÔLÕ.-0/ Ê á ³ 3ë Ó í Õ ã
where the last step follows from (5.76). Define
D R A F T
¦~¿7¨ Ø 2ÊÝ 1 Ô + ,9ýöÓÔLþ Õ3ß -0/ Ô ¦~¿9¨54 î
September 13, 2001, 6:27pm
(5.85)
D R A F T
82
A FIRST COURSE IN INFORMATION THEORY
¼
Then for sufficiently large ,
¯=± ¢ ³ í6 ª
¯± 8¢7 ³ 9 í or
where
5.2
¦~¿9¨
Ý Ã Ê á ³ Ó í Õ ã
õ Ê á ³: Ó í Õ ã
(5.86)
(5.87)
is strictly positive. The theorem is proved.
STRONG TYPICALITY VERSUS WEAK
TYPICALITY
As we have mentioned at the beginning of the chapter, strong typicality is
more powerful and flexible than weak typicality as a tool for theorem proving
for memoryless problems, but it can be used only for random variables with
finite alphabets. We will prove in the next proposition that strong typicality is
stronger than weak typicality in the sense that the former implies the latter.
; <>="?@=BACEDFC.="GIHJEH
RTS « as ¿ S « .
For any
Kè¢L$
³ , if K ¢M ³ 9 í , then K ¢ON ³ QP , where
¢ ³ 9 í , then
Ê á ³ ÓVUàÓ ÕXW P Õ Â ¥§¦YKD¨ Â Ê á ³ VÓ UàÓ Õ á P Õ ã
(5.88)
Z
Z
K
¬
j

®
Ý
Â
Ã
Â
Ã
R ã
R
§
¥
YKD¨
¦
w
¦
1
¤
n
¨
¾
¦w¤I¨
(5.89)
¼
« as ¿ S « . Then K1¢%N ³ QP by Definition 4.2. The proposition is
Proof By Property 1 of strong AEP (Theorem 5.2), if K
or
where R[S
proved.
We have proved in this proposition that strong typicality implies weak typicality, but the converse is not true. The idea can easily be explained without
on $ different from the
any detailed analysis. Consider a distribution
such that
distribution
¥§¦w¡º¨
¥ ï ¦w¡º¨
Ã µ ¥ ï ¦w¡º¨ ¬Ëj® ¥ ï ¦w¡º¨ Ø Z ¦w¤1¨ ã
(5.90)
Ô
ï
i.e., the distribution ¥ ¦w¡º¨ has the same entropy as the distribution ¥§¦w¡º¨ . Such
ï
a distribution ¥ ¦w¡º¨ can always be found. For example, the two distributions
¥§¦w¡º¨ and ¥ ï ¦w¡©¨ on $ Ø « ã Ý defined respectively by
(5.91)
¥§¦ «4¨ Ø « ]î \ ã ¥§¦ÝL¨ Ø « _î ^ !
and
¥ ï ¦ «4¨ Ø « _î ^ ã ¥ ï ¦ÝL¨ Ø « ]î \
(5.92)
D R A F T
September 13, 2001, 6:27pm
D R A F T
83
Strong Typicality
î]\
have the same entropy ` ¦ « ¨ . Now consider a sequence Kè¢a$ ³ such that
the relative occurrence of each ¡£%
¢
$ is close to ¥ ï ¦w¡©¨ . By the above theorem,
ï
Z
the empirical
entropy of K is close to the entropy of the distribution ¥ ¦w¡©¨ ,
i.e., ¦w¤I¨ . Therefore, K is weakly typical with respect to ¥§¦w¡©¨ , but it is not
strongly typical with respect to ¥Å¦w¡º¨ because the relative occurrence of each
¡£%¢ $ is close to ¥ ï ¦w¡©¨ , which is different from ¥§¦w¡©¨ .
Z
5.3
JOINT TYPICALITY
In this section, we discuss strong joint typicality with respect to a bivariate
distribution. Generalization to a multivariate distribution
cb ed is straightforward.
cb
Consider a bivariate information
source
where
gf
cb
are i.i.d. with distribution
. We use
to denote the pair of generic
random variables.
¥§¦w¡ ã ¨
¦w¤ã ¶ ã ¶ ¨ ã » Ý
¦w¤ ¨
¦w¤ ¶ ã ¶ ¨
hTi*jkClGFCXDFCY=BGMHJEm
³ on í with respect to ¥§¦w¡ ãgf ¨
The
strongly
jointly
typical
set
ãgp ¨ð¢%$ ³rqts!³ such that
is the set of ¦YK
µ µvu Ý ¦w¡ ãgf !gK ãgp ¨ Ã ¥§¦w¡ ãgf ¨ Â ¿ ã
(5.93)
Ô ¼ ãgf ãgp ¨ is the number of occurrences of ¦w¡ ãgf ¨ in the pair of sewhere ¦w¡ ãgp !gK
quences ¦YK ãg¨ p , and ¿ is an arbitrarily small positive real number. A pair of
¨ is called strongly jointly ¿ -typical if it is in ³ on í .
sequences ¦YK
ãgp
wyxFi0=B<i*z{HJX|~}e=BG"A6C.ADFi*G"*

If
Y
¦
K
p
¢ ³ n í .
ãgp ¨ð¢ ³ n í , then
Proof If ¦YK
µ µ u Ý ¦w¡ ãgf !gK ãgp ¨
Ô ¼ Strong typicality satisfies the following consistency property.
Upon observing that
we have
¨ ¢ ³ on í , then K ¢ ³ í
Ã ¥§¦w¡ gã f ¨ Â ¿ î
¦w¡B!gKD¨ Ø µu ¦w¡ ãgf !gK gã p ¨ ã
µ Ý w¦ ¡"!gKD¨ Ã §¥ ¦w¡º¨ Ô ¼ u
µ
µ
ã
g
f gã p Ã µ u
gã f Ý
Ø Ô w
¦
¡
!gK ¨
¥
§
w
¦
¡
¨ ¼
D R A F T
September 13, 2001, 6:27pm
and
(5.94)
(5.95)
(5.96)
D R A F T
84
A FIRST COURSE IN INFORMATION THEORY
Ø µ Ô vµ u ò Ý ¦w¡ ãgf !gK ãgp ¨ Ã ¥Å¦w¡ ãgf ¨ (5.97)
¼
ó Â µ µ u Ý w¦ ¡ ãgf !gK ãgp ¨ Ã ¥§¦w¡ ãgf ¨ (5.98)
Ôã ¼ Â ¿
(5.99)
p
¢ ³ n> í . The theorem is proved.
i.e., K ¢ ³ 9 í . Similarly, cã b
For a bivariate i.i.d. source ¦w¤ ¶ ¶ ¨ , we have the strong joint asymptotic equipartition property (strong JAEP),
which
can readily be obtained by
cã b
applying the strong AEP to the source ¦w¤ ¶ ¶ ¨ .
wyxFi0=B<i*z{HJE2}e@DF<=BG"0;
Let
ã
c
b
ã
gã Ø
¦ ¨ ¦¦w¤ ¸ ¸ ¨ ¦w¤T cã b L¨ ã ã ¦w¤ ³ cã b ³ ¨¨ ã
(5.100)
cã b
cã b
¤ ~¨ are i.i.d. with generic pair of randomS variablesS ¦w¤ ¨ . In the
where ¦w
« as ¿ « .
following, is a small positive quantity such that
gã p
¨à¢ ³ n í , then
1) If Y¦ K
Ê á ³ VÓ UàÓ ì n lÕ W>LÕ Â ¥§Y¦ K gã p ¨ Â Ê á ³ Ó UåÓ ì n Õ á LÕ î
(5.101)
2) For ¼ sufficiently large,
¯=± ¦ gã ¨ð¢ ³ on í ª Ý Ã ¿ î
(5.102)
3) For ¼ sufficiently large,
Â ³ nF í 6Â Ê ³ VÓ UàÓ ì n XÕ W>LÕ î
¦Ý Ã ¿7¨ Ê ³ VÓ UàÓ ì n Õ á LÕ
(5.103)
¦ ã ¨
Ê ³ UàÓ ì n Õ
Ê ³ UåÓ Õ
Ê ³ àÓ ì Õ
Ê ³ åÓ Õ
From the
JAEP, we
n
are approxigp can see the following. SinceU there
U strong
mately
typical YK p
pairs andgpapproximately
typical K , for
a typical K , the number of such that YK
is jointly typical is approximately
¦ ã ¨
Ø Ê ³ UåÓ n Õ
(5.104)
on the average. The next theorem reveals that this is not only true on the
average,
p but it is ingp fact true for every typical K as long as there exists at least
one such that YK
is jointly typical.
wyxFi0=B<i*z{HJE
¦ ã ¨
D R A F T
1¢t ³ í , define
³ n9 9 í9¦YKD¨ Ø p ¢ ³ n í¦YK gã p à
¨ ¢ ³ nF í î
For any K
September 13, 2001, 6:27pm
(5.105)
D R A F T
85
Strong Typicality
If
n9 Y¦ KD¨
³ í
where S
» Ý , then
Ê ³ ÓVUàÓ n Õ á0 Õ Â ³ n í7¦YKD¨ (Â Ê ³ ÓUåÓ n ÕXW Õ ã
« as ¼ S¢¡ and ¿ S « .
(5.106)
We first prove the following lemma which is along the line of Stirling’s
approximation [67].
£i*zz¤HJ3¥¦
¼1ª« ,
¬¼ Ëß ¼ Ã ¼ Â ¬Ëß ¼
§ Â w¦ ¼ú¾âÝL¨ ¬Ëß ¦w¼¾ ÝL¨ Ã ¼ î
For any
(5.107)
¾ Ë¬ ß ¼ î
¬ß
Since ¡ is a monotonically increasing function of ¡ , we have
¨ ¶ ¬Ëß
d ¨ ¶ W ¸ ¬Ëß
î
Ë
¬
ß
¶ á ¸ ¡©3¡\õ õ ¶ T¡ ©3¡
Â d Â ¼ , we have
Summing over Ý
¨ ¬Ëß
³/ T¡ ©3¡Uõ ¬Ëß ¼ §õ ¨ ¸ ³ W ¸ ¬ß ª¡ ©4¡ ã
or
¼ ¬Ëß ¼ Ã ¼1õ ¬Ëß ¼
§õ ¦w¼ú¾âÝL¨ ¬Ëß ¦w¼¾ ÝL¨ Ã ¼ î
Proof First, we write
¬Ëß ¼
§ Ø ¬ ß Ý ¾ Ë¬ ßàÊ ¾
(5.108)
(5.109)
(5.110)
(5.111)
The lemma is proved.
¿
¢ ³ í
µ Ý ¦w¤~!gKD¨ Ã ¥Å¦w¡º¨ Â ¿ î
Ô ¼ ¢ $ ,
This implies that for all ¡£%
ã
Ý ¦w«
¤
!gKD¨ Ã ¥§¦w¡º¨ Â ¿
¼ or
¥§¦w¡©¨ Ã ¿ Â ¼ Ý ¦w¡B!gKD¨ Â ¥§¦w¡©¨Æ¾À¿ î
Proof of Theorem 5.9 Let be a small positive real number and
positive integer to be specified later. Fix an K t , so that
D R A F T
September 13, 2001, 6:27pm
¼
be a large
(5.112)
(5.113)
(5.114)
D R A F T
86
A FIRST COURSE IN INFORMATION THEORY
Assume that
n Y¦ KD¨
³ í
» Ý , and let
¬ gã f ã ãgf
¦w¡ ¨ ¦w¡ ¨å¢$
qts (5.115)
be any set of nonnegative integers such that
µvu
1.
for all
¡£¢)$
¬
¦w¡ gã f ¨ Ø w¦ ¡"!gKD¨
(5.116)
, and
gã f ãgp Ø ¬ ãgf
w
¦
¡
!gK ¨
¦w¡ ¨
(5.117)
gf
ã
gã p
for all ¦w¡
¨°¢%$ qts , then ¦YK ð¨ ¢t ³ on í .
¬ ãgf
¦w¡ ¨ satisfy
Then from (5.93),
µ vµ u Ý ¬ ¦w¡ gã f ¨ Ã ¥Å¦w¡ gã f ¨ Â ¿ ã
(5.118)
Ô ¼
gã f
which implies that for all ¦w¡
¨°%¢ $ qts ,
Ý ¬ ¦w¡ gã f ¨ Ã ¥Å¦w¡ gã f ¨ Â ¿ ã
(5.119)
¼
or
¥§¦w¡ gã f ¨ Ã ¿ Â ¼ Ý ¬ ¦w¡ gã f ¨ Â ¥§¦w¡ gã f ¨Æ¾À¿ î
(5.120)
ã
g
f
¬
Such a set
¦w¡ ¨ exists because ³ n í Y¦ KD¨ is assumedp to be nonempty.
2. if
Straightforward combinatorics reveals that the number of
constraints in (5.117) is equal to
®
¦¬ ¨ Ø ÚÔ
u
î
° ¬ ¦w¡B!gKDãg¨¯f §
¦w¡ ¨¯§
¬Ëß ® ¬
Using Lemma 5.10, we can lower bound
¬Ëß ® ¦ ¬ ¨
» µ Ô # ¦w¡B!gKD¨ ¬Ëß Ã µvu ¦ ¬ ¦w¡ ãgf ¨D¾
Ø× Õ µ Ô ÿ ¦w¡"!gKD¨ ¬Ëß
D R A F T
which satisfy the
(5.121)
¦ ¨ as follows.
¦w¡"!gKD¨ Ã ¦w¡B!gKD¨
ÝL¨ ¬Ëß ¦ ¬ ¦w¡ ãgf ¨n¾ ÝL¨ Ã ¬ w¦ ¡ gã f ¨ Á
¦wB¡ !gKD¨
September 13, 2001, 6:27pm
(5.122)
D R A F T
87
Strong Typicality
Ã µ u ¦ ¬ ¦w¡ gã f Æ¨ ¾âÝL¨ ¬Ëß ¦ ¬ ¦w¡ ãgf ¨Æ¾ ÝL¨ 4
(5.123)
±Õ µ
» Ô ¦w¡"!gKD¨ ¬ß ¦w¼½¦¥§¦w¡©¨ Ã ¿7¨`¨
Ã µ u ¦ ¬ ¦w¡ gã f ¨Æ¾âÝL¨ ¬Ëß ÿ ¼ ò ¥§¦w¡ ãgf ¨Æ¾W¿ð¾ Ý Á î
(5.124)
¼åó
In the above, a) follows
from (5.116), and b) is obtained by applying
¸
¸
¬ ãgf the
lower bound on ¼ á
¦w¡"!gKD¨ in (5.114) and the¬Ëß upper bound on ¼ á ¦w¡ ¨ in
(5.120). Also from (5.116),
the coefficient of ¼ in (5.124) is given by
µ 1 ¦w"¡ !gKD¨ Ã µ u ¦ ¬ ¦w¡ gã f ¨n¾ ÝL¨ 4 Ø Ã $ s î
(5.125)
Ô Let ¿ be sufficiently small and ¼ be sufficiently large so that
(5.126)
«üõ1¥§¦w¡©¨ Ã ¿õ Ý
and
gã f
Ý õ Ý
(5.127)
§
¥
w
¦
¡
©
¨
W
¾
ð
¿
¾
¼
f
for all ¡ and . Then in (5.124), both the logarithms
¬Ëß ¦¥§¦w¡º¨ Ã ¿7¨
(5.128)
and
¬ßò ¥§¦w¡ gã f ¨Æ¾À¿å¾ Ý
(5.129)
¼ó
are negative. Note that the logarithm in (5.128) is well-defined by virtue of
(5.126). Rearranging the terms in (5.124), applying the upper bound in (5.114)
and the lower bound3 in (5.120), and dividing by , we have
¼
¸
Ë
¬
ß
® ¬
¼á
¦ ¨
µ
» Ô ¦ ¥§¦w¡º¨©¾W¿7¨ ¬Ëß ¦¥§¦w¡º¨ Ã ¿7¨ Ã µ Ô µvu ò ¥§¦w¡ gã f ¨ Ã ¿ð¾ ¼°Ý ó
¬q Ëßò ¥§¦w¡ ãgf ¨n¾À¿å¾ Ý Ã $ s ¬ß ¼
ZT²
ZT²
Ø ZÃ ² b ¦w¤1¨n¾ ¦w¤ ã ãcb ¼ã ¨Æó ¾´³µt¦w¼ ã ¿7¼¨
Ø ¦ ¤I¨ÆI¾ ³9µt¦w¼ ¿9¨
u
u
u
ú
·
¸
,9ÓÔ ì Õ
Ô ,9ÓÔ ì ÕW í W Î Ñ - ¸
(5.130)
(5.131)
(5.132)
3 For
the degenerate case when
for some and ,
, and the logarithm in
(5.129) is in fact positive. Then the upper bound instead of the lower bound should be applied. The details
are omitted.
D R A F T
September 13, 2001, 6:27pm
D R A F T
88
A FIRST COURSE IN INFORMATION THEORY
ã
where ³9µt¦w¼ ¿9¨ denotes a function of ¼ and ¿ which tends to 0 as ¼ S¶¡ and
¿ S « . Using a similar method, we can obtain a corresponding upper bound
as follows.
¬Ëß ® ¦ ¬ ¨
Â µ # ¦ ¦w"¡ !gKD¨Æ¾âÝL¨ ¬Ëß ¦ ¦w"¡ !gKD¨n¾âÝL¨ Ã ¦w"¡ !gKD¨
Ô Ã µ u ¬ ¦w¡ gã f ¨ ¬Ëß ¬ ¦w¡ gã f ¨ Ã ¬ ¦w¡ gã f ¨ Á
(5.133)
Ø µ Ô ÿ ¦ ¦w"¡ !gKD¨D¾ ÝL¨ ¬Ëß ¦ ¦w"¡ !gKD¨n¾ ÝL¨
Ã µ u ¬ ¦w¡ gã f ¨ ¬Ëß ¬ ¦w¡ gã f ¨ 4
(5.134)
Â µ # ¦ ¦w¡"!gKD¨Æ¾âÝL¨ ¬ËßIÿ ¼ ò ¥§¦w¡©¨©¾À¿å¾ Ý
Ô ¼å
ó
Ã µ u ¬ ¦w¡ gã f ¨ ¬Ëß ¼½¦¥§¦w¡ gã f ¨ Ã ¿7¨ Á î
(5.135)
¬Ëß
In the above, the coefficient of ¼ is given by
µ 1 ¦ ¦w¡B!gKD¨n¾ ÝL¨ Ã vµ u ¬ ¦w¡ gã f ¨ 4 Ø $ î
(5.136)
Ô
¿
¼ be sufficiently large so that
¥§¦w¡©¨©¾À¿å¾ ¼ Ý õ Ý
«üõ1¥Å¦w¡ gã f ¨ Ã ¿õ Ý
We let be sufficiently small and
and
for all
¡
f
(5.137)
(5.138)
and . Then in (5.135), both the logarithms
¬ßò ¥§¦w¡º¨©¾W¿ð¾ Ý
¼åó
¬Ëß ¦¥§¦w¡ gã f ¨ Ã ¿9¨
and
(5.139)
(5.140)
are negative, and the logarithm in (5.140) is well-defined by virtue of (5.138).
Rearranging the terms in (5.135), applying the lower bound4 in (5.114) and the
4 If
,9ÓÔLÕ
·£¸ 9, ÓÔLÕW í W Î Ñ
-T/
,
, and the logarithm in (5.139) is in fact positive. Then the upper bound
instead of the lower bound should be applied. The details are omitted.
D R A F T
September 13, 2001, 6:27pm
D R A F T
89
Strong Typicality
¼
¸
¼Åá ¬Ëß ® ¦ ¬ ¨
Â µ ò ¥§¦w¡º¨ Ã ¿ð¾ Ý ¬Ëß ò ¥§¦w¡©¨©¾À¿å¾ Ý
Ô
¼ó
¼ ó ¬Ëß
Ã µ µu ¦¥§¦w¡ ãgf ¨Æ¾X¿9¨ ¬Ëß ¦¥Å¦w¡ ãgf ¨ Ã ¿9¨Æ¾ $ ¼
¼
ZTÔ ²
ZT² ãcb
ã
Ã
Ø Z ² b ¦w¤I¨D¾ ¦w¤ ã ¨Æã ¾´³·¦w¼ ¿9¨
Ø ¦ ¤I¨Æ´¾ ³ · ¦w¼ ¿7¨
ã
where ³ · ¦w¼ ¿9¨ denotes a function of ¼ and ¿ which tends to 0 as ¼
¿ S «.
upper bound in (5.120), and dividing by , we have
(5.141)
(5.142)
(5.143)
S¸¡
and
Now from (5.132) and (5.143) and changing the base of the logarithm to 2,
we have
Z b
¦ ¤I¨D¾´³9µk¦w¼
ã ¿9¨ Â ¼ á ¸ Ë¬ j® ® ¦ ¬ ¨ Â Z ¦ b I¤ ¨n¾´³ · w¦ ¼ ã ¿9¨ î
(5.144)
Z b
¸
K
¬
j

9
(5.145)
¼Åá ³ í7¦YKD¨ » ¦ ¤1¨Æ¾´³µt¦w¼ ã ¿7¨ î

To upper bound ³ n 9 í Y¦ KD¨ , we
¬ ãgf have to consider that there are in general
¬ more
¦w¡ ¨ . Toward this end, by observing that¬ ¦w¡ gã f ãgf ¨
than one possible set of
takes values between 0 and ¼ , the total number of possible sets of
¦w¡ ¨ is at most
¦w¼¾ ÝL¨ ¹o º( î
(5.146)
From the lower bound above, we immediately have
This is a very crude bound but it is good enough for our purpose. Then from
the upper bound in (5.144), we obtain
¹o º» Z b
¸
¸
Ë
¬
j

®
K
¬
j

Â
n

9
¼Åá ³ í9¦YKD¨ ¼Åá ¦w¼ù¾ÜÝL¨ ¾ ¦ I¤ ¨(¾%³·¦w¼ ã ¿9¨ î (5.147)
Since the first term of the above upper bound tends to 0 as ¼ S¸¡ , combining
(5.145) and (5.147), it is possible to choose a (depending on ¼ and ¿ ) which
tends to 0 as ¼ S¸¡ and ¿ S « such that
Z
¼ á ¸ ¬Kj® n í _Ó ¼jÕ ôÃ ¦ b ¤I¨ (Â ã
(5.148)
Î
or
Ê ³ VÓ UàÓ n Õ á0 Õ
Â ³ n í7¦YKD¨ (Â Ê ³ Ó UåÓ n XÕ W Õ î
(5.149)
The theorem is proved.
p
¦ ã ¨
Ê ³ åÓ Õ
The above theorem
says that for any typical K , as long as thereU isn9one
typical
gp
p
such that YK
is jointly typical, there are approximately
such
D R A F T
September 13, 2001, 6:27pm
D R A F T
90
A FIRST COURSE IN INFORMATION THEORY
½ ¾ nH (Y)¿ Ãy
2
.
.
S [Y]
.
.
.
À ¿)
½ ¾ nH (X,Y
2
Ã ) Â S ¾ nÀ
(x,y
.
.
.
.
.
.
. .
[X]
¦ gã p ¨
.
.
½ ¾ nH (XÀ ¿ )
2
Áx ÂS ¾ n
Â ¾n
[XY]
Figure 5.2. A two-dimensional strong joint typicality array.
that YK
is jointly typical. This theorem has the following corollary that the
number of such typical K grows with at almost the same rate as the total
number of typical K .
¼
¥§¦w¡ ãgf ¨ on $ qTs , let Æ ³ í be the
¢ ³ í
³ í ¦YKD¨ is nonempty. Then
Æ í » ¦ Ý Ã ¿7¨ Ê ÓVUàÓ Õ á*Ç Õ ã
(5.150)
³
³
where È S « as ¼ S¢¡ and ¿ S « .
gã p
¨¢
Proof By the consistency of strong typicality (Theorem 5.7), if Y¦ K
³ on> í , then K1¢ ³ í . In particular, K1)
9
¢ Æ ³ í . Then
î
gã p p
³ nF í Ø É
YK ¨Ñ ¢ ³ n9 9 í9¦YKD¨ ¦
(5.151)
¼ËÊÌ ÍÎ Î>Ï Ð

Using the lower bound on ³ nF í in Theorem 5.8 and the upper bound on
n 9 Y¦ KD¨ in the last theorem, we have
³ í
¦Ý Ã ¿7¨ Ê ³ VÓ UàÓ ì n Õ á ôÕ Â ³ nF í 6Â Æ ³ 9 í Ê ³ Ó UåÓ n XÕ W Õ
(5.152)
=B<=BÄÅÄ¤B<ÅaHJ3¥*¥
For a joint distribution
set of all sequences K 9 such that n 9
» ¦Ý Ã ¿7¨ Ê ³ VÓ UàÓ Õ á ÓVW ÕÕ î
The theorem is proved upon letting È Ø û¾I .
which implies
Æ í
³
¥§¦w¡ ã ¨
(5.153)
In this section, We have established a rich set of structural
gf properties for
, which is sumstrong typicality with respect to a bivariate distribution
marized in the two-dimensional strong joint typicality array in Figure 5.2. In
this array, the rows and the columns are the typical sequences K MÆ 9 and
D R A F T
September 13, 2001, 6:27pm
¢ ³ í
D R A F T
91
Strong Typicality
Ó ÔÕ
2nH(Z)
Ü ÓnÝ Þ
z S [Z
]
zØ 0
Ò2ÓnHÔ (Y)Õ
Úy Ü S Ó n Þ
[Y]
Ö(×xØ Ù,ÚyØ Û)
0 0
Ó Ôß Õ
2nH(X) x
ÜS Ó nß Þ
[X]
Figure 5.3. A three-dimensional strong joint typicality array.
p
The total number of rows and columns are approxi¢Æ ³ n> í , respectively.
UàÓ Õ
Ê
Ê UåÓ n Õ , respectively. An entry indexed by Y¦ K ãgp ¨
mately equal to ³ gã p and ³
receives a dot if Y¦ K
¨ is Êstrongly
UàÓ ì n Õ jointly typical. The total number of dots
. The number of dots in each row is apis approximately equalÊ to
UàÓ n9³ Õ
proximately equal to ³ Ê UàÓ n , Õ while the number of dots in each column is
.
approximately equal to ³
For reasons which will become clear in Chapter 16, the strong joint typicality array in Figure 5.2 is said to exhibit an asymptotic quasi-uniform structure.
By a two-dimensional asymptotic quasi-uniform structure, we mean that in the
array all the columns have approximately the same number of dots, and all
the rows have approximately the same number of dots. The strong joint typicality array for a multivariate distribution continues to exhibit an asymptotic
quasi-uniform structure. The three-dimensional
strong joint typicality array
gf cà
is
illustrated
in Figure 5.3. As before,
with respectgp tocá a distribution
gp cá
an entry YK
receives a dot if YK
is strongly jointly typical. This is
not shown in the figure otherwise it will be very confusing.
n â total number
U The
of dots in the whole array is approximately equal to
. These dots
are distributed in the array such that all the planes parallel to each other have
approximately the same number of dots, and all the cylinders parallel to each
other have approximately the same number of
á / dots. More specifically, the total
number of dots on the plane for any fixed
OÆ â (as shown) is approxi-
¥§¦w¡ ã ã ã ¨ =ã
¦ ¨
¦ ã =ã ¨
Ê ³ àÓ ì ì Õ
¢ ³ í
UàÓ ì n â Õ
Ê
mately equal
, and the total number of dots in the cylinder
ãgp to ³
Ê UåÓ âfor any
ìn Õ ,
fixed ¦YK / / ¨ pair in Æ ³ on> í (as shown) is approximately equal to ³
so on and so forth.
D R A F T
September 13, 2001, 6:27pm
D R A F T
92
5.4
A FIRST COURSE IN INFORMATION THEORY
AN INTERPRETATION OF THE BASIC
INEQUALITIES
The asymptotic quasi-uniform structure exhibited in a strong joint typicality
array discussed in the last section is extremely important in information theory. In this section, we show how the basic inequalities can be revealed by
examining this structure. It has further been shown by Chan [39] that all unconstrained information inequalities can be obtained from this structure, thus
giving a physical meaning to thesecb inequalities.
á
, gand
and
a
fixed
Consider random variables
ã
äÆ c á â . By the
p cá
2 on â , then YK
2 â consistency
p cá of strong typicality, if YK
n
and
â . Thus
¦ ã ¨ð¢ ³ í
¤ ã ã ã
¦ û¨ ¢ ³
í
á
³ n â íô¦
¨ åI ³ â íô¦ á ¨ q ³ n9 â í7¦ á
¨ã
¢ ã³ í
¦ ¨¢ ³ í
(5.154)
n â í ¦ á ¨ (Â â í ¦ á ¨ n â í ¦ á ¨ î
(5.155)
³
³
³
á
on Applying the
á lower bound
á in Theorem 5.9 to ³ â í ¦ ¨ and the upper bound

to ³ â í ¦ ¨ and ³ â í ¦ ¨ , we have
Ê ³ ÓåU Ó ì n â Õ á0æ Õ Â Ê ³ ÓåU Ó â ÕXW@ôç Õ Ê ³ ÓVàU Ó n â ÕlW>èôÕ ã
(5.156)
ë§ê ãeì S
ã
« as ¼ S¶¡ and ¿ S « . Taking logarithm to the base 2 and
where é
dividing by ¼ , we obtain
Z
Z b
cb Â Z
ã

¦w¤ ã ¨ ¦w¤ ã ¨Æ¾ ¦ ã ¨
(5.157)
upon letting ¼ Sí¡ and ¿ S « . This inequality is equivalent to
î
¦w¤«! b ã ¨°»[« î
(5.158)
which implies
Thus we have proved the nonnegativity of conditional mutual information.
Since all Shannon’s information measures are special cases of conditional mutual information, we have proved the nonnegativity of all Shannon’s information measures, namely the basic inequalities.
¦YK ã U¨ ¢ ³ ì nF í and ¦ pãcá ¨U¢ ³ n ì â í do not imply ¦YK ãcá ¨\¢
³ì í
ã ã ã ¤ ¨ , where ¤ ¶ are i.i.d. with generic random vari
2. Let Ø ¦w¤ ¸ ¤T
³
able ¤ . Prove that
¯± ¢ ³ í » Ý Ã $ ï
¼n¿
PROBLEMS gp
1. Show that
â .
D R A F T
September 13, 2001, 6:27pm
D R A F T
Strong Typicality
for any
¼
S¸¡
¼
and
if
ð
¼n¿
¿S¢
ª «¡ . This shows that ¯=± ¢« ³ í .
93
S
Ý as ¿
S
«
and
¤
3. Prove that for a discrete random variable with an infinite alphabet, Property 2 of the strong AEP holds, while Properties 1 and 3 do not hold.
Ø ¦w¡ ¸ ã ¡ ã ã ¡ ³ ¨
¡
¦w¡º¨ Ø w¦ ¡ D¨ ô¼
¡£¢
¸
× ×á ¸
³ä ³ á
gã f
5. Show that for a joint distribution ¥§¦w¡
¨ , for any ¿UªÉ« , there exists «Iõ
¿ ï õ¿ such that ³ 9 øí ÷ åOÆ ³ 9 í for sufficiently large ¼ .
6. Let ù½¦5$ú¨ be the set of all probability distributions over a finite alphabet $ .
Find a polynomial úö¦w¼D¨ such that for any integer ¼ , there exists a subset
ù ³ 5¦ $¨ of ù½5¦ $ú¨ such that

(Â úö¦w¼D¨ ;
a) ù 5¦ $¨
³
b) for all òét
¢ ù½5¦ $¨ , there exists ò ³ t¢ ù ³ 5¦ $¨ such that
ò ¦w¡©¨ Ã ò½¦w¡º¨ õ Ý
³
¼
for all ¡£%
¢ $ .
Hint: Let ù 5¦ $¨ be the set of all probability distributions over $ such that
³
all the probability masses can be expressed as fractions with denominator
¼.
7. Let ¥ be anyã probability distribution over a finite set $ and R be a real
number in ¦ « ÝL¨ . Prove that for any subset û of $ ³ with ¥ ³ ¦.û0¨°» R ,
ûMü[ í » Ê Ó UåÓ ,LÕ á í ÷ Õ ã
³
³
ï
where ¿ S « as ¿ S « and ¼ S¢¡ .
@
where 0 is in some finite alpha4. Consider a sequence K
bet $ for all ñ . The type ò ¼ of the sequence K is the empirical probability
K gó for all %$ . Prove that
distribution over $ in W K , ¹oi.e.,
ò ¼
W ±
there are a total of ô
distinct types ò ¼ . Hint: There are ô
õ
ways to distribute identicalõ balls in ö boxes.
HISTORICAL NOTES

Berger [21] introduced strong typicality which was further developed into
the method of types in the book by Csiszþ ý r and K ÿ rner [52]. The treatment of
the subject here has a level between the two, while avoiding the heavy notation
in [52]. The interpretation of the basic inequalities in Section 5.4 is a preamble
to the relation between entropy and groups to be discussed in Chapter 16.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 6
THE -MEASURE
In Chapter 2, we have shown the relationship between Shannon’s information measures for two random variables by the diagram in Figure 2.2. For convenience,
Figure 2.2 is reproduced in Figure 6.1 with the random variables
b
and replaced by
and T , respectively. This diagram is very illuminating
because it suggests that Shannon’s information measures for any
random variables may have a set-theoretic structure. In this chapter, we develop a
theory which establishes a one-to-one correspondence between Shannon’s information measures and set theory in full generality. With this correspondence,
manipulations of Shannon’s information measures can be viewed as set operations, thus allowing the rich set of tools in set theory to be used in information
theory. Moreover, the structure of Shannon’s information measures can easily
¤ ¸
¤
¼è» Ê
¤
H ( X1,X 2)
H ( X1 X 2)
H ( X 2 X1)
H ( X1)
I ( X1; X 2)
H (X 2)
Figure 6.1. Relationship between entropies and mutual information for two random variables.
D R A F T
September 13, 2001, 6:27pm
D R A F T
96
A FIRST COURSE IN INFORMATION THEORY
be visualized by means of an information diagram if four random variables or
less are involved.
The main concepts to be used in this chapter are from measure theory. However, it is not necessary for the reader to know measure theory to read this
chapter.
6.1
PRELIMINARIES
In this section, we introduce a few basic concepts in measure theory which
will be used subsequently. These concepts will be illustrated by simple examples.
ã ¤ ã ã ¤
¸
¤
³
³
ã
ã
ã
¤ ¸ ¤
¤ ³
b
b
hTi*jkClGFCXDFCY=BGMmJ
D
·
¸
The
atoms
of
are
sets
of
the
form
,
where
ü

is
³

Þ
³ ¤ .
either ¤ or ¤ , the complement of
Ê
Ê
There are ³ atoms and Î sets in
all the atoms in
are
³ . Evidently,
³
disjoint, and each set in
can be expressed uniquely
as
the
union
of
a
subset
ã ¤ ã ã ¤ intersect with
³
. We assume that the sets ¤ ¸ T
of the atoms of
³
³
are nonempty unless otherwise
each other generically, i.e., all the atoms of
³
specified.
F¤»z?*ÄËi mJ
¤ generate the field . The atoms of
The sets ¤ ¸ and T
are
(6.1)
¤ ¸ ü ¤ ã ¤ ¸Þ ü ¤ ã ¤ ¸ ü ¤ Þ ã ¤ ¸Þ ü ¤ Þ ã
hTi*jkClGFCXDFCY=BGMmJ3¥
The field generated by sets T
is the collection of sets which can be obtained by any sequence of usual
set operations

.
(union, intersection, complement, and difference) on 1
which are represented by the four distinct regions in the Venn diagram in Figure 6.2. The field consists of the unions of subsets of the atoms in (6.1).
There are a total of 16 sets in , which are precisely all the sets which can be
obtained from and T by the usual set operations.
¤ ¸
hTi*jkClGFCXDFCY=BGMmJ
¤
³
A real function defined on is called a signed measure
if it is set-additive, i.e., for disjoint û and in ,
¹ ³ î
Ø
¦.û ¹ ¨ ¦.û Æ¨ ¾ ¦ ¹ ¨
For a signed measure , we have
¦ 3¨ Ø « ã
(6.2)
1 We
adopt the convention that the union of the empty subset of the atoms of D R A F T
(6.3)
Î
is the empty set.
September 13, 2001, 6:27pm
D R A F T
î
97
The -Measure
X1
X2
Figure 6.2. The Venn diagram for
which can be seen as follows. For any û in Ñ and
.
³,
¦.û0¨ Ø ¦.û 3¨ Ø ¦.û ¨Æ¾ ¦ 3¨
(6.4)
by set-additivity because û and are disjoint, which implies (6.3).
A signed measure on is completely specified by its values on the atoms
of . The values of on the other sets in can be obtained via set-additivity.
³
³
F¤»z?*ÄËi

mJEH
³
A signed measure on is completely specified by the val-
¦ ¤ ¸ ü T¤ L¨ ã ¦ ¤ ¸Þ ü ¤TL¨ ã ¦ ¤ ¸ ü ¤ Þ ¨ ã ¦ ¤ ¸Þ ü ¤ Þ ¨ î
The value of on ¤ ¸ , for example, can be obtained as
¦ ¤ ¸ ¨ Ø ¦¦ ¤ ¸ ü ¤ ¨ ¦ ¤ ¸ ü ¤ Þ Þ ¨¨ î
Ø ¦ ¤ ¸ ü ¤TL¨Æ¾ ¦ ¤ ¸ ü ¤ ¨
ues
(6.5)
6.2
!
(6.6)
(6.7)
"
THE # -MEASURE FOR TWO RANDOM
VARIABLES
To fix ideas, we first formulate in this section the one-to-one correspondence
between Shannon’s information measures and set theory for two random variables. For random variables
and T , let and T be sets corresponding
to
and T , respectively. The sets and T generates the field whose
atoms are listed in (6.1). In our formulation, we set the universal set $ to
"
T for reasons which will become clear later. With this choice of $ , the
Venn diagram for and T is represented by the diagram in Figure 6.3. For
simplicity, the sets and T are respectively labeled by
and T in the
diagram. We call this the information diagram for the random variables
and . In this diagram, the universal set, which is the union of and , is
not shown explicitly as in a usual Venn diagram. Note that with our choice of
¤ ¸
¤ ¸ ¤
¤ ¸
¤
¤
D R A F T
¤ ¸¸ ¤
¤
¤
¤ ¸ ¤ ¸
¤
¤
¤
¤ ¸
September 13, 2001, 6:27pm
¤ ¸
¤
¤
¤ ¸
D R A F T
98
A FIRST COURSE IN INFORMATION THEORY
X1
X2
Figure 6.3. The generic information diagram for
Ñ and
%
.
¤ ¸Þ ü ¤ Þ is degenerated to the empty set, because
(6.8)
¤ ¸Þ ü ¤ Þ Ø ¦ ¤ ¸ ¤ ¨ Þ Ø Þ Ø î
Thus this atom is not shown in the information diagram in Figure 6.3.
For random variables ¤ ¸ and ¤T , the Shannon’s information measures are
Z
Z
Z
Z
Z
¦w¤ ¸ ¨ ã ¦wT¤ ô¨ ã ¦w¤ ¸ T¤ L¨ ã ¦wT¤ ¤ ¸ ¨ ã ¦w¤ ¸ ã T¤ L¨ ã î ¦w¤ ¸ !T¤ L¨ î (6.9)
Þ as û Ã , we now define a signed measure by
Writing ûIü
¹
¹ Ã
Z
Ø
¸
(6.10)
¦ ¤ Ã T¤ -¨ Z ¦w¤ ¸ T¤ L¨ ã
Ø
¸
¸
¦ T¤ ¤ ¨
¦wT¤ ¤ ¨
(6.11)
and
¦ ¤ ¸ ü ¤TL¨ Ø î ¦w¤ ¸ !¤TL¨ î
(6.12)
These are theÞ values
of
on the nonempty atoms of (i.e., atoms of
Þ
¸
ü ¤ ). The values of on the other sets in can be obtained
other than ¤
the universal set, the atom
$
'&
&
& (
&
'&
)
&
via set-additivity. In particular, the relations
¦ ¤ ¸ ¤ ¨ Ø Z ¦w¤ ¸ ã ã¤ ¨
¦ ¤ ¸ ¨ Ø ¦w¤ ¸ ¨
Z
Ø
¦ ¤TL¨ w¦ ¤TL¨
Z
&
& *
and
(6.13)
(6.14)
&
(6.15)
can readily be verified. For example, (6.13) is seen to be true by considering
¦ ¤ ¸ T¤ L¨ Ã
Ø Z ¦ ¤ ¸ T¤ L¨©Z ¾ ¦ ¤T Ã ¤ ¸ Æ¨ ¾ ¦ ¤ ¸ ü T¤ L¨
Ø Z ¦w¤ ¸ ã ¤ ¨©î¾ ¦w¤ ¤ ¸ Æ¨ ¾ î ¦w¤ ¸ !¤ ¨
Ø ¦w¤ ¸ T¤ L¨
& +
"
D R A F T
&
&
,
&
September 13, 2001, 6:27pm
(6.16)
(6.17)
(6.18)
D R A F T
î
99
The -Measure
¤ ¸
¤
The right hand side of (6.10) to (6.15) are the six Shannon’s information
measures for
and T in (6.9). Now observe that (6.10) to (6.15) are consistent with how the Shannon’s information measures on the right hand side are
identified in Figure 6.1, with the left circle and the right circle representing the
sets and T , respectively. Specifically, in each of these equations, the left
hand side and the right hand side correspond to each other via the following
substitution of symbols:
¤ ¸
¤
Z
î
ó
ã
-
!
-
&
-

-
(6.19)
Ãî
ü
¤ ¸
Z
¤
î
Note that we make no distinction between the symbols
and in this substitution. Thus for two random variables
and , Shannon’s information
measures can formally be regarded as a signed measure on . We will refer
to '& as the I-Measure for the random variables
and T 2 .
Upon realizing that Shannon’s information measures can be viewed as a
signed measure, we can apply the rich set of operations in set theory in information theory. This explains why Figure 6.1 or Figure 6.3 represents the
relationship between all Shannon’s information measures for two random variables correctly. As an example, consider the following set identity which is
readily identified in Figure 6.3:
¦ ¤ ¸ ¤TL¨ Ø
&
&
¦ ¤ ¸ ¨©¾
&
¤ ¸
¦ ¤TL¨ Ã
¤
&
¦ ¤ ¸ ü T¤ L¨
(6.20)
This identity is a special case of the inclusion-exclusion formula in set theory.
By means of the substitution of symbols in (6.19), we immediately obtain the
information identity
Z
Z
Z
ã
Ø
¸
¸
w¦ ¤ ¤TL¨ w¦ ¤ ¨©¾ ¦w¤TL¨ Ã î w¦ ¤ ¸ ! ¤TL¨ î
(6.21)
¤ ¸Þ ¤ Þ
We end this section with a remark. The value of '& on the atom ü
has no apparent information-theoretic meaning. In our formulation, we set the
universal set $ to T so that the atom ü is degenerated to the
empty set. Then & * ü. naturally vanishes because & is a measure, and
'& is completely specified by all Shannon’s information measures involving
the random variables
and .
¤ ¸ ¸Þ ¤ Þ
¦¤ ¤ ¨
¤ ¸ ¤
¤ ¸Þ ¤ Þ
2 The reader
should not confuse /10 with the probability measure defining the random variables
The former, however, is determined by the latter.
D R A F T
September 13, 2001, 6:27pm
Ñ
and
.
D R A F T
100
A FIRST COURSE IN INFORMATION THEORY
6.3
CONSTRUCTION OF THE # -MEASURE 2
î
*
¼é» Ê
We have constructed the -Measure for two random variables in the last
î
section. In this section, we construct the -Measure for any
random
variables.

Consider random variables
. For any random variable ,
let be a set corresponding to . Let
¤ ¸ã¤ ã ã¤ ³
¤
Ø³ Ý ã Ê ã ã ¼ î
¼
¤
3
¤
ã ã ã ¤ , i.e.,
Define the universal set to be the union of the sets ¤ ¸ ¤T
³
Ø É ¤ î
(6.23)
Ê Î
ã ã ã ¤ . The set
to denote the field generated by ¤ ¸ ¤T
We use
³
³
Þ
Ø
/
û
¤
(6.24)
Ê Î
is called the empty atom of
³ because Þ
ÞØ É ¤ Ø ÞØ î
(6.25)
¤

Ê Î
Ê Î
All the atoms of
other than û / are called nonempty atoms.
, the cardinality of
³
Let be theÊ set of all nonempty atoms of . Then
Ã
Ý . A signed measure on ³ ³ is completely specified by
, is equal to ³
the values of on the nonempty atoms of .
ã
³
¤
ñU¢ ù¨ and ¤ to
To simplify notation, we will use ¤
to denote ¦w
denote » Ê
¤ for any nonempty subset of ³ .
wyxFi0=B<i*z{mJEm
Let
Ø ¤ 2 is a nonempty subset of ³ î
(6.26)
ã ¢ ,
Then a signed measure on
is completely specified by ¦ ¨
³
¹ ¹
which can be any set of real numbers.
Proof The number ofÊ elements in is equal to the Ê number of nonempty
Ã Ý . Thus ã Ø Ø ³ Ã Ý . Let d d Ø Ê ³ Ãsub-Ý .
, which isd ³
sets of
³
Let ã be a column -vector of .¦ û ¨ û ¢
, and be a column -vector of
¦ ¹ ¨ ¹ ¢ . Since all the sets in can expressed uniquely as the union of
some nonempty atoms in , by the set-additivity of , for each ¢
¹ , ¦ ¹ ¨
can be expressed uniquely as the sum of some components of . Thus
Ø ³ ã
(6.27)
(6.22)
$
$
54
6
74
6
98:
54
<;=
74
$
>
>
>
@?
?
.A
3
@
?
B
A
DC
3
?
@
EA
GF
B
B
3
B
>
H
B
B
>
>
I
B
H
I
D R A F T
KJ
H
September 13, 2001, 6:27pm
D R A F T
î
The -Measure
d
³
101
d
where J
is a unique q
matrix. On the other hand, it can be shown (see
Appendix 6.A) that for each
û > , .û can be expressed as a linear comB
by applications, if necessary, of the following two
bination of identities:
¦ ¹
¦.ûIü ¹ ÃÃ
¦.û
¢ ¦ ¨
¨ã¹ ¢
¨ Ø ¦.û Ã ©¨ Ã¾ ¦ ¹ Ã î ¨ Ã ¦.û ¹ Ã ¨
¹ ¨ Ø .¦ û ¹ ¨ ¦ ¹ ¨
,L
,L
L
,L
(6.28)
(6.29)
However, the existence of the said expression does not imply its uniqueness.
Nevertheless, we can write
NM
(6.30)
H
I
d
d
Ø ³
into (6.30), we obtain
³ . UponØ substituting (6.27)
ã
(6.31)
¦ ³ ³¨
which implies that
inverse of
as (6.31) holds regardless
the
ã û ¢ of are
³ isistheunique,
³
choice of . Since
so
is
.
Therefore,
¦
.
0
û
¨
ã ¢ are³ specified. Hence, a signed mea³
uniquely determined once ¦ ¨
¹ ¹ ã ¢ , which can be any set
is completely specified by ¦ ¨
sure on
¹ ¹
³
of real numbers. The theorem is proved.
for some
q
matrix M
M
H
J
M
OH
J
J
M
B
P>
B
We now prove the following two lemmas which are related by the substitution of symbols in (6.19).
£i*zz¤mJX|
¦.ûIü ¹ Ã ¨ Ø ¦.û
,L
¨©¾ ¦ ¹
L
,
¨ Ã ¦.û ¹
L
L
¨ Ã ¦ ¨ î
L
(6.32)
Proof From (6.28) and (6.29), we have
¦.ûMü
Ø
Ø
Ø
Ã
¹¦.û Ã
Ã¦ ¦ ¦.û ¦.û
.¦ û
,L
L
Q
Q
¨
L
The lemma is proved.
£i*zz¤mJE
î
¦w¤~! b ã
D R A F T
¨©¾ Ã ¦ ¹
¨ ¦
¹¨©¾ ¦ ¨
¹
L
L
,
Ã ¨ Ã ¦.û Ã ¨
¨Ã¨Æ¾â¦ ¦ ¹ ¨ ¹ Ã ¦ ¨¨
¦ Ã ¨¨
¨ ¦.û ¹ ¨ Ã ¦ ¨ î
,L
,
Q
L
L
L
L
L
(6.33)
L
L
L
Z
Z
Z
Z
Ø¨ ¦w¤ ã ã Æ¨ ¾ ¦ bã ã ¨ Ã w¦ ¤ ãcbã ã ¨ Ã 5¦ ã ¨ î
September 13, 2001, 6:27pm
(6.34)
(6.35)
(6.36)
D R A F T
102
A FIRST COURSE IN INFORMATION THEORY
Proof Consider
¦w¤«! b Z ã ¨
Ø Z ¦w¤
Ø Z ¦w¤
Ø ¦w¤
î
Ã¨ Z Z w¦ ¤ bã ã Z ¨
ã ã ¨ Ã Z ¦5ã ¨ Ã ¦ Z ¦w¤ ãcbã ã ¨ Ã Z Z ¦ bã ã ¨¨
ã ã ¨Æ¾ ¦ bã ã ¨ Ã w¦ ¤ ãcbã ã ¨ Ã 5¦ ã ¨ î
ã
(6.37)
(6.38)
(6.39)
The lemma is proved.
î
³
We now construct the -Measure '& on Z
using Theorem 6.6 by defining
(6.40)
¦ ¤ ¨ Ø ¦w¤ ¨
for all nonempty subsets of
³ . In order for to be meaningful, it has to
be consistent with all Shannon’s information measures (via the substitution of
symbols in (6.19)). ã In that
the following must hold for all (not necessarily
ï ã ï case,
ï
of
disjoint) subsets
³Ã where î and are nonempty:
(6.41)
¦ ¤ ü ¤ ÷ ¤ ÷ ÷ ¨ Ø ¦w¤ !¤ ÷ ¤ ÷ ÷ ¨ î
ï ï Ø , (6.41) becomes
When
(6.42)
¦ ¤ tü ¤ ÷ ¨ Ø î ¦w¤ !¤ ÷ ¨ î
ï , (6.41) becomes
When Ø
Z
Ã
Ø
¦ ¤ ¤ ÷ ÷ ¨ ¦w¤ ¤ ÷ ÷ ¨ î
(6.43)
ï and ï ï Ø , (6.41) becomes
When Ø
Z
Ø
¦ ¤ =¨ ¦w¤ =¨ î
(6.44)
&
?
?
3
A
&
3
A
A
A
&
A
A
?
?
A
?
?
?
?
&
A
?
@
?
@?
?
A
A
A
A
&
?
@
?
@?
?
& *
@
?
@?
Thus (6.41) covers all the four cases of Shannon’s information measures, and it
is the necessary and sufficient condition for & to be consistent with all Shannon’s information measures.
wyxFi0=B<i*z{mJE
'& is the unique signed measure on
with all Shannon’s information measures.
³
which is consistent
Proof Consider
Ø
¦¤
tü"¤ ? ÷ Ã ¤ ? ÷ ÷ ¨
Ã Z
Z & ¦ ¤ ?RS? ÷ ÷ ¨©¾,Z & ¦ ¤ ? ÷ RT? ÷ ÷ ¨
¦î w¤ ?RS? ÷ ÷ ¨Æ ¾ ¦wã ¤ ? ÷ RS? ÷ ÷ ¨ Ã
¦w¤@?! ¤ ? ÷ ¤ ? ÷ ÷ ¨
& +
@
?
Ø
Ø
D R A F T
&
¦w¤
¦¤
?RS?
?RS?
÷ RS? ÷ ÷ ¨ Ã Z ÷ RS? ÷ ÷ ¨ Ã
September 13, 2001, 6:27pm
¦ ¤ ÷÷ ¨
¦w¤ ÷ ÷ ¨
&
?
?
(6.45)
(6.46)
(6.47)
D R A F T
î
103
The -Measure
where (6.45) and (6.47) follow from Lemmas 6.7 and 6.8, respectively, and
(6.46) follows from (6.40), the definition of & . Thus we have proved (6.41),
i.e., '& is consistent with all Shannon’s information measures.
In order that '& is consistent
with all Shannon’s information measures, for
3
, & has to satisfy (6.44), which in fact is the
all nonempty subsets A of
definition of '& in (6.40). Therefore, '& is the unique signed measure on which is consistent with all Shannon’s information measures.
³
6.4
2
*
³
CAN BE NEGATIVE
î
In the previous sections, we have been cautious in referring to the -Measure
3
'& as a signed measure instead of a measure . In this section, we show that '&
\
in fact can take negative values for
.
For
, the three nonempty atoms of are
¼1»
¤ ¸ ü ¤ ã¤ ¸ Ã ¤ ã¤ Ã ¤ ¸î
¼ Ø Ê
The values of (6.48)
on these atoms are respectively
&
Z
Z
¦w¤ ¸ ! ¤TL¨ ã ¦w¤ ¸ T¤ L¨ ã ¦w¤T ¤ ¸ ¨ î
î
(6.49)
¼ Ø Ê
These quantities are Shannon’s information measures and hence nonnegative
by the basic\ inequalities. Therefore, '& is always nonnegative for
.
ï
, the seven nonempty atoms of are
For
¼ Ø
¤ Ã ¤ ì ¶ ã ¤ 0ü ¤ Ã ¤ ¶ ã ¤ ¸ ü ¤T9ü ¤ ï ã

(6.50)
d
\
Â ñõ ½õ Â . The values of on the first two types of atoms are
where Ý
Z
Ã
Ø
¶
(6.51)
¦ ¤ ¤ ì ¨ ¦w¤ ¤ ã ¤ ¶ ¨
and
¦ ¤0ü ¤ Ã ¤ ¶ ¨ Ø î ¦w¤ë!¤ ¤ ¶ ¨ ã
(6.52)
respectively, which are Shannon’s information measures and therefore nonnegative. However, ¦ ¤ ¸ ü ¤Tü ¤ ï ¨ does not correspond to a Shannon’s
information measure. In the next example, we show that j¦ ¤ ¸ ü T
¤ Ñü ¤ ï ¨
can actually be negative.
F¤»z?*ÄËi mJ3¥¦
In this example, all the entropies are in the base 2. Let ¤ ¸
¤ be independent binary random variables with
and T
¯± ¤ Ø « Ø ¯± ¤ Ø Ý Ø « î ã
(6.53)
YX
V
U<W
" W
[
"
Z
'&
&
& *
& +
U<W
" W
\
YX
W
W
]
'&
_^
3A
measure can only take nonnegative values.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Ø Ý ã Ê . Let
104
ñ
A FIRST COURSE IN INFORMATION THEORY
¤ ï Ø ¤ ¸ ¾Ü¤T ý Ê î
a`
(6.54)
It is easy to check that ¤ ï has the same marginal distribution as ¤ ¸ and ¤T .
Z
Thus,
(6.55)
¦w¤ ¨ Ø Ý
ã
ã
\
Ê
Ø
¸
for ñ
Ý . Moreover, ¤ , Z ¤T , and ¤ ï are pairwise independent. Therefore,
¦w¤ ã ¤ ¨ Ø Ê
(6.56)
and
î
¦w¤ !¤ ¨ Ø «
(6.57)
\
Â
Â
for Ý
. We further see from (6.54) that each random variable is a
ñðõ
W
W
[
function of the other two random variables. Then by the chain rule for entropy,
we have
w¦ ¤ ¸ ã ¤T ã ¤ ï ¨ Ø Ê ¦w¤ ¸ ã ¤TL¨Æ¾
Ø î À¾ «
Ø Ê
Â ñ=õ ½õ d Â \ ,
Now for Ý
î
¦w¤ë!¤ Z ¤ ¶ ã ¨
Z
Z
Ø ¦w¤ ¤ ¶ ¨©¾ ¦w¤ ã ¤ ¶ ¨ Ã ¦w¤
Ø Ê ã¾ Ê Ã Ê Ã Ý
Ø Ý
Z
Z
¦w¤ ï ¤ ¸ ã T¤ L¨
Z
(6.58)
(6.59)
(6.60)
[
Z
¸ ã ¤ ã ¤ ï ¨ Ã w¦ ¤ ¶ ¨
W
W
(6.61)
(6.62)
(6.63)
where we have invoked Lemma 6.8. It then follows that
¸ ü ¤ ¨ Ã ¦ ¤ ¸ ü ¤ Ã ¤ ï ¨ (6.64)
¦
¤
Ø î ¦w¤ ¸ !¤TL¨ Ã î w¦ ¤ ¸ !¤T ¤ ï ¨
(6.65)
Ø « ÃîÝ
(6.66)
Ø ÃÝ
(6.67)
Thus
takes a negative value on the atom ¤ ¸ ü ¤T9ü ¤ ï .
Motivated by the substitution of symbols in (6.19)
î for Shannon’s information measures, we will write 7¦ ¤ ¸ ü ¤TBü ¤ ï ¨ as ¦w¤ ¸ !¤T!¤ ï ¨ . In general,
we will write
¦ ¤ Ñ ü ¤ ü ü ¤ Ã ¤ ç¨
(6.68)
as
î
¦w¤ Ñ !¤ ! !¤ ¤ ¨
(6.69)
¦¤ ¸ ü ¤
&
ü
¤ ï¨ Ø
&
&
'&
&
?
@
?
@
@?
D R A F T
@?
&
"
Z
?cb
@
@?cb
d
e
ed
September 13, 2001, 6:27pm
D R A F T
î
105
The -Measure
j h f
f
f
I ( X 1 ; X 2 ; X i 3)
h f
f
f
f
I ( X1 ; X 2 X i3)
g
X2
g
h f
H ( X 2 X 1)
h
H ( X 1)
Xi3
X1
g
h f
f
f
j h f
H ( X 1 X 2 , X i3)
f
I ( X 1 ; X i3)
Figure 6.4. The generic information diagram for
Ñ,
%
, and
%k
¤ Ñ ã ¤ ã ã ¤
¤
î ¸
¦w¤ !¤ !¤ ï ¨ Ø î w¦ ¤ ¸ !¤ ¨ Ã î ¦w¤ ¸ !¤ ¤ ï ¨ î
î
For this example, ¦w¤ ¸ !¤T!¤ ï ¨°õ[« , which implies
î ¸
¦w¤ !¤T ¤ ï ¨ª î ¦w¤ ¸ !T¤ L¨ î
.
and refer to it as the mutual information between @? @? @?cb
conditioning on ed . Then (6.64) in the above example can be written as
(6.70)
(6.71)
Therefore, unlike entropy, the mutual information between two random variables can be increased by conditioning on a third random variable. Also, we
note in (6.70) that although the expression on the right hand side is not symbolically symmetrical in
, T , and ï , we see from the left hand side that
it is in fact symmetrical in
, T , and ï .
¤ ¸ ¸¤
¤ ¤
6.5
¤
¤
INFORMATION DIAGRAMS
We have established in Section 6.3 a one-to-one correspondence between
Shannon’s information measures and set theory. Therefore, it is valid to use an
information diagram, which is a variation of a Venn diagram, to represent the
relationship between Shannon’s information measures.
For simplicity, a set will be labeled by in an information diagram. We
have seen the generic information
in Figure 6.3. A generic
\ diagram for
information diagram for
is shown in Figure 6.4. The informationtheoretic labeling of the values of & on some of the sets in ï is shown in
î
the diagram. As an example, the information diagram for the -Measure for
T , and ï discussed in Example 6.10 is shown in Figrandom variables
ure 6.5.
For ml , it is not possible to display an information diagram perfectly in
two dimensions. In general, an information diagram for random variables
¤
¼X»
D R A F T
¤ ¸ã¤
¼ Ø
¤
¼ Ø Ê
¤
¼
September 13, 2001, 6:27pm
D R A F T
106
A FIRST COURSE IN INFORMATION THEORY
¼ Ã Ý
¼ Ø
needs
dimensions to be displayed perfectly. Nevertheless, for
l ,
an information diagram can be displayed in two dimensions almost perfectly
as shown in Figure 6.6. This information diagram is correct in that the region
representing the set o n splits each atom in Figure 6.4 into two atoms. However,
the adjacency of certain atoms are not displayed correctly. For example, the
set ü T ü n , which consists of the atoms ü T ü ï ü n and
üpT "üq ï ür n , is not represented by a connected region because the two
atoms are not adjacent to each other.
When '& takes the value zero on an atom û of , we do not need to display
the atom û in an information diagram because the atom û does not contribute
for any set containing the atom û . As we will see shortly, this
to '&
can happen if certain Markov constraints are imposed on the random variables
involved, and the information diagram can be simplified accordingly. In a
generic information diagram (i.e., when there is no constraint on the random
variables), however, all the atoms have to be displayed, as is implied by the
next theorem.
¤¸ ¸ ¤ Þ ¤ Þ Þ
¤ ¤ ¤ ¤
¤
9¦ ¹ ¨
¤ ¸ ¤
¤ Þ
¤
³
¹
¤ ¸ ã ¤ ã ã ¤ ³ , then
³.
wyxFi0=B<i*z{mJ3¥*¥

If there is no constraint on
take any set of nonnegative values on the nonempty atoms of '&
can
Proof We will prove the theorem by constructing a '& which can take any set
of nonnegative values on the nonempty
atoms of . Recall that > is the set of
bts
û > be mutually independent
all nonempty atoms of . Let
random
variables. Now define the random variables ñ
by
ã ¢
³ ã ã ã
ã
Ø
¤
Ý Ê ¼
(6.72)
¤ Ø ¦ b û ¢ and û å ¤ t¨ î
ã
ã
ã
î
We determine
the -Measure
for ¤ ¸ ¤
¤ ³ so defined as follows.
b
Since
are mutually independent, for all nonempty subsets
of
, we
³
Z
Z
have
¦w¤ =¨ Ø Ê µ + ¦ b ¨ î
(6.73)
³
ts
u>
'&
ts
3
A
Gs
@?
s
s'xzy
*{
wv
|
X2
0
1
1
1
|
X1
0
Figure 6.5. The information diagram for
D R A F T
|
1
Ñ,
0
, and
}
%k
X3
in Example 6.10.
September 13, 2001, 6:27pm
D R A F T
î
107
The -Measure
X2
~
~
X1
X 4
Figure 6.6. The generic information diagram for
On the other hand,
Z
¦w¤ =¨ Ø
@?
¦ ¤ Å¨ Ø
& *
@
?
µ
Êwv
+
s
Ñ,
&
s'xzy
*{
X3
,
%k
¦.û0¨ î
, and

.
(6.74)
Equating the right hand sides of (6.73) and (6.74), we have
µ
Êv+ sx + y {
3
s
µ
Z bts
¦ ¨Ø
Êv
+ s'x + y {
s
&
¦.û ¨ î
(6.75)
Evidently, we can make the above equality hold for all nonempty subsets A of
Z bGs
by taking
& .û
(6.76)
³
¦ ¨Ø ¦ ¨
¢
¦ ¨
for all Z û u
bGs > . By the uniqueness of '& , this is also the only possibility for '& .
Since
can take any nonnegative value by Corollary 2.44, & can take
any set of nonnegative values on the nonempty atoms of . The theorem is
proved.
³
¸ ¤
¤
¤ ¸ ¤ ¤³
(6.77)
¦ ¤ ¸ ü ¤ Þ ü ¤ ï ¨ Ø î ¦w¤ ¸ !¤ ï ¤TL¨ Ø « ã
Þ
the atom ¤ ¸ ü ¤ ü ¤ ï does not have to be displayed in an information diagram.
As such, in constructing the information diagram, the regions representing the
random variables ¤ ¸ , T
¤ , and ¤ ï should overlap
Þ with each other such that the
region corresponding to the atom ¤ ¸ ü ¤ ü ¤ ï is empty, while the regions
In the rest of this section, weexplore
the structure of Shannon’s information
S
measures when \ S T S
forms a Markov chain. To start with,
S T S
ï forms a Markov chain. Since
we consider
, i.e.,
¤
¼ Ø
&
corresponding to all other nonempty atoms are nonempty. Figure 6.7 shows
D R A F T
September 13, 2001, 6:27pm
D R A F T
108
A FIRST COURSE IN INFORMATION THEORY

X2
X1
X3
Figure 6.7. The information diagram for the Markov chain
¤ ¸ ¤
Ñ
G

¤
k
.
such a construction, in which each random variable is represented by a mountain4 . From Figure 6.7, we see that ürT Büp ï , as the only atom on which
'& may take a negative value, now becomes identical to the atom ü ï.
Therefore, we have
¦w¤ ¸ !¤T:!¤ ï ¨ Ø
¤ ¸ ¤
¦ ¤ ¸ ü ¤T9ü ¤ ï ¨
(6.78)
(6.79)
¦¤ ¸ ü ¤ ï ¨
Ø î î¦w¤ ¸ !¤ ï ¨
(6.80)
(6.81)
» «
¤ S ¤ ï forms a Markov chain, is
Hence, we conclude that when ¤ ¸ S T
always nonnegative.
¤ S ¤ ï S ¤ forms a Markov
Next, we consider ¼ Ø
, i.e., ¤ ¸ S T
î
Ø
&
&
l
&
n
chain. With reference to Figure 6.6, we first show that under this Markov
constraint, '& always vanishes on certain nonempty atoms:
¤ ¸ S ¤T S ¤ ï implies
î ¸ ï
w¦ ¤ !¤ !¤ ¤TL¨©¾ î ¦w¤ ¸ !¤ ï ¤T ã ¤ ¨ Ø î w¦ ¤ ¸ ! ¤ ï T¤ L¨ Ø « î
(6.82)
¤ ¸ S T¤ S ¤ implies
î ¸ ï
w¦ ¤ !¤ !¤ ¤TL¨©¾ î ¦w¤ ¸ !¤ ¤T ã ¤ ï ¨ Ø î w¦ ¤ ¸ ! ¤ T¤ L¨ Ø « î
(6.83)
¤ ¸ S ¤ ï S ¤ implies
î ¸
w¦ ¤ !¤ !¤ ¤ ï ¨©¾ î ¦w¤ ¸ !¤ ¤ ã ¤ ï ¨ Ø î w¦ ¤ ¸ ! ¤ ¤ ï ¨ Ø « î
(6.84)
1. The Markov chain
n
n
2. The Markov chain
n
n
n
3. The Markov chain
on
4 This
n
n
on
on
form of an information diagram for a Markov chain first appeared in Kawabata [107].
D R A F T
September 13, 2001, 6:27pm
D R A F T
î
109
The -Measure
¤
¤ ï S ¤ implies
î ¸
¦w¤ !¤T!¤ ¤ ï ©¨ ¾ î ¦w¤T!¤ ¤ ¸ ã ¤ ï ¨ Ø î ¦w¤T!¤ ¤ ï ¨ Ø « î
ã
5. The Markov chain ¦w¤ ¸ ¤Tô¨ S ¤ ï S ¤ implies
î ¸
¦P¤ !¤Tî !¤ ã ¤ ï ¨©¾ î ¦w¤ ¸ !¤ T¤ ã ¤ ï ¨Æ¾ î ¦wT¤ !¤ ¤ ¸ ã ¤ ï ¨
Ø î¦w¤ ¸ T¤ !¤ ¤ ï ¨
Ø «
S
4. The Markov chain T
n
n
n
n
(6.85)
n
n
n
n
n
¦w¤ ¸ ! ¤ T¤ ã ¤ ï ¨ Ø î w¦ ¤ ¸ ! ¤ ï T¤ ã ¤ ¨ ã
(6.84) and (6.88) imply
î ¸
¦w¤ !¤T!¤ ¤ ï ¨ Ø Ã î ¦w¤ ¸ !¤ ï ¤T ã ¤ ¨ ã
(6.86)
(6.87)
Now (6.82) and (6.83) imply
î
n
(6.88)
n
n
(6.89)
n
¦w¤T:!¤ ¤ ¸ ã ¤ ï ¨ Ø î w¦ ¤ ¸ ! ¤ ï T¤ ã ¤ ¨ î
and (6.85) and (6.89) imply
î
n
(6.90)
n
The terms on the left hand sides of (6.88), (6.89), and (6.90) are the three terms
on the left hand side of (6.87). Then we substitute (6.88), (6.89), and (6.90) in
(6.87) to obtain
¦ ¤ ¸ ü ¤ Þ ü ¤
& *
"
Z
ï üZ¤ ¦ ¤ ¸ ü ¤ ÞÞ ü ¤ ï Þ ü ¤
¦ ¤ ¸ ü ¤ ü ¤ ïÞ ü ¤
¦ ¤ ¸Þ ü ¤T9ü ¤ ï Þ ü ¤
¦ ¤ ¸ ü T¤ 9ü ¤ ï ü ¤
n
Þ ¨ Ø î w¦ ¤ ¸ ! ¤ ï T¤ ã ¤ ¨ Ø « î
n
From (6.82), (6.88), (6.89), and (6.90), (6.91) implies
&
& *
&
& *
L¨ Ø îî w¦ ¤ ¸ !¤ ï ! ¤ ã ¤
¨ Ø î ¦w¤ ¸ !¤ ¤T ¤
¨ Ø î ¦w¤ ¸ !¤T! ¤ ã ¤
¨ Ø ¦w¤T:!¤ ¤ ¸ ¤
n
o
"
"
Z n
n
Z
"
Z n
on
n
n
n
¨Ø «
ï¨ Ø «
ï¨ Ø «
î
ï¨ Ø «

(6.91)
(6.92)
(6.93)
(6.94)
(6.95)
From (6.91) to (6.95), we see that '& always vanishes on the atoms
¤ ¸ ü ¤ ÞÞ ü ¤ ï ü ¤ Þ
¤ ¸ ü ¤ Þ ü ¤ ïÞ ü ¤
¤ ¸ ü ¤ ü ¤ ïÞ ü ¤
¤ ¸Þ ü T¤ 9ü ¤ ï Þ ü ¤
¤ ¸ü ¤ ü ¤ ïü ¤
"
n
n
o
D R A F T
Z
Z n
(6.96)
n
n
o
September 13, 2001, 6:27pm
D R A F T
110
A FIRST COURSE IN INFORMATION THEORY
X2
*
*
~
X1
*
*
*
X3
~
Figure 6.8. The atoms of
Markov chain.

0
on which
X4
vanishes when
Ñ

forms a
¦w¤ ¸ ¤ ¤ ã ¤ ¨ Ø ä » «
diagram in Figure 6.8.
of n , which we mark by an asterisk in the
î information
! ï T n
in (6.82)
In fact, the reader can benefit by letting
and trace the subsequent steps leading to the above conclusion in the information diagram in Figure 6.6.
It is not necessary to display the five atoms in (6.96) in an information diagram because '& always vanishes on these atoms. Therefore, in constructing the information diagram, the regions representing the random variables
should overlap with each other such that the regions corresponding to these
five nonempty atoms are empty, while the regions corresponding to the other
ten nonempty atoms, namely
¤ ¸ ü ¤ Þ ü
¤ ¸ ü ¤T9ü
¤ ¸ ü ¤T9ü
¤ ¸Þ ü T¤ 9ü
¤ ¸Þ ü ¤T9ü
¤ ¸Þ ü ¤ ü
¤ ¸Þ ü ¤T9Þ ü
¤ ¸Þ ü ¤ Þ ü
¤ ¸Þ ü ¤ Þ ü
¤ ¸ü ¤ ü
¤ ï ÞÞ ü ¤ Þ Þ
¤ ïü ¤ Þ
¤ ïü ¤
¤ ïÞ ü ¤ Þ
¤ ïü ¤ Þ
¤ ïü ¤
¤ ïü ¤ Þ
¤ ïü ¤
¤ ïÞ ü ¤ ã
¤ ïü ¤
)
"
) n
n
)
"
Z n
n
n
n
n
n
Z
"
Z n
n
(6.97)
are nonempty. Figure 6.9 shows such a construction. The reader should compare the information diagrams in Figures 6.7 and 6.9 and observe that the latter
is an extension of the former.
D R A F T
September 13, 2001, 6:27pm
D R A F T
î
111
The -Measure

X 2

X3
Figure 6.9. The information diagram for the Markov chain
X 4
Ñ
G
%

.
From Figure 6.9, we see that the values of '& on the ten nonempty atoms in
(6.97) are equivalent to
¦î w¤ ¸ T¤ ã ¤ ï ã ã ¤ ¨
¦î w¤ ¸ !T¤ ¤ ï ¤ ¨
¦î w¤ ¸ !¤ ï ¤ ô¨
Z ¦w¤ ¸ !¤ ¨ ã ã
¦w¤ ¤ ¸ ¤ ï ¤ ô¨
î
¦wT¤ !¤ ï ¤ ¸ !¤ ¨
î
¤ !¤ ¤ã ¸ ¨ ã
Z ¦wT
¦w¤ ï ¤ ¸ T¤ ã ¤ ¨
î ï
¤ã L¨ ã
Z ¦w¤ !¤ ¤ã ¸ T

¸
¦w¤ ¤ ¤ ¤ ï ¨
Z
n
n
on
n
on
(6.98)
n
n
n
n
on
respectively5 . Since these are all Shannon’s information measures and thus
nonnegative, we concludethat
\
S '& is always nonnegative.
S
S
T
When
forms a Markov chain, for
, there
is only one nonempty atom, namely ü ü ï , on which '& always
vanishes. This atom can be determined directly from the Markov constraint
î
. For
! ï T
l , the five nonempty atoms on which '& always
vanishes are listed in (6.96). The determination of these atoms,
\ as we have
and
l , '&
seen, is not straightforward. We have also shown that for
is always nonnegative.
We will extend this theme in Chapter 7 to finite Markov random field with
Markov chain being a special case. For a Markov chain, the information diagram can always be displayed in two dimensions, and & is always nonnegative. These will be explained in Chapter 7.
¤ ¸
¤
¦w¤ ¸ ¤ ¤ L¨ Ø «
5A
¼ Ø
¼ Ø
¤ ³ ¸ Þ
¤ ¤ ¤
¼ Ø
¼ Ø
formal proof will be given in Theorem 7.30.
D R A F T
September 13, 2001, 6:27pm
D R A F T
112
6.6
A FIRST COURSE IN INFORMATION THEORY
EXAMPLES OF APPLICATIONS
In this section, we give a few simple applications of information diagrams.
The use of information diagrams simplifies many difficult proofs in information theory problems. More importantly, these results, which may be difficult
to discover, can easily be obtained by inspection of an information diagram.
The use of an information diagram is very intuitive. To obtain an information identity from an information diagram is WYSIWYG6. However, how
to obtain an information inequality from an information diagram needs some
explanation.
Very often, we use a Venn diagram to represent a measure which takes
such
nonnegative values. If we see in the Venn diagram two sets û and
that û is a subset of , then we can immediately conclude that .û
because
.û
û
(6.99)
¦ 0¨ Â ¹ ¦ ¹ ¨
¹
¦ ¹ ¨ Ã ¦ 0¨ Ø ¦ ¹ Ã 0¨°»[« î
î
However, an -Measure
can take negative values. Therefore, when we see
in an information diagram that û is a subset of , we cannot base on this fact
Â 9¦ ¨ unless we¹ know from the setup of the
alone conclude that 9¦.û ¨
¹
problem that is nonnegative. (For example, is nonnegative if the random
&
'&
'&
&
&
variables involved form a Markov chain.) Instead, information inequalities
can be obtained from an information diagram in conjunction with the basic
inequalities. The following examples will illustrate how it works.
F¤»z?*ÄËi mJ3¥T
~}e=BG"0¤T(CEDa=Bji*GD<>=B?k

Let ¤
¥ ¦w¡º¨ . Let
¤ ¥§¦w¡©¨ Ø j ¥ ¸ ¦w¡º¨Æ¾j ¥@j¦w¡º¨ ã
Â Â Ý and Ø Ý Ã . We will show that
where «
Z
Z
Z
¦w¤1¨ð»a ¦w¤ ¸ ¨©¾
¸ ¥ ¸ ¦w¡º¨

¤T

(6.100)
¦w¤TL¨ î

and
(6.101)
Consider the system in Figure 6.10 in which the position of the switch is determined by a random variable ã with
¯± ã Ø Ý Ø and ¯± ã Ø Ê Ø ã
(6.102)
whereã ãÊ is independent of ¤ ¸ and ¤T . The switch takes position ñ if ã Ø ñ ,
ñ Ø Ý . Figure 6.11 shows
information diagram for ¤ and ã . From the
Ã ã the
diagram, we see that ¤
is a subset of ¤ . Since
is nonnegative for two
random variables, we can conclude that
¦ ¤ ¨°» ¦ ¤ Ã ã!¨ ã
(6.103)

6 What
& (
& *
'&
you see is what you get.
D R A F T
September 13, 2001, 6:27pm
D R A F T
î
113
The -Measure

Z=1
X1

X
Z=2
Figure 6.10. The schematic diagram for Example 6.12.
Z
which is equivalent to
Then
Z
¦w¤I¨å» ¦w¤ ã ¨ î
Z
(6.104)
Z
¦w¤I¨ » =¯ ±¦w ¤ ã ¨ Z
(6.105)
Z
Ø Z ã Ø Ý Z ¦w¤ ã Ø ã LÝ ¨Æ¾ ¯=± ã Ø Ê ¦w¤ ã Ø Ê ¨ (6.106)
Ø ¦w¤ ¸ ¨Æ¾ ¦w¤TL¨
(6.107)
Z
proving (6.101). This shows that ¦w¤1¨ is a concave functional of ¥§¦w¡º¨ .

F¤»z?*ÄËi mJ3¥T~}e=BGG»i»CEDa=Bjz>D>¤BÄClGFjË="<z¤ DFCY=BGB
ãcb ãgf Ø
f î
¦w¤
Let
¨ ¥Å¦w¡ ¨ ¥§¦w¡©¨¥§¦ ¡©¨
(6.108)
b
f
î

§
¥
w
¦
º
¡
¨
We will show
that for fixed
, ¦w¤«! ¨ is a convex functional of ¥§¦ ¡º¨ .
f
f

Let ¥ ¸ ¦ ¡º¨ and ¥@j¦ ¡©¨ be two transition matrices. Consider the system
in Figure 6.12 in which the position of the switch is determined by a random
variable ã as in the last example, where ã is independent of ¤ , i.e.,
î
¦w«¤ !¯ã ¨ Ø « î
(6.109)

X
Z
Figure 6.11. The information diagram for Example 6.12.
D R A F T
September 13, 2001, 6:27pm
D R A F T
114
A FIRST COURSE IN INFORMATION THEORY
Z=1

p1 ( y x )
X
Y

p2 ( y x )
Z=2
Figure 6.12. The schematic diagram for Example 6.13.
î
¤
b
, and ã in Figure 6.13, let
¦w¤«!¯ã ¨ Ø ä [» « î
In the information diagram for
,
b
¦w¤~!¯ã ¨ Ø « , we see that
î
¦w~¤ ! b !¯ã ¨ Ø Ã ä ã
because
î
¦w¤~!¯ã ¨ Ø î ¦w¤~!¯ã b ¨©¾ î w¦ ¤«! b
Since
î
(6.110)
(6.111)
!¯ã
¨î
(6.112)
Then
¦w¤«Â ! b î ¨ b
(6.113)
¦w¤«! ã ¨
Ø ¯=± ã Ø ã Ý î f ¦w¤~! b ã Ø ÝL¨Æ¾ ã ¯± f ã Ø ã Ê î ¦w¤~! b ã Ø Ê ¨ (6.114)
Ø î ¦¥§¦w¡º¨ ¥ ¸ ¦ ¡º¨¨n¾ î ¦¥§¦w¡º¨ ¥4¦ ¡©¨¨
(6.115)
ã f
î
where ¦¥§¦w¡º¨ ¥0`¦ ¡º¨¨ denotes the mutual information between the inputf and
¥§¦w¡º¨ and transition matrixf 0¥ ¦ ¡©¨ .
output of a channel with input distribution
b
î
This shows that for fixed ¥§¦w¡©¨ , ¦w~
¤ ! ¨ is a convex functional of ¥§¦ ¡º¨ .
î

-a

X
a
Z
Figure 6.13. The information diagram for Example 6.13.
D R A F T
September 13, 2001, 6:27pm
D R A F T
115
The -Measure
¢

Z =1
¡
p 1 (x )
X

¡
p 2 (x
)
¢

¡
p (y
|x )
Y
Z =2
Figure 6.14. The schematic diagram for Example 6.14.
£¤¥§¦¨ª©S«¬®¯±°²´³¶µ¸·'¹º¥T»*¼½G¾Nµ¸¿À¦Ác½Ác¥¸©À¼Â·¿Sµ'Ã¦!¥Ä½¼Qµ¸·¸Å
Let
ÆQÇÉÈËÊeÌ+ÍÎ*ÆQÏ'ÈÑÐªÌ(ÒÎ§ÆQÏÌÓÎ§ÆQÐÔ ÏÌÖÕ
Î§ÆQÐÔ ÏcÌ
(6.116)
ÆQÇØ×ËÊ@Ì
Î§ÆQÏÌ
We will show that for fixed
,
is a concave functional of
.
Consider the system in Figure 6.14, where the position of the switch is deterÇ
mined by a random variable ÊÙ as in the last
example.
In
this
system,
when
Ç
Ê
is given, Ù is independent of , or ÙNÚ
Ú
forms
aÊ Markov chain. Then
Ç
Û'Ü
is nonnegative, and the information diagram for , , and Ù is shown in
Figure 6.15.
ÇDÞ ÊNß
ÇàÞ Ê
Û'Ü
is nonnegFrom Figure 6.15, since Ý
Ý
Ù Ý is a subset of Ý
Ý and
ative, we immediately see that
ÆQÇØ×ËÊeÌ

á
Ò
ÆQÇØ×ËÊuÔ
Ì
Ù
Ò
æ
åwç

Ò
â(ã5ä
ÆQÇ×ËÊÉÔ
ÒæåèÌéâêãëä
Ù

Ù
ÆñÎòèÆQÏcÌÖÈ<Î§ÆQÐÔ ÏcÌÑÌ¸éóð
ð

ÒíìSç
Ù
ÆñÎtôïÆQÏcÌÖÈ<Î*ÆQÐÔ ÏÌÑÌÖÕ
ÆQÇ×ËÊÉÔ

Î§ÆQÐÔ ÏcÌ
,
ÆQÇØ×ËÊ@Ì
is a concave functional of
£¤¥§¦¨ª©S«¬®¯Tõ.²Ñö´¦¨ª«ªÃ¿a«º¹º½"÷è«º¹GÃ«º¹ª¾ø½ù«ºµ¸Ã«ª¦EÅ
text,
Ù

This shows that for fixed
Ê
ÒîìïÌ
be the cipher text, and Ù
ú
Z
(6.117)
(6.118)
(6.119)
Î§ÆQÏÌ
.
Ç
Let
be the plain
be the key in a secret key cryptosystem. Since
X
Y
Figure 6.15. The information diagram for Example 6.14.
D R A F T
September 13, 2001, 6:27pm
D R A F T
116
A FIRST COURSE IN INFORMATION THEORY
Y
û
ÿ
ý
a
þ
d
b
ü
0
c
X
Z
Figure 6.16. The information diagram for Example 6.15.
Ç
can be recovered from
Ê
and Ù , we have
ÆQÇÔ ÊêÈ
Ì(Ò
Ù
Õ
(6.120)
We will show that this constraint alone implies
ÆQÇØ×ËÊeÌ
ÆQÇ
á

Let
ÆQÇØ×ËÊuÔ
×
ÔÇ

ÌêÒ
Ù
Ù
ÆQÇ×ËÊ

(See Figure 6.16.) Since
Ì
Ù
Ì
Æ
(6.122)
á
Ì+Ò
(6.123)
È
á
(6.124)
ªÕ
,
(6.125)
á
*é
ÆQÇ
×
Ù
Æ<ÊV×
(6.121)
á
Ô ÇÉÈËÊeÌ(Ò
and
ÌÖÕ
Ì(Ò
Ù
Æ
Æ
Ù

Æ<Ê
Ì§ß
Õ
á
(6.126)
Ì
ÆQÇØ×
Ô ÊeÌ
Ù

Ù
InÆQÇØ
comparing
with
, we do not
have
to consider
and
×ËÊ
×
Ì
ÆQÇ
Ì
Æ
Ì

Ù
since they belong to both
and
Ù
. Then we see from
Figure 6.16 that
ÆQÇ
Ì*ß
Æ
Ì+Ò
ß
ß
wÕ
Ù
(6.127)
Therefore,
ÆQÇ×ËÊeÌ
Ò
¶é

zß
á
á
Ò
zß
ß
ÆQÇ
Ì*ß
Æ
ÌÖÈ
Ù
(6.128)
(6.129)
(6.130)
(6.131)
where (6.129) and (6.130) follow from (6.126) and (6.124), respectively, proving (6.121).
D R A F T
September 13, 2001, 6:27pm
D R A F T
The -Measure
X
Figure 6.17.
117
Z
Y
The information diagram for the Markov chain
.
ÆQÇ×ËÊ@Ì
The quantity
is a measure
of the security level of the cryptosystem.
ÆQÇØ×ËÊeÌ
small so that
the evesdroper cannot
In general, we want to make
Ç
by
observing the cipher
obtain
too
much
information
about
the
plain
text
Ê
text . This
result says that the system can attain a certain level of security
Æ
Ì
Ù
only if
(often called the key
length) is sufficiently
large. In particular, if
ÆQÇØ×ËÊeÌ+Ò
Æ
Ì
perfect
secrecy
is
required,
i.e.,
,
then
must be at least equal

Ù
ÆQÇ
Ì
to
. This special case is known as Shannon’s perfect secrecy theorem
[174]7 .
Æ<Ê
Ô ÇÀÈ
ÌêÒ
, i.e.,
Note that in deriving our result, the assumptions that
Ù
ÆQÇ×
Ì(Ò
Ù
the cipher text is a function of the plain text and the key, and
, i.e.,
the plain text and the key are independent, are not necessary.
£¤¥§¦¨ª©S«¬®¯T¬
Figure 6.17 shows the information diagram for the Markov
Ç
Ê
chain
Ú
Ú
Ù . From this diagram, we can identify the following two
information identities:
ÆQÇØ×ËÊeÌ

Ò
ÆQÇÔ ÊeÌ
ÆQÇØ×ËÊ

Ò
È
ÆQÇÔ Ê
Ì
È
Ù
(6.132)
(6.133)
ÌÖÕ
Ù
ÇqÞ
Ý
Since Û'Ü is nonnegative and
ÆQÇØ×

Ì
Ù
ÇqÞ
Ý
is a subset of
Ù Ý
Ê
Ý
, we have
ÆQÇØ×ËÊeÌÖÈ
(6.134)

which has already been obtained in Lemma 2.41. Similarly, we can also obtain
ÆQÇÔ ÊeÌ
£¤¥§¦¨ª©S«¬®¯ Ø² ¶¥Ä½a¥
ÆQÇÔ
¨ªÃcµ¸¹G«º÷÷Y¼Q·
ÌÖÕ
(6.135)
Ù
½ù«tµ'Ã«ª¦EÅ
formation diagram for the Markov chain
Ç
Ê
Ú
Ú
Figure 6.18 shows the inÙ)Ú
. Since Û Ü is
7 Shannon
used a combinatorial argument to prove this theorem. An information-theoretic proof can be
found in Massey [133].
D R A F T
September 13, 2001, 6:27pm
D R A F T
118
A FIRST COURSE IN INFORMATION THEORY
Z
Y
X
T
Figure 6.18. The information diagram for the Markov chain
nonnegative and
ÇqÞ
Ý
Ý
is a subset of
ÆQÇØ×

&
ÊîÞ
Ý
Ì
Ù Ý
.
, we have
Æ<ÊV×

!"#$!%
ÌÖÈ
(6.136)
Ù
which is the data processing theorem (Theorem 2.42).
APPENDIX 6.A: A VARIATION OF THE INCLUSIONEXCLUSION FORMULA
'
(*) +-,.'0/
+-,.1/3241
(65
8 :9 <;
+
=?@ >
+ ACBED ' AGF 1IHJLD4MOK N.M > +-,.' N F 1/ F D4MOKN.PRQSM > +-,.' NUT ' Q F V1 /
WYXCXCXZW , F\[ / >^] D +-,.' DET '`_ T XCXaX T ' > F 1V/3b
,
can be expressed as a linear combiIn this appendix, we show that for each
via applications of (6.28) and (6.29). We first prove by using (6.28) the
nation of
following variation of the inclusive-exclusive formula.
7
ù«ºµ¸Ã«ª¦P¬ Ö¬® ®¯
For a set-additive function ,
ced [
c
(6.A.1)
Proof The theorem will be proved by induction on . First, (6.A.1) is obviously true for
. Now consider
Assume (6.A.1) is true for some
= >f@ ] D
+ ACBED ' AGF 1IH
=g="@ >
J + ACBED ' A Hhi' >f] D F 1IH
= @>
=k= @ >
W
f
>
]
J + AaBjD ' A F 1 H +-,.' D F V1 / F + ACBED ' A H T ' ^> ] D F 1 H
J l D4MOK N.M > +-,.' N F 1/ F D4MmKN:PnQSM > +-,.' N T ' Q F 1V/ WoXaXCX
D R A F T
September 13, 2001, 6:27pm
c6J [
.
(6.A.2)
(6.A.3)
D R A F T
119
APPENDIX 6.A: A Variation of the Inclusion-Exclusion Formula
W , \F [ / >f] D +-,.' D T ' _ T XCXX T ' > F 1V/^p W +q,.' >f] D F 1V/
=?@ >
F + AaBjD ,.' A T ' >f] D / F 1IH
J l D4MOK N.M > +q,.' N F 1/ F D4MmKN:PnQSM > +-,.' N T ' Q F 1/ WoXaXCX
W , F\[ / >f] D +-,.' D T ' _ T XCXCX T ' > F 1V/^p W +-,.' >f] D F 1V/
(6.A.4)
F l 4D MmK N:M > +q,.' N T ' >f] D F 1V/ F D4MOKN.PRQSM > +-,.' N T ' Q T ' >^] D F 1/
W$XCXCXSW , F\[ / >^] D +-,.' D T ' _ T XCXaX T ' > T ' >f] D F 1V/^p
J D4MOKN.M >f] D +-,.' N F 1V/ F D4MON.PRK QSM >^] D +q,.' NOT ' Q F 1/ W$XCXX
W , F\[ / >f] _ +-,.' D T ' _ T XCXX T ' >^] D F 1/3b
(6.A.5)
(6.A.6)
In the above, (6.28) was used in obtaining (6.A.3), and the induction hypothesis was used in
obtaining (6.A.4) and (6.A.5). The theorem is proved.
Now a nonempty atom of
N
r>
@> N2
NsBjD
has the form
t N 6t Nu
(6.A.7)
v
where
is either
or
, and there exists at least one such that
write the atom in (6.A.7) as
N Jwt N
@ N F~ ZQ Z b
N:x ySz{B&|} z
Qx yCB&|}
. Then we can
(6.A.8)
'() +-,.'`/
+-,.1/3241
(65
Note that the intersection above is always nonempty. Then using (6.A.1) and (6.29), we see that
for each
,
can be expressed as a linear combination of
.
PROBLEMS
1. Show that
ÆQÇØ×ËÊ

×
.m
Ù
Ì+Ò
and obtain a general formula for
D R A F T
Î§ÆQÇÀÈËÊoÌÓÎ*Æ<ÊêÈ
Î§ÆQÇ
ÌÓÎ§ÆQÇÀÈ
Ù
ÌÓÎ*Æ<ÊoÌÓÎ*Æ
ÌÓÎ*ÆQÇÉÈËÊ
RRR Y
Ù
ÆQÇ
òY×ÑÇeô±Èë×
×ÑÇ
Ì
Ù
È
Ì
Ù
ÄÌ
.
September 13, 2001, 6:27pm
D R A F T
120
A FIRST COURSE IN INFORMATION THEORY
ÆQÇ×ËÊV×
2. Show that
hold:
Ç
a)
b)
Ê
,
Ì

vanishes if at least one of the following conditions
Ù
, and Ù are mutually independent;
Ç
Ê
Ú
Ç
forms a Markov chain and
Ú9Ù
ÆQÇ×ËÊ
3. a) Verify that
×
Ì
S S m
m
S nO
and Ù are independent.
S given by
S m
S R
Î§ÆQÏ¸ÈÑÐtÈ SÌ
vanishes for the distribution
S <m n
S S m
O
Ù
Î§Æ È È TÌêÒ
Õ
1ì aÈDÎ§Æ È ÈYåèÌ
Ò
Õ
Î§Æ ÈYå±ÈYåèÌêÒ
Õ
1ì aÈDÎ§ÆOå±È È TÌ
Ò
Õ
1ì aÈ
aå aÈ
Î§ÆOå±ÈYå±È TÌêÒ
Õå
aÈDÎ§ÆOå±ÈYå±ÈYåèÌ
Ò
Õ
aÕ
Î§Æ ÈYå±È TÌÒ
Õ
Î§ÆOå±È ÈYåèÌÒ
Õå1å
1ì
b) Verify that the distribution in part b) does not satisfy the condition in
part a).

Ç
Ê
is weakly independent of
4. Weak independence
Î*ÆQÏ*Ô ÐªÌ
sition matrix
are linearly dependent.
Ç
a) Show
that if
Ê
of .
Ê
and
i)
ii)
iii)
Ç
Ç
are independent, then
Ç
b) Show that for random variables
Ù
satisfying
if the rows of the tranis weakly independent
Ê
and , there exists a random variable
Ê
Ú
Ç
Ê
Ú
Ù
and Ù are independent
and Ù are not independent
if and only if
Ç
is weakly independent of
Ê
.
(Berger and Yeung [22].)
5. Prove that

ÆQÇØ×ËÊ
a)
.¡
¢£¤¡
×
Ì
ÆQÇØ×ËÊ
×
ß
á
Ù
Ì
Ù
6. a) Prove that if
tä
ºä
Ç
and
Ê
ÆQÇØ×ËÊÉÔ

ÆQÇØ×ËÊ@ÌÖÈ

ÌÖÈ
Ù
Æ<ÊV×

Æ<ÊV×

ÔÇ
ÌÖÈ
Ù
ÌÖÈ
Ù
ÆQÇÉÈ
Ô Ê@Ì´ç

ÆQÇØ×

Ù
Ì´çïÕ
Ù
are independent, then
ÆQÇÉÈËÊ
×
Ì
Ù
á
ÆQÇØ×ËÊuÔ

Ì
Ù
.
b) Show that the inequality in part a) is not valid in general by giving a
counterexample.
ÆQÇ×ËÊ@Ì
á
ÆQÇ
Ì!ß
Æ
Ì
7. In
Example 6.15, it Ê was shown that
, where
Ù
Ç
is the plain text, is the cipher text, and Ù is the key in a secret key
cryptosystem. Give an example of a secret key cryptosystem such that this
inequality is tight.
D R A F T
September 13, 2001, 6:27pm
D R A F T
APPENDIX 6.A: A Variation of the Inclusion-Exclusion Formula
¥
¥
§
¬ « ¦
§ £ § ±
§
121
¦
8. Secret sharing For a given finite set and a collection of subsets of
, a secretäYÇ sharing
scheme
is a random variable and a family of random
5Î
ç
such that for all
,
variables
©¨Yª « ¥
and for all
Æ
®°«¯ ¦
,
Æ
§
ÔÇ
+Ì+Ò
ÔÇ
¥
È
Ì+Ò
Æ
ÌÖÕ
¨
Here, is Î the secret and
is the set of participants
of the scheme. A
Ç
of the secret. The set
participant of the scheme possesses a share
specifies the access structure of the scheme: For a subset of , by
pooling their shares, if
, the participants in can reconstruct ,
otherwise they can know nothing about .
¦
¬ « ¦
¬
¬
§
¥
§
¬ ³
® ²´¥ , if ®µ«¯ ¦ and ¬¶$® « ¦ , then
£ ±
£ ± §
§
« ¦ , then
ii) Prove that if ®
£ ±
£ ± §
a)
È
i) Prove that for
ÆQÇ
%Ô Ç
êÌêÒ
ÆQÇ
Æ
Ô Ç
Ìé
êÌ+Ò
ÆQÇ
ÆQÇ
%Ô Ç
Ô Ç
%È
È
ÌÖÕ
ÌÖÕ
· « ¦ ,
¬ ® ¸· ¹
² ¥ such that ¬¶ µ
® ¶ ·µ« ¦
± º
§
(Capocelli et al. [37].)
· «¯ ¦
È
b) Prove that for
, then
È
ÆQÇ
×ÑÇ
ÔÇ
(Ì

Æ
á
, and
ÌÖÕ
(van Dijk [193].)
ÇÀÈËÊêÈ
³
9. Consider four random
variables
which satisfy
the followÙ , and
Æ
ÔÇ
Ì¶Ò
Æ
Ì
Æ
Ô ÇÀÈËÊeÌ¶Ò
Æ
Ô ÊoÌEÒ
Æ
Ì
ingÆ<Êu
constraints:
,
,
,
Ô
Ì+Ò
Æ
Ô
Ì+Ò
, and
. Prove that
Ù
Ù
Æ
a)

ÆQÇØ×
b)
ÆQÇØ×
Ô ÇÀÈËÊêÈ
Ì(Ò
Ù
ÔÊ
×
Ù
Æ<ÊÔ ÇÉÈ
d)
Ì(Ò
Ù
f)

ÆQÇØ×ËÊ

È
Ì
Æ<Ê
×

Ì Ò
ê
.
.

Ì
Ù
Ù

Ì(Ò
×
ÆQÇØ×
×
Ù
ÆQÇØ×ËÊ
á
á
Æ
Æ

È
Ù
e)
Ô ÇÉÈËÊeÌ(Ò
×
×
Ì+Ò
Ù
Æ ÇØ×ËÊÉÔ
Q
Ù
Ô
Ì+Ò
Ù
Æ<ÊV×

Õ
È
Õ
.
Ô ÇÉÈ
Ì(Ò
Ù
Õ
Ì(Ò
Ì
The inequality in f) finds application in a secret sharing problem studied by
Blundo et al. [32].
D R A F T
September 13, 2001, 6:27pm
D R A F T
122
A FIRST COURSE IN INFORMATION THEORY
¼»
Ç
In the following, we use
given Ù .
pÊ
Ô
½»
Ç
to denote that
Ù
¾»
¾»
Ç
10. a) Prove that
under
the Çconstraint thatÇ
Ç
íÊ
Ô
and
imply
chain,
Ù
Ù
Ê
and
Ê
Ú
mÊ
Ú
Ù
are independent
forms a Markov
.
b) Prove that the implication in a) continues to be valid without the Markov
chain constraint.
³» does not imply ³»
by giving a coun¿» implies ¿»
b) Prove that
conditioning on
À .
12. Prove that for random variables , , , and ,
Á»
\» Â ÃÃÄ
¹» ÃÃ Ç»
¹»
Á» ÅÆ
È»
6» are equivalent to
Hint: Observe that
and
À and use an information diagram.
13. Prove that
½»
Â ÃÄ ÊÃ » ½»
» Ë
Ã
Ã
» Ì ¾»
» ÅÉ ¾» (StudenÎ Í [189].)
Ê
Ô
11. a) Show that
terexample.
Ê
Ô ÆQÇÉÈ
Ù
Ê
Ô
Ê
Ô ÆQÇÀÈ
Ù
Ê
Ì
Ù
Ì
Ç
Ù
Ú
Ú9ÙmÚ
Ç
Ê
Ù
Ç
ÔÊ
Ù
ÆQÇÉÈËÊeÌ
Ê
Ô
Ù
Ô
Ù
Ê
Ê
Õ
Ù
ÔÇ
Ù
Ç
Ç
ÔÊ
ÆQÇÉÈËÊeÌ
Ô
Ù
Ê
Ç
Ù
Ú
Ú9ÙmÚ
Ç
mÊ
Ç
mÊÔ Æ
ÔÇ
Ù
È
Ù
Ì
Ù
Ç
ÔÊ
Ô ÆQÇÀÈËÊeÌ
Ù
Ç
mÊÔ
mÊÔ
Ù
Õ
Ù
HISTORICAL NOTES
The original work on the set-theoretic structure of Shannon’s information
measures is due to Hu [96]. It was established in this paper that every information identity implies a set identity via a substitution of symbols. This allows
the tools for proving information identities to be used in proving set identities.
Since the paper was published in Russian, it was largely unknown to the West
until it was described in Csisz r and K rner [52]. Throughout the years, the
use of Venn diagrams to represent the structure of Shannon’s information measures for two or three random variables has been suggested by various authors,
for example, Reza [161], Abramson [2], and Papoulis [150], but no formal justification was given until Yeung [213] introduced the -Measure. Most of the
examples in Section 6.6 were previously unpublished.
ÏÍ
D R A F T
Ð
September 13, 2001, 6:27pm
D R A F T
APPENDIX 6.A: A Variation of the Inclusion-Exclusion Formula
123
McGill [140] proposed a multiple mutual information for any number of
random variables which is equivalent to the mutual information between two
or more random variables discussed here. Properties of this quantity have been
investigated by Kawabata [107] and Yeung [213].
Along a related direction, Han [86] viewed the linear combination of entropies as a vector space and developed a lattice-theoretic description of Shannon’s information measures.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 7
MARKOV STRUCTURES
Ç
ò
Çeô
£Ñ
Ç
YÒ
Ç
We have proved in Section 6.5 that if
Ú
Ú
Ú
Markov chain, the -Measure Û Ü always vanishes on the five atoms
forms a
Ó £Ñ Ò Ó
Ó £Ñ YÒ
Ó ÔÑ Ó YÒ
(7.1)
Ñ Ó YÒ
Ó
ÑÓ Ò
Consequently, the -Measure
is completely
specified by the values of
Ò
on the other ten nonempty atoms of Õ , and the information diagram for four
Ç
ò§Þ
Ý
Ç
Ç
ò§Þ
Ç
Ý
Ç
ò§Þ
Ý
Ç
Ý
ò
Þ
ô
Þ
ô
Þ
Ç
ô
Þ
Ç
Ý
ò§Þ
Ý
Ç
Ç
Ý
Ý
ÇeôêÞ
Ý
Ç ô Þ
Ý
Ç
Ý
êÞ
Ç
êÞ
Ç
Ý
Ý
Ý
Þ
Ç
Ç
Ý
Ç
Ý
Þ
Ý
Ç
Þ
Ç
Ý
Ý
Õ
Ý
Û'Ü

Û'Ü
random variables forming a Markov chain can be displayed in two dimensions
as in Figure 6.10.
Ç
ò
Çeô
Ú
Ú
FigureÇ 7.1 is a graph which represents the Markov chain
Ç
Ú
. The observant reader would notice that Û'Ü always vanishes on
of
if and only if the graph in Figure 7.1 becomes
a nonempty atom
disconnected upon removing all the vertices corresponding to the complemented
set Ç variables
in . For example, Û'Ü always vanishes on the atom
Ç
ò§Þ Ç ô Þ
(Þ
Ç
, and the graph in Figure 7.1 becomes disconnected upon
Ý
Ý
Ý
Ý
removing vertices
2
and
4.Þ On
the other hand, Û'Ü does not necessarily vanish
Ç
Ç
ò Þ Ç ô Þ Ç
on the atom Ý
, and the graph in Figure 7.1 remains connected
Ý
Ý
Ý
£Ñ
YÒ
Ó £Ñ
¬
Ó
ÕÒ
ÒÓ ¬
Ñ ÒÓ
1
Ö2
3
Figure 7.1. The graph representing the Markov chain
D R A F T
×4
D _ !ÙØÚ
September 13, 2001, 6:27pm
.
D R A F T
126
A FIRST COURSE IN INFORMATION THEORY
upon removing vertices 1 and 4. This observation will be explained in a more
general setting in the subsequent sections.
The theory of -Measure establishes a one-to-one correspondence between
Shannon’s information measures and set theory. Based on this theory, we
develop in this chapter a set-theoretic characterization of a Markov structure
called full conditional mutual independence. A Markov chain, and more generally a Markov random field, is a collection of full conditional mutual independencies. We will show that if a collection of random variables forms a Markov
random field, then the structure of the -Measure can be simplified. In particular, when the random variables form a Markov chain, the -Measure exhibits a
very simple structure so that the information diagram can be displayed in two
dimensions regardless of the length of the Markov chain.
The topics to be covered in this chapter are fundamental. Unfortunately, the
proofs of the results are very heavy. At first reading, the reader should study
Example 7.8 at the end of the chapter to develop some appreciation of the
results. Then the reader may decide whether to skip the detailed discussions in
this chapter.
7.1
CONDITIONAL MUTUAL INDEPENDENCE
In this section, we explore the effect of conditional mutual independence on
the structure of the -Measure Û'Ü . We begin with a simple example.
£¤¥§¦¨ª©S«
c®¯
Let
Then
Ê
,
, and
ÆQÇ×ËÊoÌêÒ
ÆQÇØ×ËÊuÔ
Ì
á
Ù
Ù
be mutually independent random variables.
ÆQÇ×ËÊV×
, we let

Since
Ç
Ìé

Ì+Ò
Ù
so that
Æ<Ê
×
Ì(Òæß
and
Ìé

ÆQÇØ×

ÆQÇ×ËÊV×
Ù
Ì(Ò
×
Ìé

ºÕ

(7.4)

Æ<Ê
×
ÔÇ
Ù
Ì(Ò
ÆQÇØ×

Ô ÊoÌêÒ
Õ
Ô ÊeÌ(Ò
Ù
ÌêÒ
Ù
Then from (7.4), we obtain

ÔÇ
Ù
ÆQÇ×
Ù
(7.2)
(7.3)
Æ<ÊV×
Ù
ÆQÇØ×ËÊ
Ù
È
á
Ù
Ì+Ò

Õ
g
ÆQÇ×ËÊV×

Similarly,
Ì(Ò
Ù
ÆQÇØ×ËÊÉÔ

ÆQÇØ×ËÊuÔ
Ù
tÕ
(7.5)
(7.6)
(7.7)
The relations (7.3), (7.4), and (7.7)
are shown in the information diagram in
Ç
Ê
Figure 7.2, which indicates that , , and Ù are pairwise independent.
D R A F T
September 13, 2001, 6:27pm
D R A F T
127
Markov Structures
Y
Ûa
Ûa
-a
Ûa
ÜX
,
Figure 7.2.
, and
are pairwise independent.
Ç
We have proved in Theorem 2.39 that
if and only if
ÆQÇÉÈËÊ
È
ÌêÒ
ÝZ
ÆQÇ
,
Ê
, and Ù are mutually independent
Ìé
Æ<ÊoÌé
Æ
ÌÖÕ
Ù
(7.8)
Ù
By counting atoms in the information diagram, we see that
Ò
ÆQÇ
Ò
Ò
Thus
Ì'é
Æ<ÊoÌé
ÆQÇØ×ËÊuÔ

Ìé
Ù
tÕ
Æ
Æ<Ê
Ì¸ß
Ù
×
ÔÇ

ÆQÇÉÈËÊ
Ì'é
È
Ù
ÆQÇ×

Ì
Ù
Ôe
Ê Ìéì
Ù
ÆQÇØ×ËÊ
×

Ì
Ù
, which implies
(7.9)
(7.10)
(7.11)
eÒ
ÆQÇ×ËÊÉÔ

ÌÖÈ
Ù
Æ<Ê
×

are all equal to 0. Equivalently,
ÇqÞ
Ý
Ê
ß
Ý
Ù Ý
È ÊíÞ
Ý
ÔÇ
Ù
ÆQÇ×

Û'Ü
ß
Ù Ý
ÌÖÈ
Ô Ê@ÌÖÈ
Ù
ÆQÇØ×ËÊ
×

Ì
(7.12)
Ù
vanishes on
Ç,È q
Ç
Þ
Ý
Ý
ß
Ù Ý
Ê@È q
Ç
Þ
Ý
Ý
ÊíÞ
Ý
È
Ù Ý
(7.13)
which
are precisely the atoms in the intersection of any two of the set variables
Ç
Ê
Ý , Ý , and Ù Ý .
Conversely, ifÇ Û Ü Ê vanishes on the sets in (7.13), then we see from (7.10)
that
Ç
Ê
(7.8) holds, i.e., , , and Ù are mutually independent. Therefore, , , and
are mutually independent if and only if Û'Ü vanishes on the sets in (7.13).
Ù
This is shown in the information diagram in Figure 7.3.
The theme of this example will be extended to conditional mutual independence among collections of random variables in Theorem 7.9, which is the
main result in this section. In the rest of the section, we will develop the necessary tools for proving this theorem. At first reading, the reader should try to
understand the results by studying the examples without getting into the details
of the proofs.
D R A F T
September 13, 2001, 6:27pm
D R A F T
128
A FIRST COURSE IN INFORMATION THEORY
Y
Þ0 Þ Þ0
0
Þ0
X
Z
Figure 7.3.
,
, and
are mutually independent.
In Theorem 2.39, we have proved that
pendent if and only if
RRR Y
ÆQÇ
òYÈÑÇeôïÈ
YÈÑÇ
By conditioning on a random variable
7
á .â
ù«ºµ¸Ã«ª¦
Ç
ÄÌ+Ò
Ê
RRR Y
òYÈÑÇeô±È
Kß.à

YÈÑÇ
ÆQÇ
ò
ß
are mutually inde-
ÌÖÕ
(7.14)
, one can readily prove the following.
RRR Y are mutually independent conditioning on

ß
RRR
Kß.à
(7.15)
òYÈÑÇeô±È
c®
Ç
if and only if
ÆQÇ
ò ÈÑÇ
èÈÑÇ
ô È
Ê
èÈÑÇ
Ô ÊÌ+Ò
ÆQÇ
Ô ÊeÌÖÕ
ò
We now prove two alternative characterizations of conditional mutual independence.
RRR Y are mutually independent conditioning on
ä´åV´æ ,
if and only if for all
ß èç é ¯ å (7.16)
ß
ç é ¯ å are independent conditioning on .
and
i.e.,
7
á .ã
ù«ºµ¸Ã«ª¦
Ç
c®
òYÈÑÇeô±È
èÈÑÇ
Ê
å
ÆQÇ
×ÑÇ
1È
Ò
7Ô ÊeÌ(Ò
È

Ç
ÆQÇ
È
Ò
Ì
Ê
Remark A conditional independency is a special case of a conditional mutual
independency. However, this theorem says that a conditional mutual independency is equivalent to a set of conditional independencies.
RRR Y
å ß
Proof of Theorem 7.3 It suffices
to proveYthat
(7.15) and (7.16) are equivalent.
Ç
òYÈÑÇeô±È
ÈÑÇ
Assume (7.15)
is
true,
so
that
are mutually
independent
conÊ
Ç
ÆQÇ
È
Ò
Ì
ditioning
on
.
Then
for
all
,
is
independent
of
conditioning
Ê
on . This proves (7.16).
D R A F T
ç é ¯ å
September 13, 2001, 6:27pm
D R A F T
129
Markov Structures
ä¢åI´æ . Consider
Now assume that (7.16) is true for all
ß ç é ¯ å
ß
RRR ß{ê
ß ß:ë RRR Y
RRR ß{ê
å
Ò

ÆQÇ
×ÑÇ
ÆQÇ
×ÑÇ
é
È
ÆQÇ
Ò
7Ô ÊoÌ
òYÈÑÇeô±È
ÈÑÇ
òYÈ
ÈÑÇ
×ÑÇ
(7.17)
òwÔ ÊÌ
cÔ Ê
ÈÑÇ
òYÈÑÇeô1È
YÈÑÇ
ò7ÌÖÕ
(7.18)

ß
RRR ß{ê
Since mutual information is always nonnegative, this implies
ÆQÇ
×ÑÇ
òYÈ
ÈÑÇ
òwÔ ÊÌ(Ò
È
(7.19)
ß
ì
ß
ê
R
R

and
are independent conditioning on . Therefore,
or
RRR Y are mutually independent
conditioning on (see the proof of
Theorem 2.39), proving (7.15). Hence, the theorem is proved.
RRR Y are mutually independent conditioning on
7
á
if and only if

ß èç é ¯ å
RRR Y
Kß.à
(7.20)

Ç
Ç
ÆQÇ
òYÈÑÇeô±È
òYÈÑÇeô1È
èÈÑÇ
ò7Ì
Ê
YÈÑÇ
ù«ºµ¸Ã«ª¦
Ê
Ç
c®°
ÆQÇ
òYÈÑÇeô±È
èÈÑÇ
òYÈÑÇeô±È
YÈÑÇ
Ê
Ô ÊoÌ(Ò
ÆQÇ
Ô ÊêÈÑÇ
ïÈ
Ò
ÌÖÕ
ò
RRR Y
ß
Proof It suffices
to
prove èthat
(7.15) and (7.20) are equivalent. Assume (7.15)
Ç
òYÈÑÇeô±È
ÈÑÇ
Ê
are mutually
independent conditioning
on .
is true, so that Ç
Ç
È
Ò
Ê
Since for all ,
is independent of
conditioning on ,
å ß
ÆQÇ
ç é ¯ å
ß èç é ¯ å
Ô ÊoÌ(Ò
ÆQÇ
Ô ÊêÈÑÇ
ïÈ
Ò
Ì
(7.21)
Therefore, (7.15) implies (7.20).
Now assume that (7.20) is true. Consider
ÆQÇ
Ò
RRR
ß RRR ìß ê
Kß.à
(7.22)

ß ß.ë RRR Y
RRR ß{ê
Kß.à ß èç é ¯ å
(7.23)

ß
ß
.
ß
ë
èç é ¯ å Kß.à
RRR Y
RRR ß{ê
Kß.à
ò ÈÑÇ
ô È
èÈÑÇ
Ô ÊÌ
ÆQÇ
Ô ÊêÈÑÇ
òYÈ
ÆQÇ
Ô ÊêÈÑÇ
ïÈ
èÈÑÇ
òëÌ
ò
Ò
Ò
Ìé
Ò
ÆQÇ
×ÑÇ
òYÈ
ÈÑÇ
cÔ Ê
ÈÑÇ
òèÈ
cÔ Ê
ÈÑÇ
YÈÑÇ
ò7Ì

ò
ÆQÇ
Ô ÊêÈÑÇ
ïÈ
Ò
Ìé
ÆQÇ
×ÑÇ
òYÈ
ÈÑÇ
òèÈ
YÈÑÇ
ò7ÌÖÕ
ò
ò
(7.24)

Then (7.20) implies
Kß.à
D R A F T
RRR {ß ê (7.25)
September 13, 2001, 6:27pm
D R A F T
ÆQÇ
ò
ß :ß ë RRR Y
×ÑÇ
òYÈ
ÈÑÇ
Ô Ê
ÈÑÇ
òYÈ
YÈÑÇ
ò7Ì(Ò
Õ
130
A FIRST COURSE IN INFORMATION THEORY
å
Since all the terms in the above
summation are nonnegative, they must all be
§Òæå
equal to 0. In particular, for
, we have
ÆQÇ
RRR Y
ß èç é ¯ å
òY×ÑÇeô±È

èÈÑÇ
Ô ÊÌ(Ò
By symmetry, it can be shown that
ÆQÇ
×ÑÇ
ïÈ
Ò
Õ
(7.26)
7Ô ÊeÌ+Ò
ä´åI¢æ . Then this implies (7.15) by the last theorem, completing the
for all
proof.
7
á Let · and í ß be disjoint index sets and î ß be a subset of í ß
ï#å6¹ð , where ð . Assume
å such that
forß Áñ
that
there
exist atº least two
ß
ò
£
ó
S
õ
ô
«
ö
£
å
°
ð
£
ó
S
ôõ«#· be
î ¯ . Let N
í ò *åkand
ð
N ß Áñ are mutually independent
collections of random
variables.
If
÷
º
N
conditioning on º , then
ò N ê ÷ N , 6such
´åøthat
ð î. ¯ are mutually independent
conditioning on

å
ù«ºµ¸Ã«ª¦
(7.27)
c®õ
å
ì
á
Ò
Ç
Ò\ÆQÇ
È
ÌÖÈYå
Ç
Ç
Ç
Ç
ÆQÇ
È
Ì
Ò
ÈÑÇ
å
Ò]ÆQÇ
ÈYå
ªÌ
£Ñ YÒ
£ù £ú are mutually inde£ù £ú are
,
,
and
£Ñ YÒ û .
ò N "öåïµð are mutually independent
We first give an example before we prove the theorem.
£¤¥§¦¨ª©S«
Ç
c®¬
û
òYÈèÆQÇeô±ÈÑÇ
±ÈÑÇ
±ÌÖÈ
Suppose Ç
and
pendent conditioning on
. By Theorem
7.5,
ÆQÇ ±ÈÑÇ 1ÈÑÇ
mutually independent conditioning on
º
Proof of Theorem
7.5 Assume
Ç
conditioning on
, i.e.,
Ç
ÈYå
±ÈÑÇ
Ç
ò
Ì
Çeô
ÆQÇ
±ÈÑÇ
èÌ
YÌ
ÈYå
ò N ©´åø¢ð º
ÆQÇ
ÆQÇ
Ô Ç
Ì(Ò
ß.K à ü
ò N º
ÆQÇ
ò
ÔÇ
ÌÖÕ
(7.28)
÷ N ä¢åøð º ò N ê ÷ N ä´åVð
ò N ©´åIð º
ò N ê ÷ N ä´åVð º
Kß.à ü ò N º
ò N ê ÷ N º ò Q ê ÷ Q ©é£¢å
Kß.à ü
Kß.à ü ò N º ò Q ê ÷ Q äé ´å
Consider
ÆQÇ
Ò
ÈYå
Ô Ç
ÆQÇ
Ò
ÈÑÇ
ÈYå
ÈYå
Ô Ç
ÆQÇ
ÔÇ
Ì§ß
ªÌ
ÆQÇ
ÈYå
Ô Ç
Ì
(7.29)
Ì
ò
ß
ÆQÇ
ÔÇ
ÈÑÇ
ÈYå
ß
åèÌ
ß
åèÌ
ò
á
ÆQÇ
ÔÇ
ò
ýKß.à ü
ß
D R A F T
ÈYå
'ß
åèÌ
ò N ê ÷ N º ò Q ê ÷ Q ©é£¢å
ÆQÇ
ò
ÈÑÇ
ÔÇ
ÈÑÇ
ÈYå
(7.30)
September 13, 2001, 6:27pm
(7.31)
D R A F T
131
Markov Structures
Kß.à ü
Ò
Kß.à ü
á
÷ N º ò Q ê ÷
÷ N º ò Q ê ÷
ÆQÇ
ÔÇ
êÈÑÇ
ÆQÇ
ÔÇ
ÈÑÇ
ò
ò
Q äé ´å
Q äé ð
ÈYå
Ì
ÈYå
ªÌÖÕ
(7.32)
(7.33)
In the second step we have used (7.28), and the two inequalities follow because
conditioning does not increase entropy. On the other hand, by the chain rule
for entropy, we have
ÆQÇ
Ò
÷ N ©´åIð º ò N ê ÷
Kß.à ü ÷ N º ò Q ê ÷
ÈYå
Ô Ç
ÈÑÇ
ÔÇ
ÈèÆQÇ
ÆQÇ
ò
N ä¢åIð
Q ©þéYð ÷gÿ ä¢ô0´å
ÈYå
ªÌ
ÈYå
ªÌÖÈèÆQÇ
OÈYå
'ßåèÌÑÌÖÕ
(7.34)
Therefore, it follows from (7.33) that
Kß.à ü
÷ N
÷
Kß.à ü
ÆQÇ
ò
º ò Q ê ÷ Q äé ð
N ©åø ð º ò N ê ÷ N è´åIð
÷ N º ò Q ê ÷ Q ©þéYð ÷gÿ ä¢ô0´å
ÔÇ
ÈÑÇ
ÆQÇ
Ò
ÈYå
ÈYå
Ô Ç
ÆQÇ
ò
ªÌ
ÔÇ
ÈÑÇ
ÈYå
ÈèÆQÇ
ªÌ
ÈYå
ªÌÖÈèÆQÇ
OÈYå
(7.35)
(7.36)
'ßåèÌÑÌÖÕ
(7.37)
å
However, since conditioning does not increase
entropy, the th term in the sumå
mation in (7.35) is lower bounded by the th term in the summation in (7.37).
Thus we conclude that the inequality in (7.36) is an equality. Hence, the conditional entropy in (7.36) is equal to the summation in (7.35), i.e.,
÷ N ä´åVð º ò N ê ÷ N ä´åVð
Kß.à ü ÷ N º ò Q ê ÷ Q äé ð
ÆQÇ
Ò
ÈYå
Ô Ç
ÆQÇ
ÔÇ
ÈÑÇ
ÈYå
ÈÑÇ
ÈYå
ªÌ
ªÌÖÕ
ò
(7.38)
(7.39)
The theorem is proved.
Theorem 7.5 specifies a set of conditional mutual independencies (CMI’s)
which is implied by a CMI. This theorem is crucial for understanding the effect
of a CMI on the structure of the -Measure Û'Ü , which we discuss next.
«ª¦¦!¥
¤
c®
ables, where
Let ì
D R A F T
á
ß RRR ß N ïå
ò È
Æ
Ù
È
, and let
Ê
Ù
ÌÖÈYå
ß RRR ß N
be collections ofÆ random
vari Ì
òYÈ
èÈ
be a random variable, such that Ù
Ù
,
September 13, 2001, 6:27pm
D R A F T
132
A FIRST COURSE IN INFORMATION THEORY
ä´åI
å
Ê
are mutually independent
conditioning on
~ ß.@ à
Ü
Û
@à N ßç
ç
%ß
Ê
Ò
ò Ù Ý
ò
. Then
Ý
Õ
(7.40)
We first prove the following set identity which will be used in proving this
lemma.
.;
«ª¦¦!¥
§
c®
¬ ß and ®
Let and be disjoint index sets, and
be a set-additive function. Then
be sets. Let Û
~ = ß @ ¬ ß H ~ ç @ ¬ ç ®
ë
K K
¬ ®
¬ ®
¬
ß ¬ ß.
denotes ¶
where ¬
Proof The right hand
ë side of (7.41) is equal to
ë
K K
¬ ® K K
¬
ë
K K
¬
®
Now
ë
K K
¬ ® K
¬ ® K
Þ
Û
Ò
Æ ß
åèÌ
ëÆ
ß
Ìé
åèÌ ß
Ìé
Ì§ß
Æ ß
Æ ß
Æ ß
ß
Æ
Û
Æ
Û
ß
Æ
Û
®
ß
åèÌ Æ
Û
ß
ß
Ì(Ò
Æ ß
åèÌ Æ
Û
Æ
Û
ß
ÌÖÕ
Æ
Û
åèÌ åèÌ ß
Ì
ÌÑÌGÈ
(7.41)
Æ
Û
Æ ß
ß
®
Ì
(7.42)
Æ ß
åèÌ Õ
(7.43)
Since
K
Æ ß
åèÌ
ïÒ
ü
Kà
Ô
Ô
by the binomial formula1 , we conclude
that
K
1 This
K
Æ ß
åèÌ à
"
ë
ò
and !
!$#&% '(%
à
àïê
Æ
Û
¬
ß
åèÌ
ü Ò
® Ì+Ò
Õ
(7.44)
(7.45)
ò
in the binomial formula
KACB
% '(%
*),+
D R A F T
can be obtained by letting
Æ ß
ë
ð
ü
A
!/% '0% 1
A
32
.-
September 13, 2001, 6:27pm
D R A F T
Markov Structures
K
Similarly,
K
Æ ß
åèÌ
133
ë
¬
Æ
Û
® ß
Ì+Ò
Õ
Therefore, (7.41) is equivalent to
~ = ß @ ¬ ß H
Û
~ ç @ ¬ ç ®
Þ
ß
K
Ò
K
ë ë
Æ ß
åèÌ (7.46)
ò
¬
Æ
Û
ß
®
Ì
(7.47)
which can readily be obtained from Theorem 6.A.1. Hence, the lemma is
proved.
Proof of Lemma 7.7 We first prove the lemma for
~ ß.@ à
Ü
Û
K D
54 ò76989898 6 K
Æ ß
_
ß
åèÌ
Ê
ò Ù Ý
ò
<54 ò76989898 6 ;:
@à N ßç
ç
ô
Ý
ë
,=>
. By Lemma 7.8,
Ò
~ ç
¶Òíì
Ü
Û
:
ìç
ò ß
Ù Ý

Ê
Ý
ìç ¶ ~ (7.48)
ü
ü ü
The expression in the square bracket is equal to
ìç é « §
Zðo« ìç éï« §
Zð$ü « (7.49)
ü
ì
ç

é
«
Z
?
ð
«
which is equal to 0 because
§ and ü are independent
conditioning on . Therefore the lemma is proved for
.
For
, we write
ê N
N
ß
~ ß.@ à ç @ à ç
~ ~ ß.@ à ç @ à ß ç ~ ç @ à ç (7.50)
ß RRR ß N ÔáåY
RRR
Since
and
are independent
conditioning on , upon applying the lemma for
, we see that
~ ß.@ à ç @ à N ß ç (7.51)
é
Û
~
ü
Ü
ß
ô

Ê
Ù Ý
Ý
ß
ò
Æ
~ ~ ç
Ü
Û
È
ò ïÈ
ÌÖÈèÆ
Æ
ÌëÔ ÊoÌÖÈ
DB
Ì
Æ
ô
È
Ì
Ù
B
ò
ß
Ê
ò Ù Ý
ò
ÆÑÆ
Õ
Òíì
B
ì
FE
Ü
@?A
Ý
Ô ÊÌ
È
ô
ò 1È
Ù
Ê
DB
Ù
CB
Ê
È
ô
Ù
Ù
Û
Æ
CB
ÆÑÆ
ß
ô
Ù Ý
Ô ÊÌ'é
Ù
ß
ò
Ù Ý
òYÈ
Ù
Ò
Û
Ý
YÈ
Ê
ÌÖÈYå
G
Þ
Ü
ò
HßåèÌ
Æ
Ù
Ù
ß
ò Ù Ý
ò Ù Ý
òYÈ
YÈ
Òíì
Ù
Ê
Õ
Ý
G Ì
Û
Ü
ß
ò
ò Ù Ý
Ê
Ò
Õ
Ý
The lemma is proved.
D R A F T
September 13, 2001, 6:27pm
D R A F T
7
á Let and ß í ß è¢å ð be disjoint index sets, where ð ,
ò N £ó Sô©« í Ô³*å ð and
£ó Sôè« be collecand let
ò
á
£
å
ö
ð
N
independent
tions of random variables. Then
ß ² í ß,
Rß RareR mutually
conditioning
on
if
and
only
if
for
any
,
where
î
î
î
î
ä´åIð , if there exist at least two å such that î !¯ ñ , thenü
~ ~ ß:@ à ü ç @ ÷ èç
A BED ò N ê ÷ N (7.52)
N
N
134
A FIRST COURSE IN INFORMATION THEORY
ù«ºµ¸Ã«ª¦
ÈYå
c®I
Ç
Ò
ÆQÇ
È
á
ÌÖÈYå
Ç
Ç
Ò
ÆQÇ
È
ì
Ì
ÈYå
Ç
òYÈ
ôwÈ
èÈ
Ò
å
Ç
Ü
Û
ß
ò
5
Ç
Ý
"
Ý
Ò
"
Õ
##
We first give an example before proving this fundamental result. The reader
should compare this example with Example 7.6.
£¤¥§¦¨ª©S«
c®¯J
Suppose Ç
dependent conditioning on
ò
Ü Æ Ç
Û
Þ
ô
Ç
Ý
û
Ç
Þ
£Ñ YÒ
òYÈèÆQÇeô1ÈÑÇ
ÌÖÈ
and
. By Theorem 7.9,
ù
Ç
Ý
±ÈÑÇ
Þ
Ý
ú
Ç
Ý
ß
£ù £ú
ÆQÇ
ïÈÑÇ
èÌ
are mutually in-
Ñ ¶ Ò ¶ û Æ Ç
Ç
ZÝ
Ý
Ç
"Ý
ÌÑÌ(Ò
Õ
(7.53)
However, the theorem does not say, for instance, that
Ü Æ Çeô
Ý
Û
YÒ
Þ
Ç
ß
Æ Ç
ò
Ý
Ý
¶ £Ñ ¶ £ù ¶ £ú ¶ û
Ç
"Ý
Ç
ZÝ
Ç
"Ý
Ç ëÌÑÌ
"Ý
(7.54)
is equal to 0.
å
î î RRR ß î ü ñ
î ß ²Áí ß "°å µð
î ¯
ò N ä´åVð
(7.55)
NsA BjD ò N
(7.56)
±K ®
where § consists of sets of the form
~ ß.@ à ü ç @ ÷ ç
A BjD ò N ê ÷ N
(7.57)
s
N
N
ß ²í ß for *å&!ð and there exists at least one å such that î ß ¯ ñ .
with î
® « § is such that there exist at least two å for which
Byß our ñ assumption, if ö
î ¯ , then ®
. Therefore,å if
® is ß possibly
¯ ñ . Nownonzero,
6¢then
ô0ß®ð ,
must be such tható there exists a unique for which î
forß
define the set § consisting of sets of the form in (7.57) with î
²¾í for
Proof
of
Theorem
7.9 We first prove
the ‘if’ part. Assume that for any
òëÈ
ô±È
èÈ
å
,
where
,
, if there exist at least two
Ò
such that
, then (7.52) holds. Then
ÆQÇ
ÈYå
Ô Ç
Ì
Ò
Ç
ß
Ý
Ç
Ý
"
Ç
Ý
Ü Æ
Û
ß
Ý
+
Ò
ò
Ç
Ü
Û
-
Ì
"
##
å
Ò
Û'Ü
Æ
Ò
ÌzÒ
Û'Ü
Æ
Ì
Ò
D R A F T
September 13, 2001, 6:27pm
å
D R A F T
6´åøð , î ó ¯ ñ , and î ß ñ
of the form
~ ç @ ÷gÿ èç
135
Markov Structures
å
Ò
Ò
Ç
Then
±K
ò N
Ç
ß
Ý
Ò
ß
5
Ç
®
Æ
Ý
Õ
"
#
Ü Æ
Û
ò
ÌÖÕ
(7.59)
Ì
K
Ç
MN
L
Ì
<K
Kó à ü ± K ÿ ®
Ì(Ò
Q BN ò Q
ÿ <ÿ ~ ç @ ÷kÿ èç
ÿB
± ÿ®
5
Ç
Æ
Ý
Ü Æ
Û
=>
ß
5
Ç
Ý
N B ÿ:ò N ò ÿ ê ÷kÿ
Æ
Ý
Ì
<K
"
?A
#
(7.60)
0
K O
L
Ò
Ò
Ý
Now
å ¯ ô . In other words, § ó consists of atoms
(7.58)
NB ÿò N ò ÿê ÷ ÿ
for
Õ
(7.61)
Since Û'Ü is set-additive, we have
Û
ò N
Ç
Ü
+
ß
5
Ç
Ý
Ý
Æ
Q BN ò Q
K
ò N ä¢åIð
Kß.à ü ± K N ®
Kß.à ü ò N
òN
Kß.à ü
±K ÿ ®
Ò
Ì
Û
-
Ü Æ
ÌÖÕ
(7.62)
Hence, from (7.56) and (7.59), we have
ÆQÇ
Ò
ÈYå
Û
ò
Ò
Ü Æ
Û
ò
Ò
Ô Ç
Ç
Ü
+
Ì
ß
ò
ÔÇ
(7.63)
5
Ç
Ý
ÆQÇ
Ì
Ý
ÈÑÇ
Æ
Q BN ò Q
K
ò Q é ¯ å
È
Ò
Ì
(7.64)
-
ÌÖÈ
(7.65)
ò N Lå?¿ð
ò N ?µåïÁð
RRR
å î î î ß ¯ î ñü
Ç
ÈYå
where (7.64) follows from (7.62). ByÇ Theorem
7.4,
are
mutually independent conditioning on
.
Ç
ÈYå
We now prove the ‘only if’
part.
Assume
areô±È mutually
Ç
òëÈ
YÈ
. For any collection of sets
,
independent conditioning
on
å
Ò
where
,Ç
,
if
there
exist
at
least
two
such
that
,
ÈYå
by
Theorem
7.5,
are
mutually
independent
conditioning
on
ÆQÇ ÈÑÇ
ÈYå
ªÌ
. By Lemma 7.7, we obtain (7.52). The theorem is
proved.
î ßö
² í ß ÷ ³å*oá ð åe³ð
ò N ê ÷ N *åkN ð
D R A F T
September 13, 2001, 6:27pm
D R A F T
136
A FIRST COURSE IN INFORMATION THEORY
RRR Y is
A
conditional
mutual independency on

R
R

full if all
are involved. Such a conditional mutual independency is called a full conditional mutual independency (FCMI).
<â For æ ,
YÒ , and £ù are mutually independent conditioning on £Ñ
,
7.2
¢
FULL CONDITIONAL MUTUAL INDEPENDENCE
e«ª¿a¼Â·¼Ó½¼Qµ¸·
£¤¥§¦¨ª©S«
Ç
Ç
c®¯ª¯
ò ÈÑÇ
Ç
ô È
Çeô±ÈÑÇ
Ç
is an FCMI. However,
Ç
ò
,
ô
Ç
èÈÑÇ
ÀÒ
c®¯
ò
òYÈÑÇeô±È
ÈÑÇ
, and
Ç
Ç
ù
YÒ
are mutually independent conditioning on
Ç
is not an FCMI because
Ç
Ñ
is not involved.

RRR æ
As in the previous chapters, we let
ÒKäTå±È´ìaÈ
P
wÈ *çïÕ
(7.66)
= ü ß
(7.67)
¶ ß. à í H
ß © å ð defines the following FCMI on
RRR
then
the
tuple
í
Y :
ª ò D ò _ RRR ò A are mutually independent conditioning on .
ß ©åVð .
We will denote by í
RRR
¢ <ã Let í ß è¢åV ð be an FCMI on
. The image of , denoted by

of Õ which
ß , ²is the
ß , setä¢ofåIallð atoms
î
í
has the form
of
the
set
in
(7.57),
where
,
and
there
exist at
ß !¯ ñ .
å
least two such that î
´ Let í í be an FCI (full conditional indeRRR Y . Then
pendency) on
¬ « ¦ ª ¬² ò D ò _
(7.68)
RRR
´ Let í ß ä¢åIð be an FCMI on
Y . Then
ÂÄ
Ë ÊÌ « ª
¬ ¦ ¬² ß ç ò N ò Q
(7.69)
Å
ü
In Theorem 7.9, if
Ò
È
P
ò
Æ
È
ÈYå
ªÌ
Ç
òëÈÑÇeô±È
èÈ
Ç
Q
ïÇ
ÈÑÇ
È
èÈÑÇ
Æ
Q
e«ª¿a¼Â·¼Ó½¼Qµ¸·
Ç
c®¯
È
ÈYå
ªÌ
ÒDÆ
Q
Ç
È
Q
ÈYå
RTS
Ç
ªÌ
Æ Q
òëÈÑÇeô±È
èÈ
Ì
å
Ò
Ãcµ'¨tµ¸÷è¼½¼<µ'·
U
c®¯±°
Ç
RTS
U
Ãcµ'¨tµ¸÷è¼½¼<µ'·
Æ Q
c®¯Tõ
ÒqÆ
Q
òYÈÑÇeô±È
È
ò È
ô Ì
YÈÇ
Ì(ÒKä
Æ Ç
Þ
Ç
ß
Q
Ý
Ò
Æ
È
Ç
Ý
ÈYå
Ì´çïÕ
Ý
ªÌ
Ç
òYÈÑÇeô±È
èÈ
Ç
RTS
Æ Q
Ì(Ò
D R A F T
Æ Ç
ò7V W 3V
Þ
Ý
Ç
ß
Ý
Ç
Ì
Õ
Ý
September 13, 2001, 6:27pm
D R A F T
137
Markov Structures
Æ Q
Ì
. Their
These two propositions greatly simplify the description of RTS
proofs are elementary and they are left as exercises. We first illustrate these
two propositions in the following example.
4ñ
£¤¥§¦¨ª©S«
Q
ô!Ò
æ
Z
c®¯T¬
Æ aÈ5äTåwçïÈ5äìaÈ
¬ « ¦ ª ¬²
Ñ ¶
¦ ª ¬²
« on .
Let forbeallan¬ FCMI
Æ Q
RTS
¬ «
7
á and only if
¬
and
RTS
Æ Q
ôèÌ+Ò
ä
ù«ºµ¸Ã«ª¦
Æ
ò7Ì(ÒKä
Æ Ç
Æ Ç
ò5Þ
4 ô6
Ç
Ý
c®¯
Û'Ü
ÒHX
Ý
ò§Þ
Ý
Ç
Q
Æ Q
ÌêÒ
RTS
Proof First, (7.67) is true if
written as
Q
Ý
Æ Ç
Ò £Ñ
4 ô6
Ç
Ý
Ì
:

òzÒZÆ ä SçïÈ5äTåwçïÈ5äìaÈ7XçwÌ
Q
Consider
and FCMI’s
SçïÈ5äYXç±Ì
. Then
ß
:
Ç
YÌ´ç
(7.70)
Ý
Ñ YÒ ¶ YÒ
RRR Y . Then
4 ô6
Þ
:
Ç
òYÈÑÇeô±È
Ì
èÌ
Æ Ç
ò5Þ
Ý
and
Ç
Ý
èÌ´çïÕ
(7.71)
Ý
èÈÑÇ
Q
holds if
is an FCMI. Then the set in (7.57) can be
~ ç @ A ÷ èç > ê NA BED ÷ N
(7.72)
N BED N
which is seen to be an atom of Õ . The theorem can then be proved by a direct
application of Theorem 7.9 to the FCMI .
ß.à ß

Let ¬
be a nonempty atom of Õ . Define the set
nåø« ª ß ß Ó
(7.73)

Note that ¬ is uniquely specified by
because
¬ ~ ß @ > ê ß ~ ß @ ß Ó ~ ß @ > ê ß
(7.74)
!æ as the weight of the atom ¬ , the number of ßß in ¬
Define ¬
äí åø
which
not complemented. We now show that an FCMIß
ß
åø ð are
is uniquely specified by
. First, by letting î
í for
ð in Definition
7.13, we see that the atom
~ ç @ A ò èç Z
(7.75)
NsBjD N
is in
, and it is the unique atom in
with
From
ß the°largest
å ð weight.
this atom, can
be
determined.
To
determine
,
we
define
í
Ó as follows. For ô Sô « Ó , ô Sô is in if and only ifa
relation on Ç
ß
Ç[Z
Ý
Ý
È
Q
ÒîÞ
ò Ê
Ý
ÒKä
\
Ê
P
Ò
Ç
Ý
çïÕ
Ý
\
Ò
Ç
Z
Æ
Ì
5]*^
Ò
_
Þ
Ç
Ò
]*^
Ý
ußíÔ \
Ç
Z
Ý
ß
] ^
Ç
Ý
Ç
ÒøÆ
Q
RTS
Æ Q
Ì
Ò
Ç
Æ Q
ß
`
D R A F T
ba
ÈYå
å
Ý
Æ Q
RTS
P
Ý
È
Ç
Ý
Ì
Ò
Õ
Ý
Ô
ªÌ
RTS
5]^
Ì
ÈYå
È
Æ
B
È
Ì
B
September 13, 2001, 6:27pm
`
D R A F T
138
A FIRST COURSE IN INFORMATION THEORY
ô ô ; or
'Ò
i)
B
@ ç
D4Q MnB Qÿ M ÿ >
ii) there exists an atom of the form
£ó £ó
Ç
SÞ
Ç
Ý
Þ
Ê
Ý
TK
(7.76)
Ý
c Eç èç or ç Ó .
ô Sô
Õ .Ôå The
Recall that ¦ is the setô Sofô nonempty
atoms of
idea of ii) is that
ß
«
¢
w
ð
is in if and only if
. Then is reflexive
í for some
and symmetric by construction, and is transitive by virtue of the structure
of
Ó
.
In
other
words,
is
an
equivalence
relation
which
partitions
into
uniquely specify each other.
í ß ©´åI ð . Therefore, and
completely
characterizes
effect of
on the
The image of an FCMI
RRR Y . The joint effect of morethethan
-Measure for
one FCMI can
¦
in
ß
Æ Q
RTS
Ì
, where
Ê !Ò
Ý
Ç
Ç
Ý
Ý
Æ
È
`
Æ Q
RTS
ä
È
Ì
B
å
`
B
Ì
ÈYå
`
tç
Q
Ç
òYÈÑÇeô±È
Æ Q
RTS
Q
Ì
Q
YÈÑÇ

ó ©´ô`
(7.77)
ó
holds if and only if
vanishes on
be a set of FCMI’s.ó By Theorem
7.9,
ó
Y
!
ô
the atoms in Im
. Then à
hold simultaneously if and only if
ó . This
ó
vanishes on the atoms in ¶ ü
is summarized as follows.
ó ô is
¢ <; The image of a set of FCMI’s
easily be described in terms of the images of the individual FCMI’s. Let
ÒKä Q
d
ÈYå
ç
fe
Q
Æ Q
QÌ
ÈYå
Q
ò
Û'Ü
e«ª¿a¼Â·¼Ó½¼Qµ¸·
Æ Q
RTS
á
c®¯*I
if and only if Û'Ü
Æ
Òøä Q
d
Æ d
RTS
ù«ºµ¸Ã«ª¦
QÌ
c®¯
defined as
7
Û'Ü
ge
¬ óà ü
Ò
QÌÖÕ
be a set of FCMI’s
for
Æ d Ì
for all
.
RTS
Ç
ç
he
ó
Æ Q
ò RTS
¬ «
d
Let
Ì(Ò
Ì
ÈYå
(7.78)
RRR Y . Then
òYÈÑÇeô±È
ÈÑÇ
d
holds
In probability problems, we are often given a set of conditional independencies and we need to see whether another given conditional independency
is logically implied. This is called the implication problem which will be discussed in detail in Section 12.5. The next theorem gives a solution to this
problem if only FCMI’s are involved.
7
á .â
ù«ºµ¸Ã«ª¦
c® J
if and only if RTS
²
Æ d
Let
ôèÌ
²
d
ò
iRTS
and
Æ
ô
d
d
ò7Ì
Æ d
be two sets of FCMI’s. Then
.
ôèÌ
²
Æ d
ò7Ì
d
ò
d
ò
implies
d
ô
¬ Ç
¬
d
ô
Proof
We first Æ prove
that ifò RTS
iRTS
, then
implies
. Assume
Æ d ô Ì
Æ
Ì!Ò
d ò Ì
d
Û'Ü
and
holds.
Then
by
Theorem
7.19,
for all
RTS
jRTS
Æ d ò7Ì
Æ d ôèÌ
Æ d ò7Ì
Æ
Ì Ò
RTS
. Since RTS
kRTS
, this implies that Û'Ü
for all
¬ «
D R A F T
²
September 13, 2001, 6:27pm
D R A F T
139
Markov Structures
¬ «
Æ d
²
ôèÌ
ô
d
. Again
by Theorem
7.19, this
implies
also holds. Therefore,
Æ d ò7Ì
d ò
d ô
if RTS
, then
implies
.
iRTS
Æ d ôèÌ
Æ d ò7Ì
d ò
d ô
RTS
iRTS
We now proved that
if
implies
,Æ then
. To prove this,
ò
Æ d ò7Ì
d ô
d ôèÌ
implies
but RTS
, and we
will showÆ d that
we assume that
gRTS
Æ d ôèÌ ß
ò7Ì
RTS
RTS
this leads to a contradiction. Fix a nonempty atom
.
Ç
òYÈÑÇeô±È
YÈÑÇ
such
By Theorem 6.11, we can construct random variables
Û'Ü
that Û'Ü vanishes Æ on
all
the
atoms
of
except
for
.
Then
vanishes
on
all
Æ d ôèÌ
d ò7Ì
RTS
but
not
on
all
the
atoms
in
.
By
Theorem
7.19,
the atoms in RTS
Ç
òëÈÑÇeô1È
YÈÑÇ
d ò
d ô
this implies that for
so
constructed,
holds
but
does not
d ò
d ô
hold. Therefore,
does not imply , which is a contradiction. The theorem
is proved.
RTS
Æ d
ôèÌ
²
²¯
RRR Y
Õ
¬ «
¬
RRR Y
Remark In the course of proving this theorem and all its preliminaries, we
have used nothing more than the basic inequalities. Therefore, we have shown
that the basic inequalities are a sufficient set of tools to solve the implication
problem if only FCMI’s are involved.
.â
³¶µ¸ÃGµ¸©T©ï¥¸ÃT¾
c® ¯
Two sets of FCMI’s are equivalent if and only if their im-
ages are identical.
Proof Two set of FCMI’s
d
ò
d
ò
d
ô
d
and
ô
ô
d
and
Æ
²
are equivalent if and only if
ò Õ
d
Æ
Æ d
ôèÌ
²
(7.79)
Æ d
Then by
the last theorem,
this isÆ d equivalent
to RTS
iRTS
Æ d ôèÌ+Ò
ò7Ì
Æ d ôèÌ
RTS
iRTS
, i.e., RTS
. The corollary is proved.
ò7Ì
and RTS
Æ d
ò7Ì
Thus a set of FCMI’s is completely characterized by its image. A set of
FCMI’s is a set of probabilistic constraints, but the characterization by its
image is purely set-theoretic! This characterization offers an intuitive settheoretic
interpretation
of the joint Æ effect
of FCMI’s
on the -Measure for
Ç
òYÈÑÇeô±È
YÈÑÇ
ò7Ì%Þ
Æ Q ôèÌ
Q
. For example,
is
interpreted as the efRTS
RTS
ò
ô
Æ Q
ò7Ì%ß
Æ Q
ôèÌ
Q
Q
RTS
RTS
fect commonly
due
to
and
,
is interpreted as the
ò
ô
Q
Q
effect due to
but not
, etc. We end this section with an example.
RRR Y
.â â
£¤¥§¦¨ª©S«
c®
Q
Q
and let
d
òÒKä Q
Consider
4ñ
Ñ 4ñ
ÀÒiX
Z
. Let
òÒ
Æ aÈ5äTå±È´ìaÈ SçïÈ5äYXçwÌÖÈ
!Ò
Æ aÈ5äTå±È´ìSçïÈ5ä aÈ7XçwÌÖÈ
òYÈ Q
ôwç
and
RTS
D R A F T
æ
Æ d
d

ô!ÒKä Q
ò7Ì+Ò
RTS
4ñ
Ò 4ñ Z
Ñ Ò . Then
¶
ô!Ò
Æ aÈ5äTå±È´ìaÈ7XçïÈ5ä SçwÌ
Q
Ò
Æ aÈ5äTå±È SçïÈ5äìaÈ7XçwÌ
±È Q
Æ Q

Q
(7.80)
(7.81)
±ç
ò7Ì
lRTS
Æ Q
ôèÌ
September 13, 2001, 6:27pm
(7.82)
D R A F T
140
A FIRST COURSE IN INFORMATION THEORY
and
Æ d
RTS
where
RTS
RTS
RTS
Æ Q
ò7Ì
Æ Q
Ò
ä
Æ Q
Ñ
ô Ì
èÌ
Ò
ä
Ì
Ò
ä
Ò
RTS
Æ Q
Ò
ä
ô ÌêÒ
RTS
Æ Q
¬ «« ¦ ªª ¬#²
¬ « ¦ ª#
¬ ²
¬ « ¦ ª ¬#²
¬ ¦ ¬#²
Ñ ¶
Ì
4 ò76 ô6
Æ Ç
Ý
4 ò76 ô6
Æ Ç
Ý
4 ò76 ô
Æ Ç
Ý
Æ Ç
Ý
4 ò76
Ñ
Ò
Æ Q
mRTS
Ò
ÌÖÈ
Ñ YÒ
Þ
Ç
:
Ì´ç
Ý
Þ
:
Ç
Ý
Þ
Ç
Þ
Ç
:
Ñ
Ý
4 ô6
(7.84)
Ì´ç
ÑÒ
4 6
Ý
:
(7.83)
Ò
(7.85)
:
:
Ì´ç
(7.86)
Ì´çïÕ
Æ d
ò7Ì
²
(7.87)
Æ d
ôèÌ
nRTS
.
It can readilyô be seen by using
an information diagram that RTS
d
d ò
implies . Note that no probabilistic argument is involved in
Therefore,
this proof.
7.3
MARKOV RANDOM FIELD
A Markov random field is a generalization of a discrete time Markov chain
in the sense that the time index for the latter, regarded as a chain, is replaced
by a general graph for the former. Historically, the study of Markov random
field stems from statistical physics. The classical Ising model, which is defined
on a rectangular lattice, was used to explain certain empirically observed facts
about ferromagnetic materials. In this section, we explore the structure of the
-Measure for a Markov random field.
p
ÒpÆ$p(È zÌ
Let o
be an undirected graph, where is the set of vertices and
is the set of edges. We assume that there is no loop in o , i.e., there is
no
\
edge
in o whicha joins a vertex to itself. For any (possibly empty) subset of
p
\
, \ denote by o
the graph obtained from \ o by eliminating
all the vertices
Æ \ Ì
q
in
and all the edges joining
a
vertex
in
.
Let
be
the number of
a \
distinct
components
in
.
Denote
the
sets
of
vertices
of
these
components
o
pòèÆ \ ÌÖÈTpºôTÆ \ ÌÖÈ
ÈTpr ]
Æ \ Ì
Æ \ ÌsEå
\
"
q
by
.
If
,
we
say
that
is
a
cutset
in o .
#
S

RRR
¢ .â ã
e«ª¿a¼Â·¼Ó½¼Qµ¸·
c®

.²ut¥¸Ãwv'µG»
RRR æ
ÃG¥¸·yx§µ'¦m¿a¼Â«º©*x§Å
ß S
ÒDÆ$p+È Ì
Let o Ç
be an undiP
p]Ò
Ò
äTå±È´ìaÈ
wÈ *ç
rected graph with
,
and
let
be
a
random
variÇ
òYÈÑÇeô±È
èÈÑÇ
able corresponding to vertex . Then
form a Markov random
\
field
byèÈÑo Ç{if
for all cutsets in o , the sets of random variables
]
] represented
]
Ç{z
ÈÑÇ{z
È
z3|~} ]
Ç
"
"
"
.
#
#
# are mutually independent conditioning on
D
_ RRR
å
RRR Y
RRR Y
RRR
RRR Y
This definition of a Markov
random field
is referred to as the global Markov
Ç
òYÈÑÇeô±È
YÈÑÇ
property in the literature. If
form aÈÑMarkov
random field repò ÈÑÇ ô È
Ç
Ç
resented by a graph o , we also say Ç that
form
a Markov graph
òYÈÑÇeô±È
YÈÑÇ
o . When o
is a chain, we say that
form a Markov chain.
D R A F T
September 13, 2001, 6:27pm
D R A F T
141
Markov Structures
in specifies an
RRR

ª D RRR
are mutually independent conditioning on
.
R
R

For a collection of cutsets
the notation
ü in ^, weÔRintroduce
R
R

ü
(7.88)
ü
RRR Y
\
In the definition
ofÈÑaÇ Markov random field, each cutset
ò ÈÑÇ ô È
Ç
\
, denoted by
. Formally,
FCMI on
TÇ{z
\
]
"
È
ÈÑÇ{z3|~}
]
"
#
òYÈ \
ô±È
Ç
òYÈ \
ô±È
È \
èÈ \
GÒ
o
ò (
\
\
ô (

where ‘ ’ denotes ‘logical AND.’ Using this notation,
Markov graph o if and only if
²
ª ¯
p
\
]
#
\
\
o
\
Ògp
and q
Æ \
^
ÌEKå
\
Ç
ò5ÈÑÇeô±È
ÈÑÇ
form a
Õ
(7.89)
Therefore, a Markov random field is simply a collection of FCMI’s induced by
a graph.
with respect to a graph
We now define two types of nonempty
atoms of
\
o . Recall the definition of the set
for a nonempty atom of
in (7.73).

¢ .â
e«ª¿a¼Â·¼Ó½¼Qµ¸·
c® a°
¬
Õ
Õ
¬
¬

¬ Õ
Æ \
§ÌÒqå

a \
For a nonempty atom of
, if q
, i.e., o
is connected, then is a Type I atom, otherwise is aò Type IIô atom. The sets
of all Type I and Type II atoms of
are denoted by
and , respectively.
7
á .â
ù«ºµ¸Ã«ª¦
Ç
c® ªõ
ò ÈÑÇ
RRR
ô È
YÈÑÇ
Õ
form a Markov graph
if and only if
o
Û'Ü
vanishes on all Type II atoms.
Before we prove this theorem, we first state the following proposition which
is the graph-theoretic analog of Theorem 7.5. The proof is trivial and is omitted. This proposition and Theorem 7.5 together establish an analogy between
the structure of conditional mutual independence and the connectivity of a
graph. This analogy will play a key role in proving Theorem 7.25.
´ß .â Let · and í ß ß be disjoint subsets of the vertex set of a
*!åi#ð ß, where ð . Assume that
graph and î be a subset
of í forß
å
such that î
in
there
at least two
¯ ñ . If í o åe ð· are disjoint
ß
mß · ,exist
¶ iß.ü à í ß
then those î which are nonempty are disjoint in
î .
.â- In the graph in Figure 7.4, , Z , and Z are
U . Then Proposition 7.26 says that , , and Z are
disjoint in
¸ .
disjoint in
for a nonempty
Proof of Theorem 7.25 Recall the
definition
of the set

«
¬ ¦ contains precisely all the proper
atom ¬ in (7.73).
. ThusWethenotesetthat
subsets of
of FCMI’s specified by the graph can be written
as

ª ¬ « ¦ and ^
(7.90)
Ãcµ'¨tµ¸÷è¼½¼<µ'·
U
p
c® ª¬
å
o
Ò
ÈYå
aÆ
a
o
ì
á
o
ÌÑÌ
£¤¥§¦¨ª©S«
äTåwç
c®
o
aTä Tç
o

äìaÈ aÈ7Xç
äTåwç
ò Æ
ß
ä aÈ Sç
äìSç
ä aÈ Sç
aTä aÈ7XÄÈ Tç
o
\
ä \
È
ç
P
o
\
D R A F T
q
Æ \
§ÌsEKå
September 13, 2001, 6:27pm
D R A F T
142
A FIRST COURSE IN INFORMATION THEORY

2
4

5
7

Figure 7.4. The graph
ª ¬ « ¦
6
in Example 7.27.

^ (cf. (7.89)). By Theorem 7.19, it suffices to prove that
Æ
\
RTS
We first prove that
²
Consider
¬ß« an, atom
ðß
´åVí
for äand
¬
²
ô
Ò
p
q
Æ \
Æ \
Ì
å
+Ì
ô
Òp
Æ \
q
ô
Ò
\
Æ \
§Ì
*ÌsEå QÌêÒ
Æ \
q
Æ \
q
Ì
*ÌEæå
Ò
å
q
Æ \
q
Æ \
Æ
Æ
RTS
"
Ò
QÌ
\
QÌ
\
ò RTS
]*^
\
Ì
§ÌsEåwç
Æ
r
Ò
(7.91)
*ÌEå QÌÖÕ
RTS
ä
ô±Õ
ª ¬ « ¦ and ^
(7.92)

7.13, let ß so that
ßî ,
. By considering
for °å£ . In Definition
« . Therefore,
, we see that ¬
« ¦ ª
(7.93)

(7.94)
(7.95)
ª ¬ « ¦ and ^
Æ
iRTS
Æ \
and q
#
\
q
Æ \
ÌEå QÌÖÕ
ª ¬ « ¦ and ^ ²
(7.96)
«
´

ª
«

« ¦
¬ « ¦ and ¸
Consider ¬
. Then there exists ¬
with
such that ¬
. From Definition 7.13,
ê÷N
(7.97)
¬ ~ ç @ ÷ ç
N
N
E
B
D
N BED N
ß
ß
åÙ and there exist at least two å such that
where
ßî !¯ ñ î . It ² follows from, è(7.97)
that
and the definition of
¶ ß. à ß î ß
(7.98)
We now prove that
Æ
Æ
q
Æ \
RTS
wëÌEå
q
\
Æ
RTS
Ò
|~} ^

Æ \
\
RTS
Ç

ß
Ç
Ý
Ý
\
Æ \
q
*ÌsEå QÌ
ô±Õ
*ÌEDå QÌ
Ü
w QÌ
] ^

|~} ^

"
z
"
] ^
È

#
+
p
Æ \
wYÌ
å
q
Ò
Æ \
Ò
\
-
wëÌ
\
r
\
#
"
] ^

w
#
Æ$p
Æ \
wëÌ§ß
ÌÖÕ
ò
D R A F T
September 13, 2001, 6:27pm
D R A F T
143
Markov Structures
1
3

\
4
2
Figure 7.5. The graph

ß
in Example 7.28.
7
p
íß ß
î
wY
\
playing the role of and
playing the role of
in ProposiWith
tion 7.26, we see by applying the proposition that those (at least two)
which
are nonempty are disjoint in
~
a

\
o
\

~ ß. à ß î ß
(7.99)
« . Therefore, we have proved (7.96), and
, i.e., ¬
r
] ^
"

#
p
E¡ are²
³
²
C´
³
With respect to the graph
£
²
´
³¶µ
·
C
\
²
´
²
²
´
µ
³
»
³¹¸º

h
a \
o

¢w£
This implies q
hence the theorem is proved.
¤¥y¦C§¨ª©0«¬®¯ª°

,
²
£
³
´
in Figure 7.5, the Type II atoms
±
²
³¶µ
·
´
²
³¹¸º
³
²
C´
²
´
³»µ£
³¶µ
·
²
´
³¹¸º
(7.100)
¸
while the other
twelve
nonempty atoms of ¼ are Type I atoms. The random

À
£ º7³
³
º7³
º
³
¸
·

variables
and
form a Markov graph ± if and only if ½¿¾
À
Á
for all Type II atoms .
7.4
MARKOV CHAIN
When the graph ± representing a Markov random field is a chain, the Markov
random field becomes a Markov chain. In this section, we further explore the
structure of the Â -Measure for a Markov random field for this special case.
Specifically, we will show that the Â -Measure ½¿¾ for a Markov chain is always
nonnegative, and it exhibits a very simple structure so that the information
diagram can always be displayed in two dimensions. The nonnegativity of ½ ¾
facilitates the use of the information diagram because if Ã is seen to be a subset
of ÃÅÄ in the information diagram, then

½
¾
Ã
Ä

½
¾

wÆ
Ã
½
¾
Ã
Ä

sÇ
Ã
½
¾

Ã
(7.101)
These properties are not possessed by the Â -Measure for a general Markov
random field.
D R A F T
September 13, 2001, 6:27pm
D R A F T
144
A FIRST COURSE IN INFORMATION THEORY
1
...
2
Figure 7.6. The graph É
È
È
n-1
n
representing the Markov chain ÊË5ÌÊÍ,ÌhÎ/Î/ÎYÌgÊÏ .
Without loss of generality, we assume that the Markov chain is represented
ÑÐ
³
by£ the
graph
±
in Figure 7.6. This corresponds to the Markov chain
Ð
Ò3Ò3ÒÐ
³
³¹Ó
. We first prove the following characterization of a Type I
atom for a Markov chain.
Ôs«ª§Õ§D¦i¬®¯ªÖ
For
in Figure 7.6,
À the Markov chain represented by the graph ±
Ó
a nonempty atom of ¼ is a Type I atom if and only if
×
Fà
à
Þ
ÓØ*ÙCÚ
ÜÛÝ
º
Ý
Æn º Ò3Ò3Ò º7Þ¿ßº
(7.102)
À
àâá
Ý
, i.e., the indices of the set variables in
where
complemented are consecutive.
which are not
À
Proof It is easy to see that for
a nonempty
atom ,À if (7.102) is satisfied, then

Ø*ÙCÚ
ÙÚ
Ó

±
is connected, i.e., ã
. Therefore, is a Type I atom of ¼ .
Ø*ÙCÚ
On
the other hand,
if (7.102) is not satisfied, then ±
is not connected, i.e.,

À
ÙÚ ä Ó
ã
, or is a Type II atom of ¼ . The lemma is proved.
We now
show how the information diagram for a Markov chain with any
áåÇæ
length
can be constructed in two dimensions. Since ½¿¾ vanishes on
Ó
all the Type II atoms of ¼ , it is not necessary to display these atoms in the
information diagram. In constructing
the information diagram, the regions rep
£ º Ò3Ò3Ò º7³¹Ó
³
³
should overlap with each other
resenting the random variables ,
such that the regions corresponding to all the Type II atoms are empty, while
the regions corresponding to all the Type I atoms are nonempty. Figure 7.7
shows such a construction. Note that this information diagram includes Figures 6.7 and 6.9 as special cases, which are information diagrams for Markov
chains with lengths 3 and 4, respectively.
ç
X1
ç
ç
Xè 2
ç
Xé n-1 Xé n
...
Figure 7.7. The information diagram for the Markov chain Ê
D R A F T
Ë
ÌÊ
Í
ÌkÎ/Î7ÎêÌÊ
September 13, 2001, 6:27pm
Ï
.
D R A F T
145
Markov Structures
We have already shown that ½ ¾ is nonnegative for a Markov
chain with
áëÇiæ
length 3 or 4. Toward
suffices
À ìÇ proving that this is trueÀ for any length
À , it

Ó
Á
Á

to show that ½ ¾
for all Type I atoms of ¼
because ½ ¾
for
Ó
¼
.
We
have
seen
in
Lemma
7.29
that
for
a
Type
I
atom
all
Type
II
atoms
A
of
À
À
Ó
Ù Ú
of ¼ ,
has the form as prescribed in (7.102). Consider any such atom .
²
²
²
Then an inspection of the information
diagram in² Figure
7.7 reveals that
À

¾
½

Â
Ç
´
³Ñí
¾
½
´
ï5ô ³
³Ñí$ó7³
Ò3Ò3Ò ´

³Ñíî

³¹ï

³ñð*ò
(7.103)
(7.104)
(7.105)

ð ò

Á
This shows that ½¿¾ is always nonnegative. However, since Figure 7.7 involves
an indefinite number of random variables, we give a formal proof of this result
in the following theorem.
õ@öy«÷,øw«ª§ù¬®úû
For a Markov chain
Ð
³
ÐüÒ3Ò3Ò(Ð
£
³
³
Ó
, ½¿¾ is nonneg-
ative.
À

Ó
Á

½ ¾
Proof
for all Type
II atoms A of ¼ , it suffices to show that
À ýSince
À
Ç
Ó
Á
for allÀ Type Ó I atoms
of
. We have seen in Lemma 7.29 that
½¿¾
¼
ÙCÚ
for a Type I atom
of ¼ ,
has the form as prescribed in (7.102). Consider
À
any such atom and define the set
þ
Æn º Ò3Ò3Ò º7Þ
ÜÛÝ
f ß
(7.106)
Then

³Ñí
´
²
²

ï
³
ð ò
³

²
½
²
´
³
²
´
³
²
²

³¹ï
²
³Ñí
¾
²
²
´
³Ñí
¾
½

²
¾
½

ô ³ñ
² ðò
³Ñíÿó7³¹ï
³ñð ò
²
´
²

³¹ï
(7.107)
³ñð ò
(7.108)

²
(7.109)
þ
, namely
In
the´ above ´ summation,
except for the atom corresponding to

Ò3Ò3Ò ´
³Ñí
³Ñí î
³¹ï
³ñð*ò
atoms are Type
II² atoms. Therefore,
² , all the
²
²

Â
ô ³ñð*ò
³Ñí&ó7³¹ï

¾
½
Hence,
³Ñí
²
À
½
¾

Â
Ç
D R A F T
Á
¾
³Ñí
³Ñí$ó7³
´

³Ñí î
´
Ò3Ò3Ò ´
²
´
³Ñíî
ï5ô ³
³¹ï
²

ð ò
´
Ò3Ò3Ò ´
³¹ï

³ñðò

(7.110)
²

³ñð*ò

September 13, 2001, 6:27pm
(7.111)
(7.112)
(7.113)
D R A F T
146
A FIRST COURSE IN INFORMATION THEORY
X
Y
.
Z
T
.
.
.
.
..
.
..
U
..
.
.
..
!#"%$'&()!#"+*,& .
Figure 7.8. The atoms of
The theorem is proved.
á
We end this section by giving an application of this information diagram for
.- .

0/214365 87:9 <; => 09 709 @?A=CB
¤¥y¦C§¨ª©0«¬®ú
y©
¨ª©*«
Ñ«
5ø
<¨
÷
In this example, we prove
³
º4DsºFE
with
the
help
of
an
information
diagram
that
for
five
random variables
,
Ð
Ð
Ð
Ð
¢
¢
³
D
E
Ù
Ù
, and such that
forms a Markov chain,
G D

E
4D

@ó7³»º
Â
º ¢
º Ù
4D

¿Æ
³
¿Æ
ó ¢
º
Â
³
G
4D FE

G D E

yÆ
º Ù
º sº
¢
ô
º ¢
G
E
¢
yÆ

ô
(7.114)
Ù
In the information
for
, and
7.8, we first identify

G D diagram
G ¢ì inbyFigure
and then the atoms
of
marking
of them by
the atoms of
G D and G ¢ , it receiveseach
a dot. If an atom belongs to both
two dots. The
resulting diagram represents
G D

wÆ
G
By repeating the same procedure for

Â
E
4D
@ó7³»º
º ¢
º Ù

¿Æ
Â
³
4D
º
ó ¢
º Ù
yÆ
¢

(7.115)
G D E

ô
yÆ
G
¢
ô
E
º
(7.116)
we obtain the information diagram in Figure 7.9. Comparing these two information diagrams, we find that they are identical. Hence, the information
Ð
³
D Ð
identity
in
(7.114)
always
holds
conditioning
on
the
Markov
chain
Ð
Ð
¢
E
Ù
. This identity is critical in proving an outer bound on the
achievable coding rate region of the multiple descriptions problem in Fu et al.
[73]. It is virtually impossible to discover this identity without the help of an
information diagram!
D R A F T
September 13, 2001, 6:27pm
D R A F T
147
Markov Structures
X
Y
.
Z
T
.
.
.
..
Figure 7.9. The atoms of
.
..
.
U
.
.
..
..
HI"KJML ONP$QNR*ANSM&(THI" ON$L*:NUSM&V(T!#"%$XW J@&(Y!#"+*'W JA& .
&Ê
Ê
PROBLEMS
1. Prove Proposition 7.14 and Proposition 7.15.
2. In Example 7.22, it was shown that Z\[ implies Z^] . Show that Z^] does not
Ø
imply Z [ . Hint: Use an information diagram to determine _F`baZ [c _F`daZ ]ec .
3. Alternative
definition of the global Markov property:
any partition
f ] and f [ For
Ù ºFf
Û
]Ø*Ù ºFf [ ß of f such that the sets of³Yvertices
are
disconnected
g and ³Yg are independent
, the sets
of
random
variables
condiin ±
Ë
Í
ð
³
tioning on
.
Show that this definition is equivalent to the global Markov property in
Definition 7.23.
ji , Tk and gMlm:nl k are indepeno k is the set of neighbors2 of vertex i in
4. The local Markov property:
For h
³Ym0n
dent conditioning on
, where
± .
à
sàfá
³
³
a) Show that the global Markov property implies the local Markov property.
b) Show that the local Markov property does not imply the global Markov
property by giving a counterexample. Hint: Consider a joint distribution which is not strictly positive.
5. Construct a Markov random field whose Â -Measure
values. Hint: Consider a Markov “star.”
], [,
]qpra [
6. a) Show that
³
³
³
³
º7³
·
³
º7³
·
¸
, and
c
º,³
³
¸
½
¾
can take negative
are mutually independent if and only if
[Xpra
³
·
º7³
¸
c ]
ô³
º,³
·
p
³
¸ ô
a ]
³
º7³
[ cFs
Hint: Use an information diagram.
2 Vertices
k
t
k
t
and in an undirected graph are neighbors if and are connected by an edge.
D R A F T
September 13, 2001, 6:27pm
D R A F T
148
A FIRST COURSE IN INFORMATION THEORY
b) Generalize the result in a) to
á
random variables.
[
7. Determine
the Markov random field with four random variables ]
³
º
³
¸
·
and
which is characterized by the following conditional independencies:
³
º7³
º7³
³
¸ ô³
³
·
º7³
¸
·
ô
³
º7³
ô
³
º7³
³¹¸
º7³
³
º
a ] [ )u c u p
[ pva
c a ] c
c a [ )u Fc s
]qpva
³
³
º7³
·
What are the other conditional independencies pertaining to this Markov
random field?
HISTORICAL NOTES
A Markov random field can be regarded as a generalization of a discretetime Markov chain. Historically, the study of Markov random field stems from
statistical physics. The classical Ising model, which is defined on a rectangular
lattice, was used to explain certain empirically observed facts about ferromagnetic materials. The foundation of the theory of Markov random fields can be
found in Preston [157] or Spitzer [188].
The structure of the Â -Measure for a Markov chain was first investigated in
the unpublished work of Kawabata [107]. Essentially the same result was independently obtained by Yeung eleven years later in the context of the Â -Measure,
and the result was eventually published in Kawabata and Yeung [108]. Full
conditional independencies were shown to be axiomatizable by Malvestuto
[128]. The results in this chapter are due to Yeung et al. [218], where they
obtained a set-theoretic characterization of full conditional independencies and
investigated the structure of the Â -Measure for a Markov random field. These
results have been further applied by Ge and Ye [78] to characterize certain
graphical models.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 8
CHANNEL CAPACITY
In all practical communication systems, when a signal is transmitted from
one point to another point, the signal is inevitably contaminated by random
noise, i.e., the signal received is correlated with but possibly different from the
signal transmitted. We use a noisy channel to model such a situation. A noisy
channel is a “system” which has one input and one output1 , with the input
connected to the transmission point and the output connected to the receiving
point. The signal is transmitted at the input and received at the output of the
channel. When the signal is transmitted through the channel, it is distorted in a
random way which depends on the channel characteristics. As such, the signal
received may be different from the signal transmitted.
In communication engineering, we are interested in conveying messages
reliably through a noisy channel at the maximum possible rate. We first look
at a very simple channel called the binary symmetric channel (BSC), which
is represented
by the transition
³
D diagram in Figure 8.1.Á º Inß this channel, both
the input and the output take values in the set Û h . There is a certain
probability, denoted by w , that the output is not equal to the input. That is, if the
input is 0, then the output is 0 with probability h
xyw , and is 1 with probability
w . Likewise, if the input is 1, then the output is 1 with probability hxyw , and is
0 with probability w . The parameter w is called the crossover probability of the
BSC. À
º
ß
Let Û
be the message set which
contains two possible messages to
Ã
à
Á
w{z Á s - . We further À assume that the
be conveyed through
a BSC with
À
two messages and Ã are equally likely. If the message is , we map it to
the codeword 0, and if the message is Ã , we map it to the codeword 1. This
1 The
discussion on noisy channels here is confined to point-to-point channels.
D R A F T
September 13, 2001, 6:27pm
D R A F T
150
A FIRST COURSE IN INFORMATION THEORY
1
0
0
X
Y
1
1
1
Figure 8.1. A binary symmetric channel.
is the simplest example of a channel code. The codeword is then transmitted
through the channel. Our task is to decode the message based on the output of
the channel, and an error is said to occur if the message is decoded incorrectly.
Consider
|~}
|}
À
Û
ô
Dv
Á
|} Dr
|~}
|~} Dr
Dv
|} s - ahXxyw c
Dv As

Û
D2
Since wz
s- ,
(8.1)
ß
(8.2)
Á ß
|~}

Dv h
ß
Û
O hxyw
~| }
|}
Dv O hx
Dr
~| }
|~}
D
r
Dv
z
ô
ô
Á
ô
À
ô
Û
Á
sÁ
(8.4)
ß
ß
Á
Û Ã
Dv
(8.3)
ß
Á
ß
Á
À
Á
Á
Á
Û
Û Ã

ô³
Á
Û
Û
by symmetry2 , it follows that
|~}
and
Á ß
Á ß
Û
|~}
ô
Á
³
Û

Since
³
Û
À
ß
ô
Û
(8.5)
O w s
Á ß
Á
ß
(8.6)
s
(8.7)
Therefore, in order
to minimize the probability of error, we decode a received
À
0 to the message . By symmetry, we decode a received 1 to the message Ã .
2 More
explicitly,
VPY V UP V AUVY# A
e u V e u d
e ue lX e u
]
e u ]
Ú
Ú
î
î
î
D R A F T
September 13, 2001, 6:27pm
D R A F T
151
Channel Capacity
An error occurs ifÀ a 0 is received and the message is Ã , or if a 1 is received
and the message is . Therefore, the probability of error, denoted by , , is
|}
|~}
|}
|~}
given by

Á
w
º
s- w
Dr
Æ
Á
Á
ß
ô
Û Ã
s- w
Dv
Æ
ß
Á
|~} (8.6) because|}
where (8.9) follows from
À
Û
ô
Dv h O
ß
Dv h
Û
ô
Û Ã
D2
ß
À
Û
ô
Dv h
w
Á ß
ß
(8.8)
(8.9)
(8.10)
(8.11)
by symmetry.
Á . Then the above scheme obviously does not
Let us assume that w.
provide perfectly reliable communication. If we are allowed to use the channel
only once, then this is already the best we can do. However, if we are allowed
to use the same channel repeatedly, then we can improve the reliability by
generalizing the above scheme.
We now consider the
following channel code which we refer to as the binary
á
Ç
h be an odd positive integer
which is called the block
repetition code. Let
À
length
of the code. In this code, the message is mapped
to the sequence of
á
á
0’s, and the message Ã is mapped to the
sequence
of
1’s.
The codeword,
á
á
which consists á of a sequence of either 0’s or 1’s, á is transmitted through
the channel in uses. Upon receiving a sequence of bits at the output of
the channel, we use the majority vote to decode the message, i.e., if there are
more
0’s than 1’s in the sequence, we decode the sequence to the message
À
, otherwise we decode the sequence to the message Ã . Note that
á
theh block
, this
length is chosen to be odd so that there cannot be a tie. When
scheme reduces to the previous scheme.
general scheme, we continue to denote the probability of error
For this more
by . Let o and o)] be the number of 0’s and 1’s in the received sequence,

respectively. Clearly,
(8.12)
o Æ o ] á s
á
À
For large , if the message is , the number of 0’s received is approximately
equal to
ô À¡ á
o
ahxyw c
(8.13)
and the number of 1’s received is approximately equal to
o)]
ô
X¡
À
w
á
(8.14)
numbers.
with high probability by the weak law of large
This implies that the
z¢o)] ß , is small because
probability of an error, namely the event Û o
á
D R A F T
ah
x£w c
äá
w
September 13, 2001, 6:27pm
(8.15)
D R A F T
152
A FIRST COURSE IN INFORMATION THEORY
with the assumption
|} that wz
Û
error
À
ô
ß
s - . |Specifically,
}
|} o z¤o ]
|} x¥oY]
z¤o)]
|} o)] s o)] aw §¦ c
Á
à
ä
z ¦ z
Á
¦
so that
ä
Û
ô
À
Æ
s - xyw
Á
á
Á
Û
where
ß
á
Û
à
À
ô
Û
ô
À
(8.16)
(8.17)
(8.18)
(8.19)
ß
ß
á
À
ô
ßº
º
(8.20)
w j¦ z s - s
(8.21)
¦
Note
|~} that such a exists because w¨z s - . Then by the © weak law of large
numbers, the upper bound in (8.19)
. By symmetry,
ª© tends to 0 as
error
also tends
. Therefore,
|~} to 0|~as}
|} |~}
error
error
(8.22)

©
tends to 0 as
. In other words, by using a long enough repetition
code, we can make arbitrarily small. In this sense, we say that reliable
is positive and
Æ
Á
Á
Ð
á
Û
ô
Ã
á
ß
À
Û
á
ß
Û
Ð
ô
À
ß
Æ
Û Ã
ß
Û
ô
Ã
ß
Ð
communication is achieved asymptotically.
ä
Á
, for any given transmitted sequence
We point
out that for a BSC with w
á
á
of length , the probability of receiving any given sequence of length is
nonzero. It follows that for any two distinct input sequences, there is always a
nonzero probability that the same output sequence is produced so that the two
input sequences become indistinguishable.
Therefore, except for very special
Á ), no matter
channels (e.g., the BSC with w
how the encoding/decoding
scheme is devised, a nonzero probability of error is inevitable, and asymptotically reliable communication is the best we can hope for.
Though a rather naive approach, asymptotically reliable communication can
be achieved by using the repetition code. The repetition code, however, is not
without catch. For a channel code, the rate of the code in bit(s) per use, is
defined as the ratio of the logarithm of the size of the message set in the base 2
to the block length of the code. Roughly speaking, the rate of a channel code
is the average number of bits the channel code attempts to convey through
the channel
in one use of the channel.
For a binary repetion
Ð code
© with block
á
á
Ó ] , which
]
tends to 0 as
. Thus in
length , the rate is ÓO«%¬I'®
order to achieve asymptotic reliability by using the repetition code, we cannot
communicate through the noisy channel at any positive rate!
In this chapter, we characterize the maximum rate at which information can
be communicated through a discrete memoryless channel (DMC) with an arbitrarily small probability of error. This maximum rate, which is generally
D R A F T
September 13, 2001, 6:27pm
D R A F T
153
Channel Capacity
positive, is known as the channel capacity. Then we discuss the use of feedback in communicating through a channel, and show that feedback does not
increase the capacity. At the end of the chapter, we discuss transmitting an
information source through a DMC, and we show that asymptotic optimality
can be achieved by separating source coding and channel coding.
8.1
DISCRETE MEMORYLESS CHANNELS
; °¯V9±?:9K7:9 @?
/
¶
Let ² and ³ be discrete alphabets,
and ´,aRµ c be a tranô¶
c
M
´
R
a
µ
sition matrix from ² to ³ . A discrete channel
is
a single-input single³
taking values in ² and output
output system with input random variable
D
|~} that
random variable |} taking values in ³ such
Ñ«
÷
ô
°w®
· ¶ 4Dv µ
· ¶ ´MaRµ ¶ c
(8.23)
¶
|}
for all a µ cq¸ ²º¹»³ .
|} we see that if ·|~ } ¶
, then
Remark From (8.23),
~
|
}
¶

4
v
D

Dv ¼ ¶ s
´,|aRµ } ¶ c
· ¶ µ
(8.24)
|} µ
D2 µ · ¶ is not defined if · ¶ ½ . Nevertheless,
However,
³
Û
º
ß
³
Û
ô
ß
º
Û
Û
ô
³
º
Û
ô³
Û
³
³
ä
ß
Á
ß
ô³
Û
ß
ß
Û
³
ß
ß
Á
(8.23) is valid for both cases.
; °¯V9±?:9K7:9 @?
Ñ«
÷
°w®¯
A discrete memoryless channel is a sequence of replicates
of a generici discreteichannel.
Thesei discrete channels are indexed by a discreteÇ
,
where
,
with
the th channel being available for transmission
time index
h
i
at time . Transmission through a channel is assumed to be instantaneous.
The DMC is the simplest nontrivial model ô for
¶ a communication channel.
For simplicity, a DMC will be specified by ´,aRµ c , the transition matrix of the
generic³ discreteD channel.
i
k
k
Let iÇ and be respectively the input and the output of a DMC at time ,
where
h . Figure 8.2 is an illustration of a discrete memoryless channel. In
the i figure,
the
attribute of³ the channel manifests itself by that for
i memoryless
k as its input.
channel
only has
all , the th i discrete
Ç
|}
For each
h |, }
i
Û
³
k ¶ 4D k µ O
º
ß
Û
³
k ¶ ´MaRµ ¶ c
ß
ô
(8.25)
since the th discrete channel is a replicate of the generic discrete
D k ß channel
Û
´,aRµ ô ¶ c . However, the dependency
of
the
output
random
variables
on the
³
k ß cannot be further described without specifying
input random variables Û
how the DMC is being used. In this chapter, we will study two representative
models. In the first model, the channel is used without feedback. This will be
D R A F T
September 13, 2001, 6:27pm
D R A F T
154
A FIRST COURSE IN INFORMATION THEORY
1
Á p Â ( ¿ y À xÃ )
Y1
¾X
2
Á p Â ( ¿ y À xÃ )
Y2
¾X
3
Á p Â ( ¿ y À xÃ )
Y3
...
¾X
Figure 8.2. A discrete memoryless channel.
discussed in Section 8.2 through Section 8.5. In the second model, the channel
is used with complete feedback. This will be discussed in Section 8.6. To keep
our discussion simple, we will assume that the alphabets ² and ³ are finite.
; °¯V9±?:9K7:9 @?
Ñ«
÷
°w®ú
The capacity of a discrete memoryless channel
Ä ÅYÆÈÇ
4D c
É ËÊ Qa
defined as
D
³
³ó
Â
´,aRµ ¶ c
º
ô
is
(8.26)
where and are respectively the input and the output of the generic
discrete
¶
channel, and the maximum is taken over all input distributions ´,a c .
Ä
From the above definition, we see that
Ç
Á
Qa 4D c
(8.27)
because
Ç
³ó
Á
(8.28)
¶
for all input distributions ´,a c . By Theorem 2.43, we have
Ä Å)ÆÈÇ
4D c Å)ËÆÈÊÇ G a c «%¬I ² s
(8.29)
É ÌÊ a
É
Ä «%¬I
Likewise, we have
³ s
(8.30)
Ä Å)ÍÏÎ «%¬I
Therefore,
(8.31)
a ² «%¬I ³ cFs
¶
¶
4
D

c is a continuous functional of ´,a c and the set of all ´,a c 4isD a
Since a
c
compact set (i.e., closed and bounded) in Ð
, the maximum value of a
Â
à
³ëó
Â
à
à
Â
ô
³
ô
ô
ô
ô
ô º
ô
ô
³ëó
Â
³ëó
can be attained3 . This justifies taking the maximum rather than the supremum
in the definition of channel capacity in (8.26).
Ñ
3 The
assumption that
D R A F T
is finite is essential in this argument.
September 13, 2001, 6:27pm
D R A F T
155
Channel Capacity
ÒZ
X
Y
Figure 8.3. An alternative representation for a binary symmetric channel.
Ä
We will prove subsequently that is in fact the maximum rate at which
information can be communicated reliably through a DMC. We first give some
examples of DMC’s for which the capacities can be obtained in closed form.
³
D
In the following, and denote respectively the input and the output of the
generic discrete channel, and all logarithms are in the base 2.
¤¥y¦C§¨ª©0«
ÌÓÔ1Õ9±? Ö¼×ØÖ
°w®
w¦,ø
Ù7 09>ÛÚ @?:? VB
C§Õ§«
yø
[ö¦
y«©
The binary symmetric
channel (BSC) has been shown in
Figure
8.1.
Alternatively,
a BSC can be
E
represented by Figure 8.3. Here, is a binary random variable representing
|~}
|~}
the noise of the channel,
with
and
E
Û
E
O hxyw
Á ß
is independent of
³
. Then
Dv
i³
E
Æ
E h w
ß
Û
and
mod ®
º
(8.32)
s
(8.33)
4D c
In order to determine the capacity of the BSC, we first bound a
Â
follows.
a 4D c
Â
G aD c x G aD
G a D c x ´,a ¶
Ê
G a D c x ´,a ¶
Ê
G a D c x ÜØÝ aw c
hx Ü°Ý aw c
ô³
³ó

à
c
c G a D · ¶ c
ô³
cÜ°Ý aw c
º
³ó
as
(8.34)
(8.35)
(8.36)
(8.37)
(8.38)
where we have used ÜØÝ to denote the binary entropy function in the base 2. In
G D h , i.e., the output
order to achieve this upper bound, we have to make a c
¶ c be the
,
´
a
distribution of the BSC is º uniform.
This
can
be
done
by
letting
ß
³ó4D
Á
c can be
uniform distribution on Û h . Therefore, the upper bound on Âa
achieved, and we conclude that
Ä
hx ÜØÝ aw c
D R A F T
bit per use.
September 13, 2001, 6:27pm
(8.39)
D R A F T
156
A FIRST COURSE IN INFORMATION THEORY
C
1
0
0.5
1
Figure 8.4. The capacity of a binary symmetric channel.
Ä
Ä the capacity versus the crossover probability
Figure 8.4 is a plot of
Á or w w . Weh ,
w
see from the plot that attains the maximum
value
1
when
Ä the minimum value 0 when w Á s - . When w Á , it is easy to see
and attains
that
h is the maximum rate at which information can be communicated
through the channel reliably. This can be achieved simply by transmitting unencoded bits through the channel, and no
is necessary because all the
decoding
h , the same can be achieved with the
bits are received unchanged. When w
additional decoding step which complements all the received bits. By doing so,
the bits transmitted through the channel can be recovered without error. Thus
from the communication point of view, for binary channels, a channel which
never makes
error and a channel which always makes errors are equally good.

Á
- , the channel output is independent of the channel input. Theres
When w
fore, no information can possibly be communicated through the channel.
¤¥y¦C§¨ª©0«
ÏÞß1Õ9±? Ö
°w®
w¦,ø »¤
,=5 Ú @?A? VB
ø5¦
yøw«
ö¦
y«ª©
The binary erasure channel
ß
Á º
Û
is shown in Figure 8.5. In
this
channel,
the
input
alphabet
is
, while
h
ºà
ß
à
Á º
the output alphabet is Û h
. With probability á , the erasure symbol is
produced at the output, which means that the input bit is lost; otherwise the
input bit is reproduced at the output without error. The parameter á is called
the erasure probability.
To determine the capacity of this channel, we first consider
Ä

D R A F T
Å)ÌÆÈÊÇ
É
Å)ÌÆÈÊÇ
É
Å)ÌÆÈÊÇ
É
a 4D c
a G a D c x G a D cc
G a D c x Ü°Ý a±á cFs
Â
³ëó
ô³
September 13, 2001, 6:27pm
(8.40)
(8.41)
(8.42)
D R A F T
157
Channel Capacity
â0
â0
1
ãe
X
1
Y
1
1
Figure 8.5. A binary erasure channel.
G a D c . To this end, let
· O.ä

Thus we only have to maximize|~}
Û
³
Á ß
and define a binary random variable
Ûå
The random variable
D
function of . Then

Á
if
if
h
D .
à
D2.
à.
G aD c
G a c G aD c
Ü°Ý a±á c ahxdá cÜ°Ý a ä Fc s
sº
Æ

Ä
by
(8.44)
indicates whether an erasure has occured, and it is a
G aD c

Hence,
(8.43)
ô
Æ
(8.45)
(8.46)
(8.47)

Å)ËÆÈÊÇ G a D c x ØÜ Ý a±á c
(8.48)
É ¡
Å)ÆÈÇ Ü°Ý a±á c ah xdá cÜ°Ý a ä c x Ü°Ý a±á c
(8.49)
æ
ahxdá c Å)ÆÈÇ Ü°Ý a ä c
(8.50)
æ
ahxdá c
(8.51)
ä¤ s - , i.e., the input distribution is
where capacity is achieved by letting
Æ
º
Á
uniform.
It is in general not possible to obtain the capacity of a DMC in closed form,
and we have to resort to numerical computation. In Chapter 10, we will discuss
the Blahut-Arimoto algorithm for computing channel capacity.
D R A F T
September 13, 2001, 6:27pm
D R A F T
158
8.2
A FIRST COURSE IN INFORMATION THEORY
THE CHANNEL CODING THEOREM
We will justify the definition of the capacity of a DMC by the proving the
channel coding theorem. This theorem, which consists of two parts, will be
formally stated at the end of the section. The direct part of the theorem asserts that information can be communicated through a DMC with an arbitrarily
small probability of error at any rate less than the channel capacity. Here it is
assumed that the decoder knows the transition matrix of the DMC. The converse part of the theorem asserts that if information is communicated through a
DMC at a rate higher than the capacity, then the probability of error is bounded
away from zero. For better appreciation of the definition of channel capacity,
we will first prove the converse part in Section 8.3 and then prove the direct
part in Section 8.4.
; °¯V9±?:9K7:9 @?
Ñ«
÷
Ïç
°w®
put alphabet ²
4è
c code for a discrete memoryless channel with inAn a
and output alphabet ³ is defined by an encoding function
á
º
é¨ê h ®
º
Û
and a decoding function
h® ®
ah c a c
Û
The
é set
º é
º
º
º
Ò3Ò3Ò
Ò3Ò3Ò
º
ë
ê³
Ó
º
4è
Ò3Ò3Ò
º ¡ß
h ®
Ð
º
Û
º
Ð
Ó
²
Ò3Ò3Ò
4è
(8.52)
º ¡ß
s
(8.53)
4è
by ì , is called the message set. The sequences
é a è c , indenoted
² are called codewords, and the set of codewords
º ¡ß
Ó
is called the codebook.
In order to distinguish a channel code as defined above from a channel code
with feedback which will be discussed in Section 8.6, we will refer to the
former as a channel code without
feedback.
þ
is randomly chosen from the message set ì
We assume that a message
according to the uniform distribution. Therefore,
G a c «%¬I è s
¶
With respect to a channel code for a DMC ´MaRµ c , we let
íÔ a ] [
c
î D 4D
and
a ] [ 4D c
þ
(8.54)
ô
³
º7³
º
º
º
Ò3Ò3Ò
Ò3Ò3Ò
º
º7³
Ó
Ó
(8.55)
(8.56)
be the input sequence and the output sequence of the channel, respectively.
Evidently,
íï é a þ cFs
(8.57)
We also let
D R A F T
ðñ ë î
a Fc s
þ
September 13, 2001, 6:27pm
(8.58)
D R A F T
159
Channel Capacity
W
õX
Encoder
ò óô
Y
Channel
p(y|x)
Message
W
Decoder
Estimate
of message
ö
Figure 8.6. A channel code with block length .
Figure 8.6 is the block diagram for a channel code.
j÷ è , let |}
; °¯V9±?:9K7:|~9 }@?
For all h
î í é ÷
ð ÷ ú ÷ O
øù
ù

a c
û È ü þý ÿ û
÷
be the conditional probability of error given that the message is .
Ñ«
÷
à
°w®¬
ô þ
þ
Û
à
ß
ô
Û
ß
(8.59)
Ï
We now define two performance measures for a channel code.
; °¯V9±?:9K7:9 @?
Ñ«
÷
°w®°
á
ø æ Ê ÅYù ÆÈÇ °ø ù s
fined as
; °¯V9±?:9K7:9 @?
Ñ«
a 4è c
The maximal probability of error of an
÷
a 4è c
á
The average probability
of error of an
|~}
°w®Ö
@
fined as
ð

þ
Û
º
º
code is de(8.60)
code is de-
s
gþkß
(8.61)
|~} of
From the definition
, we have
ð
~| }
|~}

ð
ú ÷
ñ ÷

ù
|}
ð ÷
èh
÷

ù
èh ø ù
ù
i.e., @
is the arithmetic mean of ø ù , h j÷ è . It then follows that
@ ø æ Ê s
Û
þ
Û
gþkß
þ
ß
Û
ô þ
gþ
ô þ
þ
Û
þ
ß
ß
º
à
Ñ«
÷
/
°w® û
per use.
D R A F T
The rate of an
a 4è c
á
º
(8.63)
(8.64)
(8.65)
Üà
à
; °¯V9±?:9K7:9 @?
(8.62)
channel code is
á
September 13, 2001, 6:27pm
l ] «%¬I è
(8.66)
in bits
D R A F T
160
A FIRST COURSE IN INFORMATION THEORY
; °¯V9±?:9K7:9 @?
/°/ A rate is asymptotically achievable for a discrete mem¶
, there exists for sufficiently large an
oryless
channel ´,aRµ c if for any w
4
è
c code such that
a
h «%¬I è
xyw
(8.67)
Ñ«
÷
°w®
ä
ô
á
á
Á
º
ä
á
ø æ Ê ¤
z ws
and
(8.68)
For brevity, an asymptotically achievable rate will be referred to as an achievable rate.
In other words, a rate is achievable if there exists a sequence of codes
whose rates approach and whose probabilities of error approach zero. We
end this section by stating the channel coding theorem, which gives a characterization of all achievable rates. This theorem will be proved in the next two
sections.
/ y1Ú @?A? bÚ @9±?
¶
for a discrete memoryless channel ´,aRµ c
õ@öy«÷,øw«ª§H°w® *¯
ö¦
y«ª©
@÷
ô
OB
Ä is achievable
A rate
à
if and only if , the capacity of
õ@ö¿«÷¿øw«§
the channel.
8.3
THE CONVERSE
random
Let us consider a channel code with block length
ð variables
á
þ
³Tk
DØk for h à i.à The
þ
involved in this code are ,
and
, and . We see
from the definition of a channel code in Definition 8.6 that all the random
variables are generated sequentially according to some deterministic or probabilistic rules. Specifically,
the ðrandom variables are generated in the order
Ò3Ò3Ò
þ
º7³
] º4D ] º7³ [ º4D [ º º7³ Ó º4D Ó º þ . The generation of these random variables
can be represented by the dependency graph4 in Figure 8.7. In this graph,³ a
node represents a random variable. If there is a (directed) edge from node
D
³
D
to node , then node is called a parent of node . We further distinguish
a solid edge and a dotted edge: a solid edge represents functional (deterministic) dependency, while a dotted edge
represents the probabilistic dependency
ô¶
c of the generic discrete channel. For a
M
´
R
a
µ
induced
by
the
transition
matrix
³
node , its parent nodes represent all the random variables on which random
³
variable
depends when it is generated. We will use to denote the joint
¶k
distributioni of these random variables as well as all the marginals, and let
denote the th component of a sequence . From the dependency graph, we
4A
á
dependency graph can be regarded as a Bayesian network [151].
D R A F T
September 13, 2001, 6:27pm
D R A F T
161
Channel Capacity
X
W
Y
p y ( | x)
X1
Y1
X2
Y2
X3
Y3
Xn
Yn
W
Figure 8.7. The dependency graph for a channel code without feedback.
a÷
÷
a
Ó
Ó
c¸ ì ¹¨² ¹»³ such that a c ,
c a ÷ c k a ¶ k ÷ c k ´,aRµ k ¶ k c s
(8.69)
]
]
÷
÷
¶ k ÷ c are well-defined, and a ¶ k ÷ c
Note that a c
for all so that Ùa
í
î set of nodes ] [
are deterministic.
Denote
the
by and the set
D
4
D
D
,
by . We notice the following structure in the depenof nodes ] [
í
dency
î graph: all the edges from end in í , and
all theî edges from end
4í , and form the Markov
in . This suggests that the random variables
î
chain
í
s
(8.70)
The validity of this Markov chain÷ canbe
formally justified by applying Propo
c¸ ì¹^² ¹³ such that Ùa c ,
sition 2.9 to (8.69). Then for all a º
see that for all
º
ä
Ó
º
ä
Ó
ô
º
ô
ô
Á
ô
³
º
º
Ò3Ò3Ò
Á
º7³
Ò3Ò3Ò
º
º7³
Ó
Ó
þ
þ
Ð
þ
Ð
º
a ÷
we can write
º
Ó
º
Ó
c a ÷ c a ÷ c a Fc s
º
º
ô
ô
ä
Á
(8.71)
By identifying on the right hand sides of (8.69) and (8.71) the terms which
depend only on and , we see that
Ó
a c k ,´ aRµ k ¶ k c
]
ô
where
ô
º
(8.72)
is the normalization constant which is readily seen to be 1. Hence,
Ó
a
D R A F T
ô
c k ,´ aRµ k ¶ k Fc s
]
ô
September 13, 2001, 6:27pm
(8.73)
D R A F T
162
A FIRST COURSE IN INFORMATION THEORY
W
X
0
0
0
Y
W
0
0
0
0
0
Figure 8.8. The information diagram for Ì
ÌfÌ
.
The Markov chain in (8.70) and the relation in (8.73) are apparent from
the setup of the problem, and the above justification may seem superfluous.
However, the methodology developed here is necessary to handle the more
delicate situation which arises when the channel is used with feedback. This
will be discussed in Section 8.6.
Consider
aî channel
ð code whose probability of error is arbitrarily small.
þ
º4í¶º
þ
, and
form the Markov chain in (8.70), the information í diaSince
gram for these four random
variables is asî shown in Figure 8.8. Moreover, is
ð
þ
þ
a function of , and
is a function of . These two relations are equivalent
to
G a í ôþ c Á º
(8.74)
G a ð î c
and
ô
þ
Á
º
ð
(8.75)
þ
þ
respectively. Since the probability of error is arbitrarily small,
and
are
essentially identical.
ð To gain insight into the problem, we assume for the time
þ
þ
being that
and
are equivalent, so that
G a ð
þ
c G a
ô þ
ô þ
þ
ð
c
Á
s
(8.76)
Since the Â -Measure ½¿¾ for a Markov chain is nonnegative, the constraints in
(8.74) to (8.76) imply that ½ ¾ vanishes on all the atoms in Figure 8.8 marked
with a ‘0.’ Immediately, we see that
G a c Qa í î cFs
þ
¶ó
Â
(8.77)
That is, the amount of information conveyed through the channel is essentially
the mutual information between the input sequence and the output sequence of
the channel.
For a single transmission, we see from the definition of channel capacity
that the mutual information between theà¢
input
the output cannot exceed
i àá and
the capacity of the channel, i.e., for all h
,
a k 4D k c
Â
D R A F T
³
ó
à
Ä
s
September 13, 2001, 6:27pm
(8.78)
D R A F T
163
Channel Capacity
i
Summing
á
from 1 to , we have
Ä
Ó
k a k 4D k c
]
³
Â
àá
ó
s
(8.79)
Upon establishing in the next lemma that
î
í
c
Qa
Ó
k Qa k 4D k c
]
à
»ó
Â
³
ó
Â
º
(8.80)
the converse of the channel coding theorem then follows from
á
h «%¬I è

hG a c
h Qa í î c
h Qa k 4D k c
k
Ä ]
s

á
à
/
(8.81)
¶ó
Â
(8.82)
Ó
á
à
Ôs«ª§Õ§D¦
þ
á
³
Â
ó
(8.83)
(8.84)
°w® *ú
For a discrete
memoryless channel used with a channel code
á
Ç
without feedback, for any
h,
î
Qa í c
Â
»ó
wherei
time .
³
k
and
Dk
Proof For any a
holds. Therefore,
Ó
à
k Qa k 4D k c
]
³
Â
ó
º
(8.85)
are respectively the input and the output of the channel at
º
c¸ ²
Ó
¹¨³
Ó
, if
a
º
c
ä
Á
, then
a c
ä
Á
î í Dk k
c k ,´ a
c
a
]
c in the support of a c . Then
holds for all a
î

x «%¬I a í c x «Ï¬I k ´,a Dk Tk c x k %« ¬I M´ a DØk Tk c
]
]
ô
ô³
º
ô
Ó
ô³
ô³
G a î í c G a D k k cFs
k ]
º
(8.87)
Ó
ô
D R A F T
(8.86)
º
Ó
or
and (8.73)
Ó
ô³
September 13, 2001, 6:27pm
(8.88)
D R A F T
164
A FIRST COURSE IN INFORMATION THEORY
Hence,
a í
Â
¶ó
î
G aî c x G aî í c
k G a D k c x k G a D k k c
]
]
k Qa Tk 4Dk cFs
]
c
ô
Ó
à
(8.89)
Ó
ô³
(8.90)
Ó

Â
³
;ó
(8.91)
The lemma is proved.
We now formally prove the converse
of the channel coding theorem. Let
ä
á
Á
be
an
achievable
rate,
i.e.,
for
any
,
there
exists
for
sufficiently
large
an
w
a á º4è c code such that
á
h «%¬I è
ä
xyw
(8.92)
ø æ Ê ¤
z ws
and
Consider
«Ï¬I è

æ
G a c
G a ð c
ð
Ý G
c
a

ð
G a
c
!
G a ð c
þ
þ
ô þ
Æ
þ
ô þ
Æ
þ
ô þ
Æ
þ
ô þ
Æ
à
µ
à
(8.93)
à
Â
Â Ó
(8.94)
(8.95)
»ó
Â
á
ð
c
Qa
î
Qa í c
k Qa k 4D k c
]
Ä
þkó þ
³
(8.96)
ó
(8.97)
º
(8.98)
where
a) follows from (8.54);
b) follows from the data processing theorem since
þ
Ð
í
î
Ð
Ð
þ
ð
;
c) follows from Lemma 8.13;
d) follows from (8.79).
From (8.61) and Fano’s inequality (cf. Corollary 2.48), we have
G a
D R A F T
þ
ô þ
ð
c z h
Æ
%« ¬I è s
September 13, 2001, 6:27pm
(8.99)
D R A F T
165
Channel Capacity
Therefore, from (8.98),
«%¬I è
z
h
à
z
Ä
,ø" %« ¬IÊ è è
Ä
æ «%¬I Ä
w «Ï¬I è
Æ
h
Æ
Æ
h
á
Æ
Æ
Æ
(8.100)
(8.101)
(8.102)
á
á
º
where we have used (8.66)
and (8.93), respectively to obtain the last two iná
equalities. Dividing by and rearranging the terms, we have
Ä
h «%¬I è z ]
h xyw
Æ
Ó
á
and from (8.92), we obtain
]
(8.103)
Ä
Æ
Ó
º
xywz h
xyw s
For
any
Ðª
©
á
w
ä
Á
(8.104)
á
, the Ð above inequality holds for all sufficiently large . Letting
Á
and then w
, we conclude that
à
Ä
s
(8.105)
This completes the proof for the converse of the channel coding theorem.
From the above proof, we can obtainÄ an asymptotic bound on when the
] è is greater than . Consider (8.100) and obtain
rate of the code Ó «%¬I
Ä
hx h %« ¬I è
@
Æëá
Ç
Ä
h
x ] è s
] «%¬I
Ó
Æ
(8.106)
Ó
Ä
Ä
hx ] «%¬I è # hx ] «%¬I è
(8.107)
Ä
when is large.
This asymptotic bound on , which is strictly positive if
½] «%¬I è
, is illustrated in Figure 8.9.
] è
in (8.106) implies that
for all if «%¬I
%$ bound
Ä In fact, the lower
for some , then for all h , by concatenating because if
copies of the code, we
' & ($ obtain
a code with the same rate and block length equal
to
such that
, which is a contradiction to our conclusion that

when is large. Therefore, if we use a code whose rate is greater than
Then
Ó
Ç
]
Æ
Ó
Ó
á
ä
Ó
ä
Ó
á
Á
á
Ó
ä
Ç
Ó
á
ä
Á
Á
Á
á
the channel capacity, the probability of error is non-zero for all block lengths.
The converse of the channel coding theorem we have proved is called the
weak converse. A stronger version
of this
result
can
Ð
Ð
© called the strong converse
á
ä
Á
Ä
be proved,è which
says
that
as
if
there
exists
an
such
,

h
w
Ç
Æ
]
that Ó «%¬I
w for all á .
D R A F T
September 13, 2001, 6:27pm
D R A F T
166
A FIRST COURSE IN INFORMATION THEORY
P+ e
1
)
1
*n log M
C
Figure 8.9. An asymptotic upper bound on ,.- .
ACHIEVABILITY OF THE CHANNEL CAPACITY
Ä
8.4
We have shown in the last section that the channel capacity is an upper
Ä
Ä
bound
on all achievable rates for a DMC. In thisà section,
we show that the rate
is achievable, which implies
that
any
rate
is
achievable.
ô¶
Consider a DMC ´MaRµ c , and denote the input and the output of the generic
³
D respectively. For every input distribution ´,a ¶ c ,
discrete channel by and ,³ë
ó4D
c is achievable by showing for large á the
we will prove that the rate Âa
existence of a channel code such that
1. the rate of the code is arbitrarily close to Âa
2. the maximal probability of error
ø" æ Ê
4D c ;
³ëó
is arbitrarily small.
¶
Ä achieves the
´,a c to be one which
Then by choosing the input
distribution
³ëó4D
Ä , we conclude
c
channel capacity,
i.e.,
that
the
rate
is achievable.
Âa
ä
Á
Fix any w
and let / be a small positive quantity to be specified later.
Toward
proving the existence of a desiredô ¶ code, we fix an input distribution
¶
´,a c for the generic discrete channel ´MaRµ c , and let è be an even integer
satisfying
a 4D c x ®w z h «%¬I è z Qa 4D c x 0 w
Â
³ó
á
fÂ
³ó
º
(8.108)
á
where is sufficiently large. We now describe a random coding scheme in the
following steps:
4è
è
Ó
c code randomly by generating
1. Construct the codebook
1 of an a
¶ c Ó . Denote
codewords in ²
independently and
Ò3Ò3Ò identically according to ´,a
í º4í a ® c º º4í a è c .
these codewords by ah c
á
º
2. Reveal the codebook 1 to both the encoder and the decoder.
D R A F T
September 13, 2001, 6:27pm
D R A F T
167
Channel Capacity
is chosen from ì
þ
3. A message
according to the uniform distribution.
í a c , which is then trans-
þ
þ
is encoded into the codeword
4. The message
mitted through the channel.
|~}
î
5. The channel outputs a sequence
Û
according to
î í
a c
ô
Ó
þ
O M´ aRµ k ¶ k c
k ]
ß
î
(cf. (8.73) and Remark 1 in Section 8.6).
6. The sequence is decoded to the message
÷ ÷
í ÷
there
î does not exists Ä such that a a
÷
î
is decoded
to a constant message in
which is decoded.
ì
ô
í a ÷ c î c½¸32 4 q6587 and
:587
c c9
¸ 2 4 ð . Otherwise,
if î a
Ä
(8.109)
Ó
º
Ó
º
. Denote by
þ
the message to
Ó
Remark 1 There are a total of ²
possible codebooks which can be constructed by the random procedure in Step 2, where we regard two codebooks
whose sets of codewords are permutations of each other as two different codebooks.
ô
ô
;
Remark 2 Strong typicality is used in defining the decoding function in Step 6.
This is made possible by the assumption that the alphabets ² and ³ are finite.
We now analyze the performance of this random coding scheme. Let

<>=?=
ð

þ
Û
|}
þhß
(8.110)
<>=?=(ß
be the event of a decoding error. In the following, we analyze Û
, the
probability
of a decoding error for the random code constructed above. For all
h àj÷Üà è , define the event
î

A587
s
a í a ÷ c c ¸@2 4
(8.111)
~| }
|~}
|~}
;
<B=C= O
<>=?= ú ÷
ú ÷ s
(8.112)
ù ]
÷ are identical for all ÷ by symmetry in the code con ù
Now
|}
(ß
Û
<>=?=
ô þ
Since Û
struction, we have
ß
Û
þ
ß
ß

<B=C=0ß

D R A F T
ß
ô þ
Û
|~}
Û
Ó
º
Û
|}
| }
Û
Û
;
ú h
ù ]
ú h
<>=?=
ô þ
ß
<>=?=
ô þ
ßº
|}
Û
ú ÷
þ
September 13, 2001, 6:27pm
ß
(8.113)
(8.114)
D R A F T
168
A FIRST COURSE IN INFORMATION THEORY
i.e., we can assume without loss of generalityî that the message 1 is chosen.
Then decoding is correct
if the received

ù vector is decoded to® à the÷ message
à
è . 1.It
does not occur for all
This is the case if ] occurs but
|~} that5
|}
follows
<B=C=
Û
ú h
µô þ
|~}
which implies

Û
ô þ

µô þ
´
µ
´
ú h
<B=C=
ô þ
·
|~}
à
ß
´
µ
·
µ
Û
By the union bound, we have
Û
´
µ
Û
|~}

´
µ
·
ß
´
Û

[
µ
´

Ò3Ò3Ò ´
;
ú h
µ
ô þ
ßº
(8.115)
ß
Û
à

´
]
ú| } h
hx |} <>=?= ú h
|~h} x
]
|~} a ] [ [

]ED [ D D
<B=C=
Û

Ç
ß
·
´
µ
Ò3Ò3Ò ´
Ò3Ò3Ò
µ ô þ
Û
ú
; ú h
c h
; ú
;
h s
µ
ô þ
µ
µô þ
Æ
ß
ß
|~}
;
ù
(8.116)
(8.117)
(8.118)
(8.119)
ß
ß
ô þ
D
ñ
h
]

Ò3Ò3Ò ´
Û
[
ù ú
h s
ô þ
ß
(8.120)
ú h , then a í ah c î c are i.i.d. copies of the pair of generic random
|~} JAEP (Theorem 5.8), for any F
a 4D c . By the strong
,
q:587 ú
ú h
ía ah c î c ¸G
2 4
h zF
(8.121)
]
.
÷
÷ c î c
è

í
for sufficiently large . Second, if
,
h , then for ®
a
a
4D c , where
are i.i.d. copies of the pair of generic random variables a
D
D
î
and
have
same marginal distributions
, respectively, and
D aretheindependent
í ah c andasí a ÷ cand
and
because
are
independent
and
í
depends only on|~} ah c . Therefore,
ù |ú
} h
î
q6 587 ú

(8.122)
a í a ÷ c c¸G2 4
h

(8.123)
'HJI û L K M NAOQP R ´,a c ´,a cFs

: 587
587
T
4
4
{
c
¸
S
2
¸
S
2
By the consistency
of
strong
typicality,
for
,
and
a
587
A
4
¸ 2 . By the strong AEP, all ´,a c and
U
´,a c in the above summation
l® WV l"X8
satisfy
´,a c
(8.124)
]]\ ù \ ;
þ
á
º
First, if
|}
variables
ä
³»º
µ ô þ
Û
ß
Ó
º
Û
á
ô þ
ß
à
þ
¡à
á
³
³
³
Ä
Ä
Á
Ä
º
º
³
Ä
Ä
Ä
Û
ô þ
ß
Ó
º
Û
ô þ
ß
Ï
º
Ó
Ó
Ó
à
Ó
5 If w
, the received vector ^ is decoded to the
Y Ë does not occur or Y[Z occurs for some
constant message, which may happen to be the message 1. Therefore, the inequality in (8.115) is not an
equality in general.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Channel Capacity
and
´,a c
Ð
where a
ºcb
Á
Ð
Á
as /
Ð

® l _V l[`
Ó
à
º
(8.125)
. Again by the strong
JAEP,
Ó
ô
Ð
169
2 4

:587
® WV
ed
I
Ó
ô à
î
º
(8.126)
|} as /
. Then from (8.123), we have
ù ú
h

® l _V_ V I eVd ® lgl V_ VI l l"d Xel" Xl[® `l WV l[`
®
ji
® l Wh8 j i l d l"Xl[`
® l W h8 "l k
Á
where f
Á
ô þ
Û
ß
î
Ó
à
Ó
Ò
Ó
Ò
(8.127)
(8.128)
(8.129)
(8.130)
î
Ó
Ó
Ó
º

l
where
Ð
Æ
f
Ð
b
Æ
a
Á
(8.131)
Á
as /
.
From the upper bound in (8.108), we j
have
i
è z ® Wh8
l>nm
Ó
s
(8.132)
è
Using (8.121), (8.130), and the above upper bound on , it follows from
|~} that
(8.114) and (8.120)
ji
oi
<>=?=(ß
Û
l
Ð
Since
Á
Ð
Á
as /
Á
Wh8 l>nm l 'h8
®
l® nm l"k
s
®
Æ
F

F
Æ
Ó
l nm l"k
®
, so that
Ó
l"k
(8.133)
Ó
0
Ó
|}
w x
Ð
l
Á
Û
á
Ò
(8.134)
, for sufficiently small / , we have
for any w
follows from (8.134) that
ä
z
ä
as
<>=?=0ß
á
Á
Ð
©
(8.135)
. Then by letting
z ®w
|}
6z
F

·
, it
(8.136)
for sufficiently large .
<>=?=0ß
Ó
The main idea of the above analysis on Û
is the following.
In conè
codewords in ² ô ¶according
structing
codebook, we randomly generate
¶ c Ó , the
,
´
a
to
and
one
of
the
codewords
is
sent
through
the channel ´,aRµ c . When
á
is large, with high probability, the received
sequence
is jointly typical with
¶ º µ c . If the number
è
the codeword
sent
with
respect
to
of
codewords
,
´
a
á
³ó4D
c , then the probability that the received
grows with at a rate less than Âa
D R A F T
September 13, 2001, 6:27pm
D R A F T
170
A FIRST COURSE IN INFORMATION THEORY
sequence is jointly typical with a codeword other than the one sent through the
channel is negligible. Accordingly, the message can be decoded correctly with
probability arbitrarily close to 1.
|~}
In constructing the codebook by the randomß procedure in Step 2, we choose
a codebook 1 with a certain probability Û 1 from the ensemble of all possible codebooks. By conditioning
on the|~codebook
|~}
} |~} chosen, we have
Û
|~}
O cp
<>=?=*ß
Û
|~}
|~}
<B=C=0ß
ß
1
<B=C=
Û
<B=C=
ô
ô
1
ßº
(8.137)
ß
is a weighted average of ô Û
1
over all 1 in the ensemble
i.e., Û
<B=C=
ß
is the average
of all possible codebooks, where Û
1
|~} probability of error
of the code, i.e., , when the codebook 1 is chosen (cf. Definition 8.9). The
<B=C=0ß
reader should compare the two different expansions of Û
in (8.137) and
(8.112).
1y¾ such that
Therefore, there exists
|} at least one codebook
|}
<>=?=
Û
ô
1
¾
à
ß
Thus
we have shown that for any w
á º4è
c code such that
a
á
(cf. (8.108)) and
ä
h «%¬I è
Û
Á
<>=?=*ß
z ®w s
(8.138)
, there exists for sufficiently large
a 4D c x ®w
ä
Â
³ó
á
an
(8.139)
z ®w s
(8.140)
4D
c is achievable
We are still one step away
proving that the rate ÂQa
ø" æ Ê from
because we require that
instead of is arbitrarily small. Toward this
end, we write (8.140) as
³ó
;
è h ù ø ù z ®w
]
;
è
ù ø°ù r
z q t
® s ws
]
º
or
(8.141)
(8.142)
Upon ordering the codewords according to their conditional probabilities of
error, we
observe that the conditional probabilities of error of the better half
è codewords
of the
are less than w , otherwise the conditional probabilities of
error of; the worse half of the codewords are at least w , and they contribute at
least a [ c w to the summation in (8.142), which is a contradiction.
D R A F T
September 13, 2001, 6:27pm
D R A F T
171
Channel Capacity
Thus by discarding the worse half of the ø"
codewords
in 1 ¾ , for the resulting
Ê
æ
codebook, the maximal probability of error
is less than w . Using (8.139)
and considering
á
h «%¬I è
®

h «Ï¬I è x h
4D c x w x h
q a
®s
4
D
c xyw
Qa
á
ä
á
*Â
ä
Â
á
³ó
á
³ó
(8.143)
(8.144)
(8.145)
when is sufficiently large, we see that the rate of the resulting code is greater
³ó4D
³ëó4D
c xyw . Hence, we conclude that the rate
c is achievable.
than ÂQa
Âa
¶
c
Ä
,
´
a
Ä
Finally, upon letting the
input
distribution
be
one
which
achieves
the
³ó4D

c
, we have proved that the rate is achievchannel capacity, i.e., ÂQa
able. This completes the proof of the direct part of the channel coding theorem.
8.5
A DISCUSSION
In the last two sections, we have proved the channel coding theorem which
Ä
asserts that reliable
communication through a DMC at rate is possible if and
only if z
, the channel capacity. By reliable communicationá at rate ,
we mean that the size of the message set grows exponentially with at rate ,
Ä with probability arbitrarily close
while the
message
can be decoded correctly
Ðª
©
á
to 1 as
. Therefore, the capacity is a fundamental characterization of
a DMC.
The capacity of a noisy channel is analogous to the capacity of a water pipe
in the following way. For a water pipe, if we pump water through the pipe at
a rate higher than its capacity, the pipe would burst and water would be lost.
For a communication channel, if we communicate through the channel at a rate
higher than the capacity, the probability of error is bounded away from zero,
i.e., information is lost.
In proving the direct part of the channel coding theorem, weÄ showed that
and whose
there exists a channel code whose rate is arbitrarily close to
probability of error is arbitrarily close to zero. Moreover,
the
existence
of such
á
a code is guaranteed only when the block length is large. However, the proof
does not indicate how we can find such a codebook. For this reason, the proof
we gave is called an existence
proof (as oppose to a constructive proof).
á
For a fixed block length , we in principle can search through the ensemble
of all possible
codebooks for a good one, but this is quite prohibitive even for
á
small because
the number of all possible codebooks growsá double
exponená
º4è
Ó
c
Ä
a
the
total
number
of
all
possible
codebooks
tially with ô . ô Specifically,
;
è is approximately
is equal toÓ(u ²
. When the rate of the code is close to ,
equal to ® Ó . Therefore, the number of codebooks we need to search through
ô
ô [ Ïwv
is about ²
.
D R A F T
September 13, 2001, 6:27pm
D R A F T
172
A FIRST COURSE IN INFORMATION THEORY
Nevertheless, the proof of the direct part of the channel coding theorem does
indicate that if we generate a|~codebook
randomly as prescribed, the codebook
}
is most likely to be good. More precisely,
we now show that the probabilityä of
<B=C= ô ß
Á
Û
is
greater than any prescribed x
choosing a code 1 such that
1
á
|~} small when is sufficiently
|~} |~} large. Consider
is arbitrarily

p
<B=C=0ß
Û

Û
ý
Ç
Vzy" {|{8 p c~
Vzy" {|{8 p c~
ý
}

|
}
From (8.138), we have
p
Û
for any w
Á
when
á
}
Û
1
1
1
ý
<>=?=
ô
1
ô
1
ß
ß
(8.147)
(8.148)
(8.149)
<B=C=(ß
x
s
z ®w
Û
}
<>=?=
ßº
Û
à
ß
<>=?=*ß
Vzy" {|{8 p c~
Û
ß
1
|~}
1
ß
(8.150)
(8.151)
is sufficiently large.
|~} Then
p
ß
|~}
Û
Û
|}
which implies
ä
|}
ô
|~}
ß
1
<>=?=
Û
Û
Û
(8.146)
ß
1
|~} }
}
ý
ß
|~}
1
|~}
ý
zy {|{ p c~
p
ô
Û
zy {|{ p c~
p
ý
x
<B=C=
|~}
\"}
Æ
p
Û
zy {|{ p
p
Ç
ß
1
à
w
®x s
(8.152)
Since x is fixed, this upper bound can be made arbitrarily small by choosing a
sufficiently small w .
Although the proof of the direct part of the channel coding theorem does not
provide an explicit construction of a good code, it does give much insight into
what a good code is like. Figure 8.10 is an illustration of a channel code which
Ä distribution ´,a ¶ c
achieves the channel capacity. Here we assume that
the input
³ó4D

c
is one which achieves the channel capacity, i.e., ÂQa Ó
. The idea
¶ is that
most of the codewords are typical sequences in ²
with respect to ´,a c . (For
this reason, the repetition code is not a good code.) When such a codeword is
' the channel,Ó the received sequence is likely to be one of
Ó V through
transmitted
®
about
are jointly typical with the transmitted
'sequences
to ´,ina ¶ ³ º µ c which
Ó V respect
codeword with
. The associationÓ between a codeword and
the about ®
corresponding sequences in ³
is shown as a cone in the
figure. As we require that the probability of decoding error is small, the cones
D R A F T
September 13, 2001, 6:27pm
D R A F T
173
Channel Capacity

p ( y | x )
n H (Y | X )
2
n I( X;Y )
n H (Y )
2
sequences

in T[nY ]
...
2
codewords

in T[nX ]
Figure 8.10. A channel code which achieves capacity.
essentially do not overlap with each other. Since the number of typical sequences with respect to " is about %('E , the number of codewords cannot
exceed about
%BW
('T >
(wj

(
(8.153)
This is consistent with the converse of the channel coding theorem. The direct
part of the channel coding theorem says that when is large, as long as the
number of codewords generated randomly is not more than about % W¡[¢| , the
overlap among the cones is negligible with very high probability.
Therefore, instead of searching through the ensemble of all possible codebooks for a good one, we can generate a codebook randomly, and it is likely to
be good. However, such a code is difficult to use due to the following implementation issues.
A codebook with block length and rate £ consists of ¤ symbols from
the input alphabet ¥ . This means that the size of the codebook, i.e., the amount
of storage required to store the codebook, grows exponentially with . This
also makes the encoding process very inefficient.
Another issue is regarding the computation required for decoding. Based on
the sequence received at the output of the channel, the decoder needs to decide
which of the about %¤ codewords was the one transmitted. This requires an
exponential amount of computation.
In practice, we are satisfied with the reliability of communication once it
exceeds a certain level. Therefore, the above implementation issues may eventually be resolved with the advancement of microelectronics. But before then,
we still have to deal with these issues. For this reason, the entire field of coding
theory has been developed since the 1950’s. Researchers in this field are devoted to searching for good codes and devising efficient decoding algorithms.
D R A F T
September 13, 2001, 6:27pm
D R A F T
174
A FIRST COURSE IN INFORMATION THEORY
«
W
Message
¦
Encoder
X i =f i (W, Y i- 1 )
Channel
§ p (y
¨ |x
© )
Yi
ª
Decoder
W
Estimate
of message
Figure 8.11. A channel code with feedback.
In fact, almost all the codes studied in coding theory are linear codes. By
taking advantage of the linear structures of these codes, efficient encoding and
decoding can be achieved. In particular, Berrou el al [25] have proposed a
linear code called the turbo code6 in 1993, which is now generally believed to
be the practical way to achieve the channel capacity.
Today, channel coding has been widely used in home entertainment systems (e.g., audio CD and DVD), computer storage systems (e.g., CD-ROM,
hard disk, floppy disk, and magnetic tape), computer communication, wireless
communication, and deep space communication. The most popular channel
codes used in existing systems include the Hamming code, the Reed-Solomon
code7 , the BCH code, and convolutional codes. We refer the interested reader
to textbooks on coding theory [29] [124] [202] for discussions of this subject.
8.6
FEEDBACK CAPACITY
Feedback is very common in practical communication systems for correcting possible errors which occur during transmission. As an example, during
a telephone conversation, we very often have to request the speaker to repeat
due to poor voice quality of the telephone line. As another example, in data
communication, the receiver may request a packet to be retransmitted if the
parity check bits received are incorrect. In general, when feedback from the
receiver is available at the transmitter, the transmitter can at any time decide
what to transmit next based on the feedback so far, and can potentially transmit
information through the channel reliably at a higher rate.
In this section, we study a model in which a DMC is used with complete
feedback. The block diagram for the model is shown in Figure 8.11. In this
model, the symbol ¬[ received at the output of the channel at time ® is available
instantaneously at the encoder without error. Then depending on the message
¯
and all the previous feedback ¬°²±³¬g´(±²µ²µ²µ¶±³¬ , the encoder decides the value
6 The
turbo code is a special case of the class of Low-density parity-check (LDPC) codes proposed by
Gallager [75] in 1962 (see MacKay [127]). However, the performance of such codes was not known at that
time due to lack of high speed computers for simulation.
7 The Reed-Solomon code was independently discovered by Arimoto [13].
D R A F T
September 13, 2001, 6:27pm
D R A F T
175
Channel Capacity
of · ¹¸ ° , the next symbol to be transmitted. Such a channel code is formally
defined below.
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿ÅÄ:Æ|Ç(È
An É±³ÊU code with complete feedback for a discrete
memoryless channel with input alphabet ¥ and output alphabet Ë is defined
by encoding functions Ì
ÍÎJÏ(±cQ±²µ²µ²µ%±³ÊÑÐÓÒ@Ë
for
Ï]ÕÖ®×ÕÅ

¡
°Ô
¥
(8.154)
and a decoding function
Ø
Ô

ÍJË
ÎJÏ(±cQ±²µ²µ²µ%±³ÊÑÐ
(8.155)
Ì
¯

°
to denote Ú¬°²±³¬g´(±²µ²µ²µL±³¬ and · to denote ±Ù ¡ .
We will use Ù
We note that a channel code without feedback is a special case of a channel
code with complete feedback because for the latter, the encoder can ignore the
feedback.
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿ÅÄ:Æ|ÇJÛ
A rate £ is achievable with complete feedback for a discrete memoryless channel E:Ü Ý: if for any Þàßâá , there exists for sufficiently
large an É±³Êâ code with complete feedback such that
Ï
tã¹äå
Ê
ßÖ£SæçÞ
(8.156)
è"éTêcëíì
and
Þ
(8.157)
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿ÅÄ:Æ|ÇJî
The feedback capacity, ï FB , of a discrete memoryless channel is the supremum of all the rates achievable by codes with complete feedback.
ð]ñAÃ¿¨gÃòó¾¹Á6¾ÚÃ¿ôÄ:Æ|Ç õ
The supremum in the definition of ï
FB
in Definition 8.16
is the maximum.
Proof Consider rates £à'ö² which are achievable with complete feedback such
that
£ 'ö²
£
(8.158)
÷¹ø ú
öwã¹ùû

Then for any ÞjßSá , for all ü , there exists for sufficiently large an
code with complete feedback such that
Ï

D R A F T
ã_äå
Ê
'ö² ßÖ£
'ö² æýÞ
September 13, 2001, 6:27pm
É±³Ê
'ö²
(8.159)
D R A F T
176
A FIRST COURSE IN INFORMATION THEORY
p y x
( | )
ÿ
W
X1
Y1
Xþ2
Yþ2
X3
Y3
W
Yn-1
ÿ
Xn
Yn
Figure 8.12. The dependency graph for a channel code with feedback.
and
è éê³ë
Wöw
ì
Þ
(8.160)
By virtue of (8.158), let ü:|ÞC be an integer such that for all ütßü:|ÞC ,
ì
Ü £Sæç£
'ö² Ü
Þ¶±
(8.161)
which implies
£
Then for all ü
ß
ü:|Þ?
Wö² ßÖ£SæçÞ
(8.162)
,
Ï

ã_äå
Ê
Wöw ßÖ£
'ö² æýÞBßÖ£SæçÞ
(8.163)
Therefore, it follows from (8.163) and (8.160) that £ is achievable with complete feedback. This implies that the supremum in Definition 8.16, which can
be achieved, is in fact the maximum.
Since a channel code without feedback is a special case of a channel code
with complete feedback, any rate £ achievable by the former is also achievable
by the latter. Therefore,
ï FB
ï
(8.164)
The interesting question is whether ï FB is greater than ï . The answer surprisingly turns out to be no for a DMC, as we now show. From the description
of a channel code with complete
feedback,
we obtain the dependency graph
¯
¯
for the random variables ± ±Ù9± in Figure 8.12. From this dependency
D R A F T
September 13, 2001, 6:27pm
D R A F T
177
Channel Capacity
graph, we see that
Ó±É± ± Ó

T»!

Ý Ü û±
°
¡

Ü Ý
°
¼
Ü
(8.165)
°
ÒG¥¼íÒtËoíÒ
such that û± ¡ C± Ý >ßÖá for

°
, where Ý cÜ û± ¡
° ± ´ ±²µ²µ²µó± . Note that [

û±É± ± Ó

Ï]ÕÅ®×ÕÅ and ßÖá
í
Ü
°
for all
and
are deterministic.
Ä:Æ|ÇJÄ
For all
Ï]ÕÖ®×ÕÅ
¯

,
Ô
°

¡
±Ù

Ô
·
¬

(8.166)
forms a Markov chain.
¯

Proof The dependency graph for the random
variables ± , and Ù is shown
¯
°
°
in Figure 8.13. Denote the set of nodes ± ¡ , and Ù ¡ by " . Then we
see that all the edges from " end at · , and the only edge from · ends at ¬ .
¯
°
°
This means that ¬g depends on ± ¡ ±Ù ¡ only through ·í , i.e.,
¯

±
°
¡

±Ù
°
¡
Ô

·
Ô

¬
(8.167)
á
(8.168)
forms a Markov chain, or
#
¯

±
¡
°

±Ù
¡
°%$
¬ Ü·

This can be formally justified by Proposition 2.9, and the details are omitted
here. Since
#
á
#

¯
±
±Ù

°
¡
° $
¡

±Ù
¡
° $
¬ Ü·
#
¬ Ü·
'&

(
° $
¡
¬ Ü
¯
±·
±Ù

¡
°

(8.169)
(8.170)
and mutual information is nonnegative,
we obtain
#

¯
¯

±Ù
±Ù

¡
°%$
¡
°

Ô
¬ Ü·
·

á.±
(8.171)
¬
(8.172)
forms a Markov chain. The lemma is proved.
From the definition of ï FB and by virtue of Proposition 8.17, if £ ÕUï FB ,
then £ is a rate achievable with complete feedback. We will show that if a rate
£ is achievable with complete feedback, then £
Õ
ï . If so, then £
Õ
ï FB
implies £âÕSï , which can be true if and only if ï FB Õï . Then from (8.164),
we can conclude that ï FB ï .

D R A F T
September 13, 2001, 6:27pm
D R A F T
178
A FIRST COURSE IN INFORMATION THEORY
,
X1
Y1
X) 2
Y) 2
X+ 3
Y+ 3
Xi-1
Yi-1
Xi
Yi
*
W
Z
Figure 8.13. The dependency graph for -/.(021 , and 341 .
Let £ be a rate achievable with complete feedback, i.e., for any ÞBßÖá , there
exists for sufficiently large an É±³ÊU code with complete feedback such that
°
¡
Ê
ßÖ£SæýÞ
ã_äå
èéê³ëíì
and
Consider
and bound
Þ
#
#
$ ã_äå
65 ¯
Ù and
3

# ¯

Ê
¯
$
Ù

5
5
5

Ü Ù3
5

8
8

Õ
°
·
$
5
Ù
Ü
8
8
8
¯

5
ÜÙ

Ú¬ Ü ·
° 5
$
¯
Ú¬ Ü Ù
° 5

7&
Ù
(8.175)
(8.176)

8
° 5
°
¡
Ú¬gÜ Ù
° 5
Ú¬ æ
° 5#
¯
(8.174)
as follows. First,
Ù3æ
5
Õ

Ù3æ

D R A F T

Ù3æ
ê

¯
Ù3æ

(8.173)

¡
¯
°
±
¯

±
±·
(8.177)

Ú¬ Ü ·
(8.178)
(8.179)

¬
ï±
September 13, 2001, 6:27pm
(8.180)
(8.181)
(8.182)
D R A F T
179
Channel Capacity
where a) follows because ·
Lemma 8.18. Second,
5

is a function of
Ü Ù3
¯

:5
¯
Ü Ù9±;
Thus
Ê
ã_äå
Õ
¯
5

¯
and Ù
¯
Ü Õ
5

¡
°
and b) follows from
¯
Ü
<&ýï±
(8.183)
(8.184)
which is the same as (8.98). Then by (8.173) and an application of Fano’s
inequality, we conclude as in the proof for the converse of the channel coding
theorem that
£âÕï
(8.185)
Hence, we have proved that ï
FB
ï

.
Remark 1 The proof for the converse of the channel coding theorem in Section 8.3 depends critically on the Markov chain
¯
Ô
Ô
Ù
Ô
¯
(8.186)
and the relation in (8.73) (the latter implies Lemma 8.13). Both of them do not
hold in general in the presence of feedback.
Remark 2 The proof for ï FB ï in this section is also a proof for the con
verse of the channel coding theorem,
so we actually do not need the proof in
Section 8.3. However, the proof here and the proof in Section 8.3 have very
different spirits. Without comparing the two proofs, one cannot possibly understand the subtlety of the result that feedback does not increase the capacity
of a DMC.
Remark 3 Although feedback does not increase the capacity of a DMC, the
availability of feedback very often makes coding much simpler. For some
channels, communication through the channel with zero probability of error
can be achieved in the presence of feedback by using a variable-length channel
code. These are discussed in the next example.
=2>7?¨ª© »
Ä:Æ|ÇA@
Consider the binary erasure channel in Example 8.5 whose
capacity is Ï æCB , where B is the erasure probability. In the presence of complete
feedback, for every information bit to be transmitted, the encoder can transmit
the same information bit through the channel until an erasure does not occur,
i.e., the information bit is received correctly. Then the number of uses of the
channel it takes to transmit an information bit through the channel correctly
°
has a truncated geometrical distribution whose mean is zÏ>æDB6c¡ . Therefore,
the effective rate at which information can be transmitted through the channel
is ÏÓæEB . In other words, the channel capacity is achieved by using a very
D R A F T
September 13, 2001, 6:27pm
D R A F T
180
U
A FIRST COURSE IN INFORMATION THEORY
G
Fsource
encoder
N
W
channel
G
encoder
H
X
Ip J yL K x
( | )
Y
M
channel
decoder
N
W
Fsource
decoder
U
Figure 8.14. Separation of source coding and channel coding.
simple variable-length code. Moreover, the channel capacity is achieved with
zero probability of error.
In the absence of feedback, the rate Ï×æOB can also be achieved, but with an
arbitrarily small probability of error and a much more complicated code.
8.7
SEPARATION OF SOURCE AND CHANNEL
CODING
We have so far considered the situation in which we want to convey a message through a DMC, where the message is randomly selected from a finite set
according to the uniform distribution. However, in most situations, we want
to convey an information source through a DMC. Let ÎQP ±cü ß æoEÐ be a staö
tionary ergodic information source with entropy rate . Denote the common
alphabet by R and assume that R is finite. To convey 5 ÎQP Ð through the chanö
ì
£4S and a channel code with rate
nel, weÌ can employ a source code with rate
that £;S
£T as shown in Figure 8.14 such
£4T .
Ì
S
S
Let and Ø be respectively the encoding function and the decoding funcT
T
tion of the source code, and
and Ø be respectively the encoding function
and the decoding function of the channel code. The block of information
P
symbols U
° ±VP
´ ±²µ²µ²µóÌ ±VP?Wó is first encoded by the source en¡:'J¡
¡:'J¡
coder into an index
¯
S
(8.187)
(U C±

¯
called the source codeword. Then
distinct channel codeword
Ì
is mapped by the channel encoder to a
T ¯

C±
(8.188)
·t°¶±·¼´(±²µ²µ²µó±·
. This is possible because there are about (¤YX
where

ì
source
codewords
and about %(¤[Z channel codewords, and we assume that
£4S
£4T . Then is transmitted through the DMC 6Ü ÝA , and the sequence
Ù
Ú¬°²±³¬g´(±²µ²µ²µ¶±³¬
is received. Based on Ù , the channel decoder first
¯

estimates
as
¯
Ø T
(8.189)
Ù3

Finally, the source decoder decodes
U
D R A F T
¯
to

Ø S
¯
\

September 13, 2001, 6:27pm
(8.190)
D R A F T
181
Channel Capacity
For this scheme, an error occurs if U^] U , and we denote the probability of

ì
error by _?` .
ï , the capacity of the DMC 6Ü ÝA , then it is
We now show that if
5
possible to convey U through
the channel with an arbitrarily small probability
of error. First, we choose £;S and £4T such that
ì
Observe that if
¯
5
¯
and Ø

ì
£4S
S
¯

¯
\
Ø S
U
ì

£4T
U

(8.191)
, then from (8.190),
Ø S

ï
¯

U3±

(8.192)
i.e., an¯ error does not occur. In other words, if an error occurs, either
S
or Ø 2] U . Then by the union bound, we have

¯
_ ` ÕbadcwÎe

¯
]
Ðf&gahcwÎ
Ø S

¯
2]

U
Ð
¯

]
¯
(8.193)
For any Þ ß á and sufficiently large , by the Shannon-McMillan-Breiman
theorem, there exists a source code such that
adc²Î
Ø S
¯

U
ÐûÕÞ
(8.194)
èéê³ë
éê³ë
By the è
channel
coding theorem, there exists a channel code such that
where
is the maximal probability of error. This implies
¯
ahcwÎe

]
¯
8ji
Ð

èéê³ë 8ji
Õ
èéê³ë

¯
adc²Îk

¯
]
ahc²Î
¯
Ü

Ðlahc¶Î
¯

Ð
,
(8.195)
(8.196)
(8.197)
(8.198)
Þ
Õ
ûÐ
ÕÖÞ
Combining (8.194) and (8.198), we have
_?`oÕ
Þ
ì
(8.199)
ï , it is possible to convey ÎQP
Ð
Therefore, we conclude that as long as
ö
5
through the DMC reliably.
In the scheme we have discussed, source coding and channel coding are
separated. In general, source coding and channel coding can be combined.
This technique is called joint source-channel coding. It is then natural to ask
whether it is possible to convey information through the channel reliably at a
higher rate by using joint source-channel coding. In the rest of the section, we
show that the answer to this question is no to the extent that for asymptotic
Õ
ï . However, whether asymptotical reliability
reliability, we must have
5
D R A F T
September 13, 2001, 6:27pm
D R A F T
182
A FIRST COURSE IN INFORMATION THEORY
q
m
r
X mi =f mi sc( W, Y i -1 )
sourcechannel
encoder
U
Ymi
np oy px
( | )
sourcechannel
decoder
U
Figure 8.15. Joint source-channel coding.
ï depends on the specific information source and
can be achieved for
5

channel.
Ì
We base our discussion on the general assumption that complete feedback
SjT
is available at the encoder as shown in Figure 8.15. Let , Ï Õ ®íÕ be
Ø ST
the encoding functions and
be the
Ì decoding function of the source-channel
code. Then
SjT
°
¡
(8.200)
·
(U3±Ù
for
Ï]ÕÖ®×ÕÅ

, where Ù
¡

Ú¬°²±³¬g´(±²µ²µ²µL±³¬
°

Ø SjT
U
Ù

°C
¡
, and
C±
(8.201)
where U [P×
°¶±sPE
´(±²µ²µ²µó±?P . In exactly the same way as we proved (8.182)

in the last section, we can prove# that
$
(U
Ù
# Ù ,
Since U is a function of
(U
$
#
U3

Õ
for sufficiently large . Then
5
5
(U

:5
5
$
(U
$
(U
Õ
(U

×
#
(U9Ü U '&
U9
±Ù3
Ù
(8.203)
(8.204)
(8.205)

ï
For any ÞBßÖá ,
æçÞCTÕ
(8.202)
#

ÕÅï
æçÞ?
5
(U
$
(8.206)
U >Õ
5
(U9Ü U3
'&ýï

(8.207)
Applying Fano’s inequality (Corollary 2.48), we obtain
×
Ïf&ý7_ `
æçÞCTÕ
5
or
5
D R A F T
æçÞÕ

Ï
&u_?`
ÜR
ã_äå
ã_äå
ÜR
Üt&ýï±
Ü%&Åï

September 13, 2001, 6:27pm
(8.208)
(8.209)
D R A F T
183
Channel Capacity
Ô
For asymptotic reliability, _v`
á as
Ô
and then Þ
á , we conclude that
Ô

Õï
5
w
Ô

. Therefore, by letting
(8.210)
This result, sometimes called the separation theorem for source and channel
coding, says that asymptotic optimality can be achieved by separating source
coding and channel coding. This theorem has significant engineering implication because the source code and the channel code can be designed separately
without losing asymptotic optimality. Specifically, we only need to design the
best source code for the information source and design the best channel code
for the channel. Moreover, separation of source coding and channel coding
facilitates the transmission of different information sources on the same channel because we only need to change the source code for different information
sources. Likewise, separation of source coding and channel coding also facilitates the transmission of an information source on different channels because
we only need to change the channel code for different channels.
We remark that although asymptotic optimality can be achieved by separating source coding and channel coding, for finite block length, the probability
of error can generally be reduced by using joint source-channel coding.
PROBLEMS
In the following,
·t°¶±·¼´(±²µ²µ²µL±·

,
Ý6°¶±Ýe´L±²µ²µ²µL±Ý

, and so on.
1. Show that the capacity of a DMC with complete feedback cannot be increased by using probabilistic encoding and/or decoding schemes.
ì
ì increases capacity Consider a BSC with crossover probability
2. Memory
á
Þ
Ï represented by ·í
¬gY&x"E mod , where ·í , ¬[ , and "É are

respectively the input, the output, and the noise random variable at time ® .
Then
ahc²Îy"
áQÐ
ÏBæçÞ
Ï%Ð
Þ
and _ Îy"

for all ® . We make no assumption that
have memory.
a) Prove that
"

are i.i.d. so that the channel may
#
(
$
Ù
ÕÅGæ
5
9 |Þ?
b) Show that the upper bound in a) can be achieved by letting ·
bits taking the values 0 and 1 with equal probability and "°
#
µ²µ²µ
" .

be i.i.d.

c) Show that with the assumptions in b), ( Ù ß ï , where
9 |ÞC is the capacity of the BSC if it is memoryless.
Ï>æ
"É´
ï

5
D R A F T
September 13, 2001, 6:27pm
D R A F T
184
A FIRST COURSE IN INFORMATION THEORY
3. Show that the capacity of a channel can be increased by feedback if the
channel has memory.
4. In Remark 1 toward the end of Section 8.6, it¯ was mentioned that¯ in the
Ô
Ô
Ô
Ù
presence of feedback, both the Markov chain
and
Lemma 8.13 do not hold in general. Give examples to substantiate this
remark.
5. Prove that when a DMC is used with complete feedback,
adc²Îó¬

±Ù

¡
°

¡
°
Ð
ahc²Îó¬

Ü·

Ý Ð
Ï . This relation, which is a consequence of the causality of the
for all ®
code, says that given the current input, the current output does not depend
on all the past inputs and outputs of the DMC.
6. Let
_
|Þ?
Ï>æçÞ
ÏBæýÞe|
Þ
{z
Þ
be the transition matrix for a BSC with crossover probability Þ . Define
}~\
}
}

}
zÏæ
&
zÏBæ
for áíÕ
±
ÕÏ .

a) Prove that a DMC with transition matrix _ |Þ²°?_ |Þ³´ó is equivalent to a
BSC with crossover probability Þ²° ~ Þc´ . Such a channel is the cascade
of two BSC’s with crossover probabilities Þw° and Þc´ , respectively.
b) Repeat a) for a DMC with transition matrix _
|Þ³´¶_
|Þ²°?
.
c) Prove that
Ï>æ
5
9 |Þw° ~ Þ³´¶Õ
ø¼÷
zÏæ
5
9 |Þw°?C±¶ÏBæ
5
9 |Þ³´ó
This means that the capacity of the cascade of two BSC’s is less than
the capacity of either of the two BSC’s.
d) Prove that a DMC with transition matrix _ |Þ? is equivalent to a BSC
°
with crossover probabilities ´ zÏæ zÏBæçÞ? .
7. Symmetric channel A DMC is symmetric if the rows of the transition matrix
E:Ü Ý: are permutations of each other and so are the columns. Determine
the capacity of such a channel.
See Section 4.5 in Gallager [77] for a more general discussion.
8. Let ïo° and ï´ be the capacities of two DMC’s with transition matrices _ °
and _ ´ , respectively, and let ï be the capacity of the DMC with transition
ïo°¶±Cï´ó .
matrix _ °_´ . Prove that ïâÕ
ø¼÷
D R A F T
September 13, 2001, 6:27pm
D R A F T
185
Channel Capacity
9. Two independent parallel channels Let ïo° and ï´ be the capacities of two
DMC’s ° ° Ü Ý ° and ´ ´ Ü Ý ´ , respectively. Determine the capacity of
the DMC
.°²± ´JÜ Ý6°²±Ýg´ó
Hint: Prove that#
#
·t°¶±·¼´
if .°²±
6°ó.°óÜ Ý°?Âg´ ´ Ü Ýg´ó

´ Ü Ý6°¶±Ýe´ó

$
¬6°¶±³¬e´¶Õ
#
·t°
6°ó.°%Ü Ý6°CÂg´ ´ Ü Ýe´ó
$
¬6°?'&
·¼´
$
¬e´²
.
10. Maximum likelihood decoding In maximum likelihood decoding for a given
channel and a given codebook, if a received sequence is decoded to a
codeword , then maximizes ahcwÎ%Ü 'ÚÐ among all codewords ' in the
codebook.
a) Prove that maximum likelihood decoding minimizes the average probability of error.
b) Does maximum likelihood decoding also minimize the maximal probability of error? Give an example if your answer is no.
11. Minimum distance decoding The Hamming distance between two binary
sequences and , denoted by gÉ±E , is the number of places where and differ. In minimum distance decoding for a memoryless BSC, if a
received sequence is decoded to a codeword , then minimizes g'Ú±
over all codewords in the codebook. Prove that minimum distance decoding is equivalent to maximum likelihood decoding if the crossover probability of the BSC is less than 0.5.
12. The following figure shows a communication system with two DMC’s with
complete feedback. The capacities of the two channels are respectively ïo°
and ï´ .
W
Encoder
1
Channel
1
Encoder
2
Channel
2
Decoder
2
W
a) Give the dependency graph for all the random variables involved in the
coding scheme.
b) Prove that the capacity of the system is
ø¼÷
ï
° ±Cï ´
.
For the capacity of a network of DMC’s, see Song et al. [186].
D R A F T
September 13, 2001, 6:27pm
D R A F T
186
A FIRST COURSE IN INFORMATION THEORY
13. Binary arbitrarily varying channel Consider a memoryless BSC whose
ì
ì
crossover probability is time-varying. Specifically, the crossover
probabilÞ³´
á .
ity Þ at time ® is an arbitrary value in Þ²°w±cÞc´ , where á ÕÑÞw°
9 |Þ ´ .
Prove that the capacity of this channel is ÏÓæ
(Ahlswede and
5
Wolfowitz [9].)
ì
14.
ì
a BSC with crossover probability Þk Þw°¶±cÞc´ , where á
Þw°
, but the exact value of Þ is unknown. Prove that the capacity of
this channel is 9 |Þ³´ó .
ì
Consider
Þ ´
á
5
HISTORICAL NOTES
The concept of channel capacity was introduced in Shannon’s original paper [173], where he stated the channel coding theorem and outlined a proof.
The first rigorous proof was due to Feinstein [65]. The random coding error
exponent was developed by Gallager [76] in a simplified proof.
The converse of the channel coding theorem was proved by Fano [63],
where he used an inequality now bearing his name. The strong converse was
first proved by Wolfowitz [205]. An iterative algorithm for calculating the
channel capacity developed independently by Arimoto [14] and Blahut [27]
will be discussed in Chapter 10.
It was proved by Shannon [175] that the capacity of a discrete memoryless
channel cannot be increased by feedback. The proof here based on dependency
graphs is inspired by Bayesian networks.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 9
RATE DISTORTION THEORY
Let
be the entropy rate of an information source. By the source coding
5 it is possible to design a source code with rate £ which reconstructs
theorem,
the source sequence · ° ±· ´ ±²µ²µ²µó±·
with an arbitrarily small probabil
and the block length is sufficiently large. Howity of error provided £â ß
ever, there are situations in5 which we want to convey an information source by
ì
a source code with rate less
than . Then we are motivated to ask: what is the
best we can do when £
? 5
5
A natural approach is to design a source code such that for part of the time
the source sequence is reconstructed correctly, while for the other part of the
time the source sequence is reconstructed incorrectly, i.e., an error occurs. In
designing such a code, we try to minimize the probability of error. However,
ì
this approach is not viable asymptotically
because the converse of the source
, then the probability of error inevitably
coding theorem says that if £
Ôw
5
tends to 1 as
ì .
, no matter how the source code is designed, the source
Therefore, if £
5
sequence is almost always
reconstructed incorrectly when is large. An alternative approach is to design a source code called a rate distortion code which
reproduces the source sequence with distortion. In order to formulate the problem properly, we need a distortion measure between each source sequence and
each reproduction sequence. Then we try to design a rate distortion code which
with high probability reproduces the source sequence with a distortion within
a tolerance level.
Clearly, a smaller distortion can potentially be achieved if we are allowed
to use a higher coding rate. Rate distortion theory, the subject matter of this
chapter, gives a characterization of the asymptotic optimal tradeoff between
the coding rate of a rate distortion code for a given information source and
D R A F T
September 13, 2001, 6:27pm
D R A F T
188
A FIRST COURSE IN INFORMATION THEORY
the allowed distortion in the reproduction sequence with respect to a distortion
measure.
9.1
SINGLE-LETTER DISTORTION MEASURES
Ï%Ð be an i.i.d. information source with generic random variLet Î¶· ±cü
ö
able · . We assume that the source alphabet ¥ is finite. Let Ý: be the
probability distribution of · , and we assume without loss of generality that
the support of · is equal to ¥ . Consider a source sequence
Ý6°¶±Ýg´(±²µ²µ²µL±Ý

(9.1)
and a reproduction sequence
Ý °¶± g
Ý ´%±²µ²µ²µó± Ý

(9.2)
The components of can take values in ¥ , but more generally, they can take
values in any finite set ¥ which may be different from ¥ . The set ¥ , which is
also assumed to be finite, is called the reproduction alphabet. To measure the
distortion between and , we introduce the single-letter distortion measure
and the average distortion measure.
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ|Ç
A single-letter distortion measure is a mapping

Ô ¸
Ò
¥ ÍQ¥
±
(9.3)
¸
where
is the set of nonnegative real numbers1 . The value eÝ± Ý6 denotes
the distortion incurred when a source symbol Ý is reproduced as Ý .
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ
The average distortion between a source sequence D ¥¼
and a reproduction sequence
¥ induced by a single-letter distortion
measure is defined by
gÉ± É

ö
°
gÝ
ö
± Ý
ö
(9.4)
In Definition 9.2, we have used to denote both the single-letter distortion
measure and the average distortion measure, but this abuse of notation should
cause no ambiguity. Henceforth, we will refer to a single-letter distortion measure simply as a distortion measure.
Very often, the source sequence represents quantized samples of a continuous signal, and the user attempts to recognize certain objects and derive
ëë
1 Note
that
ëë
is finite for all
D R A F T

.
September 13, 2001, 6:27pm
D R A F T
189
Rate Distortion Theory
meaning from the reproduction sequence . For example, may represent a
video signal, an audio signal, or an image. The ultimate purpose of a distortion
measure is to reflect the distortion between and as perceived by the user.
This goal is very difficult to achieve in general because measurements of the
distortion between and must be made within context unless the symbols
in ¥ carry no physical meaning. Specifically, when the user derives meaning
from , the distortion in as perceived by the user depends on the context.
For example, the perceived distortion is small for a portrait contaminated by
a fairly large noise, while the perceived distortion is large for the image of a
book page contaminated by the same noise. Hence, a good distortion measure
should be context dependent.
Although the average distortion is not necessarily the best way to measure
the distortion between a source sequence and a reproduction sequence, it has
the merit of being simple and easy to use. Moreover, rate distortion theory,
which is based on the average distortion measure, provides a framework for
data compression when distortion is inevitable.
=2>7?¨ª© »@:Æ
When the symbols in ¥ and ¥ represent real values, a popular distortion measure is the square-error distortion measure which is defined
by
´
gÝ± Ý:
Ý æ Ý:
(9.5)

The average distortion measure so induced is often referred to as the meansquare error.
=2>7?¨ª© »@:Æ'È
When ¥ and ¥ are identical and the symbols in ¥ do not
carry any particular meaning, a frequently used distortion measure is the Hamming distortion measure, which is defined by
gÝ± Ý6

Ï
if Ý
if Ýu ]

Ý
Ý
(9.6)
.
The Hamming distortion measure indicates the occurrence of an error. In particular, for an estimate · of · , we have

g·
±;
·9

adcwÎ¶·

·ç
Ðµóá!&gahc²Î¶· ]

· Ðoµ Ï

adc²Î¶·¡]

· Ð ±
(9.7)
i.e., the expectation of the Hamming distortion measure between · and · is
the probability of error.
For ¢¥í and ¢
, the average distortion eÉ± E
¥¼
induced by the
Hamming distortion measure gives the frequency of error in the reproduction
sequence .
D R A F T
September 13, 2001, 6:27pm
D R A F T
190
A FIRST COURSE IN INFORMATION THEORY
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ¹Û
minimizes gÝ±
if
for all ÝO
¥
Ý:

For a distortion measure , for each ÝO ¥ , let Ý¤ £ÝA ¥ over all Ý¥
¥ . A distortion measure is said to be normal
¦ ë4§V¨ª©
gÝ± Ý £ ÝA

.
á

(9.8)
The square-error distortion measure and the Hamming distortion measure
are examples of normal distortion measures. Basically, a normal distortion
measure is one which allows · to be reproduced with zero distortion. Although a distortion measure is not normal in general, a normalization of
can always be obtained by defining the distortion¦ measure
«
gÝ± :
Ý
ë
«
eÝ± 6
Ý æ
(9.9)
for all Ý± Ý: 2ý¥ Ò¡¥ . Evidently, is a normal distortion measure, and it is
referred to as the normalization of .
=2>7?¨ª© »@:Æ¹î
Let be a distortion measure defined by
¬Q®
®±°
. ¯
V
1
2
²
³
´
7
3
5
8
³
´
5
0
3
5
, there exists an Ý¶
¥ 2
4
«
Then , the normalization of , is given by
¬Qµ ®
Note that for every Ý¶
¥
®±°
. ¯
V
1
2
²
0
1
«
such that gÝ±
Let · be any estimate of · which takes values in
distribution for · and · by Ý± ÝA . Then

e·
8
± ·9

D R A F T
8
ë
«
e·

«
e·

8
ë
ë
Ý± ÝA
eÝ± Ý

¥ ë
ÝA
ë
±4
· 7&
«
e·
±4
· 7&

«
e·
±4
· 7&¹t±
8
8
ÝA

ë
.
(9.10)
¦ ë
8
á
¦ ëy¸
«
8

, and denote the joint
Ý± A
Ý f· eÝ± 6
Ý 7&
±4
· 7&
Ý:

¦ ë
(9.11)
ë
¦ ë
É
Ý Ü ÝA
8
ë
É
Ý Ü ÝA
ÝA
September 13, 2001, 6:27pm
(9.12)
(9.13)
(9.14)
(9.15)
D R A F T
191
Rate Distortion Theory
where
8
¹
¦ ë
ë
ÝA

(9.16)
is a constant which depends only on Ý: and but not on the conditional
distribution ÝÉ
Ü ÝA . In other words, for a given · and a distortion measure ,
·
and an estimate · of · is always reduced
the expected distortion between
«
by a constant upon using instead of as the distortion measure. For reasons
which will be explained in Section 9.3, it is sufficient for us to assume that a
distortion measure is normal.
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:ÆÂõ
Let Ý¤ £ minimizes
º
éTêcë

e·

± Ý
e·

over all Ý¶
¥ , and define
± Ý £
(9.17)
º
Ý £
éê³ë
is the best estimate of · if we know nothing about · , and
is the
éTêcë distortion between ·
and a constant estimate of · . The
minimum expected
º
significance of
can be seen by taking the reproduction sequence to be
Ý £ ± Ý £ ±²µ²µ²µL± Ý £ . Since e·
± Ý £ are i.i.d., by the weak law of large numbers
ö
e(
±
8

ö
°
g·
± Ý £
ö in probability, i.e., for any ÞBßÖá ,
ahcwÎlg(
±h Tß
º
Ô

g·
± Ý £
&ÅÞwÐ Õ
Þ
º
éTêcë
(9.18)

éê³ë
(9.19)
for sufficiently large . Note that is a constant sequence which does not
éê³ë of depend on . In other words, even when no description
is available, we
º
& Þ with probability
can still achieve an average distortion no more than
ë
arbitrarily close to éT
1 ê³when
is sufficiently large.
º
is seeming confusing because the quantity stands for
The notation
the minimum rather than the maximum expected distortion between · and a
éTêcë from the above discussion that this notation
constant estimate of · . But weº see
is in fact appropriate because
is the maximum distortion we have ºto be
éTê³ë
concerned
about. Specifically, it is not meanful to impose a constraint
º
on the reproduction sequence because it can be achieved even without
any knowledge about the source sequence.
9.2
THE RATE DISTORION FUNCTION »b¼½¿¾
Throughout this chapter, all the discussions are with respect to an i.i.d. infor
mation source Î¶· ±cü
and a distortion
Ï%Ð with generic random variable ·
ö
measure . All logarithms are in the base 2 unless otherwise specified.
D R A F T
September 13, 2001, 6:27pm
D R A F T
192
A FIRST COURSE IN INFORMATION THEORY
X
À
Å
Å
source
sequence
Â Ã
Ä
f (X)
Á
Encoder
Ç
X
Decoder Æ
reproduction
Å
sequence
Figure 9.1. A rate distortion code with block length È .
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ¹Ä
An
function
É±³ÊU
Ì
rate distortion code is defined by an encoding
Ô

Í ¥
ÎJÏ(±cQ±²µ²µ²µ%±³Ê
Ð
(9.20)
and a decoding function
Ø
Ô
ÍÎJÏ(±cQ±²µ²µ²µ%±³ÊÑÐ
Ì
¥
(9.21)
Ì
Ì
The set ÎJÏ(±cQ±²µ²µ²µ%±³ÊÑÐ , denoted by É
sequences Ø zÏóC± Ø | C±²µ²µ²µ ± Ø
, is called the index set. The reproduction
ÚÊU in · are called codewords, and
the set of codewords is called the codebook.
Figure 9.1 is an illustration of a rate distortion code.
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ@
The rate of an
É±³Êâ
rate distortion code is E¡
bits per symbol.
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ|Ç±Ê
ã¹äå
Ê
in
º
A rate distortion pair Ú£ ± is asymptotically achievable
, there exists for sufficiently large an É±³Êâ rate distortion
if for any Þ ß á
code such that
Ï

and
°
Ì
adcwÎle(
ã_äå
ÕÖ£&ôÞ
Ê
± ß
º
(9.22)
&ôÞwÐ ÕÞ²±
(9.23)
Ø
(G . For brevity, an asymptotically achievable pair will be
where
referred to as an achievable pair.
Remark It is clear from the definition that if
º
º
are also achievable for all £
Ú£ ±
and Ú£ ±
Ú£
±
£
º
is achievable, then
º
º
and
.
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ|Ç"Ç
The rate distortion region is the subset of
º
all achievable pairs Ú£ ± .
ËÍÌ6»[Ãñ:»Î@:Æ|ÇA
´
containing
The rate distortion region is closed and convex.
Proof We first show that the rate distortion region is closed. Consider achievº
able rate distortion pairs Ú£àWö²z± Wö² such that
÷¹ø ú
öw¹ã ùû
D R A F T
Ú£
'ö² ±
º
'ö²

Ú£
±
º

September 13, 2001, 6:27pm
(9.24)
D R A F T
193
Rate Distortion Theory
Then for any ÞjßSá , for all ü , there exists for sufficiently large an
code such that
Ï
and
Ì
'ö² ÕÖ£
Ê
@ã_äå
'ö² &ÅÞ
º
Wö² ±Ï 'ö² Tß
adc²Île(
É±³ÊÑ'ö²
(9.25)
Wö² &ôÞ²Ð
Ì
Õ
Þ²±
(9.26)
where 'ö² and Ø Wöw are respectively the encoding function and the decoding
Ø
function of the É±³Ê 'ö² code, and Wö²
Wö² 'ö² (G . By virtue of (9.24),

let ü:|Þ? be an integer such that for all ü ß ü:|ÞC ,
ì
'ö² Ü
Ü £Sæ9£
and
Ü
º
(9.27)
Þ²±
(9.28)
ì
º
æ
Þ
Wöw Ü
ì
which imply
'ö²
£
and
ì
º
respectively. Then for all ütß
£E&ôÞ
º
Wö²
ü:|Þ?
(9.29)
&ôÞ²±
(9.30)
,
ì
Ï
Ê
'ö² ÕÖ£
Wö² &ôÞ
&ôÞwÐ
Õ
adcwÎle(
Õ
Þ
@ã_äå
£E&
Þ
(9.31)
and
ahcwÎlg(
'ö² ± Wö² Tß
º
'ö² ± Wöw ß
º
'ö² &ôÞ²Ð
(9.32)
(9.33)
Note that (9.32) follows because
º
&
º
Þß
'ö² &
Þ
(9.34)
º
by (9.30). From (9.31) and (9.33), we see that Ú£ ± is also achievable. Thus
we have proved that the rate distortion region is closed.
We will prove the convexity of the rate distortion region by a time-sharing
argument
whose idea is the following. Ð Roughly speaking, if we can use a code
Ð
º
º ´
°
°
´
° to achieve Ú£àè ³±
and a code ´ to achieve Ú£à ³±
,è then for any
Ð
è 0 and 1, we can use ° for a fraction
rational Ð number between
of the time
º
and use ´ for a fraction Ñ of the time to achieve Ú£àÓÒó± ÔÒó , where
è
º
D R A F T
£
ÔÒL
ÔÒL
è £
º

è
°
´
Õ
& Ñè £ ±
º ´
°
&ÖÑ
±
September 13, 2001, 6:27pm
(9.35)
(9.36)
D R A F T
194
A FIRST COURSE IN INFORMATION THEORY
è
è
è
. Since the rate distortion region is closed as we have proved,
and
Ñ
Ïjæ

can be
taken as any real number between 0 and 1, and the convexity of the
region follows.
We now give a formal proof for the convexity of the rate distortion region.
è
Let
(9.37)
±
&
× Ø

è
where and Ø are positive integers. × º Then is a rational
number between
º ´
°
°
´
0 and 1.
now prove that if Ú£ ± and Ú£ º ± are achievable,
× We
º
º ´
°
°
´
then Ú£àÔÒó³± ÓÒó is also achievable. Assume Ú£à c± and Ú£à ³±
are achievable. Then for any Þ3ß á and sufficiently large , there exist an
°
´
É±³Ê
code and an É±³Ê
code such that
Ï
@ã_äå
and

Ï(±c
è
. Let
Ê

è
and
×
è
è

Þ
(9.38)

&ÅÞwÐ
Õ
Þ¶±
(9.39)
´
S

°
Ù ÚÊ
ÚÊ

&
º

± ß
ahcwÎlg(
®

ÕÖ£
Ê
(9.40)
&gØ%

(9.41)
×
We now construct an × C±³Ê
code
by concatenating copies of the
°
´
è
è
of
the É±³Ê code.× We call these
É±³Ê
code followed by Ø copies
&Ø codes subcodes of the × C±³Ê code. For this code, let
×
Ù
(

zÏóC±3| C±²µ²µ²µ±

(9.42)
&Ø%C±
(9.43)
×
and
Ù
&ÚØ(
¤ zÏóC±d | C±²µ²µ²µ%±s

× è the reproduction
è
where Û and ÛQ are the source sequence and
sequence
of the Û th subcode, respectively. Then for this × C±³Ê code,
è
Ïè
×
ã_äå
Ê

&gØ%

Ï
×
&gØ%

è¶Ü
Ï
×

è
Õ

D R A F T
è Ú£
£
£
ã_äå
ã¹äå

×
Ê
´
S

°
Ù ÚÊ
_ÚÊ
ã_äå
Ê
°
&ÚØ
è¶Ü
°
Þ Ý
è
&ÕÑ
ã_äå

Ï
ã_äå
Ê
Ê
°
´
Å
&ôÞC
& ÞC
è 7&ÕÑ Ú £
°
´
& Ñ £ 7ô
& Þ
ÔÒL &ôÞ¶±
September 13, 2001, 6:27pm
(9.44)
´

(9.45)
´
j Ý
(9.46)
(9.47)
(9.48)
(9.49)
D R A F T
195
Rate Distortion Theory
where (9.47) follows from (9.38), and
º
ahc²ÎlgÙç± Ù Tß
àá
adcß

Ù8
Ï
g(
Ù8
Õ
º
ÛQß
Û C± ç
ahcwÎlg(
â °
&
e(3ÛQC±h9
Û Bß
º
Û C±hý
Û ß
8 Ù
Õ
¸ S
&ÚØ â °
×
adc?æçe(
Õ
ÓÒó &ôÞ²Ð
°
&ôÞ
´
&ôÞ
Û C±\ý
Û Bß
º
â ¸ °
ÔÒó &ôÞ±ã ä
for some
°
&ôÞ²Ð
Û C±hý
Û Bß
º
(9.50)
å
Ï]ÕèÛ¼Õ
&SÏ ÕèÛ
for some
¸ S
ahc²Îlg(
º
×
Õ
×
&gØçé
(9.51)
×
´
&ôÞ²Ð
(9.52)
Ù
&ÚØ%zÞ¶±

or
(9.53)
×
where (9.52) follows from the union bound and (9.53) follows from (9.39).
º
Hence, we conclude that the rate distortion pair Ú£àÓÒó± ÔÒóz is achievable.
This completes the proof of the theorem.
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ|ÇA
º
The rate distortion
function
£¼
is the minimum of all
º
º
for a given distortion such that Ú£ ± is achievable.
rates £
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ|Ç(È
distortions
º
The distortion rate function Ú£û is the minimum of all
º
for a given rate £ such that Ú£ ± is achievable.
º
º
º
Both the functions £¼ and Ú£û are equivalent descriptions of the boundary of the rate distortion region. They are sufficient to describe the rate
disº
tortion region because the region is closed. Note that in defining £¼ , the
º
minimum insteadº of the infimum is taken because for a fixed , the set of all
£ such that Ú£ ±
is achievable is closed and lower bounded by zero. Simº
ilarly, the minimum instead of the infimum is taken in defining Ú£û . In the
º
subsequent discussions, only £¼ will be used.
ËÍÌ6»[Ãñ:»Î@:Æ|ÇJÛ
£¼
1.
2.
º

The following properties hold for the rate distortion function
:
º
£¼
is non-increasing in
£¼

is convex.
º

º
3.
£¼

4.
£¼ÚáJ>Õ

á
5
for
·
D R A F T

º
º
º
.
éTê³ë
.
.
September 13, 2001, 6:27pm
D R A F T
196
A FIRST COURSE IN INFORMATION THEORY
º
º
Proof From the remark following Definition 9.10, since Ú£¼ C± is achievº
º
º
º
º
is also achievable for all

able,
. Therefore,
Ú£¼
C±
£¼

º
º
º
£¼
¹ because £¼ ¹ is the minimum of all £ such that Ú£ ±
¹ is achievable. This proves Property 1.
Property 2 follows immediately from the convexity of the rate distortion
region which was proved in Theorem 9.12. From the discussion toward the
end of the last section, we see for any ÞBß á , it is possible to achieve
ahcwÎlg(
éTê³ë
º
±h ß
&ÅÞwÐ Õ
Þ
(9.54)
º
éê³no
ë
for sufficiently large with
description of available. Therefore, Úá.± is
º
º
, proving Property 3.
achievable for all
Property 4 is a consequence of the assumption that the distortion measure
is normalized, which can be seen as follows. By the source coding theorem,
for any Þ ß á , by using a rate no more than ·3\&Þ , we can decribe the
5 of error less than Þ when is
source sequence of length with probability
sufficiently large. Since is normalized, for each ü
Ï , let
· ö
Ý £ ·

(9.55)
(cf. Definition 9.5), so that whenever an error does not occur,
g·
ö
± ·
ö
g·

± Ý £ ·
ö

á
(9.56)
by (9.8) for each ü , and
g(
±
8
Ï

ö
°
g·
ö
± · ö

8
Ï

ö
°
g·
ö
± Ý £ ·
ö

(9.57)
Therefore,
adc²Île(
±Ï Tß
ÞwÐÓÕ
Þ²±
(9.58)
which shows that the pair ·3C±³áJ is achievable. This in turn implies that
5
£¼ÚáJ Õ
·3 because £¼ÚáJ is the minimum of all £
such that Ú£ ±³áJ is
5
achievable.
º
Figure 9.2 is an illustration of a rate
distortion function £¼ . The reader
º
should note the four properties of £¼ in Theorem 9.15. The rate distortion
theorem,
which will be stated in the next section, gives a characterization of
º
£¼
.
9.3
RATE DISTORTION THEOREM
º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ|ÇJî
º
For
defined by
£
D R A F T

á
, the information
rate distortion function is
#

ø¼÷
2ê ë ' >ªìYí
·
$
·
September 13, 2001, 6:27pm
(9.59)
D R A F T
197
Rate Distortion Theory
ï î
R(D)
H(X)
R(0)
The rate
distortion
ð region
î
Dñmax
Figure 9.2. A rate distortion function ò
óC°
D
.
º
In defining £ , the minimization is taken over all random variables

jointly distributed with · such that

e·
±
·9
Õ
º

(9.60)
Since EÝA is given, the minimization is taken over the set of all E
that (9.60) is satisfied, namely the set
àá
8
ë ë ÝAÂ É
ß É
Ý Ü ÝAÍ
Ý Ü ÝAeÝ± Ý TÕ
#

#
· º
ÝÉ
Ü Ý:
ãä
such
(9.61)
å
$
Since this set is compact in > and · · is a continuous functional of
$
ÝE
·9
can be attained2 . This justifies taking
Ü Ý: , the minimum value of ·
º
the minimum instead of the infimum in the definition of £ .
«

We have seen in Section 9.1 that we can obtain a normalization for any
distortion measure with
«
g·

± ·9

e·
± ·9
æè¹
(9.62)
for any · , where ¹ is a constant which depends
only on Ý: and . Thus if
«
º
º
is not normal,
we
can
always
replace
by æô¹ in the definition
by and
º
of £ without changing the minimization problem. Therefore, we do not

lose any generality by assuming that a distortion measure is normal.
ËÍÌ6»[Ãñ:»Î@:Æ|Ç õõËÍÌ6»DöÍÁ6»
º¼¾ÚòwÁÃñ Á6¾ÚÃ¿gËÍÌ»[Ãñ:»[4÷

2 The
assumption that both
D R A F T
and
£¼
º

.
are finite is essential in this argument.
September 13, 2001, 6:27pm
D R A F T
198
A FIRST COURSE IN INFORMATION THEORY
The rate distortion theorem, which is the main result in rateº distortion
theory,
º
says that the minimum coding rate for achieving a distortion is £ . This

theorem will be proved in the next two sections. In the next section, we will
º
º
prove the converse of this theorem, i.e.,
£¼

, and in Section 9.5,
º
º
º
we will prove the achievability of £ , i.e., £¼ ÕÖ£ .
º
º

In order for £ º to be a characterization of £¼ , it has toº satisfy the

same properties as £¼ . In particular, º the four properties of £¼ in Theorem 9.15 should also be satisfied by £ .

ËÍÌ6»[Ãñ:»Î@:Æ|ÇJÄ
tortion function £
£
1.
º

£
2.
3.
£

4.
£
º

is non-increasing in

is convex.
º

The
following properties hold for the information rate disº

ÚáJBÕ
5
for
·

º
.
éê³ë
º
º
.
.
º
º
Proof Referring to the definition of £ in (9.59),
for a larger , the minº
º

imization is taken over a larger set. Therefore, £ is non-increasing in ,

è
proving Property 1.
º
º ´ °
á and let
be any number
To prove Property 2, consider any ±
º

between 0 and 1. Let ·9
achieves £
# for ®
Ï(±c , i.e.,
£
where
º

±;
· Õ
g·

· C±
º
and let · be defined by the transition matrix
distributed with · which is defined by
è
è
where

g·
è
Ñ

8
ÝE
Ü Ý:
. Let
· ÓÒó
be jointly
è
° ÝÉ
Ü ÝA7&ÕÑ ´ ÝE
Ü Ý:C±
(9.65)
. Then
ëë ÝAÂ
8
ëë ÝAw
èôøù 8

(9.64)
±;
· ÔÒL

Ïæ
Ý Ü ÝA
Ò
(9.63)

Ò
E
Ý Ü ÝAeÝ± Ý
è
è
6°L ÝÉ
Ü ÝA7&
Ñ g´ ÝE
Ü Ý:gÝ± Ý6

ëë ÝAÂ°L ÝEÜ ÝAeÝ± ÝûÕ
ú &
D R A F T
(9.66)
èôøù 8
Ñ
(9.67)
ëës
ü?ý Ý'þ ü ´ ý ÝÉÜ Ý'þ ý Ý± Ý7þúû
September 13, 2001, 6:27pm
(9.68)
D R A F T
199
Rate Distortion Theory

ÿ
º
º
ÿ

þÕÑ
ý
º
ÕÑ
þ
ý
(9.69)
(9.70)
(9.71)
Ò
where
º
ÿ º
Ò
º
ÖÑ
(9.72)
and (9.70) follows from (9.64). Now consider
ý ý þ ÿ
þ
ý !
þ " ý #! $
ý #!%
&'
$
ý &'
þ
þ
þ
(9.73)
(9.74)
(9.75)
where the inequality in (9.74) follows from the convexity of mutual informa)(
tion with respect to the transition matrix üsý (*
þ (see Example 6.13), and the
inequality in (9.75) follows from (9.71) and the definition of ý þ . Therefore, we have proved Property 2.
To prove Property 3, let take the value (, + as defined in Definition 9.7
with probability 1. Then
ý!%
þ ÿ.(9.76)
and
/10
$
Then for 3:5;7
,
On the other hand, since
/20
ý!
þ ÿ
ý ý ý!
þ
þ
þ ÿ
( +
4365798
þ ÿ.-
ý #! (9.77)
8
(9.78)
is nonnegative, we conclude that
ý þ ÿ.8
(9.79)
This proves Property 3.
Finally, to prove Property 4, we let
ÿ
where ( +
ý
(
þ
( + ý þ
(9.80)
is defined in Definition 9.5. Then
/10
ý
/20
ÿ
( + ý
þþ
0
(
(ý
(
+
þ
ý<
þ
ÿ
üsý
=
7
ÿ
ý
(9.81)
(
þþ
-
(9.82)
(9.83)
0
by (9.8) since we assume that
D R A F T
ý
-
þ
is a normal distortion measure. Moreover,
ý !%
þ
?>
ý
þ
8
September 13, 2001, 6:27pm
(9.84)
D R A F T
200
A FIRST COURSE IN INFORMATION THEORY
Then Property 4 and hence the theorem is proved.
@ACBDACEFEHGCBFIKJMLONFJ
If ý - þ1P - , then ý þ is strictly decreasing for 43657 , and the inequality constraint in Definition 9.16 for Q ý þ can be
replaced by an equality constraint.
Proof Assume that ý - þQP - . We first show that ý
43:57 by contradiction. Suppose ýVU þ ÿ"- for some let ý U þ be achieved by some . Then
ý implies that and þ ÿ. ý#!
U
þRP
þ ÿ-
-
for -
WUMS.43:5;7
TS
, and
(9.85)
are independent, or
ü?ý
(
(
þ ÿ
ý<
ü?ý
(
(
þ ü?ý
þ
(9.86)
for all ( and ( . It follows that
/20
$
U
ÿ
=
=YX
7
ÿ
= X
7
7
=YX
üsý
(
=YX
= X
= X
ü?ý
=
üsý
(
(
üsý
(
þ
þ
(
0
0
(
ý
þ
ý
(
ý
( +
þ
(9.88)
( (
ý
þ
7
/ 0
1
(
ý
þ
/10
7
ÿ
þ ü?ý
þ
7
ÿ
(
(
üsý
7
$
üsý
þ
7
ÿ
üsý
(9.87)
0
(
7
=
ÿ
þ
(
(
(
þ
(9.89)
þ
þ
(9.90)
(9.91)
þ
3:5;7
(9.92)
(9.93)
436579
(9.94)
where (, + and 43657 are defined in Definition 9.7. This leads to a contradiction
VUS43657 . Therefore, we conclude that
because we have assumed that Q ý
Z
?
S
4
:
3
;
5
7
þ6P
for
.
Since ý - þ[P - and ý43:57 þ ÿ\- , and ý þ is non-increasing and
convex from the above theorem, ý þ must be strictly decreasing for 43:5;7 . We now prove by contradiction that the inequality constraint in
Definition 9.16 for ý þ can be replaced by an equality constraint. Assume
that ý þ is achieved by some + such that
/10
ý
D R A F T
+
þ ÿ
U U S?<8
September 13, 2001, 6:27pm
(9.95)
D R A F T
201
Rate Distortion Theory
Then
ý UU
X
þ ÿ
bRc de
]_^a`
b6f b
X
ý #!
þ
ý !
+
þ ÿ. ý
þ
8
(9.96)
Ogihkj j
This is a contradiction because
43:57 . Hence,
/10
Q ý ý
þ
is strictly decreasing for
þ ÿ
+
-
<8
(9.97)
This implies that the inequality constraint in Definition 9.16 for Q
replaced by an equality constraint.
ý
Remark In all problems of interest, ý - þ ÿ. ý - þlP - . Otherwise, $
for all because ý þ is nonnegative and non-increasing.
mRn GkoRpEYqrJMLas9tvuw%xzyMGCBFI.{MA|BD}Dqi~
with

Let
ÿ.-4ÿl
þ
can be
ý
þ ÿ
be a binary random variable

ÿ4ÿK
and

0
8
(9.98)
-
Let ÿ
be the reproduction alphabet for , and let be the Hamming

. Then if we
distortion measure. We first consider the case that make a guess on the value of , we should guess 0 in order to minimize the
expected distortion. Therefore, (D + ÿ.- and
/20i
ÿ
43657
ÿ

8
,

Mk
dÿ
-

if if Let be an estimate of taking values in
distortion measure between and , i.e.,
>
Then for ZS

ÿ
43:57
and
) Ïÿ
and any
S

(9.102)
8
, and let

be the Hamming
(9.103)
determine each other. Therefore,
>

) 8
(9.104)
such that
/10D

D R A F T
$
%
8
Observe that conditioning on ,

0i
ÿ

(9.99)
(9.100)
(9.101)
ÿ
ÿ
We will show that for -
< -F

<
September 13, 2001, 6:27pm
(9.105)
D R A F T
202
A FIRST COURSE IN INFORMATION THEORY

1
D
1 2D
1 D

0
1
0
D
X

1 2D
1
X
D
1
1 D
Figure 9.3. Achieving Ca for a binary source via a reverse binary symmetric channel.
we have

>
ÿ

$
Mk

(9.106)
)

>
Mk
)

>
Mk
Mk

(9.107)
(9.108)
¢ÿ ¡

(9.109)
(9.110)
where the last inequality is justified because

¢ÿ ¡
£
/20i
4ÿ
£

(9.111)
. Minimizing over all
and is increasing for (9.105) in (9.110), we obtain the lower bound

satisfying

8
(9.112)
To show that this lower bound is achievable, we need to construct an such
that the inequalities in both (9.108) and (9.110) are tight. The tightness of the
inequality in (9.110) simply says that

¤ÿ ¡
;ÿ
(9.113)
while the tightness of the inequality in (9.108) says that should be independent of .
It would be more difficult to make
independent of if we specify by
¥
(*) (
. Instead, we specify the joint distribution of and by means of a
reverse binary symmetric channel (BSC) with crossover probability as the
shown in Figure 9.3. Here, we regard as the input and as the output
of the BSC. Then is independent of the input because the error event
is independent of the input for a BSC, and (9.113) is satisfied by setting the
crossover probability to . However, we need to ensure that the marginal
distribution of so specified is equal to ¥ ( . Toward this end, we let

D R A F T
ÿ4ÿ.¦
September 13, 2001, 6:27pm
(9.114)
D R A F T
203
Rate Distortion Theory
and consider

ÿ
¦
V
¦ÿ

(9.118)
V
. Therefore,
-
and
-
8
(9.120)
lv© (9.121)

¦ÿ
ÿ

lr¦ÿ
(9.119)
©
This implies
or
©

. On the other hand,
gives
(9.116)
(9.117)
ÿª
3:5;7

¦
(9.115)
8
¨#© ZS?
-
Since
$
ÿ
)

¦* which gives
we have ¦
ÿ.-
)
ÿ

ÿ
'
ÿ
or
§
ÿK-

ÿ.-
(9.122)
8
(9.123)

Hence, we have shown that the lower bound on in (9.110) can be
achieved, and Q is as given in (9.102).

, by exchanging
the roles of the symbols 0 and 1 in the above
For
argument, we obtain
as in (9.102) except that is replaced by Q .
Combining the two cases, we have
for -

dÿ«
-

if
if . The function Q

$
for ÿ
ZS

]_^a`
]_^a`

¬M

(9.124)
.
is illustrated in Figure 9.4.

-F\ÿ
M\ÿ
. Then by
we see that
Remark In the above example,

the rate distortion theorem,
is the minimum rate of a rate distortion code
which achieves an arbitrarily small average Hamming distortion. It is tempting
to regarding this special case of the rate distortion theorem as a version of the
D R A F T
September 13, 2001, 6:27pm
D R A F T
204
A FIRST COURSE IN INFORMATION THEORY
RI ( D)
1
1
0.8
0.6
0.4
0.2
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.5
0.7
0.8
0.9
1
D
Figure 9.4. The function ka for the uniform binary source with the Hamming distortion
measure.
source coding theorem and conclude that the rate distortion theorem is a generalization of the source coding theorem. However, this is incorrect because the
rate distortion theorem only guarantees that the average Hamming distortion
between and is small with probability arbitrarily close to 1, but the source
coding theorem guarantees that ÿ with probability arbitrarily close to 1,
which is much stronger.
It is in general not possible to obtain the rate distortion function in closed
form, and we have to resort to numerical computation. In Chapter 10, we
will discuss the Blahut-Arimoto algorithm for computing the rate distortion
function.
9.4
THE CONVERSE

is lower
In this section, we prove that the rate distortion function

, i.e.,
bounded
by the information rate distortion function
Q

. Specifically,
we will prove that for any achievable rate distortion pair
$

. Then by fixing
and minimizing over all achievable

pairs
, we conclude that
.

Let
be any achievable rate distortion pair. Then for any ®lP - , there
exists for sufficiently large ¯ an ¯ ° code such that

°
¯²±´³¶µ
and
0D

D R A F T

P
(9.125)
·®
·®

® September 13, 2001, 6:27pm
(9.126)
D R A F T
205
Rate Distortion Theory
where ÿK¸
O¹

¯
·®

. Then
$
$
>±´³¶µ
$
>
°
O¹

O¹
¸
ÿ
ÿ
>
ÿ
ÿ
>

$ >
½½½'
» ) » ÃÂ
»¿¾

(9.134)
(9.135)
(9.136)
(9.137)
= º
1È
(9.138)
/10D
»¼
¯
» »
e
¯
» ) »
/20i
Q

» !
»
¯·Å
¯
>
» )

= º
$ >
» k
ÁÀ
»¼
>
= º

>
= º
ÿ
» k
»¼
(9.133)

= º
»¼
»¼
Ä
$

>
= º
ÿ

» k
= º
ÿ
)
»¼
»¼
(9.131)

= º
»¼

(9.132)

>
k
)

(9.130)

>
k

(9.127)
(9.128)
(9.129)

» Q
» ÇÆ
(9.139)
/20i

= º
¯ »¼
/20i

» » É
(9.140)
8
(9.141)
In the above,
a) follows from (9.125);
b) follows because conditioning does not increase entropy;

c) follows from the definition of Q
d) follows from the convexity of Q
inequality.
Now let
0
3:5;7
D R A F T

in Definition 9.16;

X
proved in Theorem 9.18 and Jensen’s
ÿ.]_ÊË
7 f7
0i
(
(
September 13, 2001, 6:27pm
(9.142)
D R A F T
206
A FIRST COURSE IN INFORMATION THEORY
0
0
be the maximum value which
can be taken by the distortion measure . The
6
3
57 with 43:5;7 in Definition 9.7. Then from
reader should not confuse
(9.126), we have
/10D

/
0i

ÿ
0
i
/
ÿ
0
À

)

3:57%
À ½ ®k
0
0D
0i
)
3657

0
i

·®
'
P

Â
·®

·®
Â

Ì®

·®

½
® 8
This shows that if the probability that the average distortion between
exceeds Í® is small, then the expected average distortion between
can exceed only by a small amount3 . Following (9.141), we have
$
·®
$
and and /10D
(9.143)
(9.144)
(9.145)

0
3657
'
(9.146)
(9.147)

®
is non-increasing

where the last inequality follows from (9.145)
because

in . We note that the convexity of
implies that it is a continuous function of . Then taking the limit as ®6Î - , we obtain
$
ÿ

3657

3:5;7
'
®

(9.148)
0
%Ò ÿ
0
^a]
ÏOÐ%Ñ
±
'
^a]
ÏOÐ%Ñ ®Ó
±
(9.149)
(9.150)

where we have invoked the continuity of minimizing over all achievable pairs have proved that

in obtaining (9.149). Upon
for a fixed in (9.150), we
8
(9.151)
This completes the proof for the converse of the rate distortion theorem.
9.5
ACHIEVABILITY OF Ô
FÕ×ÖÙØ

is upper

In this section, we prove that the rate distortion function

bounded
by the information rate distortion function
, i.e.,
$
Q

. Then by combining with

the result

that
from the
Ïÿ
last section, we conclude that
, and the rate distortion theorem
is proved.
3 The
converse is not true.
D R A F T
September 13, 2001, 6:27pm
D R A F T
207
Rate Distortion Theory
For any
taking values in
43657
, we will prove that for every random variable
such that
/10D

<
(9.152)

the rate distortion pair ! is achievable. This will be proved by
showing for sufficiently large ¯ the existence of a rate distortion code such that

1. the rate of the code is not more than 0D
2.

·®
!%
·®
;
with probability almost 1.

over all satisfying (9.152), we conclude
!

Then by minimizing #
that
$

the rate distortion pair
is achievable, which
implies
because is the minimum of all such that is achievable.
3657 and any ®lP
, and let Ú be a small positive quantity
Fix any to be specified later. Toward proving the existence of a desired code, we fix a
random variable which satisfies (9.152) and let ° be an integer satisfying

®
!
©
°
¯
±´³¶µ

#! ·® (9.153)
where ¯ is sufficiently large.
We now describe a random coding scheme in the following steps:

°
1. Construct a codebook Û of an ¯ ° code by randomly generating

¥
(

codewords in º independently
and identically
according to º . Denote

' ©H ½½½¿
°
these codewords by

.
2. Reveal the codebook Û to both the encoder and the decoder.
is generated according to ¥
3. The source sequence

(
º
.

4. Ý The encoder
encodes the source sequence into an index
ÿ
© ½½½°

. The index Ü takes the value Þ if

b) for all Þ
in the set
X

Ü
Þ
¨ßáà
U ß
Ý
âb
º
, if
,
b:ãåä

6ßæà
Þ U

âb
º
b6ãåä
otherwise, Ü
takes the constant value 1.
5. The index Ü
is delivered to the decoder.
, then Þ
U
Þ
;

6. The decoder outputs Ü

as the reproduction sequence .
Remark Strong typicality is used in defining the encoding function in Step 4.

This is made possible by the
assumption
that
both
the
source
alphabet
and

the reproduction alphabet are finite.
D R A F T
September 13, 2001, 6:27pm
D R A F T
208
A FIRST COURSE IN INFORMATION THEORY
Let us further explain the encoding scheme described in Step 4. After the
source sequence is generated, we search through all the codewords in the
codebook Û for those which are jointly typical with with respect to ¥ ( ( .
If there is at least one such codeword, we let Þ be the largest index of such
ÿ Þ . If such a codeword does not exist, we let Ü
ÿ .
let Ü
codewords and
The event Ü ÿ occurs in one of the following two scenarios:

1.
'
is the only codeword in Û which is jointly typical with .
2. No codeword in Û is jointly typical with .

Oç
In either scenario, is not jointly typical with the codewords ©H , ½½½ ,

°
. In other
words,
if Ü ÿè
, then is jointly typical with none of the

Oç

H
©

½½½¿ °
codewords .
Define
X
/é

ÿëê Þ lßáà â b b:ãåäì
(9.154)
º

/ é the codeword Þ . Since the
to be the event that is jointly typical with
are mutually independent, and
codewords are generated i.i.d., the events
they all have the same probability. Moreover, we see from the discussion above
that

/
/
/
Ä
Ä
Ä
Ù2í
ÿ
ð 8
ï

½
½
(9.155)
Ü
î
î
î
Then

Ü
/
ÿ
ÿ
ñ
ï
î
/
ð
é
/
Ä
é
Ä
/
Ä
½½½
î
ð
î
Ä

(9.157)
¼
/
ÿ

¾
ð
Ä

/
/
(9.156)

ð
¾
(9.158)
(9.159)
8

We now obtain a lower bound on
as follows.
Since

and ' are

'
ò
ò independently, the joint distribution of can be factorized
generated
¥
¥

as
. Then
/

ê
ÿ
ôó
ò
¥
ò
âb
º
bãåä ì
8
(9.160)
(9.161)
û üý
óõOö'÷iù ø ú_úi
ò
The summation above is over all
of strong typicality (Theorem 5.7), if
D R A F T
'¨ßáà
¥
X =
f
X

*
X
ò

ßZà â
·
b :
b ãåä X
ò
ò
º
áßà â b b:ãåä
º
. From the
consistency
ò
ßà â bãåä
, then
and
º
September 13, 2001, 6:27pm
D R A F T
209
Rate Distortion Theory
ò
þ
X
ò
. By the strong AEP (Theorem 5.2), all ¥
summation satisfy
ò
¾
b
ßà
¥
and
ò
¥
respectively, where Î
-
where Î
-
as Ú%Î
/
l
l
©
Ú
l
©
Ú
. Again by the
strong AEP,
X
l
b6f b ¾ ©
Ú
¾
b ¾
ºYÿÿ
ºYÿÿ
b6f b (9.167)
Î
"#$"
l
¾
©
Ú
$
¾
ð
b b('
ºFÿ
8
ÿ
The lower bound in (9.153) implies
°
(9.168)
X

&% ¨

(9.165)
(9.166)
ÿ
. Following (9.159), we have
Ü
b
ÿ

X
b ¾
X
©
!
as Ú%Î
(9.164)
ºYÿÿ
© X
b
X
ºFÿ
where
-
(9.163)
ºFÿlÿ
-
ºFÿlÿ
lÿ
b b ¾
in the above
(9.162)

ºFÿlÿ
¾

b
b:f b ¾ ©
þ
X
¾
©
) $
b:ãåä
Ú

ºYÿÿ
. Then from (9.161),
we have
X

âb
º
þ
©
as Ú%Î
X
)à
$

and ¥

â b:ãåä
º
ò
(9.169)
X
©
b b) *
ºYÿ
8
ÿ
(9.170)
Then upon taking natural logarithm in (9.169), we obtain

?'
°
5
+
©
+

% ¨
`

*
ºYÿ
X±

b b) ÿ
ºYÿ
l
Ú
ÿ
*

% ©
*
ºYÿ
) ) ¾
ÿ

l
¾
#©
Ú
Ú
ºYÿ
ÿ
ºYX ÿ
¾
©X
¾
©
b b. '
ºYÿ
(9.171)
X

?,
.
b b('
ºYÿ
% l
`
±
b b
¾
©
Ú
?,
X
©

Ü
±
b b-'
ÿ
b b
ÿ
8
(9.172)
(9.173)
(9.174)
In the above, a) follows from (9.170) by noting that the logarithm
£
£ in (9.171)
K
is negative, and b) follows from the fundamental inequality `
. By
±
letting Ú be sufficiently small so that
®
©
D R A F T
!

P
-
September 13, 2001, 6:27pm
(9.175)
D R A F T
210
A FIRST COURSE IN INFORMATION THEORY

above upper bound on
Ü
tends to
the
±

/
as ¯<Î
. This implies
Ü
Î

Ü
/
as
/
¯èÎ
, i.e.,
(9.176)
®

for sufficiently large ¯ .

The main idea of the above upper bound on
for sufficiently
Ü

In
constructing
the
codebook,
we
randomly
generate
large ¯ is the following.
þ
¥
(þ
°
°
codewords
in
according
to
.
If
grows
with
at
a
rate
higher
¯
-0
0 þ
º
º

!
than
, then the probability that there exists at least one codeword
þ
which is jointly typical with the source sequence with respect to ¥ ( (
/10DFurther,
-0 0 þ
is very high when ¯ is large.
the average distortion between and

because the empirical
joint distribution
such a codeword is close to
¥
(
(þ
of the symbol pairs in and such
a
codeword
is
close
to
. Then by letþ
ting the reproduction
sequence be such a codeword, the average distortion
þ
/10i-0 0 þ
between
and is less than "æ® with probability arbitrarily close to 1 since
. These will be formally shown in the rest of the proof.
Now for sufficiently large ¯ , consider
0i
þ

P
"·®
0i
þ

P
0D
þ

½ ®1"

®2"

'
P

Ü
)
"Ì®
þ

þ
)
"·®
P

¡
Ü
P

Ü
§
)
"·®
"·®
)

¡
Ü

¡
Ü
(9.177)
½
(9.178)

8
¡
Ü
(9.179)
0i
We
will
show that by choosing the value of Ú carefully, it is possible to make
þ
þ

¡
or equal to X
always less than
. Since 6ß

"·® provided Ü
à â

¡
b bãåä conditioning on Ü
, we have
º
0i
þ

0i-0

= º »¼
¯
0D

=
»

( þ 43
(9.180)
(
þ )
( 7 f7
¯

X
=
0D
(
(þ

7 f7
¯
(
X
0 þ
»
X ¥
56 =

7 f7
(
(þ
0D

¥
¯
(
(
( þ 897
þ

(þ
(9.181)

3
"
(
0D
"
56 =
X
7 f7
(
(þ )

(þ

Ò
¯

¥
¯
(
(
(þ )

( þ
þ

(9.182)

C
¥

(
(þ
Ó
79
(9.183)
D R A F T
September 13, 2001, 6:27pm
D R A F T
211
Rate Distortion Theory
/20i

Ä þ

=
"
"
(
3
(þ )
*
¯
(
X

( þ 1:
:
þ

¥
C

¥
k

þ )
(
(
3
: ¯
(
(
(þ
=
:
7 f7
:
(
3
:
(9.185)
:
:

X
(9.184)
Ó
( þ ;:
:
3:57
"

Ò
7 f7
þ )
(
: ¯
þ

¥
k
(
( þ ;:
(9.186)
:
:
:
:
0

3:57 Ú
"
0
(þ
7 f7

/20i
(
X
0
þ

/20i
=
"
0D
þ

/20i
0D
þ
3657 Ú
(9.187)
(9.188)
where
0
3:5;7
X
a) follows from the definition of

þ
b) follows because

:ßáà

âb
º
in (9.142);
;
bãåä
c) follows from (9.152).
Therefore, by taking
Ú
we obtain
0D
if Ü

¡
þ

. Therefore,

0i
þ

P
®
Ò 0
3:5;7
"

(9.189)
3:5;7
0

®
0
3657
)
"·®
Ó

<.¡
Ü
(9.190)
"·®
(9.191)
and it follows that from (9.179) that
0D
þ

"·®
® 8
(9.192)

Thus we have shown that for sufficiently large ¯ , there exists an
random code which satisfies

°
¯
±´³¶µ
-0
!
0 þ

"·®
¯
°

(9.193)

(this follows from
the upper bound in (9.153)) and (9.192). This implies the
existence of an ¯ ° rate distortion
-0 code
0 þ which satisfies (9.193) and (9.192).

!
Therefore, the rate
distortion
pair
is achievable. Then upon min0 þ
imizing

over

all which satisfy (9.152), we conclude

that the rate distortion
$
. The proof is
pair is achievable, which implies
completed.
D R A F T
September 13, 2001, 6:27pm
D R A F T
212
A FIRST COURSE IN INFORMATION THEORY
PROBLEMS

1. Obtain the forward channel description of
the Hamming distortion measure.

for the binary source with

BC
°
Let
BC
=KJ
L
.

C
f
, ºL

ò
DFE
=
ß
-
é
=é º ) (
ºHG ¼

> é

?>; >@
2. Binary
covering radius The Hamming ball with center =

and radius A is the set
º
½½½
>
6ß
) º
8
AI
there exists
ò that
ò

be the
minimum
number ° such
Hamming
balls
BC

½½½ °
=KJ
such that for all ß - º , ß
for some
a) Show that
C
°
$
f
Cé
M
º
C
©
º
8
C O
¼ ÑN º
b) What is the relation between ° f and the rate distortion function for
º
the binary source with the Hamming
distortion measure?
0
with the Hamming distortion mea-
3. Consider a source random variable
sure.
a) Prove that

-0
$ª>

k
)

>
) ?'k

±´³¶µ
43657 .
for b) Show that the above lower bound on

uniformly on .

is tight if
distributes
See Jerohin [102] (also see [52], p.133) for the tightness of this lower bound
for a general source. This bound is a special case of the Shannon lower
bound for the rate distortion function [176] (also see [49], p.369).
0
0RQ
4. Product source Let and þbe two þ independent source random0 variables

P
with reproduction alphabets and 0 and distortion measures 7 and ,
Q the
Q rate distortion functions for
-0 and are denoted
by 7 47 and
and

, respectively.
Now for
the product source

, define a distor0
þ S P þ
TSUP
T
Î
tion measure G
by
0i
(
V

Vþ
(þ
0

W
þ
(
(
7

Prove0 that the rate distortion function sure is given by

W
h1Z
D R A F T
]_^a`
h1[
¼

"
0XQ VY V
þ
8
-0
for

Q
7 "
with distortion meaQ
8
h
September 13, 2001, 6:27pm
D R A F T
213
Rate Distortion Theory
-0
0 þ
þ
Hint: Prove that !

independent. (Shannon [176].)
-0
$

0 þ

!
þ
0

if
and
are

0`_
ß

5. Compound source Let \ be an index set and ]^ be a
\
Gba
collection of source random variables. The random variables in ]þ ^ have
a common source alphabet 0 , a common reproduction alphabet , and a
0dc
common distortion measure . A compound
source is an i.i.d. information
ß
c
, where e is equal to some
source whose generic random variable is
a

0dc we do not know which one it is. The rate distortion function
but
\
for
has the same definition as the rate distortion function defined in this
chapter except that (9.23) is replaced by
§ 0i
_
þ

Show that
P
c
Wg_fihkj

for all a
®
_¶
^
ö
_¶
where
"·®

ß
.
\
0`_
is the rate distortion function for
.
6. Show that asymptotic optimality can be achieved by separating rate distortion coding and channel coding when the information source is i.i.d. (with
a single-letter distortion measure) and the channel is memoryless.
7. Slepian-Wolf b
coding
Let ® , and Ú be small positive quantities. For
Ï ©
Þ
, randomly and independently select with replacement
bºY
p n
ÿÿ
¾k
mlo
©
à â ãåä
according to the uniform distribution to
é vectorsò from
ºYÿ ÿ
l
º
r

l a fixed pair of vectors in à â b
ãåä . Prove the
be
form a bin q . Let
º

l
following by choosing ® , and Ú appropriately:
é
r
a) the probability thaté
r
b) given thatò
such that

s
r
ß
is in some q
tends to 1 as ¯<Î
/
;
r
, the probability that there exists another
6ßæà â b
/
ãåä tends to 0 as ¯<Î
.
º
q
U

V
l

ß
U
é
q
s
¥
(
ut

Let
. The results in a) and b) say that if
is
º
jointly typical, whichs happens with probability
é
close
to
1
for
large
¯
,
then
s
it is évery likely that is in some bin q , and that is the unique vector
as side-information,
in q which is jointly typical with . If is available
s
then bybspecifying
the
index
of
the
bin
containing
,
which takes about
s
©
bits,
can
be
uniquely
specified.
Note
that
no
knowledge
about
s
ºvÿmlon

is involved in specifying the index of the bin containing . This is the
basis of the Slepian-Wolf coding [184] which launched the whole area of
multiterminal source coding (see Berger [21]).
D R A F T
September 13, 2001, 6:27pm
D R A F T
214
A FIRST COURSE IN INFORMATION THEORY
HISTORICAL NOTES
Transmission of an information source with distortion was first conceived
by Shannon in his 1948 paper [173]. He returned to the problem in 1959 and
proved the rate distortion theorem. The normalization of the rate distortion
function is due to Pinkston [154]. The rate distortion theorem proved here
is a stronger version of the original theorem. Extensions of the theorem to
more general sources were proved in the book by Berger [20]. An iterative
algorithm for computing the rate distortion function developed by Blahut [27]
will be discussed in Chapter 10. Rose [169] has developed an algorithm for
the same purpose based on a mapping approach.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 10
THE BLAHUT-ARIMOTO ALGORITHMS
For a discrete memoryless channel
V
¥
)(
, the capacity
-0
w
.]W
C ÊË
7 !

(10.1)
ÿ
0
where and are respectively the input and the output of the generic channel
and A ( is the input distribution, characterizes the maximum asymptotically
achievable rate at which information
can be transmitted through the channel
w
reliably. The expression for in (10.1) is called a single-letter characterization because it depends only the transition matrix of the generic channel but
for the channel. When both the
input alnot on the
block length ¯ of a code
w

P
phabet and the output alphabet are finite, the computation of becomes
0
a finite-dimensional maximization problem.
x
$

»
0
with generic random variable
For an i.i.d. information source
, the rate distortion function

X
] ^a`
W
7 c de b6f b
X
W
y
ÿ
7
n
ÿ
-0
!
0 þ

(10.2)
gih
characterizes the minimum asymptotically achievable rate of a rate distortion
code which reproduces the information source with an average0 distortion no
more than with
respect to a single-letter distortion measure . Again, the
0
expression for in (10.2) is a single-letter
characterization because it depends only on the generic random variable but not on the block length ¯ of

a rate distortion
code. When both the source alphabet and the reproduction
þ
alphabet are finite, the computation of becomes a finite-dimensional
minimization problem.

Unless
for very special cases, it is not possible to obtain an expression for
w
or in closed form, and we have to resort to numerical computation.
D R A F T
September 13, 2001, 6:27pm
D R A F T
216
A FIRST COURSE IN INFORMATION THEORY
However, computing these quantities is not straightforward because the associated optimization problem is nonlinear. In this chapter, we discuss the
Blahut-Arimoto algorithms (henceforth the BA algorithms), which is an iterative algorithm devised for this purpose.
In order to better understand how and why the BA algorithm works, we will

first describe the algorithm in a general setting
in the next
section. Specializaw
tions of the algorithm for the computation of and will be discussed in
Section 10.2, and convergence of the algorithm will be proved in Section 10.3.
10.1
ALTERNATING OPTIMIZATION
In this section, we describe an alternating optimization algorithm. This algorithm will be specialized in the next section for computing the channel capacity
and the rate distortion function.
Consider the double supremum
fihkj
z|{
{ z *
ö~}
é
¹ ( 1@
fihkj

(10.3)
*
ö~}
¹
¹
is a convex subset
of º for Þ © , and is a function defined on
S
@
@
@
. The function is bounded
from
above, and is continuousband
has

S
ß
>;¿(b@

continuous partial derivatives
on . Further assume that for all
,
:ß
there exists a unique

such that

where
@
¹ ?> (
@

¹ (
@

W
]_ÊË
{
z {

and for all
j ö~}
¹ (
> @ (
¹ (
W
]_ÊË
z *
j ö~}

Let
(W b@

and
(10.4)
@
:ß
, there exists a unique

>@H(W
ß
@

U
*
such that

@
U 8
(10.5)
@

. Then (10.3) can be written as
¹ (
8
fhkj
z
(10.6)
¹
ö~}
{ *
In other words, the supremum of is taken over a subset of{ º º which is
*
equal to the Cartesian product of two convex subsets of º and º , respec¹
tively.
+
We now describe an alternating optimization algorithm
computing
,
( for
» @ » x
» $

the value of the double supremum in
(10.3).
Let
for
ÿ
ÿ
ÿ
Ñ
which are
defined
as
follows.
Let
be
an
arbitrarily
chosen
vector
in
,

@ >@H( ÿ
x
» Ñ
Ñ
$

and let ÿ
. For
, ÿ is defined
by
ÿ

ÿ
» D R A F T
>K¿(
@ »¿¾
ÿ

September 13, 2001, 6:27pm
(10.7)
D R A F T
217
The Blahut-Arimoto Algorithms
Figure 10.1. Alternating optimization.
and

» >@¶(
@ » ÿ
@ » » 8
ÿ

Ñ
@ Ñ
(10.8)

@
@
@
In
words, ÿ and ÿ are generated in the order ÿ
other
ÿ
ÿ
ÿ
@ ½½½ , where each vector in the sequence is a function of the previous
ÿ
ÿ
Ñ
vector except that ÿ is arbitrarily chosen in . Let
¹
¹ (
» » ÿ
8
ÿ
(10.9)
Then from (10.4) and (10.5),
¹
» ¹ (
»
¹ (
ÿ
¹ (
ÿ
ÿ
$
$
¹
x
$
¹
» »¿¾
»
ÿ
»'¾
@ » ÿ
@ »¿¾
ÿ

@ »¿¾
ÿ
(10.10)
(10.11)

ÿ
(10.12)
(10.13)

¹
¹
¹
for
. Since the sequence ÿ is non-decreasing, it must converge
because
» +
¹ is bounded from above. We will show in Section 10.3 that
if
Î
ÿ
@
¹
is concave. Figure 10.1 is an illustration
of the alternating ¹maximization
» +
algorithm, where in this case both ¯ and ¯ are equal to 1, and ÿ Î
.
The alternating optimization algorithm can be explained by the following
analogy. Suppose a hiker wants to reach the summit of a mountain. Starting
from a certain point in the mountain, the hiker moves north-south and eastwest alternately. (In our problem, the north-south and east-west directions can
be multi-dimensional.) In each move, the hiker moves to the highest possible
point. The question is whether the hiker can eventually approach the summit
starting from any point in the mountain.
D R A F T
September 13, 2001, 6:27pm
D R A F T
218
A FIRST COURSE IN INFORMATION THEORY
¹
Replacing
infimum
¹

by
in (10.3), the double supremum becomes the double
z|{
^
`k
ö~}
{ z *
¹ ( 1@
^
`k
*
ö~}
8
(10.14)
¹
¹
@
All the previous assumptions on , , and remain valid except that is
now assumed to be bounded from below instead of bounded from above. The
double infimum in (10.14) can be¹ computed by the¹ same alternating optimization algorithm. Note that with replaced by , the maximums in (10.4)
and (10.5) become minimums, and the inequalities in (10.11) and (10.12) are
reversed.
10.2
THE ALGORITHMS
In this section, we specialize the alternating optimization algorithm described in the last section to compute the channel capacity and the rate distortion function. The corresponding algorithms are known as the BA algorithms.
10.2.1
CHANNEL CAPACITY

an input distribution A ( , and we write WP - if
We will use to denote

(
is strictly positive, i.e., A
for all ( ß
. If is not strictly positive, we
P
$
write
. Similar notations will be introduced as appropriate.

qoQo¬G"NHtLON
1P
-
V
)(
Let A ( ¥
be a given joint distribution on
P

, and let be a transition matrix
from to . Then

]WÊË

=
= Q
7
A
( ¥
V
V
)(
±´³¶µ A

( )

(
"=
= Q
A
7
( ¥
V
)(
S<P
+
±´³¶µ A
such that

( )
V
(
(10.15)
where the maximization is taken over all such that

and
(*)
V
.-

+
if and only if

( )
¥
V

A
M
7
j

V
) ( W.-
V
( ¥
)(

V
( U ¥
)( U
A
(10.16)
(10.17)
V the which corresponds to the input distribution
i.e., the maximizing is
¥
)(
.
the transition matrix

and
V
In (10.15) and theV sequel, we adopt the convention
that the summation is
)(
-0 P
taken over all ( and such that A ( P - and ¥
. Note that the right

!
when is the input
hand side of (10.15) gives the mutual information

V
)(
distribution for the generic channel ¥
.
Proof Let

=
7
D R A F T
A
( U ¥
V
)( U
(10.18)
j
September 13, 2001, 6:27pm
D R A F T
219
The Blahut-Arimoto Algorithms
V
P
V
,¥
in (10.17). We assume with loss of generality
that forV all ß

for some ( ß
. Since ?P - ,
for all , and hence
P
well-defined. Rearranging (10.17),
we
have

A
Consider

=
= Q
A
7
V
( ¥
±´³¶µ
=
= Q
( ¥
A
7
V
=
7
V

= Q
V

= Q
$
(*)

+
V

+
(*)
V
V
8

A
(*)

)(
±´³¶µ
A
V
( )

(
(10.20)
V

V
(10.21)

V
(*)
V
(*)

V
( ¥
-
(10.19)

(*)

+
±´³¶µ

V
( )
( )
)(
P

V
+ (*)

V
( )
(*)

= Q
7
±´³¶µ
+
-

7
#=
(
A
)(
+

±´³¶µ

= Q
V
V
( )

V
) ( W

+
)(

V
( ¥
V

(10.22)

V
(*)
(10.23)

(10.24)
where (10.21) follows from (10.19), and the last step is an application of the
divergence inequality. Then the proof is completed by noting in (10.17) that
+

satisfies (10.16) because 1P - .

$
q ACBMqoZNHtLas
V
For a discrete memoryless channel ¥
w
.
=
]WÊË
gfihkj

= Q
( ¥
A
7
V

)(
±´³¶µ
A
)(
V
,
( )
(
(10.25)
where the maximization
is taken over all which satisfies
(10.16).

-0
i

! when is the input
Proof Let
denote the mutual information
V
¥
)(
distribution for the generic channel
. Then we can write

w
Let
+
w
achieves
w
+
. If
P
]_ÊË
. Ñ
i

i
]_ÊË
. Ñ
]_ÊË
=

.
D R A F T
]WÊË
fihkj
Ñ

(10.26)

= Q
7
A

8
(10.27)

, then

]_ÊË
. Ñ
.]_ÊË
. Ñ
=
7
= Q
A
V
( ¥
( ¥
V

)(
±´³¶µ A

)(
±´³¶µ
A
(10.28)
V
(*)

(
(10.29)
V
( )
(
September 13, 2001, 6:27pm
(10.30)
D R A F T
220
A FIRST COURSE IN INFORMATION THEORY
where (10.29) follows from Lemma 10.1 (and the maximization is over all
i
which satisfies (10.16)).
$

. Since
is continuous in ,
Next, we consider the case when +
for any ®lP , there exists Ú2P
such that if

then

w
i
#
S

(10.31)
Ú
S
(10.32)
®
+
+
where
denotes the Euclidean distance between and . In particular,
which satisfies (10.31) and (10.32). Then
there exists 2
P

w
]_ÊË
Ñ
$
$
(10.33)

i
(10.34)

w
P
i
Ñ

fihkj
.
i
(10.35)
(10.36)

®
where the last step follows because satisfies (10.32). Thus we have

w
-
Finally, by letting ®:Î
w

w
o

Ñ
.
8
(10.37)
, we conclude that
gfihkj

i
fihj
S
®

i
gfhkj

]_ÊË
=

Ñ
.
= Q
A
7

V
( ¥
V
(*)

8
(
)(
±´³¶µ A
(10.38)
This accomplishes the proof.
Now for the double supremum in (10.3), let
¹

= Q
( ¥
A
7

(
(

A
G
V
(*)
(
)(
A
(10.39)
1@
and

±a³¶µ
W
with and playing the roles of
V
(
, respectively. Let
-
P

M
and
7
A
( W
(10.40)
and

if
M
and
V
(*)
(*)
V

7
P
V
(
(*)
V
lß

,
TSUP
(*)
W

W.-
G
if
V
for all
(*)
¥
V

P
) ( .-

ß
V
P
-
,
8
(10.41)

D R A F T
September 13, 2001, 6:27pm
D R A F T
221
The Blahut-Arimoto Algorithms
@
Then
is a subset of n n @ and is a subset of
checked that both and are convex. For all
Lemma 10.1,
¹

=
= Q

A
7
= Q
7
V
±´³¶µ

(10.42)
(10.43)
A
(10.44)
(10.45)
(10.46)

)

)8
±´³¶µ
¹
@
, and it is readily
and ß , by

!
-0
>

)(
-0
V
( ¥
A
n nn n

(*)

(
A

V
+ (*)

(
)(
±´³¶µ

V
( ¥

@
V

Thus
is bounded
from above. Since for all ß , ( ) W- for all ( and
V
V
¥
(*) o
¹ such that
, these components of are degenerated.
In fact, these

components of do not appear in the definition of in (10.39), which

can be seen as follows. V Recall the convention
that the
double summation
in
V
V
)(
P
. If ( ) - ,
(10.39) V is over all ( and such that A ( P - and ¥
) ( u
, and hence the corresponding term¹ is not included
in the
then ¥

double summation. Therefore, it is readily seen that is continuous and has
continuous partial derivatives on because all the probabilities involved in
@
¹
the double
summation in (10.39) are strictly positive. Moreover,
for any given
ß

, by Lemma 10.1, there exists a unique @ ß which maximizes .
It will be shown shortly that
for any given ß , there also exists a unique
¹
ß
which maximizes .

The double supremum in (10.3) now becomes

fihkj

ö~}
=
fihkj
{
ö}
= Q
( ¥
A
7
*

V
)(
±a³¶µ
A
V
(*)
(
(10.47)
@
w
which by Theorem 10.2 is equal to , where the supremum over all ß
is in fact a maximum. We then
apply the alternating optimization algorithm in
w
the last section to compute
. First, we arbitrarily choose a strictly positive
Ñ
Ñ
input
distribution
in and let it be ÿ . Then we define ÿ and in general
x
» $

for
by

V
» ÿ
»
ÿ
(*)
V
W
A
M
7

ÿ
j
A
» ÿ
( ¥

( U ¥
)(
V
7
D R A F T
A
» in view of Lemma 10.1. In
order to define ÿ ¹ and in general
we need to find the ß which maximizes for a given ß
constraints on are

=
(10.48)
)( U
(
September 13, 2001, 6:27pm
ÿ

@
x
$

for
,
, where the
(10.49)
D R A F T
222
A FIRST COURSE IN INFORMATION THEORY
and

(
A
for all (
-
P

ß
.
(10.50)
We now use the method of Lagrange multipliers to find the best by ignoring
temporarily the positivity constraints in (10.50). Let

=
= Q
A
7

V
( ¥
V

( )

(
)(
±´³¶µ A
#
=
A
7
( 8
(10.51)

For convenience sake, we assume
that the logarithm is the natural logarithm.
Differentiating with respect to A ( gives

A
(
C
Upon setting
¡
V
¥
= Q

)(
±a³¶µ
K-
¡;¢ 7 k
±´³¶µ

( kÌl# 8
A
(10.52)
, we have
ÿ

( W
A
±´³¶µ

V
(*)
V
¥
= Q

)(
±´³¶µ
or
( ¤£
¾
A
&
ÿ
ñ
kª¨#
(10.53)

V
( )
Q

V
(*)
Q
7 -¥
ÿ
8
n
(10.54)

By considering the normalization constraint in (10.49),
we can eliminate and
Q

Q
obtain
V
7
A
(
7
V
(*)

Q
¦
M
j ¦

¥
( U )
V
V ÿ
¥
Q
n
7
ÿ
n
8
j
(10.55)

V
)(
The above
product is over all such that ¥
, and (*) P - for all
P
V
such . This implies that both the numerator and the denominator
on the right

(
hand side above are positive, and therefore A
P
. In other words, the thus
obtained happen to satisfy the positivity constraints in (10.50) although these
¹
constraints were ignored when
we set up the Lagrange multipliers. We will
@
which is
show in Section 10.3.2 that is concave. ¹ Then as given in (10.55),
ß

unique, indeed achieves
the maximum of for a given
because is in
x
» $

the interior of . In view of (10.55), we define

for
by
ÿ Q
»
A
@
» ÿ
Q
(
» M
»¿¾
Q ÿ
7
¦
j ¦
»'¾

( )

V
¥
( U )
7 V ÿ
¥
Q
n
ÿ
7
n
j
8
(10.56)
Ñ Ñ @

The
vectors ÿ and ÿ are defined in the order ÿ
ÿ
ÿ
ÿ
ÿ
½½½ , where each vector in the sequence
is a function of the previous vector

ÿ
Ñ except» that
is x arbitrarily chosen
in@ .x It remains» to
show
by induction
» » ÿ
$
$

that ÿ ß for
and ÿ ß for
. If ÿ ß , i.e., ÿ P - ,
D R A F T
September 13, 2001, 6:27pm
D R A F T
223
The Blahut-Arimoto Algorithms
¨
©
R( D)
¨
©
R( Dª s) - ª s Dª s
¨
¨
©
R( Dª s )
¨
©
(Dª s ¬ ,R ( Dª s ))
§
Dª s
Figure 10.2.
D
D« max
A tangent to the 6a curve with slope equal to .
»

V
V
@
then
we see
from (10.48) that ÿ (*» ) - @ if and only if ¥ ( ) Z- , i.e.,
» ß
ß
@ if
@
. On the» other
hand,
, then we
see from
(10.56)
that

» » ÿ» ÿ
ß
ß
ß

¹
, i.e., ÿ
. Therefore,
and
for » all
x ÿ $
» » » ÿ
» ÿ

.x Upon determining ÿ ÿ , we can¹ compute

» ÿ
ÿ
ÿ
w
for all . It will be shown in Section 10.3 that ÿ Î
.
10.2.2
THE RATE DISTORTION FUNCTION
This discussion in this section is analogous to the discussion in Section 10.2.1.

Some of the details will be omitted
for brevity.
$
-F
problems of interest,
P
. Otherwise, W.- for all For all

since is nonnegative
and non-increasing. Therefore, we assume without

loss of generality that -F P - .
-F
P
We have shown in Corollary 9.19 that if
, then is strictly
-®

4
6
3
5
7

decreasing for
. Since
is convex, for any ¯ - , there

43:5;7 such that the slope of
exists a point on the
curve for
a tangent1 to the curve
at

that
point
is
equal
to ¯ . Denote such a point

`
±
`
±
on the
curve by
, which is not necessarily unique. Then
this tangent intersects with the ordinate at ± k ¯ ± . This is illustrated in
D²
-0 0 þ
D²
Figure 10.2.

Let denote /1
the
mutual
information
and
denote
0i-0 0 þ
0

the expected distortion
when0 þ is the distribution
the
for
² and
is
²
²

transition matrix from to defining . Then for any , is a point in the rate distortion region, and the line with slope ¯ passing through
1 We
say that a line is a tangent to the
³
ÿ
D R A F T
h
curve if it touches the ³
ÿ
h
curve from below.
September 13, 2001, 6:27pm
D R A F T
224

A FIRST COURSE IN INFORMATION THEORY
D²

D²

intersects the ordinate at
. Since the
curve defines the boundary of the rate distortion region and it is above
the tangent in Figure 10.2, we see that

± k
¯ .]W^a`
´
±
D²
*
D²
¯ ÃÂ 8
(10.57)
À
²
a ± ² which achieves
²
the above minimum,
For each ¯ - , if we can find
± :
± , i.e., the tangent in

then the line passing through - ¯

Figure
10.2,
gives a tight lower bound on the curve. In particular, if
± ± is unique,
D²
± ±
(10.58)
and

²
`± W.
± 8
(10.59)

By varying over all ¯ ?- , we can then trace out the whole curve. In the
rest of the section, we will devise an iterative algorithm for the minimization
problem in (10.57).

qoQo¬G"NHtL¶µ
²
that
]_^
`
º
7

¥
= X
=
Ñ

( ·
þ
¸S
(*) (
Let
be a given joint
distribution on
þ
, and let ¹ be any distribution on such that ¹P - . Then
P

( ·
·
(þ ) (
»
±´³¶µ
7

(þ ) (
*
(þ
¥
= X
=
7

such

( ·
þ
·
(þ ) (
±´³¶µ
7
(þ ) ( *

» +
(þ
(10.60)
where

» +
»

( þ

¥
=

( ·
(þ ) (
(10.61)
7
þ
(
is the distribution
on
i.e., the minimizing

²
distribution and the transition matrix .
þ
corresponding to the input
Proof It suffices to prove that
=
7
= X
¥

( ·

»
±´³¶µ
7
for all ¹V² P
because
·
(þ ) (

(þ ) (
*
(þ
$
=
= X
7
¥

( ·

·
(þ ) (
±´³¶µ
7
(þ ) (

» + (þ
. The details are left as an exercise. Note in (10.61) that
.
D²
²
(10.62)
¹
+
P
-
²

Since and are continuous in , via an argument similar
to the one
we
used
in
the
proof
of
Theorem 10.2,
we can replace the minimum
²
²
D² over all
P
over all in (10.57) by the infimum
. By noting that the right

hand side of (10.60) is equal to
and
²
W=
7
=YX
D R A F T
¥

( ·

(þ )(

0D
(
(þ
7
September 13, 2001, 6:27pm
(10.63)
D R A F T
225
The Blahut-Arimoto Algorithms
we can apply Lemma 10.3 to obtain

± k
¯ ±
X
M
¼ ½.¾
¼½
¿ÁÀvÂÄÃÅ Æ ÀÂ
¥
û
ZÇ Z
M
¼½.¾
¼½
¿ÁÀvÂÒÅ Æ ÀvÂ
Ã
ZÇ Z
7 y
ÿ
ÿ
ZÐÎ
n
X
7 y
¥
û
X
û
M
7 .ÈÉÊ2ËÏ Ì Z
Í ZiÎ ¾ ±
Ì û
7
ÿ
û
¾
ËÌ Z
Ï Ì û Í ZiÎ
ZÐÎ
7 .ÈÉÊ
7
ÿ
n
¥
û
ZÇ Z
M
±
ZÇ Z
7
ÿ
ÿ

¹
=

E
·
=
¯
(þ ) (
*

±´³¶µ

( ·
»

0D
(þ )(
(þ
(

( þ :ß
þ )(
(*

þ
TS
for all (
7
and

(þ
(þ
þ
ß
»
G
X

(þ
=
$
(þ
(10.66)
(þ )(
·
¥
7
-0
!

0 þ
( ·

»
±´³¶µ
= X
7
·
(þ ) (

( ·

ß
and
1@
P
(10.67)
I
X
M
»
7

( þ W
(10.68)
and , respectively. Then is a subset
, and it is readily checked that both
n n

7
=
$

¥
=X
7
X

¹
P
W
with and ¹ playing
the roles of
@
of n @ nn Dn and is a subset of
and are convex. Since ¯ Ì- ,
¹ ²H
(10.65)
(þ ) (
*
G
=YX
²

·
(þ ) (
7 f7 Ñ 8
ÿ
n
7
(

¥
= X
7

( ·
7

¥
= X
7

(10.64)
X
7 e
ÿ
Now in the double infimum in (10.14), let
¹ ²Ó
ÿ
7
ÿ
7 f 7 Ñ
n
X
7 y
¥
û
X
7 e
7 y
(þ ) (
*
(þ

=
¯
7

¥
=YX
( ·

(þ ) (
0i
(þ
(
7
(10.69)

·
(þ ) (

±´³¶µ
(þ ) (
*

» + (þ
"
(10.70)

(10.71)
(10.72)
- 8
¹
Therefore, is bounded from below.
The double infimum in (10.14) now becomes
´
^a`k
ö~}
{ º
^a`
ö~}
*
Å
= X
=
7
7
¥

( ·

·
(þ ) (
*
±a³¶µ
»

(þ ) (
*
(þ

¯
= X
=
7
¥

( ·

(þ ) (

0D
(
( þ ÇÆ
7
(10.73)
D R A F T
September 13, 2001, 6:27pm
D R A F T
226
A FIRST COURSE IN INFORMATION THEORY
@
where the infimum over all ¹ ß is in fact a minimum. We then apply ¹the
alternating optimization algorithm described in Section 10.2 to compute + ,
the value of (10.73). First,² we arbitrarily choose a strictly
positive transition
» Ñ
Ñ matrix
in
and
let
it
be
.
Then
we
define
and
in
general ¹ ÿ for

¹
x
ÿ
ÿ
$

» þ
» þ
»
( "=
¥
( ·
() (
(10.74)
ÿ
ÿ
7
²
» ²
in view of Lemma 10.3.
In order to define ÿ ¹ and in general
²
ß
we need to find² the

which minimizes for a given ¹ ß
constraints on are

þ )(
( ·
P
and

=YX
for all
( þ ) ( W
·
(
( þ ¨ß
for all (
7
þ
ÔS

ß
@
ÿ

x
$

for
,
, where the
,
(10.75)
.
(10.76)
As we did for the computation of the channel capacity, we first ignore the positivity constraints in (10.75) when setting up X the Lagrange multipliers. Then

we obtain
± e 7 f7 þ
»

·
( þ ) (

X
M
7
»
j
( Ð
£

ÿ e
( þ U Ð£ ±
X
7 f7
ÿ
j
- 8
P
» ²
The details are left as an exercise. We then
define
·
ÿ
»¿¾
»
» þ
( ) ( W
M
7
X ÿ

þ
( Ð£
»¿¾
»
j
¹

ÿ
» 7 f7 ÿ e
( þ U Ð£ ±
¹ ²

that
It will be shown in the next section
ÿ
ÿ
If there exists a unique point ± ± on the slope of a tangent at that point is equal to ¯ , then

²
D²
» ÿ
»
²
D²
» ÿ

by
8
j
ÿ
» ¹
(10.78)
» ÿ
±
$
X
7 f7

x
for
Xÿ
± e
(10.77)

x
+
as Î / .
curve such that the
Î
± 8
(10.79)
»

is arbitrarilyx close to the segment of the
Otherwise,
ÿ
ÿ
curve at which the slope is equal to ¯ when
is sufficiently large. These
facts are easily shown to be true.
10.3
CONVERGENCE
¹
¹
» ¹
+
. We then
In this section, we first prove that if is concave, then ÿ Î
apply this sufficient condition to prove the convergence of the BA algorithm
for computing the channel capacity. The convergence of the BA algorithm for
computing the rate distortion function can be proved likewise. The details are
omitted.
D R A F T
September 13, 2001, 6:27pm
D R A F T
227
The Blahut-Arimoto Algorithms
10.3.1
A SUFFICIENT CONDITION
In the alternating optimization algorithm in Section 10.1, we see from (10.7)
and (10.8) that

» (
ÿ
x
$
for
-

» @ » ÿ
. Define
Õ
ÿ
¹ (
¹ ?>;'(b@

Then
¹
» ÿ
¹

ÿ
¹ ?> ÿ (
Õ
*
@ » ÿ
» ¹ (
(10.80)
8
(10.81)
¹ (W 1@
@ » ÿ
¹ (
k
»
@ » ÿ
ÿ
(10.82)

(10.83)
8
ÿ
¹

> @ ?> ÿ(

» @ » ÿ

¹ (
> @ ?> (
>@H?>;(1@
»
ÿ

¹ (
» ?> ( @ » W
¹
(10.84)
¹
» +
We will prove that¹ being concave is sufficient for ÿ Î
. To this end,

¹ ( first prove
¹
we
that if is concave, then the algorithm cannot be trapped at if
+
S
.

¹
qoQo¬G"NHtLmÖ
Let
¹
¹
» be concave.
If
Õ
ÿ
¹ (
+
S
¹
» , then
P
ÿ

¹
» ÿ
.
¹ (

¹
(
Proof We
will ¹ prove
that ¹
for any ß such that
P
» » +
S
, we see from
(10.84) that
Then if ÿ
Õ
ÿ
¹
» ¹
»
ÿ
ÿ
¹ (
» ÿ

S
+
.
P
¹

(10.85)
and the
lemma is proved. Õ
¹ (
¹
Õ
+
ß
S
¹ (
(
Consider
any
such¹ that
. We will prove by contradiction

.P
that
. Assume
. Then it follows from (10.81) that
¹ ?> (
> @ ?> (
@

Now we see from (10.5) that
¹ ?>;'(b@

> (
@
If

>@F?>;¿(1@
8
¹ ?>;'(b@
$

(10.86)
1@
8
(10.87)
¡
, then
¹ ?>;'(1@

¹ ( 1@
b@

>K¿(1@
because
¹ (
@
P

(10.88)
is unique. Combining (10.87) and (10.88), we have
¹ ?>;'(b@

>@H?>;¿(b@

¹ (W 1@
P

(10.89)
which is a contradiction to (10.86). Therefore,
>;'(b@
W
D R A F T
8
September 13, 2001, 6:27pm
(10.90)
D R A F T
228
A FIRST COURSE IN INFORMATION THEORY
Ø Ù
Ø Ü
Ú
×
z2
×
Ø Ù
Ú
(v 1 Û , v 2 )
(u 1 Û , v 2 )
z
Ú
×
(u 1 Û , u 2 )
Ø Ü
Ú
(v 1 Û , u 2)
z1
á(à
Figure 10.3. The vectors Ý , Þ , à
ß
{ á
and à * .
Using this, we see from (10.86) that
¹ ( >@F(W
which implies
¹ (W b@
W
8
>@H?>;¿(1@

is unique.
, there exists â
+
S
ß
¹

â
S
8
â

(10.93)

â
(10.92)
such that

¹ (
Consider
(10.91)
>@H(W
1@
because ¹ (
Since

-F
@
"
1@

â
8

(10.94)
Win
@
the direction
of â
, ã be the unit vector in
Let ã be the unit vector
ã

-F
@ direction
b@
the
of â
, and
be the unit vector in the direction of

-F
â
. Then

â
â
@
¦
"
é
é

"
.¦

@
ã
ã

â
@
ã
(10.95)
(10.96)
é

â
(10.97)

@
¹
. Figure 10.3 is an illustration
of the vectors , â , ã ã and ã (.W b@

@
We see
from (10.90) that attains
¹
its maximum value at

1@
attains
b@ its maximum
¹
when
is fixed. In (particular,
value at alone the

line passing through
and â
. Let å
denotes the gradient of
Þ
D R A F T
September 13, 2001, 6:27pm
D R A F T
229
The Blahut-Arimoto Algorithms
¹
¹
¹

¹
. Since is continuous
and has continuous
partial derivatives, the directional
ã
ã
½
¹
¹ of
derivative of at in the direction
exists and is given by å
. It
( 1@ from the

concavity
1@
¹
passing through
that
is
concave
along
the
line
follows
of

¹
(Wits
bmaximum
@

value
b@
and â
. Since attains
at , the derivative

of along the line passing through
and â
vanishes. Then we
see that
¹
.- 8
½ ã
(10.98)
å
Similarly, we see from (10.92) that
¹
@
.-
½ ã
å
8
(10.99)
¹
Then from (10.96), the directional derivative of
given by

@
¹
¹
½ ã
å
¹
Since
.¦

½ ã
å
¦
"

at
¹

â
.- 8
½ ã
å
¹
ã
is

@
(10.100)

is concave along the line passing through
¹ (
in the direction of
and â , this implies
Õ
(10.101)
¹ (

which is a contradiction to (10.93). Hence, we conclude that

P
-
.
¹ (
¹
¹
Although
we have proved that the algorithm
cannot be trapped at if ¹ » S
» +
, ÿ does not necessarily converge to
because the increment in ÿ in
each step may be arbitrarily small. In order to prove the desired convergence,
we will show in next theorem that this cannot be the case.
¹
+
$
q ACBMqoZNHtL¶æ
¹
If
¹
¹
» is concave, then
+
Î
ÿ
.
¹
» Proof ¹ We have already shown in Section 10.1 that ÿ necessarily
converges,
x
say to U . Hence, for any ®lP - and all sufficiently large ,
¹
¹
U
¹
»
®

ÿ
Õ
Let
¹ (
ç«]_^a`
z
where

G
(10.102)
(10.103)
ö} j
¹
ß
U 8
¹ (
U
®
Õ
¹
o
U 8
¹ (
(10.104)

Since

has continuous partial derivatives,
is a continuous function of
. Then the minimum in (10.103) exists because U is compact2 .
2
} j is compact because it is the inverse image of a closed interval under a continuous function and
bounded.
D R A F T
September 13, 2001, 6:27pm
}
is
D R A F T
230
A FIRST COURSE IN INFORMATION THEORY
¹
¹
¹
Õ
+
S
U
¹ (
We¹ now show that
will lead to a contradiction
if is concave. If
+

, then from ¹ Lemma
¹ 10.4,
(
we
see
that
P
for
all ß U and
» » »
ß

hence P . Since ÿ
satisfies
(10.102), ÿ
U , and
Õ
ÿ
¹
U S
¹
» ¹
ÿ
¹ (
»
» ÿ
ÿ
$

(10.105)
¹
x

¹
for all sufficiently large . ¹ Therefore, no matter how smaller
» U
¹
eventually be greater
than¹ , which is a contradiction to ÿ
» +
we conclude that ÿ Î
.
10.3.2
is,¹
U
Î
» will
. Hence,
ÿ
CONVERGENCE TO THE CHANNEL
CAPACITY
In order to show that the ¹ BA
algorithm
for computing the channel capacity
» w
¹
converges
as intended, i.e., ÿ Î
, we only need to show that the function
defined in (10.39) is concave. Toward this end, for
¹

=
= Q
( ¥
A
7

V
V
( )

(
)(
±´³¶µè A

(10.106)

@
@
@
defined in
(10.39),
we consider two ordered pairs and in ,
where and are defined in (10.40) and (10.41), respectively. For any -é
Ó"
CU
and ê
, an application of the log-sum inequality (Theorem 2.31)
gives

@
(
A

"" A

(
±´¿
³¶ µ
A §
(
A
±´³¶µ
'A

( )
V
( )

@H
(
A

"
¿
$

A
A
±´³¶
§
µ

¹
Therefore,
@

""
±´³¶µ
V
A

""
A
"
(
V
(*)
8

(10.107)

@¶
V

""
(
( )
@H
"" A

(*)
@H
(
A
V
(
@
¶

@õ

A @¶
±´³¶µ
A
V
(10.108)
(
and summing over all ( and , we obtain
@

( )
'

@H
V
)(

( )
¿
(
(
and upon multiplying by ¥
¹

(
V
±´³¶µ

Taking reciprocal in the logarithms
yields
'

( )
(

"" A

"
@
(
V
@H
(

"" A @¶
(

$
¹

""
is concave. Hence, we have shown that
@

ÿ
» Î
8
w
(10.109)
.
PROBLEMS
1. Implement the BA algorithm for computing channel capacity.
D R A F T
September 13, 2001, 6:27pm
D R A F T
231
The Blahut-Arimoto Algorithms
2. Implement the BA algorithm for computing the rate-distortion function.
3. Explain why in the BA Algorithm for computing channel capacity, we
should not choose an initial input distribution which contains zero probability masses.
4. Prove Lemma 10.3.
¹ ²Ó
5. Consider
function.
¹

in the BA algorithm for computing the rate-distortion
¹ ²Ó
a) Show that for fixed ¯ and ¹ ,

·
¹ ²H
b) Show that
¹

( þ ) (
*

M
¹

is minimized
by
X
± e 7 f7 (þ Ð
X
£

ÿ± e 7 f7 8
» (þ Ð
U £
7
j
j
ÿ
X
»
is convex.
HISTORICAL NOTES
An iterative algorithm for computing the channel capacity was developed
by Arimoto [14], where the convergence of the algorithm was proved. Blahut
[27] independently developed two similar algorithms, the first for computing
the channel capacity and the second for computing the rate distortion function.
The convergence of Blahut’s second algorithm was proved by CsiszÊ ë r [51].
These two algorithms are now commonly referred to as the Blahut-Arimoto
algorithms. The simplified proof of convergence in this chapter is based on
Yeung and Berger [217].
The Blahut-Arimoto algorithms are special cases of a general iterative algorithm due to CsiszÊ ë r and TusnÊ ë dy [55] which also include the EM algorithm
[59] for fitting models from incomplete data and the algorithm for finding the
log-optimal portfolio for a stock market due to Cover [46].
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 11
SINGLE-SOURCE NETWORK CODING
In a point-to-point communication system, the transmitting point and the
receiving point are connected by a communication channel. An information
source is generated at the transmission point, and the purpose of the communication system is to deliver the information generated at the transmission point
to the receiving point via the channel.
In a point-to-point network communication system, there are a number of
points, called nodes. Between certain pair of nodes, there exist point-to-point
communication channels on which information can be transmitted. On these
channels, information can be sent only in the specified direction. At a node,
one or more information sources may be generated, and each of them is multicast1 to a set of destination nodes on the network. Examples of point-to-point
network communication systems include the telephone network and certain
computer networks such as the Internet backbone. For simplicity, we will refer
to a point-to-point network communication system as a point-to-point communication network. It is clear that a point-to-point communication system is a
special case of a point-to-point communication network.
Point-to-point communication systems have been thoroughly studied in classical information theory. However, point-to-point communication networks
have been studied in depth only during the last ten years, and there are a lot
of problems yet to be solved. In this chapter and Chapter 15, we focus on
point-to-point communication networks satisfying the following:
1. the communication channels are free of error;
2. the information is received at the destination nodes with zero error or almost
perfectly.
1 Multicast
means to send information to a specified set of destinations.
D R A F T
September 13, 2001, 6:27pm
D R A F T
234
A FIRST COURSE IN INFORMATION THEORY
In this chapter, we consider networks with one information source. Networks
with multiple information sources will be discussed in Chapter 15 after we
have developed the necessary tools.
11.1
A POINT-TO-POINT NETWORK
8í
/

í
,
A point-to-point
network is represented by/ a directed graph ì where is the set of nodes in the network and is the set of edges in ì which
represent the communication channels. An edge from node Þ to node L , which
í
represents
the communication channel from
node Þ to node L , is denoted by

Þ L . We assume that ì
is finite, i.e., î îï / . For a network with one
information source, which is the subject of discussion in this chapter, the node
at which information is generated is referred to as the source node, denoted
by
¯ @ , and the destination nodes are referred to as the sink nodes, denoted by
»
ðððÒ »4ñ
»
.
é
For a communication channel from node Þ to node L , let ò J be the rate
constraint, i.e., theé maximum number of bits which can be sent per unit time
on the channel. ò J is also referred to as the capacity2 of edge óÞ LÄô . Let
õ
÷ö
òøùJ
G
ó-ú
(11.1)
LÄôûçüþý
be the rate constraints for graph ì . To simplify our discussion, we assume that
òøùJ are (nonnegative) integers for all ó-úiÿ|ôûçü .
In the following, we introduce some notions in graph theory which will
facilitate the description of a point-to-point network. Temporarily regard an
edge in ì as a water pipe and the graph ì as a network of water pipes. Fix a
»
»
sink node and assume that all the sink nodes except for is blocked. Suppose
water is generated at a constant rate at node ¯ . We assume that the rate of water
flow in each pipe does not exceed its capacity. We also assume that there is no
leakage in the network, so that water is conserved at every node other than ¯
»
and in the sense that the total rate of water flowing into the node is equal to
the total rate of water flowing out of the node. The water generated at node ¯
»
is eventually drained at node .
A flow
ö
ø
ó-úiÿ|ôûUüý
(11.2)
»
õ
in ì from node ¯ to node with respect to rate constraints
is a valid assignment of a nonnegative integer ø
to every edge ó-úÿ|ôûUü such that ø is
equal to the rate of water flow in edge ó-úiÿÄô under all the above
assumptions.
on edge ó-úiÿ|ô . Specifically, is a flow in ì
ø
is referred to as the value of
»
from node ¯ to node if for all ó-úÿ|ôûUü ,
2 Here

òøvÿ
(11.3)
the term “capacity” is used in the sense of graph theory.
D R A F T
September 13, 2001, 6:27pm
D R A F T
235
Single-Source Network Coding
and for all úû
except for and ,
where
ó-úÐô
ó-úÐô
ø
(11.4)
ó-ú4ô.ÿ
ø
(11.5)
ø
"!
(11.6)
ÿ
and
ó-úÐô
ø ÿ
In the above, ó-ú4ô is the total flow into node ú and # ó-ú4ô is the total flow out
of node ú , and (11.4) are called the conservation conditions.
Since the conservation conditions require that the resultant flow out of any
node other than and is zero, it is intuitively clear and not difficult to show
that the resultant flow out of node is equal
to the resultant flow into node .
This common value is called the value of . is a max-flow from node to
õ
node in $ with respect to rate constraints if is a flow from to whose
value is greater than or equal to the value of any other flow from to .
A cut between node and node is a subset % of such that û&% and
('
û)% . Let
-,
'
ü+*
ó-úiÿÄôûçü.XúWû/%
and û/%0
(11.7)
be the set of edges across the cut % . The capacity of the cut % with respect to
õ
rate constraints is defined as the sum of the capacities of all the edges across
the cut, i.e.,
(11.8)
òø
3!
ø ÿ
21
is a min-cut between node and node if % is a cut between and whose
capacity is less than or equal to the capacity of any other cut between and .
A min-cut between node and node can be thought of as a bottleneck
between and . Therefore, it is intuitively clear that the value of a max-flow
from node to node cannot exceed the capacity of a min-cut between node and node . The following theorem, known as the max-flow min-cut theorem,
states that the capacity of a min-cut is always achievable. This theorem will
play a key role in subsequent discussions in this chapter.
%
457698;:26=<?>=>=@ >ACBED7F2GHJIK8MLNBPORQSGUTSVXWY4576M8Z:26=<\[^]=_J`a
Let $ be a graph
ððð
õ
ñ
Ò
b
c
ÿ
~
d
ÿ
c
ÿ
with sourceg
node
and
sink
nodes
,
and
be
the
rate
constraints.
f
ððð
Then for e
ÿhÄÿ
ÿCi , the value of a max-flow from node to node is equal
to the capacity of a min-cut between node and node .
D R A F T
September 13, 2001, 6:27pm
D R A F T
236
A FIRST COURSE IN INFORMATION THEORY
k
m
k
s
2
2
2
1
1
1
l
j
2
1
1
l
t1
j
(a)
n
1
1
2
k
s
2
b1bp2
n
1
n
2
b2
t1
l
j
(b)
s
bo3
b1
2
b1bo3
t1
(c)
Figure 11.1. A one-sink network.
11.2
WHAT IS NETWORK CODING?
Let q be the rate at which
information is multicast from the source
node ððð
õ
ñ
ÿc
to the sink nodes bÿcdÿ
in a network $ with rate constraints . We are
naturally interested in the maximum possible value of q . With a slight abuse
of notation, we denote the value of a max-flow from the source node to a sink
node by rtsuJvMwyxèó Rÿc ô . It is intuitive that
q{z&r|suJv9w}xó Rÿc
gf
for all e
(11.9)
ô
ððð
ÿhÄÿ
ÿCi
, i.e.,
q{z&rt ~
rtsuJvMwyxèó Rÿc ôU!
(11.10)
This is called the max-flow bound, which will be proved in Section 11.4. Further, it will be proved in Section 11.5 that the max-flow bound can always
be achieved. In this section, we first show that the max-flow bound can be
achieved in a few simple examples. In these examples, the unit of information
is the bit.
First, we consider the network in Figure 11.1 with one sink node., Figure
11.1(a)
f
shows the capacity of each edge. By identifying the min-cut to be Rÿ ÿhJ0 and
applying max-flow min-cut theorem, we see that

rtsuKv9wyxó XÿcUbô
!
(11.11)
Therefore the flow in Figure 11.1(b) is a max-flow. In Figure 11.1(c), we show
how we can send three bits b ÿ d , and U from node to node b based on the
max-flow in Figure 11.1(b). Evidently, the max-flow bound is achieved.
In fact, we can easily see that the max-flow bound can always be achieved
when there is only one sink node in the network. In this case, we only need to
treat the information bits constituting the message as physical entities and route
them through the network according to any fixed routing scheme. Eventually,
all the bits will arrive at the sink node. Since the routing scheme is fixed, the
D R A F T
September 13, 2001, 6:27pm
D R A F T
237
Single-Source Network Coding

2
1

1
1
2
3
1

5
1

3
3

s
b

t1
b4

b3
b3b4b5 b1b 2

b1 b2b 3
2

b4b 5

(a)

b1b2b3
t2
(b)
Figure 11.2. A two-sink network without coding.
sink node knows which bit is coming in from which edge, and the message can
be recovered accordingly.
Next, we consider the network in Figure 11.2 which has two sink nodes.
Figure 11.2(a) shows the capacity of each edge. It is easy to see that

(11.12)
rtsuKv9w}xó Rÿcbô
and

rtsuKv9wyxó XÿcdKô
!
(11.13)
So the max-flow bound asserts that we cannot send more than 5 bits to both Ub
and d . Figure 11.2(b) shows a scheme which sends 5 bits ybÿdÿ ÿU , and
to both b and d . Therefore, the max-flow bound is achieved. In this scheme,
b and Ud are replicated at node 3, is replicated at node , while U and are
replicated node 1. Note that each bit is replicated exactly once in the network
because two copies of each bit are needed to send to the two sink nodes.
We now consider the network in Figure 11.3 which again has two sink
nodes. Figure 11.3(a) shows the capacity of each edge. It is easy to see that
rtsuKv9wyxéó Rÿc
f
ô
hÄÿ
(11.14)
. So the max-flow bound asserts that we cannot send more than 2 bits
to both Ub and d . In Figure 11.3(b), we try to devise a routing scheme which
sends 2 bits yb and d to both Ub and d . By symmetry, we sendf one bit on each
output channel at node . In this
case,
b is sent on channel ó Xÿ ô and d is sent
f
ÿh , .ø is replicated and the copies are sent
on channel ó XÿhXô . At node ú , ú
on the two output channels. At node 3, since both yb and d are received but
there is only one output channel, we have to choose one of the two bits to send
on the output channel ó ÿc|ô . Suppose we choose yb as in Figure 11.3(b). Then
the bit is replicated at node 4 and the copies are sent to nodes b and d . At
node d , both b and Ud are received. However, at node Ub , two copies of yb are
e
ÿh
D R A F T
September 13, 2001, 6:27pm
D R A F T
238
A FIRST COURSE IN INFORMATION THEORY

s
1
1
1
2
1

4
1

t
t1

b1
1

2
b1

b
2
2
b1+b 2
2

b1+b2

2
2
3
b1
4
b2
b2

s
b1
3

2
b1

b2
b2
b1

b
(b)
s
b1
b1
t1
2
(a)

2
3
b1
1
b2
b2

s
b1
3
1

1
b1b2
4

t
t1
2
(d)
(c)
Figure 11.3. A two-sink network with coding.
received but d cannot be recovered. Thus this routing
scheme does not work.

Similarly, if d instead of b is sent on channel ó ÿc|ô , b cannot be recovered
at node d . Therefore, we conclude that for this network, the max-flow bound
cannot be achieved by routing and replication of bits.
However, if coding is allowed at the nodes, it is actually possible to achieve
the max-flow bound. Figure 11.3(c) shows a scheme which sends 2 bits yb and
d to both nodes b and d , where ‘+’ denotes modulo 2 addition. At node b ,
b is received, and d can be recovered by adding yb and yb Yd , because
d
b ¤ó b Y d U
ô !
(11.15)
Similarly, d is received at node d , and yb can be recovered by adding d and
b {Ud . Therefore, the max-flow bound is achieved. In this scheme, b and d
are encoded into the codeword yb;Yd which is then sent on channel ó ÿc|ô . If
coding at a node is not allowed, in order to send both yb and d to nodes Ub and
shows such a scheme.
d , at least one more bit has to be sent. Figure 11.3(d)

In this scheme, however, the capacity of channel ó ÿc|ô is exceeded by 1 bit.
D R A F T
September 13, 2001, 6:27pm
D R A F T
239
Single-Source Network Coding
¢
1
1
1
¤
¢
s
1
1
1
1
b1
£
2
1
3
t1
¡
2
1
¤ ¦
t
¤
3
b 1+b 2
b2
2
b2
b1
1
1
¤ ¥
t
s
£
b2
3
b 1+b 2
b1
¤ ¥
t
t1
¡
(a)
2
¤ ¦
t
3
(b)
Figure 11.4. A diversity coding scheme.
Finally, we consider the network in Figure 11.4 which has three sink nodes.
Figure 11.4(a) shows the capacity of each edge. It is easy to see that
rtsuJvMwyxéó Xÿc
ô
h
(11.16)
for all e . In Figure 11.4(b), we show how to multicast 2 bits yb and d to all the
sink nodes. Therefore, the max-flow bound is achieved. Again, it is necessary
to code at the nodes in order to multicast the maximum number of bits to all
the sink nodes.
The network in Figure 11.4 is of special interest in practice because it is a
special case of the diversity coding scheme used in commercial disk arrays,
which are a kind of fault-tolerant data storage system. For simplicity, assume
the disk array has three disks which are represented by nodes 1, 2, and 3 in the
network, and the information to be stored are the bits yb and d . The information is encoded into three pieces, namely b , d , and b & d , which are stored
on the disks represented by nodes 1, 2, and 3, respectively. In the system, there
are three decoders, represented by sink nodes Ub , d , and , such that each of
them has access to a distinct set of two disks. The idea is that when any one
disk is out of order, the information can still be recovered from the remaining
two disks. For example, if the disk represented by node 1 is out of order, then
the information can be recovered by the decoder represented by the sink node
which has access to the disks represented by node 2 and node 3. When
all the three disks are functioning, the information can be recovered by any
decoder.
The above examples reveal the surprising fact that although information may
be generated at the source node in the form of raw bits3 which are incompressible (see Section 4.3), coding within the network plays an essential role in
3 Raw
bits refer to i.i.d. bits, each distributing uniformly on §¨ b© .
D R A F T
September 13, 2001, 6:27pm
D R A F T
240
A FIRST COURSE IN INFORMATION THEORY
Figure 11.5. A node with two input channels and two output channels.
multicasting the information to two or more sink nodes. In particular, unless
there is only one sink node in the network, it is generally not valid to regard
information bits as physical entities.
We refer to coding at the nodes in a network as network coding. In the
paradigm of network coding, a node consists of a set of encoders which are
dedicated to the individual output channels. Each of these encoders encodes
the information received from all the input channels (together with the information generated by the information source if the node is the source node)
into codewords and sends them on the corresponding output channel. At a
sink node, there is also a decoder which decodes the multicast information.
Figure 11.5 shows a node with two input channels and two output channels.
In existing computer networks, each node functions as a switch in the sense
that a data packet received from an input channel is replicated if necessary
and sent on one or more output channels. When a data packet is multicast
in the network, the packet is replicated and routed at the intermediate nodes
according to a certain scheme so that eventually a copy of the packet is received
by every sink node. Such a scheme can be regarded as a degenerate special case
of network coding.
11.3
A NETWORK CODE
In this section, we describe a coding scheme for the network defined in the
previous sections. Such a coding scheme is referred to as a network code.
Since the max-flow bound concerns only the values of max-flows from the
source node to the sink nodes, we assume without loss of generality that
'
there is no loop in $ , i.e., ó-úiÿú4ô û ü for all údûg , because such edges do
not increase the value of a max-flow from node to a sink node. For the same
'
reason, , we assume that there is no input edge at node , i.e., ó-úiÿ~ô û ü for all
úûª¬«
"0 .
We consider a block code of length . Let ® denote the information source
and assume that ¯ , the outcome of ® , is obtained by selecting an index from a
D R A F T
September 13, 2001, 6:27pm
D R A F T
241
Single-Source Network Coding
set ° according to the uniform distribution. The elements in ° are called messages. For ó-úÿ|ôèû ü , node ú can send information to node which depends
only on the information previously received by node ú .
An óÿKó± ø
ó-úiÿ|ôûuüèô.ÿC² ô(³ -code on a graph $ is defined by the components listed below. The construction of an ³ -code from these components will
be described after their definitions are given.
1) A positive integer ´ .
2)
µ
,¶f
and
¼
µ
such that
-,¶f
3)
ÀÂÁ
¼
f
,
ÀÂÁÄÃÅ0
ÿhÄÿ¸·¸·¸·~ÿC´½0¾¹»
(11.18)
.
ó ¿ ôôûçü
ÿhÄÿ¸·¸·¸·ÿ}Ã
(11.17)
,¶f
ó ¿ ô.ÿ
ó
ÿhÄÿ¸·¸·¸·ÿC´E0º¹»
, such that
z¿Æz{´
Ç
where
µ
Ì
4) If
µ
-,¶f
ø
ó ¿ ô
± ø Ã À+ÁËÃ
Á yÈ}É Ê
z¿Æz&´Í
, then
(11.19)
ÿ
¼
ó ¿ ô.ÿ
ó
ó ¿ ôô
ó-úÿ|ô03!
Î
(11.21)
ÁÏJ°¹ÐÀÂÁ|ÿ
where
µ
If
-,¶f
ó ¿ ô
'
°
, if
Ö
¼
-,¶f
Î
Ç
ÁÚ
Á
à
gf
, whereä
D R A F T
Ç
(11.23)
(11.24)
ÿhÄÿ¸·¸·¸·;ÿCi
¼
z¿æz&´Í
,è
à
(11.25)
À+ÁUÛ2¹»°Uÿ
ÁÛ "áãâ
å,¶f
çf
such that for all e
ó ¿ ô0
be an arbitrary constant taken from À+Á .
5)
ÿhÄÿ¸·¸·¸·~ÿCi
¿K×
ô
ó
ÀÂÁÛ2¹ÞÀÂÁJß
ÁÛ ÕÜ7Ý
Î
otherwise, let
µ
z¿K×MØ¿Ù
is nonempty, then
(11.22)
ÿhÄÿ¸·¸·¸·;ÿJÑhÒÓÕÔÕ03!
Á
e
(11.20)
ó ¿ ô
0
(11.26)
ó¯
ô
¯
September 13, 2001, 6:27pm
(11.27)
D R A F T
242
à
A FIRST COURSE IN INFORMATION THEORY
è
è
for
Î
all
¯®û° , whereà
is the function
à
from ° to ° inducedà inductively by
f
and such that ó¯ô denotes the value of as a function
Á|ÿ
z¿éz´
of ¯ .
The quantity ² is the rate of the information source ® , which is also the rate
at which information is multicast from the source node to all the sink nodes.
The óÿKó±vøê ó-úiÿÄôûuüô.ÿC² ô ³ -code is constructed from these components as
follows. At the beginning of a coding session, the value of ® is available to
node . During the coding session, there are ´ transactions which take place
µ ë a node sending inforin chronological order, where each transaction refers to
Î
mation to another node.
In the ¿ th transaction, node ¿=ì encodes
according
¼ ë
Î
µ ë in À+Á to node
¿=ì . The domain
to encoding
function Á and sends an index
µíë information
ë far,
received byÎ node ¿=ì µíso
of Á is the
andÖ we distinguish two
'
, the domain of Á is ° . If
¿=ì
,
Á gives the indices
µíë
cases. If ¿=ì
of all the previousÎ transactions for which information
was sent to node ¿=ì ,
Ì
so the domain of Á is î ÁÛ ÕÜ7Ý ÀÂÁ Û . The set ø
gives the indices of all the
transactions for which information
is sent from node ú to node , so ±vø
is the
ä
number of possible index tuples that can be sent from node ú to node dur
gives à the indices of all the transactions for
ing the coding session. Finally,
which information is sent to node , and is the decoding function at node which recovers ¯ with zero error.
11.4
THE MAX-FLOW BOUND
In this section, we formally state and prove the max-flow bound discussed
in Section 11.2. The achievability of this bound will be proved in the next
ï
section.
6=HKORQ7OW7O8;Q>=>=@ñð
õ
For a graph $ with rate constraints , an information
ë
ë
ë
rate qÍò
is asymptotically
achievable
if for any óõô
, there exists for
sufficiently large an ÿ ± ø
úiÿKìû®üêì.ÿC²9ì;³ -code on $ such that

ë
for all úë ÿJìUû ü , where
channel úÿJì , and

b=ö
b ö
(11.28)
w"÷ d ±vøÚz&øø
ùYó
w"÷ d ±vø
is the average bit rate of the code on
(11.29)
²ÙòYq)úõó¸!
For brevity, an asymptotically achievable information rate will be referred to
as an achievable information rate.
is achievable,
then
Remark It follows from the above definition that if q&ò
f
Á
q × is also
û
z
q
g
z
q
q
ÿ¿Yò
achievable for
all
.
Also,
if
is
achievable,
×
ÿ
ö
Á then q
~ñr¬Áýüþ)q
is also achievable. Therefore, the set of all achievable
ÿ
information rates is closed and fully characterized by the maximum value in
the set.
D R A F T
September 13, 2001, 6:27pm
D R A F T
243
Single-Source Network Coding
457698;:26=<?>=>=@ñÿEACBED7F2G
õ
IJ8MLÚ8ZVZQ a
For a graph $ with rate constraints
, if q is achievable, then
ë
q{z&rt ~
rtsuJvMwyx
Rÿc U
ì !
(11.30)
õ
ë
ë rate constraints
ë
Proof It suffices to prove that for a graph $ with
, if for any
óô
there exists for sufficiently large an ÿ ±vø
úÿJì$û#ü¬ì.ÿC²9ì#³ -code
on $ such that
b ö

w"÷ d ±vøÚz&øø
ùYó
(11.31)
ë
for all úiÿKìûçü , and
(11.32)
²ÙòYq)ú½óÒÿ
then q satisfies (11.30).
Consider
such a code for a fixed ó and a sufficiently large , and consider
gf
è
ÿhÄÿ¸·¸·¸·;ÿCi
% between node and node . Let
any e
and any
cut
ë
ë Î ë
Ì
è
Î
Î
where
¯
f
¯2ì
Á
è
J¿
¯Xì
ì.ÿ
(11.33)
ë
Î
and Á is the Î function
from ° to ÀÂÁ induced
inductively by
Á
such that Á ¯2ì denotes
the
value
of
as
a function of
è
.
is
all
the
information
known
by
node
during
the
whole
coding
¯
¯2ì
Î ë
session
when the message isµ ¯ ë . Since Á ¯2ì is a function of ë the information
è
â
¿=ì , we see inductively that
¯Xì is a function
previously
received by node
Î ë
Ì
71
of Á ¯Xì.ÿ¿
, where
-, ë
ÁUÛ(ÿ
ë zN¿ ×
°
zN¿
ÿKì
*
is the set of edges across the cut
, we have
determined at node
ë
ë
®
ì
z
Î
1 Á yÈ}É Ê
ö
w"÷ d
71
D R A F T
ë Î
ö
71
è R Õ 1
ö
(11.37)
ë
Á
1 Á yÈ}É Ê
z
can be
Ì
®/ì.ÿ¿
z
¯
(11.35)
(11.36)
ìcì
ë
Á
(11.34)
ë
â
®/ìcì
è
z
'
/%0
and
as previously defined. Since
%
â
®
ë
ë ®®ÿ
/%
®/ìcì
w"÷ d Ã ÀÂÁÄÃ
(11.38)
(11.39)
Ç
Á yÈ}É Ê
Ã À+ÁËÃ w"÷ d ±
Õ!
September 13, 2001, 6:27pm
(11.40)
(11.41)
D R A F T
244
A FIRST COURSE IN INFORMATION THEORY
Thus
q)úEó
z
²
z

b=ö

z
b=ö

b z
z
w"÷ d Ñh ÒÓ Ô
w"÷ ë d Ã°Ã
®/ì
b ö

71
(11.42)
(11.43)
(11.44)
(11.45)
w"÷ d ± ë
Yóì
ø
71
z
(11.46)
71
ùÃ
ø
(11.47)
(11.48)
Ã óÿ
where (11.46) follows from (11.41). Minimizing the right hand side over all
% , we have
q)úEó zrt~ñ
ø Ã
Ã ó¸!
(11.49)
*
Õ 1
The first term on the right hand side is the capacity of a min-cut between node
and node . By the max-flow min-cut theorem,ë it is equal to the value of
a max-flow from node to node , i.e., rtsuJvMwyx Rÿc ì . Letting ót¹
, we
obtain
ë
q&z{rtsuKv9wyx
Xÿc ìU!
(11.50)
gf
Since this upper bound on q holds for all e
ÿhÄÿ¸·¸·¸·~ÿCi
ë
q{z&rt ~
rtsuJvMwyx
,
Rÿc U
ì !
(11.51)
The theorem is proved.
Remark 1 The max-flow bound cannot be proved by a straightforward applië
ë because
for a cutë % ë between node and
cation of the data processing theorem
ë , the random variables ® ,
®/ìÂ
"!Ú*íì ,
® ì+¶#"! * × ì , and
node
â
®/ì in general do not form a Markov chain in this order, where
å, !Ú*
and
ë
.
ë
å,
!ê* ×
ÿKì
%
ç
ÿJì&
for some $t0
*
(11.52)
*
for some t0
(11.53)
are the two sets of nodes on the boundary of the cut % . The time parameter and
the causality of an ³ -code must be taken into account in proving the max-flow
bound.
D R A F T
September 13, 2001, 6:27pm
D R A F T
245
Single-Source Network Coding
Remark 2 Even if we allow an arbitrarily small probability of decoding error
in the usual Shannon sense, by modifying our proof by means of a standard
application of Fano’s inequality, it can be shown that it is still necessary for q
to satisfy (11.51). The details are omitted here.
11.5
ACHIEVABILITY OF THE MAX-FLOW BOUND
The max-flow bound has been proved in the last section. The achievability
of this bound is stated in the following theorem. This theorem will be proved
after the necessary preliminaries are presented.
457698;:26=<?>=>=@('
For a graph $ with rate constraints ) , if
ë
q{z&rt ~
rtsuJvMwyx
Rÿc .
ì ÿ
(11.54)
then q is achievable.
A directed path in a graph $ is a finite non-null sequence of nodes
¼
ë ¼
¼*
¼
bÿ
¼
Ðÿ bì
f are
ÿhÄÿ¸·¸·¸·~ÿ,+?ú
dÿ¸·¸·¸·Kÿ
»f
¼*
(11.55)
bÒÿ
ë ¼
f
¼
ÿhÄÿ¸·¸·¸·;ÿ,+
ú
for all
. The edges ÿ bì ,
referred to as the
edges
on¼ the directed
path. Such a
¼
¼-*
¼*
sequence is called a directed path from b to . If b
, then it is called a
directed cycle. If there exists a directed cycle in $ , then $ is cyclic, otherwise
$ is acyclic.
In Section 11.5.1, we will first prove Theorem 11.4 for the special case
when the network is acyclic. Acyclic networks are easier to handle because
the nodes in the network can be ordered in a way which allows coding at the
nodes to be done in a sequential and consistent manner. The following proposition describes such an order, and the proof shows how it can be obtained. In
Section 11.5.2, Theorem 11.4 will be proved in full generality.
?f that
such
.º:X80/M821yOñW7O8ZQ
>=>=@43
ë
If $
Wÿ
ì is a finite acyclic graph, then it is possible
to order the nodes of $ in a sequence such that if there is an edge from node
to node , then node is before node in the sequence.
Proof We partition into subsets b ÿU d ÿ¸·¸·¸· , such that node
is in 9Á if and
a longest directed path ending at node is equal to ¿ . We
only if the length of
is in 9Á such that ¿½zû¿ × , then there
claim that if node is in 9ÁUÛ and node
exists no directed path from node to node . This is proved by contradiction
as follows. Assume that there exists a directed path
from node to node .
Since there exists a directed path ending at node
of length ¿ × , there exists f a
directed path ending at node containing node whose length is at least ¿ × .
D R A F T
September 13, 2001, 6:27pm
D R A F T
246
A FIRST COURSE IN INFORMATION THEORY
Since node is in
equal to ¿ . Then
, the length of a longest directed path ending at node is
9Á
f
z¿S!
(11.56)
ô¿ × ò¿S!
(11.57)
¿ ×
However, this is a contradiction because
f
¿ ×
Therefore, we conclude that there exists a directed path from a node in =ÁÛ to
a node in =Á , then ¿ × Ø¿ .
Hence, by listing the nodes of $ in a sequence such that the nodes in 9Á
appear before the nodes in 9ÁUÛ if ¿ Øå¿ × , where the order of the nodes within
each 9Á is arbitrary, we obtain an order of the nodes of $ with the desired
property.
5ÂF7D <6/=IJ6Y>=>=@ñ]
Consider ordering the nodes in the acyclic graph in Figure 11.3 by the sequence
f
(11.58)
XÿhÄÿ ÿ ÿckÿcd|ÿcb!
in this sequence, if there is a directed path from node
It is easy to check that
to node , then node appears before node .
11.5.1
ACYCLIC NETWORKS
In this section, we prove Theorem 11.4 for the special
case when
f
f the graph
is acyclic. Let the vertices in $ be labeled by ÿ ÿ¸·¸·¸·~ÿ}ÃÆÃ3ú
in the fol has the label
ë 0. The other vertices
lowing way. The source
are labeled in a
f
f
way such that for z/zÃæÃ¶ú , ÿJì7
implies Ø . Such a labeling is possible by Proposition 11.5. We regard RÿcbÒÿ¸·¸·¸·Kÿc98 as aliases of the
ë
corresponding vertices.ë ë
ì.ÿC²9ì;: -code on the graph $ defined
We will consider an ÿ ±Ú ÿKì&
by
$
ë
RÿJì<
1) for all
, an encoding function
Î=
where
ÿKì&
such that
Î
¾
, ¹
ÿhÄÿ¸·¸·¸·ÿc±
(11.59)
Õ0Xÿ
-,¶f
°
ë
2) for all
=
,¶f
+Ë°
ë
Ç
Û Û '
ÿhÄÿ¸·¸·¸·;ÿJÑh
(11.60)
ÒÓ ÔÕ03ß
, an encoding function
,¶f
,¶f
ÿhÄÿ¸·¸·¸·~ÿc±
Û 0º¹
ÿhÄÿ¸·¸·¸·ÿc±
Õ0
(11.61)
Î
(if × × ÿ ì>
we adopt the convention that is an arbi0 is empty,
,¶f
trary constant taken from ÿhÄÿ¸·¸·¸·~ÿc±
Õ0 );
D R A F T
September 13, 2001, 6:27pm
D R A F T
247
Single-Source Network Coding
gf
3) for all e
, a decoding function
ÿhÄÿ¸·¸·¸·~ÿCi
à
,¶f
Ç
ÿhÄÿ¸·¸·¸·~ÿc±
R â such that
à
à
è
ë
¯
à
denotes the value of
¯2ì
Î
(11.62)
¯2ì
ë
for all ¯?° . (Recall that
è
â
0+¹N°
ë
(11.63)
as a function of ¯ .)
à
Î
In the above, is the encoding function for edgeÎ ÿKì , and is the decoding
function
forÎ sink node . In a Î coding session, is applied before Û CÛ if
Ø
× . This
ë defines
the order in which
× , and is applied before Û if ÆØ{
if × ÿ ì
, a node does not
the encoding functions are applied. Since × Ø
encode until all the necessary information is received on the input channels. A
: -code is a special case of an ³ -code defined in Section 11.3.
ë
Assume that q satisfies (11.54) with respect to rate constraints ) . Itë suffices
ë show that for any óÙô
an ÿ ±
/
to
,
there
exists
for
sufficiently
large
ÿJì
ì.ÿq½úõóUì: -code on $ such that

ë
b ö
w"÷ d ±Úz&ø
ùYó
(11.64)
ë
ë
ë
. Instead, we will show the existence of an ÿ ± ÿJì@
-code satisfying the same set of conditions. This will be done by
constructing a random code. In constructing this code, we temporarily replace
°
by
-,¶f
° ×
ÿhÄÿ¸·¸·¸·;ÿJÑBAh ÒDC ÔÕ0Xÿ
(11.65)
ÿJì@
for
all
ì.ÿq
:
ì
Î=
where A is any constant
ë
greater than 1. Thus the domain of is expanded
.
from ° to ° × for all XÿKì&
ë We now construct the encoding
Î= functions
ë
as follows. For all E such that
= {° × , let ¯Xì be a value selected independently
ë from
XÿKìF ,¶f , for all ¯G
set
ÿhÄÿ¸·¸·¸·ÿc±
0 according to the uniform distribution. For all
ÿJì
the
'
ÿ
, and for all
H
Î
ë
,¶f
Ç
ÿhÄÿ¸·¸·¸·~ÿc±
Û Û Õ
H
^Û( 0Xÿ
,¶f
let ì be a value selected independently from the set
ing to the uniform distribution. H
Let
=yë
and for %ÿ
¯2ì
'
, let H
¯Xì
ë Î
ë
ë
ÿhÄÿ¸·¸·¸·~ÿc±
Õ0
¯Xì.ÿ
ÿKì
accord-
(11.67)
¯bÿ
è
ë
D R A F T
(11.66)
ì.ÿ
September 13, 2001, 6:27pm
(11.68)
D R A F T
248
A FIRST COURSE IN INFORMATION THEORY
H
è
Î
ë
Î
ë
where ¯I° × and ¯Xì denotes the value of as a function of ¯ . ¯Xì
is all the information
received
by node during the coding session when the
H
H
ë distinct
ë ¯bÿc¯ × P° , ¯ and ¯ × are indistinguishable at sink message is ¯ . For
â
¯ × ì . For all ¯J° , define
if and only if â ¯Xì
f
gf
LM NK
ë
¯Xì
H
H
ë
if for
some e
exists ¯
ÿhÄÿ¸ë ·¸·¸·;ÿC i , there
'
â
¯ ×
¯ , such that â ¯2ì
¯ ×ì ,
otherwise.
× °
,
(11.69)
ë
¯Xì is equal to 1 if and only if ¯ cannot be
f uniquely determined at at least
one of the sink nodes. Now fix ¯Oõ° and z e z-i . Consider any ¯ × õ°
not equal to ¯ and define the sets
H
H
-, ë
.
% ¨
H
-, and
%ùb
¯Xì
ë
ë
'
¯ × ì0
H
ë
¯2ì
ç
(11.70)
¯ × ì03!
(11.71)
% ¨ is the set of nodes at which the two messages ¯ and ¯ × are distinguishable,
and %b is the set of
H nodes atH which ¯ and ¯ × are indistinguishable. Obviously,
ë
ë
>/% ¨ .
â
â
¯Xì
¯ × ì . Then % ¨
%
Now suppose
for some %P , where
é'
ë %
and
Q
?% ,è i.e., % isè a cut H betweenH node and node . For any
ÿJì
RTS , Î
,
ë
Î
ë
ë
ë
'
b
¯Xì
¯ × ì¸Ã ¯Xì
¯ × ì0
± !
(11.72)
Therefore,
RUS ,
% ¨ RTS , %0
R S , % ¨
T
%oÿý% ¨@V
RTS , % ¨ %tÃÅ% ¨ V
è
ÃÅ% ¨ V
RTS %t
ë
, Î @
z
Ç
% ¨
R Õ21
b
± Ç
R Õ21
where
%Ï0 RUS ,
%Ï0
%Ï0
Î
ë
¯Xì
H
%0
V
ë
¯ × ì¸Ã ¯Xì
H
'
ë
¯ × ì0
(11.76)
(11.77)
ÿ
å, ë
*
% ¨
è
(11.73)
(11.74)
(11.75)
ÿJìW
/%oÿ
'
/%Ï0
(11.78)
ë
is the set of all the edges across the cut % as previously
defined.
Let ó be any fixed positive real number. For all ÿJì&
, take ± such that
ø6
D R A F T
YX¬z{
b=ö
w"÷ù±Úz&ø
ùPó
September 13, 2001, 6:27pm
(11.79)
D R A F T
249
Single-Source Network Coding
for some
ØZX
. Then
Ø&ó
RUS ,
% ¨
z
Õ21
Ò^D_
h
j
m
z
Ã
h
]`Ga É4b Êdcfeg 1 [ É Êih
(11.84)
C
;]
(11.85)
ÿ
71
(11.83)
;
ÃËò
ò&rt~ñ
ø
(11.82)
f
*
Ò
h
b) follows because
_
Ò
z
1
(11.81)
a Ékb Ê9c4elg 1 [ É Ê h
Ò ^
h
=
]onUp9qsrtdu â v
\[ É Ê;]
Ò
]. `
z
a) follows because
h
Õ21
(11.80)
Ç
z
where
b
± Ç
%0
* Û
ë
21
where % × is a cut between node
theorem;
ø
rtsuKv9wyx
Xÿc .
ì ÿ
(11.86)
and node , by the max-flow min-cut
c) follows from (11.54).
Note that this upper
bound does not depend on
and H has h _ _ H subsets,
RUS ,
ë
hw_
z
%
is some subset of
â
¯ × ì0
% ¨
. Since
ë
â
¯ S ì,
RTX
%
_h
%
;].
Ò C
for some cut
%
between node and node z
Ø
ë
â
¯2ì
0
(11.87)
(11.88)
H
H
by the union bound.
Further,
RTS ,
ë
â
¯ × ì for some ¯ × °
ë
f
;]
Ã° × Ã}ú
ìh _ _ h
Ò C
;]
Ò C h _ _h
Ò C
A h x
Ú
]
A h _ _h Ò ÿ
Ú
×,¯
×
'
¯Æ0
(11.89)
(11.90)
(11.91)
where (11.89) follows from the union bound and (11.90) follows from (11.65).
Therefore,
y ë
¯Xì{z
D R A F T
September 13, 2001, 6:27pm
D R A F T
250
A FIRST COURSE IN INFORMATION THEORY
RUS ,
RUS}|
ë
8
gf
¯2ì H
,
0
ë
â
¯2ì
(~
Ø
i<
ë AÚh _
b
_h
Ò
(11.92)
H
ë
â
¯ ×ì
× °
for some ¯
×
,¯
×
'
¯Æ0
(11.93)
]
(11.94)
(11.95)
ÿXKì
by the union bound, where
ë
ÿXJì
i<AÚhw_
]
_h
Ò
(11.96)
!
Now the total number of messages which can be uniquely determined at all the
sink nodes is equal to
ë f
ë
ú
¯2ìcìU!
(11.97)
Û
By taking expectation for the random code we have constructed, we have
ë f
Û
ú
ë
ë f
¯Xìcì
Û
ô
ë
¯2ì{zì
(11.98)
ë
ë f
ë Û ë
f

ò
y ú
ÿXKìcì
ú
(11.99)
ÿXJìcì,AÚhÒxC1ÿ
ú
(11.100)
where (11.99) follows from (11.95), and the last step follows from (11.65).
Hence, there exists a deterministic code for which the number of messages
which can be uniquely determined at all the sink nodes is at least
ë
ë f
ÿXJìcì,AÚh ÒxC
ú
(11.101)
ÿ
ë
which is greater than h ÒxC for sufficiently large since
Let ° to be any set of Ñh H ÒxC Ô such messages in ° × . For e
,¶f
Ç
upon defining
where ¯?°
ÿhÄÿ¸·¸·¸·~ÿc±
Û Û â
à
H
such that
ë
ë
ë
H
0 ÿ
Û â X
as é¹ .
and
ÿhÄÿ¸·¸·¸·~ÿCi
(11.102)
ì

ÿXK
g
f ì#¹
H
(11.103)
¯
ë
â
¯Xì.ÿ
ë
ÿJì
(11.104)
we have obtained a desired ÿ ± ì.ÿq ì<: -code. Hence, Theorem 11.4 is proved for the special case when $ is acyclic.
D R A F T
September 13, 2001, 6:27pm
D R A F T
251
Single-Source Network Coding
11.5.2
CYCLIC NETWORKS
For cyclic networks, there is no natural ordering of the nodes which allows
coding in a sequential manner as for acyclic networks. In this section, we will
prove Theorem 11.4 in full generality, which involves the construction of a
more elaborate code on a time-parametrized acyclic graph.
Consider the graph $ with rate constraints ) we have defined in the previë
ous sections but without the assumption
that $ is acyclic. We first construct
Kÿ
ì from the graph $ . The set
a time-parametrized
graph $>
f
consists of /
layers of nodes, each of which is a copy of . Specifically,

where

~
(11.105)
ÿ
¨
-,

ªt03!
(11.106)
As we will see later, is interpreted as the time parameter. The set
of the following three types of edges:
ë
¨
ÿ
ë
1.
2.
ë
3.
ì.ÿ
ÿc
ÿ

consists
b
zZ
z
zZ
ë
zQ½ú
f
ì.ÿ

f
ì.ÿ
ÿJìW
f
ÿ
zZÙzEú
.

¨
For $> , let x
be the source node, and Þ
letf ,
be a sink node
which corresponds to the sink node in $ , e
ÿhÄÿ¸·¸·¸·;ÿCi . Clearly, $ is
ëµ ¼ each edge in $ ends at a vertex in ëaµ layer
acyclic because
with
a larger index.
¼

be the capacity of an edge
ÿ
ì
, where
Let ø> ÿ ÿ ì
ëµ
ø

M NK
ø

¼
ë

if ÿ ì
ÿ
and zZézQEú
otherwise,
and let
)

y
f

ì
ÿJì
for some
(11.107)
ÿ
ëµ
ø
ë
b
ÿ
¼
ÿ
ì
z8ÿ
(11.108)
which is referred to as the rate constraints for graph $ .
Þ
5ÂF7D <6/=IJ6Y>=>=@f
In Figure 11.6, we show the graph
graph $ in Figure 11.1(a).
ù6=<+< D
node
>=>=@4
gf
$

for
for the
For e
in graph $ from
ÿhÄÿ¸·¸·¸·~ÿCi , there exists a max-flow
to node which is expressible as the sum of a number of flows (from
D R A F T
September 13, 2001, 6:27pm
D R A F T
252
A FIRST COURSE IN INFORMATION THEORY
s (5)
s* (0)
=s
(5)
(0)
2
1
(0)
(5) *
t1 = t1
t1

=0
=1
=2

=3
Figure 11.6. The graph U with ¡%¢#£ for the graph
=4

=5
in Figure 11.1(a).
node to node ), each of them consisting of a simple path (i.e., a directed
path without cycle) from node to node only.
Proof Let be a flow from node
to node in the graph $ . If a directed path
in $ is such that the value of on every edge on the path is strictly positive,
then we say that the path is a positive directed path. If a positive directed path
forms a cycle,
it is called a positive directed cycle.
Suppose contains a positive directed cycle. Let ¤ be the minimum
value of
on an edge in the cycle. By subtracting
on every edge
¤ from the value of
in the cycle, we obtain another flow × from node to node such that the
resultant
flow out of each node is the same as . Consequently,
the values of
and
is eliminated in
× are the same. Note that the positive directed cycle in
× . By repeating this procedure if necessary, one can obtain a flow from node to node containing no positive
directed cycle such that the value of this flow
is the same
as the value of .
Let be a max-flow from node to node in $ . From the foregoing, we
assume without loss of generality that does not contain a positive directed
è
cycle. Let ¥b be any positive directed path from
to in
(evidently
¥b is
è
b
simple), and let ¦}b be è the minimum value of on an edge along ¥#b . Let
be
b
the
flow from node
to node along ¥#b with value ¦}b . Subtracting
from ,
b
is reduced to ú
, a flow from node to node which does not contain
a positive
directed cycle. Then we can apply the same procedure repeatedly
until is reduced to the zero flow4 , and § is seen to be the sum of all the flows
which have been subtracted from § . The lemma is proved.
5ÂF7D <6/=IJ6Y>=>=@ñ_
The max-flow § from node to node Ub in Figure 11.1(b),
whose value is 3, can be expressed as the sum of the three flows in Figure 11.7,
4 The
process must terminate because the components of ¨ are integers.
D R A F T
September 13, 2001, 6:27pm
D R A F T
253
Single-Source Network Coding
«
¬
s
«
s
s
1
1
1
ª
1
2
1
1
2
ª
1
1
1
t1
©
1
t1
©
(a)
2
t1
©
(b)
(c)
Figure 11.7. The max-flow in Figure 11.1(b) decomposed into three flows.
each of them consisting of a simple path from node to node ® only. The value
of each of these flows is equal to 1.
ù6=<+< D
f²
>=>=@ >°¯
²
²
h ·¸·¸· i , if the value of a max-flow from to 9 in
For e}±
$ is greater than or equal to q ë , then the f value of a max-flow from to in
$> is greater than or equal to ½ú"³´¶
ìq , where ³µ is the maximum length
of a simple path from to 9 .
We first use the last two examples to illustrate the idea of the proof of this
lemma which will be given next. For the graph $ in Figure 11.1(a), ³Äb , the
maximum length of a simple path from node to node Ub , is equal to 3. In
Example 11.9, we have expressed the max-flow § in Figure 11.1(b), whose
value is equal to 3, as the sum of the three flows in Figure 11.7. Based on
these three flows, we now construct a flow from x to ®b in the graph $> in
Figure
11.8. In this figure, copies of these three flows are initiated at nodes ¨ ,
b
d
, and , and they traverse
in $¶ as shown. The reader should think of
the
b d
¨ flows initiated at nodes
and
as being
ë
ë at node
±å
and
² generated
²
b d b d delivered to nodes and via edges ì and ì , respectively,
whose capacities are infinite. These are not shown in the figure for the sake of
·
=1
=0
=2
=3
¸
=4
=5
Figure 11.8. A flow from ¹ to º » in the graph in Figure 11.6.
D R A F T
September 13, 2001, 6:27pm
D R A F T
254
A FIRST COURSE IN INFORMATION THEORY

b
d
clarity.
Since ³=b ±
, the flows initiated at ¨ , , and all terminate at
b
for z
. Therefore, total value of the flow is equal to three ² times
² the
¨
b value
of
,
i.e.,
9.
Since
the
copies
of
the
three
flows
initiated
at
and
§
d interleave with each other in a way which is consistent with the original
flow ¼ , it is not difficult to see that the value of the flow so constructed does
not exceed the capacity in each edge. This shows that the value of a max-flow
from - to , is at least 9.
We now give a formal proof for Lemma 11.10. The reader may skip this
proof without affecting further reading of this section.
f
Proof of Lemma 11.10 For a fixed z-e zåi , let § be a max-flow from
node to node ® in $ with value q such that § does not contain a positive
è
è can write è
directed cycle. Using Lemma 11.8,
we
è
f²
²
²Á
§±
§
b
²
d
§

·¸·¸·Õ
§¾½
(11.109)
h ·¸·¸·
where §W¿ , À±
contains a positive simple path ¥ from node
¿
node ® only. Specifically,è
to
ë ²{Æ
¦
ì&?¥
if
¿
otherwise,
Ç ¿
§ ¿ÃÂ ±ÅÄ
(11.110)
where
¦}b G¦ýd·¸·¸·Y¦
±qã!
ë ²{Æ ½
ì¾
¥ ,
Let È be the length of ¥ . For an edge
¿
¿
¿
of node from node along ¥ . Clearly,
¿ ë ²{Æ
f
É
ì zQÈ
ú
¿
¿
(11.111)
ë ²{Æ
let É
ì
¿
be the distance
(11.112)
and
Now for
È
Ç
zZézQõúÊ³´
§
where
¼ ¿
, define
¿
ëµ ² ¼
KÎÎ
¦
±
MÎ
¦
Î
ÎÎ
N
¦
¿
¿
Ç ¿
±ÌË\¼ ¿
ë
²

(11.113)
² ëµ ² ¼
ì
²
sÍ
(11.114)
²f
if ëµ ² ¼ ìT± ë ì
zQõú"³´
ë {² Æ
²{zZ
Æ é
jsÏ Â jsÏ Â b if ëµ ² ¼ ìT± ë ;Ð ²
ì where
ì&?¥
¿
Ï
if
ìT±

ì
otherwise.
(11.115)
ë ²{Æ
Since
êGÉ
D R A F T
z³´ !
¿
¿
²
f
ìZ
zZÏYÈ
¿
zÑÏG³µ2zQ
September 13, 2001, 6:27pm
(11.116)
D R A F T
255
Single-Source Network Coding
¿ is well-defined
the second caseè and the third
case in (11.115) and hence §
Ç
¿ is a flow from node to node in $ derived
for zÒPzÓYúY³µ . §
from the flow §¿ in $ as follows. A flow
of ¦ is generated at node x and
¿
. Then the flow traverses consecutive
enters the th layer of nodes from layers of nodes by emulating
the path ¥ in $ until it eventually terminate at
;Ð Ï ¿
¿ , we construct
node via node
. Based on §

§
½
±
~
ÕÔ â

§ ±
¿
²
(11.117)
b
¿
and
§

§
~
(11.118)
!
¨
We will prove that § z) componentwise. Then § is a flow from node
to node , in $> , and from (11.111), its value is given by
ÕÔ â

~
¨
~
¿
ÕÔ â

½
¦
b
±
¿
ë
qÖ±
~
f
õúÊ³µJ
ìqã!
(11.119)
¨
This wouldë imply that
f the value of a max-flow from node to node in $
ëµ ² ¼
is at least õú"³´K ìq , and the lemma is proved.
such
Toward proving that § z×) , we only need to consider
ì ëµ ² ¼
ë ²{Æ that
b
ìU±
ì
(11.120)
ë ²{Æ
f
Ç
ì&
for some
and zZ zQ½ú , because ø> is infinite otherwise (cf.
(11.107)). For notational convenience, we adopt the convention that
§
for
Ø
Ç
. Now for
Ç
f
zZézQ½ú
±
~
Û
ÕÔ â
±
~
Û

½
±
~
¿
¿
~
½
b
½
¼ ~
,
b
¨
(11.122)
¼ aÙÛ Ø c Ùa ØÚ c
¿
»
Â
(11.123)
¼ aÙÛ Ø c Ùa ØÚ c
¿
»
Â
(11.124)
¿ ÕÔ â
b Û
±
~
¨
(11.121)
aÃÛ Ø c Ùa ØÚ c
»
Â
¼ ¨
Ç
ì
ÕÔ â

D R A F T
±
ë ²{Æ
and

aÃØ c aÙØÚ c
»
¼ Â
¿
aÃ Ø c j Ùa ØÏ Ú Â c » ¿
Â
September 13, 2001, 6:27pm
(11.125)
D R A F T
256
A FIRST COURSE IN INFORMATION THEORY
è
z
½
~
¿
±
(11.126)
¼2ÃÂ
ø
ÃÂ
ø

z
±
¼ Ã¿ Â
b
(11.127)
(11.128)
(11.129)
aÙØ c Ùa ØsÚ c
» !
Â
In the above
derivation, (11.125) follows from the second case in (11.115)
aÙÛ Ø aÃØsÚ c
because ¼ c ¿ Â
» is possibly nonzero only when
ë ²{Æ
±ZÜ
GÉ
²
(11.130)
ì
¿
ë ²{Æ
or
ÜJ±×|úÊÉ
²
(11.131)
ì
¿
ë ²{Æ
and (11.126)
can be justified as follows. First, the inequality is justified for
ÙØQÉ
ì since by (11.121),
¿
§
ë ²{Æ
Ç
Û ¿
Ç
±
(11.132)
²{Æ òÝÉ
ì , we distinguish two cases. From (11.115) and
for ÜgØ
. ë For
¿
(11.110), if
ìW#¥ , we have
è
¿
¼ ë ²{Æ
If
ì
'
#¥
¿
aÙ Ø c jÙa ØÏ Ú R Â c » ¿
Â
¼ Ã¿ Â ±I¦
±
, we have
(11.133)
!
¿
è
aÙ Ø j Ùa ØÏ Ú Â c ¼ cÂ
» ¿
¼ Ã¿ Â ±
±
Ç
(11.134)
!
Thus the inequality is justified for all cases. Hence we conclude that §¾+zQ) ,
and the lemma is proved.
ë
f
From Lemma 11.10 and the result in Section 11.5.1, we see that )ú
³
is achievable for graph $ with rate constraints ) , where
³7±
Thus,
² ë for everyf ó+ô

õúÊ³+
ëµ ² ¼
ìq
rtsu

³´ !
, there exists for sufficiently large Þ a Þ
ì: -code on $¶ such that
b ö

w"÷ d ±
zø
ìq
(11.135)
ë
Ç
Þ

D
Yó
² ë
ëµ ² ¼
$ß
±
ìW
(11.136)
ëµ ² ¼ . For thisÎ : -code on $ , we denote the encoding function
for all
ìà
by , and denote the decoding function at the sink
for an edge
ìF
D R A F T
September 13, 2001, 6:27pm
D R A F T
257
Single-Source Network Coding
f²
à
® by , }
e ±
node
f
zZ zQ ,
²
²
h
·¸·¸·
. Without loss of generality, we assume that for
i
Î =
and for
= Ùa Ø ë
c
,¶f²
for all ¯ in
zZézQõú
Î
for all ¤ in
aÙØ
â c â
f
²
b
C ÔÕ0
(11.138)
¤=ìU±Z¤
²
±
Á
·¸·¸·
Á Á â ë
,¶f²
Ç
(11.137)
¯
ÕÔ
Ñhâ
·¸·¸·
,
¯XìU±
²
°á±
f
Ç

(11.139)
aÙØã
c aÙØ c
» â 0
²
(11.140)
ë
. Note that if the : -code does not satisfy these assumptions,
² it can
ì ,
readily
be converted into ë one
because
the
capacities
of
the
edges
f
²
f
Ç
are infinite.
zZ zQ and the edges
ì ,
zZézQEú
ë ² ë integer,
ë ²{Æ and let ² Q
ë ±Å&Þ . Using
Let be a positive
the : -code on $> , we
f
now construct an ±ÃÂ ß
ì
ì
õúÊ³¾
ìq<ä- ìå -code on $ , where
z-e
z
i
±ÃÂ
ë ²{Æ
ì
for
±
~
b
aÙØ c Ùa ØÚ c
»
Â
±
¨
(11.141)
. The code is defined by the following components:
ë ²{Æ
ì&
1) for
Ç
Î
'
b
such that ± , an arbitrary constant ÃÂ
,¶f²
²
²
h
±
·¸·¸·
taken from the set
aÃæ c a c
0 ß
Â » 3
(11.142)
f
zZÙz
2) for
, an encoding function
Î =
Â
Æ
ª
for all
,¶f²
ß °
ë ²{Æ
such that
h
ì <
,¶f²
²
¹
= aÙØã
c aÙØ c
» Â
0
±
(11.143)
, where
²
°Å±
²
·¸·¸·
²
h
Ñhâ
·¸·¸·
ÕÔ
b
²
C ÔÕ0
(11.144)
and for h¬zZézQ , an encoding function
Î

ÃÂ
ß
ë ²{Æ
,¶f²
Ç
Á Á h
²
²
·¸·¸·
D R A F T
ì G
for all
such that
Î
adopt
,¶f² ² the² convention that ÃÂ
aÙØã c aÙØ c
h ·¸·¸· ±
0 );
» Â
aÙØlãç c Ãa Ølã c
» 0+¹
± Á
,¶f²
h
²
²
·¸·¸·
±
aÙØã
c aÙØ c
» Â
0
ë ² (11.145)
,
0 is empty, we
(if ¿ ß ¿ ìÖ
is an arbitrary constant taken from the set
±
'
September 13, 2001, 6:27pm
D R A F T
258
A FIRST COURSE IN INFORMATION THEORY
f²
²
3) for e±
²
h
·¸·¸·
à
ß
à
è
, a decoding function
i
Ç
b
~
,¶f²
Ç
²
²
h
â ¨ ë ²{Æ
è
à
ì
i) for
ë
such that ¯Xì<±¯ for all ¯è)°
as a function of ¯ );
where
aÙØ c Ùa ØÚ c
â
» 0º¹»°
±
·¸·¸·
(11.146)
ë
à
(recall that
¯2ì
denotes the value of
'
such that ± ,
Î
Î
b
ÃÂ

,¶f²
±
Î
aÃæ a
(ëµ ² c Â » c is an arbitrary constant in
¨ 0 is empty);
ì<
aÃæ c a c
Â »
²
h
(11.147)
²
aÃæ c a c
Â » 0
±
·¸·¸·
, µ
ß

since
f
ii) for
zZÙz
, for all ¯?° ,
Î =
ë
for all ¤ in
²
·¸·¸·
Î
à
ë
for all
in
Ç
~
b
à

¶â ¨ R (11.151)
ì
²
h
(11.150)
H
ë
ìU±
(11.149)
aÃØlãç c Ùa Øã c
» 03ß
± Á
H
,¶f²
Ç
c Ùa Ø c
» Â
¤ ì
Ä
²
·¸·¸·

'
ë
aÙØã
²
h
,
i

¤=ìU±
,¶f²
(11.148)
such that ± ,
ë
Á Á ²
h
H

éÂ
Ç
f²
iii) for eê±
ì&
²
ë
c Ãa Ø c
» Â
¯ ì
X

ë ²{Æ
and for h¬zZézQ and all
Î
Î =
aÃØlã
¯2ìU±
Â
²
±
·¸·¸·
aÙØ c Ùa ØÚ c
â
» 03!
(11.152)
f
The coding process of the å -code consists of /
ë ²{Æ
1. In Phase 1,Æ for all
and for all
ìF
such that
ë

where ÃÂ
D R A F T
h|z×/zI
¯Xì
'
Î
b
Æ
such that ±g , node Î sends
,
Ã Â to node
²{Æ
Æ
= b ë
ì
, node sends Â è ¯Xì to node .
ë ²{Æ
è
2. In Phase
,ë
Î
phases:
, for all
ì6
Î
denotes the value of ÃÂ
Î

ë
Æ
, node sends Ã Â ¯Xì to node ,
as a function of ¯ , and it depends
September 13, 2001, 6:27pm
D R A F T
259
Single-Sourceè Network Coding
Î
b
ë
² ë
only on Á ¯Xì for all ¿?½ suchf that
received by node during Phase |ú .
f²
f
3. In Phase /
²
, for e0±
ì6
¿
, i.e., the information
²
h
·¸·¸·
à
, the sink node 9 uses to decode ¯ .
i
ë ²{Æ
² ë
ë
From the definitions, we see that anë ² ë ±éÂ ßë ²{Æ ì& å -code on $ is a special case of an
±éÂ ß
ì7
³ -code on $ . For the å -code we have constructed,

b ö
ë
w"÷ d ± ÃÂ
±
<Þ9ì
b

Þ
¨
b ö
b ë

ø
~
b
w"÷ d>ë
b

~
z
b ö

±
ë ²{Æ
¨
~
Ç
b
~
¨
±
w"÷ d ±
² ë
f
ì ² ë /úJ³
ì
aÃØ c Ùa ØÚ cíì
»
Â
aÙØ c Ãa ØsÚ c
»
Â
&ú³¬

f ìq<ä-
ì
ìq<ä-
ì
(11.153)
(11.154)
aÃØ c Ùa ØÚ c
» PóUì
Â
(11.155)
b ë
ÃÂùYóì
ø
(11.156)
¨
ø6ÃÂ Yó
(11.157)
ì&
, where (11.155) follows from (11.136), and (11.156) follows
for all
Ç
from (11.107). Finally, for any óãô , by taking a sufficiently large , we have
ë
f
õúÊ³º
ìq
(11.158)
ôYq)úEó!

Hence, we conclude that q is achievable for the graph
) .
$
with rate constraints
PROBLEMS
In the following problems, the rate constraint for an edge is in bits per unit
time.
1. Consider the following network.
We want to multicast information to the , sink² nodes
²
² at the maximum rate
without using network coding. Let !Ý±
yb d ·¸·¸· ë ï² îË 0 be the set of bits
befmulticast.
to
Let ! be the set of bits sent in edge ì , where Ã ! cÃ´±h ,
² ²
±
h
. At node , the received bits are duplicated and sent in the two
out-going edges. Thus two bits are sent in each edge in the network.
f
a) Show that !±I!6
ë !Â for any
b) Show that !
D R A F T
!bð!+dìU±Z!
z
Æ
Ø

z
.
.
September 13, 2001, 6:27pm
D R A F T
260
A FIRST COURSE IN INFORMATION THEORY
ò
t1
1
ñ
s
ò
2
t2
ò ó
t3
3
ë
c) Show that Ã !
!bð!+dì¸ÃKzgÃ ! Ã¸
Ã !bÃ¸
Ã !+d¶Ã}ú
Ã !b!+d3Ã .
d) Determine the maximum value of ô and devise a network code which
achieves this maximum value.
e) What is the percentage of improvement if network coding is used?
(Ahlswede et al. [6].)
2. Consider the following network.
ø
1
ö
s
õ
3
2
5
4
÷
6
Devise a network coding scheme which multicasts two bits yb and Ud from
node to all the other nodes such that nodes 3, 5, and 6 receive yb and d
ë ² after 1 unit time and nodes
1, 2, and 4 receive yb and d after 2 units of time.
In other words, node receives information at a rate equal to rtsuJvMwyx ì
'
for all ± .
See Li et al. [123] for such a scheme for a general network.
3. Determine the maximum rate at which information can be multicast to
nodes 5 and 6 only in the network in Problem 2 if network coding is not
used. Devise a network coding scheme which achieves this maximum rate.
ë ²
4. Convolutional network code

f² ²
In the following network, rtsuKv9wyx ®RìU±
for eê±
h
. The max-flow
bound asserts that 3 bits can be multicast to all the three sink nodes per unit
time. We now describe a network coding scheme which achieve this. Let 3
D R A F T
September 13, 2001, 6:27pm
D R A F T
261
Single-Source Network Coding
û ú
t
uù 2
ü ù
v
u1
ü
2
ý
û
0
s
ü ú
v
t1
v1
û ù
t
0
2
uú 0
ë
²
ë
²
f²
ë
²
h ·¸·¸· , where
bits yb ¿=ì d ¿=ì ¿=ì be generated at node ë at time ¿#±
ë without loss of generality that ï ¿=ë ì is an element of the finite
we assume
Ç
Ç
field f $>¼ h3ì . We adopt the convention that ï ¿=ì ±
for ¿½z
. At time
¿Ùò
, information transactions T1 to T11 occur in the following order:
ë
¼
Ç
²f²
h
T1. ¼ sends ï ë¿=ì to µ , eê±
²f²
Ç
T2. µ sends s ¿=
to
,
,
and
,
ì

9
(

þ
®
ÿ

þ
e
±
b
d
ë
ë
ë
µh
f
f
T3. µ ¨ sends ¨ ë ¿=ì7Yyb ë ¿¬ú f ì7Yd ë ¿¬ú f ì to b
T4. µ b sends ¨ ë ¿=ì7Yyb ë ¿¬ú ì7Y
ë d ¿¬
µ ì to d
f ú
T5. µ b sends ¨ ë ¿=ì7Yyb ë ¿=ì7Yd ë ¿êú f ì to d
T6. µ d sends ¨ ë ¿=ì7Yyb ë ¿=ì7Yd ë ¿êú µ ì to ¨
T7. µ d sends ¨ ë ¿=ì7Y b ë ¿=ì7Y d ë ¿=ì to ¨
T8. ¨ sends ¨ ¿=ë ì7Yyfb ¿=ì7Yd ¿=ì to Ub
ì
T9. d decodes d ¿¬
ë ú
T10. ¨ decodes ¨ ë ¿=ì
T11. Ub decodes yb ¿=ì
where “ ” denotes modulo 3 addition and “+” denotes modulo 2 addition.
a) Show thatf the information transactions T1 to T11 can be performed at
time ¿%± .
f
by induction
b) Show that T1 to T11 can be performed at any time ¿éò
on ¿ .
ë
ë
c) Verify
that at time ¿ , nodes
d ¿ × ì for all ¿ × z¿ .
D R A F T
¨
and
b
can recover
¨
September 13, 2001, 6:27pm
ë
¿ × ì
,
b
¿ × ì
, and
D R A F T
262
A FIRST COURSE IN INFORMATION THEORY
ë
ë
d) Verify thatë at time ¿ , node d canf recover ¨ ¿ × ì and yb ¿ × ì for all ¿ × z
¿ , and d ë ¿ × ì for all ¿ × z\¿Ùú
. Note the unit time delay for d to
recover d ¿=ì .
(Ahlswede et al. [6].)
HISTORICAL NOTES
Network coding was introduced by Ahlswede et al. [6], where it was shown
that information can be multicast in a point-to-point network at a higher rate
by coding at the intermediate nodes. In the same paper, the achievability of the
max-flow bound for single-source network coding was proved. Special cases
of single-source network coding has previously been studied by Roche et al.
[165], Rabin [158], Ayanoglu et al. [17], and Roche [164].
The achievability of the max-flow bound was shown in [6] by a random
coding technique. Li et al. [123] have subsequently shown that the max-flow
bound can be achieved by linear network codes.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 12
INFORMATION INEQUALITIES
An information expression refers to a linear combination1 of Shannon’s
information measures involving a finite number of random variables. For example,
(12.1)
and
(12.2)
are information expressions. An information inequality has the form
!#"
"
(12.3)
where the constant is usually equal to zero. We consider non-strict inequalities only because these are usually the form of inequalities in information
theory. Likewise, an information identity has the form
$%"'&
(12.4)
($)"
We point out that an information identity
is equivalent to the pair of
information inequalities
and
.
An information inequality or identity is said to always hold if it holds for
any joint distribution for the random variables involved. For example, we say
that the information inequality
*+"
!,+"
+-
(12.5)
1 More generally, an information expression can be nonlinear, but they do not appear to be useful in information theory.
D R A F T
September 13, 2001, 6:27pm
D R A F T
264
/102
A FIRST COURSE IN INFORMATION THEORY
.
. On the other
always holds because it holds for any joint distribution
hand, we say that an information inequality does not always hold if there exists a joint distribution for which the inequality does not hold. Consider the
information inequality
(12.6)
3
,+-4&
Since
+(12.7)
always holds, (12.6) is equivalent to
3
(12.8)
$5 only if and are independent. In other words, (12.6)
which holds if and
does not hold if and are not independent. Therefore, we say that (12.6)
does not always hold.
As we have seen in the previous chapters, information inequalities are the
major tools for proving converse coding theorems. These inequalities govern
the impossibilities in information theory. More precisely, information inequalities imply that certain things cannot happen. As such, they are referred to as
the laws of information theory.
The basic inequalities form the most important set of information inequalities. In fact, almost all the information inequalities known to date are implied
by the basic inequalities. These are called Shannon-type inequalities. On the
other hand, if an information inequality always holds but is not implied by the
basic inequalities, then it is called a non-Shannon-type inequality. We have not
yet explained what it means by that an inequality is or is not implied by the
basic inequalities, but this will become clear later in the chapter.
Let us now rederive the inequality obtained in Example 6.15 (Imperfect secrecy theorem) without using an information diagram. In this example, three
, and are involved, and the setup of the problem is
random variables
equivalent to the constraint
*
67 8
Then
$5-4&
(12.9)
3
9:6;<=
$ 9:6;<=#> ?=@A CB
$ 9:6;<=
?
9:6;<=#> @D;ED ?CB
$ 9F6@D;GH367 8
$ 9F6@
D R A F T
September 13, 2001, 6:27pm
(12.10)
(12.11)
(12.12)
(12.13)
(12.14)
(12.15)
D R A F T
Information Inequalities
where we have used
in obtaining (12.12), and
@I *
+;G
+-
265
(12.16)
(12.17)
and (12.9) in obtaining (12.15). This derivation is less transparent than the one
we presented in Example 6.15, but the point here is that the final inequality we
obtain in (12.15) can be proved by invoking the basic inequalities (12.16) and
(12.17). In other words, (12.15) is implied by the basic inequalities. Therefore,
it is a (constrained) Shannon-type inequality.
We are motivated to ask the following two questions:
1. How can Shannon-type inequalities be characterized? That is, given an
information inequality, how can we tell whether it is implied by the basic
inequalities?
2. Are there any non-Shannon-type information inequalities?
These are two very fundamental questions in information theory. We point out
that the first question naturally comes before the second question because if
we cannot characterize all Shannon-type inequalities, even if we are given a
non-Shannon-type inequality, we cannot tell that it actually is one.
In this chapter, we develop a geometric framework for information inequalities which allows them to be studied systematically. This framework naturally
leads to an answer to the first question, which makes machine-proving of all
Shannon-type inequalities possible. This will be discussed in the next chapter.
The second question will be answered positively in Chapter 14. In other words,
there do exist laws in information theory beyond those laid down by Shannon.
JLMK
STVUVUVUW1XFYZ
N
Let
M
(12.18)
$ ORQ
P
X [ < \]1^8_ N Y
where
, and let
M
$PO
(12.19)
X
[
be any collection of random variables. Associated with are
` $ M Q
(12.20)
X
dceexample,
1f forg $ba , the 7 joint entropies associated with
joint entropies. For
random variables
c and
6 f are
g c 1 f f 1 g c 1 g 6 c 1 f 1 g &
(12.21)
12.1
THE REGION
D R A F T
September 13, 2001, 6:27pm
D R A F T
266
Let
and
N
A FIRST COURSE IN INFORMATION THEORY
h
j <\k1^?_ $
i
lm 6jn
i l $
&
denote the set of real numbers. For any nonempty subset
i
of
M , let
(12.22)
p orq (12.23)
l tasZ fixed , we can then view as a set function from to h with
For
$u- , i.e., we adopt the convention that the entropy ofl an empty set
[
the entropy
of random variable
is equal to zero. For this reason, we call
function of . `
jx v M _(beportheqTy -dimensional
szY
j Euclidean space with the coordinates
lm labeled
Let
i [ X O , where w corresponds to the value of i for any
by w
X
vl M as the entropy space
collection
of random variables. We will refer to
_
can be represented
by
for random variables. Then an entropy function
M
|l vector { v M is[ calledX
a column vector in v . On the other hand, a column
of some collection of
entropic if { is equal to the entropy function
random variables. We are motivated to define the following region in v M :
} MK $PO'{ _ v M~ { is entropicY &
(12.24)
}
g to as entropy functions.
For convenience, theX vectors in MK will also be referred
As an example, for $%a , the coordinates of v
by
c f g cCf cCg f1g arecCf1g labeled
(12.25)
cCf1g
g
c w f gS w w w } w g w w
where w
denotes w , etc, and K is the region in v
of all entropy
functions for 3 random variables.
}
While further characterizations of MK will be given later, we first point out a
}
few basic properties of MK :
}
1. MK contains the origin.
}
}
2. MK , the closure of MK , is convex.
}
3. MK is in the nonnegative orthant of the entropy space v M .
X
[
2
The origin of the entropy space corresponds to the entropy function of degenerate random variables taking constant values. Hence, Property 1 follows.
Property 2 will be proved in Chapter 14. Properties 1 and 2 imply that
is
a convex cone. Property 3 is true because the coordinates in the entropy space
correspond to joint entropies, which are always nonnegative.
} MK
v M
2 The
nonnegative orthant of
D R A F T

q
is the region
] q1e
for all
j @f xZ 1
k
September 13, 2001, 6:27pm
.
D R A F T
267
Information Inequalities
12.2
INFORMATION EXPRESSIONS IN CANONICAL
FORM
Any Shannon’s information measure other than a joint entropy can be expressed as a linear combination of joint entropies by application of one of the
following information identities:
6 7 ;<
$
3
9;<=
$
;8=
?F@
$
&
X
(12.26)
(12.27)
(12.28)
The first and the second identity are special cases of the third identity, which
has already been proved in Lemma 6.8. Thus any information expression
which involves random variables can be expressed as a linear combination of
the associated joint entropies. We call this the canonical form of an information expression. When we write an information expression as
, it means
that is in canonical form. Since an information expression in canonical form
is a linear combination of the joint entropies, it has the form
`
{
F {

h
(12.29)
where
denotes the transpose of a constant column vector in .
The identities in (12.26) to (12.28) provide a way to express every information expression in canonical form. However, it is not clear whether such
a canonical form is unique. To illustrate the point, we consider obtaining the
in two ways. First,
canonical form of
;<
$
&
Second,
67 <
6F
$ 6F#;;<=;d 1
$ 6F#;;<=
D91
$ 6*
=;A
$
&
(12.30)
(12.31)
(12.32)
(12.33)
(12.34)
67 <
Thus it turns out that we can obtain the same canonical form for
via two different expansions. This is not accidental, as it is implied by the
uniqueness of the canonical form which we will prove shortly.
Recall from the proof of Theorem 6.6 that the vector represents the values
of the -Measure
on the unions in . Moreover, is related to the values
of
on the atoms of , represented as , by
K
D R A F T
K
M
M

{
{
{!$P M
September 13, 2001, 6:27pm
(12.35)
D R A F T
268
`67`
M
A FIRST COURSE IN INFORMATION THEORY
where
is a unique
matrix (cf. (6.27)). We now state the following lemma which is a rephrase of Theorem 6.11. This lemma is essential for
proving the next theorem which implies the uniqueness of the canonical form.
?HL (¡R¢D£t¡
¤ MK $PO' _ h ~ M _ } MK Y &
(12.36)
¤
Then the nonnegative orthant of h is a subset of MK .
¥§¦n¨©Dª¡R¢D£«¢ Let be an information expression. Then the unconstrained
information identity $5- always holds if and only if is the zero function.
Proof Without loss of generality, assume is in canonical form and let
{ $ F {8&
(12.37)
®
Assume 7$¬- always holds and is not the zero function, i.e., $¬- . We
will show that this leads to a contradiction. Now $¯- , or more precisely the
YZ
set

~
O'{ {!$5(12.38)
is a hyperplane in the entropy space which has zero Lebesgue measure
. If
+$°- always holds, i.e., it holds for all joint distributions, then _ } MK } must be
3 $b - , otherwise there
_ } exists an {
MK which
contained in the hyperplane
MK , it corresponds to the
$±- . Since {
is not on ¬$±- , i.e., {
distribution. This means that there exists a joint
entropy function of some joint
distribution such that { $²- does not hold, which cannot be true because
d$%- } always holds.
If MK has positive Lebesgue measure, it cannot be contained in the hyperplane} *$P- which has zero Lebesgue measure. Therefore, it suffices to show
that MK has positive Lebesgue measure. To this end, we see from Lemma 12.1
that the nonnegative
orthant of v M , which has positive Lebesgue
¤
¤
} measure, is a
subset of MK . Thus MK has positive Lebesgue measure. Since MK is an invert¤
ible transformation
of MK , its Lebesgue measure is also positive.
}
Therefore, MK is not contained in the hyperplane *$P- , which implies that
there exists a joint distribution for which !$P- does not hold. This leads to a
contradiction because we have assumed that 3$®- always holds. Hence, we
have proved that if E$5- always holds, then must be the zero function.
Conversely, if is the zero function, then it is trivial that P$³- always
holds. The theorem is proved.

]

´µ
´Z¶ µ
q
Let
3
4
3 If
, then
is equal to
.
Lebesque measure can be thought of as “volume" in the Euclidean space if the reader is not familiar
with measure theory.
4 The
D R A F T
September 13, 2001, 6:27pm
D R A F T
269
Information Inequalities
· ¨©¨¸R¸Z ©R¹P¡R¢D£«º
c f
Proof Let and and
The canonical form of an information expression is unique.
»
c
be canonical forms of an information expression . Since
»<$%
(12.39)
f
»<$%
(12.40)
c f
always hold,
c $5f (12.41)
alwaysc holds.f By the above theorem, is the zero function, which implies
that and are identical. The corollary is proved.
c
Due to the uniqueness of the canonical form of an information expression,
itf is an easy matter to check whether for two information expressions and
the unconstrained information identity
c f
$¯ c f
(12.42)
in canonical form. If all
always holds. All we need to do is to express the coefficients are zero, then (12.42) always holds, otherwise it does not.
12.3
A GEOMETRICAL FRAMEWORK
} MK
In the last section, we have seen the role of the region
in proving unconstrained information identities. In this section, we explain the geometrical
meanings of unconstrained information inequalities, constrained information
inequalities, and constrained information identities in terms of . Without
loss of generality, we assume that all information expressions are in canonical
form.
} MK
an unconstrained information inequality ¼½- , where { $
Consider
{ . Then !+- corresponds to_ the set
Y
F

~
M
O'{ v
{+(12.43)
_ in the entropy
which is a half-space
space v M containing the origin. Specifv M , { ¾- if and only if { belongs to this set. For
ically, for any {
X we will refer to this set as the half-space !+- . As an example, for
simplicity,
$ , theinformation
c 1 f inequality
c f c 1 f $
+(12.44)
12.3.1
UNCONSTRAINED INEQUALITIES
D R A F T
September 13, 2001, 6:27pm
D R A F T
270
A FIRST COURSE IN INFORMATION THEORY
¿f
n
0
Figure 12.1. An illustration for
c f
w
corresponds to the half-space
_
c
~
M
fO'{ v w
in the entropy space v .
written as
w
w
w
cCf
f
ÀÁGÂ always holds.
+cCf Y
w +- &
(12.45)
(12.46)
Since an information inequality always holds if and only if it is satisfied by
the entropy function of any joint distribution for the random variables involved,
we have the following geometrical interpretation of an information inequality:
_
Y
!#- always holds if and only if } M|K Ã O'{ v M|~ { #- &
This gives a complete characterization of all unconstrained inequalities in terms
} }
X we in principle can determine whether any information
of MK . If MK is known,
inequality involving random variables always holds.
The two possible cases } for Ä½- are illustrated in Figure 12.1 and Fig_
6(- ,
ure 12.2. In Figure 12.1, MK is completely included in the half-space
so * +rÅ always holds. In Figure 12.2, there exists a vector { } MK such that
{ - . Thus the inequality #- does not always hold.
12.3.2
CONSTRAINED INEQUALITIES
In information theory, we very often deal with information inequalities (identities) with certain constraints on the joint distribution for the random variables
involved. These are called constrained information inequalities (identities),
and the constraints on the joint distribution can usually be expressed as linear
constraints on the entropies. The following are such examples:
1.
dc f g
6dce1f1gÆ
, c D,and f are
mutually
g independent if and only if
$
.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Information Inequalities
Çf
271
0
n
.h 0
Figure 12.2. An illustration for
ÀÁ|Â
not always holds.
dc f
g
dce1fÆ
1
1
f
g
c
g
,
, and
are pairwise independent if and only if
2.
$
$
5
$
.
c
f
6 c f 3.
is a function of
if and only if
$5- . c 1 g f crÈ fLÈ gLÈ <É
c 1 f 1<Éz forms
g a Markov chain if and only if
4.
$%- and
$5- .
Suppose there are Ê linear constraints on the entropies given by
Ë {9$5- (12.47)
Ë
9

`
where is a Ê
matrix.
Ë Here we do not assume that the Ê constraints are
linearly independent, so is not necessarily full rank. Let
Ì $PO'{ _ v M ~ Ë {9$5- Y &
(12.48)
Ì
In other words, the Ê constraints confine { to a linear subspace in the entropy
space. Parallel to our discussion on unconstrained inequalities, we have the
following geometrical interpretation of a constrained inequality:
} Ì Ì
Y
, 3(- always holds if and only if M§
Under the constraint
KÍ Ã
O'{ ~ { +- .
This
} gives a complete
Ì characterization of all constrained inequalities in terms
of MK . Note that $(v M when there is no constraint on the entropies. In this
sense, an unconstrained inequality is a special case
Ì of a constrained inequality.
The two cases of !+- under the constraint are illustrated in Figure 12.3
and Figure 12.4. Figure 12.3 shows the case when Î- always holds under
Ì
the constraint . Note that 7P- may or may not always hold when there is
no constraint. Figure 12.4 shows the case when #Ä- does not always hold
D R A F T
September 13, 2001, 6:27pm
D R A F T
272
A FIRST COURSE IN INFORMATION THEORY
Ïf
0
ÀHÁ|Â always holds under the constraint Ð .
Ì
under the constraint . In this case, !#- does not always hold when there is
no constraint, because } MK Í Ì Ã O'{ ~ { +- Y
(12.49)
implies
} MK Ã O'{ ~ { +- Y &
(12.50)
Figure 12.3. An illustration for
12.3.3
CONSTRAINED IDENTITIES
As we have pointed out at the beginning of the chapter, an identity
$%-
*+-
!,#-
(12.51)
always holds if and only if both the inequalities
and
always hold.
Then following our discussion on constrained inequalities, we have
Ñ f Ò0
Figure 12.4. An illustration for
D R A F T
ÀÁ|Â
Ð
not always holds under the constraint .
September 13, 2001, 6:27pm
D R A F T
273
Information Inequalities
Ô
c h
Ób Ôh
0
0
Õ ¶Ö Á|Â and × ¶nÖ ÁGÂ under the constraint Ð .
Y Ì , $Ø - always
Y holds if and only if } M§K Í Ì Ã
Under the constraint
O'{ ~ { +- Í O'{ ~ { ,+- ,
or
Y Ì , $Ø- always holds if and only if } M§K Í Ì Ã
Under the constraint
O'{ ~ { $5- .
}
Ì
This condition says that the intersection of MK and is contained in the hyperplane $%- .
Figure 12.5.
12.4
Equivalence of
EQUIVALENCE OF CONSTRAINED
INEQUALITIES
When there is no constraint on the entropies, two information inequalities
{ #Ù
and
Ù {+are equivalent if and only if $PÚ , where Ú
(12.52)
(12.53)
v M
Ì $Û
Ì
is a positive constant. However,
this is not the case under a non-trivial constraint
. This situation is
illustrated in Figure 12.5. In this figure, although the inequalities in (12.52) and
(12.53) correspond to different half-spaces in the entropy space, they actually
impose the same constraint on when is confined to .
In this section, we present a characterization of (12.52) and (12.53) being
equivalent under a set of linear constraint . The reader may skip this section
at first reading.
Let be the rank of in (12.47). Since is in the null space of , we can
write
(12.54)
{
Ü
D R A F T
Ë
{
Ì
{
{$ÞË Ý {ß
September 13, 2001, 6:27pm
Ë
D R A F T
` Ë
`
E

Ü
where Ý is a
274
A FIRST COURSE IN INFORMATION THEORY
Ë
ËÝ
{ß
` Ü
matrix such that the rows of
form a basis of the
orthogonal complement of the row space of , and
is a column
vector. Then using (12.54), (12.52) and (12.53) can be written as
and
Ë Ý { ß +
Ù Ë
Ý { ß +-
(12.55)
Ë
respectively in terms of the set of basis given by the columns of Ý . Then
(12.55) and (12.56) are equivalent if and only if
Ù Ë Ë (12.57)
Ý $5Ú Ý
where Ú is a positive constant, or
Ù Ë
(12.58)
Ú Ý $5-4&
Ù
Ù Ú is in the orthogonal
In
words,
complement
ofthe
row space of
Ë Ý other
Ë
Ë
`
, i.e.,
whose
Ú is in the row space
Ë . ( Ë ofcan.beLettakenß beasanË Ü ß if Ë matrix
is
full
rank.)
row space is the same
as
that
of
Ë
Ë
Ë
Since the rankË of isË Ü and ß has Ü rows, the rows of ß form a basis for the
row space of , and ß is full rank.
Ì Then from (12.58), (12.55) and (12.56)
are equivalent under the constraint if and only if
Ù % Ë à
ß
$%Ú
(12.59)
for some positive constant
Ú Ù and some column Ü -vector à .

Suppose for given and , we want to see whether (12.55) and (12.56) are
Ì . We first consider the case when either
Ù under the constraint
equivalent
Ë
or
is in the row space of . This is actually
not an interesting case because
Ë

if
, for example, is in the row space of , then
F Ë Ý $%(12.60)
in (12.55), which
Ì means that (12.55) imposes no additional constraint under
the constraint .
¥§¦nÙ ¨©Dª¡R¢D£âá If either or Ù is in the row space of Ë , then {¯Ù {+- are equivalentË under the constraint Ì if and only if both and
and
are in the row space of .
The proof of this theorem is left asÙ an exercise. We now turn
to the more
nor is in the row space of Ë . The
interesting case when neither
following theorem gives anÌ explicit condition for (12.55) and (12.56) to be equivalent
under the constraint .
D R A F T
September 13, 2001, 6:27pm
(12.56)
D R A F T
Information Inequalities
Ù
275
¥§¦nÙ ¨©Dª¡R¢D£«ã If neither nor is in the row space of Ë , then {3+Ì

and {3+- are equivalent under the constraint if and only if
ä Ë EåIæ à Ù
ß
(12.61)
Úç $ &
Ë
has a unique solution
with Úè%- , where ß is any matrix whose row space is
Ë
the same as that of .
and Ù not in the row space of Ë , we want to see when we
Proof For
à
can find unknowns Ú and satisfying (12.59) with Ú5è¾- . To this end, we
write
Ë (12.59) Ë in matrix form as ä (12.61).
Since is not in the column space of
ß and ß is full rank, Ë ß då is also full rank. Then (12.61) has
either a unique solution or no solution. Therefore, the necessary and sufficient
condition for (12.55) and (12.56) to be equivalent is that (12.61) has a unique
. The theorem is proved.
solution and
GÚ è+éê =ë¸z7¡R¢D£«ì Consider three random variables c 1 f and g with the
dce1gR fÆ Markov constraint
(12.62)
$5which is equivalent to
c 1 f f 1 g =6 c 1 f 1 g 6 f $5-4& (12.63)
g
In terms of the coordinates in the entropy space v , this constraint is written
as
Ë {9$5- (12.64)
where
Ë $ > - Qí- Q Qî- Q B
(12.65)
> c f g cCf f1g cCg cCf1gEB
and
{$ w w w w w w w
&
(12.66)
We now show that under the constraint in (12.64), the inequalities
dcW gÆ=dcW fÆ
(12.67)
+ c 1 f g and
+(12.68)

Ù equivalent. Toward this end, we write (12.67) and (12.68) as {
are in fact
- and {+- , respectively,
where
>
$ - Q Q Qî- Qï- B
(12.69)
D R A F T
September 13, 2001, 6:27pm
D R A F T
276
A FIRST COURSE IN INFORMATION THEORY
B
Ù >
Q - Q Q Q &
$ - - î
(12.70)
Ë
Ë
Ë
. Upon solving
Since is full rank, we may take ß $
ä ËEdå<æ à Ù (12.71)
Ú ç $
à
à
Q matrix).
we obtain the unique solution Ú6$ðQ!èñ- and $³Q ( is a Q
Therefore, (12.67) and (12.68) are equivalent under the constraint in (12.64).
Ù

Ë
Ì

Under the constraint , if neither
nor
is in the row space of , it can
be shown that the identities
{!$%(12.72)
Ù
and
(12.73)
{!$5and
are equivalent if and only if (12.61) has a unique solution. We leave the proof
as an exercise.
12.5
THE IMPLICATION PROBLEM OF
CONDITIONAL INDEPENDENCE
jòÎ<óD Aô
We use
j <ó
Aô
to denote the conditional independency (CI)
.
j|ò#<óD Aô
We have proved in Theorem 2.34 that
is equivalent to
jD1<óD AôT
$%-4&
(12.74)
s j ò²<óx ô
,
becomes an unconditional independency which
When õ7$
we regard as a special case of a conditional independency. When iª$÷ö ,
j: AôT (12.74) becomes
$5j Aô (12.75)
and
are conditionally independent given
which we see from Proposition 2.36 that
is a function of
. For this
reason, we also regard functional dependency as a special case of conditional
independency.
In probability problems, we are often given a set of CI’s and we need to
determine whether another given CI is logically implied. This is called the
implication problem, which is perhaps the most basic problem in probability
theory. We have seen in Section 7.2 that the implication problem has a solution if only full conditional mutual independencies are involved. However, the
general problem is extremely difficult, and it has recently been solved only up
to four random variables by Mat [137].
ùø ûú
D R A F T
September 13, 2001, 6:27pm
D R A F T
Information Inequalities
c 1 f VUVUVUÆ1
} MK
277
M
j òÎ<óD ô N
jx1<ó AôT
M
Ã
where i ö õ
6 jZüýô . Since
<ó üýô =6 jZü ó üý$5ô =- isequivalent
ô to
(12.77)
$5jòÎ<óx Aô
corresponds to the hyperplane
jZüôH ó üýô jZü ó üýô ô Y
O'{ ~ w
w
w
w $5- &
(12.78)
M
Y
corresponding to þ by ÿ þ .
For a CI þ , we denote the hyperplane in v
Let %$POÆ
þ be a collection of CI’s, and we want to determine whether
implies a given CI þ . This would be the case if and only if the following is
_ } _ _ true:
M
K
if {
For all {
ÿ þ then { ÿ þ &
We end this section by explaining the relation between the implication problem and the region . A CI involving random variables
has
the form
(12.76)
Equivalently,
}
implies þ if and only if
ÿ þ Í MK Ã ÿ þ &
}
Therefore, the implication
problem can be solved if MK can be characterized.
}
Hence, the region MK is not only of fundamental importance in information
theory, but is also of fundamental importance in probability theory.
PROBLEMS
1. Symmetrical information expressions An information expression is said
to be symmetrical if it is identical under every permutation of the random
variables involved. However, sometimes a symmetrical information expression cannot be readily recognized symbolically. For example,
is symmetrical in
, and
but it is not symmetrical
symbolically. Devise a general method for recognizing symmetrical information expressions.
c 1 f g c 1 f
g
c 1 f 4
2. The canonical form of an information expression is unique when there is no
constraint on the random variables involved. Show by an example that this
does not hold when certain constraints are imposed on the random variables
involved.
\ < \ ú
3. Alternative canonical form Denote Í Ý by and let
$ ú ~ is a nonempty subset of N M &
D R A F T
September 13, 2001, 6:27pm
D R A F T
278
¼_
A FIRST COURSE IN INFORMATION THEORY
a) Prove
Y that a signed measure
on M
Oe
c 1 f VUVUVUÆ1
M can be
b) Prove that an information expression involving
expressed uniquely as
N a linear combination of K ú , where are
nonempty subsets of M .
M information expressions
_
4. Uniqueness of the canonicalÈ form for nonlinear
`
~
aY function h h , where $
Consider
Q such that O'{ h ~
{ $5- has zero Lebesgue measure.
}
a) Prove that cannot be identically zero on MK .
canonical form for
b) Use the result in a) to show the uniqueness of the
the class of information expressions of the form » { where » is a polyis completely specified by
, which can be any set of real numbers.
nomial.
Ù
Ë

$(- ,Ù if neither nor is in the row
5. Prove thatË under the constraint
{!$%{- and
space of , the identities
{!$5- are equivalent if and only
(Yeung [216].)
if (12.61) has a unique solution.
HISTORICAL NOTES
The uniqueness of the canonical form for linear information expressions
was first proved by Han [86]. The same result was independently obtained in
the book by Csisz r and K rner [52]. The geometrical framework for information inequalities is due to Yeung [216]. The characterization of equivalent
constrained inequalities in Section 12.4 is previously unpublished.
ø
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 13
SHANNON-TYPE INEQUALITIES
The basic inequalities form the most important set of information inequalities. In fact, almost all the information inequalities known to date are implied
by the basic inequalities. These are called Shannon-type inequalities. In this
chapter, we show that verification of Shannon-type inequalities can be formulated as a linear programming problem, thus enabling machine-proving of all
such inequalities.
13.1
THE ELEMENTAL INEQUALITIES
G1<
(13.1)
random
G1variables
< and appear more than once. It is readily
in which the
seen that
can
:7be;|written
< as (13.2)
67 ;G < * Consider the conditional mutual information
where in both
and
, each random variable appears
only once.
A Shannon’s information measure is said to be reducible if there exists a
random variable which appears more than once in the information measure,
otherwise the information measure is said to be irreducible. Without loss of
generality, we will consider irreducible Shannon’s information measures only,
because a reducible Shannon’s information measure can always be written as
the sum of irreducible Shannon’s information measures.
The nonnegativity of all Shannon’s information measures form a set of inequalities called the basic inequalities. The set of basic inequalities, however,
is not minimal in the sense that some basic inequalities are implied by the
D R A F T
September 13, 2001, 6:27pm
D R A F T
280
A FIRST COURSE IN INFORMATION THEORY
others. For example,
3
and
#
(13.3)
+ (13.4)
which are both basic inequalities involving random variables and , imply
6 D
$
(13.5)
+TVUVUVUWinequality
1XFY
X and .
which N again is aSbasic
involving
, where
. Unless otherwise specified, all inforLet M $PORQ
mation
VUVUVUe1 in this chapterX involve some or all of the random variables
c 1 f expressions
M . The value of will be specified when necessary. Through
application of the identities
6
(13.6)
6*
$ 9:76;E 9
(13.7)
?H $ 3
D3A <
(13.8)
67 $ ?H3
H
(13.9)
$ :;E (13.10)
3
8A $ 3
DA 8 (13.11)
$
any Shannon’s information measure can be expressed as the sum of Shannon’s
information measures of the following two elemental forms:
<\S o?q \ 1^8_ N
i)
\ 1 z ^M
ii)
, where !
$
X
and
N ^
Y
þ Ã M O .
This will be illustrated in the next example. It is not difficult to check that the
total number of the two elemental forms of Shannon’s information measures
for random variables is equal to
|
X
$# X !% M f
" $
&
(13.12)
c 1 f X
into a sum of elemental forms of
$%a by applying the identities in (13.6)
The proof of (13.12) is left as an exercise.
éê =ë¸z7¡RºD£t¡
We can expand
Shannon’s information measures for
to (13.11) as follows:
6dce1fÆ
c D6 f c $
D R A F T
September 13, 2001, 6:27pm
(13.13)
D R A F T
c f 1 g D c 1 f 1 g D76 f c 1 g $ f1gR dc c f 1 g D c 1 f D7 c 1 g f $ f c 1 g D f 1 g c &
281
Shannon-Type Inequalities
(13.14)
(13.15)
The nonnegativity of the two elemental forms of Shannon’s information
measures form a proper subset of the set of basic inequalities. We call the
inequalities in this smaller set the elemental inequalities. They are equivalent to the basic inequalities because each basic inequality which is not an
elemental inequality can be obtained as the sum of a set of elemental inequalities in view of (13.6) to (13.11). This will be illustrated in the next example.
The proof for the minimality of the set of elemental inequalities is deferred to
Section 13.6.
"
éê =ë¸z7¡RºD£«¢ In the last example, we expressed c 1 f as
c f 1 g c 1 f D7 c 1 g f f c 1 g D f 1 g c &
(13.16)
X information measures in the above expression are in
All the five Shannon’s
elemental form for $¯a . Then the basic inequality
dce1f'
(13.17)
+dcW f1gÆ
c 1 f c 1 g f f c 1 g f 1 g c can be obtained as the sum of the following elemental inequalities:
13.2
-
(13.18)
(13.19)
(13.20)
(13.21)
(13.22)
-
-4&
A LINEAR PROGRAMMING APPROACH
c ` 1 $ f VUVM UVUÆ1 Q
Recall from Section 12.2 that any information expression can be expressed
uniquely in canonical form, i.e., a linear combination of the
joint
entropies involving some or all of the random variables
. If
the elemental inequalities are expressed in canonical form, they become linear
inequalities in the entropy space
. Denote this set of inequalities by
, where is an
matrix, and define
-
&
D R A F T
" `
v M
} M $(O'{ ~ &E{+- Y &
September 13, 2001, 6:27pm
M
d& {6
(13.23)
D R A F T
282
}M
}M
A FIRST COURSE IN INFORMATION THEORY
v M
à ý Q ,', `
We first show that
is a pyramid in the nonnegative orthant of the entropy
. Evidently,
contains the origin. Let
, be the column
space
-vector whose th component is equal to 1 and all the other components are
equal to 0. Then the inequality
`
à {#-
_ }
M
{
(13.24)
corresponds to the nonnegativity of a joint entropy, which is a basic inequality.
Since the set of elemental inequalities is equivalent to the set of basic inequalities, if
, i.e., satisfies all the elemental inequalities, then also
satisfies the basic inequality in (13.24). In other words,
{
{
} M Ã O'{ ~ à { +- Y
(13.25)
` . This implies that } M is in the nonnegative orthant of the
for all Q,(6,
}
entropy space. Since M contains the origin and the constraints &E{(®- are
}
linear, we conclude that M is a pyramid in the nonnegative orthant of v M .
X Since the elemental c inequalities
1 f VUVUVUÆ1 are satisfied by} the entropy function
of any
M , for any { in MK , { is also in } M , i.e.,
random variables
} MK Ã } M &
(13.26)
Therefore, for any unconstrained inequality !#- , if
} M Ã O'{ ~ { +- YZ
(13.27)
then
} M|K Ã O'{ ~ { +- YZ
(13.28)
i.e., 7¼- always holds. In other words, (13.27) is a sufficient condition for
Øª- to always hold. Moreover, an inequality Ø¾- such that (13.27) is
because if { satisfies the basic
satisfied is implied _ by} the basic inequalities,
M
inequalities, i.e., {
, then { satisfies { #- .
For constrained inequalities, following our discussion in Section 12.3, we
impose the constraint
(13.29)
Ë {9$5and let
Ì $PO'{ ~ Ë {!$5- Y &
For an inequality !+- , if
} Ì YZ
M Í Ã O'{ ~ { #} Ì YZ
then by (13.26),
MK Í Ã O'{ ~ { #D R A F T
September 13, 2001, 6:27pm
(13.30)
(13.31)
(13.32)
D R A F T
283
Shannon-Type Inequalities
b h
0
Figure 13.1.
#Ä-
n
)
q
is contained in
* Ö,+ Õ ¶Ö Á|Â- .
Ì
Ì
i.e.,
always holds under the constraint . In other words, (13.31) is a
sufficient condition for
to always hold under the constraint . Moreover,
under the constraint such that (13.31) is satisfied is
an inequality
implied by the basic inequalities and the constraint , because if
and
satisfies the basic inequalities, i.e.,
, then satisfies
.
®ð-
13.2.1
!+-
Ì
_ } Ì
{ MÍ
Ì
_
{ Ì {
{ #-
{
{7(Y

~
O'{ %
{ Ø-
UNCONSTRAINED INEQUALITIES
}M
To check whether an unconstrained inequality
is a Shannon-type
inequality, we need to check whether
is a subset of
. The
following theorem induces a computational procedure for this purpose.
¥§¦n¨©Dª¡RºD£«º ª
{ ³-
is a Shannon-type inequality if and only if the
minimum of the problem
Minimize
{
, subject to
&E{#-
(13.33)
is zero. In this case, the minimum occurs at the origin.
}M
Y
O'{ ~ Ä
{ ªY
{
}M
Remark The idea of this theorem is illustrated in Figure 13.1 and Figure 13.2.
In Figure 13.1,
is contained in
. The minimum of
subject to
occurs at the origin with the minimum equal to 0. In Figure 13.2,
is not contained in
. The minimum of
subject to
is
. A formal proof of the theorem is given next.
}/M .
}M
O'{ ~ 3{ +-
{
Y

}

~
M
{+Proof of Theorem 13.3 We have to prove that
is a subset of O'{
_ } only if the
if and
minimum of the
problem
in
(13.33)
is
zero.
First
of
all,
since
Y
of
the
problem
in
(13.33)
- M and -|} $Ä- for any , the minimum
{ ¬- and the minimum ofistheat
most 0. Assume M is a subset of O'{ ~
D R A F T
September 13, 2001, 6:27pm
D R A F T
284
A FIRST COURSE IN INFORMATION THEORY
1b 2h
0
0n
Figure 13.2.
)
q
is not contained in
* Ö3+ Õ ¶nÖ Á|Â4- .
_ }
M
problem in (13.33) is negative. Then there exists an {
Å
{ which implies
} M Ã O'{ ~ {+- YZ
}
which is a contradiction. Therefore, if M is a subset of O'{
the minimum of the problem in (13.33) is zero.
} _ not a subset of O'{
To prove the converse, assume M is
} M such that
(13.35) is true. Then there exists an {
= { Å -4&
such that
(13.34)
Y
~ {7¯- , then
Y
~ { ²- , i.e.
(13.35)
(13.36)
This implies that the minimum of the problem in (13.33) is negative, i.e., it is
not equal to zero.
Finally, if the minimum of the problem in (13.33) is zero, since the
contains the origin and
, the minimum occurs at the origin.
-§$5-
}M
{¯¼{5$²-
{($ª{ ñ¼
By virtue of this theorem, to check whether
is an unconstrained
Shannon-type inequality, all we need to do is to apply the optimality test of
the simplex method [56] to check whether the point
is optimal for the
minimization problem in (13.33). If
is optimal, then
is an
unconstrained Shannon-type inequality, otherwise it is not.
13.2.2
CONSTRAINED INEQUALITIES AND
IDENTITIES
{ #- under the constraint Ì is a Shannon} M Í Ì is a subset of O'{ ~ 3{ +- Y .
To check whether an inequality
type inequality, we need to check whether
D R A F T
September 13, 2001, 6:27pm
D R A F T
285
Shannon-Type Inequalities
¥§¦n¨©Dª¡RºD£âá { #- is a Shannon-type inequality under the constraint
Ì if and only if the minimum of the problem
{ , subject to &d{3+- and Ë {9$5Minimize
(13.37)
is zero. In this case, the minimum occurs at the origin.
Ì
The proof of this theorem is similar to that for Theorem 13.3, so it is omitted. By taking advantage of the linear structure of the constraint , we can
reformulate the minimization problem in (13.37) as follows. Let be the rank
of . Since is in the null space of , we can write
Ë
Ü
Ë
{
` Ë
`
E

where Ý is a
Ü
{$ Ë Ý {ß
Ë
ËÝ
{ß
(13.38)
` Ü
matrix such that the rows of
form a basis of the
is a column
orthogonal complement of the row space of , and
vector. Then the elemental inequalities can be expressed as
{ ß,} M
Ë& Ý {:ßx+- (13.39)
} ßM $PO'{ß ~ &¼Ë Ý {:ßx+- YZ
(13.40)
65
which is a pyramid in h
(but not
necessarily
Ë Ý { ß . in the nonnegative orthant).
{ can be expressed
Likewise,
as
With all the information expressions in terms of { ß , the problem in (13.37)
becomes
Ë Ý { ß , subject to &ØË Ý { ß +- .
Minimize
(13.41)

Therefore, Ì to check whether
{7%- is a Shannon-type inequality under the
constraint , all we need to do is to apply the optimality test of the simplex
method to check whether the point { ß $5- is optimal for the problem in (13.41).
{Ø°- is a Shannon-type inequality under the
If { ß $Û- Ì is optimal, then
constraint , otherwise it is not.
Ì
remains
By imposing the constraint , the number of elemental inequalities
`
`
Ü.
the same, while the dimension
problem decreases from to
{¯of$ñthe
Finally, to verify
that
is
a
Shannon-type
identity
under
the
{3$P- is implied by the basic inequalities, all we need toconÌ
straint , i.e.,
do

is to verify that both
Ì {¼²- and {Ø,¬- are Shannon-type inequalities
under the constraint .
and in terms of
13.3
becomes
A DUALITY
A nonnegative linear combination is a linear combination whose coefficients are all nonnegative. It is clear that a nonnegative linear combination
D R A F T
September 13, 2001, 6:27pm
D R A F T
286
A FIRST COURSE IN INFORMATION THEORY
of basic inequalities is a Shannon-type inequality. However, it is not clear that
all Shannon-type inequalities are of this form. By applying the duality theorem
in linear programming [182], we will see that this is in fact the case.
The dual of the primal linear programming problem in (13.33) is
U

Maximize 7
-
where
{b½-
subject to
7+-
and
7 &ª,
> 0 c UVUVU²098 B
7 $
&
,
(13.42)
(13.43)
By the duality theorem, if the minimum of the primal problem is zero, which
is a Shannon-type inequality, the maximum of the
happens when
dual problem is also zero. Since the cost function in the dual problem is zero,
the maximum of the dual problem is zero if and only if the feasible region
¤ $PO:7 ~ 73+-
and
Y
7 &½,
(13.44)
$
¥§¦n¨©Dª¡RºD£«ã +
(- is; a Shannon-type
inequality if and only if
; & for some ; +- ,{ where

is a column " -vector, i.e.,
is a nonnegative
linear combination of the rows of G.
$ ; & for
¤
Proof We have to prove that ¤ is nonempty if and only if
some ; +- . The feasible region is nonempty if and only if
=< &
(13.45)
for some <#- , where < is a column " -vector. Consider any < which satisfies
(13.45), and let
>
(13.46)
$
< &¾+-4&
à
`
is equal to 1 and all
Denote by the column -vector whose th component
à
`
, d, . Then { is a joint entropy.
the other components are equal to 0, Q ?
à
Since every joint entropy can be expressed
as the sum of elemental forms of
Shannon’s information measures,
can be expressed as a nonnegative linear
combination of the rows of G. Write
c @'f UVUVU @ B > A> @ýB
$
(13.47)

@
`
where #- for all Q,C,
. Then
> D c @ à
(13.48)
$
µ
is nonempty.
can also be expressed as a nonnegative linear combinations of the rows of G,
i.e.,
(13.49)
>
D R A F T
$FE &
September 13, 2001, 6:27pm
D R A F T
287
Shannon-Type Inequalities
for some
where
that
Eª+- . From (13.46), we see

$ E < Û
;
& $ &
(13.50)
; +- . The proof is accomplished.
From this theorem, we see that all Shannon-type inequalities are actually trivially implied by the basic inequalities! However, the verification of a Shannontype inequality requires a computational procedure as described in the last section.
13.4
MACHINE PROVING – ITIP
Theorems 13.3 and 13.4 transform the problem of verifying a Shannon-type
inequality into a linear programming problem. This enables machine-proving
of all Shannon-type inequalities. A software package called ITIP1 , which runs
on MATLAB, has been developed for this purpose. Both the PC version and
the Unix version of ITIP are included in this book.
Using ITIP is very simple and intuitive. The following examples illustrate
the use of ITIP:
1. >> ITIP(’H(XYZ) <= H(X) + H(Y) + H(Z)’)
True
2. >> ITIP(’I(X;Z) = 0’,’I(X;Z|Y) = 0’,’I(X;Y) = 0’)
True
3. >> ITIP(’I(Z;U) - I(Z;U|X) - I(Z;U|Y) <=
0.5 I(X;Y) + 0.25 I(X;ZU) + 0.25 I(Y;ZU)’)
Not provable by ITIP
È È In the first example, we prove an unconstrained inequality. In the second examforms a Markov
ple, we prove that and are independent if
chain and
and
are independent. The first identity is what we want to
prove, while the second and the third expressions specify the Markov chain
and the independency of
and , respectively. In the third
example, ITIP returns the clause “Not provable by ITIP,” which means that the
inequality is not a Shannon-type inequality. This, however, does not mean that
the inequality to be proved cannot always hold. In fact, this inequality is one
of the few known non-Shannon-type inequalities which will be discussed in
Chapter 14.
We note that most of the results we have previously obtained by using information diagrams can also be proved by ITIP. However, the advantage of
È È 1 ITIP
stands for Information-Theoretic Inequality Prover.
D R A F T
September 13, 2001, 6:27pm
D R A F T
288
A FIRST COURSE IN INFORMATION THEORY
Y
GZ
HX
T
Figure 13.3. The information diagram for
I ,J ,K
, and
L
in Example 13.6.
using information diagrams is that one can visualize the structure of the problem. Therefore, the use of information diagrams and ITIP very often complement each other. In the rest of the section, we give a few examples which
demonstrate the use of ITIP. The features of ITIP are described in details in the
readme file.
é êÈ = ë¸z7¡RºD£«ì
È È
ÈMarkov
È chain
È È
By Proposition 2.10, the long
implies the two short Markov chains
and
. We want to see whether the two short Markov chains also imply the long
Markov chain. If so, they are equivalent to each other.
Using ITIP, we have
>> ITIP(’X/Y/Z/T’, ’X/Y/Z’, ’Y/Z/T’)
Not provable by ITIP
In the above, we have used a macro in ITIP to specify the three Markov chains.
The above result from ITIP says that the long Markov chain cannot be proved
from the two short Markov chains by means of the basic inequalities. This
strongly suggests that the two short Markov chains is weaker than the long
Markov chain. However, in order to prove that this is in fact the case, we
need an explicit construction of a joint distribution for , , , and which
satisfies the two short Markov chains but not the long Markov chain. Toward
this end, we resort to the information diagram in Figure 13.3. The Markov
chain
is equivalent to
, i.e.,
È È 3I NM D NM $%- OM
K Ý Í Ý Í Ý Í Ý È K È Ý Í Ý Í Ý Í Ý $5-4&
Similarly, the Markov chain
M D M is equivalent
M to
:K Ý Í Ý Í Ý Í Ý :K Ý Í Ý Í Ý Í Ý $5-4&
D R A F T
September 13, 2001, 6:27pm
(13.51)
(13.52)
D R A F T
289
Shannon-Type Inequalities
Y
*
*
HX
*
*
GZ
*
T
Figure 13.4. The atoms of
Markov chain.
PRQ
on which
SUT
vanishes when
È È È IWVXJYVZK[V\L
forms a
The four atoms involved in the constraints (13.51) and (13.52) are marked
by a dagger in Figure 13.3. In Section 6.5, we have seen that the Markov
holds if and only if
takes zero value on the
chain
set of atoms in Figure 13.4 which are marked with an asterisk2 . Comparing
Figure 13.3 and Figure 13.4, we see that the only atom marked in Figure 13.4
. Thus if we can construct a
but not in Figure 13.3 is
such that it takes zero value on all atoms except for
, then the
corresponding joint distribution satisfies the two short Markov chains but not
the long Markov chain. This would show that the two short Markov chains are
in fact weaker than the long Markov chain. Following Theorem 6.11, such a
can be constructed.
In fact, the required joint distribution can be obtained by simply letting
, where is any random variable such that
, and letting
and be degenerate random variables taking constant values. Then it is easy
to see that
and
hold, while
does not hold.
K
$
M M Ý Í Ý Í Ý Í Ý
K
È È M M ÝÍ Ý Í Ý Í Ý
6
È È éê =ë¸z7¡RºD£^]
È È È È $
è( È È È The data processing theorem says that if
forms a Markov chain, then
;| K
È È È &
(13.53)
We want to see whether this inequality holds under the weaker condition that
and
form two short Markov chains. By using ITIP,
we can show that (13.53) is not a Shannon-type inequality under the Markov
2 This
information diagram is essentially a reproduction of Figure 6.8.
D R A F T
September 13, 2001, 6:27pm
D R A F T
290
A FIRST COURSE IN INFORMATION THEORY
A $ 5
(13.54)
;|<H
and
(13.55)
$%-4&
This strongly suggests that (13.53) does not always hold under the constraint
this has to be proved by an explicit
of the two short Markov chains. However,
conditions
construction of a joint distribution for , , , and which satisfies (13.54)
and (13.55) but not (13.53). The construction at the end of the last example
serves this purpose.
éê =ë¸z7¡RºD£`_badcxfe©DUgFcD¦x ©ihkjRlnm¡o]2¢qpr Consider the following
secret
sharing
problem. Let s be a secret to be encoded into three pieces, , , and
. The scheme has to satisfy the following two secret sharing requirements:
1.
s
can be recovered from any two of the three encoded pieces.
2. No information about
pieces.
s
can be obtained from any one of the three encoded
6 ?H 6 * s
$ s
$ s
$5(13.56)
while the second requirement is equivalent to the constraints
19 < s
$ s $ s $5
(13.57)
-4&
Since the secret s can be recovered if all , , and are known,
6:;A:66@ (13.58)
s &
We are naturally interested in the maximum constant " which satisfies
6;A:6@ (13.59)
+" s &
The first requirement is equivalent to the constraints
"
We can explore the possible values of by ITIP. After a few trials, we find
that ITIP returns a “True” for all
, and returns the clause “Not provable
by ITIP” for any slightly larger than 3, say 3.0001. This means that the maximum value of is lower bounded by 3. This lower bound is in fact tight, as we
can see from the following construction. Let and be mutually independent
ternary random variables uniformly distributed on
, and define
"
",#a
"
D R A F T
s
$ t $ s tvu q w a
t SzY
OÆ- Q
September 13, 2001, 6:27pm
(13.60)
(13.61)
D R A F T
291
Shannon-Type Inequalities
and
7
$ s txu qw aT&
Then it is easy to verify that
P qw
s $ 6 u a
$ +3W u qw a
$
u qw aT&
S zY
OÆ- Q 976;<@
(13.62)
1*
(13.63)
(13.64)
(13.65)
Thus the requirements in (13.56) are satisfied. It is also readily verified that
the requirements in (13.57) are satisfied. Finally, all
, and distribute
uniformly on
. Therefore,
"
s
6 $%a s &
(13.66)
This proves that the maximum constant which satisfies (13.59) is 3.
Using the approach in this example, almost all information-theoretic bounds
reported in the literature for this class of problems can be obtained when a
definite number of random variables are involved.
13.5
TACKLING THE IMPLICATION PROBLEM
We have already mentioned in Section 12.5 that the implication problem of
conditional independence is extremely difficult except for the special case that
only full conditional mutual independencies are involved. In this section, we
employ the tools we have developed in this chapter to tackle this problem.
In Bayesian network (see [151]), the following four axioms are often used
for proving implications of conditional independencies:
òÎdzy ØòÎ7
Symmetry:
ò¼;?V|{ ò5EH~} )ò=<H
(13.67)
Decomposition:
Weak Union:
D R A F T
)òÄ;8V{ )ò5E
September 13, 2001, 6:27pm
(13.68)
(13.69)
D R A F T
292
A FIRST COURSE IN INFORMATION THEORY
)ò5ER} ò=A ?H{ )òÄ;8V
Contraction:
&
(13.70)
These axioms form a system called semi-graphoid and were first proposed by
Dawid [58] as heuristic properties of conditional independence.
The axiom of symmetry is trivial in the context of probability3 . The other
three axioms can be summarized by
)ò¼;?Vzy ò5ER}Ø ò=A ?H
(13.71)
&
This can easily be proved as follows. Consider the identity
?A D3A ?H
$
&
(13.72)
mutual
?Ainformations
are
always
nonnegative
A ?by the basic
Since conditional
inequalities, if
vanishes,
and
also vanish, and vice versa. This proves (13.71). In other words, (13.71) is the result
of a specific application of the basic inequalities. Therefore, any implication
which can be proved by invoking these four axioms can also be proved by ITIP.
In fact, ITIP is considerably more powerful than the above four axioms.
This will be shown in the next example in which we give an implication which
can be proved by ITIP but not by these four axioms4 . We will see some implications which cannot be proved by ITIP when we discuss non-Shannon-type
inequalities in the next chapter.
éê =ë¸z7¡RºD£`
3 A
3 A A 3A We will show that
$5-
$5- { $5-
$5$5-
$5can be proved by invoking the basic inequalities. First, we write
3
E
dA
&
$ 3
E Since
$5- and 3
E #- , we let
$ÎÚ
(13.73)
(13.74)
(13.75)
3 These
4 This
four axioms can be used beyond the context of probability.
example is due to Zhen Zhang, private communication.
D R A F T
September 13, 2001, 6:27pm
D R A F T
293
Shannon-Type Inequalities
Y
+ _
_
HX
_
+
GZ
+
T
Figure 13.5. The information diagram for
I ,J ,K
L
, and .
GÚ A $ Ú
(13.76)
(13.74).
In the information diagram
indFigure
<H 13.5, we mark the atom
from
bya
“+”
<Hand theatom
3
| <H:3 byA 8a “ .” Then we write
(13.77)
&
A
$ G A Since
$5- and A ?$ Ú , we get
(13.78)
$5Ú& A ?
for some nonnegative real number , so that
In the information diagram, we mark the atom
with a “+.” Continue in this fashion, the five CI’s on the left hand side of (13.73) imply that
all the atoms marked with a “+” in the information diagram take the value ,
while all the atoms marked with a “ ” take the value
. From the information diagram, we see that
Ú
3
3
dA D3
k D $
$ Ú Ú $5-
Ú
(13.79)
which proves our claim. Since we base our proof on the basic inequalities, this
implication can also be proved by ITIP.
Due to the form of the five given CI’s in (13.73), none of the axioms in
(13.68) to (13.70) can be applied. Thus we conclude that the implication in
(13.73) cannot be proved by the four axioms in (13.67) to (13.70).
13.6
MINIMALITY OF THE ELEMENTAL
INEQUALITIES
We have already seen in Section 13.1 that the set of basic inequalities is not
minimal in the sense that in the set, some inequalities are implied by the others.
D R A F T
September 13, 2001, 6:27pm
D R A F T
294
A FIRST COURSE IN INFORMATION THEORY
We then showed that the set of basic inequalities is equivalent to the smaller
set of elemental inequalities. Again, we can ask whether the set of elemental
inequalities is minimal.
In this section, we prove that the set of elemental inequalities is minimal.
This result is important for efficient implementation of ITIP because it says
that we cannot consider a smaller set of inequalities. The proof, however, is
rather technical. The reader may skip this proof without missing the essence
of this chapter.
The elemental inequalities in set-theoretic notations have one of the following two forms:
<\x orq \ 1. Ý
<\ Ý #- ^
N ^
Y
2. Ý Í Ý
Ý +- , $! and þ Ã M O ,
where denotes a set-additive function defined on M . They will be referred
to as i -inequalities and ö -inequalities, respectively.
We are to show that all the elemental inequalities are nonredundant, i.e.,
none of them is implied by the
an
<others.
\x o For
\ i -inequality
q
Ý Ý #<\S or(13.80)
\
q
9
since it is the only elemental inequality which involves the atom Ý
Ý ,
it is clearly not implied by the other elemental inequalities. Therefore we
only need to show that all -inequalities are nonredundant. To show that a inequality is nonredundant, it suffices to show that there exists a measure on
which satisfies all other elemental inequalities except for that -inequality.
We will show that the -inequality
M
ö

ö
ö
ö <\ Ý Í Ý Ý + Y
N ^
(13.81)
nonredundant.
^
\ our
discussion,
^S we
denote M <þ \ O by
is
To facilitate
þ , and we let
s s Ã
þ be the atoms in Ý Í Ý Ý ,
where
\ <\ M M \
s $ Ý Í Ý Í Ý ^
Í Ý Í Ý s 6 & N ^
(13.82)
Y

M
þ $ , i.e., þ³$
O . We
We first consider the case when
construct a measure by
<\ $ïÝ Í Ý Ý
b
$ Q Q ifotherwise,
(13.83)
¼
_
<\ atom
Å with measure
. In other words, Ý Í Ý
where
Ý <\is the only
Q ; all other atoms have measure 1. Then^ _ ÝN Í Ý Ý
- is trivially
M
true. It is also trivial to check
\ that
foro?qany
\ ß , (13.84)
Ý Ý $ØQ#
D R A F T
September 13, 2001, 6:27pm
D R A F T
295
^ ^
^
N ^ Y
and for any ß ß þ ß $
\ þ such
that
ß $! ß and þ ß Ã M O ß ß ,
(13.85)
N N ^ Y Ý Í Ý Ý $ØQ +M \ O ß ß . On the other hand, if þ ß is a proper subset of M
if^ þ ß Y $
O ß ß , then Ý Í Ý Ý \contains
at least
two atoms, and therefore
(13.86)
Ý Í Ý Ý +-4&
^
the proof for the ö -inequality in (13.81) to be nonredundant
This completes

^
V
when
þ
$ .

þ W
$ <, \or þ ½Q . We
We now consider the case when
the
atoms
in Ý Í Ý
Ý , let
construct a measure as follows. For
k
^
\ 1
Q s$ ^
þ k
Q
(13.87)

s b
$ Q
s $
þ
&
\ <\ For s , if s is odd, it is referred to as an odd <
atom
\ of Ý Í Ý Ý ,
\ it is referred
and if
to as an even atom of Ý Í Ý
s _ is even,
Ý . For any
atom
Ý Í Ý Ý , we let
$ØQ&
(13.88)
This completes the construction of .
<\ rÅ
We first prove that
Ý Í Ý Ý
-4&
(13.89)
Consider
<\ \ 1
Ý Í Ý Ý $ 6D \ \
s

6 # ^
V k 5 þ % Q
$ 5D
Q
Ü
µ
$ Q
where the last equality follows from the binomial formula
X
D M # % k 5
Q $5(13.90)
5 Ü
µ
X
^ _
for (Q . This proves (13.89).
\ that o q satisfies
<\ We
note that for any ß
\ all i -inequalities.
N Next we prove
M , the atom Ý Ý is not in Ý Í Ý Ý . Thus
\ \ (13.91)
Ý Ý o?q $ØQ +-4&
Shannon-Type Inequalities
D R A F T
September 13, 2001, 6:27pm
D R A F T
296
A FIRST COURSE IN INFORMATION THEORY
^ ^
satisfies
Y i.e.,
all ^ ö -inequalities
except
N for ^ (13.81),

M
Ã
þ such
O ß ß,
ß ßþ ß $
\ that
ß $! ß and þ ß
Ý Í Ý Ý +-4&
(13.92)
Consider
\ Ý Í 1 Ý \ Ý <\ 1
$ 1Ý \Í Ý Ý Í #Ý <Í \ Ý Ý 1
Ý Í Ý Ý
Ý Í Ý Ý &
(13.93)
It remains to prove that
for any
\ <\ Ý Í Ý Ý Í Ý Í Ý Ý
is nonempty if and only if
^ Y
^ Y
Í
O ß Zß þ $ and O Í þ!ß$&
The nonnegativity of the second term above follows from (13.88). For the first
term,
(13.94)
$
(13.95)
If this condition is not satisfied, then the first term in (13.93) becomes
, and (13.92) follows immediately.
Let us assume that the condition in (13.95) is satisfied. Then by simple
counting, we see that the number atoms in
-
\ <\ Ý Í Ý Ý Í Ý Í Ý Ý
(13.96)
9
is equal to , where XE# ^S Y¢¡ ^ Y£¡ ¡ (13.97)
O
O f ß ß þ þ ß &
X $
$ ¤ , there
For example, for
g in <ÉÆ
c aref ¥I$ c atoms
c f g Ý É M Í Ý§ ¦ Í § ¨ Ý Í Ý \ Ý <\ \ M ^ (13.98)
Í
Í
Í
Í
Í
namely Ý
, where $ Ý or Ý for $[© ¤ . We
Ý Ý Ý
Î Sz¢Y ¡ ªY ¡ ¡ YT check that
$ ¤ ORQ
[
ORQ a :O ¥ $ &
(13.99)
We first consider the case when $5- , i.e.,
^
ªY ¡ ^ «Y ¡ ¡
N
M $PO
O ß ß þ þ ß&
(13.100)
\ <\ Then
Ý Í Ý Ý Í Ý Í Ý Ý \ §(13.101)
contains exactly one atom. If this atom is an even atom of Ý Í Ý
Ý ,
then the first term in (13.93) is either 0 or 1 (cf., (13.87)), and (13.92) follows
D R A F T
September 13, 2001, 6:27pm
D R A F T
APPENDIX 13.A: The Basic Inequalities and the Polymatroidal Axioms
297
° ,

®
¯

± is an odd atom of ¬ ¬ ¬ , then the first term in
immediately. If this°atom
(13.93) is equal to
. This happens if and¯ only
¬ ¾ one
®¼ ¬ ½ ¼ ° if ²:¬ ³µ¾ ´·¶¼k¿ ¸ ° andº»,¬ ²:®
³¯ ¹´· ¶9¬ ¹k½ ¸ ° have
¿ is
common element, which implies that º»¬
nonempty. Therefore the second term in (13.93) is at least 1, and hence (13.92)
±
follows.
. Using the binomial formula
Finally, we consider the case when ÀÂÁ
,®4¯ ¬ ½ ° ¬ in¾
(13.90), we see that the number of odd atoms and even atoms of ¬
in
º ¬ ® ¼¯ ¬ ½µ¼ ° ¬ ¾«¼ ¿ ¯ º ,¬ ®Ã¯ ¬ ½ ° ¬ ¾ ¿
(13.102)
°±
are the same. Therefore the first term in (13.93) is equal to
if
°
Ä ® ½Å ¾ ºÆ/ºk³µ´·¶9´µÇ ¿È¿ÊÉ ¬ ® ¼U¯ ¬ ½d¼ ¬ ¾«¼ ´
(13.103)
° The
° former
° if and
and is equal to 0¼Ãotherwise.
is true
only if Ç ¹»Ë Ç , which
½
¾
¯

d
½
¼
«
¾
¼
,

®
¯
®
¿
¿
¬
¬
º ¬ ¬ ¬ is nonempty, or that the
implies that º ¬
second term is at least 1. Thus in either case (13.92) is true. This completes
the proof that (13.81) is nonredundant.
APPENDIX 13.A: THE BASIC INEQUALITIES AND THE
POLYMATROIDAL AXIOMS
show that the basic inequalities for a collection of Ì random variables
ÍÏÎÑIn Ðµthis
Ò«ÓÔappendix,
ÕÃÖ«×/Ø9Ù isweequivalent
ÔÜÛNÝÞ×OØ ,
to the following polymatroidal axioms: For all Ú
æ
Î
å
P1. ßªà§á`âäã
.
Û
ÝéÛ .
P2. ß à áçÚ6ãèß à á ã if Ú
Û
Û ãoê,ß à áçÚOí Û ã .
P3. ß à áçÚ6ãêéß à á ã§ëß à áçÚOì
We first
that the
Ý Ú show
ÝÞ×/polymatroidal
Ø , we have axioms imply the basic inequalities. From P1 and P2,
since â
for any Ú
ßªàáçÚ6ã§ëîß£à§á`âã ÎÏå:Ô
(13.A.1)
or
(13.A.2)
ß3á Òðï ã§ë åòñ
This shows that entropy is nonnegative.
ÎôÛõ Ú , we have
In P2, letting ó
ß£à§áçÚ6ã§èîß£àiáçÚOíªóã Ô
(13.A.3)
or
ß× 3Ø á Ò«ö9÷ Ò ï ãë åòñ
(13.A.4)
Here, ó and Ú are disjoint subsets of
.
ÎôÛõ Ú , ø Î Ú/ì Û , and
In P3, letting ó
ù Î Ú õµÛ , we have
(13.A.5)
ß à áçùÊíÊøãê,ß à áó«í¢øäãëß à áçøäãêß à áçù£íOøúíªóã Ô
or
û Ò«ü9ýÒ«öþ÷ Òðÿ å:ñ
á
ãë
(13.A.6)
D R A F T
September 13, 2001, 6:27pm
D R A F T
298
Again,
ù Ô ø , and ó
A FIRST COURSE IN INFORMATION THEORY
× Ø
ø Î â , from P3, we have
û «Ò ü9.ýÜWhen
á Òðö ã§ë åòñ
are disjoint subsets of
(13.A.7)
Thus P1 to P3 imply that entropy is nonnegative, and that conditional entropy, mutual information, and conditional mutual information are nonnegative provided that they are irreducible.
However, it has been shown in Section 13.1 that a reducible Shannon’s information measure
can always be written as the sum of irreducible Shannon’s information measures. Therefore, we
have shown that the polymatroidal axioms P1 to P3 imply the basic inequalities.
The converse is trivial and the proof is omitted.
PROBLEMS
1. Prove (13.12) for the total number of elemental forms of Shannon’s information measures for random variables.
´ ´ò´
2. Shannon-type inequalities for random variables
refer to
all information inequalities implied by the basic inequalities for these random variables. Show that no new information inequality can be generated
by considering the basic inequalities for more than random variables.
3. Show by an example that the decomposition of an information expression
into a sum of elemental forms of Shannon’s information measures is not
unique.
´ ´ò´
4. Elemental forms of conditional independencies Consider random vari. A conditional independency is said to be elemenables
tal if it corresponds to setting an elemental form of Shannon’s information measure to zero. Show that any conditional independency involving
is equivalent to a collection of elemental conditional independencies.
´ ´´
5. Symmetrical information inequalities
´ ´ò´
´
,® "!#6® ¿
® º
a) Show that every symmetrical information expression (cf. Problem 1 in
Chapter 12) involving random variable
can be written
in the form
where
±%$'&$
and for
D R A F T
°=±
,
@ ,®8AÈ ½ ¾ ¿CB
) (6Ó5®+4 *6879½ 4;(,: : <= º
-/.1023 - 2?>
September 13, 2001, 6:27pm
D R A F T
299
APPENDIX 13.A: The Basic Inequalities and the Polymatroidal Axioms
±D$E&F$
°n±
Note that
is the sum of all Shannon’s information measures of
,
is the sum
the first elemental form, and for
of all Shannon’s information measures of the second elemental form
conditioning on
random variables.
&N° ±
b)
c)
G
&
Á H always holds if Á'H for all . &
Show that I
$Kall
&L$ . Hint:
°Â± Construct
Show that if ÁJH always
holds, then ÁIH for

variables
´ ´´ $'& for$ each
random
° ± H & & such that
G M H and G ¼ H for all H ¹
and ¹ON .
(Han [87].)
6. Strictly positive probability distributions It was shown in Proposition 2.12
that
Q P R? º ´ S ¿
P S º ´ R¿UTWV QP º S ´ R ¿ S R:¿ M H for all Y , Y , Y S , and Y R . Show by using ITIP
if X º+Y )´ Y ´)Y ´)Y
that this implication is not implied by the basic inequalities. This strongly
suggests that this implication does not hold in general, which was shown to
be the case by the construction following Proposition 2.12.
7. a) Verify by ITIP that
@ º ´ A[Z\ ´ Z] ¿ $ @ º ^A[Z\ ¿_
Z\ Z] ´ ¿ under the constraint º ´
@ º A[Z` ¿
º Z\G ¿a_
Z]1 ¿ .
º
This constrained inequality was used in Problem 9 in Chapter 8 to obtain the capacity of two independent parallel channels.
b) Verify by ITIP that
@ º ´ A[Z ´ Z ¿ Á @ º A[Z ¿_ @ º A[Z ¿
@ ^AÈ ¿ H . This constrained inequality was
under the constraint º
used in Problem 4 in Chapter 9 to obtain the rate distortion function for
a product source.
8. Repeat Problem 9 in Chapter 6 with the help of ITIP.
cb
9. Prove the implications in Problem 13 in Chapter 6 by ITIP and show that
they cannot be deduced from the semi-graphoidal axioms. (Studen [189].)
D R A F T
September 13, 2001, 6:27pm
D R A F T
300
A FIRST COURSE IN INFORMATION THEORY
HISTORICAL NOTES
For almost half a century, all information inequalities known in the literature
are consequences of the basic inequalities due to Shannon [173]. Fujishige [74]
showed that the entropy function is a polymatroid (see Appendix 13.A). Yeung
[216] showed that verification of all such inequalities, referred to Shannon-type
inequalities, can be formulated as a linear programming problem if the number
of random variables involved is fixed. Non-Shannon-type inequalities, which
are discovered only recently, will be discussed in the next chapter.
The recent interest in the implication problem of conditional independence
has been fueled by Bayesian networks. For a number of years, researchers in
Bayesian networks generally believed that the semi-graphoidal axioms form a
complete set of axioms for conditional independence until it was refuted by
Studen [189].
cb
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 14
BEYOND SHANNON-TYPE INEQUALITIES
dOe
d e
d
d
f In Chapter 12, we introduced the regions
and
in the entropy space
for random variables. From , one in principle can determine whether any
information inequality always holds. The region , defined by the set of all
basic inequalities (equivalently all elemental inequalities) involving random
variables, is an outer bound on . From , one can determine whether any
information inequality is implied by the basic inequalities. If so, it is called
a Shannon-type inequality. Since the basic inequalities always hold, so do all
Shannon-type inequalities. In the last chapter, we have shown how machineproving of all Shannon-type inequalities can be made possible by taking advantage of the linear structure of .
d e
d e
d
d
d
If the two regions
and
are identical, then all information inequalities
which always hold are Shannon-type inequalities, and hence all information
inequalities can be completely characterized. However, if
is a proper subset
of , then there exist constraints on an entropy function which are not implied
by the basic inequalities. Such a constraint, if in the form of an inequality, is
referred to a non-Shannon-type inequality.
d e
d
d e N d There is a point here which needs further explanation. The fact that
does not necessarily imply the existence of a non-Shannon-type inequality. As
an example, suppose
contains all but an isolated point in . Then this
does not lead to the existence of a non-Shannon-type inequality for random
variables.
d
d
d e
d e
In this chapter, we present characterizations of
which are more refined
than . These characterizations lead to the existence of non-Shannon-type
inequalities for
.
D R A F T
ÁJg
September 13, 2001, 6:27pm
D R A F T
302
14.1
A FIRST COURSE IN INFORMATION THEORY
@
CHARACTERIZATIONS OF h e , h Se , AND h
k l
n
m
i
&l
i
e
Recall from the proof of Theorem 6.6 that the vector represents the values
of the -Measure
on the unions in . Moreover, is related to the values
on the atoms of , represented as , by
of
j/e
je
k &pq&
i
(14.1)
°=±
o
m
D
r
&
where is a unique
matrix with
(cf. (6.27)).
Let s be the -dimensional Euclidean space with the coordinates labeled
by the components of l . Note that each coordinate in s corresponds to the
value of j e on a nonempty atom of k . Recall from Lemma 12.1 the definition
t É "umv É of the region
e ²Gl s
l d e ¸þ´
(14.2)
which
is obtained from the region d e via the linear transformation induced by
m . Analogously,
we define the region
t É um É B
(14.3)
²Gl s
l d ¸
The region d e , as we will see, is extremely difficult to characterize for a
general
Dr . . Therefore, we start our discussion with the simplest case, namely
wox\yz|{y,}~|~ d e d .
Dr , the elemental inequalities are
Proof For
º ¿ j e º»¬ °° ¬ ¿ ÁH
(14.4)
¿

¿
(14.5)
@ º AÈ ¿ j e º ¬ ¯ ¬ ¿ ÁH B
(14.6)
º
j e º ¬ ¬ ÁH
Note that the quantities
on the left hand sides above are precisely the values of
j e on the atoms of k . Therefore,
t
É u
²Gl s lCÁIHq¸þ´
t t t (14.7)
t t orthant of s . Since d|t e Ë d t , e Ë . On
is the nonnegative
i.e.,
e by Lemma 12.1. Thus e
the other
hand,
, which implies
d e d . The proof isË accomplished.
D .
Next, we prove that Theorem 14.1 cannot even be generalized to
wox\yz|{y,}~|5 d Se N d S .
D , the elemental inequalities are
Proof For
,®[ ½ ´ ¿ j e º ,¬ ® ° ¬ ½ ° ¬ ¿ ÁH
º
(14.8)
° ¿
@ ,®9AÈ ½ ¿ ½
,

f
®
¯

º
j e º ¬ ¬ ¬ ÁHU´
(14.9)
D R A F T
September 13, 2001, 6:27pm
D R A F T
Beyond Shannon-Type Inequalities
0
X
X
303
2
a a
a
0
0
a
1
Figure 14.1. The set-theoretic structure of the point
X
3
á åòÔ·å:ÔåòÔ9Ô9Ô+þÔ9\ ã in .
and
@ º , ®9AÈ ½ ¿ j e º»,¬ ®f¯ ¬ ½ ¿
(14.10)
j e º ¬ ® ¯ ¬ ½ ¯ ¬ ¿_ j e º ¬ ® ¯ ¬ ½ ° ¬ ¿
(14.11)
(14.12)
±$
&$Á H
É
S
³C¶"
s , let
. For l
for
´) S ´) R ´)`
´)]
´)] ¿ ´
l
+
º

)
´

(14.13)
±$ $
®
³ correspond to the values
where ´
°
° Sò¿ ° ° S:¿ S ° ° ¿

j e º» ¬ ¯ ¬ ° ¬ S ¿ ´Oj e º» ¬ ¯ ¬ S ° ¬ ¿ ´j e º ¬ ¯ ¬ S ° ¬ ¿ ´
j e º ¬ ¯ ¬ ¯ ¬ Sò¿ ´Oj e º ¬ ¬ ¬ ´j e º ¬ ¬ ¬ ´ (14.14)
j e º»¬ ¬ ¬ ´
S
respectively. These are the values of j e on the nonempty atoms of k . Then
from (14.8), (14.9), and (14.12),
±%$ we$'see A that½ _
$ $'
t S É S u ®
²Gl s ÁIHU´ ³
]/° Á'HU´|g ¶ ¸ B (14.15)
¿ for any ÁKH is in t S .
It is easy to check that the point ºHU´[HU´[HU´ ´ ´ ´
This is illustrated in Figure 14.1, and it is readily seen that the relations
,® ½ ´ ¿ H
(14.16)
º
and
@ º ,®8AÈ ½ ¿ H
(14.17)
±$
&$
³C¶"
for
are satisfied, i.e., each random variable is a function of
± r are pairwise
the other two, and the three random
,® , ³ variables
É independent.
É ¡ ,
>
´
´
Y

Y
Let Ó be
the
support
of
.
For
any
and
are independent, we have
since
and
X º+Y ´)Y ¿ X º+Y ¿ X»º+Y ¿ M H B
(14.18)
D R A F T
September 13, 2001, 6:27pm
D R A F T
304
A FIRST COURSE IN INFORMATION THEORY
and , there is a unique Y S/É such that

S
¿
¿
¿
¿
B
(14.19)
X»º+Y ´)Y ´)Y
X º+Y ´)Y
X º+Y X º+Y M H
is a function of and S , and and S are independent,
Now since
we can write
X º+Y ´)Y ´)Y S:¿ X º+Y ´)Y S4¿ X º+Y ¿ X º+Y Sò¿CB
(14.20)
Since
S
is a function of
Equating (14.19) and (14.20), we have
X»º+Y ¿ X º+Y S:¿CB
(14.21)
É ¡ such that YÃ¹ N Y . Since and S are indeNow consider any Yf¹
pendent, we have
X º+Y ¹ ´)Y Sò¿ X º+Y ¹ ¿ X»º+Y Sò¿ M H B
(14.22)

S
É
> such that
Since
is a function of
and
, there is a unique Y ¹
(14.23)
X»º+Y ¹ ´)Y ¹ ´)Y Sò¿ X º+Y ¹ ´)Y Sò¿ X º+Y ¹ ¿ X º+Y Sò¿ M H B
is a function of and S , and and S are independent,
Now since
we can write
(14.24)
X º+Y ¹ ´)Y ¹ ´)Y S ¿ X º+Y ¹ ´)Y S ¿ X º+Y ¹ ¿ X º+Y S ¿CB

S
Similarly, since
is a function of
and
, and
and
are independent, we can write
X º+Y ¹ ´)Y ¹ )´ Y S:¿ X º+Y ¹ ´)Y ¹ ¿ X º+Y ¹ ¿ X º+Y ¹ ¿CB
(14.25)
Equating (14.24) and (14.25), we have
and from (14.21), we have

X»º+Y ¹ ¿ X º+Y :S ¿ ´
(14.26)
X»º+Y ¹ ¿ X º+Y ¿CB
(14.27)
S
¿ Sò¿_ @ ÂÈ? òS ¿_ @ º ÂÈ S ¿
º
_ @ º ÂÈ´ AÈ Sò¿ º
H _ º _ _ º° ¿
´ and similarly
¿ º Sò¿ B
º
Therefore
must have a uniform distribution on its support. The same can
and
. Now from Figure 14.1,
be proved for
D R A F T
September 13, 2001, 6:27pm
(14.28)
(14.29)
(14.30)
(14.31)
D R A F T
305
Beyond Shannon-Type Inequalities
a=0
log 2
Figure 14.2. The values of

log 3
log 4
(a,a,a,a,2a,2a,2a)
for which
á 9Ô9Ô9Ô£¢[9Ô£¢ 9Ô£¢[þÔ¢ ã is in ¤¥ .
¨¦ §©Qª S
t S t S ª
e N
ª
Then the only values that can take are
, where
(a positive integer)
, and . In other words, if is not
is the cardinality of the supports of ,
for some positive integer , then the point
equal to
is not in . This proves that
, which implies
. The theorem
is proved.
¦¨t §©QS ª
e
°
ºS
HU´[HUS ´[HU´ ´ ´ ´ ¿
|d e N d
i É f S , let
i º« ´ « ´ « S ´ « £ ´ « S ´ « S ´ « £ S° ¿CB
(14.32)
t
¿ in S corresponds
From Figure 14.1, wer see
that
the
point ºHU´[HU´[HU´ ´ ´ ´
r
r
r
r r
¿
S
to the point º ´ ´ ´ ´ ´ ´
. Evidently, the point º ´ ´ ´ ´ ´
r ´ r±o$ ¿ in d S satisfies
&W $ the 6 elemental
in d inequalities
given in (14.8) and (14.12)

S
³¬ ¶
for
with equality. Since d is defined by all the elemental
inequalities, the set
²oº ´ ´ ´ r ´ r ´ r ´ r ¿/É d S u ÁIHq¸
(14.33)
S

is in the intersection of 6 hyperplanes in f (i.e., ) defining the boundary of
Sd , and hence it defines an extreme direction of d S . Then the proof says that
S
along this extreme direction of d , only certain discrete points, namely those
points with equals ¦5§©®ª for some positive integer ª , are entropic. This is
S
illustrated in Figure 14.2. As a consequence, the region d e is not convex.
S
S
Having proved that d e N d , it is natural to conjecture that the gap between
d|Se and d S hasS Lebesgue measure 0. In other words, d Se d S , where d Se is the
closure of d . This conjecture is indeed true and will be proved at the end of
the section.
More generally, we are interested in characterizing d e , the closure of d e .
Although the region d e is not sufficient for characterizing all information inThe proof above has the following interpretation. For
equalities, it is actually sufficient for characterizing all unconstrained information inequalities. This can be seen as follows. Following the discussion
in Section 12.3.1, an unconstrained information inequality
involving
random variables always hold if and only if
¯ ÁnH
Since
²Gi u ¯ ºi ¿ I
Á Hq¸
D R A F T
d e Ë ²Gi u ¯ ºi ¿ I
Á Hq¸ B
(14.34)
is closed, upon taking closure on both sides, we have
d e Ë ²Gi u ¯ ºi ¿ I
Á Hq¸ B
September 13, 2001, 6:27pm
(14.35)
D R A F T
306
A FIRST COURSE IN INFORMATION THEORY
On the other hand, if
¯ÁIH satisfies (14.35), then
d e Ë d e Ë ²Gi u ¯ ºi ¿ ÁIHq¸ B
(14.36)
d e
Therefore, (14.34) and (14.35) are equivalent, and hence
is sufficient for
characterizing all unconstrained information inequalities.
We will prove in the next theorem an important property of the region
for
. This result will be used in the proof for
. Further, this result
all
will be used in Chapter 15 when we use
to characterize the achievable information rate region for multi-source networking coding. It will also be used
in Chapter 16 when we establish a fundamental relation between information
theory and group theory.
to denote the
We first prove a simple lemma. In the following, we use
set
.
Á r
d e
± r
² ´ ´ ´)»¸
±®y,}}a²³~|5´ If i
d e
d Se d S
° Ri ¹ are in d e , then i _ iú¹ is in d e .
iR¹ in d e . Let i represents the entropy function for ranProof Consider i and
dom variables µ ´)µ ´ò´)µ , and let iú¹ represents
the entropy
¿ and º+µ function
¹ ´)µ ¹ ´for
ò´
random variables µ ¹ ´)µ ¹ ´:´)µ ¹ . Let º+µ ´)µ ´´)µ
µÑ ¹ ¿ be independent, and define random variables Z/ ´ Z] ´ò´ Z` by
Z`¶/
º+µ ¶ ´)µ ¶ ¹ ¿
(14.37)
É ° . Then for any subset · of ° ,
for all ³
Z]¸ ¿ º+µ ¸ ¿_ º+µ ¸ ¹ ¿ « ¸ _ « ¹¸ B
(14.38)
º
_ i ¹ , which represents the entropy function for Z/ ´ Z` ´ò´ Z` , is
Therefore, i
in d e . The lemma is proved.
¹ z|{]z|ºº1²|{»n~|¼ If i É d e , then & i É d e for any positive integer & .
and
Proof It suffices to write
& _ _ _
i ½ i i [¾ ¿ i À
(14.39)
and apply Lemma 14.3.
wox\yz|{y,}~|5Á d e
is a convex cone.
µ )´ µ ´´)µ · of ° ,
Proof Consider the entropy function for random variables
taking constant values with probability 1. Then for all subset
¸¿
HB
º+µ
D R A F T
September 13, 2001, 6:27pm
all
(14.40)
D R A F T
Beyond Shannon-Type Inequalities
307
Od e
f i Z\ Z]iú ¹ d Zè ´ ´:´
Â Ć Â ´òĆÂ ±
de
ÅÃ ±Ç Ã
i iú ¹ d e ¿ ÄÃ i & _WÅÃiR¹ d e
HÆ'Z\Ã Z]
º ´ ¿ ´:´ Z` ¿ ºÊÉ ´
É ´º+´ÄÈ É ´)¿ È ´& ò´)È
ºÂ ĆÂ ´òĆÂ
Ë
ÌÎÍ ²Ë Hq¸ ±¬ÇÐÏÑÇ j ´ ÌÍ ²Ë ± ¸ Ï ´ ÌÍ ²Ë Òr ¸ j B
by letting
Now construct random variables µ ´)µ ´ò´)µ
¶µ ÔÕ ÖÓ ÈH ¶ ifif ËË H ±
¶ if Ë ×r .
É
¿ÎØ H as Ï ´)j Ø H . Then for any nonempty subset · of ° ,
Note that ºÊË
$
¸
¸ ´ÄË ¿
¿
+
º
µ
(14.41)
º+µ
/
¸
\
¿
_
¿
(14.42)
ºÊºÊËË ¿\_ Ï& º+µ º Z]¸ Ë ¿_ j & ºÂ ¸ ¿CB
(14.43)
On the other hand,
¸ ¿ Á º+µ ¸/ Ë ¿ Ï& º Z`¸ ¿/_ j & ºÂ ¸ ¿CB
+
º
µ
(14.44)
Combining the above, we have
$
Ç Ï& Z ¸ ¿_ & ¸ ¿È¿ $
¸
¿
¿CB
(14.45)
H º+µ
º º
j ºÂ
ºÊË
Now take
Ï &Ã
(14.46)
Therefore,
contains the origin in
.
Let and
in
be the entropy functions for any two sets of random
variables
and
, respectively. In view of Corolis a convex cone, we only need to show that
lary 14.4, in order to prove that
if and are in , then
is in
for all
, where
.
be independent copies of
and
Let
be independent copies of
. Let be a ternary
random variable independent of all other random variables such that
Å
j &Ã
(14.47)
$
Ç
$
to obtain
¸
]
Z
¸
¸
¿
\
¿
L
_
Å
È
¿
¿
¿CB
º Ã º
Ã º Â
(14.48)
ºÊË
& H º+µ
By letting be sufficiently large, the upper bound can be made arbitrarily
_ÙÅÃÄi ¹ É d e . The theorem is proved.
small. This shows that ÃÄi
S
S
t
In the next theorem, we
prove that d e and d aret almost identical. Analo
gous to d e , we will use e to denote the closure of e .
and
D R A F T
September 13, 2001, 6:27pm
D R A F T
308
wox\yz|{y,}~|5Ú d Se d S .
S dS
Proof We first note that d e
Since
A FIRST COURSE IN INFORMATION THEORY
t S t SB
e
(14.49)
d Se Ë d S
(14.50)
if and only if
S
t SB on both sides in the above, we obtain d Se Ë
and d is closed, by taking
Sd B This implies thatt t Se Ë closure
to prove the theorem, it
S Ë t Se B Therefore, in order
Ç ¿
t S
suffices to show that
e for all M H .
We first show that the
point
is
in

º
U
H
[
´
U
H
[
´
U
H
´
´
´
´
S
be defined as in Example 6.10, i.e., µ ±
Let random variables µ ´)µ , and µ
and µ
are two independent binary random variables taking values in ²HU´ ¸
according to the uniform distribution, and
µ S µ _ µ rB
(14.51)
É d Se represents the entropy function for µ ´)µ , and µ S , and let
Let i
l nm S i ±%B $ $
(14.52)
¶
As in the proof of Theorem 14.2, we let ,
³ be the coordinates of s S
mod
which correspond to the values of the quantities in (14.14), respectively. From
Example 6.10, we have
± r
±
for
H
³
¶ Õ ÖÓ Ç± for ³ g ´´ Ûq´´
(14.53)
.
for ³
± ± ± Ç± ¿
t S
@
µ & ,
Thus the point ºHU´[HU´[HU´ ´ ´ ´
is in e , and the -Measure j e & for& µ & ´)ÇÑ
S
¿
&
t
and µ t is shown in Figure
14.3. Then by Corollary
14.4, ºHU´[HU´[HU´ ´ ´ ´
is
t S in Se and hence in Se for all positive integer . S Since d Se contains the origin,
t S
e also contains the origin. By Theorem Ç14.5,¿ d e ist convex.
This implies e
t S ºHU´[HU´[HU´ ´ ´ ´ is in Se for all M H .
is also convex. Therefore,
É
Consider any l
. Referring to (14.15), we have
¶ ÁIH

(14.54)
±Ü$ $Ý
for
³ . Thus is the only component of l which can possibly be
negative. We first consider the case when t ÁIH . Then l is in the nonnegative
S
S
orthant of s , and by Lemma 12.1, l is in e . Next, consider the case when
Ç Ç Ç
]IH . Let
Þ ºHU´[HU´[HU´ ]9´ ] ´ ò´) ¿CB
(14.55)
D R A F T
September 13, 2001, 6:27pm
D R A F T
Beyond Shannon-Type Inequalities
ß0
X1
û
ß0
Figure 14.3. The -Measure
âã
1
1
1
for
àX
1
309
2
ß0
Ò > ÔÒ ¡ , and Ò
á
X3
in the proof of Theorem 14.6.
ä _ Þ´
l
(14.56)
where
äF
º+ )´ ´) S ´) R _ ´) _ ´) _ ´[H ¿CB
(14.57)
Ç
Þ É t S
e . From (14.15), we
Since M H , we see from the foregoing that
have
(14.58)
¶ _ ]/ÁIH

t S
ä
S
Þ É t S orthant in s and hence in e by
for ³
g´ Ûq´ . Thus is in the nonnegative
e such that
Lemma 12.1. Now for any å M H , let ¹
æ Þ Ç Þ ¹ æ 'å4´
(14.59)
Þæ Ç Þ æ
Þ Þ
where
¹ denotes the Euclidean distance
Þ B between and ¹ , and let

ä
_
¹
(14.60)
Þ
t S l¹
t S
ä
Since both and ¹ are in e , by Lemma 14.3, lú¹ is also in e , and
æ l Ç l ¹ æ æ Þ Ç Þ ¹ æ 'å B
(14.61)
t S
t S t S
É
e . Hence, Ë e , and the theorem is proved.
Therefore, l
S
S
Remark 1 Han [88] has found that d is the smallest cone that contains d e .
This result together with Theorem 14.5 implies Theorem 14.6. Theorem 14.6
is also a consequence of the theorem in Matç`b éè [135].
Remark 2 We have shown that the region d e completely characterizes all
unconstrained
information inequalities involving random variables. Since
d Se d S , it follows that there exists no unconstrained information inequaliThen
ties involving three random variables other than the Shannon-type inequalities.
However, whether there exist constrained non-Shannon-type inequalities involving three random variables is still unknown.
D R A F T
September 13, 2001, 6:27pm
D R A F T
310
14.2
A FIRST COURSE IN INFORMATION THEORY
A NON-SHANNON-TYPE UNCONSTRAINED
INEQUALITY
d Se d S
ÁJg
.
We have proved in Theorem 14.6 at the end of the last section that
It is natural to conjecture that this theorem can be generalized to
. If this
conjecture is true, then it follows that all unconstrained information inequalities involving a finite number of random variables are Shannon-type inequalities, and they can all be proved by ITIP running on a sufficiently powerful
computer. However, it turns out that this is not the case even for
.
We will prove in the next theorem an unconstrained information inequality
involving four random variables. Then we will show that this inequality is a
non-Shannon-type inequality, and that
.
g
wox\yz|{y,}~|ëê
d Re N d R
r@ º+µ SA µ R ¿ $ @ +º µ A µ ¿\_
_ @ º+µ S A µ R µ ¿/_ @ +º µ
For any four random variables
@ +º µ
SAµ
µ )´ µ ´)µ S , and µ R ,
A µ S ´)µ R ¿
R µ ¿CB
(14.62)
auxiliary random
ì Towardì proving this theorem, we introduce
´)µ ´)µ S ,two
ía
R such that íaì ®
Òvariables
µíQì î and
µ
jointly
distributed
with
µ
and
µ
FíQ . To simplify notation, we will use X £ S8RÄï ï º+Y ´)Y ´)Y S ´)Y R ´ Y ì ´ Y ì and
¿ to
S
R
¿
ì
ì
ï
ï
denote X > ð¡9 ñ > ð¡ º+Y ´)Y ´)Y ´)Y ´ Y ´ Y , etc. The joint distribution for
ì
S R ì
the six random variables µ ´)µ ´)µ ´)µ ´ µ , and µ is defined by
ï ï º+Y ´)Y ´)Y S ´)Y R ´ Y ì ´ Y ì ¿ X £ S8RÄòF
ó > ¡ ñ ¼ô >9õ ô ¡ õ ô ó õ ô ñ ó > ¡ ñ ô ï >)õ ô ï ¡ õ ô õ ô ñ S8R S R¿ M H
if X º+Y ´)Y
ñ öô õ ô ñ
S8R S R¿ H . (14.63)
H
if X º+Y ´)Y
±®y,}}a²³~|5÷
ì ì
º+µ ´)µ ¿ðØ º+µ S ´)µ R
¿ðØ º µ ´ µ ¿ ì ì (14.64)
S R ¿ and º µ ´ µ ´)µ S ´)µ R ¿
forms a Markov chain. Moreover, º+µ ´)µ ´µ ´)µ
have the same marginal distribution.
Proof The Markov chain in (14.64) is readily seen by invoking Proposition 2.5.
The second part of the lemma is readily seen to be true by noting in (14.63)
is symmetrical in
and
and in
and
.
that
ì
ì
X £ S8RÄï ï
µ µ µ µ variables
ì Fromì the above lemma, we see that the pair of auxiliary random
¿ in the
º µ ´ µ ì ¿ ì corresponds
+
º
µ
)
´
µ
to
the
pair
of
random
variables
S R¿ have the same marginal distribution as º+µ ´)µ ´)µ S sense
that º µ ´ µ Ćµ ´)µ
´)µ R¿ .
We need to prove two inequalities regarding these six random variables before
we prove Theorem 14.7.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Beyond Shannon-Type Inequalities
±®y,}}a²³~|5ø
ì
µ @ º+µ SA µ R ¿ Ç @ º+µ SA µ ?R µ
µ ´)µ
ì
µ
¿ Ç @ º+µ S A µ R? µ ¿ $
´)µ S
For any four random variables
, and
and
as defined in (14.63),
iary random variables
µ R
311
and auxil-
@ º+µ A µ ì ¿CB
@ º+µ S A µ R¿ Ç @ º+µ S A µ R µ ¿ Ç @ º+µ S A µ R µ ¿
ù ú @ º+µ S A µ R¿ Ç @ Ç º+µ S A µ R µ ¿£û Ç @ º+µ S A µ R µ ì ¿
@ º+µ A µ S1A µ R ¿ @ º+µ SA µ R? µ ì ¿
ú @ º+µ GA µ S A µ R A µ ì ¿\Ç _ @ º+µ Â µ S A µ R µ Ç ì ¿£û Ç @ º+µ
@ º+µ A µ S1A µ RA µ ì ¿ Ç ú @ º+µ SA µ R? µ ì ¿ @ º+µ A µ
@ º+µ Â µ S A µ R A µ ì Ç ¿ @ º+µ S A µ R µ ´ µ ì Ç ¿
ú @ º+µ GA µ R A µ Ç ì ¿ @ º+µ A µ R A µ Çì µ Sò¿£û @ º+µ S A µ
Çú @ º+µ GA µ ì ¿ @ º+µ Â µ ì Ç ? µ R¿£û ú @ º+µ A µ ì µ Sò¿
@ º+µ Â µ ì µ S ´)µ R¿£û @ º+µ S A µ R µ ´ µ ì ¿
ü @Ç º+µ A µ ì ¿ Ç @ º+µ A µ ì µ R ¿ Ç @ º+µ A µ ì µ S ¿
ì
$ @ @ º+µ Â S ìA µ ¿ R µ ´ µ ¿
º+µ µ ´
(14.65)
Proof Consider
ì
º µ ´)µ S )´ µ R¿
SAµ
SA µ
(14.66)
(14.67)
(14.68)
(14.69)
(14.70)
(14.71)
R µ ì ¿
R? µ ì ¿£û
R µ ´ µ ì ¿
(14.72)
(14.73)
(14.74)
º+µ ´)µ S ´)µ R¿
where a) follows because we see from Lemma 14.8 that
and
have the same marginal distribution, and b) follows because
@ º+µ ^ A µ ì µ S ´)µ R¿ H
(14.75)
from the Markov chain in (14.64). The lemma is proved.
±®y,}}a²³~|~1ý
µ ´)µ
ì
ì
µ
µ
Ç
@ º+µ S A µ R¿ r @ +º µ S A µ R µ ¿ $ @ +º µ
´)µ S , and µ R
For any four random variables
iliary random variables
and
as defined in (14.63),
ì
ì
µ and aux-
Â µ ì ¿CB
(14.76)
µ µ Proof Notice that (14.76) can be obtained from (14.65) by replacing
by
and
by
in (14.65). The inequality (14.76) can be proved by replacing
by
and
by
in (14.66) through (14.74) in the proof of the last
lemma. The details are omitted.
µ
µ µ D R A F T
ì
µ ì
µ September 13, 2001, 6:27pm
D R A F T
312
A FIRST COURSE IN INFORMATION THEORY
r@ $ º+µ S A µ R ¿ Ç @ º+µ S A µ
@ º+µ GA µ ì ¿_ @ º+µ
@ º+µ GA µ ì ¿_ ú @ º+µ
ú @ º+µ A µ ì ¿\_ @ º+µ
@ º+µ GA µ ì ´ µ ì ¿\_
$ @ º+µ GA µ ì ´ µ ì ¿\_
@ GA ì ì ¿\_
ù$ º+µ µ ´ µ
@ GA S R¿\_
ü @ º+µ GA µ S ´)µ R¿\_
º+µ µ ´)µ
R µ ¿ Ç @ º+µ S A µ R µ ¿
A µ ì ¿
(14.77)
A µ ì µ ì ¿_ @ º+µ A µ ì Â µ ì ¿£û
(14.78)
A µ ì µ ì ¿£ûþ_ @ º+µ A µ ì A µ ì ¿
(14.79)
@ º+µ A µ ì Â µ ì Ç ¿
(14.80)
ú @ º µ ì A µ ì ¿ @ º µ ì Â µ ì µ ¿£û
(14.81)
@ º µ ì A µ ì ¿
(14.82)
@ º µ ì A µ ì ¿
(14.83)
@ º+µ A µ ¿ ´
(14.84)
where a) follows from the Markov
and b) follows because
ì chain
ì ¿ in (14.64),
¿
and º+µ ´)µ
have
we see from Lemma 14.8 that º µ ´ µ
ì theì same
marginal
distribution. Note that the auxiliary random variables µ and µ disappear in
Proof of Theorem 14.7 By adding (14.65) and (14.76), we have
(14.84) after the sequence of manipulations. The theorem is proved.
wox\yz|{y,}~|~,~
R R
and d e N d .
The inequality (14.62) is a non-Shannon-type inequality,
M H the point i«ì º ¿«É f R , where
ì
ì
ì
ì ¿ Dr
«ì £ º ¿ ¿ « º ¿ ì S « S ¿ º ¿ ì R « R º ¿ D
´

«ì S º ¿ g ì R ´ « ¿ º ì S8R « ¿ Dº ´
(14.85)
«ì £ S º ¿ « ì £ º R ¿ « ì º S8 R ¿ ´ ì S8R ¿ ì £ S8R ¿ B
« º « º « º « º « º g ì ¿
The set-theoretic structure of i«º is illustrated by the information diagram in
Figure 14.4.ì The reader should check that this information diagram correctly
ì ¿
¿
represents iðº as defined. It is also easy to check from this diagram that i£º
satisfies
ì ¿É allR the elemental inequalities for four random variables, and therefore
iðº ì ¿ d . However, upon substituting the corresponding values in (14.62)
for i«º with the help of Figure 14.4, we have
r $ H _ _ H _ H ´
(14.86)
ì ¿
which is a contradiction because M H . In other words, i«º does not satisfy
(14.62). Equivalently,
ì
i«º ¿ É N ²Gi É f R u i satisfies (14.62) ¸ B
(14.87)
Proof Consider for any
D R A F T
September 13, 2001, 6:27pm
D R A F T
313
Beyond Shannon-Type Inequalities
X2
0
a
a
0
a
a
a
0
X1
a
0
a
a
0
X3
0
ÿ
0
X4
Figure 14.4. The set-theoretic structure of
Since
ì
i£º «¿ É d R , we conclude that
d R Ë N ²Gi É f R u i
á ã.
satisfies (14.62)
¸þ´
(14.88)
i.e., (14.62) is not implied by the basic inequalities for four random variables.
Hence, (14.62) is a non-Shannon-type inequality.
Since (14.62) is satisfied by all entropy functions for four random variables,
we have
satisfies (14.62)
(14.89)
d Re Ë ²Gi É f %R u i
¸þ´
and upon taking closure on both sides, we have
(14.90)
d Re Ë ²Gi É f R u i satisfies (14.62)¸ B
ì ¿ É Re
ì ¿ÞÉ R
ì ¿ É Re
Then (14.87) implies i£º
R R
N d . Since i«º d and iðº N d , we conclude that d e N d . The theorem is proved.
Remark We have shown in the proof of Theorem 14.11 that the inequality
(14.62) cannot be proved by invoking the basic inequalities for four random
variables. However, (14.62) can be proved by invoking the basic inequalities
for the six random variables
, and
with the joint probability distribution
as constructed in (14.63).
ì
)´ µ ´)µ S ´)µ R ´ µ ì µ
µ
X £ S8R ï ï
The inequality (14.62) remains valid when the indices 1, 2, 3, ± and 4 are
S and µ R , g r r distinct
permuted. Since (14.62) is symmetrical in µ
versions of (14.62) can be obtained by permuting the indices, and all these
twelve inequalities are simultaneously satisfied by the entropy function of any
set of random variables
, and
. We will denote these twelve
inequalities collectively by <14.62>. Now define the region
µ ´)µ )´ µ S
ì
d R ²Gi É d R u i
D R A F T
µ R
satisfies <14.62>
¸B
September 13, 2001, 6:27pm
(14.91)
D R A F T
314
A FIRST COURSE IN INFORMATION THEORY
Evidently,
ìR
R
Since both d and d
ì
dR R
ì
d Re Ë d R Ë d RB
(14.92)
are closed, upon taking closure, we also have
ì
d Re Ë d R Ë d RB
(14.93)
dR
d Re
Since <14.62> are non-Shannon-type inequalities as we have proved in the last
is a proper subset of
and hence a tighter outer bound on
and
theorem,
than .
In the course of proving that (14.62) is of non-Shannon-type, it was shown
in the proof of Theorem 14.11 that there exists
as defined in (14.85)
which does not satisfy (14.62). By investigating the geometrical relation between
and , we prove in the next theorem that (14.62) in fact induces
non-Shannon-type constrained inequalities. Applications of
a class of
some of these inequalities will be discussed in Section 14.4.
d Re
d
ì
Ç ±d R
i«º r ¿ R F
ì
i«º «¿ É d R
wox\yz|{y,}~|~
The inequality (14.62) is a non-Shannon-type inequality
conditioning on setting any nonempty subset of the following 14 Shannon’s
information measures to zero:
@ +º µ A µ ¿ ´ @ º+µ A µ µ S ¿ ´ @ º+µ A µ µ R ¿ ´ @ º+µ A µ S µ R ¿ ´
@ º+µ ^ A µ R µ Sò¿ ´ @ º+µ A µ S µ R¿ ´ @ º+µ A µ R µ Sò¿ ´ @ º+µ S A µ R µ ¿ ´
(14.94)
@ º+µ S A µ R µ ¿ ´ @ º+µ S A µ R µ ´)µ ¿ ´ º+µ µ ´)µ S ´)µ R¿ ´
S R¿
S R¿
R S¿CB
º+µ µ ´)µ )´ µ ´ º+µ µ ´)µ ´)µ ´ º+µ µ ´)µ ´)µ
ì ¿
Proof It is easy
to
verify
from
Figure
14.4
that
in exactly 14 hyð
i
R (i.e., ) defining the boundary ofº d R lies
perplanes in f
which correspond
ì ¿ to
setting the 14 Shannon’s measures in (14.94) to zero. Therefore, iðº
for
Á IH define an extreme direction of d R . R
ì ¿
Now for any linear subspace of f
containing iðº , where M H , we
have
ì ¿«É R (14.95)
i«º d ì ¿
and i«º does not satisfy (14.62). Therefore,
ºd R ¿ Ë N ²Gi É f R u i satisfies (14.62)¸ B
(14.96)
This means that (14.62) is a non-Shannon-type inequality under the constraint
. From the above, we see that can be taken to be the intersection of any
nonempty subset of the 14 hyperplanes containing
. Thus (14.62) is a
non-Shannon-type inequality conditioning on any nonempty subset of the 14
Shannon’s measures in (14.94) being equal to zero. Hence, (14.62) induces a
ì
i£º ¿
D R A F T
September 13, 2001, 6:27pm
D R A F T
R Ç ±
r
class of
315
Beyond Shannon-Type Inequalities
non-Shannon-type constrained inequalities. The theorem is
proved.
Remark It is not true that the inequality (14.62) is of non-Shannon-type under
any constraint. Suppose we impose the constraint
@ º+µ S A µ R¿ H B
(14.97)
Then the left hand side of (14.62) becomes zero, and the inequality is trivially implied by the basic inequalities because only mutual informations with
positive coefficients appear on the right hand side. Then (14.62) becomes a
Shannon-type inequality under the constraint in (14.97).
14.3
A NON-SHANNON-TYPE CONSTRAINED
INEQUALITY
d Re N d R R
d e r R Ç[d ± Re
In the last section, we proved a non-Shannon-type unconstrained inequality
. This inequality induces
for four random variables which implies
which is a tighter outer bound on
and
then . We fura region
ther showed that this inequality induces a class of
non-Shannon-type
constrained inequalities for four random variables.
In this section, we prove a non-Shannon-type constrained inequality for four
random variables. Unlike the non-Shannon-type unconstrained inequality we
proved in the last section, this constrained inequality is not strong enough to
. However, the latter is not implied by the former.
imply that
d
ìR
dR
d Re N d R
±®y,}}a²³~|~´ Let X º+Y ´)Y ´)Y S ´)Y R¿ be any probability distribution. Then
òFó öô >8õ ô ó õ ô ñ ó öô ¡ õ ô õ ô ñ S R:¿ M H
ìX º+Y ´)Y ´)Y S ´)Y R¿ if X º+Y ´)Y
¼ô õ ô ñ
(14.98)
S R:¿ H
if X º+Y ´)Y
H
is also a probability distribution. Moreover,
and
for all
X ì º+Y ´)Y S ´)Y Rò¿ X º+Y )´ Y S ´)Y R¿
(14.99)
X ì º+Y ´)Y S ´)Y Rò¿ X º+Y )´ Y S ´)Y R¿
(14.100)
Y ´)Y ´)Y S , and Y R .
Proof The proof for the first part of the lemma is straightforward (see Problem 4 in Chapter 2). The details are omitted here.
To prove the second part of the lemma, it suffices to prove (14.99) for all
, and
because
is symmetrical in
and . We first
Y ´)Y S
YR
D R A F T
»X ì º+Y ´)Y ´)Y S ´)Y R¿
Y
September 13, 2001, 6:27pm
Y
D R A F T
316
A FIRST COURSE IN INFORMATION THEORY
Y , Y S , and Y R such that X º+Y S ´)Y R¿ M H . From (14.98), we have
ì S R¿
X»ì º+Y )´ Y S ´)Y R¿ X º+Y ´)Y ´)Y ´)Y
(14.101)
ô ¡ S Rò¿ S R¿
X º+Y ´)Y ´)Y S X º+Y R¿ ´)Y ´)Y
(14.102)
ô ¡ S RX»¿ º+Y ´)Y
X º+Y ´)S Y ´)RòY ¿ X»º+Y ´)Y S ´)Y R ¿
(14.103)
X º+Y ´)Y
¡
ô
R¿ S R¿
X º+Y ´)S Y S ´)RY (14.104)
¿
X º+Y ´)Y
X
+
º
Y
)
´
Y
X º+Y ´)Y S ´)Y R ¿CB
(14.105)
S
R
S

R
¿
For Y ´)Y , and Y such$ that X º+Y ´)Y
$ H , weS have
S
R
¿
(14.106)
H X º+Y ´)Y ´)Y
X º+Y ´)Y R ¿ HU´
which implies
(14.107)
X º+Y ´)Y S ´)Y R ¿ H B
consider
ì S R¿
X ì º+Y ´)Y S ´)Y Rò¿ ~X º+Y ´)Y ´)Y ´)Y
(14.108)
¡
ô
H
(14.109)
¡
ô
H
(14.110)
X º+Y ´)Y S ´)Y R¿CB
(14.111)
S
R
Thus we have proved (14.99) for all Y ´)Y , and Y , and the lemma is proved.
wox\yz|{y,}~|~ For any four random variables µ ´)µ ´)µ S , and µ R , if
@ º+µ A µ ¿ ×@ º+µ ^A µ ? µ Sò¿ HU´
(14.112)
then
@ º+µ S A µ R¿ $ @ º+µ S A µ R µ ¿\_ @ º+µ S A µ R µ ¿CB
(14.113)
Therefore, from (14.98), we have
@ º+µ 1S A µ R ¿ Ç @ º+µ SA µ R¥ µ ¿ Ç @ º+µ SA µ R? µ ¿
4 ñ > 4 > 4 ñ 4 4 ¡ 4 4 ¡ 4 4 ñ
ó
>
õ
õ
õ
¡
ñ

ñ > ¡ > ñ ¡ ñ
> 4 4 ¡ 4 4 4 4 ñ ¼ô ô ô ô
> ¡ ñ ó ¦¨§© X º+µ S S¿ ´)µ R R ¿ X ¿ º+µ ´)µ¿ Sò¿ X» º+µ¿ ´)µ R¿ X S º+µ R ´)µ¿ Sò¿ X º+µ S ´)µ RR¿ ¿ ´
X»º+µ X»º+µ X º+µ X º+µ X º+µ ´)µ ´)µ X º+µ ´)µ ´)µ
Proof Consider
(14.114)
D R A F T
September 13, 2001, 6:27pm
D R A F T
317
Beyond Shannon-Type Inequalities
ó
X º+Y ´)Y ´)Y S ´)Y R¿ .
ó ï ¨¦ §© X»º+µ SòS¿ ´)µ RR¿ X ¿ º+µ ´)µ¿ Sò¿ X º+µ¿ )´ µ R ¿ X S º+µ R´)µ¿ Sò¿ X» º+µ S ´)µ RR¿ ¿ ´
X º+µ X º+µ X º+µ X»º+µ X º+µ )´ µ ´)µ X º+µ ´)µ ´)µ
(14.115)
S
ò
R
¿
ì
is defined in (14.98).
where X~º+Y ´)Y ´)Y ´)Y
Toward proving that the claim is correct, we note that (14.115) is the sum
ì
of a number of expectations with respect to X . Let us consider one of these
to denote expectation with respect to
where we have used
We claim that the above expectation is equal to
expectations, say
4 > 4 ¡ 4 4 4 4 ñ X ì +º Y ´)Y ´)Y S ´)Y
> ¡ ñ
ì S
Note that in the above summation, if X»º+Y ´)Y ´)Y ´)Y
we see that
X º+Y ´)Y S ´)Y R¿ M HU´
and hence
X»º+Y ´)Y Sò¿ M H B
ó ï ¨¦ §©/X º+µ ´)µ Sò¿ R¿ ¦5§©\X º+Y )´ Y S:¿CB
(14.116)
R¿ M H , then from (14.98),
(14.117)
(14.118)
Therefore, the summation in (14.116) is always well-defined. Further, it can
be written as
´)Y Sò¿
ì º+Y ´)Y ´)Y S ´)Y R¿
¦
¨
§
\
©
X
+
º
Y
X
"
!
ï
ó
¡ > õ ¡ õ õ ñ ô )> õ ô õ ô ñ ìX»º+Y ´)Y ô S ´)Y ¼Rô ¿ ¦¨ô §©\ô X»º+ô Y ´)Y Sò¿CB
(14.119)
>ô 8õ ô õ ô ñ
S ¿ depends on X ì º+Y ´)Y ´)Y S ´)Y R ¿ only through X ì º+Y ´)Y S ´)Y R ¿ ,
ï
Thus ó ¦¨§©X º+µ ´)µ
S R¿
which by Lemma 14.13 is equal to X º+Y ´)Y ´)Y . It then follows that
ó ï ¦¨§©\X º+µ ´)µ S¿
X ì º+Y ´)Y S ´)Y R¿ ¦¨§©/X º+Y ´)Y Sò¿
(14.120)
>9õ ô õ ô ñ
ô
X º+Y ´)Y S ´)Y R¿ ¦¨§©/X º+Y ´)Y Sò¿
(14.121)
>9õ ô õ ô ñ
ô
ó ¦¨§©X»º+µ ´)µ Sò¿CB
(14.122)
ò
S
¿
In other words, the expectation on ¦¨§©/X º+µ ´)µ
can be taken with respect
S

R
¿
S

R
¿
ì
or X º+Y ´)Y ´)Y ´)Y
without affecting its value. By
to either X»º+Y ´)Y ´)Y ´)Y
observing that all the marginals of X in the logarithm in (14.115) involve only
S R
S R
subsets of either ²^µ ´)µ ´)µ ¸ or ²^µ ´)µ ´)µ ¸ , we see that similar conclusions can be drawn for all the other expectations in (14.115), and hence the
claim is proved.
D R A F T
September 13, 2001, 6:27pm
D R A F T
318
A FIRST COURSE IN INFORMATION THEORY
@ º+µ S A µ R ¿ Ç @ º+µ S A µ R µ ¿ Ç @ º+µ S A µ R µ ¿
ó ï ¨¦ §© X º+µ S4S¿ ´)µ R R ¿ X ¿ º+µ ´)µ¿ Sò¿ X» º+µ¿ ´)µ R¿ X S º+µ R
´)µ¿ Sò¿ X º+µ S ´)µ RR¿ ¿
»X º+µ X»ó ï º+µ >)õ X ¡ õ º+µ õ ñ X º+µ X º+4 µ ñ ´)µ > 4 ´) µ > X 4 º+ñµ 4 4 ´)¡ µ 4 ´)µ 4 ¡ 4 4 ñ
4 4 4 öô ô ô ô ñ > ¡ > ñ ¡ ñ
> > 4 ¡ ¡ 4 4 ññ Ç 4 4 4 X~ì º+Y ´)Y ´)Y S ´)Y R ¿ ¦¨§© X~ì$ º+Y ´)Y ´)Y SS ´)Y RòRò¿¿q´ (14.123)
> X~º+Y ´)Y ´)Y ´)Y
> 4 ¡ ¡ 4 4 ñ ñ# Thus the claim implies that
where
~X $ º+Y ò)´ Y ó ´)Y S ´)Y R:ó ¿ ó
ö ô >8õ ô ó öô >öô ó >)¼õ ôô ñ¡ ó ¼¼ôô ¡ õ ôó ¼ô ó ñ ö ô ¡ õ ô ñ
H
X º+Y ¿ ´ X»º+Y ¿ ´ X º+Y òS ¿ ´ X º+Y R ¿ M H
if
otherwise.
(14.124)
X ì º+Y ´)Y
X º+Y ´)Y Sò¿ ´X
Y ´)Y ´)Y S , and Y R
are
´)Y S )´ Y R ¿ M H
º+Y ´)Y R¿ ´X º+Y ´)Y òS ¿ ´ X º+Y )´ Y R ¿ ´ X º+Y ¿ ´ X º+Y ¿ ´ X»º+Y Sò¿ ´X º+Y R ¿
$ S R(14.125)
¿ M H.
are all strictly positive, and we see from (14.124) that$ X~º+ Y ´) Y ´)Y ´)Y
S

R
¿
To complete the proof, we only need to show that X º+Y ´)Y ´)Y ´)Y
is a prob-
The equality in (14.123) is justified by observing that if
such that
, then
ability distribution. Once this is proven, the conclusion of the theorem follows
immediately because the summation in (14.123),
which is identified as the
$
and
, is always nonnegdivergence between
ative by the divergence inequality (Theorem 2.30). Toward this end, we notice
that for
, and
such that
,
X ì º+Y ´)Y )´ Y S ´)Y R ¿ X º+Y ´)Y ´)Y S ´)Y R ¿
YS
X º+Y S:¿ M H
´)Y Sò¿ X º+Y ´)Y Sò¿
ò
S
¿
X
+
º
Y
X»º+Y ´)Y ´)Y
X º+Y S ¿
Y ´)Y by the assumption
and for all
Y
Y ,
and
by the assumption
D R A F T
(14.126)
@ º+µ ^ A µ µ òS ¿ HU´
(14.127)
X º+Y )´ Y ¿ X º+Y ¿ X º+Y ¿
(14.128)
@ º+µ ^ A µ ¿ H B
(14.129)
September 13, 2001, 6:27pm
D R A F T
319
Beyond Shannon-Type Inequalities
Then
$ S R¿
X
»
>ô 8õ ô ¡ õ ô õ ô ñ º+Y ´)Y ´)Y ´)Y
4 > 4 ¡ 4 4 4 4 ñ% X $ º+Y ´)Y ´)Y S ´)Y R ¿
& > ¡ ñ#
X
º+Y ´)Y S ¿ X º+ Y ¿ ´)Y R ¿¿ X º+Y Sò¿´)Y S ¿ X R¿ º+Y ´)Y R ¿
> 4 4 4
X º+Y X º+Y X º+Y X»º+Y
> 4 ¡ ¡ 4 4ñ% ñ#
ù X
º+Y ´)Y ´)Y S: ¿¿ X º+Y ¿´)Y R¿ X R¿ º+Y ´)Y R¿
> 4 4 4
X º+Y X º+Y X º+Y
> 4 ¡ ¡ 4 4%ñ ñ #
ü X
º+Y ´)Y ´)Y S ¿ X º+Y ¿ ´)Y R R¿ X ¿ º+Y ´)Y R ¿
> 4 4 4
X»º+Y ´)Y X º+Y
> 4 ¡ ¡ 4 4ñ% ñ #
S Y ´)Y ¿
»º+Y ´)Y Rò¿ X Ròº+¿ Y ´)Y R¿ X
X
+
º
Y
(
!
4 > 4 ¡ 4 4 ñ ó
X º+Y
> ¡ ñ '
ô ¼ô
R
R
¿
¿
X»º+Y ´)Y X Ròº+¿ Y ´)Y
4 > 4 ¡ 4 4 ñ
X º+Y
> ¡ ñ '
G
4 ¡ 4 ñ X º+Y ´)Y R¿ > ó > ( ! X º+Y Y R¿
¡ ñ ô öô
)
X º+Y ´)Y R¿
¡ 4 4 ñ
¡ ñ *
±
´
Y
X º+Y ¿ H
¿ X º+Y R Y ¿ B
G

R
¿
X
+
º
Y
X º+Y Y
H
X º+Y R ¿
Therefore
G Y R¿ X»º+Y G Y R¿ ± B
X
+
º
Y
( !
ô>
ô > ó ¼ô > R
Finally, the equality in d) is justified as follows. For Y and Y

R
¿
:
R
¿
or X º+Y
vanishes, X º+Y ´)Y
must vanish because
$
$
H X º+Y ´)Y R¿ X»º+Y ¿
(14.130)
(14.131)
(14.132)
(14.133)
(14.134)
(14.135)
(14.136)
(14.137)
(14.138)
where a) and b) follows from (14.126) and (14.128), respectively. The equality
in c) is justified as follows. For such that
,
D R A F T
September 13, 2001, 6:27pm
(14.139)
(14.140)
such that
X º+Y ¿
(14.141)
D R A F T
320
A FIRST COURSE IN INFORMATION THEORY
and
H
Therefore,
$
$ R:¿CB

R
¿
X º+Y ´)Y
X º+Y
(14.142)
)´ Y R¿ X º+Y )´ Y Rò¿ ± B
X
+
º
Y
4
ô ¡ õô ñ
¡ ¡ 4 ñ%ñ (14.143)
The theorem is proved.
wox\yz|{y,}~|~Á
The constrained inequality in Theorem 14.14 is a nonShannon-type inequality.
ì
i«º ¿ôÉ f R
Proof The theorem can be proved by considering the point
for
as in the proof of Theorem 14.11. The details are left as an exercise.
M H
The constrained inequality in Theorem 14.14 has the following geometrical
interpretation. The constraints in (14.112) correspond to the intersection of
which define the boundary of . Then the inequaltwo hyperplanes in
ity (14.62) says that a certain region on the boundary of
is not in . It
can further be proved by computation1 that the constrained inequality in Theorem 14.14 is not implied by the twelve distinct versions of the unconstrained
inequality in Theorem 14.7 (i.e., <14.62>) together with the basic inequalities.
We have proved in the last section that the non-Shannon-type inequality
constrained non-Shannon-type inequalities.
(14.62) implies a class of
We end this section by proving a similar result for the non-Shannon-type constrained inequality in Theorem 14.14.
f R
dR R
d
d Re
r R Ç[±
wox\yz|{y,}~|~Ú The inequality
@ º+µ S A µ R ¿ $ @ º+µ S A µ R µ ¿_ @ º+µ S A µ R µ ¿
@ º+µ A µ µ S ¿
(14.144)
@ º+µ ^ A µ ¿
is a non-Shannon-type inequality conditioning on setting both
and
and any subset of the following 12 Shannon’s information measures to zero:
@ +º µ Â µ µ R¿ ´ @
@ º+µ A µ S µ R¿ ´ @
@ º+µ SA µ ?R µ ¿ ´ @
S
º+µ µ ´)µ )´ µ
1 Ying-On
º+µ AA µ RS µ
+º µ SA µ R? µ
º+Rµ ¿ µ µ S
´ º+µ
R¿ ´ @ º+µ A µ R µ Sò¿ ´
Sò¿ ´ @ º+µ S A µ R µ ¿ ´
´)µ ¿ ´ º+µ µ ´)µ
µ ´)µ ´)µ R¿ ´ º+µ R
S ´)µ R ¿ ´
µ ´)µ ´)µ S¿CB
(14.145)
Yan, private communication.
D R A F T
September 13, 2001, 6:27pm
D R A F T
321
Beyond Shannon-Type Inequalities
@ º+µ A µ ¿
@ º+µ A µ µ S ¿
Proof The proof of this theorem is very similar to the proof of Theorem 14.12.
and
together with the 12 Shannon’s
We first note that
information measures in (14.145) are exactly the 14 Shannon’s information
measures in (14.94). We have already shown in the proof of Theorem 14.12
that
(cf. Figure 14.4) lies in exactly 14 hyperplanes defining the boundary
which correspond to setting these 14 Shannon’s information measures to
of
zero. We also have shown that
for
define an extreme direction of
.
Denote by the intersection of the two hyperplanes in
which correand
to zero. Since
for any
spond to setting
satisfies
,+
(14.146)
d
ì
iRðº ¿
ì
iðº ¿
Á H
@ Â ¿ @ GA Sò¿
f R ì ¿
º+µ µ
º+µ µ µ
i«º M H
@ º+µ Â µ ¿ ×@ º+µ Â µ µ òS ¿ H
ì
ì
iðº ¿ is in . Now for any linear subspace of f R containing iðº ¿ such that
Ë , we have
ì
iðº ¿«É d R B
ì ¿ (14.147)
Upon substituting the corresponding values in (14.113) for i«º with the help
$ _ of Figure 14.4, we have
(14.148)
H M H ,H +
ì ¿
which is a contradiction because
H . Therefore, iðº does not satisfy
(14.113). Therefore,
ºd R ¿ .
(14.149)
Ë N - i É f R u i satisfies (14.113) / B
dR
This means that (14.113) is a non-Shannon-type inequality under the constraint
. From the above, we see that can be taken to be the intersection of and
any subset of the 12 hyperplanes which correspond to setting the 12 Shannon’s
information measures in (14.145) to zero. Hence, (14.113) is a non-Shannontype inequality conditioning on
,
, and any subset of
the 12 Shannon’s information measures in (14.145) being equal to zero. In
other words, the constrained inequality in Theorem 14.14 in fact induces a class
constrained non-Shannon-type inequalities. The theorem is proved.
of
@ º+µ ^ A µ ¿ @ º+µ ^ A µ µ Sò¿
r £
14.4
APPLICATIONS
As we have mentioned in Chapter 12, information inequalities are the laws
of information theory. In this section, we give several applications of the nonShannon-type inequalities we have proved in this chapter in probability theory
and information theory. An application of the unconstrained inequality proved
in Section 14.2 in group theory will be discussed in Chapter 16.
021\²O}23,º?y
~|~?ê
For the constrained inequality in Theorem 14.14, if we further impose the constraints
@ º+µ S A µ R µ ¿ ×
@ +º µ S A µ R µ ¿ H,+
D R A F T
September 13, 2001, 6:27pm
(14.150)
D R A F T
322
A FIRST COURSE IN INFORMATION THEORY
then the right hand side of (14.113) becomes zero. This implies
because
@ º+µ S A µ R¿
@ º+µ S A µ R¿ H
(14.151)
is nonnegative. This means that
µ PP µ S
µ S P µ R µ µ S P µ R1 µ µ µ µ
4 556
7
55
V µ S P µ RB
(14.152)
We leave it as an exercise for the reader to show that this implication cannot
be deduced from the basic inequalities.
021\²O}23,º?y
~|~÷ If we impose the constraints
@ +º µ ^ A µ ¿ ×@ +º µ Â µ S +)µ R¿ ×@ º+µ S A µ R µ ¿ ×
@ +º µ S A µ R µ ¿ H,+
(14.153)
then the right hand side of (14.62) becomes zero, which implies
µ Q PP
µ S P
µ S P
µ
This means that
@ º+µ S A µ R ¿ H B
µ S R¿ 4 556
S P RB
º+µ +)µ
µ RR µ 7 55 V µ µ
µ µ
(14.154)
(14.155)
Note that (14.152) and (14.155) differ only in the second constraint. Again, we
leave it as an exercise for the reader to show that this implication cannot be
deduced from the basic inequalities.
021\²O}23,º?y
~|~ø
µ )µ )µ S )µ R
$
½
¶
;
:¿ H,+=<
º+µ µ +98 N
Consider a fault-tolerant data storage system consisting of
random variables
such that any three random variables can
+
+
+
recover the remaining one, i.e.,
:
+98
$ B
g
(14.156)
We are interested in the set of all entropy functions subject to these constraints,
denoted by > , which characterizes the amount of joint information which can
possibly be stored in such a data storage system. Let
-i É f R ui
satisfies (14.156) /
d Re
B
d Re
(14.157)
Then the set > is equal to the intersection between
and , i.e.,
.
Since each constraint in (14.156) is one of the 14 constraints specified in
Theorem 14.12, we see that (14.62) is a non-Shannon-type inequality under
D R A F T
September 13, 2001, 6:27pm
D R A F T
323
Beyond Shannon-Type Inequalities
ì
d R (cf. (14.91)) is a tighter outer bound
Rd
S
021\²O}23,º?y ~|5þý
R
Consider four random variables µ +)µ +)µ , and µ such

S
Ø
Q
¿
Ø
R
º+µ +)µ
µ forms a Markov chain. This Markov condition is
that µ
equivalent to
@ º+µ S A µ R µ +)µ ¿ H B
(14.158)
the constraints in (14.156). Then
on > than
.
It can be proved by invoking the basic inequalities (using ITIP) that
@ º+µ S A µ R ¿ $ @ º+µ S A µ R µ ¿\_ @ º+µ S A µ R µ ¿/_ H B Û @ º+µ A µ ¿
_ @ º+µ ^A µ S +)µ R¿\_ º< Ç ¿ @ º+µ A µ S +)µ R¿ +
(14.159)
$
$

B r H B Û , and this is the best possible.
where H Û
Now observe that the Markov condition (14.158) is one of the 14 constraints
specified in Theorem 14.12. Therefore, (14.62) is a non-Shannon-type inequality under this Markov condition. By replacing
and
by each other in
(14.62), we obtain
r@ º+µ S A µ R¿ $
_ @ º+µ S A µ
µ @ +º µ ^ A µ ¿\_
R µ ¿/_ @ +º µ
µ @ º+µ A µ S +)µ R¿
S A µ R µ ¿CB
(14.160)
Upon adding (14.62) and (14.160) and dividing by 4, we obtain
@ º+µ S A µ R ¿ $
_ H Br Û @
@ º+µ S A µ R µ ¿\_
º+µ A µ S +)µ R\¿ _ H
@ º+µ S A µ R µ ¿/_ H B Û @ º+µ A µ ¿
B r Û @ º+µ A µ S +)µ R ¿CB
(14.161)
Comparing the last two terms in (14.159) and the last two terms in (14.161),
we see that (14.161) is a sharper upper bound than (14.159).
The Markov chain
arises in many communication
+
situations. As an example, consider a person listening to an audio source. Then
being the sound
the situation can be modeled by this Markov chain with
wave generated at the source,
and
being the sound waves received at
the two ear drums, and
being the nerve impulses which eventually arrive at
the brain. The inequality (14.161) gives an upper bound on
which
is tighter than what can be implied by the basic inequalities.
There is some resemblance between the constrained inequality (14.161) and
the data processing theorem, but there does not seem to be any direct relation
between them.
µ S Ø º+µ )µ ¿%Ø µ R
µ R
D R A F T
µ µ µ S
September 13, 2001, 6:27pm
@ º+µ S A µ R ¿
D R A F T
324
A FIRST COURSE IN INFORMATION THEORY
PROBLEMS
1. Verify by ITIP that the unconstrained information inequality in Theorem 14.7
is of non-Shannon-type.
2. Verify by ITIP and prove analytically that the constrained information inequality in Theorem 14.14 is of non-Shannon-type.
3. Use ITIP to verify the unconstrained information inequality in Theorem 14.7.
Hint: Create two auxiliary random variables as in the proof of Theorem 14.7
and impose appropriate constraints on the random variables.
4. Verify by ITIP that the implications in Examples 14.17 and 14.18 cannot
be deduced from the basic inequalities.
5. Can you show that the sets of constraints in Examples 14.17 and 14.18 are
in fact different?
6. Let
µ ¶ + :
r
<'+ +?+) , Â , and @ be discrete random variables.
Ç
@ ºÂ A @ ¿
@ º Â A @ µ ½ ¿ Ç @ ºÂ A @ µ ¶ ¿
½ $ @ ¶8A ¿/_ Ç
º+µ ÂA+@ ½ +º µ ½ ¿ º+µ +)µ +B+)µ ¿CB
Ýr , this inequality reduces to the unconstrained nonHint: When
a) Prove that
Shannon-type inequality in Theorem 14.7.
b) Prove that
@ ºÂ A @
$ <
¿Ç r
@
¶ @ A ½¿
½ º Â @ µ
Ç
½
8
¶
A
¿CB
/
¿
_
¿
º+µ ÂC+@ ½ º+µ
º+µ )+ µ + B+)µ
( Zhang and Yeung [223].)
XS º+Y )Y R )Y S )Y R¿ @ A S ¿ @ A ?R S ¿ µ
µ
º+µ µ µ
º+µ µ µ
H
µ ìµ X
7. Let
+ + +
be the joint distribution for random variables
,
,
, and
such that
, and let be
defined in (14.98).
a) Show that
X è º+Y +)Y ò +)Y S +)ó Y R¿ ó
ó ¡ õ ñ >
>
õ
õ
õ
¡
ñ

¼
ô
ô
ô
ö
ô
ô
ó
ó
>)õ ô ¡ öô ñ öô ô
ö
ô
H
D R A F T
X»º+Y )Y ¿ X º+Y R ¿ M H
+
+
if
otherwise
September 13, 2001, 6:27pm
D R A F T
325
Beyond Shannon-Type Inequalities
D
Á <. S
ò
S
¿
S
ò
¿
ì
X º+Y +)Y +)Y for all Y , Y , and Y .
Prove that X º+Y +)Y +)Y
æ
¿
ì
è
By considering Eº X X ÁIH , prove that
Sò¿/_
R¿_
S¿_
R¿\_
S8R¿
º+µ Sò¿_ º+µ R¿\_ º+µ £ ¿\_ º+µ S8R¿\_ º+µ S8R¿
Á º+µ
º+µ
º+µ
º+µ
º+µ +
S8R¿ denotes º+µ +)µ S +)µ R¿ , etc.
where º+µ
defines a probability distribution for an appropriate
b)
c)
d) Prove that under the constraints in (14.112), the inequality in (14.113)
is equivalent to the inequality in c).
The inequality in c) is referred to as the Ingleton inequality for entropy in
the literature. For the origin of the Ingleton inequality, see Problem 9 in
Chapter 16. (Mat [137].)
ç`b éè
HISTORICAL NOTES
In 1986, Pippenger [156] asked whether there exist constraints on the entropy function other than the polymatroidal axioms, which are equivalent to
the basic inequalities. He called the constraints on the entropy function the
laws of information theory. The problem had been open since then until Zhang
and Yeung discovered for four random variables first the constrained nonShannon-type inequality in Theorem 14.14 [222] and then the unconstrained
non-Shannon-type inequality in Theorem 14.7 [223].
Yeung and Zhang [221] have subsequently shown that each of the inequalities reported in [222] and [223] implies a class of non-Shannon-type inequalities, and they have applied some of these inequalities in information theory
problems. The existence of these inequalities implies that there are laws in
information theory beyond those laid down by Shannon [173].
Meanwhile, Mat and Studen [135][138][136] had been studying the structure of conditional independence (which subsumes the implication problem)
of random variables. Mat [137] finally settled the problem for four random
variables by means of a constrained non-Shannon-type inequality which is a
variation of the inequality reported in [222].
The non-Shannon-type inequalities that have been discovered induce outer
bounds on the region
which are tighter than . Mat and Studen [138]
showed that an entropy function in
is entropic if it satisfies the Ingleton
inequality (see Problem 9 in Chapter 16). This gives an inner bound on .
A more explicit proof of this inner bound can be found in Zhang and Yeung
[223], where they showed that this bound is not tight.
Along a related direction, Hammer et al. [84] have shown that all linear
inequalities which always hold for Kolmogorov complexity also always hold
for entropy, and vice versa.
ç]b éè
ç]b éè
d Re
D R A F T
cb
dR
dR
ç`b éè
September 13, 2001, 6:27pm
cb
d|Re
D R A F T
Chapter 15
MULTI-SOURCE NETWORK CODING
In Chapter 11, we have discussed the single-source network coding problem
in which an information source is multicast in a point-to-point communication
network. The maximum rate at which information can be multicast has a simple characterization in terms of the maximum flows in the graph representing
the network. In this chapter, we consider the more general multi-source network coding problem in which more than one mutually independent information sources are generated at possibly different nodes, and each of the information sources is multicast to a specific set of nodes. We continue to assume that
the point-to-point communication channels in the network are free of error.
The achievable information rate region for a multi-source network coding
problem, which will be formally defined in Section 15.3, refers to the set of
all possible rates at which multiple information sources can be multicast simultaneously on a network. In a single-source network coding problem, we
are interested in characterizing the maximum rate at which information can be
multicast from the source node to all the sink nodes. In a multi-source network
coding problem, we are interested in characterizing the achievable information
rate region.
Multi-source network coding turns out not to be a simple extension of singlesource network coding. This will become clear after we have discussed two
characteristics of multi-source network coding in the next section. Unlike
the single-source network coding problem which has an explicit solution, the
multi-source network coding problem has not been completely solved. The
best characterizations of the achievable information rate region of the latter
problem (for acyclic networks) which have been obtained so far make use
of the tools we have developed for information inequalities in Chapter 12 to
Chapter 14. This will be explained in the subsequent sections.
D R A F T
September 13, 2001, 6:27pm
D R A F T
328
A FIRST COURSE IN INFORMATION THEORY
H
H
X 1 XI 2
J
1
[ X 1]
F
2
F
[ XI 2]
3
2
G
1
JK
b3
4
H H
[ X 1 X 2]
15.1
3
G
b4
4
(b)
(a)
Figure 15.1.
b2
b1
A network which achieves the max-flow bound.
TWO CHARACTERISTICS
In this section, we discuss two characteristics of multi-source networking
coding which differentiate it from single-source network coding. In the following discussion, the unit of information is the bit.
15.1.1
THE MAX-FLOW BOUNDS
The max-flow bound, which fully characterizes the maximum rate at which
information can be multicast, plays a central role in single-source network coding. We now revisit this bound in the context of multi-source network coding.
Consider the graph in Figure 15.1(a). The capacity of each edge is equal
and
with rates L and L ,
to 1. Two independent information sources
respectively are generated at node 1. Suppose we want to multicast
to
nodes 2 and 4 and multicast
to nodes 3 and 4. In the figure, an information
source in square brackets is one which is to be received at that node.
It is easy to see that the values of a max-flow from node 1 to node 2, from
node 1 to node 3, and from node 1 to node 4 are respectively 1, 1, and 2. At
node 2 and node 3, information is received at rates L and L , respectively.
At node 4, information is received at rate L
L because
and
are
independent. Applying the max-flow bound at nodes 2, 3, and 4, we have
µ µ _ L
L
and
º ¿
L
$$
µ µ µ µ <
(15.1)
(15.2)
<
_ L $ r
+
(15.3)
respectively. We refer to (15.1) to (15.3) as the max-flow bounds. Figure 15.2
is an illustration of all ML +L
which satisfy these bounds, where L and L
are obviously nonnegative.
D R A F T
September 13, 2001, 6:27pm
D R A F T
329
Multi-Source Network Coding
2
(0,2)
(1,1)
(2,0)
1
Figure 15.2. The max-flow bounds for the network in Figure 15.1.
µ Ã
º
¿
µ Ã
Ã
We now show that the rate pair <'+N< is achievable. Let be a bit generated
and be a bit generated by
. In the scheme in Figure 15.1(b),
by
is received at node 2, is received at node 3, and both and are received
at node 4. Thus the multicast requirements are satisfied, and the information
which satisfy the
rate pair <'+N< is achievable. This implies that all ML +L
max-flow bounds are achievable because they are all inferior to <'+N< (see
Figure. 15.2). In this sense, we say that the max-flow bounds are achievable.
Suppose we now want to multicast
to nodes 2, 3, and 4 and multicast
to node 4 as illustrated in Figure 15.3. Applying the max-flow bound at either
node 2 or node 3 gives
(15.4)
L
<'+
º
¿
Ã
Ã
º ¿
µ $
Ã
º
_ L $ rB
¿
µ and applying the max-flow bound at node 4 gives
L
(15.5)
X1X2
1
[X1]
2
3
O
[X1]
4
[X1X2]
Figure 15.3. A network which does not achieve the max-flow bounds.
D R A F T
September 13, 2001, 6:27pm
D R A F T
330
A FIRST COURSE IN INFORMATION THEORY
º ¿
º
µ Ã
Ã
¿
which satisfy these bounds.
Figure 15.4 is an illustration of all ML +L
We now show that the information rate pair <'+N< is not achievable. Suppose
we need to send a bit generated by
to nodes 2, 3, and 4 and send a bit
generated by
to node 4. Since has to be recovered at node 2, the bit
sent to node 2 must be an invertible transformation of . This implies that the
bit sent to node 2 cannot not depend on . Similarly, the bit sent to node 3
also cannot depend on . Therefore, it is impossible for node 4 to recover
because both the bits received at nodes 2 and 3 do not depend on . Thus the
information rate pair <'+N< is not achievable, which implies that the max-flow
bounds (15.4) and (15.5) are not achievable.
From the last example, we see that the max-flow bounds do not always fully
characterize the achievable information rate region. Nevertheless, the maxflow bounds always give an outer bound on the achievable information rate
region.
Ã
Ã
µ Ã
º
15.1.2
Ã
Ã
Ã
¿
SUPERPOSITION CODING
Consider a point-to-point communication system represented by the graph
in Figure 15.5(a), where node 1 is the transmitting point and node 2 is the receiving point. The capacity of channel (1,2) is equal to 1. Let two independent
and
, whose rates are L and L , respectively, be
information sources
and
are received at node 2.
generated at node 1. It is required that both
We first consider coding the information sources
and
individually.
We will refer to such a coding method as superposition coding. To do so, we
decompose the graph in Figure 15.5(a) into the two graphs in Figures 15.5(b)
and (c). In Figure 15.5(b),
is generated at node 1 and received at node 2.
In Figure 15.5(c),
is generated at node 1 and received at node 2. Let P RQ be
µ µ µ µ µ
µ µ µ £ 2
(0,2)
(1,1)
(2,0)
1
Figure 15.4. The max-flow bounds for the network in Figure 15.3.
D R A F T
September 13, 2001, 6:27pm
D R A F T
331
Multi-Source Network Coding
T T
T
X1X2
X1
XV 2
1
1
1
W
W
(a)
S
W
(b)
(c)
S
2
2
T T U
T U
[X1X2]
2
T U
[X1]
[X2]
Figure 15.5. A network for which superposition coding is optimal.
µ
the bit rate on channel (1,2) for transmitting
£P ÊZ Y L and
Y
P\`[^]_
]
which imply
L
]
Q,X
r
<'+ . Then
(15.6)
+
(15.7)
`
Y
P ` [ _ P `[^]_
L ` L
]ba
]
a ]dc
(15.8)
`
Pe`[ _ \P `[^]_
<
]ba
] f c
g
(15.9)
On the other hand, from the rate constraint for edge (1,2), we have
From (15.8) and (15.9), we obtain
L `
a
L
] f
<
(15.10)
c
µ
µ
Next, we consider coding the information sources ` and
jointly. To
are independent, the total rate at which] information
this end, since ` and
]
L . From the rate constraint for channel
is received at node 2 is equal to L `
a ]
(1,2), we again obtain (15.10). Therefore,
we conclude that when ` and
are independent, superposition coding is always optimal. The achievable
]
information rate region is illustrated in Figure 15.6.
The above example is a degenerate example of multi-source networking
coding because there are only two nodes in the network. Nevertheless, one
may hope that superposition coding is optimal for all multi-source networking coding problems, so that any multi-source network coding problem can be
decomposed into single-source network coding problems which can be solved
individually. Unfortunately, this is not the case, as we now explain.
Consider the graph in Figure 15.7. The capacities of all the edges are equal
µ
µ
µ
µ
D R A F T
September 13, 2001, 6:27pm
D R A F T
332
A FIRST COURSE IN INFORMATION THEORY
2
(0,1)
(1,0)
1
Figure 15.6. The achievable information rate region for the network in Figure 15.5(a).
to 1. We want to multicast h ` to nodes 2, 5, 6, and 7, and multicast h
to
]
nodes 5, 6, and 7.
We first consider coding h ` and h individually. Let
]
i M`^[ jk _ Yml
(15.11)
be the bit rate on edge no'p9qer for the transmission of h , where qtsvu\p%w\px and
Xyszo'p%u . Then the rate constraints on edges no'p%u{r , no'p%w{j r , and no'pxer imply
`
i ` [ _ i ` R[ ]_
` a
]
i `9[ ]|
} _ i `9[R]} _
`
i `[ ~ _ a i `[R]~ _
a
o
(15.12)
o
(15.13)
f
f
f
o
(15.14)
c

X1X2
1

[X1]

2
5

[X1X2]

4
6

[X1X2]

[X1X2]
Figure 15.7. A network for which superposition coding is not optimal.
D R A F T
September 13, 2001, 6:27pm
D R A F T
333
Multi-Source Network Coding
X
X2
1
1
r 12(1)
r 13(1)
5

[X1]
r 12(2)

[X1]
[X1]

[X1] 2

1
r 14(1)

3
5

[X2]
[X2]
[X 2]

r 14(2)
2
4
7
r 13(2)

(a)

4
7

(b)
Figure 15.8. The individual multicast requirements for the network in Figure 15.7.
Together with (15.11), we see that
l
f
i M` R[ jRk _
f
o
(15.15)
for all qs.u\p%w\px and Xyszo'p%u .
We now decompose the graph in Figure 15.7 into the two graphs in Figures 15.8(a) and (b) which show the individual multicast requirements for h `
and h , respectively. In these two graphs, the edges (1,2), (1,3), and (1,4) are
labeled] by the bit rates for transmitting the corresponding information source.
We now show by contradiction that the information rate pair (1,1) is not
achievable by coding h ` and h individually. Assume that the contrary is true,
] is achievable. Referring to Figure 15.8(a),
i.e., the information rate pair (1,1)
`
since node 2 receives h at rate 1,
`
i `[ _ Y o
c
]
(15.16)
At node 7, since h ` is received via nodes 3 and 4,
`
i 9`[ } _
a
`
i `[ ~ _ Y o
c
(15.17)
Similar considerations at nodes 5 and 6 in Figure 15.8(b) give
and
i ` [R]_ i `9R[ ]} _ Y o
] a
(15.18)
i `[^]_ i `^[ ]~ _ Y o
c
]ba
(15.19)
`
i `[ _ so
c
]
(15.20)
Now (15.16) and (15.15) implies
D R A F T
September 13, 2001, 6:27pm
D R A F T
334
A FIRST COURSE IN INFORMATION THEORY
1
b2
b1
2
b1
b1
b2
5
Figure 15.9.
b1+b2
b1+b2
3
4
b1+b2
b2

6
7
A coding scheme for the network in Figure 15.7.
With (15.12) and (15.15), this implies
i `[^]_ s l
c
]
(15.21)
From (15.21), (15.18), and (15.15), we have
i `9[^]} _ so
c
(15.22)
c
(15.23)
With (15.13) and (15.15), this implies
`
i `9[ } _ s l
It then follows from (15.23), (15.17), and (15.15) that
`
i `[ ~ _ so
(15.24)
c
Together with (15.14) and (15.15), this implies
i `^[ ]~ _ s l
c
(15.25)
However, upon adding (15.21) and (15.25), we obtain
i `[^]_ i `^[ ]~ _ s l p
]ba
(15.26)
which is a contradiction to (15.19). Thus we have shown that the information
rate pair (1,1) cannot be achieved by coding h ` and h individually.
]
However, the information rate pair (1,1) can actually be achieved by coding
h ` and h jointly. Figure 15.9 shows such a scheme, where ` is a bit gen]
erated by h ` and is a bit generated by h . Note that at node 4, the bits `
]
]
and , which are generated by two different information sources, are jointly
]
coded.
Hence, we conclude that superposition coding is not necessarily optimal in
multi-source network coding. In other words, in certain problems, optimality
can be achieved only by coding the information sources jointly.
D R A F T
September 13, 2001, 6:27pm
D R A F T
335
Multi-Source Network Coding
X X X
1 2 3
[X1]
[X1]
Level 1
[X1]
[X1X2] [X1X2] [X1X2] [X1X2X3]
Level 2
Level 3
Figure 15.10. A 3-level diversity coding system.
15.2
EXAMPLES OF APPLICATION
Multi-source network coding is a very rich model which encompasses many
communication situations arising from fault-tolerant network communication,
disk array, satellite communication, etc. In this section, we discuss some applications of the model.
15.2.1
Let h ` p h
MULTILEVEL DIVERSITY CODING
p?ph be information sources in decreasing order of im]
portance. These information sources are encoded into pieces of information.
There are a number of users, each of them having access to a certain subset of
the information pieces. Each user belongs to a level between 1 and , where a
Level user can decode h ` ph pBpht . This model, called multilevel diversity coding, finds applications ] in fault-tolerant network communication, disk
array, and distributed data retrieval.
Figure 15.10 shows a graph which represents a 3-level diversity coding system. The graph consists of three layers of nodes. The top layer consists of
a node at which information sources h ` , h , and h } are generated. These
]
information sources are encoded into three pieces,
each of which is stored in
a distinct node in the middle layer. The nodes in the bottom layer represent
the users, each of them belonging to one of the three levels. Each of the three
Level 1 users has access to a distinct node in the middle layer and decodes h ` .
Each of the three Level 2 users has access to a distinct set of two nodes in the
middle layer and decodes h ` and h . There is only one Level 3 user, who has
]
access to all the three nodes in the middle
layer and decodes h ` , h , and h } .
]
D R A F T
September 13, 2001, 6:27pm
D R A F T
336
A FIRST COURSE IN INFORMATION THEORY
The model represented by the graph in Figure 15.10 is called symmetrical 3level diversity coding because the model is unchanged by permuting the nodes
in the middle layer. By degenerating information sources h ` and h } , the
model is reduced to the diversity coding model discussed in Section 11.2.
In the following, we describe two applications of symmetrical multilevel
diversity coding:
Fault-Tolerant Network Communication In a computer network, a data
packet can be lost due to buffer overflow, false routing, breakdown of communication links, etc. Suppose the packet carries messages, h ` ph pNph ,
]
in decreasing order of importance. For improved reliability, the packet
is encoded into sub-packets, each of which is sent over a different channel. If
any sub-packets are received, then the messages h ` ph pBpht can be re]
covered.
Disk Array Consider a disk array which consists of disks. The data to
be stored in the disk array are segmented into pieces, h ` ph pNph , in
]
decreasing order of importance. Then h ` ph pBph are encoded into
]
pieces, each of which is stored on a separate disk. When any out of the
disks are functioning, the data h ` ph pNpht can be recovered.
]
15.2.2
SATELLITE COMMUNICATION NETWORK
In a satellite communication network, a user is at any time covered by one
or more satellites. A user can be a transmitter, a receiver, or both. Through the
satellite network, each information source generated at a transmitter is multicast to a certain set of receivers. A transmitter can transmit to all the satellites
within the line of sight, while a receiver can receive from all the satellites
within the line of sight. Neighboring satellites may also communicate with
each other. Figure 15.11 is an illustration of a satellite communication network.
The satellite communication network in Figure 15.11 can be represented by
the graph in Figure 15.12 which consists of three layers of nodes. The top layer
represents the transmitters, the middle layer represents the satellites, and the
bottom layer represents the receivers. If a satellite is within the line-of-sight of
a transmitter, then the corresponding pair of nodes are connected by a directed
edge. Likewise, if a receiver is within the line-of-sight of a satellite, then the
corresponding pair nodes are connected by a directed edge. The edges between
two nodes in the middle layer represent the communication links between the
two neighboring satellites corresponding to the two nodes. Each information
source is multicast to a specified set of receiving nodes as shown.
D R A F T
September 13, 2001, 6:27pm
D R A F T
337
Multi-Source Network Coding
transmitter
¡ receiver
Figure 15.11. A satellite communication network.
15.3
A NETWORK CODE FOR ACYCLIC NETWORKS
Consider a network represented by an acyclic directed graph ¢£s¤n¥¦p§r ,
where ¨¥©¨\ª¬« . Without loss of generality, assume that
¥vs®¯o'p%u\p?p?¨¥°¨²±{p
(15.27)
and the nodes are indexed such that if n´³p9q\r¶µ·§ , then ³¸ª¹q . Such an indexing
of the nodes is possible by Theorem 11.5. Let º2» k be the rate constraint on
channel n´³p9qer , and let
¼
sz½ º¾» kÀ¿ n´³p9q\r¶µ·§ÂÁ
(15.28)
be the rate constraints for graph ¢ .
A set of multicast requirements on ¢ consists of the following elements:
Ë
X 1 XÉ2
Å X Å XÆ Å XÇ
XÈ3
4
5
6
XÃ7
X Ä8
Å Å
Å Å
Å Å Å
transmitters
satellites
receivers
Å Å
Å Å
[X 1 XÄ8 ] [XÉ2 X Ã7 ]
Å
[X Ê4 ]
[XÆ5 XÇ6 ] [X È3 XÆ5 ] [X È3 X Ã7 X Ä8 ]
Figure 15.12. A graph representing a satellite communication network.
D R A F T
September 13, 2001, 6:27pm
D R A F T
338
A FIRST COURSE IN INFORMATION THEORY
X1
Ð ÍÐ
X2X3
1
Ì
[X 1 ]
2
3
Î
Ï
4
5
Ð Ð Ì
[X 1 X 3 ]
6
Ì
[XÍ 2 ]
Figure 15.13. The acyclic graph for Example 15.1.
1) Ñ , the set of information sources;
2) Ò ¿ Ñ®Ó ¥ , which specifies the node at which an information source is
generated;
3) Ô ¿ ¥ÕÓ uÖ , which specifies the set of information sources received at
each node.
The information source × is generated at node Òn"×dr . For all ³¸µØ¥ , let
Ù
n´³Úr¸sv?×ÛµØÑ ¿ Òtn"×'rÜs;³±
(15.29)
be the set of information sources generated at node ³ . The set of information
sources Ôn´³Úr is received at node ³ .
Ý2Þàßâá2ãåäeæèç¯éyê"ç
The nodes in the graph in Figure 15.13 are indexed such that
if n´³p9qer©µ¬§ , then ³ªëq . The set of multicast requirements as illustrated is
specified by
(15.30)
Ñìsv¯o'p%u\p%we±{p
and
ÒtnoBrÜso'pÒn"u{r¸s.u\pÒn"w{r¸s.u\p
(15.31)
Ô noBrs¬Ôn"u{rÜs¬Ô©n´xer¸s.íîp

Ôn"w{r¸s®¯od±{pÔn"ï{rs®¯o'p%we±{pÔn"ð{r¶sv?ue±{ñ
(15.32)
(15.33)
We consider a block code with block length ò which is similar to the ó -code
we defined in Section 11.5.1 for acyclic single-source networks. An information source × is represented by a random variable hô which takes values in the
D R A F T
September 13, 2001, 6:27pm
D R A F T
339
Multi-Source Network Coding
õ
set
ôs®¯o'p%u\p?peö(ud÷dø9ùú'±
(15.34)
according to the uniform distribution. The rate of information source × is ûNô . It
is assumed that hôNp%×ÂµbÑ are mutually independently.
Unlike the ó -code we defined for acyclic single-source networks which is
zero-error, the code we now define allows an arbitrarily small probability of
ü
error. Let
(15.35)
svN³µb¥ ¿ Ôn´³ÚrÀs¬
ý þe±
be the set of nodes which receive at least one information source. An
n´ò¦pBn´ÿ» kÂ¿ n´³%p9qerCµ·§ rpBn(ûNô ¿ ×Âµ ÑÜrr
(15.36)
code on graph ¢ (with respect to a set of multicast requirements) is defined by
1) for all n´³p9q\r¶µ·§ , an encoding function
» k ¿ õ ô l pNo'p'pÿ »R» ±¾Ó l pNo'p?pÿ» k ± (15.37)
ô»
»» »
Ù
(if
we adopt the convention that
k both n´³Úr and N³ ¿ n´³(p³ÚrAµØ§ ± are empty,
l
k
» is an arbitrary constant taken from Np o'pdpÿ» ± );
ü
2) for all ³Üµ
, a decoding function
õ
l
» ¿
pNo 'pÿ » R» ±¾Ó
ôñ
» » »
ô »
(15.38)
ü
In the above, » k is the encoding function for edge n´³%p9qer . For ³¾µ
, » is the
k
decoding function for node ³ . In a coding session, » is applied before » k if
³ ª¬³ , and » k is applied before » k if q°ªgq . This defines the order in which
the encoding functions are applied. Since ³ yªg³ if n´³ (p³ÚrAµ·§ , a node does not
ü the necessary information is received on the input channels.
encode until all
For all ³Üµ
, define
#
»ys
"!y$ # »n´hô
¿ × µbÑÜr2sz
ý n´hô ¿ × µ Ôn´³Úrr%±p
(15.39)
where »n´hô ¿ ×Àµ ÑÜr denotes the value of » as a function of n´hô ¿ ×ÂµbÑ¸r . »
is the probability that the set of information sources Ôn´³Úr is decoded incorrectly
at node ³ .
Throughout this chapter, all the logarithms are in the base 2 unless otherwise
specified.
%
'&)(*+(-,+(/.0*.ç¯éyê1
æ
rate tuple
D R A F T
For a graph ¢
2
¼
with rate constraints
3
szn ¦ô ¿ × µbÑÜrp
September 13, 2001, 6:27pm
, an information
(15.40)
D R A F T
254
340
A FIRST COURSE IN INFORMATION THEORY
l
where
(componentwise), is asymptotically achievable if for any
there exists for sufficiently large ò an
n´ò¦pBn´ÿ» kÂ¿ n´³p9qerAµ·§ rpBn(ûNô ¿ ×Âµ ÑÜrr
687
l
,
(15.41)
:9<;'=?>A@ÿ» kCB º¾» k"D 6
(15.42)
k
for all n´³%p9qer µ § , where ò 9<; =?>A@¸ÿ» is the average bit rate of the code on
channel n´³%p9qer ,
4
ûNô
3 ô:EF6
¦
(15.43)
for all × µbÑ , and
»B 6
(15.44)
ü
code on ¢ such that
ò
for all ³µ
. For brevity, an asymptotically achievable information rate tuple
will be referred to as an achievable information rate tuple.
%
'&)(*+(-,+(/.0*.ç¯éyêG
2
æ
The achievable information rate region, denoted by
is the set of all achievable information rate tuples .
2
2
4
2
H
, is
B 2B 2
Remark It follows from the definition of the achievability of an information
l
is achievable for all
.
rate vector that if is achievable, then

Also, if
,
o are achievable, then it can be proved by techniques similar
to those in the proof of Theorem 9.12 that
2
is also achievable, i.e.,
H
2
IM=L8JK N s
(15.45)
is closed. The details are omitted here.
H
In the rest of the chapter, we will prove inner and outer bounds on the achievable information rate region .
15.4
AN INNER BOUND
H
H
In this section, we first state an inner bound ©» on in terms of a set of
÷
auxiliary random variables. Subsequently, we will cast this inner bound in the
framework of information inequalities developed in Chapter 12.
%
'&)(*+(-,+(/.0*.ç¯éyêO
æ
H P
2
Q
Let
be the set of all information rate tuples such that
there exist auxiliary random variables ô p%× µÑ and » k pBn´³p9qer·µ®§ which
satisfy the following conditions:
R
UQ R P
R
P
n ô ¿ × µbÑÜr
UQ VQ Ù
n â» k ¨^n ô ¿ Â
× µ ń ³ÚrrpBn » » ¿ ń ³ p³Úr µ·§rr
n ô ¿ × µ·
Ô n´³Úr¨ » R» ¿ ń ³ p³Úr¶µ §r
º »k
n ôr
D R A F T
P
R
R
TS s
ô
l Ö
s
s
7
R
l
UQ
P
n ôr
P W7 3
n »k r
¦ô ñ
September 13, 2001, 6:27pm
(15.46)
(15.47)
(15.48)
(15.49)
(15.50)
D R A F T
341
Multi-Source Network Coding
ü
Note that for ³Âµ ý
, the constraint (15.48) in the above definition is degenerate because Ôn´³Úr is empty.
%
'&)(*+(-,+(/.0*.ç¯éyê#é Let H» s XY>Zîn/H r , the convex closure of H .
[]\àæ^.0y_ æåáç¯éyê` H» H ÷ .
÷ba
In the definition of H , P ô is an auxiliary random variable associated with
Q » k is an auxiliary random variable associated with
information source × , and î
the codeword sent on channel n´³p9q\r . The interpretations of (15.46) to (15.50)
are as follows. The equality in (15.46) says that P ôNp%×µ Ñ are mutually independent, which corresponds to the assumption that all the information sources
are mutually
independent. The equality in (15.47) says that Q » k is a function of
Ù
nP ô ¿ ×µ n´³Úrr and nUQ » » ¿ n´³´ p³Úr µ|§ r , which corresponds to the requirement
æ
that the codeword received from channel n´³%p9qer depends only on the information
sources generated at node ³ and the codewords received from the input channels at node ³ . The equality in (15.48) says that n ô ¿ ×ÂµØÔn´³Úrr is a function of
n » » ¿ n´³ (p³ÚrAµb§ r , which corresponds to the requirement that the information
sources to be received at node ³ can be decoded from the codewords received
from the input channels at node ³ . The inequality in (15.49) says that the entropy of î» k is strictly less than º2» k , the rate constraint for channel n´³p9q\r . The
inequality in (15.50) says that the entropy of ô is strictly greater than ô , the
rate of information source × .
The proof of Theorem 15.6 is very tedious and will be postponed to Section 15.7. We note that the same inner bound has been proved in [187] for
variable length zero-error network codes.
In the rest of the section, we cast the region ©» in the framework of infor÷
mation inequalities we developed in Chapter 12. The notation we will use is a
slight modification of the notation in Chapter 12. Let
P
UQ Q
P
3
H
c
dP
feYQâ» kÂ¿ n´³p9q\r¶µ·§t±
sv ô ¿ × µbÑ
(15.51)
be a collection of discrete random variables whose joint distribution is unspecified, and let
(15.52)
Òtn rÜs.u ¯?þe±{ñ
c
Agih
Then
c
¨Òn r¨¯s.ukj g jEZo'ñ
(15.53)
c
Let l
be the ¨Òtn c r¨ -dimensional Euclidean space with the coordinates lag
beled by m'nÜpToëµ|Òn r . We will refer to l
c
g as the entropy space for the set
of random variables . A vector
p sznqm^n ¿ o®µbÒn c rr
(15.54)
D R A F T
September 13, 2001, 6:27pm
D R A F T
342
A FIRST COURSE IN INFORMATION THEORY
R
r ¿ r µ c
is said to be entropic if there exists a joint distribution for n
that
c
for all o®µbÒtn
so tm'n
r . We then define
uv sv p µwl ¿ p is entropic±{ñ
g
g
n´h
¿ hÕµ Àr¦s
r such
(15.55)
o Tp ox
To simplify notation in the sequel, for any nonempty
define
m n j ^n stm nyn EFm n p
c
µ®Òtn
c
(15.56)
r , we
(15.57)
where we use juxtaposition to denote the union of two sets. In using the above
notation, we do not distinguish elements and singletons of , i.e., for a random
variable Dµ
,
is the same as
.
To describe ©» in terms of
, we observe that the constraints (15.46) to
÷
correspond to the following constraints in
,
(15.50) in the definition of
respectively:
r c ^m z
H
%
u v ym { zy|
H g
l g
m } ù ôT Ö s
m~ j } ù ô»
-
q ~ »- »?
q
s
m } ù ô»
-
j ~ »- »?
q
s
º »k 7
m'} ù 7
TS m^} ù
ô
l Ö
3
l
m ~
ôNñ
H .
(15.58)
(15.59)
(15.60)
(15.61)
(15.62)
'&)(*+(-,+(/.0*.ç¯éyê- Let H be the set of all information rate tuples 2 such
p u v which satisfies (15.58) to (15.62) for all × µmÑ and
that there exists µ
g
n´³p9qerAµ·§ .
Although the original definition of H as given in Definition 15.4 is more
intuitive, the region so defined appears to be totally different from case to case.
On the other hand, the alternative definition of H above enables
to
u # istheanregion
be described on u the same footing foru all cases.
Moreover,
if
explicit
v , upon replacing v by u # in the above definition
g
inner bound on
H ,
g an explicit innerg boundg on H » ÷ for all cases. Weof will
we immediately obtain
see further advantage of this alternative definition when we discuss an explicit
outer bound on H in Section 15.6.
Then we have the following alternative definition of
æ
15.5
AN OUTER BOUND
uv
uv
In this section, we prove an outer bound
terms of
, the closure of
.
g
D R A F T
g
H on H
. This outer bound is in
September 13, 2001, 6:27pm
D R A F T
%
2
Multi-Source Network Coding
343
'&)(*+(-,+(/.0*.ç¯éyê Letv H be the set of all information rate tuples such
p u which satisfies the following constraints for all × µbÑ
that there exists µ
g
and n´³%p9qer¶µØ§ :
m } ù ôT Ö s ôTS m^} ù
(15.63)
(15.64)
m~ j } ù ô »
-
q ~ »- »?
q
s ll Ö
m } ù ô»
-
j ~ »- »?
q
s4
(15.65)
k
º¾» 4
m ~
(15.66)
(15.67)
m}ù 3 ôñ
The definition of H is the same as the alternative definition of Hb (Definition 15.7) except that
u v is replaced by u v .
1.
g
g
æ
2. The inequalities in (15.61) and (15.62) are strict, while the inequalities in
(15.66) and (15.67) are nonstrict.
H and H , it is clear that
H a H ñ
(15.68)
v
u (Theorem 14.5) implies the conIt is easy to verify that the convexity of
g
vexity of H . Also, it is readily seen that H is closed. Then upon taking
convex closure in (15.68), we see that
H» ÷ s XY>Zîn/H r a XY>Z n/H rÜs H ñ
(15.69)
From the definition of
However, it is not apparent that the two regions coincide in general. This will
be further discussed in the next section.
[]\àæ^.0_yæåáç¯éyê H H .
a
2
Proof Let be an achievable information rate tuple and ò
l
large integer. Then for any 67 , there exists an
n´ò¦pBn´ÿ» kÂ¿ n´³%p9qerCµ·§ rpBn(ûNô ¿ ×Âµ ÑÜrr
code on ¢ such that
for all n´³p9q\r¶µ·§ ,
for all × µbÑ , and
D R A F T
ò
9<; =?>A@ÿ» kiB º¾» kD 6
4
ûNô
3 ô:EF6
¦
»B 6
September 13, 2001, 6:27pm
be a sufficiently
(15.70)
(15.71)
(15.72)
(15.73)
D R A F T
344
A FIRST COURSE IN INFORMATION THEORY
ü
for all ³¸µ
.
We consider such a code for a fixed and a sufficiently large ò . For ń ³p9q\rAµ
§ , let
â» k s » k n´hô ¿ ×ÀµbÑ¸r
(15.74)
#
Q
6
Q
be the codeword sent on channel n´³p9q\r . Since î» k is a function of the information sources generated at node ³ and the codewords received on the input
channels at node ³ ,
R
For ³Üµ
ü
UQ
UQ Ù
l
n î» k ¨^n´hô ¿ ×Âµ n´³ÚrrpBn » » ¿ n´³ p³Úr µØ§ rrÜs ñ
R
(15.75)
, by Fano’s inequality (Corollary 2.48), we have
VQ D =>A@
R
T
D R
D 6
n´hô ¿ × µ·Ôn´³Úr¨ » R» ¿ n´³ p³ÚrAµ·§ r
õ
o
»
¨ ôd¨
ô R»
s
o
» n´hô ¿ ×Âµ·Ô©n´³Úrr
o
n´hô ¿ × µ·Ôn´³Úrrp
B
B
(15.76)
(15.77)
(15.78)
õ
where (15.77) follows because hô distributes uniformly on ô and hô , × µ|Ñ
are mutually independent, and (15.78) follows from (15.73). Then
R
n´hô ¿ Â
× µØÔn´³Úrr
s
nn´hô ¿ × µ·Ôn´³Úrr Bn » » ¿ n´³ p ³Úr¶µ·§ rr
n´hô ¿ × µ·Ôn´³Úr¨ » R» ¿ ń ³ pÚ³ r¶µ·§ r
(15.79)
B
(15.80)
(15.81)
B
B
B R
D
e UQ VQ nn´hô ¿R × µ·Ôn´³ÚrrBe nUQ » » ¿ n´³ p³Úr¶µ·§ rr
RD o D 6 n´h ô ¿ × µ·Ôn´³Úrr R
nUQ »²» ¿ n´³ p³ÚrAµ § r D o D R 6 n´hô ¿ × µ·Ôn´³rr
SR»/ »?
=?>A@Üÿ »^» D o D 6 n´hô ¿ × µ·Ôn´³Úrr

R
SR»/ »?
ò¸n(º »^» D 6 r D o D 6 n´hô ¿ × µ·Ôn´³Úrrp
(15.82)
(15.83)
where
a) follows from (15.78);
b) follows from Theorem 2.43;
c) follows from (15.71).
D R A F T
September 13, 2001, 6:27pm
D R A F T
345
Multi-Source Network Coding
R
Rearranging the terms in (15.83), we obtain

n´h ô ¿ ×Âµ·Ô©n´³Úrr
B oòEF6 S n(º »^» D 6r D ò o
(15.84)
»/ »
q
ª
udò S
n(º »^» D 6r
(15.85)
»/ »
q
for sufficiently small 6 and sufficiently large ò . Substituting (15.85) into (15.78),
we have
R
n´hô ¿ ×Âµ Ô©n´³Úr¨VQ » » ¿ n´³ p³Úr¶µ §r

o D
uA6S
n(º »^» D 6 rU
(15.86)
ª ò
ò
»/ » q
(15.87)
s
ò:? »n´ò¦p6 rp
where

o D
l
? »n´ò¦p6 r¦s
(15.88)
uA6S
n(º »^» D 6 rUë

Ó
ò
»/ » 6 Ó 4 l . From (15.71), for all n´³4 p9q\R rAµØ§ ,
as òØÓ « and then ò¸n(º¾» kD uA6 r
=?>A@ n´ÿ» kfD oBrÜs =?>AÀ@ ¨VîQ » k ¨ nUâQ » k rñ
(15.89)
For all ×ÂRµ Ñ , from (15.72),
õ
4 4
(15.90)
n´hô r¦s =?>AÀ
@ ¨ ô?¨s =?>A@ ö(u ÷?ø ùú òàûNô ò¸n3 ôEF6 rñ
Thus for this code, we have
R
R
¿
n´hô × µbÑÜr s
S n´hô r (15.91)
R
ôT Ö
nUî
Q » k ¨^R n´hô ¿ × µ Ù n´³ÚrrpBnUQ » » ¿ n´³ p³Úr µ·§rr s l
(15.92)
¿
¿
B
R
(15.93)
n´h ô ×ÂµØÔn´³Úr¨VQ » » n´³ p³Úr µ·§rr 4 ò: » n´ò¦p6 r
"
k
D
k
R
ò¸n(º¾»
uA6 r 4
nUî
Q»r
(15.94)
n´hô r
ò¸n¦
3 ô:EF6 rñ (15.95)
We note the one-to-one correspondence between (15.91) to (15.95) andp (15.63)
uv
to (15.67). By letting P ôCsDhô for all ×µ Ñ , we see that there exists µ
g
such that
(15.96)
m } ù ôT Ö s ôTS m } ù
Ö
D R A F T
September 13, 2001, 6:27pm
D R A F T
346
A FIRST COURSE IN INFORMATION THEORY
m ~ j } ù ô»
-
q ~ » »
0
s l
(15.97)
B
(15.98)
m } ù ô»
-
j ~ » »
0
4 ò: » n´ò¦p6r

k
D
ò¸n(º¾»
uA6 r 4
m ~
(15.99)
(15.100)
m'} ù ò¸n3 ôEF6rñ
u v is a convex cone. Therefore, if p µ u v , then ò 9<; p µ
By
14.5,
vu .Theorem
g through
gp p
Dividing (15.96)
(15.100) by ò and replacing ò 9<; by , we see
v
p
u
g
that there exists µ
g such that
(15.101)
m } ù ô Ö s ôS m } ù
m ~ j } ù ôTAR »?
~ R »- »

s l Ö
(15.102)
B
(15.103)
m } ù ôTR »?
j ~ R »- »

4 » n´ò¦p6 r
"
k
D
º2»
uA6 4
m ~
(15.104)
(15.105)
m } ù 3 ô EFN6 ñ
6 Ó l to conclude that there exists p µ u vg which
We then let òØÓ « and then satisfies (15.63) to (15.67). Hence, H
a H , and the theorem is proved.
15.6 THE LP BOUND AND ITS TIGHTNESS
u v (the
In Section 15.4, we stated the inner bound H » on H in terms of
÷
proof is deferred to Section 15.7),
and in Section 15.5, we proved theg outer
v
u
bound H Uu d on H u v in terms of
far, there exists no full characterization
v or . Therefore, these
g . Sobounds
cannot be evaluated explicitly. In
on either
g
this section,g we give a geometrical interpretation of these bounds which leads
to an explicit outer bound on H called the LP bound (LP for linear programming).
p
c
Let o be a subset of Òn r . For a vector µl
, let
g
p n snqmz ¿ r µsÀo rñ
(15.106)
For a subset of l
g , let¡
(15.107)
!T>¢ n n¶ r¸sv p n ¿ p µsC ±
z pD
r µ£o . For a subset of
be the projection of the set on the coordinates m^Ü
l g , define¤
p
¿ l B p ª p for some p µs ±
(15.108)
n ArÜsv µwl
g
¤¥
and
p
p p
p
l
n Ar¦sv µwl
(15.109)
g ¿ B B for some µs A±{ñ
D R A F T
September 13, 2001, 6:27pm
D R A F T
347
4p l ¤ ¤
p
A vector
is in n Ar if and only if it is strictly inferior top some vector in , and is in n¶
r if and only if it is inferior to some vector in .
Define the following subsets of l
g:
¦ s § p µsl ¿ m } ôT s¨S m^}
(15.110)
;
g ù Ö ôT Ö ù©
¦$ª s « p µwl ¿ m ~ } ôTAR »
-
~ » »

s l for all n´³p9qerCµ·§¬
g
j ù
¦ } s « p µwl ¿ m } ôT »
~ R »- »
0
s l for all ³µ ü ¬ (15.111)
(15.112)
g
j
ù
¦ ~ s « p µwl ¿ º¾» k 7m ~ for all n´³p9q\r¶µ·§®¾¬ ñ
(15.113)
g
¦ is a hyperplane in l . Each of the sets ¦ ª and ¦ } is the intersecThe set
¦
;
g
tion of a collection of hyperplanes in l
. The set ~ is the intersection of a
g
. Then from the alternative definition of
collection of open half-spaces in l
¤ ¡ we see that g
H (Definition 15.7),
H s n !>¢ } ù ôT Ö n u vg°¯ ¦ ; ¯ ¦ ª ¯ ¦ } ¯ ¦ ~ rrñ
(15.114)
¤ ¡
and
H » ÷ s XY>Z n n !T>¢ } ù ô Ö n u vg ¯ ¦ ; ¯ ¦ ª ¯ ¦ } ¯ ¦ ~ rrrñ (15.115)

Similarly, we see ¤that¡
¥
H s n !>¢ } ù ô Ö n u vg±¯ ¦ ; ¯ ¦ ª ¯ ¦ } ¯ ¦ ~ rrñ
(15.116)
u v n ¦ ¦ ª ¦ } r is dense in u v n ¦ ¦ ª ¦ } r ,
It can be shown that if
;
; ¯
i.e.,
u v n ¦ ¦ ª g ¯¦ } r¦s ¯ u v ¯ n ¦ ¦ ª ¦ } rp g²¯ ¯ (15.117)
g ¯ ;¯ ¯
g ¯ ;¯ ¯
then
H s H a XY>Z n/H r¦s³©H » ÷ p
(15.118)
which implies
(15.119)
H » ÷ s³H ñ
¦
¦
¦
ª
}
Note that n
subset of l
. However, while
; ¯ ¯ r is a closed
g
uv ¦ uv ¦
(15.120)
a
¯
F
g
¯
g
¦
for any closed subset of l
true that
g u ,v it is ¦ not inu v general
¦
(15.121)
g ¯ s g ¯ ñ
Multi-Source Network Coding
D R A F T
September 13, 2001, 6:27pm
D R A F T
348
A FIRST COURSE IN INFORMATION THEORY
u v ¦#
¦ # sv¯ p µ u0}v
u v ¦#
¯
As a counterexample, as we have shown in the proof of Theorem 14.2, }
is a proper subset of }
, where
ª s;³ ª } s;³ }
;´ ´ ;´
k
k
and we have used ³ to denote m Eµm k .
´
j
To facilitate our discussion, we further define
³ n n stm^nsE°m n n^
´
j
¿ ³
s
l
±¸p
(15.122)
(15.123)
and
³ n n n stm n n EFm n n^n^
(15.124)
j p j
´uj
p
c
for o pTo pTo µ Òtn r . Let
be the set of µ¶l
such that satisfies
all
c
g
g
the basic inequalities involving some or all of the random variables in , i.e.,
c
for all oÛpTo pTo µØÒn r ,
4
m'n 4 ll
(15.125)
(15.126)
m n j n 4 l
³ n n 4
(15.127)
´
l
(15.128)
³n n n ñ
´ j v
u u . Then upon replacing u v by u
We know from Section 13.2 that
g a g
g g
in the definition of H , we immediately obtain an outer bound on H . This
is called the LP bound,
which is denoted by H·)¸ . In other words, H·¹¸ is
obtained by replacing
¤ ¥ ¡ u vg by u g on the right hand side of (15.116), i.e.,
(15.129)
H ·)¸ s n !T>¢ } ù ô Ö n u g ¯ ¦ ; ¯ ¦$ª ¯ ¦ } ¯ ¦ ~ rrñ
Since all the constraints defining H·¹¸ are linear, H·¹¸ can be evaluated explicitly.
We now show that the outer bound H·)¸ is tight for the special case of
single-source network coding. Without loss of generality, let
Ñìs®¯od±{ñ
ü
Then the information
source represented by h
for all ³¸µ
,
3 ; ºH
(15.130)
; is generated at node ÒtnoBr , and
Ôn´³ÚrÜsDÑìsv¯od±{ñ
H Ud
(15.131)
Consider any
µ
. Following the proof that
is an outer bound
l
on
in the last section (Theorem 15.9), we see that for any
and a
sufficiently large ò , there exist random variables î» k pBn´³p9qer¾µ|§ which satisfy
H
D R A F T
Q
September 13, 2001, 6:27pm
6F7
D R A F T
349
Multi-Source Network Coding
the constraints in (15.92) to (15.94). Upon specializing for a single source,
these constraints become
R
R
for q ¿ nÒtnoBrp9qerCµ·§
(15.132)
UQ» ; q k ¨ h ; r s ll
k
R nUQî» ¨VQ »² » n´³ p³Úr µØ§ r s
for n´³p9q\ü r¶µ·§
(15.133)
¿
B
(15.134)
n´h ¨VQ » n´³p¼ r¶µØ§ r 4 ò:
¼µ
; ò¸n(º¾» kD uA6r R nUQîn´ò¦» k pr 6r for
for n´³p9qerCµ § ,
(15.135)
l
6 Ó l and ò Ó « . Moreover, upon specializing
where n´ò¦p6 r·Ó
as Ø
(15.95) for a single source, we
R also have
4
n´h r
(15.136)
; ò¸n3 ; EF6rñ
ü
Now fix a sink node ¼ µ
and consider any cut ½ between node ÒtnoBr and
¼ µ¾ý ½ , and let
node ¼ , i.e., ÒnoBr µ¾½ and ¾
(15.137)
§x
¿ sven´³p9qerCµ·§ ¿ ³¸µ£½ and q·µ£ý ½ë±
be the set of edges across the cut ½ . Then
S »/ À ò¸n(º¾» kfD uA6 r
k
R
4 SR»/ qÀ nUîQ » k r
(15.138)
k
R
4
kÂ¿
(15.139)
RR nUîQ » kÂ¿ n´³p9qerCµ·§ ¿ r
s4
nUî
Q » n´³p9qerCµ·§ ¿ or both ³p9q µ²ý ®
½ r
(15.140)
¿
R nUîQ » n´³%p¼ r µØ§ r
(15.141)
¿
D
¿
s
nUî
Q » n´³%p¼ r µØ§ ¨ h ; r nUîQ » n´³p¼ r µ·§be h ; r (15.142)
R nUîQ » ¿ n´³pR¼ r µ·§be h ; r
s
(15.143)
¿
(15.144)
s
4 R n´h ; r0E n´h ; ¨VQ » n´³p¼ r¶µ §r
(15.145)
Á4 n´h ; r0ìE ò: n´ò¦p6 r
(15.146)
ò¸n3 EF6 r0ì
E ò: n´ò¦p6 r
;
s
ò¸n3 EF6EÂ n´ò¦p6 rrp
(15.147)
;
n
where
a) follows from (15.135);
Q
t½
b) follows because » k , where both ³p9q;µ ý
, are functions of
§
by virtue of (15.133) and the acyclicity of the graph ¢ ;
¿
D R A F T
September 13, 2001, 6:27pm
Q »/ k pBn´³p9qer©µ
D R A F T
350
A FIRST COURSE IN INFORMATION THEORY
c) follows from (15.134);
d) follows from (15.136).
6
Dividing by ò and letting Ó
l
and òØÓ
4
«
in (15.147), we obtain
(15.148)
S/ yÀ º¾» k 3 ; ñ
ü
In other words, if 3
inequality for ev; µÃH , then 3 ; satisfies the above
ery cut ½ between node ÒtnoBr and any sink node ¼ µ
, which is precisely the
max-flow bound for single-source network coding. Since we know from Chapter 11 that this bound is both necessary and sufficient, we conclude that H Ud
» k
is tight for single-source network coding. However, we note that the max-flow
bound was discussed in Chapter 11 in the more general context that the graph
¢ may be cyclic.
In fact, it has been proved in [220] that
is tight for all other special
cases of multi-source network coding for which the achievable information region is known. In addition to single-source network coding, these also include
enthe models described in [92], [215], [166], [219], and [220]. Since
compasses all Shannon-type information inequalities and the converse proofs
of the achievable information rate region for all these special cases do not infor all these cases
volve non-Shannon-type inequalities, the tightness of
is expected.
H·¹¸
H·)¸
H·)¸
15.7
ACHIEVABILITY OF
Ä
»
÷
H
In this section, we prove the achievability of » , namely Theorem 15.6.
÷
To facilitate our discussion, we first introduce a few functions regarding the
in Definition
15.4.
auxiliary random variables in the definition of
Ù
From (15.47), since â» k is a function of n ô ¿ × µ
n´³Úrr and n » » ¿ n´³ p³rCµ
§ rr , we write
Q
Qî» k s Å » k nnP ô
P
UQ Hb
UQ ¿ × µ Ù n´³ÚrrpBn » » ¿ n´³ p³Úr µØ§ rrñ
(15.149)
Since the graph ¢ is acyclic, we see inductively that all the auxiliary random
variables » k pBn´³p9qerCµ·§ are functions of the auxiliary random variables ô p%× µ
Ñ . Thus we also write
(15.150)
â» k s » k n ô ¿ ×ÀµbÑ¸rñ
Q
P
Q ÆÅ # P
Equating (15.149) and (15.150), we obtain
Å » k nnP ô
UQ ÇÅ # » k nP ô ¿ ×
For ×·µZÔn´³Úr , since P ô is a function of nUQ »R» ¿ n´³ p³Úrµm§
write
P ôs È ôR»?
nUQ »R» ¿ n´³ p³r µ·§ rñ
¿ × µ Ù n´³ÚrrpBn » ^» ¿ n´³ p³ÚrAµ·§ rrÜs
D R A F T
µbÑÜrñ
(15.151)
r from (15.48), we
September 13, 2001, 6:27pm
(15.152)
D R A F T
351
Multi-Source Network Coding
Substituting (15.150) into (15.152), we obtain
P ô s³È ô »
nYÅ # »» nP ôÉ ¿ × µ
ÑÜr ¿ n´³ p³Úr µØ§ rñ
(15.153)
These relations will be useful in the proof of Theorem 15.6.
Before we prove Theorem 15.6, we first prove in the following lemma that
strong typicality is preserved when a function is applied to a vector.
Êæåá¾á
Ë
ßDç¯éyê"ç
P
Let s
n´h|r . If
Ì sën/Í ; pÍ ª pBpÍ ÷ rµÎ Ï ÷ ÐÑVÒ p
then
n Ì r¦szn/È pÈ ª pBpÈ rµÎ Ï ÷ } ÑVÒ p
;
÷
B
B
n/Í »r for o
³
ò .
where È»ys
Proof Consider Ì µÎ Ï ÷ Ð"ÑVÒ , i.e.,
SÓÇÔÔ ò o c n/Í0e Ì r:EÕ n/Í rÖÔÔ ª×Nñ
ÔÔ
ÔÔ
P s n´h|r ,
Since ë
Õ n/È r¦s Ó AØS Ù)ÚÜÛÝ
âÕ n/yÍ r
for all È µsÞ . On the other hand,
c n/Èe n Ì rrs Ó S c n/Íe Ì r
Ø Ù¹Ú ÜÛY
for all È µsÞ . Then
S Û ÔÔ ò o c n/Èe n Ì rr:EâÕ n/åÈ rÖÔÔ
ÔÔ
ÔÔ
s
S Û ÔÔÔ Ó ØS Ù)ÚÜÛY
ß ò o c n/Íe Ì r:EâÕ n/Í rUà ÔÔÔ
ÔÔ
ÔÔ
Ô
B S Ó S ÔÔ ò o c n/Íe Ì r:EwâÕ n/Í r ÔÔ Ô
Û ØÙ¹ÚÛÝ
ÔÔ
ÔÔ
S Ó ÔÔ ò o c n/Í0e Ì rEâÕ n/yÍ r ÔÔ
s
Ô
ÔÔ
ª
B× ñ Ô
Therefore, n Ì rµwÎ Ï ÷ } ÑVÒ , proving the lemma.
D R A F T
September 13, 2001, 6:27pm
(15.154)
(15.155)
(15.156)
(15.157)
(15.158)
(15.159)
(15.160)
(15.161)
(15.162)
D R A F T
352
A FIRST COURSE IN INFORMATION THEORY
2
Proof of Theorem 15.6 Consider an information rate tuple
H 3
szn ô ¿ × µØÑÜr
P
Q
(15.163)
in , i.e., there exist auxiliary random variables ô p%×ÛµØÑ and â» k pBn´³p9qerCµ·§
which satisfy (15.46) to (15.50). We first impose the constraint that all the
auxiliary random variables have finite alphabets. The purpose of imposing this
additional constraint is to enable the use of strong typicality which applies only
to random variables with finite alphabets. This constraint will be removed later.
l
. We now construct an
Consider any fixed
67
3 "EF6
n´ò¦pBn´ÿ» k ¿ n´³p9q\r¶µ·§rpBn ô
¿ × µbÑÜrr
(15.164)
random code on graph ¢ for sufficiently large ò such that
ò
for all n´³p9q\rAµ § , and
ü
(15.165)
(15.166)
× be small positive quantities such that
(15.167)
× ªB× ñ
The actual values of × and × will be specified later. The coding scheme is
for all ³¸µ
×
9<; =?>A@ÿ» kCB º¾» k"D 6
»B 6
. Let and
described in the following steps, where ò is assumed to be a sufficiently large
integer.
1. For × µbÑ , by the strong AEP (Theorem 5.2),
á ôâÝs ãä ¨ Î Ï ÷ } ÑVÒ ¨ 4 noåEµ× rud÷ ?æå} ù 9$ç ù p
(15.168)
ù
l
l
where è ô Ó
as × Ó
. By letting × be sufficiently small, by virtue of
R
(15.50), we have
nP ôr0E¶,è ô7Â3 ôNp
(15.169)
and
æ} ù 9$ç ù 7ê'é ôp
(15.170)
noåEF× rud÷
where
õ
(15.171)
é ô s ö(ud÷ Vë ù 9^ì ú¾sz¨ ô ¨ñ
Then it follows from (15.168) and (15.170) that
á ô7êé'ôñ
(15.172)
á
Denote the vectors in Î Ï ÷ } ÑVÒ by í ô n"årpNo B B ô .
ù
D R A F T
September 13, 2001, 6:27pm
D R A F T
Multi-Source Network Coding
o ; To ª
Î Ï } VÑ Ò Toxî
353
ï/é 9<; á ñð
n"×'rp
n"×drp?p
"n ×'r almost
into subsets
2. Arbitrarily partition ÷
ù
ù
ô or
uniformly, such that the size of each subset is either equal to ô
ö ô ôú . Define a random mapping
/é 9<; á
ò ó
òô
Té
¿ ¯o'p%u\pdp 'ô ±¾Ó
á
¯o'p%u\pdp ô ±
B ó B é
by letting ôBn "r be an index randomly chosen in
ô
ô.
uniform distribution for o
o
jù
(15.173)
n"×dr according to the
Î Ï ~ ÑVÒ by ô » k n"år , o B B á » k , where
á » k së¨ Î Ï ÷ ~ ÑVÒ ¨ñ
(15.174)
3. For n´³p9q\r¶µ·§ , denote the vectors in ÷
By the strong AEP,
á » k së¨ Î Ï ÷ ~ ÑVÒ ¨ B ud÷ ?æå ~ õ ç p
(15.175)
l
l
where èe» k Ó
as × Ó
. Then by letting × be sufficiently small, by virtue
R
of (15.49), we have
(15.176)
nUQ » k r D è » k B º » k D 6
and hence
æ ~ -õ ç B u ÷ ö õ ì ñ
(15.177)
u ÷
Then we can choose an integer ÿ» k such that
ud÷
æ ~ õ ç B ÿ» 8k B ud÷ ? ö õ ì ñ
(15.178)
The upper bound above ensures that ÿ» k satisfies (15.165).
4. Define the encoding function
» k ¿ õ ô l pNo'pdpÿ »R» ± Ó l pNo'pdpÿ » k ± (15.179)
ôTA»
»» »
as follows. Let Í ô be the outcome of hô . If
÷ ò
Ù
nqí ô n ôBn/Í ôrr ¿ ×Àµ n´³ÚrrpBn/ô »R » n # » » n/Í ô ¿ × µbÑÜr ¿ n´³ p³Úr µ·§rrø
(15.180)
µÎ Ï } ô »
-
q ~ » »?
q
ÑVÒ p
ù
where # »² » n/Í ô ¿ ×©µ¹ÑÜr denotes the value of »² » as a function of n/Í ô ¿ ×µ
Ñ¸r , then by Lemma 15.10,
÷
Å » k nqí ô n ò ôBn/Í ôrr ¿ × µ Ù n´³ÚrrpBn/ô » » n # »» n/Í ô ¿ × µbÑÜr ¿ n´³ p³r µ·§ rr ø
(15.181)
D R A F T
September 13, 2001, 6:27pm
D R A F T
354
A FIRST COURSE IN INFORMATION THEORY
Î Ï ~ ÑVÒ (cf. (15.149) for the definition of Å » k ). Then let
» k ÷ n ò ô?n/Í ôr ¿ ×Àµ Ù n´³ÚrrpBn # » » n/Í ô ¿ ×ÀµbÑ¸r ¿ n´³ p³Úr¶µ §rø
is in ÷
(15.182)
÷
Å » k nqí ô n ò ô n/Í ô rr ¿ ×Âµ Ù n´³ÚrrpBn/ô » k n # »²» n/Í ô ¿ ×Âµ ÑÜr ¿ n´³ p³Úr µØ§ rrø
(15.183)
Ï
V
Ñ
Ò
in Î ÷ ~ (which is greater than or equal to 1), i.e.,
ô » k n # » k n/Í ô ¿ × µbÑÜrr¸s
÷
Å » k nqí ô n ò ô?n/Í ôrr ¿ ×Âµ Ù n´³ÚrrpBn/ô » k n # »² » n/Í ô ¿ ×Âµ ÑÜr ¿ n´³ p³Úr µØ§ rrøÀñ
be the index of
(15.184)
» k ÷ n ò ôBn/Í ôr
If (15.180) is not satisfied, then let
# /Í
ø
¿ ×Àµ Ù n´³ÚrrpBn » » n ô ¿ ×ÀµbÑÜr ¿ n´³ p³Úr¶µ·§r s l ñ (15.185)
Note that the lower bound on ÿ» k in (15.178) ensures that » k is properly
defined because from (15.175), we have
5. For ³¸µ
ÿ »k
ü
4 æ ~ -õ 4 Ï ÑVÒ á k
ç ¨ Î ÷ ~ ¨s » ñ
u ÷
(15.186)
, define the decoding function
as follows. If
» ¿ l pNo dpÿ »R» ±¾Ó õ
»- »- »
q
ôT
»
# » » n/Í ô ¿ × µbÑÜr2s ý l
ô
(15.187)
(15.188)
B óMô B é'ô such that
÷
È ô »
ô »²» n # »» n/Í ô ¿ × µØÑÜrr ¿ n´³ p³Úr¶µ·§]østí ô n ò ô nó ô rr (15.189)
R»?
R»?
(cf. the definition of È ô in (15.152) and note that È ô is applied to a vector
as in Lemma 15.10), then let
» ÷ # » » n/Í ô ¿ × µbÑÜr ¿ n´³ p³Úr¶µ § ø sznóMô ¿ ×ÀµØÔn´³Úrrñ (15.190)
Note that if an Mó ô as specified above exists, then it must be unique because
í ô n ò ô?n"ó rr stý í ô n ò ôBnó rr
(15.191)
for all n´³ p³Úr µØ§ , and if for all × µ·Ôn´³Úr there exists o
D R A F T
September 13, 2001, 6:27pm
D R A F T
355
Multi-Source Network Coding
ó ó/ . For all other cases, let
» ÷ # » » n/Í ô ¿ × µbÑÜr ¿ n´³ p³r µ·§]øsënù o'pNo'púû ?pNüo rp
j R »
j
õ
a constant vector in ý ôTR »
ô .
if s ý
(15.192)
ü random code. Specifically, we will
We now analyze the performance of this
l
show that (15.166) is satisfied for all ³Üµ
. We claim that for any
, if
qí ò /Í
Î Ï } ù ô Ö ÑVÒ p
×þ7
n ô n ôBn ôrr ¿ × µbÑÜr µ 2÷
(15.193)
then the following are true:
i) (15.180) holds for all ³Üµb¥ ;
# » k n/Í ô ¿ ×ÂµbÑ¸r¾s ý l for all n´³p9q\r¶µ·§ ;
iii) ô » k n # » k n/Í ô ¿ ×Âµ ÑÜrrÜsÆÅ # » k nqí ô n ò ôBn/Í ô rr
ii)
¿ × µbÑÜr for all n´³p9q\r¶µ·§ .
We will prove the claim by induction on ³ , the index of a node. We first consider
node 1, which is the first node to encode. It is evident that
N³ Ø
µ ¥ ¿ n´³ pNoBrAµ·§ ±
Ù
noBrr is a function of n ô ¿ ×mµ
is empty. Since n ô ¿ ×mµ
P
P
(15.194)
ÑÜr , (15.180)
follows from (15.193) via an application of Lemma 15.10. This proves i), and
ii) follows by the definition of the encoding function » k . For all no'p9qerµg§ ,
consider
ô ; k n # ; k n/Í ô
¿ × µbÑÜrr
s
s
Å ; k nqí ô n ò ô?n/Í ôrr
Å # ; k nqí ô n ò ô n/Í ô rr
¿ ×Àµ Ù noBrr
(15.195)
¿ ×ÀµbÑ¸rp
(15.196)
P
where the first and the second equality follow from (15.184) and (15.151),
respectively. The latter can be seen by replacing the random variable ô by the
vector ô n ô?n ôrr in (15.151). This proves iii). Now assume that the claim is
³ ¬o for some u
³
¨¥©¨ , and we will prove that the
true for all node q
claim is true for node ³ . For all n´³ p³r µ·§ ,
í ò /Í
B :E
B B
ô »R» n # »» n/Í ô ¿ × µbÑÜrrÜsÇÅ # »²» nqí ô n ò ôBn/Í ôrr ¿ × µbÑÜr
(15.197)
Ù
by iii) of the induction hypothesis. Since nnP ô ¿ × µ
n´³rrpBnUQ » » ¿ n´³ p³ÚrAµ·§ rr
¿
is a function of nP ô
× µDÑÜr (cf. (15.150)), (15.180) follows from (15.193)
via an application of Lemma 15.10, proving i). Again, ii) follows immediately
D R A F T
September 13, 2001, 6:27pm
D R A F T
356
A FIRST COURSE IN INFORMATION THEORY
from i). For all n´³p9q\r¶µ·§ , it follows from (15.184) and (15.151) that
ô » k n # » k n/Í ô ÷ ¿ ×Âµ ÑÜrr
s
Å » k nqí ô n ò ôBn/Í ôrr ¿ × µ Ù n´³ÚrrpBn/ô »²» n # »²» n/Í ô ¿ × µbÑÜr ¿ n´³ p³Úr¶µ·§ rrø
(15.198)
ò
k
¿
(15.199)
s
Å # » nqí ô n ôBn/Í ôrr × µbÑÜrp
where the last equality is obtained by replacing the random variable
P ô by the
vector í ô n ò ô?n/Í ôrr and the random variable Q »» by the vector ô »²» n # »²» n/Í ô ¿ ×Âµ
ÑÜr in (15.151). This proves iii). Thus the claim is proved.
If (15.193) holds, for ×©µ¹Ô©n´³Úr , it follows from iii) of the above claim and
È ô »
n/ô »» n # »²» n/Í ô ¿ × µbÑÜrr ¿ n´³ p³r¶µ § r
(15.200)
s
È ô »
ò nYÅ # »» nqí ôÉ n ò ôÉ n/Í ô rr ¿ × µbÑ¸r ¿ n´³ p³Úr µ·§r
s
í ô n ô?n/Í ôrrp
(15.201)
where the last equality is obtained by replacing P ôÉ by í ôÉ n ò ôÉ n/Í ô rr in (15.153).
Then by the definition of the decoding function » , we have
» ÷ # » » n/Í ô ¿ ×ÂµØÑÜr ¿ n´³ p³Úr µ·§ ø szn/Í ô ¿ × µ Ôn´³Úrrñ
(15.202)
Hence, Í ô p%×Ûµ Ôn´³Úr can be decoded correctly.
Therefore, it suffices to consider the probability that (15.193) does not hold,
since this is theá only case for a decoding error to occur. For × µgÑ , consider
any o B ô B ô , where ô µso n"×'r for some o B ó ô B é ô . Then
jù
N! ò ôBn´hôrÜò s.eô ±
(15.203)
s
"! ò ô n´h ô rÜs ô ph ô s ó ô ±
s
"! ôdn´hôrÜs ¯ô?¨ hô s Mó ô±d! Nhô¶s ´ó ô ±
(15.204)
(15.205)
s
¨ o n"×dr¨V9<;é$ô 9<; ñ
jù
á
á
Since ¨ o n"×'r¨ is equal to either ï/é ô 9<; ôTð or ö/é ô 9<; ôú , we have
ùj
4 á 4 á
(15.206)
¨ o n"×'r¨
ï/é ô9<; ô ð é ô 9<; ô ZE o'ñ
jù
(15.153) that
Therefore,
! ò ôBn´hô r¦s.¯ô± B né ô9<; á ôo EgoBré'ô s á ô µE o é ô ñ
ë á æ} ùR , where
Since é ô"ÿ u ÷ ù and ô"ÿ u ÷
nP ôr7 3 ô
D R A F T
September 13, 2001, 6:27pm
(15.207)
(15.208)
D R A F T
Multi-Source Network Coding
357
á
é
from (15.50), dô is negligible compared with ô when ò is large. Then by the
l
strong AEP, there exists ô n r
such that
è q× f7
?æå} ù 9$ç ù Ò -
B á ôEµé'ôAª á ô B u ÷ æ} ù õ ç ù Ò p
noåEF× ru ÷
l
l
where èô nq×dMrÜÓ
as ×d Ó
. Therefore,
! ò ôBn´hô r¦s.¯ô± B noåE°× r 9<; u 9å÷ æ} ù 9$ç ù Ò ñ
(15.209)
(15.210)
Note that this lower bound does not depend on the value of eô . It then follows
that
! ò ô?n´hôrÜs.¯ô ¿ ×Àµ Ñ± æ} Ò B noåE°× r9 j Ö j u¹9å÷ ù 9$ç ù
ô
?æå} ù 9$ç ù Ò -
s
noåE°× r 9 j Ö j u 9å÷ ù
9 ÷ ?n æ}ù æ ôT } ù 9 ù Ò ç -ù Ò r
s
noåE°× r9 j Ö j u å
s
noåE°× r 9 j Ö j u 9å÷
ù Ö 9$ç
p
(15.211)
(15.212)
(15.213)
(15.214)
where the last equality follows from (15.46) with
which tends to 0 as
!Öí ô n ò ô n´h
×d Ó
l
è nq× r¦s S ô è ô nq× r
(15.215)
. In other words, for every n ô ¿ × µbÑÜrµ
ô rrÜs ô ¿ × µbÑ±
ý ô Î Ï ÷ } ù ÑVÒ ,
B no)E]× r9 j Ö j u¹9å÷ æå} ù ôT Ö 9$ç Ò ñ
(15.216)
denote ò i.i.d. copies of the random variable P ô . For every ô µ
ÏÎ ÷ } Let
Ñù VÒ , byô the
strong AEP,
4
"! ôs ô± u 9å÷ Ï æå} ù -õ ù Ò Ñ p
(15.217)
l
l
where ôBnq×M rÜÓ
as × Ó
. Then for every n ô ¿ ×ÂµbÑÜr µ ý ô Î Ï ÷ } ÑVÒ ,
ù
4
Ï
Ò
Ñ
å
æ
}
õ
(15.218)
! ô¶s ô ¿ × µbÑ± ô u¹å9 ÷ ù ù
9 ÷ ?n æå} ù æå ô } ù -
õ õ Ò ù ù Ò r (15.219)
s
u å
(15.220)
s
u¹9å÷
p
ù Ö
where
D R A F T
q× S ô åôBnq× r¦Ó
ân r¦s
l
September 13, 2001, 6:27pm
(15.221)
D R A F T
358
as
A FIRST COURSE IN INFORMATION THEORY
×d Ó
l
£ý ô Î Ï ÷ } ù ÑVÒ , from (15.216) and (15.220), we have
"! Öí ô n ò ô n´h ô rrÜs ô ¿ ×ÀÒ µb
-õ Ñ ± Ò æ} ôT -õ Ò -
s
noåEF× r 9 j Ö j u ÷ ç
u 9å÷
(15.222)
ù Ö
Ò
Ò
õ
B noåEF× r9 j Ö j ud÷ ç Ò

N! ô s
ô ¿ ×ÂµbÑ ±
(15.223)
¿
s
noåEF× r 9 j Ö j u ÷
"! ô s ô ×Àµ Ñ±{p
(15.224)
.
For every n ô ¿ ×Âµ ÑÜr µ
where
q× è nq× r D ânq× rñ
n r¦s
(15.225)
(15.226)
"! Öí ô n ò ô n´h ô rrÜs ô ¿ ×ÀµbÑ±Às l
for all n ô ¿ ×Ûµ Ñ¸r µ ý ý ô Î Ï ÷ } ÑVÒ , (15.224) is in fact true for all n ô ¿ × µ|ÑÜr¶µ
ù
nÞ ô ÷ ¿ ×Âµ ÑÜr . By Theorem 5.3,
(15.227)
!Nen ô ¿ × µbÑÜr2µý Î Ï ÷ } ù ô Ö ÑVÒ ± B u¹9å÷ Ò p
l
l
l
where nq×'r7
and ¶nq'× r¸Ó
as × Ó
. Then by summing over all n ô ¿ ×Âµ
ÑÜr¾µ®
ý Î Ï ÷ } ôT ÑVÒ in (15.224), we have
ù Ö
!enqí ô n ò ô?n´hôrr ¿ ×ÂµbÑ¸Ò r2µý Î Ï ÷ } ù ôT Ö ÑVÒ ±
B noåE°× r9 j Ö j ud÷ ! en ô ¿ ×Âµ ÑÜr2µý Î Ï ÷ } ù ôT Ö ÑVÒ ± (15.228)
B noåE°× r 9 j Ö j u 9å÷ Ò 9 Ò -
ñ
(15.229)
Since × ª× (cf. (15.167)), we can let ×d be sufficiently small so that
l
(15.230)
¶nqd× rE
nq× rf7
Since
ü
and the upper bound in (15.229)
tends to 0 as ò
sufficiently large, for all ³¸µ
,
» B !« nqí ô n ò ô?n´hôrr
Ó
« . Thus, when ò is
wÎ Ï ÷ } ù ôT Ö ÑVÒ ¬ B 6ñ
¿ × µØÑÜr¾µ ý
(15.231)
Hence, we have constructed a desired random code with the extraneous constraint that the all the auxiliary random variables have finite alphabets.
We now show that when some of the auxiliary random variables do not
have finite supports1 , they can be approximated by ones with finite supports
while all the independence and functional dependence constraints required
1 These
random variables nevertheless are assumed to have finite entropies.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Multi-Source Network Coding
P
359
continue to hold. Suppose some of the auxiliary random variables ôp%× µmÑ
and » k pBn´³p9q\r¾µ|§ do not have finite supports. Without loss of generality, assume that ô take values in the set of positive integers. For all × µ Ñ , define a
random variable ô nØr which takes values in
Q
P
P
sv¯o'p%u\pdp ±
(15.232)
!dP ô n Ør¦s. ± s ""!d! dP P ôCô µ s. ± ±
(15.233)
R
R
for all °µ
, i.e., Pþô n Ør is a truncation of P ô up to . It is intuitively clear
that nPþô n ØrrÓ
nP ô r as Ó « . The proof is given in Appendix 15.A.
Let Pþô n Ørp%×ÂµbÑ be mutually independent so that they satisfy (15.46) with
all the random variables in (15.46) primed. Now construct random variables
Qx» k n Ør inductively by letting
Q » k n Ør¦s³Å » k nnP ô n Ør ¿ × µ Ù n´³ÚrrpBnUQ » » n Ør ¿ n´³ p³Úr¶µ·§ rr (15.234)
for all n´³p9q\rbµë§ (cf. (15.149)). Then P ô n Ørp%×èµ Ñ and Q » k n ØrpBn´³%p9qer µ
§ satisfy (15.47) and (15.48) with all the random variables in (15.47) and
R(15.48)
primed.R Using the exactly the same argument for P ô n Ør , we see that
nUQ » k n ØrrÜÓ
nUQ » k r as Ó « .
Q » k pBn´³p9qerCµ·§ satisfy
If P ôp%×ÂµbÑ and â
l
R (15.49) and (15.50), then there exists
7 such that
(15.235)
º¾» k 7
nUâ
Q »k r D
R
and
nP ôr:E
7Â¦3 ôñ
(15.236)
Then for sufficiently large R, we have
R
k
k
D
º2» 7
nUâ
Q » r 7 nUQ » k n Ørr
(15.237)
R
R
and
nP ô n Ørrf7
nP ôr0E
7 3 ôñ
(15.238)
In other words, P ô n Ørp%×gµ Ñ and Q » k n ØrpBn´³p9q\r µz§ , whose supports are
finite, satisfy (15.49) and (15.50) with all the random variables in (15.49) and
(15.50) primed.
2
Therefore, we have proved that all information rate tuples
in H are
achievable. By means of a time-sharing argument
2 similar
2 ª to the one we used
in the proof of Theorem 9.12, we see that if ; and
are achievable, then
for any rational number between 0 and 1,
2 2 D
2 ª s
n
å
o
E
r
(15.239)
;
such that
D R A F T
September 13, 2001, 6:27pm
D R A F T
360
A FIRST COURSE IN INFORMATION THEORY
2
4
is also achievable. As the remark immediately after Definition 15.3 asserts that
,
o are achievable, then
if
2
2
IM=L8JK N 2
is also achievable, we see that every
in XY>Zîn/H r
words,
H» ÷ s XY>îZ n/H r a ·H p
s
(15.240)
is achievable. In other
(15.241)
and the theorem is proved.
APPENDIX 15.A: APPROXIMATION OF RANDOM VARIABLES WITH INFINITE ALPHABETS
In this appendix, we prove that "!#!%$& ! as '$)( , where we assume that
ù
ù
!*+( . For every ,.- , define the binary random variable
ù
1
/
3
"!0
ù%2
if if -
ù%4
(15.A.1)
.
Consider
ù
S
!50
6
S
6
ù
!*+(
,
0
!
Ú
/
ùGX /
ù
ù
Ú
e
"!#!ZY\[]
ù^ /
0.?A
ù
<L=H>
ù
<L=;>
ù
ù
0@?ACBMDNF
0.?ARBMDNF
<=H>
<L=;>
U<L=H>
7KJ
0@?ACBEDGF
ù
0@?AS$'
0@?ALBEDNF
<L=;>
ù
(15.A.2)
0@?APO
0@?AS$
(15.A.3)
!TO
3
(15.A.4)
/
"!#!
/
3
"!L0_-`!
"!0a-bAKYc
d!0
A
X
/X
/
3
L
<
;
=
>
d!0
AKY\[]
"!#!
^
/
/
3
0
<
=;> "!#!
d!f0a-`AgY\
"!0
!
bX
/
/
3
e
L
<
T
=
>
d!0
AKY\[]
"!#!TO
^
<=;>
/
/
3
, "!#!S$
since
"!j0k-`Al$m- . This implies ]
[ 0
As h$i(
because
S
9H: I
<=H>
S
as ?V$W( . Now consider
ù
Ú
9H: I
6
9H7 :
ù
<=;>
7KJ
As Q$'( ,
Since Ú
689;7 :
ù
ù
D R A F T
ù
ù
<L=H>
/
ùb^
"!#!
(15.A.6)
ù
ù
[]
(15.A.5)
2
/
"!#!TO
September 13, 2001, 6:27pm
ù
(15.A.7)
^
/
"!#!n$
3
(15.A.8)
D R A F T
APPENDIX 15.A: Approximation of Random Variables with Infinite Alphabets
361
In (15.A.7), we further consider
ù
/
S
6
S
6
Ú
<=;>
7KJ
S
ZoK0.?APBMDNF
Ú
S
7KJ
6
3
(15.A.10)
Ab!
ZoK0@?ARBEDGF
Zog0.?AN!
<=;>
Ú
/
ZoK0@?AqBEDGF
U<=H>
7KJ
r
9H: I
3
"!0
3
Zog0.?AN!
<=;>
/
ARBEDGF
<L=;>
3
d!f0
(15.A.12)
APO
<=;>
, the summation above tends to 0 by (15.A.4). Since
/
3
3
3
ARBEDGF
d!f0
AS$
. Therefore,
<L=;>
<=;>
Zo
and we see from (15.A.7) that
(15.A.11)
A
<L=H>
ZoK0@?ARBEDGF
/
7KJLs "!0
Y
t$u(
/
"!0
Zog0@?A
<L=;>
<L=>
<L=>
As
(15.A.9)
<L=;>
Yp 9H: I
0
ù
0@?A
/
3
A
<=;> d!f0
<=;>
Ú
9H: I
6
A
0.?ACBEDGF
U<=;>
/
6lBEDGF 7KJ
"!0
0
3
d!0
ù
9H: I
0
!
<=;>
9H: I
0
/
3
d!L0
X
/
X
3
d!f0
/
!
3
"!0
< =H>
zo y d!#!f${ o !
/
3
Av$
3
,
<L=;>
3xw
AS$
"!d0
(15.A.13)
as Q$h( .
PROBLEMS
1. Consider the following network.
|
~
~
|
X1X}2
2
2
1
~
2
1
[X1X2]
2
[X1X2]
1
2
1
1
1
|
|
[X1X2]
[X1]
a) Let n be the rate of information source
the max-flow bounds.

[X1]
. Determine and illustrate
b) Are the max-flow bounds achievable?
D R A F T
September 13, 2001, 6:27pm
D R A F T
362
A FIRST COURSE IN INFORMATION THEORY
c) Is superposition coding always optimal?
2. Repeat Problem 1 for the following network in which the capacities of all
the edges are equal to 1.
X 1 X2

[X1X2]
[X1X2]
[X1]
[X1]
3. Consider a disk array with 3 disks. Let , \ , and \ be 3 mutually
independent pieces of information to be retrieved from the disk array, and
let n , , and be the data to be stored separately in the 3 disks. It is
required that can be retrieved from , bCb , \ can be retrieved
from NL , dQ\ , and can be retrieved from N N .
R
R
a) Prove that for ShbCb ,
#z
R
b) Prove that for
R
ck
` GK
c) Prove that
R
R
GK
b GN
R
,
R
G \xg
R
%PNNZ \xz
R
d) Prove that for ShbCb .
R
R
e) Prove that
#g
R
V k
GN
R
Gg
R
\xg
R
f) Prove that
Z¢¤£%¥¢¤£¦ ¢§
D R A F T
\xN
R
¡ ;N \xN
;N \xN
R
UKQ
R
\K
September 13, 2001, 6:27pm
\xN
D R A F T
APPENDIX 15.A: Approximation of Random Variables with Infinite Alphabets
363
where ShbCb and
for
"Qb¡c
R
%Ng
R
R
K©¬
.
R
g) Prove that
if ©ck
if ©ck
K©
K¨©ª«
R
R
R
g
R
V
R
R

Gg
R
R
\xg

\xN
Parts d) to g) give constraints on %G , , and in terms of
,
, and
. It was shown in Roche et al. [166] that these
constraints are the tightest possible.
4. Generalize the setup in Problem 3 to ®
¯
r
R
R
disks and show that
r
¯
V ¢®
± °%
¤°%
±

Hint: Use the inequalities in Problem 13 in Chapter 2 to prove that for
³
µ´RP¶P¶P¶`®t¬¢ ,
r
¯
R
R
r
#· ¸g®
¤°%
½

²
¹
± °%
R
r
¾g¿MÀ ¾%À
±

¯
¼
¾

°
®
º

¹;» \P¶P¶P¶
³
Á

¹
¹;»
by induction on ³ , where Â is a subset of
ÃÄbCP¶P¶P¶`®Å
.
5. Determine the regions ÆvÈÇ , Æ+ÉTÊxË , and ÆvÌCÍ for the network in Figure 15.13.
6. Show that if there exists an
³
¸jxÎ ÐÏÑ`¡CÓÒÕÔNx×Ö
Ï
(15.A.14)
Òaz
¹
code which satisfies (15.71) and (15.73), then there always exists an
¸jxÎ ÐÏÑ`¡CÓÒÕÔNx×ÖRØ
Ï
³
(15.A.15)
Òaz
¹
code which satisfies (15.71) and (15.73), where Ö
use a random coding argument.
D R A F T
¹
Ø
for all ³
¢Ö
Òa
. Hint:
¹
September 13, 2001, 6:27pm
D R A F T
364
A FIRST COURSE IN INFORMATION THEORY
HISTORICAL NOTES
Multilevel diversity coding was introduced by Yeung [215], where it was
shown that superposition coding is not always optimal. Roche et al. [166]
showed that superposition coding is optimal for symmetrical three-level diversity coding. This result was extended to ® & levels by Yeung and Zhang
[219] with a painstaking proof. Hau [92] studied all the one hundred configurations of a three-encoder diversity coding systems and found that superposition
is optimal for eighty-six configurations.
Yeung and Zhang [220] introduced the distributed source coding model discussed in Section 15.2.2 which subsumes multilevel diversity coding. The regions Ù%ÇÚ and ÙgÇ previously introduced by Yeung [216] for studying information inequalities enabled them to obtain inner and outer bounds on the coding
rate region for a variety of networks.
Distributed source coding is equivalent to multi-source network coding in a
two-tier acyclic network. Recently, the results in [220] have been generalized
to an arbitrary acyclic network by Song et al. [187], on which the discussion in
this chapter is based. Koetter and MÜ Û dard [112] have developed an algebraic
approach to multi-source network coding.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Chapter 16
ENTROPY AND GROUPS
The group is the first major mathematical structure in abstract algebra, while
entropy is the most basic measure of information. Group theory and information theory are two seemingly unrelated subjects which turn out to be intimately related to each other. This chapter explains this intriguing relation between these two fundamental subjects. Those readers who have no knowledge
in group theory may skip this introduction and go directly to the next section.
Let and \ be any two random variables. Then
R
R
UK
R
\xV (16.1)
\xN
which is equivalent to the basic inequality
Ý
Let ß be any finite group and
Section 16.4 that
àßcàß

ßª
%á
Þ
(16.2)
V ¢´R
and ß be subgroups of ß . We will show in
ß
C Wàß
àß
(16.3)
â
where àßc denotes the order of ß and ß ãá ß denotes the intersection of
ßª and ß ( ß á
ß" is also a subgroup of ß , see Proposition 16.13). By
rearranging the terms, the above inequality can be written as
ä¤å]æ
àßc
àßª

ä¤å]æ
àßc
àß"Ä
ä¤å]æ
àßc
àßª
á
ßZ

(16.4)
By comparing (16.1) and (16.4), one can easily identify the one-to-one correspondence between these two inequalities, namely that corresponds to ßd ,
ãçb , and \x corresponds to ßª á ß . While (16.1) is true for any
pair of random variables and , (16.4) is true for any finite group ß and
subgroups ßª and ß .
D R A F T
September 13, 2001, 6:27pm
D R A F T
366
A FIRST COURSE IN INFORMATION THEORY
Recall from Chapter 12 that the region Ù%ÇÚ characterizes all information inequalities (involving ¸ random variables). In particular, we have shown in
Section 14.1 that the region Ù ÇÚ is sufficient for characterizing all unconstrained
information inequalities, i.e., by knowing Ù ÇÚ , one can determine whether any
unconstrained information inequality always holds. The main purpose of this
chapter is to obtain a characterization of Ù ÇÚ in terms of finite groups. An
important consequence of this result is a one-to-one correspondence between
unconstrained information inequalities and group inequalities. Specifically, for
every unconstrained information inequality, there is a corresponding group inequality, and vice versa. A special case of this correspondence has been given
in (16.1) and (16.4).
By means of this result, unconstrained information inequalities can be proved
by techniques in group theory, and a certain form of inequalities in group theory can be proved by techniques in information theory. In particular, the unconstrained non-Shannon-type inequality in Theorem 14.7 corresponds to the
group inequality
àßª
á
Wàßªàß
ßZ
á

àßª
á
ß"]àß"Ä
ßdèÄ

á
àß
àß"èZ

àß
ßdèÄ
á

ß"
àß"
á
á
ß"è]
ß]àß
è
àß
á
á
ßdèÄ
á
ß"
(16.5)
ßdèÄâ
where ß" are subgroups of a finite group ß , nébCbC§ . The meaning of this
inequality and its implications in group theory are yet to be understood.
16.1
GROUP PRELIMINARIES
In this section, we present the definition and some basic properties of a
group which are essential for subsequent discussions.
ê\ëìCíïîgíñðgíò%îÁóÄôKõ#ó
A group is a set of objects ß together with a binary operation on the elements of ß , denoted by “ ö ” unless otherwise specified, which
satisfy the following four axioms:
1. Closure For every ÷ , ø in ß , ÷öÓø is also in ß .
2. Associativity For every ÷ , ø , ù in ß , ÷"ö"#øjöÓùzé×÷öÓøPgöVù .
3. Existence of Identity There exists an element ú in ß such that ÷¦öúdÁúzö÷
Á÷ for every ÷ in ß .
4. Existence of Inverse For every
that ÷"öãøãÁøjöÓ÷Áú .
ûdüò¥ýÑò%þxí¤ðgí×ò¥îµóÄôKõ¤ÿ
Proof Let ú and
identity element,
ú Ø
÷
ß
, there exists an element ø in
such
ß
For any group ß , the identity element is unique.
be both identity elements in a group
ú Ø ö
D R A F T
in
ß
ú¦ÁúZ
September 13, 2001, 6:27pm
. Since
ú
is an
(16.6)
D R A F T
367
Entropy and Groups
and since ú Ø is also an identity element,
(16.7)
ú Ø öãúdÁú Ø
It follows by equating the right hand sides of (16.6) and (16.7) that
which implies the uniqueness of the identity element of a group.
ûdüò¥ýÑò%þxí¤ðgí×ò¥îµóÄôKõ
ú

ú Ø
,
For every element ÷ in a group ß , its inverse is unique.
Proof Let ø and ø Ø be inverses of an element ÷ , so that
÷"öãøãÁøjöÓ÷Áú
(16.8)
÷"öãøNØfÁøNØ]öÓ÷ÁúZ
(16.9)
and
Then
ø

øzöÓú

øzöd×÷öãø Ø

#øzöÓ÷LKöãøNØ

úVö

øNØ
ø Ø
(16.10)
(16.11)
(16.12)
(16.13)
(16.14)
where (16.11) and (16.13) follow from (16.9) and (16.8), respectively, and
(16.12) is by associativity. Therefore, the inverse of ÷ is unique.
Thus the inverse of a group element

denoted by ÷ .
÷
is a function of ÷ , and it will be
ê\ëìCíïîgíñðgíò%îÁóÄôKõ
The number of elements of a group ß is called the order
of ß , denoted by àßc . If àßc% , ß is called a finite group, otherwise it is
called an infinite group.
There is an unlimited supply of examples of groups. Some familiar examples are: the integers under addition, the rationals excluding zero under
multiplication, and the set of real-valued ½ matrices under addition, where
addition and multiplication refer to the usual addition and multiplication for
real numbers and matrices. In each of these examples, the operation (addition
or multiplication) plays the role of the binary operation “ ö ” in Definition 16.2.
All the above are examples of infinite groups. In this chapter, however, we
are concerned with finite groups. In the following, we discuss two examples of
finite groups in details.
D R A F T
September 13, 2001, 6:27pm
D R A F T
368
A FIRST COURSE IN INFORMATION THEORY
¦ý
ò
ëóÄôKõ
%í¤ðgí×ò¥î
òÿ
The trivial group consists of only
the identity element. The simplest nontrivial group is the group of modulo 2
addition. The order of this group is 2, and the elements are Ãx´RÅ . The binary
operation, denoted by “ ,” is defined by following table:
+
0
1
0
0
1
1
1
0
The four axioms of a group simply say that certain constraints must hold in the
above table. We now check that all these axioms are satisfied. First, the closure
axiom requires that all the entries in the table are elements in the group, which
is easily seen to be the case. Second, it is required that associativity holds. To
this end, it can be checked in the above table that for all ÷ , ø , and ù ,
÷"Á#øz
ùzé×÷"øPg
ù
(16.15)
For example,
´¦ÁHVµxzµ´¦´ªµ´R
(16.16)
×´¦µx¥µ¦hlµ´R
(16.17)
while
which is the same as ´ Hdéx . Third, the element 0 is readily identified
as the unique identity. Fourth, it is readily seen that an inverse exists for each
element in the group. For example, the inverse of 1 is 1, because
(16.18)
µlµ´R
Thus the above table defines a group of order 2. It happens in this example that
the inverse of each element is the element itself, which is not true for a group
in general.
We remark that in the context of a group, the elements in the group should
be regarded strictly as symbols only. In particular, one should not associate
group elements with magnitudes as we do for real numbers. For instance, in
the above example, one should not think of 0 as being less than 1. The element
0, however, is a special symbol which plays the role of the identity of the group.
We also notice that for the group in the above example, ÷\ø is equal to
øÓÁ÷ for all group elements ÷ and ø . A group with this property is called a
commutative group or an Abelian group1 .
¦ý
;ûdëfü
ð Lðgíò%î"!cüògý#
ëóÄôKõ¤ô
components of a vector
$

1 The
&% '%

Consider a permutation of the
'%(
P¶P¶P¶
(16.19)
Abelian group is named after the Norwegian mathematician Niels Henrik Abel (1802-1829).
D R A F T
September 13, 2001, 6:27pm
D R A F T
369
Entropy and Groups
)+* $-,
given by
&%/.10 32 ' %/.10 2
'%/.10 ( 2
é
)
where
N
(16.20)
ÃÄbCP¶P¶P¶ CÅ
(16.21)
P¶P¶P¶x
'4 6
Å 5
') 4
ÏRÃÄbCP¶P¶P¶
is a one-to-one mapping. The one-to-one mapping
on ÃÄbCP¶P¶P¶'4CÅ , which is represented by
)
)
)
é
)
HxN
)
)
&4
#ZNP¶P¶P¶]
)
For two
) permutations and , define Lö
and . For example, for 4µ§ , suppose
is called a permutation
(16.22)
ÄN
)
)
as the composite function of

)
(16.23)
Vé#C§LbZ
)
and
)
Then
)
nö

is given by

)
)
)

)
)
)
ö
)
nö
)
nö
nö
)
)

)
Hx
)
)

)
)
)
x
)
x
x
)
nö

Z#Z

Z§

)
)
Hx

§L
(16.25)
)

#CbC§ N
(16.26)

§LbCbZN
(16.27)

The reader can easily check that
)

Z#Z
)
Hx
)
or
)
(16.24)
éH§LbCbZN
ö
)

which is different from nö . Therefore, the operation “ ö ” is not commutative.
We now show that the set of all permutations on ÃÄbCP¶P¶P¶'4 Å and the operation ) “ ö ” form
group. First, for two permuta) a group, called
) the permutation
)
)
)
tions and , since both and are one-to-one mappings, so) is ) Sö .
Therefore,
the closure axiom is satisfied. Second, for permutations , , and
)
,
)
)
nö
)
ö
)
)
xUH

)

) ¥ö

)
)

nö
ö
)
)
];
)Z )Z;
) ] Z) ;
xKö
H
(16.28)
(16.29)
(16.30)
(16.31)
)
for +é"74 . Therefore, associativity is satisfied. Third, it is clear
that the
identity map is the identity element. Fourth, for a permutation , it is clear
D R A F T
September 13, 2001, 6:27pm
D R A F T
370
A FIRST COURSE IN INFORMATION THEORY
)
)

)
that its inverse is , the inverse mapping of which is defined because
is one-to-one. Therefore, the set of all permutations on ÃÄbCP¶P¶P¶'4CÅ and the
operation “ ö ” form a group. The order of this group is evidently equal to &498â .
;:
ê\ëìCíïîgíñðgíò%îÁóÄôKõ
Let ß be a group with operation “ ö ,” and be a subset
of ß . If is a group with respect to the operation “ ö ,” then is called a
subgroup of ß .
<
ê\ëìCíïîgíñðgíò%îÁóÄôKõ
Let be a subgroup of a group ß and ÷ be an element of
. The left coset of with respect to ÷ is the set ÷ªö"'tÃx÷ö ³ Ï ³ ÒµÅ .
Similarly, the right coset of with respect to ÷ is the set ö ÷WÃ ³ ö ÷Ï ³ Ò_Å .
ß
In the sequel, only the left coset will be used. However, any result which
applies to the left coset also applies to the right coset, and vice versa. For
simplicity, ÷"ö will be denoted by ÷L .
=
ûdüò¥ýÑò%þxí¤ðgí×ò¥îµóÄôKõ
For ÷ and ÷C in ß , ÷fG and ÷CP are either identical or
disjoint. Further, ÷fG and ÷ x are identical if and only if ÷f and ÷C belong to
the same left coset of .
Proof Suppose ÷fb and
in ÷ á ÷ such that
are not disjoint. Then there exists an element
÷C
³
øãµ÷nö
³
Vµ÷Czö
ø
(16.32)

for some ³ in , nhb . Then
³
÷fVé×÷ ö
³
gö

µ÷ zö"
³
ö
³

where d
> ³ ö ³ is in . We now show that
³
÷nö
in ÷fG , where ³ Ò_ ,
÷
ö
³
é×÷
B>Kö
ö
³
µ÷
&>
öd ¥ö
?>G
(16.33)
÷
. For an element
µ÷ ö
A@
÷
³
jµ÷
DC%
ö
(16.34)
where CÕE>¥ö ³ is in . This implies that ÷fSö ³ is in ÷C . Thus, ÷fGF@÷Cx .
By symmetry, ÷ xG@&÷N . Therefore, ÷GW ÷C . Hence, if ÷fN and ÷
are not disjoint, then they are identical. Equivalently, ÷fb and ÷ are either
identical or disjoint. This proves the first part of the proposition.
We now prove the second part of the proposition. Since is a group, it
contains ú , the identity element. Then for any group element ÷ , ÷a÷ö¦ú is
in ÷L because ú is in . If ÷fb and ÷ x are identical, then ÷f Òh÷fG and
÷ dÒÕ÷C
µ÷fG . Therefore, ÷ and ÷ belong to the same left coset of .
To prove the converse, assume ÷ and ÷ belong to the same left coset of
. From the first part of the proposition, we see that a group element belongs
D R A F T
September 13, 2001, 6:27pm
D R A F T
371
Entropy and Groups
to one and only one left coset of . Since ÷f is in ÷G and ÷
÷ and ÷ belong to the same left coset of , we see that ÷
identical. The proposition is proved.
is in ÷C , and
and ÷ are

IH
ûdüò¥ýÑò%þxí¤ðgí×ò¥îµóÄôKõ#ó
Let be a subgroup of a group ß and ÷ be an element
of ß . Then ÷L R)à , i.e., the numbers of elements in all the left cosets of
are the same, and they are equal to the order of .
³
Proof Consider two elements ÷Ðö
in such that

³
÷dö
³
and ÷Ðö
³
µ÷ö
in ÷Ðö

, where ³

and
³

are
(16.35)

Then
÷
×÷

öd×÷ö
³
N
öã÷RKö
³

³
úö
³
÷

×÷

úVö
³

ö"×÷ö

³
öV÷R¥ö
³
(16.36)
(16.37)
(16.38)
(16.39)

Thus each element in corresponds to a unique element in
÷R Ä
à for all ÷vÒaß .
÷R
. Therefore,
We are just one step away from obtaining the celebrated Lagrange’s Theorem stated below.
JLKgëfò%üKëióÄôKõ#óóM'N+O%ü/%îO%ëP¤þJQKgëfò%üKëR
à

divides
àßv
If is a subgroup of ß , then
.
Proof Since ÷ Ò÷R for every ÷ÒÁß , every element of ß belongs to a left
coset of . Then from Proposition 16.9, we see that the distinct left cosets of
partition ß . Therefore àßc , the total number of elements in ß , is equal to the
number of distinct cosets of multiplied by the number of elements in each
left coset, which is equal to à by Proposition 16.10. This implies that à
divides àßv , proving the theorem.
The following corollary is immediate from the proof of Lagrange’s Theorem.
S
IT%üIU'óÄôKõ#óÄÿ
ò%üò
tinct left cosets of
D R A F T
Let be aÀ VSsubgroup
of a group
À
is equal to À WÀ .
ß
. The number of dis-
September 13, 2001, 6:27pm
D R A F T
372
A FIRST COURSE IN INFORMATION THEORY
16.2
GROUP-CHARACTERIZABLE ENTROPY
FUNCTIONS
Recall from Chapter 12 that the region Ù%ÇÚ consists of all the entropy functions in the entropy space XcÇ for ¸ random variables. As a first step toward establishing the relation between entropy and groups, we discuss in this
section entropy functions in Ù ÇÚ which can be described by a finite group ß
and subgroups ß Nß P¶P¶P¶Nß Ç . Such entropy functions are said to be groupcharacterizable. The significance of this class of entropy functions will become clear in the next section.
In the sequel, we will make use of the intersections of subgroups extensively.
We first prove that the intersection of two subgroups is also a subgroup.
ûdüò¥ýÑò%þxí¤ðgí×ò¥îµóÄôKõ#ó
ß"
Let ß
is also a subgroup of ß .

and ß
be subgroups of a group ß . Then ß

á
Proof It suffices to show that ßª á ß together with the operation “ ö ” satisfy all
the axioms of a group. First, consider two elements ÷ and ø of ß in ßª á ß" .
Since both ÷ and ø are in ß , ×÷vöÐøU is in ß . Likewise, ×÷vöøP is in ß" .
Therefore, ÷\ödø is in ß á ß . Thus the closure axiom holds for ß á ß .
Second, associativity for ßª á ß" inherits from ß . Third, ßª and ß" both
contain the identity element because they are groups. Therefore, the identity

element is in ßª á ß" . Fourth, for an element ÷+Òaßd , since ß" is a group, ÷

is in ßd , b . Thus for an element ÷ÕÒ ßª á ß , ÷ is also in ßª á ß" .
Therefore, ß á ß" is a group and hence a subgroup of ß .
S
IT%üIU'óÄôKõ#óY
ò%üò
á
Ç
¤°% ß"
Let ßªPNßP¶P¶P¶NßdÇ be subgroups of a group
is also a subgroup of ß .
ß
In the rest of the chapter, we let ZÇhÃÄbCP¶P¶P¶¸SÅ and denote á \ [
± , where ² is a nonempty subset of ZÇ .
ß
Në6]{óÄôKõ#óI
Ò
²
Let
ß

be subgroups of a group
ß
and
÷
. Then
± ßd
be elements of
by
ß
,
. Then
á
\[

± ÷Ä¡ßdZ
«
if ^ \[ ± ÷ÄßdDE
_ `
otherwise.
±
àß
´
Proof For the special case when
ÒaZÇ , (16.40) reduces to
²
is a singleton, i.e.,
÷ ¡ßdHÄ
(16.40)
²
àßd`â

ÃÅ
for some
(16.41)
which has already been proved in Proposition 16.10.
D R A F T
September 13, 2001, 6:27pm
D R A F T
373
Entropy and Groups
Gb 2
cG
cG
1,2
1
Figure 16.1. The membership table for a finite group d
and subgroups d
and dfe .
s
Let ² be any nonempty subset of ZÇ . If ^ &[ ± ÷ ¡ßddg` , then (16.40) is
_ ` , then there exists %QÒj^ \[ ± ÷Äßd such that
obviously true. If ^ &[ ± ÷ ¡ßdhi
for all Ò ² ,
%@µ÷ Lö ³ ;
(16.42)
where ³
nÒ
ßd
. For any Ò
²
%ö?k\é×÷
and for any kvÒ_ß
ö
³
±
, consider
Dk\µ÷Äö" ³ LöBkLN
Kö
(16.43)
Since both ³ and k are in ß" , ³ göRk is in ßd . Thus %@öRk is in ÷Äßd for all
²
¦Ò
, or %völk is in ^ \[ ± ÷Äßd . Moreover, for k'k Ø Òß ± , if %cö]kÕ7%vömk Ø ,
then kÕAk Ø . Therefore, each element in ß ± corresponds to a unique element
in ^ &[ ± ÷ ß . Hence,
(16.44)
án&[ ± ÷ ß ]8àß ± â
proving the lemma.
The relation between a finite group ß and subgroups ßª and ß is illustrated by the membership table in Figure 16.1. In this table, an element of ß
is represented by a dot. The first column represents the subgroup ßª , with the
dots in the first column being the elements in ßª . The other columns represent
the left cosets of ß . By Proposition 16.10, all the columns have the same
number of dots. Similarly, the first row represents the subgroup ß and the
other rows represent the left cosets of ß . Again, all the rows have the same
number of dots.
The upper left entry in the table represents the subgroup ß á ß . There are
àßª á ß dots in this entry, with one of them representing the identity element.
Any other entry represents the intersection between a left coset of ßª and a left
coset of ß , and by Lemma 16.15, the number of dots in each of these entries
is either equal to àßª á ß] or zero.
D R A F T
September 13, 2001, 6:27pm
D R A F T
374
A FIRST COURSE IN INFORMATION THEORY
s on
n o nH (Y)p ty
2
.
.
n o nH (Xq p )
2
rx sS o n
.
[X]
S [Y]
.
.
.
.
q p)
n o nH (X,Y
2
t ) s S o nq
(x,y
.
.
.
.
.
.
. .
[XY]
Figure 16.2. A two-dimensional strong typicality array.
Since all the column have the same numbers of dots and all the rows have
the same number of dots, we say that the table in Figure 16.1 exhibits a quasiuniform structure. We have already seen a similar structure in Figure 5.1 for
the two-dimensional strong joint typicality array, which we reproduce in Figure 16.2. In this array, when ¸ is large, all the columns have approximately the
same number of dots and all the rows have approximately the same number of
dots. For this reason, we say that the two-dimensional strong typicality array
exhibits an asymptotic quasi-uniform structure. In a strong typicality array,
however, each entry can contain only one dot, while in a membership table,
each entry can contain multiple dots.
One can make a similar comparison between a strong joint typicality array
for any ¸ random variables and the membership table for a finite group
with ¸ subgroups. The details are omitted here.
JLKgëfò%üKëióÄôKõ#óÄô
Let
uZÇ be subgroups of a group ß . Then vÒXcÇ
w
ßd;ãÒ
defined by
±
ä¤å]æ

àßc
àß
(16.45)
±
for all nonempty subset ² of ZÇ is entropic, i.e., v
Ò
Ù ÇÚ
.
Proof It suffices to show that there exists a collection of random variables
\P¶P¶P¶Ç such that
x
± z

äå]æ
àßc
±
àß
(16.46)
²
for
y all nonempty subset of ZÇ . We first introduce a uniform random variable
defined on the sample space ß with probability mass function
z|{ y
Ã
D R A F T
Á÷Åd

àßc
September 13, 2001, 6:27pm
(16.47)
D R A F T
375
Entropy and Groups
y
for all ÷+Òaß . Fory any zÒ}ZÇ , let random variable be a function of such
that µ÷Lß if µ÷ .
Let ² y be a nonempty subset of ZÇ . Since Ó÷ #ß" for all "Ò ² if and
only if is equal to some ølÒ áS\[ ± ÷ ß ,
z+{
²
Ã¥µ÷ ¡ßdnÏ]Ò
'^ & [
±

÷ ¡ßd
àßc
À LÀ
À SÀ
~
V/
V

(16.48)

_ `
if ^ &[ ± ÷ ß
otherwise
´

(16.49)
²
;ÕÒ
distributes uniformly on its
by Lemma 16.15. In other words,
À VSÀ
support whose cardinality is À V/RÀ . Then (16.46) follows and the theorem is
proved.
ê\ëìCíïîgíñðgíò%îÁóÄôKõ#ó
:
w
Let ß be a finite group and
ß Nß P¶P¶P¶xNß Ç be subgroups
À VSÀ
äå]æ
À V À for all nonempty subsets ² of
of ß . Let v be a vector in XvÇ . If ±
ZÇ , then ßNßUP¶P¶P¶NßdÇL is a group characterization of v .
Theorem 16.16 asserts that certain entropy functions in ÙnÇÚ have a group
characterization. These are called group-characterizable entropy functions,
which will be used in the next section to obtain a group characterization of
the region Ù ÇÚ . We end this section by giving a few examples of such entropy
functions.
¦ý
I<
ëóÄôKõ#ó
v
Ò X+ by
Fix any subset of Z@
w
±
äå]æ

«
ÃÄbCb Å
and define a vector
_ `
if ² á FE
otherwise.

´
(16.50)
We now show that v has a group characterization. Let ßu Ãx´RÅ be the
group of modulo 2 addition in Example 16.5, and for ShbCb , let
if zÒ
otherwise.
Ãx´CÅ
ß"K«
ß
(16.51)
_ ` , there exists an in ² such
Then for a nonempty subset ² of Z@ , if ² á EA
that is also in , and hence by definition ß WÃx´CÅ . Thus,
±
ß

±
ß
(16.52)
'Ãx´CÅZ
Therefore,
äå]æ
àßv
àß
D R A F T
±

äå]æ
àßc
MÃx´CÅC

äå]æ

äå]æ
C
September 13, 2001, 6:27pm
(16.53)
D R A F T
376
If ²
A FIRST COURSE IN INFORMATION THEORY
á
a` , then ßdgß for all Ò
±
ß
²

, and
(16.54)
ß"Kß
±
Therefore,
ä¤å]æ
àßc
±
àß
ä¤å]æ

àßc
äå]æ

àßc

(16.55)
µ´R
Then we see from (16.50), (16.53), and (16.55) that
w
±
¦ý
ä¤å]æ
àßc
àß
of Z@ . Hence,
²
for all nonempty subset
terization of v .

(16.56)
±
ßNßªNßNß"
is a group charac-
I=
ëóÄôKõ#ó
This is a generalization of the last example. Fix any nonempty subset of ZÇ and define a vector vÒXvÇ by
w
äå]æ
±

«
if ² á "E
_ `
otherwise.

´
(16.57)
Then ßNßªNßP¶P¶P¶xNßdÇR is a group characterization of v , where
group of modulo 2 addition, and
ß"K
Ãx´CÅ
«
ß
if zÒ
otherwise.
ß
is the
(16.58)
By letting 7` , vW´ . Thus we see that ßNßªNßP¶P¶P¶NßdÇL is a group
characterization of the origin of XcÇ , with ß{ß{ß" W¶P¶P¶Cß"Ç .
¦ý
H
ëóÄôKõ¤ÿ
v
Ò X+ as follows:
Define a vector
w
±
Let
ßNßªNßNß
D R A F T
(16.59)
âbZN
be the group of modulo 2 addition, ß{
ß

Ã ×´R`´ÄNxH`´ÄbÅ
ß

Ã ×´R`´ÄNx×´RxbÅ

Ã ×´R`´ÄNxHxbÅZ
ß
Then
Eag` ²

½
, and
(16.60)
(16.61)
(16.62)
is a group characterization of v .
September 13, 2001, 6:27pm
D R A F T
377
Entropy and Groups
16.3
A GROUP CHARACTERIZATION OF
ÇÚ
We have introduced in the last section the class of entropy functions in Ù ÇÚ
which have a group characterization. However, an entropy function vWÒµÙ ÇÚ
may not have a group characterization due to the following observation. Suppose vQÒ Ù ÇÚ . Then there exists a collection of random variables \P¶P¶P¶x
w
Ç such that
x
±
±
(16.63)
for all nonempty subset ² of ZÇ . If
tion of v , then
x

ßNßUP¶P¶P¶NßdÇL
± j
äå]æ
is a group characteriza-
àßc
(16.64)
±
àß
x
for all nonempty subset of Z Ç . Since both àßc and àß ± are integers, ±
must be the logarithm of a rational number. However, the joint entropy of a
set of random variables in general is not necessarily the logarithm of a rational
number (see Corollary 2.44). Therefore, it is possible to construct an entropy
function v Ò Ù ÇÚ which has no group characterization.
Although vÒ.Ù ÇÚ does not imply v has a group characterization, it turns out
that the set of all v¢Ò Ù%ÇÚ which have a group characterization is almost good
enough to characterize the region Ù ÇÚ , as we will see next.
ê\ëìCíïîgíñðgíò%îÁóÄôKõ¤ÿgó

Define the following region in XvÇ :
v XcÇvÏv has a group characterizationÅZ
(16.65)
Ç'Ã Ò
By Theorem
16.16, if vµÒXvÇ has a group characterization, then vµÒÙnÇÚ .

Therefore,
Ç@)Ù ÇÚ . We will prove as a corollary of the next theorem that

ù¸
ÇL , the convex closure of
Ç , is in fact equal to Ù ÇÚ , the closure of Ù%ÇÚ .
JLKgëfò%üKëióÄôKõ¤ÿÿ
For any v
ä
0 ( 2 Ev

that ('L (
.
ÒÕÙ ÇÚ
, there exists a sequence Ã

0( 2
Å in
Ç
such
We need the following lemma to prove this theorem. The proof of this
lemma resembles the proof of Theorem 5.9. Nevertheless, we give a sketch of
the proof for the sake of completeness.
Në6]{óÄôKõ¤ÿ Let be a random variable such that and the
distribution Ã'S&%bÅ is rational, i.e., n&% is a rational number for all %Ò .
Without loss of generality, assume n&% is a rational number with denominator
for all % ÒM . Then for 4Ð b b P¶P¶P¶ ,
ä

(h
4
D R A F T

äå]æ

4 8
&4n&%8

x
©N
September 13, 2001, 6:27pm
(16.66)
D R A F T
378
A FIRST COURSE IN INFORMATION THEORY
Proof Applying Lemma 5.10, we can obtain
ä

94 8
&4n&%K8
a
4
r

ä
4
a
4
¬
x
ä

4
ä

B4 ¡

g&4
4¢

xa
B4
(16.67)

(16.68)

4-¢
as 4L5£ . On the other hand, we can obtain
_
¤S&%g
ä
µxn¬
94 8
&4n&%K8
r
ä

4
ä
n&%K n&%Kg
¬
4 ¢
¤n&%Kg

4 ¢
ä
¬
B4
4

(16.69)
© as 4L5£
. Then the proof is completed
This lower bound also tends to
by changing the base of the logarithm if necessary.
Proof of Theorem 16.22 For any vÒÕÙnÇÚ , there exists a collection of random
variables P\P¶P¶P¶Ç such that
w
±

±
(16.70)
for all nonempty subset ² of ZÇ . We first consider the special case that
for all ÒZ Ç and the joint distribution of P¶P ¶P¶ Ç is ra j¡
0( 2 Å
tional.
We
want to show that there exists a sequence Ã
in Ç such that
ä
0( 2

.
(h ( Ev
Denote &[ ± by ± . For any nonempty subset ² of ZÇ , let ¥ ± be the
marginal distribution of ± . Assume without loss of generality that for any
nonempty subset ² of Z Ç and for all ÷vÒM ± , ¥ ± ×÷L is a rational number with
denominator .
For each 4Ð b b P¶P¶P¶ , fix a sequence
$¦D§
¦D§T¨
¦?§I¨
'%
P¶P¶P¶%
(
¦?§T¨
¨
¦D§
¦ § bCP¶P¶P¶'4F%
¦ §
where for$ all

¦ ×§ ÷K
¦ § of occurrences of ÷ in sequence
that «
, is equal
, the number
×÷L for all ÷ÁÒ¬
§
to 4I¥
. The existence of such a¦Dsequence
is guaran
&%

¦D§1¨
P
are rational numteed by that all the values of the joint distribution of
. Also,¨ we denote
¨
¨ the sequence$ of 4 elements of ± ,
bers¨ with¨ denominator
&% ± '% ± P¶P¶P¶% ± ( , where
$
% ± &%Ñ ÏÒ ² , by ± . Let ÷ Òj ± . It is
$
easy
to check that « ×÷ ± , the number of occurrences of ÷ in the sequence
± , is equal to 4I¥ ± ×÷R for all ÷vÒ ± .
D R A F T
September 13, 2001, 6:27pm
D R A F T
379
Entropy and Groups
Let ß be the group of permutations on ÃÄbCP¶P¶P¶'4CÅ . The group G depends
on 4 , but for simplicity, we do not state this dependency explicitly. For any
zÒ®ZÇ , define
)+* $ ,
)
ß"KWÃ
ÒaßhÏ
)+* $ ,
where
$

¨
¨
&% T. 0 32 ' % 1. 0 2

é
;ÅZ
¨
'% 1. 0 ( 2
P¶P¶P¶
(16.71)
N
It is easy to check that ß" is a subgroup of ß .
Let ² be a nonempty subset of ZÇ . Then
ß
ß
&[ ± )
) *$ , $
+

TÅ
Ã Ò ßéÏ
& [ ) ±
)+* $ , $
Ã ) ÒaßéÏ )+* $ ,
$ for all Ò
±

(16.72)

Ã
)+* $
where
,
±
For any ÷cÒM
±
±
ÒaßéÏ
&%
é
¨
±
¨
±
.10 2
¨
'%
±
P¶P¶P¶
define the set
¯+°
²
(16.74)
(16.75)
Å
± ÅZ

.T0 32 ' %
(16.73)
'4
¯+°
±

)+* $
Á÷ÅZ
,
± . Then
±
×÷L contains the “locations” of ) ÷ in

¯+°
¯+°
for all ÷vÒM ± , \Ò
×÷L implies
ÓÒ
×÷L . Since

¯+°
E«×÷K
$
×÷RPZ
and therefore
àßc
±
àß
±4I¥
± j
² & 4T¥
³ 1[ ´
± Z
àß

(h
4

ä¤å]æ
àßc
±
àß

D R A F T

if and only if
(16.78)
8
(16.79)
8
± ×÷L
(16.80)

w
x
± z

±
Recall that ß and hence all its subgroups depend on 4 . Define
µ 0( 2
±
±
± ×÷LN
By Lemma 16.23,
ä
(16.77)
$
± ×÷R
498
³ [1´ &4I¥

(16.76)
N
¨
T%
×÷Lz'ÃNcÒ©ÃÄbCP¶P¶P¶ CÅÏ
$
.T0 ( 2
äå]æ
àßv
àß
±
September 13, 2001, 6:27pm
(16.81)
0( 2
by
(16.82)
D R A F T
380
A FIRST COURSE IN INFORMATION THEORY
for all nonempty subset ² of ZÇ . Then
0( 2

Ò
and
Ç
ä

0
( 2 Ev
(h
4
(16.83)
We have already proved the theorem for the special case that v is the entropy
function of a collection of random variables P\]P¶P¶P¶Ç with finite alphabets and a rational joint distribution. To complete the proof, we only have to
0¶ 2
Å
note that for any ä vQÒ Ù%ÇÚ , it is always possible to construct a sequence Ãv
0
¶
0
¶
2
2
in Ù ÇÚ such that ¶ L v
·v , where v
is the entropy function of a
0¸¶ 2
0¸¶ 2
0¶ 2
collection of random variables P¶P¶P¶x Ç with finite alphabets and
a rational joint distribution. This can be proved by techniques similar to those
used in Appendix 15.A together with the continuity of the entropy function.
The details are omitted here.
S
IT%üIU'óÄôKõ¤ÿ ò%üò

ù ¸
Ç j
Ù
ÇÚ
.

Proof First of all, Ç¹@{Ù ÇÚ . By taking convex closure, we have ù¤¸ ÇR6@
ù¸×ÙnÇÚ . By Theorem 14.5, Ù ÇÚ is convex. Therefore, ù¤¸×Ù%ÇÚ Ó
Ù ÇÚ , and we

Ú
have ù¤¸ ÇR}@ Ù Ç . On the other hand, we have shown in Example 16.19
that the origin of X Ç has a group characterization and therefore is in Ç . It
Ú
then follows from
Theorem 16.22 that Ù Ç @ ù¸ ÇL . Hence, we conclude
that Ù ÇÚ ù¤¸ ÇL , completing the proof.
16.4
INFORMATION INEQUALITIES AND GROUP
INEQUALITIES
We have proved in Section 14.1 that an unconstrained information inequality
º¼»
v
always holds if and only if
@ÁÃvÒXvÇcÏ
ÇÚ
Ù
(16.84)
k´
º »
v
(16.85)
¢´CÅZ
In other words, all unconstrained information inequalities are fully character
ized by Ù ÇÚ . We also have proved at the end of the last section that ù¤¸ ÇRz
Ù ÇÚ . Since
Ç®@¢Ù ÇÚ @
Ù ÇÚ , if (16.85) holds, then

}@ÁÃv ¹
Ò XcÇcÏ
Ç
º»
v
¢´CÅZ
º »
v
On the other hand, if (16.86) holds, since Ãv ÒMXcÇÏ
convex, by taking convex closure in (16.86), we obtain
Ù ÇÚ

ù ¸

D@ÁÃv
Ò XcÇvÏ
ÇL
º »
v
(16.86)
µ´CÅ
¢´CÅZ
is closed and
(16.87)
Therefore, (16.85) and (16.86) are equivalent.
D R A F T
September 13, 2001, 6:27pm
D R A F T
381
Entropy and Groups
Now (16.86) is equivalent to
º »

Since vÒ
Ç
if and only if
v ¢´ for all v

Ò
.
Ç
(16.88)
w
±
ä¤å]æ

àßc
àß
(16.89)
±
for all nonempty subset ² of ZÇ for some finite group ß and subgroups ßU
ß"L¶P¶P¶NßdÇ , we see that the inequality (16.84) holds for all random variables
w
à\¶P¶P¶àÇ if and only if the inequality obtained from (16.84) by replacing
À VSÀ
äå]æ
± by
À V À for all nonempty subset ²
of ZÇ holds for all finite group ß
and subgroups ßªPNß¶P¶P¶xNßdÇ . In other words, for every unconstrained information inequality, there is a corresponding group inequality, and vice versa.
Therefore, inequalities in information theory can be proved by methods in
group theory, and inequalities in group theory can be proved by methods in
information theory.
In the rest of the section, we explore this one-to-one correspondence between information theory and group theory. We first give a group-theoretic
proof of the basic inequalities in information theory. At the end of the section,
we will give an information-theoretic proof for the group inequality in (16.5).
ê\ëìCíïîgíñðgíò%îÁóÄôKõ¤ÿ
Let ßª and ß be subgroups of a finite group
ßª%ö
ßãWÃx÷"öÓølÏZ÷vÒaß
and ø¦Òaß"ÅZ
ß
. Define
(16.90)
is in general not a subgroup of ß . However, it can be shown that
is a subgroup of ß if ß is Abelian (see Problem 1).
ßjödß"
ß

ö
ß

ûdüò¥ýÑò%þxí¤ðgí×ò¥îµóÄôKõ¤ÿfô
Let ßª and ß be subgroups of a finite group ß . Then
àßªàß"Ä
àßªnöãß"ZZ
Proof Fix
×÷fU`÷CxvÒ{ßª
½
#øbøGVÒaß
ß"
½
ß
àßª
, Then
á
ß"Z
is in
÷Vö"÷
(16.91)

ßVöß"
. Consider any
such that
(16.92)
øxnöãøGãÁ÷nöÓ÷C
We will determine the number of
tion. From (16.92), we have
ø
#ø
D R A F T

öd#øxnöãøGP

in ß
#øxbøGx

ø
öãø göÓø

ø
øG

ø
ß"
which satisfies this rela-
ö"×÷nöV÷Cx

½
öÓ÷
öÓ÷
öÓ÷%öÓ÷C
September 13, 2001, 6:27pm
(16.93)
(16.94)
(16.95)
D R A F T
382
A FIRST COURSE IN INFORMATION THEORY
Then
ø
öÓ÷

Áø
öÓ÷
ö×÷
öÓ÷

zÁø
öÓ÷
(16.96)
Let ½ be this common element in ß , i.e.,

½\ÁøGöÓ÷

ø
(16.97)
öV÷f

Since ø öã÷ Ò ß and ø öÓ÷ Ò©ß , ½ is in ß ná ß . In other words, for
given ×÷f`÷CxVÒ ßª ½ ß" , if #øxbøNxÓÒaßª ½ ß satisfies (16.92), then #øxbøNx
satisfies (16.97) for some ½+Òaßª á ß" . On the other hand, if #øxPbøNxÓÒaßª ½ ß
satisfies (16.97) for some ½+Òaß á ß" , then (16.96) is satisfied, which implies
(16.92). Therefore, for given ×÷`÷CxÒaßª ½ ß , #øxbøGVÒaß ½ ß" satisfies
(16.92) if and only if #ø bø satisfies (16.97) for some ½Òaß %á ß .
Now from (16.97), we obtain
¾½
¾½
øx z
ÐöÓ÷

(16.98)
and
¾½
½ÐöÓ÷C
(16.99)
øG]
where we have written øx and øG as øx¾½ and øGZ¾½ to emphasize their dependence on ½ . Now consider ½¿½ Ø Ò ßª á ß such that
¾½
¾½
¾½
¾½
#øx NbøGÄ zé#øx
Since øx¾½zø¾½
Ø
Ø NbøGZ
(16.100)
Ø N
, from (16.98), we have
¾½
ÐöÓ÷
which implies

¾½
é CØÄöÓ÷

(16.101)

½\E½CØï
(16.102)
Therefore, each ½+Òaßª á ß" corresponds to a unique pair #øxUbøGxVÒ_ßª ½ ß
which satisfies (16.92). Therefore, we see that the number of distinct elements
in ßnöãß" is given by
àß

öãß
Z
½
àß
àß
á
ß"Z
ß"Z

àßxàßZ
àßª
á
ßZ

(16.103)
completing the proof.
JLKgëfò%üKëióÄôKõ¤ÿ:
Let ß
Nß

, and ß

be subgroups of a finite group ß . Then
àß]àß¡ÄC Wàß¡]àßÄâ
D R A F T
September 13, 2001, 6:27pm
(16.104)
D R A F T
383
Entropy and Groups
Proof First of all,
á
ßª¡
ß
á
hßª
ßP á
á
ß"
á
ß"z{ßª
á
ß"
ß
ßª¡
(16.105)
By Proposition 16.26, we have
àßª¡öãß"Ä]
It is readily seen that ßª¡zö
ß"
àß¡zö
àßª¡Äàß"Z
(16.106)

àßª¡Ä
is a subset of ß" , Therefore,
¡ àß
àß
ß"ZZ

(16.107)
WàßÄâ
àß¡Ä
The theorem is proved.
S
IT%üIU'óÄôKõ¤ÿ<
ò%üò
For random variables P\ , and \ ,
Ý
(16.108)
Þ \Ä \xV ¢´R
Proof Let ßªUNß" , and ß" be subgroups of a finite group ß . Then
(16.109)
àß"Äàßª¡Ä hàßª¡Zàß"Ä
by Theorem 16.27, or
àßv
àß

¡ àß

àßv
àß
àß

¡
(16.110)

This is equivalent to
äå]æ
àßc
àß¡Z
äå]æ

àßc
àßÄ
äå]æ
àßv
àß"Z
äå]æ

àßc
àßª¡Z
(16.111)

This group inequality corresponds to the information inequality
x
\xK
x
which is equivalent to
\\ Ý
x
\xg
x
\\xN
Þ \Ä \xV ¢´R
(16.112)
(16.113)
The above corollary shows that all the basic inequalities in information theory has a group-theoretic proof. Of course, Theorem 16.27 is also implied by
the basic inequalities. As a remark, the inequality in (16.3) is seen to be a
special case of Theorem 16.27 by letting ß ß .
D R A F T
September 13, 2001, 6:27pm
D R A F T
384
A FIRST COURSE IN INFORMATION THEORY
We are now ready to prove the group inequality in (16.5). The non-Shannontype inequality we have proved in Theorem 14.7 can be expressed in canonical
form as
x
x
GK
x
l§

x
è g
x\xg
x
x
\x¥

x
\\xg

x
x
\x¥

è
è
x
xèg
\è
(16.114)
\èN
which corresponds to the group inequality
äå]æ
àßv

àßx
l§
äå]æ
äå]æ
àß¡]
àßv
àß¡Hè

äå]æ

àß
äå]æ

àßv
äå]æ
àßZ

àßv
àß"è
àßc
àß"HèÄ
àßv
äå]æ

àßc
¡

àßv
äå]æ
àßv
àß
ä¤å]æ

è
àßv
Hè
àß
äå]æ

àßc
àß

(16.115)

àß"Hè
Upon rearranging the terms, we obtain
àßª
á
'àßªàß
ßZ
á

àßª
á
ß"Zàß"Z

ßdèÄ

àßdèÄ
àß

àß
á
á
ßdèÄ

ß"
á
àß"
á
ßdèÄ
è
ß]àß
á
àß
á
ß"
ßdèÄ
á
ßdèÄâ
(16.116)
which is the group inequality in (16.5). The meaning of this inequality and its
implications in group theory are yet to be understood.
PROBLEMS
1. Let ßª and ß be subgroups of a finite group
subgroup if ß is Abelian.
ß
. Show that
ßªzölß
is a
2. Let Àg and Àf be group characterizable entropy functions.
a) Prove that Á_¿Àgz"Á.¤Àf is group characterizable, where Á_ and Á.
are any positive integers.
b) For any positive real numbers ÷f and ÷C , construct a sequence of group
0¸¶ 2
characterizable entropy functions
for ½cébCP¶P¶P¶ such that
ä
;
¶ h
where v_µ÷f¿ÀK%
D R A F T
¸0 ¶ 2
0¸¶ 2

v
v
Ó

ÂÀ .
÷C f
September 13, 2001, 6:27pm
D R A F T
385
Entropy and Groups
3. Let ßNßªPNßP¶P¶P¶NßdÇL be a group characterization of À&Ò ÙnÇÚ , where
À is the entropy function for random variables \P¶P¶P¶xÇ . Fix any
nonempty subset ² of ZÇ , and
wÃ define v Ã by
jÄ T± Å
ÆÄ

Ã
wÃ
±
¬
for
subsets of Z Ç . It can easily be checked that

x all nonempty

± . Show that ×®Õ`®.x`®cP¶P¶P¶x`®\Ç is a group characterization
of v , where ® ß ± and ® {ß á ß ± .
4. Let ßNßªPNß"P¶P¶P¶xNß"ÇR be a group characterization of ÀkÒQÙ%ÇÚ , where À
is the entropy function for random variables P¶P¶P¶x Ç . Show that if
²
is a function of ÏcÒ
, then ß ± is a subgroup of ßd .
5. Let ßUNßNß" be subgroups of a finite group ß . Prove that
á
àßcàßª
ß
á

ß"Z
Wàß
á
á
ß"Zàß"
ßàß
á
ß"Zâ
Hint: Use the information-theoretic approach.
w
w
w
6. Let vQÒÙ%Ú be the entropy function for random variables and \ such
l
¡ , i.e. and \ are independent. Let ßNßPNß" be a
that n
¯
group characterization of v , and define a mapping ÏRß ½ ß 5uß by
¯
×÷ÑbøPzµ÷"öãø
¯
a) Prove that the mapping is onto, i.e., for any element
exists ×÷ÑbøUãÒ ßª ½ ß such that ÷döÓø Áù .
b) Prove that ßª%ö
ß"
is a group.
w
w
ùaÒ'ß
, there
w
7. Denote an entropy function v ÒWÙ Ú by ¡ . Construct a group
characterization for each of the following entropy functions:
a) v
b) v%
c) v%

äå]æ
`´R
C

Z
äå]æ
äå]æ
é×´R
C

Z
äå]æ
äå]æ
ä¤]
å æ
é
C
C
é
äå]æ
Verify that Ù
functions.

Z
.
is the minimal convex set containing the above three entropy
w
w
w
w
w
w
w
8. Denote an entropy function v5ÒiÙ Ú by ¡ ¡ ¡ .
Construct a group characterization for each of the following entropy functions:
a) vzé
b) v%
é
äå]æ
äå]æ
D R A F T
äå]æ
äå]æ
äå]æ
C`´R`´R
`´R
C

C
Z
äå]æ
ä¤]
å æ
äå]æ
äå]æ
äå]æ
C
C`´R
C

C
C
Z
September 13, 2001, 6:27pm
D R A F T
386
A FIRST COURSE IN INFORMATION THEORY
v
c) %
d) v è
äå]æ
é
é
äå]æ
äå]æ
C
äå]æ
C
ä¤å]æ
C
ä¤å]æ
C
äå]æ
C
äå]æ
C
äå]æ
C
äå]æ
§L
ä¤å]æ
C
ä¤å]æ
§L
äå]æ
C
Z
äå]æ
§L
.
§
9. Ingleton inequality Let ß be a finite Abelian group and ß Nß Nß , and ß è
be subgroups of ß . Let ßNßªPNß"NßNßdèx be a group characterization of
À , where À is the entropy function for random variables U\Z\ , and
è . Prove the following statements:
a)
ná
ß
Hint: Show that
gö"ß
ß
ßª
á

öãß
%á
è PCWàß
ß
D@µßª
á
ßPKößª
%á
ßdèx
ß

á
ß"zö
ö
è P
ß
.
ß"èx
b)
àß

ö
ß
àßªàß"zö
è C
á
ß"èZàß
á
àß
á
ß"
á
ß"Zàßª
ß"èZ
ß"èÄ

c)
àß¥ö
ß"zöãß"zö
àßªnö
ß"èZR
ßjöãß"èÄàß"öãß"öãß"èÄ

àß
è
öãß

d)
àß%ö
ß"zö
ßjö
ßdèÄ
á
àßàß"Äàß"Zàß"è àßª

á
ß"
á
ß"èÄàß"
á
ß"
ß"èÄ
àß
á
ß"Zàßª
á
ßdèZàß"
á
ß]àß
á
ßdèÄàß"
á
ßdèZ
àß
á
ß"Zàßª
á
ßdèZàß"
á
ß]àß
á
ßdèÄàß"
á
ßdèZ
á
ßdèZâ

àßZàßdè àßª
f)
x
x
x

¡ ¥
\x¥
x
x
á

ßZàßª
è ¥
èg
x
x
x
á
ß"

á
K
¡xg
x
ß"èÄàß"
x

á
ß
Hè K
¡Hèg
x
x

Hè
\HèN
where ¡Hè denotes \è , etc.
g) Is the inequality in f) implied by the basic inequalities? And does it
always hold? Explain.
The Ingleton inequality [99] (see also [149]) was originally obtained as a
constraint on the rank functions of vector spaces. The inequality in e) was
obtained in the same spirit by Chan [38] for subgroups of a finite group.
The inequality in f) is referred to as the Ingleton inequality for entropy in
the literature. (See also Problem 7 in Chapter 14.)
D R A F T
September 13, 2001, 6:27pm
D R A F T
Entropy and Groups
387
HISTORICAL NOTES
The results in this chapter are due to Chan and Yeung [40], whose work was
inspired by a one-to-one correspondence between entropy and quasi-uniform
arrays previously established by Chan [38] (also Chan [39]). Romashchenko et
al. [168] have developed an interpretation of Kolmogorov complexity similar
to the combinatorial interpretation of entropy in Chan [38].
D R A F T
September 13, 2001, 6:27pm
D R A F T
Bibliography
[1] J. Abrahams, “Code and parse trees for lossless source encoding,” Comm. Inform. &
Syst., 1: 113-146, 2001.
[2] N. Abramson, Information Theory and Coding, McGraw-Hill, New York, 1963.
[3] Y. S. Abu-Mostafa, Ed., Complexity in Information Theory, Springer-Verlag, New York,
1988.
[4] J. AczÈ Ç l and Z. DarÉ Ç czy, On Measures of Information and Their Characterizations,
Academic Press, New York, 1975.
[5] R. Ahlswede, B. Balkenhol and L. Khachatrian, “Some properties of fix-free codes,”
preprint 97-039, Sonderforschungsbereich 343, UniversitË Ê t Bielefeld, 1997.
[6] R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung, “Network information flow,” IEEE
Trans. Inform. Theory, IT-46: 1204-1216, 2000.
[7] R. Ahlswede and J. KÉ Ê rner, “Source coding with side information and a converse for
degraded broadcast channels,” IEEE Trans. Inform. Theory, IT-21: 629-637, 1975.
[8] R. Ahlswede and I. Wegener, Suchprobleme, Teubner Studienbcher. B. G. Teubner,
Stuttgart, 1979 (in German). English translation: Search Problems, Wiley, New York,
1987.
[9] R. Ahlswede and J. Wolfowitz, “The capacity of a channel with arbitrarily varying cpf’s
and binary output alphabet,” Zeitschrift fÌ Ê r Wahrscheinlichkeitstheorie und verwandte
Gebiete, 15: 186-194, 1970.
[10] P. Algoet and T. M. Cover, “A sandwich proof of the Shannon-McMillan-Breiman theorem,” Ann. Prob., 16: 899-909, 1988.
[11] S. Amari, Differential-Geometrical Methods in Statistics, Springer-Verlag, New York,
1985.
[12] J. B. Anderson and S. Mohan, Source and Channel Coding: An Algorithmic Approach,
Kluwer Academic Publishers, Boston, 1991.
D R A F T
September 13, 2001, 6:27pm
D R A F T
390
A FIRST COURSE IN INFORMATION THEORY
[13] S. Arimoto, “Encoding and decoding of Í -ary group codes and the correction system,”
Information Processing in Japan, 2: 321-325, 1961 (in Japanese).
[14] S. Arimoto, “An algorithm for calculating the capacity of arbitrary discrete memoryless
channels,” IEEE Trans. Inform. Theory, IT-18: 14-20, 1972.
[15] S. Arimoto, “On the converse to the coding theorem for discrete memoryless channels,”
IEEE Trans. Inform. Theory, IT-19: 357-359, 1973.
[16] R. B. Ash, Information Theory, Interscience, New York, 1965.
[17] E. Ayanoglu, R. D. Gitlin, C.-L. I, and J. Mazo, “Diversity coding for transparent selfhealing and fault-tolerant communication networks,” 1990 IEEE International Symposium on Information Theory, San Diego, CA, Jan. 1990.
[18] A. R. Barron, “The strong ergodic theorem for densities: Generalized ShannonMcMillan-Breiman theorem,” Ann. Prob., 13: 1292-1303, 1985.
[19] L. A. Bassalygo, R. L. Dobrushin, and M. S. Pinsker, “Kolmogorov remembered,” IEEE
Trans. Inform. Theory, IT-34: 174-175, 1988.
[20] T. Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression,
Prentice-Hall, Englewood Cliffs, New Jersey, 1971.
[21] T. Berger, “Multiterminal source coding,” in The Information Theory Approach to Communications, G. Longo, Ed., CISM Courses and Lectures #229, Springer-Verlag, New
York, 1978.
[22] T. Berger and R. W. Yeung, “Multiterminal source coding with encoder breakdown,”
IEEE Trans. Inform. Theory, IT-35: 237-244, 1989.
[23] E. R. Berlekamp, “Block coding for the binary symmetric channel with noiseless, delayless feedback,” in H. B. Mann, Error Correcting Codes, Wiley, New York, 1968.
[24] E. R. Berlekamp, Ed., Key Papers in the Development of Coding Theory, IEEE Press,
New York, 1974.
[25] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit error-correcting
coding and decoding: Turbo codes,” Proceedings of the 1993 International Conferences
on Communications, 1064-1070, 1993.
[26] D. Blackwell, L. Breiman, and A. J. Thomasian, “The capacities of certain channel
classes under random coding,” Ann. Math. Stat., 31: 558-567, 1960.
[27] R. E. Blahut, “Computation of channel capacity and rate distortion functions,” IEEE
Trans. Inform. Theory, IT-18: 460-473, 1972.
[28] R. E. Blahut, “Information bounds of the Fano-Kullback type,” IEEE Trans. Inform.
Theory, IT-22: 410-421, 1976.
[29] R. E. Blahut, Theory and Practice of Error Control Codes, Addison-Wesley, Reading,
Massachusetts, 1983.
[30] R. E. Blahut, Principles and Practice of Information Theory, Addison-Wesley, Reading,
Massachusetts, 1987.
D R A F T
September 13, 2001, 6:27pm
D R A F T
391
BIBLIOGRAPHY
[31] R. E. Blahut, D. J. Costello, Jr., U. Maurer, and T. Mittelholzer, Ed., Communications
and Cryptography: Two Sides of One Tapestry, Kluwer Academic Publishers, Boston,
1994.
[32] C. Blundo, A. De Santis, R. De Simone, and U. Vaccaro, “Tight bounds on the information rate of secret sharing schemes,” Designs, Codes and Cryptography, 11: 107-110,
1997.
[33] R. C. Bose and D. K. Ray-Chaudhuri, “On a class of error correcting binary group
codes,” Inform. Contr., 3: 68-79, Mar. 1960.
[34] L. Breiman, “The individual ergodic theorems of information theory,” Ann. Math. Stat.,
28: 809-811, 1957.
[35] M. Burrows and D. J. Wheeler, “A block-sorting lossless data compression algorithm,”
Technical Report 124, Digital Equipment Corporation, 1994.
[36] R. Calderbank and N. J. A. Sloane, “Obituary: Claude Shannon (1916-2001),” Nature,
410: 768, April 12, 2001.
[37] R. M. Capocelli, A. De Santis, L. Gargano, and U. Vaccaro, “On the size of shares for
secret sharing schemes,” J. Cryptology, 6: 157-168, 1993.
[38] H. L. Chan (T. H. Chan), “Aspects of information inequalities and its applications,”
M.Phil. thesis, The Chinese University of Hong Kong, Jun. 1998.
[39] T. H. Chan, “A combinatorial approach to information inequalities,” to appear in Comm.
Inform. & Syst..
[40] T. H. Chan and R. W. Yeung, “On a relation between information inequalities and group
theory,” to appear in IEEE Trans. Inform. Theory.
[41] T. H. Chan and R. W. Yeung, “Factorization of positive functions,” in preparation.
[42] G. J. Chatin, Algorithmic Information Theory, Cambridge Univ. Press, Cambridge,
1987.
[43] H. Chernoff, “A measure of the asymptotic efficiency of test of a hypothesis based on a
sum of observations,” Ann. Math. Stat., 23: 493-507, 1952.
[44] K. L. Chung, “A note on the ergodic theorem of information theory,” Ann. Math. Stat.,
32: 612-614, 1961.
[45] T. M. Cover, “A proof of the data compression theorem of Slepian and Wolf for ergodic
sources,” IEEE Trans. Inform. Theory, IT-21: 226-228, 1975.
[46] T. M. Cover, “An algorithm for maximizing expected log investment return,” IEEE
Trans. Inform. Theory, IT-30: 369-373, 1984.
[47] T. M. Cover, P. G Ë Ç cs, and R. M. Gray, “Kolmogorov’s contribution to information theory and algorithmic complexity,” Ann. Prob., 17: 840-865, 1989.
[48] T. M. Cover and S. K. Leung, “Some equivalences between Shannon entropy and Kolmogorov complexity,” IEEE Trans. Inform. Theory, IT-24: 331-338, 1978.
D R A F T
September 13, 2001, 6:27pm
D R A F T
392
A FIRST COURSE IN INFORMATION THEORY
[49] T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, New York,
1991.
[50] I. Csisz Ë Ç r, “Information type measures of difference of probability distributions and
indirect observations,” Studia Sci. Math. Hungar., 2: 229-318, 1967.
[51] I. Csisz Ë Ç r, “On the computation of rate-distortion functions,” IEEE Trans. Inform. Theory, IT-20: 122-124, 1974.
[52] I. CsiszË Ç r and J. KÉ Ê rner, Information Theory: Coding Theorems for Discrete Memoryless Systems, Academic Press, New York, 1981.
[53] I. CsiszË Ç r and P. Narayan, “Arbitrarily varying channels with constrained inputs and
states,” IEEE Trans. Inform. Theory, IT-34: 27-34, 1988.
[54] I. CsiszË Ç r and P. Narayan, “The capacity of the arbitrarily varying channel revisited:
Positivity, constraints,” IEEE Trans. Inform. Theory, IT-34: 181-193, 1988.
[55] I. Csisz Ë Ç r and G. TusnË Ç dy, “Information geometry and alternating minimization procedures,” Statistics and Decisions, Supplement Issue 1: 205-237, 1984.
[56] G. B. Dantzig, Linear Programming and Extensions, Princeton Univ. Press, Princeton,
New Jersey, 1962.
[57] L. D. Davisson, “Universal noiseless coding,” IEEE Trans. Inform. Theory, IT-19: 783795, 1973.
[58] A. P. Dawid, “Conditional independence in statistical theory (with discussion)," J. Roy.
Statist. Soc., Series B, 41: 1-31, 1979.
[59] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood form incomplete
data via the EM algorithm,” Journal Royal Stat. Soc., Series B, 39: 1-38, 1977.
[60] G. Dueck and J. KÉ Ê rner, “Reliability function of a discrete memoryless channel at rates
above capacity,” IEEE Trans. Inform. Theory, IT-25: 82-85, 1979.
[61] P. Elias, “Universal codeword sets and representations of the integers,” IEEE Trans.
Inform. Theory, IT-21: 194-203, 1975.
[62] Encyclopedia Britanica, http://www/britanica.com/.
[63] R. M. Fano, Class notes for Transmission of Information, Course 6.574, MIT, Cambridge, Massachusetts, 1952.
[64] R. M. Fano, Transmission of Information: A Statistical Theory of Communication, Wiley, New York, 1961.
[65] A. Feinstein, “A new basic theorem of information theory,” IRE Trans. Inform. Theory,
IT-4: 2-22, 1954.
[66] A. Feinstein, Foundations of Information Theory, McGraw-Hill, New York, 1958.
[67] W. Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, Wiley,
New York, 1950.
D R A F T
September 13, 2001, 6:27pm
D R A F T
393
BIBLIOGRAPHY
[68] B. M. Fitingof, “Coding in the case of unknown and changing message statistics,” PPI
2: 3-11, 1966 (in Russian).
[69] L. K. Ford, Jr. and D. K. Fulkerson, Flows in Networks, Princeton Univ. Press, Princeton, New Jersey, 1962.
[70] G. D. Forney, Jr., “Convolutional codes I: Algebraic structure,” IEEE Trans. Inform.
Theory, IT-16: 720 - 738, 1970.
[71] G. D. Forney, Jr., Information Theory, unpublished course notes, Stanford University,
1972.
[72] G. D. Forney, Jr., “The Viterbi algorithm,” Proc. IEEE, 61: 268-278, 1973.
[73] F. Fu, R. W. Yeung, and R. Zamir, “On the rate-distortion region for multiple descriptions,” submitted to IEEE Trans. Inform. Theory.
[74] S. Fujishige, “Polymatroidal dependence structure of a set of random variables,” Inform.
Contr., 39: 55-72, 1978.
[75] R. G. Gallager, “Low-density parity-check codes,” IEEE Trans. Inform. Theory, IT-8:
21-28, Jan. 1962.
[76] R. G. Gallager, “A simple derivation of the coding theorem and some applications,”
IEEE Trans. Inform. Theory, IT-11: 3-18, 1965.
[77] R. G. Gallager, Information Theory and Reliable Communication, Wiley, New York,
1968.
[78] Y. Ge and Z. Ye, “Information-theoretic characterizations of lattice conditional independence models,” submitted to Ann. Stat.
[79] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, Boston, 1992.
[80] S. Goldman, Information Theory, Prentice-Hall, Englewood Cliffs, New Jersey, 1953.
[81] R. M. Gray, Entropy and Information Theory, Springer-Verlag, New York, 1990.
[82] S. Guiasu, Information Theory with Applications, McGraw-Hill, New York, 1976.
[83] B. E. Hajek and T. Berger, “A decomposition theorem for binary Markov random
fields,” Ann. Prob., 15: 1112-1125, 1987.
[84] D. Hammer, A. Romashchenko, A. Shen, and N. K. Vereshchagin, J. Comp. & Syst.
Sci., 60: 442-464, 2000.
[85] R. V. Hamming, “Error detecting and error correcting codes,” Bell Sys. Tech. Journal,
29: 147-160, 1950.
[86] T. S. Han, “Linear dependence structure of the entropy space,” Inform. Contr., 29: 337368, 1975.
[87] T. S. Han, “Nonnegative entropy measures of multivariate symmetric correlations,” Inform. Contr., 36: 133-156, 1978.
D R A F T
September 13, 2001, 6:27pm
D R A F T
394
A FIRST COURSE IN INFORMATION THEORY
[88] T. S. Han, “A uniqueness of Shannon’s information distance and related non-negativity
problems,” J. Comb., Inform., & Syst. Sci., 6: 320-321, 1981.
[89] T. S. Han, “An information-spectrum approach to source coding theorems with a fidelity
criterion,” IEEE Trans. Inform. Theory, IT-43: 1145-1164, 1997.
[90] T. S. Han and K. Kobayashi, “A unified achievable rate region for a general class of
multiterminal source coding systems,” IEEE Trans. Inform. Theory, IT-26: 277-288,
1980.
[91] G. H. Hardy, J. E. Littlewood, and G. Polya, Inequalities, 2nd ed., Cambridge Univ.
Press, London, 1952.
[92] K. P. Hau, “Multilevel diversity coding with independent data streams,” M.Phil. thesis,
The Chinese University of Hong Kong, Jun. 1995.
[93] C. Heegard and S. B. Wicker, Turbo Coding, Kluwer Academic Publishers, Boston,
1999.
[94] A. Hocquenghem, “Codes correcteurs d’erreurs,” Chiffres, 2: 147-156, 1959.
[95] Y. Horibe, “An improved bound for weight-balanced tree,” Inform. Contr., 34: 148-151,
1977.
[96] Hu Guoding, “On the amount of Information,” Teor. Veroyatnost. i Primenen., 4: 447455, 1962 (in Russian).
[97] D. A. Huffman, “A method for the construction of minimum redundancy codes,” Proc.
IRE, 40: 1098-1101, 1952.
[98] L. P. Hyvarinen, Information Theory for Systems Engineers, Springer-Verlag, Berlin,
1968.
[99] A. W. Ingleton, “Representation of matroids,” in Combinatorial Mathematics and Its
Applications, D. J. A. Welsh, Ed., 149-167, Academic Press, London, 1971.
[100] E. T. Jaynes, “On the rationale of maximum entropy methods,” Proc. IEEE, 70: 939052, 1982.
[101] F. Jelinek, Probabilistic Information Theory, McGraw-Hill, New York, 1968.
[102] V. D. Jerohin, “Î -entropy of discrete random objects,” Teor. Veroyatnost. i Primenen, 3:
103-107, 1958.
[103] O. Johnsen, “On the redundancy of binary Huffman codes,” IEEE Trans. Inform. Theory, IT-26: 220-222, 1980.
[104] G. A. Jones and J. M. Jones, Information and Coding Theory, Springer, London, 2000.
[105] Y. Kakihara, Abstract Methods in Information Theory, World-Scientific, Singapore,
1999.
[106] J. Karush, “A simple proof of an inequality of McMillan,” IRE Trans. Inform. Theory,
7: 118, 1961.
D R A F T
September 13, 2001, 6:27pm
D R A F T
395
BIBLIOGRAPHY
[107] T. Kawabata, “Gaussian multiterminal source coding,” Master thesis, Math. Eng., Univ.
of Tokyo, Japan, Feb. 1980.
[108] T. Kawabata and R. W. Yeung, “The structure of the Ï -Measure of a Markov chain,”
IEEE Trans. Inform. Theory, IT-38: 1146-1149, 1992.
[109] A. I. Khinchin, Mathematical Foundations of Information Theory, Dover, New York,
1957.
[110] J. C. Kieffer, E.-h. Yang, “Grammar-based codes: A new class of universal lossless
source codes,” IEEE Trans. Inform. Theory, IT-46: 737-754, 2000.
[111] R. Kindermann and J. Snell, Markov Random Fields and Their Applications, American
Math. Soc., Providence, Rhode Island, 1980.
[112] R. Koetter and M. MÈ Ç dard, “An algebraic approach to network coding,” 2001 IEEE
International Symposium on Information Theory, Washington, D.C., Jun. 2001.
[113] A. N. Kolmogorov, “On the Shannon theory of information transmission in the case of
continuous signals,” IEEE Trans. Inform. Theory, IT-2: 102-108, 1956.
[114] A. N. Kolmogorov, “Three approaches to the quantitative definition of information,”
Problems of Information Transmission, 1: 4-7, 1965.
[115] A. N. Kolmogorov, “Logical basis for information theory and probability theory,” IEEE
Trans. Inform. Theory, IT-14: 662-664, 1968.
[116] L. G. Kraft, “A device for quantizing, grouping and coding amplitude modulated
pulses,” M.S. thesis, Dept. of E.E., MIT, 1949.
[117] S. Kullback, Information Theory and Statistics, Wiley, New York, 1959.
[118] S. Kullback, Topics in Statistical Information Theory, Springer-Verlag, Berlin, 1987.
[119] S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann. Math. Stat., 22:
79-86, 1951.
[120] G. G. Langdon, “An introduction to arithmetic coding,” IBM J. Res. Devel., 28: 135149, 1984.
[121] S. L. Lauritzen, Graphical Models, Oxford Science Publications, Oxford, 1996.
[122] M. Li and P. Vit Ë Ç nyi, An Introduction to Kolmogorov Complexity and Its Applications,
2nd ed., Springer, New York, 1997.
[123] S.-Y. R. Li, R. W. Yeung and N. Cai, “Linear network coding,” to appear in IEEE Trans.
Inform. Theory.
[124] S. Lin and D. J. Costello, Jr., Error Control Coding: Fundamentals and Applications,
Prentice-Hall, Englewood Cliffs, New Jersey, 1983.
[125] T. Linder, V. Tarokh, and K. Zeger, “Existence of optimal codes for infinite source
alphabets,” IEEE Trans. Inform. Theory, IT-43: 2026-2028, 1997.
[126] L. Lovasz, “On the Shannon capacity of a graph,” IEEE Trans. Inform. Theory, IT-25:
1-7, 1979.
D R A F T
September 13, 2001, 6:27pm
D R A F T
396
A FIRST COURSE IN INFORMATION THEORY
[127] D. J. C. MacKay, “Good error-correcting codes based on very sparse matrices,” IEEE
Trans. Inform. Theory, IT-45: 399-431, Mar. 1999.
[128] F. M. Malvestuto, “A unique formal system for binary decompositions of database relations, probability distributions, and graphs," Inform. Sci., 59: 21-52, 1992; with Comment by F. M. Malvestuto and M. StudenÐ Ç , Inform. Sci., 63: 1-2, 1992.
[129] M. Mansuripur, Introduction to Information Theory, Prentice-Hall, Englewood Cliffs,
New Jersey, 1987.
[130] K. Marton, “Error exponent for source coding with a fidelity criterion,” IEEE Trans.
Inform. Theory, IT-20: 197 - 199, 1974.
[131] J. L. Massey, “Shift-register synthesis and BCH decoding,” IEEE Trans. Inform. Theory, IT-15: 122-127, 1969.
[132] J. L. Massey, “Causality, feedback and directed information,” in Proc. 1990 Int. Symp.
on Inform. Theory and Its Applications, 303-305, 1990.
[133] J. L. Massey, “Contemporary cryptology: An introduction,” in Contemporary Cryptology: The Science of Information Integrity, G. J. Simmons, Ed., IEEE Press, Piscataway,
New Jersey, 1992.
[134] A. M. Mathai and P. N. Rathie, Basic Concepts in Information Theory and Statistics:
Axiomatic Foundations and Applications, Wiley, New York, 1975.
[135] F. MatÑTÇ Ó Ò , “Probabilistic conditional independence structures and matroid theory: Background,” Int. J. of General Syst., 22: 185-196, 1994.
[136] F. MatÑTÇ Ó Ò , “Conditional independences among four random variables II,” Combinatorics, Probability & Computing, 4: 407-417, 1995.
[137] F. MatÑTÇ Ó Ò , “Conditional independences among four random variables III: Final conclusion,” Combinatorics, Probability & Computing, 8: 269-276, 1999.
[138] F. MatÑTÇ Ó Ò and M. StudenÐ Ç , “Conditional independences among four random variables
I,” Combinatorics, Probability & Computing, 4: 269-278, 1995.
[139] R. J. McEliece, The Theory of Information and Coding, Addison-Wesley, Reading,
Massachusetts, 1977.
[140] W. J. McGill, “Multivariate information transmission,” Transactions PGIT, 1954 Symposium on Information Theory, PGIT-4: pp. 93-111, 1954.
[141] B. McMillan, “The basic theorems of information theory,” Ann. Math. Stat., 24: 196219, 1953.
[142] B. McMillan, “Two inequalities implied by unique decipherability,” IRE Trans. Inform.
Theory, 2: 115-116, 1956.
[143] S. C. Moy, “Generalization of the Shannon-McMillan theorem,” Pacific J. Math., 11:
705-714, 1961.
[144] J. K. Omura, “A coding theorem for discrete-time sources,” IEEE Trans. Inform. Theory, IT-19: 490-498, 1973.
D R A F T
September 13, 2001, 6:27pm
D R A F T
397
BIBLIOGRAPHY
[145] J. M. Ooi, Coding for Channels with Feedback, Kluwer Academic Publishers, Boston,
1998.
[146] A. Orlitsky, “Worst-case interactive communication I: Two messages are almost optimal,” IEEE Trans. Inform. Theory, IT-36: 1111-1126, 1990.
[147] A. Orlitsky, “Worst-case interactive communication—II: Two messages are not optimal,” IEEE Trans. Inform. Theory, IT-37: 995-1005, 1991.
[148] D. S. Ornstein, “Bernoulli shifts with the same entropy are isomorphic,” Advances in
Math., 4: 337-352, 1970.
[149] J. G. Oxley, Matroid Theory, Oxford Univ. Press, Oxford, 1992.
[150] A. Papoulis, Probability, Random Variables and Stochastic Processes, 2nd ed.,
McGraw-Hill, New York, 1984.
[151] J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kaufman, San Meteo,
California, 1988.
[152] A. Perez, “Extensions of Shannon-McMillan’s limit theorem to more general stochastic processes,” in Trans. Third Prague Conference on Information Theory, Statistical
Decision Functions and Random Processes, 545-574, Prague, 1964.
[153] J. R. Pierce, An Introduction to Information Theory : Symbols, Signals and Noise, 2nd
rev. ed., Dover, New York, 1980.
[154] J. T. Pinkston, “An application of rate-distortion theory to a converse to the coding
theorem,” IEEE Trans. Inform. Theory, IT-15: 66-71, 1969.
[155] M. S. Pinsker, Information and Information Stability of Random Variables and Processes, Vol. 7 of the series Problemy PeredaÔ Ò i Informacii, AN SSSR, Moscow, 1960 (in
Russian). English translation: Holden-Day, San Francisco, 1964.
[156] N. Pippenger, “What are the laws of information theory?” 1986 Special Problems on
Communication and Computation Conference, Palo Alto, California, Sept. 3-5, 1986.
[157] C. Preston, Random Fields, Springer-Verlag, New York, 1974.
[158] M. O. Rabin, “Efficient dispersal of information for security, load balancing, and faulttolerance,” J. ACM, 36: 335-348, 1989.
[159] I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,” SIAM Journal
Appl. Math., 8: 300-304, 1960.
[160] A. RÈ Ç nyi, Foundations of Probability, Holden-Day, San Francisco, 1970.
[161] F. M. Reza, An Introduction to Information Theory, McGraw-Hill, New York, 1961.
[162] J. Rissanen, “Generalized Kraft inequality and arithmetic coding,” IBM J. Res. Devel.,
20: 198, 1976.
[163] J. Rissanen, “Universal coding, information, prediction, and estimation,” IEEE Trans.
Inform. Theory, IT-30: 629-636, 1984.
D R A F T
September 13, 2001, 6:27pm
D R A F T
398
A FIRST COURSE IN INFORMATION THEORY
[164] J. R. Roche, “Distributed information storage,” Ph.D. thesis, Stanford University,
Mar. 1992.
[165] J. R. Roche, A. Dembo, and A. Nobel, “Distributed information storage,” 1988 IEEE
International Symposium on Information Theory, Kobe, Japan, Jun. 1988.
[166] J. R. Roche, R. W. Yeung, and K. P. Hau, “Symmetrical multilevel diversity coding,”
IEEE Trans. Inform. Theory, IT-43: 1059-1064, 1997.
[167] R. T. Rockafellar, Convex Analysis, Princeton Univ. Press, Princeton, New Jersey, 1970.
[168] A. Romashchenko, A. Shen, and N. K. Vereshchagin, “Combinatorial interpretation of
Kolmogorov complexity,” Electronic Colloquium on Computational Complexity, vol.7,
2000.
[169] K. Rose, “A mapping approach to rate-distortion computation and analysis,” IEEE
Trans. Inform. Theory, IT-40: 1939-1952, 1994.
[170] S. Shamai, S. VerdÑ Ç , “The empirical distribution of good codes,” IEEE Trans. Inform.
Theory, IT-43: 836-846, 1997.
[171] S. Shamai, S. VerdÑ Ç , and R. Zamir, “Systematic lossy source/channel coding,” IEEE
Trans. Inform. Theory, IT-44: 564-579, 1998.
[172] A. Shamir, “How to share a secret,” Comm. ACM, 22: 612-613, 1979.
[173] C. E. Shannon, “A Mathematical Theory of Communication,” Bell Sys. Tech. Journal,
27: 379-423, 623-656, 1948.
[174] C. E. Shannon, “Communication theory of secrecy systems,” Bell Sys. Tech. Journal,
28: 656-715, 1949.
[175] C. E. Shannon, “The zero-error capacity of a noisy channel,” IRE Trans. Inform. Theory,
IT-2: 8-19, 1956.
[176] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE
National Convention Record, Part 4, 142-163, 1959.
[177] C. E. Shannon, R. G. Gallager, and E. R. Berlekamp, “Lower bounds to error probability
for coding in discrete memoryless channels,” Inform. Contr., 10: 65-103 (Part I), 522552 (Part II), 1967.
[178] C. E. Shannon and W. W. Weaver, The Mathematical Theory of Communication, Univ.
of Illinois Press, Urbana, Illinois, 1949.
[179] P. C. Shields, The Ergodic Theory of Discrete Sample Paths, American Math. Soc.,
Providence, Rhode Island, 1996.
[180] J. E. Shore and R. W. Johnson, “Axiomatic derivation of the principle of maximum
entropy and the principle of minimum cross-entropy,” IEEE Trans. Inform. Theory, IT26: 26-37, 1980.
[181] I. Shunsuke, Information theory for continuous systems, World Scientific, Singapore,
1993.
D R A F T
September 13, 2001, 6:27pm
D R A F T
399
BIBLIOGRAPHY
[182] M. Simonnard, Linear Programming, translated by William S. Jewell, Prentice-Hall,
Englewood Cliffs, New Jersey, 1966.
[183] D. S. Slepian, Ed., Key Papers in the Development of Information Theory, IEEE Press,
New York, 1974.
[184] D. S. Slepian and J. K. Wolf, “Noiseless coding of correlated information sources,”
IEEE Trans. Inform. Theory, IT-19: 471-480, 1973.
[185] N. J. A. Sloane and A. D. Wyner, Ed., Claude Elwood Shannon Collected Papers, IEEE
Press, New York, 1993.
[186] L. Song, R. W. Yeung, and N. Cai, “A separation theorem for point-to-point communication networks,” submitted to IEEE Trans. Inform. Theory.
[187] L. Song, R. W. Yeung and N. Cai, “Zero-error network coding for acyclic networks,”
submitted to IEEE Trans. Inform. Theory.
[188] F. Spitzer, “Random fields and interacting particle systems,” M. A. A. Summer Seminar
Notes, 1971.
[189] M. StudenÐ Ç , “Multiinformation and the problem of characterization of conditionalindependence relations,” Problems Control Inform. Theory, 18: 1, 3-16, 1989.
[190] J. C. A. van der Lubbe, Information Theory, Cambridge Univ. Press, Cambridge, 1997
(English translation).
[191] E. C. van der Meulen, “A survey of multi-way channels in information theory: 19611976,” IEEE Trans. Inform. Theory, IT-23: 1-37, 1977.
[192] E. C. van der Meulen, “Some reflections on the interference channel,” in Communications and Cryptography: Two Side of One Tapestry, R. E. Blahut, D. J. Costello, Jr., U.
Maurer, and T. Mittelholzer, Ed., Kluwer Academic Publishers, Boston, 1994.
[193] M. van Dijk, “On the information rate of perfect secret sharing schemes,” Designs,
Codes and Cryptography, 6: 143-169, 1995.
[194] S. Vembu, S. VerdÑ Ç , and Y. Steinberg, “The source-channel separation theorem revisited,” IEEE Trans. Inform. Theory, IT-41: 44-54, 1995.
[195] S. VerdÑ Ç and T. S. Han, “A general formula for channel capacity,” IEEE Trans. Inform.
Theory, IT-40: 1147-1157, 1994.
[196] S. VerdÑ Ç and T. S. Han, “The role of the asymptotic equipartition property in noiseless
source coding,” IEEE Trans. Inform. Theory, IT-43: 847-857, 1997.
[197] S. VerdÑ Ç and S. W. McLaughlin, Ed., Information Theory : 50 years of Discovery, IEEE
Press, New York, 2000.
[198] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum
decoding algorithm,” IEEE Trans. Inform. Theory, IT-13: 260-269, 1967.
[199] A. J. Viterbi and J. K. Omura, Principles of Digital Communications and Coding,
McGraw-Hill, New York, 1979.
D R A F T
September 13, 2001, 6:27pm
D R A F T
400
A FIRST COURSE IN INFORMATION THEORY
[200] T. A. Welch, “A technique for high-performance data compression,” Computer, 17: 819, 1984.
[201] P. M. Woodard, Probability and Information Theory with Applications to Radar,
McGraw-Hill, New York, 1953.
[202] S. B. Wicker, Error Control Systems for Digital Communication and Storage, PrenticeHall, Englewood Cliffs, New Jersey, 1995.
[203] S. B. Wicker and V. K. Bhargava, Ed., Reed-Solomon Codes and Their Applications,
IEEE Press, Piscataway, New Jersey, 1994.
[204] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “The context-tree weighting
method: basic properties,” IEEE Trans. Inform. Theory, IT-41: 653-664, 1995.
[205] J. Wolfowitz, “The coding of messages subject to chance errors,” Illinois Journal of
Mathematics, 1: 591-606, 1957.
[206] J. Wolfowitz, Coding Theorems of Information Theory, Springer, Berlin-Heidelberg,
2nd ed., 1964, 3rd ed., 1978.
[207] A. D. Wyner, “On source coding with side information at the decoder,” IEEE Trans.
Inform. Theory, IT-21: 294-300, 1975.
[208] A. D. Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the decoder,” IEEE Trans. Inform. Theory, IT-22: 1-10, 1976.
[209] E.-h. Yang and J. C. Kieffer, “Efficient universal lossless data compression algorithms
based on a greedy sequential grammar transform – Part one: Without context models,”
IEEE Trans. Inform. Theory, IT-46: 755-777, 2000.
[210] C. Ye and R. W. Yeung, “Some basic properties of fix-free codes,” IEEE Trans. Inform.
Theory, IT-47: 72-87, 2001.
[211] C. Ye and R. W. Yeung, “A simple upper bound on the redundancy of Huffman codes,”
to appear in IEEE Trans. Inform. Theory.
[212] Z. Ye and T. Berger, Information Measures for Discrete Random Fields, Science Press,
Beijing/New York, 1998.
[213] R. W. Yeung, “A new outlook on Shannon’s information measures,” IEEE Trans. Inform. Theory, IT-37: 466-474, 1991.
[214] R. W. Yeung, “Local redundancy and progressive bounds on the redundancy of a Huffman code," IEEE Trans. Inform. Theory, IT-37: 687-691, 1991.
[215] R. W. Yeung, “Multilevel diversity coding with distortion,” IEEE Trans. Inform. Theory,
IT-41: 412-422, 1995.
[216] R. W. Yeung, “A framework for linear information inequalities,” IEEE Trans. Inform.
Theory, IT-43: 1924-1934, 1997.
[217] R. W. Yeung and T. Berger, “Multi-way alternating minimization,” 1995 IEEE Internation Symposium on Information Theory, Whistler, British Columbia, Canada, Sept.
1995.
D R A F T
September 13, 2001, 6:27pm
D R A F T
401
BIBLIOGRAPHY
[218] R. W. Yeung, T. T. Lee and Z. Ye, “Information-theoretic characterization of conditional
mutual independence and Markov random fields,” to appear in IEEE Trans. Inform.
Theory.
[219] R. W. Yeung and Z. Zhang, “On symmetrical multilevel diversity coding,” IEEE Trans.
Inform. Theory, IT-45: 609-621, 1999.
[220] R. W. Yeung and Z. Zhang, “Distributed source coding for satellite communications,”
IEEE Trans. Inform. Theory, IT-45: 1111-1120, 1999.
[221] R. W. Yeung and Z. Zhang, “A class of non-Shannon-type information inequalities and
their applications,” Comm. Inform. & Syst., 1: 87-100, 2001.
[222] Z. Zhang and R. W. Yeung, “A non-Shannon-type conditional inequality of information
quantities,” IEEE Trans. Inform. Theory, IT-43: 1982-1986, 1997.
[223] Z. Zhang and R. W. Yeung, “On characterization of entropy function via information
inequalities,” IEEE Trans. Inform. Theory, IT-44: 1440-1452, 1998.
[224] S. Zimmerman, “An optimal search procedure,” Am. Math. Monthly, 66: 8, 690-693,
1959.
[225] K. Sh. Zigangirov, “Number of correctable errors for transmission over a binary symmetrical channel with feedback,” Problems Inform. Transmission, 12: 85-97, 1976.
Translated from Problemi Peredachi Informatsii, 12: 3-19 (in Russian).
[226] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE
Trans. Inform. Theory, IT-23: 337-343, 1977.
[227] J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,”
IEEE Trans. Inform. Theory, IT-24: 530-536, 1978.
D R A F T
September 13, 2001, 6:27pm
D R A F T
Index
a posterior distribution, 218
Abel, N.H., 368
Abelian group, 368, 381, 384, 386
Abrahams, J., xvii, 389
Abramson, N., 122, 389
abstract algebra, 365
Abu-Mostafa, Y.S., 389
acyclic network, 245–250, 338, 364
AczÖ Õ l, J., 389
Ahlswede, R., 58, 186, 260, 262, 389
Algoet, P., 389
almost perfect reconstruction, 65, 66, 233
alternating optimization algorithm, 216–218,
221
convergence, 226
Amari, S., 39, 389
Anatharam, V., xvii
Anderson, J.B., 389
applied mathematics, 4
applied probability, 3
Arimoto, S., 174, 186, 231, 389, 390
See also Blahut-Arimoto algorithms
arithmetic mean, 159
ascendant, 55
Ash, R.B., 390
asymptotically reliable communication, 152
atom of a field, 96
weight of, 137
audio signal, 189
audio source, 323
auxiliary random variable, 310, 311, 324, 340–
361
average distortion, 188, 189, 191, 206, 210, 215
expected, 206
average probability of error, 159, 185
Ayanoglu, E., 262, 390
Balkenhol, B., 58, 389
Barron, A.R., 390
D R A F T
September
13, 2001, 6:27pm D R A
F T
basic inequalities, 23, 22–25, 92, 103, 139, 264,
279–301, 313, 322–325, 365, 381
Bassalygo, L.A., 390
Bayesian networks, 160, 291, 300
BCH (Bose-Chaudhuri-Hocquenghem) code,
174
Beethoven’s violin concerto, 1
Bell Telephone Laboratories, 2
Berger, T., xii, xvii, 93, 120, 213, 214, 231, 390,
393, 400
Berlekamp, E.R., 390, 398
Berrou, C., 174, 390
Bhargava, V.K., 400
biased coin, 41
binary arbitrarily varying channel, 185
binary covering radius, 212
binary entropy function, 11, 29, 30, 201
binary erasure channel, 156, 179
binary symmetric channel (BSC), 149, 155,
183–186
binomial formula, 132, 295, 297
bit, 3, 11, 236
Blackwell, D., 390
Blahut, R.E., xvii, 186, 214, 231, 390, 399
Blahut-Arimoto algorithms, xii, 157, 204, 215–
231
channel capacity, 218–223, 230
convergence, 230
rate distortion function, 223–226, 231
block code, 64
block length, 64, 151, 160, 187, 215
Blundo, C., 121, 391
Bose, R.C., 391
See also BCH code
bottleneck, 235
brain, 323
branching probabilities, 54
Breiman, L., 71, 390, 391
See also Shannon-McMillan-Breiman theorem
404
A FIRST COURSE IN INFORMATION THEORY
Burrows, M., 391
Cai, N., xvii, 77, 185, 260, 262, 364, 389, 395,
399
Calderbank, R., 391
capacity of an edge, 234
Capocelli, R.M., 121, 391
Cartesian product, 216
cascade of channels, 184
causality, 184
Ces× Õ ro mean, 34
chain rule for
conditional entropy, 18
conditional mutual information, 19
entropy, 17, 26
mutual information, 18, 27
Chan, A.H., xvii
Chan, T.H., xvii, 10, 92, 386, 387, 391
Chan, V.W.S., xvii
channel capacity, 3, 154, 149–186, 215
computation of, xii, 157, 186, 218–223,
226, 230
feedback, 174–180
channel characteristics, 149
channel code, 150, 160, 183
probability of error, 150
rate, 152, 159
with feedback, 175
without feedback, 158
channel coding theorem, xii, 3, 39, 160, 158–
160, 186
achievability, 158, 166–171
converse, 158, 179
random code, 166
strong converse, 165, 186
channel with memory, 183
Chatin, G.J., 391
Chernoff bound, 77
Chernoff, H., 391
chronological order, 242
Chung, K.L., 71, 391
cipher text, 115, 120
classical information theory, 233
closure, in a group, 366
code alphabet, 42
code tree, 46, 46–57
pruning of, 55
codebook, 158, 192, 207
codeword, 64, 158, 192, 207
coding session, 242, 247, 248, 339
transaction, 242, 261
coding theory, 173, 174
column space, 275
combinatorics, 86
communication engineer, 3
communication engineering, 149
communication system, 1, 3
D R A F T
September
13, 2001, 6:27pm D R A
F T
Shannon’s model, 2
communication theory, 3
commutative group, see Abelian group
commutativity, 368, 369
compact disc, 1
compact set, 154, 229
composite function, 369
compound source, 213
computational procedure, 283, 287
computer communication, 174
computer network, 233, 240
computer science, 11
computer storage systems, 174
concavity, 36, 37, 112, 115, 226, 227, 229, 230
conditional branching distribution, 54
conditional entropy, 5, 12
conditional independence, xiii, 6, 276–277, 291,
300
elemental forms, 298
structure of, 10, 325
conditional mutual independence, 126–135
conditional mutual information, 5, 15
constant sequence, 191
continuous partial derivatives, 216, 221, 229
convex closure, 341, 343, 377, 380
convex cone, 306, 307, 346
convexity, 20, 37, 113, 192, 195, 198, 206, 216,
221, 225, 266, 308
convolutional code, 174
convolutional network code, 260
Cornell, xvii
coset
left, 370, 371, 373
right, 370
Costello, Jr., D.J., 390, 395, 399
countable alphabet, 71
Cover, T.M., xi, xvii, 231, 389, 391
crossover probability, 149, 184–186, 202
Croucher Foundation, xvii
Csisz× Õ r, I., xii, xvii, 39, 93, 122, 231, 278, 392
cyclic network, 251–259
D-adic distribution, 48, 53
D-ary source code, 42
D-it, 11, 44, 57
ØÙ/Ú'Û 191, 195, 198, 201
Ü Ù/ÚÝÛ , ,205
Dantzig, G.B., 392
DarÞ Õ czy, Z., 389
data communication, 174
data packet, 240
data processing theorem, 28, 117, 164, 244, 289,
323
Davisson, L.D., 392
Dawid, A.P., 292, 392
De Santis, A., 121, 391
De Simone, R., 121, 391
decoder, 64, 166, 207, 240
405
INDEX
decoding function, 70, 158, 175, 192, 242, 247,
258, 339, 354
deep space communication, 174
Delchamps, D.F., xvii
Dembo, A., 262, 398
Dempster, A.P., 392
dense, 347
dependency graph, 160, 176, 185
directed edge, 160
dotted edge, 160
parent node, 160
solid edge, 160
descendant, 47, 55, 56
destination, 2
destination node, 233
digital, 3
digital communication system, 3
directed graph, 234
acyclic, 245, 337
cut, 235, 350
capacity of, 235
cycle, 245, 252
cyclic, 245
edges, 234
max-flow, 235
min-cut, 235
nodes, 234
path, 245, 252
rate constraints, 234, 235, 337
sink node, 234
source node, 234
directional derivative, 229
discrete channel, 153
discrete memoryless channel (DMC), 152–157,
175, 183, 184, 215
achievable rate, 160
symmetric, 184
disk array, 239, 335, 362
distinguishable messages, 248
distortion measure, 187–214
average, 188
context dependent, 189
Hamming, 189, 201, 203
normalization, 189, 199
single-letter, 188
square-error, 189
distortion rate function, 195
distributed source coding, 364
divergence, 5, 20, 19–22, 39, 318
convexity of, 37
divergence inequality, 20, 22, 37, 219, 318
diversity coding, 239, 336
Dobrushin, R.L., 390
double infimum, 218, 225
double supremum, 216, 218, 220, 221
duality theorem, 286
dual, 286
D R A F T
September
13, 2001, 6:27pm D R A
F T
primal, 286
Dueck, G., 392
dyadic distribution, 48
ear drum, 323
east-west direction, 217
efficient source coding, 66–67
elemental inequalities, 281, 279–281, 285, 301,
302
ß -inequalities,
294–298
à -inequalities, 294–298
minimality of, 293–298
Elias, P., 392
EM algorithm, 231
emotion, 1
empirical distribution, 93
joint, 210
empirical entropy, 62, 64, 73
encoder, 64, 166, 207, 240
encoding function, 70, 158, 175, 192, 242, 246,
247, 257, 339, 353
Encyclopedia Britanica, 4, 392
engineering, 3, 183
ensemble average, 68
entropic, 266, 305, 342, 374
entropies, linear combination of, 10, 24, 123,
267, 281
entropy, 3, 5, 10, 39, 372
concavity of, 112
relation with groups, 365–387
entropy bound, xii, 44, 42–45, 48, 50, 53, 57
for prefix code, 54
entropy function, xi, 266, 301, 306, 308, 313,
322, 372
continuity of, 380
group characterization, 375, 372–376
entropy rate, xii, 5, 33, 32–35, 187
entropy space, 266, 269, 281, 301, 341, 372
equivalence relation, 138
erasure probability, 156
ergodic, 68
Euclidean distance, 220, 309
Euclidean space, 266, 302, 341
evesdroper, 117
expected distortion, 191, 223
minimum, 191, 201
extreme direction, 305, 314, 321
facsimile, 1
fair coin, 41
Fano’s inequality, 30, 28–32, 39, 67, 164, 179,
182, 186, 245, 344
simplified version, 31
tightness of, 38
Fano, R.M., 39, 186, 392
fault-tolerant data storage system, 239, 322
fault-tolerant network communication, 335
406
A FIRST COURSE IN INFORMATION THEORY
FCMI, see full conditional mutual independencies
feedback, xii, 153, 154, 183–185
Feinstein, A., 186, 392
Feller, W., 392
ferromagnetic material, 140, 148
field, in measure theory, 96
Fine, T.L., xvii
finite alphabet, 29, 32, 67, 70, 71, 93, 154, 167,
180, 188, 207, 215, 352, 358, 380
finite group, 367, 365–387
finite resolution, 1
finite-dimensional maximization, 215
Fitingof, B.M., 392
fix-free code, 58
flow, 234, 251
conservation conditions, 235
value of, 235
zero flow, 252
Ford, Jr., L.K., 393
Forney, Jr., G.D., 393
frequency of error, 189
Fu, F., xvii, 146, 393
Fujishige, S., 300, 393
Fulkerson, D.K., 393
full conditional independence, xii
full conditional mutual independencies, 126,
136–140, 291
axiomatization, 148
image of, 136, 138
set-theoretic characterization, 148
functional dependence, 276, 358
fundamental inequality, xii, 20, 20, 45
fundamental limits, 3
G × Õ cs, P., 391
Gallager, R.G., xvii, 174, 186, 393, 398
áYâ , 281–301
áTâã , 266, 282, 301, 364, 366
á âã , 305
group characterization of, 377–380
Gargano, L., 121, 391
Ge, Y., 148, 393
generic discrete channel, 153, 166, 215, 218
Gersho, A., 393
Gitlin, R.D., 262, 390
Glavieux, A., 390
global Markov property, 140, 147
Goldman, S., 393
gradient, 228
graph theory, 234
graphical models, xii, 148
Gray, R.M., 391, 393
group, 366, 365–387
associativity, 366, 367–369, 372
axioms of, 366
closure, 368, 369, 372
D R A F T
September
13, 2001, 6:27pm D R A
F T
identity, 366, 367–370, 372, 373
inverse, 366–368, 370
order of, 365, 367, 370
group inequalities, 365–380, 384, 387
group theory, xi, xiii, 93, 306
relation with information theory, 365–387
group-characterizable entropy function, 375,
372–376
Guiasu, S., 393
Hajek, B.E., xvii, 393
half-space, 269, 347
Hammer, D., 325, 393
Hamming ball, 212
Hamming code, 174
Hamming distance, 185
Hamming distortion measure, 189
Hamming, R.V., 393
Han, T.S., xvii, 37, 123, 278, 309, 393, 394, 399
Hardy, G.H., 38, 394
Hau, K.P., 363, 364, 394, 398
Heegard, C., xvii, 394
Hekstra, A.P., xvii
hiker, 217
Ho, S.-W., xvii
Hocquenghem, A., 394
See also BCH code
home entertainment systems, 174
Horibe, Y., 394
Hu, G., 122, 394
Huffman code, 48, 48–53
expected length, 50, 52
optimality of, 50
Huffman procedure, 48, 48–53
dummy symbols, 49
Huffman, D.A., 59, 394
human factor, 1
hyperplane, 268, 273, 277, 305, 321, 347
Hyvarinen, L.P., 394
I, C.-L., 262, 390
ä -Measure, xii, 102, 95–123, 125–148, 162,
302, 308
empty atom, 100
Markov chain, 105–111, 143–146
Markov structures, 125–148
negativity of, 103–105
nonempty atom, 100
universal set, 97, 100
i.i.d. source, 64, 68, 70, 73, 188, 213, 215
bivariate, 83, 84
image, 189
imperfect secrecy theorem, 115
implication problem, xiii, 276–277, 291–293,
300
involves only FCMI’s, 138, 276
inclusion-exclusion formula, 99
a variation of, 118
407
INDEX
incomplete data, 231
incompressible, 66, 239
independence bound for entropy, 25, 66
independence of random variables, 5–10
mutual, 6, 25, 26, 33, 36, 106, 120, 208,
270, 290, 359
pairwise, 6, 36, 104, 303
independent parallel channels, 184, 299
inferior, 329, 347
strictly, 347
infinite alphabet, 29, 32, 70, 93
infinite group, 367
Information Age, 4
information diagram, xii, 105, 95–123, 287,
293, 312
Markov chain, 108–111, 143–146, 162
information expressions, 263
canonical form, 267–269
alternative, 277
uniqueness, 268, 277, 278
nonlinear, 278
symmetrical, 277
information identities, xii, 24, 122, 263
constrained, 272, 284–285
unconstrained, 269
information inequalities, xi, xii, 24, 112, 263,
364, 366, 380–384
constrained, 270–272, 284–285
equivalence of, 273–276, 278
framework for, xiii, 263–278, 340, 341
machine-proving, ITIP, 265, 287–291
non-Shannon-type, xiii, 25, 301–325
Shannon-type, xiii, 279–300
symmetrical, 298
unconstrained, 269, 283–284, 305, 366,
380
information rate distortion function, 196, 206
continuity of, 206
properties of, 198
information source, 2, 32, 42, 187, 233, 242
informational divergence, see divergence
Ingleton inequality, 325, 386
Ingleton, A.W., 394
input channel, 240, 247, 339
input distribution, 154, 166, 215, 218, 224
strictly positive, 221, 231
interleave, 254
internal node, 46, 46–57
conditional entropy of, 54
Internet, 233
invertible transformation, 268
Ising model, 140, 148
iterative algorithm, 186, 214, 216, 224, 231
ITIP, xiii, xvii, 287–291, 310, 324
efficient implementation, 294
Jaynes, E.T., 394
D R A F T
September
13, 2001, 6:27pm D R A
F T
Jelinek, F., 394
Jensen’s inequality, 205
Jerohin, V.D., 212, 394
Jewell, W.S., 399
Johnsen, O., 58, 394
Johnson, R.W., 398
joint entropy, 12, 265
joint source-channel coding, 181, 183
Jones, G.A., 394
Jones, J.M., 394
Kakihara, Y., 394
Karush, J., 59, 394
Kawabata, T., 108, 123, 148, 394, 395
Keung-Tsang, F.-O., xviii
key, of a cryptosystem, 115, 120
Khachatrian, L., 58, 389
Khinchin, A.I., 395
Kieffer, J.C., 395, 400
Kindermann, R., 395
Kobayashi, K., xvii, 394
Koetter, R., 364, 395
Kolmogorov complexity, 325, 387
Kolmogorov, A.N., 395
KÞ å rner, J., xii, 93, 122, 278, 389, 392
Kraft inequality, xii, 42, 44, 45, 47, 48, 52, 58,
59
Kraft, L.G., 395
Kschischang, F.R., xvii
Kullback, S., 39, 395
Kullback-Leibler distance, see divergence
L’Hospital’s rule, 17
Lagrange multipliers, 222
Lagrange’s Theorem, 371
Laird, N.M., 392
Langdon, G.G., 395
Lapidoth, A., xvii
LATEX, xviii
lattice theory, 123
Lauritzen, S.L., 395
laws of information theory, xiii, 264, 321, 325
leaf, 46, 46–57
Lebesgue measure, 278, 305
Lee, J.Y.-B., xvii
Lee, T.T., 148, 400
Leibler, R.A., 39, 395
Lempel, A., 401
letter, 32
Leung, S.K., 391
Li, M., 395
Li, P., xvii
Li, S.-Y.R., 260, 262, 389, 395
Lin, S., 395
Linder, T., 59, 395
line of sight, 336
linear code, 174
linear constraints, 270, 273
408
A FIRST COURSE IN INFORMATION THEORY
linear network code, 262
linear programming, xiii, 279, 281–287, 300,
346
linear subspace, 271, 314, 321
Littlewood, J.E., 38, 394
local Markov property, 147
local redundancy, 56
local redundancy theorem, 56–57
log-optimal portfolio, 231
log-sum inequality, 21, 37, 230
longest directed path, 245
Longo, G., 390
lossless data compression, 3
Lovasz, L., 395
low-density parity-check (LDPC) code, 174
MacKay, D.J.C., 174, 395
majority vote, 151
Malvestuto, F.M., 148, 396
Mann, H.B., 390
Mansuripur, M., 396
mapping approach, 214
marginal distribution, 310–312
Markov chain, xii, 5, 7, 27, 107, 108, 111, 112,
115, 117, 125, 126, 140, 143–146,
161, 179, 184, 244, 275, 287, 289,
310, 312, 323
information diagram, 107–111, 143–146,
162
Markov graph, 140
Markov random field, xii, 111, 126, 140–143,
148
Markov star, 147
Markov structures, xii, 125–148
Markov subchain, 8
Marton, K., 396
Massey, J.L., xvii, 117, 396
Mathai, A.M., 396
MATLAB, 287
MatæÕ èç , F., 276, 309, 325, 396
Maurer, U., 390, 399
max-flow, 235, 244, 252
max-flow bound, xiii, 236, 242, 242–259, 262,
327, 350
achievability, 245–259
max-flow bounds, 328–330, 361
max-flow min-cut theorem, 235, 244, 249
maximal probability of error, 159, 166, 181, 185
maximum likelihood decoding, 185
Mazo, J., 262, 390
McEliece, R.J., 396
McGill, W.J., 122, 396
McLaughlin, S.W., 399
McMillan, B., 59, 71, 396
See also Shannon-McMillan-Breiman theorem
mean ergodic, 69
D R A F T
September
13, 2001, 6:27pm D R A
F T
mean-square error, 189
meaningful information, 1
measure theory, 68, 96
MÖ Õ dard, M., 364, 395
membership table, 373
message, 236
message set, 149, 158
method of types, 77, 93
microelectronics, 173
min-cut, 235, 244
minimum distance decoding, 185
Mittelholzer, T., 390, 399
modulo 2 addition, 238, 261, 367–368, 375, 376
modulo 3 addition, 261
Mohan, S., 389
most likely sequence, 64
Moy, S.C., 396
é ã , see ä -Measure
multi-dimensional direction, 217
multi-source network coding, xiii, 327–364
áIêã , 342
á êã , 342
achievable information rate region, 327,
340
inner bound, 340–342
LP bound, 346–350
outer bound, 342–346
achievable information rate tuple, 339
algebraic approach, 364
network code for acyclic network, 337–
340
random code, 352
superposition, 330–334, 362, 364
variable length zero-error, 341
multicast, xiii, 233, 233–262, 327–364
multilevel diversity coding, 335–336, 364
symmetrical, 336, 364
multiple descriptions, 146
multiple information sources, 234
multiterminal source coding, 213
mutual information, 5, 13, 223
between more than two random variables,
105
concavity of, 115
convexity of, 113, 199
mutually independent information sources, 327–
361
Narayan, P., xvii, 77, 392
nerve impulse, 323
network coding, xi, xiii, 240, 233–262, 327–364
Nobel, A., 262, 398
node of a network, 233, 240
noise source, 2
noisy channel, 3, 149, 171
noisy environment, 1
409
INDEX
non-Shannon-type inequalities, xiii, 25, 264,
265, 287, 301–325
constrained, 315–321
unconstrained, 310–315, 366, 384
nonlinear optimization, 216
nonnegative linear combination, 285, 286
nonnegative orthant, 266, 268, 282, 302, 309
normalization constant, 161
north-south direction, 217
null space, 273
numerical computation, 157, 204, 215–231
Omura, J.K., 396, 399
Ooi, J.M., 396
optimal coding scheme, 3
order of a node, 47
ordinate, 223
Orlitsky, A., xvii, 397
Ornstein, D.S., 397
orthogonal complement, 274
output channel, 237, 240
Oxley, J.G., 397
Papoulis, A., 122, 397
parity check, 174
partition, 138, 353
PC, xviii, 287
Pearl, J., 397
perceived distortion, 189
Perez, A., 397
perfect secrecy theorem, Shannon’s, xii, 117
permutation, 368
permutation group, 368–370
physical entity, 236
physical system, 4
physics, xi
Pierce, J.R., 397
Pinkston, J.T., 214, 397
Pinsker’s inequality, 22, 37, 39, 77
Pinsker, M.S., 39, 390, 397
Pippenger, N., 325, 397
plain text, 115, 120
point-to-point channel, 149, 233, 234
error-free, 233
point-to-point communication network, xiii,
233–235, 327
point-to-point communication system, 2, 233
Polya, G., 38, 394
polymatroid, 297, 300, 325
polynomial, 278
practical communication system, 149, 174
prefix code, xii, 42, 45, 45–57
entropy bound, xii
existence of, 47
expected length, 55
random coding, 58
redundancy, xii, 54–57
D R A F T
September
13, 2001, 6:27pm D R A
F T
prefix-free code, see prefix code
Preston, C., 148, 397
probabilistic coding, 70, 183
probability distribution
rational, 377
strictly positive, 5, 9, 218, 299
with zero masses, 5, 147, 218
probability of error, 30, 64, 150, 165, 189, 196,
339
probability theory, 276, 321
product source, 212, 299
projection, 346
pyramid, 282, 285
quantized samples, 188
quasi-uniform structure, 374
asymptotic, 91, 374
Rabin, M.O., 262, 397
random code, 166, 207, 247, 262, 352, 363
random coding error exponent, 186
random noise, 149
rank function, 386
rank of a matrix, 273
full, 274–276
rate constraints, 234, 236, 242, 243, 245, 247,
251, 259, 332, 337, 339
rate distortion code, 191, 187–215
rate distortion function, 195, 191–196, 206, 213,
215
binary source, 201
forward channel description, 212
reverse channel description, 202
computation of, xii, 204, 214, 223–226,
231
normalization, 214
product source, 212, 299
properties of, 195
Shannon lower bound, 212
rate distortion pair, 192, 204
rate distortion region, 192, 196, 223
rate distortion theorem, 197, 196–204, 214
achievability, 206–211
converse, 204–206
random code, 207
relation with source coding theorem, 203
rate distortion theory, xii, 187–214
Rathie, P.N., 396
rational number, 193, 194, 377
raw bits, 66, 239
“approximately raw” bits, 67
Ray-Chaudhuri, D.K., 391
See also BCH code
reaching probability, 54, 55
receiver, 2
receiving point, 233, 330
rectangular lattice, 140, 148
reduced code tree, 51
410
A FIRST COURSE IN INFORMATION THEORY
reduced probability set, 51
redundancy, xii
of prefix code, 54–57
of uniquely decodable code, 45
Reed, I.S., 397
Reed-Solomon code, 174
relative entropy, see divergence
relative frequency, 73
RÖ Õ nyi, A., 397
repetition code, 151
replication of information, 237, 238
reproduction alphabet, 188, 201, 207
reproduction sequence, 187–189, 194, 207
resultant flow, 235, 252
Reza, F.M., 122, 397
Rissanen, J., 397
Roche, J.R., 262, 363, 364, 397, 398
Rockafellar, R.T., 398
Romashchenko, A., 325, 387, 393, 398
Rose, K., 214, 398
routing, 236, 238
row space, 274
Rubin, D.B., 392
Russian, 122
Sason, I., xvii
satellite communication network, 335–336
science, 3
science of information, the, 3
secret key cryptosystem, 115, 120
secret sharing, 120, 121, 290
access structure, 121
information-theoretic bounds, 120–121,
290, 291
participants, 121
security level of cryptosystem, 117
self-information, 14
semi-graphoid, 299, 300
axioms of, 292
separating rate distortion coding and channel
coding, 213
separating source and channel coding, 153, 180–
183
separation theorem for source and channel coding, 183
set function, 266
additive, 96, 118, 135
set identity, 99, 122, 132
set operations, 95, 96
set theory, xii, 95
Shamai, S., xvii, 398
Shamir, A., 398
Shannon code, 53
Shannon’s information measures, xii, 5, 10–16,
24, 95
continuity of, 16
elemental forms, 280, 298
D R A F T
September
13, 2001, 6:27pm D R A
F T
irreducible, 279, 298
linear combination of, 263
reducible, 279, 298
ä
set-theoretic structure of, see -Measure
Shannon’s papers, collection of, 4
Shannon, C.E., xi, 2, 39, 59, 64, 71, 186, 213,
214, 300, 325, 398
Shannon-McMillan-Breiman theorem, xii, 35,
68, 68–69, 181
Shannon-type identities
constrained, 284–285
Shannon-type inequalities, xiii, 264, 265, 279–
300, 309
constrained, 284–285
machine-proving, ITIP, 279, 287–291,
301
unconstrained, 283–284
Shen, A., 325, 387, 393, 398
Shields, P.C., 398
Shore, J.E., 398
Shtarkov, Y.M., 400
Shunsuke, I., 398
siblings, 50
side-information, 213
signal, 149
signed measure, 96, 102, 103
Simmons, G.J., 396
Simonnard, M., 398
simple path, 252
maximum length, 253
simplex method, 284, 285
optimality test, 284, 285
single-input single-output system, 149, 153
single-letter characterization, 215
single-letter distortion measure, 215
single-source network code
ß -code, 241, 242, 247
causality of, 244
à -code,
246, 247, 257, 338
ë -code, 257
phase, 258
single-source network coding, 233–262, 327,
328, 348, 350
achievable information rate, 242, 259
one sink node, 236
random code, 247, 262
three sink nodes, 239
two sink nodes, 237
sink node, 234, 236
Slepian, D.S., 213, 399
Slepian-Wolf coding, 213
Sloane, N.J.A., 391, 399
Snell. J., 395
Solomon, G., 397
See also Reed-Solomon code
Song, L., xvii, 185, 364, 399
sound wave, 323
411
INDEX
source code, 42, 183, 187
source coding theorem, xii, 3, 64, 64–66, 71,
187, 196, 204
coding rate, 64
converse, 66
direct part, 65
general block code, 70
source node, 234, 236
source random variable, 212
source sequence, 187–189, 194, 207
Spitzer, F., 148, 399
stationary ergodic source, 68, 68, 71, 180
entropy rate, 69
stationary source, xii, 34, 68
entropy rate, 5, 32–35
Steinberg, Y., 399
still picture, 2
Stirling’s approximation, 85
stock market, 231
strong asymptotic equipartition property (AEP),
61, 74, 73–82, 93, 209, 352, 353,
357
strong law of large numbers, 69
strong typicality, xii, 73–93, 207, 351
consistency, 83, 168, 208
joint, 83–91
joint AEP, 84
joint typicality array, 90, 374
jointly typical sequence, 83
jointly typical set, 83
typical sequence, 73
typical set, 73
vs weak typicality, 82
Studenì Õ , M., 122, 299, 300, 325, 396, 399
subcode, 194
subgroups, xiii, 365–387
intersection of, 365, 372
membership table, 373
substitution of symbols, 99
suffix code, 58
summit of a mountain, 217
support, 5, 11, 16, 17, 19, 36, 70, 73, 188, 304,
375
finite, 358
switch, 240
tangent, 223
Tarokh, V., 59, 395
telephone conversation, 174
telephone line, 1, 174
telephone network, 233
Teletar, I.E., xvii
television broadcast channel, 2
thermodynamics, 39
Thitimajshima, P., 390
Thomas, J.A., xi, 391
Thomasian, A.J., 390
D R A F T
September
13, 2001, 6:27pm D R A
F T
time average, 68
time-parametrized acyclic graph, 251
layers of nodes, 251
time-sharing, 193, 359
Tjalkens, T.J., 400
transition matrix, 153, 184, 198, 215, 218, 223,
224
strictly positive, 226
transmitter, 2
transmitting point, 233, 330
Tsang, M.-W., xviii
Tsang, P.-W.R., xviii
turbo code, 174
Tusn × Õ dy, G., 231, 392
Type I atom, 141
Type II atom, 141
type of an empirical distribution, 93
uncertainty, 2, 11
undirected graph, 140
components, 140
cutset, 140
edges, 140
loop, 140
vertices, 140
uniform distribution, 155, 157, 158, 167, 212,
247, 304, 308, 374, 375
union bound, 76, 168, 181, 195, 249, 250
uniquely decodable code, xii, 42, 45, 47, 48, 50,
54, 58, 59
expected length, 43
redundancy, 45
universal source coding, 70
Unix, xviii, 287
Vaccaro, U., 121, 391
van der Lubbe, J.C.A., 399
van der Meulen, E.C., 399
van Dijk, M., 121, 399
variable length channel code, 179
variational distance, 22, 38
vector space, 386
Vembu, S., 399
Venn diagram, xviii, 14, 96, 105, 122
Verdæ Õ , S., xvii, 398, 399
Vereshchagin, N.K., 325, 387, 393, 398
video signal, 189
Vit× Õ nyi, P., 395
Viterbi, A.J., 399
water flow, 234
water leakage, 234
water pipe, 171, 234
weak asymptotic equipartition property (AEP),
xii, 61, 61–71
See also Shannon-McMillan-Breiman theorem
weak independence, 120
412
A FIRST COURSE IN INFORMATION THEORY
weak law of large numbers, 61, 62, 151
weak typicality, xii, 61–71, 73
typical sequence, 62, 62–71
typical set, 62, 62–71
Weaver, W.W., 398
Wegener, I., 389
Wei, V.K., xvii
Welch, T.A., 399
Welsh, D.J.A., 394
Wheeler, D.J., 391
Wicker, S.B., 394, 400
Willems, F.M.J., xvii, 400
wireless communication, 174
Wolf, J.K., xvii, 213, 399
See also Slepian-Wolf coding
Wolfowitz, J., 186, 389, 400
Woodard, P.M., 400
Wyner, A.D., 399, 400
WYSIWYG, 112
D R A F T
Yan, Y.-O., xvii, 320
Yang, E.-h., 395, 400
Ye, C., xvii, 59, 400
Ye, Z., 148, 393, 400
Yeh, Y.N., xviii
Yeung, G., xviii
Yeung, R.W., 10, 57, 59, 77, 120, 122, 123, 146,
148, 185, 231, 260, 262, 278, 300,
324, 325, 363, 364, 389–391, 393,
395, 398–401
Yeung, S.-W.S., xviii
Zamir, R., 146, 393, 398
Zeger, K., xvii, 59, 395
zero-error data compression, xii, 41–59, 61
zero-error reconstruction, 233, 242
Zhang, Z., xvii, 292, 324, 325, 364, 401
Zigangirov, K.Sh., 401
Zimmerman, S., 59, 401
Ziv, J., 400, 401
September 13, 2001, 6:27pm
D R A F T

Download Report

A First Course in Information Theory

Paperzz.com

Your Paperzz