A First Course in Information Theory A FIRST COURSE IN INFORMATION THEORY RAYMOND W. YEUNG The Chinese University of Hong Kong Kluwer Academic Publishers Boston/Dordrecht/London Contents Preface Acknowledgments Foreword xi xvii xix 1. THE SCIENCE OF INFORMATION 1 2. INFORMATION MEASURES 5 2.1 Independence and Markov Chains 5 2.2 Shannon’s Information Measures 10 2.3 Continuity of Shannon’s Information Measures 16 2.4 Chain Rules 17 2.5 Informational Divergence 19 2.6 The Basic Inequalities 23 2.7 Some Useful Information Inequalities 25 2.8 Fano’s Inequality 28 2.9 Entropy Rate of Stationary Source 32 Problems 36 Historical Notes 39 3. ZERO-ERROR DATA COMPRESSION 41 3.1 The Entropy Bound 42 3.2 Prefix Codes 45 3.3 Redundancy of Prefix Codes 54 Problems 58 Historical Notes 59 4. WEAK TYPICALITY D R A F T September 13, 2001, 6:27pm 61 D R A F T vi A FIRST COURSE IN INFORMATION THEORY 4.1 The Weak AEP 4.2 The Source Coding Theorem 4.3 Efficient Source Coding 4.4 The Shannon-McMillan-Breiman Theorem Problems Historical Notes 61 64 66 68 70 71 5. STRONG TYPICALITY 5.1 Strong AEP 5.2 Strong Typicality Versus Weak Typicality 5.3 Joint Typicality 5.4 An Interpretation of the Basic Inequalities Problems Historical Notes 73 73 82 83 92 92 93 6. THE -MEASURE 6.1 Preliminaries 6.2 The -Measure for Two Random Variables 6.3 Construction of the -Measure * * Can be Negative 6.4 6.5 Information Diagrams 6.6 Examples of Applications Problems Historical Notes 95 96 97 100 103 105 112 119 122 7. MARKOV STRUCTURES 7.1 Conditional Mutual Independence 7.2 Full Conditional Mutual Independence 7.3 Markov Random Field 7.4 Markov Chain Problems Historical Notes 125 126 136 140 143 147 148 8. CHANNEL CAPACITY 8.1 Discrete Memoryless Channels 8.2 The Channel Coding Theorem 8.3 The Converse 149 153 158 160 D R A F T September 13, 2001, 6:27pm D R A F T vii Contents 8.4 Achievability of the Channel Capacity 8.5 A Discussion 8.6 Feedback Capacity 8.7 Separation of Source and Channel Coding Problems Historical Notes 166 171 174 180 183 186 9. RATE DISTORTION THEORY 9.1 Single-Letter Distortion Measures 9.2 The Rate Distorion Function 9.3 Rate Distortion Theorem 9.4 The Converse 9.5 Achievability of Problems Historical Notes 187 188 191 196 204 206 212 214 10. THE BLAHUT-ARIMOTO ALGORITHMS 10.1 Alternating Optimization 10.2 The Algorithms 10.3 Convergence Problems Historical Notes 215 216 218 226 230 231 11. SINGLE-SOURCE NETWORK CODING 11.1 A Point-to-Point Network 11.2 What is Network Coding? 11.3 A Network Code 11.4 The Max-Flow Bound 11.5 Achievability of the Max-Flow Bound Problems Historical Notes 233 234 236 240 242 245 259 262 12. INFORMATION INEQUALITIES 12.1 The Region 12.2 Information Expressions in Canonical Form 12.3 A Geometrical Framework 12.4 Equivalence of Constrained Inequalities 263 265 267 269 273 D R A F T September 13, 2001, 6:27pm D R A F T viii A FIRST COURSE IN INFORMATION THEORY 12.5 The Implication Problem of Conditional Independence Problems Historical Notes 276 277 278 13. SHANNON-TYPE INEQUALITIES 13.1 The Elemental Inequalities 13.2 A Linear Programming Approach 13.3 A Duality 13.4 Machine Proving – ITIP 13.5 Tackling the Implication Problem 13.6 Minimality of the Elemental Inequalities Problems Historical Notes 279 279 281 285 287 291 293 298 300 14. BEYOND SHANNON-TYPE INEQUALITIES 14.1 Characterizations of , , and 14.2 A Non-Shannon-Type Unconstrained Inequality 14.3 A Non-Shannon-Type Constrained Inequality 14.4 Applications Problems Historical Notes 301 302 310 315 321 324 325 15. MULTI-SOURCE NETWORK CODING 15.1 Two Characteristics 15.2 Examples of Application 15.3 A Network Code for Acyclic Networks 15.4 An Inner Bound 15.5 An Outer Bound 15.6 The LP Bound and Its Tightness 15.7 Achievability of Problems Historical Notes 327 328 335 337 340 342 346 350 361 364 16. ENTROPY AND GROUPS 16.1 Group Preliminaries 16.2 Group-Characterizable Entropy Functions 16.3 A Group Characterization of 365 366 372 377 D R A F T September 13, 2001, 6:27pm D R A F T ix Contents 16.4 Information Inequalities and Group Inequalities Problems Historical Notes Bibliography 380 384 387 389 Index 403 D R A F T September 13, 2001, 6:27pm D R A F T Preface Cover and Thomas wrote a book on information theory [49] ten years ago which covers most of the major topics with considerable depth. Their book has since become the standard textbook in the field, and it was no doubt a remarkable success. Instead of writing another comprehensive textbook on the subject, which has become more difficult as new results keep emerging, my goal is to write a book on the fundamentals of the subject in a unified and coherent manner. During the last ten years, significant progress has been made in understanding the entropy function and information inequalities of discrete random variables. The results along this direction are not only of core interest in information theory, but they also have applications in network coding theory and group theory, and possibly in physics. This book is an up-to-date treatment of information theory for discrete random variables, which forms the foundation of the theory at large. There are eight chapters on classical topics (Chapters 1, 2, 3, 4, 5, 8, 9, and 10), five chapters on fundamental tools (Chapters 6, 7, 12, 13, and 14), and three chapters on selected topics (Chapters 11, 15, and 16). The chapters are arranged according to the logical order instead of the chronological order of the results in the literature. What is in this book Out of the sixteen chapters in this book, the first thirteen chapters are basic topics, while the last three chapters are advanced topics for the more enthusiastic reader. A brief rundown of the chapters will give a better idea of what is in this book. Chapter 1 is a very high level introduction to the nature of information theory and the main results in Shannon’s original paper in 1948 which founded the field. There are also pointers to Shannon’s biographies and his works. D R A F T September 13, 2001, 6:27pm D R A F T xii A FIRST COURSE IN INFORMATION THEORY Chapter 2 introduces Shannon’s information measures and their basic properties. Useful identities and inequalities in information theory are derived and explained. Extra care is taken in handling joint distributions with zero probability masses. The chapter ends with a section on the entropy rate of a stationary information source. Chapter 3 is a discussion of zero-error data compression by uniquely decodable codes, with prefix codes as a special case. A proof of the entropy bound for prefix codes which involves neither the Kraft inequality nor the fundamental inequality is given. This proof facilitates the discussion of the redundancy of prefix codes. Chapter 4 is a thorough treatment of weak typicality. The weak asymptotic equipartition property and the source coding theorem are discussed. An explanation of the fact that a good data compression scheme produces almost i.i.d. bits is given. There is also a brief discussion of the Shannon-McMillanBreiman theorem. Chapter 5 introduces a new definition of strong typicality which does not involve the cardinalities of the alphabet sets. The treatment of strong typicality here is more detailed than Berger [21] but less abstract than Csisz r and K rner [52]. A new exponential convergence result is proved in Theorem 5.3. Chapter 6 is an introduction to the theory of -Measure which establishes a one-to-one correspondence between Shannon’s information measures and set theory. A number of examples are given to show how the use of information diagrams can simplify the proofs of many results in information theory. Most of these examples are previously unpublished. In particular, Example 6.15 is a generalization of Shannon’s perfect secrecy theorem. Chapter 7 explores the structure of the -Measure for Markov structures. Set-theoretic characterizations of full conditional independence and Markov random field are discussed. The treatment of Markov random field here is perhaps too specialized for the average reader, but the structure of the -Measure and the simplicity of the information diagram for a Markov chain is best explained as a special case of a Markov random field. Chapter 8 consists of a new treatment of the channel coding theorem. Specifically, a graphical model approach is employed to explain the conditional independence of random variables. Great care is taken in discussing feedback. Chapter 9 is an introduction to rate distortion theory. The results in this chapter are stronger than those in a standard treatment of the subject although essentially the same techniques are used in the derivations. In Chapter 10, the Blahut-Arimoto algorithms for computing channel capacity and the rate distortion function are discussed, and a simplified proof for convergence is given. Great care is taken in handling distributions with zero probability masses. D R A F T September 13, 2001, 6:27pm D R A F T xiii PREFACE Chapter 11 is an introduction to network coding theory. The surprising fact that coding at the intermediate nodes can improve the throughput when an information source is multicast in a point-to-point network is explained. The max-flow bound for network coding with a single information source is explained in detail. Multi-source network coding will be discussed in Chapter 15 after the necessary tools are developed in the next three chapters. Information inequalities are sometimes called the laws of information theory because they govern the impossibilities in information theory. In Chapter 12, the geometrical meaning of information inequalities and the relation between information inequalities and conditional independence are explained in depth. The framework for information inequalities discussed here is the basis of the next two chapters. Chapter 13 explains how the problem of proving information inequalities can be formulated as a linear programming problem. This leads to a complete characterization of all information inequalities which can be proved by conventional techniques. These are called Shannon-type inequalities, which can now be proved by the software ITIP which comes with this book. It is also shown how Shannon-type inequalities can be used to tackle the implication problem of conditional independence in probability theory. All information inequalities we used to know were Shannon-type inequalities. Recently, a few non-Shannon-type inequalities have been discovered. This means that there exist laws in information theory beyond those laid down by Shannon. These inequalities and their applications are explained in depth in Chapter 14. Network coding theory is further developed in Chapter 15. The situation when more than one information source are multicast in a point-to-point network is discussed. The surprising fact that a multi-source problem is not equivalent to a few single-source problems even when the information sources are mutually independent is clearly explained. Implicit and explicit bounds on the achievable coding rate region are discussed. These characterizations on the achievable coding rate region involve almost all the tools that have been developed earlier in the book, in particular, the framework for information inequalities. Chapter 16 explains an intriguing relation between information theory and group theory. Specifically, for every information inequality satisfied by any joint distribution, there is a corresponding group inequality satisfied by any finite group and its subgroups, and vice versa. Inequalities of the latter type govern the orders of any finite group and their subgroups. Group-theoretic proofs of Shannon-type information inequalities are given. At the end of this chapter, a group inequality is obtained from a non-Shannon-type inequality discussed in Chapter 14. The meaning and the implication of this inequality are yet to be understood. D R A F T September 13, 2001, 6:27pm D R A F T xiv A FIRST COURSE IN INFORMATION THEORY 8 9 4 1 10 5 3 2 11 15 6 7 12 13 14 16 How to use this book You are recommended to read the chapters according to the above chart. However, you will not have too much difficulty jumping around in the book because there should be sufficient references to the previous relevant sections. As a relatively slow thinker, I feel uncomfortable whenever I do not reason in the most explicit way. This probably has helped in writing this book, in which all the derivations are from the first principle. In the book, I try to explain all the subtle mathematical details without sacrificing the big picture. Interpretations of the results are usually given before the proofs are presented. The book also contains a large number of examples. Unlike the examples in most books which are supplementary, the examples in this book are essential. This book can be used as a reference book or a textbook. For a two-semester course on information theory, this would be a suitable textbook for the first semester. This would also be a suitable textbook for a one-semester course if only information theory for discrete random variables is covered. If the instructor also wants to include topics on continuous random variables, this book can be used as a textbook or a reference book in conjunction with another suitable textbook. The instructor will find this book a good source for homework problems because many problems here do not appear in other textbooks. The instructor only needs to explain to the students the general idea, and they should be able to read the details by themselves. Just like any other lengthy document, this book for sure contains errors and omissions. If you find any error or if you have any comment on the book, please email them to me at [email protected]. An errata will be maintained at the book website http://www.ie.cuhk.edu.hk/IT book/. !#"%$'&)(#*,+.-%"/ D R A F T September 13, 2001, 6:27pm D R A F T To my parents and my family Acknowledgments Writing a book is always a major undertaking. This is especially so for a book on information theory because it is extremely easy to make mistakes if one has not done research on a particular topic. Besides, the time needed for writing a book cannot be under estimated. Thanks to the generous support of a fellowship from the Croucher Foundation, I have had the luxury of taking a one-year leave in 2000-01 to write this book. I also thank my department for the support which made this arrangement possible. There are many individuals who have directly or indirectly contributed to this book. First, I am indebted to Toby Berger who taught me information theory and writing. Venkat Anatharam, Dick Blahut, Dave Delchamps, Terry Fine, and Chris Heegard all had much influence on me when I was a graduate student at Cornell. I am most thankful to Zhen Zhang for his friendship and inspiration. Without the results obtained through our collaboration, this book would not be complete. I would also like to thank Julia Abrahams, Agnes and Vincent Chan, Tom Cover, Imre Csisz r, Bob Gallager, Bruce Hajek, Te Sun Han, Jim Massey, Alon Orlitsky, Shlomo Shamai, Sergio Verd , Victor Wei, Frans Willems, and Jack Wolf for their encouragement throughout the years. In particular, Imre Csisz r has closely followed my work and has given invaluable feedback. I would also like to thank all the collaborators of my work for their contribution and all the anonymous reviewers for their useful comments. During the process of writing this book, I received tremendous help from Ning Cai and Fangwei Fu in proofreading the manuscript. Lihua Song and Terence Chan helped compose the figures and gave useful suggestions. Without their help, it would have been quite impossible to finish the book within the set time frame. I also thank Andries Hekstra, Frank Kschischang, Siu-Wai Ho, Kingo Kobayashi, Amos Lapidoth, Li Ping, Prakash Narayan, Igal Sason, Emre Teletar, Chunxuan Ye, and Ken Zeger for their valuable inputs. The code for ITIP was written by Ying-On Yan, and Jack Lee helped modify the D R A F T 0 September 13, 2001, 6:27pm D R A F T xviii A FIRST COURSE IN INFORMATION THEORY code and adapt it for the PC. Gordon Yeung helped on numerous Unix and LATEX problems. The two-dimensional Venn diagram representation for four sets which is repeatedly used in the book was taught to me by Yong Nan Yeh. On the domestic side, I am most grateful to my wife Rebecca for her love and support. My daughter Shannon has grown from an infant to a toddler during this time. She has added much joy to our family. I also thank my mother-in-law Mrs. Tsang and my sister-in-law Ophelia for coming over from time to time to take care of Shannon. Life would have been a lot more hectic without their generous help. D R A F T September 13, 2001, 6:27pm D R A F T Foreword D R A F T September 13, 2001, 6:27pm D R A F T Chapter 1 THE SCIENCE OF INFORMATION In a communication system, we try to convey information from one point to another, very often in a noisy environment. Consider the following scenario. A secretary needs to send facsimiles regularly and she wants to convey as much information as possible on each page. She has a choice of the font size, which means that more characters can be squeezed onto a page if a smaller font size is used. In principle, she can squeeze as many characters as desired on a page by using a small enough font size. However, there are two factors in the system which may cause errors. First, the fax machine has a finite resolution. Second, the characters transmitted may be received incorrectly due to noise in the telephone line. Therefore, if the font size is too small, the characters may not be recognizable on the facsimile. On the other hand, although some characters on the facsimile may not be recognizable, the recipient can still figure out the words from the context provided that the number of such characters is not excessive. In other words, it is not necessary to choose a font size such that all the characters on the facsimile are recognizable almost surely. Then we are motivated to ask: What is the maximum amount of meaningful information which can be conveyed on one page of facsimile? This question may not have a definite answer because it is not very well posed. In particular, we do not have a precise measure of meaningful information. Nevertheless, this question is an illustration of the kind of fundamental questions we can ask about a communication system. Information, which is not a physical entity but an abstract concept, is hard to quantify in general. This is especially the case if human factors are involved when the information is utilized. For example, when we play Beethoven’s violin concerto from a compact disc, we receive the musical information from the loudspeakers. We enjoy this information because it arouses certain kinds of emotion within ourselves. While we receive the same information every D R A F T September 13, 2001, 6:27pm D R A F T 2 A FIRST COURSE IN INFORMATION THEORY 3 RECEIVER INFORMATION SOURCE TRANSMITTER 1MESSAGE SIGNAL RECEIVED SIGNAL DESTINATION 1MESSAGE 2 NOISE SOURCE Figure 1.1. Schematic diagram for a general point-to-point communication system. time we play the same piece of music, the kinds of emotions aroused may be different from time to time because they depend on our mood at that particular moment. In other words, we can derive utility from the same information every time in a different way. For this reason, it is extremely difficult to devise a measure which can quantify the amount of information contained in a piece of music. In 1948, Bell Telephone Laboratories scientist Claude E. Shannon (19162001) published a paper entitled “The Mathematical Theory of Communication” [173] which laid the foundation of an important field now known as information theory. In his paper, the model of a point-to-point communication system depicted in Figure 1.1 is considered. In this model, a message is generated by the information source. The message is converted by the transmitter into a signal which is suitable for transmission. In the course of transmission, the signal may be contaminated by a noise source, so that the received signal may be different from the transmitted signal. Based on the received signal, the receiver then makes an estimate of the message and deliver it to the destination. In this abstract model of a point-to-point communication system, one is only concerned about whether the message generated by the source can be delivered correctly to the receiver without worrying about how the message is actually used by the receiver. In a way, Shannon’s model does not cover all the aspects of a communication system. However, in order to develop a precise and useful theory of information, the scope of the theory has to be restricted. In [173], Shannon introduced two fundamental concepts about ‘information’ from the communication point of view. First, information is uncertainty. More specifically, if a piece of information we are interested in is deterministic, then it has no value at all because it is already known with no uncertainty. From this point of view, for example, the continuous transmission of a still picture on a television broadcast channel is superfluous. Consequently, an information source is naturally modeled as a random variable or a random process, D R A F T September 13, 2001, 6:27pm D R A F T 3 The Science of Information and probability is employed to develop the theory of information. Second, information to be transmitted is digital. This means that the information source should first be converted into a stream of 0’s and 1’s called bits, and the remaining task is to deliver these bits to the receiver correctly with no reference to their actual meaning. This is the foundation of all modern digital communication systems. In fact, this work of Shannon appears to contain the first published use of the term bit, which stands for nary digi . In the same work, Shannon also proved two important theorems. The first theorem, called the source coding theorem, introduces entropy as the fundamental measure of information which characterizes the minimum rate of a source code representing an information source essentially free of error. The source coding theorem is the theoretical basis for lossless data compression1 . The second theorem, called the channel coding theorem, concerns communication through a noisy channel. It was shown that associated with every noisy channel is a parameter, called the capacity, which is strictly positive except for very special channels, such that information can be communicated reliably through the channel as long as the information rate is less than the capacity. These two theorems, which give fundamental limits in point-to-point communication, are the two most important results in information theory. In science, we study the laws of Nature which must be obeyed by any physical systems. These laws are used by engineers to design systems to achieve specific goals. Therefore, science is the foundation of engineering. Without science, engineering can only be done by trial and error. In information theory, we study the fundamental limits in communication regardless of the technologies involved in the actual implementation of the communication systems. These fundamental limits are not only used as guidelines by communication engineers, but they also give insights into what optimal coding schemes are like. Information theory is therefore the science of information. Since Shannon published his original paper in 1948, information theory has been developed into a major research field in both communication theory and applied probability. After more than half a century’s research, it is quite impossible for a book on the subject to cover all the major topics with considerable depth. This book is a modern treatment of information theory for discrete random variables, which is the foundation of the theory at large. The book consists of two parts. The first part, namely Chapter 1 to Chapter 13, is a thorough discussion of the basic topics in information theory, including fundamental results, tools, and algorithms. The second part, namely Chapter 14 to Chapter 16, is a selection of advanced topics which demonstrate the use of the tools 4.5 6 1 A data compression scheme is lossless if the data can be recovered with an arbitrarily small probability of error. D R A F T September 13, 2001, 6:27pm D R A F T 4 A FIRST COURSE IN INFORMATION THEORY developed in the first part of the book. The topics discussed in this part of the book also represent new research directions in the field. An undergraduate level course on probability is the only prerequisite for this book. For a non-technical introduction to information theory, we refer the reader to Encyclopedia Britanica [62]. In fact, we strongly recommend the reader to first read this excellent introduction before starting this book. For biographies of Claude Shannon, a legend of the 20th Century who had made fundamental contribution to the Information Age, we refer the readers to [36] and [185]. The latter is also a complete collection of Shannon’s papers. Unlike most branches of applied mathematics in which physical systems are studied, abstract systems of communication are studied in information theory. In reading this book, it is not unusual for a beginner to be able to understand all the steps in a proof but has no idea what the proof is leading to. The best way to learn information theory is to study the materials first and come back at a later time. Many results in information theory are rather subtle, to the extent that an expert in the subject may from time to time realize that his/her understanding of certain basic results has been inadequate or even incorrect. While a novice should expect to raise his/her level of understanding of the subject by reading this book, he/she should not be discouraged to find after finishing the book that there are actually more things yet to be understood. In fact, this is exactly the challenge and the beauty of information theory. D R A F T September 13, 2001, 6:27pm D R A F T Chapter 2 INFORMATION MEASURES Shannon’s information measures refer to entropy, conditional entropy, mutual information, and conditional mutual information. They are the most important measures of information in information theory. In this chapter, we introduce these measures and prove some of their basic properties. The physical meaning of these measures will be discussed in depth in subsequent chapters. We then introduce informational divergence which measures the distance between two probability distributions and prove some useful inequalities in information theory. The chapter ends with a section on the entropy rate of a stationary information source. 2.1 INDEPENDENCE AND MARKOV CHAINS We begin our discussion in this chapter by reviewing two basic notions in probability: independence of random variables and Markov chain. All the random variables in this book are discrete. Let be a random variable taking values in an alphabet . The probability , with . distribution for is denoted as When there is no ambiguity, will be abbreviated as , and will be abbreviated as . The support of , denoted by , is the set of all such that . If , we say that is strictly positive. Otherwise, we say that is not strictly positive, or contains zero probability masses. All the above notations naturally extend to two or more random variables. As we will see, probability distributions with zero probability masses are very delicate in general, and they need to be handled with great care. 7 7 >'CV8 [\^]`_ba%_dc%_?eagf@hji by 7mlQk , if D R A F T 8 9;:.<=?>@BA;>DCE8GF .: <=?>@IHKJMLN9O7PHQ>#F : < ?>R :?>R 9;:?>RSF :T?>R 7 U< :T?>RWYX U < HZ8 : : : Two random variables 7 and k are independent, denoted :T?>#A;n^MHo:?>@d:?np September 13, 2001, 6:27pm (2.1) D R A F T 6 for all A FIRST COURSE IN INFORMATION THEORY > and n (i.e., for all ?>#A;n^qCE8)rts ). For more than two random variables, we distinguish between two types of independence. [\^]`_ba%_dc%_?eagf@hufovxwzyRc%yR{|~}Sa%\^^\^a%\^a#\ For , random variables 7GA;7 ANNNA;7 are mutually independent if :?>%OA;> ANNNA;> MHo:?>#Sd:T?> NN:T?> (2.2) for all >%OA;> ANNN , > . [\^]`_ba%_dc%_?eagf@huov{_?_O\D}Ba%#\^.\^a%#\.a#\ For ) , random variables 7GOA;7 ANNNA;7 are pairwise independent if 7 and 7 are independent for all g o¡g . Note that mutual independence implies pairwise independence. We leave it as an exercise for the reader to show that the converse is not true. [ \^]`_ba%_dc%_?eagf@h£¢¤vS¥ea%_uc%_e#aR{|D}Sa%\^^\^a%\^a#\ For random variables 7¦Axk , and § , 7 is independent of § conditioning on k , denoted by 7¨l©§=ª k , if :?>#A;nAx«¬d:?n^IH:?>#A;n^d:?nAx«¬ (2.3) for all >#A;n , and « , or equivalently, ®Z¯°²±³ ¯´°²µ¶¯´µ°£´O³ ·Sµ Ho:?>#A;n^d:«ª n^ if :?np W¸X (2.4) :?>A;nAx«¬ H X otherwise. The first definition of conditional independence above is sometimes more convenient to use because it is not necessary to distinguish between the cases and . However, the physical meaning of conditional independence is more explicit in the second definition. :?np W¹X :T?npºH»X !Re#e¼_uc%_e#a'f@hu½ 7 Axk , and § , m ¦ 7 lK§,ª k :?>#A;nAx«¬MH¿¾?>A;npÁÀ?nAx«¬ for all > , n , and « such that :?n^qWVX . For random variables if and only if (2.5) Proof The ‘only if’ part follows immediately from the definition of conditional independence in (2.4), so we will only prove the ‘if’ part. Assume :?>#A;nAx«¬MH¿¾?>A;npÁÀ?nAx«¬ D R A F T September 13, 2001, 6:27pm (2.6) D R A F T 7 Information Measures > n « :T?npqW¸X . Then :?>#A;n^HK · :?>A;nAx«¬ HK · ¾R?>#A;n^ÁÀÃ?nAx«¬qHľ?>A;npp · À?nAx«¬BA for all , , and such that (2.7) :?nAx«¬IH  ± :?>#A;nAx«¬MH  ± ¾R?>#A;n^ÁÀ?nAx«` H¿À?nAx«¬  ± ¾R?>#A;n^BA and (2.8) :?n^MH  · :?nAx«¬MHPÅ Â ± ¾?>A;npÇÆÈÅ Â · À?nAx«¬ÇÆoÉ (2.9) Therefore, Åp¾?>A;np  · À?nAx«¬ÇÆYÅ^ÀÃ?nAx«¬  ± ¾R?>#A;n^ÇÆ :?>#A;n^d:?nAx«¬ H :?n^ ^Å Â ± ¾R?>#A;n^ÇÆYÅp · À?nAx«`ÇÆ H ¾R?>#A;n^ÁÀ?nAx«` H :T?>#A;nAx«¬BÉ (2.10) (2.11) (2.12) 7 lK§,ª k . The proof is accomplished. m [\^]`_ba%_dc%_?eagf@huÊËvxwo{@Ì#eÍÎ¥ ÏR{_?a For random variables 7GNA;7 ANNNOA;7 , where D¸ , 7GqÐÑ7 ÐÒNNÐÑ7 forms a Markov chain if :?> A;> ANNN¼A;> d:?> d:T?> NNj:?> ÔÓ H :?>#NA;> d:?> A;> NN:?> ÔÓ NA;> (2.13) for all > A;> ANNN , > , or equivalently, Hence, :?>%NA;> ANNNA;> MH Õ :?>#OA;> d:?> ª > NN:?> ª > ÔÓ B X :T?> BA:?> BANNNA:?> ¬Ó ÖqWVX if otherwise. (2.14) We note that 7mlK§,ª k is equivalent to the Markov chain 7mÐkÈÐ×§ . !Re#e¼_uc%_e#a'f@hdØ 7 Ð 7 Ð NN Ð 7 forms a Markov chain if and only if 7 ÐÑ7 ÔÓ qÐÒNN`ÐÑ7G forms a Markov chain. Proof This follows directly from the symmetry in the definition of a Markov chain in (2.13). D R A F T September 13, 2001, 6:27pm D R A F T 8 A FIRST COURSE IN INFORMATION THEORY In the following, we state two basic properties of a Markov chain. The proofs are left as an exercise. !Re#e¼_uc%_e#a'f@huÙ 7GÐ 7 Ð NNqÐ 7 forms a Markov chain if and only if 7GqÐÑ7 ÐÑ7 ?7GOA;7 MÐÑ7 Ð 7Ú (2.15) .. . ?7 A;7 ANNN¼A;7 ¬ Ó M ÐÑ7 ¬ Ó ÐÑ7 form Markov chains. !Re#e¼_uc%_e#a'f@huÛ 7GÐ 7 Ð NNqÐ 7 forms a Markov chain if and only if :?>#OA;> ANNNOA;> H¿Ü¬¼?>#OA;> ÁÜ ?> A;> NNBÜ ÔÓ ¼?> ÔÓ NA;> for all >%OA;> ANNN , > such that :?> BA:T?> BANNN¼A:T?> ¬Ó Ö WVX . (2.16) Note that Proposition 2.9 is a generalization of Proposition 2.5. From Proposition 2.9, one can prove the following important property of a Markov chain. Again, the details are left as an exercise. !Re#e¼_uc%_e#a'f@hjiÝZvxwo{@ÌeÍ»¼y%ÞRÏ@{_ba Let ß Hm9ÔASà`ANNNA; TF and let 7GÐá7 ÐâNN#Ðã7 forms a Markov chain. For any subset ä of ß , denote ?7 A;IC~äT by 7å . Then for any disjoint subsets ä Axä ANNNAxä#æ of ß ç ç ç such that KNN^ æ è (2.17) ç for all CtäR , ¡=HYASà`ANNNA;é , 7 å`ê ÐÑ7 åë ÐÒNN`ÐÑ7 åíì (2.18) forms a Markov chain. That is, a subchain of 7GîÐï7 ÐðNN.Ðï7 is also a Markov chain. ñºò%{óº^|¬\ôf@hji^i Let 7 Ð 7 Ð NNõÐ 7 ÷ö forms a Markov chain and äMHø9ÔASà¬F , ä ¨ H 9OùF , ä Hø9 ú`ABû¬ASü¬F , and 9ÔNX`F be subsets of ßE÷ö . Then Proposition 2.10 says that 7 å`ê Ð7 åë Ð7 åíý ÐÑ7 åíþ (2.19) also forms a Markov chain. D R A F T September 13, 2001, 6:27pm D R A F T 9 Information Measures We have been very careful in handling probability distributions with zero probability masses. In the rest of the section, we show that such distributions are very delicate in general. We first prove the following property of a strictly positive probability distribution involving four random variables1 . ! Re#e¼_uc%_e#a'f@hjiÔf Let 7GOA;7 A;7 , and 7Ú be random variables such that :?> A;> A;> A;> Ú is strictly positive. Then 7GqlÄ7Ú¬ªÿ?7 A;7 (2.20) 7GqlÄ7 ªÿ?7 A;7Ú 7GqlÈ?7 A;7ÚNª 7 É , we have A;>ÚMH :?>#NA;> ;A > d:T?> A;> :T?> A;> On the other hand, if 7GqlÄ7 ªÿ?7 A;7Úà , we have :?>#OA;> A;> A;>ÚMH :?> A;> :TA;> ?> Ú d:TA;>.?> Ú¼ A;> Proof If 7GqlQ7Ú¬ªÿ?7 ;A 7 :?>#OA;> A;> A;>Ú É (2.21) A;> Ú É (2.22) Equating (2.21) and (2.22), we have :?>%NA;> A;> MH :?> ;A > : ?d> : ?A;>#>.OÚA; > A;>.Ú¼ É Then :?> A;> ÒH  ± : ?> A;> A;> ý H  ± :?> A;> :?d> :T ?A;>%>OÚA; > A;>Ú¼ ý H :?> :Td:?> ?># OA;>A;> Ú A;>.Ú¼ A :?> A;> A;> Ú H :?> A;> É :?> A;>Ú :T?> or (2.23) (2.24) (2.25) (2.26) (2.27) Hence from (2.22), :?>%NA;> A;> A;>.ÚH :T?>%OA;> :;A >?> Úd:;A >?> Ú A;> A;>.Ú H :?>#OA;> :d :?> ?> A;> A;>.Ú (2.28) 1 Proposition 2.12 is called the intersection axiom in Bayesian networks. See [151]. D R A F T September 13, 2001, 6:27pm D R A F T 10 A FIRST COURSE IN INFORMATION THEORY %> OA;> A;> , and >.Ú , i.e., 7GqlÈ?7 A;7ÚNª 7 . If :?>%OA;> A;> A;>.ÚH¨X for some >#OA;> A;> , and >.Ú , i.e., : is not strictly positive, the arguments in the above proof are not valid. In fact, the proposition may not hold in this case. For instance, let 7GGH¨k , 7 H § , and 7 H 7 Ú HËk AB§! , where k and § are independent random variables. Then 7 l 7Ú¬ªÿ?7 A;7 , 7GDl7 ªÿ?7 A;7Ú , but 7G l ?7 A;7ÚNª 7 . Note that for this construction, : is not strictly positive because :?>#OA;> A;> A;>.Ú¼ÎH×X if > HÈ ?>#A;> or >Ú H» ?>%OA;> . : for all The above example is somewhat counter-intuitive because it appears that Proposition 2.12 should hold for all probability distributions via a continuity argument. Specifically, such an argument goes like this. For any distribution , let be a sequence of strictly positive distributions such that and satisfies (2.21) and (2.22) for all , i.e., : 9;:ÔF : Ð : ç :¬?> A;> A;> A;> Ú d:`?> A;> H:`?> A;> A;> d:`?> A;> A;> Ú (2.29) and :?>%OA;> A;> A;>.Úd:`?> A;>ÚMHo:`?>%OA;> A;>.Úd:`?> A;> A;>ÚBÉ (2.30) Then by the proposition, : also satisfies (2.28), i.e., (2.31) ç : ¬?>#OA;> A;> A;>Úd:¬?> MHo:`?>%OA;> d:¬?> A;> A;>.ÚBÉ Ð , we have Letting (2.32) :?> A;> A;> A;> Ú d:?> H:?> A;> d:?> A;> A;> Ú for all >%NA;> A;> , and >.Ú , i.e., 7GqlÈ?7 A;7ÚNª 7 . Such an argument would be : ÔF as prescribed. However, the exisvalid if there always exists a sequence 9; tence of the distribution :?>#OA;> A;> A;>Ú constructed immediately after Propo- sition 2.12 simply says that it is not always possible to find such a sequence . Therefore, probability distributions which are not strictly positive can be very delicate. In fact, these distributions have very complicated conditional independence structures. For a complete characterization of the conditional independence structures of strictly positively distributions, we refer the reader to Chan and Yeung [41]. 9;:F 2.2 SHANNON’S INFORMATION MEASURES We begin this section by introducing the entropy of a random variable. As we will see shortly, all Shannon’s information measures can be expressed as linear combinations of entropies. D R A F T September 13, 2001, 6:27pm D R A F T 11 Information Measures [\^]`_ba%_dc%_?eagf@hjiÔ ?7Î of a random variable 7 z ?7DIH  ± :T?>R :?>RBÉ The entropy is defined by (2.33) In the above and subsequent definitions of information measures, we adopt the convention that summation is taken over the corresponding support. Such a convention is necessary because in (2.33) is undefined if . The base of the logarithm in (2.33) can be chosen to be any convenient as when the base of the real number great than 1. We write logarithm is . When the base of the logarithm is 2, the unit for entropy is bit. When the base of the logarithm is an integer , the unit for entropy is -it ( -ary digit). In the context of source coding, the base is usually taken to be the size of the code alphabet. This will be discussed in Chapter 3. In computer science, a bit means an entity which can take the value 0 or 1. In information theory, a bit is a measure of information, which depends on the probabilities of occurrence of 0 and 1. The reader should distinguish these two meanings of a bit from each other carefully. be any function of a random variable . We will denote the Let expectation of by , i.e., :?>R :?>R X ?7D ä :?>RH å@?7Î ã¤à ?7Î R?7Î ?7D ?7Î H  ± :?>@?>RBA 7 (2.34) U< where the summation is over . Then the definition of the entropy of a random variable can be written as 7 z?7Î H :?7ÎBÉ (2.35) Expressions of Shannon’s information measures in terms of expectations will be useful in subsequent discussions. The entropy of a random variable is a functional of the probability distribution which measures the amount of information contained in , or equivalently, the amount of uncertainty removed upon revealing the outcome of . Note that depends only on but not on the actual values in . For , define 7 z?7Î :?>R z?7Î XD:~© 7 :?>@ 7 8 ²:@MH: :¸Á t:R Át:R (2.36) ! Á¼MH¿X . With this convenwith the convention X X HQX , so that XÔIH tion, ²:R is continuous at :ôH X and :ôH¤ . is called the binary entropy function. For a binary random variable 7 !with distribution 9;:#AO" :%F , z?7ÎMH ²:RBÉ (2.37) D R A F T September 13, 2001, 6:27pm D R A F T 12 A FIRST COURSE IN INFORMATION THEORY hb( p) 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 00 0.1 0.2 0.3 0.4 0.5 0.5 0.6 #%$'&)(* versus ( Figure 2.1. ²:R 0.7 0.8 0.9 11 p in the base 2. : :GH ²:R versus in the base 2. Note that Figure 2.1 shows the graph achieves the maximum value 1 when . For two random variables, their joint entropy is defined as below. Extension of this definition to more than two random variables is straightforward. [\^]`_ba%_dc%_?eagf@hji¢ The joint entropy ?7~Axk of a pair of random variables 7 and k is defined by (2.38) z?7¦Axk H  ±³ ´ :?>#A;n^ :?>#A;n^MH+ :?7~AxkBÉ [\^]`_ba%_dc%_?eagf@hjiÔ½ For random variables 7 and k , the conditional entropy of 7 given k is defined by ktª 7DM H  ±³ ´ :?>A;np :?n%ª >RMH, :kª 7ÎBÉ (2.39) From (2.39), we can write zkGª 7DIH  ± :T?>R-.  ´ :?n%ª >R :?n%ª >R0/É The inner sum is the entropy of are motivated to express k zkGª 7D conditioning on a fixed as >zCU < zkª 7ÎMHK ± :T?>R1zkª 7mHÄ>@BA D R A F T (2.40) September 13, 2001, 6:27pm . Thus we (2.41) D R A F T 13 Information Measures zkGª 7mHÄ>RH,E ´ :?n%ª >R :?n@ª >@BÉ where Similarly, for where ktª 7~AB§ , we write ktª 7¦AB§!MH©Â · :«¬1zkª 7~AB§¿HQ«¬BA (2.42) (2.43) kª 7¦AB§¿H¿«¬MHE ±³ ´ :T?>#A;n%ª «¬ :?n@ª >Ax«¬BÉ (2.44) ?7~AxkMH2?7D435zkª 7Î (2.45) z?7¦AxkMH6zk735?7'ª kBÉ (2.46) !Re#e¼_uc%_e#a'f@hjiÔÊ and Proof Consider ?7~Axk H :?7~Axk H 8 :?7Dd:ktª 7Î9 H :?7D: :kª 7Î H ?7D;35z ktª 7ÎBÉ (2.47) (2.48) (2.49) (2.50) Note that (2.48) is justified because the summation of the expectation is over , and we have used the linearity of expectation2 to obtain (2.49). This proves (2.45), and (2.46) follows by symmetry. U@<=< This proposition has the following interpretation. Consider revealing the outcome of and in two steps: first the outcome of and then the outcome of . Then the proposition says that the total amount of uncertainty removed upon revealing both and is equal to the sum of the uncertainty removed upon revealing (uncertainty removed in the first step) and the uncertainty removed upon revealing once is known (uncertainty removed in the second step). k 7 k 7 7 7 k k 7 [\^]`_ba%_dc%_?eagf@hji¬Ø For random variables 7 and k , the mutual information between 7 and k is defined by (2.51) ? 7 >xkMHK ±³ ´ :T?>#A;n^ ::?>@?>#d:A;n^?n^ H6 :: ? 7D?7~d:Axkk É 2 See Problem 5 at the end of the chapter. D R A F T September 13, 2001, 6:27pm D R A F T 14 A FIRST COURSE IN INFORMATION THEORY ? 7>xk is symmetrical in 7 and k . !Re#e¼_uc%_e#a'f@hjiÔÙ The mutual information between a random variable 7 7 >;7Î 6 H ?7D . and itself is equal to the entropy of 7 , i.e., ?+ Remark Proof This can be seen by considering ?7+>;7ÎáH + ::??7Î7Î H + :?7Î H ?7DBÉ (2.52) (2.53) (2.54) The proposition is proved. Remark The entropy of !Re#e¼_uc%_e#a'f@hjiÔÛ 7 is sometimes called the self-information of ?7>xk H ? 7 >xk H and z?7Î??7'ª kBA zk:z kGª 7DBA 7 . (2.55) (2.56) ?7>xkMH@?7D;35zk,:+?7~AxkBÉ (2.57) The proof of this proposition is leaf as an exercise. ?7+>xk From (2.55), we can interpret as the reduction in uncertainty about when is given, or equivalently, the amount of information about provided by . Since is symmetrical in and , from (2.56), we can as as the amount of information about provided by . well interpret 7 k k ? 7>xk ?7>xk 7 7 k k 7 The relations between the (joint) entropies, conditional entropies, and mutual information for two random variables and are given in Propositions 2.16 and 2.19. These relations can be summarized by the diagram in Figure 2.2 which is a variation of the Venn diagram3 . One can check that all the relations between Shannon’s information measures for and which are shown in Figure 2.2 are consistent with the relations given in Propositions 2.16 and 2.19. This one-to-one correspondence between Shannon’s information measures and 7 7 k 3 The k rectangle representing the universal set in a usual Venn diagram is missing in Figure 2.2. D R A F T September 13, 2001, 6:27pm D R A F T Information Measures B A 15 B A H(X,Y) CH A( Y|X B) H(X|Y) A B CH A( Y B ) D I A(E X ; Y B ) H(X) Figure 2.2. Relationship between entropies and mutual information for two random variables. set theory is not just a coincidence for two random variables. We will discuss this in depth when we introduce the -Measure in Chapter 6. Analogous to entropy, there is a conditional version of mutual information called conditional mutual information. [\^]`_ba%_dc%_?eagf@hufpÝ For random variables 7 , k and § , the mutual information between 7 and k conditioning on § is defined by Axktª § É ?+ 7 >xktª § IHP±Â ³ ´O³ · :?>A;nAx«` :T?:>T?ª >«¬dA;:n@?ª n%«¬ ª «¬ H6 :?:7z?7~ ª §!d:kª §! (2.58) Remark ?7>xk¦ª §! is symmetrical in 7 and k . Analogous to conditional entropy, we write where ?7+>xkª §!MH  · :«¬ ?7+>xk¦ª §QHQ«¬BA (2.59) ?7>xk¦ª §QHQ«`MHK ±³ ´ :?>A;n@ª «¬ :?: > ?ª >#«¬dA;:n%?ª n%«¬ ª «¬ É (2.60) Similarly, when conditioning on two random variables, we write ?7+>xkª §èA.F! HKÂHG~:JIx ?7>xktª §èA.FKH6I; where (2.61) ?7+>xkª §èA.F¿H6I;MH ±Â ³ ´O³ · :?>A;nAx«ª Ix :?>T:ª «^?>#AKI;A;dn%:ª «p?n%KA Ixª «p AKIx É D R A F T September 13, 2001, 6:27pm (2.62) D R A F T 16 A FIRST COURSE IN INFORMATION THEORY Conditional mutual information satisfies the same set of relations given in Propositions 2.18 and 2.19 for mutual information except that all the terms are now conditioned on a random variable . We state these relations in the next two propositions. The proofs are omitted. § !Re#e¼_uc%_e#a'f@huf@i 7 The mutual information between a random variable and itself conditioning on a random variable is equal to the conditional entropy of given , i.e., . § ?7>;7'ª § H6z?7zª §! 7 § !Re#e¼_uc%_e#a'f@huf^f ?7>xktª §! H ? 7 >xktª §! H and 2.3 z?7zª §!:+?7'ª kIAB§!BA zkª § ? kª 7¦AB§!BA (2.63) (2.64) ?7+>xk¦ª §!H6?7'ª §!;35zkGª §!:Lz?7¦Axk¦ª §!BÉ (2.65) CONTINUITY OF SHANNON’S INFORMATION MEASURES This section is devoted to a discussion of the continuity of Shannon’s information measures. To begin with, we first show that all Shannon’s information measures are special cases of conditional mutual information. Let be a degenerate random variable, i.e., takes a constant value with probability 1. Consider the mutual information . When and , becomes the entropy . When , becomes the conditional entropy . When , becomes the mutual information . Thus all Shannon’s information measures are special cases of conditional mutual information. Therefore, we only need to discuss the continuity of conditional mutual information. Recall the definition of conditional mutual information: M M ?7>xktª §! ? 7+>xk¦ª §! 7 Hk §HNM ?7D 7×HYk ?7>xk¦ª §! z?7zª §! §ïHOM ?7>xktª §! ?7>xk : ?>#A;n%ª «¬ ¯ ?7>xktª §!H (2.66) °²±³ ´¼³ ·SµQP% RTS7UWV.° ¯¼µ :?>A;nAx«¬ :?> ª «¬d:?n%ª «¬ A where we have written ? 7 >xktª §! and U@=< <:X as ¯ ?7+>xkª§! and U@<Y<4X ²:R , respectively to emphasize their dependence on : . Since ¾ is continuous in ¾ for ¾EW X , ¯ ?+ 7 >xkª §! varies continuously with : as long as the support U =< <:X ²:R does not change. The problem arises when some positive probability masses become zero or some zero probability masses become positive. Since continuity in the former case implies continuity in the latter case and vice versa, we only need to consider the former case. As and eventually :?>#A;nAx«¬ Ð X D R A F T September 13, 2001, 6:27pm D R A F T 17 Information Measures >#A;n U <=<:X ²:R « , and , the support is reduced, and become(s) zero for some there is a discrete change in the number of terms in (2.66). Now for and such that , consider « >#A;nA :?>A;nAx«¬õWVX ¯°²±³ ´¼³ ·Sµ ¯°ÿ·Sµ : ? > ; A @ n ª ¬ « (2.67) :?>A;nAx«¬ :?>ª «¬d:?n%ª «¬ H :?>#A;nAx«¬ ¯¯°£±°£·B³ ·Sµ µ ¯¯°£´O°£·S³ ·Sµ µ (2.68) H :?>#A;nAx«¬ ::??>>A;Axn«¬dAx:«¬?dn:TAx«¬«¬ H :?>#A;nAx«¬ 8 :?>A;nAx«¬435 :T«¬ (2.69) Z :T?>#Ax«¬:+ :?nAx«¬9É It can ready be verified by L’Hospital’s rule that :?>A;nAx«¬ :?>A;nAx«` ÐX as :?>#A;nAx«¬ Ð X . For :?>A;nAx«¬ :«` , since :T?>#A;nAx«¬È:«`P , we have ,X D:?>A;nAx«¬ : «`qD:?>A;nAx«¬ :?>A;nAx«`IÐXÉ (2.70) Thus :T?>#A;nAx«¬ :«¬ Ð X as :T?>#A;nAx«¬Ð X . Similarly, we can show that :?>#Ax«` and : ?>A;nAx«` :T?nAx«¬¦Ð X as :?>A;nAx«`Ð X . both :?>#A;nAx«¬ Therefore, :?>A;nAx«` :T?:>T?ª >«¬dA;:n@?ª «¬n% ª «¬ ÐX (2.71) 7 >xkª §! varies continuously with : even when as :T?>#A;nAx«¬ÐâX . Hence, ?+ :?>#A;nAx«¬MÐX for some >#A;n , and « . ç :?>#A;nAx«¬ , if 9;: ÔF is a sequence of joint distributions To conclude, for any such that : Ð : as [ Ð , then (2.72) 7 >xktª §!MH ¯ ?+ 7 >xktª § BÉ _u^a]5 \ ` c¯ b ? 2.4 CHAIN RULES In this section, we present a collection of information identities known as the chain rules which are often used in information theory. !Re#e¼_uc%_e#a'f@huf^ovS¥ Ï@{_ba'y%|¬\D]Ôe#V\^aRc%ed z?7GOA;7 ANNNOA;7 H  z?7xª 7GOANNNA;7 Ó ÖBÉ f e ÈHÑà (2.73) Proof The chain rule for has been proved in Proposition 2.16. We prove the chain rule by induction on . Assume (2.73) is true for , D R A F T September 13, 2001, 6:27pm »H é D R A F T 18 A FIRST COURSE IN INFORMATION THEORY where é ¸à . Then ¨ z?7GOANNN¼A;7 æ A;7 æYg Ö H z?7 ANNNOA;7æ735z?7æ=g ª 7 ANNN¼A;7æ æ H  z?7Sª 7GOANNNOA;7 Ó Ö; 35z?7 æ=g ê 7GOANNNA;7 æ ] e Yæ g H  ?7Sª 7GOANNNA;7 Ó ÖBA ] e (2.74) (2.75) (2.76) where in (2.74), we have used (2.45), and in (2.75), we have used (2.73) for . This proves the chain rule for entropy. ~HÄé The chain rule for entropy has the following conditional version. !Re#e¼_uc%_e#a'f@huf`¢'vS¥ Ï@{_ba'y%|¬\D]Ôe#¸ea%_uc%_e#aR{|\^aRc%ed z?7GOA;7 ANNN¼A;7 ª kH  ?7Sª 7GOANNN¼A;7 Ó OAxkBÉ ] e (2.77) Proof This can be proved by considering z?7 A;7 ANNN¼A;7 ª k H  ´ :?n^1 ?7 A;7 ANNNA;7 ª k»HÄn^ (2.78) :?n^  ?7Sª 7GOANNN¼A;7 Ó OAxkÈHÄn^ ]e  ´ :?n^1?7Sª 7GOANNNOA;7 Ó OAxk»HÄn^ H H H ´  ]e  ?7xª 7GOANNN¼A;7 Ó OAxkBA ]e (2.79) (2.80) (2.81) where we have used (2.41) in (2.78) and (2.43) in (2.81), and (2.79) follows from Proposition 2.23. !Re#e¼_uc%_e#a'f@huf^½ovS¥ Ï@{_ba'y%|¬\D]Ôe#VóºyRc%yR{|t_ba#]Ôe#@óî{pc%_?ea ?7GOA;7 ANNN¼A;7 >xkMH  ?71 >xkª 7GOANNN¼A;7 Ó ÖBÉ ] e D R A F T September 13, 2001, 6:27pm (2.82) D R A F T 19 Information Measures Proof Consider ?7GOA;7 ANNNA;7 >xk H z ?7GOA;7 ANNNOA;7 :L?7GA;7 ANNNOA;7 ª k H  8 z?7xª 7GOANNNA;7 Ó Ö:z?7;ª 7GOANNNA;7 Ó OAxk9 f e H  ?7 >xkª 7 ANNN¼A;7 Ó BA f e (2.83) (2.84) (2.85) where in (2.84), we have invoked both Propositions 2.23 and 2.24. The chain rule for mutual information is proved. !Re#e¼_uc%_e#a'f@huf^Ê vS¥ Ï@{_ba¸y#|Ô\ óè{pc%_e#a For random variables 7GOA;7 ?7GOA;7 ANNN¼A;7 >xktª § H  ] e ]¬e#Äea%_uc%_e#aR{|~ó yRc%yR{|E_ba#]Ôe#h ANNNOA;7 , k , and § , ?7.>xkGª 7GOANNNA;7 Ó OAB§ BÉ (2.86) Proof This is the conditional version of the chain rule for mutual information. The proof is similar to that for Proposition 2.24. The details are omitted. 2.5 : INFORMATIONAL DIVERGENCE i 8 Let and be two probability distributions on a common alphabet . We very often want to measure how much is different from , and vice versa. In order to be useful, this measure must satisfy the requirements that it is always nonnegative and it takes the zero value if and only if . We denote the support of and by and , respectively. The informational divergence defined below serves this purpose. : i : U¯ U;k i :¸Hji [\^]`_ba%_dc%_?eagf@huf.Ø The informational divergence between two probability distributions : and i on a common alphabet 8 is defined as 7Î A E²:mlniMH©Â ± :?>R :ip??>R>R H@ ¯ :ip??7Î (2.87) where ¯ denotes expectation with respect to : . In the above definition, in addition to the convention ¯that °£±¼µ the summation is k °£±¼µ o H if ip?>RIHQX . taken over U ¯ , we further adopt the convention :?>@ In the literature, informational divergence is also referred to as relative entropy or the Kullback-Leibler distance. We note that is not symmetrical in and , so it is not a true metric or “distance.” : i D R A F T β:mlni September 13, 2001, 6:27pm D R A F T 20 A FIRST COURSE IN INFORMATION THEORY 1.5 a −1 1 ln a 0.5 1 0 a −0.5 −1 −1.5 −0.5 0 0.5 1 1.5 2 2.5 Figure 2.3. The fundamental inequality 3 prq7stsvuw . In the rest of the book, informational divergence will be referred to as divergence for brevity. Before we prove that divergence is always nonnegative, we first establish the following simple but important inequality called the fundamental inequality in information theory. x \^ó óî{¸f@huf^ÙovKyõy%a%{ó \^aRc`{|¦}Ba%\!zyR{|¬_ucd For any ]{õ¾ V¾V ¾ W¸X , (2.88) ¾,HY . Proof Let ܾpIH|]{õ¾o¾3¿ . Then Ü}j¾p H c~þV and Ü} }?¾p Hc~þ . H ZX , we see that ܾ attains Since ÜÁ¼H¤X , Ü } Á¼ H¤X , and Ü } } Á¼ its maximum value 0 when ¾VH . This proves (2.88). It is also clear that equality holds in (2.88) if and only if ¾HY . Figure 2.3 is an illustration of the with equality if and only if fundamental inequality. ¥ee|Ô|{dÄf@huf^Û For any ¾ WVX , ]{õ¾ © ¾ with equality if and only if ¾,HY . Proof This can be proved by replacing ¾ by (2.89) c~þ in (2.88). We can see from Figure 2.3 that the fundamental inequality results from the convexity of the logarithmic function. In fact, many important results in information theory are also direct or indirect consequences of the convexity of the logarithmic function! D R A F T September 13, 2001, 6:27pm D R A F T 21 Information Measures Ï%\.e@\^óËf@hupÝv;[_ ÍT\^\^a#\D}Ba%\!zyR{|Ô_dcd For any two probability distributions : and i on a common alphabet 8 , β:mlni qVX (2.90) with equality if and only if :GH6i . i ?>@HZX for some >VCgU ¯ , then E²:mlniH and ¯ the theorem is Proof If i ?>RqW¸X for all >¦CtU . Consider trivially true. Therefore, we assume that p β:mlni H Q Y ,±H P%R :T?>R]{ :ip??>@>@ (2.91) Q Y ,±H P%R :T?>R=.Y :pi ??>R>R% (2.92) H Q Y a#±H P%R :T?>R:P±H P%R i?>@0 (2.93) Q Y Á"¸¼ (2.94) H XA (2.95) where (2.92) results from an application of (2.89), and (2.94) follows from H±  PTRc ip?>R ©É >ÎC~U ¯ (2.96) This proves (2.90). Now for equality to hold in (2.90), equality must hold in , and equality must hold in (2.94). For the former, we see (2.92) for all from Lemma 2.28 that this is the case if and only if for all . For the latter, i.e., (2.97) >~CtU ¯ :?>@MH6ip?>R H±  PTRc ip?>RHYA this is the case if and only if U ¯ and ; U k coincide. Therefore, we conclude that equality holds in (2.90) if and only if :?>Rº, H ip?>R for all >C8 , i.e., :ô H i. The theorem is proved. We now prove a very useful consequence of the divergence inequality called the log-sum inequality. Ï%\.e@\^óËf@hu@iYv;|¬e:hB¼y%ó _?a%\!zyR{|¬_ucd À¼OASÀ ANNN ¾¬2  ¾¬% ¾ÔÀB .Å Â ¾Ô?ÆZ ¾ÔÀS and nonnegative numbers D R A F T ¾ Ax¾ ANNN X2 ÀBT , For positive numbers such that and September 13, 2001, 6:27pm (2.98) D R A F T 22 n HnHW_0H! A FIRST COURSE IN INFORMATION THEORY with the convention that for all . anö H . Moreover, equality holds if and only if The log-sum inequality can easily be understood by writing it out for the case when there are two terms in each of the summations: 3z¾ É ¾.! .¾ À¼ z 3 ¾ ¾ À ¾^:3¾ ^¾ ¼À ::3' À ¾} H ¾Ô0~ ¾Ã (2.99) À} HËÀB~ ÀÁ 9¼¾} F Proof of Theorem 2.31 Let and . Then and are probability distributions. Using the divergence inequality, we have 9À } F X  ¾ } ¾À }} H  ¾Ô ¾ ¾¬ÀB00~~ H ¾ -  ¾ (2.100) ¾Ã (2.101) À ¾¬ Ë ¾Ã / A  (2.102) Å ¾ Z Æ ÀB À n which implies (2.98). Equality holds if and only if ¾ } HËÀ } for all , or H nHW_0H! for all . The theorem is proved. One can also prove the divergence inequality by using the log-sum inequality (see Problem 14), so the two inequalities are in fact equivalent. The logsum inequality also finds application in proving the next theorem which gives a lower bound on the divergence between two probability distributions on a common alphabet in terms of the variational distance between them. [\^]`_ba%_dc%_?eagf@hu^f Let : and i be two probability distributions on a common alphabet 8 . The variational distance between : and i is defined by ²:#'A iMH¨Â ª :T?>R: ip?>@Nª²É (2.103) ±HPH Ï%\.e@\^óËf@hu^Ëv!_?a#¼Ì\^¡u=_?a%\!zyR{|Ô_dcd β:mlniq à?] î{ à ²:#A'iBÉ (2.104) The proof of this theorem is left as an exercise (see Problem 17). We will see further applications of the log-sum inequality when we discuss the convergence of some iterative algorithms in Chapter 10. D R A F T September 13, 2001, 6:27pm D R A F T 23 Information Measures 2.6 THE BASIC INEQUALITIES In this section, we prove that all Shannon’s information measures, namely entropy, conditional entropy, mutual information, and conditional mutual information are always nonnegative. By this, we mean that these quantities are nonnegative for all joint distributions for the random variables involved. ¢a£;¤!¥§¦7¤¨ª©7«]¬ ® , ¯ , and ° , ±² ®³'¯´)°!vµ¶¸· (2.105) with equality if and only if ® and ¯ are independent conditioning on ° . For random variables Proof This can be readily proved by observing that ±² ®+³'¯Z´)° ²JÀ ¹ º ²JÀ ·KÁ·'¬ÃfÄÅ ²J¿ À ·KÁ;² ´ ¬ (2.106) »H¼ ½c¼ ¾%¿ ¿ ´ Â` ¿ Á7´ ²JÂ`À ¹ º ² Â` º ²JÀ ·KÁ;´ ¬ÃfÄÅ ²J¿ À ·KÁ;² ´ ¬ (2.107) ´ ¬  ; Á ´ ¬  ¾L¿ »H¼ ½§¿ ¹ º ² Â` ² ¾É ¾ ¿ ¾ · ¿ (2.108) ¿!ÆYÇ=È ¿ÆÊÈ ¿ÇËÈ ¾ ¿ ²JÀ ·KÁ;´ ¬· ²JÀ ·KÁ^ÎÍ+ÏÑÐÓÒaÔ , etc. Since ¾ where we have used to denote Ì ÆYÇm¾ È and ¾ ¿ ¾ are joint probability distributions on for a fixed  , both ¿ = Æ ¿ ÏÕÐÒ , we have ÇËÈ ² ¿ ÆÎÈ ¿ ÇmÈ ¿ ÆYÇËÈ ¾ É ¿ ÆÎÈ ¾ ¿ ÇmÈ ¾ vµ¶¸Ö (2.109) ± ² ° Lµ¶ . Finally, we see from TheoTherefore, we±conclude that ®+³'¯×´)! ² rem 2.30 that ®³'¯´)! ° ¹ ¶ if and only if for all ÂØÍÙ ¾ , ²JÀ ·KÁ;´ ¬Â ¹ ²JÀ ´ ¬Â ² Á;´ ¬Â · (2.110) ¿ ¿ ¿ or ²JÀ ·KÁ·'¬ ¹ ²JÀ ·'Â` ² Á7´ Â` (2.111) ¿ ¿ ¿ À for all and Á . Therefore, ® and ¯ are independent conditioning on ° . The proof is accomplished. As we have seen in Section 2.3 that all Shannon’s information measures are special cases of conditional mutual information, we already have proved that all Shannon’s information measures are always nonnegative. The nonnegativity of all Shannon’s information measures are called the basic inequalities. For entropy and conditional entropy, we offer the following more direct proof for their nonnegativity. Consider the entropy of a random variable Ú ² ®D D R A F T September 13, 2001, 6:27pm D R A F T 24 ¶Ü ² ¿ J² Ý À7ÝÞß ÃfÄÅ ¿ ²JÀÝÞ ¶ ² ¯×´ ® Ý Ú ® | µ ¶ Ú ² À Ý ¯ ®Ý Ú ¯×´ ® ¹ µà¶ À ÍÙ ² Ú ¯×´ ® µ¶ Æ áΦ¥4⥧ãcä]å;äQ¥4æç©7«]¬è Ú ² ® Ý ¹ ¶ if and only if ® is deterministic. Àé Í,Ï such that ²JÀéÝ ¹ ß Proof ²JÀIfÝ ® is deterministic, i.e., there exists é ë À ê À ² Ý ¹ ¶ for all ¹ , then Ú ® ¹íì ²JÀ é Ý ÃfÄÅÀ é ²JÀ é Ý ¿ ¹ ¶ . On and ¿ hand, if ® is not deterministic, i.e., there¿ exists ¿ ÍàÏ such that the other ¶îÜ ¿ ²JÀ² éÝ Ý Ü ¹ ß , then Ú ² ® Ý µ ì ¿ ²JÀéÝ Ã]ÄÅ ¿ ²JÀé_ÝYï ¶ . Therefore, we conclude that Ú ® ¶ if and only if ® is deterministic. áΦ¥4⥧ãcä]å;äQ¥4æç©7«]¬ð Ú ² ¯ñ´ ® Ý ¹ ¶ if and only if ¯ is a function of ® . ² Ý ¹ ¶ if and only if Ú ² ¯´ ® ¹ ÀÝ ¹ ¶ Proof From (2.41), we see that Ú ¯×´ ® À for each ÍZÙ . Then from the À last proposition, this happens if and only if ¯ Æ is deterministic for each given . In other words, ¯ is a function of ® . áΦ¥4⥧ãcä]å;äQ¥4æç©7«]¬!òo±² ®³'¯ Ý ¹ ¶ if and only if ® and ¯ are independent. Proof This is a special case of Theorem 2.34 with ° being a degenerate ran® À Û Í ÙÆ A FIRST COURSE IN INFORMATION THEORY . For all , since , . It then follows from the definition in (2.33) that . For the conditional entropy of random variable given random variable , since for , we see from (2.41) that . each dom variable. One can regard (conditional) mutual information as a measure of (conditional) dependency between two random variables. When the (conditional) mutual information is exactly equal to 0, the two random variables are (conditionally) independent. We refer to inequalities involving Shannon’s information measures only (possibly with constant terms) as information inequalities. The basic inequalities are important examples of information inequalities. Likewise, we refer to identities involving Shannon’s information measures only as information identities. From the information identities (2.45), (2.55), and (2.63), we see that all Shannon’s information measures can be expressed as linear combinations of entropies. Specifically, and Ú ±² ² ¯ñ´ ® ÝÝ ¹ Ú ²² ®Z·'Ý;¯ ó Ý ì ² Ú Ý² ® Ý · ² Ý ®+³'¯ ¹ Ú ® Ú ¯ ì Ú ®Z·'¯ · (2.112) (2.113) ±² ®³'¯´)° Ý ¹ Ú ² ®Z·° ;Ý ó Ú ² ¯Ë·° Ý ì Ú ² ®Z·'¯v·° Ý ì Ú ² ° Ý Ö (2.114) Therefore, an information inequality is an inequality which involves only entropies. D R A F T September 13, 2001, 6:27pm D R A F T 25 Information Measures In information theory, information inequalities form the most important set of tools for proving converse coding theorems. Except for a few so-called nonShannon-type inequalities, all known information inequalities are implied by the basic inequalities. Information inequalities will be studied systematically in Chapter 12 to Chapter 14 . In the next section, we will prove some consequences of the basic inequalities which are often used in information theory. 2.7 SOME USEFUL INFORMATION INEQUALITIES In this section, we prove some useful consequences of the basic inequalities introduced in the last section. Note that the conditional versions of these inequalities can be proved by similar techniques as in Proposition 2.24. ¢a£;¤!¥§¦7¤¨ª©7«]¬ô+õnö¥§æ;÷4äøå;äJ¥§æ;äJæ4ùú÷:¥4¤!ãØæ4¥7åúäJæ4û¦7¤¸ü:ãc¤ý¤æå;¦¥§âþ ÿ (2.115) Ú ² ¯´ ® Ý=Þ Ú ² ¯ Ý with equality if and only if ® and ¯ are independent. Proof This can be proved by considering Ú ² ¯´ ® Ý ¹ Ú ² ¯ Ý ì ±² ®³'¯ Ý=Þ Ú ² ¯ Ý · (2.116) ±² ®³'¯ Ý is always nonnegative. The where the inequality follows because ± ² Ý ¹ ¶ , which is equivalent to that ® inequality is tight if and only if ®³'¯ and ¯ are independent by Proposition 2.37. ² Ý Þ Ú ² ¯×´)° Ý , which is the conSimilarly, it can be shown that Ú ¯×´ ®Z·° ditional version of the above proposition. These results have the following interpretation. Suppose ¯ is a random variable we are interested in, and ® and ° are side-information about ¯ . Then our uncertainty about ¯ cannot be increased on the average upon knowing side-information ® . Once we know ® , our uncertainty about ¯ again cannot be increased on the average upon further knowing side-information ° . Remark Unlike entropy, the mutual information between two random variables can be increased by conditioning on a third random variable. We refer the reader to Section 6.4 for a discussion. ¢a£;¤!¥§¦7¤¨ª©7«]¬ õ næ;÷§¤â¤æ;÷§¤æ4û¤¥;æ4÷¥§¦ç¤æå;¦¥4âþmÿ ² Ý = Þ º Ú ® ·K®H·c·K® Ú ² ® Ý (2.117) ß ··T· are mutually independent. with equality if and only if ® , ¹ D R A F T September 13, 2001, 6:27pm D R A F T 26 A FIRST COURSE IN INFORMATION THEORY Proof By the chain rule for entropy, Ú ² ® ·K®··K® Ý ¹ º Ú Þ º Ú ² ® ´ ® ··K® ²® Ý · Ý (2.118) (2.119) where the inequality follows because we have proved in the last theorem that conditioning does not increase entropy. The inequality is tight if and only if it is tight for each , i.e., Ú ² ® ´ ® · ·K® Ý ¹ Ú ²® Ý (2.120) ßaÞ Þ . From the last theorem, this is equivalent to that ® is indepen® ·K®··K® for each . Then ²JÀ · À H·T· À Ý ¿ ¹ ²JÀ _· À H· c· À Ý ²JÀ Ý (2.121) ¹ ¿ ² ²JÀ · À H·c· À ¿ Ý ²JÀ Ý ²JÀ Ý (2.122) ¿ ¿ . ¿ ¿ .. J ² À Ý J ² À Ý ²JÀ Ý ¹ (2.123) ¿ ¿ ¿ À À À for all · · c· , or ® ·K® · c·K® are mutually independent. for dent of Alternatively, we can prove the theorem by considering º Ú ² ® Ý ì Ú ² ® ·K®H·c·K® Ý ó ÃfÄÅ ² ® ·K®H·T·K® Ý ¹ ì º Ã]ÄÅ ² ® Ý7 ¿ ¿ ² Ý ² Ý ² ¹ ì ÃfÄÅ ® ® ® Ý ¸ó! ÃfÄÅ ² ® ·K® · c·K® Ý ¿² ¿ ¿ Ý ¿ ¹ ÃfÄÅ ² ¿ ® Ý ·K² ® H· Ý ·K® ² Ý ® ® ¿ ® Ý ¹ " ² ¿ ¿É Æ $ #0& Æ %(')')' + Æ * ¿ $ Æ # ¿ & Æ % ¿ ¿ ,Æ * µ ¶¸· (2.124) (2.125) (2.126) (2.127) (2.128) where equality holds if and only if ²JÀ · À · T· À Ý ¹ ²JÀ Ý ²JÀ Ý ²JÀ Ý (2.129) ¿ ¿ ¿ À À ® ·K® H·T·K® are mutually independent. for all · %·c· , i.e., ¿À D R A F T September 13, 2001, 6:27pm D R A F T 27 Information Measures ¢a£;¤!¥§¦7¤¨ª©7« .- ±² ®+³'¯Ë·° Ý µ ± ² ® ³'¯ Ý · with equality if and only if ®0/ 1 ¯ /° forms a Markov chain. (2.130) Proof By the chain rule for mutual information, we have ±² ®+³'¯Ë·° Ý ¹ ± ² ® ³'¯ Ý;óç±² ®³°´ ¯ Ý µ ±² ®³'¯ Ý Ö (2.131) ±² Ý ¹ ¶ , or 2 ® / 3 ¯ / ° The above inequality is tight if and only if ®³°´ ¯ forms a Markov chain. The theorem is proved. 4v¤¨Ê¨ü©7« 6 5 If ®7/ ¯1/° forms a Markov chain, then ±² ®+³° ÝYÞ±² ®³'¯ Ý (2.132) ±² ®+³° Ý=Þ±² ¯ ³° Ý Ö and (2.133) ® The meaning of this inequality is the following. Suppose is a random is an observation of . If we infer variable we are interested in, and via , our uncertainty about on the average is . Now suppose we process (either deterministically or probabilistically) to obtain a random variable . If we infer via , our uncertainty about on the average is . Since 7/ 1/ forms a Markov chain, from (2.132), we have ¯ ° Ú ² ®ç´)° Ý ¯ ® ¯ ®Ý ² Ú ®ç´ ¯ ® ® ® ° ¯ [° Ú ² ®ç´)° Ý ¹ Ú ²² ® ÝÝ ì ±±²² ®³° Ý Ý µ Ú ² ® ì Ý ®³'¯ ¹ Ú ®5´ ¯ · i.e., further processing ¯ can only increase our uncertainty about ® ® (2.134) (2.135) (2.136) on the average. ¯8/ ° ±² ®³°´ ¯ Ý ¹ ¶¸Ö Proof of Lemma 2.41 Assume rem 2.34, we have Then D R A F T ® / , i.e., ® 9 ±² ®³° Ý ¹ ± ² Ý ì ±² ®³'¯´)° Ý ® ' ³ v ¯ · ° Þ ±² ®³'¯v·° Ý ¹ ±² ®³'¯ Ý7ó5±² ®³°´ ¯ Ý ¹ ±² ®³'¯ Ý Ö September 13, 2001, 6:27pm °´ ¯ . By Theo(2.137) (2.138) (2.139) (2.140) (2.141) D R A F T 28 A FIRST COURSE IN INFORMATION THEORY ±² ®+³'¯Z´)° Ý In (2.138) and (2.140), we have used the chain rule for mutual information. The inequality in (2.139) follows because is always nonnegative, and (2.141) follows from (2.137). This proves (2.132). is equivalent to </ 7/ , we also have proved Since :/ ;/ (2.133). This completes the proof of the lemma. ® ¯ ° ° ¯ ® From Lemma 2.41, we can prove the more general data processing theorem. ¢a£;¤!¥§¦7¤¨ª©7« ©+õ=üWåüúᦥ:û¤!ãcãcä æ4ù¢a£;¤¥4¦7¤¨ÿ forms a Markov chain, then ® If >?/ ±² > ³CA Ý=Þú±² ®+³'¯ Ý Ö ®@/[¯1/BA (2.142) ¯ Proof Assume >D/ E/ 1/8A . Then by Proposition 2.10, we have >D/ and >F/ G/HA . From the first Markov chain and Lemma 2.41, we 7/ have > (2.143) ® ¯ ¯ ±² ³'¯ ÝvÞ±² ®+³'¯ Ý Ö From the second Markov chain and Lemma 2.41, we have ±² > ³CA Ý=Þ±² > ³'¯ Ý Ö (2.144) Combining (2.143) and (2.144), we obtain (2.142), proving the theorem. 2.8 FANO’S INEQUALITY In the last section, we have proved a few information inequalities involving only Shannon’s information measures. In this section, we first prove an upper bound on the entropy of a random variable in terms of the size of the alphabet. This inequality is then used in the proof of Fano’s inequality, which is extremely useful in proving converse coding theorems in information theory. ¢a£;¤!¥§¦7¤¨ª©7« ¬ ® , (2.145) Ú ² ® Ý=Þ ÃfÄÅδ)ÏÓ´ · where ´)ÏÓ´ denotes the size of the alphabet Ï . This upper bound is tight if and only if ® distributes uniformly on Ï . ²JÀÝ ¹ ´)Ï´ for all Proof Let I be the uniform distribution on Ï , i.e., I À ÍÓÏ . Then ÃfÄÅ´)Ï´ ì Ú ² ® Ý ¹ ìNº ²JÀÝ ÃfÄÅ´)Ï´ ó º ²JÀÝ Ã]ÄÅ ²JÀ7Ý (2.146) ¿ » JMLON ¿ K » JOLMN ¿ K D R A F T For any random variable September 13, 2001, 6:27pm D R A F T 29 Information Measures ¹ ìNº ²JÀ7Ý ÃfÄÅPI ²JÀÝ;ó º ²JÀ7Ý ÃfÄÅ ²JÀÝ (2.147) ¿ »KJMLON ¿ »KJMLON ¿ ²JÀ7Ý Q ¹ º ²JÀ7Ý ÃfÄÅ ¿ ²JÀÝMR (2.148) I » JMLON ¿ K ¹ " ² ÉI Ý (2.149) ¿ (2.150) µ ¶¸· ² É I Ý ¹ ¶ , which proving (2.145). This upper bound is tight if and if only " J ² À Ý J ² 7 À Ý À ¹ I for all ÍÓÏ ¿ , completing the from Theorem 2.30 is equivalent to ¿ proof. ö¥§¦6¥ STS ü§¦þ2©7« The entropy of a random variable may take any nonnegative real value. ® ² Ý ¹ Ú ® ÃfÄÅδ)ÏÓ´ Ú ²® Ý ¹ ¶ ¶ Ü Ü,ÃfÄÅδ)ÏÓ´ ² Ý ¹ ® Ú ® ´)ÏÓ´ ´)ÏÓ´ and let be fixed. We see from the Proof Consider a random variable last theorem that is achieved when distributes uniformly on . On the other hand, is achieved when is deterministic. For , by the intermediate value theorem, there exists a any value VU distribution for such that can take any U . Then we see that be sufficiently large. This accomplishes the proof. positive value by letting Ï ´)Ï´ ¹ ® <" , or the random variable Remark Let the base of the logarithm is " , (2.145) becomes ÚXW ² ® ÝYÞ|ß Ö ® ® is a " Ú ²® Ý -ary symbol. When (2.151) Recall that the unit of entropy is the " -it when the logarithm is in the base " . This inequality says that a " -ary symbol can carry at most 1 " -it of information. This maximum is achieved when has a uniform distribution. We already have seen the binary case when we discuss the binary entropy function Y[Z in Section 2.2. ® ² Ý ¿ We see from Theorem 2.43 that the entropy of a random variable is finite as long as it has a finite alphabet. However, if a random variable has an infinite alphabet, its entropy may or may not be finite. This will be shown in the next two examples. ;ü:¨â[S¤L©7« è \^] Let ¹ ß ··%Ö Then D R A F T ® be a random variable such that _a` Ì® ¹ K Ô ¹ Ú ² ® Ý ¹cºb d [ · (2.152) ¹ · (2.153) [ September 13, 2001, 6:27pm D R A F T 30 A FIRST COURSE IN INFORMATION THEORY which is finite. ;ü:¨â[S¤L©7« ð \^] Let of pairs of integers e ¯ be a random variable which takes value in the subset ² '· f Ýhg!ßÎÞ Ë Üji such that ßÎÞ f Þ and Ìc¯ ¹ ² n· f Ý Ô ¹ _a` k ml k (2.154) (2.155) for all and f . First, we check that % k n k ºb º Then _P` oC cÌ ¯ ¹ ² '· f Ý Ô ¹ ºb % ² Ý ¹ r ì º b º Ú ¯ s k n k ot k$p k 6q k ÃfÄÅ ¹ ßÖ (2.156) ¹ º ß· (2.157) k cb which does not converge. ® ® ® Let be a random variable and u be an estimate of which takes value in the same alphabet . Let the probability of error v,w be Ï v,w ¹ _P` Ì® ¹ ê L ®u Ô Ö (2.158) ² ®ç´ ® Ý ¹ ¶ ¹ ¶ ® ¹ ® Ú ® ¹ ® ² Ý Ú ®ç´ ® ¢a£;¤!¥§¦7¤¨ª©7« ò õxü§4æ ¥v¡]ãäJæ;¤zy6ü6Säøåþ ÿ Let ® and ® u be random variables taking values in the same alphabet Ï . Then Ú ² ®ç´ ® u ÝYÞ Y[Z ² v,w Ý;ó v6wÃfÄÅ ² ´)ÏÓ´ ì ßcÝ · (2.159) u with probability 1, then u If v6w , i.e., by Proposiu with probability close to tion 2.36. Intuitively, if v,w is small, i.e., u 1, then should be close to 0. Fano’s inequality makes this intuition precise. where Y[Z is the binary entropy function. Proof Define a random variable e ¯ ¹ D R A F T ¶ß if if ® ¹ê ® u ® ¹ ®u . September 13, 2001, 6:27pm (2.160) D R A F T Information Measures Ìc¯ ¹ ß Ô ¹ The random variable _a` v,w and 31 ê ® u Ô , with of the error event Ì® ¹ ¯ ² is Ýan¹ indicator Ú ¯ YzZ ² v6w Ý . Since ¯ is a function ® and ® u , Ú ² ¯×´ ®· ® u Ý ¹ ¶¸Ö (2.161) Then Ú ²5 ® ´.® ± u ² Ý ¹ ® ² ³'¯Z´ Ý® u Ý;ó Ú² ² ®5´ ®Ûu ·'Ý7¯ ó Ý ² ¹ Ú ¯ ´[® u ì Ú ¯×´ ®Z·{® u Ú ®5´.®Ûu ·'¯ Ý ¹ Ú ² ¯´ ® u Ý4ó Ú ² ®ç´ ®+u ·'¯ Ý Þ Ú ² ¯ Ý7ó Ú ² ®5´ ®5u ·'¯ Ý ¹ Ú ² ¯ Ý7ó º _P` Ì ® u ¹ À u ·'¯ ¹ ¶ÔcÚ ² ®5´ ® u ¹ À u '· ¯ ¹ ¶ Ý »K| J~} ó _a` Ì ® u ¹ À ·'¯ ¹ ß ÔcÚ ² ®5´ ® u ¹ À ·'¯ ¹ ßcÝ Ö u u (2.162) (2.163) (2.164) (2.165) (2.166) In the above, (2.164) follows from (2.161), (2.165) follows because conditioning does not increase entropy, and (2.166) follows from an application of . In other words, (2.41). Now must take the value u if u u and is conditionally deterministic given u . Therefore, by u and Proposition 2.35, u (2.167) u À ® ¹ À ¯ ¹ ¶ ® ¯ ¹ ¶ ® ¹ À Ú ² ®ç´ ® ¹ À ·'¯ ¹ ¶ Ý ¹ ¸¶ Ö À ß , then ® must take values in the set Ì À ÍÏ gÀo¹ ê À Ô If ® u ¹ u and ¯ ¹ u ß elements. From the last theorem, we have which contains ´)Ï´ ì Ú ² ®ç´z® u ¹ À u ·'¯ ¹ ßcÝ=Þ Ã]ÄÅ ² ´)ÏÓ´ ì ßcÝ · (2.168) À where this upper bound does not depend on u . Hence, Ú ² ®5´ ® u Ý Þ Y[Z ² v,w Ý;@ ó º _P` Ì ® u ¹ À ·'¯ ¹ ß Ô 5ÃfÄÅ ² ´)ÏÓ´ ì ßcÝ (2.169) u »K| JK} ó _a` Ìc¯ ¹ ß Ô4ÃfÄÅ ² ´)Ï´ ì ßcÝ ¹ Y[Z ² v,w Ý;! (2.170) ¹ Y[Z ² v,w Ý;ó v6wÃfÄÅ ² ´)ÏÓ´ ì ßcÝ · (2.171) ® proving (2.159). This accomplishes the proof. Very often, we only need the following simplified version when we apply Fano’s inequality. The proof is omitted. ö¥§¦¥6STS ü§¦þ2©7« ô Ú ² ®ç´ ® u Ý Ü ßvó v,w!ÃfÄÅ´)Ï´ . D R A F T September 13, 2001, 6:27pm D R A F T 32 A FIRST COURSE IN INFORMATION THEORY Ï Ý ² Ú ®5´ ® Fano’s inequality has the following implication. If the alphabet is finite, u also , the upper bound in (2.159) tends to 0, which implies as v w / tends to 0. However, this is not necessarily the case if is infinite, which is shown in the next example. ¶ Ï ;ü:¨â[S¤L©7« \^] ® Ìc¶¸· ß Ô ° Let u take the value 0 with probability 1. Let be an independent binary random variable taking values in . Define the random variable by if (2.172) if , ® where ¯ ® ¹ ° ¹¹ ¶ ß ° ¶ ¯ is the random variable in Example 2.46 whose entropy is infinity. Let v,w ¹ _a` Ì® ¹ ê + ®u Ô ¹ _a` Ì%° ¹ ß Ô Ö (2.173) Then Ú ²ç ® ´® u ݲ Ý ¹ Ú ® µ Ú_a` ² ®5´)° Ý ² ¹ ¹ ¶ÔcÚ ®ç´)° ¹ ¶ Ý;ó_P` Ì%° ¹ ß ÔcÚ ² ®ç´)° ¹ cß Ý % Ì ° ¹ ².ß ì v,w Ý ¶ ó v,wacÚ ² ¯ Ý ¹ i ï ¶ . Therefore, Ú ² ®5´ ® u Ý does not tend to 0 as v6wh/ ¶ . for any v,w 2.9 (2.174) (2.175) (2.176) (2.177) (2.178) (2.179) ENTROPY RATE OF STATIONARY SOURCE In the previous sections, we have discussed various properties of the entropy of a finite collection of random variables. In this section, we discuss the entropy rate entropy rate of a discrete-time information source. is an infinite collection of A discrete-time information source random variables indexed by the set of positive integers. Since the index set is ordered, it is natural to regard the indices as time indices. We will refer to the random variables as letters. i We assumeg that for all . Then for any finite subset of the index set M , we have Ì® · Óµ ß Ô ® ² Ý Ú ® Ü Ì ñµ ß Ô (2.180) Ú ² ®·ñÍ =Ý Þ º JK Ú ² ® Ý Üi2Ö ² ßcÝ because the joint enHowever, it is not meaningful to discuss Ú ®·µ tropy of an infinite collection of letters is infinite except for very special cases. On the other hand, since the indices are ordered, we can naturally define the D R A F T September 13, 2001, 6:27pm D R A F T 33 Information Measures entropy rate of an information source, which gives the average entropy per letter of the source. ؤ[ä æ;äøå;äJ¥§æú©7«]è= The entropy rate of an information source by when the limit exists. ß ² ¹ Ú Æ KÃ Ú ® ·K®··K® Ý b Ì®Ô is defined (2.181) We show in the next two examples that the entropy rate of a source may or may not exist. ® ;ü:¨â[S¤L©7«]è5 \^] . Then à K b Let Ì®Ô be an i.i.d. source with generic random variable ß ² Ú ® ·K®H·c·K® Ý ¹ ²® Ý ; Ú ~à ¹ ~à b Ú ² ® Ý ¹ Ú ² b® Ý · (2.182) (2.183) (2.184) i.e., the entropy rate of an i.i.d. source is the entropy of a single letter. ;ü:¨â[S¤L©7«]è© ²® Ý ¹ and Ú Ì®Ô be a source such that ® are mutually independent ñµ ß . Then ß ß ÃK Ú ² ® ·K®H·c·K® Ý ¹ ~à º (2.185) b b ß ² ó6ßcÝ ¹ ~à (2.186) ß b ² ó6 ßcÝ ¹ ~à · (2.187) b ²® Ý which does not converge although Ú Ü i for all . Therefore, the entropy rate of Ì®Ô does not exist. ® Ô , it is natural to conToward characterizing the asymptotic behavior of Ì sider the limit (2.188) Ú Æ ¹ ~Ã Ú ² ® ´ ® ·K® · T·K® Ý ² ® b ´ ® ·K® H· T·K® Ý is interpreted as the conif it exists. The quantity Ú ditional entropy of the next letter given that we know all the past history of the source, and Ú is the limit of this quantity after the source has been run for an Æ \^] Let for indefinite amount of time. D R A F T September 13, 2001, 6:27pm D R A F T 34 A FIRST COURSE IN INFORMATION THEORY ؤ[ä æ;äøå;äJ¥§æú©7«]è¬ = An information source Ì® Ô is stationary if ® ·K®··K® and ® · c·K® ß have the same joint distribution for any ý(· ?µ . d ·K® (2.189) ( (2.190) In the rest of the section, we will show that stationarity is a sufficient condition for the existence of the entropy rate of an information source. v¤¨Ê¨ü©7«]è Ì® Ô be a stationary source. Then Ú Æ exists. ² Ý Proof Since Ú ®;´ ® ·K®² H··K® is lower Ýbounded by zero for all , ® 7´ ® ·K® H··K® is non-increasing in to it suffices to prove that Ú µ , consider conclude that the limit Ú exists. Toward this end, for ý Æ Ú Þ² ®7´ ®² ·K®H·c·K® Ý Ý (2.191) Ú ² ® ´ ® ·K® ¡H· T·K® Ý ¹ Ú ® %´ ® ·K® H· T·K® · (2.192) ® Ô . The lemma is where the last step is justified by the stationarity of Ì proved. 4v¤¨Ê¨ü©7«]èè+õnö¤!a ã ü§¢ ¦ ¥ £5¤¸ü§æ§ÿ Let U¤ and ¥C be real numbers. If UTX/BU as /Hi and ¥C ¹ §¦ U¤ , then ¥C/¨U as /Hi . 4 Let Proof The idea of the lemma is the following. If U©/2U as /8i , then the average of the first terms in ªU¤ , namely ¥C , also tends to U as /8i . The lemma is formally proved as follows. Since U / U as / i , for U G« for all ¬ « . For every « , there exists ¬ « such that U© ¬ « , consider ï ï ² ݶ ² Ý ß ´ ¥C ì U7´ ¹ ¹ ß Þ ¹ D R A F T ß ß º Ì Ô U ì ï ´ ì ´;Ü U (2.193) º ² U ì U Ý (2.194) º ´ U ì U ´ ² Ý º ´ U ì U´ ó (2.195) ´ U ì 7U ´ (2.196) September 13, 2001, 6:27pm D R A F T .®h¯±°³² º ®´¯µ°¶² 6 35 Information Measures ß Ü Ü ß º ´ U ì U7´ ó ®´¯µ°¶² ² ì ² « ÝKÝ « ¬ (2.197) º ´ U ì U7´ ó « Ö ®´¯µ°¶² (2.198) ´ ì 7´Ü The first term tends to 0 as ·/¸i . Therefore, for any « U ?~« . Hence be sufficiently large, we can make ¥C proving the lemma. ï ¶ , by taking ¥C/¹U as º/¸i Ú Æ Ì ® Ô Ì® Ô ¢a£;¤!¥§¦7¤¨ª©7«]èð For a stationary source Ì®Ô , the entropy rate Ú Æ and it is equal to Ú . Æ to , We now prove that is an alternative definition/interpretation of the entropy rate of when is stationary. exists, Ú Æ Proof Since we have proved in Lemma 2.54 that always exists for a stationary source , in order to prove the theorem, we only have to prove that . By the chain rule for entropy, Ì ® Ô Ú Æ ¹ Ú Æ ß ² ß ² Ý ¹ Ú ® ·K® · c·K® º Ú ®´ ® K· ® · ·K® Ý Ö Since (2.199) Ã Ú ² ®´ ® ·K®H··K® Ý ¹ Ú Æ b (2.200) from (2.188), it follows from Lemma 2.55 that ß ² ¹ Ú Æ Ã Ú ® K· ® · ·K® Ý ¹ Ú Æ Ö b (2.201) The theorem is proved. In this theorem, we have proved that the entropy rate of a random source exists under the fairly general assumption that is stationary. However, the entropy rate of a stationary source may not carry any physical meaning unless is also ergodic. This will be explained when we discuss the Shannon-McMillan-Breiman Theorem in Section 4.4. Ì® Ô Ì® Ô D R A F T Ì® Ô Ì® Ô September 13, 2001, 6:27pm D R A F T 36 A FIRST COURSE IN INFORMATION THEORY PROBLEMS ® ¯ ²JÀ ·KÁ Ý ¿ ¾¿ ß ¿ ß ¿¿À ~½ ¶ ¶ ² Ý ² Ý ² Determine Ú ® ·'Ú ¯ ·'Ú ®5´ ¯ 1. Let and be random variables with alphabets and joint distribution given by ß ß ß ß ¶ ¶ ß ß » ¶ ¶ ß ß Ý ·'Ú ² ¯Z´ ® ß Áà   ¶ß ¶ Ä Ï ¹ Ò ¹ Ì ß ··»·¼W·½Ô Ö » Ý , and ±² ®³'¯ Ý . 2. Prove Propositions 2.8, 2.9, 2.10, 2.19, 2.21, and 2.22. 3. Give an example which shows that pairwise independence does not imply mutual independence. ²JÀ Ý ¿ K· Á·' 4. Verify that as defined in Definition 2.4 is a probability distribution. You should exclude all the zero probability masses from the summation carefully. ² ® Ýó ² ¯ Ý ¹ ² ® ÝYó ² ¯ Ý 5. Linearity of Expectation It is well-known that expectation is linear, i.e., DÆ ?{Æ , where the summation in an ex)Å Å pectation is taken over the corresponding alphabet. However, we adopt in information theory the convention that the summation in an expectation is taken over the corresponding support. Justify carefully the linearity of expectation under this convention. 6. Let ÇÉÈ ¹ ¦ b ¯ËÊ)ÌÍ . ²sÎ a) Prove that ÇÉÈ È 7. 8. ¹Üi i ¶ Þ if Ï if ï|ß Ï Þoß Ö ² ÃfÄÅa Ý È 3 ß ·· · ¹ ¿ ï|ß . is a probability distribution for Ï b) Prove that ï ² Ý if Ï ß i Ü Ú ¿ È ¹ i if Ü Ï Þ Ö ² Ý is concave in , i.e., Prove that Ú ¿ Ò ² Ý;óDÒ Ó ¿ ² Ý=Þ ² Ò óFÒ Ó Ý Ú ¿ Ú ¿ Ú ¿ ¿ Ö ² &Ý Ô ²JÀ ·KÁ Ý ¹ ²JÀÝ ² Á;´ ÀÝ . Let ®·'¯ ¿ ¿ ¿ Then D R A F T ² Ý ¹ ÐÇÉÈÑ September 13, 2001, 6:27pm D R A F T 37 Information Measures ²JÀ7Ý , ±² ®+³'¯ Ý is a convex functional of ² Á;´ ÀÝ . ¿ ¿ ²JÀÝ . ² ÀÝ ±² Ý b) Prove that for fixed Á;´ , ®³'¯ is a concave functional of ¿ ±² Ý ±² ¿ Ý Do ®+³'¯ ¹ ¶ and ®³'¯´)° ¹ ¶ imply each other? If so, give a proof. a) Prove that for fixed 9. If not, give a counterexample. 10. Let ® be a function of ¯ ýµ , . Prove that Ú ² ® Ý Þ Ú ² ¯ Ý . Interpret this result. 11. Prove that for any Ú ² ® ·K®H·c·K® Ý µ º Ú ² ® ´ ® o · f ¹ ê Ý Ö ß ² Ú ® ·K® Ý7ó Ú ² ®H·K® ¡ ;Ý ó Ú ² ® ·K® ¡ Ý µ Ú ² ® ·K®H·K® ¡ Ý Ö 12. Prove that Hint: Sum the identities Ú ² ® ·K®H·K® ¡ Ý ¹ Ú ² ® o · f ¹ ê Ý;ó Ú ² ® ´ ® o · f ¹ ê Ý ß ··» and apply the result in Problem 11. for ¹ ß ² Ý ² Ý 13. Let Õ ¹ Ì ··%·?Ô and denote Ú ® ·ËÍÏ by Ú ®È for any subset ß Þ Þ Ï of Õ . For , let ß ¹ Ú Ö º Ú ²® È Ý Ö X Ø× È¤Ù È ÈÈ Prove that Ú µXÚ F µ !µÚ !Ö This sequence of inequalities, due to Han [87], is a generalization of the independence bound for entropy (Theorem 2.39). 14. Prove the divergence inequality by using the log-sum inequality. ² É Ý ² · Ý ² · Ý ² %· Ú Ý ¿ ¿ ¿ ¿ " ² Ò ó Ò É Ò Ú ó Ò Ú Ý=Þ Ò " ² ÉÚ Ý7ó Ò " ² ÉÚ Ý ¿ Ò ß Ò ¿ ¿ Þ Ò Þ|¿ ß , where ¹ ì . for all ¶ 16. Let² distributions on ÏÐýÒ . Prove that ÝÖ ¿ Æ=Ç ÉÚ and Ý Ú ÆYµ Ç " be² twoÉÚ probability " ¿ ÆYÇ ÆYÇ ¿Æ Æ Ú is convex in the pair Ú , i.e., if 15. Prove that " Ú and are two pairs of probability distributions on a common alphabet, then D R A F T September 13, 2001, 6:27pm D R A F T 38 A FIRST COURSE IN INFORMATION THEORY ² · Ý ¿ Ï ¿ " ² É¿ Ú Ý µÜÝÛ ² ¿ · Ú Ý Ö À@g ²JÀÝ µ Ú ²JÀÝ Ô , ¹ Ì ² Ý · ß ì ² Ý Ô , and Ú ¹ Let ¹ Ì u u ² Ý ß Ì Ú · ì Ú ² Ý Ô .¿ Show that " ² ¿ É ¿ Ú Ý µ " ¿ ² ¿ u É Ú u Ý and Û¿ ² ¿ · Ú Ý ¹ Û ² ¿ u · Ú u Ý . Show that toward determining the largest value of Ü , we only have to consider the case when Ï is binary. By virtue of b), it suffices to determine the largest Ü such that ßì @ ó . ² ß Ý ì ¼ÞÜ ² ì Ú Ý µú¶ ì ÿ ]ÄÅ ¿ Ú f Ã Ä Å ß ¿ ì Ú ¿ ¿ Þ Ñ Þ ß ·¹ Ú , withï the convention that ¶ ÃfÄàÅ ßZ ¹ ¶ for ¥ µ ¶ for all ¶ ¿ and U=ÃfÄÅàá i for U ¶ . By observing that equality in the above holds if ß ¹ Ú and considering the derivative of the left ² hand side ¿ Ú , show that the largest value of Ü is equal to ?Ãâm Ý . with respect to Ú denotes the variational distance between 17. Pinsker’s inequality Let Û two probability distributions and Ú on a common alphabet . We will determine the largest Ü which satisfies a) b) c) 18. Find a necessary and sufficient condition for Fano’s inequality to be tight. ¹ Ì · T· T· Ô ¹ Ì · ¿ ¿ ¿ µ µ ¹ ß · ¿ · %¿ · ê ßÞ Show that for ã ¹ ä , there exist %· c· Ô be two probability Þ Ü? , and ¦ s Ú ² Ý µÚ ² ä Ý . Hint: ¿ fúæ Ü Þ which satisfy the Ú Ú Ú 19. Let ã and ä ±å and Ú Ú µå for all distributions such that ot o . Prove that ã Ú for all ¦ a) following: Ú i) f is the largest index such that Ú f and ii) is the smallest index such that Ú for all f ç . iii) Ú Ú Ú b) Consider the distribution ä defined by Ú f and ï ¿ Üï ¿ ¹ Ü ËÜ é ¿ ¹ Ì é · é · c· é Ô ê¹ · ² Ú o é · Ú é Ý ¹ ²² ¿ o · ì Ú ² óo² ì Ú o ì Ý ¿ o ÝKÝ Ý if ¿ ìì Ú aµ Ú o Ú · Ú aÜ if ¿ ¿ ¿ é é Note that either Ú o ¹ o or Ú ¹ . Show that ¿ ¿ é é Þ i) Ú µ Ú å for all é Þ ¹ ß · · T· ii) ¦ Ú for all ¦ ² é ¿Ý µÚ ² ä Ý . iii) Ú ä Ú o Ú o ì é¹ Ú for o ì ¿o. ¿ c) Prove the result by induction on the Hamming distance between ã and ä , i.e., the number of places where ã and ä differ. Ä (Hardy, Littlewood, and P è lya [91].) D R A F T September 13, 2001, 6:27pm D R A F T Information Measures 39 HISTORICAL NOTES The concept of entropy has its root in thermodynamics. Shannon [173] was the first to use entropy as a measure of information. Informational divergence was introduced by Kullback and Leibler [119], and it has been studied extensively by Csiszé è r [50] and Amari [11]. The material in this chapter can be found in most textbooks in information theory. The main concepts and results are due to Shannon [173]. Pinsker’s inequality is due to Pinsker [155]. Fano’s inequality has its origin in the converse proof of the channel coding theorem (to be discussed in Chapter 8) by Fano [63]. D R A F T September 13, 2001, 6:27pm D R A F T Chapter 3 ZERO-ERROR DATA COMPRESSION In a random experiment, a coin is tossed times. Let the th toss, with _a` ¶ Þ ¿ ® be the outcome of Ì ® ¹ HEADÔ ¹ ¿ and _a` Ì® ¹ TAILÔ ¹ ß ì ¿ · Þàß . It is assumed that ® are i.i.d., and the value of (3.1) ¿ where is known. We are asked to describe the outcome of the random experiment without error (with zero error) by using binary symbols. One way to do this is to encode a HEAD by a ‘0’ and a TAIL by a ‘1.’ Then the outcome of the random experiment is encoded into a binary codeword of length . When the coin is ý , this is the best we can do because the probability of every fair, i.e., outcome of the experiment is equal to . In other words, all the outcomes are equally likely. ý , the probability of an outcome However, if the coin is biased, i.e., of the experiment depends on the number of HEADs and the number of TAILs in the outcome. In other words, the probabilities of the outcomes are no longer uniform. It turns out that we can take advantage of this by encoding more likely outcomes into shorter codewords and less likely outcomes into longer codewords. By doing so, it is possible to use less than bits on the average to describe the outcome of the random experiment. In particular, when or 1, we actually do not need to describe the outcome of the experiment because it is deterministic. At the beginning of Chapter 2, we mentioned that the entropy measures the amount of information contained in a random variable . In this chapter, we substantiate this claim by exploring the role of entropy in the context of zero-error data compression. ¿ ¹ ¶¸Ö ¹ ê ¶¸Ö ¿ ¿ ¹ ¶ Ú ²® Ý ® D R A F T September 13, 2001, 6:27pm D R A F T 42 3.1 A FIRST COURSE IN INFORMATION THEORY Ú ²® Ý THE ENTROPY BOUND is a fundamental lower bound on the In this section, we establish that expected length of the number of symbols needed to describe the outcome of a random variable with zero error. This is called the entropy bound. ؤ[ä æ;äøå;äJ¥§æú¬7«¶5 a mapping from Ï " = from a ® ® é A " -ary source code ê for a source random variable is to ë , the set of all finite length sequences of symbols taken -ary code alphabet. Ì® · úµ ß Ô ® Consider an information source , where are discrete random variables which take values in the same alphabet. We apply a source code ê to each and concatenate the codewords. Once the codewords are concatenated, the boundaries of the codewords are no longer explicit. In other words, when the code ê is applied to a source sequence, a sequence of code symbols are produced, and the codewords may no longer be distinguishable. We are particularly interested in uniquely decodable codes which are defined as follows. ® ؤ[ä æ;äøå;äJ¥§æú¬7«]© = A code ê is uniquely decodable if for any finite source sequence, the sequence of code symbols corresponding to this source sequence is different from the sequence of code symbols corresponding to any other (finite) source sequence. Suppose we use a code ê to encode a source file into a coded file. If ê is uniquely decodable, then we can always recover the source file from the coded file. An important class of uniquely decodable codes, called prefix codes, are discussed in the next section. But we first look at an example of a code which is not uniquely decodable. ;ü:¨â[S¤L¬7«]¬ \^] Let Ï ¹ ̪ ·(ì ·CÇÊ· " Ô . Consider the code ê í î[ïsíÞð A B C D 0 1 01 10 defined by Then all the three source sequences AAD, ACA, and AABA produce the code sequence 0010. Thus from the code sequence 0010, we cannot tell which of the three source sequences it comes from. Therefore, ê is not uniquely decodable. In the next theorem, we prove that for any uniquely decodable code, the lengths of the codewords have to satisfy an inequality called the Kraft inequality. D R A F T September 13, 2001, 6:27pm D R A F T 43 Zero-Error Data Compression ¢a£;¤!¥§¦7¤¨ª¬7« Õõñ¦ü6åjæ;¤zy6ü6Sä]åþ ÿ Let ê be a " let ò (· H· c(· be the lengths of the codewords. If ê then º " Ãó Þàß Ö -ary source code, and is uniquely decodable, (3.2) Proof Let ¬ be an arbitrary positive integer, and consider the identity º p " ó ¹ º ® q º # º % " ¯ Cô ó # ó % ')')' ó ô ² Ö (3.3) By collecting terms on the right-hand side, we write p º Ãó " ¹ ®º ® q ¹ where øaùØú Ðõö³÷ éOû ü [ ü [ " [ (3.4) (3.5) Ãóªþ ® . Now observe that gives and is the coefficient of " in ý ¦ " the total number of sequences of ¬ codewords with a total length of code symbols. Since the code is uniquely decodable, these code sequences must be " distinct, and therefore because there are " distinct sequences of code symbols. Substituting this inequality into (3.4), we have Þ p º Ãó " or º " q Þ ® º Ðõö³÷ ß ¹ ® Let ® · (3.6) Þಠ¬ÿøaùØú Ý n ® Ö Ãó Since this inequality holds for any completing the proof. ¬ øaùØú ¬ (3.7) , upon letting ¬H/¹i , we obtain (3.2), be a source random variable with probability distribution µ Ì ¿ _· ¿ H ·c· ¿ Ô · (3.8) where B . When we use a uniquely decodable code ê to encode the outcome of , we are naturally interested in the expected length of a codeword, which is given by (3.9) ® D R A F T ¹oº ¿ Ö September 13, 2001, 6:27pm D R A F T 44 A FIRST COURSE IN INFORMATION THEORY We will also refer to as the expected length of the code ê . The quantity gives the average number of symbols we need to describe the outcome of when the code ê is used, and it is a measure of the efficiency of the code ê . Specifically, the smaller the expected length is, the better the code ê is. In the next theorem, we will prove a fundamental lower bound on the expected length of any uniquely decodable " -ary code. We first explain why this is the lower bound we should expect. In a uniquely decodable code, we use " -ary symbols on the average to describe the outcome of . Recall from the remark following Theorem 2.43 that a " -ary symbol can carry at most one " -it of information. Then the maximum amount of information which can be " -its. Since the code carried by the codeword on the average is is uniquely decodable, the amount of entropy carried by the codeword on the average is . Therefore, we have ® ® Ú ²® Ý ß ¹ Ú W ² ® Ý=Þ Ö (3.10) In other words, the expected length of a uniquely decodable code is at least the entropy of the source. This argument is rigorized in the proof of the next theorem. ¢a£;¤!¥§¦7¤¨ª¬7«]èÛõØ\Îæå;¦¥4â¸þ¥;æ4÷:ÿ Let ê be a " -ary uniquely decodable ² Ý code for a source random variable ® ² with entropy Ú W ® . Then the expected Ý length of ê is lower bounded by ÚXW ® , i.e., (3.11) µXÚ W ² ® Ý Ö This lower bound is tight if and only if ¹ì Ã]ÄÅ W for all . ¿ Proof Since ê is uniquely decodable, the lengths of its codewords satisfy the Kraft inequality. Write " k (3.12) W ¹oº ¿ ÃfÄÅ and recall from Definition 2.33 that Ú W ² ® Ý ¹ì º ¿ ]à ÄÅ W ¿ Ö Then ì ÚXW ² ® Ý ¹ º ÃfÄÅ W ² " k Ý ¿ ¿ ² Ý ¹ à â " º Ãâ ² " k Ý ¿ ¿ ß Q µ ² Ãâ " Ý º ¿ ß ì " ¿ D R A F T (3.13) (3.14) (3.15) k R September 13, 2001, 6:27pm (3.16) D R A F T Zero-Error Data Compression 45 ¹ ² à â " Ý º ìÛº ¿ ².ß ² Ý ß c Ý ì µ Ãâ " ¹ ¶¸· " k (3.17) (3.18) (3.19) where we have invoked the fundamental inequality in (3.16) and the Kraft inequality in (3.18). This proves (3.11). In order for this lower bound to be tight, both (3.16) and (3.18) have to be tight simultaneously. Now (3.16) is tight if for all . If this holds, we have and only if " k , or W ¿ ¹ ß ¹ì ÃfÄÅ ¿ º " k ¹ º ¹ ß· ¿ (3.20) i.e., (3.18) is also tight. This completes the proof of the theorem. The entropy bound can be regarded as a generalization of Theorem 2.43, as is seen from the following corollary. ö¥§¦¥6STS ü§¦þ2¬7«]ð Ú ² ® Ý=Þ ÃfÄÅ´)Ï´ Ö Proof Considering encoding each outcome of a random variable ® by a disß tinct symbol in Ì ··%·T´)Ï´ Ô . This is obviously a ´)ÏÓ´ -ary uniquely decodable code with expected length 1. Then by the entropy bound, we have which becomes Ú È} Ú ²® ² ® ÝÞ|ß · È YÝ Þ ÃfÄÅ´)Ï´ (3.21) (3.22) when the base of the logarithm is not specified. Motivated by the entropy bound, we now introduce the redundancy of a uniquely decodable code. ؤ[ä æ;äøå;äJ¥§æú¬7«øò = The redundancy of a " -ary uniquely decodable code is the difference between the expected length of the code and the entropy of the source. We see from the entropy bound that the redundancy of a uniquely decodable code is always nonnegative. 3.2 PREFIX CODES 3.2.1 DEFINITION AND EXISTENCE ؤ[ä æ;äøå;äJ¥§æú¬7«]ô = A code is called a prefix-free code if no codeword is a prefix of any other codeword. For brevity, a prefix-free code will be referred to as a prefix code. D R A F T September 13, 2001, 6:27pm D R A F T 46 A FIRST COURSE IN INFORMATION THEORY ;ü:¨â[S¤L¬7« \^] The code ê in Example 3.3 is not a prefix code because the codeword 0 is a prefix of the codeword 01, and the codeword 1 is a prefix of the codeword 10. It can easily be checked that the following code ê is a prefix code. x A B C D î å ïsíÞð 0 10 110 1111 A " -ary tree is a graphical representation of a collection of finite sequences of " -ary symbols. In a " -ary tree, each node has at most " children. If a node has at least one child, it is called an internal node, otherwise it is called a leaf. The children of an internal node are labeled by the " symbols in the code alphabet. A " -ary prefix code can be represented by a " -ary tree with the leaves of the tree being the codewords. Such a tree is called the code tree for the prefix code. Figure 3.1 shows the code tree for the prefix code ê in Example 3.9. As we have mentioned in Section 3.1, once a sequence of codewords are concatenated, the boundaries of the codewords are no longer explicit. Prefix codes have the desirable property that the end of a codeword can be recognized instantaneously so that it is not necessary to make reference to the future codewords during the decoding process. For example, for the source sequence ìÇ " Ç , the code ê in Example 3.9 produces the code sequence . Based on this binary sequence, the decoder can reconstruct the source sequence as follows. The first bit 1 cannot form the first codeword because 1 is not a valid codeword. The first two bits 10 must form the first codeword because it is a valid codeword and it is not the prefix of any other codeword. The same procedure is repeated to locate the end of the next ß ¶ ß ß ¶ ßßßß ¶ ß ß ¶ 0 10 110 1111 å Figure 3.1. The code tree for the code î . D R A F T September 13, 2001, 6:27pm D R A F T 47 Zero-Error Data Compression ß ßß ßßßß ·'¶¸· ßß ¶¸· . Then codeword, and the code sequence is parsed as ¶¸· ¶¸· the source sequence ìÇ " {Ç can be reconstructed correctly. Since a prefix code can always be decoded correctly, it is a uniquely decodable code. Therefore, by Theorem 3.4, the codeword lengths of a prefix code also satisfies the Kraft inequality. In the next theorem, we show that the Kraft inequality fully characterizes the existence of a prefix code. ¢a£;¤!¥§¦7¤¨ª¬7«¶5Þ- There exists a " -ary prefix code with codeword lengths ¶ · % · c(· if and only if the Kraft inequality º " Ãó Þ|ß (3.23) is satisfied. Proof We only need to prove the existence of a " -ary prefix code with codeword lengths ¶ ( ( if these lengths satisfy the Kraft inequality. Without . loss of generality, assume that ¶ Consider all the " -ary sequences of lengths less than or equal to and regard them as the nodes of the full " -ary tree of depth . We will refer to a sequence of length as a node of order . Our strategy is to choose nodes as codewords in nondecreasing order of the codeword lengths. Specifically, we choose a node of order ¶ as the first codeword, then a node of order as the second codeword, so on and so forth, such that each newly chosen codeword is not prefixed by any of the previously chosen codewords. If we can successfully choose all the codewords, then the resultant set of codewords forms a prefix code with the desired set of lengths. There are " # (since ò ) nodes of order ¶ which can be chosen as the first codeword. Thus choosing the first codeword is always possible. Assume that the first codewords have been chosen successfully, where , and we want to choose a node of order 6 as the st codeword such that it is not prefixed by any of the previously chosen codewords. In other st node to be chosen cannot be a descendant of any of the words, the previously chosen codewords. Observe that for f , the codeword with length o has " k # descendents of order 6 . Since all the previously chosen codewords are not prefeces of each other, their descendents of order 6 do not overlap. Therefore, upon noting that the total number of nodes of order 6 is " k # , the number of nodes which can be chosen as the st codeword is " k " k " k k (3.24) # # # # Þ ì ß · H· · Þ ï ß îµ ß Þ Þ ² óñßcÝ ² ó,ßcÝ ·(H··( ì ì ì Ö satisfy the Kraft inequality, we have " D R A F T Þ ² ó@ßcÝ If ò ßÞ ßñÞ # ó ó " k ó " Þ ßÖ | k # September 13, 2001, 6:27pm (3.25) D R A F T 48 A FIRST COURSE IN INFORMATION THEORY Multiplying by " k " and rearranging the terms, we have # ì k k ·" # # # ì ì k " µ ßÖ k # (3.26) ² óúßcÝ ² ó ßcÝ The left hand side is the number of nodes which can be chosen as the st codeword as given in (3.24). Therefore, it is possible to choose the st codeword. Thus we have shown the existence of a prefix code with codeword lengths ( ( , completing the proof. · · c· ¹ ¿ ¹ Ì Ô ¿ Ì¿ Ô k , where is a A probability distribution such that for all , G" " " is called a , positive integer, is called a -adic distribution. When dyadic distribution. From Theorem 3.5 and the above theorem, we can obtain the following result as a corollary. ö¥§¦¥6STS ü§¦þ2¬7«¶5[5 There exists a " -ary prefix code which achieves the entropy bound for a distribution if and only if is " -adic. Ì¿ Ô Ì¿ Ô Proof Consider a " -ary prefix code which achieves the entropy bound for a distribution . Let be the length of the codeword assigned to the probak , or " bility . By Theorem 3.5, for all , . Thus W " is -adic. " k for all . Let Conversely, suppose is " -adic, and let for all . Then by the Kraft inequality, there exists a prefix code with codeword lengths ª , because Ì¿ Ô ¿ ¹ªì ÃfÄÅ ¿ ¹ ¿ Ì¿ Ô Ì Ô º " k ¹oº " ¹oº k ¿ ¿ ¹ Ì¿ Ô ¹ ¹ ßÖ (3.27) Assigning the codeword with length to the probability for all , we see from Theorem 3.5 that this code achieves the entropy bound. ¿ 3.2.2 HUFFMAN CODES As we have mentioned, the efficiency of a uniquely decodable code is measured by its expected length. Thus for a given source , we are naturally interested in prefix codes which have the minimum expected length. Such codes, called optimal codes, can be constructed by the Huffman procedure, and these codes are referred to as Huffman codes. In general, there exists more than one optimal code for a source, and some optimal codes cannot be constructed by the Huffman procedure. For simplicity, we first discuss binary Huffman codes. A binary prefix code with distribution is represented by a binary code tree, for a source with each leaf in the code tree corresponding to a codeword. The Huffman procedure is to form a code tree such that the expected length is minimum. The procedure is described by a very simple rule: ® ® D R A F T Ì¿ Ô September 13, 2001, 6:27pm D R A F T Zero-Error Data Compression 49 p codeword 0.35 00 0.1 010 0.15 011 0.2 10 0.2 11 i 0.25 0.6 1 0.4 Figure 3.2. The Huffman procedure. Keep merging the two smallest probability masses until one probability mass (i.e., 1) is left. The merging of two probability masses corresponds to the formation of an internal node of the code tree. We now illustrate the Huffman procedure by the following example. ;ü:¨â[S¤L¬7«¶5© \^] Ï ¹ Ì · · Ê· · Ô ® Let be the source with , and the ª (ì CÇ " probabilities are 0.35, 0.1, 0.15, 0.2, 0.2, respectively. The Huffman procedure is shown in Figure 3.2. In the first step, we merge probability masses 0.1 and 0.15 into a probability mass 0.25. In the second step, we merge probability masses 0.2 and 0.2 into a probability mass 0.4. In the third step, we merge probability masses 0.35 and 0.25 into a probability mass 0.6. Finally, we merge probability masses 0.6 and 0.4 into a probability mass 1. A code tree is then formed. Upon assigning 0 and 1 (in any convenient way) to each pair of branches at an internal node, we obtain the codeword assigned to each source symbol. In the Huffman procedure, sometimes there are more than one choice of merging the two smallest probability masses. We can take any one of these choices without affecting the optimality of the code eventually obtained. ì ß steps to complete the Huffman For an alphabet of size , it takes procedure for constructing a binary code, because we merge two probability masses in each step. In the resulting code tree, there are leaves and internal nodes. In the Huffman procedure for constructing a " -ary code, the smallest " probability masses are merged in each step. If the resulting code tree is formed in steps, where , then there will be internal nodes and " " leaves, where each leaf corresponds to a source symbol in the alphabet. If the alphabet size has the form " , then we can apply " the Huffman procedure directly. Otherwise, we need to add a few dummy óªß ó ² ì ßcÝ D R A F T µ ¶ ì ß ó ß ó ² 6ì ßcÝ September 13, 2001, 6:27pm D R A F T 50 A FIRST COURSE IN INFORMATION THEORY ó ² ì ßcÝ symbols with probability 0 to the alphabet in order to make the total number " of symbols have the form " . ;ü:¨â[S¤L¬7«¶5¬ Õ¹ \^] ¼ ) If we want to construct a quaternary Huffman code (" for the source in the last example, we need to add 2 dummy symbols so that the ¼ » , where total number of symbols becomes . In general, we need to add at most " dummy symbols. ¹ óo².ßcÝ ì ¹ ß In Section 3.1, we have proved the entropy bound for a uniquely decodable code. This bound also applies to a prefix code since a prefix code is uniquely decodable. In particular, it applies to a Huffman code, which is a prefix code by construction. Thus the expected length of a Huffman code is at least the entropy of the source. In Example 3.12, the entropy is 2.202 bits, while the expected length of the Huffman code is Ú ²® Ý ¸¶ Öû~½ ² Ý:ó ¶¸Ö ß ² » §Ý ó ¶¸Ö ß ½ ² » Ý§ó ¶¸Öà ² :Ý ó ¸¶ Öà ² Ý ¹ ÖÃ~½Ö (3.28) We now turn to proving the optimality of a Huffman code. For simplicity, we will only prove the optimality of a binary Huffman code. Extension of the proof to the general case is straightforward. Without loss of generality, assume that (3.29) ¿ µ ¿ µD!µ ¿ Ö Denote the codeword assigned to by Ü , and denote its length by . To prove ¿ that a Huffman code is actually optimal, we make the following observations. 4v¤¨Ê¨ü¬7¶« 5H In an optimal code, shorter codewords are assigned to larger probabilities. ßÞ ËÜ Þ ï ï o . Assume that in a code, the Proof Consider ºf such that o o codewords Ü and Ü are such that , i.e., a shorter codeword is assigned to a smaller probability. Then by exchanging Ü and Ü o , the expected length of the code is changed by ² o ó o Ýì ² ó ¿ ¿ Ý ¹ ² ì o Ý_² o ì Ý Ü ¶ ¿ ¿ ¿ o o (3.30) ¿ ¿ ¿ ï o and ï o . In other words, the code can be improved and since ¿ ¿ optimal. The lemma is proved. therefore is not 4v¤¨Ê¨ü¬7¶« 5è There exists an optimal code in which the codewords assigned to the two smallest probabilities are siblings, i.e., the two codewords have the same length and they differ only in the last symbol. Proof The reader is encouraged to trace the steps in this proof by drawing a code tree. Consider any optimal code. From the last lemma, the codeword Üt D R A F T September 13, 2001, 6:27pm D R A F T 51 Zero-Error Data Compression ¿ assigned to has the longest length. Then the sibling of ÜC cannot be the prefix of another codeword. We claim that the sibling of Üt must be a codeword. To see this, assume that it is not a codeword (and it is not the prefix of another codeword). Then we can replace Üt by its parent to improve the code because the length of the codeword assigned to is reduced by 1, while all the other codewords remain unchanged. This is a contradiction to the assumption that the code is optimal. Therefore, the sibling of Üt must be a codeword. If the sibling of Ü is assigned to , then the code already has the desired property, i.e., the codewords assigned to the two smallest probabilities are siblings. If not, assume that the sibling of Ü is assigned to , where ç . , . On the other hand, by Lemma 3.14, Since is always less than or equal to , which implies that . Then we can exchange the codewords for and without changing the expected length of the code (i.e., the code remains optimal) to obtain the desired code. The lemma is proved. ¿ ¿ ¿ µ ¿ µ ¿ ¹ ¿ ¿ ¹ ¹ vÜ ì ß ¹ Suppose Ü and Ü o are siblings in a code tree. Then o . If we replace o Ð o and Ü by a common codeword at their parent, call it Ü , then we obtain o . Accordingly, the a reduced code tree, and the probability of Ü )o is probability set becomes a reduced probability set with and o replaced by a o . Let probability and be the expected lengths of the original code and the reduced code, respectively. Then Ü ¿ ¿ ó ì ¿ which implies ¹ ² ó o o Ý ì ¿ ¿ ¹ ² ó o Ý ì ¹ ¿ ó o ¿· ¿ ¿ ¹ ó6² ó ¿ ¿ ¿ ¿ ² ó o Ý_² ì cß Ý ² ¿ ó ¿ o _Ý ² ì ßcÝ ¿ ¿ ó ¿ o (3.31) (3.32) (3.33) ÝÖ (3.34) This relation says that the difference between the expected length of the original code and the expected length of the reduced code depends only on the values of the two probabilities merged but not on the structure of the reduced code tree. ¢a£;¤!¥§¦7¤¨ª¬7«¶5ð The Huffman procedure produces an optimal prefix code. Proof Consider an optimal code in which Üt and Üt are siblings. Such an optimal code exists by Lemma 3.15. Let be the reduced probability set obtained from by merging and . From (3.34), we see that is the expected length of an optimal code for if and only if is the expected length of an optimal code for . Therefore, if we can find an optimal code for , we can use it to construct an optimal code for . Note that by Ì¿ Ô Ì¿ Ô D R A F T ¿ Ì¿ Ô Ì¿ Ô ¿ Ì Ô ¿ Ì¿ Ô September 13, 2001, 6:27pm D R A F T 52 A FIRST COURSE IN INFORMATION THEORY merging and , the size of the problem, namely the total number of probability masses, is reduced by one. To find an optimal code for , we again merge the two smallest probability in . This is repeated until the size of the problem is eventually reduced to 2, which we know that an optimal code has two codewords of length 1. In the last step of the Huffman procedure, two probability masses are merged, which corresponds to the formation of a code with two codewords of length 1. Thus the Huffman procedure indeed produces an optimal code. ¿ ¿ Ì¿ Ô Ì¿ Ô We have seen that the expected length of a Huffman code is lower bounded by the entropy of the source. On the other hand, it would be desirable to obtain an upper bound in terms of the entropy of the source. This is given in the next theorem. ¢a£;¤!¥§¦7¤¨ª¬7«¶5ò satisfies The expected length of a Huffman code, denoted by ÜÚXW ² ® Ý;ó6ß Ö This bound is the tightest among all the upper bounds on only on the source entropy. , (3.35) which depend Ú ² ® Ý'óØß . Proof We will construct a prefix code with expected length less than Then, because a Huffman code is an optimal prefix code, its expected length is upper bounded by . Consider constructing a prefix code with codeword lengths ª , where Ú ² ® Ý;ó6ß ¹ ì fà ÄÅ W Ö ¿ ó ß ì Ã]ÄÅ W Þ Ü ì Ã]ÄÅ W @ ¿ ¿ · ï " µ " k ¿ ¿ Ö º " k Þ º ¹ ß · ¿ Then or Thus Ì Ô Ì Ô (3.36) (3.37) (3.38) (3.39) i.e., ª satisfies the Kraft inequality, which implies that it is possible to construct a prefix code with codeword lengths ª . It remains to show that , the expected length of this code, is less than . Toward this end, consider Ú ² ® Ý;ó6ß Ì Ô ¹ º ¿ Ü º ¿ ² ì ]à ÄÅ W ¿ ó2ßcÝ D R A F T September 13, 2001, 6:27pm (3.40) (3.41) D R A F T 53 Zero-Error Data Compression ¹ Ó ì º ÃfÄÅ W ó º ¿ ¿ ¿ ² § Ý ó 6 ß ¹ Ú ® · (3.42) (3.43) Þ ÜÚ ² ® Ý;ó6ß Ö (3.44) we have to show² thatÝ7ó2there To see that this upper bound is the tightest possible, ß where (3.41) follows from the upper bound in (3.37). Thus we conclude that Ú ® approaches as exists a sequence of distributions v6 such that /Hi . This can be done by considering the sequence of " -ary distributions v, 6µ ¹< ß ì " ß ì ß ß · ·c· ! · (3.45) " codewords of ² Ý Ú ® / ¶ , and hence " . The Huffman code for each v6 consists of where is equal to 1 for all . As X/8i , length 1. Thus approaches . The theorem is proved. Ú ² ® Ý4ó6ß The code constructed in the above proof is known as the Shannon code. The idea is that in order for the code to be near-optimal, we should choose for all . When is " -adic, can be chosen to be exactly close to because the latter are integers. In this case, the entropy bound is tight. From the entropy bound and the above theorem, we have ì fà ÄÅ ¿ ì fà ÄÅ ¿ Ì¿ Ô ² Ý;ó6ß Ú ² ® ÝYÞ ÜÚ ® Ö (3.46) Now suppose we use a Huffman code to encode ® ·K® ··K® which are i.i.d. copies of ® . Let us denote the length of this Huffman code by . Then (3.46) becomes ² Y Ý Þ ;Ú ® Üç;Ú ² ® Ý;ó6ß Ö (3.47) Dividing by , we obtain ß ß ² Y Ý Þ ² ; Ý ó Ú ® ÜÚ ® Ö (3.48) As /Hi , the upper bound approaches the lower bound. Therefore, , the coding rate of the code, namely the average number of code symbols needed to encode a source symbol, approaches as 3/ i . But of course, as becomes large, constructing a Huffman code becomes very complicated. Nevertheless, this result indicates that entropy is a fundamental measure of information. Ú ²® Ý D R A F T September 13, 2001, 6:27pm D R A F T 54 A FIRST COURSE IN INFORMATION THEORY 3.3 REDUNDANCY OF PREFIX CODES The entropy bound for a uniquely decodable code has been proved in Section 3.1. In this section, we present an alternative proof specifically for prefix codes which offers much insight into the redundancy of such codes. Let be a source random variable with probability distribution ® Ì ¿ · ¿ % ·c· ¿ aÔ · -ary prefix code for ® can be represented by a " µ (3.49) where . A " -ary code tree with leaves, where each leaf corresponds to a codeword. We denote the leaf corresponding to by Ü and the order of Ü by , and assume that the alphabet is " (3.50) ¿ Ìc¶¸· ß · %· ì ß Ô Ö " Let be the index set of all the internal nodes (including the root) in the code tree. Instead of matching codewords by brute force, we can use the code tree of a prefix code for more efficient decoding. To decode a codeword, we trace the path specified by the codeword from the root of the code tree until it terminates at the leaf corresponding to that codeword. Let Ú be the probability of reaching an internal node during the decoding process. The probability Ú is called the reaching probability of internal node . Evidently, Ú is equal to the sum of the probabilities of all the leaves descending from node . Let o be the probability that the f th branch of node is taken during " f the decoding process. The probabilities o , are called the branching probabilities of node , and ¿# ¼ #¿ ¼ ·'¶ Þ Þ ì ß ¹ º # ¼o Ö o ¿ Ú (3.51) Once node is reached, the conditional branching distribution is ¿ # Ú ¼ ß · ¿ # Ú ¼ · · ¿ # Ú¼ W Ö (3.52) Then define the conditional entropy of node by Y ¹ Ú Q W ¿ # Ú ¼ ß · ¿ # Ú ¼ · T· ¿ # Ú¼ W R · (3.53) Ú ²Ý where with a slight abuse of notation, we have used XW to denote the entropy in the base " Y of the conditional branching distribution in the parenthesis. By Theorem 2.43, . The following lemma relates the entropy of with the structure of the code tree. Þ|ß v¤¨Ê¨ü¬7«¶5ô ÚXW ² ® Ý ¹ 4 D R A F T ¦ ® %$ J Ú Y Ö September 13, 2001, 6:27pm D R A F T 55 Zero-Error Data Compression Proof We prove the lemma by induction on the number of internal nodes of the code tree. If there is only one internal node, it must be the root of the tree. Then the lemma is trivially true upon observing that the reaching probability of the root is equal to 1. Assume the lemma is true for all code trees with internal nodes. Now internal nodes. Let be an internal node such consider a code tree with that is the parent of a leaf Ü with maximum order. The siblings of Ü may or may not be a leaf. If it is not a leaf, then it cannot be the ascendent of another leaf because we assume that Ü is a leaf with maximum order. Now consider revealing the outcome of in two steps. In the first step, if the outcome of is not a leaf descending from node , we identify the outcome exactly, otherwise we identify the outcome to be a descendent of node . We call this random variable A . If we do not identify the outcome exactly in the first step, which happens with probability Ú , we further identify in the second step which of the descendent(s) of node the outcome is (node has only one descendent if all the siblings of Ü are not leaves). We call this random variable . If the second step is not necessary, we assume that takes a constant value with A . probability 1. Then The outcome of A can be represented by a code tree with internal nodes which is obtained by pruning the original code tree at node . Then by the induction hypothesis, Y Ú å å A (3.54) óoß ® ® ® ¹ ² ·& Ý Ú ² Ý ¹ å J%º $('*) ,+ & & Ö By the chain rule for entropy, we have Ú ² ® Ý ¹ Ú ² A 4Ý ó Ú Y ² &ó@ ´)A ².Ý ß ¹ º Úå å ì -$('*) -+ ¹ º Ö -$ åJ åJ Ý ¶ ó Ú Ú Y (3.55) Y Ú å å (3.56) (3.57) The lemma is proved. The next lemma expresses the expected length of a prefix code in terms of the reaching probabilities of the internal nodes of the code tree. v¤¨Ê¨ü¬7«¶5 ¹ 4 Proof Define ¹ U¤ < D R A F T ¶ ß ¦ %$ Ö J Ú if leaf Ü is a descendent of internal node k otherwise. September 13, 2001, 6:27pm (3.58) D R A F T 56 A FIRST COURSE IN INFORMATION THEORY Then ¹º %$ · U© J (3.59) because there are exactly internal nodes of which order of Ü is . On the other hand, ¹ º Ú Then ¹ º ¹ º (3.60) ¿ is a descendent if the ¿ Ö U¤ Ü (3.61) º ¿ J%$ U© ¹ º º U© ¿ J%$ ¹ º Ú · J%$ (3.62) (3.63) (3.64) proving the lemma. Define the local redundancy of an internal node by . ¹ .² ß ì Ý Ö Y Ú (3.65) Þ|ß¿ # ¼ ¹ This quantity is local to node in the sense that it depends only on the branch ing probabilities of node , and it vanishes if and only if o D" for all f , Y or the node is balanced. Note that because . The next theorem says that the redundancy of a prefix code is equal to the sum of the local redundancies of all the internal nodes of the code tree. . µ¶ ¢a£;¤!¥§¦7¤¨ª¬7«]©-õ4=¥:û!ü6S0/¤÷;æ4÷ü§æ4ûþ¢a£;¤¥4¦7¤¨ÿ Let length of a " -ary prefix code for a source random variable redundancy of the code. Then ¹ º -$ . Ö J ® be the expected , and be the (3.66) Proof By Lemmas 3.18 and 3.19, we have ì ÚXW ² ® Ý ¹ º Ú ì º ÚY ¹ %$ J D R A F T (3.67) September 13, 2001, 6:27pm (3.68) D R A F T Zero-Error Data Compression ¹ º ².ß ì Y %$ ¹ º . Ö %$ Ú J 57 Ý (3.69) J (3.70) The theorem is proved. ö¥§¦¥6STS ü§¦þ2¬7«]©5ÓõØ\Îæå;¦¥4â¸þ¥;æ4÷:ÿ Let be the redundancy of a prefix code. Then յචwith equality if and only if all the internal nodes in the We now present an slightly different version of the entropy bound. code tree are balanced. . µ2¶ µ¶ . ¹ ¶ ¹ ¶ Proof Since for all , it is evident from the local redundancy theorem that . Moreover if and only if for all , which means that all the internal nodes in the code tree are balanced. Remark Before the entropy bound was stated in Theorem 3.5, we gave the intuitive explanation that the entropy bound results from the fact that a " ary symbol can carry at most one " -it of information. Therefore, when the entropy bound is tight, each code symbol has to carry exactly one " -it of information. Now consider revealing a random codeword one symbol after another. The above corollary states that in order for the entropy bound to be tight, all the internal nodes in the code tree must be balanced. That is, as long as the codeword is not completed, the next code symbol to be revealed always carries one " -it of information because it distributes uniformly on the alphabet. This is consistent with the intuitive explanation we gave for the entropy bound. ;ü:¨â[S¤L¬7«]©© \^] The local redundancy theorem allows us to lower bound the redundancy of a prefix code based on partial knowledge on the structure of the code tree. More specifically, " ,µ º -$ . " å J (3.71) for any subset of . Let be the two smallest probabilities in the source distribution. In constructing a binary Huffman code, and are merged. Then the redundancy of a Huffman code is lower bounded by ¿ · ¿ ² ¿ ó ¿ Ý ÚXW Q ¿ ¿ ¿ ¿ ó ¿ · ¿ ó 1 R · ¿ ¿ (3.72) the local redundancy of the parent of the two leaves corresponding to and . See Yeung [214] for progressive lower and upper bounds on the redundancy of a Huffman code. ¿ D R A F T September 13, 2001, 6:27pm ¿ D R A F T 58 A FIRST COURSE IN INFORMATION THEORY Ìc¶¸ÖÃ~½·c¶¸Ör¶Þ½·c¶¸Ö ß ·¶¸Ö ß »· PROBLEMS ¶¸Ö ·'¶¸Ö ß ·'¶¸Ör¶32·'¶Ör¶4Ô 1. Construct a binary Huffman code for the distribution . à 2. Construct a ternary Huffman code for the source distribution in Problem 1. 3. Construct an optimal binary prefix code for the source distribution in Problem 1 such that all the codewords have even lengths. 4. Prove directly that the codeword lengths of a prefix code satisfy the Kraft inequality without using Theorem 3.4. ¿ µ¶¸Ö 5. Prove that if )¼ , then the shortest codeword of a binary Huffman code has length equal to 1. Then prove that the redundancy of such a Huffman Y[Z code is lower bounded by . (Johnsen [103].) ßì ² Ý ¿ 6. Suffix codes A code is a suffix code if no codeword is a suffix of any other codeword. Show that a suffix code is uniquely decodable. 7. Fix-free codes A code is a fix-free code if it is both a prefix code and a suffix code. Let ( ( be positive integers. Prove that if · · c· º ó Þ ß· ·( H··( then there exists a binary fix-free code with codeword lengths ¶ (Ahlswede et al. [5].) Þ Þ Þ ßÞ . Þ 8. Random coding for prefix codes Construct a binary prefix code with codeword lengths ¶ as follows. For each , the ó codeword with length is chosen independently from the set of all possible binary strings with length according the uniform distribution. Let Æ v Û be the probability that the code so constructed is a prefix code. ² 6575 Ý a) Prove that v6 ² Æ8595 Û Ý ¹ .² ß ì # Ý À ²JÀ7Ý < ¹ ¶ b) Prove by induction on v , where if if À µ¶ À ܶWÖ that ² Æ8595 Û Ý ¹ : 4ß ì º ² 6575 ÝÎï ¶ ; oC Ö ² 8595 ÝÎï ¶ · H· c) Observe that there exists a prefix code with codeword lengths ¶ ( , Æ Æ . Show that v is equiva if and only if v Û Û lent to the Kraft inequality. D R A F T September 13, 2001, 6:27pm D R A F T 59 Zero-Error Data Compression By using this random coding method, one can derive the Kraft inequality without knowing the inequality ahead of time. (Ye and Yeung [210].) ® 9. Let be a source random variable. Suppose a certain probability mass in the distribution of is given. Let ¹ ® ì ¹ ì ]ÃÃ]ÄÄÅÅ ²¿ ó5À Ý ifif ¹ ê ¿ ì =< 9> *? À ¹ ¿ ßì o o < where o o for all f ¹ê ¿ o p . o s ¿ Ê)ÌÍ f f ¿ , ó q ßÞ o Þ ì ÃfÄÅ o% for all f . ¿ inequality. Show that ÌªÔ satisfies the Kraft ² Ý and which is Obtain an upper bound on in terms of Ú ® ² ; Ý ó ß ¿ about tighter than Ú ® . This shows that when partial knowledge a) Show that b) c) the source distribution in addition to the source entropy is available, can be obtained. tighter upper bounds on (Ye and Yeung [211].) HISTORICAL NOTES The foundation for the material in this chapter can be found in Shannon’s original paper [173]. The Kraft inequality for uniquely decodable codes was first proved by McMillan [142]. The proof given here is due to Karush [106]. The Huffman coding procedure was devised and proved to be optimal by Huffman [97]. The same procedure was devised independently by Zimmerman [224]. Linder et al. [125] have proved the existence of an optimal prefix code for an infinite source alphabet which can be constructed from Huffman codes for truncations of the source distribution. The local redundancy theorem is due to Yeung [214]. D R A F T September 13, 2001, 6:27pm D R A F T Chapter 4 WEAK TYPICALITY In the last chapter, we have discussed the significance of entropy in the context of zero-error data compression. In this chapter and the next, we explore entropy in terms of the asymptotic behavior of i.i.d. sequences. Specifically, two versions of the asymptotic equipartition property (AEP), namely the weak AEP and the strong AEP, are discussed. The role of these AEP’s in information theory is analogous to the role of the weak law of large numbers in probability theory. In this chapter, the weak AEP and its relation with the source coding theorem are discussed. All the logarithms are in the base 2 unless otherwise specified. Ì® · µ ß Ô ® ²® Ý ® Ú ¿ ® Ú ²® Ý Ü @ ¹ ² ® ·K® H· c·K® Ý ® ² @ Ý ¹ ² ® Ý ² ® Ý ² ® Ý Ö (4.1) ¿ ¿ ¿ ¿ ² Ý Note that @ variable because it is a function of the random ¿ ·K®His·acrandom ·K® . We now prove an asymptotic property of ¿ ² @ Ý variables ® called the weak asymptotic equipartition property (weak AEP). ¢a£;¤!¥§¦7¤¨§«¶5 õBA ¤¸Dü CE \< á _ÿ ß ì ÃfÄÅ ² @ Ý / Ú ² ® Ý (4.2) ¿ ï ¶ , for sufficiently large, in probability as /8i , i.e., for any « ß _a` ì Ã]ÄÅ ¿ ² @ Ý ì Ú ² ® Ý Ü « ï|ß ì « Ö (4.3) 4.1 THE WEAK AEP ²JÀÝ where are i.i.d. with We consider an information source distribution . We use to denote the generic random variable and Hi . Let to denote the common entropy for all , where . Since are i.i.d., D R A F T September 13, 2001, 6:27pm D R A F T 62 A FIRST COURSE IN INFORMATION THEORY ® ·K®H·c·K® are i.i.d., by (4.1), ß ß ì ÃfÄÅ ² @ Ý ¹ì º ÃfÄÅ ² ® Ý Ö (4.4) ¿ ¿ ² ® Ý are also i.i.d. Then by the weak law of large The random variables ÃfÄÅ ¿ of (4.4) tends to numbers, the right hand side ì ÃfÄÅ ² ® Ý ¹ Ú ² ® Ý · (4.5) ¿ Proof Since in probability, proving the theorem. The weak AEP is nothing more than a straightforward application of the weak law of large numbers. However, as we will see shortly, this property has significant implications. ؤ[ä æ;äøå;äJ¥§æÛ§«]© The weakly typical set & F with respect to ²JÀ7Ý J² À · À H·· À Ý ÍÓÏ suchÆG that ¿ ° of sequences H ¹ ß ì ÃfÄÅ ¿ ² H Ý ì Ú ² ® Ý Þ «· = ß Ú ² ® Ý ì « Þ ì fà ÄÅ ¿ ² H YÝ Þ Ú ² ® 4Ý ó «· (4.7) where « is an arbitrarily small positive real number. The sequences in are called weakly « -typical sequences. ß ß ì fà ÄÅ ² H Ý ¹ ì º fà ÄÅ ²JÀ Ý ¿ ¿ The quantity (4.6) or equivalently, is the set H & F ÆG ° (4.8) Ú ²® Ý is called the empirical entropy of the sequence . The empirical entropy of a weakly typical sequence is close to the true entropy . The important properties of the set are summarized in the next theorem which we will ° see is equivalent to the weak AEP. & F Æ G ¢a£;¤!¥§¦7¤¨§«]¬ÛõBA ¤¸üDCE á ÿ The following hold for any ï ¶ : 1) If HLÍI& F , then ÆG KJ Æ Þ ¿ ² H ÝYÞ KJ Æ Ö \ 3 « ° D R A F T ¯ h¯ ² °³² ¯ h¯ ² °¶² September 13, 2001, 6:27pm (4.9) D R A F T 63 Weak Typicality 2) For sufficiently large, ï ß ì «Ö ÌL@ëÍI& F Æ G ° Ô | sufficiently large, ².ß ì « Ý ¯MJ´¯ Æ ² °¶² Þ ´N& F ´ Þ ¯MJ´¯ Æ Æ G ° _a` 3) For ² (4.10) °¶² Ö (4.11) & F ÆG Proof Property 1 follows immediately from the definition of in (4.7). ° Property 2 is equivalent to Theorem 4.1. To prove Property 3, we use the lower bound in (4.9) and consider Þ _P` Ì7& F Ô Þ|ß · (4.12) ´N& F Æ G ° ´ ¯KJh¯ Æ ² °³² ÆG ° which implies N´ & F ÆG ° ´ Þ ¯MJ´¯ Æ ² ß °¶² Ö (4.13) Note that this upper bound holds for any 5µ . On the other hand, using the upper bound in (4.9) and Theorem 4.1, for sufficiently large, we have ßì « Þ _a` Ì7& F Ô Þ ´N& F ´ ¯MJ´¯ Æ ² °¶² Ö (4.14) Æ G ° Æ G ° Then N´ & F ÆG ° ´µ ².ß ì « Ý ¯MJ´¯ Æ ² °¶² Ö (4.15) Combining (4.13) and (4.15) gives Property 3. The theorem is proved. Remark 1 Theorem 4.3 is a consequence of Theorem 4.1. However, Property 2 in Theorem 4.3 is equivalent to Theorem 4.1. Therefore, Theorem 4.1 and Theorem 4.3 are equivalent, and they will both be referred to as the weak AEP. ® Ý @ ¹ ² ® ·K® H· c· ²JÀ7Ý The weak AEP has the following interpretation. Suppose is drawn i.i.d. according to , where is large. After the sequence is drawn, we ask what the probability of occurrence of the sequence is. The weak AEP says that the probability of occurrence of the sequence drawn is close to ´¯ ² with very high probability. Such a sequence is called a weakly typical sequence. Moreover, the total number of weakly typical sequences is approx imately equal to ´¯ ² . The weak AEP, however, does not say that most of the sequences in are weakly typical. In fact, the number of weakly typical sequences is in general insignificant compared with the total number of sequences, because ¿ J Æ Ï J Æ ´N& F Æ GPO ´ ´)ÏÓ´ Q D R A F T J Æ ¹ h¯ Ê)ÌÍ ² È È } È } È J´¯ Æ ¯ËÊ)ÌÍ ²² / September 13, 2001, 6:27pm ¶ (4.16) D R A F T 64 Ú ²® Ý A FIRST COURSE IN INFORMATION THEORY ÃfÄÅδ)ÏÓ´ is strictly less than . The idea is that, as G/ i as long as although the size of the weakly typical set may be insignificant compared with the size of the set of all sequences, the former has almost all the probability. When is large, one can almost think of the sequence as being obtained by choosing a sequence from the weakly typical set according to the uniform distribution. Very often, we concentrate on the properties of typical sequences because any property which is proved to be true for typical sequences will then be true with high probability. This in turn determines the average behavior of a large sample. @ Remark The most likely sequence is in general not weakly typical although the probability of a weakly typical set is close to 1 when is large. For example, for i.i.d. with and , is the most likely sequence, but it is not weakly typical because its empirical entropy is not close to the true entropy. The idea is that as /¹i , the probability of every sequence, including that of the most likely sequence, tends to 0. Therefore, it is not necessary for a weakly typical set to include the most likely sequence in order to make up a probability close to 1. ² ¶ Ý ¹ ¶¸Ö ß ¿ ® 4.2 ².ßcÝ ¹ ¶¸ÖPR .² ß · ß · T· ßcÝ ¿ @ ¹ ² ® ·K® H· ·K® Ý THE SOURCE CODING THEOREM ²JÀÝ drawn i.i.d. accordTo encode a random sequence ing to by a block code, we construct a one-to-one mapping from a subset of to an index set (4.17) S Ï¿ ´ S ´ ¹ T Þ ´)ÏÓ´ " " ¹ Ì ß · · T·*T Ô · ´)Ï´ where . We do not have to assume that is finite. The indices in are called codewords, and the integer is called the block length of the code. If a sequence occurs, the encoder outputs the correspondbits. If a sequence ing codeword which is specified by approximately occurs, the encoder outputs the constant codeword 1. In either case, the codeword output by the encoder is decoded to the sequence in corresponding to that codeword by the decoder. If a sequence occurs, then is decoded correctly by the decoder. If a sequence occurs, then is not decoded correctly by the decoder. For such a code, its performance is measured by the coding rate defined as (in bits per source symbol), and the probability of error is given by H Íê S H5Í S ÃfÄÅT H Ã]ÄÅT v,w T ¹ _a` ÌL@ Í ê S Ô Ö S ¹ Ï Hê S Í S H Í ¹ ¶ S H (4.18) If the code is not allowed to make any error, i.e., v6w , it is clear that must be taken to be , or . In that case, the coding rate is equal to . However, if we allow v w to be any small quantity, Shannon [173] showed that there exists a block code whose coding rate is arbitrarily ÃfÄÅÊ´)ÏÓ´ D R A F T ´)Ï´ September 13, 2001, 6:27pm D R A F T 65 Weak Typicality Ú ²® Ý @ when is sufficiently large. This is the direct part of Shannon’s close to source coding theorem, and in this sense the source sequence is said to be reconstructed almost perfectly. We now prove the direct part of the source coding theorem by constructing a desired code. First, we fix « and take ï ¶ S ¹ & F Æ G T ¹ ´S ´Ö and (4.19) ° (4.20) For sufficiently large , by the weak AEP, Þ T ¹ ´ S ´ ¹ N´ & F ´ Þ ¯MJ´¯ Æ ² °¶² Ö Æ G ° Therefore, the coding rate ÃfÄÅ T satisfies ß ².ß Ý;ó ² Ý Þ ß Ã fÄÅ ì « Ú ® ì « ÃfÄÅT Þ Ú ² ® Ý;ó « Ö ².ß ì « Ý ¯MJ´¯ Æ ² (4.21) °¶² Also by the weak AEP, ÌL@ Í ê S Ô ¹ (4.22) (4.23) ÌL@ ÍUê & F Æ G ° ÔaÜ «Ö ¶ , the coding rate tends to Ú ² ® Ý , while v6w tends to 0. This proves ¹ v,w _a` _a` Letting «$/ the direct part of the source coding theorem. The converse part of the source coding theorem says that if we use a block code with block length and coding rate less than , where does not change with , then v,w^/ as ·/:i . To prove this, consider any code with block length and coding rate less than , so that , the total number of codewords, is at most ¯ ´¯ ² ² . We can use some of these codewords for the typical sequences , and some for the non-typical ° . The total probability of the typical sequences covered sequences ° by the code, by the weak AEP, is upper bounded by ß Æ H+ÍZM& J F Æ G H Í[ê & F Æ G Y MJ Æ ¯ ´¯ Y ó ² ² MJ Æ ¯ ´¯ ² °¶² ¹ Ú ²® Y Ú ² ® ÝW ì V ÝX ì V Y Ö Ý ¯ (4.24) °¶² ÌL@ Í\ê & F Æ G ° ÔÜ ¯ YÝ °¶² ó T V ï ¶ Therefore, the total probability covered by the code is upper bounded by ßì for Ý ¯ °¶² _a` « (4.25) @ sufficiently large, again by the weak AEP. This probability is equal to because v w is the probability that the source sequence is not covered by the code. Thus v,w « (4.26) ¯ °³² v w D R A F T ß ì Ü Y ó · September 13, 2001, 6:27pm D R A F T 66 A FIRST COURSE IN INFORMATION THEORY ï|ß ì ² ¯ Y °³² ó « Ý Ö (4.27) ï ¶ , in particular This inequality holds when is sufficiently large for any « ï ß V V ì for «ZÜ . Then Ü , v6w ~« when is sufficiently large. ß asforº/ anyi «Zand Hence, v,w/ then «/í¶ . This proves the converse part of or v6w the source coding theorem. ¢a£;¤!¥§¦7¤EFFICIENT ¨§« Let ] ¹ SOURCE ² ¯ ·'¯ H·CODING sequence of ·'¯. Ý be a random binary ² Ý Þ length . Then Ú ] with equality ß if and only if ¯ are drawn i.i.d. according to the uniform distribution on Ìc¶¸· Ô . 4.3 Proof By the independence bound for entropy, ² Ý v Þ º Ú ] Ú ² ¯ Ý with equality if and only if (4.28) ¯ with equality if and only if (4.28) and (4.29), we have ¯ are mutually independent. By Theorem 2.43, Ú ² ¯ vÝ Þ fà ÄÅÉ ¹ ß distributes uniformly on (4.29) Ìc¶¸· ß Ô . Combining ² Ý = Þ º Ú ] Ú ² ¯ YÝ Þ ý· (4.30) ¯ ß Ìc¶¸· Ô where this upper bound is tight if and only if are mutually independent and each of them distributes uniformly on . The theorem is proved. ] ¹ ² ¯ ·'¯ · T·'¯ Ý Ìc¶¸· ß Ô ¯ be a sequence of length such that are drawn Let i.i.d. according to the uniform distribution on , and let denote the generic random variable. Then . According to the source coding theorem, for almost perfect reconstruction of , the coding rate of the source code must be at least 1. It turns out that in this case it is possible to use a source code with coding rate exactly equal to 1 while the source sequence can be reconstructed with zero error. This can be done by simply encoding all the possible binary sequences of length , i.e., by taking . Then the coding rate is given by Ú ²¯ Ý ¹ ß ] ¯ ^ ¹ ] ] Ã]ÄÅδ ^´ ¹ ÃfÄÅ ¹ ß Ö ] (4.31) Since each symbol in is a bit and the rate of the best possible code describing is 1 bit per symbol, are called raw bits, with the connotation that they are incompressible. D R A F T ¯ ·'¯ · ·'¯ September 13, 2001, 6:27pm D R A F T 67 Weak Typicality It turns out that the whole idea of efficient source coding by a block code is to describe the information source by a binary sequence consisting of “approximately raw” bits. Consider a sequence of block codes which encode , where are i.i.d. with into , generic random variable , is a binary sequence with length and ?/ i . For simplicity, we assume that the common alphabet is fi nite. Let u be the reconstruction of by the decoder and v,w be the probability of error, i.e. _a` u (4.32) v6w @ ¹ ² ® ·K® · T·K® Ý ® ] @ ÍoÏ ] ¹ ² ¯ ·'¯ · ·'¯ Ý @ ÌL@ ¹ ê @ Ô Ö ¹ [¶ ® as ÿ/i . We will show that Further assume v w / mately raw bits. By Fano’s inequality, Since @ u Ú ² @L´M@ u ÝvÞ|ßvó v,w!ÃfÄÅ´)Ï´ ¹ ßËó is a function of It follows that ] ] Q ;Ï Ú ² ® Ý consists of approxi- !Ã]ÄÅδ)ÏÓ´ Ö v,w (4.33) , Ú ² ] Ý ¹ Ú ² ]+· @ u Ý µÚ ² @ u Ý Ö Ú ² ] Ý µ Ú± ² ² @ u Ý Ý µ @ ³ @ u Ý ¹ Ú ²@ Ý ì Ú ²ý @ M@ u ´ µ ;Ú ² ² ® ² Ý Ý ì ².ßvó v,w!ÃfÄÝ ÅÊ´)ÏÓß ´ Ý ¹ Ú ® ì v w fà ÄÅδ)ÏÓ´ ì Ö (4.34) (4.35) (4.36) (4.37) (4.38) (4.39) On the other hand, by Theorem 4.4, Ú ² ] vÝ Þ ýÖ (4.40) Combining (4.39) and (4.40), we have ² Ú ² ® Ý ì v6wÃfÄÅδ)ÏÓ´ Ý ì ß Þ Ú ² ] vÝ Þ ýÖ (4.41) ² Ý is approximately Since v w / ¶ as º/¹i , the above lower bound on Ú ] equal to ²® Ý ;Ú (4.42) Q when is large. Therefore, (4.43) Ú ² ] Ý ýÖ ] ] Q In light of Theorem 4.4, almost attains the maximum possible entropy. In this sense, we say that consists of approximately raw bits. D R A F T September 13, 2001, 6:27pm D R A F T 68 4.4 A FIRST COURSE IN INFORMATION THEORY THE SHANNON-MCMILLAN-BREIMAN THEOREM Ì® Ô ® and ¿ ß ì ÃfÄÅ ² @ Ý / Ú ² ® Ý (4.44) ¿ ² Ý ² Ý is the in probability as !/ i , where @ ¹ ® ·K®H·c·K® . Here Ú ® entropy of the generic random variables ® as well as the entropy rate of the source Ì® Ô . In Section 2.9, we showed that the entropy rate Ú of a source Ì ® Ô exists if the source is stationary. The Shannon-McMillan-Breiman theorem states that if Ì ® Ô is also ergodic, then ß _a` ?ì (4.45) à à fÄÅ _P` LÌ @ZÔ ¹ Ú ¹ ß Ö ~ b ® Ô is stationary and ergodic, then ì Ã]ÄÅ _a` LÌ @Ô not This means that if Ì only almost always converges, but it also almost always converges to Ú . For ²JÀÝ For an i.i.d. information source with generic random variable generic distribution , the weak AEP states that this reason, the Shannon-McMillan-Breiman theorem is also referred to as the weak AEP for stationary ergodic sources. The formal definition of an ergodic source and the statement of the ShannonMcMillan-Breiman theorem require the use of measure theory which is beyond the scope of this book. We point out that the event in (4.45) involves an infinite collection of random variables which cannot be described by a joint distribution unless for very special cases. Without measure theory, the probability of this event in general cannot be properly defined. However, this does not prevent us from developing some appreciation of the Shannon-McMillan-Breiman theorem. be the common alphabet for a stationary source . Roughly Let speaking, a stationary source is ergodic if the time average exhibited by a single realization of the source is equal to the ensemble average with probability 1. Moree specifically, for any T Þ , Ï Ì® Ô Ì® Ô _P` à ~ ß b º Å ß _· · · ² ® ·K® ·c·K®_ Ý # % ¹ Å ² ® ·K® ·T·K®`_ Ý l ¹ ß · (4.46) # % where Å is a function defined on Ï which satisfies suitable conditions. For e the special case that Ì®Ô satisfies ß _a` º ® ¹ ® l ¹ ß · à (4.47) ~ b D R A F T September 13, 2001, 6:27pm D R A F T 69 Weak Typicality Ì®Ô is mean ergodic. ;ü:¨â[S¤Ó§«]è The i.i.d. source Ì®Ô we say that \^] is mean ergodic under suitable conditions because the strong law of the large numbers states that (4.47) is satisfied. ;ü:¨â[S¤Ó§«]ð Ì® Ô \^] ° ® ¹ ° Ìc¶¸· ß Ô ß ß º ® ¹ ¶l ¹ Consider the source defined as follows. Let be a e binary random variable uniformly distributed on . For all , let . Then à _a` ~ e and à _P` Since ® ¹ Ì®Ô Ã K ß b º ® ¹ (4.49) ® l ¹ ¸¶ Ö (4.50) is not mean ergodic and hence not ergodic. Ì® Ô If an information source McMillan-Breiman theorem, ß @ b (4.48) ß º ® ¹ ßl ¹ Ö ß e , _a` Therefore, ~ b is stationary and ergodic, by the Shannon- ÃfÄÅ _a` LÌ @Ô Q J (4.51) when is large, i.e., with probability close to 1, the probability of the sequence which occurs is approximately equal to . Then by means of arguments similar to the proof of Theorem 4.3, we see that there exist approx imately sequences in whose probabilities are approximately equal to , and the total probability of these sequences is almost 1. Therefore, by encoding these sequences with approximate bits, the source sequence can be recovered with an arbitrarily small probability of error when the block length is sufficiently large. This is a generalization of the direct part of the source coding theorem which gives a physical meaning to the entropy rate of a stationary ergodic sources. We remark that if a source is stationary but not ergodic, although the entropy rate always exists, it may not carry any physical meaning. J J D R A F T J Ï ;Ú September 13, 2001, 6:27pm @ D R A F T 70 A FIRST COURSE IN INFORMATION THEORY PROBLEMS 1. Show that for any « ï ¶ , & F Æ G ° is nonempty for sufficiently large . " 2. The source coding theorem with a general block code In proving the converse of the source coding theorem, we assume that each codeword in corresponds to a unique sequence in . More generally, a block code Æg / with block length isg defined by an encoding function and a . Prove that v,w^/ as ·/:i even if we decoding function Å / are allowed to use a general block code. Ï " íÏ O¶ Ï a" 3. Following Problem 2, we further assume that we can use a block code with probabilistic encoding and decoding. For such a code, encoding is defined to and decoding is defined by a transiby a transition matrix from . Prove that v,wm/ as /ci even if we are tion matrix from to allowed to use such a code. c b " Ï " Ï N¶ 4. In the discussion in Section 4.3, we made the assumption that the common alphabet is finite. Can you draw the same conclusion when is infinite but is finite? Hint: use Problem 2. 5. 6. 7. Ï Ï Ú ²® Ý Let and Ú ² beÝÊtwo distributions on the same finite alphabet Ï ² Ú Ý . Show that ï ¶ such ¹ ê Ú probability such¿ that Ú there exists an « that ¿ þ » Jh} » > * ý9d * * Ù Ê)̹ êÍ9> * ¯ * ² J´¯e² Lf °*g ß as ÿ/¶ . Give an example that ¿ Ú but the above convergence does not hold. Let and Ú be two probability distributions on the same finite alphabet Ï with¿ the same support. ï ¶, a) Prove that for any h » J} * Ù Ê)ÌÍ6e * ¯ » * ² ¯KJh¯ >ª² {W ¯ >jike² ² O þ * ý7d * > g f as /Hi . ï ¶, b) Prove that for any h » Jh} * Ù Ê)ÌÍ(e * ¯ » * ² ¯MJ´¯ >M² §W ¯ >jike²² O ü *LlPmnl o*p 4q l o%rstp ju pwv d * gg ßf ¹ ÌÌ® ¯Kyd² · ×µ Ô ÍZÙ"Ô be a family of Universal source coding Let x ; a common alphabet i.i.d. information sources indexed by a finite set Ù with Ï . Define Ú Ó ¹ JMéOL û Ú ² ® ¯Myز Ý y ß Ô , and where ® ¯Myز is the generic random variable for Ì® ¯Kyd² · µ ² Ý { Ù° ¹ z JOL & F Æ l}|~p G · y D R A F T ° September 13, 2001, 6:27pm D R A F T 71 Weak Typicality where « ï ¶. a) Prove that for all b) ; ÍÙ , ÌL@ ¯Kyd² Í ° ² Ù Ý Ô{/ ß ² Ý as /Hi , where @ ¯Kyd² ¹ ® ¯Kyd² ·K® ¯Kyd² ·c·K® ¯Kyd² . ï «, Prove that for any « å ´ ² Ù Ý ´ Þ ¯9J ° ² Ö _a` ° x c) Suppose we know that an information source is in the family but we do not know which one it is. Devise a compression scheme for the information source such that it is asymptotically optimal for every possible source in . x 8. ß Ô be an i.i.d. information source with generic random variLet Ì®·µ able ® and finite alphabet Ï . Assume º ²JÀÝ ÃfÄÅ ²JÀÝ Üi »5¿ ¿ and define ² Ý ° ¹ì ÃfÄÅ ¿ @ ì Ú ² ® Ý & ß · · . Prove that ° / ° in distribution, where ° is the for ¹ ²JÀÝ Ã]ÄÅ ²JÀ7Ý ì Gaussian random variable with mean 0 and variance ¦ » ² Ý ¿ ¿ Ú ® . HISTORICAL NOTES The weak asymptotic equipartition property (AEP), which is instrumental in proving the source coding theorem, was first proved by Shannon in his original paper [173]. In this paper, he also stated that this property can be extended to a stationary ergodic source. Subsequently, McMillan [141] and Breiman [34] proved this property for a stationary ergodic source with a finite alphabet. Chung [44] extended the theme to a countable alphabet. D R A F T September 13, 2001, 6:27pm D R A F T Chapter 5 STRONG TYPICALITY Weak typicality requires that the empirical entropy of a sequence is close to the true entropy. In this chapter, we introduce a stronger notion of typicality which requires that the relative frequency of each possible outcome is close to the corresponding probability. As we will see later, strong typicality is more powerful and flexible than weak typicality as a tool for theorem proving for memoryless problems. However, strong typicality can be used only for random variables with finite alphabets. Throughout this chapter, typicality refers to strong typicality and all the logarithms are in the base 2 unless otherwise specified. 5.1 STRONG AEP ²JÀÝ Ì® · µ ß Ô ® ® Ú ²® Ý Ü Ú ²® ¹Ý @ where are i.i.d. with We consider an information source distribution . We use to denote the generic random variable and Hi . Let to denote the common entropy for all , where . ® ¿ ² ® ·K® H· c·K® Ý =ؤ[ä æ;äøå;äJ¥§æúè7«¶5 ²JÀ7Ý is the set The strongly typical set F with respect to ²JÀ · À T·T· À Ý ÍÓÏ suchÆGPthat ¿ O of sequences H ¹ ß º ¬ ²JÀ ³H Ý ì ²JÀÝ Þ h· (5.1) ¿ » ²JÀ ³H Ý is the number of occurrences of À in the sequence H , and h is where ¬ are called an arbitrarily small positive real number. The sequences in F Æ P G O strongly h -typical sequences. Throughout this chapter, we adopt the convention that all the summations, products, unions, etc are taken over the corresponding supports. The strongly D R A F T September 13, 2001, 6:27pm D R A F T 74 A FIRST COURSE IN INFORMATION THEORY F ÆGPO typical set shares similar properties with its weakly typical counterpart, which is summarized as the strong asymptotic equipartition property (strong AEP) below. The interpretation of the strong AEP is similar to that of the weak AEP. ¢a£;¤!¥§¦7¤¨ªè7«]©Ûõ`å;¦¥§æ4ùE \áaÿ In the following, is a small positive quantity such that X/ ¶ as h{/ ¶ . 1) If HL Í F ÆGPO , then ² H ÝYÞ ¯KJh¯ Æ ² ² Ö Þ (5.2) ¯KJh¯ Æ ² ² ¿ 2) For sufficiently large, _a` ï ß ì h Ö ÌL@ëÍ F Æ GPO Ô | 3) For sufficiently large, ².ß ì h Ý ¯MJ´¯ Æ ² ² Þ ´ F ´ Þ Æ GPO Proof To prove Property 1, we write ¿ ²H Ý ¹ : »Z¿ (5.3) MJ Æ Ö ¯ ´¯ ² (5.4) ² ²JÀ7Ý ®h¯ »9 ² Ö (5.5) Then Ã]ÄÅ ¿ ² H Ý ²JÀ Ý ²JÀÝ ¹ º ¬ ³H ÃfÄÅ ¿ » ² J ² À Ý ¹ º ¬ ³H ì ²JÀ7Ý7ó ¿ » ¹ º ²JÀÝ ÃfÄÅ ²JÀÝ ì » ¿ ¿ Q ß ¹ ì Ú ² ® Ý4ó º » ¬ If HLÍ F ÆGPO , then D R A F T º ²JÀ7ÝKÝ ÃfÄÅ J² À7Ý ¿ Qß ¿ ²JÀ ³ H Ý ì ²JÀ7Ý R ² ì Ã]ÄÅ ²JÀ7ÝKÝ º ¬ » ¿ ¿ ²JÀ ³H Ý ì ²JÀÝ R ² ì ÃfÄÅ ²JÀ7ÝKÝ Ö ¿ ¿ ß J² À Ý ²JÀÝ Þ ì » ¬ ³ H ¿ h· September 13, 2001, 6:27pm (5.6) (5.7) (5.8) (5.9) (5.10) D R A F T 75 Strong Typicality which implies ß J² À Ý ²JÀÝ ² ì ì fà ÄÅ ²JÀ7ÝKÝ ¬ ³ H R » ¿ ¿ ß Þ ì ÃfÄÅ ýt» â ²JÀ7Ý þ º ¬ J² À ³ H Ý ì ²JÀÝ ¿ ¿ » Þ ì h4ÃfÄÅ ý » â ²JÀ7Ý þ ¿ ¹ · where ¹,ì h:ÃfÄÅ ýt» â ¿ ²JÀÝ þ ï ¶¸Ö On the other hand, Q ß ²JÀ ³ H Ý ì ²JÀÝ R ² ì ÃfÄÅ ²JÀÝKÝ º ¬ » ß ²J¿À Ý ²JÀÝ ¿ ² µ ìÓº » ¬ ³ H ì ¿ ì ÃfÄÅ ¿ ²JÀÝKÝ ß J ² À Ý ²JÀ ³H Ý ì ²JÀÝ q þ p ì º ì ¬ µ ÃfÄÅ ýt» â ¿ » ¿ ß ¹ ÃfÄÅ ýt» â ²JÀÝ þ º ¬ ²JÀ ³ H Ý ì ²JÀÝ ¿ » ¿ J ² 7 À Ý µ h§Ã]ÄÅ ýt» â ¿ þ ¹ ì Ö º Q (5.11) (5.12) (5.13) (5.14) (5.15) (5.16) (5.17) (5.18) (5.19) Combining (5.13) and (5.19), we have ì Þ º » Q ß J² À Ý ²JÀ7Ý ² ì ì fà ÄÅ ²JÀÝKÝYÞ Ö R ¬ ³ H ¿ ¿ (5.20) It then follows from (5.9) that or ì ² Ú ² ® ;Ý ó =Ý Þ fà ÄÅ ² H Ý=Þ ì ² Ú ² ® Ý ì Ý · ¿ ² H Ý=Þ ¯MJ´¯ Æ ² ² · Þ ¯MJ´¯ Æ ² ² ¿ ¶ as h{/ ¶ , proving Property 1. (5.21) (5.22) where / To prove Property 2, we write ¬ D R A F T ²JÀ ³*@ Ý ¹ º ì ²JÀÝ · September 13, 2001, 6:27pm (5.23) D R A F T 76 A FIRST COURSE IN INFORMATION THEORY ß if ® ¹ À J ² À Ý ¹ < ì ¶ if ® ¹ ê À . ²JÀÝ · ¹ ß ··T· are i.i.d. random variables with Then ì _a` Ì ì ²JÀÝ ¹ ß Ô ¹ ¿ ²JÀÝ ª and _P` Ìªì ²JÀ7Ý ¹ ¶Ô ¹ ß ì ¿ ²JÀÝ Ö Note that ²JÀ7Ý ¹ ².ß ì ²JÀÝKÝ c¶ ó ²JÀÝ ß ¹ ²JÀÝ Ö ì ¿ ¿ï ¿ e By the weak law of large numbers, for any h ¶ and for any À ÍÓÏ ß ²JÀ7Ý ²JÀÝ ï h _a` h º ì ì l Ü ) ´ Ó Ï ´ ) ´ Ï ´ ¿ where (5.24) (5.25) (5.26) (5.27) , (5.28) for sufficiently large. Then ß J² À Ý ²JÀ7Ý ï h À ì ¬ e * ³ @ for some ¿ ´)ÏÓ´ ß _a` J² À7Ý ì ²JÀ7Ý ï h for some À º e e ì ¿ ´)ÏÓ´ ß _a` z e º ì ²JÀÝ ì ²JÀÝ ï h lml ¿ ´)ÏÓ´ » ß º _a` º ì ²JÀ7Ý ì ²JÀ7Ý ï h l » ¿ ´)ÏÓ´ º h » ´)Ï´ h· _P` ¹ ¹ Þ ¹ Ü l (5.29) (5.30) (5.31) (5.32) (5.33) where we have used the union bound1 to obtain (5.31). Since º ß J² À Ý ²JÀÝ ï ì » ¬ ³ H ¿ h ß J² À Ý ²JÀÝ ï h ¬ ³ H ì ¿ ´)ÏÓ´ implies 1 The union bound refers to D R A F T for some (5.34) À ÍÓÏ , (5.35) 8 ) 0 + 6 ) + 6 ) + v Ø ü Ø September 13, 2001, 6:27pm D R A F T 77 Strong Typicality we have _a` d(@ÑÍ F ÆGPO g ß J² À Ý ²JÀÝ ¹ º *³ @ ì ¿ » ß ²JÀ Ý º ¹ ßì ³*@ ì ¿ »ß ²JÀ ³*@ Ý ì ²JÀ7Ý µ ßì ¿ ï ß ì ch · e _P` _P` Þ hl ²JÀ7Ý ï h l ï h for some À ÍÏ ´)Ï´ e ¬ ¬ _P` ¬ (5.36) (5.37) (5.38) (5.39) proving Property 2. Finally, Property 3 follows from Property 1 and Property 2 in exactly the same way as in Theorem 4.3, so it is omitted. àµ ß ´ F ÆGPO ´ h ï ¶ Remark Analogous to weak typicality, we note that the upper bound on in Property 3 holds for all , and for any , there exists at least one strongly typical sequence when is sufficiently large. See Problem 1 in Chapter 4. In the rest of the section, we prove an enhancement of Property 2 of the strong AEP which gives an exponential bound on the probability of obtaining a non-typical vector2 . This result, however, will not be used until Chapter 15. ¢a£;¤!¥§¦7¤¨ªè7«]¬ For sufficiently large , there exists _P` ÌL@ Íê F Æ GPO ÔÜ j ¯ O ² Ö ² h ÝYï ¶ such that (5.40) The proof of this theorem is based on the Chernoff bound [43] which we prove in the next lemma. v¤¨Ê¨üè7« úõnö£;¤¦7æ§¥¤I¥;æ4÷:ÿ ¯ 4 Let be a real random variable and be any nonnegative real number. Then for any real number U , and Ã]ÄÅ _a` cÌ ¯ªµU!Ô Þ ì U ó ÃfÄÅ ; Ã]ÄÅ _a` Ìc¯ Þ U!Ô Þ U ó fà ÄÅ ; ) ) yÇ yÇ Ö ; (5.41) (5.42) 2 This result is due to Ning Cai and Raymond W. Yeung. An alternative proof based on Pinsker’s inequality (Theorem 2.33) and the method of types has been given by Prakash Narayan (private communication). D R A F T September 13, 2001, 6:27pm D R A F T 78 A FIRST COURSE IN INFORMATION THEORY 2 1.8 2 1.6 s ( y−a) 1.4 1.2 u ( y−a) 1 1 0.8 0.6 0.4 0.2 0 −1 0 1 a 2 3 4 K |tl}9p . Figure 5.1. An illustration of ï Proof Let I Then for any ; µ¶ , Since I I ²¯ ì U Ý ¹ we see that _a` ì ¯ ² ¯ ì U Ý 4Þ ) y ¯ Ç _P` Oð ² Á Ý ¹ ß if Á µ¶ ¶ if Á Ü¶Ö ² Ý=Þ (y ¯ ½ á ² Ö I Á ì U This is illustrated in Fig. 5.1. Then y 5 á ² ¹ (5.43) (5.44) y á Ìc¯ªµUÔ´ ßËó_a` Ìc¯ ÜU!Ômc¶ ¹ Ìc¯ªµU!Ô Þ y á ) yÇ ¹ ) yÇ Ö _P` Ìc¯ µUÔ · F y | G Ö á Ê)ÌÍ (5.45) ì (5.46) (5.47) Then (5.41) is obtained by taking logarithm in the base 2. Upon replacing by and U by U in (5.41), (5.42) is obtained. The lemma is proved. £¡ ¢ ¤ ¥§¦w¡©¨ª[« ¬Kj®°¯=±=²´µ ³ ¶ ¦w¡º¨°»[¼½¦¥§¦w¡º¨©¾À¿9¨9Á ¶%·D¸4¹  Ã!Ä ¼½¦¥Å¦w¡º¨©¾À¿9¨Æ¾ ¬Kj®°ÇÉÈËÊjÌjÍZÎÏBÐÑ7Ò Ï-ÓÔLÕ Ö ¯ Proof of Theorem 5.3 We will follow the notation in the proof of Theorem 5.2. Consider such that . Applying (5.41), we have D R A F T September 13, 2001, 6:27pm (5.48) D R A F T 79 Strong Typicality Ø× Õ !Ã Ä ½¼ ¦P¥§¦w¡©¨Æ¾À¿9¨Æ¾ ¬Kj®ÙÀ¶-Ú ·D³ ¸ Ç È ÊjÌBÒ Ï ÓÔLÕ Ö%Û Ø Õ Ã!Ä ½¼ ¦P¥§¦w¡©¨Æ¾À¿9¨Æ¾Ü¼ ¬Kj® ¦Ý à ¥§¦w¡©¨Æ¾U¥§¦w¡©¨ Ê Ì ¨ ÂÞ Õ Ã!Ä ½¼ ¦P¥§¦w¡©¨Æ¾À¿9¨Æ¾Ü¼¦ ¬ËßàÊ ¨`á ¸ ¦ à ¥§¦w¡º¨Æ¾\¥§¦w¡º¨ ÊjÌ ¨ Ø Ã ¼ È Ä ¦ ¥§¦w¡©¨©¾À¿9¨n¾â¦ ¬ËßàÊ ¨ á ¸ ¥Å¦w¡º¨%¦Ý Ã Ê Ì ¨ Ö=ã (5.49) Z where a) follows because (5.50) (5.51) (5.52) ¹ ¶ ¦w¡º¨ are mutually independent; b) is a direct evaluation of the expectation from the definition of (5.24); c) follows from the fundamental inequality ¬ßåä  ä Ã Ý . æ Ô ¦ Ä ã ¿9¨ Ø Ä ¦ ¥§¦w¡º¨©¾W¿7¨D¾â¦ ¬ËßàÊ ¨ á ¸ ¥§¦w¡©¨%¦Ý Ã Ê Ì ¨ ã ¹ ¶ ¦w¡º¨ in In (5.52), upon defining (5.53) we have ¬Kj®¯±ç² µ ³ ¶ w¦ ¡©¨°»W¼½¦¥§¦w¡©¨©¾À¿9¨7Á Âèà ¼ æ Ô ¦ Ä ã ¿9¨ ã ¶-·D¸ ¹ or ¯±=²éµ ³ ¶ ¦w¡©¨°»W¼½¦¥§¦w¡©¨Æ¾À¿9¨7Á Â Ê á ³7ê,ë Ó Ì*ì í Õî ¶-·D¸ ¹ It is readily seen that æ Ô ¦ « ã ¿7¨ Ø « î Ä Regarding ¿ as fixed and differentiate with respect to , we have æÆÔï ¦ Ä ã ¿9¨ Ø ¥§¦w¡º¨%¦Ý Ã Ê Ì ¨Æ¾À¿ î Then æ Ôï ¦ « ã ¿7¨ Ø ¿ª[« and it is readily verified that for D R A F T æÆÔï ¦ Ä ã ¿9¨ð»[« « ÂÄñ ¬Kj®½ò ݾ ¥§¦w¿ º¡ ¨ôó î September 13, 2001, 6:27pm (5.54) (5.55) (5.56) (5.57) (5.58) (5.59) (5.60) D R A F T 80 A FIRST COURSE IN INFORMATION THEORY æ Ô ¦ Ä ã ¿7¨ is strictly positive for «öõ Äñ ¬Kj®½ò ݾ ¥§¦w¿ ¡º¨ôó î Therefore, we conclude that (5.61) On the other hand, by applying (5.42), we can obtain in the same fashion the bound or where Then and ¬Kj®¯±ç²éµ ³ ¶ ¦w¡©¨  ¼½¦¥§¦w¡©¨ à ¿9¨7Á Âèà ¼Æ÷ Ô ¦ Ä ã ¿9¨ ã ¶-·D¸ ¹ ¯±=² µ ³ ¶ ¦w¡©¨  ¼½¦¥§¦w¡©¨ à ¿7¨jÁ Â Ê á ³9ø ë Ó Ì*ì í Õ ã ¶-·D¸ ¹ ÷ Ô ¦ Ä ã ¿7¨ Ø Ã!Ä ¦¥Å¦w¡º¨ à ¿9¨Æ¾â¦ ¬ßàÊ ¨`á ¸ ¥§¦w¡º¨%¦Ý Ã Ê á Ì ¨ î ÷ Ô ¦ « ã ¿7¨ Ø « ã ÷ Ôï ¦ Ä ã ¿9¨ Ø ¥§¦w¡©¨%¦ Ê á Ì Ã ÝL¨Æ¾À¿ ã (5.62) (5.63) (5.64) (5.65) (5.66) which is nonnegative for (5.67) « ÂÄùÂèà K¬ j®úò Ý Ã ¥§¦w¿ ¡©¨ôó î In particular, ÷ Ôï ¦ ã« ã ¿9¨ Ø ¿ûª[« î (5.68) Ä Therefore, we conclude that ÷ Ô ¦ ¿7¨ is strictly positive for «üõ ÄùÂèà ¬Kj® ò Ý Ã ¥§¦w¿ ¡©¨ ó î (5.69) Ä By choosing satisfying (5.70) «üõ ÄùÂ[ýöþ ß\ÿˬKj®úò ݾ ¥§¦w¿ ¡©¨ôó ã à ¬Ëj®½ò Ý Ã ¥Å¦w¿ ¡º¨ôó ã æ Äã Äã both Ô ¦ ¿9¨ and ÷ Ô ¦ ¿9¨ are strictly positive. From (5.55) and (5.63), we have ¯± ² Ý µ ³ ¶ ¦w¡º¨ à ¥§¦w¡º¨ »¿ Á ¼ ¶%D· ¸ ¹ D R A F T September 13, 2001, 6:27pm D R A F T 81 Strong Typicality Ø =¯ ±=² ¶-µ ·D³ ¸ ¹ ¶ ¦w¡©¨ à ¼(¥§¦w¡©¨ »W¼n¿Á (5.71)  ¯=± ² µ ³ ¶ ¦w¡º¨°»W¼½¦¥Å¦w¡º¨Æ¾À¿9¨ Á ¶%·D¸ ¹ (5.72) ¾ ¯=±ç²é¶%µ ·D³ ¸ ¹ ¶ ¦w¡º¨  ¼½¦¥§¦w¡º¨ à ¿9¨7Á Â Ê á ³ôê,ë Ó Ì*ì í Õ ¾ Ê á ³9ø-ë Ó Ì*ì í Õ (5.73) Â Ê LÊ á ³ Ó ê,ë Ó Ì*ì í Õ ì ø-ë Ó Ìì í ÕÕÑ (5.74) Ó Ó Õ Ó Õ Õ Ê * Ì ì í ì * Ì ì í Ø á ³ Ó Õ ê,ã ë ø-ë á Î (5.75) Ø Ê á ³ jë í (5.76) where ý½þ ß ¦ æ Ô ¦ Ä ã ¿9¨ ã ÷ Ô ¦ Ä ã ¿9¨¨ Ã Ý î Ø Ô ~ ¦ 9 ¿ ¨ (5.77) ¼ æ Äã Then Ô ~¦ ã ¿9¨ is strictly positive for sufficiently large ¼ because both Ô ¦ ¿7¨ Ä and ÷ Ô ¦ ¿9¨ are strictly positive. Finally, consider ¯± ¢ ³ í Ø ¯=± ² µ Ô Ý ¦w"¡ ! ¨ à ¥§¦w¡©¨  ¿ Á (5.78) ¼ (5.79) » ¯= ± # ¼ Ý ¦w"¡ ! ¨ à ¥§¦w¡©¨  ¿ for all ¡\%¢ $'& Ø Ý Ã ¯= (± # Ý ¦w"¡ ! ¨ à ¥§ ¦w¡º¨ ª¿ for some ¡£)¢ $)& (5.80) ¼ (5.81) » Ý Ã µ Ô ¯=(± # ¼ Ý ¦w"¡ ! ¨ à ¥§¦w¡º¨ ª*¿ & µ µ = ¯ ç ± ² ³ Ý Ã Ã Ø Ý Ô ¶ ¦w¡©¨ ¥§¦w¡©¨ ª ¿Á (5.82) ¹ ¶ D · ¸ ¼ Ø Ý Ã µ ¯=±=² Ý ¶-µ ·D³ ¸4¹ ¶ ¦w¡©¨ à ¥§¦w¡©¨ ª ¿Á (5.83) Ô+ ,9ÓÔLÕ.-0/ ¼ (5.84) » Ý Ã Ô+ ,9µ ÓÔLÕ.-0/ Ê á ³ 3ë Ó í Õ ã where the last step follows from (5.76). Define D R A F T ¦~¿7¨ Ø 2ÊÝ 1 Ô + ,9ýöÓÔLþ Õ3ß -0/ Ô ¦~¿9¨54 î September 13, 2001, 6:27pm (5.85) D R A F T 82 A FIRST COURSE IN INFORMATION THEORY ¼ Then for sufficiently large , ¯=± ¢ ³ í6 ª ¯± 8¢7 ³ 9 í or where 5.2 ¦~¿9¨ Ý Ã Ê á ³ Ó í Õ ã õ Ê á ³: Ó í Õ ã (5.86) (5.87) is strictly positive. The theorem is proved. STRONG TYPICALITY VERSUS WEAK TYPICALITY As we have mentioned at the beginning of the chapter, strong typicality is more powerful and flexible than weak typicality as a tool for theorem proving for memoryless problems, but it can be used only for random variables with finite alphabets. We will prove in the next proposition that strong typicality is stronger than weak typicality in the sense that the former implies the latter. ; <>="?@=BACEDFC.="GIHJEH RTS « as ¿ S « . For any Kè¢L$ ³ , if K ¢M ³ 9 í , then K ¢ON ³ QP , where ¢ ³ 9 í , then Ê á ³ ÓVUàÓ ÕXW P Õ Â ¥§¦YKD¨ Â Ê á ³ VÓ UàÓ Õ á P Õ ã (5.88) Z Z K ¬ j ® Ý Â Ã Â Ã R ã R § ¥ YKD¨ ¦ w ¦ 1 ¤ n ¨ ¾ ¦w¤I¨ (5.89) ¼ « as ¿ S « . Then K1¢%N ³ QP by Definition 4.2. The proposition is Proof By Property 1 of strong AEP (Theorem 5.2), if K or where R[S proved. We have proved in this proposition that strong typicality implies weak typicality, but the converse is not true. The idea can easily be explained without on $ different from the any detailed analysis. Consider a distribution such that distribution ¥§¦w¡º¨ ¥ ï ¦w¡º¨ à µ ¥ ï ¦w¡º¨ ¬Ëj® ¥ ï ¦w¡º¨ Ø Z ¦w¤1¨ ã (5.90) Ô ï i.e., the distribution ¥ ¦w¡º¨ has the same entropy as the distribution ¥§¦w¡º¨ . Such ï a distribution ¥ ¦w¡º¨ can always be found. For example, the two distributions ¥§¦w¡º¨ and ¥ ï ¦w¡©¨ on $ Ø « ã Ý defined respectively by (5.91) ¥§¦ «4¨ Ø « ]î \ ã ¥§¦ÝL¨ Ø « _î ^ ! and ¥ ï ¦ «4¨ Ø « _î ^ ã ¥ ï ¦ÝL¨ Ø « ]î \ (5.92) D R A F T September 13, 2001, 6:27pm D R A F T 83 Strong Typicality î]\ have the same entropy ` ¦ « ¨ . Now consider a sequence Kè¢a$ ³ such that the relative occurrence of each ¡£% ¢ $ is close to ¥ ï ¦w¡©¨ . By the above theorem, ï Z the empirical entropy of K is close to the entropy of the distribution ¥ ¦w¡©¨ , i.e., ¦w¤I¨ . Therefore, K is weakly typical with respect to ¥§¦w¡©¨ , but it is not strongly typical with respect to ¥Å¦w¡º¨ because the relative occurrence of each ¡£%¢ $ is close to ¥ ï ¦w¡©¨ , which is different from ¥§¦w¡©¨ . Z 5.3 JOINT TYPICALITY In this section, we discuss strong joint typicality with respect to a bivariate distribution. Generalization to a multivariate distribution cb ed is straightforward. cb Consider a bivariate information source where gf cb are i.i.d. with distribution . We use to denote the pair of generic random variables. ¥§¦w¡ ã ¨ ¦w¤ã ¶ ã ¶ ¨ ã » Ý ¦w¤ ¨ ¦w¤ ¶ ã ¶ ¨ hTi*jkClGFCXDFCY=BGMHJEm ³ on í with respect to ¥§¦w¡ ãgf ¨ The strongly jointly typical set ãgp ¨ð¢%$ ³rqts!³ such that is the set of ¦YK µ µvu Ý ¦w¡ ãgf !gK ãgp ¨ à ¥§¦w¡ ãgf ¨  ¿ ã (5.93) Ô ¼ ãgf ãgp ¨ is the number of occurrences of ¦w¡ ãgf ¨ in the pair of sewhere ¦w¡ ãgp !gK quences ¦YK ãg¨ p , and ¿ is an arbitrarily small positive real number. A pair of ¨ is called strongly jointly ¿ -typical if it is in ³ on í . sequences ¦YK ãgp wyxFi0=B<i*z{HJX|~}e=BG"A6C.ADFi*G"* If Y ¦ K p ¢ ³ n í . ãgp ¨ð¢ ³ n í , then Proof If ¦YK µ µ u Ý ¦w¡ ãgf !gK ãgp ¨ Ô ¼ Strong typicality satisfies the following consistency property. Upon observing that we have ¨ ¢ ³ on í , then K ¢ ³ í à ¥§¦w¡ gã f ¨  ¿ î ¦w¡B!gKD¨ Ø µu ¦w¡ ãgf !gK gã p ¨ ã µ Ý w¦ ¡"!gKD¨ à §¥ ¦w¡º¨ Ô ¼ u µ µ ã g f gã p à µ u gã f Ý Ø Ô w ¦ ¡ !gK ¨ ¥ § w ¦ ¡ ¨ ¼ D R A F T September 13, 2001, 6:27pm and (5.94) (5.95) (5.96) D R A F T 84 A FIRST COURSE IN INFORMATION THEORY Ø µ Ô vµ u ò Ý ¦w¡ ãgf !gK ãgp ¨ à ¥Å¦w¡ ãgf ¨ (5.97) ¼ ó  µ µ u Ý w¦ ¡ ãgf !gK ãgp ¨ à ¥§¦w¡ ãgf ¨ (5.98) Ôã ¼  ¿ (5.99) p ¢ ³ n> í . The theorem is proved. i.e., K ¢ ³ 9 í . Similarly, cã b For a bivariate i.i.d. source ¦w¤ ¶ ¶ ¨ , we have the strong joint asymptotic equipartition property (strong JAEP), which can readily be obtained by cã b applying the strong AEP to the source ¦w¤ ¶ ¶ ¨ . wyxFi0=B<i*z{HJE2}e@DF<=BG"0; Let ã c b ã gã Ø ¦ ¨ ¦¦w¤ ¸ ¸ ¨ ¦w¤T cã b L¨ ã ã ¦w¤ ³ cã b ³ ¨¨ ã (5.100) cã b cã b ¤ ~¨ are i.i.d. with generic pair of randomS variablesS ¦w¤ ¨ . In the where ¦w « as ¿ « . following, is a small positive quantity such that gã p ¨à¢ ³ n í , then 1) If Y¦ K Ê á ³ VÓ UàÓ ì n lÕ W>LÕ Â ¥§Y¦ K gã p ¨ Â Ê á ³ Ó UåÓ ì n Õ á LÕ î (5.101) 2) For ¼ sufficiently large, ¯=± ¦ g㠨𢠳 on í ª Ý Ã ¿ î (5.102) 3) For ¼ sufficiently large,  ³ nF í 6Â Ê ³ VÓ UàÓ ì n XÕ W>LÕ î ¦Ý Ã ¿7¨ Ê ³ VÓ UàÓ ì n Õ á LÕ (5.103) ¦ ã ¨ Ê ³ UàÓ ì n Õ Ê ³ UåÓ Õ Ê ³ àÓ ì Õ Ê ³ åÓ Õ From the JAEP, we n are approxigp can see the following. SinceU there U strong mately typical YK p pairs andgpapproximately typical K , for a typical K , the number of such that YK is jointly typical is approximately ¦ ã ¨ Ø Ê ³ UåÓ n Õ (5.104) on the average. The next theorem reveals that this is not only true on the average, p but it is ingp fact true for every typical K as long as there exists at least one such that YK is jointly typical. wyxFi0=B<i*z{HJE ¦ ã ¨ D R A F T 1¢t ³ í , define ³ n9 9 í9¦YKD¨ Ø p ¢ ³ n í¦YK gã p à ¨ ¢ ³ nF í î For any K September 13, 2001, 6:27pm (5.105) D R A F T 85 Strong Typicality If n9 Y¦ KD¨ ³ í where S » Ý , then Ê ³ ÓVUàÓ n Õ á0 Õ Â ³ n í7¦YKD¨ (Â Ê ³ ÓUåÓ n ÕXW Õ ã « as ¼ S¢¡ and ¿ S « . (5.106) We first prove the following lemma which is along the line of Stirling’s approximation [67]. £i*zz¤HJ3¥¦ ¼1ª« , ¬¼ Ëß ¼ à ¼  ¬Ëß ¼ §  w¦ ¼ú¾âÝL¨ ¬Ëß ¦w¼¾ ÝL¨ à ¼ î For any (5.107) ¾ ˬ ß ¼ î ¬ß Since ¡ is a monotonically increasing function of ¡ , we have ¨ ¶ ¬Ëß d ¨ ¶ W ¸ ¬Ëß î Ë ¬ ß ¶ á ¸ ¡©3¡\õ õ ¶ T¡ ©3¡  d  ¼ , we have Summing over Ý ¨ ¬Ëß ³/ T¡ ©3¡Uõ ¬Ëß ¼ §õ ¨ ¸ ³ W ¸ ¬ß ª¡ ©4¡ ã or ¼ ¬Ëß ¼ à ¼1õ ¬Ëß ¼ §õ ¦w¼ú¾âÝL¨ ¬Ëß ¦w¼¾ ÝL¨ à ¼ î Proof First, we write ¬Ëß ¼ § Ø ¬ ß Ý ¾ ˬ ßàÊ ¾ (5.108) (5.109) (5.110) (5.111) The lemma is proved. ¿ ¢ ³ í µ Ý ¦w¤~!gKD¨ à ¥Å¦w¡º¨  ¿ î Ô ¼ ¢ $ , This implies that for all ¡£% ã Ý ¦w« ¤ !gKD¨ à ¥§¦w¡º¨  ¿ ¼ or ¥§¦w¡©¨ à ¿  ¼ Ý ¦w¡B!gKD¨  ¥§¦w¡©¨Æ¾À¿ î Proof of Theorem 5.9 Let be a small positive real number and positive integer to be specified later. Fix an K t , so that D R A F T September 13, 2001, 6:27pm ¼ be a large (5.112) (5.113) (5.114) D R A F T 86 A FIRST COURSE IN INFORMATION THEORY Assume that n Y¦ KD¨ ³ í » Ý , and let ¬ gã f ã ãgf ¦w¡ ¨ ¦w¡ ¨å¢$ qts (5.115) be any set of nonnegative integers such that µvu 1. for all ¡£¢)$ ¬ ¦w¡ gã f ¨ Ø w¦ ¡"!gKD¨ (5.116) , and gã f ãgp Ø ¬ ãgf w ¦ ¡ !gK ¨ ¦w¡ ¨ (5.117) gf ã gã p for all ¦w¡ ¨°¢%$ qts , then ¦YK 𨠢t ³ on í . ¬ ãgf ¦w¡ ¨ satisfy Then from (5.93), µ vµ u Ý ¬ ¦w¡ gã f ¨ à ¥Å¦w¡ gã f ¨  ¿ ã (5.118) Ô ¼ gã f which implies that for all ¦w¡ ¨°%¢ $ qts , Ý ¬ ¦w¡ gã f ¨ à ¥Å¦w¡ gã f ¨  ¿ ã (5.119) ¼ or ¥§¦w¡ gã f ¨ à ¿  ¼ Ý ¬ ¦w¡ gã f ¨  ¥§¦w¡ gã f ¨Æ¾À¿ î (5.120) ã g f ¬ Such a set ¦w¡ ¨ exists because ³ n í Y¦ KD¨ is assumedp to be nonempty. 2. if Straightforward combinatorics reveals that the number of constraints in (5.117) is equal to ® ¦¬ ¨ Ø ÚÔ u î ° ¬ ¦w¡B!gKDãg¨¯f § ¦w¡ ¨¯§ ¬Ëß ® ¬ Using Lemma 5.10, we can lower bound ¬Ëß ® ¦ ¬ ¨ » µ Ô # ¦w¡B!gKD¨ ¬Ëß Ã µvu ¦ ¬ ¦w¡ ãgf ¨D¾ Ø× Õ µ Ô ÿ ¦w¡"!gKD¨ ¬Ëß D R A F T which satisfy the (5.121) ¦ ¨ as follows. ¦w¡"!gKD¨ à ¦w¡B!gKD¨ ÝL¨ ¬Ëß ¦ ¬ ¦w¡ ãgf ¨n¾ ÝL¨ à ¬ w¦ ¡ gã f ¨ Á ¦wB¡ !gKD¨ September 13, 2001, 6:27pm (5.122) D R A F T 87 Strong Typicality à µ u ¦ ¬ ¦w¡ gã f ƨ ¾âÝL¨ ¬Ëß ¦ ¬ ¦w¡ ãgf ¨Æ¾ ÝL¨ 4 (5.123) ±Õ µ » Ô ¦w¡"!gKD¨ ¬ß ¦w¼½¦¥§¦w¡©¨ à ¿7¨`¨ à µ u ¦ ¬ ¦w¡ gã f ¨Æ¾âÝL¨ ¬Ëß ÿ ¼ ò ¥§¦w¡ ãgf ¨Æ¾W¿ð¾ Ý Á î (5.124) ¼åó In the above, a) follows from (5.116), and b) is obtained by applying ¸ ¸ ¬ ãgf the lower bound on ¼ á ¦w¡"!gKD¨ in (5.114) and the¬Ëß upper bound on ¼ á ¦w¡ ¨ in (5.120). Also from (5.116), the coefficient of ¼ in (5.124) is given by µ 1 ¦w"¡ !gKD¨ à µ u ¦ ¬ ¦w¡ gã f ¨n¾ ÝL¨ 4 Ø Ã $ s î (5.125) Ô Let ¿ be sufficiently small and ¼ be sufficiently large so that (5.126) «üõ1¥§¦w¡©¨ à ¿õ Ý and gã f Ý õ Ý (5.127) § ¥ w ¦ ¡ © ¨ W ¾ ð ¿ ¾ ¼ f for all ¡ and . Then in (5.124), both the logarithms ¬Ëß ¦¥§¦w¡º¨ à ¿7¨ (5.128) and ¬ßò ¥§¦w¡ gã f ¨Æ¾À¿å¾ Ý (5.129) ¼ó are negative. Note that the logarithm in (5.128) is well-defined by virtue of (5.126). Rearranging the terms in (5.124), applying the upper bound in (5.114) and the lower bound3 in (5.120), and dividing by , we have ¼ ¸ Ë ¬ ß ® ¬ ¼á ¦ ¨ µ » Ô ¦ ¥§¦w¡º¨©¾W¿7¨ ¬Ëß ¦¥§¦w¡º¨ à ¿7¨ à µ Ô µvu ò ¥§¦w¡ gã f ¨ à ¿ð¾ ¼°Ý ó ¬q Ëßò ¥§¦w¡ ãgf ¨n¾À¿å¾ Ý Ã $ s ¬ß ¼ ZT² ZT² Ø Zà ² b ¦w¤1¨n¾ ¦w¤ ã ãcb ¼ã ¨Æó ¾´³µt¦w¼ ã ¿7¼¨ Ø ¦ ¤I¨ÆI¾ ³9µt¦w¼ ¿9¨ u u u ú · ¸ ,9ÓÔ ì Õ Ô ,9ÓÔ ì ÕW í W Î Ñ - ¸ (5.130) (5.131) (5.132) 3 For the degenerate case when for some and , , and the logarithm in (5.129) is in fact positive. Then the upper bound instead of the lower bound should be applied. The details are omitted. D R A F T September 13, 2001, 6:27pm D R A F T 88 A FIRST COURSE IN INFORMATION THEORY ã where ³9µt¦w¼ ¿9¨ denotes a function of ¼ and ¿ which tends to 0 as ¼ S¶¡ and ¿ S « . Using a similar method, we can obtain a corresponding upper bound as follows. ¬Ëß ® ¦ ¬ ¨  µ # ¦ ¦w"¡ !gKD¨Æ¾âÝL¨ ¬Ëß ¦ ¦w"¡ !gKD¨n¾âÝL¨ à ¦w"¡ !gKD¨ Ô Ã µ u ¬ ¦w¡ gã f ¨ ¬Ëß ¬ ¦w¡ gã f ¨ à ¬ ¦w¡ gã f ¨ Á (5.133) Ø µ Ô ÿ ¦ ¦w"¡ !gKD¨D¾ ÝL¨ ¬Ëß ¦ ¦w"¡ !gKD¨n¾ ÝL¨ à µ u ¬ ¦w¡ gã f ¨ ¬Ëß ¬ ¦w¡ gã f ¨ 4 (5.134)  µ # ¦ ¦w¡"!gKD¨Æ¾âÝL¨ ¬ËßIÿ ¼ ò ¥§¦w¡©¨©¾À¿å¾ Ý Ô ¼å ó à µ u ¬ ¦w¡ gã f ¨ ¬Ëß ¼½¦¥§¦w¡ gã f ¨ à ¿7¨ Á î (5.135) ¬Ëß In the above, the coefficient of ¼ is given by µ 1 ¦ ¦w¡B!gKD¨n¾ ÝL¨ à vµ u ¬ ¦w¡ gã f ¨ 4 Ø $ î (5.136) Ô ¿ ¼ be sufficiently large so that ¥§¦w¡©¨©¾À¿å¾ ¼ Ý õ Ý «üõ1¥Å¦w¡ gã f ¨ à ¿õ Ý We let be sufficiently small and and for all ¡ f (5.137) (5.138) and . Then in (5.135), both the logarithms ¬ßò ¥§¦w¡º¨©¾W¿ð¾ Ý ¼åó ¬Ëß ¦¥§¦w¡ gã f ¨ à ¿9¨ and (5.139) (5.140) are negative, and the logarithm in (5.140) is well-defined by virtue of (5.138). Rearranging the terms in (5.135), applying the lower bound4 in (5.114) and the 4 If ,9ÓÔLÕ ·£¸ 9, ÓÔLÕW í W Î Ñ -T/ , , and the logarithm in (5.139) is in fact positive. Then the upper bound instead of the lower bound should be applied. The details are omitted. D R A F T September 13, 2001, 6:27pm D R A F T 89 Strong Typicality ¼ ¸ ¼Åá ¬Ëß ® ¦ ¬ ¨  µ ò ¥§¦w¡º¨ à ¿ð¾ Ý ¬Ëß ò ¥§¦w¡©¨©¾À¿å¾ Ý Ô ¼ó ¼ ó ¬Ëß Ã µ µu ¦¥§¦w¡ ãgf ¨Æ¾X¿9¨ ¬Ëß ¦¥Å¦w¡ ãgf ¨ à ¿9¨Æ¾ $ ¼ ¼ ZTÔ ² ZT² ãcb ã Ã Ø Z ² b ¦w¤I¨D¾ ¦w¤ ã ¨Æã ¾´³·¦w¼ ¿9¨ Ø ¦ ¤I¨Æ´¾ ³ · ¦w¼ ¿7¨ ã where ³ · ¦w¼ ¿9¨ denotes a function of ¼ and ¿ which tends to 0 as ¼ ¿ S «. upper bound in (5.120), and dividing by , we have (5.141) (5.142) (5.143) S¸¡ and Now from (5.132) and (5.143) and changing the base of the logarithm to 2, we have Z b ¦ ¤I¨D¾´³9µk¦w¼ ã ¿9¨  ¼ á ¸ ˬ j® ® ¦ ¬ ¨  Z ¦ b I¤ ¨n¾´³ · w¦ ¼ ã ¿9¨ î (5.144) Z b ¸ K ¬ j ® n 9 (5.145) ¼Åá ³ í7¦YKD¨ » ¦ ¤1¨Æ¾´³µt¦w¼ ã ¿7¨ î To upper bound ³ n 9 í Y¦ KD¨ , we ¬ ãgf have to consider that there are in general ¬ more ¦w¡ ¨ . Toward this end, by observing that¬ ¦w¡ gã f ãgf ¨ than one possible set of takes values between 0 and ¼ , the total number of possible sets of ¦w¡ ¨ is at most ¦w¼¾ ÝL¨ ¹o º( î (5.146) From the lower bound above, we immediately have This is a very crude bound but it is good enough for our purpose. Then from the upper bound in (5.144), we obtain ¹o º» Z b ¸ ¸ Ë ¬ j ® K ¬ j ® 6  n 9 ¼Åá ³ í9¦YKD¨ ¼Åá ¦w¼ù¾ÜÝL¨ ¾ ¦ I¤ ¨(¾%³·¦w¼ ã ¿9¨ î (5.147) Since the first term of the above upper bound tends to 0 as ¼ S¸¡ , combining (5.145) and (5.147), it is possible to choose a (depending on ¼ and ¿ ) which tends to 0 as ¼ S¸¡ and ¿ S « such that Z ¼ á ¸ ¬Kj® n í _Ó ¼jÕ ôà ¦ b ¤I¨ ( ã (5.148) Î or Ê ³ VÓ UàÓ n Õ á0 Õ Â ³ n í7¦YKD¨ (Â Ê ³ Ó UåÓ n XÕ W Õ î (5.149) The theorem is proved. p ¦ ã ¨ Ê ³ åÓ Õ The above theorem says that for any typical K , as long as thereU isn9one typical gp p such that YK is jointly typical, there are approximately such D R A F T September 13, 2001, 6:27pm D R A F T 90 A FIRST COURSE IN INFORMATION THEORY ½ ¾ nH (Y)¿ Ãy 2 . . S [Y] . . . À ¿) ½ ¾ nH (X,Y 2 à )  S ¾ nÀ (x,y . . . . . . . . [X] ¦ gã p ¨ . . ½ ¾ nH (XÀ ¿ ) 2 Áx ÂS ¾ n  ¾n [XY] Figure 5.2. A two-dimensional strong joint typicality array. that YK is jointly typical. This theorem has the following corollary that the number of such typical K grows with at almost the same rate as the total number of typical K . ¼ ¥§¦w¡ ãgf ¨ on $ qTs , let Æ ³ í be the ¢ ³ í ³ í ¦YKD¨ is nonempty. Then Æ í » ¦ Ý Ã ¿7¨ Ê ÓVUàÓ Õ á*Ç Õ ã (5.150) ³ ³ where È S « as ¼ S¢¡ and ¿ S « . gã p ¨¢ Proof By the consistency of strong typicality (Theorem 5.7), if Y¦ K ³ on> í , then K1¢ ³ í . In particular, K1) 9 ¢ Æ ³ í . Then î gã p p ³ nF í Ø É YK ¨Ñ ¢ ³ n9 9 í9¦YKD¨ ¦ (5.151) ¼ËÊÌ ÍÎ Î>Ï Ð Using the lower bound on ³ nF í in Theorem 5.8 and the upper bound on n 9 Y¦ KD¨ in the last theorem, we have ³ í ¦Ý Ã ¿7¨ Ê ³ VÓ UàÓ ì n Õ á ôÕ Â ³ nF í 6Â Æ ³ 9 í Ê ³ Ó UåÓ n XÕ W Õ (5.152) =B<=BÄÅĤB<ÅaHJ3¥*¥ For a joint distribution set of all sequences K 9 such that n 9 » ¦Ý à ¿7¨ Ê ³ VÓ UàÓ Õ á ÓVW ÕÕ î The theorem is proved upon letting È Ø û¾I . which implies Æ í ³ ¥§¦w¡ ã ¨ (5.153) In this section, We have established a rich set of structural gf properties for , which is sumstrong typicality with respect to a bivariate distribution marized in the two-dimensional strong joint typicality array in Figure 5.2. In this array, the rows and the columns are the typical sequences K MÆ 9 and D R A F T September 13, 2001, 6:27pm ¢ ³ í D R A F T 91 Strong Typicality Ó ÔÕ 2nH(Z) Ü ÓnÝ Þ z S [Z ] zØ 0 Ò2ÓnHÔ (Y)Õ Úy Ü S Ó n Þ [Y] Ö(×xØ Ù,ÚyØ Û) 0 0 Ó Ôß Õ 2nH(X) x ÜS Ó nß Þ [X] Figure 5.3. A three-dimensional strong joint typicality array. p The total number of rows and columns are approxi¢Æ ³ n> í , respectively. UàÓ Õ Ê Ê UåÓ n Õ , respectively. An entry indexed by Y¦ K ãgp ¨ mately equal to ³ gã p and ³ receives a dot if Y¦ K ¨ is Êstrongly UàÓ ì n Õ jointly typical. The total number of dots . The number of dots in each row is apis approximately equalÊ to UàÓ n9³ Õ proximately equal to ³ Ê UàÓ n , Õ while the number of dots in each column is . approximately equal to ³ For reasons which will become clear in Chapter 16, the strong joint typicality array in Figure 5.2 is said to exhibit an asymptotic quasi-uniform structure. By a two-dimensional asymptotic quasi-uniform structure, we mean that in the array all the columns have approximately the same number of dots, and all the rows have approximately the same number of dots. The strong joint typicality array for a multivariate distribution continues to exhibit an asymptotic quasi-uniform structure. The three-dimensional strong joint typicality array gf cà is illustrated in Figure 5.3. As before, with respectgp tocá a distribution gp cá an entry YK receives a dot if YK is strongly jointly typical. This is not shown in the figure otherwise it will be very confusing. n â total number U The of dots in the whole array is approximately equal to . These dots are distributed in the array such that all the planes parallel to each other have approximately the same number of dots, and all the cylinders parallel to each other have approximately the same number of á / dots. More specifically, the total number of dots on the plane for any fixed OÆ â (as shown) is approxi- ¥§¦w¡ ã ã ã ¨ =ã ¦ ¨ ¦ ã =ã ¨ Ê ³ àÓ ì ì Õ ¢ ³ í UàÓ ì n â Õ Ê mately equal , and the total number of dots in the cylinder ãgp to ³ Ê UåÓ âfor any ìn Õ , fixed ¦YK / / ¨ pair in Æ ³ on> í (as shown) is approximately equal to ³ so on and so forth. D R A F T September 13, 2001, 6:27pm D R A F T 92 5.4 A FIRST COURSE IN INFORMATION THEORY AN INTERPRETATION OF THE BASIC INEQUALITIES The asymptotic quasi-uniform structure exhibited in a strong joint typicality array discussed in the last section is extremely important in information theory. In this section, we show how the basic inequalities can be revealed by examining this structure. It has further been shown by Chan [39] that all unconstrained information inequalities can be obtained from this structure, thus giving a physical meaning to thesecb inequalities. á , gand and a fixed Consider random variables ã äÆ c á â . By the p cá 2 on â , then YK 2 â consistency p cá of strong typicality, if YK n and â . Thus ¦ 㠨𢠳 í ¤ ã ã ã ¦ û¨ ¢ ³ í á ³ n â íô¦ ¨ åI ³ â íô¦ á ¨ q ³ n9 â í7¦ á ¨ã ¢ ã³ í ¦ ¨¢ ³ í (5.154) n â í ¦ á ¨ ( â í ¦ á ¨ n â í ¦ á ¨ î (5.155) ³ ³ ³ á on Applying the á lower bound á in Theorem 5.9 to ³ â í ¦ ¨ and the upper bound n to ³ â í ¦ ¨ and ³ â í ¦ ¨ , we have Ê ³ ÓåU Ó ì n â Õ á0æ Õ Â Ê ³ ÓåU Ó â ÕXW@ôç Õ Ê ³ ÓVàU Ó n â ÕlW>èôÕ ã (5.156) ë§ê ãeì S ã « as ¼ S¶¡ and ¿ S « . Taking logarithm to the base 2 and where é dividing by ¼ , we obtain Z Z b cb  Z ã ¦w¤ ã ¨ ¦w¤ 㠨ƾ ¦ ã ¨ (5.157) upon letting ¼ Sí¡ and ¿ S « . This inequality is equivalent to î ¦w¤«! b ã ¨°»[« î (5.158) which implies Thus we have proved the nonnegativity of conditional mutual information. Since all Shannon’s information measures are special cases of conditional mutual information, we have proved the nonnegativity of all Shannon’s information measures, namely the basic inequalities. ¦YK ã U¨ ¢ ³ ì nF í and ¦ pãcá ¨U¢ ³ n ì â í do not imply ¦YK ãcá ¨\¢ ³ì í ã ã ã ¤ ¨ , where ¤ ¶ are i.i.d. with generic random vari 2. Let Ø ¦w¤ ¸ ¤T ³ able ¤ . Prove that ¯± ¢ ³ í » Ý Ã $ ï ¼n¿ PROBLEMS gp 1. Show that â . D R A F T September 13, 2001, 6:27pm D R A F T Strong Typicality for any ¼ S¸¡ ¼ and if ð ¼n¿ ¿S¢ ª «¡ . This shows that ¯=± ¢« ³ í . 93 S Ý as ¿ S « and ¤ 3. Prove that for a discrete random variable with an infinite alphabet, Property 2 of the strong AEP holds, while Properties 1 and 3 do not hold. Ø ¦w¡ ¸ ã ¡ ã ã ¡ ³ ¨ ¡ ¦w¡º¨ Ø w¦ ¡ D¨ ô¼ ¡£¢ ¸ × ×á ¸ ³ä ³ á gã f 5. Show that for a joint distribution ¥§¦w¡ ¨ , for any ¿UªÉ« , there exists «Iõ ¿ ï õ¿ such that ³ 9 øí ÷ åOÆ ³ 9 í for sufficiently large ¼ . 6. Let ù½¦5$ú¨ be the set of all probability distributions over a finite alphabet $ . Find a polynomial úö¦w¼D¨ such that for any integer ¼ , there exists a subset ù ³ 5¦ $¨ of ù½5¦ $ú¨ such that ( úö¦w¼D¨ ; a) ù 5¦ $¨ ³ b) for all òét ¢ ù½5¦ $¨ , there exists ò ³ t¢ ù ³ 5¦ $¨ such that ò ¦w¡©¨ à ò½¦w¡º¨ õ Ý ³ ¼ for all ¡£% ¢ $ . Hint: Let ù 5¦ $¨ be the set of all probability distributions over $ such that ³ all the probability masses can be expressed as fractions with denominator ¼. 7. Let ¥ be anyã probability distribution over a finite set $ and R be a real number in ¦ « ÝL¨ . Prove that for any subset û of $ ³ with ¥ ³ ¦.û0¨°» R , ûMü[ í » Ê Ó UåÓ ,LÕ á í ÷ Õ ã ³ ³ ï where ¿ S « as ¿ S « and ¼ S¢¡ . @ where 0 is in some finite alpha4. Consider a sequence K bet $ for all ñ . The type ò ¼ of the sequence K is the empirical probability K gó for all %$ . Prove that distribution over $ in W K , ¹oi.e., ò ¼ W ± there are a total of ô distinct types ò ¼ . Hint: There are ô õ ways to distribute identicalõ balls in ö boxes. HISTORICAL NOTES Berger [21] introduced strong typicality which was further developed into the method of types in the book by Csiszþ ý r and K ÿ rner [52]. The treatment of the subject here has a level between the two, while avoiding the heavy notation in [52]. The interpretation of the basic inequalities in Section 5.4 is a preamble to the relation between entropy and groups to be discussed in Chapter 16. D R A F T September 13, 2001, 6:27pm D R A F T Chapter 6 THE -MEASURE In Chapter 2, we have shown the relationship between Shannon’s information measures for two random variables by the diagram in Figure 2.2. For convenience, Figure 2.2 is reproduced in Figure 6.1 with the random variables b and replaced by and T , respectively. This diagram is very illuminating because it suggests that Shannon’s information measures for any random variables may have a set-theoretic structure. In this chapter, we develop a theory which establishes a one-to-one correspondence between Shannon’s information measures and set theory in full generality. With this correspondence, manipulations of Shannon’s information measures can be viewed as set operations, thus allowing the rich set of tools in set theory to be used in information theory. Moreover, the structure of Shannon’s information measures can easily ¤ ¸ ¤ ¼è» Ê ¤ H ( X1,X 2) H ( X1 X 2) H ( X 2 X1) H ( X1) I ( X1; X 2) H (X 2) Figure 6.1. Relationship between entropies and mutual information for two random variables. D R A F T September 13, 2001, 6:27pm D R A F T 96 A FIRST COURSE IN INFORMATION THEORY be visualized by means of an information diagram if four random variables or less are involved. The main concepts to be used in this chapter are from measure theory. However, it is not necessary for the reader to know measure theory to read this chapter. 6.1 PRELIMINARIES In this section, we introduce a few basic concepts in measure theory which will be used subsequently. These concepts will be illustrated by simple examples. ã ¤ ã ã ¤ ¸ ¤ ³ ³ ã ã ã ¤ ¸ ¤ ¤ ³ b b hTi*jkClGFCXDFCY=BGMmJ D · ¸ The atoms of are sets of the form , where ü is ³ Þ ³ ¤ . either ¤ or ¤ , the complement of Ê Ê There are ³ atoms and Î sets in all the atoms in are ³ . Evidently, ³ disjoint, and each set in can be expressed uniquely as the union of a subset ã ¤ ã ã ¤ intersect with ³ . We assume that the sets ¤ ¸ T of the atoms of ³ ³ are nonempty unless otherwise each other generically, i.e., all the atoms of ³ specified. F¤»z?*ÄËi mJ ¤ generate the field . The atoms of The sets ¤ ¸ and T are (6.1) ¤ ¸ ü ¤ ã ¤ ¸Þ ü ¤ ã ¤ ¸ ü ¤ Þ ã ¤ ¸Þ ü ¤ Þ ã hTi*jkClGFCXDFCY=BGMmJ3¥ The field generated by sets T is the collection of sets which can be obtained by any sequence of usual set operations . (union, intersection, complement, and difference) on 1 which are represented by the four distinct regions in the Venn diagram in Figure 6.2. The field consists of the unions of subsets of the atoms in (6.1). There are a total of 16 sets in , which are precisely all the sets which can be obtained from and T by the usual set operations. ¤ ¸ hTi*jkClGFCXDFCY=BGMmJ ¤ ³ A real function defined on is called a signed measure if it is set-additive, i.e., for disjoint û and in , ¹ ³ î Ø ¦.û ¹ ¨ ¦.û ƨ ¾ ¦ ¹ ¨ For a signed measure , we have ¦ 3¨ Ø « ã (6.2) 1 We adopt the convention that the union of the empty subset of the atoms of D R A F T (6.3) Î is the empty set. September 13, 2001, 6:27pm D R A F T î 97 The -Measure X1 X2 Figure 6.2. The Venn diagram for which can be seen as follows. For any û in Ñ and . ³, ¦.û0¨ Ø ¦.û 3¨ Ø ¦.û ¨Æ¾ ¦ 3¨ (6.4) by set-additivity because û and are disjoint, which implies (6.3). A signed measure on is completely specified by its values on the atoms of . The values of on the other sets in can be obtained via set-additivity. ³ ³ F¤»z?*ÄËi mJEH ³ A signed measure on is completely specified by the val- ¦ ¤ ¸ ü T¤ L¨ ã ¦ ¤ ¸Þ ü ¤TL¨ ã ¦ ¤ ¸ ü ¤ Þ ¨ ã ¦ ¤ ¸Þ ü ¤ Þ ¨ î The value of on ¤ ¸ , for example, can be obtained as ¦ ¤ ¸ ¨ Ø ¦¦ ¤ ¸ ü ¤ ¨ ¦ ¤ ¸ ü ¤ Þ Þ ¨¨ î Ø ¦ ¤ ¸ ü ¤TL¨Æ¾ ¦ ¤ ¸ ü ¤ ¨ ues (6.5) 6.2 ! (6.6) (6.7) " THE # -MEASURE FOR TWO RANDOM VARIABLES To fix ideas, we first formulate in this section the one-to-one correspondence between Shannon’s information measures and set theory for two random variables. For random variables and T , let and T be sets corresponding to and T , respectively. The sets and T generates the field whose atoms are listed in (6.1). In our formulation, we set the universal set $ to " T for reasons which will become clear later. With this choice of $ , the Venn diagram for and T is represented by the diagram in Figure 6.3. For simplicity, the sets and T are respectively labeled by and T in the diagram. We call this the information diagram for the random variables and . In this diagram, the universal set, which is the union of and , is not shown explicitly as in a usual Venn diagram. Note that with our choice of ¤ ¸ ¤ ¸ ¤ ¤ ¸ ¤ ¤ D R A F T ¤ ¸¸ ¤ ¤ ¤ ¤ ¸ ¤ ¸ ¤ ¤ ¤ ¤ ¸ September 13, 2001, 6:27pm ¤ ¸ ¤ ¤ ¤ ¸ D R A F T 98 A FIRST COURSE IN INFORMATION THEORY X1 X2 Figure 6.3. The generic information diagram for Ñ and % . ¤ ¸Þ ü ¤ Þ is degenerated to the empty set, because (6.8) ¤ ¸Þ ü ¤ Þ Ø ¦ ¤ ¸ ¤ ¨ Þ Ø Þ Ø î Thus this atom is not shown in the information diagram in Figure 6.3. For random variables ¤ ¸ and ¤T , the Shannon’s information measures are Z Z Z Z Z ¦w¤ ¸ ¨ ã ¦wT¤ ô¨ ã ¦w¤ ¸ T¤ L¨ ã ¦wT¤ ¤ ¸ ¨ ã ¦w¤ ¸ ã T¤ L¨ ã î ¦w¤ ¸ !T¤ L¨ î (6.9) Þ as û à , we now define a signed measure by Writing ûIü ¹ ¹ à Z Ø ¸ (6.10) ¦ ¤ à T¤ -¨ Z ¦w¤ ¸ T¤ L¨ ã Ø ¸ ¸ ¦ T¤ ¤ ¨ ¦wT¤ ¤ ¨ (6.11) and ¦ ¤ ¸ ü ¤TL¨ Ø î ¦w¤ ¸ !¤TL¨ î (6.12) These are theÞ values of on the nonempty atoms of (i.e., atoms of Þ ¸ ü ¤ ). The values of on the other sets in can be obtained other than ¤ the universal set, the atom $ '& & & ( & '& ) & via set-additivity. In particular, the relations ¦ ¤ ¸ ¤ ¨ Ø Z ¦w¤ ¸ ã 㤠¨ ¦ ¤ ¸ ¨ Ø ¦w¤ ¸ ¨ Z Ø ¦ ¤TL¨ w¦ ¤TL¨ Z & & * and (6.13) (6.14) & (6.15) can readily be verified. For example, (6.13) is seen to be true by considering ¦ ¤ ¸ T¤ L¨ Ã Ø Z ¦ ¤ ¸ T¤ L¨©Z ¾ ¦ ¤T à ¤ ¸ ƨ ¾ ¦ ¤ ¸ ü T¤ L¨ Ø Z ¦w¤ ¸ ã ¤ ¨©î¾ ¦w¤ ¤ ¸ ƨ ¾ î ¦w¤ ¸ !¤ ¨ Ø ¦w¤ ¸ T¤ L¨ & + " D R A F T & & , & September 13, 2001, 6:27pm (6.16) (6.17) (6.18) D R A F T î 99 The -Measure ¤ ¸ ¤ The right hand side of (6.10) to (6.15) are the six Shannon’s information measures for and T in (6.9). Now observe that (6.10) to (6.15) are consistent with how the Shannon’s information measures on the right hand side are identified in Figure 6.1, with the left circle and the right circle representing the sets and T , respectively. Specifically, in each of these equations, the left hand side and the right hand side correspond to each other via the following substitution of symbols: ¤ ¸ ¤ Z î ó ã - ! - & - - (6.19) Ãî ü ¤ ¸ Z ¤ î Note that we make no distinction between the symbols and in this substitution. Thus for two random variables and , Shannon’s information measures can formally be regarded as a signed measure on . We will refer to '& as the I-Measure for the random variables and T 2 . Upon realizing that Shannon’s information measures can be viewed as a signed measure, we can apply the rich set of operations in set theory in information theory. This explains why Figure 6.1 or Figure 6.3 represents the relationship between all Shannon’s information measures for two random variables correctly. As an example, consider the following set identity which is readily identified in Figure 6.3: ¦ ¤ ¸ ¤TL¨ Ø & & ¦ ¤ ¸ ¨©¾ & ¤ ¸ ¦ ¤TL¨ à ¤ & ¦ ¤ ¸ ü T¤ L¨ (6.20) This identity is a special case of the inclusion-exclusion formula in set theory. By means of the substitution of symbols in (6.19), we immediately obtain the information identity Z Z Z ã Ø ¸ ¸ w¦ ¤ ¤TL¨ w¦ ¤ ¨©¾ ¦w¤TL¨ à î w¦ ¤ ¸ ! ¤TL¨ î (6.21) ¤ ¸Þ ¤ Þ We end this section with a remark. The value of '& on the atom ü has no apparent information-theoretic meaning. In our formulation, we set the universal set $ to T so that the atom ü is degenerated to the empty set. Then & * ü. naturally vanishes because & is a measure, and '& is completely specified by all Shannon’s information measures involving the random variables and . ¤ ¸ ¸Þ ¤ Þ ¦¤ ¤ ¨ ¤ ¸ ¤ ¤ ¸Þ ¤ Þ 2 The reader should not confuse /10 with the probability measure defining the random variables The former, however, is determined by the latter. D R A F T September 13, 2001, 6:27pm Ñ and . D R A F T 100 A FIRST COURSE IN INFORMATION THEORY 6.3 CONSTRUCTION OF THE # -MEASURE 2 î * ¼é» Ê We have constructed the -Measure for two random variables in the last î section. In this section, we construct the -Measure for any random variables. Consider random variables . For any random variable , let be a set corresponding to . Let ¤ ¸ã¤ ã 㤠³ ¤ س Ý ã Ê ã ã ¼ î ¼ ¤ 3 ¤ ã ã ã ¤ , i.e., Define the universal set to be the union of the sets ¤ ¸ ¤T ³ Ø É ¤ î (6.23) Ê Î ã ã ã ¤ . The set to denote the field generated by ¤ ¸ ¤T We use ³ ³ Þ Ø / û ¤ (6.24) Ê Î is called the empty atom of ³ because Þ ÞØ É ¤ Ø ÞØ î (6.25) ¤ Ê Î Ê Î All the atoms of other than û / are called nonempty atoms. , the cardinality of ³ Let be theÊ set of all nonempty atoms of . Then Ã Ý . A signed measure on ³ ³ is completely specified by , is equal to ³ the values of on the nonempty atoms of . ã ³ ¤ ñU¢ ù¨ and ¤ to To simplify notation, we will use ¤ to denote ¦w denote » Ê ¤ for any nonempty subset of ³ . wyxFi0=B<i*z{mJEm Let Ø ¤ 2 is a nonempty subset of ³ î (6.26) ã ¢ , Then a signed measure on is completely specified by ¦ ¨ ³ ¹ ¹ which can be any set of real numbers. Proof The number ofÊ elements in is equal to the Ê number of nonempty Ã Ý . Thus ã Ø Ø ³ Ã Ý . Let d d Ø Ê ³ Ãsub-Ý . , which isd ³ sets of ³ Let ã be a column -vector of .¦ û ¨ û ¢ , and be a column -vector of ¦ ¹ ¨ ¹ ¢ . Since all the sets in can expressed uniquely as the union of some nonempty atoms in , by the set-additivity of , for each ¢ ¹ , ¦ ¹ ¨ can be expressed uniquely as the sum of some components of . Thus Ø ³ ã (6.27) (6.22) $ $ 54 6 74 6 98: 54 <;= 74 $ > > > @? ? .A 3 @ ? B A DC 3 ? @ EA GF B B 3 B > H B B > > I B H I D R A F T KJ H September 13, 2001, 6:27pm D R A F T î The -Measure d ³ 101 d where J is a unique q matrix. On the other hand, it can be shown (see Appendix 6.A) that for each û > , .û can be expressed as a linear comB by applications, if necessary, of the following two bination of identities: ¦ ¹ ¦.ûIü ¹ Ãà ¦.û ¢ ¦ ¨ ¨ã¹ ¢ ¨ Ø ¦.û à ©¨ þ ¦ ¹ Ã î ¨ à ¦.û ¹ à ¨ ¹ ¨ Ø .¦ û ¹ ¨ ¦ ¹ ¨ ,L ,L L ,L (6.28) (6.29) However, the existence of the said expression does not imply its uniqueness. Nevertheless, we can write NM (6.30) H I d d Ø ³ into (6.30), we obtain ³ . UponØ substituting (6.27) ã (6.31) ¦ ³ ³¨ which implies that inverse of as (6.31) holds regardless the ã û ¢ of are ³ isistheunique, ³ choice of . Since so is . Therefore, ¦ . 0 û ¨ ã ¢ are³ specified. Hence, a signed mea³ uniquely determined once ¦ ¨ ¹ ¹ ã ¢ , which can be any set is completely specified by ¦ ¨ sure on ¹ ¹ ³ of real numbers. The theorem is proved. for some q matrix M M H J M OH J J M B P> B We now prove the following two lemmas which are related by the substitution of symbols in (6.19). £i*zz¤mJX| ¦.ûIü ¹ à ¨ Ø ¦.û ,L ¨©¾ ¦ ¹ L , ¨ à ¦.û ¹ L L ¨ à ¦ ¨ î L (6.32) Proof From (6.28) and (6.29), we have ¦.ûMü Ø Ø Ø Ã ¹¦.û à æ ¦ ¦.û ¦.û .¦ û ,L L Q Q ¨ L The lemma is proved. £i*zz¤mJE î ¦w¤~! b ã D R A F T ¨©¾ à ¦ ¹ ¨ ¦ ¹¨©¾ ¦ ¨ ¹ L L , à ¨ à ¦.û à ¨ ¨Ã¨Æ¾â¦ ¦ ¹ ¨ ¹ à ¦ ¨¨ ¦ à ¨¨ ¨ ¦.û ¹ ¨ à ¦ ¨ î ,L , Q L L L L L (6.33) L L L Z Z Z Z ب ¦w¤ ã ã Æ¨ ¾ ¦ bã ã ¨ à w¦ ¤ ãcbã ã ¨ à 5¦ ã ¨ î September 13, 2001, 6:27pm (6.34) (6.35) (6.36) D R A F T 102 A FIRST COURSE IN INFORMATION THEORY Proof Consider ¦w¤«! b Z ã ¨ Ø Z ¦w¤ Ø Z ¦w¤ Ø ¦w¤ î è Z Z w¦ ¤ bã ã Z ¨ ã ã ¨ à Z ¦5ã ¨ à ¦ Z ¦w¤ ãcbã ã ¨ à Z Z ¦ bã ã ¨¨ ã 㠨ƾ ¦ bã ã ¨ à w¦ ¤ ãcbã ã ¨ à 5¦ ã ¨ î ã (6.37) (6.38) (6.39) The lemma is proved. î ³ We now construct the -Measure '& on Z using Theorem 6.6 by defining (6.40) ¦ ¤ ¨ Ø ¦w¤ ¨ for all nonempty subsets of ³ . In order for to be meaningful, it has to be consistent with all Shannon’s information measures (via the substitution of symbols in (6.19)). ã In that the following must hold for all (not necessarily ï ã ï case, ï of disjoint) subsets ³Ã where î and are nonempty: (6.41) ¦ ¤ ü ¤ ÷ ¤ ÷ ÷ ¨ Ø ¦w¤ !¤ ÷ ¤ ÷ ÷ ¨ î ï ï Ø , (6.41) becomes When (6.42) ¦ ¤ tü ¤ ÷ ¨ Ø î ¦w¤ !¤ ÷ ¨ î ï , (6.41) becomes When Ø Z Ã Ø ¦ ¤ ¤ ÷ ÷ ¨ ¦w¤ ¤ ÷ ÷ ¨ î (6.43) ï and ï ï Ø , (6.41) becomes When Ø Z Ø ¦ ¤ =¨ ¦w¤ =¨ î (6.44) & ? ? 3 A & 3 A A A & A A ? ? A ? ? ? ? & A ? @ ? @? ? A A A A & ? @ ? @? ? & * @ ? @? Thus (6.41) covers all the four cases of Shannon’s information measures, and it is the necessary and sufficient condition for & to be consistent with all Shannon’s information measures. wyxFi0=B<i*z{mJE '& is the unique signed measure on with all Shannon’s information measures. ³ which is consistent Proof Consider Ø ¦¤ tü"¤ ? ÷ à ¤ ? ÷ ÷ ¨ à Z Z & ¦ ¤ ?RS? ÷ ÷ ¨©¾,Z & ¦ ¤ ? ÷ RT? ÷ ÷ ¨ ¦î w¤ ?RS? ÷ ÷ ¨Æ ¾ ¦wã ¤ ? ÷ RS? ÷ ÷ ¨ à ¦w¤@?! ¤ ? ÷ ¤ ? ÷ ÷ ¨ & + @ ? Ø Ø D R A F T & ¦w¤ ¦¤ ?RS? ?RS? ÷ RS? ÷ ÷ ¨ à Z ÷ RS? ÷ ÷ ¨ à September 13, 2001, 6:27pm ¦ ¤ ÷÷ ¨ ¦w¤ ÷ ÷ ¨ & ? ? (6.45) (6.46) (6.47) D R A F T î 103 The -Measure where (6.45) and (6.47) follow from Lemmas 6.7 and 6.8, respectively, and (6.46) follows from (6.40), the definition of & . Thus we have proved (6.41), i.e., '& is consistent with all Shannon’s information measures. In order that '& is consistent with all Shannon’s information measures, for 3 , & has to satisfy (6.44), which in fact is the all nonempty subsets A of definition of '& in (6.40). Therefore, '& is the unique signed measure on which is consistent with all Shannon’s information measures. ³ 6.4 2 * ³ CAN BE NEGATIVE î In the previous sections, we have been cautious in referring to the -Measure 3 '& as a signed measure instead of a measure . In this section, we show that '& \ in fact can take negative values for . For , the three nonempty atoms of are ¼1» ¤ ¸ ü ¤ 㤠¸ à ¤ 㤠à ¤ ¸î ¼ Ø Ê The values of (6.48) on these atoms are respectively & Z Z ¦w¤ ¸ ! ¤TL¨ ã ¦w¤ ¸ T¤ L¨ ã ¦w¤T ¤ ¸ ¨ î î (6.49) ¼ Ø Ê These quantities are Shannon’s information measures and hence nonnegative by the basic\ inequalities. Therefore, '& is always nonnegative for . ï , the seven nonempty atoms of are For ¼ Ø ¤ à ¤ ì ¶ ã ¤ 0ü ¤ à ¤ ¶ ã ¤ ¸ ü ¤T9ü ¤ ï ã (6.50) d \  ñõ ½õ  . The values of on the first two types of atoms are where Ý Z Ã Ø ¶ (6.51) ¦ ¤ ¤ ì ¨ ¦w¤ ¤ ã ¤ ¶ ¨ and ¦ ¤0ü ¤ à ¤ ¶ ¨ Ø î ¦w¤ë!¤ ¤ ¶ ¨ ã (6.52) respectively, which are Shannon’s information measures and therefore nonnegative. However, ¦ ¤ ¸ ü ¤Tü ¤ ï ¨ does not correspond to a Shannon’s information measure. In the next example, we show that j¦ ¤ ¸ ü T ¤ Ñü ¤ ï ¨ can actually be negative. F¤»z?*ÄËi mJ3¥¦ In this example, all the entropies are in the base 2. Let ¤ ¸ ¤ be independent binary random variables with and T ¯± ¤ Ø « Ø ¯± ¤ Ø Ý Ø « î ã (6.53) YX V U<W " W [ " Z '& & & * & + U<W " W \ YX W W ] '& _^ 3A measure can only take nonnegative values. D R A F T September 13, 2001, 6:27pm D R A F T Ø Ý ã Ê . Let 104 ñ A FIRST COURSE IN INFORMATION THEORY ¤ ï Ø ¤ ¸ ¾Ü¤T ý Ê î a` (6.54) It is easy to check that ¤ ï has the same marginal distribution as ¤ ¸ and ¤T . Z Thus, (6.55) ¦w¤ ¨ Ø Ý ã ã \ Ê Ø ¸ for ñ Ý . Moreover, ¤ , Z ¤T , and ¤ ï are pairwise independent. Therefore, ¦w¤ ã ¤ ¨ Ø Ê (6.56) and î ¦w¤ !¤ ¨ Ø « (6.57) \   for Ý . We further see from (6.54) that each random variable is a ñðõ W W [ function of the other two random variables. Then by the chain rule for entropy, we have w¦ ¤ ¸ ã ¤T ã ¤ ï ¨ Ø Ê ¦w¤ ¸ ã ¤TL¨Æ¾ Ø î À¾ « Ø Ê Â ñ=õ ½õ d  \ , Now for Ý î ¦w¤ë!¤ Z ¤ ¶ ã ¨ Z Z Ø ¦w¤ ¤ ¶ ¨©¾ ¦w¤ ã ¤ ¶ ¨ à ¦w¤ Ø Ê ã¾ Ê Ã Ê Ã Ý Ø Ý Z Z ¦w¤ ï ¤ ¸ ã T¤ L¨ Z (6.58) (6.59) (6.60) [ Z ¸ ã ¤ ã ¤ ï ¨ à w¦ ¤ ¶ ¨ W W (6.61) (6.62) (6.63) where we have invoked Lemma 6.8. It then follows that ¸ ü ¤ ¨ à ¦ ¤ ¸ ü ¤ à ¤ ï ¨ (6.64) ¦ ¤ Ø î ¦w¤ ¸ !¤TL¨ à î w¦ ¤ ¸ !¤T ¤ ï ¨ (6.65) Ø « ÃîÝ (6.66) Ø ÃÝ (6.67) Thus takes a negative value on the atom ¤ ¸ ü ¤T9ü ¤ ï . Motivated by the substitution of symbols in (6.19) î for Shannon’s information measures, we will write 7¦ ¤ ¸ ü ¤TBü ¤ ï ¨ as ¦w¤ ¸ !¤T!¤ ï ¨ . In general, we will write ¦ ¤ Ñ ü ¤ ü ü ¤ à ¤ ç¨ (6.68) as î ¦w¤ Ñ !¤ ! !¤ ¤ ¨ (6.69) ¦¤ ¸ ü ¤ & ü ¤ ï¨ Ø & & '& & ? @ ? @ @? D R A F T @? & " Z ?cb @ @?cb d e ed September 13, 2001, 6:27pm D R A F T î 105 The -Measure j h f f f I ( X 1 ; X 2 ; X i 3) h f f f f I ( X1 ; X 2 X i3) g X2 g h f H ( X 2 X 1) h H ( X 1) Xi3 X1 g h f f f j h f H ( X 1 X 2 , X i3) f I ( X 1 ; X i3) Figure 6.4. The generic information diagram for Ñ, % , and %k ¤ Ñ ã ¤ ã ã ¤ ¤ î ¸ ¦w¤ !¤ !¤ ï ¨ Ø î w¦ ¤ ¸ !¤ ¨ Ã î ¦w¤ ¸ !¤ ¤ ï ¨ î î For this example, ¦w¤ ¸ !¤T!¤ ï ¨°õ[« , which implies î ¸ ¦w¤ !¤T ¤ ï ¨ª î ¦w¤ ¸ !T¤ L¨ î . and refer to it as the mutual information between @? @? @?cb conditioning on ed . Then (6.64) in the above example can be written as (6.70) (6.71) Therefore, unlike entropy, the mutual information between two random variables can be increased by conditioning on a third random variable. Also, we note in (6.70) that although the expression on the right hand side is not symbolically symmetrical in , T , and ï , we see from the left hand side that it is in fact symmetrical in , T , and ï . ¤ ¸ ¸¤ ¤ ¤ 6.5 ¤ ¤ INFORMATION DIAGRAMS We have established in Section 6.3 a one-to-one correspondence between Shannon’s information measures and set theory. Therefore, it is valid to use an information diagram, which is a variation of a Venn diagram, to represent the relationship between Shannon’s information measures. For simplicity, a set will be labeled by in an information diagram. We have seen the generic information in Figure 6.3. A generic \ diagram for information diagram for is shown in Figure 6.4. The informationtheoretic labeling of the values of & on some of the sets in ï is shown in î the diagram. As an example, the information diagram for the -Measure for T , and ï discussed in Example 6.10 is shown in Figrandom variables ure 6.5. For ml , it is not possible to display an information diagram perfectly in two dimensions. In general, an information diagram for random variables ¤ ¼X» D R A F T ¤ ¸ã¤ ¼ Ø ¤ ¼ Ø Ê ¤ ¼ September 13, 2001, 6:27pm D R A F T 106 A FIRST COURSE IN INFORMATION THEORY ¼ Ã Ý ¼ Ø needs dimensions to be displayed perfectly. Nevertheless, for l , an information diagram can be displayed in two dimensions almost perfectly as shown in Figure 6.6. This information diagram is correct in that the region representing the set o n splits each atom in Figure 6.4 into two atoms. However, the adjacency of certain atoms are not displayed correctly. For example, the set ü T ü n , which consists of the atoms ü T ü ï ü n and üpT "üq ï ür n , is not represented by a connected region because the two atoms are not adjacent to each other. When '& takes the value zero on an atom û of , we do not need to display the atom û in an information diagram because the atom û does not contribute for any set containing the atom û . As we will see shortly, this to '& can happen if certain Markov constraints are imposed on the random variables involved, and the information diagram can be simplified accordingly. In a generic information diagram (i.e., when there is no constraint on the random variables), however, all the atoms have to be displayed, as is implied by the next theorem. ¤¸ ¸ ¤ Þ ¤ Þ Þ ¤ ¤ ¤ ¤ ¤ 9¦ ¹ ¨ ¤ ¸ ¤ ¤ Þ ¤ ³ ¹ ¤ ¸ ã ¤ ã ã ¤ ³ , then ³. wyxFi0=B<i*z{mJ3¥*¥ If there is no constraint on take any set of nonnegative values on the nonempty atoms of '& can Proof We will prove the theorem by constructing a '& which can take any set of nonnegative values on the nonempty atoms of . Recall that > is the set of bts û > be mutually independent all nonempty atoms of . Let random variables. Now define the random variables ñ by ã ¢ ³ ã ã ã ã Ø ¤ Ý Ê ¼ (6.72) ¤ Ø ¦ b û ¢ and û å ¤ t¨ î ã ã ã î We determine the -Measure for ¤ ¸ ¤ ¤ ³ so defined as follows. b Since are mutually independent, for all nonempty subsets of , we ³ Z Z have ¦w¤ =¨ Ø Ê µ + ¦ b ¨ î (6.73) ³ ts u> '& ts 3 A Gs @? s s'xzy *{ wv | X2 0 1 1 1 | X1 0 Figure 6.5. The information diagram for D R A F T | 1 Ñ, 0 , and } %k X3 in Example 6.10. September 13, 2001, 6:27pm D R A F T î 107 The -Measure X2 ~ ~ X1 X 4 Figure 6.6. The generic information diagram for On the other hand, Z ¦w¤ =¨ Ø @? ¦ ¤ Ũ Ø & * @ ? µ Êwv + s Ñ, & s'xzy *{ X3 , %k ¦.û0¨ î , and . (6.74) Equating the right hand sides of (6.73) and (6.74), we have µ Êv+ sx + y { 3 s µ Z bts ¦ ¨Ø Êv + s'x + y { s & ¦.û ¨ î (6.75) Evidently, we can make the above equality hold for all nonempty subsets A of Z bGs by taking & .û (6.76) ³ ¦ ¨Ø ¦ ¨ ¢ ¦ ¨ for all Z û u bGs > . By the uniqueness of '& , this is also the only possibility for '& . Since can take any nonnegative value by Corollary 2.44, & can take any set of nonnegative values on the nonempty atoms of . The theorem is proved. ³ ¸ ¤ ¤ ¤ ¸ ¤ ¤³ (6.77) ¦ ¤ ¸ ü ¤ Þ ü ¤ ï ¨ Ø î ¦w¤ ¸ !¤ ï ¤TL¨ Ø « ã Þ the atom ¤ ¸ ü ¤ ü ¤ ï does not have to be displayed in an information diagram. As such, in constructing the information diagram, the regions representing the random variables ¤ ¸ , T ¤ , and ¤ ï should overlap Þ with each other such that the region corresponding to the atom ¤ ¸ ü ¤ ü ¤ ï is empty, while the regions In the rest of this section, weexplore the structure of Shannon’s information S measures when \ S T S forms a Markov chain. To start with, S T S ï forms a Markov chain. Since we consider , i.e., ¤ ¼ Ø & corresponding to all other nonempty atoms are nonempty. Figure 6.7 shows D R A F T September 13, 2001, 6:27pm D R A F T 108 A FIRST COURSE IN INFORMATION THEORY X2 X1 X3 Figure 6.7. The information diagram for the Markov chain ¤ ¸ ¤ Ñ G ¤ k . such a construction, in which each random variable is represented by a mountain4 . From Figure 6.7, we see that ürT Büp ï , as the only atom on which '& may take a negative value, now becomes identical to the atom ü ï. Therefore, we have ¦w¤ ¸ !¤T:!¤ ï ¨ Ø ¤ ¸ ¤ ¦ ¤ ¸ ü ¤T9ü ¤ ï ¨ (6.78) (6.79) ¦¤ ¸ ü ¤ ï ¨ Ø î î¦w¤ ¸ !¤ ï ¨ (6.80) (6.81) » « ¤ S ¤ ï forms a Markov chain, is Hence, we conclude that when ¤ ¸ S T always nonnegative. ¤ S ¤ ï S ¤ forms a Markov Next, we consider ¼ Ø , i.e., ¤ ¸ S T î Ø & & l & n chain. With reference to Figure 6.6, we first show that under this Markov constraint, '& always vanishes on certain nonempty atoms: ¤ ¸ S ¤T S ¤ ï implies î ¸ ï w¦ ¤ !¤ !¤ ¤TL¨©¾ î ¦w¤ ¸ !¤ ï ¤T ã ¤ ¨ Ø î w¦ ¤ ¸ ! ¤ ï T¤ L¨ Ø « î (6.82) ¤ ¸ S T¤ S ¤ implies î ¸ ï w¦ ¤ !¤ !¤ ¤TL¨©¾ î ¦w¤ ¸ !¤ ¤T ã ¤ ï ¨ Ø î w¦ ¤ ¸ ! ¤ T¤ L¨ Ø « î (6.83) ¤ ¸ S ¤ ï S ¤ implies î ¸ w¦ ¤ !¤ !¤ ¤ ï ¨©¾ î ¦w¤ ¸ !¤ ¤ ã ¤ ï ¨ Ø î w¦ ¤ ¸ ! ¤ ¤ ï ¨ Ø « î (6.84) 1. The Markov chain n n 2. The Markov chain n n n 3. The Markov chain on 4 This n n on on form of an information diagram for a Markov chain first appeared in Kawabata [107]. D R A F T September 13, 2001, 6:27pm D R A F T î 109 The -Measure ¤ ¤ ï S ¤ implies î ¸ ¦w¤ !¤T!¤ ¤ ï ©¨ ¾ î ¦w¤T!¤ ¤ ¸ ã ¤ ï ¨ Ø î ¦w¤T!¤ ¤ ï ¨ Ø « î ã 5. The Markov chain ¦w¤ ¸ ¤Tô¨ S ¤ ï S ¤ implies î ¸ ¦P¤ !¤Tî !¤ ã ¤ ï ¨©¾ î ¦w¤ ¸ !¤ T¤ ã ¤ ï ¨Æ¾ î ¦wT¤ !¤ ¤ ¸ ã ¤ ï ¨ Ø î¦w¤ ¸ T¤ !¤ ¤ ï ¨ Ø « S 4. The Markov chain T n n n n (6.85) n n n n n ¦w¤ ¸ ! ¤ T¤ ã ¤ ï ¨ Ø î w¦ ¤ ¸ ! ¤ ï T¤ ã ¤ ¨ ã (6.84) and (6.88) imply î ¸ ¦w¤ !¤T!¤ ¤ ï ¨ Ø Ã î ¦w¤ ¸ !¤ ï ¤T ã ¤ ¨ ã (6.86) (6.87) Now (6.82) and (6.83) imply î n (6.88) n n (6.89) n ¦w¤T:!¤ ¤ ¸ ã ¤ ï ¨ Ø î w¦ ¤ ¸ ! ¤ ï T¤ ã ¤ ¨ î and (6.85) and (6.89) imply î n (6.90) n The terms on the left hand sides of (6.88), (6.89), and (6.90) are the three terms on the left hand side of (6.87). Then we substitute (6.88), (6.89), and (6.90) in (6.87) to obtain ¦ ¤ ¸ ü ¤ Þ ü ¤ & * " Z ï üZ¤ ¦ ¤ ¸ ü ¤ ÞÞ ü ¤ ï Þ ü ¤ ¦ ¤ ¸ ü ¤ ü ¤ ïÞ ü ¤ ¦ ¤ ¸Þ ü ¤T9ü ¤ ï Þ ü ¤ ¦ ¤ ¸ ü T¤ 9ü ¤ ï ü ¤ n Þ ¨ Ø î w¦ ¤ ¸ ! ¤ ï T¤ ã ¤ ¨ Ø « î n From (6.82), (6.88), (6.89), and (6.90), (6.91) implies & & * & & * L¨ Ø îî w¦ ¤ ¸ !¤ ï ! ¤ ã ¤ ¨ Ø î ¦w¤ ¸ !¤ ¤T ¤ ¨ Ø î ¦w¤ ¸ !¤T! ¤ ã ¤ ¨ Ø ¦w¤T:!¤ ¤ ¸ ¤ n o " " Z n n Z " Z n on n n n ¨Ø « ï¨ Ø « ï¨ Ø « î ï¨ Ø « (6.91) (6.92) (6.93) (6.94) (6.95) From (6.91) to (6.95), we see that '& always vanishes on the atoms ¤ ¸ ü ¤ ÞÞ ü ¤ ï ü ¤ Þ ¤ ¸ ü ¤ Þ ü ¤ ïÞ ü ¤ ¤ ¸ ü ¤ ü ¤ ïÞ ü ¤ ¤ ¸Þ ü T¤ 9ü ¤ ï Þ ü ¤ ¤ ¸ü ¤ ü ¤ ïü ¤ " n n o D R A F T Z Z n (6.96) n n o September 13, 2001, 6:27pm D R A F T 110 A FIRST COURSE IN INFORMATION THEORY X2 * * ~ X1 * * * X3 ~ Figure 6.8. The atoms of Markov chain. 0 on which X4 vanishes when Ñ k forms a ¦w¤ ¸ ¤ ¤ ã ¤ ¨ Ø ä » « diagram in Figure 6.8. of n , which we mark by an asterisk in the î information ! ï T n in (6.82) In fact, the reader can benefit by letting and trace the subsequent steps leading to the above conclusion in the information diagram in Figure 6.6. It is not necessary to display the five atoms in (6.96) in an information diagram because '& always vanishes on these atoms. Therefore, in constructing the information diagram, the regions representing the random variables should overlap with each other such that the regions corresponding to these five nonempty atoms are empty, while the regions corresponding to the other ten nonempty atoms, namely ¤ ¸ ü ¤ Þ ü ¤ ¸ ü ¤T9ü ¤ ¸ ü ¤T9ü ¤ ¸Þ ü T¤ 9ü ¤ ¸Þ ü ¤T9ü ¤ ¸Þ ü ¤ ü ¤ ¸Þ ü ¤T9Þ ü ¤ ¸Þ ü ¤ Þ ü ¤ ¸Þ ü ¤ Þ ü ¤ ¸ü ¤ ü ¤ ï ÞÞ ü ¤ Þ Þ ¤ ïü ¤ Þ ¤ ïü ¤ ¤ ïÞ ü ¤ Þ ¤ ïü ¤ Þ ¤ ïü ¤ ¤ ïü ¤ Þ ¤ ïü ¤ ¤ ïÞ ü ¤ ã ¤ ïü ¤ ) " ) n n ) " Z n n n n n n Z " Z n n (6.97) are nonempty. Figure 6.9 shows such a construction. The reader should compare the information diagrams in Figures 6.7 and 6.9 and observe that the latter is an extension of the former. D R A F T September 13, 2001, 6:27pm D R A F T î 111 The -Measure X1 X 2 X3 Figure 6.9. The information diagram for the Markov chain X 4 Ñ G % %k . From Figure 6.9, we see that the values of '& on the ten nonempty atoms in (6.97) are equivalent to ¦î w¤ ¸ T¤ ã ¤ ï ã ã ¤ ¨ ¦î w¤ ¸ !T¤ ¤ ï ¤ ¨ ¦î w¤ ¸ !¤ ï ¤ ô¨ Z ¦w¤ ¸ !¤ ¨ ã ã ¦w¤ ¤ ¸ ¤ ï ¤ ô¨ î ¦wT¤ !¤ ï ¤ ¸ !¤ ¨ î ¤ !¤ ¤ã ¸ ¨ ã Z ¦wT ¦w¤ ï ¤ ¸ T¤ ã ¤ ¨ î ï ¤ã L¨ ã Z ¦w¤ !¤ ¤ã ¸ T ¸ ¦w¤ ¤ ¤ ¤ ï ¨ Z n n on n on (6.98) n n n n on respectively5 . Since these are all Shannon’s information measures and thus nonnegative, we concludethat \ S '& is always nonnegative. S S T When forms a Markov chain, for , there is only one nonempty atom, namely ü ü ï , on which '& always vanishes. This atom can be determined directly from the Markov constraint î . For ! ï T l , the five nonempty atoms on which '& always vanishes are listed in (6.96). The determination of these atoms, \ as we have and l , '& seen, is not straightforward. We have also shown that for is always nonnegative. We will extend this theme in Chapter 7 to finite Markov random field with Markov chain being a special case. For a Markov chain, the information diagram can always be displayed in two dimensions, and & is always nonnegative. These will be explained in Chapter 7. ¤ ¸ ¤ ¦w¤ ¸ ¤ ¤ L¨ Ø « 5A ¼ Ø ¼ Ø ¤ ³ ¸ Þ ¤ ¤ ¤ ¼ Ø ¼ Ø formal proof will be given in Theorem 7.30. D R A F T September 13, 2001, 6:27pm D R A F T 112 6.6 A FIRST COURSE IN INFORMATION THEORY EXAMPLES OF APPLICATIONS In this section, we give a few simple applications of information diagrams. The use of information diagrams simplifies many difficult proofs in information theory problems. More importantly, these results, which may be difficult to discover, can easily be obtained by inspection of an information diagram. The use of an information diagram is very intuitive. To obtain an information identity from an information diagram is WYSIWYG6. However, how to obtain an information inequality from an information diagram needs some explanation. Very often, we use a Venn diagram to represent a measure which takes such nonnegative values. If we see in the Venn diagram two sets û and that û is a subset of , then we can immediately conclude that .û because .û û (6.99) ¦ 0¨  ¹ ¦ ¹ ¨ ¹ ¦ ¹ ¨ à ¦ 0¨ Ø ¦ ¹ à 0¨°»[« î î However, an -Measure can take negative values. Therefore, when we see in an information diagram that û is a subset of , we cannot base on this fact  9¦ ¨ unless we¹ know from the setup of the alone conclude that 9¦.û ¨ ¹ problem that is nonnegative. (For example, is nonnegative if the random & '& '& & & variables involved form a Markov chain.) Instead, information inequalities can be obtained from an information diagram in conjunction with the basic inequalities. The following examples will illustrate how it works. F¤»z?*ÄËi mJ3¥T ~}e=BG"0¤T(CEDa=Bji*GD<>=B?k Let ¤ ¥ ¦w¡º¨ . Let ¤ ¥§¦w¡©¨ Ø j ¥ ¸ ¦w¡º¨Æ¾j ¥@j¦w¡º¨ ã Â Â Ý and Ø Ý Ã . We will show that where « Z Z Z ¦w¤1¨ð»a ¦w¤ ¸ ¨©¾ ¸ ¥ ¸ ¦w¡º¨ ¤T (6.100) ¦w¤TL¨ î and (6.101) Consider the system in Figure 6.10 in which the position of the switch is determined by a random variable ã with ¯± ã Ø Ý Ø and ¯± ã Ø Ê Ø ã (6.102) whereã ãÊ is independent of ¤ ¸ and ¤T . The switch takes position ñ if ã Ø ñ , ñ Ø Ý . Figure 6.11 shows information diagram for ¤ and ã . From the à ã the diagram, we see that ¤ is a subset of ¤ . Since is nonnegative for two random variables, we can conclude that ¦ ¤ ¨°» ¦ ¤ à ã!¨ ã (6.103) 6 What & ( & * '& you see is what you get. D R A F T September 13, 2001, 6:27pm D R A F T î 113 The -Measure Z=1 X1 X2 X Z=2 Figure 6.10. The schematic diagram for Example 6.12. Z which is equivalent to Then Z ¦w¤I¨å» ¦w¤ ã ¨ î Z (6.104) Z ¦w¤I¨ » =¯ ±¦w ¤ ã ¨ Z (6.105) Z Ø Z ã Ø Ý Z ¦w¤ ã Ø ã LÝ ¨Æ¾ ¯=± ã Ø Ê ¦w¤ ã Ø Ê ¨ (6.106) Ø ¦w¤ ¸ ¨Æ¾ ¦w¤TL¨ (6.107) Z proving (6.101). This shows that ¦w¤1¨ is a concave functional of ¥§¦w¡º¨ . F¤»z?*ÄËi mJ3¥T~}e=BGG»i»CEDa=Bjz>D>¤BÄClGFjË="<z¤ DFCY=BGB ãcb ãgf Ø f î ¦w¤ Let ¨ ¥Å¦w¡ ¨ ¥§¦w¡©¨¥§¦ ¡©¨ (6.108) b f î § ¥ w ¦ º ¡ ¨ We will show that for fixed , ¦w¤«! ¨ is a convex functional of ¥§¦ ¡º¨ . f f Let ¥ ¸ ¦ ¡º¨ and ¥@j¦ ¡©¨ be two transition matrices. Consider the system in Figure 6.12 in which the position of the switch is determined by a random variable ã as in the last example, where ã is independent of ¤ , i.e., î ¦w«¤ !¯ã ¨ Ø « î (6.109) X Z Figure 6.11. The information diagram for Example 6.12. D R A F T September 13, 2001, 6:27pm D R A F T 114 A FIRST COURSE IN INFORMATION THEORY Z=1 p1 ( y x ) X Y p2 ( y x ) Z=2 Figure 6.12. The schematic diagram for Example 6.13. î ¤ b , and ã in Figure 6.13, let ¦w¤«!¯ã ¨ Ø ä [» « î In the information diagram for , b ¦w¤~!¯ã ¨ Ø « , we see that î ¦w~¤ ! b !¯ã ¨ Ø Ã ä ã because î ¦w¤~!¯ã ¨ Ø î ¦w¤~!¯ã b ¨©¾ î w¦ ¤«! b Since î (6.110) (6.111) !¯ã ¨î (6.112) Then ¦w¤«Â ! b î ¨ b (6.113) ¦w¤«! ã ¨ Ø ¯=± ã Ø ã Ý î f ¦w¤~! b ã Ø ÝL¨Æ¾ ã ¯± f ã Ø ã Ê î ¦w¤~! b ã Ø Ê ¨ (6.114) Ø î ¦¥§¦w¡º¨ ¥ ¸ ¦ ¡º¨¨n¾ î ¦¥§¦w¡º¨ ¥4¦ ¡©¨¨ (6.115) ã f î where ¦¥§¦w¡º¨ ¥0`¦ ¡º¨¨ denotes the mutual information between the inputf and ¥§¦w¡º¨ and transition matrixf 0¥ ¦ ¡©¨ . output of a channel with input distribution b î This shows that for fixed ¥§¦w¡©¨ , ¦w~ ¤ ! ¨ is a convex functional of ¥§¦ ¡º¨ . î Y -a X a Z Figure 6.13. The information diagram for Example 6.13. D R A F T September 13, 2001, 6:27pm D R A F T 115 The -Measure ¢ Z =1 ¡ p 1 (x ) X ¡ p 2 (x ) ¢ ¡ p (y |x ) Y Z =2 Figure 6.14. The schematic diagram for Example 6.14. £¤¥§¦¨ª©S«¬®¯±°²´³¶µ¸·'¹º¥T»*¼½G¾Nµ¸¿À¦Ác½Ác¥¸©À¼Â·¿Sµ'æ!¥Ä½¼Qµ¸·¸Å Let ÆQÇÉÈËÊeÌ+ÍÎ*ÆQÏ'ÈÑЪÌ(ÒΧÆQÏÌÓΧÆQÐÔ ÏÌÖÕ Î§ÆQÐÔ ÏcÌ (6.116) ÆQÇØ×ËÊ@Ì Î§ÆQÏÌ We will show that for fixed , is a concave functional of . Consider the system in Figure 6.14, where the position of the switch is deterÇ mined by a random variable ÊÙ as in the last example. In this system, when Ç Ê is given, Ù is independent of , or ÙNÚ Ú forms aÊ Markov chain. Then Ç Û'Ü is nonnegative, and the information diagram for , , and Ù is shown in Figure 6.15. ÇDÞ ÊNß ÇàÞ Ê Û'Ü is nonnegFrom Figure 6.15, since Ý Ý Ù Ý is a subset of Ý Ý and ative, we immediately see that ÆQÇØ×ËÊeÌ á Ò ÆQÇØ×ËÊuÔ Ì Ù Ò æ åwç Ò â(ã5ä ÆQÇ×ËÊÉÔ ÒæåèÌéâêãëä Ù Ù ÆñÎòèÆQÏcÌÖÈ<ΧÆQÐÔ ÏcÌÑ̸éóð ð ÒíìSç Ù ÆñÎtôïÆQÏcÌÖÈ<Î*ÆQÐÔ ÏÌÑÌÖÕ ÆQÇ×ËÊÉÔ Î§ÆQÐÔ ÏcÌ , ÆQÇØ×ËÊ@Ì is a concave functional of £¤¥§¦¨ª©S«¬®¯Tõ.²Ñö´¦¨ª«ªÃ¿a«º¹º½"÷諺¹G뺹ª¾ø½ù«ºµ¸Ã«ª¦EÅ text, Ù This shows that for fixed Ê ÒîìïÌ be the cipher text, and Ù ú Z (6.117) (6.118) (6.119) ΧÆQÏÌ . Ç Let be the plain be the key in a secret key cryptosystem. Since X Y Figure 6.15. The information diagram for Example 6.14. D R A F T September 13, 2001, 6:27pm D R A F T 116 A FIRST COURSE IN INFORMATION THEORY Y û ÿ ý a þ d b ü 0 c X Z Figure 6.16. The information diagram for Example 6.15. Ç can be recovered from Ê and Ù , we have ÆQÇÔ ÊêÈ Ì(Ò Ù Õ (6.120) We will show that this constraint alone implies ÆQÇØ×ËÊeÌ ÆQÇ á Let ÆQÇØ×ËÊuÔ × ÔÇ ÌêÒ Ù Ù ÆQÇ×ËÊ (See Figure 6.16.) Since Ì Ù Ì Æ (6.122) á Ì+Ò (6.123) È á (6.124) ªÕ , (6.125) á *é ÆQÇ × Ù Æ<ÊV× (6.121) á Ô ÇÉÈËÊeÌ(Ò and ÌÖÕ Ì(Ò Ù Æ Æ Ù Æ<Ê Ì§ß Õ á (6.126) Ì ÆQÇØ× Ô ÊeÌ Ù Ù InÆQÇØ comparing with , we do not have to consider and ×ËÊ × Ì ÆQÇ Ì Æ Ì Ù since they belong to both and Ù . Then we see from Figure 6.16 that ÆQÇ Ì*ß Æ Ì+Ò ß ß wÕ Ù (6.127) Therefore, ÆQÇ×ËÊeÌ Ò ¶é zß á á Ò zß ß ÆQÇ Ì*ß Æ ÌÖÈ Ù (6.128) (6.129) (6.130) (6.131) where (6.129) and (6.130) follow from (6.126) and (6.124), respectively, proving (6.121). D R A F T September 13, 2001, 6:27pm D R A F T The -Measure X Figure 6.17. 117 Z Y The information diagram for the Markov chain . ÆQÇ×ËÊ@Ì The quantity is a measure of the security level of the cryptosystem. ÆQÇØ×ËÊeÌ small so that the evesdroper cannot In general, we want to make Ç by observing the cipher obtain too much information about the plain text Ê text . This result says that the system can attain a certain level of security Æ Ì Ù only if (often called the key length) is sufficiently large. In particular, if ÆQÇØ×ËÊeÌ+Ò Æ Ì perfect secrecy is required, i.e., , then must be at least equal Ù ÆQÇ Ì to . This special case is known as Shannon’s perfect secrecy theorem [174]7 . Æ<Ê Ô ÇÀÈ ÌêÒ , i.e., Note that in deriving our result, the assumptions that Ù ÆQÇ× Ì(Ò Ù the cipher text is a function of the plain text and the key, and , i.e., the plain text and the key are independent, are not necessary. £¤¥§¦¨ª©S«¬®¯T¬ Figure 6.17 shows the information diagram for the Markov Ç Ê chain Ú Ú Ù . From this diagram, we can identify the following two information identities: ÆQÇØ×ËÊeÌ Ò ÆQÇÔ ÊeÌ ÆQÇØ×ËÊ Ò È ÆQÇÔ Ê Ì È Ù (6.132) (6.133) ÌÖÕ Ù ÇqÞ Ý Since Û'Ü is nonnegative and ÆQÇØ× Ì Ù ÇqÞ Ý is a subset of Ù Ý Ê Ý , we have ÆQÇØ×ËÊeÌÖÈ (6.134) which has already been obtained in Lemma 2.41. Similarly, we can also obtain ÆQÇÔ ÊeÌ £¤¥§¦¨ª©S«¬®¯ ز ¶¥Ä½a¥ ÆQÇÔ ¨ªÃcµ¸¹G«º÷÷Y¼Q· ÌÖÕ (6.135) Ù ½ù«tµ'몦EÅ formation diagram for the Markov chain Ç Ê Ú Ú Figure 6.18 shows the inÙ)Ú . Since Û Ü is 7 Shannon used a combinatorial argument to prove this theorem. An information-theoretic proof can be found in Massey [133]. D R A F T September 13, 2001, 6:27pm D R A F T 118 A FIRST COURSE IN INFORMATION THEORY Z Y X T Figure 6.18. The information diagram for the Markov chain nonnegative and ÇqÞ Ý Ý is a subset of ÆQÇØ× & ÊîÞ Ý Ì Ù Ý . , we have Æ<ÊV× !"#$!% ÌÖÈ (6.136) Ù which is the data processing theorem (Theorem 2.42). APPENDIX 6.A: A VARIATION OF THE INCLUSIONEXCLUSION FORMULA ' (*) +-,.'0/ +-,.1/3241 (65 8 :9 <; + =?@ > + ACBED ' AGF 1IHJLD4MOK N.M > +-,.' N F 1/ F D4MOKN.PRQSM > +-,.' NUT ' Q F V1 / WYXCXCXZW , F\[ / >^] D +-,.' DET '`_ T XCXaX T ' > F 1V/3b , can be expressed as a linear combiIn this appendix, we show that for each via applications of (6.28) and (6.29). We first prove by using (6.28) the nation of following variation of the inclusive-exclusive formula. 7 ù«ºµ¸Ã«ª¦P¬ Ö¬® ®¯ For a set-additive function , ced [ c (6.A.1) Proof The theorem will be proved by induction on . First, (6.A.1) is obviously true for . Now consider Assume (6.A.1) is true for some = >f@ ] D + ACBED ' AGF 1IH =g="@ > J + ACBED ' A Hhi' >f] D F 1IH = @> =k= @ > W f > ] J + AaBjD ' A F 1 H +-,.' D F V1 / F + ACBED ' A H T ' ^> ] D F 1 H J l D4MOK N.M > +-,.' N F 1/ F D4MmKN:PnQSM > +-,.' N T ' Q F 1V/ WoXaXCX D R A F T September 13, 2001, 6:27pm c6J [ . (6.A.2) (6.A.3) D R A F T 119 APPENDIX 6.A: A Variation of the Inclusion-Exclusion Formula W , \F [ / >f] D +-,.' D T ' _ T XCXX T ' > F 1V/^p W +q,.' >f] D F 1V/ =?@ > F + AaBjD ,.' A T ' >f] D / F 1IH J l D4MOK N.M > +q,.' N F 1/ F D4MmKN:PnQSM > +-,.' N T ' Q F 1/ WoXaXCX W , F\[ / >f] D +-,.' D T ' _ T XCXCX T ' > F 1V/^p W +-,.' >f] D F 1V/ (6.A.4) F l 4D MmK N:M > +q,.' N T ' >f] D F 1V/ F D4MOKN.PRQSM > +-,.' N T ' Q T ' >^] D F 1/ W$XCXCXSW , F\[ / >^] D +-,.' D T ' _ T XCXaX T ' > T ' >f] D F 1V/^p J D4MOKN.M >f] D +-,.' N F 1V/ F D4MON.PRK QSM >^] D +q,.' NOT ' Q F 1/ W$XCXX W , F\[ / >f] _ +-,.' D T ' _ T XCXX T ' >^] D F 1/3b (6.A.5) (6.A.6) In the above, (6.28) was used in obtaining (6.A.3), and the induction hypothesis was used in obtaining (6.A.4) and (6.A.5). The theorem is proved. Now a nonempty atom of N r> @> N2 NsBjD has the form t N 6t Nu (6.A.7) v where is either or , and there exists at least one such that write the atom in (6.A.7) as N Jwt N @ N F~ ZQ Z b N:x ySz{B&|} z Qx yCB&|} . Then we can (6.A.8) '() +-,.'`/ +-,.1/3241 (65 Note that the intersection above is always nonempty. Then using (6.A.1) and (6.29), we see that for each , can be expressed as a linear combination of . PROBLEMS 1. Show that ÆQÇØ×ËÊ × .m Ù Ì+Ò and obtain a general formula for D R A F T ΧÆQÇÀÈËÊoÌÓÎ*Æ<ÊêÈ Î§ÆQÇ ÌÓΧÆQÇÀÈ Ù ÌÓÎ*Æ<ÊoÌÓÎ*Æ ÌÓÎ*ÆQÇÉÈËÊ RRR Y Ù ÆQÇ òY×ÑÇeô±Èë× ×ÑÇ Ì Ù È Ì Ù ÄÌ . September 13, 2001, 6:27pm D R A F T 120 A FIRST COURSE IN INFORMATION THEORY ÆQÇ×ËÊV× 2. Show that hold: Ç a) b) Ê , Ì vanishes if at least one of the following conditions Ù , and Ù are mutually independent; Ç Ê Ú Ç forms a Markov chain and Ú9Ù ÆQÇ×ËÊ 3. a) Verify that × Ì S S m m S nO and Ù are independent. S given by S m S R ΧÆQϸÈÑÐtÈ SÌ vanishes for the distribution S <m n S S m O Ù Î§Æ È È TÌêÒ Õ 1ì aÈDÎ§Æ È ÈYåèÌ Ò Õ Î§Æ ÈYå±ÈYåèÌêÒ Õ 1ì aÈDΧÆOå±È È TÌ Ò Õ 1ì aÈ aå aÈ Î§ÆOå±ÈYå±È TÌêÒ Õå aÈDΧÆOå±ÈYå±ÈYåèÌ Ò Õ aÕ Î§Æ ÈYå±È TÌÒ Õ Î§ÆOå±È ÈYåèÌÒ Õå1å 1ì b) Verify that the distribution in part b) does not satisfy the condition in part a). Ç Ê is weakly independent of 4. Weak independence Î*ÆQÏ*Ô ÐªÌ sition matrix are linearly dependent. Ç a) Show that if Ê of . Ê and i) ii) iii) Ç Ç are independent, then Ç b) Show that for random variables Ù satisfying if the rows of the tranis weakly independent Ê and , there exists a random variable Ê Ú Ç Ê Ú Ù and Ù are independent and Ù are not independent if and only if Ç is weakly independent of Ê . (Berger and Yeung [22].) 5. Prove that b) ÆQÇØ×ËÊ a) .¡ ¢£¤¡ × Ì ÆQÇØ×ËÊ × ß á Ù Ì Ù 6. a) Prove that if tä ºä Ç and Ê ÆQÇØ×ËÊÉÔ ÆQÇØ×ËÊ@ÌÖÈ ÌÖÈ Ù Æ<ÊV× Æ<ÊV× ÔÇ ÌÖÈ Ù ÌÖÈ Ù ÆQÇÉÈ Ô Ê@Ì´ç ÆQÇØ× Ù Ì´çïÕ Ù are independent, then ÆQÇÉÈËÊ × Ì Ù á ÆQÇØ×ËÊuÔ Ì Ù . b) Show that the inequality in part a) is not valid in general by giving a counterexample. ÆQÇ×ËÊ@Ì á ÆQÇ Ì!ß Æ Ì 7. In Example 6.15, it Ê was shown that , where Ù Ç is the plain text, is the cipher text, and Ù is the key in a secret key cryptosystem. Give an example of a secret key cryptosystem such that this inequality is tight. D R A F T September 13, 2001, 6:27pm D R A F T APPENDIX 6.A: A Variation of the Inclusion-Exclusion Formula ¥ ¥ § ¬ « ¦ § £ § ± § 121 ¦ 8. Secret sharing For a given finite set and a collection of subsets of , a secretäYÇ sharing scheme is a random variable and a family of random 5Î ç such that for all , variables ©¨Yª « ¥ and for all Æ ®°«¯ ¦ , Æ § ÔÇ +Ì+Ò ÔÇ ¥ È Ì+Ò Æ ÌÖÕ ¨ Here, is Î the secret and is the set of participants of the scheme. A Ç of the secret. The set participant of the scheme possesses a share specifies the access structure of the scheme: For a subset of , by pooling their shares, if , the participants in can reconstruct , otherwise they can know nothing about . ¦ ¬ « ¦ ¬ ¬ § ¥ § ¬ ³ ® ²´¥ , if ®µ«¯ ¦ and ¬¶$® « ¦ , then £ ± £ ± § § « ¦ , then ii) Prove that if ® £ ± £ ± § a) È i) Prove that for ÆQÇ %Ô Ç êÌêÒ ÆQÇ Æ Ô Ç Ìé êÌ+Ò ÆQÇ ÆQÇ %Ô Ç Ô Ç %È È ÌÖÕ ÌÖÕ · « ¦ , ¬ ® ¸· ¹ ² ¥ such that ¬¶ µ ® ¶ ·µ« ¦ ± º § (Capocelli et al. [37].) · «¯ ¦ È b) Prove that for , then È ÆQÇ ×ÑÇ ÔÇ (Ì Æ á , and ÌÖÕ (van Dijk [193].) ÇÀÈËÊêÈ ³ 9. Consider four random variables which satisfy the followÙ , and Æ ÔÇ Ì¶Ò Æ Ì Æ Ô ÇÀÈËÊeÌ¶Ò Æ Ô ÊoÌEÒ Æ Ì ingÆ<Êu constraints: , , , Ô Ì+Ò Æ Ô Ì+Ò , and . Prove that Ù Ù Æ a) c) ÆQÇØ× b) ÆQÇØ× Ô ÇÀÈËÊêÈ Ì(Ò Ù ÔÊ × Ù Æ<ÊÔ ÇÉÈ d) Ì(Ò Ù f) ÆQÇØ×ËÊ È Ì Æ<Ê × Ì Ò ê . . Ì Ù Ù Ì(Ò × ÆQÇØ× × Ù ÆQÇØ×ËÊ á á Æ Æ È Ù e) Ô ÇÉÈËÊeÌ(Ò × × Ì+Ò Ù Æ ÇØ×ËÊÉÔ Q Ù Ô Ì+Ò Ù Æ<ÊV× Õ È Õ . Ô ÇÉÈ Ì(Ò Ù Õ Ì(Ò Ì The inequality in f) finds application in a secret sharing problem studied by Blundo et al. [32]. D R A F T September 13, 2001, 6:27pm D R A F T 122 A FIRST COURSE IN INFORMATION THEORY ¼» Ç In the following, we use given Ù . pÊ Ô ½» Ç to denote that Ù ¾» ¾» Ç 10. a) Prove that under the Çconstraint thatÇ Ç íÊ Ô and imply chain, Ù Ù Ê and Ê Ú mÊ Ú Ù are independent forms a Markov . b) Prove that the implication in a) continues to be valid without the Markov chain constraint. ³» does not imply ³» by giving a coun¿» implies ¿» b) Prove that conditioning on À . 12. Prove that for random variables , , , and , Á» \»  ÃÃÄ ¹» Ãà ǻ ¹» Á» ÅÆ È» 6» are equivalent to Hint: Observe that and À and use an information diagram. 13. Prove that ½»  ÃÄ Êà » ½» » Ë Ã Ã » Ì ¾» » ÅÉ ¾» (StudenÎ Í [189].) Ê Ô 11. a) Show that terexample. Ê Ô ÆQÇÉÈ Ù Ê Ô Ê Ô ÆQÇÀÈ Ù Ê Ì Ù Ì Ç Ù Ú Ú9ÙmÚ Ç Ê Ù Ç ÔÊ Ù ÆQÇÉÈËÊeÌ Ê Ô Ù Ô Ù Ê Ê Õ Ù ÔÇ Ù Ç Ç ÔÊ ÆQÇÉÈËÊeÌ Ô Ù Ê Ç Ù Ú Ú9ÙmÚ Ç mÊ Ç mÊÔ Æ ÔÇ Ù È Ù Ì Ù Ç ÔÊ Ô ÆQÇÀÈËÊeÌ Ù Ç mÊÔ mÊÔ Ù Õ Ù HISTORICAL NOTES The original work on the set-theoretic structure of Shannon’s information measures is due to Hu [96]. It was established in this paper that every information identity implies a set identity via a substitution of symbols. This allows the tools for proving information identities to be used in proving set identities. Since the paper was published in Russian, it was largely unknown to the West until it was described in Csisz r and K rner [52]. Throughout the years, the use of Venn diagrams to represent the structure of Shannon’s information measures for two or three random variables has been suggested by various authors, for example, Reza [161], Abramson [2], and Papoulis [150], but no formal justification was given until Yeung [213] introduced the -Measure. Most of the examples in Section 6.6 were previously unpublished. ÏÍ D R A F T Ð September 13, 2001, 6:27pm D R A F T APPENDIX 6.A: A Variation of the Inclusion-Exclusion Formula 123 McGill [140] proposed a multiple mutual information for any number of random variables which is equivalent to the mutual information between two or more random variables discussed here. Properties of this quantity have been investigated by Kawabata [107] and Yeung [213]. Along a related direction, Han [86] viewed the linear combination of entropies as a vector space and developed a lattice-theoretic description of Shannon’s information measures. D R A F T September 13, 2001, 6:27pm D R A F T Chapter 7 MARKOV STRUCTURES Ç ò Çeô £Ñ Ç YÒ Ç We have proved in Section 6.5 that if Ú Ú Ú Markov chain, the -Measure Û Ü always vanishes on the five atoms forms a Ó £Ñ Ò Ó Ó £Ñ YÒ Ó ÔÑ Ó YÒ (7.1) Ñ Ó YÒ Ó ÑÓ Ò Consequently, the -Measure is completely specified by the values of Ò on the other ten nonempty atoms of Õ , and the information diagram for four Ç ò§Þ Ý Ç Ç ò§Þ Ç Ý Ç ò§Þ Ý Ç Ý ò Þ ô Þ ô Þ Ç ô Þ Ç Ý ò§Þ Ý Ç Ç Ý Ý ÇeôêÞ Ý Ç ô Þ Ý Ç Ý êÞ Ç êÞ Ç Ý Ý Ý Þ Ç Ç Ý Ç Ý Þ Ý Ç Þ Ç Ý Ý Õ Ý Û'Ü Û'Ü random variables forming a Markov chain can be displayed in two dimensions as in Figure 6.10. Ç ò Çeô Ú Ú FigureÇ 7.1 is a graph which represents the Markov chain Ç Ú . The observant reader would notice that Û'Ü always vanishes on of if and only if the graph in Figure 7.1 becomes a nonempty atom disconnected upon removing all the vertices corresponding to the complemented set Ç variables in . For example, Û'Ü always vanishes on the atom Ç ò§Þ Ç ô Þ (Þ Ç , and the graph in Figure 7.1 becomes disconnected upon Ý Ý Ý Ý removing vertices 2 and 4.Þ On the other hand, Û'Ü does not necessarily vanish Ç Ç ò Þ Ç ô Þ Ç on the atom Ý , and the graph in Figure 7.1 remains connected Ý Ý Ý £Ñ YÒ Ó £Ñ ¬ Ó ÕÒ ÒÓ ¬ Ñ ÒÓ 1 Ö2 3 Figure 7.1. The graph representing the Markov chain D R A F T ×4 D _ !ÙØÚ September 13, 2001, 6:27pm . D R A F T 126 A FIRST COURSE IN INFORMATION THEORY upon removing vertices 1 and 4. This observation will be explained in a more general setting in the subsequent sections. The theory of -Measure establishes a one-to-one correspondence between Shannon’s information measures and set theory. Based on this theory, we develop in this chapter a set-theoretic characterization of a Markov structure called full conditional mutual independence. A Markov chain, and more generally a Markov random field, is a collection of full conditional mutual independencies. We will show that if a collection of random variables forms a Markov random field, then the structure of the -Measure can be simplified. In particular, when the random variables form a Markov chain, the -Measure exhibits a very simple structure so that the information diagram can be displayed in two dimensions regardless of the length of the Markov chain. The topics to be covered in this chapter are fundamental. Unfortunately, the proofs of the results are very heavy. At first reading, the reader should study Example 7.8 at the end of the chapter to develop some appreciation of the results. Then the reader may decide whether to skip the detailed discussions in this chapter. 7.1 CONDITIONAL MUTUAL INDEPENDENCE In this section, we explore the effect of conditional mutual independence on the structure of the -Measure Û'Ü . We begin with a simple example. £¤¥§¦¨ª©S« c®¯ Let Then Ê , , and ÆQÇ×ËÊoÌêÒ ÆQÇØ×ËÊuÔ Ì á Ù Ù be mutually independent random variables. ÆQÇ×ËÊV× , we let Since Ç Ìé Ì+Ò Ù so that Æ<Ê × Ì(Òæß and Ìé ÆQÇØ× ÆQÇ×ËÊV× Ù Ì(Ò × Ìé ºÕ (7.4) Æ<Ê × ÔÇ Ù Ì(Ò ÆQÇØ× Ô ÊoÌêÒ Õ Ô ÊeÌ(Ò Ù ÌêÒ Ù Then from (7.4), we obtain ÔÇ Ù ÆQÇ× Ù (7.2) (7.3) Æ<ÊV× Ù ÆQÇØ×ËÊ Ù È á Ù Ì+Ò Õ g ÆQÇ×ËÊV× Similarly, Ì(Ò Ù ÆQÇØ×ËÊÉÔ ÆQÇØ×ËÊuÔ Ù tÕ (7.5) (7.6) (7.7) The relations (7.3), (7.4), and (7.7) are shown in the information diagram in Ç Ê Figure 7.2, which indicates that , , and Ù are pairwise independent. D R A F T September 13, 2001, 6:27pm D R A F T 127 Markov Structures Y Ûa Ûa -a Ûa ÜX , Figure 7.2. , and are pairwise independent. Ç We have proved in Theorem 2.39 that if and only if ÆQÇÉÈËÊ È ÌêÒ ÝZ ÆQÇ , Ê , and Ù are mutually independent Ìé Æ<ÊoÌé Æ ÌÖÕ Ù (7.8) Ù By counting atoms in the information diagram, we see that Ò ÆQÇ Ò Ò Thus Ì'é Æ<ÊoÌé ÆQÇØ×ËÊuÔ Ìé Ù tÕ Æ Æ<Ê Ì¸ß Ù × ÔÇ ÆQÇÉÈËÊ Ì'é È Ù ÆQÇ× Ì Ù Ôe Ê Ìéì Ù ÆQÇØ×ËÊ × Ì Ù , which implies (7.9) (7.10) (7.11) eÒ ÆQÇ×ËÊÉÔ ÌÖÈ Ù Æ<Ê × are all equal to 0. Equivalently, ÇqÞ Ý Ê ß Ý Ù Ý È ÊíÞ Ý ÔÇ Ù ÆQÇ× Û'Ü ß Ù Ý ÌÖÈ Ô Ê@ÌÖÈ Ù ÆQÇØ×ËÊ × Ì (7.12) Ù vanishes on Ç,È q Ç Þ Ý Ý ß Ù Ý Ê@È q Ç Þ Ý Ý ÊíÞ Ý È Ù Ý (7.13) which are precisely the atoms in the intersection of any two of the set variables Ç Ê Ý , Ý , and Ù Ý . Conversely, ifÇ Û Ü Ê vanishes on the sets in (7.13), then we see from (7.10) that Ç Ê (7.8) holds, i.e., , , and Ù are mutually independent. Therefore, , , and are mutually independent if and only if Û'Ü vanishes on the sets in (7.13). Ù This is shown in the information diagram in Figure 7.3. The theme of this example will be extended to conditional mutual independence among collections of random variables in Theorem 7.9, which is the main result in this section. In the rest of the section, we will develop the necessary tools for proving this theorem. At first reading, the reader should try to understand the results by studying the examples without getting into the details of the proofs. D R A F T September 13, 2001, 6:27pm D R A F T 128 A FIRST COURSE IN INFORMATION THEORY Y Þ0 Þ Þ0 0 Þ0 X Z Figure 7.3. , , and are mutually independent. In Theorem 2.39, we have proved that pendent if and only if RRR Y ÆQÇ òYÈÑÇeôïÈ YÈÑÇ By conditioning on a random variable 7 á .â ù«ºµ¸Ã«ª¦ Ç ÄÌ+Ò Ê RRR Y òYÈÑÇeô±È Kß.à YÈÑÇ ÆQÇ ò ß are mutually inde- ÌÖÕ (7.14) , one can readily prove the following. RRR Y are mutually independent conditioning on ß RRR Kß.à (7.15) òYÈÑÇeô±È c® Ç if and only if ÆQÇ ò ÈÑÇ èÈÑÇ ô È Ê èÈÑÇ Ô ÊÌ+Ò ÆQÇ Ô ÊeÌÖÕ ò We now prove two alternative characterizations of conditional mutual independence. RRR Y are mutually independent conditioning on ä´åV´æ , if and only if for all ß èç é ¯ å (7.16) ß ç é ¯ å are independent conditioning on . and i.e., 7 á .ã ù«ºµ¸Ã«ª¦ Ç c® òYÈÑÇeô±È èÈÑÇ Ê å ÆQÇ ×ÑÇ 1È Ò 7Ô ÊeÌ(Ò È Ç ÆQÇ È Ò Ì Ê Remark A conditional independency is a special case of a conditional mutual independency. However, this theorem says that a conditional mutual independency is equivalent to a set of conditional independencies. RRR Y å ß Proof of Theorem 7.3 It suffices to proveYthat (7.15) and (7.16) are equivalent. Ç òYÈÑÇeô±È ÈÑÇ Assume (7.15) is true, so that are mutually independent conÊ Ç ÆQÇ È Ò Ì ditioning on . Then for all , is independent of conditioning Ê on . This proves (7.16). D R A F T ç é ¯ å September 13, 2001, 6:27pm D R A F T 129 Markov Structures ä¢åI´æ . Consider Now assume that (7.16) is true for all ß ç é ¯ å ß RRR ß{ê ß ß:ë RRR Y RRR ß{ê å Ò Ò ÆQÇ ×ÑÇ ÆQÇ ×ÑÇ é È ÆQÇ Ò 7Ô ÊoÌ òYÈÑÇeô±È ÈÑÇ òYÈ ÈÑÇ ×ÑÇ (7.17) òwÔ ÊÌ cÔ Ê ÈÑÇ òYÈÑÇeô1È YÈÑÇ ò7ÌÖÕ (7.18) ß RRR ß{ê Since mutual information is always nonnegative, this implies ÆQÇ ×ÑÇ òYÈ ÈÑÇ òwÔ ÊÌ(Ò È (7.19) ß ì ß ê R R R and are independent conditioning on . Therefore, or RRR Y are mutually independent conditioning on (see the proof of Theorem 2.39), proving (7.15). Hence, the theorem is proved. RRR Y are mutually independent conditioning on 7 á if and only if ß èç é ¯ å RRR Y Kß.à (7.20) Ç Ç ÆQÇ òYÈÑÇeô±È òYÈÑÇeô1È èÈÑÇ ò7Ì Ê YÈÑÇ ù«ºµ¸Ã«ª¦ Ê Ç c®° ÆQÇ òYÈÑÇeô±È èÈÑÇ òYÈÑÇeô±È YÈÑÇ Ê Ô ÊoÌ(Ò ÆQÇ Ô ÊêÈÑÇ ïÈ Ò ÌÖÕ ò RRR Y ß Proof It suffices to prove èthat (7.15) and (7.20) are equivalent. Assume (7.15) Ç òYÈÑÇeô±È ÈÑÇ Ê are mutually independent conditioning on . is true, so that Ç Ç È Ò Ê Since for all , is independent of conditioning on , å ß ÆQÇ ç é ¯ å ß èç é ¯ å Ô ÊoÌ(Ò ÆQÇ Ô ÊêÈÑÇ ïÈ Ò Ì (7.21) Therefore, (7.15) implies (7.20). Now assume that (7.20) is true. Consider ÆQÇ Ò RRR ß RRR ìß ê Kß.à (7.22) ß ß.ë RRR Y RRR ß{ê Kß.à ß èç é ¯ å (7.23) ß ß . ß ë èç é ¯ å Kß.à RRR Y RRR ß{ê Kß.à ò ÈÑÇ ô È èÈÑÇ Ô ÊÌ ÆQÇ Ô ÊêÈÑÇ òYÈ ÆQÇ Ô ÊêÈÑÇ ïÈ èÈÑÇ òëÌ ò Ò Ò Ìé Ò ÆQÇ ×ÑÇ òYÈ ÈÑÇ cÔ Ê ÈÑÇ òèÈ cÔ Ê ÈÑÇ YÈÑÇ ò7Ì ò ÆQÇ Ô ÊêÈÑÇ ïÈ Ò Ìé ÆQÇ ×ÑÇ òYÈ ÈÑÇ òèÈ YÈÑÇ ò7ÌÖÕ ò ò (7.24) Then (7.20) implies Kß.à D R A F T RRR {ß ê (7.25) September 13, 2001, 6:27pm D R A F T ÆQÇ ò ß :ß ë RRR Y ×ÑÇ òYÈ ÈÑÇ Ô Ê ÈÑÇ òYÈ YÈÑÇ ò7Ì(Ò Õ 130 A FIRST COURSE IN INFORMATION THEORY å Since all the terms in the above summation are nonnegative, they must all be §Òæå equal to 0. In particular, for , we have ÆQÇ RRR Y ß èç é ¯ å òY×ÑÇeô±È èÈÑÇ Ô ÊÌ(Ò By symmetry, it can be shown that ÆQÇ ×ÑÇ ïÈ Ò Õ (7.26) 7Ô ÊeÌ+Ò ä´åI¢æ . Then this implies (7.15) by the last theorem, completing the for all proof. 7 á Let · and í ß be disjoint index sets and î ß be a subset of í ß ï#å6¹ð , where ð . Assume å such that forß Áñ that there exist atº least two ß ò £ ó S õ ô « ö £ å ° ð £ ó S ôõ«#· be î ¯ . Let N í ò *åkand ð N ß Áñ are mutually independent collections of random variables. If ÷ º N conditioning on º , then ò N ê ÷ N , 6such ´åøthat ð î. ¯ are mutually independent conditioning on å ù«ºµ¸Ã«ª¦ (7.27) c®õ å ì á Ò Ç Ò\ÆQÇ È ÌÖÈYå Ç Ç Ç Ç ÆQÇ È Ì Ò ÈÑÇ å Ò]ÆQÇ ÈYå ªÌ £Ñ YÒ £ù £ú are mutually inde£ù £ú are , , and £Ñ YÒ û . ò N "öåïµð are mutually independent We first give an example before we prove the theorem. £¤¥§¦¨ª©S« Ç c®¬ û òYÈèÆQÇeô±ÈÑÇ ±ÈÑÇ ±ÌÖÈ Suppose Ç and pendent conditioning on . By Theorem 7.5, ÆQÇ ±ÈÑÇ 1ÈÑÇ mutually independent conditioning on º Proof of Theorem 7.5 Assume Ç conditioning on , i.e., Ç ÈYå ±ÈÑÇ Ç ò Ì Çeô ÆQÇ ±ÈÑÇ èÌ YÌ ÈYå ò N ©´åø¢ð º ÆQÇ ÆQÇ Ô Ç Ì(Ò ß.K à ü ò N º ÆQÇ ò ÔÇ ÌÖÕ (7.28) ÷ N ä¢åøð º ò N ê ÷ N ä´åVð ò N ©´åIð º ò N ê ÷ N ä´åVð º Kß.à ü ò N º ò N ê ÷ N º ò Q ê ÷ Q ©é£¢å Kß.à ü Kß.à ü ò N º ò Q ê ÷ Q äé ´å Consider ÆQÇ Ò ÈYå Ô Ç ÆQÇ Ò ÈÑÇ ÈYå ÈYå Ô Ç ÆQÇ ÔÇ Ì§ß ªÌ ÆQÇ ÈYå Ô Ç Ì (7.29) Ì ò ß ÆQÇ ÔÇ ÈÑÇ ÈYå ß åèÌ ß åèÌ ò á ÆQÇ ÔÇ ò ýKß.à ü ß D R A F T ÈYå 'ß åèÌ ò N ê ÷ N º ò Q ê ÷ Q ©é£¢å ÆQÇ ò ÈÑÇ ÔÇ ÈÑÇ ÈYå (7.30) September 13, 2001, 6:27pm (7.31) D R A F T 131 Markov Structures Kß.à ü Ò Kß.à ü á ÷ N º ò Q ê ÷ ÷ N º ò Q ê ÷ ÆQÇ ÔÇ êÈÑÇ ÆQÇ ÔÇ ÈÑÇ ò ò Q äé ´å Q äé ð ÈYå Ì ÈYå ªÌÖÕ (7.32) (7.33) In the second step we have used (7.28), and the two inequalities follow because conditioning does not increase entropy. On the other hand, by the chain rule for entropy, we have ÆQÇ Ò ÷ N ©´åIð º ò N ê ÷ Kß.à ü ÷ N º ò Q ê ÷ ÈYå Ô Ç ÈÑÇ ÔÇ ÈèÆQÇ ÆQÇ ò N ä¢åIð Q ©þéYð ÷gÿ ä¢ô0´å ÈYå ªÌ ÈYå ªÌÖÈèÆQÇ OÈYå 'ßåèÌÑÌÖÕ (7.34) Therefore, it follows from (7.33) that Kß.à ü ÷ N ÷ Kß.à ü ÆQÇ ò º ò Q ê ÷ Q äé ð N ©åø ð º ò N ê ÷ N è´åIð ÷ N º ò Q ê ÷ Q ©þéYð ÷gÿ ä¢ô0´å ÔÇ ÈÑÇ ÆQÇ Ò ÈYå ÈYå Ô Ç ÆQÇ ò ªÌ ÔÇ ÈÑÇ ÈYå ÈèÆQÇ ªÌ ÈYå ªÌÖÈèÆQÇ OÈYå (7.35) (7.36) 'ßåèÌÑÌÖÕ (7.37) å However, since conditioning does not increase entropy, the th term in the sumå mation in (7.35) is lower bounded by the th term in the summation in (7.37). Thus we conclude that the inequality in (7.36) is an equality. Hence, the conditional entropy in (7.36) is equal to the summation in (7.35), i.e., ÷ N ä´åVð º ò N ê ÷ N ä´åVð Kß.à ü ÷ N º ò Q ê ÷ Q äé ð ÆQÇ Ò ÈYå Ô Ç ÆQÇ ÔÇ ÈÑÇ ÈYå ÈÑÇ ÈYå ªÌ ªÌÖÕ ò (7.38) (7.39) The theorem is proved. Theorem 7.5 specifies a set of conditional mutual independencies (CMI’s) which is implied by a CMI. This theorem is crucial for understanding the effect of a CMI on the structure of the -Measure Û'Ü , which we discuss next. «ª¦¦!¥ ¤ c® ables, where Let ì D R A F T á ß RRR ß N ïå ò È Æ Ù È , and let Ê Ù ÌÖÈYå ß RRR ß N be collections ofÆ random vari Ì òYÈ èÈ be a random variable, such that Ù Ù , September 13, 2001, 6:27pm D R A F T 132 A FIRST COURSE IN INFORMATION THEORY ä´åI å Ê are mutually independent conditioning on ~ ß.@ à Ü Û @à N ßç ç %ß Ê Ò ò Ù Ý ò . Then Ý Õ (7.40) We first prove the following set identity which will be used in proving this lemma. .; «ª¦¦!¥ § c® ¬ ß and ® Let and be disjoint index sets, and be a set-additive function. Then be sets. Let Û ~ = ß @ ¬ ß H ~ ç @ ¬ ç ® ë K K ¬ ® ¬ ® ¬ ß ¬ ß. denotes ¶ where ¬ Proof The right hand ë side of (7.41) is equal to ë K K ¬ ® K K ¬ ë K K ¬ ® Now ë K K ¬ ® K ¬ ® K Þ Û Ò Æ ß åèÌ ëÆ ß Ìé åèÌ ß Ìé Ì§ß Æ ß Æ ß Æ ß ß Æ Û Æ Û ß Æ Û ® ß åèÌ Æ Û ß ß Ì(Ò Æ ß åèÌ Æ Û Æ Û ß ÌÖÕ Æ Û åèÌ åèÌ ß Ì ÌÑÌGÈ (7.41) Æ Û Æ ß ß ® Ì (7.42) Æ ß åèÌ Õ (7.43) Since K Æ ß åèÌ ïÒ ü Kà Ô Ô by the binomial formula1 , we conclude that K 1 This K Æ ß åèÌ à " ë ò and ! !$#&% '(% à àïê Æ Û ¬ ß åèÌ ü Ò ® Ì+Ò Õ (7.44) (7.45) ò in the binomial formula KACB % '(% *),+ D R A F T can be obtained by letting Æ ß ë ð ü A !/% '0% 1 A 32 .- September 13, 2001, 6:27pm D R A F T Markov Structures K Similarly, K Æ ß åèÌ 133 ë ¬ Æ Û ® ß Ì+Ò Õ Therefore, (7.41) is equivalent to ~ = ß @ ¬ ß H Û ~ ç @ ¬ ç ® Þ ß K Ò K ë ë Æ ß åèÌ (7.46) ò ¬ Æ Û ß ® Ì (7.47) which can readily be obtained from Theorem 6.A.1. Hence, the lemma is proved. Proof of Lemma 7.7 We first prove the lemma for ~ ß.@ à Ü Û K D 54 ò76989898 6 K Æ ß _ ß åèÌ Ê ò Ù Ý ò <54 ò76989898 6 ;: @à N ßç ç ô Ý ë ,=> . By Lemma 7.8, Ò ~ ç ¶Òíì Ü Û : ìç ò ß Ù Ý Ê Ý ìç ¶ ~ (7.48) ü ü ü The expression in the square bracket is equal to ìç é « § Zðo« ìç éï« § Zð$ü « (7.49) ü ì ç é « Z ? ð « which is equal to 0 because § and ü are independent conditioning on . Therefore the lemma is proved for . For , we write ê N N ß ~ ß.@ à ç @ à ç ~ ~ ß.@ à ç @ à ß ç ~ ç @ à ç (7.50) ß RRR ß N ÔáåY RRR Since and are independent conditioning on , upon applying the lemma for , we see that ~ ß.@ à ç @ à N ß ç (7.51) é Û ~ ü Ü ß ô Ê Ù Ý Ý ß ò Æ ~ ~ ç Ü Û È ò ïÈ ÌÖÈèÆ Æ ÌëÔ ÊoÌÖÈ DB Ì Æ ô È Ì Ù B ò ß Ê ò Ù Ý ò ÆÑÆ Õ Òíì B ì FE Ü @?A Ý Ô ÊÌ È ô ò 1È Ù Ê DB Ù CB Ê È ô Ù Ù Û Æ CB ÆÑÆ ß ô Ù Ý Ô ÊÌ'é Ù ß ò Ù Ý òYÈ Ù Ò Û Ý YÈ Ê ÌÖÈYå G Þ Ü ò HßåèÌ Æ Ù Ù ß ò Ù Ý ò Ù Ý òYÈ YÈ Òíì Ù Ê Õ Ý G Ì Û Ü ß ò ò Ù Ý Ê Ò Õ Ý The lemma is proved. D R A F T September 13, 2001, 6:27pm D R A F T 7 á Let and ß í ß è¢å ð be disjoint index sets, where ð , ò N £ó Sô©« í Ô³*å ð and £ó Sôè« be collecand let ò á £ å ö ð N independent tions of random variables. Then ß ² í ß, Rß RareR mutually conditioning on if and only if for any , where î î î î ä´åIð , if there exist at least two å such that î !¯ ñ , thenü ~ ~ ß:@ à ü ç @ ÷ èç A BED ò N ê ÷ N (7.52) N N 134 A FIRST COURSE IN INFORMATION THEORY ù«ºµ¸Ã«ª¦ ÈYå c®I Ç Ò ÆQÇ È á ÌÖÈYå Ç Ç Ò ÆQÇ È ì Ì ÈYå Ç òYÈ ôwÈ èÈ Ò å Ç Ü Û ß ò 5 Ç Ý " Ý Ò " Õ ## We first give an example before proving this fundamental result. The reader should compare this example with Example 7.6. £¤¥§¦¨ª©S« c®¯J Suppose Ç dependent conditioning on ò Ü Æ Ç Û Þ ô Ç Ý û Ç Þ £Ñ YÒ òYÈèÆQÇeô1ÈÑÇ ÌÖÈ and . By Theorem 7.9, ù Ç Ý ±ÈÑÇ Þ Ý ú Ç Ý ß £ù £ú ÆQÇ ïÈÑÇ èÌ are mutually in- Ñ ¶ Ò ¶ û Æ Ç Ç ZÝ Ý Ç "Ý ÌÑÌ(Ò Õ (7.53) However, the theorem does not say, for instance, that Ü Æ Çeô Ý Û YÒ Þ Ç ß Æ Ç ò Ý Ý ¶ £Ñ ¶ £ù ¶ £ú ¶ û Ç "Ý Ç ZÝ Ç "Ý Ç ëÌÑÌ "Ý (7.54) is equal to 0. å î î RRR ß î ü ñ î ß ²Áí ß "°å µð î ¯ ò N ä´åVð (7.55) NsA BjD ò N (7.56) ±K ® where § consists of sets of the form ~ ß.@ à ü ç @ ÷ ç A BjD ò N ê ÷ N (7.57) s N N ß ²í ß for *å&!ð and there exists at least one å such that î ß ¯ ñ . with î ® « § is such that there exist at least two å for which Byß our ñ assumption, if ö î ¯ , then ® . Therefore,å if ® is ß possibly ¯ ñ . Nownonzero, 6¢then ô0ß®ð , must be such tható there exists a unique for which î forß define the set § consisting of sets of the form in (7.57) with î ²¾í for Proof of Theorem 7.9 We first prove the ‘if’ part. Assume that for any òëÈ ô±È èÈ å , where , , if there exist at least two Ò such that , then (7.52) holds. Then ÆQÇ ÈYå Ô Ç Ì Ò Ç ß Ý Ç Ý " Ç Ý Ü Æ Û ß Ý + Ò ò Ç Ü Û - Ì " ## å Ò Û'Ü Æ Ò ÌzÒ Û'Ü Æ Ì Ò D R A F T September 13, 2001, 6:27pm å D R A F T 6´åøð , î ó ¯ ñ , and î ß ñ of the form ~ ç @ ÷gÿ èç 135 Markov Structures å Ò Ò Ç Then ±K ò N Ç ß Ý Ò ß 5 Ç ® Æ Ý Õ " # Ü Æ Û ò ÌÖÕ (7.59) Ì K Ç MN L Ì <K Kó à ü ± K ÿ ® Ì(Ò Q BN ò Q ÿ <ÿ ~ ç @ ÷kÿ èç ÿB ± ÿ® 5 Ç Æ Ý Ü Æ Û => ß 5 Ç Ý N B ÿ:ò N ò ÿ ê ÷kÿ Æ Ý Ì <K " ?A # (7.60) 0 K O L Ò Ò Ý Now å ¯ ô . In other words, § ó consists of atoms (7.58) NB ÿò N ò ÿê ÷ ÿ for Õ (7.61) Since Û'Ü is set-additive, we have Û ò N Ç Ü + ß 5 Ç Ý Ý Æ Q BN ò Q K ò N ä¢åIð Kß.à ü ± K N ® Kß.à ü ò N òN Kß.à ü ±K ÿ ® Ò Ì Û - Ü Æ ÌÖÕ (7.62) Hence, from (7.56) and (7.59), we have ÆQÇ Ò ÈYå Û ò Ò Ü Æ Û ò Ò Ô Ç Ç Ü + Ì ß ò ÔÇ (7.63) 5 Ç Ý ÆQÇ Ì Ý ÈÑÇ Æ Q BN ò Q K ò Q é ¯ å È Ò Ì (7.64) - ÌÖÈ (7.65) ò N Lå?¿ð ò N ?µåïÁð RRR å î î î ß ¯ î ñü Ç ÈYå where (7.64) follows from (7.62). ByÇ Theorem 7.4, are mutually independent conditioning on . Ç ÈYå We now prove the ‘only if’ part. Assume areô±È mutually Ç òëÈ YÈ . For any collection of sets , independent conditioning on å Ò where ,Ç , if there exist at least two such that , ÈYå by Theorem 7.5, are mutually independent conditioning on ÆQÇ ÈÑÇ ÈYå ªÌ . By Lemma 7.7, we obtain (7.52). The theorem is proved. î ßö ² í ß ÷ ³å*oá ð åe³ð ò N ê ÷ N *åkN ð D R A F T September 13, 2001, 6:27pm D R A F T 136 A FIRST COURSE IN INFORMATION THEORY RRR Y is A conditional mutual independency on R R R full if all are involved. Such a conditional mutual independency is called a full conditional mutual independency (FCMI). <â For æ , YÒ , and £ù are mutually independent conditioning on £Ñ , 7.2 ¢ FULL CONDITIONAL MUTUAL INDEPENDENCE e«ª¿a¼Â·¼Ó½¼Qµ¸· £¤¥§¦¨ª©S« Ç Ç c®¯ª¯ ò ÈÑÇ Ç ô È Çeô±ÈÑÇ Ç is an FCMI. However, Ç ò , ô Ç èÈÑÇ ÀÒ c®¯ ò òYÈÑÇeô±È ÈÑÇ , and Ç Ç ù YÒ are mutually independent conditioning on Ç is not an FCMI because Ç Ñ is not involved. RRR æ As in the previous chapters, we let ÒKäTå±È´ìaÈ P wÈ *çïÕ (7.66) = ü ß (7.67) ¶ ß. à í H ß © å ð defines the following FCMI on RRR then the tuple í Y : ª ò D ò _ RRR ò A are mutually independent conditioning on . ß ©åVð . We will denote by í RRR ¢ <ã Let í ß è¢åV ð be an FCMI on . The image of , denoted by of Õ which ß , ²is the ß , setä¢ofåIallð atoms î í has the form of the set in (7.57), where , and there exist at ß !¯ ñ . å least two such that î ´ Let í í be an FCI (full conditional indeRRR Y . Then pendency) on ¬ « ¦ ª ¬² ò D ò _ (7.68) RRR ´ Let í ß ä¢åIð be an FCMI on Y . Then ÂÄ Ë ÊÌ « ª ¬ ¦ ¬² ß ç ò N ò Q (7.69) Å ü In Theorem 7.9, if Ò È P ò Æ È ÈYå ªÌ Ç òëÈÑÇeô±È èÈ Ç Q ïÇ ÈÑÇ È èÈÑÇ Æ Q e«ª¿a¼Â·¼Ó½¼Qµ¸· Ç c®¯ È ÈYå ªÌ ÒDÆ Q Ç È Q ÈYå RTS Ç ªÌ Æ Q òëÈÑÇeô±È èÈ Ì å Ò Ãcµ'¨tµ¸÷è¼½¼<µ'· U c®¯±° Ç RTS U Ãcµ'¨tµ¸÷è¼½¼<µ'· Æ Q c®¯Tõ ÒqÆ Q òYÈÑÇeô±È È ò È ô Ì YÈÇ Ì(ÒKä Æ Ç Þ Ç ß Q Ý Ò Æ È Ç Ý ÈYå Ì´çïÕ Ý ªÌ Ç òYÈÑÇeô±È èÈ Ç RTS Æ Q Ì(Ò D R A F T Æ Ç ò7V W 3V Þ Ý Ç ß Ý Ç Ì Õ Ý September 13, 2001, 6:27pm D R A F T 137 Markov Structures Æ Q Ì . Their These two propositions greatly simplify the description of RTS proofs are elementary and they are left as exercises. We first illustrate these two propositions in the following example. 4ñ £¤¥§¦¨ª©S« Q ô!Ò æ Z c®¯T¬ Æ aÈ5äTåwçïÈ5äìaÈ ¬ « ¦ ª ¬² Ñ ¶ ¦ ª ¬² « on . Let forbeallan¬ FCMI Æ Q RTS ¬ « 7 á and only if ¬ and RTS Æ Q ôèÌ+Ò ä ù«ºµ¸Ã«ª¦ Æ ò7Ì(ÒKä Æ Ç Æ Ç ò5Þ 4 ô6 Ç Ý c®¯ Û'Ü ÒHX Ý ò§Þ Ý Ç Q Æ Q ÌêÒ RTS Proof First, (7.67) is true if written as Q Ý Æ Ç Ò £Ñ 4 ô6 Ç Ý Ì : òzÒZÆ ä SçïÈ5äTåwçïÈ5äìaÈ7XçwÌ Q Consider and FCMI’s SçïÈ5äYXç±Ì . Then ß : Ç YÌ´ç (7.70) Ý Ñ YÒ ¶ YÒ RRR Y . Then 4 ô6 Þ : Ç òYÈÑÇeô±È Ì èÌ Æ Ç ò5Þ Ý and Ç Ý èÌ´çïÕ (7.71) Ý èÈÑÇ Q holds if is an FCMI. Then the set in (7.57) can be ~ ç @ A ÷ èç > ê NA BED ÷ N (7.72) N BED N which is seen to be an atom of Õ . The theorem can then be proved by a direct application of Theorem 7.9 to the FCMI . ß.à ß Let ¬ be a nonempty atom of Õ . Define the set nåø« ª ß ß Ó (7.73) Note that ¬ is uniquely specified by because ¬ ~ ß @ > ê ß ~ ß @ ß Ó ~ ß @ > ê ß (7.74) !æ as the weight of the atom ¬ , the number of ßß in ¬ Define ¬ äí åø which not complemented. We now show that an FCMIß ß åø ð are is uniquely specified by . First, by letting î í for ð in Definition 7.13, we see that the atom ~ ç @ A ò èç Z (7.75) NsBjD N is in , and it is the unique atom in with From ß the°largest å ð weight. this atom, can be determined. To determine , we define í Ó as follows. For ô Sô « Ó , ô Sô is in if and only ifa relation on Ç ß Ç[Z Ý Ý È Q ÒîÞ ò Ê Ý ÒKä \ Ê P Ò Ç Ý çïÕ Ý \ Ò Ç Z Æ Ì 5]*^ Ò _ Þ Ç Ò ]*^ Ý ußíÔ \ Ç Z Ý ß ] ^ Ç Ý Ç ÒøÆ Q RTS Æ Q Ì Ò Ç Æ Q ß ` D R A F T ba ÈYå å Ý Æ Q RTS P Ý È Ç Ý Ì Ò Õ Ý Ô ªÌ RTS 5]^ Ì ÈYå È Æ B È Ì B September 13, 2001, 6:27pm ` D R A F T 138 A FIRST COURSE IN INFORMATION THEORY ô ô ; or 'Ò i) B @ ç D4Q MnB Qÿ M ÿ > ii) there exists an atom of the form £ó £ó Ç SÞ Ç Ý Þ Ê Ý TK (7.76) Ý c Eç èç or ç Ó . ô Sô Õ .Ôå The Recall that ¦ is the setô Sofô nonempty atoms of idea of ii) is that ß « ¢ w ð is in if and only if . Then is reflexive í for some and symmetric by construction, and is transitive by virtue of the structure of Ó . In other words, is an equivalence relation which partitions into uniquely specify each other. í ß ©´åI ð . Therefore, and completely characterizes effect of on the The image of an FCMI RRR Y . The joint effect of morethethan -Measure for one FCMI can ¦ in ß Æ Q RTS Ì , where Ê !Ò Ý Ç Ç Ý Ý Æ È ` Æ Q RTS ä È Ì B å ` B Ì ÈYå ` tç Q Ç òYÈÑÇeô±È Æ Q RTS Q Ì Q YÈÑÇ ó ©´ô` (7.77) ó holds if and only if vanishes on be a set of FCMI’s.ó By Theorem 7.9, ó Y ! ô the atoms in Im . Then à hold simultaneously if and only if ó . This ó vanishes on the atoms in ¶ ü is summarized as follows. ó ô is ¢ <; The image of a set of FCMI’s easily be described in terms of the images of the individual FCMI’s. Let ÒKä Q d ÈYå ç fe Q Æ Q QÌ ÈYå Q ò Û'Ü e«ª¿a¼Â·¼Ó½¼Qµ¸· Æ Q RTS á c®¯*I if and only if Û'Ü Æ Òøä Q d Æ d RTS ù«ºµ¸Ã«ª¦ QÌ c®¯ defined as 7 Û'Ü ge ¬ óà ü Ò QÌÖÕ be a set of FCMI’s for Æ d Ì for all . RTS Ç ç he ó Æ Q ò RTS ¬ « d Let Ì(Ò Ì ÈYå (7.78) RRR Y . Then òYÈÑÇeô±È ÈÑÇ d holds In probability problems, we are often given a set of conditional independencies and we need to see whether another given conditional independency is logically implied. This is called the implication problem which will be discussed in detail in Section 12.5. The next theorem gives a solution to this problem if only FCMI’s are involved. 7 á .â ù«ºµ¸Ã«ª¦ c® J if and only if RTS ² Æ d Let ôèÌ ² d ò iRTS and Æ ô d d ò7Ì Æ d be two sets of FCMI’s. Then . ôèÌ ² Æ d ò7Ì d ò d ò implies d ô ¬ Ç ¬ d ô Proof We first Æ prove that ifò RTS iRTS , then implies . Assume Æ d ô Ì Æ Ì!Ò d ò Ì d Û'Ü and holds. Then by Theorem 7.19, for all RTS jRTS Æ d ò7Ì Æ d ôèÌ Æ d ò7Ì Æ Ì Ò RTS . Since RTS kRTS , this implies that Û'Ü for all ¬ « D R A F T ² September 13, 2001, 6:27pm D R A F T 139 Markov Structures ¬ « Æ d ² ôèÌ ô d . Again by Theorem 7.19, this implies also holds. Therefore, Æ d ò7Ì d ò d ô if RTS , then implies . iRTS Æ d ôèÌ Æ d ò7Ì d ò d ô RTS iRTS We now proved that if implies ,Æ then . To prove this, ò Æ d ò7Ì d ô d ôèÌ implies but RTS , and we will showÆ d that we assume that gRTS Æ d ôèÌ ß ò7Ì RTS RTS this leads to a contradiction. Fix a nonempty atom . Ç òYÈÑÇeô±È YÈÑÇ such By Theorem 6.11, we can construct random variables Û'Ü that Û'Ü vanishes Æ on all the atoms of except for . Then vanishes on all Æ d ôèÌ d ò7Ì RTS but not on all the atoms in . By Theorem 7.19, the atoms in RTS Ç òëÈÑÇeô1È YÈÑÇ d ò d ô this implies that for so constructed, holds but does not d ò d ô hold. Therefore, does not imply , which is a contradiction. The theorem is proved. RTS Æ d ôèÌ ² ²¯ RRR Y Õ ¬ « ¬ RRR Y Remark In the course of proving this theorem and all its preliminaries, we have used nothing more than the basic inequalities. Therefore, we have shown that the basic inequalities are a sufficient set of tools to solve the implication problem if only FCMI’s are involved. .â ³¶µ¸ÃGµ¸©T©ï¥¸ÃT¾ c® ¯ Two sets of FCMI’s are equivalent if and only if their im- ages are identical. Proof Two set of FCMI’s d ò d ò d ô d and ô ô d and Æ ² are equivalent if and only if ò Õ d Æ Æ d ôèÌ ² (7.79) Æ d Then by the last theorem, this isÆ d equivalent to RTS iRTS Æ d ôèÌ+Ò ò7Ì Æ d ôèÌ RTS iRTS , i.e., RTS . The corollary is proved. ò7Ì and RTS Æ d ò7Ì Thus a set of FCMI’s is completely characterized by its image. A set of FCMI’s is a set of probabilistic constraints, but the characterization by its image is purely set-theoretic! This characterization offers an intuitive settheoretic interpretation of the joint Æ effect of FCMI’s on the -Measure for Ç òYÈÑÇeô±È YÈÑÇ ò7Ì%Þ Æ Q ôèÌ Q . For example, is interpreted as the efRTS RTS ò ô Æ Q ò7Ì%ß Æ Q ôèÌ Q Q RTS RTS fect commonly due to and , is interpreted as the ò ô Q Q effect due to but not , etc. We end this section with an example. RRR Y .â â £¤¥§¦¨ª©S« c® Q Q and let d òÒKä Q Consider 4ñ Ñ 4ñ ÀÒiX Z . Let òÒ Æ aÈ5äTå±È´ìaÈ SçïÈ5äYXçwÌÖÈ !Ò Æ aÈ5äTå±È´ìSçïÈ5ä aÈ7XçwÌÖÈ òYÈ Q ôwç and RTS D R A F T æ Æ d d ô!ÒKä Q ò7Ì+Ò RTS 4ñ Ò 4ñ Z Ñ Ò . Then ¶ ô!Ò Æ aÈ5äTå±È´ìaÈ7XçïÈ5ä SçwÌ Q Ò Æ aÈ5äTå±È SçïÈ5äìaÈ7XçwÌ ±È Q Æ Q Q (7.80) (7.81) ±ç ò7Ì lRTS Æ Q ôèÌ September 13, 2001, 6:27pm (7.82) D R A F T 140 A FIRST COURSE IN INFORMATION THEORY and Æ d RTS where RTS RTS RTS Æ Q ò7Ì Æ Q Ò ä Æ Q Ñ ô Ì èÌ Ò ä Ì Ò ä Ò RTS Æ Q Ò ä ô ÌêÒ RTS Æ Q ¬ «« ¦ ªª ¬#² ¬ « ¦ ª# ¬ ² ¬ « ¦ ª ¬#² ¬ ¦ ¬#² Ñ ¶ Ì 4 ò76 ô6 Æ Ç Ý 4 ò76 ô6 Æ Ç Ý 4 ò76 ô Æ Ç Ý Æ Ç Ý 4 ò76 Ñ Ò Æ Q mRTS Ò ÌÖÈ Ñ YÒ Þ Ç : Ì´ç Ý Þ : Ç Ý Þ Ç Þ Ç : Ñ Ý 4 ô6 (7.84) Ì´ç ÑÒ 4 6 Ý : (7.83) Ò (7.85) : : Ì´ç (7.86) Ì´çïÕ Æ d ò7Ì ² (7.87) Æ d ôèÌ nRTS . It can readilyô be seen by using an information diagram that RTS d d ò implies . Note that no probabilistic argument is involved in Therefore, this proof. 7.3 MARKOV RANDOM FIELD A Markov random field is a generalization of a discrete time Markov chain in the sense that the time index for the latter, regarded as a chain, is replaced by a general graph for the former. Historically, the study of Markov random field stems from statistical physics. The classical Ising model, which is defined on a rectangular lattice, was used to explain certain empirically observed facts about ferromagnetic materials. In this section, we explore the structure of the -Measure for a Markov random field. p ÒpÆ$p(È zÌ Let o be an undirected graph, where is the set of vertices and is the set of edges. We assume that there is no loop in o , i.e., there is no \ edge in o whicha joins a vertex to itself. For any (possibly empty) subset of p \ , \ denote by o the graph obtained from \ o by eliminating all the vertices Æ \ Ì q in and all the edges joining a vertex in . Let be the number of a \ distinct components in . Denote the sets of vertices of these components o pòèÆ \ ÌÖÈTpºôTÆ \ ÌÖÈ ÈTpr ] Æ \ Ì Æ \ ÌsEå \ " q by . If , we say that is a cutset in o . # S RRR ¢ .â ã e«ª¿a¼Â·¼Ó½¼Qµ¸· c® .²ut¥¸Ãwv'µG» RRR æ ÃG¥¸·yx§µ'¦m¿a¼Â«º©*x§Å ß S ÒDÆ$p+È Ì Let o Ç be an undiP p]Ò Ò äTå±È´ìaÈ wÈ *ç rected graph with , and let be a random variÇ òYÈÑÇeô±È èÈÑÇ able corresponding to vertex . Then form a Markov random \ field byèÈÑo Ç{if for all cutsets in o , the sets of random variables ] ] represented ] Ç{z ÈÑÇ{z È z3|~} ] Ç " " " . # # # are mutually independent conditioning on D _ RRR å RRR Y RRR Y RRR RRR Y This definition of a Markov random field is referred to as the global Markov Ç òYÈÑÇeô±È YÈÑÇ property in the literature. If form aÈÑMarkov random field repò ÈÑÇ ô È Ç Ç resented by a graph o , we also say Ç that form a Markov graph òYÈÑÇeô±È YÈÑÇ o . When o is a chain, we say that form a Markov chain. D R A F T September 13, 2001, 6:27pm D R A F T 141 Markov Structures in specifies an RRR ª D RRR are mutually independent conditioning on . R R R For a collection of cutsets the notation ü in ^, weÔRintroduce R R R R ü (7.88) ü RRR Y \ In the definition ofÈÑaÇ Markov random field, each cutset ò ÈÑÇ ô È Ç \ , denoted by . Formally, FCMI on TÇ{z \ ] " È ÈÑÇ{z3|~} ] " # òYÈ \ ô±È Ç òYÈ \ ô±È È \ èÈ \ GÒ o ò ( \ \ ô ( where ‘ ’ denotes ‘logical AND.’ Using this notation, Markov graph o if and only if ² ª ¯ p \ ] # \ \ o \ Ògp and q Æ \ ^ ÌEKå \ Ç ò5ÈÑÇeô±È ÈÑÇ form a Õ (7.89) Therefore, a Markov random field is simply a collection of FCMI’s induced by a graph. with respect to a graph We now define two types of nonempty atoms of \ o . Recall the definition of the set for a nonempty atom of in (7.73). ¢ .â e«ª¿a¼Â·¼Ó½¼Qµ¸· c® a° ¬ Õ Õ ¬ ¬ ¬ Õ Æ \ §ÌÒqå a \ For a nonempty atom of , if q , i.e., o is connected, then is a Type I atom, otherwise is aò Type IIô atom. The sets of all Type I and Type II atoms of are denoted by and , respectively. 7 á .â ù«ºµ¸Ã«ª¦ Ç c® ªõ ò ÈÑÇ RRR ô È YÈÑÇ Õ form a Markov graph if and only if o Û'Ü vanishes on all Type II atoms. Before we prove this theorem, we first state the following proposition which is the graph-theoretic analog of Theorem 7.5. The proof is trivial and is omitted. This proposition and Theorem 7.5 together establish an analogy between the structure of conditional mutual independence and the connectivity of a graph. This analogy will play a key role in proving Theorem 7.25. ´ß .â Let · and í ß ß be disjoint subsets of the vertex set of a *!åi#ð ß, where ð . Assume that graph and î be a subset of í forß å such that î in there at least two ¯ ñ . If í o åe ð· are disjoint ß mß · ,exist ¶ iß.ü à í ß then those î which are nonempty are disjoint in î . .â- In the graph in Figure 7.4, , Z , and Z are U . Then Proposition 7.26 says that , , and Z are disjoint in ¸ . disjoint in for a nonempty Proof of Theorem 7.25 Recall the definition of the set « ¬ ¦ contains precisely all the proper atom ¬ in (7.73). . ThusWethenotesetthat subsets of of FCMI’s specified by the graph can be written as ª ¬ « ¦ and ^ (7.90) Ãcµ'¨tµ¸÷è¼½¼<µ'· U p c® ª¬ å o Ò ÈYå aÆ a o ì á o ÌÑÌ £¤¥§¦¨ª©S« äTåwç c® o aTä Tç o äìaÈ aÈ7Xç äTåwç ò Æ ß ä aÈ Sç äìSç ä aÈ Sç aTä aÈ7XÄÈ Tç o \ ä \ È ç P o \ D R A F T q Æ \ §ÌsEKå September 13, 2001, 6:27pm D R A F T 142 A FIRST COURSE IN INFORMATION THEORY 2 4 3 1 5 7 Figure 7.4. The graph ª ¬ « ¦ 6 in Example 7.27. ^ (cf. (7.89)). By Theorem 7.19, it suffices to prove that Æ \ RTS We first prove that ² Consider ¬ß« an, atom ðß ´åVí for äand ¬ ² ô Ò p q Æ \ Æ \ Ì å +Ì ô Òp Æ \ q ô Ò \ Æ \ §Ì *ÌsEå QÌêÒ Æ \ q Æ \ q Ì *ÌEæå Ò å q Æ \ q Æ \ Æ Æ RTS " Ò QÌ \ QÌ \ ò RTS ]*^ \ Ì §ÌsEåwç Æ r Ò (7.91) *ÌEå QÌÖÕ RTS ä ô±Õ ª ¬ « ¦ and ^ (7.92) 7.13, let ß so that ßî , . By considering for °å£ . In Definition « . Therefore, , we see that ¬ « ¦ ª (7.93) (7.94) (7.95) ª ¬ « ¦ and ^ Æ iRTS Æ \ and q # \ q Æ \ ÌEå QÌÖÕ ª ¬ « ¦ and ^ ² (7.96) « ´ ª « ^ « ¦ ¬ « ¦ and ¸ Consider ¬ . Then there exists ¬ with such that ¬ . From Definition 7.13, ê÷N (7.97) ¬ ~ ç @ ÷ ç N N E B D N BED N ß ß åÙ and there exist at least two å such that where ßî !¯ ñ î . It ² follows from, è(7.97) that and the definition of ¶ ß. à ß î ß (7.98) We now prove that Æ Æ q Æ \ RTS wëÌEå q \ Æ RTS Ò |~} ^ Æ \ \ RTS Ç ß Ç Ý Ý \ Æ \ q *ÌsEå QÌ ô±Õ *ÌEDå QÌ Ü w QÌ ] ^ |~} ^ " z " ] ^ È # + p Æ \ wYÌ å q Ò Æ \ Ò \ - wëÌ \ r \ # " ] ^ w # Æ$p Æ \ wëÌ§ß ÌÖÕ ò D R A F T September 13, 2001, 6:27pm D R A F T 143 Markov Structures 1 3 \ 4 2 Figure 7.5. The graph · ß in Example 7.28. 7 p íß ß î wY \ playing the role of and playing the role of in ProposiWith tion 7.26, we see by applying the proposition that those (at least two) which are nonempty are disjoint in ~ a \ o \ ~ ß. à ß î ß (7.99) « . Therefore, we have proved (7.96), and , i.e., ¬ r ] ^ " # p E¡ are² ³ ² C´ ³ With respect to the graph £ ² ´ ³¶µ · C \ ² ´ ² ² ´ µ ³ » ³¹¸º h a \ o ¢w£ This implies q hence the theorem is proved. ¤¥y¦C§¨ª©0«¬®¯ª° , ² £ ³ ´ in Figure 7.5, the Type II atoms ± ² ³¶µ · ´ ² ³¹¸º ³ ² C´ ² ´ ³»µ£ ³¶µ · ² ´ ³¹¸º (7.100) ¸ while the other twelve nonempty atoms of ¼ are Type I atoms. The random À £ º7³ ³ º7³ º ³ ¸ · variables and form a Markov graph ± if and only if ½¿¾ À Á for all Type II atoms . 7.4 MARKOV CHAIN When the graph ± representing a Markov random field is a chain, the Markov random field becomes a Markov chain. In this section, we further explore the structure of the  -Measure for a Markov random field for this special case. Specifically, we will show that the  -Measure ½¿¾ for a Markov chain is always nonnegative, and it exhibits a very simple structure so that the information diagram can always be displayed in two dimensions. The nonnegativity of ½ ¾ facilitates the use of the information diagram because if à is seen to be a subset of ÃÅÄ in the information diagram, then ½ ¾ Ã Ä ½ ¾ wÆ Ã ½ ¾ Ã Ä sÇ Ã ½ ¾ à (7.101) These properties are not possessed by the  -Measure for a general Markov random field. D R A F T September 13, 2001, 6:27pm D R A F T 144 A FIRST COURSE IN INFORMATION THEORY 1 ... 2 Figure 7.6. The graph É È È n-1 n representing the Markov chain ÊË5ÌÊÍ,ÌhÎ/Î/ÎYÌgÊÏ . Without loss of generality, we assume that the Markov chain is represented ÑÐ ³ by£ the graph ± in Figure 7.6. This corresponds to the Markov chain Ð Ò3Ò3ÒÐ ³ ³¹Ó . We first prove the following characterization of a Type I atom for a Markov chain. Ôs«ª§Õ§D¦i¬®¯ªÖ For in Figure 7.6, À the Markov chain represented by the graph ± Ó a nonempty atom of ¼ is a Type I atom if and only if × Fà à Þ ÓØ*ÙCÚ ÜÛÝ º Ý Æn º Ò3Ò3Ò º7޿ߺ (7.102) À àâá Ý , i.e., the indices of the set variables in where complemented are consecutive. which are not À Proof It is easy to see that for a nonempty atom ,À if (7.102) is satisfied, then Ø*ÙCÚ ÙÚ Ó ± is connected, i.e., ã . Therefore, is a Type I atom of ¼ . Ø*ÙCÚ On the other hand, if (7.102) is not satisfied, then ± is not connected, i.e., À ÙÚ ä Ó ã , or is a Type II atom of ¼ . The lemma is proved. We now show how the information diagram for a Markov chain with any áåÇæ length can be constructed in two dimensions. Since ½¿¾ vanishes on Ó all the Type II atoms of ¼ , it is not necessary to display these atoms in the information diagram. In constructing the information diagram, the regions rep £ º Ò3Ò3Ò º7³¹Ó ³ ³ should overlap with each other resenting the random variables , such that the regions corresponding to all the Type II atoms are empty, while the regions corresponding to all the Type I atoms are nonempty. Figure 7.7 shows such a construction. Note that this information diagram includes Figures 6.7 and 6.9 as special cases, which are information diagrams for Markov chains with lengths 3 and 4, respectively. ç X1 ç ç Xè 2 ç Xé n-1 Xé n ... Figure 7.7. The information diagram for the Markov chain Ê D R A F T Ë ÌÊ Í ÌkÎ/Î7ÎêÌÊ September 13, 2001, 6:27pm Ï . D R A F T 145 Markov Structures We have already shown that ½ ¾ is nonnegative for a Markov chain with áëÇiæ length 3 or 4. Toward suffices À ìÇ proving that this is trueÀ for any length À , it Ó Á Á to show that ½ ¾ for all Type I atoms of ¼ because ½ ¾ for Ó ¼ . We have seen in Lemma 7.29 that for a Type I atom all Type II atoms A of À À Ó Ù Ú of ¼ , has the form as prescribed in (7.102). Consider any such atom . ² ² ² Then an inspection of the information diagram in² Figure 7.7 reveals that À ¾ ½ Â Ç ´ ³Ñí ¾ ½ ´ ï5ô ³ ³Ñí$ó7³ Ò3Ò3Ò ´ ³Ñíî ³¹ï ³ñð*ò (7.103) (7.104) (7.105) ð ò Á This shows that ½¿¾ is always nonnegative. However, since Figure 7.7 involves an indefinite number of random variables, we give a formal proof of this result in the following theorem. õ@öy«÷,øw«ª§ù¬®úû For a Markov chain Ð ³ ÐüÒ3Ò3Ò(Ð £ ³ ³ Ó , ½¿¾ is nonneg- ative. À Ó Á ½ ¾ Proof for all Type II atoms A of ¼ , it suffices to show that À ýSince À Ç Ó Á for allÀ Type Ó I atoms of . We have seen in Lemma 7.29 that ½¿¾ ¼ ÙCÚ for a Type I atom of ¼ , has the form as prescribed in (7.102). Consider À any such atom and define the set þ Æn º Ò3Ò3Ò º7Þ ÜÛÝ f ß (7.106) Then  ³Ñí ´ ² ² ï ³ ð ò ³ ² ½ ² ´ ³ ² ´ ³ ² ² ³¹ï ² ³Ñí ¾ ² ² ´ ³Ñí ¾ ½ ² ¾ ½ ô ³ñ ² ðò ³Ñíÿó7³¹ï ³ñð ò ² ´ ² ³¹ï (7.107) ³ñð ò (7.108) ² (7.109) þ , namely In the´ above ´ summation, except for the atom corresponding to Ò3Ò3Ò ´ ³Ñí ³Ñí î ³¹ï ³ñð*ò atoms are Type II² atoms. Therefore, ² , all the ² ²  ô ³ñð*ò ³Ñí&ó7³¹ï ¾ ½ Hence, ³Ñí ² À ½ ¾ ½ Â Ç D R A F T Á ¾ ³Ñí ³Ñí$ó7³ ´ ³Ñí î ´ Ò3Ò3Ò ´ ² ´ ³Ñíî ï5ô ³ ³¹ï ² ð ò ´ Ò3Ò3Ò ´ ³¹ï ³ñðò (7.110) ² ³ñð*ò September 13, 2001, 6:27pm (7.111) (7.112) (7.113) D R A F T 146 A FIRST COURSE IN INFORMATION THEORY X Y . Z T . . . . .. . .. U .. . . .. !#"%$'&()!#"+*,& . Figure 7.8. The atoms of The theorem is proved. á We end this section by giving an application of this information diagram for .- . 0/214365 87:9 <; => 09 709 @?A=CB ¤¥y¦C§¨ª©0«¬®ú y© ¨ª©*« Ñ« 5ø <¨ ÷ In this example, we prove ³ º4DsºFE with the help of an information diagram that for five random variables , Ð Ð Ð Ð ¢ ¢ ³ D E Ù Ù , and such that forms a Markov chain, G D E 4D @ó7³»º  º ¢ º Ù 4D ¿Æ ³ ¿Æ ó ¢ º  ³ G 4D FE G D E yÆ º Ù º sº ¢ ô º ¢ G E ¢ yÆ ô (7.114) Ù In the information for , and 7.8, we first identify G D diagram G ¢ì inbyFigure and then the atoms of marking of them by the atoms of G D and G ¢ , it receiveseach a dot. If an atom belongs to both two dots. The resulting diagram represents G D wÆ G By repeating the same procedure for  E 4D @ó7³»º º ¢ º Ù ¿Æ  ³ 4D º ó ¢ º Ù yÆ ¢ (7.115) G D E ô yÆ G ¢ ô E º (7.116) we obtain the information diagram in Figure 7.9. Comparing these two information diagrams, we find that they are identical. Hence, the information Ð ³ D Ð identity in (7.114) always holds conditioning on the Markov chain Ð Ð ¢ E Ù . This identity is critical in proving an outer bound on the achievable coding rate region of the multiple descriptions problem in Fu et al. [73]. It is virtually impossible to discover this identity without the help of an information diagram! D R A F T September 13, 2001, 6:27pm D R A F T 147 Markov Structures X Y . Z T . . . .. Figure 7.9. The atoms of . .. . U . . .. .. HI"KJML ONP$QNR*ANSM&(THI" ON$L*:NUSM&V(T!#"%$XW J@&(Y!#"+*'W JA& . &Ê Ê PROBLEMS 1. Prove Proposition 7.14 and Proposition 7.15. 2. In Example 7.22, it was shown that Z\[ implies Z^] . Show that Z^] does not Ø imply Z [ . Hint: Use an information diagram to determine _F`baZ [c _F`daZ ]ec . 3. Alternative definition of the global Markov property: any partition f ] and f [ For Ù ºFf Û ]Ø*Ù ºFf [ ß of f such that the sets of³Yvertices are disconnected g and ³Yg are independent , the sets of random variables condiin ± Ë Í ð ³ tioning on . Show that this definition is equivalent to the global Markov property in Definition 7.23. ji , Tk and gMlm:nl k are indepeno k is the set of neighbors2 of vertex i in 4. The local Markov property: For h ³Ym0n dent conditioning on , where ± . à sàfá ³ ³ a) Show that the global Markov property implies the local Markov property. b) Show that the local Markov property does not imply the global Markov property by giving a counterexample. Hint: Consider a joint distribution which is not strictly positive. 5. Construct a Markov random field whose  -Measure values. Hint: Consider a Markov “star.” ], [, ]qpra [ 6. a) Show that ³ ³ ³ ³ º7³ · ³ º7³ · ¸ , and c º,³ ³ ¸ ½ ¾ can take negative are mutually independent if and only if [Xpra ³ · º7³ ¸ c ] ô³ º,³ · p ³ ¸ ô a ] ³ º7³ [ cFs Hint: Use an information diagram. 2 Vertices k t k t and in an undirected graph are neighbors if and are connected by an edge. D R A F T September 13, 2001, 6:27pm D R A F T 148 A FIRST COURSE IN INFORMATION THEORY b) Generalize the result in a) to á random variables. [ 7. Determine the Markov random field with four random variables ] ³ º ³ ¸ · and which is characterized by the following conditional independencies: ³ º7³ º7³ ³ ¸ ô³ ³ · º7³ ¸ · ô ³ º7³ ô ³ º7³ ³¹¸ º7³ ³ º a ] [ )u c u p [ pva c a ] c c a [ )u Fc s ]qpva ³ ³ º7³ · What are the other conditional independencies pertaining to this Markov random field? HISTORICAL NOTES A Markov random field can be regarded as a generalization of a discretetime Markov chain. Historically, the study of Markov random field stems from statistical physics. The classical Ising model, which is defined on a rectangular lattice, was used to explain certain empirically observed facts about ferromagnetic materials. The foundation of the theory of Markov random fields can be found in Preston [157] or Spitzer [188]. The structure of the  -Measure for a Markov chain was first investigated in the unpublished work of Kawabata [107]. Essentially the same result was independently obtained by Yeung eleven years later in the context of the  -Measure, and the result was eventually published in Kawabata and Yeung [108]. Full conditional independencies were shown to be axiomatizable by Malvestuto [128]. The results in this chapter are due to Yeung et al. [218], where they obtained a set-theoretic characterization of full conditional independencies and investigated the structure of the  -Measure for a Markov random field. These results have been further applied by Ge and Ye [78] to characterize certain graphical models. D R A F T September 13, 2001, 6:27pm D R A F T Chapter 8 CHANNEL CAPACITY In all practical communication systems, when a signal is transmitted from one point to another point, the signal is inevitably contaminated by random noise, i.e., the signal received is correlated with but possibly different from the signal transmitted. We use a noisy channel to model such a situation. A noisy channel is a “system” which has one input and one output1 , with the input connected to the transmission point and the output connected to the receiving point. The signal is transmitted at the input and received at the output of the channel. When the signal is transmitted through the channel, it is distorted in a random way which depends on the channel characteristics. As such, the signal received may be different from the signal transmitted. In communication engineering, we are interested in conveying messages reliably through a noisy channel at the maximum possible rate. We first look at a very simple channel called the binary symmetric channel (BSC), which is represented by the transition ³ D diagram in Figure 8.1.Á º Inß this channel, both the input and the output take values in the set Û h . There is a certain probability, denoted by w , that the output is not equal to the input. That is, if the input is 0, then the output is 0 with probability h xyw , and is 1 with probability w . Likewise, if the input is 1, then the output is 1 with probability hxyw , and is 0 with probability w . The parameter w is called the crossover probability of the BSC. À º ß Let Û be the message set which contains two possible messages to à à Á w{z Á s - . We further À assume that the be conveyed through a BSC with À two messages and à are equally likely. If the message is , we map it to the codeword 0, and if the message is à , we map it to the codeword 1. This 1 The discussion on noisy channels here is confined to point-to-point channels. D R A F T September 13, 2001, 6:27pm D R A F T 150 A FIRST COURSE IN INFORMATION THEORY 1 0 0 X Y 1 1 1 Figure 8.1. A binary symmetric channel. is the simplest example of a channel code. The codeword is then transmitted through the channel. Our task is to decode the message based on the output of the channel, and an error is said to occur if the message is decoded incorrectly. Consider |~} |} À Û ô Dv Á |} Dr |~} |~} Dr Dv |} s - ahXxyw c Dv As ß Û D2 Since wz s- , (8.1) ß (8.2) Á ß |~} Dv h ß Û O hxyw ~| } |} Dv O hx Dr ~| } |~} D r Dv z ô ô Á ô À ô Û Á sÁ (8.4) ß ß Á Û Ã Dv (8.3) ß Á ß Á À Á Á Á Û Û Ã ô³ Á Û Û by symmetry2 , it follows that |~} and Á ß Á ß Û |~} ô Á ³ Û Since ³ Û À ß ô Û (8.5) O w s Á ß Á ß (8.6) s (8.7) Therefore, in order to minimize the probability of error, we decode a received À 0 to the message . By symmetry, we decode a received 1 to the message à . 2 More explicitly, VPY V U P V AU VY# A e u V e u d e ue lX e u ] e u ] Ú Ú î î î D R A F T September 13, 2001, 6:27pm D R A F T 151 Channel Capacity An error occurs ifÀ a 0 is received and the message is à , or if a 1 is received and the message is . Therefore, the probability of error, denoted by , , is |} |~} |} |~} given by Û Á w º s- w Dr Æ Á Á ß ô Û Ã s- w Dv Æ ß Á |~} (8.6) because|} where (8.9) follows from À Û ô Dv h O ß Dv h Û ô Û Ã D2 ß À Û ô Dv h w Á ß ß (8.8) (8.9) (8.10) (8.11) by symmetry. Á . Then the above scheme obviously does not Let us assume that w. provide perfectly reliable communication. If we are allowed to use the channel only once, then this is already the best we can do. However, if we are allowed to use the same channel repeatedly, then we can improve the reliability by generalizing the above scheme. We now consider the following channel code which we refer to as the binary á Ç h be an odd positive integer which is called the block repetition code. Let À length of the code. In this code, the message is mapped to the sequence of á á 0’s, and the message à is mapped to the sequence of 1’s. The codeword, á á which consists á of a sequence of either 0’s or 1’s, á is transmitted through the channel in uses. Upon receiving a sequence of bits at the output of the channel, we use the majority vote to decode the message, i.e., if there are more 0’s than 1’s in the sequence, we decode the sequence to the message À , otherwise we decode the sequence to the message à . Note that á theh block , this length is chosen to be odd so that there cannot be a tie. When scheme reduces to the previous scheme. general scheme, we continue to denote the probability of error For this more by . Let o and o)] be the number of 0’s and 1’s in the received sequence, respectively. Clearly, (8.12) o Æ o ] á s á À For large , if the message is , the number of 0’s received is approximately equal to ô À¡ á o ahxyw c (8.13) and the number of 1’s received is approximately equal to o)] ô X¡ À w á (8.14) numbers. with high probability by the weak law of large This implies that the z¢o)] ß , is small because probability of an error, namely the event Û o á D R A F T ah x£w c äá w September 13, 2001, 6:27pm (8.15) D R A F T 152 A FIRST COURSE IN INFORMATION THEORY with the assumption |} that wz Û error À ô ß s - . |Specifically, } |} o z¤o ] |} x¥oY] z¤o)] |} o)] s o)] aw §¦ c Á à ä z ¦ z Á ¦ so that ä Û ô À Æ s - xyw Á á Á Û where ß á Û à À ô Û ô À (8.16) (8.17) (8.18) (8.19) ß ß á À ô ߺ º (8.20) w j¦ z s - s (8.21) ¦ Note |~} that such a exists because w¨z s - . Then by the © weak law of large numbers, the upper bound in (8.19) . By symmetry, ª© tends to 0 as error also tends . Therefore, |~} to 0|~as} |} |~} error error (8.22) © tends to 0 as . In other words, by using a long enough repetition code, we can make arbitrarily small. In this sense, we say that reliable is positive and Æ Á Á Ð á Û ô à á ß À Û á ß Û Ð ô À ß Æ Û Ã ß Û ô Ã ß Ð communication is achieved asymptotically. ä Á , for any given transmitted sequence We point out that for a BSC with w á á of length , the probability of receiving any given sequence of length is nonzero. It follows that for any two distinct input sequences, there is always a nonzero probability that the same output sequence is produced so that the two input sequences become indistinguishable. Therefore, except for very special Á ), no matter channels (e.g., the BSC with w how the encoding/decoding scheme is devised, a nonzero probability of error is inevitable, and asymptotically reliable communication is the best we can hope for. Though a rather naive approach, asymptotically reliable communication can be achieved by using the repetition code. The repetition code, however, is not without catch. For a channel code, the rate of the code in bit(s) per use, is defined as the ratio of the logarithm of the size of the message set in the base 2 to the block length of the code. Roughly speaking, the rate of a channel code is the average number of bits the channel code attempts to convey through the channel in one use of the channel. For a binary repetion Ð code © with block á á Ó ] , which ] tends to 0 as . Thus in length , the rate is ÓO«%¬I'® order to achieve asymptotic reliability by using the repetition code, we cannot communicate through the noisy channel at any positive rate! In this chapter, we characterize the maximum rate at which information can be communicated through a discrete memoryless channel (DMC) with an arbitrarily small probability of error. This maximum rate, which is generally D R A F T September 13, 2001, 6:27pm D R A F T 153 Channel Capacity positive, is known as the channel capacity. Then we discuss the use of feedback in communicating through a channel, and show that feedback does not increase the capacity. At the end of the chapter, we discuss transmitting an information source through a DMC, and we show that asymptotic optimality can be achieved by separating source coding and channel coding. 8.1 DISCRETE MEMORYLESS CHANNELS ; °¯V9±?:9K7:9 @? / ¶ Let ² and ³ be discrete alphabets, and ´,aRµ c be a tranô¶ c M ´ R a µ sition matrix from ² to ³ . A discrete channel is a single-input single³ taking values in ² and output output system with input random variable D |~} that random variable |} taking values in ³ such Ñ« ÷ ô °w® · ¶ 4Dv µ · ¶ ´MaRµ ¶ c (8.23) ¶ |} for all a µ cq¸ ²º¹»³ . |} we see that if ·|~ } ¶ , then Remark From (8.23), ~ | } ¶ 4 v D Dv ¼ ¶ s ´,|aRµ } ¶ c · ¶ µ (8.24) |} µ D2 µ · ¶ is not defined if · ¶ ½ . Nevertheless, However, ³ Û º ß ³ Û ô ß º Û Û ô ³ º Û ô³ Û ³ ³ ä ß Á ß ô³ Û ß ß Û ³ ß ß Á (8.23) is valid for both cases. ; °¯V9±?:9K7:9 @? Ñ« ÷ °w®¯ A discrete memoryless channel is a sequence of replicates of a generici discreteichannel. Thesei discrete channels are indexed by a discreteÇ , where , with the th channel being available for transmission time index h i at time . Transmission through a channel is assumed to be instantaneous. The DMC is the simplest nontrivial model ô for ¶ a communication channel. For simplicity, a DMC will be specified by ´,aRµ c , the transition matrix of the generic³ discreteD channel. i k k Let iÇ and be respectively the input and the output of a DMC at time , where h . Figure 8.2 is an illustration of a discrete memoryless channel. In the i figure, the attribute of³ the channel manifests itself by that for i memoryless k as its input. channel only has all , the th i discrete Ç |} For each h |, } i Û ³ k ¶ 4D k µ O º ß Û ³ k ¶ ´MaRµ ¶ c ß ô (8.25) since the th discrete channel is a replicate of the generic discrete D k ß channel Û ´,aRµ ô ¶ c . However, the dependency of the output random variables on the ³ k ß cannot be further described without specifying input random variables Û how the DMC is being used. In this chapter, we will study two representative models. In the first model, the channel is used without feedback. This will be D R A F T September 13, 2001, 6:27pm D R A F T 154 A FIRST COURSE IN INFORMATION THEORY 1 Á p  ( ¿ y À xà ) Y1 ¾X 2 Á p  ( ¿ y À xà ) Y2 ¾X 3 Á p  ( ¿ y À xà ) Y3 ... ¾X Figure 8.2. A discrete memoryless channel. discussed in Section 8.2 through Section 8.5. In the second model, the channel is used with complete feedback. This will be discussed in Section 8.6. To keep our discussion simple, we will assume that the alphabets ² and ³ are finite. ; °¯V9±?:9K7:9 @? Ñ« ÷ °w®ú The capacity of a discrete memoryless channel Ä ÅYÆÈÇ 4D c É ËÊ Qa defined as D ³ ³ó  ´,aRµ ¶ c º ô is (8.26) where and are respectively the input and the output of the generic discrete ¶ channel, and the maximum is taken over all input distributions ´,a c . Ä From the above definition, we see that Ç Á Qa 4D c (8.27) because Ç ³ó Á (8.28) ¶ for all input distributions ´,a c . By Theorem 2.43, we have Ä Å)ÆÈÇ 4D c Å)ËÆÈÊÇ G a c «%¬I ² s (8.29) É ÌÊ a É Ä «%¬I Likewise, we have ³ s (8.30) Ä Å)ÍÏÎ «%¬I Therefore, (8.31) a ² «%¬I ³ cFs ¶ ¶ 4 D Ñ c is a continuous functional of ´,a c and the set of all ´,a c 4isD a Since a c compact set (i.e., closed and bounded) in Ð , the maximum value of a Â à ³ëó  à à  ô ³ ô ô ô ô ô º ô ô ³ëó  ³ëó can be attained3 . This justifies taking the maximum rather than the supremum in the definition of channel capacity in (8.26). Ñ 3 The assumption that D R A F T is finite is essential in this argument. September 13, 2001, 6:27pm D R A F T 155 Channel Capacity ÒZ X Y Figure 8.3. An alternative representation for a binary symmetric channel. Ä We will prove subsequently that is in fact the maximum rate at which information can be communicated reliably through a DMC. We first give some examples of DMC’s for which the capacities can be obtained in closed form. ³ D In the following, and denote respectively the input and the output of the generic discrete channel, and all logarithms are in the base 2. ¤¥y¦C§¨ª©0« ÌÓÔ1Õ9±? Ö¼×ØÖ °w® w¦,ø Ù7 09>ÛÚ @?:? VB C§Õ§« yø [ö¦ y«© The binary symmetric channel (BSC) has been shown in Figure 8.1. Alternatively, a BSC can be E represented by Figure 8.3. Here, is a binary random variable representing |~} |~} the noise of the channel, with and E Û E O hxyw Á ß is independent of ³ . Then Dv i³ E Æ E h w ß Û and mod ® º (8.32) s (8.33) 4D c In order to determine the capacity of the BSC, we first bound a  follows. a 4D c  G aD c x G aD G a D c x ´,a ¶ Ê G a D c x ´,a ¶ Ê G a D c x ÜØÝ aw c hx Ü°Ý aw c ô³ ³ó à c c G a D · ¶ c ô³ cÜ°Ý aw c º ³ó as (8.34) (8.35) (8.36) (8.37) (8.38) where we have used ÜØÝ to denote the binary entropy function in the base 2. In G D h , i.e., the output order to achieve this upper bound, we have to make a c ¶ c be the , ´ a distribution of the BSC is º uniform. This can be done by letting ß ³ó4D Á c can be uniform distribution on Û h . Therefore, the upper bound on Âa achieved, and we conclude that Ä hx ÜØÝ aw c D R A F T bit per use. September 13, 2001, 6:27pm (8.39) D R A F T 156 A FIRST COURSE IN INFORMATION THEORY C 1 0 0.5 1 Figure 8.4. The capacity of a binary symmetric channel. Ä Ä the capacity versus the crossover probability Figure 8.4 is a plot of Á or w w . Weh , w see from the plot that attains the maximum value 1 when Ä the minimum value 0 when w Á s - . When w Á , it is easy to see and attains that h is the maximum rate at which information can be communicated through the channel reliably. This can be achieved simply by transmitting unencoded bits through the channel, and no is necessary because all the decoding h , the same can be achieved with the bits are received unchanged. When w additional decoding step which complements all the received bits. By doing so, the bits transmitted through the channel can be recovered without error. Thus from the communication point of view, for binary channels, a channel which never makes error and a channel which always makes errors are equally good. Á - , the channel output is independent of the channel input. Theres When w fore, no information can possibly be communicated through the channel. ¤¥y¦C§¨ª©0« ÏÞß1Õ9±? Ö °w® w¦,ø »¤ ,=5 Ú @?A? VB ø5¦ yøw« ö¦ y«ª© The binary erasure channel ß Á º Û is shown in Figure 8.5. In this channel, the input alphabet is , while h ºà ß à Á º the output alphabet is Û h . With probability á , the erasure symbol is produced at the output, which means that the input bit is lost; otherwise the input bit is reproduced at the output without error. The parameter á is called the erasure probability. To determine the capacity of this channel, we first consider Ä D R A F T Å)ÌÆÈÊÇ É Å)ÌÆÈÊÇ É Å)ÌÆÈÊÇ É a 4D c a G a D c x G a D cc G a D c x Ü°Ý a±á cFs  ³ëó ô³ September 13, 2001, 6:27pm (8.40) (8.41) (8.42) D R A F T 157 Channel Capacity â0 â0 1 ãe X 1 Y 1 1 Figure 8.5. A binary erasure channel. G a D c . To this end, let · O.ä Thus we only have to maximize|~} Û ³ Á ß and define a binary random variable Ûå The random variable D function of . Then Á if if h D . à D2. à. G aD c G a c G aD c Ü°Ý a±á c ahxdá cÜ°Ý a ä Fc s sº Æ Ä by (8.44) indicates whether an erasure has occured, and it is a G aD c Hence, (8.43) ô Æ (8.45) (8.46) (8.47) Å)ËÆÈÊÇ G a D c x ØÜ Ý a±á c (8.48) É ¡ Å)ÆÈÇ Ü°Ý a±á c ah xdá cÜ°Ý a ä c x Ü°Ý a±á c (8.49) æ ahxdá c Å)ÆÈÇ Ü°Ý a ä c (8.50) æ ahxdá c (8.51) ä¤ s - , i.e., the input distribution is where capacity is achieved by letting Æ º Á uniform. It is in general not possible to obtain the capacity of a DMC in closed form, and we have to resort to numerical computation. In Chapter 10, we will discuss the Blahut-Arimoto algorithm for computing channel capacity. D R A F T September 13, 2001, 6:27pm D R A F T 158 8.2 A FIRST COURSE IN INFORMATION THEORY THE CHANNEL CODING THEOREM We will justify the definition of the capacity of a DMC by the proving the channel coding theorem. This theorem, which consists of two parts, will be formally stated at the end of the section. The direct part of the theorem asserts that information can be communicated through a DMC with an arbitrarily small probability of error at any rate less than the channel capacity. Here it is assumed that the decoder knows the transition matrix of the DMC. The converse part of the theorem asserts that if information is communicated through a DMC at a rate higher than the capacity, then the probability of error is bounded away from zero. For better appreciation of the definition of channel capacity, we will first prove the converse part in Section 8.3 and then prove the direct part in Section 8.4. ; °¯V9±?:9K7:9 @? Ñ« ÷ Ïç °w® put alphabet ² 4è c code for a discrete memoryless channel with inAn a and output alphabet ³ is defined by an encoding function á º é¨ê h ® º Û and a decoding function h® ® ah c a c Û The é set º é º º º Ò3Ò3Ò Ò3Ò3Ò º ë ê³ Ó º 4è Ò3Ò3Ò º ¡ß h ® Ð º Û º Ð Ó ² Ò3Ò3Ò 4è (8.52) º ¡ß s (8.53) 4è by ì , is called the message set. The sequences é a è c , indenoted ² are called codewords, and the set of codewords º ¡ß Ó is called the codebook. In order to distinguish a channel code as defined above from a channel code with feedback which will be discussed in Section 8.6, we will refer to the former as a channel code without feedback. þ is randomly chosen from the message set ì We assume that a message according to the uniform distribution. Therefore, G a c «%¬I è s ¶ With respect to a channel code for a DMC ´MaRµ c , we let íÔ a ] [ c î D 4D and a ] [ 4D c þ (8.54) ô ³ º7³ º º º Ò3Ò3Ò Ò3Ò3Ò º º7³ Ó Ó (8.55) (8.56) be the input sequence and the output sequence of the channel, respectively. Evidently, íï é a þ cFs (8.57) We also let D R A F T ðñ ë î a Fc s þ September 13, 2001, 6:27pm (8.58) D R A F T 159 Channel Capacity W õX Encoder ò óô Y Channel p(y|x) Message W Decoder Estimate of message ö Figure 8.6. A channel code with block length . Figure 8.6 is the block diagram for a channel code. j÷ è , let |} ; °¯V9±?:9K7:|~9 }@? For all h î í é ÷ ð ÷ ú ÷ O øù ù a c û È ü þý ÿ û ÷ be the conditional probability of error given that the message is . Ñ« ÷ à °w®¬ ô þ þ Û à ß ô Û ß (8.59) Ï We now define two performance measures for a channel code. ; °¯V9±?:9K7:9 @? Ñ« ÷ °w®° á ø æ Ê ÅYù ÆÈÇ °ø ù s fined as ; °¯V9±?:9K7:9 @? Ñ« a 4è c The maximal probability of error of an ÷ a 4è c á The average probability of error of an |~} °w®Ö @ fined as ð þ Û º º code is de(8.60) code is de- s gþkß (8.61) |~} of From the definition , we have ð ~| } |~} ð ú ÷ ñ ÷ ù |} ð ÷ èh ÷ ù èh ø ù ù i.e., @ is the arithmetic mean of ø ù , h j÷ è . It then follows that @ ø æ Ê s Û þ Û gþkß þ ß Û ô þ gþ ô þ þ Û þ ß ß º à Ñ« ÷ / °w® û per use. D R A F T The rate of an a 4è c á º (8.63) (8.64) (8.65) Üà à ; °¯V9±?:9K7:9 @? (8.62) channel code is á September 13, 2001, 6:27pm l ] «%¬I è (8.66) in bits D R A F T 160 A FIRST COURSE IN INFORMATION THEORY ; °¯V9±?:9K7:9 @? /°/ A rate is asymptotically achievable for a discrete mem¶ , there exists for sufficiently large an oryless channel ´,aRµ c if for any w 4 è c code such that a h «%¬I è xyw (8.67) Ñ« ÷ °w® ä ô á á Á º ä á ø æ Ê ¤ z ws and (8.68) For brevity, an asymptotically achievable rate will be referred to as an achievable rate. In other words, a rate is achievable if there exists a sequence of codes whose rates approach and whose probabilities of error approach zero. We end this section by stating the channel coding theorem, which gives a characterization of all achievable rates. This theorem will be proved in the next two sections. / y1Ú @?A? bÚ @9±? ¶ for a discrete memoryless channel ´,aRµ c õ@öy«÷,øw«ª§H°w® *¯ ö¦ y«ª© @÷ ô OB Ä is achievable A rate à if and only if , the capacity of õ@ö¿«÷¿øw«§ the channel. 8.3 THE CONVERSE random Let us consider a channel code with block length ð variables á þ ³Tk DØk for h à i.à The þ involved in this code are , and , and . We see from the definition of a channel code in Definition 8.6 that all the random variables are generated sequentially according to some deterministic or probabilistic rules. Specifically, the ðrandom variables are generated in the order Ò3Ò3Ò þ º7³ ] º4D ] º7³ [ º4D [ º º7³ Ó º4D Ó º þ . The generation of these random variables can be represented by the dependency graph4 in Figure 8.7. In this graph,³ a node represents a random variable. If there is a (directed) edge from node D ³ D to node , then node is called a parent of node . We further distinguish a solid edge and a dotted edge: a solid edge represents functional (deterministic) dependency, while a dotted edge represents the probabilistic dependency ô¶ c of the generic discrete channel. For a M ´ R a µ induced by the transition matrix ³ node , its parent nodes represent all the random variables on which random ³ variable depends when it is generated. We will use to denote the joint ¶k distributioni of these random variables as well as all the marginals, and let denote the th component of a sequence . From the dependency graph, we 4A á dependency graph can be regarded as a Bayesian network [151]. D R A F T September 13, 2001, 6:27pm D R A F T 161 Channel Capacity X W Y p y ( | x) X1 Y1 X2 Y2 X3 Y3 Xn Yn W Figure 8.7. The dependency graph for a channel code without feedback. a÷ ÷ a Ó Ó c¸ ì ¹¨² ¹»³ such that a c , c a ÷ c k a ¶ k ÷ c k ´,aRµ k ¶ k c s (8.69) ] ] ÷ ÷ ¶ k ÷ c are well-defined, and a ¶ k ÷ c Note that a c for all so that Ùa í î set of nodes ] [ are deterministic. Denote the by and the set D 4 D D , by . We notice the following structure in the depenof nodes ] [ í dency î graph: all the edges from end in í , and all theî edges from end 4í , and form the Markov in . This suggests that the random variables î chain í s (8.70) The validity of this Markov chain÷ canbe formally justified by applying Propo c¸ ì¹^² ¹³ such that Ùa c , sition 2.9 to (8.69). Then for all a º see that for all º ä Ó º ä Ó ô º ô ô Á ô ³ º º Ò3Ò3Ò Á º7³ Ò3Ò3Ò º º7³ Ó Ó þ þ Ð þ Ð º a ÷ we can write º Ó º Ó c a ÷ c a ÷ c a Fc s º º ô ô ä Á (8.71) By identifying on the right hand sides of (8.69) and (8.71) the terms which depend only on and , we see that Ó a c k ,´ aRµ k ¶ k c ] ô where ô º (8.72) is the normalization constant which is readily seen to be 1. Hence, Ó a D R A F T ô c k ,´ aRµ k ¶ k Fc s ] ô September 13, 2001, 6:27pm (8.73) D R A F T 162 A FIRST COURSE IN INFORMATION THEORY W X 0 0 0 Y W 0 0 0 0 0 Figure 8.8. The information diagram for Ì ÌfÌ . The Markov chain in (8.70) and the relation in (8.73) are apparent from the setup of the problem, and the above justification may seem superfluous. However, the methodology developed here is necessary to handle the more delicate situation which arises when the channel is used with feedback. This will be discussed in Section 8.6. Consider aî channel ð code whose probability of error is arbitrarily small. þ º4í¶º þ , and form the Markov chain in (8.70), the information í diaSince gram for these four random variables is asî shown in Figure 8.8. Moreover, is ð þ þ a function of , and is a function of . These two relations are equivalent to G a í ôþ c Á º (8.74) G a ð î c and ô þ Á º ð (8.75) þ þ respectively. Since the probability of error is arbitrarily small, and are essentially identical. ð To gain insight into the problem, we assume for the time þ þ being that and are equivalent, so that G a ð þ c G a ô þ ô þ þ ð c Á s (8.76) Since the  -Measure ½¿¾ for a Markov chain is nonnegative, the constraints in (8.74) to (8.76) imply that ½ ¾ vanishes on all the atoms in Figure 8.8 marked with a ‘0.’ Immediately, we see that G a c Qa í î cFs þ ¶ó  (8.77) That is, the amount of information conveyed through the channel is essentially the mutual information between the input sequence and the output sequence of the channel. For a single transmission, we see from the definition of channel capacity that the mutual information between theࢠinput the output cannot exceed i àá and the capacity of the channel, i.e., for all h , a k 4D k c  D R A F T ³ ó à Ä s September 13, 2001, 6:27pm (8.78) D R A F T 163 Channel Capacity i Summing á from 1 to , we have Ä Ó k a k 4D k c ] ³  àá ó s (8.79) Upon establishing in the next lemma that î í c Qa Ó k Qa k 4D k c ] à »ó  ³ ó  º (8.80) the converse of the channel coding theorem then follows from á h «%¬I è hG a c h Qa í î c h Qa k 4D k c k Ä ] s á à / (8.81) ¶ó  (8.82) Ó á à Ôs«ª§Õ§D¦ þ á ³  ó (8.83) (8.84) °w® *ú For a discrete memoryless channel used with a channel code á Ç without feedback, for any h, î Qa í c  »ó wherei time . ³ k and Dk Proof For any a holds. Therefore, Ó à k Qa k 4D k c ] ³ Â ó º (8.85) are respectively the input and the output of the channel at º c¸ ² Ó ¹¨³ Ó , if a º c ä Á , then a c ä Á î í Dk k c k ,´ a c a ] c in the support of a c . Then holds for all a î x «%¬I a í c x «Ï¬I k ´,a Dk Tk c x k %« ¬I M´ a DØk Tk c ] ] ô ô³ º ô Ó ô³ ô³ G a î í c G a D k k cFs k ] º (8.87) Ó ô D R A F T (8.86) º Ó or and (8.73) Ó ô³ September 13, 2001, 6:27pm (8.88) D R A F T 164 A FIRST COURSE IN INFORMATION THEORY Hence, a í  ¶ó î G aî c x G aî í c k G a D k c x k G a D k k c ] ] k Qa Tk 4Dk cFs ] c ô Ó à (8.89) Ó ô³ (8.90) Ó Â ³ ;ó (8.91) The lemma is proved. We now formally prove the converse of the channel coding theorem. Let ä á Á be an achievable rate, i.e., for any , there exists for sufficiently large an w a á º4è c code such that á h «%¬I è ä xyw (8.92) ø æ Ê ¤ z ws and Consider «Ï¬I è æ G a c G a ð c ð Ý G c a ð G a c ! G a ð c þ þ ô þ Æ þ ô þ Æ þ ô þ Æ þ ô þ Æ à µ à (8.93) à Â Â Ó (8.94) (8.95) »ó  á ð c Qa î Qa í c k Qa k 4D k c ] Ä þkó þ ³ (8.96) ó (8.97) º (8.98) where a) follows from (8.54); b) follows from the data processing theorem since þ Ð í î Ð Ð þ ð ; c) follows from Lemma 8.13; d) follows from (8.79). From (8.61) and Fano’s inequality (cf. Corollary 2.48), we have G a D R A F T þ ô þ ð c z h Æ %« ¬I è s September 13, 2001, 6:27pm (8.99) D R A F T 165 Channel Capacity Therefore, from (8.98), «%¬I è z h à z Ä ,ø" %« ¬IÊ è è Ä æ «%¬I Ä w «Ï¬I è Æ h Æ Æ h á Æ Æ Æ (8.100) (8.101) (8.102) á á º where we have used (8.66) and (8.93), respectively to obtain the last two iná equalities. Dividing by and rearranging the terms, we have Ä h «%¬I è z ] h xyw Æ Ó á and from (8.92), we obtain ] (8.103) Ä Æ Ó º xywz h xyw s For any Ъ © á w ä Á (8.104) á , the Ð above inequality holds for all sufficiently large . Letting Á and then w , we conclude that à Ä s (8.105) This completes the proof for the converse of the channel coding theorem. From the above proof, we can obtainÄ an asymptotic bound on when the ] è is greater than . Consider (8.100) and obtain rate of the code Ó «%¬I Ä hx h %« ¬I è @ Æëá Ç Ä h x ] è s ] «%¬I Ó Æ (8.106) Ó Ä Ä hx ] «%¬I è # hx ] «%¬I è (8.107) Ä when is large. This asymptotic bound on , which is strictly positive if ½] «%¬I è , is illustrated in Figure 8.9. ] è in (8.106) implies that for all if «%¬I %$ bound Ä In fact, the lower for some , then for all h , by concatenating because if copies of the code, we ' & ($ obtain a code with the same rate and block length equal to such that , which is a contradiction to our conclusion that when is large. Therefore, if we use a code whose rate is greater than Then Ó Ç ] Æ Ó Ó á ä Ó ä Ó á Á á Ó ä Ç Ó á ä Á Á Á á the channel capacity, the probability of error is non-zero for all block lengths. The converse of the channel coding theorem we have proved is called the weak converse. A stronger version of this result can Ð Ð © called the strong converse á ä Á Ä be proved,è which says that as if there exists an such , h w Ç Æ ] that Ó «%¬I w for all á . D R A F T September 13, 2001, 6:27pm D R A F T 166 A FIRST COURSE IN INFORMATION THEORY P+ e 1 ) 1 *n log M C Figure 8.9. An asymptotic upper bound on ,.- . ACHIEVABILITY OF THE CHANNEL CAPACITY Ä 8.4 We have shown in the last section that the channel capacity is an upper Ä Ä bound on all achievable rates for a DMC. In thisà section, we show that the rate is achievable, which implies that any rate is achievable. ô¶ Consider a DMC ´MaRµ c , and denote the input and the output of the generic ³ D respectively. For every input distribution ´,a ¶ c , discrete channel by and ,³ë ó4D c is achievable by showing for large á the we will prove that the rate Âa existence of a channel code such that 1. the rate of the code is arbitrarily close to Âa 2. the maximal probability of error ø" æ Ê 4D c ; ³ëó is arbitrarily small. ¶ Ä achieves the ´,a c to be one which Then by choosing the input distribution ³ëó4D Ä , we conclude c channel capacity, i.e., that the rate is achievable. Âa ä Á Fix any w and let / be a small positive quantity to be specified later. Toward proving the existence of a desiredô ¶ code, we fix an input distribution ¶ ´,a c for the generic discrete channel ´MaRµ c , and let è be an even integer satisfying a 4D c x ®w z h «%¬I è z Qa 4D c x 0 w  ³ó á f ³ó º (8.108) á where is sufficiently large. We now describe a random coding scheme in the following steps: 4è è Ó c code randomly by generating 1. Construct the codebook 1 of an a ¶ c Ó . Denote codewords in ² independently and Ò3Ò3Ò identically according to ´,a í º4í a ® c º º4í a è c . these codewords by ah c á º 2. Reveal the codebook 1 to both the encoder and the decoder. D R A F T September 13, 2001, 6:27pm D R A F T 167 Channel Capacity is chosen from ì þ 3. A message according to the uniform distribution. í a c , which is then trans- þ þ is encoded into the codeword 4. The message mitted through the channel. |~} î 5. The channel outputs a sequence Û according to î í a c ô Ó þ O M´ aRµ k ¶ k c k ] ß î (cf. (8.73) and Remark 1 in Section 8.6). 6. The sequence is decoded to the message ÷ ÷ í ÷ there î does not exists Ä such that a a ÷ î is decoded to a constant message in which is decoded. ì ô í a ÷ c î c½¸32 4 q6587 and :587 c c9 ¸ 2 4 ð . Otherwise, if î a Ä (8.109) Ó º Ó º . Denote by þ the message to Ó Remark 1 There are a total of ² possible codebooks which can be constructed by the random procedure in Step 2, where we regard two codebooks whose sets of codewords are permutations of each other as two different codebooks. ô ô ; Remark 2 Strong typicality is used in defining the decoding function in Step 6. This is made possible by the assumption that the alphabets ² and ³ are finite. We now analyze the performance of this random coding scheme. Let <>=?= ð þ Û |} þhß (8.110) <>=?=(ß be the event of a decoding error. In the following, we analyze Û , the probability of a decoding error for the random code constructed above. For all h àj÷Üà è , define the event î A587 s a í a ÷ c c ¸@2 4 (8.111) ~| } |~} |~} ; <B=C= O <>=?= ú ÷ ú ÷ s (8.112) ù ] ÷ are identical for all ÷ by symmetry in the code con ù Now |} (ß Û <>=?= ô þ Since Û struction, we have ß Û þ ß ß <B=C=0ß D R A F T ß ô þ Û |~} Û Ó º Û |} | } Û Û ; ú h ù ] ú h <>=?= ô þ ß <>=?= ô þ ߺ |} Û ú ÷ þ September 13, 2001, 6:27pm ß (8.113) (8.114) D R A F T 168 A FIRST COURSE IN INFORMATION THEORY i.e., we can assume without loss of generalityî that the message 1 is chosen. Then decoding is correct if the received ù vector is decoded to® à the÷ message à è . 1.It does not occur for all This is the case if ] occurs but |~} that5 |} follows <B=C= Û ú h µô þ |~} which implies Û ô þ µô þ ´ µ ´ ú h <B=C= ô þ · |~} à ß ´ µ · µ Û By the union bound, we have Û ´ µ Û |~} ´ µ · ß ´ Û [ µ ´ Ò3Ò3Ò ´ ; ú h µ ô þ ߺ (8.115) ß Û à ´ ] ú| } h hx |} <>=?= ú h |~h} x ] |~} a ] [ [ ]ED [ D D <B=C= Û Ç ß · ´ µ Ò3Ò3Ò ´ Ò3Ò3Ò µ ô þ Û ú ; ú h c h ; ú ; h s µ ô þ µ µô þ Æ ß ß |~} ; ù (8.116) (8.117) (8.118) (8.119) ß ß ô þ D ñ h ] Ò3Ò3Ò ´ Û [ ù ú h s ô þ ß (8.120) ú h , then a í ah c î c are i.i.d. copies of the pair of generic random |~} JAEP (Theorem 5.8), for any F a 4D c . By the strong , q:587 ú ú h ía ah c î c ¸G 2 4 h zF (8.121) ] . ÷ ÷ c î c è í for sufficiently large . Second, if , h , then for ® a a 4D c , where are i.i.d. copies of the pair of generic random variables a D D î and have same marginal distributions , respectively, and D aretheindependent í ah c andasí a ÷ cand and because are independent and í depends only on|~} ah c . Therefore, ù |ú } h î q6 587 ú (8.122) a í a ÷ c c¸G2 4 h (8.123) 'HJI û L K M NAOQP R ´,a c ´,a cFs : 587 587 T 4 4 { c ¸ S 2 ¸ S 2 By the consistency of strong typicality, for , and a 587 A 4 ¸ 2 . By the strong AEP, all ´,a c and U ´,a c in the above summation l® WV l"X8 satisfy ´,a c (8.124) ]]\ ù \ ; þ á º First, if |} variables ä ³»º µ ô þ Û ß Ó º Û á ô þ ß à þ ¡à á ³ ³ ³ Ä Ä Á Ä º º ³ Ä Ä Ä Û ô þ ß Ó º Û ô þ ß Ï º Ó Ó Ó à Ó 5 If w , the received vector ^ is decoded to the Y Ë does not occur or Y[Z occurs for some constant message, which may happen to be the message 1. Therefore, the inequality in (8.115) is not an equality in general. D R A F T September 13, 2001, 6:27pm D R A F T Channel Capacity and ´,a c Ð where a ºcb Á Ð Á as / Ð ® l _V l[` Ó à º (8.125) . Again by the strong JAEP, Ó ô Ð 169 2 4 :587 ® WV ed I Ó ô à î º (8.126) |} as / . Then from (8.123), we have ù ú h ® l _V_ V I eVd ® lgl V_ VI l l"d Xel" Xl[® `l WV l[` ® ji ® l Wh8 j i l d l"Xl[` ® l W h8 "l k Á where f Á ô þ Û ß î Ó à Ó Ò Ó Ò (8.127) (8.128) (8.129) (8.130) î Ó Ó Ó º l where Ð Æ f Ð b Æ a Á (8.131) Á as / . From the upper bound in (8.108), we j have i è z ® Wh8 l>nm Ó s (8.132) è Using (8.121), (8.130), and the above upper bound on , it follows from |~} that (8.114) and (8.120) ji oi <>=?=(ß Û l Ð Since Á Ð Á as / Á Wh8 l>nm l 'h8 ® l® nm l"k s ® Æ F F Æ Ó l nm l"k ® , so that Ó l"k (8.133) Ó 0 Ó |} w x Ð l Á Û á Ò (8.134) , for sufficiently small / , we have for any w follows from (8.134) that ä z ä as <>=?=0ß á Á Ð © (8.135) . Then by letting z ®w |} 6z F · , it (8.136) for sufficiently large . <>=?=0ß Ó The main idea of the above analysis on Û is the following. In conè codewords in ² ô ¶according structing codebook, we randomly generate ¶ c Ó , the , ´ a to and one of the codewords is sent through the channel ´,aRµ c . When á is large, with high probability, the received sequence is jointly typical with ¶ º µ c . If the number è the codeword sent with respect to of codewords , ´ a á ³ó4D c , then the probability that the received grows with at a rate less than Âa D R A F T September 13, 2001, 6:27pm D R A F T 170 A FIRST COURSE IN INFORMATION THEORY sequence is jointly typical with a codeword other than the one sent through the channel is negligible. Accordingly, the message can be decoded correctly with probability arbitrarily close to 1. |~} In constructing the codebook by the randomß procedure in Step 2, we choose a codebook 1 with a certain probability Û 1 from the ensemble of all possible codebooks. By conditioning on the|~codebook |~} } |~} chosen, we have Û |~} O cp <>=?=*ß Û |~} |~} <B=C=0ß ß 1 <B=C= Û <B=C= ô ô 1 ߺ (8.137) ß is a weighted average of ô Û 1 over all 1 in the ensemble i.e., Û <B=C= ß is the average of all possible codebooks, where Û 1 |~} probability of error of the code, i.e., , when the codebook 1 is chosen (cf. Definition 8.9). The <B=C=0ß reader should compare the two different expansions of Û in (8.137) and (8.112). 1y¾ such that Therefore, there exists |} at least one codebook |} <>=?= Û ô 1 ¾ à ß Thus we have shown that for any w á º4è c code such that a á (cf. (8.108)) and ä h «%¬I è Û Á <>=?=*ß z ®w s (8.138) , there exists for sufficiently large a 4D c x ®w ä  ³ó á an (8.139) z ®w s (8.140) 4D c is achievable We are still one step away proving that the rate ÂQa ø" æ Ê from because we require that instead of is arbitrarily small. Toward this end, we write (8.140) as ³ó ; è h ù ø ù z ®w ] ; è ù ø°ù r z q t ® s ws ] º or (8.141) (8.142) Upon ordering the codewords according to their conditional probabilities of error, we observe that the conditional probabilities of error of the better half è codewords of the are less than w , otherwise the conditional probabilities of error of; the worse half of the codewords are at least w , and they contribute at least a [ c w to the summation in (8.142), which is a contradiction. D R A F T September 13, 2001, 6:27pm D R A F T 171 Channel Capacity Thus by discarding the worse half of the ø" codewords in 1 ¾ , for the resulting Ê æ codebook, the maximal probability of error is less than w . Using (8.139) and considering á h «%¬I è ® h «Ï¬I è x h 4D c x w x h q a ®s 4 D c xyw Qa á ä á * ä Â á ³ó á ³ó (8.143) (8.144) (8.145) when is sufficiently large, we see that the rate of the resulting code is greater ³ó4D ³ëó4D c xyw . Hence, we conclude that the rate c is achievable. than ÂQa Âa ¶ c Ä , ´ a Ä Finally, upon letting the input distribution be one which achieves the ³ó4D c , we have proved that the rate is achievchannel capacity, i.e., ÂQa able. This completes the proof of the direct part of the channel coding theorem. 8.5 A DISCUSSION In the last two sections, we have proved the channel coding theorem which Ä asserts that reliable communication through a DMC at rate is possible if and only if z , the channel capacity. By reliable communicationá at rate , we mean that the size of the message set grows exponentially with at rate , Ä with probability arbitrarily close while the message can be decoded correctly Ъ © á to 1 as . Therefore, the capacity is a fundamental characterization of a DMC. The capacity of a noisy channel is analogous to the capacity of a water pipe in the following way. For a water pipe, if we pump water through the pipe at a rate higher than its capacity, the pipe would burst and water would be lost. For a communication channel, if we communicate through the channel at a rate higher than the capacity, the probability of error is bounded away from zero, i.e., information is lost. In proving the direct part of the channel coding theorem, weÄ showed that and whose there exists a channel code whose rate is arbitrarily close to probability of error is arbitrarily close to zero. Moreover, the existence of such á a code is guaranteed only when the block length is large. However, the proof does not indicate how we can find such a codebook. For this reason, the proof we gave is called an existence proof (as oppose to a constructive proof). á For a fixed block length , we in principle can search through the ensemble of all possible codebooks for a good one, but this is quite prohibitive even for á small because the number of all possible codebooks growsá double exponená º4è Ó c Ä a the total number of all possible codebooks tially with ô . ô Specifically, ; è is approximately is equal toÓ(u ² . When the rate of the code is close to , equal to ® Ó . Therefore, the number of codebooks we need to search through ô ô [ Ïwv is about ² . D R A F T September 13, 2001, 6:27pm D R A F T 172 A FIRST COURSE IN INFORMATION THEORY Nevertheless, the proof of the direct part of the channel coding theorem does indicate that if we generate a|~codebook randomly as prescribed, the codebook } is most likely to be good. More precisely, we now show that the probabilityä of <B=C= ô ß Á Û is greater than any prescribed x choosing a code 1 such that 1 á |~} small when is sufficiently |~} |~} large. Consider is arbitrarily p <B=C=0ß Û Û ý Ç Vzy" {|{8 p c~ Vzy" {|{8 p c~ ý } | } From (8.138), we have p Û for any w Á when á } Û 1 1 1 ý <>=?= ô 1 ô 1 ß ß (8.147) (8.148) (8.149) <B=C=(ß x s z ®w Û } <>=?= ߺ Û à ß <>=?=*ß Vzy" {|{8 p c~ Û ß 1 |~} 1 ß (8.150) (8.151) is sufficiently large. |~} Then p ß |~} Û Û |} which implies ä |} ô |~} ß 1 <>=?= Û Û Û (8.146) ß 1 |~} } } ý ß |~} 1 |~} ý zy {|{ p c~ p ô Û zy {|{ p c~ p ý x <B=C= |~} \"} Æ p Û zy {|{ p p Ç ß 1 à w ®x s (8.152) Since x is fixed, this upper bound can be made arbitrarily small by choosing a sufficiently small w . Although the proof of the direct part of the channel coding theorem does not provide an explicit construction of a good code, it does give much insight into what a good code is like. Figure 8.10 is an illustration of a channel code which Ä distribution ´,a ¶ c achieves the channel capacity. Here we assume that the input ³ó4D c is one which achieves the channel capacity, i.e., ÂQa Ó . The idea ¶ is that most of the codewords are typical sequences in ² with respect to ´,a c . (For this reason, the repetition code is not a good code.) When such a codeword is ' the channel,Ó the received sequence is likely to be one of Ó V through transmitted ® about are jointly typical with the transmitted 'sequences to ´,ina ¶ ³ º µ c which Ó V respect codeword with . The associationÓ between a codeword and the about ® corresponding sequences in ³ is shown as a cone in the figure. As we require that the probability of decoding error is small, the cones D R A F T September 13, 2001, 6:27pm D R A F T 173 Channel Capacity p ( y | x ) n H (Y | X ) 2 n I( X;Y ) n H (Y ) 2 sequences in T[nY ] ... 2 codewords in T[nX ] Figure 8.10. A channel code which achieves capacity. essentially do not overlap with each other. Since the number of typical sequences with respect to " is about %('E , the number of codewords cannot exceed about %BW ('T > (wj ( (8.153) This is consistent with the converse of the channel coding theorem. The direct part of the channel coding theorem says that when is large, as long as the number of codewords generated randomly is not more than about % W¡[¢| , the overlap among the cones is negligible with very high probability. Therefore, instead of searching through the ensemble of all possible codebooks for a good one, we can generate a codebook randomly, and it is likely to be good. However, such a code is difficult to use due to the following implementation issues. A codebook with block length and rate £ consists of ¤ symbols from the input alphabet ¥ . This means that the size of the codebook, i.e., the amount of storage required to store the codebook, grows exponentially with . This also makes the encoding process very inefficient. Another issue is regarding the computation required for decoding. Based on the sequence received at the output of the channel, the decoder needs to decide which of the about %¤ codewords was the one transmitted. This requires an exponential amount of computation. In practice, we are satisfied with the reliability of communication once it exceeds a certain level. Therefore, the above implementation issues may eventually be resolved with the advancement of microelectronics. But before then, we still have to deal with these issues. For this reason, the entire field of coding theory has been developed since the 1950’s. Researchers in this field are devoted to searching for good codes and devising efficient decoding algorithms. D R A F T September 13, 2001, 6:27pm D R A F T 174 A FIRST COURSE IN INFORMATION THEORY « W Message ¦ Encoder X i =f i (W, Y i- 1 ) Channel § p (y ¨ |x © ) Yi ª Decoder W Estimate of message Figure 8.11. A channel code with feedback. In fact, almost all the codes studied in coding theory are linear codes. By taking advantage of the linear structures of these codes, efficient encoding and decoding can be achieved. In particular, Berrou el al [25] have proposed a linear code called the turbo code6 in 1993, which is now generally believed to be the practical way to achieve the channel capacity. Today, channel coding has been widely used in home entertainment systems (e.g., audio CD and DVD), computer storage systems (e.g., CD-ROM, hard disk, floppy disk, and magnetic tape), computer communication, wireless communication, and deep space communication. The most popular channel codes used in existing systems include the Hamming code, the Reed-Solomon code7 , the BCH code, and convolutional codes. We refer the interested reader to textbooks on coding theory [29] [124] [202] for discussions of this subject. 8.6 FEEDBACK CAPACITY Feedback is very common in practical communication systems for correcting possible errors which occur during transmission. As an example, during a telephone conversation, we very often have to request the speaker to repeat due to poor voice quality of the telephone line. As another example, in data communication, the receiver may request a packet to be retransmitted if the parity check bits received are incorrect. In general, when feedback from the receiver is available at the transmitter, the transmitter can at any time decide what to transmit next based on the feedback so far, and can potentially transmit information through the channel reliably at a higher rate. In this section, we study a model in which a DMC is used with complete feedback. The block diagram for the model is shown in Figure 8.11. In this model, the symbol ¬[ received at the output of the channel at time ® is available instantaneously at the encoder without error. Then depending on the message ¯ and all the previous feedback ¬°²±³¬g´(±²µ²µ²µ¶±³¬ , the encoder decides the value 6 The turbo code is a special case of the class of Low-density parity-check (LDPC) codes proposed by Gallager [75] in 1962 (see MacKay [127]). However, the performance of such codes was not known at that time due to lack of high speed computers for simulation. 7 The Reed-Solomon code was independently discovered by Arimoto [13]. D R A F T September 13, 2001, 6:27pm D R A F T 175 Channel Capacity of · ¹¸ ° , the next symbol to be transmitted. Such a channel code is formally defined below. º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿ÅÄ:Æ|Ç(È An ɱ³ÊU code with complete feedback for a discrete memoryless channel with input alphabet ¥ and output alphabet Ë is defined by encoding functions Ì ÍÎJÏ(±cQ±²µ²µ²µ%±³ÊÑÐÓÒ@Ë for Ï]ÕÖ®×ÕÅ ¡ °Ô ¥ (8.154) and a decoding function Ø Ô ÍJË ÎJÏ(±cQ±²µ²µ²µ%±³ÊÑÐ (8.155) Ì ¯ ° to denote Ú¬°²±³¬g´(±²µ²µ²µL±³¬ and · to denote ±Ù ¡ . We will use Ù We note that a channel code without feedback is a special case of a channel code with complete feedback because for the latter, the encoder can ignore the feedback. º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿ÅÄ:Æ|ÇJÛ A rate £ is achievable with complete feedback for a discrete memoryless channel E:Ü Ý: if for any Þàßâá , there exists for sufficiently large an ɱ³Êâ code with complete feedback such that Ï tã¹äå Ê ßÖ£SæçÞ (8.156) è"éTêcëíì and Þ (8.157) º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿ÅÄ:Æ|ÇJî The feedback capacity, ï FB , of a discrete memoryless channel is the supremum of all the rates achievable by codes with complete feedback. ð]ñAÿ¨gÃòó¾¹Á6¾ÚÿôÄ:Æ|Ç õ The supremum in the definition of ï FB in Definition 8.16 is the maximum. Proof Consider rates £à'ö² which are achievable with complete feedback such that £ 'ö² £ (8.158) ÷¹ø ú öwã¹ùû Then for any ÞjßSá , for all ü , there exists for sufficiently large an code with complete feedback such that Ï D R A F T ã_äå Ê 'ö² ßÖ£ 'ö² æýÞ September 13, 2001, 6:27pm ɱ³Ê 'ö² (8.159) D R A F T 176 A FIRST COURSE IN INFORMATION THEORY p y x ( | ) ÿ W X1 Y1 Xþ2 Yþ2 X3 Y3 W Yn-1 ÿ Xn Yn Figure 8.12. The dependency graph for a channel code with feedback. and è éê³ë Wöw ì Þ (8.160) By virtue of (8.158), let ü:|ÞC be an integer such that for all ütßü:|ÞC , ì Ü £Sæç£ 'ö² Ü Þ¶± (8.161) which implies £ Then for all ü ß ü:|Þ? Wö² ßÖ£SæçÞ (8.162) , Ï ã_äå Ê Wöw ßÖ£ 'ö² æýÞBßÖ£SæçÞ (8.163) Therefore, it follows from (8.163) and (8.160) that £ is achievable with complete feedback. This implies that the supremum in Definition 8.16, which can be achieved, is in fact the maximum. Since a channel code without feedback is a special case of a channel code with complete feedback, any rate £ achievable by the former is also achievable by the latter. Therefore, ï FB ï (8.164) The interesting question is whether ï FB is greater than ï . The answer surprisingly turns out to be no for a DMC, as we now show. From the description of a channel code with complete feedback, we obtain the dependency graph ¯ ¯ for the random variables ± ±Ù9± in Figure 8.12. From this dependency D R A F T September 13, 2001, 6:27pm D R A F T 177 Channel Capacity graph, we see that ӱɱ ± Ó T»! Ý Ü û± ° ¡ Ü Ý ° ¼ Ü (8.165) ° ÒG¥¼íÒtËoíÒ such that û± ¡ C± Ý >ßÖá for ° , where Ý cÜ û± ¡ ° ± ´ ±²µ²µ²µó± . Note that [ û±É± ± Ó Ï]ÕÅ®×ÕÅ and ßÖá í Ü ° for all and are deterministic. Ä:Æ|ÇJÄ For all Ï]ÕÖ®×ÕÅ ¯ , Ô ° ¡ ±Ù Ô · ¬ (8.166) forms a Markov chain. ¯ Proof The dependency graph for the random variables ± , and Ù is shown ¯ ° ° in Figure 8.13. Denote the set of nodes ± ¡ , and Ù ¡ by " . Then we see that all the edges from " end at · , and the only edge from · ends at ¬ . ¯ ° ° This means that ¬g depends on ± ¡ ±Ù ¡ only through ·í , i.e., ¯ ± ° ¡ ±Ù ° ¡ Ô · Ô ¬ (8.167) á (8.168) forms a Markov chain, or # ¯ ± ¡ ° ±Ù ¡ °%$ ¬ Ü· This can be formally justified by Proposition 2.9, and the details are omitted here. Since # á # ¯ ¯ ± ±Ù ° ¡ ° $ ¡ ±Ù ¡ ° $ ¬ Ü· # ¬ Ü· '& ( ° $ ¡ ¬ Ü ¯ ±· ±Ù ¡ ° (8.169) (8.170) and mutual information is nonnegative, we obtain # or ¯ ¯ ±Ù ±Ù ¡ °%$ ¡ ° Ô ¬ Ü· · Ô á.± (8.171) ¬ (8.172) forms a Markov chain. The lemma is proved. From the definition of ï FB and by virtue of Proposition 8.17, if £ ÕUï FB , then £ is a rate achievable with complete feedback. We will show that if a rate £ is achievable with complete feedback, then £ Õ ï . If so, then £ Õ ï FB implies £âÕSï , which can be true if and only if ï FB Õï . Then from (8.164), we can conclude that ï FB ï . D R A F T September 13, 2001, 6:27pm D R A F T 178 A FIRST COURSE IN INFORMATION THEORY , X1 Y1 X) 2 Y) 2 X+ 3 Y+ 3 Xi-1 Yi-1 Xi Yi * W Z Figure 8.13. The dependency graph for -/.(021 , and 341 . Let £ be a rate achievable with complete feedback, i.e., for any ÞBßÖá , there exists for sufficiently large an ɱ³ÊU code with complete feedback such that ° ¡ Ê ßÖ£SæýÞ ã_äå èéê³ëíì and Consider and bound Þ # # $ ã_äå 65 ¯ Ù and 3 # ¯ Ê ¯ $ Ù 5 5 5 Ü Ù3 5 8 8 Õ ° · $ 5 Ù Ü 8 8 8 ¯ 5 ÜÙ Ú¬ Ü · ° 5 $ ¯ Ú¬ Ü Ù ° 5 7& Ù (8.175) (8.176) 8 ° 5 ° ¡ Ú¬gÜ Ù ° 5 Ú¬ æ ° 5# ¯ (8.174) as follows. First, Ù3æ 5 Õ Ù3æ D R A F T Ù3æ ê ¯ Ù3æ 9 (8.173) ¡ ¯ ° ± ¯ ± ±· (8.177) Ú¬ Ü · (8.178) (8.179) ¬ ï± September 13, 2001, 6:27pm (8.180) (8.181) (8.182) D R A F T 179 Channel Capacity where a) follows because · Lemma 8.18. Second, 5 ¯ is a function of Ü Ù3 ¯ :5 ¯ Ü Ù9±; Thus Ê ã_äå Õ ¯ 5 ¯ and Ù ¯ Ü Õ 5 ¯ ¡ ° and b) follows from ¯ Ü <&ýï± (8.183) (8.184) which is the same as (8.98). Then by (8.173) and an application of Fano’s inequality, we conclude as in the proof for the converse of the channel coding theorem that £âÕï (8.185) Hence, we have proved that ï FB ï . Remark 1 The proof for the converse of the channel coding theorem in Section 8.3 depends critically on the Markov chain ¯ Ô Ô Ù Ô ¯ (8.186) and the relation in (8.73) (the latter implies Lemma 8.13). Both of them do not hold in general in the presence of feedback. Remark 2 The proof for ï FB ï in this section is also a proof for the con verse of the channel coding theorem, so we actually do not need the proof in Section 8.3. However, the proof here and the proof in Section 8.3 have very different spirits. Without comparing the two proofs, one cannot possibly understand the subtlety of the result that feedback does not increase the capacity of a DMC. Remark 3 Although feedback does not increase the capacity of a DMC, the availability of feedback very often makes coding much simpler. For some channels, communication through the channel with zero probability of error can be achieved in the presence of feedback by using a variable-length channel code. These are discussed in the next example. =2>7?¨ª© » Ä:Æ|ÇA@ Consider the binary erasure channel in Example 8.5 whose capacity is Ï æCB , where B is the erasure probability. In the presence of complete feedback, for every information bit to be transmitted, the encoder can transmit the same information bit through the channel until an erasure does not occur, i.e., the information bit is received correctly. Then the number of uses of the channel it takes to transmit an information bit through the channel correctly ° has a truncated geometrical distribution whose mean is zÏ>æDB6c¡ . Therefore, the effective rate at which information can be transmitted through the channel is ÏÓæEB . In other words, the channel capacity is achieved by using a very D R A F T September 13, 2001, 6:27pm D R A F T 180 U A FIRST COURSE IN INFORMATION THEORY G Fsource encoder N W channel G encoder H X Ip J yL K x ( | ) Y M channel decoder N W Fsource decoder U Figure 8.14. Separation of source coding and channel coding. simple variable-length code. Moreover, the channel capacity is achieved with zero probability of error. In the absence of feedback, the rate Ï׿OB can also be achieved, but with an arbitrarily small probability of error and a much more complicated code. 8.7 SEPARATION OF SOURCE AND CHANNEL CODING We have so far considered the situation in which we want to convey a message through a DMC, where the message is randomly selected from a finite set according to the uniform distribution. However, in most situations, we want to convey an information source through a DMC. Let ÎQP ±cü ß æoEÐ be a staö tionary ergodic information source with entropy rate . Denote the common alphabet by R and assume that R is finite. To convey 5 ÎQP Ð through the chanö ì £4S and a channel code with rate nel, weÌ can employ a source code with rate that £;S £T as shown in Figure 8.14 such £4T . Ì S S Let and Ø be respectively the encoding function and the decoding funcT T tion of the source code, and and Ø be respectively the encoding function and the decoding function of the channel code. The block of information P symbols U ° ±VP ´ ±²µ²µ²µóÌ ±VP?Wó is first encoded by the source en¡:'J¡ ¡:'J¡ coder into an index ¯ S (8.187) (U C± ¯ called the source codeword. Then distinct channel codeword Ì is mapped by the channel encoder to a T ¯ C± (8.188) ·t°¶±·¼´(±²µ²µ²µó±· . This is possible because there are about (¤YX where ì source codewords and about %(¤[Z channel codewords, and we assume that £4S £4T . Then is transmitted through the DMC 6Ü ÝA , and the sequence Ù Ú¬°²±³¬g´(±²µ²µ²µ¶±³¬ is received. Based on Ù , the channel decoder first ¯ estimates as ¯ Ø T (8.189) Ù3 Finally, the source decoder decodes U D R A F T ¯ to Ø S ¯ \ September 13, 2001, 6:27pm (8.190) D R A F T 181 Channel Capacity For this scheme, an error occurs if U^] U , and we denote the probability of ì error by _?` . ï , the capacity of the DMC 6Ü ÝA , then it is We now show that if 5 possible to convey U through the channel with an arbitrarily small probability of error. First, we choose £;S and £4T such that ì Observe that if ¯ 5 ¯ and Ø ì £4S S ¯ ¯ \ Ø S U ì £4T U (8.191) , then from (8.190), Ø S ï ¯ U3± (8.192) i.e., an¯ error does not occur. In other words, if an error occurs, either S or Ø 2] U . Then by the union bound, we have ¯ _ ` ÕbadcwÎe ¯ ] Ðf&gahcwÎ Ø S ¯ 2] U Ð ¯ ] ¯ (8.193) For any Þ ß á and sufficiently large , by the Shannon-McMillan-Breiman theorem, there exists a source code such that adc²Î Ø S ¯ 2] U ÐûÕÞ (8.194) èéê³ë éê³ë By the è channel coding theorem, there exists a channel code such that where is the maximal probability of error. This implies ¯ ahcwÎe ] ¯ 8ji Ð èéê³ë 8ji Õ èéê³ë ¯ adc²Îk ¯ ] ahc²Î ¯ Ü ¯ Ðlahc¶Î ¯ Ð , (8.195) (8.196) (8.197) (8.198) Þ Õ ûÐ ÕÖÞ Combining (8.194) and (8.198), we have _?`oÕ Þ ì (8.199) ï , it is possible to convey ÎQP Ð Therefore, we conclude that as long as ö 5 through the DMC reliably. In the scheme we have discussed, source coding and channel coding are separated. In general, source coding and channel coding can be combined. This technique is called joint source-channel coding. It is then natural to ask whether it is possible to convey information through the channel reliably at a higher rate by using joint source-channel coding. In the rest of the section, we show that the answer to this question is no to the extent that for asymptotic Õ ï . However, whether asymptotical reliability reliability, we must have 5 D R A F T September 13, 2001, 6:27pm D R A F T 182 A FIRST COURSE IN INFORMATION THEORY q m r X mi =f mi sc( W, Y i -1 ) sourcechannel encoder U Ymi np oy px ( | ) sourcechannel decoder U Figure 8.15. Joint source-channel coding. ï depends on the specific information source and can be achieved for 5 channel. Ì We base our discussion on the general assumption that complete feedback SjT is available at the encoder as shown in Figure 8.15. Let , Ï Õ ®íÕ be Ø ST the encoding functions and be the Ì decoding function of the source-channel code. Then SjT ° ¡ (8.200) · (U3±Ù for Ï]ÕÖ®×ÕÅ , where Ù ¡ Ú¬°²±³¬g´(±²µ²µ²µL±³¬ ° Ø SjT U Ù °C ¡ , and C± (8.201) where U [P× °¶±sPE ´(±²µ²µ²µó±?P . In exactly the same way as we proved (8.182) in the last section, we can prove# that $ (U Ù # Ù , Since U is a function of (U $ # U3 Õ for sufficiently large . Then 5 5 (U :5 5 $ (U $ (U Õ (U × # (U9Ü U '& U9 ±Ù3 Ù (8.203) (8.204) (8.205) ï For any ÞBßÖá , æçÞCTÕ (8.202) # × ÕÅï æçÞ? 5 (U $ (8.206) U >Õ 5 (U9Ü U3 '&ýï (8.207) Applying Fano’s inequality (Corollary 2.48), we obtain × Ïf&ý7_ ` æçÞCTÕ 5 or 5 D R A F T æçÞÕ Ï &u_?` ÜR ã_äå ã_äå ÜR Üt&ýï± Ü%&Åï September 13, 2001, 6:27pm (8.208) (8.209) D R A F T 183 Channel Capacity Ô For asymptotic reliability, _v` á as Ô and then Þ á , we conclude that Ô á Õï 5 w Ô . Therefore, by letting (8.210) This result, sometimes called the separation theorem for source and channel coding, says that asymptotic optimality can be achieved by separating source coding and channel coding. This theorem has significant engineering implication because the source code and the channel code can be designed separately without losing asymptotic optimality. Specifically, we only need to design the best source code for the information source and design the best channel code for the channel. Moreover, separation of source coding and channel coding facilitates the transmission of different information sources on the same channel because we only need to change the source code for different information sources. Likewise, separation of source coding and channel coding also facilitates the transmission of an information source on different channels because we only need to change the channel code for different channels. We remark that although asymptotic optimality can be achieved by separating source coding and channel coding, for finite block length, the probability of error can generally be reduced by using joint source-channel coding. PROBLEMS In the following, ·t°¶±·¼´(±²µ²µ²µL±· , Ý6°¶±Ýe´L±²µ²µ²µL±Ý , and so on. 1. Show that the capacity of a DMC with complete feedback cannot be increased by using probabilistic encoding and/or decoding schemes. ì ì increases capacity Consider a BSC with crossover probability 2. Memory á Þ Ï represented by ·í ¬gY&x"E mod , where ·í , ¬[ , and "É are respectively the input, the output, and the noise random variable at time ® . Then ahc²Îy" áQÐ ÏBæçÞ Ï%Ð Þ and _ Îy" for all ® . We make no assumption that have memory. a) Prove that " are i.i.d. so that the channel may # ( $ Ù ÕÅGæ 5 9 |Þ? b) Show that the upper bound in a) can be achieved by letting · bits taking the values 0 and 1 with equal probability and "° # µ²µ²µ " . $ be i.i.d. c) Show that with the assumptions in b), ( Ù ß ï , where 9 |ÞC is the capacity of the BSC if it is memoryless. Ï>æ "É´ ï 5 D R A F T September 13, 2001, 6:27pm D R A F T 184 A FIRST COURSE IN INFORMATION THEORY 3. Show that the capacity of a channel can be increased by feedback if the channel has memory. 4. In Remark 1 toward the end of Section 8.6, it¯ was mentioned that¯ in the Ô Ô Ô Ù presence of feedback, both the Markov chain and Lemma 8.13 do not hold in general. Give examples to substantiate this remark. 5. Prove that when a DMC is used with complete feedback, adc²Îó¬ Ü ±Ù ¡ ° ¡ ° Ð ahc²Îó¬ Ü· Ý Ð Ï . This relation, which is a consequence of the causality of the for all ® code, says that given the current input, the current output does not depend on all the past inputs and outputs of the DMC. 6. Let _ |Þ? Ï>æçÞ ÏBæýÞe| Þ {z Þ be the transition matrix for a BSC with crossover probability Þ . Define }~\ } } } zÏæ & zÏBæ for áíÕ ± ÕÏ . a) Prove that a DMC with transition matrix _ |Þ²°?_ |Þ³´ó is equivalent to a BSC with crossover probability Þ²° ~ Þc´ . Such a channel is the cascade of two BSC’s with crossover probabilities Þw° and Þc´ , respectively. b) Repeat a) for a DMC with transition matrix _ |Þ³´¶_ |Þ²°? . c) Prove that Ï>æ 5 9 |Þw° ~ Þ³´¶Õ ø¼÷ zÏæ 5 9 |Þw°?C±¶ÏBæ 5 9 |Þ³´ó This means that the capacity of the cascade of two BSC’s is less than the capacity of either of the two BSC’s. d) Prove that a DMC with transition matrix _ |Þ? is equivalent to a BSC ° with crossover probabilities ´ zÏæ zÏBæçÞ? . 7. Symmetric channel A DMC is symmetric if the rows of the transition matrix E:Ü Ý: are permutations of each other and so are the columns. Determine the capacity of such a channel. See Section 4.5 in Gallager [77] for a more general discussion. 8. Let ïo° and ï´ be the capacities of two DMC’s with transition matrices _ ° and _ ´ , respectively, and let ï be the capacity of the DMC with transition ïo°¶±Cï´ó . matrix _ °_´ . Prove that ïâÕ ø¼÷ D R A F T September 13, 2001, 6:27pm D R A F T 185 Channel Capacity 9. Two independent parallel channels Let ïo° and ï´ be the capacities of two DMC’s ° ° Ü Ý ° and ´ ´ Ü Ý ´ , respectively. Determine the capacity of the DMC .°²± ´JÜ Ý6°²±Ýg´ó Hint: Prove that# # ·t°¶±·¼´ if .°²± 6°ó.°óÜ Ý°?Âg´ ´ Ü Ýg´ó ´ Ü Ý6°¶±Ýe´ó $ ¬6°¶±³¬e´¶Õ # ·t° 6°ó.°%Ü Ý6°CÂg´ ´ Ü Ýe´ó $ ¬6°?'& ·¼´ $ ¬e´² . 10. Maximum likelihood decoding In maximum likelihood decoding for a given channel and a given codebook, if a received sequence is decoded to a codeword , then maximizes ahcwÎ%Ü 'ÚÐ among all codewords ' in the codebook. a) Prove that maximum likelihood decoding minimizes the average probability of error. b) Does maximum likelihood decoding also minimize the maximal probability of error? Give an example if your answer is no. 11. Minimum distance decoding The Hamming distance between two binary sequences and , denoted by gɱE , is the number of places where and differ. In minimum distance decoding for a memoryless BSC, if a received sequence is decoded to a codeword , then minimizes g'Ú± over all codewords in the codebook. Prove that minimum distance decoding is equivalent to maximum likelihood decoding if the crossover probability of the BSC is less than 0.5. 12. The following figure shows a communication system with two DMC’s with complete feedback. The capacities of the two channels are respectively ïo° and ï´ . W Encoder 1 Channel 1 Encoder 2 Channel 2 Decoder 2 W a) Give the dependency graph for all the random variables involved in the coding scheme. b) Prove that the capacity of the system is ø¼÷ ï ° ±Cï ´ . For the capacity of a network of DMC’s, see Song et al. [186]. D R A F T September 13, 2001, 6:27pm D R A F T 186 A FIRST COURSE IN INFORMATION THEORY 13. Binary arbitrarily varying channel Consider a memoryless BSC whose ì ì crossover probability is time-varying. Specifically, the crossover probabilÞ³´ á . ity Þ at time ® is an arbitrary value in Þ²°w±cÞc´ , where á ÕÑÞw° 9 |Þ ´ . Prove that the capacity of this channel is ÏÓæ (Ahlswede and 5 Wolfowitz [9].) ì 14. ì a BSC with crossover probability Þk Þw°¶±cÞc´ , where á Þw° , but the exact value of Þ is unknown. Prove that the capacity of this channel is 9 |Þ³´ó . ì Consider Þ ´ á 5 HISTORICAL NOTES The concept of channel capacity was introduced in Shannon’s original paper [173], where he stated the channel coding theorem and outlined a proof. The first rigorous proof was due to Feinstein [65]. The random coding error exponent was developed by Gallager [76] in a simplified proof. The converse of the channel coding theorem was proved by Fano [63], where he used an inequality now bearing his name. The strong converse was first proved by Wolfowitz [205]. An iterative algorithm for calculating the channel capacity developed independently by Arimoto [14] and Blahut [27] will be discussed in Chapter 10. It was proved by Shannon [175] that the capacity of a discrete memoryless channel cannot be increased by feedback. The proof here based on dependency graphs is inspired by Bayesian networks. D R A F T September 13, 2001, 6:27pm D R A F T Chapter 9 RATE DISTORTION THEORY Let be the entropy rate of an information source. By the source coding 5 it is possible to design a source code with rate £ which reconstructs theorem, the source sequence · ° ±· ´ ±²µ²µ²µó±· with an arbitrarily small probabil and the block length is sufficiently large. Howity of error provided £â ß ever, there are situations in5 which we want to convey an information source by ì a source code with rate less than . Then we are motivated to ask: what is the best we can do when £ ? 5 5 A natural approach is to design a source code such that for part of the time the source sequence is reconstructed correctly, while for the other part of the time the source sequence is reconstructed incorrectly, i.e., an error occurs. In designing such a code, we try to minimize the probability of error. However, ì this approach is not viable asymptotically because the converse of the source , then the probability of error inevitably coding theorem says that if £ Ôw 5 tends to 1 as ì . , no matter how the source code is designed, the source Therefore, if £ 5 sequence is almost always reconstructed incorrectly when is large. An alternative approach is to design a source code called a rate distortion code which reproduces the source sequence with distortion. In order to formulate the problem properly, we need a distortion measure between each source sequence and each reproduction sequence. Then we try to design a rate distortion code which with high probability reproduces the source sequence with a distortion within a tolerance level. Clearly, a smaller distortion can potentially be achieved if we are allowed to use a higher coding rate. Rate distortion theory, the subject matter of this chapter, gives a characterization of the asymptotic optimal tradeoff between the coding rate of a rate distortion code for a given information source and D R A F T September 13, 2001, 6:27pm D R A F T 188 A FIRST COURSE IN INFORMATION THEORY the allowed distortion in the reproduction sequence with respect to a distortion measure. 9.1 SINGLE-LETTER DISTORTION MEASURES Ï%Ð be an i.i.d. information source with generic random variLet ζ· ±cü ö able · . We assume that the source alphabet ¥ is finite. Let Ý: be the probability distribution of · , and we assume without loss of generality that the support of · is equal to ¥ . Consider a source sequence Ý6°¶±Ýg´(±²µ²µ²µL±Ý (9.1) and a reproduction sequence Ý °¶± g Ý ´%±²µ²µ²µó± Ý (9.2) The components of can take values in ¥ , but more generally, they can take values in any finite set ¥ which may be different from ¥ . The set ¥ , which is also assumed to be finite, is called the reproduction alphabet. To measure the distortion between and , we introduce the single-letter distortion measure and the average distortion measure. º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ|Ç A single-letter distortion measure is a mapping Ô ¸ Ò ¥ ÍQ¥ ± (9.3) ¸ where is the set of nonnegative real numbers1 . The value eݱ Ý6 denotes the distortion incurred when a source symbol Ý is reproduced as Ý . º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ The average distortion between a source sequence D ¥¼ and a reproduction sequence ¥ induced by a single-letter distortion measure is defined by gɱ É 8 Ï ö ° gÝ ö ± Ý ö (9.4) In Definition 9.2, we have used to denote both the single-letter distortion measure and the average distortion measure, but this abuse of notation should cause no ambiguity. Henceforth, we will refer to a single-letter distortion measure simply as a distortion measure. Very often, the source sequence represents quantized samples of a continuous signal, and the user attempts to recognize certain objects and derive ëë 1 Note that ëë is finite for all D R A F T . September 13, 2001, 6:27pm D R A F T 189 Rate Distortion Theory meaning from the reproduction sequence . For example, may represent a video signal, an audio signal, or an image. The ultimate purpose of a distortion measure is to reflect the distortion between and as perceived by the user. This goal is very difficult to achieve in general because measurements of the distortion between and must be made within context unless the symbols in ¥ carry no physical meaning. Specifically, when the user derives meaning from , the distortion in as perceived by the user depends on the context. For example, the perceived distortion is small for a portrait contaminated by a fairly large noise, while the perceived distortion is large for the image of a book page contaminated by the same noise. Hence, a good distortion measure should be context dependent. Although the average distortion is not necessarily the best way to measure the distortion between a source sequence and a reproduction sequence, it has the merit of being simple and easy to use. Moreover, rate distortion theory, which is based on the average distortion measure, provides a framework for data compression when distortion is inevitable. =2>7?¨ª© »@:Æ When the symbols in ¥ and ¥ represent real values, a popular distortion measure is the square-error distortion measure which is defined by ´ gݱ Ý: Ý æ Ý: (9.5) The average distortion measure so induced is often referred to as the meansquare error. =2>7?¨ª© »@:Æ'È When ¥ and ¥ are identical and the symbols in ¥ do not carry any particular meaning, a frequently used distortion measure is the Hamming distortion measure, which is defined by gݱ Ý6 á Ï if Ý if Ýu ] Ý Ý (9.6) . The Hamming distortion measure indicates the occurrence of an error. In particular, for an estimate · of · , we have g· ±; ·9 adcwζ· ·ç еóá!&gahc²Î¶· ] · Ðoµ Ï adc²Î¶·¡] · Ð ± (9.7) i.e., the expectation of the Hamming distortion measure between · and · is the probability of error. For ¢¥í and ¢ , the average distortion eɱ E ¥¼ induced by the Hamming distortion measure gives the frequency of error in the reproduction sequence . D R A F T September 13, 2001, 6:27pm D R A F T 190 A FIRST COURSE IN INFORMATION THEORY º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ¹Û minimizes gݱ if for all ÝO ¥ Ý: For a distortion measure , for each ÝO ¥ , let ݤ £ÝA ¥ over all Ý¥ ¥ . A distortion measure is said to be normal ¦ ë4§V¨ª© gݱ Ý £ ÝA . á (9.8) The square-error distortion measure and the Hamming distortion measure are examples of normal distortion measures. Basically, a normal distortion measure is one which allows · to be reproduced with zero distortion. Although a distortion measure is not normal in general, a normalization of can always be obtained by defining the distortion¦ measure « gݱ : Ý ë « eݱ 6 Ý æ (9.9) for all ݱ Ý: 2ý¥ Ò¡¥ . Evidently, is a normal distortion measure, and it is referred to as the normalization of . =2>7?¨ª© »@:ƹî Let be a distortion measure defined by ¬Q® ®±° . ¯ V 1 2 ² ³ ´ 7 3 5 8 ³ ´ 5 0 3 5 , there exists an ݶ ¥ 2 4 « Then , the normalization of , is given by ¬Qµ ® Note that for every ݶ ¥ ®±° . ¯ V 1 2 ² 0 1 « such that gݱ Let · be any estimate of · which takes values in distribution for · and · by ݱ ÝA . Then e· 8 ± ·9 8 D R A F T 8 ë « e· « e· ë 8 ë ë ݱ ÝA eݱ Ý ¥ ë ÝA ë ±4 · 7& « e· ±4 · 7& « e· ±4 · 7&¹t± 8 8 ÝA ë . (9.10) ¦ ë 8 á ¦ ëy¸ « 8 , and denote the joint ݱ A Ý f· eݱ 6 Ý 7& ±4 · 7& Ý: ¦ ë (9.11) ë ¦ ë É Ý Ü ÝA 8 ë É Ý Ü ÝA ÝA September 13, 2001, 6:27pm (9.12) (9.13) (9.14) (9.15) D R A F T 191 Rate Distortion Theory where 8 ¹ ¦ ë ë ÝA (9.16) is a constant which depends only on Ý: and but not on the conditional distribution ÝÉ Ü ÝA . In other words, for a given · and a distortion measure , · and an estimate · of · is always reduced the expected distortion between « by a constant upon using instead of as the distortion measure. For reasons which will be explained in Section 9.3, it is sufficient for us to assume that a distortion measure is normal. º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:ÆÂõ Let ݤ £ minimizes º éTêcë e· ± Ý e· over all ݶ ¥ , and define ± Ý £ (9.17) º Ý £ éê³ë is the best estimate of · if we know nothing about · , and is the éTêcë distortion between · and a constant estimate of · . The minimum expected º significance of can be seen by taking the reproduction sequence to be Ý £ ± Ý £ ±²µ²µ²µL± Ý £ . Since e· ± Ý £ are i.i.d., by the weak law of large numbers ö e( ± 8 Ï ö ° g· ± Ý £ ö in probability, i.e., for any ÞBßÖá , ahcwÎl g( ±h Tß º Ô g· ± Ý £ &ÅÞwÐ Õ Þ º éTêcë (9.18) éê³ë (9.19) for sufficiently large . Note that is a constant sequence which does not éê³ë of depend on . In other words, even when no description is available, we º & Þ with probability can still achieve an average distortion no more than ë arbitrarily close to éT 1 ê³when is sufficiently large. º is seeming confusing because the quantity stands for The notation the minimum rather than the maximum expected distortion between · and a éTêcë from the above discussion that this notation constant estimate of · . But weº see is in fact appropriate because is the maximum distortion we have ºto be éTê³ë concerned about. Specifically, it is not meanful to impose a constraint º on the reproduction sequence because it can be achieved even without any knowledge about the source sequence. 9.2 THE RATE DISTORION FUNCTION »b¼½¿¾ Throughout this chapter, all the discussions are with respect to an i.i.d. infor mation source ζ· ±cü and a distortion Ï%Ð with generic random variable · ö measure . All logarithms are in the base 2 unless otherwise specified. D R A F T September 13, 2001, 6:27pm D R A F T 192 A FIRST COURSE IN INFORMATION THEORY X À Å Å source sequence Â Ã Ä f (X) Á Encoder Ç X Decoder Æ reproduction Å sequence Figure 9.1. A rate distortion code with block length È . º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ¹Ä An function ɱ³ÊU Ì rate distortion code is defined by an encoding Ô Í ¥ ÎJÏ(±cQ±²µ²µ²µ%±³Ê Ð (9.20) and a decoding function Ø Ô ÍÎJÏ(±cQ±²µ²µ²µ%±³ÊÑÐ Ì ¥ (9.21) Ì Ì The set ÎJÏ(±cQ±²µ²µ²µ%±³ÊÑÐ , denoted by É sequences Ø zÏóC± Ø | C±²µ²µ²µ ± Ø , is called the index set. The reproduction ÚÊU in · are called codewords, and the set of codewords is called the codebook. Figure 9.1 is an illustration of a rate distortion code. º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ@ The rate of an ɱ³Êâ rate distortion code is E¡ bits per symbol. º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ|Ç±Ê ã¹äå Ê in º A rate distortion pair Ú£ ± is asymptotically achievable , there exists for sufficiently large an ɱ³Êâ rate distortion if for any Þ ß á code such that Ï and ° Ì adcwÎl e( ã_äå ÕÖ£&ôÞ Ê ± ß º (9.22) &ôÞwÐ ÕÞ²± (9.23) Ø (G . For brevity, an asymptotically achievable pair will be where referred to as an achievable pair. Remark It is clear from the definition that if º º are also achievable for all £ Ú£ ± and Ú£ ± Ú£ ± £ º is achievable, then º º and . º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ|Ç"Ç The rate distortion region is the subset of º all achievable pairs Ú£ ± . ËÍÌ6»[Ãñ:»Î@:Æ|ÇA ´ containing The rate distortion region is closed and convex. Proof We first show that the rate distortion region is closed. Consider achievº able rate distortion pairs Ú£àWö²z± Wö² such that ÷¹ø ú öw¹ã ùû D R A F T Ú£ 'ö² ± º 'ö² Ú£ ± º September 13, 2001, 6:27pm (9.24) D R A F T 193 Rate Distortion Theory Then for any ÞjßSá , for all ü , there exists for sufficiently large an code such that Ï and Ì 'ö² ÕÖ£ Ê @ã_äå 'ö² &ÅÞ º Wö² ±Ï 'ö² Tß adc²Îl e( ɱ³ÊÑ'ö² (9.25) Wö² &ôÞ²Ð Ì Õ Þ²± (9.26) where 'ö² and Ø Wöw are respectively the encoding function and the decoding Ø function of the ɱ³Ê 'ö² code, and Wö² Wö² 'ö² (G . By virtue of (9.24), let ü:|Þ? be an integer such that for all ü ß ü:|ÞC , ì 'ö² Ü Ü £Sæ9£ and Ü º (9.27) Þ²± (9.28) ì º æ Þ Wöw Ü ì which imply 'ö² £ and ì º respectively. Then for all ütß £E&ôÞ º Wö² ü:|Þ? (9.29) &ôÞ²± (9.30) , ì Ï Ê 'ö² ÕÖ£ Wö² &ôÞ &ôÞwÐ Õ adcwÎl e( Õ Þ @ã_äå £E& Þ (9.31) and ahcwÎl g( 'ö² ± Wö² Tß º 'ö² ± Wöw ß º 'ö² &ô޲Р(9.32) (9.33) Note that (9.32) follows because º & º Þß 'ö² & Þ (9.34) º by (9.30). From (9.31) and (9.33), we see that Ú£ ± is also achievable. Thus we have proved that the rate distortion region is closed. We will prove the convexity of the rate distortion region by a time-sharing argument whose idea is the following. Ð Roughly speaking, if we can use a code Ð º º ´ ° ° ´ ° to achieve Ú£àè ³± and a code ´ to achieve Ú£à ³± ,è then for any Ð è 0 and 1, we can use ° for a fraction rational Ð number between of the time º and use ´ for a fraction Ñ of the time to achieve Ú£àÓÒó± ÔÒó , where è º D R A F T £ ÔÒL ÔÒL è £ º è ° ´ Õ & Ñè £ ± º ´ ° &ÖÑ ± September 13, 2001, 6:27pm (9.35) (9.36) D R A F T 194 A FIRST COURSE IN INFORMATION THEORY è è è . Since the rate distortion region is closed as we have proved, and Ñ Ïjæ can be taken as any real number between 0 and 1, and the convexity of the region follows. We now give a formal proof for the convexity of the rate distortion region. è Let (9.37) ± & × Ø è where and Ø are positive integers. × º Then is a rational number between º ´ ° ° ´ 0 and 1. now prove that if Ú£ ± and Ú£ º ± are achievable, × We º º ´ ° ° ´ then Ú£àÔÒó³± ÓÒó is also achievable. Assume Ú£à c± and Ú£à ³± are achievable. Then for any Þ3ß á and sufficiently large , there exist an ° ´ ɱ³Ê code and an ɱ³Ê code such that Ï @ã_äå and Ï(±c è . Let Ê è and × è è Þ (9.38) &ÅÞwÐ Õ Þ¶± (9.39) ´ S ° Ù ÚÊ ÚÊ & º ± ß ahcwÎl g( ® ÕÖ£ Ê (9.40) &gØ% (9.41) × We now construct an × C±³Ê code by concatenating copies of the ° ´ è è of the ɱ³Ê code.× We call these ɱ³Ê code followed by Ø copies &Ø codes subcodes of the × C±³Ê code. For this code, let × Ù ( zÏóC±3| C±²µ²µ²µ± (9.42) &Ø%C± (9.43) × and Ù &ÚØ( ¤ zÏóC±d | C±²µ²µ²µ%±s × è the reproduction è where Û and ÛQ are the source sequence and sequence of the Û th subcode, respectively. Then for this × C±³Ê code, è Ïè × ã_äå Ê Ï &gØ% Ï × &gØ% è¶Ü Ï × è Õ D R A F T è Ú£ £ £ ã_äå ã¹äå × Ê ´ S ° Ù ÚÊ _ÚÊ ã_äå Ê ° &ÚØ è¶Ü ° Þ Ý è &ÕÑ ã_äå Ï ã_äå Ê Ê ° ´ Å &ôÞC & ÞC è 7&ÕÑ Ú £ ° ´ & Ñ £ 7ô & Þ ÔÒL &ôÞ¶± September 13, 2001, 6:27pm (9.44) ´ (9.45) ´ j Ý (9.46) (9.47) (9.48) (9.49) D R A F T 195 Rate Distortion Theory where (9.47) follows from (9.38), and º ahc²Îl gÙç± Ù Tß àá adcß Ù8 Ï g( Ù8 Õ º ÛQß Û C± ç ahcwÎl g( â ° & e(3ÛQC±h9 Û Bß º Û C±hý Û ß 8 Ù Õ ¸ S &ÚØ â ° × adc?æç e( Õ ÓÒó &ô޲Р° &ôÞ ´ &ôÞ Û C±\ý Û Bß º â ¸ ° ÔÒó &ôÞ±ã ä for some ° &ôÞ²Ð Û C±hý Û Bß º (9.50) å Ï]ÕèÛ¼Õ &SÏ ÕèÛ for some ¸ S ahc²Îl g( º × Õ × &gØçé (9.51) × ´ &ô޲Р(9.52) Ù &ÚØ%zÞ¶± or (9.53) × where (9.52) follows from the union bound and (9.53) follows from (9.39). º Hence, we conclude that the rate distortion pair Ú£àÓÒó± ÔÒóz is achievable. This completes the proof of the theorem. º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ|ÇA º The rate distortion function £¼ is the minimum of all º º for a given distortion such that Ú£ ± is achievable. rates £ º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ|Ç(È distortions º The distortion rate function Ú£û is the minimum of all º for a given rate £ such that Ú£ ± is achievable. º º º Both the functions £¼ and Ú£û are equivalent descriptions of the boundary of the rate distortion region. They are sufficient to describe the rate disº tortion region because the region is closed. Note that in defining £¼ , the º minimum insteadº of the infimum is taken because for a fixed , the set of all £ such that Ú£ ± is achievable is closed and lower bounded by zero. Simº ilarly, the minimum instead of the infimum is taken in defining Ú£û . In the º subsequent discussions, only £¼ will be used. ËÍÌ6»[Ãñ:»Î@:Æ|ÇJÛ £¼ 1. 2. º The following properties hold for the rate distortion function : º £¼ is non-increasing in £¼ is convex. º º 3. £¼ 4. £¼ÚáJ>Õ á 5 for · D R A F T º º º . éTê³ë . . September 13, 2001, 6:27pm D R A F T 196 A FIRST COURSE IN INFORMATION THEORY º º Proof From the remark following Definition 9.10, since Ú£¼ C± is achievº º º º º is also achievable for all able, . Therefore, Ú£¼ C± £¼ º º º £¼ ¹ because £¼ ¹ is the minimum of all £ such that Ú£ ± ¹ is achievable. This proves Property 1. Property 2 follows immediately from the convexity of the rate distortion region which was proved in Theorem 9.12. From the discussion toward the end of the last section, we see for any ÞBß á , it is possible to achieve ahcwÎl g( éTê³ë º ±h ß &ÅÞwÐ Õ Þ (9.54) º éê³no ë for sufficiently large with description of available. Therefore, Úá.± is º º , proving Property 3. achievable for all Property 4 is a consequence of the assumption that the distortion measure is normalized, which can be seen as follows. By the source coding theorem, for any Þ ß á , by using a rate no more than ·3\&Þ , we can decribe the 5 of error less than Þ when is source sequence of length with probability sufficiently large. Since is normalized, for each ü Ï , let · ö Ý £ · ö (9.55) (cf. Definition 9.5), so that whenever an error does not occur, g· ö ± · ö g· ± Ý £ · ö ö á (9.56) by (9.8) for each ü , and g( ± 8 Ï ö ° g· ö ± · ö 8 Ï ö ° g· ö ± Ý £ · ö á (9.57) Therefore, adc²Îl e( ±Ï Tß ÞwÐÓÕ Þ²± (9.58) which shows that the pair ·3C±³áJ is achievable. This in turn implies that 5 £¼ÚáJ Õ ·3 because £¼ÚáJ is the minimum of all £ such that Ú£ ±³áJ is 5 achievable. º Figure 9.2 is an illustration of a rate distortion function £¼ . The reader º should note the four properties of £¼ in Theorem 9.15. The rate distortion theorem, which will be stated in the next section, gives a characterization of º £¼ . 9.3 RATE DISTORTION THEOREM º¼»"½Q¾À¿6¾ÂÁ6¾Ã¿@:Æ|ÇJî º For defined by £ D R A F T º á , the information rate distortion function is # ø¼÷ 2ê ë ' >ªìYí · $ · September 13, 2001, 6:27pm (9.59) D R A F T 197 Rate Distortion Theory ï î R(D) H(X) R(0) The rate distortion ð region î Dñmax Figure 9.2. A rate distortion function ò óC° D . º In defining £ , the minimization is taken over all random variables jointly distributed with · such that e· ± ·9 Õ º (9.60) Since EÝA is given, the minimization is taken over the set of all E that (9.60) is satisfied, namely the set àá 8 ë ë ÝAÂ É ß É Ý Ü ÝAÍ Ý Ü ÝA eݱ Ý TÕ # # · º ÝÉ Ü Ý: ãä such (9.61) å $ Since this set is compact in > and · · is a continuous functional of $ ÝE ·9 can be attained2 . This justifies taking Ü Ý: , the minimum value of · º the minimum instead of the infimum in the definition of £ . « We have seen in Section 9.1 that we can obtain a normalization for any distortion measure with « g· ± ·9 e· ± ·9 æè¹ (9.62) for any · , where ¹ is a constant which depends only on Ý: and . Thus if « º º is not normal, we can always replace by æô¹ in the definition by and º of £ without changing the minimization problem. Therefore, we do not lose any generality by assuming that a distortion measure is normal. ËÍÌ6»[Ãñ:»Î@:Æ|Ç õõËÍÌ6»DöÍÁ6» º¼¾ÚòwÁÃñ Á6¾ÚÿgËÍÌ»[Ãñ:»[4÷ 2 The assumption that both D R A F T and £¼ º £ º . are finite is essential in this argument. September 13, 2001, 6:27pm D R A F T 198 A FIRST COURSE IN INFORMATION THEORY The rate distortion theorem, which is the main result in rateº distortion theory, º says that the minimum coding rate for achieving a distortion is £ . This theorem will be proved in the next two sections. In the next section, we will º º prove the converse of this theorem, i.e., £¼ £ , and in Section 9.5, º º º we will prove the achievability of £ , i.e., £¼ ÕÖ£ . º º In order for £ º to be a characterization of £¼ , it has toº satisfy the same properties as £¼ . In particular, º the four properties of £¼ in Theorem 9.15 should also be satisfied by £ . ËÍÌ6»[Ãñ:»Î@:Æ|ÇJÄ tortion function £ £ 1. º £ 2. 3. £ 4. £ º is non-increasing in is convex. º The following properties hold for the information rate disº : á ÚáJBÕ 5 for · º . éê³ë º º . . º º Proof Referring to the definition of £ in (9.59), for a larger , the minº º imization is taken over a larger set. Therefore, £ is non-increasing in , è proving Property 1. º º ´ ° á and let be any number To prove Property 2, consider any ± º between 0 and 1. Let ·9 achieves £ # for ® Ï(±c , i.e., £ where º · $ ±; · Õ g· · C± º and let · be defined by the transition matrix distributed with · which is defined by è è where g· è Ñ 8 ÝE Ü Ý: . Let · ÓÒó be jointly è ° ÝÉ Ü ÝA7&ÕÑ ´ ÝE Ü Ý:C± (9.65) . Then ëë ÝA 8 ëë ÝAw èôøù 8 (9.64) ±; · ÔÒL Ïæ Ý Ü ÝA Ò (9.63) ± Ò E Ý Ü ÝA eݱ Ý è è 6°L ÝÉ Ü ÝA7& Ñ g´ ÝE Ü Ý: gݱ Ý6 ëë ÝA°L ÝEÜ ÝA eݱ ÝûÕ ú & D R A F T (9.66) èôøù 8 Ñ (9.67) ëës ü?ý Ý'þ ü ´ ý ÝÉÜ Ý'þ ý ݱ Ý7þúû September 13, 2001, 6:27pm (9.68) D R A F T 199 Rate Distortion Theory ÿ º º ÿ þÕÑ ý º ÕÑ þ ý (9.69) (9.70) (9.71) Ò where º ÿ º Ò º ÖÑ (9.72) and (9.70) follows from (9.64). Now consider ý ý þ ÿ þ ý ! þ " ý #! $ ý #!% &' $ ý &' þ þ þ (9.73) (9.74) (9.75) where the inequality in (9.74) follows from the convexity of mutual informa)( tion with respect to the transition matrix üsý (* þ (see Example 6.13), and the inequality in (9.75) follows from (9.71) and the definition of ý þ . Therefore, we have proved Property 2. To prove Property 3, let take the value (, + as defined in Definition 9.7 with probability 1. Then ý!% þ ÿ.(9.76) and /10 $ Then for 3:5;7 , On the other hand, since /20 ý! þ ÿ ý ý ý! þ þ þ ÿ ( + 4365798 þ ÿ.- ý #! (9.77) 8 (9.78) is nonnegative, we conclude that ý þ ÿ.8 (9.79) This proves Property 3. Finally, to prove Property 4, we let ÿ where ( + ý ( þ ( + ý þ (9.80) is defined in Definition 9.5. Then /10 ý /20 ÿ ( + ý þþ 0 ( (ý ( + þ ý< þ ÿ üsý = 7 ÿ ý (9.81) ( þþ - (9.82) (9.83) 0 by (9.8) since we assume that D R A F T ý - þ is a normal distortion measure. Moreover, ý !% þ ?> ý þ 8 September 13, 2001, 6:27pm (9.84) D R A F T 200 A FIRST COURSE IN INFORMATION THEORY Then Property 4 and hence the theorem is proved. @ACBDACEFEHGCBFIKJMLONFJ If ý - þ1P - , then ý þ is strictly decreasing for 43657 , and the inequality constraint in Definition 9.16 for Q ý þ can be replaced by an equality constraint. Proof Assume that ý - þQP - . We first show that ý 43:57 by contradiction. Suppose ýVU þ ÿ"- for some let ý U þ be achieved by some . Then ý implies that and þ ÿ. ý#! U þRP þ ÿ- - for - WUMS.43:5;7 TS , and (9.85) are independent, or ü?ý ( ( þ ÿ ý< ü?ý ( ( þ ü?ý þ (9.86) for all ( and ( . It follows that /20 $ U ÿ = =YX 7 ÿ = X 7 7 =YX üsý ( =YX = X = X ü?ý = üsý ( ( üsý ( þ þ ( 0 0 ( ý þ ý ( ý ( + þ (9.88) ( ( ý þ 7 / 0 1 ( ý þ /10 7 ÿ þ ü?ý þ 7 ÿ ( ( üsý 7 $ üsý þ 7 ÿ üsý (9.87) 0 ( 7 = ÿ þ ( ( ( þ (9.89) þ þ (9.90) (9.91) þ 3:5;7 (9.92) (9.93) 436579 (9.94) where (, + and 43657 are defined in Definition 9.7. This leads to a contradiction VUS43657 . Therefore, we conclude that because we have assumed that Q ý Z ? S 4 : 3 ; 5 7 þ6P for . Since ý - þ[P - and ý43:57 þ ÿ\- , and ý þ is non-increasing and convex from the above theorem, ý þ must be strictly decreasing for 43:5;7 . We now prove by contradiction that the inequality constraint in Definition 9.16 for ý þ can be replaced by an equality constraint. Assume that ý þ is achieved by some + such that /10 ý D R A F T + þ ÿ U U S?<8 September 13, 2001, 6:27pm (9.95) D R A F T 201 Rate Distortion Theory Then ý UU X þ ÿ bRc de ]_^a` b6f b X ý #! þ ý ! + þ ÿ. ý þ 8 (9.96) Ogihkj j This is a contradiction because 43:57 . Hence, /10 Q ý ý þ is strictly decreasing for þ ÿ + - <8 (9.97) This implies that the inequality constraint in Definition 9.16 for Q replaced by an equality constraint. ý Remark In all problems of interest, ý - þ ÿ. ý - þlP - . Otherwise, $ for all because ý þ is nonnegative and non-increasing. mRn GkoRpEYqrJMLas9tvuw%xzyMGCBFI.{MA|BD}Dqi~ with Let ÿ.- 4ÿl þ can be ý þ ÿ be a binary random variable ÿ4ÿK and 0 8 (9.98) - Let ÿ be the reproduction alphabet for , and let be the Hamming . Then if we distortion measure. We first consider the case that make a guess on the value of , we should guess 0 in order to minimize the expected distortion. Therefore, (D + ÿ.- and /20i ÿ 43657 ÿ 8 , Mk dÿ - if if Let be an estimate of taking values in distortion measure between and , i.e., > Then for ZS ÿ 43:57 and ) Ïÿ and any S (9.102) 8 , and let be the Hamming (9.103) determine each other. Therefore, > ) 8 (9.104) such that /10D D R A F T $ % 8 Observe that conditioning on , 0i ÿ (9.99) (9.100) (9.101) ÿ ÿ We will show that for - < -F < September 13, 2001, 6:27pm (9.105) D R A F T 202 A FIRST COURSE IN INFORMATION THEORY 1 D 1 2D 1 D 0 1 0 D X D 1 2D 1 X D 1 1 D Figure 9.3. Achieving Ca for a binary source via a reverse binary symmetric channel. we have ! > ÿ ÿ $ ÿ $ Mk (9.106) ) > Mk ) > Mk Mk > (9.107) (9.108) ¢ÿ ¡ (9.109) (9.110) where the last inequality is justified because ¢ÿ ¡ £ /20i 4ÿ £ (9.111) . Minimizing over all and is increasing for (9.105) in (9.110), we obtain the lower bound Q $ satisfying Mk 8 (9.112) To show that this lower bound is achievable, we need to construct an such that the inequalities in both (9.108) and (9.110) are tight. The tightness of the inequality in (9.110) simply says that ¤ÿ ¡ ;ÿ (9.113) while the tightness of the inequality in (9.108) says that should be independent of . It would be more difficult to make independent of if we specify by ¥ (*) ( . Instead, we specify the joint distribution of and by means of a reverse binary symmetric channel (BSC) with crossover probability as the shown in Figure 9.3. Here, we regard as the input and as the output of the BSC. Then is independent of the input because the error event is independent of the input for a BSC, and (9.113) is satisfied by setting the crossover probability to . However, we need to ensure that the marginal distribution of so specified is equal to ¥ ( . Toward this end, we let D R A F T ÿ4ÿ.¦ September 13, 2001, 6:27pm (9.114) D R A F T 203 Rate Distortion Theory and consider ÿ ÿ ¦ V ¦ÿ (9.118) V . Therefore, - and - 8 (9.120) lv© (9.121) ¦ÿ ÿ lr¦ÿ (9.119) © This implies or © . On the other hand, gives (9.116) (9.117) ÿª 3:5;7 ¦ (9.115) 8 ¨#© ZS? - Since $ ÿ ) ¦* which gives we have ¦ ÿ.- ) ÿ ÿ ÿ ' ÿ or § ÿK- ÿ.- (9.122) 8 (9.123) Hence, we have shown that the lower bound on in (9.110) can be achieved, and Q is as given in (9.102). , by exchanging the roles of the symbols 0 and 1 in the above For argument, we obtain as in (9.102) except that is replaced by Q . Combining the two cases, we have for - dÿ« - M* if if . The function Q - $ for ÿ ZS ]_^a` ]_^a` ¬M (9.124) . is illustrated in Figure 9.4. > -F\ÿ M\ÿ . Then by we see that Remark In the above example, > the rate distortion theorem, is the minimum rate of a rate distortion code which achieves an arbitrarily small average Hamming distortion. It is tempting to regarding this special case of the rate distortion theorem as a version of the D R A F T September 13, 2001, 6:27pm D R A F T 204 A FIRST COURSE IN INFORMATION THEORY RI ( D) 1 1 0.8 0.6 0.4 0.2 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.5 0.7 0.8 0.9 1 D Figure 9.4. The function ka for the uniform binary source with the Hamming distortion measure. source coding theorem and conclude that the rate distortion theorem is a generalization of the source coding theorem. However, this is incorrect because the rate distortion theorem only guarantees that the average Hamming distortion between and is small with probability arbitrarily close to 1, but the source coding theorem guarantees that ÿ with probability arbitrarily close to 1, which is much stronger. It is in general not possible to obtain the rate distortion function in closed form, and we have to resort to numerical computation. In Chapter 10, we will discuss the Blahut-Arimoto algorithm for computing the rate distortion function. 9.4 THE CONVERSE is lower In this section, we prove that the rate distortion function $ , i.e., bounded by the information rate distortion function Q . Specifically, we will prove that for any achievable rate distortion pair $ , . Then by fixing and minimizing over all achievable $ pairs , we conclude that . Let be any achievable rate distortion pair. Then for any ®lP - , there exists for sufficiently large ¯ an ¯ ° code such that ° ¯²±´³¶µ and 0D D R A F T P (9.125) ·® ·® ® September 13, 2001, 6:27pm (9.126) D R A F T 205 Rate Distortion Theory where ÿK¸ O¹ 5 ¯ ·® . Then $ $ >±´³¶µ $ > ° O¹ O¹ ¸ ÿ ÿ > ÿ ÿ > > $ > ½½½' » ) » ໿¾ (9.134) (9.135) (9.136) (9.137) = º 1È (9.138) /10D »¼ ¯ » » e ¯ » ) » /20i Q ÿ » ! » ¯·Å ¯ > » ) = º $ > » k ÁÀ »¼ > = º > = º ÿ » k »¼ (9.133) = º »¼ »¼ Ä $ > = º ÿ » k = º ÿ ) »¼ »¼ (9.131) = º »¼ (9.132) > k ) ! (9.130) > k ÿ (9.127) (9.128) (9.129) » Q » ÇÆ (9.139) /20i = º ¯ »¼ /20i » » É (9.140) 8 (9.141) In the above, a) follows from (9.125); b) follows because conditioning does not increase entropy; c) follows from the definition of Q d) follows from the convexity of Q inequality. Now let 0 3:5;7 D R A F T in Definition 9.16; X proved in Theorem 9.18 and Jensen’s ÿ.]_ÊË 7 f7 0i ( ( September 13, 2001, 6:27pm (9.142) D R A F T 206 A FIRST COURSE IN INFORMATION THEORY 0 0 be the maximum value which can be taken by the distortion measure . The 6 3 57 with 43:5;7 in Definition 9.7. Then from reader should not confuse (9.126), we have /10D / 0i ÿ 0 i / ÿ 0 À ) 3:57% À ½ ®k 0 0D 0i ) 3657 0 i ·® ' P  ·® 0D ·®  P Ì® ·® ½ ® 8 This shows that if the probability that the average distortion between exceeds Í® is small, then the expected average distortion between can exceed only by a small amount3 . Following (9.141), we have $ ·® $ and and /10D (9.143) (9.144) (9.145) 0 3657 ' (9.146) (9.147) ® is non-increasing where the last inequality follows from (9.145) because in . We note that the convexity of implies that it is a continuous function of . Then taking the limit as ®6Î - , we obtain $ ÿ 3657 3:5;7 ' ® (9.148) 0 %Ò ÿ 0 ^a] ÏOÐ%Ñ ± ' ^a] ÏOÐ%Ñ ®Ó ± (9.149) (9.150) where we have invoked the continuity of minimizing over all achievable pairs have proved that $ in obtaining (9.149). Upon for a fixed in (9.150), we 8 (9.151) This completes the proof for the converse of the rate distortion theorem. 9.5 ACHIEVABILITY OF Ô FÕ×ÖÙØ is upper In this section, we prove that the rate distortion function bounded by the information rate distortion function , i.e., $ Q . Then by combining with the result that from the Ïÿ last section, we conclude that , and the rate distortion theorem is proved. 3 The converse is not true. D R A F T September 13, 2001, 6:27pm D R A F T 207 Rate Distortion Theory For any taking values in 43657 , we will prove that for every random variable such that /10D < (9.152) the rate distortion pair ! is achievable. This will be proved by showing for sufficiently large ¯ the existence of a rate distortion code such that 1. the rate of the code is not more than 0D 2. ·® !% ·® ; with probability almost 1. over all satisfying (9.152), we conclude ! Then by minimizing # that $ the rate distortion pair is achievable, which implies because is the minimum of all such that is achievable. 3657 and any ®lP , and let Ú be a small positive quantity Fix any to be specified later. Toward proving the existence of a desired code, we fix a random variable which satisfies (9.152) and let ° be an integer satisfying ® ! © ° ¯ ±´³¶µ #! ·® (9.153) where ¯ is sufficiently large. We now describe a random coding scheme in the following steps: ° 1. Construct a codebook Û of an ¯ ° code by randomly generating ¥ ( codewords in º independently and identically according to º . Denote ' ©H ½½½¿ ° these codewords by . 2. Reveal the codebook Û to both the encoder and the decoder. is generated according to ¥ 3. The source sequence ( º . 4. Ý The encoder encodes the source sequence into an index ÿ © ½½½° . The index Ü takes the value Þ if a) b) for all Þ in the set X Ü Þ ¨ßáà U ß Ý âb º , if , b:ãåä X 6ßæà Þ U âb º b6ãåä otherwise, Ü takes the constant value 1. 5. The index Ü is delivered to the decoder. , then Þ U Þ ; 6. The decoder outputs Ü as the reproduction sequence . Remark Strong typicality is used in defining the encoding function in Step 4. This is made possible by the assumption that both the source alphabet and the reproduction alphabet are finite. D R A F T September 13, 2001, 6:27pm D R A F T 208 A FIRST COURSE IN INFORMATION THEORY Let us further explain the encoding scheme described in Step 4. After the source sequence is generated, we search through all the codewords in the codebook Û for those which are jointly typical with with respect to ¥ ( ( . If there is at least one such codeword, we let Þ be the largest index of such ÿ Þ . If such a codeword does not exist, we let Ü ÿ . let Ü codewords and The event Ü ÿ occurs in one of the following two scenarios: 1. ' is the only codeword in Û which is jointly typical with . 2. No codeword in Û is jointly typical with . Oç In either scenario, is not jointly typical with the codewords ©H , ½½½ , ° . In other words, if Ü ÿè , then is jointly typical with none of the Oç H © ½½½¿ ° codewords . Define X /é ÿëê Þ lßáà â b b:ãåäì (9.154) º / é the codeword Þ . Since the to be the event that is jointly typical with are mutually independent, and codewords are generated i.i.d., the events they all have the same probability. Moreover, we see from the discussion above that / / / Ä Ä Ä Ù2í ÿ ð 8 ï ½ ½ ½ (9.155) Ü î î î Then Ü / ÿ ÿ ñ ï î / ð é / Ä é Ä / Ä ½½½ î ð î Ä (9.157) ¼ / ÿ ÿ ¾ ð Ä / / (9.156) ð ¾ (9.158) (9.159) 8 We now obtain a lower bound on as follows. Since and ' are ' ò ò independently, the joint distribution of can be factorized generated ¥ ¥ as . Then / ÿ ê ÿ ôó ò ¥ ò âb º bãåä ì 8 (9.160) (9.161) û üý óõOö'÷iù ø ú_úi ò The summation above is over all of strong typicality (Theorem 5.7), if D R A F T '¨ßáà ¥ X = f X * X ò ßZà â · b : b ãåä X ò ò º áßà â b b:ãåä º . From the consistency ò ßà â bãåä , then and º September 13, 2001, 6:27pm D R A F T 209 Rate Distortion Theory ò þ X ò . By the strong AEP (Theorem 5.2), all ¥ summation satisfy ò ¾ b ßà ¥ and ò ¥ respectively, where Î - where Î - as Ú%Î / l l © Ú l © Ú . Again by the strong AEP, X l b6f b ¾ © Ú ¾ b ¾ ºYÿÿ ºYÿÿ b6f b (9.167) Î "#$" l ¾ © Ú $ ¾ ð b b(' ºFÿ 8 ÿ The lower bound in (9.153) implies ° (9.168) X &% ¨ (9.165) (9.166) ÿ . Following (9.159), we have Ü b ÿ X b ¾ X © ! as Ú%Î (9.164) ºYÿÿ © X b X ºFÿ where - (9.163) ºFÿlÿ - ºFÿlÿ lÿ b b ¾ in the above (9.162) ºFÿlÿ ¾ b b:f b ¾ © þ X ¾ © ) $ b:ãåä Ú ºYÿÿ . Then from (9.161), we have X $ $ âb º þ © as Ú%Î X )à $ and ¥ â b:ãåä º ò (9.169) X © b b) * ºYÿ 8 ÿ (9.170) Then upon taking natural logarithm in (9.169), we obtain ` ?' ° 5 + © + % ¨ ` * ºYÿ X± b b) ÿ ºYÿ l Ú ÿ * % © * ºYÿ ) ) ¾ ÿ l ¾ #© Ú Ú ºYÿ ÿ ºYX ÿ ¾ ©X ¾ © b b. ' ºYÿ (9.171) X ?, . b b(' ºYÿ % l ` ± b b ¾ © Ú ?, X © X Ü ± b b-' ÿ b b ÿ 8 (9.172) (9.173) (9.174) In the above, a) follows from (9.170) by noting that the logarithm £ £ in (9.171) K is negative, and b) follows from the fundamental inequality ` . By ± letting Ú be sufficiently small so that ® © D R A F T ! P - September 13, 2001, 6:27pm (9.175) D R A F T 210 A FIRST COURSE IN INFORMATION THEORY ` above upper bound on Ü tends to the ± / as ¯<Î . This implies Ü Î Ü / as / ¯èÎ , i.e., (9.176) ® for sufficiently large ¯ . The main idea of the above upper bound on for sufficiently Ü In constructing the codebook, we randomly generate large ¯ is the following. þ ¥ (þ ° ° codewords in according to . If grows with at a rate higher ¯ -0 0 þ º º ! than , then the probability that there exists at least one codeword þ which is jointly typical with the source sequence with respect to ¥ ( ( /10DFurther, -0 0 þ is very high when ¯ is large. the average distortion between and because the empirical joint distribution such a codeword is close to ¥ ( (þ of the symbol pairs in and such a codeword is close to . Then by letþ ting the reproduction sequence be such a codeword, the average distortion þ /10i-0 0 þ between and is less than "æ® with probability arbitrarily close to 1 since . These will be formally shown in the rest of the proof. Now for sufficiently large ¯ , consider 0i þ P "·® 0i þ P 0D þ 0i " ½ ®1" 0i ®2" ' P Ü ) "Ì® þ þ ) "·® P ¡ Ü P Ü § ) "·® "·® ) ¡ Ü ¡ Ü (9.177) ½ (9.178) 8 ¡ Ü (9.179) 0i We will show that by choosing the value of Ú carefully, it is possible to make þ þ ¡ or equal to X always less than . Since 6ß "·® provided Ü à â ¡ b bãåä conditioning on Ü , we have º 0i þ 0i-0 = º »¼ ¯ 0D = » ( þ 43 (9.180) ( þ ) ( 7 f7 ¯ X = 0D ( (þ 7 f7 ¯ ( X 0 þ » X ¥ 56 = 7 f7 ( (þ 0D ¥ ¯ ( ( ( þ 897 þ (þ (9.181) 3 " ( 0D " 56 = X 7 f7 ( (þ ) þ C (þ Ò ¯ 3 ¥ ¯ ( ( (þ ) ( þ þ (9.182) C ¥ ( (þ Ó 79 (9.183) D R A F T September 13, 2001, 6:27pm D R A F T 211 Rate Distortion Theory /20i Ä þ = " " ( 3 (þ ) * ¯ ( X ( þ 1: : þ þ ¥ C ¥ k þ ) ( ( 3 : ¯ ( ( (þ = : 7 f7 : ( 3 : (9.185) : : X (9.184) Ó ( þ ;: : 3:57 " Ò 7 f7 þ ) ( : ¯ þ ¥ k ( ( þ ;: (9.186) : : : : 0 3:57 Ú " 0 (þ 7 f7 /20i ( X 0 þ /20i = " 0D þ 5 /20i 0D þ 3657 Ú (9.187) (9.188) where 0 3:5;7 X a) follows from the definition of þ b) follows because :ßáà âb º in (9.142); ; bãåä c) follows from (9.152). Therefore, by taking Ú we obtain 0D if Ü ¡ þ . Therefore, 0i þ P ® Ò 0 3:5;7 " (9.189) 3:5;7 0 ® 0 3657 ) "·® Ó <.¡ Ü (9.190) "·® (9.191) and it follows that from (9.179) that 0D þ P "·® ® 8 (9.192) Thus we have shown that for sufficiently large ¯ , there exists an random code which satisfies ° ¯ ±´³¶µ -0 ! 0 þ "·® ¯ ° (9.193) (this follows from the upper bound in (9.153)) and (9.192). This implies the existence of an ¯ ° rate distortion -0 code 0 þ which satisfies (9.193) and (9.192). ! Therefore, the rate distortion pair is achievable. Then upon min0 þ imizing over all which satisfy (9.152), we conclude that the rate distortion $ . The proof is pair is achievable, which implies completed. D R A F T September 13, 2001, 6:27pm D R A F T 212 A FIRST COURSE IN INFORMATION THEORY PROBLEMS 1. Obtain the forward channel description of the Hamming distortion measure. for the binary source with BC ° Let BC =KJ L . C f , ºL ò DFE = ß - é =é º ) ( ºHG ¼ > é ?>; >@ 2. Binary covering radius The Hamming ball with center = and radius A is the set º ½½½ > 6ß ) º 8 AI there exists ò that ò be the minimum number ° such Hamming balls BC © ½½½ ° =KJ such that for all ß - º , ß for some a) Show that C ° $ f Cé M º C © º 8 C O ¼ ÑN º b) What is the relation between ° f and the rate distortion function for º the binary source with the Hamming distortion measure? 0 with the Hamming distortion mea- 3. Consider a source random variable sure. a) Prove that -0 $ª> k ) > ) ?'k ±´³¶µ 43657 . for b) Show that the above lower bound on uniformly on . 0 is tight if distributes See Jerohin [102] (also see [52], p.133) for the tightness of this lower bound for a general source. This bound is a special case of the Shannon lower bound for the rate distortion function [176] (also see [49], p.369). 0 0RQ 4. Product source Let and þbe two þ independent source random0 variables P with reproduction alphabets and 0 and distortion measures 7 and , Q the Q rate distortion functions for -0 and are denoted by 7 47 and and , respectively. Now for the product source , define a distor0 þ S P þ TSUP T Î tion measure G by 0i ( V Vþ (þ 0 W þ ( ( 7 Prove0 that the rate distortion function sure is given by W h1Z D R A F T ]_^a` h1[ ¼ " 0XQ VY V þ 8 -0 for 7 Q 7 " with distortion meaQ 8 h September 13, 2001, 6:27pm D R A F T 213 Rate Distortion Theory -0 0 þ þ Hint: Prove that ! independent. (Shannon [176].) -0 $ ! 0 þ " ! þ 0 if and are 0`_ ß 5. Compound source Let \ be an index set and ]^ be a \ Gba collection of source random variables. The random variables in ]þ ^ have a common source alphabet 0 , a common reproduction alphabet , and a 0dc common distortion measure . A compound source is an i.i.d. information ß c , where e is equal to some source whose generic random variable is a 0dc we do not know which one it is. The rate distortion function but \ for has the same definition as the rate distortion function defined in this chapter except that (9.23) is replaced by § 0i _ þ Show that P c Wg_fihkj for all a ® _¶ ^ ö _¶ where "·® ß . \ 0`_ is the rate distortion function for . 6. Show that asymptotic optimality can be achieved by separating rate distortion coding and channel coding when the information source is i.i.d. (with a single-letter distortion measure) and the channel is memoryless. 7. Slepian-Wolf b coding Let ® , and Ú be small positive quantities. For Ï © Þ , randomly and independently select with replacement bºY p n ÿÿ ¾k mlo © à â ãåä according to the uniform distribution to é vectorsò from ºYÿ ÿ l º r l a fixed pair of vectors in à â b ãåä . Prove the be form a bin q . Let º l following by choosing ® , and Ú appropriately: é r a) the probability thaté r b) given thatò such that s r ß is in some q tends to 1 as ¯<Î / ; r , the probability that there exists another 6ßæà â b / ãåä tends to 0 as ¯<Î . º q U V l ß U é q s ¥ ( ut Let . The results in a) and b) say that if is º jointly typical, whichs happens with probability é close to 1 for large ¯ , then s it is évery likely that is in some bin q , and that is the unique vector as side-information, in q which is jointly typical with . If is available s then bybspecifying the index of the bin containing , which takes about s © bits, can be uniquely specified. Note that no knowledge about s ºvÿmlon is involved in specifying the index of the bin containing . This is the basis of the Slepian-Wolf coding [184] which launched the whole area of multiterminal source coding (see Berger [21]). D R A F T September 13, 2001, 6:27pm D R A F T 214 A FIRST COURSE IN INFORMATION THEORY HISTORICAL NOTES Transmission of an information source with distortion was first conceived by Shannon in his 1948 paper [173]. He returned to the problem in 1959 and proved the rate distortion theorem. The normalization of the rate distortion function is due to Pinkston [154]. The rate distortion theorem proved here is a stronger version of the original theorem. Extensions of the theorem to more general sources were proved in the book by Berger [20]. An iterative algorithm for computing the rate distortion function developed by Blahut [27] will be discussed in Chapter 10. Rose [169] has developed an algorithm for the same purpose based on a mapping approach. D R A F T September 13, 2001, 6:27pm D R A F T Chapter 10 THE BLAHUT-ARIMOTO ALGORITHMS For a discrete memoryless channel V ¥ )( , the capacity -0 w .]W C ÊË 7 ! (10.1) ÿ 0 where and are respectively the input and the output of the generic channel and A ( is the input distribution, characterizes the maximum asymptotically achievable rate at which information can be transmitted through the channel w reliably. The expression for in (10.1) is called a single-letter characterization because it depends only the transition matrix of the generic channel but for the channel. When both the input alnot on the block length ¯ of a code w P phabet and the output alphabet are finite, the computation of becomes 0 a finite-dimensional maximization problem. x $ » 0 with generic random variable For an i.i.d. information source , the rate distortion function X ] ^a` W 7 c de b6f b X W y ÿ 7 n ÿ -0 ! 0 þ (10.2) gih characterizes the minimum asymptotically achievable rate of a rate distortion code which reproduces the information source with an average0 distortion no more than with respect to a single-letter distortion measure . Again, the 0 expression for in (10.2) is a single-letter characterization because it depends only on the generic random variable but not on the block length ¯ of a rate distortion code. When both the source alphabet and the reproduction þ alphabet are finite, the computation of becomes a finite-dimensional minimization problem. Unless for very special cases, it is not possible to obtain an expression for w or in closed form, and we have to resort to numerical computation. D R A F T September 13, 2001, 6:27pm D R A F T 216 A FIRST COURSE IN INFORMATION THEORY However, computing these quantities is not straightforward because the associated optimization problem is nonlinear. In this chapter, we discuss the Blahut-Arimoto algorithms (henceforth the BA algorithms), which is an iterative algorithm devised for this purpose. In order to better understand how and why the BA algorithm works, we will first describe the algorithm in a general setting in the next section. Specializaw tions of the algorithm for the computation of and will be discussed in Section 10.2, and convergence of the algorithm will be proved in Section 10.3. 10.1 ALTERNATING OPTIMIZATION In this section, we describe an alternating optimization algorithm. This algorithm will be specialized in the next section for computing the channel capacity and the rate distortion function. Consider the double supremum fihkj z|{ { z * ö~} é ¹ ( 1@ fihkj (10.3) * ö~} ¹ ¹ is a convex subset of º for Þ © , and is a function defined on S @ @ @ . The function is bounded from above, and is continuousband has S ß >;¿(b@ continuous partial derivatives on . Further assume that for all , :ß there exists a unique such that where @ ¹ ?> ( @ ¹ ( @ W ]_ÊË { z { and for all j ö~} ¹ ( > @ ( ¹ ( W ]_ÊË z * j ö~} Let (W b@ and (10.4) @ :ß , there exists a unique >@H(W ß @ U * such that @ U 8 (10.5) @ S . Then (10.3) can be written as ¹ ( 8 fhkj z (10.6) ¹ ö~} { * In other words, the supremum of is taken over a subset of{ º º which is * equal to the Cartesian product of two convex subsets of º and º , respec¹ tively. + We now describe an alternating optimization algorithm computing , ( for » @ » x » $ the value of the double supremum in (10.3). Let for ÿ ÿ ÿ Ñ which are defined as follows. Let be an arbitrarily chosen vector in , @ >@H( ÿ x » Ñ Ñ $ and let ÿ . For , ÿ is defined by ÿ ÿ » D R A F T >K¿( @ »¿¾ ÿ September 13, 2001, 6:27pm (10.7) D R A F T 217 The Blahut-Arimoto Algorithms Figure 10.1. Alternating optimization. and » >@¶( @ » ÿ @ » » 8 ÿ Ñ @ Ñ (10.8) @ @ @ In words, ÿ and ÿ are generated in the order ÿ other ÿ ÿ ÿ @ ½½½ , where each vector in the sequence is a function of the previous ÿ ÿ Ñ vector except that ÿ is arbitrarily chosen in . Let ¹ ¹ ( » » ÿ 8 ÿ (10.9) Then from (10.4) and (10.5), ¹ » ¹ ( » ¹ ( ÿ ¹ ( ÿ ÿ $ $ ¹ x $ ¹ » »¿¾ » ÿ »'¾ @ » ÿ @ »¿¾ ÿ @ »¿¾ ÿ (10.10) (10.11) ÿ (10.12) (10.13) ¹ ¹ ¹ for . Since the sequence ÿ is non-decreasing, it must converge because » + ¹ is bounded from above. We will show in Section 10.3 that if Î ÿ @ ¹ is concave. Figure 10.1 is an illustration of the alternating ¹maximization » + algorithm, where in this case both ¯ and ¯ are equal to 1, and ÿ Î . The alternating optimization algorithm can be explained by the following analogy. Suppose a hiker wants to reach the summit of a mountain. Starting from a certain point in the mountain, the hiker moves north-south and eastwest alternately. (In our problem, the north-south and east-west directions can be multi-dimensional.) In each move, the hiker moves to the highest possible point. The question is whether the hiker can eventually approach the summit starting from any point in the mountain. D R A F T September 13, 2001, 6:27pm D R A F T 218 A FIRST COURSE IN INFORMATION THEORY ¹ Replacing infimum ¹ by in (10.3), the double supremum becomes the double z|{ ^ `k ö~} { z * ¹ ( 1@ ^ `k * ö~} 8 (10.14) ¹ ¹ @ All the previous assumptions on , , and remain valid except that is now assumed to be bounded from below instead of bounded from above. The double infimum in (10.14) can be¹ computed by the¹ same alternating optimization algorithm. Note that with replaced by , the maximums in (10.4) and (10.5) become minimums, and the inequalities in (10.11) and (10.12) are reversed. 10.2 THE ALGORITHMS In this section, we specialize the alternating optimization algorithm described in the last section to compute the channel capacity and the rate distortion function. The corresponding algorithms are known as the BA algorithms. 10.2.1 CHANNEL CAPACITY an input distribution A ( , and we write WP - if We will use to denote ( is strictly positive, i.e., A for all ( ß . If is not strictly positive, we P $ write . Similar notations will be introduced as appropriate. qoQo¬G"NHtLON 1P - V )( Let A ( ¥ be a given joint distribution on P , and let be a transition matrix from to . Then ]WÊË = = Q 7 A ( ¥ V V )( ±´³¶µ A ( ) ( "= = Q A 7 ( ¥ V )( S<P + ±´³¶µ A such that ( ) V ( (10.15) where the maximization is taken over all such that and (*) V .- + if and only if ( ) ¥ V A M 7 j V ) ( W.- V ( ¥ )( V ( U ¥ )( U A (10.16) (10.17) V the which corresponds to the input distribution i.e., the maximizing is ¥ )( . the transition matrix and V In (10.15) and theV sequel, we adopt the convention that the summation is )( -0 P taken over all ( and such that A ( P - and ¥ . Note that the right ! when is the input hand side of (10.15) gives the mutual information V )( distribution for the generic channel ¥ . Proof Let V = 7 D R A F T A ( U ¥ V )( U (10.18) j September 13, 2001, 6:27pm D R A F T 219 The Blahut-Arimoto Algorithms V P V ,¥ in (10.17). We assume with loss of generality that forV all ß for some ( ß . Since ?P - , for all , and hence P well-defined. Rearranging (10.17), we have A Consider = = Q A 7 V ( ¥ ±´³¶µ = = Q ( ¥ A 7 V = 7 V = Q V = Q $ (*) + V 9= + (*) V V 8 K A (*) is )( ±´³¶µ A V ( ) ( (10.20) V V (10.21) V (*) V (*) V ( ¥ - (10.19) (*) + ±´³¶µ V ( ) ( ) )( P V + (*) V + V ( ) (*) = Q 7 ±´³¶µ + - 7 #= ( A )( + + ±´³¶µ = Q V V ( ) V ) ( W + )( V ( ¥ V (10.22) V (*) (10.23) (10.24) where (10.21) follows from (10.19), and the last step is an application of the divergence inequality. Then the proof is completed by noting in (10.17) that + satisfies (10.16) because 1P - . $ q ACBMqoZNHtLas V For a discrete memoryless channel ¥ w . = ]WÊË gfihkj Ñ = Q ( ¥ A 7 V )( ±´³¶µ A )( V , ( ) ( (10.25) where the maximization is taken over all which satisfies (10.16). -0 i ! when is the input Proof Let denote the mutual information V ¥ )( distribution for the generic channel . Then we can write w Let + w achieves w + . If P ]_ÊË . Ñ i i ]_ÊË . Ñ ]_ÊË = . D R A F T ]WÊË fihkj Ñ (10.26) = Q 7 A 8 (10.27) i , then ]_ÊË . Ñ .]_ÊË . Ñ = 7 = Q A V ( ¥ ( ¥ V )( ±´³¶µ A )( ±´³¶µ A (10.28) V (*) ( (10.29) V ( ) ( September 13, 2001, 6:27pm (10.30) D R A F T 220 A FIRST COURSE IN INFORMATION THEORY where (10.29) follows from Lemma 10.1 (and the maximization is over all i which satisfies (10.16)). $ . Since is continuous in , Next, we consider the case when + for any ®lP , there exists Ú2P such that if then + w i # S (10.31) Ú S (10.32) ® + + where denotes the Euclidean distance between and . In particular, which satisfies (10.31) and (10.32). Then there exists 2 P w ]_ÊË Ñ $ $ (10.33) i (10.34) w P i Ñ fihkj . i (10.35) (10.36) ® where the last step follows because satisfies (10.32). Thus we have w - Finally, by letting ®:Î w w o Ñ . 8 (10.37) , we conclude that gfihkj Ñ i fihj S ® i gfhkj ]_ÊË = Ñ . = Q A 7 V ( ¥ V (*) 8 ( )( ±´³¶µ A (10.38) This accomplishes the proof. Now for the double supremum in (10.3), let ¹ W= = Q ( ¥ A 7 A ( ( ß A G V (*) ( )( A (10.39) 1@ and ±a³¶µ W with and playing the roles of V ( , respectively. Let - P M and 7 A ( W (10.40) and @ ¥ if M and V (*) (*) V 7 P V ( (*) V lß , TSUP (*) W V W.- G if V for all (*) ¥ V P ) ( .- ß V P - , 8 (10.41) D R A F T September 13, 2001, 6:27pm D R A F T 221 The Blahut-Arimoto Algorithms @ Then is a subset of n n @ and is a subset of checked that both and are convex. For all Lemma 10.1, ¹ = = Q A 7 = Q 7 V ±´³¶µ (10.42) (10.43) A (10.44) (10.45) (10.46) ) )8 ±´³¶µ ¹ @ , and it is readily and ß , by ! -0 > )( -0 V ( ¥ A n nn n (*) ( A V + (*) ( )( ±´³¶µ = ß V ( ¥ @ V Thus is bounded from above. Since for all ß , ( ) W- for all ( and V V ¥ (*) o ¹ such that , these components of are degenerated. In fact, these components of do not appear in the definition of in (10.39), which can be seen as follows. V Recall the convention that the double summation in V V )( P . If ( ) - , (10.39) V is over all ( and such that A ( P - and ¥ ) ( u , and hence the corresponding term¹ is not included in the then ¥ double summation. Therefore, it is readily seen that is continuous and has continuous partial derivatives on because all the probabilities involved in @ ¹ the double summation in (10.39) are strictly positive. Moreover, for any given ß , by Lemma 10.1, there exists a unique @ ß which maximizes . It will be shown shortly that for any given ß , there also exists a unique ¹ ß which maximizes . The double supremum in (10.3) now becomes fihkj ö~} = fihkj { ö} = Q ( ¥ A 7 * V )( ±a³¶µ A V (*) ( (10.47) @ w which by Theorem 10.2 is equal to , where the supremum over all ß is in fact a maximum. We then apply the alternating optimization algorithm in w the last section to compute . First, we arbitrarily choose a strictly positive Ñ Ñ input distribution in and let it be ÿ . Then we define ÿ and in general x » $ for by V » ÿ » ÿ (*) V W A M 7 ÿ j A » ÿ ( ¥ ( U ¥ )( V 7 D R A F T A » in view of Lemma 10.1. In order to define ÿ ¹ and in general we need to find the ß which maximizes for a given ß constraints on are = (10.48) )( U ( September 13, 2001, 6:27pm ÿ @ x $ for , , where the (10.49) D R A F T 222 A FIRST COURSE IN INFORMATION THEORY and ( A for all ( - P ß . (10.50) We now use the method of Lagrange multipliers to find the best by ignoring temporarily the positivity constraints in (10.50). Let = = Q A 7 V ( ¥ V ( ) ( )( ±´³¶µ A # = A 7 ( 8 (10.51) For convenience sake, we assume that the logarithm is the natural logarithm. Differentiating with respect to A ( gives A ( C Upon setting ¡ V ¥ = Q )( ±a³¶µ K- ¡;¢ 7 k ±´³¶µ ( kÌl# 8 A (10.52) , we have ÿ ( W A ±´³¶µ V (*) V ¥ = Q )( ±´³¶µ or ( ¤£ ¾ A & ÿ ñ kª¨# (10.53) V ( ) Q V (*) Q 7 -¥ ÿ 8 n (10.54) By considering the normalization constraint in (10.49), we can eliminate and Q Q obtain V 7 A ( 7 V (*) Q ¦ M j ¦ ¥ ( U ) V V ÿ ¥ Q n 7 ÿ n 8 j (10.55) V )( The above product is over all such that ¥ , and (*) P - for all P V such . This implies that both the numerator and the denominator on the right ( hand side above are positive, and therefore A P . In other words, the thus obtained happen to satisfy the positivity constraints in (10.50) although these ¹ constraints were ignored when we set up the Lagrange multipliers. We will @ which is show in Section 10.3.2 that is concave. ¹ Then as given in (10.55), ß unique, indeed achieves the maximum of for a given because is in x » $ the interior of . In view of (10.55), we define for by ÿ Q » A @ » ÿ Q ( » M »¿¾ Q ÿ 7 ¦ j ¦ »'¾ ÿ ( ) V ¥ ( U ) 7 V ÿ ¥ Q n ÿ 7 n j 8 (10.56) Ñ Ñ @ The vectors ÿ and ÿ are defined in the order ÿ ÿ ÿ ÿ ÿ ½½½ , where each vector in the sequence is a function of the previous vector ÿ Ñ except» that is x arbitrarily chosen in@ .x It remains» to show by induction » » ÿ $ $ that ÿ ß for and ÿ ß for . If ÿ ß , i.e., ÿ P - , D R A F T September 13, 2001, 6:27pm D R A F T 223 The Blahut-Arimoto Algorithms ¨ © R( D) ¨ © R( Dª s) - ª s Dª s ¨ ¨ © R( Dª s ) ¨ © (Dª s ¬ ,R ( Dª s )) § Dª s Figure 10.2. D D« max A tangent to the 6a curve with slope equal to . » V V @ then we see from (10.48) that ÿ (*» ) - @ if and only if ¥ ( ) Z- , i.e., » ß ß @ if @ . On the» other hand, , then we see from (10.56) that » » ÿ» ÿ ß ß ß P ¹ ¹ , i.e., ÿ . Therefore, and for » all x ÿ $ » » » ÿ » ÿ .x Upon determining ÿ ÿ , we can¹ compute » ÿ ÿ ÿ w for all . It will be shown in Section 10.3 that ÿ Î . 10.2.2 THE RATE DISTORTION FUNCTION This discussion in this section is analogous to the discussion in Section 10.2.1. Some of the details will be omitted for brevity. $ -F problems of interest, P . Otherwise, W.- for all For all since is nonnegative and non-increasing. Therefore, we assume without loss of generality that -F P - . -F P We have shown in Corollary 9.19 that if , then is strictly -® 4 6 3 5 7 decreasing for . Since is convex, for any ¯ - , there ° 43:5;7 such that the slope of exists a point on the curve for a tangent1 to the curve at that point is equal to ¯ . Denote such a point ` ± ` ± on the curve by , which is not necessarily unique. Then this tangent intersects with the ordinate at ± k ¯ ± . This is illustrated in D² -0 0 þ D² Figure 10.2. Let denote /1 the mutual information and denote 0i-0 0 þ 0 ² the expected distortion when0 þ is the distribution the for ² and is ² ² þ transition matrix from to defining . Then for any , is a point in the rate distortion region, and the line with slope ¯ passing through 1 We say that a line is a tangent to the ³ ÿ D R A F T h curve if it touches the ³ ÿ h curve from below. September 13, 2001, 6:27pm D R A F T 224 A FIRST COURSE IN INFORMATION THEORY D² D² D² ² intersects the ordinate at . Since the curve defines the boundary of the rate distortion region and it is above the tangent in Figure 10.2, we see that ¯ ± k ¯ .]W^a` ´ ± D² * D² ¯ à8 (10.57) À ² a ± ² which achieves ² the above minimum, For each ¯ - , if we can find ± : ± , i.e., the tangent in then the line passing through - ¯ Figure 10.2, gives a tight lower bound on the curve. In particular, if ± ± is unique, D² ± ± (10.58) and ² `± W. ± 8 (10.59) By varying over all ¯ ?- , we can then trace out the whole curve. In the rest of the section, we will devise an iterative algorithm for the minimization problem in (10.57). qoQo¬G"NHtL¶µ ² that ]_^ ` º 7 ¥ = X = Ñ ( · þ ¸S (*) ( Let be a given joint distribution on þ , and let ¹ be any distribution on such that ¹P - . Then P ¥ ( · · (þ ) ( » ±´³¶µ 7 (þ ) ( * (þ ¥ = X = 7 such ( · þ · (þ ) ( ±´³¶µ 7 (þ ) ( * » + (þ (10.60) where » + » ( þ ¥ = ( · (þ ) ( (10.61) 7 þ ( is the distribution on i.e., the minimizing ² distribution and the transition matrix . þ corresponding to the input Proof It suffices to prove that = 7 = X ¥ ( · » ±´³¶µ 7 for all ¹V² P because · (þ ) ( P (þ ) ( * (þ $ = = X 7 ¥ ( · · (þ ) ( ±´³¶µ 7 (þ ) ( » + (þ . The details are left as an exercise. Note in (10.61) that . D² ² (10.62) ¹ + P - ² Since and are continuous in , via an argument similar to the one we used in the proof of Theorem 10.2, we can replace the minimum ² ² D² over all P over all in (10.57) by the infimum . By noting that the right hand side of (10.60) is equal to and ² W= 7 =YX D R A F T ¥ ( · (þ )( 0D ( (þ 7 September 13, 2001, 6:27pm (10.63) D R A F T 225 The Blahut-Arimoto Algorithms we can apply Lemma 10.3 to obtain ± k ¯ ± X M ¼ ½.¾ ¼½ ¿ÁÀvÂÄÃÅ Æ À ¥ û ZÇ Z M ¼½.¾ ¼½ ¿ÁÀvÂÒÅ Æ Àv à ZÇ Z 7 y ÿ ÿ ZÐÎ n X 7 y ¥ û X û M 7 .ÈÉÊ2ËÏ Ì Z Í ZiÎ ¾ ± Ì û 7 ÿ û ¾ ËÌ Z Ï Ì û Í ZiÎ ZÐÎ 7 .ÈÉÊ 7 ÿ n ¥ û ZÇ Z M ± ZÇ Z 7 ÿ ÿ ¹ = E · = ¯ (þ ) ( * ±´³¶µ ( · » 0D (þ )( (þ ( ( þ :ß þ )( (* · þ TS for all ( 7 and @ » (þ (þ þ ß » G X (þ = $ (þ (10.66) (þ )( · ¥ 7 -0 ! 0 þ ( · » ±´³¶µ = X 7 · (þ ) ( ( · ß and 1@ P (10.67) I X M » 7 ( þ W (10.68) and , respectively. Then is a subset , and it is readily checked that both n n 7 = $ ¥ = X 7 X ¹ P W with and ¹ playing the roles of @ of n @ nn Dn and is a subset of and are convex. Since ¯ Ì- , ¹ ²H (10.65) (þ ) ( * G =YX ² · (þ ) ( 7 f7 Ñ 8 ÿ n 7 ( ¥ = X 7 ( · 7 ¥ = X 7 (10.64) X 7 e ÿ Now in the double infimum in (10.14), let ¹ ²Ó ÿ 7 ÿ 7 f 7 Ñ n X 7 y ¥ û X 7 e 7 y (þ ) ( * (þ = ¯ 7 ¥ =YX ( · (þ ) ( 0i (þ ( 7 (10.69) · (þ ) ( ±´³¶µ (þ ) ( * » + (þ " (10.70) (10.71) (10.72) - 8 ¹ Therefore, is bounded from below. The double infimum in (10.14) now becomes ´ ^a`k ö~} { º ^a` ö~} * Å = X = 7 7 ¥ ( · · (þ ) ( * ±a³¶µ » (þ ) ( * (þ ¯ = X = 7 ¥ ( · (þ ) ( 0D ( ( þ ÇÆ 7 (10.73) D R A F T September 13, 2001, 6:27pm D R A F T 226 A FIRST COURSE IN INFORMATION THEORY @ where the infimum over all ¹ ß is in fact a minimum. We then apply ¹the alternating optimization algorithm described in Section 10.2 to compute + , the value of (10.73). First,² we arbitrarily choose a strictly positive transition » Ñ Ñ matrix in and let it be . Then we define and in general ¹ ÿ for ¹ x ÿ ÿ $ by » þ » þ » ( "= ¥ ( · () ( (10.74) ÿ ÿ 7 ² » ² in view of Lemma 10.3. In order to define ÿ ¹ and in general ² ß we need to find² the which minimizes for a given ¹ ß constraints on are þ )( ( · P and =YX for all ( þ ) ( W · ( ( þ ¨ß for all ( 7 þ ÔS ß @ ÿ x $ for , , where the , (10.75) . (10.76) As we did for the computation of the channel capacity, we first ignore the positivity constraints in (10.75) when setting up X the Lagrange multipliers. Then we obtain ± e 7 f7 þ » · ( þ ) ( X M 7 » j ( Ð £ ÿ e ( þ U У ± X 7 f7 ÿ j - 8 P » ² The details are left as an exercise. We then define · ÿ »¿¾ » » þ ( ) ( W M 7 X ÿ þ ( У »¿¾ » j ¹ ÿ » 7 f7 ÿ e ( þ U У ± ¹ ² that It will be shown in the next section ÿ ÿ If there exists a unique point ± ± on the slope of a tangent at that point is equal to ¯ , then ² D² » ÿ » ² D² » ÿ Î by 8 j ÿ » ¹ (10.78) » ÿ ± $ X 7 f7 x for Xÿ ± e (10.77) ¹ x + as Î / . curve such that the Î ± 8 (10.79) » is arbitrarilyx close to the segment of the Otherwise, ÿ ÿ curve at which the slope is equal to ¯ when is sufficiently large. These facts are easily shown to be true. 10.3 CONVERGENCE ¹ ¹ » ¹ + . We then In this section, we first prove that if is concave, then ÿ Î apply this sufficient condition to prove the convergence of the BA algorithm for computing the channel capacity. The convergence of the BA algorithm for computing the rate distortion function can be proved likewise. The details are omitted. D R A F T September 13, 2001, 6:27pm D R A F T 227 The Blahut-Arimoto Algorithms 10.3.1 A SUFFICIENT CONDITION In the alternating optimization algorithm in Section 10.1, we see from (10.7) and (10.8) that » ( ÿ x $ for - » @ » ÿ . Define Õ ÿ ¹ ( ¹ ?>;'(b@ Then ¹ » ÿ ¹ ÿ ¹ ?> ÿ ( Õ * @ » ÿ » ¹ ( (10.80) 8 (10.81) ¹ (W 1@ @ » ÿ ¹ ( k » @ » ÿ ÿ (10.82) (10.83) 8 ÿ ¹ > @ ?> ÿ( » @ » ÿ ¹ ( > @ ?> ( >@H?>;(1@ » ÿ ¹ ( » ?> ( @ » W ¹ (10.84) ¹ » + We will prove that¹ being concave is sufficient for ÿ Î . To this end, ¹ ( first prove ¹ we that if is concave, then the algorithm cannot be trapped at if + S . ¹ qoQo¬G"NHtLmÖ Let ¹ ¹ » be concave. If Õ ÿ ¹ ( + S ¹ » , then P ÿ ¹ » ÿ . ¹ ( ¹ ( Proof We will ¹ prove that ¹ for any ß such that P » » + S , we see from (10.84) that Then if ÿ Õ ÿ ¹ » ¹ » ÿ ÿ ¹ ( » ÿ S + . P ¹ (10.85) and the lemma is proved. Õ ¹ ( ¹ Õ + ß S ¹ ( ( Consider any such¹ that . We will prove by contradiction .P that . Assume . Then it follows from (10.81) that ¹ ?> ( > @ ?> ( @ Now we see from (10.5) that ¹ ?>;'(b@ > ( @ If >@F?>;¿(1@ 8 ¹ ?>;'(b@ $ @ D (10.86) 1@ 8 (10.87) ¡ , then ¹ ?>;'(1@ ¹ ( 1@ b@ >K¿(1@ because ¹ ( @ P (10.88) is unique. Combining (10.87) and (10.88), we have ¹ ?>;'(b@ >@H?>;¿(b@ ¹ (W 1@ P (10.89) which is a contradiction to (10.86). Therefore, >;'(b@ W D R A F T 8 September 13, 2001, 6:27pm (10.90) D R A F T 228 A FIRST COURSE IN INFORMATION THEORY Ø Ù Ø Ü Ú × z2 × Ø Ù Ú (v 1 Û , v 2 ) (u 1 Û , v 2 ) z Ú × (u 1 Û , u 2 ) Ø Ü Ú (v 1 Û , u 2) z1 á(à Figure 10.3. The vectors Ý , Þ , à ß { á and à * . Using this, we see from (10.86) that ¹ ( >@F(W which implies ¹ (W b@ W 8 >@H?>;¿(1@ ¹ is unique. , there exists â + S ß ¹ â S 8 â (10.93) â (10.92) such that ¹ ( Consider (10.91) >@H(W 1@ because ¹ ( Since -F @ " 1@ â 8 (10.94) Win @ the direction of â , ã be the unit vector in Let ã be the unit vector ã -F @ direction b@ the of â , and be the unit vector in the direction of -F â . Then â ã ä or where ¦ Ù © @ ã â â @ ¦ " é é " .¦ @ ã ã â @ ã (10.95) (10.96) é b@ ã â (10.97) @ ¹ . Figure 10.3 is an illustration of the vectors , â , ã ã and ã (.W b@ @ We see from (10.90) that attains ¹ its maximum value at 1@ attains b@ its maximum ¹ when is fixed. In (particular, value at alone the line passing through and â . Let å denotes the gradient of Þ D R A F T September 13, 2001, 6:27pm D R A F T 229 The Blahut-Arimoto Algorithms ¹ ¹ ¹ ¹ . Since is continuous and has continuous partial derivatives, the directional ã ã ½ ¹ ¹ of derivative of at in the direction exists and is given by å . It ( 1@ from the concavity 1@ ¹ passing through that is concave along the line follows of ¹ (Wits bmaximum @ value b@ and â . Since attains at , the derivative of along the line passing through and â vanishes. Then we see that ¹ .- 8 ½ ã (10.98) å Similarly, we see from (10.92) that ¹ @ .- ½ ã å 8 (10.99) ¹ Then from (10.96), the directional derivative of given by @ ¹ ¹ ½ ã å ¹ Since .¦ ½ ã å ¦ " at ¹ $ â .- 8 ½ ã å ¹ ã is @ (10.100) is concave along the line passing through ¹ ( in the direction of and â , this implies Õ (10.101) ¹ ( which is a contradiction to (10.93). Hence, we conclude that P - . ¹ ( ¹ ¹ Although we have proved that the algorithm cannot be trapped at if ¹ » S » + , ÿ does not necessarily converge to because the increment in ÿ in each step may be arbitrarily small. In order to prove the desired convergence, we will show in next theorem that this cannot be the case. ¹ + $ q ACBMqoZNHtL¶æ ¹ If ¹ ¹ » is concave, then + Î ÿ . ¹ » Proof ¹ We have already shown in Section 10.1 that ÿ necessarily converges, x say to U . Hence, for any ®lP - and all sufficiently large , ¹ ¹ U ¹ » ® ÿ Õ Let ¹ ( ç«]_^a` z where ¹ U G (10.102) (10.103) ö} j ¹ ß U 8 ¹ ( U ® Õ ¹ o U 8 ¹ ( (10.104) Since has continuous partial derivatives, is a continuous function of . Then the minimum in (10.103) exists because U is compact2 . 2 } j is compact because it is the inverse image of a closed interval under a continuous function and bounded. D R A F T September 13, 2001, 6:27pm } is D R A F T 230 A FIRST COURSE IN INFORMATION THEORY ¹ ¹ ¹ Õ + S U ¹ ( We¹ now show that will lead to a contradiction if is concave. If + , then from ¹ Lemma ¹ 10.4, ( we see that P for all ß U and » » » ß hence P . Since ÿ satisfies (10.102), ÿ U , and Õ ÿ ¹ U S ¹ » ¹ ÿ ¹ ( » » ÿ ÿ $ (10.105) ¹ x ¹ for all sufficiently large . ¹ Therefore, no matter how smaller » U ¹ eventually be greater than¹ , which is a contradiction to ÿ » + we conclude that ÿ Î . 10.3.2 is,¹ U Î » will . Hence, ÿ CONVERGENCE TO THE CHANNEL CAPACITY In order to show that the ¹ BA algorithm for computing the channel capacity » w ¹ converges as intended, i.e., ÿ Î , we only need to show that the function defined in (10.39) is concave. Toward this end, for ¹ W = = Q ( ¥ A 7 V V ( ) ( )( ±´³¶µè A (10.106) @ @ @ defined in (10.39), we consider two ordered pairs and in , where and are defined in (10.40) and (10.41), respectively. For any -é Ó" CU and ê , an application of the log-sum inequality (Theorem 2.31) gives @ ( A "" A ' ( ±´¿ ³¶ µ A § ( A ±´³¶µ 'A ( ) V ( ) ' @H ( A " ¿ $ A A ±´³¶ § µ ¹ Therefore, @ "" ±´³¶µ V A "" A " ( V (*) 8 (10.107) @¶ V "" ( ( ) @H "" A (*) @H ( A V ( @ ¶ @õ A @¶ ±´³¶µ A V (10.108) ( and summing over all ( and , we obtain @ ( ) ' @H V )( V ( ) ¿ ( ( and upon multiplying by ¥ ¹ ( V ±´³¶µ Taking reciprocal in the logarithms yields ' ( ) ( "" A " @ ( V @H ( "" A @¶ ( $ ¹ ¹ @ "" is concave. Hence, we have shown that @ ¹ ÿ » Î 8 w (10.109) . PROBLEMS 1. Implement the BA algorithm for computing channel capacity. D R A F T September 13, 2001, 6:27pm D R A F T 231 The Blahut-Arimoto Algorithms 2. Implement the BA algorithm for computing the rate-distortion function. 3. Explain why in the BA Algorithm for computing channel capacity, we should not choose an initial input distribution which contains zero probability masses. 4. Prove Lemma 10.3. ¹ ²Ó 5. Consider function. ¹ in the BA algorithm for computing the rate-distortion ¹ ²Ó a) Show that for fixed ¯ and ¹ , · ¹ ²H b) Show that ¹ ( þ ) ( * M ¹ is minimized by X ± e 7 f7 (þ Ð X £ ÿ± e 7 f7 8 » (þ Ð U £ 7 j j ÿ X » is convex. HISTORICAL NOTES An iterative algorithm for computing the channel capacity was developed by Arimoto [14], where the convergence of the algorithm was proved. Blahut [27] independently developed two similar algorithms, the first for computing the channel capacity and the second for computing the rate distortion function. The convergence of Blahut’s second algorithm was proved by CsiszÊ ë r [51]. These two algorithms are now commonly referred to as the Blahut-Arimoto algorithms. The simplified proof of convergence in this chapter is based on Yeung and Berger [217]. The Blahut-Arimoto algorithms are special cases of a general iterative algorithm due to CsiszÊ ë r and TusnÊ ë dy [55] which also include the EM algorithm [59] for fitting models from incomplete data and the algorithm for finding the log-optimal portfolio for a stock market due to Cover [46]. D R A F T September 13, 2001, 6:27pm D R A F T Chapter 11 SINGLE-SOURCE NETWORK CODING In a point-to-point communication system, the transmitting point and the receiving point are connected by a communication channel. An information source is generated at the transmission point, and the purpose of the communication system is to deliver the information generated at the transmission point to the receiving point via the channel. In a point-to-point network communication system, there are a number of points, called nodes. Between certain pair of nodes, there exist point-to-point communication channels on which information can be transmitted. On these channels, information can be sent only in the specified direction. At a node, one or more information sources may be generated, and each of them is multicast1 to a set of destination nodes on the network. Examples of point-to-point network communication systems include the telephone network and certain computer networks such as the Internet backbone. For simplicity, we will refer to a point-to-point network communication system as a point-to-point communication network. It is clear that a point-to-point communication system is a special case of a point-to-point communication network. Point-to-point communication systems have been thoroughly studied in classical information theory. However, point-to-point communication networks have been studied in depth only during the last ten years, and there are a lot of problems yet to be solved. In this chapter and Chapter 15, we focus on point-to-point communication networks satisfying the following: 1. the communication channels are free of error; 2. the information is received at the destination nodes with zero error or almost perfectly. 1 Multicast means to send information to a specified set of destinations. D R A F T September 13, 2001, 6:27pm D R A F T 234 A FIRST COURSE IN INFORMATION THEORY In this chapter, we consider networks with one information source. Networks with multiple information sources will be discussed in Chapter 15 after we have developed the necessary tools. 11.1 A POINT-TO-POINT NETWORK 8í / í , A point-to-point network is represented by/ a directed graph ì where is the set of nodes in the network and is the set of edges in ì which represent the communication channels. An edge from node Þ to node L , which í represents the communication channel from node Þ to node L , is denoted by Þ L . We assume that ì is finite, i.e., î îï / . For a network with one information source, which is the subject of discussion in this chapter, the node at which information is generated is referred to as the source node, denoted by ¯ @ , and the destination nodes are referred to as the sink nodes, denoted by » ðððÒ »4ñ » . é For a communication channel from node Þ to node L , let ò J be the rate constraint, i.e., theé maximum number of bits which can be sent per unit time on the channel. ò J is also referred to as the capacity2 of edge óÞ LÄô . Let õ ÷ö òøùJ G ó-ú (11.1) LÄôûçüþý be the rate constraints for graph ì . To simplify our discussion, we assume that òøùJ are (nonnegative) integers for all ó-úiÿ|ôûçü . In the following, we introduce some notions in graph theory which will facilitate the description of a point-to-point network. Temporarily regard an edge in ì as a water pipe and the graph ì as a network of water pipes. Fix a » » sink node and assume that all the sink nodes except for is blocked. Suppose water is generated at a constant rate at node ¯ . We assume that the rate of water flow in each pipe does not exceed its capacity. We also assume that there is no leakage in the network, so that water is conserved at every node other than ¯ » and in the sense that the total rate of water flowing into the node is equal to the total rate of water flowing out of the node. The water generated at node ¯ » is eventually drained at node . A flow ö ø ó-úiÿ|ôûUüý (11.2) » õ in ì from node ¯ to node with respect to rate constraints is a valid assignment of a nonnegative integer ø to every edge ó-úÿ|ôûUü such that ø is equal to the rate of water flow in edge ó-úiÿÄô under all the above assumptions. on edge ó-úiÿ|ô . Specifically, is a flow in ì ø is referred to as the value of » from node ¯ to node if for all ó-úÿ|ôûUü , 2 Here ø òøvÿ (11.3) the term “capacity” is used in the sense of graph theory. D R A F T September 13, 2001, 6:27pm D R A F T 235 Single-Source Network Coding and for all úû except for and , where ó-úÐô ó-úÐô ø (11.4) ó-ú4ô.ÿ ø (11.5) ø "! (11.6) ÿ and ó-úÐô ø ÿ In the above, ó-ú4ô is the total flow into node ú and # ó-ú4ô is the total flow out of node ú , and (11.4) are called the conservation conditions. Since the conservation conditions require that the resultant flow out of any node other than and is zero, it is intuitively clear and not difficult to show that the resultant flow out of node is equal to the resultant flow into node . This common value is called the value of . is a max-flow from node to õ node in $ with respect to rate constraints if is a flow from to whose value is greater than or equal to the value of any other flow from to . A cut between node and node is a subset % of such that û&% and (' û)% . Let -, ' ü+* ó-úiÿÄôûçü.XúWû/% and û/%0 (11.7) be the set of edges across the cut % . The capacity of the cut % with respect to õ rate constraints is defined as the sum of the capacities of all the edges across the cut, i.e., (11.8) òø 3! ø ÿ 21 is a min-cut between node and node if % is a cut between and whose capacity is less than or equal to the capacity of any other cut between and . A min-cut between node and node can be thought of as a bottleneck between and . Therefore, it is intuitively clear that the value of a max-flow from node to node cannot exceed the capacity of a min-cut between node and node . The following theorem, known as the max-flow min-cut theorem, states that the capacity of a min-cut is always achievable. This theorem will play a key role in subsequent discussions in this chapter. % 457698;:26=<?>=>=@ >ACBED7F2GHJIK8MLNBPORQSGUTSVXWY4576M8Z:26=<\[^]=_J`a Let $ be a graph ððð õ ñ Ò b c ÿ ~ d ÿ c ÿ with sourceg node and sink nodes , and be the rate constraints. f ððð Then for e ÿhÄÿ ÿCi , the value of a max-flow from node to node is equal to the capacity of a min-cut between node and node . D R A F T September 13, 2001, 6:27pm D R A F T 236 A FIRST COURSE IN INFORMATION THEORY k m k s 2 2 2 1 1 1 l j 2 1 1 l t1 j (a) n 1 1 2 k s 2 b1bp2 n 1 n 2 b2 t1 l j (b) s bo3 b1 2 b1bo3 t1 (c) Figure 11.1. A one-sink network. 11.2 WHAT IS NETWORK CODING? Let q be the rate at which information is multicast from the source node ððð õ ñ ÿc to the sink nodes bÿcdÿ in a network $ with rate constraints . We are naturally interested in the maximum possible value of q . With a slight abuse of notation, we denote the value of a max-flow from the source node to a sink node by rtsuJvMwyxèó Rÿc ô . It is intuitive that q{z&r|suJv9w}xó Rÿc gf for all e (11.9) ô ððð ÿhÄÿ ÿCi , i.e., q{z&rt ~ rtsuJvMwyxèó Rÿc ôU! (11.10) This is called the max-flow bound, which will be proved in Section 11.4. Further, it will be proved in Section 11.5 that the max-flow bound can always be achieved. In this section, we first show that the max-flow bound can be achieved in a few simple examples. In these examples, the unit of information is the bit. First, we consider the network in Figure 11.1 with one sink node., Figure 11.1(a) f shows the capacity of each edge. By identifying the min-cut to be Rÿ ÿhJ0 and applying max-flow min-cut theorem, we see that rtsuKv9wyxó XÿcUbô ! (11.11) Therefore the flow in Figure 11.1(b) is a max-flow. In Figure 11.1(c), we show how we can send three bits b ÿ d , and U from node to node b based on the max-flow in Figure 11.1(b). Evidently, the max-flow bound is achieved. In fact, we can easily see that the max-flow bound can always be achieved when there is only one sink node in the network. In this case, we only need to treat the information bits constituting the message as physical entities and route them through the network according to any fixed routing scheme. Eventually, all the bits will arrive at the sink node. Since the routing scheme is fixed, the D R A F T September 13, 2001, 6:27pm D R A F T 237 Single-Source Network Coding 2 1 4 s 1 1 2 3 1 2 3 5 1 3 3 s b t1 b4 b3 b3b4b5 b1b 2 b1 b2b 3 2 t2 b3 3 b4b 5 t1 (a) b1b2b3 t2 (b) Figure 11.2. A two-sink network without coding. sink node knows which bit is coming in from which edge, and the message can be recovered accordingly. Next, we consider the network in Figure 11.2 which has two sink nodes. Figure 11.2(a) shows the capacity of each edge. It is easy to see that (11.12) rtsuKv9w}xó Rÿcbô and rtsuKv9wyxó XÿcdKô ! (11.13) So the max-flow bound asserts that we cannot send more than 5 bits to both Ub and d . Figure 11.2(b) shows a scheme which sends 5 bits ybÿdÿ ÿU , and to both b and d . Therefore, the max-flow bound is achieved. In this scheme, b and Ud are replicated at node 3, is replicated at node , while U and are replicated node 1. Note that each bit is replicated exactly once in the network because two copies of each bit are needed to send to the two sink nodes. We now consider the network in Figure 11.3 which again has two sink nodes. Figure 11.3(a) shows the capacity of each edge. It is easy to see that rtsuKv9wyxéó Rÿc f ô hÄÿ (11.14) . So the max-flow bound asserts that we cannot send more than 2 bits to both Ub and d . In Figure 11.3(b), we try to devise a routing scheme which sends 2 bits yb and d to both Ub and d . By symmetry, we sendf one bit on each output channel at node . In this case, b is sent on channel ó Xÿ ô and d is sent f ÿh , .ø is replicated and the copies are sent on channel ó XÿhXô . At node ú , ú on the two output channels. At node 3, since both yb and d are received but there is only one output channel, we have to choose one of the two bits to send on the output channel ó ÿc|ô . Suppose we choose yb as in Figure 11.3(b). Then the bit is replicated at node 4 and the copies are sent to nodes b and d . At node d , both b and Ud are received. However, at node Ub , two copies of yb are e ÿh D R A F T September 13, 2001, 6:27pm D R A F T 238 A FIRST COURSE IN INFORMATION THEORY s 1 1 1 2 1 b1 1 4 1 1 t t1 b1 1 b1 t1 2 b1 1 b 2 2 b1+b 2 2 b1+b2 b t 2 2 2 3 b1 4 b2 b2 b s b1 3 2 b1 t b2 b2 b1 4 b (b) s b1 b1 t1 2 (a) 2 3 b1 1 b2 b2 1 s b1 3 1 1 b1b2 4 b1 t t1 2 (d) (c) Figure 11.3. A two-sink network with coding. received but d cannot be recovered. Thus this routing scheme does not work. Similarly, if d instead of b is sent on channel ó ÿc|ô , b cannot be recovered at node d . Therefore, we conclude that for this network, the max-flow bound cannot be achieved by routing and replication of bits. However, if coding is allowed at the nodes, it is actually possible to achieve the max-flow bound. Figure 11.3(c) shows a scheme which sends 2 bits yb and d to both nodes b and d , where ‘+’ denotes modulo 2 addition. At node b , b is received, and d can be recovered by adding yb and yb Yd , because d b ¤ó b Y d U ô ! (11.15) Similarly, d is received at node d , and yb can be recovered by adding d and b {Ud . Therefore, the max-flow bound is achieved. In this scheme, b and d are encoded into the codeword yb;Yd which is then sent on channel ó ÿc|ô . If coding at a node is not allowed, in order to send both yb and d to nodes Ub and shows such a scheme. d , at least one more bit has to be sent. Figure 11.3(d) In this scheme, however, the capacity of channel ó ÿc|ô is exceeded by 1 bit. D R A F T September 13, 2001, 6:27pm D R A F T 239 Single-Source Network Coding ¢ 1 1 1 ¤ ¢ s 1 1 1 1 b1 £ 2 1 3 t1 ¡ 2 1 ¤ ¦ t ¤ 3 b 1+b 2 b2 2 b2 b1 1 1 ¤ ¥ t s £ b2 3 b 1+b 2 b1 ¤ ¥ t t1 ¡ (a) 2 ¤ ¦ t 3 (b) Figure 11.4. A diversity coding scheme. Finally, we consider the network in Figure 11.4 which has three sink nodes. Figure 11.4(a) shows the capacity of each edge. It is easy to see that rtsuJvMwyxéó Xÿc ô h (11.16) for all e . In Figure 11.4(b), we show how to multicast 2 bits yb and d to all the sink nodes. Therefore, the max-flow bound is achieved. Again, it is necessary to code at the nodes in order to multicast the maximum number of bits to all the sink nodes. The network in Figure 11.4 is of special interest in practice because it is a special case of the diversity coding scheme used in commercial disk arrays, which are a kind of fault-tolerant data storage system. For simplicity, assume the disk array has three disks which are represented by nodes 1, 2, and 3 in the network, and the information to be stored are the bits yb and d . The information is encoded into three pieces, namely b , d , and b & d , which are stored on the disks represented by nodes 1, 2, and 3, respectively. In the system, there are three decoders, represented by sink nodes Ub , d , and , such that each of them has access to a distinct set of two disks. The idea is that when any one disk is out of order, the information can still be recovered from the remaining two disks. For example, if the disk represented by node 1 is out of order, then the information can be recovered by the decoder represented by the sink node which has access to the disks represented by node 2 and node 3. When all the three disks are functioning, the information can be recovered by any decoder. The above examples reveal the surprising fact that although information may be generated at the source node in the form of raw bits3 which are incompressible (see Section 4.3), coding within the network plays an essential role in 3 Raw bits refer to i.i.d. bits, each distributing uniformly on §¨ b© . D R A F T September 13, 2001, 6:27pm D R A F T 240 A FIRST COURSE IN INFORMATION THEORY Figure 11.5. A node with two input channels and two output channels. multicasting the information to two or more sink nodes. In particular, unless there is only one sink node in the network, it is generally not valid to regard information bits as physical entities. We refer to coding at the nodes in a network as network coding. In the paradigm of network coding, a node consists of a set of encoders which are dedicated to the individual output channels. Each of these encoders encodes the information received from all the input channels (together with the information generated by the information source if the node is the source node) into codewords and sends them on the corresponding output channel. At a sink node, there is also a decoder which decodes the multicast information. Figure 11.5 shows a node with two input channels and two output channels. In existing computer networks, each node functions as a switch in the sense that a data packet received from an input channel is replicated if necessary and sent on one or more output channels. When a data packet is multicast in the network, the packet is replicated and routed at the intermediate nodes according to a certain scheme so that eventually a copy of the packet is received by every sink node. Such a scheme can be regarded as a degenerate special case of network coding. 11.3 A NETWORK CODE In this section, we describe a coding scheme for the network defined in the previous sections. Such a coding scheme is referred to as a network code. Since the max-flow bound concerns only the values of max-flows from the source node to the sink nodes, we assume without loss of generality that ' there is no loop in $ , i.e., ó-úiÿú4ô û ü for all údûg , because such edges do not increase the value of a max-flow from node to a sink node. For the same ' reason, , we assume that there is no input edge at node , i.e., ó-úiÿ~ô û ü for all úûª¬« "0 . We consider a block code of length . Let ® denote the information source and assume that ¯ , the outcome of ® , is obtained by selecting an index from a D R A F T September 13, 2001, 6:27pm D R A F T 241 Single-Source Network Coding set ° according to the uniform distribution. The elements in ° are called messages. For ó-úÿ|ôèû ü , node ú can send information to node which depends only on the information previously received by node ú . An óÿKó± ø ó-úiÿ|ôûuüèô.ÿC² ô(³ -code on a graph $ is defined by the components listed below. The construction of an ³ -code from these components will be described after their definitions are given. 1) A positive integer ´ . 2) µ ,¶f and ¼ µ such that -,¶f 3) ÀÂÁ ¼ f , ÀÂÁÄÃÅ0 ÿhÄÿ¸·¸·¸·~ÿC´½0¾¹» (11.18) . ó ¿ ôôûçü ÿhÄÿ¸·¸·¸·ÿ}à (11.17) ,¶f ó ¿ ô.ÿ ó ÿhÄÿ¸·¸·¸·ÿC´E0º¹» , such that z¿Æz{´ Ç where µ Ì 4) If µ -,¶f ø ó ¿ ô ± ø à À+ÁËà Á yÈ}É Ê z¿Æz&´Í , then (11.19) ÿ ¼ ó ¿ ô.ÿ ó ó ¿ ôô ó-úÿ|ô03! Î (11.21) ÁÏJ°¹ÐÀÂÁ|ÿ where µ If -,¶f ó ¿ ô ' ° , if Ö ¼ -,¶f Î Ç ÁÚ Á à gf , whereä D R A F T Ç (11.23) (11.24) ÿhÄÿ¸·¸·¸·;ÿCi ¼ z¿æz&´Í ,è à (11.25) À+ÁUÛ2¹»°Uÿ ÁÛ "áãâ å,¶f çf such that for all e ó ¿ ô0 be an arbitrary constant taken from À+Á . 5) ÿhÄÿ¸·¸·¸·~ÿCi ¿K× ô ó ÀÂÁÛ2¹ÞÀÂÁJß ÁÛ ÕÜ7Ý Î otherwise, let µ z¿K×Mؿ٠is nonempty, then (11.22) ÿhÄÿ¸·¸·¸·;ÿJÑhÒÓÕÔÕ03! Á e (11.20) ó ¿ ô 0 (11.26) ó¯ ô ¯ September 13, 2001, 6:27pm (11.27) D R A F T 242 à A FIRST COURSE IN INFORMATION THEORY è è for Î all ¯®û° , whereà is the function à from ° to ° inducedà inductively by f and such that ó¯ô denotes the value of as a function Á|ÿ z¿éz´ of ¯ . The quantity ² is the rate of the information source ® , which is also the rate at which information is multicast from the source node to all the sink nodes. The óÿKó±vøê ó-úiÿÄôûuüô.ÿC² ô ³ -code is constructed from these components as follows. At the beginning of a coding session, the value of ® is available to node . During the coding session, there are ´ transactions which take place µ ë a node sending inforin chronological order, where each transaction refers to Î mation to another node. In the ¿ th transaction, node ¿=ì encodes according ¼ ë Î µ ë in À+Á to node ¿=ì . The domain to encoding function Á and sends an index µíë information ë far, received byÎ node ¿=ì µíso of Á is the andÖ we distinguish two ' , the domain of Á is ° . If ¿=ì , Á gives the indices µíë cases. If ¿=ì of all the previousÎ transactions for which information was sent to node ¿=ì , Ì so the domain of Á is î ÁÛ ÕÜ7Ý ÀÂÁ Û . The set ø gives the indices of all the transactions for which information is sent from node ú to node , so ±vø is the ä number of possible index tuples that can be sent from node ú to node dur gives à the indices of all the transactions for ing the coding session. Finally, which information is sent to node , and is the decoding function at node which recovers ¯ with zero error. 11.4 THE MAX-FLOW BOUND In this section, we formally state and prove the max-flow bound discussed in Section 11.2. The achievability of this bound will be proved in the next ï section. 6=HKORQ7OW7O8;Q>=>=@ñð õ For a graph $ with rate constraints , an information ë ë ë rate qÍò is asymptotically achievable if for any óõô , there exists for sufficiently large an ÿ ± ø úiÿKìû®üêì.ÿC²9ì;³ -code on $ such that ë for all úë ÿJìUû ü , where channel úÿJì , and b=ö b ö (11.28) w"÷ d ±vøÚz&øø ùYó w"÷ d ±vø is the average bit rate of the code on (11.29) ²ÙòYq)úõó¸! For brevity, an asymptotically achievable information rate will be referred to as an achievable information rate. is achievable, then Remark It follows from the above definition that if q&ò f Á q × is also û z q g z q q ÿ¿Yò achievable for all . Also, if is achievable, × ÿ ö Á then q ~ñr¬Áýüþ)q is also achievable. Therefore, the set of all achievable ÿ information rates is closed and fully characterized by the maximum value in the set. D R A F T September 13, 2001, 6:27pm D R A F T 243 Single-Source Network Coding 457698;:26=<?>=>=@ñÿEACBED7F2G õ IJ8MLÚ8ZVZQ a For a graph $ with rate constraints , if q is achievable, then ë q{z&rt ~ rtsuJvMwyx Rÿc U ì ! (11.30) õ ë ë rate constraints ë Proof It suffices to prove that for a graph $ with , if for any óô there exists for sufficiently large an ÿ ±vø úÿJì$û#ü¬ì.ÿC²9ì#³ -code on $ such that b ö w"÷ d ±vøÚz&øø ùYó (11.31) ë for all úiÿKìûçü , and (11.32) ²ÙòYq)ú½óÒÿ then q satisfies (11.30). Consider such a code for a fixed ó and a sufficiently large , and consider gf è ÿhÄÿ¸·¸·¸·;ÿCi % between node and node . Let any e and any cut ë ë Î ë Ì è Î Î where ¯ f ¯2ì Á è J¿ ¯Xì ì.ÿ (11.33) ë Î and Á is the Î function from ° to ÀÂÁ induced inductively by Á such that Á ¯2ì denotes the value of as a function of è . is all the information known by node during the whole coding ¯ ¯2ì Î ë session when the message isµ ¯ ë . Since Á ¯2ì is a function of ë the information è â ¿=ì , we see inductively that ¯Xì is a function previously received by node Î ë Ì 71 of Á ¯Xì.ÿ¿ , where -, ë ÁUÛ(ÿ ë zN¿ × ° zN¿ ÿKì * is the set of edges across the cut , we have determined at node ë ë ® ì z Î 1 Á yÈ}É Ê ö w"÷ d 71 D R A F T ë Î ö 71 è R Õ 1 ö (11.37) ë Á 1 Á yÈ}É Ê z can be Ì ®/ì.ÿ¿ z ¯ (11.35) (11.36) ìcì ë Á (11.34) ë â ®/ìcì è z ' /%0 and as previously defined. Since % â ® ë ë ®®ÿ /% ®/ìcì w"÷ d à ÀÂÁÄà (11.38) (11.39) Ç Á yÈ}É Ê Ã À+ÁËà w"÷ d ± Õ! September 13, 2001, 6:27pm (11.40) (11.41) D R A F T 244 A FIRST COURSE IN INFORMATION THEORY Thus q)úEó z ² z b=ö z b=ö b z z w"÷ d Ñh ÒÓ Ô w"÷ ë d ðà ®/ì b ö 71 (11.42) (11.43) (11.44) (11.45) w"÷ d ± ë Yóì ø 71 z (11.46) 71 ùà ø (11.47) (11.48) à óÿ where (11.46) follows from (11.41). Minimizing the right hand side over all % , we have q)úEó zrt~ñ ø à à ó¸! (11.49) * Õ 1 The first term on the right hand side is the capacity of a min-cut between node and node . By the max-flow min-cut theorem,ë it is equal to the value of a max-flow from node to node , i.e., rtsuJvMwyx Rÿc ì . Letting ót¹ , we obtain ë q&z{rtsuKv9wyx Xÿc ìU! (11.50) gf Since this upper bound on q holds for all e ÿhÄÿ¸·¸·¸·~ÿCi ë q{z&rt ~ rtsuJvMwyx , Rÿc U ì ! (11.51) The theorem is proved. Remark 1 The max-flow bound cannot be proved by a straightforward applië ë because for a cutë % ë between node and cation of the data processing theorem ë , the random variables ® , ®/ì "!Ú*íì , ® ì+¶#"! * × ì , and node â ®/ì in general do not form a Markov chain in this order, where å, !Ú* and ë . ë å, !ê* × ÿKì % ç ÿJì& for some $t0 * (11.52) * for some t0 (11.53) are the two sets of nodes on the boundary of the cut % . The time parameter and the causality of an ³ -code must be taken into account in proving the max-flow bound. D R A F T September 13, 2001, 6:27pm D R A F T 245 Single-Source Network Coding Remark 2 Even if we allow an arbitrarily small probability of decoding error in the usual Shannon sense, by modifying our proof by means of a standard application of Fano’s inequality, it can be shown that it is still necessary for q to satisfy (11.51). The details are omitted here. 11.5 ACHIEVABILITY OF THE MAX-FLOW BOUND The max-flow bound has been proved in the last section. The achievability of this bound is stated in the following theorem. This theorem will be proved after the necessary preliminaries are presented. 457698;:26=<?>=>=@(' For a graph $ with rate constraints ) , if ë q{z&rt ~ rtsuJvMwyx Rÿc . ì ÿ (11.54) then q is achievable. A directed path in a graph $ is a finite non-null sequence of nodes ¼ ë ¼ ¼* ¼ bÿ ¼ Ðÿ bì f are ÿhÄÿ¸·¸·¸·~ÿ,+?ú dÿ¸·¸·¸·Kÿ »f ¼* (11.55) bÒÿ ë ¼ f ¼ ÿhÄÿ¸·¸·¸·;ÿ,+ ú for all . The edges ÿ bì , referred to as the edges on¼ the directed path. Such a ¼ ¼-* ¼* sequence is called a directed path from b to . If b , then it is called a directed cycle. If there exists a directed cycle in $ , then $ is cyclic, otherwise $ is acyclic. In Section 11.5.1, we will first prove Theorem 11.4 for the special case when the network is acyclic. Acyclic networks are easier to handle because the nodes in the network can be ordered in a way which allows coding at the nodes to be done in a sequential and consistent manner. The following proposition describes such an order, and the proof shows how it can be obtained. In Section 11.5.2, Theorem 11.4 will be proved in full generality. ?f that such .º:X80/M821yOñW7O8ZQ >=>=@43 ë If $ Wÿ ì is a finite acyclic graph, then it is possible to order the nodes of $ in a sequence such that if there is an edge from node to node , then node is before node in the sequence. Proof We partition into subsets b ÿU d ÿ¸·¸·¸· , such that node is in 9Á if and a longest directed path ending at node is equal to ¿ . We only if the length of is in 9Á such that ¿½zû¿ × , then there claim that if node is in 9ÁUÛ and node exists no directed path from node to node . This is proved by contradiction as follows. Assume that there exists a directed path from node to node . Since there exists a directed path ending at node of length ¿ × , there exists f a directed path ending at node containing node whose length is at least ¿ × . D R A F T September 13, 2001, 6:27pm D R A F T 246 A FIRST COURSE IN INFORMATION THEORY Since node is in equal to ¿ . Then , the length of a longest directed path ending at node is 9Á f z¿S! (11.56) ô¿ × ò¿S! (11.57) ¿ × However, this is a contradiction because f ¿ × Therefore, we conclude that there exists a directed path from a node in =ÁÛ to a node in =Á , then ¿ × Ø¿ . Hence, by listing the nodes of $ in a sequence such that the nodes in 9Á appear before the nodes in 9ÁUÛ if ¿ Øå¿ × , where the order of the nodes within each 9Á is arbitrary, we obtain an order of the nodes of $ with the desired property. 5ÂF7D <6/=IJ6Y>=>=@ñ] Consider ordering the nodes in the acyclic graph in Figure 11.3 by the sequence f (11.58) XÿhÄÿ ÿ ÿckÿcd|ÿcb! in this sequence, if there is a directed path from node It is easy to check that to node , then node appears before node . 11.5.1 ACYCLIC NETWORKS In this section, we prove Theorem 11.4 for the special case when f f the graph is acyclic. Let the vertices in $ be labeled by ÿ ÿ¸·¸·¸·~ÿ}ÃÆÃ3ú in the fol has the label ë 0. The other vertices lowing way. The source are labeled in a f f way such that for z/z̦̾ú , ÿJì7 implies Ø . Such a labeling is possible by Proposition 11.5. We regard RÿcbÒÿ¸·¸·¸·Kÿc98 as aliases of the ë corresponding vertices.ë ë ì.ÿC²9ì;: -code on the graph $ defined We will consider an ÿ ±Ú ÿKì& by $ ë RÿJì< 1) for all , an encoding function Î= where ÿKì& such that Î ¾ , ¹ ÿhÄÿ¸·¸·¸·ÿc± (11.59) Õ0Xÿ -,¶f ° ë 2) for all = ,¶f +˰ ë Ç Û Û ' ÿhÄÿ¸·¸·¸·;ÿJÑh (11.60) ÒÓ ÔÕ03ß , an encoding function ,¶f ,¶f ÿhÄÿ¸·¸·¸·~ÿc± Û 0º¹ ÿhÄÿ¸·¸·¸·ÿc± Õ0 (11.61) Î (if × × ÿ ì> we adopt the convention that is an arbi0 is empty, ,¶f trary constant taken from ÿhÄÿ¸·¸·¸·~ÿc± Õ0 ); D R A F T September 13, 2001, 6:27pm D R A F T 247 Single-Source Network Coding gf 3) for all e , a decoding function ÿhÄÿ¸·¸·¸·~ÿCi à ,¶f Ç ÿhÄÿ¸·¸·¸·~ÿc± R â such that à à è ë ¯ à denotes the value of ¯2ì Î (11.62) ¯2ì ë for all ¯?° . (Recall that è â 0+¹N° ë (11.63) as a function of ¯ .) à Î In the above, is the encoding function for edgeÎ ÿKì , and is the decoding function forÎ sink node . In a Î coding session, is applied before Û CÛ if Ø × . This ë defines the order in which × , and is applied before Û if ÆØ{ if × ÿ ì , a node does not the encoding functions are applied. Since × Ø encode until all the necessary information is received on the input channels. A : -code is a special case of an ³ -code defined in Section 11.3. ë Assume that q satisfies (11.54) with respect to rate constraints ) . Itë suffices ë show that for any óÙô an ÿ ± / to , there exists for sufficiently large ÿJì ì.ÿq½úõóUì: -code on $ such that ë b ö w"÷ d ±Úz&ø ùYó (11.64) ë ë ë . Instead, we will show the existence of an ÿ ± ÿJì@ -code satisfying the same set of conditions. This will be done by constructing a random code. In constructing this code, we temporarily replace ° by -,¶f ° × ÿhÄÿ¸·¸·¸·;ÿJÑBAh ÒDC ÔÕ0Xÿ (11.65) ÿJì@ for all ì.ÿq : ì Î= where A is any constant ë greater than 1. Thus the domain of is expanded . from ° to ° × for all XÿKì& ë We now construct the encoding Î= functions ë as follows. For all E such that = {° × , let ¯Xì be a value selected independently ë from XÿKìF ,¶f , for all ¯G set ÿhÄÿ¸·¸·¸·ÿc± 0 according to the uniform distribution. For all ÿJì the ' ÿ , and for all H Î ë ,¶f Ç ÿhÄÿ¸·¸·¸·~ÿc± Û Û Õ H ^Û( 0Xÿ ,¶f let ì be a value selected independently from the set ing to the uniform distribution. H Let =yë and for %ÿ ¯2ì ' , let H ¯Xì ë Î ë ë ÿhÄÿ¸·¸·¸·~ÿc± Õ0 ¯Xì.ÿ ÿKì accord- (11.67) ¯bÿ è ë D R A F T (11.66) ì.ÿ September 13, 2001, 6:27pm (11.68) D R A F T 248 A FIRST COURSE IN INFORMATION THEORY H è Î ë Î ë where ¯I° × and ¯Xì denotes the value of as a function of ¯ . ¯Xì is all the information received by node during the coding session when the H H ë distinct ë ¯bÿc¯ × P° , ¯ and ¯ × are indistinguishable at sink message is ¯ . For â ¯ × ì . For all ¯J° , define if and only if â ¯Xì f gf LM NK ë ¯Xì H H ë if for some e exists ¯ ÿhÄÿ¸ë ·¸·¸·;ÿC i , there ' â ¯ × ¯ , such that â ¯2ì ¯ ×ì , otherwise. × ° , (11.69) ë ¯Xì is equal to 1 if and only if ¯ cannot be f uniquely determined at at least one of the sink nodes. Now fix ¯Oõ° and z e z-i . Consider any ¯ × õ° not equal to ¯ and define the sets H H -, ë . % ¨ H -, and %ùb ¯Xì ë ë ' ¯ × ì0 H ë ¯2ì ç (11.70) ¯ × ì03! (11.71) % ¨ is the set of nodes at which the two messages ¯ and ¯ × are distinguishable, and %b is the set of H nodes atH which ¯ and ¯ × are indistinguishable. Obviously, ë ë >/% ¨ . â â ¯Xì ¯ × ì . Then % ¨ % Now suppose for some %P , where é' ë % and Q ?% ,è i.e., % isè a cut H betweenH node and node . For any ÿJì RTS , Î , ë Î ë ë ë ' b ¯Xì ¯ × ì¸Ã ¯Xì ¯ × ì0 ± ! (11.72) Therefore, RUS , % ¨ RTS , %0 R S , % ¨ T %oÿý% ¨@V RTS , % ¨ %tÃÅ% ¨ V è ÃÅ% ¨ V RTS %t ë , Î @ z Ç % ¨ R Õ21 b ± Ç R Õ21 where %Ï0 RUS , %Ï0 %Ï0 Î ë ¯Xì H %0 V ë ¯ × ì¸Ã ¯Xì H ' ë ¯ × ì0 (11.76) (11.77) ÿ å, ë * % ¨ è (11.73) (11.74) (11.75) ÿJìW /%oÿ ' /%Ï0 (11.78) ë is the set of all the edges across the cut % as previously defined. Let ó be any fixed positive real number. For all ÿJì& , take ± such that ø6 D R A F T YX¬z{ b=ö w"÷ù±Úz&ø ùPó September 13, 2001, 6:27pm (11.79) D R A F T 249 Single-Source Network Coding for some ØZX . Then Ø&ó RUS , % ¨ z Õ21 Ò^D_ h j m z à h ]`Ga É4b Êdcfeg 1 [ É Êih (11.84) C ;] (11.85) ÿ 71 (11.83) ; ÃËò ò&rt~ñ ø (11.82) f * Ò h b) follows because _ Ò z 1 (11.81) a Ékb Ê9c4elg 1 [ É Ê h Ò ^ h = ]onUp9qsrtdu â v \[ É Ê;] Ò ]. ` z a) follows because h Õ21 (11.80) Ç z where b ± Ç %0 * Û ë 21 where % × is a cut between node theorem; ø rtsuKv9wyx Xÿc . ì ÿ (11.86) and node , by the max-flow min-cut c) follows from (11.54). Note that this upper bound does not depend on and H has h _ _ H subsets, RUS , ë hw_ z % is some subset of â ¯ × ì0 % ¨ . Since ë â ¯ S ì, RTX % _h % ;]. Ò C for some cut % between node and node z Ø ë â ¯2ì 0 (11.87) (11.88) H H by the union bound. Further, RTS , ë â ¯ × ì for some ¯ × ° ë f ;] ð × Ã}ú ìh _ _ h Ò C ;] Ò C h _ _h Ò C A h x Ú ] A h _ _h Ò ÿ Ú ×,¯ × ' ¯Æ0 (11.89) (11.90) (11.91) where (11.89) follows from the union bound and (11.90) follows from (11.65). Therefore, y ë ¯Xì{z D R A F T September 13, 2001, 6:27pm D R A F T 250 A FIRST COURSE IN INFORMATION THEORY RUS , RUS}| ë 8 gf ¯2ì H , 0 ë â ¯2ì (~ Ø i< ë AÚh _ b _h Ò (11.92) H ë â ¯ ×ì × ° for some ¯ × ,¯ × ' ¯Æ0 (11.93) ] (11.94) (11.95) ÿXKì by the union bound, where ë ÿXJì i<AÚhw_ ] _h Ò (11.96) ! Now the total number of messages which can be uniquely determined at all the sink nodes is equal to ë f ë ú ¯2ìcìU! (11.97) Û By taking expectation for the random code we have constructed, we have ë f Û ú ë ë f ¯Xìcì Û ô ë ¯2ì{zì (11.98) ë ë f ë Û ë f ò y ú ÿXKìcì ú (11.99) ÿXJìcì,AÚhÒxC1ÿ ú (11.100) where (11.99) follows from (11.95), and the last step follows from (11.65). Hence, there exists a deterministic code for which the number of messages which can be uniquely determined at all the sink nodes is at least ë ë f ÿXJìcì,AÚh ÒxC ú (11.101) ÿ ë which is greater than h ÒxC for sufficiently large since Let ° to be any set of Ñh H ÒxC Ô such messages in ° × . For e ,¶f Ç upon defining where ¯?° ÿhÄÿ¸·¸·¸·~ÿc± Û Û â à H such that ë ë ë H 0 ÿ Û â X as é¹ . and ÿhÄÿ¸·¸·¸·~ÿCi (11.102) ì ÿXK g f ì#¹ H (11.103) ¯ ë â ¯Xì.ÿ ë ÿJì (11.104) we have obtained a desired ÿ ± ì.ÿq ì<: -code. Hence, Theorem 11.4 is proved for the special case when $ is acyclic. D R A F T September 13, 2001, 6:27pm D R A F T 251 Single-Source Network Coding 11.5.2 CYCLIC NETWORKS For cyclic networks, there is no natural ordering of the nodes which allows coding in a sequential manner as for acyclic networks. In this section, we will prove Theorem 11.4 in full generality, which involves the construction of a more elaborate code on a time-parametrized acyclic graph. Consider the graph $ with rate constraints ) we have defined in the previë ous sections but without the assumption that $ is acyclic. We first construct Kÿ ì from the graph $ . The set a time-parametrized graph $> f consists of / layers of nodes, each of which is a copy of . Specifically, where ~ (11.105) ÿ ¨ -, ªt03! (11.106) As we will see later, is interpreted as the time parameter. The set of the following three types of edges: ë ¨ ÿ ë 1. 2. ë 3. ì.ÿ ÿc ÿ consists b zZ z zZ ë zQ½ú f ì.ÿ f ì.ÿ ÿJìW f ÿ zZÙzEú . ¨ For $> , let x be the source node, and Þ letf , be a sink node which corresponds to the sink node in $ , e ÿhÄÿ¸·¸·¸·;ÿCi . Clearly, $ is ëµ ¼ each edge in $ ends at a vertex in ëaµ layer acyclic because with a larger index. ¼ be the capacity of an edge ÿ ì , where Let ø> ÿ ÿ ì ëµ ø M NK ø ¼ ë if ÿ ì ÿ and zZézQEú otherwise, and let ) y f ì ÿJì for some (11.107) ÿ ëµ ø ë b ÿ ¼ ÿ ì z8ÿ (11.108) which is referred to as the rate constraints for graph $ . Þ 5ÂF7D <6/=IJ6Y>=>=@f In Figure 11.6, we show the graph graph $ in Figure 11.1(a). ù6=<+< D node >=>=@4 gf $ for for the For e in graph $ from ÿhÄÿ¸·¸·¸·~ÿCi , there exists a max-flow to node which is expressible as the sum of a number of flows (from D R A F T September 13, 2001, 6:27pm D R A F T 252 A FIRST COURSE IN INFORMATION THEORY s (5) s* (0) =s (5) (0) 2 1 (0) (5) * t1 = t1 t1 =0 =1 =2 =3 Figure 11.6. The graph U with ¡%¢#£ for the graph =4 =5 in Figure 11.1(a). node to node ), each of them consisting of a simple path (i.e., a directed path without cycle) from node to node only. Proof Let be a flow from node to node in the graph $ . If a directed path in $ is such that the value of on every edge on the path is strictly positive, then we say that the path is a positive directed path. If a positive directed path forms a cycle, it is called a positive directed cycle. Suppose contains a positive directed cycle. Let ¤ be the minimum value of on an edge in the cycle. By subtracting on every edge ¤ from the value of in the cycle, we obtain another flow × from node to node such that the resultant flow out of each node is the same as . Consequently, the values of and is eliminated in × are the same. Note that the positive directed cycle in × . By repeating this procedure if necessary, one can obtain a flow from node to node containing no positive directed cycle such that the value of this flow is the same as the value of . Let be a max-flow from node to node in $ . From the foregoing, we assume without loss of generality that does not contain a positive directed è cycle. Let ¥b be any positive directed path from to in (evidently ¥b is è b simple), and let ¦}b be è the minimum value of on an edge along ¥#b . Let be b the flow from node to node along ¥#b with value ¦}b . Subtracting from , b is reduced to ú , a flow from node to node which does not contain a positive directed cycle. Then we can apply the same procedure repeatedly until is reduced to the zero flow4 , and § is seen to be the sum of all the flows which have been subtracted from § . The lemma is proved. 5ÂF7D <6/=IJ6Y>=>=@ñ_ The max-flow § from node to node Ub in Figure 11.1(b), whose value is 3, can be expressed as the sum of the three flows in Figure 11.7, 4 The process must terminate because the components of ¨ are integers. D R A F T September 13, 2001, 6:27pm D R A F T 253 Single-Source Network Coding « ¬ s « s s 1 1 1 ª 1 2 1 1 2 ª 1 1 1 t1 © 1 t1 © (a) 2 t1 © (b) (c) Figure 11.7. The max-flow in Figure 11.1(b) decomposed into three flows. each of them consisting of a simple path from node to node ® only. The value of each of these flows is equal to 1. ù6=<+< D f² >=>=@ >°¯ ² ² h ·¸·¸· i , if the value of a max-flow from to 9 in For e}± $ is greater than or equal to q ë , then the f value of a max-flow from to in $> is greater than or equal to ½ú"³´¶ ìq , where ³µ is the maximum length of a simple path from to 9 . We first use the last two examples to illustrate the idea of the proof of this lemma which will be given next. For the graph $ in Figure 11.1(a), ³Äb , the maximum length of a simple path from node to node Ub , is equal to 3. In Example 11.9, we have expressed the max-flow § in Figure 11.1(b), whose value is equal to 3, as the sum of the three flows in Figure 11.7. Based on these three flows, we now construct a flow from x to ®b in the graph $> in Figure 11.8. In this figure, copies of these three flows are initiated at nodes ¨ , b d , and , and they traverse in $¶ as shown. The reader should think of the b d ¨ flows initiated at nodes and as being ë ë at node ±å and ² generated ² b d b d delivered to nodes and via edges ì and ì , respectively, whose capacities are infinite. These are not shown in the figure for the sake of · =1 =0 =2 =3 ¸ =4 =5 Figure 11.8. A flow from ¹ to º » in the graph in Figure 11.6. D R A F T September 13, 2001, 6:27pm D R A F T 254 A FIRST COURSE IN INFORMATION THEORY b d clarity. Since ³=b ± , the flows initiated at ¨ , , and all terminate at b for z . Therefore, total value of the flow is equal to three ² times ² the ¨ b value of , i.e., 9. Since the copies of the three flows initiated at and § d interleave with each other in a way which is consistent with the original flow ¼ , it is not difficult to see that the value of the flow so constructed does not exceed the capacity in each edge. This shows that the value of a max-flow from - to , is at least 9. We now give a formal proof for Lemma 11.10. The reader may skip this proof without affecting further reading of this section. f Proof of Lemma 11.10 For a fixed z-e zåi , let § be a max-flow from node to node ® in $ with value q such that § does not contain a positive è è can write è directed cycle. Using Lemma 11.8, we è f² ² ²Á §± § b ² d § ·¸·¸·Õ §¾½ (11.109) h ·¸·¸· where §W¿ , À± contains a positive simple path ¥ from node ¿ node ® only. Specifically,è to ë ²{Æ ¦ ì&?¥ if ¿ otherwise, Ç ¿ § ¿Ã ±ÅÄ (11.110) where ¦}b G¦ýd·¸·¸·Y¦ ±qã! ë ²{Æ ½ ì¾ ¥ , Let È be the length of ¥ . For an edge ¿ ¿ ¿ of node from node along ¥ . Clearly, ¿ ë ²{Æ f É ì zQÈ ú ¿ ¿ (11.111) ë ²{Æ let É ì ¿ be the distance (11.112) and Now for È Ç zZézQõúʳ´ § where ¼ ¿ , define ¿ ëµ ² ¼ KÎÎ ¦ ± MÎ ¦ Î ÎÎ N ¦ ¿ ¿ Ç ¿ ±ÌË\¼ ¿ ë ² (11.113) ² ëµ ² ¼ ì ² sÍ (11.114) ²f if ëµ ² ¼ ìT± ë ì zQõú"³´ ë {² Æ ²{zZ Æ é jsÏ Â jsÏ Â b if ëµ ² ¼ ìT± ë ;Ð ² ì where ì&?¥ ¿ Ï if ìT± ì otherwise. (11.115) ë ²{Æ Since êGÉ D R A F T z³´ ! ¿ ¿ ² f ìZ zZÏYÈ ¿ zÑÏG³µ2zQ September 13, 2001, 6:27pm (11.116) D R A F T 255 Single-Source Network Coding ¿ is well-defined the second caseè and the third case in (11.115) and hence § Ç ¿ is a flow from node to node in $ derived for zÒPzÓYúY³µ . § from the flow §¿ in $ as follows. A flow of ¦ is generated at node x and ¿ . Then the flow traverses consecutive enters the th layer of nodes from layers of nodes by emulating the path ¥ in $ until it eventually terminate at ;Ð Ï ¿ ¿ , we construct node via node . Based on § § ½ ± ~ ÕÔ â § ± ¿ ² (11.117) b ¿ and § § ~ (11.118) ! ¨ We will prove that § z) componentwise. Then § is a flow from node to node , in $> , and from (11.111), its value is given by ÕÔ â ~ ¨ ~ ¿ ÕÔ â ½ ¦ b ± ¿ ë qÖ± ~ f õúʳµJ ìqã! (11.119) ¨ This wouldë imply that f the value of a max-flow from node to node in $ ëµ ² ¼ is at least õú"³´K ìq , and the lemma is proved. such Toward proving that § z×) , we only need to consider ì ëµ ² ¼ ë ²{Æ that b ìU± ì (11.120) ë ²{Æ f Ç ì& for some and zZ zQ½ú , because ø> is infinite otherwise (cf. (11.107)). For notational convenience, we adopt the convention that § for Ø Ç . Now for Ç f zZézQ½ú ± ~ Û ÕÔ â ± ~ Û ½ ± ~ ¿ ¿ ~ ½ b ½ ¼ ~ , b ¨ (11.122) ¼ aÙÛ Ø c Ùa ØÚ c ¿ »  (11.123) ¼ aÙÛ Ø c Ùa ØÚ c ¿ »  (11.124) ¿ ÕÔ â b Û ± ~ ¨ (11.121) aÃÛ Ø c Ùa ØÚ c »  ¼ ¨ Ç ì ÕÔ â D R A F T ± ë ²{Æ and aÃØ c aÙØÚ c » ¼  ¿ aÃ Ø c j Ùa ØÏ Ú Â c » ¿  September 13, 2001, 6:27pm (11.125) D R A F T 256 A FIRST COURSE IN INFORMATION THEORY è z ½ ~ ¿ ± (11.126) ¼2àø àø z ± ¼ ÿ  b (11.127) (11.128) (11.129) aÙØ c Ùa ØsÚ c » !  In the above derivation, (11.125) follows from the second case in (11.115) aÙÛ Ø aÃØsÚ c because ¼ c ¿  » is possibly nonzero only when ë ²{Æ ±ZÜ GÉ ² (11.130) ì ¿ ë ²{Æ or ÜJ±×|úÊÉ ² (11.131) ì ¿ ë ²{Æ and (11.126) can be justified as follows. First, the inequality is justified for ÙØQÉ ì since by (11.121), ¿ § ë ²{Æ Ç Û ¿ Ç ± (11.132) ²{Æ òÝÉ ì , we distinguish two cases. From (11.115) and for ÜgØ . ë For ¿ (11.110), if ìW#¥ , we have è ¿ ¼ ë ²{Æ If ì ' #¥ ¿ aÙ Ø c jÙa ØÏ Ú R  c » ¿  ¼ ÿ  ±I¦ ± , we have (11.133) ! ¿ è aÙ Ø j Ùa ØÏ Ú Â c ¼ c » ¿ ¼ ÿ  ± ± Ç (11.134) ! Thus the inequality is justified for all cases. Hence we conclude that §¾+zQ) , and the lemma is proved. ë f From Lemma 11.10 and the result in Section 11.5.1, we see that )ú ³ is achievable for graph $ with rate constraints ) , where ³7± Thus, ² ë for everyf ó+ô õúʳ+ ëµ ² ¼ ìq rtsu ³´ ! , there exists for sufficiently large Þ a Þ ì: -code on $¶ such that b ö w"÷ d ± zø ìq (11.135) ë Ç Þ D Yó ² ë ëµ ² ¼ $ß ± ìW (11.136) ëµ ² ¼ . For thisÎ : -code on $ , we denote the encoding function for all ìà by , and denote the decoding function at the sink for an edge ìF D R A F T September 13, 2001, 6:27pm D R A F T 257 Single-Source Network Coding f² à ® by , } e ± node f zZ zQ , ² ² h ·¸·¸· . Without loss of generality, we assume that for i Î = and for = Ùa Ø ë c ,¶f² for all ¯ in zZézQõú Î for all ¤ in aÙØ â c â f ² b C ÔÕ0 (11.138) ¤=ìU±Z¤ ² ± Á ·¸·¸· Á Á â ë ,¶f² Ç (11.137) ¯ ÕÔ Ñhâ ·¸·¸· , ¯XìU± ² °á± f Ç (11.139) aÙØã c aÙØ c » â 0 ² (11.140) ë . Note that if the : -code does not satisfy these assumptions, ² it can ì , readily be converted into ë one because the capacities of the edges f ² f Ç are infinite. zZ zQ and the edges ì , zZézQEú ë ² ë integer, ë ²{Æ and let ² Q ë ±Å&Þ . Using Let be a positive the : -code on $> , we f now construct an ±ÃÂ ß ì ì õúʳ¾ ìq<ä- ìå -code on $ , where z-e z i ±ÃÂ ë ²{Æ ì for ± ~ b aÙØ c Ùa ØÚ c »  ± ¨ (11.141) . The code is defined by the following components: ë ²{Æ ì& 1) for Ç Î ' b such that ± , an arbitrary constant à,¶f² ² ² h ± ·¸·¸· taken from the set aÃæ c a c 0 ß Â » 3 (11.142) f zZÙz 2) for , an encoding function Î = Â Æ ª for all ,¶f² ß ° ë ²{Æ such that h ì < ,¶f² ² ¹ = aÙØã c aÙØ c »  0 ± (11.143) , where ² °Å± ² ·¸·¸· ² h Ñhâ ·¸·¸· ÕÔ b ² C ÔÕ0 (11.144) and for h¬zZézQ , an encoding function Î ÃÂ ß ë ²{Æ ,¶f² Ç Á Á h ² ² ·¸·¸· D R A F T ì G for all such that Î adopt ,¶f² ² the² convention that àaÙØã c aÙØ c h ·¸·¸· ± 0 ); »  aÙØlãç c Ãa Ølã c » 0+¹ ± Á ,¶f² h ² ² ·¸·¸· ± aÙØã c aÙØ c »  0 ë ² (11.145) , 0 is empty, we (if ¿ ß ¿ ìÖ is an arbitrary constant taken from the set ± ' September 13, 2001, 6:27pm D R A F T 258 A FIRST COURSE IN INFORMATION THEORY f² ² 3) for e± ² h ·¸·¸· à ß à è , a decoding function i Ç b ~ ,¶f² Ç ² ² h â ¨ ë ²{Æ è à ì i) for ë such that ¯Xì<±¯ for all ¯è)° as a function of ¯ ); where aÙØ c Ùa ØÚ c â » 0º¹»° ± ·¸·¸· (11.146) ë à (recall that ¯2ì denotes the value of ' such that ± , Î Î b à,¶f² ± Î aÃæ a (ëµ ² c  » c is an arbitrary constant in ¨ 0 is empty); ì< aÃæ c a c  » ² h (11.147) ² aÃæ c a c  » 0 ± ·¸·¸· , µ ß since f ii) for zZÙz , for all ¯?° , Î = ë for all ¤ in ² ·¸·¸· Î à ë for all in Ç ~ b à ¶â ¨ R (11.151) ì ² h (11.150) H ë ìU± (11.149) aÃØlãç c Ùa Øã c » 03ß ± Á H ,¶f² Ç c Ùa Ø c »  ¤ ì Ä ² ·¸·¸· ' ë aÙØã ² h , i ¤=ìU± ,¶f² (11.148) such that ± , ë Á Á ² h H éÂ Ç f² iii) for eê± ì& ² ë c Ãa Ø c »  ¯ ì X ë ²{Æ and for h¬zZézQ and all Î Î = aÃØlã ¯2ìU±  ² ± ·¸·¸· aÙØ c Ùa ØÚ c â » 03! (11.152) f The coding process of the å -code consists of / ë ²{Æ 1. In Phase 1,Æ for all and for all ìF such that ë where àD R A F T h|z×/zI ¯Xì ' Î b Æ such that ±g , node Î sends , à  to node ²{Æ Æ = b ë ì , node sends Â è ¯Xì to node . ë ²{Æ è 2. In Phase ,ë Î phases: , for all ì6 Î denotes the value of ÃÂ Î ë Æ , node sends à  ¯Xì to node , as a function of ¯ , and it depends September 13, 2001, 6:27pm D R A F T 259 Single-Sourceè Network Coding Î b ë ² ë only on Á ¯Xì for all ¿?½ suchf that received by node during Phase |ú . f² f 3. In Phase / ² , for e0± ì6 ¿ , i.e., the information ² h ·¸·¸· à , the sink node 9 uses to decode ¯ . i ë ²{Æ ² ë ë From the definitions, we see that anë ² ë ±é ßë ²{Æ ì& å -code on $ is a special case of an ±éÂ ß ì7 ³ -code on $ . For the å -code we have constructed, b ö ë w"÷ d ± à± <Þ9ì b ± ± Þ ¨ b ö b ë ø ~ b w"÷ d>ë b b ~ z b ö ± ë ²{Æ ¨ ~ Ç b ~ ¨ ± w"÷ d ± ² ë f ì ² ë /úJ³ ì aÃØ c Ùa ØÚ cíì »  aÙØ c Ãa ØsÚ c »  &ú³¬ f ìq<ä- ì ìq<ä- ì (11.153) (11.154) aÃØ c Ùa ØÚ c » PóUì  (11.155) b ë ÃÂùYóì ø (11.156) ¨ ø6àYó (11.157) ì& , where (11.155) follows from (11.136), and (11.156) follows for all Ç from (11.107). Finally, for any óãô , by taking a sufficiently large , we have ë f õúʳº ìq (11.158) ôYq)úEó! Hence, we conclude that q is achievable for the graph ) . $ with rate constraints PROBLEMS In the following problems, the rate constraint for an edge is in bits per unit time. 1. Consider the following network. We want to multicast information to the , sink² nodes ² ² at the maximum rate without using network coding. Let !ݱ yb d ·¸·¸· ë ï² îË 0 be the set of bits befmulticast. to Let ! be the set of bits sent in edge ì , where à ! cô±h , ² ² ± h . At node , the received bits are duplicated and sent in the two out-going edges. Thus two bits are sent in each edge in the network. f a) Show that !±I!6 ë ! for any b) Show that ! D R A F T !bð!+dìU±Z! z Æ Ø z . . September 13, 2001, 6:27pm D R A F T 260 A FIRST COURSE IN INFORMATION THEORY ò t1 1 ñ s ò 2 t2 ò ó t3 3 ë c) Show that à ! !bð!+dì¸ÃKzgà ! ø à !bø à !+d¶Ã}ú à !b!+d3à . d) Determine the maximum value of ô and devise a network code which achieves this maximum value. e) What is the percentage of improvement if network coding is used? (Ahlswede et al. [6].) 2. Consider the following network. ø 1 ö s õ 3 2 5 4 ÷ 6 Devise a network coding scheme which multicasts two bits yb and Ud from node to all the other nodes such that nodes 3, 5, and 6 receive yb and d ë ² after 1 unit time and nodes 1, 2, and 4 receive yb and d after 2 units of time. In other words, node receives information at a rate equal to rtsuJvMwyx ì ' for all ± . See Li et al. [123] for such a scheme for a general network. 3. Determine the maximum rate at which information can be multicast to nodes 5 and 6 only in the network in Problem 2 if network coding is not used. Devise a network coding scheme which achieves this maximum rate. ë ² 4. Convolutional network code f² ² In the following network, rtsuKv9wyx ®RìU± for eê± h . The max-flow bound asserts that 3 bits can be multicast to all the three sink nodes per unit time. We now describe a network coding scheme which achieve this. Let 3 D R A F T September 13, 2001, 6:27pm D R A F T 261 Single-Source Network Coding û ú t uù 2 ü ù v u1 ü 2 ý û 0 s ü ú v t1 v1 û ù t 0 2 uú 0 ë ² ë ² f² ë ² h ·¸·¸· , where bits yb ¿=ì d ¿=ì ¿=ì be generated at node ë at time ¿#± ë without loss of generality that ï ¿=ë ì is an element of the finite we assume Ç Ç field f $>¼ h3ì . We adopt the convention that ï ¿=ì ± for ¿½z . At time ¿Ùò , information transactions T1 to T11 occur in the following order: ë ¼ Ç ²f² h T1. ¼ sends ï ë¿=ì to µ , eê± ²f² Ç T2. µ sends s ¿= to , , and , ì 9 ( þ ® ÿ þ e ± b d ë ë ë µh f f T3. µ ¨ sends ¨ ë ¿=ì7Yyb ë ¿¬ú f ì7Yd ë ¿¬ú f ì to b T4. µ b sends ¨ ë ¿=ì7Yyb ë ¿¬ú ì7Y ë d ¿¬ µ ì to d f ú T5. µ b sends ¨ ë ¿=ì7Yyb ë ¿=ì7Yd ë ¿êú f ì to d T6. µ d sends ¨ ë ¿=ì7Yyb ë ¿=ì7Yd ë ¿êú µ ì to ¨ T7. µ d sends ¨ ë ¿=ì7Y b ë ¿=ì7Y d ë ¿=ì to ¨ T8. ¨ sends ¨ ¿=ë ì7Yyfb ¿=ì7Yd ¿=ì to Ub ì T9. d decodes d ¿¬ ë ú T10. ¨ decodes ¨ ë ¿=ì T11. Ub decodes yb ¿=ì where “ ” denotes modulo 3 addition and “+” denotes modulo 2 addition. a) Show thatf the information transactions T1 to T11 can be performed at time ¿%± . f by induction b) Show that T1 to T11 can be performed at any time ¿éò on ¿ . ë ë c) Verify that at time ¿ , nodes d ¿ × ì for all ¿ × z¿ . D R A F T ¨ and b can recover ¨ September 13, 2001, 6:27pm ë ¿ × ì , b ¿ × ì , and D R A F T 262 A FIRST COURSE IN INFORMATION THEORY ë ë d) Verify thatë at time ¿ , node d canf recover ¨ ¿ × ì and yb ¿ × ì for all ¿ × z ¿ , and d ë ¿ × ì for all ¿ × z\¿Ùú . Note the unit time delay for d to recover d ¿=ì . (Ahlswede et al. [6].) HISTORICAL NOTES Network coding was introduced by Ahlswede et al. [6], where it was shown that information can be multicast in a point-to-point network at a higher rate by coding at the intermediate nodes. In the same paper, the achievability of the max-flow bound for single-source network coding was proved. Special cases of single-source network coding has previously been studied by Roche et al. [165], Rabin [158], Ayanoglu et al. [17], and Roche [164]. The achievability of the max-flow bound was shown in [6] by a random coding technique. Li et al. [123] have subsequently shown that the max-flow bound can be achieved by linear network codes. D R A F T September 13, 2001, 6:27pm D R A F T Chapter 12 INFORMATION INEQUALITIES An information expression refers to a linear combination1 of Shannon’s information measures involving a finite number of random variables. For example, (12.1) and (12.2) are information expressions. An information inequality has the form !#" " (12.3) where the constant is usually equal to zero. We consider non-strict inequalities only because these are usually the form of inequalities in information theory. Likewise, an information identity has the form $%"'& (12.4) ($)" We point out that an information identity is equivalent to the pair of information inequalities and . An information inequality or identity is said to always hold if it holds for any joint distribution for the random variables involved. For example, we say that the information inequality *+" !,+" +- (12.5) 1 More generally, an information expression can be nonlinear, but they do not appear to be useful in information theory. D R A F T September 13, 2001, 6:27pm D R A F T 264 /102 A FIRST COURSE IN INFORMATION THEORY . . On the other always holds because it holds for any joint distribution hand, we say that an information inequality does not always hold if there exists a joint distribution for which the inequality does not hold. Consider the information inequality (12.6) 3 ,+-4& Since +(12.7) always holds, (12.6) is equivalent to 3 (12.8) $5 only if and are independent. In other words, (12.6) which holds if and does not hold if and are not independent. Therefore, we say that (12.6) does not always hold. As we have seen in the previous chapters, information inequalities are the major tools for proving converse coding theorems. These inequalities govern the impossibilities in information theory. More precisely, information inequalities imply that certain things cannot happen. As such, they are referred to as the laws of information theory. The basic inequalities form the most important set of information inequalities. In fact, almost all the information inequalities known to date are implied by the basic inequalities. These are called Shannon-type inequalities. On the other hand, if an information inequality always holds but is not implied by the basic inequalities, then it is called a non-Shannon-type inequality. We have not yet explained what it means by that an inequality is or is not implied by the basic inequalities, but this will become clear later in the chapter. Let us now rederive the inequality obtained in Example 6.15 (Imperfect secrecy theorem) without using an information diagram. In this example, three , and are involved, and the setup of the problem is random variables equivalent to the constraint * 67 8 Then $5-4& (12.9) 3 9:6;<= $ 9:6;<=#> ?=@A CB $ 9:6;<= ? 9:6;<=#> @D;ED ?CB $ 9F6@D;GH367 8 $ 9F6@ D R A F T September 13, 2001, 6:27pm (12.10) (12.11) (12.12) (12.13) (12.14) (12.15) D R A F T Information Inequalities where we have used in obtaining (12.12), and @I * +;G +- 265 (12.16) (12.17) and (12.9) in obtaining (12.15). This derivation is less transparent than the one we presented in Example 6.15, but the point here is that the final inequality we obtain in (12.15) can be proved by invoking the basic inequalities (12.16) and (12.17). In other words, (12.15) is implied by the basic inequalities. Therefore, it is a (constrained) Shannon-type inequality. We are motivated to ask the following two questions: 1. How can Shannon-type inequalities be characterized? That is, given an information inequality, how can we tell whether it is implied by the basic inequalities? 2. Are there any non-Shannon-type information inequalities? These are two very fundamental questions in information theory. We point out that the first question naturally comes before the second question because if we cannot characterize all Shannon-type inequalities, even if we are given a non-Shannon-type inequality, we cannot tell that it actually is one. In this chapter, we develop a geometric framework for information inequalities which allows them to be studied systematically. This framework naturally leads to an answer to the first question, which makes machine-proving of all Shannon-type inequalities possible. This will be discussed in the next chapter. The second question will be answered positively in Chapter 14. In other words, there do exist laws in information theory beyond those laid down by Shannon. JLMK STVUVUVUW1XFYZ N Let M (12.18) $ ORQ P X [ < \]1^8_ N Y where , and let M $PO (12.19) X [ be any collection of random variables. Associated with are ` $ M Q (12.20) X dceexample, 1f forg $ba , the 7 joint entropies associated with joint entropies. For random variables c and 6 f are g c 1 f f 1 g c 1 g 6 c 1 f 1 g & (12.21) 12.1 THE REGION D R A F T September 13, 2001, 6:27pm D R A F T 266 Let and N A FIRST COURSE IN INFORMATION THEORY h j <\k1^?_ $ i lm 6jn i l $ & denote the set of real numbers. For any nonempty subset i of M , let (12.22) p orq (12.23) l tasZ fixed , we can then view as a set function from to h with For $u- , i.e., we adopt the convention that the entropy ofl an empty set [ the entropy of random variable is equal to zero. For this reason, we call function of . ` jx v M _(beportheqTy -dimensional szY j Euclidean space with the coordinates lm labeled Let i [ X O , where w corresponds to the value of i for any by w X vl M as the entropy space collection of random variables. We will refer to _ can be represented by for random variables. Then an entropy function M |l vector { v M is[ calledX a column vector in v . On the other hand, a column of some collection of entropic if { is equal to the entropy function random variables. We are motivated to define the following region in v M : } MK $PO'{ _ v M~ { is entropicY & (12.24) } g to as entropy functions. For convenience, theX vectors in MK will also be referred As an example, for $%a , the coordinates of v by c f g cCf cCg f1g arecCf1g labeled (12.25) cCf1g g c w f gS w w w } w g w w where w denotes w , etc, and K is the region in v of all entropy functions for 3 random variables. } While further characterizations of MK will be given later, we first point out a } few basic properties of MK : } 1. MK contains the origin. } } 2. MK , the closure of MK , is convex. } 3. MK is in the nonnegative orthant of the entropy space v M . X [ 2 The origin of the entropy space corresponds to the entropy function of degenerate random variables taking constant values. Hence, Property 1 follows. Property 2 will be proved in Chapter 14. Properties 1 and 2 imply that is a convex cone. Property 3 is true because the coordinates in the entropy space correspond to joint entropies, which are always nonnegative. } MK v M 2 The nonnegative orthant of D R A F T q is the region ] q1e for all j @f xZ 1 k September 13, 2001, 6:27pm . D R A F T 267 Information Inequalities 12.2 INFORMATION EXPRESSIONS IN CANONICAL FORM Any Shannon’s information measure other than a joint entropy can be expressed as a linear combination of joint entropies by application of one of the following information identities: 6 7 ;< $ 3 9;<= $ ;8= ?F@ $ & X (12.26) (12.27) (12.28) The first and the second identity are special cases of the third identity, which has already been proved in Lemma 6.8. Thus any information expression which involves random variables can be expressed as a linear combination of the associated joint entropies. We call this the canonical form of an information expression. When we write an information expression as , it means that is in canonical form. Since an information expression in canonical form is a linear combination of the joint entropies, it has the form ` { F { h (12.29) where denotes the transpose of a constant column vector in . The identities in (12.26) to (12.28) provide a way to express every information expression in canonical form. However, it is not clear whether such a canonical form is unique. To illustrate the point, we consider obtaining the in two ways. First, canonical form of ;< $ & Second, 67 < 6F $ 6F#;;<=;d 1 $ 6F#;;<= D91 $ 6* =;A $ & (12.30) (12.31) (12.32) (12.33) (12.34) 67 < Thus it turns out that we can obtain the same canonical form for via two different expansions. This is not accidental, as it is implied by the uniqueness of the canonical form which we will prove shortly. Recall from the proof of Theorem 6.6 that the vector represents the values of the -Measure on the unions in . Moreover, is related to the values of on the atoms of , represented as , by K D R A F T K M M { { {!$P M September 13, 2001, 6:27pm (12.35) D R A F T 268 `67` M A FIRST COURSE IN INFORMATION THEORY where is a unique matrix (cf. (6.27)). We now state the following lemma which is a rephrase of Theorem 6.11. This lemma is essential for proving the next theorem which implies the uniqueness of the canonical form. ?HL (¡R¢D£t¡ ¤ MK $PO' _ h ~ M _ } MK Y & (12.36) ¤ Then the nonnegative orthant of h is a subset of MK . ¥§¦n¨©Dª¡R¢D£«¢ Let be an information expression. Then the unconstrained information identity $5- always holds if and only if is the zero function. Proof Without loss of generality, assume is in canonical form and let { $ F {8& (12.37) ® Assume 7$¬- always holds and is not the zero function, i.e., $¬- . We will show that this leads to a contradiction. Now $¯- , or more precisely the YZ set ~ O'{ {!$5(12.38) is a hyperplane in the entropy space which has zero Lebesgue measure . If +$°- always holds, i.e., it holds for all joint distributions, then _ } MK } must be 3 $b - , otherwise there _ } exists an { MK which contained in the hyperplane MK , it corresponds to the $±- . Since { is not on ¬$±- , i.e., { distribution. This means that there exists a joint entropy function of some joint distribution such that { $²- does not hold, which cannot be true because d$%- } always holds. If MK has positive Lebesgue measure, it cannot be contained in the hyperplane} *$P- which has zero Lebesgue measure. Therefore, it suffices to show that MK has positive Lebesgue measure. To this end, we see from Lemma 12.1 that the nonnegative orthant of v M , which has positive Lebesgue ¤ ¤ } measure, is a subset of MK . Thus MK has positive Lebesgue measure. Since MK is an invert¤ ible transformation of MK , its Lebesgue measure is also positive. } Therefore, MK is not contained in the hyperplane *$P- , which implies that there exists a joint distribution for which !$P- does not hold. This leads to a contradiction because we have assumed that 3$®- always holds. Hence, we have proved that if E$5- always holds, then must be the zero function. Conversely, if is the zero function, then it is trivial that P$³- always holds. The theorem is proved. ] ´µ ´Z¶ µ q Let 3 4 3 If , then is equal to . Lebesque measure can be thought of as “volume" in the Euclidean space if the reader is not familiar with measure theory. 4 The D R A F T September 13, 2001, 6:27pm D R A F T 269 Information Inequalities · ¨©¨¸R¸Z ©R¹P¡R¢D£«º c f Proof Let and and The canonical form of an information expression is unique. » c be canonical forms of an information expression . Since »<$% (12.39) f »<$% (12.40) c f always hold, c $5f (12.41) alwaysc holds.f By the above theorem, is the zero function, which implies that and are identical. The corollary is proved. c Due to the uniqueness of the canonical form of an information expression, itf is an easy matter to check whether for two information expressions and the unconstrained information identity c f $¯ c f (12.42) in canonical form. If all always holds. All we need to do is to express the coefficients are zero, then (12.42) always holds, otherwise it does not. 12.3 A GEOMETRICAL FRAMEWORK } MK In the last section, we have seen the role of the region in proving unconstrained information identities. In this section, we explain the geometrical meanings of unconstrained information inequalities, constrained information inequalities, and constrained information identities in terms of . Without loss of generality, we assume that all information expressions are in canonical form. } MK an unconstrained information inequality ¼½- , where { $ Consider { . Then !+- corresponds to_ the set Y F ~ M O'{ v {+(12.43) _ in the entropy which is a half-space space v M containing the origin. Specifv M , { ¾- if and only if { belongs to this set. For ically, for any { X we will refer to this set as the half-space !+- . As an example, for simplicity, $ , theinformation c 1 f inequality c f c 1 f $ +(12.44) 12.3.1 UNCONSTRAINED INEQUALITIES D R A F T September 13, 2001, 6:27pm D R A F T 270 A FIRST COURSE IN INFORMATION THEORY ¿f n 0 Figure 12.1. An illustration for c f w corresponds to the half-space _ c ~ M fO'{ v w in the entropy space v . written as w w w cCf f ÀÁG always holds. +cCf Y w +- & (12.45) (12.46) Since an information inequality always holds if and only if it is satisfied by the entropy function of any joint distribution for the random variables involved, we have the following geometrical interpretation of an information inequality: _ Y !#- always holds if and only if } M|K à O'{ v M|~ { #- & This gives a complete characterization of all unconstrained inequalities in terms } } X we in principle can determine whether any information of MK . If MK is known, inequality involving random variables always holds. The two possible cases } for Ľ- are illustrated in Figure 12.1 and Fig_ 6(- , ure 12.2. In Figure 12.1, MK is completely included in the half-space so * +rÅ always holds. In Figure 12.2, there exists a vector { } MK such that { - . Thus the inequality #- does not always hold. 12.3.2 CONSTRAINED INEQUALITIES In information theory, we very often deal with information inequalities (identities) with certain constraints on the joint distribution for the random variables involved. These are called constrained information inequalities (identities), and the constraints on the joint distribution can usually be expressed as linear constraints on the entropies. The following are such examples: 1. dc f g 6dce1f1gÆ , c D,and f are mutually g independent if and only if $ . D R A F T September 13, 2001, 6:27pm D R A F T Information Inequalities Çf 271 0 n .h 0 Figure 12.2. An illustration for ÀÁ| not always holds. dc f g dce1fÆ 1 1 f g c g , , and are pairwise independent if and only if 2. $ $ 5 $ . c f 6 c f 3. is a function of if and only if $5- . c 1 g f crÈ fLÈ gLÈ <É c 1 f 1<Éz forms g a Markov chain if and only if 4. $%- and $5- . Suppose there are Ê linear constraints on the entropies given by Ë {9$5- (12.47) Ë 9 ` where is a Ê matrix. Ë Here we do not assume that the Ê constraints are linearly independent, so is not necessarily full rank. Let Ì $PO'{ _ v M ~ Ë {9$5- Y & (12.48) Ì In other words, the Ê constraints confine { to a linear subspace in the entropy space. Parallel to our discussion on unconstrained inequalities, we have the following geometrical interpretation of a constrained inequality: } Ì Ì Y , 3(- always holds if and only if M§ Under the constraint KÍ Ã O'{ ~ { +- . This } gives a complete Ì characterization of all constrained inequalities in terms of MK . Note that $(v M when there is no constraint on the entropies. In this sense, an unconstrained inequality is a special case Ì of a constrained inequality. The two cases of !+- under the constraint are illustrated in Figure 12.3 and Figure 12.4. Figure 12.3 shows the case when Î- always holds under Ì the constraint . Note that 7P- may or may not always hold when there is no constraint. Figure 12.4 shows the case when #Ä- does not always hold D R A F T September 13, 2001, 6:27pm D R A F T 272 A FIRST COURSE IN INFORMATION THEORY Ïf 0 ÀHÁ| always holds under the constraint Ð . Ì under the constraint . In this case, !#- does not always hold when there is no constraint, because } MK Í Ì Ã O'{ ~ { +- Y (12.49) implies } MK à O'{ ~ { +- Y & (12.50) Figure 12.3. An illustration for 12.3.3 CONSTRAINED IDENTITIES As we have pointed out at the beginning of the chapter, an identity $%- *+- !,#- (12.51) always holds if and only if both the inequalities and always hold. Then following our discussion on constrained inequalities, we have Ñ f Ò0 Figure 12.4. An illustration for D R A F T ÀÁ| Рnot always holds under the constraint . September 13, 2001, 6:27pm D R A F T 273 Information Inequalities Ô c h Ób Ôh 0 0 Õ ¶Ö Á| and × ¶nÖ ÁG under the constraint Ð . Y Ì , $Ø - always Y holds if and only if } M§K Í Ì Ã Under the constraint O'{ ~ { +- Í O'{ ~ { ,+- , or Y Ì , $Ø- always holds if and only if } M§K Í Ì Ã Under the constraint O'{ ~ { $5- . } Ì This condition says that the intersection of MK and is contained in the hyperplane $%- . Figure 12.5. 12.4 Equivalence of EQUIVALENCE OF CONSTRAINED INEQUALITIES When there is no constraint on the entropies, two information inequalities { #Ù and Ù {+are equivalent if and only if $PÚ , where Ú (12.52) (12.53) v M Ì $Û Ì is a positive constant. However, this is not the case under a non-trivial constraint . This situation is illustrated in Figure 12.5. In this figure, although the inequalities in (12.52) and (12.53) correspond to different half-spaces in the entropy space, they actually impose the same constraint on when is confined to . In this section, we present a characterization of (12.52) and (12.53) being equivalent under a set of linear constraint . The reader may skip this section at first reading. Let be the rank of in (12.47). Since is in the null space of , we can write (12.54) { Ü D R A F T Ë { Ì { {$ÞË Ý {ß September 13, 2001, 6:27pm Ë D R A F T ` Ë ` E Ü where Ý is a 274 A FIRST COURSE IN INFORMATION THEORY Ë ËÝ {ß ` Ü matrix such that the rows of form a basis of the orthogonal complement of the row space of , and is a column vector. Then using (12.54), (12.52) and (12.53) can be written as and Ë Ý { ß + Ù Ë Ý { ß +- (12.55) Ë respectively in terms of the set of basis given by the columns of Ý . Then (12.55) and (12.56) are equivalent if and only if Ù Ë Ë (12.57) Ý $5Ú Ý where Ú is a positive constant, or Ù Ë (12.58) Ú Ý $5-4& Ù Ù Ú is in the orthogonal In words, complement ofthe row space of Ë Ý other Ë Ë ` , i.e., whose Ú is in the row space Ë . ( Ë ofcan.beLettakenß beasanË Ü ß if Ë matrix is full rank.) row space is the same as that of Ë Ë Ë Since the rankË of isË Ü and ß has Ü rows, the rows of ß form a basis for the row space of , and ß is full rank. Ì Then from (12.58), (12.55) and (12.56) are equivalent under the constraint if and only if Ù % Ë à ß $%Ú (12.59) for some positive constant Ú Ù and some column Ü -vector à . Suppose for given and , we want to see whether (12.55) and (12.56) are Ì . We first consider the case when either Ù under the constraint equivalent Ë or is in the row space of . This is actually not an interesting case because Ë if , for example, is in the row space of , then F Ë Ý $%(12.60) in (12.55), which Ì means that (12.55) imposes no additional constraint under the constraint . ¥§¦nÙ ¨©Dª¡R¢D£âá If either or Ù is in the row space of Ë , then {¯Ù {+- are equivalentË under the constraint Ì if and only if both and and are in the row space of . The proof of this theorem is left asÙ an exercise. We now turn to the more nor is in the row space of Ë . The interesting case when neither following theorem gives anÌ explicit condition for (12.55) and (12.56) to be equivalent under the constraint . D R A F T September 13, 2001, 6:27pm (12.56) D R A F T Information Inequalities Ù 275 ¥§¦nÙ ¨©Dª¡R¢D£«ã If neither nor is in the row space of Ë , then {3+Ì and {3+- are equivalent under the constraint if and only if ä Ë EåIæ à Ù ß (12.61) Úç $ & Ë has a unique solution with Úè%- , where ß is any matrix whose row space is Ë the same as that of . and Ù not in the row space of Ë , we want to see when we Proof For à can find unknowns Ú and satisfying (12.59) with Ú5è¾- . To this end, we write Ë (12.59) Ë in matrix form as ä (12.61). Since is not in the column space of ß and ß is full rank, Ë ß då is also full rank. Then (12.61) has either a unique solution or no solution. Therefore, the necessary and sufficient condition for (12.55) and (12.56) to be equivalent is that (12.61) has a unique . The theorem is proved. solution and GÚ è+é ê = ë¸z7¡R¢D£«ì Consider three random variables c 1 f and g with the dce1gR fÆ Markov constraint (12.62) $5which is equivalent to c 1 f f 1 g =6 c 1 f 1 g 6 f $5-4& (12.63) g In terms of the coordinates in the entropy space v , this constraint is written as Ë {9$5- (12.64) where Ë $ > - Qí- Q Qî- Q B (12.65) > c f g cCf f1g cCg cCf1gEB and {$ w w w w w w w & (12.66) We now show that under the constraint in (12.64), the inequalities dcW gÆ=dcW fÆ (12.67) + c 1 f g and +(12.68) Ù equivalent. Toward this end, we write (12.67) and (12.68) as { are in fact - and {+- , respectively, where > $ - Q Q Qî- Qï- B (12.69) D R A F T September 13, 2001, 6:27pm D R A F T 276 A FIRST COURSE IN INFORMATION THEORY B Ù > Q - Q Q Q & $ - - î (12.70) Ë Ë Ë . Upon solving Since is full rank, we may take ß $ ä ËEdå<æ à Ù (12.71) Ú ç $ à à Q matrix). we obtain the unique solution Ú6$ðQ!èñ- and $³Q ( is a Q Therefore, (12.67) and (12.68) are equivalent under the constraint in (12.64). Ù Ë Ì Under the constraint , if neither nor is in the row space of , it can be shown that the identities {!$%(12.72) Ù and (12.73) {!$5and are equivalent if and only if (12.61) has a unique solution. We leave the proof as an exercise. 12.5 THE IMPLICATION PROBLEM OF CONDITIONAL INDEPENDENCE jòÎ<óD Aô We use j <ó Aô to denote the conditional independency (CI) . j|ò#<óD Aô We have proved in Theorem 2.34 that is equivalent to jD1<óD AôT $%-4& (12.74) s j ò²<óx ô , becomes an unconditional independency which When õ7$ we regard as a special case of a conditional independency. When iª$÷ö , j: AôT (12.74) becomes $5j Aô (12.75) and are conditionally independent given which we see from Proposition 2.36 that is a function of . For this reason, we also regard functional dependency as a special case of conditional independency. In probability problems, we are often given a set of CI’s and we need to determine whether another given CI is logically implied. This is called the implication problem, which is perhaps the most basic problem in probability theory. We have seen in Section 7.2 that the implication problem has a solution if only full conditional mutual independencies are involved. However, the general problem is extremely difficult, and it has recently been solved only up to four random variables by Mat [137]. ùø ûú D R A F T September 13, 2001, 6:27pm D R A F T Information Inequalities c 1 f VUVUVUÆ1 } MK 277 M j òÎ<óD ô N jx1<ó AôT M à where i ö õ 6 jZüýô . Since <ó üýô =6 jZü ó üý$5ô =- isequivalent ô to (12.77) $5jòÎ<óx Aô corresponds to the hyperplane jZüôH ó üýô jZü ó üýô ô Y O'{ ~ w w w w $5- & (12.78) M Y corresponding to þ by ÿ þ . For a CI þ , we denote the hyperplane in v Let %$POÆ þ be a collection of CI’s, and we want to determine whether implies a given CI þ . This would be the case if and only if the following is _ } _ _ true: M K if { For all { ÿ þ then { ÿ þ & We end this section by explaining the relation between the implication problem and the region . A CI involving random variables has the form (12.76) Equivalently, } implies þ if and only if ÿ þ Í MK à ÿ þ & } Therefore, the implication problem can be solved if MK can be characterized. } Hence, the region MK is not only of fundamental importance in information theory, but is also of fundamental importance in probability theory. PROBLEMS 1. Symmetrical information expressions An information expression is said to be symmetrical if it is identical under every permutation of the random variables involved. However, sometimes a symmetrical information expression cannot be readily recognized symbolically. For example, is symmetrical in , and but it is not symmetrical symbolically. Devise a general method for recognizing symmetrical information expressions. c 1 f g c 1 f g c 1 f 4 2. The canonical form of an information expression is unique when there is no constraint on the random variables involved. Show by an example that this does not hold when certain constraints are imposed on the random variables involved. \ < \ ú 3. Alternative canonical form Denote Í Ý by and let $ ú ~ is a nonempty subset of N M & D R A F T September 13, 2001, 6:27pm D R A F T 278 ¼_ A FIRST COURSE IN INFORMATION THEORY a) Prove Y that a signed measure on M Oe c 1 f VUVUVUÆ1 M can be b) Prove that an information expression involving expressed uniquely as N a linear combination of K ú , where are nonempty subsets of M . M information expressions _ 4. Uniqueness of the canonicalÈ form for nonlinear ` ~ aY function h h , where $ Consider Q such that O'{ h ~ { $5- has zero Lebesgue measure. } a) Prove that cannot be identically zero on MK . canonical form for b) Use the result in a) to show the uniqueness of the the class of information expressions of the form » { where » is a polyis completely specified by , which can be any set of real numbers. nomial. Ù Ë $(- ,Ù if neither nor is in the row 5. Prove thatË under the constraint {!$%{- and space of , the identities {!$5- are equivalent if and only (Yeung [216].) if (12.61) has a unique solution. HISTORICAL NOTES The uniqueness of the canonical form for linear information expressions was first proved by Han [86]. The same result was independently obtained in the book by Csisz r and K rner [52]. The geometrical framework for information inequalities is due to Yeung [216]. The characterization of equivalent constrained inequalities in Section 12.4 is previously unpublished. ø D R A F T September 13, 2001, 6:27pm D R A F T Chapter 13 SHANNON-TYPE INEQUALITIES The basic inequalities form the most important set of information inequalities. In fact, almost all the information inequalities known to date are implied by the basic inequalities. These are called Shannon-type inequalities. In this chapter, we show that verification of Shannon-type inequalities can be formulated as a linear programming problem, thus enabling machine-proving of all such inequalities. 13.1 THE ELEMENTAL INEQUALITIES G1< (13.1) random G1variables < and appear more than once. It is readily in which the seen that can :7be;|written < as (13.2) 67 ;G < * Consider the conditional mutual information where in both and , each random variable appears only once. A Shannon’s information measure is said to be reducible if there exists a random variable which appears more than once in the information measure, otherwise the information measure is said to be irreducible. Without loss of generality, we will consider irreducible Shannon’s information measures only, because a reducible Shannon’s information measure can always be written as the sum of irreducible Shannon’s information measures. The nonnegativity of all Shannon’s information measures form a set of inequalities called the basic inequalities. The set of basic inequalities, however, is not minimal in the sense that some basic inequalities are implied by the D R A F T September 13, 2001, 6:27pm D R A F T 280 A FIRST COURSE IN INFORMATION THEORY others. For example, 3 and # (13.3) + (13.4) which are both basic inequalities involving random variables and , imply 6 D $ (13.5) +TVUVUVUWinequality 1XFY X and . which N again is aSbasic involving , where . Unless otherwise specified, all inforLet M $PORQ mation VUVUVUe1 in this chapterX involve some or all of the random variables c 1 f expressions M . The value of will be specified when necessary. Through application of the identities 6 (13.6) 6* $ 9:76;E 9 (13.7) ?H $ 3 D3A < (13.8) 67 $ ?H3 H (13.9) $ :;E (13.10) 3 8A $ 3 DA 8 (13.11) $ any Shannon’s information measure can be expressed as the sum of Shannon’s information measures of the following two elemental forms: <\S o?q \ 1^8_ N i) \ 1 z ^M ii) , where ! $ X and N ^ Y þ à M O . This will be illustrated in the next example. It is not difficult to check that the total number of the two elemental forms of Shannon’s information measures for random variables is equal to | X $# X !% M f " $ & (13.12) c 1 f X into a sum of elemental forms of $%a by applying the identities in (13.6) The proof of (13.12) is left as an exercise. é ê = ë¸z7¡RºD£t¡ We can expand Shannon’s information measures for to (13.11) as follows: 6dce1fÆ c D6 f c $ D R A F T September 13, 2001, 6:27pm (13.13) D R A F T c f 1 g D c 1 f 1 g D76 f c 1 g $ f1gR dc c f 1 g D c 1 f D7 c 1 g f $ f c 1 g D f 1 g c & 281 Shannon-Type Inequalities (13.14) (13.15) The nonnegativity of the two elemental forms of Shannon’s information measures form a proper subset of the set of basic inequalities. We call the inequalities in this smaller set the elemental inequalities. They are equivalent to the basic inequalities because each basic inequality which is not an elemental inequality can be obtained as the sum of a set of elemental inequalities in view of (13.6) to (13.11). This will be illustrated in the next example. The proof for the minimality of the set of elemental inequalities is deferred to Section 13.6. " é ê = ë¸z7¡RºD£«¢ In the last example, we expressed c 1 f as c f 1 g c 1 f D7 c 1 g f f c 1 g D f 1 g c & (13.16) X information measures in the above expression are in All the five Shannon’s elemental form for $¯a . Then the basic inequality dce1f' (13.17) +dcW f1gÆ c 1 f c 1 g f f c 1 g f 1 g c can be obtained as the sum of the following elemental inequalities: 13.2 - (13.18) (13.19) (13.20) (13.21) (13.22) - -4& A LINEAR PROGRAMMING APPROACH c ` 1 $ f VUVM UVUÆ1 Q Recall from Section 12.2 that any information expression can be expressed uniquely in canonical form, i.e., a linear combination of the joint entropies involving some or all of the random variables . If the elemental inequalities are expressed in canonical form, they become linear inequalities in the entropy space . Denote this set of inequalities by , where is an matrix, and define - & D R A F T " ` v M } M $(O'{ ~ &E{+- Y & September 13, 2001, 6:27pm M d& {6 (13.23) D R A F T 282 }M }M A FIRST COURSE IN INFORMATION THEORY v M à ý Q ,', ` We first show that is a pyramid in the nonnegative orthant of the entropy . Evidently, contains the origin. Let , be the column space -vector whose th component is equal to 1 and all the other components are equal to 0. Then the inequality ` à {#- _ } M { (13.24) corresponds to the nonnegativity of a joint entropy, which is a basic inequality. Since the set of elemental inequalities is equivalent to the set of basic inequalities, if , i.e., satisfies all the elemental inequalities, then also satisfies the basic inequality in (13.24). In other words, { { } M à O'{ ~ à { +- Y (13.25) ` . This implies that } M is in the nonnegative orthant of the for all Q,(6, } entropy space. Since M contains the origin and the constraints &E{(®- are } linear, we conclude that M is a pyramid in the nonnegative orthant of v M . X Since the elemental c inequalities 1 f VUVUVUÆ1 are satisfied by} the entropy function of any M , for any { in MK , { is also in } M , i.e., random variables } MK à } M & (13.26) Therefore, for any unconstrained inequality !#- , if } M à O'{ ~ { +- YZ (13.27) then } M|K à O'{ ~ { +- YZ (13.28) i.e., 7¼- always holds. In other words, (13.27) is a sufficient condition for ت- to always hold. Moreover, an inequality ؾ- such that (13.27) is because if { satisfies the basic satisfied is implied _ by} the basic inequalities, M inequalities, i.e., { , then { satisfies { #- . For constrained inequalities, following our discussion in Section 12.3, we impose the constraint (13.29) Ë {9$5and let Ì $PO'{ ~ Ë {!$5- Y & For an inequality !+- , if } Ì YZ M Í Ã O'{ ~ { #} Ì YZ then by (13.26), MK Í Ã O'{ ~ { #D R A F T September 13, 2001, 6:27pm (13.30) (13.31) (13.32) D R A F T 283 Shannon-Type Inequalities b h 0 Figure 13.1. #Ä- n ) q is contained in * Ö,+ Õ ¶Ö Á|Â- . Ì Ì i.e., always holds under the constraint . In other words, (13.31) is a sufficient condition for to always hold under the constraint . Moreover, under the constraint such that (13.31) is satisfied is an inequality implied by the basic inequalities and the constraint , because if and satisfies the basic inequalities, i.e., , then satisfies . ®ð- 13.2.1 !+- Ì _ } Ì { MÍ Ì _ { Ì { { #- { {7(Y ~ O'{ % { Ø- UNCONSTRAINED INEQUALITIES }M To check whether an unconstrained inequality is a Shannon-type inequality, we need to check whether is a subset of . The following theorem induces a computational procedure for this purpose. ¥§¦n¨©Dª¡RºD£«º ª { ³- is a Shannon-type inequality if and only if the minimum of the problem Minimize { , subject to &E{#- (13.33) is zero. In this case, the minimum occurs at the origin. }M Y O'{ ~ Ä { ªY { }M Remark The idea of this theorem is illustrated in Figure 13.1 and Figure 13.2. In Figure 13.1, is contained in . The minimum of subject to occurs at the origin with the minimum equal to 0. In Figure 13.2, is not contained in . The minimum of subject to is . A formal proof of the theorem is given next. }/M . }M O'{ ~ 3{ +- { Y } ~ M {+Proof of Theorem 13.3 We have to prove that is a subset of O'{ _ } only if the if and minimum of the problem in (13.33) is zero. First of all, since Y of the problem in (13.33) - M and -|} $Ä- for any , the minimum { ¬- and the minimum ofistheat most 0. Assume M is a subset of O'{ ~ D R A F T September 13, 2001, 6:27pm D R A F T 284 A FIRST COURSE IN INFORMATION THEORY 1b 2h 0 0n Figure 13.2. ) q is not contained in * Ö3+ Õ ¶nÖ Á|Â4- . _ } M problem in (13.33) is negative. Then there exists an { Å { which implies } M à O'{ ~ {+- YZ } which is a contradiction. Therefore, if M is a subset of O'{ the minimum of the problem in (13.33) is zero. } _ not a subset of O'{ To prove the converse, assume M is } M such that (13.35) is true. Then there exists an { = { Å -4& such that (13.34) Y ~ {7¯- , then Y ~ { ²- , i.e. (13.35) (13.36) This implies that the minimum of the problem in (13.33) is negative, i.e., it is not equal to zero. Finally, if the minimum of the problem in (13.33) is zero, since the contains the origin and , the minimum occurs at the origin. -§$5- }M {¯¼{5$²- {($ª{ ñ¼ By virtue of this theorem, to check whether is an unconstrained Shannon-type inequality, all we need to do is to apply the optimality test of the simplex method [56] to check whether the point is optimal for the minimization problem in (13.33). If is optimal, then is an unconstrained Shannon-type inequality, otherwise it is not. 13.2.2 CONSTRAINED INEQUALITIES AND IDENTITIES { #- under the constraint Ì is a Shannon} M Í Ì is a subset of O'{ ~ 3{ +- Y . To check whether an inequality type inequality, we need to check whether D R A F T September 13, 2001, 6:27pm D R A F T 285 Shannon-Type Inequalities ¥§¦n¨©Dª¡RºD£âá { #- is a Shannon-type inequality under the constraint Ì if and only if the minimum of the problem { , subject to &d{3+- and Ë {9$5Minimize (13.37) is zero. In this case, the minimum occurs at the origin. Ì The proof of this theorem is similar to that for Theorem 13.3, so it is omitted. By taking advantage of the linear structure of the constraint , we can reformulate the minimization problem in (13.37) as follows. Let be the rank of . Since is in the null space of , we can write Ë Ü Ë { ` Ë ` E where Ý is a Ü {$ Ë Ý {ß Ë ËÝ {ß (13.38) ` Ü matrix such that the rows of form a basis of the is a column orthogonal complement of the row space of , and vector. Then the elemental inequalities can be expressed as { ß,} M Ë& Ý {:ßx+- (13.39) } ßM $PO'{ß ~ &¼Ë Ý {:ßx+- YZ (13.40) 65 which is a pyramid in h (but not necessarily Ë Ý { ß . in the nonnegative orthant). { can be expressed Likewise, as With all the information expressions in terms of { ß , the problem in (13.37) becomes Ë Ý { ß , subject to &ØË Ý { ß +- . Minimize (13.41) Therefore, Ì to check whether {7%- is a Shannon-type inequality under the constraint , all we need to do is to apply the optimality test of the simplex method to check whether the point { ß $5- is optimal for the problem in (13.41). {ذ- is a Shannon-type inequality under the If { ß $Û- Ì is optimal, then constraint , otherwise it is not. Ì remains By imposing the constraint , the number of elemental inequalities ` ` Ü. the same, while the dimension problem decreases from to {¯of$ñthe Finally, to verify that is a Shannon-type identity under the {3$P- is implied by the basic inequalities, all we need toconÌ straint , i.e., do is to verify that both Ì {¼²- and {Ø,¬- are Shannon-type inequalities under the constraint . and in terms of 13.3 becomes A DUALITY A nonnegative linear combination is a linear combination whose coefficients are all nonnegative. It is clear that a nonnegative linear combination D R A F T September 13, 2001, 6:27pm D R A F T 286 A FIRST COURSE IN INFORMATION THEORY of basic inequalities is a Shannon-type inequality. However, it is not clear that all Shannon-type inequalities are of this form. By applying the duality theorem in linear programming [182], we will see that this is in fact the case. The dual of the primal linear programming problem in (13.33) is U Maximize 7 - where {b½- subject to 7+- and 7 &ª, > 0 c UVUVU²098 B 7 $ & , (13.42) (13.43) By the duality theorem, if the minimum of the primal problem is zero, which is a Shannon-type inequality, the maximum of the happens when dual problem is also zero. Since the cost function in the dual problem is zero, the maximum of the dual problem is zero if and only if the feasible region ¤ $PO:7 ~ 73+- and Y 7 &½, (13.44) $ ¥§¦n¨©Dª¡RºD£«ã + (- is; a Shannon-type inequality if and only if ; & for some ; +- ,{ where is a column " -vector, i.e., is a nonnegative linear combination of the rows of G. $ ; & for ¤ Proof We have to prove that ¤ is nonempty if and only if some ; +- . The feasible region is nonempty if and only if =< & (13.45) for some <#- , where < is a column " -vector. Consider any < which satisfies (13.45), and let > (13.46) $ < &¾+-4& à ` is equal to 1 and all Denote by the column -vector whose th component à ` , d, . Then { is a joint entropy. the other components are equal to 0, Q ? à Since every joint entropy can be expressed as the sum of elemental forms of Shannon’s information measures, can be expressed as a nonnegative linear combination of the rows of G. Write c @'f UVUVU @ B > A> @ýB $ (13.47) @ ` where #- for all Q,C, . Then > D c @ à (13.48) $ µ is nonempty. can also be expressed as a nonnegative linear combinations of the rows of G, i.e., (13.49) > D R A F T $FE & September 13, 2001, 6:27pm D R A F T 287 Shannon-Type Inequalities for some where that Eª+- . From (13.46), we see $ E < Û ; & $ & (13.50) ; +- . The proof is accomplished. From this theorem, we see that all Shannon-type inequalities are actually trivially implied by the basic inequalities! However, the verification of a Shannontype inequality requires a computational procedure as described in the last section. 13.4 MACHINE PROVING – ITIP Theorems 13.3 and 13.4 transform the problem of verifying a Shannon-type inequality into a linear programming problem. This enables machine-proving of all Shannon-type inequalities. A software package called ITIP1 , which runs on MATLAB, has been developed for this purpose. Both the PC version and the Unix version of ITIP are included in this book. Using ITIP is very simple and intuitive. The following examples illustrate the use of ITIP: 1. >> ITIP(’H(XYZ) <= H(X) + H(Y) + H(Z)’) True 2. >> ITIP(’I(X;Z) = 0’,’I(X;Z|Y) = 0’,’I(X;Y) = 0’) True 3. >> ITIP(’I(Z;U) - I(Z;U|X) - I(Z;U|Y) <= 0.5 I(X;Y) + 0.25 I(X;ZU) + 0.25 I(Y;ZU)’) Not provable by ITIP È È In the first example, we prove an unconstrained inequality. In the second examforms a Markov ple, we prove that and are independent if chain and and are independent. The first identity is what we want to prove, while the second and the third expressions specify the Markov chain and the independency of and , respectively. In the third example, ITIP returns the clause “Not provable by ITIP,” which means that the inequality is not a Shannon-type inequality. This, however, does not mean that the inequality to be proved cannot always hold. In fact, this inequality is one of the few known non-Shannon-type inequalities which will be discussed in Chapter 14. We note that most of the results we have previously obtained by using information diagrams can also be proved by ITIP. However, the advantage of È È 1 ITIP stands for Information-Theoretic Inequality Prover. D R A F T September 13, 2001, 6:27pm D R A F T 288 A FIRST COURSE IN INFORMATION THEORY Y GZ HX T Figure 13.3. The information diagram for I ,J ,K , and L in Example 13.6. using information diagrams is that one can visualize the structure of the problem. Therefore, the use of information diagrams and ITIP very often complement each other. In the rest of the section, we give a few examples which demonstrate the use of ITIP. The features of ITIP are described in details in the readme file. é êÈ = ë¸z7¡RºD£«ì È È ÈMarkov È chain È È By Proposition 2.10, the long implies the two short Markov chains and . We want to see whether the two short Markov chains also imply the long Markov chain. If so, they are equivalent to each other. Using ITIP, we have >> ITIP(’X/Y/Z/T’, ’X/Y/Z’, ’Y/Z/T’) Not provable by ITIP In the above, we have used a macro in ITIP to specify the three Markov chains. The above result from ITIP says that the long Markov chain cannot be proved from the two short Markov chains by means of the basic inequalities. This strongly suggests that the two short Markov chains is weaker than the long Markov chain. However, in order to prove that this is in fact the case, we need an explicit construction of a joint distribution for , , , and which satisfies the two short Markov chains but not the long Markov chain. Toward this end, we resort to the information diagram in Figure 13.3. The Markov chain is equivalent to , i.e., È È 3I NM D NM $%- OM K Ý Í Ý Í Ý Í Ý È K È Ý Í Ý Í Ý Í Ý $5-4& Similarly, the Markov chain M D M is equivalent M to :K Ý Í Ý Í Ý Í Ý :K Ý Í Ý Í Ý Í Ý $5-4& D R A F T September 13, 2001, 6:27pm (13.51) (13.52) D R A F T 289 Shannon-Type Inequalities Y * * HX * * GZ * T Figure 13.4. The atoms of Markov chain. PRQ on which SUT vanishes when È È È IWVXJYVZK[V\L forms a The four atoms involved in the constraints (13.51) and (13.52) are marked by a dagger in Figure 13.3. In Section 6.5, we have seen that the Markov holds if and only if takes zero value on the chain set of atoms in Figure 13.4 which are marked with an asterisk2 . Comparing Figure 13.3 and Figure 13.4, we see that the only atom marked in Figure 13.4 . Thus if we can construct a but not in Figure 13.3 is such that it takes zero value on all atoms except for , then the corresponding joint distribution satisfies the two short Markov chains but not the long Markov chain. This would show that the two short Markov chains are in fact weaker than the long Markov chain. Following Theorem 6.11, such a can be constructed. In fact, the required joint distribution can be obtained by simply letting , where is any random variable such that , and letting and be degenerate random variables taking constant values. Then it is easy to see that and hold, while does not hold. K $ M M Ý Í Ý Í Ý Í Ý K È È M M ÝÍ Ý Í Ý Í Ý 6 È È é ê = ë¸z7¡RºD£^] È È È È $ è( È È È The data processing theorem says that if forms a Markov chain, then ;| K È È È & (13.53) We want to see whether this inequality holds under the weaker condition that and form two short Markov chains. By using ITIP, we can show that (13.53) is not a Shannon-type inequality under the Markov 2 This information diagram is essentially a reproduction of Figure 6.8. D R A F T September 13, 2001, 6:27pm D R A F T 290 A FIRST COURSE IN INFORMATION THEORY A $ 5 (13.54) ;|<H and (13.55) $%-4& This strongly suggests that (13.53) does not always hold under the constraint this has to be proved by an explicit of the two short Markov chains. However, conditions construction of a joint distribution for , , , and which satisfies (13.54) and (13.55) but not (13.53). The construction at the end of the last example serves this purpose. é ê = ë¸z7¡RºD£`_badcxfe©DUgFcD¦x ©ihkjRlnm¡o]2¢qpr Consider the following secret sharing problem. Let s be a secret to be encoded into three pieces, , , and . The scheme has to satisfy the following two secret sharing requirements: 1. s can be recovered from any two of the three encoded pieces. 2. No information about pieces. s can be obtained from any one of the three encoded 6 ?H 6 * s $ s $ s $5(13.56) while the second requirement is equivalent to the constraints 19 < s $ s $ s $5 (13.57) -4& Since the secret s can be recovered if all , , and are known, 6:;A:66@ (13.58) s & We are naturally interested in the maximum constant " which satisfies 6;A:6@ (13.59) +" s & The first requirement is equivalent to the constraints " We can explore the possible values of by ITIP. After a few trials, we find that ITIP returns a “True” for all , and returns the clause “Not provable by ITIP” for any slightly larger than 3, say 3.0001. This means that the maximum value of is lower bounded by 3. This lower bound is in fact tight, as we can see from the following construction. Let and be mutually independent ternary random variables uniformly distributed on , and define " ",#a " D R A F T s $ t $ s tvu q w a t SzY OÆ- Q September 13, 2001, 6:27pm (13.60) (13.61) D R A F T 291 Shannon-Type Inequalities and 7 $ s txu qw aT& Then it is easy to verify that P qw s $ 6 u a $ +3W u qw a $ u qw aT& S zY OÆ- Q 976;<@ (13.62) 1* (13.63) (13.64) (13.65) Thus the requirements in (13.56) are satisfied. It is also readily verified that the requirements in (13.57) are satisfied. Finally, all , and distribute uniformly on . Therefore, " s 6 $%a s & (13.66) This proves that the maximum constant which satisfies (13.59) is 3. Using the approach in this example, almost all information-theoretic bounds reported in the literature for this class of problems can be obtained when a definite number of random variables are involved. 13.5 TACKLING THE IMPLICATION PROBLEM We have already mentioned in Section 12.5 that the implication problem of conditional independence is extremely difficult except for the special case that only full conditional mutual independencies are involved. In this section, we employ the tools we have developed in this chapter to tackle this problem. In Bayesian network (see [151]), the following four axioms are often used for proving implications of conditional independencies: òÎdzy ØòÎ7 Symmetry: ò¼;?V|{ ò5EH~} )ò=<H (13.67) Decomposition: Weak Union: D R A F T )òÄ;8V{ )ò5E September 13, 2001, 6:27pm (13.68) (13.69) D R A F T 292 A FIRST COURSE IN INFORMATION THEORY )ò5ER} ò=A ?H{ )òÄ;8V Contraction: & (13.70) These axioms form a system called semi-graphoid and were first proposed by Dawid [58] as heuristic properties of conditional independence. The axiom of symmetry is trivial in the context of probability3 . The other three axioms can be summarized by )ò¼;?Vzy ò5ER}Ø ò=A ?H (13.71) & This can easily be proved as follows. Consider the identity ?A D3A ?H $ & (13.72) mutual ?Ainformations are always nonnegative A ?by the basic Since conditional inequalities, if vanishes, and also vanish, and vice versa. This proves (13.71). In other words, (13.71) is the result of a specific application of the basic inequalities. Therefore, any implication which can be proved by invoking these four axioms can also be proved by ITIP. In fact, ITIP is considerably more powerful than the above four axioms. This will be shown in the next example in which we give an implication which can be proved by ITIP but not by these four axioms4 . We will see some implications which cannot be proved by ITIP when we discuss non-Shannon-type inequalities in the next chapter. é ê = ë¸z7¡RºD£` 3 A 3 A A 3A We will show that $5- $5- { $5- $5$5- $5can be proved by invoking the basic inequalities. First, we write 3 E dA & $ 3 E Since $5- and 3 E #- , we let $ÎÚ (13.73) (13.74) (13.75) 3 These 4 This four axioms can be used beyond the context of probability. example is due to Zhen Zhang, private communication. D R A F T September 13, 2001, 6:27pm D R A F T 293 Shannon-Type Inequalities Y + _ _ HX _ + GZ + T Figure 13.5. The information diagram for I ,J ,K L , and . GÚ A $ Ú (13.76) (13.74). In the information diagram indFigure <H 13.5, we mark the atom from bya “+” <Hand theatom 3 | <H:3 byA 8a “ .” Then we write (13.77) & A $ G A Since $5- and A ?$ Ú , we get (13.78) $5Ú& A ? for some nonnegative real number , so that In the information diagram, we mark the atom with a “+.” Continue in this fashion, the five CI’s on the left hand side of (13.73) imply that all the atoms marked with a “+” in the information diagram take the value , while all the atoms marked with a “ ” take the value . From the information diagram, we see that Ú 3 3 dA D3 k D $ $ Ú Ú $5- Ú (13.79) which proves our claim. Since we base our proof on the basic inequalities, this implication can also be proved by ITIP. Due to the form of the five given CI’s in (13.73), none of the axioms in (13.68) to (13.70) can be applied. Thus we conclude that the implication in (13.73) cannot be proved by the four axioms in (13.67) to (13.70). 13.6 MINIMALITY OF THE ELEMENTAL INEQUALITIES We have already seen in Section 13.1 that the set of basic inequalities is not minimal in the sense that in the set, some inequalities are implied by the others. D R A F T September 13, 2001, 6:27pm D R A F T 294 A FIRST COURSE IN INFORMATION THEORY We then showed that the set of basic inequalities is equivalent to the smaller set of elemental inequalities. Again, we can ask whether the set of elemental inequalities is minimal. In this section, we prove that the set of elemental inequalities is minimal. This result is important for efficient implementation of ITIP because it says that we cannot consider a smaller set of inequalities. The proof, however, is rather technical. The reader may skip this proof without missing the essence of this chapter. The elemental inequalities in set-theoretic notations have one of the following two forms: <\x orq \ 1. Ý <\ Ý #- ^ N ^ Y 2. Ý Í Ý Ý +- , $! and þ à M O , where denotes a set-additive function defined on M . They will be referred to as i -inequalities and ö -inequalities, respectively. We are to show that all the elemental inequalities are nonredundant, i.e., none of them is implied by the an <others. \x o For \ i -inequality q Ý Ý #<\S or(13.80) \ q 9 since it is the only elemental inequality which involves the atom Ý Ý , it is clearly not implied by the other elemental inequalities. Therefore we only need to show that all -inequalities are nonredundant. To show that a inequality is nonredundant, it suffices to show that there exists a measure on which satisfies all other elemental inequalities except for that -inequality. We will show that the -inequality M ö ö ö ö <\ Ý Í Ý Ý + Y N ^ (13.81) nonredundant. ^ \ our discussion, ^S we denote M <þ \ O by is To facilitate þ , and we let s s à þ be the atoms in Ý Í Ý Ý , where \ <\ M M \ s $ Ý Í Ý Í Ý ^ Í Ý Í Ý s 6 & N ^ (13.82) Y M þ $ , i.e., þ³$ O . We We first consider the case when construct a measure by <\ $ïÝ Í Ý Ý b $ Q Q ifotherwise, (13.83) ¼ _ <\ atom Å with measure . In other words, Ý Í Ý where Ý <\is the only Q ; all other atoms have measure 1. Then^ _ ÝN Í Ý Ý - is trivially M true. It is also trivial to check \ that foro?qany \ ß , (13.84) Ý Ý $ØQ# D R A F T September 13, 2001, 6:27pm D R A F T 295 ^ ^ ^ N ^ Y and for any ß ß þ ß $ \ þ such that ß $! ß and þ ß Ã M O ß ß , (13.85) N N ^ Y Ý Í Ý Ý $ØQ +M \ O ß ß . On the other hand, if þ ß is a proper subset of M if^ þ ß Y $ O ß ß , then Ý Í Ý Ý \contains at least two atoms, and therefore (13.86) Ý Í Ý Ý +-4& ^ the proof for the ö -inequality in (13.81) to be nonredundant This completes ^ ^ V when þ $ . þ W $ <, \or þ ½Q . We We now consider the case when the atoms in Ý Í Ý Ý , let construct a measure as follows. For k ^ \ 1 Q s$ ^ þ k Q (13.87) s b $ Q s $ þ & \ <\ For s , if s is odd, it is referred to as an odd < atom \ of Ý Í Ý Ý , \ it is referred and if to as an even atom of Ý Í Ý s _ is even, Ý . For any atom Ý Í Ý Ý , we let $ØQ& (13.88) This completes the construction of . <\ rÅ We first prove that Ý Í Ý Ý -4& (13.89) Consider <\ \ 1 Ý Í Ý Ý $ 6D \ \ s 6 # ^ V k 5 þ % Q $ 5D Q Ü µ $ Q where the last equality follows from the binomial formula X D M # % k 5 Q $5(13.90) 5 Ü µ X ^ _ for (Q . This proves (13.89). \ that o q satisfies <\ We note that for any ß \ all i -inequalities. N Next we prove M , the atom Ý Ý is not in Ý Í Ý Ý . Thus \ \ (13.91) Ý Ý o?q $ØQ +-4& Shannon-Type Inequalities D R A F T September 13, 2001, 6:27pm D R A F T 296 A FIRST COURSE IN INFORMATION THEORY ^ ^ satisfies Y i.e., all ^ ö -inequalities except N for ^ (13.81), M à þ such O ß ß, ß ßþ ß $ \ that ß $! ß and þ ß Ý Í Ý Ý +-4& (13.92) Consider \ Ý Í 1 Ý \ Ý <\ 1 $ 1Ý \Í Ý Ý Í #Ý <Í \ Ý Ý 1 Ý Í Ý Ý Ý Í Ý Ý & (13.93) It remains to prove that for any \ <\ Ý Í Ý Ý Í Ý Í Ý Ý is nonempty if and only if ^ Y ^ Y Í O ß Zß þ $ and O Í þ!ß$& The nonnegativity of the second term above follows from (13.88). For the first term, (13.94) $ (13.95) If this condition is not satisfied, then the first term in (13.93) becomes , and (13.92) follows immediately. Let us assume that the condition in (13.95) is satisfied. Then by simple counting, we see that the number atoms in - \ <\ Ý Í Ý Ý Í Ý Í Ý Ý (13.96) 9 is equal to , where XE# ^S Y¢¡ ^ Y£¡ ¡ (13.97) O O f ß ß þ þ ß & X $ $ ¤ , there For example, for g in <ÉÆ c aref ¥I$ c atoms c f g Ý É M Í Ý§ ¦ Í § ¨ Ý Í Ý \ Ý <\ \ M ^ (13.98) Í Í Í Í Í namely Ý , where $ Ý or Ý for $[© ¤ . We Ý Ý Ý Î Sz¢Y ¡ ªY ¡ ¡ YT check that $ ¤ ORQ [ ORQ a :O ¥ $ & (13.99) We first consider the case when $5- , i.e., ^ ªY ¡ ^ «Y ¡ ¡ N M $PO O ß ß þ þ ß& (13.100) \ <\ Then Ý Í Ý Ý Í Ý Í Ý Ý \ §(13.101) contains exactly one atom. If this atom is an even atom of Ý Í Ý Ý , then the first term in (13.93) is either 0 or 1 (cf., (13.87)), and (13.92) follows D R A F T September 13, 2001, 6:27pm D R A F T APPENDIX 13.A: The Basic Inequalities and the Polymatroidal Axioms 297 ° , ® ¯ ± is an odd atom of ¬ ¬ ¬ , then the first term in immediately. If this°atom (13.93) is equal to . This happens if and¯ only ¬ ¾ one ®¼ ¬ ½ ¼ ° if ²:¬ ³µ¾ ´·¶¼k¿ ¸ ° andº»,¬ ²:® ³¯ ¹´· ¶9¬ ¹k½ ¸ ° have ¿ is common element, which implies that º»¬ nonempty. Therefore the second term in (13.93) is at least 1, and hence (13.92) ± follows. . Using the binomial formula Finally, we consider the case when ÀÂÁ ,®4¯ ¬ ½ ° ¬ in¾ (13.90), we see that the number of odd atoms and even atoms of ¬ in º ¬ ® ¼¯ ¬ ½µ¼ ° ¬ ¾«¼ ¿ ¯ º ,¬ ®Ã¯ ¬ ½ ° ¬ ¾ ¿ (13.102) °± are the same. Therefore the first term in (13.93) is equal to if ° Ä ® ½Å ¾ ºÆ/ºk³µ´·¶9´µÇ ¿È¿ÊÉ ¬ ® ¼U¯ ¬ ½d¼ ¬ ¾«¼ ´ (13.103) ° The ° former ° if and and is equal to 0¼Ãotherwise. is true only if Ç ¹»Ë Ç , which ½ ¾ ¯ d ½ ¼ « ¾ ¼ , ® ¯ ® ¿ ¿ ¬ ¬ º ¬ ¬ ¬ is nonempty, or that the implies that º ¬ second term is at least 1. Thus in either case (13.92) is true. This completes the proof that (13.81) is nonredundant. APPENDIX 13.A: THE BASIC INEQUALITIES AND THE POLYMATROIDAL AXIOMS show that the basic inequalities for a collection of Ì random variables ÍÏÎÑIn еthis Ò«ÓÔappendix, ÕÃÖ«×/Ø9Ù isweequivalent ÔÜÛNÝÞ×OØ , to the following polymatroidal axioms: For all Ú æ Î å P1. ߪà§á`âäã . Û ÝéÛ . P2. ß à áçÚ6ãèß à á ã if Ú Û Û ãoê,ß à áçÚOí Û ã . P3. ß à áçÚ6ãêéß à á ã§ëß à áçÚOì We first that the Ý Ú show ÝÞ×/polymatroidal Ø , we have axioms imply the basic inequalities. From P1 and P2, since â for any Ú ßªàáçÚ6ã§ëîߣà§á`âã ÎÏå:Ô (13.A.1) or (13.A.2) ß3á Òðï ã§ë åòñ This shows that entropy is nonnegative. ÎôÛõ Ú , we have In P2, letting ó ߣà§áçÚ6ã§èîߣàiáçÚOíªóã Ô (13.A.3) or ß× 3Ø á Ò«ö9÷ Ò ï ãë åòñ (13.A.4) Here, ó and Ú are disjoint subsets of . ÎôÛõ Ú , ø Î Ú/ì Û , and In P3, letting ó ù Î Ú õµÛ , we have (13.A.5) ß à áçùÊíÊøãê,ß à áó«í¢øäãëß à áçøäãêß à áçù£íOøúíªóã Ô or û Ò«ü9ýÒ«öþ÷ Òðÿ å:ñ á ãë (13.A.6) D R A F T September 13, 2001, 6:27pm D R A F T 298 Again, ù Ô ø , and ó A FIRST COURSE IN INFORMATION THEORY × Ø ø Î â , from P3, we have û «Ò ü9.ýÜWhen á Òðö ã§ë åòñ are disjoint subsets of (13.A.7) Thus P1 to P3 imply that entropy is nonnegative, and that conditional entropy, mutual information, and conditional mutual information are nonnegative provided that they are irreducible. However, it has been shown in Section 13.1 that a reducible Shannon’s information measure can always be written as the sum of irreducible Shannon’s information measures. Therefore, we have shown that the polymatroidal axioms P1 to P3 imply the basic inequalities. The converse is trivial and the proof is omitted. PROBLEMS 1. Prove (13.12) for the total number of elemental forms of Shannon’s information measures for random variables. ´ ´ò´ 2. Shannon-type inequalities for random variables refer to all information inequalities implied by the basic inequalities for these random variables. Show that no new information inequality can be generated by considering the basic inequalities for more than random variables. 3. Show by an example that the decomposition of an information expression into a sum of elemental forms of Shannon’s information measures is not unique. ´ ´ò´ 4. Elemental forms of conditional independencies Consider random vari. A conditional independency is said to be elemenables tal if it corresponds to setting an elemental form of Shannon’s information measure to zero. Show that any conditional independency involving is equivalent to a collection of elemental conditional independencies. ´ ´´ 5. Symmetrical information inequalities ´ ´ò´ ´ ,® "!#6® ¿ ® º a) Show that every symmetrical information expression (cf. Problem 1 in Chapter 12) involving random variable can be written in the form where ±%$'&$ and for D R A F T °=± , @ ,®8AÈ ½ ¾ ¿CB ) (6Ó5®+4 *6879½ 4;(,: : <= º -/.1023 - 2?> September 13, 2001, 6:27pm D R A F T 299 APPENDIX 13.A: The Basic Inequalities and the Polymatroidal Axioms ±D$E&F$ °n± Note that is the sum of all Shannon’s information measures of , is the sum the first elemental form, and for of all Shannon’s information measures of the second elemental form conditioning on random variables. &N° ± b) c) G & Á H always holds if Á'H for all . & Show that I $Kall &L$ . Hint: °Â± Construct Show that if ÁJH always holds, then ÁIH for variables ´ ´´ $'& for$ each random ° ± H & & such that G M H and G ¼ H for all H ¹ and ¹ON . (Han [87].) 6. Strictly positive probability distributions It was shown in Proposition 2.12 that Q P R? º ´ S ¿ P S º ´ R¿UTWV QP º S ´ R ¿ S R:¿ M H for all Y , Y , Y S , and Y R . Show by using ITIP if X º+Y )´ Y ´)Y ´)Y that this implication is not implied by the basic inequalities. This strongly suggests that this implication does not hold in general, which was shown to be the case by the construction following Proposition 2.12. 7. a) Verify by ITIP that @ º ´ A[Z\ ´ Z] ¿ $ @ º ^A[Z\ ¿_ Z\ Z] ´ ¿ under the constraint º ´ @ º A[Z` ¿ º Z\G ¿a_ Z]1 ¿ . º This constrained inequality was used in Problem 9 in Chapter 8 to obtain the capacity of two independent parallel channels. b) Verify by ITIP that @ º ´ A[Z ´ Z ¿ Á @ º A[Z ¿_ @ º A[Z ¿ @ ^AÈ ¿ H . This constrained inequality was under the constraint º used in Problem 4 in Chapter 9 to obtain the rate distortion function for a product source. 8. Repeat Problem 9 in Chapter 6 with the help of ITIP. cb 9. Prove the implications in Problem 13 in Chapter 6 by ITIP and show that they cannot be deduced from the semi-graphoidal axioms. (Studen [189].) D R A F T September 13, 2001, 6:27pm D R A F T 300 A FIRST COURSE IN INFORMATION THEORY HISTORICAL NOTES For almost half a century, all information inequalities known in the literature are consequences of the basic inequalities due to Shannon [173]. Fujishige [74] showed that the entropy function is a polymatroid (see Appendix 13.A). Yeung [216] showed that verification of all such inequalities, referred to Shannon-type inequalities, can be formulated as a linear programming problem if the number of random variables involved is fixed. Non-Shannon-type inequalities, which are discovered only recently, will be discussed in the next chapter. The recent interest in the implication problem of conditional independence has been fueled by Bayesian networks. For a number of years, researchers in Bayesian networks generally believed that the semi-graphoidal axioms form a complete set of axioms for conditional independence until it was refuted by Studen [189]. cb D R A F T September 13, 2001, 6:27pm D R A F T Chapter 14 BEYOND SHANNON-TYPE INEQUALITIES dOe d e d d f In Chapter 12, we introduced the regions and in the entropy space for random variables. From , one in principle can determine whether any information inequality always holds. The region , defined by the set of all basic inequalities (equivalently all elemental inequalities) involving random variables, is an outer bound on . From , one can determine whether any information inequality is implied by the basic inequalities. If so, it is called a Shannon-type inequality. Since the basic inequalities always hold, so do all Shannon-type inequalities. In the last chapter, we have shown how machineproving of all Shannon-type inequalities can be made possible by taking advantage of the linear structure of . d e d e d d d If the two regions and are identical, then all information inequalities which always hold are Shannon-type inequalities, and hence all information inequalities can be completely characterized. However, if is a proper subset of , then there exist constraints on an entropy function which are not implied by the basic inequalities. Such a constraint, if in the form of an inequality, is referred to a non-Shannon-type inequality. d e d d e N d There is a point here which needs further explanation. The fact that does not necessarily imply the existence of a non-Shannon-type inequality. As an example, suppose contains all but an isolated point in . Then this does not lead to the existence of a non-Shannon-type inequality for random variables. d d d e d e In this chapter, we present characterizations of which are more refined than . These characterizations lead to the existence of non-Shannon-type inequalities for . D R A F T ÁJg September 13, 2001, 6:27pm D R A F T 302 14.1 A FIRST COURSE IN INFORMATION THEORY @ CHARACTERIZATIONS OF h e , h Se , AND h k l n m i &l i e Recall from the proof of Theorem 6.6 that the vector represents the values of the -Measure on the unions in . Moreover, is related to the values on the atoms of , represented as , by of j/e je k &pq& i (14.1) °=± o m D r & where is a unique matrix with (cf. (6.27)). Let s be the -dimensional Euclidean space with the coordinates labeled by the components of l . Note that each coordinate in s corresponds to the value of j e on a nonempty atom of k . Recall from Lemma 12.1 the definition t É "umv É of the region e ²Gl s l d e ¸þ´ (14.2) which is obtained from the region d e via the linear transformation induced by m . Analogously, we define the region t É um É B (14.3) ²Gl s l d ¸ The region d e , as we will see, is extremely difficult to characterize for a general Dr . . Therefore, we start our discussion with the simplest case, namely wox\yz|{y,}~|~ d e d . Dr , the elemental inequalities are Proof For º ¿ j e º»¬ °° ¬ ¿ ÁH (14.4) ¿ ¿ (14.5) @ º AÈ ¿ j e º ¬ ¯ ¬ ¿ ÁH B (14.6) º j e º ¬ ¬ ÁH Note that the quantities on the left hand sides above are precisely the values of j e on the atoms of k . Therefore, t É u ²Gl s lCÁIHq¸þ´ t t t (14.7) t t orthant of s . Since d|t e Ë d t , e Ë . On is the nonnegative i.e., e by Lemma 12.1. Thus e the other hand, , which implies d e d . The proof isË accomplished. D . Next, we prove that Theorem 14.1 cannot even be generalized to wox\yz|{y,}~|5 d Se N d S . D , the elemental inequalities are Proof For ,®[ ½ ´ ¿ j e º ,¬ ® ° ¬ ½ ° ¬ ¿ ÁH º (14.8) ° ¿ @ ,®9AÈ ½ ¿ ½ , f ® ¯ º j e º ¬ ¬ ¬ ÁHU´ (14.9) D R A F T September 13, 2001, 6:27pm D R A F T Beyond Shannon-Type Inequalities 0 X X 303 2 a a a 0 0 a 1 Figure 14.1. The set-theoretic structure of the point X 3 á åòÔ·å:ÔåòÔ9Ô9Ô+þÔ9\ ã in . and @ º , ®9AÈ ½ ¿ j e º»,¬ ®f¯ ¬ ½ ¿ (14.10) j e º ¬ ® ¯ ¬ ½ ¯ ¬ ¿_ j e º ¬ ® ¯ ¬ ½ ° ¬ ¿ (14.11) (14.12) ±$ &$Á H É S ³C¶" s , let . For l for ´) S ´) R ´)` ´)] ´)] ¿ ´ l + º ) ´ (14.13) ±$ $ ® ³ correspond to the values where ´ ° ° Sò¿ ° ° S:¿ S ° ° ¿ j e º» ¬ ¯ ¬ ° ¬ S ¿ ´Oj e º» ¬ ¯ ¬ S ° ¬ ¿ ´j e º ¬ ¯ ¬ S ° ¬ ¿ ´ j e º ¬ ¯ ¬ ¯ ¬ Sò¿ ´Oj e º ¬ ¬ ¬ ´j e º ¬ ¬ ¬ ´ (14.14) j e º»¬ ¬ ¬ ´ S respectively. These are the values of j e on the nonempty atoms of k . Then from (14.8), (14.9), and (14.12), ±%$ we$'see A that½ _ $ $' t S É S u ® ²Gl s ÁIHU´ ³ ]/° Á'HU´|g ¶ ¸ B (14.15) ¿ for any ÁKH is in t S . It is easy to check that the point ºHU´[HU´[HU´ ´ ´ ´ This is illustrated in Figure 14.1, and it is readily seen that the relations ,® ½ ´ ¿ H (14.16) º and @ º ,®8AÈ ½ ¿ H (14.17) ±$ &$ ³C¶" for are satisfied, i.e., each random variable is a function of ± r are pairwise the other two, and the three random ,® , ³ variables É independent. É ¡ , > ´ ´ Y Y Let Ó be the support of . For any and are independent, we have since and X º+Y ´)Y ¿ X º+Y ¿ X»º+Y ¿ M H B (14.18) D R A F T September 13, 2001, 6:27pm D R A F T 304 A FIRST COURSE IN INFORMATION THEORY and , there is a unique Y S/É such that S ¿ ¿ ¿ ¿ B (14.19) X»º+Y ´)Y ´)Y X º+Y ´)Y X º+Y X º+Y M H is a function of and S , and and S are independent, Now since we can write X º+Y ´)Y ´)Y S:¿ X º+Y ´)Y S4¿ X º+Y ¿ X º+Y Sò¿CB (14.20) Since S is a function of Equating (14.19) and (14.20), we have X»º+Y ¿ X º+Y S:¿CB (14.21) É ¡ such that Yù N Y . Since and S are indeNow consider any Yf¹ pendent, we have X º+Y ¹ ´)Y Sò¿ X º+Y ¹ ¿ X»º+Y Sò¿ M H B (14.22) S É > such that Since is a function of and , there is a unique Y ¹ (14.23) X»º+Y ¹ ´)Y ¹ ´)Y Sò¿ X º+Y ¹ ´)Y Sò¿ X º+Y ¹ ¿ X º+Y Sò¿ M H B is a function of and S , and and S are independent, Now since we can write (14.24) X º+Y ¹ ´)Y ¹ ´)Y S ¿ X º+Y ¹ ´)Y S ¿ X º+Y ¹ ¿ X º+Y S ¿CB S Similarly, since is a function of and , and and are independent, we can write X º+Y ¹ ´)Y ¹ )´ Y S:¿ X º+Y ¹ ´)Y ¹ ¿ X º+Y ¹ ¿ X º+Y ¹ ¿CB (14.25) Equating (14.24) and (14.25), we have and from (14.21), we have X»º+Y ¹ ¿ X º+Y :S ¿ ´ (14.26) X»º+Y ¹ ¿ X º+Y ¿CB (14.27) S ¿ Sò¿_ @ ^AÈ? òS ¿_ @ º ^AÈ S ¿ º _ @ º ^AÈ´ AÈ Sò¿ º H _ º _ _ º° ¿ ´ and similarly ¿ º Sò¿ B º Therefore must have a uniform distribution on its support. The same can and . Now from Figure 14.1, be proved for D R A F T September 13, 2001, 6:27pm (14.28) (14.29) (14.30) (14.31) D R A F T 305 Beyond Shannon-Type Inequalities a=0 log 2 Figure 14.2. The values of log 3 log 4 (a,a,a,a,2a,2a,2a) for which á 9Ô9Ô9Ô£¢[9Ô£¢ 9Ô£¢[þÔ¢ ã is in ¤¥ . ¨¦ §©Qª S t S t S ª e N ª Then the only values that can take are , where (a positive integer) , and . In other words, if is not is the cardinality of the supports of , for some positive integer , then the point equal to is not in . This proves that , which implies . The theorem is proved. ¦¨t §©QS ª e ° ºS HU´[HUS ´[HU´ ´ ´ ´ ¿ |d e N d i É f S , let i º« ´ « ´ « S ´ « £ ´ « S ´ « S ´ « £ S° ¿CB (14.32) t ¿ in S corresponds From Figure 14.1, wer see that the point ºHU´[HU´[HU´ ´ ´ ´ r r r r r ¿ S to the point º ´ ´ ´ ´ ´ ´ . Evidently, the point º ´ ´ ´ ´ ´ r ´ r±o$ ¿ in d S satisfies &W $ the 6 elemental in d inequalities given in (14.8) and (14.12) S ³¬ ¶ for with equality. Since d is defined by all the elemental inequalities, the set ²oº ´ ´ ´ r ´ r ´ r ´ r ¿/É d S u ÁIHq¸ (14.33) S is in the intersection of 6 hyperplanes in f (i.e., ) defining the boundary of Sd , and hence it defines an extreme direction of d S . Then the proof says that S along this extreme direction of d , only certain discrete points, namely those points with equals ¦5§©®ª for some positive integer ª , are entropic. This is S illustrated in Figure 14.2. As a consequence, the region d e is not convex. S S Having proved that d e N d , it is natural to conjecture that the gap between d|Se and d S hasS Lebesgue measure 0. In other words, d Se d S , where d Se is the closure of d . This conjecture is indeed true and will be proved at the end of the section. More generally, we are interested in characterizing d e , the closure of d e . Although the region d e is not sufficient for characterizing all information inThe proof above has the following interpretation. For equalities, it is actually sufficient for characterizing all unconstrained information inequalities. This can be seen as follows. Following the discussion in Section 12.3.1, an unconstrained information inequality involving random variables always hold if and only if ¯ ÁnH Since ²Gi u ¯ ºi ¿ I Á Hq¸ D R A F T d e Ë ²Gi u ¯ ºi ¿ I Á Hq¸ B (14.34) is closed, upon taking closure on both sides, we have d e Ë ²Gi u ¯ ºi ¿ I Á Hq¸ B September 13, 2001, 6:27pm (14.35) D R A F T 306 A FIRST COURSE IN INFORMATION THEORY On the other hand, if ¯ÁIH satisfies (14.35), then d e Ë d e Ë ²Gi u ¯ ºi ¿ ÁIHq¸ B (14.36) d e Therefore, (14.34) and (14.35) are equivalent, and hence is sufficient for characterizing all unconstrained information inequalities. We will prove in the next theorem an important property of the region for . This result will be used in the proof for . Further, this result all will be used in Chapter 15 when we use to characterize the achievable information rate region for multi-source networking coding. It will also be used in Chapter 16 when we establish a fundamental relation between information theory and group theory. to denote the We first prove a simple lemma. In the following, we use set . Á r d e ± r ² ´ ´ ´)»¸ ±®y,} }a²³~|5´ If i d e d Se d S ° Ri ¹ are in d e , then i _ iú¹ is in d e . iR¹ in d e . Let i represents the entropy function for ranProof Consider i and dom variables µ ´)µ ´ò´)µ , and let iú¹ represents the entropy ¿ and º+µ function ¹ ´)µ ¹ ´for ò´ random variables µ ¹ ´)µ ¹ ´:´)µ ¹ . Let º+µ ´)µ ´´)µ µÑ ¹ ¿ be independent, and define random variables Z/ ´ Z] ´ò´ Z` by Z`¶/ º+µ ¶ ´)µ ¶ ¹ ¿ (14.37) É ° . Then for any subset · of ° , for all ³ Z]¸ ¿ º+µ ¸ ¿_ º+µ ¸ ¹ ¿ « ¸ _ « ¹¸ B (14.38) º _ i ¹ , which represents the entropy function for Z/ ´ Z` ´ò´ Z` , is Therefore, i in d e . The lemma is proved. ¹ z|{]z|ºº1²|{»n~|¼ If i É d e , then & i É d e for any positive integer & . and Proof It suffices to write & _ _ _ i ½ i i [¾ ¿ i À (14.39) and apply Lemma 14.3. wox\yz|{y,}~|5Á d e is a convex cone. µ )´ µ ´´)µ · of ° , Proof Consider the entropy function for random variables taking constant values with probability 1. Then for all subset ¸¿ HB º+µ D R A F T September 13, 2001, 6:27pm all (14.40) D R A F T Beyond Shannon-Type Inequalities 307 Od e f i Z\ Z]iú ¹ d Z`e ´ ´:´  ´C  ´ò´C ± de Åà ±Ç à i iú ¹ d e ¿ Äà i & _WÅÃiR¹ d e HÆ'Z\à Z] º ´ ¿ ´:´ Z` ¿ ºÊÉ ´ É ´º+´ÄÈ É ´)¿ È ´& ò´)È ºÂ ´C ´ò´CÂ Ë ÌÎÍ ²Ë Hq¸ ±¬ÇÐÏÑÇ j ´ ÌÍ ²Ë ± ¸ Ï ´ ÌÍ ²Ë Òr ¸ j B by letting Now construct random variables µ ´)µ ´ò´)µ ¶µ ÔÕ ÖÓ ÈH ¶ ifif ËË H ± ¶ if Ë ×r . É ¿ÎØ H as Ï ´)j Ø H . Then for any nonempty subset · of ° , Note that ºÊË $ ¸ ¸ ´ÄË ¿ ¿ + º µ (14.41) º+µ / ¸ \ ¿ _ ¿ (14.42) ºÊºÊËË ¿\_ Ï& º+µ º Z]¸ Ë ¿_ j & ºÂ ¸ ¿CB (14.43) On the other hand, ¸ ¿ Á º+µ ¸/ Ë ¿ Ï& º Z`¸ ¿/_ j & ºÂ ¸ ¿CB + º µ (14.44) Combining the above, we have $ Ç Ï& Z ¸ ¿_ & ¸ ¿È¿ $ ¸ ¿ ¿CB (14.45) H º+µ º º j ºÂ ºÊË Now take Ï &à (14.46) Therefore, contains the origin in . Let and in be the entropy functions for any two sets of random variables and , respectively. In view of Corolis a convex cone, we only need to show that lary 14.4, in order to prove that if and are in , then is in for all , where . be independent copies of and Let be independent copies of . Let be a ternary random variable independent of all other random variables such that Å j &à (14.47) $ Ç $ to obtain ¸ ] Z ¸ ¸ ¿ \ ¿ L _ Å È ¿ ¿ ¿CB º à º à º  (14.48) ºÊË & H º+µ By letting be sufficiently large, the upper bound can be made arbitrarily _ÙÅÃÄi ¹ É d e . The theorem is proved. small. This shows that ÃÄi S S t In the next theorem, we prove that d e and d aret almost identical. Analo gous to d e , we will use e to denote the closure of e . and D R A F T September 13, 2001, 6:27pm D R A F T 308 wox\yz|{y,}~|5Ú d Se d S . S dS Proof We first note that d e Since A FIRST COURSE IN INFORMATION THEORY t S t SB e (14.49) d Se Ë d S (14.50) if and only if S t SB on both sides in the above, we obtain d Se Ë and d is closed, by taking Sd B This implies thatt t Se Ë closure to prove the theorem, it S Ë t Se B Therefore, in order Ç ¿ t S suffices to show that e for all M H . We first show that the point is in º U H [ ´ U H [ ´ U H ´ ´ ´ ´ S be defined as in Example 6.10, i.e., µ ± Let random variables µ ´)µ , and µ and µ are two independent binary random variables taking values in ²HU´ ¸ according to the uniform distribution, and µ S µ _ µ rB (14.51) É d Se represents the entropy function for µ ´)µ , and µ S , and let Let i l nm S i ±%B $ $ (14.52) ¶ As in the proof of Theorem 14.2, we let , ³ be the coordinates of s S mod which correspond to the values of the quantities in (14.14), respectively. From Example 6.10, we have ± r ± for H ³ ¶ Õ ÖÓ Ç± for ³ g ´´ Ûq´´ (14.53) . for ³ ± ± ± DZ ¿ t S @ µ & , Thus the point ºHU´[HU´[HU´ ´ ´ ´ is in e , and the -Measure j e & for& µ & ´)ÇÑ S ¿ & t and µ t is shown in Figure 14.3. Then by Corollary 14.4, ºHU´[HU´[HU´ ´ ´ ´ is t S in Se and hence in Se for all positive integer . S Since d Se contains the origin, t S e also contains the origin. By Theorem Ç14.5,¿ d e ist convex. This implies e t S ºHU´[HU´[HU´ ´ ´ ´ is in Se for all M H . is also convex. Therefore, É Consider any l . Referring to (14.15), we have ¶ ÁIH (14.54) ±Ü$ $Ý for ³ . Thus is the only component of l which can possibly be negative. We first consider the case when t ÁIH . Then l is in the nonnegative S S orthant of s , and by Lemma 12.1, l is in e . Next, consider the case when Ç Ç Ç ] IH . Let Þ ºHU´[HU´[HU´ ]9´ ] ´ ò´) ¿CB (14.55) D R A F T September 13, 2001, 6:27pm D R A F T Beyond Shannon-Type Inequalities ß0 X1 û ß0 Figure 14.3. The -Measure âã 1 1 1 for àX 1 309 2 ß0 Ò > ÔÒ ¡ , and Ò á X3 in the proof of Theorem 14.6. ä _ Þ´ l (14.56) where äF º+ )´ ´) S ´) R _ ´) _ ´) _ ´[H ¿CB (14.57) Ç Þ É t S e . From (14.15), we Since M H , we see from the foregoing that have (14.58) ¶ _ ]/ÁIH t S ä S Þ É t S orthant in s and hence in e by for ³ g´ Ûq´ . Thus is in the nonnegative e such that Lemma 12.1. Now for any å M H , let ¹ æ Þ Ç Þ ¹ æ 'å4´ (14.59) Þæ Ç Þ æ Þ Þ where ¹ denotes the Euclidean distance Þ B between and ¹ , and let ä _ ¹ (14.60) Þ t S l¹ t S ä Since both and ¹ are in e , by Lemma 14.3, lú¹ is also in e , and æ l Ç l ¹ æ æ Þ Ç Þ ¹ æ 'å B (14.61) t S t S t S É e . Hence, Ë e , and the theorem is proved. Therefore, l S S Remark 1 Han [88] has found that d is the smallest cone that contains d e . This result together with Theorem 14.5 implies Theorem 14.6. Theorem 14.6 is also a consequence of the theorem in Matç`b éè [135]. Remark 2 We have shown that the region d e completely characterizes all unconstrained information inequalities involving random variables. Since d Se d S , it follows that there exists no unconstrained information inequaliThen ties involving three random variables other than the Shannon-type inequalities. However, whether there exist constrained non-Shannon-type inequalities involving three random variables is still unknown. D R A F T September 13, 2001, 6:27pm D R A F T 310 14.2 A FIRST COURSE IN INFORMATION THEORY A NON-SHANNON-TYPE UNCONSTRAINED INEQUALITY d Se d S ÁJg . We have proved in Theorem 14.6 at the end of the last section that It is natural to conjecture that this theorem can be generalized to . If this conjecture is true, then it follows that all unconstrained information inequalities involving a finite number of random variables are Shannon-type inequalities, and they can all be proved by ITIP running on a sufficiently powerful computer. However, it turns out that this is not the case even for . We will prove in the next theorem an unconstrained information inequality involving four random variables. Then we will show that this inequality is a non-Shannon-type inequality, and that . g wox\yz|{y,}~|ëê d Re N d R r@ º+µ SA µ R ¿ $ @ +º µ A µ ¿\_ _ @ º+µ S A µ R µ ¿/_ @ +º µ For any four random variables @ +º µ SAµ µ )´ µ ´)µ S , and µ R , A µ S ´)µ R ¿ R µ ¿CB (14.62) auxiliary random ì Towardì proving this theorem, we introduce ´)µ ´)µ S ,two ía R such that íaì ® Òvariables µíQì î and µ jointly distributed with µ and µ FíQ . To simplify notation, we will use X £ S8RÄï ï º+Y ´)Y ´)Y S ´)Y R ´ Y ì ´ Y ì and ¿ to S R ¿ ì ì ï ï denote X > ð¡9 ñ > ð¡ º+Y ´)Y ´)Y ´)Y ´ Y ´ Y , etc. The joint distribution for ì S R ì the six random variables µ ´)µ ´)µ ´)µ ´ µ , and µ is defined by ï ï º+Y ´)Y ´)Y S ´)Y R ´ Y ì ´ Y ì ¿ X £ S8RÄòF ó > ¡ ñ ¼ô >9õ ô ¡ õ ô ó õ ô ñ ó > ¡ ñ ô ï >)õ ô ï ¡ õ ô õ ô ñ S8R S R¿ M H if X º+Y ´)Y ñ öô õ ô ñ S8R S R¿ H . (14.63) H if X º+Y ´)Y ±®y,} }a²³~|5÷ ì ì º+µ ´)µ ¿ðØ º+µ S ´)µ R ¿ðØ º µ ´ µ ¿ ì ì (14.64) S R ¿ and º µ ´ µ ´)µ S ´)µ R ¿ forms a Markov chain. Moreover, º+µ ´)µ ´µ ´)µ have the same marginal distribution. Proof The Markov chain in (14.64) is readily seen by invoking Proposition 2.5. The second part of the lemma is readily seen to be true by noting in (14.63) is symmetrical in and and in and . that ì ì X £ S8RÄï ï µ µ µ µ variables ì Fromì the above lemma, we see that the pair of auxiliary random ¿ in the º µ ´ µ ì ¿ ì corresponds + º µ ) ´ µ to the pair of random variables S R¿ have the same marginal distribution as º+µ ´)µ ´)µ S sense that º µ ´ µ ´Cµ ´)µ ´)µ R¿ . We need to prove two inequalities regarding these six random variables before we prove Theorem 14.7. D R A F T September 13, 2001, 6:27pm D R A F T Beyond Shannon-Type Inequalities ±®y,} }a²³~|5ø ì µ @ º+µ SA µ R ¿ Ç @ º+µ SA µ ?R µ µ ´)µ ì µ ¿ Ç @ º+µ S A µ R? µ ¿ $ ´)µ S For any four random variables , and and as defined in (14.63), iary random variables µ R 311 and auxil- @ º+µ A µ ì ¿CB @ º+µ S A µ R¿ Ç @ º+µ S A µ R µ ¿ Ç @ º+µ S A µ R µ ¿ ù ú @ º+µ S A µ R¿ Ç @ Ç º+µ S A µ R µ ¿£û Ç @ º+µ S A µ R µ ì ¿ @ º+µ A µ S1A µ R ¿ @ º+µ SA µ R? µ ì ¿ ú @ º+µ GA µ S A µ R A µ ì ¿\Ç _ @ º+µ ^A µ S A µ R µ Ç ì ¿£û Ç @ º+µ @ º+µ A µ S1A µ RA µ ì ¿ Ç ú @ º+µ SA µ R? µ ì ¿ @ º+µ A µ @ º+µ ^A µ S A µ R A µ ì Ç ¿ @ º+µ S A µ R µ ´ µ ì Ç ¿ ú @ º+µ GA µ R A µ Ç ì ¿ @ º+µ A µ R A µ Çì µ Sò¿£û @ º+µ S A µ Çú @ º+µ GA µ ì ¿ @ º+µ ^A µ ì Ç ? µ R¿£û ú @ º+µ A µ ì µ Sò¿ @ º+µ ^A µ ì µ S ´)µ R¿£û @ º+µ S A µ R µ ´ µ ì ¿ ü @Ç º+µ A µ ì ¿ Ç @ º+µ A µ ì µ R ¿ Ç @ º+µ A µ ì µ S ¿ ì $ @ @ º+µ ^A S ìA µ ¿ R µ ´ µ ¿ º+µ µ ´ (14.65) Proof Consider ì º µ ´)µ S )´ µ R¿ SAµ SA µ (14.66) (14.67) (14.68) (14.69) (14.70) (14.71) R µ ì ¿ R? µ ì ¿£û R µ ´ µ ì ¿ (14.72) (14.73) (14.74) º+µ ´)µ S ´)µ R¿ where a) follows because we see from Lemma 14.8 that and have the same marginal distribution, and b) follows because @ º+µ ^ A µ ì µ S ´)µ R¿ H (14.75) from the Markov chain in (14.64). The lemma is proved. ±®y,} }a²³~|~1ý µ ´)µ ì ì µ µ Ç @ º+µ S A µ R¿ r @ +º µ S A µ R µ ¿ $ @ +º µ ´)µ S , and µ R For any four random variables iliary random variables and as defined in (14.63), ì ì µ and aux- ^A µ ì ¿CB (14.76) µ µ Proof Notice that (14.76) can be obtained from (14.65) by replacing by and by in (14.65). The inequality (14.76) can be proved by replacing by and by in (14.66) through (14.74) in the proof of the last lemma. The details are omitted. µ µ µ D R A F T ì µ ì µ September 13, 2001, 6:27pm D R A F T 312 A FIRST COURSE IN INFORMATION THEORY r@ $ º+µ S A µ R ¿ Ç @ º+µ S A µ @ º+µ GA µ ì ¿_ @ º+µ @ º+µ GA µ ì ¿_ ú @ º+µ ú @ º+µ A µ ì ¿\_ @ º+µ @ º+µ GA µ ì ´ µ ì ¿\_ $ @ º+µ GA µ ì ´ µ ì ¿\_ @ GA ì ì ¿\_ ù$ º+µ µ ´ µ @ GA S R¿\_ ü @ º+µ GA µ S ´)µ R¿\_ º+µ µ ´)µ R µ ¿ Ç @ º+µ S A µ R µ ¿ A µ ì ¿ (14.77) A µ ì µ ì ¿_ @ º+µ A µ ì ^A µ ì ¿£û (14.78) A µ ì µ ì ¿£ûþ_ @ º+µ A µ ì A µ ì ¿ (14.79) @ º+µ A µ ì ^A µ ì Ç ¿ (14.80) ú @ º µ ì A µ ì ¿ @ º µ ì ^A µ ì µ ¿£û (14.81) @ º µ ì A µ ì ¿ (14.82) @ º µ ì A µ ì ¿ (14.83) @ º+µ A µ ¿ ´ (14.84) where a) follows from the Markov and b) follows because ì chain ì ¿ in (14.64), ¿ and º+µ ´)µ have we see from Lemma 14.8 that º µ ´ µ ì theì same marginal distribution. Note that the auxiliary random variables µ and µ disappear in Proof of Theorem 14.7 By adding (14.65) and (14.76), we have (14.84) after the sequence of manipulations. The theorem is proved. wox\yz|{y,}~|~,~ R R and d e N d . The inequality (14.62) is a non-Shannon-type inequality, M H the point i«ì º ¿«É f R , where ì ì ì ì ¿ Dr «ì £ º ¿ ¿ « º ¿ ì S « S ¿ º ¿ ì R « R º ¿ D ´ «ì S º ¿ g ì R ´ « ¿ º ì S8R « ¿ Dº ´ (14.85) «ì £ S º ¿ « ì £ º R ¿ « ì º S8 R ¿ ´ ì S8R ¿ ì £ S8R ¿ B « º « º « º « º « º g ì ¿ The set-theoretic structure of i«º is illustrated by the information diagram in Figure 14.4.ì The reader should check that this information diagram correctly ì ¿ ¿ represents iðº as defined. It is also easy to check from this diagram that i£º satisfies ì ¿É allR the elemental inequalities for four random variables, and therefore iðº ì ¿ d . However, upon substituting the corresponding values in (14.62) for i«º with the help of Figure 14.4, we have r $ H _ _ H _ H ´ (14.86) ì ¿ which is a contradiction because M H . In other words, i«º does not satisfy (14.62). Equivalently, ì i«º ¿ É N ²Gi É f R u i satisfies (14.62) ¸ B (14.87) Proof Consider for any D R A F T September 13, 2001, 6:27pm D R A F T 313 Beyond Shannon-Type Inequalities X2 0 a a 0 a a a 0 X1 a 0 a a 0 X3 0 ÿ 0 X4 Figure 14.4. The set-theoretic structure of Since ì i£º «¿ É d R , we conclude that d R Ë N ²Gi É f R u i á ã. satisfies (14.62) ¸þ´ (14.88) i.e., (14.62) is not implied by the basic inequalities for four random variables. Hence, (14.62) is a non-Shannon-type inequality. Since (14.62) is satisfied by all entropy functions for four random variables, we have satisfies (14.62) (14.89) d Re Ë ²Gi É f %R u i ¸þ´ and upon taking closure on both sides, we have (14.90) d Re Ë ²Gi É f R u i satisfies (14.62)¸ B ì ¿ É Re ì ¿ÞÉ R ì ¿ É Re Then (14.87) implies i£º R R N d . Since i«º d and iðº N d , we conclude that d e N d . The theorem is proved. Remark We have shown in the proof of Theorem 14.11 that the inequality (14.62) cannot be proved by invoking the basic inequalities for four random variables. However, (14.62) can be proved by invoking the basic inequalities for the six random variables , and with the joint probability distribution as constructed in (14.63). ì )´ µ ´)µ S ´)µ R ´ µ ì µ µ X £ S8R ï ï The inequality (14.62) remains valid when the indices 1, 2, 3, ± and 4 are S and µ R , g r r distinct permuted. Since (14.62) is symmetrical in µ versions of (14.62) can be obtained by permuting the indices, and all these twelve inequalities are simultaneously satisfied by the entropy function of any set of random variables , and . We will denote these twelve inequalities collectively by <14.62>. Now define the region µ ´)µ )´ µ S ì d R ²Gi É d R u i D R A F T µ R satisfies <14.62> ¸B September 13, 2001, 6:27pm (14.91) D R A F T 314 A FIRST COURSE IN INFORMATION THEORY Evidently, ìR R Since both d and d ì dR R ì d Re Ë d R Ë d RB (14.92) are closed, upon taking closure, we also have ì d Re Ë d R Ë d RB (14.93) dR d Re Since <14.62> are non-Shannon-type inequalities as we have proved in the last is a proper subset of and hence a tighter outer bound on and theorem, than . In the course of proving that (14.62) is of non-Shannon-type, it was shown in the proof of Theorem 14.11 that there exists as defined in (14.85) which does not satisfy (14.62). By investigating the geometrical relation between and , we prove in the next theorem that (14.62) in fact induces non-Shannon-type constrained inequalities. Applications of a class of some of these inequalities will be discussed in Section 14.4. d Re d ì Ç ±d R i«º r ¿ R F ì i«º «¿ É d R wox\yz|{y,}~|~ The inequality (14.62) is a non-Shannon-type inequality conditioning on setting any nonempty subset of the following 14 Shannon’s information measures to zero: @ +º µ A µ ¿ ´ @ º+µ A µ µ S ¿ ´ @ º+µ A µ µ R ¿ ´ @ º+µ A µ S µ R ¿ ´ @ º+µ ^ A µ R µ Sò¿ ´ @ º+µ A µ S µ R¿ ´ @ º+µ A µ R µ Sò¿ ´ @ º+µ S A µ R µ ¿ ´ (14.94) @ º+µ S A µ R µ ¿ ´ @ º+µ S A µ R µ ´)µ ¿ ´ º+µ µ ´)µ S ´)µ R¿ ´ S R¿ S R¿ R S¿CB º+µ µ ´)µ )´ µ ´ º+µ µ ´)µ ´)µ ´ º+µ µ ´)µ ´)µ ì ¿ Proof It is easy to verify from Figure 14.4 that in exactly 14 hyð i R (i.e., ) defining the boundary ofº d R lies perplanes in f which correspond ì ¿ to setting the 14 Shannon’s measures in (14.94) to zero. Therefore, iðº for Á IH define an extreme direction of d R . R ì ¿ Now for any linear subspace of f containing iðº , where M H , we have ì ¿«É R (14.95) i«º d ì ¿ and i«º does not satisfy (14.62). Therefore, ºd R ¿ Ë N ²Gi É f R u i satisfies (14.62)¸ B (14.96) This means that (14.62) is a non-Shannon-type inequality under the constraint . From the above, we see that can be taken to be the intersection of any nonempty subset of the 14 hyperplanes containing . Thus (14.62) is a non-Shannon-type inequality conditioning on any nonempty subset of the 14 Shannon’s measures in (14.94) being equal to zero. Hence, (14.62) induces a ì i£º ¿ D R A F T September 13, 2001, 6:27pm D R A F T R Ç ± r class of 315 Beyond Shannon-Type Inequalities non-Shannon-type constrained inequalities. The theorem is proved. Remark It is not true that the inequality (14.62) is of non-Shannon-type under any constraint. Suppose we impose the constraint @ º+µ S A µ R¿ H B (14.97) Then the left hand side of (14.62) becomes zero, and the inequality is trivially implied by the basic inequalities because only mutual informations with positive coefficients appear on the right hand side. Then (14.62) becomes a Shannon-type inequality under the constraint in (14.97). 14.3 A NON-SHANNON-TYPE CONSTRAINED INEQUALITY d Re N d R R d e r R Ç[d ± Re In the last section, we proved a non-Shannon-type unconstrained inequality . This inequality induces for four random variables which implies which is a tighter outer bound on and then . We fura region ther showed that this inequality induces a class of non-Shannon-type constrained inequalities for four random variables. In this section, we prove a non-Shannon-type constrained inequality for four random variables. Unlike the non-Shannon-type unconstrained inequality we proved in the last section, this constrained inequality is not strong enough to . However, the latter is not implied by the former. imply that d ìR dR d Re N d R ±®y,} }a²³~|~´ Let X º+Y ´)Y ´)Y S ´)Y R¿ be any probability distribution. Then òFó öô >8õ ô ó õ ô ñ ó öô ¡ õ ô õ ô ñ S R:¿ M H ìX º+Y ´)Y ´)Y S ´)Y R¿ if X º+Y ´)Y ¼ô õ ô ñ (14.98) S R:¿ H if X º+Y ´)Y H is also a probability distribution. Moreover, and for all X ì º+Y ´)Y S ´)Y Rò¿ X º+Y )´ Y S ´)Y R¿ (14.99) X ì º+Y ´)Y S ´)Y Rò¿ X º+Y )´ Y S ´)Y R¿ (14.100) Y ´)Y ´)Y S , and Y R . Proof The proof for the first part of the lemma is straightforward (see Problem 4 in Chapter 2). The details are omitted here. To prove the second part of the lemma, it suffices to prove (14.99) for all , and because is symmetrical in and . We first Y ´)Y S YR D R A F T »X ì º+Y ´)Y ´)Y S ´)Y R¿ Y September 13, 2001, 6:27pm Y D R A F T 316 A FIRST COURSE IN INFORMATION THEORY Y , Y S , and Y R such that X º+Y S ´)Y R¿ M H . From (14.98), we have ì S R¿ X»ì º+Y )´ Y S ´)Y R¿ X º+Y ´)Y ´)Y ´)Y (14.101) ô ¡ S Rò¿ S R¿ X º+Y ´)Y ´)Y S X º+Y R¿ ´)Y ´)Y (14.102) ô ¡ S RX»¿ º+Y ´)Y X º+Y ´)S Y ´)RòY ¿ X»º+Y ´)Y S ´)Y R ¿ (14.103) X º+Y ´)Y ¡ ô R¿ S R¿ X º+Y ´)S Y S ´)RY (14.104) ¿ X º+Y ´)Y X + º Y ) ´ Y X º+Y ´)Y S ´)Y R ¿CB (14.105) S R S R ¿ For Y ´)Y , and Y such$ that X º+Y ´)Y $ H , weS have S R ¿ (14.106) H X º+Y ´)Y ´)Y X º+Y ´)Y R ¿ HU´ which implies (14.107) X º+Y ´)Y S ´)Y R ¿ H B consider ì S R¿ X ì º+Y ´)Y S ´)Y Rò¿ ~X º+Y ´)Y ´)Y ´)Y (14.108) ¡ ô H (14.109) ¡ ô H (14.110) X º+Y ´)Y S ´)Y R¿CB (14.111) S R Thus we have proved (14.99) for all Y ´)Y , and Y , and the lemma is proved. wox\yz|{y,}~|~ For any four random variables µ ´)µ ´)µ S , and µ R , if @ º+µ A µ ¿ ×@ º+µ ^A µ ? µ Sò¿ HU´ (14.112) then @ º+µ S A µ R¿ $ @ º+µ S A µ R µ ¿\_ @ º+µ S A µ R µ ¿CB (14.113) Therefore, from (14.98), we have @ º+µ 1S A µ R ¿ Ç @ º+µ SA µ R¥ µ ¿ Ç @ º+µ SA µ R? µ ¿ 4 ñ > 4 > 4 ñ 4 4 ¡ 4 4 ¡ 4 4 ñ ó > õ õ õ ¡ ñ ñ > ¡ > ñ ¡ ñ > 4 4 ¡ 4 4 4 4 ñ ¼ô ô ô ô > ¡ ñ ó ¦¨§© X º+µ S S¿ ´)µ R R ¿ X ¿ º+µ ´)µ¿ Sò¿ X» º+µ¿ ´)µ R¿ X S º+µ R ´)µ¿ Sò¿ X º+µ S ´)µ RR¿ ¿ ´ X»º+µ X»º+µ X º+µ X º+µ X º+µ ´)µ ´)µ X º+µ ´)µ ´)µ Proof Consider (14.114) D R A F T September 13, 2001, 6:27pm D R A F T 317 Beyond Shannon-Type Inequalities ó X º+Y ´)Y ´)Y S ´)Y R¿ . ó ï ¨¦ §© X»º+µ SòS¿ ´)µ RR¿ X ¿ º+µ ´)µ¿ Sò¿ X º+µ¿ )´ µ R ¿ X S º+µ R´)µ¿ Sò¿ X» º+µ S ´)µ RR¿ ¿ ´ X º+µ X º+µ X º+µ X»º+µ X º+µ )´ µ ´)µ X º+µ ´)µ ´)µ (14.115) S ò R ¿ ì is defined in (14.98). where X~º+Y ´)Y ´)Y ´)Y Toward proving that the claim is correct, we note that (14.115) is the sum ì of a number of expectations with respect to X . Let us consider one of these to denote expectation with respect to where we have used We claim that the above expectation is equal to expectations, say 4 > 4 ¡ 4 4 4 4 ñ X ì +º Y ´)Y ´)Y S ´)Y > ¡ ñ ì S Note that in the above summation, if X»º+Y ´)Y ´)Y ´)Y we see that X º+Y ´)Y S ´)Y R¿ M HU´ and hence X»º+Y ´)Y Sò¿ M H B ó ï ¨¦ §©/X º+µ ´)µ Sò¿ R¿ ¦5§©\X º+Y )´ Y S:¿CB (14.116) R¿ M H , then from (14.98), (14.117) (14.118) Therefore, the summation in (14.116) is always well-defined. Further, it can be written as ´)Y Sò¿ ì º+Y ´)Y ´)Y S ´)Y R¿ ¦ ¨ § \ © X + º Y X " ! ï ó ¡ > õ ¡ õ õ ñ ô )> õ ô õ ô ñ ìX»º+Y ´)Y ô S ´)Y ¼Rô ¿ ¦¨ô §©\ô X»º+ô Y ´)Y Sò¿CB (14.119) >ô 8õ ô õ ô ñ S ¿ depends on X ì º+Y ´)Y ´)Y S ´)Y R ¿ only through X ì º+Y ´)Y S ´)Y R ¿ , ï Thus ó ¦¨§©X º+µ ´)µ S R¿ which by Lemma 14.13 is equal to X º+Y ´)Y ´)Y . It then follows that ó ï ¦¨§©\X º+µ ´)µ S¿ X ì º+Y ´)Y S ´)Y R¿ ¦¨§©/X º+Y ´)Y Sò¿ (14.120) >9õ ô õ ô ñ ô X º+Y ´)Y S ´)Y R¿ ¦¨§©/X º+Y ´)Y Sò¿ (14.121) >9õ ô õ ô ñ ô ó ¦¨§©X»º+µ ´)µ Sò¿CB (14.122) ò S ¿ In other words, the expectation on ¦¨§©/X º+µ ´)µ can be taken with respect S R ¿ S R ¿ ì or X º+Y ´)Y ´)Y ´)Y without affecting its value. By to either X»º+Y ´)Y ´)Y ´)Y observing that all the marginals of X in the logarithm in (14.115) involve only S R S R subsets of either ²^µ ´)µ ´)µ ¸ or ²^µ ´)µ ´)µ ¸ , we see that similar conclusions can be drawn for all the other expectations in (14.115), and hence the claim is proved. D R A F T September 13, 2001, 6:27pm D R A F T 318 A FIRST COURSE IN INFORMATION THEORY @ º+µ S A µ R ¿ Ç @ º+µ S A µ R µ ¿ Ç @ º+µ S A µ R µ ¿ ó ï ¨¦ §© X º+µ S4S¿ ´)µ R R ¿ X ¿ º+µ ´)µ¿ Sò¿ X» º+µ¿ ´)µ R¿ X S º+µ R ´)µ¿ Sò¿ X º+µ S ´)µ RR¿ ¿ »X º+µ X»ó ï º+µ >)õ X ¡ õ º+µ õ ñ X º+µ X º+4 µ ñ ´)µ > 4 ´) µ > X 4 º+ñµ 4 4 ´)¡ µ 4 ´)µ 4 ¡ 4 4 ñ 4 4 4 öô ô ô ô ñ > ¡ > ñ ¡ ñ > > 4 ¡ ¡ 4 4 ññ Ç 4 4 4 X~ì º+Y ´)Y ´)Y S ´)Y R ¿ ¦¨§© X~ì$ º+Y ´)Y ´)Y SS ´)Y RòRò¿¿q´ (14.123) > X~º+Y ´)Y ´)Y ´)Y > 4 ¡ ¡ 4 4 ñ ñ# Thus the claim implies that where ~X $ º+Y ò)´ Y ó ´)Y S ´)Y R:ó ¿ ó ö ô >8õ ô ó öô >öô ó >)¼õ ôô ñ¡ ó ¼¼ôô ¡ õ ôó ¼ô ó ñ ö ô ¡ õ ô ñ H X º+Y ¿ ´ X»º+Y ¿ ´ X º+Y òS ¿ ´ X º+Y R ¿ M H if otherwise. (14.124) X ì º+Y ´)Y X º+Y ´)Y Sò¿ ´X Y ´)Y ´)Y S , and Y R are ´)Y S )´ Y R ¿ M H º+Y ´)Y R¿ ´X º+Y ´)Y òS ¿ ´ X º+Y )´ Y R ¿ ´ X º+Y ¿ ´ X º+Y ¿ ´ X»º+Y Sò¿ ´X º+Y R ¿ $ S R(14.125) ¿ M H. are all strictly positive, and we see from (14.124) that$ X~º+ Y ´) Y ´)Y ´)Y S R ¿ To complete the proof, we only need to show that X º+Y ´)Y ´)Y ´)Y is a prob- The equality in (14.123) is justified by observing that if such that , then ability distribution. Once this is proven, the conclusion of the theorem follows immediately because the summation in (14.123), which is identified as the $ and , is always nonnegdivergence between ative by the divergence inequality (Theorem 2.30). Toward this end, we notice that for , and such that , X ì º+Y ´)Y )´ Y S ´)Y R ¿ X º+Y ´)Y ´)Y S ´)Y R ¿ YS X º+Y S:¿ M H ´)Y Sò¿ X º+Y ´)Y Sò¿ ò S ¿ X + º Y X»º+Y ´)Y ´)Y X º+Y S ¿ Y ´)Y by the assumption and for all Y Y , and by the assumption D R A F T (14.126) @ º+µ ^ A µ µ òS ¿ HU´ (14.127) X º+Y )´ Y ¿ X º+Y ¿ X º+Y ¿ (14.128) @ º+µ ^ A µ ¿ H B (14.129) September 13, 2001, 6:27pm D R A F T 319 Beyond Shannon-Type Inequalities Then $ S R¿ X » >ô 8õ ô ¡ õ ô õ ô ñ º+Y ´)Y ´)Y ´)Y 4 > 4 ¡ 4 4 4 4 ñ% X $ º+Y ´)Y ´)Y S ´)Y R ¿ & > ¡ ñ# X º+Y ´)Y S ¿ X º+ Y ¿ ´)Y R ¿¿ X º+Y Sò¿´)Y S ¿ X R¿ º+Y ´)Y R ¿ > 4 4 4 X º+Y X º+Y X º+Y X»º+Y > 4 ¡ ¡ 4 4ñ% ñ# ù X º+Y ´)Y ´)Y S: ¿¿ X º+Y ¿´)Y R¿ X R¿ º+Y ´)Y R¿ > 4 4 4 X º+Y X º+Y X º+Y > 4 ¡ ¡ 4 4%ñ ñ # ü X º+Y ´)Y ´)Y S ¿ X º+Y ¿ ´)Y R R¿ X ¿ º+Y ´)Y R ¿ > 4 4 4 X»º+Y ´)Y X º+Y > 4 ¡ ¡ 4 4ñ% ñ # S Y ´)Y ¿ »º+Y ´)Y Rò¿ X Ròº+¿ Y ´)Y R¿ X X + º Y ( ! 4 > 4 ¡ 4 4 ñ ó X º+Y > ¡ ñ ' ô ¼ô R R ¿ ¿ X»º+Y ´)Y X Ròº+¿ Y ´)Y 4 > 4 ¡ 4 4 ñ X º+Y > ¡ ñ ' G 4 ¡ 4 ñ X º+Y ´)Y R¿ > ó > ( ! X º+Y Y R¿ ¡ ñ ô öô ) X º+Y ´)Y R¿ ¡ 4 4 ñ ¡ ñ * ± ´ Y X º+Y ¿ H ¿ X º+Y R Y ¿ B G R ¿ X + º Y X º+Y Y H X º+Y R ¿ Therefore G Y R¿ X»º+Y G Y R¿ ± B X + º Y ( ! ô> ô > ó ¼ô > R Finally, the equality in d) is justified as follows. For Y and Y R ¿ : R ¿ or X º+Y vanishes, X º+Y ´)Y must vanish because $ $ H X º+Y ´)Y R¿ X»º+Y ¿ (14.130) (14.131) (14.132) (14.133) (14.134) (14.135) (14.136) (14.137) (14.138) where a) and b) follows from (14.126) and (14.128), respectively. The equality in c) is justified as follows. For such that , D R A F T September 13, 2001, 6:27pm (14.139) (14.140) such that X º+Y ¿ (14.141) D R A F T 320 A FIRST COURSE IN INFORMATION THEORY and H Therefore, $ $ R:¿CB R ¿ X º+Y ´)Y X º+Y (14.142) )´ Y R¿ X º+Y )´ Y Rò¿ ± B X + º Y 4 ô ¡ õô ñ ¡ ¡ 4 ñ%ñ (14.143) The theorem is proved. wox\yz|{y,}~|~Á The constrained inequality in Theorem 14.14 is a nonShannon-type inequality. ì i«º ¿ôÉ f R Proof The theorem can be proved by considering the point for as in the proof of Theorem 14.11. The details are left as an exercise. M H The constrained inequality in Theorem 14.14 has the following geometrical interpretation. The constraints in (14.112) correspond to the intersection of which define the boundary of . Then the inequaltwo hyperplanes in ity (14.62) says that a certain region on the boundary of is not in . It can further be proved by computation1 that the constrained inequality in Theorem 14.14 is not implied by the twelve distinct versions of the unconstrained inequality in Theorem 14.7 (i.e., <14.62>) together with the basic inequalities. We have proved in the last section that the non-Shannon-type inequality constrained non-Shannon-type inequalities. (14.62) implies a class of We end this section by proving a similar result for the non-Shannon-type constrained inequality in Theorem 14.14. f R dR R d d Re r R Ç[± wox\yz|{y,}~|~Ú The inequality @ º+µ S A µ R ¿ $ @ º+µ S A µ R µ ¿_ @ º+µ S A µ R µ ¿ @ º+µ A µ µ S ¿ (14.144) @ º+µ ^ A µ ¿ is a non-Shannon-type inequality conditioning on setting both and and any subset of the following 12 Shannon’s information measures to zero: @ +º µ ^A µ µ R¿ ´ @ @ º+µ A µ S µ R¿ ´ @ @ º+µ SA µ ?R µ ¿ ´ @ S º+µ µ ´)µ )´ µ 1 Ying-On º+µ AA µ RS µ +º µ SA µ R? µ º+Rµ ¿ µ µ S ´ º+µ R¿ ´ @ º+µ A µ R µ Sò¿ ´ Sò¿ ´ @ º+µ S A µ R µ ¿ ´ ´)µ ¿ ´ º+µ µ ´)µ µ ´)µ ´)µ R¿ ´ º+µ R S ´)µ R ¿ ´ µ ´)µ ´)µ S¿CB (14.145) Yan, private communication. D R A F T September 13, 2001, 6:27pm D R A F T 321 Beyond Shannon-Type Inequalities @ º+µ A µ ¿ @ º+µ A µ µ S ¿ Proof The proof of this theorem is very similar to the proof of Theorem 14.12. and together with the 12 Shannon’s We first note that information measures in (14.145) are exactly the 14 Shannon’s information measures in (14.94). We have already shown in the proof of Theorem 14.12 that (cf. Figure 14.4) lies in exactly 14 hyperplanes defining the boundary which correspond to setting these 14 Shannon’s information measures to of zero. We also have shown that for define an extreme direction of . Denote by the intersection of the two hyperplanes in which correand to zero. Since for any spond to setting satisfies ,+ (14.146) d ì iR𺠿 ì i𺠿 Á H @ ^A ¿ @ GA Sò¿ f R ì ¿ º+µ µ º+µ µ µ i«º M H @ º+µ ^A µ ¿ ×@ º+µ ^A µ µ òS ¿ H ì ì i𺠿 is in . Now for any linear subspace of f R containing i𺠿 such that Ë , we have ì i𺠿«É d R B ì ¿ (14.147) Upon substituting the corresponding values in (14.113) for i«º with the help $ _ of Figure 14.4, we have (14.148) H M H ,H + ì ¿ which is a contradiction because H . Therefore, iðº does not satisfy (14.113). Therefore, ºd R ¿ . (14.149) Ë N - i É f R u i satisfies (14.113) / B dR This means that (14.113) is a non-Shannon-type inequality under the constraint . From the above, we see that can be taken to be the intersection of and any subset of the 12 hyperplanes which correspond to setting the 12 Shannon’s information measures in (14.145) to zero. Hence, (14.113) is a non-Shannontype inequality conditioning on , , and any subset of the 12 Shannon’s information measures in (14.145) being equal to zero. In other words, the constrained inequality in Theorem 14.14 in fact induces a class constrained non-Shannon-type inequalities. The theorem is proved. of @ º+µ ^ A µ ¿ @ º+µ ^ A µ µ Sò¿ r £ 14.4 APPLICATIONS As we have mentioned in Chapter 12, information inequalities are the laws of information theory. In this section, we give several applications of the nonShannon-type inequalities we have proved in this chapter in probability theory and information theory. An application of the unconstrained inequality proved in Section 14.2 in group theory will be discussed in Chapter 16. 021\²O}23,º?y ~|~?ê For the constrained inequality in Theorem 14.14, if we further impose the constraints @ º+µ S A µ R µ ¿ × @ +º µ S A µ R µ ¿ H,+ D R A F T September 13, 2001, 6:27pm (14.150) D R A F T 322 A FIRST COURSE IN INFORMATION THEORY then the right hand side of (14.113) becomes zero. This implies because @ º+µ S A µ R¿ @ º+µ S A µ R¿ H (14.151) is nonnegative. This means that µ PP µ S µ S P µ R µ µ S P µ R1 µ µ µ µ 4 556 7 55 V µ S P µ RB (14.152) We leave it as an exercise for the reader to show that this implication cannot be deduced from the basic inequalities. 021\²O}23,º?y ~|~÷ If we impose the constraints @ +º µ ^ A µ ¿ ×@ +º µ ^A µ S +)µ R¿ ×@ º+µ S A µ R µ ¿ × @ +º µ S A µ R µ ¿ H,+ (14.153) then the right hand side of (14.62) becomes zero, which implies µ Q PP µ S P µ S P µ This means that @ º+µ S A µ R ¿ H B µ S R¿ 4 556 S P RB º+µ +)µ µ RR µ 7 55 V µ µ µ µ (14.154) (14.155) Note that (14.152) and (14.155) differ only in the second constraint. Again, we leave it as an exercise for the reader to show that this implication cannot be deduced from the basic inequalities. 021\²O}23,º?y ~|~ø µ )µ )µ S )µ R $ ½ ¶ ; :¿ H,+=< º+µ µ +98 N Consider a fault-tolerant data storage system consisting of random variables such that any three random variables can + + + recover the remaining one, i.e., : +98 $ B g (14.156) We are interested in the set of all entropy functions subject to these constraints, denoted by > , which characterizes the amount of joint information which can possibly be stored in such a data storage system. Let -i É f R ui satisfies (14.156) / d Re B d Re (14.157) Then the set > is equal to the intersection between and , i.e., . Since each constraint in (14.156) is one of the 14 constraints specified in Theorem 14.12, we see that (14.62) is a non-Shannon-type inequality under D R A F T September 13, 2001, 6:27pm D R A F T 323 Beyond Shannon-Type Inequalities ì d R (cf. (14.91)) is a tighter outer bound Rd S 021\²O}23,º?y ~|5þý R Consider four random variables µ +)µ +)µ , and µ such S Ø Q ¿ Ø R º+µ +)µ µ forms a Markov chain. This Markov condition is that µ equivalent to @ º+µ S A µ R µ +)µ ¿ H B (14.158) the constraints in (14.156). Then on > than . It can be proved by invoking the basic inequalities (using ITIP) that @ º+µ S A µ R ¿ $ @ º+µ S A µ R µ ¿\_ @ º+µ S A µ R µ ¿/_ H B Û @ º+µ A µ ¿ _ @ º+µ ^A µ S +)µ R¿\_ º< Ç ¿ @ º+µ A µ S +)µ R¿ + (14.159) $ $ B r H B Û , and this is the best possible. where H Û Now observe that the Markov condition (14.158) is one of the 14 constraints specified in Theorem 14.12. Therefore, (14.62) is a non-Shannon-type inequality under this Markov condition. By replacing and by each other in (14.62), we obtain r@ º+µ S A µ R¿ $ _ @ º+µ S A µ µ @ +º µ ^ A µ ¿\_ R µ ¿/_ @ +º µ µ @ º+µ A µ S +)µ R¿ S A µ R µ ¿CB (14.160) Upon adding (14.62) and (14.160) and dividing by 4, we obtain @ º+µ S A µ R ¿ $ _ H Br Û @ @ º+µ S A µ R µ ¿\_ º+µ A µ S +)µ R\¿ _ H @ º+µ S A µ R µ ¿/_ H B Û @ º+µ A µ ¿ B r Û @ º+µ A µ S +)µ R ¿CB (14.161) Comparing the last two terms in (14.159) and the last two terms in (14.161), we see that (14.161) is a sharper upper bound than (14.159). The Markov chain arises in many communication + situations. As an example, consider a person listening to an audio source. Then being the sound the situation can be modeled by this Markov chain with wave generated at the source, and being the sound waves received at the two ear drums, and being the nerve impulses which eventually arrive at the brain. The inequality (14.161) gives an upper bound on which is tighter than what can be implied by the basic inequalities. There is some resemblance between the constrained inequality (14.161) and the data processing theorem, but there does not seem to be any direct relation between them. µ S Ø º+µ )µ ¿%Ø µ R µ R D R A F T µ µ µ S September 13, 2001, 6:27pm @ º+µ S A µ R ¿ D R A F T 324 A FIRST COURSE IN INFORMATION THEORY PROBLEMS 1. Verify by ITIP that the unconstrained information inequality in Theorem 14.7 is of non-Shannon-type. 2. Verify by ITIP and prove analytically that the constrained information inequality in Theorem 14.14 is of non-Shannon-type. 3. Use ITIP to verify the unconstrained information inequality in Theorem 14.7. Hint: Create two auxiliary random variables as in the proof of Theorem 14.7 and impose appropriate constraints on the random variables. 4. Verify by ITIP that the implications in Examples 14.17 and 14.18 cannot be deduced from the basic inequalities. 5. Can you show that the sets of constraints in Examples 14.17 and 14.18 are in fact different? 6. Let µ ¶ + : r <'+ +?+) ,  , and @ be discrete random variables. Ç @ ºÂ A @ ¿ @ º  A @ µ ½ ¿ Ç @ ºÂ A @ µ ¶ ¿ ½ $ @ ¶8A ¿/_ Ç º+µ ÂA+@ ½ +º µ ½ ¿ º+µ +)µ +B+)µ ¿CB Ýr , this inequality reduces to the unconstrained nonHint: When a) Prove that Shannon-type inequality in Theorem 14.7. b) Prove that @ ºÂ A @ $ < ¿Ç r @ ¶ @ A ½¿ ½ º  @ µ Ç ½ 8 ¶ A ¿CB / ¿ _ ¿ º+µ ÂC+@ ½ º+µ º+µ )+ µ + B+)µ ( Zhang and Yeung [223].) XS º+Y )Y R )Y S )Y R¿ @ A S ¿ @ A ?R S ¿ µ µ º+µ µ µ º+µ µ µ H µ ìµ X 7. Let + + + be the joint distribution for random variables , , , and such that , and let be defined in (14.98). a) Show that X è º+Y +)Y ò +)Y S +)ó Y R¿ ó ó ¡ õ ñ > > õ õ õ ¡ ñ ¼ ô ô ô ö ô ô ó ó >)õ ô ¡ öô ñ öô ô ö ô H D R A F T X»º+Y )Y ¿ X º+Y R ¿ M H + + if otherwise September 13, 2001, 6:27pm D R A F T 325 Beyond Shannon-Type Inequalities D Á <. S ò S ¿ S ò ¿ ì X º+Y +)Y +)Y for all Y , Y , and Y . Prove that X º+Y +)Y +)Y æ ¿ ì è By considering Eº X X ÁIH , prove that Sò¿/_ R¿_ S¿_ R¿\_ S8R¿ º+µ Sò¿_ º+µ R¿\_ º+µ £ ¿\_ º+µ S8R¿\_ º+µ S8R¿ Á º+µ º+µ º+µ º+µ º+µ + S8R¿ denotes º+µ +)µ S +)µ R¿ , etc. where º+µ defines a probability distribution for an appropriate b) c) d) Prove that under the constraints in (14.112), the inequality in (14.113) is equivalent to the inequality in c). The inequality in c) is referred to as the Ingleton inequality for entropy in the literature. For the origin of the Ingleton inequality, see Problem 9 in Chapter 16. (Mat [137].) ç`b éè HISTORICAL NOTES In 1986, Pippenger [156] asked whether there exist constraints on the entropy function other than the polymatroidal axioms, which are equivalent to the basic inequalities. He called the constraints on the entropy function the laws of information theory. The problem had been open since then until Zhang and Yeung discovered for four random variables first the constrained nonShannon-type inequality in Theorem 14.14 [222] and then the unconstrained non-Shannon-type inequality in Theorem 14.7 [223]. Yeung and Zhang [221] have subsequently shown that each of the inequalities reported in [222] and [223] implies a class of non-Shannon-type inequalities, and they have applied some of these inequalities in information theory problems. The existence of these inequalities implies that there are laws in information theory beyond those laid down by Shannon [173]. Meanwhile, Mat and Studen [135][138][136] had been studying the structure of conditional independence (which subsumes the implication problem) of random variables. Mat [137] finally settled the problem for four random variables by means of a constrained non-Shannon-type inequality which is a variation of the inequality reported in [222]. The non-Shannon-type inequalities that have been discovered induce outer bounds on the region which are tighter than . Mat and Studen [138] showed that an entropy function in is entropic if it satisfies the Ingleton inequality (see Problem 9 in Chapter 16). This gives an inner bound on . A more explicit proof of this inner bound can be found in Zhang and Yeung [223], where they showed that this bound is not tight. Along a related direction, Hammer et al. [84] have shown that all linear inequalities which always hold for Kolmogorov complexity also always hold for entropy, and vice versa. ç]b éè ç]b éè d Re D R A F T cb dR dR ç`b éè September 13, 2001, 6:27pm cb d|Re D R A F T Chapter 15 MULTI-SOURCE NETWORK CODING In Chapter 11, we have discussed the single-source network coding problem in which an information source is multicast in a point-to-point communication network. The maximum rate at which information can be multicast has a simple characterization in terms of the maximum flows in the graph representing the network. In this chapter, we consider the more general multi-source network coding problem in which more than one mutually independent information sources are generated at possibly different nodes, and each of the information sources is multicast to a specific set of nodes. We continue to assume that the point-to-point communication channels in the network are free of error. The achievable information rate region for a multi-source network coding problem, which will be formally defined in Section 15.3, refers to the set of all possible rates at which multiple information sources can be multicast simultaneously on a network. In a single-source network coding problem, we are interested in characterizing the maximum rate at which information can be multicast from the source node to all the sink nodes. In a multi-source network coding problem, we are interested in characterizing the achievable information rate region. Multi-source network coding turns out not to be a simple extension of singlesource network coding. This will become clear after we have discussed two characteristics of multi-source network coding in the next section. Unlike the single-source network coding problem which has an explicit solution, the multi-source network coding problem has not been completely solved. The best characterizations of the achievable information rate region of the latter problem (for acyclic networks) which have been obtained so far make use of the tools we have developed for information inequalities in Chapter 12 to Chapter 14. This will be explained in the subsequent sections. D R A F T September 13, 2001, 6:27pm D R A F T 328 A FIRST COURSE IN INFORMATION THEORY H H X 1 XI 2 J 1 [ X 1] F 2 F [ XI 2] 3 2 G 1 JK b3 4 H H [ X 1 X 2] 15.1 3 G b4 4 (b) (a) Figure 15.1. b2 b1 A network which achieves the max-flow bound. TWO CHARACTERISTICS In this section, we discuss two characteristics of multi-source networking coding which differentiate it from single-source network coding. In the following discussion, the unit of information is the bit. 15.1.1 THE MAX-FLOW BOUNDS The max-flow bound, which fully characterizes the maximum rate at which information can be multicast, plays a central role in single-source network coding. We now revisit this bound in the context of multi-source network coding. Consider the graph in Figure 15.1(a). The capacity of each edge is equal and with rates L and L , to 1. Two independent information sources respectively are generated at node 1. Suppose we want to multicast to nodes 2 and 4 and multicast to nodes 3 and 4. In the figure, an information source in square brackets is one which is to be received at that node. It is easy to see that the values of a max-flow from node 1 to node 2, from node 1 to node 3, and from node 1 to node 4 are respectively 1, 1, and 2. At node 2 and node 3, information is received at rates L and L , respectively. At node 4, information is received at rate L L because and are independent. Applying the max-flow bound at nodes 2, 3, and 4, we have µ µ _ L L and º ¿ L $$ µ µ µ µ < (15.1) (15.2) < _ L $ r + (15.3) respectively. We refer to (15.1) to (15.3) as the max-flow bounds. Figure 15.2 is an illustration of all ML +L which satisfy these bounds, where L and L are obviously nonnegative. D R A F T September 13, 2001, 6:27pm D R A F T 329 Multi-Source Network Coding 2 (0,2) (1,1) (2,0) 1 Figure 15.2. The max-flow bounds for the network in Figure 15.1. µ à º ¿ µ à à We now show that the rate pair <'+N< is achievable. Let be a bit generated and be a bit generated by . In the scheme in Figure 15.1(b), by is received at node 2, is received at node 3, and both and are received at node 4. Thus the multicast requirements are satisfied, and the information which satisfy the rate pair <'+N< is achievable. This implies that all ML +L max-flow bounds are achievable because they are all inferior to <'+N< (see Figure. 15.2). In this sense, we say that the max-flow bounds are achievable. Suppose we now want to multicast to nodes 2, 3, and 4 and multicast to node 4 as illustrated in Figure 15.3. Applying the max-flow bound at either node 2 or node 3 gives (15.4) L <'+ º ¿ à à º ¿ µ $ à º _ L $ rB ¿ µ and applying the max-flow bound at node 4 gives L (15.5) X1X2 1 [X1] 2 3 O [X1] 4 [X1X2] Figure 15.3. A network which does not achieve the max-flow bounds. D R A F T September 13, 2001, 6:27pm D R A F T 330 A FIRST COURSE IN INFORMATION THEORY º ¿ º µ à à ¿ which satisfy these bounds. Figure 15.4 is an illustration of all ML +L We now show that the information rate pair <'+N< is not achievable. Suppose we need to send a bit generated by to nodes 2, 3, and 4 and send a bit generated by to node 4. Since has to be recovered at node 2, the bit sent to node 2 must be an invertible transformation of . This implies that the bit sent to node 2 cannot not depend on . Similarly, the bit sent to node 3 also cannot depend on . Therefore, it is impossible for node 4 to recover because both the bits received at nodes 2 and 3 do not depend on . Thus the information rate pair <'+N< is not achievable, which implies that the max-flow bounds (15.4) and (15.5) are not achievable. From the last example, we see that the max-flow bounds do not always fully characterize the achievable information rate region. Nevertheless, the maxflow bounds always give an outer bound on the achievable information rate region. à à µ à º 15.1.2 à à à ¿ SUPERPOSITION CODING Consider a point-to-point communication system represented by the graph in Figure 15.5(a), where node 1 is the transmitting point and node 2 is the receiving point. The capacity of channel (1,2) is equal to 1. Let two independent and , whose rates are L and L , respectively, be information sources and are received at node 2. generated at node 1. It is required that both We first consider coding the information sources and individually. We will refer to such a coding method as superposition coding. To do so, we decompose the graph in Figure 15.5(a) into the two graphs in Figures 15.5(b) and (c). In Figure 15.5(b), is generated at node 1 and received at node 2. In Figure 15.5(c), is generated at node 1 and received at node 2. Let P RQ be µ µ µ µ µ µ µ µ £ 2 (0,2) (1,1) (2,0) 1 Figure 15.4. The max-flow bounds for the network in Figure 15.3. D R A F T September 13, 2001, 6:27pm D R A F T 331 Multi-Source Network Coding T T T X1X2 X1 XV 2 1 1 1 W W (a) S W (b) (c) S 2 2 T T U T U [X1X2] 2 T U [X1] [X2] Figure 15.5. A network for which superposition coding is optimal. µ the bit rate on channel (1,2) for transmitting £P ÊZ Y L and Y P\`[^]_ ] which imply L ] Q,X r <'+ . Then (15.6) + (15.7) ` Y P ` [ _ P `[^]_ L ` L ]ba ] a ]dc (15.8) ` Pe`[ _ \P `[^]_ < ]ba ] f c g (15.9) On the other hand, from the rate constraint for edge (1,2), we have From (15.8) and (15.9), we obtain L ` a L ] f < (15.10) c µ µ Next, we consider coding the information sources ` and jointly. To are independent, the total rate at which] information this end, since ` and ] L . From the rate constraint for channel is received at node 2 is equal to L ` a ] (1,2), we again obtain (15.10). Therefore, we conclude that when ` and are independent, superposition coding is always optimal. The achievable ] information rate region is illustrated in Figure 15.6. The above example is a degenerate example of multi-source networking coding because there are only two nodes in the network. Nevertheless, one may hope that superposition coding is optimal for all multi-source networking coding problems, so that any multi-source network coding problem can be decomposed into single-source network coding problems which can be solved individually. Unfortunately, this is not the case, as we now explain. Consider the graph in Figure 15.7. The capacities of all the edges are equal µ µ µ µ D R A F T September 13, 2001, 6:27pm D R A F T 332 A FIRST COURSE IN INFORMATION THEORY 2 (0,1) (1,0) 1 Figure 15.6. The achievable information rate region for the network in Figure 15.5(a). to 1. We want to multicast h ` to nodes 2, 5, 6, and 7, and multicast h to ] nodes 5, 6, and 7. We first consider coding h ` and h individually. Let ] i M`^[ jk _ Yml (15.11) be the bit rate on edge no'p9qer for the transmission of h , where qtsvu\p%w\px and Xyszo'p%u . Then the rate constraints on edges no'p%u{r , no'p%w{j r , and no'pxer imply ` i ` [ _ i ` R[ ]_ ` a ] i `9[ ]| } _ i `9[R]} _ ` i `[ ~ _ a i `[R]~ _ a o (15.12) o (15.13) f f f o (15.14) c X1X2 1 [X1] 2 5 [X1X2] 3 4 6 7 [X1X2] [X1X2] Figure 15.7. A network for which superposition coding is not optimal. D R A F T September 13, 2001, 6:27pm D R A F T 333 Multi-Source Network Coding X X2 1 1 r 12(1) r 13(1) 5 [X1] r 12(2) 3 6 [X1] [X1] [X1] 2 1 r 14(1) 3 5 6 [X2] [X2] [X 2] r 14(2) 2 4 7 r 13(2) (a) 4 7 (b) Figure 15.8. The individual multicast requirements for the network in Figure 15.7. Together with (15.11), we see that l f i M` R[ jRk _ f o (15.15) for all qs.u\p%w\px and Xyszo'p%u . We now decompose the graph in Figure 15.7 into the two graphs in Figures 15.8(a) and (b) which show the individual multicast requirements for h ` and h , respectively. In these two graphs, the edges (1,2), (1,3), and (1,4) are labeled] by the bit rates for transmitting the corresponding information source. We now show by contradiction that the information rate pair (1,1) is not achievable by coding h ` and h individually. Assume that the contrary is true, ] is achievable. Referring to Figure 15.8(a), i.e., the information rate pair (1,1) ` since node 2 receives h at rate 1, ` i `[ _ Y o c ] (15.16) At node 7, since h ` is received via nodes 3 and 4, ` i 9`[ } _ a ` i `[ ~ _ Y o c (15.17) Similar considerations at nodes 5 and 6 in Figure 15.8(b) give and i ` [R]_ i `9R[ ]} _ Y o ] a (15.18) i `[^]_ i `^[ ]~ _ Y o c ]ba (15.19) ` i `[ _ so c ] (15.20) Now (15.16) and (15.15) implies D R A F T September 13, 2001, 6:27pm D R A F T 334 A FIRST COURSE IN INFORMATION THEORY 1 b2 b1 2 b1 b1 b2 5 Figure 15.9. b1+b2 b1+b2 3 4 b1+b2 b2 6 7 A coding scheme for the network in Figure 15.7. With (15.12) and (15.15), this implies i `[^]_ s l c ] (15.21) From (15.21), (15.18), and (15.15), we have i `9[^]} _ so c (15.22) c (15.23) With (15.13) and (15.15), this implies ` i `9[ } _ s l It then follows from (15.23), (15.17), and (15.15) that ` i `[ ~ _ so (15.24) c Together with (15.14) and (15.15), this implies i `^[ ]~ _ s l c (15.25) However, upon adding (15.21) and (15.25), we obtain i `[^]_ i `^[ ]~ _ s l p ]ba (15.26) which is a contradiction to (15.19). Thus we have shown that the information rate pair (1,1) cannot be achieved by coding h ` and h individually. ] However, the information rate pair (1,1) can actually be achieved by coding h ` and h jointly. Figure 15.9 shows such a scheme, where ` is a bit gen] erated by h ` and is a bit generated by h . Note that at node 4, the bits ` ] ] and , which are generated by two different information sources, are jointly ] coded. Hence, we conclude that superposition coding is not necessarily optimal in multi-source network coding. In other words, in certain problems, optimality can be achieved only by coding the information sources jointly. D R A F T September 13, 2001, 6:27pm D R A F T 335 Multi-Source Network Coding X X X 1 2 3 [X1] [X1] Level 1 [X1] [X1X2] [X1X2] [X1X2] [X1X2X3] Level 2 Level 3 Figure 15.10. A 3-level diversity coding system. 15.2 EXAMPLES OF APPLICATION Multi-source network coding is a very rich model which encompasses many communication situations arising from fault-tolerant network communication, disk array, satellite communication, etc. In this section, we discuss some applications of the model. 15.2.1 Let h ` p h MULTILEVEL DIVERSITY CODING p?ph be information sources in decreasing order of im] portance. These information sources are encoded into pieces of information. There are a number of users, each of them having access to a certain subset of the information pieces. Each user belongs to a level between 1 and , where a Level user can decode h ` ph pBpht . This model, called multilevel diversity coding, finds applications ] in fault-tolerant network communication, disk array, and distributed data retrieval. Figure 15.10 shows a graph which represents a 3-level diversity coding system. The graph consists of three layers of nodes. The top layer consists of a node at which information sources h ` , h , and h } are generated. These ] information sources are encoded into three pieces, each of which is stored in a distinct node in the middle layer. The nodes in the bottom layer represent the users, each of them belonging to one of the three levels. Each of the three Level 1 users has access to a distinct node in the middle layer and decodes h ` . Each of the three Level 2 users has access to a distinct set of two nodes in the middle layer and decodes h ` and h . There is only one Level 3 user, who has ] access to all the three nodes in the middle layer and decodes h ` , h , and h } . ] D R A F T September 13, 2001, 6:27pm D R A F T 336 A FIRST COURSE IN INFORMATION THEORY The model represented by the graph in Figure 15.10 is called symmetrical 3level diversity coding because the model is unchanged by permuting the nodes in the middle layer. By degenerating information sources h ` and h } , the model is reduced to the diversity coding model discussed in Section 11.2. In the following, we describe two applications of symmetrical multilevel diversity coding: Fault-Tolerant Network Communication In a computer network, a data packet can be lost due to buffer overflow, false routing, breakdown of communication links, etc. Suppose the packet carries messages, h ` ph pNph , ] in decreasing order of importance. For improved reliability, the packet is encoded into sub-packets, each of which is sent over a different channel. If any sub-packets are received, then the messages h ` ph pBpht can be re] covered. Disk Array Consider a disk array which consists of disks. The data to be stored in the disk array are segmented into pieces, h ` ph pNph , in ] decreasing order of importance. Then h ` ph pBph are encoded into ] pieces, each of which is stored on a separate disk. When any out of the disks are functioning, the data h ` ph pNpht can be recovered. ] 15.2.2 SATELLITE COMMUNICATION NETWORK In a satellite communication network, a user is at any time covered by one or more satellites. A user can be a transmitter, a receiver, or both. Through the satellite network, each information source generated at a transmitter is multicast to a certain set of receivers. A transmitter can transmit to all the satellites within the line of sight, while a receiver can receive from all the satellites within the line of sight. Neighboring satellites may also communicate with each other. Figure 15.11 is an illustration of a satellite communication network. The satellite communication network in Figure 15.11 can be represented by the graph in Figure 15.12 which consists of three layers of nodes. The top layer represents the transmitters, the middle layer represents the satellites, and the bottom layer represents the receivers. If a satellite is within the line-of-sight of a transmitter, then the corresponding pair of nodes are connected by a directed edge. Likewise, if a receiver is within the line-of-sight of a satellite, then the corresponding pair nodes are connected by a directed edge. The edges between two nodes in the middle layer represent the communication links between the two neighboring satellites corresponding to the two nodes. Each information source is multicast to a specified set of receiving nodes as shown. D R A F T September 13, 2001, 6:27pm D R A F T 337 Multi-Source Network Coding transmitter ¡ receiver Figure 15.11. A satellite communication network. 15.3 A NETWORK CODE FOR ACYCLIC NETWORKS Consider a network represented by an acyclic directed graph ¢£s¤n¥¦p§r , where ¨¥©¨\ª¬« . Without loss of generality, assume that ¥vs®¯o'p%u\p?p?¨¥°¨²±{p (15.27) and the nodes are indexed such that if n´³p9q\r¶µ·§ , then ³¸ª¹q . Such an indexing of the nodes is possible by Theorem 11.5. Let º2» k be the rate constraint on channel n´³p9qer , and let ¼ sz½ º¾» kÀ¿ n´³p9q\r¶µ·§ÂÁ (15.28) be the rate constraints for graph ¢ . A set of multicast requirements on ¢ consists of the following elements: Ë X 1 XÉ2 Å X Å XÆ Å XÇ XÈ3 4 5 6 XÃ7 X Ä8 Å Å Å Å Å Å Å transmitters satellites receivers Å Å Å Å [X 1 XÄ8 ] [XÉ2 X Ã7 ] Å [X Ê4 ] [XÆ5 XÇ6 ] [X È3 XÆ5 ] [X È3 X Ã7 X Ä8 ] Figure 15.12. A graph representing a satellite communication network. D R A F T September 13, 2001, 6:27pm D R A F T 338 A FIRST COURSE IN INFORMATION THEORY X1 Ð ÍÐ X2X3 1 Ì [X 1 ] 2 3 Î Ï 4 5 Ð Ð Ì [X 1 X 3 ] 6 Ì [XÍ 2 ] Figure 15.13. The acyclic graph for Example 15.1. 1) Ñ , the set of information sources; 2) Ò ¿ Ñ®Ó ¥ , which specifies the node at which an information source is generated; 3) Ô ¿ ¥ÕÓ uÖ , which specifies the set of information sources received at each node. The information source × is generated at node Òn"×dr . For all ³¸µØ¥ , let Ù n´³Úr¸sv?×ÛµØÑ ¿ Òtn"×'rÜs;³± (15.29) be the set of information sources generated at node ³ . The set of information sources Ôn´³Úr is received at node ³ . Ý2Þàßâá2ãåäeæèç¯éyê"ç The nodes in the graph in Figure 15.13 are indexed such that if n´³p9qer©µ¬§ , then ³ªëq . The set of multicast requirements as illustrated is specified by (15.30) Ñìsv¯o'p%u\p%we±{p and ÒtnoBrÜso'pÒn"u{r¸s.u\pÒn"w{r¸s.u\p (15.31) Ô noBrs¬Ôn"u{rÜs¬Ô©n´xer¸s.íîp Ôn"w{r¸s®¯od±{pÔn"ï{rs®¯o'p%we±{pÔn"ð{r¶sv?ue±{ñ (15.32) (15.33) We consider a block code with block length ò which is similar to the ó -code we defined in Section 11.5.1 for acyclic single-source networks. An information source × is represented by a random variable hô which takes values in the D R A F T September 13, 2001, 6:27pm D R A F T 339 Multi-Source Network Coding õ set ôs®¯o'p%u\p?peö(ud÷dø9ùú'± (15.34) according to the uniform distribution. The rate of information source × is ûNô . It is assumed that hôNp%×µbÑ are mutually independently. Unlike the ó -code we defined for acyclic single-source networks which is zero-error, the code we now define allows an arbitrarily small probability of ü error. Let (15.35) svN³µb¥ ¿ Ôn´³ÚrÀs¬ ý þe± be the set of nodes which receive at least one information source. An n´ò¦pBn´ÿ» k¿ n´³%p9qerCµ·§ rpBn(ûNô ¿ ×µ ÑÜrr (15.36) code on graph ¢ (with respect to a set of multicast requirements) is defined by 1) for all n´³p9q\r¶µ·§ , an encoding function » k ¿ õ ô l pNo'p'pÿ »R» ±¾Ó l pNo'p?pÿ» k ± (15.37) ô» »» » Ù (if we adopt the convention that k both n´³Úr and N³ ¿ n´³(p³ÚrAµØ§ ± are empty, l k » is an arbitrary constant taken from Np o'pdpÿ» ± ); ü 2) for all ³Üµ , a decoding function õ l » ¿ pNo 'pÿ » R» ±¾Ó ôñ » » » ô » (15.38) ü In the above, » k is the encoding function for edge n´³%p9qer . For ³¾µ , » is the k decoding function for node ³ . In a coding session, » is applied before » k if ³ ª¬³ , and » k is applied before » k if q°ªgq . This defines the order in which the encoding functions are applied. Since ³ yªg³ if n´³ (p³ÚrAµ·§ , a node does not ü the necessary information is received on the input channels. encode until all For all ³Üµ , define # »ys "!y$ # »n´hô ¿ × µbÑÜr2sz ý n´hô ¿ × µ Ôn´³Úrr%±p (15.39) where »n´hô ¿ ×Àµ ÑÜr denotes the value of » as a function of n´hô ¿ ×µbѸr . » is the probability that the set of information sources Ôn´³Úr is decoded incorrectly at node ³ . Throughout this chapter, all the logarithms are in the base 2 unless otherwise specified. % '&)(*+(-,+(/.0*.ç¯éyê1 æ rate tuple D R A F T For a graph ¢ 2 ¼ with rate constraints 3 szn ¦ô ¿ × µbÑÜrp September 13, 2001, 6:27pm , an information (15.40) D R A F T 254 340 A FIRST COURSE IN INFORMATION THEORY l where (componentwise), is asymptotically achievable if for any there exists for sufficiently large ò an n´ò¦pBn´ÿ» k¿ n´³p9qerAµ·§ rpBn(ûNô ¿ ×µ ÑÜrr 687 l , (15.41) :9<;'=?>A@ÿ» kCB º¾» k"D 6 (15.42) k for all n´³%p9qer µ § , where ò 9<; =?>A@¸ÿ» is the average bit rate of the code on channel n´³%p9qer , 4 ûNô 3 ô:EF6 ¦ (15.43) for all × µbÑ , and »B 6 (15.44) ü code on ¢ such that ò for all ³µ . For brevity, an asymptotically achievable information rate tuple will be referred to as an achievable information rate tuple. % '&)(*+(-,+(/.0*.ç¯éyêG 2 æ The achievable information rate region, denoted by is the set of all achievable information rate tuples . 2 2 4 2 H , is B 2B 2 Remark It follows from the definition of the achievability of an information l is achievable for all . rate vector that if is achievable, then Also, if , o are achievable, then it can be proved by techniques similar to those in the proof of Theorem 9.12 that 2 is also achievable, i.e., H 2 IM=L8JK N s (15.45) is closed. The details are omitted here. H In the rest of the chapter, we will prove inner and outer bounds on the achievable information rate region . 15.4 AN INNER BOUND H H In this section, we first state an inner bound ©» on in terms of a set of ÷ auxiliary random variables. Subsequently, we will cast this inner bound in the framework of information inequalities developed in Chapter 12. % '&)(*+(-,+(/.0*.ç¯éyêO æ H P 2 Q Let be the set of all information rate tuples such that there exist auxiliary random variables ô p%× µÑ and » k pBn´³p9qer·µ®§ which satisfy the following conditions: R UQ R P R P n ô ¿ × µbÑÜr UQ VQ Ù n â» k ¨^n ô ¿ Â × µ ´n ³ÚrrpBn » » ¿ ´n ³ p³Úr µ·§rr n ô ¿ × µ· Ô n´³Úr¨ » R» ¿ ´n ³ p³Úr¶µ §r º »k n ôr D R A F T P R R TS s ô l Ö s s 7 R l UQ P n ôr P W7 3 n »k r ¦ô ñ September 13, 2001, 6:27pm (15.46) (15.47) (15.48) (15.49) (15.50) D R A F T 341 Multi-Source Network Coding ü Note that for ³Âµ ý , the constraint (15.48) in the above definition is degenerate because Ôn´³Úr is empty. % '&)(*+(-,+(/.0*.ç¯éyê#é Let H» s XY>Zîn/H r , the convex closure of H . []\àæ^.0y_ æåáç¯éyê` H» H ÷ . ÷ba In the definition of H , P ô is an auxiliary random variable associated with Q » k is an auxiliary random variable associated with information source × , and î the codeword sent on channel n´³p9q\r . The interpretations of (15.46) to (15.50) are as follows. The equality in (15.46) says that P ôNp%×µ Ñ are mutually independent, which corresponds to the assumption that all the information sources are mutually independent. The equality in (15.47) says that Q » k is a function of Ù nP ô ¿ ×µ n´³Úrr and nUQ » » ¿ n´³´ p³Úr µ|§ r , which corresponds to the requirement æ that the codeword received from channel n´³%p9qer depends only on the information sources generated at node ³ and the codewords received from the input channels at node ³ . The equality in (15.48) says that n ô ¿ ×µØÔn´³Úrr is a function of n » » ¿ n´³ (p³ÚrAµb§ r , which corresponds to the requirement that the information sources to be received at node ³ can be decoded from the codewords received from the input channels at node ³ . The inequality in (15.49) says that the entropy of î» k is strictly less than º2» k , the rate constraint for channel n´³p9q\r . The inequality in (15.50) says that the entropy of ô is strictly greater than ô , the rate of information source × . The proof of Theorem 15.6 is very tedious and will be postponed to Section 15.7. We note that the same inner bound has been proved in [187] for variable length zero-error network codes. In the rest of the section, we cast the region ©» in the framework of infor÷ mation inequalities we developed in Chapter 12. The notation we will use is a slight modification of the notation in Chapter 12. Let P UQ Q P 3 H c dP feYQâ» k¿ n´³p9q\r¶µ·§t± sv ô ¿ × µbÑ (15.51) be a collection of discrete random variables whose joint distribution is unspecified, and let (15.52) Òtn rÜs.u ¯?þe±{ñ c Agih Then c ¨Òn r¨¯s.ukj g jEZo'ñ (15.53) c Let l be the ¨Òtn c r¨ -dimensional Euclidean space with the coordinates lag beled by m'nÜpToëµ|Òn r . We will refer to l c g as the entropy space for the set of random variables . A vector p sznqm^n ¿ o®µbÒn c rr (15.54) D R A F T September 13, 2001, 6:27pm D R A F T 342 A FIRST COURSE IN INFORMATION THEORY R r ¿ r µ c is said to be entropic if there exists a joint distribution for n that c for all o®µbÒtn so tm'n r . We then define uv sv p µwl ¿ p is entropic±{ñ g g n´h ¿ hÕµ Àr¦s r such (15.55) o Tp ox To simplify notation in the sequel, for any nonempty define m n j ^n stm nyn EFm n p c µ®Òtn c (15.56) r , we (15.57) where we use juxtaposition to denote the union of two sets. In using the above notation, we do not distinguish elements and singletons of , i.e., for a random variable Dµ , is the same as . To describe ©» in terms of , we observe that the constraints (15.46) to ÷ correspond to the following constraints in , (15.50) in the definition of respectively: r c ^m z H % u v ym { zy| H g l g m } ù ôT Ö s m~ j } ù ô» - q ~ »- »? q s m } ù ô» - j ~ »- »? q s º »k 7 m'} ù 7 TS m^} ù ô l Ö 3 l m ~ ôNñ H . (15.58) (15.59) (15.60) (15.61) (15.62) '&)(*+(-,+(/.0*.ç¯éyê- Let H be the set of all information rate tuples 2 such p u v which satisfies (15.58) to (15.62) for all × µmÑ and that there exists µ g n´³p9qerAµ·§ . Although the original definition of H as given in Definition 15.4 is more intuitive, the region so defined appears to be totally different from case to case. On the other hand, the alternative definition of H above enables to u # istheanregion be described on u the same footing foru all cases. Moreover, if explicit v , upon replacing v by u # in the above definition g inner bound on H , g an explicit innerg boundg on H » ÷ for all cases. Weof will we immediately obtain see further advantage of this alternative definition when we discuss an explicit outer bound on H in Section 15.6. Then we have the following alternative definition of æ 15.5 AN OUTER BOUND uv uv In this section, we prove an outer bound terms of , the closure of . g D R A F T g H on H . This outer bound is in September 13, 2001, 6:27pm D R A F T % 2 Multi-Source Network Coding 343 '&)(*+(-,+(/.0*.ç¯éyê Letv H be the set of all information rate tuples such p u which satisfies the following constraints for all × µbÑ that there exists µ g and n´³%p9qer¶µØ§ : m } ù ôT Ö s ôTS m^} ù (15.63) (15.64) m~ j } ù ô » - q ~ »- »? q s ll Ö m } ù ô» - j ~ »- »? q s4 (15.65) k º¾» 4 m ~ (15.66) (15.67) m}ù 3 ôñ The definition of H is the same as the alternative definition of Hb (Definition 15.7) except that u v is replaced by u v . 1. g g æ 2. The inequalities in (15.61) and (15.62) are strict, while the inequalities in (15.66) and (15.67) are nonstrict. H and H , it is clear that H a H ñ (15.68) v u (Theorem 14.5) implies the conIt is easy to verify that the convexity of g vexity of H . Also, it is readily seen that H is closed. Then upon taking convex closure in (15.68), we see that H» ÷ s XY>Zîn/H r a XY>Z n/H rÜs H ñ (15.69) From the definition of However, it is not apparent that the two regions coincide in general. This will be further discussed in the next section. []\àæ^.0_yæåáç¯éyê H H . a 2 Proof Let be an achievable information rate tuple and ò l large integer. Then for any 67 , there exists an n´ò¦pBn´ÿ» k¿ n´³%p9qerCµ·§ rpBn(ûNô ¿ ×µ ÑÜrr code on ¢ such that for all n´³p9q\r¶µ·§ , for all × µbÑ , and D R A F T ò 9<; =?>A@ÿ» kiB º¾» kD 6 4 ûNô 3 ô:EF6 ¦ »B 6 September 13, 2001, 6:27pm be a sufficiently (15.70) (15.71) (15.72) (15.73) D R A F T 344 A FIRST COURSE IN INFORMATION THEORY ü for all ³¸µ . We consider such a code for a fixed and a sufficiently large ò . For ´n ³p9q\rAµ § , let â» k s » k n´hô ¿ ×ÀµbѸr (15.74) # Q 6 Q be the codeword sent on channel n´³p9q\r . Since î» k is a function of the information sources generated at node ³ and the codewords received on the input channels at node ³ , R For ³Üµ ü UQ UQ Ù l n î» k ¨^n´hô ¿ ×µ n´³ÚrrpBn » » ¿ n´³ p³Úr µØ§ rrÜs ñ R (15.75) , by Fano’s inequality (Corollary 2.48), we have VQ D =>A@ R T D R D 6 n´hô ¿ × µ·Ôn´³Úr¨ » R» ¿ n´³ p³ÚrAµ·§ r õ o » ¨ ôd¨ ô R» s o » n´hô ¿ ×µ·Ô©n´³Úrr o n´hô ¿ × µ·Ôn´³Úrrp B B (15.76) (15.77) (15.78) õ where (15.77) follows because hô distributes uniformly on ô and hô , × µ|Ñ are mutually independent, and (15.78) follows from (15.73). Then R n´hô ¿ Â × µØÔn´³Úrr s nn´hô ¿ × µ·Ôn´³Úrr Bn » » ¿ n´³ p ³Úr¶µ·§ rr n´hô ¿ × µ·Ôn´³Úr¨ » R» ¿ ´n ³ pÚ³ r¶µ·§ r (15.79) B (15.80) (15.81) B B B R D e UQ VQ nn´hô ¿R × µ·Ôn´³ÚrrBe nUQ » » ¿ n´³ p³Úr¶µ·§ rr RD o D 6 n´h ô ¿ × µ·Ôn´³Úrr R nUQ »²» ¿ n´³ p³ÚrAµ § r D o D R 6 n´hô ¿ × µ·Ôn´³rr SR»/ »? =?>A@Üÿ »^» D o D 6 n´hô ¿ × µ·Ôn´³Úrr R SR»/ »? ò¸n(º »^» D 6 r D o D 6 n´hô ¿ × µ·Ôn´³Úrrp (15.82) (15.83) where a) follows from (15.78); b) follows from Theorem 2.43; c) follows from (15.71). D R A F T September 13, 2001, 6:27pm D R A F T 345 Multi-Source Network Coding R Rearranging the terms in (15.83), we obtain n´h ô ¿ ×µ·Ô©n´³Úrr B oòEF6 S n(º »^» D 6r D ò o (15.84) »/ » q ª udò S n(º »^» D 6r (15.85) »/ » q for sufficiently small 6 and sufficiently large ò . Substituting (15.85) into (15.78), we have R n´hô ¿ ×µ Ô©n´³Úr¨VQ » » ¿ n´³ p³Úr¶µ §r o D uA6S n(º »^» D 6 rU (15.86) ª ò ò »/ » q (15.87) s ò:? »n´ò¦p6 rp where o D l ? »n´ò¦p6 r¦s (15.88) uA6S n(º »^» D 6 rUë Ó ò »/ » 6 Ó 4 l . From (15.71), for all n´³4 p9q\R rAµØ§ , as òØÓ « and then ò¸n(º¾» kD uA6 r =?>A@ n´ÿ» kfD oBrÜs =?>AÀ@ ¨VîQ » k ¨ nUâQ » k rñ (15.89) For all ×ÂRµ Ñ , from (15.72), õ 4 4 (15.90) n´hô r¦s =?>AÀ @ ¨ ô?¨s =?>A@ ö(u ÷?ø ùú òàûNô ò¸n3 ôEF6 rñ Thus for this code, we have R R ¿ n´hô × µbÑÜr s S n´hô r (15.91) R ôT Ö nUî Q » k ¨^R n´hô ¿ × µ Ù n´³ÚrrpBnUQ » » ¿ n´³ p³Úr µ·§rr s l (15.92) ¿ ¿ B R (15.93) n´h ô ×µØÔn´³Úr¨VQ » » n´³ p³Úr µ·§rr 4 ò: » n´ò¦p6 r " k D k R ò¸n(º¾» uA6 r 4 nUî Q»r (15.94) n´hô r ò¸n¦ 3 ô:EF6 rñ (15.95) We note the one-to-one correspondence between (15.91) to (15.95) andp (15.63) uv to (15.67). By letting P ôCsDhô for all ×µ Ñ , we see that there exists µ g such that (15.96) m } ù ôT Ö s ôTS m } ù Ö D R A F T September 13, 2001, 6:27pm D R A F T 346 A FIRST COURSE IN INFORMATION THEORY m ~ j } ù ô» - q ~ » » 0 s l (15.97) B (15.98) m } ù ô» - j ~ » » 0 4 ò: » n´ò¦p6r k D ò¸n(º¾» uA6 r 4 m ~ (15.99) (15.100) m'} ù ò¸n3 ôEF6rñ u v is a convex cone. Therefore, if p µ u v , then ò 9<; p µ By 14.5, vu .Theorem g through gp p Dividing (15.96) (15.100) by ò and replacing ò 9<; by , we see v p u g that there exists µ g such that (15.101) m } ù ô Ö s ôS m } ù m ~ j } ù ôTAR »? ~ R »- » s l Ö (15.102) B (15.103) m } ù ôTR »? j ~ R »- » 4 » n´ò¦p6 r " k D º2» uA6 4 m ~ (15.104) (15.105) m } ù 3 ô EFN6 ñ 6 Ó l to conclude that there exists p µ u vg which We then let òØÓ « and then satisfies (15.63) to (15.67). Hence, H a H , and the theorem is proved. 15.6 THE LP BOUND AND ITS TIGHTNESS u v (the In Section 15.4, we stated the inner bound H » on H in terms of ÷ proof is deferred to Section 15.7), and in Section 15.5, we proved theg outer v u bound H Uu d on H u v in terms of far, there exists no full characterization v or . Therefore, these g . Sobounds cannot be evaluated explicitly. In on either g this section,g we give a geometrical interpretation of these bounds which leads to an explicit outer bound on H called the LP bound (LP for linear programming). p c Let o be a subset of Òn r . For a vector µl , let g p n snqmz ¿ r µsÀo rñ (15.106) For a subset of l g , let¡ (15.107) !T>¢ n n¶ r¸sv p n ¿ p µsC ± z pD r µ£o . For a subset of be the projection of the set on the coordinates m^Ü l g , define¤ p ¿ l B p ª p for some p µs ± (15.108) n ArÜsv µwl g ¤¥ and p p p p l n Ar¦sv µwl (15.109) g ¿ B B for some µs A±{ñ D R A F T September 13, 2001, 6:27pm D R A F T 347 4p l ¤ ¤ p A vector is in n Ar if and only if it is strictly inferior top some vector in , and is in n¶ r if and only if it is inferior to some vector in . Define the following subsets of l g: ¦ s § p µsl ¿ m } ôT s¨S m^} (15.110) ; g ù Ö ôT Ö ù© ¦$ª s « p µwl ¿ m ~ } ôTAR » - ~ » » s l for all n´³p9qerCµ·§¬ g j ù ¦ } s « p µwl ¿ m } ôT » ~ R »- » 0 s l for all ³µ ü ¬ (15.111) (15.112) g j ù ¦ ~ s « p µwl ¿ º¾» k 7m ~ for all n´³p9q\r¶µ·§®¾¬ ñ (15.113) g ¦ is a hyperplane in l . Each of the sets ¦ ª and ¦ } is the intersecThe set ¦ ; g tion of a collection of hyperplanes in l . The set ~ is the intersection of a g . Then from the alternative definition of collection of open half-spaces in l ¤ ¡ we see that g H (Definition 15.7), H s n !>¢ } ù ôT Ö n u vg°¯ ¦ ; ¯ ¦ ª ¯ ¦ } ¯ ¦ ~ rrñ (15.114) ¤ ¡ and H » ÷ s XY>Z n n !T>¢ } ù ô Ö n u vg ¯ ¦ ; ¯ ¦ ª ¯ ¦ } ¯ ¦ ~ rrrñ (15.115) Similarly, we see ¤that¡ ¥ H s n !>¢ } ù ô Ö n u vg±¯ ¦ ; ¯ ¦ ª ¯ ¦ } ¯ ¦ ~ rrñ (15.116) u v n ¦ ¦ ª ¦ } r is dense in u v n ¦ ¦ ª ¦ } r , It can be shown that if ; ; ¯ i.e., u v n ¦ ¦ ª g ¯¦ } r¦s ¯ u v ¯ n ¦ ¦ ª ¦ } rp g²¯ ¯ (15.117) g ¯ ;¯ ¯ g ¯ ;¯ ¯ then H s H a XY>Z n/H r¦s³©H » ÷ p (15.118) which implies (15.119) H » ÷ s³H ñ ¦ ¦ ¦ ª } Note that n subset of l . However, while ; ¯ ¯ r is a closed g uv ¦ uv ¦ (15.120) a ¯ F g ¯ g ¦ for any closed subset of l true that g u ,v it is ¦ not inu v general ¦ (15.121) g ¯ s g ¯ ñ Multi-Source Network Coding D R A F T September 13, 2001, 6:27pm D R A F T 348 A FIRST COURSE IN INFORMATION THEORY u v ¦# ¦ # sv¯ p µ u0}v u v ¦# ¯ As a counterexample, as we have shown in the proof of Theorem 14.2, } is a proper subset of } , where ª s;³ ª } s;³ } ;´ ´ ;´ k k and we have used ³ to denote m Eµm k . ´ j To facilitate our discussion, we further define ³ n n stm^nsE°m n n^ ´ j ¿ ³ s l ±¸p (15.122) (15.123) and ³ n n n stm n n EFm n n^n^ (15.124) j p j ´uj p c for o pTo pTo µ Òtn r . Let be the set of µ¶l such that satisfies all c g g the basic inequalities involving some or all of the random variables in , i.e., c for all oÛpTo pTo µØÒn r , 4 m'n 4 ll (15.125) (15.126) m n j n 4 l ³ n n 4 (15.127) ´ l (15.128) ³n n n ñ ´ j v u u . Then upon replacing u v by u We know from Section 13.2 that g a g g g in the definition of H , we immediately obtain an outer bound on H . This is called the LP bound, which is denoted by H ·)¸ . In other words, H·¹¸ is obtained by replacing ¤ ¥ ¡ u vg by u g on the right hand side of (15.116), i.e., (15.129) H ·)¸ s n !T>¢ } ù ô Ö n u g ¯ ¦ ; ¯ ¦$ª ¯ ¦ } ¯ ¦ ~ rrñ Since all the constraints defining H·¹¸ are linear, H·¹¸ can be evaluated explicitly. We now show that the outer bound H ·)¸ is tight for the special case of single-source network coding. Without loss of generality, let Ñìs®¯od±{ñ ü Then the information source represented by h for all ³¸µ , 3 ; ºH (15.130) ; is generated at node ÒtnoBr , and Ôn´³ÚrÜsDÑìsv¯od±{ñ H Ud (15.131) Consider any µ . Following the proof that is an outer bound l on in the last section (Theorem 15.9), we see that for any and a sufficiently large ò , there exist random variables î» k pBn´³p9qer¾µ|§ which satisfy H D R A F T Q September 13, 2001, 6:27pm 6F7 D R A F T 349 Multi-Source Network Coding the constraints in (15.92) to (15.94). Upon specializing for a single source, these constraints become R R for q ¿ nÒtnoBrp9qerCµ·§ (15.132) UQ» ; q k ¨ h ; r s ll k R nUQî» ¨VQ »² » n´³ p³Úr µØ§ r s for n´³p9q\ü r¶µ·§ (15.133) ¿ B (15.134) n´h ¨VQ » n´³p¼ r¶µØ§ r 4 ò: ¼µ ; ò¸n(º¾» kD uA6r R nUQîn´ò¦» k pr 6r for for n´³p9qerCµ § , (15.135) l 6 Ó l and ò Ó « . Moreover, upon specializing where n´ò¦p6 r·Ó as Ø (15.95) for a single source, we R also have 4 n´h r (15.136) ; ò¸n3 ; EF6rñ ü Now fix a sink node ¼ µ and consider any cut ½ between node ÒtnoBr and ¼ µ¾ý ½ , and let node ¼ , i.e., ÒnoBr µ¾½ and ¾ (15.137) §x ¿ sven´³p9qerCµ·§ ¿ ³¸µ£½ and q·µ£ý ½ë± be the set of edges across the cut ½ . Then S »/ À ò¸n(º¾» kfD uA6 r k R 4 SR»/ qÀ nUîQ » k r (15.138) k R 4 k¿ (15.139) RR nUîQ » k¿ n´³p9qerCµ·§ ¿ r s4 nUî Q » n´³p9qerCµ·§ ¿ or both ³p9q µ²ý ® ½ r (15.140) ¿ R nUîQ » n´³%p¼ r µØ§ r (15.141) ¿ D ¿ s nUî Q » n´³%p¼ r µØ§ ¨ h ; r nUîQ » n´³p¼ r µ·§be h ; r (15.142) R nUîQ » ¿ n´³pR¼ r µ·§be h ; r s (15.143) ¿ (15.144) s 4 R n´h ; r0E n´h ; ¨VQ » n´³p¼ r¶µ §r (15.145) Á4 n´h ; r0ìE ò: n´ò¦p6 r (15.146) ò¸n3 EF6 r0ì E ò: n´ò¦p6 r ; s ò¸n3 EF6E n´ò¦p6 rrp (15.147) ; n where a) follows from (15.135); Q t½ b) follows because » k , where both ³p9q;µ ý , are functions of § by virtue of (15.133) and the acyclicity of the graph ¢ ; ¿ D R A F T September 13, 2001, 6:27pm Q »/ k pBn´³p9qer©µ D R A F T 350 A FIRST COURSE IN INFORMATION THEORY c) follows from (15.134); d) follows from (15.136). 6 Dividing by ò and letting Ó l and òØÓ 4 « in (15.147), we obtain (15.148) S/ yÀ º¾» k 3 ; ñ ü In other words, if 3 inequality for ev; µÃH , then 3 ; satisfies the above ery cut ½ between node ÒtnoBr and any sink node ¼ µ , which is precisely the max-flow bound for single-source network coding. Since we know from Chapter 11 that this bound is both necessary and sufficient, we conclude that H Ud » k is tight for single-source network coding. However, we note that the max-flow bound was discussed in Chapter 11 in the more general context that the graph ¢ may be cyclic. In fact, it has been proved in [220] that is tight for all other special cases of multi-source network coding for which the achievable information region is known. In addition to single-source network coding, these also include enthe models described in [92], [215], [166], [219], and [220]. Since compasses all Shannon-type information inequalities and the converse proofs of the achievable information rate region for all these special cases do not infor all these cases volve non-Shannon-type inequalities, the tightness of is expected. H·¹¸ H ·)¸ H ·)¸ 15.7 ACHIEVABILITY OF Ä » ÷ H In this section, we prove the achievability of » , namely Theorem 15.6. ÷ To facilitate our discussion, we first introduce a few functions regarding the in Definition 15.4. auxiliary random variables in the definition of Ù From (15.47), since â» k is a function of n ô ¿ × µ n´³Úrr and n » » ¿ n´³ p³rCµ § rr , we write Q Qî» k s Å » k nnP ô P UQ Hb UQ ¿ × µ Ù n´³ÚrrpBn » » ¿ n´³ p³Úr µØ§ rrñ (15.149) Since the graph ¢ is acyclic, we see inductively that all the auxiliary random variables » k pBn´³p9qerCµ·§ are functions of the auxiliary random variables ô p%× µ Ñ . Thus we also write (15.150) â» k s » k n ô ¿ ×ÀµbѸrñ Q P Q ÆÅ # P Equating (15.149) and (15.150), we obtain Å » k nnP ô UQ ÇÅ # » k nP ô ¿ × For ×·µZÔn´³Úr , since P ô is a function of nUQ »R» ¿ n´³ p³Úrµm§ write P ôs È ôR»? nUQ »R» ¿ n´³ p³r µ·§ rñ ¿ × µ Ù n´³ÚrrpBn » ^» ¿ n´³ p³ÚrAµ·§ rrÜs D R A F T µbÑÜrñ (15.151) r from (15.48), we September 13, 2001, 6:27pm (15.152) D R A F T 351 Multi-Source Network Coding Substituting (15.150) into (15.152), we obtain P ô s³È ô » nYÅ # »» nP ôÉ ¿ × µ ÑÜr ¿ n´³ p³Úr µØ§ rñ (15.153) These relations will be useful in the proof of Theorem 15.6. Before we prove Theorem 15.6, we first prove in the following lemma that strong typicality is preserved when a function is applied to a vector. Êæåá¾á Ë ßDç¯éyê"ç P Let s n´h|r . If Ì sën/Í ; pÍ ª pBpÍ ÷ rµÎ Ï ÷ ÐÑVÒ p then n Ì r¦szn/È pÈ ª pBpÈ rµÎ Ï ÷ } ÑVÒ p ; ÷ B B n/Í »r for o ³ ò . where È»ys Proof Consider Ì µÎ Ï ÷ Ð"ÑVÒ , i.e., SÓÇÔÔ ò o c n/Í0e Ì r:EÕ n/Í rÖÔÔ ª×Nñ ÔÔ ÔÔ P s n´h|r , Since ë Õ n/È r¦s Ó AØS Ù)ÚÜÛÝ âÕ n/yÍ r for all È µsÞ . On the other hand, c n/Èe n Ì rrs Ó S c n/Íe Ì r Ø Ù¹Ú ÜÛY for all È µsÞ . Then S Û ÔÔ ò o c n/Èe n Ì rr:EâÕ n/åÈ rÖÔÔ ÔÔ ÔÔ s S Û ÔÔÔ Ó ØS Ù)ÚÜÛY ß ò o c n/Íe Ì r:EâÕ n/Í rUà ÔÔÔ ÔÔ ÔÔ Ô B S Ó S ÔÔ ò o c n/Íe Ì r:EwâÕ n/Í r ÔÔ Ô Û ØÙ¹ÚÛÝ ÔÔ ÔÔ S Ó ÔÔ ò o c n/Í0e Ì rEâÕ n/yÍ r ÔÔ s Ô ÔÔ ª B× ñ Ô Therefore, n Ì rµwÎ Ï ÷ } ÑVÒ , proving the lemma. D R A F T September 13, 2001, 6:27pm (15.154) (15.155) (15.156) (15.157) (15.158) (15.159) (15.160) (15.161) (15.162) D R A F T 352 A FIRST COURSE IN INFORMATION THEORY 2 Proof of Theorem 15.6 Consider an information rate tuple H 3 szn ô ¿ × µØÑÜr P Q (15.163) in , i.e., there exist auxiliary random variables ô p%×ÛµØÑ and â» k pBn´³p9qerCµ·§ which satisfy (15.46) to (15.50). We first impose the constraint that all the auxiliary random variables have finite alphabets. The purpose of imposing this additional constraint is to enable the use of strong typicality which applies only to random variables with finite alphabets. This constraint will be removed later. l . We now construct an Consider any fixed 67 3 "EF6 n´ò¦pBn´ÿ» k ¿ n´³p9q\r¶µ·§rpBn ô ¿ × µbÑÜrr (15.164) random code on graph ¢ for sufficiently large ò such that ò for all n´³p9q\rAµ § , and ü (15.165) (15.166) × be small positive quantities such that (15.167) × ªB× ñ The actual values of × and × will be specified later. The coding scheme is for all ³¸µ × 9<; =?>A@ÿ» kCB º¾» k"D 6 »B 6 . Let and described in the following steps, where ò is assumed to be a sufficiently large integer. 1. For × µbÑ , by the strong AEP (Theorem 5.2), á ôâÝs ãä ¨ Î Ï ÷ } ÑVÒ ¨ 4 noåEµ× rud÷ ?æå} ù 9$ç ù p (15.168) ù l l where è ô Ó as × Ó . By letting × be sufficiently small, by virtue of R (15.50), we have nP ôr0E¶,è ô7Â3 ôNp (15.169) and æ} ù 9$ç ù 7ê'é ôp (15.170) noåEF× rud÷ where õ (15.171) é ô s ö(ud÷ Vë ù 9^ì ú¾sz¨ ô ¨ñ Then it follows from (15.168) and (15.170) that á ô7êé'ôñ (15.172) á Denote the vectors in Î Ï ÷ } ÑVÒ by í ô n"årpNo B B ô . ù D R A F T September 13, 2001, 6:27pm D R A F T Multi-Source Network Coding o ; To ª Î Ï } VÑ Ò Toxî 353 ï/é 9<; á ñð n"×'rp n"×drp?p "n ×'r almost into subsets 2. Arbitrarily partition ÷ ù ù ô or uniformly, such that the size of each subset is either equal to ô ö ô ôú . Define a random mapping /é 9<; á ò ó òô Té ¿ ¯o'p%u\pdp 'ô ±¾Ó á ¯o'p%u\pdp ô ± B ó B é by letting ôBn "r be an index randomly chosen in ô ô. uniform distribution for o o jù (15.173) n"×dr according to the Î Ï ~ ÑVÒ by ô » k n"år , o B B á » k , where á » k së¨ Î Ï ÷ ~ ÑVÒ ¨ñ (15.174) 3. For n´³p9q\r¶µ·§ , denote the vectors in ÷ By the strong AEP, á » k së¨ Î Ï ÷ ~ ÑVÒ ¨ B ud÷ ?æå ~ õ ç p (15.175) l l where èe» k Ó as × Ó . Then by letting × be sufficiently small, by virtue R of (15.49), we have (15.176) nUQ » k r D è » k B º » k D 6 and hence æ ~ -õ ç B u ÷ ö õ ì ñ (15.177) u ÷ Then we can choose an integer ÿ» k such that ud÷ æ ~ õ ç B ÿ» 8k B ud÷ ? ö õ ì ñ (15.178) The upper bound above ensures that ÿ» k satisfies (15.165). 4. Define the encoding function » k ¿ õ ô l pNo'pdpÿ »R» ± Ó l pNo'pdpÿ » k ± (15.179) ôTA» »» » as follows. Let Í ô be the outcome of hô . If ÷ ò Ù nqí ô n ôBn/Í ôrr ¿ ×Àµ n´³ÚrrpBn/ô »R » n # » » n/Í ô ¿ × µbÑÜr ¿ n´³ p³Úr µ·§rrø (15.180) µÎ Ï } ô » - q ~ » »? q ÑVÒ p ù where # »² » n/Í ô ¿ שµ¹ÑÜr denotes the value of »² » as a function of n/Í ô ¿ ×µ Ѹr , then by Lemma 15.10, ÷ Å » k nqí ô n ò ôBn/Í ôrr ¿ × µ Ù n´³ÚrrpBn/ô » » n # »» n/Í ô ¿ × µbÑÜr ¿ n´³ p³r µ·§ rr ø (15.181) D R A F T September 13, 2001, 6:27pm D R A F T 354 A FIRST COURSE IN INFORMATION THEORY Î Ï ~ ÑVÒ (cf. (15.149) for the definition of Å » k ). Then let » k ÷ n ò ô?n/Í ôr ¿ ×Àµ Ù n´³ÚrrpBn # » » n/Í ô ¿ ×ÀµbѸr ¿ n´³ p³Úr¶µ §rø is in ÷ (15.182) ÷ Å » k nqí ô n ò ô n/Í ô rr ¿ ×µ Ù n´³ÚrrpBn/ô » k n # »²» n/Í ô ¿ ×µ ÑÜr ¿ n´³ p³Úr µØ§ rrø (15.183) Ï V Ñ Ò in Î ÷ ~ (which is greater than or equal to 1), i.e., ô » k n # » k n/Í ô ¿ × µbÑÜrr¸s ÷ Å » k nqí ô n ò ô?n/Í ôrr ¿ ×µ Ù n´³ÚrrpBn/ô » k n # »² » n/Í ô ¿ ×µ ÑÜr ¿ n´³ p³Úr µØ§ rrøÀñ be the index of (15.184) » k ÷ n ò ôBn/Í ôr If (15.180) is not satisfied, then let # /Í ø ¿ ×Àµ Ù n´³ÚrrpBn » » n ô ¿ ×ÀµbÑÜr ¿ n´³ p³Úr¶µ·§r s l ñ (15.185) Note that the lower bound on ÿ» k in (15.178) ensures that » k is properly defined because from (15.175), we have 5. For ³¸µ ÿ »k ü 4 æ ~ -õ 4 Ï ÑVÒ á k ç ¨ Î ÷ ~ ¨s » ñ u ÷ (15.186) , define the decoding function as follows. If » ¿ l pNo dpÿ »R» ±¾Ó õ »- »- » q ôT » # » » n/Í ô ¿ × µbÑÜr2s ý l ô (15.187) (15.188) B óMô B é'ô such that ÷ È ô » ô »²» n # »» n/Í ô ¿ × µØÑÜrr ¿ n´³ p³Úr¶µ·§]østí ô n ò ô nó ô rr (15.189) R»? R»? (cf. the definition of È ô in (15.152) and note that È ô is applied to a vector as in Lemma 15.10), then let » ÷ # » » n/Í ô ¿ × µbÑÜr ¿ n´³ p³Úr¶µ § ø sznóMô ¿ ×ÀµØÔn´³Úrrñ (15.190) Note that if an Mó ô as specified above exists, then it must be unique because í ô n ò ô?n"ó rr stý í ô n ò ôBnó rr (15.191) for all n´³ p³Úr µØ§ , and if for all × µ·Ôn´³Úr there exists o D R A F T September 13, 2001, 6:27pm D R A F T 355 Multi-Source Network Coding ó ó/ . For all other cases, let » ÷ # » » n/Í ô ¿ × µbÑÜr ¿ n´³ p³r µ·§]øsënù o'pNo'púû ?pNüo rp j R » j õ a constant vector in ý ôTR » ô . if s ý (15.192) ü random code. Specifically, we will We now analyze the performance of this l show that (15.166) is satisfied for all ³Üµ . We claim that for any , if qí ò /Í Î Ï } ù ô Ö ÑVÒ p ×þ7 n ô n ôBn ôrr ¿ × µbÑÜr µ 2÷ (15.193) then the following are true: i) (15.180) holds for all ³Üµb¥ ; # » k n/Í ô ¿ ×µbѸr¾s ý l for all n´³p9q\r¶µ·§ ; iii) ô » k n # » k n/Í ô ¿ ×µ ÑÜrrÜsÆÅ # » k nqí ô n ò ôBn/Í ô rr ii) ¿ × µbÑÜr for all n´³p9q\r¶µ·§ . We will prove the claim by induction on ³ , the index of a node. We first consider node 1, which is the first node to encode. It is evident that N³ Ø µ ¥ ¿ n´³ pNoBrAµ·§ ± Ù noBrr is a function of n ô ¿ ×mµ is empty. Since n ô ¿ ×mµ P P (15.194) ÑÜr , (15.180) follows from (15.193) via an application of Lemma 15.10. This proves i), and ii) follows by the definition of the encoding function » k . For all no'p9qerµg§ , consider ô ; k n # ; k n/Í ô ¿ × µbÑÜrr s s Å ; k nqí ô n ò ô?n/Í ôrr Å # ; k nqí ô n ò ô n/Í ô rr ¿ ×Àµ Ù noBrr (15.195) ¿ ×ÀµbѸrp (15.196) P where the first and the second equality follow from (15.184) and (15.151), respectively. The latter can be seen by replacing the random variable ô by the vector ô n ô?n ôrr in (15.151). This proves iii). Now assume that the claim is ³ ¬o for some u ³ ¨¥©¨ , and we will prove that the true for all node q claim is true for node ³ . For all n´³ p³r µ·§ , í ò /Í B :E B B ô »R» n # »» n/Í ô ¿ × µbÑÜrrÜsÇÅ # »²» nqí ô n ò ôBn/Í ôrr ¿ × µbÑÜr (15.197) Ù by iii) of the induction hypothesis. Since nnP ô ¿ × µ n´³rrpBnUQ » » ¿ n´³ p³ÚrAµ·§ rr ¿ is a function of nP ô × µDÑÜr (cf. (15.150)), (15.180) follows from (15.193) via an application of Lemma 15.10, proving i). Again, ii) follows immediately D R A F T September 13, 2001, 6:27pm D R A F T 356 A FIRST COURSE IN INFORMATION THEORY from i). For all n´³p9q\r¶µ·§ , it follows from (15.184) and (15.151) that ô » k n # » k n/Í ô ÷ ¿ ×µ ÑÜrr s Å » k nqí ô n ò ôBn/Í ôrr ¿ × µ Ù n´³ÚrrpBn/ô »²» n # »²» n/Í ô ¿ × µbÑÜr ¿ n´³ p³Úr¶µ·§ rrø (15.198) ò k ¿ (15.199) s Å # » nqí ô n ôBn/Í ôrr × µbÑÜrp where the last equality is obtained by replacing the random variable P ô by the vector í ô n ò ô?n/Í ôrr and the random variable Q »» by the vector ô »²» n # »²» n/Í ô ¿ ×µ ÑÜr in (15.151). This proves iii). Thus the claim is proved. If (15.193) holds, for שµ¹Ô©n´³Úr , it follows from iii) of the above claim and È ô » n/ô »» n # »²» n/Í ô ¿ × µbÑÜrr ¿ n´³ p³r¶µ § r (15.200) s È ô » ò nYÅ # »» nqí ôÉ n ò ôÉ n/Í ô rr ¿ × µbѸr ¿ n´³ p³Úr µ·§r s í ô n ô?n/Í ôrrp (15.201) where the last equality is obtained by replacing P ôÉ by í ôÉ n ò ôÉ n/Í ô rr in (15.153). Then by the definition of the decoding function » , we have » ÷ # » » n/Í ô ¿ ×µØÑÜr ¿ n´³ p³Úr µ·§ ø szn/Í ô ¿ × µ Ôn´³Úrrñ (15.202) Hence, Í ô p%×Ûµ Ôn´³Úr can be decoded correctly. Therefore, it suffices to consider the probability that (15.193) does not hold, since this is theá only case for a decoding error to occur. For × µgÑ , consider any o B ô B ô , where ô µso n"×'r for some o B ó ô B é ô . Then jù N! ò ôBn´hôrÜò s.eô ± (15.203) s "! ò ô n´h ô rÜs ô ph ô s ó ô ± s "! ôdn´hôrÜs ¯ô?¨ hô s Mó ô±d! Nhô¶s ´ó ô ± (15.204) (15.205) s ¨ o n"×dr¨V9<;é$ô 9<; ñ jù á á Since ¨ o n"×'r¨ is equal to either ï/é ô 9<; ôTð or ö/é ô 9<; ôú , we have ùj 4 á 4 á (15.206) ¨ o n"×'r¨ ï/é ô9<; ô ð é ô 9<; ô ZE o'ñ jù (15.153) that Therefore, ! ò ôBn´hô r¦s.¯ô± B né ô9<; á ôo EgoBré'ô s á ô µE o é ô ñ ë á æ} ùR , where Since é ô"ÿ u ÷ ù and ô"ÿ u ÷ nP ôr7 3 ô D R A F T September 13, 2001, 6:27pm (15.207) (15.208) D R A F T Multi-Source Network Coding 357 á é from (15.50), dô is negligible compared with ô when ò is large. Then by the l strong AEP, there exists ô n r such that è q× f7 ?æå} ù 9$ç ù Ò - B á ôEµé'ôAª á ô B u ÷ æ} ù õ ç ù Ò p noåEF× ru ÷ l l where èô nq×dMrÜÓ as ×d Ó . Therefore, ! ò ôBn´hô r¦s.¯ô± B noåE°× r 9<; u 9å÷ æ} ù 9$ç ù Ò ñ (15.209) (15.210) Note that this lower bound does not depend on the value of eô . It then follows that ! ò ô?n´hôrÜs.¯ô ¿ ×Àµ ѱ æ} Ò B noåE°× r9 j Ö j u¹9å÷ ù 9$ç ù ô ?æå} ù 9$ç ù Ò - s noåE°× r 9 j Ö j u 9å÷ ù 9 ÷ ?n æ}ù æ ôT } ù 9 ù Ò ç -ù Ò r s noåE°× r9 j Ö j u å s noåE°× r 9 j Ö j u 9å÷ ù Ö 9$ç p (15.211) (15.212) (15.213) (15.214) where the last equality follows from (15.46) with which tends to 0 as !Öí ô n ò ô n´h ×d Ó l è nq× r¦s S ô è ô nq× r (15.215) . In other words, for every n ô ¿ × µbÑÜrµ ô rrÜs ô ¿ × µbѱ ý ô Î Ï ÷ } ù ÑVÒ , B no)E]× r9 j Ö j u¹9å÷ æå} ù ôT Ö 9$ç Ò ñ (15.216) denote ò i.i.d. copies of the random variable P ô . For every ô µ ÏÎ ÷ } Let Ñù VÒ , byô the strong AEP, 4 "! ôs ô± u 9å÷ Ï æå} ù -õ ù Ò Ñ p (15.217) l l where ôBnq×M rÜÓ as × Ó . Then for every n ô ¿ ×µbÑÜr µ ý ô Î Ï ÷ } ÑVÒ , ù 4 Ï Ò Ñ å æ } õ (15.218) ! ô¶s ô ¿ × µbѱ ô u¹å9 ÷ ù ù 9 ÷ ?n æå} ù æå ô } ù - õ õ Ò ù ù Ò r (15.219) s u å (15.220) s u¹9å÷ p ù Ö where D R A F T q× S ô åôBnq× r¦Ó ân r¦s l September 13, 2001, 6:27pm (15.221) D R A F T 358 as A FIRST COURSE IN INFORMATION THEORY ×d Ó l £ý ô Î Ï ÷ } ù ÑVÒ , from (15.216) and (15.220), we have "! Öí ô n ò ô n´h ô rrÜs ô ¿ ×ÀÒ µb -õ Ñ ± Ò æ} ôT -õ Ò - s noåEF× r 9 j Ö j u ÷ ç u 9å÷ (15.222) ù Ö Ò Ò õ B noåEF× r9 j Ö j ud÷ ç Ò N! ô s ô ¿ ×µbÑ ± (15.223) ¿ s noåEF× r 9 j Ö j u ÷ "! ô s ô ×Àµ ѱ{p (15.224) . For every n ô ¿ ×µ ÑÜr µ where q× è nq× r D ânq× rñ n r¦s (15.225) (15.226) "! Öí ô n ò ô n´h ô rrÜs ô ¿ ×ÀµbѱÀs l for all n ô ¿ ×Ûµ Ѹr µ ý ý ô Î Ï ÷ } ÑVÒ , (15.224) is in fact true for all n ô ¿ × µ|ÑÜr¶µ ù nÞ ô ÷ ¿ ×µ ÑÜr . By Theorem 5.3, (15.227) !Nen ô ¿ × µbÑÜr2µý Î Ï ÷ } ù ô Ö ÑVÒ ± B u¹9å÷ Ò p l l l where nq×'r7 and ¶nq'× r¸Ó as × Ó . Then by summing over all n ô ¿ ×µ ÑÜr¾µ® ý Î Ï ÷ } ôT ÑVÒ in (15.224), we have ù Ö !enqí ô n ò ô?n´hôrr ¿ ×µbÑ¸Ò r2µý Î Ï ÷ } ù ôT Ö ÑVÒ ± B noåE°× r9 j Ö j ud÷ ! en ô ¿ ×µ ÑÜr2µý Î Ï ÷ } ù ôT Ö ÑVÒ ± (15.228) B noåE°× r 9 j Ö j u 9å÷ Ò 9 Ò - ñ (15.229) Since × ª× (cf. (15.167)), we can let ×d be sufficiently small so that l (15.230) ¶nqd× rE nq× rf7 Since ü and the upper bound in (15.229) tends to 0 as ò sufficiently large, for all ³¸µ , » B !« nqí ô n ò ô?n´hôrr Ó « . Thus, when ò is wÎ Ï ÷ } ù ôT Ö ÑVÒ ¬ B 6ñ ¿ × µØÑÜr¾µ ý (15.231) Hence, we have constructed a desired random code with the extraneous constraint that the all the auxiliary random variables have finite alphabets. We now show that when some of the auxiliary random variables do not have finite supports1 , they can be approximated by ones with finite supports while all the independence and functional dependence constraints required 1 These random variables nevertheless are assumed to have finite entropies. D R A F T September 13, 2001, 6:27pm D R A F T Multi-Source Network Coding P 359 continue to hold. Suppose some of the auxiliary random variables ôp%× µmÑ and » k pBn´³p9q\r¾µ|§ do not have finite supports. Without loss of generality, assume that ô take values in the set of positive integers. For all × µ Ñ , define a random variable ô nØr which takes values in Q P P sv¯o'p%u\pdp ± (15.232) !dP ô n Ør¦s. ± s ""!d! dP P ôCô µ s. ± ± (15.233) R R for all °µ , i.e., Pþô n Ør is a truncation of P ô up to . It is intuitively clear that nPþô n ØrrÓ nP ô r as Ó « . The proof is given in Appendix 15.A. Let Pþô n Ørp%×µbÑ be mutually independent so that they satisfy (15.46) with all the random variables in (15.46) primed. Now construct random variables Qx» k n Ør inductively by letting Q » k n Ør¦s³Å » k nnP ô n Ør ¿ × µ Ù n´³ÚrrpBnUQ » » n Ør ¿ n´³ p³Úr¶µ·§ rr (15.234) for all n´³p9q\rbµë§ (cf. (15.149)). Then P ô n Ørp%×èµ Ñ and Q » k n ØrpBn´³%p9qer µ § satisfy (15.47) and (15.48) with all the random variables in (15.47) and R(15.48) primed.R Using the exactly the same argument for P ô n Ør , we see that nUQ » k n ØrrÜÓ nUQ » k r as Ó « . Q » k pBn´³p9qerCµ·§ satisfy If P ôp%×µbÑ and â l R (15.49) and (15.50), then there exists 7 such that (15.235) º¾» k 7 nUâ Q »k r D R and nP ôr:E 7¦3 ôñ (15.236) Then for sufficiently large R, we have R k k D º2» 7 nUâ Q » r 7 nUQ » k n Ørr (15.237) R R and nP ô n Ørrf7 nP ôr0E 7 3 ôñ (15.238) In other words, P ô n Ørp%×gµ Ñ and Q » k n ØrpBn´³p9q\r µz§ , whose supports are finite, satisfy (15.49) and (15.50) with all the random variables in (15.49) and (15.50) primed. 2 Therefore, we have proved that all information rate tuples in H are achievable. By means of a time-sharing argument 2 similar 2 ª to the one we used in the proof of Theorem 9.12, we see that if ; and are achievable, then for any rational number between 0 and 1, 2 2 D 2 ª s n å o E r (15.239) ; such that D R A F T September 13, 2001, 6:27pm D R A F T 360 A FIRST COURSE IN INFORMATION THEORY 2 4 is also achievable. As the remark immediately after Definition 15.3 asserts that , o are achievable, then if 2 2 IM=L8JK N 2 is also achievable, we see that every in XY>Zîn/H r words, H» ÷ s XY>îZ n/H r a ·H p s (15.240) is achievable. In other (15.241) and the theorem is proved. APPENDIX 15.A: APPROXIMATION OF RANDOM VARIABLES WITH INFINITE ALPHABETS In this appendix, we prove that "!#!%$& ! as '$)( , where we assume that ù ù !*+( . For every ,.- , define the binary random variable ù 1 / 3 "!0 ù%2 if if - ù%4 (15.A.1) . Consider ù S !50 6 S 6 ù !*+( , 0 ! Ú / ùGX / ù ù Ú e "!#!ZY\[] ù^ / 0.?A ù <L=H> ù <L=;> ù ù 0@?ACBMDNF 0.?ARBMDNF <=H> <L=;> U<L=H> 7KJ 0@?ACBEDGF ù 0@?AS$' 0@?ALBEDNF <L=;> ù (15.A.2) 0@?APO 0@?AS$ (15.A.3) !TO 3 (15.A.4) / "!#! / 3 "!L0_-`! "!0a-bAKYc d!0 A X /X / 3 L < ; = > d!0 AKY\[] "!#! ^ / / 3 0 < =;> "!#! d!f0a-`AgY\ "!0 ! bX / / 3 e L < T = > d!0 AKY\[] "!#!TO ^ <=;> / / 3 , "!#!S$ since "!j0k-`Al$m- . This implies ] [ 0 As h$i( because S 9H: I <=H> S as ?V$W( . Now consider ù Ú 9H: I 6 9H7 : ù <=;> 7KJ As Q$'( , Since Ú 689;7 : ù ù D R A F T ù ù <L=H> / ùb^ "!#! (15.A.6) ù ù [] (15.A.5) 2 / "!#!TO September 13, 2001, 6:27pm ù (15.A.7) ^ / "!#!n$ 3 (15.A.8) D R A F T APPENDIX 15.A: Approximation of Random Variables with Infinite Alphabets 361 In (15.A.7), we further consider ù / S 6 S 6 Ú <=;> 7KJ S ZoK0.?APBMDNF Ú S 7KJ 6 3 (15.A.10) Ab! ZoK0@?ARBEDGF Zog0.?AN! <=;> Ú / ZoK0@?AqBEDGF U<=H> 7KJ r 9H: I 3 "!0 3 Zog0.?AN! <=;> / ARBEDGF <L=;> 3 d!f0 (15.A.12) APO <=;> , the summation above tends to 0 by (15.A.4). Since / 3 3 3 ARBEDGF d!f0 AS$ . Therefore, <L=;> <=;> Zo and we see from (15.A.7) that (15.A.11) A <L=H> ZoK0@?ARBEDGF / 7KJLs "!0 Y t$u( / "!0 Zog0@?A <L=;> <L=> <L=> As (15.A.9) <L=;> Yp 9H: I 0 ù 0@?A / 3 A <=;> d!f0 <=;> Ú 9H: I 6 A 0.?ACBEDGF U<=;> / 6lBEDGF 7KJ "!0 0 3 d!0 ù 9H: I 0 ! <=;> 9H: I 0 / 3 d!L0 X / X 3 d!f0 / ! 3 "!0 < =H> zo y d!#!f${ o ! / 3 Av$ 3 , <L=;> 3xw AS$ "!d0 (15.A.13) as Q$h( . PROBLEMS 1. Consider the following network. | ~ ~ | X1X}2 2 2 1 ~ 2 1 [X1X2] 2 [X1X2] 1 2 1 1 1 | | [X1X2] [X1] a) Let n be the rate of information source the max-flow bounds. [X1] . Determine and illustrate b) Are the max-flow bounds achievable? D R A F T September 13, 2001, 6:27pm D R A F T 362 A FIRST COURSE IN INFORMATION THEORY c) Is superposition coding always optimal? 2. Repeat Problem 1 for the following network in which the capacities of all the edges are equal to 1. X 1 X2 [X1X2] [X1X2] [X1] [X1] 3. Consider a disk array with 3 disks. Let , \ , and \ be 3 mutually independent pieces of information to be retrieved from the disk array, and let n , , and be the data to be stored separately in the 3 disks. It is required that can be retrieved from , bCb , \ can be retrieved from NL , dQ\ , and can be retrieved from N N . R R a) Prove that for ShbCb , #z R b) Prove that for R ck ` GK c) Prove that R R GK b GN R , R G \xg R % PNNZ \xz R d) Prove that for ShbCb . R R e) Prove that #g R V k GN R Gg R \xg R f) Prove that Z¢¤£% ¥¢¤£¦ ¢§ D R A F T \xN R ¡ ;N \xN ;N \xN R UKQ R \K September 13, 2001, 6:27pm \xN D R A F T APPENDIX 15.A: Approximation of Random Variables with Infinite Alphabets 363 where ShbCb and for "Qb¡c R % Ng R R K©¬ . R g) Prove that if ©ck if ©ck K© K¨©ª« R R R g R V R R Gg R R \xg \xN Parts d) to g) give constraints on % G , , and in terms of , , and . It was shown in Roche et al. [166] that these constraints are the tightest possible. 4. Generalize the setup in Problem 3 to ® ¯ r R R disks and show that r ¯ V ¢® ± °% ¤°% ± ² Hint: Use the inequalities in Problem 13 in Chapter 2 to prove that for ³ µ´RP¶P¶P¶`®t¬¢ , r ¯ R R r #· ¸g® ¤°% ½ ² ¹ ± °% R r ¾g¿MÀ ¾%À ± ¯ ¼ ¾ ° ® º ¹; » \P¶P¶P¶ ³ Á ¹ ¹;» by induction on ³ , where  is a subset of ÃÄbCP¶P¶P¶`®Å . 5. Determine the regions ÆvÈÇ , Æ+ÉTÊxË , and ÆvÌCÍ for the network in Figure 15.13. 6. Show that if there exists an ³ ¸jxÎ ÐÏÑ`¡CÓÒÕÔNx×Ö Ï (15.A.14) Òaz ¹ code which satisfies (15.71) and (15.73), then there always exists an ¸jxÎ ÐÏÑ`¡CÓÒÕÔNx×ÖRØ Ï ³ (15.A.15) Òaz ¹ code which satisfies (15.71) and (15.73), where Ö use a random coding argument. D R A F T ¹ Ø for all ³ ¢Ö Òa . Hint: ¹ September 13, 2001, 6:27pm D R A F T 364 A FIRST COURSE IN INFORMATION THEORY HISTORICAL NOTES Multilevel diversity coding was introduced by Yeung [215], where it was shown that superposition coding is not always optimal. Roche et al. [166] showed that superposition coding is optimal for symmetrical three-level diversity coding. This result was extended to ® & levels by Yeung and Zhang [219] with a painstaking proof. Hau [92] studied all the one hundred configurations of a three-encoder diversity coding systems and found that superposition is optimal for eighty-six configurations. Yeung and Zhang [220] introduced the distributed source coding model discussed in Section 15.2.2 which subsumes multilevel diversity coding. The regions Ù%ÇÚ and ÙgÇ previously introduced by Yeung [216] for studying information inequalities enabled them to obtain inner and outer bounds on the coding rate region for a variety of networks. Distributed source coding is equivalent to multi-source network coding in a two-tier acyclic network. Recently, the results in [220] have been generalized to an arbitrary acyclic network by Song et al. [187], on which the discussion in this chapter is based. Koetter and MÜ Û dard [112] have developed an algebraic approach to multi-source network coding. D R A F T September 13, 2001, 6:27pm D R A F T Chapter 16 ENTROPY AND GROUPS The group is the first major mathematical structure in abstract algebra, while entropy is the most basic measure of information. Group theory and information theory are two seemingly unrelated subjects which turn out to be intimately related to each other. This chapter explains this intriguing relation between these two fundamental subjects. Those readers who have no knowledge in group theory may skip this introduction and go directly to the next section. Let and \ be any two random variables. Then R R UK R \xV (16.1) \xN which is equivalent to the basic inequality Ý Let ß be any finite group and Section 16.4 that àßcàß ßª %á Þ (16.2) V ¢´R and ß be subgroups of ß . We will show in ß C Wàß àß (16.3) â where àßc denotes the order of ß and ß ãá ß denotes the intersection of ߪ and ß ( ß á ß" is also a subgroup of ß , see Proposition 16.13). By rearranging the terms, the above inequality can be written as ä¤å]æ àßc àߪ ä¤å]æ àßc àß"Ä ä¤å]æ àßc àߪ á ßZ (16.4) By comparing (16.1) and (16.4), one can easily identify the one-to-one correspondence between these two inequalities, namely that corresponds to ßd , ãçb , and \x corresponds to ߪ á ß . While (16.1) is true for any pair of random variables and , (16.4) is true for any finite group ß and subgroups ߪ and ß . D R A F T September 13, 2001, 6:27pm D R A F T 366 A FIRST COURSE IN INFORMATION THEORY Recall from Chapter 12 that the region Ù%ÇÚ characterizes all information inequalities (involving ¸ random variables). In particular, we have shown in Section 14.1 that the region Ù ÇÚ is sufficient for characterizing all unconstrained information inequalities, i.e., by knowing Ù ÇÚ , one can determine whether any unconstrained information inequality always holds. The main purpose of this chapter is to obtain a characterization of Ù ÇÚ in terms of finite groups. An important consequence of this result is a one-to-one correspondence between unconstrained information inequalities and group inequalities. Specifically, for every unconstrained information inequality, there is a corresponding group inequality, and vice versa. A special case of this correspondence has been given in (16.1) and (16.4). By means of this result, unconstrained information inequalities can be proved by techniques in group theory, and a certain form of inequalities in group theory can be proved by techniques in information theory. In particular, the unconstrained non-Shannon-type inequality in Theorem 14.7 corresponds to the group inequality àߪ á Wàߪ àß ßZ á àߪ á ß"]àß"Ä ßdèÄ á àß àß"èZ àß ßdèÄ á ß" àß" á á ß"è] ß]àß è àß á á ßdèÄ á ß" (16.5) ßdèÄâ where ß" are subgroups of a finite group ß , nébCbC§ . The meaning of this inequality and its implications in group theory are yet to be understood. 16.1 GROUP PRELIMINARIES In this section, we present the definition and some basic properties of a group which are essential for subsequent discussions. ê\ëìCíïîgíñðgíò%îÁóÄôKõ#ó A group is a set of objects ß together with a binary operation on the elements of ß , denoted by “ ö ” unless otherwise specified, which satisfy the following four axioms: 1. Closure For every ÷ , ø in ß , ÷öÓø is also in ß . 2. Associativity For every ÷ , ø , ù in ß , ÷"ö"#øjöÓùzé×÷öÓøPgöVù . 3. Existence of Identity There exists an element ú in ß such that ÷¦öúdÁúzö÷ Á÷ for every ÷ in ß . 4. Existence of Inverse For every that ÷"öãøãÁøjöÓ÷Áú . ûdüò¥ýÑò%þxí¤ðgí×ò¥îµóÄôKõ¤ÿ Proof Let ú and identity element, ú Ø ÷ ß , there exists an element ø in such ß For any group ß , the identity element is unique. be both identity elements in a group ú Ø ö D R A F T in ß ú¦ÁúZ September 13, 2001, 6:27pm . Since ú is an (16.6) D R A F T 367 Entropy and Groups and since ú Ø is also an identity element, (16.7) ú Ø öãúdÁú Ø It follows by equating the right hand sides of (16.6) and (16.7) that which implies the uniqueness of the identity element of a group. ûdüò¥ýÑò%þxí¤ðgí×ò¥îµóÄôKõ ú ú Ø , For every element ÷ in a group ß , its inverse is unique. Proof Let ø and ø Ø be inverses of an element ÷ , so that ÷"öãøãÁøjöÓ÷Áú (16.8) ÷"öãøNØfÁøNØ]öÓ÷ÁúZ (16.9) and Then ø øzöÓú øzöd×÷öãø Ø #øzöÓ÷LKöãøNØ úVö øNØ ø Ø (16.10) (16.11) (16.12) (16.13) (16.14) where (16.11) and (16.13) follow from (16.9) and (16.8), respectively, and (16.12) is by associativity. Therefore, the inverse of ÷ is unique. Thus the inverse of a group element denoted by ÷ . ÷ is a function of ÷ , and it will be ê\ëìCíïîgíñðgíò%îÁóÄôKõ The number of elements of a group ß is called the order of ß , denoted by àßc . If àßc% , ß is called a finite group, otherwise it is called an infinite group. There is an unlimited supply of examples of groups. Some familiar examples are: the integers under addition, the rationals excluding zero under multiplication, and the set of real-valued ½ matrices under addition, where addition and multiplication refer to the usual addition and multiplication for real numbers and matrices. In each of these examples, the operation (addition or multiplication) plays the role of the binary operation “ ö ” in Definition 16.2. All the above are examples of infinite groups. In this chapter, however, we are concerned with finite groups. In the following, we discuss two examples of finite groups in details. D R A F T September 13, 2001, 6:27pm D R A F T 368 A FIRST COURSE IN INFORMATION THEORY ¦ý ò ëóÄôKõ %í¤ðgí×ò¥î òÿ The trivial group consists of only the identity element. The simplest nontrivial group is the group of modulo 2 addition. The order of this group is 2, and the elements are Ãx´RÅ . The binary operation, denoted by “ ,” is defined by following table: + 0 1 0 0 1 1 1 0 The four axioms of a group simply say that certain constraints must hold in the above table. We now check that all these axioms are satisfied. First, the closure axiom requires that all the entries in the table are elements in the group, which is easily seen to be the case. Second, it is required that associativity holds. To this end, it can be checked in the above table that for all ÷ , ø , and ù , ÷"Á#øz ùzé×÷"øPg ù (16.15) For example, ´¦ÁHVµxzµ´¦´ªµ´R (16.16) ×´¦µx¥µ¦hlµ´R (16.17) while which is the same as ´ Hdéx . Third, the element 0 is readily identified as the unique identity. Fourth, it is readily seen that an inverse exists for each element in the group. For example, the inverse of 1 is 1, because (16.18) µlµ´R Thus the above table defines a group of order 2. It happens in this example that the inverse of each element is the element itself, which is not true for a group in general. We remark that in the context of a group, the elements in the group should be regarded strictly as symbols only. In particular, one should not associate group elements with magnitudes as we do for real numbers. For instance, in the above example, one should not think of 0 as being less than 1. The element 0, however, is a special symbol which plays the role of the identity of the group. We also notice that for the group in the above example, ÷\ø is equal to øÓÁ÷ for all group elements ÷ and ø . A group with this property is called a commutative group or an Abelian group1 . ¦ý ;ûdëfü ð Lðgíò%î"!cüògý# ëóÄôKõ¤ô components of a vector $ 1 The &% '% Consider a permutation of the '%( P¶P¶P¶ (16.19) Abelian group is named after the Norwegian mathematician Niels Henrik Abel (1802-1829). D R A F T September 13, 2001, 6:27pm D R A F T 369 Entropy and Groups )+* $-, given by &%/.10 32 ' %/.10 2 '%/.10 ( 2 é ) where N (16.20) ÃÄbCP¶P¶P¶ CÅ (16.21) P¶P¶P¶x '4 6 Å 5 ') 4 ÏRÃÄbCP¶P¶P¶ is a one-to-one mapping. The one-to-one mapping on ÃÄbCP¶P¶P¶'4CÅ , which is represented by ) ) ) é ) HxN ) ) &4 #ZNP¶P¶P¶] ) For two ) permutations and , define Lö and . For example, for 4µ§ , suppose is called a permutation (16.22) ÄN ) ) as the composite function of ) (16.23) Vé#C§LbZ ) and ) Then ) nö is given by ) ) ) ) ) ) ö ) nö ) nö nö ) ) #Z #Z § ) Hx ) ) ) ) ) x ) x x ) nö Z#Z Z§ ) ) Hx § #Z #Z §L (16.25) ) #CbC§ N (16.26) §LbCbZN (16.27) The reader can easily check that ) Z#Z ) Hx ) or ) (16.24) éH§LbCbZN ö ) which is different from nö . Therefore, the operation “ ö ” is not commutative. We now show that the set of all permutations on ÃÄbCP¶P¶P¶'4 Å and the operation ) “ ö ” form group. First, for two permuta) a group, called ) the permutation ) ) ) tions and , since both and are one-to-one mappings, so) is ) Sö . Therefore, the closure axiom is satisfied. Second, for permutations , , and ) , ) ) nö ) ö ) ) xUH ) ) ¥ö ) ) nö ö ) ) ]; )Z )Z; ) ] Z) ; xKö H (16.28) (16.29) (16.30) (16.31) ) for +é"74 . Therefore, associativity is satisfied. Third, it is clear that the identity map is the identity element. Fourth, for a permutation , it is clear D R A F T September 13, 2001, 6:27pm D R A F T 370 A FIRST COURSE IN INFORMATION THEORY ) ) ) that its inverse is , the inverse mapping of which is defined because is one-to-one. Therefore, the set of all permutations on ÃÄbCP¶P¶P¶'4CÅ and the operation “ ö ” form a group. The order of this group is evidently equal to &498â . ;: ê\ëìCíïîgíñðgíò%îÁóÄôKõ Let ß be a group with operation “ ö ,” and be a subset of ß . If is a group with respect to the operation “ ö ,” then is called a subgroup of ß . < ê\ëìCíïîgíñðgíò%îÁóÄôKõ Let be a subgroup of a group ß and ÷ be an element of . The left coset of with respect to ÷ is the set ÷ªö"'tÃx÷ö ³ Ï ³ ҵŠ. Similarly, the right coset of with respect to ÷ is the set ö ÷Wà ³ ö ÷Ï ³ Ò_Å . ß In the sequel, only the left coset will be used. However, any result which applies to the left coset also applies to the right coset, and vice versa. For simplicity, ÷"ö will be denoted by ÷L . = ûdüò¥ýÑò%þxí¤ðgí×ò¥îµóÄôKõ For ÷ and ÷C in ß , ÷f G and ÷CP are either identical or disjoint. Further, ÷f G and ÷ x are identical if and only if ÷f and ÷C belong to the same left coset of . Proof Suppose ÷f b and in ÷ á ÷ such that are not disjoint. Then there exists an element ÷C ³ øãµ÷ nö ³ Vµ÷Czö ø (16.32) for some ³ in , nhb . Then ³ ÷f Vé×÷ ö ³ gö µ÷ zö" ³ ö ³ where d > ³ ö ³ is in . We now show that ³ ÷ nö in ÷f G , where ³ Ò_ , ÷ ö ³ é×÷ B>Kö ö ³ µ÷ &> öd ¥ö ?>G (16.33) ÷ . For an element µ÷ ö A@ ÷ ³ jµ÷ DC% ö (16.34) where CÕE>¥ö ³ is in . This implies that ÷f Sö ³ is in ÷C . Thus, ÷f GF@÷Cx . By symmetry, ÷ xG@&÷ N . Therefore, ÷ GW ÷C . Hence, if ÷f N and ÷ are not disjoint, then they are identical. Equivalently, ÷f b and ÷ are either identical or disjoint. This proves the first part of the proposition. We now prove the second part of the proposition. Since is a group, it contains ú , the identity element. Then for any group element ÷ , ÷a÷ö¦ú is in ÷L because ú is in . If ÷f b and ÷ x are identical, then ÷f Òh÷f G and ÷ dÒÕ÷C µ÷f G . Therefore, ÷ and ÷ belong to the same left coset of . To prove the converse, assume ÷ and ÷ belong to the same left coset of . From the first part of the proposition, we see that a group element belongs D R A F T September 13, 2001, 6:27pm D R A F T 371 Entropy and Groups to one and only one left coset of . Since ÷f is in ÷ G and ÷ ÷ and ÷ belong to the same left coset of , we see that ÷ identical. The proposition is proved. is in ÷C , and and ÷ are IH ûdüò¥ýÑò%þxí¤ðgí×ò¥îµóÄôKõ#ó Let be a subgroup of a group ß and ÷ be an element of ß . Then ÷L R)à , i.e., the numbers of elements in all the left cosets of are the same, and they are equal to the order of . ³ Proof Consider two elements ÷Ðö in such that ³ ÷dö ³ and ÷Ðö ³ µ÷ö in ÷Ðö , where ³ and ³ are (16.35) Then ÷ ×÷ öd×÷ö ³ N öã÷RKö ³ ³ úö ³ ÷ ×÷ úVö ³ ö"×÷ö ³ öV÷R¥ö ³ (16.36) (16.37) (16.38) (16.39) ³ Thus each element in corresponds to a unique element in ÷R Ä à for all ÷vÒaß . ÷R . Therefore, We are just one step away from obtaining the celebrated Lagrange’s Theorem stated below. JLKgëfò%üKëióÄôKõ#óóM'N+O%ü/%îO%ëP¤þJQKgëfò%üKëR à divides àßv If is a subgroup of ß , then . Proof Since ÷ Ò÷R for every ÷ÒÁß , every element of ß belongs to a left coset of . Then from Proposition 16.9, we see that the distinct left cosets of partition ß . Therefore àßc , the total number of elements in ß , is equal to the number of distinct cosets of multiplied by the number of elements in each left coset, which is equal to à by Proposition 16.10. This implies that à divides àßv , proving the theorem. The following corollary is immediate from the proof of Lagrange’s Theorem. S IT%üIU'óÄôKõ#óÄÿ ò%üò tinct left cosets of D R A F T Let be aÀ VSsubgroup of a group À is equal to À WÀ . ß . The number of dis- September 13, 2001, 6:27pm D R A F T 372 A FIRST COURSE IN INFORMATION THEORY 16.2 GROUP-CHARACTERIZABLE ENTROPY FUNCTIONS Recall from Chapter 12 that the region Ù%ÇÚ consists of all the entropy functions in the entropy space XcÇ for ¸ random variables. As a first step toward establishing the relation between entropy and groups, we discuss in this section entropy functions in Ù ÇÚ which can be described by a finite group ß and subgroups ß Nß P¶P¶P¶Nß Ç . Such entropy functions are said to be groupcharacterizable. The significance of this class of entropy functions will become clear in the next section. In the sequel, we will make use of the intersections of subgroups extensively. We first prove that the intersection of two subgroups is also a subgroup. ûdüò¥ýÑò%þxí¤ðgí×ò¥îµóÄôKõ#ó ß" Let ß is also a subgroup of ß . and ß be subgroups of a group ß . Then ß á Proof It suffices to show that ߪ á ß together with the operation “ ö ” satisfy all the axioms of a group. First, consider two elements ÷ and ø of ß in ߪ á ß" . Since both ÷ and ø are in ß , ×÷vöÐøU is in ß . Likewise, ×÷vöøP is in ß" . Therefore, ÷\ödø is in ß á ß . Thus the closure axiom holds for ß á ß . Second, associativity for ߪ á ß" inherits from ß . Third, ߪ and ß" both contain the identity element because they are groups. Therefore, the identity element is in ߪ á ß" . Fourth, for an element ÷+Òaßd , since ß" is a group, ÷ is in ßd , b . Thus for an element ÷ÕÒ ßª á ß , ÷ is also in ߪ á ß" . Therefore, ß á ß" is a group and hence a subgroup of ß . S IT%üIU'óÄôKõ#óY ò%üò á Ç ¤°% ß" Let ߪ PNßP¶P¶P¶NßdÇ be subgroups of a group is also a subgroup of ß . ß In the rest of the chapter, we let ZÇhÃÄbCP¶P¶P¶¸SÅ and denote á \ [ ± , where ² is a nonempty subset of ZÇ . ß Në6]{óÄôKõ#óI Ò ² Let ß be subgroups of a group ß and ÷ . Then ± ßd be elements of by ß , . Then á \[ ± ÷Ä¡ßdZ « if ^ \[ ± ÷ÄßdDE _ ` otherwise. ± àß ´ Proof For the special case when ÒaZÇ , (16.40) reduces to ² is a singleton, i.e., ÷ ¡ßdHÄ (16.40) ² àßd`â ÃÅ for some (16.41) which has already been proved in Proposition 16.10. D R A F T September 13, 2001, 6:27pm D R A F T 373 Entropy and Groups Gb 2 cG cG 1,2 1 Figure 16.1. The membership table for a finite group d and subgroups d and dfe . s Let ² be any nonempty subset of ZÇ . If ^ &[ ± ÷ ¡ßddg` , then (16.40) is _ ` , then there exists %QÒj^ \[ ± ÷Äßd such that obviously true. If ^ &[ ± ÷ ¡ßdhi for all Ò ² , %@µ÷ Lö ³ ; (16.42) where ³ nÒ ßd . For any Ò ² %ö?k\é×÷ and for any kvÒ_ß ö ³ ± , consider Dk\µ÷Äö" ³ LöBkLN Kö (16.43) Since both ³ and k are in ß" , ³ göRk is in ßd . Thus %@öRk is in ÷Äßd for all ² ¦Ò , or %völk is in ^ \[ ± ÷Äßd . Moreover, for k'k Ø Òß ± , if %cö]kÕ7%vömk Ø , then kÕAk Ø . Therefore, each element in ß ± corresponds to a unique element in ^ &[ ± ÷ ß . Hence, (16.44) án&[ ± ÷ ß ]8àß ± â proving the lemma. The relation between a finite group ß and subgroups ߪ and ß is illustrated by the membership table in Figure 16.1. In this table, an element of ß is represented by a dot. The first column represents the subgroup ߪ , with the dots in the first column being the elements in ߪ . The other columns represent the left cosets of ß . By Proposition 16.10, all the columns have the same number of dots. Similarly, the first row represents the subgroup ß and the other rows represent the left cosets of ß . Again, all the rows have the same number of dots. The upper left entry in the table represents the subgroup ß á ß . There are àߪ á ß dots in this entry, with one of them representing the identity element. Any other entry represents the intersection between a left coset of ߪ and a left coset of ß , and by Lemma 16.15, the number of dots in each of these entries is either equal to àߪ á ß] or zero. D R A F T September 13, 2001, 6:27pm D R A F T 374 A FIRST COURSE IN INFORMATION THEORY s on n o nH (Y)p ty 2 . . n o nH (Xq p ) 2 rx sS o n . [X] S [Y] . . . . q p) n o nH (X,Y 2 t ) s S o nq (x,y . . . . . . . . [XY] Figure 16.2. A two-dimensional strong typicality array. Since all the column have the same numbers of dots and all the rows have the same number of dots, we say that the table in Figure 16.1 exhibits a quasiuniform structure. We have already seen a similar structure in Figure 5.1 for the two-dimensional strong joint typicality array, which we reproduce in Figure 16.2. In this array, when ¸ is large, all the columns have approximately the same number of dots and all the rows have approximately the same number of dots. For this reason, we say that the two-dimensional strong typicality array exhibits an asymptotic quasi-uniform structure. In a strong typicality array, however, each entry can contain only one dot, while in a membership table, each entry can contain multiple dots. One can make a similar comparison between a strong joint typicality array for any ¸ random variables and the membership table for a finite group with ¸ subgroups. The details are omitted here. JLKgëfò%üKëióÄôKõ#óÄô Let uZÇ be subgroups of a group ß . Then vÒXcÇ w ßd;ãÒ defined by ± ä¤å]æ àßc àß (16.45) ± for all nonempty subset ² of ZÇ is entropic, i.e., v Ò Ù ÇÚ . Proof It suffices to show that there exists a collection of random variables \P¶P¶P¶Ç such that x ± z äå]æ àßc ± àß (16.46) ² for y all nonempty subset of ZÇ . We first introduce a uniform random variable defined on the sample space ß with probability mass function z|{ y à D R A F T Á÷Åd àßc September 13, 2001, 6:27pm (16.47) D R A F T 375 Entropy and Groups y for all ÷+Òaß . Fory any zÒ}ZÇ , let random variable be a function of such that µ÷Lß if µ÷ . Let ² y be a nonempty subset of ZÇ . Since Ó÷ #ß" for all "Ò ² if and only if is equal to some ølÒ áS\[ ± ÷ ß , z+{ ² Ã¥µ÷ ¡ßdnÏ]Ò '^ & [ ± Å ÷ ¡ßd àßc À LÀ À SÀ ~ V/ V (16.48) _ ` if ^ &[ ± ÷ ß otherwise ´ (16.49) ² ;ÕÒ distributes uniformly on its by Lemma 16.15. In other words, À VSÀ support whose cardinality is À V/RÀ . Then (16.46) follows and the theorem is proved. ê\ëìCíïîgíñðgíò%îÁóÄôKõ#ó : w Let ß be a finite group and ß Nß P¶P¶P¶xNß Ç be subgroups À VSÀ äå]æ À V À for all nonempty subsets ² of of ß . Let v be a vector in XvÇ . If ± ZÇ , then ßNß UP¶P¶P¶NßdÇL is a group characterization of v . Theorem 16.16 asserts that certain entropy functions in ÙnÇÚ have a group characterization. These are called group-characterizable entropy functions, which will be used in the next section to obtain a group characterization of the region Ù ÇÚ . We end this section by giving a few examples of such entropy functions. ¦ý I< ëóÄôKõ#ó v Ò X+ by Fix any subset of Z@ w ± äå]æ « ÃÄbCb Å and define a vector _ ` if ² á FE otherwise. ´ (16.50) We now show that v has a group characterization. Let ßu Ãx´RÅ be the group of modulo 2 addition in Example 16.5, and for ShbCb , let if zÒ otherwise. Ãx´CÅ ß"K« ß (16.51) _ ` , there exists an in ² such Then for a nonempty subset ² of Z@ , if ² á EA that is also in , and hence by definition ß WÃx´CÅ . Thus, ± ß &[ ± ß (16.52) 'Ãx´CÅZ Therefore, äå]æ àßv àß D R A F T ± äå]æ àßc MÃx´CÅC äå]æ äå]æ C September 13, 2001, 6:27pm (16.53) D R A F T 376 If ² A FIRST COURSE IN INFORMATION THEORY á a` , then ßdgß for all Ò ± ß ² &[ , and (16.54) ß"Kß ± Therefore, ä¤å]æ àßc ± àß ä¤å]æ àßc äå]æ àßc (16.55) µ´R Then we see from (16.50), (16.53), and (16.55) that w ± ¦ý ä¤å]æ àßc àß of Z@ . Hence, ² for all nonempty subset terization of v . (16.56) ± ßNߪ NßNß" is a group charac- I= ëóÄôKõ#ó This is a generalization of the last example. Fix any nonempty subset of ZÇ and define a vector vÒXvÇ by w äå]æ ± « if ² á "E _ ` otherwise. ´ (16.57) Then ßNߪ NßP¶P¶P¶xNßdÇR is a group characterization of v , where group of modulo 2 addition, and ß"K Ãx´CÅ « ß if zÒ otherwise. ß is the (16.58) By letting 7` , vW´ . Thus we see that ßNߪ NßP¶P¶P¶NßdÇL is a group characterization of the origin of XcÇ , with ß{ß {ß" W¶P¶P¶Cß"Ç . ¦ý H ëóÄôKõ¤ÿ v Ò X+ as follows: Define a vector w ± Let ßNߪ NßNß D R A F T (16.59) âbZN be the group of modulo 2 addition, ß{ ß Ã ×´R`´ÄNxH`´ÄbÅ ß Ã ×´R`´ÄNx×´RxbÅ Ã ×´R`´ÄNxHxbÅZ ß Then Eag` ² ½ , and (16.60) (16.61) (16.62) is a group characterization of v . September 13, 2001, 6:27pm D R A F T 377 Entropy and Groups 16.3 A GROUP CHARACTERIZATION OF ÇÚ We have introduced in the last section the class of entropy functions in Ù ÇÚ which have a group characterization. However, an entropy function vWÒµÙ ÇÚ may not have a group characterization due to the following observation. Suppose vQÒ Ù ÇÚ . Then there exists a collection of random variables \P¶P¶P¶x w Ç such that x ± ± (16.63) for all nonempty subset ² of ZÇ . If tion of v , then x ßNß UP¶P¶P¶NßdÇL ± j äå]æ is a group characteriza- àßc (16.64) ± àß x for all nonempty subset of Z Ç . Since both àßc and àß ± are integers, ± must be the logarithm of a rational number. However, the joint entropy of a set of random variables in general is not necessarily the logarithm of a rational number (see Corollary 2.44). Therefore, it is possible to construct an entropy function v Ò Ù ÇÚ which has no group characterization. Although vÒ.Ù ÇÚ does not imply v has a group characterization, it turns out that the set of all v¢Ò Ù%ÇÚ which have a group characterization is almost good enough to characterize the region Ù ÇÚ , as we will see next. ê\ëìCíïîgíñðgíò%îÁóÄôKõ¤ÿgó Define the following region in XvÇ : v XcÇvÏv has a group characterizationÅZ (16.65) Ç'Ã Ò By Theorem 16.16, if vµÒXvÇ has a group characterization, then vµÒÙnÇÚ . Therefore, Ç@)Ù ÇÚ . We will prove as a corollary of the next theorem that ù¸ ÇL , the convex closure of Ç , is in fact equal to Ù ÇÚ , the closure of Ù%ÇÚ . JLKgëfò%üKëióÄôKõ¤ÿÿ For any v ä 0 ( 2 Ev that ('L ( . ÒÕÙ ÇÚ , there exists a sequence à 0( 2 Å in Ç such We need the following lemma to prove this theorem. The proof of this lemma resembles the proof of Theorem 5.9. Nevertheless, we give a sketch of the proof for the sake of completeness. Në6]{óÄôKõ¤ÿ Let be a random variable such that and the distribution Ã'S&%bÅ is rational, i.e., n&% is a rational number for all %Ò . Without loss of generality, assume n&% is a rational number with denominator for all % ÒM . Then for 4Ð b b P¶P¶P¶ , ä (h 4 D R A F T äå]æ 4 8 &4n&%8 x ©N September 13, 2001, 6:27pm (16.66) D R A F T 378 A FIRST COURSE IN INFORMATION THEORY Proof Applying Lemma 5.10, we can obtain ä 94 8 &4n&%K8 a 4 r xa ä 4 a 4 ¬ x ä 4 ä B4 ¡ g&4 4¢ ä xa B4 (16.67) (16.68) 4-¢ as 4L5£ . On the other hand, we can obtain _ ¤S&%g ä µxn¬ 94 8 &4n&%K8 r ä ©¥ This upper bound tends to 4 ä n&%K n&%Kg ¬ 4 ¢ ¤n&%Kg 4 ¢ ä ¬ B4 4 (16.69) © as 4L5£ . Then the proof is completed This lower bound also tends to by changing the base of the logarithm if necessary. Proof of Theorem 16.22 For any vÒÕÙnÇÚ , there exists a collection of random variables P\P¶P¶P¶Ç such that w ± x ± (16.70) for all nonempty subset ² of ZÇ . We first consider the special case that for all ÒZ Ç and the joint distribution of P¶P ¶P¶ Ç is ra j¡ 0( 2 Å tional. We want to show that there exists a sequence à in Ç such that ä 0( 2 . (h ( Ev Denote &[ ± by ± . For any nonempty subset ² of ZÇ , let ¥ ± be the marginal distribution of ± . Assume without loss of generality that for any nonempty subset ² of Z Ç and for all ÷vÒM ± , ¥ ± ×÷L is a rational number with denominator . For each 4Ð b b P¶P¶P¶ , fix a sequence $¦D§ ¦D§T¨ ¦?§I¨ '% P¶P¶P¶% ( ¦?§T¨ ¨ ¦D§ ¦ § bCP¶P¶P¶'4F% ¦ § where for$ all &f % ÏÒ©ZÇR¢$Òª , such ¦ ×§ ÷K ¦ § of occurrences of ÷ in sequence that « , is equal , the number ×÷L for all ÷ÁÒ¬ § to 4I¥ . The existence of such a¦Dsequence is guaran &% ¦D§1¨ P are rational numteed by that all the values of the joint distribution of . Also,¨ we denote ¨ ¨ the sequence$ of 4 elements of ± , bers¨ with¨ denominator &% ± '% ± P¶P¶P¶% ± ( , where $ % ± &%Ñ ÏÒ ² , by ± . Let ÷ Òj ± . It is $ easy to check that « ×÷ ± , the number of occurrences of ÷ in the sequence ± , is equal to 4I¥ ± ×÷R for all ÷vÒ ± . D R A F T September 13, 2001, 6:27pm D R A F T 379 Entropy and Groups Let ß be the group of permutations on ÃÄbCP¶P¶P¶'4CÅ . The group G depends on 4 , but for simplicity, we do not state this dependency explicitly. For any zÒ®ZÇ , define )+* $ , ) ß"KWà ÒaßhÏ )+* $ , where $ ¨ ¨ &% T. 0 32 ' % 1. 0 2 é ;ÅZ ¨ '% 1. 0 ( 2 P¶P¶P¶ (16.71) N It is easy to check that ß" is a subgroup of ß . Let ² be a nonempty subset of ZÇ . Then ß ß &[ ± ) ) *$ , $ + TÅ Ã Ò ßéÏ & [ ) ± )+* $ , $ à ) ÒaßéÏ )+* $ , $ for all Ò ± (16.72) à )+* $ where , ± For any ÷cÒM ± ± ÒaßéÏ &% é ¨ ± ¨ ± .10 2 ¨ '% ± P¶P¶P¶ define the set ¯+° ² (16.74) (16.75) Å ± ÅZ .T0 32 ' % (16.73) '4 ¯+° ± )+* $ Á÷ÅZ , ± . Then ± ×÷L contains the “locations” of ) ÷ in ¯+° ¯+° for all ÷vÒM ± , \Ò ×÷L implies ÓÒ ×÷L . Since ¯+° E«×÷K $ ×÷RPZ and therefore àßc ± àß ±4I¥ ± j ² & 4T¥ ³ 1[ ´ ± Z àß (h 4 ä¤å]æ àßc ± àß D R A F T if and only if (16.78) 8 (16.79) 8 ± ×÷L (16.80) w x ± z ± Recall that ß and hence all its subgroups depend on 4 . Define µ 0( 2 ± ± ± ×÷LN By Lemma 16.23, ä (16.77) $ ± ×÷R 498 ³ [1´ &4I¥ (16.76) N ¨ T% ×÷Lz'ÃNcÒ©ÃÄbCP¶P¶P¶ CÅÏ $ .T0 ( 2 äå]æ àßv àß ± September 13, 2001, 6:27pm (16.81) 0( 2 by (16.82) D R A F T 380 A FIRST COURSE IN INFORMATION THEORY for all nonempty subset ² of ZÇ . Then 0( 2 Ò and Ç ä 0 ( 2 Ev (h 4 (16.83) We have already proved the theorem for the special case that v is the entropy function of a collection of random variables P\]P¶P¶P¶Ç with finite alphabets and a rational joint distribution. To complete the proof, we only have to 0¶ 2 Å note that for any ä vQÒ Ù%ÇÚ , it is always possible to construct a sequence Ãv 0 ¶ 0 ¶ 2 2 in Ù ÇÚ such that ¶ L v ·v , where v is the entropy function of a 0¸¶ 2 0¸¶ 2 0¶ 2 collection of random variables P¶P¶P¶x Ç with finite alphabets and a rational joint distribution. This can be proved by techniques similar to those used in Appendix 15.A together with the continuity of the entropy function. The details are omitted here. S IT%üIU'óÄôKõ¤ÿ ò%üò ¤ ù ¸ Ç j Ù ÇÚ . Proof First of all, ǹ@{Ù ÇÚ . By taking convex closure, we have ù¤¸ ÇR6@ ù¸×ÙnÇÚ . By Theorem 14.5, Ù ÇÚ is convex. Therefore, ù¤¸×Ù%ÇÚ Ó Ù ÇÚ , and we Ú have ù¤¸ ÇR}@ Ù Ç . On the other hand, we have shown in Example 16.19 that the origin of X Ç has a group characterization and therefore is in Ç . It Ú then follows from Theorem 16.22 that Ù Ç @ ù¸ ÇL . Hence, we conclude that Ù ÇÚ ù¤¸ ÇL , completing the proof. 16.4 INFORMATION INEQUALITIES AND GROUP INEQUALITIES We have proved in Section 14.1 that an unconstrained information inequality º¼» v always holds if and only if @ÁÃvÒXvÇcÏ ÇÚ Ù (16.84) k´ º » v (16.85) ¢´CÅZ In other words, all unconstrained information inequalities are fully character ized by Ù ÇÚ . We also have proved at the end of the last section that ù¤¸ ÇRz Ù ÇÚ . Since Ç®@¢Ù ÇÚ @ Ù ÇÚ , if (16.85) holds, then }@ÁÃv ¹ Ò XcÇcÏ Ç º» v ¢´CÅZ º » v On the other hand, if (16.86) holds, since Ãv ÒMXcÇÏ convex, by taking convex closure in (16.86), we obtain Ù ÇÚ ù ¸ D@ÁÃv Ò XcÇvÏ ÇL º » v (16.86) µ´CÅ ¢´CÅZ is closed and (16.87) Therefore, (16.85) and (16.86) are equivalent. D R A F T September 13, 2001, 6:27pm D R A F T 381 Entropy and Groups Now (16.86) is equivalent to º » Since vÒ Ç if and only if v ¢´ for all v Ò . Ç (16.88) w ± ä¤å]æ àßc àß (16.89) ± for all nonempty subset ² of ZÇ for some finite group ß and subgroups ß U ß"L¶P¶P¶NßdÇ , we see that the inequality (16.84) holds for all random variables w à\¶P¶P¶àÇ if and only if the inequality obtained from (16.84) by replacing À VSÀ äå]æ ± by À V À for all nonempty subset ² of ZÇ holds for all finite group ß and subgroups ߪ PNß¶P¶P¶xNßdÇ . In other words, for every unconstrained information inequality, there is a corresponding group inequality, and vice versa. Therefore, inequalities in information theory can be proved by methods in group theory, and inequalities in group theory can be proved by methods in information theory. In the rest of the section, we explore this one-to-one correspondence between information theory and group theory. We first give a group-theoretic proof of the basic inequalities in information theory. At the end of the section, we will give an information-theoretic proof for the group inequality in (16.5). ê\ëìCíïîgíñðgíò%îÁóÄôKõ¤ÿ Let ߪ and ß be subgroups of a finite group ߪ %ö ßãWÃx÷"öÓølÏZ÷vÒaß and ø¦Òaß"ÅZ ß . Define (16.90) is in general not a subgroup of ß . However, it can be shown that is a subgroup of ß if ß is Abelian (see Problem 1). ß jödß" ß ö ß ûdüò¥ýÑò%þxí¤ðgí×ò¥îµóÄôKõ¤ÿfô Let ߪ and ß be subgroups of a finite group ß . Then àߪ àß"Ä àߪ nöãß"ZZ Proof Fix ×÷f U`÷CxvÒ{ߪ ½ #ø bøGVÒaß ß" ½ ß àߪ , Then á ß"Z is in ÷ Vö"÷ (16.91) ß Vöß" . Consider any such that (16.92) øx nöãøGãÁ÷ nöÓ÷C We will determine the number of tion. From (16.92), we have ø #ø D R A F T öd#øx nöãøGP in ß #øx bøGx ø öãø göÓø ø øG ø ß" which satisfies this rela- ö"×÷ nöV÷Cx ½ öÓ÷ öÓ÷ öÓ÷ %öÓ÷C September 13, 2001, 6:27pm (16.93) (16.94) (16.95) D R A F T 382 A FIRST COURSE IN INFORMATION THEORY Then ø öÓ÷ Áø öÓ÷ ö×÷ öÓ÷ zÁø öÓ÷ (16.96) Let ½ be this common element in ß , i.e., ½\ÁøGöÓ÷ ø (16.97) öV÷f Since ø öã÷ Ò ß and ø öÓ÷ Ò©ß , ½ is in ß ná ß . In other words, for given ×÷f `÷CxVÒ ßª ½ ß" , if #øx bøNxÓÒaߪ ½ ß satisfies (16.92), then #øx bøNx satisfies (16.97) for some ½+Òaߪ á ß" . On the other hand, if #øx PbøNxÓÒaߪ ½ ß satisfies (16.97) for some ½+Òaß á ß" , then (16.96) is satisfied, which implies (16.92). Therefore, for given ×÷ `÷CxÒaߪ ½ ß , #øx bøGVÒaß ½ ß" satisfies (16.92) if and only if #ø bø satisfies (16.97) for some ½Òaß %á ß . Now from (16.97), we obtain ¾½ ¾½ ø x z ÐöÓ÷ (16.98) and ¾½ ½ÐöÓ÷C (16.99) øG] where we have written øx and øG as øx ¾½ and øGZ¾½ to emphasize their dependence on ½ . Now consider ½¿½ Ø Ò ßª á ß such that ¾½ ¾½ ¾½ ¾½ #øx NbøGÄ zé#øx Since ø x¾½zø ¾½ Ø Ø NbøGZ (16.100) Ø N , from (16.98), we have ¾½ ÐöÓ÷ which implies ¾½ é CØÄöÓ÷ (16.101) ½\E½CØï (16.102) Therefore, each ½+Òaߪ á ß" corresponds to a unique pair #øx UbøGxVÒ_ߪ ½ ß which satisfies (16.92). Therefore, we see that the number of distinct elements in ß nöãß" is given by àß öãß Z ½ àß àß á ß"Z ß"Z àß xàßZ àߪ á ßZ (16.103) completing the proof. JLKgëfò%üKëióÄôKõ¤ÿ: Let ß Nß , and ß be subgroups of a finite group ß . Then àß]àß ¡ÄC Wàß ¡]àßÄâ D R A F T September 13, 2001, 6:27pm (16.104) D R A F T 383 Entropy and Groups Proof First of all, á ߪ ¡ ß á hߪ ßP á á ß" á ß"z{ߪ á ß" ß ßª ¡ (16.105) By Proposition 16.26, we have àߪ ¡öãß"Ä] It is readily seen that ߪ ¡zö ß" àß ¡zö àߪ ¡Äàß"Z (16.106) àߪ ¡Ä is a subset of ß" , Therefore, ¡ àß àß ß"ZZ (16.107) WàßÄâ àß ¡Ä The theorem is proved. S IT%üIU'óÄôKõ¤ÿ< ò%üò For random variables P\ , and \ , Ý (16.108) Þ \Ä \xV ¢´R Proof Let ߪ UNß" , and ß" be subgroups of a finite group ß . Then (16.109) àß"Äàߪ ¡Ä hàߪ ¡Zàß"Ä by Theorem 16.27, or àßv àß ¡ àß àßv àß àß ¡ (16.110) This is equivalent to äå]æ àßc àß ¡Z äå]æ àßc àßÄ äå]æ àßv àß"Z äå]æ àßc àߪ ¡Z (16.111) This group inequality corresponds to the information inequality x \xK x which is equivalent to \\ Ý x \xg x \\xN Þ \Ä \xV ¢´R (16.112) (16.113) The above corollary shows that all the basic inequalities in information theory has a group-theoretic proof. Of course, Theorem 16.27 is also implied by the basic inequalities. As a remark, the inequality in (16.3) is seen to be a special case of Theorem 16.27 by letting ß ß . D R A F T September 13, 2001, 6:27pm D R A F T 384 A FIRST COURSE IN INFORMATION THEORY We are now ready to prove the group inequality in (16.5). The non-Shannontype inequality we have proved in Theorem 14.7 can be expressed in canonical form as x x GK x l§ x x è g x\xg x x \x¥ x \\xg x x \x¥ è è x xèg \è (16.114) \èN which corresponds to the group inequality äå]æ àßv àß x l§ äå]æ äå]æ àß ¡] àßv àß ¡Hè äå]æ äå]æ àß äå]æ àßv äå]æ àßZ àßv àß"è àßc àß"HèÄ àßv äå]æ àßc ¡ àßv äå]æ àßv àß ä¤å]æ è àßv Hè àß äå]æ àßc àß (16.115) àß"Hè Upon rearranging the terms, we obtain àߪ á 'àߪ àß ßZ á àߪ á ß"Zàß"Z ßdèÄ àßdèÄ àß àß á á ßdèÄ ß" á àß" á ßdèÄ è ß]àß á àß á ß" ßdèÄ á ßdèÄâ (16.116) which is the group inequality in (16.5). The meaning of this inequality and its implications in group theory are yet to be understood. PROBLEMS 1. Let ߪ and ß be subgroups of a finite group subgroup if ß is Abelian. ß . Show that ߪ zölß is a 2. Let Àg and Àf be group characterizable entropy functions. a) Prove that Á_ ¿Àg z"Á.¤Àf is group characterizable, where Á_ and Á. are any positive integers. b) For any positive real numbers ÷f and ÷C , construct a sequence of group 0¸¶ 2 characterizable entropy functions for ½cébCP¶P¶P¶ such that ä ; ¶ h where v_µ÷f ¿ÀK % D R A F T ¸0 ¶ 2 0¸¶ 2 v v Ó ÂÀ . ÷C f September 13, 2001, 6:27pm D R A F T 385 Entropy and Groups 3. Let ßNߪ PNßP¶P¶P¶NßdÇL be a group characterization of À&Ò ÙnÇÚ , where À is the entropy function for random variables \P¶P¶P¶xÇ . Fix any nonempty subset ² of ZÇ , and wà define v à by jÄ T± Å ÆÄ à w à ± ¬ for subsets of Z Ç . It can easily be checked that x all nonempty ± . Show that ×®Õ`®. x`®cP¶P¶P¶x`®\Ç is a group characterization of v , where ® ß ± and ® {ß á ß ± . 4. Let ßNߪ PNß"P¶P¶P¶xNß"ÇR be a group characterization of ÀkÒQÙ%ÇÚ , where À is the entropy function for random variables P¶P¶P¶x Ç . Show that if ² is a function of ÏcÒ , then ß ± is a subgroup of ßd . 5. Let ß UNßNß" be subgroups of a finite group ß . Prove that á àßcàߪ ß á ß"Z Wàß á á ß"Zàß" ßàß á ß"Zâ Hint: Use the information-theoretic approach. w w w 6. Let vQÒÙ%Ú be the entropy function for random variables and \ such l ¡ , i.e. and \ are independent. Let ßNß PNß" be a that n ¯ group characterization of v , and define a mapping ÏRß ½ ß 5uß by ¯ ×÷ÑbøPzµ÷"öãø ¯ a) Prove that the mapping is onto, i.e., for any element exists ×÷ÑbøUãÒ ßª ½ ß such that ÷döÓø Áù . b) Prove that ߪ %ö ß" is a group. w w ùaÒ'ß , there w 7. Denote an entropy function v ÒWÙ Ú by ¡ . Construct a group characterization for each of the following entropy functions: a) v b) v% c) v% äå]æ `´R C Z äå]æ äå]æ é×´R C Z äå]æ äå]æ ä¤] å æ é C C é äå]æ Verify that Ù functions. Z . is the minimal convex set containing the above three entropy w w w w w w w 8. Denote an entropy function v5ÒiÙ Ú by ¡ ¡ ¡ . Construct a group characterization for each of the following entropy functions: a) vz é b) v% é äå]æ äå]æ D R A F T äå]æ äå]æ äå]æ C`´R`´R `´R C C Z äå]æ ä¤] å æ äå]æ äå]æ äå]æ C C`´R C C C Z September 13, 2001, 6:27pm D R A F T 386 A FIRST COURSE IN INFORMATION THEORY v c) % d) v è äå]æ é é äå]æ äå]æ C äå]æ C ä¤å]æ C ä¤å]æ C äå]æ C äå]æ C äå]æ C äå]æ §L ä¤å]æ C ä¤å]æ §L äå]æ C Z äå]æ §L . § 9. Ingleton inequality Let ß be a finite Abelian group and ß Nß Nß , and ß è be subgroups of ß . Let ßNߪ PNß"NßNßdèx be a group characterization of À , where À is the entropy function for random variables U\Z\ , and è . Prove the following statements: a) ná ß Hint: Show that gö"ß ß ßª á öãß %á è PCWàß ß D@µßª á ßPKöߪ %á ßdèx ß á ß"zö ö è P ß . ß"èx b) àß ö ß àߪ àß"zö è C á ß"èZàß á àß á ß" á ß"Zàߪ ß"èZ ß"èÄ c) àß ¥ö ß"zöãß"zö àߪ nö ß"èZR ßjöãß"èÄàß"öãß"öãß"èÄ àß è öãß d) àß %ö ß"zö ßjö ßdèÄ á àß àß"Äàß"Zàß"è àߪ á ß" á ß"èÄàß" á ß" ß"èÄ àß á ß"Zàߪ á ßdèZàß" á ß]àß á ßdèÄàß" á ßdèZ àß á ß"Zàߪ á ßdèZàß" á ß]àß á ßdèÄàß" á ßdèZ á ßdèZâ e) àßZàßdè àߪ f) x x x ¡ ¥ \x¥ x x á ßZàߪ è ¥ èg x x x á ß" á K ¡xg x ß"èÄàß" x á ß Hè K ¡Hèg x x Hè \HèN where ¡Hè denotes \è , etc. g) Is the inequality in f) implied by the basic inequalities? And does it always hold? Explain. The Ingleton inequality [99] (see also [149]) was originally obtained as a constraint on the rank functions of vector spaces. The inequality in e) was obtained in the same spirit by Chan [38] for subgroups of a finite group. The inequality in f) is referred to as the Ingleton inequality for entropy in the literature. (See also Problem 7 in Chapter 14.) D R A F T September 13, 2001, 6:27pm D R A F T Entropy and Groups 387 HISTORICAL NOTES The results in this chapter are due to Chan and Yeung [40], whose work was inspired by a one-to-one correspondence between entropy and quasi-uniform arrays previously established by Chan [38] (also Chan [39]). Romashchenko et al. [168] have developed an interpretation of Kolmogorov complexity similar to the combinatorial interpretation of entropy in Chan [38]. D R A F T September 13, 2001, 6:27pm D R A F T Bibliography [1] J. Abrahams, “Code and parse trees for lossless source encoding,” Comm. Inform. & Syst., 1: 113-146, 2001. [2] N. Abramson, Information Theory and Coding, McGraw-Hill, New York, 1963. [3] Y. S. Abu-Mostafa, Ed., Complexity in Information Theory, Springer-Verlag, New York, 1988. [4] J. AczÈ Ç l and Z. DarÉ Ç czy, On Measures of Information and Their Characterizations, Academic Press, New York, 1975. [5] R. Ahlswede, B. Balkenhol and L. Khachatrian, “Some properties of fix-free codes,” preprint 97-039, Sonderforschungsbereich 343, UniversitË Ê t Bielefeld, 1997. [6] R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung, “Network information flow,” IEEE Trans. Inform. Theory, IT-46: 1204-1216, 2000. [7] R. Ahlswede and J. KÉ Ê rner, “Source coding with side information and a converse for degraded broadcast channels,” IEEE Trans. Inform. Theory, IT-21: 629-637, 1975. [8] R. Ahlswede and I. Wegener, Suchprobleme, Teubner Studienbcher. B. G. Teubner, Stuttgart, 1979 (in German). English translation: Search Problems, Wiley, New York, 1987. [9] R. Ahlswede and J. Wolfowitz, “The capacity of a channel with arbitrarily varying cpf’s and binary output alphabet,” Zeitschrift fÌ Ê r Wahrscheinlichkeitstheorie und verwandte Gebiete, 15: 186-194, 1970. [10] P. Algoet and T. M. Cover, “A sandwich proof of the Shannon-McMillan-Breiman theorem,” Ann. Prob., 16: 899-909, 1988. [11] S. Amari, Differential-Geometrical Methods in Statistics, Springer-Verlag, New York, 1985. [12] J. B. Anderson and S. Mohan, Source and Channel Coding: An Algorithmic Approach, Kluwer Academic Publishers, Boston, 1991. D R A F T September 13, 2001, 6:27pm D R A F T 390 A FIRST COURSE IN INFORMATION THEORY [13] S. Arimoto, “Encoding and decoding of Í -ary group codes and the correction system,” Information Processing in Japan, 2: 321-325, 1961 (in Japanese). [14] S. Arimoto, “An algorithm for calculating the capacity of arbitrary discrete memoryless channels,” IEEE Trans. Inform. Theory, IT-18: 14-20, 1972. [15] S. Arimoto, “On the converse to the coding theorem for discrete memoryless channels,” IEEE Trans. Inform. Theory, IT-19: 357-359, 1973. [16] R. B. Ash, Information Theory, Interscience, New York, 1965. [17] E. Ayanoglu, R. D. Gitlin, C.-L. I, and J. Mazo, “Diversity coding for transparent selfhealing and fault-tolerant communication networks,” 1990 IEEE International Symposium on Information Theory, San Diego, CA, Jan. 1990. [18] A. R. Barron, “The strong ergodic theorem for densities: Generalized ShannonMcMillan-Breiman theorem,” Ann. Prob., 13: 1292-1303, 1985. [19] L. A. Bassalygo, R. L. Dobrushin, and M. S. Pinsker, “Kolmogorov remembered,” IEEE Trans. Inform. Theory, IT-34: 174-175, 1988. [20] T. Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression, Prentice-Hall, Englewood Cliffs, New Jersey, 1971. [21] T. Berger, “Multiterminal source coding,” in The Information Theory Approach to Communications, G. Longo, Ed., CISM Courses and Lectures #229, Springer-Verlag, New York, 1978. [22] T. Berger and R. W. Yeung, “Multiterminal source coding with encoder breakdown,” IEEE Trans. Inform. Theory, IT-35: 237-244, 1989. [23] E. R. Berlekamp, “Block coding for the binary symmetric channel with noiseless, delayless feedback,” in H. B. Mann, Error Correcting Codes, Wiley, New York, 1968. [24] E. R. Berlekamp, Ed., Key Papers in the Development of Coding Theory, IEEE Press, New York, 1974. [25] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit error-correcting coding and decoding: Turbo codes,” Proceedings of the 1993 International Conferences on Communications, 1064-1070, 1993. [26] D. Blackwell, L. Breiman, and A. J. Thomasian, “The capacities of certain channel classes under random coding,” Ann. Math. Stat., 31: 558-567, 1960. [27] R. E. Blahut, “Computation of channel capacity and rate distortion functions,” IEEE Trans. Inform. Theory, IT-18: 460-473, 1972. [28] R. E. Blahut, “Information bounds of the Fano-Kullback type,” IEEE Trans. Inform. Theory, IT-22: 410-421, 1976. [29] R. E. Blahut, Theory and Practice of Error Control Codes, Addison-Wesley, Reading, Massachusetts, 1983. [30] R. E. Blahut, Principles and Practice of Information Theory, Addison-Wesley, Reading, Massachusetts, 1987. D R A F T September 13, 2001, 6:27pm D R A F T 391 BIBLIOGRAPHY [31] R. E. Blahut, D. J. Costello, Jr., U. Maurer, and T. Mittelholzer, Ed., Communications and Cryptography: Two Sides of One Tapestry, Kluwer Academic Publishers, Boston, 1994. [32] C. Blundo, A. De Santis, R. De Simone, and U. Vaccaro, “Tight bounds on the information rate of secret sharing schemes,” Designs, Codes and Cryptography, 11: 107-110, 1997. [33] R. C. Bose and D. K. Ray-Chaudhuri, “On a class of error correcting binary group codes,” Inform. Contr., 3: 68-79, Mar. 1960. [34] L. Breiman, “The individual ergodic theorems of information theory,” Ann. Math. Stat., 28: 809-811, 1957. [35] M. Burrows and D. J. Wheeler, “A block-sorting lossless data compression algorithm,” Technical Report 124, Digital Equipment Corporation, 1994. [36] R. Calderbank and N. J. A. Sloane, “Obituary: Claude Shannon (1916-2001),” Nature, 410: 768, April 12, 2001. [37] R. M. Capocelli, A. De Santis, L. Gargano, and U. Vaccaro, “On the size of shares for secret sharing schemes,” J. Cryptology, 6: 157-168, 1993. [38] H. L. Chan (T. H. Chan), “Aspects of information inequalities and its applications,” M.Phil. thesis, The Chinese University of Hong Kong, Jun. 1998. [39] T. H. Chan, “A combinatorial approach to information inequalities,” to appear in Comm. Inform. & Syst.. [40] T. H. Chan and R. W. Yeung, “On a relation between information inequalities and group theory,” to appear in IEEE Trans. Inform. Theory. [41] T. H. Chan and R. W. Yeung, “Factorization of positive functions,” in preparation. [42] G. J. Chatin, Algorithmic Information Theory, Cambridge Univ. Press, Cambridge, 1987. [43] H. Chernoff, “A measure of the asymptotic efficiency of test of a hypothesis based on a sum of observations,” Ann. Math. Stat., 23: 493-507, 1952. [44] K. L. Chung, “A note on the ergodic theorem of information theory,” Ann. Math. Stat., 32: 612-614, 1961. [45] T. M. Cover, “A proof of the data compression theorem of Slepian and Wolf for ergodic sources,” IEEE Trans. Inform. Theory, IT-21: 226-228, 1975. [46] T. M. Cover, “An algorithm for maximizing expected log investment return,” IEEE Trans. Inform. Theory, IT-30: 369-373, 1984. [47] T. M. Cover, P. G Ë Ç cs, and R. M. Gray, “Kolmogorov’s contribution to information theory and algorithmic complexity,” Ann. Prob., 17: 840-865, 1989. [48] T. M. Cover and S. K. Leung, “Some equivalences between Shannon entropy and Kolmogorov complexity,” IEEE Trans. Inform. Theory, IT-24: 331-338, 1978. D R A F T September 13, 2001, 6:27pm D R A F T 392 A FIRST COURSE IN INFORMATION THEORY [49] T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, New York, 1991. [50] I. Csisz Ë Ç r, “Information type measures of difference of probability distributions and indirect observations,” Studia Sci. Math. Hungar., 2: 229-318, 1967. [51] I. Csisz Ë Ç r, “On the computation of rate-distortion functions,” IEEE Trans. Inform. Theory, IT-20: 122-124, 1974. [52] I. CsiszË Ç r and J. KÉ Ê rner, Information Theory: Coding Theorems for Discrete Memoryless Systems, Academic Press, New York, 1981. [53] I. CsiszË Ç r and P. Narayan, “Arbitrarily varying channels with constrained inputs and states,” IEEE Trans. Inform. Theory, IT-34: 27-34, 1988. [54] I. CsiszË Ç r and P. Narayan, “The capacity of the arbitrarily varying channel revisited: Positivity, constraints,” IEEE Trans. Inform. Theory, IT-34: 181-193, 1988. [55] I. Csisz Ë Ç r and G. TusnË Ç dy, “Information geometry and alternating minimization procedures,” Statistics and Decisions, Supplement Issue 1: 205-237, 1984. [56] G. B. Dantzig, Linear Programming and Extensions, Princeton Univ. Press, Princeton, New Jersey, 1962. [57] L. D. Davisson, “Universal noiseless coding,” IEEE Trans. Inform. Theory, IT-19: 783795, 1973. [58] A. P. Dawid, “Conditional independence in statistical theory (with discussion)," J. Roy. Statist. Soc., Series B, 41: 1-31, 1979. [59] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood form incomplete data via the EM algorithm,” Journal Royal Stat. Soc., Series B, 39: 1-38, 1977. [60] G. Dueck and J. KÉ Ê rner, “Reliability function of a discrete memoryless channel at rates above capacity,” IEEE Trans. Inform. Theory, IT-25: 82-85, 1979. [61] P. Elias, “Universal codeword sets and representations of the integers,” IEEE Trans. Inform. Theory, IT-21: 194-203, 1975. [62] Encyclopedia Britanica, http://www/britanica.com/. [63] R. M. Fano, Class notes for Transmission of Information, Course 6.574, MIT, Cambridge, Massachusetts, 1952. [64] R. M. Fano, Transmission of Information: A Statistical Theory of Communication, Wiley, New York, 1961. [65] A. Feinstein, “A new basic theorem of information theory,” IRE Trans. Inform. Theory, IT-4: 2-22, 1954. [66] A. Feinstein, Foundations of Information Theory, McGraw-Hill, New York, 1958. [67] W. Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, Wiley, New York, 1950. D R A F T September 13, 2001, 6:27pm D R A F T 393 BIBLIOGRAPHY [68] B. M. Fitingof, “Coding in the case of unknown and changing message statistics,” PPI 2: 3-11, 1966 (in Russian). [69] L. K. Ford, Jr. and D. K. Fulkerson, Flows in Networks, Princeton Univ. Press, Princeton, New Jersey, 1962. [70] G. D. Forney, Jr., “Convolutional codes I: Algebraic structure,” IEEE Trans. Inform. Theory, IT-16: 720 - 738, 1970. [71] G. D. Forney, Jr., Information Theory, unpublished course notes, Stanford University, 1972. [72] G. D. Forney, Jr., “The Viterbi algorithm,” Proc. IEEE, 61: 268-278, 1973. [73] F. Fu, R. W. Yeung, and R. Zamir, “On the rate-distortion region for multiple descriptions,” submitted to IEEE Trans. Inform. Theory. [74] S. Fujishige, “Polymatroidal dependence structure of a set of random variables,” Inform. Contr., 39: 55-72, 1978. [75] R. G. Gallager, “Low-density parity-check codes,” IEEE Trans. Inform. Theory, IT-8: 21-28, Jan. 1962. [76] R. G. Gallager, “A simple derivation of the coding theorem and some applications,” IEEE Trans. Inform. Theory, IT-11: 3-18, 1965. [77] R. G. Gallager, Information Theory and Reliable Communication, Wiley, New York, 1968. [78] Y. Ge and Z. Ye, “Information-theoretic characterizations of lattice conditional independence models,” submitted to Ann. Stat. [79] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, Boston, 1992. [80] S. Goldman, Information Theory, Prentice-Hall, Englewood Cliffs, New Jersey, 1953. [81] R. M. Gray, Entropy and Information Theory, Springer-Verlag, New York, 1990. [82] S. Guiasu, Information Theory with Applications, McGraw-Hill, New York, 1976. [83] B. E. Hajek and T. Berger, “A decomposition theorem for binary Markov random fields,” Ann. Prob., 15: 1112-1125, 1987. [84] D. Hammer, A. Romashchenko, A. Shen, and N. K. Vereshchagin, J. Comp. & Syst. Sci., 60: 442-464, 2000. [85] R. V. Hamming, “Error detecting and error correcting codes,” Bell Sys. Tech. Journal, 29: 147-160, 1950. [86] T. S. Han, “Linear dependence structure of the entropy space,” Inform. Contr., 29: 337368, 1975. [87] T. S. Han, “Nonnegative entropy measures of multivariate symmetric correlations,” Inform. Contr., 36: 133-156, 1978. D R A F T September 13, 2001, 6:27pm D R A F T 394 A FIRST COURSE IN INFORMATION THEORY [88] T. S. Han, “A uniqueness of Shannon’s information distance and related non-negativity problems,” J. Comb., Inform., & Syst. Sci., 6: 320-321, 1981. [89] T. S. Han, “An information-spectrum approach to source coding theorems with a fidelity criterion,” IEEE Trans. Inform. Theory, IT-43: 1145-1164, 1997. [90] T. S. Han and K. Kobayashi, “A unified achievable rate region for a general class of multiterminal source coding systems,” IEEE Trans. Inform. Theory, IT-26: 277-288, 1980. [91] G. H. Hardy, J. E. Littlewood, and G. Polya, Inequalities, 2nd ed., Cambridge Univ. Press, London, 1952. [92] K. P. Hau, “Multilevel diversity coding with independent data streams,” M.Phil. thesis, The Chinese University of Hong Kong, Jun. 1995. [93] C. Heegard and S. B. Wicker, Turbo Coding, Kluwer Academic Publishers, Boston, 1999. [94] A. Hocquenghem, “Codes correcteurs d’erreurs,” Chiffres, 2: 147-156, 1959. [95] Y. Horibe, “An improved bound for weight-balanced tree,” Inform. Contr., 34: 148-151, 1977. [96] Hu Guoding, “On the amount of Information,” Teor. Veroyatnost. i Primenen., 4: 447455, 1962 (in Russian). [97] D. A. Huffman, “A method for the construction of minimum redundancy codes,” Proc. IRE, 40: 1098-1101, 1952. [98] L. P. Hyvarinen, Information Theory for Systems Engineers, Springer-Verlag, Berlin, 1968. [99] A. W. Ingleton, “Representation of matroids,” in Combinatorial Mathematics and Its Applications, D. J. A. Welsh, Ed., 149-167, Academic Press, London, 1971. [100] E. T. Jaynes, “On the rationale of maximum entropy methods,” Proc. IEEE, 70: 939052, 1982. [101] F. Jelinek, Probabilistic Information Theory, McGraw-Hill, New York, 1968. [102] V. D. Jerohin, “Î -entropy of discrete random objects,” Teor. Veroyatnost. i Primenen, 3: 103-107, 1958. [103] O. Johnsen, “On the redundancy of binary Huffman codes,” IEEE Trans. Inform. Theory, IT-26: 220-222, 1980. [104] G. A. Jones and J. M. Jones, Information and Coding Theory, Springer, London, 2000. [105] Y. Kakihara, Abstract Methods in Information Theory, World-Scientific, Singapore, 1999. [106] J. Karush, “A simple proof of an inequality of McMillan,” IRE Trans. Inform. Theory, 7: 118, 1961. D R A F T September 13, 2001, 6:27pm D R A F T 395 BIBLIOGRAPHY [107] T. Kawabata, “Gaussian multiterminal source coding,” Master thesis, Math. Eng., Univ. of Tokyo, Japan, Feb. 1980. [108] T. Kawabata and R. W. Yeung, “The structure of the Ï -Measure of a Markov chain,” IEEE Trans. Inform. Theory, IT-38: 1146-1149, 1992. [109] A. I. Khinchin, Mathematical Foundations of Information Theory, Dover, New York, 1957. [110] J. C. Kieffer, E.-h. Yang, “Grammar-based codes: A new class of universal lossless source codes,” IEEE Trans. Inform. Theory, IT-46: 737-754, 2000. [111] R. Kindermann and J. Snell, Markov Random Fields and Their Applications, American Math. Soc., Providence, Rhode Island, 1980. [112] R. Koetter and M. MÈ Ç dard, “An algebraic approach to network coding,” 2001 IEEE International Symposium on Information Theory, Washington, D.C., Jun. 2001. [113] A. N. Kolmogorov, “On the Shannon theory of information transmission in the case of continuous signals,” IEEE Trans. Inform. Theory, IT-2: 102-108, 1956. [114] A. N. Kolmogorov, “Three approaches to the quantitative definition of information,” Problems of Information Transmission, 1: 4-7, 1965. [115] A. N. Kolmogorov, “Logical basis for information theory and probability theory,” IEEE Trans. Inform. Theory, IT-14: 662-664, 1968. [116] L. G. Kraft, “A device for quantizing, grouping and coding amplitude modulated pulses,” M.S. thesis, Dept. of E.E., MIT, 1949. [117] S. Kullback, Information Theory and Statistics, Wiley, New York, 1959. [118] S. Kullback, Topics in Statistical Information Theory, Springer-Verlag, Berlin, 1987. [119] S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann. Math. Stat., 22: 79-86, 1951. [120] G. G. Langdon, “An introduction to arithmetic coding,” IBM J. Res. Devel., 28: 135149, 1984. [121] S. L. Lauritzen, Graphical Models, Oxford Science Publications, Oxford, 1996. [122] M. Li and P. Vit Ë Ç nyi, An Introduction to Kolmogorov Complexity and Its Applications, 2nd ed., Springer, New York, 1997. [123] S.-Y. R. Li, R. W. Yeung and N. Cai, “Linear network coding,” to appear in IEEE Trans. Inform. Theory. [124] S. Lin and D. J. Costello, Jr., Error Control Coding: Fundamentals and Applications, Prentice-Hall, Englewood Cliffs, New Jersey, 1983. [125] T. Linder, V. Tarokh, and K. Zeger, “Existence of optimal codes for infinite source alphabets,” IEEE Trans. Inform. Theory, IT-43: 2026-2028, 1997. [126] L. Lovasz, “On the Shannon capacity of a graph,” IEEE Trans. Inform. Theory, IT-25: 1-7, 1979. D R A F T September 13, 2001, 6:27pm D R A F T 396 A FIRST COURSE IN INFORMATION THEORY [127] D. J. C. MacKay, “Good error-correcting codes based on very sparse matrices,” IEEE Trans. Inform. Theory, IT-45: 399-431, Mar. 1999. [128] F. M. Malvestuto, “A unique formal system for binary decompositions of database relations, probability distributions, and graphs," Inform. Sci., 59: 21-52, 1992; with Comment by F. M. Malvestuto and M. StudenÐ Ç , Inform. Sci., 63: 1-2, 1992. [129] M. Mansuripur, Introduction to Information Theory, Prentice-Hall, Englewood Cliffs, New Jersey, 1987. [130] K. Marton, “Error exponent for source coding with a fidelity criterion,” IEEE Trans. Inform. Theory, IT-20: 197 - 199, 1974. [131] J. L. Massey, “Shift-register synthesis and BCH decoding,” IEEE Trans. Inform. Theory, IT-15: 122-127, 1969. [132] J. L. Massey, “Causality, feedback and directed information,” in Proc. 1990 Int. Symp. on Inform. Theory and Its Applications, 303-305, 1990. [133] J. L. Massey, “Contemporary cryptology: An introduction,” in Contemporary Cryptology: The Science of Information Integrity, G. J. Simmons, Ed., IEEE Press, Piscataway, New Jersey, 1992. [134] A. M. Mathai and P. N. Rathie, Basic Concepts in Information Theory and Statistics: Axiomatic Foundations and Applications, Wiley, New York, 1975. [135] F. MatÑTÇ Ó Ò , “Probabilistic conditional independence structures and matroid theory: Background,” Int. J. of General Syst., 22: 185-196, 1994. [136] F. MatÑTÇ Ó Ò , “Conditional independences among four random variables II,” Combinatorics, Probability & Computing, 4: 407-417, 1995. [137] F. MatÑTÇ Ó Ò , “Conditional independences among four random variables III: Final conclusion,” Combinatorics, Probability & Computing, 8: 269-276, 1999. [138] F. MatÑTÇ Ó Ò and M. StudenÐ Ç , “Conditional independences among four random variables I,” Combinatorics, Probability & Computing, 4: 269-278, 1995. [139] R. J. McEliece, The Theory of Information and Coding, Addison-Wesley, Reading, Massachusetts, 1977. [140] W. J. McGill, “Multivariate information transmission,” Transactions PGIT, 1954 Symposium on Information Theory, PGIT-4: pp. 93-111, 1954. [141] B. McMillan, “The basic theorems of information theory,” Ann. Math. Stat., 24: 196219, 1953. [142] B. McMillan, “Two inequalities implied by unique decipherability,” IRE Trans. Inform. Theory, 2: 115-116, 1956. [143] S. C. Moy, “Generalization of the Shannon-McMillan theorem,” Pacific J. Math., 11: 705-714, 1961. [144] J. K. Omura, “A coding theorem for discrete-time sources,” IEEE Trans. Inform. Theory, IT-19: 490-498, 1973. D R A F T September 13, 2001, 6:27pm D R A F T 397 BIBLIOGRAPHY [145] J. M. Ooi, Coding for Channels with Feedback, Kluwer Academic Publishers, Boston, 1998. [146] A. Orlitsky, “Worst-case interactive communication I: Two messages are almost optimal,” IEEE Trans. Inform. Theory, IT-36: 1111-1126, 1990. [147] A. Orlitsky, “Worst-case interactive communication—II: Two messages are not optimal,” IEEE Trans. Inform. Theory, IT-37: 995-1005, 1991. [148] D. S. Ornstein, “Bernoulli shifts with the same entropy are isomorphic,” Advances in Math., 4: 337-352, 1970. [149] J. G. Oxley, Matroid Theory, Oxford Univ. Press, Oxford, 1992. [150] A. Papoulis, Probability, Random Variables and Stochastic Processes, 2nd ed., McGraw-Hill, New York, 1984. [151] J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kaufman, San Meteo, California, 1988. [152] A. Perez, “Extensions of Shannon-McMillan’s limit theorem to more general stochastic processes,” in Trans. Third Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, 545-574, Prague, 1964. [153] J. R. Pierce, An Introduction to Information Theory : Symbols, Signals and Noise, 2nd rev. ed., Dover, New York, 1980. [154] J. T. Pinkston, “An application of rate-distortion theory to a converse to the coding theorem,” IEEE Trans. Inform. Theory, IT-15: 66-71, 1969. [155] M. S. Pinsker, Information and Information Stability of Random Variables and Processes, Vol. 7 of the series Problemy PeredaÔ Ò i Informacii, AN SSSR, Moscow, 1960 (in Russian). English translation: Holden-Day, San Francisco, 1964. [156] N. Pippenger, “What are the laws of information theory?” 1986 Special Problems on Communication and Computation Conference, Palo Alto, California, Sept. 3-5, 1986. [157] C. Preston, Random Fields, Springer-Verlag, New York, 1974. [158] M. O. Rabin, “Efficient dispersal of information for security, load balancing, and faulttolerance,” J. ACM, 36: 335-348, 1989. [159] I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,” SIAM Journal Appl. Math., 8: 300-304, 1960. [160] A. RÈ Ç nyi, Foundations of Probability, Holden-Day, San Francisco, 1970. [161] F. M. Reza, An Introduction to Information Theory, McGraw-Hill, New York, 1961. [162] J. Rissanen, “Generalized Kraft inequality and arithmetic coding,” IBM J. Res. Devel., 20: 198, 1976. [163] J. Rissanen, “Universal coding, information, prediction, and estimation,” IEEE Trans. Inform. Theory, IT-30: 629-636, 1984. D R A F T September 13, 2001, 6:27pm D R A F T 398 A FIRST COURSE IN INFORMATION THEORY [164] J. R. Roche, “Distributed information storage,” Ph.D. thesis, Stanford University, Mar. 1992. [165] J. R. Roche, A. Dembo, and A. Nobel, “Distributed information storage,” 1988 IEEE International Symposium on Information Theory, Kobe, Japan, Jun. 1988. [166] J. R. Roche, R. W. Yeung, and K. P. Hau, “Symmetrical multilevel diversity coding,” IEEE Trans. Inform. Theory, IT-43: 1059-1064, 1997. [167] R. T. Rockafellar, Convex Analysis, Princeton Univ. Press, Princeton, New Jersey, 1970. [168] A. Romashchenko, A. Shen, and N. K. Vereshchagin, “Combinatorial interpretation of Kolmogorov complexity,” Electronic Colloquium on Computational Complexity, vol.7, 2000. [169] K. Rose, “A mapping approach to rate-distortion computation and analysis,” IEEE Trans. Inform. Theory, IT-40: 1939-1952, 1994. [170] S. Shamai, S. VerdÑ Ç , “The empirical distribution of good codes,” IEEE Trans. Inform. Theory, IT-43: 836-846, 1997. [171] S. Shamai, S. VerdÑ Ç , and R. Zamir, “Systematic lossy source/channel coding,” IEEE Trans. Inform. Theory, IT-44: 564-579, 1998. [172] A. Shamir, “How to share a secret,” Comm. ACM, 22: 612-613, 1979. [173] C. E. Shannon, “A Mathematical Theory of Communication,” Bell Sys. Tech. Journal, 27: 379-423, 623-656, 1948. [174] C. E. Shannon, “Communication theory of secrecy systems,” Bell Sys. Tech. Journal, 28: 656-715, 1949. [175] C. E. Shannon, “The zero-error capacity of a noisy channel,” IRE Trans. Inform. Theory, IT-2: 8-19, 1956. [176] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE National Convention Record, Part 4, 142-163, 1959. [177] C. E. Shannon, R. G. Gallager, and E. R. Berlekamp, “Lower bounds to error probability for coding in discrete memoryless channels,” Inform. Contr., 10: 65-103 (Part I), 522552 (Part II), 1967. [178] C. E. Shannon and W. W. Weaver, The Mathematical Theory of Communication, Univ. of Illinois Press, Urbana, Illinois, 1949. [179] P. C. Shields, The Ergodic Theory of Discrete Sample Paths, American Math. Soc., Providence, Rhode Island, 1996. [180] J. E. Shore and R. W. Johnson, “Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy,” IEEE Trans. Inform. Theory, IT26: 26-37, 1980. [181] I. Shunsuke, Information theory for continuous systems, World Scientific, Singapore, 1993. D R A F T September 13, 2001, 6:27pm D R A F T 399 BIBLIOGRAPHY [182] M. Simonnard, Linear Programming, translated by William S. Jewell, Prentice-Hall, Englewood Cliffs, New Jersey, 1966. [183] D. S. Slepian, Ed., Key Papers in the Development of Information Theory, IEEE Press, New York, 1974. [184] D. S. Slepian and J. K. Wolf, “Noiseless coding of correlated information sources,” IEEE Trans. Inform. Theory, IT-19: 471-480, 1973. [185] N. J. A. Sloane and A. D. Wyner, Ed., Claude Elwood Shannon Collected Papers, IEEE Press, New York, 1993. [186] L. Song, R. W. Yeung, and N. Cai, “A separation theorem for point-to-point communication networks,” submitted to IEEE Trans. Inform. Theory. [187] L. Song, R. W. Yeung and N. Cai, “Zero-error network coding for acyclic networks,” submitted to IEEE Trans. Inform. Theory. [188] F. Spitzer, “Random fields and interacting particle systems,” M. A. A. Summer Seminar Notes, 1971. [189] M. StudenÐ Ç , “Multiinformation and the problem of characterization of conditionalindependence relations,” Problems Control Inform. Theory, 18: 1, 3-16, 1989. [190] J. C. A. van der Lubbe, Information Theory, Cambridge Univ. Press, Cambridge, 1997 (English translation). [191] E. C. van der Meulen, “A survey of multi-way channels in information theory: 19611976,” IEEE Trans. Inform. Theory, IT-23: 1-37, 1977. [192] E. C. van der Meulen, “Some reflections on the interference channel,” in Communications and Cryptography: Two Side of One Tapestry, R. E. Blahut, D. J. Costello, Jr., U. Maurer, and T. Mittelholzer, Ed., Kluwer Academic Publishers, Boston, 1994. [193] M. van Dijk, “On the information rate of perfect secret sharing schemes,” Designs, Codes and Cryptography, 6: 143-169, 1995. [194] S. Vembu, S. VerdÑ Ç , and Y. Steinberg, “The source-channel separation theorem revisited,” IEEE Trans. Inform. Theory, IT-41: 44-54, 1995. [195] S. VerdÑ Ç and T. S. Han, “A general formula for channel capacity,” IEEE Trans. Inform. Theory, IT-40: 1147-1157, 1994. [196] S. VerdÑ Ç and T. S. Han, “The role of the asymptotic equipartition property in noiseless source coding,” IEEE Trans. Inform. Theory, IT-43: 847-857, 1997. [197] S. VerdÑ Ç and S. W. McLaughlin, Ed., Information Theory : 50 years of Discovery, IEEE Press, New York, 2000. [198] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. Inform. Theory, IT-13: 260-269, 1967. [199] A. J. Viterbi and J. K. Omura, Principles of Digital Communications and Coding, McGraw-Hill, New York, 1979. D R A F T September 13, 2001, 6:27pm D R A F T 400 A FIRST COURSE IN INFORMATION THEORY [200] T. A. Welch, “A technique for high-performance data compression,” Computer, 17: 819, 1984. [201] P. M. Woodard, Probability and Information Theory with Applications to Radar, McGraw-Hill, New York, 1953. [202] S. B. Wicker, Error Control Systems for Digital Communication and Storage, PrenticeHall, Englewood Cliffs, New Jersey, 1995. [203] S. B. Wicker and V. K. Bhargava, Ed., Reed-Solomon Codes and Their Applications, IEEE Press, Piscataway, New Jersey, 1994. [204] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “The context-tree weighting method: basic properties,” IEEE Trans. Inform. Theory, IT-41: 653-664, 1995. [205] J. Wolfowitz, “The coding of messages subject to chance errors,” Illinois Journal of Mathematics, 1: 591-606, 1957. [206] J. Wolfowitz, Coding Theorems of Information Theory, Springer, Berlin-Heidelberg, 2nd ed., 1964, 3rd ed., 1978. [207] A. D. Wyner, “On source coding with side information at the decoder,” IEEE Trans. Inform. Theory, IT-21: 294-300, 1975. [208] A. D. Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the decoder,” IEEE Trans. Inform. Theory, IT-22: 1-10, 1976. [209] E.-h. Yang and J. C. Kieffer, “Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform – Part one: Without context models,” IEEE Trans. Inform. Theory, IT-46: 755-777, 2000. [210] C. Ye and R. W. Yeung, “Some basic properties of fix-free codes,” IEEE Trans. Inform. Theory, IT-47: 72-87, 2001. [211] C. Ye and R. W. Yeung, “A simple upper bound on the redundancy of Huffman codes,” to appear in IEEE Trans. Inform. Theory. [212] Z. Ye and T. Berger, Information Measures for Discrete Random Fields, Science Press, Beijing/New York, 1998. [213] R. W. Yeung, “A new outlook on Shannon’s information measures,” IEEE Trans. Inform. Theory, IT-37: 466-474, 1991. [214] R. W. Yeung, “Local redundancy and progressive bounds on the redundancy of a Huffman code," IEEE Trans. Inform. Theory, IT-37: 687-691, 1991. [215] R. W. Yeung, “Multilevel diversity coding with distortion,” IEEE Trans. Inform. Theory, IT-41: 412-422, 1995. [216] R. W. Yeung, “A framework for linear information inequalities,” IEEE Trans. Inform. Theory, IT-43: 1924-1934, 1997. [217] R. W. Yeung and T. Berger, “Multi-way alternating minimization,” 1995 IEEE Internation Symposium on Information Theory, Whistler, British Columbia, Canada, Sept. 1995. D R A F T September 13, 2001, 6:27pm D R A F T 401 BIBLIOGRAPHY [218] R. W. Yeung, T. T. Lee and Z. Ye, “Information-theoretic characterization of conditional mutual independence and Markov random fields,” to appear in IEEE Trans. Inform. Theory. [219] R. W. Yeung and Z. Zhang, “On symmetrical multilevel diversity coding,” IEEE Trans. Inform. Theory, IT-45: 609-621, 1999. [220] R. W. Yeung and Z. Zhang, “Distributed source coding for satellite communications,” IEEE Trans. Inform. Theory, IT-45: 1111-1120, 1999. [221] R. W. Yeung and Z. Zhang, “A class of non-Shannon-type information inequalities and their applications,” Comm. Inform. & Syst., 1: 87-100, 2001. [222] Z. Zhang and R. W. Yeung, “A non-Shannon-type conditional inequality of information quantities,” IEEE Trans. Inform. Theory, IT-43: 1982-1986, 1997. [223] Z. Zhang and R. W. Yeung, “On characterization of entropy function via information inequalities,” IEEE Trans. Inform. Theory, IT-44: 1440-1452, 1998. [224] S. Zimmerman, “An optimal search procedure,” Am. Math. Monthly, 66: 8, 690-693, 1959. [225] K. Sh. Zigangirov, “Number of correctable errors for transmission over a binary symmetrical channel with feedback,” Problems Inform. Transmission, 12: 85-97, 1976. Translated from Problemi Peredachi Informatsii, 12: 3-19 (in Russian). [226] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. Inform. Theory, IT-23: 337-343, 1977. [227] J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,” IEEE Trans. Inform. Theory, IT-24: 530-536, 1978. D R A F T September 13, 2001, 6:27pm D R A F T Index a posterior distribution, 218 Abel, N.H., 368 Abelian group, 368, 381, 384, 386 Abrahams, J., xvii, 389 Abramson, N., 122, 389 abstract algebra, 365 Abu-Mostafa, Y.S., 389 acyclic network, 245–250, 338, 364 AczÖ Õ l, J., 389 Ahlswede, R., 58, 186, 260, 262, 389 Algoet, P., 389 almost perfect reconstruction, 65, 66, 233 alternating optimization algorithm, 216–218, 221 convergence, 226 Amari, S., 39, 389 Anatharam, V., xvii Anderson, J.B., 389 applied mathematics, 4 applied probability, 3 Arimoto, S., 174, 186, 231, 389, 390 See also Blahut-Arimoto algorithms arithmetic mean, 159 ascendant, 55 Ash, R.B., 390 asymptotically reliable communication, 152 atom of a field, 96 weight of, 137 audio signal, 189 audio source, 323 auxiliary random variable, 310, 311, 324, 340– 361 average distortion, 188, 189, 191, 206, 210, 215 expected, 206 average probability of error, 159, 185 Ayanoglu, E., 262, 390 Balkenhol, B., 58, 389 Barron, A.R., 390 D R A F T September 13, 2001, 6:27pm D R A F T basic inequalities, 23, 22–25, 92, 103, 139, 264, 279–301, 313, 322–325, 365, 381 Bassalygo, L.A., 390 Bayesian networks, 160, 291, 300 BCH (Bose-Chaudhuri-Hocquenghem) code, 174 Beethoven’s violin concerto, 1 Bell Telephone Laboratories, 2 Berger, T., xii, xvii, 93, 120, 213, 214, 231, 390, 393, 400 Berlekamp, E.R., 390, 398 Berrou, C., 174, 390 Bhargava, V.K., 400 biased coin, 41 binary arbitrarily varying channel, 185 binary covering radius, 212 binary entropy function, 11, 29, 30, 201 binary erasure channel, 156, 179 binary symmetric channel (BSC), 149, 155, 183–186 binomial formula, 132, 295, 297 bit, 3, 11, 236 Blackwell, D., 390 Blahut, R.E., xvii, 186, 214, 231, 390, 399 Blahut-Arimoto algorithms, xii, 157, 204, 215– 231 channel capacity, 218–223, 230 convergence, 230 rate distortion function, 223–226, 231 block code, 64 block length, 64, 151, 160, 187, 215 Blundo, C., 121, 391 Bose, R.C., 391 See also BCH code bottleneck, 235 brain, 323 branching probabilities, 54 Breiman, L., 71, 390, 391 See also Shannon-McMillan-Breiman theorem 404 A FIRST COURSE IN INFORMATION THEORY Burrows, M., 391 Cai, N., xvii, 77, 185, 260, 262, 364, 389, 395, 399 Calderbank, R., 391 capacity of an edge, 234 Capocelli, R.M., 121, 391 Cartesian product, 216 cascade of channels, 184 causality, 184 Ces× Õ ro mean, 34 chain rule for conditional entropy, 18 conditional mutual information, 19 entropy, 17, 26 mutual information, 18, 27 Chan, A.H., xvii Chan, T.H., xvii, 10, 92, 386, 387, 391 Chan, V.W.S., xvii channel capacity, 3, 154, 149–186, 215 computation of, xii, 157, 186, 218–223, 226, 230 feedback, 174–180 channel characteristics, 149 channel code, 150, 160, 183 probability of error, 150 rate, 152, 159 with feedback, 175 without feedback, 158 channel coding theorem, xii, 3, 39, 160, 158– 160, 186 achievability, 158, 166–171 converse, 158, 179 random code, 166 strong converse, 165, 186 channel with memory, 183 Chatin, G.J., 391 Chernoff bound, 77 Chernoff, H., 391 chronological order, 242 Chung, K.L., 71, 391 cipher text, 115, 120 classical information theory, 233 closure, in a group, 366 code alphabet, 42 code tree, 46, 46–57 pruning of, 55 codebook, 158, 192, 207 codeword, 64, 158, 192, 207 coding session, 242, 247, 248, 339 transaction, 242, 261 coding theory, 173, 174 column space, 275 combinatorics, 86 communication engineer, 3 communication engineering, 149 communication system, 1, 3 D R A F T September 13, 2001, 6:27pm D R A F T Shannon’s model, 2 communication theory, 3 commutative group, see Abelian group commutativity, 368, 369 compact disc, 1 compact set, 154, 229 composite function, 369 compound source, 213 computational procedure, 283, 287 computer communication, 174 computer network, 233, 240 computer science, 11 computer storage systems, 174 concavity, 36, 37, 112, 115, 226, 227, 229, 230 conditional branching distribution, 54 conditional entropy, 5, 12 conditional independence, xiii, 6, 276–277, 291, 300 elemental forms, 298 structure of, 10, 325 conditional mutual independence, 126–135 conditional mutual information, 5, 15 constant sequence, 191 continuous partial derivatives, 216, 221, 229 convex closure, 341, 343, 377, 380 convex cone, 306, 307, 346 convexity, 20, 37, 113, 192, 195, 198, 206, 216, 221, 225, 266, 308 convolutional code, 174 convolutional network code, 260 Cornell, xvii coset left, 370, 371, 373 right, 370 Costello, Jr., D.J., 390, 395, 399 countable alphabet, 71 Cover, T.M., xi, xvii, 231, 389, 391 crossover probability, 149, 184–186, 202 Croucher Foundation, xvii Csisz× Õ r, I., xii, xvii, 39, 93, 122, 231, 278, 392 cyclic network, 251–259 D-adic distribution, 48, 53 D-ary source code, 42 D-it, 11, 44, 57 ØÙ/Ú'Û 191, 195, 198, 201 Ü Ù/ÚÝÛ , ,205 Dantzig, G.B., 392 DarÞ Õ czy, Z., 389 data communication, 174 data packet, 240 data processing theorem, 28, 117, 164, 244, 289, 323 Davisson, L.D., 392 Dawid, A.P., 292, 392 De Santis, A., 121, 391 De Simone, R., 121, 391 decoder, 64, 166, 207, 240 405 INDEX decoding function, 70, 158, 175, 192, 242, 247, 258, 339, 354 deep space communication, 174 Delchamps, D.F., xvii Dembo, A., 262, 398 Dempster, A.P., 392 dense, 347 dependency graph, 160, 176, 185 directed edge, 160 dotted edge, 160 parent node, 160 solid edge, 160 descendant, 47, 55, 56 destination, 2 destination node, 233 digital, 3 digital communication system, 3 directed graph, 234 acyclic, 245, 337 cut, 235, 350 capacity of, 235 cycle, 245, 252 cyclic, 245 edges, 234 max-flow, 235 min-cut, 235 nodes, 234 path, 245, 252 rate constraints, 234, 235, 337 sink node, 234 source node, 234 directional derivative, 229 discrete channel, 153 discrete memoryless channel (DMC), 152–157, 175, 183, 184, 215 achievable rate, 160 symmetric, 184 disk array, 239, 335, 362 distinguishable messages, 248 distortion measure, 187–214 average, 188 context dependent, 189 Hamming, 189, 201, 203 normalization, 189, 199 single-letter, 188 square-error, 189 distortion rate function, 195 distributed source coding, 364 divergence, 5, 20, 19–22, 39, 318 convexity of, 37 divergence inequality, 20, 22, 37, 219, 318 diversity coding, 239, 336 Dobrushin, R.L., 390 double infimum, 218, 225 double supremum, 216, 218, 220, 221 duality theorem, 286 dual, 286 D R A F T September 13, 2001, 6:27pm D R A F T primal, 286 Dueck, G., 392 dyadic distribution, 48 ear drum, 323 east-west direction, 217 efficient source coding, 66–67 elemental inequalities, 281, 279–281, 285, 301, 302 ß -inequalities, 294–298 à -inequalities, 294–298 minimality of, 293–298 Elias, P., 392 EM algorithm, 231 emotion, 1 empirical distribution, 93 joint, 210 empirical entropy, 62, 64, 73 encoder, 64, 166, 207, 240 encoding function, 70, 158, 175, 192, 242, 246, 247, 257, 339, 353 Encyclopedia Britanica, 4, 392 engineering, 3, 183 ensemble average, 68 entropic, 266, 305, 342, 374 entropies, linear combination of, 10, 24, 123, 267, 281 entropy, 3, 5, 10, 39, 372 concavity of, 112 relation with groups, 365–387 entropy bound, xii, 44, 42–45, 48, 50, 53, 57 for prefix code, 54 entropy function, xi, 266, 301, 306, 308, 313, 322, 372 continuity of, 380 group characterization, 375, 372–376 entropy rate, xii, 5, 33, 32–35, 187 entropy space, 266, 269, 281, 301, 341, 372 equivalence relation, 138 erasure probability, 156 ergodic, 68 Euclidean distance, 220, 309 Euclidean space, 266, 302, 341 evesdroper, 117 expected distortion, 191, 223 minimum, 191, 201 extreme direction, 305, 314, 321 facsimile, 1 fair coin, 41 Fano’s inequality, 30, 28–32, 39, 67, 164, 179, 182, 186, 245, 344 simplified version, 31 tightness of, 38 Fano, R.M., 39, 186, 392 fault-tolerant data storage system, 239, 322 fault-tolerant network communication, 335 406 A FIRST COURSE IN INFORMATION THEORY FCMI, see full conditional mutual independencies feedback, xii, 153, 154, 183–185 Feinstein, A., 186, 392 Feller, W., 392 ferromagnetic material, 140, 148 field, in measure theory, 96 Fine, T.L., xvii finite alphabet, 29, 32, 67, 70, 71, 93, 154, 167, 180, 188, 207, 215, 352, 358, 380 finite group, 367, 365–387 finite resolution, 1 finite-dimensional maximization, 215 Fitingof, B.M., 392 fix-free code, 58 flow, 234, 251 conservation conditions, 235 value of, 235 zero flow, 252 Ford, Jr., L.K., 393 Forney, Jr., G.D., 393 frequency of error, 189 Fu, F., xvii, 146, 393 Fujishige, S., 300, 393 Fulkerson, D.K., 393 full conditional independence, xii full conditional mutual independencies, 126, 136–140, 291 axiomatization, 148 image of, 136, 138 set-theoretic characterization, 148 functional dependence, 276, 358 fundamental inequality, xii, 20, 20, 45 fundamental limits, 3 G × Õ cs, P., 391 Gallager, R.G., xvii, 174, 186, 393, 398 áYâ , 281–301 áTâã , 266, 282, 301, 364, 366 á âã , 305 group characterization of, 377–380 Gargano, L., 121, 391 Ge, Y., 148, 393 generic discrete channel, 153, 166, 215, 218 Gersho, A., 393 Gitlin, R.D., 262, 390 Glavieux, A., 390 global Markov property, 140, 147 Goldman, S., 393 gradient, 228 graph theory, 234 graphical models, xii, 148 Gray, R.M., 391, 393 group, 366, 365–387 associativity, 366, 367–369, 372 axioms of, 366 closure, 368, 369, 372 D R A F T September 13, 2001, 6:27pm D R A F T identity, 366, 367–370, 372, 373 inverse, 366–368, 370 order of, 365, 367, 370 group inequalities, 365–380, 384, 387 group theory, xi, xiii, 93, 306 relation with information theory, 365–387 group-characterizable entropy function, 375, 372–376 Guiasu, S., 393 Hajek, B.E., xvii, 393 half-space, 269, 347 Hammer, D., 325, 393 Hamming ball, 212 Hamming code, 174 Hamming distance, 185 Hamming distortion measure, 189 Hamming, R.V., 393 Han, T.S., xvii, 37, 123, 278, 309, 393, 394, 399 Hardy, G.H., 38, 394 Hau, K.P., 363, 364, 394, 398 Heegard, C., xvii, 394 Hekstra, A.P., xvii hiker, 217 Ho, S.-W., xvii Hocquenghem, A., 394 See also BCH code home entertainment systems, 174 Horibe, Y., 394 Hu, G., 122, 394 Huffman code, 48, 48–53 expected length, 50, 52 optimality of, 50 Huffman procedure, 48, 48–53 dummy symbols, 49 Huffman, D.A., 59, 394 human factor, 1 hyperplane, 268, 273, 277, 305, 321, 347 Hyvarinen, L.P., 394 I, C.-L., 262, 390 ä -Measure, xii, 102, 95–123, 125–148, 162, 302, 308 empty atom, 100 Markov chain, 105–111, 143–146 Markov structures, 125–148 negativity of, 103–105 nonempty atom, 100 universal set, 97, 100 i.i.d. source, 64, 68, 70, 73, 188, 213, 215 bivariate, 83, 84 image, 189 imperfect secrecy theorem, 115 implication problem, xiii, 276–277, 291–293, 300 involves only FCMI’s, 138, 276 inclusion-exclusion formula, 99 a variation of, 118 407 INDEX incomplete data, 231 incompressible, 66, 239 independence bound for entropy, 25, 66 independence of random variables, 5–10 mutual, 6, 25, 26, 33, 36, 106, 120, 208, 270, 290, 359 pairwise, 6, 36, 104, 303 independent parallel channels, 184, 299 inferior, 329, 347 strictly, 347 infinite alphabet, 29, 32, 70, 93 infinite group, 367 Information Age, 4 information diagram, xii, 105, 95–123, 287, 293, 312 Markov chain, 108–111, 143–146, 162 information expressions, 263 canonical form, 267–269 alternative, 277 uniqueness, 268, 277, 278 nonlinear, 278 symmetrical, 277 information identities, xii, 24, 122, 263 constrained, 272, 284–285 unconstrained, 269 information inequalities, xi, xii, 24, 112, 263, 364, 366, 380–384 constrained, 270–272, 284–285 equivalence of, 273–276, 278 framework for, xiii, 263–278, 340, 341 machine-proving, ITIP, 265, 287–291 non-Shannon-type, xiii, 25, 301–325 Shannon-type, xiii, 279–300 symmetrical, 298 unconstrained, 269, 283–284, 305, 366, 380 information rate distortion function, 196, 206 continuity of, 206 properties of, 198 information source, 2, 32, 42, 187, 233, 242 informational divergence, see divergence Ingleton inequality, 325, 386 Ingleton, A.W., 394 input channel, 240, 247, 339 input distribution, 154, 166, 215, 218, 224 strictly positive, 221, 231 interleave, 254 internal node, 46, 46–57 conditional entropy of, 54 Internet, 233 invertible transformation, 268 Ising model, 140, 148 iterative algorithm, 186, 214, 216, 224, 231 ITIP, xiii, xvii, 287–291, 310, 324 efficient implementation, 294 Jaynes, E.T., 394 D R A F T September 13, 2001, 6:27pm D R A F T Jelinek, F., 394 Jensen’s inequality, 205 Jerohin, V.D., 212, 394 Jewell, W.S., 399 Johnsen, O., 58, 394 Johnson, R.W., 398 joint entropy, 12, 265 joint source-channel coding, 181, 183 Jones, G.A., 394 Jones, J.M., 394 Kakihara, Y., 394 Karush, J., 59, 394 Kawabata, T., 108, 123, 148, 394, 395 Keung-Tsang, F.-O., xviii key, of a cryptosystem, 115, 120 Khachatrian, L., 58, 389 Khinchin, A.I., 395 Kieffer, J.C., 395, 400 Kindermann, R., 395 Kobayashi, K., xvii, 394 Koetter, R., 364, 395 Kolmogorov complexity, 325, 387 Kolmogorov, A.N., 395 KÞ å rner, J., xii, 93, 122, 278, 389, 392 Kraft inequality, xii, 42, 44, 45, 47, 48, 52, 58, 59 Kraft, L.G., 395 Kschischang, F.R., xvii Kullback, S., 39, 395 Kullback-Leibler distance, see divergence L’Hospital’s rule, 17 Lagrange multipliers, 222 Lagrange’s Theorem, 371 Laird, N.M., 392 Langdon, G.G., 395 Lapidoth, A., xvii LATEX, xviii lattice theory, 123 Lauritzen, S.L., 395 laws of information theory, xiii, 264, 321, 325 leaf, 46, 46–57 Lebesgue measure, 278, 305 Lee, J.Y.-B., xvii Lee, T.T., 148, 400 Leibler, R.A., 39, 395 Lempel, A., 401 letter, 32 Leung, S.K., 391 Li, M., 395 Li, P., xvii Li, S.-Y.R., 260, 262, 389, 395 Lin, S., 395 Linder, T., 59, 395 line of sight, 336 linear code, 174 linear constraints, 270, 273 408 A FIRST COURSE IN INFORMATION THEORY linear network code, 262 linear programming, xiii, 279, 281–287, 300, 346 linear subspace, 271, 314, 321 Littlewood, J.E., 38, 394 local Markov property, 147 local redundancy, 56 local redundancy theorem, 56–57 log-optimal portfolio, 231 log-sum inequality, 21, 37, 230 longest directed path, 245 Longo, G., 390 lossless data compression, 3 Lovasz, L., 395 low-density parity-check (LDPC) code, 174 MacKay, D.J.C., 174, 395 majority vote, 151 Malvestuto, F.M., 148, 396 Mann, H.B., 390 Mansuripur, M., 396 mapping approach, 214 marginal distribution, 310–312 Markov chain, xii, 5, 7, 27, 107, 108, 111, 112, 115, 117, 125, 126, 140, 143–146, 161, 179, 184, 244, 275, 287, 289, 310, 312, 323 information diagram, 107–111, 143–146, 162 Markov graph, 140 Markov random field, xii, 111, 126, 140–143, 148 Markov star, 147 Markov structures, xii, 125–148 Markov subchain, 8 Marton, K., 396 Massey, J.L., xvii, 117, 396 Mathai, A.M., 396 MATLAB, 287 MatæÕ èç , F., 276, 309, 325, 396 Maurer, U., 390, 399 max-flow, 235, 244, 252 max-flow bound, xiii, 236, 242, 242–259, 262, 327, 350 achievability, 245–259 max-flow bounds, 328–330, 361 max-flow min-cut theorem, 235, 244, 249 maximal probability of error, 159, 166, 181, 185 maximum likelihood decoding, 185 Mazo, J., 262, 390 McEliece, R.J., 396 McGill, W.J., 122, 396 McLaughlin, S.W., 399 McMillan, B., 59, 71, 396 See also Shannon-McMillan-Breiman theorem mean ergodic, 69 D R A F T September 13, 2001, 6:27pm D R A F T mean-square error, 189 meaningful information, 1 measure theory, 68, 96 MÖ Õ dard, M., 364, 395 membership table, 373 message, 236 message set, 149, 158 method of types, 77, 93 microelectronics, 173 min-cut, 235, 244 minimum distance decoding, 185 Mittelholzer, T., 390, 399 modulo 2 addition, 238, 261, 367–368, 375, 376 modulo 3 addition, 261 Mohan, S., 389 most likely sequence, 64 Moy, S.C., 396 é ã , see ä -Measure multi-dimensional direction, 217 multi-source network coding, xiii, 327–364 áIêã , 342 á êã , 342 achievable information rate region, 327, 340 inner bound, 340–342 LP bound, 346–350 outer bound, 342–346 achievable information rate tuple, 339 algebraic approach, 364 network code for acyclic network, 337– 340 random code, 352 superposition, 330–334, 362, 364 variable length zero-error, 341 multicast, xiii, 233, 233–262, 327–364 multilevel diversity coding, 335–336, 364 symmetrical, 336, 364 multiple descriptions, 146 multiple information sources, 234 multiterminal source coding, 213 mutual information, 5, 13, 223 between more than two random variables, 105 concavity of, 115 convexity of, 113, 199 mutually independent information sources, 327– 361 Narayan, P., xvii, 77, 392 nerve impulse, 323 network coding, xi, xiii, 240, 233–262, 327–364 Nobel, A., 262, 398 node of a network, 233, 240 noise source, 2 noisy channel, 3, 149, 171 noisy environment, 1 409 INDEX non-Shannon-type inequalities, xiii, 25, 264, 265, 287, 301–325 constrained, 315–321 unconstrained, 310–315, 366, 384 nonlinear optimization, 216 nonnegative linear combination, 285, 286 nonnegative orthant, 266, 268, 282, 302, 309 normalization constant, 161 north-south direction, 217 null space, 273 numerical computation, 157, 204, 215–231 Omura, J.K., 396, 399 Ooi, J.M., 396 optimal coding scheme, 3 order of a node, 47 ordinate, 223 Orlitsky, A., xvii, 397 Ornstein, D.S., 397 orthogonal complement, 274 output channel, 237, 240 Oxley, J.G., 397 Papoulis, A., 122, 397 parity check, 174 partition, 138, 353 PC, xviii, 287 Pearl, J., 397 perceived distortion, 189 Perez, A., 397 perfect secrecy theorem, Shannon’s, xii, 117 permutation, 368 permutation group, 368–370 physical entity, 236 physical system, 4 physics, xi Pierce, J.R., 397 Pinkston, J.T., 214, 397 Pinsker’s inequality, 22, 37, 39, 77 Pinsker, M.S., 39, 390, 397 Pippenger, N., 325, 397 plain text, 115, 120 point-to-point channel, 149, 233, 234 error-free, 233 point-to-point communication network, xiii, 233–235, 327 point-to-point communication system, 2, 233 Polya, G., 38, 394 polymatroid, 297, 300, 325 polynomial, 278 practical communication system, 149, 174 prefix code, xii, 42, 45, 45–57 entropy bound, xii existence of, 47 expected length, 55 random coding, 58 redundancy, xii, 54–57 D R A F T September 13, 2001, 6:27pm D R A F T prefix-free code, see prefix code Preston, C., 148, 397 probabilistic coding, 70, 183 probability distribution rational, 377 strictly positive, 5, 9, 218, 299 with zero masses, 5, 147, 218 probability of error, 30, 64, 150, 165, 189, 196, 339 probability theory, 276, 321 product source, 212, 299 projection, 346 pyramid, 282, 285 quantized samples, 188 quasi-uniform structure, 374 asymptotic, 91, 374 Rabin, M.O., 262, 397 random code, 166, 207, 247, 262, 352, 363 random coding error exponent, 186 random noise, 149 rank function, 386 rank of a matrix, 273 full, 274–276 rate constraints, 234, 236, 242, 243, 245, 247, 251, 259, 332, 337, 339 rate distortion code, 191, 187–215 rate distortion function, 195, 191–196, 206, 213, 215 binary source, 201 forward channel description, 212 reverse channel description, 202 computation of, xii, 204, 214, 223–226, 231 normalization, 214 product source, 212, 299 properties of, 195 Shannon lower bound, 212 rate distortion pair, 192, 204 rate distortion region, 192, 196, 223 rate distortion theorem, 197, 196–204, 214 achievability, 206–211 converse, 204–206 random code, 207 relation with source coding theorem, 203 rate distortion theory, xii, 187–214 Rathie, P.N., 396 rational number, 193, 194, 377 raw bits, 66, 239 “approximately raw” bits, 67 Ray-Chaudhuri, D.K., 391 See also BCH code reaching probability, 54, 55 receiver, 2 receiving point, 233, 330 rectangular lattice, 140, 148 reduced code tree, 51 410 A FIRST COURSE IN INFORMATION THEORY reduced probability set, 51 redundancy, xii of prefix code, 54–57 of uniquely decodable code, 45 Reed, I.S., 397 Reed-Solomon code, 174 relative entropy, see divergence relative frequency, 73 RÖ Õ nyi, A., 397 repetition code, 151 replication of information, 237, 238 reproduction alphabet, 188, 201, 207 reproduction sequence, 187–189, 194, 207 resultant flow, 235, 252 Reza, F.M., 122, 397 Rissanen, J., 397 Roche, J.R., 262, 363, 364, 397, 398 Rockafellar, R.T., 398 Romashchenko, A., 325, 387, 393, 398 Rose, K., 214, 398 routing, 236, 238 row space, 274 Rubin, D.B., 392 Russian, 122 Sason, I., xvii satellite communication network, 335–336 science, 3 science of information, the, 3 secret key cryptosystem, 115, 120 secret sharing, 120, 121, 290 access structure, 121 information-theoretic bounds, 120–121, 290, 291 participants, 121 security level of cryptosystem, 117 self-information, 14 semi-graphoid, 299, 300 axioms of, 292 separating rate distortion coding and channel coding, 213 separating source and channel coding, 153, 180– 183 separation theorem for source and channel coding, 183 set function, 266 additive, 96, 118, 135 set identity, 99, 122, 132 set operations, 95, 96 set theory, xii, 95 Shamai, S., xvii, 398 Shamir, A., 398 Shannon code, 53 Shannon’s information measures, xii, 5, 10–16, 24, 95 continuity of, 16 elemental forms, 280, 298 D R A F T September 13, 2001, 6:27pm D R A F T irreducible, 279, 298 linear combination of, 263 reducible, 279, 298 ä set-theoretic structure of, see -Measure Shannon’s papers, collection of, 4 Shannon, C.E., xi, 2, 39, 59, 64, 71, 186, 213, 214, 300, 325, 398 Shannon-McMillan-Breiman theorem, xii, 35, 68, 68–69, 181 Shannon-type identities constrained, 284–285 Shannon-type inequalities, xiii, 264, 265, 279– 300, 309 constrained, 284–285 machine-proving, ITIP, 279, 287–291, 301 unconstrained, 283–284 Shen, A., 325, 387, 393, 398 Shields, P.C., 398 Shore, J.E., 398 Shtarkov, Y.M., 400 Shunsuke, I., 398 siblings, 50 side-information, 213 signal, 149 signed measure, 96, 102, 103 Simmons, G.J., 396 Simonnard, M., 398 simple path, 252 maximum length, 253 simplex method, 284, 285 optimality test, 284, 285 single-input single-output system, 149, 153 single-letter characterization, 215 single-letter distortion measure, 215 single-source network code ß -code, 241, 242, 247 causality of, 244 à -code, 246, 247, 257, 338 ë -code, 257 phase, 258 single-source network coding, 233–262, 327, 328, 348, 350 achievable information rate, 242, 259 one sink node, 236 random code, 247, 262 three sink nodes, 239 two sink nodes, 237 sink node, 234, 236 Slepian, D.S., 213, 399 Slepian-Wolf coding, 213 Sloane, N.J.A., 391, 399 Snell. J., 395 Solomon, G., 397 See also Reed-Solomon code Song, L., xvii, 185, 364, 399 sound wave, 323 411 INDEX source code, 42, 183, 187 source coding theorem, xii, 3, 64, 64–66, 71, 187, 196, 204 coding rate, 64 converse, 66 direct part, 65 general block code, 70 source node, 234, 236 source random variable, 212 source sequence, 187–189, 194, 207 Spitzer, F., 148, 399 stationary ergodic source, 68, 68, 71, 180 entropy rate, 69 stationary source, xii, 34, 68 entropy rate, 5, 32–35 Steinberg, Y., 399 still picture, 2 Stirling’s approximation, 85 stock market, 231 strong asymptotic equipartition property (AEP), 61, 74, 73–82, 93, 209, 352, 353, 357 strong law of large numbers, 69 strong typicality, xii, 73–93, 207, 351 consistency, 83, 168, 208 joint, 83–91 joint AEP, 84 joint typicality array, 90, 374 jointly typical sequence, 83 jointly typical set, 83 typical sequence, 73 typical set, 73 vs weak typicality, 82 Studenì Õ , M., 122, 299, 300, 325, 396, 399 subcode, 194 subgroups, xiii, 365–387 intersection of, 365, 372 membership table, 373 substitution of symbols, 99 suffix code, 58 summit of a mountain, 217 support, 5, 11, 16, 17, 19, 36, 70, 73, 188, 304, 375 finite, 358 switch, 240 tangent, 223 Tarokh, V., 59, 395 telephone conversation, 174 telephone line, 1, 174 telephone network, 233 Teletar, I.E., xvii television broadcast channel, 2 thermodynamics, 39 Thitimajshima, P., 390 Thomas, J.A., xi, 391 Thomasian, A.J., 390 D R A F T September 13, 2001, 6:27pm D R A F T time average, 68 time-parametrized acyclic graph, 251 layers of nodes, 251 time-sharing, 193, 359 Tjalkens, T.J., 400 transition matrix, 153, 184, 198, 215, 218, 223, 224 strictly positive, 226 transmitter, 2 transmitting point, 233, 330 Tsang, M.-W., xviii Tsang, P.-W.R., xviii turbo code, 174 Tusn × Õ dy, G., 231, 392 Type I atom, 141 Type II atom, 141 type of an empirical distribution, 93 uncertainty, 2, 11 undirected graph, 140 components, 140 cutset, 140 edges, 140 loop, 140 vertices, 140 uniform distribution, 155, 157, 158, 167, 212, 247, 304, 308, 374, 375 union bound, 76, 168, 181, 195, 249, 250 uniquely decodable code, xii, 42, 45, 47, 48, 50, 54, 58, 59 expected length, 43 redundancy, 45 universal source coding, 70 Unix, xviii, 287 Vaccaro, U., 121, 391 van der Lubbe, J.C.A., 399 van der Meulen, E.C., 399 van Dijk, M., 121, 399 variable length channel code, 179 variational distance, 22, 38 vector space, 386 Vembu, S., 399 Venn diagram, xviii, 14, 96, 105, 122 Verdæ Õ , S., xvii, 398, 399 Vereshchagin, N.K., 325, 387, 393, 398 video signal, 189 Vit× Õ nyi, P., 395 Viterbi, A.J., 399 water flow, 234 water leakage, 234 water pipe, 171, 234 weak asymptotic equipartition property (AEP), xii, 61, 61–71 See also Shannon-McMillan-Breiman theorem weak independence, 120 412 A FIRST COURSE IN INFORMATION THEORY weak law of large numbers, 61, 62, 151 weak typicality, xii, 61–71, 73 typical sequence, 62, 62–71 typical set, 62, 62–71 Weaver, W.W., 398 Wegener, I., 389 Wei, V.K., xvii Welch, T.A., 399 Welsh, D.J.A., 394 Wheeler, D.J., 391 Wicker, S.B., 394, 400 Willems, F.M.J., xvii, 400 wireless communication, 174 Wolf, J.K., xvii, 213, 399 See also Slepian-Wolf coding Wolfowitz, J., 186, 389, 400 Woodard, P.M., 400 Wyner, A.D., 399, 400 WYSIWYG, 112 D R A F T Yan, Y.-O., xvii, 320 Yang, E.-h., 395, 400 Ye, C., xvii, 59, 400 Ye, Z., 148, 393, 400 Yeh, Y.N., xviii Yeung, G., xviii Yeung, R.W., 10, 57, 59, 77, 120, 122, 123, 146, 148, 185, 231, 260, 262, 278, 300, 324, 325, 363, 364, 389–391, 393, 395, 398–401 Yeung, S.-W.S., xviii Zamir, R., 146, 393, 398 Zeger, K., xvii, 59, 395 zero-error data compression, xii, 41–59, 61 zero-error reconstruction, 233, 242 Zhang, Z., xvii, 292, 324, 325, 364, 401 Zigangirov, K.Sh., 401 Zimmerman, S., 59, 401 Ziv, J., 400, 401 September 13, 2001, 6:27pm D R A F T
© Copyright 2026 Paperzz