Probability and Random Processes
for Electrical Engineers
John A. Gubner
University of Wisconsin{Madison
Discrete Random Variables
Bernoulli(p)
}(X = 1) = p; }(X = 0) = 1 , p.
E[X] = p; var(X) = p(1 , p); GX (z) = (1 , p) + pz.
binomial(n;p)
}(X = k) = n pk (1 , p)n,k ; k = 0; : : :; n.
k
E[X] = np; var(X) = np(1 , p); GX (z) = [(1 , p) + pz]n.
geometric0(p)
}(X = k) = (1 , p)pk ; k = 0; 1; 2; : :::
p ; var(X) = p ; G (z) = 1 , p .
X
1,p
(1 , p)2
1 , pz
geometric1(p)
}(X = k) = (1 , p)pk,1; k = 1; 2; 3; : : ::
1 ; var(X) = p ; G (z) = (1 , p)z .
E[X] =
X
1,p
(1 , p)2
1 , pz
negative binomial or Pascal(m;p)
}(X = k) = k , 1 (1 , p)m pk,m; k = m; m + 1; : : ::
m,1
m ; var(X) = mp ; G (z) = (1 , p)z m .
E[X] =
X
1,p
(1 , p)2
1 , pz
Note that Pascal(1; p) is the same as geometric1(p).
E[X] =
Poisson()
k ,
}(X = k) = e ; k = 0; 1; : : ::
k!
E[X] = ; var(X) = ; GX (z) = e(z,1) .
Fourier Transforms
Fourier Transform
H(f) =
Inversion Formula
h(t) =
h(t)
I[,T;T ] (t)
2W sin(2Wt)
2Wt
Z1
,1
Z1
,1
h(t)e,j 2ft dt
H(f)ej 2ft df
H(f)
Tf)
2T sin(2
2 Tf
I[,W;W ] (f)
Tf) 2
(1 , jtj=T )I[,T;T ](t) T sin(
Tf
sin(Wt) 2
W Wt
(1 , jf j=W)I[,W;W ] (f)
1
e,t u(t)
+ j2f
2
e,jtj
2
+ (2f)2
e,2jf j
2 + t2
p
e,(t=)2 =2
2 e,2 (2f )2 =2
Preface
Intended Audience
This book contains enough material to serve as a text for a two-course
sequence in probability and random processes for electrical engineers. It is also
useful as a reference by practicing engineers.
For students with no background in probability and random processes, a
rst course can be oered either at the undergraduate level to talented juniors
and seniors, or at the graduate level. The prerequisite is the usual undergraduate electrical engineering course on signals and systems, e.g., Haykin and Van
Veen [18] or Oppenheim and Willsky [30] (see the Bibliography at the end of
the book).
A second course can be oered at the graduate level. The additional prerequisite is some familiarity with linear algebra; e.g., matrix-vector multiplication,
determinants, and matrix inverses. Because of the special attention paid to
complex-valued Gaussian random vectors and related random variables, the
text will be of particular interest to students in wireless communications. Additionally, the last chapter, which focuses on self-similar processes, long-range
dependence, and aggregation, will be very useful for students in communication
networks who are interested modeling Internet trac.
Material for a First Course
In a rst course, Chapters 1{5 would make up the core of any oering. These
chapters cover the basics of probability and discrete and continuous random
variables. Following Chapter 5, additional topics such as wide-sense stationary
processes (Sections 6.1{6.5), the Poisson process (Section 8.1), discrete-time
Markov chains (Section 9.1), and condence intervals (Sections 12.1{12.4) can
also be included. These topics can be covered independently of each other, in
any order, except that Problem 15 in Chapter 12 refers to the Poisson process.
Material for a Second Course
In a second course, Chapters 7{8 and 10{11 would make up the core, with
additional material from Chapters 6, 9, 12, and 13 depending on student preparation and course objectives. For review purposes, it may be helpful at the beginning of the course to assign the more advanced problems from Chapters 1{5
that are marked with a ? .
Features
Those parts of the book mentioned above as being suitable for a rst course
are written at a level appropriate for undergraduates. More advanced problems
and sections in these parts of the book are indicated by a ? . Those parts of the
book not indicated as suitable for a rst course are written at a level suitable
iii
iv
Preface
for graduate students. Throughout the text, there are numerical superscripts
that refer to notes at the end of each chapter. These notes are usually rather
technical and address subtleties of the theory.
The last section of each chapter is devoted to problems. The problems are
separated by section so that all problems relating to a particular section are
clearly indicated. This enables the student to refer to the appropriate part
of the text for background relating to particular problems, and it enables the
instructor to make up assignments more quickly.
Tables of discrete random variables and of Fourier transform pairs are found
inside the front cover. A table of continuous random variables is found inside
the back cover.
When cdfs or other functions are encountered that do not have a closed form,
Matlab commands are given for computing them. For a list, see \Matlab" in
the index.
The index was compiled as the book was being written. Hence, there are
many pointers to specic information. For example, see \noncentral chi-squared
random variable."
With the importance of wireless communications, it is vital for students to
be comfortable with complex random variables, especially complex Gaussian
random variables. Hence, Section 7.5 is devoted to this topic, with special
attention to the circularly symmetric complex Gaussian. Also important in this
regard are the central and noncentral chi-squared random variables and their
square roots, the Rayleigh and Rice random variables. These random variables,
as well as related ones such as the beta, F, Nakagami, and student's t, appear
in numerous problems in Chapters 3{5 and 7.
Chapter 10 gives an extensive treatment of convergence in mean of order
p. Special attention is given to mean-square convergence and the Hilbert space
of square-integrable random variables. This allows us to prove the projection
theorem, which is important for establishing the existence of certain estimators,
and conditional expectation in particular. The Hilbert space setting allows us
to dene the Wiener integral, which is used in Chapter 13 on advanced topics
to construct fractional Brownian motion.
We give an extensive treatment of condence intervals in Chapter 12, not
only for normal data, but also for more arbitrary data via the central limit
theorem. The latter is important for any situation in which the data is not
normal; e.g., sampling for quality control, election polls, etc. Much of this
material can be covered in a rst course. However, the last subsection on
derivations is more advanced, and uses results from Chapter 7 on Gaussian
random vectors.
With the increasing use of self-similar and long-range dependent processes
for modeling Internet trac, we provide in Chapter 13 an introduction to these
developments. In particular, fractional Brownian motion is a standard example
of a continuous-time self-similar process with stationary increments. Fractional
autoregressive integrated moving average (FARIMA) processes are developed
as examples of important discrete-time processes in this context.
July 19, 2002
Table of Contents
1 Introduction to Probability
1.1 Review of Set Notation : : : : : : : : : : : : : :
1.2 Probability Models : : : : : : : : : : : : : : : : :
1.3 Axioms and Properties of Probability : : : : : :
Consequences of the Axioms : : : : : : : : : :
1.4 Independence : : : : : : : : : : : : : : : : : : : :
Independence for More Than Two Events : : :
1.5 Conditional Probability : : : : : : : : : : : : : :
The Law of Total Probability and Bayes' Rule
1.6 Notes : : : : : : : : : : : : : : : : : : : : : : : :
1.7 Problems : : : : : : : : : : : : : : : : : : : : : :
2 Discrete Random Variables
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
2.1 Probabilities Involving Random Variables : : : : : : : : : : : : :
Discrete Random Variables : : : : : : : : : : : : : : : : : : : :
Integer-Valued Random Variables : : : : : : : : : : : : : : : :
Pairs of Random Variables : : : : : : : : : : : : : : : : : : : :
Multiple Independent Random Variables : : : : : : : : : : : :
Probability Mass Functions : : : : : : : : : : : : : : : : : : : :
2.2 Expectation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
Expectation of Functions of Random Variables, or the Law of
the Unconscious Statistician (LOTUS) : : : : : : : : : : :
? Derivation of LOTUS : : : : : : : : : : : : : : : : : : : : : :
Linearity of Expectation : : : : : : : : : : : : : : : : : : : : :
Moments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
Probability Generating Functions : : : : : : : : : : : : : : : :
Expectations of Products of Functions of Independent Random Variables : : : : : : : : : : : : : : : : : : : : : : : :
Binomial Random Variables and Combinations : : : : : : : : :
Poisson Approximation of Binomial Probabilities : : : : : : :
2.3 The Weak Law of Large Numbers : : : : : : : : : : : : : : : : : :
Uncorrelated Random Variables : : : : : : : : : : : : : : : : :
Markov's Inequality : : : : : : : : : : : : : : : : : : : : : : : :
Chebyshev's Inequality : : : : : : : : : : : : : : : : : : : : : :
Conditions for the Weak Law : : : : : : : : : : : : : : : : : :
2.4 Conditional Probability : : : : : : : : : : : : : : : : : : : : : : :
The Law of Total Probability : : : : : : : : : : : : : : : : : :
The Substitution Law : : : : : : : : : : : : : : : : : : : : : : :
2.5 Conditional Expectation : : : : : : : : : : : : : : : : : : : : : : :
Substitution Law for Conditional Expectation : : : : : : : : :
Law of Total Probability for Expectation : : : : : : : : : : : :
v
1
2
5
12
12
16
16
19
19
22
24
33
33
36
36
38
39
42
44
47
47
48
49
50
52
54
57
57
58
60
60
61
62
63
66
69
70
70
vi
Table of Contents
2.6 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72
2.7 Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 74
3 Continuous Random Variables
85
3.1 Denition and Notation : : : : : : : : : : : : : : : : : : : : : : : 85
The Paradox of Continuous Random Variables : : : : : : : : : 90
3.2 Expectation of a Single Random Variable : : : : : : : : : : : : : 90
Moment Generating Functions : : : : : : : : : : : : : : : : : : 94
Characteristic Functions : : : : : : : : : : : : : : : : : : : : : 96
3.3 Expectation of Multiple Random Variables : : : : : : : : : : : : 98
Linearity of Expectation : : : : : : : : : : : : : : : : : : : : : 99
Expectations of Products of Functions of Independent Random Variables : : : : : : : : : : : : : : : : : : : : : : : : 99
3.4 ? Probability Bounds : : : : : : : : : : : : : : : : : : : : : : : : : 100
3.5 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 103
3.6 Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 104
4 Analyzing Systems with Random Inputs
4.1 Continuous Random Variables : : : : : : : : : :
? The Normal CDF and the Error Function :
4.2 Reliability : : : : : : : : : : : : : : : : : : : : :
4.3 Cdfs for Discrete Random Variables : : : : : :
4.4 Mixed Random Variables : : : : : : : : : : : :
4.5 Functions of Random Variables and Their Cdfs
4.6 Properties of Cdfs : : : : : : : : : : : : : : : :
4.7 The Central Limit Theorem : : : : : : : : : : :
Derivation of the Central Limit Theorem : :
4.8 Problems : : : : : : : : : : : : : : : : : : : : :
5 Multiple Random Variables
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
117
: 118
: 123
: 123
: 126
: 127
: 130
: 134
: 137
: 139
: 142
155
5.1 Joint and Marginal Probabilities : : : : : : : : : : : : : : : : : : 155
Product Sets and Marginal Probabilities : : : : : : : : : : : : 155
Joint and Marginal Cumulative Distributions : : : : : : : : : : 157
5.2 Jointly Continuous Random Variables : : : : : : : : : : : : : : : 158
Marginal Densities : : : : : : : : : : : : : : : : : : : : : : : : 159
Specifying Joint Densities : : : : : : : : : : : : : : : : : : : : 162
Independence : : : : : : : : : : : : : : : : : : : : : : : : : : : 163
Expectation : : : : : : : : : : : : : : : : : : : : : : : : : : : : 163
? Continuous Random Variables That Are not Jointly Continuous : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 164
5.3 Conditional Probability and Expectation : : : : : : : : : : : : : : 164
5.4 The Bivariate Normal : : : : : : : : : : : : : : : : : : : : : : : : 168
5.5 ? Multivariate Random Variables : : : : : : : : : : : : : : : : : : 172
The Law of Total Probability : : : : : : : : : : : : : : : : : : 175
5.6 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 177
July 19, 2002
Table of Contents
vii
5.7 Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 178
6 Introduction to Random Processes
185
7 Random Vectors
223
6.1 Mean, Correlation, and Covariance : : : : : : : : : : : : : : : : : 188
6.2 Wide-Sense Stationary Processes : : : : : : : : : : : : : : : : : : 189
Strict-Sense Stationarity : : : : : : : : : : : : : : : : : : : : : 189
Wide-Sense Stationarity : : : : : : : : : : : : : : : : : : : : : 191
Properties of Correlation Functions and Power Spectral Densities : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 193
6.3 WSS Processes through Linear Time-Invariant Systems : : : : : 195
6.4 The Matched Filter : : : : : : : : : : : : : : : : : : : : : : : : : : 199
6.5 The Wiener Filter : : : : : : : : : : : : : : : : : : : : : : : : : : 201
? Causal Wiener Filters : : : : : : : : : : : : : : : : : : : : : : 203
?
6.6 Expected Time-Average Power and the Wiener{
Khinchin Theorem : : : : : : : : : : : : : : : : : : : : : : : : : : 206
Mean-Square Law of Large Numbers for WSS Processes : : : 208
6.7 ? Power Spectral Densities for non-WSS Processes : : : : : : : : : 210
Derivation of (6.22) : : : : : : : : : : : : : : : : : : : : : : : : 211
6.8 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 212
6.9 Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 214
7.1 Mean Vector, Covariance Matrix, and Characteristic Function : : 223
7.2 The Multivariate Gaussian : : : : : : : : : : : : : : : : : : : : : : 226
The Characteristic Function of a Gaussian Random Vector : : 227
For Gaussian Random Vectors, Uncorrelated Implies Independent : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 228
The Density Function of a Gaussian Random Vector : : : : : 229
7.3 Estimation of Random Vectors : : : : : : : : : : : : : : : : : : : 230
Linear Minimum Mean Squared Error Estimation : : : : : : : 230
Minimum Mean Squared Error Estimation : : : : : : : : : : : 233
7.4 Transformations of Random Vectors : : : : : : : : : : : : : : : : 234
7.5 Complex Random Variables and Vectors : : : : : : : : : : : : : : 236
Complex Gaussian Random Vectors : : : : : : : : : : : : : : : 238
7.6 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 239
7.7 Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 240
8 Advanced Concepts in Random Processes
8.1 The Poisson Process : : : : : : : : : : : : : : : : : : : : : : : :
? Derivation of the Poisson Probabilities : : : : : : : : : : : :
Marked Poisson Processes : : : : : : : : : : : : : : : : : : :
Shot Noise : : : : : : : : : : : : : : : : : : : : : : : : : : : :
8.2 Renewal Processes : : : : : : : : : : : : : : : : : : : : : : : : :
8.3 The Wiener Process : : : : : : : : : : : : : : : : : : : : : : : :
Integrated White-Noise Interpretation of the Wiener Process
July 19, 2002
249
: 249
: 253
: 255
: 256
: 256
: 257
: 258
viii
Table of Contents
The Problem with White Noise : : : : : : : : : : : :
The Wiener Integral : : : : : : : : : : : : : : : : : : :
Random Walk Approximation of the Wiener Process
8.4 Specication of Random Processes : : : : : : : : : : : :
Finitely Many Random Variables : : : : : : : : : : :
Innite Sequences (Discrete Time) : : : : : : : : : : :
Continuous-Time Random Processes : : : : : : : : :
8.5 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : :
8.6 Problems : : : : : : : : : : : : : : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: 260
: 260
: 261
: 263
: 263
: 266
: 269
: 270
: 270
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: 279
: 281
: 282
: 284
: 287
: 288
: 289
: 291
: 294
: 294
Convergence in Mean of Order p : : : : : : : : : : : : :
Normed Vector Spaces of Random Variables : : : : : : :
The Wiener Integral (Again) : : : : : : : : : : : : : : :
Projections, Orthonality Principle, Projection Theorem
Conditional Expectation : : : : : : : : : : : : : : : : : :
Notation : : : : : : : : : : : : : : : : : : : : : : : : :
10.6 The Spectral Representation : : : : : : : : : : : : : : : :
10.7 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : :
10.8 Problems : : : : : : : : : : : : : : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: 299
: 303
: 307
: 308
: 311
: 313
: 313
: 316
: 316
9 Introduction to Markov Chains
9.1 Discrete-Time Markov Chains : : : : : : : : : : : : : :
State Space and Transition Probabilities : : : : : :
Examples : : : : : : : : : : : : : : : : : : : : : : : :
Stationary Distributions : : : : : : : : : : : : : : :
Derivation of the Chapman{Kolmogorov Equation :
Stationarity of the n-step Transition Probabilities :
9.2 Continuous-Time Markov Chains : : : : : : : : : : : :
Kolmogorov's Dierential Equations : : : : : : : : :
Stationary Distributions : : : : : : : : : : : : : : :
9.3 Problems : : : : : : : : : : : : : : : : : : : : : : : : :
10 Mean Convergence and Applications
10.1
10.2
10.3
10.4
10.5
11 Other Modes of Convergence
11.1 Convergence in Probability : : : : : : : :
11.2 Convergence in Distribution : : : : : : : :
11.3 Almost Sure Convergence : : : : : : : : :
The Skorohod Representation Theorem
11.4 Notes : : : : : : : : : : : : : : : : : : : :
11.5 Problems : : : : : : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
12.1
12.2
12.3
12.4
299
323
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: 324
: 325
: 330
: 335
: 337
: 337
The Sample Mean : : : : : : : : : : : : : : : : : : :
Condence Intervals When the Variance Is Known :
The Sample Variance : : : : : : : : : : : : : : : : : :
Condence Intervals When the Variance Is Unknown
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: 345
: 347
: 349
: 350
12 Parameter Estimation and Condence Intervals
July 19, 2002
:
:
:
:
:
:
279
345
Table of Contents
ix
Applications : : : : : : : : : : : : : : : : : :
Sampling with and without Replacement : :
12.5 Condence Intervals for Normal Data : : : : :
Estimating the Mean : : : : : : : : : : : : :
Limiting t Distribution : : : : : : : : : : : :
Estimating the Variance | Known Mean : :
Estimating the Variance | Unknown Mean
? Derivations : : : : : : : : : : : : : : : : : :
12.6 Notes : : : : : : : : : : : : : : : : : : : : : : :
12.7 Problems : : : : : : : : : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: 351
: 352
: 353
: 353
: 355
: 355
: 357
: 358
: 360
: 362
13 Advanced Topics
365
Bibliography
Index
393
397
13.1 Self Similarity in Continuous Time : : : : : : : : : : : : : : : : : 365
Implications of Self Similarity : : : : : : : : : : : : : : : : : : 366
Stationary Increments : : : : : : : : : : : : : : : : : : : : : : : 367
Fractional Brownian Motion : : : : : : : : : : : : : : : : : : : 368
13.2 Self Similarity in Discrete Time : : : : : : : : : : : : : : : : : : : 369
Convergence Rates for the Mean-Square Law of Large Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 370
Aggregation : : : : : : : : : : : : : : : : : : : : : : : : : : : : 371
The Power Spectral Density : : : : : : : : : : : : : : : : : : : 373
Notation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 374
13.3 Asymptotic Second-Order Self Similarity : : : : : : : : : : : : : : 375
13.4 Long-Range Dependence : : : : : : : : : : : : : : : : : : : : : : : 380
13.5 ARMA Processes : : : : : : : : : : : : : : : : : : : : : : : : : : : 383
13.6 ARIMA Processes : : : : : : : : : : : : : : : : : : : : : : : : : : 384
13.7 Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 387
July 19, 2002
x
Table of Contents
July 19, 2002
CHAPTER 1
Introduction to Probability
If we toss a fair coin many times, then we expect that the fraction of heads
should be close to 1=2, and we say that 1=2 is the \probability of heads." In
fact, we might try to dene the probability of heads to be the limiting value
of the fraction of heads as the total number of tosses tends to innity. The
diculty with this approach is that there is no simple way to guarantee that
the desired limit exists.
An alternative approach was developed by A. N. Kolmogorov in 1933.
Kolmogorov's idea was to start with an axiomatic denition of probability and
then deduce results logically as consequences of the axioms. One of the successes
of Kolmogorov's approach is that under suitable assumptions, it can be proved
that the fraction of heads in a sequence of tosses of a fair coin converges to
1=2 as the number of tosses increases. You will meet a simple version of this
result in the weak law of large numbers in Chapter 2. Generally speaking,
a law of large numbers is a theorem that gives conditions under which the
numerical average of a large number of measurements converges in some sense
to a probability or other parameter specied by the underlying mathematical
model. Other laws of large numbers are discussed in Chapters 10 and 11.
Another celebrated limit result of probability theory is the central limit
theorem, which you will meet in Chapter 4. The central limit theorem says
that when you add up a large number of random perturbations, the overall
eect has a Gaussian or normal distribution. For example, the central limit
theorem explains why thermal noise in ampliers has a Gaussian distribution.
The central limit theorem also accounts for the fact that the speed of a particle
in an ideal gas has the Maxwell distribution.
In addition to the more famous results mentioned above, in this book you
will learn the \tools of the trade" of probability and random processes. These
include probability mass functions and densities, expectation, transform methods, random processes, ltering of processes by linear time-invariant systems,
and more.
Since probability theory relies heavily on the use of set notation and set
theory, a brief review of these topics is given in Section 1.1. In Section 1.2, we
consider a number of simple physical experiments, and we construct mathematical probability models for them. These models are used to solve several sample
problems. Motivated by our specic probability models, in Section 1.3, we introduce the general axioms of probability and several of their consequences.
The concepts of statistical independence and conditional probability are introduced in Sections 1.4 and 1.5, respectively. Section 1.6 contains the notes that
are referenced in the text by numerical superscripts. These notes are usually
rather technical and can be skipped by the beginning student. However, the
1
2
Chap. 1 Introduction to Probability
notes provide a more in-depth discussion of certain topics that may be of interest to more advanced readers. The chapter concludes with problems for the
reader to solve. Problems and sections marked by a ? are intended for more
advanced readers.
1.1. Review of Set Notation
Let be a set of points. If ! is a point in , we write ! 2 . Let A and B
be two collections of points in . If every point in A also belongs to B, we say
that A is a subset of B, and we denote this by writing A B. If A B and
B A, then we write A = B; i.e., two sets are equal if they contain exactly
the same points.
Set relationships can be represented graphically in Venn diagrams. In
these pictures, the whole space is represented by a rectangular region, and
subsets of are represented by disks or oval-shaped regions. For example, in
Figure 1.1(a), the disk A is completely contained in the oval-shaped region B,
thus depicting the relation A B.
If A , and ! 2 does not belong to A, we write ! 2= A. The set of all
such ! is called the complement of A in ; i.e.,
Ac := f! 2 : ! 2= Ag:
This is illustrated in Figure 1.1(b), in which the shaded region is the complement
of the disk A.
The empty set or null set of is denoted by 6 ; it contains no points of
. Note that for any A , 6 A. Also, c = 6 .
A
B
A
Ac
(b)
(a)
Figure 1.1. (a) Venn diagram of A B . (b) The complement of the disk A, denoted by
Ac , is the shaded part of the diagram.
The union of two subsets A and B is
A [ B := f! 2 : ! 2 A or ! 2 B g:
Here \or" is inclusive; i.e., if ! 2 A [ B, we permit ! to belong to either A or
B or both. This is illustrated in Figure 1.2(a), in which the shaded region is
the union of the disk A and the oval-shaped region B.
The intersection of two subsets A and B is
A \ B := f! 2 : ! 2 A and ! 2 B g;
July 18, 2002
1.1 Review of Set Notation
3
hence, ! 2 A \ B if and only if ! belongs to both A and B. This is illustrated
in Figure 1.2(b), in which the shaded area is the intersection of the disk A and
the oval-shaped region B. The reader should also note the following special
case. If A B (recall Figure 1.1(a)), then A \ B = A. In particular, we always
have A \ = A.
A
B
A
B
(b)
(a)
Figure 1.2. (a) The shaded region is A [ B . (b) The shaded region is A \ B .
The set dierence operation is dened by
B n A := B \ Ac ;
i.e., B n A is the set of ! 2 B that do not belong to A. In Figure 1.3(a), B n A
is the shaded part of the oval-shaped region B.
Two subsets A and B are disjoint or mutually exclusive if A \ B = 6 ;
i.e., there is no point in that belongs to both A and B. This condition is
depicted in Figure 1.3(b).
A
B
A
B
(b)
(a)
Figure 1.3. (a) The shaded region is B n A. (b) Venn diagram of disjoint sets A and B .
Using the preceding denitions, it is easy to see that the following properties
hold for subsets A; B, and C of .
The commutative laws are
A [ B = B [ A and A \ B = B \ A:
The associative laws are
A \ (B \ C) = (A \ B) \ C and A [ (B [ C) = (A [ B) [ C:
July 18, 2002
4
Chap. 1 Introduction to Probability
The distributive laws are
A \ (B [ C) = (A \ B) [ (A \ C)
and
A [ (B \ C) = (A [ B) \ (A [ C):
DeMorgan's laws are
(A \ B)c = Ac [ B c and (A [ B)c = Ac \ B c :
We next consider innite collections of subsets of . Suppose An ,
n = 1; 2; : : :: Then
1
[
n=1
An := f! 2 : ! 2 An for some n 1g:
In other words, ! 2 1
n=1 An if and only if for at least one integer n 1,
! 2 An . This denition admits the possibility that ! 2 An for more than one
value of n. Next, we dene
1
\
n=1
S
An := f! 2 : ! 2 An for all n 1g:
T
In other words, ! 2 1
n=1 An if and only if ! 2 An for every positive integer n.
Example 1.1. Let denote the real numbers, = (,1; 1). Then the
following innite intersections and unions can be simplied. Consider the intersection
1
\
(,1; 1=n) = f! : ! < 1=n; for all n 1g:
n=1
Now, if ! < 1=n for all n 1, then ! cannot be positive; i.e., we must have
! 0. Conversely, if ! 0, then for all n 1, ! 0 < 1=n. It follows that
1
\
n=1
(,1; 1=n) = (,1; 0]:
Consider the innite union,
1
[
(,1; ,1=n] = f! : ! ,1=n; for some n 1g:
n=1
Now, if ! ,1=n for some n 1, then we must have ! < 0. Conversely, if
! < 0, then for large enough n, ! ,1=n. Thus,
1
[
n=1
(,1; ,1=n] = (,1; 0):
July 18, 2002
1.2 Probability Models
5
In a similar way, one can show that
1
\
n=1
as well as
1
[
[0; 1=n) = f0g;
1
\
(,1; n] = (,1; 1) and
n=1
(,1; ,n] = 6 :
n=1
The following generalized distributive laws also hold,
B\
and
B[
[
1
n=1
\
1
n=1
An =
An =
1
[
(B \ An );
n=1
1
\
(B [ An ):
n=1
We also have the generalized DeMorgan's laws,
\
1
n=1
and
[
1
n=1
An
An
c
c
=
=
1
[
n=1
1
\
n=1
Acn ;
Acn :
Finally, we will need the following denition. We say that subsets An ; n =
1; 2; : : :; are pairwise disjoint if An \ Am = 6 for all n 6= m.
1.2. Probability Models
Consider the experiment of tossing a fair die and measuring, i.e., noting,
the face turned up. Our intuition tells us that the \probability" of the ith face
turning up is 1=6, and that the \probability" of a face with an even number of
dots turning up is 1=2.
Here is a mathematical model for this experiment and measurement. Let be any set containing six points. We call the sample space. Each point
in corresponds to, or models, a possible outcome of the experiment. For
simplicity, let
:= f1; 2; 3; 4;5; 6g:
Now put
Fi := fig; i = 1; 2; 3; 4;5; 6;
July 18, 2002
6
Chap. 1 Introduction to Probability
and
E := f2; 4; 6g:
We call the sets Fi and E events. The event Fi corresponds to, or models, the
die's turning up showing the ith face. Similarly, the event E models the die's
showing a face with an even number of dots. Next, for every subset A of , we
denote the number of points in A by jAj. We call jAj the cardinality of A. We
dene the probability of any event A by
}(A) := jAj=j
j:
It then follows that }(Fi) = 1=6 and }(E) = 3=6 = 1=2, which agrees with our
intuition.
We now make four observations about our model:
(i) }(
6 ) = j6j=j
j = 0=j
j = 0.
(ii) }(A) 0 for every event A.
(iii) If A and B are mutually exclusive events, i.e., A \ B = 6 , then }(A [
}
}
B) = (A) + (B); for example, F3 \ E = 6 , and it is easy to check
that }(F3 [ E) = }(f2; 3; 4; 6g) = }(F3) + }(E).
(iv) When the die is tossed, something happens; this is modeled mathematically by the easily veried fact that }(
) = 1.
As we shall see, these four properties hold for all the models discussed in this
section.
We next modify our model to accommodate an unfair die as follows. Observe
that for a fair die,
X 1
X
}(A) = jAj =
=
j
j !2A j
j !2A p(!);
where p(!) := 1=j
j. For an unfair die, we redene } by taking
}(A) := X p(!);
!2A
where now p(!) is not constant, but is chosen to reect the likelihood of occurrence of the various faces. This new denition of } still satises (i) and (iii);
however, to guarantee that (ii) and (iv) still hold, we mustPrequire that p be
nonnegative and sum to one, or, in symbols, p(!) 0 and !2
p(!) = 1.
Example 1.2. Consider a die for which face i is twice as likely as face i , 1.
Find the probability of the die's showing a face with an even number of dots.
Solution. To solve the problem, we use the above modied probability
model withy p(!) = 2!,1=63. We then need to realize that we are being asked
to compute }(E), where E = f2; 4; 6g. Hence,
}(E) = [21 + 23 + 25 ]=63 = 42=63 = 2=3:
If A = 6 , the summation is always taken to be zero.
y The derivation of this formula appears in Problem 8.
July 18, 2002
1.2 Probability Models
7
This problem is typical of the kinds of \word problems" to which probability
theory is applied to analyze well-dened physical experiments. The application
of probability theory requires the modeler to take the following steps:
1. Select a suitable sample space .
2. Dene }(A) for all events A.
3. Translate the given \word problem" into a problem requiring the calculation of }(E) for some specic event E.
The following example gives a family of constructions that can be used to
model experiments having a nite number of possible outcomes.
Example 1.3. Let M be a positive integer, and put :=PfM1; 2; : : :; M g.
Next, let p(1); : : :; p(M) be nonnegative real numbers such that !=1 p(!) = 1.
For any subset A , put
}(A) := X p(!):
! 2A
In particular, to model equally likely outcomes, or equivalently, outcomes that
occur \at random," we take p(!) = 1=M. In this case, }(A) reduces to jAj=j
j.
Example 1.4. A single card is drawn at random from a well-shued deck
of playing cards. Find the probability of drawing an ace. Also nd the probability of drawing a face card.
Solution. The rst step in the solution is to specify the sample space
and the probability }. Since there are 52 possible outcomes, we take :=
f1; : : :; 52g. Each integer corresponds to one of the cards in the deck. To specify
}, we must dene }(E) for all events E . Since all cards are equally likely
to be drawn, we put }(E) := jE j=j
j.
To nd the desired probabilities, let 1; 2; 3; 4 correspond to the four aces, and
let 41; : : :; 52 correspond to the 12 face cards. We identify the drawing of an ace
with the event A := f1; 2; 3; 4g, and we identify the drawing of a face card with
the event F := f41; : : :; 52g. It then follows that }(A) = jAj=52 = 4=52 = 1=13
and }(F ) = jF j=52 = 12=52 = 3=13.
While the sample spaces in Example 1.3 can model any experiment with
a nite number of outcomes, it is often convenient to use alternative sample
spaces.
Example 1.5. Suppose that we have two well-shued decks of cards, and
we draw one card at random from each deck. What is the probability of drawing
the ace of spades followed by the jack of hearts? What is the probability of
drawing an ace and a jack (in either order)?
July 18, 2002
8
Chap. 1 Introduction to Probability
Solution. The rst step in the solution is to specify the sample space and the probability }. Since there are 52 possibilities for each draw, there are
522 = 2,704 possible outcomes when drawing two cards. Let D := f1; : : :; 52g,
and put
:= f(i; j) : i; j 2 Dg:
Then j
j = jDj2 = 522 = 2,704 as required. Since all pairs are equally likely,
we put }(E) := jE j=j
j for arbitrary events E .
As in the preceding example, we denote the aces by 1; 2; 3; 4. We let 1 denote
the ace of spades. We also denote the jacks by 41; 42; 43; 44, and the jack of
hearts by 42. The drawing of the ace of spades followed by the jack of hearts
is identied with the event
A := f(1; 42)g;
and so }(A) = 1=2,704 0:000370. The drawing of an ace and a jack is
identied with B := Baj [ Bja , where
Baj := (i; j) : i 2 f1; 2; 3; 4g and j 2 f41; 42; 43;44g
corresponds to the drawing of an ace followed by a jack, and
Bja := (i; j) : i 2 f41; 42; 43; 44g and j 2 f1; 2; 3; 4g
corresponds to the drawing of a jack followed by an ace. Since Baj and Bja are
disjoint, }(B) = }(Baj )+}(Bja ) = (jBajj + jBja j)=j
j. Since jBajj = jBja j = 16,
}(B) = 2 16=2,704 = 2=169 0:0118.
Example 1.6. Two cards are drawn at random from a single well-shued
deck of playing cards. What is the probability of drawing the ace of spades
followed by the jack of hearts? What is the probability of drawing an ace and
a jack (in either order)?
Solution. The rst step in the solution is to specify the sample space and
the probability }. There are 52 possibilities for the rst draw and 51 possibilities
for the second. Hence, the sample space should contain 52 51 = 2,652 elements.
Using the notation of the preceding example, we take
:= f(i; j) : i; j 2 D with i 6= j g;
Note that j
j = 522 , 52 = 2,652 as required. Again, all such pairs are equally
likely, and so we take }(E) := jE j=j
j for arbitrary events E . The events
A and B are dened as before, and the calculation is the same except that
j
j = 2,652 instead of 2,704. Hence, }(A) = 1=2,652 0:000377, and }(B) =
2 16=2,652 = 8=663 0:012.
July 18, 2002
1.2 Probability Models
9
Example 1.7 (The Birthday Problem). In a group of n people, what is
the probability that two or more people have the same birthday?
Solution. The rst step in the solution is to specify the sample space and the probability }. Let D := f1; : : :; 365g denote the days of the year, and
let
:= f(d1 ; : : :; dn) : di 2 Dg
denote the set of all possible sequences of n birthdays. Then j
j = jDjn . Since
all sequences are equally likely, we take }(E) := jE j=j
j for arbitrary events
E .
Let Q denote the set of sequences (d1 ; : : :; dn) that have at least one pair of
repeated entries. For example, if n = 9, one of the sequences in Q would be
(364; 17; 201;17;51; 171;51; 33;51):
Notice that 17 appears twice and 51 appears 3 times. The set Q is very complicated. On the other hand, consider Qc , which is the set of sequences (d1; : : :; dn)
that have no repeated entries. A typical element of Qc can be constructed as
follows. There are jDj possible choices for d1. For d2 there are jDj , 1 possible
choices because repetitions are not allowed for elements of Qc . For d3 there are
jDj , 2 possible choices. Continuing in this way, we see that
jQc j = jDj (jDj , 1) (jDj , [n , 1]) = (jDjjD,j!n)! :
It follows that }(Qc ) = jQcj=j
j, and }(Q) = 1 , }(Qc ). A plot of }(Q) as a
function of n is shown in Figure 1.4. As the dashed line indicates, for n 23,
the probability of two more more people having the same birthday is greater
than 1=2.
Example 1.8. A certain memory location in an old personal computer
is faulty and returns 8-bit bytes at random. What is the probability that a
returned byte has seven 0s and one 1? Six 0s and two 1s?
Solution. Let := f(b1; : : :; b8) : bi = 0 or 1g. Since all bytes are equally
likely, put }(E) := jE j=j
j for arbitrary E . Let
A1 := (b1 ; : : :; b8) :
and
A2 := (b1; : : :; b8) :
8
P
i=1
8
P
bi = 1
bi = 2 :
i=1
Then }(A1 ) = jA1 j=j
j = 8=256 = 1=32, and }(A2 ) = jA2j=j
j = 28=256 =
7=64.
July 18, 2002
10
Chap. 1 Introduction to Probability
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
20
25
30
35
40
45
50
55
Figure 1.4. A plot of }(Q) as a function of n. For n 23, the probability of two or more
people having the same birthday is greater than 1=2.
In some experiments, the number of possible outcomes is countably innite.
For example, consider the tossing of a coin until the rst heads appears. Here is
a model for such situations. Let denote the set of all positive integers, :=
f1; 2; : : :g. For ! 2 , let p(!) be nonnegative, and suppose that P1
!=1 p(!) =
1. For any subset A , put
}(A) := X p(!):
!2A
This construction can be used to model the coin tossing experiment by identifying ! = i with the outcome that the rst heads appears on the ith toss. If
the probability of tails on a single toss is (0 < 1), it can be shown that
we should take p(!) = !,1(1 , ) (cf. Example 2.8. To nd the probability that the rst head occurs before the fourth toss, we compute }(A), where
A = f1; 2; 3g. Then
}(A) = p(1) + p(2) + p(3) = (1 + + 2)(1 , ):
If = 1=2, }(A) = (1 + 1=2 + 1=4)=2 = 7=8.
For some experiments, the number of possible outcomes is more than countably innite. Examples include the lifetime of a lightbulb or a transistor, a
noise voltage in a radio receiver, and the arrival time of a city bus. In these
cases, } is usually dened as an integral,
}(A) :=
Z
A
f(!) d!; A ;
July 18, 2002
1.2 Probability Models
11
R
for some nonnegative function f. Note that f must also satisfy f(!) d! = 1.
Example 1.9. Consider the following model for the lifetime of a lightbulb.
For the sample space we take the nonnegative half line, := [0; 1), and we
put
Z
}(A) := f(!) d!;
A
where, for example, f(!) := e,! . Then the probability that the lightbulb's
lifetime is between 5 and 7 time units is
}([5; 7]) =
Z 7
5
e,! d! = e,5 , e,7 :
Example 1.10. A certain bus is scheduled to pick up riders at 9:15. However, it is known that the bus arrives randomly in the 20-minute interval between
9:05 and 9:25, and departs immediately after boarding waiting passengers. Find
the probability that the bus arrives at or after its scheduled pick-up time.
Solution. Let := [5; 25], and put
}(A) :=
Z
A
f(!) d!:
Now, the term \randomly" in the problem statement is usually taken to mean
that f(!) constant. In order that }(
) = 1, we must choose the constant to
be 1=length(
) = 1=20. We represent the bus arriving at or after 9:15 with the
event L := [15; 25]. Then
}(L) =
Z
1 d! =
20
[15;25]
Z 25
15
1
25 , 15
1
20 d! = 20 = 2 :
Example 1.11. A dart is thrown at random toward a circular dartboard of
radius 10 cm. Assume the thrower never misses the board. Find the probability
that the dart lands within 2 cm of the center.
Solution. Let := f(x; y) : x2 + y2 100g, and for any A , put
}(A) := area(A) = area(A) :
area(
)
100
We then identify the event A := f(x; y) : x2 + y2 4g with the dart's landing
within 2 cm of the center. Hence,
}(A) = 4 = 0:04:
100
July 18, 2002
12
Chap. 1 Introduction to Probability
1.3. Axioms and Properties of Probability
The probability models of the preceding section suggest the following axiomatic denition of probability. Given a nonempty set , called the sample
space, and a function } dened on the subsets of , we say } is a probability
measure if the following four axioms are satised:1
(i) The empty set 6 is called the impossible event. The probability of
the impossible event is zero; i.e., }(
6 ) = 0.
(ii) Probabilities are nonnegative; i.e., for any event A, }(A) 0.
(iii) If A1; A2; : : : are events that are mutually exclusive or pairwise disjoint,
i.e., An \ Am = 6 for n 6= m, thenz
}
[
1
n=1
An =
1
X
n=1
}(An ):
This property is summarized by saying that the probability of the union
of disjoint events is the sum of the probabilities of the individual events,
or more briey, \the probabilities of disjoint events add."
(iv) The entire sample space is called the sure event or the certain
event. Its probability is always one; i.e., }(
) = 1.
We now give an interpretation of how and } model randomness. We
view the sample space as being the set of all possible \states of nature."
First, Mother Nature chooses a state !0 2 . We do not know which state
has been chosen. We then conduct an experiment, and based on some physical
measurement, we are able to determine that !0 2 A for some event A . In
some cases, A = f!0g, that is, our measurement reveals exactly which state !0
was chosen by Mother Nature. (This is the case for the events Fi dened at
the beginning of Section 1.2). In other cases, the set A contains !0 as well as
other points of the sample space. (This is the case for the event E dened at
the beginning of Section 1.2). In either case, we do not know before making
the measurement what measurement value we will get, and so we do not know
what event A Mother Nature's !0 will belong to. Hence, in many applications,
e.g., gambling, weather prediction, computer message trac, etc., it is useful to
compute }(A) for various events to determine which ones are most probable.
In many situations, an event A under consideration depends on some system
parameter, say , that we can select. For example, suppose A occurs if we
correctly receive a transmitted radio message; here could represent a voltage
threshold. Hence, we can choose to maximize }(A ).
Consequences of the Axioms
Axioms (i){(iv) that dene a probability measure have several important
implications as discussed below.
z See the paragraph Finite Disjoint Unions below and Problem 9 for further discussion
regarding this axiom.
July 18, 2002
1.3 Axioms and Properties of Probability
13
Finite Disjoint Unions. Let N be a positive integer. By taking An = 6
for n > N in axiom (iii), we obtain the special case (still for pairwise disjoint
events)
N
N
} [ An = X }(An):
n=1
n=1
Remark. It is not possible to go backwards and use this special case to
derive axiom (iii).
Example 1.12. If A is an event consistingPof a nite number of sample
points, say A = f!1 ; : : :; !N g, then2 }(A) = Nn=1 }(f!n g). Similarly, if A
consists of a countably many
P sample points, say A = f!1; !2; : : :g, then directly
from axiom (iii), }(A) = 1
n=1 }(f!ng).
Probability of a Complement. Given an event A, we can always write =
A [ Ac , which is a nite disjoint union. Hence, }(
) = }(A) + }(Ac ). Since
}(
) = 1, we nd that
}(Ac ) = 1 , }(A):
Monotonicity. If A and B are events, then
A B implies }(A) }(B):
To see this, rst note that A B implies
B = A [ (B \ Ac ):
This relation is depicted in Figure 1.5, in which the disk A is a subset of the
A
B
Figure 1.5. In this diagram, the disk A is a subset of the oval-shaped region B ; the shaded
region is B \ Ac , and B = A [ (B \ Ac ).
oval-shaped region B; the shaded region is B \ Ac . The gure shows that B is
the disjoint union of the disk A together with the shaded region B \ Ac . Since
B = A [ (B \ Ac ) is a disjoint union, and since probabilities are nonnegative,
}(B) = }(A) + }(B \ Ac )
}(A):
July 18, 2002
14
Chap. 1 Introduction to Probability
Note that the special case B = results in }(A) 1 for every event A. In
other words, probabilities are always less than or equal to one.
Inclusion-Exclusion. Given any two events A and B, we always have
}(A [ B) = }(A) + }(B) , }(A \ B):
(1.1)
To derive (1.1), rst note that (see Figure 1.6)
A
B
B
A
(a)
(b)
Figure 1.6. (a) Decomposition A = (A \ B c ) [ (A \ B ). (b) Decomposition B = (A \ B ) [
(Ac \ B ).
A = (A \ B c ) [ (A \ B)
and
B = (A \ B) [ (Ac \ B):
Hence,
A [ B = (A \ B c ) [ (A \ B) [ (A \ B) [ (Ac \ B) :
The two copies of A \ B can be reduced to one using the identity F [ F = F
for any set F . Thus,
A [ B = (A \ B c ) [ (A \ B) [ (Ac \ B):
A Venn diagram depicting this last decomposition is shown in Figure 1.7. Taking probabilities of the preceding equations, which involve disjoint unions, we
A
B
Figure 1.7. Decomposition A [ B = (A \ B c) [ (A \ B ) [ (Ac \ B ).
July 18, 2002
1.3 Axioms and Properties of Probability
15
nd that
}(A) = }(A \ B c ) + }(A \ B);
}(B) = }(A \ B) + }(Ac \ B);
}(A [ B) = }(A \ B c ) + }(A \ B) + }(Ac \ B):
Using the rst two equations, solve for }(A \ B c ) and }(Ac \ B), respectively,
and then substitute into the rst and third terms on the right-hand side of the
last equation. This results in
}(A [ B) = }(A) , }(A \ B) + }(A \ B)
+ }(B) , }(A \ B)
= }(A) + }(B) , }(A \ B):
Limit Properties. Using axioms (i){(iv), the following formulas can be derived (see Problems 10{12). For any sequence of events An ,
[
1
\
1
}
and
}
[
N
\
N
}
An = Nlim
A ;
!1 n=1 n
n=1
n=1
An
}
= Nlim
!1
n=1
An :
(1.2)
(1.3)
In particular, notice that if An An+1 for all n, then the nite union in (1.2)
reduces to AN . Thus, (1.2) becomes
[
1
}
}(AN ); if An An+1 :
An = Nlim
!1
n=1
(1.4)
Similarly, if An+1 An for all n, then the nite intersection in (1.3) reduces to
AN . Thus, (1.3) becomes
}
1
\
}(AN ); if An+1 An :
An = Nlim
!1
n=1
(1.5)
Formulas (1.1) and (1.2) together imply that for any sequence of events An ,
[
1
}
n=1
An 1
X
n=1
}(An):
This formula is known as the union bound in engineering and as countable
subadditivity in mathematics. It is derived in Problems 13 and 14 at the end
of the chapter.
July 18, 2002
16
Chap. 1 Introduction to Probability
1.4. Independence
Consider two tosses of an unfair coin whose \probability" of heads is 1=4.
Assuming that the rst toss has no inuence on the second, our intuition tells us
that the \probability" of heads on both tosses is (1=4)(1=4) = 1=16. This motivates the following denition. Two events A and B are said to be statistically
independent, or just independent, if
}(A \ B) = }(A) }(B):
(1.6)
The notation A ?? B is sometimes used to mean A and B are independent.
We now make some important observations about independence. First, it is a
simple exercise to show that if A and B are independent events, then so are A
and B c , Ac and B, and Ac and B c . For example, using the identity
A = (A \ B) [ (A \ B c );
we have
}(A) = }(A \ B) + }(A \ B c )
= }(A) }(B) + }(A \ B c );
and so
}(A \ B c ) = }(A) , }(A) }(B)
= }(A)[1 , }(B)]
= }(A) }(B c ):
By interchanging the roles of A and Ac and/or B and B c , it follows that if any
one of the four pairs is independent, then so are the other three.
Now suppose that A and B are any two events. If }(B) = 0, then we
claim that A and B are independent. To see this, rst note that the righthand side of (1.6) is zero; as for the left-hand side, since A \ B B, we have
0 }(A \ B) }(B) = 0, i.e., the left-hand side is also zero. Hence, (1.6)
always holds if }(B) = 0.
We now show that if }(B) = 1, then A and B are independent. This follows
from the two preceding paragraphs. If }(B) = 1, then }(B c ) = 0, and we see
that A and B c are independent; but then A and B must also be independent.
Independence for More Than Two Events
Suppose that for j = 1; 2; : : :; Aj is an event. When we say that the Aj are
independent, we certainly want that for any i 6= j,
}(Ai \ Aj ) = }(Ai ) }(Aj ):
And for any distinct i; j; k, we want
}(Ai \ Aj \ Ak ) = }(Ai ) }(Aj ) }(Ak ):
July 18, 2002
1.4 Independence
17
We want analogous equations to hold for any four events, ve events, and so
on. In general, we want that for every nite subset J containing two or more
positive integers,
} \ Aj = Y }(Aj ):
j 2J
j 2J
In other words, we want the probability of every intersection involving nitely
many of the Aj to be equal to the product of the probabilities of the individual
events. If the above equation holds for all nite subsets of two or more positive
integers, then we say that the Aj are mutually independent, or just independent. If the above equation holds for all subsets J containing exactly two
positive integers but not necessarily for all nite subsets of 3 or more positive
integers, we say that the Aj are pairwise independent.
Example 1.13. Given three events, say A, B, and C, they are mutually
independent if and only if the following equations all hold,
}(A \ B \ C) = }(A) }(B) }(C)
}(A \ B) = }(A) }(B)
}(A \ C) = }(A) }(C)
}(B \ C) = }(B) }(C):
It is possible to construct events A, B, and C such that the last three equations
hold (pairwise independence), but the rst one does not.3 It is also possible
for the rst equation to hold while the last three fail.4
Example 1.14. A coin is tossed three times and the number of heads is
noted. Find the probability that the number of heads is two, assuming the
tosses are mutually independent and that on each toss the probability of heads
is for some xed 0 1.
Solution. To solve the problem, let Hi denote the event that the ith toss
is heads (so }(Hi ) = ), and let S2 denote the event that the number of heads
in three tosses is 2. Then
S2 = (H1 \ H2 \ H3c) [ (H1 \ H2c \ H3) [ (H1c \ H2 \ H3):
This is a disjoint union, and so }(S2 ) is equal to
}(H1 \ H2 \ H3c ) + }(H1 \ H2c \ H3) + }(H1c \ H2 \ H3):
(1.7)
Next, since H1, H2, and H3 are mutually independent, so are (H1 \ H2) and
H3. Hence, (H1 \ H2 ) and H3c are also independent. Thus,
}(H1 \ H2 \ H3c ) = }(H1 \ H2 ) }(H3c )
= }(H1) }(H2 ) }(H3c )
= 2 (1 , ):
July 18, 2002
18
Chap. 1 Introduction to Probability
Treating the last two terms in (1.7) similarly, we have }(S2 ) = 32(1 , ). If
the coin is fair, i.e., = 1=2, then }(S2 ) = 3=8.
In working the preceding example, we did not explicitly specify the sample
space or the probability measure }. This is common practice. However, the
interested reader can nd one possible choice for and } in the Notes.5
Example 1.15. If A1; A2; : : : are mutually independent, show that
}
Solution. Write
}
\
1
n=1
An
1
\
n=1
1
Y
An =
n=1
\
N
}(An):
}
= Nlim
!1 n=1 An ; by (1.3);
N
Y
}(An ); by independence;
= Nlim
!1
=
n=1
1
Y
n=1
}(An);
where the last step is just the denition of the innite product.
Example 1.16. Consider an innite sequence of independent coin tosses.
Assume that the probability of heads is 0 < p < 1. What is the probability of
seeing all heads? What is the probability of ever seeing heads?
Solution. We use the result of the preceding example as follows. Let be a sample space equipped with a probability measure } and events An ; n =
1; 2; : : :; with }(An ) = p, where the An are mutually independent.6 The event
An corresponds to, or models, the outcome that the nth toss results
in a heads.
T
The outcome of seeing all heads corresponds to the event 1
n=1 An , and its
probability is
}
1
\
N
Y
}(An ) = lim pN = 0:
An = Nlim
!1
N !1
n=1
n=1
S
The outcome of ever seeing heads corresponds to the event A := T1
n=1 Anc .
Since }(A) = 1 ,}(Ac ), it suces to compute the probability of Ac = 1
n=1 An.
Arguing exactly as above, we have
\
1
}
n=1
Acn
= Nlim
!1
Thus, }(A) = 1 , 0 = 1.
N
Y
}(Acn) = lim (1 , p)N = 0:
N !1
n=1
July 18, 2002
1.5 Conditional Probability
19
1.5. Conditional Probability
Conditional probability gives us a mathematically precise way of handling
questions of the form, \Given that an event B has occurred, what is the conditional probability that A has also occurred?" If }(B) > 0, we put
\ B)
}(AjB) := }(A
(1.8)
}(B) :
We call }(AjB) the conditional probability of A given B. In order to justify
using the word \probability" in reference to }(jB), note that it is an easy
exercise to show that if we put Q(A) = }(AjB) for xed B with }(B) > 0,
then Q satises axioms (i){(iv) in Section 1.3 that dene a probability measure.
Note, however, that although Q is a probability measure, it has the special
property that if A B c , then Q(A) = 0. In other words, if B implies A has
not occurred, and if B occurs, then }(AjB) = 0.
The denition of conditional probability satises the following intuitive
property, \If A and B are independent, then }(AjB) does not depend on B."
In fact, if A and B are independent,
\ B) }(A) }(B)
}(AjB) = }(A
}(B) = }(B) = }(A):
Observe that }(AjB) as dened in (1.8) is the unique solution of
}(A \ B) = }(AjB)}(B)
(1.9)
when }(B) > 0. If }(B) = 0, then no matter what real number we assign to
}(AjB), both sides of (1.9) are zero; clearly the right-hand side is zero, and, for
the left-hand side, note that since A\B B, we have 0 }(A\B) }(B) = 0.
Hence, in probability theory, when }(B) = 0, the value of }(AjB) is permitted
to be arbitrary, and it is understood that (1.9) always holds.
The Law of Total Probability and Bayes' Rule
From the identity
A = (A \ B) [ (A \ B c );
it follows that
}(A) = }(A \ B) + }(A \ B c )
= }(AjB)}(B) + }(AjB c )}(B c ):
(1.10)
This formula is the simplest version of the law of total probability. In many
applications, the quantities on the right of (1.10) are known, and it is required
to nd }(B jA). This can be accomplished by writing
\ A) }(A \ B) }(AjB)}(B) :
}(B jA) := }(B
}(A) = }(A) =
}(A)
July 18, 2002
20
Chap. 1 Introduction to Probability
Substituting (1.10) into the denominator yields
}(AjB)}(B)
}(B jA) = }
}
(AjB) (B) + }(AjB c )}(B c ) :
This formula is the simplest version of Bayes' rule.
Example 1.17. Polychlorinated biphenyls (PCBs) are toxic chemicals.
Given that you are exposed to PCBs, suppose that your conditional probability of developing cancer is 1/3. Given that you are not exposed to PCBs,
suppose that your conditional probability of developing cancer is 1/4. Suppose
that the probability of being exposed to PCBs is 3/4. Find the probability that
you were exposed to PCBs given that you do not develop cancer.
Solution. To solve this problem, we use the notation
E = fexposed to PCBsg and C = fdevelop cancer g:
With this notation, it is easy to interpret the problem as telling us that
}(C jE) = 1=3; }(C jE c) = 1=4; and }(E) = 3=4;
(1.11)
and asking us to nd }(E jC c ). Before solving the problem, note that the
above data implies three additional equations as follows. First, recall that
}(E c ) = 1 , }(E). Similarly, since conditional probability is a probability
as a function of its rst argument, we can write }(C c jE) = 1 , }(C jE) and
}(C c jE c) = 1 , }(C jE c). Hence,
}(C c jE) = 2=3; }(C cjE c ) = 3=4; and }(E c ) = 1=4:
(1.12)
To nd the desired conditional probability, we write
\ C c)
}(E jC c ) = }(E
}(C c)
}(C c jE)}(E)
=
}(C c )
= (2=3)(3=4)
}(C c )
= }1=2
(C c ) :
To nd the denominator, we use the law of total probability to write
}(C c) = }(C c jE)}(E) + }(C c jE c)}(E c )
= (2=3)(3=4) + (3=4)(1=4) = 11=16:
Hence,
}(E jC c ) = 1=2 = 8 :
11=16
11
July 18, 2002
1.5 Conditional Probability
21
In working the preceding example, we did not explicitly specify the sample
space or the probability measure }. As mentioned earlier, this is common
practice. However, one possible choice for and } is given in the Notes.7
We now generalize the law of P
total probability. Let Bn be a sequence of
pairwise disjoint events such that n }(Bn ) = 1. Then for any event A,
}(A) =
X
n
}(AjBn )}(Bn ):
S
To derive this result, put B := n Bn , and observe that
}(B) = X }(Bn ) = 1:
n
It follows that }(B c ) = 0. Next, for any event A, A \ B c B c , and so
0 }(A \ B c ) }(B c ) = 0:
Hence, }(A \ B c ) = 0. Writing
A = (A \ B c ) [ (A \ B);
it follows that
}(A) = }(A \ B c ) + }(A \ B)
= }(A
\ B)
[
= } A \ Bn
[
= }
=
X
n
n
n
[A \ Bn ]
}(A \ Bn ):
(1.13)
To compute the posterior probabilities }(Bk jA), write
}(Bk jA) = }(A} \ Bk ) = }(Aj}Bk )}(Bk ) :
(A)
(A)
Applying the law of total probability to }(A) in the denominator yields the
general form of Bayes' rule,
}(Bk jA) = X}(AjBk )}(Bk ) :
}(AjBn )}(Bn )
n
July 18, 2002
22
Chap. 1 Introduction to Probability
1.6. Notes
Notes x1.3: Axioms and Properties of Probability
Note 1. When the sample space is nite or countably innite, }(A) is
usually dened for all subsets of by taking
X
}(A) :=
p(!)
!2A
P
for some nonnegative function p that sums to one; i.e., p(!) 0 and !2
p(!)
= 1. (It is easy to check that if } is dened in this way, then it satises the
axioms of a probability measure.) However, for larger sample spaces, such as
when is an interval of the real line, e.g., Example 1.10, it is not possible to
dene }(A) for all subsets and still have } satisfy all four axioms. (A proof
of this fact can be found in advanced texts, e.g., [4, p. 45].) The way around
this diculty is to dene }(A) only for some subsets of , but not all subsets
of . It is indeed fortunate that this can be done in such a way that }(A) is
dened for all subsets of interest that occur in practice. A set A for which }(A)
is dened is called an event, and the collection of all events is denoted by A.
The triple (
; A; }) is called a probability space.
Given that }(A) is dened only for A 2 A, in order for the probability axioms to make sense, A must have certain properties. First, axiom (i)
requires that 6 2 A. Second, axiom (iv) requires that 2 A. Third, axiom (iii)S1requires that if A1 ; A2; : : : are mutually exclusive events, then their
union, n=1 An , is also an event. Additionally, we need inTSection 1.4 that if
A1 ; A2; : : : are arbitrary events, then so is their intersection, 1
n=1 An. We show
below that these four requirements are satised if we assume only that A has
the following three properties,
(i) The empty set 6 belongs to A, i.e., 6 2 A.
(ii) If A 2 A, then so does its complement, Ac , i.e., A 2 SA implies Ac 2 A.
(iii) If A1 ; A2; : : : belong to A, then so does their union, 1
n=1 An .
A collection of subsets with these properties is called a -eld or a -algebra.
Of the four requirements above, the rst and third obviously hold for any -eld
A. The second requirement holds because (i) and (ii) imply that 6 c = must
be in A. The fourth requirement is a consequence of (iii) and DeMorgan's law.
Note 2. In light of the preceding note, we see that to guarantee that
}(f!ng) is dened in Example 1.12, it is necessary to assume that the singleton sets f!ng are events, i.e., f!ng 2 A.
Notes x1.4: Independence
Note 3. Here is an example of three events that are pairwise independent,
but not mutually independent. Let
:= f1; 2; 3; 4; 5;6;7g;
July 18, 2002
1.6 Notes
23
and put }(f!g) := 1=8 for ! 6= 7, and }(f7g) := 1=4. Take A := f1; 2; 7g,
B := f3; 4; 7g, and C := f5; 6; 7g. Then }(A) = }(B) = }(C) = 1=2. and
}(A \ B) = }(A \ C) = }(B \ C) = }(f7g) = 1=4. Hence, A and B, A and
C, and B and C are pairwise independent. However, since }(A \ B \ C) =
}(f7g) = 1=4, and since }(A) }(B) }(C) = 1=8, A, B, and C are not mutually
independent.
Note 4. Here is an example of three events for which }(A \ B \ C) =
}(A) }(B) }(C) but no pair is independent. Let := f1; 2; 3; 4g. Put }(f1g) =
}(f2g) = }(f3g) = p and }(f4g) = q, where 3p + q = 1 and 0 p; q 1.
Put A := f1; 4g, B := f2; 4g, and C := f3; 4g. Then the intersection of any
pair is f4g, as is the intersection of all three sets. Also, }(f4g) = q. Since
}(A) = }(B) = }(C) = p+q, we require (p+q)3 = q and (p+q)2 6= q. Solving
3p + q = 1 and (p + q)3 = q for q reduces to solving 8q3 + 12q2 , 21q + 1 = 0.
Now, q = 1 is obviously a root, but this results in p = 0, which implies mutual
independence. However, since q = 1 is a root, it is easy to verify that
8q3 + 12q2 , 21q + 1 = (q , 1)(8q2 + 20q , 1):
,
p 3 =4. It then follows
By the quadratic
formula,
the
desired
root
is
q
=
,
5+3
p p ,
,
that p = 3 , 3 =4 and that p + q = ,1 + 3 =2. Now just observe that
(p + q)2 6= q.
Note 5. Here is a choice for and } for Example 1.14. Let
:= f(i; j; k) : i; j; k = 0 or 1g;
with 1 corresponding to heads and 0 to tails. Now put
H1 := f(i; j; k) : i = 1g;
H2 := f(i; j; k) : j = 1g;
H3 := f(i; j; k) : k = 1g;
and observe that
H1 = f (1; 0; 0) ; (1; 0; 1) ; (1; 1; 0) ; (1; 1; 1) g;
H2 = f (0; 1; 0) ; (0; 1; 1) ; (1; 1; 0) ; (1; 1; 1) g;
H3 = f (0; 0; 1) ; (0; 1; 1) ; (1; 0; 1) ; (1; 1; 1) g:
Next, let }(f(i; j; k)g) := i+j +k (1 , )3,(i+j +k). Since
H3c = f (0; 0; 0) ; (1; 0; 0) ; (0; 1; 0) ; (1; 1; 0) g;
H1 \ H2 \ H3c = f(1; 1; 0)g. Similarly, H1 \ H2c \ H3 = f(1; 0; 1)g, and H1c \
H2 \ H3 = f(0; 1; 1)g. Hence,
S2 = f (1; 1; 0) ; (1; 0; 1) ; (0; 1; 1) g
= f(1; 1; 0)g [ f(1; 0; 1)g [ f(0; 1; 1)g;
and thus, }(S2 ) = 32 (1 , ).
July 18, 2002
24
Chap. 1 Introduction to Probability
Note 6. To show the existence of a sample space and probability measure
with such independent events is beyond the scope of this book. Such constructions can be found in more advanced texts such as [4, Section 36].
Notes x1.5: Conditional Probability
Note 7. Here is a choice for and } for Example 1.17. Let
:= f(e; c) : e; c = 0 or 1g;
where e = 1 corresponds to exposure to PCBs, and c = 1 corresponds to
developing cancer. We then take
E := f(e; c) : e = 1g = f (1; 0) ; (1; 1) g;
and
C := f(e; c) : c = 1g = f (0; 1) ; (1; 1) g:
It follows that
E c = f (0; 1) ; (0; 0) g and C c = f (1; 0) ; (0; 0) g:
Hence, E \ C = f(1; 1)g, E \ C c = f(1; 0)g, E c \ C = f(0; 1)g, and E c \ C c =
f(0; 0)g.
In order to specify a suitable probability measure on , we work backwards.
First, if a measure } on exists such that (1.11) and (1.12) hold, then
}(f(1; 1)g) = }(E \ C) = }(C jE)}(E) = 1=4;
}(f(0; 1)g) = }(E c \ C) = }(C jE c )}(E c ) = 1=16;
}(f(1; 0)g) = }(E \ C c ) = }(C c jE)}(E) = 1=2;
}(f(0; 0)g) = }(E c \ C c ) = }(C c jE c)}(E c ) = 3=16:
This suggests that we dene } by
}(A) := X p(!);
!2A
where p(!) = p(e; c) is given by p(1; 1) := 1=4, p(0; 1) := 1=16, p(1; 0) := 1=2,
and p(0; 0) := 3=16. Starting from this denition of }, it is not hard to check
that (1.11) and (1.12) hold.
1.7. Problems
Problems x1.1: Review of Set Notation
1. For real numbers ,1 < a < b < 1, we use the following notation.
(a; b] := fx : a < x bg
(a; b) := fx : a < x < bg
[a; b) := fx : a x < bg
[a; b] := fx : a x bg:
July 18, 2002
1.7 Problems
25
We also use
(,1; b]
(,1; b)
(a; 1)
[a; 1)
fx : x bg
fx : x < bg
fx : x > ag
fx : x ag:
For example, with this notation, (0; 1]c = (,1; 0] [ (1; 1) and (0; 2] [
:=
:=
:=
:=
[1; 3) = (0; 3). Now analyze
(a) [2; 3]c,
(b) (1; 3) [ (2; 4),
(c) (1; 3) \ [2; 4),
(d) (3; 6] n (5; 7).
2. Sketch the following subsets of the x-y plane.
(a) Bz := f(x; y) : x + y z g for z = 0; ,1; +1.
(b) Cz := f(x; y) : x > 0; y > 0; and xy z g for z = 1.
(c) Hz := f(x; y) : x z g for z = 3.
(d) Jz := f(x; y) : y z g for z = 3.
(e) Hz \ Jz for z = 3.
(f) Hz [ Jz for z = 3.
(g) Mz := f(x; y) : max(x; y) z g for z = 3, where max(x; y) is
the larger of x and y. For example, max(7; 9) = 9. Of course,
max(9; 7) = 9 too.
(h) Nz := f(x; y) : min(x; y) z g for z = 3, where min(x; y) is the
smaller of x and y. For example, min(7; 9) = 7 = min(9; 7).
(i) M2 \ N3 .
(j) M4 \ N3 .
3. Let denote the set of real numbers, = (,1; 1).
(a) Use the distributive law to simplify
[1; 4] \ [0; 2] [ [3; 5] :
c
(b) Use DeMorgan's law to simplify [0; 1] [ [2; 3] .
(c) Simplify
1
\
(,1=n; 1=n).
n=1
July 18, 2002
26
Chap. 1 Introduction to Probability
(d) Simplify
(e) Simplify
(f) Simplify
1
\
[0; 3 + 1=(2n)).
n=1
1
[
[5; 7 , (3n),1].
n=1
1
[
[0; n].
n=1
Problems x1.2: Probability Models
4. A letter of the alphabet (a{z) is generated at random. Specify a sample
space and a probability measure }. Compute the probability that a
vowel (a, e, i, o, u) is generated.
5. A collection of plastic letters, a{z, is mixed in a jar. Two letters drawn at
random, one after the other. What is the probability of drawing a vowel
(a, e, i, o, u) and a consonant in either order? Two vowels in any order?
Specify your sample space and probability }.
6. A new baby wakes up exactly once every night. The time at which the
baby wakes up occurs at random between 9 pm and 7 am. If the parents
go to sleep at 11 pm, what is the probability that the parents are not
awakened by the baby before they would normally get up at 7 am? Specify
your sample space and probability }.
7. For any real or complex number z 6= 1 and any positive integer N, derive
the geometric series formula
NX
,1
N
z k = 11,,zz ; z 6= 1:
k=0
Hint: Let SN := 1 + z + + z N ,1 , and show that SN , zSN = 1 , z N .
Then solve for SN .
Remark. If jzj < 1, jzjN ! 0 as N ! 1. Hence,
1
X
z k = 1 ,1 z ; for jz j < 1:
k=0
8. Let
:= f1; : : :; 6g. If p(!) = 2 p(! , 1) for ! = 2; : : :; 6, and if
P6
!,1
!=1 p(!) = 1, show that p(!) = 2 =63. Hint: Use Problem 7.
July 18, 2002
1.7 Problems
27
Problems x1.3: Axioms and Properties of Probability
9. Suppose that instead of axiom (iii) of Section 1.3, we assume only that
for any two disjoint events A and B, }(A [ B) = }(A) + }(B). Use this
assumption and inductionx on N to show that for any nite sequence of
pairwise disjoint events A1 ; : : :; AN ,
}
N
[
n=1
N
X
An =
n=1
}(An ):
Using this result for nite N, it is not possible to derive axiom (iii), which
is the assumption needed to derive the limit results of Section 1.3.
? 10. The purpose of this problem is to show that any countable union can be
written as a union of pairwise disjoint sets. Given any sequence of sets
Fn, dene a new sequence by A1 := F1, and
An := Fn \ Fnc,1 \ \ F1c; n 2:
Note that the An are pairwise disjoint. For nite N 1, show that
N
[
n=1
Also show that
? 11.
1
[
n=1
Fn =
Fn =
N
[
n=1
1
[
n=1
An :
An :
Use the preceding problem to show that for any sequence of events Fn ,
[
1
[
N
1
\
N
\
}
}
Fn = Nlim
F :
!1 n=1 n
n=1
? 12. Use the preceding problem to show that for any sequence of events Gn,
}
G :
Gn = Nlim
!1 n=1 n
n=1
13. The Finite Union Bound. Show that for any nite sequence of events
F1; : : :; FN ,
N
N
} [ Fn X }(Fn ):
}
n=1
n=1
Hint: Use the inclusion-exclusion formula (1.1) and induction on N. See
the last footnote for information on induction.
x In this case, using induction on N means that you rst verify the desired result for N = 2.
Second, you assume the result is true for some arbitrary N 2 and then prove the desired
result is true for N + 1.
July 18, 2002
28
Chap. 1 Introduction to Probability
? 14. The Innite Union Bound. Show that for any innite
Fn,
1
1
} [ Fn X }(Fn ):
n=1
n=1
sequence of events
Hint: Combine Problems 11 and 13.
? 15. Borel{Cantelli Lemma. Show that if Bn is a sequence of events for which
1
X
n=1
then
}
}(Bn ) < 1;
1 [
1
\
n=1 k=n
(1.14)
Bk = 0:
T
S1
Hint: Let G := 1
n=1 Gn, where Gn := k=n Bk . Now use Problem 12,
the union bound of the preceding problem, and the fact that (1.14) implies
lim
N !1
Problems x1.4: Independence
1
X
n=N
}(Bn ) = 0:
16. A certain binary communication system has a bit-error rate of 0:1; i.e.,
in transmitting a single bit, the probability of receiving the bit in error is
0:1. To transmit messages, a three-bit repetition code is used. In other
words, to send the message 1, 111 is transmitted, and to send the message
0, 000 is transmitted. At the receiver, if two or more 1s are received, the
decoder decides that message 1 was sent; otherwise, i.e., if two or more
zeros are received, it decides that message 0 was sent. Assuming bit errors
occur independently, nd the probability that the decoder puts out the
wrong message. Answer: 0:028.
17. You and your neighbor attempt to use your cordless phones at the same
time. Your phones independently select one of ten channels at random to
connect to the base unit. What is the probability that both phones pick
the same channel?
18. A new car is equipped with dual airbags. Suppose that they fail independently with probability p. What is the probability that at least one
airbag functions properly?
19. A discrete-time FIR lter is to be found satisfying certain constraints,
such as energy, phase, sign changes of the coecients, etc. FIR lters can
be thought of as vectors in some nite-dimensional space. The energy
constraint implies that all suitable vectors lie in some hypercube; i.e., a
July 18, 2002
1.7 Problems
29
square in IR2 , a cube in IR3, etc. It is easy to generate random vectors
uniformly in a hypercube. This suggests the following Monte{Carlo procedure for nding a lter that satises all the desired properties. Suppose
we generate vectors independently in the hypercube until we nd one that
has all the desired properties. What is the probability of ever nding such
a lter?
20. A dart is thrown at random toward a circular dartboard of radius 10 cm.
Assume the thrower never misses the board. Let An denote the event that
the dart lands within 2 cm of the center on the nth throw. Suppose that
the An are mutually independent and that }(An ) = p for some 0 < p < 1.
Find the probability that the dart never lands within 2 cm of the center.
21. Each time you play the lottery, your probability of winning is p. You play
the lottery n times, and plays are independent. How large should n be
to make the probability of winning at least once more than 1=2? Answer:
For p = 1=106, n 693147.
22. Consider the sample space = [0; 1] equipped with the probability measure
Z
}(A) := 1 d!; A :
A
For A = [0; 1=2], B = [0; 1=4] [ [1=2; 3=4], and C = [0; 1=8] [ [1=4; 3=8] [
[1=2; 5=8] [ [3=4;7=8], determine whether or not A, B, and C are mutually
independent.
Problems x1.5: Conditional Probability
23. The university buys workstations from two dierent suppliers, Moon Mi-
crosystems (MM) and Hyped{Technology (HT). On delivery, 10% of MM's
workstations are defective, while 20% of HT's workstations are defective.
The university buys 140 MM workstations and 60 HT workstations.
(a) What is the probability that a workstation is from MM? From HT?
(b) What is the probability of a defective workstation? Answer: 0:13.
(c) Given that a workstation is defective, what is the probability that it
came from Moon Microsystems? Answer: 7=13.
24. The probability that a cell in a wireless system is overloaded is 1=3. Given
that it is overloaded, the probability of a blocked call is 0:3. Given that
it is not overloaded, the probability of a blocked call is 0:1. Find the
conditional probability that the system is overloaded given that your call
is blocked. Answer: 0:6.
25. A certain binary communication system operates as follows. Given that
a 0 is transmitted, the conditional probability that a 1 is received is ".
Given that a 1 is transmitted, the conditional probability that a 0 is
July 18, 2002
30
Chap. 1 Introduction to Probability
received is . Assume that the probability of transmitting a 0 is the same
as the probability of transmitting a 1. Given that a 1 is received, nd the
conditional probability that a 1 was transmitted. Hint: Use the notation
Ti := fi is transmittedg; i = 0; 1;
and
Rj := fj is receivedg; j = 0; 1:
26. Professor Random has taught probability for many years. She has found
that 80% of students who do the homework pass the exam, while 10%
of students who don't do the homework pass the exam. If 60% of the
students do the homework, what percent of students pass the exam? Of
students who pass the exam, what percent did the homework? Answer:
12=13.
27. The Bingy 007 jet aircraft's autopilot has conditional probability 1=3 of
failure given that it employs a faulty Hexium 4 microprocessor chip. The
autopilot has conditional probability 1=10 of failure given that it employs
nonfaulty chip. According to the chip manufacturer, unre l, the probability of a customer's receiving a faulty Hexium 4 chip is 1=4. Given that
an autopilot failure has occurred, nd the conditional probability that a
faulty chip was used. Use the following notation:
AF = fautopilot failsg
CF = fchip is faultyg:
Answer: 10=19.
? 28.
You have ve computer chips, two of which are known to be defective.
(a) You test one of the chips; what is the probability that it is defective?
(b) Your friend tests two chips at random and reports that one is defective and one is not. Given this information, you test one of the three
remaining chips at random; what is the conditional probability that
the chip you test is defective?
(c) Consider the following modication of the preceding scenario. Your
friend takes away two chips at random for testing; before your friend
tells you the results, you test one of the three remaining chips at
random; given this (lack of) information, what is the conditional
probability that the chip you test is defective? Since you have not
yet learned the results of your friend's tests, intuition suggests that
your conditional probability should be the same as your answer to
part (a). Is your intuition correct?
July 18, 2002
1.7 Problems
31
29. Given events A, B, and C, show that
}(A \ B \ C) = }(AjB \ C) }(B jC) }(C):
Also show that
}(A \ C jB) = }(AjB) }(C jB)
if and only if
}(AjB \ C) = }(AjB):
In this case, A and C are conditionally independent given B.
July 18, 2002
32
Chap. 1 Introduction to Probability
July 18, 2002
CHAPTER 2
Discrete Random Variables
In most scientic and technological applications, measurements and observations are expressed as numerical quantities, e.g., thermal noise voltages, number of web site hits, round-trip times for Internet packets, engine temperatures,
wind speeds, loads on an electric power grid, etc. Traditionally, numerical measurements or observations that have uncertain variability each time they are
repeated are called random variables. However, in order to exploit the axioms and properties of probability that we studied in Chapter 1, we need to
dene random variables in terms of an underlying sample space . Fortunately,
once some basic operations on random variables are derived, we can think of
random variables in the traditional manner.
The purpose of this chapter is to show how to use random variables to model
events and to express probabilities and averages. Section 2.1 introduces the notions of random variable and of independence of random variables. Several examples illustrate how to express events and probabilities using multiple random
variables. For discrete random variables, some of the more common probability
mass functions are introduced as they arise naturally in the examples. (A summary of the more common ones can be found on the inside of the front cover.)
Expectation for discrete random variables is dened in Section 2.2, and then
moments and probability generating functions are introduced. Probability generating functions are an important tool for computing both probabilities and
expectations. In particular, the binomial random variable arises in a natural
way using generating functions, avoiding the usual combinatorial development.
The weak law of large numbers is derived in Section 2.3 using Chebyshev's inequality. Conditional probability and conditional expectation are developed in
Sections 2.4 and 2.5, respectively. Several examples illustrate how probabilities and expectations of interest can be easily computed using the law of total
probability.
2.1. Probabilities Involving Random Variables
A real-valued function X(!) dened for points ! in a sample space is
called a random variable.
Example 2.1. Let us construct a model for counting the number of heads
in a sequence of three coin tosses. For the underlying sample space, we take
:= fTTT, TTH, THT, HTT, THH, HTH, HHT, HHHg;
which contains the eight possible sequences of tosses. However, since we are
only interested in the number of heads in each sequence, we dene the random
33
34
variable (function) X by
Chap. 2 Discrete Random Variables
8
>
>
<
0;
X(!) := > 1;
2;
>
:
3;
This is illustrated in Figure 2.1.
! = TTT;
! 2 fTTH, THT, HTTg;
! 2 fTHH, HTH, HHTg;
! = HHH:
IR
HHH HHT HTH THH
TTT TTH THT HTT
3
2
1
0
Figure 2.1. Illustration of a random variable X that counts the number of heads in a
sequence of three coin tosses.
With the setup of the previous example, let us assume for specicity that
the sequences are equally likely. Now let us nd the probability that the number
of heads X is less than 2. In other words, we want to nd }(X < 2). But what
does this mean? Let us agree that }(X < 2) is shorthand for
}(f! 2 : X(!) < 2g):
Then the rst step is to identify the event f! 2 : X(!) < 2g. In Figure 2.1,
the only lines pointing to numbers less than 2 are the lines pointing to 0 and 1.
Tracing these lines backwards from IR into , we see that
f! 2 : X(!) < 2g = fTTT, TTH, THT, HTTg:
Since the sequences are equally likely,
}(fTTT, TTH, THT, HTTg) = jfTTT, TTH, THT, HTTgj
j
j
1
4
= 8 = 2:
The shorthand introduced above is standard in probability theory. More
generally, if B IR, we use the shorthand
fX 2 B g := f! 2 : X(!) 2 B g
July 18, 2002
2.1 Probabilities Involving Random Variables
35
and1
}(X 2 B) := }(fX 2 B g) = }(f! 2 : X(!) 2 B g):
If B is an interval such as B = (a; b],
fX 2 (a; b]g := fa < X bg := f! 2 : a < X(!) bg
and
}(a < X b) = }(f! 2 : a < X(!) bg):
Analogous notation applies to intervals such as [a; b], [a; b), (a; b), (,1; b),
(,1; b], (a; 1), and [a; 1).
Example 2.2. Show that
}(a < X b) = }(X b) , }(X a):
Solution. It is convenient to rst rewrite the desired equation as
}(X b) = }(X a) + }(a < X b):
Now observe that
f! 2 : X(!) bg = f! 2 : X(!) ag [ f! 2 : a < X(!) bg:
Since we cannot have an ! with X(!) a and X(!) > a at the same time, the
events in the union are disjoint. Taking probabilities of both sides yields the
desired result.
IfB is a singleton set, say B = fx0g, we write fX = x0g instead of X 2
fx0g .
Example 2.3. Show that
}(X = 0 or X = 1) = }(X = 0) + }(X = 1):
Solution. First note that the word \or" means \union." Hence, we are
trying to nd the probability of fX = 0g[fX = 1g. If we expand our shorthand,
this union becomes
f! 2 : X(!) = 0g [ f! 2 : X(!) = 1g:
Since we cannot have an ! with X(!) = 0 and X(!) = 1 at the same time,
these events are disjoint. Hence, their probabilities add, and we obtain
}(fX = 0g [ fX = 1g) = }(X = 0) + }(X = 1):
July 18, 2002
36
Chap. 2 Discrete Random Variables
The following notation is also useful in computing probabilities. For any
subset B, we dene the indicator function of B by
IB (x) :=
1; x 2 B;
0; x 2= B:
Readers familiar with the unit-step function,
u(x) :=
1; x 0;
0; x < 0;
will note that u(x) = I[0;1) (x). However, the indicator notation is often more
compact. For example, if a < b, it is easier to write I[a;b) (x) than u(x , a) ,
u(x , b). How would you write I(a;b] (x) in terms of the unit step?
Discrete Random Variables
We say X is a discrete random variable if there exist distinct real numbers xi such that
X
For discrete random variables,
i
}(X = xi ) = 1:
}(X 2 B) =
X
i
IB (xi )}(X = xi):
To derive this equation, we apply the law of total probability as given in (1.13)
with A = fX 2 B g and Bi = fX = xig. Since the xi are distinct, the Bi are
disjoint, and (1.13) says that
}(X 2 B) =
X
i
}(fX 2 B g \ fX = xi g):
Now observe that
fX 2 B g \ fX = xi g =
Hence,
fX = xi g; xi 2 B;
6 ;
xi 2= B:
}(fX 2 B g \ fX = xi g) = IB (xi )}(X = xi ):
Integer-Valued Random Variables
An integer-valued random variable is a discrete random variable whose
distinct values are xi = i. For integer-valued random variables,
}(X 2 B) =
X
i
IB (i)}(X = i):
July 18, 2002
2.1 Probabilities Involving Random Variables
37
Here are some simple probability calculations involving integer-valued random
variables.
}(X 7) =
X
I(,1;7] (i)}(X = i) =
i
Similarly,
}(X 7) =
X
}(X > 7) =
X
i
However,
i
I[7;1) (i)}(X = i) =
I(7;1)(i)}(X = i) =
which is equal to }(X 8). Similarly
}(X < 7) =
X
i
I(,1;7) (i)}(X = i) =
7
X
i=,1
1
X
i=7
1
X
i=8
}(X = i):
}(X = i):
}(X = i);
6
X
i=,1
}(X = i);
which is equal to }(X 6).
When an experiment results in a nite number of \equally likely" or \totally
random" outcomes, we model it with a uniform random variable. We say that
X is uniformly distributed on 1; : : :; n if
}(X = k) = 1 ; k = 1; : : :; n:
n
For example, to model the toss of a fair die we would use }(X = k) = 1=6
for k = 1; : : :; 6. To model the selection of one card for a well-shued deck of
playing cards we would use }(X = k) = 1=52 for k = 1; : : :; 52. More generally,
we can let k vary over any subset of n integers. A common alternative to
1; : : :; n is 0; : : :; n , 1. For k not in the range of experimental outcomes, we
put }(X = k) = 0.
Example 2.4. Ten neighbors each have a cordless phone. The number of
people using their cordless phones at the same time is totally random. Find the
probability that more than half of the phones are in use at the same time.
Solution. We model the number of phones in use at the same time as
a uniformly distributed random variable X taking values 0; : : :; 10. Zero is
included because we allow for the possibility that no phones are in use. We
must compute
10
10
}(X > 5) = X }(X = i) = X 1 = 5 :
11
i=6
i=6 11
If the preceding example had asked for the probability that at least half the
phones are in use, then the answer would have been }(X 5) = 6=11.
July 18, 2002
38
Chap. 2 Discrete Random Variables
Pairs of Random Variables
If X and Y are random variables, we use the shorthand
fX 2 B; Y 2 C g := f! 2 : X(!) 2 B and Y (!) 2 C g;
which is equal to
f! 2 : X(!) 2 B g \ f! 2 : Y (!) 2 C g:
Putting all of our shorthand together, we can write
fX 2 B; Y 2 C g = fX 2 B g \ fY 2 C g:
We also have
}(X 2 B; Y 2 C) := }(fX 2 B; Y 2 C g)
= }(fX 2 B g \ fY 2 C g):
In particular, if the events fX 2 B g and fY 2 C g are independent for all sets B
and C, we say that X and Y are independent random variables. In light of
this denition and the above shorthand, we see that X and Y are independent
random variables if and only if
}(X 2 B; Y 2 C) = }(X 2 B) }(Y 2 C)
(2.1)
for all sets2 B and C.
Example 2.5. On a certain aircraft, the main control circuit on an autopilot fails with probability p. A redundant backup circuit fails independently with
probability q. The airplane can y if at least one of the circuits is functioning.
Find the probability that the airplane cannot y.
Solution. We introduce two random variables, X and Y . We set X = 1 if
the main circuit fails, and X = 0 otherwise. We set Y = 1 if the backup circuit
fails, and Y = 0 otherwise. Then }(X = 1) = p and }(Y = 1) = q. We assume
X and Y are independent random variables. Then the event that the airplane
cannot y is modeled by
fX = 1g \ fY = 1g:
Using the independence of X and Y , }(X = 1; Y = 1) = }(X = 1)}(Y = 1) =
pq.
The random variables X and Y of the preceding example are said to be
Bernoulli. To indicate the relevant parameters, we write X Bernoulli(p)
and Y Bernoulli(q). Bernoulli random variables are good for modeling the
result of an experiment having two possible outcomes (numerically represented
by 0 and 1), e.g., a coin toss, testing whether a certain block on a computer
disk is bad, whether a new radar system detects a stealth aircraft, whether a
certain internet packet is dropped due to congestion at a router, etc.
July 18, 2002
2.1 Probabilities Involving Random Variables
39
Multiple Independent Random Variables
We say that X1 ; X2; : : : are independent random variables, if for every nite
subset J containing two or more positive integers,
\
}
j 2J
fXj 2 Bj g =
Y
j 2J
}(Xj 2 Bj );
for all sets Bj . If for every B, }(Xj 2 B) does not depend on j for all j, then
we say the Xj are identically distributed. If the Xj are both independent
and identically distributed, we say they are i.i.d.
Example 2.6. Ten neighbors each have a cordless phone. The number of
people using their cordless phones at the same time is totally random. For one
week (ve days), each morning at 9:00 am we count the number of people using
their cordless phone. Assume that the numbers of phones in use on dierent
days are independent. Find the probability that on each day fewer than three
phones are in use at 9:00 am. Also nd the probability that on at least one day,
more than two phones are in use at 9:00 am.
Solution. For i = 1; : : :; 5, let Xi denote the number of phones in use at
9:00 am on the ith day. We assume that the Xi are independent, uniformly
distributed random variables taking values 0; : : :; 10. For the rst part of the
problem, we must compute
5
\
}
Using independence,
\
5
}
fXi < 3g :
i=1
fXi < 3g =
i=1
5
Y
i=1
}(Xi < 3):
Now observe that since the Xi are nonnegative, integer-valued random variables,
}(Xi < 3) = }(Xi 2)
= }(Xi = 0 or Xi = 1 or Xi = 2)
= }(Xi = 0) + }(Xi = 1) + }(Xi = 2)
= 3=11:
Hence,
5
} \ fXi < 3g = (3=11)5 0:00151:
i=1
For the second part of the problem, we must compute
[
5
}
i=1
fXi > 2g
= 1,}
\
5
July 18, 2002
i=1
fXi 2g
40
Chap. 2 Discrete Random Variables
= 1,
5
Y
i=1
}(Xi 2)
= 1 , (3=11)5 0:99849:
Calculations similar to those in the preceding example can be used to nd
probabilities involving the maximum or minimum of several independent random variables.
Example 2.7. Let X1; : : :; Xn be independent random variables. Evaluate
}(max(X1 ; : : :; Xn) z) and }(min(X1 ; : : :; Xn) z):
Solution. Observe that max(X1; : : :; Xn) z if and only if all of the Xk
are less than or equal to z; i.e.,
fmax(X1 ; : : :; Xn) z g =
n
\
It then follows that
n
\
}(max(X1 ; : : :; Xn ) z) = }
fXk z g
k=1
n
Y
=
fXk z g:
k=1
k=1
}(Xk z);
where the second equation follows by independence.
For the min problem, observe that min(X1 ; : : :; Xn ) z if and only if at
least one of the Xi is less than or equal to z; i.e.,
n
[
fmin(X1 ; : : :; Xn) z g =
fXk z g:
k=1
Hence,
}(min(X1 ; : : :; Xn) z) = }
n
[
k=1
= 1,}
= 1,
July 18, 2002
fXk z g
n
Y
k=1
n
\
fXk > z g
k=1
}(Xk > z):
2.1 Probabilities Involving Random Variables
41
Example 2.8. A drug company has developed a new vaccine, and would
like to analyze how eective it is. Suppose the vaccine is tested in several
volunteers until one is found in whom the vaccine fails. Let T = k if the rst
time the vaccine fails is on the kth volunteer. Find }(T = k).
Solution. For i = 1; 2; : : :; let Xi = 1 if the vaccine works in the ith
volunteer, and let Xi = 0 if it fails in the ith volunteer. Then the rst failure
occurs on the kth volunteer if and only if the vaccine works for the rst k , 1
volunteers and then fails for the kth volunteer. In terms of events,
fT = kg = fX1 = 1g \ \ fXk,1 = 1g \ fXk = 0g:
Assuming the Xi are independent, and taking probabilities of both sides yields
}(T = k) = }(fX1 = 1g \ \ fXk,1 = 1g \ fXk = 0g)
= }(X1 = 1) }(Xk,1 = 1) }(Xk = 0):
If we further assume that the Xi are identically distributed with p := }(Xi = 1)
for all i, then
}(T = k) = pk,1(1 , p):
The preceding random variable T is an example of a geometric random
variable. Since we consider two types of geometric random variables, both
depending on 0 p < 1, it is convenient to write X geometric1 (p) if
}(X = k) = (1 , p)pk,1; k = 1; 2; : : :;
and X geometric0 (p) if
}(X = k) = (1 , p)pk ; k = 0; 1; : : ::
As the above exmaple shows, the geometric1(p) random variable represents the
rst time something happens. We will see in Chapter 9 that the geometric0(p)
random variable represents the steady-state number of customers in a queue
with an innite buer.
Note that by the geometric series formula (Problem 7 in Chapter 1), for any
real or complex number z with jz j < 1,
1
X
z k = 1 ,1 z :
k=0
Taking z = p shows that the probabilities sum to one in both cases. Note that
if we put q = 1 , p, then 0 < q 1, and we can write }(X = k) = q(1 , q)k,1
in the geometric1(p) case and }(X = k) = q(1 , q)k in the geometric0(p) case.
July 18, 2002
42
Chap. 2 Discrete Random Variables
Example 2.9. If X geometric0(p), evaluate }(X > 2).
Solution. We could write
}(X > 2) =
1
X
k=3
}(X = k):
However, since }(X = k) = 0 for negative k, a nite series is obtained by
writing
}(X > 2) = 1 , }(X 2)
2
X
= 1 , }(X = k)
k=0
= 1 , (1 , p)[1 + p + p2]:
Probability Mass Functions
The probability mass function (pmf) of a discrete random variable X
taking distinct values xi is dened by
pX (xi ) := }(X = xi ):
With this notation
}(X 2 B) = X IB (xi )}(X = xi ) = X IB (xi )pX (xi ):
i
i
Example 2.10. If X geometric0(p), write down its pmf.
Solution. Write
pX (k) := }(X = k) = (1 , p)pk ; k = 0; 1; 2; : :::
A graph of pX (k) is shown in Figure 2.2.
The Poisson random variable is used to model many dierent physical phenomena ranging from the photoelectric eect and radioactive decay to computer
message trac arriving at a queue for transmission. A random variable X is
said to have a Poisson probability mass function with parameter > 0, denoted
by X Poisson(), if
k ,
pX (k) = k!e ; k = 0; 1; 2; : :::
A graph of pX (k) is shown in Figure 2.3. To see that these probabilities sum
to one, recall that for any real or complex number z, the power series for ez is
1 zk
X
ez =
:
k=0 k!
July 18, 2002
2.1 Probabilities Involving Random Variables
43
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
p
X
6
7
8
9
k
( k )
Figure 2.2. The geometric0(p) pmf pX (k) = (1 , p)pk with p = 0:7.
The joint probability mass function of X and Y is dened by
pXY (xi; yj ) := }(X = xi ; Y = yj ) = }(fX = xi g \ fY = yj g): (2.2)
Two applications of the law of total probability as in (1.13) can be used to show
that3
XX
}(X 2 B; Y 2 C) =
IB (xi )IC (yj )pXY (xi ; yj ):
(2.3)
i
j
If B = fxk g and C = IR, the left-hand side of (2.3) is
}(X = xk ; Y 2 IR) = }(fX = xk g \ ) = }(X = xk ):
P
The right-hand side of (2.3) reduces to j pXY (xk ; yj ). Hence,
pX (xk ) =
X
j
pXY (xk ; yj ):
Thus, the pmf of X can be recovered from the joint pmf of X and Y by summing
over all values of Y . Similarly,
pY (y` ) =
X
pXY (xi ; y` ):
i
Because we have dened random variables to be real-valued functions of !, it is easy to
see, after expanding our shorthand notation, that
fY 2 IRg := f! 2 : Y (!) 2 IRg = :
July 18, 2002
44
Chap. 2 Discrete Random Variables
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
2
4
6
8
p
X
10
12
14
k
( k )
Figure 2.3. The Poisson() pmf pX (k) = k e, =k! with = 5.
In this context, we call pX and pY marginal probability mass functions.
From (2.2) it is clear that if X and Y are independent, then pXY (xi; yj ) =
pX (xi )pY (yj ). The converse is also true, since if pXY (xi ; yj ) = pX (xi )pY (yj ),
then the double sum in (2.3) factors, and we have
}(X 2 B; Y 2 C) =
X
i
IB (xi)pX (xi )
X
j
IC (yj )pY (yj )
= }(X 2 B) }(Y 2 C):
Hence, for a pair of discrete random variables X and Y , they are independent
if and only if their joint pmf factors into the product of their marginal pmfs.
2.2. Expectation
The notion of expectation is motivated by the conventional denition of
numerical average. Recall that the numerical average of n numbers, x1; : : :; xn,
is
n
1X
n xi :
i=1
We use the average to summarize or characterize the entire collection of numbers
x1; : : :; xn with a single \typical" value.
If X is a discrete random variable taking n distinct, equally likely, real
values, x1; : : :; xn, then we dene its expectation by
n
X
E[X] := 1 xi :
n i=1
July 18, 2002
2.2 Expectation
45
Since }(X = xi) = 1=n for each i = 1; : : :; n, we can also write
E[X] =
n
X
i=1
xi }(X = xi):
Observe that this last formula makes sense even if the values xi do not occur
with the same probability. We therefore dene the expectation, mean, or
average of a discrete random variable X by
X
E[X] :=
xi }(X = xi);
i
or, using the pmf notation pX (xi ) = }(X = xi ),
X
E[X] =
xi pX (xi ):
i
In complicated problems, it is sometimes easier to compute means than probabilities. Hence, the single number E[X] often serves as a conveniently computable way to summarize or partially characterize an entire collection of pairs
f(xi; pX (xi ))g.
Example 2.11. Find the mean of a Bernoulli(p) random variable X.
Solution. Since X takes only the values x0 = 0 and x1 = 1, we can write
E[X] =
1
X
i=0
i }(X = i) = 0 (1 , p) + 1 p = p:
Note that while X takes only the values 0 and 1, its \typical" value p is never
seen (unless p = 0 or p = 1).
Example 2.12 (Every Probability is an Expectation). There is a close connection between expectation and probability. If X is any random variable and
B is any set, then IB (X) is a discrete random variable taking the values zero
and one. Thus, IB (X) is a Bernoulli random variable, and
,
E[IB (X)] = } IB (X) = 1 = }(X 2 B):
Example 2.13. When light of intensity is incident on a photodetector,
the number of photoelectrons generated is Poisson with parameter . Find the
mean number of photoelectrons generated.
Solution. Let X denote the number of photoelectrons generated. We
need to calculate E[X]. Since a Poisson random variable takes only nonnegative
integer values with positive probability,
E[X] =
1
X
n=0
n }(X = n):
July 18, 2002
46
Chap. 2 Discrete Random Variables
Since the term with n = 0 is zero, it can be dropped. Hence,
E[X] =
1
X
n=1
1
X
n ,
n n!e
n e,
n=1 (n , 1)!
1 n,1
X
= e, (n, 1)! :
n=1
=
Now change the index of summation from n to k = n , 1. This results in
E[X] = e,
1
X
k = e, e = :
k=0 k!
Example 2.14 (Innite Expectation). Here is random variable for which
E[X] = 1. Suppose that }(X = k) = C ,1=k2, k = 1; 2; : : :; wherey
C :=
Then
E[X] =
1
X
k=1
k }(X = k) =
1
X
1:
2
k
k=1
1
X
k=1
1
,1
X
k Ck2 = C ,1 k1 = 1
k=1
as shown in Problem 34.
Some care is necessary when computing expectations of signed random variables that take more than nitely many values. It is the convention in probability theory that E[X] should be evaluated as
E[X] =
X
i:xi 0
xi }(X = xi) +
X
i:xi <0
xi }(X = xi );
assuming that at least one of these sums is nite. If the rst sum is +1 and
the second one is ,1, then no value is assigned to E[X], and we say that E[X]
is undened.
y Note that C is nite by Problem 34. This is important since if C = 1, then C ,1 = 0
and the probabilities would sum to zero instead of one.
July 18, 2002
2.2 Expectation
47
Example 2.15 (Undened Expectation). With C as in the previous example, suppose that for k = 1; 2; : : :; }(X = k) = }(X = ,k) = 21 C ,1=k2.
Then
E[X] =
1
X
k=1
k }(X = k) +
,1
X
k=,1
k }(X = k)
1 1 1 X
,1 1
1 X
= 2C
+
k=1 k 2C k=,1 k
1 + ,1
= 2C
2C
= undened:
Expectation of Functions of Random Variables, or the Law of the Unconscious Statistician (LOTUS)
Given a random variable X, we will often have occasion to dene a new
random variable by Z := g(X), where g(x) is a real-valued function of the
real variable x. More precisely, recall that a random variable X is actually a
function taking points of the sample space, ! 2 , into real numbers X(!).
Hence, the notation Z = g(X) is actually shorthand for Z(!) := g(X(!)). We
now show that if X has pmf pX , then we can compute E[Z] = E[g(X)] without
actually nding the pmf of Z. The precise formula is
X
E[g(X)] =
g(xi ) pX (xi):
(2.4)
i
Equation (2.4) is sometimes called the law of the unconscious statistician
(LOTUS). As a simple example of its use, we can write, for any constant a,
X
X
E[aX] =
axi pX (xi ) = a xi pX (xi ) = aE[X]:
i
i
In other words, constant factors can be pulled out of the expectation.
? Derivation
of LOTUS
To derive (2.4), we proceed as follows. Let X take distinct values xi. Then Z
takes values g(xi ). However, the values g(xi ) may not be distinct. For example,
if g(x) = x2 , and X takes the four distinct values 1 and 2, then Z takes only
the two distinct values 1 and 4. In any case, let zk denote the distinct values
of Z, and put
Bk := fx : g(x) = zk g:
Then Z = zk if and only if g(X) = zk , and this happens if and only if X 2 Bk .
Thus, the pmf of Z is
pZ (zk ) := }(Z = zk ) = }(X 2 Bk );
July 18, 2002
48
Chap. 2 Discrete Random Variables
where
X
}(X 2 Bk ) =
We now compute
E[Z] =
=
=
X
k
X
k
i
IBk (xi) pX (xi ):
zk pZ (zk )
zk
X
X X
i
k
i
IBk (xi) pX (xi )
zk IBk (xi ) pX (xi ):
From the denition of the Bk in terms of the distinct zk , the Bk are disjoint.
Hence, for xed i in the outer sum, in the inner sum, only one of the Bk
can contain xi . In other words, all terms but one in the inner sum are zero.
Furthermore, for that one value of k with IBk (xi ) = 1, we have xi 2 Bk , and
zk = g(xi ); i.e., for this value of k,
zk IBk (xi) = g(xi )IBk (xi ) = g(xi ):
Hence, the inner sum reduces to g(xi ), and
X
E[Z] = E[g(X)] =
g(xi ) pX (xi):
i
Linearity of Expectation
The derivation of the law of the unconscious statistician can be generalized
to show that if g(x; y) is a real-valued function of two variables x and y, then
XX
E[g(X; Y )] =
g(xi ; yj ) pXY (xi ; yj ):
i
j
In particular, taking g(x; y) = x + y, it is a simple exercise to show that E[X +
Y ] = E[X] + E[Y ]. Using this fact, we can write
E[aX + bY ] = E[aX] + E[bY ] = aE[X] + bE[Y ]:
In other words, expectation is linear.
Example 2.16. A binary communication link has bit-error probability p.
What is the expected number of bit errors in a transmission of n bits?
Solution. For i = 1; : : :; n, let Xi = 1 if the ith bit is received incorrectly,
and let Xi = 0 otherwise. Then Xi Bernoulli(p), and Y := X1 + + Xn
is the total number of errors in the transmission of n bits. We know from
Example 2.11 that E[Xi] = p. Hence,
X
n
E[Y ] = E
i=1
Xi =
n
X
i=1
E[Xi ] =
July 18, 2002
n
X
i=1
p = np:
2.2 Expectation
49
Moments
The nth moment, n 1, of a real-valued random variable X is dened
to be E[X n]. In particular, the rst moment of X is its mean E[X]. Letting
m = E[X], we dene the variance of X by
var(X) := E[(X , m)2 ]
= E[X 2 , 2mX + m2 ]
= E[X 2] , 2mE[X] + m2 ; by linearity;
= E[X 2] , m2
= E[X 2] , (E[X])2 :
The variance is a measure of the spread of a random variable about its mean
value. The larger the variance, the more likely we are to see values of X that
are far from the mean m. The standard deviation of X is dened to be
the positive square root of the variance. The nth central moment of X is
E[(X , m)n ]. Hence, the second central moment is the variance.
Example 2.17. Find the second moment and the variance of X if X Bernoulli(p).
Solution. Since X takes only the values 0 and 1, it has the unusual
property that X 2 = X. Hence, E[X 2] = E[X] = p. It now follows that
var(X) = E[X 2] , (E[X])2 = p , p2 = p(1 , p):
Example 2.18. Find the second moment and the variance of X if X Poisson().
Solution. Observe that E[X(X , 1)] + E[X] = E[X 2]. Since we know that
E[X] = from Example 2.13, it suces to compute
E[X(X , 1)] =
=
1
X
n=0
1
X
n ,
n(n , 1) n!e
n e,
n=2 (n , 2)!
= 2e,
1
X
n,2 :
n=2 (n , 2)!
Making the change of summation k = n , 2, we have
1 k
X
E[X(X , 1)] = 2 e, k=0 k!
= 2 :
July 18, 2002
50
Chap. 2 Discrete Random Variables
It follows that E[X 2 ] = 2 + , and
var(X) = E[X 2 ] , (E[X])2 = (2 + ) , 2 = :
Thus, the Poisson() random variable is unusual in that its mean and variance
are the same.
Probability Generating Functions
Let X be a discrete random variable taking only nonnegative integer values.
The probability generating function (pgf) of X is4
GX (z) := E[z X ] =
Since for jz j 1,
1
X n
z
n=0
}(X = n) =
1
X
n=0
1
X
jz n}(X = n)j
n=0
1
X
jz jn}(X = n)
n=0
1
X
n=0
z n}(X = n):
}(X = n) = 1;
GX has a power series expansion with radius of convergence at least one. The
reason that GX is called the probability generating function is that the probabilities }(X = k) can be obtained from GX as follows. Writing
GX (z) = }(X = 0) + z }(X = 1) + z 2 }(X = 2) + ;
we immediately see that GX (0) = }(X = 0). We also see that
G0X (z) = }(X = 1) + 2z }(X = 2) + 3z 2 }(X = 3) + :
Hence, G0X (0) = }(X = 1). Continuing in this way shows that
G(Xk)(z)jz=0 = k! }(X = k);
or equivalently,
G(Xk)(z)jz=0 = }(X = k):
(2.5)
k!
The probability generating function can also be used to nd moments. Since
G0X (z) =
1
X
n=1
nz n,1}(X = n);
July 18, 2002
2.2 Expectation
51
setting z = 1 yields
G0X (z)jz=1 =
Similarly, since
G00X (z) =
setting z = 1 yields
G00X (z)jz=1 =
In general, since
1
X
n=2
1
X
n=1
1
X
n=2
n}(X = n) = E[X]:
n(n , 1)z n,2}(X = n);
n(n , 1)}(X = n) = E[X(X , 1)] = E[X 2] , E[X]:
G(Xk)(z) =
1
X
n=k
n(n , 1) (n , [k , 1])z n,k}(X = n);
setting z = 1 yields
G(Xk)(z)jz=1 = E X(X , 1)(X , 2) (X , [k , 1]) :
The right-hand side is called the kth factorial moment of X.
Example 2.19. Find the probability generating function of X Poisson(),
and use the result to nd E[X] and var(X).
Solution. Write
GX (z) = E[z X ]
1
X
=
z n}(X = n)
n=0
=
1
X
n=0
n ,
z n n!e
1 (z)n
X
,
= e
n=0
,
e ez
n!
=
= exp[(z , 1)]:
Now observe that G0X (z) = exp[(z , 1)], and so E[X] = G0X (1) = . Since
G00X (z) = exp[(z , 1)]2, E[X 2 ] , E[X] = 2 , and so E[X 2] = 2 + . For the
variance, write
var(X) = E[X 2] , (E[X])2 = (2 + ) , 2 = :
July 18, 2002
52
Chap. 2 Discrete Random Variables
Expectations of Products of Functions of Independent Random Variables
If X and Y are independent, then
E[h(X) k(Y )] = E[h(X)] E[k(Y )]
for all functions h(x) and k(y). In other words, if X and Y are independent,
then for any functions h(x) and k(y), the expectation of the product h(X)k(Y )
is equal to the product of the individual expectations. To derive this result, we
use the fact that
pXY (xi ; yj ) := }(X = xi; Y = yj )
= }(X = xi) }(Y = yj ); by independence;
= pX (xi ) pY (yj ):
Now write
E[h(X) k(Y )] =
=
=
XX
i
j
i
j
XX
X
i
h(xi ) k(yj ) pXY (xi ; yj )
h(xi ) k(yj ) pX (xi ) pY (yj )
h(xi ) pX (xi)
X
= E[h(X)] E[k(Y )]:
j
k(yj ) pY (yj )
Example 2.20. A certain communication network consists of n links. Suppose that each link goes down with probability p independently of the other
links. For k = 0; : : :; n, nd the probability that exactly k out of the n links
are down.
Solution. Let Xi = 1 if the ith link is down and Xi = 0 otherwise. Then
the Xi are independent Bernoulli(p). Put Y := X1 + + Xn . Then Y counts
the number of links that are down. We rst nd the probability generating
function of Y . Write
GY (z) =
=
=
=
E[z Y ]
E[z X1 ++Xn ]
E[z X1 z Xn ]
E[z X1 ] E[z Xn ]; by independence:
Now, the probability generating function of a Bernoulli(p) random variable is
easily seen to be
E[z Xi ] = z 0 (1 , p) + z 1 p = (1 , p) + pz:
July 18, 2002
2.2 Expectation
53
Thus,
GY (z) = [(1 , p) + pz]n:
To apply (2.5), we need the derivatives of GY (z). The rst derivative is
G0Y (z) = n[(1 , p) + pz]n,1p;
and in general, the kth derivative is
G(Yk)(z) = n(n , 1) (n , [k , 1]) [(1 , p) + pz]n,k pk :
It now follows that
(k)
}(Y = k) = GY (0)
k!
n(n
, 1) (n , [k , 1]) (1 , p)n,k pk
=
k!
n!
k
= k!(n , k)! p (1 , p)n,k :
Since the formula for GY (z) is a polynomial of degree n, G(Yk)(z) = 0 for all
k > n. Thus, }(Y = k) = 0 for k > n.
The preceding random variable Y is an example of a binomial(n;p) random
variable. (An alternative derivation using techniques from Section 2.4 is given
in the Notes.5 ) The binomial random variable counts how many times an event
has occurred. Its probability mass function is usually written using the notation
pY (k) = nk pk (1 , p)n,k ; k = 0; : : :; n;
, where the symbol nk is read \n choose k," and is dened by
n :=
n!
k!(n , k)! :
k
, In Matlab, nk = nchoosek(n; k).
For any discrete random variable Y taking only nonnegative integer values,
we always have
GY (1) =
X
1
k=0
z k pY (k)
z=1
=
1
X
k=0
pY (k) = 1:
For the binomial random variable Y this says
n n
X
pk (1 , p)n,k = 1;
k
k=0
July 18, 2002
54
Chap. 2 Discrete Random Variables
when p is any number satisfying 0 p 1. More generally, suppose a and b are
any nonnegative numbers with a + b > 0. Put p = a=(a + b). Then 0 p 1,
and 1 , p = b=(a + b). It follows that
n
X
k=0
n
k
a k b n,k = 1:
a+b
a+b
Multiplying both sides by (a + b)k (a + b)n,k = (a + b)n yields the binomial
theorem,
n n
X
k n,k = (a + b)n:
k ab
k=0
Remark. The above derivation is for nonnegative a and b. The binomial theorem actually holds for complex a and b. This is easy to derive using
induction on n along with the easily veried identity
n + n = n + 1:
k,1
k
k
, On account of the binomial theorem, the quantity nk is sometimes called
the binomial coecient. We also point out that the binomial coecients can
be read o from the nth row of Pascal's triangle in Figure 2.4. Noting that
the top row is row 0, it is immediate, for example, that
(a + b)5 = a5 + 5a4 b + 10a3b2 + 10a2b3 + 5ab4 + b5 :
To generate the triangle, observe that except for the entries that are ones, each
entry is equal to the sum of the two numbers above it to the left and right.
1
1
1
5
1
4
1
3
10
1
2
6
..
.
1
3
10
1
4
1
5
1
1
Figure 2.4. Pascal's triangle.
Binomial Random Variables and Combinations
We showed in Example 2.20 that if X1 ; : : :; Xn are i.i.d. Bernoulli(p), then
Y := X1 + + Xn is binomial(n; p). We can use this result to show that the
July 18, 2002
2.2 Expectation
55
, binomial coecient nk is equal to the number of n-bit
words with exactly k
, ones and n , k zeros. On the one hand, }(Y = k) = nk pk (1 , p)n,k . On the
other hand, observe that Y = k if and only if
X1 = i1 ; : : :; Xn = in
for some n-tuple (i1 ; : : :; in ) of ones and zeros with exactly k ones and n , k
zeros. Denoting this set of n-tuples by Bn;k , we can write
}(Y = k) = }
=
=
[
(i1 ;:::;in )2Bn;k
X
(i1 ;:::;in )2Bn;k
X
fX1 = i1 ; : : :; Xn = in g
}(X1 = i1 ; : : :; Xn = in )
n
Y
(i1 ;:::;in )2Bn;k =1
}(X = i ):
Now, each factor is p if i = 1 and 1 , p if i = 0. Since each n-tuple belongs
to Bn;k , we must have k factors being p and n , k factors being 1 , p. Thus,
}(Y = k) =
X
(i1 ;:::;in )2Bn;k
pk (1 , p)n,k :
Since all terms in the above sum are the same,
}(Y = k) = jBn;k j pk (1 , p)n,k :
, It follows that jBn;k j = nk .
Consider n objects O1; : : :; On. How many ways can we select exactly k of
the n objects if the order of selection is not important? Each such selection is
O1
O2
/
/
On − 1
On
Figure 2.5. There are n objects arranged over trap doors. If k doors are opened (and n , k
shut), this device allows the corresponding combination of k of the n objects to fall into the
bowl below.
July 18, 2002
56
Chap. 2 Discrete Random Variables
called a combination. One way to answer this question is to think of arranging
the n objects over trap doors as shown in Figure 2.5. If k doors are opened
(and n , k shut), this device allows the corresponding combination of k of the n
objects to fall into the bowl below. The number of ways to select the open and
shut doors is exactly the number,of n-bit words with k ones and n , k zeros.
As argued above, this number is nk .
Example 2.21. A 12-person jury is to be selected from a group of 20
potential jurors. How many dierent juries are possible?
Solution. There are
20 = 20! = 125970
12! 8!
12
dierent juries.
Example 2.22. A 12-person jury is to be selected from a group of 20
potential jurors of which 11 are men and 9 are women. How many 12-person
juries are there with 5 men and 7 women?
Solution. There are ,115 ways to choose the 5 men, and there are ,97 ways
to choose the 7 women. Hence, there are
11 9 = 11! 9! = 16632
5 7
5! 6! 7! 2!
possible juries with 5 men and 7 women.
Example 2.23. An urn contains 11 green balls and 9 red balls. If 12 balls
are chosen at random, what is the probability of choosing exactly 5 green balls
and 7 red balls?
Solution. Since there are, ,2012 ways to choose any 12 balls, we envision
a sample space with j
j = 20
12 . We are interested in a subset of , call it
Bg;r (5; 7), corresponding to those selections of 12 balls such that 5 are green
and 7 are red. Thus,
jBg;r (5; 7)j = 115 97 :
Since all selections of 12 balls are equally likely, the probability we seek is
11 9
jBg;r (5; 7)j = 5 7 = 16632 0:132:
20
j
j
125970
12
July 18, 2002
2.3 The Weak Law of Large Numbers
57
Poisson Approximation of Binomial Probabilities
If we let := np, then the probability generating function of a binomial(n; p)
random variable can be written as
[(1 , p) + pz]n = [1 + p(z , 1)]n
n
= 1 + (z n, 1) :
From calculus, recall the formula
x n = ex :
lim
1
+
n!1
n
So, for large n,
n
1 + (z n, 1) exp[(z , 1)];
which is the probability generating function of a Poisson() random variable
(Example 2.19). In making this approximation, n should be large compared to
(z , 1). Since := np, as n becomes large, so does (z , 1). To keep the size
of small enough to be useful, we should keep p small. Under this assumption,
the binomial(n;p) probability generating function is close to the Poisson(np)
probability generating function. This suggests the approximationz
n pk (1 , p)n,k (np)k e,np ; n large, p small:
k!
k
2.3. The Weak Law of Large Numbers
Let X1 ; X2; : : : be a sequence of random variables with a common mean
E[Xi ] = m for all i. In practice, since we do not know m, we use the numerical
average, or sample mean,
n
X
Mn := n1 Xi ;
i=1
in place of the true, but unknown value, m. Can this procedure of using Mn as
an estimate of m be justied in some sense?
Example 2.24. You are given a coin which may or may not be fair, and
you want to determine the probability of heads, p. If you toss the coin n times
and use the fraction of times that heads appears as an estimate of p, how does
this t into the above framework?
Solution. Let Xi = 1 if the ith toss results in heads, and let Xi = 0
otherwise. Then }(Xi = 1) = p and m := E[Xi] = p as well. Note that
X1 + + Xn is the number of heads, and Mn is the fraction of heads. Are we
justied in using Mn as an estimate of p?
z This approximation is justied rigorously in Problems 15 and 16(a) in Chapter 11. It is
also derived directly without probability generating functions in Problem 17 in Chapter 11.
July 18, 2002
58
Chap. 2 Discrete Random Variables
One way to answer these questions is with a weak law of large numbers
(WLLN). A weak law of large numbers gives conditions under which
lim }(jMn , mj ") = 0
n!1
for every " > 0. This is a complicated formula. However, it can be interpreted
as follows. Suppose that based on physical considerations, m is between 30 and
70. Let us agree that if Mn is within " = 1=2 of m, we are \close enough" to
the unknown value m. For example, if Mn = 45:7, and if we know that Mn is
within 1=2 of m, then m is between 45:2 and 46:2. Knowing this would be an
improvement over the starting point 30 m 70. So, if jMn , mj < ", we are
\close enough," while if jMn , mj " we are not \close enough." A weak law
says that by making n large (averaging lots of measurements), the probability
of not being close enough can be made as small as we like; equivalently, the
probability of being close enough can be made as close to one as we like. For
example, if }(jMn , mj ") 0:1, then
}(jMn , mj < ") = 1 , }(jMn , mj ") 0:9;
and we would be 90% sure that Mn is \close enough" to the true, but unknown,
value of m.
Before giving conditions under which the weak law holds, we need to introduce the notion of uncorrelated random variables. Also, in order to prove the
weak law, we need to rst derive Markov's inequality and Chebyshev's inequality.
Uncorrelated Random Variables
Before giving conditions under which the weak law holds, we need to introduce the notion of uncorrelated random variables. A pair of random
variables, say X and Y , is said to be uncorrelated if
E[XY ] = E[X] E[Y ]:
If X and Y are independent, then they are uncorrelated. Intuitively, the property of being uncorrelated is weaker than the property of independence. Recall
that for independent random variables,
E[h(X) k(Y )] = E[h(X)] E[k(Y )]
for all functions h(x) and k(y), while for uncorrelated random variables, we
only require that that this hold for h(x) = x and k(y) = y. For an example of
uncorrelated random variables that are not independent, see Problem 41.
Example 2.25. If mX := E[X] and mY := E[Y ], show that X and Y are
uncorrelated if and only if
E[(X , mX )(Y , mY )] = 0:
July 18, 2002
2.3 The Weak Law of Large Numbers
59
Solution. The result is obvious if we expand
E[(X , mX )(Y , mY )] = E[XY , mX Y , XmY + mX mY ]
= E[XY ] , mX E[Y ] , E[X]mY + mX mY
= E[XY ] , mX mY , mX mY + mX mY
= E[XY ] , mX mY :
Example 2.26. Let X1; X2; : : : be a sequence of uncorrelated random variables; more precisely, for i 6= j, Xi and Xj are uncorrelated. Show that
var
X
n
i=1
n
X
Xi =
i=1
var(Xi ):
In other words, for uncorrelated random variables, the variance of the sum is
the sum of the variances.
Solution. Let mi := E[Xi] and mj := E[Xj ]. Then uncorrelated means
that
E[(Xi , mi )(Xj , mj )] = 0 for all i 6= j:
Put
n
X
X :=
Xi :
Then
and
i=1
E[X] = E
n
X
i=1
n
X
X , E[X] =
Now write
i=1
Xi =
Xi ,
n
X
n
X
i=1
var(X) = E[(X , E[X])2]
= E
=
n
X
E[Xi ] =
i=1
mi =
(Xi , mi )
i=1
n
n
X X
i=1 j =1
n
X
i=1
mi ;
n
X
i=1
(Xi , mi ):
n
X
j =1
(Xj , mj )
E[(Xi , mi )(Xj , mj )] :
For xed i, consider the sum over j. There is only one term in this sum for
which j 6= i, and that is the term j = i. All the other terms are zero because
July 18, 2002
60
Chap. 2 Discrete Random Variables
Xi and Xj are uncorrelated. Hence,
var(X) =
=
=
n
X
i=1
n
X
i=1
n
X
i=1
E[(Xi , mi )(Xi , mi )]
E[(Xi , mi )2]
var(Xi ):
Markov's Inequality
If X is a nonnegative random variable, we show that for any a > 0,
}(X a) E[X] :
a
This relation is known as Markov's inequality. Since we always have }(X a) 1, Markov's inequality is useful only when the right-hand side is less than
one.
Consider the random variable aI[a;1) (X). This random variable takes the
value zero if X < a, and it takes the value a if X a. Hence,
E[aI[a;1) (X)] = 0 }(X < a) + a }(X a) = a }(X a):
Now observe that
aI[a;1) (X) X:
To see this, note that the left-hand side is either zero or a. If it is zero, there is
no problem because X is a nonnegative random variable. If the left-hand side
is a, then we must have I[a;1) (X) = 1; but this means that X a. Next take
expectations of both sides to obtain
a }(X a) E[X]:
Dividing both sides by a yields Markov's inequality.
Chebyshev's Inequality
For any random variable Y and any a > 0, we show that
2
}(jY j a) E[Y2 ] :
a
This result, known as Chebyshev's inequality, is an easy consequence of
Markov's inequality. As in the case of Markov's inequality, it is useful only
when the right-hand side is less than one.
July 18, 2002
2.3 The Weak Law of Large Numbers
61
To prove Chebyshev's inequality, just note that
fjY j ag = fjY j2 a2g:
Since the above two events are equal, they have the same probability. Hence,
2
}(jY j a) = }(jY j2 a2 ) E[jY2j ] ;
a
where the last step follows by Markov's inequality.
The following special cases of Chebyshev's inequality are sometimes of interest. If m := E[X] is nite, then taking Y = jX , mj yields
}(jX , mj a) var(X)
a2 :
If 2 := var(X) is also nite, taking a = k yields
}(jX , mj k) 12 :
k
Conditions for the Weak Law
We now give sucient conditions for a version of the weak law of large
numbers (WLLN). Suppose that the Xi all have the same mean m and the
same variance 2 . Assume also that the Xi are uncorrelated random variables.
Then for every " > 0,
lim }(jMn , mj ") = 0:
n!1
This is an immediate consequence of the following two facts. First, by Chebyshev's inequality,
n)
}(jMn , mj ") var(M
"2 :
Second,
n
X
var(Mn ) = var 1 Xi
n n=1
n
2 X
2
2
var(Xi ) = n2 = :
= n1
n
n
i=1
(2.6)
Thus,
2
}(jMn , mj ") 2 ;
(2.7)
n"
which goes to zero as n ! 1. Note that the bound 2 =n"2 can be used to
select a suitable value of n.
Example 2.27. Given " and 2, determine how large n should be so that
the probability that Mn is within " of m is at least 0:9.
July 18, 2002
62
Chap. 2 Discrete Random Variables
Solution. We want to have
}(jMn , mj < ") 0:9:
Rewrite this as
or
1 , }(jMn , mj ") 0:9;
}(jMn , mj ") 0:1:
By (2.7), it suces to take
2 0:1;
n"2
or n 102="2 .
2.4. Conditional Probability
For conditional probabilities involving random variables, we use the notation
}(X 2 B jY 2 C) := }(fX 2 B gjfY 2 C g)
}
= (fX }2(fBYg 2\ fCYg)2 C g)
}
= (X}2(YB;2YC)2 C) :
For discrete random variables, we dene the conditional probability
mass functions,
pX jY (xijyj ) := }(X = xijY = yj )
}
= (X}=(Yxi=; Yy =) yj )
j
= pXYp (x(yi; )yj ) ;
Y j
and
pY jX (yj jxi) := }(Y = yj jX = xi)
}
= (X}=(Xxi=; Yx =) yj )
i
p
(x
;
y
)
= XYp (xi ) j :
X i
We call pX jY the conditional probability mass function of X given Y . Similarly,
pY jX is called the conditional pmf of Y given X.
July 18, 2002
2.4 Conditional Probability
63
Conditional pmfs are important because we can use them to compute conditional probabilities just as we use marginal pmfs to compute ordinary probabilities. For example,
}(Y 2 C jX = xk ) = X IC (yj )pY jX (yj jxk):
j
This formula is derived by taking B = fxk g in (2.3), and then dividing the
result by }(X = xk ) = pX (xk ).
Example 2.28. To transmit message i using an optical communication
system, light of intensity i is directed at a photodetector. When light of
intensity i strikes the photodetector, the number of photoelectrons generated
is a Poisson(i ) random variable. Find the conditional probability that the
number of photoelectrons observed at the photodetector is less than 2 given
that message i was sent.
Solution. Let X denote the message to be sent, and let Y denote the number of photoelectrons generated by the photodetector. The problem statement
is telling us that
n ,
}(Y = njX = i) = i e i ; n = 0; 1; 2; : : ::
n!
The conditional probability to be calculated is
}(Y < 2jX = i) = }(Y = 0 or Y = 1jX = i)
= }(Y = 0jX = i) + }(Y = 1jX = i)
= e,i + i e,i :
Example 2.29. For the random variables X and Y used in the solution of
the previous example, write down their joint pmf if X geometric0 (p).
Solution. The joint pmf is
n e,i
;
pXY (i; n) = pX (i) pY jX (nji) = (1 , p)pi i n!
for i; n 0, and pXY (i; n) = 0 otherwise.
The Law of Total Probability
Let A be any event, and let X be any discrete random variable taking
distinct values xi. Then the events
Bi := fX = xig = f! 2 : X(!) = xi g:
July 18, 2002
64
Chap. 2 Discrete Random Variables
are pairwise disjoint, and i }(Bi ) =
probability as in (1.13) yields
P
}(A) =
X
i
}(A \ Bi ) =
i }(X
P
X
i
= xi ) = 1. The law of total
}(AjX = xi )}(X = xi ):
(2.8)
Hence, we call (2.8) the law of total probability as well.
Now suppose that Y is an arbitrary random variable. Taking A = fY 2 C g,
where C IR, yields
}(Y 2 C) = X }(Y 2 C jX = xi )}(X = xi):
i
If Y is a discrete random variable taking distinct values yj , then setting C =
fyj g yields
}(Y = yj ) =
X
=
X
i
i
}(Y = yj jX = xi)}(X = xi)
pY jX (yj jxi)pX (xi ):
Example 2.30. Radioactive samples give o alpha-particles at a rate based
on the size of the sample. For a sample of size k, suppose that the number of particles observed is a Poisson random variable Y with parameter k. If the sample
size is a geometric1 (p) random variable X, nd }(Y = 0) and }(X = 1jY = 0).
Solution. The rst step is to realize that the problem statement is telling
us that as a function of n, }(Y = njX = k) is the Poisson pmf with parameter
k. In other words,
n ,k
}(Y = njX = k) = k e ; n = 0; 1; : : ::
n!
In particular, note that }(Y = 0jX = k) = e,k . Now use the law of total
probability to write
}(Y = 0) =
=
1
X
k=1
1
X
k=1
}(Y = 0jX = k) }(X = k)
e,k (1 , p)pk,1
1
X
= 1 ,p p (p=e)k
k=1
, p:
1
,
p
= p 1 ,p=ep=e = 1e ,
p
July 18, 2002
2.4 Conditional Probability
Next,
65
}(X = 1jY = 0) = }(X}= 1; Y = 0)
(Y = 0)
}(Y = 0jX = 1)}(X = 1)
=
}(Y = 0)
p
= e,1 (1 , p) 1e ,
,p
= 1 , p=e:
Example 2.31. A certain electric eye employs a photodetector whose efciency occasionally drops in half. When operating properly, the detector outputs photoelectrons according to a Poisson() pmf. When the detector malfunctions, it outputs photoelectrons according to a Poisson(=2) pmf. Let p < 1
denote the probability that the detector is operating properly. Find the pmf
of the observed number of photoelectrons. Also nd the conditional probability that the circuit is malfunctioning given that n output photoelectrons are
observed.
Solution. Let Y denote the detector output, and let X = 1 indicate that
the detector is operating properly. Let X = 0 indicate that it is malfunctioning.
Then the problem statement is telling us that }(X = 1) = p and
n ,=2
n ,
}(Y = njX = 0) = (=2) e :
}(Y = njX = 1) = e
and
n!
n!
Now, using the law of total probability,
}(Y = n) = }(Y = njX = 1)}(X = 1) + }(Y = njX = 0)}(X = 0)
n ,=2
n ,
= n!e p + (=2)n!e (1 , p):
This is the pmf of the observed number of photoelectrons.
The above formulas can be used to nd }(X = 0jY = n). Write
}(X = 0jY = n) = }(X}= 0; Y = n)
(Y = n)
}(Y = njX = 0)}(X = 0)
=
}(Y = n)
(=2)n e,=2 (1 , p)
= n , n! n ,=2
e p + (=2) e (1 , p)
n!
n!
1
;
= n ,=2
2 e p +1
(1 , p)
July 18, 2002
66
Chap. 2 Discrete Random Variables
which is clearly a number between zero and one as a probability should be.
Notice that as we observe a greater output Y = n, the conditional probability
that the detector is malfunctioning decreases.
The Substitution Law
It is often the case that Z is a function of X and some other discrete random
variable Y , say Z = X + Y , and we are interested in }(Z = z). In this case,
the law of total probability becomes
}(Z = z) = X }(Z = z jX = xi )}(X = xi):
=
i
X
i
}(X + Y = z jX = xi)}(X = xi ):
We now claim that
}(X + Y = z jX = xi) = }(xi + Y = z jX = xi ):
This property is known as the substitution law of conditional probability. To
derive it, we need the observation
fX + Y = z g \ fX = xi g = fxi + Y = z g \ fX = xig:
From this we see that
}(X + Y = z jX = xi ) = }(fX +}Y = z g \ fX = xi g)
(fX = xi g)
}(fxi + Y = z g \ fX = xi g)
=
}(fX = xi g)
}
= (xi + Y = z jX = xi ):
Writing
}(xi + Y = z jX = xi) = }(Y = z , xijX = xi );
we can make further simplications if X and Y are independent. In this case,
}(Y = z , xi jX = xi ) = }(Y =}z , xi; X = xi)
(X = xi)
}(Y = z , xi)}(X = xi)
=
}(X = xi )
}
= (Y = z , xi ):
Thus, when X and Y are independent, we can write
}(Y = z , xijX = xi) = }(Y = z , xi);
and we say that we \drop the conditioning."
July 18, 2002
2.4 Conditional Probability
67
Example 2.32 (Signal in Additive Noise). A random, integer-valued signal X is transmitted over a channel subject to independent, additive, integervalued noise Y . The received signal is Z = X +Y . To estimate X based on the
received value Z, the system designer wants to use the conditional pmf pX jZ .
Our problem is to nd this conditional pmf.
Solution. Let X and Y be independent, discrete, integer-valued random
variables with pmfs pX and pY , respectively. Put Z := X + Y . We begin by
writing out the formula for the desired pmf
pX jZ (ijj) = }(X = ijZ = j)
} = i; Z = j)
= (X}(Z
= j)
}(Z = j jX = i)}(X = i)
=
}(Z = j)
}
jX = i)pX (i) :
(2.9)
= (Z =}j(Z
= j)
To continue the analysis, we use the substitution law followed by independence
to write
}(Z = j jX = i) = }(X + Y = j jX = i)
= }(i + Y = j jX = i)
= }(Y = j , ijX = i)
= }(Y = j , i)
= pY (j , i):
(2.10)
This result can also be combined with the law of total probability to compute
the denominator in (2.9). Just write
X
X
pZ (j) = }(Z = j jX = i)}(X = i) =
pY (j , i)pX (i):
(2.11)
i
i
In other words, if X and Y are independent, discrete, integer-valued random
variables, the pmf of Z = X + Y is the discrete convolution of pX and pY .
It now follows that
pX jZ (ijj) = XpY (j , i)pX (i) ;
pY (j , k)pX (k)
k
where in the denominator we have changed the dummy index of summation to
k to avoid confusion with the i in the numerator.
The Poisson() random variable is a good model for the number of photoelectrons generated in a photodetector when the incident light intensity is .
July 18, 2002
68
Chap. 2 Discrete Random Variables
Now suppose that an additional light source of intensity is also directed at the
photodetector. Then we expect that the number of photoelectrons generated
should be related to the total light intensity +. The next example illustrates
the corresponding probabilistic model.
Example 2.33. If X and Y are independent Poisson random variables
with respective parameters and , use the results of the preceding example
to show that Z := X + Y is Poisson( + ). Also show that as a function of i,
pX jZ (ijj) is a binomial(j;=( + )) pmf.
Solution. To nd pZ (j), we apply (2.11) as follows. Since pX (i) = 0 for
i < 0 and since pY (j , i) = 0 for j < i, (2.11) becomes
pZ (j) =
j i ,
X
e
j ,i ,
e
i! (j , i)!
i=0
j
,(+) X
e
j! i j ,i
=
j! i=0 i!(j , i)! j
,(+) X
j i j ,i
e
=
j! i=0 i ,(+)
= e j! ( + )j ; j = 0; 1; : : :;
where the last step follows by the binomial theorem.
Our second task is to compute
}
}
pX jZ (ijj) = (Z = j}jX(Z==i)j) (X = i)
}
= (Z = jpjX(j)= i)pX (i) :
Z
Since we have already found pZ (j), all we need is }(Z = j jX = i), which, using
(2.10), is simply pY (j , i). Thus,
j ,i , i ,
pX jZ (ijj) = (j ,ei)! ei!
i j ,i = ( + )j ji
=
j
i
e,(+) ( + )j
j!
i j , i ;
( + ) ( + )
for i = 0; : : :; j.
July 18, 2002
2.5 Conditional Expectation
69
2.5. Conditional Expectation
Just as we developed expectation for discrete random variables in Section 2.2, including the law of the unconscious statistician, we can develop conditional expectation in the same way. This leads to the formula
X
E[g(Y )jX = xi] =
g(yj ) pY jX (yj jxi):
(2.12)
j
Example 2.34. The random number Y of alpha-particles given o by a
radioactive sample is Poisson(k) given that the sample size X = k. Find
E[Y jX = k].
Solution. We must compute
X
E[Y jX = k] =
n }(Y = njX = k);
n
where (cf. Example 2.30)
Hence,
n ,k
}(Y = njX = k) = k e ; n = 0; 1; : : ::
n!
E[Y jX = k] =
1
X
n=0
n ,k
n k n!e :
Now observe that the right-hand side is exactly ordinary expectation of a Poisson random variable with parameter k (cf. the calculation in Example 2.13).
Therefore, E[Y jX = k] = k.
Example 2.35. Suppose that given Y = n, X binomial(n; p). Find
E[X jY = n].
Solution. We must compute
E[X jY = n] =
where
Hence,
n
X
k=0
k}(X = kjY = n);
}(X = kjY = n) = n pk (1 , p)n,k :
k
n
X
k nk pk (1 , p)n,k :
k=0
Now observe that the right-hand side is exactly the ordinary expectation of a
binomial(n;p) random variable. It is shown in Problem 20 that the mean of
such a random variable is np. Therefore, E[X jY = n] = np.
E[X jY = n] =
July 18, 2002
70
Chap. 2 Discrete Random Variables
Substitution Law for Conditional Expectation
For functions of two variables, we have the following conditional law of the
unconscious statistician,
XX
E[g(X; Y )jX = xi ] =
g(xk ; yj ) pXY jX (xk ; yj jxi):
However,
k
j
pXY jX (xk ; yj jxi) = }(X = xk ; Y = yj jX = xi )
}
; Y = yj ; X = xi ) :
= (X = x}k(X
=x)
i
Now, when k 6= i, the intersection
fX = xk g \ fY = yj g \ fX = xig
is empty, and has zero probability. Hence, the numerator above is zero for
k 6= i. When k = i, the above intersections reduce to fX = xig \ fY = yj g,
and so
pXY jX (xk ; yj jxi) = pY jX (yj jxi); for k = i:
It now follows that
X
E[g(X; Y )jX = xi ] =
g(xi ; yj ) pY jX (yj jxi)
j
= E[g(xi; Y )jX = xi]:
(2.13)
which is the substitution law for conditional expectation. Note that if g in
(2.13) is a function of Y only, then (2.13) reduces to (2.12). Also, if g is of
product form, say g(x; y) = h(x)k(y), then
E[h(X)k(Y )jX = xi ] = h(xi )E[k(Y )jX = xi ]:
Law of Total Probability for Expectation
In Section 2.4 we discussed the law of total probability, which shows how to
compute probabilities in terms of conditional probabilities. We now derive the
analogous formula for expectation. Write
X
i
E[g(X; Y )jX = xi ] pX (xi) =
=
XX
i
j
XX
i
j
g(xi ; yj ) pXY (xi ; yj )
= E[g(X; Y )]:
In particular, if g is a function of Y only, then
X
E[g(Y )jX = xi] pX (xi ):
E[g(Y )] =
i
July 18, 2002
g(xi ; yj ) pY jX (yj jxi) pX (xi )
2.5 Conditional Expectation
71
Example 2.36. Light of intensity is directed at a photomultiplier that
generates X Poisson() primaries. The photomultiplier then
generates
Y
,
secondaries, where given X = n, Y is conditionally geometric1 (n+2),1 . Find
the expected number of secondaries and the correlation between the primaries
and the secondaries.
Solution. The law of total probability for expectations says that
E[Y ] =
1
X
n=0
E[Y jX = n] pX (n);
where the range of summation follows because X is Poisson(). The next
step is to compute the conditional expectation. The conditional pmf of Y
is geometric1(p), where p = (n + 2),1 , and the mean of such a pmf is, by
Problem 19, 1=(1 , p). Hence,
E[Y ] =
1
X
n=0
1 + n +1 1 pX (n) = E 1 + X 1+ 1 :
An easy calculation (Problem 14) shows that for X Poisson(),
1
,
E
X + 1 = [1 , e ]=;
and so E[Y ] = 1 + [1 , e, ]=.
The correlation between X and Y is
E[XY ] =
=
1
X
n=0
1
X
n=0
E[XY jX = n] pX (n)
n E[Y jX = n] pX (n)
1 X
1
=
n 1 + n + 1 pX (n)
n=0
= E X 1 + X 1+ 1 :
Now observe that
X 1 + X 1+ 1
It follows that
= X + 1 , X 1+ 1 :
E[XY ] = + 1 , [1 , e, ]=:
July 18, 2002
72
Chap. 2 Discrete Random Variables
2.6. Notes
Notes x2.1: Probabilities Involving Random Variables
Note 1. According to Note 1 in Chapter 1, }(A) is only dened for certain
subsets A 2 A. Hence, in order that the probability
}(f! 2 : X(!) 2 B g)
be dened, it is necessary that
f! 2 : X(!) 2 B g 2 A:
(2.14)
To guarantee that this is true, it is convenient to consider }(X 2 B) only for
sets B in some -eld B of subsets of IR. The technical denition of a random
variable is then as follows. A function X from into IR is a random variable
if and only if (2.14) holds for every B 2 B. Usually B is usually taken to be the
Borel -eld; i.e., B is the smallest -eld containing all the open subsets of
IR. If B 2 B, then B is called a Borel set. It can be shown [4, pp. 182{183]
that a real-valued function X satises (2.14) for all Borel sets B if and only if
f! 2 : X(!) xg 2 A; for all x 2 IR:
Note 2. In light of Note 1 above, we do not require that (2.1) hold for all
sets B and C, but only for all Borel sets B and C.
Note 3. We show that
}(X 2 B; Y 2 C) = X X IB (xi )IC (yj )pXY (xi ; yj ):
i
j
P
Consider the disjoint events fY = yj g. Since j }(Y = yj ) = 1, we can use
the law of total probability as in (1.13) with A = fX = xi ; Y 2 C g to write
}(X = xi; Y 2 C) = X }(X = xi ; Y 2 C; Y = yj ):
j
Now observe that
fY 2 C g \ fY = yj g =
Hence,
}(X = xi ; Y 2 C) =
X
j
fY = yj g; yj 2 C;
6 ;
yj 2= C:
IC (yj )}(X = xi ; Y = yj ) =
X
j
IC (yj )pXY (xi ; yj ):
The next step is to use (1.13) again, but this time with the disjoint events
fX = xig and A = fX 2 B; Y 2 C g. Then,
}(X 2 B; Y 2 C) = X }(X 2 B; Y 2 C; X = xi):
i
July 18, 2002
2.6 Notes
73
Now observe that
fX 2 B g \ fX = xi g =
fX = xi g; xi 2 B;
6 ;
xi 2= B:
Hence,
}(X 2 B; Y 2 C) =
X
=
X
i
i
IB (xi )}(X = xi; Y 2 C)
IB (xi )
X
j
IC (yj )pXY (xi ; yj ):
Notes x2.2: Expectation
Note 4. When z is complex,
E[z X ] := E[Re(z X )] + j E[Im(z X )]:
By writing
z n = rnejn = rn [cos(n) + j sin(n)];
it is easy to check that
E[z X ] =
1
X
n=0
z n}(X = n):
Notes x2.4: Conditional Probability
Note 5. Here is an alternative derivation of the fact that the sum of independent Bernoulli random variables is a binomial random variable. Let
X1 ; X2; : : : be independent Bernoulli(p) random variables. Put
Yn :=
n
X
i=1
Xi :
We need to show that Yn binomial(n; p). The case n = 1 is trivial. Suppose
the result is true for some n 1. We show that it must be true for n + 1. Use
the law of total probability to write
}(Yn+1 = k) =
n
X
i=0
}(Yn+1 = kjYn = i)}(Yn = i):
(2.15)
To compute the conditional probability, we rst observe that Yn+1 = Yn +Xn+1 .
Also, since the Xi are independent, and since Yn depends only on X1 ; : : :; Xn,
July 18, 2002
74
Chap. 2 Discrete Random Variables
we see that Yn and Xn+1 are independent. Keeping this in mind, we apply the
substitution law and write
}(Yn+1 = kjYn = i) = }(Yn + Xn+1 = kjYn = i)
= }(i + Xn+1 = kjYn = i)
= }(Xn+1 = k , ijYn = i)
= }(Xn+1 = k , i):
Since Xn+1 takes only the values zero and one, this last probability is zero
unless i = k or i = k , 1. Returning to (2.15), we can writex
}(Yn+1 = k) =
k
X
i=k,1
}(Xn+1 = k , i)}(Yn = i):
Assuming that Yn binomial(n; p), this becomes
}(Yn+1 = k) = p n pk,1(1 , p)n,(k,1) + (1 , p) n pk (1 , p)n,k :
k,1
k
Using the easily veried identity,
n + n = n+1 ;
k,1
k
k
we see that Yn+1 binomial(n + 1; p).
2.7. Problems
Problems x2.1: Probabilities Involving Random Variables
1. Let Y be an integer-valued random variable. Show that
}(Y = n) = }(Y > n , 1) , }(Y > n):
2. Let X Poisson(). Evaluate }(X > 1); your answer should be in
terms of . Then compute the numerical value of }(X > 1) when = 1.
Answer: 0:264.
3. A certain photo-sensor fails to activate if it receives fewer than four photons in a certain time interval. If the number of photons is modeled
by a Poisson(2) random variable X, nd the probability that the sensor
activates. Answer: 0:143.
4. Let X geometric1(p).
x When k = 0 or k = n + 1, this sum actually has only one term, since }(Yn = ,1) =
}(Yn = n + 1) = 0.
July 18, 2002
2.7 Problems
75
(a) Show that }(X > n) = pn .
(b) Compute }(fX > n+kgjfX > ng). Hint: If A B, then A \ B = A.
Remark. Your answer to (b) should not depend on n. For this reason,
the geometric random variable is said to have the memoryless property. For example, let X model the number of the toss on which the rst
heads occurs in a sequence of coin tosses. Then given a heads has not
occurred up to and including time n, the conditional probability that a
heads does not occur in the next k tosses does not depend on n. In other
words, given that no heads occurs on tosses 1; : : :; n has no eect on the
conditional probability of heads occurring in the future. Future tosses do
not remember the past.
? 5. From your solution of Problem 4(b), you can see that if X geometric (p),
1
then }(fX > n + kgjfX > ng) = }(X > k). Now prove the converse;
i.e., show that if Y is a positive integer-valued random variable such that
}(fY > n + kgjfY > ng) = }(Y > k), then Y geometric1(p), where
p = }(Y > 1). Hint: First show that }(Y > n) = }(Y > 1)n; then apply
Problem 1.
6. At the Chicago IRS oce, there are m independent auditors. The kth
auditor processes Xk tax returns per day, where Xk is Poisson distributed
with parameter > 0. The oce's performance is unsatisfactory if any
auditor processes fewer than 2 tax returns per day. Find the probability
that the oce performance is unsatisfactory.
7. An astronomer has recently discovered n similar galaxies. For i = 1; : : :; n,
let Xi denote the number of black holes in the ith galaxy, and assume the
Xi are independent Poisson() random variables.
(a) Find the probability that at least one of the galaxies contains two or
more black holes.
(b) Find the probability that all n galaxies have at least one black hole.
(c) Find the probability that all n galaxies have exactly one black hole.
Your answers should be in terms of n and .
8. There are 29 stocks on the Get Rich Quick Stock Exchange. The price
of each stock (in whole dollars) is geometric0(p) (same p for all stocks).
Prices of dierent stocks are independent. If p = 0:7, nd the probability
that at least one stock costs more than 10 dollars. Answer: 0:44.
9. Let X1 ; : : :; Xn be independent, geometric1 (p) random variables. Evaluate }(min(X1 ; : : :; Xn ) > `) and }(max(X1 ; : : :; Xn ) `).
10. A class consists of 15 students. Each student has probability p = 0:1
of getting an \A" in the course. Find the probability that exactly one
July 18, 2002
76
Chap. 2 Discrete Random Variables
student receives an \A." Assume the students' grades are independent.
Answer: 0:343.
11. In a certain lottery game, the player chooses three digits. The player
wins if at least two out of three digits match the random drawing for
that day in both position and value. Find the probability that the player
wins. Assume that the digits of the random drawing are independent and
equally likely. Answer: 0:028.
12. Blocks on a computer disk are good with probability p and faulty with
probability 1 , p. Blocks are good or bad independently of each other.
Let Y denote the location (starting from 1) of the rst bad block. Find
the pmf of Y .
13. Let X and Y be jointly discrete, integer-valued random variables with
joint pmf
8
3j ,1e,3 ; i = 1; j 0;
>
>
>
>
j!
<
pXY (i; j) = > 4 6j ,1e,6 ; i = 2; j 0;
>
>
j!
>
:
0;
otherwise:
Find the marginal pmfs pX (i) and pY (j), and determine whether or not
X and Y are independent.
Problems x2.2: Expectation
14. If X is Poisson(), compute E[1=(X + 1)].
15. A random variable X has mean 2 and variance 7. Find E[X 2 ].
16. Let X be a random variable with mean m and variance 2 . Find the
constant c that best approximates the random variable X in the sense
that c minimizes the mean squared error E[(X , c)2 ].
17. Find var(X) if X has probability generating function
GX (z) = 61 + 16 z + 23 z 2 :
18. Find var(X) if X has probability generating function
GX (z) =
2 + z 5 :
3
19. Evaluate GX (z) for the cases X geometric0 (p) and X geometric1(p).
Use your results to nd the mean and variance of X in each case.
20. The probability generating function of Y binomial(n; p) was found in
Example 2.20. Use it to nd the mean and variance of Y .
July 18, 2002
2.7 Problems
77
21. Compute E[(X + Y )3 ] if X and Y are independent Bernoulli(p) random
variables.
22. Let X1 ; : : :; Xn be independent Poisson() random variables. Find the
probability generating function of Y := X1 + + Xn .
23. For i = 1; : : :; n, let Xi Poisson(i ). Put
Y :=
24.
25.
26.
27.
28.
29.
? 30.
n
X
i=1
Xi :
Find }(Y = 2) if the Xi are independent.
A certain digital communication link has bit-error probability p. In a
transmission of n bits, nd the probability that k bits are received incorrectly, assuming bit errors occur independently.
A new school has M classrooms. For i = 1; : : :; M, let ni denote the
number of seats in the ith classroom. Suppose that the number of students
in the ith classroom is binomial(ni; p) and independent. Let Y denote the
total number of students in the school. Find }(Y = k).
Let X1 ; : : :; Xn be i.i.d. with }(Xi = 1) = 1 , p and }(Xi = 2) = p. If
Y := X1 + + Xn , nd }(Y = k) for all k.
Ten-bit codewords are transmitted over a noisy channel. Bits are ipped
independently with probability p. If no more than two bits of a codeword
are ipped, the codeword can be correctly decoded. Find the probability
that a codeword cannot be correctly decoded.
From a well-shued deck of 52 playing cards you are dealt 14 cards. What
is the probability that 2 cards are spades, 3 are hearts, 4 are diamonds,
and 5 are clubs? Answer: 0.0116.
From a well-shued deck of 52 playing cards you are dealt 5 cards. What
is the probability that all 5 cards are of the same suit? Answer: 0.00198.
A generalization of the binomial coecient is the multinomial coecient. Let m1; : : :; mr be nonnegative integers that sum to n. Then
n :=
n! :
m1 ! mr !
m1 ; : : :; mr
If words of length n are composed of r-ary symbols, then the multinomial
coecient is the number of words with exactly m1 symbols of type 1, m2
symbols of type 2, : : :, and mr symbols of type r. The multinomial
theorem says that
X
J kJ
aj
j =1
July 18, 2002
78
Chap. 2 Discrete Random Variables
is equal to
kJ
X
kJ ,1 =0
k2
X
k1=0
kJ
kJ ,kJ ,1 :
k1 k2 ,k1
k1; k2 , k1; : : :; kJ , kJ ,1 a1 a2 aJ
Derive the multinomial theorem by induction on J.
Remark. The multinomial theorem can also be expressed in the form
X
J
j =1
aj
n
n
=
am1 1 am2 2 amJ J :
m
;
m
1 2 ; : : :; mJ
m1 ++mJ =n
X
31. Make a table comparing both sides of the Poisson approximation of bino-
mial probabilities,
n pk (1 , p)n,k (np)k e,np ; n large, p small;
k!
k
for k = 0; 1; 2; 3; 4;5 if n = 100 and p = 1=100. Hint: If Matlab is
available, the binomial probability can be written
nchoosek(n; k) p^k (1 , p)^(n , k)
and the Poisson probability can be written
(n p)^k exp(,n p)=factorial(k):
32. Betting on Fair Games. Let X Bernoulli(p) random variable. For exam-
ple, we could let X = 1 model the result of a coin toss being heads. Or
we could let X = 1 model your winning the lottery. In general, a bettor
wagers a stake of s dollars that X = 1 with a bookmaker who agrees to
pay d dollars to the bettor if X = 1 occurs; if X = 0, the stake s is kept
by the bookmaker. Thus, the net income of the bettor is
Y := dX , s(1 , X);
since if X = 1, the bettor receives Y = d dollars, and if X = 0, the bettor
receives Y = ,s dollars; i.e., loses s dollars. Of course the net income
to the bookmaker is ,Y . If the wager is fair to both the bettor and the
bookmaker, then we should have E[Y ] = 0. In other words, on average,
the net income to either party is zero. Show that a fair wager requires
that d=s = (1 , p)=p.
33. Odds. Let X Bernoulli(p) random variable. We say that the (fair) odds
against X = 1 are n2 to n1 (written n2 : n1) if n2 and n1 are positive
integers satisfying n2 =n1 = (1 , p)=p. Typically, n2 and n1 are chosen to
July 18, 2002
2.7 Problems
79
have no common factors. Conversely, we say that the odds for X = 1 are
n1 to n2 if n1 =n2 = p=(1 , p). Consider a state lottery game in which
players wager one dollar that they can correctly guess a randomly selected
three-digit number in the range 000{999. The state oers a payo of $500
for a correct guess.
(a) What is the probability of correctly guessing the number?
(b) What are the (fair) odds against guessing correctly?
(c) The odds against actually oered by the state are determined by the
ratio of the payo divided by the stake, in this case, 500 : 1. Is the
game fair to the bettor? If not, what should the payo be to make
it fair? (See the preceding problem for the notion of \fair.")
? 34. These results are needed for Examples 2.14 and 2.15. Show that
1 1
X
= 1
k=1 k
and that
1
X
1 < 2:
2
k=1 k
(The exact value of the second sum is 2 =6 [38, p. 198].) Hints: For the
rst formula, use the inequality
Z k+1
1 dt Z k+1 1 dt = 1 :
t
k
k
k
k
For the second formula, use the inequality
Z k+1
1 dt Z k+1 1 dt =
1 :
2
2
t
(k
+
1)
(k
+
1)2
k
k
35. Let X be a discrete random variable taking nitely many distinct values
x1; : : :; xn. Let pi := }(X = xi) be the corresponding probability mass
function. Consider the function
g(x) := , log }(X = x):
Observe that g(xi ) = , log pi . The entropy of X is dened by
H(X) := E[g(X)] =
n
X
i=1
n
X
}
g(xi ) (X = xi) = pi log p1 :
i=1
i
If all outcomes are equally likely, i.e., pi = 1=n, nd H(X). If X is a
constant random variable, i.e., pj = 1 for some j and pi = 0 for i 6= j,
nd H(X).
July 18, 2002
80
Chap. 2 Discrete Random Variables
? 36. Jensen's inequality.
Recall that a real-valued function g dened on an
interval I is convex if for all x; y 2 I and all 0 1,
,
g x + (1 , )y g(x) + (1 , )g(y):
Let g be a convex function, and let X be a discrete random variable taking
nitely many values, say n values, all in I. Derive Jensen's inequality,
E[g(X)] g(E[X]):
Hint: Use induction on n.
? 37.
Derive Lyapunov's inequality,
E[jZ j]1= E[jZ j ]1= ; 1 < < 1:
Hint: Apply Jensen's inequality to the convex function g(x) = x= and
the random variable X = jZ j.
? 38. A discrete random variable is said to be nonnegative, denoted by X 0,
if }(X 0) = 1; i.e., if
X
i
I[0;1) (xi )}(X = xi ) = 1:
(a) Show that for a nonnegative random variable, if xk < 0 for some k,
then }(X = xk ) = 0.
(b) Show that for a nonnegative random variable, E[X] 0.
(c) If X and Y are discrete random variables, we write X Y if X , Y 0. Show that if X Y , then E[X] E[Y ]; i.e., expectation is
monotone.
Problems x2.3: The Weak Law of Large Numbers
39. Show that E[Mn] = m. Also show that for any constant c, var(cX) =
c2 var(X).
40. Let X1 ; X2 ; : : :; Xn be i.i.d. geometric1(p) random variables, and put Y :=
X1 + + Xn . Find E[Y ], var(Y ), and E[Y 2]. Also nd the moment
generating function of Y .
Remark. We say that Y is a negative binomial or Pascal random
variable with parameters n and p.
41. Show by counterexample that being uncorrelated does not imply independence. Hint: Let }(X = 1) = }(X = 2) = 1=4, and put Y := jX j.
Show that E[XY ] = E[X]E[Y ], but }(X = 1; Y = 1) 6= }(X = 1) }(Y =
1).
July 18, 2002
2.7 Problems
81
42. Let X geometric0(1=4). Compute both sides of Markov's inequality,
}(X 2) E[X] :
2
43. Let X geometric0 (1=4). Compute both sides of Chebyshev's inequality,
2
}(X 2) E[X ] :
4
44. Student heights range from 120 cm to 220 cm. To estimate the average height, determine how many students' heights should be measured
to make the sample mean within 0:25 cm of the true mean height with
probability at least 0:9. Assume measurements are uncorrelated and have
variance 2 = 1. What if you only want to be within 1 cm of the true
mean height with probability at least 0:9?
45. Show that for an arbitrary sequence of random variables, it is not always
true that for every " > 0,
lim }(jMn , mj ") = 0:
n!1
Hint: Let Z be a nonconstant random variable and put Xi := Z for
i = 1; 2; : : :: To be specic, try Z Bernoulli(1=2) and " = 1=4.
46. Let X and Y be two random variables with means mX and mY and
variances X2 and Y2 . The correlation coecient of X and Y is
, mY )] :
:= E[(X , mX )(Y
X Y
Note that = 0 if and only if X and Y are uncorrelated. Suppose X
cannot be observed, but we are able to measure Y . We wish to estimate
X by using the quantity aY , where a is a suitable constant. Assuming
mX = mY = 0, nd the constant a that minimizes the mean squared
error E[(X , aY )2]. Your answer should depend on X , Y and .
Problems x2.4: Conditional Probability
47. Let X and Y be integer-valued random variables. Suppose that condi-
tioned on X = i, Y binomial(n; pi), where 0 < pi < 1. Evaluate
}(Y < 2jX = i).
48. Let X and Y be integer-valued random variables. Suppose that conditioned on Y = j, X Poisson(j ). Evaluate }(X > 2jY = j).
49. Let X and Y be independent random variables. Show that pX jY (xi jyj ) =
pX (xi ) and pY jX (yj jxi) = pY (yj ).
July 18, 2002
82
Chap. 2 Discrete Random Variables
50. Let X and Y be independent with X geometric0 (p) and Y geometric0 (q).
51.
52.
53.
54.
55.
56.
57.
58.
Put T := X , Y , and nd }(T = n) for all n.
When a binary optical communication system transmits a 1, the receiver
output is a Poisson() random variable. When a 2 is transmitted, the
receiver output is a Poisson() random variable. Given that the receiver
output is equal to 2, nd the conditional probability that a 1 was sent.
Assume messages are equally likely.
In a binary communication system, when a 0 is sent, the receiver outputs
a random variable Y that is geometric0 (p). When a 1 is sent, the receiver
output Y geometric0(q), where q 6= p. Given that the receiver output
Y = k, nd the conditional probability that the message sent was a 1.
Assume messages are equally likely.
Apple crates are supposed to contain only red apples, but occasionally a
few green apples are found. Assume that the number of red apples and the
number of green apples are independent Poisson random variables with
parameters and , respectively. Given that a crate contains a total of k
apples, nd the conditional probability that none of the apples is green.
Let X Poisson(), and suppose that given X = n, Y Bernoulli(1=(n+
1)). Find }(X = njY = 1).
Let X Poisson(), and suppose that given X = n, Y binomial(n; p).
Find }(X = njY = k) for n k.
Let X and Y be independent binomial(n;p) random variables. Find the
conditional probability that X > k given that max(X; Y ) > k if n = 100,
p = 0:01, and k = 1. Answer: 0.284.
Let X geometric0(p) and Y geometric0 (q), and assume X and Y are
independent.
(a) Find }(XY = 4).
(b) Put Z := X+Y and nd pZ (j) for all j using the discrete convolution
formula (2.11). Treat the cases p 6= q and p = q separately.
Let X and Y be independent random variables, each taking the values
0; 1; 2; 3 with equal probability. Put Z := X + Y and sketch pZ (j) for all
j. Hint: Use the discrete convolution formula (2.11), and use graphical
convolution techniques.
Problems x2.5: Conditional Expectation
59. Let X and Y be as in Problem 13. Compute E[Y jX = i], E[Y ], and
E[X jY = j].
60. Let X and Y be as in Example 2.30. Find E[Y ], E[XY ], E[Y 2 ], and var(Y ).
July 18, 2002
2.7 Problems
83
61. Let X and Y be as in Example 2.31. Find E[Y jX = 1], E[Y jX = 0], E[Y ],
E[Y 2 ], and var(Y ).
,
62. Let X Bernoulli(2=3), and suppose that given X = i, Y Poisson 3(i+
1) . Find E[(X + 1)Y 2 ].
63. Let X Poisson(), and suppose that given X = n, Y Bernoulli(1=(n+
1)). Find E[XY ].
64. Let X geometric1 (p), and suppose that given X = n, Y Pascal(n; q).
Find E[XY ].
65. Let X and Y be integer-valued random variables. Suppose that given
Y = k, X is conditionally Poisson with parameter k. If Y has mean m
and variance r, nd E[X 2 ].
66. Let X and Y be independent random variables, with X binomial(n; p),
and let Y binomial(m; p). Put V := X + Y . Find the pmf of V . Find
}(V = 10jX = 4) (assume n 4 and m 6).
67. Let X and Y be as in Problem 13. Find the probability generating function of Y , and then nd }(Y = 2).
68. Let X and Y be as in Example 2.30. Find GY (z).
July 18, 2002
84
Chap. 2 Discrete Random Variables
July 18, 2002
CHAPTER 3
Continuous Random Variables
In Chapter 2, the only specic random variables we considered were discrete
such as the Bernoulli, binomial, Poisson, and geometric. In this chapter we consider a class of random variables that are allowed to take on a continuum of
values. These random variables are called continuous random variables and are
introduced in Section 3.1. Their expectation is dened in Section 3.2 and used
to develop the concepts of moment generating function (Laplace transform)
and characteristic function (Fourier transform). In Section 3.3 expectation of
multiple random variables is considered. Applications of characteristic functions to sums of independent random variables are illustrated. In Section 3.4
Markov's inequality, Chebyshev's inequality, and the Cherno bound illustrate
simple techniques for bounding probabilities in terms of expectations.
3.1. Denition and Notation
Recall that a real-valued function X(!) dened on a sample space is
called a random variable. We say that X is a continuous random variable
if }(X 2 B) has the form
}(X 2 B) =
Z
B
f(t) dt :=
1
Z
,1
IB (t)f(t) dt
(3.1)
}
for some
R 1 integrable function f. Since (X 2 IR) = 1, f must integrate to one;
}
i.e., ,1 f(t) dt = 1. Further, since (X 2 B) 0 for all B, it can be shown
that f must be nonnegative.1 A nonnegative function that integrates to one is
called a probability density function (pdf).
Here are some examples of continuous random variables. A summary of the
more common ones can be found on the inside of the back cover.
The simplest continuous random variable is the uniform. It is used to model
experiments in which the outcome is constrained to lie in a known interval,
say [a; b], and all outcomes are equally likely. We used this model in the bus
Example 1.10 in Chapter 1. We write f uniform[a; b] if a < b and
8
1 ; a x b;
<
f(x) = : b , a
0; otherwise:
To verify that f integrates to one, rst note that since f(x) = 0 for x < a and
x > b, we can write
Z b
Z 1
f(x) dx:
f(x) dx =
,1
a
85
86
Chap. 3 Continuous Random Variables
Next, for a x b, f(x) = 1=(b , a), and so
Z
b
Z
b
1 dx = 1:
a
a b,a
This calculation illustrates a very important technique that is often incorrectly
carried out by beginning students: First modify the limits of integration, then
substitute the appropriate formula for f(x). For example, it is quite common
to see the incorrect calculation,
Z 1
Z 1
1 dx = 1:
f(x) dx =
,1
,1 b , a
Example 3.1. If X is a continuous random variable with density f uniform[,1; 3], nd }(X 0), }(X 1), and }(X > 1jX > 0).
Solution. To begin, write
f(x) dx =
}(X 0) = }(X 2 (,1; 0]) =
Z 0
,1
f(x) dx:
Since f(x) = 0 for x < ,1, we can modify the limits of integration and obtain
}(X 0) =
Z 0
,1
f(x) dx =
Z 0
1 dx = 1=4:
,1 4
R1
Similarly, }(X 1) = ,1 1=4 dx = 1=2. To calculate
}(X > 1jX > 0) = }(fX >} 1g \ fX > 0g) ;
(X > 0)
observe that the denominator is simply }(X > 0) = 1 , }(X 0) = 1 , 1=4 =
3=4. As for the numerator,
}(fX > 1g \ fX > 0g) = }(X > 1) = 1 , }(X 1) = 1=2:
Thus, }(X > 1jX > 0) = (1=2)=(3=4) = 2=3.
Another simple continuous random variable is the exponential with parameter > 0. We write f exp() if
(
e,x ; x 0;
f(x) =
0; x < 0:
It is easy to check that f integrates to one. The exponential random variable
is often used to model lifetimes, such as how long a lightbulb burns or how
long it takes for a computer to transmit a message from one node to another.
July 18, 2002
3.1 Denition and Notation
87
The exponential random variable also arises as a function of other random
variables. For example, in Problem 35 you will show that if U uniform(0; 1),
then X = ln(1=U) is exp(1). We also point out that if U and V arepindependent
Gaussian random variables, then U 2 + V 2 is exponential2 and U 2 + V 2 is
Rayleigh (dened in Problem 30).
Related to the exponential is the Laplace, sometimes called the doublesided exponential. For > 0, we write f Laplace() if
f(x) = 2 e,jxj:
You will show in Problem 45 that the dierence of two independent exponential
random variables is a Laplace random variable.
Example 3.2. If X is a continuous random variable with density f Laplace(), nd
}(,3 X ,2 or 0 X 3):
Solution. The desired probability can be written as
}(f,3 X ,2g [ f0 X 3g):
Since these are disjoint events, the probability of the union is the sum of the
individual probabilities. We therefore need to compute
}(,3 X ,2) =
,2
Z
,3
e,jxj dx
2
=
which is equal to (e,2 , e,3 )=2, and
}(0 X 3) =
Z 3
0
e,jxj dx
2
= 2
Z
2
Z 3
0
,2
,3
ex dx;
e,x dx;
which is equal to (1 , e,3 )=2. The desired probability is then
1 , 2e,3 + e,2 :
2
The Cauchy random variable with parameter > 0 is also easy to work
with. We write f Cauchy() if
f(x) = 2=
+ x2 :
Since (1=)(d=dx) tan,1 (x=) = f(x), and since tan,1 (1) = =2, it is easy
to check that f integrates to one. The Cauchy random variable arises as the
quotient of independent Gaussian random variables (Problem 18 in Chapter 5).
Gaussian random variables are dened later in this section.
July 18, 2002
88
Chap. 3 Continuous Random Variables
Example 3.3. In the -lottery you choose a number with 1 10.
Then a random variable X is chosen according to the Cauchy density with
parameter . If jX j 1, then you win the lottery. Which value of should you
choose to maximize your probability of winning?
Solution. Your probability of winning is
}(jX j 1) = }(X 1 or X ,1)
=
1
Z
1
f(x) dx +
Z 1
,1
f(x) dx;
where f(x) = (=)=(2 + x2) is the Cauchy density. Since the Cauchy density
is an even function,
Z 1
= dx:
}(jX j 1) = 2
1 2 + x2
Now make the change of variable y = x=, dy = dx=, to get
Z 1
1= dy:
}(jX j 1) = 2
1
1= + y2
Since the integrand is nonnegative, the integral is maximized by minimizing
1= or by maximizing . Hence, choosing = 10 maximizes your probability
of winning.
The most important density is the Gaussian or normal. As a consequence
of the central limit theorem, whose discussion is taken up in Chapter 4, Gaussian random variables are good models for sums of many independent random
variables. Hence, Gaussian random variables often arise as noise models in
communication and control systems. For 2 > 0, we write f N(m; 2 ) if
2 1
1
x
,
m
f(x) = p
exp , 2 ;
2 where is the positive square root of 2 . If m = 0 and 2 = 1, we say that
f is a standard normal density. A graph of the standard normal density is
shown in Figure 3.1.
To verify that an arbitrary normal density integrates to one, we proceed
as follows. (For an alternative derivation, see Problem 14.) First, making the
change of variable t = (x , m)= shows that
Z 1
Z 1
f(x) dx = p1
e,t2 =2 dt:
2
,1
,1
So, without loss of generality, we may assume f is a standard
density
p
R 1 ,normal
with m = 0 and = 1. We then need to show that I := ,1
e x2 =2 dx = 2.
July 18, 2002
3.1 Denition and Notation
89
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
−4
−3
−2
−1
0
1
2
3
4
Figure 3.1. Standard normal density.
The trick is to show instead that I 2 = 2. First write
I2 =
Z
1
,1
2
e,x =2 dx
Z
1
,1
2
e,y =2 dy :
Now write the product of integrals as the iterated integral
I2 =
Z
1
Z
1
e,(x2 +y2 )=2 dx dy:
,1 ,1
Next, we interpret this as a double integral over the whole plane and change to
polar coordinates. This yields
I2
=
1
Z 2 Z
0
0
e,r2 =2 r dr d
2 =2 1
,
r
,e
d
Z 2 =
0
0
= 2:
Example 3.4. If f denotes the standard normal density, show that
Z 0
,1
f(x) dx =
1
Z
0
f(x) dx = 1=2:
Solution. Since f(x) = e,x2=2=p2 is an even function
of x, the two
R1
integrals are equal. Since the sum of the two integrals is ,1 f(x) dx = 1, each
individual integral must be 1=2.
July 18, 2002
90
Chap. 3 Continuous Random Variables
The Paradox of Continuous Random Variables
Let X uniform(0; 1), and x any point x0 2 (0; 1). What is }(X = x0)?
Consider the interval [x0 , 1=n; x0 + 1=n]. Write
}(X 2 [x0 , 1=n; x0 + 1=n]) =
Z x0 +1=n
x0 ,1=n
1 dx = 2=n ! 0
as n ! 1. Since the interval [x0 , 1=n; x0 + 1=n] converges to the singleton
set fx0g, we concludey that }(X = x0) = 0 for every x0 2 (0; 1). We are
thus confronted with the fact that continuous random variables take no xed
value with positive probability! The way to understand this apparent paradox
is to realize that continuous random variables are an idealized model of what we
normally think of as continuous-valued measurements. For example, a voltmeter
only shows a certain number of digits after the decimal point, say 5:127 volts
because physical devices have limited precision. Hence, the measurement X =
5:127 should be understood as saying that
5:1265 X < 5:1275;
since all numbers in this range round to 5:127. Now there is no paradox because
}(5:1265 X < 5:1275) has positive probability.
You may still ask, \Why not just use a discrete random variable taking the
distinct values k=1000, where k is any integer?" After all, this would model
the voltmeter in question. The answer is that if you get a better voltmeter,
you need to redene the random variable, while with the idealized, continuousrandom-variable model, even if the voltmeter changes, the random variable does
not.
Remark. If B is any set with nitely many points, or even countably many
points, then }(X 2 B) = 0 when X is a continuous random variable. To see
this, suppose B = fx1; x2; : : :g where the xi are distinct real numbers. Then
}(X 2 B) = }
1
[
fxi g =
1
X
i=1
i=1
}(X = xi ) = 0;
since, as argued above, each term is zero.
3.2. Expectation of a Single Random Variable
Let X be a continuous random variable with density f, and let g be a
real-valued function taking nitely many distinct values yj 2 IR. We claim that
E[g(X)] =
Z
1
g(x)f(x) dx:
(3.2)
,1
y This argumentis made rigorouslyfor arbitrarycontinuousrandom variablesin Section 4.6.
July 18, 2002
3.2 Expectation of a Single Random Variable
91
First note that the assumptions on g imply Y := g(X) is a discrete random
variable whose expectation is dened as
X
E[Y ] =
yj }(Y = yj ):
j
Let Bj := fx : g(x) = yj g. Observe that X 2 Bj if and only if g(X) = yj , or
equivalently, if and only if Y = yj . Hence,
X
E[Y ] =
yj }(X 2 Bj ):
j
Since X is a continuous random variable,
}(X 2 Bj ) =
Z
1
,1
IBj (x)f(x) dx:
Substituting this into the preceding equation, and writing g(X) instead of Y ,
we have
E[g(X)] =
X
j
yj
Z
1
,1
IBj (x)f(x) dx =
1 X
Z
,1 j
yj IBj (x) f(x) dx:
It suces to show that the sum in brackets is exactly g(x). Fix any x 2 IR.
Then g(x) = yk for some k. In other words, x 2 Bk . Now the Bj are disjoint
because the yj are distinct. Hence, for x 2 Bk , the sum in brackets reduces to
yk , which is exactly g(x).
We would like to apply (3.2) for more arbitrary functions g. However, if
g(X) is not a discrete random variable, its expectation has not yet been dened!
This raises the question of how to dene the expectation of an arbitrary random
variable. The approach is to approximate X by a sequence of discrete random
variables (for which expectation was dened in Chapter 2) and then dene E[X]
to be the limit of the expectations of the approximations. To be more precise
about this, consider the sequence of functions qn sketched in Figure 3.2. Since
each qn takes only nitely many distinct values, qn(X) is a discrete random
variable for which E[qn(X)] is dened. Since qn(X) ! X, we then dene 3
E[X] := limn!1 E[qn(X)].
Now suppose X is a continuous random variable. Since qn(X) is a discrete
random variable, (3.2) applies, and we can write
Z
1
E[X] := nlim
!1 E[qn(X)] = nlim
!1 ,1 qn(x)f(x) dx:
Now bring the limit inside the integral,4 and then use the fact that qn(x) ! x.
This yields
E[X] =
Z
1
,1
lim q (x)f(x) dx =
n!1 n
July 18, 2002
1
Z
,1
xf(x) dx:
92
Chap. 3 Continuous Random Variables
qn ( x )
n
x
n
1
___
2n
Figure 3.2. Finite-step quantizer qn (x) for approximating
arbitrary random variables by
n
discrete random variables. The number of steps is n2 . In the gure, n = 2 and so there are
8 steps.
The same technique can be used to show that (3.2) holds even if g takes
more than nitely many values. Write
E[g(X)] := nlim
!1 E[qn(g(X))]
Z
1
= nlim
!1 ,1 qn(g(x))f(x) dx
Z 1
=
lim q (g(x))f(x) dx
n!1 n
=
Z
,1
1
,1
g(x)f(x) dx:
Example 3.5. If X uniform[a; b] random variable, nd E[X] and E[cos(X)].
Solution. To nd E[X], write
E[X] =
which simplies to
Z
1
,1
xf(x) dx =
Z
2
b
x b ,1 a dx = 2(bx, a) ;
a
a
b
b2 , a2 = (b + a)(b , a) = a + b ;
2(b , a)
2(b , a)
2
which is simply the numerical average of a and b. To nd E[cos(X)], write
Z b
Z 1
cos(x) b ,1 a dx:
cos(x)f(x) dx =
E[cos(X)] =
a
,1
July 18, 2002
3.2 Expectation of a Single Random Variable
93
Since the antiderivative of cos is sin, we have
E[cos(X)] = sin(b) , sin(a) :
b,a
In particular, if b = a + 2k for some positive integer k, then E[cos(X)] = 0.
Example 3.6. Let X be a continuous random variable with standard Gaussian density f N(0; 1). Compute E[X n] for all n 1.
Solution. Write
Z 1
E[X n ] =
xnf(x) dx;
p
exp(,x2 =2)= 2.
,1
where f(x) =
Since f is an even function of x, the above
integrand is odd for n odd. Hence, all the odd moments are zero. For n 2,
apply integration by parts to
Z 1
E[X n ] = p1
xn,1 xe,x2 =2 dx
2 ,1
to obtain E[X n] = (n , 1)E[X n,2], where E[X 0 ] := E[1] = 1. When n = 2 this
yields E[X 2 ] = 1, and when n = 4 this yields E[X 4 ] = 3. The general result for
even n is
E[X n] = 1 3 (n , 3)(n , 1):
Example 3.7. Determine E[X] if X has a Cauchy density with parameter
= 1.
Solution. This is a trick question. Recall that as noted following Example 2.14, for signed discrete random variables,
E[X] =
X
i:xi 0
xi }(X = xi) +
X
i:xi <0
xi }(X = xi );
if at least one of the sums is nite. The analogous formula for continuous
random variables is
E[X] =
1
Z
0
xf(x) dx +
Z 0
,1
xf(x) dx;
assuming at least one of the integrals is nite. Otherwise we say that E[X] is
undened. Write
E[X] =
Z
1
,1
xf(x) dx
July 18, 2002
94
Chap. 3 Continuous Random Variables
=
Z
0
1
xf(x) dx +
Z 0
,1
xf(x) dx
1 ln(1 + x2) 1 + 1 ln(1 + x2 ) 0
= 2
2
0
,1
= (1 , 0) + (0 , 1)
= 1,1
= undened:
Moment Generating Functions
The probability generating function was dened in Chapter 2 for discrete
random variables taking nonnegative integer values. For more general random
variables, we introduce the moment generating function (mgf)
MX (s) := E[esX ]:
This generalizes the concept of probability generating function because if X is
discrete taking only nonnegative integer values, then
MX (s) = E[esX ] = E[(es )X ] = GX (es ):
To see why MX is called the moment generating function, write
d E[esX ]
MX0 (s) = ds
d
sX
= E ds e
= E[XesX ]:
Taking s = 0, we have
MX0 (s)js=0 = E[X]:
By dierentiating k times and then setting s = 0 yields
MX(k)(s)js=0 = E[X k ]:
(3.3)
This derivation makes sense only if MX (s) is nite in a neighborhood of s = 0
[4, p. 278].
In the preceding equations, it is convenient to allow s to be complex. This
means that we need to dene the expectation of complex-valued functions of x.
If g(x) = u(x) + jv(x), where u and v are real-valued functions of x, then
E[g(X)] := E[u(X)] + j E[v(X)]:
July 18, 2002
3.2 Expectation of a Single Random Variable
95
If X has density f, then
E[g(X)] := E[u(X)] + j E[v(X)]
=
=
=
1
Z
,1
Z 1
,1
Z 1
,1
u(x)f(x) dx + j
Z
1
,1
v(x)f(x) dx
[u(x) + jv(x)]f(x) dx
g(x)f(x) dx:
It now follows that if X is a continuous random variable with density f, then
MX (s) =
E[esX ]
1
Z
=
which is just the Laplace transform of f.
,1
esx f(x) dx;
Example 3.8. If X is an exponential random variable with parameter >
0, then for Re s < ,
MX (s) =
Z
1
0Z
= 0
esx e,x dx
1
ex(s,) dx
:
= ,
s
If MX (s) is nite for all real s in a neighborhood of the origin, say for
,r < s < r for some 0 < r 1, then X has nite moments
P1 of all orders, and
the following calculation using the power series e =
complex s with jsj < r [4, p. 278]:
E[esX ]
n=0 n =n!
is valid for
X
1
n
(sX)
= E
n=0 n!
1 sn
X
=
E[X n]; jsj < r:
n!
n=0
(3.4)
Example 3.9. For the exponential random variable of the previous example, we can obtain the power series as follows. Recalling the geometric series
formula (Problem 7 in Chapter 1), write, this time for jsj < ,
1
1 = X
n
=
,s
1 , s= n=0(s=) :
July 18, 2002
96
Chap. 3 Continuous Random Variables
Comparing this expression with (3.4) and equating the coecients of the powers
of s, we see by inspection that E[X n ] = n!=n. In particular, we have E[X] =
1= and E[X 2 ] = 2=2 . Since var(X) = E[X 2],(E[X])2, it follows that var(X) =
1=2.
Example 3.10. We nd the moment generating function of X N(0; 1).
To begin, write
Z 1
2
esx e,x =2 dx:
MX (s) = p1
2 ,1
We rst show that MX (s) is nite for all real s. Combine the exponents in the
above integral and complete the square. This yields
Z 1 ,(x,s)2=2
e p
MX (s) = es2 =2
dx:
2
,1
The above integrand is a normal density with mean s and unit variance. Since
densities integrate to one, MX (s) = es2 =2 for all real s. Note that since the
mean of a real-valued random variable must be real, our argument required
that s be real.5
We next show that MX (s) = es2 =2 holds for all complex s. Since the moment
generating function is nite for all real s, we can use the power series expansion
(3.4) to nd MX . The moments of X were determined in Example 3.6. Recalling
that the odd moments are zero,
1
X
s2m E[X 2m ]
m=0 (2m)!
1 s2m
X
1 3 (2m , 3)(2m , 1)
=
m=0 (2m)!
1
X
s2m :
=
m=0 2 4 2m
MX (s) =
Combining this result with the power series
es2 =2 =
1
X
1
(s2 =2)m = X
s2m
m=0 m!
m=0 2 4 2m
shows that MX (s) = es2 =2 for all complex s.
Characteristic Functions
In Example 3.8, the moment generating function was guaranteed nite only
for Re s < . It is possible to have random variables for which MX (s) is dened
July 18, 2002
3.2 Expectation of a Single Random Variable
97
only for Re s = 0; i.e., MX (s) is only dened for imaginary s. For example, if X
is a Cauchy random variable, then it is easy to see that MX (s) = 1 for all real
s 6= 0. In order to develop transform methods that always work for any random
variable X, we introduce the the characteristic function of X, dened by
'X () := E[ejX ]:
(3.5)
Note that 'X () = MX (j). Also, since jejX j = 1, j'X ()j E[jejX j] = 1.
Hence, the characteristic function always exists and is bounded in magnitude
by one.
If X is a continuous random variable with density f, then
'X () =
Z
1
,1
ejxf(x) dx;
which is just the Fourier transform of f. Using the Fourier inversion formula,
1 Z 1 e,jx' () d:
f(x) = 2
X
,1
Example 3.11. If X is an N(0; 1) random variable, then by Example 3.10,
MX (s) = es2 =2 . Thus, 'X () = MX (j) = e(j )2 =2 = e, 2 =2. In terms of
Fourier transforms,
Z 1
p1
ejx e,x2 =2 dx = e, 2 =2 :
2 ,1
In signal processing terms, the Fourier transform of a Gaussian pulse is a
Gaussian pulse.
Example 3.12. Let X be a gamma random variable with parameter p > 0
(dened in Problem 11). It is shown in Problem 39 that MX (s) = 1=(1 , s)p
for complex s with jsj < 1. It follows that 'X () = 1=(1 , j)p for j j < 1. It
can be shown6 that 'X () = 1=(1 , j)p for all .
Example 3.13. As noted above, the characteristic function of an N(0; 1)
random variable is e, 2 =2. What is the characteristic function of an N(m; 2 )
random variable?
Solution. Let f0 denote the N(0; 1) density. If X N(m; 2), then
fX (x) = f0 ((x , m)=)=. Now write
'X () = EZ [ejX ]
1
=
ejxfX (x) dx
=
Z
,1
1
,1
ejx 1 f x , m dx:
0 July 18, 2002
98
Chap. 3 Continuous Random Variables
Now apply the change of variable y = (x , m)=, dy = dx= and obtain
'X () =
Z
1
ej (y+m) f0 (y) dy
,1 Z
1
= ejm
ej ()y f0 (y) dy
=
=
,1
ejm e,()2 =2
ejm,2 2=2 :
If X is a discrete integer-valued random variable, then
'X () = E[ejX ] =
X
n
ejn}(X = n)
is a 2-periodic Fourier series. Given 'X , the coecients can be recovered
by the formula for Fourier series coecients,
Z ,jn
}(X = n) = 1
2 , e 'X () d:
When the moment generating function is not nite in a neighborhood of the
origin, the moments of X cannot be obtained from (3.3). However, the moments
can sometimes be obtained from the characteristic function. For example, if we
dierentiate (3.5) with respect to , we obtain
d E[ejX ] = E[jXejX ]:
'0X () = d
Taking = 0 yields '0X (0) = j E[X]. The general result is
'(Xk)()j =0 = j k E[X k ]; assuming E[jX jk] < 1:
The assumption that E[jX jk ] < 1 justies dierentiating inside the expectation
[4, pp. 344{345].
3.3. Expectation of Multiple Random Variables
In Chapter 2 we showed that for discrete random variables, expectation is
linear and monotone. We also showed that the expectation of a product of
independent discrete random variables is the product of the individual expectations. We now derive these properties for arbitrary random variables. Recall
that for an arbitrary random variable X, E[X] := limn!1 E[qn(X)], where
qn(x) is sketched in Figure 3.2, qn(x) ! x, and for each n, qn(X) is a discrete
random variable taking nitely many values.
July 18, 2002
3.3 Expectation of Multiple Random Variables
99
Linearity of Expectation
To establish linearity, write
aE[X] + bE[Y ] :=
=
=
=
a nlim
!1 E[qn(X)] + b nlim
!1 E[qn(Y )]
lim E[aqn(X) + bqn(Y )]
n!1
E[nlim
!1 aqn(X) + bqn(Y )]
E[aX + bY ]:
From our new denition of expectation, it is clear that if X 0, then so is
E[X]. Combining this with linearity shows that monotonicity holds for general
random variables; i.e., X Y implies E[X] E[Y ].
Expectations of Products of Functions of Independent Random Variables
Suppose X and Y are independent random variables. For any functions
h(x) and k(y) write
E[h(X)]E[k(Y )] := nlim
!1 E[qn(h(X))] nlim
!1 E[qn(k(Y ))]
= nlim
!1 E[qn(h(X))]E[qn (k(Y ))]
= nlim
!1 E[qn(h(X))qn (k(Y ))]
= E[nlim
!1 qn(h(X))qn (k(Y ))]
= E[h(X)k(Y )]:
Example 3.14. Let Z := X +Y , where X and Y are independent random
variables. Show that the characteristic function of Z is the product of the
characteristic functions of X and Y .
Solution. The characteristic function of Z is
'Z () := E[ejZ ] = E[ej (X +Y ) ] = E[ejX ejY ]:
Now use independence to write
'Z () = E[ejX ]E[ejY ] = 'X ()'Y ():
In the preceding example, suppose that X and Y are continuous random
variables with densities fX and fY . Then we can inverse Fourier transform the
last equation and show that fZ is the convolution of fX and fY . In symbols,
fZ (z) =
Z
1
,1
fX (z , ) fY () d:
July 18, 2002
100
Chap. 3 Continuous Random Variables
Example 3.15. In the preceding example, suppose that X and Y are
Cauchy with parameters and , respectively. Find the density of Z.
Solution. The characteristic of functions of X and Y are, by Problem 44,
'X () = e,j j and 'Y () = e,j j . Hence,
'Z () = 'X ()'Y () = e,j je,j j = e,(+)j j ;
which is the characteristic function of a Cauchy random variable with parameter
+ . In terms of convolution,
=
= d:
( + )= = Z 1
2
2
2
2
2
( + ) + z
,1 + (z , ) + 2
3.4. ?Probability Bounds
In many applications, it is dicult to compute the probability of an event
exactly. However, bounds on the probability can often be obtained in terms
of various expectations. For example, the Markov and Chebyshev inequalities
were derived in Chapter 2. Below we use Markov's inequality to derive a much
stronger result known as the Cherno bound.z
Example 3.16 (Using Markov's Inequality). Let X be a Poisson random
variable with parameter = 1=2. Use Markov's inequality to bound }(X > 2).
Compare your bound with the exact result.
Solution. First note that since X takes only integer values, }(X > 2) =
}(X 3). Hence, by Markov's inequality and the fact that E[X] = = 1=2
from Example 2.13,
}(X 3) E[X] = 1=2 = 0:167:
3
3
The exact answer can be obtained by noting that }(X 3) = 1 , }(X < 3) =
1 , }(X = 0) , }(X = 1) , }(X = 2). For a Poisson() random variable with
= 1=2, }(X 3) = 0:0144. So Markov's inequality gives quite a loose bound.
Example 3.17 (Using Chebyshev's Inequality). Let X be a Poisson random variable with parameter = 1=2. Use Chebyshev's inequality to bound
}(X > 2). Compare your bound with the result of using Markov's inequality
in Example 3.16.
z This bound, often attributed to Cherno (1952) [7], was used earlier by Cramer (1938)
[11].
July 18, 2002
3.4 ? Probability Bounds
101
Solution. Since X is nonnegative, we don't have to worry about the
absolute value signs. Using Chebyshev's inequality and the fact that E[X 2] =
2 + = 0:75 from Example 2.18,
2
}(X 3) E[X2 ] = 3=4 0:0833:
3
9
From Example 3.16, the exact probability is 0:0144 and the Markov bound is
0:167.
We now derive the Cherno bound. Let X be a random variable whose
moment generating function MX (s) is nite for s 0. For s > 0,
fX ag = fsX sag = fesX esa g:
Using this identity along with Markov's inequality, we have
sX
}(X a) = }(esX esa ) E[esa ] = e,sa MX (s):
e
Now observe that the inequality holds for all s > 0, and the left-hand side does
not depend on s. Hence, we can minimize the right-hand side to get as tight a
bound as possible. The Cherno bound is given by
}(X a) inf e,sa MX (s);
s>0
where the inmum is over all s > 0 for which MX (s) is nite.
Example 3.18. Let X be a Poisson random variable with parameter =
1=2. Bound }(X > 2) using the Cherno bound. Compare your result with
the exact probability and with the bound obtained via Chebyshev's inequality in Example 3.17 and with the bound obtained via Markov's inequality in
Example 3.16.
Solution. First recall that MX (s) = GX (es ), where GX (z) = exp[(z , 1)]
was derived in Example 2.19. Hence,
e,sa MX (s) = e,sa exp[(es , 1)] = exp[(es , 1) , as]:
The desired Cherno bound when a = 3 is
}(X 3) inf exp[(es , 1) , 3s]:
s>0
To evaluate the inmum, we must minimize the exponential. Since exp is an
increasing function, it suces to minimize its argument. Taking the derivative
of the argument and setting it equal to zero requires us to solve es , 3 = 0.
July 18, 2002
102
Chap. 3 Continuous Random Variables
The solution is s = ln(3=). Substituting this value of s into exp[(es , 1) , 3s]
and simplifying the exponent yields
}(X 3) exp[3 , , 3 ln(3=)]:
Since = 1=2,
}(X 3) exp[2:5 , 3 ln6] = 0:0564:
Recall that from Example 3.16, the exact probability is 0:0144 and Markov's
inequality yielded the bound 0:167. From Example 3.17, Chebyshev's inequality
yielded the bound 0:0833.
Example 3.19. Let X be a continuous random variable having exponential
density with parameter = 1. Compute }(X 7) and the corresponding
Markov, Chebyshev, and Cherno bounds.
Solution. The exact probability is }(X 7) = R71 e,x dx = e,7 =
0:00091. For the Markov and Chebyshev inequalities, recall that from Example 3.9, E[X] = 1= and E[X 2 ] = 2=2. For the Cherno bound, we need
MX (s) = =( , s) for s < , which was derived in Example 3.8. Armed with
these formulas, we nd that Markov's inequality yields }(X 7) E[X]=7 =
1=7 = 0:143 and Chebyshev's inequality yields }(X 7) E[X 2]=72 = 2=49 =
0:041. For the Cherno bound, write
}(X 7) inf e,7s=(1 , s);
s
where the minimization is over 0 < s < 1. The derivative of e,7s=(1 , s) with
respect to s is
e,7s (7s , 6) :
(1 , s)2
Setting this equal to zero requires that s = 6=7. Hence, the Cherno bound is
,7s ,6
}(X 7) e
(1 , s) s=6=7 = 7e = 0:017:
For suciently large a, the Cherno bound on }(X a) is always smaller
than the bound obtained by Chebyshev's inequality, and this is smaller than
the one obtained by Markov's inequality. However, for small a, this may not
be the case. See Problem 56 for an example.
July 18, 2002
3.5 Notes
103
3.5. Notes
Notes x3.1: Denition and Notation
Note 1. Strictly speaking, it can only be shown that f in (3.1) is nonnegative almost everywhere; that is
Z
ft2IR:f (t)<0g
1 dt = 0:
For example, f could be negative at a nite or countably innite set of points.
Note 2. If U and V are N(0; 1), then by Problem 41, U 2 and V 2 are each
chi-squared with one degree of freedom (dened in Problem 12). By the Remark
following Problem 46(c), U 2 +V 2 is chi-squared with 2 degrees of freedom, which
is the same as exp(1=2).
Notes x3.2: Expectation of a Single Random Variable
Note 3. Since qn in Figure 3.2 is dened only for x 0, the denition of
expectation in the text applies only arbitrary nonnegative random variables.
However, for signed random variables,
X = jX j 2+ X , jX j 2, X ;
and we dene
j
X
j
,
X
j
X
j
+
X
,E 2 ;
E[X] := E
2
assuming the dierence is not of the form 1 , 1. Otherwise, we say E[X] is
undened.
We also point out that for x 0, qn(x) ! x. To see this, x any x 0,
and let n > x. Then x will lie under one of the steps in Figure 3.2. If x lies
under the kth step, then
k,1 x < k ;
2n
2n
For x in this range, the value of qn(x) is (k ,1)=2n. Hence, 0 x,qn (x) < 1=2n.
Another important fact to note is that for each x 0, qn(x) qn+1(x).
Hence, qn(X) qn+1 (X), and so E[qn(X)] E[qn+1(X)] as well. In other
words, the sequence of real numbers E[qn(X)] is nondecreasing. This implies
that limn!1 E[qn(X)] exists either as a nite real number or the extended real
number 1 [38, p. 55].
Note 4. In light of the preceding note, we are using Lebesgue's monotone
convergence theorem [4, p. 208], which applies to nonnegative functions.
July 18, 2002
104
Chap. 3 Continuous Random Variables
1 e,(x,s)2 =2 dx as a conNote 5. If s were complex, we could interpret R,1
tour integral in the complex plane. By appealing to the Cauchy{Goursat
Theorem
[9, pp.p115{121], one could then show that this integral is equal to
R 1 ,t2 =2
e
dt = 2. Alternatively, one can use a permanence of form ar,1
gument [9, pp. 286{287]. In this approach, one shows that MX (s) is analytic in
some region, in this case the whole complex plane. One then obtains a formula
for MX (s) on a contour in this region, in this case, the contour is the entire real
axis. The permanence of form theorem then states that the formula is valid in
the entire region.
Note 6. The permanence of form argument mentioned above in Note 5 is
the easiest approach. Since MX (s) is analytic for complex s with Re s < 1, and
since MX (s) = 1=(1 , s)p for real s with s < 1, MX (s) = 1=(1 , s)p holds for
all complex s with Re s < 1. In particular, the formula holds for s = j, since
in this case, Re s = 0 < 1.
3.6. Problems
Problems x3.1: Denition and Notation
1. A certain burglar alarm goes o if its input voltage exceeds ve volts at
three consecutive sampling times. If the voltage samples are independent
and uniformly distributed on [0; 7], nd the probability that the alarm
sounds.
2. Let X have density
(
2=x3; x > 1;
0; otherwise;
The median of X is the number t satisfying }(X > t) = 1=2. Find the
median of X.
3. Let X have an exp() density.
(a) Show that }(X > t) = e,t .
(b) Compute }(X > t + tjX > t). Hint: If A B, then A \ B = A.
Remark. Observe that X has a memoryless property similar to
that of the geometric1(p) random variable. See the remark following
Problem 4 in Chapter 2.
4. A company produces independent voltage regulators whose outputs are
exp() random variables. In a batch of 10 voltage regulators, nd the
probability that exactly 3 of them produce outputs greater than v volts.
5. Let X1 ; : : :; Xn be i.i.d. exp() random variables.
(a) Find the probability that min(X1 ; : : :; Xn) > 2.
f(x) =
July 18, 2002
3.6 Problems
105
(b) Find the probability that max(X1 ; : : :; Xn) > 2.
Hint: Example 2.7 may be helpful.
6. A certain computer is equipped with a hard drive whose lifetime (mea-
sured in months) is X exp(). The lifetime of the monitor (also measured in months) is Y exp(). Assume the lifetimes are independent.
(a) Find the probability that the monitor fails during the rst 2 months.
(b) Find the probability that both the hard drive and the monitor fail
during the rst year.
(c) Find the probability that either the hard drive or the monitor fails
during the rst year.
7. A random variable X has the Weibull density with parameters p > 0
and > 0, denotedp by X Weibull(p; ), if its density is given by
f(x) := pxp,1 e,x for x > 0, and f(x) := 0 for x 0.
(a) Show that this density integrates to one.
(b) If X Weibull(p; ), evaluate }(X > t) for t > 0.
(c) Let X1 ; : : :; Xn be i.i.d. Weibull(p; ) random variables. Find the
probability that none of them exceeds 3. Find the probability that
at least one of them exceeds 3.
Remark. The Weibull density arises in the study of reliability in Chapter 4. Note that Weibull(1; ) is the same as exp().
? 8.
p
The standard normal density f N(0; 1) is given by f(x) := e,x2 =2= 2.
The following steps provide a mathematical proof that the normal density
is indeed \bell-shaped" as shown in Figure 3.1.
(a) Use the derivative of f to show that f is decreasing for x > 0 and
increasing for x < 0. (It then follows that f has a global maximum
at x = 0.)
(b) Show that f is concave for jxj < 1 and convex for jxj > 1. Hint:
Show that the second derivative of f is negative for jxj < 1 and
positive for jxj > 1.
P
n
z z. Hence, e+x2 =2 x2=2.
(c) Since ez = 1
z,
e
n=0 z =n!, for positive
Use this fact to show that e,x2 =2 ! 0 as jxj ! 1.
? 9.
Use the results of the preceding problem to obtain the corresponding
three results2 forp the general normal density f N(m; 2 ). Hint: Let
'(t) := e,t =2 = 2 denote the N(0; 1) density, and observe that f(x) =
'((x , m)=)=.
July 18, 2002
106
Chap. 3 Continuous Random Variables
As in the preceding problem, let f N(m; 2 ). Keeping in mind that
f(x) depends on > 0, show that lim!1 f(x) = 0. Show also that for
x 6= m, lim!0 f(x) = 0, whereas for x = m, lim!0 f(x) = 1. With
m = 0, sketch f(x) for = 0:5; 1; and 2.
11. The gamma density with parameter p > 0 is given by
8 p,1 ,x
e ; x > 0;
< x
gp (x) := : ,(p)
0;
x 0;
? 10.
where ,(p) is the gamma function,
,(p) :=
1
Z
0
xp,1 e,x dx; p > 0:
In other words, the gamma function is dened exactly so that the gamma
density integrates to one. Note that the gamma density is a generalization
of the exponential since g1 is the exp(1) density.
Remark. In Matlab, ,(p) = gamma(p).
(a) Use integration by parts to show that
,(p) = (p , 1) ,(p , 1); p > 1:
Since ,(1) can be directly evaluated and is equal to one, it follows
that
,(n) = (n , 1)!; n = 1; 2; : : ::
Thus , is sometimes called the factorial function.
(b) Show that ,(1=2) = p aspfollows. In the dening integral, use
the change of variable y = 2x. Write the result in terms of the
standard normal density, which integrates to one, in order to obtain
the answer by inspection.
(c) Show that
p
2n
+
1
, 2
= (2n , 1)2n 5 3 1 ; n 1:
Remark. The integral denition of ,(p) makes sense only for p > 0.
However, the recursion ,(p) = (p , 1),(p , 1) suggests a simple way to
dene , for negative, noninteger arguments. For 0 < " < 1, the righthand side of ,(") = (" , 1),(" , 1) is undened. However, we rearrange
this equation and make the denition,
,(" , 1) := , 1,(")
, ":
July 18, 2002
3.6 Problems
107
Similarly writing ,(" , 1) = (" , 2),(" , 2), and so on, leads to
n
,(" , n) = (n , ")(, 1)(2 ,(")
, ")(1 , ") :
12. Important generalizations of the gamma density gp of the preceding prob-
lem arise if we include a scale parameter. For > 0, put
p,1 ,x
gp;(x) := gp (x) = (x),(p)e ; x > 0:
We write X gamma(p; ) if X has density gp; , which is called the
gamma density with parameters p and .
(a) Let f be any probability density. For > 0, show that
f (x) := f(x)
is also a probability density.
(b) For p = m a positive integer, gm; is called the Erlang density with
parameters m and . We write X Erlang(m; ) if X has density
m,1 e,x
gm; (x) = (x)
(m , 1)! ; x > 0:
Remark. As you will see in Problem 46(c), the sum of m i.i.d.
exp() random variables is Erlang(m; ).
(c) If X Erlang(m; ), show that
}(X > t) =
mX
,1 (t)k
e,t :
k!
k=0
In other words, if Y Poisson(t), then }(X > t) = }(Y < m).
(d) For p = k=2 and = 1=2, gk=2;1=2 is called the chi-squared density
with k degrees of freedom. It is not required that k be an integer.
Of course, the chi-squared density with an even number of degrees
of freedom, say k = 2m, is the same as the Erlang(m; 1=2) density.
Using Problem 11(b), it is also clear that for k = 1,
e,x=2 ; x > 0:
g1=2;1=2(x) = p
2x
For an odd number of degrees of freedom, say k = 2m + 1, where
m 1, show that
xm,1=2 e,x=2 p
g 2m2+1 ; 21 (x) =
(2m , 1) 5 3 1 2
July 18, 2002
108
Chap. 3 Continuous Random Variables
for x > 0. Hint: Use Problem 11(c).
Remark. As you will see in Problem 41, the chi-squared random
variable arises as the square of a normal random variable. In communication systems employing noncoherent receivers, the incoming
signal is squared before further processing. Since the thermal noise in
these receivers is Gaussian, chi-squared random variables naturally
apppear.
? 13.
The beta density with parameters r > 0 and s > 0 is dened by
br;s (x) :=
8
<
:
,(r + s) xr,1(1 , x)s,1; 0 < x < 1;
,(r) ,(s)
0;
otherwise;
where , is the gamma function dened in Problem 11. We note that if
X gamma(p; ) and Y gamma(q; ) are independent random variables, then X=(X + Y ) has the beta density with parameters p and q
(Problem 23 in Chapter 5).
(a) Find simplied formulas and sketch the beta density for the following
sets of parameter values: (a) r = 1, s = 1. (b) r = 2, s = 2.
(c) r = 1=2, s = 1.
(b) Use the followingapproach to show that the density integrates to one.
Write ,(r) ,(s) as a product of integrals using dierent variables of
integration, say x and y, respectively. Rewrite this as an iterated
integral, with the inner variable of integration being x. In the inner
integral, make the change of variable u = x=(x+y). Change the order
of integration, and then make the change of variable v = y=(1 , u)
in the inner integral. Recognize the resulting inner integral, which
does not depend on u, as ,(r + s). Thus,
,(r) ,(s) = ,(r + s)
Z 1
0
ur,1 (1 , u)s,1 du;
(3.6)
showing that indeed, the density integrates to one.
Remark. The above integral, which is a function of r and s, is usually
called the beta function, and is denoted by B(r; s). Thus,
,(s)
B(r; s) = ,(r)
,(r + s) ;
and
r,1 (1 , x)s,1
br;s (x) = x B(r;
s) ; 0 < x < 1:
July 18, 2002
(3.7)
3.6 Problems
? 14.
109
Use equation (3.6) in the preceding problem to show that ,(1=2) = p.
Hint: Make the change of variable u = sin2 . Then take r = s = 1=2.
Remark. In Problem 11, you used thep fact that the normal density
integrates to one to show that ,(1=2) = . Since your derivation there
is reversible, it follows
p that the normal density integrates to one if and
only if ,(1=2) = . In this problem, you usedpthe fact that the beta
density integrates to one to show that ,(1=2) = . Thus, you have an
alternative derivation of the fact that the normal density integrates to
one.
? 15. Show that
p , n + 1 Z =2
2
sinn d =
n + 2 :
0
2, 2
Hint: Use equation (3.6) in Problem 13 with r = (n + 1)=2 and s = 1=2,
and make the substitution u = sin2 .
? 16. The beta function B(r; s) is dened as the integral in (3.6) in Problem 13.
Show that
Z 1
B(r; s) =
(1 , e, )r,1 e,s d:
? 17.
0
The student's t density with degrees of freedom is given by
,( +1)=2
1 + x2
f (x) := pB( 1 ; ) ; ,1 < x < 1;
2 2
where B is the beta function. Show that f integrates to one. Hint: The
change of variable e = 1 + x2= may be useful. Also, the result of the
preceding problem may be useful.
Remarks. (i) Note that f1 Cauchy(1).
(ii) It is shown in Problem 24 in Chapter 5 that if X and Y are independentpwith X N(0; 1) and Y chi-squared with k degrees of freedom, then
X= Y=k has the student's t density with k degrees of freedom, a result
of crucial importance in the study of condence intervals in Chapter 12.
? 18. Show that the student's t density f (x) dened in Problem 17 converges
to the standard normal density as ! 1. Hints: Recall that
x = ex :
lim
1
+
!1
Combine (3.7) in Problem 13 with Problem 11 for the cases = 2n and
= 2n + 1 separately. Then appeal to Wallis's formula [3, p. 206],
= 2 2 4 4 6 6 :
2
133557
July 18, 2002
110
? 19.
Chap. 3 Continuous Random Variables
For p and q positive, let B(p; q) denote the beta function dened by the
integral in (3.6) in Problem 13. Show that
p,1
fZ (z) := B(p;1 q) (1 +z z)p+q ; z > 0;
is a valid density (i.e., integrates to one) on (0; 1). Hint: Make the change
of variable t = 1=(1 + z).
Problems x3.2: Expectation of a Single Random Variable
20. Let X have density f(x) = 2=x3 for x 1 and f(x) = 0 otherwise.
Compute E[X].
21. Let X have the exp() density, fX (x) = e,x , for x 0. Use integration
by parts to show that E[X] = 1=.
22. High-Mileage Cars has just begun producing its new Lambda Series, which
averages miles per gallon. Al's Excellent Autos has a limited supply of
n cars on its lot. Actual mileage of the ith car is given by an exponential
random variable Xi with E[Xi] = . Assume actual mileages of dierent
cars are independent. Find the probability that at least one car on Al's
lot gets less than =2 miles per gallon.
23. The dierential entropy of a continuous random variable X with den-
sity f is
h(X) := E[, logf(X)] =
1
Z
,1
1 dx:
f(x) log f(x)
If X uniform[0; 2], nd h(X). Repeat for X uniform[0; 21 ] and for
X N(m; 2 ).
24. Let X be a continuous random variable with density f, and suppose that
E[X] = 0. If Z is another random variable with density fZ (z) := f(z , m),
nd E[Z].
? 25.
For the random variable X in Problem 20, nd E[X 2 ].
? 26.
Let X have the student's t density with degrees of freedom, as dened
in Problem 17. Show that E[jX jk] is nite if and only if k < .
27. Let X have moment generating function MX (s) = e2 s2 =2. Use formula (3.3) to nd E[X 4 ].
28. Let Z N(0; 1), and put Y = Z + n for some constant n. Show that
E[Y 4 ] = n4 + 6n2 + 3.
July 18, 2002
3.6 Problems
111
29. Let X gamma(p; 1) as in Problem 12. Show that
+ p)
E[X n ] = ,(n
,(p) = p(p + 1)(p + 2) (p + [n , 1]):
30. Let X have the standard Rayleigh density, f(x) := xe,x2 =2 for x 0
and f(x) := 0 for x < 0.
p
(a) Show that E[X] = =2.
(b) For n 2, show that E[X n] = 2n=2,(1 + n=2).
(c) An Internet router has n input links. The ows in the links are independent standard Rayleigh random variables. The router's buer
memory overows if more than two links have ows greater than .
Find the probability of buer memory overow.
31. Let X Weibull(p; ) as in Problem 7. Show that E[X n ] = ,(1 +
n=p)=n=p.
32. A certain nonlinear circuit has random input X exp(1), and output
Y = X 1=4 . Find the second moment of the output.
? 33. Let X have the student's t density with degrees of freedom, as dened
in Problem 17. For n a positive integer less than =2, show that
, ,2n ,
,
, 2n2+1
2
n
n
, 2
,
:
E[X ] = , 12 , 2
? 34.
35.
36.
? 37.
38.
Recall that the moment generating function of an N(0; 1) random variable
es2 =2 . Use this fact to nd the moment generating function of an N(m; 2 )
random variable.
If X uniform(0; 1), show that Y = ln(1=X) exp(1) by nding its
moment generating function for s < 1.
Find a closed-form expression for MX (s) if X Laplace(). Use your
result to nd var(X).
Let X be the random variable of Problem 20. For what real values of s is
MX (s) nite? Hint: It is not necessary to evaluate MX (s) to answer the
question.
Let Mp (s) denote the moment generating function of the gamma density
gp dened in Problem 11. Show that
Mp (s) = 1 ,1 s Mp,1 (s); p > 1:
Remark. Since g1(x) is the exp(1) density, and M1(s) = 1=(1 , s) by
direct calculation, it now follows that the moment generating function of
an Erlang(m; 1) random variable is 1=(1 , s)m .
July 18, 2002
112
? 39.
Chap. 3 Continuous Random Variables
To do this problem, use the methods of Example 3.10. Let X have the
gamma density gp given in Problem 11.
(a) For real s < 1, show that MX (s) = 1=(1 , s)p .
(b) The moments of X are given in Problem 29. Hence, from (3.4), we
have for complex s,
MX (s) =
1
X
sn ,(n + p) ; jsj < 1:
,(p)
n=0 n!
For complex s with jsj < 1, derive the Taylor series for 1=(1 , s)p and
show that it is equal to the above series. Thus, MX (s) = 1=(1 , s)p
for all complex s with jsj < 1.
40. As shown in the preceding problem, the basic gamma density with parameter p, gp (x), has moment generating function 1=(1 , s)p . The more
general gamma density dened by gp; (x) := gp (x) is given in Problem 12.
(a) Find the moment generating function and then the characteristic
function of gp;(x).
(b) Use the answer to (a) to nd the moment generating function and
the characteristic function of the Erlang density with parameters m
and , gm; (x).
(c) Use the answer to (a) to nd the moment generating function and
the characteristic function of the chi-squared density with k degrees
of freedom, gk=2;1=2(x).
41. Let X N(0; 1), and put Y = X 2 . For real values of s < 1=2, show that
MY (s) =
1
1=2
1 , 2s :
By Problem 40(c), it follows that Y is chi-squared with one degree of
freedom.
42. Let X N(m; 1), and put Y = X 2 . For real values of s < 1=2, show that
sm2 =(1,2s)
e
MY (s) = p
:
1 , 2s
Remark. For m 6= 0, Y is said to be noncentral chi-squared
with
2
one degree of freedom and noncentrality parameter m . For m = 0,
this reduces to the result of the previous problem.
43. Let X have characteristic function 'X (). If Y := aX + b for constants a
and b, express the characteristic function of Y in terms of a; b, and 'X .
July 18, 2002
3.6 Problems
113
44. Apply the Fourier inversion formula to 'X () = e,j j to verify that this
is the characteristic function of a Cauchy() random variable.
Problems x3.3: Expectation of Multiple Random Variables
45. Let X and Y be independent random variables with moment generat-
ing functions MX (s) and MY (s). If Z := X , Y , show that MZ (s) =
MX (s)MY (,s). Show that if both X and Y are exp(), then Z Laplace().
46. Let X1 ; : : :; Xn be independent, and put Yn := X1 + + Xn .
(a) If Xi N(mi ; i2), show that Yn N(m; 2 ), and identify m and
2 . Hint: Example 3.13 may be helpful.
(b) If Xi Cauchy(i ), show that Yn Cauchy(), and identify .
Hint: Problem 44 may be helpful.
(c) If Xi is a gamma random variable with parameters pi and (same
for all i), show that Yn is gamma with parameters p and , and
identify p. Hint: The results of Problem 40 may be helpful.
Remark. Note the following special cases of this result. If all the
pi = 1, then the Xi are exponential with parameter , and Yn is
Erlang with parameters n and . If p = 1=2 and = 1=2, then Xi is
chi-squared with one degree of freedom, and Yn is chi-squared with
n degrees of freedom.
47. Let X1 ; : : :; Xr be i.i.d. gamma random variables with parameters p and
. Let Y = X1 + + Xr . Find E[Y n ].
48. Packet transmission times on a certain network link are i.i.d. with an
exponential density of parameter . Suppose n packets are transmitted.
Find the density of the time to transmit n packets.
49. The random number generator on a computer produces i.i.d. uniform(0; 1)
random variables X1 ; : : :; Xn . Find the probability density of
Y = ln
Y
n
1 :
X
i=1 i
50. Let X1 ; : : :; Xn be i.i.d. Cauchy(). Find the density of Y := 1 X1 + +
n Xn , where the i are given positive constants.
51. Three independent pressure sensors produce output voltages U, V , and
W, each exp() random variables. The three voltages are summed and
fed into an alarm that sounds if the sum is greater than x volts. Find the
probability that the alarm sounds.
July 18, 2002
114
Chap. 3 Continuous Random Variables
52. A certain electric power substation has n power lines. The line loads are
independent Cauchy() random variables. The substation automatically
shuts down if the total load is greater than `. Find the probability of
automatic shutdown.
53. The new outpost on Mars extracts water from the surrounding soil. There
are 13 extractors. Each extractor produces water with a random eciency
that is uniformly distributed on [0; 1]. The outpost operates normally if
fewer than 3 extractors produce water with eciency less than 0:25. If the
eciencies are independent, nd the probability that the outpost operates
normally.
? 54. In this problem we generalize the noncentral chi-squared density of
Problem 42. To distinguish these new densities from the original chisquared densities dened in Problem 12, we refer to the original ones as
central chi-squared densities. The noncentral chi-squared density with k
degrees of freedom and noncentrality parameter 2 is dened byx
1 (2 =2)ne,2 =2
X
c2n+k (x); x > 0;
ck;2 (x) :=
n!
n=0
where c2n+k denotes the central chi-squared density with 2n + k degrees
of freedom.
R
(a) Show that 01 ck;2 (x) dx = 1.
(b) If X is a noncentral chi-squared random variable with k degrees of
freedom and noncentrality parameter 2 , show that X has moment
generating function
2=(1 , 2s)]
Mk;2 (s) = exp[s
(1 , 2s)k=2 :
Hint: Problem 40 may be helpful.
Remark. When k = 1, this agrees with Problem 42.
(c) Let X1 ; : : :; Xn be independent random variables with Xi cki ;2i .
Show that Y := X1 + + Xn has the ck;2 density, and identitify
k and 2.
Remark. By part (b), if each ki = 1, we could assume that each
Xi is the square of an N(i ; 1) random variable.
(d) Show that
(x+2 )=2 epx + e,px
e,p
= c1;2 (x):
2
2x
(Note that if = 0, the left-hand side reduces to the central chisquared density
one degree of freedom.) Hint: Use the power
P with
n=n! for the two exponentials involving px.
series e = 1
n=0
x A closed-form expression is derived in Problem 17 of Chapter 4.
July 18, 2002
3.6 Problems
115
Problems x3.4: Probability Bounds
55. Let X be the random variable of Problem 20. For a 1, compare }(X a) and the bound obtained via Markov's inequality.
? 56. Let X be an exponential random variable with parameter = 1. Use
Markov's inequality, Chebyshev's inequality, and the Cherno bound to
obtain bounds on }(X a) as a function of a .
(a) For what values of a does Markov's inequality give the best bound?
(b) For what values of a does Chebyshev's inequality give the best bound?
(c) For what values of a does the Cherno bound work best?
July 18, 2002
116
Chap. 3 Continuous Random Variables
July 18, 2002
CHAPTER 4
Analyzing Systems with Random Inputs
In this chapter we consider systems with random inputs, and we try to
compute probabilities involving the system outputs. The key tool we use to
solve these problems is the cumulative distribution function (cdf). If X is
a real-valued random variable, its cdf is dened by
FX (x) := }(X x):
Cumulative distribution functions of continuous random variables are introduced in Section 4.1. The emphasis is on the fact that if Y = g(X), where
X is a continuous random variable and g is a fairly simple function, then the
density of Y can be found by rst determining the cdf of Y via simple algebraic
manipulations and then dierentiating.
The material in Section 4.2 is a brief diversion into reliability theory. The
topic is placed here because it makes simple use of the cdf. With the exception
of the formula
Z 1
}(T > t) dt
E[T ] =
0
for nonnegative random variables, which is derived at the beginning of Section 4.2, the remaining material on reliability is not used in the rest of the book.
In trying to compute probabilities involving Y = g(X), the method of Section 4.1 only works for very simple functions g. To handle more complicated
functions, we need more background on cdfs. Section 4.3 provides a brief introduction to cdfs of discrete random variables. This section is not of great
interest in itself, but serves as preparation for Section 4.4 on mixed random
variables. Mixed random variables frequently appear in the form Y = g(X)
when X is continuous, but g has \at spots." More precisely, there is an interval where g(x) is constant. For example, most ampliers have a linear region,
say ,v x v, where g(x) = x. But if x > v, g(x) = v, and if x < ,v,
g(x) = ,v. In this case, if a continuous random variable is applied to such
a device, the output will be a mixed random variable, which can be thought
of has a random variable whose \generalized density" contains Dirac impulses.
The problem of nding the cdf and generalized density of Y = g(X) is studied
in Section 4.5.
At this point, having seen several generalized densities and their corresponding cdfs, Section 4.6 summarizes and derives the general properties that characterize arbitrary cdfs.
Section 4.7 introduces the central limit theorem. Although we have seen
many examples for which we can explicitly write down probabilities involving a
The material in Sections 4.3{4.5 is not used in the rest of the book, though it does provide
examples of cdfs with jump discontinuities.
117
118
Chap. 4 Analyzing Systems with Random Inputs
sum of i.i.d. random variables, in general, the problem is very hard. The central
limit theorem provides an approximation for computing probabilities involving
the sum of i.i.d. random variables.
4.1. Continuous Random Variables
If X is a continuous random variable with density f, then
FX (x) = }(X x) =
Z
x
,1
f(t) dt:
Example 4.1. Find the cdf of a Cauchy random variable X with parame-
ter = 1.
Solution. Write
Z
x
1= dt
,1 1 + t2 x
= 1 tan,1(t) ,1 1
,
1
= tan (x) , ,2
= 1 tan,1(x) + 21 :
FX (x) =
A graph of FX is shown in Figure 4.1.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−15
−10
−5
0
5
10
15
Figure 4.1. Cumulative distribution function of a Cauchy random variable.
July 18, 2002
4.1 Continuous Random Variables
119
Example 4.2. Find the cdf of a uniform[a; b] random variable X.
we see that FX (x) =
R x Solution. Since f(t) = 0 for t < a and t >R b,
1
,1 f(t) dt is equal to 0 for x < a, and is equal to ,1 f(t) dt = 1 for x > b.
For a x b, we have
Z
x
Z
x
1 dt = x , a :
b
,
a
b,a
,1
a
Hence, for a x b, FX (x) is an aney function of x. A graph of FX when
X uniform[0; 1] is shown if Figure 4.2.
FX (x) =
f(t) dt =
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Figure 4.2. Cumulative distribution function of a uniform random variable.
We now consider the cdf of a Gaussian random variable. If X N(m; 2),
then
2 Z x
p 1 exp , 12 t , m dt:
FX (x) =
,1 2 Unfortunately, there is no closed-form expression for this integral. However,
it can be computed numerically, and there are many subroutines available for
doing it. For example, in Matlab, the above integral can be computed with
normcdf(x,m,sigma). If you do not have access to such a routine, you may
have software available for computing some related functions that can be used
to calculate the normal cdf. To see how, we need the following analysis. Making
y A function is ane if it is equal to a linear function plus a constant.
July 18, 2002
120
Chap. 4 Analyzing Systems with Random Inputs
the change of variable = (t , m)= yields
Z (x,m)=
2
e, =2 d:
FX (x) = p1
2 ,1
In other words, if we put
Z y
2
(y) := p1
e, =2 d;
2 ,1
then
x
,
m
FX (x) = :
(4.1)
Note that is the cdf of a standard normal random variable (a graph is shown
in Figure 4.3). Thus, the cdf of an N(m; 2) random variable can always be
expressed in terms of the cdf of a standard N(0; 1) random variable. If your
computer has a routine that can compute , then you can compute an arbitrary
normal cdf.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−3
−2
−1
0
1
2
3
Figure 4.3. Cumulative distribution function of a standard normal random variable.
For continuous random variables, the density can be recovered from the cdf
by dierentiation. Since
FX (x) =
dierentiation yields
Z
x
,1
f(t) dt;
F 0(x) = f(x):
July 18, 2002
4.1 Continuous Random Variables
121
The observation that the density of a continuous random variable can be recovered from its cdf is of tremendous importance, as the following examples
illustrate.
Example 4.3. Consider an electrical circuit whose random input voltage
X is rst amplied by a gain > 0 and then added to a constant oset voltage
. Then the output voltage is Y = X +. If the input voltage is a continuous
random variable X, nd the density of the output voltage Y .
Solution. Although the question asks for the density of Y , it is more
advantageous to nd the cdf rst and then dierentiate to obtain the density.
Write
FY (y) = }(Y y)
= }(X + y)
= }(X (y , )=); since > 0;
= FX ((y , )=):
If X has density fX , then
d F y , = 1F0 y , = 1f y , :
fY (y) = dy
X
X
X Example 4.4. Amplitude modulation in certain communication systems
can be accomplished using various nonlinear devices such as a semiconductor
diode. Suppose we model the nonlinear device by the function Y = X 2 . If
the input X is a continuous random variable, nd the density of the output
Y = X 2.
Solution. Although the question asks for the density of Y , it is more
advantageous to nd the cdf rst and then dierentiate to obtain the density. To
begin, note that since Y = X 2 is nonnegative, for y < 0, FY (y) = }(Y y) = 0.
For nonnegative y, write
FY (y) = }(Y y)
= }(X 2 y)
= }(,py X py )
Z
=
The density is
py
,py
fX (t) dt:
p
d Z y f (t) dt
fY (y) = dy
X
,py
= 2p1 y fX (py ) + fX (,py ) ; y > 0:
July 18, 2002
122
Chap. 4 Analyzing Systems with Random Inputs
Since }(Y y) = 0 for y < 0, fY (y) = 0 for y < 0.
When the diode input voltage X of the preceding example is N(0; 1), it
turns out that Y is chi-squared with one degree of freedom (Problem 7). If
X is N(m; 1) with m 6= 0, then Y is noncentral chi-squared with one degree
of freedom (Problem 8). These results are frequently used in the analysis of
digital communication systems.
The two preceding examples illustrate the problem of nding the density
of Y = g(X) when X is a continuous random variable. The next example
illustrates the problem of nding the density of Z = g(X; Y ) when X is discrete
and Y is continuous.
Example 4.5 (Signal in Additive Noise). Let X and Y be independent random variables, with X being discrete with pmf pX and Y being continuous with
density fY . Put Z := X + Y and nd the density of Z.
Solution. Although the question asks for the density of Z, it is more
advantageous to nd the cdf rst and then dierentiate to obtain the density.
This time we use the law of total probability, substitution, and independence.
Write
FZ (z) = }(Z z)
X
}(Z z jX = xi)}(X = xi)
=
=
i
X
i
=
X
=
X
=
=
i
i
X
i
X
i
}(X + Y z jX = xi )}(X = xi )
}(xi + Y z jX = xi )}(X = xi)
}(Y z , xi jX = xi )}(X = xi)
}(Y z , xi )}(X = xi)
FY (z , xi) pX (xi ):
Dierentiating this expression yields
fZ (z) =
X
i
fY (z , xi ) pX (xi ):
We should also note that FZ jX (z jxi) := }(Z z jX = xi ) is called the
conditional cdf of Z given X.
July 18, 2002
4.2 Reliability
? The
123
Normal CDF and the Error Function
If your computer does not have a routine that computes the normal cdf, it
may have a routine to compute a related function called the error function,
which is dened by
Z z
2
2
erf(z) := p e, d:
0
Note that erf(z) is negative for z < 0. In fact, erf is an odd function. To express
in terms of erf, write
Z 0
Z y
2 =2
2 =2
1
,
,
(y) = p
e
d + e
d
2 ,1Z
0
y 2
= 21 + p1
e, =2 d; by Example 3.4;
2 0 p
Z y= 2
= 12 + 12 p2
e,2 d;
0
p
where the last step follows by making the change of variable = = 2. We
now see that
, p (y) = 12 1 + erf y= 2 :
From the derivation, it is easy to see that this holds for all y, both positive and
negative. The Matlab command for erf(z) is erf(z).
In many applications, instead of (y), we need
Q(y) := 1 , (y):
Using the complementary error function,
Z 1
erfc(z) := 1 , erf(z) = p2
e,2 d;
z
it is easy to see that
, p Q(y) = 12 erfc y= 2 :
The Matlab command for erfc(z) is erfc(z).
4.2. Reliability
Let T be the lifetime of a device or system. The reliability function of
the device or system is dened by
R(t) := }(T > t) = 1 , FT (t):
The reliability at time t is the probability that the lifetime is greater than t.
The mean time to failure (MTTF) is simply the expected lifetime, E[T].
Since lifetimes are nonnegative random variables, we claim that
E[T ] =
Z
0
1
}(T > t) dt:
July 18, 2002
(4.2)
124
Chap. 4 Analyzing Systems with Random Inputs
It then follows that
E[T ] =
1
Z
0
R(t) dt;
i.e., the MTTF is the integral of the reliability. To derive (4.2), write
1
Z
0
}(T > t) dt =
1
Z
E[I(t;1)(T)] dt
0 Z
= E
1
0
Z
= E
1
0
I(t;1) (T ) dt
I(,1;T ) (t) dt :
To evaluate the integral, observe that since T is nonnegative, the intersection
of [0; 1) and (,1; T) is [0; T). Hence,
1
Z
0
Z
}(T > t) dt = E
T
0
dt = E[T ]:
The failure rate of a device or system with lifetime T is
}(T t + tjT > t)
:
r(t) := lim
t#0
t
This can be rewritten as
}(T t + tjT > t) r(t)t:
In other words, given that the device or system has operated for more than
t units of time, the conditional probability of failure before time t + t is
approximately r(t)t. Intuitively, the form of a failure rate function should
be as shown in Figure 4.4. For small values of t, r(t) is relatively large when
r(t )
t
Figure 4.4. Typical form of a failure rate function.
pre-existing defects are likely to appear. Then for intermediate values of t, r(t)
is at indicating a constant failure rate. For large t, as the device gets older,
r(t) increases indicating that failure is more likely.
July 18, 2002
4.2 Reliability
125
To say more about the failure rate, write
}(T t + tjT > t) = }(fT t}+ tg \ fT > tg)
(T > t)
}(t < T t + t)
=
}(T > t)
, FT (t) :
= FT (t + t)
R(t)
Since FT (t) = 1 , R(t), we can rewrite this as
}(T t + tjT > t) = , R(t + t) , R(t) :
R(t)
Dividing both sides by t and letting t # 0 yields the dierential equation
0 (t)
:
r(t) = , RR(t)
Now suppose T is a continuous random variable with density fT . Then
Z 1
}
R(t) = (T > t) =
fT () d;
t
and
We can now write
R0(t) = ,fT (t):
0(t)
r(t) = , RR(t)
=
fT (t) :
fT () d
1
Z
t
In this case, the failure rate r(t) is completely determined by the density fT (t).
The converse is also true; i.e., given the failure rate r(t), we can recover the
density fT (t). To see this, rewrite the above dierential equation as
0 (t)
= dtd ln R(t):
,r(t) = RR(t)
Integrating the left and right-hand formulas from zero to t yields
,
Then
Z
0
t
r() d = ln R(t) , ln R(0):
Rt
R(t) = R(t);
e, 0 r( ) d = R(0)
where we have used the fact that for a nonnegative, continuous random variable,
R(0) = }(T > 0) = }(T 0) = 1. Since R0(t) = ,fT (t),
Rt
fT (t) = r(t)e, 0 r( ) d :
For example, if the failure rate is constant, r(t) = , then fT (t) = e,t, and
we see that T has an exponential density with parameter .
July 18, 2002
126
Chap. 4 Analyzing Systems with Random Inputs
4.3. Cdfs for Discrete Random Variables
In this section, we show that for discrete random variables, the pmf can be
recovered from the cdf. If X is a discrete random variable, then
FX (x) = }(X x) =
X
i
I(,1;x] (xi ) pX (xi ) =
X
i:xi x
pX (xi ):
Example 4.6. Let X be a discrete random variable with pX (0) = 0:1,
pX (1) = 0:4, pX (2:5) = 0:2, and pX (3) = 0:3. Find the cdf FX (x) for all x.
Solution. Put x0 = 0, x1 = 1, x2 = 2:5, and x3 = 3. Let us rst evaluate
FX (0:5). Observe that
FX (0:5) =
X
i:xi 0:5
pX (xi) = pX (0);
since only x0 = 0 is less than or equal to 0:5. Similarly, for any 0 x < 1,
FX (x) = pX (0) = 0:1.
Next observe that
FX (2:1) =
X
i:xi 2:1
pX (xi ) = pX (0) + pX (1);
since only x0 = 0 and x1 = 1 are less than or equal to 2.1. Similarly, for any
1 x < 2:5, FX (x) = 0:1 + 0:4 = 0:5.
The same kind of argument shows that for 2:5 x < 3, FX (x) = 0:1+0:4+
0:2 = 0:7.
Since all the xi are less than or equal to 3, for x 3, FX (x) = 0:1 + 0:4 +
0:2 + 0:3 = 1:0.
Finally, since none of the xi is less than zero, for x < 0, FX (x) = 0. The
graph of FX is given in Figure 4.5.
We now show that the arguments used in the preceding example hold for
the cdf of any discrete random variable X taking distinct values xi . Fix two
adjacent points, say xj ,1 < xj as shown in Figure 4.6. For xj ,1 x < xj ,
observe that
FX (x) =
X
i:xi x
pX (xi ) =
X
i:xi xj,1
pX (xi) = FX (xj ,1):
In other words, the cdf is constant on the interval [xj ,1; xj ), with the constant
equal to the value of the cdf at the left-hand endpoint, FX (xj ,1). Next, since
FX (xj ) =
X
i:xi xj
pX (xi) =
X
i:xi xj,1
July 18, 2002
pX (xi) + pX (xj );
4.4 Mixed Random Variables
127
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Figure 4.5. Cumulative distribution function of a discrete random variable.
... p
p
...
pj − 1
... x
1
x2
...
x j− 1
1
2
pj . . .
x
xj . . .
Figure 4.6. Analysis for the cdf of a discrete random variable.
we have
or
FX (xj ) = FX (xj ,1) + pX (xj );
FX (xj ) , FX (xj ,1) = pX (xj ):
As noted above, for xj ,1 x < xj , FX (x) = FX (xj ,1). Hence, FX (xj ,) :=
limx"xj FX (x) = FX (xj ,1). It then follows that
FX (xj ) , FX (xj ,) = FX (xj ) , FX (xj ,1) = pX (xj ):
In other words, the size of the jump in the cdf at xj is pX (xj ), and the cdf is
constant between jumps.
4.4. Mixed Random Variables
We say that X is a mixed random variable if }(X 2 B) has the form
}(X 2 B) =
Z
B
~ dt + X IB (xi)~pi
f(t)
July 18, 2002
i
(4.3)
128
Chap. 4 Analyzing Systems with Random Inputs
for some nonnegativeR function
f,~ some
nonnegative sequence p~i , and distinct
P
1
~
points xi such that ,1 f(t) dt + i p~i = 1. Note that the mixed random
variables include the continuous and discrete real-valued random variables as
special cases.
Example 4.7. Let X be a mixed random variable with
Z
}(X 2 B) := 1 e,jtj dt + 1 IB (0) + 1 IB (7):
(4.4)
4 B
3
6
Compute }(X < 0), }(X 0), }(X > 2), and }(X 2).
Solution. We need to compute }(X 2 B) for B = (,1; 0), (,1; 0],
(2; 1), and [2; 1), respectively. Since the set B = (,1; 0) does not contain 0
or 7, the two indicator functions in the denition of }(X 2 B) are zero in this
case. Hence,
Z 0
Z 0
1
1
,j
t
j
t
}(X < 0) = 1
4 ,1 e dt = 4 ,1 e dt = 4 :
Next, the set B = (,1; 0] contains 0 but not 7. Thus,
Z 0
1
1 1
7
,jtj
}(X 0) = 1
4 ,1 e dt + 3 = 4 + 3 = 12 :
For the last two probabilities, we need to consider (2; 1) and [2; 1). Both
intervals contain 7 but not 0. Hence, the probabilities are the same, and both
equal
1 Z 1 e,t dt + 1 = e,2 + 1 :
4
6
4 6
2
Example 4.8. With X as in the last example, compute }(X = 0) and
}(X = 7). Also, use your results to verify that }(X = 0)+ }(X < 0) = }(X 0).
Solution. To compute }(X = 0) = }(X 2 f0g), take B = f0g in (4.4).
Then
Z
1
1
,jtj
}(X = 0) = 1
4 f0g e dt + 3 If0g(0) + 6 If0g (7):
The integral of an ordinary function over a set consisting of a single point is
zero, and since 7 2= f0g, the last term is zero also. Hence, }(X = 0) = 1=3.
Similarly, }(X = 7) = 1=6. From the previous example, }(X < 0) = 1=4.
Hence, }(X = 0) + }(X < 0) = 1=3 + 1=4 = 7=12, which is indeed the value of
}(X 0) computed earlier.
July 18, 2002
4.4 Mixed Random Variables
129
In order to perform calculations with mixed random variables more easily, it
is sometimes convenient to generalize the notion of density to permit impulses.
Recall that the unit impulse or Dirac delta function, denoted by , is
dened by the two properties
Putting
1
Z
(t) = 0 for t 6= 0 and,
,1
(t) dt = 1:
~ + X p~i (t , xi);
f(t) := f(t)
it is easy to verify that
Z
f(t) dt = }(X 2 B) and
i
B
Z
1
,1
(4.5)
g(t)f(t) dt = E[g(X)]:
~ =0
We call f in (4.5) a generalized probability density function. If f(t)
for all t, we say that f is purely impulsive, and if p~i = 0 for all i, we say f is
nonimpulsive.
Example 4.9. The preceding examples can be worked using delta functions. In these examples, the generalized density is
f(t) = 14 e,jtj + 13 (t) + 16 (t , 7);
and
Z
}(X 2 B) = f(t) dt:
B
Since the impulses are located at 0 and at 7, they will contribute to the integral
if and only if 0 and/or 7 lie in the set B. For example, in computing
}(0 < X 7) =
Z 7
0+
f(t) dt;
the impulse at the origin makes no contribution, but the impulse at 7 does.
Thus,
Z 7
1 e,jtj + 1 (t) + 1 (t , 7)dt
}(0 < X 7) =
3
6
0+ 4
Z 7
= 41 e,t dt + 16
0
1
,
e,7 + 1 = 5 , e,7 :
=
4
6
12 4
Similarly, in computing }(X = 0) = }(X 2 f0g), only the impulse at zero
makes a contribution. Thus,
Z
Z
1 (t) dt = 1 :
}(X = 0) =
f(t) dt =
3
f0g
f0g 3
July 18, 2002
130
Chap. 4 Analyzing Systems with Random Inputs
~
We now show that if X is a mixed random variable as in (4.3), then f(x)
and p~i can be recovered from the cdf of X. From (4.3),
Z x
~ dt + X p~i ;
FX (x) =
f(t)
,1
and so
i:xi x
~
FX0 (x) = f(x);
x 6= xi;
while
FX (xi) , FX (xi ,) = p~i :
~ = (0:7=)=(1+t2),
Figure 4.7 shows the cdf of a mixed random variable with f(t)
x1 = 0; x2 = 3, p~1 = 0:21, and p~2 = 0:09.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−15
−10
−5
0
5
10
15
Figure 4.7. Cumulative distribution function of a mixed random variable.
4.5. Functions of Random Variables and Their Cdfs
Most modern systems today are composed of many subsystems in which the
output of one system serves as the input to another. When the input to a system
or a subsystem is random, so is the output. To evaluate system performance, it
is necessary to take into account this randomness. The rst step in this process
is to nd the cdf of the system output if we know the pmf or density of the
random input. In many cases, the output will be a mixed random variable with
a generalized impulsive density.
We consider systems modeled by real-valued functions g(x). The random
system input is a random variable X, and the system output is the random
variable Y = g(X). To nd FY (y), observe that
FY (y) := }(Y y) = }(g(X) y) = }(X 2 By );
July 18, 2002
4.5 Functions of Random Variables and Their Cdfs
131
where
By := fx 2 IR : g(x) yg:
If X has density fX , then
Z
FY (y) = }(X 2 By ) =
By
fX (x) dx:
The diculty is to identify the set By . However, if we rst sketch the function
g(x), the problem becomes manageable.
Example 4.10. Suppose g is given by
8
1; ,2 x < ,1;
>
>
>
>
< x2; ,1 x < 0;
g(x) := > x; 0 x < 2;
>
2; 2 x < 3;
>
>
:
0; otherwise:
Find the inverse image B = fx 2 IR : g(x) yg for y 2, 2 > y 1, 1 > y 0,
and y < 0.
Solution. To begin, we sketch g in Figure 4.8. Fix any y 2. Since g is
upper bounded by 2, g(x) 2 y for all x 2 IR. Thus, for y 2, B = IR.
2
1.5
1
0.5
0
−2
−1
0
1
2
3
Figure 4.8. The function g from Example 4.10.
Next, x any y with 2 > y 1. On the graph of g, draw a horizontal
line at level y. In Figure 4.9, the line is drawn at level y = 1:5. Next, at the
intersection of the horizontal line and the curve g, drop a vertical line to the
x-axis. In the gure, this vertical line hits the x-axis at the point marked .
July 18, 2002
132
Chap. 4 Analyzing Systems with Random Inputs
2
1.5
1
0.5
0
−2
−1
0
1
2
3
Figure 4.9. Determining B for 1 y < 2.
Clearly, for all x to the left of this point, and for all x 3, g(x) y. To nd
the x-coordinate of , we solve g(x) = y for x. For the y-value in question, the
formula for g(x) is g(x) = x. Hence, the x-coordinate of is simply y. Thus,
g(x) y , x y or x 3, and so,
B = (,1; y] [ [3; 1); 1 y < 2:
Now x any y with 1 > y 0, and draw a horizontal line at level y as
before. In Figure 4.10 we have used y = 1=2. This time the horizontal line
intersects the curve g in two places, and there are two points marked on the
x-axis. Call the x-coordinate of the left one x1 and that of the right one x2. We
must solve g(x1 ) = y, where x1 is negative and g(x1 ) = x21. We must alsopsolve
g(x2 ) = y, where g(x2 ) = x2. We conclude that g(x) y , x < ,2 or , y x y or x 3. Thus,
B = (,1; ,2) [ [,py; y] [ [3; 1); 0 y < 1:
Note that when y = 0, the interval [0; 0] is just the singleton set f0g.
Finally, since g(x) 0 for all x 2 IR, for y < 0, B = 6 .
Example 4.11. Let g be as in Example 4.10, and suppose that X uniform[,4; 4]. Find and sketch the cumulative distribution of Y = g(X).
Also nd and sketch the density.
Solution. Proceeding as in the two examples above, write FY (y) = }(Y y) = }(X 2 B) for the appropriate inverse image B = fx 2 IR : g(x) yg.
July 18, 2002
4.5 Functions of Random Variables and Their Cdfs
133
2
1.5
1
0.5
0
−2
−1
0
1
2
3
Figure 4.10. Determining B for 0 y < 1.
From Example 4.10,
8
>
>
<
6 ;
py; y] [ [3; 1); 0y < y0;< 1;
(
,1
;
,
2)
[
[
,
B = > (,1; y] [ [3; 1);
1 y < 2;
>
:
IR;
y 2:
For y < 0, FY (y) = 0. For 0 y < 1,
FY (y) = }(X < ,2) + }(,py X y) + }(X 3)
y , (,py ) 4 , 3
= ,2 ,8(,4) +
+ 8
8
p
= y + 8y + 3 :
For 1 y < 2,
FY (y) = }(X y) + }(X 3)
= y , 8(,4) + 4 ,8 3
= y +8 5 :
For y > 2, FY (y) = }(X 2 IR) = 1. Putting all this together,
8
0; p
y < 0;
>
>
<
(y
+
y
+
3)=8;
0 y < 1;
FY (y) = > (y + 5)=8;
1 y < 2;
>
:
1;
y 2:
July 18, 2002
134
Chap. 4 Analyzing Systems with Random Inputs
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
2
Figure 4.11. Cumulative distribution function FY (y) from Example 4.11.
In sketching FY (y), we note from the formula that it is 0 for y < 0 and 1 for
y 2. Also from the formula, note that there is a jump discontinuity of 3=8 at
y = 0 and a jump of 1=8 at y = 1 and at y = 2. See Figure 4.11.
From the observations used in graphing FY , we can easily obtain the generalized density,
hY (y) = 38 (y) + 81 (y , 1) + 81 (y , 2) + f~Y (y);
where
8
p
< [1 + 1=(2 y )]=8; 0 < y < 1;
1 < y < 2;
f~Y (y) = : 1=8;
0;
otherwise;
is obtained by dierentiating FY (y) at non-jump points y. A sketch of hY is
shown in Figure 4.12.
4.6. Properties of Cdfs
Given an arbitrary real-valued random variable X, its cumulative distribution function is dened by
F(x) := }(X x); ,1 < x < 1:
We show below that F satises the following eight properties:
(i) 0 F (x) 1.
(ii) For a < b, }(a < X b) = F (b) , F(a).
July 18, 2002
4.6 Properties of Cdfs
135
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
2
Figure 4.12. Density hY (y) from Example 4.11.
F is nondecreasing, i.e., a b implies F (a) F(b).
lim F(x) = 1.
x"1
lim F (x) = 0.
x#,1
F (x0+) := xlim
F(x) = }(X x0) = F (x0). In other words, F is
#x0
right-continuous.
(vii) F (x0,) := xlim
F (x) = }(X < x0 ).
"x0
(viii) }(X = x0) = F(x0) , F(x0,).
We also point out that
}(X > x0 ) = 1 , }(X x0 ) = 1 , F(x0);
(iii)
(iv)
(v)
(vi)
and
}(X x0) = 1 , }(X < x0 ) = 1 , F (x0,):
If F(x) is continuous at x = x0, i.e., F(x0,) = F(x0), then this last equation
becomes
}(X x0) = 1 , F (x0):
Another consequence of the continuity of F (x) at x = x0 is that }(X = x0) = 0.
Note that if a random variable has a nonimpulsive density, then its cumulative
distribution is continuous everywhere.
We now derive the eight properties of cumulative distribution functions.
(i) The properties of } imply that F(x) = }(X x) satises 0 F(x) 1.
July 18, 2002
136
Chap. 4 Analyzing Systems with Random Inputs
(ii) First consider the disjoint union (,1; b] = (,1; a] [ (a; b]. It then
follows that
fX bg = fX ag [ fa < X bg
is a disjoint union of events in . Now write
F(b) = }(X b)
= }(fX ag [ fa < X bg)
= }(X a) + }(a < X b)
= F(a) + }(a < X b):
Now subtract F(a) from both sides.
(iii) This follows from (ii) since }(a < X b) 0.
(iv) We prove the simpler result limN !1 F(N) = 1. Starting with
IR = (,1; 1) =
we can write
1
[
(,1; n];
n=1
1 = }(X 2 IR)
[
1
}
=
fX ng
n=1
}(X N); by limit property (1.4);
= Nlim
!1
= Nlim
!1 F(N):
(v) We prove the simpler result, limN !1 F(,N) = 0. Starting with
6 =
we can write
1
\
(,1; ,n];
n=1
0 = }(X 2 6 )
1
\
}
=
fX ,ng
n=1
}
= Nlim
!1 (X ,N); by limit property (1.5);
= Nlim
F(,N):
!1
(vi) We prove the simpler result, }(X x0) = Nlim
F(x0 + N1 ). Starting
!1
with
1
\
(,1; x0] = (,1; x0 + n1 ];
n=1
July 18, 2002
4.7 The Central Limit Theorem
we can write
137
\
1
}(X x0 ) = }
1g
fX x0 + n
n=1
}(X x0 + N1 ); by (1.5);
= Nlim
!1
F (x0 + N1 ):
= Nlim
!1
(vii) We prove the simpler result, }(X < x0) = Nlim
F(x0 , N1 ). Starting
!1
with
1
[
(,1; x0) = (,1; x0 , n1 ];
we can write
n=1
}(X < x0 ) = }
1
[
1g
fX x0 , n
n=1
}(X x0 , N1 ); by (1.4);
= Nlim
!1
= Nlim
F (x0 , N1 ):
!1
(viii) First consider the disjoint union (,1; x0] = (,1; x0) [fx0 g. It then
follows that
fX x0 g = fX < x0 g [ fX = x0 g
is a disjoint union of events in . Using Property (vii), it follows that
F(x0) = F(x0,) + }(X = x0):
4.7. The Central Limit Theorem
Let X1 ; X2; : : : be i.i.d. with common mean m and common variance 2.
There are many cases for which we know the probability mass function or
density of
n
X
Xi :
i=1
For example, if the Xi are Bernoulli, binomial, Poisson, gamma, or Gaussian,
we know the cdf of the sum (see Example 2.20 or Problem 22 in Chapter 2
and Problem 46 in Chapter 3). Note that the exponential and chi-squared are
special cases of the gamma (see Problem 12 in Chapter 3). In general, however,
nding the cdf of a sum of i.i.d. random variables is not possible. Fortunately,
we have the following approximation.
Central Limit Theorem (CLT). Let X1 ; X2; : : : be independent, identically
distributed random variables with nite mean m and nite variance 2 . If
n
, m and M := 1 X
Yn := M=n p
n
n
n i=1 Xi ;
July 18, 2002
138
Chap. 4 Analyzing Systems with Random Inputs
then
lim F (y)
n!1 Yn
= (y):
To get some idea of how large n should be, would like to compare FYn (y)
and (y) in cases where FYn is known. To do this, we need the following result.
Example 4.12. Show that if Gn is the cdf of Pni=1 Xi, then
p
(4.6)
FYn (y) = Gn(y n + nm):
Solution. Write
,m y
FYn (y) = } M=n p
n
p
= }(Mn , m y= n )
p
= }(Mn y= n + m)
= sP
X
n
i=1
p
p
Xi y n + nm
= Gn(y n + nm):
When the Xi are exp(1), Gn is the Erlang(30; 1) cdf given in Problem 12(c)
in Chapter 3. In Figure 4.13 we plot FY30 (y) (dashed line) and the N(0; 1) cdf
(y) (solid line).
A typical calculation using the central limit theorem is as follows. To approximate
n
} X Xi > t ;
i=1
write
}
n
X
i=1
Xi > t
X
n
t
1
}
=
n i=1 Xi > n
= }(Mn > t=n)
= }(M
n , m > t=n , m) M
, m > t=n p, m
= } =n p
n
= n
t=n
,
m
= } Yn > =pn
t=n
,
m
= 1 , FYn =pn
t=n
,
m
1 , =pn :
July 18, 2002
4.7 The Central Limit Theorem
139
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−3
−2
−1
0
1
2
3
Figure 4.13. Illustration of the central limit theorem when the Xi are exponential with
parameter 1. The dashed line is FY30 , and the solid line is the standard normal cumulative
distribution, .
Example 4.13. A certain digital communication link has bit-error probability p. Use the central limit theorem to approximate the probability that in
transmitting a word of n bits, more than k bits are received incorrectly.
Solution. Let Xi = 1 if bit i is received in error, and Xi = 0 otherwise.
We assume the Xi arePindependent
Bernoulli(p) random variables. The number
of errors in n bits is ni=1 Xi . We must compute
}
X
n
i=1
Xi > k :
Using the approximation above in which t = k, m = p, and 2 = p(1 , p), we
have
n
} X Xi > k 1 , p k=n , p :
p(1 , p)=n
i=1
Derivation of the Central Limit Theorem
It is instructive to consider rst the following special case, which illustrates
the key steps of the p
general derivation. Suppose that the Xi are i.i.d. Laplace
with parameter = 2. Then m = 0, 2 = 1, and
pn X
n
n
X
p
M
,
m
n
Xi = p1n Xi :
Yn = =pn = nMn = n
i=1
i=1
July 18, 2002
140
Chap. 4 Analyzing Systems with Random Inputs
The characteristic function of Yn is
n
'Yn () =
E[ejYn ]
p
E ej (= n )Xi
Of course,
variable Xi ,
Thus,
and
j (=pn ) P Xi = Ee
=
i=1
n Y
p
E ej (= n )Xi :
i=1
,p p = 'Xi = n , where, for the Laplace 2 random
,
'Xi () = 2 +2 2 = 1 + 1 2=2 :
p
E[ej (= n )Xi ] = 'Xi p
1 ;
=
n
1 + 2n=2
'Yn () =
1
1 + 2n=2
n
=
1
1 + 2n=2
n :
We now use the fact that for any number ,
n
1 + n ! e :
It follows that
'Yn () = 1 2 =2 n ! e12 =2 = e, 2 =2 ;
1+ n
which is the characteristic function of an N(0; 1) random variable.
We now turn to the derivation in the general case. Write
,m
Yn = M=n p
n
pn 1 X
n
= n Xi , m
i=1
pn 1 X
n
= n (Xi , m)
i=1
n
X
1
X
,
m
i
p
= n
:
i=1
Now let Zi := (Xi , m)=. Then
n
X
Yn = p1n Zi ;
i=1
July 18, 2002
4.7 The Central Limit Theorem
141
where the Zi are zero mean and unit variance. Since the Xi are i.i.d., so are
the Zi . Let 'Z () := E[ejZi ] denote their common characteristic function. We
can write the characteristic function of Yn as
'Yn () := E[ejYn ]
n X
= E exp j pn Zi
i=1
n
Y
= E exp j pn Zi
i=1
n Y
=
E exp j p Zi
n
i=1
n
Y
=
'Z pn
i=1
n
= 'Z pn :
Now recall that for any complex ,
e = 1 + + 12 2 + R():
Thus,
p
'Z pn = E[ej (= n )Zi ]
2
1
= E 1 + j pn Zi + 2 j pn Zi + R j pn Zi :
Since Zi is zero mean and unit variance,
2
'Z pn = 1 , 21 n + E R j pn Zi :
It can be shown that the last term on the right is asymptotically negligible [4,
pp. 357{358], and so
2
'Z pn 1 , n=2 :
We now have
n
2 n
'Yn () = 'Z pn 1 , n=2 ! e, 2 =2 ;
which is the N(0; 1) characteristic function. Since the characteristic function
of Yn converges to the N(0; 1) characteristic function, it follows that FYn (y) !
(y) [4, p. 349, Theorem 26.3].
July 18, 2002
142
Chap. 4 Analyzing Systems with Random Inputs
Example 4.14. In the derivation of the central limit theorem, we found
that 'Yn () ! e, 2 =2 as n ! 1. We can use this result to give a simple
derivation of Stirling's formula,
p
n! 2 nn+1=2e,n :
Solution. Since n! = n(n , 1)!, it suces to show that
p
(n , 1)! 2 nn,1=2e,n :
To thisPend,
let X1 ; : : :; Xn be i.i.d. exp(1). Then by Problem 46(c) in Chapter 3, ni=1 Xi has the n-Erlang density,
n,1 ,x
gn(x) = x(n ,e1)! ; x 0:
Also, the mean and variance of Xi are both one. Dierentiating (4.6) (with
m = = 1) yields
p
p
fYn (y) = gn(y n + n) n:
Note that fYn (0) = nn,1=2e,n =(n , 1)!. On the other hand, by inverting the
characteristic function of Yn , we can also write
Z 1
1
fYn (0) = 2
'Yn () d
,1
Z 1
e, 2 =2 d; by the CLT;
21
p ,1
= 1= 2:
p
It follows that (n , 1)! 2 nn,1=2e,n .
Remark. A more precise version of Stirling's formula is [14, pp. 50{53]
p
p
2 nn+1=2e,n+1=(12n+1) < n! < 2 nn+1=2e,n+1=(12n):
4.8. Problems
Problems x4.1: Continuous Random Variables
1. Find the cumulative distribution function FX (x) of an exponential ran-
dom variable X with parameter . Sketch the graph of FX when = 1.
2. The Rayleigh density with parameter is dened by
8
< x e,(x=)2 =2 ; x 0;
f(x) := : 2
0;
x < 0:
Find the cumulative distribution function.
July 18, 2002
4.8 Problems
? 3.
4.
5.
6.
7.
143
The Maxwell density with parameter is dened by
8 r
>
2
<
x2 e,(x=)2 =2; x 0;
f(x) := > 3
:
0;
x < 0:
Show that the cdf F (x) can be expressed in terms of the standard normal
cdf (x) dened in (4.1).
If Z has density fZ (z) and Y = eZ , nd fY (y).
If X uniform(0; 1), nd the density of Y = ln(1=X).
Let X Weibull(p; ). Find the density of Y = X p .
The input to a squaring circuit is a Gaussian random variable X with
mean zero and variance one. Show that the output Y = X 2 has the
chi-squared density with one degree of freedom,
e,y=2 ; y > 0:
fY (y) = p
2y
8. If the input to the squaring circuit of Problem 7 includes a xed bias, say
9.
10.
11.
12.
m, then the output is given by Y = (X + m)2 , where again X N(0; 1).
Show that Y has the noncentral chi-squared density with one degree of
freedom and noncentrality parameter m2 ,
,(y+m2 )=2 mpy ,mpy
fY (y) = e p2y e +2 e
; y > 0:
Note that if m = 0, we recover the result of Problem 7.
Let X1 ; : : :; Xn be independent with common cumulative distribution
function F (x). Let us dene Xmax := max(X1 ; : : :; Xn ) and Xmin :=
min(X1 ; : : :; Xn). Express the cumulative distributions of Xmax and Xmin
in terms of F(x). Hint: Example 2.7 may be helpful.
If X and Y are independent exp() random variables, nd E[max(X; Y )].
Digital Communication System. The received voltage in a digital communication system is Z = X + Y , where X Bernoulli(p) is a random
message, and Y N(0; 1) is a Gaussian noise voltage. Assuming X and
Y are independent, nd the conditional cdf FZ jX (z ji) for i = 0; 1, the cdf
FZ (z), and the density fZ (z).
Fading Channel. Let X and Y be as in the preceding problem, but now
suppose Z = X=A + Y , where A, X, and Y are independent, and A
takes the values 1 and 2 with equal probability. Find the conditional cdf
FZ jA;X (z ja; i) for a = 1; 2 and i = 0; 1.
July 18, 2002
144
Chap. 4 Analyzing Systems with Random Inputs
13. Generalized Rayleigh Densities. Let Yn be chi-squared with pn degrees of
freedom as dened in Problem 12 of Chapter 3. Put Zn := Yn.
(a) Express the cdf of Zn in terms of the cdf of Yn .
(b) Find the density of Z1 .
(c) Show that Z2 has a Rayleigh density, as dened in Problem 2, with
= 1.
(d) Show that Z3 has a Maxwell density, as dened in Problem 3, with
= 1.
(e) Show that Z2m has a Nakagami-m density
8
2 z 2m,1 e,(z=)2 =2 ; z 0;
<
m
f(z) := : 2 ,(m) 2m
0;
z < 0;
with = 1.
Remark. For the general chi-squared random variable Yn, it is not
necessary that n be an integer. However, if n is a positive integer, and
if X1 ; : : :; Xn are i.i.d. N(0; 1), then the Xi2 are chi-squared with one
degree of freedom by Problem 7, and Yn := X12 + + Xn2 is chi-squared
with n degrees of freedom by Problem 46(c) in Chapter 3. Hence, the
above densities usually arise from taking the square root of the sum of
squares of standard normal random variables. For example, (X1 ; X2 ) can
be regarded as a random point in the plane whose horizontal and vertical
coordinates are
p independent N(0; 1). The distance of this point from
the origin is X12 + X22 = Z2 , which is a Rayleigh random variable. As
another example, consider an ideal gas. The velocity of a given particle is
obtained by adding up the results of many collisions with other particles.
By the central limit theorem (Section 4.7), each component of the given
particle's velocity vector,psay (X1 ; X2; X3 ) should be i.i.d. N(0; 1). The
speed of the particle is X12 + X22 + X32 = Z3 , which has the Maxwell
density. When the Nakagami-m density is used as a model for fading in
wireless communication channels, m is often not an integer.
14. Generalized Gamma Densities.
(a) Let X gamma(p; 1), and put Y := X 1=q . Show that
pq,1 ,yq
fY (y) = qy ,(p)e ; y > 0:
(b) If in part (a) we replace p with p=q, we nd that
p,1 e,yq
qy
fY (y) = ,(p=q) ; y > 0:
July 18, 2002
4.8 Problems
145
If we introduce a scale parameter > 0, we have the generalized
gamma density [46]. More precisely, we say that Y g-gamma(p; ; q)
if Y has density
p,1 e,(y)q
q(y)
; y > 0:
fY (y) =
,(p=q)
Clearly, g-gamma(p; ;1) = gamma(p; ), which includes the Erlang
and the chi-squared as special cases. Show that
(i) g-gamma(p;1=pp; p) = Weibull(p; ).
(ii) g-gamma(2; 1=( 2); 2) is the Rayleigh density dened in Problem 2.
p
(iii) g-gamma(3; 1=( 2); 2) is the Maxwell density dened in Problem 3.
(c) If Y g-gamma(p; ; q), show that
,
, (n + p)=q)
n
E[Y ] = ,(p=q)n ;
and conclude that
1 sn ,,(n + p)=q)
X
,(p=q)n :
MY (s) =
n=0 n!
? 15. In the analysis of communicationsystems, one is often interested in }(X >
x) = 1 , FX (x) for some voltage threshold x. We call FXc (x) := 1 , FX (x)
the complementary cumulative distribution function (ccdf) of X.
Of particular interest is the ccdf of the standard normal, which is often
denoted by
Z 1
e,t2 =2 dt:
Q(x) := 1 , (x) = p1
2 x
Using the hints below, show that for x > 0,
,x2 =2 1 1 e,px2 =2 :
ep
,
<
Q(x)
<
2 x x3
x 2
Hints: To derive the upper bound, apply integration by parts to
Z 1
1 te,t2 =2 dt;
x t
and then drop the new integral term (which is positive),
Z 1
1 e,t2 =2 dt:
x t2
If you do not drop the above term, you can derive the lower bound by applying integration by parts one more time (after dividing and multiplying
by t again) and then dropping the nal integral.
July 18, 2002
146
? 16.
Chap. 4 Analyzing Systems with Random Inputs
Let Ck (x) denote the chi-squared cdf with k degrees of freedom. Show that
the noncentral chi-squared cdf with k degrees of freedom and noncentrality
parameter 2 is given by (recall Problem 54 in Chapter 3)
Ck;2 (x) =
1
X
(2 =2)ne,2 =2 C (x):
2n+k
n!
n=0
Remark. Note that in Matlab, Ck(x) = chi2cdf(x; k), and Ck;2 (x) =
(; ;
).
ncx2cdf x k lambda^2
? 17. Generalized Rice or Noncentral Rayleigh Densities.
Let Yn be noncentral
chi-squared with n degrees of freedom and noncentrality parameter m2
as dened in Problem 54 in Chapter 3. (In general, n need not be an
integer, but if it is, and if X1 ; : : :; Xn are independent normal random
variables with Xi N(mi ; 1), then by Problem 8, Xi2 is noncentral chisquared with one degree of freedom and noncentrality parameter m2i , and
by Problem 54 in Chapter 3, X12 + +Xn2 is noncentral chi-squared with
n degrees of freedom and noncentrality parameter m2 = m21 + + m2n .)
p
(a) Show that Zn := Yn has the generalized Rice density,
n=2
fZn (z) = mzn=2,1 e,(m2 +z2 )=2In=2,1(mz); z > 0;
where I is the modied Bessel function of the rst kind, order ,
I (x) :=
1
X
(x=2)2`+ :
`=0 `! ,(` + + 1)
Remark. In Matlab, I (x) = besseli(nu; x).
(b) Show that Z2 has the original Rice density,
fZ2 (z) = ze,(m2 +z2 )=2I0 (mz); z > 0:
(c) Show that
py n=2,1
fYn (y) = 21 m
e,(m2 +y)=2 In=2,1(mpy ); y > 0;
giving a closed-form expression for the noncentral chi-squared density. Recall that you already have a closed-form expression for the
moment generating function and characteristic function of a noncentral chi-squared random variable (see Problem 54(b) in Chapter 3).
Remark. Note that in Matlab, the cdf of Yn is given by FYn (y) =
ncx2cdf(y; n; m^2).
July 18, 2002
4.8 Problems
147
(d) Denote the complementary cumulative distribution of Zn by
FZc n (z) := }(Zn > z) =
1
Z
z
Show that
fZn (t) dt:
n=2,1
e,(m2 +z2 )=2 In=2,1(mz) + FZc n,2 (z):
FZc n (z) = mz
Hint: Use integration by parts; you will need the easily-veried fact
that
d x I (x) = x I (x):
,1
dx
(e) The complementary cdf of Z2 , FZc 2 (z), is known as the Marcum Q
function,
Q(m; z) :=
Z
z
1
te,(m2 +t2 )=2 I0 (mt) dt:
Show that if n 4 is an even integer, then
2
2
FZc n (z) = Q(m; z) + e,(m +z )=2
n=X
2,1
k=1
z k I (mz):
m k
e
(f) Show that Q(m; z) = Q(m;
z), where
e
Q(m;
z)
1
2 +z2 )=2 X
,
(
m
:= e
(m=z)k Ik (mz):
k=0
e z) = e,z2 =2 . It then
Hint: [19, p. 450] Show that Q(0; z) = Q(0;
suces to prove that
@
@ e
@m Q(m; z) = @m Q(m; z):
To this end, use the derivative formula in the hint of part (d) to show
that
@ Q(m;
,(m2 +z2 )=2 I1(mz):
e
z)
=
ze
@m
Now take the same partial derivative of Q(m; z) as dened in part (e),
and then use integration by parts on the term involving I1 .
? 18. Properties of Modied Bessel Functions.
(a) Use the power-series denition of I (x) in the preceding problem to
show that
I0 (x) = 12 [I ,1(x) + I +1 (x)]
July 18, 2002
148
Chap. 4 Analyzing Systems with Random Inputs
and that
I ,1(x) , I +1 (x) = 2(=x) I (x):
Note that the second identity implies the recursion,
I +1 (x) = I ,1(x) , 2(=x) I (x):
Hence, once I0 (x) and I1(x) are known, In (x) can be computed for
n = 2; 3; : : ::
(b) Parts (b) and (c) of this problem are devoted to showing that for
integers n 0,
1 Z ex cos cos(n) d:
In(x) = 2
,
To this end, denote the above integral by I~n (x). Use integration by
parts and a trigonometric identity to show that
x [I~ (x) , I~ (x)]:
I~n(x) = 2n
n,1
n+1
Hence, in Part (c) it will be enough to show that I~0 (x) = I0 (x) and
I~1 (x) = I1 (x).
(c) From the integral denition of I~n(x), it is clear that I~00 (x) = I~1 (x). It
is also easy to see from the power series for I0 (x) that I00 (x) = I1(x).
Hence, it is enough to show that I~0 (x) = I0(x). Since the integrand
dening I~0 (x) is even,
Z I~0 (x) = 2 ex cos d:
0
Show that
Z =2
I~0 (x) = 1
e,x sin t dt:
e
,=2
P1
Then use the power series = k=0 k =k! in the above integrand
and integrate term by term. Then use the results of Problems 15
and 11 of Chapter 3 to show that I~0 (x) = I0 (x).
(d) Use the integral formula for In (x) to show that
In0 (x) = 12 [In,1(x) + In+1 (x)]:
Problems x4.2: Reliability
19. The lifetime T of a Model n Internet router has an Erlang(n; 1) density,
fT (t) = tn,1e,t =(n , 1)!.
(a) What is the router's mean time to failure?
July 18, 2002
4.8 Problems
149
(b) Show that the reliability of the router after t time units of operation
is
nX
,1 tk
,t
R(t) =
k! e :
k=0
(c) Find the failure rate (known as the Erlang failure rate).
20. A certain device has the Weibull failure rate
r(t) = ptp,1 ; t > 0:
(a) Sketch the failure rate for = 1 and the cases p = 1=2, p = 1,
p = 3=2, p = 2, and p = 3.
(b) Find the reliability R(t).
(c) Find the mean time to failure.
(d) Find the density fT (t).
21. A certain device has the Pareto failure rate
(
p=t; t t0 ;
r(t) =
0; t < t0 :
(a) Find the reliability R(t) for t 0.
(b) Sketch R(t) if t0 = 1 and p = 2.
(c) Find the mean time to failure if p > 1.
(d) Find the Pareto density fT (t).
22. A certain device has failure rate r(t) = t2 , 2t + 2 for t 0.
(a) Sketch r(t) for t 0.
(b) Find the corresponding density fT (t) in closed form (no integrals).
23. Suppose that the lifetime T of a device is uniformly distributed on the
interval [1; 2].
(a) Find and sketch the reliability R(t) for t 0.
(b) Find the failure rate r(t) for 0 t < 2.
(c) Find the mean time to failure.
24. Consider a system composed of two devices with respective lifetimes T1
and T2 . Let T denote the lifetime of the composite system. Suppose that
the system operates properly if and only if both devices are functioning.
In other words, T > t if and only if T1 > t and T2 > t. Express the
reliability of the overall system R(t) in terms of R1(t) and R2(t), where
R1(t) and R2(t) are the reliabilities of the individual devices. Assume T1
and T2 are independent.
July 18, 2002
150
Chap. 4 Analyzing Systems with Random Inputs
25. Consider a system composed of two devices with respective lifetimes T1
and T2 . Let T denote the lifetime of the composite system. Suppose that
the system operates properly if and only if at least one of the devices
is functioning. In other words, T > t if and only if T1 > t or T2 > t.
Express the reliability of the overall system R(t) in terms of R1(t) and
R2(t), where R1(t) and R2 (t) are the reliabilities of the individual devices.
Assume T1 and T2 are independent.
26. Let Y be a nonnegative random variable. Show that
E[Y n ]
=
1
Z
0
nyn,1}(Y > y) dy:
Hint: Put T = Y n in (4.2).
Problems x4.3: Discrete Random Variables
27. Let X binomial(n; p) with n = 4 and p = 3=4. Sketch the graph of the
cumulative distribution function of X, FX (x).
Problems x4.4: Mixed Random Variables
28. A random variable X has generalized density
f(t) = 13 e,t u(t) + 21 (t) + 16 (t , 1);
where u is the unit step function dened in Section 2.1, and is the Dirac
delta function dened in Section 4.4.
(a) Sketch f(t).
(b) Compute }(X = 0) and }(X = 1).
(c) Compute }(0 < X < 1) and }(X > 1).
(d) Use your above results to compute }(0 X 1) and }(X 1).
(e) Compute E[X].
29. If X has generalized density f(t) = 12 [(t) + I(0;1](t)], evaluate E[X] and
}(X = 0jX 1=2).
30. Let X have cdf
8
0; x < 0;
>
>
< 2
x
x < 1=2;
FX (x) = > x;; 01=2
x < 1;
>
:
1; x 1:
Show that E[X] = 7=12.
July 18, 2002
4.8 Problems
151
31. Sketch the graph of the cumulative distribution function of the random
variable in Example 4.9. Also sketch corresponding impulsive density
function.
? 32. A certain computer monitor contains a loose connection. The connection
is loose with probability 1=2. When the connection is loose, the monitor
displays a blank screen (brightness=0). When the connection is not loose,
the brightness is uniformly distributed on (0; 1]. Let X denote the observed brightness. Find formulas and plot the cdf and generalized density
of X.
Problems x4.5: Functions of Random Variables and Their Cdfs
33. Let uniform[,; ], and put X := cos and Y := sin .
(a) Show that FX (x) = 1 , 1 cos,1 x for ,1 x 1.
(b) Show that FY (y) = 21 + 1 sin,1 y for ,1 y 1.
(c) Use the identity sin(=2 , ) = cos to show that FX (x) = 21 +
1 ,1
sin x. Thus, X and Y have the same cdf and are called arcsine random
variables. The corresponding density is fX (x) =
p
(1=)= 1 , x2 for ,1 < x < 1.
34. Find the cdf and density of Y = X(X + 2) if X is uniformly distributed
on [,3; 1].
35. Let g be as in Example 4.10 and 4.11. Find the cdf and density of Y =
g(X) if
(a) X uniform[,1; 1];
(b) X uniform[,1; 2];
(c) X uniform[,2; 3];
(d) X exp().
36. Let
8
0; jxj < 1;
<
g(x) := : jxj , 1; 1 jxj 2;
1;
jxj > 2:
Find the cdf and density of Y = g(X) if
(a) X uniform[,1; 1];
(b) X uniform[,2; 2];
(c) X uniform[,3; 3];
(d) X Laplace().
July 18, 2002
152
Chap. 4 Analyzing Systems with Random Inputs
37. Let
8
>
>
<
g(x) := >
>
:
38.
39.
40.
41.
? 42.
,x , 2; x < ,1;
,x2; ,1 x < 0;
x3; 0 x < 1;
1;
x 1:
Find the cdf and density of Y = g(X) if
(a) X uniform[,3; 2];
(b) X uniform[,3; 1];
(c) X uniform[,1; 1].
Consider the function g given by
8
< x2 , 1; x < 0;
g(x) = : x , 1; 0 x < 2;
1; x 2:
If X is uniform[,3;3], nd the cdf and density of Y = g(X).
Let X be a uniformly distributed random variable on the interval [,3; 1].
Let Y = g(X), where
8
0; x < ,2;
>
>
<
x
+
2 x < ,1;
g(x) = > x2;2; ,
,
1 x < 0;
>
: p
x; x 0:
Find the cdf and density of Y .
Let X uniform[,6; 0], and suppose that Y = g(X), where
8
jxj , 1; 1 x < 2;
>
<
p
g(x) = > 1 , jxj , 2; jxj 2;
:
0;
otherwise:
Find the cdf and density of Y .
Let X uniform[,2; 1], and suppose that Y = g(X), where
8
x + 2; ,2 x < ,1;
>
>
>
<
2
g(x) = > 1 2x
+ x2 ; ,1 x < 0;
>
>
:
0;
otherwise:
Find the cdf and density of Y .
For x 0, let g(x) denote the fractional part of x. For example, g(5:649) =
0:649, and g(0:123) = 0:123. Find the cdf and density of Y = g(X) if
July 18, 2002
4.8 Problems
153
(a) X exp(1);
(b) X uniform[0; 1);
(c) X uniform[v; v + 1), where v = m + for some integer m 0 and
some 0 < < 1.
Problems x4.6: Properties of Cdfs
? 43.
From your solution of Problem 3(b) in Chapter 3, you can see that if
X exp(), then }(X > t + tjX > t) = }(X > t). Now prove
the converse; i.e., show that if Y is a nonnegative random variable such
that }(Y > t + tjY > t) = }(Y > t), then Y exp(), where
= , ln[1 , FY (1)], assuming that }(Y > t) > 0 for all t 0. Hints:
Put h(t) := ln }(Y > t), which is a right-continuous function of t (Why?).
Show that h(t + t) = h(t) + h(t) for all t; t 0.
Problems x4.7: The Central Limit Theorem
44. Packet transmission times on a certain Internet link are i.i.d. with mean
m and variance 2 . Suppose n packets are transmitted. Then the total
expected transmission time for n packets is nm. Use the central limit
theorem to approximate the probability that the total transmission time
for the n packets exceeds twice the expected transmission time.
45. To combat noise in a digital communication channel with bit-error probability p, the use of an error-correcting code is proposed. Suppose that the
code allows correct decoding of a received binary codeword if the fraction
of bits in error is less than or equal to t. Use the central limit theorem
to approximate the probability that a received word cannot be reliably
decoded.
46. Let Xi = 1 with equal probability. Then the Xi are zero mean and have
unit variance. Put
n X
X
pni :
Yn =
i=1
Derive
the central limit theorem for this case; i.e., show that 'Yn () !
2 =2
,
e
. Hint: Use the Taylor series approximation cos() 1 , 2 =2.
July 18, 2002
154
Chap. 4 Analyzing Systems with Random Inputs
July 18, 2002
CHAPTER 5
Multiple Random Variables
The main focus of this chapter is the study of multiple continuous random
variables that are not independent. In particular, conditional probability and
conditional expectation along with corresponding laws of total probability and
substitution are studied. These tools are used to compute probabilities involving
the output of systems with multiple random inputs.
In Section 5.1 we introduce the concept of joint cumulative distribution
function for a pair of random variables. Marginal cdfs are also dened. In
Section 5.2 we introduce pairs of jointly continuous random variables, joint
densities, and marginal densities. Conditional densities, independence and expectation are also discussed in the context of jointly continuous random variables. In Section 5.3 conditional probability and expectation are dened for
jointly continuous random variables. In Section 5.4 the bivariate normal density is introduced, and some of its properties are illustrated via examples. In
Section 5.5 most of the bivariate concepts are extended to the case of n random
variables.
5.1. Joint and Marginal Probabilities
Suppose we have a pair of random variables, say X and Y , for which we
are able to compute }((X; Y ) 2 A) for arbitrary1 sets A IR2 . For example,
suppose X and Y are the horizontal and vertical coordinates of a dart striking
a target. We might be interested in the probability that the dart lands within
two units of the center. Then we would take A to be the disk of radius two
centered at the origin, i.e., A = f(x; y) : x2 + y2 4g. Now, even though we
can write down a formula for }((X; Y ) 2 A), we may be interested in nding
potentially simpler expressions for things like }(X 2 B; Y 2 C), }(X 2 B),
and }(Y 2 C), etc. To this end, the notion of a product set is helpful.
Product Sets and Marginal Probabilities
The Cartesian product of two univariate sets B and C is dened by
B C := f(x; y) : x 2 B and y 2 C g:
In other words,
(x; y) 2 B C , x 2 B and y 2 C:
For example, if B = [1; 3] and C = [0:5; 3:5], then B C is the rectangle
[1; 3] [0:5; 3:5] = f(x; y) : 1 x 3 and 0:5 y 3:5g;
which is illustrated in Figure 5.1(a). In general, if B and C are intervals, then
B C is a rectangle or square. If one of the sets is an interval and the other
155
156
Chap. 5 Multiple Random Variables
4
3
2
1
0
4
3
2
1
0
0 1 2 3 4
(a)
0 1 2 3 4
(b)
,
Figure 5.1. The Cartesian products (a) [1; 3] [0:5; 3:5] and (b) [1; 2] [ [3; 4] [1; 4].
is a singleton, then the product set degenerates to a line segment in the plane.
A more complicated
example
is shown in Figure 5.1(b), which illustrates the
,
product [1; 2] [ [3; 4] [1; 4]. Figure 5.1(b) also illustrates the general result
that distributes over [; i.e., (B1 [ B2 ) C = (B1 C) [ (B2 C).
Using the notion of product set,
fX 2 B; Y 2 C g = f! 2 : X(!) 2 B and Y (!) 2 C g
= f! 2 : (X(!); Y (!)) 2 B C g;
for which we use the shorthand
f(X; Y ) 2 B C g:
We can therefore write
}(X 2 B; Y 2 C) = }((X; Y ) 2 B C):
The preceding expression allows us to obtain the marginal probability
}(X 2 B) as follows. First, for any event F, we have F , and therefore,
F = F \ . Second, Y is assumed to be a real-valued random variable, i.e.,
Y (!) 2 IR for all !. Thus, fY 2 IRg = . Now write
}(X 2 B) = }(fX 2 B g \ )
= }(fX 2 B g \ fY 2 IRg)
= }(X 2 B; Y 2 IR)
= }((X; Y ) 2 B IR):
Similarly,
}(Y 2 C) = }((X; Y ) 2 IR C):
If X and Y are independent random variables, then
}((X; Y ) 2 B C) = }(X 2 B; Y 2 C)
= }(X 2 B) }(Y 2 C):
July 18, 2002
5.1 Joint and Marginal Probabilities
157
Joint and Marginal Cumulative Distributions
The joint cumulative distribution function of X and Y is dened by
FXY (x; y) := }(X x; Y y)
= }((X; Y ) 2 (,1; x] (,1; y]):
In Chapter 4, we saw that }(X 2 B) could be computed if we knew the
cumulative distribution function FX (x) for all x. Similarly, it turns out that
we can compute }((X; Y ) 2 A) if we know FXY (x; y) for all x; y. For example,
it is shown in Problems 1 and 2 that
}((X; Y ) 2 (a; b] (c; d]) = }(a < X b; c < Y d)
is given by
FXY (b; d) , FXY (a; d) , FXY (b; c) + FXY (a; c):
In other words, the probability that (X; Y ) lies in a rectangle can be computed
in terms of the cdf values at the corners.
If X and Y are independent random variables, we have the simplications
FXY (x; y) = }(X x; Y y)
= }(X x) }(Y y)
= FX (x)FY (y);
and
}(a < X b; c < Y d) = }(a < X b) }(c < Y d);
which is equal to [FX (b) , FX (a)][FY (d) , FY (c)].
We return now to the general case (X and Y not necessarily independent).
It is possible to obtain the marginal cumulative distributions FX and FY
directly from FXY by setting the unwanted variable to 1. More precisely, it
can be shown that2
FX (x) = ylim
(5.1)
!1 FXY (x; y) =: FXY (x; 1);
and
FY (y) = xlim
!1 FXY (x; y) =: FXY (1; y):
Example 5.1. If
FXY (x; y) =
8
<
:
y + e,x(y+1) , e,x ; x; y > 0;
y+1
0;
otherwise;
nd both of the marginal cumulative distribution functions, FX (x) and FY (y).
July 18, 2002
158
Chap. 5 Multiple Random Variables
Solution. For x; y > 0,
FXY (x; y) = y +y 1 + y +1 1 e,x(y+1) , e,x :
Hence, for x > 0,
lim F (x; y)
y!1 XY
= 1 + 0 0 , e,x = 1 , e,x :
For x 0, FXY (x; y) = 0 for all y. So, for x 0, ylim
!1 FXY (x; y) = 0. The
complete formula for the marginal cdf of X is
,x
0;
FX (x) = 1 ,0;e ; xx >
(5.2)
0:
Next, for y > 0,
lim F (x; y) = y +y 1 + y +1 1 0 , 0 = y +y 1 :
x!1 XY
We then see that the marginal cdf of Y is
0;
FY (y) = y=(y0;+ 1); yy >
(5.3)
0:
Note that since FXY (x; y) 6= FX (x) FY (y) for all x; y, X and Y are not independent,
Example 5.2. Express }(X x or Y y) using cdfs.
Solution. The desired probability is that of the union, }(A [ B), where
A := fX xg and B := fY yg. Applying the inclusion-exclusion formula
(1.1) yields
}(A [ B) = }(A) + }(B) , }(A \ B)
= }(X x) + }(Y y) , }(X x; Y y)
= FX (x) + FY (y) , FXY (x; y):
5.2. Jointly Continuous Random Variables
In analogy with the univariate case, we say that two random variables X
and Y are jointly continuous with joint density fXY (x; y) if
}((X; Y ) 2 A) =
ZZ
A
fXY (x; y) dx dy
July 18, 2002
5.2 Jointly Continuous Random Variables
=
=
=
1
Z
1
Z
159
IA (x; y)fXY (x; y) dx dy
,1 ,1
1 Z1
Z
,1 Z,1
1
1
Z
,1
,1
IA (x; y)fXY (x; y) dy dx
IA (x; y)fXY (x; y) dx dy
for some nonnegative function fXY that integrates to one; i.e.,
Z
1Z 1
,1 ,1
fXY (x; y) dx dy = 1:
Example 5.3. Show that
2
2
fXY (x; y) = 21 e,(2x ,2xy+y )=2
is a valid joint probability density.
Solution. Since fXY (x; y) is nonnegative, all we have to do is show that
it integrates to one. By completing the square in the exponent, we obtain
,(y,x)2 =2 e,x2 =2
fXY (x; y) = e p
p :
2
2
This factorization allows us to write the double integral
Z
1Z 1
fXY (x; y) dx dy =
,1 ,1
as the iterated integral
Z
1 Z 1 e,(2x2 ,2xy+y2 )=2
dx dy
2
,1 ,1
Z
1 e,x2 =2 Z 1 e,(y,x)2 =2 p
p
dy dx:
2
,1 2
,1
The inner integral, as a function of y, is a normal density with mean x and
variance one. Hence, the inner integral is one. But this leaves only the outer
integral, whose integrand is an N(0; 1) density, which also integrates to one.
Marginal Densities
If A is a product set, say A = B C, then IA (x; y) = IB (x) IC (y), and so
}(X 2 B; Y 2 C) = }((X; Y ) 2 B C)
Z Z
=
fXY (x; y) dy dx
(5.4)
=
B C
Z Z
C
B
July 18, 2002
fXY (x; y) dx dy:
160
Chap. 5 Multiple Random Variables
At this point we would like to substitute B = (,1; x] and C = (,1; y] in order
to obtain expressions for FXY (x; y). However, the preceding integrals already
use x and y for the variables of integration. To avoid confusion, we must rst
replace the variables of integration. We change x to t and y to . We then nd
that
Z x Z y
FXY (x; y) =
fXY (t; ) d dt;
,1
or, equivalently,
FXY (x; y) =
Z
,1
Z
y
,1
x
,1
fXY (t; ) dt d:
It then follows that
@ 2 F (x; y) = f (x; y) and @ 2 F (x; y) = f (x; y):
XY
XY
@y@x XY
@x@y XY
Example 5.4. Let
8
< y + e,x(y+1)
,x
FXY (x; y) = :
y + 1 , e ; x; y > 0;
0;
otherwise;
as in Example 5.1. Find the joint density fXY .
Solution. For x; y > 0,
@ F (x; y) = e,x , e,x(y+1) ;
@x XY
and
@ 2 F (x; y) = xe,x(y+1) :
@y@x XY
Thus,
fXY (x; y) =
xe,x(y+1) ; x; y > 0;
0;
otherwise:
We now show that if X and Y are jointly continuous, then X and Y are
individually continuous with marginal densities obtained as follows. Taking
C = IR in (5.4), we obtain
}(X 2 B) = }((X; Y ) 2 B IR) =
Z Z
which implies the marginal density of X is
fX (x) =
Z
1
,1
B
1
,1
fXY (x; y) dy:
July 18, 2002
fXY (x; y) dy dx;
(5.5)
5.2 Jointly Continuous Random Variables
Similarly,
}(Y 2 C) = }((X; Y ) 2 IR C) =
and
fY (y) =
Z
1
,1
161
1
Z Z
,1
C
fXY (x; y) dx dy;
fXY (x; y) dx:
Thus, to obtain the marginal densities, integrate out the unwanted variable.
Example 5.5. Using the joint density fXY obtained in Example 5.4, nd
the marginal densities fX and fY by integrating out the unneeded variable. To
check your answer, also compute the marginal densities by dierentiating the
marginal cdfs obtained in Example 5.1.
Solution. We rst compute fX (x). To begin, observe that for x 0,
fXY (x; y) = 0. Hence, for x 0, the integral in (5.5) is zero. Now suppose
x > 0. Since fXY (x; y) = 0 whenever y 0, the lower limit of integration in
(5.5) can be changed to zero. For x > 0, it remains to compute
Z
0
1
fXY (x; y) dy = xe,x
= e,x :
Hence,
1
Z
0
e,xy dy
,x
> 0;
fX (x) = e 0; ; xx 0;
and we see that X is exponentially distributed with parameter = 1. Note
that the same answer can be obtained by dierentiating the formula for FX (x)
in (5.2).
We now turn to the calculation
of fY (y). Arguing as above, we have fY (y) =
R
0 for y 0, and fY (y) = 01 fXY (x; y) dx for y > 0. Write this integral as
Z 1
Z 1
1
x (y + 1)e,(y+1)x dx:
(5.6)
fXY (x; y) dx = y + 1
0
0
If we put = y + 1, then the integral on the right has the form
1
Z
0
x e,x dx;
which is the mean of an exponential random variable with parameter . This
integral is equal to 1= = 1=(y + 1), and so the right-hand side of (5.6) is equal
to 1=(y + 1)2. We conclude that
2
> 0;
fY (y) = 1=(y 0;+ 1) ; yy 0:
Note that the same answer can be obtained by dierentiating the formula for
FY (y) in (5.3).
July 18, 2002
162
Chap. 5 Multiple Random Variables
Specifying Joint Densities
Suppose X and Y are jointly continuous with joint density fXY . Dene
(x; y) :
(5.7)
fY jX (yjx) := fXY
fX (x)
For reasons that will become clear later, we call fY jX the conditional density
of Y given X. For the moment, simply note that as a function of y, fY jX (yjx)
is nonnegative and integrates to one, since
Z
1
fXY (x; y) dy
fY jX (yjx) dy = ,1 f (x)
= ffX (x)
= 1:
X
X (x)
,1
Now rewrite (5.7) as
fXY (x; y) = fY jX (yjx) fX (x):
This suggests an easy way to specify a joint density. First, pick any onedimensional density, say fX (x). Then fX (x) is nonnegative and integrates to
one. Next, for each x, let fY jX (yjx) be any density in the variable y. For
dierent values of x, fY jX (yjx) could be very dierent densities as a function
of y; the only important thing is that for each x, fY jX (yjx) as a function of
y should be nonnegative and integrate to one with respect to y. Now if we
dene fXY (x; y) := fY jX (yjx) fX (x), we will have a nonnegative function of
(x; y) whose double integral is one. In practice, and in the problems at the end
of the chapter, this is how we usually specify joint densities.
Example 5.6. Let X N(0; 1), and suppose that the conditional density
of Y given X = x is N(x; 1). Find the joint density fXY (x; y).
Solution. First note that
,(y,x)2 =2
,x2 =2
fX (x) = ep
and fY jX (yjx) = e p
:
2
2
Then write
,(y,x)2 =2 e,x2 =2
fXY (x; y) = fY jX (yjx) fX (x) = e p
p2 ;
2
which simplies to
2
2
fXY (x; y) = exp[,(2x ,22xy + y )=2] :
Z
1
By interchanging the roles of X and Y , we have
(x; y)
fX jY (xjy) := fXY
fY (y) ;
and we can dene fXY (x; y) = fX jY (xjy) fY (y) if we specify a density fY (y)
and a function fX jY (xjy) that is a density in x for each xed y.
July 18, 2002
5.2 Jointly Continuous Random Variables
163
Independence
We now consider the joint density of jointly continuous independent random variables. As noted in Section 5.1, if X and Y are independent, then
FXY (x; y) = FX (x) FY (y) for all x and y. If X and Y are also jointly continuous, then by taking second-order mixed partial derivatives, we nd
@ 2 F (x) F (y) = f (x) f (y):
Y
X
Y
@y@x X
In other words, if X and Y are jointly continuous and independent, then the
joint density is the product of the marginal densities. Using (5.4), it is easy to
see that the converse is also true. If fXY (x; y) = fX (x) fY (y), (5.4) implies
Z Z
}(X 2 B; Y 2 C) =
B
Z Z
=
B
Z
=
C
C
fXY (x; y) dy dx
fX (x)
BZ
fX (x) fY (y) dy dx
Z
fY (y) dy dx
C
=
fX (x) dx }(Y 2 C)
B
= }(X 2 B) }(Y 2 C):
Expectation
If X and Y are jointly continuous with joint density fXY , then the methods
of Section 3.2 can easily be used to show that
Z
E[g(X; Y )] =
1
Z
1
,1 ,1
g(x; y)fXY (x; y) dx dy:
For arbitrary random variables X and Y , their bivariate characteristic function is dened by
'XY (1; 2) := E[ej (1X +2 Y ) ]:
If X and Y have joint density fXY , then
'XY (1 ; 2) =
1
Z
Z
1
,1 ,1
fXY (x; y)ej (1 x+2 y) dx dy;
which is simply the bivariate Fourier transform of fXY . By the inversion
formula,
Z 1Z 1
1
fXY (x; y) = (2)2
'XY (1; 2)e,j (1 x+2 y) d1 d2:
,1 ,1
Now suppose that X and Y are independent. Then
'XY (1; 2) = E[ej (1X +2 Y ) ] = E[ej1X ] E[ej2Y ] = 'X (1) 'Y (2):
July 18, 2002
164
Chap. 5 Multiple Random Variables
In other words, if X and Y are independent, then their joint characteristic function factors. The converse is also true; i.e., if the joint characteristic function
factors, then X and Y are independent. The general proof is complicated, but
if X and Y are jointly continuous, it is easy using the Fourier inversion formula.
? Continuous
Random Variables That Are not Jointly Continuous
Let uniform[,; ], and put X := cos and Y := sin. As shown
in Problem 33 in Chapter
p 4, X and Y are both arcsine random variables, each
having density (1=)= 1 , x2 for ,1 < x < 1.
Next, since X 2 +Y 2 = 1, the pair (X; Y ) takes values only on the unit circle
C := f(x; y) : x2 + y2 = 1g:
,
Thus, } (X; Y ) 2 C = 1. On the other hand, if X and Y have a joint density
fXY , then
ZZ
},(X; Y ) 2 C =
fXY (x; y) dx dy = 0
C
because a double integral over a set of zero area must be zero. So, if X and Y
had a joint density, this would imply that 1 = 0. Since this is not true, there
can be no joint density.
5.3. Conditional Probability and Expectation
If X is a continuous random variable, then its cdf FX (x) := }(X x) =
,1 fX (t) dt is a continuous function of x. It follows from the properties of
cdfs in Section 4.6 that }(X = x) = 0 for all x. Hence, we cannot dene
}(Y 2 C jX = x) by }(X = x; Y 2 C)=}(X = x) since this requires division by
zero. Similar problems arise with conditional expectation. How should we dene
conditional probability and expectation in this case? Recall from Section 2.5
that if X and Y are both discrete random variables, then the following law of
total probability holds,
Rx
X
E[g(X; Y )] =
E[g(X; Y )jX = xi] pX (xi ):
i
If X and Y are jointly continuous, however we dene E[g(X; Y )jX = x], we
certainly want the analogous formula
E[g(X; Y )] =
Z
1
E[g(X; Y )jX = x]fX (x) dx
,1
to hold. To discover how E[g(X; Y )jX = x] should be dened, write
E[g(X; Y )] =
=
1Z 1
Z
g(x; y)fXY (x; y) dx dy
,1 ,1
1 Z1
Z
,1
,1
g(x; y)fXY (x; y) dy dx
July 18, 2002
(5.8)
5.3 Conditional Probability and Expectation
=
=
Z
1Z
1
,1 Z,1
1
1
Z
,1
,1
It follows that if
(x; y)
g(x; y) fXY
f (x) dy fX (x) dx
X
g(x; y)fY jX (yjx) dy fX (x) dx:
Z
E[g(X; Y )jX = x] :=
165
1
,1
g(x; y) fY jX (yjx) dy;
(5.9)
then (5.8) holds.
Now observe that if g in (5.9) is a function of y alone, then
E[g(Y )jX = x] =
1
Z
,1
g(y) fY jX (yjx) dy:
(5.10)
Returning now to the case g(x; y), we see that (5.10) implies that for any x~,
E[g(~x; Y )jX = x] =
Z
1
,1
g(~x; y) fY jX (yjx) dy:
Taking x~ = x, we obtain
E[g(x; Y )jX = x] =
Z
1
,1
g(x; y) fY jX (yjx) dy:
Comparing this with (5.9) shows that
E[g(X; Y )jX = x] = E[g(x; Y )jX = x]:
In other words, the substitution law holds. Another important point to note is
that if X and Y are independent, then
fX (x) fY (y)
(x; y)
fY jX (yjx) = fXY
fX (x) = fX (x) = fY (y):
In this case, (5.10) becomes E[g(Y )jX = x] = E[g(Y )]. In other words, we can
\drop the conditioning."
Example 5.7. Let X exp(1), and suppose that given X = x, Y is
conditionally normal with fY jX (jx) N(0; x2). Evaluate E[Y 2 ] and E[Y 2 X 3 ].
Solution. We use the law of total probability for expectation. We begin
with
E[Y 2 ]
=
Z
1
,1
E[Y 2 jX = x]fX (x) dx:
July 18, 2002
166
Chap. 5 Multiple Random Variables
p
Since fY jX (yjx) = e,(y=x)2 =2=( 2 x), we see that
Z
,(y=x)2 =2
1
y2 e p
dy
2 x
,1
is the second moment of a zero-mean Gaussian density with variance x2. Hence,
E[Y 2jX = x] = x2 , and
E[Y 2jX = x] =
E[Y 2] =
1
Z
,1
x2fX (x) dx = E[X 2 ]:
Since X exp(1), E[X 2 ] = 2 by Example 3.9.
To compute E[Y 2 X 3 ], we proceed similarly. Write
E[Y 2X 3 ] =
=
=
=
1
Z
,1
Z 1
,1
Z 1
,1
Z 1
,1
E[Y 2 X 3 jX = x]fX (x) dx
E[Y 2 x3jX = x]fX (x) dx
x3E[Y 2 jX = x]fX (x) dx
x3x2fX (x) dx
= E[X 5 ]
= 5!; by Example 3.9:
For conditional probability, if we put
}(Y 2 C jX = x) :=
Z
C
fY jX (yjx) dy;
then it is easy to show that the law of total probability
}(Y 2 C) =
1
Z
,1
}(Y 2 C jX = x) fX (x) dx
holds. It can also be shown3 that the substitution law holds for conditional
probabilities involving X and Y conditioned on X. Also, if X and Y are
independent, then }(Y 2 C jX = x) = }(Y 2 C); i.e., we can \drop the
conditioning."
Example 5.8 (Signal in Additive Noise). A random, continuous-valued signal X is transmitted over a channel subject to additive, continuous-valued noise
Y . The received signal is Z = X + Y . Find the cdf and density of Z if X and
Y are jointly continuous random variables with joint density fXY .
July 18, 2002
5.3 Conditional Probability and Expectation
167
Solution. Since we are not assuming that X and Y are independent, the
characteristic-function method of Example 3.14 does not work here. Instead,
we use the laws of total probability and substitution. Write
FZ (z) = }Z (Z z)
1
}(Z z jY = y) fY (y) dy
=
=
=
=
=
Z
,1
1
,1
Z 1
,1
Z 1
,1
Z 1
,1
}(X + Y z jY = y) fY (y) dy
}(X + y z jY = y) fY (y) dy
}(X z , yjY = y) fY (y) dy
FX jY (z , yjy) fY (y) dy:
By dierentiating with respect to z,
fZ (z) =
1
Z
,1
fX jY (z , yjy) fY (y) dy:
In particular, note that if X and Y are independent, we can drop the conditioning and recover the convolution result stated following Example 3.14,
fZ (z) =
1
Z
,1
fX (z , y) fY (y) dy:
Example 5.9 (Signal in Multiplicative Noise). A random, continuous-valued signal X is transmitted over a channel subject to multiplicative, continuousvalued noise Y . The received signal is Z = XY . Find the cdf and density of Z
if X and Y are jointly continuous random variables with joint density fXY .
Solution. We proceed as in the previous example. Write
FZ (z) = }Z (Z z)
1
}(Z z jY = y) fY (y) dy
=
=
=
Z
,1
1
,1
Z 1
,1
}(XY z jY = y) fY (y) dy
}(Xy z jY = y) fY (y) dy:
At this point we have a problem when we attempt to divide through by y. If y
is negative, we have to reverse the inequality sign. Otherwise, we do not have
July 18, 2002
168
Chap. 5 Multiple Random Variables
to reverse the inequality. The solution to this diculty is to break up the range
of integration. Write
FZ (z) =
Z 0
}(Xy z jY = y) fY (y) dy
,1
Z
+
1
0
}(Xy z jY = y) fY (y) dy:
Now we can divide by y separately in each integral. Thus,
FZ (z) =
Z 0
,1
Z
+
=
}(X z=yjY = y) fY (y) dy
1
0
Z 0
,1
Z
}(X z=yjY = y) fY (y) dy
, 1 , FX jY zy y fY (y) dy
1
, FX jY yz y fY (y) dy:
0
Dierentiating with respect to z yields
+
1
, , fX jY zy y y1 fY (y) dy +
fX jY yz y 1y fY (y) dy:
,1
0
Now observe that in the rst integral, the range of integration implies that y
is always negative. For such y, ,y = jyj. In the second integral, y is always
positive, and so y = jyj. Thus,
fZ (z) = ,
fZ (z) =
=
Z 0
Z
Z 0
, fX jY yz y j1yj fY (y) dy +
,1
1
Z
,1
, fX jY z y 1
y
Z
0
1
, fX jY zy y jy1j fY (y) dy
jyj fY (y) dy:
5.4. The Bivariate Normal
The bivariate Gaussian or bivariate normal density is a generalization
of the univariate N(m; 2) density.
p Recall that the standard N(0; 1) density
is given by '(x) := exp(,x2 =2)= 2. The general N(m; 2) density can be
written in terms of ' as
2
1
1
x
,
m
1
x
,
m
p exp , 2 = ' :
2 In order to dene the general bivariate Gaussian density, it is convenient to
dene a standard bivariate density rst. So, for jj < 1, put
exp 2(1,,12 ) [u2 , 2uv + v2 ]
p
' (u; v) :=
:
(5.11)
2 1 , 2
July 18, 2002
5.4 The Bivariate Normal
169
0.15
0.1
0.05
3
0
2
−3
1
−2
0
−1
0
−1
1
−2
2
3
−3
v−axis
u−axis
Figure 5.2. The Gaussian surface ' (u; v) of (5.11) with = 0.
For xed , this function of the two variables u and v denes a surface. The
surface corresponding to = 0 is shown in Figure 5.2. From the gure and from
the formula (5.11), we see that '0 is circularly symmetric; i.e., for2 all (u; v) on a
circle of radius r, in other words, for u2 +v2 = r2, '0 (u; v) = e,r =2 =2 does not
depend on the particular values of u and v, but only on the radius of the circle
on which they lie. We also point out that for = 0, the formula (5.11) factors
into the product of two univariate N(0; 1) densities, i.e., '0 (u; v) = '(u) '(v).
For 6= 0, ' does not factor. In other words, if U and V have joint density ' ,
then U and V are independent if and only if = 0. A plot of ' for = ,0:85
is shown in Figure 5.3. It turns out that now ' is constant on ellipses instead
of circles. The axes of the ellipses are not parallel to the coordinate axes.
Notice how the density is concentrated along the line v = ,u. As ! ,1, this
concentration becomes more extreme. As ! +1, the density concentrates
around the line v = u. We now show that the density ' integrates to one. To
do this, rst observe that for all jj < 1,
u2 , 2uv + v2 = u2 (1 , 2 ) + (v , u)2 :
It follows that
,u2 =2 exp 2(1,,12) [v , u]2
e
' (u; v) = p p p 2
2
2 1 , = '(u) p 1 2 ' pv , u2 :
(5.12)
1,
1,
Observe that the right-hand factor as a function of v has the form of a univariate
July 18, 2002
170
Chap. 5 Multiple Random Variables
0.3
0.25
0.2
0.15
0.1
3
2
0.05
1
0
0
−1
−3
−2
−1
0
−2
1
2
3
−3
v−axis
u−axis
Figure 5.3. The Gaussian surface ' (u; v) of (5.11) with = ,0:85.
2
normal densityR 1with
R 1mean u and variance 1 , . With ' factored as in (5.12),
we can write ,1 ,1 ' (u; v) du dv as the iterated integral
Z
1
Z
1
p
'(u)
' pv , u2 dv du:
2
1,
,1
,1 1 , 1
As noted above, the inner integrand, as a function of v, is simply an N(u; 1,2)
density, and
R 1 therefore integrates to one. Hence, the above iterated integral
becomes ,1
'(u) du = 1.
We can now easily dene the general bivariate Gaussian density with parameters mX , mY , X2 , Y2 , and by
fXY (x; y) := 1 ' x , mX ; y , mY :
X Y
X
Y
More explicitly, this density is
exp
y,mY 2
x,mX y,mY
,1 x,mX 2
2(1,2 ) [( X ) , 2( X )( Y ) + ( Y ) ]
p
2X Y 1 , 2
:
(5.13)
It can be shown that the marginals are fX N(mX ; X2 ) and fY N(mY ; Y2 )
(see Problems 25 and 28). The parameter is called the correlation coecient. From (5.13), we observe that X and Y are independent if and only if
= 0. A plot of fXY with mX = mY = 0, X = 1:5, Y = 0:6, and = 0
is shown in Figure 5.4. Notice how the density is concentrated around the line
July 18, 2002
5.4 The Bivariate Normal
171
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
3
0
2
−3
1
−2
0
−1
0
−1
1
−2
2
3
−3
y−axis
x−axis
Figure 5.4. The bivariate normal density fXY (x; y) of (5.13) with mX = mY = 0, X = 1:5,
Y = 0:6, and = 0.
y = 0. Also, fXY is constant on ellipses of the form
x 2 + y 2 = r2:
X
Y
R1 R1
To show that ,1 ,1 fXY (x; y) dx dy = 1 as well, use formula (5.12) for
' and proceed as above, integrating with respect to y rst and then x. For
the inner integral, make the change of variable v = (y , mY )=Y , and in the
remaining outer integral make the change of variable u = (x , mX )=X .
Example 5.10. Let random variables U and V have the standard bivariate
normal density ' in (5.11). Show that E[UV ] = .
Solution. Using the factored form of ' in (5.12), write
E[UV ] =
Z
1
Z
1
uv ' (u; v) du dv
Z 1
v
v
,
u
p
=
u '(u)
' p
dv du:
1 , 2
,1
,1 1 , 2
The quantity in brackets has the form E[V^ ], where V^ is a univariate normal
random variable with mean u and variance 1 , 2 . Thus,
,1 ,1
Z 1
E[UV ] =
Z
1
u '(u)[u] du
,1
Z 1
= u2 '(u) du
,1
= ;
July 18, 2002
172
Chap. 5 Multiple Random Variables
since ' is the N(0; 1) density.
Example 5.11. Let U and V have the standard bivariate normal density
fUV (u; v) = ' (u; v) given in (5.11). Find the conditional densities fV jU and
fU jV .
Solution. It is shown in Problem 25 that fU and fV are both N(0; 1).
Hence,
' (u; v)
(u; v)
fV jU (vju) = fUV
f (u) = '(u) ;
U
where ' is the N(0; 1) density. If we now substitute the factored form of ' (u; v)
given in (5.12), we obtain
1
v
,
u
fV jU (vju) = p
' p
;
1 , 2
1 , 2
i.e., fV jU ( ju) N(u; 1 , 2 ). To compute fU jV we need the following alternative factorization of ' ,
u
,
v
1
' (u; v) = p
' p
'(v):
1 , 2
1 , 2
It then follows that
1
u
,
v
fU jV (ujv) = p
' p
;
1 , 2
1 , 2
i.e., fU jV ( jv) N(v; 1 , 2 ).
Example 5.12. If U and V have standard joint normal density '(u; v),
nd E[V jU = u].
Solution. Recall that from Example 5.11, fV jU (ju) N(u; 1 , 2).
Hence,
Z 1
v fV jU (vju) dv = u:
E[V jU = u] =
,1
5.5. ?Multivariate Random Variables
Let X := [X1; : : :; Xn ]0 (the prime denotes transpose) be a length-n column
vector of random variables. We call X a random vector. We say that the variables X1 ; : : :; Xn are jointly continuous with joint density fX = fX1 Xn if
for A IRn,
}(X 2 A) =
Z
A
fX (x) dx =
Z
July 18, 2002
IRn
IA (x) fX (x) dx;
5.5 ? Multivariate Random Variables
173
which is shorthand for
Z
1
1
Z
,1
,1
IA (x1; : : :; xn) fX1 Xn (x1; : : :; xn) dx1 dxn:
Usually, we need to compute probabilities of the form
}(X1 2 B1 ; : : :; Xn 2 Bn ) := }
n
\
fXk 2 Bk g ;
k=1
where B1 ; : : :; Bn are one-dimensional sets. The key to this calculation is to
observe that
n
\
n
o
fXk 2 Bk g = [X1 ; : : :; Xn ]0 2 B1 Bn :
k=1
(Recall that a vector [x1; : : :; xn]0 lies in the Cartesian product set B1 Bn
if and only if xi 2 Bi for i = 1; : : :; n.) It now follows that
}(X1 2 B1 ; : : :; Xn 2 Bn ) =
Z
B1
Z
Bn
fX1 Xn (x1 ; : : :; xn) dx1 dxn:
For example, to compute the joint cumulative distribution function, take Bk =
(,1; xk ] to get
FX (x) := FX1 Xn (x1; : : :; xn)
:= }Z (X1 Zx1; : : :; Xn xn)
x1
xn
=
fX1 Xn (t1 ; : : :; tn) dt1 dtn;
,1
,1
where we have changed the dummy variables of integration to t to avoid confusion with the upper limits of integration. Note also that
@ n FX @xn @x1 x1 ;:::;xn = fX1 Xn (x1; : : :; xn):
We can use these equations to characterize the joint cdf and density of independent random variables. First, suppose X1 ; : : :; Xn are independent. Then
FX (x) = }(X1 x1; : : :; Xn xn)
= }(X1 x1) }(Xn xn):
By taking partial derivatives, we learn that
fX1 Xn (x1 ; : : :; xn) = fX1 (x1) fXn (xn ):
Thus, if the Xi are independent, then the joint density factors. On the other
hand, if the joint density factors, then }(X1 2 B1 ; : : :; Xn 2 Bn ) has the form
Z
B1
Z
Bn
fX1 (x1 ) fXn (xn ) dx1 dxn;
July 18, 2002
174
Chap. 5 Multiple Random Variables
which factors into the product
Z
B1
fX1 (x1 ) dx1 Z
Bn
fXn (xn) dxn :
Thus, if the joint density factors,
}(X1 2 B1 ; : : :; Xn 2 Bn ) = }(X1 2 B1 ) }(Xn 2 Bn );
and we see that the Xi are independent.
Sometimes we need to compute probabilities involving only some of the Xi .
In this case, we can obtain the joint density of the ones we need by integrating
out the ones we do not need. For example,
fXi (xi ) =
and
Z
fX (x1; : : :; xn) dx1 dxi,1dxi+1 dxn;
IRn,1
fX1 X2 (x1 ; x2) =
Z
1
Z
,1
1
fX1 Xn (x1 ; : : :; xn) dx3 dxn:
,1
We can use these results to compute conditional densities such as
fX3 Xn jX1 X2 (x3 ; : : :; xnjx1; x2) = fX1fXn (x(x1; :; :x:;)xn) :
X1 X2 1 2
Example 5.13. Let
3z 2 e,zy exph, 1 x,y 2 i;
fXY Z (x; y; z) = p
2 z
7 2
for y 0 and 1 z 2, and fXY Z (x; y; z) = 0 otherwise. Find fY Z (y; z) and
fX jY Z (xjy; z). Then nd fZ (z), fY jZ (yjz), and fXY jZ (x; yjz).
Solution. Observe that the joint density can be written as
h
2i
exp , 12 x,z y
p
fXY Z (x; y; z) =
ze,zy 37 z 2 :
2 z
The rst factor as a function of x is an N(y; z 2 ) density. Hence,
fY Z (y; z) =
and
Z
1
,1
fXY Z (x; y; z) dx = ze,zy 37 z 2 ;
h
exp , 21 x,z y
f
(x;
y;
z)
XY
Z
p
fX jY Z (xjy; z) = f (y; z) =
2 z
YZ
July 18, 2002
2 i
:
5.5 ? Multivariate Random Variables
175
Thus, fX jY Z (jy; z) N(y; z 2 ). Next, in the above formula for fY Z (y; z), observe that ze,zy as a function of y is an exponential density with parameter z.
Thus,
Z 1
fZ (z) =
fY Z (y; z) dy = 37 z 2 ; 1 z 2:
0
It follows that fY jZ (yjz) = fY Z (y; z)=fZ (z) = ze,zy ; i.e., fY jZ (jz) exp(z).
Finally,
h
1 x,y
y; z) = exp ,p2 z
fXY jZ (x; yjz) = fXYfZ (x;
2 z
Z (z)
2i
ze,zy :
The preceding example shows that
fXY Z (x; y; z) = fX jY Z (xjy; z) fY jZ (yjz) fZ (z):
More generally,
fX1 Xn (x1; : : :; xn) = fX1 (x1 ) fX2 jX1 (x2jx1) fX3 jX1 X2 (x3jx1; x2) fXn jX1 Xn,1 (xnjx1; : : :; xn,1):
Contemplating this equation for a moment reveals that
fX1 Xn (x1; : : :; xn) = fX1 Xn,1 (x1; : : :; xn,1) fXn jX1 Xn,1 (xn jx1; : : :; xn,1);
which is an important recursion that is discussed later.
If g(x) = g(x1 ; : : :; xn) is a real-valued function of the vector x = [x1; : : :; xn]0,
then we can compute
E[g(X)] =
Z
IRn
g(x) fX (x) dx;
which is shorthand for
Z
1
Z
,1
1
,1
g(x1 ; : : :; xn) fX1 Xn (x1; : : :; xn) dx1 dxn:
The Law of Total Probability
We also have the law of total probability and the substitution law for multiple random variables. Let U = [U1 ; : : :; Um ]0 and V = [V1; : : :; Vn]0 be any two
vectors of random variables. The law of total probability tells us that
E[g(U; V )] =
1
Z
|
Z
1
,1 {z ,1}
m times
E[g(U; V )jU = u] fU (u) du;
July 18, 2002
176
Chap. 5 Multiple Random Variables
where, by the substitution law, E[g(U; V )jU = u] = E[g(u; V )jU = u], and
E[g(u; V )jU = u] =
and
Z
|
1
Z
1
,1 {z ,1}
n times
g(u; v) fV jU (vju) dv;
fV jU (vju) = fUV (uf1 ; :(u: :;;u: m: :;; vu1; :): :; vn) :
U 1
m
When g has a product form, say g(u; v) = h(u)k(v), it is easy to see that
E[h(u)k(V )jU = u] = h(u) E[k(V )jU = u]:
Example 5.14. Let X, Y , and Z be as in Example 5.13. Find E[X] and
E[XZ].
Solution. Rather than use the marginal density of X to compute E[X],
we use the law of total probability. Write
E[X] =
1
Z
1
Z
E[X jY = y; Z = z] fY Z (y; z) dy dz:
,1 ,1
Next, from Example 5.13, fX jY Z (jy; z)
z] = y. Thus,
E[X] =
=
=
Z
1 Z 1
Z
,1 Z,1
1
1
Z
,1
1
,1
,1
N(y; z 2 ), and so E[X jY = y; Z =
y fY Z (y; z) dy dz
y fY jZ (yjz) dy fZ (z) dz
E[Y jZ = z] fZ (z) dz:
From Example 5.13, fY jZ (jz) exp(z) and fZ (z) = 3z 2=7; hence, E[Y jZ =
z] = 1=z, and we have
E[X] =
Now, to nd E[XZ], write
E[XZ] =
1
Z
Z
1
,1 ,1
Z 2
1
3 z dz
7
=
9
14 :
E[Xz jY = y; Z = z] fY Z (y; z) dy dz:
We then note that E[Xz jY = y; Z = z] = E[X jY = y; Z = z] z = yz. Thus,
E[XZ] =
1
Z
Z
1
,1 ,1
yz fY Z (y; z) dy dz = E[Y Z]:
In Problem 34 the reader is asked to show that E[Y Z] = 1. Thus, E[XZ] = 1
as well.
July 18, 2002
5.6 Notes
177
Example 5.15. Let N be a positive, integer-valued random variable, and
let X1 ; X2 ; : : : be i.i.d. Further assume that N is independent of this sequence.
Consider the random sum,
N
X
Xi :
i=1
Note that the number of terms in the sum is a random variable. Find the mean
value of the random sum.
Solution. Use the law of total probability to write
X
N
E
i=1
Xi =
1 X
n
X
n=1
E
i=1
Xi N = n }(N = n):
By independence of N and the Xi sequence,
E
n
X
i=1
n
X
Xi N = n = E
i=1
Xi =
n
X
i=1
E[Xi]:
Since the Xi are i.i.d., they all have the same mean. In particular, for all i,
E[Xi ] = E[X1 ]. Thus,
X
n
E
i=1
Xi N
= n = n E[X1]:
Now we can write
E
N
X
i=1
Xi
=
1
X
n=1
n E[X1] }(N = n)
= E[N] E[X1]:
5.6. Notes
Notes x5.1: Joint and Marginal Probabilities
Note 1. Comments analogous to Note 1 in Chapter 2 apply here. Specifically, the set A must be restricted to a suitable -eld B of subsets of IR2.
Typically, B is taken to be the collection of Borel sets of IR2 ; i.e., B is the
2
smallest -eld containing all the open sets of IR .
Note 2. We now derive the limit formula for FX (x) in (5.1); the formula
for FY (y) can be derived similarly. To begin, write
FX (x) := }(X x) = }((X; Y ) 2 (,1; x] IR):
July 18, 2002
178
Chap. 5 Multiple Random Variables
S
Next, observe that IR = 1
n=1 (,1; n], and write
(,1; x] IR = (,1; x] =
1
[
n=1
1
[
n=1
(,1; n]
(,1; x] (,1; n]:
Since the union is increasing, we can use the limit property (1.4) to show that
1
[
FX (x) = } (X; Y ) 2 (,1; x] (,1; n]
n=1
}((X; Y ) 2 (,1; x] (,1; N])
= Nlim
!1
= Nlim
F (x; N):
!1 XY
Notes x5.3: Conditional Probability and Expectation
Note 3. To show that the law of substitution holds for conditional probability, write
}(g(X; Y ) 2 C) = E[IC (g(X; Y ))] =
Z
1
,1
E[IC (g(X; Y ))jX = x]fX (x) dx
and reduce the problem to one involving conditional expectation, for which the
law of substitution has already been established.
5.7. Problems
Problems x5.1: Joint and Marginal Distributions
1. For a < b and c < d, sketch the following sets.
(a) R := (a; b] (c; d].
(b) A := (,1; a] (,1; d].
(c) B := (,1; b] (,1; c].
(d) C := (a; b] (,1; c].
(e) D := (,1; a] (c; d].
(f) A \ B.
2. Show that }(a < X b; c < Y d) is given by
FXY (b; d) , FXY (a; d) , FXY (b; c) + FXY (a; c):
Hint: Using the notation of the preceding problem, observe that
(,1; b] (,1; d] = R [ (A [ B);
and solve for }((X; Y ) 2 R).
July 18, 2002
5.7 Problems
179
3. The joint density in Example 5.4 was obtained by dierentiating FXY (x; y)
rst with respect to x and then with respect to y. In this problem, nd
the joint density by dierentiating rst with respect to y and then with
respect to x.
4. Find the marginals FX (x) and FY (y) if
FXY (x; y) =
8
>
>
>
>
<
>
>
>
>
:
,y
,xy
x , 1 , e ,y e ; 1 x 2; y > 0;
,y ,2y
1 , e ,y e ; x > 2; y > 0;
0;
otherwise:
5. Find the marginals FX (x) and FY (y) if
FXY (x; y) =
8
>
>
>
<
>
>
>
:
2 (1 , e,2y );
7
(7 , 2e,2y , 5e,3y )
7
0;
2 x < 3; y 0;
; x 3; y 0;
otherwise:
Problems x5.2: Jointly Continuous Random Variables
6. Find the marginal density fX (x) if
2
fXY (x; y) = exp[,jy ,pxj , x =2] :
2 2
7. Find the marginal density fY (y) if
,(x,y)2 =2
fXY (x; y) = 4e 5 p
; y 1:
y 2
8. Let X and Y have joint density fXY (x; y). Find the marginal cdf and
density of max(X; Y ) and of min(X; Y ). How do your results simplify if
X and Y are independent? What if you further assume that the densities
of X and Y are the same?
9. Let X and Y be independent gamma random variables with positive parameters p and q, respectively. Find the density of Z := X + Y . Then
compute }(Z > 1) if p = q = 1=2.
10. Find the density of Z := X +Y , where X and Y are independent Cauchy
random variables with parameters and , respectively. Then compute
}(Z 1) if = = 1=2.
July 18, 2002
180
? 11.
Chap. 5 Multiple Random Variables
If X N(0; 1), then the complementary cumulative distribution function
(ccdf) of X is
Z 1 ,x2 =2
ep dx:
Q(x0) := }(X > x0) =
2
x0
(a) Show that
Z =2
x20 d; x 0:
Q(x0 ) = 1
exp 2 ,
0
cos2 0
Hint: For any random variables X and Y , we can always write
}(X > x0 ) = }(X > x0 ; Y 2 IR) = }((X; Y ) 2 D);
where D is the half plane D := f(x; y) : x > x0g. Now specialize to
the case where X and Y are independent and both N(0; 1). Then
the probability on the right is a double integral that can be evaluated
using polar coordinates.
Remark. The procedure outlined in the hint is a generalization of
that used in Section 3.1 to show that the standard normal density
integrates to one. To see this, note that if x0 = ,1, then D = IR2.
(b) Use the result of (a) to derive Craig's formula [10, p. 572, eq. (9)],
Z =2
2 ,
x
1
0 dt; x 0:
exp
Q(x0 ) = 0
2 sin2 t
0
Remark. Simon [41] has derived a similar result for the Marcum Q
function (dened in Problem 17 in Chapter 4) and its higher-order
generalizations. See also [43, pp. 1865{1867].
Problems x5.3: Conditional Probability and Expectation
12. Let fXY (x; y) be as derived in Example 5.4, and note that fX (x) and
fY (y) were found in Example 5.5. Find fY jX (yjx) and fX jY (xjy) for
x; y > 0.
13. Let fXY (x; y) be as derived in Example 5.4, and note that fX (x) and
fY (y) were found in Example 5.5. Compute E[Y jX = x] for x > 0 and
E[X jY = y] for y > 0.
14. Let X and Y be jointly continuous. Show that if
Z
}(Y 2 C jX = x) := fY jX (yjx) dy;
then
C
}(Y 2 C) =
Z
1
,1
}(Y 2 C jX = x)fX (x) dx:
July 18, 2002
5.7 Problems
181
15. Find }(X Y ) if X and Y are independent with X exp() and
16.
17.
18.
19.
20.
21.
22.
? 23.
Y exp().
Let X and Y be independent random variables with Y being exponential
with parameter 1 and X being uniform on [1; 2]. Find }(Y= ln(1 + X 2 ) >
1).
Let X and Y be jointly continuous random variables with joint density
fXY . Find fZ (z) if
(a) Z = eX Y .
(b) Z = jX + Y j.
Let X and Y be independent continuous random variables with respective
densities fX and fY . Put Z = Y=X.
(a) Find the density of Z. Hint: Review Example 5.9.
(b) If X and Y are both N(0; 2 ), show that Z has a Cauchy(1) density
that does not depend on 2 .
(c) If X and Y are both Laplace(), nd a closed-form expression for
fZ (z) that does not depend on .
(d) Find a closed-form expression for the density of Z if Y is uniform on
[,1; 1] and X N(0; 1).
(e) If X and Y are both Rayleigh random variables with parameter ,
nd a closed-form expression for the density of Z. Your answer
should not depend on .
Let X and Y be independent with densities fX (x) and fY (y). If X is a
positive random variable, and if Z = Y= ln(X), nd the density of Z.
Let Y exp(), and suppose that given Y = y, X gamma(p; y).
Assuming r > n, evaluate E[X nY r ].
Use the law of total probability to to solve the following problems.
(a) Evaluate E[cos(X + Y )] if given X = x, Y is conditionally uniform
on [x , ; x + ].
(b) Evaluate }(Y > y) if X uniform[1; 2], and given X = x, Y is
exponential with parameter x.
(c) Evaluate E[XeY ] if X uniform[3; 7], and given X = x, Y N(0; x2).
(d) Let X uniform[1; 2], and suppose that given X = x, Y N(0; 1=x).
Evaluate E[cos(XY )].
Find E[X n Y m ] if Y exp(), and given Y = y, X Rayleigh(y).
Let X gamma(p; ) and Y gamma(q; ) be independent.
July 18, 2002
182
Chap. 5 Multiple Random Variables
(a) If Z := X=Y , show that the density of Z is
p,1
fZ (z) = B(p;1 q) (1 +z z)p+q ; z > 0:
Observe that fZ (z) depends on p and q, but not on . It was shown
in Problem 19 in Chapter 3 that fZ (z) integrates to one. Hint: You
will need the fact that B(p; q) = ,(p),(q)=,(p+q), which was shown
in Problem 13 in Chapter 3.
(b) Show that V := X=(X + Y ) has a beta density with parameters p
and q. Hint: Observe that V = Z=(1 + Z), where Z = X=Y as
above.
Remark. If W := (X=p)=(Y=q), then fW (w) = (p=q) fZ (w(p=q)). If
further p = k1=2 and q = k2=2, then W is said to be an F random
variable with k1 and k2 degrees of freedom. If further = 1=2, then the
X and Y are chi-squared with k1 and k2 degrees of freedom, respectively.
? 24. Let X and Y be independent with X N(0; 1) and Y being chi-squared
p
with k degrees of freedom. Show that the density of Z := X= Y=k has
a student's t density with k degrees of freedom. Hint: For this problem, it may be helpful to review the results of Problems 11{13 and 17 in
Chapter 3.
Problems x5.4: The Bivariate Normal
25. Let U and V have the joint Gaussian density in (5.11). Show that for
26.
27.
28.
? 29.
all with ,1 < < 1, U and V both have standard univariate N(0; 1)
marginal densities that do not involve .
Let X and Y be jointly Gaussian with density fXY (x; y) given by (5.13).
Find fX (x), fY (y), fX jY (xjy), and fY jX (yjx).
Let X and Y be jointly Gaussian with density fXY (x; y) given by (5.13).
Find E[Y jX = x] and E[X jY = y].
If X and Y are jointly normal with parameters, mX , mY , X2 , Y2 , and .
Compute E[X], E[X 2], and E[XY ].
Let ' be the standard bivariate normal density dened in (5.11). Put
fUV (u; v) := 12 ['1 (u; v) + '2 (u; v)];
where ,1 < 1 6= 2 < 1.
(a) Show that the marginals fU and fV are both N(0; 1). (You may use
the results of Problem 25.)
July 18, 2002
5.7 Problems
183
(b) Show that := E[UV ] = (1 + 2 )=2. (You may use the result of
Example 5.10.)
(c) Show that U and V cannot be jointly normal. Hints: (i) To obtain
a contradiction, suppose that fUV is a jointly normal density with
parameters given by parts (a) and (b). (ii) Consider fUV (u; u). (iii)
Use the following fact: If 1 ; : : :; n are distinct real numbers, and if
n
X
k=1
k ek t = 0; for all t 0;
then 1 = = n = 0.
? 30. Let U and V be jointly normal with joint density ' (u; v) dened in
(5.11). Put
Q (u0; v0) := }(U > u0 ; V > v0 ):
Show that for u0; v0 0,
Z
Q (u0 ; v0) =
=2,tan,1 (u0=v0 )
0
+
where
Z tan,1 (u0 =v0 )
0
h (u20 ; ) d
h (v02 ; ) d;
p
2
1
,
,
z(1
,
sin
2)
:
h (z; ) := 2(1 , sin 2) exp
2(1 , 2 ) sin2 This formula for Q (u0; v0) is Simon's bivariate generalization [43, pp. 1864{
1865] of Craig's univariate formula given in Problem 11. Hint: Write
}(U > u0; V > v0) as a double integral and convert to polar coordinates.
It may be helpful to review your solution of Problem 11 rst.
? 31. Use Simon's formula (Problem 30) to show that
Z =4
2 1
,
x
0 dt:
Q(x0 = exp
2 sin2 t
0
In other words, to compute Q(x0)2 , we integrate Craig's integrand (Problem 11) only half as far [42, p. 210]!
)2
Problems x5.5: ? Multivariate Random Variables
? 32.
If
2
fXY Z (x; y; z) = 2 exp[,jx ,5ypj , (y , z) =2] ; z 1;
z 2
and fXY Z (x; y; z) = 0 otherwise, nd fY Z (y; z), fX jY Z (xjy; z), fZ (z), and
fY jZ (yjz).
July 18, 2002
184
? 33.
? 34.
? 35.
? 36.
? 37.
Chap. 5 Multiple Random Variables
Let
,(x,y)2 =2e,(y,z)2 =2e,z2 =2
:
fXY Z (x; y; z) = e
(2)3=2
Find fXY (x; y). Then nd the means and variances of X and Y . Also
nd the correlation, E[XY ].
Let X, Y , and Z be as in Example 5.13. Evaluate E[XY ] and E[Y Z].
Let X, Y , and Z be as in Problem 32. Evaluate E[XY Z].
Let X, Y , and Z be jointly continuous. Assume that X uniform[1; 2];
that given X = x, Y exp(1=x); and that given X = x and Y = y, Z is
N(x; 1). Find E[XY Z].
Let N denote the number of primaries in a photomultiplier, and let Xi be
the number of secondaries due to the ith primary. Then the total number
of secondaries is
N
X
Y =
Xi :
i=1
Express the characteristic function of Y in terms of the probability generating function of N, GN (z), and the characteristic function of the Xi , assuming that the Xi are i.i.d. with common characteristic function 'X ().
Assume that N is independent of the Xi sequence. Find the density of Y
if N geometric1 (p) and Xi exp().
July 18, 2002
CHAPTER 6
Introduction to Random Processes
A random process or stochastic process is any family of random variables. For example, consider the experiment consisting of three coin tosses. A
suitable sample space would be
:= fTTT, TTH, THT, HTT, THH, HTH, HHT, HHHg:
On this sample space, we can dene several random variables. For n = 1; 2; 3;
put
if the nth component of ! is T;
Xn (!) := 0;
1; if the nth component of ! is H:
Then, for example, X1 (THH) = 0, X2 (THH) = 1, and X3 (THH) = 1. For
xed !, we can plot the sequence X1 (!), X2 (!), X3 (!), called the sample
path corresponding to !. This is done for ! = THH and for ! = HTH in
Figure 6.1. These two graphs illustrate the important fact that every time the
sample point ! changes, the sample path also changes.
X n ( THH )
X n ( HTH )
2
2
1
1
n
n
−1
1
2
3
−1
1
2
3
−2
−2
Figure 6.1. Sample paths Xn (!) for ! = THH at the left and for ! = HTH at the right.
In general, a random process Xn may be dened over any range of integers
n, which we usually think of as discrete time. In this case, we can usually
plot only a portion of a sample path as in Figure 6.2. Notice that there is no
requirement that Xn be integer valued. For example, in Figure 6.2, X0 (!) = 0:5
and X1 (!) = 2:5.
It is also useful to allow continuous-time random processes Xt where t can
be any real number, not just an integer. Thus, for each t, Xt (!) is a random
variable, or function of !. However, as in the discrete-time case, we can x !
and allow the time parameter t to vary. In this case, for xed !, if we plot the
sample path Xt (!) as a function of t, we get a curve instead of a sequence as
shown in Figure 6.3.
185
186
Chap. 6 Introduction to Random Processes
Xn( )
3
2
1
−6 −5 −4 −3 −2 −1
−1
1
2
4
3
n
5
6
−2
−3
Figure 6.2. A portion of a sample path Xn (!) of a discrete-time random process. Because
there are more dots than in the previous gure, the dashed line has been added to more
clearly show the ordering of the dots.
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 6.3. Sample path Xt (!) of a continuous-time random process with continuous but
very wiggly sample paths.
July 18, 2002
187
5
4
3
2
1
0
−1
−2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 6.4. Sample path Xt (!) of a continuous-time random process with very wiggly
sample paths and jump discontinuities.
5
Nt
4
3
2
1
t
T1
T2
T3 T 4
T5
Figure 6.5. Sample path Xt (!) of a continuous-time random process with sample paths
that are piecewise constant with jumps at random times.
July 18, 2002
188
Chap. 6 Introduction to Random Processes
Figure 6.3 shows a very wiggly but continuous waveform. However, wiggly
waveforms with jump discontinuities are also possible as shown in Figure 6.4.
Figures 6.3 and 6.4 are what we usually think of when we think of a random
process as a model for noise waveforms. However, random waveforms do not
have to be wiggly. They can be very regular as in Figure 6.5.
The main point of the foregoing pictures and discussion is to emphasize
that there are two ways to think about a random process. The rst way is to
think of Xn or Xt as a family of random variables, which by denition, is just
a family of functions of !. The second way is to think of Xn or Xt as a random
waveform in discrete or continuous time, respectively.
Section 6.1 introduces the mean function, the correlation function, and the
covariance function of a random process. Section 6.2 introduces the concept
of a stationary process and of a wide-sense stationary process. The correlation
function and the power spectral density of wide-sense stationary processes are
introduced and their properties are derived. Section 6.3 is the heart of the chapter, and covers the analysis of wide-sense stationary processes through linear,
time-invariant systems. These results are then applied in Sections 6.4 and 6.5 to
derive the matched lter and the Wiener lter. Section 6.6 contains a discussion
of the expected time-average power in a wide-sense stationary random process.
The Wiener{Khinchin Theorem is derived and used to provide an alternative
expression for the power spectral density. As an easy corollary of our derivation
of the Wiener{Khinchin Theorem, we obtain the mean-square ergodic theorem
for wide-sense stationary processes. Section 6.7 briey extends the notion of
power spectral density to processes that are not wide-sense stationary.
6.1. Mean, Correlation, and Covariance
If Xt is a random process, its mean function is
mX (t) := E[Xt ]:
Its correlation function is
RX (t1 ; t2) := E[Xt1 Xt2 ]:
Example 6.1. In modeling a communication system, the carrier signal at
the receiver is modeled by Xt = cos(2ft + ), where uniform[,; ].
The reason for the random phase is that the receiver does not know the exact
time when the transmitter was turned on. Find the mean function and the
correlation function of Xt .
Solution. For the mean, write
E[Xt] = E[cos(2ft + )]
Z 1
=
cos(2ft + ) f () d
,1
Z d :
cos(2ft + ) 2
=
,
July 18, 2002
6.2 Wide-Sense Stationary Processes
189
Be careful to observe that this last integral is with respect to , not t. Hence,
this integral evaluates to zero.
For the correlation, write
RX (t1 ; t2) = E[X
t Xt ]
1 2
= E cos(2ft1 + ) cos(2ft2 + )
= 12 E cos(2f[t1 + t2 ] + 2) + cos(2f[t1 , t2 ]) :
The rst cosine has expected value zero just as the mean did. The second
cosine is nonrandom, and equal to its expected value. Thus, RX (t1 ; t2) =
cos(2f[t1 , t2 ])=2.
Correlation functions have special properties. First,
RX (t1 ; t2) = E[Xt1 Xt2 ] = E[Xt2 Xt1 ] = RX (t2; t1):
In other words, the correlation function is a symmetric function of t1 and t2.
Next, observe that RX (t; t) = E[Xt2 ] 0, and for any t1 and t2 ,
RX (t1 ; t2)
q
E[Xt21 ] E[Xt22 ]:
(6.1)
This is an easy consequence of the Cauchy{Schwarz inequality (see Prob-
lem 1), which says that for any random variables U and V ,
E[UV ]2 E[U 2] E[V 2]:
Taking U = Xt1 and V = Xt2 yields the desired result.
The covariance function is
,
,
CX (t1 ; t2) := E Xt1 , E[Xt1 ] Xt2 , E[Xt2 ] ;
An easy calculation shows that
CX (t1 ; t2) = RX (t1 ; t2) , mX (t1 )mX (t2):
Note that the covariance function is also symmetric; i.e., CX (t1 ; t2) = CX (t2 ; t1).
Since we usually assume that our processes are zero mean; i.e., mX (t) 0,
we focus on the correlation function and its properties.
6.2. Wide-Sense Stationary Processes
Strict-Sense Stationarity
A random process is strictly stationary if for any nite collection of times
t1 ; : : :; tn, all joint probabilities involving Xt1 ; : : :; Xtn are the same as those
involving Xt1 +t ; : : :; Xtn +t for any positive or negative time shift t. For
discrete-time processes, this is equivalent to requiring that joint probabilities
involving X1 ; : : :; Xn are the same as those involving X1+m ; : : :; Xn+m for any
integer m.
July 18, 2002
190
Chap. 6 Introduction to Random Processes
Example 6.2. Consider a discrete-time process of integer-valued random
variables Xn such that
}(X1 = i1; : : :; Xn = in ) = q(i1 ) r(i2 ji1 ) r(i3 ji2) r(in jin,1);
where q(i) is any pmf, and for each i, r(j ji) is a pmf in the variable j; i.e., r is
any conditional pmf. Show that if q has the property
X
k
q(k)r(j jk) = q(j);
(6.2)
then Xn is strictly stationary for positive time shifts m.
Solution. Observe that
}(X1 = j1; : : :; Xm = jm ; X1+m = i1 ; : : :; Xn+m = in )
is equal to
q(j1 )r(j2 jj1)r(j3 jj2) r(i1 jjm )r(i2 ji1) r(injin,1):
Summing both expressions over j1 and using (6.2) shows that
}(X2 = j2; : : :; Xm = jm ; X1+m = i1 ; : : :; Xn+m = in )
is equal to
q(j2)r(j3 jj2 ) r(i1 jjm )r(i2 ji1 ) r(in jin,1):
Continuing in this way, summing over j2 ; : : :; jm , shows that
}(X1+m = i1 ; : : :; Xn+m = in ) = q(i1 )r(i2 ji1) r(in jin,1);
which was the denition of }(X1 = i1 ; : : :; Xn = in ).
Strict stationarity is a very strong property with many implications. Even
taking n = 1 in the denition tells us that for any t1 and t1 + t, Xt1 and
Xt1 +t have the same pmf or density. It then follows that for any function g(x),
E[g(Xt1 )] = E[g(Xt1 +t )]. Taking t = ,t1 shows that E[g(Xt1 )] = E[g(X0 )],
which does not depend on t1. Stationarity also implies that for any function
g(x1 ; x2), we have
E[g(Xt1 ; Xt2 )] = E[g(Xt1 +t ; Xt2 +t )]
for every time shift t. Since t is arbitrary, let t = ,t2 . Then
E[g(Xt1 ; Xt2 )] = E[g(Xt1 ,t2 ; X0)]:
In Chapter 9, we will see that Xn is a time-homogeneous Markov chain. The condition
on q is that it be the equilibrium distribution of the chain.
July 18, 2002
6.2 Wide-Sense Stationary Processes
191
It follows that E[g(Xt1 ; Xt2 )] depends on t1 and t2 only through the time difference t1 , t2 .
Asking that all the joint probabilities involving any Xt1 ; : : :; Xtn be the same
as those involving Xt1 +t ; : : :; Xtn +t for every time shift t is a very strong
requirement. In practice, e.g., analyzing receiver noise in a communication
system, it is often enough to require only that E[Xt] not depend on t and that
the correlation RX (t1 ; t2) = E[Xt1 Xt2 ] depend on t1 and t2 only through the
time dierence, t1 , t2. This is a much weaker requirement than stationarity
for several reasons. First, we are only concerned with one or two variables at a
time rather than arbitrary nite collections. Second, we are not concerned with
probabilities, only expectations. Third, we are only concerned with E[Xt] and
E[Xt1 Xt2 ] rather than E[g(Xt)] and E[g(Xt1 ; Xt2 )] for arbitrary functions g.
Wide-Sense Stationarity
We say that a process is wide-sense stationary (WSS) if the following
two properties both hold:
(i) The mean function mX (t) = E[Xt] does not depend on t.
(ii) The correlation function RX (t1 ; t2) = E[Xt1 Xt2 ] depends on t1 and t2
only through the time dierence t1 , t2 .
The process of Example 6.1 was WSS since we showed that E[Xt ] = 0 and
RX (t1 ; t2) = cos(2f[t1 , t2])=2.
Once we know a process is WSS, we write RX (t1 , t2 ) instead of RX (t1; t2).
We can also write
RX () = E[Xt+ Xt ]; for any choice of t:
In particular, note that
RX (0) = E[Xt2 ] 0; for all t:
At time t, the instantaneous power in a WSS process is Xt2 .y The expected
instantaneous power is then
PX := E[Xt2 ] = RX (0);
which does not depend on t, since the process is WSS.
It is now convenient to introduce the Fourier transform of RX (),
SX (f) :=
1
Z
,1
RX ()e,j 2f d:
We call SX (f) the power spectral density of the process. Observe that by
the Fourier inversion formula,
RX () =
Z
1
SX (f)ej 2f df:
,1
y If Xt is the voltage across a one-ohm resistor or the current passing through a one-ohm
resistor, then Xt2 is the instantaneous power dissipated.
July 18, 2002
192
Chap. 6 Introduction to Random Processes
In particular, taking = 0, shows that
Z
1
,1
SX (f) df = RX (0) = PX :
To explain the terminology, \power spectral density," recall that probability
densities are nonnegative functions that are integrated to obtain probabilities.
Similarly, power spectral densities are nonnegative functionsz that are integrated to obtain powers. The adjective \spectral" refers to the fact that SX (f)
is a function of frequency.
Example 6.3. Let Xt be a WSS random process with power spectral density SX (f) = e,f 2 =2 . Find the power in the process.
Solution. The power is
PX =
=
=
=
1
Z
,1
Z 1
,1
p
2
p
SX (f) df
e,f 2 =2 df
Z
1 e,f 2 =2
,1
p
2
df
2:
Example 6.4. Let Xt be a WSS process with correlation function RX () =
5 sin()=. Find the power in the process.
Solution. The power is
sin()
PX = RX (0) = lim
!0 RX () = lim
!0 5 = 5:
Example 6.5. Let Xt be a WSS random process with ideal lowpass power
spectral density SX (f) = I[,W;W ] (f) shown in Figure 6.6. Find its correlation
function RX ().
Solution. We must nd the inverse Fourier transform of SX (f). Write
RX () =
=
Z
1
,1
Z
W
SX (f)ej 2f df
ej 2f df
,W
z The nonnegativity of power spectral densities is derived in Example 6.9.
July 18, 2002
6.2 Wide-Sense Stationary Processes
193
SX ( f )
1
f
−W
W
0
Figure 6.6. Power spectral density of bandlimited noise in Example 6.5.
j 2f f =W
= ej2 f =,W
j
2
W
, e,j 2W
= e
j2
j
2
W
, e,j 2W
= 2W e 2j(2W)
= 2W sin(2W)
2W :
If the bandwidth W of SX (f) in the preceding example is allowed to go to
innity, then all frequencies will be present in equal amounts. A WSS process
whose power spectral density is constant for all f is call white noise. The
value of the constant is usually denoted by N0 =2. Since the inverse transform
of a constant function is a delta function, the correlation function of white noise
with SX (f) = N0 =2 is RX () = N20 ().
Remark. Just as the delta function is not an ordinary function, white
noise is not an ordinary random process. For example, since (0) is not dened,
and since E[Xt2 ] = RX (0) = N20 (0), we cannot speak of the second moment of
Xt when Xt is white noise. On the other hand, since SX (f) = N0 =2 for white
noise, and since
Z 1
N0 df = 1;
2
,1
we often say that white noise has innite average power.
Properties of Correlation Functions and Power Spectral Densities
The correlation function RX () is completely characterized by the following
three properties.1
(i) RX (,) = RX (); i.e., RX is an even function of .
(ii) jRX ()j RX (0). In other words, the maximum value of jRX ()j
occurs at = 0. In particular, taking = 0 rearms that RX (0) 0.
(iii) The power spectral density, SX (f), is a real, even, nonnegative function
of f.
July 18, 2002
194
Chap. 6 Introduction to Random Processes
Property (i) is a restatement of the fact any correlation function, as a
function of two variables, is symmetric; for a WSS process this means that
RX (t1 , t2) = RX (t2 , t1 ). Taking t1 = and t2 = 0 yields the result.
Property (ii) is a restatement of (6.1) for a WSS process. However, we can
also derive it directly by again using the Cauchy{Schwarz inequality. Write
RX () = E[Xt+ Xt ]
q
E[Xt2+ ] E[Xt2]
p
= RX (0) RX (0)
= RX (0):
Property (iii): To see that SX (f) is real and even, write
SX (f) =
=
Z
1
,1
Z 1
,1
RX ()e,j 2f dt
RX () cos(2f) d , j
Z
1
,1
RX () sin(2f) d:
Since correlation functions are real and even, and since the sine is odd, the
second integrand is odd, and therefore integrates to zero. Hence, we can always
write
Z 1
SX (f) =
RX () cos(2f) d:
,1
Thus, SX (f) is real. Furthermore, since the cosine is even, SX (f) is an even
function of f as well.
The fact that SX (f) is nonnegative is derived later in Example 6.9.
Remark. Property (iii) actually implies Properties (i) and (ii). (Problem 9.) Hence, a real-valued function of is a correlation function if and only
if its Fourier transform is real, even, and nonnegative. However, if one is trying
to show that a function of is not a correlation function, it is easier to check if
either (i) or (ii) fails. Of course, if (i) and (ii) both hold, it is then necessary
to go ahead and nd the power spectral density and see if it is nonnegative.
Example 6.6. Determine whether or not R() := e,j j is a valid correlation function.
Solution. Property (i) requires that R() be even. However, R() is not
even (in fact, it is odd). Hence, it cannot be a correlation function.
Example 6.7. Determine whether or not R() := 1=(1 + 2) is a valid
correlation function.
Solution. We must check the three properties
that
characterize correlation
functions. It is obvious that R is even and that R() R(0) = 1. The Fourier
July 18, 2002
6.3 WSS Processes through Linear Time-Invariant Systems
195
transform of R() is S(f) = exp(,2jf j), as can be veried by applying the
inversion formula to S(f).x Since S(f) is nonnegative, we see that R() satises
all three properties, and is therefore a valid correlation function.
Example 6.8 (Power in a Frequency Band). Let Xt have power spectral
density SX (f). Find the power in the frequency band W1 f W2.
Solution. Since power spectral densities are even, we include the corresponding negative frequencies too. Thus, we compute
,W1
Z
,W2
SX (f) df +
Z
W2
W1
SX (f) df = 2
Z
W2
W1
SX (f) df:
6.3. WSS Processes through Linear Time-Invariant Systems
In this section, we consider passing a WSS random process through a linear
time-invariant (LTI) system. Our goal is to nd the correlation and power
spectral density of the output. Before doing so, it is convenient to introduce
the following concepts.
Let Xt and Yt be random processes. Their cross-correlation function is
RXY (t1 ; t2) := E[Xt1 Yt2 ]:
To distinguish between the terms cross-correlation function and correlation
function, the latter is sometimes referred to as the auto-correlation function. The cross-covariance function is
CXY (t1 ; t2) := E[fXt1 ,mX (t1 )gfYt2 ,mY (t2 )g] = RXY (t1 ; t2),mX (t1) mY (t2):
A pair of processes Xt and Yt is called jointly wide-sense stationary
(J-WSS) if all three of the following properties hold:
(i) Xt is WSS.
(ii) Yt is WSS.
(iii) the cross-correlation RXY (t1 ; t2) = E[Xt1 Yt2 ] depends on t1 and t2 only
through the dierence t1 , t2 . If this is the case, we write RXY (t1 , t2 )
instead of RXY (t1; t2).
Remark. In checking to see if a pair of processes is J-WSS, it is usually
easier to check property (iii) before (ii).
The most common situation in which we have a pair of J-WSS processes
occurs when Xt is the input to an LTI system, and Yt is the output. Recall
x There is a short table of transform pairs in the Problems section and inside the front
cover.
July 18, 2002
196
Chap. 6 Introduction to Random Processes
that an LTI system is expressed by the convolution
1
Z
Yt =
h(t , )X d;
,1
where h is the impulse response of the system. An equivalent formula is obtained
by making the change of variable = t , , d = ,d. This allows us to write
Yt =
1
Z
,1
h()Xt, d;
which we use in most of the calculations below. We now show that if the input
Xt is WSS, then the output Yt is WSS and Xt and Yt are J-WSS.
To begin, we show that E[Yt] does not depend on t. To do this, write
1
Z
E[Yt] = E
,1
Z
h()Xt, d =
1
,1
E[h()Xt, ] d:
To justify bringing the expectation inside the integral, write the integral as a
Riemann sum and use the linearity of expectation. More explicitly,
E
Z
1
,1
h()Xt, d
X
E
=
X
=
X
i
i
Z
i
h(i )Xt,i i
E[h(i )Xt,i i ]
E[h(i )Xt,i ] i
1
,1
E[h()Xt, ] d:
In computing the expectation inside the integral, observe that Xt, is random,
but h() is not. Hence, h() is a constant and can be pulled out of the expectation; i.e., E[h()Xt, ] = h()E[Xt, ]. Next, since Xt is WSS, E[Xt, ] does
not depend on t , , and is therefore equal to some constant, say m. Hence,
E[Yt] =
Z
1
,1
h() m d = m
Z
1
,1
h() d;
which does not depend on t.
Logically, the next step would be to show that E[Yt1 Yt2 ] depends on t1 and
t2 only through the dierence t1 , t2 . However, it is more ecient if we rst
obtain a formula for the cross-correlation, E[Xt1 Yt2 ], and show that it depends
only on t1 , t2 . Write
E[Xt1 Yt2 ] = E Xt1
1
Z
,1
h()Xt2 , d
July 18, 2002
6.3 WSS Processes through Linear Time-Invariant Systems
=
=
=
Z
1
Z
,1
1
Z
,1
1
,1
197
h()E[Xt1 Xt2 , ] d
h()RX (t1 , [t2 , ]) d
h()RX ([t1 , t2 ] + ) d;
which depends only on t1 , t2 as required. We can now write
1
Z
RXY () =
,1
h()RX ( + ) d:
(6.3)
If we make the change of variable = ,, d = ,d, then
Z
RXY () =
1
,1
h(,)RX ( , ) d;
(6.4)
and we see that RXY is the convolution of h(,) with RX (). This suggests
that we dene the cross power spectral density by
1
Z
SXY (f) :=
,1
RXY ()e,j 2f d:
Since we are considering real-valued random variables here, we must assume
h() is real valued. Letting H(f) denote the Fourier transform of h(), it
follows that the Fourier transform of h(,) is H(f) , where the superscript denotes the complex conjugate. Thus, the Fourier transform of (6.4) is
SXY (f) = H(f) SX (f):
(6.5)
We now calculate the auto-correlation of Yt and show that it depends only
on t1 , t2 . Write
Z
E[Yt1 Yt2 ] = E
=
=
=
Thus,
RY () =
Z
1
Z
,1
1
Z
,1
1
,1
Z
1
,1
h()Xt1 , d Yt2
h()E[Xt1 , Yt2 ] d
h()RXY ([t1 , ] , t2 ) d
h()RXY ([t1 , t2 ] , ) d:
1
,1
h()RXY ( , ) d:
July 18, 2002
198
Chap. 6 Introduction to Random Processes
SX ( f )
f
− W2 − W1
0
W1
W2
H( f )
1
f
− W2 − W1
0
W1
W2
Figure 6.7. Setup of Example 6.9 to show that a power spectral density must be nonnegative.
In other words, RY is the convolution of h and RXY . Taking Fourier transforms,
we obtain SY (f) = H(f)SXY (f) = H(f)H(f) SX (f), and
SY (f) = H(f)2SX (f):
(6.6)
In other words, the power spectral density of the output of an LTI system is
the product of the magnitude squared of the transfer function and the power
spectral density of the input.
Example 6.9. We can use the above formula to show that power spectral
densities are nonnegative. Suppose Xt is a WSS process with SX (f) < 0 on
some interval [W1; W2 ], where 0 < W1 W2 . Since SX is even, it is also
negative on [,W2 ; ,W1] as shown in Figure 6.7. Let H(f) := I[,W2 ;,W1 ] (f) +
I[W1 ;W2 ] (f) as shown in Figure 6.7. Note that since H(f) takes only the values
zero and one, H(f)2 = H(f). Let h(t) denote the inverse Fourier Rtransform
1 h(t ,
of H. Since H is real and even, h is also real and even. Then Yt := ,1
)X d is a real WSS random process, and
SY (f) = H(f)2 SX (f) = H(f)SX (f) 0;
1 S (f) df < 0. It then follows
with SY (f) < 0 for W1 < jf j < W2 . Hence, ,1
Y
that
Z 1
0 E[Yt2 ] = RY (0) =
SY (f) df < 0;
R
which is a contradiction. Thus, SX (f) 0.
,1
Example 6.10. A certain communication receiver employs a bandpass lter to reduce white noise generated in the amplier. Suppose that the white
July 18, 2002
6.4 The Matched Filter
199
noise Xt has power spectral density N0 =2 and that the lter transfer function
H(f) is given in Figure 6.7. Find the expected output power from the lter.
Solution. The expected output power is obtained by integrating the power
spectral density of the lter output. Denoting the lter output by Yt ,
PY =
H(f)2 SX (f)
1
Z
,1
SY (f) df =
Z
1
,1
H(f)2SX (f) df:
Since
= N0 =2 for W1 jf j W2 , and is zero otherwise, PY =
2(N0 =2)(W2 , W1 ) = N0 (W2 , W1 ). In other words, the expected output power
is N0 times the bandwidth of the lter.
6.4. The Matched Filter
Consider an air-trac control system which sends out a known radar pulse.
If there are no objects in range of the radar, the system returns only a noise
waveform Xt , which we model as a zero-mean, WSS random process with power
spectral density SX (f). If there is an object in range, the system returns the
reected radar pulse, say v(t), plus the noise Xt . We wish to design a system
that decides whether received waveform is noise only, Xt , or signal plus noise,
v(t) + Xt . As an aid to achieving this goal, we propose to take the received
waveform and pass it through an LTI system with impulse response h(t). If the
received signal is in fact v(t) + Xt , then the output of the linear system is
Z
where
1
,1
h(t , )[v() + X ] d = vo (t) + Yt;
vo (t) :=
is the output signal, and
Yt :=
1
Z
,1
1
Z
,1
h(t , )v() d
h(t , )X d
is the output noise process. We will now try to nd the impulse response h that
maximizes the output signal-to-noise ratio (SNR),
(t0 )2 ;
SNR := vEo[Y
2
t0 ]
where vo (t0 )2 is the instantaneous output signal power at time t0 , and E[Yt20 ]
is the expected instantaneous output noise power at time t0 . Note that since
E[Yt20 ] = RY (0) = PY , we can also write
2
SNR = voP(t0) :
Y
July 18, 2002
200
Chap. 6 Introduction to Random Processes
Our approach will be to obtain an upper bound on the numerator of the
form vo (t0 )2 PY B, where B does not depend on the impulse response h. It
will then follow that
2
SNR = voP(t0 ) PYP B = B:
Y
Y
We then show how to choose the impulse response so that in fact vo (t0 )2 =
PY B. For this choice of impulse response, we then have SNR = B, the
maximum possible value.
We begin by analyzing the denominator in the SNR. Observe that
PY =
1
Z
,1
SY (f) df =
To analyze the numerator, write
vo (t0 ) =
Z
1
,1
Vo (f)ej 2ft0 df
1
Z
,1
=
Z
jH(f)j2 SX (f) df:
1
,1
H(f)V (f)ej 2ft0 df;
where Vo (f) is the Fourier transform of vo (t), and V (f) is the Fourier transform
of v(t). Next, write
Z 1
p
(f)ej 2ft0 df
vo (t0 ) =
H(f) SX (f) V p
SX (f)
,1
Z 1
e,j 2ft0 p
p
=
H(f) SX (f) V (f)
df;
SX (f)
,1
where the asterisk denotes complex conjugation. Applying the Cauchy{Schwarz
inequality (see Problem 1 and the Remark following it), we obtain the upper
bound,
Z 1
Z 1
jV (f)j2 df :
jvo (t)j2 jH(f)j2 SX (f) df S (f)
|
Thus,
,1
{z
= PY
} |
X
,1 {z
=: B
}
2
SNR = jvoP(t0 )j PYP B = B:
Y
Y
p
Now, the Cauchy{Schwarz inequality
holds with equality if and only if H(f) SX (f)
p
is a multiple of V (f) e,j 2ft0 = SX (f). Thus, the upper bound on the SNR
will be achieved if we take H(f) to solve
e,j 2ft0
p
p
H(f) SX (f) = V (f)
SX (f)
for any complex multiplier ; i.e., we should take
,j 2ft0
H(f) = V (f)S e(f) :
X
July 18, 2002
6.5 The Wiener Filter
201
v (t )
t
−T
0
T
h(t ) = v ( t0 − t )
t
−T
0
t0
T
Figure 6.8. Known signal v(t) and corresponding matched lter impulse response h(t) in
the case of white noise.
Thus, the optimal lter is \matched" to the known signal.
Consider the special case in which Xt is white noise with power spectral density SX (f) = N0 =2. Taking = N0 =2 as well, we have H(f) = V (f) e,j 2ft0 ,
which inverse transforms to h(t) = v(t0 , t), assuming v(t) is real. Thus, the
matched lter has an impulse response which is a time-reversed and translated
copy of the known signal v(t). An example of v(t) and the corresponding h(t) is
shown in Figure 6.8. As the gure illustrates, if v(t) is a nite-duration, causal
waveform, as any radar \pulse" would be, then the sampling time t0 can always
be chosen so that h(t) corresponds to a causal system.
6.5. The Wiener Filter
In the preceding section, the available data was of the form v(t)+Xt , where
v(t) was a known, nonrandom signal, and Xt was a zero-mean, WSS noise
process. In this section, we suppose that Vt is an unknown random process
that we would like to estimate based on observing a related random process
Ut . For example, we might have Ut = Vt + Xt , where Xt is a noise process.
However, to begin, we only assume that Ut and Vt are zero-mean, J-WSS with
known power spectral densities and known cross power spectral density.
We restrict attention to linear estimators of the form
Vbt =
1
Z
,1
h(t , )U d =
Z
1
,1
h()Ut, d:
(6.7)
Note that to estimate Vt at a single time t, we use the entire observed waveform
U for ,1 < < 1. Our goal is to nd an impulse response h that minimizes
the mean squared error, E[jVt , Vbt j2]. In other words, we are looking for
an impulse response h such that if Vbt is given by (6.7), and if ~h is any other
July 18, 2002
202
Chap. 6 Introduction to Random Processes
impulse response, and we put
Z 1
Vet =
h~ (t , )U d =
,1
then
Z
1
,1
~h()Ut, d;
(6.8)
E[jVt , Vbt j2] E[jVt , Vetj2 ]:
To nd the optimal lter h, we apply the orthogonality principle, which says
that if
Z 1
~h()Ut, d = 0
E (Vt , Vbt)
(6.9)
,1
for every lter h~ , then h is the optimal lter.
Before proceeding any further, we need the following observation. Suppose
(6.9) holds for every choice of h~ . Then in particular, it holds if we replace h~ by
h , ~h. Making this substitution in (6.9) yields
E (Vt , Vbt )
1
Z
,1
[h() , h~ ()]Ut, d = 0:
Since the integral in this expression is simply Vbt , Vet, we have that
E[(Vt , Vbt )(Vbt , Vet )] = 0:
(6.10)
To establish the orthogonality principle, assume (6.9) holds for every choice
of h~ . Then (6.10) holds as well. Now write
E[jVt , Vetj2 ] = E[j(Vt , Vbt ) + (Vbt , Vet )j2 ]
= E[jVt , Vbt j2 + 2(Vt , Vbt )(Vbt , Vet ) + jVbt , Vetj2 ]
= E[jVt , Vbt j2] + 2E[(Vt , Vbt )(Vbt , Vet)] + E[jVt , Vet j2]
= E[jVt , Vbt j2] + E[jVt , Vetj2 ]
E[jVt , Vbt j2];
and thus, h is the lter that minimizes the mean squared error.
The next task is to characterize the lter h such that (6.9) holds for every
choice of ~h. Recalling (6.9), we have
0 = E (Vt ,
Z
= E
=
=
=
Z
1
1
,1
,1
Z 1
,1
Z 1
,1
Vbt )
1
Z
,1
~h()Ut, d
~h()(Vt , Vbt)Ut, d
E[~h()(Vt , Vbt )Ut, ] d
h~ ()E[(Vt , Vbt )Ut, ] d
h~ ()[RV U () , RVbU ()] d:
July 18, 2002
6.5 The Wiener Filter
203
Since h~ is arbitrary, take ~h() = RV U () , RVbU () to get
1
Z
,1
RV U ()
, RVbU ()2 d = 0:
(6.11)
Thus, (6.9) holds for all ~h if and only if RV U = RVbU .
The next task is to analyze RVbU . Recall that Vbt in (6.7) is the response
of an LTI system to input Ut . Applying (6.3) with X replaced by U and Y
replaced by Vb , we have, also using the fact that RU is even,
Z
1
Z
1
RVbU () = RU Vb (,) =
h()RU ( , ) d:
,1
Taking Fourier transforms of
yields
RV U () = RVbU () =
h()RU ( , ) d
,1
(6.12)
SV U (f) = H(f)SU (f):
Thus,
(f)
H(f) = SSV U(f)
U
is the optimal lter. This choice of H(f) is called the Wiener lter.
? Causal
Wiener Filters
Typically, the Wiener lter as found above is not causal; i.e., we do not have
h(t) = 0 for t < 0. To nd such an h, we need to reformulate the problem by
replacing (6.7) with
Vbt =
Z
t
,1
h(t , )U d =
1
Z
0
and replacing (6.8) with
Vet =
Z
t
,1
h~ (t , )U d =
1
Z
0
h()Ut, d;
h~ ()Ut, d:
Everything proceeds as before from (6.9) through (6.11) except that lower limits
of integration are changed from ,1 to 0. Thus, instead of concluding RV U () =
RVbU () for all , we only have RV U () = RVbU () for 0. Instead of (6.12),
we have
Z 1
RV U () =
h()RU ( , ) d; 0:
(6.13)
0
This is known as the Wiener{Hopf equation. Because the equation only
holds for 0, we run into a problem if we try to take Fourier transforms. To
July 18, 2002
204
Chap. 6 Introduction to Random Processes
H( f )
Wt
Ut
H0 ( f )
K( f )
^
Vt
Figure 6.9. Decomposition of the causal Wiener lter using the whitening lter K (f ).
compute SV U (f), we need to integrate RV U ()e,j 2f from = ,1 to = 1.
But we can only use the Wiener{Hopf equation for 0.
In general, the Wiener{Hopf equation is very dicult to solve. However, if
U is white noise, say RU () = (), then (6.13) reduces to
RV U () = h(); 0:
Since h is causal, h() = 0 for < 0.
The preceding observation suggests the construction of H(f) using a whitening lter as shown in Figure 6.9. If Ut is not white noise, suppose we can nd a
causal lter K(f) such that when Ut is passed through this system, the output
is white noise Wt , by which we mean SW (f) = 1. Letting k denote the impulse
response corresponding to K, we can write Wt mathematically as
Wt =
Z
0
1
k()Ut, d:
(6.14)
Then
1 = SW (f) = jK(f)j2 SU (f):
(6.15)
Consider the problem of causally estimating Vt based on Wt instead of Ut . The
solution is again given by the Wiener{Hopf equation,
RV W () =
1
Z
0
h0 ()RW ( , ) d; 0:
Since K was chosen so that SW (f) = 1, RW () = (). Therefore, the Wiener{
Hopf equation tells us that h0() = RV W () for 0. Using (6.14), it is easy
to see that
Z 1
RV W () =
k()RV U ( + ) d;
(6.16)
and then{
0
SV W (f) = K(f) SV U (f):
We now summarize the procedure.
(6.17)
{ If k() is complexvalued, so is Wt in (6.14). In this case, as in Problem 26, it is understood
that RV W ( ) = E[Vt+ Wt ].
July 18, 2002
6.5 The Wiener Filter
205
1. According to (6.15), we must rst write SU (f) in the form
1 1 ;
SU (f) = K(f)
K(f)
where K(f) is a causal lter (this is known as spectral factorization).k
2. The optimum lter is H(f) = H0(f)K(f), where
H0(f) =
Z
0
1
RV W ()e,j 2f d;
and RV W () is given by (6.16) or by the inverse transform of (6.17).
Example 6.11. Let Ut = Vt + Xt, where Vt and Xt are zero-mean, WSS
processes with E[Vt X ] = 0 for all t and . Assume that the signal Vt has power
spectral density SV (f) = 2=[2 + (2f)2 ] and that the noise Xt is white with
power spectral density SX (f) = 1. Find the causal Wiener lter.
Solution. From your solution of Problem 32, SU (f) = SV (f) + SX (f).
Thus,
A2 + (2f)2 ;
SU (f) = 2 +2
+
1
=
(2f)2
2 + (2f)2
where A2 := 2 + 2. This factors into
j2f A , j2f :
SU (f) = A ++ j2f
, j2f
Then
K(f) = A ++ j2f
j2f
is the required causal (by Problem 35) whitening lter. Next, from your solution
of Problem 32, SV U (f) = SV (f). So, by (6.17),
2
SV W (f) = A ,, j2f
j2f 2 + (2f)2
2
= (A , j2f)(
+ j2f)
B
= A , j2f + +Bj2f ;
where B := 2=( + A). It follows that
RV W () = BeA u(,) + Be, u();
k If SU (f ) satises the Paley{Wiener condition,
Z
1 j ln S (f )j
U
1
+
f 2 df < 1;
,1
then SU (f ) can always be factored in this way.
July 18, 2002
206
Chap. 6 Introduction to Random Processes
where u is the unit-step function. Since h0() = RV W () for 0, h0() =
Be, u() and H0(f) = B=( + j2f). Next,
B ;
=
H(f) = H0 (f)K(f) = +Bj2f A ++ j2f
j2f
A + j2f
and h() = Be,A u().
6.6. ?Expected Time-Average Power and the Wiener{
Khinchin Theorem
In this section we develop the notion of the expected time-average power in
a WSS process. We then derive the Wiener{Khinchin Theorem, which implies
that the expected time-average power is the same as the expected instantaneous
power. As an easy corollary of the derivation of the Wiener{Khinchin Theorem,
we derive the mean-square ergodic theorem for WSS processes. This result
shows that E[Xt] can often be computed by averaging a single sample path over
time.
Recall that the average power in a deterministic waveform x(t) is given by
Z T
1
x(t)2 dt:
lim
T !1 2T ,T
The analogous formula for a random process is
1 Z T X 2 dt:
lim
T !1 2T ,T t
The problem with the above quantity is that it is a random variable because
the integrand, Xt2 , is random. We would like to characterize the process by a
nonrandom quantity. This suggests that we dene the expected time-average
power in a random process by
1 Z T X 2 dt :
PeX := E Tlim
!1 2T ,T t
We now carry out some analysis of PeX . To begin, we want to apply Parseval's
equation to the integral. To do this, we use the following notation. Let
(
Xt ; jtj T;
T
Xt :=
0; jtj > T;
R
R 1 T 2
X dt. Now, the Fourier transform of X T is
so that ,TT Xt2 dt = ,1
t
t
Xe T
f
:=
Z
1
,1
XtT e,j 2ft dt
=
July 18, 2002
Z
T
,T
Xt e,j 2ft dt;
(6.18)
6.6 ? Expected Time-Average Power and the Wiener{
and by Parseval's equation,
Z
1
,1
X T 2 dt
=
t
Z
1
,1
Khinchin Theorem207
X
e T 2 df:
f
Returning now to PeX , we have
1
= E Tlim
!1 2T
1
= E Tlim
!1 2T
1
= E Tlim
!1 2T
PeX
=
Z
X 2 dt
,T
1
Z
t
X T 2 dt
t
,1
1
Z
,1
X
e T 2 df
f
E XefT 2 df:
lim
T !1 2T
1
Z
T
,1
The Wiener{Khinchin Theorem says that the above integrand is exactly
the power spectral density
SX (f). From our earlier work, we know that PX :=
R1
E[Xt2 ] = RX (0) = ,1
SX (f) df. Thus, PeX = PX ; i.e., the expected timeaverage power is equal to the expected instantaneous power at any point in
time. Note also that the above integrand is obviously nonnegative, and so if we
can show that it is equal to SX (f), then this will provide another proof that
SX (f) must be nonnegative.
To derive the Wiener{Khinchin Theorem, we begin with the numerator,
,
,
E XefT 2 = E XefT XefT ;
where the asterisk denotes complex conjugation. To evaluate the right-hand
side, use (6.18) to obtain
Z
E
T
,T
Xt e,j 2ft dt
Z
T
,T
X e,j 2f d
:
We can now write
Z
T
Z
T
Z
T
Z
T
E XefT 2 =
E[XtX ]e,j 2f (t,) dt d
,T ,T
=
=
=
,T ,T
Z
T
Z
T Z 1
,T ,T
1
Z
,1
RX (t , )e,j 2f (t,) dt d
SX ()
,1
Z
(6.19)
SX ()ej 2 (t,) d e,j 2f (t,) dt d
T
,T
ej 2(f , )
July 18, 2002
Z
T
,T
ej 2t( ,f ) dt
d d:
208
Chap. 6 Introduction to Random Processes
Notice that the inner two integrals decouple so that
E Xe T 2
f
=
=
1
Z
,1
Z 1
SX ()
Z
T
,T
ej 2(f , ) d
,
Z
2T(f , )
SX () 2T sin2T(f
, )
,1
T
,T
2
ej 2t( ,f ) dt
d
d:
We can then write
,
Z 1
E XefT 2
sin 2T(f , ) 2 d:
=
S
()
2T
(6.20)
X
2T
2T(f , )
,1
This is a convolution integral. Furthermore, the quantity multiplying SX ()
converges to the delta function (f , ) as T ! 1.2 Thus,
E XefT 2
Z
1
SX () (f , ) d = SX (f);
2T
,1
which is exactly the Wiener{Khinchin Theorem.
Remark. The preceding derivation shows that SX (f) is equal to the limit
of (6.19) divided by 2T. Thus,
lim
T !1
=
1 Z T Z T R (t , )e,j 2f (t,) dt d:
S(f) = Tlim
!1 2T ,T ,T X
As noted in Problem 37, the properties of the correlation function directly imply
that this double integral is nonnegative. This is the direct way to prove that
power spectral densities are nonnegative.
Mean-Square Law of Large Numbers for WSS Processes
In the process of deriving the weak law of large numbers in Chapter 2, we
showed that for an uncorrelated sequence Xn with common mean m = E[Xn]
and common variance 2 = var(Xn ), the sample mean (or time average)
n
X
1
Mn := n Xi
i=1
converges to m in the sense that E[jMn , mj2] = var(Mn ) ! 0 as n ! 1 by
(2.6). In this case, we say that Mn converges in mean square to m, and we call
this a mean-square law of large numbers.
We now show that for a WSS process Yt with mean m = E[Yt], the sample
mean (or time average)
Z T
1
Yt dt ! m
MT := 2T
,T
July 18, 2002
6.6 ? Expected Time-Average Power and the Wiener{
Khinchin Theorem209
in the sense that E[jMT , mj2 ] ! 0 as T ! 1 if and only if
1 Z 2T C ()1 , j j d = 0;
lim
T !1 2T ,2T Y
2T
where CY is the covariance (not correlation) function of Yt. We can view this
result as a mean-square law of large numbers for WSS processes. Laws of large
numbers for sequences or processes that are not uncorrelated are often called
ergodic theorems. Hence, the above result is also called the mean-square
ergodic theorem for WSS processes. The point in all theorems of this type
is that the expectation E[Yt] can be computed by averaging a single sample path
over time.
We now derive this result. To begin, put Xt := Yt , m so that Xt is zero
mean and has correlation function RX () = CY (). Also,
Z T
XefT f =0
1
Xt dt = 2T ;
MT , m = 2T
,T
where XefT was dened in (6.18). Using (6.20) with f = 0 we can write
E[jMT
, mj2 ]
=
E Xe0T 2
4T 2 E Xe0T 2
1
= 2T 2T
2
Z 1
1
sin(2T)
= 2T
SX () 2T 2T d:
,1
(6.21)
Applying Parseval's equation yields
1 Z 2T R ()1 , j j d
E[jMT , mj2] = 2T
X
2T
,2T
Z 2T
j j d:
= 2T1
CY () 1 , 2T
,2T
To give a sucient condition for when this goes to zero, recall that by the
argument following (6.20), as T ! 1 the integral in (6.21) is approximately
SX (0) if SX (f) is continuous at f = 0.3 Thus, if SX (f) is continuous at f = 0,
E[jMT , mj2 ] SX (0) ! 0
2T
as T ! 1.
Remark. If RX () = CY () is absolutely integrable, then SX (f) is uniformly continuous. To see this write
jSX (f) , SX (f0 )j =
Z
1
,1
RX ()e,j 2f d ,
July 18, 2002
1
Z
,1
RX ()e,j 2f0 d 210
Chap. 6 Introduction to Random Processes
=
=
Z
1
,1
Z 1
,1
Z 1
,1
Now observe that je,j 2(f ,f0 )
jRX ()j je,j 2f , e,j 2f0 j d
jRX ()j je,j 2f0 [e,j 2(f ,f0 ) , 1]j d
jRX ()j je,j 2(f ,f0) , 1j d:
, 1j ! 0 as f ! f0. Since RX is absolutely
integrable, Lebesgue's dominated convergence theorem [4, p. 209] implies that
the integral goes to zero as well.
6.7. ?Power Spectral Densities for non-WSS Processes
If Xt is not WSS, then the instantaneous power E[Xt2 ] will depend on t. In
this case, it is more appropriate to look at the expected time-average power PeX
dened in the previous section. At the end of this section, we show that
where
lim
T !1
E XefT 2
2T
=
Z
1
,1
RX ()e,j 2f d;
(6.22)
1 Z T R ( + ; ) d:
RX () := Tlim
!1 2T ,T X
Hence, for a non-WSS process, we dene its power spectral density to be the
Fourier transform of RX (),
SX (f) :=
Z
1
,1
RX ()e,j 2f d:
An important application the foregoing is to cyclostationary processes. A
process Yt is (wide-sense) cyclostationary if its mean function is periodic in t,
and if its correlation function has the property that for xed , RX ( + ; )
is periodic in . For a cyclostationary process with period T0 , it is not hard to
show that
Z T0 =2
RX () = T1
RX ( + ; ) d:
(6.23)
0 ,T0 =2
Example 6.12. Let Xt be WSS, and put Yt := Xt cos(2f0t). Show that
Yt is cyclostationary and that
SY (f) = 14 [SX (f , f0 ) + SX (f + f0 )]:
Solution. The mean of Yt is
E[Yt] = E[Xt cos(2f0 t)] = E[Xt] cos(2f0 t):
Note that if Xt is WSS, then RX ( ) = RX ( ).
July 18, 2002
6.7 ? Power Spectral Densities for non-WSS Processes
211
Because Xt is WSS, E[Xt] does not depend on t, and it is then clear that E[Yt]
has period 1=f0 . Next consider
RY (t + ; ) = E[Yt+ Y ]
= E[Xt+ cos(2f0 ft + g)X cos(2f0 )]
= RX (t) cos(2f0 ft + g) cos(2f0 );
which is periodic in with period 1=f0 . To compute SY (f), rst use a trigonometric identity to write
RY (t + ; ) = RX2(t) [cos(2f0 t) + cos(2f0 ft + 2g)]:
Applying (6.23) to RY with T0 = 1=f0 yields
RY (t) = RX2(t) cos(2f0 t):
Taking Fourier transforms yields the claimed formula for SY (f).
Derivation of (6.22)
We begin as in the derivation of the Wiener{Khinchin Theorem, except that
instead of (6.19) we have
Z
T
Z
T
Z
T
Z
1
E XefT 2 =
RX (t; )e,j 2f (t,) dt d
,T ,T
=
,T ,1
I[,T;T ] (t)RX (t; )e,j 2f (t,) dt d:
Now make the change of variable = t , in the inner integral. This results in
Z
T
Z
1
E XefT 2 =
I[,T;T ] ( + )RX ( + ; )e,j 2f d d:
,T ,1
Change the order of integration to get
E Xe T 2
f
=
Z
1
,1
Z
e,j 2f
T
,T
I[,T;T ] ( + )RX ( + ; ) d d:
To simplify the inner integral, observe that I[,T;T ] ( + ) = I[,T ,;T , ] ().
Now T , is to the left of ,T if 2T < , and ,T , is to the right of T if
,2T > . Thus,
Z 1
E XefT 2
e,j 2f gT () d;
=
2T
,1
July 18, 2002
212
Chap. 6 Introduction to Random Processes
where
gT () :=
Z T ,
8
>
1
>
RX ( + ; ) d;
>
2T
>
>
,T
>
<
Z T
1
>
RX ( + ; ) d;
> 2T
>
>
,
T
,
>
>
:
0 2T;
,2T < 0;
j j > 2T:
If T much greater than j j, then T , T and ,T , ,T in the above
0;
limits of integration. Hence, if RX is a reasonably-behaved correlation function,
Z T
1
R ( + ; ) d = RX ();
lim g () = Tlim
T !1 T
!1 2T ,T X
and we nd that
E XefT 2
lim
=
T !1
2T
1
Z
,1
e,j 2f Tlim
g () d =
!1 T
Z
1
,1
e,j 2f RX () d:
Remark. If Xt is actually WSS, then RX ( + ; ) = RX (), and
j j I
gT () = RX () 1 , 2T
[,2T;2T ]():
In this case, for each xed , gT () ! RX (). We thus have an alternative
derivation of the Wiener{Khinchin Theorem.
6.8. Notes
Notes x6.2: Wide-Sense Stationary Processes
Note 1. We mean that if RX is the correlation function of a WSS process,
then RX it must satisfy Properties (i){(iii). Furthermore, it can be proved [50,
pp. 94{96] that if R is a real-valued function satisfying Properties (i){(iii), then
there is a WSS process having R as its correlation function.
Notes x6.6: ?Expected Time-Average Power and the Wiener{Khinchin Theorem
Note 2. To give a rigorous derivation of the fact that
,
Z 1
sin 2T (f , ) 2
d = SX (f);
lim
S () 2T 2T (f , )
T !1 ,1 X
it is convenient to assume SX (f) is continuous at f. Letting
2
T (f) := 2T sin(2Tf)
;
2Tf
July 18, 2002
6.8 Notes
213
we must show that
Z
1
,1
SX () T (f , ) d ,
SX (f)
! 0:
To proceed, we need the following properties of T . First,
1
Z
T (f) df = 1:
,1
This can be seen by using the Fourier transform table to evaluate the inverse
transform of T (f) at t = 0. Second, for xed f > 0, as T ! 1,
Z
ff :jf j>f g
T (f) df ! 0:
This can be seen by using the fact that T (f) is even and writing
Z 1
Z 1
1 df =
1 ;
2T
T (f) df (2T)2
2
2 f
f
2T
f
f
which goes to zero as T ! 1. Third, for jf j f > 0,
1 :
jT (f)j 2T(f)
2
Now, using the rst property of T , write
SX (f) = SX (f)
Then
SX (f) ,
1
Z
,1
Z
1
,1
T (f , ) d =
Z
SX () T (f , ) d =
1
,1
Z
1
,1
SX (f) T (f , ) d:
[SX (f) , SX ()] T (f , ) d:
For the next step, let " > 0 be given, and use the continuity of SX at f to get
the existence of a f > 0 such that for jf , j < f, jSX (f) , SX ()j < ".
Now break up the range of integration into such that jf , j < f and such that jf , j f. For the rst range, we need the calculation
Z
f +f
f ,f
[SX (f) , SX ()] T (f , ) d Z f +f
"
"
jSX (f) , SX ()j T (f , ) d
f ,f
Z f +f
f ,f
Z
1
,1
T (f , ) d
T (f , ) d = ":
July 18, 2002
214
Chap. 6 Introduction to Random Processes
For the second range of integration, consider the integral
Z
1
f +f
) d [SX (f) , SX ()] T (f ,
1
Z
f +f
Z
1
f +f
jSX (f) , SX ()j T (f , ) d
(jSX (f)j + jSX ()j) T (f , ) d
= jSX (f)j
+
Observe that
Z
1
f +f
f +f
1
Z
1
Z
f +f
T (f , ) d
jSX ()j T (f , ) d:
T (f , ) d =
,f
Z
,1
T () d;
which goes to zero by the second property of T . Using the third property, we
have
1
Z
f +f
Z
jSX ()j T (f , ) d =
,f
,1
jSX (f , )j T () d
1
2T (f)
2
Z
,f
,1
jSX (f , )j d;
which also goes to zero as T ! 1.
Note 3. In applying the derivation in Note 2 to the special case f = 0 in
(6.21) we do not need SX (f) to be continuous for all f, we only need continuity
at f = 0.
6.9. Problems
Recall that if
H(f) =
1
Z
,1
h(t)e,j 2ft dt;
then the inverse Fourier transform is
h(t) =
Z
1
,1
H(f)ej 2ft df:
With these formulas in mind, the following table of Fourier transform pairs may
be useful.
July 18, 2002
6.9 Problems
215
H(f)
Tf)
2T sin(2
2 Tf
h(t)
I[,T;T ] (t)
2W sin(2Wt)
2Wt
I[,W;W ] (f)
Tf) 2
(1 , jtj=T )I[,T;T ](t) T sin(
Tf
2
W sin(Wt)
(1 , jf j=W)I[,W;W ] (f)
Wt
1
e,t u(t)
+ j2f
2
e,jtj
2
+ (2f)2
e,2jf j
2 + t2
p
e,(t=)2 =2
2 e,2 (2f )2 =2
Problems x6.1: Mean, Correlation, and Covariance
1. The Cauchy{Schwarz inequality says that for any random variables U and
V,
E[UV ]2 E[U 2] E[V 2 ]:
(6.24)
Derive this formula as follows. For any constant , we can always write
0 E[jU , V j2 ]
= E[U 2] , 2E[UV ] + 2 E[V 2]:
(6.25)
Now set = E[UV ]=E[V 2 ] and rearrange the inequality. Note that since
equality holds in (6.25) if and only if U = V , equality holds in (6.24) if
and only if U is a multiple of V .
Remark. The same technique can be used to show that for complexvalued waveforms,
Z
1
,1
2
g() h() d
Z
1
,1
jg()j2 d Z
1
,1
jh()j2 d;
where the asterisk denotes complex conjugation, and for any complex
number z, jz j2 = z z .
2. Show that RX (t1 ; t2) = E[Xt1 Xt2 ] is a positive semidenite function
in the sense that for any real or complex constants c1 ; : : :; cn and any
July 18, 2002
216
Chap. 6 Introduction to Random Processes
times t1; : : :; tn,
Hint: Observe that
n X
n
X
i=1 k=1
ci RX (ti ; tk)ck 0:
2 n
X
E ci Xti i=1
0:
3. Let Xt for t > 0 be a random process with zero mean and correlation
function RX (t1 ; t2) = min(t1 ; t2). If Xt is Gaussian for each t, write down
the density of Xt .
Problems x6.2: Wide-Sense Stationary Processes
4. Find the correlation function corresponding to each of the following
power
2
5.
6.
7.
8.
? 9.
spectral densities. (a) (f). (b) (f , f0 )+(f +f0 ). (c) e,f =2 . (d) e,jf j .
Let Xt be a WSS random process with power spectral density SX (f) =
I[,W;W ] (f). Find E[Xt2 ].
Explain why each of the following frequency functions cannot be a power
spectral density. (a) e,f u(f), where u is the unit step function. (b) e,f 2 cos(f).
(c) (1 , f 2 )=(1 + f 4 ). (d) 1=(1 + jf 2 ).
For each of the following functions, determine whether or not it is a valid
correlation function.
(a) sin(). (b) cos(). (c) e, 2 =2 . (d) e,j j . (e) 2 e,j j . (f) I[,T;T ] ().
Let R0() be a correlation function, and put R() := R0() cos(2f0 )
for some f0 > 0. Determine whether or not R() is a valid correlation
function.
Let S(f) be a real-valued, even, nonnegative function, and put
R() :=
Z
1
,1
S(f)ej 2f d:
Show that R satises properties (i) and (ii) that characterize a correlation
function.
? 10. Let R0 () be a real-valued, even function, but not necessarily a correlation
function. Let R() denote the convolution of R0 with itself, i.e.,
R() :=
1
Z
,1
R0() R0 ( , ) d:
(a) Show that R() is a valid correlation function. Hint: You will need
the remark made in Problem 1.
July 18, 2002
6.9 Problems
217
(b) Now suppose that R0() = I[,T;T ] (). In this case, what is R(),
and what is its Fourier transform?
11. A discrete-time random process is WSS if E[Xn] does not depend on n
and if the correlation E[Xn+k Xk ] does not depend on k. In this case we
write RX (n) = E[Xn+k Xk ]. For discrete-time WSS processes, the power
spectral density is dened by
SX (f) :=
1
X
n=,1
RX (n)e,j 2fn ;
which is a periodic function of f with period one. By the formula for
Fourier series coecients
RX (n) =
Z 1=2
,1=2
SX (f)ej 2fn df:
(a) Show that RX is an even function of n.
(b) Show that SX is a real and even function of f.
Problems x6.3: WSS Processes through LTI Systems
12. White noise with power spectral density SX (f) = N0 =2 is applied to a
lowpass lter with transfer function
H(f) =
1 , f 2 ; jf j 1;
0; jf j > 1:
Find the output power of the lter.
13. White noise with power spectral density N0 =2 is applied to a lowpass
lter with transfer function H(f) = sin(f)=(f). Find the output noise
power from the lter.
14. A WSS input signal Xt with correlation function RX () = e, 2 =2 2is passed
through an LTI system with transfer function H(f) = e,(2f ) =2 . Denote the system output by Yt . Find (a) the cross power spectral density,
SXY (f); (b) the cross-correlation, RXY (); (c) E[Xt1 Yt2 ]; (d) the output
power spectral density, SY (f); (e) the output auto-correlation, RY ();
(f) the output power PY .
15. White noise with power spectral density SX (f) = N0 =2 is applied to a
1 e,t=(RC ) u(t). Find
lowpass RC lter with impulse response h(t) = RC
(a) the cross power spectral density, SXY (f); (b) the cross-correlation,
RXY (); (c) E[Xt1 Yt2 ]; (d) the output power spectral density, SY (f);
(e) the output auto-correlation, RY (); (f) the output power PY .
July 18, 2002
218
Chap. 6 Introduction to Random Processes
16. White noise with power spectral density N0 =2 is passed through a linear,
17.
18.
19.
20.
time-invariant system with impulse response h(t) = 1=(1 + t2 ). If Yt
denotes the lter output, nd E[Yt+1=2Yt ].
A WSS process Xt with correlation function RX () = 1=(1+ 2) is passed
through an LTI system with impulse response h(t) = 3 sin(t)=(t). Let
Yt denote the system output. Find the output power PY .
White noise with power spectral density SX (f) = N0 =2 is passed though
a lter with impulse response h(t) = I[,T=2;T=2](t). Find and sketch the
correlation function of the lter output.
R
Let Xt be a WSS random process, and put Yt := tt,3 X d. Determine
whether or not Yt is WSS.
Consider the system
Yt = e,t
21.
22.
23.
24.
Z
t
,1
e X d:
Assume that Xt is zero mean white noise with power spectral density
SX (f) = N0 =2. Show that Xt and Yt are J-WSS, and nd RXY (),
SXY (f), SY (f), and RY ().
A zero-mean, WSS process Xt with correlation function (1 ,j j)I[,1;1]()
is to be processed by a lter with transfer function H(f) designed so that
the system output Yt has correlation function
RY () = sin()
:
Find a formula for, and sketch the required lter H(f).
R1
Let XtRbe a WSS random process. Put Yt := ,1
h(t , )X d, and
1 g(t , )X d. Determine whether or not Y and Z are JZt := ,1
t
t
WSS.
Let Xt be a zero-mean WSS random process with power spectral density
SX (f) = 2=[1 + (2f)2 ]. Put Yt := Xt , Xt,1 .
(a) Show that Xt and Yt are J-WSS.
(b) Find the power spectral density SY (f).
(c) Find the power in Yt .
Let fXt g be a zero-mean wide-sense stationary random process with
power spectral density SX (f). Consider the process
Yt :=
with hn real valued.
1
X
n=,1
hn Xt,n;
July 18, 2002
6.9 Problems
219
(a) Show that fXtg and fYtg are jointly wide-sense stationary.
(b) Show that SY (f) has the form SY (f) = P(f)SX (f) where P is a
real-valued, nonnegative, periodic function of f with period 1. Give
a formula for P(f).
25. System Identication. When white noise fWtg with power spectral density
SW (f) = 3 is applied to a certain2 linear time-invariant system, the output
has power spectral density e,f . Now let fXt g be a zero-mean, widesense
stationary random process with power spectral density SX (f) =
ef 2 I[,1;1] (f). If fYt g is the response of the system to fXt g, nd RY ()
for all .
? 26. Extension to Complex Random Processes. If Xt is a complex-valued random process, then its auto-correlation function is dened by RX (t1; t2 ) :=
E[Xt1 Xt2 ]. Similarly, if Yt is another complex-valued random process,
their cross-correlation is dened by RXY (t1; t2) := E[Xt1 Yt2 ]. The concepts of WSS, J-WSS, the power spectral density, and the cross power
spectral density are dened as in the realR 1
case. Now suppose that Xt
is a complex WSS process and that Yt = ,1 h(t , )X d, where the
impulse response h is now possibly complex valued.
(a) Show that RX (,) = RX () .
(b) Show that SX (f) must be real valued. Hint: What does part (a) say
about the real and imaginary parts of RX ()?
(c) Show that
E[Xt1 Yt2 ] =
Z
1
h(,) RX ([t1 , t2] , ) d:
,1
(d) Even though the above result is a little dierent from (6.4), show
that (6.5) and (6.6) still hold for complex random processes.
27. Let Xn be a discrete-time WSS process as dened in Problem 11. Put
Yn =
1
X
k=,1
hk Xn,k :
(a) Suggest a denition of jointly WSS discrete-time processes and show
that Xn and Yn are J-WSS.
(b) Suggest a denition for the cross power spectral density of two discretetime J-WSS processes and show that (6.5) holds if
H(f) :=
1
X
n=,1
Also show that (6.6) holds.
July 18, 2002
hne,j 2fn :
220
Chap. 6 Introduction to Random Processes
(c) Write out the discrete-time analog of Example 6.9. What condition
do you need to impose on W2 ?
Problems x6.4: The Matched Filter
28. Determine the matched lter if the known radar pulse is v(t) = sin(t)I[0;] (t),
and Xt is white noise with power spectral density SX (f) = N0 =2. For
what values of t0 is the optimal system causal?
p
29. Determine the matched lter if v(t) = e,(t= 2)2 =2 , and SX (f) = e,(2f )2 =2 .
30. Derive the matched lter for a discrete-time received signal v(n) + Xn .
Hint: Problems 11 and 27 may be helpful.
Problems x6.5: The Wiener Filter
31. Suppose Vt and Xt are J-WSS. Let Ut := Vt + Xt . Show that Ut and Vt
32.
33.
34.
35.
36.
are J-WSS.
Suppose Ut = Vt + Xt , where Vt and Xt are each zero mean and WSS.
Also assume that E[VtX ] = 0 for all t and . Express the Wiener lter
H(f) in terms of SV (f) and SX (f).
Using the setup of the previous problem, nd the Wiener lter H(f) and
the corresponding impulse response h(t) if SV (f) = 2=[2 + (2f)2 ] and
SX (f) = 1.
Remark. You may want to compare your answer with the causal Wiener
lter found in Example 6.11.
Using the setup of Problem
32, suppose that the signal has correlation
2
sin
and that the noise has power spectral density
function RV () = SX (f) = 1 , I[,1;1] (f). Find the Wiener lter H(f) and the corresponding
impulse response h(t).
Find the impulse response of the whitening lter K(f) of Example 6.11.
Is it causal?
Derive the Wiener lter for discrete-time J-WSS signals Un and Vn with
zero means. Hints: (i) First derive the analogous orthogonality principle.
(ii) Problems 11 and 27 may be helpful.
Problems x6.6: ? Expected Time-Average Power and the Wiener{Khinchin Theorem
37. Recall that by Problem 2, correlation functions are positive semidenite.
Use this fact to prove that the double integral in (6.19) is nonnegative,
assuming that RX is continuous. Hint: Since RX is continuous, the double
integral in (6.19) is a limit of Riemann sums of the form
XX
RX (ti , tk )e,j 2f (ti ,tk ) titk :
i
k
July 18, 2002
6.9 Problems
221
38. Let Yt be Ra WSS process. In each of the cases below, determine whether
or not 21T ,TT Yt dt ! E[Yt] in mean square.
(a) The covariance CY () = e,j j .
(b) The covariance CY () = sin()=().
39. Let Yt = cos(2t + ), where uniform[,; ]. As in Example 6.1,
E[Yt] = 0. Determine whether or not
Z T
1
lim
Y dt ! 0:
T !1 2T ,T t
40. Let Xt be a zero-mean, WSS process. For xed , you might expect
1 Z T X X dt
2T ,T t+ t
to converge in mean square to E[Xt+ Xt ] = RX (). Give conditions on
the process Xt under which this will be true. Hint: Dene Yt := Xt+ Xt .
Remark. When = 0 this says that 21T R,TT Xt2 dt converges in mean
square to RX (0) = PX = PeX .
41. Let Xt be a zero-mean, WSS process. For a xed set B IR, you might
expectyy
1 Z T I (X ) dt
B t
2T
,T
to converge in mean square to E[IB (Xt )] = }(Xt 2 B). Give conditions on
the process Xt under which this will be true. Hint: Dene Yt := IB (Xt ).
yy This is the fraction of time during [,T; T ] that Xt 2 B . For example, we might have
B = [vmin; vmax] being the acceptable operating range of the voltage of some device. Then
we would be interested in the fraction of time during [,T; T ] that the device is operating
normally.
July 18, 2002
222
Chap. 6 Introduction to Random Processes
July 18, 2002
CHAPTER 7
Random Vectors
This chapter covers concepts that require some familiarity with linear algebra, e.g., matrix-vector multiplication, determinants, and matrix inverses. Section 7.1 denes the mean vector, covariance matrix, and characteristic function
of random vectors. Section 7.2 introduces the multivariate Gaussian random
vector and illustrates several of its properties. Section 7.3 discusses both linear and nonlinear minimum mean squared error estimation of random variables
and random vectors. Estimators are obtained using appropriate orthogonality
principles. Section 7.4 covers the Jacobian formula for nding the density of
Y = G(X) when the density of X is given. Section 7.5 introduces complexvalued random variables and random vectors. Emphasis is on circularly symmetric Gaussian random vectors due to their importance in the design of digital
communication systems.
7.1. Mean Vector, Covariance Matrix, and Characteristic
Function
If X = [X1; : : :; Xn ]0 is a random vector, we put mi := E[Xi], and we dene
Rij to be the covariance between Xi and Xj , i.e.,
Rij := cov(Xi ; Xj ) := E[(Xi , mi )(Xj , mj )]:
Note that Rii = cov(Xi ; Xi) = var(Xi ). We call the vector m := [m1 ; : : :; mn ]0
the mean vector of X, and we call the n n matrix R := [Rij ] the covariance
matrix of X. Note that since Rij = Rji, the matrix R is symmetric. In other
words, R0 = R. Also note that for i 6= j, Rij = 0 if and only if Xi and Xj
are uncorrelated. Thus, R is a diagonal matrix if and only if Xi and Xj are
uncorrelated for all i 6= j.
If we dene the expectation of a random vector X = [X1; : : :; Xn ]0 to be the
vector of expectations, i.e.,
2
E[X] :=
6
4
E[X1]
3
.. 75 ;
.
E[Xn]
then E[X] = m.
We dene the expectation of a matrix of random variables to be the matrix
of expectations. This leads to the following characterization of the covariance
matrix R. To begin, observe that if we multiply an n 1 matrix (a column
vector) by a 1 n matrix (a row vector), we get an n n matrix. Hence,
223
224
Chap. 7 Random Vectors
(X , m)(X , m)0 is equal to
2
3
(X1 , m1 )(X1 , m1 ) (X1 , m1 )(Xn , mn )
6
7
..
..
4
5:
.
.
(Xn , mn )(X1 , m1 ) (Xn , mn )(Xn , mn )
The expectation of the ijth entry is cov(Xi ; Xj ) = Rij . Thus,
R = E[(X , m)(X , m)0 ] =: cov(X):
Note the distinction between the covariance of a pair of random variables, which
is a scalar, and the covariance of a column vector, which is a matrix.
Example 7.1. Write out the covariance matrix of the three-dimensional
random vector U := [X; Y; Z]0 if U has zero mean.
Solution. The covariance matrix
of U is
2
3
E[X 2] E[XY ] E[XZ]
cov(U) = E[UU 0] = 4 E[Y X] E[Y 2] E[Y Z] 5 :
E[ZX] E[ZY ] E[Z 2 ]
Example 7.2. Let X = [X1;0 : : :; Xn]0 be a random vector with covariance
matrix R, and let c = [c1 ; : : :; cn] be a given constant vector. Dene the scalar
Z := c0(X , m) =
Show that
n
X
i=1
ci (Xi , mi );
var(Z) = c0Rc:
Solution. First note that since E[X] = m, Z has zero mean. Hence,
var(Z) = E[Z 2]. Write
E[Z 2]
X
n
X
n
= E
ci (Xi , mi )
=
ci cj E[(Xi , mi )(Xj , mj )]
=
=
i=1
n X
n
X
i=1 j =1
n X
n
X
j =1
cj (Xj , mj )
ci cj Rij
i=1 j =1
n X
n
X
ci
Rij cj :
i=1
j =1
The inner sum is the ith component of the column vector Rc. Hence, E[Z 2] =
c0 (Rc) = c0 Rc.
July 18, 2002
7.1 Mean Vector, Covariance Matrix, and Characteristic Function
225
The preceding example shows that a covariance matrix must satisfy
c0Rc = E[Z 2] 0
for all vectors c = [c1; : : :; cn]0. A symmetric matrix R with the property
c0 Rc 0 for all vectors c is said to be positive semidenite. If c0Rc > 0 for
all nonzero c, then R is called positive denite.
Recall that the norm of a column vector x is dened by kxk := (x0x)1=2. It
is shown in Problem 5 that
E[kX , E[X]k2] = tr(R) =
n
X
i=1
var(Xi );
where the trace of an n n matrix R is dened by
tr(R) :=
n
X
i=1
Rii:
The joint characteristic function of X = [X1 ; : : :; Xn ]0 is dened by
0
'X () := E[ej X ] = E[ej (1X1 ++n Xn ) ];
where = [1; : : :; n]0.
Note that when X has a joint density, 'X () = E[ej 0X ] is just the ndimensional Fourier transform,
'X () =
Z
j 0 x fX (x) dx;
e
n
IR
(7.1)
and the joint density can be recovered using the multivariate inverse Fourier
transform:
1 Z e,j 0x ' () d:
fX (x) = (2)
X
n IRn
Whether X has a joint density or not, the joint characteristic function can
be used to obtain its various moments.
Example 7.3. The components of the mean vector and covariance matrix
can be obtained from the characteristic function as follows. Write
@ j 0X
j 0X
@k E[e ] = E[e jXk ];
and
@ 2 E[ej 0X ] = E[ej 0X (jX )(jX )]:
`
k
@ @
Then
` k
@ E[ej 0X ] = j E[X ];
k
@k
=0
July 18, 2002
226
Chap. 7 Random Vectors
and
@ 2 E[ej 0X ] = ,E[X X ]:
` k
@` @k
=0
Higher-order moments can be obtained in a similar fashion.
If the components of X = [X1 ; : : :; Xn ]0 are independent, then
0
'X () = E[ej X ]
= E[ej (1X1 ++n Xn ) ]
n
Y
= E
=
=
k=1
n
Y
k=1
n
Y
k=1
ejk Xk
E[ejkXk ]
'Xk (k ):
In other words, the joint characteristic function is the product of the marginal
characteristic functions.
7.2. The Multivariate Gaussian
A random vector X = [X1 ; : : :; Xn]0 is said to be Gaussian or normal if
every linear combination of the components of X, e.g.,
n
X
i=1
ci Xi ;
(7.2)
is a scalar Gaussian random variable.
Example 7.4. If X is a Gaussian random vector, then the numerical average of its components,
n
1X
n Xi ;
i=1
is a scalar Gaussian random variable.
An easy consequence of our denition of Gaussian random vector is that
any subvector is also Gaussian. To see this, suppose X = [X1; : : :; Xn]0 is a
Gaussian random vector. Then every linear combination of the components of
the subvector [X1; X3 ; X5]0 is of the form (7.2) if we take ci = 0 for i not equal
to 1; 3; 5.
July 18, 2002
7.2 The Multivariate Gaussian
227
Sometimes it is more convenient to express linear combinations as the product of a row vector times the column vector X. For example, if we put
c = [c1; : : :; cn]0, then
n
X
ci Xi = c0 X:
i=1
Now suppose that Y = AX for some pn matrix A. Letting c = [c1; : : :; cp]0,
every linear combination of the p components of Y has the form
p
X
i=1
ci Yi = c0Y = c0(AX) = (A0 c)0 X;
which is a linear combination of the components of X, and therefore normal.
We can even add a constant vector. If Y = AX + b, where A is again p n,
and b is p 1, then
c0 Y = c0 (AX + b) = (A0 c)0X + c0b:
Adding the constant c0b to the normal random variable (A0 c)0 X results in another normal random variable (with a dierent mean).
In summary, if X is a Gaussian random vector, then so is AX + b for any
p n matrix A and any p-vector b.
The Characteristic Function of a Gaussian Random Vector
We now nd the joint characteristic function of the Gaussian random vector
X, 'X () = E[ej 0X ]. Since we are assuming that X is a normal vector, the
quantity Y := 0X is a scalar Gaussian random variable. Hence,
'X () = E[ej 0X ] = E[ejY ] = E[ejY ]=1 = 'Y ()=1 :
In other words, all we have to do is nd the characteristic function of Y and
evaluate it at = 1. Since Y is normal, we know its characteristic function is
(recall Example 3.13)
'Y () := E[ejY ] = ej,2 2 =2;
where := E[Y ], and 2 := var(Y ). Assuming X has mean vector m, =
E[ 0X] = 0m. Suppose X has covariance matrix R. Next, write var(Y ) =
E[(Y , )2]. Since
Y , = 0(X , m);
Example 7.2 tells us that var(Y ) = 0R. It now follows that
'X () = ej 0m, 0 R=2:
If X is a Gaussian random vector with mean vector m and covariance matrix
R, we write X N(m; R).
July 18, 2002
228
Chap. 7 Random Vectors
For Gaussian Random Vectors, Uncorrelated Implies Independent
If the components of a random vector are uncorrelated, then the covariance
matrix is diagonal. In general, this is not enough to prove that the components
of the random vector are independent. However, if X is a Gaussian random
vector, then the components are independent. To see this, suppose that X is
Gaussian with uncorrelated components. Then R is diagonal, say
2
R =
6
6
4
12
0
07
3
...
n2
7;
5
where i2 = Rii = var(Xi ). The diagonal form of R implies that
0R =
and so
n
X
i=1
i2 i2;
0
0
'X () = ej m, R=2 =
In other words,
'X () =
n
Y
i=1
n
Y
i=1
2 2
ejimi ,i i =2 :
'Xi (i );
where 'Xi (i) is the characteristic function of the N(mi ; i2) density. Multivariate inverse Fourier transformation then yields
fX (x) =
n
Y
i=1
fXi (xi );
where fXi N(mi ; i2 ). This establishes the independence of the Xi .
Example
7.5. Let X1; : : :; Xn be i.i.d. N(m; 2 ) random variables. Let
1 Pn
X := n i=1 Xi denote the average of the Xi . Furthermore, for j = 1; : : :; n,
put Yj := Xj , X. Show that X and Y := [Y1 ; : : :; Yn ]0 are jointly normal and
independent.
Solution. Let X := [X1; : : :; Xn]0, and put a := [ n1 n1 ]. Then X = aX.
Next, observe that 2 3 2
3 2
3
Y1
X1
X
6 . 7
6 . 7 6 . 7
4 .. 5 = 4 .. 5 , 4 .. 5 :
Yn
Xn
X
July 18, 2002
7.2 The Multivariate Gaussian
229
Let M denote the n n matrix with each row equal to a; i.e., Mij = 1=n for
all i; j. Then Y = X , MX = (I , M)X, and Y is a jointly normal random
vector. Next consider the vector
a
=
Z := X
I , M X:
Y
Since Z is a linear transformation of the Gaussian random vector X, Z is also
a Gaussian random vector. Furthermore, its covariance matrix has the blockdiagonal form (see Problem 11)
var( X )
0
0
E[Y Y 0 ] :
This implies, by Problem 12, that X and Y are independent.
The Density Function of a Gaussian Random Vector
We now derive the general multivariate Gaussian density function under the
assumption that R is positive denite. Using the multivariate Fourier inversion
formula,
1 Z e,jx0 ' () d
fX (x) = (2)
X
n IRn
Z
1
,jx0 ej 0m, 0 R=2 d
= (2)
n IRn e
Z
1
= (2)n n e,j (x,m)0 e, 0 R=2 d:
IR
Now make the multivariate change of variable = Rp1=2, d = det R1=2 d.
The existence of R1=2 and the fact that det(R1=2) = det R > 0 are shown in
the Notes.1 We have
1 Z e,j (x,m)0 R,1=2 e, 0 =2 d
fX (x) = (2)
n IRn
det R1=2
Z
1
,j fR,1=2 (x,m)g0 e, 0 =2 p d :
= (2)
n IRn e
det R
Put t = R,1=2(x , m). Then
1 Z e,jt0 e, 0 =2 p d
fX (x) = (2)
n IRn
det R
Z
n
1
Y 1
2 =2
1
,
,
jt
i
i
i
di :
e
= p
e
det R i=1 2 ,1
July 18, 2002
230
Chap. 7 Random Vectors
Observe that e,i2 =2 is the characteristic function of a scalar N(0; 1) random
variable. Hence,
n 1
Y
p e,t2i =2
fX (x) = p 1
det R i=1 2
n 1p
1X
2
=
exp
,
2 i=1 ti
(2)n=2 det R
1p
=
exp[,t0t=2]
n=
2
(2)
det R
exp[, 21 (x , m)p0 R,1(x , m)]
:
=
(2)n=2 det R
With the norm notation, t0t = ktk2 = kR,1=2(x , m)k2 , and so
,1=2 , m)k2 =2]
p
fX (x) = exp[,kR n=2(x
(2)
det R
as well.
(7.3)
7.3. Estimation of Random Vectors
Consider a pair of random vectors X and Y , where X is not observed, but
Y is observed. We wish to estimate X based on our knowledge of Y . By an
estimator of X based on Y , we mean a function g(y) such that X^ := g(Y )
is our estimate or \guess" of the value of X. What is the best function g to
use? What do we mean by best? In this section, we dene g to be best if it
minimizes the mean squared error (MSE) E[kX , g(Y )k2 ] for all functions
g in some class of functions. The optimal function g is called the minimum
mean squared error (MMSE) estimator. In the next subsection, we restrict
attention to the class of linear estimators (actually ane). Later we allow g
to be arbitrary.
Linear Minimum Mean Squared Error Estimation
We now restrict attention to estimators g of the form g(y) = Ay + b, where
A is a matrix and b is a column vector. Such a function of y is said to be ane.
If b = 0, then g is linear. It is common to say g is linear even if b 6= 0 since
this only is a slight abuse of terminology, and the meaning is understood. We
shall follow this convention. To nd the best linear estimator is simply to nd
the matrix A and the column vector b that minimize the MSE, which for linear
estimators has the form
E X , (AY + b)2 :
Letting mX = E[X] and mY = E[Y ], the MSE is equal to
2
:
E (X , mX ) , A(Y , mY ) + mX , AmY , b
July 18, 2002
7.3 Estimation of Random Vectors
231
p
l
^p
Figure 7.1. Geometric interpretation of the orthogonality principle.
Since the left-hand quantity in braces is zero mean, and since the right-hand
quantity in braces is a constant (nonrandom), the MSE simplies to
E (X , mX ) , A(Y , mY )2 + kmX , AmY , bk2:
No matter what matrix A is used, the optimal choice of b is
b = mX , AmY ;
and the estimate is g(Y ) = AY + b = A(Y , mY ) + mX . The estimate is truly
linear in Y if and only if AmY = mX .
We now turn to the problem of minimizing
E (X , mX ) , A(Y , mY )2 :
The matrix A is optimal if and only if for all matrices B,
E (X , mX ) , A(Y , mY )2 E (X , mX ) , B(Y , mY )2 :
(7.4)
The following condition is equivalent, and easier to use. This equivalence is
known as the orthogonality principle. It says that (7.4) holds for all B if
and only if
E B(Y , mY ) 0 (X , mX ) , A(Y , mY ) = 0; for all B:
(7.5)
Below we prove that (7.5) implies (7.4). The converse is also true, but we shall
not use it in this book.
We rst explain the terminology and show geometrically why it is true.
Recall that given a straight line l and a point p not on the line, the shortest
path between p and the line is obtained by dropping a perpendicular segment
from the point to the line as shown in Figure 7.1. The point p^ on the line where
the segment touches is closer to p than any other point on the line. Notice also
that the vertical segment, p , p^, is orthogonal to the line l. In our situation,
the role of p is played by the random variable X , mX , the role of p^ is played
by the random variable A(Y , mY ), and the role of the line is played by the
set of all random variables of the form B(Y , mY ) as B runs over all matrices.
Since the inner product between two random vectors U and V can be dened
as E[V 0U], (7.5) says that
(X , mX ) , A(Y , mY )
July 18, 2002
232
Chap. 7 Random Vectors
is orthogonal to all B(Y , mY ).
To use (7.5), rst note that since it is a scalar equation, the left-hand side
is equal to its trace. Bringing the trace inside the expectation and using the
fact that tr() = tr(), we see that the left-hand side of (7.5) is equal to
,
E tr (X , mX ) , A(Y , mY ) (Y , mY )0 B 0 :
Taking the trace back out of the expectation shows that (7.5) is equivalent to
tr([RXY , ARY ]B 0 ) = 0; for all B;
(7.6)
where RY = cov(Y ) is the covariance matrix of Y , and
RXY = E[(X , mX )(Y , mY )0 ] =: cov(X; Y ):
(This is our third usage of cov. We have already seen the covariance of a pair of
random variables and of a random vector. This is the third type of covariance,
that of a pair of random vectors. When X and Y are both of dimension one,
this reduces to the rst usage of cov.) By Problem 5, it follows that (7.6) holds
if and only if A solves the equation
ARY = RXY :
If RY is invertible, the unique solution of this equation is
A = RXY R,Y 1 :
In this case, the complete estimate of X is
RXY R,Y 1 (Y , mY ) + mX :
Even if RY is not invertible, ARY = RXY always has a solution, as shown in
Problem 16.
Example 7.6 (Signal in Additive Noise). Let X denote a random signal of
zero mean and known covariance matrix RX . Suppose that in order to estimate
X, all we have available is the noisy measurement
Y = X + W;
where W is a noise vector with zero mean and known covariance matrix RW .
Further assume that the covariance between the signal and noise, RXW , is zero.
Find the linear MMSE estimate of X based on Y .
Solution. Since E[Y ] = E[X + W ] = 0, mY = mX = 0. Next,
RXY = E[(X , mX )(Y , mY )0]
= E[X(X + W )0]
= RX :
July 18, 2002
7.3 Estimation of Random Vectors
Similarly,
It follows that
233
RY = E[(Y , mY )(Y , mY )0 ]
= E[(X + W)(X + W )0 ]
= RX + RW :
X^ = RX (RX + RW ),1Y:
We now show that (7.5) implies (7.4). To simplify the notation, we assume
zero means.
E[kX , BY k2] = E[k(X , AY ) + (AY , BY )k2]
= E[k(X , AY ) + (A , B)Y )k2]
= E[kX , AY k2] + E[k(A , B)Y k2 ];
where the cross terms 2E[f(A , B)Y g0(X , AY )] vanish by (7.5). If we drop
the right-hand term in the above display, we obtain
E[kX , BY k2] E[kX , AY k2 ]:
Minimum Mean Squared Error Estimation
We begin with an observed vector Y and a scalar X to be estimated. (The
generalization to vector-valued X is left to the problems.) We seek a real-valued
function g(y) with the property that
E[jX , g(Y )j2] E[jX , h(Y )j2]; for all h:
(7.7)
Here we are not requiring g and/or h to be linear. We again have an orthogonality principle (see Problem 17) that says that the above condition is
equivalent to
E[fX , g(Y )gh(Y )] = 0; for all h:
(7.8)
We claim that g(y) = E[X jY = y] is the optimal function g if X and Y are
jointly continuous. In this case, we can use the law of total probability to write
E[fX , g(Y )gh(Y )] =
Z
E fX , g(Y )gh(Y )Y = y fY (y) dy
Applying the substitution law to the conditional expectation yields
E fX , g(Y )gh(Y )Y = y = E fX , g(y)gh(y) Y = y
= E[X jY = y]h(y) , g(y)h(y)
= g(y)h(y) , g(y)h(y)
= 0:
In order that the expectations in (7.7) be nite, we assume E[X 2 ] < 1, and we restrict
g and h to be such that E[g(Y )2 ] < 1 and E[h(Y )2 ] < 1.
July 18, 2002
234
Chap. 7 Random Vectors
It is shown in Problem 18 that the solution g of (7.8) is unique. We can use
this fact to nd conditional expectations if X and Y are jointly normal. For
simplicity, assume zero means. We claim that g(y) = RXY R,Y 1 y solves (7.8).
We rst observe that
X , RXY R,Y 1Y
Y
=
I ,RXY R,Y 1
0
I
X
Y
is a linear transformation of [X; Y 0]0 and so the left-hand side is a Gaussian
random vector whose top and bottom entries are easily seen to be uncorrelated:
E[(X , RXY R,Y 1Y )Y 0] = RXY , RXY R,Y 1 RY
= 0:
Being jointly Gaussian and uncorrelated, they are independent. Hence, for any
function h(y),
E[(X , RXY R,Y 1Y )h(Y )] = E[X , RXY R,Y 1 Y ] E[h(Y )]
= 0 E[h(Y )]:
We have now learned two things about jointly normal random variables. First,
we have learned how to nd conditional expectations. Second, in looking for
the best estimator function g, and not requiring g to be linear, we found that
the best g is linear if X and Y are jointly normal.
7.4. Transformations of Random Vectors
If G(x) is a vector-valued function of x 2 IRn , and X is an IRn -valued
random vector, we can dene a new random vector by Y = G(X). If X has
joint density fX , and G is a suitable invertible mapping, then we can nd a
relatively explicit formula for the joint density of Y . Suppose that the entries
of the vector equation y = G(x) are given by
2
6
4
y1
..
.
yn
3
7
5
2
=
6
4
3
g1(x1 ; : : :; xn)
7
..
5:
.
gn(x1 ; : : :; xn)
If G is invertible, we can apply G,1 to both sides of y = G(x) to obtain
G,1(y) = x. Using the notation H(y) := G,1(y), we can write the entries of
the vector equation x = H(y) as
2
6
4
x1
..
.
xn
3
7
5
2
=
6
4
3
h1(y1 ; : : :; yn )
7
..
5:
.
hn(y1 ; : : :; yn )
July 18, 2002
7.4 Transformations of Random Vectors
235
Assuming that H is continuous and has continuous partial derivatives, let
2
6
6
6
6
6
6
6
4
H 0(y) :=
@h1
@y1
@hi
@y1
@hn
@y1
..
.
..
.
@h1
@yn
3
..
.
7
7
7
@hi 7 :
@yn 7
.. 77
. 5
@hn
@yn
To compute }(Y 2 C) = }(G(X) 2 C), it is convenient to put B = fx : G(x) 2
C g so that
}(Y 2 C) = }(G(X) 2 C)
= }Z (X 2 B)
=
IB (x) fX (x) dx:
n
IR
Now apply the multivariate change of variable x = H(y). Keeping in mind that
dx = j det H 0(y) j dy,
}(Y 2 C) =
Z
IRn
IB (H(y)) fX (H(y)) j det H 0 (y) j dy:
Observe that IB (H(y)) = 1 if and only if H(y) 2 B, which happens if and
only if G(H(y)) 2 C. However, since H = G,1, G(H(y)) = y, and we see that
IB (H(y)) = IC (y). Thus,
}(Y 2 C) =
Z
C
fX (H(y)) j det H 0 (y) j dy:
It follows that Y has density
fY (y) = fX (H(y)) j det H 0(y) j:
Since det H 0 (y) is called the Jacobian of H, the preceding equations are sometimes called Jacobian formulas.
Example 7.7. pLet X and Y be independent univariate N(0; 1) random
variables. Let R = X 2 + Y 2 and = tan,1(Y=X). Find the joint density of
R and .
Solution. The transformation G is given by
p
r = x2 + y2 ;
= tan,1 (y=x):
The rst thing we must do is nd the inverse transformation H. To begin,
observe that x tan = y. Also, r2 = x2 + y2 . Write
x2 tan2 = y2 = r2 , x2:
July 18, 2002
236
Chap. 7 Random Vectors
Then x2 (1 + tan2 ) = r2 . Since 1 + tan2 = sec2 ,
x2 = r2 cos2 :
It then follows that y2 = r2 , x2 = r2 (1 , cos2 ) = r2 sin2 . Hence, the inverse
transformation H is given by
x = r cos ;
y = r sin :
The matrix H 0 (r; ) is given by
H 0 (r; ) =
"
@x
@r
@y
@r
@x
@
@y
@
#
=
cos ,r sin ;
sin r cos and det H 0 (r; ) = r cos2 + r sin2 = r. Then
fR; (r; ) = fXY (x; y) x=r cos j det H 0 (r; ) j
y=r sin = fXY (r cos ; r sin ) r:
Now,2 since
X and Y are independent N(0; 1), fXY (x; y) = fX (x) fY (y) =
e,(x +y2 )=2=(2), and
1 ; r 0; , < :
fR; (r; ) = re,r2 =2 2
Thus, R and are independent, with R having a Rayleigh density and having
a uniform(,; ] density.
7.5. Complex Random Variables and Vectors
A complex random variable is a pair of real random variables, say X
and Y , written in the form Z = X + jY , where j denotes the square root of
,1. The advantage of the complex notation is that it becomes easy to write
down certain functions of (X; Y ). For example, it is easier to talk about
Z 2 = (X + jY )(X + jY ) = (X 2 , Y 2) + j(2XY )
than the vector-valued mapping
2,Y2 X
g(X; Y ) =
:
2XY
Recall that the absolute value of a complex number z = x + jy is
jz j :=
p
x2 + y2 :
July 18, 2002
7.5 Complex Random Variables and Vectors
237
The complex conjugate of z is
z := x , jy;
and so
zz = (x + jy)(x , jy) = x2 + y2 = jz j2:
The expected value of Z is simply
E[Z] := E[X] + j E[Y ]:
The covariance of Z is
cov(Z) := E[(Z , E[Z])(Z , E[Z]) ] = E Z , E[Z]2 :
Note that cov(Z) = var(X) + var(Y ), while
E[(Z , E[Z])2] = [var(X) , var(Y )] + j[2 cov(X; Y )];
which is zero if and only if X and Y are uncorrelated and have the same variance.
If X and Y are jointly continuous real random variables, then we say that
Z = X + jY is a continuous complex random variable with density
fZ (z) = fZ (x + jy) := fXY (x; y):
Sometimes the formula for fXY (x; y) is more easily expressed in terms of the
complex variable z. For example, if X and Y are independent N(0; 1=2), then
,y 2
,jzj2
, x2
= e
fXY (x; y) = p e p p e p
2 1=2 2 1=2
Note that E[Z] = 0 and cov(Z) = 1. Also, the density is circularly symmetric
since jz j2 = x2 + y2 depends only on the distance from the origin of the point
(x; y) 2 IR2.
A complex random vector of dimension n, say
Z = [Z1 ; : : :; Zn]0;
is a vector whose ith component is a complex random variable Zi = Xi + jYi ,
where Xi and Yj are real random variables. If we put
X := [X1 ; : : :; Xn]0 and Y := [Y1; : : :; Yn]0;
then Z = X + jY , and the mean vector of Z is E[Z] = E[X] + j E[Y ]. The
covariance matrix of Z is
C := E[(Z , E[Z])(Z , E[Z])H ];
July 18, 2002
238
Chap. 7 Random Vectors
where the superscript H denotes the complex conjugate transpose. In other
words, the i k entry of C is
Cik = E[(Zi , E[Zi])(Zk , E[Zk ])] =: cov(Zi ; Zk ):
It is also easy to show that
C = (RX + RY ) + j(RY X , RXY ):
(7.9)
For joint distribution purposes, we identify the n-dimensional complex vector Z with the 2n-dimensional real random vector
[X1 ; : : :; Xn; Y1; : : :; Yn]0:
(7.10)
If this 2n-dimensional real random vector has a joint density fXY , then we
write
fZ (z) := fXY (x1; : : :; xn; y1; : : :; yn):
Sometimes the formula for the right-hand side can be written simply in terms
of the complex vector z.
Complex Gaussian Random Vectors
An n-dimensional complex random vector Z = X+jY is said to be Gaussian
if the 2n-dimensional real random vector 0in (7.10)
is jointly Gaussian; i.e., its
characteristic function 'XY (; ) = E[ej ( X +0 Y ) ] has the form
0
R
R
X
XY
0
0
1
0
exp j( mX + mY ) , 2 RY X RY
: (7.11)
Now observe that
0
RX RXY
0
(7.12)
RY X RY
is equal to
0RX + 0 RY + 20 RY X :
On the other hand, if we put w := + j, and use (7.9), then (see Problem 26)
wH Cw = 0(RX + RY ) + 0 (RX + RY ) + 20 (RY X , RXY ):
Clearly, if
RX = RY and RXY = ,RY X ;
(7.13)
then (7.12) is equal to wH Cw=2. Conversely, if (7.12) is equal to wH Cw=2 for
all w = +j, then (7.13) holds (Problem 33). We say that a complex Gaussian
random vector Z = X + jY is circularly symmetric if (7.13) holds. If Z is
circularly symmetric and zero mean, then its characteristic function is
0
0
H
E[ej ( X + Y ) ] = e,w Cw=4 ; w = + j:
July 18, 2002
7.6 Notes
239
The density corresponding to (7.11) is (assuming zero means)
fXY (x; y) =
RX RXY ,1 x
RY X RY
y
p
n
(2) det K
exp , 12 x0 y0
where
;
(7.14)
K := RRX RRXY :
YX
Y
It is shown in Problem 34 that under the assumption of circular symmetry
(7.13),
,z0 C ,1 z
e
(7.15)
fXY (x; y) = n det C ; z = x + jy;
and that C is invertible if and only if K is invertible.
7.6. Notes
Notes x7.2: The Multivariate Gaussian Density
Note 1. Recall that an n n matrix R is symmetric if it is equal to its
transpose; i.e., R = R0. It is positive denite if c0 Rc > 0 for all c =
6 0. We show
that the determinant of a positive-denite matrix is positive. A trivial modication of the derivation shows that the determinant of a positive semidenite
matrix is nonnegative. At the end of the note, we also dene the square root
of a positive-denite matrix.
We start with the well-known fact that a symmetric matrix can be diagonalized [22]; i.e., there is an n n matrix P such that P 0P = PP 0 = I and such
that P 0RP is a diagonal matrix, say
2
P 0RP = =
6
6
4
1
0
07
3
...
n
7:
5
Next, from P 0RP = , we can easily obtain R = P P 0. Since the determinant of a product of matrices is the product of their determinants, det R =
det P det det P 0. Since the determinants are numbers, they can be multiplied
in any order. Thus,
det R =
=
=
=
=
det det P 0 det P
det det(P 0P)
det det I
det 1 n :
July 18, 2002
240
Chap. 7 Random Vectors
Rewrite P 0RP = as RP = P . Then it is easy to see that the columns
of P are eigenvectors of R; i.e., if P has columns p1 ; : : :; pn, then Rpi = i pi.
Next, since P 0P = I, each pi satises p0i pi = 1. Since R is positive denite,
0 < p0i Rpi = p0i (i pi ) = i p0i pi = i :
Thus, each eigenvalue i > 0, and it follows that det R = 1 n > 0.
Because positive-denite matrices are diagonalizable with positive eigenvalues, it is easy to dene their square root by
p
p
R := P P 0;
where
2 p
3
0
1
p
7
6
...
7:
:= 64
5
p
0
n
p
p
p
p
Thus,
p det R = 1 n = det R. Furthermore, from
p pthe denition
of R, it is clear that it is positive denite and satises R R = R. We
also point out that since R = P P 0, Rp,1 = P p,1P 0, where ,1 is diagonal withpdiagonalpentries 1=i ; hence,p R,1 = ( R ),1 . Finally, note that
p
RR,1 R = (P P 0)(P ,1 P 0)(P P 0) = I.
7.7. Problems
Problems x7.1: Mean Vector, Covariance Matrix, and Characteristic
Function
1. The input U to a certain amplier is N(0; 1), and the output is X =
ZU + Y , where the amplier's random gain Z has density
fZ (z) = 37 z 2 ; 1 z 2;
and given Z = z, the amplier's random bias Y is conditionally exponential with parameter z. Assuming that the input U is independent of the
amplier parameters Z and Y , nd the mean vector and the covariance
matrix of [X; Y; Z]0.
2. Find the mean vector and covariance matrix of [X; Y; Z]0 if
2
fXY Z (x; y; z) = 2 exp[,jx ,5ypj , (y , z) =2] ; z 1;
z 2
and fXY Z (x; y; z) = 0 otherwise.
3. Let X, Y , and Z be jointly continuous. Assume that X uniform[1; 2];
that given X = x, Y exp(1=x); and that given X = x and Y = y, Z is
N(x; 1). Find the mean vector and covariance matrix of [X; Y; Z]0.
July 18, 2002
7.7 Problems
241
4. Find the mean vector and covariance matrix of [X; Y; Z]0 if
,(x,y)2 =2e,(y,z)2 =2e,z2 =2
:
fXY Z (x; y; z) = e
(2)3=2
Also nd the joint characteristic function of [X; Y; Z]0.
5. Traces and Norms.
(a) If M is a random n n matrix, show that tr(E[M]) = E[tr(M)].
(b) Let A and B be matrices of dimensions m n and n m, respectively.
Derive the formula tr(AB) = tr(BA). Hint: Recall that if C = AB,
then
n
X
Cij :=
Aik Bkj :
k=1
(c) Show that if X is an n-dimensional random vector with covariance
matrix R, then
E[kX , E[X]k2] = tr(R) =
n
X
i=1
var(Xi ):
(d) For m n matrices U and V , show that
tr(UV 0) =
m X
n
X
i=1 k=1
Uik Vik :
Show that if U is xed and tr(UV 0 ) = 0 for all matrices V , then
U = 0.
Remark. What is going on here is that tr(UV 0) denes an inner product on the space of m n matrices.
Problems x7.2: The Multivariate Gaussian
6. If X is a zero-mean, multivariate Gaussian random variable, show that
E[( 0XX 0 )k ] = (2k , 1)(2k , 3) 5 3 1 ( 0E[XX 0 ])k:
Hint: Example 3.6.
7. Wick's Theorem. Let X N(0; R) be n-dimensional. Let (i1 ; : : :; i2k ) be
a vector of indices chosen from f1; : : :; ng. Repetitions are allowed; e.g.,
(1; 3; 3; 4). Derive Wick's Theorem,
E[Xi1 Xi2k ] =
X
j1 ;:::;j2k
July 18, 2002
Rj1j2 Rj2k,1 j2k ;
242
Chap. 7 Random Vectors
where the sum is over all j1; : : :; j2k that are permutations of i1; : : :; i2k
and such that the product Rj1j2 Rj2k,1 j2k is distinct. Hint: The idea
is to view both sides of the equation derived in the previous problem as a
multivariate polynomial in the n variables 1; : : :; n. After collecting all
terms on each side that involve i1 i2k , the corresponding coecients
must be equal. In the expression
X
n
E[( 0X)2k ] = E
=
n
X
j1 =1
j1 =1
j1 Xj1 n
X
j2k =1
X
n
j2k =1
j2k Xj2k
j1 j2k E[Xj1 Xj2k ];
we are only interested in those terms for which j1 ; : : :; j2k is a permutation
of i1 ; : : :; i2k. There are (2k)! such terms, each equal to
i1 i2k E[Xi1 Xi2k ]:
Similarly, from
( 0R)k =
X
n X
n
i=1 j =1
iRij j
k
we are only interested in terms of the form
j1 j2 j2k,1 j2k Rj1 j2 Rj2k,1 j2k ;
where j1 ; : : :; j2k is a permutation of i1 ; : : :; i2k . Now many of these permutations involve the same value of the product Rj1j2 Rj2k,1 j2k . First,
because R is symmetric, each factor Rij also occurs as Rji. This happens
in 2k dierent ways. Second, the order in which the Rij are multiplied
together occurs in k! dierent ways.
8. Let X be a multivariate normal random vector with covariance matrix R.
Use Wick's theorem of the previous problem to evaluate E[X1X2 X3 X4 ],
E[X1X32 X4 ], and E[X12 X22 ].
9. Evaluate (7.3) if m = 0 and
R =
12 1 2 :
12 22
Show that your result has the same form as the bivariate normal density
in (5.13).
10. Let X = [X1; : : :; Xn]0 N(m; R), and suppose that Y = AX + b, where
A is a p n matrix, and b 2 IRp . Find the mean and variance of Y .
July 18, 2002
7.7 Problems
243
P
11. Let X1 ; : : :; Xn be i.i.d. N(m; 2 ) random variables, and let X := n1 ni=1 Xi
denote the average of the Xi . For j = 1; : : :; n, put Yj := Xj , X. Show
that E[Yj ] = 0 and that E[ XYj ] = 0 for j = 1; : : :; n.
12. Let X = [X1 ; : : :; Xn ]0 N(m; R), and suppose R is block diagonal, say
R =
S 0 ;
0 T
where S and T are square submatrices with S being s s and T being
t t with s + t = n. Put U := [X1 ; : : :; Xs]0 and V := [Xs+1 ; : : :; Xn]0.
Show that U and V are independent. Hint: Show that
'X () = 'U (1 ; : : :; s) 'V (s+1 ; : : :; n);
where 'U is an s-variate normal characteristic function, and 'V is a tvariate normal characteristic function. Use the notation := [1; : : :; s]0
and := [s+1; : : :; n]0.
13. The digital signal processing chip in a wireless communication receiver
generates the n-dimensionalGaussian vector X with mean zero and positivedenite covariance matrix R. It then computes the vector Y = R,1=2X.
(Since R,1=2 is invertible, there is no loss of information in applying
P such
a transformation.) Finally, the decision statistic V = kY k2 := nk=1 Yk2
is computed.
(a) Find the multivariate density of Y .
(b) Find the density of Yk2 for k = 1; : : :; n.
(c) Find the density of V .
Problems x7.3: Estimation of Random Vectors
14. Let X denote a random signal of known mean mX and known covariance
matrix RX . Suppose that in order to estimate X, all we have available is
the noisy measurement
Y = GX + W;
where G is a known gain matrix, and W is a noise vector with zero mean
and known covariance matrix RW . Further assume that the covariance
between the signal and noise, RXW , is zero. Find the linear MMSE
estimate of X based on Y .
15. Let X and Y be random vectors with known means and covariance matrices. Do not assume zero means. Find the best purely linear estimate
of X based on Y ; i.e., nd the matrix A that minimizes E[kX , AY k2].
Similarly, nd the best constant estimate of X; i.e., nd the vector b that
minimizes E[kX , bk2].
July 18, 2002
244
Chap. 7 Random Vectors
16. In this problem, you will show that ARY = RXY has a solution even if
RY is singular. Recall that since RY is symmetric, it can be diagonalized
[22]; i.e., there is an n n matrix P such that P 0P = PP 0 = I and such
that P 0RY P = is a diagonal matrix, say = diag(1 ; : : :; n). Dene
a new random variable Y~ := P 0Y . Consider the problem of nding the
best linear estimate of X based on Y~ . This leads to nding a matrix A~
that solves
~ Y~ = RX Y~ :
AR
(7.16)
~ 0 solves ARY = RXY .
(a) Show that if A~ solves (7.16), then A := AP
(b) Show that (7.16) has a solution. Hint: Use the fact that if var(Y~k ) =
0, then cov(Xi ; Y~k ) = 0 as well. (This fact is an easy consequence
of the Cauchy{Schwarz inequality, which is derived in Problem 1 of
Chapter 6.)
17. Show that (7.8) implies (7.7).
18. Show that if g1 (y) and g2 (y) both solve (7.8), then g1 = g2 in the sense
that
E[jg2(Y ) , g1 (Y )j] = 0:
Hint: Observe that
jg2(y) , g1(y)j = [g2(y) , g1(y)]IB+ (y) , [g2(y) , g1 (y)]IB, (y);
where
B+ := fy : g2 (y) > g1 (y)g and B, := fy : g2(y) < g1 (y)g:
Now write down the four versions of (7.8) obtained with g = g2 or g = g1
and h(y) = IB+ (y) or h(y) = IB, (y).
19. Write down the analogs of (7.7) and (7.8) when X is vector valued. Find
E[X jY = y] if [X 0; Y 0 ]0 is a Gaussian random vector such that X has
mean mX and covariance RX , Y has mean mY and covariance RY , and
RXY = cov(X; Y ).
20. Let X and Y be jointly normal random vectors as in the previous problem,
and let the matrix A solve ARY = RXY . Show that given Y = y, X is
conditionally N(mX + A(y , mY ); RX , ARY X ). Hints: First note that
(X , mX ) , A(Y , mY ) and Y are uncorrelated
and therefore independent
by Problem 12. Next, observe that E[ej 0X jY = y] is equal to
0
0
E ej [(X ,mX ),A(Y ,mY )] ej [mX +A(Y ,mY )] Y = y :
July 18, 2002
7.7 Problems
245
Problems x7.4: Transformations of Random Vectors
21. Let X and Y have joint density fXY (x; y). Let U := X + Y and V :=
X , Y . Find fUV (u; v).
22. p
Let X and Y be independent uniform(0;
1] random variables. Let U :=
,2 ln X cos(2Y ), and V := p,2 ln Y sin(2Y ). Show that U and V
are independent N(0; 1) random variables.
23. Let X and Y have joint density fXY (x; y). Let U := X + Y and V :=
X=(X+Y ). Find fUV (u; v). Apply your result to the case where X and Y
are independent gamma random variables X gp; and Y gq; . What
type of density is fU ? What type of density is fV ? Hint: Results from
Problem 21(d) in Chapter 5 may be helpful.
Problems x7.5: Complex Random Variables and Vectors
24. Show that for a complex random variable Z = X+jY , cov(Z) = var(X)+
var(Y ).
25. Consider the complex random vector Z = X +jY with covariance matrix
C.
(a) Show that C = (RX + RY ) + j(RY X , RXY ).
(b) If the circular symmetry conditions RX = RY and RXY = ,RY X
hold, show that the diagonal elements of RXY are zero; i.e., the
components Xi and Yi are uncorrelated.
(c) If the circular symmetry conditions hold, and if C is a real matrix,
show that X and Y are uncorrelated.
26. Let Z be a complex random vector with covariance matrix C = R + jQ
for real matrices R and Q.
(a) Show that R = R0 and that Q0 = ,Q.
(b) If Q0 = ,Q, show that 0Q = 0.
(c) If w = + j, show that
wH Cw = 0R + 0 R + 20 Q:
27. Let Z = X + jY be a complex random vector, and let A = + j be a
complex matrix. Show that the transformation Z 7! AZ is equivalent to
X 7! ,
X :
Y
Y
Hence, multiplying an n-dimensional complex random vector by an n n
complex matrix is a linear transformation of the 2n-dimensional vector
[X 0 ; Y 0 ]0. Now show that such a transformation preserves circular symmetry; i.e., if Z is circularly symmetric, then so is AZ.
July 18, 2002
246
Chap. 7 Random Vectors
28. Consider the complex random vector partitioned as
Z
X + jY
W = U + jV ;
where X, Y , U, and V are appropriately sized real random vectors. Since
every complex random vector is identied with a real random vector of
f := [U 0; V 0]0.
twice the length, it is convenient to put Ze := [X 0 ; Y 0 ]0 and W
0
Since the real and imaginary parts of are R := [X ; U 0]0 and I :=
[Y 0; V 0 ]0, we put
2
3
X
6
7
e := RI = 64 UY 75 :
V
Assume that is Gaussian and circularly symmetric.
(a) Show that CZW = 0 if and only if RZeWe = 0.
(b) Show that the complex matrix A = + j solves ACW = CZW if
and only if
Ae := ,
=
e
solves AR
e = RZ
eW
e.
W
(c) If A solves ACW = CZW , show that given W = w, Z is conditionally Gaussian and circularly symmetric N(mZ + A(w , mW ); CZ ,
ACWZ ). Hint: Problem 20.
29. Let Z = X + jY have density fZ (z) = e,jzj2 = as discussed in the text.
(a) Find cov(Z).
(b) Show that 2jZ j2 has a chi-squared density with 2 degrees of freedom.
30. Let X N(mr ; 1) and Y N(mi ; 1) be independent, and dene the
complex random variable Z := X + jY . Use the result of Problem 17 in
Chapter 4 to show that jZ j has the Rice density.
31. The base station of a wireless communication system generates an ndimensional, complex, circularly symmetric, Gaussian random vector Z
with mean zero and covariance matrix C. Let W = C ,1=2Z.
(a) Find the density of W.
(b) Let Wk = Uk +jVk . Find the joint density of the pair of real random
variables (Uk ; Vk ).
(c) If
n
n
X
X
kW k2 := jWk j2 = Uk2 + Vk2 ;
k=1
July 18, 2002
k=1
7.7 Problems
247
show that 2kW k2 has a chi-squared density with 2n degrees of freedom.
Remark. (i) The chi-squared density with 2n degrees of freedom
is the same as the n-Erlang density, whose cdf has a closed-form
expression givenpin Problem 12(c) in Chapter 3. (ii) By Problem 13
in Chapter 4, 2 kW k has a Nakagami-n density with parameter
= 1.
32. Let M be a real symmetric matrix such that u0 Mu = 0 for all real vectors
u.
(a) Show that v0 Mu = 0 for all real vectors u and v. Hint: Consider the
quantity (u + v)0 M(u + v).
(b) Show that M = 0. Hint: Note that M = 0 if and only if Mu = 0 for
all u, and Mu = 0 if and only if kMuk = 0.
33. Show that if (7.12) is equal to wH Cw for all w = +j, then (7.13) holds.
Hint: Use the result of the preceding problem.
34. Assume that circular symmetry (7.13) holds. In this problem you will
show that (7.14) reduces to (7.15).
(a) Show that det K = (det C)2 =22n. Hint:
2RX ,2RY X
det(2K) = det 2R
2RX
YX
X + j2RY X ,2RY X
= det 2R
2RY X , j2RX 2RX
2RY X
= det ,CjC ,2R
X
YX
= det C0 ,2R
= (det C)2:
CH
Remark. Thus, K is invertible if and only if C is invertible.
(b) Matrix Inverse Formula. For any matrices A, B, C, and D, let V =
A + BCD. If A and C are invertible, show that
V ,1 = A,1 , A,1B(C ,1 + DA,1 B),1 DA,1
by verifying that the formula for V ,1 satises V V ,1 = I.
(c) Show that
,1RY X ,1 ,1
R
,
1
K = ,,1R R,1 X ,1
;
YX X
where := RX +RY X R,X1RY X , by verifying that KK ,1 = I. Hint:
Note that ,1 satises
,1 = R,X1 , R,X1 RY X ,1 RY X R,X1 :
July 18, 2002
248
Chap. 7 Random Vectors
(d) Show that C ,1 = (,1,jR,X1 RY X ,1)=2 by verifying that CC ,1 =
I.
(e) Show that (7.14) reduces to (7.15). Hint: Before using the formula
in part (d), rst use the fact that (Im C ,1)0 = , ImC ,1; this fact
can be seen using the equation for ,1 given in part (c).
July 18, 2002
CHAPTER 8
Advanced Concepts in Random Processes
The two most important continuous-time random processes are the Poisson
process and the Wiener process, which are introduced in Sections 8.1 and 8.3,
respectively. The construction of arbitrary random processes in discrete and
continuous time using Kolmogorov's Theorem is discussed in Section 8.4.
In addition to the Poisson process, marked Poisson processes and shot noise
are introduced in Section 8.1. The extension of the Poisson process to renewal
processes is presented briey in Section 8.2. In Section 8.3, the Wiener process
is dened and then interpreted as integrated white noise. The Wiener integral
is introduced. The approximation of the Wiener process via a random walk is
also outlined. For random walks without nite second moments, it is shown by
a simulation example that the limiting process is no longer a Wiener process.
8.1. The Poisson Process
A counting process fNt ; t 0g is a random process that counts how many
times something happens from time zero up to and including time t. A sample
path of such a process is shown in Figure 8.1. Such processes always have a
5
Nt
4
3
2
1
t
T1
T2
T3 T 4
T5
Figure 8.1. Sample path Nt of a counting process.
staircase form with jumps of height one. The randomness is in the times Ti
at which whatever we are counting happens. Note that counting processes are
right continuous.
Here are some examples of things that we might count:
Nt = the number of radioactive particles emitted from a sample of radioactive material up to and including time t.
249
250
Chap. 8 Advanced Concepts in Random Processes
Nt = the number of photoelectrons emitted from a photodetector up to
and including time t.
Nt = the number of hits of a web site up to and including time t.
Nt = the number of customers passing through a checkout line at a grocery
store up to and including time t.
Nt = the number of vehicles passing through a toll booth on a highway
up to and including time t.
Suppose that 0 t1 < t2 < 1 are given times, and we want to know
how many things have happened between t1 and t2. Now Nt2 is the number
of occurrences up to and including time t2. If we subtract Nt1 , the number of
occurrences up to and including time t1 , then the dierence Nt2 , Nt1 is simply
the number of occurrences that happen after t1 up to and including t2. We call
dierences of the form Nt2 , Nt1 increments of the process.
A counting process fNt ; t 0g is called a Poisson process if the following
three conditions hold:
N0 0; i.e., N0 is a constant random variable whose value is always zero.
For any 0 s < t < 1, the increment Nt , Ns is a Poisson random
variable with parameter (t , s). The constant is called the rate or the
intensity of the process.
If the time intervals
(t1 ; t2]; (t2; t3]; : : :; (tn; tn+1]
are disjoint, then the increments
Nt2 , Nt1 ; Nt3 , Nt2 ; : : :; Ntn+1 , Ntn
are independent; i.e., the process has independent increments. In
other words, the numbers of occurrences in disjoint time intervals are
independent.
Example 8.1. Photoelectrons are emitted from a photodetector at a rate
of per minute. Find the probability that during each of 2 consecutive minutes,
more than 5 photoelectrons are emitted.
Solution. Let Ni denote the number of photoelectrons emitted from time
zero up through the ith minute. The probability that during the rst minute
and during the second minute more than 5 photoelectrons are emitted is
}(fN1 , N0 6g \ fN2 , N1 6g):
By the independent increments property, this is equal to
}(N1 , N0 6) }(N2 , N1 6):
July 18, 2002
8.1 The Poisson Process
251
Each of these factors is equal to
1,
5
X
k e, ;
k=0 k!
where we have used the fact that the length of the time increments is one.
Hence,
5 k , 2
}(fN1 , N0 6g \ fN2 , N1 6g) = 1 , X e
:
k!
k=0
We now compute the correlation and covariance functions of a Poisson process. Since N0 0, Nt = Nt , N0 is a Poisson random variable with parameter
(t , 0) = t. Hence,
E[Nt] = t and var(Nt ) = t:
This further implies that E[Nt2] = t + (t)2 . For 0 s < t, we can compute
the correlation
E[NtNs ] = E[(Nt , Ns )Ns ] + E[Ns2]
= E[(Nt , Ns )(Ns , N0 )] + (s)2 + s:
Since (0; s] and (s; t] are disjoint, the above increments are independent, and so
E[(Nt , Ns )(Ns , N0 )] = E[Nt , Ns ] E[Ns , N0 ] = (t , s) s:
It follows that
E[NtNs ] = (t)(s) + s:
We can also compute the covariance,
cov(Nt ; Ns) = E[(Nt , t)(Ns , s)]
= E[NtNs ] , (t)(s)
= s:
More generally, given any two times t1 and t2 ,
cov(Nt1 ; Nt2 ) = min(t1 ; t2):
So far, we have focused on the number of occurrences between two xed
times. Now we focus on the jump times, which are dened by (see Figure 8.1)
Tn := minft > 0 : Nt ng:
In other words, Tn is the time of the nth jump in Figure 8.1. In particular,
if Tn > t, then the nth jump happens after time t; hence, at time t we must
July 18, 2002
252
Chap. 8 Advanced Concepts in Random Processes
have Nt < n. Conversely, if at time t, Nt < n, then the nth occurrence has not
happened yet; it must happen after time t, i.e., Tn > t. We can now write
nX
,1 (t)k
}(Tn > t) = }(Nt < n) =
,t
k! e :
k=0
From Problem 12 in Chapter 3, we see that Tn has an Erlang density with parameters n and . In particular, T1 has an exponential density with parameter
. Depending on the context, the jump times are sometimes called arrival
times or occurrence times.
In the previous paragraph, we dened the occurrence times in terms of
counting process fNt ; t 0g. Observe that we can express Nt in terms of the
occurrence times since
1
X
Nt =
I(0;t](Tk ):
k=1
We now dene the interarrival times,
X1 = T1 ;
Xn = Tn , Tn,1 ; n = 2; 3; : : ::
The occurrence times can be recovered from the interarrival times by writing
Tn = X1 + + Xn :
We noted above that Tn is Erlang with parameters n and . Recalling Problem 46 in Chapter 3, which shows that a sum of i.i.d. exp() random variables
is Erlang with parameters n and , we wonder if the Xi are i.i.d. exponential
with parameter . This is indeed the case, as shown in [4, p. 301].
Example 8.2. Micrometeors strike the space shuttle according to a Poisson process. The expected time between strikes is 30 minutes. Find the probability that during at least one hour out of 5 consecutive hours, 3 or more
micrometeors strike the shuttle.
Solution. The problem statement is telling us that the expected interarrival time is 30 minutes. Since the interarrival times are exp() random
variables, their mean is 1=. Thus, 1= = 30 minutes, or 0:5 hours, and so
= 2 strikes per hour. The number of strikes during the ith hour is Ni , Ni,1.
The probability that during at least one hour out of 5 consecutive hours, 3 or
more micrometeors strike the shuttle is
5
[
}
fNi , Ni,1 3g
i=1
5
\
= 1,}
= 1,
5
Y
i=1
July 18, 2002
fNi , Ni,1 < 3g
i=1
}(Ni , Ni,1 2);
8.1 The Poisson Process
253
where the last step follows by the independent increments property of the Poisson process. Since Ni , Ni,1 Poisson([i , (i , 1)]), or simply Poisson(),
}(Ni , Ni,1 2) = e, (1 + + 2 =2) = 5e,2 ;
we have
? Derivation
}
5
[
fNi , Ni,1 3g = 1 , (5e,2 )5 0:86:
i=1
of the Poisson Probabilities
In the denition of the Poisson process, we required that Nt , Ns be a
Poisson random variable with parameter (t , s). In particular, this implies
that
}(Nt+t , Nt = 1)
te,t ! ;
=
t
t
as t ! 0. The Poisson assumption also implies that
1 , }(Nt+t , Nt = 0) = }(Nt+t , Nt 1)
t
t
1
X
1
(t)k
= t
k!
k=1
1
X
k ,1
! ;
= 1 + (t)
k!
k=2
as t ! 0.
As we now show, the converse is also true as long as we continue to assume
that N0 0 and that the process has independent increments. So, instead of
assuming that Nt , Ns is a Poisson random variable with parameter (t , s),
we assume that
During a suciently short time interval, t > 0, }(Nt+t , Nt = 1) t. By this we mean that
}(Nt+t , Nt = 1)
lim
= :
t#0
t
This property can be interpreted as saying that the probability of having exactly one occurrence during a short time interval of length t is
approximately t.
For suciently small t > 0, }(Nt+t , Nt = 0) 1 , t. More
precisely,
1 , }(Nt+t , Nt = 0) = :
lim
t#0
t
July 18, 2002
254
Chap. 8 Advanced Concepts in Random Processes
By combining this property with the preceding one, we see that during a
short time interval of length t, we have either exactly one occurrence or
no occurrences. In other words, during a short time interval, at most one
occurrence is observed.
For n = 0; 1; : : :; let
pn(t) := }(Nt , Ns = n); t s:
(8.1)
Note that p0 (s) = }(Ns , Ns = 0) = }(0 = 0) = }(
) = 1. Now,
pn(t + t) = }(Nt+t , Ns = n)
!
= }
=
=
n
X
k=0
n
X
k=0
n
[
k=0
fNt , Ns = n , kg \ fNt+t , Nt = kg
}(Nt+t , Nt = k; Nt , Ns = n , k)
}(Nt+t , Nt = k)pn,k (t);
using independent increments and (8.1). Break the preceding sum into three
terms as follows.
pn (t + t) = }(Nt+t , Nt = 0)pn(t)
+ }(Nt+t , Nt = 1)pn,1(t)
n
X
+ }(Nt+t , Nt = k)pn,k (t):
k=2
This enables us to write
pn(t + t) , pn (t) = ,[1 , }(Nt+t , Nt = 0)]pn(t)
+ }(Nt+t , Nt = 1)pn,1(t)
n
X
+ }(Nt+t , Nt = k)pn,k (t):
k=2
(8.2)
For n = 0, only the rst term on the right in (8.2) is present, and we can write
p0 (t + t) , p0 (t) = ,[1 , }(Nt+t , Nt = 0)]p0(t):
(8.3)
It then follows that
, p0 (t) = ,p (t):
lim p0 (t + t)
0
t#0
t
In other words, we are left with the rst-order dierential equation,
p00(t) = ,p0 (t); p0 (s) = 1;
July 18, 2002
8.1 The Poisson Process
255
whose solution is simply
p0(t) = e,(t,s) ; t s:
To handle the case n 2, note that since
n
X
k=2
}(Nt+t , Nt = k)pn,k (t)
n
X
k=2
}(Nt+t , Nt = k)
= }(Nt+t , Nt 2)
= 1 , [ }(Nt+t , Nt = 0) + }(Nt+t , Nt = 1) ];
it follows that
k=2 }(Nt+t , Nt = k)pn,k(t) = , = 0:
lim
t#0
t
Pn
Returning to (8.2), we see that for n = 1 and for n 2,
, pn (t) = ,p (t) + p (t):
lim pn (t + t)
n
n,1
t#0
t
This results in the dierential-dierence equation,
p0n (t) = ,pn (t) + pn,1 (t); p0(t) = e,(t,s):
It is easily veried that for n = 1; 2; : : :;
n ,(t,s)
;
pn(t) = [(t , s)]n! e
which are the claimed Poisson probabilities, solve (8.4).
(8.4)
Marked Poisson Processes
It is frequently the case that in counting arrivals, each arrival is associated
with a mark. For example, suppose packets arrive at a router according to a
Poisson process of rate , and that the size of the ith packet is Bi bytes. The
size Bi is the mark. Thus, the ith packet, whose size is Bi , arrives at time Ti ,
where Ti is the ith occurrence time of the Poisson process. The total number
of bytes processed up to time t is
Mt :=
Nt
X
i=1
Bi :
We usually assume that the mark sequence is independent of the Poisson process. In this case, the mean of Mt can be computed as in Example 5.15. The
characteristic function of Mt can be computed as in Problem 37 in Chapter 5.
July 18, 2002
256
Chap. 8 Advanced Concepts in Random Processes
Shot Noise
Light striking a photodetector generates photoelectrons according to a Poisson process. The rate of the process is proportional to the intensity of the light
and the eciency of the detector. The detector output is then passed through
an amplier of impulse response h(t). We model the input to the amplier as
a train of impulses
X
Xt :=
(t , Ti );
i
where the Ti are the occurrence times of the Poisson process. The amplier
output is
Yt =
=
=
Z
1
h(t , )X d
,1
1Z1
X
i=1 ,1
1
X
i=1
h(t , )( , Ti ) d
h(t , Ti ):
For any realizable system, h(t) is a causal function; i.e., h(t) = 0 for t < 0.
Then
Nt
X
X
Yt =
h(t , Ti ) =
h(t , Ti ):
(8.5)
i:Ti t
i=1
A process of the form of Yt is called a shot-noise process or a ltered Poisson
process. Note that if the impulse response h(t) is continuous, then so is Yt . If
h(t) contains jump discontinuities, then so does Yt . If h(t) is nonnegative, then
so is Yt.
8.2. Renewal Processes
Recall that a Poisson process of rate can be constructed by writing
Nt :=
where the arrival times
1
X
k=1
I[0;t] (Tk );
Tk := X1 + + Xk ;
and the Xk are i.i.d. exp() interarrival times. If we drop the requirement that
the interarrival times be exponential and let them have arbitrary density f,
then Nt is called a renewal process.
If we let F denote the cdf corresponding to the interarrival density f, it is
easy to see that the mean of the process is
E[Nt] =
1
X
k=1
Fk (t);
July 18, 2002
(8.6)
8.3 The Wiener Process
257
where Fk is the cdf of Tk . The corresponding density, denoted by fk , is the
k-fold convolution of f with itself. Hence, in general this formula is dicult
to work with. However, there is another way to characterize E[Nt ]. In the
problems you are asked to derive renewal equation
E[Nt] = F (t) +
Z
t
0
E[Nt,x]f(x) dx:
The mean function m(t) := E[Nt] of a renewal process is called the renewal
function. Note that m(0) = E[N0] = 0, and that the renewal equation can be
written in terms of the renewal function as
m(t) = F (t) +
Z
8.3. The Wiener Process
0
t
m(t , x)f(x) dx:
The Wiener process or Brownian motion is a random process that
models integrated white noise. Such a model is needed because, as indicated
below, white noise itself does not exist as an ordinary random process! In
fact, the careful reader will note that in Chapter 6, we never worked directly
with white noise, but rather with the output of an LTI system whose input
was white noise. The Wiener process provides a mathematically well-dened
object that can be used in modeling the output of an LTI system in response
to (approximately) white noise.
We say that fWt; t 0g is a Wiener process if the following four conditions
hold:
W0 0; i.e., W0 is a constant random variable whose value is always zero.
For any 0 s t < 1, the increment Wt , Ws is a Gaussian random
variable with zero mean and variance 2 (t , s).
If the time intervals
(t1 ; t2]; (t2; t3]; : : :; (tn; tn+1]
are disjoint, then the increments
Wt2 , Wt1 ; Wt3 , Wt2 ; : : :; Wtn+1 , Wtn
are independent; i.e., the process has independent increments.
For each sample point ! 2 , Wt (!) as a function of t is continuous. More
briey, we just say that Wt has continuous sample paths.
Remarks. (i) If the parameter 2 = 1, then the process is called a standard Wiener process. A sample path of a standard Wiener process is shown
in Figure 8.2.
July 18, 2002
258
Chap. 8 Advanced Concepts in Random Processes
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 8.2. Sample path Wt of a standard Wiener process.
(ii) Since the rst and third properties of the Wiener process are the same
as those of the Poisson process, it is easy to show that
cov(Wt1 ; Wt2 ) = 2 min(t1 ; t2):
(iii) The fourth condition, that as a function of t, Wt should be continuous,
is always assumed in practice, and can always be arranged by construction [4,
Section 37]. Actually, the precise statement of the fourth property is
}(f! 2 : Wt(!) is a continuous function of tg) = 1;
i.e., the realizations of Wt are continuous with probability one. (iv) As indicated
in Figure 8.2, the Wiener process is very \wiggly." In fact, it wiggles so much
that it is nowhere dierentiable with probability one [4, p. 505, Theorem 37.3].
Integrated White-Noise Interpretation of the Wiener Process
We now justify the interpretation of the Wiener process as integrated white
noise. Let Xt be a zero-mean wide-sense stationary white noise process with
correlation function RX () = 2() and power spectral density SX (f) = 2.
For t 0, put
Z t
Wt = X d:
0
Then Wt is clearly zero mean. For 0 s < t < 1, consider the increment
Wt , Ws =
Z
t
0
X d ,
Z
0
s
X d =
July 18, 2002
Z
t
s
X d:
8.3 The Wiener Process
259
This increment also has zero mean. To compute its variance, write
var(Wt , Ws ) = E[(Wt , Ws )2 ]
Z
= E
t
X d
s
Z t Z
=
s
Z t Z
=
Z
=
s
s
t
t
s
Z
t
s
X d
E[X X ] d d
t
2 (
s
, ) d d
2 d
= 2 (t , s):
(8.7)
A random process Xt is said to be Gaussian (cf. Section 7.2), if for every function
c(),
Z 1
c()X d
,1
is a Gaussian random variable. For example, if our white noise Xt is Gaussian,
and if we take c() = I(0;t] (), then
1
Z
,1
c()X d =
1
Z
,1
I(0;t]()X d =
is a Gaussian random variable. Similarly,
Wt , Ws =
Z
t
s
1
Z
X d =
,1
Z
t
0
X d = Wt
I(s;t] ()X d
is a Gaussian random variable with zero mean and variance 2 (t , s). More
generally, for time intervals
(t1 ; t2]; (t2; t3]; : : :; (tn; tn+1 ];
consider the random vector
[Wt2 , Wt1 ; Wt3 , Wt2 ; : : :; Wtn+1 , Wtn ]0:
Recall from Section 7.2 that this vector is normal, if every linear combination
of its components is a Gaussian random variable. To see that this is indeed the
case, write
n
X
i=1
ci(Wti+1 , Wti ) =
=
=
n
X
i=1
n
X
i=1
Z
ci
ci
Z ti+1
ti
Z
1
,1
n
1X
,1 i=1
July 18, 2002
X d
I(ti ;ti+1 ] ()X d
ciI(ti ;ti+1 ] () X d:
260
Chap. 8 Advanced Concepts in Random Processes
This is a Gaussian random variable if we assume Xt is a Gaussian random
process. Finally, if the time intervals are disjoint, we claim that the increments
are uncorrelated (and therefore independent under the Gaussian assumption).
Without loss of generality, suppose 0 s < t < < 1. Write
Z
E[(Wt , Ws )(W , W )] = E
t
s
X d
Z Z
=
t
s
Z
X d
2 ( , ) d
d:
In the inside integral, we never have = because s < t < .
Hence, the inner integral evaluates to zero, and the increments Wt , Ws and
W , W are uncorrelated as claimed.
The Problem with White Noise
If white noise really existed and
Wt =
Z
t
0
X d;
then the derivative of Wt would be W_ t = Xt . Now consider the following
calculations. First, write
d E[W 2] = E d W 2
dt t
dt t
= E[2W
tW_ t]
Wt+t , Wt = E 2Wt lim
t#0
t
E
[W
(W
t t+t , Wt )] :
= 2 lim
t#0
t
It is a simple calculation using the independent increments property to show
that E[Wt(Wt+t , Wt )] = 0. Hence
d
2
dt E[Wt ] = 0:
On the other hand, from (8.7), E[Wt2] = 2 t, and so
d
d 2
2
2
dt E[Wt ] = dt t = > 0:
The Wiener Integral
The Wiener processR is a well-dened mathematical object. We argued above
that Wt behaves like 0t X d, where Xt is white Gaussian noise. If such noise
July 18, 2002
8.3 The Wiener Process
261
is applied to an LTI system starting at time zero, and if the system has impulse
response h, then the output is
1
Z
h(t , )X d:
0
We now suppress t and write g() instead of h(t , ). Then we need a welldened mathematical object to play the role of
1
Z
0
g()X d:
The required object is the Wiener integral,
1
Z
0
g() dW ;
which is dened as follows. For piecewise constant functions g of the form
g() =
n
X
i=1
giI(ti ;ti+1 ] ();
where 0 t1 < t2 < < tn+1 < 1, we dene
Z
0
1
g() dW :=
n
X
i=1
gi(Wti+1 , Wti ):
Note that the right-hand side is a weighted sum of independent, zero-mean,
Gaussian random variables. The sum is therefore Gaussian with zero mean and
variance
n
X
i=1
gi2 var(Wti+1 , Wti ) =
n
X
i=1
gi2 2 (ti+1 , ti) = 2
1
Z
0
g()2 d:
Because of the zero mean, the variance and second moment are the same. Hence,
we also have
2 Z 1
Z 1
2
g()2 d:
(8.8)
g() dW
= E
0
0
For functions g that are not piecewise constant, but do satisfy 01 g()2 d < 1,
the Wiener integral can be dened by a limiting process, which is discussed in
more detail in Chapter 10. Basic properties of the Wiener integral are explored
in the problems.
Random Walk Approximation of the Wiener Process
R
We now show how to approximate the Wiener process with a piecewiseconstant, continuous-time process that is obtained from a discrete-time random
walk.
July 18, 2002
262
Chap. 8 Advanced Concepts in Random Processes
6
4
2
0
−2
−4
−6
0
10
20
30
40
50
60
70
80
Figure 8.3. Sample path Sk ; k = 0;: : :; 74.
Let X1 ; X2; : : : be i.i.d. 1-valued random variables with }(Xi = 1) = 1=2.
Then each Xi has zero mean and variance one. Let
Sn :=
n
X
i=1
Xi :
Then Sn has zero mean and variance n. The process fSn ; n 0g is called a
symmetric random walk. A sample path of Sn is shown
in Figure 8.3.
We next consider the scaled random walk Sn =pn. Note that Sn =pn has
zero mean and variance one. By the pcentral limit theorem, which is discussed
in detail in Chapter 4, the cdf of Sn = n converges to the standard normal cdf.
To approximate the continuous-time Wiener process, we use the piecewiseconstant, continuous-time process
Wt(n) := p1n Sbntc ;
where b c denotes the greatest integer that is less than or equal to . For
example, if n = 100 and t = 3:1476, then
1
1
1
W3(100)
:1476 = p100 Sb1003:1476c = 10 Sb314:76c = 10 S314:
Figure 8.4 shows a sample path of Wt(75) plotted as a function of t. As
the continuous variable t ranges over [0; 1], the values of b75tc range over the
It is understood that S0 0.
July 18, 2002
8.4 Specication of Random Processes
263
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 8.4. Sample path Wt(75) .
integers 0; 1; : : :; 75. Thus, the constant levels seen in Figure 8.4 are
p1 Sb75tc = p1 Sk ; k = 0; : : :; 75:
75
75
p
In other words, the levels in Figure 8.4 are 1= 75 times those in Figure 8.3.
Figures 8.5 and 8.6 show sample paths of Wt(150) and Wt(10;000), respectively.
As n increases, the sample paths look more and more like those of the Wiener
process shown in Figure 8.2.
Since the central limit theorem applies to any i.i.d. sequence with nite
variance, the preceding convergence to the Wiener process holds if we replace
the 1-valued Xi by any i.i.d. sequence with nite variance.y However, if the
Xi only have nite mean but innite variance, other limit processes can be
obtained. For example, suppose the Xi are i.i.d. having a student's t density
with = 3=2 degrees of freedom. Then the Xi have zero mean and innite
variance. As can be seen in Figure 8.7, the limiting process has jumps, which
is inconsistent with the Wiener process, which has continuous sample paths.
8.4. Specication of Random Processes
Finitely Many Random Variables
In this text we have often seen statements of the form, \Let X, Y , and Z
be random variables with }((X; Y; Z) 2 B) = (B)," where B IR3, and (B)
y If the mean of the Xi is m and the variance is 2 , then we must replace Sn by (Sn ,nm)=.
July 18, 2002
264
Chap. 8 Advanced Concepts in Random Processes
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.8
0.9
1
Figure 8.5. Sample path Wt(150) .
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Figure 8.6. Sample path Wt(10;000) .
July 18, 2002
8.4 Specication of Random Processes
265
5
4
3
2
1
0
−1
−2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 8.7. Sample path Sbntc =n2=3 for n = 10; 000 when the Xi have a student's t density
with 3=2 degrees of freedom.
is given by some formula. For example, if X, Y , and Z are discrete, we would
have
XXX
(B) =
IB (xi; yj ; zk )pi;j;k;
(8.9)
i
j
k
where the xi , yj , and zk are the values taken by the random variables, and the
pi;j;k are nonnegative numbers that sum to one. If X, Y , and Z are jointly
continuous, we would have
Z
(B) =
1
Z
1
Z
1
,1 ,1 ,1
IB (x; y; z)f(x; y; z) dx dy dz;
(8.10)
where f is nonnegative and integrates to one. In fact, if X is discrete and Y
and Z are jointly continuous, we would have
XZ
(B) =
Z
1
,1 ,1
i
where f is nonnegative and
XZ
i
1
1
Z
1
,1 ,1
IB (xi ; y; z)f(xi ; y; z) dy dz;
(8.11)
f(xi ; y; z) dy dz = 1:
The big question is, given a formula for computing (B), how do we know that
a sample space , a probability measure }, and functions X(!), Y (!), and
Z(!) exist such that we indeed have
},(X; Y; Z) 2 B = (B); B IR3 :
July 18, 2002
266
Chap. 8 Advanced Concepts in Random Processes
As we show in the next paragraph, the answer turns out to be rather simple.
If is dened by expressions such as (8.9){(8.11), it can be shown that is
a probability measurez on IR3; the case of (8.9) is easy; the other two require
some background in measure theory, e.g., [4]. More generally, if we are given
any probability measure on IR3 , we take = IR3 and put }(A) := (A) for
A = IR3 . For ! = (!1 ; !2; !3), we dene
X(!) := !1 ; Y (!) := !2 ; and Z(!) := !3:
It then follows that for B IR3,
f! 2 : (X(!); Y (!); Z(!)) 2 B g
reduces to
f! 2 : (!1 ; !2; !3) 2 B g = B:
}
Hence, (f(X; Y; Z) 2 B g) = }(B) = (B).
For xed n 1, the foregoing ideas generalize in the obvious way to show
the existence of a sample space , probability measure }, and random variables
X1 ; : : :; Xn with
},(X1 ; X2 ; : : :; Xn ) 2 B = (B); B IRn;
where is any given probability measure dened on IRn.
Innite Sequences (Discrete Time)
Consider an innite sequence of random variables such as X1 ; X2; : : :: While
(X1 ; : : :; Xn) takes values in IRn, the innite sequence (X1 ; X2 ; : : :) takes values
in IR1 . If such an innite sequence of random variables exists on some sample
space equipped with some probability measure }, then1
},(X1 ; X2; : : :) 2 B ; B IR1 ;
is a probability measure on IR1 . We denote this probability measure by (B).
Similarly, } induces on IRn the measure
,
n(Bn ) = } (X1 ; : : :; Xn) 2 Bn ; Bn IRn ;
Of course, } induces on IRn+1 the measure
,
n+1 (Bn+1 ) = } (X1 ; : : :; Xn; Xn+1 ) 2 Bn+1 ; Bn+1 IRn+1 :
Now, if we take Bn+1 = Bn IR for any Bn IRn , then
,
n+1 (Bn IR) = } (X1 ; : : :; Xn; Xn+1 ) 2 Bn IR
,
= } (X1 ; : : :; Xn) 2 Bn ; Xn+1 2 IR
,
= } f(X1 ; : : :; Xn) 2 Bn g \ fXn+1 2 IRg
,
= } f(X1 ; : : :; Xn) 2 Bn g \ ,
= } (X1 ; : : :; Xn) 2 Bn
= n(Bn ):
z See Section 1.3 to review the denition of probability measure.
July 18, 2002
8.4 Specication of Random Processes
267
Thus, we have the consistency condition
n+1 (Bn IR) = n(Bn ); Bn IRn; n = 1; 2; : : ::
Next, observe that since (X1 ; : : :; Xn) 2 Bn if and only if
(X1 ; X2 ; : : :; Xn ; Xn+1; : : :) 2 Bn IR ;
it follows that n(Bn ) is equal to
},(X1 ; X2; : : :; Xn; Xn+1 ; : : :) 2 Bn IR ;
,
(8.12)
which is simply Bn IR . Thus,
(Bn IR ) = n(Bn ); Bn 2 IRn ; n = 1; 2; : : ::
(8.13)
The big question here is, if we are given a sequence of probability measures n
on IRn for n = 1; 2; : : :; does there exist a probability measure on IR1 such
that (8.13) holds? The answer to this question was given by Kolmogorov, and
is known as Kolmogorov's Consistency Theorem or as Kolmogorov's
Extension Theorem. It says that if the1consistency condition (8.12) holds,x
then a probability measure exists on IR such that (8.13) holds [8, p. 188].
We now specialize the foregoing discussion to the case of integer-valued
random variables X1 ; X2 ; : : :: For each n = 1; 2; : : :; let pn (i1 ; : : :; in ) denote
a proposed joint probability mass function of X1 ; : : :; Xn. In other words, we
want a random process for which
},(X1 ; : : :; Xn) 2 Bn =
1
X
i1 =,1
1
X
in =,1
IBn (i1 ; : : :; in )pn(i1 ; : : :; in):
More precisely, with n (Bn ) given by the above right-hand side, does there
exist a measure on IR1 such that (8.13) holds? By Kolmogorov's theorem,
we just need to show that (8.12) holds.
We now show that (8.12) is equivalent to
1
X
j =,1
pn+1 (i1 ; : : :; in; j) = pn (i1 ; : : :; in ):
x Knowing the measure n , we can always write the corresponding cdf as
,
(8.14)
Fn (x1 ; :: : ;xn ) = n (,1; x1 ] (,1; xn ] :
Conversely, if we know the Fn , there is a unique measure n on IRn such that the above
formula holds [4, Section 12]. Hence, the consistency condition has the equivalent formulation
in terms of cdfs [8, p. 189],
lim F (x ;: : :; xn ; xn+1 ) = Fn (x1 ; : :: ; xn ):
x !1 n+1 1
n+1
July 18, 2002
268
Chap. 8 Advanced Concepts in Random Processes
The left-hand side of (8.12) takes the form
1
X
i1 =,1
1
X
1
X
IBn IR (i1 ; : : :; in; j)pn+1 (i1 ; : : :; in ; j):
in =,1 j =,1
(8.15)
Observe that IBn IR = IBn IIR = IBn . Hence, the above sum becomes
1
X
i1 =,1
1
X
1
X
in =,1 j =,1
IBn (i1 ; : : :; in)pn+1 (i1 ; : : :; in; j);
which, using (8.14), simplies to
1
X
i1 =,1
1
X
in =,1
IBn (i1 ; : : :; in)pn(i1 ; : : :; in );
(8.16)
which is our denition of n(Bn ). Conversely, if in (8.12), or equivalently in
(8.15) and (8.16), we take Bn to be the singleton set
Bn = f(j1; : : :; jn)g;
then we obtain (8.14).
The next question is how to construct a sequence of probability mass functions satisfying (8.14). Observe that (8.14) can be rewritten as
1
X
pn+1 (i1 ; : : :; in; j) = 1:
j =,1 pn (i1 ; : : :; in)
In other words, if pn (i1 ; : : :; in ) is a valid joint pmf, and if we dene
pn+1 (i1 ; : : :; in ; j) := pn+1j1;:::;n(j ji1 ; : : :; in) pn(i1 ; : : :; in);
where pn+1jn;:::;1(j ji1 ; : : :; in ) is a valid pmf in the variable j (i.e., is nonnegative
and the sum over j is one), then (8.14) will automatically hold!
Example 8.3. Let q(i) be any pmf. Take p1(i) := q(i), and take pn+1j1;:::;n(jji1; : : :; in) :=
q(j). Then, for example,
p2(i; j) = p2j1(j ji) p1 (i) = q(j) q(i);
and
p3 (i; j; k) = p3j1;2(kji; j) p2(i; j) = q(i) q(j) q(k):
More generally,
pn(i1 ; : : :; in ) = q(i1 ) q(in ):
Thus, the Xn are i.i.d. with common pmf q.
July 18, 2002
8.4 Specication of Random Processes
269
Example 8.4. Again let q(i) be any pmf. Suppose that for each i, r(jji)
is a pmf in the variable j; i.e., r is any conditional pmf. Put p1 (i) := q(i), and
put{
pn+1j1;:::;n (j ji1 ; : : :; in ) := r(j jin):
Then
p2(i; j) = p2j1(j ji) p1 (i) = r(j ji) q(i);
and
p3(i; j; k) = p3j1;2(kji; j) p1;2(i; j) = q(i) r(j ji) r(kjj):
More generally,
pn (i1 ; : : :; in ) = q(i1 ) r(i2 ji1 ) r(i3 ji2) r(in jin,1):
Continuous-Time Random Processes
The consistency condition for a continuous-time random process is a little
more complicated. The reason is that in discrete time, between any two consecutive integers, there are no other integers, while in continuous time, for any
t1 < t2 , there are innitely many times between t1 and t2 .
Now suppose that for any t1 < < tn+1, we are given a probability
measure t1 ;:::;tn+1 on IRn+1. Fix any Bn IRn . For k = 1; : : :; n+1, we dene
Bn;k IRn+1 by
Bn;k := f(x1; : : :; xn+1) : (x1 ; : : :; xk,1; xk+1; : : :; xn+1) 2 Bn and xk 2 IRg:
Note the special casesk
Bn;1 = IR Bn and Bn;n+1 = Bn IR:
The continuous-time consistency condition is that [40, p. 244] for k = 1; : : :; n+
1,
t1 ;:::;tn+1 (Bn;k ) = t1 ;:::;tk,1 ;tk+1 ;:::;tn+1 (Bn ):
(8.17)
If this condition holds, then there is a sample space , a probability measure
}, and random variables Xt such that
},(Xt1 ; : : :; Xtn ) 2 Bn = t1 ;:::;tn (Bn ); Bn IRn;
for any n 1 and any times t1 < < tn.
{ In Chapter 9, we will see that Xn is a Markov chain with stationary transition probabilities.
k In the previous subsection, we only needed the case k = n + 1. If we had wanted to allow
two-sided discrete-time processes Xn for n any positive or negative integer, then both k = 1
and k = n + 1 would have been needed (Problem 28).
July 18, 2002
270
Chap. 8 Advanced Concepts in Random Processes
8.5. Notes
Notes x8.4: Specication of Random Processes
Note 1. Comments analogous to Note 1 in Chapter 5 apply here. Specifically, the set B must be restricted to a suitable -eld B1 of subsets of IR1 .
Typically, B1 is taken to be the smallest -eld containing all sets of the form
f! = (!1; !2; : : :) 2 IR1 : (!1 ; : : :; !n) 2 Bn g;
where Bn is a Borel subset of IRn , and n ranges over the positive integers [4,
p. 485].
8.6. Problems
Problems x8.1: The Poisson Process
1. Hits to a certain web site occur according to a Poisson process of rate
2.
3.
4.
5.
6.
= 3 per minute. What is the probability that there are no hits in a 10minute period? Give a formula and then evaluate it to obtain a numerical
answer.
Cell-phone calls processed by a certain wireless base station arrive according to a Poisson process of rate = 12 per minute. What is the
probability that more than 3 calls arrive in a 20-second interval? Give a
formula and then evaluate it to obtain a numerical answer.
Let Nt be a Poisson process with rate = 2, and consider a xed observation interval (0; 5].
(a) What is the probability that N5 = 10?
(b) What is the probability that Ni , Ni,1 = 2 for all i = 1; : : :; 5?
A sports clothing store sells football jerseys with a certain very popular
number on them according to a Poisson process of rate 3 crates per day.
Find the probability that on 5 days in a row, the store sells at least 3
crates each day.
A sporting goods store sells a certain shing rod according to a Poisson
process of rate 2 per day. Find the probability that on at least one day
during the week, the store sells at least 3 rods. (Note: week=ve days.)
Let Nt be a Poisson process with rate , and let t > 0.
(a) Show that Nt+t , Nt and Nt are independent.
(b) Show that }(Nt+t = k + `jNt = k) = }(Nt+t , Nt = `).
(c) Evaluate }(Nt = kjNt+t = k + `).
(d) Show that as a function of k = 0; : : :; n, }(Nt = kjNt+t = n) has
the binomial(n; p) probability mass function and identify p.
July 18, 2002
8.6 Problems
271
7. Customers arrive at a store according to a Poisson process of rate . What
is the expected time until the nth customer arrives? What is the expected
time between customers?
8. During the winter, snowstorms occur according to a Poisson process of
intensity = 2 per week.
(a) What is the average time between snowstorms?
(b) What is the probability that no storms occur during a given two-week
period?
(c) If winter lasts 12 weeks, what is the expected number of snowstorms?
(d) Find the probability that during at least one of the 12 weeks of
winter, there are at least 5 snowstorms.
9. Space shuttles are launched according to a Poisson process. The average
time between launches is two months.
(a) Find the probability that there are no launches during a four-month
period.
(b) Find the probability that during at least one month out of four consecutive months, there are at least two launches.
10. Diners arrive at popular restaurant according to a Poisson process Nt of
rate . A confused maitre d' seats the ith diner with probability p, and
turns the diner away with probability 1 , p. Let Yi = 1 if the ith diner is
seated, and Yi = 0 otherwise. The number diners seated up to time t is
Mt :=
Nt
X
i=1
Yi :
Show that Mt is a Poisson random variable and nd its parameter. Assume the Yi are independent of each other and of the Poisson process.
Remark. Mt is an example of a thinned Poisson process.
11. Lightning strikes occur according to a Poisson process of rate per
minute. The energy of the ith strike is Vi . Assume the energy is independent of the occurrence times. What is the expected energy of a storm that
lasts for t minutes? What is the average time between lightning strikes?
? 12. Find the mean and characteristic function of the shot-noise random variable Yt in equation (8.5).
Problems x8.2: Renewal Processes
13. In the case of a Poisson process, show that the right-hand side of (8.6)
reduces to t. Hint: Formula (4.2) in Section 4.2 may be helpful.
July 18, 2002
272
Chap. 8 Advanced Concepts in Random Processes
14. Derive the renewal equation
E[Nt ] = F (t) +
Z
t
0
E[Nt,x]f(x) dx
as follows.
(a) Show that E[NtjX1 = x] = 0 for x > t.
(b) Show that E[NtjX1 = x] = 1 + E[Nt,x] for x t.
(c) Use parts (a) and (b) and the laws of total probability and substitution to derive the renewal equation.
15. Solve the renewal equation for the renewal function m(t) := E[Nt ] if the
interarrival density is f exp(). Hint: Dierentiate the renewal equation and show that m0 (t) = . It then follows that m(t) = t, which is
what we expect since f exp() implies Nt is a Poisson process of rate .
July 18, 2002
8.6 Problems
273
Problems x8.3: The Wiener Process
16. For 0 s < t < 1, use the denition of the Wiener process to show that
E[WtWs ] = 2 s.
17. Let the random vector X = [Wt1 ; : : :; Wtn ]0, 0 < t1 < < tn < 1,
consist of samples of a Wiener process. Find the covariance matrix of X.
18. For piecewise constant g and h, show that
1
Z
0
g() dW +
1
Z
0
1
Z
h() dW =
g() + h() dW :
0
Hint: The problem is easy if g and h are constant over the same intervals.
19. Use (8.8) to derive the formula
1
Z
E
0
g() dW
1
Z
h() dW
0
Hint: Consider the expectation
E
1
Z
g() dW ,
0
1
Z
0
=
2
Z
h() dW
0
1
g()h() d:
2 ;
which can be evaluated in two dierent ways. The rst way is to expand
the square and take expectations term by term, applying (8.8) where
possible. The second way is to observe that since
1
Z
0
g() dW ,
1
Z
0
h() dW =
Z
0
1
[g() , h()] dW ;
the above second moment can be computed directly using (8.8).
20. Let
Z t
Yt = g() dW ; t 0:
0
(a) Use (8.8) to show that
E[Y 2 ]
t
=
Hint: Observe that
Z
0
t
g() dW =
2
Z
0
1
Z
t
0
jg()j2 d:
g()I(0;t] () dW :
(b) Show that Yt has correlation function
RY (t1; t2 ) =
2
Z min(t1 ;t2)
0
July 18, 2002
jg()j2 d; t1 ; t2 0:
274
Chap. 8 Advanced Concepts in Random Processes
21. Consider the process
Yt = e,t V +
Z
0
t
e,(t, ) dW ; t 0;
where Wt is a Wiener process independent of V , and V has zero mean
and variance q2 . Show that Yt has correlation function
2 + 2 e,jt1,t2 j :
RY (t1 ; t2) = e,(t1 +t2 ) q2 , 2
2
Remark. If V is normal, then the process Yt is Gaussian and is known
as an Ornstein{Uhlenbeck process.
22. Let Wt be a Wiener process, and put
,t
Yt := ep We2t :
2
Show that
2
RY (t1; t2 ) = 2 e,jt1,t2 j :
In light of the remark above, this is another way to dene an Ornstein{
Uhlenbeck process.
23. So far we have dened the Wiener process Wt only for t 0. When
dening Wt for all t, we continue to assume that W0 0; that for s < t,
the increment Wt , Ws is a Gaussian random variable with zero mean
and variance 2 (t , s); that Wt has independent increments; and that Wt
has continuous sample paths. The only dierence is that s or both s and
t can be negative, and that increments can be located anywhere in time,
not just over intervals of positive time. In the following take 2 = 1.
(a) For t > 0, show that E[Wt2] = t.
(b) For s < 0, show that E[Ws2] = ,s.
(c) Show that
jtj + jsj , jt , sj :
E[WtWs ] =
2
Problems x8.4: Specication of Random Processes
24. Suppose X and Y are random variables with
},(X; Y ) 2 A =
XZ
i
1
,1
IA (xi ; y)fXY (xi ; y) dy;
where the xi are distinct real numbers, and fXY is a nonnegative function
satisfying
XZ 1
fXY (xi; y) dy = 1:
i
,1
July 18, 2002
8.6 Problems
275
(a) Show that
(b) Show that for C IR,
}(Y 2 C) =
1
Z
}(X = xk ) =
fXY (xk ; y) dy:
,1
Z X
C
i
fXY (xi; y) dy:
In other words, Y has marginal density
fY (y) =
X
i
fXY (xi ; y):
(c) Show that
Z fXY (xi; y) dy:
C pX (xi )
}(Y 2 C jX = xi) =
In other words,
(xi; y) :
fY jX (yjxi ) = fXY
pX (xi )
(d) For B IR, show that if we dene
}(X 2 B jY = y) :=
where
then
X
i
IB (xi )pX jY (xi jy);
pX jY (xijy) := fXYf (x(y)i ; y) ;
Y
Z
1
,1
}(X 2 B jY = y)fY (y) dy = }(X 2 B):
In other words, we have the law of total probability.
25. Let F be the standard normal cdf. Then F is a one-to-one mapping from
(,1; 1) onto (0; 1). Therefore, F has an inverse, F ,1: (0; 1) ! (,1; 1).
If U uniform(0; 1), show that X := F ,1(U) has F for its cdf.
26. Consider the cdf
8
0; x < 0;
>
>
>
>
< x2; 0 x < 1=2;
F(x) := > 1=4; 1=2 x < 1;
>
x=2; 1 x < 2;
>
>
:
1; x 2:
July 18, 2002
276
Chap. 8 Advanced Concepts in Random Processes
(a) Sketch F(x).
(b) For 0 < u < 1, sketch
G(u) := inf fx 2 IR : F(x) ug:
Hint: First identify the set
Bu := fx 2 IR : F(x) ug:
Then nd its inmum.
27. As illustrated in the previous problem, an arbitrary cdf F is usually not
invertible, either because the equation F (x) = u has more than one solution, e.g., F (x) = 1=4, or because it has no solution, e.g., F(x) = 3=8.
However, for any cdf F, we can always introduce the function
G(u) := inf fx 2 IR : F (x) ug; 0 < u < 1;
which, you will now show, can play the role of F ,1 in Problem 25.
(a) Show that if 0 < u < 1 and x 2 IR, then G(u) x if and only if
u F (x).
(b) Let U uniform(0; 1), and put X := G(U). Show that X has cdf
F.
28. In the text we considered discrete-time processes Xn for n = 1; 2; : : :: The
consistency condition (8.12) arose from the requirement that
},(X1 ; : : :; Xn; Xn+1 ) 2 B IR = },(X1 ; : : :; Xn) 2 B ;
where B IRn . For processes Xn with n = 0; 1; 2; : : :; we require not
only
},(Xm ; : : :; Xn; Xn+1 ) 2 B IR = },(Xm ; : : :; Xn ) 2 B ;
but also
},(Xm,1 ; Xm ; : : :; Xn) 2 IR B = },(Xm ; : : :; Xn ) 2 B ;
where now B IRn,m+1 . Let m;n(B) be a proposed formula for the
above right-hand side. Then the two consistency conditions are
m;n+1 (B IR) = m;n (B) and m,1;n (IR B) = m;n (B):
For integer-valued random processes, show that these are equivalent to
1
X
and
j =,1
1
X
j =,1
pm;n+1 (im ; : : :; in; j) = pm;n (im ; : : :; in)
pm,1;n (j; im ; : : :; in) = pm;n(im ; : : :; in );
where pm;n is the proposed joint probability mass function of Xm ; : : :; Xn.
July 18, 2002
8.6 Problems
277
29. Let q be anyPpmf, and let r(j ji) be any conditional pmf. In addition,
assume that k q(k)r(j jk) = q(j). Put
pm;n (im ; : : :; in ) := q(im ) r(im+1 jim ) r(im+2 jim+1 ) r(in jin,1):
Show that both consistency conditions for pmfs in the preceding problem
are satised.
Remark. This process is strictly stationary as dened in Section 6.2
since the upon writing out the formula for pm+k;n+k (im ; : : :; in), we see
that it does not depend on k.
30. Let n be a probability measure on IRn , and suppose that it is given in
terms of a joint density fn , i.e.,
n (Bn ) =
1
Z
,1
1
Z
,1
IBn (x1; : : :; xn)fn (x1; : : :; xn) dxn dx1:
Show that the consistency condition (8.12) holds if and only if
Z
1
,1
fn+1 (x1; : : :; xn; xn+1) dxn+1 = fn(x1 ; : : :; xn):
31. Generalize the preceding problem for the continuous-time consistency con-
dition (8.17).
32. Let Wt be a Wiener process. Let ft1 ;:::;tn denote the joint density of
Wt1 ; : : :; Wtn . Find ft1 ;:::;tn and show that it satises the density version
of (8.17) that you derived in the preceding problem. Hint: The joint
density should be Gaussian.
July 18, 2002
278
Chap. 8 Advanced Concepts in Random Processes
July 18, 2002
CHAPTER 9
Introduction to Markov Chains
A Markov chain is a random process with the property that given the values
of the process from time zero up through the current time, the conditional
probability of the value of the process at any future time depends only on its
value at the current time. This is equivalent to saying that the future and
the past are conditionally independent given the present (cf. Problem 29 in
Chapter 1).
Markov chains often have intuitively pleasing interpretations. Some examples discussed in this chapter are random walks (without barriers and with
barriers, which may be reecting, absorbing, or neither), queuing systems (with
nite or innite buers), birth{death processes (with or without spontaneous
generation), life (with states being \healthy," \sick," and \death"), and the
gambler's ruin problem.
Discrete-time Markov chains are introduced in Section 9.1 via the random
walk and its variations. The emphasis is on the Chapman{Kolmogorov equation
and the nding of stationary distributions.
Continuous-time Markov chains are introduced in Section 9.2 via the Poisson
process. The emphasis is on Kolmogorov's forward and backward dierential
equations, and their use to nd stationary distributions.
9.1. Discrete-Time Markov Chains
A sequence of discrete random variables, X0 ; X1 ; : : : is called a Markov
chain if for n 1,
}(Xn+1 = xn+1 jXn = xn; Xn,1 = xn,1; : : :; X0 = x0)
is equal to
}(Xn+1 = xn+1jXn = xn ):
In other words, given the sequence of values x0; : : :; xn, the conditional probability of what Xn+1 will be depends only on the value of xn .
Example 9.1 (Random Walk). Let X0 be an integer-valued random variable that is independent of the i.i.d. sequence Z1 ; Z2; : : :; where }(Zn = 1) = a,
}(Zn = ,1) = b, and }(Zn = 0) = 1 , (a + b). Show that if
Xn := Xn,1 + Zn ; n = 1; 2; : : :;
then Xn is a Markov chain.
Solution. It helps to write out
X1 = X0 + Z1
279
280
Chap. 9 Introduction to Markov Chains
X2 = X1 + Z2 = X0 + Z1 + Z2
..
.
Xn = Xn,1 + Zn = X0 + Z1 + + Zn :
The point here is that (X0 ; : : :; Xn ) is a function of (X0 ; Z1; : : :; Zn ), and hence,
Zn+1 and (X0 ; : : :; Xn ) are independent. Now observe that
}(Xn+1 = xn+1 jXn = xn; : : :; X0 = x0)
is equal to
}(Xn + Zn+1 = xn+1 jXn = xn; : : :; X0 = x0):
Using the substitution law, this becomes
}(Zn+1 = xn+1 , xn jXn = xn ; : : :; X0 = x0 ):
On account of the independence of Zn+1 and (X0 ; : : :; Xn), the above conditional probability is equal to
}(Zn+1 = xn+1 , xn):
Since this depends on xn but not on xn,1; : : :; x0, it follows by the next example
that }(Zn+1 = xn+1 , xn ) = }(Xn+1 = xn+1jXn = xn).
The Markov chain of the preceding example is called a random walk on
the integers. The random walk is said to be symmetric if a = b = 1=2. A
realization of a symmetric random walk is shown in Figure 9.1. Notice that
each point diers from the preceding one by 1.
To restrict the random walk to the nonnegative integers, we can take Xn =
max(0; Xn,1 + Zn ) (Problem 1).
Example 9.2. Show that if
}(Xn+1 = xn+1 jXn = xn; : : :; X0 = x0)
is a function of xn but not xn,1; : : :; x0, then the above conditional probability
is equal to
}(Xn+1 = xn+1jXn = xn ):
Solution. We use the identity }(A \ B) = }(AjB) }(B), where A =
fXn+1 = xn+1g and B = fXn = xn; : : :; X0 = x0g. Hence, if
}(Xn+1 = xn+1jXn = xn; : : :; X0 = x0) = h(xn);
then the joint probability mass function
}(Xn+1 = xn+1; Xn = xn; Xn,1 = xn,1; : : :; X0 = x0)
July 18, 2002
9.1 Discrete-Time Markov Chains
281
6
4
2
0
−2
−4
−6
0
10
20
30
40
50
60
70
80
Figure 9.1. Realization of a symmetric random walk Xn .
is equal to
h(xn) }(Xn = xn ; Xn,1 = xn,1; : : :; X0 = x0 ):
Summing both of these formulas over all values of xn,1; : : :; x0 shows that
}(Xn+1 = xn+1 ; Xn = xn) = h(xn) }(Xn = xn):
Hence, h(xn) = }(Xn+1 = xn+1 jXn = xn ) as claimed.
State Space and Transition Probabilities
The set of possible values that the random variables Xn can take is called
the state space of the chain. In the rest of this chapter, we take the state
space to be the set of integers or some specied subset of the integers. The
conditional probabilities
}(Xn+1 = j jXn = i)
are called transition probabilities. In this chapter, we assume that the
transition probabilities do not depend on time n. Such a Markov chain is
said to have stationary transition probabilities or to be time homogeneous.
For a time-homogeneous Markov chain, we use the notation
pij := }(Xn+1 = j jXn = i)
for the transition probabilities. The pij are also called the one-step transition probabilities because they are the probabilities of going from state i to
state j in one time step. One of the most common ways to specify the transition probabilities is with a state transition diagram as in Figure 9.2. This
July 18, 2002
282
Chap. 9 Introduction to Markov Chains
a
1− a
0
1− b
1
b
Figure 9.2. A state transition diagram. The diagram says that the state space is the nite
set f0; 1g, and that p01 = a, p10 = b, p00 = 1 , a, and p11 = 1 , b.
particular diagram says that the state space is the nite set f0; 1g, and that
p01 = a, p10 = b, p00 = 1 , a, and p11 = 1 , b. Note that the sum of all the
probabilities leaving a state must be one. This is because for each state i,
X
X
}(Xn+1 = j jXn = i) = 1:
pij =
j
j
The transition probabilities pij can be arranged in a matrix P, called the transition matrix, whose i j entry is pij . For the chain in Figure 9.2,
P = 1 ,b a 1 ,a b :
The top row of P contains the probabilities p0j , which is obtained by noting
the probabilities written next to all the arrows leaving state 0. Similarly, the
probabilities written next to all the arrows leaving state 1 are found in the
bottom row of P .
Examples
The general random walk on the integers has the state transition diagram
shown in Figure 9.3. Note that the Markov chain constructed in Example 9.1 is
1− ( a i + b i )
a i −1
ai
...
...
i
bi
b i +1
Figure 9.3. State transition diagram for a random walk on the integers.
a special case in which ai = a and bi = b for all i. The state transition diagram
is telling us that
8
bi ;
j = i , 1;
>
>
<
1
,
(a
+
b
);
i i j = i;
(9.1)
pij = >
a
j = i + 1;
i;
>
:
0;
otherwise:
July 18, 2002
9.1 Discrete-Time Markov Chains
283
1− ( a i + b i )
1− ( a 1 + b 1 )
a0
1− a 0
a i −1
a1
0
...
1
b1
ai
b2
...
i
bi
b i +1
Figure 9.4. State transition diagram for a random walk with barrier at the origin (also
called a birth{death process).
Hence, the transition matrix P is innite, tridiagonal, and its ith row is
0 bi 1 , (ai + bi) ai 0 :
Frequently, it is convenient to introduce a barrier at zero, leading to the
state transition diagram in Figure 9.4. In this case, we speak of the random
walk with barrier. If a0 = 1, the barrier is said to be reecting. If a0 = 0, the
barrier is said to be absorbing. Once a chain hits an absorbing state, the chain
stays in that state from that time onward. If ai = a and bi = b for all i, the
Markov chain is viewed as a model for a queue with an innite buer. A random
walk with barrier at the origin is also known as a birth{death process. With
this terminology, the state i is taken to be a population, say of bacteria. In this
case, if a0 > 0, there is spontaneous generation. If bi = 0 for all i, we have
a pure birth process. For i 1, the formula for pij is given by (9.1) above,
while for i = 0,
8
< 1 , a0; j = 0;
p0j = : a0; j = 1;
(9.2)
0; otherwise:
The transition matrix P is the tridiagonal, semi-innite matrix
3
2
1 , a0
a0
0
0
0 6
b1 1 , (a1 + b1)
a1
0
0 77
6
6
0
b
1
,
(a
+
b
)
a
0
77 :
2
2
2
2
P = 66
7
0
0
b
1
,
(a
+
b
)
a
3
3
3
3
5
4
..
..
..
.
.
.
.
.
.
Sometimes it is useful to consider a random walk with barriers at the origin
and at N, as shown in Figure 9.5. The formula for pij is given by (9.1) above
for 1 i N , 1, by (9.2) above for i = 0, and, for i = N, by
8
j = N , 1;
< bN ;
pNj = : 1 , bN ; j = N;
(9.3)
0; otherwise:
This chain is viewed as a model for a queue with a nite buer, especially if
ai = a and bi = b for all i. When a0 = 0 and bN = 0, the barriers at 0 and N
July 18, 2002
284
Chap. 9 Introduction to Markov Chains
1− ( a N − 1 + b N − 1 )
1− ( a 1 + b 1 )
a0
1− a 0
0
...
1
b1
aN − 1
aN − 2
a1
N −1
bN − 1
b2
N
1− b N
bN
Figure 9.5. State transition diagram for a queue with a nite buer.
are absorbing, and the chain is a model for the gambler's ruin problem. In
this problem, a gambler starts at time zero with 1 i N , 1 dollars, and
plays until he either runs out of money, that is, absorption into state zero, or
he acquires a fortune of N dollars and stops playing (absorption into state N).
If N = 2 and b2 = 0, the chain can be interpreted as the story of life if we view
state i = 0 as being the \healthy" state, i = 1 as being the \sick" state, and
i = 2 as being the \death" state. In this model, if you are healthy (in state 0),
you remain healthy with probability 1 , a0 and become sick (move to state 1)
with probability a0 . If you are sick (in state 1), you become healthy (move
to state 0) with probability b1, remain sick (stay in state 1) with probability
1 , (a1 + b1), or die (move to state 2) with probability b1. Since state 2 is
absorbing (b2 = 0), once you enter this state, you never leave.
Stationary Distributions
The n-step transition probabilities are dened by
p(ijn) := }(Xn = j jX0 = i):
This is the probability of going from state i (at time zero) to state j in n steps.
For a chain with stationary transition probabilities, it will be shown later that
the n-step transition probabilities are also stationary; i.e., for m = 1; 2; : : :;
}(Xn+m = j jXm = i) = }(Xn = j jX0 = i) = p(ijn) :
We also note that
p(0)
ij = }(X0 = j jX0 = i) = ij ;
where ij denotes the Kronecker delta, which is one if i = j and is zero
otherwise.
A Markov chain with stationary transition probabilities also satises the
Chapman{Kolmogorov equation,
X (n) (m)
p(ijn+m) =
pik pkj :
(9.4)
k
This equation, which is derived later, says that to go from state i to state j in
n + m steps, rst you go to some intermediate state k in n steps, and then you
July 18, 2002
9.1 Discrete-Time Markov Chains
285
go from state k to state j in m steps. Taking m = 1, we have the special case
p(ijn+1) =
If n = 1 as well,
p(2)
ij =
X (n)
pik pkj :
k
X
k
pik pkj :
This equation says that the matrix with entries p(2)
ij is equal to the product
(
n
)
PP. More generally, the matrix with entries pij , called the n-step transition
matrix , is equal to P n. The Chapman{Kolmogorov equation thus says that
P n+m = P nP m :
For many Markov chains, it can be shown [17, Section 6.4] that
lim }(Xn = j jX0 = i)
n!1
exists and does not depend on i. In other words, if the chain runs for a long
time, it reaches a \steady state" in which the probability of being in state j
does not depend on the initial state of the chain. If the above limit exists and
does not depend on i, we put
}
j := nlim
!1 (Xn = j jX0 = i):
We call fj g the stationary distribution or the equilibrium distribution
of the chain. Note that if we sum both sides over j, we get
X
X
j =
lim }(Xn = j jX0 = i)
j
j n!1
= nlim
!1
X
j
}(Xn = j jX0 = i)
= nlim
!1 1 = 1:
(9.5)
To nd the j , we can use the Chapman{Kolmogorov equation. If p(ijn) ! j ,
then taking limits in
X (n)
p(ijn+1) =
pik pkj
shows that
k
j =
X
k
k pkj :
If we think of as a row vector with entries j , and if we think of P as the
matrix with entries pij , then the above equation says that
= P:
July 18, 2002
286
Chap. 9 Introduction to Markov Chains
1− ( a+b )
a
a
1− a
1− ( a+b )
0
a
a
...
1
b
b
...
i
b
b
Figure 9.6. State transition diagram for a queuing system with an innite buer.
On account of (9.5), cannot be the zero vector. Hence, is a left eigenvector
of P with eigenvalue 1. To say this another way, I , P is singular; i.e., there
are many solutions of (I , P) = 0. The solution we want must satisfy not
only = P, but also the normalization condition,
X
j
j = 1;
derived in (9.5).
Example 9.3. The state transition diagram for a queuing system with an
innite buer is shown in Figure 9.6. Find the stationary distribution of the
chain.
Solution. We begin by writing out
j =
X
k
k pkj
(9.6)
for j = 0; 1; 2; : ::: For each j, the coecients pkj are obtained by inspection of
the state transition diagram. For j = 0, we must consider
0 =
X
k
k pk0:
We need the values of pk0. From the diagram, the only way to get to state
0 is from state 0 itself (with probability p00 = 1 , a) or from state 1 (with
probability p10 = b). The other pk0 = 0. Hence,
0 = 0 (1 , a) + 1 b:
We can rearrange this to get
1 = ab 0:
Now put j = 1 in (9.6). The state transition diagram tells us that the only way
to enter state 1 is from states 0, 1, and 2, with probabilities a, 1 , (a + b), and
b, respectively. Hence,
1 = 0 a + 1 [1 , (a + b)] + 2 b:
July 18, 2002
9.1 Discrete-Time Markov Chains
287
Substituting 1 = (a=b)0 yields 2 = (a=b)20 . In general, if we substitute
j = (a=b)j 0 and j ,1 = (a=b)j ,10 into
j = j ,1 a + j [1 , (a + b)] + j +1 b;
then we obtain j +1 = (a=b)j +1 0. We conclude that
j
j = ab 0 ; j = 0; 1; 2; : : ::
To solve for 0 , we use the fact that
1
X
j =0
or
j = 1;
1 a j
X
b = 1:
The geometric series formula shows that
0 = 1 , a=b;
and
j
j = ab (1 , a=b):
In other words, the stationary distribution is a geometric0(a=b) probability mass
function.
0
j =0
Derivation of the Chapman{Kolmogorov Equation
We begin by showing that
p(ijn+1) =
X (n)
l
pil plj :
To derive this, we need the following law of total conditional probability.
We claim that
}(X = xjZ = z) = X }(X = xjY = y; Z = z) }(Y = yjZ = z):
y
The right-hand side of this equation is simply
X }(X = x; Y = y; Z = z) }(Y = y; Z = z)
}(Y = y; Z = z) }(Z = z) :
y
Canceling common factors and then summing over y yields
}(X = x; Z = z) }
= (X = xjZ = z):
}(Z = z)
July 18, 2002
288
Chap. 9 Introduction to Markov Chains
We can now write
p(ijn+1) = }(Xn+1 = j jX0 = i)
X
X
=
}(Xn+1 = j jXn = ln ; : : :; X1 = l1 ; X0 = i)
=
=
=
ln
l1
X
X
ln
X
ln
l1
}(Xn = ln ; : : :; X1 = l1 jX0 = i)
}(Xn+1 = j jXn = ln ) }(Xn = ln ; : : :; X1 = l1 jX0 = i)
}(Xn+1 = j jXn = ln ) }(Xn = ln jX0 = i)
X (n)
l
pil plj :
This establishes that for m = 1,
p(ijn+m) =
X (n) (m)
k
pik pkj :
We now assume it is true for m and show it is true for m + 1. Write
p(ijn+[m+1]) = p([ijn+m]+1)
X (n+m)
=
pil plj
l
=
=
=
XX (n) (m) pik pkl plj
k
X (n) X (m) pik
pkl plj
k
l
X (n) (m+1)
pik pkj :
k
l
Stationarity of the n-step Transition Probabilities
We would now like to use the Chapman{Kolmogorov equation to show that
the n-step transition probabilities are stationary; i.e.,
}(Xn+m = j jXm = i) = }(Xn = j jX0 = i) = p(ijn) :
To do this, we need the fact that for any 0 < n,
}(Xn+1 = j jXn = i; X = l) = }(Xn+1 = j jXn = i):
(9.7)
To establish this fact, we use the law of total conditional probability to write
the left-hand side as
X
X
}(Xn+1 = j jXn = i; Xn,1 = ln,1; : : :; X = l; : : :; X0 = l0 )
ln,1
l0
}(Xn,1 = ln,1 ; : : :; X +1 = l +1 ; X ,1 = l ,1; : : :; X0 = l0 jXn = i; X = l);
July 18, 2002
9.2 Continuous-Time Markov Chains
289
where the sum is (n , 1)-fold, the sum over l being omitted. By the Markov
property, this simplies to
X
ln,1
X
l0
}(Xn+1 = j jXn = i)
}(Xn,1 = ln,1 ; : : :; X +1 = l +1 ; X ,1 = l ,1; : : :; X0 = l0 jXn = i; X = l);
which further simplies to }(Xn+1 = j jXn = i).
We can now show that the n-step transition probabilities are stationary. For
a time-homogeneous Markov chain, the case n = 1 holds by denition. Assume
it holds for m, and use the law of total conditional probability to write
}(Xn+m+1 = j jXm = i) = X }(Xn+m+1 = j jXn+m = k; Xm = i)
k
=
=
=
X
k
}(Xn+m = kjXm = i)
}(Xn+m+1 = j jXn+m = k)
}(Xn+m = kjXm = i)
X (n)
pik pkj
k
p(ijn+1) ;
by the Chapman{Kolmogorov equation.
9.2. Continuous-Time Markov Chains
A family of integer-valued random variables, fXt ; t 0g, is called a Markov
chain if for all n 1, and for all 0 s0 < < sn,1 < s < t,
}(Xt = j jXs = i; Xsn,1 = in,1 ; : : :; Xs0 = i0 ) = }(Xt = j jXs = i):
In other words, given the sequence of values i0 ; : : :; in,1; i, the conditional probability of what Xt will be depends only on the condition Xs = i. The quantity
}(Xt = j jXs = i) is called the transition probability.
Example 9.4. Show that the Poisson process of rate is a Markov chain.
Solution. To begin, observe that
}(Nt = j jNs = i; Nsn,1 = in,1; : : :; Ns0 = i0 );
is equal to
}(Nt , i = j , ijNs = i; Nsn,1 = in,1 ; : : :; Ns0 = i0 ):
July 18, 2002
290
Chap. 9 Introduction to Markov Chains
By the substitution law, this is equal to
}(Nt , Ns = j , ijNs = i; Nsn,1 = in,1; : : :; Ns0 = i0 ):
Since
(Ns ; Nsn,1 ; : : :; Ns0 )
is a function of
(9.8)
(9.9)
(Ns , Nsn,1 ; : : :; Ns1 , Ns0 ; Ns0 , N0 );
and since this is independent of Nt , Ns by the independent increments property
of the Poisson process, it follows that (9.9) and Nt , Ns are also independent.
Thus, (9.8) is equal to }(Nt , Ns = j , i), which depends on i but not on
in,1; : : :; i0 . It then follows that
}(Nt = j jNs = i; Nsn,1 = in,1 ; : : :; Ns0 = i0 ) = }(Nt = j jNs = i);
and we see that the Poisson process is a Markov chain.
As shown in the above example,
j ,i ,(t,s)
}(Nt = j jNs = i) = }(Nt , Ns = j , i) = [(t , s)] e
(j , i)!
depends on t and s only through t , s. In general, if a Markov chain has the
property that the transition probability }(Xt = j jXs = i) depends on t and s
only through t , s, we say that the chain is time-homogeneous or that it has
stationary transition probabilities. In this case, if we put
pij (t) := }(Xt = j jX0 = i);
then }(Xt = j jXs = i) = pij (t , s). Note that pij (0) = ij , the Kronecker
delta.
In the remainder of the chapter, we assume that Xt is a time-homogeneous
Markov chain with transition probability function pij (t). For such a chain, we
can derive the continuous-time Chapman{Kolmogorov equation,
pij (t + s) =
X
k
pik (t)pkj (s):
To derive this, we rst use the law of total conditional probability to write
pij (t + s) = }(Xt+s = j jX0 = i)
X
}(Xt+s = j jXt = k; X0 = i)}(Xt = kjX0 = i):
=
k
July 18, 2002
9.2 Continuous-Time Markov Chains
291
Now use the Markov property and time homogeneity to obtain
X
}(Xt+s = j jXt = k)}(Xt = kjX0 = i)
pij (t + s) =
=
k
X
k
pkj (s)pik (t):
The reader may wonder why the derivation of the continuous-time Chapman{
Kolmogorov equation is so much simpler than the derivation of the discrete-time
version. The reason is that in discrete time, the Markov property and time homogeneity are dened in a one-step manner. Hence, induction arguments are
rst needed to derive the discrete-time analogs of the continuous-time denitions !
Kolmogorov's Dierential Equations
In the remainder of the chapter, we assume that for small t > 0,
pij (t) gij t; i 6= j; and pii (t) 1 + giit:
These approximations tell us the conditional probability of being in state j at
time t in the very near future given that we are in state i at time zero. These
assumptions are more precisely written as
pii(t) , 1 = g :
(9.10)
lim pij (t) = gij and lim
ii
t#0
t#0 t
t
Note that gij 0, while gii 0.
Using the Chapman{Kolmogorov equation, write
pij (t + t) =
=
X
pik (t)pkj (t)
k
X
pij (t)pjj (t) + pik (t)pkj (t):
k6=j
Now subtract pij (t) from both sides to get
pij (t + t) , pij (t) = pij (t)[pjj (t) , 1] +
X
k6=j
pik (t)pkj (t):
(9.11)
Dividing by t and applying the limit assumptions (9.10), we obtain
p0ij (t) = pij (t)gjj +
X
k6=j
pik (t)gkj :
This is Kolmogorov's forward dierential equation, which can be written
more compactly as
X
p0ij (t) =
pik (t)gkj :
(9.12)
k
July 18, 2002
292
Chap. 9 Introduction to Markov Chains
To derive the backward equation, observe that since pij (t+t) = pij (t+t),
we can write
X
pij (t + t) =
pik (t)pkj (t)
=
k
X
pii (t)pij (t) + pik (t)pkj (t):
k6=i
Now subtract pij (t) from both sides to get
pij (t + t) , pij (t) = [pii(t) , 1]pij (t) +
X
k6=i
pik (t)pkj (t):
Dividing by t and applying the limit assumptions (9.10), we obtain
p0ij (t) = giipij (t) +
X
k6=i
gik pkj (t):
This is Kolmogorov's backward dierential equation, which can be written more compactly as
X
p0ij (t) =
gik pkj (t):
(9.13)
k
The derivations of both the forward and backward dierential equations
require taking a limit in t inside the sum over k. For example, in deriving the
backward equation, we tacitly assumed that
X
X pik (t)
pik (t) p (t):
p
(t)
=
lim
(9.14)
lim
kj
t
#
0
t#0 k6=i t
t kj
k6=i
If the state space of the chain is nite, the above sum is nite and there is no
problem. Otherwise, additional technical assumptions are required to justify
this step. We now show that a sucient assumption for deriving the backward
equation is that
X
gij = ,gii < 1:
j 6=i
A chain that satises this condition is said to be conservative. For any nite
N, observe that
X pik (t)
X pik (t)
p
(t)
pkj (t):
kj
k6=i t
jkjN t
k6=i
Since the right-hand side is a nite sum,
X pik (t)
X
lim
p
(t)
gik pkj (t):
kj
t#0
k6=i t
jkjN
k6=i
July 18, 2002
9.2 Continuous-Time Markov Chains
293
Letting N ! 1 shows that
X pik (t)
X
pkj (t) gik pkj (t):
(9.15)
k6=i
k6=i t
To get an upper bound on the limit, take N i and write
X pik (t)
X pik (t)
X pik (t)
p
(t)
=
p
(t)
+
pkj (t):
kj
kj
k6=i t
jkjN t
jkj>N t
lim
t#0
k6=i
Since pkj (t) 1,
X pik (t)
X pik (t)
X pik (t)
p
(t)
p
(t)
+
kj
kj
k6=i t
jkjN t
jkj>N t
k6=i
pik (t) p (t) + 1 1 , X p (t)
kj
ik
t
jkjN t
jkjN
k6=i
X pik (t)
1 , pii(t) , X pik (t) :
=
p
(t)
+
kj
t
jkjN t
jkjN t
=
X
k6=i
k6=i
Since these sums are nite,
X
(t) p (t) X g p (t) , g , X g :
lim pikt
kj
ik kj
ii
ik
t#0
k6=i
jkjN
jkjN
k6=i
k6=i
Letting N ! 1 shows that
X pik (t)
X
X
lim
p
(t)
g
p
(t)
,
g
,
gik :
kj
ik
kj
ii
t#0
k6=i t
k6=i
k6=i
If the chain is conservative, this simplies to
X
X pik (t)
p
(t)
gik pkj (t):
lim
kj
t#0 k6=i t
k6=i
Combining this with (9.15) yields (9.14), thus justifying the backward equation.
Readers familiar with linear system theory may nd it insightful to write the
forward and backward equations in matrix form. Let P(t) denote the matrix
whose i j entry is pij (t), and let G denote the matrix whose i j entry is gij (G
is called the generator matrix, or rate matrix). Then the forward equation
(9.12) becomes
P 0(t) = P (t)G;
July 18, 2002
294
Chap. 9 Introduction to Markov Chains
and the backward equation (9.13) becomes
P 0(t) = GP(t);
The initial condition in both cases is P (0) = I. Under suitable assumptions,
the solution of both equations is given by the matrix exponential,
1 (Gt)n
X
Gt
:
P (t) = e :=
n=0 n!
When the state space is nite, G is a nite-dimensional matrix, and the theory
is straightforward. Otherwise, more careful analysis is required.
Stationary Distributions
Let us suppose that pij (t) ! j as t ! 1, independently of the initial state
i. Then for large t, pij (t) is approximately constant, and we therefore hope that
its derivative is converging to zero. Assuming this to be true, the limiting form
of the forward equation (9.12) is
X
0 =
k gkj :
k
P
Combining this with the normalization condition k k = 1 allows us to solve
for k much as in the discrete case.
9.3. Problems
Problems x9.1: Introduction to Discrete-Time Markov Chains
1. Let X0 ; Z1; Z2 ; : : : be a sequence of independent discrete random variables.
Put
Xn = g(Xn,1 ; Zn ); n = 1; 2; : : ::
Show that Xn is a Markov chain. For example, if Xn = max(0; Xn,1 +
Zn ), where X0 and the Zn are as in Example 9.1, then Xn is a random
walk restricted to the nonnegative integers.
2. Let Xn be a time-homogeneous Markov chain with transition probabilities
pij . Put i := }(X0 = i). Express
}(X0 = i; X1 = j; X2 = k; X3 = l)
in terms of i and entries from the transition probability matrix.
3. Find the stationary distribution of the Markov chain in Figure 9.2.
4. Draw the state transition diagram and nd the stationary distribution of
the Markov chain whose transition
matrix is 3
2
1=2 1=2 0
P = 4 1=4 0 3=4 5 :
1=2 1=2 0
Answer: 0 = 5=12, 1 = 1=3, 2 = 1=4.
July 18, 2002
9.3 Problems
295
5. Draw the state transition diagram and nd the stationary distribution of
the Markov chain whose transition matrix is
2
3
0 1=2 1=2
P = 4 1=4 3=4 0 5 :
1=4 3=4 0
Answer: 0 = 1=5, 1 = 7=10, 2 = 1=10.
6. Draw the state transition diagram and nd the stationary distribution of
the Markov chain whose transition matrix is
2
3
1=2 1=2 0
0
6
0 1=10 0 77 :
P = 64 9=10
0 1=10 0 9=10 5
0
0 1=2 1=2
Answer: 0 = 9=28, 1 = 5=28, 2 = 5=28, 3 = 9=28.
7. Find the stationary distribution of the queuing system with nite buer
of size N, whose state transition diagram is shown in Figure 9.7.
1− ( a + b )
a
1− a
1− ( a + b )
0
a
a
a
N −1
...
1
b
b
b
N
1− b
b
Figure 9.7. State transition diagram for a queue with a nite buer.
8. Show that if i := }(X0 = i), then
}(Xn = j) =
X
i
i p(ijn):
If fj g is the stationary distribution, and if i = i , show that }(Xn =
j) = j .
Problems x9.2: Continuous-Time Markov Chains
9. The general continuous-time random walk is dened by
gij =
8
>
>
<
i ;
j = i , 1;
,(i + i ); j = i;
i ;
j = i + 1;
0;
otherwise:
Write out the forward and backward equations. Is the chain conservative?
>
>
:
July 18, 2002
296
Chap. 9 Introduction to Markov Chains
10. The continuous-time queue with innite buer can be obtained by mod-
ifying the general random walk in the preceding problem to include a
barrier at the origin. Put
8
< ,0 ; j = 0;
g0j = : 0; j = 1;
0; otherwise:
Find the stationary distribution assuming
1 X
0
j =1
11.
12.
13.
14.
j ,1 < 1:
1 j
If i = and i = for all i, simplify the above condition to one involving
only the relative values of and .
Modify Problem 10 to include a barrier at some nite N. Find the stationary distribution.
For the chain in Problem 10, let
j = j + and j = j;
where , , and are positive. Put mi (t) := E[XtjX0 = i]. Derive a
dierential equation for mi (t) and solve it. Treat the cases = and
6= separately. Hint: Use the forward equation (9.12).
For the chain in Problem 10, let i = 0 and i = . Write down and solve
the forward equation (9.12). Hint: Equation (8.4).
Let T denote the rst time a chain leaves state i,
T := minft 0 : Xt 6= ig:
Show that given X0 = i, T is conditionally exp(,gii ). In other words,
the time the chain spends in state i, known as the sojourn time, has
an exponential density with parameter ,gii . Hints: By Problem 43 in
Chapter 4, it suces to prove that
}(T > t + tjT > t; X0 = i) = }(T > tjX0 = i):
To derive this equation, use the fact that if Xt is right-continuous,
T > t if and only if Xs = i for 0 s t:
Use the Markov property in the form
}(Xs = i; t s t + tjXs = i; 0 s t)
= }(Xs = i; t s t + tjXt = i);
July 18, 2002
9.3 Problems
297
and use time homogeneity in the form
}(Xs = i; t s t + tjXt = i) = }(Xs = i; 0 s tjX0 = i):
To identify the parameter of the exponential density, you may use the
approximation }(Xs = i; 0 s tjX0 = i) pii (t).
15. The notion a Markov chain can be generalized to include random variables
that are not necessarily discrete. We say that Xt is a continuous-time
Markov process if
}(Xt 2 B jXs = x; Xsn,1 = xn,1; : : :; Xs0 = x0) = }(Xt 2 B jXs = x):
Such a process is time homogeneous if }(Xt 2 B jXs = x) depends on t
and s only through t,s. Show that the Wiener process is a Markov process
that is time homogeneous. Hint: It is enough to look at conditional cdfs;
i.e., show that
}(Xt yjXs = x; Xsn,1 = xn,1; : : :; Xs0 = x0) = }(Xt yjXs = x):
16. Let Xt be a time-homogeneous Markov process as dened in the previous
problem. Put
Pt(x; B) := }(Xt 2 B jX0 = x);
and assume that there is a corresponding conditional density, denoted by
ft (x; y), such that
Z
Pt (x; B) = ft (x; y) dy:
B
Derive the Chapman{Kolmogorov equation for conditional densities,
ft+s (x; y) =
Hint: It suces to show that
Pt+s (x; B) =
Z
1
,1
Z
1
,1
fs (x; z)ft(z; y) dz:
fs (x; z)Pt(z; B) dz:
To derive this, you may assume that a law of total conditional probability
holds for random variables with appropriate conditional densities.
July 18, 2002
298
Chap. 9 Introduction to Markov Chains
July 18, 2002
CHAPTER 10
Mean Convergence and Applications
As mentioned at the beginning of Chapter 1, limit theorems have been the
foundation of the success of Kolmogorov's axiomatic theory of probability. In
this chapter and the next, we focus on dierent notions of convergence and their
implications.
Section 10.1 introduces the notion of convergence in mean of order p. Section 10.2 introduces the normed Lp spaces. Norms provide a compact notation
for establishing results about convergence in mean of order p. We also point out
that the Lp spaces are complete. Completeness is used to show that convolution
sums like
1
X
hk Xn,k
k=0
are well dened. This is an important result because sums like this represent
the response of a causal, linear, time-invariant system to a random input Xk .
Section 10.3 uses completeness to develop the Wiener integral. Section 10.4
introduces the notion of projections. The L2 setting allows us to introduce a
general orthogonality principle that unies results from earlier chapters on the
Wiener lter, linear estimators of random vectors, and minimum mean squared
error estimation. The completeness of L2 is also used to prove the projection
theorem. In Section 10.5, the projection theorem is used to to establish the
existence conditional expectation for random variables that may not be discrete
or jointly continuous. In Section 10.6, completeness is used to establish the
spectral representation of wide-sense stationary random sequences.
10.1. Convergence in Mean of Order p
We say that Xn converges in mean of order p to X if
lim E[jXn , X jp ] = 0;
n!1
where 1 p < 1. Mostly we focus on the cases p = 1 and p = 2. The case
p = 1 is called convergence in mean or mean convergence. The case p = 2
is called mean square convergence or quadratic mean convergence.
Example 10.1. Let Xn N(0; 1=n2). Show that pnXn converges in
mean square to zero.
Solution. Write
p
E[j nXn j2] = nE[Xn2 ] = n n12 = n1 ! 0:
299
300
Chap. 10 Mean Convergence and Applications
In the following example, Xn converges in mean square to zero, but not in
mean of order 4.
Example 10.2. Let Xn have density
fn (x) = gn (x)(1 , 1=n3) + hn (x)=n3;
where gn N(0; 1=n2) and hn N(n; 1). Show that Xn converges to zero in
mean square, but not in mean of order 4.
Solution. For convergence in mean square, write
E[jXnj2] = n12 (1 , 1=n3) + (1 + n2 )=n3 ! 0:
However, using Problem 28 in Chapter 3, we have
3
E[jXnj4] = n4 (1 , 1=n3) + (n4 + 6n2 + 3)=n3 ! 1:
The preceding example raises the question of whether Xn might converge
in mean of order 4 to something other than zero. However, by Problem 8 at
the end of the chapter, if Xn converged in mean of order 4 to some X, then it
would also converge in mean square to X. Hence, the only possible limit for
Xn in mean of order 4 is zero, and as we saw, Xn does not converge in mean
of order 4 to zero.
Example 10.3. Let X1; X2; : : : be uncorrelated random variables with common mean m and common variance 2 . Show that the sample mean
n
X
Mn := n1 Xi
i=1
converges in mean square to m. We call this the mean-square law of large
numbers for uncorrelated random variables.
Solution. Since
n
X
Mn , m = n1 (Xi , m);
i=1
we can write
E[jMn
, mj2 ]
n
n
X
X
= n12 E
(Xi , m)
(Xj , m)
i=1
j =1
n X
n
X
1
= n2
E[(Xi , m)(Xj , m)]:
i=1 j =1
July 18, 2002
(10.1)
10.1 Convergence in Mean of Order p
301
Since Xi and Xj are uncorrelated, the preceding expectations are zero when
i 6= j. Hence,
n
2
X
2 ;
1
2
=
E[jMn , mj ] = n2 E[(Xi , m)2 ] = n
2
n
n
i=1
which goes to zero as n ! 1.
Example 10.4. Here is a generalization of the preceding example. Let
X1 ; X2; : : : be wide-sense stationary; i.e., the Xi have common mean m = E[Xi ],
and the covariance E[(Xi , m)(Xj , m)] depends only on the dierence i , j.
Put
R(i) := E[(Xj +i , m)(Xj , m)]:
Show that
n
X
Mn := n1 Xi
i=1
converges in mean square to m if and only if
,1
1 nX
lim
R(k) = 0:
n!1 n
(10.2)
k=0
We call this result the mean-square law of large numbers for wide-sense
stationary sequences or the mean-square ergodic theorem for wide-sense
stationary sequences Note that a sucient condition for (10.2) to hold is
that limk!1 R(k) = 0 (Problem 3).
Solution. We show that (10.2) implies Mn converges in mean square to
m. The converse is left to the reader in Problem 4. From (10.1), we see that
n2 E[jMn , mj2 ] =
=
n X
n
X
i=1 j =1
X
i=j
R(i , j)
R(0) + 2
= nR(0) + 2
= nR(0) + 2
X
R(i , j)
j<i
n
i,1
XX
i=2 j =1
n X
i, 1
X
i=2 k=1
R(i , j)
R(k):
On account of (10.2), given " > 0, there is an N such that for all i N,
i
i,1
1 X
R(k)
, 1 k=1 < ":
July 18, 2002
302
Chap. 10 Mean Convergence and Applications
For n N, the above double sum can be written as
i,1
X
1
R(k) =
R(k) + (i , 1) i , 1 R(k) :
i=2 k=1
i=2 k=1
k=1
i=N
The magnitude of the right-most double sum is upper bounded by
n X
i,1
X
NX
,1 X
i,1
n
X
i,1
n
X
X
1
, 1) i , 1 R(k) < " (i , 1)
i=N
i=N
k=1
n
X
< " (i , 1)
n
X
(i
i=1
= "n(n2, 1)
2
< "n2 :
It now follows that limn!1 E[jMn , mj2 ] can be no larger than ". Since " is
arbitrary, the limit must be zero.
Example 10.5. Let Wt be a Wiener process with E[Wt2] = 2t. Show that
Wt =t converges in mean square to zero as t ! 1.
Solution. Write
2
2
2
E Wt = 2t = ! 0:
t
t
t
Example 10.6. Let X be a nonnegative random variable with nite mean;
i.e., E[X] < 1. Put
(
X; X n;
n; X > n:
The idea here is that Xn is a bounded random variable that can be used to
approximate X. Show that Xn converges in mean to X.
Solution. Since X Xn, E[jXn , X j] = E[X , Xn]. Since X , Xn is
nonnegative, we can write
Xn := min(X; n) =
E[X , Xn ] =
1
Z
0
}(X , Xn > t) dt;
where we have appealed to (4.2) in Section 4.2. Next, for t 0, a little thought
shows that
fX , Xn > tg = fX > t + ng:
July 18, 2002
10.2 Normed Vector Spaces of Random Variables
Hence,
E[X , Xn ] =
1
Z
0
}(X > t + n) dt =
303
Z
1
n
}(X > ) d;
which goes to zero as n ! 1 on account of the fact that
1 > E[X] =
1
Z
0
}(X > ) d:
10.2. Normed Vector Spaces of Random Variables
We denote by Lp the set of all random variables X with the property that
E[jX jp] < 1. We claim that Lp is a vector space. To prove this, we need
to show that if E[jX jp] < 1 and E[jY jp] < 1, then E[jaX + bY jp ] < 1 for
all scalars a and b. To begin, recall that the triangle inequality applied to
numbers x and y says that
jx + yj jxj + jyj:
If jyj jxj, then
jx + yj 2jxj;
and so
jx + yjp 2p jxjp:
A looser bound that has the advantage of being symmetric is
jx + yjp 2p (jxjp + jyjp ):
It is easy to see that this bound also holds if jyj > jxj. We can now write
E[jaX + bY jp ] E[2p(jaX jp + jbY jp)]
= 2p (jajpE[jX jp] + jbjpE[jY jp ]):
Hence, if E[jX jp] and E[jY jp ] are both nite, then so is E[jaX + bY jp ].
For X 2 Lp , we put
kX kp := E[jX jp]1=p:
We claim that k kp is a norm on Lp , by which we mean the following three
properties hold:
(i) kX kp 0, and kX kp = 0 if and only if X is the zero random variable.
(ii) For scalars a, kaX kp = jaj kX kp.
(iii) For X; Y 2 Lp ,
kX + Y kp kX kp + kY kp :
As in the numerical case, this is also known as the triangle inequality.
July 18, 2002
304
Chap. 10 Mean Convergence and Applications
The rst two properties are obvious, while the third one is known as Minkowski's
inequality, which is derived in Problem 9.
Observe now that Xn converges in mean of order p to X if and only if
lim kXn , X kp = 0:
n!1
Hence, the three norm properties above can be used to derive results about
convergence in mean of order p, as will be seen in the problems.
Recall that a sequence of real numbers xn is Cauchy if for every " > 0, for
all suciently large n and m, jxn , xm j < ". A basic fact that can be proved
about the set of real numbers is that it is complete; i.e., given any Cauchy
sequence of real numbers xn, there is a real number x such that xn converges to
x [38, p. 53, Theorem 3.11]. Similarly, a sequence of random variables Xn 2 Lp
is said to be Cauchy if for every " > 0, for all suciently large n and m,
kXn , Xm kp < ":
It can be shown that the Lp spaces are complete; i.e., if Xn is a Cauchy sequence
of Lp random variables, then there exists an Lp random variable X such that
Xn converges in mean of order p to X. This is known as the Riesz{Fischer
Theorem [37, p. 244]. A normed vector space that is complete is called a
Banach space.
Of special interest is the case p = 2 because the norm kk2 can be expressed
in terms of the inner product
hX; Y i := E[XY ]; X; Y 2 L2:
It is easily seen that
hX; X i1=2 = kX k2 :
Because the norm k k2 can be obtained using the inner product, L2 is called
an inner-product space. Since the Lp spaces are complete, L2 in particular
is a complete inner-product space. A complete inner-product space is called a
Hilbert space.
The space L2 has several important properties. First, for xed Y , it is easy
to see that hX; Y i is linear in X. Second, the simple relationship between the
norm and the inner product imply the parallelogram law (Problem 25),
kX + Y k22 + kX , Y k22 = 2(kX k22 + kY k22 ):
Third, there is the Cauchy{Schwarz inequality (Problem 1 in Chapter 6),
hX; Y i kX k2 kY k2:
For complex-valued random variables (dened in Section 7.5), we put hX;Y i := E[XY ].
July 18, 2002
10.2 Normed Vector Spaces of Random Variables
Example 10.7. Show that
1
X
k=1
305
hk X k
is well dened as an element of L2 assuming that
1
X
k=1
jhk j < 1 and E[jXkj2 ] B; for all k;
where B is a nite constant.
Solution. Consider the partial sums,
Yn :=
n
X
k=1
hk Xk :
Each Yn is an element of L2 , which is complete. If we can show that Yn is a
Cauchy sequence, then therePwill
exist a Y 2 L2 with kYn , Y k2 ! 0. Thus,
1
the innite-sum expression
k=1 hk Xk is understood to be shorthand for \the
P
mean-square limit of nk=1 hk Xk as n ! 1." Next, for n > m, write
Yn , Ym =
Then
n
X
k=m+1
hk Xk :
kYn , Ym k22 = hYn , Ym ; Yn , Ym i
=
X
n
n
X
hk Xk ;
hl Xl
k=m+1
l=m+1
n
n
X
X
jhk j jhl j hXk ; Xl i
k=m+1 l=m+1
n
n
X
X
jhk j jhl j kXk k2 kXl k2;
k=m+1 l=m+1
p
by the Cauchy{Schwarz inequality. Next, since kXk k2 = E[jXk j2]1=2 B,
kYn , Ym k22 B
P1
Since k=1 jhkj < 1 implies1
n
X
k=m+1
n
X
k=m+1
2
jhk j :
jhk j ! 0 as n and m ! 1 with n > m;
it follows that Yn is Cauchy.
July 18, 2002
306
Chap. 10 Mean Convergence and Applications
Under the conditions of the previous example, since kYn , Y k2 ! 0, we can
write, for any Z 2 L2 ,
hYn ; Z i , hY; Z i = hYn , Y; Z i
kYn , Y k2 kZ k2 ! 0:
In other words,
lim hY ; Z i
n!1 n
= hY; Z i:
Taking Z = 1 and writing the inner products as expectations yields
lim E[Yn] = E[Y ];
n!1
or, using the denitions of Yn and Y ,
lim E
n!1
X
n
k=1
X
1
hk Xk = E
k=1
hk Xk :
Since the sum on the left is nite, we can bring the expectation inside and write
lim
n!1
In other words,
n
X
k=1
1
X
X
1
hk E[Xk ] = E
X
1
hk E[Xk ] = E
k=1
k=1
k=1
hk Xk :
hk Xk :
The foregoing example and discussion have the following application. Consider a discrete-time, causal, stable, linear, time-invariant system with impulse
response hk . Now suppose that the random sequence Xk is applied to the input
of this system. If E[jXk j2] is bounded as in the example, theny the output of
the system at time n is
1
X
k=0
hk Xn,k ;
which is a well-dened element of L2 . Furthermore,
X
1
E
k=0
hk Xn,k =
1
X
k=0
hk E[Xn,k]:
y The assumption in the example, P jhk j < 1, is equivalent to the assumption that the
k
system is stable.
July 18, 2002
10.3 The Wiener Integral (Again)
307
10.3. The Wiener Integral (Again)
In Section 8.3, we dened the Wiener integral
Z
0
1
g() dW :=
X
i
gi(Wti+1 , Wti );
for piecewise constant g, say g() = gi for ti < t ti+1 for a nite number of
intervals, and g() = 0 otherwise. In this case, since the integral is the sum of
scaled, independent, zero mean, Gaussian increments of variance 2 (ti+1 , ti),
Z
E
0
1
g() dW
2 = 2
X
i
gi2 (ti+1 , ti ) =
1
Z
0
g()2 d:
We now dene the Wiener integral for arbitrary g satisfying
1
Z
g()2 d < 1:
0
(10.3)
To do this, we use the fact [13, p. 86, Prop. 3.4.2] that for g satisfying (10.3),
there always exists a sequence of piecewise constant functions gn converging to
g in the mean-square sense
lim
n!1
1
Z
0
jgn() , g()j2 d = 0:
(10.4)
The set Rof1g satisfying (10.3) is an inner product space with inner product
h g; hi = 0 g()h() d and corresponding norm kgk = h g; gi1=2 . Thus, (10.4)
says that kgn ,gk ! 0. In particular, this implies gn is Cauchy; i.e., kgn ,gm k !
0 as n; m ! 1 (cf. Problem 11). Consider the random variables
Yn :=
Z
0
1
gn () dW :
Since each gn is piecewise constant, Yn is well dened and is Gaussian with zero
mean and variance
Z 1
gn()2 d:
0
Now observe that
kYn , Ym k22 = E[jYn , Ym j2]
2 Z 1
Z 1
g
()
dW
g
()
dW
n
m
0
0
2 Z 1
E
gn () gm () dW 0
Z 1
,
= E
=
=
,
0
jgn() , gm ()j2 d;
July 18, 2002
308
Chap. 10 Mean Convergence and Applications
since gn , gm is piecewise constant. Thus,
kYn , Ym k22 = kgn , gm k2:
Since gn is Cauchy, we see that Yn is too. Since L2 is complete, there exists a
random variable Y 2 L2 with kYn , Y k2 ! 0. We denote this random variable
by
Z 1
g() dW ;
0
and call it the Wiener integral of g.
10.4. Projections, Orthonality Principle, Projection Theorem
Let C be a subset of Lp . Given X 2 Lp , suppose we want to approximate
X by some Xb 2 C. We call Xb a projection of X onto C if Xb 2 C and if
kX , Xb kp kX , Y kp ; for all Y 2 C:
Note that if X 2 C, the we can take Xb = X.
Example 10.8. Let C be the unit ball,
C := fY 2 Lp : kY kp 1g:
For X 2= C, i.e., kX kp > 1, show that
Xb = kXXk :
p
Solution. First note that the proposed formula for Xb satises kXb kp = 1
so that Xb 2 C as required. Now observe that
X
b
kX , X kp = X , kX k p p
= 1 , kX1k kX kp
p
= kX kp , 1:
Next, for any Y 2 C,
kX , Y kp kX kp , kY kp ; by Problem 15;
= kX kp , kY kp
kX kp , 1
= kX , Xb kp:
b
Thus, no Y 2 C is closer to X than X.
July 18, 2002
10.4 Projections, Orthonality Principle, Projection Theorem
309
Much more can be said about projections when p = 2 and when the set we
are projecting onto is a subspace rather than an arbitrary subset.
We now present two fundamental results about projections onto subspaces
of L2 . The rst result is the orthogonality principle. Let M be a subspace
of L2 . If X 2 L2 , then Xb 2 M satises
kX , Xb k2 kX , Y k2; for all Y 2 M;
(10.5)
if and only if
b Y i = 0; for all Y 2 M:
hX , X;
(10.6)
Furthermore, if such an Xb 2 M exists, it is unique. Observe that there is no
claim that an Xb 2 M exists that satises either (10.5) or (10.6). In practice,
we try to nd an Xb 2 M satisfying (10.6), since such an Xb then automatically
satises (10.5). This was the approach used to derive the Wiener lter in
Section 6.5, where we implicitly took (for xed t)
M = Vbt : Vbt is given by (6.7) and E[Vbt2 ] < 1 :
In Section 7.3, when we discussed linear estimation of random vectors, we implicitly took
M = fAY + b : A is a matrix and b is a column vector g:
When we discussed minimummean squared error estimation, also in Section 7.3,
we implicitly took
M = g(Y ) : g is any function such that E[g(Y )2 ] < 1 :
Thus, several estimation problems discussed in earlier chapters are seen to be
special cases of nding the projection onto a suitable subspace of L2 . Each of
these special cases had its version of the orthogonality principle, and so it should
be no trouble for the reader to show that (10.6) implies (10.5). The converse is
also true, as we now show. Suppose (10.5) holds, but for some Y 2 M,
b Y i = c 6= 0:
hX , X;
Because we can divide this equation by kY k2, there is no loss of generality in
assuming kY k2 = 1. Now, since M is a subspace containing both Xb and Y ,
Xb + cY also belongs to M. We show that this new vector is strictly closer to
b contradicting (10.5). Write
X than X,
b , cY k2
kX , (Xb + cY )k22 = k(X , X)
2
= kX , Xb k22 , jcj2 , jcj2 + jcj2
= kX , Xb k22 , jcj2
> kX , Xb k22:
July 18, 2002
310
Chap. 10 Mean Convergence and Applications
The second fundamental result to be presented is the Projection Theorem. Recall that the orthogonality principle does not guarantee the existence
of an Xb 2 M satisfying (10.5). If we are not smart enough to solve (10.6), what
can we do? This is where the projection theorem comes in. To state and prove
this result, we need the concept of a closed set. We say that M is closed if
whenever Xn 2 M and kXn , X k2 ! 0 for some X 2 L2 , the limit X must
actually be in M. In other words, a set is closed if it contains all the limits of
all converging sequences from the set.
Example 10.9. Show that the set of Wiener integrals
M :=
1
Z
0
is closed.
g() dW :
1
Z
0
g()2 d
<1
Solution. A sequence Xn from M has the form
Xn =
1
Z
0
gn () dW
for square-integrable g. Suppose Xn converges in mean square to some X. We
must show that there exists a square-integrable function g for which
X =
1
Z
0
g() dW :
Since Xn converges, it is Cauchy. Now observe that
kXn , Xm k22 = E[jXn , Xm j2]
2 Z 1
Z 1
E gn() dW
gm () dW 0
0
2 Z 1
E gn () gm () dW 0
Z 1
,
=
,
=
jgn() , gm ()j2 d
= kgn , gm k2:
=
0
Thus, gn is Cauchy. Since the set of square-integrable time functions is complete
(the Riesz{Fischer Theorem again [37, p. 244]), there is a square-integrable g
with kgn , gk ! 0. For this g, write
Xn
,
1
Z
0
2
g() dW 2
2 Z 1
Z 1
g
()
dW
g()
dW
n
0
0
2 Z 1
E
gn() g() dW 0
Z 1
,
= E
=
,
=
jgn() , g()j2 d
0
= kgn , gk2 ! 0:
July 18, 2002
10.5 Conditional Expectation
311
R
Since mean-square limits are unique (Problem 14), X = 01 g() dW .
Projection Theorem. If M is a closed subspace of L2 , and X 2 L2, then there
exists a unique Xb 2 M such that (10.5) holds.
To prove this result, rst put h := inf Y 2M kX , Y k2 . From the denition
of the inmum, there is a sequence Yn 2 M with kX , Ynk2 ! h. We will
show that Yn is a Cauchy sequence. Since L2 is a Hilbert space, Yn converges
b must be in M.
to some limit in L2 . Since M is closed, the limit, say X,
To show Yn is Cauchy, we proceed as follows. By the parallelogram law,
2(kX , Yn k22 + kX , Ym k22) = k2X , (Yn + Ym )k22 + kYm , Yn k22
2
= 4X , Yn +2 Ym + kYm , Yn k22:
2
Note that the vector (Yn + Ym )=2 2 M since M is a subspace. Hence,
2(kX , Yn k22 + kX , Ym k22) 4h2 + kYm , Yn k22:
Since kX , Yn k2 ! h, given " > 0, there exists an N such that for all n N,
kX , Yn k2 < h + ". Hence, for m; n N,
kYm , Ynk22 < 2((h + ")2 + (h + ")2 ) , 4h2
= 4"(2h + ");
and we see that Yn is Cauchy.
Since L2 is a Hilbert space, and since M is closed, Yn ! Xb for some Xb 2 M.
We now have to show that kX , Xb k2 kX , Y k2 for all Y 2 M. Write
kX , Xb k2 = kX , Yn + Yn , Xb k2
kX , Yn k2 + kYn , Xb k2:
Since kX , Ynk2 ! h and since kYn , Xb k2 ! 0,
kX , Xb k2 h:
Since h kX , Y k2 for all Y 2 M, kX , Xb k2 kX , Y k2 for all Y 2 M.
The uniqueness of Xb is left to Problem 22.
10.5. Conditional Expectation
In earlier chapters, we dened conditional expectation separately for discrete
and jointly continuous random variables. We are now in a position to introduce
a unied denition. The new denition is closely related to the orthogonality
principle and the projection theorem. It also reduces to the earlier denitions
in the discrete and jointly continuous cases.
July 18, 2002
312
Chap. 10 Mean Convergence and Applications
We say that gb(Y ) is the conditional expectation of X given Y if
E[Xg(Y )] = E[bg(Y )g(Y )]; for all bounded functions g:
(10.7)
When X and Y are discrete or jointly continuous it is easy to check that bg(y) =
E[X jY = y] solves this equation.z However, the importance of this denition
is that we can prove the existence and uniqueness of such a function bg even if
X and Y are not discrete or jointly continuous, as long as X 2 L1 . Recall that
uniqueness was proved in Problem 18 in Chapter 7.
We rst consider the case X 2 L2 . Put
M := fg(Y ) : E[g(Y )2 ] < 1g:
(10.8)
It is a consequence of the Riesz{Fischer Theorem [37, p. 244] that M is closed.
By the projection theorem combined with the orthogonality principle, there
exists a bg(Y ) 2 M such that
hX , gb(Y ); g(Y )i = 0; for all g(Y ) 2 M:
Since the above inner product is dened as an expectation, it is equivalent to
E[Xg(Y )] = E[bg(Y )g(Y )]; for all g(Y ) 2 M:
Since boundedness of g implies g(Y ) 2 M, (10.7) holds.
When X 2 L2 , we have shown that bg(Y ) is the projection of X onto M.
For X 2 L1 , we proceed as follows. First consider the case of nonnegative
X with E[X] < 1. We can approximate X by the bounded function Xn of
Example 10.6. Being bounded, Xn 2 L2 , and the corresponding gbn (Y ) exists
and satises
E[Xng(Y )] = E[bg(Y )g(Y )]; for all g(Y ) 2 M:
(10.9)
Since Xn Xn+1 , bgn(Y ) bgn+1 (Y ) by Problem 29. Hence, bg(Y ) := limn!1 bgn(Y )
exists. To verify that gb(Y ) satises (10.7), write2
E[Xg(Y )] = E nlim
X
g(Y
)
n
!1
= nlim
!1 E[Xng(Y )]
= nlim
E[bgn(Y )g(Y )]; by (10:9);
!1
= E nlim
g
b
(Y
)g(Y
)
n
!1
= E[bg(Y )g(Y )]:
For signed X with E[jX j] < 1, consider the nonnegative random variables
X;
X
0;
< 0;
+
,
X := 0; X < 0; and X := ,0;X; X
X 0:
z See, for example, the derivation following (7.8) in Section 7.3.
July 18, 2002
10.6 The Spectral Representation
313
Since X + + X , = jX j, it is clear that X + and X , are L1 random variables.
Since they are nonnegative, their conditional expectations exist. Denote them
by bg+ (Y ) and gb, (Y ). Since X + , X , = X, it is easy to verify that bg(Y ) :=
g+ (Y ) , bg, (Y ) satises (10.7).
b
Notation
We have shown that to every X 2 L1 , there corresponds a unique function
g(y) such that (10.7) holds. The standard notation for this function of y is
b
E[X jY = y], which, as noted above, is given by the usual formulas when X and
Y are discrete or jointly continuous. It is conventional in probability theory to
write E[X jY ] instead of bg(Y ). We emphasize that E[X jY = y] is a deterministic
function of y, while E[X jY ] is a function of Y and is therefore a random variable.
We also point out that with the conventional notation, (10.7) becomes
E[Xg(Y )] = E E[X jY ]g(Y ) ; for all bounded functions g:
(10.10)
10.6. The Spectral Representation
Let Xn be a discrete-time, zero-mean, wide-sense stationary process with
correlation function
R(n) := E[Xn+m Xm ]
and corresponding power spectral density
S(f) =
so thatx
1
X
(10.11)
S(f)ej 2fn df:
(10.12)
n=,1
Z 1=2
R(n) =
R(n)ej 2fn
,1=2
Consider the space of complex-valued frequency functions,
S := G :
Z 1=2
equipped with the inner product
,1=2
jG(f)j2S(f) df < 1
Z 1=2
G(f)H(f) S(f) df
,1=2
x For some correlation functions, the sum in (10.11) may not converge. However, by Herglotz's Theorem, we can always replace (10.12) by
hG; H i :=
Z 1=2
ej2fn dS0 (f );
,1=2
where S0 (f ) is the spectral (cumulative) distribution function, and the rest of the section
goes through with the necessary changes. Note that when the power spectral density exists,
S (f ) = S00 (f ).
R(n) =
July 18, 2002
314
Chap. 10 Mean Convergence and Applications
and corresponding norm kGk := hG; Gi1=2. Then S is a Hilbert space. Furthermore, every G 2 S can be approximated by a trigonometric polynomial of the
form
N
X
G0(f) =
gn ej 2fn :
n=,N
The approximation is in norm; i.e., given any " > 0, there is a trigonometric
polynomial G0 such that kG , G0 k < " [5, p. 139].
To each frequency function G 2 S , we now associate an L2 random variable
as follows. For trigonometric polynomials like G0 , put
T(G0) :=
N
X
n=,N
gn Xn :
Note that T is well dened (Problem 35). A critical property of these trigonometric polynomials is that (Problem 36)
E[T(G0)T(H0 ) ] =
Z 1=2
,1=2
G0(f)H0 (f) S(f) df = hG0; H0i;
where H0 is dened similarly to G0 . In particular, T is norm preserving on
the trigonometric polynomials; i.e.,
kT (G0)k22 = kG0k2 :
To dene T for arbitrary G 2 S , let Gn be a sequence of trigonometric polynomials converging to G in norm; i.e., kGn , Gk ! 0. Then Gn is Cauchy
(cf. Problem 11). Furthermore, the linearity of and norm-preservation properties of T on the trigonometric polynomials tells us that
kGn , Gm k = kT (Gn , Gm )k = kT (Gn) , T (Gm )k2 ;
and we see that T(Gn) is Cauchy in L2 . Since L2 is complete, there is a
limit random variable, denoted by T(G) and such that kT (Gn) , T (G)k2 ! 0.
Note that T(G) is well dened, norm preserving, linear, and continuous on S
(Problem 37).
There is another way approximate elements of S . Every G 2 S can also
be approximated by a piecewise constant function of the following form. For
,1=2 f0 < < fn 1=2, let
G0 (f) =
n
X
i=1
giI(fi,1 ;fi ] (f):
Given any " > 0, there is a piecewise constant function G0 such that kG , G0k <
". This is exactly what we did for the Wiener process, but with a dierent norm.
For piecewise constant G0,
T(G0 ) =
n
X
i=1
giT (I(fi,1 ;fi ] ):
July 18, 2002
10.6 The Spectral Representation
Since
315
I(fi,1 ;fi ] = I[,1=2;fi ] , I[,1=2;fi,1 ] ;
if we put
Zf := T(I[,1=2;f ] ); ,1=2 f 1=2;
then Zf is well dened by Problem 38, and
n
X
gi (Zfi , Zfi,1 ):
i=1
The family of random variables fZf ; ,1=2 f 1=2g has many similarities to
T(G0) =
the Wiener process (see Problem 39). We write
Z 1=2
,1=2
G0 (f) dZf :=
n
X
i=1
gi (Zfi , Zfi,1 ):
For arbitrary G 2 S , we approximate G with a sequence of piecewise constant
functions. Then Gn will be Cauchy. Since
Z 1=2
,1=2
Gn(f) dZf ,
Z 1=2
,1=2
Gm (f) dZf 2
= kT(Gn) , T(Gm )k2 = kGn , Gm k;
there is a limit random variable in L2 , denoted by
Z 1=2
and
,1=2
Z 1=2
,1=2
Gn (f) dZf ,
G(f) dZf ;
Z 1=2
,1=2
G(f) dZf ! 0:
On the other hand, since Gn is piecewise constant,
Z 1=2
,1=2
2
Gn (f) dZf = T (Gn);
and since kGn , Gk ! 0, and since T is continuous, kT (Gn) , T(G)k2 ! 0.
Thus,
Z 1=2
T (G) =
G(f) dZf :
In particular, taking G(f) = ej 2fn
Z 1=2
,1=2
,1=2
shows that
ej 2fn dZf = T(G) = Xn ;
where the last step follows from the original denition of T. The formula
Xn =
Z 1=2
,1=2
ej 2fn dZf
is called the spectral representation of Xn .
July 18, 2002
316
Chap. 10 Mean Convergence and Applications
10.7. Notes
Notes x10.1: Convergence in Mean of Order p
Note 1. The assumption
1
X
k=1
jhk j < 1
should be understood more precisely as saying that the partial sum
An :=
n
X
k=1
jhkj
is bounded above by some nite constant. Since An is also nondecreasing, the
monotonic sequence property of real numbers [38, p. 55, Theorem 3.14]
says that An converges to a real number, say A. Now, by the same argument
used to solve Problem 11, since An converges, it is Cauchy. Hence, given any
" > 0, for large enough n and m, jAn , Am j < ". If n > m, we have
An , Am =
n
X
k=m+1
jhk j < ":
Notes x10.5: Conditional Expectation
Note 2. The interchange of limit and expectation is justied by Lebesgue's
monotone convergence theorem [4, p. 208].
10.8. Problems
Problems x10.1: Convergence in Mean of Order p
1. Let U uniform[0; 1], and put
p
(U); n = 1; 2; : : ::
X := nI
n
[0;1=n]
For what values of p 1 does Xn converge in mean of order p to zero?
2. Let Nt be a Poisson process of rate . Show that Nt =t converges in mean
square to as t ! 1.
3. Show that limk!1 R(k) = 0 implies
,1
1 nX
R(k) = 0;
lim
n!1 n
k=0
and thus Mn converges in mean square to m by Example 10.4.
July 18, 2002
10.8 Problems
317
4. Show that if Mn converges in mean square m, then
,1
1 nX
R(k) = 0;
lim
n!1 n
k=0
where R(k) is dened in Example 10.4. Hint: Observe that
nX
,1
k=0
R(k) = E (X1 , m)
n
X
k=1
(Xk , m)
= E[(X1 , m) n(Mn , m)]:
5. Let Z be a nonnegative random variable with E[Z] < 1. Given any " > 0,
show that there is a > 0 such that for any event A,
}(A) < implies E[ZIA ] < ":
Hint: Recalling Example 10.6, put Zn := min(Z; n) and write
E[ZIA ] = E[(Z , Zn )IA ] + E[ZnIA ]:
6. Derive Holder's inequality,
E[jXY j] E[jX jp]1=pE[jY jq ]1=q ;
if 1 < p; q < 1 with 1p + q1 = 1. Hint: Let and denote the factors
on the right-hand side; i.e., := E[jX jp ]1=p and := E[jY jq ]1=q . Then it
suces to show that
E[jXY j] 1:
To this end, observe that by the convexity of the exponential function,
for any real numbers u and v, we can always write
h
i
exp p1 u + 1q v 1p eu + q1 ev :
Now take u = ln(jX j=)p and v = ln(jY j=)q .
7. Derive Lyapunov's inequality,
E[jZ j]1= E[jZ j ]1= ; 1 < < 1:
Hint: Apply Holder's inequality to X = jZ j and Y = 1 with p = =.
8. Show that if Xn converges in mean of order > 1 to X, then Xn converges
in mean of order to X for all 1 < .
July 18, 2002
318
Chap. 10 Mean Convergence and Applications
9. Derive Minkowski's inequality,
E[jX + Y jp]1=p E[jX jp]1=p + E[jY jp ]1=p;
where 1 p < 1. Hint: Observe that
E[jX + Y jp ] = E[jX + Y j jX + Y jp,1 ]
E[jX j jX + Y jp,1] + E[jY j jX + Y jp,1];
and apply Holder's inequality.
Problems x10.2: Normed Vector Spaces of Random Variables
10. Show that if Xn converges in mean of order p to X, and if Yn converges
11.
12.
13.
14.
15.
in mean of order p to Y , then Xn + Yn converges in mean of order p to
X +Y.
Let Xn converge in mean of order p to X 2 Lp . Show that Xn is Cauchy.
Let Xn be a Cauchy sequence in Lp . Show that Xn is bounded in the
sense that there is a nite constant K such that kXn kp K for all n.
Remark. Since by the previous problem, a convergent sequence is
Cauchy, it follows that a convergent sequence is bounded.
Let Xn converge in mean of order p to X 2 Lp , and let Yn converge in
mean of order p to Y 2 Lp . Show that Xn Yn converges in mean of order
p to XY .
Show that limits in mean of order p are unique; i.e., if Xn converges in
mean of order p to both X and Y , show that E[jX , Y jp ] = 0.
Show that
kX kp , kY kp kX , Y kp kX kp + kY kp :
16. If Xn converges in mean of order p to X, show that
lim E[jXnjp ] = E[jX jp]:
n!1
17. Show that
1
X
k=1
hk Xk
is well dened as an element of L1 assuming that
1
X
k=1
jhk j < 1 and E[jXk j] B; for all k;
July 18, 2002
10.8 Problems
319
where B is a nite constant.
Remark. By the Cauchy{Schwarz
inequality, a sucient condition for
E[jXkj] B is that E[Xk2 ] B 2 .
Problems x10.3: The Wiener Integral (Again)
18. Recall that the Wiener integral of g was dened to be the mean-square
limit of
Yn :=
1
Z
0
gn() dW ;
where each gn is piecewise constant, and kgn , gk ! 0. Show that
1
Z
E
and
1
Z
E
0
0
g() dW = 0;
g() dW
2 =
2
1
Z
0
g()2 d:
Hint: Let Y denote the mean-square limit of Yn. Show that E[Yn] ! E[Y ]
and that E[Yn2 ] ! E[Y 2 ].
19. Use the following approach to show that the limit denition of the Wiener
integral is unique. Let Yn and Y be as in the text. If egn is another sequence
of piecewise constant functions with kegn , gk2 ! 0, put
Yen
:=
1
Z
0
gn() dW :
e
By the argument that Yn has a limit Y , Yen has a limit, say Ye . Show that
kY , Ye k2 = 0.
20. Show that the Wiener integral is linear even for functions that are not
piecewise constant.
Problems x10.4: Projections, Orthogonality Principle, Projection Theorem
21. Let C = fY 2 Lp : kY kp rg. For kX kp > r, nd a projection of X onto
C.
22. Show that if Xb 2 M satises (10.6), it is unique.
23. Let N M be subspaces of L2 . Assume that the projection of X 2 L2
onto M exists and denote it by XbM . Similarly, assume that the projection
of X onto N exists and denote it by XbN . Show that XbN is the projection
of XbM onto N.
July 18, 2002
320
Chap. 10 Mean Convergence and Applications
24. The preceding problem shows that when N M are subspaces, the pro-
jection of X onto N can be computed in two stages: First project X onto
M and then project the result onto N. Show that this is not true in
general if N and M are not both subspaces. Hint: Draw a disk of radius
one. Then draw a straight line through the origin. Identify M with the
disk and identify N with the line segment obtained by intersecting the
line with the disk. If the point X is outside the disk and not on the line,
then projecting rst onto the ball and then onto the segment does not
give the projection.
Remark. Interestingly, projecting rst onto the line (not the line seg-
ment) and then onto the disk does give the correct answer.
25. Derive the parallelogram law,
kX + Y k22 + kX , Y k22 = 2(kX k22 + kY k22 ):
P1
26. Show that the result of Example 10.7 holds
k=1 jhk j <
P1if the assumption
1 is replaced by the two assumptions k=1 jhkj2 < 1 and E[Xk Xl ] = 0
for k 6= l.
27. If kXn , X k2 ! 0 and kYn , Y k2 ! 0, show that
lim hX ; Y i = hX; Y i:
n!1 n n
28. If Y has density fY , show that M in (10.8) is closed. Hints: (i) Observe
that
E[g(Y )2 ]
=
Z
1
g(y)2 fY (y) dy:
,1
(ii) Use the fact that the set of functions
G := g :
Z
1
,1
g(y)2 fY (y) dy < 1
is complete if G is equipped with the norm
kgk2Y :=
Z
1
,1
g(y)2 fY (y) dy:
(iii) Follow the method of Example 10.9.
Problems x10.5: Conditional Expectation
29. Fix X 2 L1 , and suppose X 0. Show that E[X jY ] 0 in the sense that
}(E[X jY ] < 0) = 0. Hints: To begin, note that
fE[X jY ] < 0g =
1
[
fE[X jY ] < ,1=ng:
n=1
July 18, 2002
10.8 Problems
321
By limit property (1.4),
}(E[X jY ] < 0) = lim }(E[X jY ] < ,1=n):
n!1
To obtain a proof by contradiction, suppose that }(E[X jY ] < 0) > 0.
Then for suciently large n, }(E[X jY ] < ,1=n) > 0 too. Take g(Y ) =
IB (E[X jY ]) where B = (,1; ,1=n) and consider the dening relationship
E[Xg(Y )] = E E[X jY ]g(Y ) :
30. For X 2 L1 , let X + and X , be as in the text, and denote their corresponding conditional expectations by E[X + jY ] and E[X , jY ], respectively.
Show that
,
E[Xg(Y )] = E E[X + jY ] , E[X , jY ] g(Y ) :
31. For X 2 L1 , derive the smoothing property of conditional expectation,
E[X jY1] = E E[X jY2; Y1]Y1 :
Remark. If X 2 L2, this is an instance of Problem 23.
32. If X 2 L1 and h is a bounded function, show that
E[h(Y )X jY ] = h(Y )E[X jY ]:
33. If X 2 L2 and h(Y ) 2 L2 , use the orthogonality principle to show that
E[h(Y )X jY ] = h(Y )E[X jY ]:
34. If X 2 L1 and if h(Y )X 2 L1 , show that
E[h(Y )X jY ] = h(Y )E[X jY ]:
Hint: Use a limit argument as in the text to approximate h(Y ) by a
bounded function of Y . Then apply either of the preceding two problems.
Problems x10.6: The Spectral Representation
35. Show that the mapping T dened by
G0(f) =
N
X
n=,N
gn ej 2fn 7!
N
X
n=,N
gnXn
is well dened; i.e., if G0(f) has another representation, say
G0(f) =
N
X
n=,N
July 18, 2002
gen ej 2fn ;
322
Chap. 10 Mean Convergence and Applications
show that
N
X
n=,N
gnXn
e
=
N
X
n=,N
gn Xn :
P
Hint: With dn := gn , egn, it suces to show that if Nn=,N dn ej 2fn = 0,
PN
then Y := n=,N dn Xn = 0, and for this it is enough to show that
E[jY j2] = 0. Formula (10.12) may be helpful.
36. Show that if H0(f) is dened similarly to G0(f) in the preceding problem,
then
Z 1=2
E[T(G0)T (H0) ] =
G0(f)H0 (f) S(f) df:
,1=2
37. Show that if Gn and Ge n are trigonometric polynomials with kGn , Gk ! 0
and kGe n , Gk ! 0, then
e
lim T(Gn) = nlim
!1 T (Gn):
Also show that T is norm preserving, linear, and continuous on S . Hint:
Put T (G) := limn!1 T(Gn) and Y := limn!1 T(Ge n). Show that kT(G),
Y k2 = 0. Use the fact that T is norm preserving on trigonometric polynomials.
38. Show that Zf := T(I[,1=2;f ] ) is well dened. Hint: It suces to show
that I[,1=2;f ] 2 S .
39. Since
Z 1=2
G(f) dZf = T(G);
n!1
,1=2
use results about T (G) to prove the following.
(a) Show that
Z 1=2
E
G(f) dZf = 0:
,1=2
(b) Show that
Z 1=2
E
,1=2
G(f) dZf
Z 1=2
,1=2
H() dZ
=
Z 1=2
,1=2
G(f)H(f) S(f) df:
(c) Show that Zf has orthogonal (uncorrelated) increments.
July 18, 2002
CHAPTER 11
Other Modes of Convergence
This chapter introduces the three types of convergence,
convergence in probability,
convergence in distribution,
almost sure convergence,
which are related to each other (and to convergence in mean of order p) as
shown in Figure 11.1.
mean of order p
in probability
in distribution
almost sure
Figure 11.1. Implications of various types of convergence.
Section 11.1 introduces the notion of convergence in probability. Convergence in probability will be important in Chapter 12 on parameter estimation
and condence intervals, where it justies various statistical procedures that
are used to estimate unknown parameters.
Section 11.2 introduces the notion of convergence in distribution. Convergence in distribution is often used to approximate probabilities that are hard
to calculate exactly. Suppose that Xn is a random variable whose cumulative
distribution function FXn (x) is very hard to compute. But suppose that for
large n, FXn (x) FX (x), where FX (x) is a cdf that is very easy to compute.
When FXn (x) ! FX (x), we say that Xn converges in distribution to X. In
this case, we can approximate FXn (x) by FX (x) if n is large enough. When
the central limit theorem applies, FX is the normal cdf with mean zero and
variance one.
Section 11.3 introduces the notion of almost-sure convergence. This kind
of convergence is more technically demanding to analyze, but it allows us to
discuss important results such as the strong law of large numbers, for which we
give the usual fourth-moment proof. Almost-sure convergence also allows us
the derive the Skorohod representation, which is a powerful tool for studying
convergence in distribution.
323
324
Chap. 11 Other Modes of Convergence
11.1. Convergence in Probability
We say that Xn converges in probability to X if
lim }(jXn , X j ") = 0 for all " > 0:
n!1
The rst thing to note about convergence in probability is that it is implied
by convergence in mean of order p. Using Markov's inequality, we have
p
}(jXn , X j ") = }(jXn , X jp "p ) E[jXn ,p X j ] :
"
Hence, if Xn converges in mean of order p to X, then Xn converges in probability
to X as well.
Example 11.1 (Weak Law of Large Numbers). Since the sample mean Mn
dened in Example 10.3 converges in mean square to m, it follows that Mn converges in probability to m. This is known as the weak law of large numbers.
The second thing to note about convergence in probability is that it is
possible to have Xn converging in probability to X, while Xn does not converge
to X in mean of order p for any p 1. For example, let U uniform[0; 1], and
consider the sequence
Xn := nI[0;1=n](U); n = 1; 2; : : ::
Observe that for " > 0,
fjXnj "g =
(
fU 1=ng; n ";
6 ;
n < ":
It follows that for all n,
}(jXn j ") }(U 1=n) = 1=n ! 0:
Thus, Xn converges in probability to zero. However,
E[jXnjp] = np}(U 1=n) = np,1 ;
which does not go to zero as n ! 1.
The third thing to note about convergence in probability is that if g(x; y)
is continuous, and if Xn converges in probability to X, and Yn converges in
probability to Y , then g(Xn ; Yn) converges in probability to g(X; Y ). This
result is derived in Problem 6. In particular, note that since the functions
g(x; y) = x + y and g(x; y) = xy are continuous, Xn + Yn and Xn Yn converge
in probability to X + Y and XY , respectively, whenever Xn and Yn converge
in probability to X and Y , respectively.
July 18, 2002
11.2 Convergence in Distribution
325
Example 11.2. Since Problem 6 is somewhat involved, it is helpful to
rst see the derivation for the special case in which Xn and Yn both converge
in probability to constant random variables, say u and v, respectively.
Solution. Let " > 0 be given. We show that
}(jg(Xn ; Yn) , g(u; v)j ") ! 0:
By the continuity of g at the point (u; v), there is a > 0 such that whenever
(u0 ; v0) is within of (u; v); i.e., whenever
p
ju0 , uj2 + jv0 , vj2 < 2;
then
jg(u0; v0 ) , g(u; v)j < ":
Since
p
we see that
ju0 , uj2 + jv0 , vj2 ju0 , uj + jv0 , vj;
jXn , uj < and jYn , vj < ) jg(Xn ; Yn) , g(u; v)j < ":
Conversely,
jg(Xn ; Yn) , g(u; v)j " ) jXn , uj or jYn , vj :
It follows that
}(jg(Xn ; Yn ) , g(u; v)j ") }(jXn , uj ) + }(jYn , vj ):
These last two terms go to zero by hypothesis.
Example 11.3. Let Xn converge in probability to x, and let cn be a converging sequence of real numbers with limit c. Show that cn Xn converges in
probability to cx.
Solution. Dene the sequence of constant random variables Yn := cn.
It is a simple exercise to show that Yn converges in probability to c. Thus,
Xn Yn = cn Xn converges in probability to cx.
11.2. Convergence in Distribution
We say that Xn converges in distribution to X, or converges weakly
to X if
lim F (x) = FX (x); for all x 2 C(FX );
n!1 Xn
July 18, 2002
326
Chap. 11 Other Modes of Convergence
where C(FX ) is the set of points x at which FX is continuous. If FX has a
jump at a point x0, then we do not say anything about the existence of
lim F (x ):
n!1 Xn 0
The rst thing to note about convergence in distribution is that it is implied
by convergence in probability. To derive this result, x any x at which FX is
continuous. For any " > 0, write
FXn (x) = }(Xn x; X x + ") + }(Xn x; X > x + ")
FX (x + ") + }(Xn , X < ,")
FX (x + ") + }(jXn , X j "):
It follows that
lim F (x) FX (x + "):
n!1 Xn
Similarly,
FX (x , ") = }(X x , "; Xn x) + }(X x , "; Xn > x)
FXn (x) + }(X , Xn < ,")
FXn (x) + }(jXn , X j ");
and we obtain
FX (x , ") lim FXn (x):
n!1
Since the liminf is always less than or equal to the limsup, we have
FX (x , ") lim FXn (x) nlim
!1 FXn (x) FX (x + "):
n!1
Since FX is continuous at x, letting " go to zero shows that the liminf and the
limsup are equal to each other and to FX (x). Hence limn!1 FXn (x) exists and
equals FX (x).
The need to restrict attention to continuity points can also be seen in the
following example.
Example 11.4. Let Xn exp(n); i.e.,
FXn (x) =
1 , e,nx ; x 0;
0;
x < 0:
For x > 0, FXn (x) ! 1 as n ! 1. However, since FXn (0) = 0, we see that
FXn (x) ! I(0;1) (x), which, being left continuous at zero, is not the cumulative
distribution function of any random variable. Fortunately, the constant random
variable X 0 has I[0;1) (x) for its cdf. Since the only point of discontinuity is
zero, FXn (x) does indeed converge to FX (x) at all continuity points of FX .
July 18, 2002
11.2 Convergence in Distribution
327
The second thing to note about convergence in distribution is that if Xn
converges in distribution to X, and if X is a constant random variable, say
X c, then Xn converges in probability to c. To derive this result, rst
observe that
fjXn , cj < "g = f," < Xn , c < "g
= fc , " < Xn g \ fXn < c + "g:
It is then easy to see that
}(jXn , cj ") }(Xn c , ") + }(Xn c + ")
FXn (c , ") + }(Xn > c + "=2)
= FXn (c , ") + 1 , FXn (c + "=2):
Since X c, FX (x) = I[c;1) (x). Therefore, FXn (c , ") ! 0 and FXn (c+"=2) !
1.
The third thing to note about convergence in distribution is that it is equivalent to the condition
lim E[g(Xn )] = E[g(X)] for every bounded continuous function g: (11.1)
n!1
A proof that convergence in distribution implies (11.1) can be found in [17,
p. 316]. A proof that (11.1) implies convergence in distribution is sketched in
Problem 19.
Example 11.5. Show that if Xn converges in distribution to X, then the
characteristic function of Xn converges to the characteristic function of X.
Solution. Fix any , and take g(x) = ejx. Then
'Xn () = E[ejXn ] = E[g(Xn )] ! E[g(X)] = E[ejX ] = 'X ():
Remark. It is also true that if 'Xn () converges to 'X () for all , then
Xn converges in distribution to X [4, p. 349, Theorem 26.3].
Example2 11.6. Let Xn N(mn; n2 ), where mn ! m and n2 ! 2. If
X N(m; ), show that Xn converges in distribution to X.
Solution. It suces to show that the characteristic function of Xn converges to the characteristic function of X. Write
2 2
2 2
'Xn () = ejmn ,n =2 ! ejm , =2 = 'X ():
July 18, 2002
328
Chap. 11 Other Modes of Convergence
A very important instance of convergence in distribution is the central
limit theorem. This result says that if X1 ; X2; : : : are independent, identically
distributed random variables with nite mean m and nite variance 2 , and if
n
X
, m;
1
Mn := n Xi and Yn := M=n p
n
i=1
then
Z y
2
1
lim F (y) = (y) := p
e,t =2 dt:
n!1 Yn
2 ,1
In other words, for all y, FYn (y) converges to the standard normal cdf (y)
(which is continuous at all points y). To better see the dierence between the
weak law of large numbers and the central limit theorem, we specialize to the
case m = 0 and 2 = 1. In this case,
n
X
Yn = p1n Xi :
i=1
In other words, the sample mean Mn divides the sum X1 + +Xn by n while Yn
divides the sum only by pn. The dierence is that Mn converges in probability
(and in distribution) to the constant zero, while Yn converges in distribution
to an N(0; 1) random variable. Notice also that the weak law requires only
uncorrelated random variables, while the central limit theorem requires i.i.d.
random variables. The central limit theorem was derived in Section 4.7, where
examples and problems can also be found. The central limit theorem is also
used to determine condence intervals in Chapter 12.
Example 11.7. Let Nt be a Poisson process of rate . Show that
Nn , n
Yn := p
=n
converges in distribution to an N(0; 1) random variable.
Solution. Since N0 0,
n
Nn = 1 X
n
n k=1(Nk , Nk,1 ):
By the independent increments property of the Poisson process, the terms of the
sum are i.i.d. Poisson() random variables, with mean and variance . Hence,
Yn has the structure to apply the central limit theorem, and so FYn (y) ! (y)
for all y.
The next example is a version of Slutsky's Theorem. We need it in our
analysis of condence intervals in Chapter 12.
July 18, 2002
11.2 Convergence in Distribution
329
Example 11.8. Let Yn be a sequence of random variables with corresponding cdfs Fn. Suppose that Fn converges to a continuous cdf F . Suppose also
that Un converges in probability to 1. Show that
}
nlim
!1 (Yn yUn ) = F (y):
Solution. The result is very intuitive. For large n, Un 1 and Fn(y) F(y) suggest that
}(Yn yUn ) }(Yn y) = Fn(y) F (y):
The precise details are more involved. Fix any y > 0 and 0 < < 1. Then
}(Yn yUn ) is equal to
}(Yn yUn ; jUn , 1j < ) + }(Yn yUn ; jUn , 1j ):
The second term is upper bounded by }(jUn , 1j ), which goes to zero.
Rewrite the rst term as
}(Yn yUn ; 1 , < Un < 1 + );
which is equivalent to
}(Yn yUn ; y(1 , ) < yUn < y(1 + )):
(11.2)
Now this is upper bounded by
}(Yn y(1 + )) = Fn(y(1 + )):
Thus,
lim }(Yn yUn ) F(y(1 + )):
n!1
Next, (11.2) is lower bounded by
}(Y y(1 , ); jUn , 1j < );
which is equal to
}(Yn y(1 , )) , }(Yn y(1 , ); jUn , 1j ):
Now the second term satises
}(Yn y(1 , ); jUn , 1j ) }(jUn , 1j ) ! 0:
In light of these observations,
lim }(Yn yUn ) F(y(1 , )):
n!1
Since was arbitrary, and since F is continuous,
lim }(Yn yUn ) = F (y):
n!1
The case y < 0 is similar.
July 18, 2002
330
Chap. 11 Other Modes of Convergence
11.3. Almost Sure Convergence
Let Xn be any sequence of random variables, and let X be any other random
variable. Put
G := ! 2 : nlim
!1 Xn (!) = X(!) :
In other words, G is the set of sample points ! 2 for which the sequence
of real numbers Xn (!) converges to the real number X(!). We think of G as
the set of \good" !'s for which Xn (!) ! X(!). Similarly, we think of the
complement of G, Gc , as the set of \bad" !'s for which Xn (!) 6! X(!).
We say that Xn converges almost surely to X if1
},f! 2 : lim Xn (!) 6= X(!)g = 0:
(11.3)
n!1
In other words, Xn converges almost surely to X if the \bad" set Gc has probability zero. We write Xn ! X a.s. to indicate that Xn converges almost surely
to X.
If it should happen that the bad set Gc = 6 , then Xn (!) ! X(!) for every
! 2 . This is called sure convergence, and is a special case of almost sure
convergence.
Because almost sure convergence is so closely linked to the convergence of
ordinary sequences of real numbers, many results are easy to derive.
Example 11.9. Show that if Xn ! X a.s. and Yn ! Y a.s., then Xn +
Yn ! X + Y a.s.
Solution. Let GX := fXn ! X g and GY := fYn ! Y g. In other words,
GX and GY are the \good" sets for the sequences Xn and Yn respectively.
Now consider any ! 2 GX \ GY . For such !, the sequence of real numbers
Xn (!) converges to the real number X(!), and the sequence of real numbers
Yn (!) converges to the real number Y (!). Hence, from convergence theory for
sequences of real numbers,
Xn (!) + Yn(!) ! X(!) + Y (!):
(11.4)
At this point, we have shown that
GX \ GY G;
(11.5)
where G denotes the set of all ! for which (11.4) holds. To prove that Xn +Yn !
X+Y a.s., we must show that }(Gc) = 0. On account of (11.5), Gc GcX [ GcY .
Hence,
}(Gc ) }(GcX ) + }(GcY );
and the two terms on the right are zero because Xn and Yn both converge
almost surely.
July 18, 2002
11.3 Almost Sure Convergence
331
In order to discuss almost sure convergence in more detail, it is helpful to
characterize when (11.3) holds. Recall that a sequence of real numbers xn
converges to the real number x if given any " > 0, there is a positive integer N
such that for all n N, jxn , xj < ". Hence,
1 \
1
\ [
fXn ! X g =
">0 N =1 n=N
Equivalently,
1 [
1
[ \
fXn 6! X g =
">0 N =1 n=N
It is convenient to put
fjXn , X j < "g:
fjXn , X j "g:
An (") := fjXn , X j "g; BN (") :=
and
A(") :=
Then
1
\
N =1
[
[
0 = }(fXn 6! X g) = }
n=N
An (");
BN ("):
}(fXn 6! X g) = }
If Xn ! X a.s., then
1
[
">0
">0
(11.6)
A(") :
,
A(") } A("0 )
,
for any choice of "0 > 0. Conversely, suppose } A(") = 0 for every positive
". We claim that Xn ! X a.s. To see this, observe that in the earlier characterization of the convergence of a sequence of real numbers, we could have
restricted attention to values of " of the form " = 1=k for positive integers k.
In other words, a sequence of real numbers xn converges to a real number x if
and only if for every positive integer k, there is a positive integer N such that
for all n N, jxn , xj < 1=k. Hence,
}(fXn 6! X g) = }
,
[
1
k=1
A(1=k) 1 ,
X
k=1
} A(1=k):
From this we see that if } A(") = 0 for all " > 0, then }(fXn 6! X g) = 0.
To say more about almost sure convergence, we need to examine (11.6) more
closely. Observe that
BN (") =
1
[
n=N
An (") 1
[
n=N +1
July 18, 2002
An(") = BN +1 ("):
332
Chap. 11 Other Modes of Convergence
By limit property (1.5),
},A(") = }
\
1
,
} BN (") :
BN (") = Nlim
!1
N =1
The next two examples use this equation to derive important results about
almost sure convergence.
Example 11.10. Show that if Xn ! X a.s., then Xn converges in probability to X.
Solution., Recall that, by denition, Xn converges in probability to X if
and only if } AN (") ! 0 for every " > 0. If Xn ! X a.s., then for every
" > 0,
,
},BN ("):
0 = } A(") = Nlim
!1
Next, since
1
},BN (") = } [ An (") },AN (");
,
it follows that } AN (") ! 0 too.
n=N
Example 11.11. Show that if
1 ,
X
n=1
} An (") < 1;
holds for all " > 0, then Xn ! X a.s.
Solution. For any " > 0,
},A(") = lim },BN (")
N !1
(11.7)
[
1
n=N
Nlim
!1
},An(")
}
= Nlim
!1
1
X
n=N
An (")
= 0; on account of (11.7):
What we have done here is derive a particular instance of the Borel{Cantelli
Lemma (cf. Problem 15 in Chapter 1).
Example 11.12. Let X1; X2; : : : be i.i.d. zero-mean random variables with
nite fourth moment. Show that
n
X
Mn := n1 Xi ! 0 a.s.
i=1
July 18, 2002
11.3 Almost Sure Convergence
333
Solution. We already know from Example 10.3 that Mn converges in
mean square to zero, and hence, Mn converges to zero in probability and in
distribution as well. Unfortunately, as shown in the next example, convergence
in mean does not imply almost sure convergence. However, by the previous
example, the almost sure convergence of Mn to zero will be established if we
can show that for every " > 0,
1
X
n=1
}(jMnj ") < 1:
(11.8)
By Markov's inequality,
4
}(jMn j ") = }(jMnj4 "4 ) E[jM4nj ] :
"
By Problem 34, there are nite, nonnegative constants (depending on E[Xi4])
and (depending on E[Xi2]) such that
E[jMnj4] 3 + 2 :
n n
Hence, (11.8) holds by Problem 33.
The preceding example is an instance of the strong law of large numbers.
The derivation in the example is quite simple because of the assumption of nite
fourth moments (which implies niteness of the third, second, and rst moments
by Lyapunov's inequality). A derivation assuming only nite second moments
can be found in [17, pp. 326{327], and assuming only nite rst moments in
[17, pp. 329{331].
Strong Law of Large Numbers (SLLN). Let X1 ; X2 ; : : : be independent, identically distributed random variables with nite mean m. Then
n
X
Mn := n1 Xi ! m a.s.
i=1
Since almost sure convergence implies convergence in probability, the following form of the weak law of large numbers holds.
Weak Law of Large Numbers (WLLN). Let X1 ; X2; : : : be independent, identically distributed random variables with nite mean m. Then
n
X
Mn := n1 Xi
i=1
converges in probability to m.
July 18, 2002
334
Chap. 11 Other Modes of Convergence
The weak law of large numbers in Example 11.1, which relied on the meansquare version in Example 10.3, required nite second moments and uncorrelated random variables. The above form does not require nite second moments,
but does require independent random variables.
Example 11.13. Let Wt be a Wiener process with E[Wt2] = 2t. Use the
strong law to show that Wn=n converges almost surely to zero.
Solution. Since W0 0, we can write
n
Wn = 1 X
n
n k=1(Wk , Wk,1):
By the independent increments property of the Wiener process, the terms of the
sum are i.i.d. N(0; 2) random variables. By the strong law, this sum converges
almost surely to zero.
Example 11.14. We construct a sequence of random variables that converges in mean to zero, but does not converge almost surely to zero. Fix any
positive integer n. Then n can be uniquely represented as n = 2m + k, where
m and k are integers satisfying m 0 and 0 k 2m , 1. Dene
(x):
gn(x) = g2m +k (x) = I km ; k+1
m
2
2
For example, taking m = 2 and k = 0; 1; 2; 3; which corresponds to n = 4; 5; 6; 7,
we nd
g4(x) = I0; 1 (x); g5(x) = I 1 ; 1 (x);
4
4 2
and
g6(x) = I 1 ; 3 (x); g7(x) = I 3 ; 1(x):
2 4
4
For xed m, as k goes from 0 to 2m , 1, g2m +k is a sequence of pulses moving
from left to right. This is repeated for m + 1 with twice as many pulses that
are half as wide. The two key ideas are that the pulses get narrower and that
for any xed x 2 [0; 1), gn(x) = 1 for innitely many n.
Now let U uniform[0; 1). Then
E[gn(U)] = } 2km X < k2+m 1 = 21m :
Since m ! 1 as n ! 1, we see that gn(U) converges in mean to zero. It
then follows that gn(U) converges in probability to zero. Since almost sure
convergence also implies convergence in probability, the only possible almost
sure limit is zero. However, we now show that gn(U) does not converge almost
surely to zero. Fix any x 2 [0; 1). Then for each m = 0; 1; 2; : : :;
k x < k+1
2m
2m
Limits in probability are unique by Problem 4.
July 18, 2002
11.3 Almost Sure Convergence
335
for some k satisfying 0 k 2m , 1. For these values of m and k, g2m +k (x) =
1. In other words, there are innitely many values of n = 2m + k for which
gn(x) = 1. Hence, for 0 x < 1, gn (x) does not converge to zero. Therefore,
fU 2 [0; 1)g fgn (U) 6! 0g;
and it follows that
}(fgn (U) 6! 0g) }(fU 2 [0; 1)g) = 1:
The Skorohod Representation Theorem
The Skorohod Representation Theorem says that if Xn converges in
distribution to X, then we can construct random variables Yn and Y with
FYn = FXn , FY = FX , and such that Yn converges almost surely to Y . This
can often simplify proofs concerning convergence in distribution.
Example 11.15. Let Xn converge in distribution to X, and let c(x) be a
continuous function. Show that c(Xn ) converges in distribution to c(X).
Solution. Let Yn and Y be as given by the Skorohod representation theorem. Since Yn converges almost surely to Y , the set
G := f! 2 : Yn(!) ! Y (!)g
has the property that }(Gc ) = 0. Fix any ! 2 G. Then Yn(!) ! Y (!). Since
c is continuous,
,
,
c Yn (!) ! c Y (!) :
Thus, c(Yn) converges almost surely to c(Y ). Now recall that almost-sure convergence implies convergence in probability, which implies convergence in distribution. Hence, c(Yn ) converges in distribution to c(Y ). To conclude, observe
that since Yn and Xn have the same cumulative distribution function, so do
c(Yn ) and c(Xn ). Similarly, c(Y ) and c(X) have the same cumulative distribution function. Thus, c(Xn ) converges in distribution to c(X).
We now derive the Skorohod representation theorem. For 0 < u < 1, let
Gn(u) := inf fx 2 IR : FXn (x) ug;
and
G(u) := inf fx 2 IR : FX (x) ug:
By Problem 27(a) in Chapter 8,
G(u) x , u FX (x);
July 18, 2002
336
Chap. 11 Other Modes of Convergence
or, equivalently,
G(u) > x , u > FX (x);
and similarly for Gn and FXn . Now let U uniform(0; 1). By Problem 27(b)
in Chapter 8, Yn := Gn(U) and Y := G(U) satisfy FYn = FXn and FY = FX ,
respectively.
From the denition of G, it is easy to see that G is nondecreasing. Hence,
its set of discontinuities, call it D, is at most countable (Problem 39), and so
}(U 2 D) = 0 by Problem 40. We show below that for u 2= D, Gn(u) ! G(u).
It then follows that
,
,
Yn(!) := Gn U(!) ! G U(!) =: Y (!);
except for ! 2 f! : U(!) 2 Dg, which has probability zero.
Fix any u 2= D, and let " > 0 be given. Then between G(u) , " and G(u)
we can select a point x,
G(u) , " < x < G(u);
that is a continuity point of FX . Since x < G(u),
FX (x) < u:
Since x is a continuity point of FX , FXn (x) must be close to FX (x) for large n.
Thus, for large n, FXn (x) < u. But this implies Gn(u) > x. Thus,
G(u) , " < x < Gn (u);
and it follows that
G(u) lim Gn(u):
n!1
To obtain the reverse inequality involving the lim, x any u0 with u < u0 < 1.
Fix any " > 0, and select another continuity point of FX , again called x, such
that
G(u0 ) < x < G(u0) + ":
Then G(u0 ) x, and so u0 FX (x). But then u < FX (x). Since FXn (x) !
FX (x), for large n, u < FXn (x), which implies Gn(u) x. It then follows that
lim G (u)
n!1 n
G(u0):
Since u is a continuity point of G, we can let u0 ! u to get
lim G (u)
n!1 n
G(u):
It now follows that Gn(u) ! G(u) as claimed.
July 18, 2002
11.4 Notes
337
11.4. Notes
Notes x11.3: Almost Sure Convergence
Note 1. In order that (11.3) be well dened, it is necessary that the set
f! 2 : nlim
6 X(!)g
!1 Xn (!) =
be an event in the technical sense of Note 1 in Chapter 1. This is assured
by the assumption that each Xn is a random variable (the term \random variable" is used in the technical sense of Note 1 in Chapter 2). The fact that
this assumption is sucient is demonstrated in more advanced texts, e.g., [4,
pp. 183{184].
11.5. Problems
Problems x11.1: Convergence in Probability
1. Let cn be a converging sequence of real numbers with limit c. Dene the
constant random variables Yn cn and Y c. Show that Yn converges
in probability to Y .
2. Let U uniform[0; 1], and put
p
Xn := nI[0;1=n](U); n = 1; 2; : : ::
Does Xn converge in probability to zero?
3. Let V be any continuous random variable with an even density, and let cn
be any positive sequence with cn ! 1. Show that Xn := V=cn converges
in probability to zero.
4. Show that limits in probability are unique; i.e., show that if Xn converges
in probability to X, and Xn converges in probability to Y , then }(X 6=
Y ) = 0. Hint: Write
fX 6= Y g =
1
[
fjX , Y j 1=kg;
k=1
and use limit property (1.4).
5. Suppose you have shown that given any " > 0, for suciently large n,
}(jXn , X j ") < ":
Show that
lim }(jXn , X j ") = 0 for every " > 0:
n!1
July 18, 2002
338
Chap. 11 Other Modes of Convergence
6. Let g(x; y) be continuous, and suppose that Xn converges in probability
to X, and that Yn converges in probability to Y . In this problem you will
show that g(Xn ; Yn) converges in probability to g(X; Y ).
(a) Fix any " > 0. Show that for suciently large and ,
}(jX j > ) < "=4 and }(jY j > ) < "=4:
(b) Once and have been xed, we can use the fact that g(x; y) is
uniformly continuous on the rectangle jxj 2 and jyj 2. In
other words, there is a > 0 such that for all (x0; y0 ) and (x; y) in
the rectangle and satisfying
jx0 , xj and jy0 , yj ;
we have
jg(x0; y0 ) , g(x; y)j < ":
There is no loss of generality if we assume that and .
Show that if the four conditions
jXn , X j < ; jYn , Y j < ; jX j ; and jY j hold, then
jg(Xn ; Yn) , g(X; Y )j < ":
(c) Show that if n is large enough that
}(jXn , X j ) < "=4 and }(jYn , Y j ) < "=4;
then
}(jg(Xn ; Yn) , g(X; Y )j ") < ":
7. Let X1 ; X2 ; : : : be i.i.d. with common nite mean m and common nite
variance 2 . Also assume that E[Xi4 ] < 1. Put
n
n
X
X
Mn := n1 Xi and Vn := n1 Xi2 :
i=1
i=1
(a) Explain (briey) why Vn converges in probability to 2 + m2 .
(b) Explain (briey) why
S2
n
n
:= n , 1
n
1X
2
2
n i=1 Xi , Mn
converges in probability to 2.
8. Let Xn converge in probability to X. Assume that there is a nonnegative
random variable Y with E[Y ] < 1 and such that jXnj Y for all n.
July 18, 2002
11.5 Problems
339
(a) Show that }(jX j Y + 1) = 0. (In other words, jX j < Y + 1 with
probability one, from which it follows that E[jX j] E[Y ] + 1 < 1.)
(b) Show that Xn converges in mean to X. Hints: Write
E[jXn , X j] = E[jXn , X jIAn ] + E[jXn , X jIAcn ];
where An := fjXn , X j "g. Then use Problem 5 in Chapter 10
with Z = Y + jX j.
9. (a) Let g be a bounded, nonnegative function satisfying limx!0 g(x) = 0.
Show that limn!1 E[g(Xn )] = 0 if Xn converges in probability to
zero.
(b) Show that
j
X
j
n
= 0
lim E
n!1 1 + jX j
n
if and only if Xn converges in probability to zero.
Problems x11.2: Convergence in Distribution
10. Let cn be a converging sequence of real numbers with limit c. Dene the
constant random variables Yn cn and Y c. Show that Yn converges
in distribution to Y .
11. Let X be a random variable, and let cn be a positive sequence converging
to limit c. Show that cn X converges in distribution to cX. Consider
separately the cases c = 0 and 0 < c < 1.
12. For t 0, let Xt Yt Zt be three continuous-time random processes
such that
lim F (y) = tlim
t!1 Xt
!1 FZt (y) = F(y)
for some continuous cdf F. Show that FYt (y) ! F(y) for all y.
R1
13. Show that the Wiener
integral
Y
:=
0 g() dW is Gaussian with zero
R1
2
mean and variance 0 g() d. Hints: The desired integral Y is the
mean-square limit of the sequence Yn dened in Problem 18 in Chapter 10.
Use Example 11.5.
R
14. Let g(t; ) be such that for each t, 01 g(t; )2 d < 1. Dene the process
Xt =
Z
0
1
g(t; ) dW :
Use the result of the preceding problem to show that for any 0 t1 <
< tn < 1, the random vector of samples X := [Xt1 ; : : :; Xtn ]0 is
Gaussian. Hint: Read the rst paragraph of Section 7.2.
July 18, 2002
340
Chap. 11 Other Modes of Convergence
15. If the moment generating functions MXn (s) converge to the moment gen-
erating function MX (s), show that Xn converges in distribution to X.
Also show that for nonnegative, integer-valued random variables, if the
probability generating functions GXn (z) converge to the probability generating function GX (z), then Xn converges in distribution to X. Hint:
The Remark following Example 11.5 may be useful.
16. Let Xn and X be integer-valued random variables with probability mass
functions pn(k) := }(Xn = k) and p(k) := }(X = k), respectively.
(a) If Xn converges in distribution to X, show that for each k, pn (k) !
p(k).
(b) If Xn and X are nonnegative, and if for each k 0, pn(k) ! p(k),
show that Xn converges in distribution to X.
17. Let pn be a sequence of numbers lying between 0 and 1 and such that
npn ! > 0 as n ! 1. Let Xn binomial(n; pn), and let X Poisson(). Show that Xn converges in distribution to X. Hints: By the
previous problem, it suces to prove that the probability mass functions
converge. Stirling's formula,
p
n! 2 nn+1=2e,n ;
by which we mean
= 1;
lim p n!
n!1 2 nn+1=2e,n
and the formula
qn n = e,q ; if q ! q;
lim
1
,
n
n!1
n
may be helpful.
18. Let Xn binomial(n; pn) and X Poisson(), where pn and are as
in the previous problem. Show that the probability generating function
GXn (z) converges to GX (z).
19. Suppose that Xn and X are such that for every bounded continuous
function g(x),
lim E[g(Xn )] = E[g(X)]:
n!1
Show that Xn converges in distribution to X as follows:
(a) For a < b, sketch the three functions I(,1;a] (t), I(,1;b] (t), and
8
1; t < a;
>
>
<
b
,
t ; a t b;
ga;b(t) := >
>
: b,a
0; t > b:
July 18, 2002
11.5 Problems
341
(b) Your sketch in part (a) shows that
I(,1;a] (t) ga;b (t) I(,1;b] (t):
Use these inequalities to show that for any random variable Y ,
FY (a) E[ga;b(Y )] FY (b):
(c) For x > 0, use part (b) with a = x and b = x + x to show that
lim F (x) FX (x + x):
n!1 Xn
(d) For x > 0, use part (b) with a = x , x and b = x to show that
FX (x , x) lim FXn (x):
n!1
(e) If x is a continuity point of FX , show that
lim F (x) = FX (x):
n!1 Xn
20. Show that Xn converges in distribution to zero if and only if
j
X
j
n
lim E
= 0:
n!1 1 + jXnj
21. For t 0, let Zt be a continuous-time random process. Suppose that as
t ! 1, FZt (z) converges to a continuous cdf F(z). Let u(t) be a positive
function of t such that u(t) ! 1 as t ! 1. Show that
,
lim } Zt z u(t) = F (z):
t!1
Hint: Modify the derivation in Example 11.8.
22. Let Zt be as in the preceding problem. Show that if c(t) ! c > 0, then
,
lim } c(t) Zt z = F (c=z):
t!1
23. Let Zt be as in Problem 21. Let s(t) ! 0 as t ! 1. Show that if
Xt = Zt + s(t), then FXt (x) ! F (x).
24. Let Nt be a Poisson process of rate . Show that
Nt , t
Yt := p
=t
converges in distribution to an N(0; 1) random variable. Hint: By Example 11.7, Yn converges in distribution to an N(0; 1) random variable.
Next, since Nt is a nondecreasing function of t, observe that
Nbtc Nt Ndte ;
July 18, 2002
342
Chap. 11 Other Modes of Convergence
where btc denotes the greatest integer less than or equal to t, and dte denotes the smallest integer greater than or equal to t. Hint: The preceding
two problems and Problem 12 may be useful.
Problems x11.3: Almost Sure Convergence
25. Let Xn ! X a.s. and let Yn ! Y a.s. If g(x; y) is a continuous function,
show that g(Xn ; Yn) ! g(X; Y ) a.s.
26. Let Xn ! X a.s., and suppose that X = Y a.s. Show that Xn ! Y a.s.
(The statement X = Y a.s. means }(X =
6 Y ) = 0.)
27. Show that almost sure limits are unique; i.e., if Xn ! X a.s. and Xn !
Y a.s., then X = Y a.s. (The statement X = Y a.s. means }(X =
6 Y)=
28.
29.
30.
31.
0.)
Suppose Xn ! X a.s. and Yn ! Y a.s. Show that if Xn Yn a.s. for all
n, then X Y a.s. (The statement Xn Yn a.s. means }(Xn > Yn ) =
0.)
If Xn converges almost surely and in mean, show that the two limits are
equal almost surely. Hint: Problem 4 may be helpful.
In Problem 11, suppose that the limit is c = 1. Specify the limit random
variable cX(!) as a function of !. Note that cX(!) is a discrete random
variable. What is its probability mass function?
Let S be a nonnegative random variable with E[S] < 1. Show that
S < 1 a.s. Hints: It is enough to show that }(S = 1) = 0. Observe
that
1
\
fS = 1g = fS > ng:
n=1
Now appeal to the limit property (1.5) and use Markov's inequality.
32. Under the assumptions of Problem 17 in Chapter 10, show that
1
X
k=1
hk Xk
is well dened as an almost-sure limit. Hints: It is enough to prove that
S :=
1
X
k=1
jhk Xk j < 1 a.s.
Hence, the result of the preceding problem can be applied if it can be
shown that E[S] < 1. To this end, put
Sn :=
n
X
k=1
jhk j jXk j:
July 18, 2002
11.5 Problems
343
By Problem 17 in Chapter 10, Sn converges in mean to S 2 L1 . Use the
nonnegativity of Sn and S along with Problem 16 in Chapter 10 to show
that
n
X
E[S] = nlim
!1 jhkjE[jXk j] < 1:
33. For p > 1, show that
P1
n=1
k=1
1=np < 1.
34. Let X1 ; X2; : : : be i.i.d. with := E[Xi4], 2 := E[Xi2], and E[Xi ] = 0.
Show that
n
X
Mn := n1 Xi
i=1
4
4
satises E[Mn] = [n + 3n(n , 1) ]=n4.
35. In Problem 7, explain why the assumption E[Xi4 ] < 1 can be omitted.
36. Let Nt be a Poisson process of rate . Show that Nt =t converges almost
surely to . Hint: First show that Nn =n converges almost surely to .
Second, since Nt is a nondecreasing function of t, observe that
Nbtc Nt Ndte ;
where btc denotes the greatest integer less than or equal to t, and dte
denotes the smallest integer greater than or equal to t.
37. Let Nt be a renewal process as dened in Section 8.2. Let X1 ; X2; : : :
denote the i.i.d. interarrival times. Assume that the interarrival times
have nite, positive mean .
(a) Show that for any " > 0, for all suciently large n,
n
X
k=1
Xk < n( + ") a.s.
(b) Show that Nt ! 1 as t ! 1 a.s.; i.e., show that for any M, Nt M
for all suciently large t.
(c) Show that n=Tn ! 1= a.s. Here Tn := X1 + + Xn is the nth
occurrence time.
(d) Show that Nt=t ! 1= a.s. Hints: On account of (c), if we put
Yn := n=Tn, then YNt ! 1= a.s. since Nt ! 1. Also note that
TNt t < TNt +1 :
38. Give an example of a sequence of random variables that converges almost
surely to zero but not in mean.
July 18, 2002
344
Chap. 11 Other Modes of Convergence
39. Let G be a nondecreasing function dened on the closed interval [a; b].
Let D" denote the set of discontinuities of size greater than " on [a; b],
D" := fu 2 [a; b] : G(u+) , G(u,) > "g;
with the understanding that G(b+) means G(b) and G(a,) means G(a).
Show that if there are n points in D" , then
n < [G(b) , G(a)]=" + 2:
Remark.SThe
set of all discontinuities of G on [a; b], denoted by D[a; b],
1
is simply k=1 D1=k . Since this is a countable union of nite sets, D[a; b]
is at most countably innite. If G is dened on the open interval (0; 1),
we can write
1
[
D(0; 1) =
D[1=n; 1 , 1=n]:
n=3
Since this is a countable union of countably innite sets, D(0; 1) is also
countably innite [37, p. 21, Proposition 7].
40. Let D be a countably innite subset of (0; 1). Let U uniform(0; 1).
Show that }(U 2 D) = 0. Hint: Since D is countably innite, we can
enumerate its elements as a sequence un. Fix any " > 0 and put Kn :=
(un , "=2n; un + "=2n). Observe that
D Now bound
} U2
1
[
n=1
1
[
n=1
July 18, 2002
Kn :
Kn :
CHAPTER 12
Parameter Estimation and Condence Intervals
Consider observed measurements X1 ; X2 ; : : :; each having unknown mean
m. Then natural thing to do is to use the sample mean
n
X
1
Mn := n Xi
i=1
as an estimate of m. However, since Mn is a random variable and m is a
constant, we do not expect to have Mn always equal to m (or even ever equal
if the Xi are continuous random variables). In this chapter we investigate how
close the sample mean Mn is to the true mean m (also known as the ensemble
mean or the population mean).
Section 12.1 introduces the notions of sample mean, unbiased estimator, and
consistent estimator. It also recalls the central limit theorem approximation
from Section 4.7. Section 12.2 introduces the notion of a condence interval for
the mean when the variance is known. In Section 12.3, the sample variance is
introduced, and some of its properties presented. In Section 12.4, the sample
mean and sample variance are combined to obtain condence intervals for the
mean when both the mean and the variance are unknown. Section 12.5 considers
the special case of normal data. While the preceding sections assume that the
number of samples is large, in the normal case, this assumption is not necessary.
To obtain condence intervals for the mean when the variance is unknown, we
introduce the student's t density. By introducing the chi-squared density, we
obtain condence intervals for the variance when the mean is known and when
the mean is unknown. Section 12.5 concludes with a subsection containing
derivations of the more complicated results concerning the density of the sample
variance and of the T random variable. These derivations can be skipped by
the beginning student.
12.1. The Sample Mean
Although we do not expect to have Mn = m, we can say that
n
n
n
X
X
1
1X
E[Mn] = E
X
=
E[Xi] = 1 m = m:
i
n i=1
n i=1
n i=1
In other words, the expected value of the estimator is equal to the quantity we
are trying to estimate. An estimator with this property is said to be unbiased.
Next, if the Xi have common nite variance 2 , as we assume from now
on, and if they are uncorrelated, then for any error bound > 0, Chebyshev's
345
346
Chap. 12 Parameter Estimation and Condence Intervals
inequality tells us that
mj2 ] = var(Mn ) = 2 :
}(jMn , mj > ) E[jMn ,
(12.1)
2
2
n 2
This says that the probability that the estimate Mn diers from the true mean
m by more than is upper bounded by 2 =(n 2). For large enough n, this
bound is very small, and so the probability of being o by more than is
negligible. In particular, (12.1) implies that
nlim
!1
}(jMn , mj > ) = 0; for every > 0:
The terminology for describing the above expression is to say that Mn converges in probability to m. An estimator that converges in probability to
the parameter to be estimated is said to be consistent. Thus, the sample mean
is a consistent estimator of the population mean.
While it is reassuring to know that the sample mean is a consistent estimator
of the ensemble mean, this is an asymptotic result. In practice, we work with a
nite value of n, and so it is unfortunate that, as the next example shows, the
bound in (12.1) is not very good.
Example 12.1. Let X1; X2; : : : be i.i.d. N(m; 2). Then Mn N(m; 2=n)
and so
,m
Yn := M=n p
n
is N(0; 1). Thus,
}(jMn , mj > ) = }(jYnj > pn) = 2[1 , (pn )];
where we have used the fact that Yn has an even density. For n = 3, we nd
}(jM3 , mj > ) = 0:0833. For = in (12.1), we have the bound
}(jMn , mj > ) 1 :
n
To make this bound less than or equal to 0:0833 would require n 12. Thus,
(12.1) would lead us to use a lot more data than is really necessary.
Since the bound in (12.1) is not very good, we would like to know the
distribution of Yn itself in the general case, not just the Gaussian case as in the
example. Fortunately, the central limit theorem (Section 4.7) tells us that if
the Xi are i.i.d. with common mean m and common variance 2 , then for large
n, FYn can be approximated by the normal cdf .
July 18, 2002
12.2 Condence Intervals When the Variance Is Known
347
12.2. Condence Intervals When the Variance Is Known
as
The notion of a condence interval is obtained by rearranging
} jMn , mj py
n
} , py Mn , m py ;
n
n
or
} py m , Mn , py ;
n
n
and then
} Mn + py m Mn , py :
n
n
The above quantity is the probability that the random interval
h
i
Mn , pyn ; Mn + pyn
contains the true mean m. This random interval is called a condence interval. If y is chosen so that
i
h
} m 2 Mn , py ; Mn + py = 1 , (12.2)
n
n
for some 0 < < 1, then the condence interval is said to be a 100(1 , )%
condence interval. The shorthand for the above expression is
m = Mn pyn with 100(1 , )% probability:
In practice, we are given a condence level 1 , , and we would like to choose
y so that (12.2) holds. Now, (12.2) is equivalent to
1 , = } jMn , mj pyn
, mj y
= } jM=n p
n
}
= (jYnj y)
= FYn (y) , FYn (,y)
(y) , (,y);
where the last step follows by the central limit theorem (Section 4.7) if the Xi
are i.i.d. with common mean m and common variance 2. Since the N(0; 1)
density is even, the approximation becomes
2(y) , 1 1 , :
July 18, 2002
348
Chap. 12 Parameter Estimation and Condence Intervals
1,
0.90
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
y=2
1.645
1.695
1.751
1.812
1.881
1.960
2.054
2.170
2.326
2.576
Table 12.1. Condence levels 1 , and corresponding y=2 such that 2(y=2 ) , 1 = 1 , .
Ignoring the approximation, we solve
(y) = 1 , =2
for y. We denote the solution of this equation by y=2 . It can be found from
tables, e.g., Table 12.1, or numerically by nding the unique root of the equation
(y) + =2 , 1 = 0, or in Matlab by y=2 = norminv(1 , alpha=2). The width of a condence interval is
y
2 p=n 2 :
p
Notice that as n increases, the width gets smaller as 1= n. If we increase the
requested condence level 1 , , we see from Table 12.1 that y=2 gets larger.
Hence, by using a higher condence level, the condence interval gets larger.
Example 12.2. Let X1; X2; : : : be i.i.d. random variables with variance
2 = 2. If M100 = 7:129, nd the 93% and 97% condence intervals for the
population mean.
Solution. In Table 12.1 we scan the 1 , column
p until we nd 0:93.
The corresponding value of y=2 is 1:812. Since = 2, we obtain the 93%
condence interval
p i
p
h
2
1:812
1:812
; 7:129 + p 2 = [6:873; 7:385]:
7:129 , p
100
100
p p
Since 1:812 2= 100 = 0:256, we can also express this condence interval as
m = 7:129 0:256 with 93% probability:
Since can be related to the error function erf (see Section 4.1), y=2 can also be found
p
using the inverse of the error function. Hence, y=2 = 2 erf ,1 (1 , ). The Matlab
,
1
command for erf (z) is erfinv(z).
July 18, 2002
12.3 The Sample Variance
349
For the 97% condence interval, we use y=2 = 2:170 and obtain
or
p
p
i
p 2 ; 7:129 + 2:170
p 2 = [6:822; 7:436];
7:129 , 2:170
100
100
h
m = 7:129 0:307 with 97% probability:
12.3. The Sample Variance
If the population variance 2 is unknown, then the condence interval
i
h
Mn , pyn ; Mn + pyn
cannot be used. Instead, we replace by the sample standard deviation
Sn , where
n
X
Sn2 := n ,1 1 (Xi , Mn )2
i=1
is the sample variance. This leads to the new condence interval
h
i
Mn , Spnny ; Mn + Spnny :
Before we can do any analysis of this, we need some properties of Sn .
To begin, we need the formula,
Sn2 = n ,1 1
n
X
i=1
Xi2 , nMn2 :
It is derived as follows. Write
n
X
Sn2 := n ,1 1 (Xi , Mn )2
i=1
n
X
1
= n , 1 (Xi2 , 2Xi Mn + Mn2)
i=1
X
X
n
n
= n ,1 1
Xi2 , 2
Xi Mn + nMn2
i=1
i=1
n
X
1
2
2
= n,1
Xi , 2(n Mn )Mn + nMn
i=1
X
n
1
2
2
Xi , nMn :
= n,1
i=1
July 18, 2002
(12.3)
350
Chap. 12 Parameter Estimation and Condence Intervals
Using (12.3), it is shown in the problems that Sn2 is an unbiased estimator of
Furthermore, if the Xi are i.i.d. with nite fourth moment,
n
then it is shown in the problems that
2 ; i.e., E[S 2 ] = 2 .
n
1X
2
n i=1 Xi
is a consistent estimator of the second moment E[Xi2 ] = 2 +m2 ; i.e., the above
formula converges in probability to 2 +m2 . Since Mn converges in probability
to m, it follows that
S2
n
n
= n,1
n
1X
2
2
n i=1 Xi , Mn
converges in probability1 to (2 + m2 ) , m2 = 2 . In other words, Sn2 is a
consistent estimator of 2. Furthermore, it now follows that Sn is a consistent
estimator of ; i.e., Sn converges in probability to .
12.4. Condence Intervals When the Variance Is Unknown
As noted in the previous section, if the population variance 2 is unknown,
then the condence interval
i
h
Mn , pyn ; Mn + pyn
cannot be used. Instead, we replace by the sample standard deviation Sn .
This leads to the new condence interval
h
i
Mn , Spnny ; Mn + Spnny :
Given a condence level 1 , , we would like to choose y so that
h
i
} m 2 Mn , Spn y ; Mn + Spn y = 1 , :
(12.4)
n
n
To this end, rewrite the left-hand side as
} jMn , mj Spn y ;
n
or
} Mn p, m Sn y :
= n
p
Again using Yn := (Mn , m)=(= n ), we have
} jYnj Sn y :
July 18, 2002
12.4 Condence Intervals When the Variance Is Unknown
As n ! 1, Sn is converging in probability to . So,2 for large n,
} jYnj Sn y }(jYnj y)
= FYn (y) , FYn (,y):
2(y) , 1;
351
(12.5)
where the last step follows by the central limit theorem approximation. Thus, if
we want to solve (12.4), we can solve the approximate equation 2(y) , 1 = 1 , instead, exactly as in Section 12.2. In other words, to nd condence intervals,
we still get y=2 from Table 12.1, but we use Sn in place of .
Remark. In this section, we are using not only the central limit theorem
approximation, for which it is suggested n should be at least 30, but we are
also using the approximation (12.5). For the methods of this section to apply,
n should be at least 100. This choice is motivated by considerations in the next
section.
Example 12.3. Let X1; X2; : : : be i.i.d. Bernoulli(p) random variables. Find
the 95% condence interval for p if M100 = 0:28 and S100 = 0:451.
Solution. Observe that since m := E[Xip] = p, we can use Mn to estimate
p. From Table 12.1, y=2 = 1:960, S100 y=2 = 100 = 0:088, and
p = 0:28 0:088 with 95% probability:
The actual condence interval is [0:192; 0:368].
Applications
Estimating the number of defective products in a lot. Consider a production
run of N cellular phones, of which, say d are defective. The only way to determine d exactly and for certain is to test every phone. This is not practical if N
is very large. So we consider the following procedure to estimate the fraction
of defectives, p := d=N, based on testing only n phones, where n is large, but
smaller than N.
for i = 1 to n
Select a phone at random from the lot of N phones;
if the ith phone selected is defective
let Xi = 1;
else
let Xi = 0;
end if
Return the phone to the lot;
end for
July 18, 2002
352
Chap. 12 Parameter Estimation and Condence Intervals
Because phones are returned to the lot (sampling with replacement), it is possible to test the same phone more than once. However, because the phones
are always chosen from the same set of N phones, the Xi are i.i.d. with
}(Xi = 1) = d=N = p. Hence, the central limit theorem applies, and we
can use the method of Example 12.3 to estimate p and d = Np. For example,
if N = 1; 000 and we use the numbers from Example 12.3, we would estimate
that
d = 280 88 with 95% probability:
In other words, we are 95% sure that the number of defectives is between 192
and 368 for this particular lot of 1; 000 phones.
If the phones were not returned to the lot after testing (sampling without
replacement), the Xi would not be i.i.d. as required by the central limit theorem.
However, in sampling with replacement when n is much smaller than N, the
chances of testing the same phone twice are negligible. Hence, we can actually
sample without replacement and proceed as above.
Predicting the Outcome of an Election. In order to predict the outcome of
a presidential election, 4; 000 registered voters are surveyed at random. 2; 104
(more than half) say they will vote for candidate A, and the rest say they will
vote for candidate B. To predict the outcome of the election, let p be the fraction
of votes actually received by candidate A out of the total number of voters N
(millions). Our poll samples n = 4; 000, and M4000 = 2; 104=4; 000 = 0:526.
Suppose that
p S4000 = 0:499. For a 95% condence interval for p, y=2 = 1:960,
S4000 y=2= 4000 = 0:015, and
p = 0:526 0:015 with 95% probability:
Rounding o, we would predict that candidate A will receive 53% of the vote,
with a margin of error of 2%. Thus, we are 95% sure that candidate A will win
the election.
Sampling with and without Replacement
Consider sampling n items from a batch of N items, d of which are defective. If we sample with replacement, then the theory above worked out rather
simply. We also argued briey that if n is much smaller than N, then sampling
without replacement would give essentially the same results. We now make this
statement more precise.
To begin, recall that the central limit theorem says that
Pfor large n, FYn (y) (y), where Yn = (Mn , m)=(=pn), and Mn = (1=n) ni=1 Xi . If we sample
with replacement and set Xi = 1 if the ith item is defective, then the Xi are
i.i.d. Bernoulli(p)
with p = d=N. When X1 ; X2; : : : are i.i.d. Bernoulli(p), we
P
know that ni=1 Xi is binomial(n;p) (e.g., by Example 2.20). Putting this all
together, we obtain the DeMoivre{Laplace Theorem, which
p says that if
V binomial(n; p) and n is large, then the cdf of (V=n , p)= p(1 , p)=n is
approximately standard normal.
July 18, 2002
12.5 Condence Intervals for Normal Data
353
Now suppose we sample n items without replacement. Let U denote the
number of defectives out of the n samples. It is shown in the Notes3 that U
has the hypergeometric(N; d;n) pmf,
d N ,d
}(U = k) = k n, k ; k = 0; : : :; n:
N
n
}
As we show below, if n is much smaller than d, N , d, and
p N, then (U =
k) }(V = k). It then p
follows that the cdf of (U=n , p)= p(1 , p)=n is close
to the cdf of (V=n , p)= p(1 , p)=n, which is close to the standard normal cdf
if n is large. (Thus, to make it all work we need n large, but still much smaller
than d, N , d, and N.)
To show that }(U = k) }(V = k), write out }(U = k) as
d! (N , d)!
n!(N , n)! :
k!(d , k)! (n , k)![(N , d) , (n , k)]!
N!
,n
We can easily identify the factor k . Next, since 0 k n d,
d!
k
(d , k)! = d(d , 1) (d , k + 1) d :
Similarly, since 0 k n (N , d),
(N , d)!
n,k
[(N , d) , (n , k)]! = (N , d) [(N , d) , (n , k) + 1] (N , d) :
Finally, since n N,
1
1:
(N , n)! =
N!
N(N , 1) (N , n + 1)
Nn
Writing p = d=N, we have
}(U = k) n pk (1 , p)n,k :
k
12.5. Condence Intervals for Normal Data
In this section we assume that X1 ; X2 ; : : : are i.i.d. N(m; 2 ).
Estimating the Mean
Recall from Example 12.1 that Yn N(0; 1). Hence, the analysis in Sections 12.1 and 12.2 shows that
h
i
} m 2 Mn , py ; Mn + py
= }(jYn j y)
n
n
= 2(y) , 1:
July 18, 2002
354
Chap. 12 Parameter Estimation and Condence Intervals
The point is that for normal data there is no central limit theorem approximation. Hence, we can determine condence intervals as in Section 12.2 even if
n < 30.
Example 12.4. Let X1; X2; : : : be i.i.d. N(m; 2). If M10 = 5:287, nd the
90% condence interval for m.
Solution. From Table 12.1 for 1 , = 0:90, y=2 is 1:645. Since = p2,
we obtain the 90% condence interval
p i
p
h
p 2 ; 5:287 + 1:645
p 2 = [4:551; 6:023]:
5:287 , 1:645
10
10
p p
Since 1:645 2= 10 = 0:736, we can also express this condence interval as
m = 5:287 0:736 with 90% probability:
Unfortunately, the results of Section 12.4 still involve approximation in
(12.5) even if the Xi are normal. However, we now show how to compute
the left-hand side of (12.4) exactly when the Xi are normal. The left-hand side
of (12.4) is
} jMn , mj Spn y = } Mn ,pm y :
n
Sn = n
It is convenient to rewrite the above quotient as
T := MS n=,pnm :
n
It is shown later that if X1 ; X2; : : : are i.i.d. N(m; 2), then T has a student's t
density with = n , 1 degrees of freedom (dened in Problem 17 in Chapter 3).
To compute 100(1 , )% condence intervals, we must solve
}(jT j y) = 1 , :
Since the density fT is even, this is equivalent to
2FT (y) , 1 = 1 , ;
or FT (y) = 1 , =2. This can be solved using tables, e.g., Table 12.2, or
numerically by nding the unique root of the equation FT (y) + =2 , 1 = 0, or
in Matlab by y=2 = tinv(1 , alpha=2; n , 1).
Example 12.5. Let X1; X2; : : : be i.i.d. N(m; 2) random variables, and
suppose M10 = 5:287. Further suppose that S10 = 1:564. Find the 90% condence interval for m.
Solution. In Tablep 12.2 with n = 10, we see that for 1 , = 0:90, y=2 is
1:833. Since S10 y=2 = 10 = 0:907,
m = 5:287 0:907 with 90% probability:
The corresponding condence interval is [4:380; 6:194].
July 18, 2002
12.5 Condence Intervals for Normal Data
355
1 , y=2(n = 100)
1 , y=2(n = 10)
0.90
1.833
0.90
1.660
0.91
1.899
0.91
1.712
0.92
1.973
0.92
1.769
0.93
2.055
0.93
1.832
0.94
2.150
0.94
1.903
0.95
2.262
0.95
1.984
0.96
2.398
0.96
2.081
0.97
2.574
0.97
2.202
0.98
2.821
0.98
2.365
0.99
3.250
0.99
2.626
Table 12.2. Condence levels 1 , and corresponding y=2 such that }(jT j y=2 ) = 1 , .
The left-hand table is for n = 10 observations with T having n , 1 = 9 degrees of freedom,
and the right-hand table is for n = 100 observations with T having n , 1 = 99 degrees of
freedom.
Limiting t Distribution
If we compare the n = 100 table in Table 12.2 with Table 12.1, we see they
are almost the same. This is a consequence of the fact that as n increases, the
t cdf converges to the standard normal cdf. We can see this by writing
}(T t) = } Mn ,pm t
Sn = n
S
M
,
m
n
n
}
p
=
= n t
= } Yn Sn t
}(Yn t);
since2 Sn converges in probability to . Finally, since the Xi are independent
and normal, FYn (t) = (t).
We also recall from Problem 18 in Chapter 3 that the t density converges
to the standard normal density.
Estimating the Variance | Known Mean
Suppose that X1 ; X2 ; : : : are i.i.d. N(m; 2) with m known but 2 unknown.
We use
n
X
Vn2 := n1 (Xi , m)2
i=1
as our estimator of the variance 2. It is shown in the problems that Vn2 is a
consistent estimator of 2 .
July 18, 2002
356
Chap. 12 Parameter Estimation and Condence Intervals
For determining condence intervals, it is easier to work with
n X ,m 2
n V2 = X
i
:
2 n
i=1
Since (Xi , m)= is N(0; 1), its square is chi-squared with one degree of freedom
(Problem 41 in Chapter 3 or Problem 7 in Chapter 4). It then follows that n2 Vn2
is chi-squared with n degrees of freedom (see Problem 46(c) and its Remark in
Chapter 3).
Choose 0 < ` < u, and consider the equation
} ` n2 Vn2 u = 1 , :
We can rewrite this as
2
2
} nVn 2 nVn = 1 , :
`
u
This suggests the condence interval
h nV 2
2i
n ; nVn
u ` :
Then the probability that 2 lies in this interval is
F (u) , F (`) = 1 , ;
where F is the chi-squared cdf with n degrees of freedom. We usually choose `
and u to solve
F (`) = =2 and F (u) = 1 , =2:
These equations can be solved using tables, e.g., Table 12.3, or numerically by
root nding, or in Matlab with the commands ` = chi2inv(alpha=2; n) and
u = chi2inv(1 , alpha=2; n).
Example2 12.6. Let X1; X2; : : : be i.i.d. N(5; 2) random 2variables. Sup-
pose that V100 = 1:645. Find the 90% condence interval for .
Solution. From Table 12.3 we see that for 1 , = 0:90, ` = 77:929 and
u = 124:342. The 90% condence interval is
h 100(1:645) 100(1:645) i
124:342 ; 77:929 = [1:323; 2:111]:
July 18, 2002
12.5 Condence Intervals for Normal Data
1,
0.90
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
`
77.929
77.326
76.671
75.949
75.142
74.222
73.142
71.818
70.065
67.328
357
u
124.342
125.170
126.079
127.092
128.237
129.561
131.142
133.120
135.807
140.169
Table
12.3. Condence levels 1 , and corresponding values of ` and u such that }(` n V 2 u) = 1 , and such that }( n V 2 `) = }( n V 2 u) = =2 for n = 100
2 n
2 n
2 n
observations.
Estimating the Variance | Unknown Mean
Let X1 ; X2; : : : be i.i.d. N(m; 2 ), where both the mean and the variance
are unknown, but we are interested only in estimating the variance. Since we
do not know m, we cannot use the estimator Vn2 above. Instead we use Sn2 .
However, for determining condence intervals, it is easier to work with n,21 Sn2 .
As argued below, n,21 Sn2 is a chi-squared random variable with n , 1 degrees
of freedom.
Choose 0 < ` < u, and consider the equation
} ` n ,2 1 Sn2 u = 1 , :
We can rewrite this as
2
2
} (n , 1)Sn 2 (n , 1)Sn = 1 , :
`
u
This suggests the condence interval
h (n , 1)S 2 (n , 1)S 2 i
n
n :
u ;
`
Then the probability that 2 lies in this interval is
F(u) , F (`) = 1 , ;
where now F is the chi-squared cdf with n , 1 degrees of freedom. We usually
choose ` and u to solve
F (`) = =2 and F (u) = 1 , =2:
These equations can be solved using tables, e.g., Table 12.4, or numerically by
root nding, or in Matlab with the commands ` = chi2inv(alpha=2; n , 1)
and u = chi2inv(1 , alpha=2; n , 1).
July 18, 2002
358
Chap. 12 Parameter Estimation and Condence Intervals
1,
0.90
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
`
77.046
76.447
75.795
75.077
74.275
73.361
72.288
70.972
69.230
66.510
u
123.225
124.049
124.955
125.963
127.103
128.422
129.996
131.966
134.642
138.987
Table
12.4. Condence levels 1 , and corresponding values of ` and u such that }(` n,1 S 2 u) = 1 , and such that }( n,1 S 2 `) = }( n,1 S 2 u) = =2 for n = 100
2 n
2 n
2 n
observations (n , 1 = 99 degrees of freedom).
Example 12.7. Let X1; X2; : : : be i.i.d. N(m;22) random variables. If
2 = 1:608, nd
S100
the 90% condence interval for .
Solution. From Table 12.4 we see that for 1 , = 0:90, ` = 77:046 and
u = 123:225. The 90% condence interval is
h 99(1:608) 99(1:608) i
123:225 ; 77:046 = [1:292; 2:067]:
? Derivations
The remainder of this section is devoted to deriving the distributions of Sn2
and
p
(M
n , m)=(= n)
q
T := MS n=,pm
=
n,21 S 2 =(n , 1)
n n
n
under the assumption that the Xi are i.i.d. N(m; 2). The analysis is rather
complicated, and may be omitted by the beginning student. p
We begin with the numerator in T . Recall that (Mn , m)=(= n ) is simply
Yn as dened in Section 12.1. It was shown in Example 12.1 that Yn N(0; 1).
For the denominator, we show that the density of n,21 Sn2 is chi-squared with
n , 1 degrees of freedom. We begin by recalling the derivation of (12.3). If we
replace the rst line of the derivation with
n
X
Sn2 = n ,1 1 ([Xi , m] , [Mn , m])2 ;
i=1
then we end up with
X
n
1
2
2
2
[Xi , m] , n[Mn , m] :
Sn = n , 1
i=1
July 18, 2002
12.5 Condence Intervals for Normal Data
359
Using the notation Zi := (Xi , m)=, we have
n
2
n , 1 S2 = X
2 , n Mn , m ;
Z
2 n i=1 i
or
n
n , 1 S 2 + Mn p
, m 2 = X
Zi2 :
2 n
= n
i=1
As we argue below, P
the two terms on the left are independent. It then follows
that the density of ni=1 Zi2 is equal to the convolution of the densities of the
other two terms. To nd the density of n,21 Sn2 , we use moment generating
functions. Now, the second term on the left is the square of an N(0; 1) random variable, and by Problem 7 in Chapter 4 has a chi-squared density with
one degree of freedom. Its moment generating function is 1=(1 , 2s)1=2 (see
Problem 40(c) in Chapter 3). For the same reason, each Zi2 has aPchi-squared
density with one degree of freedom. Since the Zi are independent, ni=1 Zi2 has
a chi-squared density with n degrees of freedom, and its moment generating
function is 1=(1 , 2s)n=2 (see Problem 46(c) in Chapter 3). It now follows that
the moment generating function of n,21 Sn2 is the quotient
1=(1 , 2s)n=2 =
1
1=(1 , 2s)1=2
(1 , 2s)(n,1)=2 ;
which is the moment generating function of a chi-squared density with n , 1
degrees of freedom.
It remains to show that Sn2 and Mn are independent. Observe that Sn2 is a
function of the vector
W := [(X1 , Mn); : : :; (Xn , Mn )]0:
In fact, Sn2 = W 0W=(n , 1). By Example 7.5, the vector W and the sample
mean are independent. It then follows that any function of W and any function
of Mn are independent.
We can now nd the density of
p
n , m)=(= n) :
q
T = (M
n,21 S 2 =(n , 1)
n
If the Xi are i.i.d. N(m; 2 ), then the numerator and the denominator are
independent; the numerator is N(0; 1), and the denominator is chi-squared
with n , 1 degrees of freedom divided by n , 1. By Problem 24 in Chapter 5,
T has a student's t density with = n , 1 degrees of freedom.
July 18, 2002
360
Chap. 12 Parameter Estimation and Condence Intervals
12.6. Notes
Notes x12.3: The Sample Variance
Note 1. We are appealing to the fact that if g(u; v) is a continuous function
of two variables, and if Un converges in probability to u, and if Vn converges in
probability to v, then g(Un ; Vn) converges in probability to g(u; v). This result
is proved in Example 11.2.
Notes x12.4: Condence Intervals When the Variance Is Unknown
Note 2. We are appealing to the fact that if the cdf of Yn, say Fn, converges
to a continuous cdf F, and if Un converges in probability to 1, then
}(Yn yUn ) ! F(y):
This result, which is proved in Example 11.8, is a version of Slutsky's Theorem.
Note 3. The hypergeometric random variable arises in the following situation. We have a collection of N items, d of which are defective. Rather than
test all N items, we select at random a small number of items, say n < N. Let
Yn denote the number of defectives out the n items tested. We show that
d N ,d
}(Yn = k) = k n, k ; k = 0; : : :; n:
N
n
We denote this by Yn hypergeometric(N; d; n).
Remark. In the typical case, d n and N , d n; however,,dif these
conditions do not
hold
in the above formula, it is understood that k = 0 if
,
d < k n, and Nn,,kd = 0 if n , k > N , d, i.e., if 0 k < n , (N , d).
For i = 1; : : :; n, draw at random an item from the collection and test it.
If the ith item is defective, let Xi = 1, and put Xi = 0 otherwise. In either
case, do not put the tested item back into the collection (sampling without
replacement). Then the total number of defectives among the rst n items
tested is
n
X
Yn :=
Xi :
i=1
We show that Yn hypergeometric(N; d; n).
Consider the case n = 1. Then Y1 = X1 , and the chance of drawing a
defective item at random is simply the ratio of the number of defectives to the
total number of items in the collection; i.e., }(Y1 = 1) = }(X1 = 1) = d=N.
July 18, 2002
12.6 Notes
361
Now in general, suppose the result is true for some n 1. We show it is true
for n + 1. Use the law of total probability to write
}(Yn+1 = k) =
n
X
i=0
}(Yn+1 = kjYn = i)}(Yn = i):
(12.6)
Since Yn+1 = Yn + Xn+1 , we can use the substitution law to write
}(Yn+1 = kjYn = i) = }(Yn + Xn+1 = kjYn = i)
= }(i + Xn+1 = kjYn = i)
= }(Xn+1 = k , ijYn = i):
Since Xn+1 takes only the values zero and one, this last expression is zero unless
i = k or i = k , 1. Returning to (12.6), we can write
}(Yn+1 = k) =
k
X
i=k,1
}(Xn+1 = k , ijYn = i)}(Yn = i):
(12.7)
When i = k , 1, the above conditional probability is
}(Xn+1 = 1jYn = k , 1) = d , (k , 1) ;
N ,n
since given Yn = k , 1, there are N , n items left in the collection, and of those,
the number of defectives remaining is d , (k , 1). When i = k, the needed
conditional probability is
}(Xn+1 = 0jYn = k) = (N , d) , (n , k) ;
N ,n
since given Yn = k, there are N , n items left in the collection, and of those,
the number of non defectives remaining is (N , d) , (n , k). If we now assume
that Yn hypergeometric(N; d; n), we can expand (12.7) to get
d
N ,d
}(Yn+1 = k) = d , (k , 1) k , 1 n , (k , 1)
N
N ,n
n
d N ,d
, (n , k) k n, k :
+ (N , d)
N ,n
N
n
It is a simple calculation to see that the rst term on the right is equal to
d
N ,d
1 , n +k 1 k [nN+ 1], k ;
n+1
July 18, 2002
362
Chap. 12 Parameter Estimation and Condence Intervals
and the second term is equal to
d
k k
n+1
N ,d
[n
+ 1], k :
N
n+1
Thus, Yn+1 hypergeometric(N; d; n + 1).
12.7. Problems
Problems x12.1: The Sample Mean
1. Show that the random variable Yn dened in Example 12.1 is N(0; 1).
2. Show that when the Xi are uncorrelated, Yn has zero mean and unit
variance.
Problems x12.2: Condence Intervals When the Variance Is Known
3. If 2 = 4 and n = 100, how wide is the 99% condence interval? How
large would n have to be to have a 99% condence interval of width less
than or equal to 1=4?
4. Let W1 ; W2; : : : be i.i.d. with zero mean and variance 4. Let Xi = m +
Wi , where m is an unknown constant. If M100 = 14:846, nd the 95%
condence interval.
5. Let Xi = m + Wi , where m is an unknown constant, and the Wi are
i.i.d. Cauchy with parameter 1. Find > 0 such that the probability is
2=3 that the condence interval [Mn , ; Mn + ] contains m; i.e., nd
> 0 such that
}(jMn , mj ) = 2=3:
Hints: Since E[Wi2] = 1, the central limit theorem does not apply. However, you can solve for exactly if you can nd the cdf of Mn , m. The
cdf of Wi is F(w) = 1 tan,1(w) + 1=2, and the characteristic function of
Wi is E[ejWi ] = e,j j .
Problems x12.3: The Sample Variance
6. If X1 ; X2 ; : : : are uncorrelated, use the formula
S2
n
1
= n,1
n
X
i=1
X2
i
, nM 2
n
to show that Sn2 is an unbiased estimator of 2; i.e., show that E[Sn2 ] = 2.
July 18, 2002
12.7 Problems
363
7. Let X1 ; X2 ; : : : be i.i.d. with nite fourth moment. Let m2 := E[Xi2 ] be
the second moment. Show that
n
1X
2
n i=1 Xi
is a consistent estimator of m2 . Hint: Let X~ i := Xi2 .
Problems x12.4: Condence Intervals When the Variance is Unknown
8. Let X1 ; X2; : : : be i.i.d. random variables with unknown, nite mean m
9.
10.
11.
12.
13.
14.
and variance 2 . If M100 = 10:083 and S100 = 0:568, nd the 95%
condence interval for the population mean.
Suppose that 100 engineering freshmen are selected at random and X1 ; : : :; X100
are their times (in years) to graduation. If M100 = 4:422 and S100 = 0:957,
nd the 93% condence interval for their expected time to graduate.
From a batch of N = 10; 000 computers, n = 100 are sampled, and 10
are found defective. Estimate the number of defective computers in the
total batch of 10; 000, and give the margin of error for 90% probability if
S100 = 0:302.
You conduct a presidential preference poll by surveying 3; 000 voters. You
nd that 1; 559 (more than half) say they plan to vote for candidate A,
and the others say they plan to vote for candidate B. If S3000 = 0:500,
are you 90% sure that candidate A will win the election? Are you 99%
sure?
From a batch of 100; 000 airbags, 500 are sampled, and 48 are found
defective. Estimate the number of defective airbags in the total batch of
100; 000, and give the margin of error for 94% probability if S100 = 0:295.
A new vaccine has just been developed at your company. You need to be
97% sure that side eects do not occur more than 10% of the time.
(a) In order to estimate the probability p of side eects, the vaccine is
tested on 100 volunteers. Side eects are experienced by 6 of the
volunteers. Using the value S100 = 0:239, nd the 97% condence
interval for p if S100 = 0:239. Are you 97% sure that p 0:1?
(b) Another study is performed, this time with 1000 volunteers. Side
eects occur in 71 volunteers. Find the 97% condence interval for
the probability p of side eects if S1000 = 0:257. Are you 97% sure
that p 0:1?
Packet transmission times on a certain Internet link are independent and
identically distributed. Assume that the times have an exponential density with mean .
July 18, 2002
364
Chap. 12 Parameter Estimation and Condence Intervals
(a) Find the probability that in transmitting n packets, at least one of
them takes more than t seconds to transmit.
(b) Let T denote the total time to transmit n packets. Find a closedform expression for the density of T .
(c) Your answers to parts (a) and (b) depend on , which in practice is
unknown and must be estimated. To estimate the expected transmission time, n = 100 packets are sent, and the transmission times
T1 ; : : :; Tn recorded. It is found that the sample mean M100 = 1:994,
and sample standard deviation S100 = 1:798, where
n
n
X
X
Mn := n1 Ti and Sn2 := n ,1 1 (Ti , Mn )2 :
i=1
i=1
Find the 95% condence interval for the expected transmission time.
15. Let Nt be a Poisson process with unknown intensity . For i = 1; : : :; 100,
put Xi = (Ni , Ni,1). Then the Xi are i.i.d. Poisson(), and E[Xi] = .
If M100 = 5:170, and S100 = 2:132, nd the 95% condence interval for .
Problems x12.5: Condence Intervals for Normal Data
16. Let W1; W2 ; : : : be i.i.d. N(0; 2) with 2 unknown. Let Xi = m + Wi ,
17.
18.
19.
20.
where m is an unknown constant. Suppose M10 = 14:832 and S10 = 1:904.
Find the 95% condence interval for m.
Let X1 ; X2; : : : be i.i.d. N(0; 2) with 2 unknown. Find the 95% con2 = 4:413.
dence interval for 2 if V100
Let Wt be a zero-mean, wide-sense stationary white noise process with
power spectral density SW (f) = N0 =2. Suppose that Wt is applied to the
ideal lowpass lter of bandwidth B = 1 MHz and power gain 120 dB; i.e.,
H(f) = GI[,B;B] (f), where G = 106. Denote the lter output by Yt , and
for i = 1; : : :; 100, put Xi := Yit , where t = (2B),1 . Show that the
Xi are zero mean, uncorrelated, with variance 2 = G2B N0 . Assuming
2 = 4:659 mW, nd the 95% condence
the Xi are normal, and that V100
interval for N0 ; your answer should have units of Watts/Hz.
Let X1 ; X2; : : : be i.i.d. with known, nite mean m, unknown, nite variance 2 , and nite, fourth central moment 4 = E[(Xi , m)4 ]. Show
that Vn2 is a consistent estimator of 2 . Hint: Apply Problem 7 with Xi
replaced by Xi , m.
Let W1; W2 ; : : : be i.i.d. N(0; 2) with 2 unknown. Let Xi = m + Wi ,
where m is an unknown constant. Find the 95% condence interval for
2 = 4:736.
2 if S100
July 18, 2002
CHAPTER 13
Advanced Topics
Self similarity has been noted in the trac of local-area networks [26], widearea networks [32], and in World Wide Web trac [12]. The purpose of this
chapter is to introduce the notion of self similarity and related concepts so that
students can be conversant with the kinds of stochastic processes being used to
model network trac.
Section 13.1 introduces the Hurst parameter and the notion of distributional
self similarity for continuous-time processes. The concept of stationary increments is also presented. As an example of such processes, fractional Brownian
motion is developed using the Wiener integral. In Section 13.2, we show that if
one samples the increments of a continuous-time self-similar process with stationary increments, then the samples have a covariance function with a very
specic formula. It is shown that this formula is equivalent to specifying the
variance of the sample mean for all values of n. Also, the power spectral density
is found up to a multiplicative constant. Section 13.3 introduces the concept
of asymptotic second-order self similarity and shows that it is equivalent to
specifying the limiting form of the variance of the sample mean. The main
result here is a sucient condition on the power spectral density that guarantees asymptotic second-order self similarity. Section 13.4 denes long-range
dependence. It is shown that every long-range-dependent process is asymptotically second-order self similar. Section 13.5 introduces ARMA processes, and
Section 13.6 extends this to fractional ARIMA processes. Fractional ARIMA
process provide a large class of models that are asymptotically second-order self
similar.
13.1. Self Similarity in Continuous Time
Loosely speaking, a continuous-time random process Wt is said to be self
similar with Hurst parameter H > 0 if the process Wt \looks like" the
process H Wt. If > 1, then time is speeded up for Wt compared to the
original process Wt . If < 1, then time is slowed down. The factor H in
H Wt either increases or decreases the magnitude (but not the time scale)
compared with the original process. Thus, for a self-similar process, when time
is speeded up, the apparent eect is the same as changing the magnitude of the
original process, rather than its time scale.
The precise denition of \looks like" will be in the sense of nite-dimensional
distributions. That is, for Wt to be self similar, we require that for every
> 0, for every nite collection of times t1; : : :; tn, all joint probabilities involving Wt1 ; : : :; Wtn are the same as those involving H Wt1 ; : : :; H Wtn .
The best example of a self-similar process is the Wiener process. This is easy
to verify by comparing the joint characteristic functions of Wt1 ; : : :; Wtn and
365
366
Chap. 13 Advanced Topics
H Wt1 ; : : :; H Wtn for the correct value of H.
Implications of Self Similarity
Let us focus rst on a single time point t. For a self-similar process, we must
have
}(Wt x) = }(H Wt x):
Taking t = 1, results in
}(W x) = }(H W1 x):
Since > 0 is a dummy variable, we can call it t instead. Thus,
}(Wt x) = }(tH W1 x); t > 0:
Now rewrite this as
}(Wt x) = }(W1 t,H x); t > 0;
or, in terms of cumulative distribution functions,
FWt (x) = FW1 (t,H x); t > 0:
It can now be shown (Problem 1) that Wt converges in distribution to the zero
random variable as t ! 0. Similarly, as t ! 1, Wt converges in distribution to
a discrete random variable taking the values 0 and 1.
We next look at expectations of self-similar processes. We can write
E[Wt] = E[H Wt ] = H E[Wt]:
Setting t = 1 and replacing by t results in
E[Wt] = tH E[W1]; t > 0:
(13.1)
Hence, for a self-similar process, its mean function has the form of a constant
times tH for t > 0.
As another example, consider
E[Wt2 ] = E (H Wt)2 = 2H E[Wt2]:
(13.2)
Arguing as above, we nd that
E[Wt2] = t2H E[W12]; t > 0:
(13.3)
We can also take t = 0 in (13.2) to get
E[W02] = 2H E[W02]:
Since the left-hand side does not depend on , E[W02] = 0, which implies W0 =
0 a.s. Hence, (13.1) and (13.3) both continue to hold even when t = 0.
Using the formula ab = eb ln a, 0H = eH ln 0 = e,1 , since H > 0. Thus, 0H = 0.
July 18, 2002
13.1 Self Similarity in Continuous Time
367
Example 13.1. Assuming that the Wiener process is self similar, show
that the Hurst parameter must be H = 1=2.
Solution. Recall that for the Wiener process, we have E[Wt2] = 2t. Thus,
(13.2) implies that
2 (t) = 2H 2t:
Hence, H = 1=2.
Stationary Increments
A process Wt is said to have stationary increments if for every increment
> 0, the increment process
Zt := Wt , Wt,
is a stationary process in t. If Wt is self similar with Hurst parameter H, and
has stationary increments, we say that Wt is H -sssi.
If Zt is H-sssi, then E[Zt] cannot depend on t; but by (13.1),
E[Zt] = E[Wt , Wt, ] = tH , (t , )H E[W1]:
If H 6= 1, then we must have E[W1] = 0, which by (13.1), implies E[Wt] = 0.
As we will see later, the case H = 1 is not of interest if Wt has nite second
moments, and so we will always take E[Wt] = 0.
If Zt is H-sssi, then the stationarity of the increments and (13.3) imply
E[Zt2 ] = E[Z2] = E[(W , W0 )2 ] = E[W2] = 2H E[W12]:
Similarly, for t > s,
E[(Wt , Ws )2 ] = E[(Wt,s , W0 )2 ] = E[Wt2,s] = (t , s)2H E[W12]:
For t < s,
E[(Wt , Ws )2 ] = E[(Ws , Wt )2 ] = (s , t)2H E[W12]:
Thus, for arbitrary t and s,
E[(Wt , Ws )2 ] = jt , sj2H E[W12]:
Note in particular that
E[Wt2] = jtj2H E[W12]:
Now, we also have
E[(Wt , Ws )2 ] = E[Wt2] , 2E[WtWs ] + E[Ws2];
and it follows that
2
E[Wt Ws ] = E[W1 ] [jtj2H , jt , sj2H + jsj2H ]:
(13.4)
2
July 18, 2002
368
Chap. 13 Advanced Topics
Fractional Brownian Motion
Let Wt denote the standard Wiener process on ,1 < t < 1 as dened in
Problem 23 in Chapter 8. The standard fractional Brownian motion is the
process BH (t) dened by the Wiener integral
Z
BH (t) :=
1
,1
gH;t () dW ;
where gH;t is dened below. Then E[BH (t)] = 0, and
E[BH
(t)2 ]
=
Z
1
,1
gH;t ()2 d:
To evaluate this expression as well as the correlation E[BH (t)BH (s)], we must
now dene gH;t (). To this end, let
H ,1=2
> 0;
qH () := 0; ; 0;
and put
gH;t() := C1 qH (t , ) , qH (,) ;
H
where
Z 1
1:
CH2 :=
(1 + )H ,1=2 , H ,1=2 2 d + 2H
0
First note that since gH;0 () = 0, BH (0) = 0. Next,
BH (t) , BH (s) =
Z
1
[gH;t() , gH;s ()] dW
,1 Z
1
= CH,1
[qH (t , ) , qH (s , )] dW ;
,1
and so
E[jBH (t) , BH
(s)j2 ]
= CH,2
Z
1
,1
jqH (t , ) , qH (s , )j2 d:
If we now assume s < t, then this integral is equal to the sum of
Z
and
s
,1
[(t , )H ,1=2 , (s , )H ,1=2 ]2 d
Z t,s
2H
(t , )2H ,1 d =
2H ,1 d = (t ,2Hs) :
s
0
To evaluate the integral from ,1 to s, let = s , to get
Z
t
1
Z
0
[(t , s + )H ,1=2 , H ,1=2 ]2 d;
July 18, 2002
13.2 Self Similarity in Discrete Time
which is equal to
(t , s)2H ,1
369
1 ,
Z
0
,
1 + =(t , s) H ,1=2 , =(t , s) H ,1=2 2 d:
Making the change of variable = =(t , s) yields
(t , s)2H
It is now clear that
1
Z
0
(1 + )H ,1=2 , H ,1=2 2 d:
E[jBH (t) , BH (s)j2 ] = (t , s)2H ; t > s:
Since interchanging the positions of t and s on the left-hand side has no eect,
we can write for arbitrary t and s,
E[jBH (t) , BH (s)j2 ] = jt , sj2H :
Taking s = 0 yields
E[BH (t)2 ] = jtj2H :
Furthermore, expanding E[jBH (t) , BH (s)j2], we nd that
jt , sj2H = jtj2H , 2E[BH (t)BH (s)] + jsj2H ;
or,
2H
2H
2H
(13.5)
E[BH (t)BH (s)] = jtj , jt , sj + jsj :
2
Observe that BH (t) is a Gaussian process in the sense that if we select any
sampling times, t1 < < tn , then the random vector [BH (t1 ); : : :; BH (tn )]0 is
Gaussian; this is a consequence of the fact that BH (t) is dened as a Wiener
integral (Problem 14 in Chapter 11). Furthermore, the covariance matrix of
the random vector is completely determined by (13.5). On the other hand, by
(13.4), we see that any H-sssi process has the same covariance function (up
to a scale factor). If that H-sssi process is Gaussian, then as far as the joint
probabilities involving any nite number of sampling times, we may as well
assume that the H-sssi process is fractional Brownian motion. In this sense,
there is only one H-sssi process with nite second moments that is Gaussian :
fractional Brownian motion.
13.2. Self Similarity in Discrete Time
Let Wt be an H-sssi process. By choosing an appropriate time scale for Wt,
we can focus on the unit increment = 1. Furthermore, the advent of digital
signal processing suggests that we sample the increment process. This leads us
to consider the discrete-time increment process
Xn := Wn , Wn,1:
July 18, 2002
370
Chap. 13 Advanced Topics
Since Wt is assumed to have zero mean, the covariance of Xn is easily found
using (13.4). For n > m,
2
1 ] [(n , m + 1)2H , 2(n , m)2H + (n , m , 1)2H ]:
E[Xn Xm ] = E[W
2
Since this depends only on the time dierence, the covariance function of Xn is
2
(13.6)
C(n) = 2 [jn + 1j2H , 2jnj2H + jn , 1j2H ];
where 2 := E[W12].
The foregoing analysis assumed that Xn was obtained by sampling the increments of an H-sssi process. More generally, a discrete-time, wide-sense stationary (WSS) process is said to be second-order self similar if its covariance
function has the form in (13.6). In this context it is not assumed that Xn
is obtained from an underlying continuous-time process or that Xn has zero
mean. A second-order self-similar process that is Gaussian is called fractional
Gaussian noise, since one way of obtaining it is by sampling the increments
of fractional Brownian motion.
Convergence Rates for the Mean-Square Law of Large Numbers
Suppose that Xn is a discrete-time, WSS process with mean := E[Xn].
It is shown later in Section 13.4 that if Xn is second-order self similar; i.e., if
(13.6) holds, then C(n) ! 0. On account of Example 10.4, the sample mean
n
1X
n i=1 Xi
converges in mean square to . But how fast does the sample mean converge?
We show that (13.6) holds if and only if
2 n
2
1 X
2 n2H ,2 = n2H ,1 :
E
X
,
=
(13.7)
i
n i=1
n
In other words, Xn is second-order self similar if and only if (13.7) holds. To put
(13.7) into perspective, rst consider the case H = 1=2. Then (13.6) reduces
to zero for n 6= 0. In other words, the Xn are uncorrelated. Also, when
H = 1=2, the factor n2H ,1 in (13.7) is not present. Thus, (13.7) reduces to the
mean-square law of large numbers for uncorrelated random variables derived in
Example 10.3. If 2H , 1 < 0, or equivalently H < 1=2, then the convergence
is faster than in the uncorrelated case. If 1=2 < H < 1, then the convergence
is slower than in the uncorrelated case. This has important consequences for
determining condence intervals, as shown in Problem 6.
To show the equivalence of (13.6) and (13.7), the rst step is to dene
Yn :=
n
X
i=1
(Xi , );
July 18, 2002
13.2 Self Similarity in Discrete Time
371
and observe that (13.7) is equivalent to E[Yn2] = 2 n2H .
The second step is to express E[Yn2 ] in terms of C(n). Write
E[Y 2]
n
X
n
X
n
i=1
n X
n
X
k=1
= E
(Xi , )
=
E[(Xi , )(Xk , )]
=
i=1 k=1
n X
n
X
i=1 k=1
(Xk , )
C(i , k):
(13.8)
The above sum amounts to summing all the entries of the n n matrix with i k
entry C(i , k). This matrix is symmetric and is constant along each diagonal.
Thus,
nX
,1
2
E[Yn ] = nC(0) + 2 C()(n , ):
(13.9)
=1
2
Now that we have a formula for E[Y ] in terms of C(n), we follow Likhanov
n
[28, p. 195] and write
E[Yn2+1] , E[Yn2] = C(0) + 2
and then
,
,
n
X
=1
C();
E[Yn2+1] , E[Yn2 ] , E[Yn2] , E[Yn2,1] = 2C(n):
Applying the formula E[Yn2 ] = 2 n2H shows that for n 1,
(13.10)
2
C(n) = 2 [(n + 1)2H , 2n2H + (n , 1)2H ]:
Finally, it is a simple exercise (Problem 7) using induction on n to show
that (13.6) implies E[Yn2 ] = 2 n2H for n 1.
Aggregation
Consider the partitioning of the sequence Xn into blocks of size m:
X
: :; Xm} X
: : :; X2m} X(n,1)m+1 ; : : :; Xnm :
1 ; : {z
m+1 ; {z
|
|
1st block
2nd block
|
P
{z
nth block
}
The average
of the rst block is m1 mk=1 Xk . The average of the second block
P
2
m
is m1 k=m+1 Xk . The average of the nth block is
nm
X
Xn(m) := m1
Xk :
k=(n,1)m+1
July 18, 2002
(13.11)
372
Chap. 13 Advanced Topics
The superscript (m) indicates the block size, which is the number of terms used
to compute the average. The subscript n indicates the block number. We call
fXn(m) g1n=,1 the aggregated process. We now show that if Xn is secondorder self similar, then the covariance function of Xn(m) , denoted by C (m) (n),
satises
C (m) (n) = m2H ,2 C(n):
(13.12)
In other words, if the original sequence is replaced by the sequence of averages
of blocks of size m, then the new sequence has a covariance function that is the
same as the original one except that the magnitude is scaled by m2H ,2 .
The derivation of (13.12) is very similar to the derivation of (13.7). Put
Xe(m) :=
m
X
(Xk , ):
(13.13)
k=( ,1)m+1
Since Xk is WSS, so is Xe(m) . Let its covariance function be denoted by Ce (m) ().
Next dene
n
X
Yen :=
Xe(m) :
=1
Just as in (13.10),
,
,
2Ce (m) (n) = E[Yen2+1] , E[Yen2] , E[Yen2 ] , E[Yen2,1] :
Now observe that
Yen =
=
=
n
X
=1
n
X
Xe(m)
m
X
(Xk , )
=1 k=( ,1)m+1
nm
X
(X , )
=1
Ynm ;
=
where this Y is the same as the one dened in the preceding subsection. Hence,
,
2 ] , ,E[Y 2 ] , E[Y 2
2Ce (m) (n) = E[Y(2n+1)m] , E[Ynm
(13.14)
nm
(n,1)m] :
Now we use the fact that since Xn is second-order self similar, E[Yn2 ] = 2 n2H .
Thus,
2 ,
,
Ce (m) (n) = 2 (n + 1)m 2H , 2(nm)2H + (n , 1)m 2H :
Since C (m) (n) = Ce (m) (n)=m2 , (13.12) follows.
July 18, 2002
13.2 Self Similarity in Discrete Time
373
The Power Spectral Density
We show that the power spectral density of a second-order self-similar process is proportional toy
1
X
1
2
(13.15)
sin (f)
j
i
+
f
j2H +1 :
i=,1
The proof rests on the fact (derived below) that for any wide-sense stationary
process,
,1
2 C (m) (n) = Z 1 ej 2fn mX
S([f + k]=m)
sin(f)
df
m2H ,2
m2H +1
sin([f + k]=m)
0
k=0
for m = 1; 2; : : :: Now we also have
C(n) =
Z 1=2
,1=2
S(f)ej 2fn df
=
Z 1
0
S(f)ej 2fn df:
Hence, if S(f) satises
mX
,1 S([f + k]=m) sin(f) 2 = S(f); m = 1; 2; : : :; (13.16)
m2H +1
sin([f + k]=m)
k=0
then (13.12) holds. In particular it holds for n = 0. Thus,
E[Ym2 ] = C (m) (0) = C(0) = 2 ; m = 1; 2; : : ::
m2H
m2H ,2
2
As noted earlier, E[Yn ] = 2 n2H is just (13.7), which is equivalent to (13.6).
Now observe that if S(f) is proportional to (13.15), then (13.16) holds.
The integral formula for C (m) (n)=m2H ,2 is derived following Sinai [44,
p. 66]. Write
1
C (m) (n) =
(m)
( m)
m2H ,2
m2H ,2 E[(Xn+1 , )(X1 , )]
(nX
+1)m X
m
E[(Xi , )(Xk , )]
= m12H
i=nm+1 k=1
(nX
+1)m X
m
= m12H
C(i , k)
i=nm+1 k=1
(nX
+1)m X
m Z 1=2
1
= m2H
S(f)ej 2f (i,k) df
,
1
=
2
i=nm+1 k=1
Z 1=2
(nX
+1)m X
m
1
= m2H
S(f)
ej 2f (i,k) df:
,1=2
i=nm+1 k=1
y The constant of proportionality can be found using results from Section 13.3; see Problem 15.
July 18, 2002
374
Chap. 13 Advanced Topics
Now write
(nX
+1)m X
m
i=nm+1 k=1
ej 2f (i,k) =
m X
m
X
=1 k=1
= ej 2nm
=
ej 2f (nm+ ,k)
m X
m
X
=1 k=1
ej 2f ( ,k)
m
2
X
ej 2nm e,j 2fk :
k=1
Using the nite geometric series,
m
X
k=1
,j 2fm
e,j 2fk = e,j 2f 11,,ee,j 2f = e,jf (m+1) sin(mf)
sin(f) :
Thus,
C (m) (n) = 1 Z 1=2 S(f)ej 2fnm sin(mf) 2df:
m2H ,2
m2H ,1=2
sin(f)
Since the integrand has period one, we can shift the range of integration to [0; 1]
and then make the change-of-variable = mf. Thus,
C (m) (n) = 1 Z 1 S(f)ej 2fnm sin(mf) 2 df
m2H ,2
m2H 0
sin(f)
Z m
sin() 2d
S(=m)ej 2n sin(=m)
= m21H +1
0
mX
,1 Z k+1
1
sin() 2d:
= m2H +1
S(=m)ej 2n sin(=m)
k=0 k
Now make the change-of-variable f = , k to get
2
,1 Z 1
C (m) (n) =
1 mX
sin([f
+
k])
j
2
[
f
+
k
]
n
m2H ,2
m2H +1 k=0 0 S([f + k]=m)e
sin([f + k]=m) df
2
mX
,1 Z 1
sin(f)
= m21H +1
S([f + k]=m)ej 2fn sin([f
+ k]=m) df
k=0 0
m,1
Z 1
X S([f + k]=m)
sin(f) 2 df:
=
ej 2fn
m2H +1
sin([f + k]=m)
0
k=0
Notation
We have been using the term correlation function to refer to the quantity E[Xn Xm ]. This is the usual practice in engineering. However, engineers
July 18, 2002
13.3 Asymptotic Second-Order Self Similarity
375
studying network trac follow the practice of statisticians and use the term
correlation function to refer to
cov(Xn ; Xm ) :
p
var(Xn ) var(Xm )
In other words, in networking, the term correlation function refers to our covariance function C(n) divided by C(0). We use the notation
(n) := C(n)
C(0) :
Now assume that Xn is second-order self similar. We have by (13.6) that
C(0) = 2 , and so
(n) = 12 [jn + 1j2H , 2jnj2H + jn , 1j2H ]:
Let (m) denote the correlation function of Xn(m) . Then (13.12) tells us that
(m)
m2H ,2 C(n) = (n):
=
(m) (n) := CC (m)(n)
m2H ,2 C(0)
(0)
13.3. Asymptotic Second-Order Self Similarity
We showed in the previous section that second-order self similarity(eq. (13.6))
is equivalent to (13.7), which species
2 X
1 n
(13.17)
E n Xi , i=1
exactly for all n. While this is a very nice result, it applies only when the
covariance function has exactly the form in (13.6). However, if we only need to
know the behavior of (13.17) for large n, say
2 n
1 X
E n Xi , i=1
2
lim
= 1
(13.18)
n!1
n2H ,2
2 , then we can allow more freedom in the behavior
for some nite, positive 1
of the covariance function. The key to obtaining such a result is suggested by
(13.12), which says that for a second-order self similar process,
C (m) (n) = 2 [jn + 1j2H , 2jnj2H + jn , 1j2H ]:
m2H ,2
2
You are asked to show in Problems 9 and 10 that (13.18) holds if and only if
2
C (m) (n) = 1
2H
2H
2H
(13.19)
lim
2
H
,
2
m!1 m
2 [jn + 1j , 2jnj + jn , 1j ]:
July 18, 2002
376
Chap. 13 Advanced Topics
A wide-sense stationary process that satises (13.19) is said to be asymptotically second-order self similar. In the literature, (13.19) is usually written in
terms of the correlation function (m) (n); see Problem 11.
If a wide-sense stationary process has a covariance function C(n), how can
we check if (13.18) or (13.19) holds? Below we answer this question in the
frequency domain with a sucient condition on the power spectral density.
In Section 13.4, we answer this question in the time domain with a sucient
condition on C(n) known as long-range dependence.
Let us look into the frequency domain. Suppose that C(n) has a power
spectral density S(f) so thatz
C(n) =
Z 1=2
,1=2
S(f)ej 2fn df:
The following result is proved later in this section.
Theorem. If
lim S(f) = s
f !0 jf j,1
(13.20)
for some nite, positive s, and if for every 0 < < 1=2, S(f) is bounded
on [; 1=2], then the process is asymptotically second-order self similar with
H = 1 , =2 and
2 = s 4 cos(=2),() :
(13.21)
1
(2) (1 , )(2 , )
Notice that 0 < < 1 implies H = 1 , =2 2 (1=2; 1).
Below we give a specic power spectral density that satises the above
conditions.
Example 13.2 (Hosking [23, Theorem 1(c)]). Fix 0 < d < 1=2, and letx
S(f) = j1 , e,j 2f j,2d :
Since
jf
,jf
1 , e,j 2f = 2je,jf e ,2je ;
we can write
S(f) = [4 sin2 (f)],d :
z The power spectral density of a discrete-time process is periodic with period 1, and is
real, even, and nonnegative. It is integrable since
Z 1=2
,1=2
S (f ) df = C (0) < 1:
x As shown in Section 13.6, a process with this power spectral density is an ARIMA(0; d; 0)
process. The covariance function that corresponds to S (f ) is derived in Problem 12.
July 18, 2002
13.3 Asymptotic Second-Order Self Similarity
Since
if we put = 1 , 2d, then
377
[4 sin2 (f)],d = 1;
lim
f !0 [4(f)2 ],d
S(f) = (2),2d :
lim
f !0 jf j,1
Notice that to keep 0 < < 1, we needed 0 < d < 1=2.
The power spectral density S(f) in the above example factors into
S(f) = [1 , e,j 2f ],d [1 , ej 2f ],d :
More generally, let S(f) be any power spectral density satisfying (13.20), boundedness away from the origin, and having a factorization of the form S(f) =
G(f)G(f) . Then pass any wide-sense stationary, uncorrelated sequence though
the discrete-time lter G(f). By the discrete-time analog of (6.6), the output
power spectral density is proportional to S(f), and therefore asymptotically
second-order self similar.
Example 13.3 (Hosking [23, Theorem 1(a)]). Find the impulse response
of the lter G(f) = [1 , e,j 2f ],d .
Solution. Observe that G(f) can be obtained by evaluating the z transform (1 , z ,1 ),d on the unit circle, z = ej 2f . Hence, the desired impulse
response can be found by inspection of the series for (1 , z ,1 ),d . To this end,
it is easy to show that the Taylor series for (1 + z)d is{
(1 + z)d = 1 +
Hence,
1
X
d(d , 1) (d , [n , 1]) z n:
n!
n=1
1
X
(,d)(,d , 1) (,d , [n , 1]) (,z ,1 )n
n!
n=1
1
X
= 1 + d(d + 1) n!(d + [n , 1]) z ,n:
n=1
(1 , z ,1 ),d = 1 +
Note that the impulse response is causal.
{ Notice that if d 0 is an integer, the product
d(d , 1) (d , [n , 1])
contains zero as a factor for n d + 1; in this case, the sum contains only d + 1 terms and
converges for all complex z. In fact, the formula reduces to the binomial theorem.
July 18, 2002
378
Chap. 13 Advanced Topics
Once we have a process whose power spectral density satises (13.20) and
boundedness away from the origin, itP
remains so after further ltering by stable
linear time-invariant systems. For if n jhnj < 1, then
H(f) =
1
X
n=,1
hn e,j 2fn
is an absolutely convergent series and therefore continuous. If S(f) satises
(13.20), then
jH(f)j2 S(f) = jH(0)j2s:
lim
f !0 jf j,1
A wide class of stable lters is provided by autoregressive moving average
(ARMA) systems discussed in Section 13.5.
Proof of Theorem. We now establish that (13.20) and boundedness of
S(f) away from the origin imply asymptotic second-order self similarity. Since
(13.19) and (13.18) are equivalent, it suces to establish (13.18). As noted in
2 . From (13.8),
the Hint in Problem 10, (13.18) is equivalent to E[Yn2]=n2H ! 1
E[Y 2 ]
n
=
=
=
n X
n
X
C(i , k)
i=1 k=1
n X
n Z 1=2
X
i=1 k=1 ,1=2
Z 1=2
,1=2
S(f)ej 2f (i,k) df
n
X
S(f)
k=1
2
e,j 2fk df:
Using the nite geometric series,
n
X
k=1
,j 2fn
e,j 2fk = e,j 2f 11,,ee,j 2f = e,jf (n+1) sin(nf)
sin(f) :
Thus,
2
sin(nf)
S(f) sin(f) df
n =
,1=2
2 2
Z 1=2
sin(nf)
f
2
= n
S(f) nf
sin(f) df:
,1=2
E[Y 2 ]
Z 1=2
We now show that
4 cos(=2),()
E[Yn2]
lim
n!1 n2, = s (2) (1 , )(2 , ) :
July 18, 2002
13.3 Asymptotic Second-Order Self Similarity
The rst step is to put
Kn() :=
and show that
n
Z 1=2
sin(nf)
1
,
j
f
j
nf
,1=2
1
379
2 E[Yn2 ] , s K () ! 0:
n
2,
f 2 df;
sin(f)
(13.22)
n
The second step, which is left to Problem 13, is to show that Kn () ! K(),
where
Z 1
1 sin 2d:
K() := ,
,1 jj1, The third step, which is left to Problem 14, is to show that
4 cos(=2),() :
K() = (2)
(1 , )(2 , )
To begin, observe that
E[Yn2] , s K ()
n
n2,
is equal to
2
2
Z 1=2 s
sin(nf)
f
n
S(f) , jf j1,
(13.23)
nf
sin(f) df:
,1=2
Let " > 0 be given, and let 0 < < 1=2 be so small that
S(f)
,1 , s < ";
jf j
or
s < " :
S(f) ,
jf j1, jf j1,
In analyzing (13.23), it is rst convenient to focus on the range of integration
[0; ]. Then the absolute value of
2 2
Z s
sin(nf)
f
n
S(f) , jf j1,
nf
sin(f) df
0
is upper bounded by
2 Z 1 sin(nf) 2
df;
"n 2
nf
0 f 1,
where we have used the fact that for jf j 1=2,
2:
1 sin(f)
f
July 18, 2002
380
Chap. 13 Advanced Topics
Now make the change-of-variable = nf in the above integral to get the
expression
2 , Z n 1 sin 2
"n 2 n
1, d;
0
or
2
Z n
2
2 K()
1
sin
,
" 2 d
"
1, 2
2 :
0
We return to (13.23) and focus now on the range of integration [; 1=2]. The
absolute value of
2 Z 1=2 f 2df
n
S(f) , jf js1, sin(nf)
nf
sin(f)
is upper bounded by
n
where
2
Z 1=2 sin(nf) 2
2 B B := sup
nf
jf j2[;1=2]
S(f)
df;
(13.24)
, jf js1, :
In (13.24) make the change-of-variable = nf to get
Z 1=2
sin(nf) 2
nf
df =
n=2 sin 2 d
Z
Z
n
n
1 sin 2 d
0
n :
Thus, (13.24) is upper bounded by
n,1
Z
1 sin 2
4 B 0
d:
Since this integral is nite, and since < 1, this bound goes to zero as n ! 1.
Thus, (13.22) is established.
13.4. Long-Range Dependence
Loosely speaking, a wide-sense stationary process is said to be long-range
dependent (LRD) if its covariance function C(n) decays slowly as n ! 1.
The precise denition of slow decay is the requirement that for some 0 < < 1,
(13.25)
lim C(n) = c;
n!1 n,
for some nite, positive constant c. In other words, for large n, C(n) looks like
c=n.
July 18, 2002
13.4 Long-Range Dependence
381
In this section, we prove two important results. The rst result is that a
second-order self-similar process is long-range dependent. The second result is
that long-range dependence implies asymptotic second-order self similarity.
To prove that second-order self similarity implies long-range dependence,
we proceed as follows. Write (13.6) for n 1 as
2
2
C(n) = 2 n2H [(1 + 1=n)2H , 2 + (1 , 1=n)2H ] = 2 n2H q(1=n);
where
q(t) := (1 + t)2H , 2 + (1 , t)2H :
For n large, 1=n is small. This suggests that we examine the Taylor expansion
of q(t) for t near zero. Since q(0) = q0 (0) = 0, we expand to second order to get
00
q(t) q 2(0) t2 = 2H(2H , 1)t2 :
So, for large n,
2
C(n) = 2 n2H q(1=n) 2H(2H , 1)n2H ,2:
(13.26)
It appears that = 2 , 2H and c = 2 H(2H , 1). Note that 0 < < 1
corresponds to 1=2 < H < 1. Also, H > 1=2 corresponds to 2 H(2H , 1) > 0.
To prove that these values of and c work, write
C(n) = 2 lim q(1=n) = 2 lim q(t) ;
lim
n!1 n2H ,2
2 n!1 n,2
2 t#0 t2
and apply l'H^opital's rule twice to obtain
(13.27)
lim C(n) = 2 H(2H , 1):
n!1 n2H ,2
This formula implies the following two facts. First, if H > 1, then C(n) ! 1
as n ! 1. This contradicts the fact that covariance functions are bounded
(recall that jC(n)j C(0) by the Cauchy{Schwarz inequality; cf. Section 6.2).
Thus, a second-order self-similar process cannot have H > 1. Second, if H = 1,
then C(n) ! 2 . In other words, the covariance does not decay to zero as n
increases. Since this situation does not arise in applications, we do not consider
the case H = 1.
We now show that long-range dependence (13.25) impliesk asymptotic second2 = 2c=[(1 , )(2 , )]. From
order self similarity with H = 1 , =2 and 1
k Actually, the weaker condition,
lim `(Cn)(nn,) = c;
where ` is a slowly varying function, is enough to imply asymptotic second-order self
similarity [48]. The derivation we present results from taking the proof in [48, Appendix A]
and setting `(n) 1 so that no theory of slowly varying functions is required.
n!1
July 18, 2002
382
Chap. 13 Advanced Topics
(13.9),
nX
,1
C()
nX
,1
C()
E[Yn2 ] = C(0) + 2 =1
=1
2,
1,
1, , 2
2, :
n
n
n
We claim that if (13.25) holds, then
nX
,1
=1
n
C()
= 1 ,c ;
lim n1,
n!1
and
nX
,1
nlim
!1
=1
C()
n2,
(13.28)
= 2 ,c :
(13.29)
Since n1, ! 1, it follows that
2c
E[Yn2 ] = 2c , 2c =
lim
2
,
n!1 n
1, 2,
(1 , )(2 , ) :
Since n1, ! 1, to prove (13.28), it is enough to show that for some k,
nX
,1
C()
lim n1, = 1 ,c :
Fix any 0 < " < c. By (13.25), there is a k such that for all k,
C()
, , c < ":
Then
nX
,1
nX
,1
nX
,1
(c , ") , C() (c + ") ,:
n!1
=k
=k
Hence, we only need to prove that
=k
nX
,1
=k
,
1 :
=k
lim
=
1
,
n!1 n
1,
This is done in Problem 17 by exploiting the inequality
nX
,1
=k
Note that
Z
( + 1),
Z
k
n
t, dt n
nX
,1
=k
,:
1,
1,
! 1 as n ! 1:
t, dt = n 1 ,, k
k
A similar approach is used in Problem 18 to derive (13.29).
In :=
July 18, 2002
(13.30)
13.5 ARMA Processes
383
13.5. ARMA Processes
We say that Xn is an autoregressive moving average (ARMA) process
if Xn satises the equation
Xn + a1 Xn,1 + + ap Xn,p = Zn + b1 Zn,1 + + bq Zn,q ; (13.31)
where Zn is an uncorrelated sequence of zero-mean random variables with common variance 2 = E[Zn]. In this case, we say that Xn is ARMA(p; q). If
a1 = = ap = 0, then
Xn = Zn + b1Zn,1 + + bq Zn,q ;
and we say that Xn is a moving average process, denoted by MA(q). If
instead b1 = = bq = 0, then
Xn = ,(a1 Xn,1 + + ap Xn,p );
and we say that Xn is an autoregressive process, denoted by AR(p).
To gain some insight into (13.31), rewrite it using convolution sums as
1
X
where
and
k=,1
ak Xn,k =
1
X
k=,1
bk Zn,k ;
(13.32)
a0 := 1; ak := 0; k < 0 and k > p;
b0 := 1; bk := 0; k < 0 and k > q:
Taking z transforms of (13.32) yields
A(z)X(z) = B(z)Z(z);
or
X(z) = B(z)
A(z) Z(z);
where
A(z) := 1 + a1 z ,1 + + ap z ,p and B(z) := 1 + b1 z ,1 + + bq z ,q :
This suggests that if hn has z transform H(z) := B(z)=A(z), and if
X
X
Xn :=
hk Zn,k =
hn,k Zk ;
(13.33)
k
k
then (13.32) holds. This is indeed the case, as can be seen by writing
X
X X
aiXn,i =
ai hn,i,kZk
i
i
=
=
k
XX
k
X
k
i
bn,k Zk ;
July 18, 2002
aihn,k,i Zk
384
Chap. 13 Advanced Topics
since A(z)H(z) = B(z).
The \catch" in the preceding argument is to make sure that the innite
sums inP
(13.33) are well dened. If hn is causal (hn = 0 for n < 0), and if hn is
stable ( n jhnj < 1), then (13.33) holds in L2 , L1 , and almost surely (recall
Example 10.7 and Problems 17 and 32 in Chapter 11). Hence, it remains to
prove the key result of this section, that if A(z) has all roots strictly inside the
unit circle, then hn is causal and stable.
To begin the proof, observe that since A(z) has all its roots inside the unit
circle, the polynomial (z) := A(1=z) has all its roots strictly outside the unit
circle. Hence, for small enough > 0, 1=(z) has the power series expansion
1
1 = X
n
(z) n=0 nz ; jz j < 1 + ;
for unique coecients n. In particular, this series converges for z = 1 + =2.
Since the terms of a convergent series go to zero, we must have n(1+=2)n !
0. Since a convergent sequence is bounded, there is some nite M for which
jn(1 + =2)nj M, orP jnj M(1 + =2),n, which is summable by the
geometric series. Thus, n jnj < 1. Now write
B(z)
1
H(z) = B(z)
A(z) = (1=z) = B(z) (1=z) ;
or
H(z) = B(z)
1
X
n=0
nz ,n =
where hn is given by the convolution
hn =
1
X
k=,1
1
X
n=,1
hn z ,n ;
k bn,k:
Since n and bn are causal, so is their convolution hn . Furthermore, for n 0,
hn =
n
X
k=max(0;n,q)
k bn,k :
(13.34)
P
In Problem 19, you are asked to show that n jhnj < 1. In Problem 20, you
are asked to show that (13.33) is the unique solution of (13.32).
13.6. ARIMA Processes
Before dening ARIMA processes, we introduce the dierencing lter,
whose z transform is 1 , z ,1. If the input to this lter is Xn , then the output
is Xn , Xn,1 .
July 18, 2002
13.6 ARIMA Processes
385
A process Xn is said to be an autoregressive integrated moving average (ARIMA) process if instead of A(z)X(z) = B(z)Z(z), we have
A(z)(1 , z ,1 )d X(z) = B(z)Z(z);
(13.35)
where A(z) and B(z) are dened as in the previous section. In this case, we
e
say that Xn is an ARIMA(p; d; q) process. If we let A(z)
= A(z)(1 , z ,1 )d ,
it would seem that ARIMA(p; d; q) is just a fancy name for ARMA(p + d; q).
While this is true when d is a nonnegative integer, there are two problems.
First, recall that the results of the previous section assume A(z) has all roots
e
strictly inside the unit circle, while A(z)
has a root at z = 1 repeated d times.
The second problem is that we will be focusing on fractional values of d, in
e
which case A(1=z)
is no longer a polynomial, but an innite power series in z.
Let us rewrite (13.35) as
X(z) = (1 , z ,1 ),d B(z)
A(z) Z(z) = H(z)Gd (z)Z(z);
where H(z) := B(z)=A(z) as in the previous section, and
Gd (z) := (1 , z ,1 ),d :
From the calculations following Example 13.2,
Gd (z) =
where g0 = 1, and for n 1,
The plan then is to set
n=0
gnz ,n ;
gn = d(d + 1) n!(d + [n , 1]) :
Yn :=
and thenyy
1
X
Xn :=
1
X
k=0
1
X
k=0
gk Zn,k
hk Yn,k :
Note that the power spectral density of Y iszz
SY (f) = jGd(ej 2f )j22 = j1 , e,j 2f j,2d2 = [4 sin2 (f)],d 2 ;
Since 1 , z,1 is a dierencing lter, (1 , z,1 ),1 is a summing or integrating lter. For
noninteger values of d, (1 , z,1 ),d is called a fractional integrating lter. The corresponding
process is sometimes called a fractional ARIMA process (FARIMA).
yy
Recall that hn is given by (13.34).
zz We are appealing to the discrete-time version of (6.6).
July 18, 2002
386
Chap. 13 Advanced Topics
using the result of Example 13.2. If p = q = 0, then A(z) = B(z) = H(z) = 1,
Xn = Yn , and we see that the process of Example 13.2 is ARIMA(0; d; 0).
Now, the problem with the above plan is that we have to make sure that
Yn is well dened. To analyze the situation, we need to know how fast the gn
decay. To this end, observe that
,(d + n) = (d + [n , 1]),(d + [n , 1])
..
.
= (d + [n , 1]) (d + 1),(d + 1):
Hence,
+ n) :
gn = ,(dd+ ,(d
1),(n + 1)
Now apply Stirling's formula,
,(x) p
2xx,1=2e,x ;
to the gamma functions that involve n. This yields
de1,d 1 + d , 1 n+1=2(n + d)d,1 :
gn ,(d
+ 1)
n+1
Since
1 n+1=2 = 1 + d , 1 n+1 1 + d , 1 ,1=2 ! ed,1 ;
1 + nd ,
+1
n+1
n+1
and since (n + d)d,1 nd,1 , we see that
gn ,(dd+ 1) nd,1;
as in Hosking [23, Theorem 1(a)]. For 0 < d < 1=2, ,1 < d , 1 < ,1=2, and we
see that the gn are not absolutely summable. However, since ,2 < 2d , 2 < ,1,
they are square summable. Hence, Yn is well dened as a limit in mean square
by Problem 26 in Chapter 11. The sum dening Xn is well dened in L2 , L1,
and almost surely by Example 10.7 and Problems 17 and 32 in Chapter 11.
Since Xn is the result of ltering the long-range dependent process Yn with the
stable impulse response hn, Xn is still long-range dependent as pointed out in
Section 13.4.
We derived Stirling's formula for ,(n) = (n , 1)! in Example 4.14. A proof for noninteger
x can be found in [6, pp. 300{301].
July 18, 2002
13.7 Problems
387
13.7. Problems
Problems x13.1: Self Similarity in Continuous Time
1. Show that for a self-similar process, Wt converges in distribution to the
zero random variable as t ! 0. Next, identify limt!1 tH W1 (!) as a
function of !, and nd the probability mass function of the limit in terms
of FW1 (x).
2. Use joint characteristic functions to show that the Wiener process is self
similar with Hurst parameter H = 1=2.
3. Use joint characteristic functions to show that the Wiener process has
stationary increments.
4. Show that for H = 1=2,
BH (t) , BH (s) =
Z
s
t
dW = Wt , Ws :
Taking t > s = 0 shows that BH (t) = Wt , while taking s < t = 0 shows
that BH (s) = Ws . Thus, B1=2 (t) = Wt for all t.
5. Show that for 0 < H < 1,
Z
0
1
(1 + )H ,1=2 , H ,1=2 2 d < 1:
Problems x13.2: Self Similarity in Discrete Time
6. Let Xn be a second-order self-similar process with mean = E[Xn], variance 2 = E[(Xn , )2 ], and Hurst parameter H. Then the sample mean
n
X
Mn := n1 Xi
i=1
has expectation and, by (13.7), variance 2 =n2,2H . If Xn is a Gaussian
sequence,
Mn , =n1,H N(0; 1);
and so given a condence level 1 , , we can choose y (e.g., by Table 12.1)
such that
} Mn 1,,H y = 1 , :
=n
For 1=2 < H < 1, show that the width of the corresponding condence interval is wider by a factor of nH ,1=2 than the condence interval obtained
if the Xn had been independent as in Section 12.2.
July 18, 2002
388
Chap. 13 Advanced Topics
7. Use (13.10) and induction on n to show that (13.6) implies E[Yn2 ] = 2 n2H .
8. Suppose that Xk is wide-sense stationary.
(a) Show that the process Xe(m) dened in (13.13) is also wide-sense
stationary.
(b) If Xk is second-order self similar, prove (13.12) for the case n = 0.
Problems x13.3: Asymptotic Second-Order Self Similarity
9. Show that asymptotic second-order self similarity (13.19) implies (13.18).
Hint: Observe that C (n)(0) = E[(X1(n) , )2 ].
10. Show that (13.18) implies asymptotic second-order self similarity (13.19).
Hint: Use (13.14), and note that (13.18) is equivalent to E[Yn2 ]=n2H !
2.
1
11. Show that a process is asymptotically second-order self similar; i.e., (13.19)
holds, if and only if the conditions
1
2H
2H
2H
(m)
mlim
!1 (n) = 2 [jn + 1j , 2jnj + jn , 1j ];
and
C (m) (0) = 2
lim
1
m!1 m2H ,2
both hold.
12. Show that the covariance function corresponding to the power spectral
density of Example 13.2 is
n ,(1 , 2d)
C(n) = ,(n +(,11), d),(1
, d , n) :
This result is due to Hosking [23, Theorem 1(d)]. Hints: First show that
Z C(n) = 1 [4 sin2(=2)],d cos(n) d:
0
Second, use the change-of-variable = 2 , to show that
1 Z 2 [4 sin2 (=2)],d cos(n) d = C(n):
Third, use the formula [16, p. 372]
Z + 1)21,p :
sinp,1 (t) cos(at) dt = cos(a=2),(p
0
p, p + a2 + 1 , p , a2 + 1
July 18, 2002
13.7 Problems
389
13. Prove that Kn () ! K() as follows. Put
n
Jn () :=
Z 1=2
sin(nf) 2 df:
,1=2 jf j1, nf
First show that
1
1
1
f 2
sin(f)
1
2
sin
d:
Jn () !
,1 jj1, Then show that Kn () , Jn() ! 0; take care to write this dierence
as an integral, and to break up the range of integration into [0; ] and
[; 1=2], where 0 < < 1=2 is chosen small enough that when jf j < ,
,
Z
,
< ":
14. Show that for 0 < < 1,
1
Z
0
1,
,3 sin2 d = 2 (1 cos(=2),()
, )(2 , ) :
Remark. The formula actually holds for complex with 0 < Re < 2
[16, p. 447].
Hints: (i) Fix 0 < " < r < 1, and apply integration by parts to
Z
r
"
,3 sin2 d
with u = sin2 and dv = ,3 d.
(ii) Apply integration by parts to the integral
Z
t,2 sin t dt
with u = sin t and dv = t,2 dt.
(iii) Use the fact that for 0 < < 1,y
Z
r
t,1e,jt dt
"
"!0
y For s > 0, a change of variable shows that
rlim
!1
= e,j=2,():
Z
r
lim
t,1 e,st dt = ,(s) :
r!1
"!0 "
As in Notes 5 and 6 in Chapter 3, a permanence of form argument allows us to set s = j =
ej=2 .
July 18, 2002
390
Chap. 13 Advanced Topics
15. Let S(f) be given by (13.15). For 1=2 < H < 1, put = 2 , 2H.
(a) Evaluate the limit in (13.20). Hint: You may use the fact that
1
X
1
2H +1
i=1 ji + f j
converges uniformly for jf j 1=2 and is therefore a continuous and
bounded function.
(b) Evaluate
Z 1=2
S(f) df:
,1=2
Hint: The above integral is equal to C(0) = 2 . Since (13.15) corresponds to a second-order self-similar process, not just an asymptoti2 . Now apply (13.21).
cally second-order self-similar process, 2 = 1
Problems x13.4: Long-Range Dependence
16. Show directly that if a wide-sense stationary sequence has the covari-
ance function C(n) given in Problem 12, then the process is long-range
dependent; i.e., (13.25) holds with appropriate values of and c [23, Theorem 1(d)]. Hints: Use the Remark following Problem 11 in Chapter 3,
Stirling's formula,
p
,(x) 2xx,1=2e,x ;
and the formula (1 + d=n)n ! ed .
17. For 0 < < 1, show that
nX
,1
,
lim =nk1, = 1 ,1 :
Hints: Rewrite (13.30) in the form
Bn + n, , k, In Bn :
Then
, ,
1 BI n 1 + k I, n :
n!1
n
n
Show that In =n1, ! 1=(1 , ), and note that this implies In =n, ! 1.
18. For 0 < < 1, show that
nX
,1
nlim
!1
=k
1,
n2,
July 18, 2002
= 2 ,1 :
13.7 Problems
391
Problems x13.5: ARMA Processes
P
19. Use the bound jnj M(1 + =2),n to show that n jhnj < 1, where
hn =
n
X
k=max(0;n,q)
k bn,k :
20. Assume (13.32) holds and that A(z) has all roots strictly inside the unit
circle. Show that (13.33) must hold. Hint: Compute the convolution
X
n
m,n Yn
rst for Yn replaced by the left-hand side of (13.32) and again for Yn
replaced by the right-hand side of (13.32).
P
P1
21. Let 1
k=0 jhk j < 1. Show that if Xn is WSS, then Yn = k=0 hk Xn,k
and Xn are J-WSS.
July 18, 2002
392
Chap. 13 Advanced Topics
July 18, 2002
Bibliography
[1] M. Abramowitz and I. A. Stegun, eds. Handbook of Mathematical Functions, with
Formulas, Graphs, and Mathematical Tables. New York: Dover, 1970.
[2] J. Beran, Statistics for Long-Memory Processes. New York: Chapman & Hall, 1994.
[3] P. Billingsley, Probability and Measure. New York: Wiley, 1979.
[4] P. Billingsley, Probability and Measure, 3rd ed. New York: Wiley, 1995.
[5] P. J. Brockwell and R. A. Davis, Time Series: Theory and Methods. New York: SpringerVerlag, 1987.
[6] R. C. Buck, Advanced Calculus, 3rd ed. New York: McGraw-Hill, 1978.
[7] H. Cherno, \A measure of asymptotic eciency for tests of a hypothesis based on the
sum of observations," Ann. Math. Statist., vol. 23, pp. 493{507, 1952.
[8] Y. S. Chow and H. Teicher, Probability Theory: Independence, Interchangeability, Martingales, 2nd ed. New York: Springer, 1988.
[9] R. V. Churchill, J. W. Brown, and R. F. Verhey, Complex Variables and Applications,
3rd ed. New York: McGraw-Hill, 1976.
[10] J. W. Craig, \A new, simple and exact result for calculating the probability of error for
two-dimensional signal constellations," in Proc. IEEE Milit. Commun. Conf. MILCOM
'91, McLean, VA, Oct. 1991, pp. 571{575.
[11] H. Cramer, \Sur un nouveaux theoreme-limite de la theorie des probabilites," Actualites Scientiques et Industrielles 736, pp. 5{23. Colloque consacre a la theorie des
probabilites, vol. 3, Hermann, Paris, Oct. 1937.
[12] M. Crovella and A. Bestavros, \Self-similarity in World Wide Web trac: Evidence and
possible causes," Perf. Eval. Rev., vol. 24, pp. 160{169, 1996.
[13] M. H. A. Davis, Linear Estimation and Stochastic Control. London: Chapman and Hall,
1977.
[14] W. Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd ed.
New York: Wiley, 1968.
[15] L. Gonick and W. Smith, The Cartoon Guide to Statistics. New York: HarperPerennial,
1993.
[16] I. S. Gradshteyn and I. M. Ryzhik, Table of Integrals, Series, and Products. Orlando,
FL: Academic Press, 1980.
[17] G. R. Grimmettand D. R. Stirzaker, Probability and Random Processes, 3rd ed. Oxford:
Oxford University Press, 2001.
[18] S. Haykin and B. Van Veen, Signals and Systems. New York: Wiley, 1999.
[19] C. W. Helstrom, Statistical Theory of Signal Detection, 2nd ed. Oxford: Pergamon,
1968.
[20] C. W. Helstrom, Probability and Stochastic Processes for Engineers, 2nd ed. New York:
Macmillan, 1991.
[21] P. G. Hoel, S. C. Port, and C. J. Stone, Introduction to Probability Theory. Boston:
Houghton Miin, 1971.
[22] K. Homan and R. Kunze, Linear Algebra, 2nd ed. Englewood Clis, NJ: Prentice-Hall,
1971.
393
394
Bibliography
[23] J. R. M. Hosking, \Fractional dierencing," Biometrika, vol. 68, no. 1, pp. 165{176,
1981.
[24] S. Karlin and H. M. Taylor, A First Course in Stochastic Processes, 2nd ed. New York:
Academic Press, 1975.
[25] S. Karlin and H. M. Taylor, A Second Course in Stochastic Processes. New York: Academic Press, 1981.
[26] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, \On the self-similar nature
of Ethernet trac (extended version)," IEEE/ACM Trans. Networking, vol. 2, no. 1,
pp. 1{15, Feb. 1994.
[27] A. Leon-Garcia, Probability and Random Processes for Electrical Engineering, 2nd ed.
Reading, MA: Addison-Wesley, 1994.
[28] N. Likhanov, \Bounds on the buer occupancy probability with self-similar trac,"
pp. 193{213, in Self-Similar Network Trac and Performance Evaluation, K. Park and
W. Willinger, Eds. New York: Wiley, 2000.
[29] A. O'Hagen, Probability: Methods and Measurement. London: Chapman and Hall,
1988.
[30] A. V. Oppenheim and A. S. Willsky, with S. H. Nawab, Signals & Systems, 2nd ed.
Upper Saddle River, NJ: Prentice Hall, 1997.
[31] A. Papoulis, Probability, Random Variables, and Stochastic Processes, 2nd ed. New
York: McGraw-Hill, 1984.
[32] V. Paxson and S. Floyd, \Wide-area trac: The failure of Poisson modeling,"
IEEE/ACM Trans. Networking, vol. 3, pp. 226-244, 1995.
[33] B. Picinbono, Random Signals and Systems. Englewood Clis, NJ: Prentice Hall, 1993.
[34] S. Ross, A First Course in Probability, 3rd ed. New York: Macmillan, 1988.
[35] R. I. Rothenberg, Probability and Statistics. San Diego: Harcourt Brace Jovanovich,
1991.
[36] G. Roussas, A First Course in Mathematical Statistics. Reading, MA: Addison-Wesley,
1973.
[37] H. L. Royden, Real Analysis, 2nd ed. New York: MacMillan, 1968.
[38] W. Rudin, Principles of Mathematical Analysis, 3rd ed. New York: McGraw-Hill, 1976.
[39] G. Samorodnitskyand M. S. Taqqu, Stable non-GaussianRandom Processes: Stochastic
Models with Innite Variance. New York: Chapman & Hall, 1994.
[40] A. N. Shiryayev, Probability. New York: Springer, 1984.
[41] M. K. Simon, \A new twist on the Marcum Q function and its application," IEEE
Commun. Lett., vol. 2, no. 2, pp. 39{41, Feb. 1998.
[42] M. K. Simon and D. Divsalar, \Some new twists to problems involving the Gaussian
probability integral," IEEE Trans. Commun., vol. 46, no. 2, pp. 200{210, Feb. 1998.
[43] M. K. Simon and M.-S. Alouini, \A unied approach to the performance analysis of
digital communication over generalized fading channels," Proc. IEEE, vol. 86, no. 9,
pp. 1860{1877, Sep. 1998.
[44] Ya. G. Sinai, \Self-similar probability distributions," Theory of Probability and its
Applications, vol. XXI, no. 1, pp. 64{80, 1976.
[45] J. L. Snell, Introduction to Probability. New York: Random House, 1988.
[46] E. W. Stacy, \A generalization of the gamma distribution," Ann. Math. Stat., vol. 33,
no. 3, pp. 1187{1192, Sept. 1962.
[47] H. Stark and J. W. Woods, Probability and Random Processes with Applications to
July 18, 2002
Bibliography
395
Signal Processing, 3rd ed. Upper Saddle River, NJ: Prentice Hall, 2002.
[48] B. Tsybakov and N. D. Georganas, \Self-similar processes in communication networks,"
IEEE Trans. Inform. Theory, vol. 44, no. 5, pp. 1713{1725, Sept. 1998.
[49] Y. Viniotis, Probability and Random Processes for Electrical Engineers. Boston:
WCB/McGraw-Hill, 1998.
[50] E. Wong and B. Hajek, Stochastic Processes in Engineering Systems. New York:
Springer, 1985.
[51] R. D. Yates and D. J. Goodman, Probability and Stochastic Processes: A Friendly
Introduction for Electrical and Computer Engineers. New York: Wiley, 1999.
[52] R. E. Ziemer, Elements of Engineering Probability and Statistics. Upper Saddle River,
NJ: Prentice Hall, 1997.
July 18, 2002
396
July 19, 2002
Index
A
absorbing state, 283
ane estimator, 230
ane function, 119, 230
aggregated process, 372
almost sure convergence, 330
arcsine random variable, 151, 164
ARIMA, 376
fractional, 385
ARMA process, 383
arrival times, 252
associative laws, 3
asymptotic second-order self similarity, 376, 388
auto-correlation function, 195
autoregressive integrated moving average
see ARIMA, 385
autoregressive moving average process
see ARMA, 383
autoregressive process | see AR process, 383
C
cardinality of a set, 6
Cartesian product, 155
Cauchy random variable, 87
cdf, 118
characteristic function, 113
nonexistence of mean, 93
special case of student's t, 109
Cauchy sequence
of Lp random variables, 304
of real numbers, 304
Cauchy{Schwarz inequality, 189, 215, 304
causal Wiener lter, 203
cdf | cumulative distribution function, 117
central chi-squared random variable, 114
central limit theorem, 1, 117, 137, 262, 328
compared with weak law of large numbers, 328
central moment, 49
certain event, 12
change of variable (multivariate), 229, 235
Chapman{Kolmogorov equation
B
continuous time, 290
discrete time, 284
Banach space, 304
discrete-time derivation, 287
Bayes' rule, 20, 21
for Markov processes, 297
Bernoulli random variable, 38
characteristic function
mean, 45
bivariate, 163
probability generating function, 52
multivariate (joint), 225
second moment and variance, 49
univariate, 97
Bessel function, 146
Chebyshev's inequality, 60, 100, 115
properties, 147
used to derive the weak law, 61
beta function, 108, 109, 110
Cherno bound, 100, 101, 115
beta random variable, 108
chi-squared random variable, 107
relation to chi-squared random variable, 182
as squared zero-mean Gaussian, 112, 122, 143
betting on fair games, 78
characteristic function, 112
binomial | approximation by Poisson, 57, 340
moment generating function, 112
binomial coecient, 54
related to generalized gamma, 145
binomial random variable, 53
relation to F random variable, 182
mean, variance, and pgf, 76
relation to beta random variable, 182
binomial theorem, 54, 377
see also noncentral chi-squared, 112
birth process | see Markov chain, 283
square root of = Rayleigh, 144
birth{death process | see Markov chain, 283
circularly symmetric complex Gaussian, 238
birthday problem, 9
closed set, 310
bivariate characteristic function, 163
CLT | central limit theorem, 137
bivariate Gaussian random variables, 168
combination, 56
Borel{Cantelli lemma, 28, 332
commutative laws, 3
Borel set, 72
complement of a set, 2
Borel sets of IR2 , 177
complementary cdf, 145
Borel -eld, 72
complementary error function, 123
Brownian motion
fractional | see fractionalBrownian motion, 368 completeness
of the Lp spaces, 304
ordinary | see Wiener process, 257
397
398
Index
of the real numbers, 304
complex conjugate, 237
complex Gaussian random vector, 238
complex random variable, 236
complex random vector, 237
conditional cdf, 122
conditional density, 162
conditional expectation
abstract denition, 312
for discrete random variables, 69
for jointly continuous random variables, 164
conditional independence, 31, 279
conditional probability, 19
conditional probability mass functions, 62
condence interval, 347
condence level, 347
conservative Markov chain, 292
consistency condition
continuous-time processes, 269
discrete-time processes, 267
consistent estimator, 346
continuous random variable, 85
arcsine, 151, 164
beta, 108
Cauchy, 87
chi-squared, 107
Erlang, 107
exponential, 86
F , 182
gamma, 106
Gaussian = normal, 88
generalized gamma, 145
Laplace, 87
Maxwell, 143
multivariate Gaussian, 226
Nakagami, 144
noncentral chi-squared, 114
noncentral Rayleigh, 146
Pareto, 149
Rayleigh, 111
Rice, 146
student's t, 109
uniform, 85
Weibull, 105
continuous sample paths, 257
convergence
almost sure (a.s.), 330
in distribution, 325
in mean of order p, 299
in mean square, 299
in probability, 324, 346
in quadratic mean, 299
sure, 330
weak, 325
convex function, 80
convolution, 67
convolution of densities, 99
convolution
and LTI systems, 196
of probability mass functions, 67
correlation coecient, 81, 170
correlation function, 188, 195
engineering denition, 374
statistics/networking denition, 375
countable subadditivity, 15
counting process, 249
covariance, 223
distinction between scalar and matrix, 224
function, 189
matrix, 223
Craig's formula, 180
cross power spectral density, 197
cross-correlation function, 195
cross-covariance function, 195
cumulative distribution function (cdf), 117
continuous random variable, 118
discrete random variable, 126
joint, 157
properties, 134
cyclostationary process, 210
D
delta function, 193
Dirac, 129
Kronecker, 284
DeMoivre{Laplace theorem, 352
DeMorgan's laws, 4
generalized, 5
dierence of sets, 3
dierencing lter, 384
dierential entropy, 110
Dirac delta function, 129
discrete random variable, 36
Bernoulli, 38
binomial, 53
geometric, 41
hypergeometric, 353
negative binomial = Pascal, 80
Poisson, 42
uniform, 37
disjoint sets, 3
distributive laws, 4
generalized, 5
double-sided exponential = Laplace, 87
E
eigenvalue, 286
eigenvector, 286
empty set, 2
ensemble mean, 345
entropy, 79
dierential, 110
equilibrium distribution, 285
July 19, 2002
Index
399
ergodic theorems, 209
Erlang random variable, 107
as sum of i.i.d. exponentials, 113
cumulative distribution function, 107
moment generating function, 111, 112
related to generalized gamma, 145
error function, 123
complementary, 123
estimate of the value of a random variable, 230
estimator of a random variable, 230
event, 6, 22, 337
expectation
linearity for arbitrary random variables, 99
linearity for discrete random variables, 48
monotonicity for arbitrary random variables, 99
monotonicity for discrete random variables, 80
of a discrete random variable, 44
of an arbitrary random variable, 91
when it is undened, 46, 93
exponential random variable, 86
double sided = Laplace, 87
memoryless property, 104
moment generating function, 95
moments, 95
F
random variable, 182
factorial function, 106
factorial moment, 51
failure rate, 124
constant, 125
Erlang, 149
Pareto, 149
Weibull, 149
FARIMA | fractional ARIMA, 385
ltered Poisson process, 256
Fourier series
as characteristic function, 98
as power spectral density, 217
Fourier transform
as bivariate characteristic function, 163
as multivariate characteristic function, 225
as univariate characteristic function, 97
inversion formula, 191, 214
multivariate inversion formula, 229
of correlation function, 191
table, 214
fractional ARIMA process, 385
fractional Brownian motion, 368
fractional Gaussian noise, 370
fractional integrating lter, 385
F
G
gambler's ruin, 284
gamma function, 106
gamma random variable, 106
characteristic function, 97, 112
generalized, 145
moment generating function, 112
moments, 111
with scale parameter, 107
Gaussian pulse, 97
Gaussian random process, 259
fractional, 369
Gaussian random variable, 88
ccdf approximation, 145
cdf, 119
cdf related to error function, 123
characteristic function, 97
complex, 238
complex circularly symmetric, 238
Craig's formula for ccdf, 180
moment generating function, 96
moments, 93
Gaussian random vector, 226
characteristic function, 227
complex circularly symmetric, 238
joint density, 229
multivariate moments, Wick's theorem, 241
generalized gamma random variable, 145
generator matrix, 293
geometric random variable, 41
mean, variance, and pgf, 76
memoryless property, 75
geometric series, 26
H
H -sssi, 367
Herglotz's theorem, 313
Hilbert space, 304
Holder's inequality, 317
Hurst parameter, 365
hypergeometric random variable, 353
derivation, 360
I
ideal gas, 1, 144
identically distributed random variables, 39
impossible event, 12
impulse function, 129
inclusion-exclusion formula, 14, 158
increment process, 367
increments of a random process, 250
independent events
more than two events, 16
pairwise, 17, 22
two events, 16
independent identically distributed (i.i.d.), 39
independent increments, 250
independent random variables, 38
example where uncorrelated does not imply independence, 80
July 19, 2002
400
Index
jointly continuous, 163
more than two, 39
indicator function, 36
inner product, 241, 304
inner-product space, 304
integrating lter, 385
intensity of a Poisson process, 250
interarrival times, 252
intersection of sets, 2
inverse Fourier transform, 214
limit properties of }, 15
linear estimators, 230, 309
linear time-invariant system, 195
long-range dependence, 380
LOTUS | law of the unconscious statistician, 47
LRD | long-range dependence, 380
LTI | linear time-invariant (system), 195
Lyapunov's inequality, 333
derived from Holder's inequality, 317
derived from Jensen's inequality, 80
J
M
J-WSS | jointly wide-sense stationary, 195
Jacobian, 235
Jacobian formulas, 235
Jensen's inequality, 80
joint characteristic function, 225
joint cumulative distribution function, 157
joint density, 158, 172
joint probability mass function, 43
jointly continuous random variables
bivariate, 158
multivariate, 172
jointly wide-sense stationary, 195
jump times
of a Poisson process, 251
MA process, 383
Marcum Q function, 147, 180
marginal cumulative distributions, 157
marginal density, 160
marginal probability, 156
marginal probability mass functions, 44
Markov chain
absorbing barrier, 283
birth{death process (discrete time), 283
conservative, 292
construction, 269
continuous time, 289
discrete time, 279
equilibrium distribution, 285
gambler's ruin, 284
K
generator matrix, 293
Kolmogorov's backward equation, 292
Kolmogorov, 1
Kolmogorov's forward equation, 291
Kolmogorov's backward equation, 292
model for queue with nite buer, 283
Kolmogorov's consistency theorem, 267
model for queue with innite buer, 283, 296
Kolmogorov's extension theorem, 267
n-step transition matrix, 285
Kolmogorov's forward equation, 291
n-step transition probabilities, 284
Kronecker delta, 284
pure birth process (discrete time), 283
random walk (construction), 279
L
random walk (continuous-time), 295
random walk (denition), 282
Laplace random variable, 87
rate matrix, 293
variance and moment generating function, 111
reecting barrier, 283
Laplace transform, 95
sojourn time, 296
law of large numbers
mean square, for 2nd-orderself-similarsequences, state space, 281
state transition diagram, 281
370
stationary distribution, 285
mean square, uncorrelated, 300
mean square, wide-sense stationary sequences, symmetric random walk (construction), 280
time homogeneous (continuous time), 290
301
time-homogeneous (discrete time), 281
strong, 333
transition probabilities (continuous time), 289
weak, for independent random variables, 333
weak, for uncorrelated random variables, 61, 324 transition probabilities (discrete time), 281
transition probability matrix, 282
law of the unconscious statistician, 47
Markov process, 297
law of total conditional probability, 287, 297
Markov's inequality, 60, 100, 115
law of total probability, 19, 21, 164, 175, 181
Matlab commands
discrete conditioned on continuous, 275
besseli, 146
for continuous random variables, 166
chi2cdf, 146
for discrete random variables, 64
chi2inv, 356
for expectation (discrete random variables), 70
July 19, 2002
Index
401
erf, 123
erfc, 123
ernv, 348
factorial, 78
gamma, 106
nchoosek, 53, 78
ncx2cdf, 146
normcdf, 119
norminv, 348
tinv, 354
matrix exponential, 294
matrix inverse formula, 247
Maxwell random variable
as square root of chi-squared, 144
cdf, 143
related to generalized gamma, 145
speed of particle in ideal gas, 144
mean, 45
mean function, 188
mean square convergence, 299
mean squared error, 76, 81, 201, 230
mean time to failure, 123
mean vector, 223
mean-square ergodic theorem
for WSS processes, 209
for WSS sequences, 301
mean-square law of large numbers
for uncorrelated random variables, 300
for wide-sense stationary sequences, 301
for WSS processes, 209
median, 104
memoryless property
exponential random variable, 104
geometric random variable, 75
mgf | moment generating function, 94
minimum mean squared error, 230, 309
Minkowski's inequality, 304, 318
mixed random variable, 127
MMSE | minimum mean squared error, 230
modied Bessel function of the rst kind, 146
properties, 147
moment, 49
moment generating function, 101
moment generating function (mgf), 94
moment
central, 49
factorial, 51
monotonic sequence property, 316
monotonicity
of E, 80, 99
of }, 13
Mother Nature, 12
moving average process | see MA process, 383
MSE | mean squared error, 230
MTTF | mean time to failure, 123
multinomial coecient, 77
multinomial theorem, 77
mutually exclusive sets, 3
mutually independent events, 17
N
Nakagami random variable, 144, 247
as square root of chi-squared, 144
negative binomial random variable, 80
noncentral chi-squared random variable
as squared non-zero-mean Gaussian, 112, 122,
143
cdf (series form), 146
density (closed form using Bessel function), 146
density (series form), 114
moment generating function, 112, 114
noncentrality parameter, 112
square root of = Rice, 146
noncentral Rayleigh random variable, 146
square of = noncentral chi-squared, 146
noncentrality parameter, 112, 114
norm preserving, 314
norm
Lp , 303
matrix, 241
vector, 225
normal random variable | see Gaussian, 88
null set, 2
O
occurrence times, 252
odds, 78
Ornstein{Uhlenbeck process, 274
orthogonal increments, 322
orthogonality principle
general statement, 309
in relation to conditional expectation, 233
in the derivation of linear estimators, 231
in the derivation of the Wiener lter, 202
P
pairwise disjoint sets, 5
pairwise independent events, 17
Paley{Wiener condition, 205
paradox of continuous random variables, 90
parallelogram law, 304, 320
Pareto failure rate, 149
Pareto random variable, 149
Parseval's equation, 206
Pascal random variable = negative binomial, 80
Pascal's triangle, 54
pdf | probability density function, 85
permanence of form argument, 104, 389
pgf | probability generating function, 50
pmf | probability mass function, 42
Poisson approximation of binomial, 57, 340
Poisson process, 250
July 19, 2002
402
arrival times, 252
as a Markov chain, 289
ltered, 256
independent increments, 250
intensity, 250
interarrival times, 252
marked, 255
occurrence times, 252
rate, 250
shot noise, 256
thinned, 271
Poisson random variable, 42
mean, 45
mean, variance, and pgf, 51
second moment and variance, 49
population mean, 345
positive denite, 225
positive semidenite, 225
function, 215
posterior probabilities, 21
power spectral density, 191
for discrete-time processes, 217
nonnegativity, 198, 207
power
dissipated in a resistor, 191
expected instantaneous, 191
expected time-average, 206
probability density function (pdf), 85
generalized, 129
nonimpulsive, 129
purely impulsive, 129
probability generating function (pgf), 50
probability mass function (pmf), 42
probability measure, 12, 266
probability space, 22
probability
written as an expectation, 45
projection, 308
onto the unit ball, 308
theorem, 310, 311
Q
Q function
Gaussian, 145
Marcum, 147
quadratic mean convergence, 299
queue | see Markov chain, 283
R
random process, 185
random sum, 177
random variable
complex-valued, 236
continuous, 85
denition, 33
discrete, 36
Index
integer-valued, 36
precise denition, 72
traditional interpretation, 33
random variables
identically distributed, 39
independent, 38, 39
uncorrelated, 58
random vector, 172
random walk
approximation of the Wiener process, 262
construction, 279
denition, 282
symmetric, 280
with barrier at the origin, 283
rate matrix, 293
rate of a Poisson process, 250
Rayleigh random variable
as square root of chi-squared, 144
cdf, 142
distance from origin, 87, 144
generalized, 144
moments, 111
related to generalized gamma, 145
square of = chi-squared, 144
reecting state, 283
reliability function, 123
renewal equation, 257
derivation, 272
renewal function, 257
renewal process, 256, 343
resistor, 191
Rice random variable, 146, 246
square of = noncentral chi-squared, 146
Riemann sum, 196, 220
Riesz{Fischer theorem, 304, 310, 312
S
sample mean, 57, 345
sample path, 185
sample space, 5, 12
sample standard deviation, 349
sample variance, 349
as unbiased estimator, 350
sampling with replacement, 352
sampling without replacement, 352, 360
scale parameter, 107, 145
second-order self similarity, 370
self similarity, 365
set dierence, 3
shot noise, 256
-algebra, 22
-eld, 22, 72, 177, 270
signal-to-noise ratio, 199
Simon's formula, 183
Skorohod representation theorem, 335
derivation, 335
July 19, 2002
Index
403
SLLN | strong law of large numbers, 333
slowly varying function, 381
Slutsky's theorem, 328, 360
smoothing property, 321
SNR | signal-to-noise ratio, 199
sojourn time, 296
spectral distribution, 313
spectral factorization, 205
spectral representation, 315
spontaneous generation, 283
square root of a nonnegative denite matrix, 240
standard deviation, 49
standard normal density, 88
state space of a Markov chain, 281
state transition diagram | see Markov chain, 281
stationary distribution, 285
stationary increments, 367
stationary random process, 189
Markov chain example, 277
statistical independence, 16
Stirling's formula, 142, 340, 386, 390
derivation, 142
stochastic process, 185
strictly stationary process, 189
Markov chain example, 277
strong law of large numbers, 333
student's t, 109, 182
cdf converges to normal cdf, 355
density converges to normal density, 109
generalization of Cauchy, 109
moments, 110, 111
subset, 2
substitution law, 66, 70, 165, 175
sure event, 12
symmetric function, 189
symmetric matrix, 223, 239
T
| see student's t, 109
thermal noise
and the central limit theorem, 1
thinned Poisson process, 271
time-homogeneity | see Markov chain, 281
trace of a matrix, 225, 241
transition matrix | see Markov chain, 282
transition probability | see Markov chain, 281
triangle inequality
for Lp random variables, 303
for numbers, 303
t
uniform random variable (discrete), 37
union bound, 15
derivation, 27, 28
union of sets, 2
unit impulse, 129
unit-step function, 36, 206
V
variance, 49
Venn diagrams, 2
W
Wallis's formula, 109
weak law of large numbers, 1, 58, 61, 208, 324, 333
compared with the central limit theorem, 328
Weibull failure rate, 149
Weibull random variable, 105, 143
moments, 111
related to generalized gamma, 145
white noise, 193
innite power, 193
whitening lter, 204
Wick's theorem, 241
wide-sense stationarity
continuous time, 191
discrete time, 217
Wiener lter, 203, 309
causal, 203
Wiener integral, 261, 307
normality, 339
Wiener process, 257
approximation using random walk, 261
as a Markov process, 297
dened for negative and positive time, 274
independent increments, 257
normality, 277
relation to Ornstein{Uhlenbeck process, 274
self similarity, 365, 387
standard, 257
stationarity of its increments, 387
Wiener{Hopf equation, 203
Wiener{Khinchin theorem, 207
alternative derivation, 212
WLLN | weak law of large numbers, 58, 61
WSS | wide-sense stationary, 191
Z
z
transform, 383
U
unbiased estimator, 345
uncorrelated random variables, 58
example that are not independent, 80
uniform random variable (continuous), 85
cdf, 119
July 19, 2002
Continuous Random Variables
uniform[a;b]
fX (x) = b ,1 a and FX (x) = xb ,, aa ; a x b.
a + b ; var(X) = (b , a)2 ; M (s) = esb , esa .
E[X] =
X
2
12
s(b , a)
exponential, exp()
fX (x) = e,x and FX (x) = 1 , e,x ; x 0.
E[X] = 1=; var(X) = 1=2; E[X n] = n!=n.
MX (s) = =( , s); Re s < .
Laplace()
fX (x) = 2 e,jxj.
E[X] = 0; var(X) = 2=2: MX (s) = 2 =(2 , s2 ); , < Re s < .
Gaussian or normal, N (m; 2)
fX (x) = p 1 exp , 12 x , m .
2 2
E[X] = m; var(X) = 2 ;
MX (s) = esm+s2 2 =2 .
FX (x) = normcdf(x; m; sigma).
E[(X , m)2n ] = 1 3 (2n , 3)(2n , 1)2n,
gamma(p; )
p,1 ,x
fX (x) = (x),(p)e ; x > 0,
R
where ,(p) := 01 xp,1 e,x dx; p > 0.
Recall that ,(p) = (p , 1) ,(p , 1); p > 1.
FX (x) = gamcdf(x; p; 1=lambda).
p
,(n
+
p)
n
E[X ] = n
,(p) . MX (s) = , s ; Re s < .
Note that gamma(1;) is the same as exp().
Continuous Random Variables (continued)
Erlang(m;) := gamma(m; ); m = integer
Since ,(m) = (m , 1)!
mX
,1 (x)k
m,1 e,x
fX (x) = (x)
and
F
(x)
=
1
,
e,x ; x > 0.
X
(m , 1)!
k!
k=0
Note that Erlang(1; ) is the same as exp().
chi-squared(k) := gamma(k=2; 1=2)
If k is an even integer,
then chi-squared(k) is the same as Erlang(k=2; 1=2).
Since ,(1=2) = p,
e,x=2 ; x > 0.
for k = 1, fX (x) = p
2x
, 2m+1 (2m,1)531 p
,
Since , 2 =
2m
xm,1=2e,x=2 p ; x > 0.
for k = 2m + 1, fX (x) =
(2m , 1) 5 3 1 2
FX (x) = chi2cdf(x; k).
Note that chi-squared(2) is the same as exp(1=2).
Rayleigh()
2
2
fX (x) = x2 e,(x=) =2 and FX (x) = 1 , e,(x=) =2 ; x 0.
p
E[X] = =2; E[X 2 ] = 22; var(X) = 2 (2 , =2):
E[X n ] = 2n=2n ,(1 + n=2).
Weibull(p;)
fX (x) = pxp,1 e,xp and FX (x) = 1 , e,xp ; x > 0.
,(1 + n=p) .
E[X n ] =
n=p
, p Note that Weibull(2; ) is the same as Rayleigh 1= 2 and that Weibull(1; )
is the same as exp().
Cauchy()
1 tan,1 x + 1 .
fX (x) = 2=
;
F
(x)
=
X
2
+x
2
2
E[X] = undened; E[X ] = 1; 'X () = e,j j.
Odd moments are not dened; even moments are innite. Since the rst moment is not dened, central moments, including the variance, are not dened.
© Copyright 2026 Paperzz