Averages
EE 278
Lecture Notes # 4
Winter 2010–2011
Two types of averages of interest for rp {Xn}:
Expectation and Averages
E(Xn)
• Expectation, probabilistic average
N−1
1�
Xn
n→∞ N
n=0
lim
• Time or sample average
Or, if f a function,
N−1
1�
f (Xn)
n→∞ N
n=0
E[ f (Xn)] vs. lim
E.g., f (Xn) = Xn2
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
1
Relative frequency
or f (XN ) = 1F (XN ) for some 1D event F ⇒
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
2
Intuition & experience suggest that sample average should converge
to expectation (under suitable assumptions)
Results of this type are laws of large numbers or ergodic theorems
If f (Xn) = 1F (Xn) ⇒ E[1F (Xn)] = PXn (F) = Pr(Xn ∈ F) vs. limiting
relative frequency
When true and how make precise? Expectation is a number that
depends on n, sample average is a random variable, depends on ω
N−1
1�
1F (Xn)
n→∞ N
n=0
������������������������
lim
Begin with discrete-time, discrete-alphabet rp
(n)
rF
where relative frequency of F in X0, X1, . . . XN−1 is
rF(n) =
# times Xn ∈ F
N
E.g., compare probability of a head with relative frequency of heads
in N tosses as N → ∞
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
3
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
4
Discrete Sample Average and Expectation
−1
{Xn}, discrete alphabet A, Sample average S n = n
n−1
�
E.g., for coin flips: A = {0, 1},
S n(ω) = r1(n)(ω), the average number of heads in the first n flips
Xi
i=0
i.e., S n(ω) =
For each n = 1, 2, 3, . . ., S n is a random variable.
n−1
1�
Xi(ω)
n i=0
Assume {Xn} is identically distributed ⇒ there is a pmf pX (x), x ∈ A
such that
pXn (x) = pX (x), all n
Can consider X as “generic” random variable with same distribution
as all of the Xn
Intuition and experience suggest ra(n) → pX (a) in some sense.
n→∞
Regroup terms in sum:
n−1
�
�
1�
S n(ω) =
a×
1a(Xi(ω)) =
ara(n)(ω)
n a∈A
a∈A
i=0
(n)
(n)
where relative frequency ra (ω) is shorthand for r{a} (ω)
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
5
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
n→∞
�
a∈A
ara(n) →
n→∞
6
Expectation
If true ra(n) → pX (a), would imply
Sn =
c R.M. Gray 2011
�
�
a∈A
apX (a) ≡ E(X)
Expectation (mean, expected value) of rv X defined by
Motivates definition of expectation, suggests that expectation
characterizes long term behavior of sample averages of random
processes
EX =
To make precise, need to consider limits of random variables —
different from usual definition of limits of sequences of real numbers
�
�
x∈A xpX (x)
�
x dF X (x) ≡
x fX (x) dx
�
�
β x xp1(x) + (1 − β) x f2(x)dx
discrete
continuous
mixed
Stieltges integral notation — short hand for usual special cases:
discrete with pmf pX , continuous with a pdf fX , mixed discrete and
continuous with a cdf F X (x) = βF1(x) + (1 − β)F2(x), where F1
corresponds to a pmf p1 and F2 corresponds to a pdf f2, and β ∈ [0, 1]
First: develop expectation and its properties in more detail
Expectation also denoted by EX , E[X], X , mX , µX
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
7
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
8
Examples
• X ∼ Unif( [0, 1))
EX =
Old and new
• X ∼ Exp (λ)
• X ∼Bern ( p)
EX =
1
�
i=0
1
0
x fX (x) dx =
EX =
�
∞
0
�
0
1
x dx =
1
2
1
rλe−λr dr = .
λ
• If X has an even pdf fX — fX (−x) = fX (x) for all x ∈ R — and if the
integral exists, then EX = 0 since x fX (x) is an odd function and
xpX (x) = 0(1 − p) + 1p = p.
hence has zero integral
Necessary for integral to exist. E.g., if fX (x) = c/x2 for all |x| ≥ 1,
where c is a normalization constant. It is not true that EX is zero,
• X ∼ Geom ( p)
EX =
�
∞
�
kpX (k) =
k=1
∞
�
k=1
�
kp(1 − p)k−1 = 1/p
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
9
x
dx
2
x: |x|≥1 x
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
10
Functions of random variables
does not exist, so expectation not defined.
• If pdf fX is an even function about some nonzero value, i.e.
fX (m + x) = fX (m − x), where m is some constant, and expectation
exists, then EX = m (e.g., Gaussian)
Given a rv X and a function g, can find E(g(X)) in two ways:
1. First derive distribution for Y = g(X) and then find expectation
using distribution for Y , e.g.,
�
y ypY (y) Y discrete
�
E[g(X)] = E[Y] =
y fY (y)
Y continuous with a pdf
2. Use original distribution for X and definition introduced earlier:
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
11
�
x g(x)pX (x) X discrete
�
E[g(X)] =
g(x) fX (x)
X continuous with a pdf
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
12
2nd method avoids explicit derivation of distribution of g(X).
If X is discrete, pmf for Y = g(X) is
Both methods give same result: Fundamental theorem of expectation
pY (y) =
�
x: g(x)=y
just a change of variables in calculus
pX (x), y ∈ AY .
EY is
Prove for discrete case:
EY =
�
AY
� �
ypY (y) =
y
pX (x)
AY
x: g(x)=y
�
� �
=
=
g(x)p
(x)
g(x)pX (x)
X
AY
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
13
Example of fundamental theorem
Define rv Y = X , that is g(r) = r
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
14
Using the fundamental theorem of expectation with g(x) = e jux, u ∈ R,
2
juX
�
MX ( ju) ≡ E[e ] =
e jux dF X (x)
�
jux
pX (x)
x∈A e
�
jux
=
e fX (x) dx
�
� jux
β x e p1(x) + (1 − β) e jux f2(x)dx
Previously derived distribution:
�
�
fY (y) = y−(1/2) fX y(1/2) ; y ≥ 0
⇒ fY (y) = y−(1/2); y ∈ (0, 1/4]
�
� 1/4
(1/4)3/2
1
⇒ EY =
y fY (y) dy =
y1/2 dy =
=
3/2
12
0
� 1/2
(1/2)3
1
Alternatively, EY = E(X 2) =
x2 dx = 2
=
3
12
−1/2
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
AX
Characteristic Functions
Random variable X ∼ U([−1/2, 1/2]).
2
x: g(x)=y
discrete
continuous
mixed
Primary use: derived distributions for sum of independent rvs
Another application: shortcut to finding moments
c R.M. Gray 2011
�
15
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
16
Examples of finding moments
Discrete case:
�
d
d �
MX ( ju) =
pX (x)e jux =
pX (x)( jx)e jux
du
du x
x
Bernoulli: X ∼ Bern(p)
evaluate the derivative at u = 0
MX �(0)
= p = E[X], −MX ��(0) = −MX(2)(0) = p = E[X 2].
j
d
MX �(0) = MX ( ju)|u=0 = jEX.
du
Thus
Gaussian X ∼ N(m, σ2)
MX �(0)
EX =
.
j
dk
MX ( ju)|u=0.
duk
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
17
Charecteristic functions and distributions
k=0
uk
∞
∞
�
M (k)(0) �
E(X k )
uk X
=
( ju)k
k!
k!
k=0
k=0
If true for all u, knowing all moments ⇒ know characteristic function
⇒ know distribution!
f (k)(0)
k!
For small u
MX ( ju) = 1 + juE(X) − u2
f (2)(0)
= f (0) + u f (1)(0) + u2
+ terms in uk ; k ≥ 3,
2
dk
f (u)|u=0
duk
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
E(X 2)
+ o(u2)/2,
2
where o(u2) contains higher-order terms that go to zero as u → 0
faster than u2.
where
f (k)(0) =
18
Combine with moment generating formula:
Suppose that the characteristic function has a Taylor series
expansion about 0:
∞
�
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
MX ( ju) =
f (u) =
2 σ2 /2
MX ( ju) = e jum−u
MX �(0)
= m = E[X], −MX ��(0) = σ2X + m2 = E[X 2].
j
Repeated differentiation gives kth moment:
E[X k ] = j−k M (k)(0) = j−k
MX ( ju) = (1 − p) + pe ju then
Key idea in proof of central limit theorem (later)
c R.M. Gray 2011
�
19
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
20
Functions of several random variables
Theorem Fundamental theorem of expectation for functions of
several random variables Given random variables X0, X1, . . . , Xk−1
described by a cdf F X0,X1,...,Xk−1 and given a (measurable) function
g : Rk → R,
Expectations of functions of several random variables: e.g., U, V ,
Y = g(U, V), E(Y) = E[g(U, V)]
E[g(X0, . . . , Xk−1)]
�
=
g(x0, . . . , xk−1) dF X0,...,Xk−1 (x0, . . . , xk−1)
�
g(x0, . . . , xk−1)pX0,...,Xk−1 (x0, . . . , xk−1)
x
,...,x
0
k−1
=
or
�
g(x0, . . . , xk−1) fX0,...,Xk−1 (x0, . . . , xk−1) dx0 . . . dxk−1
Fundamental theorem still holds:
.
Most important examples: correlation, covariance, multidimensional
characteristic functions
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
21
Properties of expectation
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Property 2. If X is a random variable such that for some fixed
number r, Pr(X = r) = 1, then EX = r. (expectation of a constant
equals the constant)
�
Properties of probability imply corresponding properties of
expectation.
Describe, and prove for discrete case:
Proof.
Property 1. If X is a random variable such that Pr(X ≥ 0) = 1, then
EX ≥ 0.
Property 2 parallels Axiom 2 of probability — normalization of the
total probability to 1 leaves the constant unscaled in the result
Proof.
Pr(X ≥ 0) = 1 ⇒ pX (x) = 0 for x < 0. ⇒
�
�
xpX (x) =
xpX (x) ≥ 0
x
x:x≥0
23
xpX (x) = r
x
E(aX + bY) = aEX + bEY
Proof.
c R.M. Gray 2011
�
Pr(X = r) = 1⇒ pX (r) = 1 ⇒
Property 3. Expectation is linear; that is, given two random variables
X and Y and two real constants a and b,
Property 1 parallels Axiom 1 of probability — nonnegativity of
probability measures implies Property 1.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
22
Let g(x, y) = ax + by, where a and b are constants.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
24
Fundamental theorem of expectation for 2 rvs ⇒
E[aX + bY] =
Property 3’. Given an increasing sequence of nonnegative random
variables Xn; n = 0, 1, 2, . . . , that is, Xn ≥ Xn−1 for all n (i.e.,
Xn(ω) ≥ Xn−1(ω) for all ω ∈ Ω), which converge to a limiting random
variable X = limn→∞ Xn, then
�
(ax + by)pX,Y (x, y)
x,y
= a
� �
� �
x
pX,Y (x, y) + b
y
pX,Y (x, y)
x
= a
�
y
xpX (x) + b
x
= aE(X) + bE(Y).
�
y
�
E lim Xn = lim EXn.
x
n→∞
ypY (y)
n→∞
Gives conditions under which can exchange limit and expectation
y
Called the monotone convergence theorem, a key result in the theory
of integration. The above four properties can be taken as axioms of
integration and lead to the general theory of Lebesgue integration,
which can then be used to derive measure theory/probability theory.
The two theories are equivalent. We use probability to derive
expectation, you can also do it the other way.
Property 3 parallels axiom 3, additivity of probability ⇒ linearity of
expectation. There is a countable/limiting version which is stated fyi,
we will not use it:
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
�
c R.M. Gray 2011
�
25
Correlation
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
26
If X and Y are independent and discrete, then
�
g(x)h(y)pX,Y (x, y)
x,y �
�
=
g(x)h(y)pX (x)pY (y)
x
y
�
�
= g(x)pX (x) h(y)pY (y)
E[g(X)h(Y)] =
correlation= expectation of products of random variables, e.g.,
E(XY), E[h(X)g(Y)]
Characterizes a form of dependence among random variables, crops
up in many signal processing applications
Given two random variables X and Y , E[XY] is called the correlation
between X and Y .
If we have two real-valued functions or measurements on these
random variables, say g(X) and h(Y), then E[g(X)h(Y)] is called the
correlation between g(X) and h(Y)
y
If X and Y are independent, than any functions g(X) and h(Y) are
uncorrelated.
A similar manipulation with integrals shows also true for continuous
random variables possessing pdfs (and for mixed)
State formally as lemma with a technical condition:
(We will also consider complex-valued random variables)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
x
= E[g(X)]E[h(Y)].
c R.M. Gray 2011
�
27
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
28
Then E(XY) = (1/4)(1 − 1) + (1/2)(0) = 0 and
(EX)(EY) = (0)((1/2)) = 0 ⇒ uncorrelated, but not independent
Lemma. For any two independent random variables X and Y ,
E[g(X)h(Y)] = E[g(X)]E[h(Y)]
E.g., Pr(X = 0|Y = 0) = 1 whereas Pr(X = 0) = 1/2.
for all functions g and h with finite expectation.
Example: pX (x) = 1/3 for x = −1, 0, 1 and Y = X 2. X and Y are
uncorrelated but not independent.
E.g., if X, Y independent, then E(XY) = E(X)E(Y) (uncorrelated)
Example: U ∼ U[0, 1], X = cos U , Y = sin U .
Independence ⇒ uncorrelated, which can be thought of as a weak
form of independence.
X and Y are uncorrelated, but not independent
Converse not true in general.
Example:
1/4 if (x, y) = (1, 1) or (−1, 1)
pX,Y (x, y) =
.
1/2 if (x, y) = (0, 0)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
One exception: If have jointly Gaussian random variables and they
are uncorrelated, then they are also independent (covariance and
inverse covariance matrices are diagonal).
c R.M. Gray 2011
�
29
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
30
Characteristic functions revisited
Uncorrelation does not imply independence, but if h(X) and g(Y) are
uncorrelated for all functions h, g, then X and Y are independent.
Proof for discrete case: consider all possible indicator functions of
points 1a(x).
Extending the ideas to complex functions of several random variables
leads to another proof that the characteristic function of the sum of
independent random variables is the product of the characteristic
functions.
Let g = 1a, a ∈ AX and h = 1b, b ∈ AY
Then uncorrelation of g(X) and h(Y) implies
Theorem Given a sequence of mutually independent random
variables X1, X2, . . . , Xn, define
pX,Y (a, b) = E[1a(X)1b(Y)] = E[1a(X)]E[1b(Y)] = pX (a)pY (b)
Yn =
A similar result holds for continuous random variables.
n
�
Xi.
n
�
MXi ( ju).
i=1
Then
MYn ( ju) =
i=1
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
31
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
32
Proof.
Covariance
Since functions of independent rvs are uncorrelated,
n
n
�
�
�
�
juYn
juX
E e
= E exp ju Xi = E e i
i=1
i=1
n
n
�
�
� �
juXi
=
E e
=
MXi ( ju).
i=1
Correlation of mean-removed random variables:
i=1
Given rvs X, Y , the rvs X − X and Y − Y have mean 0.
E.g., E(X − X) = E(X) − E(X) = X − X = 0
Define covariance COV(X, Y) by
COV(X, Y) ≡ E[(X − X)(Y − Y)].
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
33
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
34
Cauchy–Schwarz inequality
COV(X, Y) = E[XY − Y EX − XEY + (EX)(EY)]
= E(XY) − (EY)(EX) − (EX)(EY) + (EX)(EY)
= E(XY) − (EX)(EY).
Two random variables X and Y are uncorrelated iff their covariance is
zero; that is, if COV(X, Y) = 0.
If Y = X , COV(X, X) = σ2X
From previous formula (also derived earlier)
σ2X = E[(X − EX)2] = E(X 2) − (EX)2.
2
so that ∓ COV(X, Y) ≤ σX σY or | COV(X, Y) |≤ σX σY
1
2
⇒ |EX| ≤ [E(X )] a special case of the Cauchy–Schwarz inequality
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
2
X − X
Y
−
Y
0 ≤ E (
)±(
)
σX
σY
X − X 2
Y − Y 2
X − X Y − Y
= E (
) + E (
) ± E 2(
)(
)
σX
σY
σX
σY
�
�
2
= 2±
E (X − X)(Y − Y)
σX σY
If do not remove means, get similar bound on correlation:
35
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
36
2
�
�1
X
Y
0 ≤ E ( �
)±(�
) ⇒ |E(XY)| ≤ E(X 2)E(Y 2) 2
E(X 2)
E(Y 2)
Mean vector, covariance matrix
If Y = 1, get |X| ≤ σX
Suppose that X = (X0, X1, . . . , Xn−1) an n-dimensional random vector.
mean vector m = (m0, m1, . . . , mn−1)t = vector of the means, i.e.,
mk = E(Xk ) for all k = 0, 1, . . . , n − 1.
In vector notation
m = E(X)
If joint pdf exists,
m=
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
37
For k, l = 0, 1, . . . , n − 1 the covariance KX (k, l) = E[(Xk − mk )(Xl − ml)]
Covariance matrix
K = E[(X − m)(X − m) ] =
Rn
(x − m)(x − m)t fX (x) dx,
where outer product of a vector a with a vector b, abt , has (k, j) entry
equal to ak b j.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
fX (x)x dx
c R.M. Gray 2011
�
38
By straightforward but tedious multiple integration, mean vector and
the covariance matrix of a Gaussian random vector are indeed the
mean and covariance:
�
Alternatively, use outer product notation of linear algebra:
�
Rn
�
�
exp − 12 (x − m)t K −1(x − m)
dx
m = E(X) =
x
√
Rn
(2π)n/2 det K
K = {KX (k, l); k = 0, 1, . . . , n − 1; l = 0, 1, . . . , n − 1}
t
�
39
K = E[(X − m)(X − m)t ]
�
�
�
exp − 12 (x − m)t K −1(x − m)
=
(x − m)(x − m)t
dx.
√
Rn
(2π)n/2 det K
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
(1)
c R.M. Gray 2011
�
40
Multivariable characteristic functions
Theorem Let X be a k-dimensional Gaussian random vector with
mean mX and covariance matrix ΛX . Let Y be the new random vector
formed by an affine (linear plus a constant) operation of X :
Recall if random vector X ∼ N(m, Λ)
Y = HX + b,
�
�
1 t
t
MX ( ju) = exp ju m − u Λu
2
n−1
n−1 �
n−1
�
�
1
= exp j
uk m k −
uk Λ(k, m)um .
2
where H is an n × k matrix and b is an n-dimensional vector. Then Y
is a Gaussian random vector of dimension n with mean
mY = HmX + b
k=0 m=0
k=0
and covariance matrix
Formula yields simple proof of extremely important property of
Gaussian vectors
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Proof.
41
ΛY = HΛX H t .
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
42
Singular Gaussian random vectors
Find MY ( ju):
� t �
� t
�
MY ( ju) = E e ju Y = E e ju (HX+b)
�
t
t t �
t
= e ju b E e j(H u) X = e ju b MX ( jH t u)
= e
jut b jut Hm−(1/2)(H t u)t ΛX (H t u)
= e
jut (Hm+b) −(1/2)ut HΛX H t u
In following examples, assume that X is kD Gaussian with invertible
covariance KX , so it has a k-D pdf
e
e
1. Suppose that H is m × k matrix with m < k and a single 1 in each
row and column. Then Y = HX is a subvector, and from the
theorem it is Gaussian, with mean vector and covariance matrix
given by previous formulas. ⇒ any subvector of a Gaussian vector
is Gaussian.
.
characteristic function of a Gaussian random vector pdf with mean
vector Hm + b and covariance matrix HΛH t
2. Suppose Y = Hx + b, H is the all 0 matrix, then from theorem the
constant vector b is Gaussian. It has a singular distribution — kth
order pdf does not exist.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
43
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
44
3. Suppose that Y = HX , with m × k matrix H such that KY is not
invertible. E.g., k = 1, m = 2 and H = (a, b)t ⇒, then Y Gaussian
with covariance
�
�
KY = σ2X
Conditional expectation
a2 ba
.
ab b2
Conditional expectation = expectation wrt a conditional distribution
which has determinant 0, hence not invertible and Y has no pdf.
Primary applications:
More generally, suppose that X is a k-dimensional nonsingular
Gaussian random vector, i.e., KX is an invertible matrix and X has
a pdf.
• Iterated or nested expectation permits simplification of many
expectation computations — suitable conditioning simplifies
calculus
Let H be an m × k with m > k. Then Y = HX + b Gaussian, but
covariance KY will not be invertible
• Optimal nonlinear minimum mean squared error estimation
It can be shown that all singular Gaussian distributions have one of
the preceding forms: either constant or formed by taking a linear
transformation of a nonsingular Gaussian random vector of lower
dimension.
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
45
Discrete conditional estimation
ypY (y)
Given any function g(x) of x, can replace the independent variable x
by a random variable X to get a new random variable g(X), thus
Suppose know X = x ⇒ have conditional (a-posteriori) pmf pY|X
Define conditional expectation of Y given X = x by
E(Y|x) =
�
E(Y | X)
ypY|X (y|x),
is a random variable which takes on sample values
E(Y | X)(x) = E(Y | x).
y∈AY
expectation with respect to the pmf pY|X (·|x).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
46
So far, little new. But this is only an interim definition of conditional
expectation, actual definition differs in a simple, but fundamental,
way. Conditional expectation of Y given X = x is a function of the
independent variable x, say g(x): g(x) = E(Y|x)
Ordinary expectation of Y :
�
c R.M. Gray 2011
�
Think of E(Y | x) as shorthand for E(Y | X = x), but the more
complete notation can cause confusion later on. So stick to
shorthand, it is a function of an argument x
Suppose (X, Y) a random vector described by a joint pmf pX,Y .
EY =
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
The conditional expection of Y given X is this random variable.
c R.M. Gray 2011
�
47
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
48
Iterated Expectation
Can write definition as
E(Y|X) =
�
ypY|X (y|X),
y∈AY
Since E(Y|X) is a random variable, can evaluate its expectation:
Beware of the dual use of X – it is the name of the random variable in
the subscript, it is the random variable itself in the argument.
E[E(Y|X)] =
�
pX (x)E(Y|x) =
x∈AX
=
�
y∈AY
y
�
�
pX (x)
x∈AX
pX,Y (x, y) =
x∈AX
�
�
ypY|X (y|x)
y∈AY
ypY (y) = EY,
y∈AY
iterated expectation or nested expectation
To find expectation of a random variable Y , first find its conditional
expectation with respect to another random variable, E(Y|X), then
take the expectation of the resulting random variable to obtain
EY = E[E(Y|X)].
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
49
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
50
Application: Random sums of random variables
Note on notation: Sometimes write EY = E X [E(Y|X)] to emphasize
outer expectation is over X (again beware dual use of X )
Suppose have rp {Xk ; k = 0, 1, . . .}, with identically distributed
random variables Xn (not necessarily independent), mean X
Suppose have rv N that takes on positive integer values, mean N
Suppose Xk are all independent of N
Define rv
Y=
N−1
�
Xk
k=0
sum of a random number of random variables
Find E(Y)
Finding derived distribution for Y can be hard.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
51
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
52
Use iterated expectation: idea is that conditioned on N = n, linearity
of expectation implies easy computation:
EY = E[E(Y|N)] =
�
n
EY =
pN (n)E(Y | n)
=
�
n
k=0
pN (n)
n
n−1
�
k=0
pN (n)
n
N−1
n−1
�
�
�
�
=
pN (n)E
Xk | n =
pN (n)E
Xk | n
n
�
=
�
n−1
�
k=0
E(Xk | n)
pN (n)nX = X
n
�
pN (n)n = N X
n
k=0
Example: Assume Xk ∼ Bern(p), N ∼ Poisson(λ).
E(Xk | n)
⇒ EY = pλ.
But since Xk and N are independent,
E(Xk | n) =
�
x
pXk |N (x | n) =
�
pXk (x) = E(Xk ) = X
x
⇒
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
53
Iterated Expectation — General form
Use pdfs instead of pmfs. For example,
Proof.
E[g(X)h(X, Y)] =
E(Y|x) =
g(x)h(x, y)pX,Y (x, y)
x,y
=
�
pX (x)g(x)
x
=
�
�
54
Continuous conditional expectation
X, Y discrete random variables, g(X) and h(X, Y) are functions of
these random variables. Then E[g(X)h(X, Y)] = E (g(X)E[h(X, Y)|X])
�
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
�
y fY|X (y|x) dy,
E(Y|X) is defined by replacing x by X in the above formula. Both
iterated expectation and its general form extend to this case by
replacing sums by integrals.
h(x, y)pY|X (y|x)
y
pX (x)g(x)E[h(X, Y)|x]
x
= E (g(X)E[h(X, Y)|X])
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
55
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
56
• Since E(Y | X) is a rv, it has a variance
Law of Conditional Variance
��
�2 �
VAR(E(Y | X)) = E X E(Y | X) − E[E(Y | X)]
Variation of iterated expectation, useful in estimation
�
�
= E (E(Y | X))2 − (E(Y))2
• Let X and Y be two rvs. Define the conditional variance of Y given
X = x to be the variance of Y using fY|X (y | x), i.e.,
�
�
VAR(Y | X = x) = E (Y − E(Y | X = x))2 | X = x
• Law of Conditional Variances: Adding the above expressions,
• The rv VAR(Y | X) is a function of X that takes on the values
VAR(Y | X = x). Its expected value is
�
�
E X [VAR(Y | X)] = E X E(Y 2 | X) − (E(Y | X))2
�
�
= E(Y 2) − E (E(Y | X))2
Conditional expectation similarly defined for random vectors X, Y :
E(Y | X), iterated expectation extends to vectors: E(Y) = E[E(Y | X)]
= E(Y 2 | X = x) − [E(Y | X = x)]2
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
57
�Jointly Gaussian vectors
Consider a random vector
U=
�
formed of two random vectors X and Y of dimensions k and m,
respectively.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
58
From earlier, X and Y are individually Gaussian (since subvectors of
a Gaussian vector)
Computation is linear algebra + calculus. We skip the details, but
survey some results on Gaussian vectors and the implied conditional
expectations.
X
Y
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
If U is Gaussian, then say X and Y are jointly Gaussian.
Finding conditional expectations can often be difficult. One
exception: Gaussian random vectors
�
VAR(Y) = E (VAR(Y | X)) + VAR (E(Y | X))
c R.M. Gray 2011
�
KU = E[(U − mU )(U − mU )t ]
��
�
�
X − mX
t
t
= E
((X − mX ) (Y − mY ) )
Y − mY
�
�
E[(X − mX )(X − mX )t ] E[(X − mX )(Y − mY )t ]
=
E[(Y − mY )(X − mX )t ] E[(Y − mY )(Y − mY )t ]
�
�
KX KXY
=
,
KY X KY
KXY = KYt X are called cross-covariance matrices
59
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
60
Also denote KU by K(X,Y) (do not confuse with KXY )
The conditional pdf for vector Y given vector X :
The key to recognizing the conditional moments and densities is the
following unpleasant matrix equation, which can be proved with a fair
amount of brute force linear algebra:
�
�−1
KX KXY
=
KY X KY
−1
−1
KX + KX−1 KXY KY|X
KY X KX−1
−1
−KY|X
KY X KX−1
−1
−KX−1 KXY KY|X
−1
KY|X
where KY|X ≡ KY − KY X KX−1 KXY .
fXY (x, y)
fX (x)
�
�
��
t
t
−1 x − mX
exp −(1/2)((x − mX ) (y − mY ) )KU
y − mY
=
�
(2π)(k+m) det KU
�
(2π)k det KX
�
�
×
exp −(x − mX )t KX−1(x − mX )/2
fY|X (y|x) =
,
=
See book for more details if interested
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
61
More brute force linear algebra ⇒ quadratic terms in the exponential
=
((x − mX )t , (y − mY )t )KU−1
�
x − mX
y − mY
�
1
(2π)m det KU / det KX
�
�
�
t
t
−1 x − mX
× exp −(1/2)((x − mX ) (y − mY ) )KU
y − mY
�
+ (1/2)(x − mX )t KX−1(x − mX )
√
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
62
⇒ conditioned on X = x, Y is Gaussian! has a Gaussian density
⇒ can recognize conditional expectation of Y given X as
E(Y|X = x) = mY|x = mY + KY X KX−1(x − mX ),
+ (x − mX )t KX−1(x − mX )
−1
= (y − mY − KY X KX−1(x − mX ))t KY|X
(y − mY − KY X KX−1(x − mX )).
⇒ the conditional expectation is an affine function of the vector x,
linear if mY = 0 (all zero vector) . For most nonGaussian situations,
conditional expectations are very nonlinear.
Defining
Can also infer that KY|X is the (conditional) covariance
mY|x = mY + KY X KX−1(x − mX )
the conditional density simplifies to
KY|X = E[(Y − E(Y|X = x))(Y − E(Y|X = x))t |x],
which unlike the conditional mean does not depend on the vector x!
fY|X (y|x) =
−m/2
(2π)
�
det KU
det KX
�− 12
�
�
−1
× exp −(y − mY|x)t KY|X
(y − mY|x)/2 ,
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
63
Furthermore,
det(KY|X ) =
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
det(KU )
.
det(KX )
c R.M. Gray 2011
�
64
Expectation as estimation
Moral: all conditional pdfs involving components of a Gaussian
random vector are also Gaussian, and we can write out explicit
formulas for the conditional mean and covariance.
Suppose you are asked to guess the value that a random variable Y ,
knowing only the distribution of Y
What is the best guess or estimate, say Ŷ ?
Many ways to define best. One of the most popular is the mean
squared error (MSE)
E[(Y − Ŷ)2]
Here Ŷ is just a constant, say a. Optimization problem is: Find a to
minimize E[(Y − a)2]
Common approach: Use calculus, differentiate wrt a and solve, then
show second derivative ≥ 0 to prove minimum.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
65
Alternative using simple inequalities:
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
66
Thus the mean of a random variable is the minimum mean squared
error (MMSE) estimate of the value of a random variable in the
absence of any a-priori information.
E[(Y − a)2] = E[(Y − EY + EY − a)2]
= E[(Y − EY)2] + 2E[(Y − EY)(EY − a)] + (EY − a)2.
The cross-product is evaluated using the linearity of expectation and
the fact that EY is a constant as
E[(Y − EY)(EY − a)] = (EY)2 − aEY − (EY)2 + aEY = 0
and hence from Property 1 of expectation,
E[(Y − a)2] = E[(Y − EY)2] + (EY − a)2 ≥ E[(Y − EY)2],
which is the mean squared error resulting from using the mean of Y
as an estimate.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
67
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
68
MMSE Estimate of Y given X
Plugging in the random variable X in place of the dummy variable x
we have the following interpretation of conditional expectation:
• The conditional expectation E(Y|X) of a random variable Y given a
random variable X is the MMSE estimate of Y given X .
Now suppose told X = x. What then is the best estimate of Y , say
Ŷ(X)?
• Same argument works for continuous rvs, replace pmfs by pdfs and
Simple modification of previous result gives solution. Just replace
distribution of Y by conditional distribution of Y given X :
sums by integrals
• Unfortunately, usually cannot evaluate explicitly. Major exception,
jointly Gaussian ⇒
�
�
E[(Y − Ŷ(X))2] = E E[(Y − Ŷ(X))2|X]
�
=
pX (x)E[(Y − Ŷ(x))2|x].
E[Y|X] = mY + ρ(σY /σX )(X − mX ).
x
Each term in the sum is MSE wrt pY|X (·|x). By same argument as in
previous case, minimum achieved if estimate is (conditional)
expectation, i.e., Ŷ(x) = E(Y | x).
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
A more direct proof follows from general iterated expectation:
69
Suppose that g(X) is an estimate of Y given X . The resulting MSE is
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
E[g(X)h(X, Y)] = E (g(X)E[h(X, Y)|X]) .
E[(Y − E(Y|X))2]
������������������������������������
MMSE using E(Y|X)
With g(X) = E(Y | X), h(X, Y) = Y ⇒ E[Y E(Y|X)] = E[E(Y|X)2]
+2E[(Y − E(Y|X))(E(Y|X) − g(X))]
With g(X) as is and h(X, Y) = Y ⇒ E[Yg(X)] = E[E(Y|X)g(X)]
+ ��������������������������������������������
E[(E(Y|X) − g(X))2]
⇒ crossterm
≥0
Expand the cross term:
70
Recall general iterated expectation:
E[(Y − g(X))2] = E[(Y − E(Y|X) + E(Y|X) − g(X))2]
=
c R.M. Gray 2011
�
E[Y E(Y|X)] − E[Yg(X)] − E[E(Y|X)2] + E[E(Y|X)g(X)] = 0
E[(Y − E(Y|X))(E(Y|X) − g(X))] =
E[Y E(Y|X)] − E[Yg(X)] − E[E(Y|X)2] + E[E(Y|X)g(X)].
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
71
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
72
Orthogonality
with a similar proof for pdfs. ⇒
Given observation X , MMSE estimate for Y is Ŷ = Ŷ(X) = E(Y | X).
Conditional expectation of error with optimal estimate: From linearity
= E(Y | X = x) − E (E(Y | x) | X = x)
= E(Y | X = x) − E(Y | X = x) = 0
0
or, writing it out for the discrete case,
= 0
�
�
�
�
E (Y − Ŷ) | X = x) =
pY|X (y | x)y −
pY|X (y | x)Ŷ(x)
c R.M. Gray 2011
�
��
using (�)
y
The error is said to be orthogonal to any function of the observations
= Ŷ(x) − Ŷ(x) = 0
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
For any function g(x) the correlation of g(X), and the error
Y − E(Y | X) is 0 since using general iterated expectation
�
E[(Y − E(Y | X))g(X)] = E (E[(Y − E(Y | X))g(X) | X]
�
�
= E E (g(X)(Y − E(Y | X)) | X)
= E g(X) ������������������������������������������������
E ((Y − E(Y | X)) | X)
�
�
E (Y − Ŷ) | X = x) = E(Y | X = x) − E(Ŷ | X = x)
y
E (Y − E(Y | X) | X) = 0 �
73
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
74
MMSE Estimate
Summarize discussion as a theorem and collect some properties.
Reverse roles of X and Y to better match detection treatment:
Consider Y observation, estimate X :
Theorem: The MMSE estimate of X given the observation Y is
X̂ = E(X | Y) ,
and the MMSE is
�
MMSE = EY (VAR(X | Y)) = E(X 2) − E (E(X | Y))2
�
Properties of the minimum MSE estimator:
• If X and Y are independent, then the best MSE estimate is E(X)
• Since E(X̂) = EY [E(X | Y)] = E(X), the best MSE estimate is
unbiased
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
75
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
76
• The
of the optimal estimation error,
� conditional expectation
�
E (X − X̂) | Y = y , is 0 for all y, i.e., the error is unbiased for every
�
�
Y = y (E (X − X̂) | Y = y = E[X | Y = y] − E[E(X | Y = y) | Y = y])
• The optimal estimation error and the estimate are orthogonal
�
�
� �
��
E (X − X̂)X̂ = EY E (X − X̂)X̂ | Y
�
�
= EY X̂E((X − X̂) | Y)
�
�
= EY X̂(E(X | Y) − X̂) | Y) = 0
• Law of conditional variance ⇒ VAR(X) = VAR(X̂) + E(VAR(X | Y)),
i.e., the sum of the variance of the estimate and the MMSE = the
variance of the signal ⇒ variance of estimate ≤ variance of signal
c R.M. Gray 2011
�
Suppose have random vectors X = (X0, . . . , Xk−1)t and
Y = (Y0, . . . , Ym−1)t
The prediction Ŷ = Ŷ(X) is to be chosen as a function of X which
yields the smallest possible MSE, as in the scalar case. For vectors
MSE defined as a sum of squared errors:
In fact, the estimation error is orthogonal to any function g(Y) of the
observation Y !
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Estimation for random vectors
� 2(Ŷ) = E(�Y − Ŷ�2) ≡ E[(Y − Ŷ)t (Y − Ŷ)]
m−1
�
=
E[(Yi − Ŷi)2].
i=0
An estimator is optimal within some class if it minimizes the MSE.
77
Claim: Given two random vectors Y and X , the minimum mean
squared error (MMSE) estimate of Y given X is
78
m−1
�
E[(Y − Ŷ)t (Ŷ − Ỹ)] = E (Yi − Ŷi)(Ŷi − Ỹi)
Ŷ(X) = E(Y|X).
i=0
=
Proof: Can consider this as m separate MMSE estimators for each
component Yi of Y . MMSE estimate Ŷi(X) is E(Yi | X) as before.
Alternatively, suppose that Ŷ = E(Y | X) and that Ỹ is some other
estimate.
m−1
�
i=0
=
m−1
�
i=0
= 0
� 2(Ỹ) = E(�Y − Ỹ�2) = E(�Y − Ŷ + Ŷ − Ỹ�2)
�
�
E (Yi − Ŷi)(Ŷi − Ỹi)
E E (Yi − ����
Ŷi )
E(Yi|X))
(Ŷi − Ỹi)
����������
scalar fn of X
| X
from orthogonality of error with any function of observations
= E(�Y − Ŷ�2) + E(�Ŷ − Ỹ�2) + 2E[(Y − Ŷ)t (Ŷ − Ỹ)]
�
≥ � 2(Ŷ) + 2E[(Y − Ŷ)t (Ŷ − Ỹ)].
As in scalar case, conditional expectation is usually hard to evaluate.
Exception: jointly Gaussian vectors
Done if show rightmost term= zero
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
79
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
80
MMSE Estimation of Gaussian Vectors
where
KY|X ≡ KY − KY X KX−1 KXY
= E[(Y − E(Y|X))(Y − E(Y|X))t |X],
If (X, Y) jointly Gaussian vectors with
K(X,Y) = E[((X t , Y t ) − (mtX , mtY ))t ((X t , Y t ) − (mtX , mtY ))],
KY = E[(Y − mY )(Y − mY )t ], KX = E[(X − mX )(X − mX )t ],
KXY = E[(X − mX )(Y − mY )t ], KY X = E[(Y − mY )(X − mX )t ] then
det(KY|X ) =
conditional pdf is is
and
E(Y|X = x) = mY|x = mY + KY X KX−1(x − mX ),
1
fY|X (y|x) = (2π)−m/2(det(KY|X ))− 2 ×
−1
(y − mY|x)t KY|X
(y − mY|x)
,
exp −
2
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
det(K(Y,X))
,
det(KX )
and hence the minimum mean squared error estimate of Y given X is
E(Y|X) = mY + KY X KX−1(X − mX ) ,
c R.M. Gray 2011
�
81
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
82
One-step prediction error
which is an affine function of X ! The resulting MSE is
E[(Y − E(Y|X))t (Y − E(Y|X))]
�
�
= E E[(Y − E(Y|X))t (Y − E(Y|X))|X]
�
�
= E E[Tr[(Y − E(Y|X))(Y − E(Y|X))t ]|X]
In special case where
X = X n = (X0, X1, . . . , Xn−1)
= Tr(KY|X ).
Y = Xn
one-step prediction problem
Predict Xn given X n
Define the nth-order covariance matrix as the n × n matrix
KX(n) = E[(X n − E(X n))(X n − E(X n))t ],
(n)
i.e., the (k, j) entry of KX is E[(Xk − E(Xk ))(X j − E(X j))],
k, j = 0, 1, . . . , n − 1.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
83
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
84
and
If X n+1 is Gaussian, the optimal one-step predictor for Xn given X n is
The resulting MSE is
X̂n(X n) =
E(Xn) + E[(Xn − E(Xn))(X n − E(X n))t ](KX(n))−1(X n − E(X n))
MMSE = E[(Xn − X̂n(X n))2]
= Tr(KY − KY X KX−1 KXY )
which has an affine form
n
= σ2Xn − rt (KX(n))−1r
n
X̂n(X ) = AX + b
where
b = E(Xn) − AE(X n).
or
A = rt (KX(n))−1,
KX (n, 0)
KX (n, 1)
r =
..
KX (n, n − 1)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
,
MMSE = E[(Xn − X̂n(X n))2] = σ2Xn|X n =
det(KX(n+1))
det(KX(n))
,
a classical result from minimum mean squared error estimation
theory.
c R.M. Gray 2011
�
85
Special case: suppose mean of Xn is constant for all n and the
covariance function has the property that
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
86
c R.M. Gray 2011
�
88
Estimator simplifies to
X̂n(X n) = rt (KX(n))−1 X n
KX (k, l) = KX (k + n, l + n), all k, l, n,
where r is the n-dimensional vector
that is, the covariance function evaluated at two sample times is
unchanged if both sample times are shifted by an equal amount.
KX (n)
K (n − 1)
r = X .
.
KX (1)
Then covariance of two samples depends only on how far apart in
time they are and not on the actual absolute time of each.
A process with these properties is called weakly stationary or wide
sense stationary
.
Assume also for simplicity that E(Xn) = 0 all n
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
87
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Optimal Linear Estimation I
Idea: Affine estimators are easy to compute, optimal (usually
nonlinear) estimators can be quite difficult.
Have shown: if (X, Y) jointly Gaussian, then MMSE estimator for Y
given vector X is affine with A = KY X KX−1 and b = E(Y) − AE(X) with
resulting MMSE = Tr(KY − KY X KX−1 KXY ).
Gaussian MMSE estimation theory provides a shortcut for basic
results for best linear/affine estimators. Show trick here, later do in
more traditional way.
Use this fact to describe general optimal affine estimators for random
vectors without a Gaussian assumption.
Suppose we have two jointly distributed random vectors X and Y
which are not assumed to be Gaussian.
State as a theorem and then prove:
Instead of finding the best of all estimators for Y given X , the problem
is to find the best of all affine estimators, that is, find the best Ŷ(X) of
the form
Ŷ(X) = AX + b.
If b = 0 this is a linear estimator, and affine case is often referred to
as linear in the literature.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
89
Theorem Given random vectors X and Y with
t
K(X,Y) = E[((X , Y ) − (mtX , mtY ))t ((X t , Y t ) − (mtX , mtY ))],
KY = E[(Y − mY )(Y − mY )t ], KX = E[(X − mX )(X − mX )t ],
KXY = E[(X − mX )(Y − mY )t ], KY X = E[(Y − mY )(X − mX )t ],
90
Expand MSE and use linear algebra:
t
�
�
MMSE(A, b) = E Tr (Y − AX − b)(Y − AX − b)t
assume
��
= E [Tr ((Y − mY − A(X − mX ) − b + mY − AmX )
that KX is invertible (or that it is positive definite). Then
× (Y − mY − A(X − mX ) − b + mY − AmX )t
�
�
= Tr KY − AKXY − KY X At + AKX At
�
�
min MMSE(A, b) = min Tr (Y − AX − b)(Y − AX − b)t
A,b
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
A,b
��
+ (b − mY + AmX )t (b − mY + AmX )
= Tr(KY − KY X KX−1 KXY )
and the minimum is achieved by A = KY X KX−1 and b = E(Y) − AE(X).
where all the remaining cross terms are zero.
Result does not require that the vectors be jointly Gaussian, but show
optimal affine estimate = optimal estimate under Gaussian
assumption.
Regardless of A, final term ≥ 0
Therefore the minimum is achieved by the choice
b = mY − AmX
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
91
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
92
One-step prediction
Thus the inequality we wish to prove becomes
�
�
Tr KY − AKXY − KY X At + AKX At ≥ Tr(KY − KY X KX−1 KXY )
or
�
�
Tr KY X KX−1 KXY + AKX At − AKXY − KY X At ≥ 0.
Since KX is a covariance matrix it is Hermitian, since invertible it must
(1/2)
be positive definite ⇒ It has a square root KX
⇒ LHS=
�
1
2
Tr (AKX −
1
−1
KY X KX 2 )(AKX2
−
−1
KY X KX 2 )t
�
�
2
i bi,i
≥ 0, proving the inequality.
Plugging in A = KY X KX−1 achieves the lower bound with equality.
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
93
(n)
det(KX(n))
det(KX(n−1))
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
.
c R.M. Gray 2011
�
94
.
Traditional approach to linear estimation shows why correlations
important
a classical result of linear estimation theory.
There are fast algorithms for computing inverses of covariance
matrices and hence for computing linear least squares estimators.
This is of particular interest in applications such as speech
processing which depend on fast computation of linear estimators.
First consider random variables X, Y
X is observed, want a good estimate Ŷ(X) of Y
E.g., Gram–Schmidt orthogonalization procedure of matrix theory
leads to the Cholesky decomposition of positive definite matrices.
Quality of estimate measured by MSE
Require linear (actually affine) estimator of form
If the process is weakly stationary, then Levinson-Durbin algorithm
can be used.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
KX (n)
K (n − 1)
r = X .
.
KX (1)
Optimal Linear Estimation II: Application of
Correlation
Resulting minimum mean squared error
LLSE = σ2X − rt (KX )−1r =
X̂n(X n) = rt (KX(n))−1 X n,
where r is the n-dimensional vector
(just expand this expression to verify it is the same as the previous
expression)
Has the form Tr(BBt ) =
Specialize to Y = Xn, X = X n, {Xn} a weakly stationary process ⇒
Optimal linear estimator of Xn given (X0, . . . , Xn−1)
c R.M. Gray 2011
�
Ŷ(x) = ax + b,
95
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
96
Find best a and b — minimize
Middle term is nonnegative. Once a chosen, middle term minimized
as 0 by choosing b = EY − aEX
E[(Y − Ŷ(X))2] = E[(Y − aX − b)2].
Thus the best a must minimize a2σ2X − 2a COV(X, Y). A little calculus
shows that the minimizing a is
Rewrite formula in terms of the mean-removed random variables:
�
a=
�
E [Y − (aX + b)]2
�
�
= E [(Y − EY) − a(X − EX) − (b − EY + aEX)]2
=
σ2Y
2
+a
σ2X
COV(X, Y)
σ2X
and hence the best b is
2
+ (b − EY + aEX) − 2a COV(X, Y)
b = EY −
since the remaining cross-products are all zero (left as exercise).
COV(X, Y)
EX
σ2X
First term does not depend on a or b, minimizing MSE ⇔ minimizing
a2σ2X + (b − EY + aEX)2 − 2a COV(X, Y).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
97
c R.M. Gray 2011
�
98
Vector Correlation and Optimal Estimation
Alternatively, the optimum a can be found without calculus by
observing that
a2σ2X − 2a COV(X, Y) = (aσX − COV(X, Y)/σX )2 − COV(X, Y)2/σ2X
≥ − COV(X, Y)2/σ2X
with equality if a =
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Similar ideas work for vectors, but details messier
As before: given k-D random vector X , estimate m-D random vector Y
COV(X, Y)/σ2X .
Begin with truly linear estimate constraint, i.e., optimize over
estimates of form
Ŷ = AX,
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
99
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
100
m × k-dimensional matrix A can be considered as a matrix of
k-dimensional row vectors atk ; k = 0, . . . , m − 1:
Common technique: use calculus, setting derivatives of � 2(a) to zero
and verifying that the stationary point so obtained is a global
minimum.
A = [a0, a1, . . . , am−1]t
Alternative: If solution known (from calculus or guess), use
elementary inequalities to prove true.
so that if Ŷ = (Yˆ0, . . . , Ŷm−1)t , then Ŷi = ati X and
2
� (A) =
k
�
i=1
E[(Yi −
We shall show that optimal a is a solution of at RX = E(Y X t ), so that if
the autocorrelation matrix defined by
ati X)2].
RX = E[XX t ] = {RX (k, i) = E(Xk Xi); k, i = 0, . . . , k − 1}
2
Find matrix A that minimizes � (A)
is invertible, then the optimal a is given by
� 2(A) is minimized if each term E[(Yi − ati X)2] is minimized over ai
since there is no interaction among the terms in the sum.
at = E(Y X t )R−1
X .
Hence focus on simpler problem: given random vector X and random
vector Y , find a that minimizes � 2(a) = E[(Y − at X)2].
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
101
To prove at = E(Y X t )R−1
X optimal, show any vector b yields worse
MSE:
� 2(b) ≥ � 2(a) .
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
102
By definition, at RX = E(Y X t ) and hence
�
�
� 2(b) ≥ � 2(a) + 2 E[Y X t ] − at RX (a − b) = � 2(a)
with equality if b = a.
Autocorrelation matrices and their inverses are symmetric ⇒
� 2(b) = E[(Y − bt X)2] = E[(Y − at X + at X − bt X)2]
= ��������������������������
E[(Y − at X)2] +2E[(Y − at X)(at X − bt X)] + ������������������������������
E[(at X − bt X)2]
� 2(a)
2
t
t
t
≥0
≥ � (a) + 2E[(Y − a X)(a − b )X]
= � 2(a) + 2E[(Y − at X)X t (a − b)]
= � 2(a) + 2E[(Y − at X)X t ](a − b)
�
�
= � 2(a) + 2 E[Y X t ] − at E[XX t ] (a − b)
�
�
= � 2(a) + 2 E[Y X t ] − at RX (a − b)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
103
t
−1
a = (E(Y X t )R−1
X ) = RX E[Y X].
Application of previous result to termwise minimization of overall
MSE ⇒
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
104
Optimal Affine Estimation
Theorem The minimum mean squared error linear predictor of the
form Ŷ = AX is given by any solution A of the equation
ARX = E(Y X t ).
Having found the best linear estimate, easy to modify the
development to find the best estimate of the form
If the matrix RX is invertible, then A is uniquely given by
Ŷ(X) = AX + b
t
At = R−1
X E[XY ] ,
that is, the matrix A has rows ati ; i = 0, 1, . . . , m, with
ai = R−1
X E[Yi X].
Alternatively,
A = E[Y X t ]R−1
X .
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
105
c R.M. Gray 2011
�
106
If X and Y have mean 0, result reduces to the previous result (and b
does not help)
Theorem The minimum mean squared estimate of the form
Ŷ = AX + b is given by any solution A of the equation
Proof. Let C be any matrix and d any vector (of suitable
dimensions). Y − E(Y) and X − E(X) are 0 mean rvs
AKX = E[(Y − E(Y))(X − E(X))t ]
where the covariance matrix KX is defined by
E(�Y − (CX + d)�2)
= E(�(Y − E(Y)) − C(X − E(X)) + E(Y) − CE(X) − d�2)
KX = E[(X − E(X))(X − E(X))t ] = RX−E(X),
= E(�(Y − E(Y)) − C(X − E(X))�2) + E(�E(Y) − CE(X) − d�2)
and
If KX is invertible, then
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
+ 2E[Y
− E(Y) − C(X − E(X))]t [E(Y) − CE(X) − d]
������������������������������������������������������������������
b = E(Y) − AE(X).
0
= E(�(Y − E(Y)) − C(X − E(X))�2) + �E(Y) − CE(X) − d�2
A = E[(Y − E(Y))(X − E(X))t ]KX−1.
From linear result, first term is minimized by choosing C = A,
second term ≥ 0 with = if C = A and d = b (recall b = E(Y) − AE(X))
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
107
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
108
Thus
The orthogonality principle
E(�Y − (CX + d)�2) ≥ E(�Y − (AX + b)�2).
Another, more general, way to look at optimal linear estimation
Often simplify problems by assuming 0 mean, so linear estimation
good enough.
Error vector
e = Y − Ŷ.
Not always possible, and affine may help.
Rewriting as Y = Ŷ + e ⇒ Y = estimate plus an error or “noise” term.
Goal of estimator is to minimize error energy et e =
If the estimate is linear, then
�k−1
2
n=0 en.
e = Y − AX.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
109
For the time being, simplify to estimating a random variable Y given a
random vector X
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
110
Consider the crosscorrelation between an arbitrary error term and
the observable vector:
Ŷ = at X
E[(Y − Ŷ)X] = E[(Y − at X)X]
error of e = Y − Ŷ = Y − at X
= E[Y X] − E[X(X t a)]
= E[Y X] − RX a = 0
Can extend to vector Y using previous ideas: overall MSE E[et e] in
vector case is minimized by separately minimizing each component
E[e2k ]
Using optimal linear prediction result
Suppose that a chosen optimally
Thus for the optimal predictor, the error satisfies
E[eX] = 0,
or, equivalently,
E[eXn] = 0; n = 0, . . . , k.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
111
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
112
If a random variable e and a random vector X have 0 correlation,
E(eX) = 0, say they orthogonal and we write
Conversely, suppose know a ⇒ prediction error orthogonal to all of
the observations
If a has orthogonality property Y − at X ⊥ X and b is any other linear
predictor vector,
e⊥X
� 2(b) = E[(Y − bt X)2]
Have shown that the optimal linear estimate of a scalar random
variable given a vector of observations causes the error to be
orthogonal to all of the observables.
= E[(Y − at X + at X − bt X)2]
= � 2(a) + 2 ������������������������������������������������������
E[(Y − at X)(at X − bt X)] +E[(a − b)t X)2]
⇒ orthogonality of error and observations is a necessary condition
for optimality of a linear estimate
2
t
0
2
≥ � (a) + E[(a − b) X) ] ≥ � 2(a)
with equality if b = a
Putting it all together:
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
113
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
114
Correlation and covariance functions
Theorem The orthogonality principle
A linear estimate Ŷ = AX is optimal (in the mean squared error
sense) if and only if the resulting errors are orthogonal to the
observations, that is, if e = Y − AX , then
Correlation and covariance in the context of random processes
Given random process {Xt ; t ∈ T }, the autocorrelation function
RX (t, s); t, s ∈ T is defined by
E[ek Xn] = 0; k = 1, . . . , K; n = 1, . . . , N.
RX (t, s) = E(Xt X s); all t, s ∈ T .
If the process is complex-valued instead of real-valued, the definition
becomes
RX (t, s) = E(Xt X s∗); all t, s ∈ T
The autocovariance function or simply the covariance function
KX (t, s); t, s, ∈ T is defined by
KX (t, s) = COV(Xt , X s)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
115
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
116
Recall that
or, equivalently, if
KX (t, s) = RX (t, s) −
(EXt )(EX s∗).
Autocorrelation and covariance functions are equal if the process has
zero-mean, EXt = 0 for all t
The covariance function of a process {Xt } = autocorrelation function
of the process {Xt − EXt }
if t � s
.
Warning: If a process is iid or uncorrelated, the random variables are
independent or uncorrelated only if taken at different times
only when t � s
A random process {Xt ; t ∈ T } is said to be uncorrelated if
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
if t = s
i.e., Xt and X s are not be independent or uncorrelated when t = s
(unless trivial process)
A measure of dependence between pairs of samples of process
if t = s
E(Xt2)
RX (t, s) =
EXt EX s if t � s
Generalization of iid
σ2Xt
KX (t, s) =
0
c R.M. Gray 2011
�
117
Gaussian Processes Revisited
A Gaussian random process {Xt ; t ∈ T } is completely described by a
mean function {mt ; t ∈ T } and a covariance function {Λ(t, s); t, s ∈ T }
Recall that mean and covariance are appropriate names since
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
118
i.e., KX (t, s) = KX (s, t), and, given any positive integer k, any
collection of sample times {t0, . . . , tk−1}, and any k real numbers
ai; i = 0, . . . , k − 1 (not all 0), then
n−1 �
n−1
�
i=0 l=0
aial KX (ti, tl) ≥ 0.
If the process is complex-valued, then the covariance function is
Hermitian, i.e., KX (t, s) = KX∗ (s, t)
mt = EXt
Now show where these conditions come from
Λ(t, s) = KX (t, s) = E[(Xt − mt )(X s − m s)]
Symmetry for real-valued random processes
More important issue: Why was it assumed that covariance function
(and corresponding matrices) are symmetric and positive definite?
KX (t, s) = E[(Xt − EXt )(X s − EX s)]
= E[(X s − EX s)(Xt − EXt )]
= KX (s, t)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
119
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
120
Similar proof for Hermitian if complex-valued
⇒ sufficient condition for transform definition of Gaussian random
vector is also necessary
For positive definite, consider
n−1 �
n−1
�
i=0 l=0
aial KX (ti, tl) =
n−1 �
n−1
�
i=0 l=0
aial E[(Xti − EXti )(Xtl − EXtl )]
n−1 n−1
� �
= E
aial(Xti − EXti )(Xtl − EXtl )
i=0 l=0
�
��2
���
n−1
�
= E ��� ai(Xti − EXti )��� ≥ 0.
�
�
i=0
⇒ any covariance function KX must be nonnegative definite ⇒ any
covariance matrix formed by sampling the covariance function must
also be nonnegative definite
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
121
�The central limit theorem
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Ui =
122
Xi − m
is iid and has mean 0, variance 1, and characteristic
σ
function
Where do Gaussian distributions come from?
Use characteristic functions to look at what happens when sum up
many iid random variables with suitable scaling and shifting
ERn
Suppose that {Xn} iid with common cdf F X described by a pmf or pdf
Assume finite mean EXn = m and finite variance σ2Xn = σ2
Assume that the characteristic function MX ( ju) is well behaved for
small u (so Taylor series expansion works)
σ2Rn
Consider “standardized” or “normalized” sum
2
n−1
n−1 �
n−1
�
�
1
1
2
= E(Rn) = E 1/2
Ui = E
UiU j
n
n
i=0
i=0 j=0
n−1 n−1
n−1
1 � Xi − m
Rn = 1/2
n i=0 σ
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
n−1
n−1
1 �
1 �
= E 1/2
Ui =
E [Ui] = 0
n i=0 n1/2 i=0
=
n−1
1��
1�
E[UiU j] =
1=1
n i=0 j=0 ���������� n i=0
δi− j
c R.M. Gray 2011
�
123
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
124
Let S n =
n−1
�
Recall from earlier in these notes the approximation for small u
√
Ui so that Rn = S n/ n.
i=0
MX ( ju) = 1 + juE(X) − u2
From characteristic functions:
E(X 2)
+ o(u2)/2
2
if MX ( ju) has a Taylor series about u = 0.
MS n ( ju) = MU ( ju)n
√
Replace X by U and u by u/ n
hence
MRn ( ju) = E[e
MRn ( ju)
√
juS n/ n
juRn
] = E[e
]
u
u
= MS n ( j √ ) = MU ( j √ )n
n
n
=
=
=
→
n→∞
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
125
the characteristic function of N(0, 1)
u
MU ( j √ )n
n
�
�n
u
u 2 E(U 2)
u 2
1 + j √ E(U) − ( √ )
+ o(( √ ) )/2
2
n
n
n
�
�
n
u2
u2
1 − ( ) + o(( ))
2n
2n
2 /2
e−u
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
126
Regardless of the probabilistic description of these micro-events, the
global current will appear to be Gaussian.
Taking inverse transforms ⇒ cdf of Rn converges to cdf of N(0, 1)
(convergence in distribution)
Theorem Central limit theorem
Let {Xn} be an iid random process with a finite mean m and variance
σ2. Then
−1/2
n
n−1
�
Xi
k=0
converges in distribution to a Gaussian random variable with mean m
and variance σ2.
E.g., current meter across a resistor will measure the effects of the
sum of millions of electrons randomly moving and colliding with each
other
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
127
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
128
Sample averages
Variance of the sample average:
σ2S n ≡ E[(S n − E(S n))2]
2
�
n−1
n−1
�
= E n−1 Xi − n−1 EXi
i=0
i=0
2
�
n−1
= E n−1 (Xi − EXi)
Sample or time average:
Given a process {Xn}.
Sample average of the first n values of {Xn} is S n = n−1
Linearity of expectation ⇒
n−1
�
Xi
i=0
i=0
= n−2
n−1
n−1
�
�
−1
ES n = E n
Xi = n−1
EXi
i=0
If EXi = X for all i, then ES n = X ,
i=0 j=0
= n−2
i=0
c R.M. Gray 2011
�
129
−2
=n
n−1 �
n−1
�
KX (i, j) = n
i=0 j=0
−2
n−1
�
i=0
σ2Xi
KX (i, j)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
Ordinary convergence of real numbers: A sequence {xn} converges
to a limit x if for every � > 0 there exists an N such that |xn − x| < � for
every n > N
σ2X
→ 0
n n→∞
Suggests that S n converges to something since its variance vanishes.
But random variables are functions, ordinary convergence does not
apply immediately.
This is the heart of a classic law of large numbers, we only need
some definitions to complete it.
There are many definitions of convergence for random variables, all
of which involve the usual notion of limits. Most important are:
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
130
What does it mean to say a sequence of random variables Yn
converges to a limit Y ?
If σ2Xi = σ2X for all i, then
σ2S n =
n−1 �
n−1
�
Convergence of random variables
If process is uncorrelated, KX (i, j) = σ2i δk− j ⇒
σ2S n
E[(Xi − EXi)(X j − EX j)]
i=0 j=0
S n is an unbiased estimate of X
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
n−1 �
n−1
�
c R.M. Gray 2011
�
131
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
132
A mean ergodic theorem
A sequence of random variables Yn is said to converge to Y
• with probability one or almost everywhere or almost surely if
P({ω : lim Yn(ω) = Y(ω)}) = 1
Definitions + result for variance of S n ⇒
n→∞
Theorem Let {Xn} be a discrete time uncorrelated random process
such that EXn = X is finite and σ2Xn = σ2X < ∞ for all n. Then
• in probability if for any � > 0
lim P({ω : |Yn(ω) − Y(ω)| > �}) = 0
n→∞
n−1
1�
l.i.m.
Xi = X,
n→∞ n
i=0
• in mean square or in quadratic mean if
lim E[|Yn − Y|2] = 0
n→∞
that is,
Convergence in mean square is also denoted by
n−1
1�
Xi → X in mean square.
n i=0
Intuition: Sample average → expectation
l.i.m. Yn = Y
n→∞
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
133
1
n
134
The Tchebychev inequality
Proof.
Sn =
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
n−1
�
Xi, ES n = X
i=0
rv X , mean mX , variance σ2X . Then Pr(|X − mX | ≥ �) ≤
lim E[(S n − X)2] = lim E[(S n − ES n)2]
n→∞
Proof.
n→∞
σ2X
=0
n→∞ n
n→∞
Discrete case, follows similarly for pdfs
σ2X = E[(X − mX )2] =
= lim σ2S n = lim
�
=
To relate convergence in mean square to convergence in probability
and the classic weak law of large numbers, need Tchebychev
inequality.
Develop in two ways for added insight: Direct proof and proof using
Markov inequality
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
135
σ2X
.
�2
x:|x−mX |<�
≥
�
x:|x−mX |≥�
≥ �2
�
(x − mX )2 pX (x)
x
2
(x − mX ) pX (x) +
2
(x − mX ) pX (x)
�
x:|x−mX|≥�
�
x:|x−mX |≥�
(x − mX )2 pX (x)
pX (x) = � 2 Pr(|X − mX | ≥ �).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
136
The Markov Inequality
Application of Tchebychev or Markov inequality:
Given a nonnegative random variable U with finite expectation E[U],
for any a > 0
E[U]
Pr(U ≥ a) = PU ([a, ∞)) ≤
.
a
Proof.
Proof.
From the Markov inequality applied to |Yn − Y|2, for any � > 0
Pr(|Yn − Y| ≥ �) = Pr(|Yn − Y|2 ≥ � 2) ≤
Fix a > 0, set F = {u : u ≥ a}.
E(|Yn − Y|2)
→ 0
n→∞
�2
Converse not true, e.g., let Yn be a discrete random variable with pmf
E[U] = E[U(1F (U) + 1F c (U))]
1 − 1/n if y = 0
pYn (y) =
.
1/n
if y = n
= E[U(1F (U))] + E[U(1F c (U))]
≥ E[U(1F (U))] ≥ aE[1F (U)]
Converges to 0 in probability since
= aP(F).
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
If Yn converges to Y in mean square, then it also converges in
probability.
c R.M. Gray 2011
�
137
Weak law of large numbers
Pr[|Yn − 0| > �] = Pr[Yn > 0] = 1/n → 0, but not in mean square since
E[Yn2] = 0(1 − 1/n) + n2/n = n does not converge to 0
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
138
Stationarity
Mean ergodic theorem + fact convergence in mean square implies
convergence in probability yields weak law of large numbers:
In the development weak law of large numbers assumed (1) mean
EXt of the process did not depend on time, and (2) covariance
function had the form KX (t, s) = σ2X δt−s
Theorem The weak law of large numbers
Let {Xn} be a discrete time process with finite mean EXn = X and
variance σ2Xn = σ2X < ∞ for all n. If the process is uncorrelated, then
�
the sample average n−1 n−1
i=0 Xi converges to X in probability.
A special case of a stationarity property — the mean does not
change with time, and the covariance of two samples depends only
on how far apart they are.
“weak law” since only convergence in probability, strong law is for
convergence with probability 1. It is much more difficult to prove.
Rephrase the constant mean property:
EXt = EXt+τ; all t, τ
mean not changed by a delay or shift
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
139
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
140
A random process is said to be weakly stationary or wide sense
stationary or stationary in the wide sense if mean is constant and the
covariance depends only on the difference of the sample times, i.e., if
1. EXt = EXt+τ for all t, τ,
2. covariance KX (t, s) depends on t and s only through the difference
t − s, i.e., for all t, s, τ
Stronger notion is required for stronger results (e.g., strong law of
large numbers)
A random process {Xt ; t ∈ T } is stationary or strictly stationary if for
any k and and choice of sample times t0, t1, . . . , tk−1
PXt0 ,Xt1 ,...,Xtk−1 = PXt0−τ,Xt1−τ,...,Xtk−1−τ , all τ.
Strict stationarity ⇒ weak stationarity
KX (t, s) = KX (t + τ, s + τ)
Converse not true in general. It is true for Gaussian processes — a
weakly stationary Gaussian random process is also strictly stationary
Abbreviate to
KX (t, t + τ) = KX (τ).
for all t, τ (Toeplitz)
Weak stationarity is enough for many convergence in mean square
results, and enough for most engineering applications.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
141
Asymptotically uncorrelated processes
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
142
This condition implies that also
lim KX (k) = 0,
k→∞
Can extend mean ergodic theorem and WLLN to more general
random processes
Theorem A mean ergodic theorem
Let {Xn} be a weakly stationary asymptotically uncorrelated discrete
time random process such that EXn = X is finite and σ2Xn = σ2X < ∞
for all n. Then .
An advantage of the more general result over the result for
uncorrelated discrete time random processes is that it extends to
continuous random processes
n−1
1�
l.i.m.
Xi = X,
n→∞ n
i=0
n−1
1�
that is,
Xi → X in mean square.
n i=0
A discrete time weakly stationary process {Xn; n ∈ Z} is said to be
asymptotically uncorrelated if its covariance function is absolutely
summable, that is, if
∞
�
k=−∞
Common examples of asymptotically uncorrelated processes are
processes with exponentially decreasing covariance, i.e., of the form
KX (k) = σ2X ρ|k| for ρ < 1.
|KX (k)| < ∞.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
143
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
144
�Proof for the asymptotically uncorrelated case
Can show (see Appendix of book)
lim
From earlier results for S n = n−1
�n−1
i=0
n→∞
Xi
n−1 �
�
k=−n+1
�
∞
�
|k|
1−
KX (k) =
KX (k),
n
k=−∞
which is finite by assumption, hence dividing by n yields
E[(S n − X)2] = E[(S n − ES n)2] = σ2S n = n−2
n−1 �
n−1
�
i=0 j=0
�
n−1 �
1 �
|k|
lim
1−
KX (k) = 0
n→∞ n
n
k=−n+1
KX (i − j)
which completes the proof.
This sum can be rearranged as
σ2S n
�
n−1 �
1 �
|k|
=
1−
KX (k).
n k=−n+1
n
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
145
Continuous time random processes
{X(t); t ∈ R} is asymptotically uncorrelated if its covariance function is
absolutely integrable:
which implies that
�
∞
−∞
|KX (τ)| < ∞,
T →∞
1
that is,
T
τ→∞
A sample or time average can be defined for a continuous time
process by
�
ST =
1
T
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
T
0
�
T
0
1
T
�
T
0
X(t) dt = X,
X(t) dt → X in mean square.
As in the discrete time case, convergence in mean square
immediately implies converges in probability
X(t) dt.
c R.M. Gray 2011
�
146
Theorem A mean ergodic theorem
Let {X(t)} be a weakly stationary asymptotically uncorrelated
continuous time random process such that EX(t) = X is finite and
σ2X(t) = σ2X < ∞ for all t. Then .
l.i.m.
lim KX (τ) = 0.
c R.M. Gray 2011
�
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Can show that strong laws of large numbers hold for stationary
processes satisfying a property called ergodicity, which is a form of
asymptotic memorylessness.
147
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
�
148
© Copyright 2026 Paperzz