Lecture 7

Lecture 7
1
Bochner’s Theorem
Before moving on to the proof of the Central Limit Theorem (CLT) using characteristic
functions, we prove another important result, known as Bochner’s Theorem.
If φ is the characteristic function of a probability measure on R, then φ is positive definite
in the following sense:
n
X
φ(ti − tj )ξi ξj ≥ 0
for all n ∈ N, ξ1 , . . . , ξn ∈ C, and t1 , . . . , tn ∈ R.
(1.1)
i,j=1
The proof is straightforward:
n
X
φ(ti − tj )ξi ξj =
n
X
Z
ξi ξj
eiti x e−itj x µ(dx) =
Z X
2
iti x ξ
e
µ(dx) ≥ 0.
i
i,j=1
i,j=1
i
The interesting fact is the converse, that every continuous positive definite function can be
written as the characteristic function of a finite measure.
Theorem 1.1 [Bochner’s Theorem] Let φ be a positive definite function whichR is continuous at 0 with φ(0) = 1. Then there exists a probability measure µ on R with φ(t) = eitx µ(dx).
Remark 1.2 Positive definite functions appear in many contexts. For example, if (Xs )s∈R is
a stationary real-valued stochastic process (i.e., (Xs )s∈R is a family of real-valued random varidist
ables indexed by s ∈ R, with (Xs )s∈R = (Xs+θ )s∈R for all θ ∈ R), then the covariance
function
R
C(t) = E[X0 Xt ] − E[X0 ]E[Xt ] is positive definite, and hence C(t) = Var(X0 ) eitx µ(dx) for a
symmetric probability measure µ on R, symmetric because C(t) = C(−t). Indeed, note that
n
X
ξi ξj C(ti − tj ) =
i,j=1
n
X
n
2 i
h X
ξi ξj E[Xt0i Xt0j ] = E ξi Xt0i ≥ 0,
i,j=1
i=1
where Xt0 = Xt − E[Xt ], and we used the stationarity of (Xs )s∈R .
Remark 1.3 The Hamburger moment problem is closely related to Bochner’s Theorem, with
a moment
(mk )k∈N playing the role of the characteristic functions (φ(t))t∈R . Indeed,
R sequence
k
if mk = x µ(dx) for a probability measure µ, then the sequence (mk )k∈N is Hankel positive
definite in the sense that: for all n ∈ N and ξ1 , . . . , ξn ∈ R,
Z X
n
n Z
n
2
X
X
ξi ξj mi+j =
xi+j ξi ξj µ(dx) =
ξi xi µ(dx) ≥ 0.
i,j=1
i,j=1
i=1
The main difference from the definition of positive definiteness of a characteristic function is
that, mi+j depends on the indices i and j in terms of their sum, while φ(ti − tj ) depends on
the arguments ti and tj in terms of their difference, which can be called Toeplitz positive
definite, in analogy with Toeplitz matrices and Hankel matrices. Hamburger proved the
analogue of Bochner’s Theorem, that (mk )k∈N is a moment sequence of a finite measure if and
only if it is Hankel positive definite. For an extensive discussion on representations of positive
definite functions, see [1].
1
Proof of Theorem 1.1. We will construct a sequence of characteristic functions φn such
that φn → φ pointwise, and then by Lévy’s Continuity Theorem, φ must be a characteristic
function as well.
Assuming φ is positive definite, we first establish some of its properties.
(1) For any a ∈ R, ψ(t) = φ(t)eita is also positive definite, which follows from the definition.
P
(2) If φ1 , . . . , φm are positive definite, and λ1 , . . . , λm > 0, then m
i=1 λi φi is also positive
definite, which also follows easily from the definition.
(3) φ(0) ≥ 0, φ(t) = φ(−t), and |φ(t)| ≤ φ(0) for all t ∈ R. Indeed, if we let t1 = 0 and
t2 = t in (1.1), then
(|ξ1 |2 + |ξ2 |2 )φ(0) + ξ1 ξ2 φ(−t) + ξ1 ξ2 φ(t) ≥ 0
for all ξ1 , ξ2 ∈ C.
Letting ξ1 = 1 and ξ2 = 0 shows that φ(0) ≥ 0; letting ξ1 = ξ2 = 1 shows that
Im[φ(t)] = −Im[φ(−t)]
; letting ξ1 = 1 and ξ2 = i shows that Re[φ(t)] = Re[φ(−t)];
p
letting ξ1 = ξ2 = −φ(t) shows that |φ(t)| ≤ φ(0).
(4) For any s, t ∈ R, we have
|φ(t) − φ(s)|2 ≤ 4φ(0)|φ(0) − φ(t − s)|.
(1.2)
To prove this, note that (3) and (1.1) imply that (φ(ti −tj ))1≤i,j≤n is a Hermitian positive
definite matrix. In particular, if we choose t1 = t, t2 = s and t3 = 0, then the matrix


φ(0)
φ(t − s) φ(t)
φ(t − s)
φ(0)
φ(s)
φ(t)
φ(s)
φ(0)
must have a non-negative determinant, which gives
0 ≤ φ(0)3 + φ(s)φ(t − s)φ(t) + φ(s)φ(t − s)φ(t) − φ(0) |φ(s)|2 + |φ(t)|2 + |φ(t − s)|2
= φ(0)3 − φ(0) |φ(t) − φ(s)|2 + |φ(t − s)|2 − 2Re φ(s)φ(t)(φ(0) − φ(t − s))
≤ φ(0)3 − φ(0) |φ(t) − φ(s)|2 + |φ(t − s)|2 + 2φ(0)2 |φ(0) − φ(t − s)|,
where in the last line we used that |φ(s)|, |φ(t)| ≤ φ(0). Rearranging terms then gives
|φ(t) − φ(s)|2 ≤ φ(0)2 − |φ(t − s)|2 + 2φ(0)|φ(0) − φ(t − s)|
= φ(0) + |φ(t − s)| φ(0) − |φ(t − s)| + 2φ(0)|φ(0) − φ(t − s)|
≤ 4φ(0)|φ(0) − φ(t − s)|,
where we again used that |φ(t − s)| ≤ φ(0). This proves (1.2).
(5) By (4), it follows that if φ is continuous at 0, then in fact φ is uniformly continuous on
R. This is consistent with what we have proved for characteristic functions.
We are now ready to prove the
R theorem. The basic idea is that if φ is furthermore assumed
to be absolutely integrable, i.e., |φ(t)|dt < ∞, then we can actually take the inverse Fourier
transform
Z
1
f (x) =
e−itx φ(t)dt.
(1.3)
2π R
If Bochner’s Theorem was true, then we expect f to be in fact a probability density, which
we will try to establish. Then φ is the Fourier transform of f , and hence is a characteristic
2
function. When φ is not assumed to be integrable, we will approximate φ by φn which are
positive definite and absolutely integrable. Since φn would be characteristic functions, by
Lévy’s Continuity Theorem, we can then conclude that φ is also a characteristic function.
Now we show that f defined in (1.3) from an absolutely integrable φ is indeed a probability
density. We first use the positive definiteness of φ to show that f ≥ 0. Note that
Z TZ T
1
0 ≤ lim
e−itx eisx φ(t − s)dt ds
T →∞ 2πT 0
0
Z T
Z
1
−iux
ds du
φ(u)e
= lim
T →∞ 2πT −T
s∈[0,T ], s+u∈[0,T ]
Z T
1
|u| = lim
φ(u)e−iux 1 −
du
T →∞ 2π −T
T
Z
1
e−itx φ(t)dt = f (x),
=
2π
where the inequality can be deduced from a Riemann sum approximation of the two-fold
integral and the positive definiteness of φ, while the last equality follows from the assumption
that φ is integrable and the dominated convergence theorem.
Next we show that
Z
φ(t) = eitx f (x)dx,
(1.4)
R
which will imply that f = φ(0) = 1, and hence φ is a characteristic function. Note that if
R
f = φb is integrable, then eitx f (x)dx is well-defined and is the inverse Fourier transform of
f , and hence must equal φ by the same proof as Theorem 1.5 in Lecture 6. More generally,
without assuming the integrability of f , we can approximate f by
fσ (x) = f (x)e
−σ 2 x2
2
,
σ > 0.
Note that by Fubini,
Z
Z
Z
σ 2 x2
itx
itx 1
e−isx φ(s)ds e− 2 dx
e fσ (x) = e
2π
Z
Z
σ 2 x2
1
1
φ(s) p
ei(t−s)x− 2 dx ds
=√
(1.5)
2π/σ 2
2πσ 2
Z
Z
(t−s)2
u2
1
1
=√
φ(s)e− 2σ2 ds = √
φ(t + σu)e− 2 du.
2π
2πσ 2
In particular, let t = 0, then
Z
Z
s2
1
fσ (x)dx = √
φ(s)e− 2σ2 ds ≤ sup |φ(s)| = φ(0) = 1,
2πσ 2
s∈R
R
which by monotone convergence theorem implies f (x)dx ≤ R1 as we send σ ↓ 0. Therefore
in (1.5), as we let σ ↓ 0, the left hand side (LHS) converges to eitx f (x)dx by the dominated
convergence theorem, while the right hand side (RHS) converges to φ(t) because φ is bounded
and continuous. This proves (1.4), and hence the theorem for integrable φ.
If φ is not assumed to be integrable, then we approximate φ by
Z
2 2
y2
1
− σ 2t
φσ (t) := φ(t)e
=√
φ(t)eity e− 2σ2 dy
2πσ 2
where we note that φσ is continuous and integrable with φσ (0) = 1, and φσ is a convex
combination of the functions φ(t)eity , y ∈ R, each of which is positive definite, therefore φσ is
also positive definite. In particular, by our arguments above, φσ is a characteristic function.
Since φσ → φ as σ ↓ 0, φ is also a characteristic function by Lévy’s Continuity Theorem.
3
2
Central Limit Theorem
We now use characteristic functions to prove the CLT for sums of i.i.d. random variables.
Theorem 2.1 [Central Limit Theorem] Let (Xn )n∈N be i.i.d. random variables with E[X1 ] =
Pn
√
0 and Var(X1 ) = σ 2 ∈ (0, ∞). Let Sn :=
i=1 Xi . Then Sn / n converges in distribution to Z, a Gaussian (or normal) random variable with mean 0 and variance σ 2 , i.e.,
y2
Rx
P(Z ≤ x) = √ 1 2 −∞ e− 2σ2 dy for x ∈ R.
2πσ
Proof. By Lévy’s continuity theorem, it suffices to show that the characteristic function
√
of Sn / n converges to that of Z. Let φ(t) := E[eitX1 ]. Then φ0 (0) = iE[X1 ] = 0 and
φ00 (0) = −E[X12 ] = −σ 2 . For each t ∈ R,
h √
i
t n t2 n t2 n
σ 2 t2
φ00 (0) t2
σ 2 t2
it Sn
E e n =φ √
= 1+
= 1−
−→ e− 2 ,
+o
+o
n→∞
2 n
n
2n
n
n
which is precisely the characteristic function of Z, as can be seen by
E[e
itZ
]= √
1
2πσ 2
Z
e
ity−
R
y2
2σ 2
σ 2 t2
e− 2
dy = √
2πσ 2
Z
e−
(y−itσ 2 )2
2σ 2
dy = e−
σ 2 t2
2
.
R
√
Therefore Sn / n converges in distribution to Z.
3
Lindeberg’s Theorem
The central limit theorem can be regarded as a result about a triangular array of random
√
variables. Namely if we let Xn,i = Xi / n, then
X1,1 ;
X2,1 , X2,2 ;
..
.
Xn,1 , · · · , Xn,n ;
Sn
forms a triangular array. For √
= Xn,1 + Xn,2 + · · · + Xn,n to converge to a Gaussian
n
distribution, it is actually not necessary to assume Xn,1 , . . . , Xn,n to be i.i.d. It suffices to
assume that Xn,1 , . . . , Xn,n are independent and uniformly small in a suitable sense.
Theorem 3.1 [Lindeberg’s Theorem] For each n ∈ N, let (Xn,j )1≤j≤n be independent
P
2 ] = 1. If X satisfies Lindeberg’s condition,
random variables with E[Xn,j ] = 0 and nj=1 E[Xn,j
namely that for each > 0,
n
X
2
E[Xn,j
1{|Xn,j |>} ] −→ 0,
(3.6)
n→∞
j=1
then Zn := Xn,1 + · · · + Xn,n converges in distribution to a standard normal (or Gaussian)
random variable, i.e., a normal random variable with mean 0 and variance 1.
We will give two proofs of Theorem 3.1. The first proof is based on the convergence of
characteristic functions, while the second proof is based on Lindeberg’s original proof, which
has recently attracted renewed interest because Lindeberg’s method also applies to more
4
complex (non-linear) functionals of independent random variables, where the distribution of
the functional need not be Gaussian asymptotically. One recent example in the literature
is the study of eigenvalues of a random matrix with independent entries. As the matrix
size tends to infinity, it turns out that many statistics of the eigenvalues are asymptotically
independent of the distribution of the individual matrix entries, similar in spirit to the central
limit theorem.
Proof by characteristic functions. Let φn,j (t) = E[eitXn,j ], and φn (t) = E[eitZn ] =
Qn
j=1 φn,j (t). It suffices to show that
t2
φn (t) → e− 2
for each t ∈ R.
2 := E[X 2 ]. We need to Taylor expand φ (t) = 1 −
Let σn,j
n,j
n,j
(3.7)
2 t2
σn,j
2
+ Rn,j (t) with a good
control on the remainder Rn,j (t), so that we can approximate φn,j (t) by e−
then imply (3.7).
2 t2
σn,j
2
Note that by Taylor expansion with remainder, we have
Z
x2
i x is
ix
e = 1 + ix −
−
e (x − s)2 ds.
2
2 0
, which will
(3.8)
Therefore
x2 1
ix
e − 1 − ix + ≤ |x|3 .
2
6
On the other hand, using integration by parts,
Z
Z x
Z x
x
is
−1 2
−1
is
2
−1 is
2
−1
e (x − s)ds = −i x + 2i
e (x − s) ds = i e (x − s) + 2i
0
0
Therefore
x
eis (x − s)ds.
0
0
x2 ix
e − 1 − ix + ≤
2
(3.9)
1 2
Z
0
x
eis (x − s)2 ds ≤ |x|2 .
(3.10)
Combining (3.9) and (3.10) then shows that for any random variable Y with E[Y ] = 0 and
E[Y 2 ] = σY2 ,
σ 2 t2 h
t2 Y 2 i
itY
E[e ] − 1 + Y = E eitY − 1 − itY +
≤ E[min{Y 2 , |Y |3 /6}].
2
2
(3.11)
Applying this bound to Xn,j then gives
2 t2 σn,j
2
2
2
|Rn,j (t)| = φn,j (t) − 1 +
, |Xn,j |3 }] ≤ E[Xn,j
1{|Xn,j |>} ] + E[Xn,j
].
≤ E[min{Xn,j
2
Therefore
n
X
|Rn,j (t)| ≤
j=1
n
X
2
E[Xn,j
1{|Xn,j |>} ] + .
j=1
By Lindeberg’s condition (3.6) and the fact that > 0 can be chosen arbitrarily, we have
n
X
|Rn,j (t)| −→ 0.
n→∞
j=1
5
(3.12)
Again by Lindeberg’s condition, we note that
2
2
2
2
max σn,j
= max E[Xn,j
] ≤ max E[Xn,j
1{|Xn,j |≤} ] + max E[Xn,j
1{|Xn,j |>} ]
1≤j≤n
1≤j≤n
1≤j≤n
≤ 2 +
n
X
1≤j≤n
2
E[Xn,j
1{|Xn,j |>} ] −→ 2 .
n→∞
j=1
Since > 0 can be chosen arbitrarily, it follows that
2
max σn,j
−→ 0.
(3.13)
n→∞
1≤j≤n
By (3.12) and (3.13), we can Taylor expand
2 t2
2 t2
σ 2 t2
2 σn,j
σn,j
n,j
log φn,j (t) = log 1 −
+ Rn,j (t) = −
+ Rn,j (t) + O
+ |Rn,j (t)|
,
2
2
2
and hence
log φn (t) =
n
X
j=1
n
X
t2
t2
2 4
|Rn,j (t)| + |Rn,j (t)|2 + O max σn,j
−→ − ,
t
log φn,j (t) = − + O
n→∞
1≤j≤n
2
2
j=1
which proves (3.7).
P
Proof by Lindeberg’s method. To prove that Zn := nj=1 Xn,j converges in distribution
to a standard normal random variable W , by the definition of weak convergence, it suffices to
show that E[f (Zn )] → E[f (W )] for all bounded continuous f : R → R. In fact, since for any
−∞ < a < b < ∞, the indicator function 1(a,b] (x) can be approximated from above and below
by f ∈ Cc3 (R), the space of three-times continuously differentiable functions with compact
support, it suffices to show that
E[f (Zn )] −→ E[f (W )]
n→∞
for f ∈ Cc3 (R).
(3.14)
P
Let us denote gn (Xn,1 , Xn,2 , . . . , Xn,n ) = f (Zn ) = f ( ni=1 Xn,j ). The basic idea is that under
Lindeberg’s condition, the influence of each coordinate Xn,j on gn is small. Therefore we will
successively exchange Xn,1 , Xn,2 , . . . with Gaussian random variables Wn,1 , Wn,2 , . . . with the
same mean and variance, and hope that the aggregate effect of these exchanges on gn is still
small. More precisely, let Wn,j , 1 ≤ j ≤ n, be independent Gaussian random variables with
Pn
2 . Note that W :=
mean 0 and variance σn,j
j=1 Wn,j is a standard normal random variable.
We can rewrite (3.14) as
E[gn (Xn,1 , . . . , Xn,n )] − E[gn (Wn,1 , . . . , Wn,n )] −→ 0.
n→∞
(3.15)
We rewrite the above difference as a telescopic sum:
E[gn (Xn,1 , . . . , Xn,n )] − E[gn (Wn,1 , . . . , Wn,n )]
n X
=
E[gn (Xn,1 , . . . , Xn,j , Wn,j+1 , . . . , Wn,n )] − E[gn (Xn,1 , . . . , Xn,j−1 , Wn,j , . . . , Wn,n )] .
j=1
In the j-th term, we effectively switched the j-th argument of gn from Wn,j to Xn,j , and we
can bound this difference by Taylor expansion as follows. Given x := (x1 , . . . , xn ) ∈ Rn and
y ∈ R, let us denote hxn,j (y) = gn (x1 , . . . , xj−1 , y, xj+1 , . . . , xn ). Then
hxn,j (y) = hxn,j (0) +
2
dhxn,j
1 d hxn,j
x
(0)y +
(0)y 2 + Rn,j
(y),
dy
2 d2 y
6
where by the same arguments as in (3.8)–(3.10), the remainder
x
Rn,j
(y)
1
=
2
Z
y
d3 hxn,j
0
d3 y
(t)(y − t)2 dt
satisfies
3
1
1 d hxn,j ≤ 3 |y|3 = |f 000 |∞ |y|3 ,
6 d y ∞
6
d2 hx n,j x
|Rn,j (y)| ≤ 2 y 2 = |f 00 |∞ y 2 ,
d y ∞
x
|Rn,j
(y)|
(3.16)
(3.17)
where |φ|∞ := supx |φ(x)| denotes the supremum norm of a function φ, and we can assume
without loss of generality that |f |∞ , |f 0 |∞ , |f 00 |∞ and |f 000 |∞ are all bounded by 1. Similar to
(3.11), for any x ∈ Rn and any random variable Y with E[Y ] = 0 and E[Y 2 ] = σY2 , we have
2
σ 2 d hxn,j (0) ≤ E[min{Y 2 , |Y |3 /6}].
E[hxn,j (Y )] − hxn,j (0) − Y
2 d2 y
In particular, for each 1 ≤ j ≤ n, by first conditioning on Xn,i and Wn,i with i 6= j, we obtain
E[gn (Xn,1 , . . . , Xn,j , Wn,j+1 , . . . , Wn,n )] − E[gn (Xn,1 , . . . , Xn,j−1 , Wn,j , . . . , Wn,n )]
2
2
≤ E[min{Xn,j
, |Xn,j |3 }] + E[min{Wn,j
, |Wn,j |3 }],
and hence
E[gn (Xn,1 , . . . , Xn,n )] − E[gn (Wn,1 , . . . , Wn,n )]
≤
n
X
2
E[min{Xn,j
, |Xn,j |3 }]
+
j=1
n
X
2
E[min{Wn,j
, |Wn,j |3 }].
(3.18)
j=1
As shown in the calculations leading to (3.12), Lindeberg’s condition (3.6) implies that the
first sum tends to 0 as n → ∞. Since Wn,j are Gaussian random variables with the same
mean and variance as Xn,j , using the bound
2
4
4
1{|Wn,j |>} ] ≤ −2 E[Wn,j
] = 3−2 σn,j
,
E[Wn,j
it is easily seen that (Wn,j )1≤j≤n also satisfies Lindeberg’s condition, and hence the second
P
sum in (3.18) also tends to 0 as n → ∞, which concludes the proof that nj=1 Xn,j converges
in distribution to a standard normal random variable.
As a last remark, we note that neither characteristic functions nor Lindeberg’s method can
be easily adapted to prove CLT for sums of dependent random variables. There is however a
method which is particularly suited for proving CLT for dependent random variables, known
as Stein’s method, which also gives bounds on the rate of convergence. Stein’s method has
been developed also for Poisson distribution, exponential distribution, etc. See e.g. [2] for a
recent survey.
References
[1] P. Lax. Functional Analysis, John Wiley & Sons, 2002.
[2] N. Ross. Fundamentals of Stein’s method. Probab. Surveys 8, 210–293, 2011.
7