Probability, Measure and Topology
September 7-13, 2011
Not Yet Finialized
Riemann Integrals on [0, 1]
Definition
For an integrable function f : [0, 1] → R, its integral is defined
as the limit of the Riemann sums as n → ∞ and
max{t2 − t1 , · · · , tn − tn−1 } → 0:
n
X
i=2
Z
mi (ti − ti−1 ) ≤
1
f (x)dx ≤
0
n
X
Mi (ti − ti−1 )
i=2
where 0 = t1 < · · · < tn = 1 is a partition of [0, 1] and mi (Mi ) is
the minimum (maximum) of f (x) on the interval [ti , ti+1 ].
The LHS (RHS) is the lower (upper) Riemann Sum. The
integral exists provided that both lower and upper Riemann
sums have the same limit as n → ∞.
Riemann Integrals
Consider the function
(
1
f (x) =
0
if x is irrational
if x is rational.
This function is not integrable according Riemann’s definition
because the upper and lower Riemann sums do not converge
to the same limit.
Note that f (x) is in fact the limit of a sequence of integrable
functions fn (x): Let {q1 , q2 , q3 , ...} denote one enumeration of
rational numbers in [0, 1].
(
1 if x 6= q1 , · · · , qn
fn (x) =
0 otherwise.
Clear, for each x ∈ [0, 1], fn (x) → f (x) as n → ∞. The integral
of each fn (x) is in fact 1.
Lebesgue Integral
For developing an integration theory on a general space X , that
is, one wants to define the integral for a function f (x) : X → R,
Z
f (x)dx
X
Lebesgue’s idea is to partition the range of the function (i.e., R)
instead of its domain.
Suppose f (x) is a bounded function m ≤ f (x) ≤ M, and
m = t1 < t2 < · · · < tn = M is a partition of the interval [m, M].
n
X
i=2
ti−1 s(f
−1
Z
([ti−1 , ti ))) ≤
f (x)dx ≤
X
n
X
ti s(f −1 ([ti−1 , ti )))
i=2
where s(f −1 ([ti−1 , ti ))) is the size (measure) of the set
f −1 ([ti−1 , ti )).
Illustration of Lebesque Integration
For the function f (x, y ) = x 2 − y 2 + 2xy − x defined on the unit
disk.
Lebesgue Integral
This approach is necessary if X is some abstract space (e.g.,
just a set).
The main idea is, for an abstract space X , we associate it a
certain structure that will allow the definition of the integrals.
This structure is a collection of subsets whose sizes (or
measures) are known. Furthermore, depending on the
structure of these subsets, we can define a class of functions
that are integrable.
This leads to the notion of σ-algebra and measurable functions
on X .
σ-Algebra (a.k.a. σ-field)
Let X be a set. A collection F of subsets of X is a σ-algebra if
I
X , ∅ ∈ F.
I
F is closed under complement: if A ∈ F, Ac ∈ F.
I
F is closed under countable union: if
A1 , A2 , · · · ∈ F, ∪∞
i=1 An ∈ F.
1. The pair (X , F) is a measurable space, where F provides
all the measurable sets.
2. F is also closed under countable intersection.
3. (X , F) is measurable because σ-algebra is the data
necessary and sufficient to define measures on X .
Measurable Spaces
Examples
I
The powerset of X is a σ-algebra.
I
The collection {X , ∅} is also a (trivial) σ-algebra.
I
If A1 , A2 , · · · are σ-algebras, ∩∞
i=1 Ai is also a σ-algebra.
I
If A is σ-algebras of X and Y ⊂ X . The restriction of A to Y
(including ∅) (U ∈ A, U ∪ Y ) is a σ-algebra of Y .
Let A{Ai }i∈I be a family of subsets in X . Define σ(A) to be the
smallest σ-algebra containing all Ai .
The existence of σ(A) follows from the simple fact that the
intersections of σ-algebras is also an σ-algebra.
A Simple Example
Let X be set with three elements X = {a, b, c}.
Consider the following three σ-algebras
{X , ∅},
{X , ∅, {a}, {b, c}}
and the powerset. We will only assign measures to elements in
the σ-algebra.
Question: What functions can be integrated for each σ-algebra?
Borel Algebra B(R) on R
The standard σ-algebra on R is called Borel Algebra (its
elements are the Borel sets).
B(R) is generated by all intervals of the form (−∞, a] for some
a ∈ R.
From this, one can show that it is the same σ-algebra
generated by intervals of the form
1. (−∞, a)
2. [a, ∞)
3. (a, ∞)
4. (a, b)
5. [a, b]
Measurable Functions
For a measurable space (X , F), a function f : X → R is
measurable if for any Borel set B ∈ R, its pre-image
f −1 (B) ∈ F.
Examples
I
If F is the trivial σ-algebra, what are the measurable
functions?
I
If F is the powerset of X , what are the measurable
functions?
In general, given two measurable space (X , F), (Y , G) a
mapping f : X → Y is measurable if f −1 (B) ∈ F for every
B ∈ G.
Examples
Let X = {1, 2, 3, · · · , } be a countably infinite set. Take the
powerset of X as the σ-algebra F.
For U ∈ F, we define the counting measure µ as
(
|U| if U is a finite subset
µ(U) =
∞ if U is an infinite subset.
Let X = R, and F the powerset of X . Suppose
P = {p1 , · · · , pn } is a finite set of points in X . For U ∈ F, we
define the counting measure µ as
µ(U) = |U ∩ P|.
Note that µ is a measure on R that is not translational invariant
(e.g., µ([0, 1]) 6= µ([1, 2])).
Probability Space
A measurable space (M, F, µ) is a probability space if
µ(M) = 1.
Elements in F are called the events, and for an event U ∈ F, its
probability is µ(U).
A random variable X is a measurable function on M.
Let X be a real-valued random variable: X : M → R. We can
define a measure µ∗ on RR
µ∗ (U) = µ(X −1 (U)),
U a Borel set.
µ∗ is the distribution of the random variable X .
A discrete-time stochastic process is a sequence of
(real-valued) random variables X1 , X2 , ... defined on the same
probability space.
A continuous-time stochastic process is a set of (real-valued)
random variables Xt defined on the same probability space
parameterized by t ∈ [0, ∞).
Non-measurable Set
A famous example is the Vitali set that shows not every subset
of [0, 1] can be measurable:
A Vitali set V is a subset of [0, 1] which, for each real number
r ∈ R, contains exactly one v ∈ V such that v − r is rational.
I
Partition R into subsets {Si }i∈I such that x, y ∈ Si if and
only if x − y is a rational number.
I
Each Si has non-empty intersection with [0, 1], pick a
vi ∈ [0, 1] ∩ Si .
I
V = {vi }i∈I .
Clearly V ⊂ [0, 1]. The claim is that V cannot be measurable
with respect to Lebesque measure:
Suppose V is measurable and has measure 0 ≤ µ(V ) < ∞.
Non-Measurability of Vitali Set
Let q1 , q2 , · · · be an enumeration of the rational numbers in
[−1, 1].
The disjoint sets Wi = qi + V will cover [0, 1] and lie within
[−1, 2]:
∪∞
∪∞
i=1 Wi = [0, 1],
i=1 Wi ⊂ [−1, 2].
Furthermore, µ(Wi ) = µ(W1 ) because Wi is a translated W1
and they have the same measure. Therefore
1≤
∞
X
µ(W1 ) ≤ 3.
i=1
There exists no such number µ(W1 )!.
Non-measurable Set
The point is that in [0, 1], there are many such strange and
weird subsets that simply cannot be assigned lengths to.
Hence, not every subset of [0, 1] can be (Lebesgue)
measurable!
In higher dimensions, the situation is even more strange and
defies one’s intuition. The famous example is the
Banach-Tarski Paradox in R3 . See
http://en.wikipedia.org/wiki/Banach-Tarski_paradox
Lebesgue Measure µ on (0, 1]
This can be done in steps:
I For any half-closed interval (a, b] ∈ (0, 1],
µ((a, b]) = b − 1.
I For any disjoint half-closed intervals I1 , · · · , In ⊂ [0, 1],
µ(∪ni=1 Ii ) = µ(I1 ) + · · · + µ(In ).
Caratheodory Extension Theorem then allows one to
extend µ to the entire Borel algebra of (0, 1]. The
extension is essentially unique in this case.
To define Lebesgue measure on R: R is a disjoint union.
I
R = ∪n∈Z (n, n + 1]
Each half-closed interval (n, n + 1] has its Lebesgue measure
µ. For a Borel set U ∈ R
µ(U) =
∞
X
−∞
µ(U ∩ (n, n + 1]).
More On Measurable Functions
Measurable functions enjoy the following properties:
Consider functions f , f1 , f2 , ... : X → R and g : R → R.
I
If g is continuous, then it is measurable (will be proved
later).
I
If f , f1 , f2 are measurable, then so are αf , f1 + f2 and f1 × f2 .
I
If f and g are measurable, their composition
g ◦ f (x) = g(f (x)) is again measurable.
I
f1 , f2 , ... are measurable, then so are
supk fk , infk fk , lim supk fk and lim infk fk .
Let’s see how to see that infk fk , supk fk are measurable.
s(x) ≡ inf fk (x) = inf f (x),
k
k
S(x) ≡ sup fk (x) = sup f (x)
k
k
More On Measurable Functions
What is the set S −1 ((−∞, b])?
−1
S −1 ((−∞, b]) = ∩∞
k =1 fk ((−∞, b]).
Similarly,
−1
s−1 ((−∞, b]) = ∪∞
k =1 fk ((−∞, b]).
This is in contrast with continuous functions. Consider
fk (x) = x k ,
k ≥1
on [0, 1]. s(x) = infk fk (x) is not continuous.
Lebesgue Integrals on (X , F, µ)
I
Define integrals for simple functions.
I
Define integrals for bounded measurable functions as the
limit of integrals of simple functions.
I
Define integrals for arbitrary measurable functions.
Let A ∈ F be a measurable set. Its indicator function 1A :
(
1, x ∈ A
1A (x) =
0, otherwise
is called an elementary function.
Simple functions are (finite) linear combinations of elementary
functions
n
X
f =
αi 1Ai ,
i=1
A1 , · · · , An ∈ F, α1 , · · · , αn ∈ R.
Lebesgue Integrals of Simple Functions
Integral of a simple function f =
Z
fdµ =
X
Pn
i=1
n
X
αi 1Ai is easy to define:
αi µ(Ai ).
i=1
For any measurable subset B ∈ F of X
Z
Z
fdµ = Xf 1B dµ.
B
The product f 1B is also a simple function
f 1B =
n
X
i=1
αi 1Ai ∩B .
Lebesgue Integrals
The integral for simple functions has the following properties
R
I If f ≥ 0,
fdµ ≥ 0.
R
R
R
I
(af + bg)dµ = a fdµ + b gdµ, for all a, b, ∈ R and
simple functions f , g.
R
R
I If f (x) ≥ g(x),
fdµ ≥ gdµ.
Next we extend the integral to bounded measurable functions
using the following result
Lemma
If f : X → R is bounded and measurable, then we can
sequences of simple functions {fn }, {fn }, n = 1, 2, · · · such that
as n → ∞ fn ↑ f , fn ↓ f and
fn (x) ≤ f (x) ≤ fn (x) ≤ fn (x) + 2−n
for every x ∈ X .
Lebesgue Integrals
The Lemma can be easily "proved" as follows
Divide the interval [min f , max f ] into subintervals with lengths
less than 2−n . Here we use the hypothesis that f is a bounded
measurable function. How do we define the sequences
{fn }, {fn }?
Together, we have
Z
Z
Z
fn dµ ≤ fn dµ ≤ fn dµ + 2−n µ(X )
for all n ≥ 1. This allows us to define
Z
Z
Z
fdµ = lim
fn dµ = lim
fn dµ
n→∞
n→∞
Lebesgue Integrals
The integral defined for the bounded measurable functions
clearly satisfies the followings:
R
I If f ≥ 0,
fdµ ≥ 0.
R
R
R
I
(af + bg)dµ = a fdµ + b gdµ, for all a, b, ∈ R and
simple functions f , g.
R
R
I If f (x) ≥ g(x),
fdµ ≥ gdµ.
Finally, we extend the integral to arbitrary measurable functions
in two steps.
Let f : X → R≥0 be a non-negative measurable function. Define
a sequence of bounded measurable functions
fn (x) = min(f (x), n).
Clearly 0 ≤ fn ≤ f and fn ↑ f as n → ∞. It follows that
Z
fn dµ
is an increasing sequence of non-negative real numbers, hence
has a limit (including ∞).
Lebesgue Integrals
Therefore, we define
Z
fdµ = lim dn dµ.
n→∞
Finally, for an arbitrary measurable function f , we can write
f = f + − f − , where f + , f − are non-negative measurable
functions:
f + (x) = max(f (x), 0),
f − (x) = − min(f (x), 0).
We define the integral of f as
Z
Z
Z
+
fdµ = f dµ − f − dµ,
R
provided both integrals on the right are not ∞ (i.e. |f |dµ < ∞
to rule out ∞ − ∞). Again, we have
R
I If f ≥ 0,
fdµ ≥ 0.R
R
R
I
(af + bg)dµ = a fdµ + b gdµ, for all a, b, ∈ R and
simple functions
R f , g. R
I If f (x) ≥ g(x),
fdµ ≥ gdµ.
Topology
Now we know how to integrate functions defined on an abstract
measure space (X , F, µ).
We next extend the concept of continuity and convergence to
an abstract space X .
Specifically, what data do we need to specify on X in order to
talk about
1. A function f : X → R is a continuous function on X .
2. A sequence of points x1 , x2 , · · · , ∈ X converge to a limit
limn→∞ xn = x ∈ X .
Not surprisingly, the necessary data is again a collection T of
subsets of X , called its topology.
Topology
A topology T on a set X is a collection of subsets of X
satisfying the following three axioms:
1. X , ∅ ∈ T .
2. T is closed under arbitrary unions of subsets in T , i.e., let I
be an index set
if Ui ∈ T for any i ∈ I, ∪i∈I Ui ∈ T .
3. T is closed under finite intersections of subsets in T , i.e.,
if U1 , · · · , Un ∈ T , ∩ni=1 Ui ∈ T .
Subsets in T are called the open sets and the complement
of an open set is a closed set.
Notice the differences between the topology axioms and
σ-algebra axioms.
I
I
T is not closed under taking complement.
Asymmetry between the union and intersection.
Examples
I
The powerset of X is a topology of X .
I
The collection {X , ∅} is also a (trivial) topology.
I
Let X has three points .....
Basis of Topology
With the abstract definitions out of way, let’s generate a few
topologies using the notion of basis.
A basis for a topology T on X is a collection B of elements in T
such that
I
For each x ∈ X , ∃B ∈ B such that x ∈ B.
I
If x ∈ B1 ∩ B2 with B1 , B2 ∈ B, ∃B3 ∈ B such that x ∈ B3 .
I
For each x ∈ U, ∃B ∈ B such that x ∈ B ⊂ U.
Elements in B are called the basis open sets (or
neighborhoods). It can be shown quite easily that every open
set U ∈ T is an union of basis open sets.
Topology on R This is the topology with basis neighborhoods
open intervals of the form (a, b), for a < b.
1. any interval of the form [a, b] is a closed set.
2. any interval of the form (a, b], [b, a) is neither closed nor
open.
Topology
If x ∈ X and x ∈ U ∈ T , U is an (open) neighborhood of x.
A function (map) f : X1 → X2 between two topological spaces
(X1 , T1 ), (X2 , T2 ) is continuous
f −1 (U) ∈ T1 ,
for any U ∈ T2 .
Again, notice the similarity in the definition of measurable
functions.
Notice that the continuity is phrased entirely in terms of the
topology T (open sets).
Does the new definition agree with the notion of continuity we
have learned earlier? Yes!
Topology
Subspace Topology
If Y ⊂ X is a subset of X , Y has the subspace topology TY
given by
V ∈ TY if ∃U ∈ T , such that V = U ∩ Y .
Limit Point
Let Y be a subset of X . A point x ∈ X is a limit point of Y if
every neighborhood of x intersects Y in some point other than
x.
If x1 , x2 , · · · ≡ Y ⊂ X , x is a limit of x1 , x2 , · · · if x is a limit point
of the subset Y .
Topology
We have our first two results
Lemma
A closed set in X contains all its limit points.
Lemma
Let f be a continuous function defined on X , and x1 , x2 , ... → x
be a sequence of point converging to x in X . Then,
lim f (xn ) = f (x).
n→∞
Topology
Occasionally, the problem demands that the topology be
generated by a collection of functions. That is, we are given a
set X and a collection F of (real-valued) functions on X . We
can define a topology T such that every function in F is
continuous.
What are the open sets in the topology?
How do we characterize the convergence of a sequence of
points xn → x to its limit?
The Topologies of Z and Q
Topology and Measure
Finally, let’s put topology and measure together:
Starting with an abstract space (i.e., set) X ,
1. Specify a topology T of X .
2. Construct the Borel σ-algebra F generated by open sets in
T.
3. Define a measure µ using the Borel σ-algebra F.
So the structures we put on X are T , F, µ, and they allows us
to integrate functions on X , talk about continuity of functions on
X and convergence of points in X . In particular, we have
1. Open sets and closed sets are in F.
2. Continuous functions are measurable functions.
Probability
Finally, we have learned enough machinery to start modern
treatment of probability!
The basic ingredient is the sample space Ω, which is a measure
space with σ-algebra F and probability measure P, P(Ω) = 1.
I
Subsets in A ∈ F are called events and they can be
assigned probability µ(A).
I
A real-valued random variable X is a measurable function
X : Ω → R.
I
The probability of X ∈ B for any Borel set B ⊂ R is given by
P(X −1 (B).
Probability
In classical probability, a central notion associated to a random
variable is its distribution function:
For a real-valued random variable X , its cumulative distribution
function
FX (x) = Probability(X ∈ (−∞, x]).
The modern viewpoint is to interpret F as a probability measure
on (R, B(R)):
I A Probability Measure is just a measure space (Ω, F, P)
such that P(Ω) = 1.
I We can turn F (x) into a measure PX on (R, B(R)) by
defining (F (x) is non-decreasing).
PX ([a, b]) = FX (b) − FX (a).
Probability
In the previous setup, the probability measure PX and the
values taken by the random variable X are defined on the same
space and things can get murky.
One main conceptual advance in modern treatment of
probability is to separate the distribution of a random from its
abstract probability space: the distribution function FX (x) is
now considered as a derived measure from the usual
"push-forward" construction and it is the abstract probability
space (Ω, F, P) that is the focus of attention:
PX (B) = P(X −1 (B))
for any Borel set B ∈ B(R).
Probability
What are the advantages of such novel viewpoint?
I The new viewpoint takes out much of the randomness in
random variable X . X is now a function defined on a space
X , a very concrete object.
I All classical quantities such as means, variances, etc, can
be defined. For example, the expection EX of a random
variable X is
Z
X dP.
EX =
Ω
Its variance Var(X ) = E(X − EX ). Its p-th moment is
Z
X p dP.
Ω
I
I
The new view approach allows many concepts, notions,
proofs to be simplified, i.e., the new language introduced is
more expressive and more concrete.
The abstract approach gives us a flexible framework that
allows many more interesting examples to be constructed
and studied.
Random Variables
By defining random variables as measurable functions on Ω, we
get a concrete mathematical handle for random variables (not
as random as before!).
Definition
The space Lp (P) is the collection of all random variables
X : Ω → R that are p times P-integral:
p
1
p
Z
kX kp := (E{|X | }) =
p
1
p
|X | dP < ∞.
Ω
That is, p-norm of X is finite.
These are also the random variables with finite p-th moment:
Z
X p dP < ∞.
Ω
Inequalities for Lp -spaces
1. Holder’s Inequality: Let p > 1 and
1
p
+
1
q
= 1 then
kfgk1 ≤ kf kp kgkq
for all f ∈ Lp , g ∈ Lq .
2. Minkowski’s (Triangle) Inequality: For all f , g ∈ Lp ,
kf + gkp ≤ kf kp + kgkp .
3. Cauchy-Schwarz Inequality: For all f , g, ∈ L2 ,
Z
| fg dP| ≤ kf k2 kgk2 .
4. Jensen’s Inequality: If ψ : R → R is a convex function and
X , ψ(X ) ∈ L1 (P), then
Z
Z
ψ(X )dP ≥ ψ( XdP).
Ω
Ω
Notion of Almost Surely
Let X , Y be two random variables defined on a probability
space (Ω, F, P).
Note that, in general, the following implication is NOT true
kX − Y kp = 0 =⇒ X = Y .
However, the set SX 6=Y = {ω|ω ∈ Ω, X (ω) 6= Y (ω)} must have
measure zero (why?).
The two random variables agree except on a set of measure
zero, i.e., they are equal almost surely.
We will consider two random variables X , Y the same random
variables, if they agree almost surely.
Under this equivalence, k · kp is a norm on Lp (i.e., kX k = 0 iff
X = 0).
Embeddings Between Lp -spaces
Proposition
Let r > p ≥ 1. For any probability space (Ω, F, P), we have
Lr ⊆ Lp and
kX kp ≤ kX kr ,
∀X ∈ Lr .
The proof is an exercise using Jensen’s inequality:
I
φ(x) = |x|s is convex for all s ≥ 1. Set s = pr .
I
We have
kX krp
Z
= φ(
p
|X | dP) ≤
Z
φ(|X |p ) = kX krr .
In particular, L2 ⊂ L1 . That is, any random variable X with finite
second moment also has finite first moment.
This result also holds for any finite measure space (Ω, F, µ)
with µ(Ω) < ∞.
Markov’s and Chebyshev’s Inequalities
We are now ready for the two classical inequalities:
Theorem (Markov’s Inequality)
Let X ∈ L1 (P), then for all λ > 0
Z
1
kX k1
P{|X | ≥ λ} ≤
.
|X | dP ≤
λ {X ≥λ}
λ
Theorem (Chebyshev’s Inequality)
For all p, λ > 0 and X ∈ Lp (P),
1
P{|X | ≥ λ} ≤ p
λ
Z
{X ≥λ}
|X |p dP ≤
kX kpp
.
λp
What do the inequalities imply, for example, if X ≥ 0?
Both inequalities in fact are valid for general finite measure
space as well (not just probability space).
Markov’s and Chebyshev’s Inequalities
Both inequalities can be proved rather quickly by looking at the
figure.
Let Λ := {|f | ≥ λ}, we have
Z
Z
|f |dP ≥
λ dP = λP(Λ).
Λ
Λ
Similarly,
Z
p
Z
|f | dP ≥
Λ
Λ
λp dP = λp P(Λ).
Modes of Convergence
Consider a sequence of random variables, XX1 , X2 , ... : Ω → R
defined on a probability space (Ω, F, P)).
I
Convergence in Lp Xn → X in Lp if
lim kXn − X kp = 0
n→∞
I
Almost Surely Convergence Xn converge to X P-almost
surely if
P({ω ∈ Ω : lim sup |Xn (ω) − X (ω)| > 0}) = 0
n→∞
i.e., the probability of Xn 6= X converges to zero as n → ∞.
I
Convergence in Probability Xn converge to X in probability
if for any > 0,
lim P({ω ∈ Ω : ||Xn (ω) − X (ω)| > }) = 0.
n→∞
Modes of Convergence
In general, these three modes of convergence are different. For
example, convergence in Lp and almost surely convergence
can be different (one does not imply the other).
Example
Consider the probability space [0, 1] with the usual Borel
algebra and Lebegues measure. Define a sequence of random
variables (measurable functions) on [0, 1] as
(
k
1 if x ∈ [ k −1
n , n]
Xn,k =
0 otherwise.
where 1 ≤ k ≤ n. We order the sequence Xn,k as
X1,1 , X2,1 , X2,2 , X3,1 , X3,2 , X3,3 , ... The sequence Xn,k converges
to 0 in Lp . However, limk →∞ Xk (ω) does not exist for any ω. But
the sequence does converge to 0 in probability.
However, almost surely convergence or Lp convergence does
imply convergence in probability. That is, convergence in
probability is the weakest among the three.
Independence and Random Variables
Definition
Events {Ei }ni=1 are independent if for all distinct indices
i(1), · · · , i(l) ∈ {1, · · · , n},
P(Ei(1) ∩ · · · ∩ Ei(l) ) =
l
Y
P(Ei(j) ).
j=1
Definition
A collection of random variables X1 , · · · , Xn : Ω → R are
independent if the events {Xi−1 (Ai )}ni=1 are independent for all
A1 , · · · , An ∈ B(R).
From the definition, it is immediate that if X , Y are independent,
E(XY ) = E(X )E(Y ).
Why E(XY ) = E(X )E(Y )?
This follows from the construction of Lebegues integrals.
We approximate X , Y with a sequence of increasingly
complicated random variables Xi , Yi such that
E(Xi ) → E(X ),
E(Yi ) → E(Y ).
and E(Xi Yi ) = E(Xi )E(Yi ) for all i. This will suffice to show that
E(XY ) = E(X )E(Y ).
Suppose X , Y : Ω → (0, 1]. Divide the range (0, 1] into two
subsets (0, 12 ], ( 12 , 1]. Let
1
1 1 (X (ω)) + 1( 1 , 1] (X (ω)),
2
2 (0, 2 ]
1
Y1 (ω) = 1(0, 1 ] (Y (ω)) + 1( 1 , 1] (Y (ω)),
2
2
2
X1 (ω) =
where 1(0, 1 ] is the indicator function for the interval (0, 21 ].
2
Why E(XY ) = E(X )E(Y )?
We need to compute E(X1 Y1 ). Note that X1 Y1 only takes on
three different values 14 , 21 , 1. How do we compute the
Lebesgue integral of
Z
X1 Y1 dP?
Ω
Using independence of X , Y , we see that
E(X1 Y1 ) = E(X1 )E(Y1 ).
We get X2 , Y2 , X3 , Y3 , .... by dividing (0, 1] into smaller
subintervals. Clearly, for each i, E(Xi Yi ) = E(Xi )E(Yi ).
Independent and Identically Distributed Random Variables
A countable collection of random variables X1 , X2 , · · · are
independent and identically distributed if
1. Any finite subcollection of random variables are
independent,
2. The distributions of Xi considered as measures on R are
the same (identical).
A simple example is coin toss. Say we toss the coin N times.
The random variables Xi = 1 if the ith toss yield a head;
otherwise Xi = −1.
Depending on the probability space Ω, X1 , X2 , ... are generally
different functions on Ω.
Independence
For two random variables X , Y , their covariance Cov(X , Y ) and
correlation ρ(X , Y ) are defined as
Cov(X , Y ) = E[(X −EX )(Y −EY )],
ρ(X , Y ) =
Cov(X , Y )
,
SD(X )SD(Y )
p
p
where SD(X ) = Var(X ), SD(Y ) = Var(Y ), and
p
Var(X ) = E((X − EX )2 ).
We immediately have two simple results:
Lemma
If X , Y ∈ L2 (P) are independent, then they are uncorrelated,
i.e., Cov(X , Y ) = 0.
If X1 , · · · , Xn are uncorrelated and in L2 (P), then
Var(X1 + · · · + Vn ) =
n
X
j=1
VarXj .
The Law of Large Numbers
Believe it or not, we have gathered enough background to prove
Theorem (The Law of Large Numbers)
If {Xi }∞
i=1 are independent and identically distributed random
variables in L1 (P). Then, as n → ∞,
X1 + · · · + Xn
→ EX1 in L1 (P), and hence in probability.
n
Note that the theorem is formulated entirely in terms of
convergence (in L1 (P)) of functions to a constant function, EX1 ,
defined on a probability space Ω. In practice, what does the
theorem mean? The L1 (P) assumption requires that the
random variables Xi have finite expectations (means). It does
not need to have finite variance (since the random variables
may not be in L2 (P)).
The Proof of the Law of Large Numbers
We will prove the theorem first with a weaker assumption that
Xi ∈ L2 (P), that is, the variances of Xi are all finite.
Recall that this implies that Xi ∈ L1 (P) and let
Sn = X1 + . . . + Xn .
r
p
Var(X1 )
Sn
Sn
√
k
− EX1 k2 = Var( ) =
→0
n
n
n
as n → ∞.
Note that the convergence is in L2 and it still implies
convergence in probability.
The Proof of the Law of Large Numbers
For the general case that X ∈ L1 . We use the truncation
method of Markov:
Choose a large α > 0 and define Xiα = Xi 1|Xi |≥α . Likewise
Snα = X1α + ... + Xnα . We have
kSn − Snα k1 ≤
n
X
E(|Xi | > α) = nE(|Xi | > α).
i=1
and
Sα
Sn
− E(X1 )k ≤ 2E(|Xi | > α) + k n − E(X1α )k1 .
n
n
The second term goes to zero as n → ∞.
k
The Strong Law of Large Numbers
There is a stronger and more difficult result due to Kolmogorov
Theorem (The Strong Law of Large Numbers)
If X1 ∈ L1 (P), then
X1 + . . . + Xn
= EX1 , a.s.
n→∞
n
lim
Conversely, if lim supn→∞ |Sn /n| < ∞ with positive probability,
then Xj ’s are in L1 (P) and the above holds.
Difference between the strong and the weak laws:
Weak Law: For large n, Snn is likely to be near E(X ). But it is
still possible that for a given , | Snn − E(X )| > happens an
infinite number of times.
Strong Law: This will not happen with probability 1.
The Strong and Weak Laws of Large Numbers
Suppose we are measuring some physical quantity through a
sequence of identical experiments and using Snn as the
estimated value of that quantity.
Central Limit Theorem
Consider the set of real numbers R. Convergence of points in R
is well-known. How about convergence of probability
measures?
Let Ω denote the set of probability measures on R, naturally an
infinite-dimensional space.
Recall the central limit theorem
Theorem (Central Limit Theorem)
Suppose {Xi }∞
i=1 are identically and independently distributed
real-valued random variable and have two finite moments. Let
Sn = X1 + · · · + Xn and σ(X1 ) ∈ (0, ∞), then
Sn − nEX1
√
→ N(0, σ(X1 )).
n
σ(X1 ) is the variance of X1 , and EX is the expectation of the
random variable X . We have use the same notations to denote
both the random variables and their distributions.
Example:CLT
The central limit theorem is a theorem on convergence of
distributions, i.e., convergence on the set Ω of probability
measures on R.
The topology of interest on Ω is called the weak topology of Ω:
I Let C0 (R) denote the set of all bounded continuous
functions on R.
I The integral (µ ∈ Ω)
Z
f dµ
R
defines a function f ∗ : Ω → R for every f ∈ C0 (R).
I The weak topology T is the smallest topology such that f ∗
is a continuous function on Ω for every f ∈ C0 (R).
Therefore, a sequence of distributions (probability measures)
µ1 , µ2 , · · · , ∈ Ω converges to a (limit) distribution µ, µn → µ if
Z
Z
lim
f dµn =
f dµ
n→∞ R
R
for all f ∈ C0 (R). We say that µn converges weakly to µ.
Independent Random Variables
Finally, we have two unresolved questions:
1. How do we know that independent random variables exist?
2. Are there sufficiently many independent random variables
to make the theory interesting?
In particular, how do we know independent and identically
distributed random variables exist?
Theorem
Let {P1 , · · · , Pn , ...} be a countable collection of probability
measures on (R, B(R)) (not necessarily distinct). There exist
independent random variables X1 , X2 , · · · all defined on a
suitable probability space such that the distribution of Xi is Pi
for each i = 1, 2, 3....
Note that the theorem answers the above question affirmatively,
and it is an easy consequence of Kolomogorov Extension
(Consistency) Theorem.
Product Measure
Kolomogorov Extension Theorem requires the notion of product
measure:
Let X , Y be two topological spaces. We know at this point, how
to equip X , Y with Borel σ-algebras to make them into
measurable spaces. We further assume that both X , Y are also
measure spaces with measures µX , µY .
The question now is how to define topology, σ-algebra and
measure on X × Y that is related to the structures on X and Y .
We can be more ambitious and suppose that we have countably
many topological (measure) spaces X1 , X2 , ..., we want to
define a topology and measure on their cartesian product
X =
∞
Y
Xi .
i=1
Recall that the space X is define to be the set with elements
x = (x1 , x2 , ...) such that xi ∈ Xi for all i.
Product Toplogy
The problem can be solved if we can define topology on X .
How to accomplish this?
We have the obvious projection maps πi
πi : X → Xi ,
πi (x) = xi .
The right topology (the product topology) is the smallest
topology that makes all these maps continuous.
How do we describe this topology? Well, this topology has to
be generated by the so-called cylinder sets:
U = U1 × U2 × U3 × ....
where Ui are open sets in Xi and only finitely many of them are
different from Xi .
Why?
Finite Products
Note that previous definition when applies to finite products
X × Y implies that the product topology on X × Y are
generated by (open) sets of the form Ux × Uy for open sets
Ux ⊂ X , Uy ⊂ Y .
The product topology on X × Y gives us its Borel σ-algebra,
which is the smallest σ-algebra generated by sets of the form
Vx × Vy , where Vx ∈ B(X ), Vy ∈ B(Y ).
It can be shown quite easily that this makes the two projection
maps X × Y → X , X × Y → Y measurable. Furthermore, if
X , Y are measure spaces with measures µX , µY , we can define
the product measure µ on X × Y such that
µ(Vx × Vy ) = µX (Vx ) × µY (Vy ).
We typically write the product measure as µ = µX × µy .
Fubini-Tonelli Theorem
Theorem (Fubini-Tonelli Theorem)
If f : X × Y → R is product measurable and f ∈ L1 (µ), then
Z
Z Z
f (x, y ) d(µX × µY ) = ( f (x, y )dµX (x) )dµY (y )
Z Z
= ( f (x, y )dµY (y ) )dµX (x).
I
The L1 assumption is important. Without it, it is usually not
possible to switch the integrals.
I
The functions 1) for every fixed y , x → f (x, y ) and 2) every
fixed x, y → f (x, y ) are measurable.
Fubini-Tonelli Theorem
An example where Fubini-Tonelli Theorem does not apply:
The function f is defined on the first quadrant of R2 . It is 1 on
the blue squares and −1 on the yellow squares.
I f ∈
/ L1 (R). Why?
R
I The function x →
f (x, y )dy ≡ 0.
R
I The function y →
f (x, y )dy ≡ 1.
Z Z
Z Z
0=
f (x, y )dydx 6=
f (x, y )dxdy = 1.
Consistency
At this point, given any finite collection of probability spaces
(Xi , Pi ), we know how to define their product probability space
X n = X1 × ... × Xn and probability measure P = P1 × ... × Pn . In
particular, given a countable collection of probability spaces
X1 , X2 , ... we know how to define, for every n, the probability
space
X n = X1 × ... × Xn .
and its probability measure P n .
Definition (Dimension of A Cylinder Set)
Let U = U1 × U2 × ... denote a cylinder set. Recall that this
requires that Ui ⊂ Xi for each i ≥ 1 and only finitely many Ui
are different from Xi . Its dimension dim U is defined as
dim U := max {i ≥ 1 : Ui 6= X }.
We define dim ∅ := ∞.
Consistency
Definition (Consistency)
A family of probability spaces {Xi , Pi }∞
i=1 is consistent if for
every cylinder set A with dim A = n ≤ ∞,
P m (πm (A)) = P n (πn (A)),
∀m ≥ n.
RecallQ
that P m is the product probability defined on
m
X = m
i=1 Xi .
The question is whether there exists a compatible
probability
Q
measure P on the countable product X = ∞
X
.
i=1 i
Compatibility means
P(U) = P n (πn (U))
for every cylinder set U with dimension n.
Kolmogorov’s Consistency Theorem
Theorem (Kolmogorov)
Suppose {Xi , Pi }∞
i=1 is a consistent family of probability
measures.
Then,
there exists a unique probability measure P
Q
on X = ∞
X
such
that
i=1 i
P(U) = P n (πn (U))
for every cylinder set U with dimension n.
The following theorem is an easy consequence of Kolmogorov’s
Consistency Theorem:
Theorem
Let {P1 , · · · , Pn , ...} be a countable collection of probability
measures on (R, B(R)) (not necessarily distinct). There exist
independent random variables X1 , X2 , · · · all defined on a
suitable probability space such that the distribution of Xi is Pi
for each i = 1, 2, 3....
Independent Random Variables
The idea is to define the abstract probability space Ω to be the
product
∞
Y
R.
Ω=
i=1
Here are the details:
Qn
n
n
n
I Define Ωn =
i=1 R = R , F = B(R ), and
P n = P1 × ... × Pn .
I {P n }∞ form a consistent family of probability measures.
n=1
I By Kolmogorov’s consistency theorem, there is a unique
probability measure P on Ω. Define Xi to be the i th
projection map, and Xi are random variables on Ω.
I P{Xi ∈ (Ei )} = Pi (Ei ) for every Borel set Ei ⊂ R.
I
P{X1 ∈ E1 , ..., Xn ∈ En } =
P(X1−1 (E1 )∩...∩Xn−1 (En ))
=
n
Y
i=1
Pi (Ei ).
Finally, we are done with probability preliminaries. Next time,
we will start with (a survey of) real applications!
© Copyright 2026 Paperzz