ME/MATH 577
Stochastic Systems in Science and Engineering
Chapter #01
This chapter provides a brief introduction to the theory of probability measure
that is essential for understanding stochastic systems. It consists of three parts,
each of which deals with basic concepts:
• Part I – probability measure;
• Part II – integration of random variables; and
• Part III – convergence of random sequences.
Part I - Concepts of Probability Measure
This (first) part of the chapter provides a brief introduction to the notion of
probability measure that is a (possibly shift-variant) finite measure. It also introduces the notion of random variables that are measurable functions relative to a
probability measure space.
1
Sample Space and Event Space
Definition 1.1. (Sample Space) The sample space Ω is a nonempty set (that can
be finite, countably infinite, or uncountably infinite) of sample points ζ (also called
experimental outcomes).
Remark 1.1. In classical probability, each sample point (or experimental outcome)) ζ ∈ Ω serves as an atomic element. However, this notion may not hold true
for quantum probability. This course deals with classical probability only.
Definition 1.2. (Event Space) Given a sample space Ω, a collection of subsets
of Ω, called events, is called an event space E provided that the following three
conditions hold:
1
(i) Ω ∈ E,
(ii) The complement F c ∈ E ∀F ∈ E, [Note: F c , Ω \ F ], and
∪∞
(iii) Any countable union k=1 Fk ∈ E, where Fk ∈ E ∀k ∈ N, the set of positive
integers.
The event space E is also known as a σ-algebra of the sample space Ω. The pair
(
)
Ω, E is called a measurable space and the members of E are called measurable
subsets of Ω, or events.
Remark 1.2. Each event F in the event space E is a measurable set which is a
subset of the sample space Ω. However, an arbitrary subset of the sample space
Ω may not be a measurable set and hence may not qualify as an event. Distinct
events F1 , F2 ∈ E may or may not be disjoint.
Remark 1.3. If the property (iii) in Definition 1.2 is restricted to finite unions
∪K
only, i.e., k=1 Fk ∈ E, where Fk ∈ E and ∀k ∈ {1, 2, · · · , K}, then E is called an
algebra of the sample space Ω. Therefore, every σ-algebra is an algebra but every
algebra is not a σ-algebra.
{
}
Remark 1.4. The smallest event space of a sample space Ω is ∅, Ω . The largest
event space of a sample space Ω is its power set 2Ω , which is the collection of all
subsets of Ω including ∅ and Ω. Therefore, if Ω is a finite set, an event space E of Ω
must be a finite set. On the other hand, if Ω is an infinite set, an event space E of
Ω could be either a finite set or an uncountable set, but never a countably infinite
set; the proof of this statement will be assigned as a homework.
Definition 1.3. (σ-algebra generated by G)
Given a sample space Ω and a
(nonempty) collection G of subsets of Ω, the smallest σ-algebra containing all sets
in G is called the σ-algebra over G or the σ-algebra generated by G. In other words,
the σ-algebra generated by G is the intersection of all σ-algebras that contain each
member of G.
Definition 1.4. (Borel Set) If the sample space is R, then the Borel set B(R) is
defined to be the minimal σ-algebra generated by all open intervals in the form
(a, b) ⊂ R where a < b.
Remark 1.5. In addition to all open intervals, the Borel set B(R) contains all
closed intervals, left semi-open intervals, right semi-open intervals, and singleton
subsets in R, their at most countable (i.e., finite or countably infinite) unions and
2
intersections, as well as their complements in Ω. Similarly, the Borel set B(Rn ) is
defined over the n-dimensional space Rn for any n ∈ N. We can also define the
n
n
Borel set B(R ) over R , where R , [−∞, ∞] is the extended real line.
Example 1.1. If F is the collection of all finite disjoint unions of right semi-closed
intervals in R, then F is an algebra but not a σ-algebra because the condition (iii)
in Definition 1.2 is not completely satisfied. In this case, F does not contain any
closed interval, right or left semi-open interval, and singleton subset of R.
2
Random Variables
Definition 2.1. (Measurable Function) Let (Ω1 , E1 ) and (Ω2 , E2 ) be two measurable spaces. Then, g : Ω1 → Ω2 is called a measurable function relative to the
σ-algebras E1 and E2 if the inverse image g −1 (A) ∈ E1 ∀A ∈ E2 .
Remark 2.1. Measurability of the function g : Ω1 → Ω2 relative to the σ-algebras
E1 and E2 does not imply g(A) ∈ E2 ∀A ∈ E1 . To see this, let Ω1 = Ω2 = Ω but
E2
E1 ; and let g be the identity function, i.e., g(x) = x ∀x ∈ Ω. Then, let us
choose A ∈ E1 such that A ∈
/ E2 . Obviously, g(A) = A ∈
/ E2 , but g −1 (B) = B ∈
E1 ∀B ∈ E2 . Therefore, we see that the identity function is E1 − E2 measurable but
it is not E2 − E1 measurable for E2 E1 .
(
)
Definition 2.2. (Random Variable and Random Vector) Let Ω, E be a measurable space. Then, X : Ω → Rn , where n ∈ N, is called a E-measurable function on
(
)
Ω, E if X is a measurable function relative to the σ-algebras E and B(Rn ), i.e.,
X −1 (A) ∈ E ∀A ∈ B(Rn ). If n = 1, then X is called a random variable; if n > 1,
then X is called a random vector.
Remark 2.2. The term Borel measurable function is often used for a E − B(R)
measurable mapping g : Ω → R, where Ω ⊆ R and the σ-algebra E = B(Ω).
Alternatively, a Borel measurable function is a real-valued function such that the
inverse image of the set of real numbers greater than any given real number is a
Borel set, i.e., ∀α ∈ R, the set {ζ ∈ Ω : g(ζ) > α} is Borel measurable. In this way,
the notions of a random variable and a Borel measurable function can be extended
to the complex field C (for example, see Bartle, p. 13).
The concept of a measurable function is analogous (but not identical) to that of
a continuous function in the context of two topological spaces, as explained below.
3
Definition 2.3. (Topological Space) A collection ℑ of subsets of a (nonempty) set
Ω is called a topology provided that the following three conditions hold:
(i) ∅ ∈ ℑ and Ω ∈ ℑ,
(ii) An arbitrary union
∪
α∈I
Fα ∈ ℑ, where Fα ∈ ℑ ∀α ∈ I which is a (finite,
countably infinite, or uncountably infinite) index set, and
∩K
(iii) An arbitrary finite intersection k=1 Fk ∈ ℑ, where Fk ∈ ℑ and K ∈ N.
(
)
The pair Ω, ℑ is called a topological space and the members of ℑ are called ℑ-open
subsets of Ω.
(
)
(
)
Definition 2.4. (Continuity) Let Ω1 , ℑ1 and Ω2 , ℑ2 be two topological spaces.
A function g : Ω1 → Ω2 is said to be ℑ1 − ℑ2 continuous if the inverse image
g −1 (A) ∈ ℑ1 ∀A ∈ ℑ2 .
Remark 2.3. Continuity of g : Ω1 → Ω2 relative to the topologies ℑ1 and ℑ2 does
not imply g(A) ∈ ℑ2 ∀A ∈ ℑ1 . For example, let Ω1 = Ω2 = Ω but ℑ2 ℑ1 , and let
g be the identity function, i.e., g(x) = x. Let us choose A ∈ ℑ1 such that A ∈
/ ℑ2 .
Then, g(A) = A ∈
/ ℑ2 but g −1 (B) = B ∈ ℑ1 ∀B ∈ ℑ2 .
3
Probability Space
(
)
Definition 3.1. (σ-finite Measure and Finite Measure) Let Ω, E be a measurable
space and let µ : E → [0, ∞]. The set function µ is defined to be a σ-finite measure
(
)
if µ is countably additive on Ω, E under the following conditions:
(i) µ[∅] = 0,
] ∑
[∪
∞
∞
(ii) µ k=1 Fk = k=1 µ[Fk ] provided that Fk ∈ E ∀k and Fi ∩ Fj = ∅ ∀i ̸= j,
∪∞
(iii) There exists a sequence {Fk } such that k=1 Fk = Ω and µ[Fk ] < ∞ ∀k.
A σ-finite measure µ is defined to be finite if µ(A) < ∞ ∀A ∈ E.
(
)
Definition 3.2. (Complete Measure) A measure µ on a measurable space Ω, E
is defined to be complete if E contains all subsets of zero-measure sets, i.e., if F ∈ E
and µ[F ] = 0, then E ∈ E ∀E F .
(
)
Definition 3.3. (Probability Measure) Let Ω, E be a measurable space. Following Definition 3.1, a probability measure P : E → [0, 1] is a countably additive set
function under the following two axioms:
4
Axiom 1: P [Ω] = 1.
[∪
] ∑
∞
∞
Axiom 2: P
F
= k=1 P [Fk ] if Fk ∈ E ∀k and if Fi ∩ Fj = ∅ ∀i ̸= j
k
k=1
Remark 3.1. The probability measure is finite and (possibly) translation-variant
in contrast to the standard Lebesgue measure that is σ-finite and translationinvariant.
(
)
Definition 3.4. (Probability Space) A probability space is the triple Ω, E, P ,
where Ω is a (finite, countable, or uncountable) sample space and E is the event
(
)
space corresponding to the measurable space Ω, E ; and P : E → [0, 1] is a probability measure.
Remark 3.2. Let X : Ω → Rn be a random vector of dimension n ∈ N. The
(
)
(
respective probability spaces Ω, E, P and Rn , B(Rn ), PX ) are equivalent in the
following sense:
PX [A] = P [X −1 (A)] ∀A ∈ B(Rn )
For example, if n = 1 and a, b ∈ R, then it follows that
PX [(−∞, a]] = P [{ζ ∈ Ω : −∞ < X(ζ) ≤ a}]
PX [(−∞, a)] = P [{ζ ∈ Ω : −∞ < X(ζ) < a}]
PX [(a, b]] = P [{ζ ∈ Ω : a < X(ζ) ≤ b}]
PX [[a, b)] = P [{ζ ∈ Ω : a ≤ X(ζ) < b}]
(
Definition 3.5. (Distribution Function) In the probability space Rn , B(Rn ), PX ),
instead of using the probability measure (that is a set function whose domain is
the Borel set B(Rn ) having the range [0, 1]), we introduce an equivalent function
FX : Rn → [0, 1] as:
[
]
(
FX θ1 , · · · , θn ) , PX (−∞, θ1 ] × (−∞, θ2 ] × · · · × (−∞, θn ]
for every semi-infinite right-closed cell in Rn . Thus, in this representation, the
(scalar-valued) joint distribution function FX is right-continuous. However, if we
define
[
]
(
FX θ1 , · · · , θn ) , PX (−∞, θ1 ) × (−∞, θ2 ) × · · · × (−∞, θn )
for every semi-infinite right-open cell in Rn , then FX becomes left-continuous.
5
(a) Random variable X : Ω → R
(b) Random vector X : Ω → Rn
Figure 1: Pictorial presentation of probability distribution function
Remark 3.3. In the distribution function FX : Rn → [0, 1] is continuous at a point
θ ∈ Rn , then
FX (θ) − FX (θ− ) = P [{ζ ∈ Ω : X(ζ) = θ}] = 0;
(
)
otherwise FX (θ) − FX (θ− ) > 0. In this case, we call FX (θ) − FX (θ− ) as the
probability mass function (PMF) at the point θ ∈ Rn .
Definition 3.6. (Singularity of a Measure) Let µ and ν be two measures defined
(
)
on a measurable space Ω, E . Then, the measures µ and ν are called mutually
singular, denoted as µ⊥ν, if there exists disjoint sets E and F in E such that
E ∪ F = Ω and ν(E) = µ(F ) = 0.
Definition 3.7. (Absolute Continuity of a Measure) Let µ and ν be two measures
(
)
defined on a measurable space Ω, E . The measure ν is called absolutely continuous
(
)
(
)
relative to the measure µ, denoted as ν ≪ µ, if µ[S] = 0 ⇒ ν[S] = 0 ∀S ∈ E.
Definition 3.8. (Almost everywhere (a.e.) or almost surely (a.s.)) A property is
said to hold almost everywhere (a.e.) or almost surely (a.s.) relative to a measure
µ, if the set of points where this property fails to hold is a set of measure 0. Thus,
6
in particular, we say the f = g µ-a.e. or µ-a.s., also denoted as f ∼ g, if f and g
have the same domain and µ{ζ ∈ Ω : f (ζ) ̸= g(ζ)} = 0.
Remark 3.4. Let {fn } be a sequence of µ-measurable real-valued functions. Then,
fn converges to g µ-a.e., if there exists a set E with µ[E] = 0 such that the sequence
of real numbers {fn (ζ)} converges to the real number {g(ζ)} ∀ζ ∈ Ω \ E. Two
(
)
consequences of a.e. equality in a measure space Ω, E, µ are stated below.
∫
∫
• If f = g µ-a.e., then E f dµ = E gdµ ∀E ∈ E.
• If f is a µ-measurable function and f = g µ-a.e., then g is µ-measurable.
Theorem 3.1. (Radon-Nikodym Theorem) Let µ and ν be two σ-finite measures
(
)
defined on a measurable space Ω, E and let ν ≪ µ. Then, there exists a non(
)
negative measurable function g : Ω → R on Ω, E such that
∫
ν[E] =
g dµ ∀E ∈ E.
E
Furthermore, the function g is uniquely determined µ-almost everywhere (i.e., if
(
)
there is another such measurable function h on Ω, E , then h = g
µ-almost
everywhere).
Proof. See any text book on Real analysis (e.g., Bartle p. 85 or Royden, p. 277).
Definition 3.9. (Radon-Nikodym Derivative) Let µ and ν be two measures on the
(
)
measurable space on Rn , B(Rn ) , where n ∈ N. If ν ≪ µ , then there exists a
non-negative measurable function f : Rn → Rn such that
∫
ν(E) =
f dµ ∀E ∈ B(Rn ).
E
The function f is uniquely determined µ-a.e on B(Rn ) and is called the Radondν
Nikodym derivative of ν with respect to the measure µ, which is denoted as dµ
.
Remark 3.5. Let X : Ω → Rn , where n ∈ N, be a continuous random vector; and
(
)
(
)
the corresponding probability spaces be Ω, E, P and Rn , B(Rn ), PX . Then, the
Radon-Nikodym derivative of PX with respect to the Lebesgue measure µ can be
expressed at a point θ ∈ Rn as:
∂ n FX (θ1 , · · · , θn )
dPX
(θ) =
dµ
∂θ1 · · · ∂θn
and is known as the joint probability density function (pdf) that is denoted as:
fX (θ) or fX (θ1 , · · · , θn ).
7
Remark 3.6. Let two probability measures, P0 : B(Rn ) → [0, 1] and P1 : B(Rn ) →
(
)
[0, 1], be defined on a measurable space Rn , B(Rn to represent two hypotheses:,
namely, nominal and faulty conditions. If P1 ≪ P0 ≪ µ, then the likelihood ratio
of these two hypotheses at a point θ ∈ Rn is expressed as:
dP1 (θ)
=
dP0 (θ)
dP1
dµ (θ)
dP0
dµ (θ)
=
f1 (θ)
f0 (θ)
Definition 3.10. (limsup and liminf) Let {Ek } be a sequence of events on a
(
probability space Ω, E, P ). The superior and inferior limits are defined as follows:
lim sup En ,
n→∞
∞ ∪
∞
∩
Ek and lim inf En ,
n→∞
n=1 k=n
∞ ∩
∞
∪
Ek
n=1 k=n
If the two limits coincide, then {Ek } is defined to be a convergent sequence of events
and the limit set is defined as:
E = lim En = lim inf En = lim sup En
n→∞
n→∞
n→∞
Remark 3.7. In case of real numbers, the notion of a limit can be paraphrased as
follows: The limit point of a sequence {xk } is x if ∀ε > 0, all but a finite number
of terms in {xk } are within a distance ε from x, i.e., ∀ε > 0 ∃n ∈ N such that
∀k ≥ n, |xk − x| < ε; in this case, x is called the limit point of the sequence {xk }.
A weaker condition is to have infinitely many terms of {xk } within a distance ε
from x, i.e., ∀ε > 0 and ∀n ∈ N, there exists k ≥ n such that |xk − x| < ε; in this
case, x is called the cluster point of the sequence {xk }.
Remark 3.8. The limit superior is based on the rationale that ω ∈ lim supn→∞ En
iff, for all n ∈ N and all ω ∈ Ek for some k ≥ n; in other words, ω ∈ lim supn→∞ En
iff ω ∈ En for infinitely many n. In contrast, the limit inferior is based on the
rationale that ω ∈ lim inf n→∞ En iff, for some n ∈ N and some ω ∈ Ek for all
k ≥ n; in other words, ω ∈ lim inf n→∞ En iff ω ∈ En for all but finitely many n.
This set-theoretic limit concept is analogous to that for a sequence {xk } of real
numbers where
lim inf xn = sup inf xk and lim sup xn = inf sup xk
n→∞
n
k≥n
n→∞
n
k≥n
In general, lim inf xn ≤ lim sup xn ; and {xk } converges to the limit x if
x = lim inf xn = lim sup xn
8
Remark 3.9. Let {Xk } be a sequence of real random variables and a ∈ R . Both
lim inf Xn and lim sup Xn are random variables provided that they are finite at
every ζ ∈ Ω as explained below:
∞ ∩
∞
∪
{
}
{
}
ζ ∈ Ω : lim inf Xn (ζ) ≤ a =
ζ ∈ Ω : Xn (ζ) ≤ a
n=1 k=n
∞ ∪
∞
∩
{
}
{
}
ζ ∈ Ω : Xn (ζ) ≤ a
ζ ∈ Ω : lim sup Xn (ζ) ≤ a =
n=1 k=n
{
}
{
Note that every set in the form ζ ∈ Ω : lim inf Xn (ζ) ≤ a or ζ ∈ Ω :
}
lim sup Xn (ζ) ≤ a is also a measurable set by the countable union and countable intersection properties of a σ-algebra. Furthermore, if {Xk } is a sequence of
random variables converging to the limit X such that |X(ζ)| < ∞ ∀ζ ∈ Ω, then X
is a random variable. It is noted that if the range of a random variable is modified
to the extended real line, then lim inf Xn and lim sup Xn should always be random
variables.
(
)
Lemma 3.1. (Continuity of Probability Measure) Let Ω, E, P be a probability
space, where {Bk } is a sequence of events. Then,
(
)
(a) If B1 ⊂ B2 ⊂ · · · ⊂ Bk ⊂ · · · , then limk→∞ P (Bk ) = P limk→∞ ∪kj=1 Bj
(
)
(b) If B1 ⊃ B2 ⊃ · · · ⊃ Bk ⊃ · · · , then limk→∞ P (Bk ) = P limk→∞ ∩kj=1 Bj
Proof. Given B1 ⊂ B2 ⊂ · · · , let B0 = ∅ and let Dj = Bj \ Bj−1 for j ∈ N. Then,
it follows that, for each i, j ∈ N,
Di ∩ Dj = ∅ ∀i ̸= j and Bj =
j
∑
Di
i=1
( )
(
)
( ∞
)
∑k
Then, limk→∞ P (Bk ) = limk→∞ j=1 P Dj = P ∪∞
j=1 Dj = P ∪j=1 Bj .
This proves part (a) of the lemma. The second part can be proved by using the
steps in the proof of part (a) on the complement of Bj .
Lemma 3.2. (Borel Cantelli Lemma) Let {Ak } be an arbitrary sequence of events
(
)
on a probability space Ω, E, P and let pk = P (Ak ). Then,
∞
(∑
)
(
)
P [Ak ] < ∞ ⇒ P [lim sup Ak ] = 0
k=1
9
Proof. It follows from Definition 3.10 and Lemma 3.1 that
P [lim sup An ] = P
∞ ∪
[∩
]
∑
Ak ≤ lim
P [Ak ]
n=1 k>n
n→∞
k>n
∑∞
Since
k=1 P [Ak ] < ∞, the tail end of the series must sum to zero. Hence,
∑
limn→∞ k>n P [Ak ] = 0. This proves the lemma.
Definition 3.11. (Indicator Function) Let A be an event in the probability space
(
Ω, E, P ). Then, the indicator function (also known as characteristic function)
χA : Ω → {0, 1} is defined as:
1 if ζ ∈ A
χA (ζ) ,
0 if ζ ∈
/A
(
Definition 3.12. (Simple Random Variable) In the probability space Ω, E, P ), a
simple random variable Sn : Ω → R is defined as:
Sn (ζ) =
n
∑
αk χAk (ζ)
k=1
where the events Ak ∈ E, scalars αk ∈ R, and n ∈ N.
Theorem 3.2. (Limit point of a simple random variable sequence) Every random
variable is the limit of a sequence of simple random variables.
(
Proof. Let X : Ω → R be a random variable in a probability space Ω, E, P ). Let
us define a sequence {Xk } of simple random variables as follows:
−2n
if X(ζ) < −2n
Xn (ζ) ,
k 2−n if X(ζ) ∈ [k 2−n , (k + 1) 2−n ) f or k ∈ [−22n , 22n − 1]
n
2
if X(ζ) ≥ 2n
For a fixed ζ ∈ Ω and n ≥ log2 |Xn (ζ)| , we have limn→∞ supk≥n |X(ζ)−Xk (ζ)| → 0
so that {Xk } converges to X for every sample point ζ ∈ Ω.
Corollary 3.1. (Corollary to Theorem 3.2) Every non-negative random variable
is the limit of an increasing sequence of simple random variables and every nonpositive random variable is the limit of a decreasing sequence of simple random
variables.
10
Proof. If the random variable X is non-negative, then {Xk } as defined in Theorem 3.2 is an increasing sequence of non-negative simple random variables. Similarly, if the random variable X is non-positive, then {Xk } as defined in Theorem 3.2
is a decreasing sequence of non-negative simple random variables.
Part II - Integration of Random Variables
This (second) part of the chapter deals with integration of random variables
(
)
and measurable functions of random variables. Let Ω, E, P be a probability space.
Given the probability distribution function FX : R → [0, 1] of a random variable
X, the expected value of X is defined in terms of the Riemann-Stieltjes integral as:
∫
E[X] , R θ dFX (θ) provided that the integral is well-defined (i.e., at least one of
∫∞
∫0
the two integrals 0 θ dFX (θ) and − −∞ θ dFX (θ) is less than ∞). Furthermore, if
the random variable X has a density function fX (i.e., if the probability measure PX
is absolutely continuous relative to the Lebesgue measure), then the expected value
∫
of X can be defined in terms of the Riemann integral as: E[X] , R θfX (θ) dθ. We
will provide an alternative definition of expected value in terms of the LebesgueStieltjes integral that is not only more rigorous but also clarifies the relationship
between the expectation operator E[X] and the probability measure P .
We first assume that X is a simple random variable (see Definition 3.12),
i.e., there exist events A1 , · · · , An and scalars α1 , · · · , αn ∈ R such that x(ζ) =
∑∞
k=1 αk χAk (ζ), where χA is an indicator function (see Definition 3.11). By this
construction, the expected values of any random variables X and Y must satisfy
the following three properties.
i) Additivity: E[X + Y ] = E[X] + E[Y ].
ii) Homogeneity: E[c X] = c E[X] ∀c ∈ R.
iii) Order preservation: If X ≥ Y, i.e., P [{ζ ∈ Ω : X(ζ) ≥ Y (ζ)}] = 1, then
E[X] ≥ E[Y ].
In addition, E[X] satisfies an important property stated in the following lemma.
Lemma 3.3. (Monotone Property of Expectation) Let {Xn } be a monotone sequence of simple random variables converging to a simple random variable X. Then,
limn→∞ E[Xn ] = E[X].
11
Proof. If {Xn } is a decreasing sequence, then {(Xn − X)} decreases to 0 as n → ∞;
and if {Xn } is an increasing sequence, then {(X − Xn )} decreases to 0 as n → ∞.
Since expectation is additive, it suffices to show that if {Xn } decreases to 0, then
limn→∞ E[Xn ] = 0. We make use of the following fact to prove the lemma.
(
)
∀ε > 0, 0 ≤ E[Xn ] ≤ (max Xn )P [(Xn > ε] + ε → ε as n → ∞.
Next let X be a nonnegative (not necessarily simple) random variable and
let {Xk } be an increasing sequence of nonnegative random variables converging
to X. Since {E[Xk ]} is an increasing sequence of nonnegative real numbers,
limk→∞ E[Xk ] = supk E[Xk ] always exists (but may be infinite). To unambiguously define limk→∞ E[Xk ] = E[X] for nonnegative random variables, we make
use of the following proposition.
Proposition 3.1. (Convergence of Expectation) Let {Xn } and {Yn } be two increasing sequences of nonnegative simple random variables converging to the same
limit X. Then, limn→∞ E[Xn ] = limn→∞ E[Yn ]
Proof. Let ℓ ∈ N be fixed and let us define Zn , min(Xn , Yℓ ). Since {Zn } is an
increasing sequence of simple random variables, limn→∞ Xn = X, and X ≥ Yℓ , it
follows that limn→∞ Zn = Yℓ . Usage of Lemma 3.3 yields
lim E[Xn ] ≥ lim E[Zn ] = E[Yℓ ]
n→∞
n→∞
Hence, limn→∞ E[Xn ] ≥ limℓ→∞ E[Yℓ ]. Now, by interchanging the roles of {Xn }
and {Yn } yields limn→∞ E[Yn ] ≥ limℓ→∞ E[Xℓ ]. This proves the proposition.
Now let us consider the general case of random variables by removing the restriction of nonnegativity. We express the random variable X as: X(ζ) , X + (ζ) −
X − (ζ) ∀ζ ∈ Ω, where both X + and X − are nonnegative. We define: E[X] ,
E[X + ] − E[X − ] provided that the right hand side is not of the form ∞ − ∞. In
this way, E[X] satisfies the three postulated properties of additivity, homogeneity,
and order preservation. The uniqueness of the limit: limk→∞ E[Xk ] = E[X] is
established by Proposition 3.1. Now, we introduce the following definition of E[X].
(
)
Definition 3.13. (Expectation) Let Ω, E, P be a probability space and X : Ω →
(
)
R be a random variable defined on the probability space R, B(R), PX . The expectation of X is defined in terms of the following two Lebesgue-Stieltjes integrals:
∫
∫
E[X] ,
X(ζ) dP (ζ) or E[X] =
θPX [(θ − dθ, θ]]
ζ∈Ω
θ∈R
12
which are respectively denoted as:
∫
Ω
X dP or
∫
R
θ dPX .
Remark 3.10. For dθ > 0, we have (−∞, θ − dθ]
(−∞, θ] and it follows that
PX [(θ − dθ], θ]] = PX [(−∞, θ]] − PX [(−∞, θ − dθ]]
We often denote PX [(θ − dθ], θ]] as dPX (θ). Along this line, following Figure 1(a),
the implication of the term dP (ζ) in Definition 3.13 is explained as follows:
[
{
}]
ζ ∈ Ω; θ = X(ζ); dθ > 0; and dP (ζ) ≡ P X −1 ω ∈ Ω : X(ω) ∈ (θ − dθ, θ]
Definition 3.14. (Integrability with respect to a measure P ) A random variable
X is said to be integrable (with respect to a measure P ) if E[X] < ∞.
Remark 3.11. Let g : R → R be a Borel-measurable function. Then E[g(X)] =
∫
∫
∫
g(X) dP or E[g(X)] = R g(θ) dPX (, or equivalently, E[g(X)] = R g(θ) dFX (θ).
Ω
Next we present the following results on sequences of integrable random variables, which are similar to the standard results on sequences of integrable functions
in real analysis with Lebesgue measure.
Proposition 3.2. (Monotone Convergence Theorem) Let {Xk } be an increasing
sequence of nonnegative random variables converging to a (random variable) X.
Then,
E[X] = lim E[Xk ]
k→∞
Proposition 3.3. (Fatou’s Lemma) Let {Xk } be a sequence of nonnegative random
variables such that there exists an integrable random variable X having the property:
Xk (ζ) ≥ X(ζ) ∀k ∀ζ. Then,
lim inf E[Xk ] ≥ E[lim inf Xk ]
Proposition 3.4. (Dominated Convergence Theorem) Let {Xk } be a sequence of
random variables converging to a (random variable) X such that there exists an
integrable non-negative random variable Y having the property: E[Y ] < ∞ and
|Xk (ζ)| ≤ Y (ζ) ∀k ∀ζ. Then,
E[|X|] < ∞; and
lim E[|Xk − X|] = 0; and
k→∞
lim E[Xk ] = E[X]
k→∞
It is relatively straight-forward to extend the above concept of expectation from
(
)
random variables to random vectors. Let Ω, E, P be a probability space and let a
13
random vector be defined as X , [X1 · · · Xn ]T : Ω → Rn on the probability space
( n
)
R , B(Rn ), PX . Let FX : Rn → [0, 1] be the joint distribution function of the
random vector X and let g : Rn → Rm be a Borel-measurable function. Then,
∫
∫
∫
∫
E[X] ,
X dP =
θ dPX (θ) and E[g(X)] ,
g(X) dP =
g(θ) dPX (θ)
Rn
Ω
Rn
Ω
If we define Y = g(X) as another random vector Y : Ω → Rm on the probability
(
)
space Rm , B(Rm ), PY , then
∫
∫
E[Y ] ,
Y dP =
φ dPY
Rm
Ω
The expectations of the individual components of a random vector can be written
in terms of its joint distribution function, for k = 1, · · · , n and ℓ = 1, · · · , m, as:
∫
∫
E[Xk ] =
θk dFX (θ1 , · · · , θn ) and E[Yℓ ] =
φℓ dFY (φ1 , · · · , φm )
Rn
Rm
Definition 3.15. (Characteristic Function of a Random Vector) Let a random
vector X : Ω → Rn have a distribution function FX . Then, the characteristic
function of X is defined as: ΦX (ξ) , E[exp(i2πξ T X)]. Therefore,
∫
ΦX (ξ) =
Rn
(
)
exp i2πξ T θ dFX (θ) =
∫
Rn
n
(
)
∑
exp i2π
ξk θk dFX (θ1 , · · · θn )
k=1
Remark 3.12. If a random vector X has the density function fX , then the characteristic function ΦX can be expressed as:
∫
∫
n
(
)
∑
(
)
ΦX (ξ) ,
exp i2πξ T θ fX (θ)dθ =
exp i2π
ξk θk fX (θ1 , · · · θn ) dθ1 · · · dθn
Rn
Rn
k=1
Notice that, in general, the characteristic function ΦX is identical to the Fourier
transform of the density function fX for negative frequency, i.e., ΦX (ξ) = fˆX (−ξ).
Existence of fˆX is guaranteed by the condition that fX ∈ L1 (Rn ), i.e., fX is
∫
absolutely integrable over Rn , which is trivially true because Rn fX (θ)dθ = 1.
Therefore, the density function can be generated from the characteristic function
by the inversion formula as follows:
∫
(
)
fX (ξ) =
exp − i2πξ T θ ΦX (θ)dθ
Rn
∫
=
Rn
(
exp − i2π
n
∑
)
ξk θk ΦX (θ1 , · · · θn ) dθ1 · · · dθn
k=1
14
Definition 3.16. (Moment Generating Function of a Random Vector) Let a random vector X : Ω → Rn have a distribution function FX . Then, the moment
generating function of X is defined as a function of a complex vector z ∈ Cn as:
Θ(z) , E[exp(2πz T X)]. Therefore,
∫
∫
n
)
( ∑
(
)
zk θk dFX (θ1 , · · · θn )
ΘX (z) =
exp 2πz T θ dFX (θ) =
exp 2π
Rn
Rn
k=1
provided that the integral converges for the given value of z.
Remark 3.13. If a random vector X has the density function fX , then the moment
generating function ΘX can be expressed as:
∫
∫
n
( ∑
)
(
)
ΘX (z) ,
exp 2πz T θ fX (θ)dθ =
exp 2π
zk θk fX (θ1 , · · · θn ) dθ1 · · · dθn
Rn
Rn
k=1
provided that the integral converges for the given value of z. In the statistics
literature, it is a common practice to restrict z to be real.
Notice that, in general, the moment generating function is identical to Laplace
transform of the density function fX with the sign of the argument being negative,
i.e., ΘX (z) = fˆX (−z) similar to how the characteristic function is related to Fourier
transform. As the real part of z approaches zero, ΘX (z) approaches ΦX (ξ), where
ξ is the imaginary part of the complex vector z. Like Laplace transform, the region
of convergence of the moment generating function needs to be specified.
For a random variable X : Ω → R, under the condition of convergence, the
moment generating function ΘX (z), where z ∈ C, is given by:
z2
z3
E[X 2 ] + E[X 3 ] + · · ·
2!
3!
If ΘX (z) is analytic in the complex plane C is in the neighborhood of z = 0,
ΘX (z) = E[exp(zX)] = 1 + zE[X] +
k
d
then it follows that the k th moment E[X k ] = dz
In the statistics
k ΘX (z) |z=0 .
literature, it is a common practice to restrict z to be real.
(
)
Proposition 3.5. (Generalized Markov Inequality) Let Ω, E, P be a probability
space and let X : Ω → R be a random variable. Then, ∀r ∈ (0, ∞) and ∀δ ∈ (0, ∞),
∫
[
]
1
| X |r dP
If X ∈ Lr (P ), then P {ζ ∈ Ω :| X(ζ) |≥ δ} ≤ r
δ Ω
∫
∫
Proof. It follows from the fact, Ω | X |r dP ≥ {ζ∈Ω:|X(ζ)|≥δ} | X |r dP , that
∫
[
]
1
| X |r dP
P {ζ ∈ Ω :| X(ζ) |≥ δ} ≤ r
δ Ω
15
(
)
Theorem 3.3. (Chebyshev Inequality) Let Ω, E, P be a probability space and let
(
)
X : Ω → R be a random variable on R, B(R) . Let g be a monotonically increasing
and non-negative real-valued function on the range of X such that E[g ◦ X] < ∞.
Then, for g(θ) > 0, it follows that
P [{ζ ∈ Ω : X(ζ) ≥ θ}] ≤
Proof. Since E[g ◦ X] =
∫
R
g dPX and g is monotonically increasing and non-
negative, it follows that
∫
∫ ∞
∫
g dPX ≥
g dPX ≥ g(θ)
R
E[g ◦ X]
g(θ)
θ
∞
dPX = g(θ)P [{ζ ∈ Ω : X(ζ) ≥ θ}]
θ
Corollary 3.2. By choosing different structures for the function g in Chebyshev
inequality, the following special cases are obtained:
(i) Let g(θ) = |θ|, i.e., the random variable X have finite absolute first moment,
i.e., E[|X|] < ∞. Then,
[
] E[|X|]
∀θ ∈ (0, ∞), P {ζ ∈ Ω : |X(ζ)| ≥ θ} ≤
|θ|
(ii) Let g(θ) = θ2 , i.e., the second moment of the random variable X be finite, i.e.,
E[|X|2 ] < ∞. Then,
[
] ( σ X )2
∀θ ∈ (0, ∞), P {ζ ∈ Ω :| X(ζ) − E[X] |≥ θ} ≤
θ
2
where σX
is the variance of the random variable X.
Proof. Since E[|X|2 ] < ∞, it follows from Hölder inequality that
|E[X]| ≤ E[|X|] ≤ E[|X|2 ] < ∞
The remaining part of the proof follows from Theorem 3.3.
3.1
Chernoff Bound
As seen in Theorem 3.3, Chebyshev inequality provides an upper bound for convergence of a random; indeed the weak law of large numbers, which we will come
across in a later chapter. In this subsection, we will study the Chernoff bound
16
for a continuous random variable X with the density function fX . We derive the
Chernoff bound on the tail probability, i.e., P [X ≥ α], where α is a prescribed real
constant. that is,
∫
∫ ∞
dxfX (x) =
dxfX (x)U (x − α)
(1)
P [X ≥ α] =
R
α
where the standard step function
U (θ) ,
1 if θ ≥ 0
0 if θ < 0
(
)
Since exp (x − α)t ≥ 1 ∀t ≥ 0 ∀x ≥ α, it follows that
∫
(
)
dxfX (x) exp (x − α)t = exp(−αt) ΘX (t)
P [X ≥ α] ≤
where ΘX (t) ,
(2)
R
∫
R
dxfX (x) exp(xt) is the moment generating function of X within
an appropriate range of the real t.
The tightest bound occurs when the right hand side of Eq. (2) is minimized wrt
the real variable t, which is called the Chernoff bound.
Example 3.1. (Chernoff bound of a Gaussian random variable)
Let X ∼ N (µ, σ 2 ) and let α > µ. Then,
(
)
∫
(
( )
1
(x − µ)2
σ 2 t2 )
ΘX (t) = √
dx exp −
exp
xt
=
exp
µt
+
2σ 2
2
2πσ 2 R
by completing the square in the integrand. Therefore,
(
σ 2 t2 )
P [X ≥ α] ≤ exp(−αt)ΘX (t) = exp (µ − α)t + +
2
(
)
2 2
d
and dt
exp (µ − α)t + + σ 2t = 0 yields t = α−µ
σ 2 . Hence, the Chernoff bound is:
( (α − µ)2 )
P [X ≥ α] ≤ exp(−αt)ΘX (t) = exp −
for α > µ
2σ 2
Verify whether the above bound is tighter than the corresponding Chebyshev bound.
Example 3.2. (Chernoff bound of a Poisson random variable)
Let the distribution of a discrete variable N with distribution P [N = k] be denoted
∑∞
∑∞
as PN (k). Then, P [N ≥ k] =
n=k PN [n] =
n=0 PN [n]U (n − k) where the
standard step function
1 if k ≥ 0
U (k) ,
0 if k < 0
17
Then, with t > 0, it follows that
P [N ≥ k] ≤
∞
∑
(
)
(
)
PN [n] exp (n − k)t = exp − kt ΘN (t)
n=0
where ΘN (t) ,
∑∞
n=0
( )
PN [n] exp nt is the moment generating function of X within
an appropriate range of the real t.
Next we compute the Chernoff bound of the Poisson random variable:
exp(−λ) λk
for k = 0, 1, 2, · · · and the parameter λ > 0
k!
(
)
and the corresponding moment generating function ΘN (t) = exp (et − 1)λ . Then,
P [N = k] =
(
)]
d[
exp(kt) exp (et − 1)λ = 0
dt
yields
tmin = ℓn
Therefore, P [N ≥ k] ≤
(k)
λ
( )k
λ
k
⇒ ΘN (t)
k
)
t=ℓn( λ
= exp(k − λ)
exp(k − λ) is Chernoff bound for the Poisson rv.
Verify whether the above bound is tighter than the corresponding Chebyshev bound.
3.2
Minimum-variance Estimation
Let X be a random variable whose expected value E[X] = θ is an unknown parameter. Let xi , i = 1, 2, · · · , n be measurements of X as individual samples. The
objective is to calculate, from these n measurements, an unbiased estimate θ̂n of θ,
i.e., E[θ̂n ] = θ, where it is noted that the estimator θ̂n is itself a random variable.
In that case, the probability of the estimation error is given by:
[
]
[
]
P |θ̂n − θ| ≥ ε = P |θ̂n − E[θ̂n ]| ≥ ε for any given ε > 0
(3)
By Chebyshev inequality (see Theorem 3.3) it follows that
[
] var(θ̂ )
n
P |θ̂n − E[θ̂n ]| ≥ ε <
for any given ε > 0
ε2
(4)
Equation (4) implies that the variance of the estimator θ̂n determines an upper
bound of the probability of estimation error.
We have the following two types of restrictions for assigning boundary conditions on the minimum-variance unbiased estimators:
18
1. Mathematical structure (e.g., linear).
2. Probability distribution (e.g., Gaussian) of the random variables.
Let us start with the sample mean that is a particular linear combination of the
measured random samples xi , i.e.,
θ̂n ,
n
∑
ai xi
(5)
i=1
where ai ’s are (as yet) arbitrary (non-negative) real constants.
Lemma 3.4. The coefficients ai ’s in Eq. (5) satisfy the relation
n
∑
ai = 1
(6)
i=1
if the linear estimator θ̂n is an unbiased estimator of the parameter θ.
∑n
∑n
Proof. E[θ̂n − θ] = 0 ⇒ E[ i=1 ai xi − θ] = i=1 ai E[xi ] − θ = 0. Therefore,
( ∑n
)
since E[xi ] = θ ∀i, we have
i=1 ai − 1 θ = 0. Since the choice of m is arbitrary,
∑n
i=1 ai = 1.
Lemma 3.5. Let the samples xi , i = 1, 2, · · · , n be pairwise uncorrelated. If the
coefficients ai ’s in Eq. (5) are adjusted to minimize the variance of θ̂n , then the
∑n
coefficients ai = n1 , i = 1, 2, · · · , n subject to the constraint i=1 ai = 1.
Proof. Let us use a Lagrange multiplier λ on the equality constraint
∑n
i=1
ai −1 = 0.
Then, the cost functional to be minimized is:
J=
n
(∑
)
]
1 [
E (θ̂n − θ)2 − λ
ai − 1
2
i=1
By applying Eq. (5) in the above equation yields
J=
n
n
)2 ]
(∑
)
1 [( ∑
E
ai xi − θ
−λ
ai − 1
2
i=1
i=1
∂J
A necessary condition for minimizing J is to set ∂a
= 0 for all k = 1, 2, · · · , n,
k
which implies
[∑
]
n
E
ai xi xk − θ2 − λ = 0 ∀k
i=1
19
Since xi and xk are uncorrelated for all i ̸= k, we have
λ = ak E[(xk )2 ] − θ2 ⇒ ak =
λ + θ2
] ∀k
E[(xk )2
which implies that all ai ’s are equal. Therefore, with the constraint
it follows that ai = n1 , i = 1, 2, · · · , n.
∑n
i=1
ai = 1,
It follows from Lemmas 3.4 and 3.5 that if the samples xi , i = 1, 2, · · · , n are
pairwise uncorrelated, then the minimum-variance unbiased linear sampling mean
estimator θ̂n of the unknown expected-value parameter θ. has the form; θ̂n =
∑n
1
i=1 xi . Instead of focusing our attention on the minimum variance estimation
n
problem, we may restrict the class of random variables that we shall observe. In
this regard, we derive a lower bound for the variance of any unbiased estimator in
the next subsection
3.3
Cramer-Rao Inequality for Unbiased Estimation
Since only unbiased estimators are of current interest, we use E[θ̂n − θ] = 0. In
other words,
∫
(θ̂n − θ)fX (x; θ)dx = 0
(7)
R
where fX (x; θ) , fX (x1 , x2 , · · · , xn ; θ) is the (θ-dependent) nth order joint probability density function of the random variable X, which is assumed to exist. Differentiation of Eq. (7) wrt θ and recognizing the fact that θ̂n is not a function of θ (i.e.,
no a priori knowledge of θ is used to obtain its estimator θ̂n ) yield the following:
∫
∫
∫
)
∂ (
∂fX (x; θ)
0=
(θ̂n − θ) fX (x; θ) dx =
(θ̂n − θ)
dx −
fX (x; θ)dx (8)
∂θ
Rn ∂θ
Rn
Rn
]
∫ [
X (x;θ))
which implies that Rn (θ̂n − θ) ∂(ℓn f∂θ
fX (x; θ)dx = 1, i.e.,
[
∂(ℓn fX (x; θ)) ]
E (θ̂n − θ)
=1
∂θ
because
∫
Rn
fX (x; θ)dx = 1 and
(9)
1
∂fX (x; θ)
∂(ℓn fX (x; θ))
=
∂θ
fX (x; θ)
∂θ
An application of Cauchy-Schwarz inequality in Eq. (9) yields
[
2
] [ ∂(ℓn f (x; θ)) 2 ]
[
∂(ℓn fX (x; θ)) ]
X
2
1 = E (θ̂n − θ)
E ≤
E
(
θ̂
−
θ)
n
∂θ
∂θ
20
(10)
Therefore,
var(θ̂n ) ≥
[
E ∂(ℓn
1
2 ]
fX (x;θ)) ∂θ
(11)
which is one form of Cramer-Rao lower bound, which states that:
The variance of every unbiased estimator θ̂n of the mean θ has a lower
bound which is determined solely by the properties of the joint probability density fX (x; θ)) of the random samples used in the estimation.
2 ]
[
X (x;θ)) The term Iθ , E ∂(ℓn f∂θ
in Eq. (11) is known as Fisher information
for estimating θ from X. Higher Iθ is for a given model, the better is the lower
bound on estimation accuracy provided by the Cramer-Rao inequality in Eq. (11.
Fisher Information may also be expressed in an alternative form as:
Iθ = −E
if the second derivative
∂ 2 fX (x;θ)
∂θ 2
[ ∂ 2 (ℓn f (x; θ)) ]
X
∂θ2
exists for all x and θ in support of fX (x; θ). For
details, please see the reference: V. Poor, An introduction to Signal Detection and
Estimation, 2nd ed., Springer-Verlag, 1988, pp. 170-171.
Next let us assume that the random samples x1 , x2 , · · · , xn be independent and
identically distributed (iid) and each has a mean θ and variance σ 2 . Then,
ℓnfX (x1 , x2 , · · · , xn ; θ) =
n
∑
ℓnf (xi ) = n ℓnf (x)
i=1
Consequently, θ̂n =
1
n
∑n
i=1
xi ⇒ var(θ̂n ) =
σ2
n .
Hence it follows from Cramer-Rao
inequality in Eq. (11) that
var(θ̂n ) ≥
1
1
1
[
] or var(θ̂n ) ≥
2 ] or var(θ̂n ) ≥
[
2 ℓn f (x;θ)
∂
nIθ
f (x;θ)) nE −
nE ∂(ℓn ∂θ
∂θ 2
(12)
Having obtained the lower bound of σ 2 in Eq. (12, we proceed to determine the
form of the minimum variance unbiased estimator for Gaussian random variables,
where the density function is:
(
)
1
(x − θ)2
∂ 2 ℓn f
1
σ2
f (x) = √
exp −
⇒
I
,
−
=
⇒
var(
θ̂
)
≥
θ
n
2σ 2
∂θ2
σ2
n
2πσ 2
2
We have seen that variance of the sample mean θ̂n is dentically equal to σn for the
iid case. Therefore, the sample mean θ̂n is indeed a minimum-variance unbiased
21
estimator of the true mean θ of a Gaussian random variable when the samples are
independent. Hence, no other sampling estimator, whether linear or nonlinear, can
have a smaller variace of the estimation error than this estimator.
Part III - Convergence of Random Sequences
This (third) part of the chapter presents several modes of convergence of random
sequences, where the measure is finite. However, for comparison purposes, we also
consider the convergence under the σ–finite Lebesgue measure. In general, we
consider sequences of measurable functions mapping from the measurable space
(
)
(
)
Ω, E to the Borel-measurable space R, B(R) . With no loss of generality, we
assume that Ω = R and E = B(R), implying that the random variables mapping
(
)
R into R are Borel-measurable. We consider two measure spaces R, B(R), P
(
)
and R, B(R), µ , where the measures denote the (finite) probability measure and
(σ-finite) Lebesgue measure, respectively. Furthermore, we specify P [R] = 1 for
compatibility with the probability measure. As we will see, the measure space
(
)
(
)
R, B(R), P is more restrictive than the measure space R, B(R), µ .
Definition 3.17. (Convergence Modes) Let {gk } be a sequence of measurable
(
)
(
)
functions on R, B(R), µ converging to a measurable function g on R, B(R), µ in
the following modes:
(i) Uniform convergence: ∀ε > 0 ∃ n(ε) > 0 such that
|gk (t) − g(t)| < ε ∀k ≥ n ∀t ∈ R.
(ii) Convergence at a given point t ∈ R: ∀ε > 0 ∃ nt (ε) > 0 such that
|gk (t) − g(t)| < ε ∀k ≥ nt .
(iii) Pointwise convergence (also called sure convergence): If the sequence {gk }
converges to the measurable function g ∀t ∈ R.
(iv) Uniform Cauchy convergence: ∀ε > 0 ∃ n(ε) > 0 such that
|gk (t) − gℓ (t)| < ε ∀k, ℓ ≥ n ∀t ∈ R.
(v) Cauchy Convergence at a given point t ∈ R: ∀ε > 0 ∃nt (ε) > 0 such that
|gk (t) − gℓ (t)| < ε ∀k, ℓ ≥ nt .
(vi) Pointwise Cauchy convergence: If the sequence {gk } converges in the Cauchy
sense ∀t ∈ R.
22
(vii) Almost everywhere (a.e) or almost sure (a.s.) convergence: If the sequence
{gk } converges to the measurable function g ∀t ∈ S ⊆ R such that
µ[R \ S] = 0.
(viii) Almost everywhere (a.e) or almost sure (a.s.) Cauchy convergence: If the
sequence {gk } converges in the Cauchy sense ∀t ∈ S ⊆ R such that
µ[R \ S] = 0.
(
)
Remark 3.14. Definition 3.17 hold for both measure spaces R, B(R), µ and
(
)
R, B(R), P .
Remark 3.15. Convergence implies Cauchy convergence and if the converse is true
for all Cauchy sequences, then the space to which these Cauchy sequences belong
is called complete.
Remark 3.16. Uniform convergence ⇒ Sure Convergence ⇒ Almost sure convergence, but the converse is not true in general.
Definition 3.18. (The Lr Space) Let r ∈ [1, ∞). Then a measurable function h on
(
)
∫
R, B(R), P belongs to the space Lr (P ) if R |h(t)|r dP (t) < ∞. The correspond) r1
(∫
r
|h(t)|
dP
(t)
ing norm in the Lr (P )-space is defined as: ∥h∥Lr ,
. Similar
R
(
)
definitions hold for R, B(R), µ .
Definition 3.19. (Convergence in the Lr Space) Let r ∈ [1, ∞). Then a sequence
{gk } of measurable functions in Lr (P ) converges in Lr (P ) to a measurable function
g in Lr (P ) if ∀ε > 0 ∃ n(ε) > 0 such that ∥gk (t) − g(t)∥Lr (P ) < ε ∀k ≥ n.
Similarly, a sequence {gk } of measurable functions in Lr (P ) converge in Cauchy
sense if ∀ε > 0 ∃ n(ε) > 0 such that ∥gk (t) − gℓ (t)∥Lr (P ) < ε ∀k, ℓ ≥ n.
(
)
Remark 3.17. Definition 3.19 holds for both measure spaces R, B(R), µ and
(
)
R, B(R), P .
Remark 3.18. The spaces Lr (P ) and Lr (µ) are complete, i.e., every Cauchy sequence converges in these spaces.
Remark 3.19. In general, uniform convergence does not imply Lr (µ) convergence
and vice versa. We cite an example to show that uniform convergence does not
1
imply convergence in Lr (µ). Let gk = k − r χ[0,k] . The sequence {gk } converges
uniformly to the function 0 but it does not converge to 0 in Lr (µ). However,
sequence {gk } does converge to to 0 in Lr (P ). Indeed, uniform convergence in
(
)
R, B(R), P implies convergence in Lr (P ) but the converse is not true.
23
Theorem 3.4. (Uniform Convergence and Convergence in Lr (P )) Let P [R] = 1
and {gk } be a sequence in Lr (P ) that converges uniformly on R to a measurable
function g. Then, g ∈ Lr (P ) and {gk } converges in Lr (P ) to g.
Proof. Uniform convergence implies that ∀ε > 0 ∃ n(ε) > 0 such that
∫
∫
|gk (t) − g(t)| < ε ∀k ≥ n ∀t ∈ R. For any k ≥ n, R |gk (t) − g(t)|r dP ≤ εr R dP =
εr < ∞ ⇒ ∥gk − g∥Lr (P ) < ε < ∞. Since (gk − g) ∈ Lr (P ) and gk ∈ Lr (P ), it
follows that g ∈ Lr (P ) and {gk } converges in Lr (P ) to g.
Corollary 3.3. (Almost Everywhere Convergence and Convergence in Lr (P )) Let
{gk } be a sequence in Lr (P ), which converges almost everywhere on R to a measurable function g. Then, if there exists a constant c ∈ [0, ∞) such that |gk | ≤ c
a.e. on R, then g ∈ Lr (P ) and {gk } converges in Lr (P ) to g.
Proof. Since P [R] = 1 and c ∈ Lr (P ), the result follows from Theorem 3.4.
Theorem 3.5. (Almost Everywhere Convergence and Convergence in Lr (µ)) Let
{gk } be a sequence in Lr (µ) that converges almost everywhere (a.e.) on R to a
measurable function g. If there exists h ∈ Lr (µ) such that |gk | ≤ h ∀k a.e. on R,
then g ∈ Lr (µ) and {gk } converges in Lr (µ) to g.
Proof. Given |gk | ≤ h ∀k a.e. on R, it follows that |gk (t) − g(t)|r ≤ |2h(t)|r . Since
|gk (t) − g(t)|r → 0 a.e. on R, and |h|r ∈ L1 (µ) (because h ∈ Lr (µ)), we obtain the
following result by applying the dominated convergence theorem:
∫
∫
r
lim
|gk (t) − g(t)| dµ =
lim |gk (t) − g(t)|r dµ = 0 ⇒ g ∈ Lr (µ)
k→∞
R k→∞
R
(
)
Definition 3.20. (Convergence in Measure) In the measure space R, B(R), P ,
a sequence {gk } of measurable functions converges in measure P to a measurable
function g if
[
]
lim P {t ∈ R | gk (t) − g(t) |≥ ε} = 0
k→∞
(
)
A similar definition holds for the measure space R, B(R), µ .
∀ε > 0
Remark 3.20. Let us compare the definition of convergence in measure with that
of almost everywhere convergence:
• Convergence in measure (or probability):
[
]
∀ε > 0 lim P {t ∈ R : | gk (t) − g(t) |≥ ε} = 0
k→∞
24
• Convergence almost everywhere (or almost surely):
[
]
∀ε > 0 P {t ∈ R : lim | gk (t) − g(t) |≥ ε} = 0
k→∞
The main difference between almost sure (a.s.) convergence (also called convergence
with probability 1) and convergence in probability (also called p-convergence) is that
the former deals with the probability of the limit of a sequence of measurable sets
(i.e., events), while the latter deals with the limit of a sequence of probabilities (i.e.,
limit of a sequence of real numbers). Further insight can be gained from the fact
that a.s. convergence addresses convergence of the entire sample sequences while pconvergence addresses only the convergence of the random variable at an individual
k. That is, a.s. convergence is based on the joint events at infinite number of times,
while p-convergence is based on events at individual k’s.
Remark 3.21. Uniform convergence implies convergence in measure regardless of
whether the measure is finite (e.g., P ) or infinite (e.g., µ). But this may not hold for
pointwise (and hence almost everywhere) convergence for an infinite measure (e.g.,
µ). For example, if gk = χ[k,(k+1)] , then the sequence {gk } converges pointwise in
(infinite) measure µ to 0 but it may not converge to 0 in (infinite) measure µ.
Remark 3.22. It follows from Proposition 3.5 that Lr -convergence implies con(
)
vergence in measure in the probability space R, B(R), P . In fact, a sequence {gk }
converges for both finite measure (e.g., P ) and infinite measure (e.g., µ).
Theorem 3.6. Convergence in Lr (µ) implies convergence in measure µ.
Proof. Let a sequence {gk } of measurable functions converge in Lr (µ) to a measurable function g in Lr (µ), where r ∈ [1, ∞). For any ε > 0, let us define a measurable
{
}
set Ekε , t ∈ R : |gk (t) − g(t)| ≥ ε . Then, ∀ε > 0
∫
∫
∫
r
r
r
|gk (t) − g(t)| dµ ≥
|gk (t) − g(t)| dµ ≥ ε
dµ
R
Ekε
Ekε
[{
}
= εr µ t ∈ R : |gk (t) − g(t)| ≥ ε ]
∫
Since convergence in Lr (µ) is equivalent
to limk→∞ R |gk (t) −] g(t)|r dµ = 0, it is
[{
}
concluded that ∀ε > 0 limk→∞ µ t ∈ R : |gk (t) − g(t)| ≥ ε = 0, which implies
convergence in measure µ.
Remark 3.23. Convergence in measure µ does not imply convergence in Lr (µ).
For example, let gk = k χ[ k1 , k2 ] . Then, {gk } converges to 0 in measure µ but does
not converge to 0 in Lr (µ).
25
Definition 3.21. (Almost Uniform Convergence) A sequence {gk } of measurable
(
)
functions on R, B(R), µ converges almost uniformly to a measurable function g
if ∀ε > 0 ∃Eε ∈ B(R) with µ[Eε ] < ε such that {gk } converges uniformly to g on
R \ Eε . Similarly, the sequence {gk } converges almost uniformly in Cauchy sense
if ∀ε > 0 ∃Eε ∈ B(R) with µ[Eε ] < ε such that {gk } converges uniformly to g on
R \ Eε in Cauchy sense.
Theorem 3.7. Uniform Convergence ⇒ Almost Uniform Convergence ⇒ Convergence Almost Everywhere.
Proof. The first part ”Uniform convergence ⇒ Almost Uniform Convergence” follows by choosing Eε = ∅ in Definition 3.21. The proof the second part ”Almost
Uniform Convergence ⇒ Convergence Almost Everywhere” is presented below.
(
)
Let a sequence {gn } of measurable functions on R, B(R), µ converge almost
uniformly to a measurable function g. For k ∈ N, let Ek ∈ B(R) be such that
∪∞
µ[Ek ] < 2−k and {gn } converges uniformly to g on R \ Ek . Let Fk , j=k implying
(
)
that µ[Fk ] < 2−k+1 . Since (Ek ⊆ Fk ) ⇒ (R \ Fk ) ⊆ (R \ Ek ) , it follows that {gn }
converges uniformly on R \ Fk ). Now let us define a sequence hk of measurable
functions as:
hk (t) ,
limn→∞ gn (t)
if t ∈
/ Fk
0
if t ∈ Fk
It is observed that {Fk } is monotonically decreasing and µ[F ] = 0, where F ,
∩∞
/ Fℓ . Therefore, {hk } converges on R
k=n . If ℓ ≤ k, then hℓ (t) = hk (t) ∀t ∈
to a measurable function that is denoted as g. If t ∈
/ Fk , then g(t) = hk (t) =
limn→∞ gn (t). Hence, {gn } converges to g on R \ F implying that {gn } converges
to g almost everywhere on R.
Theorem 3.8. Almost Uniform Convergence ⇒ Convergence in Measure. Con(
)
versely, if a sequence {gk } of measurable functions on R, B(R), µ converges measure to a measurable function g., then a subsequence of {gk } converges almost uniformly to g.
(
)
Proof. Given that a sequence {gk } on R, B(R), µ converges almost uniformly to
g, let us choose ε > 0 and α > 0. Then, there exists Eε ∈ B(R) with µ[Eε ] < ε
with µ[Eε ] < ε such that {gk } converges uniformly to g on R \ Eε . Therefore, if the
positive integer k is made sufficiently large, then {t ∈ R||gk (t) − g(t)| ≥ α} ⊆ Eε
implying convergence in measure. Conversely, let {gk } converge in measure to g.
26
Then, there is a subsequence that converges almost everywhere and in measure to a
measurable function h. Since {gk } converges in measure to both g and h, it follows
that h = g a.e.
Remark 3.24. It follows from Theorem 3.8 that if a sequence converges in Lr ,
then it has a subsequence that converges almost uniformly; however, the converse
may not be true, in general (see Chapter 7, Bartle).
Theorem 3.9. (Egoroff Theorem) For a finite measure on the real space (e.g., in
(
)
the measure space R, B(R), P with P [R] < ∞),
Convergence Almost Everywhere ⇒ Almost Uniform Convergence.
Proof. Let {gk } converge to g everywhere on E , R \ H, where P [H] = 0 . For
m, n ∈ N, let us construct a measurable set
}
∞ {
∪
1
Enm ,
t ∈ E : |gk (t) − g(t)| ≥
m
k=n
m
⊆ Enm and, because of pointwise convergence of {gk } to g everywhere
so that En+1
∩∞
on E, we have n=1 Enm = ∅. Since P [R] < ∞ , we infer that P [Enm ] → 0 as n → ∞.
For any given ε > 0, let us choose choose ℓ(m, ε) ∈ N such that P [Eℓm ] < 2−m ε and
∪∞
/ Eε , then t ∈
/ Eℓm . Therefore,
let Eε , m=1 Eℓm so that P [Eε ] < ε. Note that if t ∈
|gk (t) − g(t)| <
E \ Eε .
1
m
∀k ≥ ℓ, which implies that {gk } converges to g everywhere on
Corollary 3.4. (Corollary to Egoroff Theorem): For P [R] < ∞,
Convergence Almost Everywhere ⇒ Convergence in Measure P .
Proof. The proof follows by combining the results of Theorems 3.8 and 3.9.
Before concluding this section we introduce a weak form of convergence, known
as convergence in distribution, which is not really a convergence condition for random variables. A formal definition follows.
Definition 3.22. (Convergence in Distribution) Let {Pk } be a sequence of prob(
)
ability measures on the measurable space R, B(R) . Then, {Pk } converges in
distribution to a probability measure P , denoted as Pk → P in distribution, if
∫
∫
∀g ∈ C(R) lim
g dPk = g dP
k→∞
where C(R) is the space of continuous real-valued functions on R.
27
Remark 3.25. Definition 3.22 is interpreted in the following way in terms of
random variables. Let {Xk } be a sequence of random variables, where Xk has the
distribution Fk : R → [0, 1]. Then, {Xk } converges in distribution to a random
variable X whose distribution is F : R → [0, 1], i.e.,
Xk → X in distribution if
lim Fk (θ) = F (θ) for all θ at which F is continuous
k→∞
Definition 3.22 does not directly deal with the convergence of random variables
themselves, but with their probability distributions (PDFs). Convergence in distribution simply implies that as k becomes larger, the PDFs converge and thus tend
to become alike. For example, the sequence {Xk } and the limiting random variable
X can be jointly independent even though Xk → X in distribution. Convergence
in distribution is radically different from convergence in measure and other types
of convergence (e.g., a.s. and Lr ), where Xk and X becomes increasing dependent
on each other because a measure of the error |Xk − X| approaches zero as k tends
to infinity. Convergence in distribution occurs in the Central Limit Theorem that
provides a good reason for widespread usage of the Gaussian (also called normal)
distribution in scientific and engineering analysis. The Gaussian distribution Pσ
(with mean m and variance σ 2 ) is defined as follows:
∫
( (t − m)2 )
1
Pσ [B] , √
exp −
dt ∀B ∈ B(R)
2σ 2
2πσ B
It is empirically well known that, in some sense, the distribution of a sum of independent trials tends to the Gaussian distribution as the number of these trials
approaches infinity. The central limit theorem provides a rigorous statement and
proof for this intuition. Next we provide the simplest version of the Central Limit
Theorem. For details, please see books on Mathematical Statistics (e.g., Durrett
(2005)).
(
)
Theorem 3.10. (Central Limit Theorem) Let Ω, E, P be a probability space. Let
X1 , X2 , · · · be independent and identically distributed (i.i.d.) random variables on
(
)
∑n
R, B(R) with zero mean (i.e., m = 0) and variance σ 2 . Let Sn , k=1 Xk and let
Pn be the probability measure of √1n Sn . Then, as n → ∞, Pn → Pσ in distribution,
where Pσ is Gaussian distribution with zero mean and variance σ 2 .
Proof. Let ψ be the characteristic function of the i.i.d. random variables Xk , and
the first two derivatives of ψ exist
ψ ′ (0) = i E[Xk ] = 0 and ψ ′′ (0) = i2 E[|Xk |2 ] = −σ 2
28
Noting that the density function of f(Xk +Xℓ ) of the random variable (Xk + Xℓ ),
where k ̸= ℓ, is the convolution product fXk ⋆ fXℓ , the characteristic function
ψ(Xk +Xℓ ) = ψXk ψXℓ ∀k ̸= ℓ
Therefore, the chracteristic function ψn of
√1 Sn
n
is
)]
1
ψn (ω) = E exp iω √ Sn =
n
[
(
( (
)n
ω )
ψ √
n
Taylor series expansion of ψ at ω = 0 yields
ψ(ω) = ψ(0) + ψ ′ (0)ω + ψ”(0)
ω2
σ2 ω2
+ o(ω 2 ) = 1 −
+ o(ω 2 )
2
2
For a fixed ω, as n → ∞, it follows that
( (
)n (
)n
( σ2 ω2 )
ω )
σ2 ω2
ψn (ω) = ψ √
= 1−
+ o(n−1 ω 2 ) → exp −
= ψσ (ω)
2n
2
n
Now, what remains to show that Pn → Pσ in distribution. We need two lemmas to
establish this last part of the proof.
Lemma
(
)
Ω, E, P
function.
[{
P ζ
3.6. Let X be a random variable mapping from the probability space
(
)
to the probability space R, B(R), PX and let ψX be the characteristic
Then, for any a > 0,
]
∫
)
−2 } ∪ {
2}
1 a(
∈ Ω : X(ζ) ≤
ζ ∈ Ω : X(ζ) >
≤
1 − ψX (ω) dω
a
a
a −a
Proof. It follows from Fubini’s theorem and the fact sinatat ≤ 1 that
∫
∫ ∫
)
)
1 a(
1 a ∞(
1 − ψX (ω) dω =
1 − eiωt dFX (t) dω
a −a
a −a −∞
(
)
[ ∫ −2 ∫ ∞ ](
)
a
sin at
sin at
=2
1−
dFX (t) ≥ 2
+
1−
dFX (t)
2
at
at
−∞
−∞
a
(
)
)
[ ∫ −2 ∫ ∞ ](
( −2 ) (
( 2 ))
a
1
1−
dFX (t) ≥ FX
+ 1 − FX
≥2
+
2
|at|
a
a
−∞
a
[{
]
−2 } ∪ {
2}
= P ζ ∈ Ω : X(ζ) ≤
ζ ∈ Ω : X(ζ) >
a
a
∫
∞
29
Lemma 3.7. (Convergence of a sequence of characteristic functions) Let Pn be
(
)
probability measures on R, B(R) and let ψn be the respective characteristic function of the measure Pn . It is given that limn→∞ ψn (ω) = ψ(ω) exist for all ω ∈ R
and that ψ(ω) is continuous at ω = 0. Then, there exists a probability measure P
(
)
on R, B(R) , which has the characteristic function ψ. Furthermore, ψn → ψ as
n → ∞.
Proof. To prove Lemma 3.7, we use the notion of tightness of a family of measures,
which is defined below.
Definition 3.23. (Tightness and relative compactness) Let M be a family of prob(
)
ability measures on the measurable space R, E(R)) . Then, M is called tight if
∀ε > 0 ∃ a compact set E ∈ E(R) such that P [E] > (1 − ε) ∀P ∈ M . A
consequence of tightness is weak convergence (i.e., if every sequence in M has a
subsequence that may not converge in M ).
We first show that the family {Pn } in Lemma 3.7 of measures is tight. Since
ψ(ω) is continuous at ω = 0 and ψ(0) = 1, it follows that
∫
)
1 a(
ε
∀ε > 0 ∃ a > 0 such that 0 ≤
1 − ψ(ω) dω ≤
a −a
2
where the integral on the right hand side is non-negative by Lemma 3.6. Since
ψn → ψ as n → ∞, an application of dominated convergence theorem yields
∫
)
1 a(
0≤
1 − ψn (ω) dω ≤ ε for a sufficiently large n ∈ N
a −a
Now it follows from Lemma 3.6 that, for a sufficiently large n ∈ N,
[(
]
−2 2 ]
Pn
,
> (1 − ε)
a a
[(
]
]]
(
2
,
to
an
interval
[−b,
b]
such
that
P
−b,
b
>
By enlarging the interval −2
n
a a
(1 − ε) for n ∈ N. Hence, the family of measures {Pn } is tight,
Finally, let us assume that Pn 9 P in distribution. Then, there exists a
∫
∫
continuous function g : R → R such that g dPn 9 c ̸= g dP . Therefore, there
∫
exists a subsequence that converges (weakly) to a measure ν such that g dν = c,
which implies that ν ̸= P . This is a contradiction because ψ is the characteristic
function of both ν and P .
30
The proof of Theorem 3.10 is now completed by showing that Pn → Pσ through
usage of Lemma 3.7.
Proposition 3.6. Convergence in Measure ⇒ Convergence in Distribution.
Proof. Let the limiting random variable X be continuous on R. Let the conditional
distribution be defined as:
[
FXk |X (θ|φ) , P Xk ≤ θ|X = φ]
Let {Xk } converge in measure. Then, it follows from Definition 3.20 that
∀ε > 0
[
]
lim P {t ∈ R | gk (t) − g(t) |≥ ε} = 0
k→∞
Then,
FXk |X (θ|φ) =
1
if θ ≥ φ
0
if θ < φ
which implies that FXk |X (θ|φ) = U (θ−φ) where U (•) is the standard step function.
Therefore, as k → ∞,
∫
FXk (θ) = P [Xk ≤ θ] =
FXk |X (θ|φ) dFX (θ)
R
∫
→
R
∫
U (θ − φ) dFX (θ) =
θ
−∞
dFX (φ) = FX (θ)
Example 3.3. Let {Xk } be a sequence of independent random variables having
the density functions
(
)
(
( n − 1 ) )2
1 (
1) 1
σ
√
exp − 2 θ −
fXk (θ) = 1 −
σ
+ e−σθ U (θ)
k
2σ
n
k
2π σ
It is false that Xk → X in measure, but it is true that Xk → X in distribution.
31
Relationships among different types of convergence are illustrated below.
au
ae
Lr
m
Finite or Infinite Measure
almost everywhere (ae)
almost uniformly (au)
in measure (m)
rth mean (Lr)
Finite Measure Only
Subsequence Convergence for
a Finite or Infinite Measure
Figure 2: Relationship among different convergence modes
d - Convergence in distribution
m - Convergence in measure
L r - Convergence in the rth mean
as - Convergence almost surely
s - Convergence surely
Lr
d
m
s
as
Figure 3: Comparison of different convergence modes
32
© Copyright 2026 Paperzz