1
ME/MATH 577
Stochastic Systems in Science and Engineering
Chapter #02
Stochastic Processes and Its Properties
This chapter provides a brief introduction to the theory of stochastic processes, which is essential for understanding the
principles of stochastic estimation & control.
I. I NTRODUCTION
TO
S TOCHASTIC P ROCESSES
Definition 1.1: Let Ω, E and R, B(R) be measurable spaces. A stochastic process z(t, ζ) is defined as: z : T × Ω → R,
where T is a nonempty set that could be finite, countable, or uncountable; for any given t ∈ T , zt : Ω → R is E − B(R)
measurable, i.e., zt is a random variable. Similarly, for any given ζ ∈ Ω, zζ : T → R is a deterministic function of t called a
sample path.
Remark 1.1: If the index set T ⊆ R is uncountable, then z in Definition 1.1 is called a continuous-parameter stochastic
process (or a random process). If T is countably infinite, then z is called a discrete-parameter stochastic process (or a random
sequence); if T is finite, then z is called a random vector and, in the special case, if card(T ) = 1, then z is a random variable,
as described earlier in Chapter 01. If z is discrete-valued, i.e., if the range of z is at most countable, then z is called a stochastic
chain. In general, a stochastic process can be classified in the following four categories:
•
•
•
•
Continuous-parameter continuous-valued random process, i.e., T is uncountable and the range of z is uncountable.
Continuous-parameter discrete-valued stochastic chain, i.e., T is uncountable and the range of z is at most countable.
Discrete-parameter continuous-valued random process, i.e., T is at most countable and the range of z is uncountable.
Discrete-parameter discrete-valued stochastic chain, i.e., T is at most countable and the range of z is at most countable.
Note that the set T in Definition 1.1 may also be a subset of an n-dimensional space, where n ∈ N. Although we are
restricted to n = 1 in this course on stochastic systems, there are numerous applications, such as multi-dimensional signal
processing, where n > 1. As an example, for multidimensional signal processing, one may consider T ⊆ Rn with n ∈ N \ {1},
which leads to a random field. In this context, we very briefly introduce the concept of random field.
Definition 1.2: (Random field) Let (K, K, P ) be a measure space. Let Gn×d be the set of all Rd -valued functions on Rn ,
where n, d ∈ N, and let G n×d be the corresponding σ-algebra. Then, a measurable map ρ : K, K → Gn×d , G n×d is a
called an n-dimensional random field.
In the context of Definition 1.2, ρ maps K to functions in Gn×d . Equivalently ρ maps sets in K to sets in G n×d . For a
given ζ ∈ K, the corresponding function in G n×d is called a realization of the random field ρ and is denoted as ρ(•, ζ). At a
given point ~r ∈ Rn , the value of this function is ρ(~r, ζ). The σ-algebra G n×d contains sets of the form
{g ∈ Gn×d : g(~ri ) ∈ Bi , i = 1, · · · , m}
where the choice of m ∈ R is arbitrary, ~ri ∈ Rn and Bi ∈ B(Rd ) is the Borel σ-algebra of Rd . Note that sets of the form
interval in Rn are not usually present in G n×d . Such cases are mathematically tractable when the random field ρ is separable.
Before leaving the topic of random field, we present the original definition of random field as it was introduced by
Kolmogorov.
Definition 1.3: (Kolmogorov’s definition of random field) Let ρ be a family of random variables such that
R = {ρ(~ri , •) : (Ω, E) → Rd , B(Rd ) , ~r ∈ Rn }
Then, R is a random field provided that the distribution function F~r1 ,··· ,~rn ~x1 , · · · , x~n , where xi ∈ Rd satisfies the following
conditions:
1) Symmetry: F~r1 ,··· ,~rn ~x1 , · · · , x~n is invariant under identical permutations of x and ~r.
2) Consistency: F~r1 ,··· ,~rn+m B × Rmd = F~r1 ,··· ,~rn B for every n, m ≥ 1 and B ∈ B(Rmd )
Returning to stochastic processes, where n is restricted to 1, T may often represent time moving forward, i.e., T = [t0 , ∞)
where t0 ∈ R represents initial time. In the sequel, we take T = [t0 , ∞) and denote the stochastic process as zt at time t. In
this setting, we introduce several concepts and notations that are commonly used in literature.
Theorem 1.1: (Completeness of the space of Borel measurable functions)
The space of E − B(R) measurable functions is complete in the usual metric space.
Proof 1: Let {zk } be a random sequence defined as zk : Ω → R, where each zk is E − B(R) measurable. Thus, the event
{ζ ∈ Ω : zk ≤ θ} ∈ E ∀k ∈ N ∀θ ∈ R. Let the random sequence {zk } converge pointwise in the Cauchy sense in the given
metric space. Then, we need to show that {zk } converges pointwise to a E − B(R) measurable function z in the metric space.
For a given sample point ζ ∈ Ω, the Cauchy sequence of real numbers {zk (ζ)} must converge to a real number, which we
assign as z(ζ), because of completeness of the real field R. What remains to be shown that the function z : Ω → R, to which
{zk (ζ)} converges, is E − B(R) measurable.
Let Eθ , (−∞, θ] ⇒ zk−1 (Eθ ) = {ζ ∈ Ω : z(ζ) ≤ θ} ∈ E k ∈ N. By making use of the countable union and countable
intersection properties of the σ-algebra E, it follows that
lim sup{ζ ∈ Ω : zn (ζ) ≤ θ} ,
n→∞
lim inf {ζ ∈ Ω : zn (ζ) ≤ θ} ,
n→∞
∞ [
∞
\
k=1 n=k
∞ \
∞
[
k=1 n=k
{ζ ∈ Ω : zn (ζ) ≤ θ} ∈ E
{ζ ∈ Ω : zn (ζ) ≤ θ} ∈ E
Since {zk (ζ)} converges pointwise to z on Ω, it follows that ∀ζ ∈ Ω,
lim inf {ζ ∈ Ω : zn (ζ) ≤ θ} = lim sup{ζ ∈ Ω : zn (ζ) ≤ θ} = {ζ ∈ Ω : z(ζ) ≤ θ} ∈ E
n→∞
n→∞
Therefore, z : Ω → R is a E − B(R) measurable function, implying that the space of E − B(R) measurable functions is
complete in the usual metric.
Next we cite a few examples of stochastic processes.
Example 1.1: Let a stochastic process X have the following structure, where X(t, ζ) = g(t)Θ(ζ)., i.e., the time-dependent
non-random part g and the random part Θ are sparable. Often stochastic processes do not satisfy this separability property.
Example 1.2: Let a stochastic process X have the following structure, where X(t, ζ) = A(ζ) sin(ω0 t + Θ(ζ))., i.e., A and
Θ are random variables.
P
Example 1.3: Let a stochastic process X have the following structure, where X(t, ζ) =
n X[n]pn t − T [n] , where
{X[n]} and {T [n]} are random sequences and the functions {pn } are non-random waveforms that may take various shapes.
Definition 1.4: (Conditional PDF of a Stochastic Process) On the probability space ( Ω, E, P , the conditional probability
distribution function of a stochastic process zt at an arbitrary time t2 conditioned on zt at another arbitrary time t2 is defined
as:
Fz (θ2 |θ1 ; t2 , t1 ) = P [{ζ ∈ ω : zt2 (ζ) ≤ θ2 }|{ζ ∈ ω : zt1 (ζ) ≤ θ1 }], where θ ∈ R.
Similarly, the joint probability distribution function of a stochastic process zt at any n distinct arbitrary time instants t1 , · · · , tn
is defined as:
n
\
Fz (θ1 , · · · , θn ; t1 , · · · , tn ) = P [{ζ ∈ ω :
{ztk (ζ) ≤ θk }], where θk ∈ R.
k=1
If the partial derivatives are continuous, then
∂
∂ n Fz (θ1 , · · · θn ; t1 , · · · , tn )
Fz (θ; t) and fz (θ1 , · · · θn ; t1 , · · · , tn ) =
∂θ
∂θ1 · · · ∂θn
Definition 1.5: (Conditional PDF of a Stochastic Process) On the probability space ( Ω, E, P , the conditional probability
distribution function of a stochastic process zt at an arbitrary time t2 conditioned on zt at another arbitrary time t2 is defined
as:
Fz (θ2 |θ1 ; t2 , t1 ) = P [{ζ ∈ ω : zt2 (ζ) ≤ θ2 }|{ζ ∈ ω : zt1 (ζ) ≤ θ1 }], where θ ∈ R.
fz (θ; t) =
2
Definition 1.6: (Jointly Gaussian Process) A stochastic process {zt : t ∈ T } is called Gaussian if all nth order PDF’s of
the random variables zt1 , zt2 , · · · , ztn are jointly Gaussian for arbitrary choices of t1 , t2 , · · · , tn ∀n ∈ N.
Definition 1.7: (Independent Stochastic Process) A stochastic process {zt : t ∈ T } is called independent if the random
variables zt1 , zt2 , · · · , ztn are jointly independent, i.e., P [zt1 , zt2 , · · · , ztn ] = P [zt1 ]P [zt2 ] · · · P [ztn ] for arbitrary choices of
t1 , t2 , · · · , tn ∀n ∈ N.
Definition 1.8: (Independent Increment) A stochastic process {zt : t ∈ T } is said to have independent increments if, for
arbitrary parameters t1 < t2 < · · · < tn , the random variables zt1 , (zt2 − zt1 ), · · · , (ztn − ztn−1 ) are jointly independent for
arbitrary choices of t1 , t2 , · · · , tn ∀n ∈ N.
Definition 1.9: (Stationarity) A stochastic process {zt : t ∈ T } is called stationary (or strict-sense stationary) if, for all
orders n ∈ N, all shifts τ ∈ R, and instants t1 < t2 < · · · < tn , the joint PDFs of the random vectors zt1 , zt2 , · · · , ztn and
zt1 +τ , zt2 +τ , · · · , ztn +τ are identical, i.e.,
Fz θ1 , θ2 , · · · , θn = Fz θ1 + τ, θ2 + τ, · · · , θn + τ
Definition 1.10: (Wide-sense Stationarity) A stochastic process {zt : t ∈ T } is called wide-sense stationary (or weak-sense
stationary) if, for all time t, τ ∈ R, the following two conditions are satisfied:
•
•
E[zt ] = E[zt+τ ] = µz that is a constant for all t, τ ∈ R
Kzz (t1 , t2 ) = Kzz (t1 + τ, t2 + τ ) for all t1 , t2 , τ ∈ R
Definition 1.11: ) A stochastic process {zt : t ∈ T } is called wide sense periodic with period T > 0 if, for all t ∈ R, the
following two conditions are satisfied:
•
•
E[zt ] = E[zt+τ ] = µz that is a constant for all t, τ ∈ R
Kzz (t1 , t2 ) = Kzz (t1 , t2 + T ) for all t1 , t2 ∈ R
and is called wide sense cyclostationary if the second condition is replaced by
Kzz (t1 , t2 ) = Kzz (t1 + T, t2 + T ) ∀t1 , t2 ∈ R
Remark 1.2: All strict sense stationary processes are wide sense stationary, but the converse is not true, in general. However,
all wide sense stationary Gaussian processes are strict sense stationary Gaussian.
Remark 1.3: Let {zt , t ∈ T } be a wide sense stationary (possibly vector-valued) stochastic process. Then, with a slight abuse
of notation, Rzz (t1 , t2 ) ≡ Rzz (t1 − t2 ) and Kzz (t1 , t2 ) ≡ Kzz (t1 − t2 ), where Rzz (t1 , t2 ) , E[zt1 ztT2 ] and Kzz (t1 , t2 ) ,
E (zt1 − E[zt1 ]) (zt2 − E[zt2 ])T .
Definition 1.12: (Power Spectral Density) Let {zt , t ∈ T } be a wide sense stationary (possibly vector-valued) stochastic
process. Then, the power spectral density, Szz (ξ) of zt is the Fourier transform of the autocorrelation Rzz (τ ), i.e., Szz (ξ) =
F [Rzz (τ )]]. Hence, Rzz (τ ) = F −1 [Szz (ξ)].
It follows from Definition 1.12 that
Z
Z
E[|zt |2 ] = Rzz (0) =
dξ Szz (ξ) exp(i2πξτ )
=
dξ Szz (ξ)
R
τ =0
R
Example 1.4: Let {wk } be a Gaussian random sequence with E[wk ] = 0 ∀k and Rk,ℓ = σ 2 δkℓ ∀k, ℓ. Let zk , wk + wk−1 .
Then, zk is correlated wide-sense stationary (also strict-sense stationary because of the jointly Gaussian property of {wk }).
Then, it follows that
Kzz (k − ℓ) = σ 2 2δk,ℓ + δk,(ℓ−1) + δk,(ℓ+1)
II. M ARKOV P ROCESSES
AND
C ONDITIONAL I NDEPENDENCE
Markov processes are naturally associated with the state-space modeling of dynamical systems, where the state of the system
at any given time t contains all information about the system up to and including the time t. The concept of a state is captured
in stochastic processes by using the notions of conditional independence and Markov property.
Let us elucidate the notion of conditional independence first. Recall that, for an event F in a probability space Ω, E, P ,
P [F ] = E[IF ], where IF is the indicator function of F , i.e., ∀ζ ∈ Ω,
(
1 if ζ ∈ F
IF (ζ) =
0 if ζ ∈
/F
3
Furthermore, let Y be a random vector such that
•
•
as
P [F |Y ] = E[IF |Y ] is a measurable function of Y , which has a finite second moment.
E[g(Y )P [F |Y ]] = E[g(Y )IF ] for any measurable function g(Y ) with a finite second moment.
Definition 2.1: (Conditional Independence) Let A, B, and C be Borel sets. Then, the events X ∈ A and Z ∈ C are said to
be conditionally independent of an event Y ∈ B, which is denoted as X ⊲⊳ Y ⊲⊳ Z, if
P [X, Z|Y ] = P [X|Y ] P [Z|Y ]
Remark 2.1: The statement X ⊲⊳ Y ⊲⊳ Z implies truth of the following two statements:
•
•
P [Z|X, Y ] = P [Z|Y ];
P [X, Y, Z]P [Y ] = P [X, Y ]P [Z, Y ].
Definition 2.2: (Markov Process) A stochastic Process {Xt : t ∈ T } is defined to be Markov if any one of the two following
(equivalent) conditions hold for any t1 , · · · , tn+m ∈ T with t1 < · · · < tn+m and n, m ∈ N:
• The conditional independence: Xt1 , · · · , Xtn
⊲⊳ Xtn ⊲⊳ Xtn+1
⊲⊳ Xtn ⊲⊳ Xtn , · · · , Xtn+m
• The conditional independence: Xt1 , · · · , Xtn
Remark 2.2: In Definition 2.2, the first condition is easier to check than the second condition but the second condition is
more appealing, because it is symmetric in time. For example, letting tn to be the current time, the Markov property implies
that the past and future are conditionally independent given the present state. Equivalent to Definition 2.2, a Markov process
satisfies the following definition:
A stochastic process Xt is called Markov if, for every tk−1 < tk ,
P Xtk |Xt ∀t ≤ tk−1 = P Xtk |Xtk−1
In other words, if t1 < t2 < · · · < tk−1 < tk , then
P Xtk |Xtk−1 , Xtk−2 , · · · , Xt1 = P Xtk |Xtk−1
In essence, the past has no bearing on the future if the present is known.
Remark 2.3: A Markov process Xt has the following properties:
•
•
Retention of Markov property under time reversal::If tk < tk+1 < · · · , tk+ℓ , · · · , then the Markov process Xt satisfies
the following condition:
P Xtk |Xtk+1 , · · · , Xtk+ℓ = P Xtk |Xtk+1
Separation of the joint conditional probability: If tk < tℓ < tm , then
P Xtm , Xtk |Xtℓ = P Xtm |Xtℓ
|
{z
}
P Xtk |Xtℓ
|
{z
}
Markov Property Conditional Probability
Proposition 2.1: (Chapman-Kolmogorov Equation) let Xt be a continuous Markov process. Let t1 < t2 < t3 . Then,
Z
dx2 fX x3 |x2 ; t3 , t2 fX x2 |x1 ; t2 , t1
fX x3 |x1 ; t3 , t1 =
Proof 2: Marginal density fX x3 , x1 ; t3 , t1 =
fX
R
dx2 fX x3 , x2 , x1 ; t3 , t2 , t1 leads to
Z
x3 , x1 ; t3 , t1 =
dx2 fX x3 |x2 ; t3 , t2 fX x2 , x1 ; t2 , t1
R
R
R
by using the Markov property of Xt . Then, dividing both sides by fX x1 ; t1 yields
Z
fX x2 , x1 ; t2 , t1
fX x3 , x1 ; t3 , t1
dx2 fX x3 |x2 ; t3 , t2
=
fX x1 ; t1
fX x1 ; t1
R
Z
dx2 fX x3 |x2 ; t3 , t2 fX x2 |x1 ; t2 , t1
⇒ fX x3 |x1 ; t3 , t1 =
R
Definition 2.3: (Increment and Independent Increment) The increment of a stochastic process {Xt : t ∈ T ⊆ [0, ∞)} over
an interval [a, b] ⊆ T is a random variable Xb − Xa . A stochastic process is said to have independent increments if, for any
4
positive integer n and any t0 < t1 < · · · < tn ∈ T , the increments, Xt1 − X0 , Xt2 − Xt1 , · · · , Xtn − Xtn−1 are mutually
independent.
Remark 2.4: (Markov property of independent-increment processes) It follows from Definition 2.3 that a stochastic process
{Xt : t ∈ T } is an independent-increment process if the following conditions hold:
•
•
X0 is a real constant.
For any t1 < · · · < tn+1 , the vector (Xt1 , · · · , Xtn ) is a function of the n increments, Xt1 − X0 , Xt2 − Xt1 , · · · , Xtn −
Xtn−1 , and is thus independent of the increment Xtn+1 − Xtn . But Xtn+1 is determined by Xtn+1 − Xtn and Xtn only.
Thus, {Xt : t ∈ [0, ∞)} is a Markov process. Later, we shall encounter random walk, Brownian motion, and Poisson
process that are all independent-increment processes.
In other words, the vector (Xt1 , · · · , Xtn ) can be generated from the n increments, Xt1 − X0 , Xt2 − Xt1 , · · · , Xtn − Xtn−1 ,
and is thus independent of the increment Xtn+1 − Xtn . But Xtn+1 is determined by Xtn+1 − Xtn and Xtn only.
III. E XAMPLES
OF
M ARKOV P ROCESSES
This section presents examples of Markov processes that are commonly encountered in science and engineering, such as
Wiener process and Poisson process.
A. Random Walk Sequence
This subsection deals with a discrete-time Markov chain that is the cumulative sum of infinite-length independent Bernoulli
trials. Let us consider independent flips of a fair coin, where the sample space is Ω = {H, T }; the event space is 2Ω ; and
P [{H}] = 12 and P [{T }] = 21 . If the outcome of a coin flip is H, we move from the current position to the right by one step;
and if the outcome is T , we move from the current position to the left by one step. Let the step length be s > 0 after each
flip.
Starting from the zero position, if we are at the position m after n flips, then m = r − (n − r) = 2r − n if there are r
th
occurrences of H out of n flips. Hence, r = n+m
flip as a random variable wk , where
2 . Let us denote the k
(
s
if H occurs (with probability 12 )
wk =
−s if T occurs (with probability 12 )
P
Let the cumulative effect of n consecutive flips be realized as a random variable Xn , nk=1 wk with X0 = 0 with probability
1. Then,
n! n−m if (n + m) is an even integer
n+m
!
!
2
2
P [Xn = ms] =
0
if (n + m) is an odd integer
occurrences of H and n−m
occurrences of T in a total of n flips.
implying n+m
2
2
Remark 3.1: Individual flips of the coin are assumed to be independent events. Therefore, FXn = Fw1 ···wn = Fw1 · · · Fwn
and the sum of independent Bernoulli trials leads to a binomial process.
(−s)2
s2
= s2 and n trials are pairwise independent events, it follows that
Since E[wk ] = 2s + −s
2 = 0 and var[wk ]= 2 +
2
E[Xn ] = 0 and var[Xn ] = ns2
Let us normalize Xn to have unity standard deviation for each n ∈ N to yield Wn ,
Xn
√
,
ns
which yields the following results:
E[Wn ] = 0 and var[Wn ] = 1 ∀n ∈ N
By Central Limit Theorem, as n → ∞, the random walk sequence converges in distribution to the zero-mean unity-variance
Gaussian random variable, i.e., Wn → N (0, 1). So, for a sufficiently large n,
√
√
P [α < Wn ≤ β] = P [α ns < Xn ≤ β ns] ≈ erf(β) − erf(α)
Rx
where the error function is defined as: erf(x) , √2π 0 dt exp(−t2 ).
5
B. Brownian Motion
Brownian motion is the consequence of random events in the transport phenomena of particles suspended in a fluid (i.e., a
liquid or a gas) resulting from their collision with the quick atoms or molecules in the fluid. The Wiener process refers to the
mathematical model used to describe the Brownian motion.
In 1827, while looking through a microscope at particles trapped in cavities inside pollen grains in water, the botanist
Robert Brown noted that the particles moved through the water but he was not able to determine the mechanisms that caused
this motion. Although atoms and molecules had long been theorized as the constituents of matter, it was Albert Einstein who
published a paper in 1905, which explained in precise details of how the Brownian motion is a result of the pollen being
moved by individual water molecules. This explanation of Brownian motion served as definitive confirmation that atoms and
molecules actually exist. In 1908, this observation was further verified experimentally by Jean Perrin who was awarded the
Nobel Prize in Physics in 1926 for his work on discontinuous structure of matter. The underlying concept of Brownian motion
is as follows:
As the direction of the force of atomic bombardment is constantly changing, the particles are hit more on one side
than another, leading to the seemingly random nature of the motion.
1) Einstein’s Theory of Brownian Motion: The first part of Einstein’s theory formulates a diffusion equation for Brownian
particles, in which the diffusion coefficient is related to the mean squared displacement of a Brownian particle, while the second
part relates the diffusion coefficient to measurable physical quantities. This observation led Einstein to determine the size of
atoms, and how many atoms there are in a mole, or the molecular weight in grams, of a gas. In accordance to Avogadro’s
law this volume is the same for all ideal gases, which is ∼ 22.4 liters at standard temperature and pressure. The number of
atoms contained in this volume is referred to as Avogadro’s number, and the determination of this number is equivalent to the
knowledge of the mass of an atom as the latter is obtained by dividing the mass of a mole of the gas by Avogadro’s number.
We present a simplified form of the first part of Einstein’s theory of Brownian motionin the next paragraph.
Einstein’s argument was to determine how far a Brownian particle travels in a given time interval. Classical mechanics is
unable to determine this distance because of the enormous number of bombardments a Brownian particle undergoes, roughly
of the order of ∼ 1, 021 collisions per second. Einstein considered the collective motion of Brownian particles, where the
increment of particle positions in unrestricted one-dimensional domain was modeled as a random variable with a probability
density function under coordinate transformation so that the origin lies at the initial position of the particle. Furthermore,
assuming conservation of particle number, he expanded the density (i.e., the number of particles per unit volume) in a Taylor
series. Then, he derived that the mass density function ρ(x, t) of Brownian particles at point x at time t satisfies the diffusion
equation:
∂ρ
∂2ρ
, where α is the mass diffusivity.
=α
∂t
∂x2
With the initial condition of ρ(x, 0) = δ(x) and boundary conditions of ρ(∞, t) = ρ(−∞, t) = 0, the diffusion equation yields
the solution:
x2
1
exp −
ρ(x, t) = √
4αt
4παt
This expression allowed Einstein to calculate the moments directly. The first moment is seen to vanish, meaning that the
Brownian particle is equally likely to move to the left as it is to move to the right. The second moment is, however, nonvanishing, being given by:
x2 = 2αt
which expresses the mean squared displacement in terms of the time elapsed and the diffusivity. From this expression Einstein
argued that the rms displacement of a Brownian particle is not proportional to the elapsed time, but rather to its square root.
Einstein’s argument is based on a conceptual switch from the ensemble of Brownian particles to the single Brownian particle,
where the relative number of particles at a single instant is the time that a Brownian particle takes to reach a given point.
6
2) Derivation of Brownian Motion as a limit point of the Random Walk Sequence: The Brownian motion process,
also known as Wiener-Levy process, is a continuous-time Markov process that can be derived from the random walk sequence
described in the previous subsection. In the one-dimensional form, the Brownian motion process is expressed as:
β : [0, ∞) × Ω → R
at a given time instant t ∈ [0, ∞), the random variable βt is zero-mean Gaussian and its variance is proportional to the time
parameter t. Let us consider a random walk sequence, where the (uniform) step size s and the (uniform) time interval ∆
between consecutive steps are both infinitesimally small. Let us denote this (random walk) process as Xt∆ that is characterized
by the time interval ∆ > 0, where Xt∆ is piecewise constant between consecutive step s. Starting from the zero position at
the time t = 0 to some finite final time t = n∆, it follows that
Xt∆ =
n−1
X
k=0
∆
wk ⇒ Xn∆
=
n−1
X
wk
k=0
where, by following Subsection III-A, the Bernoulli trial of the flip of a fair coin is:
(
s
if H occurs (with probability 12 )
wk =
−s if T occurs (with probability 12 )
Then, E[Xt∆ ] = 0 and var[Xt∆]=ns2 = ∆t s2 . By following Einstein’s derivation, we set s2 = 2α∆, where α is the diffusion
√
constant, i.e., the step size is proportional to the square root of the time interval (s = 2α∆).
Having s2 = 2α∆, as ∆ → 0+ , s → 0+ . Then, for any finite time t > 0 as ∆ → 0+ , the limiting condition of the random
sequence becomes the Brownian motion (or Wiener-Levy) process
βt = lim+ Xt∆ where β0 = 0 with probability 1
∆→0
i.e., for all finite t (i.e., t ∈ [0, ∞)),
βt = lim
n→∞
n−1
X
wk
k=0
which converges in distribution to a Gaussian process by virtue of Central Limit Theorem because wk ’s are i.i.d. random
variables. Then, for all t ∈ [0, ∞), E[βt ] = 0 and var[βt ] = limn→∞ ns2 = 2αt. In other words, the probability density
function of βt for all finite t is:
x2 1
exp −
fβ (x; t) = √
4αt
4παt
C. Birth and Death Process
Let Nt : t ∈ (0, ∞) be a continuous-time Markov chain having exactly one of the non-negative integer values, 0, 1, 2, 3, · · ·
at any given instant t. Examples of the birth and death process are:
•
•
•
Population of a city where people enter and leave the city in continuous time.
Queue length at the buffer of a computer.
Jobs waiting at a transfer station within a manufacturing environment.
Let us denote the state of the system at time t as i if Nt = i. Let the conditional probability of transition from the state i
to its neighboring states (i + 1) and (i − 1) over the (closed) interval [t, t + δ] (where δ > 0 and δ → 0) to be as follows:
P [Nt+δ = (i + 1)|Nt = i] = λi δ + o(δ) and P [Nt+δ = (i − 1)|Nt = i] = µi δ + o(δ)
where the parameters λi ≥ 0 and µi ≥ 0 are respectively called the birth rate and the death rate at the state i and the parameter;
and o(•) is a class of functions such that, if limδ→0 |g(δ)|
= 0, then the function g(•) belongs to the class of o(•).
δ
Remark 3.2: If θ is any finite real number, then g(•) ∼ o(•) ⇒ θ g(•) ∼ o(•) .
Next two major assumptions are made to drive the governing equations for the birth and death process, as stated below.
•
Assumption 1: Every change of state occurs instantaneously.
7
•
Assumption 2: The probability of any change of a state by more than 1 in an infinitesimally small interval [t, t + δ] is
o(δ), i.e., as δ → 0, it follows that
X
P Nt+δ = j|Nt = i] = o(δ) ∀i
P |Nt+δ − Nt | > 1] = o(δ) and hence
Since
P∞
j=0
j: |j−i|≥2
P Nt+δ = j|Nt = i] = 1 ∀i, it follows that, as δ → 0,
P Nt+δ = i|Nt = i] = 1 − (λi + µi )δ + o(δ) ∀i
Then, by using the theorem of total probability, which is stated below as:
∞
X
P Nt+δ = j|Nt = i] P Nt = i ∀j
P Nt+δ = j =
i=0
and denoting Pj (t) , P Nt = j for brevity and assigning λ−1 = 0; µ0 = 0, and P−1 (t) = 0 t ∈ [0, ∞), it follows that, as
δ → 0,
Pi (t + δ) = λi−1 δPi−1 (t) + µi+1 δPi+1 (t) + 1 − (λi + µi )δ Pi (t) + o(δ)
which leads to:
Pi (t + δ) − Pi (t)
o(δ)
= λi−1 Pi−1 (t) + µi+1 Pi+1 (t) − (λi + µi )Pi (t) + lim
δ→0 δ
δ
To solve the resulting differential-difference equation:
lim
δ→0
dPi (t)
= λi−1 Pi−1 (t) + µi+1 Pi+1 (t) − (λi + µi )Pi (t)
dt
with λ−1 = 0; µ0 = 0; and P−1 (t) = 0 t ∈ [0, ∞), we need the initial conditions for individual state probabilities:
Pi (0), i = 0, 1, 2, 3, · · · .
Example 3.1: (The Poisson Process) Let us consider a pure birth process with a constant birth rate λ > 0, i.e., λi = λ and
the death rate µi = 0 ∀i, and initially occupying the zero state with probability 1, i.e.,
(
1 for i = 0
Pi (0) =
0 for i > 0
i (t)
The governing equation for the generic birth and death process is thus simplified as: dPdt
= λ Pi−1 (t) − Pi (t) with the
above initial conditions. A real-life example is an operator recording incoming telephone calls that arrive at random instants
of time.
0 (t)
= −λP0 (t) at the initial condition P0 (0) = 1 , which
With P−1 (t) = 0 ∀t, the dynamics at state i = 0 becomes: dPdt
1 (t)
= −λP1 (t) + λ exp(−λt) with the initial condition
yields the solution: P0 (t) = exp(−λt) ∀t ≥ 0. Next consider i = 1: dPdt
2 (t)
P1 (0) = 0, which yields the solution: P1 (t) = (λt) exp(−λt) ∀t ≥ 0. Similarly for j=2: dPdt
= −λP2 (t) + (λ2 t) exp(−λt)
2
with the initial condition P2 (0) = 0, which yields the solution: P2 (t) = (λt) exp(−λt) ∀t ≥ 0. By applying the method of
induction, the following result is generated:
(λt)k
exp(−λt) ∀t ≥ 0
k!
which describes the Poisson process with a constant birth rate λ > 0. Poisson process plays a key role in the queueing theory.
Example 3.2: (The Binomial Process) Let us consider a pure death process with the death rate being proportional to the
state, i.e., µj = jµ, where µ > 0 is a constant, and birth rate λj = 0 ∀j, and initially occupying the zero initial state, i.e.,
(
1 for j = n
Pj (0) =
0 for j < n
Pk (t) =
The following result is generated by by solving the generic governing equation of the birth and death process:
!
(n−j)
n
Pj (t) =
exp(−jµt) 1 − exp(−µt)
∀t ≥ 0 and j = 0, 1, 2, · · · , n
j
w hich describes the binomial process with µ > 0.
8
Remark 3.3: One may view the Poisson process to be a continuous time version of the Bernoulli trials process. To see this,
let us suppose that each success in the Bernoulli trials process as a random point in discrete time. Then the Bernoulli trials
process, like the Poisson process, has the strong renewal (memoryless) property, i.e., at the instant of each arrival, the process
starts over independently of the past. With this analogy in mind, the Poisson process has connections with Bernoulli trials and
binomial processes as seen below.
•
•
The inter-arrival times have independent geometric distributions in the Bernoulli trials process; they have independent
exponential distributions in the Poisson process.
The arrival times have Pólya (also called negative binomial) distributions in the Bernoulli trials process; they have gamma
distributions in the Poisson process. Note: A negative binomial distribution is the probability mass function, which is
denoted as:
!
k+r−1
N Bin(k; r, p) ∼
pk (1 − p)r
k
where the (non-random) non-negative integer r signifies the number of failures; k is the number of successes; and p is
the probability of successes. Compare it with the binomial distribution having a total of n Bernoulli trials
!
n
Bin(k; n, p) ∼
pk (1 − p)n−k
k
•
The number of arrivals in an interval has a binomial distribution in the Bernoulli trials process; it has a Poisson distribution
in the Poisson process.
IV. L AW
OF
L ARGE
NUMBERS
This section devoted to two versions of the the law of large numbers, namely,
•
•
Weak law of large numbers;
Strong law of large numbers.
A. Weak Law of Large Numbers
We present two versions although there are many other versions.
1) Version 1 of Weak Law of Large Numbers (Uniform Variance):
Theorem 4.1: Let {X(k)} be a sequence of independent random variables on a probability space Ω, E, P such that
2
E[X(k)] = µX and var[X(k)] = σX
∀k ∈ N
Define another random sequence {b
µX (n)} as
p
n
µ
bX (n) =
1X
X(k) ∀n ∈ N
n
k=1
Then, µ
bX (n) → µX , i.e., converges in probability.
Proof 3: By Chebyshev inequalty, it follows that
var(b
µX (n))
o2X
=
δ2
n δ2
Therefore, the convergence is established in the L2 sense, which suffices convergence in probability.
P [|b
µX (n) − µX | ≥ δ]] ≤
9
2) Version 2 of Weak Law of Large Numbers (Non-uniform Variance):
Theorem 4.2: Let {X(k)} be a sequence of independent random variables on a probability space Ω, E, P such that
2
E[X(k)] = µX and var[X(k)] = σX
(k) ∀k ∈ N
Define another random sequence {b
µX (n)} as
If
Pn
2
k=1 σX (k)
n2
p
n
µ
bX (n) =
1X
X(k) ∀n ∈ N
n
k=1
< ∞, then µ
bX (n) → µX , i.e., converges in probability.
Proof 4: By Chebyshev inequalty, it follows that
P [|b
µX (n) − µX | ≥ δ]] ≤
Pn
2
k=1 σX (k)
2
n δ2
Therefore, the convergence is established in the L2 sense, which suffices convergence in probability.
B. Strong Law of Large Numbers
The strong law of large numbers is stated below.
(Strong Law of Large Numbers) Let {Xk } be a sequence of wide-sense stationary independent random variables
on a probability space Ω, E, P such that
E[Xk ] = µ and var[Xk ] = σ 2 ∀k ∈ N
Define another random sequence {b
µn )} as
n
µ
bn =
as
1X
Xk ∀n ∈ N
n
k=1
Then, µ
b(n) → µ, i.e., converges almost surely.
As the proof of strong law of large numbers requires Martingale Convergence theorem, we first introduce the notion of
Martingales.
Definition 4.1: (Martingale) A stochastic process {Xt : t ∈ T } is called a martingale if the following conditions hold:
•
•
|E[Xt ]| is finite for all t ∈ T ,
E[Xtn+1 |Xtn , · · · , Xt2 , Xt1 ] = Xtn for any n ∈ N and any t0 < t1 < · · · < tn ∈ T .
Equivalently, E[Xtn+1 − Xtn |Xtn , · · · , Xt2 , Xt1 ] = 0.
Remark 4.1: In Definition 4.1, if tn is interpreted as the present time, then tn+1 is a future time and the vector (Xt1 , · · · , Xtn )
represents the past and present. Then, the martingale property implies that the future increments of X have zero conditional
means, given the past and present values of the process.
Remark 4.2: An alternative and more rigorous statement of Martingale in Definition 4.1 is as follows.
Let {Xt : t ∈ T } be a stochastic process on Ω, E, P and let {Ft : t ∈ T } be a monotonically increasing set of σ-algebras,
i.e., Fs ⊆ Ft ⊆ F . Then, a stochastic process {Xt : t ∈ T } is called a Martingale if, for all s < t,
E[Xt |Fs ] = Xs (a.s.), where E[Xt |Fs ] , E[Xt |Xτ : τ ≤ s]
Furthermore, a stochastic process {Xt : t ∈ T } is called a supermartingale if E[Xt |Fs ] ≤ Xs (a.s.) and is called a submartingale
if E[Xt |Fs ] ≥ Xs (a.s.).
P
Example 4.1: Let {Wk } be a sequence of iid Bernoulli trials with E[Wk ] = 0 and let Xn = nk=0 Wk . Then, {Xk } is a
Martingale sequence as seen below.
E[Xn |Xn−1 · · · X0 ] = E
n
hX
k=0
n
n
i X
X
E[Wk |Xn−1 · · · X0 ] =
E[Wk ] + E[Wn ] = Xn−1
Wk |Xn−1 · · · X0 =
k=0
What happens if E[Wk ] 6= 0?
10
k=0
Example 4.2: Let {Xk } be a sequence of zero-mean random variables with independent increments. Then, {Xk } is a
Martingale sequence as seen below.
E[Xn |Xn−1 · · · X0 ] = E[(Xn − Xn−1 )|Xn−1 · · · X0 ] + E[Xn−1 |Xn−1 · · · X0 ] = E[(Xn − Xn−1 )] + Xn−1 = Xn−1
What happens if {Xk } be a sequence of non-zero-mean random variables with independent increments?
The next theorem shows the connection between the Strong Laws, which have to do with the (a.s.) convergence of sample
sequences and Martingales. It provides an analog of Chebyshev inequality for the maximum term in an n-point Martingale
sequence.
Lemma 4.1: Let {Xn } be a Martingale sequence for non-negative integers n. Then, for all ε > 0 and any positive integer
n,
h
i E|X |2
n
P { max |Xk | ≥ ε} ≤
0≤k≤n
ε2
Proof 5: For 0 ≤ j ≤ n, let us define mutually exclusive events,
Aj , {|Xk | ≥ ε the first time at j}
Then, the event {max0≤k≤n |Xk | ≥ ε} is just a union of the events Aj ’s, where 0 ≤ j ≤ n and the respective indicator
functions are defined as:
1 if Aj occurs
Ij ,
0 if A does not occur
j
P
It is noted that at most one of these Ij ’s is equal to 1 and the rest are 0’s. That is, E[|Xn |2 ≥ nj=0 E[|Xn |2 Ij ]. Then, by
2
expanding |Xn |2 as Xj + (Xn − Xj ) and substituting the above inequality, it follows that
E[|Xn |2 ≥
n
X
j=0
E[|Xj |2 Ij + 2
n
X
E Xj (Xn − Xj )Ij
j=0
Denoting Xj Ij as Zj (that depends only on X0 , · · · , Xj ), it follows that
h
h
i
i
E Zj (Xn − Xj ) = E E[Zj (Xn − Xj )|X0 , · · · , Xj = E Zj E (Xn − Xj )|X0 , · · · , Xj = E[Zj (Xj − Xj )] = 0
In the above derivation, we have used the identity E[Y ] = E[E[Y |X]] in the first equalty and the Martingale property in the
third equality. Therefore,
n
n
n
h[
i
h
i
X
X
E[|Xn |2 ≥
E[|Xj |2 Ij ] ≥ ε2
E[Ij ] = ε2 P
Aj = ε2 P { max |Xk | ≥ ε}
j=0
j=0
j=0
0≤k≤n
Theorem 4.3: (Martingale Convergence Theorem) Let {Xn } be a martingale sequence for non-negative integers n, satisfying
the condition: E[|Xn |2 ] ≤ C < ∞ for some C ∈ R. Then, Xn → X (a.s.) as n → ∞, where X is the limiting random
variable.
Proof 6: Having the integers m ≥ 0 and n ≥ 0, let Yn , (Xn+m − Xm ). Then, {Yn } is a Martingale sequence and it
follows from Lemma 4.1 that
h
i E|X |2
n
∀ε > 0
P { max |Xm+k − Xm | ≥ ε} ≤
0≤k≤n
ε2
Having E |Yn |2 = E |Xn+m |2 − 2E Xm Xn+m + E |Xm |2 and the middle term is rewritten as
h
i
E Xm Xn+m = E Xm E Xn+m |Xm , · · · , X0 = E[|Xm |2 ]
where the above expression follows from Remark 4.2 because {Xn } is a Martingale sequence and (n + m) ≥ m. Then, it
follows that E[|Xn |2 ] is monotonically non-decreasing sequence of real numbers, because
E[|Yn |2 ] = E |Xn+m |2 − E |Xm |2 ≥ 0 ∀m, n ≥ 0
Since E[|Xn |2 ] is bounded from above by C < ∞, it must converge to a limit. Therefore, E[|Yn |2 ] → 0 as m, n → 0.
Therefore,
h
i
lim P max |Xm+k − Xm | ≥ ε = 0
m→∞
k≥0
11
which by continuity of the probability measure, implies that
h
i
P lim max |Xm+k − Xm | ≥ ε = 0
m→∞ k≥0
Finally, due to completeness of Borel-measurable spaces, it is concluded that there exists a random variable X such that
as
Xn → X as n → ∞.
Now we present the proof of strong law of large numbers, which is restated below.
Theorem 4.4: (Strong Law of Large Numbers) Let {Xk } be a sequence of wide-sense stationary independent random
variables on a probability space Ω, E, P such that
E[Xk ] = µ and var[Xk ] = σ 2 ∀k ∈ N
Define another random sequence {b
µn )} as
n
µ
bn =
as
1X
Xk ∀n ∈ N with probability 1
n
k=1
Then, µ
bn → µ, i.e., almost sure convergence.
Pn X c
Proof 7: Let us define Xkc , Xk − µ and Yn , k=1 kk for n ∈ N with Y0 = 0 (a.s.).
{Yn } is a Martingale sequence in view of the fact that {Xk } be a sequence of independent random variables,
#
"
h Xc
i
Xnc
+ Yn−1 Yn−1 , · · · , Y1 = E n |Yn−1 , · · · , Y1 + Yn−1 = Yn−1
E[Yn |Yn−1 , · · · , Y1 ] = E
n
n
i
h P
Pn σ 2
Pn
n X c 2
1
2
Since E[|Yn |2 ] = E k=1 kk =
k=1 k2 = σ
k=1 k2 < ∞, because Xk ’s are pairwise independent. Now, by
applying Theorem 4.3, it follows that the sequence {Yn } → Y (a.s.), where Y is some random variable. Since Y0 = 0 (a.s.),
Xkc = (Yk − Yk−1 )k ⇒
=
i
i
X
X
1X c
1hX
1hX
kYk −
kYk−1 =
kYk −
(k + 1)Yk
Xk =
n
n
n
Therefore,
n
n
n
n−1
k=1
k=1
k=1
k=1
k=1
n−1
n
n−1
n−1
n−1
X i
X
X i
1X
1h
1hX
nYn −
Yk = Yn −
kYk −
kYk −
Yk =
Yk
n
n
n
k=1
limn→∞ Xkc
n
= limn→∞ Yn −
k=1
limn→∞ n1
k=1
k=1
Pn−1
k=1
k=1
as
Yk = Y − Y = 0 (a.s.), which implies µ
bn → µ.
For an alternative proof, please download the text from https://almostsure.wordpress.com/2009/12/06/upcrossings-downcrossingsand-martingale-convergence/
Pn
Remark 4.3: The convergence limn→∞ n1 k=1 Yk = Y (a.s.) in the proof of Theorem 4.4 is based on the following fact.
Pn
If {ak } is a (bounded) real sequence converging to a in the usual metric, then n1 k=1 ak → a as n → ∞.
P
k = 1M−1 (ak − a) = C < ∞
Proof: Since an → a, it follows that ∀ε > 0 ∃M ∈ N such that |an − a| < 2ε ∀n ≥ M . Let
2C
and let n = max ε , M . Then,
n
n
1 M−1
1 X
ε
ε ε
X 1 X C
+ = + =ε
ak ≤ ak + ak <
n
n
n
n
2
2 2
k=1
k=1
k=M
P
1 n
i.e., ∀ε > 0 ∃M ∈ N such that n k=1 ak − a < ε ∀n ≥ M .
Remark 4.4: (Converse to Strong Law of Large Numbers)(Ref: Feller Vol. 2, p. 241) Let {Xk }} be a sequence of parwise
independent random variables with a common distribution and let E[Xk ] = ∞. Then, with probability 1,
Pn
Xk − cn = ∞ for any real sequence {ck }.
lim sup 1
n
k=1
12
V. E RGODICITY
The notion of ergodicity of a stationary stochastic process (or a stationary random sequence) is that its time average over a
window converges to the ensemble average in a certain sense, i.e.,
Z T
N
1
1 X
lim
Xτ dτ = E[Xt ] ∀t or lim
Xk = E[Xk ] ∀k
T →∞ 2T −T
N →∞ 2N
k=−N
Note that the left hand sides of each of the above two expressions is a random variable and the corresponding right hand
sides are deterministic functions of time. For the equality to hold, the left hand sides must converge to a constant in some
sense of convergence and the right hand side should be a constant as T → ∞ or N → ∞. For example, let us consider
the case of mean-square (i.e., L2 (P ))) sense ergodicity when the random process becomes progressively uncorrelated with
the increasing time shift. Then, the time average appears like the average of many nearly uncorrelated random variables. By
Chebyshev inequality, the time average should converge to the ensemble average in the mean-square sense as the time shift
tends to infinity.
Next let us discuss the implications of ergodicity from the perspectives of dynamical systems. There are two ways to analyze
the stochastic behavior of a dynamical system described by the following map:
xk+1 = f (xk )
One way is to look at the time evolution of the map from a single initial condition x0 at a steady state. The other way is to
consider an ensemble of different random initial conditions at a steady state. Let ρ0 (x) be the probability density of initial
conditions over the entire sample space. Therefore, the probability P0 (A) of having an initial condition in a measurable subset
A of sample space Ω can be expressed as
Z
dxρ0 (x)
P0 (A) =
A
Let us analyze the time evolution of an ensemble of initial conditions. After n iterations Pn (A), the probability of finding
an iterate xn in A is
Z
Pn (A) =
dxρn (x)
A
where ρn (x) is the corresponding probability density after n iterations. Now, Pn+1 (A) = Pn (f −1 (A)) implies that the
probability of finding the (n + 1)th iterate in A is the same as the probability of finding the nth iterate in the preimage of
A, i.e., f −1 (A). From the perspectives of ergodicity, we are especially interested in stationary probability distributions, i.e.,
Pn+1 (A) = Pn (A) ∀n. Such a measure is called an invariant measure and the corresponding density function is called an
invariant density. For an invariant measure P and an invariant density ρ, it follows that
Z
Z
−1
P (A) = P (f (A)) and
dxρ(x) =
dxρ(x)
A
f −1 (A)
Now let Q(x) be an arbitrary integrable function with respect to the invariant density ρ. Then the ensemble average (over the
entire sample space Ω) of Q, denoted by hQi is expressed as
Z
hQi =
dxρ(x)Q(x)
Ω
In an alternative notation, ρ(x)dx is replaced by dP (x). In this notation,
Z
Z
Z
hQi =
dP (x)Q(x) =
dP (f −1 (x))Q(x) =
dP (x)Q(f (x))
Ω
Ω
Ω
Note that in the last step, x is replaced by f (x). The significance of this observation is that the expectation of an observable
remains invariant in an invariant ensemble under the action of the map f .
On the other hand, the time average of Q is defined as
N −1
1 X
Q(xn )
N →∞ N
n=0
Q = lim
provided that the limit converges.
13
We re-iterate the concept of ergodicity here by noting that the notions of ensemble average and time average are different, in
general. When these two averages become equal in a given sense (e.g., mean square), then the map together with the invariant
measure is called ergodic in that specific sense. Let us present an equivalent probabilistic setting to clarify the notions.
•
•
•
•
•
∅ ( Ω ⊆ R is the sample space, i.e., the collection of all possible states ζ of a random system.
B(Ω) is the collection of all measurable sets (events) such that it is known whether ω ∈ E or ω ∈
/ E.
P is a probability measure, i.e., P r[{ζ ∈ E}] := P (E).
Measurable functions f : Ω → R are random variables.
The random sequence {Xn }, where Xn , f ◦ T n (n ≥ 1), whose distribution is given by
P r[Xi1 ∈ Ei1 , · · · , Xik ∈ Eik ] , P ∩kj=1 {ζ ∈ X : f (T ij ζ) ∈ Eij }
A. Probability Preserving Transformation
In this section, we will study the notions of ergodic theory of Markov chains on a probability measure space that is defined
below.
Definition 5.1: (Probability Measure Space) A probability measure space is a triplet (R, B(Ω), P ) where,
1) Ω ⊆ R is a nonempty set.
2) B(Ω) is a σ-algebra (i.e., a collection of subsets of Ω that contains ∅ and which is closed under complements and
countable unions). Elements of B are called measurable sets.
3) P : B → [0, 1] is called the probability measure and it satisfies the countable subadditivity property, i.e., if E1 , E2 , · · · ∈
P
B, then P [∪i Ei ] ≤ i P (Ei ) and the equality holds if, in addition, E1 , E2 , · · · are (pairwise) disjoint.
Definition 5.2: (Probability Preserving Transformation) A probability preserving transformation (PPT) is a quartet (Ω, B(Ω), P, T )
where (Ω, B(Ω), P ) is a probability measure space and
1) T is a measurable function, i.e., E ∈ B(Ω) implies T −1 E ∈ B(Ω).
2) P is T -invariant, i.e., P (T −1 E) = P (E) ∀E ∈ B(Ω).
Remark 5.1: We provide a physical interpretation of a dynamical system that is ergodic. Given that a dynamical system
satisfies the measure preserving condition, for almost all initial conditions, it will return arbitrarily close to its initial state
infinitely many times in arbitrarily far future. Note that it does not depend on the specifications of the system states or its laws
of motion. This idea is also the heart of Statistical Mechanics that deals with general many-body systems without the explicit
knowledge (e.g., dynamical equations) of the system model. In fact, the basic assumption made by Boltzman in his work on
Statistical Mechanics is nothing but a quantitative version of the Poincare’s recurrence theorem, which also marks the birth of
Ergodic theory. The assumption is known as the ”Ergodic Hypothesis”.
Theorem 5.1: (Poincare’s Recurrence Theorem) Let (X, B, µ, T ) be a PPT. If E is a measurable set, then for almost every
x ∈ E, there exists a sequence nk → ∞ such that T nk (x) ∈ E.
Remark 5.2: Ergodic theory is the study of measure preserving actions of measurable maps on a measure space (usually
with a finite measure, e.g., a probability space). The above setting is a way of studying stochastic phenomena that evolve in a
deterministic context. that is, if the system state x is known at time t = 0, then the system state will be T n (x) at t = n with
full certainty).
B. Ergodicity and Mixing
Another hypothesis, stronger than the ergodic hypothesis is known as the mixing hypothesis. This refers to the notion
that an initial distribution of points on a trajectory of the dynamical system very quickly permeates the entire surface, while
occupying the same volume as required in a Hamiltonian (i.e., non-dissipative) system. Both ergodic and mixing hypotheses
play important roles in the fundamental arguments of statistical mechanics.
Definition 5.3: (Invariant Set) Let (Ω, B(Ω), P, T ) be a PPT. A measurable set E ∈ B(Ω) is called invariant (more precisely,
T -invariant), if T −1 (E) = E. Therefore, T can be split into two non-interacting PPTs as follows:
T |E : E → E
and
T |E c : E c → E c
Definition 5.4: (Ergodicity) A PPT (Ω, B(Ω), P, T ) is called ergodic, if every invariant set E satisfies the condition: P (E) = 0
or P (Ω \ E) = 0. In that case, P is called an ergodic measure.
14
Let us explain the physical consequence of the above definition. A direct consequence of a map being ergodic is that an
invariant measure of the map can be obtained by iterating (i.e., taking time evolution of) a single initial condition and producing
a histogram; this is achieved by tracking the system evolution through different parts of the phase space. For example, let A
be a certain subset of the sample space Ω and, given the ergodic map T : Ω → Ω, the objective is to find an invariant measure
of A by using the ergodic property of the map. The ensemble average is defined as
Z
dxρ(x)Q(x)
hQi =
X
where Q is an arbitrary integrable function with respect to the invariant density ρ. On the other hand, the time average is
defined as
N −1
1 X
Q = lim
Q(xn )
N →∞ N
n=0
Note that, in general, Q could be a random variable and hQi could be time-dependent. However, if Q = hQi in a given sense
(e.g., mean-square, or almost sure) and if Q(x) is chosen as the characteristic function χA (x) of the subset A, then it follows
that
Z
Z
N −1
1 X
dxρ(x) = P (A)
dxρ(x)χA (x) =
χA (xn ) =
lim
N →∞ N
A
X
n=0
The fact of the matter is that there may exist several invariant measures for an ergodic map. However, the invariant measure
µ identified from the above perspective is special and very useful in most cases due to its physical relevance. This measure is
known as the natural invariant measure and the corresponding density function ρ(x) is known as the natural invariant density.
However, there are only a very few examples for which the natural invariant density can be determined analytically. Here is
an example.
Example 5.1: (Tent Map) Let the unit interval X = [0, 1] be the phase space. Consider the following one-dimensional map
f : X → X defined as
(
2x
if x ∈ [0, 21 )
f (x) =
2(1 − x)
if x ∈ [ 12 , 1]
f (x)
1
0.5
l2
0
l0
l1
0.5
1
x
Fig. 1. Conservation of interval lengths by the tent map, l0 + l1 = l2
Referring to Fig. 1, total length of arbitrary intervals in the phase space remains invariant under the action of f ; for example,
the length segments satisfy the condition: l0 + l1 = l2 . Thus, in this case, ρ(x) = 1 is an invariant probability density function
15
(in fact, the natural invariant density). However, in general, invariant density functions tend to be far more complicated than
the uniform distribution.
Definition 5.5: (Mixing) A PPT (X, B, P, T ) is called mixing (or strong mixing) if ∀E, F ∈ B, P (E ∩T −k F ) → P (E)µ(F )
as k → ∞.
Remark 5.3: There is no notion of strong mixing for general infinite measure spaces. Probabilistically speaking, T −k (F )
is “asymptotically independent” of E and physically it actually signifies uniform mixing of one substance into another (Rum
and Coke!!)
Proposition 5.1: Strong mixing implies ergodicity.
Proof 8: Let (Ω, B(Ω), P, T ) be a PPT and let E ∈ B(Ω) be T -invariant, Then,
P (E) = P (E ∩ T −n E) → P (E)2 as n → ∞
and this can only happen if either P (E) = 0 or P (E) = 1 = P (Ω). This implies P (Ω\E) = 0.
Definition 5.6: (Weak Mixing) A PPT (X, B, µ, T ) is called weak mixing if ∀E, F ∈ B(Ω),
n
1X
|P (E ∩ T −k F ) − P (E)P (F )| = 0
n→∞ n
lim
k=0
C. Discrete-time Markov Chains
Let X : Z × Ω → N be a discrete-time Markov chain.
Definition 5.7: A priori state probability mass function is defined as:
pi (k) = P [Xk = i] where k ∈ Z and i ∈ N
and the state transition probability is defined as:
πij (k, ℓ) = P Xℓ = j|Xk = i and the operator Π(k, ℓ) = πij (k, ℓ)
Definition 5.8: A discrete-time Markov chain is called homogenous (or shift-invariant) if πij (k, ℓ) is dependent on (ℓ − k)
only. That is, by setting n = (ℓ − k) it follows that
πij (n) = P Xk+n = j|Xk = i where k, n ∈ Z and i ∈ N
For a homogenous discrete-time Markov chain, the state transition probability operator is defined as:
P = [πij (1)] where i, j ∈ N
Next let us consider homogenous Markov chains with finite many states, i.e., X : Z × Ω → {1, 2, · · · , N } where N ∈ N.
That is, P ∈ [0, 1]N ×N and each row sum of P is identcally equal to 1. Let pi (k) denote the the probability of occupying the
state i at the discrete time instant k and let p(k) = [p1 (k), · · · , pN (k)]. Then,
p(k + 1) = p(k)P, where P = [πij : i, j ∈ {1, · · · , N }]
Since each row sum of P is unity, the induced infinity norm of the square matrix P is k P k∞ = 1. Therefore, limℓ→∞ (P )ℓ
is a bounded operator because k (P )ℓ k≤ (k P k∞ )ℓ = 1. So,
p(∞) = p(0)P ∞ , where p(∞) , lim p(k) and P ∞ , lim (P )k
k→∞
k→∞
Note that the choice of p(0) ∈ [0, 1] is arbitrary; then, it follows from the fact p(∞) = p(∞)P ∞ that the left eigenvector of
P ∞ corresponding to the eigenvalue 1 is p(∞).
Definition 5.9: A finite-state Markov chain (P, p) is called irreducible (or ergodic) if there exists an n ∈ N such that a given
entry of the nth power of P is strictly positive. This positive integer n possibly depends on the indices of the given matrix
entry.
Definition 5.10: An irreducible finite-state Markov chain (P, p) is called primitive (or regular) if there exists an n ∈ N such
that, for all k ≥ n, all entries of the k th power of P is strictly positive. This positive integer n is obviously independent of
indices of any matrix entry.
N
16
Remark 5.4: For an irreducible Markov chain, each element of the the state probability vector p is strictly positive. This
ensures that each state is recurrent and this property is equivalent to ergodicity (i.e., where a sequence of time averages of
random variable converges to its ensemble average in a specified sense). That is why a Markov chain that has an irreducible
stochastic matrix is called ergodic. (Ergodicity will be studied in more depth in a later chapter in this course.) An irreducible
Markov chain could be periodic or aperiodic, whereas a primitive matrix is always aperiodic.
0
1
0 &0.5 0.5
0.5
0.5
0
2
1
0.5
0.5
0.5
0.5
3
0.5
P'
0.5
1 0.5
%
2 0
3$ 0
0
0
0
2
3
0#
0.5 0
"
0.5 0.5
0.5 0.5!
0
Fig. 2. Example of reducible Markov chain
n
Example 5.2: The Markov chain in Fig. 2 is not irreducible (hence not primitive), because ∄ n ∈ N such that 2 −
→ 1, i.e.,
there are no finite number of transitions that can take the state 2 to the state 1. Now, let E = {2, 3}, then E is an invariant
set, but µ(E) 6= 0, 1. Hence the Markov chain cannot be ergodic.
At this point, we state the Perron-Frobenius theorem that is very important for analysis of finite-state Markov chains.
Theorem 5.2 (Perron-Frobenius theorem restricted to finite-state irreducible Markov chains): Let an irreducible Markov chain
with N states be represented as (P, p), i.e., P is an (N × N ) irreducible stochastic matrix. Then, the following conditions
hold.
•
•
•
Condition (i): p P = p, i.e., one of the eigenvalues of P is 1, and the corresponding left eigenvector p > 0, i.e., each
entry of the (1 × N ) vector p is strictly positive.
Condition (ii): The eigenvalue 1 has algebraic multiplicity = 1 and consequently its geometric multiplicity = 1. All other
eigenvalues λi of P are bounded above as: |λi | ≤ 1, i.e., they must be contained in the complex plane on or within on
the unit circle with its center at the origin.
Condition (iii): The eigenvalues λi ' of P such that |λi | = 1 are solutions of the equation λk − 1 = 0, i.e., the unity
magnitude eigenvalues that are not equal to 1 must appear as −1 and/or as complex conjugate pairs on the unit circle
with its center at the origin.
Corollary 5.1 (Perron-Frobenius theorem restricted to primitive Markov chains): Let a regular Markov chain with N states
be represented as (P, p), i.e., P is an (N ×N ) a primitive (i.e., irreducible and aperiodic) stochastic matrix. Then, the following
conditions hold.
•
•
Condition (i): p P = p, i.e., exactly one of the eigenvalues of P is 1, and the corresponding left eigenvector p > 0, i.e.,
each entry of the (1 × N ) vector p is strictly positive.
Condition (ii): The eigenvalue 1 has algebraic multiplicity = 1 and consequently its geometric multiplicity = 1. All other
eigenvalues λi of P are bounded above as: |λi | < 1, i.e., these eigenvalues must be contained strictly within the unit
circle with its center at the origin.
D. Important Theorems on Ergodicity
Two theorems that carry important results on ergodic theory are stated below.
Theorem 5.3: (Von Newmann’s Mean Ergodic theorem) Let (Ω, B(Ω), P, T ) be a PPT. If f ∈ L2 (P ), then
N −1
1 X
L
f ◦ T k −−−2−→ f
n→∞
N
k=0
R
where f ∈ L2 (P ) is invariant. Moreover, if T is ergodic then f = f dP
17
Remark 5.5: Invariance of f ∈ L2 (P ) means ||f − f ◦ T ||22 = 0. However, if T is ergodic, then f becomes a constant with
R
value f dP . As a matter of fact, f is the projection of f onto the space of invariant functions.
Corollary 5.2: A PPT (Ω, B(Ω), P, T ) is ergodic if and only if ∀ E, F ∈ B,
n−1
1X
µ(E ∩ T −k F ) −−−−→ µ(E)µ(F )
n→∞
n
k=0
Therefore, “Ergodicity is Mixing on average”.
L
Sketch of the proof: If fn −−−2−→ f , then
n→∞
hfn , gi −−−−→ hf, gi ∀g ∈ L2
n→∞
1
n
Pn−1
k
1B ◦ T and g = 1A . Then, the proof of the corollary follows from the Mean Ergodic theorem.
PN −1
Theorem 5.4: (Birkhoff’s Pointwise Ergodic Theorem) Let (Ω, B(Ω), P, T ) be a PPT. If f ∈ L1 (P ), then limN →∞ N1
k=0 f ◦
R
k
T a.s. (i.e., P a.e.). The limit is a T -invariant function and if T is ergodic, then it is equal to f dP .
Remark 5.6: Birkhoff’s Pointwise Ergodic Theorem is also known as the strong ergodic theorem due to the strong convergence
property. Direct consequences of the ergodic theorems are observed as follows.
Now let fn =
k=0
1) The time spent in a measurable subset A (i.e., sojourn time) of the phase space X of an ergodic system is proportional
to its measure, i.e.,
n−1
1X
P (A)
lim
for almost all x ∈ Ω
χA (T k x) =
n→∞ n
P (X)
k=0
where χA is the indicator function of A.
18
© Copyright 2026 Paperzz