Probability on a Riemannian Manifold

Probability on a Riemannian Manifold
Jennifer Pajda-De La O
December 2, 2015
1
Introduction
We discuss how we can construct probability theory on a Riemannian manifold. We make
comparisons to this and how probability is thought of on Rn . The basic definitions for
probability on Rn come from Resnick [2005]. Information about probability theory on a
Riemannian manifold is taken from Pennec [1999]. Fisher Information on Rn and the Delta
Theorem on Rn are taken from Martin [2015].
We organize the paper as follows. Each section covers a different topic in probability/statistics. We summarize the importance of each concept, and then show, side by side,
each concept in Rn (on the left) and on a Riemannian manifold (on the right). From this
paper, it is hoped that the similarities and differences between probability in Rn and on a
Riemannian manifold are made clear.
2
Definitions
We define important terms, abbreviations, and recall important definitions here. We combine
the definitions that are required for probability on Rn and on a Riemannian manifold.
• B : a σ-field
Note: A σ-field has the following properties:
1. Ω ∈ B;
2. If B ∈ B then B C ∈ B, where C denotes the complement;
3. If Bi ∈ B, i ≥ 1, then ∪∞
i=1 Bi ∈ B.
• B(R) : Borel σ-field. This is the σ-field that is generated by open subsets of R, i.e.
B(R) = σ ({(a, b], −∞ ≤ a ≤ b < ∞})
= σ ({[a, b), −∞ < a ≤ b ≤ ∞})
= σ ({[a, b], −∞ < a ≤ b < ∞})
= σ ({(a, b), −∞ ≤ a ≤ b ≤ ∞})
= σ ({open sets of R}) .
1
• A : A Borel tribe / σ-field
• (Ω, B, P) is a probability space. In particular, P is a probability measure on the
measurable space (Ω, B).
Note: A probability measure has the following properties:
1. P(∅) = 0; P(Ω) = 1;
2. 0 ≤ P(B) ≤ 1,
∀B ∈ B;
3. If {Bn , n ≥ 1} are disjoint events in B, then P (∪∞
n=1 Bn ) =
P∞
n=1
P(Bn ).
• E : Expected Value
• E : Mean value (Fréchet Expectation)
• V : Variance
• COV : Covariance
• 1A : Indicator Function of a set A
• M : Riemannian Manifold
•
>
: Matrix Transpose
• iid: Independent and Identically Distributed
D
• −
→ : Convergence in Distribution
Note: We can define convergence in distribution as follows:
D
D
Suppose {F, Fn , n ≥ 1} are probability distributions. Then Xn −
→ X means that Fn −
→
F i.e. Xn converges in distribution to X or Fn converges weakly to F . Convergence
in distribution only happens when the four types of convergence stated below are
equivalent. Equivalence will only happen if F is proper, i.e. F (−∞) = 0 and
F (∞) = 1.
Resnick [2005] Let {Fn , n ≥ 1} be probability distribution functions and let F be a
distribution function which is not necessarily proper.
v
1. Vague Convergence: The seqeunce {Fn } converges vaguely to F , written Fn →
−
F , if for every finite interval of continuity I of F , we have Fn (I) → F (I).
v
2. Proper Convergence: The sequence {Fn } converges properly to F if Fn →
− F
and F is a proper distribution function; that is F (R) = 1.
w
3. Weak Convergence: The sequence {Fn } converges weakly to F , written Fn −
→
F , if Fn (x) → F (x) for all x ∈ C (F ). Here, C (F ) = {x ∈ R : F is continuous at x}.
4. Complete Convergence: The sequence {Fn } converges complete to F , written
c
w
Fn →
− F , if Fn −
→ F and F is proper.
2
3
Setup
In this section, we give the basic setup for each scenario.
3.1
In Rn
3.2
Riemannian Manifold
Let M be a connected Riemannian manifold.
Let (Ω, B, P) be a probability space. We can Take a continuous collection of dot products
on the tangent space TX M . The basis is
define a random variable as follows:
h
∂i
iX = Q(X).
∂j
X
(Ω, B, P) −
→ (R, B(R)).
In particular, Q(X) is a positive definite matrix. The distance between two points on the
manifold is defined as the “minimum length
among smooth curves joining the points”.
So X is a random variable with the property
The curves that are created from this procethat
dure are called geodesics. There is a unique
geodesic “starting at a given point X with a
given tangent vector”. We consider an exponential map, such that each vector is mapped
X −1 ((−∞, λ]) = [X ≤ λ] ∈ B, ∀λ ∈ R.
“to the corresponding pointon the
manifold”.
−−→
In particular, let y = expX XY .
The measure that we will be considerThere is an induced measure on (R, B(R)). ing on the manifold is the volume measure.
This induced measure is P ◦ X −1 . We can This volume measure is induced by the basis,
also define a distribution function of X as
Q(X). Specifically,
p
dM (X) = |Q(X)|dX.
F (A) = P ◦ X −1 (A) = P[X ∈ A].
On M , we can define a probability space
(Ω, B, P). Let X : Ω → M be a random
primitive. The induced measure is P ◦ X −1 .
In particular, take
For a point, F (x) = F ((−∞, x]) = P[X ≤ x].
X
The density function is given as f (x), or
(Ω, B, P) −
→ (M, A ) .
fθ (x), when we need to remind ourselves what
the parameter of our distribution is.
Let pX (y) be the pdf of X.
3
4
Probability of a Set or Event Occurring
We often want to know what is the probability of a certain event, therefore, it is important
that we have a definition for this. The probability of an event occurring depends on its
distribution. This can be given by its distribution function, or by its probability mass
function (in the case of discrete variables), or by its probability density function (in the case
of continuous variables). We denote the probability of an event A occurring by P [X ∈ A],
regardless of whether the random variable X is continuous or discrete. For the discrete case,
we can change the integral to a sum.
4.1
In Rn
4.2
P ◦ X −1 (A) = P [X ∈ A]
Z
=
1A (X)dP
Ω
Z
1A (X(ω))P(dω)
=
Ω
Z
=
1A (x)dF (x)
R
Z
dF (x)
=
ZA
=
f (x)dx.
Riemannian Manifold
Z
P [X ∈ M ] =
pX (y)dM (y) = 1,
ZM
P [X ∈ A] =
pX (y)dM (y).
A
Example: A Uniform PDF in a bounded set
A:
pX (y) =
1A (y)
,
V (A)
where V (A) is the volume of the set A with
Note that the last three equalities are only respect to the measure dM .
true for X as a random variable.
A
Example: A (continuous) Uniform PDF on
the interval [a, b].
(
1
for x ∈ [a, b]
f (x) = b−a
0
otherwise
=
1[a,b] (x)
.
b−a
4
5
Expected Value of a Function
Expected value measures a long-run average of a random variable/primitive or a function
of a random variable/primitive. For example, we want to know how much we will win or
lose if we are gambling at dice. Depending on the roll of the dice, we would either win a
certain amount of money, lose a certain amount of money, or break-even. Assuming the
dice are fair, we know the probability of each of the outcomes from rolling the dice. We
can calculate how much, on average, we would gain/lose from playing the game. This is
assuming that we will play the game long enough. In our calculations, we first start out with
finding the expected value of a function, g(X) or ϕ(X). If we find the expected value of an
identity map, i.e. g(X) = X or ϕ(X) = X, then this would be a long-run average of the
random variable/primitive X. The expected value is also the first moment of a distribution.
Moments are used to try and distinguish one distribution from another.
5.1
In Rn
5.2
Riemannian Manifold
Take g(X) to be a measurable function of X. Let ϕ(X(ω)) be a Borel-real valued function
Then
defined on M . Then ϕ(X) is a real random
Z
primitive. The expected value is given as
g(X)dP
E(g(X)) =
Z
ZΩ
E [ϕ(X)] =
ϕ(y)pX (y)dM (y).
=
g(X(ω))Pd(ω)
M
ZΩ
g(x)dF (x)
=
R
Z
=
g(x)f (x)dx.
R
Note that the last two equalities are only true
for X as a random variable.
5
6
Variance
Variance, or the square of the standard deviation is very important in statistics. It is also the
second moment of a distribution. Variance measures the spread of the data. For example,
if the variance is large, then the data-points are more spread-out from the mean, while a
small variance would indicate that the data is more centered around the mean. We can
also calculate variances of a function of the statistic we are interested in; this is the Delta
Theorem, and will be stated in Section 6.1.
6.1
In Rn
6.2
Riemannian Manifold
−−→2
If X 2 ∈ L1 , then X ∈ L2 . Define variance as Suppose dist (X, Y )2 = XY . Then
V(X) = E (X − E(X))2
2
2
2
2
σ
(y)
=
E
dist
(y,
X)
X
= E X − [E (X)] .
Z
Theorem 1. (Delta Theorem) For random
dist (y, z)2 pX (z)dM (z).
=
M
variables Tn , assume that
D
n1/2 (Tn − θ) −
→ N (0, v(θ)) ,
where v(θ) is the asymptotic variance. Let
g(·) be a function differentiable at θ, with
g 0 (θ) 6= 0. Then
D
n1/2 {g(Tn ) − g(θ)} −
→ N (0, vg (θ)) ,
where vg (θ) = [g 0 (θ)]2 v(θ).
6
7
Covariance
Covariance is important when you have multiple random variables with different distributions. It is possible for random variables to be “related” to one another in some way, and the
covariance is one metric that displays this relationship (another is the correlation, correlation
is a scaled covariance). For example, if there is a positive covariance between two random
variables, this means that they will behave similarly (a positive correlation); if there is a
negative covariance between two random variables, this means that they will act opposite
to one another (a negative correlation). When the covariance is 0, this means that they are
uncorrelated. If two random variables are independent, then their covariance will always be
zero; the converse may not necessarily be true.
7.1
In Rn
7.2
Riemannian Manifold
Covariance calculations depend on how we
view our chart. For example, if we consider
our chart as a matrix, then covariance depends on the choice of basis. However, if we
do not view our chart as a matrix and we
view it as a “bilinear form over the tangent
plane,” then covariance calculations do not
depend on the basis.
Earlier we defined E as the mean value, or
Fréchet expectation. In particular, E is the
set of mean primitives. We can also define it
as
E [X] = arg min E dist (y, X)2 .
If X, Y ∈ L2 , then
COV(X, Y ) = E [(X − E(X)) (Y − E(Y ))]
= E(XY ) − E(X)E(Y ).
~ and Y~ are random vectors, then
If X
y∈M
Suppose X̄ ∈ E [X]. We take X̄ to be the
unique mean value of X. Then
~ Y~
COV X,
=E
~ − EX
~
X
> ~
~
Y − EY
~ Y~ > − EXE
~ Y~ > .
= EX
7
ΣXX = COVX̄ (X)
h−−→ −−→ i
= E X̄X X̄X > .
8
Fisher Information
One interpretation of Fisher Information is that “variance is small if the Fisher information
is big.” This can be seen from the Cramér-Rao Lower Bound. We display this theorem in
Section 8.1.
8.1
In Rn
8.2
Riemannian Manifold
Suppose θ is n−dimensional and fθ (x) is the Information is given by I(X). Entropy is
density of X with respect to P. Then the H(X).
following are the FI regularity conditions.
I(X) = −H(X) = E [log pX (X)] .
1. ∂fθ (x)/∂θi exists P-a.e. for each i.
R
2. fθ (x)dP(x) can be differentiated under the integral sign.
3. The support of fθ is the same for all θ.
Fisher Information may be defined as:


 ∂ log fθ (X) ∂ log fθ (X) 

IX (θ)ij = COVθ 
,


∂θi
∂θj
|
{z
}
score vector
2
∂
log fθ (X) ,
= −Eθ
∂θi ∂θj
where the last equality is true provided that
“we can differentiate twice under the integral
sign.”
Theorem 2. (Cramér-Rao Lower Bound)
For simplicity, take θ to be a scalar, and assume that fθ satisfies the FI regularity coniid
ditions. Let X1 , . . . , Xn ∼ fθ and let T =
T (X1 , . . . , Xn ) be a real-valued statistic with
Eθ (T ) = g(θ). Then
Vθ (T ) ≥ {g 0 (θ)} {nI(θ)}−1 .
2
8
9
Multivariate Normal Distribution
The normal distribution is highly used in statistics. In many cases, as long as the same size
is large enough, we can approximate the distribution of our data as a normal distribution
because of the Central Limit Theorem (CLT) for iid random variables. Note that there is also
a CLT for the case where random variables are independent, but not necessarily identically
distributed; this is the Lindeberg-Feller CLT. We give the CLT for the Rn case below.
Theorem 3. (CLT) Let {Xn , n ≥ 1} be iid random variables with E(Xn ) = µ and V(Xn ) =
σ 2 . Suppose N is a random variable with N(0, 1) distribution. If Sn = X1 + · · · + Xn , then
Sn − nµ D
√
−
→ N.
σ n
Moreover, the normal distribution has many “nice” properties so various calculations are
made easier to complete.
9.1
In Rn
9.2
Riemannian Manifold
The “pdf [of a normal distribution on a manLet X = [X1 , . . . , Xk ] be a vector such that
ifold tries to minimize] the information with
X ∼ N (µ, Σ). Then the pdf of X is given by
a fixed mean value and covariance.” Suppose
we have a cut locus C(X̄) with no continu−k/2
−1/2
ity or differentiability constraint, a symmetric
f (X) = (2π)
|Σ|
domain D(X̄), and a concentration matrix Γ.
1
· exp − (X − µ)> Σ−1 (X − µ) . Suppose k is a normalization constant. Then
2
the Normal Distribution pdf is given by
(1)
" −→ −→ #
X̄y > ΓX̄y
N(X̄,Γ) (y) = k exp −
This pdf exists only when Σ is positive defi2
"
nite. The entropy (or −IX (θ)) is given by
−→> −→ #
Z
X̄y ΓX̄y
exp −
k −1 =
dM (y)
2
M
k
1
" −→ −→ #
Z
(1 + ln (2π)) + ln|Σ|.
−→ −→>
X̄y > ΓX̄y
2
2
Σ=k
X̄y X̄y exp −
dM (y).
2
M
Suppose X ∼ N (µ, σ 2 ). Then the pdf of X is
Note that a high concentration matrix Γ ocgiven by
curs if and only if there is a small covariance
matrix Σ. The equation in Section 9.2 will
2
give the Gaussian PDF shown in Equation
(x − µ)
−1/2
f (x) = 2πσ 2
exp −
.
(1) when working in a vector space.
2σ 2
9
References
Ryan Martin. Lecture notes on advanced statistical theory. Supplement to the lectures for
Stat 511 at UIC given by the author, January 2015.
Xavier Pennec. Probabilities and statistics on riemannian manifolds: Basic tools for geometric measurements. In Proc. of Nonlinear Signal and Image Processing (NSIP’99), pages
194–198, 1999.
Sidney Resnick. A Probability Path. Birkhäuser, 2005.
10