The Multivariate Gaussian
Jian Zhao
Outline
What is multivariate Gaussian?
Parameterizations
Mathematical Preparation
Joint distributions, Marginalization and
conditioning
Maximum likelihood estimation
What is multivariate Gaussian?
p ( x | , )
1
(2 ) n / 2
1
T
1
exp{
(
x
)
( x )}
1/ 2
2
Where x is a n*1 vector, ∑ is an n*n, symmetric
matrix
x1 2
1 x1 , x2
x1 , x2
x2
2
Geometrical interpret
Exclude the normalization factor and exponential
function, we look the quadratic form
( x )T 1 ( x )
This is a quadratic form. If we set it to a
constant value c, considering the case
that n=2, we obtained
a x 2a12 x1x2 a22 x2
2
11 1
2
Geometrical interpret (continued)
This is a ellipse with the coordinate x1 and x2.
Thus we can
easily image that
when n increases
the ellipse became
higher dimension
ellipsoids
Geometrical interpret (continued)
Thus we get a Gaussian bump in n+1 dimension
Where the level surfaces of the Gaussian bump
are ellipsoids oriented along the eigenvectors
of ∑
Parameterization
E ( x)
E ( x )( x )T
Another type of parameterization, putting it into
the form of exponential family
1 T
T
p( x | , ) exp{a x x x}
2
1
1
a (n log(2 ) log | | T )
1
2
Mathematical preparation
In order to get the marginalization and
conditioning of the partitioned multivariate
Gaussian distribution, we need the theory of
block diagonalizing a partitioned matrix
In order to maximum the likelihood
estimation, we need the knowledge of traces
of the matrix.
Partitioned matrices
Consider a general patitioned matrix
E
M
G
F
H
To zero out the upper-right-hand and lowerleft-hand corner of M, we can premutiply and
postmultiply matrices in the following form
0 E FH 1G 0
I FH E F I
0
1
I
G
H
H
G
I
0
H
Partitioned matrices (continued)
Define Schur complement of Matrix M with respect to H ,
denote M/H as the term
E FH 1G
Since
( XYZ ) 1 Z 1Y 1 X 1 W 1
Y 1 ZW 1 X
So
E
G
1
F
0 ( M / H ) 1
I
1
H
H
G
I
0
( M / H ) 1
1
1
H
G
(
M
/
H
)
0 I FH 1
1
H 0
I
1
1
1
1
H H G ( M / H ) FH
( M / H ) 1 FH 1
Partitioned matrices (continued)
Note that we could alternatively have decomposed the matrix m
in terms of E and M/E, yielding the following expression for the
inverse.
0 E F I E 1F E
F
0
I
I E 1F E
GE 1 I G H
1
1
0
GE
F
H
0
GE
F
H
0
I
0
I
E
G
F
H
E 1
0
1
I
0
E 1 F E 1
I
0
I
1
( M / E ) 1 GE
0
I
E 1 F ( M / E ) 1
1
GE
( M / E ) 1
E 1 E 1 F ( M / E ) 1 GE 1
1
1
(
M
/
E
)
GE
0
I
E 1 F ( M / E ) 1
( M / E ) 1
0
I
Partitioned matrices (continued)
Thus we get
1
1
1
1
1
1
( E FH G) E E F ( H GE F ) GE
( E FH 1G)1 FH 1 E 1F ( H GE 1F )1
At the same time we get the conclusion
|M| = |M/H||H|
1
Theory of traces
Define
tr[ A]
a
ii
i
i
i
It has the following properties:
tr[ABC] = tr[CAB] = tr[BCA]
x Ax tr[ x Ax] tr[ xx A]
T
T
T
Theory of traces (continued)
tr[ AB]
aij
aij
akl blk bji
k
l
so
tr[ BA] BT
A
T
x Ax
tr[ xxT A] [ xxT ]T xxT
A
A
Theory of traces (continued)
We want to show that
Since
1
log | A |
| A|
aij
A aij
Recall
1
A
A
| A|
log | A | AT
A
1
This is equivalent to
prove
Noting that
| A | A
aij
| A | (1)i j aij M ij
j
Joint distributions, Marginalization and
conditioning
We partition the n by 1 vector x into p by 1 and q
by 1, which n = p + q
1
2
p ( x | , )
1
(2 )( p q ) / 2
11 12
21
22
1 x1 1
exp{
1/ 2
x
2
2 2
T
1
11 12 x1 1
x }
21
22 2
2
Marginalization and conditioning
1
1 x1 1 11 12 x1 1
exp{
}
2 x2 2 21 22 x2 2
T
I
0 ( / 22 ) 1 0
1 x1 1
exp{
1
21 I
2 x2 2 22
0
22
I 12 22 1 x1 1
}
I
0
x2 2
1
exp ( x1 1 12 22 1 ( x2 2 ))T ( / 22 ) 1 ( x1 1 12 22 1 ( x2 2 ))
2
T
1
exp ( x2 2 )T 22 1 ( x2 2 )
2
Normalization factor
1
(2 )
( pq) / 2
1/ 2
1
(2 )( p q ) / 2 (| / 22 || 22 |)1/ 2
1
1
p/2
1/ 2
q/2
1/ 2
(2
)
(|
/
|)
(2
)
(|
|)
22
22
Marginalization and conditioning
Thus
1
1
T
1
p( x2 )
exp ( x2 2 ) 22 ( x2 2 )
q/2
1/ 2
2
(2 ) (| 22 |)
1
p( x1 | x2 )
p/2
1/ 2
(2 ) (| / 22 |)
1
1
T
1
1
exp ( x1 1 12 22 ( x2 2 )) ( / 22 ) ( x1 1 12 22 ( x2 2 ))
2
Marginalization & Conditioning
Marginalization
2 m 2
2 22
m
Conditioning
c
1|2
1 12 22 ( x2 2 )
1
c1|2 11 12 22 1 21
In another form
Marginalization
2 m 2 211111
1
11
2 22 21 12
m
Conditioning
c
1|2
c
1|2
1 12 x2
11
Maximum likelihood estimation
Calculating µ
N
1 N
l ( , | D) log | | ( xi )T 1 ( xi )
2
2 i 1
Taking derivative with respect to µ
l N
( xi )T 1
i 1
Setting to zero
1
ˆ
N
N
x
i 1
i
Estimate ∑
We need to take the derivative with respect to ∑
N
1
log | | ( x )T 1 ( x )
2
2 n
N
1
log | 1 | tr[( x )T 1 ( x )]
2
2 n
N
1
1
log | | tr[( x )( x )T 1 ]
2
2 n
according to the property of traces
l ( | D)
l
N
1
T
(
x
)(
x
)
n
n
1
2
2 n
Estimate ∑
Thus the maximum likelihood estimator is
1
ˆ
ML ( xn )( xn )T
N n
The maximum likelihood estimator of canonical
parameters are
1
ˆ
ˆ
ML
1
ˆ
ˆ ˆ
ML
ML
© Copyright 2026 Paperzz