Exact Inferencing on Graphical Models

The Multivariate Gaussian
Jian Zhao
Outline





What is multivariate Gaussian?
Parameterizations
Mathematical Preparation
Joint distributions, Marginalization and
conditioning
Maximum likelihood estimation
What is multivariate Gaussian?
p ( x |  , ) 
1
(2 ) n / 2
1
T
1
exp{

(
x


)

( x   )}
1/ 2
2

Where x is a n*1 vector, ∑ is an n*n, symmetric
matrix
 x1 2

 1   x1 , x2



x1 , x2
x2
2






Geometrical interpret

Exclude the normalization factor and exponential
function, we look the quadratic form
( x   )T 1 ( x   )
This is a quadratic form. If we set it to a
constant value c, considering the case
that n=2, we obtained
a x  2a12 x1x2  a22 x2
2
11 1
2
Geometrical interpret (continued)
This is a ellipse with the coordinate x1 and x2.
Thus we can
easily image that
when n increases
the ellipse became
higher dimension
ellipsoids
Geometrical interpret (continued)
Thus we get a Gaussian bump in n+1 dimension
Where the level surfaces of the Gaussian bump
are ellipsoids oriented along the eigenvectors
of ∑
Parameterization
  E ( x)
  E ( x   )( x   )T
Another type of parameterization, putting it into
the form of exponential family
1 T
T
p( x |  , )  exp{a   x  x x}
2
1

1
a  (n log(2 )  log |  |  T  )
1
2
  
Mathematical preparation


In order to get the marginalization and
conditioning of the partitioned multivariate
Gaussian distribution, we need the theory of
block diagonalizing a partitioned matrix
In order to maximum the likelihood
estimation, we need the knowledge of traces
of the matrix.
Partitioned matrices

Consider a general patitioned matrix
E
M 
G
F
H 
To zero out the upper-right-hand and lowerleft-hand corner of M, we can premutiply and
postmultiply matrices in the following form
0  E  FH 1G 0 
 I  FH   E F   I


0





1
I
G
H

H
G
I
0
H



 

Partitioned matrices (continued)
Define Schur complement of Matrix M with respect to H ,
denote M/H as the term
E  FH 1G
Since
( XYZ ) 1  Z 1Y 1 X 1  W 1
Y 1  ZW 1 X
So
E
G

1
F
0  ( M / H ) 1
 I



1
H

H
G
I
0


 ( M / H ) 1

1
1

H
G
(
M
/
H
)

0   I  FH 1 

1  
H  0
I 

1
1
1
1 
H  H G ( M / H ) FH 
( M / H ) 1 FH 1
Partitioned matrices (continued)
Note that we could alternatively have decomposed the matrix m
in terms of E and M/E, yielding the following expression for the
inverse.
0  E F   I  E 1F   E
F
0
 I
  I  E 1F   E



 
 
GE 1 I  G H  


1
1
0

GE
F

H
0

GE
F

H
0
I
0
I





 
 
E
G

F
H

 E 1

 0
1
I

0
 E 1 F   E 1

I
 0
I


1
( M / E ) 1   GE
0
I
 E 1 F ( M / E ) 1  

1

GE
( M / E ) 1

 E 1  E 1 F ( M / E ) 1 GE 1

1
1

(
M
/
E
)
GE

0
I

 E 1 F ( M / E ) 1 

( M / E ) 1

0
I

Partitioned matrices (continued)
Thus we get
1
1
1
1
1
1
( E  FH G)  E  E F ( H  GE F ) GE
( E  FH 1G)1 FH 1  E 1F ( H  GE 1F )1
At the same time we get the conclusion
|M| = |M/H||H|
1
Theory of traces
Define
tr[ A]
a  
ii
i
i
i
It has the following properties:
tr[ABC] = tr[CAB] = tr[BCA]
x Ax  tr[ x Ax]  tr[ xx A]
T
T
T
Theory of traces (continued)


tr[ AB] 
 aij
 aij
 akl blk  bji
k
l
so

tr[ BA]  BT
A
 T

x Ax 
tr[ xxT A]  [ xxT ]T  xxT
A
A
Theory of traces (continued)

We want to show that

Since

1 
log | A |
| A|
 aij
A  aij

Recall
1
A 
A
| A|

log | A | AT
A
1

This is equivalent to
prove

Noting that

| A | A
 aij
| A |  (1)i  j aij M ij
j
Joint distributions, Marginalization and
conditioning
We partition the n by 1 vector x into p by 1 and q
by 1, which n = p + q
 1 
 
 2 
p ( x |  , ) 
1
(2 )( p  q ) / 2
 11 12 




 21
22 
1  x1  1 
exp{ 

1/ 2
x


2

 2 2
T
1
 11 12   x1  1 
     x   }
 21
22   2
2
Marginalization and conditioning
1
1  x1  1   11 12   x1  1 
exp{ 
 

}

2  x2  2    21  22   x2  2 
T
I
0  ( /  22 ) 1 0 
1  x1  1  
 exp{ 



1
 21 I  
2  x2  2     22
0
 22 
 I  12  22 1   x1  1 


}
I
0
  x2  2 
 1

 exp  ( x1  1  12  22 1 ( x2  2 ))T ( /  22 ) 1 ( x1  1  12  22 1 ( x2  2 )) 
 2

T
1

exp  ( x2  2 )T  22 1 ( x2  2 ) 
2

Normalization factor
1
(2 )
( pq) / 2

1/ 2
1

(2 )( p  q ) / 2 (|  /  22 ||  22 |)1/ 2



1
1

p/2
1/ 2 
q/2
1/ 2 
(2

)
(|

/

|)
(2

)
(|

|)

22

22

Marginalization and conditioning
Thus


1
1

T
1
p( x2 )  
exp  ( x2  2 ) 22 ( x2  2 ) 
q/2
1/ 2 
2

 (2 ) (| 22 |) 


1
p( x1 | x2 )  
p/2
1/ 2 
 (2 ) (|  /  22 |) 
 1

1
T
1
1
exp  ( x1  1  12  22 ( x2  2 )) ( /  22 ) ( x1  1  12  22 ( x2  2 )) 
 2

Marginalization & Conditioning

Marginalization
2 m  2
 2   22
m

Conditioning

c
1|2
 1  12  22 ( x2  2 )
1
c1|2  11  12  22 1  21
In another form

Marginalization
2 m  2   211111
1
11
 2   22   21 12
m

Conditioning

c
1|2

c
1|2
 1  12 x2
 11
Maximum likelihood estimation
Calculating µ
N
1 N
l (  ,  | D)   log |  |   ( xi   )T  1 ( xi   )
2
2 i 1
Taking derivative with respect to µ
l N
  ( xi   )T  1
 i 1
Setting to zero
1
ˆ 
N
N
x
i 1
i
Estimate ∑

We need to take the derivative with respect to ∑
N
1
log |  |   ( x   )T  1 ( x   )
2
2 n
N
1
 log |  1 |   tr[( x   )T  1 ( x   )]
2
2 n
N
1
1
 log |  |   tr[( x   )( x   )T  1 ]
2
2 n
 according to the property of traces
l ( | D)  
l
N
1
T



(
x


)(
x


)

n
n
1

2
2 n
Estimate ∑

Thus the maximum likelihood estimator is
1
ˆ
 ML   ( xn   )( xn   )T
N n

The maximum likelihood estimator of canonical
parameters are
1
ˆ
ˆ
   ML
1
ˆ
ˆ   ˆ
ML
ML