Probabilistic Inference

CS B553: ALGORITHMS FOR
OPTIMIZATION AND LEARNING
Continuous Probability Distributions and
Bayesian Networks with Continuous Variables
AGENDA
Continuous probability distributions
 Common families:



The Gaussian distribution
Linear Gaussian Bayesian networks
CONTINUOUS PROBABILITY DISTRIBUTIONS

Let X be a random variable in R, P(X) be a
probability distribution over X


P(x)0 for all x, “sums to 1”
Challenge: (most of the time) P(X=x) = 0 for any x
CDF AND PDF

Probability density function (pdf) f(x)
Nonnegative,  f(x) dx = 1


Cumulative distribution function (cdf) g(x)
 g(x)
= P(Xx)
 g(-)
 f(x)
= 0, g() = 1, g(x) = (-,x] f(y) dy, monotonic
= g’(x)
pdf f(x)
1
Both cdfs and pdfs are complete
representations of the probability
space over X, but usually
pdfs
are
cdf g(x)
more intuitive to work with.
CAVEATS
pdfs may exceed 1
 Deterministic values, or ones taking on a few
discrete values, can be represented in terms of
the Dirac delta function a(x) pdf (an improper
function)

a(x) = 0 if x  a
 a(x) =  if x = a
  a(x) dx = 1

U(a1,b1)
COMMON DISTRIBUTIONS

U(a2,b2)
Uniform distribution U(a,b)
p(x) = 1/(b-a) if x  [a,b], 0 otherwise
 P(Xx) = 0 if x < a, (x-a)/(b-a) if x  [a,b], 1 otherwise


Gaussian (normal) distribution N(,)

 = mean,  = standard deviation
x−𝜇 2
1
−
e 2𝜎2
2𝜋𝜎

𝑝(𝑥) =

P(Xx) not closed form:
(1+erf(x))/2 for N(0,1)
MULTIVARIATE CONTINUOUS
DISTRIBUTIONS

Consider c.d.f. g(x,y) = P(Xx,Yy)
g(-,y) = 0, g(x,-) = 0
 g(,) = 1
 g(x,) = P(Xx), g(,x) = P(Yy)
 g monotonic
Its joint density is given by the p.d.f. f(x,y) iff
 g(p,q) = (-,p] (-,q] f(x,y) dy dx
 i.e. P(axXbx,ayYby) = [ax,bx] [ay,by] f(x,y) dy dx


MARGINALIZATION WORKS OVER PDFS



Marginalizing f(x,y) over y:
 If h(x) = (-,) f(x,y) dy, then h(x) is a p.d.f. for P(Xx)
Proof:
 P(Xa) = P(Xa,Y) = g(a,) = (-,a] (-,) f(x,y) dy dx
 h(a) = d/da P(Xa)
= d/da (-,a] (-,) f(x,y) dy dx (definition)
= (-,) f(a,y) dy (fundamental theorem of calculus)
So, the joint density contains all information needed to
reconstruct the density of each individual variable
CONDITIONAL DENSITIES

We might want to represent the density
P(Y|X=x)… but how?


Consider pdf p(x,y), consider taking limit
P(aYb|x+eXx+e) as e0



Naively, P(aYb|X=x) = P(aYb,X=x)/P(X=x), but
denominator is 0!
𝑃 𝑎𝑌𝑏 𝑥 + 𝜖𝑋𝑥 + 𝜖 =
≈
𝑏
𝑎 2𝜖𝑝
𝑥,𝑦
2𝜖𝑝 𝑥
=
𝑏
𝑎 𝑝
𝑏 𝑥+𝜖
𝑎 𝑥−𝜖 𝑝 𝑥,𝑦 𝑑𝑥𝑑𝑦
𝑥+𝜖
𝑥−𝜖 𝑝 𝑥 𝑑𝑥
𝑥,𝑦
𝑝 𝑥
So p(x,y)/p(x) is the conditional density
TRANSFORMATIONS OF CONTINUOUS
RANDOM VARIABLES

Suppose we want to compute the distribution of
f(X), where X is a random variable distributed
w.r.t. pX(x)


Assume f is monotonic and invertible
Consider Y=f(X) a random variable

P(Yy) =  I[f(x)y]pX(x) dx
f −1 𝑦
= −∞ 𝑝𝑋 𝑥 𝑑𝑥 = P(X f-1(y))
pY(y) = d/dy P(Yy)
= d/dy P(X f-1(y))

= p(f-1(y)) d/dy f-1(y) by chain rule
= pX(f-1(y))/f ’(f-1(y)) by inverse function derivative

NOTES:
In general, continuous multivariate distributions
are hard to handle exactly
 But, there are specific classes that lead to
efficient exact inference techniques



In particular, Gaussians
Other distributions usually require resorting to
Monte Carlo approaches
MULTIVARIATE GAUSSIANS
Multivariate analog in N-D space
X
 Mean (vector) , covariance (matrix) S

1 −1 𝑥−𝜇 𝑇 Σ−1 𝑥−𝜇
𝑃 𝑋 =𝑥 = 𝑒 2
𝑍
𝑑/2 Σ 1/2 a normalization
 With Z = (2𝜋)
~ N(,S)
factor
INDEPENDENCE IN GAUSSIANS

If X ~ N(X,SX) and Y ~ N(Y,SY) are independent,
then
(𝑋, 𝑌) ~𝑁

𝜇𝑋 Σ𝑋
𝜇𝑌 , 0
0
Σ𝑌
Moreover, if X~N(,S), then Sij=0 iff Xi and Xj are
independent
LINEAR TRANSFORMATIONS
 Linear



transformations of gaussians
If X~ N(,S), y = A x + b
Then Y ~ N(A+b, ASAT)
In fact,
𝑋
~𝑁
𝑌

𝑇
𝜇
Σ
Σ𝐴
𝐴𝜇 + 𝑏 , 𝐴Σ 𝐴Σ𝐴𝑇
Consequence:
If X ~ N(x,Sx), Y ~ N(y,Sy), Z=X+Y
 Then Z ~ N(x+y,Sx+Sy)

MARGINALIZATION AND CONDITIONING

If (X,Y) ~ N([X Y],[SXX,SXY;SYX,SYY]), then:

Marginalization

Summing out Y gives
X ~ N(X , SXX)

Conditioning:
 On observing Y=y, we have
X ~ N(X+SXYSYY-1(y-Y), SXX-SXYSYY-1SYX)
LINEAR GAUSSIAN MODELS

A conditional linear Gaussian model has :
P(Y|X=x) = N(0+Ax,S0)
 With parameters 0, A, and S0

LINEAR GAUSSIAN MODELS

A conditional linear Gaussian model has :
P(Y|X=x) = N(0+Ax,S0)
 With parameters 0, A, and S0


If X ~ N(X,SX), then joint distribution over is given by:
𝑋
~𝑁
𝑌
𝜇𝑥
ΣX
𝜇0 + 𝐴𝜇𝑥 , 𝐴ΣX
Σ𝑋 𝐴𝑇
𝐴ΣX 𝐴𝑇 + Σ0
(Recall the linear transformation rule)
If X~ N(,S) and y=Ax+b, then
Y ~ N(A+b, ASAT)
CLG BAYESIAN NETWORKS

If all variables in a Bayesian network have
Gaussian or CLG CPTS, inference can be done
efficiently!
P(X1) = N(1,1)
P(X2) = N(2,2)
X1
X2
Y
Z
P(Y|x1,x2) = N(ax1+bx2,y)
P(Z|x1,y) = N(c+dx1+ey,z)
CANONICAL REPRESENTATION



All factors in a CLG Bayes net can be represented as
C(x;K,h,g) with
C(x;K,h,g) = exp(-1/2 xT Kx + hTx + g)
Ex: if P(Y|x) = N(0+Ax,S0) then
P(y|x) = 1/Z exp(-1/2 (y-Ax-0)T S0-1(y-Ax-0))
=1/Z exp(-1/2 (y,x)T [I –A]T S0-1 [I –A](y,x) + 0T S0-1
[I –A](y,x) – ½ 0TS0-10)
Is of form C((y,x);K,h,g) with
K
= [I –A]T S0-1 [I –A]
h
= [I –A]T S0-1 0
g=
log(1/Z) exp (–½ 0TS0-10)
PRODUCT OPERATIONS

C(x;K1,h1,g1)C(x;K2,h2,g2) = C(x;K,h,g) with
 K=K1+K2
 h=h1+h2
g

= g1+g2
If the scopes of the two factors are not equivalent, just
extend the K’s with 0 rows and columns, and h’s with 0
rows so that each row/column matches
SUM OPERATION

C((x,y);K,h,g)dy = C(x;K’,h’,g’)with
 K’=KXX-KXYKYY-1KYX
 h’=hX-KXYKYY-1hY
 g’
= g+1/2 (log|2pKYY-1|+hYTKYY-1hY)
Using these two operations we can implement
inference algorithms developed for discrete Bayes
nets:

Top-down inference, variable elimination (exact)
 Belief propagation (approximate)

MONTE CARLO WITH GAUSSIANS

Assume sample X ~ N(0,1) is given as a primitive
RandN()


To sample X ~ N(,2), simply + RandN()
How to generate a random multivariate Gaussian
variable N(,S)?
MONTE CARLO WITH GAUSSIANS

Assume sample X ~ N(0,1) is given as a primitive
RandN()


To sample X ~ N(,2), simply set x + RandN()
How to generate a random multivariate Gaussian
variable N(,S)?
Take Cholesky decomposition: S-1=LLT, L invertible if
S is positive definite
 Let y = LT(x-)
 P(y)  exp(-1/2 (y12 + … + yN2)) is isotropic, and each yi
is independent
 Sample each component of y at random
 Set x  L-Ty+

MONTE CARLO WITH LIKELIHOOD
WEIGHTING

Monte Carlo with rejection has probability 0 of
finding a continuous value given as evidence, so
likelihood weighting must be used
P(X)=N(X,SX)
X
P(Y|x)=N(Ax+Y,SY)
Y=y
Step 1: Sample x ~ N(X,SX)
Step 2: weight by P(y|x)
HYBRID NETWORKS
Hybrid networks combine both discrete and
continuous variables
 Exact inference techniques are hard to apply

Result in Gaussian mixtures
 NP hard even in polytree networks

Monte Carlo techniques apply in straightforward
way
 Belief approximation can be applied (e.g.,
collapsing Gaussian mixtures to single
Gaussians)

ISSUES
Non-gaussian distributions
 Nonlinear dependencies
 More in future lectures on particle filtering


Download Report

Probabilistic Inference

Paperzz.com

Your Paperzz