Belief Propagation, EM

FMA901F: Machine Learning
Lecture 7: Belief Propagation, EM
Cristian Sminchisescu
Graphical Models
•
In a probabilistic graphical model, each node represents a random variable (or a group of variables), and the links express probabilistic relationships between these variables. The graph captures the way in which the joint distribution over all of the random variables can be decomposed into a product of factors, each depending only on a subset of the variables
•
In Bayesian networks, also known as directed graphical models, the links of the graphs have a particular directionality indicated by arrows
•
The other major class of graphical models are Markov random fields, also known as undirected graphical models, in which the links do not carry arrows and have no directional significance •
Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are better suited at expressing soft constraints between random variables •
For the purposes of solving inference problems, it is often convenient to convert both directed and undirected graphs into a different representation called a factor graph
Factor Graphs
• circles for variables • squares for factors
• undirected links between factors and their variables
• Both directed and undirected graphs allow functions of several variables to be expressed as a product of factors over those subsets of variables
• Factor graphs make this explicit (nodes for factors and nodes for variables)
• Allow us to be more explicit about the details of the factorization
Slides adapted from textbook (Bishop)
Factor Graphs from Directed Graphs
• Factors in directed graphs can be the local conditional distributions
• Factor graphs are bipartite. They contain two distinct types of nodes and links between nodes of opposite type
Factor Graphs from Undirected Graphs
• Factors in undirected graphs can be the potential functions over maximal cliques
• Partition function can be seen as a factor over an empty set of variables
• There may be several different factor graphs that correspond to the same undirected (or directed) graph
The Sum‐Product Algorithm (1)
Objective:
i.
to obtain an efficient, exact inference algorithm for finding marginals;
ii. in situations where several marginals are required, to allow computations to be shared efficiently.
Key idea (as seen for chains): Distributive Law
The Sum‐Product Algorithm (2)
• Look at particular node • Assume all variables are hidden
the set of all variables in the sub‐tree connected to variable node via factor node •
,
is the product of all factors in the group associated to factor •
The Sum‐Product Algorithm (3)
Message from factor node to variable node The Sum‐Product Algorithm (4)
The Sum‐Product Algorithm (5)
Message from variable node to factor node The Sum‐Product Algorithm (6)
The Sum‐Product Algorithm (7)
Initialization
Marginals for sets of variables corresponding to factors
The Sum‐Product Algorithm (8)
To compute local marginals:
• Pick an arbitrary node as root
• Compute and propagate messages from the leaf nodes to the root, storing received messages at every node.
• Compute and propagate messages from the root to the leaf nodes, storing received messages at every node.
• Compute the product of received messages at each node for which the marginal is required, and normalize if necessary.
Sum‐Product: Example (1)
Take as root
Sum‐Product: Example (2)
Sum‐Product: Example (3)
Sum‐Product: Example (4)
The Max‐Sum Algorithm (1)
Objective: an efficient algorithm for finding i. the value
that maximises .
ii. the value of ;
In general, maximum marginals  joint maximum.
The Max‐Sum Algorithm (2)
Maximizing over a chain (max‐product)
The Max‐Sum Algorithm (3)
Generalizes to tree‐structured factor graph
maximizing as close to the leaf nodes as possible
The Max‐Sum Algorithm (4)
Max‐Product  Max‐Sum
For numerical reasons, use
Again, use distributive law The Max‐Sum Algorithm (5)
Initialization (leaf nodes)
Recursion
The Max‐Sum Algorithm (6)
Termination (root node)
Back‐track, for all nodes with factor nodes ) to the root (
The Max‐Sum Algorithm (7)
Example: Markov chain
The Junction Tree Algorithm
• Exact inference on general graphs.
• Works by turning the initial graph into a junction tree and then running a sum‐
product‐like algorithm.
• Intractable on graphs with large cliques.
Loopy Belief Propagation
• Sum‐Product on general graphs.
• Initial unit messages passed across all links, after which messages are passed around until convergence (not guaranteed!).
• Approximate but tractable for large graphs.
• Sometime works well, sometimes not at all.
Inference and Learning
• Most graphical models (factor graphs) have parameters that influence the values of the potentials (factors). Ideally they have to be learnt from data, not specified by hand
• We have, so far, leant how to perform inference in a given graphical model, where parameters were known
• We will see how we can learn the parameters of a model, e.g. using maximum likelihood, in situations where not all variables are observed. During learning, such variables have to be marginalized. This process requires inference (e.g. the sum product algorithm we studied)
Partially Unobserved (Missing) Variables
• If certain variables are unobserved they represent missing data
– e.g. undefined inputs, missing class labels, erroneous target values
– ... or introduced for convenience, to model complex dependencies between variables without representing those directly
• In this case, we can still model the joint distribution, but we define a new cost function where we marginalize (sum out) the missing values at training or test time

l ( , D) 
log p ( x c , y c |  ) 
c complete
m
log
p
(
x
| ) 

m  missing
  log p( x , y |  )   log  p ( x , y |  )
c
c
c
m
m
y
Learning with Incomplete Data is Harder
• In fully observed i.i.d. settings, the probability model is a product. The log likelihood is a sum where terms often decouple (at lest in directed models)
l ( , D)  log p( x, z |  )  log p ( z |  z )  log p ( x | z ,  x )
• With latent variables, the probability already contains a sum, so the log likelihood has parameters coupled in the log‐sum
l ( , D)  log  p ( x, z |  )  log  p ( z |  z ) p ( x | z ,  x )
z
z
Learning with Latent Variables
… Continued (likelihood with coupled parameters)
l ( , D)  log  p ( x, z |  )  log  p ( z |  z ) p ( x | z ,  x )
z
z
• If the latent variables were observed, parameters would decouple again and learning would be easy:
l ( , D)  log p( x, z |  )  log p( z |  z )  log p( x | z , x )
• One possibility: ignore latent structure, compute ℓ/
perform learning with an efficient optimizer , and • Another option: use current parameters to guess the values of the latent variables, and then do fully‐observed learning.
This alternation can make optimization easier
Complete and Incomplete Log Likelihoods
• Observed variables x, latent variables z, parameters :
ℓ
; ,
log
, |
is the complete log likelihood
• Usually optimizing ℓ
given both and is straightforward (e.g. class conditional Gaussian fitting, linear regression)
• With unobserved, we need the log of a marginal probability:
ℓ
;
log
|
log∑
, |
which is the incomplete log likelihood (no subscript)
Jensen’s inequality
•
Jensen's inequality generalizes the statement that a secant line of a convex function lies above the graph •
Assume a distribution of values. For convex mappings is increasingly "stretched out" for increasing values of . Then the distribution of Y is broader than that of in the interval 0and narrower for 0, ∀ 0
•
The expectation of Y will always shift upwards with respect to the position of with equality when is not strictly convex, e.g. when it is a straight line, or when follows a degenerate distribution (i.e. is a constant)
Expected Complete Log Likelihood
• For any distribution ℓ ,
ℓ , ,
define an expected complete log likelihood: ≡∑ log
, |
ℓ
• We can show that ℓ
,because of concavity of log:
with is the entropy of lq ( , x)  lc ( , x, z) q  q(z | x) log p(x, z | )
z
l( , x)  log p(x |  )  log p(x, z | ) 
z
 logq(z | x)
z
p(x, z |  )
p(x, z |  )
 q(z | x) log
q(z | x)
q(z | x)
z
where we used Jensen’s inequality
(only true for distributions: ∑
1; 0
Lower Bounds and Free Energy
• For fixed data , define a functional called the free energy:
p ( x, z |  )
F (q,  )   q( z | x) log
 l ( )
q ( z | x)
z
• The EM algorithm does coordinate ascent on E‐step:
argmax
M‐step:
argmax
,
,
M‐step: maximization of expected • The free energy breaks into two terms:
p ( x, z |  )

F (q,  )   q ( z | x) log
z
q ( z | x)
  q ( z | x) log p ( x, z |  )   q ( z | x) log q ( z | x) 
z
z
 lq ( , x)  H (q )
• The first term is the expected complete log likelihood (energy) and the second term, which does not depend on , is the entropy
• Thus, in the M‐step, maximizing with respect to for fixed we only need to consider the first term:
 t 1  arg max lq ( , x)  arg max  q( z | x) log p( x, z |  )
z
E‐step: inferring latent posterior
• The optimum setting of in the E‐step is:
q t 1  p ( z | x,  t 1 )
• This is the posterior distribution over the latent variables given the data and the parameters ; ,
t
p
(
x
,
z
|

)
F ( p ( z | x,  t ), t )   p ( z | x, t ) log

t
p ( z | x,  )
z
• Proof: the setting saturates the bound ℓ
  p ( z | x,  t ) log p( x | t )  log p( x |  t ) z p ( z | x, t ) 
z
 l ( , x) 1
• Consequently
l ( )  F (q,  )  KL[q || p ( z | x,  )]
Expectation Maximization Algorithm
• Procedure to maximize the likelihood function for latent variable models. Finds ML parameters when the original incomplete data problem can be broken into two components
– Estimate missing data from observed data and current parameters
– Using complete data, find the maximum likelihood parameter estimates
• Alternate between estimating distribution over latent variables and updating the parameters based on it
| ,
E‐step: M‐step: argmax ∑ | log , |
– In the M‐step (no harder than the fully observed case) we optimize a lower bound on the likelihood
– In the E‐step (which requires inference) we make the bound tight
• This procedure monotonically improves ℓ (or leaves it unchanged). Therefore, under mild conditions, it is guaranteed to converge to a local optimum of the likelihood
Readings
Bishop, Ch. 8 (from 8.4.2), Ch. 9