Markov chain

Introduction to Machine Learning
CMU-10701
Markov Chain Monte Carlo Methods
Barnabás Póczos
Contents
 Markov Chain Monte Carlo Methods
 Sampling
•
•
•
•
Rejection
Importance
Hastings-Metropolis
Gibbs
 Markov Chains
• Properties
 Particle Filtering
• Condensation
2
Monte Carlo Methods
3
Markov Chain Monte Carlo History
 A recent survey places the Metropolis algorithm among the 10
algorithms that have had the greatest influence on the development
and practice of science and engineering in the 20th century
(Beichl&Sullivan, 2000).
 The Metropolis algorithm is an instance of a large class of sampling
algorithms, known as Markov chain Monte Carlo (MCMC)
 MCMC plays significant role in statistics, econometrics, physics and
computing science.
 There are several high-dimensional problems, such as computing
the volume of a convex body in d dimensions and other high-dim
integrals, for which MCMC simulation is the only known general
approach for providing a solution within a reasonable time.
4
Markov Chain Monte Carlo History
 While convalescing from an illness in 1946, Stanislaw Ulam was playing
solitaire. It, then, occurred to him to try to compute the chances that a
particular solitaire laid out with 52 cards would come out successfully
(Eckhard, 1987).
 Exhaustive combinatorial calculations are very difficult.
 New, more practical approach: laying out several solitaires at random and
then observing and counting the number of successful plays.
 This idea of selecting a statistical sample to approximate a hard
combinatorial problem by a much simpler problem is at the heart of
modern Monte Carlo simulation.
5
FERMIAC
 FERMIAC: The beginning of Monte Carlo Methods
 Developed in the Early 1930’s
 The Monte Carlo trolley, or FERMIAC, was an analog
computer invented by physicist Enrico Fermi to
aid in his studies of neutron transport.
 At each stage, pseudo-random numbers are used
to make decisions that affect the behavior of
each neutron (random choice between fast and
slow neutrons)
6
ENIAC
 ENIAC (pron.: /ˈini.æk/; Electronic Numerical Integrator And Computer)
was the first electronic general-purpose computer.
 February 14, 1946: The completed machine was announced to the public
 Digital, and capable of being reprogrammed to solve a full range of
computing problems
 Designed by John Mauchly and J. Presper Eckert of the University of
Pennsylvania
 Stanislaw Ulam, John von Neumann, Enrico Fermi runs MCMC on ENIAC
7
MANIAC
 MANIAC Mathematical Analyzer, N umerical I ntegrator, and Computer or
Mathematical Analyzer, N umerator, I ntegrator, and Computer
 1952: built under the direction of Nicholas Metropolis at the Los Alamos Scientific
Laboratory. It was based on the von Neumann architecture of the IAS, developed
by John von Neumann
 Metropolis chose the name MANIAC in the hope of stopping the rash of
silly acronyms for machine names
 Illiac, Johnniac, Silliac….
8
MCMC Applications
 Sampling from high-dimensional, complicated distributions
 Bayesian inference and learning
 Normalization
 Marginalization
 Expectation
 Global optimization
9
The Monte Carlo principle
 Our goal is to estimate the following integral:
 The idea of Monte Carlo simulation is to draw an i.i.d. set of samples
{x(i) } from a target density p(x) defined on a high-dim. space X.
 These samples can be used to approximate the target density
with the following empirical point-mass function:
10
The Monte Carlo principle
Estimate:
Estimator:
11
The Monte Carlo principle
Theorems
a.s. consistent
Unbiased estimation
Independent of dimension d!
Asymptotically normal
12
The Monte Carlo principle
One “tiny” problem…
 Monte Carlo methods need sample from distribution p(x).
 When p(x) has standard form, e.g. Uniform or Gaussian, it is
straightforward to sample from it using easily available routines.
 However, when this is not the case, we need to introduce more
sophisticated sampling techniques. ) MCMC sampling
13
Sampling
14
Main Goal
Sam ple from distribution p (x ) that is only known up
to a proportionality constant
For example,
p(x) ∝ 0.3 exp(−0.2x2) +0.7 exp(−0.2(x − 10)2)
15
Rejection Sampling
16
Rejection Sampling
Suppose that
 p(x) is known up to a proportionality constant (e.g. exp(-x2)).
 It is easy to sample from q(x) that satisfies p(x) ≤ M q(x), M < ∞
 M is known
Rejection sampling
17
Rejection Sampling
18
Rejection Sampling
Theorem
The accepted x(i ) can be shown to be sampled with probability p(x)
(Robert & Casella, 1999, p. 49).
Severe limitations:
 It is not always possible to bound p(x)/q(x) with a reasonable
constant M over the whole space X.
 If M is too large, the acceptance probability is too small.
 In high dimensional spaces it can be exponentially slow to sample
points. (The points usually will be rejected)
19
Importance Sampling
20
Importance Sampling
 Importance sampling is an alternative “classical” solution that goes
back to the 1940’s.
 Let us introduce, again, an arbitrary importance proposal distribution
q(x) such that its support includes the support of p(x).
 Then we can rewrite I(f) as follows:
21
Importance Sampling
Consequently,
 if one can simulate N i.i.d. samples {x(i )}N i=1 according to q(x)
 and evaluate w(x(i )),
) possible Monte Carlo estimate of I (f ) is
22
Importance Sampling
Theorem
 This estimator is unbiased
 Under weak assumptions, the strong law of large numbers applies:
Some proposal distributions q(x) will obviously be preferable to others.
Find one that minimizes the variance of the estimator:
23
Importance Sampling
Theorem
The variance is minimal when we adopt the following
optim al im portance distribution:
 The optimal proposal is not very useful in the sense that it is not easy to
sample from | f (x)|p(x).
 High sampling efficiency is achieved when we focus on sampling from p(x)
in the important regions where | f (x)|p(x) is relatively large; hence the
24
name im portance sam pling
Importance Sampling
 Importance sampling estimates can be super-efficient:
For a given function f (x), it is possible to find a distribution q(x)
that yields an estimate with a lower variance than when using
q(x)= p(x)!
Problems with importance sampling
In high dimensions it is not efficient either…
25
Andrey Markov
Markov Chains
26
Markov Chains
Markov chain:
Homogen Markov chain:
27
Markov Chains
 Assume that the state space is finite:
 1-Step state transition matrix:
Lemma: The state transition matrix is stochastic:
 t-Step state transition matrix:
Lemma:
28
Markov Chains Example
Markov chain with three states (s = 3)
Transition matrix
Transition graph
29
Markov Chains
If the probability vector for the initial state is
it follows that
and, after several iterations (multiplications by T )
stationary distribution
no matter what initial distribution ¹(x1) was.
The chain has forgotten its past.
30
Markov Chains
Our goal is
to find conditions under which the Markov chain forgets its
past, and independently from its starting distribution, the state
distributions converge to a stationary distribution.
Definition: [stationary distribution, invariant distribution]
This is a necessary condition for having limit behavior of the Markov chain.
31
Markov Chains
Theorem:
For any starting point, the chain will convergence to the unique invariant
distribution p(x), as long as
1. T is a stochastic transition matrix
2. T is irreducible
3. T is aperiodic
32
Limit Theorem of Markov Chains
More formally:
If the Markov chain is Irreducible and Aperiodic, then:
33
Markov Chains
Definition
Irreducibility:
For each pairs of states (i,j), there is a positive probability, starting in
state i, that the process will ever enter state j.
= The matrix T cannot be reduced to separate smaller matrices
= Transition graph is connected.
It is possible to get to any state from any state.
34
Markov Chains
Definition
Aperiodicity:
The chain cannot get trapped in cycles.
Definition
A state i has period k if any return to state i, must occur in multiples of
k time steps. Formally, the period of a state i is defined as
(where "gcd" is the greatest common divisor)
For example, suppose it is possible to return to the state in
{6,8,10,12,...} time steps. Then k=2
35
Markov Chains
Definition
If k = 1, then the state is said to be aperiodic:
returns to state i can occur at irregular times.
In other words,
a state i is aperiodic if there exists n such that for all n' ≥ n,
Definition
A Markov chain is aperiodic if every state is aperiodic.
36
Markov Chains
Theorem:
An irreducible Markov chain only needs one aperiodic state
to imply all states are aperiodic.
Corollary:
An irreducible chain is said to be aperiodic, if for some n ¸ 0 and
some state j
Example for periodic Markov chain:
In this case
Let
However, if we start the chain from (1,0), or (0,1), then the
chain get traps into a cycle, it doesn’t forget its past.
Periodic Markov chains don’t forget their past.
37
Reversible Markov chains
Detailed Balance Property
Definition: reversibility /detailed balance condition:
Theorem:
A sufficient, but not necessary, condition to ensure that a particular ¼ is
the desired invariant distribution of the Markov chain is the detailed
balance condition.
38
How fast can Markov chains forget
the past?
MCMC samplers are
 irreducible and aperiodic Markov chains
 have the target distribution as the invariant distribution.
 the detailed balance condition is satisfied.
It is also important to design samplers that converge quickly.
39
Spectral properties
Theorem: If
 ¼ is the left eigenvector of the matrix T with eigenvalue 1.
 The Perron-Frobenius theorem from linear algebra tells us that the
remaining eigenvalues have absolute value less than 1.
 The second largest eigenvalue, therefore, determines the rate of
convergence of the chain, and should be as small as possible.
40
The Hastings-Metropolis Algorithm
41
The Hastings-Metropolis Algorithm
Our goal:
Generate samples from the following discrete distribution:
We don’t know B !
The main idea is to construct a time-reversible Markov chain
with (¼1,…,¼m) limit distributions
Later we will discuss what to do when the distribution is continuous
42
The Hastings-Metropolis Algorithm
Let {1,2,…,m} be the state space of a Markov chain that we
can simulate.
No rejection: we use all X1, X2,… Xn, …
43
Example for Large State Space
Let {1,2,…,m} be the state space of a Markov chain that we
can simulate.
d-dimensional grid:
 Max 2d possible movements at each grid point (linear in d)
 Exponentially large state space in dimension d
44
The Hastings-Metropolis Algorithm
Theorem
Proof
45
The Hastings-Metropolis Algorithm
Observation
Proof:
Corollary
Theorem
46
The Hastings-Metropolis Algorithm
Theorem
Proof:
Note:
47
The Hastings-Metropolis Algorithm
It is not rejection sampling, we use all the samples!
48
Continuous Distributions
 The same algorithm can be used for
continuous distributions as well.
 In this case, the state space is continuous.
49
Experiment with HM
An application for continuous distributions
Bimodal target distribution: p(x) ∝ 0.3 exp(−0.2x2) +0.7 exp(−0.2(x − 10)2)
q(x | x(i )) = N(x(i), 100), 5000 iterations
50
Good proposal distrib. is important
51
HM on Combinatorial Sets
Generate uniformly distributed samples from the set of permutations
Let n=3, and a=12:
{1,2,3}: 1+4+9=14
{1,3,2}: 1+6+6=13
{2,3,1}: 2+6+3=11
{2,1,3}: 2+2+9=13
{3,1,2}: 3+2+6=11
{3,2,1}: 3+4+3=10
52
HM on Combinatorial Sets
To define a simple Markov chain on
, we need the concept of
neighboring elements (permutations):
Definition: Two permutations are neighbors, if one results from
the interchange of two of the positions of the other:
(1,2,3,4) and (1,2,4,3) are neighbors.
(1,2,3,4) and (1,3,4,2) are not neighbors.
53
HM on Combinatorial Sets
That is what we wanted!
54
Gibbs Sampling: The Problem
Suppose that we can generate samples from
Our goal is to generate samples from
55
Gibbs Sampling: Pseudo Code
56
Gibbs Sampling: Theory
Let
and let
Observation: By construction, this HM sampler would sample from
57
Gibbs Sampling is a Special HM
Theorem: The Gibbs sampling is a special case of HM with
Proof:
By definition:
58
Gibbs Sampling is a Special HM
Proof:
59
Gibbs Sampling in Practice
60
Simulated Annealing
61
Simulated Annealing
Goal: Find
62
Simulated Annealing
Theorem:
Proof:
63
Simulated Annealing
Main idea
 Let ¸ be big.
 Generate a Markov chain with limit distribution P¸(x).
 In long run, the Markov chain will jump among the maximum points of
P¸(x).
Introduce the relationship of neighboring vectors:
64
Simulated Annealing
Uniform distribution
Use the Hastings- Metropolis sampling:
65
Simulated Annealing: Pseudo Code
With prob. ® accept the new state
with prob. (1-®) don't accept and stay
66
Simulated Annealing: Special case
In this special case:
With prob. ®=1 accept the new state since
we increased V
67
Simulated Annealing: Problems
68
Simulated Annealing
Temperature = 1/ ¸
69
Simulated Annealing
70
Monte Carlo EM
E Step:
Monte Carlo EM:
Then the integral can be approximated! 
71
Monte Carlo EM
72
Particle Filtering
73
Particle Filtering
 So far there was only one particle which moved according to a Markov chain.
 In particle filtering there will be lots of particles, each of them will act as a
Markov chain.
74
Application: Contour tracking
75
Object Tracking Using Bsplines
76
Bspline Curves
Stefan Hueeber, Jonas Ballani
77
Tracking using Bsplines
78
79
The Model
The object we want to track has a hidden dynamical model:
X t = f ( X t −1 , µt )
We can only observe the object through noisy measurements:
Z t = g ( X t ) + vt
Goal: From Z1, …, Zt estimate Xt
80
The Model
• The contour lines are Bspline curves.
• It is enough to transform the control points
of the Bspline curves.
• 2D Translation+ Scaling+ Shear + Rotation
= 6 free parameters.
• 6 dimensional particles
81
2D affine transformation of the
control points
82
6 Parameters to Fit
83
Monte Carlo Methods
• Each particle is a 6 dim vector and describes one
possible contour line.
(Hence the name particle filtering)
• Using edge detection, we can measure how well
one particle fits the image.
• The good particles get larger weights in the next
iteration, the weaker particles get smaller weights.
• Using a Markov chain (AR dynamics), move the
particles to predict the next image.
84
The Observation Model
Measuring how good one particle is:
85
Weighted Particles
86
Particles to Represent Bsplines
87
Conditional Density Propagation
ConDenSation
88
Condensation
89
Tracking using Bsplines
90
An Improvement
P. Torma and Cs. Szepesvári.
ECCV 2004
Local Importance
Sampling Algorithm
Condensation
M. Isard and A. Blake
Int. J. Computer Vision, 1998
91
92
93
References
Many sentences, pictures, videos, etc are copied and pasted from the
below sources:
• Michael Isard:
http://www.robots.ox.ac.uk/~misard/condensation.html
• C. Andrieu et. al.: An Introduction to MCMC for Machine Learning
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.7133
• David MacKay: Introduction to Monte Carlo methods
http://www.inference.phy.cam.ac.uk/mackay/erice.pdf
• Sheldon M. Ross: Simulation
• Wikipedia
94
Thanks for the Attention! 
95
96
Attic
97
Arriving times
Questions:
 When does a Markov chain arrives at state j if it starts from state i?
 When does it arrive there 1st?
 What is the probability that it arrives there in the future?
 Will it arrive there in finite time?
98
Arriving times
Theorem
99
Arriving times
Theorem
100