CS 59000 Statistical Machine learning Lecture 15 Yuan (Alan) Qi

CS 59000 Statistical Machine learning
Lecture 15
Yuan (Alan) Qi
Purdue CS
Oct. 21 2008
Outline
• Review of Gaussian Processes (GPs)
• From linear regression to GP
• GP for regression
• Learning hyperparameters
• Automatic Relevance Determination
• GP for classification
Gaussian Processes
How kernels arise naturally in a Bayesian
setting?
Instead of assigning a prior on parameters w,
we assign a prior on function value y.
Infinite space in theory
Finite space in practice (finite number of training
set and test set)
Linear Regression Revisited
Let
We have
From Prior on Parameter to Prior on Function
The prior on function value:
Stochastic Process
A stochastic process
is specified by giving
the joint distribution for any finite set of
values
in a consistent manner
(Loosely speaking, it means that a
marginalized joint distribution is the same as
the joint distribution that is defined in the
subspace.)
Gaussian Processes
The joint distribution of any variables
is a multivariable Gaussian distribution.
Without any prior knowledge, we often set
mean to be 0. Then the GP is specified by
the covariance :
Impact of Kernel Function
Covariance matrix : kernel function
Application economics & finance
Gaussian Process for Regression
Likelihood:
Prior:
Marginal distribution:
Samples of Data Points
Predictive Distribution
is a Gaussian distribution with mean
and variance:
Predictive Mean
is the nth component of
We see the same form as kernel ridge
regression and kernel PCA.
GP Regression
Discussion: the difference between GP regression and
Bayesian regression with Gaussian basis functions?
Computational Complexity
GP prediction for a new data point:
GP: O(N3) where N is number of data points
Basis function model: O(M3) where M is the
dimension of the feature expansion
When N is large: computationally expensive.
Sparsification: make prediction based on only a few
data points (essentially make N small)
Learning Hyperparameters
Empirical Bayes Methods
Automatic Relevance Determination
Consider two-dimensional problems:
Maximizing the marginal likelihood will make
certain
small, reducing its relevance to
prediction.
Example
t = sin(2 π x1)
x2 = x1 +n
x3 = e
Gaussian Processes for Classification
Likelihood:
GP Prior:
Covariance function:
Sample from GP Prior
Predictive Distribution
No analytical solution.
Approximate this integration:
Laplace’s method
Variational Bayes
Expectation propagation
Laplace’s method for GP Classification (1)
Laplace’s method for GP Classification (2)
Taylor expansion:
Laplace’s method for GP Classification (3)
Newton-Raphson update:
Laplace’s method for GP Classification (4)
Gaussian approximation:
Laplace’s method for GP Classification (4)
Question: How to get the mean and the variance above?
Predictive Distribution
Example