Embarrassingly parallel inference for Gaussian

Embarrassingly parallel inference for Gaussian processes
arXiv:1702.08420v1 [stat.ML] 27 Feb 2017
Michael M. Zhang1 , Sinead A. Williamson1,2
michael [email protected], [email protected]
1
Department of Statistics and Data Science. The University of Texas at
Austin.
2
Department of Information, Risk and Operations Management, McCombs
School of Business. The University of Texas at Austin
February 28, 2017
Abstract
Training Gaussian process (GP) based models typically involves an O(N 3 ) computational bottleneck. Popular methods for overcoming the matrix inversion problem
include sparse approximations of the covariance matrix through inducing variables or
through dimensionality reduction via “local experts”. However, these type of models cannot account for both long and short range correlations in the GP functions.
Furthermore, these methods are often ill-suited for cases where the input data are not
uniformly distributed. We present an embarrassingly parallel method that takes advantage of the computational ease of inverting block diagonal matrices, while maintaining
much of the expressivity of a full covariance matrix. By using importance sampling
to average over different realizations of low-rank approximations of the GP model, we
ensure our algorithm is both asymptotically unbiased and embarrassingly parallel.
1
Introduction
Many problems in statistics and machine learning can be framed in terms of learning a latent
function f (x). For example, in regression problems, we represent our dependent variables
as a (noisy) function of our independent variables. In classification, we learn a function
that maps an input to a class—or more typically, a probability for each class. In parameter
optimization, we have some function that maps parameters to their likelihoods, and want to
find optima of this function. However, if we assume the relationship between x and f (x) is
nonlinear then learning the latent function is a non-trivial task.
Gaussian processes (GPs) provide a flexible family of distributions over functions, that
have been widely adopted for problems including regression, classification and optimization
Williams & Rasmussen (1996); Williams & Barber (1998); Snoek et al. (2012) due to their
ease of use in modeling latent functions. Unfortunately, inference in GP models with N
observations involves repeated inversion of an N ×N matrix, which typically scales as O (N 3 ).
This computational bottleneck has thus far prevented Gaussian processes from being used
in so-called “big data” situations.
1
Two approaches have been proposed to ameliorate the time taken for inference. The first
tack aims to reduce the computational burden of inverting an N ×N matrix. Sparse methods
approximate the full function f using a function evaluated at M << N inducing points, so we
only need invert an M ×M matrix Smola & Bartlett (2001); Csató & Opper (2002); Lawrence
et al. (2003); Snelson & Ghahramani (2005); Titsias (2009); Lázaro-Gredilla et al. (2010).
However, sparse methods are not well-suited for modeling functions with short length-scales
without requiring more sparse inputs–thereby increasing the computational burden.
Local methods learn K separate Gaussian processes, each on N/K-th of the data, and
then combine the estimates Tresp (2000); Cao & Fleet (2014); Ng & Deisenroth (2014);
Deisenroth & Ng (2015). This is equivalent to approximating the N × N covariance matrix
with a block-diagonal matrix, which we can invert by inverting the individual N/K-sized
blocks. Unlike the sparse methods, local GP inference can model functions with short lengthscales, but will poorly approximate functions with long-length scales unless the size of the
block in the block-diagonal approximation is increased, which will cause the very problem
we want to avoid (see Low et al. (2015) for a discussion on weaknesses in sparse and local
GP methods).
The second is to use distributed computing to allow computations to be carried out
in parallel. Local methods can easily be deployed in this setting, since they are based on
learning K independent Gaussian processes, which can be done in parallel Deisenroth & Ng
(2015). We can also make use of parallel computing when inverting the covariance matrices;
while the communication costs of doing this on a distributed system may prove prohibitive,
significant speed-ups can be obtained by making use of local GPUs Dai et al. (2014); Gramacy
et al. (2014); Gramacy & Apley (2015).
In this paper, we propose a novel inference algorithm for Gaussian processes that is both
scalable with the number of data points and easily distributed. The Importance Gaussian
Process Sampler (IGPS) approximates the covariance matrix with a mixture of block diagonal
matrices. We learn–in parallel –multiple Gaussian processes with block-diagonal covariance
matrices sampled from the mixing distribution to take advantage of the associated lower
inversion cost of a block-diagonal matrix. We then use importance sampling to combine
these estimates in a principled manner. This is the only step in our algorithm requiring
global communication. The resulting posterior predictive distribution has a dense covariance
matrix, avoiding edge effects common with local methods and allowing for an expressive
covariance structure that can model both long- and short-range covariance.
We will start in Section 2 by introducing the Gaussian process and describing existing
methods for scaling inference. We then describe our approach in Section 3, before presenting
detailed experimental evaluation in Section 4 to showcase its efficacy versus existing methods.
Further extensions and applications are discussed in Section 5.
2
Problem set-up and related work
A Gaussian process is a distribution over functions f : RD → R, parametrized by some mean
function m(x), typically taken as zero, and a covariance function Σ(x, x0 ). For a given m
and Σ, a GP is a unique distribution over functions f such that for any finite set of points,
x1 , . . . , xN ∈ RD , the function evaluated at those points is multivariate normally distributed
2
with mean and covariance given by m and Σ(·, ·) evaluated at the inputs.
The behavior of functions sampled from a Gaussian process is largely determined by
the covariance function. A covariance function Σ(x, x0 ) that only depends on |x − x0 | will
produce stationary, isotropic functions. The most commonly used covariance
function
of
o
n
0 )2
(x−x
. Here,
this type is the squared exponential (or RBF) kernel, Σ(x, x0 ) = v exp − 2`2
the lengthscale ` determines how quickly f (x) varies with x. If we believe our function to
be non-stationary, a covariance function that depends on location is appropriate. While a
number of such functions exist Gibbs (1998), they are generally harder to work with than
stationary kernels with more parameters to tune.
Gaussian processes have been used to provide prior distributions over functions in a variety of applications including regression, classification, optimization and modeling dynamics
(see for example Rasmussen & Williams (2006); Snoek et al. (2012); Wang et al. (2005). In
this exposition we focus on regression (but we will note that the extension to other settings is
straightforward). In the regression setting, our data are a vector Y = (yi )N
i=1 of observations,
D
where each observation is associated with a set of covariates xi ∈ R . If we specify our
regression model as
f ∼ GP(m, Σ)
yi ∼ N f (xi ), σ 2
then the posterior predictive distribution over the value of f ∗ at test points X ∗ is nor−1
mally distributed with mean µf ∗ = ΣX ∗ X (ΣXX + σ 2 I) Y and covariance Σf ∗ = ΣX ∗ X ∗ −
−1
ΣX ∗ X (ΣXX + σ 2 I) ΣTX ∗ X .
In the regression setting, this conjugacy means that inferring f given Σ(·, ·) and the data
is straightforward. The challenge comes in inferring the hyperparameters, Θ, that control
the form of the covariance function. Optimizing or sampling these hyperparameters involves
inverting the covariance matrix given by evaluating the covariance function, Σ(·, ·), at the
inputs x1 , . . . , xN . In general, the computational cost of inverting this matrix is O(N 3 ).
2.1
Reducing the cost of matrix inversion with covariance approximations
A number of approaches have been proposed that use approximations to avoid the need to
invert an N × N matrix. These can loosely be subdivided into two categories: “sparse”
methods based on M << N inducing inputs, and “local” methods that approximate the
N × N covariance matrix with a block-diagonal matrix.
Sparse GP approximations, exemplified by the sparse GP method of Snelson & Ghahramani (2005), parameterize the covariance matrix of the GP model with M pseudo-inputs,
where M << N . The pseudo-input locations are chosen so that the posterior function
evaluated at these points is a good approximation to the true posterior, for example by maximum likelihood optimization. The computational saving comes from replacing an N × N
covariance matrix with an M × M matrix. In the regression case, this reduces the training
cost to O(N M 2 ). However, the downside of this approach is a decreased ability to model
high-frequency fluctuations in the function, since the number of inducing points limits the
3
amount of variation we can capture. Additionally, as with a full GP, the sparse GP cannot
naturally model non-stationary data without resorting to a non-stationary kernel.
Local GP methods partition the data points into K groups, and learn a separate Gaussian
process for each group. In general, partitioning is based on the input location. Park et al.
(2011) partition the input region into disjoint regions. Rather than assign partitions a priori,
the Bayesian treed GP of Gramacy & Lee (2008) uses a distribution over trees to partition
the input. Rasmussen & Ghahramani (2002) assign the data to a Dirichlet process (DP)
mixture of GPs, each of which has an idiosyncratic set of hyperparameters. Conditioned on
the partitioning, inference in the local GP scales approximately as O(N 3 /K 2 ), since we need
N
N
×K
.
to invert K matrices of average size K
A naive implementation of this type of local GP method leads to discontinuities at partition boundaries. However, a number of approaches have been developed to mitigate this
effect. Gramacy & Lee (2008) average over partitions using reversible jump MCMC, which
avoids problems due to poor partition choice but significantly increases computational cost.
Park et al. (2011) use a boundary value function to ensure continuity between regions. Mixture of Experts models Rasmussen & Ghahramani (2002); Meeds & Osindero (2006) and
Product of Experts Ng & Deisenroth (2014); Cao & Fleet (2014) models combine the predictions of multiple local Gaussian processes, either by averaging or multiplying predictions.
One advantage of local methods is they offer a straightforward way to capture nonstationarity, without the need for complex covariance functions. Indeed, this was one of the
key motivations underlying the Mixture of Experts and tree-based models: Each block in
our partition of the data is modeled using a separate GP, and these GPs can use different
hyperparameters for their covariance function. This allows us to capture behavior which
is locally approximately stationary, but where the lengthscale of variation is different in
different regions of the input space. This is in contrast with sparse methods, which can only
capture non-stationarity if we use an explicitly non-stationary covariance function.
The disadvantage of the local methods, however, is that they risk ignoring important
correlations, since they assume zero correlation between different blocks in the partition. If
the data points are partitioned based on location, this means that long-range correlations
will be ignored. And if the data are partitioned randomly, the model will tend to perform
poorly if the number of observations in some region of RD is low.
2.2
Distributed inference for Gaussian processes
The sparse and local approximations described above aim to reduce the overall computational
burden by reducing the size of matrices to be inverted. When run on a single machine, this
reduction in computational cost leads directly to faster inference.
If our primary goal is to reduce inference time, as opposed to reducing the computing
budget, we may be interested in distributing computation cost across multiple threads or
machines. Even if the total computational cost is the same, if we can split our computation
onto multiple threads running in parallel we will obtain result faster inference and prediction.
Alternatively, if we are willing to increase the computational budget but not the time budget
then we may be able to improve our posterior estimate by running multiple samplers in
parallel and then combining the results.
Local GP methods are well suited to this sort of parallelism. They split a single GP
4
problem into K independent problems whose parameters can be inferred independently of
each other. We only need communicate between the K subproblems at the end when we
combine their predictions. This type of algorithm, where global communication occurs only
once after all the local computation is complete, is known as “embarrassingly parallel”. Ng
& Deisenroth (2014) exploit these independences, in a weighted product-of-experts model,
to obtain an embarrassingly parallel algorithm appropriate for large datasets.
3
Embarrassingly parallel inference with the Importance Gaussian Process Sampler
As described in Section 2, local GP methods reduce the computational burden of inverting
the N × N covariance matrix by approximating it with a block-diagonal matrix with K
blocks. This allows us to invert each block individually, reducing the computational cost due
to matrix inversions to O(N 3 /K 2 ). This has the added advantage of allowing us to easily
capture non-stationarity, by using different covariance parameterizations in each block of
the covariance matrix Tresp (2000); Rasmussen & Ghahramani (2002). However, such an
approximation inherently ignores some of the correlations between data points. In particular,
if the partitioning is based on input location, we ignore long-range correlations. Further, in
most cases we are reliant on a fixed partition, which may not yield the best approximation
to the covariance matrix.
If we assume a mixture of partitions instead, the implied covariance matrix is given by a
mixture of block-diagonal matrices, and is therefore dense. The posterior distribution over
partitions will assign high probability to partitions that are a good approximation to the full
covariance matrix.
As shown with the Bayesian treed GP (BTGP) of Gramacy & Lee (2008), jointly learning
the partition and the local GPs can improve performance. However, iteratively sampling the
partition and the GP parameters via a Markov chain Monte Carlo (MCMC) algorithm, as
is done in the BTGP, is computationally expensive. Further, MCMC methods are difficult
to parallelize, since the distribution for each sample depends on its predecessor.
We wish to retain the ability to explore the posterior distribution over covariance functions, while ensuring our algorithm can be distributed and an importance sampling approach
will allow us this freedom. We independently sample J instantiations of a partition, plus
associated partition parameters, from an appropriate distribution, and calculate the corresponding importance weights. This importance sampling approach is both unbiased and
embarrassingly parallel to implement, since the samples and importance weights are generated independently.
A number of choices are available for the prior distribution over partitions. Our goal is
to obtain a partition that both encourages similarly sized blocks (to reduce the size of the
largest block) and ensure that data points close to each other in input space are assigned to
the same cluster (to capture local effects). We achieve this by modeling the inputs using a
5
repulsive mixture model, introduced by Petralia et al. (2012), described below:
π ∼ Dirichlet(α)
Zi ∼ Discrete(π)
Σk ∼ Inv. Wishart (Σ0 , ν)
!
K
Y
p(µ) ∝
N (µk ; µ0 , Ω0 ) h(µ)
(1)
k=1
Xi |Zi ∼ N (µZi , ΣZi ) ,
Q
where h(µ) = i<j≤K exp {−τ ||µi − µj ||−η } is a repulsion-inducing function.
To generate a single importance-weighted sample, we generate a partition of the data
from the prior of this repulsive mixture model. We then optimize over any parameters
associated with each cluster. In a regression model, these are simply the hyperparameters
of the covariance matrix. The importance weights are given by the ratio of the posterior
probability of the partition given the data, divided by the probability under the proposal
distribution.
Since we are deterministically selecting the optimal hyperparameters for the partition and
sampling the partitions from their conditional distribution given X, the importance weights
are simply the likelihood of the outputs given the partition and the inputs. This likelihood
is calculated as part of the hyperparameter optimization, so evaluating the weights incurs
no additional cost. The algorithm, which we call the Importance Gaussian Process Sampler
(IGPS), is summarized in Algorithm 1.
Our overall computational cost, using J importance-weighted samples, is therefore O(JN 3 /K 2 ).
In Table 1, we compare this with the overall computational cost of the full GP, the sparse
GP of Snelson & Ghahramani (2005), the Bayesian treed GP (BTGP) of Gramacy & Lee
(2008), and the robust Bayesian committee machine (RBCM) of Deisenroth & Ng (2015).
While the O(JN 3 /K 2 ) cost is O(J) higher than sparse methods1 and local methods based
on a fixed partition, we note that the J samples can be performed and weighted in parallel–
meaning the time taken is comparable if we are willing to sacrifice computational resources.
We can also make use of the independence of the K partitions to parallelize further, using
JK threads each taking O(N 3 /K 3 ). As shown in Table 1, this leads to an equivalent walltime cost comparable with the distributed robust Bayesian committee machine (RBCM) of
Deisenroth & Ng (2015). As we will see in Section 4, the extra computational cost required
to ensure a full posterior predictive distribution yields improved performance over methods
that are based on a fixed partition.
An additional benefit of this model is that, by representing the covariance structure as
a mixture of block-diagonal matrices, we can use separate covariance hyperparameters in
each block. This allows us to capture non-stationarity in the latent function, while working
locally with stationary kernels. Furthermore, our choice of mixture model allows us to
propose cluster locations that are potentially more meaningful and capture more correlation
within a block than simple Dirichlet process mixture modeling, which often suffers from the
draw back of proposing similar clusters.
1
In general, a sparse model with M inducing points obtains comparable accuracy to a local method with
N/M local GPs.
6
Table 1: Comparison of inference complexity, for various methods. Here, N is the number of
data points, K is the number of experts or local GPs, and M is the number of inducing points.
For the Monte Carlo based methods, J is the number of MCMC iterations or importance
samples.
Method
Full GP
Sparse GP
RBCM
IGPS
BTGP
4
Complexity
O(N 3 )
O(N M 2 )
O(N 3 /K 2 )
O(JN 3 /K 2 )
O(JN 3 /K 2 )
Complexity per thread
x
x
3
O(N /K 3 )
O(N 3 /K 3 )
x
Experimental evaluation
To showcase the performance of our method, we compare it with a number of competing
methods on both synthetic and real data.
4.1
Evaluation on synthetic data
Figure 1: Posterior mean and 95% credible intervals obtained using various methods, on a
stationary function with long-range correlations. Generating function shown bottom right.
We begin by evaluating our method on synthetically generated data, in order to allow
us to explore a range of regimes. In our studies, we will compare our Importance Gaussian
Process Sampler against the full GP model, the sparse GP method of Snelson & Ghahramani
7
Algorithm 1 Importance Gaussian Process Sampler (IGPS) inference algorithm
for j in j = 1, . . . , J in parallel do
(j)
(j)
(j)
(j)
Draw µk , Σk ∼ P (µk , Σk )
for i in i = 1, . . . , N do (j)
(j)
(j)
(j)
Draw Zi ∝ P πk P Xi µ (j) , Σ (j)
Zi
Zi
end for
Optimize hyper-parameters Θ(j)
P
(j) (j)
Set proposal probability w ∝ k P Y(k) −
∗
for i in i = 1, . . . , N
do
(j)∗
(j) Draw Zi ∝ P πk − P (Xi∗ |−)
end for
for k in k = 1,. . . , K in parallel do
(j)
(j)∗
(j)∗ (j)
Compute P Y(k) X(k) , Y(k) , X(k) , Θ(j)
end for
end for
P
Let w(j) := w(j) / j w(j)
P
(j)∗
Let P (Ȳ ∗ |X, Y, X ∗ ) := j w(j) · P (Y(k) |−)
Figure 2: Posterior distribution and 95% credible intervals obtained using various methods,
on a stationary function with short-range correlations. Generating function shown bottom
right.
8
(2005), the Bayesian treed GP, and the parallelizable robust Bayesian committee machine
(RBCM) of Deisenroth & Ng (2015). Our IGPS code is built off of the Gaussian process
modules in pyGPs in Python. We ran the full GP and sparse GP implementations through
the gpml package in Matlab, the Laplace approximation GP classification in pyGPs, the
sparse classification in GPy, the BTGP in tgp and RBCM in gptf.
First, we consider how well we are able to replicate the performance of a Gaussian process
with a full, stationary covariance function in a one-dimensional setting. We look at two
functions, both sampled from a Gaussian process with zero mean and a squared exponential
kernel and Gaussian noise. The first function has lengthscale ` = 10.0 and noise variance
σ 2 = .0625, and was chosen to test performance in the presence of long-range correlations
between observations. The second function has lengthscale ` = 2.23 and noise variance
σ 2 = .0625, and was chosen to evaluate the methods’ ability to identify high-frequency
variation in the function. In both cases, we used 1000 training points and evaluated on 500
test points. For all methods except BTGP, we infer hyperparameters through optimization
and for BTGP we infer the hyperparameters through MCMC sampling. For the sparse
methods, we used M = 50 inducing points, and for the local methods (including the IGPS)
we used K = 20 partitions to have a comparable level of computational complexity. For the
Bayesian treed GP we ran the MCMC sampler for 1,000 iterations; for the IGPS we used
1,000 independent importance-weighted samples.
Table 2: Log likelihoods obtained on the stationary, non-stationary and large-N regression
tasks
Data
GP Sparse
IGPS BTGP RBCM
Stationary, long lengthscale
-33.79 -35.28 -42.76 -80.15 -255.23
Stationary, short lengthscale -31.32 -539.55 -43.13 -64.16 -280.08
Non-stationary
-711.57 -1.39e4 232.83 119.65 -1.17e5
Flight Delay
x
x -2.49e4
x -7.13e7
Table 3: Mean squared error obtained on the stationary, non-stationary and large-N regression tasks
Data
GP Sparse
IGPS BTGP RBCM
Stationary, long lengthscale 0.06
0.07
0.06
0.07
0.09
Stationary, short lengthscale 0.07
0.65
0.07
0.07
0.13
Non-stationary
1.00
0.85
0.02
0.02
0.96
Flight Delay
x
x 1280.65
x 1300.80
Figures 1 and 2 show the posterior distributions obtained using the five methods, on
the two datasets. Tables 2 and 3 show the corresponding test set log likelihood and mean
squared error, respectively. Recall that, in general, sparse methods generally perform well
when the covariance structure is dominated by longer-range correlations, and local methods
generally perform well when we have significant local variation in our function. Looking at
the results on the dataset with long-range correlations, we see that the IGPS performs better
9
than the other local methods, and performs nearly as well as the full GP and the sparse GP.
If we look at the dataset with short-range correlation, the sparse GP performs very poorly
relative to the local methods, as expected. Of the local methods, the IGPS performs closest
to the full GP. The poor performance of BTGP is likely because the complexity inherent in
the method leads to slow mixing of the MCMC chain and therefore lack of convergence.
To explore why the IGPS is able to perform so well compared with the full GP, we consider
the covariance structure found by the IGPS. Recall that, while most other local methods
assume a single partition of the data which results in a block-diagonal covariance matrix,
the IGPS uses a mixture of block-diagonal covariance matrices. Figures 3 and 4 show the
covariance matrices obtained on these two datasets using a full GP, the covariance matrices
obtained by averaging over the importance-weighted samples, and the covariance matrices
associated with the highest-weighted sample. We see that we are able to better approximate
the full covariance matrix by using this weighted mixture, capturing more information than
even the best single proposal.
Next, we model an example of a non-stationary function where the lengthscale of the
function changes across X. Again, we use a squared exponential method to learn this
function. In this example, we focus on known failure modes of local and sparse GPs: the
inputs are not uniformly distributed and we have a combination of slowly varying behavior
(which is poorly captured by local methods) and fast-varying behavior (which is poorly
captured by sparse methods). The full GP, sparse GP and RBCM methods, fitted with
stationary kernels, obviously cannot account for the non-stationary components in the data.
In contrast, our importance sampler method method and the BTGP can account for the
non-stationarity in the data and provide confident predictions at all regions of the function.
Looking at the log likelihood and predictive performance in Tables 2 and 3, we see that the
IGPS obtains the best predictive performance.
4.2
Evaluation on real data
As seen in our experiments on synthetic data, our IGPS method is applicable to many
different data regimes where other approximations may fail. Its inherently parallelizable
nature also makes it an appealing choice for larger, real-world datasets where use of a full GP
is computationally infeasible. To evaluate performance in this “big data” regime, we modeled
a dataset of airline on-time performance data, where the goal is to predict flight delay times
from a number of covariates.2 We again model this dataset using squared exponential kernels
and a zero mean function. We compare against the RBCM—the only other method in our
comparison set that is designed for a distributed setting. We trained our method using 10,000
observations in the training set, 5,000 observations in the test set, and K = 20 clusters and
experts.3
Our importance sampling method provides for a richer predictive model due to the averaging over importance proposals and we can see the benefit of this in the results of our
“big data” analysis. Tables 2 and 3 show the test set log-likelihood and mean squared error
2
The dataset is available at http://stat-computing.org/dataexpo/2009/
In our analysis we only include the following covariates: Month, DayofMonth, DayOfWeek, DepTime,
ArrTime, AirTime, Distance.
3
10
Figure 3: Approximated covariance structure, long-range correlations. From top to bottom:
Generating function, covariance found with full GP, covariance found through averaging
importance-weighted samples, covariance from the highest-likelihood single sample. Covariance matrices are plotted on a log-scale.
obtained using the two methods. The IGPS obtains better predictive performances using
both metrics.
Finally, to highlight that our method is not limited to a specific GP model, we apply
our method on a binary classification task, using Laplace approximation Williams & Barber
(1998) with a squared exponential kernel. We compared with the full GP and the sparse
GP on a synthetic dataset consisting of two interleaving half circles (shown in Figure 6).4
4
All
empirical
classification
http://archive.ics.uci.edu/ml/.
datasets
are
11
available
in
the
UCI
repository
at
Figure 4: Approximated covariance structure, short-range correlations. From top to bottom: Generating function, covariance found with full GP, covariance found through averaging importance-weighted samples, covariance from the highest-likelihood single sample.
Covariance matrices are plotted on a log-scale.
Our importance sampling method can approximate the full GP results very well with similar
area under the curve (AUC) scores for the synthetic data and better AUC scores for the
Parkinsons and WDBC dataset.
5
Summary and future work
While Gaussian processes provide a flexible framework for a wide variety of modeling scenarios, their use has been limited in the big-data regime, since most implementations scale
12
Figure 5: Posterior distribution mean and 95% credible intervals obtained using various
methods, on a non-stationary function with both long- and short-range correlations. Generating function shown bottom right.
Table 4: Classification log likelihood results
Data
Full GP
IGPS
Synthetic
-64.81 -70.74
Pima
-128.79 -135.32
Parkinsons
-17.00 -22.76
WDBC
-15.50 -10.85
Sparse GP
-64.18
-128.61
-28.42
-18.01
cubically with the number of data points. As we saw in Section 2, a number of approximations have been proposed to reduce this cost, these approximations come with notable failure
modes: Models based on sparse inducing points tend to perform poorly when the inputs are
sparsely distributed or the function is dominated by short range correlations, and are not
well-suited to a distributed setting and models based on block-diagonal approximations tend
not to capture long-range correlations.
We have presented a new approximation and algorithmic framework for scalable, distributed inference in Gaussian process models that avoids the pitfalls of existing approaches.
A mixture of block diagonal matrices provides a good approximation to the full covariance
matrix at a lower computational cost than full GP inference. Unlike the block-diagonal covariance structures found by most local methods, our method gives a dense covariance matrix
that can capture long-range correlations. And unlike sparse methods, our approximation can
easily handle non-stationary and high-frequency functions.
13
Figure 6: Synthetic binary classification task: label probabilities obtained using the full GP
and the IGPS.
Table 5: Classification AUC results
Data
Full GP IGPS Sparse GP
Synthetic
0.98
0.98
0.98
Pima
0.83
0.81
0.83
Parkinsons
0.86
0.93
0.88
WDBC
0.83
0.97
0.81
Our algorithm uses importance sampling to average over the posterior distribution, introducing an O(J) term into the computational complexity where J is the number of samples.
In a single-machine setting, this leads to slower computation than sparse or local methods
that do not average over predictions. However, independence between the J samples and
independence between the K blocks in each sample means our algorithm can be trivially
parallelized onto as many as JK threads or machines, with a per-thread computational cost
of O(N 3 /K 3 ) . This allows us to make efficient use of additional computational resources in
a distributed setting, without increasing the total wall-clock time.
Additionally, we note that our algorithm can be used in conjunction with other inference
techniques. For example, we could approximate the covariance matrix within a single partition using a sparse inducing points, to obtain further speed-ups. Also, while the main focus
of this paper is regression, the techniques used can be applied to any Gaussian process-based
model. An interesting avenue for future research would be to explore both which domains
would benefit from additional approximations, and see which non-regression application areas would benefit most from our approach.
14
6
Acknowledgments
Sinead Williamson and Michael Zhang are supported by NSF grant 1447721.
References
Cao, Yanshuai and Fleet, David J. Generalized product of experts for automatic and principled fusion of Gaussian process predictions. In Modern Nonparametrics 3: Automating
the Learning Pipeline workshop at NIPS, 2014.
Csató, Lehel and Opper, Manfred. Sparse on-line Gaussian processes. Neural Computation,
14(3):641–668, 2002.
Dai, Zhenwen, Damianou, Andreas, Hensman, James, and Lawrence, Neil. Gaussian process
models with parallelization and GPU acceleration. arXiv preprint arXiv:1410.4984, 2014.
Deisenroth, Marc and Ng, Jun Wei. Distributed Gaussian processes. In International Conference on Machine Learning, pp. 1481–1490, 2015.
Gibbs, M.N. Bayesian Gaussian processes for regression and classification. PhD thesis,
University of Cambridge, 1998.
Gramacy, Robert B. and Apley, Daniel W. Local Gaussian process approximation for large
computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578,
2015.
Gramacy, Robert B. and Lee, Herbert K. H. Bayesian treed Gaussian process models with
an application to computer modeling. Journal of the American Statistical Association,
103(483):1119–1130, 2008.
Gramacy, Robert B, Niemi, Jarad, and Weiss, Robin M. Massively parallel approximate
Gaussian process regression. SIAM/ASA Journal on Uncertainty Quantification, 2(1):
564–584, 2014.
Lawrence, Neil, Seeger, Matthias, and Herbrich, Ralf. Fast sparse Gaussian process methods:
The informative vector machine. In Neural Information Processing Systems, pp. 625–632,
2003.
Lázaro-Gredilla, Miguel, Quiñonero-Candela, Joaquin, Rasmussen, Carl Edward, and
Figueiras-Vidal, Anı́bal R. Sparse spectrum Gaussian process regression. Journal of
Machine Learning Research, 11:1865–1881, 2010.
Low, Kian Hsiang, Yu, Jiangbo, Chen, Jie, and Jaillet, Patrick. Parallel Gaussian process
regression for big data: Low-rank representation meets Markov approximation. In AAAI
Conference on Artificial Intelligence, 2015.
Meeds, Edward and Osindero, Simon. An alternative infinite mixture of Gaussian process
experts. In Neural Information Processing Systems, pp. 883–890, 2006.
15
Ng, Jun Wei and Deisenroth, Marc Peter. Hierarchical mixture-of-experts model for largescale Gaussian process regression. arXiv preprint arXiv:1412.3078, 2014.
Park, Chiwoo, Huang, Jianhua Z., and Ding, Yu. Domain decomposition approach for
fast Gaussian process regression of large spatial data sets. Journal of Machine Learning
Research, 12(May):1697–1728, 2011.
Petralia, Francesca, Rao, Vinayak, and Dunson, David B. Repulsive mixtures. In Neural
Information Processing Systems, pp. 1889–1897, 2012.
Rasmussen, Carl E. and Ghahramani, Zoubin. Infinite mixtures of Gaussian process experts.
In Neural Information Processing Systems, pp. 881–888, 2002.
Rasmussen, Carl E. and Williams, Christopher K. I. Gaussian processes for machine learning.
Gaussian Processes for Machine Learning, 2006.
Smola, Alexander J and Bartlett, Peter. Sparse greedy Gaussian process regression. Neural
Information Processing Systems, 13:619–625, 2001.
Snelson, Edward and Ghahramani, Zoubin. Sparse Gaussian processes using pseudo-inputs.
In Neural Information Processing Systems, pp. 1257–1264, 2005.
Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical Bayesian optimization of
machine learning algorithms. In Neural Information Processing Systems, pp. 2951–2959,
2012.
Titsias, Michalis K. Variational learning of inducing variables in sparse Gaussian processes.
In Artificial Intelligence and Statistics, volume 5, pp. 567–574, 2009.
Tresp, Volker. A Bayesian committee machine. Neural Computation, 12(11):2719–2741,
2000.
Wang, Jack M, Fleet, David J, and Hertzmann, Aaron. Gaussian process dynamical models.
In Neural Information Processing Systems, 2005.
Williams, Christopher K.I. and Barber, David. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–
1351, 1998.
Williams, Christopher K.I. and Rasmussen, Carl E. Gaussian processes for regression. In
Neural Information Processing Systems, pp. 514–520. MIT Press, 1996.
16