Semiparametric Regression Analysis via Infer.NET

JSS
Journal of Statistical Software
MMMMMM YYYY, Volume VV, Issue II.
http://www.jstatsoft.org/
Semiparametric Regression Analysis via Infer.NET
Jan Luts
TheSearchParty.com Pty Ltd.
Shen S.J. Wang
University of Queensland
John T. Ormerod
University of Sydney
Matt P. Wand
University of Technology, Sydney
Abstract
We provide several examples of Bayesian semiparametric regression analysis via the
Infer.NET package for approximate deterministic inference in graphical models. The examples are chosen to encompass a wide range of semiparametric regression situations.
Infer.NET is shown to produce accurate inference in comparison with Markov chain
Monte Carlo via the BUGS package, but is considerably faster. Potentially, this contribution represents the start of a new era for semiparametric regression, where large and
complex analyses are performed via fast graphical models methodology and software,
mainly being developed within Machine Learning.
Keywords: additive mixed models, expectation propagation, generalized additive models,
measurement error models, mean field variational Bayes, missing data models, penalized
splines, variational message passing .
Date of this draft: 17 JAN 2014
1. Introduction
Infer.NET (Minka et al. 2013) is a relatively new software package for performing approximate inference in large graphical models, using fast deterministic algorithms such as expectation propagation (Minka 2001) and variational message passing (Winn and Bishop 2005).
We demonstrate its application to Bayesian semiparametric regression. A variety of situations are covered: non-Gaussian response, longitudinal data, bivariate functional effects,
robustness, sparse signal penalties, missingness and measurement error.
The Infer.NET project is still in its early years and, and at the time of this writing, has not progressed beyond beta releases. We anticipate continual updating and enhancement for many
2
Regression via Infer.NET
years to come. In this article we are necessarily restricted to the capabilities of Infer.NET at
the time of preparation. All of our examples use Infer.NET 2.5, Beta 2, which was released
in April 2013.
The graphical models viewpoint of semiparametric regression (Wand 2009), and other advanced statistical techniques, has the attraction that various embellishments of the standard
models can be accommodated by enlarging the underlying directed acyclic graph. For example, nonparametric regression with a missing predictor data model involves the addition
of nodes and edges to the graph, corresponding to the missing data mechanism. Figure 4
of Faes et al. (2011) provides some specific illustrations. Marley and Wand (2010) exploited
the graphical models viewpoint of semiparametric regression to handle a wide range of nonstandard models via the Markov chain Monte Carlo inference engine implemented by BUGS
(Lunn et al. 2000). However, many of their examples take between several minutes and hours
to run. The inference engines provided by Infer.NET allow much faster fitting for many
common semiparametric regression models. On the other hand, BUGS is the more versatile
package and not all models that are treated in Marley and Wand (2010) are supported by
Infer.NET.
Semiparametric regression, summarized by Ruppert et al. (2003, 2009), is a large branch of
Statistics that includes nonparametric regression, generalized additive models, generalized
additive mixed models, curve-by-factor interaction models, wavelet regression and geoadditive models. Parametric models such as generalized linear mixed models are special cases
of semiparametric regression. Antecedent research for special cases such as nonparametric
regression was conducted in the second half of the 20th Century at a time when data sets
were smaller, computing power was lower and the Internet was either non-existent or in its
infancy. Now, as we approach the mid-2010s, semiparametric regression is continually being
challenged by the size, complexity and, in some applications, arrival speed of data-sets requiring analysis. Implementation and computational speed are major limiting factors. This
contribution represents the potential for a new era for semiparametric regression – tapping
into 21st Century Machine Learning research on fast approximate inference on large graphical models (e.g. Minka 2001; Winn and Bishop 2005; Minka 2005; Minka and Winn 2008;
Knowles and Minka 2011) and ensuing software development.
Section 2 lays down the definitions and notation needed to describe the models given in
later sections. Each of Sections 3–10 illustrates a different type of semiparametric regression
analysis via Infer.NET. All code is available as web-supplement to this articles. For most of
the examples, we also compare the Infer.NET results with those produced by BUGS. Section 11 compares BUGS with Infer.NET in terms of versatility and computational speed. A
summary of our findings on semiparametric analysis via Infer.NET is given in Section 12.
An appendix gives a detailed description of using Infer.NET to fit the simple semiparametric
regression model of Section 3.
2. Preparatory infrastructure
The semiparametric regression examples rely on mathematical infrastructure and notation,
which we describe in this section.
Journal of Statistical Software
3
2.1. Distributional notation
The density function of a random vector x is denoted by p(x). The notation for a set of
ind.
independent random variables yi (1 ≤ i ≤ n) with distribution Di is yi ∼ Di . Table 1 lists
the distributions used in the examples and the parametrization of their density functions.
distribution
density function in x
Normal
(2πσ 2 )−1/2 exp{−(x − µ)2 /(2σ 2 )};
Laplace
(2σ)−1 exp(−|x − µ|/σ);
Γ
ν+1
2
t
√
Gamma
B A xA−1 e−B x
;
Γ(A)
Half-Cauchy
2σ
;
+ σ2)
σ>0
N (µ, σ 2 )
Laplace(µ, σ 2 )
σ>0
πνσ 2 Γ(ν/2){1 +
π(x2
abbreviation
(x−µ)2 ν+1
} 2
νσ 2
;
σ, ν > 0
x > 0; A, B > 0
x > 0; σ > 0
t(µ, σ 2 , ν)
Gamma(A, B)
Half-Cauchy(σ)
Table 1: Distributions used in the examples. The density function argument x and parameters range over R unless otherwise specified.
2.2. Standardization and default priors
In real data examples, the measurements are often recorded on several different scales.
Therefore, all continuous variables are standardized prior to fitting analysis using Infer.NET
or Markov Chain Monte Carlo (MCMC) in BUGS. This transformation makes the analyses
scale-invariant and can also lead to better behavior of the MCMC.
Since we do not have prior knowledge about the model parameters in each of the examples,
non-informative priors are used. The prior distributions for a fixed effects parameter vector
β and a standard deviation parameter σ are
β ∼ N (0, τβ−1 I) and σ ∼ Half-Cauchy(A)
with default hyperparameters
τβ = 10−10
and A = 105 .
This is consistent with the recommendations given in Gelman (2006) for achieving noninformativeness for variance parameters. Infer.NET and BUGS do not offer direct specification of Half-Cauchy distributions and therefore we use the result:
if x| a ∼ Gamma( 12 , a) and a ∼ Gamma( 12 , 1/A2 )
then x−1/2 ∼ Half-Cauchy(A).
(1)
4
Regression via Infer.NET
This result allows for the imposition of a Half-Cauchy prior using only Gamma distribution
specifications. Both Infer.NET and BUGS support the Gamma distribution.
2.3. Variational message passing and expectation propagation
Infer.NET has two inference engines, variational message passing (Winn and Bishop 2005) and
expectation propagation (Minka 2001), for performing fast deterministic approximate inference
in graphical models. Succinct summaries of variational message passing and expectation
propagation are provided in Appendices A and B of Minka and Winn (2008).
Generally speaking, variational message passing is more amenable to semiparametric regression than expectation propagation. It is a special case of mean field variational Bayes (e.g.
Wainwright and Jordan 2008). The essential idea of mean field variational Bayes is to approximate joint posterior density functions such as p(θ1 , θ2 , θ3 |D), where D denotes the observed
data, by product density forms such as
qθ1 (θ1 ) qθ2 (θ2 ) qθ3 (θ3 ),
qθ1 ,θ3 (θ1 , θ3 ) qθ2 (θ2 ) or qθ1 (θ1 ) qθ2 ,θ3 (θ2 , θ3 ).
(2)
The choice of the product density form is usually made by trading off tractability against
minimal imposition of product structure. Once this is choice is made, the optimal q-density
functions are chosen to minimize the Kullback-Liebler divergence from the exact joint posterior density function. For example, if the second product form in (2) is chosen then the
optimal density functions qθ∗1 ,θ3 (θ1 , θ3 ) and qθ∗2 (θ2 ) are those that minimize
Z Z Z
qθ1 ,θ3 (θ1 , θ3 ) qθ2 (θ2 ) log
p(θ1 , θ2 , θ3 | D)
qθ1 ,θ3 (θ1 , θ3 ) qθ2 (θ2 )
dθ1 dθ2 dθ3 ,
(3)
where the integrals range over the parameter spaces of θ1 , θ2 and θ3 . This minimization
problem gives rise to an iterative scheme which, typically, has closed form updates and good
convergence properties. A by-product of the iterations is a lower-bound approximation to
the marginal likelihood p(D), which we denote by p(D).
Further details on, and several examples of, mean field variational Bayes are provided by
Section 2.2 of Ormerod and Wand (2010). Each of these examples can also be expressed,
equivalently, in terms of variational message passing.
In typical semiparametric regression models the subscripting on the q-density functions is
quite cumbersome. Hence, it is suppressed for the remainder of the article. For example,
q(θ1 , θ3 ) is taken to mean qθ1 ,θ3 (θ1 , θ3 ).
Expectation propagation is also based on product density restrictions such as (2), but differs
in its method of obtaining the optimal density functions. It works with a different version
of the Kullback-Leibler divergence than that given in (3) and, therefore, leads to different
iterative algorithms and approximating density functions.
2.4. Mixed model-based penalized splines
Mixed model-based penalized splines are a convenient way to model nonparametric functional relationships in semiparametric regression models, and are amenable to the hierarchical Bayesian structures supported by Infer.NET. The penalized spline of a regression function
Journal of Statistical Software
5
f , with mixed model representation, takes the generic form
f (x) = β0 + β1 x +
K
X
uk zk (x),
uk ∼ N (0, τu−1 ).
ind.
(4)
k=1
Here z1 (·), . . . , zK (·) is a set of spline basis functions and τu controls the amount of penalization of the spline coefficients u1 , . . . , uk . Throughout this article, we use O’Sullivan splines
for the zk (·). Wand and Ormerod (2008) provides details on their construction. O’Sullivan
splines lead to (4) being a low-rank version of smoothing splines, which is also used by the
R function smooth.spline().
The simplest semiparametric regression model is the Gaussian response nonparametric regression model
ind.
yi ∼ N (f (xi ), σε2 ),
(5)
where (xi , yi ), 1 ≤ i ≤ n are pairs of measurements on continuous predictor and response
variables. Mixed model-based penalized splines give rise to hierarchical Bayesian models
for (5) such as
P
ind.
−1 ,
yi |β0 , β1 , u1 , . . . , uK , τε ∼ N β0 + β1 xi + K
u
z
(x
),
τ
i
ε
k=1 k k
ind.
uk |τu ∼ N (0, τu−1 ),
−1/2
τu
β0 , β1 ∼ N (0, τβ−1 ),
∼ Half-Cauchy(Au ),
ind.
−1/2
τε
(6)
∼ Half-Cauchy(Aε ).
Such models allow nonparametric regression to be performed using Bayesian inference engines such as BUGS and Infer.NET. Our decision to work with precision parameters, rather
than the variance parameters (which are common in the semiparametric regression literature), is driven by the former being the standard parametrization in BUGS and Infer.NET.
It is convenient to express (6) using matrix notation. This entails putting






y1
1 x1
z1 (x1 ) · · · zK (x1 )






..
..
..
y ≡  ...  , X ≡  ... ...  , Z ≡ 

.
.
.
yn
1 xn
(7)
z1 (xn ) · · · zK (xn )
and

u1


and u ≡  ...  .
uK

β≡
β0
β1
(8)
We then re-write (6) as
ind.
y | β, u, τε ∼ N Xβ + Zu, τε−1 ,
u|τu ∼ N (0, τu−1 I),
−1/2
τu
∼ Half-Cauchy(Au ),
β ∼ N (0, τβ−1 I),
−1/2
τε
∼ Half-Cauchy(Aε ).
6
Regression via Infer.NET
The effective degrees of freedom (edf) corresponding to (6) is defined to be
edf(τu , τε ) ≡ tr [C > C + blockdiag{τβ2 I 2 , (τu /τε ) I}]−1 C > C
(9)
and is a scale-free measure of the amount of fitting being performed by the penalized splines.
Further details on effective degrees of freedom for penalized splines are given in Section 3.13
of Ruppert et al. (2003).
Extension to bivariate predictors
The bivariate predictor extension of (5) is
ind.
yi ∼ N (f (xi ), σε2 ),
(10)
where the predictors xi , 1 ≤ i ≤ n, are each 2 × 1 vectors. In many applications, the xi s correspond to geographical location but they could be measurements on any pair of predictors
for which a bivariate mean function might be entertained. Mixed model-based penalized
splines can handle this bivariate predictor case by extending (4) to
f (x) = β0 + β T1 x +
K
X
uk zk (x),
uk ∼ N (0, τu−1 )
ind.
(11)
k=1
and setting zk to be appropriate bivariate spline functions. There are several options for doing this (e.g. Ruppert et al. 2009, Section 2.2). A relatively simple choice is described here
and used in the examples. It is based on thin plate spline theory, and corresponds to Section
13.5 of Ruppert et al. (2003). The first step is to choose the number K of bivariate knots and
their locations. We denote these 2 × 1 vectors by κ1 , . . . , κK . Our default rule for choosing knot locations involves feeding the xi s and K into the clustering algorithm known as
CLARA (Kaufman and Rousseeuw 1990) and setting the κk to be cluster centers. Next, form
the matrices


1 xT1

..  ,
X =  ...
. 
1 xTn

kx1 − κ1 k2 log kx1 − κ1 k · · · kx1 − κK k2 log kx1 − κK k


..
..
..
=
,
.
.
.
2
2
kxn − κ1 k log kxn − κ1 k · · · kxn − κK k log kxn − κK k

ZK

kκ1 − κ1 k2 log kκ1 − κ1 k · · · kκ1 − κK k2 log kκ1 − κK k


..
..
..
and Ω = 
.
.
.
.
2
2
kκK − κ1 k log kκK − κ1 k · · · kκK − κK k log kκK − κK k

√
Based on the singular value decomposition Ω = U diag(d)V T , compute Ω1/2 = U diag( d)V T
and then set Z = Z K Ω−1/2 . Then
zk (xi ) = the (i, k) entry of Z
Journal of Statistical Software
7
ind.
with uk ∼ N (0, τu−1 ).
3. Simple semiparametric model
The first example involves the simple semiparametric regression model
yi = β0 + β1 x1i + β2 x2i +
PK
k=1 uk zk (x2i )
ind.
εi ∼ N (0, τε−1 ),
+ εi ,
1 ≤ i ≤ n,
(12)
where zk (·) is a set of spline basis functions as described in Section 2.4. The corresponding
Bayesian mixed model can then be represented by
y| β, u, τε ∼ N (Xβ + Zu, τε−1 I),
β ∼ N (0, τβ−1 I),
−1/2
τu
u| τu ∼ N (0, τu−1 I),
∼ Half-Cauchy(Au ),
−1/2
τε
(13)
∼ Half-Cauchy(Aε ),
where τβ , Au and Aε are user-specified hyperparameters and




y1
β0



y =  ...  , β =  β1  , u = 
β2
yn


u1
..  ,
. 
uK



1 x11 x21
z1 (x21 ) · · · zK (x21 )




..
..
..
..
X =  ...
, Z = 
.
.
.
.
.
1 x1n x2n
z1 (x2n ) · · · zK (x2n )

Infer.NET 2.5, Beta 2 does not support direct fitting of model (13) under product restriction
(14)
q(β, u, τu , τε ) = q(β, u) q(τu , τε ).
We get around this by employing the same trick as that described Section 3.2 of Wang and
Wand (2011). It entails the introduction of the auxiliary n × 1 data vector a. By setting the
observed data for a equal to 0 and by assuming a very small number for κ, fitting the model
in (15) provides essentially the same result as fitting the model in (13). The actual model
implemented in Infer.NET is
y|β, u, τε ∼ N (Xβ +
β
u
Zu, τε−1 I),
a| β, u, τu ∼ N
∼ N (0, κ−1 I),
bu ∼ Gamma( 12 , 1/A2u ),
β
u
−1
τβ I
0
,
,
0
τu−1 I
τu | bu ∼ Gamma( 12 , bu ),
τε | bε ∼ Gamma( 21 , bε ),
(15)
bε ∼ Gamma( 21 , 1/A2ε ),
with the inputted a vector containing zeroes. While the full Infer.NET code is included in
the Appendix, we will confine discussion here to the key parts of the code that specify (15).
8
Regression via Infer.NET
Specification of the prior distributions for τu and τε can be achieved using the following
Infer.NET code:
Variable<double> tauu = Variable.GammaFromShapeAndRate(0.5,
(16)
Variable.GammaFromShapeAndRate(0.5,Math.Pow(Au,-2))).Named("tauu");
Variable<double> tauEps = Variable.GammaFromShapeAndRate(0.5,
Variable.GammaFromShapeAndRate(0.5,Math.Pow(Aeps,-2))).Named("tauEps");
while the likelihood in (15) is specified by the following line:
y[index] = Variable.GaussianFromMeanAndPrecision(
Variable.InnerProduct(betauWork,cvec[index]),tauEps);
(17)
The variable index represents a range of integers from 1 to n and enables loop-type structures. Finally, the inference engine is specified to be variational message passing via the
code:
InferenceEngine engine = new InferenceEngine();
engine.Algorithm = new VariationalMessagePassing();
engine.NumberOfIterations = nIterVB;
(18)
with nIterVB denoting the number of mean field variational Bayes iterations. Note that
this Infer.NET code is treating the coefficient vector
β
u
as an entity, in keeping with product restriction (14).
Figures 1 and 2 summarize the results from Infer.NET fitting of (15) to a data set on the yields
(g/plant) of 84 white Spanish onions crops in two locations: Purnong Landing and Virginia,
South Australia. The response variable, y is the logarithm of yield, whilst the predictors are
indicator of location being Virginia (x1 ) and areal density of the plants (plants/m2 ) (x2 ). The
hyperparameters were set at τβ = 10−10 , Aε = 105 , Au = 105 and κ = 10−10 , while the
number of mean field variational Bayes iterations was fixed at 100.
As a benchmark, Bayesian inference via MCMC was performed. To this end, the following
BUGS program was used:
for(i in 1:n)
{
mu[i] <- (beta0 + beta1*x1[i]+ beta2*x2[i] + inprod(u[],Z[i,]))
y[i] ~ dnorm(mu[i],tauEps)
}
for (k in 1:K)
{
u[k] ~ dnorm(0,tauu)
}
beta0 ~ dnorm(0,1.0E-10) ; beta1 ~ dnorm(0,1.0E-10)
beta2 ~ dnorm(0,1.0E-10) ;
bu ~ dgamma(0.5,1.0E-10) ; tauu ~ dgamma(0.5,bu)
bEps ~ dgamma(0.5,1.0E-10) ; tauEps ~ dgamma(0.5,bEps)
(19)
Journal of Statistical Software
9
A burn-in length of 50000 was used, while 1 million samples were obtained from the posterior distributions. Finally, a thinning factor of 5 was used. The posterior densities for MCMC
were produced based on kernel density estimation with plug-in bandwidth selection via the
R package KernSmooth (Wand and Ripley 2010). Figure 1 displays the fitted regression lines
and pointwise 95% credible intervals. The estimated regression lines and credible intervals
from Infer.NET and MCMC fitting are highly similar. Figure 2 visualizes the approximate
posterior density functions for β1 , τε−1 and the effective degrees of freedom of the fit. The
variational Bayes approximations for β1 and τε−1 are rather accurate. The approximate posterior density function for the effective degrees of freedom for variational Bayesian inference
was obtained based on Monte Carlo samples of size 1 million from the approximate posterior
distributions of τε−1 and τu−1 .
Figure 1: Onions data. Fitted regression line and pointwise 95% credible intervals for variational Bayesian inference by Infer.NET and MCMC.
4. Generalized additive model
The next example illustrates binary response variable regression through the model
ind.
yi |β, u ∼ Bernoulli(F ({Xβ + Zu}i )),
u| τu ∼ N (0, τu−1 I),
(20)
β∼
N (0, τβ−1 I),
−1/2
τu
∼ Half-Cauchy(Au ),
1 ≤ i ≤ n,
where F (·) denotes an inverse link function and X, Z, τβ and Au are defined as in the previous section. Typical choices of F (·) correspond to logistic regression and probit regression.
10
Regression via Infer.NET
Figure 2: Variational Bayes approximate posterior density functions produced by Infer.NET
and MCMC posterior density functions of key model parameters for fitting a simple semiparametric model to the onions data set.
As before, introduction of the auxiliary variables a ≡ 0 enables Infer.NET fitting of the
Bayesian mixed model
ind.
yi |β, u ∼ Bernoulli(F ({Xβ + Zu}i )),
−1
τβ I
0
β
a| β, u, τu ∼ N
,
,
u
0
τu−1 I
β
u
∼ N (0, κ−1 I),
bu ∼ Gamma( 12 , 1/A2u ),
(21)
τu | bu ∼ Gamma( 12 , bu ),
1 ≤ i ≤ n.
The likelihood for the logistic regression case is specified as follows
VariableArray<bool> y = Variable.Array<bool>(index).Named("y");
y[index] = Variable.BernoulliFromLogOdds(
Variable.InnerProduct(betauWork, cvec[index]));
(22)
while variational message passing is used for fitting purposes. Setting up the prior for τu
and specifying the inference engine is done as in code chunk (16) and (18), respectively. For
probit regression, the last line of (22) is replaced by
y[index] = Variable.IsPositive(Variable.GaussianFromMeanAndVariance(
Variable.InnerProduct(betauWork, cvec[index]), 1));
(23)
and expectation propagation is used
engine.Algorithm = new ExpectationPropagation();
(24)
Instead of the half-Cauchy prior in (20), a gamma prior with shape and rate equal to 2 is
used for τu in the probit regression model
Variable<double> tauU = Variable.GammaFromShapeAndRate(2,2);
(25)
Journal of Statistical Software
11
Figure 3 visualizes the results for Infer.NET and MCMC fitting of a straightforward extension of model (21) with inverse-logit and probit link functions to a breast cancer data set
(Haberman 1976). This data set contains 306 cases from a study that was conducted between
1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who
had undergone surgery for breast cancer. The binary response variable represents whether
the patient died within 5 years after operation (died), while 3 predictor variables are used:
age of patient at the time of operation (age), year of operation (year) and number of positive axillary nodes (nodes) detected. The actual model is
ind.
diedi |β, u ∼ Bernoulli(F (β0 + f1 (agei ) + f2 (yeari ) + f3 (nodesi ))),
1 ≤ i ≤ 306.
Note that all predictor variables were standardized prior to model fitting and the following
hyperparameters were used: τβ = 10−10 , Au = 105 , κ = 10−10 and the number of variational
Bayes iterations was 100. Figure 3 illustrates that there is good agreement between the results
from Infer.NET and MCMC.
(a)
(b)
Figure 3: Breast cancer data. Fitted regression lines and pointwise 95% credible intervals
for logit (a) and probit (b) regression via variational Bayesian inference by Infer.NET and
MCMC.
12
Regression via Infer.NET
5. Robust nonparametric regression based on the t-distribution
A popular model-based approach for robust regression is to model the response variable
to have a t-distribution. Outliers occur with moderate probability for low values of the
t-distribution’s degrees of freedom parameter (Lange et al. 1989). More recently, Staudenmayer et al. (2009) proposed a penalized spline mixed model approach to nonparametric
regression using the t-distribution.
The robust nonparametric regression model that we consider here is a Bayesian variant of
that treated by Staudenmayer et al. (2009):
P
ind.
−1 , ν ,
yi | β0 , β1 , u, τε , ν ∼ t β0 + β1 xi + K
u
z
(x
),
τ
1 ≤ i ≤ n,
ε
k=1 k k i
u| τu ∼ N (0, τu−1 I),
−1/2
τu
∼ Half-Cauchy(Au ),
β ∼ N (0, τβ−1 I),
−1/2
τε
(26)
∼ Half-Cauchy(Aε ),
p(ν) discrete on a finite set N .
Restricting the prior distribution of ν to be that of a discrete random variable allows Infer.NET fitting of the model using structured mean field variational Bayes (Saul and Jordan
1996). This extension of ordinary mean field variational Bayes is described in Section 3.1 of
Wand et al. (2011). Since variational message passing is a special case of mean field variational Bayes this extension also applies. Model (26) can be fitted through calls to Infer.NET
with ν fixed at each value in N . The results of each of these fits are combined afterwards.
Details are given below.
Another challenge concerning (26) is that Infer.NET does not support t-distribution specifications. We get around this by appealing to the result
if x| g ∼ N µ, (gτ )−1
and g ∼ Gamma( ν2 , ν2 ) then x ∼ t(µ, τ −1 , ν).
(27)
As before, we use the a ≡ 0 auxiliary data trick described in Section 3 and the Half-Cauchy
representation (1). These lead to following suite of models, for each fixed ν ∈ N , needing to
be run in Infer.NET:
y| β, u, τε ∼ N (Xβ + Zu, τε−1 diag(1/g)),
−1
τβ I
0
β
β
a| β, u, τu ∼ N
,
,
∼ N (0, κ−1 I)
u
u
0
τu−1 I
u| τu ∼ N (0, τu−1 I),
ind.
gi | ν ∼ Gamma( ν2 , ν2 ),
β ∼ N (0, τβ−1 I),
τu | bu ∼ Gamma( 12 , bu ),
bu ∼ Gamma( 12 , 1/A2u ),
τε | bε ∼ Gamma( 12 , bε ),
bε ∼ Gamma( 21 , 1/A2ε ),
(28)
where the matrix notation of (7), (8) is being used, g ≡ [g1 , . . . , gn ]T and τβ , Au and Aε are
user-specified hyperparameters with default values τβ = 10−10 , Au = Aε = 105 . The prior
for ν was set to be a uniform distribution over the atom set N , set to be 30 equally-spaced
numbers between 0.05 and 10 inclusive.
Journal of Statistical Software
13
After obtaining fits of (28) for each ν ∈ N , the approximate posterior densities are obtained
from
X
X
q(β, u) =
qν (ν)q(β, u|ν), q(τu ) =
qν (ν) q(τu |ν),
ν∈N
q(τε ) =
X
ν∈N
q(ν) q(τε |ν),
ν∈N
p(ν)p(y|ν)
with q(ν) = p(ν|y) = X
,
p(ν 0 )p(y|ν 0 )
ν 0 ∈N
and p(y|ν) denotes the variational lower bound on the conditional likelihood p(y|ν).
Specification of the prior distribution for g is achieved using the following Infer.NET code:
VariableArray<double> g = Variable.Array<double>(index);
g[index] = Variable.GammaFromShapeAndRate(nu/2,2/nu).ForEach(index);
(29)
while the likelihood in (28) is specified by:
y[index] = Variable.GaussianFromMeanAndPrecision(
Variable.InnerProduct(betauWork,cvec[index]),g[index]*tauEps);
(30)
The lower bound log p(y|ν) can be obtained by creating a mixture of the current model with
an empty model in Infer.NET. The learned mixing weight is then equal to the marginal loglikelihood. Therefore, an auxiliary Bernoulli variable is set up:
Variable<bool> auxML = Variable.Bernoulli(0.5).Named("auxML");
IfBlock model = Variable.If(auxML);
(31)
The normal code for fitting the model in (28) is then enclosed with
IfBlock model = Variable.If(auxML);
(32)
and
model.CloseBlock();
(33)
Finally, the lower bound log p(y|ν) is obtained from:
double marginalLogLikelihood = engine.Infer<Bernoulli>(auxML).LogOdds; (34)
Figure 4 presents the results of the structured mean field variational Bayes analysis using
Infer.NET fitting of model (28) to a data set on a respiratory experiment conducted by Professor Russ Hauser at Harvard School of Public Health, Boston, USA. The data correspond to
60 measurements on one subject during two separate respiratory experiments. The response
variable yi represents the log of the adjusted time of exhalation for xi equal to the time in
seconds since exposure to air containing particulate matter. The adjusted time of exhalation
is obtained by subtracting the average time of exhalation at baseline, prior to exposure to
filtered air. Interest centers upon the mean response as a function of time. The predictor and
response variable were both standardized prior to Infer.NET analysis. The following hyperparameters were chosen: τβ = 10−10 , Aε = 105 , Au = 105 and κ = 10−10 , while the number
of variational Bayes iterations equaled 100. Bayesian inference via MCMC was performed
as a benchmark.
14
Regression via Infer.NET
Figure 4: Structured mean field variational Bayesian, based on Infer.NET, and MCMC fitting
of the robust nonparametric regression model (26) to the respiratory experiment data. Left:
Posterior mean and pointwise 95% credible sets for the regression function. Right: Approximate posterior function for the degrees of freedom parameter ν.
Figure 4 shows that the variational Bayes fit and pointwise 95% credible sets are close to
the ones obtained using MCMC. Finally, the approximate posterior probability function and
the posterior from MCMC for the degrees of freedom ν are compared. The Infer.NET and
MCMC results coincide quite closely.
6. Semiparametric mixed model
Since semiparametric regression models based on penalized splines fit in the mixed model
framework, semiparametric longitudinal data analysis can be performed by fusion with classical mixed models (Ruppert et al. 2003). In this section we illustrate the use of Infer.NET for
fitting the class of semiparametric mixed models having the form:
ind.
yij |β, u, Ui , τε ∼ N
β0 + β T xij + Ui +
K
X
!
uk zk (sij ), τε−1
k=1
Ui | τU ∼ N (0, τU−1 ),
ind.
1 ≤ j ≤ ni ,
,
(35)
1 ≤ i ≤ m,
for analysis of the longitudinal data-sets such as that described in Bachrach et al. (1999). Here
xij is a vector of predictors that enter the model linearly and sij is another predictor that enters the model non-linearly via penalized splines. For each 1 ≤ i ≤ m, Ui denotes the random
Journal of Statistical Software
15
intercept for the ith subject. The corresponding Bayesian mixed model is represented by
y| β, u, τε ∼ N (Xβ +
β ∼ N (0, τβ−1 I),
Zu, τε−1 I),
−1/2
τu
u| τU , τu ∼ N
∼ Half-Cauchy(Au ),
−1/2
τU
0,
−1/2
τε
τU−1 I
0
0
τu−1 I
,
∼ Half-Cauchy(Aε ),
(36)
∼ Half-Cauchy(AU ),
where y, β and X are defined in a similar manner as in the previous sections and τβ , Au , Aε
and AU are user-specified hyperparameters. Introduction of the random intercepts results
in


1 · · · 0 z1 (s11 ) · · · zK (s11 )


U1
 .. . . ..

..
..
..
 .

. .
.
.
.
 .. 


 . 

1 · · · 0 z1 (s1n1 ) · · · zK (s1n1 ) 




 Um 

.. .. ..
..
..
..
, Z = 
u=

.
.
. . .
.
.
 u1 




 0 · · · 1 z1 (sm1 ) · · · zK (sm1 ) 
 .. 


 . 
 .. . . ..

..
..
.
.
 .

. .
.
.
.
uK
0 · · · 1 z1 (smnm ) · · · zK (smnm )
We again use the auxiliary data vector a ≡ 0 to allow direct Infer.NET fitting of the Bayesian
longitudinal penalized spline model
y|β, u, τε ∼ N (Xβ +

a| β, u, τU , τu ∼ N 
Zu, τε−1 I),
β
u
β
u
∼ N (0, κ−1 I),

τβ−1 I
0
0
, 0
τU−1 I
0  ,
−1
0
0
τu I

τu | bu ∼ Gamma( 21 , bu ),
bu ∼ Gamma( 21 , 1/A2u ),
τε | bε ∼ Gamma( 21 , 1/bε ),
bε ∼ Gamma( 12 , 1/A2ε ),
τU | bU ∼ Gamma( 12 , bU ),
bU ∼ Gamma( 21 , 1/A2U ).
(37)
Specification of the prior distribution for τU can be achieved in Infer.NET in a similar manner
as prior specification in code chunk (16).
Figure 5 shows the Infer.NET fits of (37) to the spinal bone mineral density data (Bachrach
et al. 1999). A population of 230 female subjects aged between 8 and 27 was followed over
time and each subject contributed either one, two, three, or four spinal bone mineral density
measurements. Age enters the model non-linearly and corresponds to sij in (35). Data on
ethnicity are available and the entries of xij correspond to the indicator variables for Black
(x1ij ), Hispanic (x2ij ) and White (x3ij ), with Asian ethnicity corresponding to the baseline.
The following hyperparameters were chosen: τβ = 10−10 , Aε = 105 , Au = 105 , AU = 105
and κ = 10−10 , while the number of mean field variational Bayes iterations was set to 100.
Figure 5 suggests that there exists a statistically significant difference in mean spinal bone
16
Regression via Infer.NET
Figure 5: Spinal bone mineral density data. Fitted regression line and pointwise 95% credible
intervals based on mean field variational Bayesian inference by Infer.NET and MCMC.
mineral density between Asian and Black subjects. This difference is confirmed by the approximate posterior density functions in Figure 6. No statistically significant difference is
found between Asian and Hispanic subjects and between Asian and White subjects.
Journal of Statistical Software
17
Figure 6: Variational Bayes approximate posterior density functions produced by Infer.NET
of ethnic group parameters for fitting a simple semiparametric mixed model to the spinal
bone mineral density data set.
7. Geoadditive model
We turn our attention to geostatistical data, for which the response variable is geographically
referenced and the extension of generalized additive models, known as geoadditive models
(Kammann and Wand 2003), applies. Such models allow for a pair of continuous predictors,
typically geographical location, to have a bivariate functional impact on the mean response.
An example geoadditive model is
yi ∼ N f1 (x1i ) + f2 (x2i ) + fgeo (x3i , x4i ), τε−1 ,
1 ≤ i ≤ n.
(38)
It can be handled using the spline basis described in Section 2.4 as follows:
ind.
yi | β, u1 , u2 , ugeo , τε ∼ N β0 + β1 x1i + β2 x2i + β3 x3i + β4 x4i
Kgeo
K1
K2
X
X
X geo geo
+ u1k z1k (x1i ) +
u2k z2k (x2i ) +
uk zk (x3i , x4i ), τε−1 ,
k=1
k=1
(39)
1 ≤ i ≤ n,
k=1
where the z1k and z2k are univariate spline basis functions as used in each of this article’s
previous examples and the zkgeo are the bivariate spline basis functions, described in Section
2.4. The corresponding Bayesian mixed model is represented by
y| β, u, τε ∼ N (Xβ + Zu, τε−1 I), β ∼ N (0, τβ−1 I),
  −1



τu1 I
0
0
u1
−1
u ≡  u2  τu1 , τu2 , τgeo ∼ N 0,  0
τu2
I
0  ,
geo
−1
u
0
0
τgeo I
−1/2
τu
∼ Half-Cauchy(Au ),
−1/2
τε
(40)
∼ Half-Cauchy(Aε ),
−1/2 ∼ Half-Cauchy(A ),
τgeo
geo
where y, β, u and X are defined in a similar manner as in the previous sections and τβ , Au ,
18
Regression via Infer.NET
Aε and Ageo are user-specified hyperparameters. The Z matrix is

geo
z11 (x11 ) · · · z1K1 (x11 ) z21 (x21 ) · · · z2K2 (x21 ) z1geo (x31 , x41 ) · · · zK
(x31 , x41 )
geo


..
..
..
..
..
..
..
..
..
Z=
.
.
.
.
.
.
.
.
.
.
geo
geo
z11 (x1n ) · · · z1K1 (x1n ) z21 (x2n ) · · · z2K2 (x2n ) z1 (x3n , x4n ) · · · zKgeo (x3n , x4n )

Infer.NET can be used to fit the Bayesian geoadditive model
β
−1
y| β, u, τε ∼ N (Xβ + Zu, τε I),
∼ N (0, κ−1 I),
u

 −1

τβ I
0
0
0
−1
 β
 0

τu1
I
0
0 

 ,
a| β, u, τu , τgeo ∼ N 
,
−1
 u
 0


0
τu2 I
0
−1
0
0
0
τgeo I
τu1 | bu1 ∼ Gamma( 21 , bu1 ),
bu2 ∼ Gamma( 12 , 1/A2u2 ),
bu1 ∼ Gamma( 12 , 1/A2u1 ),
τu2 | bu2 ∼ Gamma( 21 , bu2 ),
τε | bε ∼ Gamma( 12 , bε ), bε ∼ Gamma( 21 , 1/1A2ε ),
τgeo | bgeo ∼ Gamma( 12 , bgeo ),
bgeo ∼ Gamma( 21 , 1/A2geo )
where a ≡ 0 as in all previous examples.
We illustrate geoadditive model fitting in Infer.NET using data on residential property prices
of 37,676 residential properties that were sold in Sydney, Australia, during 2001. The data
were assembled as part of of an unpublished study by A. Chernih and M. Sherris at the University of New South Wales, Australia. The response variable is the logarithm of sale price in
Australian dollars. Apart from geographical location, several predictor variables are available. For this example we selected weekly income in Australian dollars, the distance from
the coastline in kilometres, the particulate matter 10 level and the nitrogen dioxide level. The
model is a straightforward extension of the geoadditive model conveyed by (38)–(40). In addition, all predictor variables and the dependent variable were standardized before model
fitting. The hyperparameters set to τβ = 10−10 , Aε = 105 = Au1 = . . . = Au4 = Ageo = 105 ,
while the number of variational message passing iterations was equal to 100 and κ set to
10−10 .
Figure 7 summarizes the housing prices for the Sydney metropolitan area based on the fitted
Infer.NET model, while fixing the four predictor variables at their mean levels. This result
clearly shows that the sale price is higher for houses in Sydney’s eastern and northern suburbs whereas it is strongly decreased for houses in Sydney’s western suburbs.
Figure 8 visualizes the fitted regression line and pointwise 95% credible intervals for income,
distance from the coastline, particulate matter 10 level and nitrogen dioxide level at a fixed
geographical location (longitude = 151.10◦ , latitude = −33.91◦ ). As expected, a higher income is associated with a higher sale price and houses close to the coastline are more expensive. The sale price is lower for a higher particulate matter 10 level, while a lower nitrogen
dioxide level is generally associated with a lower sale price.
Journal of Statistical Software
19
Figure 7: Sydney real estate data. Estimated housing prices for the Sydney metropolitan area
for variational Bayesian inference by Infer.NET.
8. Bayesian lasso regression
A Bayesian approach to lasso (least absolute shrinkage selection operator) regression was
proposed by Park and Casella (2008). For linear regression, the lasso approach essentially
involves the replacing
βj | τβ ∼ N (0, τβ−1 ) by βj | τβ ∼ Laplace(0, τβ−1 )
ind.
ind.
(41)
where βj is the coefficient of the jth predictor variable. Replacement (41) corresponds to the
form of the penalty changing from
λ
p
X
j=1
βj2
to λ
p
X
j=1
|βj |.
20
Regression via Infer.NET
Figure 8: Sydney real estate data. Fitted regression line and pointwise 95% credible intervals
for variational Bayesian inference by Infer.NET.
For frequentist lasso regression (Tibshirani 1996) the latter penalty produces sparse regression fits, in that the estimated coefficients become exactly zero. In the Bayesian case the
Bayes estimates do not provide exact annihilation of coefficients, but produce approximate
sparseness (Park and Casella 2008).
At this point we note that, during the years following the appearance of Park and Casella
(2008), several other proposals for sparse shrinkage of the βj appeared in the literature (e.g.
Armagan et al. 2013; Carvalho et al. 2010; Griffin and Brown 2011). Their accommodation in
Infer.NET could also be entertained. Here we restrict attention to Laplace-based shrinkage.
We commence with the goal of fitting the following model in Infer.NET:
y|β0 , β, τε ∼ N (1β0 + Xβ, τε−1 I),
β0 ∼ N (0, τβ−1
),
0
βj | τβ ∼ Laplace(0, τβ−1 )
τβ ∼ Half-Cauchy(Aβ ),
ind.
τε ∼ Half-Cauchy(Aε ).
(42)
Journal of Statistical Software
21
The current release of Infer.NET does not support the Laplace distribution, so we require an
auxiliary variable representation in a similar vein to what was used for the t-distribution in
Section 5. We first note the distributional statement
x|τ ∼ Laplace(0, τ −1 ) if and only if x|g ∼ N (0, g),
g|τ ∼ Gamma(1, 21 τ ),
(43)
which is exploited by Park and Casella (2008). However, auxiliary variable representation
(43) is also not supported by Infer.NET, due to violation of its conjugacy rules. We get around
this by working with the alternative auxiliary variable set-up:
x| c ∼ N (0, 1/c),
c| d ∼ Gamma(M, M d),
d|τ ∼ Gamma(1, 12 τ )
(44)
which is supported by Infer.NET, and leads to good approximation to the Laplace(0, τ −1 )
distribution when M is large. For any given M > 0, the density function of x satisfying
auxiliary variable representation (44) is
∞Z ∞
Z
p(x|τ ; M ) =
0
=
(M d)M M −1
c
exp(−M dc)
(2π/c)−1/2 exp − 12 c x
Γ(M )
0
× 12 τ exp − 12 dτ dc dd
τ 1/2 Γ M + 12
2Γ(M ) M 1/2
1 F1
M+
1 1
2; 2;
τ x2
4M
1
2
− τ
1/2
|x| 1 F1
τ x2
M + 1; 3/2;
4M
where 1 F1 denotes the degenerate hypergeometric function (e.g. Gradshteyn and Ryzhik
1994). Figure 9 shows p(x|τ ; M ) for τ = 1 and M = 1, 10, 30. The approximation to the
Laplace(0, 1) density function is excellent for M as low as 10.
Figure 9: The exact Laplace(0, 1) density function p(x) =
tions to it based on p(x|τ ; M ) for τ = 1 and M = 1, 10, 30.
1
2
exp(−|x|) and three approxima-
The convergence in distribution of random variables having the density function p(x|τ ; M )
to a Laplace(0, τ −1 ) random variable as M → ∞ may be established using (27) and the fact
that the family of t distributions tends to a Normal distribution as the degrees of freedom
22
Regression via Infer.NET
parameter increases, and then mixing this limiting distribution with a Gamma(1, 12 τ ) variable.
In lieu of (42), the actual model fitted in Infer.NET is:
y|β0 , β, τε ∼ N (1β0 + Xβ, τε−1 I),



−1
τ
0
β
0
β0
a| β0 , β, c1 , . . . , cp ∼ N 
, 0
diag (1/cj )I  ,
β
1≤j≤p
β0
β
ind.
∼ N (0, κ−1 I),
dj |τβ ∼ Gamma(1, 21 τβ ),
ind.
cj | dj ∼ Gamma(M, M dj ),
τβ |bβ ∼ Gamma( 21 , bβ ),
τε | bε ∼ Gamma( 12 , bε ),
(45)
bβ ∼ Gamma( 21 , 1/A2β ),
bε ∼ Gamma( 21 , 1/A2ε ).
We use κ = 10−10 , M = 100, τβ0 = 10−10 , Aβ = Aε = 105 .
Our illustration of (45) involves data from a diabetes study that was also used in Park and
Casella (2008). The sample size is n = 442 and the number of predictor variables is p = 10.
The response variable is a continuous index of disease progression one year after baseline.
A description of the predictor variables is given in Efron et al. (2004). Their abbreviations,
used in Park and Casella (2008), are: (1) age, (2) sex, (3), bmi, (4), map, (5), tc, (6) ldl, (7), hdl,
(8), tch, (9) ltg, and (10) glu.
Figure 10 shows the fits obtained from both Infer.NET and MCMC. We see that the correspondence is very good, especially for the regression coefficients.
9. Measurement error model
In some situations some variables may be measured with error, i.e., the observed data are
not the true values of those variables themselves, but a contaminated version of these variables. Examples of such situations include AIDS studies where CD4 counts are known to
be measured with errors (Wu 2002) or air pollution data (Bachrach et al. 1999). Carroll et al.
(2006) offers a comprehensive treatment of measurement error models. As described there,
failure to account for measurement error can result in biased and misleading conclusions.
Let (xi , yi ), 1 ≤ i ≤ n, be a set of predictor/response pairs that are modeled according to
yi |xi , β0 , β1 , τε ∼ N (β0 + β1 xi , τε−1 ),
ind.
1 ≤ i ≤ n.
(46)
However, instead of observing the xi s we observe
wi |xi ∼ N (xi , τz−1 ),
ind.
1 ≤ i ≤ n.
(47)
In other words, the predictors are measured with error with τz controlling the extent of the
contamination. In general τz can be estimated through validation data, where some of the xi
values are observed, or replication data, where replicates of the wi values are available. To
Journal of Statistical Software
23
Figure 10: Approximate posterior density functions of model parameters in fitting (45) to
data from a diabetes study.
simplify exposition we will assume that τz is known. Furthermore, we will assume that the
xi s are Normally distributed:
xi |µx , τx ∼ N (µx , τx−1 ),
ind.
1 ≤ i ≤ n,
(48)
where µx and τx are additional parameters to be estimated.
Combining (46), (47) and (48) and, including aforementioned priors for variances, a Bayesian
measurement error model can be represented by
y| x, β, τε ∼ N (Xβ, τε−1 I),
w|x ∼ N (x, τz−1 I),
β ∼ N (0, τβ−1 I),
−1/2
τε
∼ Half-Cauchy(Aε ),
−1/2
µx ∼ N (0, τµ−1
) and τx
x
x|µx , τx ∼ N (µx 1, τx−1 I),
(49)
∼ Half-Cauchy(Ax )
where β = [β0 , β1 ], X = [1, x], τβ , τµ , Aε and Ax are user-specified constants and the vectors
y and w are observed. We set τβ = τµ = 10−10 and Aε = Ax = 105 .
24
Regression via Infer.NET
Invoking (1), we arrive at the model
y|x, β, τε ∼ N (Xβ, τε−1 I),
β ∼ N (0, τβ−1 I),
µx ∼ N (0, τµ−1
),
x
w|x ∼ N (x, τz−1 I),
τε | bε ∼ Gamma( 21 , 1/bε ),
x| µx , τx ∼ N (µx 1, τx−1 I)
bε ∼ Gamma( 21 , 1/A2ε ),
(50)
τx | bx ∼ Gamma( 12 , 1/bx ) and bx ∼ Gamma( 21 , 1/A2x ).
Infer.NET 2.5, Beta 2 does not appear to be able to fit the model (50) under the product
restriction
q(β, τε , bε , x, µx , τx , bx ) = q(β0 , β1 ) q(µx ) q(x) q(τε , τx ) q(bε , bx ).
However, Infer.NET was able to fit this model with
(51)
q(β, τε , bε , x, µx , τx , bx ) = q(β0 )q(β1 ) q(µx )q(x) q(τε , τx ) q(bε , bx ).
Unfortunately, fitting under restriction (51) leads to poor accuracy. One possible remedy
involves the centering transformation:
w
ei = wi − w.
Under this transformation, and with diffuse priors, the marginal posterior distributions of
β0 and β1 are nearly independent.
Let qe(β0 ), qe(β1 ), qe(µx ), qe(x), qe(τε , τx ) and qe(bε , bx ) be the optimal values of q(β0 ), q(β1 ), q(µx ),
q(x), q(τε , τx ) and q(bε , bx ) when the transformed w
ei s are used in place of the wi s. This
corresponds to the transformation on the X matrix
1 0
e
X = XR
where
R=
.
−w 1
Suppose that
2
qe(β0 ) ∼ N (e
µq(β0 ) , σ
eq(β
),
0)
2
qe(µx ) ∼ N (e
µq(µx ) , σ
eq(µ
),
x)
2
qe(β1 ) ∼ N (e
µq(β1 ) , σ
eq(β
),
1)
2
qe(xi ) ∼ N (e
µq(xi ) , σ
eq(x
),
i)
and
1 ≤ i ≤ n.
Then we can back-transform approximately using
q(β0 , β1 ) ∼ N (µq(β) , Σq(β) ),
2
q(µx ) ∼ N (µq(µx ) , σq(µ
),
x)
2
q(xi ) ∼ N (µq(xi ) , σq(x
), 1 ≤ i ≤ n,
i)
where
µq(β) = R
µ
eq(β0 )
µ
eq(β1 )
"
,
Σq(β) = R
µq(µx ) = µ
eq(µx ) + w,
µq(xi ) = µ
eq(xi ) + w,
q(τε , τx ) = qe(τε , τx ) and q(bε , bx ) = qe(bε , bx ).
2
σ
eq(β
0
0)
2
0
σ
eq(β1 )
#
2
2
σq(µ
=σ
eq(µ
,
x)
x)
2
2
σq(x
=σ
eq(x
,
i)
i)
1 ≤ i ≤ n,
RT ,
Journal of Statistical Software
25
To illustrate the use of Infer.NET we simulated data according to
ind.
xi ∼ N (1/2, 1/36),
ind.
ind.
wi | xi ∼ N (xi , 1/100) and yi |xi ∼ N (0.3 + 0.7xi , 1/16)
(52)
for 1 ≤ i ≤ n with n = 50. Results from a single simulation are illustrated in Figure 11.
Similar results were obtained from other simulated data sets. From Figure 11 we see reasonable agreement of posterior density estimates produced by Infer.NET and MCMC for all
parameters. Extension to the case where the mean of the yi s are modeled nonparametrically
is covered in Pham et al. (2013). However, the current version of Infer.NET does not support
models of this type. Such versatility limitations of Infer.NET are discussed in Section 11.1.
Figure 11: Variational Bayes approximate posterior density functions produced by Infer.NET
for simulated data (52).
26
Regression via Infer.NET
10. Missing predictor model
Missing data is a commonly occurring issue in statistical problems. Naïve methods for dealing with this issue, such as ignoring samples containing missing values, can be shown to be
demonstrably optimistic (i.e., standard errors are deflated), biased or simply an inefficient
use of the data. Little and Rubin (2002) contains a detailed discussion of the various issues
at stake.
Consider a simple linear regression model with response yi with predictor xi :
yi |xi , β0 , β1 , τε ∼ N (β0 + β1 xi , τε−1 ),
ind.
1 ≤ i ≤ n.
(53)
Suppose that some of the xi s are missing and let ri be an indicator xi being observed. Specifically,
1 if xi is observed,
ri =
for 1 ≤ i ≤ n.
0 if xi is missing
A missing data model contains at least two components:
• A model for the underlying distribution of the variable that is subject to missingness
data. We will use
xi |µx , τx ∼ N (µx , τx−1 ),
ind.
1 ≤ i ≤ n.
(54)
• A model for the missing data mechanism, which models the probability distribution
of ri .
As detailed in Faes et al. (2011), there are several possible models, with varying degrees of
sophistication, for the missing data mechanism. Here the simplest such model, known as
missing completely at random, is used:
y|x, β, τε ∼ N (Xβ, τε−1 I),
β ∼ N (0, τβ−1 I),
µx ∼ N (0, τµ−1 ),
x| µx , τx ∼ N (µx 1, τx−1 I),
τε | bε ∼ Gamma( 12 , 1/bε ),
τx | bx ∼ Gamma( 21 , bx )
and
ind.
ri ∼ Bernoulli(p),
bε ∼ Gamma( 12 , 1/A2ε ),
(55)
bx ∼ Gamma( 21 , 1/A2x ),
where β = [β0 , β1 ], X = [1, x], τβ , τµ , Aε and Ax are user-specified hyperparameters. Again,
we will use τβ = τµ = 10−10 and Aε = Ax = 105 .
The simulated data that we use in our illustration of Infer.NET involves the following sample
size and ‘true values’:
n = 50,
β0 = 0.3,
β1 = 0.7,
µx = 0.5,
τx = 36,
and p = 0.75.
(56)
Putting p = 0.75 implies that, on average, 25% of the predictor values are missing completely
at random.
Infer.NET 2.5, Beta 2 is not able to fit the model (50) under the product restriction
q(β, τε , bε , xmis , µx , τx , bx ) = q(β0 , β1 ) q(xmis ) q(τε , τx ) q(bε , bx ),
(57)
Journal of Statistical Software
27
where xmis denotes the vector of missing predictors, but is able to handle
(58)
q(β, τε , bε , xmis , µx , τx , bx ) = q(β0 ) q(β1 ) q(xmis ) q(τε , τx ) q(bε , bx ),
and was used in our illustration of Infer.NET for such models.
We also used BUGS to perform Bayesian inference via MCMC as a benchmark for comparison of results. For similar reasons as in the previous section fitting under this restriction (58)
leads to poor results. Instead we perform the centering transformation
x
ei = xi − xobs
where xobs is the mean of the observed x values. This transformation makes the marginal
posterior distributions of β0 and β1 nearly independent.
Let qe(β0 ), qe(β1 ), qe(µx ), qe(xmis ), qe(τε , τx ) and qe(bε , bx ) be the optimal values of q(β0 ), q(β1 ),
q(µx ), q(xmis ), q(τε , τx ) and q(bε , bx ) when the transformed x
ei s are used in place of the xi s.
This corresponds to the transformation on the X matrix
1
0
e
X = XR
where
R=
.
−xobs 1
Suppose that
2
µq(β0 ) , σ
eq(β
),
qe(β0 ) ∼ N (e
0)
2
qe(µx ) ∼ N (e
µq(µx ) , σ
eq(µ
),
x)
2
qe(β1 ) ∼ N (e
µq(β1 ) , σ
eq(β
),
1)
2
qe(xi ) ∼ N (e
µq(xmis,i ) , σ
eq(x
), 1 ≤ i ≤ nmis ,
mis,i )
and
then we back-transform approximately using
2
q(µx ) ∼ N (µq(µx ) , σq(µ
),
x)
q(β0 , β1 ) ∼ N (µq(β) , Σq(β) ),
2
q(xmis,i ) ∼ N (µq(xmis,i ) , σq(x
), 1 ≤ i ≤ n,
mis,i )
where
µq(β) = R
µ
eq(β0 )
µ
eq(β1 )
"
,
Σq(β) = R
µq(µx ) = µ
eq(µx ) + xobs ,
µq(xmis,i ) = µ
eq(xmis,i ) + xobs ,
2
0
σ
eq(β
0)
2
0
σ
eq(β1 )
#
RT ,
2
2
σq(µ
=σ
eq(µ
,
x)
x)
2
2
σq(x
=σ
eq(x
,
mis,i )
mis,i )
1 ≤ i ≤ n,
q(τε , τx ) = qe(τε , τx ) and q(bε , bx ) = qe(bε , bx ).
Results from a single simulation are illustrated in Figure 12. Similar results were obtained
from other simulated data sets. From Figure 12 we see reasonable agreement of posterior
density estimates produced by Infer.NET and MCMC for all parameters. Extension to the
case where the mean of y is modeled nonparametrically is covered in Faes et al. (2011). However, the current version of Infer.NET does not support models of this type.
28
Regression via Infer.NET
Figure 12: Variational Bayes approximate posterior density functions produced by Infer.NET
data simulated according to (56).
11. Comparison with BUGS
We now make some comparisons between Infer.NET and BUGS in the context of Bayesian
semiparametric regression.
11.1. Versatility comparison
These comments mainly echo those given in Section 5 of Wang and Wand (2011), so are quite
brief. BUGS is much more versatile than Infer.NET, with the latter subject to restrictions such
as standard distributional forms, conjugacy rules and not being able to handle models such
as (13) without the trick manifest in (15). The measurement error and missing data regression examples given in Sections 9 and 10 are feasible in Infer.NET for parametric regression,
but not for semiparametric regression such as the examples in Sections 4, 5 and 8 of Marley
and Wand (2010). Nevertheless, as demonstrated in this article, there is a wide variety of
Journal of Statistical Software
29
semiparametric regression models that can be handled by Infer.NET.
11.2. Accuracy comparison
In general, BUGS can always be more accurate than Infer.NET since MCMC suffers only
from Monte Carlo error, which can be made arbitrarily small via larger sample sizes. The
Infer.NET inference engines, expectation propagation and variational message passing, have
inherent approximation errors that cannot be completely eliminated. However, our examples show that the accuracy of Infer.NET is quite good for a wide variety of semiparametric
regression models.
11.3. Timing comparison
Table 2 gives some indication of the relative computing times for the examples in the paper. In this table we report the average elapsed (and standard error) of the computing times
over 100 runs with the number of variational Bayes or expectation propagation iterations
set to 100 and the MCMC sample sizes set at 10000. This was sufficient for convergence in
these particular examples. All examples were run on the third author’s laptop computer (64
bit Windows 8 Intel i7-4930MX central processing unit at 3GHz with 32GB of random access memory). We concede that comparison of deterministic and Monte Carlo algorithms is
fraught with difficulties. However, convergence criteria aside, these times give an indication
of the performance in terms of the implementations of each of the algorithms for particular
models on a fast 2014 computer. Note that we did not fit the geoadditive model in Section 7
via BUGS due to the excessive time required.
section
number
3
4
4
5
6
7
8
9
10
semiparametric
regression model
Simple semiparametric regression
Generalized additive model (logistic)
Generalized additive model (probit)
Robust nonparametric regression
Semiparametric mixed model
Geoadditive model
Bayesian lasso regression
Measurement error model
Missing predictor model
time in seconds
for Infer.NET
time in seconds
for BUGS
1.55 (0.01)
3.09 (0.01)
25.80 (0.07)
80.38 (0.03)
666.73 (0.56)
1231.12 (3.05)
1.80 (0.01)
1.50 (0.01)
1.53 (0.01)
8.60 (0.01)
81.30 (0.13)
79.85 (0.04)
22.56 (0.06)
477.59 (0.06)
—— ——199.59 (0.30)
7.05 (0.02)
5.51 (0.03)
Table 2: Average run times (standard errors) in seconds over 100 runs of the methods for
each of the examples in the paper.
Table 2 reveals that Infer.NET offers considerable speed-ups compared with BUGS for most
of the examples. The exceptions are the robust nonparametric regression model of Section
5 and the semiparametric mixed model of Section 6. The slowness of the robust nonparametric regression fit is mainly explained by the multiple calls to Infer.NET, corresponding to
the degrees of freedom grid. The semiparametric mixed model is quite slow in the current
release of Infer.NET due to the full sparse design matrices being carried around in the calculations. Recently developed streamlined approaches to variational inference for longitudinal
30
Regression via Infer.NET
and multilevel data analysis (Lee and Wand 2014) offer significant speed-ups. The geoadditive model of Section 7 is also slow to fit, with the large design matrices being a likely
reason.
12. Summary
Through several examples, the efficacy of Infer.NET for semiparametric regression analysis
has been demonstrated. Generally speaking, the fitting and inference is shown to be quite
fast and accurate. Models with very large design matrices are an exception and can take
significant amounts of time to fit using the current release of Infer.NET.
Our survey of Infer.NET in the context of semiparametric regression will aid future analyses
of this type via fast graphical models software, by building upon the examples presented
here. It may also influence future directions for the development of fast graphical models
methodology and software.
Appendix: Details of Infer.NET fitting of a simple semiparametric model
This section provides an extensive description of the Infer.NET program for semiparametric
regression analysis of the Onions data set based on model (15) in Section 3. The data are
first transformed as explained in Section 3 and these transformed versions are represented
by y and C = [X Z]. Thereafter, the text files K.txt, y.txt, Cmat.txt, sigsqBeta.txt,
Au.txt, Aeps.txt and nIterVB.txt are generated. The first three files contain the number of spline basis functions, y and C, respectively. The following three text files each contain a single number and represent the values for the hyperparameters: τβ = 10−10 , Aε = 105
and Au = 105 . The last file, nIterVB.txt, contains a single positive number (i.e., 100) that
specifies the number of mean field variational Bayes iterations. All these files were set up in
R.
The actual Infer.NET code is a C] script, which can be run from Visual Studio 2010 or DOS
within the Microsoft Windows operating system. The following paragraphs explain the different commands in the C] script. Firstly, all required C] and Infer.NET libraries are loaded:
using
using
using
using
using
using
using
using
using
System;
System.Collections.Generic;
System.Text;
System.IO;
System.Reflection;
MicrosoftResearch.Infer;
MicrosoftResearch.Infer.Distributions;
MicrosoftResearch.Infer.Maths;
MicrosoftResearch.Infer.Models;
Next, the full path name of the directory containing the above-mentioned input files needs
to be specified:
string dir = "C:\\research\\inferNet\\onions\\";
The values for the hyperparameters, number of variational Bayes iterations and number of
spline basis functions are imported using the commands:
Journal of Statistical Software
31
double tauBeta = new DataTable(dir+"tauBeta.txt").DoubleNumeric;
double Au = new DataTable(dir+"Au.txt").DoubleNumeric;
double Aeps = new DataTable(dir+"Aeps.txt").DoubleNumeric;
int nIterVB = new DataTable(dir+"nIterVB.txt").IntNumeric;
int K = new DataTable(dir+"K.txt").IntNumeric;
Reading in the data y and C proceeds as follows:
DataTable yTable = new DataTable(dir+"y.txt");
DataTable C = new DataTable(dir+"Cmat.txt");
These commands make use of the C] object class DataTable. This class, which was written
by the authors, facilitates the input of general rectangular arrays of numbers when in a text
file. The following line extracts the number of observations from the data matrix C:
int n = C.numRow;
The observed values of C and y, which are stored in objects C and yTable, are assigned to
the objects cvec and y via the next set of commands:
Vector[] cvecTable = new Vector[n];
for (int i = 0; i < n; i++)
{
cvecTable[i] = Vector.Zero(3+K);
for (int j = 0; j < 3+K; j++)
cvecTable[i][j] = C.DataMatrix[i,j];
}
Range index = new Range(n).Named("index");
VariableArray<Vector> cvec = Variable.Array<Vector>(
index).Named("cvec");
cvec.ObservedValue = cvecTable;
VariableArray<double> y = Variable.Array<double>(index).Named("y");
y.ObservedValue = yTable.arrayForIN;
The function arrayForIN is member of the class DataTable and stores the data as an
array to match the type of y.ObservedValue. The resulting objects are now directly used
to specify the prior distributions. First, the commands in code chunk (16) in Section 3 are
used to set up priors for τu and τε , while the trick involving the auxiliary variable a ≡ 0 in
(15) is coded as:
32
Regression via Infer.NET
Vector zeroVec = Vector.Zero(3+K);
PositiveDefiniteMatrix betauPrecDummy =
new PositiveDefiniteMatrix(3+K,3+K);
for (int j = 0; j < 3+K; j++)
betauPrecDummy[j,j] = kappa;
Variable<Vector> betauWork =
Variable.VectorGaussianFromMeanAndPrecision(zeroVec,
betauPrecDummy).Named("betauWork");
Variable<double>[] betauDummy = new Variable<double>[3+K];
for (int j = 0; j < 3+K; j++)
betauDummy[j] = Variable.GetItem(betauWork,j);
Variable<double>[] a = new Variable<double>[3+K];
for (int j = 0; j < 3; j++)
{
a[j] = Variable.GaussianFromMeanAndVariance(betauDummy[j],
tauBeta);
a[j].ObservedValue = 0.0;
}
for (int k = 0 ; k < K; k++)
{
a[3+k] = Variable.GaussianFromMeanAndPrecision(betauDummy[3+k],
tauU);
a[3+k].ObservedValue = 0.0;
}
The command to specify the likelihood for model (15) is listed as code chunk (17) in Section 3. The inference engine and the number of mean field variational Bayes iterations are
specified by means of code chunk (18). Finally, the following commands write the estimated
values for the parameters of the approximate posteriors to files named mu.q.betau.txt,
Sigma.q.betau.txt, parms.q.tauEps.txt and parms.q.tauu.txt:
SaveData.SaveTable(engine.Infer<VectorGaussian>(betauWork).GetMean(),
dir+"mu.q.betau.txt");
SaveData.SaveTable(engine.Infer<VectorGaussian>(
betauWork).GetVariance(),dir+"Sigma.q.betau.txt");
SaveData.SaveTable(engine.Infer<Gamma>(tauEps),
dir+"parms.q.tauEps.txt");
SaveData.SaveTable(engine.Infer<Gamma>(tauu),dir+"parms.q.tauu.txt");
These commands involve the C] function named SaveTable, which is a method in the
class SaveData. We wrote SaveTable to facilitate writing the output from Infer.NET to
a text file. Summary plots, such as Figure 1, can be made in R after back-transforming the
approximate posterior density parameters.
Acknowledgments
This research was partially funded by the Australian Research Council Discovery Project
DP110100061. We are grateful to Andrew Chernih for his provision of the Sydney property
Journal of Statistical Software
33
data.
References
Armagan A, Dunson D, Lee J (2013). “Generalized Double Pareto Shrinkage.” Statistica
Sinica, 23, 119–143.
Bachrach LK, Hastie T, Wang MC, Narasimhan B, Marcus R (1999). “Bone Mineral Acquisition in Healthy Asian, Hispanic, Black, and Caucasian Youth: A Longitudinal Study.”
Journal of Clinical Endocrinology & Metabolism, 84(12), 4702–4712.
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu C (2006). Measurement Error in Nonlinear
Models: A Modern Perspective. 2nd edition edition. Chapman and Hall, London.
Carvalho CM, Polson NG, Scott J (2010). “The Horseshoe Estimator for Sparse Signals.”
Biometrika, 97, 465–480.
Efron B, Hastie T, Johnstone I, Tibshirani R (2004). “Least Angle Regression.” The Annals of
Statistics, 32, 407–451.
Faes C, Ormerod JT, Wand MP (2011). “Variational Bayesian Inference for Parametric and
Nonparametric Regression with Missing Data.” Journal of the American Statistical Association, 106, 959–971.
Gelman A (2006). “Prior Distributions for Variance Parameters in Hierarchical Models.”
Bayesian Analysis, 1, 515–533.
Gradshteyn I, Ryzhik I (1994). Table of Integrals, Series, and Products. Academic Press, Boston.
Griffin J, Brown P (2011). “Bayesian Hyper-Lassos with Non-Convex Penalization.” Australian and New Zealand Journal of Statistics, 53, 423–442.
Haberman S (1976). “Generalized Residuals for Log-Linear Models.” In Proceedings of the 9th
International Biometrics Conference, pp. 104–122. Boston, USA.
Kammann EE, Wand MP (2003). “Geoadditive Models.” Journal of the Royal Statistical Society
C, 52, 1–18.
Kaufman L, Rousseeuw PJ (1990). Finding Groups in Data: An Introduction to Cluster Analysis.
Wiley, New York.
Knowles DA, Minka TP (2011). “Non-Conjugate Message Passing for Multinomial and Binary Regression.” Advances in Neural Information Processing Systems 24, pp. 1701– 1709.
Lange KL, Little RJA, Taylor JMG (1989). “Robust Statistical Modeling Using the tDistribution.” The Journal of the American Statistical Association, 84, 881–896.
Lee Y, Wand M (2014). “Streamlined Mean Field Variational Bayes for Longitudinal and
Multilevel Data Analysis.” In progress.
34
Regression via Infer.NET
Little RJA, Rubin DB (2002). Statistical Analysis with Missing Data. 2nd edition edition. Wiley,
New York.
Lunn D, Thomas A, Best NG, Spiegelhalter DJ (2000). “WinBUGS – A Bayesian Modelling
Framework: Concepts, Structure, and Extensibility.” Statistics and Computing, 10, 325–337.
Marley JK, Wand MP (2010). “Non-Standard Semiparametric Regression Via BRugs.” Journal
of Statistical Software, 37, 1–30.
Minka T (2001). “Expectation Propagation for Approximate Bayesian Inference.” Proceedings
of Conference on Uncertainty in Artificial Intelligence, pp. 51–76.
Minka T (2005). “Divergence Measures And Message Passing.” Microsoft Research Technical
Report Series, MSR-TR-2008-173, 1–17.
Minka T, Winn J (2008). “Gates: A Graphical Notation for Mixture Models.” Microsoft Research Technical Report Series, MSR-TR-2008-185, 1–16.
Minka T, Winn J, Guiver J, Knowles D (2013). Infer.NET 2.5. URL http://research.
microsoft.com/infernet.
Ormerod JT, Wand MP (2010). “Explaining Variational Approximations.” The American
Statistician, 64, 140–153.
Park T, Casella G (2008). “The Bayesian Lasso.” Journal of the American Statistical Association,
103, 681–686.
Pham T, Ormerod JT, Wand MP (2013). “Mean Field Variational Bayesian Inference for Nonparametric Regression with Measurement Error.” Computational Statistics and Data Analysis, 68, 375–387.
Ruppert D, Wand MP, Carroll RJ (2003). Semiparametric Regression. Cambridge University
Press, New York.
Ruppert D, Wand MP, Carroll RJ (2009). “Semiparametric Regression During 2003–2007.”
Bayesian Analysis, 3, 1193–1265.
Saul LK, Jordan MI (1996). “Exploiting Tractable Substructures in Intractable Networks.”
In Advances in Neural Information Processing Systems, pp. 435–442. MIT Press, Cambridge,
Massachusetts.
Staudenmayer J, Lake EE, Wand MP (2009). “Robustness for General Design Mixed Models
Using the t-Distribution.” Statistical Modelling, 9, 235–255.
Tibshirani R (1996). “Regression Shrinkage and Selection Via the Lasso.” Journal of the Royal
Statistical Society, Series B, 58, 267–288.
Wainwright M, Jordan MI (2008). “Graphical Models, Exponential Families, and Variational
Inference.” Foundations and Trends in Machine Learning, 1, 1–305.
Wand MP (2009). “Semiparametric Regression and Graphical Models.” Australian and New
Zealand Journal of Statistics, 51, 9–41.
Journal of Statistical Software
35
Wand MP, Ormerod JT (2008). “On O’Sullivan Penalised Splines and Semiparametric Regression.” Australian and New Zealand Journal of Statistics, 50, 179–198.
Wand MP, Ormerod JT, Padoan SA, Frühwirth R (2011). “Mean Field Variational Bayes for
Elaborate Distributions.” Bayesian Analysis, 6(4), 847–900.
Wand MP, Ripley BD (2010). KernSmooth 2.23. Functions for Kernel Smoothing Corresponding
to the Book: Wand, M.P. and Jones, M.C. (1995) Kernel Smoothing. URL http://cran.
r-project.org.
Wang SSJ, Wand MP (2011). “Using Infer.NET for Statistical Analyses.” The American Statistician, 65, 115–126.
Winn J, Bishop CM (2005). “Variational Message Passing.” Journal of Machine Learning Research, 6, 661–694.
Wu L (2002). “A Joint Model for Nonlinear Mixed-Effects Models with Censored Response
and Covariates Measured with Error, with Application to AIDS Studies.” Journal of the
American Statistical Association, 97, 700–709.
36
Regression via Infer.NET
Affiliation:
Jan Luts
TheSearchParty.com Pty Ltd.
Level 1, 79 Commonwealth Street,
Surry Hills 2011, Australia
E-mail: [email protected]
Shen S.J. Wang
Department of Mathematics
The University of Queensland
Brisbane 4072, Australia
E-mail: [email protected]
John T. Ormerod
School of Mathematics and Statistics
University of Sydney
Sydney 2006, Australia
E-mail: [email protected]
URL: http://www.maths.usyd.edu.au/u/jormerod/
Matt P. Wand
School of Mathematical Sciences
University of Technology, Sydney
Broadway 2007, Australia
E-mail: [email protected]
URL: http://matt-wand.utsacademics.info
Journal of Statistical Software
published by the American Statistical Association
Volume VV, Issue II
MMMMMM YYYY
http://www.jstatsoft.org/
http://www.amstat.org/
Submitted: yyyy-mm-dd
Accepted: yyyy-mm-dd