JSS Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. http://www.jstatsoft.org/ Semiparametric Regression Analysis via Infer.NET Jan Luts TheSearchParty.com Pty Ltd. Shen S.J. Wang University of Queensland John T. Ormerod University of Sydney Matt P. Wand University of Technology, Sydney Abstract We provide several examples of Bayesian semiparametric regression analysis via the Infer.NET package for approximate deterministic inference in graphical models. The examples are chosen to encompass a wide range of semiparametric regression situations. Infer.NET is shown to produce accurate inference in comparison with Markov chain Monte Carlo via the BUGS package, but is considerably faster. Potentially, this contribution represents the start of a new era for semiparametric regression, where large and complex analyses are performed via fast graphical models methodology and software, mainly being developed within Machine Learning. Keywords: additive mixed models, expectation propagation, generalized additive models, measurement error models, mean field variational Bayes, missing data models, penalized splines, variational message passing . Date of this draft: 17 JAN 2014 1. Introduction Infer.NET (Minka et al. 2013) is a relatively new software package for performing approximate inference in large graphical models, using fast deterministic algorithms such as expectation propagation (Minka 2001) and variational message passing (Winn and Bishop 2005). We demonstrate its application to Bayesian semiparametric regression. A variety of situations are covered: non-Gaussian response, longitudinal data, bivariate functional effects, robustness, sparse signal penalties, missingness and measurement error. The Infer.NET project is still in its early years and, and at the time of this writing, has not progressed beyond beta releases. We anticipate continual updating and enhancement for many 2 Regression via Infer.NET years to come. In this article we are necessarily restricted to the capabilities of Infer.NET at the time of preparation. All of our examples use Infer.NET 2.5, Beta 2, which was released in April 2013. The graphical models viewpoint of semiparametric regression (Wand 2009), and other advanced statistical techniques, has the attraction that various embellishments of the standard models can be accommodated by enlarging the underlying directed acyclic graph. For example, nonparametric regression with a missing predictor data model involves the addition of nodes and edges to the graph, corresponding to the missing data mechanism. Figure 4 of Faes et al. (2011) provides some specific illustrations. Marley and Wand (2010) exploited the graphical models viewpoint of semiparametric regression to handle a wide range of nonstandard models via the Markov chain Monte Carlo inference engine implemented by BUGS (Lunn et al. 2000). However, many of their examples take between several minutes and hours to run. The inference engines provided by Infer.NET allow much faster fitting for many common semiparametric regression models. On the other hand, BUGS is the more versatile package and not all models that are treated in Marley and Wand (2010) are supported by Infer.NET. Semiparametric regression, summarized by Ruppert et al. (2003, 2009), is a large branch of Statistics that includes nonparametric regression, generalized additive models, generalized additive mixed models, curve-by-factor interaction models, wavelet regression and geoadditive models. Parametric models such as generalized linear mixed models are special cases of semiparametric regression. Antecedent research for special cases such as nonparametric regression was conducted in the second half of the 20th Century at a time when data sets were smaller, computing power was lower and the Internet was either non-existent or in its infancy. Now, as we approach the mid-2010s, semiparametric regression is continually being challenged by the size, complexity and, in some applications, arrival speed of data-sets requiring analysis. Implementation and computational speed are major limiting factors. This contribution represents the potential for a new era for semiparametric regression – tapping into 21st Century Machine Learning research on fast approximate inference on large graphical models (e.g. Minka 2001; Winn and Bishop 2005; Minka 2005; Minka and Winn 2008; Knowles and Minka 2011) and ensuing software development. Section 2 lays down the definitions and notation needed to describe the models given in later sections. Each of Sections 3–10 illustrates a different type of semiparametric regression analysis via Infer.NET. All code is available as web-supplement to this articles. For most of the examples, we also compare the Infer.NET results with those produced by BUGS. Section 11 compares BUGS with Infer.NET in terms of versatility and computational speed. A summary of our findings on semiparametric analysis via Infer.NET is given in Section 12. An appendix gives a detailed description of using Infer.NET to fit the simple semiparametric regression model of Section 3. 2. Preparatory infrastructure The semiparametric regression examples rely on mathematical infrastructure and notation, which we describe in this section. Journal of Statistical Software 3 2.1. Distributional notation The density function of a random vector x is denoted by p(x). The notation for a set of ind. independent random variables yi (1 ≤ i ≤ n) with distribution Di is yi ∼ Di . Table 1 lists the distributions used in the examples and the parametrization of their density functions. distribution density function in x Normal (2πσ 2 )−1/2 exp{−(x − µ)2 /(2σ 2 )}; Laplace (2σ)−1 exp(−|x − µ|/σ); Γ ν+1 2 t √ Gamma B A xA−1 e−B x ; Γ(A) Half-Cauchy 2σ ; + σ2) σ>0 N (µ, σ 2 ) Laplace(µ, σ 2 ) σ>0 πνσ 2 Γ(ν/2){1 + π(x2 abbreviation (x−µ)2 ν+1 } 2 νσ 2 ; σ, ν > 0 x > 0; A, B > 0 x > 0; σ > 0 t(µ, σ 2 , ν) Gamma(A, B) Half-Cauchy(σ) Table 1: Distributions used in the examples. The density function argument x and parameters range over R unless otherwise specified. 2.2. Standardization and default priors In real data examples, the measurements are often recorded on several different scales. Therefore, all continuous variables are standardized prior to fitting analysis using Infer.NET or Markov Chain Monte Carlo (MCMC) in BUGS. This transformation makes the analyses scale-invariant and can also lead to better behavior of the MCMC. Since we do not have prior knowledge about the model parameters in each of the examples, non-informative priors are used. The prior distributions for a fixed effects parameter vector β and a standard deviation parameter σ are β ∼ N (0, τβ−1 I) and σ ∼ Half-Cauchy(A) with default hyperparameters τβ = 10−10 and A = 105 . This is consistent with the recommendations given in Gelman (2006) for achieving noninformativeness for variance parameters. Infer.NET and BUGS do not offer direct specification of Half-Cauchy distributions and therefore we use the result: if x| a ∼ Gamma( 12 , a) and a ∼ Gamma( 12 , 1/A2 ) then x−1/2 ∼ Half-Cauchy(A). (1) 4 Regression via Infer.NET This result allows for the imposition of a Half-Cauchy prior using only Gamma distribution specifications. Both Infer.NET and BUGS support the Gamma distribution. 2.3. Variational message passing and expectation propagation Infer.NET has two inference engines, variational message passing (Winn and Bishop 2005) and expectation propagation (Minka 2001), for performing fast deterministic approximate inference in graphical models. Succinct summaries of variational message passing and expectation propagation are provided in Appendices A and B of Minka and Winn (2008). Generally speaking, variational message passing is more amenable to semiparametric regression than expectation propagation. It is a special case of mean field variational Bayes (e.g. Wainwright and Jordan 2008). The essential idea of mean field variational Bayes is to approximate joint posterior density functions such as p(θ1 , θ2 , θ3 |D), where D denotes the observed data, by product density forms such as qθ1 (θ1 ) qθ2 (θ2 ) qθ3 (θ3 ), qθ1 ,θ3 (θ1 , θ3 ) qθ2 (θ2 ) or qθ1 (θ1 ) qθ2 ,θ3 (θ2 , θ3 ). (2) The choice of the product density form is usually made by trading off tractability against minimal imposition of product structure. Once this is choice is made, the optimal q-density functions are chosen to minimize the Kullback-Liebler divergence from the exact joint posterior density function. For example, if the second product form in (2) is chosen then the optimal density functions qθ∗1 ,θ3 (θ1 , θ3 ) and qθ∗2 (θ2 ) are those that minimize Z Z Z qθ1 ,θ3 (θ1 , θ3 ) qθ2 (θ2 ) log p(θ1 , θ2 , θ3 | D) qθ1 ,θ3 (θ1 , θ3 ) qθ2 (θ2 ) dθ1 dθ2 dθ3 , (3) where the integrals range over the parameter spaces of θ1 , θ2 and θ3 . This minimization problem gives rise to an iterative scheme which, typically, has closed form updates and good convergence properties. A by-product of the iterations is a lower-bound approximation to the marginal likelihood p(D), which we denote by p(D). Further details on, and several examples of, mean field variational Bayes are provided by Section 2.2 of Ormerod and Wand (2010). Each of these examples can also be expressed, equivalently, in terms of variational message passing. In typical semiparametric regression models the subscripting on the q-density functions is quite cumbersome. Hence, it is suppressed for the remainder of the article. For example, q(θ1 , θ3 ) is taken to mean qθ1 ,θ3 (θ1 , θ3 ). Expectation propagation is also based on product density restrictions such as (2), but differs in its method of obtaining the optimal density functions. It works with a different version of the Kullback-Leibler divergence than that given in (3) and, therefore, leads to different iterative algorithms and approximating density functions. 2.4. Mixed model-based penalized splines Mixed model-based penalized splines are a convenient way to model nonparametric functional relationships in semiparametric regression models, and are amenable to the hierarchical Bayesian structures supported by Infer.NET. The penalized spline of a regression function Journal of Statistical Software 5 f , with mixed model representation, takes the generic form f (x) = β0 + β1 x + K X uk zk (x), uk ∼ N (0, τu−1 ). ind. (4) k=1 Here z1 (·), . . . , zK (·) is a set of spline basis functions and τu controls the amount of penalization of the spline coefficients u1 , . . . , uk . Throughout this article, we use O’Sullivan splines for the zk (·). Wand and Ormerod (2008) provides details on their construction. O’Sullivan splines lead to (4) being a low-rank version of smoothing splines, which is also used by the R function smooth.spline(). The simplest semiparametric regression model is the Gaussian response nonparametric regression model ind. yi ∼ N (f (xi ), σε2 ), (5) where (xi , yi ), 1 ≤ i ≤ n are pairs of measurements on continuous predictor and response variables. Mixed model-based penalized splines give rise to hierarchical Bayesian models for (5) such as P ind. −1 , yi |β0 , β1 , u1 , . . . , uK , τε ∼ N β0 + β1 xi + K u z (x ), τ i ε k=1 k k ind. uk |τu ∼ N (0, τu−1 ), −1/2 τu β0 , β1 ∼ N (0, τβ−1 ), ∼ Half-Cauchy(Au ), ind. −1/2 τε (6) ∼ Half-Cauchy(Aε ). Such models allow nonparametric regression to be performed using Bayesian inference engines such as BUGS and Infer.NET. Our decision to work with precision parameters, rather than the variance parameters (which are common in the semiparametric regression literature), is driven by the former being the standard parametrization in BUGS and Infer.NET. It is convenient to express (6) using matrix notation. This entails putting y1 1 x1 z1 (x1 ) · · · zK (x1 ) .. .. .. y ≡ ... , X ≡ ... ... , Z ≡ . . . yn 1 xn (7) z1 (xn ) · · · zK (xn ) and u1 and u ≡ ... . uK β≡ β0 β1 (8) We then re-write (6) as ind. y | β, u, τε ∼ N Xβ + Zu, τε−1 , u|τu ∼ N (0, τu−1 I), −1/2 τu ∼ Half-Cauchy(Au ), β ∼ N (0, τβ−1 I), −1/2 τε ∼ Half-Cauchy(Aε ). 6 Regression via Infer.NET The effective degrees of freedom (edf) corresponding to (6) is defined to be edf(τu , τε ) ≡ tr [C > C + blockdiag{τβ2 I 2 , (τu /τε ) I}]−1 C > C (9) and is a scale-free measure of the amount of fitting being performed by the penalized splines. Further details on effective degrees of freedom for penalized splines are given in Section 3.13 of Ruppert et al. (2003). Extension to bivariate predictors The bivariate predictor extension of (5) is ind. yi ∼ N (f (xi ), σε2 ), (10) where the predictors xi , 1 ≤ i ≤ n, are each 2 × 1 vectors. In many applications, the xi s correspond to geographical location but they could be measurements on any pair of predictors for which a bivariate mean function might be entertained. Mixed model-based penalized splines can handle this bivariate predictor case by extending (4) to f (x) = β0 + β T1 x + K X uk zk (x), uk ∼ N (0, τu−1 ) ind. (11) k=1 and setting zk to be appropriate bivariate spline functions. There are several options for doing this (e.g. Ruppert et al. 2009, Section 2.2). A relatively simple choice is described here and used in the examples. It is based on thin plate spline theory, and corresponds to Section 13.5 of Ruppert et al. (2003). The first step is to choose the number K of bivariate knots and their locations. We denote these 2 × 1 vectors by κ1 , . . . , κK . Our default rule for choosing knot locations involves feeding the xi s and K into the clustering algorithm known as CLARA (Kaufman and Rousseeuw 1990) and setting the κk to be cluster centers. Next, form the matrices 1 xT1 .. , X = ... . 1 xTn kx1 − κ1 k2 log kx1 − κ1 k · · · kx1 − κK k2 log kx1 − κK k .. .. .. = , . . . 2 2 kxn − κ1 k log kxn − κ1 k · · · kxn − κK k log kxn − κK k ZK kκ1 − κ1 k2 log kκ1 − κ1 k · · · kκ1 − κK k2 log kκ1 − κK k .. .. .. and Ω = . . . . 2 2 kκK − κ1 k log kκK − κ1 k · · · kκK − κK k log kκK − κK k √ Based on the singular value decomposition Ω = U diag(d)V T , compute Ω1/2 = U diag( d)V T and then set Z = Z K Ω−1/2 . Then zk (xi ) = the (i, k) entry of Z Journal of Statistical Software 7 ind. with uk ∼ N (0, τu−1 ). 3. Simple semiparametric model The first example involves the simple semiparametric regression model yi = β0 + β1 x1i + β2 x2i + PK k=1 uk zk (x2i ) ind. εi ∼ N (0, τε−1 ), + εi , 1 ≤ i ≤ n, (12) where zk (·) is a set of spline basis functions as described in Section 2.4. The corresponding Bayesian mixed model can then be represented by y| β, u, τε ∼ N (Xβ + Zu, τε−1 I), β ∼ N (0, τβ−1 I), −1/2 τu u| τu ∼ N (0, τu−1 I), ∼ Half-Cauchy(Au ), −1/2 τε (13) ∼ Half-Cauchy(Aε ), where τβ , Au and Aε are user-specified hyperparameters and y1 β0 y = ... , β = β1 , u = β2 yn u1 .. , . uK 1 x11 x21 z1 (x21 ) · · · zK (x21 ) .. .. .. .. X = ... , Z = . . . . . 1 x1n x2n z1 (x2n ) · · · zK (x2n ) Infer.NET 2.5, Beta 2 does not support direct fitting of model (13) under product restriction (14) q(β, u, τu , τε ) = q(β, u) q(τu , τε ). We get around this by employing the same trick as that described Section 3.2 of Wang and Wand (2011). It entails the introduction of the auxiliary n × 1 data vector a. By setting the observed data for a equal to 0 and by assuming a very small number for κ, fitting the model in (15) provides essentially the same result as fitting the model in (13). The actual model implemented in Infer.NET is y|β, u, τε ∼ N (Xβ + β u Zu, τε−1 I), a| β, u, τu ∼ N ∼ N (0, κ−1 I), bu ∼ Gamma( 12 , 1/A2u ), β u −1 τβ I 0 , , 0 τu−1 I τu | bu ∼ Gamma( 12 , bu ), τε | bε ∼ Gamma( 21 , bε ), (15) bε ∼ Gamma( 21 , 1/A2ε ), with the inputted a vector containing zeroes. While the full Infer.NET code is included in the Appendix, we will confine discussion here to the key parts of the code that specify (15). 8 Regression via Infer.NET Specification of the prior distributions for τu and τε can be achieved using the following Infer.NET code: Variable<double> tauu = Variable.GammaFromShapeAndRate(0.5, (16) Variable.GammaFromShapeAndRate(0.5,Math.Pow(Au,-2))).Named("tauu"); Variable<double> tauEps = Variable.GammaFromShapeAndRate(0.5, Variable.GammaFromShapeAndRate(0.5,Math.Pow(Aeps,-2))).Named("tauEps"); while the likelihood in (15) is specified by the following line: y[index] = Variable.GaussianFromMeanAndPrecision( Variable.InnerProduct(betauWork,cvec[index]),tauEps); (17) The variable index represents a range of integers from 1 to n and enables loop-type structures. Finally, the inference engine is specified to be variational message passing via the code: InferenceEngine engine = new InferenceEngine(); engine.Algorithm = new VariationalMessagePassing(); engine.NumberOfIterations = nIterVB; (18) with nIterVB denoting the number of mean field variational Bayes iterations. Note that this Infer.NET code is treating the coefficient vector β u as an entity, in keeping with product restriction (14). Figures 1 and 2 summarize the results from Infer.NET fitting of (15) to a data set on the yields (g/plant) of 84 white Spanish onions crops in two locations: Purnong Landing and Virginia, South Australia. The response variable, y is the logarithm of yield, whilst the predictors are indicator of location being Virginia (x1 ) and areal density of the plants (plants/m2 ) (x2 ). The hyperparameters were set at τβ = 10−10 , Aε = 105 , Au = 105 and κ = 10−10 , while the number of mean field variational Bayes iterations was fixed at 100. As a benchmark, Bayesian inference via MCMC was performed. To this end, the following BUGS program was used: for(i in 1:n) { mu[i] <- (beta0 + beta1*x1[i]+ beta2*x2[i] + inprod(u[],Z[i,])) y[i] ~ dnorm(mu[i],tauEps) } for (k in 1:K) { u[k] ~ dnorm(0,tauu) } beta0 ~ dnorm(0,1.0E-10) ; beta1 ~ dnorm(0,1.0E-10) beta2 ~ dnorm(0,1.0E-10) ; bu ~ dgamma(0.5,1.0E-10) ; tauu ~ dgamma(0.5,bu) bEps ~ dgamma(0.5,1.0E-10) ; tauEps ~ dgamma(0.5,bEps) (19) Journal of Statistical Software 9 A burn-in length of 50000 was used, while 1 million samples were obtained from the posterior distributions. Finally, a thinning factor of 5 was used. The posterior densities for MCMC were produced based on kernel density estimation with plug-in bandwidth selection via the R package KernSmooth (Wand and Ripley 2010). Figure 1 displays the fitted regression lines and pointwise 95% credible intervals. The estimated regression lines and credible intervals from Infer.NET and MCMC fitting are highly similar. Figure 2 visualizes the approximate posterior density functions for β1 , τε−1 and the effective degrees of freedom of the fit. The variational Bayes approximations for β1 and τε−1 are rather accurate. The approximate posterior density function for the effective degrees of freedom for variational Bayesian inference was obtained based on Monte Carlo samples of size 1 million from the approximate posterior distributions of τε−1 and τu−1 . Figure 1: Onions data. Fitted regression line and pointwise 95% credible intervals for variational Bayesian inference by Infer.NET and MCMC. 4. Generalized additive model The next example illustrates binary response variable regression through the model ind. yi |β, u ∼ Bernoulli(F ({Xβ + Zu}i )), u| τu ∼ N (0, τu−1 I), (20) β∼ N (0, τβ−1 I), −1/2 τu ∼ Half-Cauchy(Au ), 1 ≤ i ≤ n, where F (·) denotes an inverse link function and X, Z, τβ and Au are defined as in the previous section. Typical choices of F (·) correspond to logistic regression and probit regression. 10 Regression via Infer.NET Figure 2: Variational Bayes approximate posterior density functions produced by Infer.NET and MCMC posterior density functions of key model parameters for fitting a simple semiparametric model to the onions data set. As before, introduction of the auxiliary variables a ≡ 0 enables Infer.NET fitting of the Bayesian mixed model ind. yi |β, u ∼ Bernoulli(F ({Xβ + Zu}i )), −1 τβ I 0 β a| β, u, τu ∼ N , , u 0 τu−1 I β u ∼ N (0, κ−1 I), bu ∼ Gamma( 12 , 1/A2u ), (21) τu | bu ∼ Gamma( 12 , bu ), 1 ≤ i ≤ n. The likelihood for the logistic regression case is specified as follows VariableArray<bool> y = Variable.Array<bool>(index).Named("y"); y[index] = Variable.BernoulliFromLogOdds( Variable.InnerProduct(betauWork, cvec[index])); (22) while variational message passing is used for fitting purposes. Setting up the prior for τu and specifying the inference engine is done as in code chunk (16) and (18), respectively. For probit regression, the last line of (22) is replaced by y[index] = Variable.IsPositive(Variable.GaussianFromMeanAndVariance( Variable.InnerProduct(betauWork, cvec[index]), 1)); (23) and expectation propagation is used engine.Algorithm = new ExpectationPropagation(); (24) Instead of the half-Cauchy prior in (20), a gamma prior with shape and rate equal to 2 is used for τu in the probit regression model Variable<double> tauU = Variable.GammaFromShapeAndRate(2,2); (25) Journal of Statistical Software 11 Figure 3 visualizes the results for Infer.NET and MCMC fitting of a straightforward extension of model (21) with inverse-logit and probit link functions to a breast cancer data set (Haberman 1976). This data set contains 306 cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer. The binary response variable represents whether the patient died within 5 years after operation (died), while 3 predictor variables are used: age of patient at the time of operation (age), year of operation (year) and number of positive axillary nodes (nodes) detected. The actual model is ind. diedi |β, u ∼ Bernoulli(F (β0 + f1 (agei ) + f2 (yeari ) + f3 (nodesi ))), 1 ≤ i ≤ 306. Note that all predictor variables were standardized prior to model fitting and the following hyperparameters were used: τβ = 10−10 , Au = 105 , κ = 10−10 and the number of variational Bayes iterations was 100. Figure 3 illustrates that there is good agreement between the results from Infer.NET and MCMC. (a) (b) Figure 3: Breast cancer data. Fitted regression lines and pointwise 95% credible intervals for logit (a) and probit (b) regression via variational Bayesian inference by Infer.NET and MCMC. 12 Regression via Infer.NET 5. Robust nonparametric regression based on the t-distribution A popular model-based approach for robust regression is to model the response variable to have a t-distribution. Outliers occur with moderate probability for low values of the t-distribution’s degrees of freedom parameter (Lange et al. 1989). More recently, Staudenmayer et al. (2009) proposed a penalized spline mixed model approach to nonparametric regression using the t-distribution. The robust nonparametric regression model that we consider here is a Bayesian variant of that treated by Staudenmayer et al. (2009): P ind. −1 , ν , yi | β0 , β1 , u, τε , ν ∼ t β0 + β1 xi + K u z (x ), τ 1 ≤ i ≤ n, ε k=1 k k i u| τu ∼ N (0, τu−1 I), −1/2 τu ∼ Half-Cauchy(Au ), β ∼ N (0, τβ−1 I), −1/2 τε (26) ∼ Half-Cauchy(Aε ), p(ν) discrete on a finite set N . Restricting the prior distribution of ν to be that of a discrete random variable allows Infer.NET fitting of the model using structured mean field variational Bayes (Saul and Jordan 1996). This extension of ordinary mean field variational Bayes is described in Section 3.1 of Wand et al. (2011). Since variational message passing is a special case of mean field variational Bayes this extension also applies. Model (26) can be fitted through calls to Infer.NET with ν fixed at each value in N . The results of each of these fits are combined afterwards. Details are given below. Another challenge concerning (26) is that Infer.NET does not support t-distribution specifications. We get around this by appealing to the result if x| g ∼ N µ, (gτ )−1 and g ∼ Gamma( ν2 , ν2 ) then x ∼ t(µ, τ −1 , ν). (27) As before, we use the a ≡ 0 auxiliary data trick described in Section 3 and the Half-Cauchy representation (1). These lead to following suite of models, for each fixed ν ∈ N , needing to be run in Infer.NET: y| β, u, τε ∼ N (Xβ + Zu, τε−1 diag(1/g)), −1 τβ I 0 β β a| β, u, τu ∼ N , , ∼ N (0, κ−1 I) u u 0 τu−1 I u| τu ∼ N (0, τu−1 I), ind. gi | ν ∼ Gamma( ν2 , ν2 ), β ∼ N (0, τβ−1 I), τu | bu ∼ Gamma( 12 , bu ), bu ∼ Gamma( 12 , 1/A2u ), τε | bε ∼ Gamma( 12 , bε ), bε ∼ Gamma( 21 , 1/A2ε ), (28) where the matrix notation of (7), (8) is being used, g ≡ [g1 , . . . , gn ]T and τβ , Au and Aε are user-specified hyperparameters with default values τβ = 10−10 , Au = Aε = 105 . The prior for ν was set to be a uniform distribution over the atom set N , set to be 30 equally-spaced numbers between 0.05 and 10 inclusive. Journal of Statistical Software 13 After obtaining fits of (28) for each ν ∈ N , the approximate posterior densities are obtained from X X q(β, u) = qν (ν)q(β, u|ν), q(τu ) = qν (ν) q(τu |ν), ν∈N q(τε ) = X ν∈N q(ν) q(τε |ν), ν∈N p(ν)p(y|ν) with q(ν) = p(ν|y) = X , p(ν 0 )p(y|ν 0 ) ν 0 ∈N and p(y|ν) denotes the variational lower bound on the conditional likelihood p(y|ν). Specification of the prior distribution for g is achieved using the following Infer.NET code: VariableArray<double> g = Variable.Array<double>(index); g[index] = Variable.GammaFromShapeAndRate(nu/2,2/nu).ForEach(index); (29) while the likelihood in (28) is specified by: y[index] = Variable.GaussianFromMeanAndPrecision( Variable.InnerProduct(betauWork,cvec[index]),g[index]*tauEps); (30) The lower bound log p(y|ν) can be obtained by creating a mixture of the current model with an empty model in Infer.NET. The learned mixing weight is then equal to the marginal loglikelihood. Therefore, an auxiliary Bernoulli variable is set up: Variable<bool> auxML = Variable.Bernoulli(0.5).Named("auxML"); IfBlock model = Variable.If(auxML); (31) The normal code for fitting the model in (28) is then enclosed with IfBlock model = Variable.If(auxML); (32) and model.CloseBlock(); (33) Finally, the lower bound log p(y|ν) is obtained from: double marginalLogLikelihood = engine.Infer<Bernoulli>(auxML).LogOdds; (34) Figure 4 presents the results of the structured mean field variational Bayes analysis using Infer.NET fitting of model (28) to a data set on a respiratory experiment conducted by Professor Russ Hauser at Harvard School of Public Health, Boston, USA. The data correspond to 60 measurements on one subject during two separate respiratory experiments. The response variable yi represents the log of the adjusted time of exhalation for xi equal to the time in seconds since exposure to air containing particulate matter. The adjusted time of exhalation is obtained by subtracting the average time of exhalation at baseline, prior to exposure to filtered air. Interest centers upon the mean response as a function of time. The predictor and response variable were both standardized prior to Infer.NET analysis. The following hyperparameters were chosen: τβ = 10−10 , Aε = 105 , Au = 105 and κ = 10−10 , while the number of variational Bayes iterations equaled 100. Bayesian inference via MCMC was performed as a benchmark. 14 Regression via Infer.NET Figure 4: Structured mean field variational Bayesian, based on Infer.NET, and MCMC fitting of the robust nonparametric regression model (26) to the respiratory experiment data. Left: Posterior mean and pointwise 95% credible sets for the regression function. Right: Approximate posterior function for the degrees of freedom parameter ν. Figure 4 shows that the variational Bayes fit and pointwise 95% credible sets are close to the ones obtained using MCMC. Finally, the approximate posterior probability function and the posterior from MCMC for the degrees of freedom ν are compared. The Infer.NET and MCMC results coincide quite closely. 6. Semiparametric mixed model Since semiparametric regression models based on penalized splines fit in the mixed model framework, semiparametric longitudinal data analysis can be performed by fusion with classical mixed models (Ruppert et al. 2003). In this section we illustrate the use of Infer.NET for fitting the class of semiparametric mixed models having the form: ind. yij |β, u, Ui , τε ∼ N β0 + β T xij + Ui + K X ! uk zk (sij ), τε−1 k=1 Ui | τU ∼ N (0, τU−1 ), ind. 1 ≤ j ≤ ni , , (35) 1 ≤ i ≤ m, for analysis of the longitudinal data-sets such as that described in Bachrach et al. (1999). Here xij is a vector of predictors that enter the model linearly and sij is another predictor that enters the model non-linearly via penalized splines. For each 1 ≤ i ≤ m, Ui denotes the random Journal of Statistical Software 15 intercept for the ith subject. The corresponding Bayesian mixed model is represented by y| β, u, τε ∼ N (Xβ + β ∼ N (0, τβ−1 I), Zu, τε−1 I), −1/2 τu u| τU , τu ∼ N ∼ Half-Cauchy(Au ), −1/2 τU 0, −1/2 τε τU−1 I 0 0 τu−1 I , ∼ Half-Cauchy(Aε ), (36) ∼ Half-Cauchy(AU ), where y, β and X are defined in a similar manner as in the previous sections and τβ , Au , Aε and AU are user-specified hyperparameters. Introduction of the random intercepts results in 1 · · · 0 z1 (s11 ) · · · zK (s11 ) U1 .. . . .. .. .. .. . . . . . . .. . 1 · · · 0 z1 (s1n1 ) · · · zK (s1n1 ) Um .. .. .. .. .. .. , Z = u= . . . . . . . u1 0 · · · 1 z1 (sm1 ) · · · zK (sm1 ) .. . .. . . .. .. .. . . . . . . . . uK 0 · · · 1 z1 (smnm ) · · · zK (smnm ) We again use the auxiliary data vector a ≡ 0 to allow direct Infer.NET fitting of the Bayesian longitudinal penalized spline model y|β, u, τε ∼ N (Xβ + a| β, u, τU , τu ∼ N Zu, τε−1 I), β u β u ∼ N (0, κ−1 I), τβ−1 I 0 0 , 0 τU−1 I 0 , −1 0 0 τu I τu | bu ∼ Gamma( 21 , bu ), bu ∼ Gamma( 21 , 1/A2u ), τε | bε ∼ Gamma( 21 , 1/bε ), bε ∼ Gamma( 12 , 1/A2ε ), τU | bU ∼ Gamma( 12 , bU ), bU ∼ Gamma( 21 , 1/A2U ). (37) Specification of the prior distribution for τU can be achieved in Infer.NET in a similar manner as prior specification in code chunk (16). Figure 5 shows the Infer.NET fits of (37) to the spinal bone mineral density data (Bachrach et al. 1999). A population of 230 female subjects aged between 8 and 27 was followed over time and each subject contributed either one, two, three, or four spinal bone mineral density measurements. Age enters the model non-linearly and corresponds to sij in (35). Data on ethnicity are available and the entries of xij correspond to the indicator variables for Black (x1ij ), Hispanic (x2ij ) and White (x3ij ), with Asian ethnicity corresponding to the baseline. The following hyperparameters were chosen: τβ = 10−10 , Aε = 105 , Au = 105 , AU = 105 and κ = 10−10 , while the number of mean field variational Bayes iterations was set to 100. Figure 5 suggests that there exists a statistically significant difference in mean spinal bone 16 Regression via Infer.NET Figure 5: Spinal bone mineral density data. Fitted regression line and pointwise 95% credible intervals based on mean field variational Bayesian inference by Infer.NET and MCMC. mineral density between Asian and Black subjects. This difference is confirmed by the approximate posterior density functions in Figure 6. No statistically significant difference is found between Asian and Hispanic subjects and between Asian and White subjects. Journal of Statistical Software 17 Figure 6: Variational Bayes approximate posterior density functions produced by Infer.NET of ethnic group parameters for fitting a simple semiparametric mixed model to the spinal bone mineral density data set. 7. Geoadditive model We turn our attention to geostatistical data, for which the response variable is geographically referenced and the extension of generalized additive models, known as geoadditive models (Kammann and Wand 2003), applies. Such models allow for a pair of continuous predictors, typically geographical location, to have a bivariate functional impact on the mean response. An example geoadditive model is yi ∼ N f1 (x1i ) + f2 (x2i ) + fgeo (x3i , x4i ), τε−1 , 1 ≤ i ≤ n. (38) It can be handled using the spline basis described in Section 2.4 as follows: ind. yi | β, u1 , u2 , ugeo , τε ∼ N β0 + β1 x1i + β2 x2i + β3 x3i + β4 x4i Kgeo K1 K2 X X X geo geo + u1k z1k (x1i ) + u2k z2k (x2i ) + uk zk (x3i , x4i ), τε−1 , k=1 k=1 (39) 1 ≤ i ≤ n, k=1 where the z1k and z2k are univariate spline basis functions as used in each of this article’s previous examples and the zkgeo are the bivariate spline basis functions, described in Section 2.4. The corresponding Bayesian mixed model is represented by y| β, u, τε ∼ N (Xβ + Zu, τε−1 I), β ∼ N (0, τβ−1 I), −1 τu1 I 0 0 u1 −1 u ≡ u2 τu1 , τu2 , τgeo ∼ N 0, 0 τu2 I 0 , geo −1 u 0 0 τgeo I −1/2 τu ∼ Half-Cauchy(Au ), −1/2 τε (40) ∼ Half-Cauchy(Aε ), −1/2 ∼ Half-Cauchy(A ), τgeo geo where y, β, u and X are defined in a similar manner as in the previous sections and τβ , Au , 18 Regression via Infer.NET Aε and Ageo are user-specified hyperparameters. The Z matrix is geo z11 (x11 ) · · · z1K1 (x11 ) z21 (x21 ) · · · z2K2 (x21 ) z1geo (x31 , x41 ) · · · zK (x31 , x41 ) geo .. .. .. .. .. .. .. .. .. Z= . . . . . . . . . . geo geo z11 (x1n ) · · · z1K1 (x1n ) z21 (x2n ) · · · z2K2 (x2n ) z1 (x3n , x4n ) · · · zKgeo (x3n , x4n ) Infer.NET can be used to fit the Bayesian geoadditive model β −1 y| β, u, τε ∼ N (Xβ + Zu, τε I), ∼ N (0, κ−1 I), u −1 τβ I 0 0 0 −1 β 0 τu1 I 0 0 , a| β, u, τu , τgeo ∼ N , −1 u 0 0 τu2 I 0 −1 0 0 0 τgeo I τu1 | bu1 ∼ Gamma( 21 , bu1 ), bu2 ∼ Gamma( 12 , 1/A2u2 ), bu1 ∼ Gamma( 12 , 1/A2u1 ), τu2 | bu2 ∼ Gamma( 21 , bu2 ), τε | bε ∼ Gamma( 12 , bε ), bε ∼ Gamma( 21 , 1/1A2ε ), τgeo | bgeo ∼ Gamma( 12 , bgeo ), bgeo ∼ Gamma( 21 , 1/A2geo ) where a ≡ 0 as in all previous examples. We illustrate geoadditive model fitting in Infer.NET using data on residential property prices of 37,676 residential properties that were sold in Sydney, Australia, during 2001. The data were assembled as part of of an unpublished study by A. Chernih and M. Sherris at the University of New South Wales, Australia. The response variable is the logarithm of sale price in Australian dollars. Apart from geographical location, several predictor variables are available. For this example we selected weekly income in Australian dollars, the distance from the coastline in kilometres, the particulate matter 10 level and the nitrogen dioxide level. The model is a straightforward extension of the geoadditive model conveyed by (38)–(40). In addition, all predictor variables and the dependent variable were standardized before model fitting. The hyperparameters set to τβ = 10−10 , Aε = 105 = Au1 = . . . = Au4 = Ageo = 105 , while the number of variational message passing iterations was equal to 100 and κ set to 10−10 . Figure 7 summarizes the housing prices for the Sydney metropolitan area based on the fitted Infer.NET model, while fixing the four predictor variables at their mean levels. This result clearly shows that the sale price is higher for houses in Sydney’s eastern and northern suburbs whereas it is strongly decreased for houses in Sydney’s western suburbs. Figure 8 visualizes the fitted regression line and pointwise 95% credible intervals for income, distance from the coastline, particulate matter 10 level and nitrogen dioxide level at a fixed geographical location (longitude = 151.10◦ , latitude = −33.91◦ ). As expected, a higher income is associated with a higher sale price and houses close to the coastline are more expensive. The sale price is lower for a higher particulate matter 10 level, while a lower nitrogen dioxide level is generally associated with a lower sale price. Journal of Statistical Software 19 Figure 7: Sydney real estate data. Estimated housing prices for the Sydney metropolitan area for variational Bayesian inference by Infer.NET. 8. Bayesian lasso regression A Bayesian approach to lasso (least absolute shrinkage selection operator) regression was proposed by Park and Casella (2008). For linear regression, the lasso approach essentially involves the replacing βj | τβ ∼ N (0, τβ−1 ) by βj | τβ ∼ Laplace(0, τβ−1 ) ind. ind. (41) where βj is the coefficient of the jth predictor variable. Replacement (41) corresponds to the form of the penalty changing from λ p X j=1 βj2 to λ p X j=1 |βj |. 20 Regression via Infer.NET Figure 8: Sydney real estate data. Fitted regression line and pointwise 95% credible intervals for variational Bayesian inference by Infer.NET. For frequentist lasso regression (Tibshirani 1996) the latter penalty produces sparse regression fits, in that the estimated coefficients become exactly zero. In the Bayesian case the Bayes estimates do not provide exact annihilation of coefficients, but produce approximate sparseness (Park and Casella 2008). At this point we note that, during the years following the appearance of Park and Casella (2008), several other proposals for sparse shrinkage of the βj appeared in the literature (e.g. Armagan et al. 2013; Carvalho et al. 2010; Griffin and Brown 2011). Their accommodation in Infer.NET could also be entertained. Here we restrict attention to Laplace-based shrinkage. We commence with the goal of fitting the following model in Infer.NET: y|β0 , β, τε ∼ N (1β0 + Xβ, τε−1 I), β0 ∼ N (0, τβ−1 ), 0 βj | τβ ∼ Laplace(0, τβ−1 ) τβ ∼ Half-Cauchy(Aβ ), ind. τε ∼ Half-Cauchy(Aε ). (42) Journal of Statistical Software 21 The current release of Infer.NET does not support the Laplace distribution, so we require an auxiliary variable representation in a similar vein to what was used for the t-distribution in Section 5. We first note the distributional statement x|τ ∼ Laplace(0, τ −1 ) if and only if x|g ∼ N (0, g), g|τ ∼ Gamma(1, 21 τ ), (43) which is exploited by Park and Casella (2008). However, auxiliary variable representation (43) is also not supported by Infer.NET, due to violation of its conjugacy rules. We get around this by working with the alternative auxiliary variable set-up: x| c ∼ N (0, 1/c), c| d ∼ Gamma(M, M d), d|τ ∼ Gamma(1, 12 τ ) (44) which is supported by Infer.NET, and leads to good approximation to the Laplace(0, τ −1 ) distribution when M is large. For any given M > 0, the density function of x satisfying auxiliary variable representation (44) is ∞Z ∞ Z p(x|τ ; M ) = 0 = (M d)M M −1 c exp(−M dc) (2π/c)−1/2 exp − 12 c x Γ(M ) 0 × 12 τ exp − 12 dτ dc dd τ 1/2 Γ M + 12 2Γ(M ) M 1/2 1 F1 M+ 1 1 2; 2; τ x2 4M 1 2 − τ 1/2 |x| 1 F1 τ x2 M + 1; 3/2; 4M where 1 F1 denotes the degenerate hypergeometric function (e.g. Gradshteyn and Ryzhik 1994). Figure 9 shows p(x|τ ; M ) for τ = 1 and M = 1, 10, 30. The approximation to the Laplace(0, 1) density function is excellent for M as low as 10. Figure 9: The exact Laplace(0, 1) density function p(x) = tions to it based on p(x|τ ; M ) for τ = 1 and M = 1, 10, 30. 1 2 exp(−|x|) and three approxima- The convergence in distribution of random variables having the density function p(x|τ ; M ) to a Laplace(0, τ −1 ) random variable as M → ∞ may be established using (27) and the fact that the family of t distributions tends to a Normal distribution as the degrees of freedom 22 Regression via Infer.NET parameter increases, and then mixing this limiting distribution with a Gamma(1, 12 τ ) variable. In lieu of (42), the actual model fitted in Infer.NET is: y|β0 , β, τε ∼ N (1β0 + Xβ, τε−1 I), −1 τ 0 β 0 β0 a| β0 , β, c1 , . . . , cp ∼ N , 0 diag (1/cj )I , β 1≤j≤p β0 β ind. ∼ N (0, κ−1 I), dj |τβ ∼ Gamma(1, 21 τβ ), ind. cj | dj ∼ Gamma(M, M dj ), τβ |bβ ∼ Gamma( 21 , bβ ), τε | bε ∼ Gamma( 12 , bε ), (45) bβ ∼ Gamma( 21 , 1/A2β ), bε ∼ Gamma( 21 , 1/A2ε ). We use κ = 10−10 , M = 100, τβ0 = 10−10 , Aβ = Aε = 105 . Our illustration of (45) involves data from a diabetes study that was also used in Park and Casella (2008). The sample size is n = 442 and the number of predictor variables is p = 10. The response variable is a continuous index of disease progression one year after baseline. A description of the predictor variables is given in Efron et al. (2004). Their abbreviations, used in Park and Casella (2008), are: (1) age, (2) sex, (3), bmi, (4), map, (5), tc, (6) ldl, (7), hdl, (8), tch, (9) ltg, and (10) glu. Figure 10 shows the fits obtained from both Infer.NET and MCMC. We see that the correspondence is very good, especially for the regression coefficients. 9. Measurement error model In some situations some variables may be measured with error, i.e., the observed data are not the true values of those variables themselves, but a contaminated version of these variables. Examples of such situations include AIDS studies where CD4 counts are known to be measured with errors (Wu 2002) or air pollution data (Bachrach et al. 1999). Carroll et al. (2006) offers a comprehensive treatment of measurement error models. As described there, failure to account for measurement error can result in biased and misleading conclusions. Let (xi , yi ), 1 ≤ i ≤ n, be a set of predictor/response pairs that are modeled according to yi |xi , β0 , β1 , τε ∼ N (β0 + β1 xi , τε−1 ), ind. 1 ≤ i ≤ n. (46) However, instead of observing the xi s we observe wi |xi ∼ N (xi , τz−1 ), ind. 1 ≤ i ≤ n. (47) In other words, the predictors are measured with error with τz controlling the extent of the contamination. In general τz can be estimated through validation data, where some of the xi values are observed, or replication data, where replicates of the wi values are available. To Journal of Statistical Software 23 Figure 10: Approximate posterior density functions of model parameters in fitting (45) to data from a diabetes study. simplify exposition we will assume that τz is known. Furthermore, we will assume that the xi s are Normally distributed: xi |µx , τx ∼ N (µx , τx−1 ), ind. 1 ≤ i ≤ n, (48) where µx and τx are additional parameters to be estimated. Combining (46), (47) and (48) and, including aforementioned priors for variances, a Bayesian measurement error model can be represented by y| x, β, τε ∼ N (Xβ, τε−1 I), w|x ∼ N (x, τz−1 I), β ∼ N (0, τβ−1 I), −1/2 τε ∼ Half-Cauchy(Aε ), −1/2 µx ∼ N (0, τµ−1 ) and τx x x|µx , τx ∼ N (µx 1, τx−1 I), (49) ∼ Half-Cauchy(Ax ) where β = [β0 , β1 ], X = [1, x], τβ , τµ , Aε and Ax are user-specified constants and the vectors y and w are observed. We set τβ = τµ = 10−10 and Aε = Ax = 105 . 24 Regression via Infer.NET Invoking (1), we arrive at the model y|x, β, τε ∼ N (Xβ, τε−1 I), β ∼ N (0, τβ−1 I), µx ∼ N (0, τµ−1 ), x w|x ∼ N (x, τz−1 I), τε | bε ∼ Gamma( 21 , 1/bε ), x| µx , τx ∼ N (µx 1, τx−1 I) bε ∼ Gamma( 21 , 1/A2ε ), (50) τx | bx ∼ Gamma( 12 , 1/bx ) and bx ∼ Gamma( 21 , 1/A2x ). Infer.NET 2.5, Beta 2 does not appear to be able to fit the model (50) under the product restriction q(β, τε , bε , x, µx , τx , bx ) = q(β0 , β1 ) q(µx ) q(x) q(τε , τx ) q(bε , bx ). However, Infer.NET was able to fit this model with (51) q(β, τε , bε , x, µx , τx , bx ) = q(β0 )q(β1 ) q(µx )q(x) q(τε , τx ) q(bε , bx ). Unfortunately, fitting under restriction (51) leads to poor accuracy. One possible remedy involves the centering transformation: w ei = wi − w. Under this transformation, and with diffuse priors, the marginal posterior distributions of β0 and β1 are nearly independent. Let qe(β0 ), qe(β1 ), qe(µx ), qe(x), qe(τε , τx ) and qe(bε , bx ) be the optimal values of q(β0 ), q(β1 ), q(µx ), q(x), q(τε , τx ) and q(bε , bx ) when the transformed w ei s are used in place of the wi s. This corresponds to the transformation on the X matrix 1 0 e X = XR where R= . −w 1 Suppose that 2 qe(β0 ) ∼ N (e µq(β0 ) , σ eq(β ), 0) 2 qe(µx ) ∼ N (e µq(µx ) , σ eq(µ ), x) 2 qe(β1 ) ∼ N (e µq(β1 ) , σ eq(β ), 1) 2 qe(xi ) ∼ N (e µq(xi ) , σ eq(x ), i) and 1 ≤ i ≤ n. Then we can back-transform approximately using q(β0 , β1 ) ∼ N (µq(β) , Σq(β) ), 2 q(µx ) ∼ N (µq(µx ) , σq(µ ), x) 2 q(xi ) ∼ N (µq(xi ) , σq(x ), 1 ≤ i ≤ n, i) where µq(β) = R µ eq(β0 ) µ eq(β1 ) " , Σq(β) = R µq(µx ) = µ eq(µx ) + w, µq(xi ) = µ eq(xi ) + w, q(τε , τx ) = qe(τε , τx ) and q(bε , bx ) = qe(bε , bx ). 2 σ eq(β 0 0) 2 0 σ eq(β1 ) # 2 2 σq(µ =σ eq(µ , x) x) 2 2 σq(x =σ eq(x , i) i) 1 ≤ i ≤ n, RT , Journal of Statistical Software 25 To illustrate the use of Infer.NET we simulated data according to ind. xi ∼ N (1/2, 1/36), ind. ind. wi | xi ∼ N (xi , 1/100) and yi |xi ∼ N (0.3 + 0.7xi , 1/16) (52) for 1 ≤ i ≤ n with n = 50. Results from a single simulation are illustrated in Figure 11. Similar results were obtained from other simulated data sets. From Figure 11 we see reasonable agreement of posterior density estimates produced by Infer.NET and MCMC for all parameters. Extension to the case where the mean of the yi s are modeled nonparametrically is covered in Pham et al. (2013). However, the current version of Infer.NET does not support models of this type. Such versatility limitations of Infer.NET are discussed in Section 11.1. Figure 11: Variational Bayes approximate posterior density functions produced by Infer.NET for simulated data (52). 26 Regression via Infer.NET 10. Missing predictor model Missing data is a commonly occurring issue in statistical problems. Naïve methods for dealing with this issue, such as ignoring samples containing missing values, can be shown to be demonstrably optimistic (i.e., standard errors are deflated), biased or simply an inefficient use of the data. Little and Rubin (2002) contains a detailed discussion of the various issues at stake. Consider a simple linear regression model with response yi with predictor xi : yi |xi , β0 , β1 , τε ∼ N (β0 + β1 xi , τε−1 ), ind. 1 ≤ i ≤ n. (53) Suppose that some of the xi s are missing and let ri be an indicator xi being observed. Specifically, 1 if xi is observed, ri = for 1 ≤ i ≤ n. 0 if xi is missing A missing data model contains at least two components: • A model for the underlying distribution of the variable that is subject to missingness data. We will use xi |µx , τx ∼ N (µx , τx−1 ), ind. 1 ≤ i ≤ n. (54) • A model for the missing data mechanism, which models the probability distribution of ri . As detailed in Faes et al. (2011), there are several possible models, with varying degrees of sophistication, for the missing data mechanism. Here the simplest such model, known as missing completely at random, is used: y|x, β, τε ∼ N (Xβ, τε−1 I), β ∼ N (0, τβ−1 I), µx ∼ N (0, τµ−1 ), x| µx , τx ∼ N (µx 1, τx−1 I), τε | bε ∼ Gamma( 12 , 1/bε ), τx | bx ∼ Gamma( 21 , bx ) and ind. ri ∼ Bernoulli(p), bε ∼ Gamma( 12 , 1/A2ε ), (55) bx ∼ Gamma( 21 , 1/A2x ), where β = [β0 , β1 ], X = [1, x], τβ , τµ , Aε and Ax are user-specified hyperparameters. Again, we will use τβ = τµ = 10−10 and Aε = Ax = 105 . The simulated data that we use in our illustration of Infer.NET involves the following sample size and ‘true values’: n = 50, β0 = 0.3, β1 = 0.7, µx = 0.5, τx = 36, and p = 0.75. (56) Putting p = 0.75 implies that, on average, 25% of the predictor values are missing completely at random. Infer.NET 2.5, Beta 2 is not able to fit the model (50) under the product restriction q(β, τε , bε , xmis , µx , τx , bx ) = q(β0 , β1 ) q(xmis ) q(τε , τx ) q(bε , bx ), (57) Journal of Statistical Software 27 where xmis denotes the vector of missing predictors, but is able to handle (58) q(β, τε , bε , xmis , µx , τx , bx ) = q(β0 ) q(β1 ) q(xmis ) q(τε , τx ) q(bε , bx ), and was used in our illustration of Infer.NET for such models. We also used BUGS to perform Bayesian inference via MCMC as a benchmark for comparison of results. For similar reasons as in the previous section fitting under this restriction (58) leads to poor results. Instead we perform the centering transformation x ei = xi − xobs where xobs is the mean of the observed x values. This transformation makes the marginal posterior distributions of β0 and β1 nearly independent. Let qe(β0 ), qe(β1 ), qe(µx ), qe(xmis ), qe(τε , τx ) and qe(bε , bx ) be the optimal values of q(β0 ), q(β1 ), q(µx ), q(xmis ), q(τε , τx ) and q(bε , bx ) when the transformed x ei s are used in place of the xi s. This corresponds to the transformation on the X matrix 1 0 e X = XR where R= . −xobs 1 Suppose that 2 µq(β0 ) , σ eq(β ), qe(β0 ) ∼ N (e 0) 2 qe(µx ) ∼ N (e µq(µx ) , σ eq(µ ), x) 2 qe(β1 ) ∼ N (e µq(β1 ) , σ eq(β ), 1) 2 qe(xi ) ∼ N (e µq(xmis,i ) , σ eq(x ), 1 ≤ i ≤ nmis , mis,i ) and then we back-transform approximately using 2 q(µx ) ∼ N (µq(µx ) , σq(µ ), x) q(β0 , β1 ) ∼ N (µq(β) , Σq(β) ), 2 q(xmis,i ) ∼ N (µq(xmis,i ) , σq(x ), 1 ≤ i ≤ n, mis,i ) where µq(β) = R µ eq(β0 ) µ eq(β1 ) " , Σq(β) = R µq(µx ) = µ eq(µx ) + xobs , µq(xmis,i ) = µ eq(xmis,i ) + xobs , 2 0 σ eq(β 0) 2 0 σ eq(β1 ) # RT , 2 2 σq(µ =σ eq(µ , x) x) 2 2 σq(x =σ eq(x , mis,i ) mis,i ) 1 ≤ i ≤ n, q(τε , τx ) = qe(τε , τx ) and q(bε , bx ) = qe(bε , bx ). Results from a single simulation are illustrated in Figure 12. Similar results were obtained from other simulated data sets. From Figure 12 we see reasonable agreement of posterior density estimates produced by Infer.NET and MCMC for all parameters. Extension to the case where the mean of y is modeled nonparametrically is covered in Faes et al. (2011). However, the current version of Infer.NET does not support models of this type. 28 Regression via Infer.NET Figure 12: Variational Bayes approximate posterior density functions produced by Infer.NET data simulated according to (56). 11. Comparison with BUGS We now make some comparisons between Infer.NET and BUGS in the context of Bayesian semiparametric regression. 11.1. Versatility comparison These comments mainly echo those given in Section 5 of Wang and Wand (2011), so are quite brief. BUGS is much more versatile than Infer.NET, with the latter subject to restrictions such as standard distributional forms, conjugacy rules and not being able to handle models such as (13) without the trick manifest in (15). The measurement error and missing data regression examples given in Sections 9 and 10 are feasible in Infer.NET for parametric regression, but not for semiparametric regression such as the examples in Sections 4, 5 and 8 of Marley and Wand (2010). Nevertheless, as demonstrated in this article, there is a wide variety of Journal of Statistical Software 29 semiparametric regression models that can be handled by Infer.NET. 11.2. Accuracy comparison In general, BUGS can always be more accurate than Infer.NET since MCMC suffers only from Monte Carlo error, which can be made arbitrarily small via larger sample sizes. The Infer.NET inference engines, expectation propagation and variational message passing, have inherent approximation errors that cannot be completely eliminated. However, our examples show that the accuracy of Infer.NET is quite good for a wide variety of semiparametric regression models. 11.3. Timing comparison Table 2 gives some indication of the relative computing times for the examples in the paper. In this table we report the average elapsed (and standard error) of the computing times over 100 runs with the number of variational Bayes or expectation propagation iterations set to 100 and the MCMC sample sizes set at 10000. This was sufficient for convergence in these particular examples. All examples were run on the third author’s laptop computer (64 bit Windows 8 Intel i7-4930MX central processing unit at 3GHz with 32GB of random access memory). We concede that comparison of deterministic and Monte Carlo algorithms is fraught with difficulties. However, convergence criteria aside, these times give an indication of the performance in terms of the implementations of each of the algorithms for particular models on a fast 2014 computer. Note that we did not fit the geoadditive model in Section 7 via BUGS due to the excessive time required. section number 3 4 4 5 6 7 8 9 10 semiparametric regression model Simple semiparametric regression Generalized additive model (logistic) Generalized additive model (probit) Robust nonparametric regression Semiparametric mixed model Geoadditive model Bayesian lasso regression Measurement error model Missing predictor model time in seconds for Infer.NET time in seconds for BUGS 1.55 (0.01) 3.09 (0.01) 25.80 (0.07) 80.38 (0.03) 666.73 (0.56) 1231.12 (3.05) 1.80 (0.01) 1.50 (0.01) 1.53 (0.01) 8.60 (0.01) 81.30 (0.13) 79.85 (0.04) 22.56 (0.06) 477.59 (0.06) —— ——199.59 (0.30) 7.05 (0.02) 5.51 (0.03) Table 2: Average run times (standard errors) in seconds over 100 runs of the methods for each of the examples in the paper. Table 2 reveals that Infer.NET offers considerable speed-ups compared with BUGS for most of the examples. The exceptions are the robust nonparametric regression model of Section 5 and the semiparametric mixed model of Section 6. The slowness of the robust nonparametric regression fit is mainly explained by the multiple calls to Infer.NET, corresponding to the degrees of freedom grid. The semiparametric mixed model is quite slow in the current release of Infer.NET due to the full sparse design matrices being carried around in the calculations. Recently developed streamlined approaches to variational inference for longitudinal 30 Regression via Infer.NET and multilevel data analysis (Lee and Wand 2014) offer significant speed-ups. The geoadditive model of Section 7 is also slow to fit, with the large design matrices being a likely reason. 12. Summary Through several examples, the efficacy of Infer.NET for semiparametric regression analysis has been demonstrated. Generally speaking, the fitting and inference is shown to be quite fast and accurate. Models with very large design matrices are an exception and can take significant amounts of time to fit using the current release of Infer.NET. Our survey of Infer.NET in the context of semiparametric regression will aid future analyses of this type via fast graphical models software, by building upon the examples presented here. It may also influence future directions for the development of fast graphical models methodology and software. Appendix: Details of Infer.NET fitting of a simple semiparametric model This section provides an extensive description of the Infer.NET program for semiparametric regression analysis of the Onions data set based on model (15) in Section 3. The data are first transformed as explained in Section 3 and these transformed versions are represented by y and C = [X Z]. Thereafter, the text files K.txt, y.txt, Cmat.txt, sigsqBeta.txt, Au.txt, Aeps.txt and nIterVB.txt are generated. The first three files contain the number of spline basis functions, y and C, respectively. The following three text files each contain a single number and represent the values for the hyperparameters: τβ = 10−10 , Aε = 105 and Au = 105 . The last file, nIterVB.txt, contains a single positive number (i.e., 100) that specifies the number of mean field variational Bayes iterations. All these files were set up in R. The actual Infer.NET code is a C] script, which can be run from Visual Studio 2010 or DOS within the Microsoft Windows operating system. The following paragraphs explain the different commands in the C] script. Firstly, all required C] and Infer.NET libraries are loaded: using using using using using using using using using System; System.Collections.Generic; System.Text; System.IO; System.Reflection; MicrosoftResearch.Infer; MicrosoftResearch.Infer.Distributions; MicrosoftResearch.Infer.Maths; MicrosoftResearch.Infer.Models; Next, the full path name of the directory containing the above-mentioned input files needs to be specified: string dir = "C:\\research\\inferNet\\onions\\"; The values for the hyperparameters, number of variational Bayes iterations and number of spline basis functions are imported using the commands: Journal of Statistical Software 31 double tauBeta = new DataTable(dir+"tauBeta.txt").DoubleNumeric; double Au = new DataTable(dir+"Au.txt").DoubleNumeric; double Aeps = new DataTable(dir+"Aeps.txt").DoubleNumeric; int nIterVB = new DataTable(dir+"nIterVB.txt").IntNumeric; int K = new DataTable(dir+"K.txt").IntNumeric; Reading in the data y and C proceeds as follows: DataTable yTable = new DataTable(dir+"y.txt"); DataTable C = new DataTable(dir+"Cmat.txt"); These commands make use of the C] object class DataTable. This class, which was written by the authors, facilitates the input of general rectangular arrays of numbers when in a text file. The following line extracts the number of observations from the data matrix C: int n = C.numRow; The observed values of C and y, which are stored in objects C and yTable, are assigned to the objects cvec and y via the next set of commands: Vector[] cvecTable = new Vector[n]; for (int i = 0; i < n; i++) { cvecTable[i] = Vector.Zero(3+K); for (int j = 0; j < 3+K; j++) cvecTable[i][j] = C.DataMatrix[i,j]; } Range index = new Range(n).Named("index"); VariableArray<Vector> cvec = Variable.Array<Vector>( index).Named("cvec"); cvec.ObservedValue = cvecTable; VariableArray<double> y = Variable.Array<double>(index).Named("y"); y.ObservedValue = yTable.arrayForIN; The function arrayForIN is member of the class DataTable and stores the data as an array to match the type of y.ObservedValue. The resulting objects are now directly used to specify the prior distributions. First, the commands in code chunk (16) in Section 3 are used to set up priors for τu and τε , while the trick involving the auxiliary variable a ≡ 0 in (15) is coded as: 32 Regression via Infer.NET Vector zeroVec = Vector.Zero(3+K); PositiveDefiniteMatrix betauPrecDummy = new PositiveDefiniteMatrix(3+K,3+K); for (int j = 0; j < 3+K; j++) betauPrecDummy[j,j] = kappa; Variable<Vector> betauWork = Variable.VectorGaussianFromMeanAndPrecision(zeroVec, betauPrecDummy).Named("betauWork"); Variable<double>[] betauDummy = new Variable<double>[3+K]; for (int j = 0; j < 3+K; j++) betauDummy[j] = Variable.GetItem(betauWork,j); Variable<double>[] a = new Variable<double>[3+K]; for (int j = 0; j < 3; j++) { a[j] = Variable.GaussianFromMeanAndVariance(betauDummy[j], tauBeta); a[j].ObservedValue = 0.0; } for (int k = 0 ; k < K; k++) { a[3+k] = Variable.GaussianFromMeanAndPrecision(betauDummy[3+k], tauU); a[3+k].ObservedValue = 0.0; } The command to specify the likelihood for model (15) is listed as code chunk (17) in Section 3. The inference engine and the number of mean field variational Bayes iterations are specified by means of code chunk (18). Finally, the following commands write the estimated values for the parameters of the approximate posteriors to files named mu.q.betau.txt, Sigma.q.betau.txt, parms.q.tauEps.txt and parms.q.tauu.txt: SaveData.SaveTable(engine.Infer<VectorGaussian>(betauWork).GetMean(), dir+"mu.q.betau.txt"); SaveData.SaveTable(engine.Infer<VectorGaussian>( betauWork).GetVariance(),dir+"Sigma.q.betau.txt"); SaveData.SaveTable(engine.Infer<Gamma>(tauEps), dir+"parms.q.tauEps.txt"); SaveData.SaveTable(engine.Infer<Gamma>(tauu),dir+"parms.q.tauu.txt"); These commands involve the C] function named SaveTable, which is a method in the class SaveData. We wrote SaveTable to facilitate writing the output from Infer.NET to a text file. Summary plots, such as Figure 1, can be made in R after back-transforming the approximate posterior density parameters. Acknowledgments This research was partially funded by the Australian Research Council Discovery Project DP110100061. We are grateful to Andrew Chernih for his provision of the Sydney property Journal of Statistical Software 33 data. References Armagan A, Dunson D, Lee J (2013). “Generalized Double Pareto Shrinkage.” Statistica Sinica, 23, 119–143. Bachrach LK, Hastie T, Wang MC, Narasimhan B, Marcus R (1999). “Bone Mineral Acquisition in Healthy Asian, Hispanic, Black, and Caucasian Youth: A Longitudinal Study.” Journal of Clinical Endocrinology & Metabolism, 84(12), 4702–4712. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu C (2006). Measurement Error in Nonlinear Models: A Modern Perspective. 2nd edition edition. Chapman and Hall, London. Carvalho CM, Polson NG, Scott J (2010). “The Horseshoe Estimator for Sparse Signals.” Biometrika, 97, 465–480. Efron B, Hastie T, Johnstone I, Tibshirani R (2004). “Least Angle Regression.” The Annals of Statistics, 32, 407–451. Faes C, Ormerod JT, Wand MP (2011). “Variational Bayesian Inference for Parametric and Nonparametric Regression with Missing Data.” Journal of the American Statistical Association, 106, 959–971. Gelman A (2006). “Prior Distributions for Variance Parameters in Hierarchical Models.” Bayesian Analysis, 1, 515–533. Gradshteyn I, Ryzhik I (1994). Table of Integrals, Series, and Products. Academic Press, Boston. Griffin J, Brown P (2011). “Bayesian Hyper-Lassos with Non-Convex Penalization.” Australian and New Zealand Journal of Statistics, 53, 423–442. Haberman S (1976). “Generalized Residuals for Log-Linear Models.” In Proceedings of the 9th International Biometrics Conference, pp. 104–122. Boston, USA. Kammann EE, Wand MP (2003). “Geoadditive Models.” Journal of the Royal Statistical Society C, 52, 1–18. Kaufman L, Rousseeuw PJ (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York. Knowles DA, Minka TP (2011). “Non-Conjugate Message Passing for Multinomial and Binary Regression.” Advances in Neural Information Processing Systems 24, pp. 1701– 1709. Lange KL, Little RJA, Taylor JMG (1989). “Robust Statistical Modeling Using the tDistribution.” The Journal of the American Statistical Association, 84, 881–896. Lee Y, Wand M (2014). “Streamlined Mean Field Variational Bayes for Longitudinal and Multilevel Data Analysis.” In progress. 34 Regression via Infer.NET Little RJA, Rubin DB (2002). Statistical Analysis with Missing Data. 2nd edition edition. Wiley, New York. Lunn D, Thomas A, Best NG, Spiegelhalter DJ (2000). “WinBUGS – A Bayesian Modelling Framework: Concepts, Structure, and Extensibility.” Statistics and Computing, 10, 325–337. Marley JK, Wand MP (2010). “Non-Standard Semiparametric Regression Via BRugs.” Journal of Statistical Software, 37, 1–30. Minka T (2001). “Expectation Propagation for Approximate Bayesian Inference.” Proceedings of Conference on Uncertainty in Artificial Intelligence, pp. 51–76. Minka T (2005). “Divergence Measures And Message Passing.” Microsoft Research Technical Report Series, MSR-TR-2008-173, 1–17. Minka T, Winn J (2008). “Gates: A Graphical Notation for Mixture Models.” Microsoft Research Technical Report Series, MSR-TR-2008-185, 1–16. Minka T, Winn J, Guiver J, Knowles D (2013). Infer.NET 2.5. URL http://research. microsoft.com/infernet. Ormerod JT, Wand MP (2010). “Explaining Variational Approximations.” The American Statistician, 64, 140–153. Park T, Casella G (2008). “The Bayesian Lasso.” Journal of the American Statistical Association, 103, 681–686. Pham T, Ormerod JT, Wand MP (2013). “Mean Field Variational Bayesian Inference for Nonparametric Regression with Measurement Error.” Computational Statistics and Data Analysis, 68, 375–387. Ruppert D, Wand MP, Carroll RJ (2003). Semiparametric Regression. Cambridge University Press, New York. Ruppert D, Wand MP, Carroll RJ (2009). “Semiparametric Regression During 2003–2007.” Bayesian Analysis, 3, 1193–1265. Saul LK, Jordan MI (1996). “Exploiting Tractable Substructures in Intractable Networks.” In Advances in Neural Information Processing Systems, pp. 435–442. MIT Press, Cambridge, Massachusetts. Staudenmayer J, Lake EE, Wand MP (2009). “Robustness for General Design Mixed Models Using the t-Distribution.” Statistical Modelling, 9, 235–255. Tibshirani R (1996). “Regression Shrinkage and Selection Via the Lasso.” Journal of the Royal Statistical Society, Series B, 58, 267–288. Wainwright M, Jordan MI (2008). “Graphical Models, Exponential Families, and Variational Inference.” Foundations and Trends in Machine Learning, 1, 1–305. Wand MP (2009). “Semiparametric Regression and Graphical Models.” Australian and New Zealand Journal of Statistics, 51, 9–41. Journal of Statistical Software 35 Wand MP, Ormerod JT (2008). “On O’Sullivan Penalised Splines and Semiparametric Regression.” Australian and New Zealand Journal of Statistics, 50, 179–198. Wand MP, Ormerod JT, Padoan SA, Frühwirth R (2011). “Mean Field Variational Bayes for Elaborate Distributions.” Bayesian Analysis, 6(4), 847–900. Wand MP, Ripley BD (2010). KernSmooth 2.23. Functions for Kernel Smoothing Corresponding to the Book: Wand, M.P. and Jones, M.C. (1995) Kernel Smoothing. URL http://cran. r-project.org. Wang SSJ, Wand MP (2011). “Using Infer.NET for Statistical Analyses.” The American Statistician, 65, 115–126. Winn J, Bishop CM (2005). “Variational Message Passing.” Journal of Machine Learning Research, 6, 661–694. Wu L (2002). “A Joint Model for Nonlinear Mixed-Effects Models with Censored Response and Covariates Measured with Error, with Application to AIDS Studies.” Journal of the American Statistical Association, 97, 700–709. 36 Regression via Infer.NET Affiliation: Jan Luts TheSearchParty.com Pty Ltd. Level 1, 79 Commonwealth Street, Surry Hills 2011, Australia E-mail: [email protected] Shen S.J. Wang Department of Mathematics The University of Queensland Brisbane 4072, Australia E-mail: [email protected] John T. Ormerod School of Mathematics and Statistics University of Sydney Sydney 2006, Australia E-mail: [email protected] URL: http://www.maths.usyd.edu.au/u/jormerod/ Matt P. Wand School of Mathematical Sciences University of Technology, Sydney Broadway 2007, Australia E-mail: [email protected] URL: http://matt-wand.utsacademics.info Journal of Statistical Software published by the American Statistical Association Volume VV, Issue II MMMMMM YYYY http://www.jstatsoft.org/ http://www.amstat.org/ Submitted: yyyy-mm-dd Accepted: yyyy-mm-dd
© Copyright 2025 Paperzz