University of Wollongong Research Online University of Wollongong Thesis Collection University of Wollongong Thesis Collections 2013 Elaborate distribution semiparametric regression via mean field variational Bayes Sarah Elizabeth Neville University of Wollongong Recommended Citation Neville, Sarah Elizabeth, Elaborate distribution semiparametric regression via mean field variational Bayes, Doctor of Philosophy thesis, School of Mathematics and Applied Statistics, University of Wollongong, 2013. http://ro.uow.edu.au/theses/3958 Research Online is the open access institutional repository for the University of Wollongong. For further information contact the UOW Library: [email protected] Elaborate Distribution Semiparametric Regression via Mean Field Variational Bayes A thesis submitted in fulfilment of the requirements for the award of the degree Doctor of Philosophy from University of Wollongong by Sarah Elizabeth Neville B Math (Advanced, Honours Class I), University of Wollongong School of Mathematics and Applied Statistics 2013 CERTIFICATION I, Sarah Elizabeth Neville, declare that this thesis, submitted in fulfilment of the requirements for the award of Doctor of Philosophy, in the School of Mathematics and Applied Statistics, University of Wollongong, is wholly my own work unless otherwise referenced or acknowledged. The document has not been submitted for qualifications at any other academic institution. Sarah Elizabeth Neville 12 October, 2013 Abstract Mean field variational Bayes (MFVB) is a fast, deterministic inference tool for use in Bayesian hierarchical models. We develop and examine the performance of MFVB algorithms in semiparametric regression applications involving elaborate distributions. We assess the accuracy of MFVB in these settings via comparison with a Markov chain Monte Carlo (MCMC) baseline. MFVB methodology for Generalized Extreme Value additive models performs well, culminating in fast, accurate analysis of the Sydney hinterland maximum rainfall data. Quantile regression based on the Asymmetric Laplace distribution provides another area for successful application of MFVB. Examination of MFVB algorithms for continuous sparseness signal shrinkage in univariate models illustrates the danger of näive application of MFVB. This leads to development of a new tool to add to the MFVB armory: continuous fraction approximation of special functions using Lentz’s Algorithm. MFVB performs well in both simple and more complex penalized wavelet regression models, illustrated by analysis of the radiation pneumonitis data. Overall, MFVB is a viable inference tool for semiparametric regression involving elaborate distributions. Generally, MFVB is good at retrieving trend estimates, but underestimates variability. MFVB is best used in applications where analysis is constrained by computational time and/or storage. i Dedication This thesis is dedicated my parents, Robert F. and Susan G. Neville. Your support, generosity and love over the past three years in particular has been phenomenal. I’m so lucky to have you! ii Acknowledgements Many thanks go to my supervisor Matt for being both a professional and personal mentor. Your enthusiasm, encouragement and humility has helped me navigate the treacherous journey of writing a thesis, and made me the researcher I am on the other side. To Maureen, my unofficial co-supervisor and great friend, you have been a fantastic support at my home base of the University of Wollongong. Our coffees and office chats have regularly been the highlight of my day. To my best friend Lauren, your presence has made this journey one of laughter, and has helped me get through the tough stages. Our conversations over countless dinners, dog walks and trips helped me keep it all in perspective (well, most of the time). Amy, my sister, has been an encouraging force, always having faith in me even when my own would falter. Thanks for being there for me Boops! To my brother Bobby, wherever you are, I hope you’re proud of your little sister and that I can show this to you one of these days. Finally to Marley, Possum, Henry and Bronte. You have been the calm, warm and fun presence I needed over the course of my thesis writing. Contents 1 Introduction 1 1.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Variational approximations meet Bayesian models . . . . . . . . . . 1 1.1.2 Variational approximations in computer science . . . . . . . . . . . 2 1.1.3 Variational approximations emerging into the statistical literature . 2 1.1.4 Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . 3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Vector and matrix notation . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Distributional notation . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Basics of mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . 6 1.4 Graph theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.1 Directed acyclic graphs . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.2 Moral graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Definitions and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5.1 Non-analytic integral families . . . . . . . . . . . . . . . . . . . . . . 11 1.5.2 Special function definitions . . . . . . . . . . . . . . . . . . . . . . . 11 1.5.3 Additional function definitions and continued fraction representa- 1.2 1.5 2 tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5.4 Distributional definitions and results . . . . . . . . . . . . . . . . . 14 1.5.5 Matrix results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6 Accuracy measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.7 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Mean field variational Bayes inference for Generalised Extreme Value regression models 22 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Direct mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . . 23 iii iv CONTENTS 2.3 Auxiliary mixture sampling approach . . . . . . . . . . . . . . . . . . . . . 25 2.4 Structured mean field variational Bayes . . . . . . . . . . . . . . . . . . . . 26 2.5 Finite normal mixture response regression . . . . . . . . . . . . . . . . . . . 27 2.5.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.2 Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . 29 Generalized Extreme Value additive model . . . . . . . . . . . . . . . . . . 30 2.6.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6.2 Mean field variational Bayes for the finite normal mixture response 2.6 additive model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Structured mean field variational Bayes . . . . . . . . . . . . . . . . 37 2.7 Displaying additive model fits . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.8 Geoadditive extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.9 New South Wales maximum rainfall data analysis . . . . . . . . . . . . . . 41 2.10 Comparisons with Markov chain Monte Carlo . . . . . . . . . . . . . . . . 47 2.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.A Derivation of Algorithm 1 and lower bound (2.11) . . . . . . . . . . . . . . 49 2.A.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.A.2 Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.A.3 Derivation of lower bound (2.11) . . . . . . . . . . . . . . . . . . . . 58 2.B Derivation of Algorithm 2 and lower bound (2.17) . . . . . . . . . . . . . . 62 2.6.3 3 2.B.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.B.2 Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2.B.3 Derivation of lower bound (2.17) . . . . . . . . . . . . . . . . . . . . 70 Mean field variational Bayes for quantile regression 77 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.2 Parametric regression case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.2 Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . 79 Semiparametric regression case . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.3.2 Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . 82 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.1 Comparisons with Markov chain Monte Carlo . . . . . . . . . . . . 84 3.4.2 Accuracy study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.3 3.4 v CONTENTS 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.A Derivation of Algorithm 4 and lower bound (3.4) . . . . . . . . . . . . . . . 90 3.A.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.A.2 Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.A.3 Derivation of lower bound (3.4) . . . . . . . . . . . . . . . . . . . . . 96 3.B Derivation of Algorithm 5 and lower bound (3.9) . . . . . . . . . . . . . . . 103 4 3.B.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.B.2 Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.B.3 Derivation of lower bound (3.9) . . . . . . . . . . . . . . . . . . . . . 106 Mean field variational Bayes for continuous sparse signal shrinkage 110 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.2 Horseshoe distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.3 4.4 4.5 4.2.1 Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . 113 4.2.2 Simplicity comparison of Models II and III . . . . . . . . . . . . . . 117 4.2.3 Simulation comparison of Models II and III . . . . . . . . . . . . . . 117 4.2.4 Theoretical comparison of Models II and III . . . . . . . . . . . . . . 119 Normal-Exponential-Gamma distribution . . . . . . . . . . . . . . . . . . . 120 4.3.1 Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . 122 4.3.2 Simulation comparison of Models II and III . . . . . . . . . . . . . . 123 4.3.3 Theoretical comparison of Models II and III . . . . . . . . . . . . . . 126 Generalized-Double-Pareto distribution . . . . . . . . . . . . . . . . . . . . 128 4.4.1 Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . 129 4.4.2 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.A Derivation of Algorithm 6 and lower bound (4.7) . . . . . . . . . . . . . . . 133 4.A.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.A.2 Optimal q ∗ Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.A.3 Derivation of lower bound (4.7) . . . . . . . . . . . . . . . . . . . . . 136 4.B Proof of Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.C Derivation of Algorithm 8 and lower bound (4.12) . . . . . . . . . . . . . . 147 4.C.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4.C.2 Optimal q ∗ Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4.C.3 Derivation of lower bound (4.12) . . . . . . . . . . . . . . . . . . . . 149 4.D Normal-Exponential-Gamma correlation . . . . . . . . . . . . . . . . . . . 152 vi CONTENTS 4.E Derivation of Algorithm 10 and lower bound (4.16) . . . . . . . . . . . . . 154 5 4.E.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 4.E.2 Optimal q ∗ Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 4.E.3 Derivation of lower bound (4.16) . . . . . . . . . . . . . . . . . . . . 157 Mean field variational Bayes for penalised wavelet regression 160 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.3 5.2.1 Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.2.2 Normal-Exponential-Gamma prior . . . . . . . . . . . . . . . . . . . 162 5.2.3 Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.3.1 Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.3.2 Normal-Exponential-Gamma prior . . . . . . . . . . . . . . . . . . . 165 5.3.3 Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.4 Displaying Laplace-Zero model fits . . . . . . . . . . . . . . . . . . . . . . . 167 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.6 5.5.1 Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.5.2 Normal-Exponential-Gamma prior . . . . . . . . . . . . . . . . . . . 171 5.5.3 Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 5.5.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.A Derivation of Algorithm 11 and lower bound (5.8) . . . . . . . . . . . . . . 176 5.A.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.A.2 Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.A.3 Derivation of lower bound (5.8) . . . . . . . . . . . . . . . . . . . . . 179 6 Mean field variational Bayes for wavelet-based longitudinal data analysis 6.1 6.2 6.3 186 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 6.1.1 Radiation pneumonitis study . . . . . . . . . . . . . . . . . . . . . . 187 6.1.2 Chapter outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.2.1 Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.2.2 Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 CONTENTS vii 6.3.1 Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.3.2 Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 6.4 6.5 6.6 Radiation pneumonitis study results . . . . . . . . . . . . . . . . . . . . . . 200 6.4.1 Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.4.2 Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Accuracy study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 6.5.1 Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 6.5.2 Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 6.A Derivation of Algorithm 12 and lower bound (6.6) . . . . . . . . . . . . . . 209 6.A.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 6.A.2 Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 6.A.3 Derivation of lower bound (6.6) . . . . . . . . . . . . . . . . . . . . . 214 6.B Derivation of Algorithm 13 and lower bound (6.7) . . . . . . . . . . . . . . 221 7 6.B.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 6.B.2 Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 6.B.3 Derivation of lower bound (6.7) . . . . . . . . . . . . . . . . . . . . . 229 Conclusion 236 List of Figures 1.1 A simple example of a directed acyclic graph, containing nodes a, b and c. 1.2 Illustration of the impact of product restriction (1.7) on the directed acyclic 8 graph for Model (1.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 A normal mixture approximation to the GEV density with ξ = 1. . . . . . 26 2.2 Directed acyclic graph for Model (2.8). . . . . . . . . . . . . . . . . . . . . . 29 2.3 Directed acyclic graph representation of Model (2.13). . . . . . . . . . . . . 33 2.4 Directed acyclic graph representation of Model (2.15). . . . . . . . . . . . . 35 2.5 Annual winter maximum rainfall at 50 weather stations in the Sydney, Australia, hinterland. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 MFVB univariate functional fits in the GEV additive model (2.20) for the Sydney hinterland rainfall data. . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 43 MFVB bivariate functional fit for geographical location in the GEV additive model (2.20) for the Sydney hinterland rainfall data. . . . . . . . . . . 2.8 42 44 The prior and MFVB approximate posterior probability mass functions for the GEV shape parameter ξ in the GEV additive model (2.20) for the Sydney hinterland maximum rainfall data. . . . . . . . . . . . . . . . . . . . . . 2.9 45 Accuracy comparison between MFVB and MCMC for a single predictor model (d = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1 Directed acyclic graph for Model (3.2). . . . . . . . . . . . . . . . . . . . . . 78 3.2 Directed acyclic graph for Model (3.6). . . . . . . . . . . . . . . . . . . . . . 84 3.3 Successive values of lower bound (3.9) to monitor convergence of MFVB Algorithm 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 85 Quantile estimates (solid) and pointwise 95% credible intervals (dotted) for MFVB fitting of (3.10) via Algorithm 5. . . . . . . . . . . . . . . . . . . . viii 86 LIST OF FIGURES 3.5 Median (τ = 0.5) estimates and pointwise 95% credible intervals for MFVB (red) and MCMC (blue) fitting of (3.10). . . . . . . . . . . . . . . . . . . . . 3.6 87 MFVB (blue) and MCMC (orange) approximate posterior densities for the estimated median ŷ at the quartiles of the xi ’s under Model (3.6). . . . . . 3.7 ix 88 Boxplots of accuracy measurements for ŷ(Q2 ) for the accuracy study described in the text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.1 Standard (µ = 0, σ = 1) continuous sparseness inducing density functions. 111 4.2 Directed acyclic graphs corresponding to the three models listed in Table 4.1.113 4.3 The number of iterations required for Lentz’s Algorithm to converge when 4.4 used to approximate Q(x). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Comparison of pMCMC (σ 2 |x) and two q ∗ (σ 2 ) densities based on Model II and Model III MFVB for four replications from the simulation study corre- sponding to Table 4.2 with n = 1000. . . . . . . . . . . . . . . . . . . . . . . 118 4.5 Plot of g III (x)/g II (x) for the functions g III and g II defined by (4.8) and (4.9) respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.6 MCMC samples (n = 1000) from the distribution {log(1/b), log(c)|x = x0 } for x0 = (1, 0.1, 0.01, 0.001) where the data is generated according to (4.10). 121 4.7 Side-by-side boxplots of accuracy values for the NEG simulation study described in Section 4.3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.8 Comparison of pMCMC (σ 2 |x) and two q ∗ (σ 2 ) densities based on Model II and Model III MFVB for four replications from the NEG simulation study with n = 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.9 Plot of g III (x)/g II (x) for the functions g III and g II defined by (4.13). . . . . . . 126 4.10 MCMC samples (n = 1000) from the distribution {log(b), log(c)|x = x0 } for λ = (0.05, 0.1, 0.2, 0.4) and x0 = (1, 2, 3, 4) where the data is generated according to (4.14). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.11 Illustration of the behaviour of Corr{log(b), log(c)|x = x0 } under NEG Model III for varying values of λ and x0 . . . . . . . . . . . . . . . . . . . . . 128 4.12 Side-by-side boxplots of accuracy values for the GDP simulation study described in Section 4.4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.1 Directed acyclic graph for Models (5.2) and (5.4). . . . . . . . . . . . . . . . 162 5.2 Fitted function estimates (solid) and pointwise 95% credible sets (dotted) for both MFVB (red) and MCMC (blue) approaches under Model (5.2). . . 168 LIST OF FIGURES 5.3 x MFVB (blue) and MCMC (orange) approximate posterior densities for (a) σε2 and (b) ŷ(Q2 ) under Model (5.2). . . . . . . . . . . . . . . . . . . . . . . . 168 5.4 Fitted function estimates (solid) and pointwise 95% credible sets (dotted) for both MFVB (red) and MCMC (blue) approaches under Model (5.4). . . 169 5.5 MFVB (blue) and MCMC (orange) approximate posterior densities for (a) σε2 and (b) ŷ(Q2 ) under Model (5.4). . . . . . . . . . . . . . . . . . . . . . . . 170 5.6 Fitted function estimates (solid) and pointwise 95% credible sets (dotted) for both MFVB (red) and MCMC (blue) approaches under Model (5.6). . . 171 5.7 MFVB (blue) and MCMC (orange) approximate posterior densities for (a) σε2 and (b) ŷ(Q2 ) under Model (5.6). . . . . . . . . . . . . . . . . . . . . . . . 172 5.8 Fitted MFVB function estimates (solid) and pointwise 95% credible sets (dotted) for the Horseshoe (red), NEG (blue) and Laplace-Zero (green) models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 5.9 Boxplots of root mean squared error of fˆMFVB for the simulation study described in the text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.1 Raw data from the radiation pneumonitis study. Each panel corresponds to a subject in the study, with radiation dose (J/kg) plotted against the logarithm of fluorodeoxyglucose (FDG) uptake. . . . . . . . . . . . . . . . 188 6.2 Directed acyclic graph for Model (6.2). . . . . . . . . . . . . . . . . . . . . . 192 6.3 Directed acyclic graph for Model (6.4). . . . . . . . . . . . . . . . . . . . . . 194 6.4 MFVB fit (blue) with pointwise 95% credible sets with raw data (red) for all 21 subjects under Model (6.2). . . . . . . . . . . . . . . . . . . . . . . . . 200 6.5 Additional plots for the MFVB fit of Model (6.2). . . . . . . . . . . . . . . . 201 6.6 MFVB fit (blue) with pointwise 95% credible sets with raw data (red) for all 21 subjects under Model (6.4). . . . . . . . . . . . . . . . . . . . . . . . . 202 6.7 Additional plots for the MFVB fit of Model (6.4). . . . . . . . . . . . . . . . 203 6.8 MFVB (blue) and MCMC (orange) approximate posterior densities for the fit under Model (6.2) at the quartiles for subject 8. . . . . . . . . . . . . . . 206 6.9 MFVB (blue) and MCMC (orange) approximate posterior densities for the fit under Model (6.4) at the quartiles for subject 8. . . . . . . . . . . . . . . 207 Chapter 1 Introduction 1.1 Literature review Mean field variational Bayes (MFVB) is a fast, deterministic alternative to Markov chain Monte Carlo (MCMC) for inference in a hierarchical Bayesian model setting. This literature review will firstly explain the origins of variational approximation, and the natural connection with statistical inference. Secondly, the use of variational approximations in the computer science will be discussed. Thirdly, the emergence of variational approximations into the statistical literature will be presented. Finally, the use of MFVB in specific statistical models will be summarized, identifying further work to be done in the field. 1.1.1 Variational approximations meet Bayesian models Variational calculus has been part of the mathematical consciousness since the 18th century. Essentially, the calculus of variations involves optimizing a functional over a given class of functions. Variational approximations arise when the class of functions we are optimizing over is restricted in some way (Ormerod and Wand, 2010). Bayesian inference is centred around the posterior distribution of the parameters in a model given the observed data. The mathematics behind derivation of this posterior distribution is often intractable, even given fairly simple models. As hierarchical Bayesian models become increasingly complex, and data sets become larger, traditional Bayesian inference tools such as MCMC are becoming infeasible due to time constraints. A faster inference tool is required. One solution is to bring the concepts of variational approximation and Bayesian inference together to provide a deterministic alternative to the stochastic MCMC. This involves the approximation of intractable integrals that arise when deriving the posterior 1 1.1. LITERATURE REVIEW 2 distribution of model parameters. 1.1.2 Variational approximations in computer science Computer scientists have been exploiting variational approximations in areas such as machine learning for some time. Bishop (2006) includes an entire chapter on approximate inference. Variational inference is presented as a deterministic alternative to the stochastic MCMC, arising from analytical approximations to the posterior distribution. The various types of variational approximation are presented, including a section on factorised approximations, which we refer to as MFVB. Specific applications explored so far in the computer science literature include neural networks, hidden Markov models and information retrieval (Jordan, Ghahramani, Jaakkola and Saul, 1999; Jordan, 2004). The fact that Bishop (2006) includes such detailed descriptions of a scheme such as MFVB illustrates that variational approximations are widely accepted in the computer science field. However, lacking in the computer science literature is an analysis of the accuracy of variational approximations. Bishop (2006) states that variational inference is approximate, and proceeds to describe the methods. There is minimal time given to the accuracy of variational inference. This shortcoming provides the statistical community with an opportunity to further contribute to the investigation of variational methods in ways that the computer science literature has not. We have the ability to quantitatively assess the performance of variational inference against the stochastic alternative (MCMC). Assessment of the performance of variational approximations has begun in the statistical literature. For example, Wang and Titterington (2005) look into the properties of covariance matrices arising from variational approximations. In addition, Wang and Titterington (2006) investigate the convergence properties of a variational Bayesian algorithm for computing estimates for a normal mixture model. 1.1.3 Variational approximations emerging into the statistical literature The connection between variational approximations and statistical inference has only started to be explored extensively in the past decade. Investigation of variational approximations in the statistical literature has two advantages: 1. to make variational approximations a faster, real alternative to more prominent inference tools such as MCMC and Laplace approximations; and 1.1. LITERATURE REVIEW 3 2. to assess the accuracy of variational approximations in different statistical models. There are many types of variational approximations that lend themselves to different statistical models. In an attempt to bring variational approximations into the broader statistical consciousness, Ormerod and Wand (2010) outlined the different varieties of variational approximations available, and how they may be used in familiar statistical settings. The major focus of this thesis is product density restrictions, which we refer to as MFVB, but are also known as variational message passing, mean field approximation and product density restriction. 1.1.4 Mean field variational Bayes In 1999, Attias coined the term variational Bayes for the specific case of variational approximation where the posterior is approximated by a product density restriction. This form of variational approximation originated in statistical physics, where it is known as mean field approximation (Parisi, 1988). Borrowing terms from both areas, we term variational inference through product density restrictions mean field variational Bayes. The past decade has seen the concept of variational approximations, and in particular MFVB, appear in statistics journals. Titterington (2004) highlighted the role of MFVB as a possible inference tool for large scale data analysis problems in the context of neural networks. Teschendorff, Wang, Barbosa-Morais, Brenton and C. (2005) looked at MFVB in the context of cluster analysis for gene expression data. In 2008 Infer.net (Minka, Winn, Guiver and Kannan, 2009), a software package for MFVB, was released. This made MFVB a more accessible inference tool for hierarchical Bayesian models. The past decade has seen MFVB explored as an inference tool in many statistical settings. These include, but are not limited to: hidden markov models (McGrory and Titterington, 2009); model selection in finite mixture distributions (McGrory and Titterington, 2007); principal component analysis (Smidl and Quinn, 2007); and political science research (Grimmer, 2011). 2010/11 saw the first appearance of MFVB papers in the Journal of the American Statistical Association (Braun and McAuliffe, 2010; Faes, Ormerod and Wand, 2011), illustrating the quality and exposure of current research in the field. There remains a plethora of statistical settings that have not been explored in the context of MFVB. This forms the basis of the thesis. We have identified both areas of the literature that would benefit from extension of the current MFVB methodology, and areas that have not yet been explored in the context of variational approximations. The former include modelling of sample extremes and continuous sparse signal shrinkage. The latter 4 1.2. NOTATION include Bayesian quantile and wavelet regression. 1.2 1.2.1 Notation Vector and matrix notation For the vectors v, w ∈ Rp , defined as: w1 . and w = .. , wp v1 . v = .. vp the following notation is used throughout the thesis. √ The norm of a vector v is denoted by kvk = v T v. The vector v −i represents the vector v with the ith component removed. The component wise product of two vectors is defined by: v 1 w1 . v w = .. . v p wp If g : R → R is a scalar function, it acts upon a vector v according to: g(v1 ) .. g(v) = . . g(vp ) To create a p×p diagonal matrix from a vector, using components of the vector as diagonal elements of the matrix, we use the notation: v1 0 ... 0 diag(v) = 0 .. . v2 . . . .. . . . . 0 .. . . 0 0 . . . vp For the symmetric matrix M n×n , we denote the trace and determinant by tr(M ) and |M | respectively, and adopt the usual definitions. 5 1.2. NOTATION For the matrices Am×n and B p×q , denoted by: a . . . a1n 11 .. . .. A = .. . . am1 . . . amn b11 . . . b1q .. . .. and B = .. . . bp1 . . . bpq we define the following notation. To create a mn × 1 vector from the components of a matrix A, we define: a11 .. . am1 . . vec(A) = . a 1n .. . amn . The Kronecker product of two matrices is defined as: a B . . . a1n B 11 .. .. .. A⊗B = . . . am1 B . . . amn B 1.2.2 . Distributional notation We adopt the usual notation and definitions for the density function, expected value and variance of a random variable x, that is p(x), E(x) and Var(x) respectively. The conditional density of x given y is denoted by p(x|y). The covariance of random variables x and y is denoted by Cov(x, y). The correlation between random variables x and y is denoted by Corr(x, y) and is given by Cov(x, y) p . Corr(x, y) = p Var(x) Var(y) The density function of a random vector v is denoted by p(v). The conditional density of v given w is denoted by p(v|w). In a Bayesian model setting, the full conditional of v is the conditional distribution of v conditioned on all the remaining parameters in the model, and is denoted by p(v|rest). The expected value and covariance matrix of v are denoted by E(v) and Cov(v) respectively. 6 1.3. BASICS OF MEAN FIELD VARIATIONAL BAYES ind. If x1 , . . . , xn are independent and identically distributed as D, we write xi ∼ D, for 1 ≤ i ≤ n. We use q to denote density functions that arise from MFVB approximation. Expecta- tion and covariance under the MFVB paradigm are denoted by Eq (·) and Covq (·) respectively. For a generic random scalar variable v and density function q we define: µq(v) ≡ Eq (v) 2 and σq(v) ≡ Varq (v). For a generic random vector v and density function q we define: µq(v) ≡ Eq (v) 1.3 and Σq(v) ≡ Covq (v). Basics of mean field variational Bayes Consider a generic Bayesian model, with observed data vector y and parameter vector θ. We also suppose that θ is continuous over the parameter space Θ. As discussed in the literature review, the posterior density function p(θ|y) is often intractable, even given simple models. MFVB overcomes this intractability by postulating that p(θ|y) can be well-approximated by product density forms, for example (1.1) p(θ|y) ≈ q1 (θ 1 ) q2 (θ 2 ) q3 (θ 3 ) where {θ 1 , θ 2 , θ 3 } is a partition of θ. The choice of partition is usually made on grounds of tractability. Each qi is a density function in θ i (i = 1, 2, 3), and they are chosen to minimise the Kullback-Leibler distance between the left and right hand sides of (1.1): Z q1 (θ 1 ) q2 (θ 2 ) q3 (θ 3 ) log p(θ|y) q1 (θ 1 ) q2 (θ 2 ) q3 (θ 3 ) (1.2) dθ. Minimisation of (1.2) is equivalent to maximisation of p(y; q) ≡ Z q1 (θ 1 ) q2 (θ 2 ) q3 (θ 3 ) log p(θ, y) q1 (θ 1 ) q2 (θ 2 ) q3 (θ 3 ) dθ for which an iterative convex optimisation algorithm (e.g. Luenberger and Ye, 2008) exists for obtaining the solution. Algorithm updates can be derived from the expression q ∗ (θ i ) ∝ exp[Eq(θ−i ) log p(θ i |y, θ −i )}] (1.3) 7 1.4. GRAPH THEORY for 1 ≤ i ≤ 3 (Bishop, 2006). Each iteration results in an increase in p(y; q), and this quantity can be used to assess convergence. Upon convergence, the q ∗ (θ i ) densities can be used for approximate Bayesian inference. We have illustrated the basics of MFVB using a product restriction consisting of three densities. These concepts can be extended to n densities, in which case the factorisation would take the form p(θ|y) ≈ n Y qi (θi ) i=1 where {θ1 , . . . , θn } is a partition of θ. 1.4 Graph theory As mentioned in the literature review and Section 1.3 above, the central idea of MFVB involves imposing a product restriction, or factorisation, on the posterior p(θ|y). This imposed factorisation, once carried out, can lead to a further breakdown of the dependence structure of a model. This is known as induced factorisation (Bishop, 2006). The extent of induced factorisation depends on both the underlying structure of the model and the nature of the imposed factorisation. Graph theory allows us to better understand the nature of both imposed and induced factorisation. As stated in Ormerod and Wand (2010), a very useful tool in assessing the conditional dependence structure of a hierarchical Bayesian model is its directed acyclic graph (DAG). These are defined in the following Section. 1.4.1 Directed acyclic graphs Definition 1.4.1 An undirected graph consists of a set of nodes connected by edges. Definition 1.4.2 A directed graph consists of a set of nodes connected by directed edges (Bishop, 2006). Definition 1.4.3 A directed acyclic graph (DAG) is a directed graph containing no directed cycles. Definition 1.4.3 means that there are no closed paths within the graph such that we can move from node to node along the directed edges and end up back at the starting node (Bishop, 2006). Figure 1.1 illustrates a simple DAG with nodes a, b and c. The link be- 8 1.4. GRAPH THEORY a b c Figure 1.1: A simple example of a directed acyclic graph, containing nodes a, b and c. tween DAGs and hierarchical Bayesian models is made when we treat the nodes as representing a random variables within a model, and the directed edges as conveying the conditional dependence structure of the model (e.g. Ormerod and Wand, 2010). Definition 1.4.4 Two nodes are co-parents if they share a common child node. Definition 1.4.5 The Markov blanket of a node is the set of parents, children and co-parents of the node. We now define the full conditional distribution in the context of a DAG. Definition 1.4.6 The full conditional of a node θi is the conditional distribution of θi given all the remaining variables in the graph, and is denoted by p(θi |rest). The full conditional of a node is dependent only on the variables in its Markov blanket (Bishop, 2006). We summarise this property as p(θi |rest) = p(θi |Markov blanket of θi ). (1.4) Combining (1.3) and (1.4), we have 1 q ∗ (θ i ) ∝ exp{Eq(θ−i ) log p(θ i |Markov blanket of θi )}. (1.5) This has significant implications for MFVB. Equation (1.5) tells us that the optimal q ∗ density for θi will depend only on the nodes in its Markov blanket. This is called the 9 1.4. GRAPH THEORY locality property of MFVB. This locality property is important when developing MFVB algorithms for increasingly complex hierarcical models. 1.4.2 Moral graphs The role of moral graphs in MFVB is twofold, coming into play in both induced and imposed factorisations of MFVB. The following definitions and theorem are taken from unpublished notes by M.P. Wand entitled “Graphical Models” (2012). Definition 1.4.7 A set of nodes in a DAG is called an ancestral set if, for each node in the set, all of its parents are also in the set. Definition 1.4.8 Let S be a subset of nodes in a DAG. Then the smallest ancestral set containing S is the ancestral set, containing all nodes in S, with the fewest number of nodes. Definition 1.4.9 The moral graph of a DAG is the undirected graph formed by (1) adding an edge between all pairs of parents of each node, (2) removing all arrows. We are now in a position to determine whether two subsets of nodes within a DAG are conditionally independent. Theorem 1.4.1 Let A, B and C be disjoint subsets of nodes in a DAG. Then A ⊥ B | C if C separates A from B in the moral graph of the smallest ancestral set containing A ∪ B ∪ C. Combining moralisation with the notion of ancestral sets provides an attractive alternative to d-separation in establishing conditional independence between subsets of nodes in a DAG. This is because each part of Theorem 1.4.1, namely (1) smallest ancestral set determination, (2) moralisation and (3) separation on an undirected graph are all easy to carry out, compared with d-separation. The ability to recognise conditional independence between nodes is vital in identifying induced factorisations. Moralisation can also be used to visualise the effect MFVB has on the structure of a hierarchical Bayesian model. MFVB involves placing a product restriction, or imposed factorisation on the posterior p(θ|y). Visually, this corresponds to removing edges between relevant nodes on the moralised DAG. Consider, for example, the model: y|β, u, σε2 ∼ N (Xβ + Zu, σε2 I), β ∼ N (0, σβ2 I), σε ∼ Half-Cauchy(Aε ), u ∼ N (0, σu2 I), σu ∼ Half-Cauchy(Au ) (1.6) 10 1.4. GRAPH THEORY where y is a vector of responses, β and u are vectors of fixed and random effects, X and Z are design matrices and σε2 and σu2 are variance parameters. Definition of the HalfCauchy distribution is given in Section 1.5. Figure 1.2(a) shows the DAG for (1.6), and illustrates the conditional dependence structure of the model. Note that the node for the data, y, is shaded. Nodes representing model parameters are unshaded. MFVB involves placing a product restriction on the posterior density. Say we impose the factorisation p(β, u, σε2 , σu2 |y) ≈ q(β, u, σε2 , σu2 ) = q(β, u) q(σε2 , σu2 ). (1.7) In order to visualise this factorisation, we must first moralise the graph, as shown in σu2σu2 σu2σu2 σu2σu2 uu uu uu σε2σε2 ββ σε2σε2 ββ yy ββ yy (a) Directed acyclic graph of Model (1.6). yy (b) Moralised graph. Figure 1.2: Illustration of the impact of product restriction (1.7) on the directed acyclic graph for Model (1.6). Figure 1.2(b). After moralisation, all paths between σu2 and σε2 on the undirected graph must pass through at least one of {y, β, u}. In other words, σu2 is now separated from σε2 by the set {y, β, u}. Applying Theorem 1.4.1 gives σu2 ⊥ σε2 {y, β, u}. Hence (1.7) reduces to q(β, u, σε2 , σu2 ) = q(β, u) q(σε2 ) q(σu2 ). (1.8) The ability to recognise induced factorisations helps to streamline derivation of MFVB methodology, especially as models increase in complexity. 11 1.5. DEFINITIONS AND RESULTS 1.5 1.5.1 Definitions and results Non-analytic integral families Definition 1.5.1 We define the integral J + (·, ·, ·) by + J (p, q, r) = 1.5.2 Z 0 ∞ xp exp(qx − rx2 ) dx, p ≥ 0, −∞ < q < ∞, r > 0. Special function definitions Definition 1.5.2 The logit(·) function is defined by logit(p) = log p 1−p = log(p) − log(1 − p), 0 < p < 1. Definition 1.5.3 The digamma function is denoted by ψ(·) and is the logarithmic derivative of the gamma function ψ(x) = Γ0 (x) d ln Γ(x) = , dx Γ(x) x > 0. Definition 1.5.4 The trigamma function is denoted by ψ 0 (·) and is the second logarithmic derivative of the gamma function ψ 0 (x) = d2 d ln Γ(x) = {ψ(x)}, 2 dx dx x > 0. Definition 1.5.5 The Dirac delta function is denoted by δ0 (·) and is defined by 1 if x = 0, δ0 (x) = 0 if x 6= 0. Distributional definitions of the required continuous sparse signal shrinkage density functions require the introduction of special functions. We follow the notation of Gradshteyn and Ryzhik (1994). Definition 1.5.6 The exponential integral function of order 1 is denoted by E1 (·) and is defined by Z E1 (x) = x ∞ e−t dt, t x ∈ R, x 6= 0. Evaluation of E1 (·) is supported by the function expint_E1() in the R package gsl (Hankin, 2007). 12 1.5. DEFINITIONS AND RESULTS Definition 1.5.7 The parabolic cylinder function of order ν ∈ R is denoted by Dν (·). Parabolic cylinder functions of negative order can be expressed as: Dν (x) = Γ(−ν) −1 ∞ Z 2 exp(−x /4) 0 t−ν−1 exp −xt − 21 t2 dt, ν < 0, x ∈ R. Result 1.1 Combining Definitions 1.5.1 and 1.5.7, we have √ J + (p, q, r) = (2r)−(p+1)/2 Γ(p + 1) exp{q 2 /(8r)}D−p−1 (−q/ 2r), p > −1, q ∈ R, r > 0. For computational purposes, we note that √ Dν (x) = 2ν/2+1/4 Wν/2+1/4, −1/4 ( 21 x2 )/ x, x > 0, (1.9) where Wk,m is a confluent hypergeometric function as defined in Whittaker and Watson (1990). Direct computation of the parabolic cylinder function Dν (·) is unavailable in R. Computation of Wk,m (·) however is available via the R function whittaker() within the package fAsianOptions (Wuertz et al., 2009). Hence, using (1.9) we compute Dν (·) using the following code: library(fAsianOptions) 2(nu/2+1/4)*Re(whittakerW(x2/2,nu/2+1/4,-1/4))/sqrt(x) Definition 1.5.8 Gauss’ Hypergeometric function of order (α, β, γ) is denoted by 2 F1 (α, β, γ; ·) and has the integral representation Γ(γ) 2 F1 (α, β, γ; x) = Γ(β)Γ(γ − β) Z 0 1 (1 − tx)−α tβ−1 (1 − t)γ−β−1 dt for γ > β > 0. Evaluation of 2 F1 (α, β, γ; ·) is supported by the function hyperg_2F1() in the R package gsl (Hankin, 2007). 13 1.5. DEFINITIONS AND RESULTS 1.5.3 Additional function definitions and continued fraction representations As illustrated in Wand and Ormerod (2012), there are simple continued fraction representations for e and π, given by: 1 e=2+ and 1 1+ 4 π= 1+ 2 2+ 3+ . 12 22 3+ 3 5+ 4 + ··· 32 7 + ··· An algorithm for accurate approximation of a real number given its continued fraction expansion is presented in Section 5.2 of Press, Teukolosky, Vetterling and Flannery (1992). The authors refer to this algorithm as the modified Lentz’s algorithm, after Lentz, who developed the algorithm for a specific family of continued fractions in his 1976 paper. In this thesis, we refer to it simply as Lentz’s Algorithm, which works for general continued fractions of the form a1 b0 + . a2 b1 + b2 + a3 b3 + · · · Next, we define two functions that arise during derivation of MFVB algorithms for continuous sparse signal shrinkage density functions. We also state two important results that lead to streamlined computation. Definition 1.5.9 The functions Q(·) and Rν (·) are defined as and Q(x) ≡ ex E1 (x), x > 0, D−ν−2 (x) Rν (x) ≡ , ν > 0, x > 0. D−ν−1 (x) Both of the functions defined immediately above lead to underflow problems for large x. We call upon their continued fraction representations, then computation via Lentz’s algorithm (Lentz, 1976), to facilitate stable computation. 14 1.5. DEFINITIONS AND RESULTS Result 1.2 The function Q(x) admits the continued fraction expansion 1 Q(x) = . 12 x+1+ 22 x+3+ x+5+ 32 x + 7 + ··· Result 1.3 The function Rν (x) admits the continued fraction expansion 1 Rν (x) = . ν+2 x+ ν+3 x+ x+ ν+4 x + ... Results 1.2 and 1.3 are given in Cuyt, Petersen, Verdonk, Waadeland and Jones (2008). 1.5.4 Distributional definitions and results Here we define the distributions and important results that arise throughout the thesis. For the more complex distributions, standard densities with location parameter µ = 0 and scale parameter σ = 1 are given. Density functions for general location and scale parameters for these distributions are then found by mapping 1 p(v) 7→ p σ v−µ σ . Definition 1.5.10 We use the notation v ∼ Bernoulli(ρ) to denote v following a Bernoulli distribution with parameter 0 ≤ ρ ≤ 1. The corresponding probability function is p(v) = ρv (1 − ρ)1−v , v ∈ {0, 1}. Definition 1.5.11 We use the notation v ∼ N (µ, σ 2 ) to denote v following a Normal distribution with mean µ ∈ R and variance σ 2 > 0. The corresponding density function is p(v) = √ 1 2σ 2 exp (v − µ)2 2σ 2 , v ∈ R. Definition 1.5.12 We use the notation φ(·) to denote the standard normal density function, that 15 1.5. DEFINITIONS AND RESULTS is if v ∼ N (0, 1) then 1 2 p(v) = φ(v) = √ e−v /2 , 2π Hence if v ∼ N (µ, σ 2 ), then p(v) = 1 φ σ v−µ σ v ∈ R. . Definition 1.5.13 The notation v ∼ N (µ, Σ) represents the Multivariate Normal distribution with mean vector µ ∈ Rk and positive-semidefinite, symmetric Covariance matrix Σ ∈ Rk×k . The corresponding density function is p(v) = (2π)−k/2 |Σ|−1/2 exp − 12 (v − µ)T Σ−1 (v − µ) , v ∈ Rk . Definition 1.5.14 We use the notation v ∼ Beta(A, B) to denote v following a Beta distribution with shape parameters A > 0 and B > 0. The corresponding density function is p(v) = v A−1 (1 − v)B−1 , B(A, B) 0 ≤ v ≤ 1, where B(·, ·) is the Beta function. Definition 1.5.15 The notation v ∼ Gamma(A, B) means that v has a Gamma distribution with shape parameter A > 0 and rate parameter B > 0. The corresponding density function is p(v) = B A Γ(A)−1 v A−1 exp(−Bv), v > 0. Result 1.4 Suppose that v ∼ Gamma(A, B). Then E(v) = A/B, E{log(v)} = ψ(A) − log(B) and Var{log(v)} = ψ 0 (A), where ψ(·) is the digamma function set out by Definition 1.5.3, and ψ 0 (·) is the trigamma function set out by Definition 1.5.4. Definition 1.5.16 The notation v ∼ Inverse-Gamma(A, B) means that v has an Inverse-Gamma distribution with shape parameter A > 0 and rate parameter B > 0. The corresponding density function is p(v) = B A Γ(A)−1 v −A−1 exp(−B/v), v > 0. Result 1.5 Suppose that v ∼ Inverse-Gamma(A, B). Then E(1/v) = A/B and E{log(v)} = log(B) − ψ(A), 16 1.5. DEFINITIONS AND RESULTS where ψ(·) is the digamma function set out by Definition 1.5.3. Definition 1.5.17 The notation v ∼ Half-Cauchy(A) means that v has a Half-Cauchy distribution with scale parameter A > 0. The corresponding density function is p(v) = Result 1.6 √ 2 , π{1 + (v/A)2 } v > 0. v ∼ Half-Cauchy(A) if and only if v|a ∼ Inverse-Gamma( 21 , 1/a) and a ∼ Inverse-Gamma( 21 , 1/A2 ). This follows from Result 5 of Wand, Ormerod, Padoan and Frühwirth (2011). Definition 1.5.18 The notation v ∼ Inverse-Gaussian(µ, λ) means that v has an Inverse-Gaussian distribution with mean µ > 0 and rate parameter λ > 0. The corresponding density function is r p(v) = λ λ(v − µ)2 − , exp 2πv 3 2µ2 v v > 0, with E(v) = µ and E(1/v) = 1 1 + . µ λ Result 1.7 Suppose the density function of v takes the form p(v) ∝ v −3/2 exp(−Sv − T /v), v > 0. p Then v ∼ Inverse-Gaussian( T /S, 2T ). Definition 1.5.19 The notation v ∼ GEV(0, 1, ξ) means that v follows the standard Generalized Extreme Value distribution with shape parameter ξ ∈ R. The density function is given by (1 + ξv)−1/ξ−1 exp −(1 + ξv)−1/ξ , ξ 6= 0 pGEV (v; ξ) = , exp (−v − e−v ) , ξ=0 1 + ξv > 0. If the random variable v has density function σ −1 pGEV {(v − µ)/σ} then we write v ∼ GEV(µ, σ, ξ). Definition 1.5.20 The notation v ∼ Normal-Mixture(0, 1, w, m, s) means that v follows the standard Finite Normal Mixture distribution with: weight vector w = (w1 , . . . , wK ), wk > 0 P for 1 ≤ k ≤ K and K k=1 wk = 1; mean vector m = (m1 , . . . , mK ), mk ∈ R for 1 ≤ k ≤ K; 17 1.5. DEFINITIONS AND RESULTS and standard deviation vector s = (s1 , . . . , sK ), sk > 0 for 1 ≤ k ≤ K. The density function is given by K X v − mk wk , pNM (v; w, m, s) = φ sk sk k=1 v ∈ R. If the random variable v has density function σ −1 pNM {(v − µ)/σ}, then we write v ∼ Normal-Mixture(µ, σ, w, m, s). Definition 1.5.21 The notation v ∼ Asymmetric-Laplace(0, 1, τ ) means that v follows the stan- dard (µ = 0, σ = 1) Asymmetric-Laplace distribution. The density function is given by p(v; τ ) = τ (1 − τ ) exp 1 2 |v| + τ − 1 2 v , v ∈ R. Result 1.8 Let v and a be random variables such that p(y|a) ∼ N ! τ − 21 σ σ2 , µ+ aτ (1 − τ ) aτ (1 − τ ) and a ∼ Inverse-Gamma 1, 12 . Then y ∼ Asymmetric-Laplace(µ, σ, τ ). Result 1.8 follows from follows from Proposition 3.2.1 of Kotz, Kozubowski and Podgórski (2001). Definition 1.5.22 The standard Horseshoe density function is defined by pHS (v) = (2π 3 )−1/2 exp(v 2 /2)E1 (v 2 /2), v 6= 0, where E1 (·) is the exponential integral function of order 1. If the random variable v has density function σ −1 pHS {(v − µ)/σ} then we write v ∼ Horseshoe(µ, σ). Result 1.9 Let v, b and c be random variables such that v|b ∼ N (µ, σ 2 /b), b|c ∼ Gamma( 21 , c) and c ∼ Gamma( 12 , 1). Then v ∼ Horseshoe(µ, σ). Result 1.10 Let v and b be random variables such that v|b ∼ N (µ, σ 2 /b) and p(b) = π −1 b−1/2 (b + 1)−1 , b > 0. 18 1.5. DEFINITIONS AND RESULTS Then v ∼ Horseshoe(µ, σ). Results 1.9 and 1.10 are related to results given in Carvalho, Polson and Scott (2010). Definition 1.5.23 The standard Normal-Exponential-Gamma density function, with shape parameter λ > 0 is defined by pNEG (v; λ) = π −1/2 λ2λ Γ(λ + 21 )exp(v 2 /4)D−2λ−1 (|v|), v ∈ R, where Dν (·) is the parabolic cylinder function of order ν. If the random variable v has density function σ −1 pNEG {(v − µ)/σ; λ} then we write v ∼ NEG(µ, σ, λ). Result 1.11 Let v, b and c be random variables such that v|b ∼ N (µ, σ 2 /b), b|c ∼ Inverse-Gamma(1, c) and c ∼ Gamma(λ, 1). Then v ∼ NEG(µ, σ, λ). Result 1.12 Let v and b be random variables such that v|b ∼ N (µ, σ 2 /b) and p(b) = λbλ−1 (b + 1)−λ−1 , b > 0. Then v ∼ NEG(µ, σ, λ). Results 1.11 and 1.12 are related to results given in Griffin and Brown (2011). Definition 1.5.24 The standard Generalized Double Pareto density function, with shape parameter λ > 0 is defined by pGDP (v; λ) = 1 , 2(1 + |x|/λ)λ+1 v ∈ R. If the random variable v has density function σ −1 pGDP {(v − µ)/σ; λ} then we write v ∼ GDP(µ, σ, λ). Result 1.13 Let v, b and c be random variables such that v|b ∼ N (µ, σ 2 /b), Then v ∼ GDP(µ, σ, λ). b|c ∼ Inverse-Gamma(1, c2 /2) and c ∼ Gamma(λ, λ). 19 1.5. DEFINITIONS AND RESULTS Result 1.14 Let v and b be random variables such that v|b ∼ N (µ, σ 2 /b) and p(b) = 12 (λ + 1)λλ+1 b(λ−2)/2 eλ 2 b/4 √ Dλ−2 (λ b), b > 0. Then v ∼ GDP(µ, σ, λ). Results 1.13 and 1.14 are related to results given in Armagan, Dunson and Lee (2012). Definition 1.5.25 The Laplace-Zero density function is defined by p(u|σ, ρ) = ρ(2σ)−1 exp(−|u|/σ) + (1 − ρ)δ0 (u), u ∈ R, where ρ is a random variable over [0, 1]. Result 1.15 Suppose that u = γv. Then u|σ, ρ ∼ Laplace-Zero(σ, ρ) if and only if v|b ∼ N (0, σ 2 /b) γ|ρ ∼ Bernoulli(ρ), 1.5.5 and b ∼ Inverse-Gamma(1, 21 ). Matrix results Result 1.16 If a is a scalar, then aT = a. Result 1.17 If A is a square matrix, then tr(AT ) = tr(A). Result 1.18 If A is an m × n matrix and B is an n × m matrix, then tr(AB) = tr(BA). Result 1.19 If a and b are vectors of the same length, then ka − bk2 = kak2 − 2aT b + kbk2 . Result 1.20 If a, b and c are vectors of the same length, then ka − b − ck2 = kak2 + kbk2 + kck2 − 2aT b − 2aT c + 2bT c. Result 1.21 Let A be a symmetric invertible matrix and x and b be column vectors with the same number of rows as A. Then − 12 xT Ax + bT x = − 12 (x − A−1 b)T A(x − A−1 b) + 21 bT A−1 b. Result 1.22 Let v be a random vector. Then E(vv T ) = Cov(v) + E(v)E(v)T and E(kvk2 ) = tr{Cov(v)} + kE(v)k2 . 20 1.6. ACCURACY MEASURE Result 1.23 Let v be a random vector and let A be a constant matrix with the same number of rows as v. Then E(v T Av) = E(v)T AE(v) + tr{ACov(v)}. 1.6 Accuracy measure This section describes the mathematics behind the accuracy measure used to assess the quality of an approximate MFVB posterior via comparison to a baseline MCMC posterior. Details are taken from Wand et al. (2011) Section 8. Let p(θ|x) denote the posterior distribution of a parameter θ given data x. Our aim is to assess the quality of a MFVB approximation to the posterior, which we denote by q ∗ (θ). Wand et al. (2011) describes measuring the accuracy of q ∗ (θ) via the L1 distance. The L1 error, also known as the integrated absolute error (IAE) of q ∗ (θ) is defined as ∗ IAE{q (θ)} = Z ∞ −∞ |q ∗ (θ) − p(θ|x)| dθ. Since IAE ∈ (0, 2), we then define the accuracy of q ∗ (θ) as accuracy{q ∗ (θ)} = 1 − 21 IAE{q ∗ (θ)}, and so 0 ≤ accuracy{q ∗ (θ)} ≤ 1. We then express accuracy as a percentage. In practice, the posterior p(θ|x) is approximated by an extremely accurate MCMC analogue, denoted by pMCMC (θ|x). 1.7 Overview of the thesis The aim of this thesis is address the research question: “Is mean field variational Bayes a viable inference tool for semiparametric regression models involving elaborate distributions?” This involves identifying models and/or data sets for which traditional inference is either too slow or impossible. We are attempting to make analysis of these models temporally viable by implementing faster methodology. In some cases, this involves extending current research on MFVB to more complex models. In other areas, we are using MFVB for the very first time. 1.7. OVERVIEW OF THE THESIS 21 Each chapter essentially involves development of a MFVB algorithm to perform fast, approximate inference for the given model and data. The efficiency of the algorithm is then tested using either simulation studies or real data through comparison with MCMC. Chapter 2 involves development of MFVB methodology for Generalized Extreme Value additive models. Structured MFVB and auxiliary mixture sampling are the key methodologies used. Chapter 2 culminates in the fast, deterministic analysis of the Sydney Hinterland maximum rainfall data. In Chapter 3, we present MFVB inference for quantile semiparametric regression. This centres around the Asymmetric Laplace distribution. Chapter 4 presents MFVB inference for sparse signal shrinkage, focusing on continuous shrinkage priors. A new approach is developed to add to the MFVB armoury: continued fraction approximations via Lentz’s Algorithm. The insights gained in Chapter 4 flow on to inform the prior distributions used in Chapters 5 and 6. In Chapter 5 we develop MFVB algorithms for inference in penalised wavelet regression. Three types of penalisation are considered: Horseshoe, Normal-Exponential-Gamma and Laplace-Zero. Chapter 6 develops MFVB inference for the most complex model considered in the thesis. This model is motivated by the Radiation Pneumonitis data. All computations were carried out in R (R Development Core Team, 2011) using a MacBook Pro with a 2.7 GHz Intel Core i7 processor and 8 GB of memory. Chapter 2 Mean field variational Bayes inference for Generalised Extreme Value regression models 2.1 Introduction Analysis of sample extremes is becoming more prominent, largely driven by increasing interest in climate change research. The generalised extreme value (GEV) distribution can be used to model the behaviour of such sample extremes. Over the past decade, many papers have addressed the idea of using GEV additive models to model sample extremes. These papers differ in their approaches, ranging from Bayesian to frequestist, and employing various methods of fitting. For example, Chavez-Demoulin and Davison (2005) use data on minimum daily temperatures at 21 Swiss weather stations during winters in the seventies to motivate their methodology. They found that the North Atlantic oscillation index has a nonlinear effect on sample extremes. Davison and Ramesh (2000) outline a semiparametric approach to smoothing sample extremes, using local polynomial fitting of the GEV distribution. The aim of Yee and Stephenson (2007) was to provide a unifying framework for flexible smoothing in models involving extreme value distributions. Laurini and Pauli (2009) carry out nonparametric regression for sample extremes using penalized splines and the Poisson point process model. Markov chain Monte Carlo methods were used for inference. As explored by Laurini and Pauli (2009), a popular way of fitting heirarchical Bayesian models is to use MCMC methods. These methods, although highly accurate, are very computationally intensive and inference can take hours. This is where mean field vari22 23 2.2. DIRECT MEAN FIELD VARIATIONAL BAYES ational Bayes (MFVB) comes to the fore as an exciting alternative. MFVB provides an approximate alternative to MCMC, with massively reduced computation time. This chapter brings together the methodology of the GEV additive model and MFVB, resulting in fast, approximate inference for analysing sample extremes. We begin by explaining the need to model the GEV distribution using a highly accurate mixture of Normal distributions. The particular mixture of Normal distributions used to approximate the GEV density is obtained by minimising the χ2 distance between the two, as outlined in Wand et al. (2011). We then set up a regression model for this normal mixture response. MFVB methodology is then presented for the regression case, which extends methodology presented in Wand et al. (2011). Derivations are deferred to Appendix 2.A. We then extend from the regression case to the semiparametric regression case via construction of an additive model. Then, using relevant theory we derive MFVB algorithms for inference. Methods for construction of variability bands for the estimated parameters are then presented. This is followed by an explanation of how to devise an overall lower bound approximation for the GEV normal mixture model based on the MFVB approach. Finally we analyse maximum rainfall data from throughout New South Wales (NSW), Australia, using the full GEV normal mixture based MFVB approach. The majority of the work in this chapter has been peer reviewed and published in Neville, Palmer and Wand (2011). Work on the geoadditive extension in Section 2.8 culminated in the paper Neville and Wand (2011) and a presentation at the conference Spatial Statistics: Mapping Global Change, Enschede, the Netherlands, 2011. 2.2 Direct mean field variational Bayes Constructing MFVB algorithms for any model involves taking expectations of full conditionals with respect to many parameters. The complex form of the GEV likelihood ultimately leads to these expectations becoming too complex, and thus the resulting MFVB approximate posteriors become intractable. First, we step through the direct MFVB methodology for the GEV model, and show why it breaks down. We then present the idea of approximating the GEV density as a highly accurate mixture of normals, first illustrated in Wand et al. (2011). Let x follow a GEV(µ, σ, ξ) distribution with parameters µ, σ > 0 and ξ ∈ R. This implies that 1 p(x) = pGEV σ x−µ ;ξ σ 2.2. DIRECT MEAN FIELD VARIATIONAL BAYES 24 where pGEV is the GEV(0, 1, ξ) density function defined by (2.2). The density of x is therefore: 1 p(x|µ, σ, ξ) = σ 1+ξ x−µ σ −1/ξ−1 " −1/ξ # x−µ exp − 1 + ξ . σ Now assume we impose the continuous priors µ ∼ N (µµ , σµ2 ) and σ 2 ∼ Inverse-Gamma(A, B) and the discrete prior ξ ∼ p(ξ), ξ ∈ Ξ. Take, for example, the process of deriving an approximation, q ∗ (µ), for the posterior of µ. Following the process set out in Section 1.3, we first find the logarithm of the full conditional of µ, and then take expectations with respect to the remaining parameters. The full conditional for µ is of the form: p(µ|rest) ∝ p(x|rest)p(µ) " −1/ξ−1 −1/ξ # x−µ x−µ 1 1+ξ exp − 1 + ξ ∝ σ σ σ (µ − µµ )2 × exp − 2σµ2 −1/ξ−1 1 x−µ ∝ 1+ξ σ σ " # −1/ξ (µ − µµ )2 x−µ × exp − 1 + ξ − . σ 2σµ2 Therefore, (µ − µµ )2 1 x−µ log p(µ|rest) = const. − + − − 1 log 1 + ξ 2σµ2 ξ σ −1/ξ x−µ − 1+ξ , σ hence ∗ log q (µ) = Eσ,ξ (µ − µµ )2 1 x−µ const. − + − − 1 log 1 + ξ 2σµ2 ξ σ −1/ξ # x−µ − 1+ξ . σ (2.1) We can see the complex dependence on the parameters emerging as a substantial obsta- 2.3. AUXILIARY MIXTURE SAMPLING APPROACH 25 cle, predominantly in the final term of (2.1). Thus, taking expectations in order to derive optimal q ∗ densities is highly intractable. It is for this reason that Wand et al. (2011) developed the auxiliary mixture approach to handling GEV responses in a MFVB framework. 2.3 Auxiliary mixture sampling approach Auxiliary mixture sampling involves using a combination of simple distributions to approximate a more complex distribution. The use of auxiliary mixture sampling in Bayesian analysis dates back to at least 1994, when Shephard (1994) used mixture distributions to allow for outliers in analysing non-Gaussian time series models. Since then, many others have used auxiliary mixture sampling to overcome problems associated with elaborate distributions. For example, Kim, Shephard and Chib (1998) and Chib, Nardari and Shephard (2002) used normal mixtures to approximate the density of a log χ2 distribution in the context of stochastic volatility models. More recently, and closer to the context of our area of interest, Frühwirth-Schnatter and Wagner (2006) used auxiliary mixture sampling to deal with Gumbel random variables, that follow a GEV distribution with ξ=0. Frühwirth-Schnatter and Frühwirth (2007) also used a finite mixture of normals to approximate the extreme value distribution. Wand et al. (2011) introduce the idea of auxiliary mixture sampling as a fundamental step in facilitating variational Bayesian inference for simple univariate GEV models. The authors explain how finite normal mixture approximations to the GEV density are generated. Firstly, a general normal mixture approximation to the GEV density of size K is defined, in the form of definition (1.5.20). Then, the weight, mean and standard deviation vectors (w, m and s respectively) of the Normal mixture are chosen to minimise the χ2 distance between the GEV density and the mixture approximation. Firstly, we introduce the complex GEV density. If x ∼ GEV(0, 1, ξ), then (1 + ξx)−1/ξ−1 exp −(1 + ξx)−1/ξ , ξ 6= 0 pGEV (x; ξ) = exp (−x − e−x ) , ξ=0 (2.2) The finite normal mixture approximation to the GEV density is assumed to be of the form set out in Definition 1.5.20 K X wk x − mk pNM (x; w, m, s) ≡ φ , sk sk k=1 where w = (w1 , . . . , wK ), m = (m1 , . . . , mK ) and s = (s1 , . . . , sK ) are assumed to be fixed 26 2.4. STRUCTURED MEAN FIELD VARIATIONAL BAYES vectors not requiring Bayesian inference. Essentially, after fixing K, both the L1 and χ2 distance between the true density and the finite mixture approximation are investigated as measures to minimise in order to produce the final mixture approximation. The χ2 distance is identified as the better of the two. A chi-squared normal mixture approximations for ξ = 1 is illustrated in Figure2.1. 0.3 exact approx 0.0 0.1 0.2 density 0.4 0.5 Wand et al. (2011) then explain how the derivation of the finite normal mixture approxi- 0 2 4 6 8 x Figure 2.1: A normal mixture approximation to the GEV density with ξ = 1. mation of the GEV density forms a vital step in MFVB inference. We highlight/identify where this vital step fits within the broader process of structured MFVB in Section 2.4. 2.4 Structured mean field variational Bayes In order to facilitate MFVB inference for the GEV shape parameter ξ, we must introduce a MFVB extension known as structured MFVB. MFVB inference is difficult due to the presence of the complex dependence of the GEV density on the shape parameter ξ. However, when we fix ξ, MFVB inference becomes tractable. Saul and Jordan (1996) proposed structured MFVB for this very situation. Structured MFVB was first used in the context of Bayesian hierarchical models in Wand et al. (2011). We summarise the major ideas here in the context of the GEV regression model, borrowing heavily from Section 3.1 of Wand 2.5. FINITE NORMAL MIXTURE RESPONSE REGRESSION 27 et al. (2011). Let θ denote all parameters in the GEV additive model, except ξ. Hence we are considering a Bayesian model of the form y|θ, ξ ∼ p(y|θ, ξ) (2.3) where θ ∈ Θ and ξ ∈ Ξ. To reiterate, MFVB inference is tractable for fixed ξ, but not tractable when ξ is included as a model parameter, as seen in Section 2.2. In practice, we limit the parameter space Ξ of ξ to a finite number of atoms. We denote the the prior density function of θ by p(θ), and the probability mass function of ξ by p(ξ). Wand et al. (2011) states the following results regarding the MFVB inference in a structured context. The optimal q ∗ densities are given by: q ∗ (θ i ) = X q ∗ (ξ)q ∗ (θ i |ξ) (2.4) p(ξ)p(y|ξ) 0 0 ξ 0 ∈Ξ p(ξ )p(y|ξ ) (2.5) ξ∈Ξ and q ∗ (ξ) = P The lower bound on the marginal likelihood is given by p(y; q) = X q ∗ (ξ)p(y|ξ). (2.6) ξ∈Ξ Summarising the previous and current sections, MFVB inference for GEV models, be they simple regression models or additive models, involves 4 stages. Firstly, we limit the continuous shape parameter ξ to a discrete set Ξ. Secondly, finite normal mixture approximation of the GEV density function is carried out for each ξ ∈ Ξ. Thirdly, MFVB inference is carried out for these normal mixture models over each ξ ∈ Ξ. For fixed ξ this results in approximate posteriors for the remaining model parameters. The final stage involves combining the results across all ξ ∈ Ξ to make approximate inference for all parameters in the model, including the shape parameter ξ. 2.5 Finite normal mixture response regression Section 2.3 illustrated the need for GEV responses to be approximated by a finite mixture of normal densities in order to facilitate approximate Bayesian inference. Wand et al. (2011) carried out MFVB inference for simple univariate GEV models via normal-mixture 28 2.5. FINITE NORMAL MIXTURE RESPONSE REGRESSION approximations. The remainder of this chapter extends the work of Wand et al. (2011) to allow approximate Bayesian inference for GEV regression models. The first step in achieving MFVB inference for GEV regression models is to develop an algorithm to carry out MFVB for normal-mixture response regression. 2.5.1 Model We consider the model ind. yi |β, σε2 , ξ ∼ Normal-Mixture{(Xβ)i , σε , w, m, s}, β ∼ N (µβ , Σβ ), 1 ≤ i ≤ n, (2.7) σε2 ∼ Inverse-Gamma(Aε , Bε ), where β= β0 β1 1 x1 . . X = .. .. , 1 xn , Aε , Bε > 0 are scalar constants, µβ is a constant vector, Σβ is a constant matrix, and w = (w1 , . . . , wK ), m = (m1 , . . . , mK ) and s = (s1 , . . . , sK ) are Normal-Mixture weight, mean and standard deviation vectors pre-determined to accurately approximate the GEV density for a fixed value of the shape parameter ξ. Using (2.7) and Definition 1.5.20, we have p(yi |β, σε2 , w, m, s) = = yi − (Xβ)i σε n y −(Xβ) o i i K − m X k σε 1 wk φ σε sk sk 1 pNM σε k=1 K X wk yi − {(Xβ)i + σε mk } = φ . σε sk σε sk k=1 We now introduce auxiliary variables (a1 , . . . , an ) that allow the the distribution of yi to vary as i varies from 1 to n. Specifically, a1 , . . . , an ∼ Multinomial(1; w1 , ..., wK ) where ai = (ai1 , ..., aiK ), aik ∈ {0, 1} and K X k=1 aik = 1. 29 2.5. FINITE NORMAL MIXTURE RESPONSE REGRESSION Using auxiliary variables (a1 , . . . , an ), we can re-express Model (2.7) as p(yi |β, ai , σε2 ) = QK h k=1 β ∼ N (µβ , Σβ ), 1 σε sk φ σε2 n (y−Xβ)i −σε mk σε sk oiaik , 1 ≤ i ≤ n, (2.8) ∼ Inverse-Gamma(Aε , Bε ), which explicitly gives N {(Xβ)i + σε m1 , σε2 s21 }, ai1 = 1, .. yi |β, ai , σε2 ∼ . N {(Xβ) + σ m , σ 2 s2 }, a = 1. i ε K ε K iK (2.9) The conditional dependence structure of Model (2.8) is illustrated in the directed acyclic graph (DAG) in Figure 2.2. The DAG allows us to observe relationships between the a β y σε2 Figure 2.2: Directed acyclic graph for Model (2.8). observed data y and the model parameters β, a = (a1 , . . . , aK ) and σε2 in a simplified, visual manner. The fixed quantities w, m and s are omitted from the DAG as they do not require inference. The importance of the DAG as a tool in application of MFVB inference to more complex models will become clear as the chapter develops. 2.5.2 Mean field variational Bayes Here we present a MFVB algorithm for Model (2.8). We impose the product restriction q(β, a, σε2 ) = q(β) q(a) q(σε2 ) (2.10) 2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL 30 on the posterior p(β, a, σε2 |y). The optimal q ∗ densities for the parameters in Model (2.8) under product restriction (2.10) take the form: q ∗ (β) ∼ N (µq(β) , Σq(β) ), −Aε − q ∗ (σε2 ) = (σε2 ) n C1 C 2 −1 exp σ − 22 ε σε 2J + (2Aε +n−1,C1 ,C2 ) ind. q ∗ (ai ) ∼ Multinomial(1; µq(ai ) ), , σε2 > 0, 1 ≤ i ≤ n. Algorithm 1 gives details of an iterative scheme for finding optimal q ∗ moments for all key model parameters under product restriction (2.10). Algorithm 1 uses lower bound (2.11) to monitor convergence. Derivations for the optimal densities, updates in Algorithm 1 and lower bound (2.11) are deferred to Appendix 2.A. The lower bound corresponding to Algorithm 1 is given by log p(y; q) = d n − log(2π) + log(2) + A log(B) − log{Γ(A)} 2 2 + log J + (2A + n − 1, C1 , C2 ) K X |Σq(β) | m2k 1 + µq(a•k ) log(wk /sk ) − 2 + log 2 |Σβ | 2sk (2.11) k=1 1 T −1 − {tr(Σ−1 β Σq(β) ) + (µq(β) − µβ ) Σβ (µq(β) − µβ )} 2 n X K X − µq(aik ) log(µq(aik ) ). i=1 k=1 where µq(a•k ) = 2.6 Pn i=1 µq(aik ) . Generalized Extreme Value additive model Now we have derived an algorithm for approximate Bayesian inference for a NormalMixture response regression model, we proceed to extend our methodology to do the same for a GEV additive model. This extension involves three major changes: (1) use of multiple predictors (x1 , . . . , xd ); (2) incorporation of spline basis functions into our model to capture non-linear trends in the data; and (3) use of structured MFVB to bring together estimates from Normal-Mixture approximate inference at each ξ ∈ Ξ. 31 2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL Initialize: µq(β,u) , Σq(β,u) , and µq(1/σ) , µq(1/σ2 ) > 0. Cycle: Update q ∗ (a) parameters: For i = 1, . . . , n, k = 1, . . . , K: νik ← log(wk /sk ) − n o 1 h 2 T µ (y − Xµ ) + (XΣ X ) 2 ii q(β) q(β) i 2s2k q(1/σε ) i −2mk µq(1/σε ) (y − Xµq(β) )i + m2k eνik µq(aik ) ← PK ; νik k=1 e D µq(a k) ← diag{µq(a1k ) , . . . , µq(ank ) } Update q ∗ (β) parameters: K X 1 D µq(a ) k s2k ( X T µq(1/σε2 ) Σq(β) ← ! X + Σ−1 β )−1 k=1 K X 1 D µq(a ) y k s2k k=1 ! ) K X mk −µq(1/σε ) D µq(a ) 1 + Σ−1 β µβ k s2k ( µq(β) ← Σq(β) XT µq(1/σε2 ) k=1 Update q ∗ (σ 2 ) parameters: C1 ← C2 K X mk k=1 s2k 1T D µq(a ) (y − Xµq(β) ) k K 1X 1 n ← B+ tr(X T D µq(a ) XΣq(β) ) k 2 s2k o k=1 +(y − Xµq(β) )T D µq(a ) (y − Xµq(β) ) . k µq(1/σε2 ) = J + (2A + n + 1, C1 , C2 ) ; J + (2A + n − 1, C1 , C2 ) µq(1/σε ) = J + (2A + n, C1 , C2 ) . J + (2A + n − 1, C1 , C2 ) until the increase in p(y; q) is negligible. Algorithm 1: Mean field variational Bayes algorithm for Model (2.8) under product restriction (2.10). 32 2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL 2.6.1 Model Let yi , 1 ≤ i ≤ n be a set of response variables for which a GEV(µi , σε , ξ) distribution is appropriate. We assume that the means, µi , are of the form: (2.12) µi = f1 (x1i ) + . . . + fd (xdi ), where, for each 1 ≤ i ≤ n, (x1i , . . . , xdi ) is a vector of continuous predictor variables and the f1 , . . . , fd are smooth functions. Padoan and Wand (2008) describe the mixed model based penalized spline approach to estimating parameters in additive models for sample extremes. The mixed model based approach facilitates the use of Bayesian inference methods such as MCMC and MFVB. Explicitly, the right hand side of (2.12) is modelled as: d X f` (x` ) = β0 + `=1 d X ( β` x` + `=1 K X̀ ) u`,k z`,k (x` ) k=1 with ind. u`,1 , . . . , u`,K` |σ`2 ∼ N (0, σ`2 ) for each 1 ≤ ` ≤ d. The {z`,1 (·), . . . , z`,K` (·)} are a set of spline basis functions that allow estimation of f` . Specific choices of the spline basis functions vary. O’Sullivan penalized splines are described by Wand and Ormerod in their 2008 paper. Welham, Cullis, Kenward and Thompson (2007) described the close connection between three types of polynomial mixed model splines: smoothing splines, P-splines and penalized splines. We choose to work with O’Sullivan penalized splines here. We define the matrices: β= β0 β1 .. . , βd u`1 . u` = .. , u`K` u1 . u = .. ud 1 x11 . . . x1d .. .. .. .. X= . . . . 1 xn1 . . . xnd , z (x ) . . . z`K` (x1` ) `,1 1` .. .. .. and Z ` = . . . z`,1 (xn` ) . . . z`K` (xn` ) , 33 2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL and further set Z = [Z 1 . . . Z d ]. We then define the Bayesian GEV additive model as: ind. yi |β, u, σε , ξ ∼ GEV{(Xβ + Zu)i , σε , ξ}, 1 ≤ i ≤ n, 2 , . . . , σ 2 ∼ N {0, blockdiag(σ 2 I, . . . , σ 2 I)}, u|σu1 u1 ud ud β ∼ N (0, Σβ ), 2 σu` ind. ∼ Inverse-Gamma(Au` , Bu` ), σε2 ∼ Inverse-Gamma(Aε , Bε ), (2.13) 1 ≤ ` ≤ d, ξ ∼ p(ξ), ξ ∈ Ξ, where Σβ is a symmetric and positive definite (d+1)×(d+1) matrix and Aε , Bε , Au` , Bu` > 0 are hyperparameters for the variance component prior distributions. The set Ξ of prior atoms for ξ is assumed to be finite. The DAG for (2.13) is shown in Figure 2.3. As in 2 σu1 2 σu2 ... 2 σud u1 u2 ... ud σε2 β y ξ Figure 2.3: Directed acyclic graph representation of Model (2.13). the univariate and simple regression case, for fixed ξ, we use an accurate finite mixture of normal densities to approximate the GEV response. Recall that fGEV (·; ξ) denotes the GEV (0, 1, ξ) family of density functions, given by (2.2). In practice, we replace each fGEV (·; ξ), ξ ∈ Ξ by an extremely accurate normal mixture approximation: K X x − mk wk φ . fGEV (x; ξ) ≈ sk sk k=1 34 2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL Model (2.13) becomes ind. yi |β, u, σε , ξ ∼ Normal-Mixture{(Xβ + Zu)i , σε , w, m, s}, 1 ≤ i ≤ n, 2 , . . . , σ 2 ∼ N {0, blockdiag(σ 2 I, . . . , σ 2 I)}, u|σu1 u1 ud ud β ∼ N (0, Σβ ), 2 σu` ind. ∼ Inverse-Gamma(Au` , Bu` ), (2.14) 1 ≤ ` ≤ d, σε2 ∼ Inverse-Gamma(Aε , Bε ). and forms the first step towards MFVB inference for the GEV additive model. Again, as in the regression case, we can rewrite (2.14) as p(yi |β, u, ai , σε , ξ) = QK ind. k=1 n 1 σε sk φ ai ∼ Multinomial(1, w) (y−Xβ−Zu)i −σε mk σε sk oaik , 1 ≤ i ≤ n, 2 , . . . , σ 2 ∼ N {0, blockdiag(σ 2 I, . . . , σ 2 I)}, u|σu1 u1 ud ud β ∼ N (0, Σβ ), ind. 2 ∼ Inverse-Gamma(A , B ), σu` u` u` (2.15) 1 ≤ ` ≤ d, σε2 ∼ Inverse-Gamma(Aε , Bε ). 2.6.2 Mean field variational Bayes for the finite normal mixture response additive model In this section we present the derivation of a MFVB algorithm for Model (2.15). We begin by imposing the product restriction 2 2 2 2 )q(a). , a) = q(β, u)q(σε2 , σu1 , . . . , σud q(β, u, σε2 , σu1 , . . . , σud (2.16) A good starting point is to look at the DAG for Model (2.15), illustrated in Figure 2.4. It shares many similarities with the DAG for Model (2.8) illustrated in Figure 2.2. Essentially the structure of each model is the same, with the addition of random effects u` and 2 in the additive model. This is where the locality property their underlying variances σu` of mean variational Bayes comes to the fore through Markov blanket theory. As set out in Section 1.4, the distribution of a node within a DAG depends only on the nodes within its Markov blanket. But what impact does this locality property have on the derivation of the optimal q ∗ densities for Model (2.15)? Comparing Figures 2.2 and 2.4, similarities can be seen between the DAGs for the simple and more complex model. The parameter β in the simple regression model plays the same role as (β, u) in the additive model. So, we expect to see similarities between the optimal q ∗ densities of β in 35 2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL 2 σu1 2 σu2 ... 2 σud u1 u2 ... ud a β σε2 y Figure 2.4: Directed acyclic graph representation of Model (2.15). the simple regression case and (β, u) in the additive model case. Figure 2.4 also illus2 , namely (β, u) and the other variance trates the parameters in the Markov blanket of σu` 2 2 2 2 , . . . , σ2 components σu1 u,`−1 , σu,`+1 , . . . , σud . Hence the optimal density of σu` will have no dependence on the form of the response y, nor the parameters σε2 and a. The optimal q ∗ densities for the parameters in Model (2.15) take the form: q ∗ (β, u) ∼ N (µq(β,u) , Σq(β,u) ), 2 ) ind. q ∗ (σu` ∼ Inverse-Gamma Au` + −Aε − q ∗ (σε2 ) = ind. (σε2 ) n 2 −1 exp C C3 − 42 σε σε K` 2 ) 2 , Bq(σu` , 1 ≤ ` ≤ d, 2J + (2Aε +n−1,C3 ,C4 ) q ∗ (ai ) ∼ Multinomial(1; µq(ai ) ), , σε2 > 0, 1 ≤ i ≤ n. Algorithm 2 gives an iterative scheme for finding the moments of the optimal q ∗ densities stated above. The expression for the corresponding lower bound on the marginal loglikelihood is given by (2.17). Full derivations of the optimal q ∗ densities, Algorithm 2 and lower bound (2.17) are presented in Appendix 2.A. The lower bound corresponding 1 36 2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL Initialize: µq(β) , Σq(β) , and µq(1/σε ) , µq(1/σε2 ) > 0. Cycle: Update q ∗ (a) parameters for i = 1, . . . , n, k = 1, . . . , K: 1 µq(1/σε2 ) (y − Cµq(β,u) )2i + (CΣq(β,u) C T )ii 2 2sk −2mk µq(1/σε ) (y − Cµq(β,u) )i + m2k νik ← log(wk /sk ) − eνik µq(aik ) ← PK ; νik k=1 e D µq(a k) ← diag{µq(a1k ) , . . . , µq(ank ) } Update q ∗ (β, u) parameters: ! K X 1 C T µq(1/σε2 ) D µq(a ) C k s2k k=1 o−1 , µ I +blockdiag Σ−1 , . . . , µ I 2 2 K K 1 q(1/σu1 ) q(1/σ ) d β ( Σq(β,u) ← ud µq(β,u) ← Σq(β,u) C T ! K K X X 1 mk µq(1/σε2 ) D µq(a ) y − µq(1/σε ) D µq(a ) 1 k k s2k s2k k=1 k=1 Update q ∗ (σε2 ) parameters: C3 ← C4 K X mk k=1 s2k 1T D µq(a ) (y − Cµq(β,u) ) k K 1X 1 n ← Bε + tr(C T D µq(a ) CΣq(β,u) ) k 2 s2k o k=1 +(y − Cµq(β,u) )T D µq(a ) (y − Cµq(β,u) ) . k µq(1/σε2 ) ← J + (2Aε + n + 1, C3 , C4 ) ; J + (2Aε + n − 1, C3 , C4 ) µq(1/σε ) ← J + (2Aε + n, C3 , C4 ) . J + (2Aε + n − 1, C3 , C4 ) 2 ) parameters for ` = 1, . . . , d: Update q ∗ (σu` Bq(σ2 ) ← Bu` + u` o 1n tr(Σq(u` ) ) + ||µq(u` ) ||2 , 2 µq(1/σ2 ) ← u` Au` + K` /2 Bq(σ2 ) u` until the increase in p(y; q) is negligible. Algorithm 2: Mean field variational Bayes algorithm for Model (2.8) under product restriction (2.10). 37 2.7. DISPLAYING ADDITIVE MODEL FITS to Algorithm 2 is given by logp(y; q) = 1 2 1+d+ d X `=1 ! K` − n log(2π) + log(2) + Aε log(Bε ) − log Γ(Aε ) 2 + + log J (2Aε + n − 1, C3 , C4 ) + 21 log |Σq(β,u) | −1 1 T − 12 log |Σβ | − 12 tr(Σ−1 β Σq(β) ) − 2 µq(β,u) Σβ µq(β,u) d X {Au` log Bu` − log Γ(Au` ) (2.17) o − Au` + K2` log Bq(σ2 ) + log Γ Au` + K2` u` X K n X K X m2k + µq(a•k ) log(wk /sk ) − 2 − µq(aik ) log(µq(aik ) ). 2sk i=1 + `=1 k=1 where µq(a•k ) = 2.6.3 k=1 Pn i=1 µq(aik ) . Structured mean field variational Bayes Now that a MFVB algorithm has been developed to carry out approximate Bayesian inference for fixed ξ, we need to complete the final step to allow complete MFVB inference for Model (2.15). For each fixed ξ ∈ Ξ, we use Algorithm 2 to get approximations to 2 |y, ξ), 1 ≤ ` ≤ d. We denote the conditional posteriors p(β, u|y, ξ), p(σε2 |y, ξ) and p(σu` 2 |ξ), 1 ≤ ` ≤ d these variational Bayes approximations by q ∗ (β, u|ξ), q ∗ (σε2 |ξ) and q ∗ (σu` respectively. Using results (2.4), (2.5) and (2.6) set out in Section 2.4, we combine the approximate posteriors across all ξ ∈ Ξ via Algorithm 3. 2.7 Displaying additive model fits The majority of this chapter has described the process for obtaining approximate posterior distributions for the model parameters. The next step is to translate these into meaningful graphical displays consisting of fitted curves and corresponding variability bands. Essentially, to produce graphical summaries, each explanatory variable is plotted against the response, with all other predictor variables held at their mean values. In the case of the bivariate function of the explanatory variables latitude and longitude, the graphical summary plots the fit of the random variable maximum annual rainfall as a contour plot against geographical location. In practice, we set up a grid over the domain of each explanatory variable. For exam- 38 2.7. DISPLAYING ADDITIVE MODEL FITS For each ξ ∈ Ξ: 1. Retrieve the normal mixture approximation vectors (wk,ξ , mk,ξ , sk,ξ ), 1 ≤ k ≤ K for approximation of the GEV(0, 1, ξ) density function. 2. Apply Algorithm 2 with (wk , mk , sk ) set to (wk,ξ , mk,ξ , sk,ξ ) for 1 ≤ k ≤ K. 2 |ξ), 1 ≤ 3. Store the parameters needed to define q ∗ (β, u|ξ), q ∗ (σε2 |ξ) and q ∗ (σu` ` ≤ d. 4. Store the converged marginal likelihood lower bound p(y|ξ). Form the approximations to the posteriors p(ξ|y), p(β, u|y, ξ), p(σε2 |y, ξ) and 2 |y, ξ), 1 ≤ ` ≤ d via: p(σu` p(ξ)p(y|ξ) , 0 0 ξ 0 ∈Ξ p(ξ )p(y|ξ ) q ∗ (ξ) = P q ∗ (σε2 ) = X ξ∈Ξ q ∗ (ξ)q ∗ (σε2 |ξ), q ∗ (β, u) = X q ∗ (ξ)q ∗ (β, u|ξ), ξ∈Ξ 2 q ∗ (σu` )= X ξ∈Ξ 2 q ∗ (ξ)q ∗ (σu` |ξ), Form the approximate marginal likelihood p(y; q) = P ξ∈Ξ q 1 ≤ ` ≤ d. ∗ (ξ)p(y|ξ). Algorithm 3: Summary of the finite normal mixture approach to structured MFVB inference for the GEV additive model (2.15). ple, for the first predictor, say we set up a grid of size M , g 1 = (g11 , . . . , g1M ). In order to facilitate the alignment of the vertical axis with the response data we let the remaining girds for the other predictor variables be defined by g ` = x̄` 1M , where 1M is the M × 1 vector of ones. We then define the matrices: X (1) g ≡ [1 g 1 . . . g d ], (1) Z `g ≡ [z`,1 (g ` ) . . . z`,K` (g ` )] 1 ≤ ` ≤ d, and (1) (1) (1) C (1) g ≡ [X g |Z 1g . . . Z dg ]. The approximate posterior mean of f1 (g 1 ), the function approximating the contribution of the first predictor variable to the response, is given by f 1 = C (1) g µq(β,u) = X ξ∈Ξ q ∗ (ξ)C (1) g µq(β,u|ξ) , 39 2.7. DISPLAYING ADDITIVE MODEL FITS and the display of this fit is achieved by plotting f 1 against g 1 . Now to the idea of fitting point-wise 95% credible intervals in order to produce variability bands for our fitted curves. In order to produce these credible intervals, we need to obtain 0.025 and 0.975 quantiles of our MFVB approximations to the quantity " C (1) g β u # . " # β The MFVB approximation of , i.e. µq(β,u) as presented in Algorithm 3, has a finite u normal mixture form. Thus, our problem reduces to finding the 0.025 and 0.975 quantiles of a finite normal mixture at each point on our grid. We require the following result. Result 2.1 Suppose that the r × 1 vector x has the finite normal mixture density function p(x) = L X `=1 where PL `=1 ω` ω` (2π)−r/2 |Σ` |−1/2 exp − 12 (x − µ` )T Σ−1 ` (x − µ` ) = 1 and, for 1 ≤ ` ≤ L, ω` > 0, the µ` are unrestricted r × 1 vectors and the Σ` are r × r symmetric positive definite matrices. Write this as x ∼ ω1 N (µ1 , Σ1 ) + . . . + ωL N (µL , ΣL ). Then, for any constant r × 1 vector α, αT x ∼ ω1 N (αT µ1 , αT Σ1 α) + . . . + ωL N (αT µL , αT ΣL α). For 1 ≤ j ≤ M , let ej be the M × 1 vector having j th entry equal to one and all other entries equal to zero. Using Result 2.1, the 95% credible interval limits for our fitted curve are the 0.025 and 0.975 quantiles of X T (1) (1) T q ∗ (ξ)N (eTj C (1) g µq(β,u|ξ) , ej C g Σq(β,u|ξ) (C g ) ej . ξ∈Ξ To be clear, the ej vector picks out the distinct univariate finite normal mixture at the j th gridpoint. So, 95% credible intervals are found by simply computing the 0.025 and 0.975 quantiles of a univariate finite normal mixture at each gridpoint. 40 2.8. GEOADDITIVE EXTENSION 2.8 Geoadditive extension In order to extend Model (2.13) to include a geographical term, we model the mean µi as (2.18) 1 ≤ i ≤ n, µi = f1 (x1i ) + . . . + fd (xdi ) + g(xi ), where (x1i , . . . , xdi ) is a vector of continuous predictor variables and the f1 , . . . , fd are smooth, but otherwise arbitrary, functions. The xi is a 2×1 vector containing latitude and longitude measurements. Equation (2.18) is modelled further via spline basis functions as d X f` (x` ) + g(x) = β0 + `=1 d X ( `=1 β` x` + K X̀ ) u`,k z`,k (x` ) +β geo T x+ k=1 geo K X geo ugeo k zk (x) k=1 with ind. 2 2 u`,1 , . . . , u`,K` |σu` ∼ N (0, σu` ), 1 ≤ ` ≤ d, and ind. geo 2 2 ugeo 1 , . . . , uK geo |σu,geo ∼ N (0, σu,geo ). Here {z`,1 (·), . . . , z`,K` (·), zkgeo (·)} is a set of spline basis functions for estimation of both f` and g. We define the matrices β0 β1 . β = .. βd β geo , 1 x11 · · · x1d xT1 .. .. .. . .. X = .. . . . . 1 xn1 · · · xnd xTn , u`,1 . u` = .. , u`,K` ugeo ugeo 1 . = .. , ugeo K geo u1 .. . , u= ud geo u z (x ) · · · z`,K` (x1` ) `,1 1` .. .. .. Z` = . . . z`,1 (xn` ) · · · z`,K` (xn` ) . The number of and position of knots κk are generally chosen using a space filling algorithm as described in Ruppert, Wand & Carroll (2003). Form the matrices Z K = [kxi − κk k2 log kxi − κk k]1≤i≤n 1≤k≤K geo 41 2.9. NEW SOUTH WALES MAXIMUM RAINFALL DATA ANALYSIS and Ω = [kκk − κk0 k2 log kκk − κk0 k], 1≤k,k0 ≤K geo and then find the singular value decomposition of Ω: Ω = U diag(d)V T and use this to obtain the matrix square root of Ω: √ Ω1/2 = U diag( d)V T . We then compute Z geo = Z K Ω−1/2 , and define Z = [Z 1 · · · Z d Z geo ]. Then a Bayesian GEV geoadditive model is ind. yi |β, u, σε , ξ ∼ GEV{(Xβ + Zu)i , σε , ξ}, 1 ≤ i ≤ n, 2 2 2 2 , . . . , σ2 , σ2 u|σu1 ud u,geo ∼ N {0, blockdiag(σu1 I, . . . , σud I, σu,geo I)}, ind. 2 ∼ IG(A , B ), 1 ≤ ` ≤ d, σu` u` u` β ∼ N (0, Σβ ), (2.19) 2 σu, geo ∼ IG(Au,geo , Bu,geo ) σε2 ∼ IG(Aε , Bε ), ξ ∼ p(ξ), ξ ∈ Ξ, where Σβ is a symmetric and positive definite (d+1)×(d+1) matrix and Aε , Bε , Au` , Bu` > 0 are hyperparameters for the variance component prior distributions. The set Ξ of prior atoms for ξ is assumed to be finite. The MFVB algorithm for carrying out approximate inference for Model (2.19) is identical to that for Model (2.13), with an added variance component for geographical location. 2.9 New South Wales maximum rainfall data analysis In this section we apply our MFVB methodology for the GEV additive model to the Sydney hinterland maximum rainfall data. Fifty weather stations were selected from throughout Sydney and surrounds. Between 5 and 49 years of data was available from each of these stations. The response variable, maximum rainfall, is presented in Figure 2.5. The following variables comprise the Sydney hinterland maximum rainfall data: 42 2.9. NEW SOUTH WALES MAXIMUM RAINFALL DATA ANALYSIS 0 10 20 30 40 50 0 10 20 30 40 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 200 100 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 200 100 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 200 100 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Maximum rainfall (mm) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 200 100 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 200 100 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 200 100 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 10 20 30 40 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 200 100 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 200 100 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 200 100 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 200 100 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 10 20 30 40 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 10 20 30 40 50 Number of years since 1955 Figure 2.5: Annual winter maximum rainfall at 50 weather stations in the Sydney, Australia, hinterland. winter max. rainfall: maximum rainfall (mm) for annual winter period; defined as April to September inclusive, year: year (1955-2003), day in season: day of winter period (i.e. number of days since 31st March within the current year), OHA: Ocean Heat content Anomaly (1022 joules) SOI: Southern Oscillatory Index, PDO: Pacific Decadal Oscillation, longitude: degrees longitude of weather station, latitude: degrees latitude of weather station. We now provide some explanation of the predictor variables OHA, SOI and PDO, and why they are considered possible climate drivers for rainfall. Ocean heat content anomaly (OHA) provides a measure of the amount of heat held 43 1960 1970 1980 1990 25 30 35 40 45 50 55 60 Mean maximum rainfall 25 30 35 40 45 50 55 60 Mean maximum rainfall 2.9. NEW SOUTH WALES MAXIMUM RAINFALL DATA ANALYSIS 2000 0 50 −1 0 1 2 −20 −10 0 10 20 Southern Oscillatory Index 25 30 35 40 45 50 55 60 Ocean Heat Anomaly Mean maximum rainfall 25 30 35 40 45 50 55 60 Mean maximum rainfall −2 150 Day in Season 25 30 35 40 45 50 55 60 Mean maximum rainfall Year 100 −2 −1 0 1 2 3 Pacific Decadal Oscillation Figure 2.6: MFVB univariate functional fits in the GEV additive model (2.20) for the Sydney hinterland rainfall data. The vertical axis in each panel is such that all additive functions are included, with the horizontal axis predictor varying and the other predictors set at their average values, as decribed in Section 2.7. The grey region corresponds to approximate pointwise 95% credible sets. by the ocean in a particular region and depth. We used the time series of quarterly OHA for the 0 - 700m layer in the Southern Pacific Ocean Basin. Data for OHA was obtained from the US National Oceanographic and Data Center website at www.nodc.noaa.gov/ OC5/3M HEAT CONTENT. Levitus, Antonov, Wang, Delworth, Dixon and Broccoli (2001) and Willis, Roemmich and Cornuelle (2004) both state that over the past 40 years, the world’s oceans have been the dominant source of changes in global heat content. Hence OHA has the potential to play a role in effecting rainfall patterns. The Southern Oscillation Index (SOI), a unitless quanity, measures the air pressure difference between Tahiti and Darwin. We used monthly values of SOI, obtained from the Australian Bureau of Meterorology website www.bom.gov.au/climate/current/soi htm1.shtml. In general, sustained positive values of SOI indicate above average rainfall 44 2.9. NEW SOUTH WALES MAXIMUM RAINFALL DATA ANALYSIS −31 Geographical location ● ● Tamworth ● ● ● ● ● ● ● −32 ● Taree ● ● ● ● ● ● ● ● ●● −33 ● ● ● ●● Orange ● ● ● ● ● ● ● ●● ● ● −34 Degrees latitude ● ● ● ● ● ● Sydney ● ● ● ● ● ● ● Goulburn ●● −35 ● ● ● Mean maximum rainfall ● ● Batemans Bay −36 40 45 50 55 60 65 149 150 151 152 153 Degrees longitude Figure 2.7: MFVB bivariate functional fit for geographical location in the GEV additive model (2.20) for the Sydney hinterland rainfall data. The weather station locations are shown as grey dots. The black dots show the locations of six cities and towns with names as labelled. over northern and eastern Autralia. Conversely, negative values of SOI indicate below average rainfall over the north and east of the continent. Pacific decadal oscillation (PDO) is a monthly measure of sea surface temperature anomalies in the North Pacific Ocean. Like SOI, PDO is also a unitless quantity. Pacific Decadal Oscillation provides a measure of Pacific climate variability. We obtained data from the website jisao.washington.edu/pdo/PDO.latest. We imposed the following GEV geoadditive model on our data: ind. winter max. rainfalli ∼GEV{f1 (yeari ) + f2 (day in seasoni ) + f3 (OHAi ) (2.20) +f4 (SOIi ) + f5 (PDOi ) + g(longitudei , latitudei ), σε , ξ} for 1 ≤ i ≤ n, where n = 1874 is the total number of winter maximum rainfall measure- 45 2.9. NEW SOUTH WALES MAXIMUM RAINFALL DATA ANALYSIS 0.10 0.00 Probability Prior probabilities for ξ ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.0 0.1 0.2 0.3 0.4 0.5 ξ ● ● ● 0.10 ● ● ● ● ● ● ● 0.00 Approx. probability Approximate posterior probabilities for ξ ●●●● 0.0 0.1 ● ● 0.2 ●●●● ●●● 0.3 0.4 0.5 ξ Figure 2.8: The prior and MFVB approximate posterior probability mass functions for the GEV shape parameter ξ in the GEV additive model (2.20) for the Sydney hinterland maximum rainfall data. ments from 50 weather stations illustrated in Figure 2.5 between the years 1955 and 2003 (not all stations had this full set of years). Model (2.20) was then fitted via Algorithms 2 and 3. Univariate function estimates fˆ1 , . . . , fˆ5 were each constructed using 37 O’Sullivan spline basis functions (described in Wand and Ormerod, 2008). The bivariate function, estimated by ĝ, used 50 bivariate thin plate spline basis functions as set out in section 13.5 of Ruppert, Wand and Carroll (2003). Knots were set at the weather stations. Hyperparameters were set to Σβ = 108 I, Aε = Bε = Au = Bu = 0.01 and p(ξ) uniform on Ξ = {0.00, 0.01, . . . , 0.50}. We anticipate both spatial and temporal correlation to be present within our data. Including a smooth function of year is frequently used in additive model analysis of environmental time series data for the purpose of temporal de-correlation. Some examples 46 ● MFVB MCMC 6 1.0 2.9. NEW SOUTH WALES MAXIMUM RAINFALL DATA ANALYSIS ● ● 4 0.5 ● ● ● ● ● ● ●● ● ● ● y ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● 0.2 0.4 ● ● ● ● ● ● ● ●● ● ● ● ● 0.6 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ●●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ●●●● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ●● ●●● ● ●● ● ● ● ●● ● ● ● ● ● 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x MFVB MCMC 83% 0.08 accuracy 0.00 0.5 0.02 0.6 0.7 y 0.8 probability 0.9 0.10 1.0 0.12 x 0.06 0.0 ● ●● ● −2 0 2 ● 0.04 y ● ● ● 0.0 ● ● −0.5 ●● ● ● ● ● −1.0 ● ● ● 0.3 0.4 0.5 x 0.6 0.1 0.2 0.3 0.4 ξ Figure 2.9: Accuracy comparison between MFVB and MCMC for a single predictor model (d = 1). Top left panel: the simulated data together with the MFVB and MCMC estimates of f and pointwise 95% credible sets. Top right panel: the same as the top left panel but without the data and with the frame modified to zoom in on the function estimates. Bottom left panel: the same as the top right panel, but with the frame modified to zoom in on the region surrounding the peaks of the two function estimates. Bottom right panel: posterior probability mass functions for ξ based on MFVB and MCMC fitting. The accuracy shown indicates that there is 83% commonality between the true posterior and the MFVB approximation, measured using the L1 error. are Wand and Schwartz (2002) and Dominici, McDermott and Hastie (2004). We also include a smooth function of geographical location to handle the anticipated spatial correlation. Figures 2.6 and 2.7 illustrate the estimated univariate and bivariate functions resulting from fitting Model (2.20) to the Sydney hinterland maximum rainfall data via Algorithms 2 and 3. Figure 2.6 was constructed using the plotting scheme set out in Section 2.7. Starting with the univariate fits, the smooth function of year oscillates corresponding to the dry and wet periods in the Sydney hinterland over the past 50 years. The effect of OHA is weakly nonlinear up to approximately OHA = 1.7, then promptly converts to an upward effect for OHA > 1.7. SOI shows an approximate piecewise linear effect. For SOI ≤ 10, there is a positive effect on the response. This is in agreement with the explanation of SOI, with positive values associated with higher rainfall. In contrast, for SOI > 10, there is a negative effect on maximum rainfall. The estimated of the effect of PDO shows an interesting oscillatory relationship with maximum rainfall. 2.10. COMPARISONS WITH MARKOV CHAIN MONTE CARLO 47 Figure 2.7 illustrates well known rainfall patterns throughout the Sydney Hinterland. We see higher rainfall along the NSW coastal plain and orographic effects due to the Great Dividing Range. Figure 2.8 illustrates the posterior probability mass function for the shape parameter ξ. The uniform prior for ξ is also shown for comparative purposes. The majority of the probability mass lies between 0.15 and 0.27, with the mode at ξ = 0.21. 2.10 Comparisons with Markov chain Monte Carlo The major reason for using approximate Bayesian inference is to overcome either computational storage or time constraints. In this section, we compare our MFVB inference with MCMC inference for a simplified single predictor GEV regression model. Although MCMC inference itself is theoretically approximate, given ample time it can be made incredibly accurate. Hence MCMC serves as our baseline. The accuracy measure used throughout this thesis is described in detail in Section 1.6. Figure 2.9 provides an illustrative summary of the accuracy comparisons between MFVB and MCMC for a single predictor (d = 1) model. The sample size is n = 500, and the data were simulated according to ind. yi ∼ GEV{f (xi ), 0.5, 0.3} with f (x) = sin(2πx2 ). The xi values were generated to follow the uniform distribution on (0, 1). Hyperparameters were set to Σβ = 108 I, Aε = Bε = Au = Bu = 0.01 and p(ξ) uniform on Ξ = {0.00, 0.08, . . . , 0.40}. Iterations of the MFVB algorithm were terminated when the change in the lower bound reached less than 10−10 . Ten thousand MCMC iterations were carried out in the R package BRugs (Ligges, Thomas, Spiegelhalter, Best, Lunn, Rice and Sturtz, 2011). A burn-in of 5000 was used, with the subsequent 5000 iterations then thinned by a factor of 5. MFVB inference took 4.5 minutes. In contrast, MCMC inference using the same model took just over 21 hours. The first three panels in Figure 2.9 illustrate MFVB-based and MCMC-based estimates and corresponding 95% credible sets for f . The top right panel shows the estimates without the data. The bottom left panel zooms in on the estimates and credible sets for x ∈ (0.2, 0.7). From this panel, we can see that the MFVB-based and MCMC-based estimates of f are very similar. The credible sets, however, are narrower in the MFVB case. The narrow nature of the credible sets is in-keeping with behaviour observed by 2.11. DISCUSSION 48 Wand et al. (2011) for simpler GEV models. The bottom right panel of Figure 2.9 shows that accuracy of MFVB inference for the shape parameter ξ is quite good at 83%. 2.11 Discussion This chapter has seen us successfully develop MFVB inference for GEV regression models, culminating in the analysis of the Sydney Hinterland maximum rainfall data. Comparison between MFVB inference and the MCMC baseline showed that estimation of the additive model components and the shape parameter is highly accurate. Therefore, in practical applications where the focus is on retrieval of the mean curve, MFVB provides a promising alternative to MCMC. However, MFVB estimation of credible sets is overly narrow. Therefore, MCMC remains the methodology of choice when variability of curve estimates is a priority. Even with the shortcoming of poor estimation of credible sets, the gains in computational speed offered by MFVB make it a viable choice for larger models and/or sample sizes. 49 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) 2.A Derivation of Algorithm 1 and lower bound (2.11) 2.A.1 Full conditionals Full conditional for β log p(β|rest) = − where Ω=X 1 T β Ωβ − 2β T ω + const. 2 K 1 X 1 D ak σε2 s2k T ! X + Σ−1 β , k=1 ( ω = XT ) K 1 X 1 D ak (y − σε mk 1) + Σ−1 β µβ , σε2 s2k k=1 and D ak = diag(a1k , . . . , ank ). Derivation: p(β|rest) ∝ p(y|β, σε2 , a)p(β) ( n K ) YY 1 yi − {(Xβ)i + σε mk } aik = φ σε sk σε s k i=1 k=1 1 − d2 − 12 T −1 × (2π) |Σβ | exp − (β − µβ ) Σβ (β − µβ ) . 2 Taking logarithms, we get n X K X log p(β|rest) = i=1 k=1 aik 1 − 2 2 [yi − {(Xβ)i + σε mk }]2 2σε sk 1 − (β − µβ )T Σ−1 β (β − µβ ) + const. 2 n X K X 1 2 = const. + aik − 2 2 (y − Xβ − σε mk 1)i 2σε sk i=1 k=1 1 − (β − µβ )T Σ−1 β (β − µβ ) 2 where 1 is a 1 × n column vector of 1’s. Changing the order of summation n X K X aik i=1 k=1 1 − 2 2 (y − Xβ − σε mk 1)2i 2σε sk K n 1 X 1 X =− 2 aik (y − Xβ − σε mk 1)2i 2σε s2k i=1 =− 1 2σε2 k=1 K X k=1 1 (y − Xβ − σε mk 1)T D ak (y − Xβ − σε mk 1) s2k 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) where D ak = diag(a1k , . . . , ank ). Therefore, K 1 X 1 log p(β|rest) = − 2 (y − Xβ − σε mk 1)T D ak (y − Xβ − σε mk 1) 2σε s2k k=1 1 − (β − µβ )T Σ−1 β (β − µβ ) + const. 2 Now, (y − Xβ − σε mk 1)T D ak (y − Xβ − σε mk 1) = {Xβ − (y − σε mk 1)}T D ak {Xβ − (y − σε mk 1)} = (Xβ)T D ak (Xβ) − 2(Xβ)T D ak (y − σε mk 1) +(y − σε mk 1)T D ak (y − σε mk 1) = β T X T D ak Xβ − 2β T X T D ak (y − σε mk 1) + const. Hence log p(β|rest) K 1 X 1 T T =− 2 [β X D ak Xβ − 2β T X T D ak (y − σε mk 1)] 2σε s2k k=1 1 T −1 − (β T Σ−1 β β − 2β Σβ µβ ) + const. 2 ) ! " ( K X 1 1 1 β = − βT X T D ak X + Σ−1 β 2 σε2 s2k k=1 ( ! )# K 1 X 1 −1 T T X −2β D ak (y − σε mk 1) + Σβ µβ + const. σε2 s2k k=1 1 T = − β Ωβ − 2β T ω + const. 2 where Ω=X T K 1 X 1 D ak σε2 s2k ! X + Σ−1 β k=1 and ( ω = XT ) K 1 X 1 D ak (y − σε mk 1) + Σ−1 β µβ . σε2 s2k k=1 50 51 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) Full conditional for σε2 K n 1 X 1 (y − Xβ)T D ak (y − Xβ) log p(σε2 |rest) = − A + + 1 log σε2 − 2 2 σε s2k k=1 + K X 1 (σε2 )1/2 k=1 mk T 1 D ak (y − Xβ) + const. s2k Derivation: p(σε2 |rest) ∝ p(y|β, σε2 , a)p(σε2 ) ( n K ) YY 1 yi − {(Xβ)i + σε mk } aik = φ σε sk σε sk i=1 k=1 BA × Γ(A) σε2 B (−A−1) − σ2 e ε . Taking logarithms, we have log p(σε2 |rest) = n X K X i=1 k=1 aik − log(σε sk √ 1 2π) − 2 2 [yi − {(Xβ)i + σε mk }]2 2σε sk BA B + log − (A + 1) log(σε2 ) − 2 + const. Γ(A) σε n K 1 XX 1 2 2 = − aik log(σε ) + 2 2 [yi − {(Xβ)i + σε mk }] 2 σε sk i=1 k=1 −(A + 1) log(σε2 ) − B + const. σε2 K 1 X 1 (y − Xβ − σε mk 1)T D ak (y − Xβ − σε mk 1) = − 2 2 2σε s k=1 k n B − A + + 1 log(σε2 ) − 2 + const. 2 σε where D ak = diag(a1k , . . . , ank ). Now, (y − Xβ − σε mk 1)T D ak (y − Xβ − σε mk 1) = (y − Xβ)T D ak (y − Xβ) − 2mk σε 1T D ak (y − Xβ) + σε2 m2k 1T D ak 1. Hence K n 1 X 1 log p(σε2 |rest) = − A + + 1 log σε2 − 2 (y − Xβ)T D ak (y − Xβ) 2 σε s2k k=1 K X 1 mk T + 2 1/2 1 D ak (y − Xβ) + const. s2k (σε ) k=1 52 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) Full conditional for a log p(a|rest) = n X K X aik νik + const. i=1 k=1 where νik = log(wk /sk ) − 1 [yi − {(Xβ)i + σε mk }]2 . 2σε2 s2k Derivation: p(a|rest) ∝ p(y|β, σε2 , a)p(a) ( n K ) YY yi − {(Xβ)i + σε mk } aik 1 = φ σε s k σε sk i=1 k=1 " n # Y 1 aiK × wai1 ...wK ai1 !...aiK ! 1 i=1 # ( n K ) "Y n Y K YY yi − {(Xβ)i + σε mk } aik 1 aik φ × wk . = σε s k σε sk i=1 k=1 i=1 k=1 Taking logarithms gives log p(a|rest) = n X K X √ aik − log(σε sk 2π) − i=1 k=1 n X K X + = = aik 1 log(wk /sk ) − 2 2 [yi − {(Xβ)i + σε mk }]2 2σε sk aik νik + const. i=1 k=1 where νik = log(wk /sk ) − 2.A.2 1 [y 2σε2 s2k i − {(Xβ)i + σε mk }]2 . Optimal q ∗ densities Expressions for q ∗ (β), µq(β) and Σq(β) q ∗ (β) ∼ N (µq(β) , Σq(β) ) where ( Σq(β) = aik log(wk ) + const. i=1 k=1 n K XX i=1 k=1 n X K X 1 [yi − {(Xβ)i + σε mk }]2 2σε2 s2k T X µq(1/σε2 ) K X 1 D µq(a ) k s2k k=1 ! X+ Σ−1 β )−1 + const. 53 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) and K X 1 D µq(a ) y k s2k ( µq(β) = Σq(β) XT µq(1/σε2 ) k=1 −µq(1/σε ) where D µq(a k) K X mk k=1 s2k ! ) + D µq(a ) 1 k Σ−1 β µβ = diag(µq(a1k ) , . . . , µq(a1k ) ). Derivation: Equation (1.3) tells us that the optimal densities take the form q ∗ (θ i ) ∝ exp[Eq(θ−i ) log p(θ i |y, θ −i )}]. Hence log q ∗ (β) = Eq {log p(β|rest)} + const. 1 = − Eq (β T Ωβ − 2β T ω) + const. 2 1 T = − {β Eq (Ω)β − 2β T Eq (ω)} + const. 2 Using matrix result (1.21), it follows that 1 log q ∗ (β) = − {β − Eq (Ω)−1 Eq (ω)}T Eq (Ω){β − Eq (Ω)−1 Eq (ω)} + const. 2 Therefore, q ∗ (β) ∼ N{Eq (Ω)−1 Eq (ω), Eq (Ω)−1 }. Now, ( XT Eq (Ω) = Eq K 1 X 1 ak σε2 s2k (2.21) ! ) X + Σ−1 β k=1 and " Eq (ω) = Eq X ( T K 1 X 1 ak (y − σε mk 1) σε2 s2k ) # + Σ−1 β µβ k=1 where D ak = diag(a1k , . . . , ank ). It follows that ( Eq (Ω) = Eq XT K 1 X 1 D ak σε2 s2k k=1 ! ) X + Σ−1 β 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) (X K ) 1 Eq (D ak ) X + Σ−1 = X T Eq β s2k k=1 ! X K 1 1 T = X Eq D µq(a ) X + Σ−1 β k σε2 s2k k=1 ! K X 1 = X T µq(1/σε2 ) D µq(a ) X + Σ−1 β , k s2k 1 σε2 k=1 and " ( Eq (ω) = Eq X T = X T Eq = XT K 1 X 1 D ak (y − σε mk 1) σε2 s2k 1 σε2 k) # + k=1 K X K 1 1 X mk y − D ak 1 ak σε s2k s2k k=1 K X µq(1/σε2 ) k=1 where D µq(a ) Σ−1 β µβ ! + Σ−1 β µβ k=1 ! K X mk 1 D µq(a ) y − µq(1/σε ) D µq(a ) 1 + Σ−1 β µβ . k k s2k s2k k=1 = diag(µq(a1k ) , . . . , µq(a1k ) ). Expressions for q ∗ (σε2 ), µq(1/σε ) and µq(1/σε2 ) q ∗ (σε2 ) = µq(1/σε ) = σε2 −(A+ n +1) 2 C1 = C1 (σε2 )1/2 − C2 σε2 2J + (2A + n − 1, C1 , C2 ) J + (2A + n, C1 , C2 ) J + (2A + n − 1, C1 , C2 ) where exp and K X mk k=1 s2k σε2 > 0, , µq(1/σε2 ) = J + (2A + n + 1, C1 , C2 ) , J + (2A + n − 1, C1 , C2 ) 1T D µq(a ) (y − Xµq(β) ) k and C2 = B + K 1X 1 n tr(D µq(a ) XΣq(β) X T ) k 2 s2k k=1 o +(y − Xµq(β) )T D µq(a ) (y − Xµq(β) ) . k Derivation: h n log q ∗ (σε2 ) = Eq − A + + 1 log σε2 2 ( ) K 1 1X 1 T − 2 B+ (y − Xβ) D ak (y − Xβ) σε 2 s2k # K k=1 X 1 mk T + 2 1/2 1 D ak (y − Xβ) + const. s2k (σε ) k=1 54 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) n = − A + + 1 log σε2 2 " # K 1X 1 1 Eq (y − Xβ)T D ak (y − Xβ) − 2 B+ σε 2 s2k k=1 + K X 1 (σε2 )1/2 k=1 mk T 1 Eq {D ak (y − Xβ)} + const. s2k Now, using Result 1.23, Eq (y − Xβ)T D ak (y − Xβ) = tr{Eq (D ak ) Covq (y − Xβ)} + Eq (y − Xβ)T Eq (D ak )Eq (y − Xβ) = tr{D µq(a ) XCovq (β)X T } + (y − Xµq(β) )T D µq(a ) (y − Xµq(β) ) k k T T = tr{D µq(a ) XΣq(β) X } + (y − Xµq(β) ) D µq(a ) (y − Xµq(β) ) k k T T = tr{X D µq(a ) XΣq(β) } + (y − Xµq(β) ) D µq(a ) (y − Xµq(β) ). k k Therefore, n log q ∗ (σε2 ) = − A + + 1 log σε2 2 " K 1 1X 1 − 2 B+ {tr(X T D µq(a ) XΣq(β) ) k σε 2 s2k i k=1 T +(y − Xµq(β) ) D µq(a ) (y − Xµq(β) )} k + K X mk 1 (σε2 )1/2 k=1 s2k 1T D µq(a ) (y − Xµq(β) ) + const. k Hence q ∗ (σε2 ) n (σε2 )−(A+ 2 +1) exp ∝ where C1 = K X mk k=1 s2k C1 C2 − 2 1/2 2 σε (σε ) 1T D µq(a ) (y − Xµq(β) ) k and C2 K 1X 1 n = B+ tr(X T D µq(a ) XΣq(β) ) k 2 s2k o k=1 T +(y − Xµq(β) ) D µq(a ) (y − Xµq(β) ) . k Since q ∗ (σε2 ) is a density, it must integrate to 1. Therefore q ∗ (σε2 ) = R ∞ 0 − A+ n +1 σε2 ( 2 ) exp C1 (σε2 )1/2 C2 σε2 σε2 −(A+ 2 +1) exp − C1 (σε2 )1/2 − C2 σε2 n dσε2 . 55 56 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) We can simplify the integral by making the substitution x = 1 σε ⇒ σε = dσε2 = −2x−3 . This transforms the integral into Z 0 ∞ −(A+ 2 +1) σε2 exp n Z 0 = =2 C1 C2 − 2 1/2 2 σε (σε ) 1 x ⇒ σε2 = 1 x2 dσε2 x2A+n+2 exp C1 x − C2 x2 (−2x−3 )dx −∞ Z ∞ 0 x2A+n−1 exp C1 x − C2 x2 dx = 2J + (2A + n − 1, C1 , C2 ). Therefore q ∗ (σε2 ) = − A+ n +1 σε2 ( 2 ) exp C1 (σε2 )1/2 − C2 σε2 2J + (2A + n − 1, C1 , C2 ) . By making the same substitution as above, we find that µq(1/σε2 ) 1 = + 2J (2A + n − 1, C1 , C2 ) Z 0 ∞ C1 C2 dσε2 − (σε2 )1/2 σε2 x2A+n+4 exp C1 x − C2 x2 (−2x−3 )dx 1 2 −(A+ n2 +1) σ exp σε2 ε Z 0 1 = 2J + (2A + n − 1, C1 , C2 ) −∞ Z ∞ 1 x2A+n+1 exp C1 x − C2 x2 dx = + J (2A + n − 1, C1 , C2 ) 0 J + (2A + n + 1, C1 , C2 ) . = + J (2A + n − 1, C1 , C2 ) Similarly, µq(1/σε ) = J + (2A + n, C1 , C2 ) . J + (2A + n − 1, C1 , C2 ) Expressions for q ∗ (a) and µq(aik ) ∗ q (a) = K n Y Y µq(aik ) aik i=1 k=1 and eνik µq(aik ) = PK . νik k=1 e where 1 µq(1/σε2 ) [{yi − (Xµq(β) )i }2 + (XΣq(β) X T )ii ] 2s2k −2mk µq(1/σε ) {yi − (Xµq(β) )i } + m2k . νik = log(wk /sk ) − ⇒ 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) Derivation: ∗ log q (a) = n X K X aik log i=1 k=1 wk sk 1 1 2 − 2 Eq 2 {yi − (Xβ)i − σε mk } + const. σε 2sk Now, 1 {yi − (Xβ)i − σε mk }2 = σε2 1 2mk {yi − (Xβ)i }2 − 2 1/2 {yi − (Xβ)i } + m2k , 2 σε (σε ) therefore Eq 1 2 {yi − (Xβ)i − σε mk } σε2 = µq(1/σε2 ) [{yi − (Xµq(β) )i }2 + (XΣq(β) X T )ii ] −2mk µq(1/σε ) {yi − (Xµq(β) )i } + m2k . Combining the above three steps, we have log q ∗ (a) = n X K X aik νik + const. i=1 k=1 where 1 µq(1/σε2 ) [{yi − (Xµq(β) )i }2 + (XΣq(β) X T )ii ] 2s2k −2mk µq(1/σε ) {yi − (Xµq(β) )i } + m2k . νik = log(wk /sk ) − This leads to ∗ q (a) ∝ n Y K Y (eνik )aik i=1 k=1 which is of the form of a Multinomial distribution. We require that eνik µq(aik ) = PK νik k=1 e to ensure that PK k=1 µq(aik ) = 1. Hence ∗ q (a) = n Y K Y i=1 k=1 µq(aik ) aik . 57 58 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) 2.A.3 Derivation of lower bound (2.11) log p(y; q) = d n − log(2π) + log 2 + A log B − log{Γ(A)} 2 2 + log{J + (2A + n − 1, C1 , C2 )} K X |Σq(β) | m2k 1 wk − 2 + log + µq(a•k ) log sk 2 |Σβ | 2sk k=1 1 T −1 − {tr(Σ−1 β Σq(β) ) + (µq(β) − µβ ) Σβ (µq(β) − µβ )} 2 n X K X − µq(aik ) log(µq(aik ) ). i=1 k=1 Derivation: log p(y; q) = Eq {log p(y|β, σε2 , a)} + Eq {log p(β) − log q ∗ (β)} +Eq {log p(σε2 ) − log q ∗ (σε2 )} + Eq {log p(a) − log q ∗ (a)}. Firstly, log p(y|β, σε2 , a) = n X K X aik i=1 k=1 1 1 − log(2πσε2 s2k ) − {yi − (Xβ)i − σε mk }2 2 2 n K n n 1 XX = − log σε2 − log(2π) − aik log s2k 2 2 2 i=1 k=1 − n K 1 X X aik {yi − (Xβ)i − σε mk }2 2 i=1 k=1 s2k σε2 . Now, {yi − (Xβ)i − σε mk }2 σε2 1 2mk {yi − (Xβ)i }2 − 2 1/2 {yi − (Xβ)i } + m2k . 2 σε (σε ) = Taking expectations, Eq {log p(y|β, σε2 , a)} K n n 1X = − log(2π) − Eq (log σε2 ) − µq(a•k ) log s2k 2 2 2 k=1 − n K 1 X X µq(a ik ) 2 i=1 k=1 s2k µq(1/σε2 ) [{yi − (Xµq(β) )i }2 + (XΣq(β) X T )ii ] −2mk µq(1/σε ) {yi − (Xµq(β) )i } + m2k . 59 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) Working on the second summation, n X K X µq(a ik ) i=1 k=1 s2k µq(1/σε2 ) [{yi − (Xµq(β) )i }2 + (XΣq(β) X T )ii ] −2mk µq(1/σε ) {yi − (Xµq(β) )i } + m2k = µq(1/σε2 ) n X K X µq(a ik ) s2k i=1 k=1 n X K X −2µq(1/σε ) = µq(1/σε2 ) i=1 k=1 [{yi − (Xµq(β) )i }2 + (XΣq(β) X T )ii ] n K XX mk mk µq(aik ) 2 {yi − (Xµq(β) )i } + µq(aik ) 2 sk sk i=1 k=1 K X 1 n tr(D µq(a ) XΣq(β) X T ) k s2k o k=1 +(y − Xµq(β) )T D µq(a ) (y − Xµq(β) ) k −2µq(1/σε ) K X k=1 K X m2k mk T 1 D (y − Xµ ) + µ µ q(a ) q(β) •k q(a ) k s2k s2k = 2(C2 − B)µq(1/σε2 ) − 2C1 µq(1/σε ) + k=1 K X m2 µq(a•k ) 2k . sk k=1 Hence Eq {log p(y|β, σε2 , a)} K n n 1X = − log(2π) − Eq (log σε2 ) − µq(a•k ) log s2k 2 ( 2 2 ) k=1 K X m2k 1 − 2(C2 − B)µq(1/σε2 ) − 2C1 µq(1/σε ) + µq(a•k ) 2 2 sk k=1 n n 1 = − log(2π) − Eq (log σε2 ) − 2 2 2 K X µq(a•k ) log s2k k=1 −(C2 − B)µq(1/σε2 ) + C1 µq(1/σε ) − K m2 1X µq(a•k ) 2k 2 sk k=1 n n = − log(2π) − Eq (log σε2 ) − (C2 − B)µq(1/σε2 ) + C1 µq(1/σε ) 2 2 K X m2k 1 2 − µq(a•k ) log sk + 2 . 2 sk k=1 Secondly, log p(β) − log q ∗ (β) d 1 1 = − log 2π − log |Σβ | − (β − µβ )T Σ−1 β (β − µβ ) 2 2 2 d 1 1 T −1 − − log 2π − log |Σq(β) | − (β − µq(β) ) Σq(β) (β − µq(β) ) 2 2 2 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) |Σq(β) | 1 1 − (β − µβ )T Σ−1 = log β (β − µβ ) 2 |Σβ | 2 1 + (β − µq(β) )T Σ−1 q(β) (β − µq(β) ). 2 Taking expectations, Eq {log p(β) − log q ∗ (β)} |Σq(β) | 1 1 − Eq {(β − µβ )T Σ−1 = log β (β − µβ )} 2 |Σβ | 2 1 + Eq {(β − µq(β) )T Σ−1 q(β) (β − µq(β) )}. 2 Now, using Result 1.23, Eq {(β − µβ )T Σ−1 β (β − µβ )} T −1 = tr{Σ−1 β Covq (β − µβ )} + Eq (β − µβ ) Σβ Eq (β − µβ ) T −1 = tr(Σ−1 β Σq(β) ) + (µq(β) − µβ ) Σβ (µq(β) − µβ ). Similarly, −1 Eq {(β − µq(β) )T Σ−1 q(β) (β − µq(β) )} = tr(Σq(β) Σq(β) ) = tr(Id×d ) = d. Therefore, ∗ Eq {log p(β) − log q (β)} = |Σq(β) | 1 d log + 2 |Σβ | 2 n 1 − tr(Σ−1 β Σq(β) ) o 2 +(µq(β) − µβ )T Σ−1 (µ − µ ) . q(β) β β Thirdly, log p(a) − log q ∗ (a) = n X K X i=1 k=1 aik (log wk − log µq(aik ) ). Hence Eq {log p(a) − log q ∗ (a)} = K X k=1 µq(a•k ) log(wk ) − K n X X i=1 k=1 µq(aik ) log(µq(aik ) ). 60 2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11) Finally, log p(σε2 ) − log q ∗ (σε2 ) B A 2 −(A+1) −B/σε2 = log σ e Γ(A) ε ( −(A+n/2+1) ) σε2 exp −C2 /σε2 + C1 /σε − log 2J + (2A + n − 1, C1 , C2 ) n = A log(B) − log Γ(A) + log(σε2 ) + (C2 − B)/σε2 2 + −C1 /σε + log J (2A + n − 1, C1 , C2 ) + log(2). Thus, Eq {log p(σε2 ) − log q ∗ (σε2 )} n Eq {log(σε2 )} + (C2 − B)µq(1/σε2 ) 2 −C1 µq(1/σε ) + log{J + (2A + n − 1, C1 , C2 )} + log(2). = A log(B) − log Γ(A) + Combining, we get log p(y; q) = d n − log(2π) + log(2) + A log B − log Γ(A) 2 2 + log{J + (2A + n − 1, C1 , C2 )} K X |Σq(β) | m2 1 wk + µq(a•k ) log − k2 + log sk 2 |Σβ | 2sk k=1 1 T −1 − {tr(Σ−1 β Σq(β) ) + (µq(β) − µβ ) Σβ (µq(β) − µβ )} 2 n X K X − µq(aik ) log(µq(aik ) ). i=1 k=1 61 62 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) 2.B Derivation of Algorithm 2 and lower bound (2.17) 2.B.1 Full conditionals Full conditional for (β, u) p(β, u|rest) = p(β, u|Markov blanket of (β, u)) 2 2 = p(β, u|y, a, σε2 , σu1 , . . . , σud ) 2 2 ∝ p(y|β, u, σε2 , a)p(β, u|σu1 , . . . , σud ) aik n Y K Y (y − Xβ − Zu)i − σε mk 1 = φ σε s k σε sk i=1 k=1 Pd ×(2π)−(1+d+ `=1 K` )/2 |Σ(β,u) |−1/2 " # !T 1 β × exp − − µ(β,u) Σ−1 (β,u) 2 u " β u # − µ(β,u) ! . Taking logarithms, we get n X K X 1 2 log{p(β, u|rest)} = aik − 2 2 (y − Xβ − Zu − σε mk 1)i 2σε sk i=1 k=1 " # !T " # ! 1 β β −1 − µ(β,u) − µ(β,u) + const. − Σ(β,u) 2 u u # " !2 n X K X 1 = aik − 2 2 y − C β − σε mk 1 2σε sk u i=1 k=1 1 − 2 " i !T # β u − µ(β,u) " Σ−1 (β,u) β u # ! − µ(β,u) + const. Now, from working analagous to that in the regression case (Appendix 2.A): log{p(β, u|rest)} " #T ( 1 β =− CT 2 u " −2 β u K 1 X 1 D ak σε2 s2k k=1 #T " CT ( ! )" C + Σ−1 (β,u) β u # )# K X 1 1 D ak (y − σε mk 1) + const. 2 σε s2k k=1 where D ak = diag(a1k , . . . , ank ). Therefore " #T " # " #T 1 β log{p(β, u|rest)} = − Ω β −2 β ω + const. 2 u u u 63 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) where Ω=C T K 1 X 1 D ak σε2 s2k ! k=1 and ω=C T 1 −1 1 C + blockdiag Σβ , 2 I K1 , . . . , 2 I Kd σu1 σud ! K 1 X 1 D ak (y − σε mk 1) . σε2 s2k k=1 Full conditional for σε2 p(σε2 |rest) = p(σε2 |Markov blanket of σε2 ) = p(σε2 |y, β, u, a) ∝ p(y|β, u, σε2 , a)p(σε2 ) aik n Y K Y 1 (y − Xβ − Zu)i − σε mk = φ σε sk σε sk i=1 k=1 BεAε × Γ(Aε ) σε2 B (−Aε −1) − σ2ε e ε . Taking logarithms, we get log{p(σε2 |rest)} = −(Aε + 1) log σε2 − + n X K X i=1 k=1 = − Aε + Bε + const. σε2 aik − log(σε sk ) − 1 {(y − Xβ − Zu)i + σε mk }2 2 2 2σε sk n Bε + 1 log σε2 − 2 + const. 2 σε n K 1 X X aik 2 − 2 2 {(y − Xβ − Zu)i 2σε s i=1 k=1 k −2σε mk (y − Xβ − Zu)i + σε2 m2k } n Bε = − Aε + + 1 log σε2 − 2 2 σε − n K 1 X X aik (y − Xβ − Zu)2i 2σε2 s2k i=1 k=1 n K 1 X X aik mk + (y − Xβ − Zu)i + const. σε s2k i=1 k=1 64 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) n = − Aε + + 1 log σε2 2 ( ) K 1X 1 1 (y − Xβ − Zu)T D ak (y − Xβ − Zu) − 2 Bε + σε 2 s2k k=1 + 1 σε K X k=1 mk T 1 D ak (y − Xβ − Zu) + const. s2k Full conditional for a p(a|rest) = p(a|Markov blanket of a) = p(a|y, β, u, σε2 ) ∝ p(y|β, u, σε2 , a)p(a) n n Y K Y (y − Xβ − Zu)i − σε mk aik Y aik 1 φ × = wk . σε sk σε sk i=1 i=1 k=1 Taking logarithms gives log{p(a|rest)} = n X K X aik i=1 k=1 n X K X + i=1 k=1 = n X K X i=1 k=1 log(wk ) + const. h √ aik − log( 2πσε sk ) 1 − 2 {(y − Xβ − Zu)i + σε mk }2 2σε sk h aik log(wk /sk ) 1 2 − 2 2 {(y − Xβ − Zu)i + σε mk } + const. 2σε sk 2 , 1≤`≤d Full conditional for σu` The full conditional is given by 2 2 p(σu` |rest) = p{σu |Markov blanket of σu` } 2 = p(σu` |u` ) 2 2 ∝ p(u` |σu` )p(σu` ) 1 2 = (2π) − uT` (σu` I)−1 u` 2 B Au` 2 (−Au` −1) Bu` exp − 2 × u` σu` Γ(Au` ) σu` −K` /2 2 |σu` I|−1/2 exp 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) 1 T − 2 u` u` ∝ 2σu` Bu` 2 (−Au` −1) exp − 2 ×σu` σ u` 1 1 2 −(Au` +K` /2+1) 2 = σu` exp − 2 Bu` + ku` k , 2 σu` 2 −K` /2 exp σu` and taking logarithms gives 2 2 log{p(σu` |rest)} = −(Au` + K` /2 + 1) log(σu` ) 1 1 − 2 Bu` + ku` k2 + const. 2 σu` 2.B.2 Optimal q ∗ densities Expressions for q ∗ (β, u), µq(β,u) and Σq(β,u) q ∗ (β, u) ∼ N (µq(β,u) , Σq(β,u) ) where µq(β,u) ← Σq(β,u) C T ! K K X X 1 mk µq(1/σε2 ) D µq(a ) y − µq(1/σε ) D µq(a ) 1 , k k s2k s2k k=1 k=1 ! K X 1 D µq(a ) C C µq(1/σε2 ) k s2k k=1 o−1 +blockdiag Σ−1 , 2 ) I K1 , . . . , µq(1/σ 2 ) I K d β , µq(1/σu1 ( Σq(β,u) ← T ud and D µq(a k) = diag(µq(a1k ) , . . . , µq(a1k ) ). Derivation: " #T " # " #T 1 β log q ∗ (β, u) = − Eq Ω β −2 β ω + const. 2 u u u " #T " # " #T 1 β = − Eq (Ω) β − 2 β Eq (ω) + const. 2 u u u Application of matrix result (1.21) to the above expression gives 1 log q ∗ (β, u) = − 2 (" β u # )T − Eq (Ω)−1 Eq (ω) (" # ) β − E (Ω)−1 E (ω) + const. ×Eq (Ω) q q u 65 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) and therefore q ∗ (β, u) ∼ N {Eq (Ω)−1 Eq (ω), Eq (Ω)−1 }. Now, ! K X 1 1 D ak C Eq (Ω) = Eq C T σε2 s2 k=1 k 1 1 +blockdiag Σ−1 , , . . . , I I K1 Kd β 2 2 σu1 σud ! K X 1 T = C µq(1/σε2 ) D µq(a ) C k s2k k=1 +blockdiag Σ−1 , µ I , . . . , µ I , 2 2 K K 1 q(1/σu1 ) q(1/σ ) d β ( ud and ( CT Eq (ω) = Eq = C T K 1 X 1 D ak (y − σε mk 1) σε2 s2k µq(1/σε2 ) k=1 K X k=1 !) ! K X 1 mk D µq(a ) y − µq(1/σε ) D µq(a ) 1 . k k s2k s2k k=1 Expressions for q ∗ (σε2 ), µq(1/σε ) and µq(1/σε2 ) q ∗ (σε2 ) = σε2 −(A+ n +1) 2 exp C3 (σε2 )1/2 − C4 σε2 2J + (2A + n − 1, C1 , C2 ) σε2 > 0, , µq(1/σε2 ) ← J + (2Aε + n + 1, C3 , C4 ) , J + (2Aε + n − 1, C3 , C4 ) µq(1/σε ) ← J + (2Aε + n, C3 , C4 ) , J + (2Aε + n − 1, C3 , C4 ) where C3 ← K X mk k=1 s2k 1T D µq(a ) (y − Cµq(β,u) ) k and C4 ← Bε + K 1X 1 n tr(D µq(a ) CΣq(β,u) C T ) k 2 s2k o k=1 +(y − Cµq(β,u) )T D µq(a ) (y − Cµq(β,u) ) . k 66 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) Derivation: log q ∗ (σε2 ) n = − Aε + + 1 log σε2 2 # " K 1X 1 1 T Eq {(y − Xβ − Zu) D ak (y − Xβ − Zu)} − 2 Bε + σε 2 s2k k=1 K 1 X mk T + 1 Eq {D ak (y − Xβ − Zu)} + const. σε s2k k=1 Using working similar to that used in the regression case in Appendix 2.A, Eq {(y − Xβ − Zu)T D ak (y − Xβ − Zu)} " #!T " #! = Eq y−C β D ak y − C β u u = tr(C T D µq(a ) CΣq(β,u) ) + (y − Cµq(β,u) )T D µq(a ) (y − Cµq(β,u) ). k k Therefore n log q ∗ (σε2 ) = − Aε + + 1 log σε2 2 " K 1 1X 1 n tr(C T D µq(a ) CΣq(β,u) ) − 2 Bε + 2 k σε 2 s oi k=1 k +(y − Cµq(β,u) )T D µq(a ) (y − Cµq(β,u) ) k + 1 σε K X k=1 mk T 1 D µq(a ) (y − Cµq(β,u) ) + const. k s2k It follows that q ∗ (σε2 ) where C3 = −(Aε + 2 +1) exp σε2 n ∝ K X mk k=1 s2k C3 C4 − 2 σε σε 1T D µq(a ) (y − Cµq(β,u) ) k and C4 ← Bε + K 1X 1 n tr(C T D µq(a ) CΣq(β,u) ) k 2 s2k o k=1 +(y − Cµq(β,u) )T D µq(a ) (y − Cµq(β,u) ) . k Using working analagous to the regression case, we obtain the stated results. 67 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) Expressions for q ∗ (a) and µq(aik ) ∗ q (a) = n Y K Y µq(aik ) aik i=1 k=1 and eνik µq(aik ) = PK . νik k=1 e where νik = log(wk /sk ) − 1 µq(1/σε2 ) (y − Cµq(β,u) )2i + (CΣq(β,u) C T )ii 2 2sk −2mk µq(1/σε ) (y − Cµq(β,u) )i + m2k . Derivation: ∗ log q (a) = n X K X aik log(wk /sk ) i=1 k=1 1 1 2 − 2 Eq 2 {(y − Xβ − Zu)i + σε mk } + const. σε 2sk Now, {(y − Xβ − Zu)i + σε mk }2 = (y − Xβ − Zu)2i −2σε mk (y − Xβ − Zu)i + σε2 m2k , hence ∗ log q (a) = = n X K X i=1 k=1 n X K X 1 1 aik log(wk /sk ) − 2 Eq (y − Xβ − Zu)2i 2 σ 2s ε k i=1 k=1 2mk 2 − (y − Xβ − Zu)i + mk + const. σε 1 µq(1/σε2 ) Eq {(y − Xβ − Zu)2i } 2 2sk i − 2mk µq(1/σε ) (y − Cµq(β,u) )i + m2k + const. aik log(wk /sk ) − Now, Eq {(y − Xβ − Zu)2i } = Eq " y−C β u #!2 i 68 69 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) ( " = Varq y−C " ( Covq = C " = β u CCovq β u # #! ) β u #!) " ( + Eq i " " y − CEq + ! ii CT ii β u y−C β u #! )#2 i #!2 i + (y − Cµq(β,u) )2i = (CΣq(β,u) C T )ii + (y − Cµq(β,u) )2i . Bringing it all together, ∗ log q (a) = n X K X aik νik + const. i=1 k=1 where νik = log(wk /sk ) − 1 µq(1/σε2 ) (y − Cµq(β,u) )2i + (CΣq(β,u) C T )ii 2 2sk −2mk µq(1/σε ) (y − Cµq(β,u) )i + m2k . Exponentiating, we get q ∗ (a) ∝ n Y K Y (eνik )aik i=1 k=1 which is of the form of a multinomial distribution. To ensure that PK k=1 µq(aik ) eνik µq(aik ) = PK . νik k=1 e hence q ∗ (a) ∝ n Y K Y ik µaq(a . ik ) i=1 k=1 2 ), B Expressions for q ∗ (σu` q(σ 2 ) and µq(1/σ 2 u` ) u` ∗ q (σu ) ∼ Inverse-Gamma Au` + K` /2, Bq(σ2 u` ) Bq(σ2 ) = Bu` + u` o 1n tr(Σq(u` ) ) + kµq(u` ) k2 , 2 and µq(1/σ2 ) = u` Au` + K` /2 . Bq(σ2 ) u` , = 1, we require 70 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) Derivation: log q ∗ 2 (σu` ) = −(Au` + K` /2 + 2 1) log(σu` ) 1 − 2 σu` 1 Bu` + Eq (ku` k2 ) 2 +const. Now, using Result 1.22, Eq (ku` k2 ) = tr{Covq (u` )} + kEq (u` )k2 = tr(Σq(u` ) ) + kµq(u` ) k2 . Hence, 2 2 log q ∗ (σu` ) = −(Au` + K` /2 + 1) log(σu` ) o n 1 1 2 + const. tr(Σq(u` ) ) + kµq(u` ) k − 2 Bu` + 2 σu` Exponentiating, we get q ∗ 2 ) (σu` ∝ 2 −(Au` +K` /2+1) exp σu` o 1 1n 2 − 2 Bu` + tr(Σq(u` ) ) + kµq(u` ) k 2 σu` which is in the form of an Inverse-Gamma distribution. The stated results follow immediately. 2.B.3 Derivation of lower bound (2.17) log p(y; q) = 1 2 1+d+ d X `=1 ! K` − n log(2π) + log(2) + Aε log(Bε ) − log Γ(Aε ) 2 + log J + (2Aε + n − 1, C3 , C4 ) −1 1 T + 12 log |Σq(β,u) | − 21 log |Σβ | − 12 tr(Σ−1 β Σq(β) ) − 2 µq(β,u) Σβ µq(β,u) d X {Au` log Bu` − log Γ(Au` ) o `=1 − Au` + K2` log Bq(σ2 ) + log Γ Au` + K2` u` X K n X K 2 X m + µq(a•k ) log(wk /sk ) − k2 − µq(aik ) log(µq(aik ) ). 2sk i=1 + k=1 Derivation: k=1 71 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) The logarithm of the lower bound on the marginal likelihood is given by 2 2 , . . . , σud ) log p(y; q) = Eq {log p(y, β, u, a, σε2 , σu1 2 2 − log q ∗ (β, u, a, σε2 , σu1 , . . . , σud )}. Now, 2 2 p(y, β, u, a, σε2 , σu1 , . . . , σud ) = p(y|β, u, a, σε2 ) 2 2 ×p(β, u, a, σε2 , σu1 , . . . , σud ) (2.22) 2 , 1 ≤ ` ≤ d. Also, as the distribution of y does not depend explicitly upon σu` p(β, u, a, σε2 , σu ) = p(β, u|σu ) ×p(a)p(σε2 )p(σu ) (2.23) as the distribution of (β, u) doesn’t depend on σε2 or a, and σε2 , a and σu are independent. In addition, our imposed factorisation of the optimal density q (2.16) breaks down further into ( 2 2 ) = q(β, u)q(a)q(σε2 ) q(β, u, a, σε2 , σu1 , . . . , σud d Y ) 2 ) q(σu` (2.24) `=1 2 , 1 ≤ ` ≤ d, imposed by our model. due to the conditional independence of σε2 and σu` Using (2.22), (2.23) and (2.24), our expression for the lower bound of the log-likelihood becomes: log p(y; q) = Eq {log p(y|β, u, a, σε2 )} + Eq {log p(β, u|σu ) − log q ∗ (β, u)} +Eq {log p(a) − log q ∗ (a)} + Eq {log p(σε2 ) − log q ∗ (σε2 )} " ( d ) ( d )# Y Y 2 2 +Eq log p(σu` ) − log q(σu` ) . `=1 `=1 Firstly: log p(y|β, u, a, σε2 ) n X K X √ = aik − log( 2πσε sk ) − i=1 k=1 1 {(y − Xβ − Zu)i + σε mk }2 2σε2 sk 72 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) n K n 1 XX n 2 = − log σε − log(2π) − aik log s2k 2 2 2 i=1 k=1 ( n K 1 X X aik 2 − 12 2 (y − Xβ − Zu)i σε2 s k i=1 2 − σε k=1 n X K X i=1 k=1 n K X X aik m2 aik mk k (y − Xβ − Zu) + i 2 s2k s k i=1 ) . k=1 Taking expectations, Eq {log p(y|β, u, a, σε2 )} K n n 1X = − log σε2 − log(2π) − µq(a•k ) log s2k 2 2 2 k=1 − 12 µq(1/σε2 ) +µq(1/σε ) n X K X i=1 k=1 n K XX i=1 k=1 µq(aik ) (CΣq(β,u) C T )ii + (y − Cµq(β,u) )2i 2 sk n X K X µq(aik ) m2k µq(aik ) mk 1 (y − Cµ ) − q(β,u) i 2 s2k s2k i=1 k=1 K 1X n n = − log σε2 − log(2π) − 2 2 2 − 12 µq(1/σε2 ) µq(a•k ) log s2k k=1 K X 1 n tr(D µq(a ) CΣq(β,u) C T ) k s2k o k=1 +(y − Cµq(β,u) )T D µq(a ) (y − Cµq(β,u) ) k +µq(1/σε ) K X k=1 K X m2k mk T 1 1 D (y − Cµ ) − µ µ q(β,u) q(a ) •k q(ak ) 2 s2k s2k k=1 n n = − log σε2 − log(2π) − (C4 − Bε )µq(1/σε2 ) + C3 µq(1/σε ) 2 2 2 K X mk 2 1 −2 µq(a•k ) + log sk . s2k k=1 Secondly, log p(β, u|σu ) − log q ∗ (β, u) ! " #T " # d X β β −1 1 1 Σ(β,u) =− 1+d+ K` log(2π) − 2 log |Σ(β,u) | − 2 u u `=1 ( ! d X − − 1+d+ K` log(2π) − 12 log |Σq(β,u) | `=1 " − 12 β u !T # − µq(β,u) " Σ−1 q(β,u) β u # − µq(β,u) ! . 73 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) Taking expectations, Eq {log p(β, u|σu ) − log q ∗ (β, u)} = − 21 Eq (log |Σ(β,u) |) + 12 log |Σq(β,u) | " #T " # β Σ−1 − 21 Eq β q(β,u) u u " # !T β −µ + 21 Eq Σ−1 q(β,u) q(β,u) u " β u # − µq(β,u) ! Now, " Eq β u #T " Σ−1 (β,u) # β u = tr{Eq (Σ−1 )Σq(β,u) } (β,u) +µTq(β,u) Eq (Σ−1 (β,u) )µq(β,u) . Similarly, Eq " !T # β −µ Σ−1 q(β,u) q(β,u) u Σ = tr Σ−1 q(β,u) q(β,u) = tr I 1+d+Pd K` " β u # − µq(β,u) ! `=1 =1+d+ d X K` . `=1 Furthermore, 2 2 Eq (log |Σ(β,u) |) = Eq {log |blockdiag(Σβ , σu1 I K1 , . . . , σud I Kd )|} 2 = Eq {log(|Σβ | × σu1 K1 2 × . . . × σud Kd )} 2 2 = Eq (log |Σβ | + K1 log σu1 + . . . + Kd log σud ) = log |Σβ | + d X 2 K` Eq (log σu` ). `=1 Also, Eq (Σ−1 (β,u) ) = Eq 1 −1 1 blockdiag Σβ , 2 I K1 , . . . , 2 I Kd σu1 σud = blockdiag(Σ−1 2 ) I K1 , . . . , µq(1/σ 2 ) I K ). d β , µq(1/σu1 ud . 74 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) Therefore, Eq {log p(β, u|σu ) − log q ∗ (β, u)} = 1 2 1 2 log |Σq(β,u) | − log |Σβ | − 1 2 d X 2 K` Eq (log σu` ) + 1 2 1+d+ `=1 d X ! K` `=1 h − 12 tr{blockdiag(Σ−1 2 ) I K1 , . . . , µq(1/σ 2 ) I K )Σq(β,u) } d β , µq(1/σu1 ud +µTq(β,u) blockdiag(Σ−1 2 ) I K1 , . . . , µq(1/σ 2 ) I K )µq(β,u) d β , µq(1/σu1 ud = 1 2 log |Σq(β,u) | − 21 log |Σβ | − " 2 K` Eq (log σu` )+ 1 2 1+d+ `=1 d X ! K` `=1 d X tr(µq(1/σ2 ) I K` Σq(u`) ) u` # `=1 d X +µTq(β,u) Σ−1 µq(1/σ2 ) kµq(u`) k2 β µq(β,u) + u` `=1 ! d d X X 2 1 1 1 1 + K ) 1 + d + K` Eq (log σu` log |Σ | − log |Σ | − ` β q(β,u) 2 2 2 2 `=1 `=1 −1 −1 1 1 T − 2 tr(Σβ Σq(β) ) − 2 µq(β,u) Σβ µq(β,u) − 12 = tr(Σ−1 β Σq(β) ) 1 2 d X i − 12 d X + µq(1/σ2 ) {kµq(u`) k2 + tr(Σq(u`) )}. u` `=1 Thirdly, log p(σε2 ) − log q ∗ (σε2 ) BεAε 2 −(Aε +1) −Bε /σε2 = log σ e Γ(Aε ) ε ( −(A +n/2+1) ) 2 ε σε2 e−C4 /σε +C3 /σε − log 2J + (2Aε + n − 1, C3 , C4 ) n = A log(B) − log Γ(Aε ) + log(σε2 ) + (C4 − Bε )/σε2 2 + −C3 /σε + log J (2Aε + n − 1, C3 , C4 ) + log(2). Thus, n Eq {log(σε2 )} 2 +(C4 − Bε )µq(1/σε2 ) − C3 µq(1/σε ) Eq {log p(σε2 ) − log q ∗ (σε2 )} = Aε log(Bε ) − log Γ(Aε ) + + log J + (2Aε + n − 1, C3 , C4 ) + log(2). Fourthly, log p(a) − log q ∗ (a) = n X K X i=1 k=1 aik {log(wk ) − logµq(aik ) }. 75 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) Hence Eq {log p(a) − log q ∗ (a)} = where µq(a•k ) = Pn i=1 µq(aik ) . ( log d Y k=1 ) 2 p(σu` ) d X `=1 µq(a•k ) log(wk ) − n X K X µq(aik ) log(µq(aik ) ). i=1 k=1 Finally, ( − log `=1 = K X d Y ) 2 q(σu` ) `=1 2 Au` log Bu` − log Γ(Au` ) − (Au` + 1) log σu` − Bu` 2 σu` n − Aq(σ2 ) log Bq(σ2 ) − log Γ(Aq(σ2 ) ) u` u` u` −(Aq(σ2 ) + u` 2 1) log σu` − Bq(σ2 u` ) 2 σu` )# . Taking expectations " ( Eq log d Y ) 2 ) p(σu` − log `=1 = d X `=1 ( d Y )# 2 ) q(σu` `=1 K` Au` log Bu` − log Γ(Au` ) − Au` + log Bq(σ2 ) u` 2 K` 2 + log Γ(Aq(σ2 ) ) + Eq (log σu` ) + (Bq(σ2 ) − Bu` )µq(1/σ2 ) . u` u` u` 2 Putting it all together, we get 2 K X mk n n 2 2 1 log p(y; q) = − log(2π) − log σε − 2 + log sk µq(a•k ) 2 2 s2k k=1 −(C4 − Bε )µq(1/σε2 ) + C3 µq(1/σε ) ! d X + 12 1 + d + K` + 12 log |Σq(β,u) | `=1 − 21 log |Σβ | − 1 2 d X 2 K` Eq (log σu` ) `=1 −1 1 − 2 tr(Σβ Σq(β) ) − 12 µTq(β,u) Σ−1 β µq(β,u) − 21 + d X u` `=1 K X k=1 µq(1/σ2 ) {kµq(u`) k2 + tr(Σq(u`) )} µq(a•k ) log(wk ) − n X K X i=1 k=1 µq(aik ) log(µq(aik ) ) 76 2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17) n Eq {log(σε2 )} + (C4 − Bε )µq(1/σε2 ) 2 −C3 µq(1/σε ) + log J + (2Aε + n − 1, C3 , C4 ) + log(2) d n X + Au` log Bu` − log Γ(Au` ) − Au` + K2` log Bq(σ2 ) +Aε log(Bε ) − log Γ(Aε ) + u` `=1 + log Γ(Aq(σ2 ) ) + u` K` 2 2 Eq (log σu` ) + (Bq(σ2 ) − Bu` )µq(1/σ2 u` ) u` o . Recognising that kµq(u`) k2 + tr(Σq(u`) ) = Bq(σ2 ) − Bu` , we have the final expression for u` the lower bound log p(y; q) = 1 2 1+d+ d X `=1 ! K` − n log(2π) + log(2) + Aε log(Bε ) − log Γ(Aε ) 2 + log J + (2Aε + n − 1, C3 , C4 ) −1 1 T + 21 log |Σq(β,u) | − 21 log |Σβ | − 12 tr(Σ−1 β Σq(β) ) − 2 µq(β,u) Σβ µq(β,u) d X {Au` log Bu` − log Γ(Au` ) o `=1 K` K` − Au` + 2 log Bq(σ2 ) + log Γ Au` + 2 u` X n X K K 2 X mk + µq(a•k ) log(wk /sk ) − 2 − µq(aik ) log(µq(aik ) ). 2sk i=1 + k=1 k=1 Chapter 3 Mean field variational Bayes for quantile regression 3.1 Introduction Estimation of the quantiles of a response variable given an explanatory variable is a common statistical problem. Yu and Jones (1998) use kernel weighted local linear fitting for estimation of quantiles in nonparametric regression models. A goodness of fit process for quantile regression is explored in Koenker and Machado (1999). In a Bayesian context, inference for quantile regression models is somewhat limited to sampling based methods such as MCMC. Koenker (2005) sets out options for inference in quantile regression, highlighting both frequentist and Bayesian methods, the latter restricted to resampling methods. We address this shortcoming in this chapter by exploring MFVB for semiparametric quantile regression models. The idea of carrying out Bayesian quantile regression using an Asymmetric Laplace likelihood was introduced by Yu and Moyeed (2001). Their paper showed that the Asymmetric Laplace distribution provides a very natural option for modelling Bayesian quantile regression, regardless of the original distribution of the data. Yu and Moyeed (2001) carried out inference using a Markov chain Monte Carlo (MCMC) algorithm. More recently, Kozumi and Kobayashi (2011) developed an efficient Gibbs sampling algorithm to carry out inference in Bayesian regression models, also using the Asymmetric Laplace distribution. They used a location-scale mixture representation of the Asymmetric Laplace distribution. Extensions to the Asymmetric Laplace approach to quantile regression have been explored. For example, Yuan and Guosheng (2010) presented Bayesian quantile regression 77 78 3.2. PARAMETRIC REGRESSION CASE a β y σε2 Figure 3.1: Directed acyclic graph for Model (3.2). for longitudinal studies with non-ignorable missing data. In contrast to the parametric approach presented by Yu and Moyeed (2001), Thompson, Cai, Moyeed, Reeve and Stander (2010) adopt a nonparametric approach, carrying out quantile regression using cubic splines. In this chapter, we develop mean field variational Bayes (MFVB) methodology for inference in Bayesian quantile regression models using the Asymmetric Laplace distribution, as introduced by Yu and Moyeed (2001). MFVB inerence for a univariate Asymmetric Laplace model has been carried out in Wand et al. (2011). We extend the methodology from this paper to the regression case in this chapter. Firstly, we present the model for the parametric regression case. We develop the MFVB algorithm for the parametric regression case through derivation of optimal q densities and the lower bound on the marginal log-likelihood. Secondly, we extend the MFVB methodology to the semiparametric regression case, introducing splines into the model. We then present the MFVB fit to a simulated data set. Finally, we compare our MFVB results with MCMC to evaluate the performance of MFVB inference in Bayesian quantile regression models. 1 3.2 Parametric regression case Here we present a simple parametric linear regression model for the case of one predictor. 3.2.1 Model For data (xi , yi ), we impose the model yi = β0 + β1 xi + εi , 1 ≤ i ≤ n. 79 3.2. PARAMETRIC REGRESSION CASE We express this as the Bayesian hierarchical model ind. yi |β, σε ∼ Asymmetric-Laplace{(Xβ)i , σε , τ }, σε2 β ∼ N (0, Σβ ), 1 ≤ i ≤ n, (3.1) ∼ Inverse-Gamma(Aε , Bε ) where 1 x1 . . X = .. .. , 1 xn β= β0 β1 , Aε , Bε > 0 are scalar constants, Σβ is a constant matrix and τ ∈ (0, 1) determines the quantile level. We introduce auxiliary variables a = (a1 , . . . , an ). This allows the Asymmetric Laplace distribution to be represented in terms of Normal and Inverse-Gamma distributions, which are more amenable to MFVB. Application of Result 1.8 allows us to rewrite Model (3.1) as ind. yi |β, σε , ai ∼ N (Xβ)i + (τ − 21 )σε σε2 ai τ (1−τ ) , ai τ (1−τ ) β ∼ N (0, Σβ ), σε2 , ind. ai ∼ Inverse-Gamma 1, 12 , (3.2) ∼ Inverse-Gamma(Aε , Bε ). The dependence structure of the parameters in Model (3.2) is illustrated in Figure 3.1. 3.2.2 Mean field variational Bayes In this section we present an algorithm to carry out MFVB inference for Model (3.2) under the imposed product restriction q(β, a, σε2 ) ≈ q(β)q(a)q(σε2 ). (3.3) Derivations of the full conditionals, optimal q densities and lower bound on the marginal log-likelihood are deferred to Appendix 3.A. The lower bound on the marginal log-likelihood for Model (3.2) is given by log p(y; q) = Aε log(Bε ) − log Γ(Aε ) + log(2) + 1 + n log{τ (1 − τ )} |Σq(β) | 1 + 2 log + log{J + (2Aε + n − 1, C5 , C6 )} |Σβ | −1 T − 12 {tr(Σ−1 β Σq(β) ) + µq(β) Σβ µq(β) } n X 1 1 − . 8τ (1 − τ ) µq(ai ) i=1 (3.4) 80 3.2. PARAMETRIC REGRESSION CASE Initialize: µq(1/σε2 ) , µq(1/σε ) and µq(ai ) for 1 ≤ i ≤ n. Cycle: Update q ∗ (β) parameters: µq(β) −1 Σq(β) ← {τ (1 − τ )µq(1/σε2 ) X T M µq(a) X + Σ−1 β } o n ← Σq(β) gτ µq(1/σε2 ) X T M µq(a) y + τ − 12 µq(1/σε ) X T 1 . Update q ∗ (ai ) parameters: For i = 1, . . . , n: µq(ai ) ← −1 q T 2 2τ (1 − τ ) µq(1/σε2 ) [(XΣq(β) X )ii + {yi − (Xµq(β) )i } ] M µq(a) ← diag(µq(a1 ) , . . . , µq(an ) ) Update q ∗ (σε2 ) parameters: C5 ← τ − C6 ← Bε + 1 2 (y − Xµq(β) )T 1 τ (1 − τ ) n tr(X T M µq(a) XΣq(β) ) o 2 +(y − Xµq(β) )T M µq(a) (y − Xµq(β) ) µq(1/σε2 ) = µq(1/σε ) = J + (2Aε + n + 1, C5 , C6 ) J + (2Aε + n − 1, C5 , C6 ) J + (2Aε + n, C5 , C6 ) J + (2Aε + n − 1, C5 , C6 ) until the increase in p(y; q) is negligible. Algorithm 4: Mean field variational Bayes algorithm for Model (3.2) under product restriction (3.3). 81 3.3. SEMIPARAMETRIC REGRESSION CASE 3.3 Semiparametric regression case Here we present a semiparametric regression model, again for the one predictor case. This model allows more flexibility in the shape of the fitted curve via inclusion of spline basis functions. 3.3.1 Model We impose the model yi = β0 + β1 xi + K X uk zk (xi ) + εi , k=1 1 ≤ i ≤ n, where {z1 (·), . . . , zK (·)} are a set of spline basis functions and (u1 , . . . , uK ) are spline coefficients. In Bayesian hierarchical form we have ind. yi |β, u, σε ∼ Asymmetric-Laplace{(Xβ + Zu)i , σε , τ }, 1 ≤ i ≤ n, (3.5) σu2 ∼ Inverse-Gamma(Au , Bu ), u|σu2 ∼ N (0, σu2 I), σε2 ∼ Inverse-Gamma(Aε , Bε ). β ∼ N (0, Σβ ), where β and X are as defined in Section 3.2.1, u1 . u = .. , uK and z (x ) . . . zK (x1 ) 1 1 .. .. .. Z= . . . z1 (xn ) . . . zK (xn ) . Introduction of auxiliary variables a = (a1 , . . . , an ) and application of Result 1.8 yields the model ind. yi |β, u, σε , ai ∼ N (Xβ + Zu)i + (τ − 21 )σε σε2 ai τ (1−τ ) , ai τ (1−τ ) ind. ai ∼ Inverse-Gamma 1, 12 , u|σu2 ∼ N (0, σu2 I), β ∼ N (0, Σβ ), , (3.6) σu2 ∼ Inverse-Gamma(Au , Bu ), σε2 ∼ Inverse-Gamma(Aε , Bε ). The relationship between the parameters in Model (3.6) is illutsrtaed by the DAG in Figure 3.2. 82 3.3. SEMIPARAMETRIC REGRESSION CASE 3.3.2 Mean field variational Bayes Using the locality property of MFVB, many of the optimal densities of key parameters in Model (3.6) remain unchanged from those in Model (3.2). This is similar to the case in Chapter 2, where the progression from a basic GEV regression model to a more complex additive model caused minimal changes to the MFVB optimal densities of many parameters. Infact, the structure of the distributions of a and σε2 are the same as in Model (3.2). The only changes are that β is replaced by (β, u), and X is replaced by C = [X Z]. We need only derive optimal q densities for the new parameters, namely (β, u) and σu2 . We impose the product restriction q(β, u, a, σε2 , σu2 ) ≈ q(β, u)q(a)q(σε2 , σu2 ). (3.7) When we moralise (see Definition 1.4.9) the DAG in Figure 3.2, we find that all paths between σu2 and σε2 must pass through at least two of the nodes {y, β, u, a}. Another way of stating this structure is that the set {y, β, u, a} separates σu2 from σε2 . Applying Theorem 1.4.1 then gives the result σu2 ⊥ σε2 {y, β, u}. Hence product restriction (3.7) reduces further via induced factorisations to q(β, u, a, σε2 , σu2 ) ≈ q(β, u)q(a)q(σε2 )(σu2 ). (3.8) Derivations of the necessary optimal q densities leading to the updates in Algorithm 5 are deferred to Appendix 3.A. Derivation of the lower bound on the marginal log-likelihood are also given in Appendix 3.A. The lower bound on the marginal log-likelihood for Model (3.6) is given by (K + 2) log p(y; q) = Aε log Bε − log Γ(Aε ) + log(2) + + n log{τ (1 − τ )} 2 |Σq(β,u) | + 12 log + log{J + (2Aε + n − 1, C7 , C8 )} |Σβ | n o −1 T − 21 tr(Σ−1 Σ ) + µ Σ µ q(β) q(β) q(β) β β n − 12 µq(1/σu2 ) {tr(Σq(u) ) + kµq(u) k2 } − +Au log(Bu ) − log{Γ(Au )} − (Au + + log{Γ(Au + K 2 )} X 1 1 8τ (1 − τ ) µq(ai ) i=1 K 2 )) 2 ) log(Bq(σu − µq(1/σu2 ) (Bq(σu2 ) − Bu ). (3.9) 83 3.3. SEMIPARAMETRIC REGRESSION CASE Initialize: µq(1/σε2 ) , µq(1/σε ) and µq(ai ) for 1 ≤ i ≤ n. Cycle: Update q ∗ (β, u) parameters: T Σq(β,u) ← Σ−1 0 β 0 µq(1/σu2 ) I K τ (1 − τ )µq(1/σε2 ) C M µq(a) C + n µq(β,u) ← Σq(β,u) gτ µq(1/σε2 ) C T M µq(a) y + τ − 12 µq(1/σε ) C T 1 . −1 , Update q ∗ (ai ) parameters: For i = 1, . . . , n: µq(ai ) ← {2τ (1 − τ ) −1 q T 2 × µq(1/σε2 ) [(CΣq(β,u) C )ii + {yi − (Cµq(β,u) )i } ] M µq(a) ← diag(µq(a1 ) , . . . , µq(an ) ) Update q ∗ (σu2 ) parameters: Bq(σu2 ) ← Bu + 21 {kµq(u) k2 + tr(Σq(u) )}; µq(1/σu2 ) ← (Au + K 2) 2 )/Bq(σu Update q ∗ (σε2 ) parameters: C7 ← τ − C8 ← Bε + µq(1/σε2 ) = 1 2 (y − Cµq(β,u) )T 1 τ (1 − τ ) n tr(C T M µq(a) CΣq(β,u) ) o 2 +(y − Cµq(β,u) )T M µq(a) (y − Cµq(β,u) ) J + (2Aε + n + 1, C7 , C8 ) ; J + (2Aε + n − 1, C7 , C8 ) µq(1/σε ) = J + (2Aε + n, C7 , C8 ) J + (2Aε + n − 1, C7 , C8 ) until the increase in p(y; q) is negligible. Algorithm 5: Mean field variational Bayes algorithm for Model (3.6) under product restriction (3.8). 84 3.4. RESULTS σu2 β a u y σε2 Figure 3.2: Directed acyclic graph for Model (3.6). 3.4 Results Here we present the results of MFVB estimation under Algorithm 5. Data was generated using yi = sin(2πx2i ) + εi , ind. εi ∼ N (0, 1), (3.10) where xi = (i − 1)/n and n = 1000. We limit our investigation to the median, the first and third quartiles, and the tenth and nintieth percentiles. Hence τ ∈ {0.1, 0.25, 0.5, 0.75, 0.9}. (3.11) Recall that the quantile level is given by τ , so for example, the first quartile is given by τ = 0.25. The length of u was set to K = 22. Hyperparameters were set to σβ = 108 and Aε = Bε = Au = Bu = 0.01. Iterations of Algorithm 5 were terminated when the change in lower bound (3.9) reached less than 10−10 . Monotonic lower bounds were achieved for all quantiles. Figure 3.3 illustrates iterations of lower bound (3.9) for the median, or τ = 0.5.1 Convergence of Algorithm 5 was achieved after 80 iterations. Figure 3.4 illustrates the MFVB fit and corresponding pointwise 95% credible intervals for all quantiles considered. For all values of τ , the MFVB estimates successfully capture the underlying trend of the data. We investigate this behaviour further in the following section via comparison of MFVB with MCMC inference. 3.4.1 Comparisons with Markov chain Monte Carlo MCMC samples of size 10000 were generated, with the first 5000 discarded and the remaining 5000 thinned by a factor of 5. MCMC inference took 22 minutes and 6 seconds 85 3.4. RESULTS ● ● −1480 ● ● −1500 ● ● −1520 ● ● −1540 lower bound on marginal log−likelihood −1460 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●● ● ● ● ● ● 0 20 40 60 80 iterations Figure 3.3: Successive values of lower bound (3.9) to monitor convergence of MFVB Algorithm 5. 86 -1 0 y 1 2 3 3.4. RESULTS -2 τ = 0.9 τ = 0.75 τ = 0.5 τ = 0.25 τ = 0.1 0.0 0.2 0.4 0.6 0.8 1.0 x Figure 3.4: Quantile estimates (solid) and pointwise 95% credible intervals (dotted) for MFVB fitting of (3.10) via Algorithm 5. Estimates are shown for values of τ stated in (3.11). to run. In contrast, MFVB took only 45 seconds to run. Figure 3.5 illustrates the adequacy of MFVB inference in estimating the median. Both the MFVB and MCMC fits capture the underlying trend of the data. In this instance, MFVB produces credible intervals comparable with those produced by MCMC inference. Delving more deeply into the quality of the MFVB inference, we now turn our focus to the accuracy the MFVB fit achieved by Algorithm 5 for data modelled by (3.10). Figure 3.6 quantifies the accuracy achieved for the MFVB fit for τ = 0.5 at the quartiles of the xi ’s, denoted by ŷ(Qj ) for j = 1, 2, 3. It is pleasing to see that a high accuracy of 94% was achieved for the fit at the median in a fraction of the time it took to run MCMC. Accuracy of the fit at the first quartile was 84%, and 59% accuracy was achieved for the fit at the third quartile. The accuracy measure is explained in Section 1.6. 87 3 3.4. RESULTS ● ● ● ● ● ● 2 ● ● ● ● ● ● 1 0 −1 −2 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●●● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●●●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● y ● ● ● ●● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 x Figure 3.5: Median (τ = 0.5) estimates and pointwise 95% credible intervals for MFVB (red) and MCMC (blue) fitting of (3.10). 3.4.2 Accuracy study In this section we systematically examine how well MFVB Algorithm 5 performs against its MCMC counterpart via a simulation study. We focus on the fit achieved by MFVB inference at the median of the xi ’s, denoted by ŷ(Q2 ). We consider τ ∈ {0.1, 0.25, 0.5, 0.75, 0.9}. Samples of size n = 500 were generated according to Model (3.10). Fifty simulations were carried out for each value of τ , with the accuracy summarised in Figure 3.7. It is evident that the best quality MFVB fit is achieved for the median, or τ = 0, 5, with a mean accuracy of 80.58%. The accuracy for the MFVB fit for τ = 0.5 also achieved the lowest standard deviation. The upper quartile and lower quartiles also had reasonable mean accuracy, with 72.30% and 66.84% respectively. The mean accuracy for τ = 0.1 88 6 7 3.5. DISCUSSION 94% 4 3 approx. posterior 4 3 2 accuracy 0 0 1 2 accuracy 1 approx. posterior 5 5 6 84% 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.8 0.9 yHat 1.0 1.1 1.2 1.3 yHat (b) Median 6 (a) First quartile 2 3 4 accuracy 0 1 approx. posterior 5 59% −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 yHat (c) Third quartile Figure 3.6: MFVB (blue) and MCMC (orange) approximate posterior densities for the estimated median ŷ at the quartiles of the xi ’s under Model (3.6). The accuracy figures measure the accuracy of the variational fit compared with the MCMC fit. and τ = 0.9, the tenth and ninetieth percentiles, were the lowest at 49.70% and 33.36%. Overall the accuracy of the MFVB fit is excellent considering the massive speed gains achieved. 3.5 Discussion Throughout this chapter we developed fast, deterministic inference for quantile semiparametric regression via use of the Asymmetric Laplace distribution. The work culminated in the development Algorithm 5. Monotonicity of lower bound (3.9) was achieved, and comparisons with MCMC ultimately showed excellent performance of MFVB for quantile regression given the speed gained. The most widely used quantile, the median, pleasingly achieved a high 80% accuracy for MFVB inference. The significance of this chapter lies in its ability to facilitate fast, deterministic in- 89 3.5. DISCUSSION ● ● 80 accuracy ● ● 60 ● 40 ● 20 0.1 0.25 0.5 0.75 0.9 λ Figure 3.7: Boxplots of accuracy measurements for ŷ(Q2 ) for the accuracy study described in the text. ference in Bayesian quantile regression. MFVB provides an alternative inference tool to MCMC, to be used when computing time and/or storage are lacking. 90 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) 3.A Derivation of Algorithm 4 and lower bound (3.4) 3.A.1 Full conditionals Full conditional for β log p(β|rest) = − 12 (β T Ωβ − 2β T ω) + const. where Ω = ω = τ (1 − τ ) T X M a X + Σ−1 β , σε2 τ − 12 τ (1 − τ ) T X T 1, X M ay − σε2 σε and M a = diag(a1 , . . . , an ). Derivation: p(β|rest) ∝ p(y|rest)p(β) ( n Y a g i τ exp − ∝ yi − 2σε2 i=1 β , × exp − 12 β T Σ−1 β !)2 τ − 12 σε (Xβ)i + ai gτ where gτ = τ (1 − τ ). Taking the logarithm, we have n gτ X log p(β|rest) = − 2 ai 2σε i=1 ( yi − (Xβ)i + 1 2 !)2 − τ σε ai gτ − 12 β T Σ−1 β β + const. Now, !)2 τ − 21 σε ai yi − (Xβ)i + ai gτ i=1 n n X 2 τ − 12 X 2 = ai {yi − (Xβ)i } + (Xβ)i + const. gτ i=1 i=1 1 2 τ − 2 = (y − Xβ)T M a (y − Xβ) + (Xβ)T 1 + const. gτ n X ( where M a = diag(a1 , . . . , an ). Hence gτ log p(β|rest) = − 2 2σε ( 2 τ− (y − Xβ)T M a (y − Xβ) + gτ − 21 β T Σ−1 β β + const. 1 2 ) (Xβ)T 1 91 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) τ (1 − τ ) T −1 β X M a X + Σβ β σε2 )# ( 1 τ − τ (1 − τ ) 2 XT 1 + const. X T M ay − −2β T σε2 σε − 12 = T The form of the full conditional for β follows directly. Full conditional for σε2 log p(σε2 |rest) = − Aε + + n 2 +1 log(σε2 ) 1 − 2 σε τ (1 − τ ) Bε + (y − Xβ)T M a (y − Xβ) 2 τ − 12 (y − Xβ)T 1 + const. (σε2 )1/2 Derivation: p(σε2 |rest) ∝ p(y|rest)p(σε2 ) ( n Y 1 a g i τ exp − ∝ yi − σε 2σε2 (Xβ)i + i=1 ×σε2 −(Aε +1) 1 2 τ − σε ai gτ !)2 exp −Bε /σε2 . Taking logarithms, Bε + 1 log(σε2 ) − 2 σε " #2 n τ − 21 σε gτ X − 2 ai − {yi − (Xβ)i } + const. 2σε ai gτ i=1 n τ − 12 X n Bε 2 {yi − (Xβ)i } = − Aε + + 1 log(σε ) − 2 + 2 σε σε log p(σε2 |rest) = − Aε + n 2 i=1 − gτ 2σε2 n X i=1 ai {yi − (Xβ)i }2 + const. τ − 21 n = − Aε + + 1 log(σε2 ) + 2 1/2 (y − Xβ)T 1 2 (σε ) 1 τ (1 − τ ) T − 2 Bε + (y − Xβ) M a (y − Xβ) + const. σε 2 Full conditional for ai 3 log p(ai |rest) = − log(ai ) − 2 1 2 τ (1 − τ ) 1 ai {yi − (Xβ)i }2 + 2 σε 4τ (1 − τ )ai + const. 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) 92 Derivation: p(ai |rest) ∝ p(yi |rest)p(ai ) " #2 r 1 τ − 2 σε ai gτ ai gτ − {yi − (Xβ)i } exp − 2 ∝ 2 σε 2σε ai gτ 1 −2 ×ai exp . 2ai Taking logarithms, we have τ (1 − τ ) 1 3 + const. ai {yi − (Xβ)i }2 − log p(ai |rest) = − log(ai ) − 2 2σε2 8τ (1 − τ )ai 3.A.2 Optimal q ∗ densities Expression for q ∗ (β) q ∗ (β) ∼ N (µq(β) , Σq(β) ) where −1 Σq(β) = {τ (1 − τ )µq(1/σε2 ) X T M µq(a) X + Σ−1 β } , n o µq(β) = Σq(β) τ (1 − τ )µq(1/σε2 ) X T M µq(a) y + τ − 21 µq(1/σε ) X T 1 and M µq(a) = diag(µq(a1 ) , . . . , µq(an ) ). Derivation: log q ∗ (β) = − 12 Eq (β T Ωβ − 2β T ω) + const. = − 12 {β T Eq (Ω)β − 2β T Eq (ω)} + const. Application of Result 1.21 gives log q ∗ (β) = − 12 {β − Eq (Ω)−1 Eq (ω)}T Eq (Ω){β − Eq (Ω)−1 Eq (ω)} + const. Therefore, q ∗ (β) ∼ N{Eq (Ω)−1 Eq (ω), Eq (Ω)−1 } (3.12) Now, Eq (Ω) = Eq τ (1 − τ ) T X M a X + Σ−1 β σε2 93 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) and ( Eq (ω) = Eq ) τ − 12 τ (1 − τ ) T T X 1 X M ay − σε2 σε where M a = diag(a1 , . . . , an ). It follows that Eq (Ω) = τ (1 − τ )µq(1/σε2 ) X T M µq(a) X + Σ−1 β and Eq (ω) = τ (1 − τ )µq(1/σε2 ) X T M µq(a) y − τ − 1 2 µq(1/σε ) X T 1 where M µq(a) = diag(µq(a1 ) , . . . , µq(an ) . Expressions for q ∗ (σε2 ), µq(1/σε ) and µq(1/σε2 ) − A + +1 σε2 ( ε 2 ) exp n q ∗ (σε2 ) = 2J + (2A C5 (σε2 )1/2 − C6 σε2 ε + n − 1, C5 , C6 ) , µq(1/σε ) = J + (2Aε + n + 1, C5 , C6 ) J + (2Aε + n − 1, C5 , C6 ) µq(1/σε ) = J + (2Aε + n, C5 , C6 ) , J + (2Aε + n − 1, C5 , C6 ) and σε2 > 0 where C5 = τ − 1 2 (y − Xµq(β) )T 1 and o τ (1 − τ ) n T T C6 = Bε + tr(X M µq(a) XΣq(β) ) + (y − Xµq(β) ) M µq(a) (y − Xµq(β) ) . 2 Derivation: log q ∗ (σε2 ) = Eq − Aε + n 2 +1 log(σε2 ) 1 − 2 σ # ε τ (1 − τ ) Bε + (y − Xβ)T M a (y − Xβ) 2 τ − 21 (y − Xβ)T 1 + const. (σε2 )1/2 1 τ (1 − τ ) 2 T n = − Aε + 2 + 1 log(σε ) − 2 Bε + Eq (y − Xβ) M a (y − Xβ) σε 2 τ − 12 Eq (y − Xβ)T 1 + const. + 2 1/2 (σε ) + 94 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) Now, using Result 1.23, Eq (y − Xβ)T M a (y − Xβ) = tr{Eq (M a ) Covq (y − Xβ)} + Eq (y − Xβ)T Eq (M a )Eq (y − Xβ) = tr{M µq(a) XΣq(β) X T } + (y − Xµq(β) )T M µq(a) (y − Xµq(β) ) = tr{X T M µq(a) XΣq(β) } + (y − Xµq(β) )T M µq(a) (y − Xµq(β) ). Therefore, log q ∗ τ − 12 + 2 1/2 (y − Xµq(β) )T 1 = − Aε + + 1 (σε ) 1 τ (1 − τ ) n − 2 Bε + tr(M µq(a) XΣq(β) X T ) σε 2 oi +(y − Xµq(β) )T M µq(a) (y − Xµq(β) ) + const. (σε2 ) n 2 log(σε2 ) and hence q ∗ (σε2 ) ∝ n −(Aε + 2 +1) σε2 exp C6 C5 − 2 1/2 2 σε (σε ) where C5 = τ − 1 2 (y − Xµq(β) )T 1 and C6 = Bε + o τ (1 − τ ) n tr(X T M µq(a) XΣq(β) ) + (y − Xµq(β) )T M µq(a) (y − Xµq(β) ) . 2 Since q ∗ (σε2 ) is a density, it must integrate to 1. Therefore q ∗ (σε2 ) = R ∞ 0 − A + n +1 σε2 ( ε 2 ) exp C5 (σε2 )1/2 C6 σε2 σε2 −(Aε + 2 +1) exp − C5 (σε2 )1/2 − C6 σε2 n dσε2 . We can simplify the integral on the denominator by making the substitution x = σε = 1 x ⇒ σε2 = 1 x2 ⇒ dσε2 = −2x−3 dx. This transforms the integral into Z 0 ∞ −(Aε + 2 +1) σε2 exp n Z 0 = =2 C5 C6 − 2 1/2 2 σ (σε ) ε dσε2 x2Aε +n+2 exp C5 x − C6 x2 (−2x−3 )dx −∞ Z ∞ 0 + x2Aε +n−1 exp C5 x − C6 x2 dx = 2J (2Aε + n − 1, C5 , C6 ). 1 σε ⇒ 95 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) Therefore q ∗ (σε2 ) = − A + n +1 σε2 ( ε 2 ) exp C5 (σε2 )1/2 − C6 σε2 2J + (2Aε + n − 1, C5 , C6 ) . By making the same substitution as above, we find that µq(1/σε2 ) = = = = 1 + 2J (2Aε + n − 1, C5 , C6 ) ∞ Z 0 C6 C5 − dσε2 (σε2 )1/2 σε2 x2Aε +n+4 exp C5 x − C6 x2 (−2x−3 )dx 1 2 −(Aε + n2 +1) σ exp σε2 ε 0 1 2J + (2Aε + n − 1, C5 , C6 ) −∞ Z ∞ 1 x2Aε +n+1 exp C5 x − C6 x2 dx + J (2Aε + n − 1, C5 , C6 ) 0 J + (2Aε + n + 1, C5 , C6 ) . J + (2Aε + n − 1, C5 , C6 ) Similarly, µq(1/σε ) = Z J + (2Aε + n, C5 , C6 ) . J + (2Aε + n − 1, C5 , C6 ) Expressions for q ∗ (ai ) and µq(ai ) q ∗ (ai ) ∼ Inverse-Gaussian(µq(ai ) , λq(a) ) where ( µq(ai ) = r 2τ (1 − τ ) h T µq(1/σε2 ) (XΣq(β) X )ii + {yi − (Xµq(β) )i }2 i )−1 and λq(a) = 1 . 4τ (1 − τ ) Derivation: log q ∗ (ai ) 3 τ (1 − τ ) 1 2 ai {yi − (Xβ)i } − + const. = Eq − log(ai ) − 2 2σε2 8τ (1 − τ )ai 3 1 = − log(ai ) − 2 8τ (1 − τ )ai τ (1 − τ ) ai µq(1/σε2 ) Eq [{yi − (Xβ)i }2 ] + const. − 2 3 1 = − log(ai ) − 2 8τ (1 − τ )ai τ (1 − τ ) − ai µq(1/σε2 ) [(XΣq(β) X T )ii + {yi − (Xµq(β) )i }2 ] + const. 2 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) 96 Therefore, ∗ q (ai ) ∝ −3/2 ai exp 1 −2 1 4τ (1 − τ )ai h io +τ (1 − τ )ai µq(1/σε2 ) (XΣq(β) X T )ii + {yi − (Xµq(β) )i }2 which, by Result 1.7, is in the form of an Inverse-Gaussian distribution with parameters given by µq(ai ) = −1 q T 2 2τ (1 − τ ) µq(1/σε2 ) [(XΣq(β) X )ii + {yi − (Xµq(β) )i } ] and λq(a) = 1 . 4τ (1 − τ ) The expression for q ∗ (ai ) follows directly. 3.A.3 Derivation of lower bound (3.4) log p(y; q) = Aε log(Bε ) − log Γ(Aε ) + log(2) + 1 + n log{τ (1 − τ )} |Σq(β) | 1 + log{J + (2Aε + n − 1, C5 , C6 )} + 2 log |Σβ | −1 T − 12 {tr(Σ−1 β Σq(β) ) + µq(β) Σβ µq(β) } n X 1 1 − . 8τ (1 − τ ) µq(ai ) i=1 Derivation: log p(y; q) = Eq {log p(y|β, σε2 , a)} + Eq {log p(β) − log q ∗ (β)} +Eq {log p(σε2 ) − log q ∗ (σε2 )} + Eq {log p(a) − log q ∗ (a)}. Firstly, log p(y|β, σε2 , a) " ( )#2 n 1 2 X (τ − )σ 2 ε − 1 log 2πσε − ai gτ yi − (Xβ)i + = 2 ai gτ 2σε2 ai gτ i=1 97 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) n X n n n = − log(2π) − log(σε2 ) + log gτ + 12 log(ai ) 2 2 2 i=1 ( )2 n 1 1 X (τ − (τ − )σ )σ gτ 2 ε 2 ε ai −2 − 2 {yi − (Xβ)i } + {yi − (Xβ)i }2 2σε ai gτ ai gτ i=1 n X n n n = − log(2π) − log(σε2 ) + log gτ + 12 log(ai ) 2 2 2 i=1 − 1 2 2) (τ − 2gτ n X i=1 1 + ai n (τ − 12 ) X {yi (σε2 )1/2 i=1 n n n = − log(2π) − log(σε2 ) + log gτ + 2 2 2 − (Xβ)i } − 1 2 n X i=1 n gτ X ai {yi − (Xβ)i }2 2σε2 i=1 n (τ − 12 )2 X 1 log(ai ) − 2gτ ai i=1 1 gτ + 2 1/2 (τ − 21 )1T (y − Xβ) − 2 (y − Xβ)T M a (y − Xβ). 2σε (σε ) Taking expectations: Eq {log p(y|β, σε2 , a)} ( n n X (τ − 12 )2 X 1 n n n 2 1 log(ai ) − = Eq − log(2π) − log(σε ) + log gτ + 2 2 2 2 2gτ a i=1 i=1 i gτ 1 + 2 1/2 (τ − 12 )1T (y − Xβ) − 2 (y − Xβ)T M a (y − Xβ) 2σε (σε ) n X n n n n Eq {log(ai )} = − log(2π) − log(σε2 ) + log gτ + log gτ + 12 2 2 2 2 i=1 n (τ − 21 )2 X µq(1/ai ) + µq(1/σε ) (τ − 12 )1T (y − Xµq(β) ) − 2gτ i=1 gτ − µq(1/σε2 ) Eq {(y − Xβ)T M a (y − Xβ)}. 2 From previous work we know that Eq {(y − Xβ)T M a (y − Xβ)} = tr(M µq(a) XΣq(β) X T ) + (y − Xµq(β) )T M µq(a) (y − Xµq(β) ), so Eq {log p(y|β, σε2 , a)} n X n n n n = − log(2π) − Eq {log(σε2 )} + log gτ + log gτ + 12 Eq {log(ai )} 2 2 2 2 i=1 1 2 2) n X (τ − µq(1/ai ) + µq(1/σε ) (τ − 12 )1T (y − Xµq(β) ) 2gτ i=1 h gτ − µq(1/σε2 ) tr(M µq(a) XΣq(β) X T ) 2 i +(y − Xµq(β) )T M µq(a) (y − Xµq(β) ) − 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) n X n n n n = − log(2π) − log(σε2 ) + log gτ + log gτ + 12 Eq {log(ai )} 2 2 2 2 i=1 n (τ − 12 )2 X − µq(1/ai ) + µq(1/σε ) C5 − µq(1/σε2 ) (C6 − Bε ). 2gτ i=1 Secondly, log p(β) − log q ∗ (β) d = − log(2π) − 21 log |Σβ | − 12 β T Σ−1 β β 2 d T −1 1 1 − − log(2π) − 2 log |Σq(β) | − 2 (β − µq(β) ) Σq(β) (β − µq(β) ) 2 |Σq(β) | T −1 1 1 − 12 β T Σ−1 = 2 log β β + 2 (β − µq(β) ) Σq(β) (β − µq(β) ). |Σβ | Taking expectations, ∗ Eq {log p(β) − log q (β)} = 1 2 log |Σq(β) | |Σβ | − 12 Eq (β T Σ−1 β β) + 12 Eq {(β − µq(β) )T Σ−1 q(β) (β − µq(β) )}. Using Result 1.23, −1 T −1 Eq (β T Σ−1 β β) = tr{Σβ Covq (β)} + Eq (β) Σβ Eq (β) −1 T = tr(Σ−1 β Σq(β) ) + µq(β) Σβ µq(β) . Similarly, −1 Eq {(β − µq(β) )T Σ−1 q(β) (β − µq(β) )} = tr(Σq(β) Σq(β) ) = tr{I2 } = 2. Therefore, ∗ Eq {log p(β) − log q (β)} = 1 2 |Σq(β) | |Σβ | log +1 n o −1 T − 21 tr(Σ−1 Σ ) + µ Σ µ q(β) q(β) . q(β) β β Thirdly, log p(a) = n X i=1 log p(ai ) 98 99 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) n X = log 1 −2 −1/2ai 2 ai e i=1 = −n log(2) − 2 n X i=1 log(ai ) − 1 2 n X 1 , ai i=1 and log q ∗ (a) = = = n X i=1 n X log q ∗ (ai ) "s log i=1 " n X 1 2 i=1 = )# ( λq(a) λq(a) (ai − µq(ai ) )2 exp − 2πa3i 2µ2q(ai ) ai λq(a) 3 log(λq(a) ) − log(2π) − log(ai ) − 2 2 ( 1 2 a2i − 2ai µq(ai ) + µ2q(ai ) µ2q(ai ) ai n n λq(a) X n n 3X log(λq(a) ) − log(2π) − log(ai ) − 2 2 2 2 i=1 i=1 ai µ2q(ai ) + 2 µ2q(ai ) Therefore log p(a) − log q ∗ (a) = −n log(2) − 2 − 3 2 n X i=1 n X i=1 log(ai ) − = −n log(2) + n nn X n 1 − log(λq(a) ) − log(2π) a 2 2 i=1 i !) n X ai 2 1 + 2 + 2 µq(ai ) µq(ai ) ai log(ai ) − λq(a) 2 1 2 i=1 n X n n log(2π) − log(λq(a) ) − 12 log(ai ) 2 2 i=1 n n n X (1 − λq(a) ) X λq(a) X 1 ai 1 − − λ + . q(a) 2 2 ai 2 µ µ q(a ) i q(a ) i i=1 i=1 i=1 Taking expectations Eq {log p(a) − log q ∗ (a)} = −n log(2) + − 2 = −n log(2) + − n X n n log(2π) − log(λq(a) ) − 12 Eq {log(ai )} 2 2 i=1 n (1 − λq(a) ) X µq(1/ai ) + i=1 n n log(2π) − 2 2 n X µq(ai ) − λq(a) n X 1 µ µ2 i=1 q(ai ) i=1 q(ai ) n X Eq {log(ai )} log(λq(a) ) − 12 i=1 n λq(a) X 1 n (1 − λq(a) ) X µq(1/ai ) − 2 i=1 λq(a) 2 2 i=1 )# µq(ai ) . 1 + ai ! . 100 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) We know that µq(1/ai ) = 1 1 + , µq(ai ) λq(a) so n n (1 − λq(a) ) X (1 − λq(a) ) X 1 1 µq(1/ai ) = − + − 2 2 µq(ai ) λq(a) i=1 i=1 ( n ! ) X 1 (1 − λq(a) ) n = − + 2 µ λq(a) i=1 q(ai ) ! n n(1 − λq(a) ) (1 − λq(a) ) X 1 − = − 2 µ 2λq(a) i=1 q(ai ) ! n (1 − λq(a) ) X 1 n = − + − 2ngτ . 2 µq(ai ) 2 i=1 Hence n n log(2π) − log(λq(a) ) 2 2 n n X X 1 n 1 1 −2 Eq {log(ai )} − 2 + − 2ngτ . µq(ai ) 2 Eq {log p(a) − log q ∗ (a)} = −n log(2) + i=1 i=1 Finally, log p(σε2 ) − log q ∗ (σε2 ) = Aε log(Bε ) − log Γ(Aε ) − (Aε + 1) log(σε2 ) − Bε σε2 − [− log{2J (2Aε + n − 1, C5 , C6 )} n C5 C6 − Aε + + 1 log(σε2 ) + 2 1/2 − 2 2 σε (σε ) n = Aε log(Bε ) − log Γ(Aε ) + log(σε2 ) 2 C5 (C6 − Bε ) + log{2J (2Aε + n − 1, C5 , C6 )} − 2 1/2 + . σε2 (σε ) Taking expectations n Eq {log(σε2 )} 2 + log{2J (2Aε + n − 1, C5 , C6 )} Eq {log p(σε2 ) − log q ∗ (σε2 )} = Aε log(Bε ) − log Γ(Aε ) + −µq(1/σε ) C5 + µq(1/σε2 ) (C6 − Bε ). Combining, we get n X n n n 2 1 log p(y; q) = − log(2π) − Eq {log(σε )} + log gτ + 2 Eq {log(ai )} 2 2 2 i=1 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) n (τ − 12 )2 X µq(1/ai ) + µq(1/σε ) C5 − µq(1/σε2 ) (C6 − Bε ) 2gτ i=1 |Σ q(β) | 1 + 2 log +1 |Σβ | − −1 T − 12 {tr(Σ−1 β Σq(β) ) + µq(β) Σβ µq(β) } −n log(2) + − 12 n X n n log(2π) − log(λq(a) ) − 12 Eq {log(ai )} 2 2 i=1 n X 1 i=1 µq(ai ) + n − 2ngτ 2 n Eq {log(σε2 )} 2 + log{2J + (2Aε + n − 1, C5 , C6 )} +Aε log(Bε ) − log Γ(Aε ) + −µq(1/σε ) C5 + µq(1/σε2 ) (C6 − Bε ). Simplifying and using µq(1/ai ) = 1 µq(ai ) + 1 λq(a) = 1 µq(ai ) + 4gτ , log p(y; q) |Σq(β) | −1 T + 1 − 12 {tr(Σ−1 = log β Σq(β) ) + µq(β) Σβ µq(β) } |Σβ | +Aε log(Bε ) − log Γ(Aε ) + log(2) + log{J + (2Aε + n − 1, C5 , C6 )} n n (1/2 − τ )2 X 1 1 n + log gτ − + 4gτ − n log(2) − log 2 2gτ µq(ai ) 2 4gτ 1 2 i=1 − 21 n X 1 i=1 µq(ai ) + n − 2ngτ . 2 Focusing on the final two lines of the expression immediately above, n (τ − 12 )2 X n 1 log gτ − + 4gτ − n log(2) 2 2gτ µq(ai ) i=1 n X n 1 1 n 1 − log −2 + − 2ngτ 2 4gτ µ 2 i=1 q(ai ) ( ) n 1 2 X 1 (τ − ) n 2 = log gτ − 12 +1 − 2n(τ − 21 )2 − n log(2) 2 gτ µq(ai ) i=1 n n + log(4gτ ) + − 2ngτ 2 2 n X 1 1 n =− + log gτ − 2n(τ − 12 )2 − n log(2) 8gτ µ 2 i=1 q(ai ) n n + − 2ngτ + (log 4 + log gτ ) 2 2 101 3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4) n n 1 X 1 + log gτ − 2n(τ − 12 )2 − n log(2) 8gτ µ 2 i=1 q(ai ) n n + − 2ngτ + n log(2) + log gτ 2 2 n 1 X 1 n = − + n log gτ − 2n( 14 − τ + τ 2 ) + − 2ngτ 8gτ µq(ai ) 2 = − i=1 n = − X 1 1 + n log{τ (1 − τ )}. 8τ (1 − τ ) µq(ai ) i=1 Combining, we get: log p(y; q) = Aε log(Bε ) − log Γ(Aε ) + log(2) + 1 + n log{τ (1 − τ )} |Σq(β) | 1 + 2 log + log{J + (2Aε + n − 1, C5 , C6 )} |Σβ | −1 T − 12 {tr(Σ−1 β Σq(β) ) + µq(β) Σβ µq(β) } n X 1 1 − . 8τ (1 − τ ) µq(ai ) i=1 102 103 3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9) 3.B Derivation of Algorithm 5 and lower bound (3.9) 3.B.1 Full conditionals Full conditional for (β, u) " log p(β, u|rest) = − 12 #T β u " Ω β u # " −2 β u #T ω + const. where Ω = ω = τ (1 − τ ) T C M a C + Σ−1 (β,u) , σε2 τ − 12 τ (1 − τ ) T C T 1, C M ay − σε2 σε M a = diag(a1 , . . . , an ) and C = [X Z]. Derivation: p(β, u|rest) ∝ p(y|rest)p(β, u) #! " ( " )#2 n 1 Y τ − 2 σε ai gτ ∝ exp − 2 yi − + C β 2σε ai gτ u i=1 i # #T " " β , × exp − 21 β Σ−1 (β,u) u u where gτ = τ (1 − τ ) and C = [X Z]. This is in the exact same form as the full conditional for β under Model (3.2). The stated result follows immediately. Full conditional for σε2 log p(σε2 |rest) = − Aε + n2 + 1 log(σε2 ) τ (1 − τ ) 1 − 2 Bε + σε 2 τ − 21 + 2 1/2 (σε ) " y−C β u " y−C β u #!T #!T 1 + const. Derivation: The derivation is identical to that for σ 2 in Model (3.2). " Ma y−C β u #! 104 3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9) Full conditional for σu2 K 1 = − Au + + 1 log(σu2 ) − 2 Bu + 12 kuk2 + const. 2 σu log p(σu2 |rest) Derivation: p(σu2 |rest) ∝ p(u|σu2 )p(σu2 ) ∝ (σu2 )−K/2 exp 1 Bu 2 −Au −1 2 − 2 kuk × (σu ) exp − 2 . 2σu σu Taking logarithms gives the stated result. Full conditional for ai 3 log p(ai |rest) = − log(ai ) + const. 2 ( τ (1 − τ ) ai yi − − 12 σ2 " C β u #! )2 + i 1 . 4τ (1 − τ )ai Derivation: The derivation is identical to that for ai in Model (3.2). 3.B.2 Optimal q ∗ densities Expression for q ∗ (β, u) q ∗ (β, u) ∼ N (µq(β,u) , Σq(β,u) ) where Σq(β,u) −1 −1 Σ 0 = τ (1 − τ )µq(1/σε2 ) C T M µq(a) C + β , 0 µq(1/σu2 ) I K n o µq(β,u) = Σq(β,u) τ (1 − τ )µq(1/σε2 ) C T M µq(a) y + τ − 12 µq(1/σε ) C T 1 and M µq(a) = diag(µq(a1 ) , . . . , µq(an ) ). Derivation: The derivation is identical to that for β in Model (3.2), replacing Xβ with C " β u # . 3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9) Expressions for q ∗ (σε2 ), µq(1/σε ) and µq(1/σε2 ) q ∗ (σε2 ) = − A + n +1 σε2 ( ε 2 ) exp C7 (σε2 )1/2 − C8 σε2 2J + (2Aε + n − 1, C7 , C8 ) , µq(1/σε ) = J + (2Aε + n + 1, C7 , C8 ) J + (2Aε + n − 1, C7 , C8 ) µq(1/σε ) = J + (2Aε + n, C7 , C8 ) , J + (2Aε + n − 1, C7 , C8 ) and σε2 > 0 where C7 = τ − 1 2 (y − Cµq(β,u) )T 1 and C8 = Bε + τ (1 − τ ) n tr(M µq(a) CΣq(β,u) C T ) 2 o +(y − Cµq(β,u) )T M µq(a) (y − Cµq(β,u) ) . Derivation: The derivation is identical to that for σ 2 in Model (3.2). Expressions for q ∗ (σu2 ) and µq(1/σu2 ) q ∗ (σu2 ) ∼ Inverse-Gamma(Au + K 2 )) 2 , Bq(σu where Bq(σu2 ) = Bu + 12 {kµq(u) k2 + tr(Σq(u) )} Derivation: log q(σu2 ) K 1 2 2 1 + 1 log(σu ) − 2 Bu + 2 kuk = Eq − Au + + const. 2 σu K 1 = − Au + + 1 log(σu2 ) − 2 Bu + 12 Eq kuk2 + const. 2 σu Now, observing that Eq kuk2 = kµq(u) k2 + tr(Σq(u) ), we have log q(σu2 ) K + 1 log(σu2 ) = − Au + 2 i 1 h − 2 Bu + 12 {kµq(u) k2 + tr(Σq(u) )} + const. σu 105 106 3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9) Hence q(σu2 ) ∝ K (σu2 )−(Au + 2 )−1 exp i 1 h 2 1 − 2 Bu + 2 {kµq(u) k + tr(Σq(u) )} σu which is in the form of an Inverse-Gamma distribution (see Definition 1.5.16) with parameters stated in the result. Expressions for q ∗ (ai ) and µq(ai ) q ∗ (ai ) ∼ Inverse-Gaussian(µq(ai ) , λq(a) ) where ( µq(ai ) = and r 2τ (1 − τ ) h T µq(1/σ2 ) (CΣq(β,u) C )ii + {yi − (Cµq(β,u) )i }2 λq(a) = i )−1 1 . 4τ (1 − τ ) Derivation: The derivation is identical to that for ai in Model (3.2). 3.B.3 Derivation of lower bound (3.9) (K + 2) + n log{τ (1 − τ )} log p(y; q) = Aε log Bε − log Γ(Aε ) + log(2) + 2 |Σq(β,u) | + 12 log + log{J + (2Aε + n − 1, C7 , C8 )} |Σβ | n − 21 tr(Σ−1 (β,u) Σq(β,u) ) o +(µq(β,u) − µ(β,u) )T Σ−1 (µ − µ ) q(β,u) (β,u) (β,u) n X 1 1 − 8τ (1 − τ ) µq(ai ) i=1 +Au log(Bu ) − log{Γ(Au )} − (Au + + log{Γ(Au + K 2 )} K 2 )) 2 ) log(Bq(σu − µq(1/σu2 ) (Bq(σu2 ) − Bu ). Derivation: log p(y; q) = Eq {log p(y|β, u, σε2 , a)} + Eq {log p(β, u) − log q ∗ (β, u)} +Eq {log p(σε2 ) − log q ∗ (σε2 )} + Eq {log p(σu2 ) − log q ∗ (σu2 )} +Eq {log p(a) − log q ∗ (a)}. 3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9) 107 The forms of the contributions from p(y|β, u, σε2 , a) and the parameters (β, u), σε2 and a are identical to that for the corresponding quantities in Model (3.2). The only small changes are that • β is replaced by (β, u), • X is replaced by C. From similar working for regression case in Appendix 3.A we have: log p(β, u|σu2 ) − log q ∗ (β, u) " #T " # |Σ | q(β,u) β = 12 log − 21 β Σ−1 (β,u) |Σ(β,u) | u u " # !T " # ! β β + 21 − µq(β,u) Σ−1 − µq(β) q(β,u) u u |Σq(β,u) | 1 2 T −1 2 1 K 1 = 2 log − 2 log(σu ) − 2 β Σβ β + 2 kuk |Σβ | σu # !T # ! " " β β −1 1 − µq(β,u) − µq(β) Σq(β,u) +2 u u using the fact that |Σ(β,u) | = (σu2 )K |Σβ |. Taking expectations, Eq {log p(β) − log q ∗ (β)} |Σq(β) | 2 2 1 1 = 2 log − K2 Eq {log(σu2 )} − 21 Eq (β T Σ−1 β β) − 2 Eq (1/σu )Eq kuk |Σβ | " # !T " # ! β −µ β −µ + 21 Eq . Σ−1 q(β,u) q(β) q(β,u) u u Now, Eq " β u !T # − µq(β,u) " Σ−1 q(β,u) β u # − µq(β) ! = tr(Σ−1 q(β,u) Σq(β,u) ) = tr{IK+2 } = K + 2. Also, using Result 1.22, Eq kuk2 = tr{Covq (u)} + kEq (u)k2 = tr(Σq(u) ) + kµq(u) k2 . 108 3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9) Therefore, ∗ 1 2 Eq {log p(β) − log q (β)} = |Σq(β) | (K + 2) K + − 2 Eq {log(σu2 )} log |Σβ | 2 o n −1 −1 T 1 − 2 tr(Σβ Σq(β) ) + µq(β) Σβ µq(β) − 12 µq(1/σu2 ) {tr(Σq(u) ) + kµq(u) k2 }. We derive the contribution from σu2 afresh: Bu log p(σu2 ) − log q ∗ (σu2 ) = Au log(Bu ) − log{Γ(Au )} − (Au + 1) log(σu2 ) − 2 σu h K K − (Au + 2 ) log(Bq(σu2 ) ) − log{Γ(Au + 2 )} Bq(σu2 ) 2 K −(Au + 2 + 1) log(σu ) − σu2 = Au log(Bu ) − log{Γ(Au )} − (Au + K2 ) log(Bq(σu2 ) ) 1 + log{Γ(Au + K2 )} + K2 log(σu2 ) − 2 (Bq(σu2 ) − Bu ). σu Taking expectations Eq {log p(σu2 ) − log q ∗ (σu2 )} = Au log(Bu ) − log{Γ(Au )} − (Au + + log{Γ(Au + K 2 )} + K 2 )) 2 ) log(Bq(σu 2 K 2 Eq {log(σu )} −µq(1/σu2 ) (Bq(σu2 ) − Bu ). Combining the contributions from all nodes, we get n X n n n 2 1 log p(y; q) = − log(2π) − Eq {log(σε )} + log gτ + 2 Eq {log(ai )} 2 2 2 i=1 ( 21 τ )2 n X − µq(1/ai ) + µq(1/σε ) C7 − µq(1/σε2 ) (C8 − Bε ) 2gτ i=1 |Σq(β) | (K + 2) K 1 + 2 log + − 2 Eq {log(σu2 )} |Σβ | 2 n o −1 T − 21 tr(Σ−1 Σ ) + µ Σ µ q(β) q(β) q(β) β β − − 21 µq(1/σu2 ) {tr(Σq(u) ) + kµq(u) k2 } −n log(2) + − 12 n X n n Eq {log(ai )} log(2π) − log(λq(a) ) − 12 2 2 n X 1 i=1 µq(ai ) i=1 + n − 2ngτ 2 +Aε log Bε − log Γ(Aε ) + n Eq {log(σε2 )} 2 3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9) + log{2J + (2Aε + n − 1, C7 , C8 )} −µq(1/σε ) C7 + µq(1/σε2 ) (C8 − Bε ) +Au log(Bu ) − log{Γ(Au )} − (Au + + log{Γ(Au + K 2 )} + 2 K 2 Eq {log(σu )} K 2 )) 2 ) log(Bq(σu − µq(1/σu2 ) (Bq(σu2 ) − Bu ). Simplifying gives lower bound (3.9): (K + 2) log p(y; q) = Aε log Bε − log Γ(Aε ) + log(2) + + n log{τ (1 − τ )} 2 |Σq(β,u) | + log{J + (2Aε + n − 1, C7 , C8 )} + 12 log |Σβ | o n −1 T Σ µ − 21 tr(Σ−1 Σ ) + µ q(β) q(β) q(β) β β − 12 µq(1/σu2 ) {tr(Σq(u) ) + kµq(u) k2 } n X 1 1 − 8τ (1 − τ ) µq(ai ) i=1 +Au log(Bu ) − log{Γ(Au )} − (Au + + log{Γ(Au + K 2 )} K 2 )) 2 ) log(Bq(σu − µq(1/σu2 ) (Bq(σu2 ) − Bu ). 109 Chapter 4 Mean field variational Bayes for continuous sparse signal shrinkage 4.1 Introduction There are many areas of statistics that benefit from the imposition of a sparse distribution on a set of parameters. In many applications, the aim is to choose which of potentially thousands or millions of covariates have the greatest effect on the response. These applications are widely referred to as “wide data” or “n p” problems, defined by the number of variables (p) being considerably larger than the number of observations (n). In these cases, imposing a sparse prior on the parameters of interest allows us to perform model selection and inference simultaneously (Tibshirani, 1996). One application where sparse estimators are in huge demand is genome wide association studies (GWAS). GWAS can essentially be considered as a penalized regression problem: yi = β0 + β1 xi1 + β2 xi2 + . . . + βp xip + εi (4.1) where yi is the phenotype of the ith individual (may be discrete or continuous), xij is genotype of the jth marker of the ith individual, and βj is the effect of the jth marker. The aim is to estimate the genetic effect associated with each marker (xij ), and hence identify which are significant markers for the phenotype (yi ). With so many parameters in the model (as p may be in the order of 100,000’s or millions), the key problems here are: overfitting and computational constraints. The former issue can be addressed by employing penalised regression. We address the latter issue by using mean field variational Bayes (MFVB). 110 111 4.1. INTRODUCTION In a Bayesian framework, two major choices are evident in approaching n p prob- lems. These are (1) method of inference, and (2) type of prior to induce sparseness. In keeping with the theme of this thesis, we choose to overcome computational time constraints by utilising MFVB as the inference tool. In this chapter we explore the efficacy of continuous priors in inducing sparsity, under the framework of MFVB. These continuous priors correspond to non-convex penalization, which has been suggested for wide data applications (Griffin and Brown, 2011). These continuous priors are in contrast to the existing literature which has explored “slab and spike” distributions such as LaplaceZero mixtures (for example, Johnstone and Silverman, 2004). In particular, we focus on the Horseshoe (Carvalho et al., 2010), Normal-Exponential-Gamma (NEG) (Griffin and Brown, 2011) and Generalized-Double-Pareto (GDP) (Armagan et al., 2012) distributions. These three distributions are defined in Chapter 1 by Definitions 1.5.22, 1.5.23 and 1.5.24 respectively. Standard densities (location parameter µ = 0 and scale parameter σ = 1) of the Horseshoe, NEG and GDP distributions are shown in Figure 4.1. Both the NEG and GDP distributions have a third shape parameter λ. Hence Figure 4.1 shows several versions of the “standard” NEG and GDP densities, with the kurtosis increasing as the shape parameter decreases. 0 −2 0 2 0 2 4 0.4 0.2 0.3 0.4 0.3 0.2 0.3 0.2 −104 −5 0.1 0.1 0 −5 0 5 0 5 10 5 10 −6 0.0 0.0 −10−5 0.0 0.05 0.00 0.05 2−10 4 0.5 0.5 0.5 0.4 0.30 0.25 0.20 0.15 0.1 −4 −2 0.00 0.05 0.00 0.0 0.0 −4 −2 !=0.1 !=0.1 !=0.1 !=0.5 !=0.5 !=0.5 !=10 !=10 !=10 0.10 0.10 0.10 0.15 0.20 0.25 !=0.1 !=0.1 !=0.1 !=0.2 !=0.2 !=0.2 !=0.4 !=0.4 !=0.4 0.1 0.1 0.1 0.0 −4 0.35 0.35 0.30 0.35 0.15 0.20 0.3 0.2 0.3 0.2 0.2 0.3 0.25 0.4 0.4 0.4 0.30 0.5 0.5 0.5 Horseshoe Horseshoe Horseshoe Normal−Exponential−Gamma Normal−Exponential−Gamma Normal−Exponential−Gamma Generalized Generalized Generalized Double Double Pareto Double ParetoPareto −4 10 −6 −2 −4 −60 −2 −42 0 −24 2 06 4 2 6 4 6 (a) Horseshoe (b) Normal-Exponential(c) Generalized Double Pareto Figure Figure 1: Figure Left 1: panel: Left1: panel: Left the panel: standard the standard theHorseshoe standard Horseshoe density Horseshoe density function. density function. Middle function. Middle panel: Middle panel: threepanel: three standard three standard standard Gamma Normal-Exponential-Gamma Normal-Exponential-Gamma Normal-Exponential-Gamma density density functions density functions with functions with varying varying with shape varying shape parameter shape parameter λ. parameter Right λ. Right panel: λ. Right panel:panel: three three standard three Double standard Double Generalized Double Generalized Pareto density functions density functions with functions varying with varying with shape varying shape parameter shape parameter λ. parameter λ. λ. Figure 4.1:standard Standard (µ = 0,Generalized σPareto = Pareto 1)density continuous sparseness inducing density functions. 2.3.2 2.3.2 Related 2.3.2 Related distributional Related distributional distributional results results results The locality property of MFVB allows us to concentrate on univariate scale models of The notation The notation Thevnotation ∼ vGamma(A, ∼ vGamma(A, ∼ Gamma(A, B) means B) means B) that means vthat hasvthat ahas Gamma vahas Gamma adistribution Gamma distribution distribution with with shape with shapeshape ourparameter sparsity distributions. allows us to avoid unnecesparameter parameter Ainducing > 0Aand > 0Arate and > 0parameter rate andparameter rate parameter B >Adopting 0. B The > 0. Bcorresponding The > 0.this corresponding Thestrategy corresponding density density function density function isfunction is is A focus −1 A−1 −1 the A−1 −1core sarily complex calculations, and of vthe problem. p(v) = p(v) B = p(v) Γ(A) B A= Γ(A) B v Aon Γ(A) vexp(−Bv), exp(−Bv), v A−1 exp(−Bv), > 0. v > 0.v > 0.Our findings can then be The incorporated complex regression models, using thedistribution conditional indepennotation The notation Thevnotation ∼into IG(A, v ∼ more IG(A, vB)∼means IG(A, B) means B) thatmeans vthat hasvthat an has Inverse-Gamma van has Inverse-Gamma an Inverse-Gamma distribution distribution with with shapeshape with shape parameter parameter parameter A > 0Aand > 0Arate and > 0parameter rate andparameter rate parameter B > 0. B The > 0. Bcorresponding The > 0.corresponding The corresponding density density function density function isfunction is is dence structure of the model, most clearly evident in the directed acyclic graph (DAG). −1 −A−1 −1 −A−1 p(v) = p(v) B A= p(v) Γ(A) B A= Γ(A) B v A−1 Γ(A) v −A−1 exp(−B/v), v exp(−B/v), exp(−B/v), v > 0.v > 0.v > 0. Our most significant finding, which reveals itself throughout this chapter, is that apNoteNote that vNote that ∼ IG(A, vthat ∼ IG(A, vB)∼ifIG(A, B) andif only and B) ifonly ifand 1/vif only ∼1/v Gamma(A, if∼1/v Gamma(A, ∼ Gamma(A, B). B). B). plying MFVB using the most natural auxiliary variable representations of the Horseshoe, Result Result 1a. Result Let 1a.x,Let b1a. and x,Let bcand be x,random bc and be random c variables be random variables such variables that such that such that NEG and GDP models leads to poor inference. We remedy this through the incorporation x| b ∼x|Nb (µ, ∼x|N σb2(µ, /b), ∼N σ 2(µ, /b), b |σc2 /b), ∼b |Gamma( c ∼b Gamma( | c ∼21 Gamma( , c) 21 ,and c) 21 and ,cc)∼ Gamma( and c ∼ Gamma( c ∼12 Gamma( , 1). 12 , 1). 12 , 1). of special functions into our MFVB algorithms. Continued fraction approximations are Then Then x ∼ Horseshoe(µ, x Then ∼ Horseshoe(µ, x ∼ Horseshoe(µ, σ). σ). σ). Result Result 1b.Result Let 1b.x Let and 1b.xbLet and be random xb and be random b variables be random variables such variables that such that such that −1 −1/2 −1 x| b ∼x|Nb (µ, ∼x|N σb2(µ, /b) ∼N σ 2(µ, /b) andσ 2and /b) p(b) = and p(b) π −1=p(b) b−1/2 π −1=b(b−1/2 π+ 1) (b b −1 +,1) (b−1 b+> , 1)0. b> , 0.b > 0. 112 4.2. HORSESHOE DISTRIBUTION used to facilitate stable computation of these special functions, made practical through the use of Lentz’s Algorithm. This adds yet another tool to the rapidly growing armoury of MFVB. The chapter is set out as follows: we consider the Horseshoe, NEG and GDP priors sequentially. For each prior, we explore the impact of varying auxiliary variable representations on the quality of MFVB inference. This involves derivation and presentation of MFVB algorithms and their corresponding lower bounds for each prior. The merits of each representation is explored in terms of the simplicity of the resulting MFVB algorithm and the accuracy of MFVB inference. We give the most detail in the Horseshoe case, and present necessary information in the NEG and GDP cases. Derivations of MFVB algorithms and lower bounds are deferred to Appendices 4.A, 4.C and 4.E. The work in this chapter is presented in the manuscript Neville, Ormerod and Wand (2012) and is in the process of being peer reviewed. The novel results in this chapter culminated in an academic visit to the University of Oxford in late 2012. Key findings were presented at the Wellcome Trust Centre for Human Genetics seminar series. 4.2 Horseshoe distribution The first case we consider is a univariate random sample drawn from the Horseshoe distribution as follows: ind. xi |σ ∼ Horseshoe(0, σ), σ ∼ Half-Cauchy(A). (4.2) Introduction of the auxiliary variables a, b = (b1 , . . . , bn ) and c = (c1 , . . . , cn ), and application of Results 1.6, 1.9 and 1.10 respectively, we can represent (4.2) as the following three heirarchical models: Table 4.1: Three auxiliary variable models that are each equivalent to Horseshoe Model (4.2). The abbreviations IG and HS represent the Inverse-Gamma and Horseshoe distributions respectively. Model I Model II Model III ind. ind. ind. xi |σ ∼ HS(0, σ), σ 2 |a ∼ IG( 12 , a−1 ), a ∼ IG( 21 , A−2 ). xi |σ, bi ∼ xi |σ, bi ∼ N (0, σ 2 /bi ), N (0, σ 2 /bi ), σ 2 |a ∼ IG( 12 , a−1 ), σ 2 |a ∼ IG( 12 , a−1 ), a ∼ IG( 21 , A−2 ), a ∼ IG( 21 , A−2 ), p(bi ) = −1/2 π −1 bi (bi ind. + 1)−1 , bi > 0. bi |ci ∼ Gamma( 12 , ci ), ind. ci ∼ Gamma( 12 , 1). 113 4.2. HORSESHOE DISTRIBUTION Each of the three models presented in Table 4.1 and illustrated in Figure 4.2 have both highlights and drawbacks. For example, Model III is attractive due to the simple form of the conditional distributions that make up the hierarchy. The advantages and disadvantages of each model will be elucidated in sections immediately following, identifying the best model under the over-arching MFVB framework. a aa a aa σ σσ σ σσ x xx b x (a) Model I bb a aa c cc σ σσ b bb xx x (b) Model II xx (c) Model III Figure 4.2: Directed acyclic graphs corresponding to the three models listed in Table 4.1. 4.2.1 Mean field variational Bayes We impose the following three product restrictions on the joint posterior density function: p(σ, a|x) ≈ q(σ)q(a) p(σ, a, b|x) ≈ q(σ)q(a, b) for Model I, (4.3) for Model II, p(σ, a, b, c|x) ≈ q(σ, c)q(a, b) for Model III. Derivations of the MFVB algorithms resulting from (4.3) are deferred to Appendix 4.A. The optimal q ∗ density for σ under Model I has the form: ( ∗ 2 2 −(n+3)/3 q (σ ) ∝ (σ ) 2 exp µq(1/a) /σ + n X ) log pHS (xi /σ) . (4.4) i=1 The evaluation of the normalizing factor and hence moments for q ∗ (σ 2 ) under Model I require the use of numerical integration and multiple (n) evaluations of the exponential integral function, due to its presence in the standard Horseshoe density. This makes MFVB for Model I very computationally intensive, which is at odds with the underly11 ing purpose of MFVB as a fast, 1deterministic alternative to Markov chain Monte Carlo (MCMC). We therefore shift our focus to Models II and III. Under the product restrictions specified in (4.3), the optimal MFVB density for σ 2 has 114 4.2. HORSESHOE DISTRIBUTION an identical closed form for both Models II and III. This is in stark contrast to Model I, where the form of q ∗ (σ 2 ) was intractable, hence requiring numerical integration. A closed form optimal density for σ 2 is aligned with the MFVB framework, as it leads to a relatively simple MFVB algorithm. Hence the appeal of Models II and III lies in the closed form of the optimal q density for σ 2 , specifically: q ∗ (σ 2 ) ∼ Inverse-Gamma 1 2 (n + 1), µq(1/a) + 1 2 n X ! x2i µq(bi ) . (4.5) i=1 Algorithm 6 determines the optimal moments of q ∗ (σ 2 ), q ∗ (a), q ∗ (b) and q ∗ (c) under product restriction (4.3). In other words, the algorithm performs fast, deterministic inference for data following (4.2), under the auxiliary variable representations described by Models II and III. We defer derivation of Algorithm 6, i.e. derivation of optimal q ∗ densities, parameter updates and lower bound, to Appendix 4.A. We now present the Initialize: µq(1/σ2 ) > 0. If Model III, initialize: µq(ci ) > 0, 1 ≤ i ≤ n. Cycle: µq(1/a) ← A2 /{A2 µq(1/σ2 ) + 1}. For i = 1, . . . , n: Gi ← 12 µq(1/σ2 ) x2i if Model II: µq(bi ) ← {Gi Q(Gi )}−1 − 1 if Model III: µq(bi ) ← 1/(Gi + µq(ci ) ) ; µq(ci ) ← 1/(µq(bi ) + 1) P µq(1/σ2 ) ← (n + 1)/ 2µq(1/a) + ni=1 x2i µq(bi ) until the increase in p(x; q) is negligible. Algorithm 6: Mean field variational Bayes algorithm for determination of q ∗ (σ 2 ) from data modelled according to (4.2). The steps differ depending on which auxiliary variable representation, Model II or Model III (set out in Table 4.1), is used. lower bound for Algorithm 6. We first note that for all of the sparseness inducing priors (namely the Horseshoe, NEG and GDP) we have log p(x; q, BASE) + log p(x; q, II) for Model II cases log p(x; q) = log p(x; q, BASE) + log p(x; q, III) for Model III cases (4.6) where log p(x; q, BASE) is the contribution from the nodes common to both models, namely x, σ 2 and a; and log p(x; q, II) and log p(x; q, III) are the relative contributions from the 115 4.2. HORSESHOE DISTRIBUTION nodes specific to Models II and III. For the Horseshoe prior, the lower bound takes the form of (4.6), with specific form log p(x; q, BASE) = + log Γ{ 12 (n + 1)} − n 2 log(2π) − log(π) − log(A) − log(µq(1/σ2 ) + A−2 ) + µq(1/a) µq(1/σ2 ) P − 12 (n + 1) log µq(1/a) + 12 ni=1 x2i µq(bi ) , P log p(x; q, II) = −n log(π) + ni=1 Gi µq(bi ) + log{Q(Gi )} , and P log p(x; q, III) = −n log(π) + ni=1 µq(bi ) (Gi + µq(ci ) ) − log(Gi + µq(ci ) ) − log(µq(bi ) + 1) . (4.7) Full derivations are presented in Appendix 4.A. Under Model III, Algorithm 6 describes a set of algebraic operations, performed until a certain tolerance is reached. This is far simpler than the case for the Model II algorithm, where the ratio Q(x) = ex E1 (x) must be repeatedly evaluated, involving the evaluation of the exponential integral function (see Definition 1.5.6). To complicate things further, the ratio Q(x) provides numerical issues if computed naively. If we firstly express the problematic ratio as Q(x) = E1 (x) , e−x it is easy to observe that as x becomes large, both the numerator and the denominator rapidly approach zero. A novel way to get around this is to first express Q(x) in contin- ued fraction form (Cuyt et al., 2008). This facilitates computation via Lentz’s Algorithm (Lentz, 1976; Press et al., 1992) to a prescribed accuracy. Wand and Ormerod (2012) investigate in detail the approximation of ratios of special functions using the continued fraction approach. As set out in Result 1.2, Q(x) admits the continued fraction expansion: 1 Q(x) = . 12 x+1+ 22 x+3+ x+5+ 32 x + 7 + ··· Algorithm 7 describes the steps required to compute Q(x) for varying arguments. For arguments less than or equal to 1, direct computation is used. For arguments greater than 1, Lentz’s algorithm is called into play. Figure 4.3 illustrates the number of iterations required for Lentz’s algorithm to converge when used to approximate the ratio Q(x) to the accuracy described in Algorithm 7. The larger the value of the argument, the more 116 4.2. HORSESHOE DISTRIBUTION Inputs (with defaults): x > 0, ε1 (10−30 ), ε2 (10−7 ), If x > 1 then (use Lentz’s Algorithm) fprev ← ε1 ; Cprev ← ε2 ; Dprev ← 0; ∆ = 2 + ε2 ; j ← 1 cycle while |∆ − 1| ≥ ε2 : j ←j+1 Dcurr ← x + 2j − 1 − (j − 1)2 Dprev Ccurr ← x + 2j − 1 − (j − 1)2 /Cprev Dcurr ← 1/Dcurr ∆ ← Ccurr Dcurr fcurr ← fprev ∆ fprev ← fcurr ; Cprev ← Ccurr ; Dprev ← Dcurr return 1/(x + 1 + fcurr ) Otherwise (use direct computation) return ex E1 (x). Algorithm 7: Algorithm for stable and efficient computation of Q(x). quickly Lentz’s Algorithm converges. The number of iterations required for convergence increases as the argument approaches zero. This is not an issue due to the structure of Algorithm 7: Q(x) for small arguments (0 < x ≤ 1) are evaluated using direct computa- 20 15 10 number of iterations 25 tion. 2 4 6 8 10 x Figure 4.3: The number of iterations required for Lentz’s Algorithm to converge when used to approximate Q(x). Convergence criteria are specified as default settings in Algorithm 7. Now we have presented MFVB inference tools for Models II and III in the form of Algorithms 6 and 7, we proceed to compare the the two models in terms of their simplicity, their performance in a simulation study and their theoretical underpinnings. 117 4.2. HORSESHOE DISTRIBUTION 4.2.2 Simplicity comparison of Models II and III As mentioned in the previous section, MFVB inference for Model III is far simpler than that for Model II. This simplicity arises from the elegant distributional forms that result from including the extra set of auxiliary variables c = (c1 , . . . , cn ) in Model III. In contrast to the simple algebraic steps in Model III’s algorithm, Model II must introduce special functions, continued fractions and Lentz’s Algorithm to carry out MFVB inference. Model III clearly wins the simplicity round. It should be noted here that under Model II MFVB, the evaluation of Q(x) is rather cheap and very stable thanks to (1) the low number of iterations required for computation via Lentz’s Algorithm for x > 1, and (2) the availability of direct computation via the R package gsl (Hankin, 2007) for 0 < x ≤ 1. Over the following two sections, the superior model of the two reveals itself rather convincingly. 4.2.3 Simulation comparison of Models II and III We carried out a simulation study comparing the quality of MFVB inference resulting from Models II and III. One thousand data sets were generated according to xi ∼ Horseshoe(0, 1), 1 ≤ i ≤ n, with sample sizes of both n = 100 and n = 1000 considered. This corresponds to σ 2 having a true value of 1. We assessed the quality of MFVB inference by comparing the approximate MFVB optimal density, q ∗ (σ 2 ), with with a highly accurate MCMC-based posterior approximation denoted by pMCMC (σ 2 ). The accuracy is defined as accuracy ≡ 1 − Z 0 ∞ q ∗ (σ 2 ) − pMCMC (σ 2 |x) d(σ 2 ). An accuracy measure of 1 implies perfect alignment between the MFVB and MCMC approximate posteriors. More detail about calculation of the accuracy measure is explained in Section 1.6. The MCMC posterior approximation, pMCMC (σ 2 ), was obtained using WinBugs (Lunn, Thomas, Best and Spiegelhalter, 2000) via the BRugs package (Ligges et al., 2011) in R. MCMC samples of size 10000 were created, with the first 5000 discarded and the remaining sample thinned by a factor of 5. 118 4.2. HORSESHOE DISTRIBUTION Table 4.2: Average (standard deviation) accuracy based on MFVB for a simulation size of 1000 from (4.2). n = 100 n = 1000 Model II 54.3(1.4) 56.8(0.9) Model III 6.3(0.9) 0.0(0.0) A summary of the accuracy of MFVB inference for Models II and III is presented in Table 4.2. It is evident that the average accuracy of MFVB inference resulting from Model II is much higher than that of Model III, over both sample sizes. In particular, the MFVB approximate posteriors under Model 3 had 0% accuracy for samples of size n = 1000. 20 15 10 5 approx. posterior density 0 5 10 15 20 MCMC MFVB Model II MFVB Model III 0 approx. posterior density The poor performance of Model III is further examined in the following sections. 0.4 0.6 0.8 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 1.4 1.0 1.2 1.4 20 15 10 5 0 5 10 15 20 approx. posterior density σ2 0 approx. posterior density σ2 0.4 0.6 0.8 1.0 1.2 1.4 σ2 0.4 0.6 0.8 σ2 Figure 4.4: Comparison of pMCMC (σ 2 |x) and two q ∗ (σ 2 ) densities based on Model II and Model III MFVB for four replications from the simulation study corresponding to Table 4.2 with n = 1000. Figure 4.4 illustrates both MFVB approximate posteriors for σ 2 versus the accurate MCMC posterior for four replications within the simulation study. The purple Model II MFVB densities have similar centres to the orange MCMC densities, although MFVB results in a lower spread. The Model II densities also cover the true parameter vale of one with reasonable mass. In contrast, the blue Model III density is shifted to the left, centred around 0.4, and lies nowhere near the accurate MCMC density nor the true parameter value of 1. This supports the poor performance of MFVB inference resulting from Model III and identifies Model II as the superior choice. 119 4.2. HORSESHOE DISTRIBUTION 4.2.4 Theoretical comparison of Models II and III In this section we endeavour to explain why Model III, with its elegant q ∗ densities and simple MFVB Algorithm, performs so poorly in practice. We begin by examining the core differences between Models II and III. We then examine the simulated data and identify the underlying reason for Model III’s poor performance. Finally, we present a new theorem that explains the inapropriateness of MFVB for Model III. To allow direct comparison between Models II and III, we now find an alternative expression for µq(bi ) under Model III by eliminating terms involving ci . We denote the form of the expression for µq(bi ) as g III (x) under Model III, and g II (x) under Model II. Under Model III: 1 Gi + µq(ci ) 1 Gi + µ 1 +1 µq(bi ) = = q(bi ) µq(bi ) + 1 . (µq(bi ) + 1)Gi + 1 = It follows that Gi µ2q(bi ) + Gi µq(bi ) − 1 = 0, and using the quadratic formula gives µq(bi ) = −Gi ± q G2i + 4Gi 2Gi Hence III g (x) = −x ± . √ x2 + 4x 2x (4.8) So, Model III uses the above g III (x) as an approximation to g II (x) = 1 − 1. ex E1 (x) (4.9) Figure 4.5 illustrates the adequacy of g III (x) as an approximation of g II (x) for varying values of the argument x. We can see that as the value of x increases, the ratio g III (x)/g II (x) approaches 1. Smaller arguments present the biggest discrepancy between g II (x) and its approximation under Model III. As x approaches zero, the difference between the two functions is marked. Hence g III (x) provides poor approximation of g II (x) for low positive arguments. 120 0.8 0.7 0.6 0.4 0.5 gIII(x)/gII(x) 0.9 1.0 4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION 0.0 0.5 1.0 1.5 x Figure 4.5: Plot of g III (x)/g II (x) for the functions g III and g II defined by (4.8) and (4.9) respectively. We now examine the behaviour of random variables x, b and c created under the simplified model: x|b ∼ N (0, 1/b), b|c ∼ Gamma( 21 , c), c ∼ Gamma( 12 , 1). (4.10) Figure 4.6 shows MCMC samples of {log(1/b), log(c)|x = x0 } for x0 = (1, 0.1, 0.01, 0.001). The sample correlations are also shown. It can be seen that as x0 approaches zero, the correlation between log(1/b) and log(c) approaches 1. This is directly at odds with the MFVB assumption of posterior independence between b and c. This is the underlying reason why MFVB inference for Model III is so poor. Model II, with only b, does not encounter this problem, and hence MFVB inference is of higher quality. The reason behind the behaviour evident in Figure 4.6 is explained by Theorem 4.2.1. Theorem 4.2.1 Consider random variables x, b and c such that x|b ∼ N (0, 1/b), b|c ∼ Gamma( 12 , c), c ∼ Gamma( 12 , 1). Then lim [Corr{log(1/b), log(c)|x = x0 }] = 1. x0 →0 Full details of the proof of Theroem 4.2.1 are given in Appendix 4.B. 4.3 Normal-Exponential-Gamma distribution The second case we consider is a univariate random sample drawn from the NEG distribution, i.e. ind. xi |σ ∼ NEG(0, σ, λ), σ ∼ Half-Cauchy(A). (4.11) 121 4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION x0=0.1 corr.= 0.718 ● ● ● ● ● ● ● ● ● ● ● ● ● 10 6 8 x0=1 corr.= 0.29 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ●● ●● ● ● ●● ●● ● ● ● ● ●● ● ●●● ●●● ●●● ●●● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ●●●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ●●●●● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ●● ● ●● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ●●● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ●●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ●● ●●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −2 −6 −4 0 log(1 b) 15 x0=0.01 corr.= 0.895 ● ● ● ● ● ● ● ● −6 ● ● ● ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ●● ● ● ● ●●●●● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ● ●●●● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ●●● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ●●● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ●● ● ●● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ●● ●● ●● ● ● ●● ● ● ●●●● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ●●● ●● ●● ● ● ● ●● ● ●● ● ●● ● ●● ● ●● ●● ●●● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● −5 ● 0 5 ● 10 log(1 b) ● ● −2 0 2 4 log(1 b) 6 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ●●●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ● ● ●● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ●●●● ●● ●● ● ● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ●● ●● ●●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ●●●● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●●● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ●● ●●●●●● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ●● ● ● ● ●● ●● ●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ●● ●●●● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●●●● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●●●●● ● ● ●● ● ● ● ● ● ●●● ●●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ●● ● ● ●●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 5 ● 0 10 ● ● 5 log(c) ● ● log(c) ● ● ● ● ●● ● ● ● ● ● ● ● −4 10 ● ● ● ● ● x0=0.001 corr.= 0.951 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● 2 ● ● ● ● ● ● ● ● ● −2 ● ● ● ● ● −8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ●● ● ●●● ● ● ●●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●●● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●●● ●●●● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ●●● ● ● ●●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●●●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ●●● ●●● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●●● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●●●● ●● ●●● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ●● ●●●● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ●●● ●● ● ●● ● ● ● ●● ● ● ● ●●● ● ●● ●●● ●●● ● ●● ●● ●● ●● ● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ●● ●● ● ● ● ● ●●●●● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ●●● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●●● ● ● ●● ● ● ●●●● ● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 20 0 ● ● ● ● ●● ● ●● 0 ● ● ● ● ● 5 ● ● ● log(c) ● ● 2 log(c) ● ● ● ● ● ● ● ● ● ● −5 0 5 10 log(1 b) 15 Figure 4.6: MCMC samples (n = 1000) from the distribution {log(1/b), log(c)|x = x0 } for x0 = (1, 0.1, 0.01, 0.001) where the data is generated according to (4.10). Sample correlations are also shown. Again, through introduction of the auxiliary variables a, b = (b1 , . . . , bn ) and c = (c1 , . . . , cn ), we are able to represent (4.11) as the three equivalent heirarchical models presented in Table 4.3. Similarly to the Horseshoe case, each of the three models presented in Table 4.3 are illustrated in Figure 4.2. Table 4.3: Three auxiliary variable models that are each equivalent to NEG Model (4.11). The abbreviation IG represents the Inverse-Gamma distribution. Model I Model II Model III ind. ind. ind. xi |σ ∼ NEG(0, σ, λ), σ 2 |a ∼ IG( 12 , a−1 ), a ∼ IG( 21 , A−2 ). xi |σ, bi ∼ σ 2 |a a∼ ∼ xi |σ, bi ∼ N (0, σ 2 /bi ), N (0, σ 2 /bi ), σ 2 |a ∼ IG( 12 , a−1 ), IG( 12 , a−1 ), a ∼ IG( 21 , A−2 ), IG( 21 , A−2 ), p(bi ) = λbλ−1 (1 i ind. + bi )−λ−1 , bi > 0. bi |ci ∼ IG(1, ci ), ind. ci ∼ Gamma(λ, 1). 4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION 4.3.1 122 Mean field variational Bayes We impose the same set of product restrictions (4.3) on the joint posterior density function for Models I, II and III. The same pitfall causes computational problems for Model I as it did in the Horseshoe case. That is, the form of q ∗ (σ 2 ) requires both numerical integration for evaluation of the normalising constant; and repeated evaluations of special functions (in this case, the parabolic cylinder function). Hence we limit our investigation to Models II and III. Again, derivations of the MFVB algorithms for Models II and III are deferred to Appendix 4.C. Initialize: µq(1/σ2 ) > 0. If Model III, initialize: µq(ci ) > 0, 1 ≤ i ≤ n. Cycle: µq(1/a) ← A2 /{A2 µq(1/σ2 ) + 1}. For i = 1, . . . , n: Gi ← 12 µq(1/σ2 ) x2i √ √ if Model II: µq(bi ) ← (2λ + 1)R ( 2G ) 2Gi i 2λ q if Model III: µq(bi ) ← µq(ci ) /Gi ; µq(1/bi ) ← 1/µq(bi ) + 1/{2µq(ci ) } µq(1/σ2 ) µq(ci ) ← (λ + 1)/(µq(1/bi ) + 1) P ← (n + 1)/ 2µq(1/a) + ni=1 x2i µq(bi ) until the increase in p(x; q) is negligible. Algorithm 8: Mean field variational Bayes algorithm for determination of q ∗ (σ 2 ) from data modelled according to (4.11). The steps differ depending on which auxiliary variable representation, Model II or Model III (set out in Table 4.3), is used. Algorithm 8 determines the optimal moments of q ∗ (σ 2 ), q ∗ (a), q ∗ (b) and q ∗ (c) under Models II and III set out in Table 4.3 and product restriction (4.3). The lower bound for Algorithm 8 is given by (4.12). log p(x; q, BASE) + n log(λ) +n(λ + 21 ) log(2) + n log{Γ(λ + 21 )} for Model II √ P n 1 + i=1 [Gi (µq(bi ) + 2 ) + log{D−2λ−1 ( 2Gi )}]. log p(x; q) = log p(x; q, BASE) + n log(π) 2 +n log(λ) − Pn { 1 log(µ for Model III q(ci ) ) i=1 2 +(λ + 1) log(µq(1/bi ) + 1)}. (4.12) where log p(x; q, BASE) is identical to that specified by (4.7) in the previous Horseshoe 4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION 123 section. It should be stated that the final term in the lower bound (4.12) for Model II in the NEG case is numerically unstable. This instability was not an issue for the Horseshoe prior as we were able to incorporate the stable quantity Q(x) into the expression for the lower bound (see (4.7)). We are unable to rearrange the lower bound to achieve the same stability for the NEG model. In practice, we use a high fixed number of iterations to ensure convergence. As in the Horseshoe case, MFVB inference is far simpler for Model III as Algorithm 8 requires only algebraic steps for each iteration. For the NEG algorithm, Model II requires repeated evaluation of the ratio Rν (x) = D−ν−2 (x) , D−ν−1 (x) ν > 0, x > 0. This ratio must be computed with care as again underflow problems persist for large arguments. We first express Rν (x) in continued fraction form, using Cuyt et al. (2008). This allows use of Lentz’s Algorithm as was the case for Q(x) in Horseshoe Model II. As set out in Result 1.3, Rν (x) can be written as: 1 Rν (x) = . ν+2 x+ x+ ν+3 x+ ν+4 x + ... The procedure for stable computation of Rν (x) is presented in Algorithm 9. Specifically, Algorithm 9 describes the steps required to compute Rν (x) for varying arguments. For x ≤ 0.2 and λ ≤ 40, direct computation is carried out using R code explained in Definition 1.5.7. Otherwise, Lentz’s algorithm is used. Now we have presented the relevant algorithms to carry out MFVB inference for the NEG prior, we proceed to firstly compare Models II and III via a simulation study, and secondly look into the theory behind the relationship between the two models. 4.3.2 Simulation comparison of Models II and III The NEG simulation study to compare Models II and III was set up similarly to the Horseshoe study set out in Section 4.2.3. We generated 500 data sets of size n = 100 and n = 1000 for xi ∼ NEG(0, 1, λ), 1 ≤ i ≤ n. 4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION 124 Inputs (with defaults): x ≥ 0, λ > 0, ε1 (10−30 ), ε2 (10−7 ), If (ν > 20) or (x > 0.2) then (use Lentz’s Algorithm) fprev ← ε1 ; Cprev ← ε2 ; Dprev ← 0; ∆ = 2 + ε2 ; j ← 1 cycle while |∆ − 1| ≥ ε2 : j ←j+1 Dcurr ← x + (ν + j)Dprev Ccurr ← x + (ν + j)/Cprev Dcurr ← 1/Dcurr ∆ ← Ccurr Dcurr fcurr ← fprev ∆ fprev ← fcurr ; Cprev ← Ccurr ; Dprev ← Dcurr return 1/(x + fcurr ) Otherwise (use direct computation) return D−ν−2 (x)/D−ν−1 (x). Algorithm 9: Algorithm for stable and efficient computation of Rν (x). An extra level of complexity is present in the NEG case: the existence of the shape parameter λ. The simulation study was carried out for the values λ ∈ {0.1, 0.2, 0.4, 0.8, 1.6}. A summary of the accuracy of MFVB inference for Models II and III is presented in Figure 4.7. Figure 4.7 illustrates that the accuracy of MFVB inference resulting from Model II is much higher than that of Model III for both n = 100 and n = 1000. Figure 4.8 shows MFVB approximate posteriors for σ 2 versus the accurate MCMC posterior for four replications within the simulation study. As was the case for the Horseshoe, the purple Model II approximate posteriors have similar centres to the orange MCMC ones. Again MFVB for Model II results in a lower spread than the MCMC analogue. The Model II densities also cover the true parameter vale of one with reasonable mass. In contrast, the blue Model III densities are shifted far to the right. This again supports the poor performance of MFVB inference resulting from Model III and identifies Model II as the superior choice in both the Horseshoe and NEG cases. 125 4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION n=100 Model III n=100 Model II 80 ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● accuracy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● n=1000 Model III n=1000 Model II 80 ● ● ● ● ● 60 ● ● ● ● 40 ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.1 0.2 0.4 0.8 1.6 0.1 0.2 0.4 0.8 1.6 λ 8 6 4 2 0 approx. posterior density 8 2 4 6 MCMC MFVB Model II MFVB Model III 0 approx. posterior density Figure 4.7: Side-by-side boxplots of accuracy values for the NEG simulation study described in Section 4.3.2. 1 2 3 4 5 6 7 1 2 3 4 5 6 7 4 5 6 7 8 6 4 2 0 2 4 6 8 approx. posterior density σ2 0 approx. posterior density σ2 1 2 3 4 σ2 5 6 7 1 2 3 σ2 Figure 4.8: Comparison of pMCMC (σ 2 |x) and two q ∗ (σ 2 ) densities based on Model II and Model III MFVB for four replications from the NEG simulation study with n = 1000. 126 1.25 4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION 1.15 1.10 1.00 1.05 II gIII λ (x)/gλ(x) 1.20 λ=0.1 λ=0.2 λ=0.4 λ=0.8 λ=1.6 0.0 0.2 0.4 0.6 0.8 1.0 x Figure 4.9: Plot of g III (x)/g II (x) for the functions g III and g II defined by (4.13). 4.3.3 Theoretical comparison of Models II and III The major difference between NEG Models II and III is the form of µq(bi ) . As in the Horseshoe case, to allow comparison between Models II and III, we find an alternative expression for µq(bi ) under Model III. µq(bi ) where g II (Gi ) for Model II = g III (G ) for Model III i √ (2λ + 1)R2λ ( 2x) √ g (x) = 2x II r and g (x) = III 2λ + 1 1 1 + − . 2x 4 2 (4.13) We obtained the expression for g III (x) via reduction of the updates for µq(ci ) , µq(bi ) and µq(1/bi ) under Models II and III in Algorithm 8. The idea is that Model III uses the above g III (x) as an approximation to √ (2λ + 1)R2λ ( 2x) √ . g (x) = 2x II Figure 4.9 illustrates the adequacy of g III (x) as an approximation of g II (x) for varying values of the argument x and the shape parameter λ. As the value of x increases, the ratio g III (x)/g II (x) gets closer to 1 for all values of λ in the grid. Small values of x present the most marked difference between g II (x) and g III (x). This translates to poor performance of Model III in practice. 127 4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION Next, we address the underlying reason behind the poor performance of Model III for the NEG case. We again examine the behaviour of random variables x, b and c created under a simplified NEG model: b|c ∼ Inverse-Gamma(1, c), x|b ∼ N (0, 1/b), c ∼ Gamma(λ, 1). (4.14) The correlation Corr{log(b), log(c)|x = x0 } is more complex for the NEG case compared with the Horseshoe due to the added shape parameter λ. Figure 4.10 shows MCMC λ=0.4, x0=1 corr.= 0.699 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● −6 ● ● ● ● ● ● −4 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● −8 ● ● ● ● ● ● ● ● ● ● −8 ● ● ● ● ● ● −4 −2 0 log(b) 2 λ=0.1, x0=3 corr.= 0.86 ● −8 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●●● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ●●●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●●●●● ● ● ● ●● ● ●●● ● ● ●● ●● ● ● ●● ● ●● ●● ●● ● ● ●●●● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●●●● ●● ●● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ●●● ● ● ●● ●●● ● ●●● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●●● ●● ● ● ●● ● ● ● ●●● ● ●● ● ● ●●● ●● ● ●● ●● ●● ● ●● ● ● ● ● ●●●● ●●● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●●● ● ●● ● ● ●● ● ●● ● ● ● ● ●●●● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ●●● ●●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ●●●● ●●●●● ● ●● ●●● ● ●●● ●●●●●● ●● ●● ● ●●● ●● ● ●● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●●● ● ●● ● ●● ● ● ● ●● ● ● ●●● ●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ●● ● ● ●● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ●●● ●● ● ● ● ● ●● ●● ●● ● ● ●● ● ●● ●● ● ●● ● ●●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● −10 ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 −5 ● ● ●● ● ● ● ● ● 0 ● ● ● ● ● ●● ●● ● ● ●●● ● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ●● ● ● ● ● ● ●●●● ●●● ●● ●●● ●● ● ●● ●● ● ● ●● ●● ●● ● ●● ●●● ● ●●● ●● ●●● ● ● ●● ●● ● ●● ● ● ● ● ● ●●●● ● ●● ● ● ●●● ● ●●●● ● ●● ● ●● ● ● ●● ●● ● ● ●●●●●● ● ●●● ●● ●●●● ●● ● ●● ● ● ●● ● ● ● ● ●●●● ●●● ● ●● ● ●● ● ●● ● ● ●●● ●● ●● ● ●● ●●● ● ●●● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ●● ● ●● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ●● ●●●● ●● ●● ● ●●● ● ● ●● ●● ●● ● ● ●● ● ●●● ●● ●● ● ●● ●● ● ● ●● ●●● ●● ●●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ●● ●● ●● ●●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●● ●● ● ●● ● ●● ● ●● ●● ●●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●●●● ● ●● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ●● ●● ●● ●● ● ● ●● ●● ●● ● ●●●●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ●●● ● ● ●●●● ● ● ●●● ● ● ● ● ●●●●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ●● ● ● ●● ● ● ● ● ●●●●●●●● ● ●●●●●● ●●● ● ● ● ● ● ●● ●● ●● ●●● ● ●●● ● ●●● ●● ●● ●●● ●● ● ●●● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ●●● ● ●● ● ●●●●● ● ● ● ● ● ●● ●● ● ●●● ●● ● ● ● ●● ● ●●●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ●●●● ●● ●●● ● ●● ● ●● ● ● ●● ●●●● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ●● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −4 log(b) −10 ● ● −15 ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● log(c) −5 ● ●● −6 λ=0.05, x0=4 corr.= 0.877 0 −6 0 ● ●● ● ● ● ● ● log(c) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 ● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●● ●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●●● ●● ● ● ● ● ●● ● ●●●● ● ● ● ● ●●● ●● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ●●● ● ●● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ●● ● ● ●●●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● −6 ●● ● ● ● ● ● ● log(c) 0 −2 −4 log(c) ● ● ● ● ● ● λ=0.2, x0=2 corr.= 0.747 ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●●●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●●●● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ●● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ●● ●●●● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●●●● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ●●● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ●●●● ●● ● ● ● ●●● ●● ●● ● ●●● ● ●●●●●● ●● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● −20 ● ● ● −12 −10 −8 −6 log(b) −4 −2 0 ● ● −20 −15 −10 log(b) −5 0 Figure 4.10: MCMC samples (n = 1000) from the distribution {log(b), log(c)|x = x0 } for λ = (0.05, 0.1, 0.2, 0.4) and x0 = (1, 2, 3, 4) where the data is generated according to (4.14). Sample correlations are also shown. samples of {log(b), log(c)|x = x0 } for λ = (0.05, 0.1, 0.2, 0.4) and x0 = (1, 2, 3, 4). The figure shows that high sample correlations correspond to low values of λ. In Appendix 4.D, we derive integral expressions for the expectations that make up the correlation. We then compute these integrals in R in order to examine the relationship between the correlation, the shape parameter λ and the data x0 . Figure 4.11 illustrates the impact of both λ and x on the correlation. We can see that the correlation gets closer to 128 4.4. GENERALIZED-DOUBLE-PARETO DISTRIBUTION 1.0 1 for smaller values of λ, no matter what the value of x is. Although not as elegant as 0.7 0.6 0.3 0.4 0.5 correlation 0.8 0.9 x0 = 2 x0 = 1 x0 = 0.5 x0 = 0.2 x0 = 0.1 0.0 0.5 1.0 1.5 2.0 λ Figure 4.11: Illustration of the behaviour of Corr{log(b), log(c)|x = x0 } under NEG Model III for varying values of λ and x0 ∈ {0.1, 0.2, 0.5, 1, 2} correspoding to the colours in the legend. Theorem 4.2.1 in the Horseshoe section, we have shown numerically that in the NEG case, high correlation exists between b and c. The existence of an analogue for Thoerem 4.2.1 in the NEG case remains an interesting open question. The correlation between b and c is directly at odds with the MFVB product assumption that q(b, c) = q(b)q(c). The disparity between the existing posterior dependence and the assumption of independence explains the poor performance of Model III MFVB inference in the NEG case. 4.4 Generalized-Double-Pareto distribution The third and final case we consider is a univariate random sample drawn from the GDP distribution, i.e. ind. xi |σ ∼ GDP(0, σ, λ), σ ∼ Half-Cauchy(A). (4.15) Via introductio
© Copyright 2026 Paperzz