Elaborate distribution semiparametric regression via mean field

University of Wollongong
Research Online
University of Wollongong Thesis Collection
University of Wollongong Thesis Collections
2013
Elaborate distribution semiparametric regression
via mean field variational Bayes
Sarah Elizabeth Neville
University of Wollongong
Recommended Citation
Neville, Sarah Elizabeth, Elaborate distribution semiparametric regression via mean field variational Bayes, Doctor of Philosophy
thesis, School of Mathematics and Applied Statistics, University of Wollongong, 2013. http://ro.uow.edu.au/theses/3958
Research Online is the open access institutional repository for the
University of Wollongong. For further information contact the UOW
Library: [email protected]
Elaborate Distribution Semiparametric
Regression via Mean Field Variational Bayes
A thesis submitted in fulfilment of the
requirements for the award of the degree
Doctor of Philosophy
from
University of Wollongong
by
Sarah Elizabeth Neville
B Math (Advanced, Honours Class I), University of Wollongong
School of Mathematics and Applied Statistics
2013
CERTIFICATION
I, Sarah Elizabeth Neville, declare that this thesis, submitted in fulfilment of the requirements for the award of Doctor of Philosophy, in the School of Mathematics and Applied
Statistics, University of Wollongong, is wholly my own work unless otherwise referenced
or acknowledged. The document has not been submitted for qualifications at any other
academic institution.
Sarah Elizabeth Neville
12 October, 2013
Abstract
Mean field variational Bayes (MFVB) is a fast, deterministic inference tool for use in
Bayesian hierarchical models. We develop and examine the performance of MFVB algorithms in semiparametric regression applications involving elaborate distributions. We
assess the accuracy of MFVB in these settings via comparison with a Markov chain Monte
Carlo (MCMC) baseline. MFVB methodology for Generalized Extreme Value additive
models performs well, culminating in fast, accurate analysis of the Sydney hinterland
maximum rainfall data. Quantile regression based on the Asymmetric Laplace distribution provides another area for successful application of MFVB. Examination of MFVB algorithms for continuous sparseness signal shrinkage in univariate models illustrates the
danger of näive application of MFVB. This leads to development of a new tool to add to
the MFVB armory: continuous fraction approximation of special functions using Lentz’s
Algorithm. MFVB performs well in both simple and more complex penalized wavelet
regression models, illustrated by analysis of the radiation pneumonitis data. Overall,
MFVB is a viable inference tool for semiparametric regression involving elaborate distributions. Generally, MFVB is good at retrieving trend estimates, but underestimates
variability. MFVB is best used in applications where analysis is constrained by computational time and/or storage.
i
Dedication
This thesis is dedicated my parents, Robert F. and Susan G. Neville. Your support, generosity and love over the past three years in particular has been phenomenal. I’m so
lucky to have you!
ii
Acknowledgements
Many thanks go to my supervisor Matt for being both a professional and personal mentor. Your enthusiasm, encouragement and humility has helped me navigate the treacherous journey of writing a thesis, and made me the researcher I am on the other side. To
Maureen, my unofficial co-supervisor and great friend, you have been a fantastic support at my home base of the University of Wollongong. Our coffees and office chats have
regularly been the highlight of my day. To my best friend Lauren, your presence has
made this journey one of laughter, and has helped me get through the tough stages. Our
conversations over countless dinners, dog walks and trips helped me keep it all in perspective (well, most of the time). Amy, my sister, has been an encouraging force, always
having faith in me even when my own would falter. Thanks for being there for me Boops!
To my brother Bobby, wherever you are, I hope you’re proud of your little sister and that
I can show this to you one of these days. Finally to Marley, Possum, Henry and Bronte.
You have been the calm, warm and fun presence I needed over the course of my thesis
writing.
Contents
1
Introduction
1
1.1
Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.1
Variational approximations meet Bayesian models . . . . . . . . . .
1
1.1.2
Variational approximations in computer science . . . . . . . . . . .
2
1.1.3
Variational approximations emerging into the statistical literature .
2
1.1.4
Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . .
3
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2.1
Vector and matrix notation . . . . . . . . . . . . . . . . . . . . . . .
4
1.2.2
Distributional notation . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
Basics of mean field variational Bayes . . . . . . . . . . . . . . . . . . . . .
6
1.4
Graph theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.4.1
Directed acyclic graphs . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.4.2
Moral graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Definitions and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.5.1
Non-analytic integral families . . . . . . . . . . . . . . . . . . . . . .
11
1.5.2
Special function definitions . . . . . . . . . . . . . . . . . . . . . . .
11
1.5.3
Additional function definitions and continued fraction representa-
1.2
1.5
2
tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.5.4
Distributional definitions and results . . . . . . . . . . . . . . . . .
14
1.5.5
Matrix results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
1.6
Accuracy measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
1.7
Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
Mean field variational Bayes inference for Generalised Extreme Value regression models
22
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.2
Direct mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . .
23
iii
iv
CONTENTS
2.3
Auxiliary mixture sampling approach . . . . . . . . . . . . . . . . . . . . .
25
2.4
Structured mean field variational Bayes . . . . . . . . . . . . . . . . . . . .
26
2.5
Finite normal mixture response regression . . . . . . . . . . . . . . . . . . .
27
2.5.1
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.5.2
Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . .
29
Generalized Extreme Value additive model . . . . . . . . . . . . . . . . . .
30
2.6.1
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.6.2
Mean field variational Bayes for the finite normal mixture response
2.6
additive model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
Structured mean field variational Bayes . . . . . . . . . . . . . . . .
37
2.7
Displaying additive model fits . . . . . . . . . . . . . . . . . . . . . . . . .
37
2.8
Geoadditive extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
2.9
New South Wales maximum rainfall data analysis . . . . . . . . . . . . . .
41
2.10 Comparisons with Markov chain Monte Carlo . . . . . . . . . . . . . . . .
47
2.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
2.A Derivation of Algorithm 1 and lower bound (2.11) . . . . . . . . . . . . . .
49
2.A.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
2.A.2 Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
2.A.3 Derivation of lower bound (2.11) . . . . . . . . . . . . . . . . . . . .
58
2.B Derivation of Algorithm 2 and lower bound (2.17) . . . . . . . . . . . . . .
62
2.6.3
3
2.B.1
Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
2.B.2
Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
2.B.3
Derivation of lower bound (2.17) . . . . . . . . . . . . . . . . . . . .
70
Mean field variational Bayes for quantile regression
77
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
3.2
Parametric regression case . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
3.2.1
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
3.2.2
Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . .
79
Semiparametric regression case . . . . . . . . . . . . . . . . . . . . . . . . .
81
3.3.1
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
3.3.2
Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . .
82
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
3.4.1
Comparisons with Markov chain Monte Carlo . . . . . . . . . . . .
84
3.4.2
Accuracy study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
3.3
3.4
v
CONTENTS
3.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
3.A Derivation of Algorithm 4 and lower bound (3.4) . . . . . . . . . . . . . . .
90
3.A.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
3.A.2 Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
3.A.3 Derivation of lower bound (3.4) . . . . . . . . . . . . . . . . . . . . .
96
3.B Derivation of Algorithm 5 and lower bound (3.9) . . . . . . . . . . . . . . . 103
4
3.B.1
Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.B.2
Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.B.3
Derivation of lower bound (3.9) . . . . . . . . . . . . . . . . . . . . . 106
Mean field variational Bayes for continuous sparse signal shrinkage
110
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2
Horseshoe distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.3
4.4
4.5
4.2.1
Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . 113
4.2.2
Simplicity comparison of Models II and III . . . . . . . . . . . . . . 117
4.2.3
Simulation comparison of Models II and III . . . . . . . . . . . . . . 117
4.2.4
Theoretical comparison of Models II and III . . . . . . . . . . . . . . 119
Normal-Exponential-Gamma distribution . . . . . . . . . . . . . . . . . . . 120
4.3.1
Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . 122
4.3.2
Simulation comparison of Models II and III . . . . . . . . . . . . . . 123
4.3.3
Theoretical comparison of Models II and III . . . . . . . . . . . . . . 126
Generalized-Double-Pareto distribution . . . . . . . . . . . . . . . . . . . . 128
4.4.1
Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . 129
4.4.2
Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.A Derivation of Algorithm 6 and lower bound (4.7) . . . . . . . . . . . . . . . 133
4.A.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.A.2 Optimal q ∗ Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.A.3 Derivation of lower bound (4.7) . . . . . . . . . . . . . . . . . . . . . 136
4.B Proof of Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.C Derivation of Algorithm 8 and lower bound (4.12) . . . . . . . . . . . . . . 147
4.C.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.C.2 Optimal q ∗ Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.C.3 Derivation of lower bound (4.12) . . . . . . . . . . . . . . . . . . . . 149
4.D Normal-Exponential-Gamma correlation . . . . . . . . . . . . . . . . . . . 152
vi
CONTENTS
4.E Derivation of Algorithm 10 and lower bound (4.16) . . . . . . . . . . . . . 154
5
4.E.1
Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.E.2
Optimal q ∗ Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.E.3
Derivation of lower bound (4.16) . . . . . . . . . . . . . . . . . . . . 157
Mean field variational Bayes for penalised wavelet regression
160
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.2
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.3
5.2.1
Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.2.2
Normal-Exponential-Gamma prior . . . . . . . . . . . . . . . . . . . 162
5.2.3
Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.3.1
Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.3.2
Normal-Exponential-Gamma prior . . . . . . . . . . . . . . . . . . . 165
5.3.3
Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.4
Displaying Laplace-Zero model fits . . . . . . . . . . . . . . . . . . . . . . . 167
5.5
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.6
5.5.1
Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.5.2
Normal-Exponential-Gamma prior . . . . . . . . . . . . . . . . . . . 171
5.5.3
Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.5.4
Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.A Derivation of Algorithm 11 and lower bound (5.8) . . . . . . . . . . . . . . 176
5.A.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.A.2 Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.A.3 Derivation of lower bound (5.8) . . . . . . . . . . . . . . . . . . . . . 179
6
Mean field variational Bayes for wavelet-based longitudinal data analysis
6.1
6.2
6.3
186
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.1.1
Radiation pneumonitis study . . . . . . . . . . . . . . . . . . . . . . 187
6.1.2
Chapter outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.2.1
Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.2.2
Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Mean field variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
CONTENTS
vii
6.3.1
Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.3.2
Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.4
6.5
6.6
Radiation pneumonitis study results . . . . . . . . . . . . . . . . . . . . . . 200
6.4.1
Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.4.2
Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Accuracy study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.5.1
Horseshoe prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.5.2
Laplace-Zero prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.A Derivation of Algorithm 12 and lower bound (6.6) . . . . . . . . . . . . . . 209
6.A.1 Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.A.2 Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.A.3 Derivation of lower bound (6.6) . . . . . . . . . . . . . . . . . . . . . 214
6.B Derivation of Algorithm 13 and lower bound (6.7) . . . . . . . . . . . . . . 221
7
6.B.1
Full conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.B.2
Optimal q ∗ densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.B.3
Derivation of lower bound (6.7) . . . . . . . . . . . . . . . . . . . . . 229
Conclusion
236
List of Figures
1.1
A simple example of a directed acyclic graph, containing nodes a, b and c.
1.2
Illustration of the impact of product restriction (1.7) on the directed acyclic
8
graph for Model (1.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1
A normal mixture approximation to the GEV density with ξ = 1. . . . . .
26
2.2
Directed acyclic graph for Model (2.8). . . . . . . . . . . . . . . . . . . . . .
29
2.3
Directed acyclic graph representation of Model (2.13). . . . . . . . . . . . .
33
2.4
Directed acyclic graph representation of Model (2.15). . . . . . . . . . . . .
35
2.5
Annual winter maximum rainfall at 50 weather stations in the Sydney,
Australia, hinterland. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6
MFVB univariate functional fits in the GEV additive model (2.20) for the
Sydney hinterland rainfall data. . . . . . . . . . . . . . . . . . . . . . . . . .
2.7
43
MFVB bivariate functional fit for geographical location in the GEV additive model (2.20) for the Sydney hinterland rainfall data. . . . . . . . . . .
2.8
42
44
The prior and MFVB approximate posterior probability mass functions for
the GEV shape parameter ξ in the GEV additive model (2.20) for the Sydney hinterland maximum rainfall data. . . . . . . . . . . . . . . . . . . . . .
2.9
45
Accuracy comparison between MFVB and MCMC for a single predictor
model (d = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.1
Directed acyclic graph for Model (3.2). . . . . . . . . . . . . . . . . . . . . .
78
3.2
Directed acyclic graph for Model (3.6). . . . . . . . . . . . . . . . . . . . . .
84
3.3
Successive values of lower bound (3.9) to monitor convergence of MFVB
Algorithm 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4
85
Quantile estimates (solid) and pointwise 95% credible intervals (dotted)
for MFVB fitting of (3.10) via Algorithm 5. . . . . . . . . . . . . . . . . . . .
viii
86
LIST OF FIGURES
3.5
Median (τ = 0.5) estimates and pointwise 95% credible intervals for MFVB
(red) and MCMC (blue) fitting of (3.10). . . . . . . . . . . . . . . . . . . . .
3.6
87
MFVB (blue) and MCMC (orange) approximate posterior densities for the
estimated median ŷ at the quartiles of the xi ’s under Model (3.6). . . . . .
3.7
ix
88
Boxplots of accuracy measurements for ŷ(Q2 ) for the accuracy study described in the text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
4.1
Standard (µ = 0, σ = 1) continuous sparseness inducing density functions. 111
4.2
Directed acyclic graphs corresponding to the three models listed in Table 4.1.113
4.3
The number of iterations required for Lentz’s Algorithm to converge when
4.4
used to approximate Q(x). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Comparison of pMCMC (σ 2 |x) and two q ∗ (σ 2 ) densities based on Model II
and Model III MFVB for four replications from the simulation study corre-
sponding to Table 4.2 with n = 1000. . . . . . . . . . . . . . . . . . . . . . . 118
4.5
Plot of g III (x)/g II (x) for the functions g III and g II defined by (4.8) and (4.9)
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.6
MCMC samples (n = 1000) from the distribution {log(1/b), log(c)|x = x0 }
for x0 = (1, 0.1, 0.01, 0.001) where the data is generated according to (4.10). 121
4.7
Side-by-side boxplots of accuracy values for the NEG simulation study
described in Section 4.3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.8
Comparison of pMCMC (σ 2 |x) and two q ∗ (σ 2 ) densities based on Model II and
Model III MFVB for four replications from the NEG simulation study with
n = 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.9
Plot of g III (x)/g II (x) for the functions g III and g II defined by (4.13). . . . . . . 126
4.10 MCMC samples (n = 1000) from the distribution {log(b), log(c)|x = x0 }
for λ = (0.05, 0.1, 0.2, 0.4) and x0 = (1, 2, 3, 4) where the data is generated
according to (4.14). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.11 Illustration of the behaviour of Corr{log(b), log(c)|x = x0 } under NEG
Model III for varying values of λ and x0 . . . . . . . . . . . . . . . . . . . . . 128
4.12 Side-by-side boxplots of accuracy values for the GDP simulation study described in Section 4.4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.1
Directed acyclic graph for Models (5.2) and (5.4). . . . . . . . . . . . . . . . 162
5.2
Fitted function estimates (solid) and pointwise 95% credible sets (dotted)
for both MFVB (red) and MCMC (blue) approaches under Model (5.2). . . 168
LIST OF FIGURES
5.3
x
MFVB (blue) and MCMC (orange) approximate posterior densities for (a)
σε2 and (b) ŷ(Q2 ) under Model (5.2). . . . . . . . . . . . . . . . . . . . . . . . 168
5.4
Fitted function estimates (solid) and pointwise 95% credible sets (dotted)
for both MFVB (red) and MCMC (blue) approaches under Model (5.4). . . 169
5.5
MFVB (blue) and MCMC (orange) approximate posterior densities for (a)
σε2 and (b) ŷ(Q2 ) under Model (5.4). . . . . . . . . . . . . . . . . . . . . . . . 170
5.6
Fitted function estimates (solid) and pointwise 95% credible sets (dotted)
for both MFVB (red) and MCMC (blue) approaches under Model (5.6). . . 171
5.7
MFVB (blue) and MCMC (orange) approximate posterior densities for (a)
σε2 and (b) ŷ(Q2 ) under Model (5.6). . . . . . . . . . . . . . . . . . . . . . . . 172
5.8
Fitted MFVB function estimates (solid) and pointwise 95% credible sets
(dotted) for the Horseshoe (red), NEG (blue) and Laplace-Zero (green)
models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.9
Boxplots of root mean squared error of fˆMFVB for the simulation study described in the text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.1
Raw data from the radiation pneumonitis study. Each panel corresponds
to a subject in the study, with radiation dose (J/kg) plotted against the
logarithm of fluorodeoxyglucose (FDG) uptake. . . . . . . . . . . . . . . . 188
6.2
Directed acyclic graph for Model (6.2). . . . . . . . . . . . . . . . . . . . . . 192
6.3
Directed acyclic graph for Model (6.4). . . . . . . . . . . . . . . . . . . . . . 194
6.4
MFVB fit (blue) with pointwise 95% credible sets with raw data (red) for
all 21 subjects under Model (6.2). . . . . . . . . . . . . . . . . . . . . . . . . 200
6.5
Additional plots for the MFVB fit of Model (6.2). . . . . . . . . . . . . . . . 201
6.6
MFVB fit (blue) with pointwise 95% credible sets with raw data (red) for
all 21 subjects under Model (6.4). . . . . . . . . . . . . . . . . . . . . . . . . 202
6.7
Additional plots for the MFVB fit of Model (6.4). . . . . . . . . . . . . . . . 203
6.8
MFVB (blue) and MCMC (orange) approximate posterior densities for the
fit under Model (6.2) at the quartiles for subject 8. . . . . . . . . . . . . . . 206
6.9
MFVB (blue) and MCMC (orange) approximate posterior densities for the
fit under Model (6.4) at the quartiles for subject 8. . . . . . . . . . . . . . . 207
Chapter 1
Introduction
1.1
Literature review
Mean field variational Bayes (MFVB) is a fast, deterministic alternative to Markov chain
Monte Carlo (MCMC) for inference in a hierarchical Bayesian model setting. This literature review will firstly explain the origins of variational approximation, and the natural
connection with statistical inference. Secondly, the use of variational approximations in
the computer science will be discussed. Thirdly, the emergence of variational approximations into the statistical literature will be presented. Finally, the use of MFVB in specific
statistical models will be summarized, identifying further work to be done in the field.
1.1.1
Variational approximations meet Bayesian models
Variational calculus has been part of the mathematical consciousness since the 18th century. Essentially, the calculus of variations involves optimizing a functional over a given
class of functions. Variational approximations arise when the class of functions we are
optimizing over is restricted in some way (Ormerod and Wand, 2010).
Bayesian inference is centred around the posterior distribution of the parameters in a
model given the observed data. The mathematics behind derivation of this posterior distribution is often intractable, even given fairly simple models. As hierarchical Bayesian
models become increasingly complex, and data sets become larger, traditional Bayesian
inference tools such as MCMC are becoming infeasible due to time constraints. A faster
inference tool is required.
One solution is to bring the concepts of variational approximation and Bayesian inference together to provide a deterministic alternative to the stochastic MCMC. This involves the approximation of intractable integrals that arise when deriving the posterior
1
1.1. LITERATURE REVIEW
2
distribution of model parameters.
1.1.2
Variational approximations in computer science
Computer scientists have been exploiting variational approximations in areas such as
machine learning for some time. Bishop (2006) includes an entire chapter on approximate inference. Variational inference is presented as a deterministic alternative to the
stochastic MCMC, arising from analytical approximations to the posterior distribution.
The various types of variational approximation are presented, including a section on factorised approximations, which we refer to as MFVB.
Specific applications explored so far in the computer science literature include neural networks, hidden Markov models and information retrieval (Jordan, Ghahramani,
Jaakkola and Saul, 1999; Jordan, 2004).
The fact that Bishop (2006) includes such detailed descriptions of a scheme such as
MFVB illustrates that variational approximations are widely accepted in the computer
science field. However, lacking in the computer science literature is an analysis of the
accuracy of variational approximations. Bishop (2006) states that variational inference
is approximate, and proceeds to describe the methods. There is minimal time given to
the accuracy of variational inference. This shortcoming provides the statistical community with an opportunity to further contribute to the investigation of variational methods
in ways that the computer science literature has not. We have the ability to quantitatively assess the performance of variational inference against the stochastic alternative
(MCMC). Assessment of the performance of variational approximations has begun in the
statistical literature. For example, Wang and Titterington (2005) look into the properties of covariance matrices arising from variational approximations. In addition, Wang
and Titterington (2006) investigate the convergence properties of a variational Bayesian
algorithm for computing estimates for a normal mixture model.
1.1.3
Variational approximations emerging into the statistical literature
The connection between variational approximations and statistical inference has only
started to be explored extensively in the past decade. Investigation of variational approximations in the statistical literature has two advantages:
1. to make variational approximations a faster, real alternative to more prominent inference tools such as MCMC and Laplace approximations; and
1.1. LITERATURE REVIEW
3
2. to assess the accuracy of variational approximations in different statistical models.
There are many types of variational approximations that lend themselves to different
statistical models. In an attempt to bring variational approximations into the broader
statistical consciousness, Ormerod and Wand (2010) outlined the different varieties of
variational approximations available, and how they may be used in familiar statistical
settings. The major focus of this thesis is product density restrictions, which we refer to
as MFVB, but are also known as variational message passing, mean field approximation
and product density restriction.
1.1.4
Mean field variational Bayes
In 1999, Attias coined the term variational Bayes for the specific case of variational approximation where the posterior is approximated by a product density restriction. This form of
variational approximation originated in statistical physics, where it is known as mean field
approximation (Parisi, 1988). Borrowing terms from both areas, we term variational inference through product density restrictions mean field variational Bayes. The past decade has
seen the concept of variational approximations, and in particular MFVB, appear in statistics journals. Titterington (2004) highlighted the role of MFVB as a possible inference
tool for large scale data analysis problems in the context of neural networks. Teschendorff, Wang, Barbosa-Morais, Brenton and C. (2005) looked at MFVB in the context of
cluster analysis for gene expression data. In 2008 Infer.net (Minka, Winn, Guiver and
Kannan, 2009), a software package for MFVB, was released. This made MFVB a more
accessible inference tool for hierarchical Bayesian models.
The past decade has seen MFVB explored as an inference tool in many statistical settings. These include, but are not limited to: hidden markov models (McGrory and Titterington, 2009); model selection in finite mixture distributions (McGrory and Titterington, 2007); principal component analysis (Smidl and Quinn, 2007); and political science
research (Grimmer, 2011). 2010/11 saw the first appearance of MFVB papers in the Journal of the American Statistical Association (Braun and McAuliffe, 2010; Faes, Ormerod and
Wand, 2011), illustrating the quality and exposure of current research in the field.
There remains a plethora of statistical settings that have not been explored in the
context of MFVB. This forms the basis of the thesis. We have identified both areas of the
literature that would benefit from extension of the current MFVB methodology, and areas
that have not yet been explored in the context of variational approximations. The former
include modelling of sample extremes and continuous sparse signal shrinkage. The latter
4
1.2. NOTATION
include Bayesian quantile and wavelet regression.
1.2
1.2.1
Notation
Vector and matrix notation
For the vectors v, w ∈ Rp , defined as:

w1


 . 
and w =  ..  ,


wp

v1


 . 
v =  .. 


vp


the following notation is used throughout the thesis.
√
The norm of a vector v is denoted by kvk = v T v. The vector v −i represents the
vector v with the ith component removed. The component wise product of two vectors
is defined by:


v 1 w1


 . 
v w =  ..  .


v p wp
If g : R → R is a scalar function, it acts upon a vector v according to:


g(v1 )


 .. 
g(v) =  .  .


g(vp )
To create a p×p diagonal matrix from a vector, using components of the vector as diagonal
elements of the matrix, we use the notation:

v1
0
...
0




diag(v) = 



0
..
.
v2 . . .
.. . .
.
.
0
..
.



.



0
0
. . . vp
For the symmetric matrix M n×n , we denote the trace and determinant by tr(M ) and |M |
respectively, and adopt the usual definitions.
5
1.2. NOTATION
For the matrices Am×n and B p×q , denoted by:

a
. . . a1n
 11
..
 .
..
A =  ..
.
.

am1 . . . amn







b11 . . . b1q


.. 
 .
..
and B =  ..
.
. 


bp1 . . . bpq
we define the following notation. To create a mn × 1 vector from the components of a
matrix A, we define:

a11
..
.





 am1

 .
.
vec(A) = 
 .

 a
 1n

 ..
 .

amn









.








The Kronecker product of two matrices is defined as:

a B . . . a1n B
 11
..
..

..
A⊗B =
.
.
.

am1 B . . . amn B
1.2.2



.

Distributional notation
We adopt the usual notation and definitions for the density function, expected value
and variance of a random variable x, that is p(x), E(x) and Var(x) respectively. The
conditional density of x given y is denoted by p(x|y).
The covariance of random variables x and y is denoted by Cov(x, y). The correlation
between random variables x and y is denoted by Corr(x, y) and is given by
Cov(x, y)
p
.
Corr(x, y) = p
Var(x) Var(y)
The density function of a random vector v is denoted by p(v). The conditional density
of v given w is denoted by p(v|w).
In a Bayesian model setting, the full conditional of v is the conditional distribution of
v conditioned on all the remaining parameters in the model, and is denoted by p(v|rest).
The expected value and covariance matrix of v are denoted by E(v) and Cov(v) respectively.
6
1.3. BASICS OF MEAN FIELD VARIATIONAL BAYES
ind.
If x1 , . . . , xn are independent and identically distributed as D, we write xi ∼ D, for
1 ≤ i ≤ n.
We use q to denote density functions that arise from MFVB approximation. Expecta-
tion and covariance under the MFVB paradigm are denoted by Eq (·) and Covq (·) respectively. For a generic random scalar variable v and density function q we define:
µq(v) ≡ Eq (v)
2
and σq(v)
≡ Varq (v).
For a generic random vector v and density function q we define:
µq(v) ≡ Eq (v)
1.3
and Σq(v) ≡ Covq (v).
Basics of mean field variational Bayes
Consider a generic Bayesian model, with observed data vector y and parameter vector
θ. We also suppose that θ is continuous over the parameter space Θ. As discussed in
the literature review, the posterior density function p(θ|y) is often intractable, even given
simple models. MFVB overcomes this intractability by postulating that p(θ|y) can be
well-approximated by product density forms, for example
(1.1)
p(θ|y) ≈ q1 (θ 1 ) q2 (θ 2 ) q3 (θ 3 )
where {θ 1 , θ 2 , θ 3 } is a partition of θ. The choice of partition is usually made on grounds
of tractability. Each qi is a density function in θ i (i = 1, 2, 3), and they are chosen to
minimise the Kullback-Leibler distance between the left and right hand sides of (1.1):
Z
q1 (θ 1 ) q2 (θ 2 ) q3 (θ 3 ) log
p(θ|y)
q1 (θ 1 ) q2 (θ 2 ) q3 (θ 3 )
(1.2)
dθ.
Minimisation of (1.2) is equivalent to maximisation of
p(y; q) ≡
Z
q1 (θ 1 ) q2 (θ 2 ) q3 (θ 3 ) log
p(θ, y)
q1 (θ 1 ) q2 (θ 2 ) q3 (θ 3 )
dθ
for which an iterative convex optimisation algorithm (e.g. Luenberger and Ye, 2008) exists for obtaining the solution. Algorithm updates can be derived from the expression
q ∗ (θ i ) ∝ exp[Eq(θ−i ) log p(θ i |y, θ −i )}]
(1.3)
7
1.4. GRAPH THEORY
for 1 ≤ i ≤ 3 (Bishop, 2006). Each iteration results in an increase in p(y; q), and this
quantity can be used to assess convergence. Upon convergence, the q ∗ (θ i ) densities can
be used for approximate Bayesian inference.
We have illustrated the basics of MFVB using a product restriction consisting of three
densities. These concepts can be extended to n densities, in which case the factorisation
would take the form
p(θ|y) ≈
n
Y
qi (θi )
i=1
where {θ1 , . . . , θn } is a partition of θ.
1.4
Graph theory
As mentioned in the literature review and Section 1.3 above, the central idea of MFVB
involves imposing a product restriction, or factorisation, on the posterior p(θ|y). This
imposed factorisation, once carried out, can lead to a further breakdown of the dependence structure of a model. This is known as induced factorisation (Bishop, 2006). The
extent of induced factorisation depends on both the underlying structure of the model
and the nature of the imposed factorisation. Graph theory allows us to better understand
the nature of both imposed and induced factorisation.
As stated in Ormerod and Wand (2010), a very useful tool in assessing the conditional dependence structure of a hierarchical Bayesian model is its directed acyclic graph
(DAG). These are defined in the following Section.
1.4.1
Directed acyclic graphs
Definition 1.4.1 An undirected graph consists of a set of nodes connected by edges.
Definition 1.4.2 A directed graph consists of a set of nodes connected by directed edges (Bishop,
2006).
Definition 1.4.3 A directed acyclic graph (DAG) is a directed graph containing no directed cycles.
Definition 1.4.3 means that there are no closed paths within the graph such that we can
move from node to node along the directed edges and end up back at the starting node
(Bishop, 2006). Figure 1.1 illustrates a simple DAG with nodes a, b and c. The link be-
8
1.4. GRAPH THEORY
a
b
c
Figure 1.1: A simple example of a directed acyclic graph, containing nodes a, b and c.
tween DAGs and hierarchical Bayesian models is made when we treat the nodes as representing a random variables within a model, and the directed edges as conveying the
conditional dependence structure of the model (e.g. Ormerod and Wand, 2010).
Definition 1.4.4 Two nodes are co-parents if they share a common child node.
Definition 1.4.5 The Markov blanket of a node is the set of parents, children and co-parents of
the node.
We now define the full conditional distribution in the context of a DAG.
Definition 1.4.6 The full conditional of a node θi is the conditional distribution of θi given all
the remaining variables in the graph, and is denoted by p(θi |rest).
The full conditional of a node is dependent only on the variables in its Markov blanket
(Bishop, 2006). We summarise this property as
p(θi |rest) = p(θi |Markov blanket of θi ).
(1.4)
Combining (1.3) and (1.4), we have
1
q ∗ (θ i ) ∝ exp{Eq(θ−i ) log p(θ i |Markov blanket of θi )}.
(1.5)
This has significant implications for MFVB. Equation (1.5) tells us that the optimal q ∗
density for θi will depend only on the nodes in its Markov blanket. This is called the
9
1.4. GRAPH THEORY
locality property of MFVB. This locality property is important when developing MFVB
algorithms for increasingly complex hierarcical models.
1.4.2
Moral graphs
The role of moral graphs in MFVB is twofold, coming into play in both induced and
imposed factorisations of MFVB. The following definitions and theorem are taken from
unpublished notes by M.P. Wand entitled “Graphical Models” (2012).
Definition 1.4.7 A set of nodes in a DAG is called an ancestral set if, for each node in the set, all
of its parents are also in the set.
Definition 1.4.8 Let S be a subset of nodes in a DAG. Then the smallest ancestral set containing
S is the ancestral set, containing all nodes in S, with the fewest number of nodes.
Definition 1.4.9 The moral graph of a DAG is the undirected graph formed by (1) adding an
edge between all pairs of parents of each node, (2) removing all arrows.
We are now in a position to determine whether two subsets of nodes within a DAG are
conditionally independent.
Theorem 1.4.1 Let A, B and C be disjoint subsets of nodes in a DAG. Then A ⊥ B | C if C
separates A from B in the moral graph of the smallest ancestral set containing A ∪ B ∪ C.
Combining moralisation with the notion of ancestral sets provides an attractive alternative to d-separation in establishing conditional independence between subsets of nodes
in a DAG. This is because each part of Theorem 1.4.1, namely (1) smallest ancestral set
determination, (2) moralisation and (3) separation on an undirected graph are all easy
to carry out, compared with d-separation. The ability to recognise conditional independence between nodes is vital in identifying induced factorisations.
Moralisation can also be used to visualise the effect MFVB has on the structure of a
hierarchical Bayesian model. MFVB involves placing a product restriction, or imposed
factorisation on the posterior p(θ|y). Visually, this corresponds to removing edges between relevant nodes on the moralised DAG.
Consider, for example, the model:
y|β, u, σε2 ∼ N (Xβ + Zu, σε2 I),
β ∼ N (0, σβ2 I),
σε ∼ Half-Cauchy(Aε ),
u ∼ N (0, σu2 I),
σu ∼ Half-Cauchy(Au )
(1.6)
10
1.4. GRAPH THEORY
where y is a vector of responses, β and u are vectors of fixed and random effects, X and
Z are design matrices and σε2 and σu2 are variance parameters. Definition of the HalfCauchy distribution is given in Section 1.5. Figure 1.2(a) shows the DAG for (1.6), and
illustrates the conditional dependence structure of the model. Note that the node for the
data, y, is shaded. Nodes representing model parameters are unshaded. MFVB involves
placing a product restriction on the posterior density. Say we impose the factorisation
p(β, u, σε2 , σu2 |y) ≈ q(β, u, σε2 , σu2 ) = q(β, u) q(σε2 , σu2 ).
(1.7)
In order to visualise this factorisation, we must first moralise the graph, as shown in
σu2σu2
σu2σu2
σu2σu2
uu
uu
uu
σε2σε2
ββ
σε2σε2
ββ
yy
ββ
yy
(a) Directed acyclic graph of Model
(1.6).
yy
(b) Moralised graph.
Figure 1.2: Illustration of the impact of product restriction (1.7) on the directed acyclic
graph for Model (1.6).
Figure 1.2(b). After moralisation, all paths between σu2 and σε2 on the undirected graph
must pass through at least one of {y, β, u}. In other words, σu2 is now separated from σε2
by the set {y, β, u}. Applying Theorem 1.4.1 gives
σu2 ⊥ σε2 {y, β, u}.
Hence (1.7) reduces to
q(β, u, σε2 , σu2 ) = q(β, u) q(σε2 ) q(σu2 ).
(1.8)
The ability to recognise induced factorisations helps to streamline derivation of MFVB
methodology, especially as models increase in complexity.
11
1.5. DEFINITIONS AND RESULTS
1.5
1.5.1
Definitions and results
Non-analytic integral families
Definition 1.5.1 We define the integral J + (·, ·, ·) by
+
J (p, q, r) =
1.5.2
Z
0
∞
xp exp(qx − rx2 ) dx,
p ≥ 0, −∞ < q < ∞, r > 0.
Special function definitions
Definition 1.5.2 The logit(·) function is defined by
logit(p) = log
p
1−p
= log(p) − log(1 − p),
0 < p < 1.
Definition 1.5.3 The digamma function is denoted by ψ(·) and is the logarithmic derivative of
the gamma function
ψ(x) =
Γ0 (x)
d
ln Γ(x) =
,
dx
Γ(x)
x > 0.
Definition 1.5.4 The trigamma function is denoted by ψ 0 (·) and is the second logarithmic derivative of the gamma function
ψ 0 (x) =
d2
d
ln Γ(x) =
{ψ(x)},
2
dx
dx
x > 0.
Definition 1.5.5 The Dirac delta function is denoted by δ0 (·) and is defined by

 1 if x = 0,
δ0 (x) =
 0 if x 6= 0.
Distributional definitions of the required continuous sparse signal shrinkage density functions require the introduction of special functions. We follow the notation of Gradshteyn
and Ryzhik (1994).
Definition 1.5.6 The exponential integral function of order 1 is denoted by E1 (·) and is defined
by
Z
E1 (x) =
x
∞
e−t
dt,
t
x ∈ R, x 6= 0.
Evaluation of E1 (·) is supported by the function expint_E1() in the R package gsl
(Hankin, 2007).
12
1.5. DEFINITIONS AND RESULTS
Definition 1.5.7 The parabolic cylinder function of order ν ∈ R is denoted by Dν (·). Parabolic
cylinder functions of negative order can be expressed as:
Dν (x) = Γ(−ν)
−1
∞
Z
2
exp(−x /4)
0
t−ν−1 exp −xt − 21 t2 dt,
ν < 0, x ∈ R.
Result 1.1 Combining Definitions 1.5.1 and 1.5.7, we have
√
J + (p, q, r) = (2r)−(p+1)/2 Γ(p + 1) exp{q 2 /(8r)}D−p−1 (−q/ 2r),
p > −1, q ∈ R, r > 0.
For computational purposes, we note that
√
Dν (x) = 2ν/2+1/4 Wν/2+1/4, −1/4 ( 21 x2 )/ x,
x > 0,
(1.9)
where Wk,m is a confluent hypergeometric function as defined in Whittaker and Watson
(1990). Direct computation of the parabolic cylinder function Dν (·) is unavailable in R.
Computation of Wk,m (·) however is available via the R function whittaker() within
the package fAsianOptions (Wuertz et al., 2009). Hence, using (1.9) we compute Dν (·)
using the following code:
library(fAsianOptions)
2(nu/2+1/4)*Re(whittakerW(x2/2,nu/2+1/4,-1/4))/sqrt(x)
Definition 1.5.8 Gauss’ Hypergeometric function of order (α, β, γ) is denoted by 2 F1 (α, β, γ; ·)
and has the integral representation
Γ(γ)
2 F1 (α, β, γ; x) =
Γ(β)Γ(γ − β)
Z
0
1
(1 − tx)−α tβ−1 (1 − t)γ−β−1 dt for γ > β > 0.
Evaluation of 2 F1 (α, β, γ; ·) is supported by the function hyperg_2F1() in the R package
gsl (Hankin, 2007).
13
1.5. DEFINITIONS AND RESULTS
1.5.3
Additional function definitions and continued fraction representations
As illustrated in Wand and Ormerod (2012), there are simple continued fraction representations for e and π, given by:
1
e=2+
and
1
1+
4
π=
1+
2
2+
3+
.
12
22
3+
3
5+
4 + ···
32
7 + ···
An algorithm for accurate approximation of a real number given its continued fraction
expansion is presented in Section 5.2 of Press, Teukolosky, Vetterling and Flannery (1992).
The authors refer to this algorithm as the modified Lentz’s algorithm, after Lentz, who developed the algorithm for a specific family of continued fractions in his 1976 paper. In
this thesis, we refer to it simply as Lentz’s Algorithm, which works for general continued
fractions of the form
a1
b0 +
.
a2
b1 +
b2 +
a3
b3 + · · ·
Next, we define two functions that arise during derivation of MFVB algorithms for continuous sparse signal shrinkage density functions. We also state two important results
that lead to streamlined computation.
Definition 1.5.9 The functions Q(·) and Rν (·) are defined as
and
Q(x) ≡ ex E1 (x), x > 0,
D−ν−2 (x)
Rν (x) ≡
, ν > 0, x > 0.
D−ν−1 (x)
Both of the functions defined immediately above lead to underflow problems for large
x. We call upon their continued fraction representations, then computation via Lentz’s
algorithm (Lentz, 1976), to facilitate stable computation.
14
1.5. DEFINITIONS AND RESULTS
Result 1.2 The function Q(x) admits the continued fraction expansion
1
Q(x) =
.
12
x+1+
22
x+3+
x+5+
32
x + 7 + ···
Result 1.3 The function Rν (x) admits the continued fraction expansion
1
Rν (x) =
.
ν+2
x+
ν+3
x+
x+
ν+4
x + ...
Results 1.2 and 1.3 are given in Cuyt, Petersen, Verdonk, Waadeland and Jones (2008).
1.5.4
Distributional definitions and results
Here we define the distributions and important results that arise throughout the thesis.
For the more complex distributions, standard densities with location parameter µ = 0
and scale parameter σ = 1 are given. Density functions for general location and scale
parameters for these distributions are then found by mapping
1
p(v) 7→ p
σ
v−µ
σ
.
Definition 1.5.10 We use the notation v ∼ Bernoulli(ρ) to denote v following a Bernoulli distribution with parameter 0 ≤ ρ ≤ 1. The corresponding probability function is
p(v) = ρv (1 − ρ)1−v ,
v ∈ {0, 1}.
Definition 1.5.11 We use the notation v ∼ N (µ, σ 2 ) to denote v following a Normal distribution with mean µ ∈ R and variance σ 2 > 0. The corresponding density function is
p(v) = √
1
2σ 2
exp
(v − µ)2
2σ 2
,
v ∈ R.
Definition 1.5.12 We use the notation φ(·) to denote the standard normal density function, that
15
1.5. DEFINITIONS AND RESULTS
is if v ∼ N (0, 1) then
1
2
p(v) = φ(v) = √ e−v /2 ,
2π
Hence if v ∼ N (µ, σ 2 ), then
p(v) =
1
φ
σ
v−µ
σ
v ∈ R.
.
Definition 1.5.13 The notation v ∼ N (µ, Σ) represents the Multivariate Normal distribution
with mean vector µ ∈ Rk and positive-semidefinite, symmetric Covariance matrix Σ ∈ Rk×k .
The corresponding density function is
p(v) = (2π)−k/2 |Σ|−1/2 exp − 12 (v − µ)T Σ−1 (v − µ) ,
v ∈ Rk .
Definition 1.5.14 We use the notation v ∼ Beta(A, B) to denote v following a Beta distribution
with shape parameters A > 0 and B > 0. The corresponding density function is
p(v) =
v A−1 (1 − v)B−1
,
B(A, B)
0 ≤ v ≤ 1,
where B(·, ·) is the Beta function.
Definition 1.5.15 The notation v ∼ Gamma(A, B) means that v has a Gamma distribution with
shape parameter A > 0 and rate parameter B > 0. The corresponding density function is
p(v) = B A Γ(A)−1 v A−1 exp(−Bv),
v > 0.
Result 1.4 Suppose that v ∼ Gamma(A, B). Then
E(v) = A/B,
E{log(v)} = ψ(A) − log(B)
and
Var{log(v)} = ψ 0 (A),
where ψ(·) is the digamma function set out by Definition 1.5.3, and ψ 0 (·) is the trigamma function
set out by Definition 1.5.4.
Definition 1.5.16 The notation v ∼ Inverse-Gamma(A, B) means that v has an Inverse-Gamma
distribution with shape parameter A > 0 and rate parameter B > 0. The corresponding density
function is
p(v) = B A Γ(A)−1 v −A−1 exp(−B/v),
v > 0.
Result 1.5 Suppose that v ∼ Inverse-Gamma(A, B). Then
E(1/v) = A/B
and
E{log(v)} = log(B) − ψ(A),
16
1.5. DEFINITIONS AND RESULTS
where ψ(·) is the digamma function set out by Definition 1.5.3.
Definition 1.5.17 The notation v ∼ Half-Cauchy(A) means that v has a Half-Cauchy distribution with scale parameter A > 0. The corresponding density function is
p(v) =
Result 1.6
√
2
,
π{1 + (v/A)2 }
v > 0.
v ∼ Half-Cauchy(A) if and only if
v|a ∼ Inverse-Gamma( 21 , 1/a) and
a ∼ Inverse-Gamma( 21 , 1/A2 ).
This follows from Result 5 of Wand, Ormerod, Padoan and Frühwirth (2011).
Definition 1.5.18 The notation v ∼ Inverse-Gaussian(µ, λ) means that v has an Inverse-Gaussian
distribution with mean µ > 0 and rate parameter λ > 0. The corresponding density function is
r
p(v) =
λ
λ(v − µ)2
−
,
exp
2πv 3
2µ2 v
v > 0,
with
E(v) = µ
and
E(1/v) =
1
1
+ .
µ λ
Result 1.7 Suppose the density function of v takes the form
p(v) ∝ v −3/2 exp(−Sv − T /v),
v > 0.
p
Then v ∼ Inverse-Gaussian( T /S, 2T ).
Definition 1.5.19 The notation v ∼ GEV(0, 1, ξ) means that v follows the standard Generalized
Extreme Value distribution with shape parameter ξ ∈ R. The density function is given by

 (1 + ξv)−1/ξ−1 exp −(1 + ξv)−1/ξ , ξ 6= 0
pGEV (v; ξ) =
,
 exp (−v − e−v ) ,
ξ=0
1 + ξv > 0.
If the random variable v has density function σ −1 pGEV {(v − µ)/σ} then we write
v ∼ GEV(µ, σ, ξ).
Definition 1.5.20 The notation v ∼ Normal-Mixture(0, 1, w, m, s) means that v follows the
standard Finite Normal Mixture distribution with: weight vector w = (w1 , . . . , wK ), wk > 0
P
for 1 ≤ k ≤ K and K
k=1 wk = 1; mean vector m = (m1 , . . . , mK ), mk ∈ R for 1 ≤ k ≤ K;
17
1.5. DEFINITIONS AND RESULTS
and standard deviation vector s = (s1 , . . . , sK ), sk > 0 for 1 ≤ k ≤ K. The density function is
given by
K
X
v − mk
wk
,
pNM (v; w, m, s) =
φ
sk
sk
k=1
v ∈ R.
If the random variable v has density function σ −1 pNM {(v − µ)/σ}, then we write
v ∼ Normal-Mixture(µ, σ, w, m, s).
Definition 1.5.21 The notation v ∼ Asymmetric-Laplace(0, 1, τ ) means that v follows the stan-
dard (µ = 0, σ = 1) Asymmetric-Laplace distribution. The density function is given by
p(v; τ ) = τ (1 − τ ) exp
1
2 |v| + τ −
1
2
v ,
v ∈ R.
Result 1.8 Let v and a be random variables such that
p(y|a) ∼ N
!
τ − 21 σ
σ2
,
µ+
aτ (1 − τ ) aτ (1 − τ )
and a ∼ Inverse-Gamma 1, 12 .
Then
y ∼ Asymmetric-Laplace(µ, σ, τ ).
Result 1.8 follows from follows from Proposition 3.2.1 of Kotz, Kozubowski and Podgórski
(2001).
Definition 1.5.22 The standard Horseshoe density function is defined by
pHS (v) = (2π 3 )−1/2 exp(v 2 /2)E1 (v 2 /2),
v 6= 0,
where E1 (·) is the exponential integral function of order 1.
If the random variable v has density function σ −1 pHS {(v − µ)/σ} then we write
v ∼ Horseshoe(µ, σ).
Result 1.9 Let v, b and c be random variables such that
v|b ∼ N (µ, σ 2 /b),
b|c ∼ Gamma( 21 , c) and
c ∼ Gamma( 12 , 1).
Then v ∼ Horseshoe(µ, σ).
Result 1.10 Let v and b be random variables such that
v|b ∼ N (µ, σ 2 /b)
and p(b) = π −1 b−1/2 (b + 1)−1 ,
b > 0.
18
1.5. DEFINITIONS AND RESULTS
Then v ∼ Horseshoe(µ, σ).
Results 1.9 and 1.10 are related to results given in Carvalho, Polson and Scott (2010).
Definition 1.5.23 The standard Normal-Exponential-Gamma density function, with shape parameter λ > 0 is defined by
pNEG (v; λ) = π −1/2 λ2λ Γ(λ + 21 )exp(v 2 /4)D−2λ−1 (|v|),
v ∈ R,
where Dν (·) is the parabolic cylinder function of order ν.
If the random variable v has density function σ −1 pNEG {(v − µ)/σ; λ} then we write
v ∼ NEG(µ, σ, λ).
Result 1.11 Let v, b and c be random variables such that
v|b ∼ N (µ, σ 2 /b),
b|c ∼ Inverse-Gamma(1, c) and
c ∼ Gamma(λ, 1).
Then v ∼ NEG(µ, σ, λ).
Result 1.12 Let v and b be random variables such that
v|b ∼ N (µ, σ 2 /b) and
p(b) = λbλ−1 (b + 1)−λ−1 ,
b > 0.
Then v ∼ NEG(µ, σ, λ).
Results 1.11 and 1.12 are related to results given in Griffin and Brown (2011).
Definition 1.5.24 The standard Generalized Double Pareto density function, with shape parameter λ > 0 is defined by
pGDP (v; λ) =
1
,
2(1 + |x|/λ)λ+1
v ∈ R.
If the random variable v has density function σ −1 pGDP {(v − µ)/σ; λ} then we write
v ∼ GDP(µ, σ, λ).
Result 1.13 Let v, b and c be random variables such that
v|b ∼ N (µ, σ 2 /b),
Then v ∼ GDP(µ, σ, λ).
b|c ∼ Inverse-Gamma(1, c2 /2) and
c ∼ Gamma(λ, λ).
19
1.5. DEFINITIONS AND RESULTS
Result 1.14 Let v and b be random variables such that
v|b ∼ N (µ, σ 2 /b)
and p(b) = 12 (λ + 1)λλ+1 b(λ−2)/2 eλ
2 b/4
√
Dλ−2 (λ b),
b > 0.
Then v ∼ GDP(µ, σ, λ).
Results 1.13 and 1.14 are related to results given in Armagan, Dunson and Lee (2012).
Definition 1.5.25 The Laplace-Zero density function is defined by
p(u|σ, ρ) = ρ(2σ)−1 exp(−|u|/σ) + (1 − ρ)δ0 (u),
u ∈ R,
where ρ is a random variable over [0, 1].
Result 1.15 Suppose that u = γv. Then u|σ, ρ ∼ Laplace-Zero(σ, ρ) if and only if
v|b ∼ N (0, σ 2 /b)
γ|ρ ∼ Bernoulli(ρ),
1.5.5
and b ∼ Inverse-Gamma(1, 21 ).
Matrix results
Result 1.16 If a is a scalar, then aT = a.
Result 1.17 If A is a square matrix, then tr(AT ) = tr(A).
Result 1.18 If A is an m × n matrix and B is an n × m matrix, then tr(AB) = tr(BA).
Result 1.19 If a and b are vectors of the same length, then
ka − bk2 = kak2 − 2aT b + kbk2 .
Result 1.20 If a, b and c are vectors of the same length, then
ka − b − ck2 = kak2 + kbk2 + kck2 − 2aT b − 2aT c + 2bT c.
Result 1.21 Let A be a symmetric invertible matrix and x and b be column vectors with the
same number of rows as A. Then
− 12 xT Ax + bT x = − 12 (x − A−1 b)T A(x − A−1 b) + 21 bT A−1 b.
Result 1.22 Let v be a random vector. Then
E(vv T ) = Cov(v) + E(v)E(v)T
and
E(kvk2 ) = tr{Cov(v)} + kE(v)k2 .
20
1.6. ACCURACY MEASURE
Result 1.23 Let v be a random vector and let A be a constant matrix with the same number of
rows as v. Then
E(v T Av) = E(v)T AE(v) + tr{ACov(v)}.
1.6
Accuracy measure
This section describes the mathematics behind the accuracy measure used to assess the
quality of an approximate MFVB posterior via comparison to a baseline MCMC posterior.
Details are taken from Wand et al. (2011) Section 8.
Let p(θ|x) denote the posterior distribution of a parameter θ given data x. Our aim
is to assess the quality of a MFVB approximation to the posterior, which we denote by
q ∗ (θ). Wand et al. (2011) describes measuring the accuracy of q ∗ (θ) via the L1 distance.
The L1 error, also known as the integrated absolute error (IAE) of q ∗ (θ) is defined as
∗
IAE{q (θ)} =
Z
∞
−∞
|q ∗ (θ) − p(θ|x)| dθ.
Since IAE ∈ (0, 2), we then define the accuracy of q ∗ (θ) as
accuracy{q ∗ (θ)} = 1 − 21 IAE{q ∗ (θ)},
and so 0 ≤ accuracy{q ∗ (θ)} ≤ 1. We then express accuracy as a percentage. In practice,
the posterior p(θ|x) is approximated by an extremely accurate MCMC analogue, denoted
by pMCMC (θ|x).
1.7
Overview of the thesis
The aim of this thesis is address the research question:
“Is mean field variational Bayes a viable inference tool for semiparametric regression models
involving elaborate distributions?”
This involves identifying models and/or data sets for which traditional inference is either
too slow or impossible. We are attempting to make analysis of these models temporally
viable by implementing faster methodology. In some cases, this involves extending current research on MFVB to more complex models. In other areas, we are using MFVB for
the very first time.
1.7. OVERVIEW OF THE THESIS
21
Each chapter essentially involves development of a MFVB algorithm to perform fast,
approximate inference for the given model and data. The efficiency of the algorithm is
then tested using either simulation studies or real data through comparison with MCMC.
Chapter 2 involves development of MFVB methodology for Generalized Extreme
Value additive models. Structured MFVB and auxiliary mixture sampling are the key
methodologies used. Chapter 2 culminates in the fast, deterministic analysis of the Sydney Hinterland maximum rainfall data. In Chapter 3, we present MFVB inference for
quantile semiparametric regression. This centres around the Asymmetric Laplace distribution. Chapter 4 presents MFVB inference for sparse signal shrinkage, focusing on continuous shrinkage priors. A new approach is developed to add to the MFVB armoury:
continued fraction approximations via Lentz’s Algorithm. The insights gained in Chapter 4 flow on to inform the prior distributions used in Chapters 5 and 6. In Chapter 5 we
develop MFVB algorithms for inference in penalised wavelet regression. Three types of
penalisation are considered: Horseshoe, Normal-Exponential-Gamma and Laplace-Zero.
Chapter 6 develops MFVB inference for the most complex model considered in the thesis.
This model is motivated by the Radiation Pneumonitis data.
All computations were carried out in R (R Development Core Team, 2011) using a
MacBook Pro with a 2.7 GHz Intel Core i7 processor and 8 GB of memory.
Chapter 2
Mean field variational Bayes
inference for Generalised Extreme
Value regression models
2.1
Introduction
Analysis of sample extremes is becoming more prominent, largely driven by increasing
interest in climate change research. The generalised extreme value (GEV) distribution
can be used to model the behaviour of such sample extremes. Over the past decade,
many papers have addressed the idea of using GEV additive models to model sample
extremes. These papers differ in their approaches, ranging from Bayesian to frequestist,
and employing various methods of fitting. For example, Chavez-Demoulin and Davison
(2005) use data on minimum daily temperatures at 21 Swiss weather stations during winters in the seventies to motivate their methodology. They found that the North Atlantic
oscillation index has a nonlinear effect on sample extremes. Davison and Ramesh (2000)
outline a semiparametric approach to smoothing sample extremes, using local polynomial fitting of the GEV distribution. The aim of Yee and Stephenson (2007) was to provide
a unifying framework for flexible smoothing in models involving extreme value distributions. Laurini and Pauli (2009) carry out nonparametric regression for sample extremes
using penalized splines and the Poisson point process model. Markov chain Monte Carlo
methods were used for inference.
As explored by Laurini and Pauli (2009), a popular way of fitting heirarchical Bayesian
models is to use MCMC methods. These methods, although highly accurate, are very
computationally intensive and inference can take hours. This is where mean field vari22
23
2.2. DIRECT MEAN FIELD VARIATIONAL BAYES
ational Bayes (MFVB) comes to the fore as an exciting alternative. MFVB provides an
approximate alternative to MCMC, with massively reduced computation time.
This chapter brings together the methodology of the GEV additive model and MFVB,
resulting in fast, approximate inference for analysing sample extremes. We begin by
explaining the need to model the GEV distribution using a highly accurate mixture of
Normal distributions. The particular mixture of Normal distributions used to approximate the GEV density is obtained by minimising the χ2 distance between the two, as
outlined in Wand et al. (2011). We then set up a regression model for this normal mixture
response. MFVB methodology is then presented for the regression case, which extends
methodology presented in Wand et al. (2011). Derivations are deferred to Appendix 2.A.
We then extend from the regression case to the semiparametric regression case via construction of an additive model. Then, using relevant theory we derive MFVB algorithms
for inference. Methods for construction of variability bands for the estimated parameters
are then presented. This is followed by an explanation of how to devise an overall lower
bound approximation for the GEV normal mixture model based on the MFVB approach.
Finally we analyse maximum rainfall data from throughout New South Wales (NSW),
Australia, using the full GEV normal mixture based MFVB approach.
The majority of the work in this chapter has been peer reviewed and published in
Neville, Palmer and Wand (2011). Work on the geoadditive extension in Section 2.8 culminated in the paper Neville and Wand (2011) and a presentation at the conference Spatial
Statistics: Mapping Global Change, Enschede, the Netherlands, 2011.
2.2
Direct mean field variational Bayes
Constructing MFVB algorithms for any model involves taking expectations of full conditionals with respect to many parameters. The complex form of the GEV likelihood ultimately leads to these expectations becoming too complex, and thus the resulting MFVB
approximate posteriors become intractable. First, we step through the direct MFVB methodology for the GEV model, and show why it breaks down. We then present the idea of
approximating the GEV density as a highly accurate mixture of normals, first illustrated
in Wand et al. (2011).
Let x follow a GEV(µ, σ, ξ) distribution with parameters µ, σ > 0 and ξ ∈ R. This
implies that
1
p(x) = pGEV
σ
x−µ
;ξ
σ
2.2. DIRECT MEAN FIELD VARIATIONAL BAYES
24
where pGEV is the GEV(0, 1, ξ) density function defined by (2.2). The density of x is therefore:
1
p(x|µ, σ, ξ) =
σ
1+ξ
x−µ
σ
−1/ξ−1
" −1/ξ #
x−µ
exp − 1 + ξ
.
σ
Now assume we impose the continuous priors
µ ∼ N (µµ , σµ2 )
and σ 2 ∼ Inverse-Gamma(A, B)
and the discrete prior ξ ∼ p(ξ), ξ ∈ Ξ.
Take, for example, the process of deriving an approximation, q ∗ (µ), for the posterior
of µ. Following the process set out in Section 1.3, we first find the logarithm of the full
conditional of µ, and then take expectations with respect to the remaining parameters.
The full conditional for µ is of the form:
p(µ|rest) ∝ p(x|rest)p(µ)
" −1/ξ−1
−1/ξ #
x−µ
x−µ
1
1+ξ
exp − 1 + ξ
∝
σ
σ
σ
(µ − µµ )2
× exp −
2σµ2
−1/ξ−1
1
x−µ
∝
1+ξ
σ
σ
" #
−1/ξ
(µ − µµ )2
x−µ
× exp − 1 + ξ
−
.
σ
2σµ2
Therefore,
(µ − µµ )2
1
x−µ
log p(µ|rest) = const. −
+ − − 1 log 1 + ξ
2σµ2
ξ
σ
−1/ξ
x−µ
− 1+ξ
,
σ
hence
∗
log q (µ) = Eσ,ξ
(µ − µµ )2
1
x−µ
const. −
+ − − 1 log 1 + ξ
2σµ2
ξ
σ
−1/ξ #
x−µ
− 1+ξ
.
σ
(2.1)
We can see the complex dependence on the parameters emerging as a substantial obsta-
2.3. AUXILIARY MIXTURE SAMPLING APPROACH
25
cle, predominantly in the final term of (2.1). Thus, taking expectations in order to derive
optimal q ∗ densities is highly intractable. It is for this reason that Wand et al. (2011) developed the auxiliary mixture approach to handling GEV responses in a MFVB framework.
2.3
Auxiliary mixture sampling approach
Auxiliary mixture sampling involves using a combination of simple distributions to approximate a more complex distribution. The use of auxiliary mixture sampling in Bayesian
analysis dates back to at least 1994, when Shephard (1994) used mixture distributions to
allow for outliers in analysing non-Gaussian time series models. Since then, many others
have used auxiliary mixture sampling to overcome problems associated with elaborate
distributions. For example, Kim, Shephard and Chib (1998) and Chib, Nardari and Shephard (2002) used normal mixtures to approximate the density of a log χ2 distribution in
the context of stochastic volatility models.
More recently, and closer to the context of our area of interest, Frühwirth-Schnatter
and Wagner (2006) used auxiliary mixture sampling to deal with Gumbel random variables, that follow a GEV distribution with ξ=0. Frühwirth-Schnatter and Frühwirth (2007)
also used a finite mixture of normals to approximate the extreme value distribution.
Wand et al. (2011) introduce the idea of auxiliary mixture sampling as a fundamental
step in facilitating variational Bayesian inference for simple univariate GEV models. The
authors explain how finite normal mixture approximations to the GEV density are generated. Firstly, a general normal mixture approximation to the GEV density of size K is
defined, in the form of definition (1.5.20). Then, the weight, mean and standard deviation
vectors (w, m and s respectively) of the Normal mixture are chosen to minimise the χ2
distance between the GEV density and the mixture approximation.
Firstly, we introduce the complex GEV density. If x ∼ GEV(0, 1, ξ), then

 (1 + ξx)−1/ξ−1 exp −(1 + ξx)−1/ξ , ξ 6= 0
pGEV (x; ξ) =
 exp (−x − e−x ) ,
ξ=0
(2.2)
The finite normal mixture approximation to the GEV density is assumed to be of the form
set out in Definition 1.5.20
K
X
wk
x − mk
pNM (x; w, m, s) ≡
φ
,
sk
sk
k=1
where w = (w1 , . . . , wK ), m = (m1 , . . . , mK ) and s = (s1 , . . . , sK ) are assumed to be fixed
26
2.4. STRUCTURED MEAN FIELD VARIATIONAL BAYES
vectors not requiring Bayesian inference. Essentially, after fixing K, both the L1 and χ2
distance between the true density and the finite mixture approximation are investigated
as measures to minimise in order to produce the final mixture approximation. The χ2
distance is identified as the better of the two.
A chi-squared normal mixture approximations for ξ = 1 is illustrated in Figure2.1.
0.3
exact
approx
0.0
0.1
0.2
density
0.4
0.5
Wand et al. (2011) then explain how the derivation of the finite normal mixture approxi-
0
2
4
6
8
x
Figure 2.1: A normal mixture approximation to the GEV density with ξ = 1.
mation of the GEV density forms a vital step in MFVB inference. We highlight/identify
where this vital step fits within the broader process of structured MFVB in Section 2.4.
2.4
Structured mean field variational Bayes
In order to facilitate MFVB inference for the GEV shape parameter ξ, we must introduce
a MFVB extension known as structured MFVB. MFVB inference is difficult due to the
presence of the complex dependence of the GEV density on the shape parameter ξ. However, when we fix ξ, MFVB inference becomes tractable. Saul and Jordan (1996) proposed
structured MFVB for this very situation. Structured MFVB was first used in the context
of Bayesian hierarchical models in Wand et al. (2011). We summarise the major ideas here
in the context of the GEV regression model, borrowing heavily from Section 3.1 of Wand
2.5. FINITE NORMAL MIXTURE RESPONSE REGRESSION
27
et al. (2011).
Let θ denote all parameters in the GEV additive model, except ξ. Hence we are considering a Bayesian model of the form
y|θ, ξ ∼ p(y|θ, ξ)
(2.3)
where θ ∈ Θ and ξ ∈ Ξ. To reiterate, MFVB inference is tractable for fixed ξ, but not
tractable when ξ is included as a model parameter, as seen in Section 2.2. In practice, we
limit the parameter space Ξ of ξ to a finite number of atoms. We denote the the prior
density function of θ by p(θ), and the probability mass function of ξ by p(ξ).
Wand et al. (2011) states the following results regarding the MFVB inference in a structured context. The optimal q ∗ densities are given by:
q ∗ (θ i ) =
X
q ∗ (ξ)q ∗ (θ i |ξ)
(2.4)
p(ξ)p(y|ξ)
0
0
ξ 0 ∈Ξ p(ξ )p(y|ξ )
(2.5)
ξ∈Ξ
and
q ∗ (ξ) = P
The lower bound on the marginal likelihood is given by
p(y; q) =
X
q ∗ (ξ)p(y|ξ).
(2.6)
ξ∈Ξ
Summarising the previous and current sections, MFVB inference for GEV models, be
they simple regression models or additive models, involves 4 stages. Firstly, we limit
the continuous shape parameter ξ to a discrete set Ξ. Secondly, finite normal mixture
approximation of the GEV density function is carried out for each ξ ∈ Ξ. Thirdly, MFVB
inference is carried out for these normal mixture models over each ξ ∈ Ξ. For fixed ξ this
results in approximate posteriors for the remaining model parameters. The final stage
involves combining the results across all ξ ∈ Ξ to make approximate inference for all
parameters in the model, including the shape parameter ξ.
2.5
Finite normal mixture response regression
Section 2.3 illustrated the need for GEV responses to be approximated by a finite mixture
of normal densities in order to facilitate approximate Bayesian inference. Wand et al.
(2011) carried out MFVB inference for simple univariate GEV models via normal-mixture
28
2.5. FINITE NORMAL MIXTURE RESPONSE REGRESSION
approximations. The remainder of this chapter extends the work of Wand et al. (2011)
to allow approximate Bayesian inference for GEV regression models. The first step in
achieving MFVB inference for GEV regression models is to develop an algorithm to carry
out MFVB for normal-mixture response regression.
2.5.1
Model
We consider the model
ind.
yi |β, σε2 , ξ ∼ Normal-Mixture{(Xβ)i , σε , w, m, s},
β ∼ N (µβ , Σβ ),
1 ≤ i ≤ n,
(2.7)
σε2 ∼ Inverse-Gamma(Aε , Bε ),
where


β=
β0
β1

1 x1


 . . 
X =  .. ..  ,


1 xn

,
Aε , Bε > 0 are scalar constants, µβ is a constant vector, Σβ is a constant matrix, and
w = (w1 , . . . , wK ), m = (m1 , . . . , mK ) and s = (s1 , . . . , sK ) are Normal-Mixture weight,
mean and standard deviation vectors pre-determined to accurately approximate the GEV
density for a fixed value of the shape parameter ξ. Using (2.7) and Definition 1.5.20, we
have
p(yi |β, σε2 , w, m, s)
=
=
yi − (Xβ)i
σε
 n y −(Xβ) o

i
i
K
−
m
X
k
σε
1
wk 

φ
σε
sk
sk
1
pNM
σε
k=1
K
X
wk
yi − {(Xβ)i + σε mk }
=
φ
.
σε sk
σε sk
k=1
We now introduce auxiliary variables (a1 , . . . , an ) that allow the the distribution of yi to
vary as i varies from 1 to n. Specifically,
a1 , . . . , an ∼ Multinomial(1; w1 , ..., wK )
where
ai = (ai1 , ..., aiK ),
aik ∈ {0, 1}
and
K
X
k=1
aik = 1.
29
2.5. FINITE NORMAL MIXTURE RESPONSE REGRESSION
Using auxiliary variables (a1 , . . . , an ), we can re-express Model (2.7) as
p(yi |β, ai , σε2 ) =
QK
h
k=1
β ∼ N (µβ , Σβ ),
1
σε sk φ
σε2
n
(y−Xβ)i −σε mk
σε sk
oiaik
,
1 ≤ i ≤ n,
(2.8)
∼ Inverse-Gamma(Aε , Bε ),
which explicitly gives


N {(Xβ)i + σε m1 , σε2 s21 }, ai1 = 1,



..
yi |β, ai , σε2 ∼
.



 N {(Xβ) + σ m , σ 2 s2 }, a = 1.
i
ε K ε K
iK
(2.9)
The conditional dependence structure of Model (2.8) is illustrated in the directed acyclic
graph (DAG) in Figure 2.2. The DAG allows us to observe relationships between the
a
β
y
σε2
Figure 2.2: Directed acyclic graph for Model (2.8).
observed data y and the model parameters β, a = (a1 , . . . , aK ) and σε2 in a simplified,
visual manner. The fixed quantities w, m and s are omitted from the DAG as they do not
require inference. The importance of the DAG as a tool in application of MFVB inference
to more complex models will become clear as the chapter develops.
2.5.2
Mean field variational Bayes
Here we present a MFVB algorithm for Model (2.8). We impose the product restriction
q(β, a, σε2 ) = q(β) q(a) q(σε2 )
(2.10)
2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL
30
on the posterior p(β, a, σε2 |y). The optimal q ∗ densities for the parameters in Model (2.8)
under product restriction (2.10) take the form:
q ∗ (β) ∼ N (µq(β) , Σq(β) ),
−Aε −
q ∗ (σε2 ) =
(σε2 )
n
C1
C
2 −1 exp σ
− 22
ε
σε
2J + (2Aε +n−1,C1 ,C2 )
ind.
q ∗ (ai ) ∼ Multinomial(1; µq(ai ) ),
,
σε2 > 0,
1 ≤ i ≤ n.
Algorithm 1 gives details of an iterative scheme for finding optimal q ∗ moments for all
key model parameters under product restriction (2.10). Algorithm 1 uses lower bound
(2.11) to monitor convergence. Derivations for the optimal densities, updates in Algorithm 1 and lower bound (2.11) are deferred to Appendix 2.A. The lower bound corresponding to Algorithm 1 is given by
log p(y; q) =
d n
− log(2π) + log(2) + A log(B) − log{Γ(A)}
2
2
+ log J + (2A + n − 1, C1 , C2 )
K
X
|Σq(β) |
m2k
1
+
µq(a•k ) log(wk /sk ) − 2 + log
2
|Σβ |
2sk
(2.11)
k=1
1
T −1
− {tr(Σ−1
β Σq(β) ) + (µq(β) − µβ ) Σβ (µq(β) − µβ )}
2
n X
K
X
−
µq(aik ) log(µq(aik ) ).
i=1 k=1
where µq(a•k ) =
2.6
Pn
i=1 µq(aik ) .
Generalized Extreme Value additive model
Now we have derived an algorithm for approximate Bayesian inference for a NormalMixture response regression model, we proceed to extend our methodology to do the
same for a GEV additive model. This extension involves three major changes: (1) use
of multiple predictors (x1 , . . . , xd ); (2) incorporation of spline basis functions into our
model to capture non-linear trends in the data; and (3) use of structured MFVB to bring
together estimates from Normal-Mixture approximate inference at each ξ ∈ Ξ.
31
2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL
Initialize: µq(β,u) , Σq(β,u) , and µq(1/σ) , µq(1/σ2 ) > 0.
Cycle:
Update q ∗ (a) parameters:
For i = 1, . . . , n, k = 1, . . . , K:
νik ← log(wk /sk ) −
n
o
1 h
2
T
µ
(y
−
Xµ
)
+
(XΣ
X
)
2
ii
q(β)
q(β) i
2s2k q(1/σε )
i
−2mk µq(1/σε ) (y − Xµq(β) )i + m2k
eνik
µq(aik ) ← PK
;
νik
k=1 e
D µq(a
k)
← diag{µq(a1k ) , . . . , µq(ank ) }
Update q ∗ (β) parameters:
K
X
1
D µq(a )
k
s2k
(
X T µq(1/σε2 )
Σq(β) ←
!
X + Σ−1
β
)−1
k=1
K
X
1
D µq(a ) y
k
s2k
k=1
!
)
K
X mk
−µq(1/σε )
D µq(a ) 1 + Σ−1
β µβ
k
s2k
(
µq(β) ← Σq(β)
XT
µq(1/σε2 )
k=1
Update q ∗ (σ 2 ) parameters:
C1 ←
C2
K
X
mk
k=1
s2k
1T D µq(a ) (y − Xµq(β) )
k
K
1X 1 n
← B+
tr(X T D µq(a ) XΣq(β) )
k
2
s2k
o
k=1
+(y − Xµq(β) )T D µq(a ) (y − Xµq(β) ) .
k
µq(1/σε2 ) =
J + (2A + n + 1, C1 , C2 )
;
J + (2A + n − 1, C1 , C2 )
µq(1/σε ) =
J + (2A + n, C1 , C2 )
.
J + (2A + n − 1, C1 , C2 )
until the increase in p(y; q) is negligible.
Algorithm 1: Mean field variational Bayes algorithm for Model (2.8) under product restriction (2.10).
32
2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL
2.6.1
Model
Let yi , 1 ≤ i ≤ n be a set of response variables for which a GEV(µi , σε , ξ) distribution is
appropriate. We assume that the means, µi , are of the form:
(2.12)
µi = f1 (x1i ) + . . . + fd (xdi ),
where, for each 1 ≤ i ≤ n, (x1i , . . . , xdi ) is a vector of continuous predictor variables and
the f1 , . . . , fd are smooth functions.
Padoan and Wand (2008) describe the mixed model based penalized spline approach
to estimating parameters in additive models for sample extremes. The mixed model
based approach facilitates the use of Bayesian inference methods such as MCMC and
MFVB. Explicitly, the right hand side of (2.12) is modelled as:
d
X
f` (x` ) = β0 +
`=1
d
X
(
β` x` +
`=1
K
X̀
)
u`,k z`,k (x` )
k=1
with
ind.
u`,1 , . . . , u`,K` |σ`2 ∼ N (0, σ`2 )
for each 1 ≤ ` ≤ d. The {z`,1 (·), . . . , z`,K` (·)} are a set of spline basis functions that allow
estimation of f` . Specific choices of the spline basis functions vary. O’Sullivan penalized splines are described by Wand and Ormerod in their 2008 paper. Welham, Cullis,
Kenward and Thompson (2007) described the close connection between three types of
polynomial mixed model splines: smoothing splines, P-splines and penalized splines.
We choose to work with O’Sullivan penalized splines here. We define the matrices:




β=



β0

β1
..
.



,




βd


u`1


 . 
u` =  ..  ,


u`K`


u1


 . 
u =  .. 


ud
1 x11 . . . x1d

..
..
 ..
..
X= .
.
.
.

1 xn1 . . . xnd




,

z (x ) . . . z`K` (x1` )
 `,1 1`
..
..

..
and Z ` = 
.
.
.

z`,1 (xn` ) . . . z`K` (xn` )



,

33
2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL
and further set Z = [Z 1 . . . Z d ]. We then define the Bayesian GEV additive model as:
ind.
yi |β, u, σε , ξ ∼ GEV{(Xβ + Zu)i , σε , ξ},
1 ≤ i ≤ n,
2 , . . . , σ 2 ∼ N {0, blockdiag(σ 2 I, . . . , σ 2 I)},
u|σu1
u1
ud
ud
β ∼ N (0, Σβ ),
2
σu`
ind.
∼ Inverse-Gamma(Au` , Bu` ),
σε2 ∼ Inverse-Gamma(Aε , Bε ),
(2.13)
1 ≤ ` ≤ d,
ξ ∼ p(ξ), ξ ∈ Ξ,
where Σβ is a symmetric and positive definite (d+1)×(d+1) matrix and Aε , Bε , Au` , Bu` >
0 are hyperparameters for the variance component prior distributions. The set Ξ of prior
atoms for ξ is assumed to be finite. The DAG for (2.13) is shown in Figure 2.3. As in
2
σu1
2
σu2
...
2
σud
u1
u2
...
ud
σε2
β
y
ξ
Figure 2.3: Directed acyclic graph representation of Model (2.13).
the univariate and simple regression case, for fixed ξ, we use an accurate finite mixture
of normal densities to approximate the GEV response. Recall that fGEV (·; ξ) denotes the
GEV (0, 1, ξ) family of density functions, given by (2.2). In practice, we replace each
fGEV (·; ξ), ξ ∈ Ξ by an extremely accurate normal mixture approximation:
K
X
x − mk
wk
φ
.
fGEV (x; ξ) ≈
sk
sk
k=1
34
2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL
Model (2.13) becomes
ind.
yi |β, u, σε , ξ ∼ Normal-Mixture{(Xβ + Zu)i , σε , w, m, s},
1 ≤ i ≤ n,
2 , . . . , σ 2 ∼ N {0, blockdiag(σ 2 I, . . . , σ 2 I)},
u|σu1
u1
ud
ud
β ∼ N (0, Σβ ),
2
σu`
ind.
∼ Inverse-Gamma(Au` , Bu` ),
(2.14)
1 ≤ ` ≤ d,
σε2 ∼ Inverse-Gamma(Aε , Bε ).
and forms the first step towards MFVB inference for the GEV additive model. Again, as
in the regression case, we can rewrite (2.14) as
p(yi |β, u, ai , σε , ξ) =
QK
ind.
k=1
n
1
σε sk φ
ai ∼ Multinomial(1, w)
(y−Xβ−Zu)i −σε mk
σε sk
oaik
,
1 ≤ i ≤ n,
2 , . . . , σ 2 ∼ N {0, blockdiag(σ 2 I, . . . , σ 2 I)},
u|σu1
u1
ud
ud
β ∼ N (0, Σβ ),
ind.
2 ∼ Inverse-Gamma(A , B ),
σu`
u`
u`
(2.15)
1 ≤ ` ≤ d,
σε2 ∼ Inverse-Gamma(Aε , Bε ).
2.6.2
Mean field variational Bayes for the finite normal mixture response additive model
In this section we present the derivation of a MFVB algorithm for Model (2.15). We begin
by imposing the product restriction
2
2
2
2
)q(a).
, a) = q(β, u)q(σε2 , σu1
, . . . , σud
q(β, u, σε2 , σu1
, . . . , σud
(2.16)
A good starting point is to look at the DAG for Model (2.15), illustrated in Figure 2.4. It
shares many similarities with the DAG for Model (2.8) illustrated in Figure 2.2. Essentially the structure of each model is the same, with the addition of random effects u` and
2 in the additive model. This is where the locality property
their underlying variances σu`
of mean variational Bayes comes to the fore through Markov blanket theory.
As set out in Section 1.4, the distribution of a node within a DAG depends only on
the nodes within its Markov blanket. But what impact does this locality property have on
the derivation of the optimal q ∗ densities for Model (2.15)? Comparing Figures 2.2 and
2.4, similarities can be seen between the DAGs for the simple and more complex model.
The parameter β in the simple regression model plays the same role as (β, u) in the additive model. So, we expect to see similarities between the optimal q ∗ densities of β in
35
2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL
2
σu1
2
σu2
...
2
σud
u1
u2
...
ud
a
β
σε2
y
Figure 2.4: Directed acyclic graph representation of Model (2.15).
the simple regression case and (β, u) in the additive model case. Figure 2.4 also illus2 , namely (β, u) and the other variance
trates the parameters in the Markov blanket of σu`
2
2
2
2 , . . . , σ2
components σu1
u,`−1 , σu,`+1 , . . . , σud . Hence the optimal density of σu` will have
no dependence on the form of the response y, nor the parameters σε2 and a.
The optimal q ∗ densities for the parameters in Model (2.15) take the form:
q ∗ (β, u) ∼ N (µq(β,u) , Σq(β,u) ),
2 ) ind.
q ∗ (σu`
∼ Inverse-Gamma Au` +
−Aε −
q ∗ (σε2 )
=
ind.
(σε2 )
n
2 −1 exp
C
C3
− 42
σε
σε
K`
2 )
2 , Bq(σu`
,
1 ≤ ` ≤ d,
2J + (2Aε +n−1,C3 ,C4 )
q ∗ (ai ) ∼ Multinomial(1; µq(ai ) ),
,
σε2 > 0,
1 ≤ i ≤ n.
Algorithm 2 gives an iterative scheme for finding the moments of the optimal q ∗ densities
stated above. The expression for the corresponding lower bound on the marginal loglikelihood is given by (2.17). Full derivations of the optimal q ∗ densities, Algorithm 2
and lower bound (2.17) are presented in Appendix 2.A. The lower bound corresponding
1
36
2.6. GENERALIZED EXTREME VALUE ADDITIVE MODEL
Initialize: µq(β) , Σq(β) , and µq(1/σε ) , µq(1/σε2 ) > 0.
Cycle:
Update q ∗ (a) parameters for i = 1, . . . , n, k = 1, . . . , K:
1 µq(1/σε2 ) (y − Cµq(β,u) )2i + (CΣq(β,u) C T )ii
2
2sk
−2mk µq(1/σε ) (y − Cµq(β,u) )i + m2k
νik ← log(wk /sk ) −
eνik
µq(aik ) ← PK
;
νik
k=1 e
D µq(a
k)
← diag{µq(a1k ) , . . . , µq(ank ) }
Update q ∗ (β, u) parameters:
!
K
X
1
C T µq(1/σε2 )
D µq(a ) C
k
s2k
k=1
o−1
,
µ
I
+blockdiag Σ−1
,
.
.
.
,
µ
I
2
2
K
K
1
q(1/σu1 )
q(1/σ )
d
β
(
Σq(β,u) ←
ud
µq(β,u) ← Σq(β,u) C T
!
K
K
X
X
1
mk
µq(1/σε2 )
D µq(a ) y − µq(1/σε )
D µq(a ) 1
k
k
s2k
s2k
k=1
k=1
Update q ∗ (σε2 ) parameters:
C3 ←
C4
K
X
mk
k=1
s2k
1T D µq(a ) (y − Cµq(β,u) )
k
K
1X 1 n
← Bε +
tr(C T D µq(a ) CΣq(β,u) )
k
2
s2k
o
k=1
+(y − Cµq(β,u) )T D µq(a ) (y − Cµq(β,u) ) .
k
µq(1/σε2 ) ←
J + (2Aε + n + 1, C3 , C4 )
;
J + (2Aε + n − 1, C3 , C4 )
µq(1/σε ) ←
J + (2Aε + n, C3 , C4 )
.
J + (2Aε + n − 1, C3 , C4 )
2 ) parameters for ` = 1, . . . , d:
Update q ∗ (σu`
Bq(σ2 ) ← Bu` +
u`
o
1n
tr(Σq(u` ) ) + ||µq(u` ) ||2 ,
2
µq(1/σ2 ) ←
u`
Au` + K` /2
Bq(σ2 )
u`
until the increase in p(y; q) is negligible.
Algorithm 2: Mean field variational Bayes algorithm for Model (2.8) under product restriction (2.10).
37
2.7. DISPLAYING ADDITIVE MODEL FITS
to Algorithm 2 is given by
logp(y; q) =
1
2
1+d+
d
X
`=1
!
K`
−
n
log(2π) + log(2) + Aε log(Bε ) − log Γ(Aε )
2
+
+ log J (2Aε + n − 1, C3 , C4 ) + 21 log |Σq(β,u) |
−1
1 T
− 12 log |Σβ | − 12 tr(Σ−1
β Σq(β) ) − 2 µq(β,u) Σβ µq(β,u)
d
X
{Au` log Bu` − log Γ(Au` )
(2.17)
o
− Au` + K2` log Bq(σ2 ) + log Γ Au` + K2`
u`
X
K
n X
K
X
m2k
+
µq(a•k ) log(wk /sk ) − 2 −
µq(aik ) log(µq(aik ) ).
2sk
i=1
+
`=1
k=1
where µq(a•k ) =
2.6.3
k=1
Pn
i=1 µq(aik ) .
Structured mean field variational Bayes
Now that a MFVB algorithm has been developed to carry out approximate Bayesian inference for fixed ξ, we need to complete the final step to allow complete MFVB inference
for Model (2.15). For each fixed ξ ∈ Ξ, we use Algorithm 2 to get approximations to
2 |y, ξ), 1 ≤ ` ≤ d. We denote
the conditional posteriors p(β, u|y, ξ), p(σε2 |y, ξ) and p(σu`
2 |ξ), 1 ≤ ` ≤ d
these variational Bayes approximations by q ∗ (β, u|ξ), q ∗ (σε2 |ξ) and q ∗ (σu`
respectively.
Using results (2.4), (2.5) and (2.6) set out in Section 2.4, we combine the approximate
posteriors across all ξ ∈ Ξ via Algorithm 3.
2.7
Displaying additive model fits
The majority of this chapter has described the process for obtaining approximate posterior distributions for the model parameters. The next step is to translate these into
meaningful graphical displays consisting of fitted curves and corresponding variability
bands.
Essentially, to produce graphical summaries, each explanatory variable is plotted
against the response, with all other predictor variables held at their mean values. In
the case of the bivariate function of the explanatory variables latitude and longitude, the
graphical summary plots the fit of the random variable maximum annual rainfall as a
contour plot against geographical location.
In practice, we set up a grid over the domain of each explanatory variable. For exam-
38
2.7. DISPLAYING ADDITIVE MODEL FITS
For each ξ ∈ Ξ:
1. Retrieve the normal mixture approximation vectors (wk,ξ , mk,ξ , sk,ξ ), 1 ≤ k ≤
K for approximation of the GEV(0, 1, ξ) density function.
2. Apply Algorithm 2 with (wk , mk , sk ) set to (wk,ξ , mk,ξ , sk,ξ ) for 1 ≤ k ≤ K.
2 |ξ), 1 ≤
3. Store the parameters needed to define q ∗ (β, u|ξ), q ∗ (σε2 |ξ) and q ∗ (σu`
` ≤ d.
4. Store the converged marginal likelihood lower bound p(y|ξ).
Form the approximations to the posteriors p(ξ|y), p(β, u|y, ξ), p(σε2 |y, ξ) and
2 |y, ξ), 1 ≤ ` ≤ d via:
p(σu`
p(ξ)p(y|ξ)
,
0
0
ξ 0 ∈Ξ p(ξ )p(y|ξ )
q ∗ (ξ) = P
q ∗ (σε2 ) =
X
ξ∈Ξ
q ∗ (ξ)q ∗ (σε2 |ξ),
q ∗ (β, u) =
X
q ∗ (ξ)q ∗ (β, u|ξ),
ξ∈Ξ
2
q ∗ (σu`
)=
X
ξ∈Ξ
2
q ∗ (ξ)q ∗ (σu`
|ξ),
Form the approximate marginal likelihood p(y; q) =
P
ξ∈Ξ q
1 ≤ ` ≤ d.
∗ (ξ)p(y|ξ).
Algorithm 3: Summary of the finite normal mixture approach to structured MFVB inference for the GEV additive model (2.15).
ple, for the first predictor, say we set up a grid of size M , g 1 = (g11 , . . . , g1M ). In order
to facilitate the alignment of the vertical axis with the response data we let the remaining
girds for the other predictor variables be defined by g ` = x̄` 1M , where 1M is the M × 1
vector of ones.
We then define the matrices:
X (1)
g ≡ [1 g 1 . . . g d ],
(1)
Z `g ≡ [z`,1 (g ` ) . . . z`,K` (g ` )]
1 ≤ ` ≤ d,
and
(1)
(1)
(1)
C (1)
g ≡ [X g |Z 1g . . . Z dg ].
The approximate posterior mean of f1 (g 1 ), the function approximating the contribution
of the first predictor variable to the response, is given by
f 1 = C (1)
g µq(β,u) =
X
ξ∈Ξ
q ∗ (ξ)C (1)
g µq(β,u|ξ) ,
39
2.7. DISPLAYING ADDITIVE MODEL FITS
and the display of this fit is achieved by plotting f 1 against g 1 .
Now to the idea of fitting point-wise 95% credible intervals in order to produce variability bands for our fitted curves. In order to produce these credible intervals, we need
to obtain 0.025 and 0.975 quantiles of our MFVB approximations to the quantity
"
C (1)
g
β
u
#
.
"
#
β
The MFVB approximation of
, i.e. µq(β,u) as presented in Algorithm 3, has a finite
u
normal mixture form. Thus, our problem reduces to finding the 0.025 and 0.975 quantiles
of a finite normal mixture at each point on our grid. We require the following result.
Result 2.1 Suppose that the r × 1 vector x has the finite normal mixture density function
p(x) =
L
X
`=1
where
PL
`=1 ω`
ω` (2π)−r/2 |Σ` |−1/2 exp − 12 (x − µ` )T Σ−1
` (x − µ` )
= 1 and, for 1 ≤ ` ≤ L, ω` > 0, the µ` are unrestricted r × 1 vectors and the Σ`
are r × r symmetric positive definite matrices. Write this as
x ∼ ω1 N (µ1 , Σ1 ) + . . . + ωL N (µL , ΣL ).
Then, for any constant r × 1 vector α,
αT x ∼ ω1 N (αT µ1 , αT Σ1 α) + . . . + ωL N (αT µL , αT ΣL α).
For 1 ≤ j ≤ M , let ej be the M × 1 vector having j th entry equal to one and all other
entries equal to zero. Using Result 2.1, the 95% credible interval limits for our fitted curve
are the 0.025 and 0.975 quantiles of
X
T (1)
(1) T
q ∗ (ξ)N (eTj C (1)
g µq(β,u|ξ) , ej C g Σq(β,u|ξ) (C g ) ej .
ξ∈Ξ
To be clear, the ej vector picks out the distinct univariate finite normal mixture at the j th
gridpoint. So, 95% credible intervals are found by simply computing the 0.025 and 0.975
quantiles of a univariate finite normal mixture at each gridpoint.
40
2.8. GEOADDITIVE EXTENSION
2.8
Geoadditive extension
In order to extend Model (2.13) to include a geographical term, we model the mean µi as
(2.18)
1 ≤ i ≤ n,
µi = f1 (x1i ) + . . . + fd (xdi ) + g(xi ),
where (x1i , . . . , xdi ) is a vector of continuous predictor variables and the f1 , . . . , fd are
smooth, but otherwise arbitrary, functions. The xi is a 2×1 vector containing latitude and
longitude measurements. Equation (2.18) is modelled further via spline basis functions
as
d
X
f` (x` ) + g(x) = β0 +
`=1
d
X
(
`=1
β` x` +
K
X̀
)
u`,k z`,k (x` )
+β
geo T
x+
k=1
geo
K
X
geo
ugeo
k zk (x)
k=1
with
ind.
2
2
u`,1 , . . . , u`,K` |σu`
∼ N (0, σu`
),
1 ≤ ` ≤ d,
and
ind.
geo
2
2
ugeo
1 , . . . , uK geo |σu,geo ∼ N (0, σu,geo ).
Here {z`,1 (·), . . . , z`,K` (·), zkgeo (·)} is a set of spline basis functions for estimation of both f`
and g. We define the matrices

β0



 β1

 .
β =  ..


 βd

β geo





,







1 x11 · · · x1d xT1

..
..
..
 .
..
X =  ..
.
.
.
.

1 xn1 · · · xnd xTn


,



u`,1


 . 
u` =  ..  ,


u`,K`

ugeo
ugeo
 1 
 . 
=  ..  ,


ugeo
K geo

u1


 .. 
 . 
,
u=


 ud 


geo
u


z (x ) · · · z`,K` (x1` )
 `,1 1`
..
..

..
Z` = 
.
.
.

z`,1 (xn` ) · · · z`,K` (xn` )



.

The number of and position of knots κk are generally chosen using a space filling algorithm as described in Ruppert, Wand & Carroll (2003). Form the matrices
Z K = [kxi − κk k2 log kxi − κk k]1≤i≤n
1≤k≤K geo
41
2.9. NEW SOUTH WALES MAXIMUM RAINFALL DATA ANALYSIS
and
Ω = [kκk − κk0 k2 log kκk − κk0 k],
1≤k,k0 ≤K geo
and then find the singular value decomposition of Ω:
Ω = U diag(d)V T
and use this to obtain the matrix square root of Ω:
√
Ω1/2 = U diag( d)V T .
We then compute
Z geo = Z K Ω−1/2 ,
and define Z = [Z 1 · · · Z d Z geo ]. Then a Bayesian GEV geoadditive model is
ind.
yi |β, u, σε , ξ ∼ GEV{(Xβ + Zu)i , σε , ξ}, 1 ≤ i ≤ n,
2
2
2
2 , . . . , σ2 , σ2
u|σu1
ud u,geo ∼ N {0, blockdiag(σu1 I, . . . , σud I, σu,geo I)},
ind.
2 ∼ IG(A , B ), 1 ≤ ` ≤ d,
σu`
u`
u`
β ∼ N (0, Σβ ),
(2.19)
2
σu,
geo ∼ IG(Au,geo , Bu,geo )
σε2 ∼ IG(Aε , Bε ),
ξ ∼ p(ξ), ξ ∈ Ξ,
where Σβ is a symmetric and positive definite (d+1)×(d+1) matrix and Aε , Bε , Au` , Bu` >
0 are hyperparameters for the variance component prior distributions. The set Ξ of prior
atoms for ξ is assumed to be finite. The MFVB algorithm for carrying out approximate
inference for Model (2.19) is identical to that for Model (2.13), with an added variance
component for geographical location.
2.9
New South Wales maximum rainfall data analysis
In this section we apply our MFVB methodology for the GEV additive model to the
Sydney hinterland maximum rainfall data. Fifty weather stations were selected from
throughout Sydney and surrounds. Between 5 and 49 years of data was available from
each of these stations. The response variable, maximum rainfall, is presented in Figure
2.5.
The following variables comprise the Sydney hinterland maximum rainfall data:
42
2.9. NEW SOUTH WALES MAXIMUM RAINFALL DATA ANALYSIS
0 10 20 30 40 50
0 10 20 30 40 50
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
300
200
100
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
300
200
100
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
300
200
100
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Maximum rainfall (mm)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
300
200
100
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
300
200
100
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
300
200
100
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 10 20 30 40 50
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
300
200
100
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
300
200
100
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
300
200
100
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
300
200
100
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 10 20 30 40 50
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 10 20 30 40 50
Number of years since 1955
Figure 2.5: Annual winter maximum rainfall at 50 weather stations in the Sydney, Australia, hinterland.
winter max. rainfall:
maximum rainfall (mm) for annual winter period;
defined as April to September inclusive,
year:
year (1955-2003),
day in season:
day of winter period (i.e. number of days
since 31st March within the current year),
OHA:
Ocean Heat content Anomaly (1022 joules)
SOI:
Southern Oscillatory Index,
PDO:
Pacific Decadal Oscillation,
longitude:
degrees longitude of weather station,
latitude:
degrees latitude of weather station.
We now provide some explanation of the predictor variables OHA, SOI and PDO, and
why they are considered possible climate drivers for rainfall.
Ocean heat content anomaly (OHA) provides a measure of the amount of heat held
43
1960
1970
1980
1990
25 30 35 40 45 50 55 60
Mean maximum rainfall
25 30 35 40 45 50 55 60
Mean maximum rainfall
2.9. NEW SOUTH WALES MAXIMUM RAINFALL DATA ANALYSIS
2000
0
50
−1
0
1
2
−20
−10
0
10
20
Southern Oscillatory Index
25 30 35 40 45 50 55 60
Ocean Heat Anomaly
Mean maximum rainfall
25 30 35 40 45 50 55 60
Mean maximum rainfall
−2
150
Day in Season
25 30 35 40 45 50 55 60
Mean maximum rainfall
Year
100
−2
−1
0
1
2
3
Pacific Decadal Oscillation
Figure 2.6: MFVB univariate functional fits in the GEV additive model (2.20) for the Sydney hinterland rainfall data. The vertical axis in each panel is such that all additive functions are included, with the horizontal axis predictor varying and the other predictors
set at their average values, as decribed in Section 2.7. The grey region corresponds to
approximate pointwise 95% credible sets.
by the ocean in a particular region and depth. We used the time series of quarterly OHA
for the 0 - 700m layer in the Southern Pacific Ocean Basin. Data for OHA was obtained
from the US National Oceanographic and Data Center website at www.nodc.noaa.gov/
OC5/3M HEAT CONTENT. Levitus, Antonov, Wang, Delworth, Dixon and Broccoli (2001)
and Willis, Roemmich and Cornuelle (2004) both state that over the past 40 years, the
world’s oceans have been the dominant source of changes in global heat content. Hence
OHA has the potential to play a role in effecting rainfall patterns.
The Southern Oscillation Index (SOI), a unitless quanity, measures the air pressure
difference between Tahiti and Darwin. We used monthly values of SOI, obtained from
the Australian Bureau of Meterorology website www.bom.gov.au/climate/current/soi
htm1.shtml. In general, sustained positive values of SOI indicate above average rainfall
44
2.9. NEW SOUTH WALES MAXIMUM RAINFALL DATA ANALYSIS
−31
Geographical location
●
●
Tamworth
●
●
●
●
●
●
●
−32
●
Taree
●
●
●
●
●
●
●
●
●●
−33
●
●
●
●●
Orange
●
●
●
●
●
●
●
●●
●
●
−34
Degrees latitude
●
●
● ●
●
●
Sydney
●
●
●
●
●
●
●
Goulburn
●●
−35
●
●
●
Mean maximum rainfall
●
●
Batemans Bay
−36
40 45 50 55 60 65
149
150
151
152
153
Degrees longitude
Figure 2.7: MFVB bivariate functional fit for geographical location in the GEV additive
model (2.20) for the Sydney hinterland rainfall data. The weather station locations are
shown as grey dots. The black dots show the locations of six cities and towns with names
as labelled.
over northern and eastern Autralia. Conversely, negative values of SOI indicate below
average rainfall over the north and east of the continent.
Pacific decadal oscillation (PDO) is a monthly measure of sea surface temperature
anomalies in the North Pacific Ocean. Like SOI, PDO is also a unitless quantity. Pacific
Decadal Oscillation provides a measure of Pacific climate variability. We obtained data
from the website jisao.washington.edu/pdo/PDO.latest.
We imposed the following GEV geoadditive model on our data:
ind.
winter max. rainfalli ∼GEV{f1 (yeari ) + f2 (day in seasoni ) + f3 (OHAi )
(2.20)
+f4 (SOIi ) + f5 (PDOi ) + g(longitudei , latitudei ), σε , ξ}
for 1 ≤ i ≤ n, where n = 1874 is the total number of winter maximum rainfall measure-
45
2.9. NEW SOUTH WALES MAXIMUM RAINFALL DATA ANALYSIS
0.10
0.00
Probability
Prior probabilities for ξ
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.0
0.1
0.2
0.3
0.4
0.5
ξ
●
●
●
0.10
●
●
●
●
●
●
●
0.00
Approx. probability
Approximate posterior probabilities for ξ
●●●●
0.0
0.1
●
●
0.2
●●●●
●●●
0.3
0.4
0.5
ξ
Figure 2.8: The prior and MFVB approximate posterior probability mass functions for
the GEV shape parameter ξ in the GEV additive model (2.20) for the Sydney hinterland
maximum rainfall data.
ments from 50 weather stations illustrated in Figure 2.5 between the years 1955 and 2003
(not all stations had this full set of years). Model (2.20) was then fitted via Algorithms 2
and 3.
Univariate function estimates fˆ1 , . . . , fˆ5 were each constructed using 37 O’Sullivan
spline basis functions (described in Wand and Ormerod, 2008). The bivariate function,
estimated by ĝ, used 50 bivariate thin plate spline basis functions as set out in section
13.5 of Ruppert, Wand and Carroll (2003). Knots were set at the weather stations. Hyperparameters were set to Σβ = 108 I, Aε = Bε = Au = Bu = 0.01 and p(ξ) uniform on
Ξ = {0.00, 0.01, . . . , 0.50}.
We anticipate both spatial and temporal correlation to be present within our data.
Including a smooth function of year is frequently used in additive model analysis of environmental time series data for the purpose of temporal de-correlation. Some examples
46
●
MFVB
MCMC
6
1.0
2.9. NEW SOUTH WALES MAXIMUM RAINFALL DATA ANALYSIS
●
●
4
0.5
●
●
●
●
●
●
●●
● ●
●
y
●
●
●
● ●
●
●
●●
●
●
●
●
● ●● ● ● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
● ●
●
●
●
●
●
●
●
● ●
● ● ●
●●
●
● ●
●
●● ●
● ●
● ●
●
●● ●●
●● ●●
●
●
●
●
● ● ●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
● ● ●
●
● ●
●
●●
●
●
●●●● ●
●●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●● ●
● ●
● ●●
●
●
●●
●
● ●
● ●
●
●
●
●● ● ● ● ●●
●●
●
●●
●
●
●
●
●●
● ●
●● ●
●
● ● ●
●
● ●● ●
●
●
● ●● ●
●●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●● ●
●
●
●
● ● ●●● ●
● ● ● ●
●
● ●
● ●
●
●● ● ●
●
●
●
●
● ● ●
●
●
●●● ●
●● ●
●●
●
●
●
●
●
●
●●
●
● ●
●
●●●● ●
●
● ●
●
●
●
●
●
●
●
●● ● ●●
●
●
●
●
●
0.2
0.4
●
●
●
●
●
●
●
●● ● ●
●
●
0.6
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ●
●
●
● ●● ●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●● ● ●
●
●● ● ●
●
●●
●
●
●●
●
●●●
●●
●●
● ●
●●
●
● ● ●
●
●
●
●
●
●
● ● ●●
● ●● ● ● ●●● ●
●
● ●
●
● ●●●●
●●
●
●●
●
●
●
●
●● ●●
●
●
● ●
●●
●
●●
●●
●●●
● ●●
●
● ●
●●
●
●
●
●
●
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
MFVB
MCMC
83%
0.08
accuracy
0.00
0.5
0.02
0.6
0.7
y
0.8
probability
0.9
0.10
1.0
0.12
x
0.06
0.0
●
●●
●
−2
0
2
●
0.04
y
●
●
●
0.0
●
●
−0.5
●●
●
●
●
●
−1.0
●
●
●
0.3
0.4
0.5
x
0.6
0.1
0.2
0.3
0.4
ξ
Figure 2.9: Accuracy comparison between MFVB and MCMC for a single predictor model
(d = 1). Top left panel: the simulated data together with the MFVB and MCMC estimates
of f and pointwise 95% credible sets. Top right panel: the same as the top left panel but
without the data and with the frame modified to zoom in on the function estimates. Bottom left panel: the same as the top right panel, but with the frame modified to zoom in
on the region surrounding the peaks of the two function estimates. Bottom right panel:
posterior probability mass functions for ξ based on MFVB and MCMC fitting. The accuracy shown indicates that there is 83% commonality between the true posterior and the
MFVB approximation, measured using the L1 error.
are Wand and Schwartz (2002) and Dominici, McDermott and Hastie (2004). We also
include a smooth function of geographical location to handle the anticipated spatial correlation.
Figures 2.6 and 2.7 illustrate the estimated univariate and bivariate functions resulting
from fitting Model (2.20) to the Sydney hinterland maximum rainfall data via Algorithms
2 and 3. Figure 2.6 was constructed using the plotting scheme set out in Section 2.7.
Starting with the univariate fits, the smooth function of year oscillates corresponding
to the dry and wet periods in the Sydney hinterland over the past 50 years. The effect
of OHA is weakly nonlinear up to approximately OHA = 1.7, then promptly converts
to an upward effect for OHA > 1.7. SOI shows an approximate piecewise linear effect.
For SOI ≤ 10, there is a positive effect on the response. This is in agreement with the
explanation of SOI, with positive values associated with higher rainfall. In contrast, for
SOI > 10, there is a negative effect on maximum rainfall. The estimated of the effect of
PDO shows an interesting oscillatory relationship with maximum rainfall.
2.10. COMPARISONS WITH MARKOV CHAIN MONTE CARLO
47
Figure 2.7 illustrates well known rainfall patterns throughout the Sydney Hinterland.
We see higher rainfall along the NSW coastal plain and orographic effects due to the
Great Dividing Range.
Figure 2.8 illustrates the posterior probability mass function for the shape parameter
ξ. The uniform prior for ξ is also shown for comparative purposes. The majority of the
probability mass lies between 0.15 and 0.27, with the mode at ξ = 0.21.
2.10
Comparisons with Markov chain Monte Carlo
The major reason for using approximate Bayesian inference is to overcome either computational storage or time constraints. In this section, we compare our MFVB inference
with MCMC inference for a simplified single predictor GEV regression model. Although
MCMC inference itself is theoretically approximate, given ample time it can be made
incredibly accurate. Hence MCMC serves as our baseline. The accuracy measure used
throughout this thesis is described in detail in Section 1.6.
Figure 2.9 provides an illustrative summary of the accuracy comparisons between
MFVB and MCMC for a single predictor (d = 1) model. The sample size is n = 500, and
the data were simulated according to
ind.
yi ∼ GEV{f (xi ), 0.5, 0.3}
with f (x) = sin(2πx2 ). The xi values were generated to follow the uniform distribution
on (0, 1). Hyperparameters were set to Σβ = 108 I, Aε = Bε = Au = Bu = 0.01 and p(ξ)
uniform on Ξ = {0.00, 0.08, . . . , 0.40}. Iterations of the MFVB algorithm were terminated
when the change in the lower bound reached less than 10−10 . Ten thousand MCMC
iterations were carried out in the R package BRugs (Ligges, Thomas, Spiegelhalter, Best,
Lunn, Rice and Sturtz, 2011). A burn-in of 5000 was used, with the subsequent 5000
iterations then thinned by a factor of 5. MFVB inference took 4.5 minutes. In contrast,
MCMC inference using the same model took just over 21 hours.
The first three panels in Figure 2.9 illustrate MFVB-based and MCMC-based estimates
and corresponding 95% credible sets for f . The top right panel shows the estimates
without the data. The bottom left panel zooms in on the estimates and credible sets
for x ∈ (0.2, 0.7). From this panel, we can see that the MFVB-based and MCMC-based
estimates of f are very similar. The credible sets, however, are narrower in the MFVB
case. The narrow nature of the credible sets is in-keeping with behaviour observed by
2.11. DISCUSSION
48
Wand et al. (2011) for simpler GEV models. The bottom right panel of Figure 2.9 shows
that accuracy of MFVB inference for the shape parameter ξ is quite good at 83%.
2.11
Discussion
This chapter has seen us successfully develop MFVB inference for GEV regression models, culminating in the analysis of the Sydney Hinterland maximum rainfall data. Comparison between MFVB inference and the MCMC baseline showed that estimation of the
additive model components and the shape parameter is highly accurate. Therefore, in
practical applications where the focus is on retrieval of the mean curve, MFVB provides
a promising alternative to MCMC. However, MFVB estimation of credible sets is overly
narrow. Therefore, MCMC remains the methodology of choice when variability of curve
estimates is a priority. Even with the shortcoming of poor estimation of credible sets, the
gains in computational speed offered by MFVB make it a viable choice for larger models
and/or sample sizes.
49
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
2.A
Derivation of Algorithm 1 and lower bound (2.11)
2.A.1
Full conditionals
Full conditional for β
log p(β|rest) = −
where
Ω=X
1 T
β Ωβ − 2β T ω + const.
2
K
1 X 1
D ak
σε2
s2k
T
!
X + Σ−1
β ,
k=1
(
ω = XT
)
K
1 X 1
D ak (y − σε mk 1) + Σ−1
β µβ ,
σε2
s2k
k=1
and D ak = diag(a1k , . . . , ank ).
Derivation:
p(β|rest) ∝ p(y|β, σε2 , a)p(β)
( n K )
YY
1
yi − {(Xβ)i + σε mk } aik
=
φ
σε sk
σε s k
i=1 k=1
1
− d2
− 12
T −1
× (2π) |Σβ | exp − (β − µβ ) Σβ (β − µβ ) .
2
Taking logarithms, we get
n X
K
X
log p(β|rest) =
i=1 k=1
aik
1
− 2 2 [yi − {(Xβ)i + σε mk }]2
2σε sk
1
− (β − µβ )T Σ−1
β (β − µβ ) + const.
2
n X
K
X
1
2
= const. +
aik − 2 2 (y − Xβ − σε mk 1)i
2σε sk
i=1
k=1
1
− (β − µβ )T Σ−1
β (β − µβ )
2
where 1 is a 1 × n column vector of 1’s. Changing the order of summation
n X
K
X
aik
i=1 k=1
1
− 2 2 (y − Xβ − σε mk 1)2i
2σε sk
K
n
1 X 1 X
=− 2
aik (y − Xβ − σε mk 1)2i
2σε
s2k i=1
=−
1
2σε2
k=1
K
X
k=1
1
(y − Xβ − σε mk 1)T D ak (y − Xβ − σε mk 1)
s2k
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
where D ak = diag(a1k , . . . , ank ). Therefore,
K
1 X 1
log p(β|rest) = − 2
(y − Xβ − σε mk 1)T D ak (y − Xβ − σε mk 1)
2σε
s2k
k=1
1
− (β − µβ )T Σ−1
β (β − µβ ) + const.
2
Now,
(y − Xβ − σε mk 1)T D ak (y − Xβ − σε mk 1)
= {Xβ − (y − σε mk 1)}T D ak {Xβ − (y − σε mk 1)}
= (Xβ)T D ak (Xβ) − 2(Xβ)T D ak (y − σε mk 1)
+(y − σε mk 1)T D ak (y − σε mk 1)
= β T X T D ak Xβ − 2β T X T D ak (y − σε mk 1) + const.
Hence
log p(β|rest)
K
1 X 1 T T
=− 2
[β X D ak Xβ − 2β T X T D ak (y − σε mk 1)]
2σε
s2k
k=1
1
T −1
− (β T Σ−1
β β − 2β Σβ µβ ) + const.
2
)
!
" (
K
X
1
1
1
β
= − βT X T
D ak X + Σ−1
β
2
σε2
s2k
k=1
(
!
)#
K
1 X 1
−1
T
T
X
−2β
D ak (y − σε mk 1) + Σβ µβ
+ const.
σε2
s2k
k=1
1 T
= − β Ωβ − 2β T ω + const.
2
where
Ω=X
T
K
1 X 1
D ak
σε2
s2k
!
X + Σ−1
β
k=1
and
(
ω = XT
)
K
1 X 1
D ak (y − σε mk 1) + Σ−1
β µβ .
σε2
s2k
k=1
50
51
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
Full conditional for σε2
K
n
1 X 1
(y − Xβ)T D ak (y − Xβ)
log p(σε2 |rest) = − A + + 1 log σε2 − 2
2
σε
s2k
k=1
+
K
X
1
(σε2 )1/2 k=1
mk T
1 D ak (y − Xβ) + const.
s2k
Derivation:
p(σε2 |rest) ∝ p(y|β, σε2 , a)p(σε2 )
( n K )
YY
1
yi − {(Xβ)i + σε mk } aik
=
φ
σε sk
σε sk
i=1 k=1
BA
×
Γ(A)
σε2
B
(−A−1) − σ2
e
ε
.
Taking logarithms, we have
log p(σε2 |rest)
=
n X
K
X
i=1 k=1
aik − log(σε sk
√
1
2π) − 2 2 [yi − {(Xβ)i + σε mk }]2
2σε sk
BA
B
+ log
− (A + 1) log(σε2 ) − 2 + const.
Γ(A)
σε
n
K
1 XX
1
2
2
= −
aik log(σε ) + 2 2 [yi − {(Xβ)i + σε mk }]
2
σε sk
i=1
k=1
−(A + 1) log(σε2 ) −
B
+ const.
σε2
K
1 X 1
(y − Xβ − σε mk 1)T D ak (y − Xβ − σε mk 1)
= − 2
2
2σε
s
k=1 k
n
B
− A + + 1 log(σε2 ) − 2 + const.
2
σε
where D ak = diag(a1k , . . . , ank ). Now,
(y − Xβ − σε mk 1)T D ak (y − Xβ − σε mk 1)
= (y − Xβ)T D ak (y − Xβ) − 2mk σε 1T D ak (y − Xβ) + σε2 m2k 1T D ak 1.
Hence
K
n
1 X 1
log p(σε2 |rest) = − A + + 1 log σε2 − 2
(y − Xβ)T D ak (y − Xβ)
2
σε
s2k
k=1
K
X
1
mk T
+ 2 1/2
1 D ak (y − Xβ) + const.
s2k
(σε )
k=1
52
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
Full conditional for a
log p(a|rest) =
n X
K
X
aik νik + const.
i=1 k=1
where
νik = log(wk /sk ) −
1
[yi − {(Xβ)i + σε mk }]2 .
2σε2 s2k
Derivation:
p(a|rest) ∝ p(y|β, σε2 , a)p(a)
( n K )
YY
yi − {(Xβ)i + σε mk } aik
1
=
φ
σε s k
σε sk
i=1 k=1
" n
#
Y
1
aiK
×
wai1 ...wK
ai1 !...aiK ! 1
i=1
#
( n K ) "Y
n Y
K
YY
yi − {(Xβ)i + σε mk } aik
1
aik
φ
×
wk .
=
σε s k
σε sk
i=1 k=1
i=1 k=1
Taking logarithms gives
log p(a|rest)
=
n X
K
X
√
aik − log(σε sk 2π) −
i=1 k=1
n X
K
X
+
=
=
aik
1
log(wk /sk ) − 2 2 [yi − {(Xβ)i + σε mk }]2
2σε sk
aik νik + const.
i=1 k=1
where νik = log(wk /sk ) −
2.A.2
1
[y
2σε2 s2k i
− {(Xβ)i + σε mk }]2 .
Optimal q ∗ densities
Expressions for q ∗ (β), µq(β) and Σq(β)
q ∗ (β) ∼ N (µq(β) , Σq(β) )
where
(
Σq(β) =
aik log(wk ) + const.
i=1 k=1
n
K
XX
i=1 k=1
n X
K
X
1
[yi − {(Xβ)i + σε mk }]2
2σε2 s2k
T
X µq(1/σε2 )
K
X
1
D µq(a )
k
s2k
k=1
!
X+
Σ−1
β
)−1
+ const.
53
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
and
K
X
1
D µq(a ) y
k
s2k
(
µq(β) = Σq(β)
XT
µq(1/σε2 )
k=1
−µq(1/σε )
where D µq(a
k)
K
X
mk
k=1
s2k
!
)
+
D µq(a ) 1
k
Σ−1
β µβ
= diag(µq(a1k ) , . . . , µq(a1k ) ).
Derivation:
Equation (1.3) tells us that the optimal densities take the form
q ∗ (θ i ) ∝ exp[Eq(θ−i ) log p(θ i |y, θ −i )}].
Hence
log q ∗ (β) = Eq {log p(β|rest)} + const.
1
= − Eq (β T Ωβ − 2β T ω) + const.
2
1 T
= − {β Eq (Ω)β − 2β T Eq (ω)} + const.
2
Using matrix result (1.21), it follows that
1
log q ∗ (β) = − {β − Eq (Ω)−1 Eq (ω)}T Eq (Ω){β − Eq (Ω)−1 Eq (ω)} + const.
2
Therefore,
q ∗ (β) ∼ N{Eq (Ω)−1 Eq (ω), Eq (Ω)−1 }.
Now,
(
XT
Eq (Ω) = Eq
K
1 X 1
ak
σε2
s2k
(2.21)
!
)
X + Σ−1
β
k=1
and
"
Eq (ω) = Eq X
(
T
K
1 X 1
ak (y − σε mk 1)
σε2
s2k
)
#
+
Σ−1
β µβ
k=1
where D ak = diag(a1k , . . . , ank ). It follows that
(
Eq (Ω) = Eq
XT
K
1 X 1
D ak
σε2
s2k
k=1
!
)
X + Σ−1
β
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
(X
K
)
1
Eq (D ak ) X + Σ−1
= X T Eq
β
s2k
k=1
!
X
K
1
1
T
= X Eq
D µq(a ) X + Σ−1
β
k
σε2
s2k
k=1
!
K
X
1
= X T µq(1/σε2 )
D µq(a ) X + Σ−1
β ,
k
s2k
1
σε2
k=1
and
"
(
Eq (ω) = Eq X
T
= X T Eq
= XT
K
1 X 1
D ak (y − σε mk 1)
σε2
s2k
1
σε2
k)
#
+
k=1
K
X
K
1
1 X mk
y
−
D
ak 1
ak
σε
s2k
s2k
k=1
K
X
µq(1/σε2 )
k=1
where D µq(a
)
Σ−1
β µβ
!
+ Σ−1
β µβ
k=1
!
K
X
mk
1
D µq(a ) y − µq(1/σε )
D µq(a ) 1 + Σ−1
β µβ .
k
k
s2k
s2k
k=1
= diag(µq(a1k ) , . . . , µq(a1k ) ).
Expressions for q ∗ (σε2 ), µq(1/σε ) and µq(1/σε2 )
q ∗ (σε2 ) =
µq(1/σε ) =
σε2
−(A+ n
+1)
2
C1 =
C1
(σε2 )1/2
−
C2
σε2
2J + (2A + n − 1, C1 , C2 )
J + (2A + n, C1 , C2 )
J + (2A + n − 1, C1 , C2 )
where
exp
and
K
X
mk
k=1
s2k
σε2 > 0,
,
µq(1/σε2 ) =
J + (2A + n + 1, C1 , C2 )
,
J + (2A + n − 1, C1 , C2 )
1T D µq(a ) (y − Xµq(β) )
k
and
C2 = B +
K
1X 1 n
tr(D µq(a ) XΣq(β) X T )
k
2
s2k
k=1
o
+(y − Xµq(β) )T D µq(a ) (y − Xµq(β) ) .
k
Derivation:
h n
log q ∗ (σε2 ) = Eq − A + + 1 log σε2
2
(
)
K
1
1X 1
T
− 2 B+
(y − Xβ) D ak (y − Xβ)
σε
2
s2k
#
K k=1
X
1
mk T
+ 2 1/2
1 D ak (y − Xβ) + const.
s2k
(σε )
k=1
54
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
n
= − A + + 1 log σε2
2
"
#
K
1X 1
1
Eq (y − Xβ)T D ak (y − Xβ)
− 2 B+
σε
2
s2k
k=1
+
K
X
1
(σε2 )1/2
k=1
mk T
1 Eq {D ak (y − Xβ)} + const.
s2k
Now, using Result 1.23,
Eq (y − Xβ)T D ak (y − Xβ)
= tr{Eq (D ak ) Covq (y − Xβ)} + Eq (y − Xβ)T Eq (D ak )Eq (y − Xβ)
= tr{D µq(a ) XCovq (β)X T } + (y − Xµq(β) )T D µq(a ) (y − Xµq(β) )
k
k
T
T
= tr{D µq(a ) XΣq(β) X } + (y − Xµq(β) ) D µq(a ) (y − Xµq(β) )
k
k
T
T
= tr{X D µq(a ) XΣq(β) } + (y − Xµq(β) ) D µq(a ) (y − Xµq(β) ).
k
k
Therefore,
n
log q ∗ (σε2 ) = − A + + 1 log σε2
2
"
K
1
1X 1
− 2 B+
{tr(X T D µq(a ) XΣq(β) )
k
σε
2
s2k
i
k=1
T
+(y − Xµq(β) ) D µq(a ) (y − Xµq(β) )}
k
+
K
X
mk
1
(σε2 )1/2 k=1
s2k
1T D µq(a ) (y − Xµq(β) ) + const.
k
Hence
q
∗
(σε2 )
n
(σε2 )−(A+ 2 +1) exp
∝
where
C1 =
K
X
mk
k=1
s2k
C1
C2
− 2
1/2
2
σε
(σε )
1T D µq(a ) (y − Xµq(β) )
k
and
C2
K
1X 1 n
= B+
tr(X T D µq(a ) XΣq(β) )
k
2
s2k
o
k=1
T
+(y − Xµq(β) ) D µq(a ) (y − Xµq(β) ) .
k
Since q ∗ (σε2 ) is a density, it must integrate to 1. Therefore
q ∗ (σε2 ) = R
∞
0
− A+ n +1
σε2 ( 2 ) exp
C1
(σε2 )1/2
C2
σε2
σε2 −(A+ 2 +1) exp
−
C1
(σε2 )1/2
−
C2
σε2
n
dσε2
.
55
56
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
We can simplify the integral by making the substitution x =
1
σε
⇒ σε =
dσε2 = −2x−3 . This transforms the integral into
Z
0
∞
−(A+ 2 +1)
σε2
exp
n
Z
0
=
=2
C1
C2
− 2
1/2
2
σε
(σε )
1
x
⇒ σε2 =
1
x2
dσε2
x2A+n+2 exp C1 x − C2 x2 (−2x−3 )dx
−∞
Z ∞
0
x2A+n−1 exp C1 x − C2 x2 dx
= 2J + (2A + n − 1, C1 , C2 ).
Therefore
q ∗ (σε2 ) =
− A+ n +1
σε2 ( 2 ) exp
C1
(σε2 )1/2
−
C2
σε2
2J + (2A + n − 1, C1 , C2 )
.
By making the same substitution as above, we find that
µq(1/σε2 )
1
=
+
2J (2A + n − 1, C1 , C2 )
Z
0
∞
C1
C2
dσε2
−
(σε2 )1/2 σε2
x2A+n+4 exp C1 x − C2 x2 (−2x−3 )dx
1 2 −(A+ n2 +1)
σ
exp
σε2 ε
Z 0
1
=
2J + (2A + n − 1, C1 , C2 ) −∞
Z ∞
1
x2A+n+1 exp C1 x − C2 x2 dx
= +
J (2A + n − 1, C1 , C2 ) 0
J + (2A + n + 1, C1 , C2 )
.
= +
J (2A + n − 1, C1 , C2 )
Similarly,
µq(1/σε ) =
J + (2A + n, C1 , C2 )
.
J + (2A + n − 1, C1 , C2 )
Expressions for q ∗ (a) and µq(aik )
∗
q (a) =
K
n Y
Y
µq(aik ) aik
i=1 k=1
and
eνik
µq(aik ) = PK
.
νik
k=1 e
where
1 µq(1/σε2 ) [{yi − (Xµq(β) )i }2 + (XΣq(β) X T )ii ]
2s2k
−2mk µq(1/σε ) {yi − (Xµq(β) )i } + m2k .
νik = log(wk /sk ) −
⇒
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
Derivation:
∗
log q (a) =
n X
K
X
aik log
i=1 k=1
wk
sk
1
1
2
− 2 Eq 2 {yi − (Xβ)i − σε mk }
+ const.
σε
2sk
Now,
1
{yi − (Xβ)i − σε mk }2 =
σε2
1
2mk
{yi − (Xβ)i }2 − 2 1/2 {yi − (Xβ)i } + m2k ,
2
σε
(σε )
therefore
Eq
1
2
{yi − (Xβ)i − σε mk }
σε2
= µq(1/σε2 ) [{yi − (Xµq(β) )i }2 + (XΣq(β) X T )ii ]
−2mk µq(1/σε ) {yi − (Xµq(β) )i } + m2k .
Combining the above three steps, we have
log q ∗ (a) =
n X
K
X
aik νik + const.
i=1 k=1
where
1 µq(1/σε2 ) [{yi − (Xµq(β) )i }2 + (XΣq(β) X T )ii ]
2s2k
−2mk µq(1/σε ) {yi − (Xµq(β) )i } + m2k .
νik = log(wk /sk ) −
This leads to
∗
q (a) ∝
n Y
K
Y
(eνik )aik
i=1 k=1
which is of the form of a Multinomial distribution. We require that
eνik
µq(aik ) = PK
νik
k=1 e
to ensure that
PK
k=1 µq(aik )
= 1. Hence
∗
q (a) =
n Y
K
Y
i=1 k=1
µq(aik ) aik .
57
58
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
2.A.3
Derivation of lower bound (2.11)
log p(y; q) =
d n
− log(2π) + log 2 + A log B − log{Γ(A)}
2
2
+ log{J + (2A + n − 1, C1 , C2 )}
K
X
|Σq(β) |
m2k
1
wk
− 2 + log
+
µq(a•k ) log
sk
2
|Σβ |
2sk
k=1
1
T −1
− {tr(Σ−1
β Σq(β) ) + (µq(β) − µβ ) Σβ (µq(β) − µβ )}
2
n X
K
X
−
µq(aik ) log(µq(aik ) ).
i=1 k=1
Derivation:
log p(y; q) = Eq {log p(y|β, σε2 , a)} + Eq {log p(β) − log q ∗ (β)}
+Eq {log p(σε2 ) − log q ∗ (σε2 )} + Eq {log p(a) − log q ∗ (a)}.
Firstly,
log p(y|β, σε2 , a)
=
n X
K
X
aik
i=1 k=1
1
1
− log(2πσε2 s2k ) − {yi − (Xβ)i − σε mk }2
2
2
n
K
n
n
1 XX
= − log σε2 − log(2π) −
aik log s2k
2
2
2
i=1 k=1
−
n K
1 X X aik {yi − (Xβ)i − σε mk }2
2
i=1 k=1
s2k
σε2
.
Now,
{yi − (Xβ)i − σε mk }2
σε2
1
2mk
{yi − (Xβ)i }2 − 2 1/2 {yi − (Xβ)i } + m2k .
2
σε
(σε )
=
Taking expectations,
Eq {log p(y|β, σε2 , a)}
K
n
n
1X
= − log(2π) − Eq (log σε2 ) −
µq(a•k ) log s2k
2
2
2
k=1
−
n K
1 X X µq(a
ik )
2
i=1 k=1
s2k
µq(1/σε2 ) [{yi − (Xµq(β) )i }2 + (XΣq(β) X T )ii ]
−2mk µq(1/σε ) {yi − (Xµq(β) )i } + m2k .
59
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
Working on the second summation,
n X
K
X
µq(a
ik )
i=1 k=1
s2k
µq(1/σε2 ) [{yi − (Xµq(β) )i }2 + (XΣq(β) X T )ii ]
−2mk µq(1/σε ) {yi − (Xµq(β) )i } + m2k
= µq(1/σε2 )
n X
K
X
µq(a
ik )
s2k
i=1 k=1
n X
K
X
−2µq(1/σε )
= µq(1/σε2 )
i=1 k=1
[{yi − (Xµq(β) )i }2 + (XΣq(β) X T )ii ]
n
K
XX
mk
mk
µq(aik ) 2 {yi − (Xµq(β) )i } +
µq(aik ) 2
sk
sk
i=1
k=1
K
X
1 n
tr(D µq(a ) XΣq(β) X T )
k
s2k
o
k=1
+(y − Xµq(β) )T D µq(a ) (y − Xµq(β) )
k
−2µq(1/σε )
K
X
k=1
K
X
m2k
mk T
1
D
(y
−
Xµ
)
+
µ
µ
q(a
)
q(β)
•k
q(a
)
k
s2k
s2k
= 2(C2 − B)µq(1/σε2 ) − 2C1 µq(1/σε ) +
k=1
K
X
m2
µq(a•k ) 2k .
sk
k=1
Hence
Eq {log p(y|β, σε2 , a)}
K
n
n
1X
= − log(2π) − Eq (log σε2 ) −
µq(a•k ) log s2k
2 (
2
2
)
k=1
K
X
m2k
1
−
2(C2 − B)µq(1/σε2 ) − 2C1 µq(1/σε ) +
µq(a•k ) 2
2
sk
k=1
n
n
1
= − log(2π) − Eq (log σε2 ) −
2
2
2
K
X
µq(a•k ) log s2k
k=1
−(C2 − B)µq(1/σε2 ) + C1 µq(1/σε ) −
K
m2
1X
µq(a•k ) 2k
2
sk
k=1
n
n
= − log(2π) − Eq (log σε2 ) − (C2 − B)µq(1/σε2 ) + C1 µq(1/σε )
2
2
K
X
m2k
1
2
−
µq(a•k ) log sk + 2 .
2
sk
k=1
Secondly,
log p(β) − log q ∗ (β)
d
1
1
= − log 2π − log |Σβ | − (β − µβ )T Σ−1
β (β − µβ )
2 2
2
d
1
1
T −1
− − log 2π − log |Σq(β) | − (β − µq(β) ) Σq(β) (β − µq(β) )
2
2
2
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
|Σq(β) |
1
1
− (β − µβ )T Σ−1
= log
β (β − µβ )
2
|Σβ |
2
1
+ (β − µq(β) )T Σ−1
q(β) (β − µq(β) ).
2
Taking expectations,
Eq {log p(β) − log q ∗ (β)}
|Σq(β) |
1
1
− Eq {(β − µβ )T Σ−1
= log
β (β − µβ )}
2
|Σβ |
2
1
+ Eq {(β − µq(β) )T Σ−1
q(β) (β − µq(β) )}.
2
Now, using Result 1.23,
Eq {(β − µβ )T Σ−1
β (β − µβ )}
T −1
= tr{Σ−1
β Covq (β − µβ )} + Eq (β − µβ ) Σβ Eq (β − µβ )
T −1
= tr(Σ−1
β Σq(β) ) + (µq(β) − µβ ) Σβ (µq(β) − µβ ).
Similarly,
−1
Eq {(β − µq(β) )T Σ−1
q(β) (β − µq(β) )} = tr(Σq(β) Σq(β) )
= tr(Id×d )
= d.
Therefore,
∗
Eq {log p(β) − log q (β)} =
|Σq(β) |
1
d
log
+
2
|Σβ |
2
n
1
−
tr(Σ−1
β Σq(β) )
o
2
+(µq(β) − µβ )T Σ−1
(µ
−
µ
)
.
q(β)
β
β
Thirdly,
log p(a) − log q ∗ (a) =
n X
K
X
i=1 k=1
aik (log wk − log µq(aik ) ).
Hence
Eq {log p(a) − log q ∗ (a)} =
K
X
k=1
µq(a•k ) log(wk ) −
K
n X
X
i=1 k=1
µq(aik ) log(µq(aik ) ).
60
2.A. DERIVATION OF ALGORITHM 1 AND LOWER BOUND (2.11)
Finally,
log p(σε2 )
− log q
∗
(σε2 )
B A 2 −(A+1) −B/σε2
= log
σ
e
Γ(A) ε
( −(A+n/2+1)
)
σε2
exp −C2 /σε2 + C1 /σε
− log
2J + (2A + n − 1, C1 , C2 )
n
= A log(B) − log Γ(A) + log(σε2 ) + (C2 − B)/σε2
2
+
−C1 /σε + log J (2A + n − 1, C1 , C2 ) + log(2).
Thus,
Eq {log p(σε2 ) − log q ∗ (σε2 )}
n
Eq {log(σε2 )} + (C2 − B)µq(1/σε2 )
2
−C1 µq(1/σε ) + log{J + (2A + n − 1, C1 , C2 )} + log(2).
= A log(B) − log Γ(A) +
Combining, we get
log p(y; q) =
d n
− log(2π) + log(2) + A log B − log Γ(A)
2
2
+ log{J + (2A + n − 1, C1 , C2 )}
K
X
|Σq(β) |
m2
1
wk
+
µq(a•k ) log
− k2 + log
sk
2
|Σβ |
2sk
k=1
1
T −1
− {tr(Σ−1
β Σq(β) ) + (µq(β) − µβ ) Σβ (µq(β) − µβ )}
2
n X
K
X
−
µq(aik ) log(µq(aik ) ).
i=1 k=1
61
62
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
2.B
Derivation of Algorithm 2 and lower bound (2.17)
2.B.1
Full conditionals
Full conditional for (β, u)
p(β, u|rest) = p(β, u|Markov blanket of (β, u))
2
2
= p(β, u|y, a, σε2 , σu1
, . . . , σud
)
2
2
∝ p(y|β, u, σε2 , a)p(β, u|σu1
, . . . , σud
)
aik
n Y
K
Y
(y − Xβ − Zu)i − σε mk
1
=
φ
σε s k
σε sk
i=1 k=1
Pd
×(2π)−(1+d+ `=1 K` )/2 |Σ(β,u) |−1/2

"
#
!T
 1
β
× exp −
− µ(β,u)
Σ−1
(β,u)
 2
u
"
β
u
#
− µ(β,u)
!

.

Taking logarithms, we get
n X
K
X
1
2
log{p(β, u|rest)} =
aik − 2 2 (y − Xβ − Zu − σε mk 1)i
2σε sk
i=1 k=1
"
#
!T
"
#
!
1
β
β
−1
− µ(β,u)
− µ(β,u) + const.
−
Σ(β,u)
2
u
u

#
"
!2 
n X
K


X
1
=
aik − 2 2 y − C β − σε mk 1
 2σε sk

u
i=1 k=1
1
−
2
"
i
!T
#
β
u
− µ(β,u)
"
Σ−1
(β,u)
β
u
#
!
− µ(β,u)
+ const.
Now, from working analagous to that in the regression case (Appendix 2.A):
log{p(β, u|rest)}
"
#T (
1 β
=−
CT
2
u
"
−2
β
u
K
1 X 1
D ak
σε2
s2k
k=1
#T "
CT
(
!
)"
C + Σ−1
(β,u)
β
u
#
)#
K
X
1
1
D ak (y − σε mk 1)  + const.
2
σε
s2k
k=1
where D ak = diag(a1k , . . . , ank ). Therefore
"
#T "
#
"
#T 
1 β
log{p(β, u|rest)} = −
Ω β −2 β
ω  + const.
2
u
u
u
63
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
where
Ω=C
T
K
1 X 1
D ak
σε2
s2k
!
k=1
and
ω=C
T
1
−1 1
C + blockdiag Σβ , 2 I K1 , . . . , 2 I Kd
σu1
σud
!
K
1 X 1
D ak (y − σε mk 1) .
σε2
s2k
k=1
Full conditional for σε2
p(σε2 |rest) = p(σε2 |Markov blanket of σε2 )
= p(σε2 |y, β, u, a)
∝ p(y|β, u, σε2 , a)p(σε2 )
aik
n Y
K Y
1
(y − Xβ − Zu)i − σε mk
=
φ
σε sk
σε sk
i=1 k=1
BεAε
×
Γ(Aε )
σε2
B
(−Aε −1) − σ2ε
e
ε
.
Taking logarithms, we get
log{p(σε2 |rest)}
= −(Aε + 1) log σε2 −
+
n X
K
X
i=1 k=1
= − Aε +
Bε
+ const.
σε2
aik − log(σε sk ) −
1
{(y − Xβ − Zu)i + σε mk }2
2
2
2σε sk
n
Bε
+ 1 log σε2 − 2 + const.
2
σε
n K
1 X X aik
2
− 2
2 {(y − Xβ − Zu)i
2σε
s
i=1 k=1 k
−2σε mk (y − Xβ − Zu)i + σε2 m2k }
n
Bε
= − Aε + + 1 log σε2 − 2
2
σε
−
n K
1 X X aik
(y − Xβ − Zu)2i
2σε2
s2k
i=1 k=1
n K
1 X X aik mk
+
(y − Xβ − Zu)i + const.
σε
s2k
i=1
k=1
64
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
n
= − Aε + + 1 log σε2
2
(
)
K
1X 1
1
(y − Xβ − Zu)T D ak (y − Xβ − Zu)
− 2 Bε +
σε
2
s2k
k=1
+
1
σε
K
X
k=1
mk T
1 D ak (y − Xβ − Zu) + const.
s2k
Full conditional for a
p(a|rest) = p(a|Markov blanket of a)
= p(a|y, β, u, σε2 )
∝ p(y|β, u, σε2 , a)p(a)
n
n Y
K Y
(y − Xβ − Zu)i − σε mk aik Y aik
1
φ
×
=
wk .
σε sk
σε sk
i=1
i=1 k=1
Taking logarithms gives
log{p(a|rest)} =
n X
K
X
aik
i=1 k=1
n X
K
X
+
i=1 k=1
=
n X
K
X
i=1 k=1
log(wk ) + const.
h
√
aik − log( 2πσε sk )
1
− 2 {(y − Xβ − Zu)i + σε mk }2
2σε sk
h
aik log(wk /sk )
1
2
− 2 2 {(y − Xβ − Zu)i + σε mk } + const.
2σε sk
2 , 1≤`≤d
Full conditional for σu`
The full conditional is given by
2
2
p(σu`
|rest) = p{σu |Markov blanket of σu`
}
2
= p(σu`
|u` )
2
2
∝ p(u` |σu`
)p(σu`
)
1
2
= (2π)
− uT` (σu`
I)−1 u`
2
B Au` 2 (−Au` −1)
Bu`
exp − 2
× u` σu`
Γ(Au` )
σu`
−K` /2
2
|σu`
I|−1/2 exp
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
1 T
− 2 u` u`
∝
2σu`
Bu`
2 (−Au` −1)
exp − 2
×σu`
σ
u` 1
1
2 −(Au` +K` /2+1)
2
= σu`
exp − 2 Bu` + ku` k
,
2
σu`
2 −K` /2
exp
σu`
and taking logarithms gives
2
2
log{p(σu`
|rest)} = −(Au` + K` /2 + 1) log(σu`
)
1
1
− 2 Bu` + ku` k2 + const.
2
σu`
2.B.2
Optimal q ∗ densities
Expressions for q ∗ (β, u), µq(β,u) and Σq(β,u)
q ∗ (β, u) ∼ N (µq(β,u) , Σq(β,u) )
where
µq(β,u) ← Σq(β,u) C T
!
K
K
X
X
1
mk
µq(1/σε2 )
D µq(a ) y − µq(1/σε )
D µq(a ) 1 ,
k
k
s2k
s2k
k=1
k=1
!
K
X
1
D µq(a ) C
C
µq(1/σε2 )
k
s2k
k=1
o−1
+blockdiag Σ−1
,
2 ) I K1 , . . . , µq(1/σ 2 ) I K
d
β , µq(1/σu1
(
Σq(β,u) ←
T
ud
and D µq(a
k)
= diag(µq(a1k ) , . . . , µq(a1k ) ).
Derivation:
"

#T
"
#
"
#T


1
β
log q ∗ (β, u) = − Eq
Ω β −2 β
ω + const.

2  u
u
u
"

#T
"
#
"
#T


1
β
= −
Eq (Ω) β − 2 β
Eq (ω) + const.

2 u
u
u
Application of matrix result (1.21) to the above expression gives
1
log q ∗ (β, u) = −
2
("
β
u
#
)T
− Eq (Ω)−1 Eq (ω)
("
#
)
β − E (Ω)−1 E (ω) + const.
×Eq (Ω)
q
q
u
65
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
and therefore
q ∗ (β, u) ∼ N {Eq (Ω)−1 Eq (ω), Eq (Ω)−1 }.
Now,
!
K
X
1
1
D ak C
Eq (Ω) = Eq C T
σε2
s2
k=1 k
1
1
+blockdiag Σ−1
,
,
.
.
.
,
I
I
K1
Kd
β
2
2
σu1
σud
!
K
X
1
T
= C
µq(1/σε2 )
D µq(a ) C
k
s2k
k=1
+blockdiag Σ−1
,
µ
I
,
.
.
.
,
µ
I
,
2
2
K
K
1
q(1/σu1 )
q(1/σ )
d
β
(
ud
and
(
CT
Eq (ω) = Eq
= C
T
K
1 X 1
D ak (y − σε mk 1)
σε2
s2k
µq(1/σε2 )
k=1
K
X
k=1
!)
!
K
X
1
mk
D µq(a ) y − µq(1/σε )
D µq(a ) 1 .
k
k
s2k
s2k
k=1
Expressions for q ∗ (σε2 ), µq(1/σε ) and µq(1/σε2 )
q ∗ (σε2 ) =
σε2
−(A+ n
+1)
2
exp
C3
(σε2 )1/2
−
C4
σε2
2J + (2A + n − 1, C1 , C2 )
σε2 > 0,
,
µq(1/σε2 ) ←
J + (2Aε + n + 1, C3 , C4 )
,
J + (2Aε + n − 1, C3 , C4 )
µq(1/σε ) ←
J + (2Aε + n, C3 , C4 )
,
J + (2Aε + n − 1, C3 , C4 )
where
C3 ←
K
X
mk
k=1
s2k
1T D µq(a ) (y − Cµq(β,u) )
k
and
C4 ← Bε +
K
1X 1 n
tr(D µq(a ) CΣq(β,u) C T )
k
2
s2k
o
k=1
+(y − Cµq(β,u) )T D µq(a ) (y − Cµq(β,u) ) .
k
66
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
Derivation:
log q ∗ (σε2 )
n
= − Aε + + 1 log σε2
2
#
"
K
1X 1
1
T
Eq {(y − Xβ − Zu) D ak (y − Xβ − Zu)}
− 2 Bε +
σε
2
s2k
k=1
K
1 X mk T
+
1 Eq {D ak (y − Xβ − Zu)} + const.
σε
s2k
k=1
Using working similar to that used in the regression case in Appendix 2.A,
Eq {(y − Xβ − Zu)T D ak (y − Xβ − Zu)}

"
#!T
"
#!


= Eq
y−C β
D ak y − C β

u
u
= tr(C T D µq(a ) CΣq(β,u) ) + (y − Cµq(β,u) )T D µq(a ) (y − Cµq(β,u) ).
k
k
Therefore
n
log q ∗ (σε2 ) = − Aε + + 1 log σε2
2
"
K
1
1X 1 n
tr(C T D µq(a ) CΣq(β,u) )
− 2 Bε +
2
k
σε
2
s
oi
k=1 k
+(y − Cµq(β,u) )T D µq(a ) (y − Cµq(β,u) )
k
+
1
σε
K
X
k=1
mk T
1 D µq(a ) (y − Cµq(β,u) ) + const.
k
s2k
It follows that
q
∗
(σε2 )
where
C3 =
−(Aε + 2 +1)
exp
σε2
n
∝
K
X
mk
k=1
s2k
C3
C4
− 2
σε σε
1T D µq(a ) (y − Cµq(β,u) )
k
and
C4 ← Bε +
K
1X 1 n
tr(C T D µq(a ) CΣq(β,u) )
k
2
s2k
o
k=1
+(y − Cµq(β,u) )T D µq(a ) (y − Cµq(β,u) ) .
k
Using working analagous to the regression case, we obtain the stated results.
67
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
Expressions for q ∗ (a) and µq(aik )
∗
q (a) =
n Y
K
Y
µq(aik ) aik
i=1 k=1
and
eνik
µq(aik ) = PK
.
νik
k=1 e
where
νik = log(wk /sk ) −
1 µq(1/σε2 ) (y − Cµq(β,u) )2i + (CΣq(β,u) C T )ii
2
2sk
−2mk µq(1/σε ) (y − Cµq(β,u) )i + m2k .
Derivation:
∗
log q (a) =
n X
K
X
aik log(wk /sk )
i=1 k=1
1
1
2
− 2 Eq 2 {(y − Xβ − Zu)i + σε mk }
+ const.
σε
2sk
Now,
{(y − Xβ − Zu)i + σε mk }2 = (y − Xβ − Zu)2i
−2σε mk (y − Xβ − Zu)i + σε2 m2k ,
hence
∗
log q (a) =
=
n X
K
X
i=1 k=1
n X
K
X
1
1
aik log(wk /sk ) − 2 Eq
(y − Xβ − Zu)2i
2
σ
2s
ε
k
i=1 k=1
2mk
2
−
(y − Xβ − Zu)i + mk
+ const.
σε
1 µq(1/σε2 ) Eq {(y − Xβ − Zu)2i }
2
2sk
i
− 2mk µq(1/σε ) (y − Cµq(β,u) )i + m2k + const.
aik log(wk /sk ) −
Now,
Eq {(y − Xβ − Zu)2i } = Eq



"
y−C
β
u
#!2 

i

68
69
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
(
"
= Varq
y−C
"
(
Covq
=
C
"
=
β
u
CCovq
β
u
#
#! )
β
u
#!)
"
(
+ Eq
i
"
"
y − CEq
+
! ii
CT
ii
β
u
y−C
β
u
#! )#2
i
#!2
i
+ (y − Cµq(β,u) )2i
= (CΣq(β,u) C T )ii + (y − Cµq(β,u) )2i .
Bringing it all together,
∗
log q (a) =
n X
K
X
aik νik + const.
i=1 k=1
where
νik = log(wk /sk ) −
1 µq(1/σε2 ) (y − Cµq(β,u) )2i + (CΣq(β,u) C T )ii
2
2sk
−2mk µq(1/σε ) (y − Cµq(β,u) )i + m2k .
Exponentiating, we get
q ∗ (a) ∝
n Y
K
Y
(eνik )aik
i=1 k=1
which is of the form of a multinomial distribution. To ensure
that
PK
k=1 µq(aik )
eνik
µq(aik ) = PK
.
νik
k=1 e
hence
q ∗ (a) ∝
n Y
K
Y
ik
µaq(a
.
ik )
i=1 k=1
2 ), B
Expressions for q ∗ (σu`
q(σ 2 ) and µq(1/σ 2
u` )
u`
∗
q (σu ) ∼ Inverse-Gamma Au` + K` /2, Bq(σ2
u` )
Bq(σ2 ) = Bu` +
u`
o
1n
tr(Σq(u` ) ) + kµq(u` ) k2 ,
2
and
µq(1/σ2 ) =
u`
Au` + K` /2
.
Bq(σ2 )
u`
,
= 1, we require
70
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
Derivation:
log q
∗
2
(σu`
)
= −(Au` + K` /2 +
2
1) log(σu`
)
1
− 2
σu`
1
Bu` + Eq (ku` k2 )
2
+const.
Now, using Result 1.22,
Eq (ku` k2 ) = tr{Covq (u` )} + kEq (u` )k2
= tr(Σq(u` ) ) + kµq(u` ) k2 .
Hence,
2
2
log q ∗ (σu`
) = −(Au` + K` /2 + 1) log(σu`
)
o
n
1
1
2
+ const.
tr(Σq(u` ) ) + kµq(u` ) k
− 2 Bu` +
2
σu`
Exponentiating, we get
q
∗
2
)
(σu`
∝
2 −(Au` +K` /2+1)
exp
σu`
o
1
1n
2
− 2 Bu` +
tr(Σq(u` ) ) + kµq(u` ) k
2
σu`
which is in the form of an Inverse-Gamma distribution. The stated results follow immediately.
2.B.3
Derivation of lower bound (2.17)
log p(y; q)
=
1
2
1+d+
d
X
`=1
!
K`
−
n
log(2π) + log(2) + Aε log(Bε ) − log Γ(Aε )
2
+ log J + (2Aε + n − 1, C3 , C4 )
−1
1 T
+ 12 log |Σq(β,u) | − 21 log |Σβ | − 12 tr(Σ−1
β Σq(β) ) − 2 µq(β,u) Σβ µq(β,u)
d
X
{Au` log Bu` − log Γ(Au` )
o
`=1
− Au` + K2` log Bq(σ2 ) + log Γ Au` + K2`
u`
X
K
n X
K
2
X
m
+
µq(a•k ) log(wk /sk ) − k2 −
µq(aik ) log(µq(aik ) ).
2sk
i=1
+
k=1
Derivation:
k=1
71
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
The logarithm of the lower bound on the marginal likelihood is given by
2
2
, . . . , σud
)
log p(y; q) = Eq {log p(y, β, u, a, σε2 , σu1
2
2
− log q ∗ (β, u, a, σε2 , σu1
, . . . , σud
)}.
Now,
2
2
p(y, β, u, a, σε2 , σu1
, . . . , σud
) = p(y|β, u, a, σε2 )
2
2
×p(β, u, a, σε2 , σu1
, . . . , σud
)
(2.22)
2 , 1 ≤ ` ≤ d. Also,
as the distribution of y does not depend explicitly upon σu`
p(β, u, a, σε2 , σu ) = p(β, u|σu )
×p(a)p(σε2 )p(σu )
(2.23)
as the distribution of (β, u) doesn’t depend on σε2 or a, and σε2 , a and σu are independent.
In addition, our imposed factorisation of the optimal density q (2.16) breaks down further
into
(
2
2
) = q(β, u)q(a)q(σε2 )
q(β, u, a, σε2 , σu1
, . . . , σud
d
Y
)
2
)
q(σu`
(2.24)
`=1
2 , 1 ≤ ` ≤ d, imposed by our model.
due to the conditional independence of σε2 and σu`
Using (2.22), (2.23) and (2.24), our expression for the lower bound of the log-likelihood
becomes:
log p(y; q) = Eq {log p(y|β, u, a, σε2 )} + Eq {log p(β, u|σu ) − log q ∗ (β, u)}
+Eq {log p(a) − log q ∗ (a)} + Eq {log p(σε2 ) − log q ∗ (σε2 )}
" ( d
)
( d
)#
Y
Y
2
2
+Eq log
p(σu`
) − log
q(σu`
)
.
`=1
`=1
Firstly:
log p(y|β, u, a, σε2 )
n X
K
X
√
=
aik − log( 2πσε sk ) −
i=1 k=1
1
{(y − Xβ − Zu)i + σε mk }2
2σε2 sk
72
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
n K
n
1 XX
n
2
= − log σε − log(2π) −
aik log s2k
2
2
2
i=1 k=1
(
n
K
1 X X aik
2
− 12
2 (y − Xβ − Zu)i
σε2
s
k
i=1
2
−
σε
k=1
n X
K
X
i=1 k=1
n
K
X X aik m2
aik mk
k
(y
−
Xβ
−
Zu)
+
i
2
s2k
s
k
i=1
)
.
k=1
Taking expectations,
Eq {log p(y|β, u, a, σε2 )}
K
n
n
1X
= − log σε2 − log(2π) −
µq(a•k ) log s2k
2
2
2
k=1
− 12 µq(1/σε2 )
+µq(1/σε )
n X
K
X
i=1 k=1
n
K
XX
i=1 k=1
µq(aik ) (CΣq(β,u) C T )ii + (y − Cµq(β,u) )2i
2
sk
n X
K
X
µq(aik ) m2k
µq(aik ) mk
1
(y
−
Cµ
)
−
q(β,u) i
2
s2k
s2k
i=1
k=1
K
1X
n
n
= − log σε2 − log(2π) −
2
2
2
− 12 µq(1/σε2 )
µq(a•k ) log s2k
k=1
K
X
1 n
tr(D µq(a ) CΣq(β,u) C T )
k
s2k
o
k=1
+(y − Cµq(β,u) )T D µq(a ) (y − Cµq(β,u) )
k
+µq(1/σε )
K
X
k=1
K
X
m2k
mk T
1
1
D
(y
−
Cµ
)
−
µ
µ
q(β,u)
q(a
)
•k
q(ak )
2
s2k
s2k
k=1
n
n
= − log σε2 − log(2π) − (C4 − Bε )µq(1/σε2 ) + C3 µq(1/σε )
2
2
2
K
X
mk
2
1
−2
µq(a•k )
+ log sk .
s2k
k=1
Secondly,
log p(β, u|σu ) − log q ∗ (β, u)
!
"
#T
"
#
d
X
β
β
−1
1
1
Σ(β,u)
=− 1+d+
K` log(2π) − 2 log |Σ(β,u) | − 2
u
u
`=1
(
!
d
X
− − 1+d+
K` log(2π) − 12 log |Σq(β,u) |
`=1
"
− 12
β
u
!T
#
− µq(β,u)
"
Σ−1
q(β,u)
β
u
#
− µq(β,u)
!


.
73
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
Taking expectations,
Eq {log p(β, u|σu ) − log q ∗ (β, u)}
= − 21 Eq (log |Σ(β,u) |) + 12 log |Σq(β,u) |
"
#T
"
#
β 
Σ−1
− 21 Eq  β
q(β,u)
u
u
 "
#
!T

β −µ
+ 21 Eq
Σ−1
q(β,u)
q(β,u)

u
"
β
u
#
− µq(β,u)
!

Now,
"
Eq 
β
u
#T
"
Σ−1
(β,u)
#
β
u
 = tr{Eq (Σ−1 )Σq(β,u) }
(β,u)
+µTq(β,u) Eq (Σ−1
(β,u) )µq(β,u) .
Similarly,
Eq
 "

!T
#
β −µ
Σ−1
q(β,u)
q(β,u)
u
Σ
= tr Σ−1
q(β,u)
q(β,u)
= tr I 1+d+Pd K`

"
β
u
#
− µq(β,u)
!


`=1
=1+d+
d
X
K` .
`=1
Furthermore,
2
2
Eq (log |Σ(β,u) |) = Eq {log |blockdiag(Σβ , σu1
I K1 , . . . , σud
I Kd )|}
2
= Eq {log(|Σβ | × σu1
K1
2
× . . . × σud
Kd
)}
2
2
= Eq (log |Σβ | + K1 log σu1
+ . . . + Kd log σud
)
= log |Σβ | +
d
X
2
K` Eq (log σu`
).
`=1
Also,
Eq (Σ−1
(β,u) )
= Eq
1
−1 1
blockdiag Σβ , 2 I K1 , . . . , 2 I Kd
σu1
σud
= blockdiag(Σ−1
2 ) I K1 , . . . , µq(1/σ 2 ) I K ).
d
β , µq(1/σu1
ud

.
74
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
Therefore,
Eq {log p(β, u|σu ) − log q ∗ (β, u)}
=
1
2
1
2
log |Σq(β,u) | − log |Σβ | −
1
2
d
X
2
K` Eq (log σu`
)
+
1
2
1+d+
`=1
d
X
!
K`
`=1
h
− 12 tr{blockdiag(Σ−1
2 ) I K1 , . . . , µq(1/σ 2 ) I K )Σq(β,u) }
d
β , µq(1/σu1
ud
+µTq(β,u) blockdiag(Σ−1
2 ) I K1 , . . . , µq(1/σ 2 ) I K )µq(β,u)
d
β , µq(1/σu1
ud
=
1
2
log |Σq(β,u) | − 21 log |Σβ | −
"
2
K` Eq (log σu`
)+
1
2
1+d+
`=1
d
X
!
K`
`=1
d
X
tr(µq(1/σ2 ) I K` Σq(u`) )
u`
#
`=1
d
X
+µTq(β,u) Σ−1
µq(1/σ2 ) kµq(u`) k2
β µq(β,u) +
u`
`=1
!
d
d
X
X
2
1
1
1
1
+
K
)
1
+
d
+
K` Eq (log σu`
log
|Σ
|
−
log
|Σ
|
−
`
β
q(β,u)
2
2
2
2
`=1
`=1
−1
−1
1
1 T
− 2 tr(Σβ Σq(β) ) − 2 µq(β,u) Σβ µq(β,u)
− 12
=
tr(Σ−1
β Σq(β) )
1
2
d
X
i
− 12
d
X
+
µq(1/σ2 ) {kµq(u`) k2 + tr(Σq(u`) )}.
u`
`=1
Thirdly,
log p(σε2 )
− log q
∗
(σε2 )
BεAε 2 −(Aε +1) −Bε /σε2
= log
σ
e
Γ(Aε ) ε
( −(A +n/2+1)
)
2
ε
σε2
e−C4 /σε +C3 /σε
− log
2J + (2Aε + n − 1, C3 , C4 )
n
= A log(B) − log Γ(Aε ) + log(σε2 ) + (C4 − Bε )/σε2
2
+
−C3 /σε + log J (2Aε + n − 1, C3 , C4 ) + log(2).
Thus,
n
Eq {log(σε2 )}
2
+(C4 − Bε )µq(1/σε2 ) − C3 µq(1/σε )
Eq {log p(σε2 ) − log q ∗ (σε2 )} = Aε log(Bε ) − log Γ(Aε ) +
+ log J + (2Aε + n − 1, C3 , C4 ) + log(2).
Fourthly,
log p(a) − log q ∗ (a) =
n X
K
X
i=1 k=1
aik {log(wk ) − logµq(aik ) }.
75
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
Hence
Eq {log p(a) − log q ∗ (a)} =
where µq(a•k ) =
Pn
i=1 µq(aik ) .
(
log
d
Y
k=1
)
2
p(σu`
)
d X
`=1
µq(a•k ) log(wk ) −
n X
K
X
µq(aik ) log(µq(aik ) ).
i=1 k=1
Finally,
(
− log
`=1
=
K
X
d
Y
)
2
q(σu`
)
`=1
2
Au` log Bu` − log Γ(Au` ) − (Au` + 1) log σu`
−
Bu`
2
σu`
n
− Aq(σ2 ) log Bq(σ2 ) − log Γ(Aq(σ2 ) )
u`
u`
u`
−(Aq(σ2 ) +
u`
2
1) log σu`
−
Bq(σ2
u` )
2
σu`
)#
.
Taking expectations
"
(
Eq log
d
Y
)
2
)
p(σu`
− log
`=1
=
d X
`=1
(
d
Y
)#
2
)
q(σu`
`=1
K`
Au` log Bu` − log Γ(Au` ) − Au` +
log Bq(σ2 )
u`
2
K`
2
+ log Γ(Aq(σ2 ) ) +
Eq (log σu` ) + (Bq(σ2 ) − Bu` )µq(1/σ2 ) .
u`
u`
u`
2
Putting it all together, we get
2
K
X
mk
n
n
2
2
1
log p(y; q) = − log(2π) − log σε − 2
+ log sk
µq(a•k )
2
2
s2k
k=1
−(C4 − Bε )µq(1/σε2 ) + C3 µq(1/σε )
!
d
X
+ 12 1 + d +
K` + 12 log |Σq(β,u) |
`=1
− 21 log |Σβ | −
1
2
d
X
2
K` Eq (log σu`
)
`=1
−1
1
− 2 tr(Σβ Σq(β) ) − 12 µTq(β,u) Σ−1
β µq(β,u)
− 21
+
d
X
u`
`=1
K
X
k=1
µq(1/σ2 ) {kµq(u`) k2 + tr(Σq(u`) )}
µq(a•k ) log(wk ) −
n X
K
X
i=1 k=1
µq(aik ) log(µq(aik ) )
76
2.B. DERIVATION OF ALGORITHM 2 AND LOWER BOUND (2.17)
n
Eq {log(σε2 )} + (C4 − Bε )µq(1/σε2 )
2
−C3 µq(1/σε ) + log J + (2Aε + n − 1, C3 , C4 ) + log(2)
d n
X
+
Au` log Bu` − log Γ(Au` ) − Au` + K2` log Bq(σ2 )
+Aε log(Bε ) − log Γ(Aε ) +
u`
`=1
+ log Γ(Aq(σ2 ) ) +
u`
K`
2
2 Eq (log σu` )
+ (Bq(σ2 ) − Bu` )µq(1/σ2
u` )
u`
o
.
Recognising that kµq(u`) k2 + tr(Σq(u`) ) = Bq(σ2 ) − Bu` , we have the final expression for
u`
the lower bound
log p(y; q) =
1
2
1+d+
d
X
`=1
!
K`
−
n
log(2π) + log(2) + Aε log(Bε ) − log Γ(Aε )
2
+ log J + (2Aε + n − 1, C3 , C4 )
−1
1 T
+ 21 log |Σq(β,u) | − 21 log |Σβ | − 12 tr(Σ−1
β Σq(β) ) − 2 µq(β,u) Σβ µq(β,u)
d
X
{Au` log Bu` − log Γ(Au` )
o
`=1
K`
K`
− Au` + 2 log Bq(σ2 ) + log Γ Au` + 2
u`
X
n X
K
K
2
X
mk
+
µq(a•k ) log(wk /sk ) − 2 −
µq(aik ) log(µq(aik ) ).
2sk
i=1
+
k=1
k=1
Chapter 3
Mean field variational Bayes for
quantile regression
3.1
Introduction
Estimation of the quantiles of a response variable given an explanatory variable is a common statistical problem. Yu and Jones (1998) use kernel weighted local linear fitting for
estimation of quantiles in nonparametric regression models. A goodness of fit process for
quantile regression is explored in Koenker and Machado (1999). In a Bayesian context,
inference for quantile regression models is somewhat limited to sampling based methods such as MCMC. Koenker (2005) sets out options for inference in quantile regression,
highlighting both frequentist and Bayesian methods, the latter restricted to resampling
methods. We address this shortcoming in this chapter by exploring MFVB for semiparametric quantile regression models.
The idea of carrying out Bayesian quantile regression using an Asymmetric Laplace
likelihood was introduced by Yu and Moyeed (2001). Their paper showed that the Asymmetric Laplace distribution provides a very natural option for modelling Bayesian quantile regression, regardless of the original distribution of the data. Yu and Moyeed (2001)
carried out inference using a Markov chain Monte Carlo (MCMC) algorithm. More recently, Kozumi and Kobayashi (2011) developed an efficient Gibbs sampling algorithm
to carry out inference in Bayesian regression models, also using the Asymmetric Laplace
distribution. They used a location-scale mixture representation of the Asymmetric Laplace
distribution.
Extensions to the Asymmetric Laplace approach to quantile regression have been explored. For example, Yuan and Guosheng (2010) presented Bayesian quantile regression
77
78
3.2. PARAMETRIC REGRESSION CASE
a
β
y
σε2
Figure 3.1: Directed acyclic graph for Model (3.2).
for longitudinal studies with non-ignorable missing data. In contrast to the parametric approach presented by Yu and Moyeed (2001), Thompson, Cai, Moyeed, Reeve and
Stander (2010) adopt a nonparametric approach, carrying out quantile regression using
cubic splines.
In this chapter, we develop mean field variational Bayes (MFVB) methodology for
inference in Bayesian quantile regression models using the Asymmetric Laplace distribution, as introduced by Yu and Moyeed (2001). MFVB inerence for a univariate Asymmetric Laplace model has been carried out in Wand et al. (2011). We extend the methodology from this paper to the regression case in this chapter. Firstly, we present the model
for the parametric regression case. We develop the MFVB algorithm for the parametric
regression case through derivation of optimal q densities and the lower bound on the
marginal log-likelihood. Secondly, we extend the MFVB methodology to the semiparametric regression case, introducing splines into the model. We then present the MFVB fit
to a simulated data set. Finally, we compare our MFVB results with MCMC to evaluate
the performance of MFVB inference in Bayesian quantile
regression models.
1
3.2
Parametric regression case
Here we present a simple parametric linear regression model for the case of one predictor.
3.2.1
Model
For data (xi , yi ), we impose the model
yi = β0 + β1 xi + εi ,
1 ≤ i ≤ n.
79
3.2. PARAMETRIC REGRESSION CASE
We express this as the Bayesian hierarchical model
ind.
yi |β, σε ∼ Asymmetric-Laplace{(Xβ)i , σε , τ },
σε2
β ∼ N (0, Σβ ),
1 ≤ i ≤ n,
(3.1)
∼ Inverse-Gamma(Aε , Bε )
where

1 x1


 . . 
X =  .. ..  ,


1 xn


β=
β0
β1

,
Aε , Bε > 0 are scalar constants, Σβ is a constant matrix and τ ∈ (0, 1) determines the
quantile level. We introduce auxiliary variables a = (a1 , . . . , an ). This allows the Asymmetric Laplace distribution to be represented in terms of Normal and Inverse-Gamma
distributions, which are more amenable to MFVB. Application of Result 1.8 allows us to
rewrite Model (3.1) as
ind.
yi |β, σε , ai ∼ N
(Xβ)i +
(τ − 21 )σε
σε2
ai τ (1−τ ) , ai τ (1−τ )
β ∼ N (0, Σβ ),
σε2
,
ind.
ai ∼ Inverse-Gamma 1, 12 ,
(3.2)
∼ Inverse-Gamma(Aε , Bε ).
The dependence structure of the parameters in Model (3.2) is illustrated in Figure 3.1.
3.2.2
Mean field variational Bayes
In this section we present an algorithm to carry out MFVB inference for Model (3.2) under
the imposed product restriction
q(β, a, σε2 ) ≈ q(β)q(a)q(σε2 ).
(3.3)
Derivations of the full conditionals, optimal q densities and lower bound on the marginal
log-likelihood are deferred to Appendix 3.A.
The lower bound on the marginal log-likelihood for Model (3.2) is given by
log p(y; q) = Aε log(Bε ) − log Γ(Aε ) + log(2) + 1 + n log{τ (1 − τ )}
|Σq(β) |
1
+ 2 log
+ log{J + (2Aε + n − 1, C5 , C6 )}
|Σβ |
−1
T
− 12 {tr(Σ−1
β Σq(β) ) + µq(β) Σβ µq(β) }
n
X 1
1
−
.
8τ (1 − τ )
µq(ai )
i=1
(3.4)
80
3.2. PARAMETRIC REGRESSION CASE
Initialize: µq(1/σε2 ) , µq(1/σε ) and µq(ai ) for 1 ≤ i ≤ n.
Cycle:
Update q ∗ (β) parameters:
µq(β)
−1
Σq(β) ← {τ (1 − τ )µq(1/σε2 ) X T M µq(a) X + Σ−1
β }
o
n
← Σq(β) gτ µq(1/σε2 ) X T M µq(a) y + τ − 12 µq(1/σε ) X T 1 .
Update q ∗ (ai ) parameters:
For i = 1, . . . , n:
µq(ai ) ←
−1
q
T
2
2τ (1 − τ ) µq(1/σε2 ) [(XΣq(β) X )ii + {yi − (Xµq(β) )i } ]
M µq(a) ← diag(µq(a1 ) , . . . , µq(an ) )
Update q ∗ (σε2 ) parameters:
C5 ← τ −
C6 ← Bε +
1
2
(y − Xµq(β) )T 1
τ (1 − τ ) n
tr(X T M µq(a) XΣq(β) )
o
2
+(y − Xµq(β) )T M µq(a) (y − Xµq(β) )
µq(1/σε2 ) =
µq(1/σε ) =
J + (2Aε + n + 1, C5 , C6 )
J + (2Aε + n − 1, C5 , C6 )
J + (2Aε + n, C5 , C6 )
J + (2Aε + n − 1, C5 , C6 )
until the increase in p(y; q) is negligible.
Algorithm 4: Mean field variational Bayes algorithm for Model (3.2) under product restriction (3.3).
81
3.3. SEMIPARAMETRIC REGRESSION CASE
3.3
Semiparametric regression case
Here we present a semiparametric regression model, again for the one predictor case.
This model allows more flexibility in the shape of the fitted curve via inclusion of spline
basis functions.
3.3.1
Model
We impose the model
yi = β0 + β1 xi +
K
X
uk zk (xi ) + εi ,
k=1
1 ≤ i ≤ n,
where {z1 (·), . . . , zK (·)} are a set of spline basis functions and (u1 , . . . , uK ) are spline
coefficients. In Bayesian hierarchical form we have
ind.
yi |β, u, σε ∼ Asymmetric-Laplace{(Xβ + Zu)i , σε , τ },
1 ≤ i ≤ n,
(3.5)
σu2 ∼ Inverse-Gamma(Au , Bu ),
u|σu2 ∼ N (0, σu2 I),
σε2 ∼ Inverse-Gamma(Aε , Bε ).
β ∼ N (0, Σβ ),
where β and X are as defined in Section 3.2.1,


u1


 . 
u =  ..  ,


uK

and
z (x ) . . . zK (x1 )
 1 1
..
..

..
Z=
.
.
.

z1 (xn ) . . . zK (xn )



.

Introduction of auxiliary variables a = (a1 , . . . , an ) and application of Result 1.8 yields
the model
ind.
yi |β, u, σε , ai ∼ N
(Xβ + Zu)i +
(τ − 21 )σε
σε2
ai τ (1−τ ) , ai τ (1−τ )
ind.
ai ∼ Inverse-Gamma 1, 12 ,
u|σu2 ∼ N (0, σu2 I),
β ∼ N (0, Σβ ),
,
(3.6)
σu2 ∼ Inverse-Gamma(Au , Bu ),
σε2 ∼ Inverse-Gamma(Aε , Bε ).
The relationship between the parameters in Model (3.6) is illutsrtaed by the DAG in Figure 3.2.
82
3.3. SEMIPARAMETRIC REGRESSION CASE
3.3.2
Mean field variational Bayes
Using the locality property of MFVB, many of the optimal densities of key parameters in
Model (3.6) remain unchanged from those in Model (3.2). This is similar to the case in
Chapter 2, where the progression from a basic GEV regression model to a more complex
additive model caused minimal changes to the MFVB optimal densities of many parameters. Infact, the structure of the distributions of a and σε2 are the same as in Model (3.2).
The only changes are that β is replaced by (β, u), and X is replaced by C = [X Z]. We
need only derive optimal q densities for the new parameters, namely (β, u) and σu2 .
We impose the product restriction
q(β, u, a, σε2 , σu2 ) ≈ q(β, u)q(a)q(σε2 , σu2 ).
(3.7)
When we moralise (see Definition 1.4.9) the DAG in Figure 3.2, we find that all paths
between σu2 and σε2 must pass through at least two of the nodes {y, β, u, a}. Another
way of stating this structure is that the set {y, β, u, a} separates σu2 from σε2 . Applying
Theorem 1.4.1 then gives the result
σu2 ⊥ σε2 {y, β, u}.
Hence product restriction (3.7) reduces further via induced factorisations to
q(β, u, a, σε2 , σu2 ) ≈ q(β, u)q(a)q(σε2 )(σu2 ).
(3.8)
Derivations of the necessary optimal q densities leading to the updates in Algorithm 5 are
deferred to Appendix 3.A. Derivation of the lower bound on the marginal log-likelihood
are also given in Appendix 3.A.
The lower bound on the marginal log-likelihood for
Model (3.6) is given by
(K + 2)
log p(y; q) = Aε log Bε − log Γ(Aε ) + log(2) +
+ n log{τ (1 − τ )}
2
|Σq(β,u) |
+ 12 log
+ log{J + (2Aε + n − 1, C7 , C8 )}
|Σβ |
n
o
−1
T
− 21 tr(Σ−1
Σ
)
+
µ
Σ
µ
q(β)
q(β)
q(β) β
β
n
− 12 µq(1/σu2 ) {tr(Σq(u) ) + kµq(u) k2 } −
+Au log(Bu ) − log{Γ(Au )} − (Au +
+ log{Γ(Au +
K
2 )}
X 1
1
8τ (1 − τ )
µq(ai )
i=1
K
2 ))
2 ) log(Bq(σu
− µq(1/σu2 ) (Bq(σu2 ) − Bu ).
(3.9)
83
3.3. SEMIPARAMETRIC REGRESSION CASE
Initialize: µq(1/σε2 ) , µq(1/σε ) and µq(ai ) for 1 ≤ i ≤ n.
Cycle:
Update q ∗ (β, u) parameters:
T
Σq(β,u) ←
Σ−1
0
β
0
µq(1/σu2 ) I K
τ (1 − τ )µq(1/σε2 ) C M µq(a) C +
n
µq(β,u) ← Σq(β,u) gτ µq(1/σε2 ) C T M µq(a) y
+ τ − 12 µq(1/σε ) C T 1 .
−1
,
Update q ∗ (ai ) parameters:
For i = 1, . . . , n:
µq(ai ) ← {2τ (1 − τ )
−1
q
T
2
× µq(1/σε2 ) [(CΣq(β,u) C )ii + {yi − (Cµq(β,u) )i } ]
M µq(a) ← diag(µq(a1 ) , . . . , µq(an ) )
Update q ∗ (σu2 ) parameters:
Bq(σu2 ) ← Bu + 21 {kµq(u) k2 + tr(Σq(u) )};
µq(1/σu2 ) ← (Au +
K
2)
2 )/Bq(σu
Update q ∗ (σε2 ) parameters:
C7 ← τ −
C8 ← Bε +
µq(1/σε2 ) =
1
2
(y − Cµq(β,u) )T 1
τ (1 − τ ) n
tr(C T M µq(a) CΣq(β,u) )
o
2
+(y − Cµq(β,u) )T M µq(a) (y − Cµq(β,u) )
J + (2Aε + n + 1, C7 , C8 )
;
J + (2Aε + n − 1, C7 , C8 )
µq(1/σε ) =
J + (2Aε + n, C7 , C8 )
J + (2Aε + n − 1, C7 , C8 )
until the increase in p(y; q) is negligible.
Algorithm 5: Mean field variational Bayes algorithm for Model (3.6) under product restriction (3.8).
84
3.4. RESULTS
σu2
β
a
u
y
σε2
Figure 3.2: Directed acyclic graph for Model (3.6).
3.4
Results
Here we present the results of MFVB estimation under Algorithm 5. Data was generated
using
yi = sin(2πx2i ) + εi ,
ind.
εi ∼ N (0, 1),
(3.10)
where xi = (i − 1)/n and n = 1000. We limit our investigation to the median, the first
and third quartiles, and the tenth and nintieth percentiles. Hence
τ ∈ {0.1, 0.25, 0.5, 0.75, 0.9}.
(3.11)
Recall that the quantile level is given by τ , so for example, the first quartile is given by
τ = 0.25. The length of u was set to K = 22. Hyperparameters were set to σβ = 108 and
Aε = Bε = Au = Bu = 0.01. Iterations of Algorithm 5 were terminated when the change
in lower bound (3.9) reached less than 10−10 .
Monotonic lower bounds were achieved for all quantiles. Figure 3.3 illustrates iterations of lower bound (3.9) for the median, or τ = 0.5.1 Convergence of Algorithm 5 was
achieved after 80 iterations. Figure 3.4 illustrates the MFVB fit and corresponding pointwise 95% credible intervals for all quantiles considered. For all values of τ , the MFVB
estimates successfully capture the underlying trend of the data. We investigate this behaviour further in the following section via comparison of MFVB with MCMC inference.
3.4.1
Comparisons with Markov chain Monte Carlo
MCMC samples of size 10000 were generated, with the first 5000 discarded and the remaining 5000 thinned by a factor of 5. MCMC inference took 22 minutes and 6 seconds
85
3.4. RESULTS
●
●
−1480
●
●
−1500
●
●
−1520
●
●
−1540
lower bound on marginal log−likelihood
−1460
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●
●●
●
●
●
●
●
0
20
40
60
80
iterations
Figure 3.3: Successive values of lower bound (3.9) to monitor convergence of MFVB Algorithm 5.
86
-1
0
y
1
2
3
3.4. RESULTS
-2
τ = 0.9
τ = 0.75
τ = 0.5
τ = 0.25
τ = 0.1
0.0
0.2
0.4
0.6
0.8
1.0
x
Figure 3.4: Quantile estimates (solid) and pointwise 95% credible intervals (dotted) for
MFVB fitting of (3.10) via Algorithm 5. Estimates are shown for values of τ stated in
(3.11).
to run. In contrast, MFVB took only 45 seconds to run.
Figure 3.5 illustrates the adequacy of MFVB inference in estimating the median. Both
the MFVB and MCMC fits capture the underlying trend of the data. In this instance,
MFVB produces credible intervals comparable with those produced by MCMC inference.
Delving more deeply into the quality of the MFVB inference, we now turn our focus
to the accuracy the MFVB fit achieved by Algorithm 5 for data modelled by (3.10). Figure
3.6 quantifies the accuracy achieved for the MFVB fit for τ = 0.5 at the quartiles of the
xi ’s, denoted by ŷ(Qj ) for j = 1, 2, 3. It is pleasing to see that a high accuracy of 94% was
achieved for the fit at the median in a fraction of the time it took to run MCMC. Accuracy
of the fit at the first quartile was 84%, and 59% accuracy was achieved for the fit at the
third quartile. The accuracy measure is explained in Section 1.6.
87
3
3.4. RESULTS
●
●
●
●
●
●
2
●
●
●
●
●
●
1
0
−1
−2
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●● ● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
● ●●
●
●
●● ●
● ●
●
●
●
●●
●
● ●●
●
●●
●
●●
● ●
● ● ● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●●●
● ●● ●
●
●
●
●
●● ●
●
●
●●
●●● ● ●●●●
● ●
●
●●
● ●
●
●
●
●
● ●●
●●
● ● ● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●● ●
●●
● ●
●●
●
●
●●
●
●●
● ●
● ● ●
●
●●
●
● ●
●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
● ●
●
●
●● ●
●
●
●
●
●
●
●
●
●●
●
● ● ●
● ●●● ●
● ● ●●
●
● ●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
● ●●●●
●●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
● ●●●
● ●●
● ●
●
●
●
●●
●
●●
●
●
● ● ●
●
●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●● ● ●
●
●
●●
●●
●● ●
●
●●
●
● ●
●
● ● ● ●
● ●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●● ●●
●
● ●
●
●
●
●
●● ● ●
●
● ●
●●
● ●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
● ●
●
● ●
●
●●●
●
●
●
●
● ● ● ●●
● ●● ●
●
● ●
●
● ●
●
●
● ●
●
●
●
●
● ● ● ●
●
●
●●● ● ● ●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●● ●●●●
● ●
●
●●
●
●
●
●
●
●
●●● ● ● ●●●●● ● ●
● ●
●● ●
●
●
●● ● ● ●
●●
●●●
●
●
●
●
●●●
● ●
●
●● ● ● ●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●
● ●●
●●
●
● ●●
●
● ●
● ●
●
●
● ●
●
●
●
●
●
●
● ● ● ● ●
● ●● ●●
●
●
●
● ●
● ●
● ●
●●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●●●
●●● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
● ●
●●
●●
●●
●
●
●
● ● ●
●
●
●
● ● ●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●● ●
●
●
●●
●
● ●
●
●
●●
●
● ●●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
y
●
●
●
●●
●
●
●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
Figure 3.5: Median (τ = 0.5) estimates and pointwise 95% credible intervals for MFVB
(red) and MCMC (blue) fitting of (3.10).
3.4.2
Accuracy study
In this section we systematically examine how well MFVB Algorithm 5 performs against
its MCMC counterpart via a simulation study. We focus on the fit achieved by MFVB
inference at the median of the xi ’s, denoted by ŷ(Q2 ). We consider
τ ∈ {0.1, 0.25, 0.5, 0.75, 0.9}.
Samples of size n = 500 were generated according to Model (3.10). Fifty simulations
were carried out for each value of τ , with the accuracy summarised in Figure 3.7.
It is evident that the best quality MFVB fit is achieved for the median, or τ = 0, 5, with
a mean accuracy of 80.58%. The accuracy for the MFVB fit for τ = 0.5 also achieved the
lowest standard deviation. The upper quartile and lower quartiles also had reasonable
mean accuracy, with 72.30% and 66.84% respectively. The mean accuracy for τ = 0.1
88
6
7
3.5. DISCUSSION
94%
4
3
approx. posterior
4
3
2
accuracy
0
0
1
2
accuracy
1
approx. posterior
5
5
6
84%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.8
0.9
yHat
1.0
1.1
1.2
1.3
yHat
(b) Median
6
(a) First quartile
2
3
4
accuracy
0
1
approx. posterior
5
59%
−0.7
−0.6
−0.5
−0.4
−0.3
−0.2
yHat
(c) Third quartile
Figure 3.6: MFVB (blue) and MCMC (orange) approximate posterior densities for the
estimated median ŷ at the quartiles of the xi ’s under Model (3.6). The accuracy figures
measure the accuracy of the variational fit compared with the MCMC fit.
and τ = 0.9, the tenth and ninetieth percentiles, were the lowest at 49.70% and 33.36%.
Overall the accuracy of the MFVB fit is excellent considering the massive speed gains
achieved.
3.5
Discussion
Throughout this chapter we developed fast, deterministic inference for quantile semiparametric regression via use of the Asymmetric Laplace distribution. The work culminated in the development Algorithm 5. Monotonicity of lower bound (3.9) was achieved,
and comparisons with MCMC ultimately showed excellent performance of MFVB for
quantile regression given the speed gained. The most widely used quantile, the median,
pleasingly achieved a high 80% accuracy for MFVB inference.
The significance of this chapter lies in its ability to facilitate fast, deterministic in-
89
3.5. DISCUSSION
●
●
80
accuracy
●
●
60
●
40
●
20
0.1
0.25
0.5
0.75
0.9
λ
Figure 3.7: Boxplots of accuracy measurements for ŷ(Q2 ) for the accuracy study described
in the text.
ference in Bayesian quantile regression. MFVB provides an alternative inference tool to
MCMC, to be used when computing time and/or storage are lacking.
90
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
3.A
Derivation of Algorithm 4 and lower bound (3.4)
3.A.1
Full conditionals
Full conditional for β
log p(β|rest) = − 12 (β T Ωβ − 2β T ω) + const.
where
Ω =
ω =
τ (1 − τ ) T
X M a X + Σ−1
β ,
σε2
τ − 12
τ (1 − τ ) T
X T 1,
X M ay −
σε2
σε
and M a = diag(a1 , . . . , an ).
Derivation:
p(β|rest) ∝ p(y|rest)p(β)


(
n
Y
a
g
i
τ
exp −
∝
yi −
2σε2
i=1
β
,
× exp − 12 β T Σ−1
β
!)2 
τ − 12 σε

(Xβ)i +
ai gτ
where gτ = τ (1 − τ ). Taking the logarithm, we have
n
gτ X
log p(β|rest) = − 2
ai
2σε
i=1
(
yi −
(Xβ)i +
1
2
!)2
− τ σε
ai gτ
− 12 β T Σ−1
β β + const.
Now,
!)2
τ − 21 σε
ai yi − (Xβ)i +
ai gτ
i=1
n
n
X
2 τ − 12 X
2
=
ai {yi − (Xβ)i } +
(Xβ)i + const.
gτ
i=1
i=1
1
2
τ
−
2
= (y − Xβ)T M a (y − Xβ) +
(Xβ)T 1 + const.
gτ
n
X
(
where M a = diag(a1 , . . . , an ). Hence
gτ
log p(β|rest) = − 2
2σε
(
2 τ−
(y − Xβ)T M a (y − Xβ) +
gτ
− 21 β T Σ−1
β β + const.
1
2
)
(Xβ)T 1
91
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
τ (1 − τ ) T
−1
β
X M a X + Σβ β
σε2
)#
(
1
τ
−
τ
(1
−
τ
)
2
XT 1
+ const.
X T M ay −
−2β T
σε2
σε
− 12
=
T
The form of the full conditional for β follows directly.
Full conditional for σε2
log p(σε2 |rest)
= − Aε +
+
n
2
+1
log(σε2 )
1
− 2
σε
τ (1 − τ )
Bε +
(y − Xβ)T M a (y − Xβ)
2
τ − 12
(y − Xβ)T 1 + const.
(σε2 )1/2
Derivation:
p(σε2 |rest) ∝ p(y|rest)p(σε2 )


(
n
Y
1
a
g
i
τ
 exp −
∝
yi −
σε
2σε2
(Xβ)i +
i=1
×σε2
−(Aε +1)
1
2
τ − σε
ai gτ
!)2 

exp −Bε /σε2 .
Taking logarithms,
Bε
+ 1 log(σε2 ) − 2
σε
"
#2
n
τ − 21 σε
gτ X
− 2
ai
− {yi − (Xβ)i } + const.
2σε
ai gτ
i=1
n
τ − 12 X
n
Bε
2
{yi − (Xβ)i }
= − Aε + + 1 log(σε ) − 2 +
2
σε
σε
log p(σε2 |rest) = − Aε +
n
2
i=1
−
gτ
2σε2
n
X
i=1
ai {yi − (Xβ)i }2 + const.
τ − 21
n
= − Aε + + 1 log(σε2 ) + 2 1/2
(y − Xβ)T 1
2
(σε )
1
τ (1 − τ )
T
− 2 Bε +
(y − Xβ) M a (y − Xβ) + const.
σε
2
Full conditional for ai
3
log p(ai |rest) = − log(ai ) −
2
1
2
τ (1 − τ )
1
ai {yi − (Xβ)i }2 +
2
σε
4τ (1 − τ )ai
+ const.
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
92
Derivation:
p(ai |rest) ∝ p(yi |rest)p(ai )

"
#2 
r
1
τ − 2 σε
ai gτ
ai gτ
− {yi − (Xβ)i } 
exp − 2
∝
2
σε
2σε
ai gτ
1
−2
×ai exp
.
2ai
Taking logarithms, we have
τ (1 − τ )
1
3
+ const.
ai {yi − (Xβ)i }2 −
log p(ai |rest) = − log(ai ) −
2
2σε2
8τ (1 − τ )ai
3.A.2
Optimal q ∗ densities
Expression for q ∗ (β)
q ∗ (β) ∼ N (µq(β) , Σq(β) )
where
−1
Σq(β) = {τ (1 − τ )µq(1/σε2 ) X T M µq(a) X + Σ−1
β } ,
n
o
µq(β) = Σq(β) τ (1 − τ )µq(1/σε2 ) X T M µq(a) y + τ − 21 µq(1/σε ) X T 1
and
M µq(a) = diag(µq(a1 ) , . . . , µq(an ) ).
Derivation:
log q ∗ (β) = − 12 Eq (β T Ωβ − 2β T ω) + const.
= − 12 {β T Eq (Ω)β − 2β T Eq (ω)} + const.
Application of Result 1.21 gives
log q ∗ (β) = − 12 {β − Eq (Ω)−1 Eq (ω)}T Eq (Ω){β − Eq (Ω)−1 Eq (ω)} + const.
Therefore,
q ∗ (β) ∼ N{Eq (Ω)−1 Eq (ω), Eq (Ω)−1 }
(3.12)
Now,
Eq (Ω) = Eq
τ (1 − τ ) T
X M a X + Σ−1
β
σε2
93
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
and
(
Eq (ω) = Eq
)
τ − 12
τ (1 − τ ) T
T
X 1
X M ay −
σε2
σε
where M a = diag(a1 , . . . , an ). It follows that
Eq (Ω) = τ (1 − τ )µq(1/σε2 ) X T M µq(a) X + Σ−1
β
and
Eq (ω) = τ (1 − τ )µq(1/σε2 ) X T M µq(a) y − τ −
1
2
µq(1/σε ) X T 1
where M µq(a) = diag(µq(a1 ) , . . . , µq(an ) .
Expressions for q ∗ (σε2 ), µq(1/σε ) and µq(1/σε2 )
− A + +1
σε2 ( ε 2 ) exp
n
q ∗ (σε2 ) =
2J + (2A
C5
(σε2 )1/2
−
C6
σε2
ε + n − 1, C5 , C6 )
,
µq(1/σε ) =
J + (2Aε + n + 1, C5 , C6 )
J + (2Aε + n − 1, C5 , C6 )
µq(1/σε ) =
J + (2Aε + n, C5 , C6 )
,
J + (2Aε + n − 1, C5 , C6 )
and
σε2 > 0
where
C5 = τ −
1
2
(y − Xµq(β) )T 1
and
o
τ (1 − τ ) n
T
T
C6 = Bε +
tr(X M µq(a) XΣq(β) ) + (y − Xµq(β) ) M µq(a) (y − Xµq(β) ) .
2
Derivation:
log q ∗ (σε2 )
= Eq − Aε +
n
2
+1
log(σε2 )
1
− 2
σ
# ε
τ (1 − τ )
Bε +
(y − Xβ)T M a (y − Xβ)
2
τ − 21
(y − Xβ)T 1 + const.
(σε2 )1/2
1
τ (1 − τ ) 2
T
n
= − Aε + 2 + 1 log(σε ) − 2 Bε +
Eq (y − Xβ) M a (y − Xβ)
σε
2
τ − 12
Eq (y − Xβ)T 1 + const.
+ 2 1/2
(σε )
+
94
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
Now, using Result 1.23,
Eq (y − Xβ)T M a (y − Xβ)
= tr{Eq (M a ) Covq (y − Xβ)} + Eq (y − Xβ)T Eq (M a )Eq (y − Xβ)
= tr{M µq(a) XΣq(β) X T } + (y − Xµq(β) )T M µq(a) (y − Xµq(β) )
= tr{X T M µq(a) XΣq(β) } + (y − Xµq(β) )T M µq(a) (y − Xµq(β) ).
Therefore,
log q
∗
τ − 12
+ 2 1/2 (y − Xµq(β) )T 1
= − Aε + + 1
(σε )
1
τ (1 − τ ) n
− 2 Bε +
tr(M µq(a) XΣq(β) X T )
σε
2
oi
+(y − Xµq(β) )T M µq(a) (y − Xµq(β) ) + const.
(σε2 )
n
2
log(σε2 )
and hence
q
∗
(σε2 )
∝
n
−(Aε + 2 +1)
σε2
exp
C6
C5
− 2
1/2
2
σε
(σε )
where
C5 = τ −
1
2
(y − Xµq(β) )T 1
and
C6 = Bε +
o
τ (1 − τ ) n
tr(X T M µq(a) XΣq(β) ) + (y − Xµq(β) )T M µq(a) (y − Xµq(β) ) .
2
Since q ∗ (σε2 ) is a density, it must integrate to 1. Therefore
q ∗ (σε2 ) = R
∞
0
− A + n +1
σε2 ( ε 2 ) exp
C5
(σε2 )1/2
C6
σε2
σε2 −(Aε + 2 +1) exp
−
C5
(σε2 )1/2
−
C6
σε2
n
dσε2
.
We can simplify the integral on the denominator by making the substitution x =
σε =
1
x
⇒ σε2 =
1
x2
⇒ dσε2 = −2x−3 dx. This transforms the integral into
Z
0
∞
−(Aε + 2 +1)
σε2
exp
n
Z
0
=
=2
C5
C6
− 2
1/2
2
σ
(σε )
ε
dσε2
x2Aε +n+2 exp C5 x − C6 x2 (−2x−3 )dx
−∞
Z ∞
0
+
x2Aε +n−1 exp C5 x − C6 x2 dx
= 2J (2Aε + n − 1, C5 , C6 ).
1
σε
⇒
95
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
Therefore
q ∗ (σε2 ) =
− A + n +1
σε2 ( ε 2 ) exp
C5
(σε2 )1/2
−
C6
σε2
2J + (2Aε + n − 1, C5 , C6 )
.
By making the same substitution as above, we find that
µq(1/σε2 ) =
=
=
=
1
+
2J (2Aε + n − 1, C5 , C6 )
∞
Z
0
C6
C5
−
dσε2
(σε2 )1/2 σε2
x2Aε +n+4 exp C5 x − C6 x2 (−2x−3 )dx
1 2 −(Aε + n2 +1)
σ
exp
σε2 ε
0
1
2J + (2Aε + n − 1, C5 , C6 ) −∞
Z ∞
1
x2Aε +n+1 exp C5 x − C6 x2 dx
+
J (2Aε + n − 1, C5 , C6 ) 0
J + (2Aε + n + 1, C5 , C6 )
.
J + (2Aε + n − 1, C5 , C6 )
Similarly,
µq(1/σε ) =
Z
J + (2Aε + n, C5 , C6 )
.
J + (2Aε + n − 1, C5 , C6 )
Expressions for q ∗ (ai ) and µq(ai )
q ∗ (ai ) ∼ Inverse-Gaussian(µq(ai ) , λq(a) )
where
(
µq(ai ) =
r
2τ (1 − τ )
h
T
µq(1/σε2 ) (XΣq(β) X )ii + {yi − (Xµq(β) )i
}2
i
)−1
and
λq(a) =
1
.
4τ (1 − τ )
Derivation:
log q ∗ (ai )
3
τ (1 − τ )
1
2
ai {yi − (Xβ)i } −
+ const.
= Eq − log(ai ) −
2
2σε2
8τ (1 − τ )ai
3
1
= − log(ai ) −
2
8τ (1 − τ )ai
τ (1 − τ )
ai µq(1/σε2 ) Eq [{yi − (Xβ)i }2 ] + const.
−
2
3
1
= − log(ai ) −
2
8τ (1 − τ )ai
τ (1 − τ )
−
ai µq(1/σε2 ) [(XΣq(β) X T )ii + {yi − (Xµq(β) )i }2 ] + const.
2
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
96
Therefore,
∗
q (ai ) ∝
−3/2
ai
exp
1
−2
1
4τ (1 − τ )ai
h
io
+τ (1 − τ )ai µq(1/σε2 ) (XΣq(β) X T )ii + {yi − (Xµq(β) )i }2
which, by Result 1.7, is in the form of an Inverse-Gaussian distribution with parameters
given by
µq(ai ) =
−1
q
T
2
2τ (1 − τ ) µq(1/σε2 ) [(XΣq(β) X )ii + {yi − (Xµq(β) )i } ]
and
λq(a) =
1
.
4τ (1 − τ )
The expression for q ∗ (ai ) follows directly.
3.A.3
Derivation of lower bound (3.4)
log p(y; q) = Aε log(Bε ) − log Γ(Aε ) + log(2) + 1 + n log{τ (1 − τ )}
|Σq(β) |
1
+ log{J + (2Aε + n − 1, C5 , C6 )}
+ 2 log
|Σβ |
−1
T
− 12 {tr(Σ−1
β Σq(β) ) + µq(β) Σβ µq(β) }
n
X 1
1
−
.
8τ (1 − τ )
µq(ai )
i=1
Derivation:
log p(y; q) = Eq {log p(y|β, σε2 , a)} + Eq {log p(β) − log q ∗ (β)}
+Eq {log p(σε2 ) − log q ∗ (σε2 )} + Eq {log p(a) − log q ∗ (a)}.
Firstly,
log p(y|β, σε2 , a)

"
(
)#2 
n
1
2
X
(τ
−
)σ
2 ε

− 1 log 2πσε − ai gτ yi − (Xβ)i +
=
2
ai gτ
2σε2
ai gτ
i=1
97
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
n
X
n
n
n
= − log(2π) − log(σε2 ) + log gτ + 12
log(ai )
2
2
2
i=1
(

)2
n
1
1
X
(τ
−
(τ
−
)σ
)σ
gτ
2 ε
2 ε
ai 
−2
− 2
{yi − (Xβ)i } + {yi − (Xβ)i }2 
2σε
ai gτ
ai gτ
i=1
n
X
n
n
n
= − log(2π) − log(σε2 ) + log gτ + 12
log(ai )
2
2
2
i=1
−
1 2
2)
(τ −
2gτ
n
X
i=1
1
+
ai
n
(τ − 12 ) X
{yi
(σε2 )1/2 i=1
n
n
n
= − log(2π) − log(σε2 ) + log gτ +
2
2
2
− (Xβ)i } −
1
2
n
X
i=1
n
gτ X
ai {yi − (Xβ)i }2
2σε2
i=1
n
(τ − 12 )2 X 1
log(ai ) −
2gτ
ai
i=1
1
gτ
+ 2 1/2 (τ − 21 )1T (y − Xβ) − 2 (y − Xβ)T M a (y − Xβ).
2σε
(σε )
Taking expectations:
Eq {log p(y|β, σε2 , a)}
(
n
n
X
(τ − 12 )2 X 1
n
n
n
2
1
log(ai ) −
= Eq − log(2π) − log(σε ) + log gτ + 2
2
2
2
2gτ
a
i=1
i=1 i
gτ
1
+ 2 1/2 (τ − 12 )1T (y − Xβ) − 2 (y − Xβ)T M a (y − Xβ)
2σε
(σε )
n
X
n
n
n
n
Eq {log(ai )}
= − log(2π) − log(σε2 ) + log gτ + log gτ + 12
2
2
2
2
i=1
n
(τ − 21 )2 X
µq(1/ai ) + µq(1/σε ) (τ − 12 )1T (y − Xµq(β) )
−
2gτ
i=1
gτ
− µq(1/σε2 ) Eq {(y − Xβ)T M a (y − Xβ)}.
2
From previous work we know that
Eq {(y − Xβ)T M a (y − Xβ)}
= tr(M µq(a) XΣq(β) X T ) + (y − Xµq(β) )T M µq(a) (y − Xµq(β) ),
so
Eq {log p(y|β, σε2 , a)}
n
X
n
n
n
n
= − log(2π) − Eq {log(σε2 )} + log gτ + log gτ + 12
Eq {log(ai )}
2
2
2
2
i=1
1 2
2)
n
X
(τ −
µq(1/ai ) + µq(1/σε ) (τ − 12 )1T (y − Xµq(β) )
2gτ
i=1
h
gτ
− µq(1/σε2 ) tr(M µq(a) XΣq(β) X T )
2
i
+(y − Xµq(β) )T M µq(a) (y − Xµq(β) )
−
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
n
X
n
n
n
n
= − log(2π) − log(σε2 ) + log gτ + log gτ + 12
Eq {log(ai )}
2
2
2
2
i=1
n
(τ − 12 )2 X
−
µq(1/ai ) + µq(1/σε ) C5 − µq(1/σε2 ) (C6 − Bε ).
2gτ
i=1
Secondly,
log p(β) − log q ∗ (β)
d
= − log(2π) − 21 log |Σβ | − 12 β T Σ−1
β β
2 d
T −1
1
1
− − log(2π) − 2 log |Σq(β) | − 2 (β − µq(β) ) Σq(β) (β − µq(β) )
2
|Σq(β) |
T −1
1
1
− 12 β T Σ−1
= 2 log
β β + 2 (β − µq(β) ) Σq(β) (β − µq(β) ).
|Σβ |
Taking expectations,
∗
Eq {log p(β) − log q (β)} =
1
2
log
|Σq(β) |
|Σβ |
− 12 Eq (β T Σ−1
β β)
+ 12 Eq {(β − µq(β) )T Σ−1
q(β) (β − µq(β) )}.
Using Result 1.23,
−1
T −1
Eq (β T Σ−1
β β) = tr{Σβ Covq (β)} + Eq (β) Σβ Eq (β)
−1
T
= tr(Σ−1
β Σq(β) ) + µq(β) Σβ µq(β) .
Similarly,
−1
Eq {(β − µq(β) )T Σ−1
q(β) (β − µq(β) )} = tr(Σq(β) Σq(β) )
= tr{I2 }
= 2.
Therefore,
∗
Eq {log p(β) − log q (β)} =
1
2
|Σq(β) |
|Σβ |
log
+1
n
o
−1
T
− 21 tr(Σ−1
Σ
)
+
µ
Σ
µ
q(β)
q(β) .
q(β) β
β
Thirdly,
log p(a) =
n
X
i=1
log p(ai )
98
99
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
n
X
=
log
1 −2 −1/2ai
2 ai e
i=1
= −n log(2) − 2
n
X
i=1
log(ai ) −
1
2
n
X
1
,
ai
i=1
and
log q ∗ (a) =
=
=
n
X
i=1
n
X
log q ∗ (ai )
"s
log
i=1
"
n
X
1
2
i=1
=
)#
(
λq(a)
λq(a) (ai − µq(ai ) )2
exp −
2πa3i
2µ2q(ai ) ai
λq(a)
3
log(λq(a) ) − log(2π) − log(ai ) −
2
2
(
1
2
a2i − 2ai µq(ai ) + µ2q(ai )
µ2q(ai ) ai
n
n
λq(a) X
n
n
3X
log(λq(a) ) − log(2π) −
log(ai ) −
2
2
2
2
i=1
i=1
ai
µ2q(ai )
+
2
µ2q(ai )
Therefore
log p(a) − log q ∗ (a)
= −n log(2) − 2
−
3
2
n
X
i=1
n
X
i=1
log(ai ) −
= −n log(2) +
n
nn
X
n
1
−
log(λq(a) ) − log(2π)
a
2
2
i=1 i
!)
n
X
ai
2
1
+ 2
+
2
µq(ai ) µq(ai ) ai
log(ai ) −
λq(a)
2
1
2
i=1
n
X
n
n
log(2π) − log(λq(a) ) − 12
log(ai )
2
2
i=1
n
n
n
X
(1 − λq(a) ) X
λq(a) X
1
ai
1
−
−
λ
+
.
q(a)
2
2
ai
2
µ
µ
q(a
)
i
q(a
)
i
i=1
i=1
i=1
Taking expectations
Eq {log p(a) − log q ∗ (a)}
= −n log(2) +
−
2
= −n log(2) +
−
n
X
n
n
log(2π) − log(λq(a) ) − 12
Eq {log(ai )}
2
2
i=1
n
(1 − λq(a) ) X
µq(1/ai ) +
i=1
n
n
log(2π) −
2
2
n
X
µq(ai )
− λq(a)
n
X
1
µ
µ2
i=1 q(ai )
i=1 q(ai )
n
X
Eq {log(ai )}
log(λq(a) ) − 12
i=1
n
λq(a) X
1
n
(1 − λq(a) ) X
µq(1/ai ) −
2
i=1
λq(a)
2
2
i=1
)#
µq(ai )
.
1
+
ai
!
.
100
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
We know that
µq(1/ai ) =
1
1
+
,
µq(ai ) λq(a)
so
n
n (1 − λq(a) ) X
(1 − λq(a) ) X
1
1
µq(1/ai ) = −
+
−
2
2
µq(ai ) λq(a)
i=1
i=1
( n
!
)
X 1
(1 − λq(a) )
n
= −
+
2
µ
λq(a)
i=1 q(ai )
!
n
n(1 − λq(a) )
(1 − λq(a) ) X
1
−
= −
2
µ
2λq(a)
i=1 q(ai )
!
n
(1 − λq(a) ) X
1
n
= −
+ − 2ngτ .
2
µq(ai )
2
i=1
Hence
n
n
log(2π) − log(λq(a) )
2
2
n
n
X
X
1
n
1
1
−2
Eq {log(ai )} − 2
+ − 2ngτ .
µq(ai )
2
Eq {log p(a) − log q ∗ (a)} = −n log(2) +
i=1
i=1
Finally,
log p(σε2 ) − log q ∗ (σε2 ) = Aε log(Bε ) − log Γ(Aε ) − (Aε + 1) log(σε2 ) −
Bε
σε2
− [− log{2J (2Aε + n − 1, C5 , C6 )}
n
C5
C6
− Aε + + 1 log(σε2 ) + 2 1/2 − 2
2
σε
(σε )
n
= Aε log(Bε ) − log Γ(Aε ) + log(σε2 )
2
C5
(C6 − Bε )
+ log{2J (2Aε + n − 1, C5 , C6 )} − 2 1/2 +
.
σε2
(σε )
Taking expectations
n
Eq {log(σε2 )}
2
+ log{2J (2Aε + n − 1, C5 , C6 )}
Eq {log p(σε2 ) − log q ∗ (σε2 )} = Aε log(Bε ) − log Γ(Aε ) +
−µq(1/σε ) C5 + µq(1/σε2 ) (C6 − Bε ).
Combining, we get
n
X
n
n
n
2
1
log p(y; q) = − log(2π) − Eq {log(σε )} + log gτ + 2
Eq {log(ai )}
2
2
2
i=1
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
n
(τ − 12 )2 X
µq(1/ai ) + µq(1/σε ) C5 − µq(1/σε2 ) (C6 − Bε )
2gτ
i=1
|Σ
q(β) |
1
+ 2 log
+1
|Σβ |
−
−1
T
− 12 {tr(Σ−1
β Σq(β) ) + µq(β) Σβ µq(β) }
−n log(2) +
− 12
n
X
n
n
log(2π) − log(λq(a) ) − 12
Eq {log(ai )}
2
2
i=1
n
X
1
i=1
µq(ai )
+
n
− 2ngτ
2
n
Eq {log(σε2 )}
2
+ log{2J + (2Aε + n − 1, C5 , C6 )}
+Aε log(Bε ) − log Γ(Aε ) +
−µq(1/σε ) C5 + µq(1/σε2 ) (C6 − Bε ).
Simplifying and using µq(1/ai ) =
1
µq(ai )
+
1
λq(a)
=
1
µq(ai )
+ 4gτ ,
log p(y; q)
|Σq(β) |
−1
T
+ 1 − 12 {tr(Σ−1
= log
β Σq(β) ) + µq(β) Σβ µq(β) }
|Σβ |
+Aε log(Bε ) − log Γ(Aε ) + log(2) + log{J + (2Aε + n − 1, C5 , C6 )}
n n
(1/2 − τ )2 X
1
1
n
+ log gτ −
+ 4gτ − n log(2) − log
2
2gτ
µq(ai )
2
4gτ
1
2
i=1
− 21
n
X
1
i=1
µq(ai )
+
n
− 2ngτ .
2
Focusing on the final two lines of the expression immediately above,
n (τ − 12 )2 X
n
1
log gτ −
+ 4gτ − n log(2)
2
2gτ
µq(ai )
i=1
n
X
n
1
1
n
1
− log
−2
+ − 2ngτ
2
4gτ
µ
2
i=1 q(ai )
(
) n
1 2
X 1
(τ
−
)
n
2
= log gτ − 12
+1
− 2n(τ − 21 )2 − n log(2)
2
gτ
µq(ai )
i=1
n
n
+ log(4gτ ) + − 2ngτ
2
2
n
X
1
1
n
=−
+ log gτ − 2n(τ − 12 )2 − n log(2)
8gτ
µ
2
i=1 q(ai )
n
n
+ − 2ngτ + (log 4 + log gτ )
2
2
101
3.A. DERIVATION OF ALGORITHM 4 AND LOWER BOUND (3.4)
n
n
1 X 1
+ log gτ − 2n(τ − 12 )2 − n log(2)
8gτ
µ
2
i=1 q(ai )
n
n
+ − 2ngτ + n log(2) + log gτ
2
2
n
1 X 1
n
= −
+ n log gτ − 2n( 14 − τ + τ 2 ) + − 2ngτ
8gτ
µq(ai )
2
= −
i=1
n
= −
X 1
1
+ n log{τ (1 − τ )}.
8τ (1 − τ )
µq(ai )
i=1
Combining, we get:
log p(y; q) = Aε log(Bε ) − log Γ(Aε ) + log(2) + 1 + n log{τ (1 − τ )}
|Σq(β) |
1
+ 2 log
+ log{J + (2Aε + n − 1, C5 , C6 )}
|Σβ |
−1
T
− 12 {tr(Σ−1
β Σq(β) ) + µq(β) Σβ µq(β) }
n
X 1
1
−
.
8τ (1 − τ )
µq(ai )
i=1
102
103
3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9)
3.B
Derivation of Algorithm 5 and lower bound (3.9)
3.B.1
Full conditionals
Full conditional for (β, u)
"
log p(β, u|rest) = − 12 
#T
β
u
"
Ω
β
u
#
"
−2
β
u

#T
ω  + const.
where
Ω =
ω =
τ (1 − τ ) T
C M a C + Σ−1
(β,u) ,
σε2
τ − 12
τ (1 − τ ) T
C T 1,
C M ay −
σε2
σε
M a = diag(a1 , . . . , an ) and C = [X Z].
Derivation:
p(β, u|rest) ∝ p(y|rest)p(β, u)


#!
"
(
"
)#2 
n 

1
Y
τ − 2 σε
ai gτ

∝
exp − 2 yi −
+
C β


2σε
ai gτ
u
i=1
i


#
#T
"
"


β
,
× exp − 21 β
Σ−1
(β,u)

u 
u
where gτ = τ (1 − τ ) and C = [X Z]. This is in the exact same form as the full conditional
for β under Model (3.2). The stated result follows immediately.
Full conditional for σε2
log p(σε2 |rest)
= − Aε + n2 + 1 log(σε2 )

τ (1 − τ )
1 
− 2 Bε +
σε 
2
τ − 21
+ 2 1/2
(σε )
"
y−C
β
u
"
y−C
β
u
#!T
#!T
1 + const.
Derivation:
The derivation is identical to that for σ 2 in Model (3.2).
"
Ma
y−C
β
u
#!


104
3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9)
Full conditional for σu2
K
1
= − Au +
+ 1 log(σu2 ) − 2 Bu + 12 kuk2 + const.
2
σu
log p(σu2 |rest)
Derivation:
p(σu2 |rest) ∝ p(u|σu2 )p(σu2 )
∝
(σu2 )−K/2 exp
1
Bu
2 −Au −1
2
− 2 kuk × (σu )
exp − 2 .
2σu
σu
Taking logarithms gives the stated result.
Full conditional for ai
3
log p(ai |rest) = − log(ai ) + const.
2 
(
τ
(1
−
τ
)
ai yi −
− 12 
σ2
"
C
β
u

#! )2
+
i
1
.
4τ (1 − τ )ai
Derivation:
The derivation is identical to that for ai in Model (3.2).
3.B.2
Optimal q ∗ densities
Expression for q ∗ (β, u)
q ∗ (β, u) ∼ N (µq(β,u) , Σq(β,u) )
where
Σq(β,u)


−1
−1


Σ
0

= τ (1 − τ )µq(1/σε2 ) C T M µq(a) C +  β
,

0
µq(1/σu2 ) I K 
n
o
µq(β,u) = Σq(β,u) τ (1 − τ )µq(1/σε2 ) C T M µq(a) y + τ − 12 µq(1/σε ) C T 1
and
M µq(a) = diag(µq(a1 ) , . . . , µq(an ) ).
Derivation:
The derivation is identical to that for β in Model (3.2), replacing Xβ with C
"
β
u
#
.
3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9)
Expressions for q ∗ (σε2 ), µq(1/σε ) and µq(1/σε2 )
q ∗ (σε2 ) =
− A + n +1
σε2 ( ε 2 ) exp
C7
(σε2 )1/2
−
C8
σε2
2J + (2Aε + n − 1, C7 , C8 )
,
µq(1/σε ) =
J + (2Aε + n + 1, C7 , C8 )
J + (2Aε + n − 1, C7 , C8 )
µq(1/σε ) =
J + (2Aε + n, C7 , C8 )
,
J + (2Aε + n − 1, C7 , C8 )
and
σε2 > 0
where
C7 = τ −
1
2
(y − Cµq(β,u) )T 1
and
C8 = Bε +
τ (1 − τ ) n
tr(M µq(a) CΣq(β,u) C T )
2
o
+(y − Cµq(β,u) )T M µq(a) (y − Cµq(β,u) ) .
Derivation:
The derivation is identical to that for σ 2 in Model (3.2).
Expressions for q ∗ (σu2 ) and µq(1/σu2 )
q ∗ (σu2 ) ∼ Inverse-Gamma(Au +
K
2 ))
2 , Bq(σu
where
Bq(σu2 ) = Bu + 12 {kµq(u) k2 + tr(Σq(u) )}
Derivation:
log q(σu2 )
K
1
2
2
1
+ 1 log(σu ) − 2 Bu + 2 kuk
= Eq − Au +
+ const.
2
σu
K
1
= − Au +
+ 1 log(σu2 ) − 2 Bu + 12 Eq kuk2 + const.
2
σu
Now, observing that Eq kuk2 = kµq(u) k2 + tr(Σq(u) ), we have
log q(σu2 )
K
+ 1 log(σu2 )
= − Au +
2
i
1 h
− 2 Bu + 12 {kµq(u) k2 + tr(Σq(u) )} + const.
σu
105
106
3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9)
Hence
q(σu2 )
∝
K
(σu2 )−(Au + 2 )−1 exp
i
1 h
2
1
− 2 Bu + 2 {kµq(u) k + tr(Σq(u) )}
σu
which is in the form of an Inverse-Gamma distribution (see Definition 1.5.16) with parameters stated in the result.
Expressions for q ∗ (ai ) and µq(ai )
q ∗ (ai ) ∼ Inverse-Gaussian(µq(ai ) , λq(a) )
where
(
µq(ai ) =
and
r
2τ (1 − τ )
h
T
µq(1/σ2 ) (CΣq(β,u) C )ii + {yi − (Cµq(β,u) )i }2
λq(a) =
i
)−1
1
.
4τ (1 − τ )
Derivation:
The derivation is identical to that for ai in Model (3.2).
3.B.3
Derivation of lower bound (3.9)
(K + 2)
+ n log{τ (1 − τ )}
log p(y; q) = Aε log Bε − log Γ(Aε ) + log(2) +
2
|Σq(β,u) |
+ 12 log
+ log{J + (2Aε + n − 1, C7 , C8 )}
|Σβ |
n
− 21 tr(Σ−1
(β,u) Σq(β,u) )
o
+(µq(β,u) − µ(β,u) )T Σ−1
(µ
−
µ
)
q(β,u)
(β,u)
(β,u)
n
X 1
1
−
8τ (1 − τ )
µq(ai )
i=1
+Au log(Bu ) − log{Γ(Au )} − (Au +
+ log{Γ(Au +
K
2 )}
K
2 ))
2 ) log(Bq(σu
− µq(1/σu2 ) (Bq(σu2 ) − Bu ).
Derivation:
log p(y; q) = Eq {log p(y|β, u, σε2 , a)} + Eq {log p(β, u) − log q ∗ (β, u)}
+Eq {log p(σε2 ) − log q ∗ (σε2 )} + Eq {log p(σu2 ) − log q ∗ (σu2 )}
+Eq {log p(a) − log q ∗ (a)}.
3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9)
107
The forms of the contributions from p(y|β, u, σε2 , a) and the parameters (β, u), σε2 and
a are identical to that for the corresponding quantities in Model (3.2). The only small
changes are that
• β is replaced by (β, u),
• X is replaced by C.
From similar working for regression case in Appendix 3.A we have:
log p(β, u|σu2 ) − log q ∗ (β, u)
"
#T
"
#
|Σ
|
q(β,u)
β
= 12 log
− 21 β
Σ−1
(β,u)
|Σ(β,u) |
u
u
"
#
!T
"
#
!
β
β
+ 21
− µq(β,u)
Σ−1
− µq(β)
q(β,u)
u
u
|Σq(β,u) |
1
2
T −1
2
1
K
1
= 2 log
− 2 log(σu ) − 2 β Σβ β + 2 kuk
|Σβ |
σu
#
!T
#
!
"
"
β
β
−1
1
− µq(β,u)
− µq(β)
Σq(β,u)
+2
u
u
using the fact that |Σ(β,u) | = (σu2 )K |Σβ |. Taking expectations,
Eq {log p(β) − log q ∗ (β)}
|Σq(β) |
2
2
1
1
= 2 log
− K2 Eq {log(σu2 )} − 21 Eq (β T Σ−1
β β) − 2 Eq (1/σu )Eq kuk
|Σβ |
 "
#
!T
"
#
!


β −µ
β −µ
+ 21 Eq
.
Σ−1
q(β,u)
q(β)
q(β,u)


u
u
Now,
Eq
 "


β
u
!T
#
− µq(β,u)
"
Σ−1
q(β,u)
β
u
#
− µq(β)
!

= tr(Σ−1
q(β,u) Σq(β,u) )

= tr{IK+2 }
= K + 2.
Also, using Result 1.22,
Eq kuk2
= tr{Covq (u)} + kEq (u)k2
= tr(Σq(u) ) + kµq(u) k2 .
108
3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9)
Therefore,
∗
1
2
Eq {log p(β) − log q (β)} =
|Σq(β) |
(K + 2) K
+
− 2 Eq {log(σu2 )}
log
|Σβ |
2
o
n
−1
−1
T
1
− 2 tr(Σβ Σq(β) ) + µq(β) Σβ µq(β)
− 12 µq(1/σu2 ) {tr(Σq(u) ) + kµq(u) k2 }.
We derive the contribution from σu2 afresh:
Bu
log p(σu2 ) − log q ∗ (σu2 ) = Au log(Bu ) − log{Γ(Au )} − (Au + 1) log(σu2 ) − 2
σu
h
K
K
− (Au + 2 ) log(Bq(σu2 ) ) − log{Γ(Au + 2 )}
Bq(σu2 )
2
K
−(Au + 2 + 1) log(σu ) −
σu2
= Au log(Bu ) − log{Γ(Au )} − (Au + K2 ) log(Bq(σu2 ) )
1
+ log{Γ(Au + K2 )} + K2 log(σu2 ) − 2 (Bq(σu2 ) − Bu ).
σu
Taking expectations
Eq {log p(σu2 ) − log q ∗ (σu2 )} = Au log(Bu ) − log{Γ(Au )} − (Au +
+ log{Γ(Au +
K
2 )}
+
K
2 ))
2 ) log(Bq(σu
2
K
2 Eq {log(σu )}
−µq(1/σu2 ) (Bq(σu2 ) − Bu ).
Combining the contributions from all nodes, we get
n
X
n
n
n
2
1
log p(y; q) = − log(2π) − Eq {log(σε )} + log gτ + 2
Eq {log(ai )}
2
2
2
i=1
( 21
τ )2
n
X
−
µq(1/ai ) + µq(1/σε ) C7 − µq(1/σε2 ) (C8 − Bε )
2gτ
i=1
|Σq(β) |
(K + 2) K
1
+ 2 log
+
− 2 Eq {log(σu2 )}
|Σβ |
2
n
o
−1
T
− 21 tr(Σ−1
Σ
)
+
µ
Σ
µ
q(β)
q(β)
q(β) β
β
−
− 21 µq(1/σu2 ) {tr(Σq(u) ) + kµq(u) k2 }
−n log(2) +
− 12
n
X
n
n
Eq {log(ai )}
log(2π) − log(λq(a) ) − 12
2
2
n
X
1
i=1
µq(ai )
i=1
+
n
− 2ngτ
2
+Aε log Bε − log Γ(Aε ) +
n
Eq {log(σε2 )}
2
3.B. DERIVATION OF ALGORITHM 5 AND LOWER BOUND (3.9)
+ log{2J + (2Aε + n − 1, C7 , C8 )}
−µq(1/σε ) C7 + µq(1/σε2 ) (C8 − Bε )
+Au log(Bu ) − log{Γ(Au )} − (Au +
+ log{Γ(Au +
K
2 )}
+
2
K
2 Eq {log(σu )}
K
2 ))
2 ) log(Bq(σu
− µq(1/σu2 ) (Bq(σu2 ) − Bu ).
Simplifying gives lower bound (3.9):
(K + 2)
log p(y; q) = Aε log Bε − log Γ(Aε ) + log(2) +
+ n log{τ (1 − τ )}
2
|Σq(β,u) |
+ log{J + (2Aε + n − 1, C7 , C8 )}
+ 12 log
|Σβ |
o
n
−1
T
Σ
µ
− 21 tr(Σ−1
Σ
)
+
µ
q(β)
q(β)
q(β) β
β
− 12 µq(1/σu2 ) {tr(Σq(u) ) + kµq(u) k2 }
n
X
1
1
−
8τ (1 − τ )
µq(ai )
i=1
+Au log(Bu ) − log{Γ(Au )} − (Au +
+ log{Γ(Au +
K
2 )}
K
2 ))
2 ) log(Bq(σu
− µq(1/σu2 ) (Bq(σu2 ) − Bu ).
109
Chapter 4
Mean field variational Bayes for
continuous sparse signal shrinkage
4.1
Introduction
There are many areas of statistics that benefit from the imposition of a sparse distribution
on a set of parameters. In many applications, the aim is to choose which of potentially
thousands or millions of covariates have the greatest effect on the response. These applications are widely referred to as “wide data” or “n p” problems, defined by the
number of variables (p) being considerably larger than the number of observations (n).
In these cases, imposing a sparse prior on the parameters of interest allows us to perform
model selection and inference simultaneously (Tibshirani, 1996).
One application where sparse estimators are in huge demand is genome wide association studies (GWAS). GWAS can essentially be considered as a penalized regression
problem:
yi = β0 + β1 xi1 + β2 xi2 + . . . + βp xip + εi
(4.1)
where yi is the phenotype of the ith individual (may be discrete or continuous), xij is
genotype of the jth marker of the ith individual, and βj is the effect of the jth marker.
The aim is to estimate the genetic effect associated with each marker (xij ), and hence
identify which are significant markers for the phenotype (yi ). With so many parameters
in the model (as p may be in the order of 100,000’s or millions), the key problems here
are: overfitting and computational constraints. The former issue can be addressed by employing penalised regression. We address the latter issue by using mean field variational
Bayes (MFVB).
110
111
4.1. INTRODUCTION
In a Bayesian framework, two major choices are evident in approaching n p prob-
lems. These are (1) method of inference, and (2) type of prior to induce sparseness. In
keeping with the theme of this thesis, we choose to overcome computational time constraints by utilising MFVB as the inference tool. In this chapter we explore the efficacy
of continuous priors in inducing sparsity, under the framework of MFVB. These continuous priors correspond to non-convex penalization, which has been suggested for wide
data applications (Griffin and Brown, 2011). These continuous priors are in contrast to
the existing literature which has explored “slab and spike” distributions such as LaplaceZero mixtures (for example, Johnstone and Silverman, 2004). In particular, we focus on
the Horseshoe (Carvalho et al., 2010), Normal-Exponential-Gamma (NEG) (Griffin and
Brown, 2011) and Generalized-Double-Pareto (GDP) (Armagan et al., 2012) distributions.
These three distributions are defined in Chapter 1 by Definitions 1.5.22, 1.5.23 and 1.5.24
respectively. Standard densities (location parameter µ = 0 and scale parameter σ = 1)
of the Horseshoe, NEG and GDP distributions are shown in Figure 4.1. Both the NEG
and GDP distributions have a third shape parameter λ. Hence Figure 4.1 shows several
versions of the “standard” NEG and GDP densities, with the kurtosis increasing as the
shape parameter decreases.
0
−2 0
2
0 2
4
0.4
0.2
0.3
0.4
0.3
0.2
0.3
0.2
−104 −5
0.1
0.1
0
−5 0
5
0 5
10
5 10
−6
0.0
0.0
−10−5
0.0
0.05
0.00
0.05
2−10
4
0.5
0.5
0.5
0.4
0.30
0.25
0.20
0.15
0.1
−4 −2
0.00
0.05
0.00
0.0
0.0
−4 −2
!=0.1 !=0.1 !=0.1
!=0.5 !=0.5 !=0.5
!=10 !=10 !=10
0.10
0.10
0.10
0.15
0.20
0.25
!=0.1 !=0.1 !=0.1
!=0.2 !=0.2 !=0.2
!=0.4 !=0.4 !=0.4
0.1
0.1
0.1
0.0
−4
0.35
0.35
0.30
0.35
0.15
0.20
0.3
0.2
0.3
0.2
0.2
0.3
0.25
0.4
0.4
0.4
0.30
0.5
0.5
0.5
Horseshoe
Horseshoe
Horseshoe Normal−Exponential−Gamma
Normal−Exponential−Gamma
Normal−Exponential−Gamma
Generalized
Generalized
Generalized
Double
Double
Pareto
Double
ParetoPareto
−4
10 −6
−2 −4
−60
−2
−42
0
−24
2
06
4
2
6
4
6
(a) Horseshoe
(b)
Normal-Exponential(c) Generalized Double Pareto
Figure
Figure
1: Figure
Left
1: panel:
Left1: panel:
Left
the panel:
standard
the standard
theHorseshoe
standard
Horseshoe
density
Horseshoe
density
function.
density
function.
Middle
function.
Middle
panel:
Middle
panel:
threepanel:
three
standard
three
standard
standard
Gamma
Normal-Exponential-Gamma
Normal-Exponential-Gamma
Normal-Exponential-Gamma
density
density
functions
density
functions
with
functions
with
varying
varying
with
shape
varying
shape
parameter
shape
parameter
λ.
parameter
Right
λ. Right
panel:
λ. Right
panel:panel:
three three
standard
three
Double
standard
Double
Generalized
Double
Generalized
Pareto
density
functions
density
functions
with
functions
varying
with varying
with
shape
varying
shape
parameter
shape
parameter
λ.
parameter
λ.
λ.
Figure
4.1:standard
Standard
(µ
= 0,Generalized
σPareto
= Pareto
1)density
continuous
sparseness
inducing
density
functions.
2.3.2 2.3.2
Related
2.3.2
Related
distributional
Related
distributional
distributional
results
results
results
The locality property of MFVB allows us to concentrate on univariate scale models of
The notation
The notation
Thevnotation
∼ vGamma(A,
∼ vGamma(A,
∼ Gamma(A,
B) means
B) means
B)
that
means
vthat
hasvthat
ahas
Gamma
vahas
Gamma
adistribution
Gamma
distribution
distribution
with with
shape
with
shapeshape
ourparameter
sparsity
distributions.
allows
us to
avoid
unnecesparameter
parameter
Ainducing
> 0Aand
> 0Arate
and
>
0parameter
rate
andparameter
rate parameter
B >Adopting
0.
B The
> 0.
Bcorresponding
The
> 0.this
corresponding
Thestrategy
corresponding
density
density
function
density
function
isfunction
is
is
A focus
−1 A−1
−1 the
A−1
−1core
sarily complex calculations,
and
of vthe
problem.
p(v) =
p(v)
B
=
p(v)
Γ(A)
B A=
Γ(A)
B
v Aon
Γ(A)
vexp(−Bv),
exp(−Bv),
v A−1 exp(−Bv),
> 0.
v > 0.v > 0.Our findings can then
be The
incorporated
complex
regression
models,
using
thedistribution
conditional
indepennotation
The notation
Thevnotation
∼into
IG(A,
v ∼ more
IG(A,
vB)∼means
IG(A,
B)
means
B)
thatmeans
vthat
hasvthat
an
has
Inverse-Gamma
van
has
Inverse-Gamma
an Inverse-Gamma
distribution
distribution
with with
shapeshape
with
shape
parameter
parameter
parameter
A > 0Aand
> 0Arate
and
> 0parameter
rate
andparameter
rate parameter
B > 0.
B The
> 0.
Bcorresponding
The
> 0.corresponding
The corresponding
density
density
function
density
function
isfunction
is
is
dence structure of the model, most clearly evident in the directed acyclic graph (DAG).
−1 −A−1
−1 −A−1
p(v) =
p(v)
B A=
p(v)
Γ(A)
B A=
Γ(A)
B
v A−1
Γ(A)
v −A−1
exp(−B/v),
v exp(−B/v),
exp(−B/v),
v > 0.v > 0.v > 0.
Our most significant finding, which reveals itself throughout this chapter, is that apNoteNote
that vNote
that
∼ IG(A,
vthat
∼ IG(A,
vB)∼ifIG(A,
B)
andif only
and
B) ifonly
ifand
1/vif
only
∼1/v
Gamma(A,
if∼1/v
Gamma(A,
∼ Gamma(A,
B). B). B).
plying MFVB using the most natural auxiliary variable representations of the Horseshoe,
Result
Result
1a. Result
Let
1a.x,Let
b1a.
and
x,Let
bcand
be
x,random
bc and
be random
c variables
be random
variables
such
variables
that
such that
such that
NEG and GDP models leads to poor inference. We remedy this through the incorporation
x| b ∼x|Nb (µ,
∼x|N
σb2(µ,
/b),
∼N
σ 2(µ,
/b),
b |σc2 /b),
∼b |Gamma(
c ∼b Gamma(
| c ∼21 Gamma(
, c) 21 ,and
c) 21 and
,cc)∼ Gamma(
and
c ∼ Gamma(
c ∼12 Gamma(
, 1). 12 , 1). 12 , 1).
of special functions into our MFVB algorithms. Continued fraction approximations are
Then Then
x ∼ Horseshoe(µ,
x
Then
∼ Horseshoe(µ,
x ∼ Horseshoe(µ,
σ). σ). σ).
Result
Result
1b.Result
Let
1b.x Let
and
1b.xbLet
and
be random
xb and
be random
b variables
be random
variables
such
variables
that
such that
such that
−1 −1/2
−1
x| b ∼x|Nb (µ,
∼x|N
σb2(µ,
/b)
∼N
σ 2(µ,
/b)
andσ 2and
/b)
p(b) =
and
p(b)
π −1=p(b)
b−1/2
π −1=b(b−1/2
π+
1)
(b
b −1
+,1)
(b−1
b+>
, 1)0.
b>
, 0.b > 0.
112
4.2. HORSESHOE DISTRIBUTION
used to facilitate stable computation of these special functions, made practical through
the use of Lentz’s Algorithm. This adds yet another tool to the rapidly growing armoury
of MFVB.
The chapter is set out as follows: we consider the Horseshoe, NEG and GDP priors
sequentially. For each prior, we explore the impact of varying auxiliary variable representations on the quality of MFVB inference. This involves derivation and presentation
of MFVB algorithms and their corresponding lower bounds for each prior. The merits
of each representation is explored in terms of the simplicity of the resulting MFVB algorithm and the accuracy of MFVB inference. We give the most detail in the Horseshoe
case, and present necessary information in the NEG and GDP cases. Derivations of MFVB
algorithms and lower bounds are deferred to Appendices 4.A, 4.C and 4.E.
The work in this chapter is presented in the manuscript Neville, Ormerod and Wand
(2012) and is in the process of being peer reviewed. The novel results in this chapter
culminated in an academic visit to the University of Oxford in late 2012. Key findings
were presented at the Wellcome Trust Centre for Human Genetics seminar series.
4.2
Horseshoe distribution
The first case we consider is a univariate random sample drawn from the Horseshoe
distribution as follows:
ind.
xi |σ ∼ Horseshoe(0, σ),
σ ∼ Half-Cauchy(A).
(4.2)
Introduction of the auxiliary variables a, b = (b1 , . . . , bn ) and c = (c1 , . . . , cn ), and application of Results 1.6, 1.9 and 1.10 respectively, we can represent (4.2) as the following
three heirarchical models:
Table 4.1: Three auxiliary variable models that are each equivalent to Horseshoe Model
(4.2). The abbreviations IG and HS represent the Inverse-Gamma and Horseshoe distributions respectively.
Model I
Model II
Model III
ind.
ind.
ind.
xi |σ ∼ HS(0, σ),
σ 2 |a ∼ IG( 12 , a−1 ),
a ∼ IG( 21 , A−2 ).
xi |σ, bi ∼
xi |σ, bi ∼ N (0, σ 2 /bi ),
N (0, σ 2 /bi ),
σ 2 |a ∼ IG( 12 , a−1 ),
σ 2 |a ∼ IG( 12 , a−1 ),
a ∼ IG( 21 , A−2 ),
a ∼ IG( 21 , A−2 ),
p(bi ) =
−1/2
π −1 bi
(bi
ind.
+
1)−1 ,
bi > 0.
bi |ci ∼ Gamma( 12 , ci ),
ind.
ci ∼ Gamma( 12 , 1).
113
4.2. HORSESHOE DISTRIBUTION
Each of the three models presented in Table 4.1 and illustrated in Figure 4.2 have both
highlights and drawbacks. For example, Model III is attractive due to the simple form of
the conditional distributions that make up the hierarchy. The advantages and disadvantages of each model will be elucidated in sections immediately following, identifying the
best model under the over-arching MFVB framework.
a
aa
a
aa
σ
σσ
σ
σσ
x
xx
b
x
(a) Model I
bb
a
aa
c
cc
σ
σσ
b
bb
xx
x
(b) Model II
xx
(c) Model III
Figure 4.2: Directed acyclic graphs corresponding to the three models listed in Table 4.1.
4.2.1
Mean field variational Bayes
We impose the following three product restrictions on the joint posterior density function:
p(σ, a|x)
≈ q(σ)q(a)
p(σ, a, b|x)
≈ q(σ)q(a, b)
for Model I,
(4.3)
for Model II,
p(σ, a, b, c|x) ≈ q(σ, c)q(a, b) for Model III.
Derivations of the MFVB algorithms resulting from (4.3) are deferred to Appendix 4.A.
The optimal q ∗ density for σ under Model I has the form:
(
∗
2
2 −(n+3)/3
q (σ ) ∝ (σ )
2
exp µq(1/a) /σ +
n
X
)
log pHS (xi /σ) .
(4.4)
i=1
The evaluation of the normalizing factor and hence moments for q ∗ (σ 2 ) under Model
I require the use of numerical integration and multiple (n) evaluations of the exponential integral function, due to its presence in the standard Horseshoe density. This makes
MFVB for Model I very computationally intensive, which is at odds with the underly11
ing purpose of MFVB as a fast, 1deterministic
alternative to Markov chain Monte Carlo
(MCMC). We therefore shift our focus to Models II and III.
Under the product restrictions specified in (4.3), the optimal MFVB density for σ 2 has
114
4.2. HORSESHOE DISTRIBUTION
an identical closed form for both Models II and III. This is in stark contrast to Model
I, where the form of q ∗ (σ 2 ) was intractable, hence requiring numerical integration. A
closed form optimal density for σ 2 is aligned with the MFVB framework, as it leads to
a relatively simple MFVB algorithm. Hence the appeal of Models II and III lies in the
closed form of the optimal q density for σ 2 , specifically:
q ∗ (σ 2 ) ∼ Inverse-Gamma
1
2 (n
+ 1), µq(1/a) +
1
2
n
X
!
x2i µq(bi )
.
(4.5)
i=1
Algorithm 6 determines the optimal moments of q ∗ (σ 2 ), q ∗ (a), q ∗ (b) and q ∗ (c) under
product restriction (4.3). In other words, the algorithm performs fast, deterministic inference for data following (4.2), under the auxiliary variable representations described
by Models II and III. We defer derivation of Algorithm 6, i.e. derivation of optimal q ∗
densities, parameter updates and lower bound, to Appendix 4.A. We now present the
Initialize: µq(1/σ2 ) > 0.
If Model III, initialize: µq(ci ) > 0, 1 ≤ i ≤ n.
Cycle:
µq(1/a) ← A2 /{A2 µq(1/σ2 ) + 1}.
For i = 1, . . . , n:
Gi ← 12 µq(1/σ2 ) x2i
if Model II: µq(bi ) ← {Gi Q(Gi )}−1 − 1
if Model III: µq(bi ) ← 1/(Gi + µq(ci ) ) ; µq(ci ) ← 1/(µq(bi ) + 1)
P
µq(1/σ2 ) ← (n + 1)/ 2µq(1/a) + ni=1 x2i µq(bi )
until the increase in p(x; q) is negligible.
Algorithm 6: Mean field variational Bayes algorithm for determination of q ∗ (σ 2 ) from
data modelled according to (4.2). The steps differ depending on which auxiliary variable
representation, Model II or Model III (set out in Table 4.1), is used.
lower bound for Algorithm 6. We first note that for all of the sparseness inducing priors
(namely the Horseshoe, NEG and GDP) we have

 log p(x; q, BASE) + log p(x; q, II) for Model II cases
log p(x; q) =
 log p(x; q, BASE) + log p(x; q, III) for Model III cases
(4.6)
where log p(x; q, BASE) is the contribution from the nodes common to both models, namely
x, σ 2 and a; and log p(x; q, II) and log p(x; q, III) are the relative contributions from the
115
4.2. HORSESHOE DISTRIBUTION
nodes specific to Models II and III.
For the Horseshoe prior, the lower bound takes the form of (4.6), with specific form
log p(x; q, BASE) = + log Γ{ 12 (n + 1)} −
n
2
log(2π) − log(π) − log(A)
− log(µq(1/σ2 ) + A−2 ) + µq(1/a) µq(1/σ2 )
P
− 12 (n + 1) log µq(1/a) + 12 ni=1 x2i µq(bi ) ,
P log p(x; q, II) = −n log(π) + ni=1 Gi µq(bi ) + log{Q(Gi )} , and
P log p(x; q, III) = −n log(π) + ni=1 µq(bi ) (Gi + µq(ci ) )
− log(Gi + µq(ci ) ) − log(µq(bi ) + 1) .
(4.7)
Full derivations are presented in Appendix 4.A.
Under Model III, Algorithm 6 describes a set of algebraic operations, performed until
a certain tolerance is reached. This is far simpler than the case for the Model II algorithm,
where the ratio Q(x) = ex E1 (x) must be repeatedly evaluated, involving the evaluation
of the exponential integral function (see Definition 1.5.6). To complicate things further,
the ratio Q(x) provides numerical issues if computed naively. If we firstly express the
problematic ratio as
Q(x) =
E1 (x)
,
e−x
it is easy to observe that as x becomes large, both the numerator and the denominator
rapidly approach zero. A novel way to get around this is to first express Q(x) in contin-
ued fraction form (Cuyt et al., 2008). This facilitates computation via Lentz’s Algorithm
(Lentz, 1976; Press et al., 1992) to a prescribed accuracy. Wand and Ormerod (2012) investigate in detail the approximation of ratios of special functions using the continued
fraction approach. As set out in Result 1.2, Q(x) admits the continued fraction expansion:
1
Q(x) =
.
12
x+1+
22
x+3+
x+5+
32
x + 7 + ···
Algorithm 7 describes the steps required to compute Q(x) for varying arguments. For arguments less than or equal to 1, direct computation is used. For arguments greater than
1, Lentz’s algorithm is called into play.
Figure 4.3 illustrates the number of iterations
required for Lentz’s algorithm to converge when used to approximate the ratio Q(x) to
the accuracy described in Algorithm 7. The larger the value of the argument, the more
116
4.2. HORSESHOE DISTRIBUTION
Inputs (with defaults): x > 0, ε1 (10−30 ), ε2 (10−7 ),
If x > 1 then (use Lentz’s Algorithm)
fprev ← ε1 ; Cprev ← ε2 ; Dprev ← 0; ∆ = 2 + ε2 ; j ← 1
cycle while |∆ − 1| ≥ ε2 :
j ←j+1
Dcurr ← x + 2j − 1 − (j − 1)2 Dprev
Ccurr ← x + 2j − 1 − (j − 1)2 /Cprev
Dcurr ← 1/Dcurr
∆ ← Ccurr Dcurr
fcurr ← fprev ∆
fprev ← fcurr ; Cprev ← Ccurr ; Dprev ← Dcurr
return 1/(x + 1 + fcurr )
Otherwise (use direct computation)
return ex E1 (x).
Algorithm 7: Algorithm for stable and efficient computation of Q(x).
quickly Lentz’s Algorithm converges. The number of iterations required for convergence
increases as the argument approaches zero. This is not an issue due to the structure of
Algorithm 7: Q(x) for small arguments (0 < x ≤ 1) are evaluated using direct computa-
20
15
10
number of iterations
25
tion.
2
4
6
8
10
x
Figure 4.3: The number of iterations required for Lentz’s Algorithm to converge when
used to approximate Q(x). Convergence criteria are specified as default settings in Algorithm 7.
Now we have presented MFVB inference tools for Models II and III in the form of Algorithms 6 and 7, we proceed to compare the the two models in terms of their simplicity,
their performance in a simulation study and their theoretical underpinnings.
117
4.2. HORSESHOE DISTRIBUTION
4.2.2
Simplicity comparison of Models II and III
As mentioned in the previous section, MFVB inference for Model III is far simpler than
that for Model II. This simplicity arises from the elegant distributional forms that result
from including the extra set of auxiliary variables c = (c1 , . . . , cn ) in Model III. In contrast to the simple algebraic steps in Model III’s algorithm, Model II must introduce special functions, continued fractions and Lentz’s Algorithm to carry out MFVB inference.
Model III clearly wins the simplicity round.
It should be noted here that under Model II MFVB, the evaluation of Q(x) is rather
cheap and very stable thanks to (1) the low number of iterations required for computation
via Lentz’s Algorithm for x > 1, and (2) the availability of direct computation via the R
package gsl (Hankin, 2007) for 0 < x ≤ 1. Over the following two sections, the superior
model of the two reveals itself rather convincingly.
4.2.3
Simulation comparison of Models II and III
We carried out a simulation study comparing the quality of MFVB inference resulting
from Models II and III. One thousand data sets were generated according to
xi ∼ Horseshoe(0, 1),
1 ≤ i ≤ n,
with sample sizes of both n = 100 and n = 1000 considered. This corresponds to σ 2
having a true value of 1. We assessed the quality of MFVB inference by comparing the
approximate MFVB optimal density, q ∗ (σ 2 ), with with a highly accurate MCMC-based
posterior approximation denoted by pMCMC (σ 2 ). The accuracy is defined as
accuracy ≡ 1 −
Z
0
∞
q ∗ (σ 2 ) − pMCMC (σ 2 |x) d(σ 2 ).
An accuracy measure of 1 implies perfect alignment between the MFVB and MCMC
approximate posteriors. More detail about calculation of the accuracy measure is explained in Section 1.6. The MCMC posterior approximation, pMCMC (σ 2 ), was obtained
using WinBugs (Lunn, Thomas, Best and Spiegelhalter, 2000) via the BRugs package
(Ligges et al., 2011) in R. MCMC samples of size 10000 were created, with the first 5000
discarded and the remaining sample thinned by a factor of 5.
118
4.2. HORSESHOE DISTRIBUTION
Table 4.2: Average (standard deviation) accuracy based on MFVB for a simulation size of
1000 from (4.2).
n = 100
n = 1000
Model II
54.3(1.4)
56.8(0.9)
Model III
6.3(0.9)
0.0(0.0)
A summary of the accuracy of MFVB inference for Models II and III is presented in
Table 4.2. It is evident that the average accuracy of MFVB inference resulting from Model
II is much higher than that of Model III, over both sample sizes. In particular, the MFVB
approximate posteriors under Model 3 had 0% accuracy for samples of size n = 1000.
20
15
10
5
approx. posterior density
0
5
10
15
20
MCMC
MFVB Model II
MFVB Model III
0
approx. posterior density
The poor performance of Model III is further examined in the following sections.
0.4
0.6
0.8
1.0
1.2
1.4
0.4
0.6
0.8
1.0
1.2
1.4
1.0
1.2
1.4
20
15
10
5
0
5
10
15
20
approx. posterior density
σ2
0
approx. posterior density
σ2
0.4
0.6
0.8
1.0
1.2
1.4
σ2
0.4
0.6
0.8
σ2
Figure 4.4: Comparison of pMCMC (σ 2 |x) and two q ∗ (σ 2 ) densities based on Model II and
Model III MFVB for four replications from the simulation study corresponding to Table
4.2 with n = 1000.
Figure 4.4 illustrates both MFVB approximate posteriors for σ 2 versus the accurate
MCMC posterior for four replications within the simulation study. The purple Model
II MFVB densities have similar centres to the orange MCMC densities, although MFVB
results in a lower spread. The Model II densities also cover the true parameter vale of one
with reasonable mass. In contrast, the blue Model III density is shifted to the left, centred
around 0.4, and lies nowhere near the accurate MCMC density nor the true parameter
value of 1. This supports the poor performance of MFVB inference resulting from Model
III and identifies Model II as the superior choice.
119
4.2. HORSESHOE DISTRIBUTION
4.2.4
Theoretical comparison of Models II and III
In this section we endeavour to explain why Model III, with its elegant q ∗ densities and
simple MFVB Algorithm, performs so poorly in practice. We begin by examining the
core differences between Models II and III. We then examine the simulated data and
identify the underlying reason for Model III’s poor performance. Finally, we present a
new theorem that explains the inapropriateness of MFVB for Model III.
To allow direct comparison between Models II and III, we now find an alternative
expression for µq(bi ) under Model III by eliminating terms involving ci . We denote the
form of the expression for µq(bi ) as g III (x) under Model III, and g II (x) under Model II.
Under Model III:
1
Gi + µq(ci )
1
Gi + µ 1 +1
µq(bi ) =
=
q(bi )
µq(bi ) + 1
.
(µq(bi ) + 1)Gi + 1
=
It follows that
Gi µ2q(bi ) + Gi µq(bi ) − 1 = 0,
and using the quadratic formula gives
µq(bi ) =
−Gi ±
q
G2i + 4Gi
2Gi
Hence
III
g (x) =
−x ±
.
√
x2 + 4x
2x
(4.8)
So, Model III uses the above g III (x) as an approximation to
g II (x) =
1
− 1.
ex E1 (x)
(4.9)
Figure 4.5 illustrates the adequacy of g III (x) as an approximation of g II (x) for varying
values of the argument x. We can see that as the value of x increases, the ratio g III (x)/g II (x)
approaches 1. Smaller arguments present the biggest discrepancy between g II (x) and its
approximation under Model III. As x approaches zero, the difference between the two
functions is marked. Hence g III (x) provides poor approximation of g II (x) for low positive
arguments.
120
0.8
0.7
0.6
0.4
0.5
gIII(x)/gII(x)
0.9
1.0
4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION
0.0
0.5
1.0
1.5
x
Figure 4.5: Plot of g III (x)/g II (x) for the functions g III and g II defined by (4.8) and (4.9)
respectively.
We now examine the behaviour of random variables x, b and c created under the
simplified model:
x|b ∼ N (0, 1/b),
b|c ∼ Gamma( 21 , c),
c ∼ Gamma( 12 , 1).
(4.10)
Figure 4.6 shows MCMC samples of {log(1/b), log(c)|x = x0 } for x0 = (1, 0.1, 0.01, 0.001).
The sample correlations are also shown. It can be seen that as x0 approaches zero, the
correlation between log(1/b) and log(c) approaches 1. This is directly at odds with the
MFVB assumption of posterior independence between b and c. This is the underlying
reason why MFVB inference for Model III is so poor. Model II, with only b, does not encounter this problem, and hence MFVB inference is of higher quality. The reason behind
the behaviour evident in Figure 4.6 is explained by Theorem 4.2.1.
Theorem 4.2.1 Consider random variables x, b and c such that
x|b ∼ N (0, 1/b),
b|c ∼ Gamma( 12 , c),
c ∼ Gamma( 12 , 1).
Then
lim [Corr{log(1/b), log(c)|x = x0 }] = 1.
x0 →0
Full details of the proof of Theroem 4.2.1 are given in Appendix 4.B.
4.3
Normal-Exponential-Gamma distribution
The second case we consider is a univariate random sample drawn from the NEG distribution, i.e.
ind.
xi |σ ∼ NEG(0, σ, λ),
σ ∼ Half-Cauchy(A).
(4.11)
121
4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION
x0=0.1
corr.= 0.718
●
●
●
●
●
●
●
●
●
●
●
●
●
10
6
8
x0=1
corr.= 0.29
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ● ●●
●
●
●
●●●
●
●
●
●
●
●
●
●
● ●
●
●
●● ● ●
●
●● ●● ●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●● ● ● ●
●● ●
● ●●
● ● ●● ●
●
●
●
●●
●●
●● ●
● ●
●
●● ●● ●
●
●
●
● ●●
●
● ●
● ●
●
●
●
●
●
●
●●●
● ●
●●
●
●
●
●
● ● ●●
●●
● ● ●● ●
●
●
●
●
●
● ●
●
●
●
●● ●
● ● ● ●
●
●
● ●
●
● ●
●● ● ●
●●
●
●
●
● ●
●●
●
●● ●
●
● ●● ●
● ● ● ● ●
●●● ●
●
●
●●
●●
●
●●
● ● ●● ●● ●
●
●
●
●
●
●
●●
●
●●●
● ● ●●
●
●
● ●
●
●● ● ●
●
● ●
●
●
●●
●
● ●● ● ●● ●
●
●●
●● ●
●
● ● ●●
●
●
●
●● ●
● ● ●
● ●
●
●
●
●
●
●
●
●
● ●● ●
● ●●
●● ● ●
● ●
●
●
●
●
●
●
●
● ●●● ● ●
●
●● ●
●●
●●
●
●
●
●
●●
● ●●
● ●●● ●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●●
●●
●●
●●
●
● ● ● ● ●●● ● ●
● ●
● ● ●
●
● ●
● ●
● ● ●●
●● ● ●
●
●
●
●
● ●●●● ●●
●
●
●
●
●
●
● ●
●
●
● ●
●
● ●● ●● ● ● ● ●
●
● ●●
● ● ● ●●●
● ●
●
●
● ● ●● ●●●
●
●
●
●
● ●● ● ● ● ●
●●
●
●● ● ●
●
●●
●●
●
●
●● ● ● ●
●
●
●
●● ● ● ●●●
●● ●●
●
●
●● ●●
●
●
● ●
●● ●
●●●
●●● ●●● ●●● ●
●
●● ● ●● ●●
● ●●
● ●
●
●
●
●
●
●
● ●● ●
● ●
● ● ●●
●●● ●●●● ●
●●
●
●
●
● ●● ● ● ●
●● ● ● ●●● ● ●
●
●
●
●
● ●●
●
●
● ●●
● ●● ●
● ●
● ● ●
●●● ● ●
●
● ● ●
●●
●
●●
● ● ●●
●
● ●
● ●
●
● ●
●
●●●
● ●● ● ●
●
●
●
● ●
●
● ●●●● ●●●●●
● ●
● ●
●● ●
●
●●
● ●● ●
●●
●●
●
●●
●●
●●
●
●
●
● ●● ● ● ● ●
●
●●● ●
●
● ● ●●●
●● ● ● ●● ●
●● ● ● ●● ●
● ●
●●●
●
●
●●
●●
●●
●
●
● ●
● ●
●
●
●
● ● ● ●● ●
● ●●●
● ●●
●
●
●
● ●
●
●
●
●
●
●
●
●● ●●
●●
● ●
●
●●
● ●● ●●●
●
●●
●
●● ●
●
●
●●
●
●
●
● ●● ● ●●
●
●
● ●
●
●
● ●●
●● ● ● ●
●●
●
●
●
● ●● ● ●
●
●●
●
●● ●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
● ● ●●
●
●● ●
●●
●●
●
● ● ●
●
●
●
●
●
●●
●
●
●● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●●● ●
●
●
●
●
●
●
●
●
4
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
−2
−6
−4
0
log(1 b)
15
x0=0.01
corr.= 0.895
●
●
●
●
●
●
●
●
−6
●
●
●
●
15
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
● ●●
●
●
●
●
●
● ●
●
●
●
●● ●
●
●
●
●
●
●
●●● ●
●
●
●● ●
●
●
●
●
●●
●● ●
●
●
●●
●
●●
●
● ●
●
●
●●
●
●●
●
●
●● ●
●
●●
●
●
●
● ●
●
●●
●
●
●● ● ●●
●
●
●●
● ●●
●
●
● ●
●
●●● ● ●● ●
●
●
● ●●
●
●
●●
●
●
●●● ●
● ●●
● ●●
●● ●
● ●●
●
●
●
●
●
●● ●
●
●● ● ●
●●● ●
●● ●
● ● ●●●●●
●
●
●
●● ●
●
●
●● ● ●●● ● ●
● ● ●●
●●●
●
●● ● ● ● ●
● ●● ● ● ● ● ● ● ●
●
●
●● ● ●
●
● ●● ●
●●●
● ●●●● ●
●●●●
●
● ● ●●
● ●
●
●
● ● ●●●
● ● ●● ●
● ● ●● ●●●
● ●
●
●● ●●● ● ●
●
●● ● ● ●
●● ●
●
●
●
●●
● ●● ● ● ●
●● ● ●●
●●
● ●●
●
● ●
●
●
●
● ●●● ● ●
●
●
●● ●
●●
●
● ● ●●
● ● ●
●
●
●
●
● ●
●
●●
●
●
● ●
●
●
●
●
●● ●
●
●
●
●
●
● ● ● ● ● ●
●●● ● ●●
●
●
● ●●●
● ● ●
●●
● ●
● ●
●
●●
●
●
●
● ●●●
●
●●
●
●●
●
● ●●
●
●
● ●
● ●
●● ● ● ●
●●
●
● ●
●
● ● ●● ●
●●
●
●
●
●●
●
●
●●
●
●●
● ●
● ●●
●●
●
● ● ●●
●
●
●
●
●● ●●●
●
●●●
● ●
● ●
● ● ●● ● ●
●
●
●
● ●● ● ●●●● ●
●
● ●
●
● ● ●
●
●
●●
● ●● ●●
●
● ●
●
●
●
●
●
●● ● ●
●●
●●● ●
●
●● ● ●●
●
●
●
●●
●●● ●
●
●
●
● ● ●
●
●●
●
●
●● ●
●
●
●● ● ●● ●●● ●
● ●
●
●●●
● ●
● ●
●
●
●
● ● ● ●●
●
●● ●●
● ●
●● ●●
●
● ●● ●
●
●
●● ●
●
● ●●
●
●●
●
● ●● ●●● ●
●
● ●
●
●●
● ●● ● ● ● ●
●
● ●● ●
●
●
●●
● ● ●●●
● ●
●
●
●
●
●
●
●●
● ●
●
● ●● ●
● ●●
●●
●
●●
●●
●●
●
● ●● ● ● ●●●●
●
● ●●
●
●
● ● ● ●●●
●
●
●●
●● ●
●
●
●
●
●
●●● ●
● ● ●● ● ● ●●● ●
● ●
●
● ●●
●● ●
●● ●
● ●
● ●●
●●●
●● ●● ●
● ●
●● ● ●●
● ●● ● ●● ●
●● ●●
●●● ●●
●
●● ● ●●
●
●●
●
●
●
●
●
●
●●
●
● ●
●●
●
●
●●
●● ●● ●
●
● ● ● ●●●
●
●
●
● ●
● ● ● ● ● ●
● ●
●
●
●
●
●
● ● ●● ●● ●
●
●
● ●●
●
●
● ● ●
● ●
●
●
●
● ●
●
● ●
● ●
●
●
● ●
● ●
●●
● ●
●
● ●●● ● ● ● ●
●●
●
● ●
●
●
●
●●
●● ●
● ●
●
● ●● ● ●●
●● ●
● ●
●● ● ●
●
●
●
●
● ●●
●
●
●
●
●
● ●
●● ●
●
●
● ●
●
●
●
● ● ●
● ●
●
●
●
●
●●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
●
−5
●
0
5
●
10
log(1 b)
●
●
−2
0
2
4
log(1 b)
6
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●●
● ●
●
●
●
● ●
●
●●
●
●
●●
● ●
●●●
●●
●
●●
● ●
●
●
● ●
● ●
●●
●
●
● ●
●
●●
●
●
●
●●
●
●
●
● ●
● ● ●●
● ●
●
●
●●●●● ●
●
●
● ●
●
●
●
●●●●
●●● ● ●
●
●
● ● ● ●
●
●● ●
●●●
●●
● ● ●●
● ●●● ●●
●
●
● ●●
●
●
●
●
●
● ● ●
●
●
●● ●
●● ●
●●●
●
●
●●
●
●
●●
●
● ● ● ● ●
●
● ●● ● ●
●
●
● ●●
● ● ●● ●● ●
●
●
●
● ● ●● ●● ●● ●
●●
●
● ● ●
●
●
● ●
●● ● ● ● ● ●
●
●
● ●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
● ●●
●
● ●● ●
●● ● ●
●
●
●●●●
●●
●●
●
●
● ●
● ●
● ●●●
●
● ●●● ●
●
●
● ●●
●
● ● ● ●
● ●
● ●
● ●
●
●
●
●
●
● ● ●
● ●
●
●
●● ●
●
●
●
●●●
●
●
●
● ●●
●
●● ●
●● ●
●
● ● ●●● ●
●●
●
●
● ● ●
●●
●●
●●● ●
●●
●
●
●
● ●● ●
●●
●
●
●
●
● ●● ●
●
●●
●
● ●●● ●●
●
● ●●
● ●●
●
●
●
● ● ●
●
● ●
●
●●
●● ●●
●
●●
●●
● ●●●
● ●
●
●
● ●
● ●●
● ● ●
●
●
●
●
●
●
●●● ●
●
●●●
●
●
●
●●●● ●
●
●● ●
●
●
● ●●
●●
●●
●
● ●●
●
●
●
●●● ●●
●
● ●
●●
●
●● ●
● ● ●●
●
●
●
●
● ●●
● ●
●●
●●
●
● ● ●● ●
●
●
●
●
● ● ● ● ● ● ● ● ●● ●
●
●
●
● ●
●●
●
● ●
● ● ●●●● ● ●● ●●●●●● ●
●
●
●
● ● ●● ●●
● ●●
● ●●
●
●●
●
●
●
●●
●● ●● ●
● ●●
●● ●
●
●●
●
●
●
●
●
● ●
●
● ● ●
●
●● ●
●●●●
●
● ●● ● ● ●
●● ● ● ● ●
●
● ● ●
●●
●
●
● ● ●
●●
● ●● ● ●●
●
●● ●●●● ●
●●
●
●
●
● ●● ●
●● ●
●● ● ● ● ● ●
●●
●
●
●●●
●
● ● ●
●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
● ● ●●● ● ● ●● ●
●
●●
●
●
●
●
● ● ●●
●
● ●
●●
●●
●
● ● ● ●● ●●
● ● ●●
●
●
●
● ●● ● ●
● ● ●
● ●
●
●●
●
●
●●●●●
●●●
●
● ●
●
● ● ●
●● ● ● ● ●
● ● ●●
●●●
●● ● ●●●●
●
●
● ●
●
●
●
●
●
● ● ●
●
● ●●●●
●
● ●●●●●
●
● ●●
●
●
●
●
● ●●● ●●● ●
●
● ● ●● ● ●
●●
●
●● ●
●
●
●
●●
● ●
● ●● ●
●●●
●
●●
● ●
●●●
●
●●
●
● ● ● ● ● ● ●●
●● ●
●
●●
●
●● ●
●
●
●
●
●
● ●● ●
● ●●
● ●
●
●●
●● ●
●
●
●
●
●
● ●
●
●●
●●
● ●●
●
● ●
●● ●
●
● ●
●
●
●
●
●
● ● ●
●
●
●
● ● ●
●
●
● ● ● ● ●● ●
●
●
●
●
● ●●●
●
●● ● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
5
●
0
10
●
●
5
log(c)
●
●
log(c)
●
●
●
●
●●
●
●
●
●
●
●
●
−4
10
●
● ●
●
●
x0=0.001
corr.= 0.951
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
2
●
●
●
●
●
●
●
●
●
−2
●
●
●
●
●
−8
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●●
●●
●●
●
●
●●
●●
● ● ●
●
●
●●
●
●● ●● ● ●●●
● ●
●●●●
●
●
● ●
●
●
●● ●
● ●●
● ●
●
● ●
●
●
●
●● ●
● ● ●●●●
●
● ●
●
● ● ●
● ●●
●
●
●
●
●
●
●●
●
●
● ●
●
●●
● ● ●● ●● ●● ● ●●
●●
●
●
● ●
●
●
●
● ●
● ●● ●
●● ● ● ●●
●
●
● ●●● ●
●●● ●●● ● ●
●
●
● ●
●
●
● ●● ●●
●
● ●
●
●
●
●
●
●
●
●●
●● ●
●
●
●●
●
●
● ●
●
●
●
●● ●●
● ●● ● ●
●
●
●
●●
● ●
● ●●●
●●●● ● ● ●●
●
●
● ●●
● ●●
●
●
●● ●
●
●
● ●
●
●
●
●
●
● ●●
●● ●
●
● ●
●
●
●●
●
●● ● ● ●● ●● ●●●
●
●
●●● ●
●
●●
● ●
●●
● ●
●
●● ●
●
●
●●●●● ●
● ●●
●
● ●
●
● ●
●●
● ●
● ● ●●●
●
●
●
● ●
●
●
●●
●
●● ●● ● ●● ●
● ● ● ●●●
●●● ●
●
●●
●
●●
●
● ●
●
●
●● ●
●●
● ●
● ● ●● ●●● ● ●
●
●●
●● ● ● ●
●
●● ●
●●
●● ● ● ●● ●
●
● ●
●
●
● ● ●●●●● ●● ●●●
●
●
● ●
● ●●
●
● ● ●● ●
●●● ● ●
●●
●
●
● ● ● ●●
●●●● ●
●
● ●●
●
●●
●
● ●● ●●
●
● ●
● ●● ● ●● ●●●
●● ●
●●
●
●
●
●●
● ●
●
●●● ● ●●
●●●
●●●
● ●●
●● ●●
●● ●
●
●
●
● ●●
●●
●
●
●●
●●
●● ●
●●
●●
● ●
●
● ●●●●● ●● ●
●●
●
●
●
●
● ● ●●
●●
● ●
●
● ●
●
●
●
● ●
●●
●●
● ●●
●
●
●
●
●●
● ●
●●●
●●
●● ●● ● ● ●
●
●
●
●
●
●● ●
● ● ●●
●
● ●●●
●
●
● ●● ● ●● ● ●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●●●
●●
● ● ●
● ●● ● ●
●● ●
●
●
●●
●● ●●
● ●● ●
● ●
● ● ●
●
●
●
●● ●
●
●●● ● ● ●
●
● ●
● ● ●● ●● ● ●●
●
● ●
●
●
●●
●
● ●
● ●● ● ●
●●
● ●
●●
●
● ●
●●●
●
●
●● ● ● ●●●● ●
● ●●
●●
●
●● ●
●
●● ●
●
● ● ●●
●●
●
● ● ●● ●
●● ● ● ●● ●● ● ● ● ●● ●
● ●
●
●
●●
●
● ●●
●
●
● ●
● ●●●●
●
●
●
●●
● ●●
●● ●
●
●●
● ●
●
●
●
●
●
●
●
●●●
●
●
●
● ●●
●
●● ●
●
●
●
●● ●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●● ●
●
●●●●● ●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
● ● ●
● ● ● ●●● ●
●
● ●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
20
0
●
●
●
●
●●
●
●●
0
●
●
●
●
●
5
●
●
●
log(c)
●
●
2
log(c)
●
●
●
●
●
●
●
●
●
●
−5
0
5
10
log(1 b)
15
Figure 4.6: MCMC samples (n = 1000) from the distribution {log(1/b), log(c)|x = x0 } for
x0 = (1, 0.1, 0.01, 0.001) where the data is generated according to (4.10). Sample correlations are also shown.
Again, through introduction of the auxiliary variables a, b = (b1 , . . . , bn ) and c = (c1 , . . . , cn ),
we are able to represent (4.11) as the three equivalent heirarchical models presented in
Table 4.3. Similarly to the Horseshoe case, each of the three models presented in Table 4.3
are illustrated in Figure 4.2.
Table 4.3: Three auxiliary variable models that are each equivalent to NEG Model (4.11).
The abbreviation IG represents the Inverse-Gamma distribution.
Model I
Model II
Model III
ind.
ind.
ind.
xi |σ ∼ NEG(0, σ, λ),
σ 2 |a ∼ IG( 12 , a−1 ),
a ∼ IG( 21 , A−2 ).
xi |σ, bi ∼
σ 2 |a
a∼
∼
xi |σ, bi ∼ N (0, σ 2 /bi ),
N (0, σ 2 /bi ),
σ 2 |a ∼ IG( 12 , a−1 ),
IG( 12 , a−1 ),
a ∼ IG( 21 , A−2 ),
IG( 21 , A−2 ),
p(bi ) =
λbλ−1
(1
i
ind.
+ bi
)−λ−1 ,
bi > 0.
bi |ci ∼ IG(1, ci ),
ind.
ci ∼ Gamma(λ, 1).
4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION
4.3.1
122
Mean field variational Bayes
We impose the same set of product restrictions (4.3) on the joint posterior density function
for Models I, II and III. The same pitfall causes computational problems for Model I as it
did in the Horseshoe case. That is, the form of q ∗ (σ 2 ) requires both numerical integration
for evaluation of the normalising constant; and repeated evaluations of special functions
(in this case, the parabolic cylinder function). Hence we limit our investigation to Models
II and III. Again, derivations of the MFVB algorithms for Models II and III are deferred
to Appendix 4.C.
Initialize: µq(1/σ2 ) > 0.
If Model III, initialize: µq(ci ) > 0, 1 ≤ i ≤ n.
Cycle:
µq(1/a) ← A2 /{A2 µq(1/σ2 ) + 1}.
For i = 1, . . . , n:
Gi ← 12 µq(1/σ2 ) x2i
√
√
if Model II: µq(bi ) ← (2λ
+
1)R
(
2G
)
2Gi
i
2λ
q
if Model III: µq(bi ) ← µq(ci ) /Gi ; µq(1/bi ) ← 1/µq(bi ) + 1/{2µq(ci ) }
µq(1/σ2 )
µq(ci ) ← (λ + 1)/(µq(1/bi ) + 1)
P
← (n + 1)/ 2µq(1/a) + ni=1 x2i µq(bi )
until the increase in p(x; q) is negligible.
Algorithm 8: Mean field variational Bayes algorithm for determination of q ∗ (σ 2 ) from
data modelled according to (4.11). The steps differ depending on which auxiliary variable
representation, Model II or Model III (set out in Table 4.3), is used.
Algorithm 8 determines the optimal moments of q ∗ (σ 2 ), q ∗ (a), q ∗ (b) and q ∗ (c) under
Models II and III set out in Table 4.3 and product restriction (4.3). The lower bound for
Algorithm 8 is given by (4.12).


log p(x; q, BASE) + n log(λ)






+n(λ + 21 ) log(2) + n log{Γ(λ + 21 )}
for Model II




√
P

n
1
 +
i=1 [Gi (µq(bi ) + 2 ) + log{D−2λ−1 ( 2Gi )}].
log p(x; q) =
 log p(x; q, BASE) + n log(π)


2



 +n log(λ) − Pn { 1 log(µ

for Model III

q(ci ) )
i=1 2




+(λ + 1) log(µq(1/bi ) + 1)}.
(4.12)
where log p(x; q, BASE) is identical to that specified by (4.7) in the previous Horseshoe
4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION
123
section. It should be stated that the final term in the lower bound (4.12) for Model II in
the NEG case is numerically unstable. This instability was not an issue for the Horseshoe
prior as we were able to incorporate the stable quantity Q(x) into the expression for the
lower bound (see (4.7)). We are unable to rearrange the lower bound to achieve the same
stability for the NEG model. In practice, we use a high fixed number of iterations to
ensure convergence. As in the Horseshoe case, MFVB inference is far simpler for Model
III as Algorithm 8 requires only algebraic steps for each iteration. For the NEG algorithm,
Model II requires repeated evaluation of the ratio
Rν (x) =
D−ν−2 (x)
,
D−ν−1 (x)
ν > 0, x > 0.
This ratio must be computed with care as again underflow problems persist for large
arguments. We first express Rν (x) in continued fraction form, using Cuyt et al. (2008).
This allows use of Lentz’s Algorithm as was the case for Q(x) in Horseshoe Model II. As
set out in Result 1.3, Rν (x) can be written as:
1
Rν (x) =
.
ν+2
x+
x+
ν+3
x+
ν+4
x + ...
The procedure for stable computation of Rν (x) is presented in Algorithm 9. Specifically,
Algorithm 9 describes the steps required to compute Rν (x) for varying arguments. For
x ≤ 0.2 and λ ≤ 40, direct computation is carried out using R code explained in Definition
1.5.7. Otherwise, Lentz’s algorithm is used.
Now we have presented the relevant algorithms to carry out MFVB inference for the
NEG prior, we proceed to firstly compare Models II and III via a simulation study, and
secondly look into the theory behind the relationship between the two models.
4.3.2
Simulation comparison of Models II and III
The NEG simulation study to compare Models II and III was set up similarly to the
Horseshoe study set out in Section 4.2.3. We generated 500 data sets of size n = 100
and n = 1000 for
xi ∼ NEG(0, 1, λ),
1 ≤ i ≤ n.
4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION
124
Inputs (with defaults): x ≥ 0, λ > 0, ε1 (10−30 ), ε2 (10−7 ),
If (ν > 20) or (x > 0.2) then (use Lentz’s Algorithm)
fprev ← ε1 ; Cprev ← ε2 ; Dprev ← 0; ∆ = 2 + ε2 ; j ← 1
cycle while |∆ − 1| ≥ ε2 :
j ←j+1
Dcurr ← x + (ν + j)Dprev
Ccurr ← x + (ν + j)/Cprev
Dcurr ← 1/Dcurr
∆ ← Ccurr Dcurr
fcurr ← fprev ∆
fprev ← fcurr ; Cprev ← Ccurr ; Dprev ← Dcurr
return 1/(x + fcurr )
Otherwise (use direct computation)
return D−ν−2 (x)/D−ν−1 (x).
Algorithm 9: Algorithm for stable and efficient computation of Rν (x).
An extra level of complexity is present in the NEG case: the existence of the shape parameter λ. The simulation study was carried out for the values
λ ∈ {0.1, 0.2, 0.4, 0.8, 1.6}.
A summary of the accuracy of MFVB inference for Models II and III is presented in Figure
4.7. Figure 4.7 illustrates that the accuracy of MFVB inference resulting from Model II is
much higher than that of Model III for both n = 100 and n = 1000.
Figure 4.8 shows MFVB approximate posteriors for σ 2 versus the accurate MCMC
posterior for four replications within the simulation study. As was the case for the
Horseshoe, the purple Model II approximate posteriors have similar centres to the orange MCMC ones. Again MFVB for Model II results in a lower spread than the MCMC
analogue. The Model II densities also cover the true parameter vale of one with reasonable mass. In contrast, the blue Model III densities are shifted far to the right. This again
supports the poor performance of MFVB inference resulting from Model III and identifies
Model II as the superior choice in both the Horseshoe and NEG cases.
125
4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION
n=100
Model III
n=100
Model II
80
●
●
●
●
●
●
60
●
●
●
●
●
●
●
●
accuracy
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
20
●
●
●
●
●
●
●
●
●
●
●
●
40
●
●
●
●
●
●
n=1000
Model III
n=1000
Model II
80
●
●
●
●
●
60
●
●
●
●
40
●
●
●
●
●
20
●
●
●
●
●
●
●
●
●
●
●
●
●
0.1
0.2
0.4
0.8
1.6
0.1
0.2
0.4
0.8
1.6
λ
8
6
4
2
0
approx. posterior density
8
2
4
6
MCMC
MFVB Model II
MFVB Model III
0
approx. posterior density
Figure 4.7: Side-by-side boxplots of accuracy values for the NEG simulation study described in Section 4.3.2.
1
2
3
4
5
6
7
1
2
3
4
5
6
7
4
5
6
7
8
6
4
2
0
2
4
6
8
approx. posterior density
σ2
0
approx. posterior density
σ2
1
2
3
4
σ2
5
6
7
1
2
3
σ2
Figure 4.8: Comparison of pMCMC (σ 2 |x) and two q ∗ (σ 2 ) densities based on Model II and
Model III MFVB for four replications from the NEG simulation study with n = 1000.
126
1.25
4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION
1.15
1.10
1.00
1.05
II
gIII
λ (x)/gλ(x)
1.20
λ=0.1
λ=0.2
λ=0.4
λ=0.8
λ=1.6
0.0
0.2
0.4
0.6
0.8
1.0
x
Figure 4.9: Plot of g III (x)/g II (x) for the functions g III and g II defined by (4.13).
4.3.3
Theoretical comparison of Models II and III
The major difference between NEG Models II and III is the form of µq(bi ) . As in the
Horseshoe case, to allow comparison between Models II and III, we find an alternative
expression for µq(bi ) under Model III.
µq(bi )
where

 g II (Gi ) for Model II
=
 g III (G ) for Model III
i
√
(2λ + 1)R2λ ( 2x)
√
g (x) =
2x
II
r
and g (x) =
III
2λ + 1 1 1
+ − .
2x
4 2
(4.13)
We obtained the expression for g III (x) via reduction of the updates for µq(ci ) , µq(bi ) and
µq(1/bi ) under Models II and III in Algorithm 8. The idea is that Model III uses the above
g III (x) as an approximation to
√
(2λ + 1)R2λ ( 2x)
√
.
g (x) =
2x
II
Figure 4.9 illustrates the adequacy of g III (x) as an approximation of g II (x) for varying
values of the argument x and the shape parameter λ. As the value of x increases, the
ratio g III (x)/g II (x) gets closer to 1 for all values of λ in the grid. Small values of x present
the most marked difference between g II (x) and g III (x). This translates to poor performance
of Model III in practice.
127
4.3. NORMAL-EXPONENTIAL-GAMMA DISTRIBUTION
Next, we address the underlying reason behind the poor performance of Model III for
the NEG case. We again examine the behaviour of random variables x, b and c created
under a simplified NEG model:
b|c ∼ Inverse-Gamma(1, c),
x|b ∼ N (0, 1/b),
c ∼ Gamma(λ, 1).
(4.14)
The correlation Corr{log(b), log(c)|x = x0 } is more complex for the NEG case compared
with the Horseshoe due to the added shape parameter λ. Figure 4.10 shows MCMC
λ=0.4, x0=1
corr.= 0.699
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
●
●
−6
●
●
●
●
●
●
−4
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
−8
●
●
●
●
●
●
●
●
●
●
−8
●
●
●
●
●
●
−4
−2
0
log(b)
2
λ=0.1, x0=3
corr.= 0.86
●
−8
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●●
●
● ●●
●
●
●
●●● ● ●
●
● ●
●●
●
●● ● ● ● ●
●●●●● ● ● ●
●
●●●
●
●
●●● ● ●
● ● ●● ● ●●
● ●
● ●●
●
●●
● ● ● ●
●
●
●●
●●
● ●●●● ●
● ●●
●
● ●● ● ●● ●
● ●
● ●●
●● ●● ●
●
●
● ● ●
●
●●
●
●
●
●
● ●●
● ●●
●
●●●●●
●
● ● ●●
●
●●● ● ●
●●
●●
●
● ●●
● ●●
●● ●● ●
● ●●●●
● ●●
● ●●● ● ●
●
● ●
● ● ●● ●●
●
● ● ●●
●●●● ●● ●● ● ● ●● ●
● ●●● ●
●
●
● ●● ● ● ●●● ●
●
●●
●●● ●
●●●
●● ● ●
●● ●
● ●
●
●● ●● ●●
●
●
●
●
● ● ●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●●
● ● ●●● ●●
● ●
●●
●
●
● ●●●
●
●● ●
●
●●● ●●
●
●●
●●
●● ● ●● ●
●
● ● ●●●●
●●●
●
● ●● ● ● ●
● ●● ●●● ●
●
● ●
●
● ●
● ●
●● ● ●
●●
●
●
●●
●
● ●●
●
●● ●
●
●
● ●●●
● ●● ●
●
●●
● ●● ●
●
● ●
●●●●
●●● ● ●
●
●
●●
●
●
● ● ●● ●
● ●● ●● ●
●●● ● ●
●
● ●
●
●●●
● ●
●●●
●
●
●
●
●●●
●●●● ●
●
●
●● ●
●● ● ●
●
● ●
●
● ● ● ●●
●
●
●
● ● ● ●● ●
●● ●
●●● ● ●●●●
●●●●● ●
●● ●●●
●
●●●
●●●●●●
●●
●●
●
●●● ●●
● ●●
●
● ● ●●●
●
● ●● ●
●
●●
●
●
●● ● ●
●● ● ●
●
●
●
●
●
●
●●
●
● ● ● ● ● ●●● ●
● ●● ●
●
●● ● ● ●
●
●
●
● ●
●● ●
●
●
● ●●
● ●● ●
●●● ●
●
● ●
●
●
● ●●
● ●● ●
●
●●●●
●
●● ●
●●
●
●
●
●● ● ●
●●●
●●●
● ● ●
●●
● ●●
●
●
●
●
●
●
●●
●
●●
●
● ●
● ●
●●
● ● ● ●
●● ●
● ● ●●●
●
●
● ● ● ●● ●
●● ●
●
●●
●●●
●●
●
● ●
●
●●
●
●
●
●
●
●
●
●● ●● ● ● ● ● ●
●
● ●
●●
●
● ●
● ●
●
●● ● ●
●
●●
● ● ● ●●
● ●
●
●
● ●
● ●●
●
● ●●●● ● ● ●●● ● ●
●
● ● ●
●
● ●
●
●
●●
● ●●
●
●
●
●
● ●
● ●●●
●
●●
●
●● ●
●
● ●●● ●●
● ●
●
● ●●
●●
●● ● ● ●●
● ●●
●●
● ●●
●
●●●
● ●●
● ● ●●
●
● ●●
●
●
●
●
●
● ●●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
−10
●
●
●
● ●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ● ●
●●
●
●
●
●
●
● ● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2
−5
●
●
●●
●
●
●
●
●
0
●
● ●
●
●
●●
●●
●
●
●●●
●
● ●
●● ●● ●●●
●
●
● ● ●
●
●
●
●
●
●
●
●●●
●●
●
●● ●
●●
●
●
●
●
● ●●●●
●●●
●●
●●● ●● ●
●●
●● ● ●
●● ●● ●● ●
●● ●●● ●
●●●
●●
●●●
●
● ●● ●● ●
●● ● ● ●
●
● ●●●● ●
●●
●
●
●●● ●
●●●● ●
●●
●
●●
●
● ●●
●● ●
● ●●●●●● ●
●●●
●●
●●●●
●●
● ●●
●
●
●●
● ● ● ● ●●●●
●●●
● ●●
● ●● ● ●●
●
●
●●●
●●
●●
●
●● ●●●
● ●●● ●
●
●
●
●
●
●
●●● ● ●● ●●
● ●
●
●●
●
●●
●
● ●●●
●
●●
● ● ●●
●
●
● ● ●●
●●●●
●● ●●
●
●●●
● ● ●●
●●
●●
●
●
●●
●
●●● ●●
●●
●
●●
●●
●
●
●●
●●●
●●
●●●
●●
● ● ●●
●
●
●
● ● ●
●●
●
●●
● ●● ● ●●
●
●
●●
●●
●●
●●●● ●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●● ●●●
●●
●
●
●● ● ●
●
●
● ●
●●
● ●● ●●
●
● ●●
● ●●
●● ●
●●
●
●●
●
●●
●●
●●● ● ● ●
● ●●●
● ● ● ●
● ●● ● ●
●
●● ●
●
●
●● ●
● ●●●●
● ●●
●
●
●●
● ●●● ● ●
●● ●
●
●
● ●
●●
●● ●●
● ●
●
●
● ● ●
●
●
●
●● ●
● ●●
●
●
●● ●● ●●
●● ●●
●●
●
●
●● ●● ●● ● ●●●●●●
●
●
● ●●●
● ●
●
●
●
●
●
● ●
● ● ●
● ●
● ●
● ● ●● ●
●●●
● ●● ● ●●● ● ●
●●●●
●
● ●●● ●
●
●
● ●●●●●
●
●
●●
●
●
●● ●
● ●
●
●
●
● ●
●
●●
● ●●●
●●
●●
●
●
●●
●
●
●
●
●●●●●●●●
●
●●●●●●
●●● ●
● ●
● ● ●● ●● ●● ●●● ●
●●● ●
●●●
●● ●●
●●●
●●
●
●●●
●
● ●● ●●
● ●
●●
●
●
●● ●
●
● ●●●
● ●● ●
●●●●● ●
●
●
● ●
●● ●● ●
●●●
●●
● ● ● ●● ●
●●●●● ●●
●
● ● ●●
●
●
● ● ●
● ●
●
●● ● ●●
● ●●
● ●● ● ●
●
●
●
●●
● ●●●●
●●
●●● ●
●●
● ●● ● ● ●●
●●●● ●● ●●● ●
●
● ●
●
● ●
●● ● ● ●●
●
●
●
●
●●●
●
● ●●
●●
●
● ●
● ●● ● ●
●●
●●
● ● ●●
● ● ●● ●
● ●
●
●
●
●
●
●●● ●●
●
●
●
●
●
●
●
●
● ● ●●
● ●
●
●
● ●● ●
●
●●
●
●
●
●
● ●●● ●●
●
●● ● ●
●
●
●
●
●
●●● ● ●
● ●●
●
●
●
●
● ●●
●● ● ●●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−4
log(b)
−10
● ●
−15
●
●
●
●
●● ●
●●
●
●●
●
●
●
●
●
●
●
log(c)
−5
●
●●
−6
λ=0.05, x0=4
corr.= 0.877
0
−6
0
●
●●
●
●
●
●
●
log(c)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
● ●
● ●
●
●
●
●
● ●
●
●
●●
●
● ●
●●
●
●
●
●● ● ●●
●
● ●●
● ●●
● ● ●
●
● ●
●
●
●
●
●
●● ● ● ● ●
●
●
●
●
●●● ●● ●● ●
● ●●
●●
● ● ●
● ●
●
● ● ●●
●
●
● ●
●
●
●
●●
●
●
●● ●
●
●
●●
●
●●●
● ●●
●
●
●●
●●
● ●
●
●
●●
● ●●
●
● ●
●
●
● ● ●
●●
●●
●
●
●
●
●
●
●
●
●
●
● ●
● ●●●
● ● ●
● ●●● ● ●●●● ● ● ●
●
●● ●
●
● ●
● ●
●
● ●
● ●● ●
●
● ●●
● ● ●●● ●●
●
●
●
●●●
● ● ● ● ●● ● ●
●
●
● ●
●
●
● ● ● ●
● ● ●●
●
●
●
●
●
●●● ●
●
●
●●
●● ●
● ● ●●●● ●
●
●● ●
●
●
● ● ● ●● ●
●
●● ●
●
●
●
● ●●
●
●●●
●● ●
● ●
● ●● ●
●●●●
●
●
●
● ●●● ●●
●
● ● ●●
●●
●
●
● ●● ●
●● ● ●●●
●
●● ●●●●
●
●● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●● ●
●●●
●●
●
● ●●
● ● ●
●● ●
● ● ●●●
●
● ●●
●
●
●●●●
●
●
●● ●
●
●
● ●●
●
●● ● ● ●
●
●
●
●
●
● ●
●
● ●
●●● ●
● ●
●● ●
●
●● ● ●
● ●● ● ●
● ●
●● ●●
●
●
●
●
●
●●● ●
●
●●
●
●● ●
●● ●● ●●
● ● ●
● ●
●●
●
●
●●
● ●
●
●
●
● ●●
●
●
●● ●
●
●
●
●
●●
●● ●
● ●
●
●
● ● ● ● ● ● ● ●
●●
●
●
●●
●● ●
● ●
●
●
●●
●
●
●
●
●
● ●
●● ● ●
● ● ● ●●
●
●
●
●● ● ●
● ● ●
●
●
●
● ●●●●● ●
● ●●
●
●
●
●
● ●
●
● ● ● ●
●
● ●
●
●●
●
●● ● ● ●
●
● ●
● ●
●
●
●
●
● ●●●●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ● ● ●
●
●
● ●●
●● ● ● ● ● ●
●● ●
● ●
●
●●
●
●
●
●
●●
● ●
●
● ● ●●● ● ●
●
● ●●●
●
●
● ● ●● ●
●
● ●● ●
●
●
●
●● ● ● ● ●
●
● ●
●
●
●●
●
● ●●
●
●
●
● ●●
●
●●
●● ● ● ●
●
● ●● ●
●●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ● ●●
● ●●
● ●● ●
●
●
●
●●
●● ●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●● ●
●
●
● ● ●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
●●
●
●
●
●
●
●
●
● ● ●
●
●
●●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
● ●● ●
●
● ●
●
●
●
●
● ●
●
●●
●
●
●
●
●
● ● ● ●●
●●
●
−6
●●
●
● ●
●
●
●
log(c)
0
−2
−4
log(c)
●
●
●
●
●
●
λ=0.2, x0=2
corr.= 0.747
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
● ● ● ●
●●
●
●
● ● ●● ●
●
●
● ●●
●
●
●
●
● ● ●
● ●
●
●
● ●
●
●
● ● ● ● ●●
●
●
● ●
●●
●
●
●
●● ●●●●
●
●
● ●● ●
●●
●
● ●
● ●● ● ●
●
●
● ●●●●● ● ● ●● ●
● ● ●
●●
●●
●
●
●● ●● ●● ●●
●● ●
● ●●
● ●
●
●
●
● ●●
●
● ●●
●● ●● ●● ●●●● ● ●
●
●● ● ● ●●● ●
●
●
●● ● ●
●
●
●●
●
●●
● ●
● ● ●●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●●
●
●
●
●
●
● ●●
●
●
● ●
● ●● ● ●●●●● ●
● ●
●●
●● ● ● ●
●
● ●● ● ●
● ●●●
● ●
● ●
● ●
●
● ●
●
●● ● ●
●
●
● ● ●
● ●●
●●
● ● ●●
● ● ●●
●●
●
●
●
● ● ● ●●
● ●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●● ● ● ● ●● ● ● ● ●
●●
● ●
● ●
● ●
●●
● ●
●
●● ● ● ●
●
●
●●
●●
●
●
● ●● ●
● ●
●
●
●● ● ●●
● ● ●● ●
● ● ●●● ●● ●
●
● ●● ●
● ●● ● ●
●
●
●
●
● ●●●
●
●● ● ●●● ●●
●
●
●
● ● ● ●●
●
●●
● ●
● ●
● ● ● ●●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●● ●
●●
● ●
●
● ●●●●● ● ●
● ●
●
● ● ● ●
●
●
● ● ●● ●
●
●
●
●
● ● ●● ●●
●
●
●● ● ● ●● ●
●
● ●
●●
●
●
●
●● ● ●
●
●
● ●
●●
●
●
●●
● ● ●●
●
●●
●
●
●●●●
●●
●
●
● ●●● ●●
●●
●
●●●
●
●●●●●●
●●
●●● ● ●
● ●
●
●● ●
●
●● ● ●
● ●●
●
●
●
●
●●● ●●● ●
●
●
●● ●
● ● ●
●
●
● ●
● ●
●
●●
● ●
●● ● ● ● ●● ●● ●
● ●
●●
●● ●
●
●
●●
● ●● ● ●
●
●
● ● ●● ●● ●
●
● ●
●
● ●
●● ● ● ● ● ● ● ● ● ●
● ●
●
●
● ●
●
●
●
●
● ●●
● ●
●●
● ●
●
●
●
●●
●
●
●●
●
●●
●
●● ●
●●
●
●●
●
●
●
●
● ● ● ●
● ●●
●●
● ●●
● ● ● ●
●
●
●● ●● ●
● ● ●
●● ●
●
●
● ●
● ●
●
● ●● ●
●
●●● ●
●
● ●●
●●
● ●●
●
● ●● ● ● ●
●
●● ●
●● ●●
●●
●●
●
●
●
●
●
●
●
●
● ● ● ●●● ●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●● ●●●
●● ●
● ●
●
●
● ●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
● ● ●
●
● ●
● ●
●● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
−20
●
●
●
−12
−10
−8
−6
log(b)
−4
−2
0
●
●
−20
−15
−10
log(b)
−5
0
Figure 4.10: MCMC samples (n = 1000) from the distribution {log(b), log(c)|x = x0 } for
λ = (0.05, 0.1, 0.2, 0.4) and x0 = (1, 2, 3, 4) where the data is generated according to (4.14).
Sample correlations are also shown.
samples of {log(b), log(c)|x = x0 } for λ = (0.05, 0.1, 0.2, 0.4) and x0 = (1, 2, 3, 4). The
figure shows that high sample correlations correspond to low values of λ.
In Appendix 4.D, we derive integral expressions for the expectations that make up
the correlation. We then compute these integrals in R in order to examine the relationship
between the correlation, the shape parameter λ and the data x0 . Figure 4.11 illustrates the
impact of both λ and x on the correlation. We can see that the correlation gets closer to
128
4.4. GENERALIZED-DOUBLE-PARETO DISTRIBUTION
1.0
1 for smaller values of λ, no matter what the value of x is. Although not as elegant as
0.7
0.6
0.3
0.4
0.5
correlation
0.8
0.9
x0 = 2
x0 = 1
x0 = 0.5
x0 = 0.2
x0 = 0.1
0.0
0.5
1.0
1.5
2.0
λ
Figure 4.11: Illustration of the behaviour of Corr{log(b), log(c)|x = x0 } under NEG Model
III for varying values of λ and x0 ∈ {0.1, 0.2, 0.5, 1, 2} correspoding to the colours in the
legend.
Theorem 4.2.1 in the Horseshoe section, we have shown numerically that in the NEG case,
high correlation exists between b and c. The existence of an analogue for Thoerem 4.2.1
in the NEG case remains an interesting open question. The correlation between b and c is
directly at odds with the MFVB product assumption that q(b, c) = q(b)q(c). The disparity
between the existing posterior dependence and the assumption of independence explains
the poor performance of Model III MFVB inference in the NEG case.
4.4
Generalized-Double-Pareto distribution
The third and final case we consider is a univariate random sample drawn from the GDP
distribution, i.e.
ind.
xi |σ ∼ GDP(0, σ, λ),
σ ∼ Half-Cauchy(A).
(4.15)
Via introductio