Approximation of likelihoods for non-linear mixed effects

Outline for today
Approximation of likelihoods for non-linear mixed
effects models
Rasmus Waagepetersen
Department of Mathematics
Aalborg University
Denmark
◮
integrals involving Gaussian densities - likelihoods for
non-linear mixed models
◮
Laplace approximation
◮
Gauss-Hermite quadrature
◮
Monte Carlo methods and importance sampling
February 28, 2013
1 / 29
2 / 29
Poisson GLM
η = X β linear predictor.
Data Yi Poisson distributed
General problem: approximation of integrals of the form
Z
h(u)f (u)du
Yi ∼ Poisson(λi )
Rk
and independent where mean EYi = λi = exp(ηi ).
where f (·) is a Gaussian (normal density).
Note mean equals variance
Motivation: compution of likelihoods and conditional expectations
in non-linear mixed effects models.
VarYi = EYi = λi
In practice we often encounter overdispersion and correlation:
VarYi > EYi
and Yi and Yj not independent.
3 / 29
4 / 29
A closer look at integrand in case of Poisson:
Poisson log-normal model
Introduce vector of zero-mean normal random effects U ∼ N(0, Σ).
f (y |u)φ(u) =
Conditional on U = u, Yi are Poisson
Yi |U = u ∼ Poisson(exp(ηi )
Y 1 X
exp
yi ηi − exp(ηi ) f (u)
yi !
i
If Yi |U = u is Bernouilli with
and independent with η = X β + Zu.
pi (u) = P(Yi = 1|u) =
Joint conditional density of Y :
f (y |u) =
Y
i
f (yi |u)
exp(ηi )
1 + exp(ηi )
(logistic regression) then
f (y |u)f (u) =
The marginal joint density of y (the likelihood function) is
Z
f (y |u)f (u)du
f (y ) =
Y
i
pi (u)yi (1 − pi (u))1−yi f (u) =
Rk
where
i
exp
X
i
1
1
exp(− u T Σ−1 u)
f (u) =
2
|Σ|1/2
yi ηi − log(1 + exp(ηi ) f (u)
In both cases complicated integrands - no analytical solution of
integral.
5 / 29
Simplest case: one-dimensional random effect
6 / 29
Laplace approximation
Let g (u) = log(f (y |u)f (u)) and choose û so g ′ (û) = 0
(û = arg max g (u)).
Taylor expansion around û:
U ∼ N(0, τ 2 )
Compute
f (y ) =
Possibilities:
Z
g (u) ≈ g̃ (u) =
R
1
1
g (û)+(u−û)g ′ (û)+ (u−û)2 g ′′ (û) = g (û)− (u−û)2 −g ′′ (û)
2
2
2 ,
I.e. exp(g̃ (u)) proportional to normal density N µLP , σLP
2 = −1/g ′′ (û).
µLP = û σLP
Z
Z
exp(g̃ (u))du
exp(g (u))du ≈
f (y ) =
R
R
Z
q
1
2
= exp(g (û)) exp − 2 (u − µLP )2 du = exp(g (û)) 2πσLP
2σLP
R
f (y |u)f (u)du
◮
Laplace approximation.
◮
Numerical integration/quadrature.
7 / 29
8 / 29
Prediction
Minimum mean square prediction of U given Y = y is conditional
expectation
E[U|Y = y ]
Note
Laplace approximation also possible for higher dimensions
(multivariate Taylor expansion - later).
f (u|y ) = f (y |u)f (u)/f (y ) ∝ exp(g (u)) ≈ const exp −
2 = −1/g ′′ (û).
where µLP = û σLP
1
(u−µLP )2
2
2σLP
Note: µLP is mode of conditional distribution.
It follows
2
U|Y = y ≈ N µLP , σLP
and µLP approximation of E[U|Y = y ].
9 / 29
Gaussian quadrature
10 / 29
Adaptive Gaussian Quadrature
Gauss-Hermite quadrature (numerical integration) is
Z
R
h(x)φ(x)dx ≈
n
X
A large gain in precision can be obtained by adaptation using
Laplace-approximation:
wi h(xi )
i=1
Z
where φ is the standard normal density and (xi , wi ),i = 1, n are
certain arguments and weights which can be looked up in a table.
We can replace ≈ with = whenever h is a polynomial of degree
2n − 1 or less (Arnes part of course).
Z
f (y |u)f (u)
2
f (y |u)f (u)du ≈
f (u; µLP , σLP
)du =
2 )
f (u; µLP , σLP
Z
f (y |σLP x + µLP )f (σLP x + µLP )
σLP φ(x)dx
φ(x)
(change of variable: x = (u − µLP )/σLP )
In principle we can use Gauss-Hermite directly but we might need
large number n to get accurate results (if integrand far from low
order polynomial).
11 / 29
12 / 29
Prediction of random effects for GLMM
Advantage
f (y |σLP x + µLP )f (σLP x + µLP )
f (y |u)f (u)
=
2 )
φ(x)
φ(u; µLP , σLP
x = (u−µLP )/σLP
Conditional mean
E[U|Y = y ] =
close to constant (f (y )) - hence adaptive G-H quadrature very
accurate.
Z
uf (u|y )du
is minimum mean square error predictor, i.e.
E(U − Û)2
GH scheme with n = 5:
is minimal with Û = H(Y ) where H(y ) = E[U|Y = y ]
x 2.020 0.959 0.0000000 -0.959 -2.020
0.533
0.222 0.011
w 0.011 0.222
(x’s are roots of Hermite polynomial computed using e.g. ghq in
library glmmML or gaussHermiteData() in fastGHquad).
Difficult to analytically evaluate
Z
E[U|Y = y ] = uf (y |u)f (u)/f (y )du
(GH schmes for n = 5 and n = 10 available on web page)
13 / 29
Computation of conditional expectations
14 / 29
Hierarchical/nested models
Z
f (y |u)f (u)
du =
E[U|Y = y ] = u
f (y )
Z
1
f (y |σLP x + µLP )f (σLP x + µLP )
(σLP x + µLP )
σLP φ(x)dx
f (y )
φ(x)
Suppose U = (U1 , . . . , Uk ) consist of independent normals and
Yi = (Yi1 , . . . , Yini ) only depends on Ui . Then we can factorize
integral into one-dimensional integrals
YZ
f (yi |ui )dui
f (y ) =
Note:
i
(σLP x + µLP )
f (y |σLP x + µLP )f (σLP x + µLP )
σLP
φ(x)
R
Hence back in the one-dimensional case.
behaves like a first order polynomial in x - hence GH still accurate.
15 / 29
16 / 29
Difficult cases for numerical integration - dimension
k >> 1
Suppose two levels of nested random effects (Yij = (Yij1 , . . . , Yijni )
depends on Uij and Ui ):
f (y ) =
m1 Z h Y
m2 Z
Y
i=1
R
j=1 R
i
f (yij |uij , ui )φ(uij )duij φ(ui )dui
Hence m1 one-dimensional integrals each involving m2
one-dimensional integrals. Computational cost roughly
O(m1 m2 q 2 ) where q number of quadrature points for
one-dimensional integral (ignoring cost of possible Laplace
approximation for adaptation). Still manageable.
◮
correlated random effects: cost q k
◮
crossed random effects (e.g. ηij = µ + ui + vj : cost m1 qq m2
where m1 and m2 numbers of ui and vj )
Not possible to factorize likelihood into low-dimensional integrals hence numerical quadrature may not be feasible.
Alternatives: Laplace-approximation or Monte Carlo computation.
17 / 29
Monte Carlo computation of likelihood for GLMM
18 / 29
Simple Monte Carlo
f (y ) =
Likelihood function is an expectation:
Z
f (y |u)f (u)du = Ef (y |U)
f (y ) =
Rk
Z
Rk
f (y |u)f (u)du = Ef (y |U) ≈ fˆ(y ) =
M
1 X
f (y |U l ; β)
M
l=1
where U l ∼ N(0, Σ) independent
Use Monte Carlo simulations to approximate expectation.
Monte Carlo variance:
Var(LSMC (θ)) =
19 / 29
1
Varf (y |U 1 ; β)
M
20 / 29
Importance sampling
Consider Z ∼ f and suppose we wish to evaluate Eh(Z ) where
h(Z ) has large variance.
Estimate Varf (y |U 1 ; β) using empirical variance estimate based on
f (y |U l ; β), l = 1, . . . , M:
1
M −1
M
X
l=1
Suppose we can find density g so that
h(z)f (z)
≈ const and h(z)f (z) > 0 ⇒ g (z) > 0
g (z)
(f (y |U l ; β) − fˆ(y ))2
Then
Eh(Z ) =
Often Varf (y |U 1 ; β) is large so large M is needed.
where Y ∼ g .
Increasing dimension k leads to worse performance (useless in high
dimensions - curse of dimensionality)
Z
h(Y )f (Y )
h(z)f (z)
g (z)dz = E
g (z)
g (Y )
Note variance of h(Y )f (Y )/g (Y ) small so estimate
Eh(Z ) ≈
M
1 X h(Yi )f (Yi )
M
g (Yi )
i=1
has small Monte Carlo error.
21 / 29
22 / 29
Importance sampling for GLMM
Possibility: Note
g (·) probability density on Rk .
f (y ) =
Z
Rk
f (y |u)f (u)du =
Z
f (y |u)f (u)
= f (y ) = const
f (u|y )
Laplace: U|Y = y ≈ N µLP , ΣLP
f (y |u)f (u)
g (u)du =
g (u)
f (y |V )f (V )
E
where V ∼ g (·).
g (V )
Rk
M
1 X f (y |V l )f (V l )
where V l ∼ g (·), l = 1, . . . , M
f (y ) ≈ fˆIS,g (θ) =
M
g (V l )
Use g (·) density for N µLP , ΣLP or tν µLP , ΣLP -distribution:
f (y |u)f (u)
≈ const
g (u)
l=1
so small variance.
Find g so Var f (y g|V(V)f)(V ) small.
Simulation straightforward.
VarfˆIS,g (θ) < ∞ if f (y |v )f (v )/g (v ) bounded (i.e. use g (·) with
heavy tails).
Note: ‘Monte Carlo version’ of adaptive Gaussian quadrature.
23 / 29
24 / 29
Suppose parametric model f (y ; θ) =
Consider fixed θ0 :
R
f (y |u; θ)f (u; θ)du = L(θ)
g (u) = f (u|y , θ0 ) = f (y |u; θ0 )f (u; θ0 )/L(θ0 )
This suffices for finding MLE:
Then
Z
Z
f (y |u; θ)f (u; θ)
f (y |u; θ)f (u; θ)
g (u)du = L(θ0 )
f (u|y , θ0 )du =
g
(u)
f
R (y |u; θ0 )f (u; θ0 )
R
h f (y |U; θ)f (U; θ)
h f (y |U; θ)f (U; θ)
i
L(θ)
L(θ0 )Eθ0
|Y = y ⇔
= E θ0
|Y = y
f (y |U; θ0 )f (U; θ0 )
L(θ0 )
f (y |U; θ0 )f (U; θ0 )
L(θ) =
arg max L(θ) = arg max
θ
θ
M
1 X f (y |U l ; θ)f (U l ; θ)
L(θ)
≈
L(θ0 )
M
f (y |U l ; θ0 )f (U l ; θ0 )
l=1
where Ul ∼ f (u|y ; θ0 )
So we can estimate ratio L(θ)/L(θ0 ) where L(θ0 ) is unknown
constant.
Need θ close to θ0 ! so f (y |u; θ)f (u; θ)/g (u) close to constant.
25 / 29
26 / 29
Maximization of likelihood using Newton-Raphson
Let
Vθ (y , u) =
Then
u(θ) =
d
log f (y , u|θ)
dθ
d
log L(θ) = Eθ [Vθ (y , U)|Y = y ]
dθ
Next time: simulation methods and integrated nested Laplace
approximation (INLA).
and
d2
log L(θ)
dθT dθ
= − Eθ [dVθ (y , U)/dθT |Y = y ] − Varθ [Vθ (y , U)|Y = y ]
j(θ) = −
Newton-Raphson:
θl+1 = θl + j(θl )−1 u(θl )
All unknown expectations and variances can be estimated using
the previous numerical integration or Monte Carlo methods !
27 / 29
28 / 29
Exercises
1. R exercises on exercise-sheets exercises6.pdf and exercises7.pdf
(skip last exercise on sheet exercises7.pdf).
2. Check formulas for u(θ) and j(θ). How can these expression
be computed using importance sampling ?
29 / 29