Maximum likelihood
Applications and examples
REML and residual likelihood
Peter McCullagh
Department of Statistics
University of Chicago
Nelder Lecture
Imperial College, March 8 2012
university-logo
Peter McCullagh
REML
Maximum likelihood
Applications and examples
JAN: Some personal remarks...
IC 1974–1977:
The MS/PhD program in Statistics
Computing strategies: GLIM,...
Ordinal data and log-linear models...
Chicago 1977-79: consulting work
IC 1979–1984:
Plans for the GLM book I: London 1980–81
Writing the GLM book II: Vancouver 1982
Writing the GLM book III: London/Rothamsted 1982/83
Toronto ASA Mtg 1984
Chicago 1985–1987:
The second edition...
Random effects models: the salamander data
Peter McCullagh
REML
university-logo
Maximum likelihood
Applications and examples
Outline
1
Maximum likelihood
REML and residual likelihood
Likelihood ratios
2
Applications and examples
Example I: fumigants for eelworm control
Example II: kernel smoothing
Box-Cox and REML
university-logo
Peter McCullagh
REML
Maximum likelihood
Applications and examples
REML and residual likelihood
Likelihood ratios
Symmetric functions
Estimation of moments/cumulants:
Thiele 1891; Fisher 1929; Dressel 1940; Tukey 1950
Y1 , . . . , Yn iid mean κ1 , variance κ2 , . . .
Polynomial symmetric functions...
k1 = (Y1 + · · · + Yn )/n for κ1
P
k2 = P(Yi − k1 )2 /(n − 1) for κ2
k11 = ] ij Yi Yj /n↓2 = k12 − k2 /n for ??
P
k3 = (Y1 − k1 )3 n2 /n↓3 for κ3
k21 = . P
..
k111 = ] ijk Yi Yj Yk /n↓3
P
k4 = (n + 1) (Yi − k1 )4 /n − 3(n − 1)k22 n3 /n↓4
k31 , k22 , k211 , k1111
Peter McCullagh
REML
university-logo
Maximum likelihood
Applications and examples
REML and residual likelihood
Likelihood ratios
Maximum likelihood estimation
Design with n units/plots/subjects i = 1, . . . , n
covariate x(i) ≡ xi in Rp given
Response Y (i) = Yi a real number
Observation space Y ∈ S = Rn
Covariate space X = span(X ) ⊂ S
Linear model: For some β ∈ Rp and σ 2 > 0
Y ∼ N(X β, σ 2 In )
Log likelihood function: l(β, σ; y ) = − 12 ky − X βk2 − n log σ
β̂ = (X 0 X )−1 X 0 y ;
σ̂ 2 = ky − µ̂k2 /n
µ̂ = X β̂
E(σ̂ 2 ) = (n − p)σ 2 /n: too small!
Conventional estimate s2 = ky − µ̂k2 /(n − p)
Peter McCullagh
REML
university-logo
Maximum likelihood
Applications and examples
REML and residual likelihood
Likelihood ratios
Residuals
One definition: R = Y − X (X 0 X )−1 X 0 Y = QY
Another definition R 0 = AY where ker(A) = X
But R 0 = AR.... so all definitions are equivalent
... for likelihood computations
Distributions:
R ∼ N(0, σ 2 Q)
R2 ∼ N(0, σ 2 A0 A)
Likelihoods? No density function for R
university-logo
Peter McCullagh
REML
Maximum likelihood
Applications and examples
REML and residual likelihood
Likelihood ratios
Variance-components estimation
Design with n units/plots/subjects i = 1, . . . , n
Block factor relationship: B(i, j) = 1 if i ∼B j (given)
covariate x(i) ≡ xi in Rp (given treatment level)
Response Y (i) = Yi a real number
Linear model: For some β ∈ Rp and σ02 σ12 > 0
Y ∼ N(X β, σ02 In + σ12 B)
mean µ = X β; variance Σ = σ02 In + σ12 B;
W = Σ−1
Log likelihood function: l(β, σ; y ) = − 12 ky − µk2 −
Sufficient statistics (balance and µ = 0)
E(YY 0 ) = tr(Σ) = nσ02 + nσ12 P
E(Y 0 BY = tr(ΣB) = nσ02 + σ12 nj2
typically too small
Peter McCullagh
REML
1
2
log |Σ|
university-logo
Maximum likelihood
Applications and examples
REML and residual likelihood
Likelihood ratios
Gaussian likelihoods
Density of the Gaussian N(µ, Σ) distn at y ∈ R n
|W |1/2 exp(− 12 ky − µk2 ) dy
(Rn , W ) regarded as an inner product space
W = Σ−1 , hx, y i = x 0 Wy
K ⊂ Rn a subspace of dimension k spanned by cols of K
Orthogonal projections: P = K (K 0 WK )−1 K 0 W ,
Q =I−P
A: a linear transformation with kernel K
Marginal likelihood based on AY ∼ N(Aµ, AΣA0 ) is
|AΣA0 |−1/2 exp(− 21 (y − µ)0 A0 (AΣA0 )−1 A(y − µ))
university-logo
Peter McCullagh
REML
Maximum likelihood
Applications and examples
REML and residual likelihood
Likelihood ratios
Gaussian likelihoods contd.
Marginal likelihood based on AY ∼ N(Aµ, AΣA0 ) is
|AΣA0 |−1/2 exp(− 12 (y − µ)0 A0 (AΣA0 )−1 A(y − µ))
Equivalent expressions:
|W |1/2
exp(− 21 (y − µ)0 WQ(y − µ))
|K 0 WK |1/2
|W |1/2 |K 0 K |1/2
exp(− 12 (y − µ)0 WQ(y − µ))
|K 0 WK |1/2
Det1/2 (WQ) exp(− 21 (y − µ)0 WQ(y − µ))
Det(WQ) is the product of n − k non-zero eigenvalues
Rn /K regarded as an inner product space hx, y i = x 0 WQy
Peter McCullagh
REML
university-logo
Maximum likelihood
Applications and examples
REML and residual likelihood
Likelihood ratios
REML and residual likelihood
Family of distributions: N(X β, Σ(θ)):
Full log likelihood:
β ∈ Rp , θ ∈ Θ
l(β, Σ; y ) = − 21 log det(Σ) − 12 (y − X β)0 W (y − X β)
Profile log likelihood: β̂θ = (X 0 WX )−1 X 0 Wy :
W = Σ−1
θ
l(β̂, Σ; y ) = − 12 log det(Σ) − 12 y 0 WQy
Residual: Y 7→ AY where ker(A) = X
Residual log likelihood
l(Σ; Qy ) = − 21 log det(Σ) −
1
2
log det(X 0 WX ) − 12 y 0 WQy
= 12 log Det(WQ) − 12 y 0 WQy + const(X )
Peter McCullagh
REML
university-logo
Maximum likelihood
Applications and examples
REML and residual likelihood
Likelihood ratios
Summary: Marginal likelihood: K 6= X
Model subspace X = {µ(β) : β ∈ Rp }
Kernel subspace K = span(K )
Covariance matrix Σθ : W = Σ−1
X = span(X )
Log likelihood based on observation y + K
Log likelihood based on Ay where ker(A) = K
l(β, θ; y + K) =
1
2
log Det(WQ) − 12 (y − µ)0 WQ(y − µ))
where WQ = W (I − K (K 0 WK )−1 K 0 W ) is i.p. in Rn /K
Special cases:
K = 0: ordinary likelihood
K = 1n = span(e1 + · · · + en ): likelihood based on contrasts
K = X : standard REML
university-logo
K = span(en ): likelihood with yn unobserved
Peter McCullagh
REML
Maximum likelihood
Applications and examples
REML and residual likelihood
Likelihood ratios
Likelihood ratio tests
Simple likelihood ratio:
Pθ (event)
Pθ0 (event)
Maximized likelihood ratio:
supθ∈HA Pθ (event)
supθ∈H0 Pθ (event)
Event in numerator = event in denominator, usually dy
For marginal likelihood, event = dy + K
Marginal likelihood ratio statistic
supΘ Pθ (dy + K)
supΘ0 Pθ (dy + K)
Same K in numerator and denominator
Peter McCullagh
REML
university-logo
Example I: fumigants for eelworm control
Example II: kernel smoothing
Box-Cox and REML
Maximum likelihood
Applications and examples
Example: Eelworm control using fumigants
I
IV
II
III
Actual field layout of 48 plots in four blocks. Experiment using
fumigants to control eelworms in oat field. (Bailey, 2008, p. 73).
Data (eelworm counts) from Cochran and
Blk 1 (I)
269
283
252
212
95
138
100
197
263
107
282
230
216
145
88
Blk 2 (II)
124
211
194
222
193
102
193
128
42
29
162
191
107
67
23
Cox (1950, Table 3.1)
Blk 2 (IV)
127
80
134
89
41
74
25
42
62
Blk 4 (III)
209
109
153
9
17
19
19
44
48
university-logo
Peter McCullagh
REML
Maximum likelihood
Applications and examples
Example I: fumigants for eelworm control
Example II: kernel smoothing
Box-Cox and REML
Variance models: taking off from JAN (1965)
iid
Block-structured effects: η ∼ N(0, σ12 ) const on each block
Yi = trt effects + ηb(i) + i
cov(Yi , Yj ) = σ12 B(i, j) + σ02 δij
Stationary isotropic spatial effect:
cov(η(x), η(x 0 )) = σ12 K (|x − x 0 |)
Z
Y (A) = trt erffects + η( x) dx + i
η ∼ GP(0, σ12 K )
cov(Yi , Yj ) = σ12 K̄ (xi , xj )
+
A
σ02 δij
K (x, x 0 ) = exp(−|x − x 0 |/ρ) with range ρ > 0 for illustration
In practice, ρ̂ = ∞
K (x, x 0 ) = −|x − x 0 |
Peter McCullagh
REML
university-logo
Maximum likelihood
Applications and examples
Example I: fumigants for eelworm control
Example II: kernel smoothing
Box-Cox and REML
Comparison of variance models for eelworm expt
Y (i) response for plot i: (log ratio of eelworm counts)
Block relation: B(i, j) = 1 if i ∼ j in same block
Distance relation: Dij = d(i, j): Vij = exp(−Dij /ρ)
Take K = fumigant ∗ dose as kernel
Maximal model: cov(Y (i), Y (j)) = σ02 δij + σ12 Bij + σ22 Vij
H0 : σ12 = σ22 = 0
H1 : σ22 = 0 (no spatial effect beyond blocks)
H2 : σ12 = 0 (no block effect)
Log likelihood values: 6.47, 12.28, 20.53, 20.53 (both)
(Max ‘always’ occurs at ρ → ∞
V = −D is pos def on contrasts: K ⊃ 1)
R syntax: regress(y~1, ~blk+V, kernel=K)
Peter McCullagh
REML
university-logo
Maximum likelihood
Applications and examples
Example I: fumigants for eelworm control
Example II: kernel smoothing
Box-Cox and REML
Treatment comparisons via likelihood
Fix covariance model at cov(Y ) = σ02 In + σ22 V :
(V = −D)
Treatments: Four fumigants and three dose levels including
zero
Nullnull model: Nothing has any effect (X = 1) dim 1
Null model: all fumigants equally effective: 1+dose dim 3
Alternative: fumigant*dose dim 9
regress(y~1, ~V, kernel=~1)
llik=14.4
regress(y~dose, ~V, kernel=~1)
llik=17.3
regress(y~dose, ~V, kernel=~dose)
llik=16.3
regress(y~fumigant:dose, ~V, kernel=~dose)
26.7
Comparisons involving models having the same kernel
Default kernel is K = X : (REML)
Peter McCullagh
REML
university-logo
Maximum likelihood
Applications and examples
Example I: fumigants for eelworm control
Example II: kernel smoothing
Box-Cox and REML
Marginal likelihood and kernel smoothing
Y0
Y1
∼
Σ00 Σ01
Σ10 Σ11
−1
Y0 | Y1 = y1 ∼ N(Σ01 Σ−1
11 y1 , Σ00 − Σ01 Σ11 Σ10 )
Implications: Observe Y1 = y1 only (n-component vector)
−1
Predictive distn: mean = Σ01 Σ−1
11 y1 ; cov = W00
Typical application: observe (y1 , . . . , yn ) at (x1 , . . . , xn )
0
Σij = σ02 δij + Σ21 K (xi , xj )
K (x, x 0 ) = e−|x−x | or ...
P
Predictions: E(Y (x ∗ ) | data) = ij K (x ∗ , xi )Σ−1
ij yj
‘smooth’ fn of x ∗ called kernel spline
university-logo
Peter McCullagh
REML
Example I: fumigants for eelworm control
Example II: kernel smoothing
Box-Cox and REML
Maximum likelihood
Applications and examples
●
●
●
●
●
y[w]
●
4
6
8
0.9
0.7
●
●
●
●
●
●●
x[w]
●
●
●
●
●
●
●
0.5
●
C_2 spline: linear mean model
●
●
●
●
●
●●
●
●
●
●
●
●
C_1 spline: linear mean model
●
2
4
6
8
10
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
10
●
●
●
0.3
0.3
C_1 spline: const mean model
●
2
●●
●
●
●
y[w]
0.3
0.5
●
●
●
●
0.5
●
●
0.9
●
●
0.7
●
0.9
●
●
●
●
●
●
●
0.7
●●
●●
●
●
●
●
●●
x[w]
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
0.5
0.7
●
●
0.3
0.9
●
●
●
●
●
●
●
●
C_2 spline: quadratic mean model
●
university-logo
Peter McCullagh
REML
Maximum likelihood
Applications and examples
Example I: fumigants for eelworm control
Example II: kernel smoothing
Box-Cox and REML
R code:
d <- abs(outer(x, x, "-")); rho <- 100;
K <- (1 + d/rho)*exp(-d/rho)
fit1 <- regress(y~1, ~K, kernel=1)
blp <- fit1$fitted + fit1$sigma[2] * K %*% fit1$W
%*% (y-fit1$fitted)
plot(x, y, cex=0.5); lines(x, blp)
Example of an improper covariance function:
K3 <- d^3
fit3 <- regress(y~1+x+xsq, ~K3, kernel=~1+x)
blp <- fit3$fitted + fit3$sigma[2] * K3 %*% fit3$W
%*% (y-fit3$fitted)
plot(x, y, cex=0.5); lines(x, blp)
university-logo
Peter McCullagh
REML
Maximum likelihood
Applications and examples
Example I: fumigants for eelworm control
Example II: kernel smoothing
Box-Cox and REML
The Box-Cox technique for transformation
Family of transformations y 7→ gλ (y ) = (y λ − 1)/λ
indexed by λ and applied component-wise
Model: for some λ, gλ (Y ) ∼ N(X β, Σ)
Density at y ∈ Rn is
det(W )1/2 exp(− 12 kgλ (y ) − X βk2W ) × Jλ (y ) dy
Q
W = Σ−1 , Jλ (y ) = |gλ0 (yi )|
Log likelihood is
1
2
log det(W ) − 12 kgλ (y ) − X βk2W +
P
log |gλ0 (yi )|
Profile log likelihood for λ is
1
2
log det(W ) − 12 kgλ (y )k2WQ +
Peter McCullagh
REML
P
log |gλ0 (yi )|
university-logo
Maximum likelihood
Applications and examples
Example I: fumigants for eelworm control
Example II: kernel smoothing
Box-Cox and REML
Box-Cox and REML
Profile log likelihood for λ
1
2
log det(W ) − 12 kgλ (y )k2WQ +
gλ (y ) = (y λ − 1)/λ,
P
log |gλ0 (yi )|
gλ0 (y ) = y λ−1
REML likelihood: (Shi-Tsai, JRSSB, 2002)
l(λ, W ; y , X ) =
1
2
log Det(WQ) − 12 kgλ (y )k2WQ +
P
log |gλ0 (yi )|
...by adopting the results of Verbyla (1990) or Diggle (1994)...
Is this right/reasonable/OK?
(i) seems reasonable by analogy with REML to adjust for d.f.
(ii) but not a function of the residuals Qy
P
university-logo
(iii) Put X = In : resid = 0 but l(λ...) = (λ − 1) log(yi )
... so it cannot be right!
Peter McCullagh
REML
Maximum likelihood
Applications and examples
Example I: fumigants for eelworm control
Example II: kernel smoothing
Box-Cox and REML
Box-Cox and REML, contd
Is there a right way to combine Box-Cox with REML?
No!
Why not?
Ans I:
Because the transformation y 7→ y λ Rn → Rn
is not measurable with respect to B(Rn /K)
The transformation does not preserve cosets
Ans II:
Model says Y λ ∼ N(µ ∈ X , Σ) or Y ∼ N(µ, Σ, λ)
Then E(Y ) 6∈ X implies distn of QY depends on µ
university-logo
Peter McCullagh
REML
Maximum likelihood
Applications and examples
Example I: fumigants for eelworm control
Example II: kernel smoothing
Box-Cox and REML
References
Bailey, R. (2007) Design of Comparative Experiments. Cambridge.
Box, GEP and Cox, D.R. (1964) Analysis of transformations JRSSB
211–252.
Harville, D.A. (1974) Bayesian variance components. Bka 61,
383–385.
Harville, D.A. (1977) Variance component estimation. JASA 72,
320-340.
Nelder, J.A. (1965) Orthogonal block structure. Proc Roy Soc A 283
Patterson, H. and Thompson, R. (1971) Biometrika 58 545-554.
Shi, P. and Tsal, C-L. Regression model selection: A residual
likelihood approach. JRSSB 2002.
university-logo
Peter McCullagh
REML
© Copyright 2026 Paperzz