Likelihood inference in the presence of nuisance parameters

Likelihood inference in the presence of
nuisance parameters
Nancy Reid, University of Toronto
www.utstat.utoronto.ca/reid/research
1. Notation, Fisher information, orthogonal
parameters
2. Likelihood inference with no nuisance parameters;
first and third order
3. Profile log-likelihood
4. Adjustments to profile log-likelihood
5. Third order p-values
6. Model classes
1
1. Notation...
- model Y ∼ f (y; θ), θ ∈ Rd
θ = (ψ, λ)
- likelihood L(θ) = L(θ; y) = f (y; θ), `(θ)
- i.i.d. sampling y = (y1, . . . , yn)
L(θ; y) =
- m.l.e. supθ L(θ) = L(θ̂)
- observed information j(θ̂) = −`00(θ̂)
- expected information i(θ) = E{−`00(θ)}
- partitioned information i(θ) =
- partitioned inverse i−1(θ) =
iψψ iψλ
iλψ iλλ
iψψ
iψλ
iλψ iλλ
2
!
!
i(θ) =
iψψ iψλ
iλψ iψψ
!
ψ is orthogonal to λ if iψλ(θ) = 0
implies in particular that ψ̂ and λ̂ are
asymptotically independent
Example: ratio of Poisson means
y1 ∼ P o(λ), y2 ∼ P o(ψλ):
L(ψ, λ; y1, y2) = e−2λ−ψ ψ y2 λy1+y2
in fact =L1 (ψ; y2 )L2 (λ; y+ ), stronger than orthogonality
Example: exponential regression
yi follows an exponential distribution
E(y) = λ exp(−ψxi); Σxi = 0
`(ψ, λ; y) = −n log λ + λΣyi exp(−ψxi)
3
Likelihood inference with no nuisance parameters
- Plot the likelihood
- θ̂ is asymptotically normal, mean θ
variance i−1(θ)
- r(θ) = ±[2{`(θ̂) − `(θ)}]1/2
is asymptotically N (0, 1) (better)
(
- r∗(θ) = r(θ) +
1
q(θ)
log
r(θ)
r(θ)
)
is asymptotically N (0, 1) (even better)
4
Example: Y ∼ P o(θ), θ > b, b known;
0.00
0.02
0.04
likelihood
0.06
0.08
0.10
b = 6.7, y = 17:
0
10
20
30
µ
Fraser, Reid, Wong 2003
5
1.0
0.8
0.6
p−value
0.4
0.2
0.0
0
10
20
30
µ
p-values:
upper
lower
mid
r∗
r
θ̂
0.0005993
0.0002170
0.0004081
0.0003779
0.0004416
0.0062427
6
0.06
0.04
0.0
0.02
likelihood
0
10
20
30
40
mu
0.6
0.4
0.2
0.0
p-value
0.8
1.0
.......... .
.......
........
........
.........
.........
..........
.........
.........
.........
..........
..........
........
..........
........
.........
........
...........
...........
...........
......... .....
0
10
20
mu
30
40
Nuisance parameters: profile likelihood
θ = (ψ, λ)
restricted m.l.e. λ̂ψ : supλ L(ψ, λ)
Lp(ψ) = L(ψ, λ̂ψ ) (concentrated likelihood)
for λ of fixed dimension, i.i.d. sampling y:
- supψ Lp(ψ) = L(ψ̂, λ̂)
d
- rp(ψ) = ±[2{`p(ψ̂) − `p(ψ)}]1/2 → N (0, 1)
- ψ̂ asymptotically normal with mean ψ
and variance consistently estimated by
{−`00p (ψ̂)}−1 = j ψψ (ψ̂, λ̂)
7
But, profile likelihood can be too concentrated,
and maximized at ’wrong’ point:
Example: linear regression
yi = x0iβ + i,
1
ψ̂ = Σ(yi − x0iβ̂)2
n
0.0
0.2
0.4
profile
0.6
0.8
1.0
i ∼ N (0, ψ)
xi = (xi1, . . . , xip)
1
2
3
4
5
6
σ
8
Adjustments to profile log-likelihood
If ψ is orthogonal to λ:
1
`a(ψ) = `p(ψ) − log |jλλ(ψ, λ̂ψ )|
2
j(θ) = −`00 (θ) =
jψψ
jλψ
jψλ
jλλ
`p is Op(n), log |j| is Op(1)
Example: product of exponential means
y1i ∼ Exp(ψλi)
π
ψ̂ −→ ψ
4
y2i ∼ Exp(ψ/λi), i = 1, . . . , n
π
ψ̂a −→ ψ
3
9
not invariant to (one-one) reparametrizations
of λ; better to use
`a(ψ) = `p(ψ) +
1
log |jλλ(ψ, λ̂ψ )| + B(ψ)
2
with B(ψ) = Op(1)
this can make `a invariant and remove the need
for orthogonal parametrization
1
B(ψ) = − log |ϕ0λ(ψ, λ̂ψ )jϕϕ(ψ̂, λ̂)ϕ0λ(ψ, λ̂ψ )|
2
with ϕ = ϕ(θ) = `;V (θ; y 0)
comes from an approximating location model: Fraser
2003 Biometrika
10
p-values from profile likelihood
.
First order: p(ψ) = Φ(rp)
rp (ψ) = ±[2{`p (ψ̂) − `p (ψ)}]1/2 ∼ N (0, 1)
.
Third order: p(ψ) = Φ(r∗)
rp
1
∗
r (ψ) = rp(ψ) + log
rp
Q
−1/2
Q = (ν̂ − ν̂ψ )σ̂ν
ν(θ) = eTψ ϕ(θ) ,
eψ = ψϕ0 (θ̂ψ )/|ψϕ0 (θ̂ψ )| ,
σ̂ν2 = |j(λλ) (θ̂ψ )|/|j(θθ) (θ̂)| ,
|j(θθ) (θ̂)| = |jθθ (θ̂)||ϕθ0 (θ̂)|−2 ,
|j(λλ) (θ̂ψ )| = |jλλ(θ̂ψ )||ϕλ0 (θ̂ψ )|−2 .
Fraser, Reid, Wu (1999) Biometrika
11
Example: log y ∼ N (µ, σ 2):
inference for ψ = log(EY )
0.6
0.4
0.2
0.0
p − value
0.8
1.0
solid: 3rd order dotted: 1st order
6.5
7.0
7.5
8.0
8.5
ψ
12
Example: comparing two binomials
Employment of men and women at the Space Telescope
Science Institute, 1998–2002 (from Science magazine,
Volume 299, page 993, 14 February 2003).
Men
Women
Total
Y1 ∼ Bin(19, p1),
Left
1
5
6
Stayed
18
2
20
Total
19
7
26
Y2 ∼ Bin(7, p2)
p1(1 − p2)
ψ = log
p2(1 − p1)
p-value for testing ψ = 0 is
0.00203
0.00028
0.00048
using normal approx to
using normal approx to
using normal approx to
maximum likelihoo
rp (1st order)
r∗ (3rd order)
13
Some more technical points
In special model classes, it is possible to
eliminate nuisance parameters by either
conditioning or marginalizing. The conditional
or marginal likelihood then gives essentially
exact inference for the parameter of interest, if
this likelihood can itself be computed exactly.
The main example is the canonical parameter
of an exponential family:
f (y; ψ, λ) = exp{ψs + λ0t − c(ψ, λ) − d(y)};
f (s | t; ψ) = exp{ψs − Ct(ψ) − Dt(s)}
`cond(ψ) = ψs − Ct(ψ)
The adjusted log-likelihood
`a(ψ) = `p(ψ)−(1/2) log |jλλ| approximates `cond
14
The 3rd order p-value approximation is
particularly simple:
r ∗ = ra +
1
Q
log( )
ra
ra
r = ra = ±[2{`a(ψ̂a) − `a(ψ)}]1/2
Q = (ψ̂a − ψ){ja(ψ̂)}1/2
A similar discussion applies to the class of transformation models, using marginal approximations. Both class are reviewed in Reid 1996
The approximations given earlier reduce to these
special cases.
Some References
Fraser, D.A.S., Reid, N., Wong, A. (2003). Inference
for bounded parameters. xxx.lanl.gov/0303111.
Fraser, D.A.S. (2003). Likelihood for component parameters. Biometrika 90, 327–339.
Fraser, D.A.S., Reid, N., Wu, J. (1999). A simple
general formula for tail probabilities for frequentist and
Bayesian inference. Biometrika 86, 246–264.
Reid, N. (2003). Asymptotics and the theory of inference. Ann. Statist., to appear.
Reid, N. (1992). Aspects of modified profile likelihood.
in Nonparametric Statistics and Related Topics, A.K.Md.E.
Saleh, ed. North-Holland, Amsterdam.
Cox, D.R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference (with discussion). J. R. Statist. Soc. B, 49, 1–39.
Reid, N. (1996). Likelihood and higher-order approximatons to tail areas: a review and annotated bibliography. Canad. J. Statist. 24 141–166.
15