The likelihood approach to statistics

The likelihood approach to statistics
Marco Cattaneo
Department of Statistics, LMU Munich
[email protected]
December 9, 2008
my research
I
PhD with Frank Hampel at ETH Zurich
(November 2002 – March 2007):
Statistical Decisions Based Directly on the Likelihood Function
my research
I
PhD with Frank Hampel at ETH Zurich
(November 2002 – March 2007):
Statistical Decisions Based Directly on the Likelihood Function
I
Postdoc with Thomas Augustin at LMU Munich
(SNSF Research Fellowship, October 2007 – March 2009):
Decision making on the basis of a probabilistic-possibilistic
hierarchical description of uncertain knowledge
the likelihood function
I
P = {Pθ : θ ∈ Θ} set of probabilistic models:
each Pθ ∈ P is a probability measure on (Ω, A)
the likelihood function
I
P = {Pθ : θ ∈ Θ} set of probabilistic models:
each Pθ ∈ P is a probability measure on (Ω, A)
I
when data A ∈ A are observed, the likelihood function
lik(θ) ∝ Pθ (A)
describes the relative ability of the models in P to forecast the
observed data
the likelihood function
I
P = {Pθ : θ ∈ Θ} set of probabilistic models:
each Pθ ∈ P is a probability measure on (Ω, A)
I
when data A ∈ A are observed, the likelihood function
lik(θ) ∝ Pθ (A)
describes the relative ability of the models in P to forecast the
observed data
I
the likelihood function is a central concept in statistics (both
frequentist and Bayesian)
maximum likelihood
I
the maximum likelihood estimate of θ is
θ̂ML = arg max lik(θ)
θ∈Θ
maximum likelihood
I
the maximum likelihood estimate of θ is
θ̂ML = arg max lik(θ)
θ∈Θ
I
a statistical decision problem is described by a loss function
L : Θ × D → [0, ∞],
where L(θ, d) is the loss incurred by making decision d, according to
model Pθ
maximum likelihood
I
the maximum likelihood estimate of θ is
θ̂ML = arg max lik(θ)
θ∈Θ
I
a statistical decision problem is described by a loss function
L : Θ × D → [0, ∞],
where L(θ, d) is the loss incurred by making decision d, according to
model Pθ
I
for instance, estimation of θ ∈ Rn with squared error:
D = Θ ⊆ Rn and L(θ, d) = kθ − dk2
maximum likelihood
I
the maximum likelihood estimate of θ is
θ̂ML = arg max lik(θ)
θ∈Θ
I
a statistical decision problem is described by a loss function
L : Θ × D → [0, ∞],
where L(θ, d) is the loss incurred by making decision d, according to
model Pθ
I
for instance, estimation of θ ∈ Rn with squared error:
D = Θ ⊆ Rn and L(θ, d) = kθ − dk2
I
MLD criterion: make decision
dMLD = arg min L(θ̂ML , d)
d∈D
maximum likelihood
I
the maximum likelihood estimate of θ is
θ̂ML = arg max lik(θ)
θ∈Θ
I
a statistical decision problem is described by a loss function
L : Θ × D → [0, ∞],
where L(θ, d) is the loss incurred by making decision d, according to
model Pθ
I
for instance, estimation of θ ∈ Rn with squared error:
D = Θ ⊆ Rn and L(θ, d) = kθ − dk2
I
MLD criterion: make decision
dMLD = arg min L(θ̂ML , d)
d∈D
I
the MLD criterion is very often implicitly used, but disregards the
model uncertainty
the hierarchical model
I
Bayesian criterion with prior probability measure π: make decision
Z
dπ = arg min L(θ, d) lik(θ) dπ(θ)
d∈D
the hierarchical model
I
Bayesian criterion with prior probability measure π: make decision
Z
dπ = arg min L(θ, d) lik(θ) dπ(θ)
d∈D
I
the set P = {Pθ : θ ∈ Θ} of probabilistic models and the likelihood
function lik on Θ are the two levels of a hierarchical model
the hierarchical model
I
Bayesian criterion with prior probability measure π: make decision
Z
dπ = arg min L(θ, d) lik(θ) dπ(θ)
d∈D
I
the set P = {Pθ : θ ∈ Θ} of probabilistic models and the likelihood
function lik on Θ are the two levels of a hierarchical model
I
when data B ∈ A are observed, Pθ is updated to Pθ (·|B), and lik(θ)
is updated to lik(θ) Pθ (B)
the hierarchical model
I
Bayesian criterion with prior probability measure π: make decision
Z
dπ = arg min L(θ, d) lik(θ) dπ(θ)
d∈D
I
the set P = {Pθ : θ ∈ Θ} of probabilistic models and the likelihood
function lik on Θ are the two levels of a hierarchical model
I
when data B ∈ A are observed, Pθ is updated to Pθ (·|B), and lik(θ)
is updated to lik(θ) Pθ (B)
I
before observing data, prior information ca be described by a
(subjective) prior likelihood function
the hierarchical model
I
Bayesian criterion with prior probability measure π: make decision
Z
dπ = arg min L(θ, d) lik(θ) dπ(θ)
d∈D
I
the set P = {Pθ : θ ∈ Θ} of probabilistic models and the likelihood
function lik on Θ are the two levels of a hierarchical model
I
when data B ∈ A are observed, Pθ is updated to Pθ (·|B), and lik(θ)
is updated to lik(θ) Pθ (B)
I
before observing data, prior information ca be described by a
(subjective) prior likelihood function
I
prior ignorance is described by a constant likelihood function
nonadditive measures and integrals
I
the nonadditive measure on 2Θ defined by
λ(H) = sup lik(θ)
for all H ⊆ Θ
θ∈H
is used in particular in the likelihood ratio test
nonadditive measures and integrals
I
the nonadditive measure on 2Θ defined by
λ(H) = sup lik(θ)
for all H ⊆ Θ
θ∈H
is used in particular in the likelihood ratio test
I
likelihood-based decision criterion: make decision
Z
d 0 = arg min L(θ, d) dλ(θ)
d∈D
nonadditive measures and integrals
I
the nonadditive measure on 2Θ defined by
λ(H) = sup lik(θ)
for all H ⊆ Θ
θ∈H
is used in particular in the likelihood ratio test
I
likelihood-based decision criterion: make decision
Z
d 0 = arg min L(θ, d) dλ(θ)
d∈D
I
the resulting decision functions satisfy in particular asymptotic
optimality (consistency), parametrization invariance, equivariance,
and asymptotic efficiency
MPL decision criterion
I
λ is completely maxitive:
!
[
λ
H = sup λ(H)
H∈S
H∈S
for all S ⊆ 2Θ
MPL decision criterion
I
λ is completely maxitive:
!
[
λ
H = sup λ(H)
H∈S
I
for all S ⊆ 2Θ
H∈S
hence the Shilkret integral is
Z S
L(θ, d) dλ(θ) = sup lik(θ) L(θ, d)
θ∈Θ
MPL decision criterion
I
λ is completely maxitive:
!
[
λ
H = sup λ(H)
H∈S
I
for all S ⊆ 2Θ
H∈S
hence the Shilkret integral is
Z S
L(θ, d) dλ(θ) = sup lik(θ) L(θ, d)
θ∈Θ
I
MPL criterion: make decision
dMPL = arg min sup lik(θ) L(θ, d)
d∈D θ∈Θ
MPL decision criterion
I
λ is completely maxitive:
!
[
λ
H = sup λ(H)
H∈S
I
for all S ⊆ 2Θ
H∈S
hence the Shilkret integral is
Z S
L(θ, d) dλ(θ) = sup lik(θ) L(θ, d)
θ∈Θ
I
MPL criterion: make decision
dMPL = arg min sup lik(θ) L(θ, d)
d∈D θ∈Θ
I
the MPL criterion is the only likelihood-based decision criterion
satisfying the sure-thing principle: if d is optimal with respect to
the set P of probabilistic models, and d is optimal also with respect
to P 0 , then d is optimal with respect to P ∪ P 0
example: estimation of variance components
I
estimation of the variance components in the 3 × 3 random effect
one-way layout, under normality assumptions and weighted squared
error loss
Xij = µ + αi + εij
for all i, j ∈ {1, 2, 3}
example: estimation of variance components
I
estimation of the variance components in the 3 × 3 random effect
one-way layout, under normality assumptions and weighted squared
error loss
Xij = µ + αi + εij
I
for all i, j ∈ {1, 2, 3}
normality assumptions:
αi ∼ N (0, va ), εij ∼ N (0, ve ), all independent
⇒ Xij ∼ N (µ, va + ve ) dependent, µ ∈ (−∞, ∞), va , ve ∈ (0, ∞)
example: estimation of variance components
I
estimates vbe and vba of variance components ve and va are functions
of
SSe =
3 X
3
X
(xij − x̄i· )2
and SSa = 3
i=1 j=1
where
x̄i· =
3
X
(x̄i· − x̄·· )2 ,
i=1
3
3
3
1 X
1 XX
xij , x̄·· =
xij ,
3
9
j=1
SSe
∼ χ26 ,
ve
i=1 j=1
and
1
3
SSa
∼ χ22
va + 13 ve
example: estimation of variance components
I
estimates vbe and vba of variance components ve and va are functions
of
SSe =
3 X
3
X
(xij − x̄i· )2
and SSa = 3
i=1 j=1
where
x̄i· =
(x̄i· − x̄·· )2 ,
i=1
3
3
3
1 X
1 XX
xij , x̄·· =
xij ,
3
9
j=1
SSe
∼ χ26 ,
ve
I
3
X
i=1 j=1
and
1
3
SSa
∼ χ22
va + 13 ve
invariant loss functions:
L(ve , vbe ) = 3
(ve − vbe )2
ve 2
and L(va , vba ) =
(va − vba )2
(va + 31 ve )2
example: estimation of variance components
vc
e
SSa +SSe
vba
SSa +SSe
0.16
0.15
0.12
0.1
0.05
0.08
0
0.04
0
0.2
0.4
0.6
0.8
1
SSa/(SSa+SSe)
-0.05
0
0
0.2
0.4
0.6
0.8
1
SSa/(SSa+SSe)
MPL
MPL
ANOVA
ANOVA = ANOVA+ = MINQU
ML
ML
ReML = ANOVA+
ReML
nonneg. MINQ min. bias
example: estimation of variance components
3
E [(c
ve −ve )2 ]
ve 2
E [(vba −va )2 ]
(va + 31 ve )2
1
1.6
0.95
1.2
0.9
0.85
0.8
0.8
0.4
0.75
0.7
0
0
0.2
0.4
0.6
0.8
1
Va/(Va+Ve)
MPL
0.2
0.4
0.6
0.8
1
Va/(Va+Ve)
MPL
ANOVA
ANOVA = ANOVA+ = MINQU
ML
ML
ReML = ANOVA+
ReML
nonneg. MINQ min. bias
present and future research
I
comparison/combination with other approaches in various
applications
present and future research
I
comparison/combination with other approaches in various
applications
I
application to robustness problems
(relaxation of i.i.d. assumption, robust likelihood)
present and future research
I
comparison/combination with other approaches in various
applications
I
application to robustness problems
(relaxation of i.i.d. assumption, robust likelihood)
I
application to imprecise probability theory
(better updating, possibilistic previsions, convex set of
non-normalized measures)
present and future research
I
comparison/combination with other approaches in various
applications
I
application to robustness problems
(relaxation of i.i.d. assumption, robust likelihood)
I
application to imprecise probability theory
(better updating, possibilistic previsions, convex set of
non-normalized measures)
I
application to graphical models
(probabilistic and non-probabilistic aspects of uncertainty)
present and future research
I
comparison/combination with other approaches in various
applications
I
application to robustness problems
(relaxation of i.i.d. assumption, robust likelihood)
I
application to imprecise probability theory
(better updating, possibilistic previsions, convex set of
non-normalized measures)
I
application to graphical models
(probabilistic and non-probabilistic aspects of uncertainty)
I
application to financial risk measures
(derivation and interpretation of convex risk measures)