Almost the best of three worlds - The switch model selection criterion

Almost the best of three worlds
The switch model selection criterion for single-parameter
exponential families
S.L. van der Pas
Joint work with prof. dr. P.D. Grünwald
Axioma symposium, June 4th 2014
1 / 39
Outline
The model selection problem
Example
Desirable properties
The AIC-BIC dilemma
The switch distribution
Optional stopping
Conclusions
2 / 39
The model selection problem
3 / 39
Single-parameter exponential families
I
A single-parameter exponential family {pθ |θ ∈ Θ} on X ,
where Θ ⊂ R, is a model such that any member pθ can be
written as
1 θφ(x)
e
r (x).
pθ (x) =
z(θ)
I
We only consider regular exponential families.
4 / 39
Single-parameter exponential families
I
A single-parameter exponential family {pθ |θ ∈ Θ} on X ,
where Θ ⊂ R, is a model such that any member pθ can be
written as
1 θφ(x)
e
r (x).
pθ (x) =
z(θ)
I
We only consider regular exponential families.
I
Example:
1
1
1
2
2
f (x|θ) = √ e−(x−θ) /2 = −θ2 /2 eθx √ e−x /2
e
2π
2π
4 / 39
The model selection problem
Suppose {pµ |µ ∈ M} is a regular single-parameter exponential
family in its mean-value parameterization and we have data
Xi ∼ pµ∗ , µ∗ ∈ M, i = 1, . . . , n. Let µ0 ∈ M be a constant. We
wish to select one of the following two models:
M0 = {pµ0 }
and
M1 = {pµ , µ ∈ M\µ0 }.
5 / 39
Example
6 / 39
Example
I
Soal & Bateman, Modern
Experiments in Telepathy (1954)
I
130 experiments with Mrs
Stewart between 1945 and 1950
7 / 39
Example
8 / 39
Example
9 / 39
Example
Results:
I
37,100 trials
I
9,410 ’direct hits’
10 / 39
Example
Results:
I
37,100 trials
I
9,410 ’direct hits’
H0 : Mrs Stewart does not
possess telepathic powers.
I
p-value: 5.8 · 10−139 .
Estimated probability of
guessing the correct card:
9,410
≈ 0.25.
θ̂ = 37,100
10 / 39
Desirable properties
11 / 39
Desirable properties
I
Consistency
I
Minimax-rate optimality
I
Insensitivity to optional stopping
12 / 39
Consistency
Collection of parametric models Mk = {pµ |µ ∈ Mk }.
I
A criterion δ is consistent if for any µ in any Mk :
lim Pµ (δ(X n ) = k) = 1.
n→∞
I
‘Essentially, all models are wrong, but some are useful.’
(Box and Draper, 1987)
13 / 39
Desirable properties
I
Consistency
I
Minimax-rate optimality
I
Insensitivity to optional stopping
14 / 39
Minimax-rate optimality
I
µ
b(x n ): estimator of µ.
I
Quadratic loss: L(µ, µ
b(x n )) = (µ − µ
b(x n ))2 .
I
Quadratic risk:
h
i
R(µ, µ
b, n) = Eµ (µ − µ
b(X n ))2 .
15 / 39
Minimax-rate optimality
I
µ
b(x n ): estimator of µ.
I
Quadratic loss: L(µ, µ
b(x n )) = (µ − µ
b(x n ))2 .
I
Quadratic risk:
h
i
R(µ, µ
b, n) = Eµ (µ − µ
b(X n ))2 .
I
Quadratic risk of δ:
h
i
n 2
R(µ, δ, n) = Eµ (µ − µ
bkb (x )) .
15 / 39
The AIC-BIC dilemma
16 / 39
Desirable properties - the AIC-BIC dilemma
I
Consistency
I
Minimax-rate optimality
I
Insensitivity to optional stopping
17 / 39
AIC and BIC
Collection of parametric models Mk = {pµ |µ ∈ Mk }.
Selected model:
arg max 2L(x n |µ̂k ) − 2 · (number of parameters) · cn ,
k
where
cn =
(
1
1
2
log n
AIC
.
BIC
18 / 39
AIC and BIC - example
M0 = {N (0, 1)}, M1 = {N (µ, 1), µ ∈ R}. M1 is selected if
n X √
xi > 2n
AIC
i=1
n
X
p
xi > n log n BIC
i=1
19 / 39
Complexity penalty
2
●
1
●
● ● ● ●
0
●
● ● ● ● ●
●
●
●
● ●
●
−1
●
●
−2
0.0
0.2
0.4
0.6
0.8
1.0
20 / 39
Complexity penalty
2
●
1
●
● ● ● ●
0
●
● ● ● ● ●
●
●
●
● ●
●
−1
●
●
−2
0.0
0.2
0.4
2
0.6
0.8
1.0
2
●
1
●
1
●
● ● ● ●
0
●
● ● ● ● ●
●
●
●
● ●
●
● ● ● ●
0
●
●
● ● ● ● ●
●
−1
●
●
● ●
●
−1
●
●
●
●
−2
−2
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
20 / 39
Desirable properties - AIC and BIC
Consistent
Worst-case risk
Insensitive to optional stopping
AIC
no
BIC
yes
1
n
no
log n
n
yes
21 / 39
Theorem (analogue of Yang (2005))
Suppose {pµ |µ ∈ M} is a regular single-parameter exponential
family in its mean-value parameterization and we have data
Xi ∼ pµ∗ , µ∗ ∈ M, i = 1, . . . , n. Let µ0 ∈ M be a constant.
Consider the following models:
M0 = {pµ0 }
and
M1 = {pµ , µ ∈ M\µ0 }
Let δ be a consistent model selection criterion. Then:
n sup R(µ, δ, n) → ∞,
µ
where R is the standardized squared error risk.
22 / 39
The most important observation
Let µ
bn be the estimator of µ after selection of model Mk̂ . Then
k̂
h
i
R(µ, δ, n) = Eµ I(µ)(µ − µ
bn )2 is equal to
k̂
h
Eµ I(µ)(µ −
µ
bn1 )2 1{δ(X n )=1}
i
+ I(µ)(µ − µ0 )2 Pµ (δ(X n ) = 0).
23 / 39
The switch distribution
24 / 39
Van Erven, Grünwald and De Rooij (2012).
Example
Data from a N (µ∗ , 1)-distribution, (µ∗ )2 =
models:
M0 = {N (0, 1)} vs.
1
100 .
Candidate
M1 = {N (µ, 1), µ ∈ R}
Simple model:
i
h
1
Eµ∗ (µ∗ − 0)2 =
100
Complex model:
h
i 1
Eµ∗ (µ∗ − µ̂n )2 = .
n
25 / 39
Frequentist Bayes
prior
π(θ)
Bayes’
formula
posterior
π(θ|data)
26 / 39
Frequentist Bayes
prior
π(θ)
Bayes’
formula
posterior
π(θ|data)
Example: coin toss, θ = probability of heads.
I
Prior: θ ∼ Unif[0,1].
I
Data: n coin tosses, h heads, t tails.
I
Posterior: θ|data ∼ Beta(h + 1, t + 1).
26 / 39
Frequentist Bayes
prior
π(θ)
Bayes’
formula
posterior
π(θ|data)
Example: coin toss, θ = probability of heads.
I
Prior: θ ∼ Unif[0,1].
I
Data: n coin tosses, h heads, t tails.
I
Posterior: θ|data ∼ Beta(h + 1, t + 1).
I
Estimator: E[θ|data] =
h+1
n+2 .
26 / 39
Frequentist Bayes
prior
π(θ)
Bayes’
formula
posterior
π(θ|data)
Example: coin toss, θ = probability of heads.
I
Prior: θ ∼ Unif[0,1].
I
Data: n coin tosses, h heads, t tails.
I
Posterior: θ|data ∼ Beta(h + 1, t + 1).
I
Estimator: E[θ|data] =
h+1
n+2 .
Frequentist Bayes: assume θ0 generates the data.
26 / 39
The switch distribution
Candidate models:
M0 = {pµ0 }
vs. M1 = {pµ |µ ∈ M\µ0 }
Associated prediction strategies:
p0 (xn |x n−1 ) = pµ0 (xn |x n−1 ),
where
n
Z
pB (x ) =
p1 (xn |x n−1 ) =
pB (x n )
,
pB (x n−1 )
pµ (x n )ω(µ)dµ.
M
Prediction strategy that switches after j observations:
(
pµ0 (xn |x n−1 ) n ≤ j
p̄j (xn |x n−1 ) =
.
p1 (xn |x n−1 ) n > j
27 / 39
The switch distribution
Switching allowed after 20 , 21 , . . . , 2blog2 (n−1)c observations.
Prior on strategies that switch after 2i observations, for
example:
1
.
π(i) =
(i + 2)(i + 3)
Prior on strategies predicting exclusively with simple or complex
model: 14 each.
Yields posterior on the strategies. Let Kn+1 (s) be the model
used by strategy s after n observations. Then:
δsw (x n ) = arg max psw (Kn+1 = k |x n ).
k
28 / 39
The criterion δsw
δsw (x n ) =



1


0
psw, 1 (x n )
>1
psw, 0 (x n )
,
n
psw, 1 (x )
if
≤
1
psw, 0 (x n )
if
where
blog2 nc
1
π(i)p̄2i (x n ) + pB (x n )
4
i=0


blog2 nc
X
3
psw, 0 (x n ) =  −
π(i) pµ0 (x n )
4
n
psw, 1 (x ) =
X
i=0
29 / 39
The criterion δsw
δsw (x n ) =



1


0
psw, 1 (x n )
>1
psw, 0 (x n )
,
n
psw, 1 (x )
if
≤
1
psw, 0 (x n )
if
where
blog2 nc
1
π(i)p̄2i (x n ) + pB (x n )
4
i=0


blog2 nc
X
3
psw, 0 (x n ) =  −
π(i) pµ0 (x n )
4
n
psw, 1 (x ) =
X
i=0
δsw is consistent.
29 / 39
Main theorem
Let {pµ |µ ∈ M} be a regular single-parameter exponential
family in its mean-value parameterization. Let µ0 ∈ M be a
constant. Consider the following models:
M0 = {pµ0 }
and
M1 = {pµ , µ ∈ M\µ0 }.
When δsw is used with
(i) a prior π(i) on the strategies that switch after 2i
observations such that π(i) ∝ i −k for some k ≥ 2;
(ii) a predictive distribution based on the Bayesian marginal
likelihood with a prior ω that admits a strictly positive,
continuous density;
and the parameter µ is estimated within M1 by an efficient
estimator µ
b1 , then the worst-case standardized quadratic risk is
of order log log n/n.
30 / 39
Simulations - quadratic loss
Xi ∼ N (1/10, 1) , i = 1, . . . , n.
Average instantaneous quadratic loss
BIC
Switch
AIC
minimum
4
3
2
1
0
n0 500
1500
2500
3500
4500
31 / 39
Simulations - consistency
Simple model M0 true: Xi ∼ N (0, 1), i = 1, . . . , n.
AIC
Switch
BIC
Average selected model index
1.0
0.8
0.6
0.4
0.2
0.0
0
500
1000
1500
2000
2500
n
32 / 39
Simulations - consistency
Complex model M1 true: Xi ∼ N (µ, 1), i = 1, . . . , n.
AIC
Switch
BIC
Average selected model index
1.0
0.8
0.6
0.4
0.2
0.0
0.00
0.05
0.10
0.15
µ
0.20
0.25
0.30
33 / 39
Optional stopping
34 / 39
Desirable properties
I
Consistency
I
Minimax-rate optimality
I
Insensitivity to optional stopping
35 / 39
Optional stopping
36 / 39
Optional stopping
“A psychologist who received a special grant for the purpose might perform a
limited number of experiments with impunity so long as the investigation
appeared to prove that telepathy did not happen. But if an academic man
showed any enthusiasm and a tendency to go on in the face of
discouragement, he would soon be frowned upon and accused of wasting his
time. His sanity might even be doubted.” (Soal and Bateman, p.23)
36 / 39
Optional stopping
More conservative version of δsw is insensitive to optional
stopping:

psw, 1 (x n )


> 4k − 1
1 if
psw, 0 (x n )
δsw (x n ) =
,
psw, 1 (x n )


≤ 4k − 1
0 if
psw, 0 (x n )
where k > 1.
For any n and any k > 1, by a version of Doob’s martingale
inequality (see Shafer et al. (2011)):
1
Pµ0 ∃n : δsw (X n ) = 1 ≤ .
k
37 / 39
Overview
Consistent
Worst-case risk
Insensitive to optional stopping
AIC
no
BIC
yes
Switch
yes
1
n
no
log n
n
yes
log log n
n
yes
38 / 39
Conclusions
I
Three desirability criteria: consistency, minimax-rate
optimality, insensitivity to the stopping rule.
I
A choice is inevitable.
I
δsw combines consistency and insensitivity to optional
stopping with near optimal worst-case instantaneous risk.
39 / 39