CS540 Machine learning Lecture 11 Decision theory, model selection

CS540 Machine learning
Lecture 11
Decision theory, model selection
Outline
•
•
•
•
•
•
•
Summary so far
Loss functions
Bayesian decision theory
ROC curves
Bayesian model selection
Frequentist decision theory
Frequentist model selection
Models vs algorithms
P(x|theta) scalar x
Likelihood
Bernoulli
Bernoulli
Gauss
Gauss
Gauss
Student T
Beta
Gamma
Prior
None
Beta
None
Gauss
NIG
None
NA
NA
Posterior
MLE
Beta
MLE
Gauss
NIG
MLE
NA
NA
Algorithm
Exact §??
Exact §??
Exact §??
Exact §??
Exact
EM §??
NA §??
NA §??
P(x|theta) vector x
Likelihood
MVN
MVN
MVN
Multinomial
Multinomial
Dirichlet
Wishart
Prior
None
MVN
MVNIW
None
Dirichlet
NA
NA
Posterior
MLE
MVN
MVNIW
MLE
Dirichlet
NA
NA
Algorithm
Exact §??
Exact
Exact
Exact §??
Exact §??
NA §??
NA §??
P(x,y|theta)
Likelihood
GaussClassif
GaussClassif
NB binary
NB binary
NB Gauss
NB Gauss
Prior
None
MVNIW
None
Beta
None
NIG
Posterior
MLE
MVNIW
MLE
Beta
MLE
NIG
Algorithm
Exact §??
Exact
Exact §??
Exact §??
Exact §??
Exact §??
P(y|x,theta)
Likelihood
Linear regression
Linear regression
Linear regression
Linear regression
Linear regression
Logistic regression
Logistic regression
Logistic regression
Logistic regression
GP regression
GP classi cation
Prior
None
L2
L1
MVN
MVNIG
None
L2
L1
MVN
MVN
MVN
Posterior
MLE
MAP
MAP
MVN
MVNIG
MLE
MAP
MAP
LaplaceApprox
MVN
LaplaceApprox
Algorithm
QR §??, SVD §??, LMS
QR §??, SVD §??
QP §??, CoordDesc §??,
QR/Cholesky §??
IRLS §??, perceptron §??
Newton §??, BoundOpt §
BoundOpt §??
Newton §??
Exact
-
From beliefs to actions
• We have discussed how to compute p(y|x), where y
represents the unknown state of nature (eg. does the
patient have lung cancer, breast cancer or no cancer),
and x are some observable features (eg., symptoms)
• We now discuss: what action a should we take (eg.
surgery or no surgery) given our beliefs?
•
Loss functions
• Define a loss function L(θ,a), θ=true (unknown) state of
nature, a = action
No cancer
Lung cancer
Breast cancer
Surgery
20
10
10
No surgery
0
50
60
0-1 loss
y=1
y=0
ŷ = 1
0
1
ŷ = 0
1
0
y=1
y=0
ŷ = 1
0
LF P
Asymmetric costs
ŷ = 0
LF N
0
Hypothesis tests
Utility = negative loss
H0 true
H1 true
Accept
0
LII
Reject
LI
0
More loss functions
• Regression L(y, ŷ) = (y − ŷ)2
• Parameter estimation
L(θ, θ̂) = (θ − θ̂)2
• Density estimation
LKL (p, q) =
j
L(θ, θ̂)
=
p(j) log
p(j)
q(j)
KL(p(·|θ)||p(·|θ̂)) =
p(y|θ) log
p(y|θ)
p(y|θ̂)
dy
Robust loss functions
• Squared error (L2) is sensitive to outliers
• It is common to use L1 instead.
• In general, Lp loss is defined as
Lp (y, ŷ) = |y − ŷ|p
Outline
•
•
•
•
•
Loss functions
Bayesian decision theory
Bayesian model selection
Frequentist decision theory
Frequentist model selection
Optimal policy
• Minimize posterior expected loss
def
ρ(a|x, π) = Eθ |π,x [L(θ, a)] =
• Bayes estimator
δ π (x) = arg min ρ(a|x, π)
a∈A
L(θ, a)p(θ|x)dθ
Θ
L2 loss
• Optimal action is posterior expected mean
L(θ, a) =
ρ(a|x)
∂
ρ(a|x)
∂a
a
ŷ(x, D) = E[y|x, D]
(θ − a)2
= Eθ|x [(θ − a)2 ] = E[θ 2 |x] − 2aE[θ|x] + a2
= −2E[θ|x] + 2a = 0
= E[θ|x] = θp(θ|x)dθ
Minimizing robust loss functions
• For L2 loss, mean p(y|x)
• For L1 loss, median p(y|x)
• For L0 loss, mode p(y|x)
0-1 loss
• Optimal action is most probable class
L(θ, a) =
ρ(a|x)
=
=
a∗ (x)
=
ŷ(x, D)
=
1 − δθ (a)
p(θ|x)dθ − p(θ|x)δθ (a)dθ
1 − p(a|x)
arg max p(a|x)
a∈A
arg max p(y|x, D)
y∈1:C
Binary classification problems
• Let Y=1 be ‘positive’ (eg cancer present) and Y=2 be
‘negative’ (eg cancer absent).
• The loss/ cost matrix has 4 numbers:
state
action
ŷ
1
y
1
2
True positive
False positive
λ11
2
False negative
λ21
λ12
True negative
λ22
Optimal strategy for binary classification
• We should pick class/ label/ action 1 if
ρ(α2 |x) >
λ21 p(Y = 1|x) + λ22 p(Y = 2|x) >
(λ21 − λ11 )p(Y = 1|x)
p(Y = 1|x)
p(Y = 2|x)
>
>
ρ(α1 |x)
λ11 p(Y = 1|x) + λ12 p(Y = 2|x)
(λ12 − λ22 )p(Y = 2|x)
λ12 − λ22
λ21 − λ11
where we have assumed λ21 (FN) >λ11 (TP)
• As we vary our loss function, we simply change the
optimal threshold θ on the decision rule
p(Y = 1|x)
>θ
δ(x) = 1 iff
p(Y = 2|x)
Definitions
• Declare xn to be a positive if p(y=1|xn)>θ, otherwise
declare it to be negative (y=2)
ŷn = 1 ⇐⇒ p(y = 1|xn ) > θ
• Define the numberof true positives as
TP =
I(ŷn = 1 ∧ yn = 1)
n
• Similarly for FP, TN, FN – all functions of θ
Performance measures
Truth
Estimate
ŷ = 1
ŷ = 0
1
0
Σ
1
TP
FN
P = TP + FN
y=1
T P/P̂ =precision=PPV
F N/N̂
0
FP
TN
N = FP + TN
Σ
P̂ = T P + F P
N̂ = F N + T N
n = TP + FP + FN + TN
y=0
F P/P̂ =FDP
Normalize along rows P(y|yhat)
T N/N̂ =NPV
Normalize along cols P(yhat|y)
ŷ = 1
ŷ = 0
y=1
T P/P =TPR=sensitivity=recall
F N/P =FNR
y=0
F P/N =FPR
T N/N =TNR=speci ty
ROC curves
• The optimal threshold for a binary detection
problem depends on the loss function
δ(x) = 1
⇐⇒
p(Y = 1|x)
λ12 − λ22
>
p(Y = 2|x)
λ21 − λ11
• Low threshold will give rise to many false positives
(Y=1) and high threshold to many false negatives.
• A receive operating characteristic (ROC) curves
plots the true positive rate vs false positive rate as
we vary θ
Reducing ROC curve to 1 number
• EER- Equal error rate (precision=specificity)
• AUC - Area under curve
Precision-recall curves
• Useful when notion of “negative” (and hence FPR)
is not defined
• Used to evaluate retrieval engines
• Recall = of those that exist, how many did you find?
• Precision = of those that you found, how many
correct?
2P R
2
=
F =
1/P + 1/R
R+P
• F-score is geometric mean
Reject option
• Suppose we can choose between incurring loss λs if we
make a misclassification (label substitution) error and
loss λr if we declare the action “don’t know”

 0
λr
λ(αi |Y = j) =

λs
if i = j and i, j ∈ {1, . . . , C}
if i = C + 1
otherwise
• In HW5, you will show that the optimal action is to pick
“don’t know” if the most probable class is below a
threshold 1-λr/λs
Bishop 1.26
Discriminant functions
• The optimal strategy π(x) partitions X into decision
regions Ri, defined by discriminant functions gi(x)
π(x) = arg max gi (x)
i
Ri = {x : gi (x) = max gk (x)}
k
In general
gi (x) = −R(a = i|x)
But for 0-1 loss we have
gi (x)
=
p(Y = i|x)
=
log p(Y = i|x)
=
log p(x|Y = i) + log p(Y = i)
Class prior merely shifts decision boundary by a constant
Binary discriminant functions
• In the 2 class case, we define the discriminant in
terms of the log-odds ratio
g(x)
=
g1 (x) − g2 (x)
=
log p(Y = 1|x) − log p(Y = 2|x)
p(Y = 1|x)
log
p(Y = 2|x)
=
Outline
•
•
•
•
•
Loss functions
Bayesian decision theory
Bayesian model selection
Frequentist decision theory
Frequentist model selection
Bayesian model selection
• 0-1 loss
L(m, m̂)
m∗
= I(m = m̂)
= arg max p(m|D)
m∈M
• KL loss
L(p∗ , m) =
ρ(m|x) =
p̄ =
KL(p∗ (y|x), p(y|m, x, D))
EKL(p∗ , pm ) = E[p∗ log p∗ − p∗ log pm ] =
Ep∗ =
pm p(m|D)
m∈m
m∗
=
arg min KL(p̄, pm )
m∈M
Posterior over models
• Key quantity
p(D|m)p(m)
p(m|D) = ′ )p(m′ )
p(D|m
′
m ∈M
• Marginal / integrated likelihood
p(D|m) = p(D|m, θ)p(θ|m)dθ
Example: is the coin biased?
• Model M0: theta= 0.5
1n
p(D|m0 ) =
2
• Model M1: theta could be any value in [0,1]
(includes 0.5 but with negligible probability)
p(D|m1 ) =
p(D|θ)p(θ)dθ
=
n
[ Ber(xi |θ)]Beta(θ|α0 , α1 )dθ
i=1
Computing the marginal likelihood
• For the Beta-Bernoulli model, we know the
posterior is Beta(θ|α1’,α0’) so
p(θ)p(D|θ)
p(θ|D) =
p(D)
N
1
1
α1 −1
α0 −1
N0
1
=
θ
(1 − θ)
θ (1 − θ)
p(D) B(α1 , α0 )
α −1
1
1
α0 −1 N1
N0
1
θ
=
(1 − θ)
θ (1 − θ)
p(D) B(α1 , α0 )
1
α′1 −1
α′0 −1
=
[θ
(1 − θ)
]
′
′
B(α1 , α0 )
1
1
1
=
p(D) B(α1 , α0 )
B(α1′ , α0′ )
B(α1′ , α0′ )
p(D) =
B(α1 , α0 )
ML for Dirichlet-multinomial model
• Normalization constant is
ZDir (α)
=
K
i=1 Γ(αi )
K
Γ( i=1 αi )
• Hence marg lik is
p(D) =
Γ( k αk ) Γ(Nk + αk )
ZDir (N + α)
=
ZDir (α)
Γ(N + k αk )
Γ(αk )
k
ML for biased coin
• P(D|M_1) for α0=α1=1
Theta=0.5
If nheads = 2 or 3, M1 is less likely than M0
Bayes factors
BF (Mi , Mj )
=
BF (M1 , M0 )
=
p(Mi |D) p(Mi )
p(D|Mi )
=
/
p(D|Mj )
p(Mj |D) p(Mj )
B(α1 + N1 , α0 + N0 ) 1
B(α1 , α0 )
0.5N
Bayes factor BF (1, 0)
1
B < 10
1
1
<
B
<
10
3
1
<B<1
3
1<B<3
3 < B < 10
B > 10
Interpretation
Strong evidence for H0
Moderate evidence for H0
Weak evidence for H0
Weak evidence for H1
Moderate evidence for H1
Strong evidence for H1
Polynomial regression
Bayesian Ockham’s razor
• Marginal likelihood
automatically penalizes
complex models due to
sum-to-one constraint
BIC
• Computing the marginal likelihood is hard unless
we have conjugate priors.
• One popular approach is to make a Laplace approx
to the posterior and then approximate the log
normalizer
p(D)
≈
p(D|θ̂ map )p(θ̂map )(2π)
C
=
−H−1
ndof
|H| ≈
log p(D)
≈
d/2
|C|
log p(D|θ̂ M LE ) − 12 dof log n
1
2
BIC vs CV for ridge
• Define dof in terms of
singular values
d
d2j
df (λ) =
d2j + λ
j=1
Outline
•
•
•
•
•
Loss functions
Bayesian decision theory
Bayesian model selection
Frequentist decision theory
Frequentist model selection
Frequentist decision theory
• Risk function
R(θ, δ) = Ex|θ L(θ, δ(x)) =
X
L(θ, δ(x))p(x|θ)dx
• Example: L2 loss
M SE = ED|θ0 (θ̂(D) − θ0 )2
• Assumes that true parameter θ0 is known, and
averages over data
Bias/variance tradeoff
M SE
=
E(θ̂(D) − θ0 )2
=
E(θ̂(D) − θ + θ − θ0 )2
=
E(θ̂(D) − θ)2 + 2(θ − θ0 )E(θ̂(D) − θ) + (θ − θ0 )2
=
E(θ̂(D) − θ)2 + (θ − θ0 )2
=
Var (θ̂) + bias2 (θ̂)
n
bias2
≈
var ≈
1
(y(xi ) − ftrue (xi ))2
n i=1
n
S
1 1 s
(y (xi ) − y(xi ))2
n i=1 S s=1
Average over S
training sets drawn
from true dist.
Bias/variance tradeoff
λ = e6: low variance, high bias
λ = e−2.5: high variance, low bias
Bias/ variance tradeoff
Empirical risk minimization
• Risk for function approximation
R(π, fˆ(·)) =
R̂(fˆ(·), D)
=
E(x,y)∼π L(y, fˆ(x)) =
n
1
L(yi , fˆ(xi ))
n i=1
p(y, x|π)L(y, fˆ(x))dxd
• To avoid overly optimistic estimate, can use
bootstrap resampling or cross validation
Risk functions for parameter estimation
• Risk function depends on
unknown theta
R(θ, δ) = Ex|θ L(θ, δ(x)) =
X
Xi ∼ Ber(θ), L(θ, θ̂) = (θ − θ̂)2
L(θ, δ(x))p(x|θ)dx
MLE
Minimax
PostMean
Beta(1,1)
N=10
N=50
Summarizing risk functions
• Risk function
R(θ , δ) = Ex|θ L(θ , δ(x)) =
X
L(θ , δ(x))p(x|θ )dx
• Minimax risk – very pessimistic
Rmax (δ) = max R(θ, δ)
θ∈Θ
• Bayes risk – requires a prior over theta
Rπ (δ) = Eθ|π R(θ, δ) =
R(θ, δ)π(θ)dθ
Θ
Bayes risk vs n
Xi ∼ Ber(θ), L(θ, θ̂) = (θ − θ̂)2 , π(θ) = U
Bayes meets frequentist
• To minimize the Bayes risk, minimize the posterior
expected loss
Rπ (δ)
=
=
=
=
L(θ, δ(x))p(x|θ)dx π(θ)dθ
Θ X
L(θ, δ(x))p(x|θ)π(θ)dθdx
Θ
X L(θ, δ(x))p(θ|x)dθ p(x)dx
X Θ
ρ(δ(x)|x)p(x)dx
X
To minimize the integral, minimize ρ(δ(x)|x)) for each x.
Bayesian estimators have good frequentist properties.
Outline
•
•
•
•
•
Loss functions
Bayesian decision theory
Bayesian model selection
Frequentist decision theory
Frequentist model selection
Frequentist model selection
• 0-1 loss: classical hypothesis testing, not covered
in this class (similar to, but more complex than,
Bayesian case)
• Predictive loss: minimize empirical risk, or CV/
bootstrap approximation thereof
n
1
L(yi , m(xi ))
R̂(m) =
n i=1