Advanced Topics in Machine Learning - 2. Hyperparameter

Advanced Topics in Machine Learning
Advanced Topics in Machine Learning
2. Hyperparameter Learning
Lars Schmidt-Thieme
Information Systems and Machine Learning Lab (ISMLL)
University of Hildesheim, Germany
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 11
Advanced Topics in Machine Learning
Outline
1. Grid Search
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
2 / 11
Advanced Topics in Machine Learning
1. Grid Search
Outline
1. Grid Search
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
2 / 11
Advanced Topics in Machine Learning
1. Grid Search
Alternating Parameter / Hyperparameter Learning
Let f be an objective function depending on both, parameters Θ and
hyperparameters H, e.g.,
f (Θ, H) := R(Dtrain ; Θ, H) + r (Θ, H)
with a risk R and a regularization r , where usually
R(D; Θ, H) :=
1 X
`(y , ŷ (x; Θ, H))
|D|
(x,y )∈D
with a loss ` and a prediction function ŷ .
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
2 / 11
Advanced Topics in Machine Learning
1. Grid Search
Alternating Parameter / Hyperparameter Learning
For such an objective function f one could learn parameters and
hyperparameters in an alternating manner (“EM style”):
1
2
3
4
5
6
initialize Θ, H
while not yet converged do
Θ := arg minΘ fH (Θ) := f (Θ, H)
H := arg minH fΘ (H) := f (Θ, H)
end
return (Θ, H)
E.g., for a linear SVM (with Θ := (β, β0 ), H := (C )) minimize:
f (β, β0 , C ) :=
1
|Dtrain |
X
[1 − y (β0 + β T x)]+ + C β T β
(x,y )∈Dtrain
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
3 / 11
Advanced Topics in Machine Learning
1. Grid Search
Alternating Parameter / Hyperparameter Learning
For such an objective function f one could learn parameters and
hyperparameters in an alternating manner (“EM style”):
1
2
3
4
5
6
initialize Θ, H
while not yet converged do
Θ := arg minΘ fH (Θ) := f (Θ, H)
H := arg minH fΘ (H) := f (Θ, H)
end
return (Θ, H)
E.g., for a linear SVM (with Θ := (β, β0 ), H := (C )) minimize:
f (β, β0 , C ) :=
1
|Dtrain |
X
[1 − y (β0 + β T x)]+ + C β T β
(x,y )∈Dtrain
This will not work as f is linear in C , i.e., minimal for C = 0.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
3 / 11
Advanced Topics in Machine Learning
1. Grid Search
The Hyperparameter Learning Problem
Let fcalib (Θ, H) be a calibration function, usually just
fcalib (Θ, H) := R(Dcalib , Θ, H)
the risk on a calibration sample.
Hyperparameter Learning Problem: Find Θ, H s.t.
(i) H := arg min fcalib,Θ opt (H) := fcalib (arg min f (Θ, H), H)
H
Θ
(ii) Θ := arg min fH (Θ) := f (Θ, H)
Θ
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
4 / 11
Advanced Topics in Machine Learning
1. Grid Search
Example: Grid Search
If H consists of K different hyperparameters ηk and for each
hyperparameter ηk there are some candidate values
Hk := {ηk,1 , ηk,2 , . . . , ηk,Kk }
plausibly spaced in some plausible range, then
H := arg min fcalib,Θ opt (H) := fcalib (arg min f (Θ, H), H)
Q
H∈ K
k=1 Hk
are called optimal hyperparameters for grid
Θ
Q
k
Hk (grid search).
I
grid search is trivially parallelizable.
I
if an optimal hyperparameter is at the border of the grid, its range
should be expanded.
I
often a second, narrower grid is searched centered around the optimal
hyperparameters of the first, coarse grid.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
5 / 11
Advanced Topics in Machine Learning
1. Grid Search
Example: Early Stopping
1
2
3
4
5
6
7
8
9
10
11
early-stopping(iterate, f , fcalib , tlookahead ):
t := 0, t ∗ := 0
initialize Θ(0)
while t − t ∗ < tlookahead do
Θ(t+1) := iterate(f , Θ(t) )
// with f (Θ(t+1) ) < f (Θ(t) )
∗
if fcalib (Θ(t+1) ) < fcalib (Θ(t ) ) then
t ∗ := t + 1
end
t := t + 1
end
∗
return Θ(t )
Can early stopping also be understood as hyperparameter learning?
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
6 / 11
Advanced Topics in Machine Learning
1. Grid Search
Example: Early Stopping
1
2
3
4
5
6
7
8
9
10
11
early-stopping(iterate, f , fcalib , tlookahead ):
t := 0, t ∗ := 0
initialize Θ(0)
while t − t ∗ < tlookahead do
Θ(t+1) := iterate(f , Θ(t) )
// with f (Θ(t+1) ) < f (Θ(t) )
∗
if fcalib (Θ(t+1) ) < fcalib (Θ(t ) ) then
t ∗ := t + 1
end
t := t + 1
end
∗
return Θ(t )
Can early stopping also be understood as hyperparameter learning?
I
I
Yes, we learn the hyperparameter t ∗ number of iterations.
sequential search with lookahead.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
6 / 11
Advanced Topics in Machine Learning
1. Grid Search
Example: Ridge Regression (w/o intercept)
Minimize:
fcalib (λ) :=
1
|Dcalib |
X
(y − β(λ)T x)2
(x,y )∈Dcalib
with
β(λ) := arg min fλ (β)
λ
fλ (β) :=
1
|Dtrain |
X
(y − β T x)2 + λβ T β
(x,y )∈Dtrain
Idea: can we replace the search over hyperparameter λ by proper
minimization using gradients (e.g., a gradient descent)?
[Chapelle et al. 2002; Keerthi, Sindhwani, and Chapelle 2007]
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
7 / 11
Advanced Topics in Machine Learning
1. Grid Search
Example: Ridge Regression (w/o intercept)
Minimize:
fcalib (λ) :=
1
X
|Dcalib |
(y − β(λ)T x)2
(x,y )∈Dcalib
with
β(λ) := arg min fλ (β)
λ
fλ (β) :=
1
X
|Dtrain |
(y − β T x)2 + λβ T β
(x,y )∈Dtrain
Idea: can we replace the search over hyperparameter λ by proper
minimization using gradients (e.g., a gradient descent)?
[Chapelle et al. 2002; Keerthi, Sindhwani, and Chapelle 2007]
1
∂fcalib
(λ) =
∂λ
|Dcalib |
X
(x,y )∈Dcalib
−2(y − β(λ)T x)
∂β
(λ)T x
∂λ
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
7 / 11
Advanced Topics in Machine Learning
1. Grid Search
Example: Ridge Regression (w/o intercept)
In matrix notation (and with rescaled λ): minimize:
fcalib (λ) := (ycalib − Xcalib β(λ))T (ycalib − Xcalib β(λ))
with
β(λ) := arg min fλ (β) := (ytrain − Xtrain β)T (ytrain − Xtrain β) + λβ T β
λ
T
T
i.e. (Xtrain
Xtrain + λI )β(λ) = Xtrain
ytrain
Thus
T
Xtrain + λI )
(Xtrain
∂β
!
(λ) + I β(λ) = 0
∂λ
∂β
T
Xtrain + λI )−1 β(λ)
(λ) = −(Xtrain
∂λ
∂fcalib
∂β
(λ) = −2(ycalib − Xcalib β)T Xcalib (λ)
∂λ
∂λ
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
8 / 11
Advanced Topics in Machine Learning
1. Grid Search
Example: Ridge Regression (w/o intercept)
∂β
∂fcalib
(λ) = −2(ycalib − Xcalib β(λ))T Xcalib (λ)
∂λ
∂λ
T
T ∂β
(λ)
= −2(Xcalib (ycalib − Xcalib β(λ)))
∂λ
T
T
= 2(Xcalib
(ycalib − Xcalib β(λ)))T (Xtrain
Xtrain + λI )−1 β(λ)
T
T
= 2((Xtrain
Xtrain + λI )−1 (Xcalib
(ycalib − Xcalib β(λ))))T β(λ)
= 2d T β(λ)
with
T
T
Xtrain +λI )d = Xcalib
(ycalib − Xcalib β(λ))
(Xtrain
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
9 / 11
Advanced Topics in Machine Learning
1. Grid Search
Example: Ridge Regression (w/o intercept)
1
2
3
4
5
6
7
8
9
ridge-regression-auto-hyper(Xtrain , ytrain , Xcalib , ycalib , λ0 , η, ):
λ := λ0 , g := while |g | ≥ do
T X
T
compute β: (Xtrain
train + λI )β = Xtrain ytrain
T
T
compute d: (Xtrain Xtrain + λI )d = Xcalib (ycalib − Xcalib β)
g := 2d T β
λ := [λ − ηg ]+
end
return (β, λ)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
10 / 11
Advanced Topics in Machine Learning
1. Grid Search
Example: SVM
Lemma
Let α, β0 be the solution of a SVM with hyperparameters H (i.e., C and
evtl. kernel parameters). Then for H̃ close to H, the solution of the SVM
with hyperparameters H̃ is α̃, β̃0 with




eu − C (Ku,C yu yCT )eC
(Ku,u yu yuT ) 0
α̃
u
 (Ku,u yu yuT ) yu 
=  eu − C (Ku,C yu yCT )eC  ,
β̃
0
− CeCT yC
yuT
0
α̃C := C̃ ,
α̃0 := 0
with
I0 := {i | αi = 0}
IC := {i | αi = C }
Iu := {i | 0 < αi < C }
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
11 / 11
Advanced Topics in Machine Learning
References
Chapelle, O. et al. (2002): Choosing multiple parameters for support vector machines. In:
Machine Learning 46.1, 131–159.
Keerthi, S. S, V. Sindhwani, and O. Chapelle (2007): An efficient method for gradient-based
adaptation of hyperparameters in SVM models. In: Advances in neural information processing
systems 19, p. 673.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
12 / 11