Online acceleration of exponential weights

online acceleration of exponential weights
Pierre Gaillard
February 10, 2017
INRIA Paris – SIERRA
This is joint work with Olivier Wintenberger
setting
the framework of this talk
Step by step minimization of i.i.d. convex loss functions1 ℓ1 , . . . , ℓn : Rd → R.
Assumption 1 (strongly convex risk)
∃α > 0, θ∗ ∈ Rd , ∀θ ∈ Rd
[
]
αθ − θ ∗ ∥22 ⩽ E ℓt (θ) − ℓt (θ ∗ ) .
Setting: for each t = 1, . . . , n
- the learner provides θbt−1 ∈ Rd based on past gradients ∇ℓs (θbs−1 ) for s ⩽ t − 1
- the environment reveals ∇ℓt (θbt−1 )
Goal: minimize the average risk:
n
1 ∑
Riskn (θb0:(n−1) ) =
E[ℓt ](θbt−1 ) − E[ℓt ](θ∗ ) .
n t=1
Remark 1
- non-Lipschitz gradients
- only the risk needs to be strongly convex (pinball loss)
1
Cesa-Bianchi and Lugosi, Prediction, Learning, and Games, 2006.
3
the framework of this talk
Step by step minimization of i.i.d. convex loss functions1 ℓ1 , . . . , ℓn : Rd → R.
Assumption 1 (strongly convex risk)
∃α > 0, θ∗ ∈ Rd , ∀θ ∈ Rd
[
]
αθ − θ ∗ ∥22 ⩽ E ℓt (θ) − ℓt (θ ∗ ) .
Setting: for each t = 1, . . . , n
- the learner provides θbt−1 ∈ Rd based on past gradients ∇ℓs (θbs−1 ) for s ⩽ t − 1
- the environment reveals ∇ℓt (θbt−1 )
Goal: minimize the average risk:
n
1 ∑
Riskn (θb0:(n−1) ) =
E[ℓt ](θbt−1 ) − E[ℓt ](θ∗ ) .
n t=1
∑
Remark 2 By convexity of the risk, the averaging θ̄n−1 := (1/n) nt=1 θbt−1 has an
instantaneous risk upper-bounded by the cumulative risk
Risk(θ̄n−1 ) := E[ℓn ](θ̄n−1 ) − E[ℓn ](θ ∗ )
1
Cesa-Bianchi and Lugosi, Prediction, Learning, and Games, 2006.
Jensen
⩽
Riskn (θb0:(n−1) ) .
3
the framework of this talk
Step by step minimization of i.i.d. convex loss functions1 ℓ1 , . . . , ℓn : Rd → R.
Assumption 1 (strongly convex risk)
∃α > 0, θ∗ ∈ Rd , ∀θ ∈ Rd
θ∗ is d0 -sparse
[
]
αθ − θ ∗ ∥22 ⩽ E ℓt (θ) − ℓt (θ ∗ ) .
∥θ ∗ ∥1 ⩽ U
Setting: for each t = 1, . . . , n
- the learner provides θbt−1 ∈ Rd based on past gradients ∇ℓs (θbs−1 ) for s ⩽ t − 1
- the environment reveals ∇ℓt (θbt−1 )
Goal: minimize the average risk:
n
1 ∑
E[ℓt ](θbt−1 ) − E[ℓt ](θ∗ ) .
n t=1
∑
Remark 2 By convexity of the risk, the averaging θ̄n−1 := (1/n) nt=1 θbt−1 has an
instantaneous risk upper-bounded by the cumulative risk
Riskn (θb0:(n−1) ) =
Risk(θ̄n−1 ) := E[ℓn ](θ̄n−1 ) − E[ℓn ](θ ∗ )
1
Cesa-Bianchi and Lugosi, Prediction, Learning, and Games, 2006.
Jensen
⩽
Riskn (θb0:(n−1) ) .
3
the framework of this talk
Step by step minimization of i.i.d. convex loss functions ℓ1 , . . . , ℓn : Rd → R.
Assumption 1 (strongly convex risk)
[
]
αθ − θ ∗ ∥22 ⩽ E ℓt (θ) − ℓt (θ ∗ ) .
∃α > 0, θ∗ ∈ Rd , ∀θ ∈ Rd
θ∗ is d0 -sparse
∥θ ∗ ∥1 ⩽ U
Setting: for each t = 1, . . . , n
- the learner provides θbt−1 ∈ Rd based on past gradients ∇ℓs (θbs−1 ) for s ⩽ t − 1
- the environment reveals ∇ℓt (θbt−1 )
Goal: minimize the average risk:
Riskn (θb0:(n−1) ) =
n
1 ∑
E[ℓt ](θbt−1 ) − E[ℓt ](θ∗ ) .
n t=1
Result
{
Riskn (θb0:(n−1) ) ≲ min
B2 d0 log d log n
, UB
αn
√
log d
n
}
where B ⩾ max∥θ∥1 ⩽2U ∇ℓt (θ)∞ .
3
the framework of this talk
Step by step minimization of i.i.d. convex loss functions ℓ1 , . . . , ℓn : Rd → R.
Assumption 1 (strongly convex risk)
[
]
αθ − θ ∗ ∥22 ⩽ E ℓt (θ) − ℓt (θ ∗ ) .
∃α > 0, θ∗ ∈ Rd , ∀θ ∈ Rd
θ∗ is d0 -sparse
∥θ ∗ ∥1 ⩽ U
Setting: for each t = 1, . . . , n
- the learner provides θbt−1 ∈ Rd based on past gradients ∇ℓs (θbs−1 ) for s ⩽ t − 1
- the environment reveals ∇ℓt (θbt−1 )
Goal: minimize the average risk:
Riskn (θb0:(n−1) ) =
n
1 ∑
E[ℓt ](θbt−1 ) − E[ℓt ](θ∗ ) .
n t=1
Result
{
Riskn (θb0:(n−1) ) ≲ min
where B ⩾ max∥θ∥1 ⩽2U ∇ℓt (θ)∞ .
B2 d0 log d log n
, UB
αn
√
log d
n
}
Fast rate : better for large n, α
Slow rate : better for small n, α
3
comparison with other sparse algorithms
Procedure
1
3
4
Rate Polynomial
7
d0 log d
αn
3
EWA + sparsity patern2
7
d0 log d
n
7
SeqSEW3
3
d0 log d
n
7
ℓ1 -RDA method4
3
d
n
3
SAEW
3
d0 log d
αn
3
Lasso1
2
Sequential
Bunea, Tsybakov, and Wegkamp, “Aggregation for Gaussian regression”, 2007.
Rigollet and Tsybakov, “Exponential screening and optimal rates of sparse estimation”, 2011.
Gerchinovitz, “Sparsity regret bounds for individual sequences in online linear regression”, 2013.
Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010.
4
convex optimization in the ℓ1 -ball
with slow rate
convex optimization in the ℓ1 -ball with slow rate
Goal: perform online optimization in
(
)
{
}
B1 θcenter , ε := θ ∈ Rd : ∥θ − θcenter ∥1 ⩽ ε
We define the 2d corners of the ℓ1 -ball ek = θcenter ± ε(0, . . . , 0, 1, 0, . . . , 0)
The exponentially gradient forecaster (EG± )5
At each forecasting instance t ⩾ 1,
- assign to each corner ek the weight
(
)
∑
⊤
b
exp − η t−1
s=1 ∇ℓs (θs−1 ) ek
b
(
)
pk,t−1 = ∑
∑t−1
2d
⊤
b
j=1 exp − η
s=1 ∇ℓs (θs−1 ) ej
- form parameter θbt−1 =
∑2d
k=1
b
pk,t−1 ek
Performance: bound on the average regret, if θ∗ ∈ B1 (θcenter , ε) for η well-tuned
√
n
∑
log d
ℓt (θbt−1 ) − ℓt (θ ∗ ) ≲ εB
.
n
t=1
The learning rate η can be tuned online (doubling trick, ηt ).
5
Kivinen and Warmuth, “Exponentiated Gradient Versus Gradient Descent for Linear Predictors”, 1997.
6
proof
Lemma (Hoeffding)
If X is a random variable with |X| ⩽ B. Then,
∀η ∈ R,
E[X] ⩽ −
( [
)
1
ηB
−ηX ]
log E e
+
.
η
4
1. Upper bound the instantaneous gradient
2d
∑
b
def. of θ
t−1
⊤
∇ℓt (θbt−1 ) θbt−1
=
(
⊤ )
b
pk,t−1 ∇ℓt (θbt−1 ) ek
k=1
Hoeffding
⩽
−
def. of b
pk,t+1
1
log
η
(
(
K
∑
b
−η∇ℓt (θt−1 )
b
pk,t e
⊤e
k=1
b
pk,t −η∇ℓt (θbt−1 )⊤ e
k
e
b
pk,t+1
)
k
+
)
=
1
− log
η
=
b
pk,t+1
1
ηB
⊤
∇ℓt (θbt−1 ) ek + log
+
.
b
η
pk,t
4
ηB
4
ηB
4
+
2. Sum over all t, the sum telescopes
n
∑
t=1
Jensen
ℓt (θbt−1 ) − ℓt (θb ) ⩽
∗
n
∑
⊤(b
∗)
θt−1 − θ
∇ℓt (θbt−1 )
1
⩽
{
⩽
max
k=1,...,2d
max
k=1,...,2d
{ n
∑
⊤(b
θt−1
∇ℓt (θbt−1 )
1
b
p
1
ηBn
k,n
log +
b
η
pk,0
4
}
⩽
− ek
}
)
log(2d)
ηBn
+
η
4
7
from regret bound to high probability bound
Theorem
Let 0 < δ < 1, then if θ ∗ ∈ B1 (θ ∗ , ε), EG± satisfies
2
αθ̄n−1 − θ∗ 2
where θ̄n−1 =
strong convexity
⩽
E[ℓn ](θ̄n−1 ) − E[ℓn ](θ∗ ) ≲
√
εB log(d/δ)
√
n
n
1 ∑b
θt−1
n t=1
We observe the slow rate on the risk of order UB
√
log(d)/n.
Proof:
- Hoeffding inequality for martingal : with high probability
n
∑
t=1
E[ℓt ](θbt−1 ) − E[ℓt ](θ∗ ) ≲
n
∑
ℓt (θbt−1 ) − E[ℓt ](θ∗ ) +
√
n log(1/δ)
t=1
- Jensen’s inequality (convex i.i.d losses):
E[ℓn ](θ̄n−1 ) − E[ℓn ](θ ∗ ) ⩽
n
1 ∑
E[ℓt ](θbt−1 ) − E[ℓt ](θ ∗ )
n t=1
8
acceleration : from slow rate to
fast rate
acceleration: regularly restart the algorithm
U
θ∗
θ2
θ1
θ0
10
√
acceleration: from slow rate 1/ n to fast rate 1/n
Algorithm
Parameters: U, B, α, δ > 0
Initialization: θcenter = 0 ∈ Rd , t1 = 1
For sessions i ⩾ 1,
- Start a new EG± in B1 (θcenter , U2−i ) for t ⩾ ti
- Get the high probability ℓ1 -ball for θ ∗
√
−i
θ̄t−1 − θ∗ ∥2 ⩽ dθ̄t−1 − θ ∗ ∥2 ≲ dU2 B√ log(d/δ) =: C(U, t)2
1
2
α t − ti
- Define ti+1 ⩾ ti as the first time such that C(U, tti+1 ) ⩽ U2−(i+1) , i.e., for
√
ti+1 − ti ≈
√
4dB log(d/δ)
U2−i α
- θcenter ← θ̄ti+1 −1
After n time steps, we will have the slow rate high probability bound
√
U2−i B log(d/δ)
√
E[ℓn ](θ̄n−1 ) − E[ℓn ](θbt−1 ) ≲
n
√
dB
log(d/δ)
√
but with U2−i ≈
.
α n
11
√
acceleration: from slow rate 1/ n to fast rate 1/n
Algorithm
Parameters: U, B, α, δ > 0
Initialization: θcenter = 0 ∈ Rd , t1 = 1
For sessions i ⩾ 1,
- Start a new EG± in B1 (θcenter , U2−i ) for t ⩾ ti
- Get the high probability ℓ1 -ball for θ ∗
√
−i
θ̄t−1 − θ∗ ∥2 ⩽ dθ̄t−1 − θ ∗ ∥2 ≲ dU2 B√ log(d/δ) =: C(U, t)2
1
2
α t − ti
- Define ti+1 ⩾ ti as the first time such that C(U, tti+1 ) ⩽ U2−(i+1) , i.e., for
√
ti+1 − ti ≈
√
4dB log(d/δ)
U2−i α
- θcenter ← θ̄ti+1 −1
After n time steps, we will have the slow rate high probability bound
√
U2−i B log(d/δ)
dB2 log(d/δ)
√
≲
E[ℓn ](θ̄n−1 ) − E[ℓn ](θbt−1 ) ≲
αn
n
√
dB
log(d/δ)
√
but with U2−i ≈
.
α n
11
√
acceleration: from slow rate 1/ n to fast rate 1/n
Algorithm
Parameters: U, B, α, δ > 0
Initialization: θcenter = 0 ∈ Rd , t1 = 1
For sessions i ⩾ 1,
- Start a new EG± in B1 (θcenter , U2−i ) for t ⩾ ti
- Get the high probability ℓ1 -ball for θ ∗
√
−i
θ̄t−1 − θ∗ ∥2 ⩽ dθ̄t−1 − θ ∗ ∥2 ≲ dU2 B√ log(d/δ) =: C(U, t)2
1
2
α t − ti
- Define ti+1 ⩾ ti as the first time such that C(U, tti+1 ) ⩽ U2−(i+1) , i.e., for
√
ti+1 − ti ≈
√
4dB log(d/δ)
U2−i α
- θcenter ← θ̄ti+1 −1
After n time steps, we will have the slow rate high probability bound
√
U2−i B log(d/δ)
dB2 log(d/δ)
√
≲
E[ℓn ](θ̄n−1 ) − E[ℓn ](θbt−1 ) ≲
αn
n
√
dB
log(d/δ)
√
but with U2−i ≈
.
α n
11
√
acceleration: from slow rate 1/ n to fast rate 1/n
Algorithm (SAEW)
Parameters: U, B, α, δ > 0, d0 ⩾ 1
Initialization: θcenter = 0 ∈ Rd , t1 = 1
For sessions i ⩾ 1,
- Start a new EG± in B1 (θcenter , U2−i ) for t ⩾ ti
- Get the high probability ℓ1 -ball for θ ∗
√
−i B log(d/δ)
[θ̄t−1 ]d − θ ∗ ∥2 ⩽ d0 [θ̄t−1 ]d − θ∗ ∥2 ≲ d0 U2 √
=: C(U, t)2
1
2
0
0
α t − ti
where [θ̄t−1 ]d0 is the d0 -truncation of θ̄t−1
- Define ti+1 ⩾ ti as the first time such that C(U, tti+1 ) ⩽ U2−(i+1) , i.e., for
√
ti+1 − ti ≈
√
4d0 B log(d/δ)
U2−i α
- θcenter ← θ̄ti+1 −1
After n time steps, we will have the slow rate high probability bound
√
U2−i B log(d/δ)
d0 B2 log(d/δ)
√
≲
E[ℓn ](θ̄n−1 ) − E[ℓn ](θbt−1 ) ≲
αn
n
√
d
B
log(d/δ)
but with U2−i ≈ 0 α√n
.
11
−4
−6
−8
−12
~
log(||θ t − θ∗||22)
−2
0
from slow rate bound to fast rate bound : illustration
t1
2
4
t2
6
t3 t4 t5 t6 t7 t8
8
10
log(t)
Figure 1: Logarithm of the ℓ2 -error of the averaged estimator.
12
result: high probability bound on the risk
Theorem
The average risk of SAEW is upper-bounded as
{ √
}
log(d/δ) d0 B2
αU2
Risk1:n (θb0:(n−1) ) ≲ min UB
,
log(d/δ) log n +
.
n
αn
d0 n
Remarks:
- Both rates are optimal (in some sense)
- From the strong convexity assumption, this also ensures
{ √
}
d0 B2
d0 U2
θ̄n−1 − θ ∗ ∥2 ≲ min UB log(d/δ)
√
,
log(d/δ)log
n
+
2
nα2
αn
α n
- the boundness of the gradients B ⩾ maxθ∈B1 (0,2U) ∥∇ℓt (θ)∥∞ can be weakened
to unkown B under the subgaussian condition
13
result: high probability bound on the risk
Theorem
The average risk of SAEW is upper-bounded as
{ √
}
log(d/δ) d0 B2
αU2
Risk1:n (θb0:(n−1) ) ≲ min UB
,
log(d/δ) log n +
.
n
αn
d0 n
Remarks:
- Both rates are optimal (in some sense)
- From the strong convexity assumption, this also ensures
{ √
}
d0 B2
d0 U2
θ̃n−1 − θ ∗ ∥2 ≲ min UB log(d/δ)
√
,
log(d/δ)
log
n
+
2
nα2
αn2
α n
where θ̃n−1 = (ti − ti−1 )−1
∑ti −1
t=ti−1
θbt−1 for ti ⩽ n ⩽ ti+1
- the boundness of the gradients B ⩾ maxθ∈B1 (0,2U) ∥∇ℓt (θ)∥∞ can be weakened
to unkown B under the subgaussian condition
13
simulations
We compare three online optimization procedures:
- RDA6 : ℓ1 -regularized dual averaging method
θbt = arg min
θ∈Rd
{ ∑
t
1
∇ℓt (θbs−1 )⊤ θ +
t s=1
|
{z
}
Linearized loss
| {z }
γ
√ ∥θ∥22
t
| {z }
ℓ1 regularization
force strong-convexity
λ∥θ∥1
+
}
Good performance for hand-written digits classification
Produces sparse estimators: but slow rate, or fast rate with
No sparse guarantees
- BOA7 : exponential weights with second order regularization (≈ EG± with good
tuning and high probability properties).
achieves fast rate for expert selection
no fast rate in the ℓ2 -ball
- SAEW: our acceleration of BOA
All methods are tuned in hindsight with the best parameters on a grid.
6
7
Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010
Wintenberger, “Optimal learning with Bernstein Online Aggregation”, 2014
14
least square linear regression
Let (Xt , Yt ) ∈ [−X, X]d × [−Y, Y] be i.i.d. random pairs (X, Y > 0).
Goal: estimate linearly Yt by approaching
[
]
2
θ∗ ∈ arg min E (Yt − X⊤
t θ)
θ∈Rd
εt ∼ N (0, 1)
i.i.d.
where d0 = ∥θ ∗ ∥0 = 5, U = ∥θ∗ ∥1 = 1 with non-zero
coordinates i.i.d. ∝ N (0, 1), α = 1.
−5
~
log(||θ − θ∗||22)
with
−6
∗
Yt = X⊤
t θ + 0.1 εt
−7
Experiment: Xt ∼ N (0, 1) for d = 500, n = 2 000
−4
−3
( [
)
The strong convexity assumption is achieved with α ⩽ λmin E Xt X⊤
t ] .
RDA
BOA
SAEW
Figure 2: Boxplot (30 simulations)
15
acceleration: in practice
−1
least square regression with d0 = 5, d = 500, σ = 0.1
250
200
150
Cumulative risk
2
50
0
−7
RDA
BOA
SAEW
0
100
−2
−3
−4
−5
−6
~
log(||θt − θ∗||22)
RDA
BOA
SAEW
4
log(t)
6
Figure 3: Log of the ℓ2 error.
0
5000
10000
t
15000
20000
Figure 4: Cumulative risk.
Remarks: the cumulative risks are at most of order:
√
RDA: σ 2 d log n
BOA: σ 2 n log d + log d
SAEW: σ 2 d0 log d log n + log d
In practice: much better performance if we allow multiple pass on the training set.
16
can still be useful even with d0 = d
Simulation: least square regression with d0 = 2, d = 2, σ = 0.3
20
10
0
Cumulative risk
30
BOA
SAEW
0
5000
10000
t
15000
20000
Figure 5: Cumulative risk for square linear regression with d = d0 = 2.
17
least square linear regression: improvement of the bound
Let X, Y > 0. Let (Xt , Yt ) ∈ [−X, X]d × [−Y, Y] be i.i.d. random pairs.
Theorem
SAEW applied with B = 2X(Y + 2XU) satisfies
√
{
(
)
log(d/δ)
Risk1:n θb0:(n−1) ≲ min UX(Y + XU)
,
n
d0 X2 (Y2 + X2 Y2 )
d
αU2
log log n +
αn
δ
d0 n
}
.
√∑
t
2
2
b
A better tunning of EG± with ηt ≈ 1/
s=1 ∥∇ℓs (θs−1 )∥∞ alows to substituted B
with X2 σ 2 with
]
[
∗ 2
σ 2 = E (Yt − X⊤
t θ ) .
in the instantaneous risk of θ̃n−1 .
18
least square linear regression: improvement of the bound
Let X, Y > 0. Let (Xt , Yt ) ∈ [−X, X]d × [−Y, Y] be i.i.d. random pairs.
Theorem
SAEW applied with B = 2X(Y + 2XU) satisfies
{ √
}
(
)
log(d/δ) d0 X2 σ 2
d
αU2
Risk θ̃n−1 ≲ min σ
,
log +
.
n
nα
δ
d0 n 2
√∑
t
2
2
b
A better tunning of EG± with ηt ≈ 1/
s=1 ∥∇ℓs (θs−1 )∥∞ alows to substituted B
with X2 σ 2 with
]
[
∗ 2
σ 2 = E (Yt − X⊤
t θ ) .
in the instantaneous risk of θ̃n−1 .
Optimal rate for sparse least square regression.
18
calibration of the parameters
The algorithm needs to know : d0 , U, B, α, δ
Run a meta-algorithm (BOA8 ) with parameters in a growing grid.
We leave the initial setting since we need
- to observe the gradients of all sub-algorithms.
⊤ X → [θ
b⊤ Xt ][−Y,Y] otherwise we pay the maximal
- to clip the predictions θbt−1
t
t−1
value of U considered in the final bound.
- we need strongly convex ℓt (instead of E[ℓt ] only)
This works to build an estimator for least square regression.
Theorem (Calibrated SAEW for Least Square Linear Regression)
The excess risk of the estimator produced by the meta-algorithm is of order
( 2
( (log d)(log n + log Y) )
(
))
Y
d0 X2 σ 2
On
log
+
log
d/δ
,
n
δ
α∗ n
|
{z
}
|
{z
}
Price of calibration
Fast rate with best α∗ , d0
where α∗ is the largest stong-convexity parameter.
Can we substitute α∗ with local strong convexity (cf. Lasso)?
8
Wintenberger, “Optimal learning with Bernstein Online Aggregation”, 2014.
19
quantile linear regression
Let τ ∈ (0, 1). Let (Xt , Yt ) ∈ Rd × R be i.i.d. random pairs.
Goal: estimate the conditional τ -quantile of Yt given Xt .
observation
prévision
erreur
prévision - observation
Popular solution: linear regression with the pinball loss by ρτ : u ∈ R → u(τ − 1u<0 )
The conditional quantile qτ (Yt |Xt ) is the solution of
[ (
) ]
qτ (Yt |Xt ) ∈ arg min E ρτ Yt − g(Xt ) Xt .
g
Ý minimize the pinball loss qτ = τ -quantile prediction
non-strongly convex loss
9
strongly convex risk9 Ý ok for fixed parameter
Steinwart and Christmann, “Estimating conditional quantiles with the help of the pinball loss”, 2011.
20
quantile linear regression
Let τ ∈ (0, 1). Let (Xt , Yt ) ∈ Rd × R be i.i.d. random pairs.
Goal: estimate the conditional τ -quantile of Yt given Xt .
observation
prévision
erreur
prévision - observation
Popular solution: linear regression with the pinball loss by ρτ : u ∈ R → u(τ − 1u<0 )
The conditional quantile qτ (Yt |Xt ) is the solution of
[ (
) ]
qτ (Yt |Xt ) ∈ arg min E ρτ Yt − g(Xt ) Xt .
g
Ý minimize the pinball loss qτ = τ -quantile prediction
non-strongly convex loss
9
strongly convex risk9 Ý ok for fixed parameter
Steinwart and Christmann, “Estimating conditional quantiles with the help of the pinball loss”, 2011.
20
quantile linear regression
Let τ ∈ (0, 1). Let (Xt , Yt ) ∈ Rd × R be i.i.d. random pairs.
Goal: estimate the conditional τ -quantile of Yt given Xt .
observation
prévision
erreur
prévision - observation
Popular solution: linear regression with the pinball loss by ρτ : u ∈ R → u(τ − 1u<0 )
The conditional quantile qτ (Yt |Xt ) is the solution of
[ (
) ]
qτ (Yt |Xt ) ∈ arg min E ρτ Yt − g(Xt ) Xt .
g
Ý minimize the pinball loss qτ = τ -quantile prediction
non-strongly convex loss
9
strongly convex risk9 Ý ok for fixed parameter
Steinwart and Christmann, “Estimating conditional quantiles with the help of the pinball loss”, 2011.
20
quantile linear regression
Let τ ∈ (0, 1). Let (Xt , Yt ) ∈ Rd × R be i.i.d. random pairs.
Goal: estimate the conditional τ -quantile of Yt given Xt .
observation
prévision
erreur
prévision - observation
Popular solution: linear regression with the pinball loss by ρτ : u ∈ R → u(τ − 1u<0 )
The conditional quantile qτ (Yt |Xt ) is the solution of
[ (
) ]
qτ (Yt |Xt ) ∈ arg min E ρτ Yt − g(Xt ) Xt .
g
Ý minimize the pinball loss qτ = τ -quantile prediction
non-strongly convex loss
9
strongly convex risk9 Ý ok for fixed parameter
Steinwart and Christmann, “Estimating conditional quantiles with the help of the pinball loss”, 2011.
20
quantile linear regression
Let τ ∈ (0, 1). Let (Xt , Yt ) ∈ Rd × R be i.i.d. random pairs.
Goal: estimate the conditional τ -quantile of Yt given Xt .
observation
prévision
erreur
prévision - observation
Popular solution: linear regression with the pinball loss by ρτ : u ∈ R → u(τ − 1u<0 )
The conditional quantile qτ (Yt |Xt ) is the solution of
[ (
) ]
qτ (Yt |Xt ) ∈ arg min E ρτ Yt − g(Xt ) Xt .
g
Ý minimize the pinball loss qτ = τ -quantile prediction
non-strongly convex loss
9
strongly convex risk9 Ý ok for fixed parameter
Steinwart and Christmann, “Estimating conditional quantiles with the help of the pinball loss”, 2011.
20
quantile linear regression
Let τ ∈ (0, 1). Let (Xt , Yt ) ∈ Rd × R be i.i.d. random pairs.
Goal: estimate the conditional τ -quantile of Yt given Xt .
observation
prévision
erreur
prévision - observation
Popular solution: linear regression with the pinball loss by ρτ : u ∈ R → u(τ − 1u<0 )
The conditional quantile qτ (Yt |Xt ) is the solution of
[ (
) ]
qτ (Yt |Xt ) ∈ arg min E ρτ Yt − g(Xt ) Xt .
g
Ý minimize the pinball loss qτ = τ -quantile prediction
non-strongly convex loss
9
strongly convex risk9 Ý ok for fixed parameter
Steinwart and Christmann, “Estimating conditional quantiles with the help of the pinball loss”, 2011.
20
quantile linear regression
Let τ ∈ (0, 1). Let (Xt , Yt ) ∈ Rd × R be i.i.d. random pairs.
Goal: estimate the conditional τ -quantile of Yt given Xt .
observation
prévision
erreur
prévision - observation
Popular solution: linear regression with the pinball loss by ρτ : u ∈ R → u(τ − 1u<0 )
The conditional quantile qτ (Yt |Xt ) is the solution of
[ (
) ]
qτ (Yt |Xt ) ∈ arg min E ρτ Yt − g(Xt ) Xt .
g
Ý minimize the pinball loss qτ = τ -quantile prediction
non-strongly convex loss
9
strongly convex risk9 Ý ok for fixed parameter
Steinwart and Christmann, “Estimating conditional quantiles with the help of the pinball loss”, 2011.
20
quantile linear regression
Let τ ∈ (0, 1). Let (Xt , Yt ) ∈ Rd × R be i.i.d. random pairs.
Goal: estimate the conditional τ -quantile of Yt given Xt .
observation
prévision
erreur
prévision - observation
Popular solution: linear regression with the pinball loss by ρτ : u ∈ R → u(τ − 1u<0 )
The conditional quantile qτ (Yt |Xt ) is the solution of
[ (
) ]
qτ (Yt |Xt ) ∈ arg min E ρτ Yt − g(Xt ) Xt .
g
Ý minimize the pinball loss qτ = τ -quantile prediction
non-strongly convex loss
9
strongly convex risk9 Ý ok for fixed parameter
Steinwart and Christmann, “Estimating conditional quantiles with the help of the pinball loss”, 2011.
20
quantile regression
0
Experiment: d0 = 5, d = 100
150
2
100
Cumulative risk
0
0
−10
RDA
BOA
SAEW
50
−2
−4
−6
−8
~
log(||θt − θ∗||22)
RDA
BOA
SAEW
4
6
log(t)
8
10
Figure 6: Log. of the ℓ2 -error of θ̄t
0e+00
2e+04
4e+04
6e+04
8e+04
1e+05
t
Figure 7: Cumulative risk
All the methods empirically get the fast rate 1/n for the ℓ2 -error of the estimator...
But only SAEW
- has the theoretical guarantee
- has O(log n) cumulative risk
21
ongoing questions
Some future work:
- Is averaging an efficient acceleration procedure for EG± ?
- Calibration of the parameters in the original online optimization setting
- Produce sparse estimators θbt−1
Ý improve the dependency on the strong convexity parameter (only local)
- Oracle bound: no assumption on the sparsity of θ∗
Thank you !
22
references
Bunea, F., A. Tsybakov, and M. Wegkamp. “Aggregation for Gaussian regression”. In: The
Annals of Statistics 35.4 (2007), pp. 1674–1697.
Cesa-Bianchi, N. and G. Lugosi. Prediction, Learning, and Games. Cambridge University
Press, 2006.
Gaillard, P. and O. Wintenberger. “Sparse Accelerated Exponential Weights”. Accepted
at AISTAT’17. 2017.
Gerchinovitz, S. “Sparsity regret bounds for individual sequences in online linear
regression”. In: The Journal of Machine Learning Research 14.1 (2013), pp. 729–769.
Kivinen, J. and M. K. Warmuth. “Exponentiated Gradient Versus Gradient Descent for
Linear Predictors”. In: Information and Computation 132.1 (1997), pp. 1–63.
Rigollet, P. and A. Tsybakov. “Exponential screening and optimal rates of sparse
estimation”. In: The Annals of Statistics (2011), pp. 731–771.
Steinwart, I. and A. Christmann. “Estimating conditional quantiles with the help of the
pinball loss”. In: Bernoulli 17.1 (2011), pp. 211–225.
Wintenberger, O. “Optimal learning with Bernstein Online Aggregation”. In: Extended
version available at arXiv:1404.1356 [stat. ML] (2014).
Xiao, L. “Dual averaging methods for regularized stochastic learning and online
optimization”. In: Journal of Machine Learning Research 11 (2010), pp. 2543–2596.
23