♣
DISI, Genova, December 2006
Online Gradient Descent Learning Algorithms
Yiming Ying
(joint work with Massimiliano Pontil)
Department of Computer Science, University College London
Outline
äIntroduction
–General learning setting
–Online gradient descent algorithm
0
ä Main results
–Generalization error
–Implications: consistency
–Error rates
ä Discussions and comparisons
ä Conclusions and questions
Introduction
äLearning theory model
—Input sample space X : a subset of Euclidean space Rd
Output labeled space Y : a subset of R.
—Distribution ρ on X × Y : ρ(x, y) = ρX (x)ρ(y|x).
—Loss function: L(f (x), y) = (y − f (x))2 .
—Statistical assumption: labeled sample data sequence
S = {zj = (xj , yj ) : j = 1, 2, · · · }.
identically and independent distributed according to ρ.
äGoal of learning
–Given sample S , find a function f in a suitable hypothesis space such that the true error
Z
(f (x) − y)2dρ(x, y)
E(f ) :=
X×Y
is close to the smallest true error E(fρ ) where fρ is the regression function
Z
fρ(x) :=
Y
ydρ(y|x) = inf E(f )f : X 7−→ R
Note that
kf − fρk2ρX =
Z
(f (x) − fρ(x))2dρX (x) = E(f ) − E(fρ)
X
–Equivalent approximation problem:
find an approximator f in a hypothesis space such that kf − fρ k2ρ is small
ä Hypothesis space assumption
–Hypothesis space: reproducing kernel Hilbert spaces HK (RKHS)
0 2
–Gaussian kernel: K(x, x0 ) = e−σkx−x k .
–Polynomial kernel: K(x, x0 ) = (1 + hx, x0 i)n .
äBatch Learning algorithm
Use the data set St = {z1 , · · · , zt } at one time
–Tikhonov regularization:
Cucker and Smale; Evgeniou-Pontil-Poggio; Smale-Zhou; De Vito and Verri et al., from
different perspectives: Regularization Network, Approximation Theory and Inverse Problems etc.
t
fSt,λ
1X
= arg min
(yj − f (xj )2 + λkf k2K , λ > 0
f ∈HK t
j=1
–Gradient descent boosting:
t
fk+1
ηk X
= fk −
(fk (xj ) − yj )Kxj .
t j=1
Yao, Rosasco and Caponnetto
Early stopping rule instead of regularization terms in HK .
Stochastic Online learning in RKHS
Use the data one by one
–Online regularized learning algorithm
fj+1 = fj − ηj ((fj (xj ) − yj )Kxj + λfj ), ∀j ∈ N, for e.g. f1 = 0,
where λ > 0 is regularization parameter and {ηj } is step sizes (learning rates).
Kivinen, Smola and Williamson; Smale and Yao; Ying-Zhou et al.
–The online algorithm studied here
fj+1 = fj − ηj (fj (xj ) − yj )Kxj , ∀j ∈ N, for e.g. f1 = 0,
(1)
– {ηj , j ∈ N} universal sequence
– {ηj = η(t) : j = 1, · · · , t}, t: sample number
–Our analysis purpose for the above algorithm
2
– Stochastic generalization error bounds for E kft+1 − fρ kρ in terms of the step sizes
and approximation property of HK .
2
– The choice of step sizes to guarantee the (weak) consistency: E kft+1 − fρ kρ →
inf f ∈HK kf − fρk2ρ as t → ∞
Main results
Type I: Step sizes are a universal sequence
Generalization error
Define K-functional: K(s, fρ ) := inf f ∈HK {kf − fρ kρ + skf kK }, s > 0.
Theorem 1. Let θ ∈ (0, 1) and {ηj =
for any t ∈,
j −θ
µ
: j ∈} with some constant µ ≥ µ(θ). Then,
i2
h
2
−(1−θ)/2
− min{θ,1−θ}
E kft+1 − fρkρ ≤ K(bθ,µ t
, fρ) + O t
ln t .
–Implication to consistency:
K-functional: K(·, fρ) is non-decreasing,
concave,
and lims→0+ K(s, fρ ) = inf f ∈HK kf −
fρkρ =⇒ Consistency: limt→∞ E kft+1 − fρk2ρ = inf f ∈HK kf − fρk2ρ.
–Error rates
We assume the fρ has some smoothness. Define LK : L2ρX → L2ρX :
Z
LK f (x) =
x ∈ X, f ∈ L2ρX .
K(x, y)f (y)dρX ,
X
β
β
The fractional range space LK (L2ρX ) : the range space of LK .
β
Theorem 2. Let θ ∈ (0, 1), µ(θ) be absolute constants depending on θ. If fρ ∈ LK (L2ρX )
with some 0 < β ≤ 1/2 then, by selecting ηj =
holds
1
2β
µ( 2β+1
2β
− 2β+1
j
for j ∈, for any t ∈ there
)
2β
− 2β+1
2
EZ T kft+1 − fρkρ = O t
ln t .
(2)
Type II: Step sizes depending on sample number
–Generalization error
Theorem 3. Let {ηj = η : j ∈}. Then, we have that
i2
h
− 21
2
E kft+1 − fρkρ ≤ K( ηt , fρ) + O η ln t
− 1
– Rule of early stopping: trade off K( ηt 2 , fρ ) and O η ln t =⇒ stopping rule: t = t(η)
to ensure the bounds tend to zero as η → 0 + .
Equivalently, from the perspective of choosing step sizes η = η(t).
– Implication to (weak) consistency: the step sizes (depending on samples number) t:
limt→∞ η(t) ln t = 0 and limt→∞ tη(t) = ∞ =⇒ consistency.
– Error rates
β
Theorem 4. Let {ηj = η : j = 1, 2 · · · , t}. If fρ ∈ LK (L2ρX ) for some β > 0 then, by
choosing η :=
2β
β
− 2β+1
t
,
4
64(1+κ) (2β+1)
we have that
2β
− 2β+1
2
E kft+1 − fρkρ = O t
ln t .
Discussions and Comparisons
β
Comparisons are based on the same assumptions on fρ ∈ LK (L2ρX ).
– Our error rates for online gradient descent algorithm (1):
2β
− 2β+1
(I) O t
ln t with β ∈ (0, 21 ] for {ηj , j ∈ N} universal sequence
2β
(II) O t− 2β+1 ln t with β > 0 for {ηj = η(t) : j = 1, · · · , t} depending on sample
number
2β
− 2β+1
–Batch Tikhonov regularization: O t
Zhang; Smale and Zhou
with β ∈ (0, 1]
Discussions continued
–Online regularized algorithm:
choosing λ = λ(t) > 0 appropriately
2β
− 2β+2
Yao and Smale; Ying-Zhou: O t
2β
− 2β+1
Pontil and Ying: O t
2β
ln t with β ∈ (0, 1]
ln t with β ∈ (0, 1]
–The rate O(t− 2β+1 ) is capacity independent (eigenvalue independent) optimal (only assumption on fρ , no assumption on decays of eigenvalues of LK ) implied by
Capponetto and De Vito.
Ideas of Proof
Three main steps.
ä Error decomposition
–Rewrite the online algorithm (1):
fj+1 − fρ = (I − ηj LK )(fj − fρ) + ηj LK (fj − fρ) + (yj − fj (xj ))Kxj
I : the identity operator
– Define B(fj , zj ) := LK (fj − fρ ) + (yj − fj (xj ))Kxj . Then Ezj B(fj , zj ) = 0.
Set ωkt (LK ) :=
Qt
t
− ηj LK ) and ωt+1
(LK ) := I .
Pt
t
–ft+1 − fρ = −ω1t (LK )fρ + j=1 ηj ωj+1
(LK )B(fj , zj ).
j=k (I
Proof Continued
Proposition 1. For any t ∈, kft+1 − fρ k2ρ is bounded by
kω1t (LK )fρk2ρ
|
{z
+ 2(1 + κ)
}
t
X
4
t
2 X
E(fk ) ηk
ηj + 1 .
k=1
j=k+1
|
approximation error
{z
Cumulative sample error
Pt
}
Remark 1. The standard cumulative loss k=1 (yk − fk (xk )2 has been extensively studied in
online community: Cesa-Bianchi, Warmuth, Smola et al.
Weighted cumulative loss:
Pt
k=1 (yk
−
fk (xk )2ηk2
Pt
j=k+1
ηj + 1 .
Proof Continued
Sketch of Proof for Proposition 1
Pt
t
2
t
2
2
E kft+1 − fρkρ = kω1(LK )fρkρ + E k j=1 ηj ωj+1(LK )B(fj , zj )kρ
t
X
t
t
−2 E hω1(LK )fρ,
(LK )B(fj , zj ) iρ
ηj ωj+1
j=1
|
{z
zero since Ezj B(fj , zj ) = 0
}
P
X 2 t
2
t
2
ηk Z k kωk+1(LK )B(fk , zk )kρ
EZ t k k∈t ηk ωk+1(LK )B(fk , zk )kρ =
≤
X
k∈t
t
ηk2kωk+1
(LK )LK k2Z k kB(fk , zk )k2K ,
1
2
k∈t
–Ezk kB(fk , zk )k2K ≤ c · E(fk ).
1
t
–kωk+1
(LK )LK2 k2
2 Pt
≤2 1+κ
j=k+1 ηj + 1 .
Proof Continued
ä Approximation error:
For any f ∈ HK ,
kω1t (LK )fρkρ ≤ kω1t (LK )(f − fρ)kρ + kω1t (LK )f kρ
1
−1
≤ kf − fρkρ + kω1t (LK )LK2 k kLK2 f kρ
t
X
− 21
≤ kf − fρkρ + 2(1 + κ) (
ηk ) + 1 kf kK ,
k=1
K− functional:
kω1t (LK )fρkρ
≤ K 2(1 + κ) (
t
X
k=1
ηk ) + 1
− 21
, fρ
Proof Continued
ä Cumulative sample error:
t
t
X
2 X
E(fk ) ηk
ηj + 1 ≤
k=1
j=k+1
t
t
X
X
2
sup E(fk )
ηk
ηj + 1 .
k=1,··· ,t
k=1
j=k+1
– Uniformly bounding for E(fk ) : using kfρ kρ and E(fρ ).
Pt
Pt
−θ
– Estimate k=1 ηk2
η
+
1
:
For
instance,
η
=
O
j
with θ ∈ (0, 1),
j
j=k+1 j
t
X
k=1
t
X
2
ηk
j=k+1
ηj + 1 = O t− min{θ,1−θ} ln t .
Conclusions
Online gradient descent algorithm is a simple yet competitive algorithm:
– It is statistically consistent (for two types of step sizes)
– Error rates are essentially the same as classical batch regularized learning and online
regularized learning algorithms
– Optimal capacity independent error rates
Questions
– HK -norm error kft+1 − fρ kK with the universal polynomially decaying step sizes
– Probability inequality estimates, almost surely convergence (strong consistency)
– Generalization error analysis with data non i.i.d.
– Can we directly use standard cumulative error bounds to bound generalization error?
Grazie!
Buon Natale e Felice Anno Nuovo!
© Copyright 2026 Paperzz