Online Learning of Noisy Data with Kernels

Online Learning of Noisy Data with Kernels
Nicolò Cesa-Bianchi1 , Shai Shalev-Shwartz2
and Ohad Shamir2
1 Università
2 The
degli Studi di Milano
Hebrew University
COLT, June 2010
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Online Learning with Partial Information
Standard online learning: After choosing a predictor, Learner
sees example chosen by adversary
Harder setting: Learner only receives partial information on
each example
Example (Bandit Learning)
Learner gets to see loss value
This talk
Learner has noisy view of each example
⇒
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Online Learning with Partial Information
Main Results
Online learning of linear predictors based on noisy views
√
O( T ) regret
Noise distribution unknown. Can be chosen adversarially and
change for each example
Including kernels
General technique for unbiased estimators of nonlinear
functions
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Online Learning with Partial Information
Online Learning of Linear Predictors
On each round t:
Learner picks predictor wt ∈ W
Nature picks (xt , yt )
Learner suffers loss `(hwt , xt i , yt )
Learner gets yt and noisy view of xt
Learner’s goal: minimize regret
T
X
`(hwt , xt i , yt ) − min
t=1
Cesa-Bianchi, Shalev-Shwartz and Shamir
w∈W
T
X
`(hw, xt i , yt )
t=1
Online Learning of Noisy Data with Kernels
First Try
Suppose learner gets x̃t = xt + nt , nt random zero-mean noise
vector
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
First Try
Suppose learner gets x̃t = xt + nt , nt random zero-mean noise
vector
Unfortunately, too hard!
Theorem
If an adversary can choose the noise distribution, and `(·, 1) is
1
2
Bounded from below
Differentiable at 0 with `0 (0, 1) < 0
(a.k.a. classification calibrated)
then sublinear regret is impossible
Holds even in a stochastic setting
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
First Try
0.12
0.1
Y = +3
Y = −2
0.08
Suppose data looks like this:
0.06
0.04
0.02
0
−6
Cesa-Bianchi, Shalev-Shwartz and Shamir
−4
−2
0
2
4
6
Online Learning of Noisy Data with Kernels
First Try
0.12
0.1
Y = +3
Y = −2
0.08
Suppose data looks like this:
0.06
0.04
0.02
0
−6
Option A: Data comes from
−4
−2
0
2
4
6
Option B: Data comes from
0.12
0.1
c Y = −2
0.5
Y = +3
0.08
0.4
+ no noise
0.06
0.04
Y = +3
+ noise
0.2
0.02
0
−6
Y = −2
0.3
0.1
−4
−2
0
2
4
6
0
−6
−4
−2
0
2
4
6
Information-theoretically impossible to distinguish!
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
First Try
Must provide more information to the learner
Suppose that can get more than one independent copies of
x̃t = xt + nt
Trivial (and unrealistic) setting if unlimited number of copies
Goal: small number of views, independent of problem scale
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Important Technique
Stochastic Online Gradient Descent
Initialize w1 = 0
For t = 1, . . . , T
˜t
wt+1 = wt − ηt ∇
2
Project wt+1 on ball {w : kwk ≤ W }
˜ t = ∇`(hwt , xt i , yt ), this is standard online gradient
When ∇
descent (Zinkevich 2003)
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Important Technique
Stochastic Online Gradient Descent
Initialize w1 = 0
For t = 1, . . . , T
˜t
wt+1 = wt − ηt ∇
2
Project wt+1 on ball {w : kwk ≤ W }
˜ t = ∇`(hwt , xt i , yt ), this is standard online gradient
When ∇
descent (Zinkevich 2003)
Theorem
2
˜ t ] = ∇`(hwt , xt i , yt ), E[∇
˜ t ] ≤ B, expected regret at
If E[∇
most
√
O
BWT
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Unknown Noise
Example (Linear predictors, squared loss)
Gradient is 2(hwt , xt i − yt )xt
Unbiased estimate with 2 noisy copies of xt :
2(hwt , x̃t i − yt )x̃0t
⇒ Can learn in the face of unknown noise
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Unknown Noise
Example (Linear predictors, squared loss)
Gradient is 2(hwt , xt i − yt )xt
Unbiased estimate with 2 noisy copies of xt :
2(hwt , x̃t i − yt )x̃0t
⇒ Can learn in the face of unknown noise
What if we want other loss functions? Non-linear predictors?
Note:
Technique depended on loss gradient being quadratic in x. Won’t
work otherwise!
Next: how we can learn with unknown noise using:
General ‘smooth’ loss functions
Non-linear predictors using kernels
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Kernels
Allows to learn highly non-linear predictors
Idea: instances x mapped to Ψ(x) in a high dimensional
Hilbert space, and a linear predictor learned in that space
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Kernels
Allows to learn highly non-linear predictors
Idea: instances x mapped to Ψ(x) in a high dimensional
Hilbert space, and a linear predictor learned in that space
Problematic for our setting: Ψ may be complex and
non-linear. In particular, E[Ψ(x̃)] 6= Ψ(x)
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Idea: Unbiased Estimate of Non-Linear Functions
Suppose X is a real random variable with unknown
distribution, and unknown mean µ. Can sample x1 , x2 , . . .
Want an unbiased estimate of f (µ)
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Idea: Unbiased Estimate of Non-Linear Functions
Suppose X is a real random variable with unknown
distribution, and unknown mean µ. Can sample x1 , x2 , . . .
Want an unbiased estimate of f (µ)
If f is linear: return f (x1 ):
E[f (x1 )] = f (E[x1 ]) = f (µ)
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Idea: Unbiased Estimate of Non-Linear Functions
Suppose X is a real random variable with unknown
distribution, and unknown mean µ. Can sample x1 , x2 , . . .
Want an unbiased estimate of f (µ)
If f is linear: return f (x1 ):
E[f (x1 )] = f (E[x1 ]) = f (µ)
When f is nonlinear, E[f (x1 )] 6= f (µ). In many cases,
unbiased estimate of f (µ) based on x1 , x2 , . . . , xn is provably
impossible
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Idea: Unbiased Estimate of Non-Linear Functions
Suppose X is a real random variable with unknown
distribution, and unknown mean µ. Can sample x1 , x2 , . . .
Want an unbiased estimate of f (µ)
If f is linear: return f (x1 ):
E[f (x1 )] = f (E[x1 ]) = f (µ)
When f is nonlinear, E[f (x1 )] 6= f (µ). In many cases,
unbiased estimate of f (µ) based on x1 , x2 , . . . , xn is provably
impossible
However: What if n is random?
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Idea: Unbiased Estimate of Non-Linear Functions
Suppose f is a continuous function on a bounded interval
P
There exist A0 (·), A1 (·), . . ., where An (x) = nk=0 an,k x k ,
n→∞
such that An (·) −→ f (·)
Let p0 , p1 , . . . be a distribution over all nonnegative integers
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Idea: Unbiased Estimate of Non-Linear Functions
Suppose f is a continuous function on a bounded interval
P
There exist A0 (·), A1 (·), . . ., where An (x) = nk=0 an,k x k ,
n→∞
such that An (·) −→ f (·)
Let p0 , p1 , . . . be a distribution over all nonnegative integers
Estimator
1 Pick n randomly according to Pr(n) = p
n
2
Sample x1 , . . . , xn independently
3
Return
1
θ=
pn
n
X
k=0
an,k
k
Y
i=0
!!
xi
1
−
pn
n
X
k=0
an−1,k
k
Y
i=0
Theorem
E[θ] = f (µ)
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
!!
xi
Idea: Unbiased Estimate of Non-Linear Functions
Proof
n
k
Y
1 X
θ=
an,k
xi
pn
i=0
k=0
|
{z
n
k
Y
1 X
an−1,k
xi
−
pn
i=0
k=0
|
}
{z
!!
=An (µ) in expectation
=An−1 (µ) in expectation
Therefore,
E[θ] = En
=
∞
X
1
(An (µ) − An−1 (µ))
pn
(An (µ) − An−1 (µ))
n=1
= f (µ) − A0 (µ) = f (µ)
Cesa-Bianchi, Shalev-Shwartz and Shamir
!!
Online Learning of Noisy Data with Kernels
}
Idea: Unbiased Estimate of Non-Linear Functions
Technique used in a 1960’s paper on sequential estimation
(R. Singh, 1964)
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Idea: Unbiased Estimate of Non-Linear Functions
Technique used in a 1960’s paper on sequential estimation
(R. Singh, 1964)
Crucial observation: if pn decays rapidly, then with
overwhelming probability, will need just a small number of
samples
When f is analytic, can take pn ∝ 1/q n for arbitrary q
We use this technique to learn with noise, using large families
of kernels and analytic loss functions
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Formal Result - Example
Consider any dot product kernel k(x, x0 ) = G (hx, x0 i)
e.g. k(x, x0 ) = (hx, x0 i + 1)n
Example (Dot product kernel, squared loss)
Suppose E[kx̃k2 ] ≤ B. For any q > 1.1, can construct an efficient
algorithm which:
Queries each example 1 + Op (1/q) times
√ Has regret O WG (qB) qT w.r.t. {w : kwk2 ≤ W }
Tradeoff: Large q implies less queries per example, but larger regret
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels
Smoothed Losses
5
Absolute Loss
4.5
Smoothed Absolute Loss (s2=1)
Hinge Loss
4
2
Smoothed Hinge Loss (s =1)
3.5
3
2.5
2
1.5
1
0.5
0
−4
−3
−2
Cesa-Bianchi, Shalev-Shwartz and Shamir
−1
0
1
2
3
4
Online Learning of Noisy Data with Kernels
Summary
Online Learning with noise
Noise distribution may be chosen adversarially
Quantity makes quality: More examples make up for bad
quality of each individual example seen
General technique to construct unbiased estimators of
nonlinear functions
Can be improved?
Upcoming work: yes, if know more about the noise distribution
Cesa-Bianchi, Shalev-Shwartz and Shamir
Online Learning of Noisy Data with Kernels