Slides

STATISTICAL BEHAVIOR AND CONSISTENCY
OF CLASSIFICATION METHODS BASED ON
CONVEX RISK MINIMIZATION
Tong Zhang
The Annals of Statistics, 2004
Outline
◮
Motivation
◮
Approximation error under convex risk minimization
◮
Examples of approximation error analysis
◮
Universal approximation and consistency
Motivation
◮
Machine learning: predict y for given x.
◮
Relationship: functional y ∼ f (x) where f is called the
classification function.
◮
Criterion: to minimize a problem dependent loss ℓ(f (x), y ).
◮
Usual assumption: The data (X, Y ) are drawn i .i .d. from a
common but unknown distribution D(X, Y ).
◮
Explicit formulae of the expected loss:
L(f (·)) = EX,Y ℓ(f (x), y ).
Motivation
◮
This paper deals with binary problem, namely, y ∈ {±1}.
◮
Prediction rule: y = 1 if f (x) ≥ 0 and y = −1 if f (x) < 0.
◮
The classification error of f is given by:
◮

 1 if f (x)y < 0,
I (f (x), y ) =
1 if f (x) = 0 and y = −1,

0 otherwise.
The empirical error is given by:
n
1X
I (f (Xi ), Yi ).
n
i =1
(1)
Motivation
◮
However the minimization of the previous formulae can be a
NP-hard problem, because it is not convex.
◮
Need to use a surrogate loss φ(·, ·) which makes the
computation easier.
◮
When φ(f , y ) = φ(yf ) it is called large margin classifier.
◮
For example, AdaBoost employs the exponential loss
φ = exp(−yf ) and SVM employs the hinge loss φ = [1 − yf ]+ .
◮
Now instead of (1) we minimize the empirical risk:
n
1X
φ(f (Xi )Yi ).
n
i =1
(2)
Motivation
◮
The minimization of (1) can be regarded as an approximation
to the true classification error:
L(f (·)) = EX,Y I (f (X), Y ).
◮
And the minimization of (2) can be regarded as an
approximation to the true risk:
Q(f (·)) = EX,Y φ(f (X)Y ).
◮
(3)
In this paper the authors study the impact of using φ.
(4)
Motivation
◮
In this paper the authors are interested in the following five
loss functions:
◮
◮
◮
◮
◮
Least Squares: φ(v ) = (1 − v )2 .
Modified Least Squares: φ(v ) = max(1 − v , 0)2 .
SVM: φ(v ) = [1 − v ]+ .
Exponential: φ(v ) = exp(−v ).
Logistic Regression: φ(v ) = ln(1 + exp(−v )).
Approximation error under convex risk minimization
◮
In this section the relationship between L(f (·)) and Q(f (·)) is
studied.
◮
Rewrite (4) as the expansion of conditional expectation:
Q(f (·)) = EX [η(X)φ(f (X)) + (1 − η(X))φ(−f (X))],
where η(x) is the conditional probability P(Y = 1|X = x).
(5)
Approximation error under convex risk minimization
◮
L(f (·)) can be written as:
L(f (·)) = Ef (X)≥0 (1 − η(X)) + Ef (X)<0 η(X).
◮
The following notation is very useful in this section:
Q(η, f ) = ηφ(f ) + (1 − η)φ(−f ).
◮
(6)
(7)
Define fφ∗ (η) : [0, 1] → R ∗ (where R ∗ is the extended real line)
as:
fφ∗ (η) = argmin Q(η, f ),
f ∈R ∗
and
Q ∗ (η) = inf Q(η, f ) = Q(η, fφ∗ (η)).
f ∈R
Approximation error under convex risk minimization
◮
Define the ”excess risk” as:
∆Q(η, f ) = Q(η, f ) − Q(η, fφ∗ (η)) = Q(η, f ) − Q ∗ (η),
and
∆Q(f (·)) = Q(f (·)) − Q(fφ∗ (η(·))) = EX ∆Q(η(X), f (X)).
Approximation error under convex risk minimization
◮
The above formulations are easy to calculate:
◮
◮
◮
◮
◮
Least Squares: fφ∗ (η) = 2η − 1; Q ∗ (η) = 4η(1 − η).
Modified Least Squares: fφ∗ (η) = 2η − 1; Q ∗ (η) = 4η(1 − η).
SVM: fφ∗ (η) = sign(2η − 1); Q ∗ (η) = 1 − |2η − 1|.
p
η
Exponential: fφ∗ (η) = 21 ln 1−η
; Q ∗ (η) = 2 η(1 − η).
η
Logistic Regression: fφ∗ (η) = ln 1−η
;
∗
Q (η) = −η ln η − (1 − η) ln(1 − η).
Approximation error under convex risk minimization
Theorem
Assume fφ∗ (η) > 0 when η > 0.5. Assume there exists c > 0 and
s ≥ 1 such that for all η ∈ [0, 1],
|0.5 − η|s ≤ c s ∆Q(η, 0),
then for any measurable function f (x),
L(f (·)) − L∗ ≤ 2c∆Q(f (·))1/s ,
where L∗ is the optimal Bayes error L∗ = L(2η(·) − 1).
Approximation error under convex risk minimization
◮
Proof omitted.
◮
A corollary omitted in this presentation.
◮
The above theorem shows that if there is a functional
relationship between |0.5 − η| and ∆Q(η, 0), we can bound
the classification error by certain transformations of the risk.
◮
More specifically, if ∆Q(f (·)) → 0, then ∆L → 0.
◮
The quantity ∆Q(η, f ) is the key in the proof of the theorem.
A question arises: how to compute it?
Approximation error under convex risk minimization
◮
Introduction to Bregman divergence.
◮
For a convex function φ, its Bregman divergence is defined as:
dφ (f1 , f2 ) = φ(f2 ) − φ(f1 ) − φ′ (f1 )(f2 − f1 ).
◮
Here prime means subgradient.
◮
For a concave function g , the Bregman divergence is defined
as dg (η1 , η2 ) = d−g (η1 , η2 ).
◮
The Bregman divergence is always non-negative.
◮
A plot in the next slide to illustrate the idea.
0
100
200
300
400
Approximation error under convex risk minimization
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure: Bregman divergence (the red line segment).
Approximation error under convex risk minimization
Lemma
Q ∗ (η) is a concave function of η.
Theorem
If φ is differentiable, then the Bregman divergence is uniquely
defined. Furthermore,
∆Q(η, p) = ηdφ (fφ∗ (η), p) + (1 − η)dη (−fφ∗ (η), −p).
If fφ∗ is differentiable then Q ∗ is also differentiable. Assume
p = fφ∗ (η̄), then ∆Q(η, p) = dQ ∗ (η̄, η).
Approximation error under convex risk minimization
◮
If fφ∗ is invertible, then the inverse function fφ∗ −1 (f (x)) can
serve as a conditional probability estimate.
Examples of approximation error analysis
◮
Least Squares.
◮
Bregman divergence is dφ (p1 , p2 ) = (p2 , p1 )2 .
◮
∆Q(η, p) = (2η − 1 − p)2 .
◮
|η − 0.5|2 = 0.52 ∆Q(η, 0).
◮
Thus we can choose c = 0.5 and s = 2.
Examples of approximation error analysis
◮
Modified Least Squares.
◮
Bregman divergence is
dφ (p1 , p2 ) = (p2 , p1 )2 − max(0, p2 − 1)2 .
◮
∆Q(η, p) =
(2η − 1 − p)2 − η max(0, p − 1)2 − (1 − η) min(0, p + 1)2 .
◮
|η − 0.5|2 ≤ 0.52 ∆Q(η, 0).
◮
Thus we can choose c = 0.5 and s = 2.
Examples of approximation error analysis
◮
SVM.
◮
Bregman divergence is harder to compute than calculating
∆Q(η, p) directly.
◮
∆Q(η, p) = η max(0, 1−p)+(1−η) max(0, 1+p)−1+|2η−1|.
◮
|η − 0.5| ≤ 0.5∆Q(η, 0).
◮
Thus we can choose c = 0.5 and s = 1.
Examples of approximation error analysis
◮
◮
◮
Exponential loss.
p
Bregman divergence is 2 η(1 − η).
p
p
∆Q(η, p) = (η − η̄)(e −p − e p ) + 2 η̄(1 − η̄) − 2 η(1 − η)
where η̄ = 1/(1 + e −2p ).
2
◮
|η − 0.5|2 ≤ (2−0.5 ) ∆Q(η, 0).
◮
Thus we can choose c = 2−0.5 and s = 2.
Examples of approximation error analysis
◮
Logistic regression.
◮
Bregman divergence is −η ln η − (1 − η) ln(1 − η).
◮
∆Q(η, p) = KL(ηk 1+e1 −p ) where KL is the KL-divergence.
◮
|η − 0.5|2 ≤ (2−0.5 ) ∆Q(η, 0).
◮
Thus we can choose c = 2−0.5 and s = 2.
2
Examples of approximation error analysis
◮
There are many discussion in this section, in terms of
probability estimation comparison between different loss
functions, and in terms of finding best c and s.
◮
Please refer to the paper, this is the relatively easy part for
reading.
Universal approximation and consistency
◮
Now we consider a function class C to which the function f
belongs. If inf f ∈C ∆Q(f ) is small, then any f (x) ∈ C that
(approximately) minimizes (4) achieves a classification error
close to the optimal Bayes error.
◮
We call a function class C universal with respect to a convex
loss function φ, if any measurable conditional density function
η(x) has a distance of zero to C .
Universal approximation and consistency
◮
◮
◮
◮
Next we introduce a universal approximation theorem. First
the following definitions are needed.
Let U ⊂ R d . Denote by C (U) the Banach space of continuous
functions: U → R under the uniform-norm topology.
Call a probability measure µ in R d regular if it is defined on
the Borel sets of R d .
Say that a convex function φ has property A if
◮
◮
◮
φ is continuous and Q ∗ is continuous.
φ(p) < φ(−p), for all p > 0.
fφ∗ (η) ∈ (−∞, +∞) and is piece-wise continuous in (0,1).
Universal approximation and consistency
Lemma
Assume 0 ≤ δ < 0.5. Let η ∈ [0, 1] and ηδ = min(max(η, δ), 1 − δ).
If φ has property A, then Q(η, fφ∗ (ηδ )) ≤ Q ∗ (ηδ ).
◮
The plot of ηδ is shown in the next slide.
◮
The lemma is to provide a way for avoiding the difficulty part
where fφ∗ tends to infinity, in the proof of the next theorem.
1.0
Universal approximation and consistency
0.2
0.4
y
0.6
0.8
1−δ
0.0
δ
0.0
0.2
0.4
0.6
0.8
eta
Figure: The function ηδ .
1.0
Universal approximation and consistency
Theorem
Let φ be a convex function with property A. Consider a function
class C ⊂ C (U) defined on a Borel set U ⊂ R d . If C is dense in
C (U), then for any regular probability measure µ of x ∈ R d such
that µ(U) = 1, and any conditional probability
P(Y = 1|X = x) = η(x),
inf ∆Q(f (·)) = 0.
p∈C
Universal approximation and consistency
◮
Consider the function classes R d → R consisted of linear
combinations of functions of the form h(ω T x + b), where
ω ∈ R d , b ∈ R and h is a fixed function:
Ch =
k
X
i =1
αi h(ωiT x + bi ) : αi ∈ R, ωi ∈ R d , bi ∈ R, k ∈ N .
◮
In the literature of neural networks, the following function
class is well studied and proved to be universal, as long as the
function h defined above is sigmoidal.
◮
The next theorem is more general.
Universal approximation and consistency
Theorem
If h is a non-polynomial continuous function, then Ch is dense in
C (U) for all compact subsets U of R d .
◮
The introduction to RKHS.
◮
Consider the kernel functions in the following form:
Kh ([x1 , b1 ], [x2 , b2 ]) = h(xT
1 x2 + b1 b2 ).
where h can be expressed as a Taylor expansion with
non-negative coefficients. Kh is a positive definite kernel.
◮
Denote by Hh the corresponding RKHS introduced by Kh .
Also denote f¯(x) = f ([x, 1]).
Universal approximation and consistency
◮
Consider the following estimation problem:
n
1 X
λn
¯
fn = arg inff ∈Hh
φ(Yi f¯(Xi )) + kf k2 .
n
2
i =1
◮
We have the following theorem, which says we can show
consistency by estimating leave-one-out bounds.
Theorem
Let f¯[k] be the solution of the above formulation with the kth
datum removed from the training set, then
2
[k]
1/2
|φ′ (fˆn (Xk )Yk )|h(XT
,
kfˆn (·) − fˆn (·)k ≤
k Xk + 1)
λn n
where φ′ denotes a subgradient of φ.
Universal approximation and consistency
◮
For simplicity, from now we assume that
P(h(XT X + 1) ≤ M 2 ) = 1 for a constant M.
Theorem
Under the assumption of the previous theorem, and assume that φ
is non-negative and P(h(XT X + 1) ≤ M 2 ) = 1. Then for all k, the
expected leave-one-out error can be bounded as
EQ(fˆ[k] ) ≤ inf
f ∈Hh
2M 2 ∆2phi
λn
2
¯
,
Q(f (·)) + kf k +
2
λn N
where the expectation is with respect to the training samples
(X1 , Y1 ), . . . , (Xn , Yn ) and
∆φ = sup |φ′ (z)| : |z| ≤
s
2φ(0) M .
λn
Universal approximation and consistency
◮
Here is a list of ∆φ for loss functions considered in this paper.
◮
◮
◮
◮
◮
◮
Least Squares: ∆φ ≤
q
8
λn M
+ 2.
q
Modified Least Squares: ∆φ ≤ λ8n M + 2.
SVM: ∆φ ≤ 1.
q
Exponential: ∆φ ≤ exp( λ2n M).
Logistic Regression: ∆φ ≤ 1.
We are in the position of introducing the last theorem in the
paper.
Universal approximation and consistency
Theorem
Let h be an entire function with nonnegative Taylor coefficients.
Assume we choose λn such that for least squares, modified least
squares, SVM or logistic regression, λn → 0 and λn n → ∞; or we
choose λn such that for exponential loss, λn → 0 and
λn log2 n → ∞. Then for any distribution D with regular input
probability measure which is bounded almost everywhere in R d , we
have
lim EQ(fˆn (·)) = inf Q(f (·)).
n→∞
f ∈H̄h
Moreover if h is not a polynomial, then
lim EQ(fˆn (·)) = 0.
n→∞
End of the presentation
Thank you.
Questions?

Download Report

Slides

Paperzz.com

Your Paperzz