Machine Learning and Applications

Machine Learning and Applications
Christoph Lampert
Spring Semester 2014/2015
Lecture 3
Learning from Data
(At least) three approaches to learn from training data D.
Definition
Given a training set D, we call it
• a generative probabilistic approach:
if we use D to build a model p̂(x, y) of p(x, y), and then define
c(x) := argmax p̂(x, y) or
y∈Y
c` (x) := argmin Eȳ∼p̂(x,ȳ) `( ȳ, y ).
y∈Y
• a discriminative probabilistic approach:
if we use D to build a model p̂(y|x) of p(y|x) and define
c(x) := argmax p̂(y|x)
y∈Y
or
c` (x) := argmin Eȳ∼p̂(ȳ|x) `( ȳ, y ).
y∈Y
• a risk minimization approach: if we use D to directly seach for a
classifier c in a hypothesis class H.
Discriminative Probabilistic Models
Observation
Task: spam classification, X = {all possible emails}, Y = {spam, ham}.
What’s, e.g., p(x|ham)?
For every possible email, a value how likely it is to see that email,
including:
• all possible languages,
• all possbile topics,
• an arbitrary length,
• all possible spelling mistakes, etc.
This sounds much harder than just deciding if an email is spam or not!
"When solving a problem, do not solve a more
difficult (general) problem as an intermediate
step."
(Vladimir Vapnik, 1998)
Observation
Optimal classifier:
c(x) := argmaxy∈Y p(y|x)
for any x ∈ X ..
We don’t have to model complete p(x, y), but p(y|x) is enough.
Let’s use D to estimate p(y|x).
Observation
Optimal classifier:
c(x) := argmaxy∈Y p(y|x)
for any x ∈ X ..
We don’t have to model complete p(x, y), but p(y|x) is enough.
Let’s use D to estimate p(y|x).
Visual intuition:
Observation
Optimal classifier:
c(x) := argmaxy∈Y p(y|x)
for any x ∈ X ..
We don’t have to model complete p(x, y), but p(y|x) is enough.
Let’s use D to estimate p(y|x).
Example (Spam Classification)
Is p(y|x) really easier than, e.g., p(x|y)?
• p("buy v1agra!"|spam) is some (small) positive value
• p(spam|"buy v1agra!") is almost surely 1.
For p(y|x) we treat x as given, we don’t need to determine its probability.
Nonparametric Discriminative Model
Idea: split X into regions, for each region store an estimate p̂(y|x).
p(1|x)=0.9
p(2|x)=0.0
p(3|x)=0.1
p(1|x)=0.7
p(2|x)=0.2
p(3|x)=0.1
p(1|x)=0.1
p(2|x)=0.8
p(3|x)=0.1
p(1|x)=0.01
p(2|x)=0.98
p(3|x)=0.01
X
Nonparametric Discriminative Model
Idea: split X into regions, for each region store an estimate p̂(y|x).
For example, using a decision tree:
• training: build a tree
• prediction: for new example x, find its leaf
n
• output p̂(y|x) = ny , where
I
I
n is the number of examples in the leaf,
ny is the number of example of label y in the leaf.
Nonparametric Discriminative Model
Idea: split X into regions, for each region store an estimate p̂(y|x).
For example, using a decision tree:
• training: build a tree
• prediction: for new example x, find its leaf
n
• output p̂(y|x) = ny , where
I
I
n is the number of examples in the leaf,
ny is the number of example of label y in the leaf.
Note: prediction rule
c(x) = argmax p̂(y|x)
y
is predicts the most frequent label in each leaf (same as in first lecture).
Parametric Discriminative Model: Logistic Regression
Setting.
We assume X ⊆ Rd and Y = {−1, +1}.
Definition (Logistic Regression (LogReg) Model)
Modeling
p̂(y|x; w) =
1
,
1 + exp(−yhw, xi)
with parameter vector w ∈ Rd is called logistic regression (model).
Parametric Discriminative Model: Logistic Regression
Setting.
We assume X ⊆ Rd and Y = {−1, +1}.
Definition (Logistic Regression (LogReg) Model)
Modeling
p̂(y|x; w) =
1
,
1 + exp(−yhw, xi)
with parameter vector w ∈ Rd is called logistic regression (model).
Lemma
p̂(y|x; w) is a well defined probability density w.r.t. y for any x, w ∈ Rd .
Proof. exercise.
How to set the weight vector w (based on D)
Logistic Regression Training
Given a training set D = {(x 1 , y 1 ), . . . , (x n , y n )}, logistic regression
training sets the free parameter vector as
wLR = argmin
w∈Rd
n
X
log 1 + exp(−y i hw, x i i)
i=1
Lemma (Conditional Likelihood Maximization)
wLR from Logistic Regression training maximizes the conditional data
likelihood w.r.t. the LogReg model,
wLR = argmax p̂(y 1 , . . . , y n |x 1 , . . . , x n , w)
w∈Rd
Proof: exercise
Solving Logistic Regression numerically
Theorem
Logistic Regression training,
wLR = argmin L(w) for
L(w) =
w∈Rd
n
X
log 1 + exp(−y i hw, x i i) ,
i=1
is a C ∞ -smooth, unconstrained, convex optimization problem.
Proof.
1. it’s an optimization problem,
2. there’s no constraints,
3. it’s smooth (the objective function is C ∞ differentiable),
4. remains to show: the objective function is a convex function.
Since L is smooth, it’s enough to show that its Hessian matrix
(the matrix of 2nd partial derivatives) is everywhere positive definite.
We compute gradient and Hessian of the log-likelihood
p̂(y|x; w) =
L(w) =
1
,
1 + exp(−yhw, xi)
n
X
log(1 + exp(−y i hw, x i i).
i=1
∇w L(w) = −
n
X
p̂(−y i |x i , w) y i x i
i=1
HL (w) =
n
X
i=1
→ HL is positive definite.
p̂(−y i |x i )p̂(y i |x i , w)
|
{z
>0
}
i i>
x }
|x {z
sym.pos.def .
We compute gradient and Hessian of the log-likelihood
p̂(y|x; w) =
L(w) =
1
,
1 + exp(−yhw, xi)
n
X
log(1 + exp(−y i hw, x i i).
i=1
∇w L(w) = −
n
X
p̂(−y i |x i , w) y i x i
i=1
HL (w) =
n
X
i=1
p̂(−y i |x i )p̂(y i |x i , w)
{z
|
}
>0
i i>
x }
|x {z
sym.pos.def .
→ HL is positive definite.
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1.5
1.0
0.5
0.0
0.5
1.0
Example plot: LogReg objective for three examples in R2
3
2
1
2.500
00
4.000
3.000
1
2.000
1.800
1.6
0
4.0
00
2
3
6.0
8.0
1.0
0.5
00
00
0.0
0.5
1.0
Numeric Optimization
Convex optimization is a well understood field. We can use, e.g.,
gradient descent will converge to the globally optimal solution!
Steepest Descent Minimization with Line Search
input
> 0 tolerance (for stopping criterion)
1: w ← 0
2: repeat
3:
v ← −∇w L(w)
{descent direction}
4:
η ← argminη>0 L(w + ηv)
{1D line search}
5:
w ← w + ηd
6: until kvk < output w ∈ Rd learned weight vector
Faster conference from methods that use second-order information, e.g.,
conjugate gradients or (L-)BFGS → convex optimization lecture
Binary classification with a LogReg Models
A discriminative probability model, p̂(y|x), is enough to make decisions:
c(x) = argmax p̂(y|x)
or
y∈Y
c(x) = argmin Eȳ∼p̂(y|x) `(ȳ, y).
y∈Y
For Logistic Regression, this is particularly simple:
Lemma
The LogReg classification rule for 0/1-loss is
c(x) = sign hw, xi.
For a loss function ` =
a b
c d
!
the rule is
c` (x) = sign[ hw, xi + log
c−d
],
b−a
In particular, the decision boundaries is linear (or affine).
Proof. Elementary, since log p̂(+1|x;w)
p̂(−1|x;w) = hw, xi
Multiclass Logistic Regression
For Y = {1, . . . , M }, we can do two things:
• Parametrize p̂(y|x; w
~ ) using M −1 vectors, w1 , . . . , wM −1 ∈ Rd , as
exp(hwy , xi)
for y = 1, . . . , M − 1,
1 + j=1 exp(hwj , xi)
1
.
p̂(M |x, w) =
PM −1
1 + j=1 exp(hwj , xi)
p̂(y|x, w) =
PM −1
• Parametrize p̂(y|x; w
~ ) using M vectors, w1 , . . . , wM ∈ Rd , as
exp(hwy , xi)
p̂(y|x, w) = PM
j=1 exp(hwj , xi)
for y = 1, . . . , M ,
Second is more popular, since it’s easier to implement and analyze.
Decision boundaries are still piecewise linear, c(x) = argmaxy hwy , xi.
Summary: Discriminative Models
Discriminative models treats the input data, x, as fixed and only model
the distribution of the output labels p(y|x).
Discriminative models, in particular LogReg, are popular, because they
• often need less training data than generative models,
• provide an estimate of the uncertainty of a decision p(c(x)|x),
• often can be trained efficient,
e.g. Yahoo trains LogReg models routinely from billions of examples.
But: they also have drawbacks
• usually p̂LR (y|x) 6→ p(y|x), even for n → ∞,
• they usually are good for prediction, but they do not reflect the
actual mechanism of coming to a decision.
Note: discriminative models can be much more complex than LogReg,
e.g. Conditional Random Fields (later lecture).
Learning from Data
(At least) three approaches to learn from training data D.
Definition
Given a training set D, we call it
• a generative probabilistic approach:
if we use D to build a model p̂(x, y) of p(x, y), and then define
c(x) := argmax p̂(x, y) or
y∈Y
c` (x) := argmin Eȳ∼p̂(x,ȳ) `( ȳ, y ).
y∈Y
• a discriminative probabilistic approach:
if we use D to build a model p̂(y|x) of p(y|x) and define
c(x) := argmax p̂(y|x)
y∈Y
or
c` (x) := argmin Eȳ∼p̂(ȳ|x) `( ȳ, y ).
y∈Y
• a risk minimization approach: if we use D to directly seach for a
classifier c in a hypothesis class H.
Maximum Margin Classifiers
Let’s use D to estimate a classifier c : X → Y directly.
Maximum Margin Classifiers
Let’s use D to estimate a classifier c : X → Y directly.
For a start, we fix
• D = {(x 1 , y 1 ), . . . , (x n , y n )},
• Y = {+1, −1},
• we look for classifiers with linear decision boundary.
Several of the classifiers we saw had linear decision boundaries:
• Perceptron
• Generative classifiers for Gaussian class-conditional densities with
shared covariance matrix
• Logistic Regression
What’s the best linear classifier?
Linear classifiers
Definition
Let
F = { f : Rd → {±1} with f (x) = b + a1 x1 + · · · + ad xd = b + hw, xi }
be the set of linear (affine) function from Rd → R.
A classifier g : X → Y is called linear, if it can be written as
g(x) = sign f (x)
for some f ∈ F.
We write G for the set of all linear classifiers.
A linear classifier, g(x) = signhw, xi, with b = 0
2.0
1.0
w
0.0
−1.0
0.0
1.0
2.0
3.0
A linear classifier g(x) = sign(hw, xi + b), with b > 0
2.0
1.0
0.0
−1.0
b
0.0
1.0
2.0
3.0
Linear classifiers
Definition
We call a classifier, g, correct (for a training set D), if it predicts the
correct labels for all training examples:
g(x i ) = y i
for i = 1, . . . , n.
Example (Perceptron)
• if the Perceptron converges, the result is an correct classifier.
• any classifier with zero training error is correct.
Linear classifiers
Definition
We call a classifier, g, correct (for a training set D), if it predicts the
correct labels for all training examples:
g(x i ) = y i
for i = 1, . . . , n.
Example (Perceptron)
• if the Perceptron converges, the result is an correct classifier.
• any classifier with zero training error is correct.
Definition (Linear Separability)
A training set D is called linearly separable, if it allows a correct linear
classifier (i.e. the classes can be separated by a hyperplane).
A linearly separable dataset and a correct classifier
2.0
1.0
0.0
−1.0
0.0
1.0
2.0
3.0
A linearly separable dataset and a correct classifier
2.0
1.0
w
0.0
−1.0
0.0
1.0
2.0
3.0
A linearly separable dataset and a correct classifier
2.0
1.0
0.0
−1.0
0.0
1.0
2.0
3.0
An incorrect classifier
2.0
1.0
0.0
−1.0
0.0
1.0
2.0
3.0
Linear Classifiers
Definition
The robustness of a classifier g (with respect to D) is the largest
amount, ρ, by which we can perturb the training samples without
changing the predictions of g.
g(x i + ) = g(x i ),
for all i = 1, . . . , n.
for any ∈ Rd with kk < ρ.
Example
• constant classifier, e.g. c(x) ≡ 1: very robust (ρ = ∞),
(but it is not correct, in the sense of the previous definition)
• robustness of the Perceptron: can be arbitrarily small
Robustness, ρ, of a linear classifier
2.0
ρ
1.0
0.0
−1.0
0.0
1.0
2.0
3.0
Robustness, ρ, of a linear classifier
2.0
ρ
ρ
1.0
0.0
−1.0
0.0
1.0
2.0
3.0
Definition (Margin)
Let f (x) = hw, xi + b define a correct linear classifier.
Then the smallest (Euclidean) distance of any training example from the
decision hyperplane is called the margin of f (with respect to D).
Lemma
We can compute the margin of a linear classifier f = hw, xi + b as
ρ = mini=1,...,n h
w
, x i i + b .
kwk
Proof.
High school maths: distance between a points and a hyperplane in
Hessian normal form.
Margin, ρ, of a linear classifier
2.0
in
g
r
ma ion
ρ reg
1.0
0.0
−1.0
0.0
1.0
2.0
3.0
Theorem
The robustness of a linear classifier function g(x) = sign f (x) with
f (x) = hw, xi + b with kwk = 1 is identical to its margin.
Proof by Picture:
2.0
2.0
n
rgi
ma ion
ρ reg
ρ
1.0
1.0
0.0
0.0
−1.0
0.0
1.0
2.0
3.0
−1.0
Real proof: a few lines of linear algebra
0.0
1.0
ρ
2.0
3.0
Maximum-Margin Classifier
Theorem
Let D be a linearly separable training set. Then the most robust,
correct classifier is given by g(x) = signhw ∗ , xi + b∗ where (w ∗ , b∗ ) are
the solution to
min
w∈Rd ,b∈R
subject to
1
kwk2
2
y i (hw, x i i + b) ≥ 1,
for i = 1, . . . , n.
Proof. exercise (formulate objective + linear algebra).
Remark
• The classifier defined above is called
I
I
Maximum (Hard) Margin Classifier, or
Hard-Margin Support Vector Machine (SVM)
• It is unique (follows from strictly convex optimization problem).
Observation (Not all training sets are linearly separable.)
2.0
1.0
0.0
−1.0
0.0
1.0
2.0
3.0
Observation (Not all training sets are linearly separable.)
2.0
ρ
ξi
1.0
n
rgi tion
a
m viola
n
rgi
ma
xi
0.0
−1.0
0.0
1.0
2.0
3.0
Definition (Maximum Soft-Margin Classifier)
Let D be a training set, not necessarily linearly separable. Let C > 0.
Then the classifier g(x) = signhw ∗ , xi where (w ∗ , b∗ ) are the solution to
n
X
1
kwk2 + C
ξi
2
i=1
min
w∈Rd ,ξ∈Rn
subject to
y i (hw, x i i + b) ≥ 1 − ξ i ,
i
ξ ≥ 0,
for i = 1, . . . , n.
for i = 1, . . . , n.
is called
• Maximum (Soft-)Margin Classifier, or
• (Linear) Support Vector Machine.
Maximum Soft-Margin Classifier
Theorem
The maximum soft-margin classifier exists and is unique for any C > 0.
Proof. optimization problem is strictly convex
Remark
The constant C > 0 is called regularization parameter.
It balances the wishes for robustness and for correctness
• C → 0: mistakes don’t matter much, emphasis on short w
• C → ∞: as few errors as possible, might not be robust
Remark
Sometimes, a soft margin is better even for linearly separable datasets!
2.0
2.0
1.0
1.0
0.0
0.0
−1.0
0.0
1.0
2.0
3.0 −1.0
small margin (low robustness)
n
rgi
ma
ρ
ξi
0.0
1.0
2.0
large margin, but 1 error
3.0
Nonlinear Classifiers
What, if a linear classifier is really not a good choice?
Nonlinear Classifiers
Change data representation, e.g. Cartesian → polar coordinates
Nonlinear Classifiers
Definition (Max-margin Generalized Linear Classifier)
Let C > 0. Assume a training set (not necessarily linearly separable)
D = {(x 1 , y 1 ), . . . x n , y n )} ⊂ X × Y.
Let φ : X → H be a feature map from X into a Hilbert space H.
Then we can form a new training set
Dφ = { (φ(x 1 ), y 1 ), . . . , (φ(x n ), y n ) } ⊂ H × Y.
The maximum-(soft)-margin linear classifier in H,
g(x) = signhw, φ(x)iH + b,
for w ∈ H and b ∈ R is called max-margin generalized linear classifier.
It is still linear w.r.t w, but (in general) nonlinear with respect to x.
Example (Polar coordinates)
Left: dataset D for which no good linear classifier exists.
Right: dataset Dφ for φ : X → H with X = R2 and H = R2
q
y
φ(x, y) = ( x 2 + y 2 , arctan )
x
φ
−→
(and φ(0, 0) = (0, 0))
Example (Polar coordinates)
Left: dataset D for which no good linear classifier exists.
Right: dataset Dφ for φ : X → H with X = R2 and H = R2
q
y
φ(x, y) = ( x 2 + y 2 , arctan )
x
φ
−→
Any classifier in H induces a classifier in X .
(and φ(0, 0) = (0, 0))
Other popular feature mappings, φ
Example (d-th degree polynomials)
φ : x1 , . . . , xn 7→ 1, x1 , . . . , xn , x12 , . . . , xn2 , . . . , x1d , . . . , xnd
Resulting classifier: d-th degree polynomial in x.g(x) = sign f (x) with
f (x) = hw, φ(x)i =
X
j
wj φ(x)j =
X
i
a i xi +
X
ij
bij xi xj + . . .
Example (Distance map)
For a set of prototype p1 , . . . , pN ∈ H:
2
φ : ~x 7→ e −k~x −~p1 k , . . . , e −k~x −~pN k
2
Classifier: combine weights from close enough prototypes
g(x) = signhw, φ(x)i = sign
Xn
i=1
2
ai e −k~x −~pi k .
Solving the SVM Optimization Problem
Generalized Maximum Margin Classification:
min
w∈Rd ,b∈R,ξ∈Rn
n
X
1
kwk2 + C
ξi
2
i=1
subject to, for i = 1, . . . , n,
y i (hw, φ(x i )i + b) ≥ 1 − ξ i ,
and
ξ i ≥ 0.
Convex optimization problem: we can study its dual problem.
→ details: convex optimization course (Spring II)
SVM Dual Optimization Problem
max
α≥0
−
n
n
X
1 X
αi αj y i y j hφ(x i ), φ(x j )i +
αi
2 i,j=1
i=1
subject to
n
X
αi y i = 0
and
0 ≤ αi ≤ C ,
for i = 1, . . . , n.
i=1
Why solve the dual optimization problem?
• fewer unknowns: α ∈ Rn instead of (w, b, ξ) ∈ Rd+1+n
• less storage when d n:
(hφ(x i ), φ(x j )i)i,j ∈ Rn×n instead of (x 1 , . . . , x n ) ∈ Rn×d
• Kernelization
Kernelization
Definition (Positive Definite Kernel Function)
Let X be a non-empty set. A function k : X × X → R is called positive
definite kernel function, if the following conditions hold:
• k is symmetric, i.e. k(x, x 0 ) = k(x 0 , x) for all x, x 0 ∈ X .
• For any finite set of points x1 , . . . , xn ∈ X , the kernel matrix
Kij = (k(xi , xj ))i,j
(1)
is positive semidefinite, i.e. for all vectors t ∈ Rn
n
X
i,j=1
ti Kij tj ≥ 0.
(2)
Kernelization
Lemma (Kernel function)
Let φ : X → H be a feature map into a Hilbert space H. Then the
function
k(x, x̄) = φ(x), φ(x̄) H
is a positive definite kernel function.
Theorem (Mercer’s Condition)
Let X be non-empty set. For any positive definite kernel function
k : X × X → R, there exists a Hilbert space H with inner product
h· , ·iH , and a feature map φ : X → H such that
k(x, x̄) =
φ(x), φ(x̄)
H
.
Let
• D = {(x 1 , y 1 ), . . . , (x n , y n ) } ⊂ X × {±1} training set
• k : X × X → R be a pos.def. kernel with feature map φ : X → H.
Support Vector Machine in Kernelized Form
For any C > 0, the max-margin classifier for the feature map φ is
with f (x) =
g(x) = sign f (x)
X
αi k(x i , x) + b,
i
for coefficients α1 , . . . , αn obtained by solving
minn
1
α ,...,α ∈R
subject to
X
αi yi = 0
−
n
n
X
1 X
αi αj y i y j k(x i , x j ) +
αi
2 i,j=1
i=1
and 0 ≤ αi ≤ C , for i = 1, . . . , n.
i
Note: we don’t need to know φ or H, explicitly. Knowing k is enough.
Useful and Popular Kernel Functions
For x, x̄ ∈ Rd :
• k(x, x̄) = (1 + hx, x 0 i)p for p ∈ N
f (x) =
P
i
αi
y i k(x i , x)
(polynomial kernel)
= polynomial of degree d
• k(x, x̄) = exp(−λkx − x̄k2 ) for λ > 0
f (x) =
P
i
αi y i exp(−λkx i − xk2 ) = weighted/soft nearest neighbor
For x, x̄ histograms with d bins:
P
• k(x, x̄) = dj=1 min(xj , x̄j )
• k(x, x̄) =
(Gaussian or RBF kernel)
xj x̄j
j=1 xj +x̄j
Pd
histogram intersection kernel
χ2 kernel
For x, x̄ labeled graphs:
• k(x, x̄) =
expected number of matches between random walks on x or x̄
Generally: interpret kernel function as a similarly measure.