Max-margin Classification of Data with Absent Features

Max-margin Classification of
Data with Absent Features
Gal Chechik, Geremy Heitz, Gal Elidan,
Pieter Abbeel, and Daphne Koller
Journal of Machine Learning Research
2008
Outline
•
•
•
•
•
•
Introduction
Background knowledge
Problem Description
Algorithms
Experiments
Conclusions
2
Introduction
• In the traditional supervised learning, data
instances are viewed as feature vectors in
high-dimensional space.
• Why do features miss?
– noise
– undefined part of objects
– structural absent
– etc.
3
Introduction
• How to handle classification if features
missing?
– fundamental methods
– expectation maximization (EM)
– Markov-chain monte-carlo (MCMC)
• However, features sometimes are non-existing,
rather than have an unknown value.
• To classify without filling missing values.
4
Background
• Support Vector Machines(SVMs)
• Second Order Cone Programming(SOCP)
5
Support Vector Machines
• Support Vector Machines(SVMs): a supervised
learning method used for classification and
regression.
• They simultaneously minimize the empirical
classification error and maximize the
geometric margin, also called as maximum
margin classifiers.
6
Support Vector Machines
• Given a set of n labeled sample x1…xn, in a
feature spaces F of size d. Each sample xi has a
binary class label yi   1,1.
• We want to give the maximum-margin
hyperplane which divides the data having yi =
1 from those having yi = − 1.
7
Support Vector Machines
• Any hyperplane can be written as the set of
samples x satisfying wx  b  0, where w is a
normal vector and b is the offset of the
hyperplane from the origin along w.
• The hyperplane separate samples into two
classes, so it could be:
wx  b  1 for xi for the first class
wx  b  -1 for xi for the second.
yi wx i  b  1, i  1...n
8
Support Vector Machines
• Geometric Margins: we define the margin
yi wx i  b 
as ρ  min i
, and learn a classifier w by
w
maximize ρ.
• It turns to an optimization problem:
min
w,b
w , s.t. yi wx i  b  1, i  1...n
• Or a quadratic programming (QP) optimization
problem:
min
w,b
1
2
w , s.t. yi wx i  b   1, i  1...n
2
9
Support Vector Machines
yi wx i  b 
ρ  min i
w
The hyperplane H3 doesn't
separate the 2 classes. H1
does, with a small margin and
H2 with the maximum margin.
10
Support Vector Machines
• Soft-margin SVMs: when the training samples
are not linearly separable, we introduce slack
variables  i to the SVMs.
• It turns to:
min
w,b
n
1
2
w  C   i s.t.
2
i 1
yi wx i  b   1   i , i  1...n
where C is the trade-off between accuracy and
model complexity.
11
Support Vector Machines
• The dual problem of SVMs: we consider to use
Lagrangian to solven the primal SVMs.
1
2
Lw , b,    w    i  yi wx i  b   1
2
i 1
• Set the first derivatives of L to 0:
n
Lw , b,  
 w   yi i xi  0
w
i 1
Lw , b,   n
  yi i  0
b
i 1
12
Support Vector Machines
• We then derive:
n
1
2
Lw , b,    w    i  yi wx i  b   1
2
i 1
• The dual
problem
of
SVMs:
n
n
n
n
1
1



y
y
x
,
x
0i j y
Lw , b
,  i  j
y
i 
j i 
i
j
ix i , x j
i
j
2 i , j 1
i 1
2 i , j 1
i 1
s.t. αi  0, i  1...n
n
 y
i 1
i
i
0
13
Second Order Cone Programming
• Second-order Cone Programming(SOCP): a
convex optimization problem of the form:
d
x

R
where
is the optimization variable.
• We can solve SOCPs by using MOSEK.
14
Problem Description
• Given a set of n labeled sample x1…xn, in a
feature spaces F of size d. Each sample xi has a
binary class label yi   1,1.
• Let Fi denote the set of features of the ith
sample. Each sample xi can be viewed as
F
embedded in the relevant subspace R  R d .
i
15
Problem Description
• In the traditional SVMs, we try to maximize
the geometric margin ρ for all instances.
• In the case of missing feature, the margin
which measures the distance to a hyperplane
is not well defined.
• We can’t use the traditional SVMs in this case.
16
Problem Description
• We should treat the margin of each instance
in its own relevant subspace.
• We define the instance margin ρi w  for the ith
instance as:
i 
yiw xi
ρi w  
w i 
where w(i) is a vector obtained by taking the
entries of w that are relevant for xi.
17
Problem Description
• We consider new geometric margin to be the
minimum over all instance margins, that is:
ρw   mini ρi w  , and arrive at a new
optimization problem for the case:
18
Problem Description
• However, since different margin terms are
normalized by different norms
, we can’t
take out of the minimization. i 
yi w x i
• Besides, each of the terms
is noni 
w
convex in w, it’s difficult to
solve directly.
19
Algorithms
• How to solve this optimization problem?
• Three approaches:
– Linearly separable case: A Convex Formulation
– The general case:
• Average Norm
• Instance-specific Margins
20
A Convex Formulation
• In the linearly separable case, we can
transform the optimization problem into a
series of convex optimization problem.
• First step:
By maximizing a lower bound, we take the minimization
term out of the objective function into the constraints. And
the resulting problem is equivalent to the original one since
the bound γ can always be maximized until it’s perfect tight.
21
A Convex Formulation
• Second step: Replace the single constraint by
multiple constraints.
From single constraint
to multiple constraints.
22
A Convex Formulation
• Finally, we write:
i 
Because w  0 for all instances.
23
A Convex Formulation
• Assume first that γ is given. For any fixed value
of γ, the problem obeys the general structure
of SOCP.
The structures are similar.
24
A Convex Formulation
• We can solve it by doing a bisection search
over
, for each iteration we solve a
SOCP problem.
• However, one problem is that any scaled
version of a solution is also a solution.
Each constraint is invariant to a rescaling of w. And
the null case
is always a solution.
25
A Convex Formulation
• How to solve it? We can add constraints:
– A non vanishing norm
– On a single entry in w,
. No longer convex!
for each entry.
• We can solve the SOCP twice for each entry of
w, once for
, and once for
.
• It becomes a total of 2d problems.
26
A Convex Formulation
• The convex formulation is difficult for the nonseparable case.
We can’t be sure that it’s
jointly convex with w.
Consider the slack variables are not normalized by
27
A Convex Formulation
• In the case, the vanishing solutions
, are
also encountered.
• We can no longer guarantee that the modified
approach we discussed above will coincide.
• So the non-separable formulation isn’t likely
to be of practical use for this case.
28
Average Norm
• We consider an alternative solution based on
an approximation of the margin.
ρi w 
• We can approximate the different norms
by a common term that doesn’t depend on
the instance.
29
Average Norm
• Replace each low-dimensional norms
by
the root-mean-square norm over all instances.
• When all samples have all features, it’s the
same as the original SVMs.
30
Average Norm
• In the case of missing features, the
approximation of
will be also good if all
the norms
are equal.
• When the norms
are near equal, we
expect to find nearly optimal solutions.
31
Average Norm
• We can derive:
32
Average Norm
• The linear-separable case:
They can be solved using the same
techniques as standard SVMs.
• The non-separable case:
Quadratic programming
Problems!
33
Average Norm
• However, average norm isn’t expected to
perform well if the norms
vary
considerably.
• How to solve the problem in the case?
• Instance-specific Margins approach.
34
Instance-specific Margins
• We can represent each of the norms
scaling of the full norm
.
• By defining scaling coefficients
can rewrite the equation:
as a
,we
35
Instance-specific Margins
• So we can derive as above steps:
Separable case.
Non-separable case.
36
Instance-specific Margins
• We consider how to solve it:
Not a Quadratic
Programming problem!
It’s not even convex in w.
• Projected gradient approach.
37
Instance-specific Margins
• Projected gradient approach: one iterates
between steps in the direction of the gradient
of the Lagrangian
and projections to the constrained space, by
calculating
.
• With the right choices of step sizes, it
converges to local minima.
• Other solutions?
38
Instance-specific Margins
• If we give a set of si , the problem is a
Quadratic Programming problem.
For any fixed value of si,
the problem is a QP!
• We can use the fact to devise a iterative
algorithm.
39
Instance-specific Margins
• For a given tuple of si‘s, we solve a QP for w,
and then use the resulting w to calculate new
si‘s.
• To solve the QP, we derive the dual for given
si‘s:
The dual problem
of SVMs!
40
Instance-specific Margins
• The inner product <∙, ∙> is taken only over
features that are valid for both xi and xj.
• We discuss the kernels for modified SVMs later.
41
Instance-specific Margins
• Iterative optimization/projection algorithm:
The Dual problem of SVMs.
The convergence isn’t always guaranteed.
The dual solution is used to find the optimal
classifier by setting
42
Instance-specific Margins
• Two other approaches for optimizing this:
43
Instance-specific Margins
• Updating approach: minimize
subject to
.
• Hybrid approach: combine gradient ascent
over s and QP for w.
• Those approaches didn’t perform as well as
the iterative approach above.
44
Kernels for missing features
• Why choose kernels for the SVMs?
• Some common kernels:
– Polynomial:
– Polynomial(inhomogeneous):
– Radial Basis Function:
– Gaussian Radial basis function:
– Sigmoid:
RBF kernel for SVM
45
Kernels for missing features
• In the dual formulation above, the
dependence on the instances is through their
inner product.
• We focus on kernels like the dependence.
– Polynomial:
– Sigmodal:
,
46
Kernels for missing features
• For a polynomial kernel
defined the modified kernel as:
,
with the inner product calculated over valid
features
.
• We define
, where
replaces invalid entries(missing features) with
zeros.
47
Kernels for missing features
• We have
, simply since
multiplying by zero is equivalent to skipping
the missing values.
• We make the kernels for missing features.
48
Experiments
• Three experiments:
– Features are missing at random.
– Visual object recognition: features are missing
because they can’t be located in the image.
– Biological network completion(Metabolic Pathway
Reconstruction): missingness patterns of features
is determined by the known structure of the
network.
49
Experiments
• Five common approaches for filling missing
features:
– Zero
– Mean
– Flag
– kNN
– EM
• Average Norm and Geometric Margins.
50
Missing at Random
• Features are missing at random:
– Data sets from the UCI repository.
– MNIST images.
Missing features
51
Missing at Random
• Experiment results:
Geometric Margins has good performance.
Data sets from UCI
MNIST images
52
Visual Object Recognition
• Visual Object Recognition: to determine if an
object from a certain class is present in a given
input image.
• The trunk of a car may not be found in a
picture of a hatch-back car.
• The features are structurally missing.
53
Visual Object Recognition
• The object model contains a set of
“landmarks”, defining the outline of an object.
• We find several matches in a given image.
Five matches for the front
windshield landmark.
54
Visual Object Recognition
• In the car model, we located up to 10
matches(candidates) for each of the 19
landmarks.
• For each candidate, we compute the first 10
principal component coefficients(PCA) of the
image patch.
55
Visual Object Recognition
• We concatenate these descriptors to form
1900(19×10×10) features per image.
• If the number of descriptors for a given
landmark is less than 10, we consider the rest
to be structurally absent.
56
Visual Object Recognition
• Experiment results:
57
Visual Object Recognition
• Examples:
58
Metabolic Pathway Reconstruction
• Metabolic Pathway Reconstruction: predicting
missing enzymes in metabolic pathways.
• Instances in this task have missing features
due to the structure of the biochemical
network.
59
Metabolic Pathway Reconstruction
• Cells use a complex network of chemical
reactions to produce their building blocks.
enzyme
molecular
compounds
molecular
compounds
enzyme
The enzyme
catalyzes a reaction.
60
Metabolic Pathway Reconstruction
• For many reactions, the enzyme responsible
for their catalysis is unknown, making it an
important computational task to predict the
identity of such missing enzymes.
• How to predict?
• The enzymes in local network neighborhoods
usually participate in related functions.
61
Metabolic Pathway Reconstruction
• Different types of network neighborhood
relations between enzyme pairs lead to
different relations of their properties.
• Three types:
– forks(same inputs, different outputs)
– funnels(same outputs, different inputs)
– linear chains
62
Metabolic Pathway Reconstruction
linear chains
forks(same inputs,
different outputs)
funnels(same outputs,
different inputs)
63
Metabolic Pathway Reconstruction
• Each enzyme is represented using a vector of
features that measure its relatedness to each
of its different neighbors, across different data
types.
• A feature vector will have structurally missing
entries if the enzyme does not have all types
of neighbors.
64
Metabolic Pathway Reconstruction
• Three types of data for enzyme attributes:
– A compendium of gene expression assays
– Protein domains content of enzymes
– The cellular localization of proteins
• We use those data to measure the similarity
between enzymes.
65
Metabolic Pathway Reconstruction
• Similarity Measures for Enzyme Predictions:
66
Metabolic Pathway Reconstruction
• Positive examples: From the reactions with
known enzymes.
• Negative examples: By plugging a random
impostor genes into each neighborhood.
67
Metabolic Pathway Reconstruction
• Experiment results:
68
Conclusions
• A novel method for max-margin training of
classifiers in the presence of missing features.
• To classify instances by skipping the nonexisting features, rather than filling them with
hypothetical values.
69