Fast Query-Optimized Kernel Machine Classication Via Incremental

Fast Query-Optimized Kernel Machine Classification Via
Incremental Approximate Nearest Support Vectors
by Dennis DeCoste and Dominic Mazzoni
International Conference on Machine Learning (ICML-03), August 2003
Presented by Despina Kontos
CIS 525 Neural Computation
Spring 2004
Instructor: S.Vucetic
Overview

Introduction


Background and related work


Motivation and the main idea.
A little bit about Kernel Machines (KMs) and previous work.
Methodology

The Nearest Support Vectors (NSVs).

Some enhancements.

Experiments and results

Discussion
Introduction

Why Kernel Machines??

They overcome the “curse of dimensionality”, using kernel functions, while
exploring large nonlinear feature spaces.

What is the problem??

The tradeoff for this power is that a KM's query-time complexity scales
linearly with the number of Support Vectors, making KMs often orders of
magnitude more expensive at query-time than other popular machine
learning alternatives.

KM costs are identical for each query, even for “easy” ones that
alternatives (e.g. decision trees) can classify much faster than harder ones.
Introduction

So, what would be an ideal approach?




Use a simple linearclassifier for the (majority of) queries it is likely
to correctly classify.
Implement the query-time cost exact KM only for those queries for
which such precision likely matters.
For the rest of the cases, use something in between with complexity
proportional to the difficulty of the query.
A new idea!!


One can often achieve the same classification as the exact KM by
using only small fraction of the nearest support vectors (SVs) of a
query.
Approximate the exact KM with a k nearest-neighbor (k-NN) KM,
whose output sums only over the (weighted) kernel values involving
the k nearest (according to some distance) selected SVs.
Background

Kernel Machines Summary

Binary SVM classifier is trained by optimizing an n-by-1 weighting vector to satisfy the
Quadratic Programming (QP) dual form:

The kernel avoids curse of dimensionality by projecting any two d-dimensional example
vectors into feature space vectors returning their dot product:

Popular kernels include:

The exact KM output f(x) is computed via:
Some related work

Early methods compressed a KM's SVs into a reduced set, in order to reduce
the query time costs.
When small ρ ≈ 0 can be achieved with nz«n speedups with little loss of
classification accuracy have been reported.
Problem: A key problem with all such reduced set approaches is that they do
not provide any guarantees or control concerning how much classification
error might be introduced by such approximations.
Methodology

The intuition behind the NEW idea:

Order the SVs for each query using a distance metric and use the k
nearest-neighboring (w.r.t the query sample) SVs. The largest terms tend
to get added first.

During incremental computation of the KM, once the partial KM output
leans “strong enough” either positively or negatively, it will not be able to
completely change sign as remaining βi K(Xi,x) terms are added.

Small k nearest-neighbor classifiers can often classify well, but that the
best k will vary from query to query.
Methodology

Nearest Support Vectors (NSV)

Let NSV’s distance like scoring be defined as:
The βi K(Xi,x) terms corresponding to the NNscore-ordered SVs tend to follow a
steady progression, such that soon the remaining terms become too small to
overcome any strong leanings.
Methodology

The main algorithm:
Methodology

Statistical thresholds for NSV

Derive thresholds Lk and Hk by running the algorithm over a large
representative sample of pre-query data.

Compute Lk as the minimum value of gk(x) over all x such that gk(x) < 0 and
f(x) > 0. This identifies Lk as the worst-case wrong-way leaning of any
sample that the exact KM classifies as positive. Similarly, Hk is assigned
the maximum gk(x) such that gk(x) > 0 and f(x) < 0.

In practice, the test and training data distributions will not be identical. We
can replace each Hk (Lk) with the maximum (minimum) of all threshold
values over adjacent steps k-w through k+w (variation using a window w).
Methodology

Sorting NSVs by NNscorei(x) leads to relatively wide and skewed
thresholds whenever there is imbalance in the number of positive SVs
versus negative SVs.

Adjusting the NNscore-based ordering so that the cumulative sums of the
positive β and the negative β at each step k are as equal as possible.

Full linear scan for searching the k-nearest neighbors can be very
computationally expensive even when using indexing techniques.

Perform pre-query principal component analysis (PCA) on the matrix of
SVs. Use these small k-dimensional vectors, to approximate kernels and to
order NSVs for each Q as needed.
Methodology

Some enhancements

Use a linear SVM as an initial filter. Compute the threshold bounds
as before, except using the linear SVM’s output for the first step of
the computation.

Generate additional “difficult” data in order to obtain better threshold
levels from the representative sample.
Experiments and results
Data: MNIST dataset (digit recognition)
large input dimensionality
large number of SVs
Experiments and results
Speedup advantage compared to accuracy loss
Conclusions

A new Kernel Machine at query time implementing a k nearest
neighbor approach to improve performance.

The approach is applicable to any form of Kernel Machine classifier,
regardless of the way it is trained.

Some exciting speedup results are reported without significant loss in
accuracy.

Future work toward combining the machine learning methods of
kernels, nearest-neighbors and decision trees.
Any questions???
.....THANK YOU!!!!