Instance Based Approach

KNN Classifier

10
Two Classes
5
0

Handed an instance you
wish to classify
Look around the nearby
region to see what other
classes are around
Whichever is most
common—make that the
prediction
Y

0
5
10
X
8/29/03
Instance Based Classification
2

Assign the most common class among the
K-nearest neighbors (like a vote)
8/29/03
Instance Based Classification
3
8/29/03
Instance Based Classification
4

Train


Load training data
Classify



8/29/03
𝑛
𝑑(𝑥𝑖 , 𝑥𝑗 ) ≡
(𝑎𝑟 𝑥𝑖 − 𝑎𝑟 𝑥𝑗 ))2
𝑟=1
Euclidean distance: a is an
attribute (dimension)
Read in instance
Find K-nearest neighbors in the training data
Assign the most common class among the
K-nearest neighbors (like a vote)
Instance Based Classification
5
𝑛

𝑑(𝑥𝑖 , 𝑥𝑗 ) ≡
Naïve approach:
Voting Formula


exhaustive
𝑘
For the
instance
to be
𝑓 𝑥
←
argmax
𝑞
classified
𝑣∈𝑉
(𝑎𝑟 𝑥𝑖 − 𝑎𝑟 𝑥𝑗 ))2
𝑟=1
Euclidean distance: a is an
attribute (dimension)
𝛿 𝑣, 𝑓 𝑥𝑖
𝑖=1
 Visit every training sample and calculate distance
Where 𝑓 𝑥𝑖 is 𝑥𝑖 ’s class, 𝛿 𝑎, 𝑏 = 1
 Sort
if a = b; 0 otherwise
 First K in the list
8/29/03
Instance Based Classification
6
𝑛

The Work that Must
be Performed
𝑑(𝑥𝑖 , 𝑥𝑗 ) ≡
(𝑎𝑟 𝑥𝑖 − 𝑎𝑟 𝑥𝑗 ))2
𝑟=1
Euclidean distance: a is an
Visit every training sample attribute (dimension)
and calculate distance
 Sort
 Lots of floating point calculations
 Classifier puts-off work till time to classify

8/29/03
Instance Based Classification
7



This is known as a “lazy” learning method
If do most of the work during the training stage known as
“eager”
Our next classifier, Naïve Bayes, will be eager


Training takes a while but can classify fast
Which do you think is better?
Where the work happens
8/29/03
Instance Based Classification
8
From Wikipedia:
space-partitioning data structure
for organizing points in a
k-dimensional space. kd-trees are a
useful data structure for several
applications, such as searches
involving a multidimensional
search key (e.g. range searches and
nearest neighbor searches). kdtrees are a special case of BSP trees.
8/29/03
Instance Based Classification
9


Speeds up
classification
Probably slows
“training”
8/29/03
Instance Based Classification
10
Weighted Voting Formula
𝑘
Choosing K can be a bit of an art
𝑓 𝑥𝑞 ← argmax
𝑤𝑖 𝛿 𝑣, 𝑓 𝑥𝑖
 What if you could
𝑣∈𝑉 include all data-points
𝑖=1
1
(K=n)?
Where 𝑤𝑖 =
, and 𝛿 𝑣, 𝑓 𝑥𝑖 is “1” if
𝑑(𝑥𝑞 ,𝑥𝑖 )2
 How might you do such a thing?
it is a member of class 𝑣 (i.e. 𝑣 = 𝑓 𝑥𝑖
How
all data
points?
whereinclude
𝑓 𝑥𝑖 returns
the class
of 𝑥𝑖 )

What if weighted the votes of each training sample
by its distance from the point being classified?
8/29/03
Instance Based Classification
11
100
80
60
0
20
40
1 over distance squared
Weight

0
20
40
60
80
100
0.6
0.8
1.0
0.6
0.4
0.2
But then training data
very-far-away still have
strong influence
0.0

0.8
Could get less fancy
and go linear
Weight

1.0
Distance
0.0
8/29/03
Instance Based Classification
0.2
0.4
Distance
12

𝐾 𝑑(𝑥, 𝑥𝑡 ) =
8/29/03
1
𝑒 2𝜋
1.0
0.8
0.6
0.4
0.2
0.0

Other Radial Basis
Functions
Sometimes known as a
Kernel Function
One of the more common
Weight

-4
-2
0
2
4
Distance
2 /2𝜎 2
−(𝑥−𝜇)
𝑒
Instance Based Classification
13

Work back-loaded



Worse the bigger the training data
Can alleviate with data structures
What else?
Other Issues?
What if only some dimensions contribute to ability
to classify? Differences in other dimensions would
put distance between that point and the target.
8/29/03
Instance Based Classification
14


More is not always better
Might be identical in important dimensions other
dimensions might simply be random, and seemingly distant
From Wikipedia:
In applied mathematics, curse of dimensionality (a term coined by Richard E. Bellman),[1][2]
also known as the Hughes effect[3] or Hughes phenomenon[4] (named after Gordon F.
Hughes),[5][6] refers to the problem caused by the exponential increase in volume associated
with adding extra dimensions to a mathematical space.
For example, 100 evenly-spaced sample points suffice to sample a unit interval with no more
than 0.01 distance between points; an equivalent sampling of a 10-dimensional unit
hypercube with a lattice with a spacing of 0.01 between adjacent points would require 1020
sample points: thus, in some sense, the 10-dimensional hypercube can be said to be a factor
of 1018 "larger" than the unit interval. (Adapted from an example by R. E. Bellman; see
below.)
8/29/03
Instance Based Classification
15



Thousands of genes
Relatively few patients
Is there a curse?
gene
g1
g2
g3
…
gn
disease
p1
x1,1
x1,2
x1,3
…
x1,n
Y
p2
x2,1
x2,2
x2,3
…
x2,n
N
patient
.
.
.
pm
8/29/03
.
.
.
xm,1
xm,2
xm,3
…
Instance Based Classification
xm,n
?
16


Representation
becomes
Think
of discrete
data all
as important
being pre-binned
If could arrange appropriately could use
Remember
RNA
classification
techniques like
Hamming
distances

Data in each dimension was A, C, U, or G
How
measure
A might
be closer
to G thandistance?
C or U (A and G
are both purines while C and U are
pyrimidines). Dimensional distance becomes
domain specific.
8/29/03
Instance Based Classification
17



First few records in the training data
See any issues?
Should
normalize
theisdata
Hint:
think really
of how Euclidean
distance
calculated
For each entry in a dimension
8/29/03
Redness
Yellowness
Mass
Volume
4.816472
2.347954
2.036318
4.879481
2.767383
3.353061
4.327248
3.322961
118.4266
19.07535
peach
2.96197
4.124945
159.2573
29.00904
orange
5.655719
1.706671
147.0695
39.30565
apple
𝑥𝑖 − 𝑚𝑖𝑛25.01441
18.2101
109.9687 − 𝑚𝑖𝑛
33.53737
𝑚𝑎𝑥
Class
125.5082
apple
125.8775
lemon
Instance Based Classification
orange
18

Why average?
Function approximation
Real valued prediction: take average of nearest k
neighbors
0
-5
Y
𝑘
-10
𝑓(𝑥𝑞 ) ←
𝑘
𝑖=1 𝑓(𝑥𝑖 )
5
10

-10
-5
0
5
X

8/29/03
If don’t know the function and/or it is too complex to
“learn”, just plug-in a new value the KNN classifier can
“learn” the predicted value on the fly by averaging the
nearest neighbors
Instance Based Classification
19
-10
-5
Y
0
5

Choose an m and b that minimizes the squared
error
But again,
computationally
How?
10

m and b that minimize
|𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑠𝑒𝑡|
-10
-5
2
(𝑦𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑑𝑎𝑡𝑎 − 𝑦𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑚 𝑎𝑛𝑑 𝑏 )
0
5
X
𝑖=1
8/29/03
Instance Based Classification
20


0
-3000 -2000 -1000
Y
1000
2000
3000

If want to learn an instantaneous slope
Can do local regression
Get the slope of a line that fits just the local data
-10
-5
0
5
10
X
8/29/03
Instance Based Classification
21

KNN highly effective for many practical
problems




With sufficient training data
Robust to noisy training
Work back-loaded
Susceptible to dimensionality curse
8/29/03
Instance Based Classification
22
8/29/03
Instance Based Classification
23





For each of the training datum we know what
Y should be
If we have a randomly generated m and b,
these, along with X will tell us a predicted Y
Know whether the m and b yield too large or
too small a prediction
Can nudge “m” and “b” in an appropriate
direction (+ or -)
Sum these proposed nudges across all
training data
8/29/03
Instance Based Classification
∆𝑚
∆𝑏
Line represents
output or predicted Y
Target Y too low
24

Which way should m
go to reduce error?
𝑦𝑝𝑟𝑒𝑑 = 𝑚𝑔𝑢𝑒𝑠𝑠 𝑥 + 𝑏𝑔𝑢𝑒𝑠𝑠
𝑦𝑝𝑟𝑒𝑑 − 𝑦𝑎𝑐𝑡
y actual
Rise
Could Average
y actual
∆𝑚 =
1
𝑛
𝑛
𝑖=1
b
𝑦𝑝𝑟𝑒𝑑 − 𝑦𝑎𝑐𝑡
𝑥𝑖
Then do same for
Then do again
8/29/03
𝑟𝑖𝑠𝑒
𝑟𝑢𝑛
𝑟𝑖𝑠𝑒
∆𝑚 = ∆
b
𝑟𝑢𝑛
𝑦𝑝𝑟𝑒𝑑 − 𝑏 𝑦𝑎𝑐𝑡 − 𝑏
∆𝑚 =
−
𝑥
𝑥
𝑦𝑝𝑟𝑒𝑑 − 𝑦𝑎𝑐𝑡
∆𝑚 =
𝑥
Instance Based Classification
𝑚=
25


2000
3000

Locally weighted linear regression
Would still perform gradient descent
Becomes a global function approximation
0
-3000 -2000 -1000
Y
1000
𝑓 𝑥 = 𝑤0 + 𝑤1 𝑎1 𝑥 + ⋯ + 𝑤𝑛 𝑎𝑛 (𝑥)
-10
-5
0
5
10
X
8/29/03
Instance Based Classification
26