KNN Classifier 10 Two Classes 5 0 Handed an instance you wish to classify Look around the nearby region to see what other classes are around Whichever is most common—make that the prediction Y 0 5 10 X 8/29/03 Instance Based Classification 2 Assign the most common class among the K-nearest neighbors (like a vote) 8/29/03 Instance Based Classification 3 8/29/03 Instance Based Classification 4 Train Load training data Classify 8/29/03 𝑛 𝑑(𝑥𝑖 , 𝑥𝑗 ) ≡ (𝑎𝑟 𝑥𝑖 − 𝑎𝑟 𝑥𝑗 ))2 𝑟=1 Euclidean distance: a is an attribute (dimension) Read in instance Find K-nearest neighbors in the training data Assign the most common class among the K-nearest neighbors (like a vote) Instance Based Classification 5 𝑛 𝑑(𝑥𝑖 , 𝑥𝑗 ) ≡ Naïve approach: Voting Formula exhaustive 𝑘 For the instance to be 𝑓 𝑥 ← argmax 𝑞 classified 𝑣∈𝑉 (𝑎𝑟 𝑥𝑖 − 𝑎𝑟 𝑥𝑗 ))2 𝑟=1 Euclidean distance: a is an attribute (dimension) 𝛿 𝑣, 𝑓 𝑥𝑖 𝑖=1 Visit every training sample and calculate distance Where 𝑓 𝑥𝑖 is 𝑥𝑖 ’s class, 𝛿 𝑎, 𝑏 = 1 Sort if a = b; 0 otherwise First K in the list 8/29/03 Instance Based Classification 6 𝑛 The Work that Must be Performed 𝑑(𝑥𝑖 , 𝑥𝑗 ) ≡ (𝑎𝑟 𝑥𝑖 − 𝑎𝑟 𝑥𝑗 ))2 𝑟=1 Euclidean distance: a is an Visit every training sample attribute (dimension) and calculate distance Sort Lots of floating point calculations Classifier puts-off work till time to classify 8/29/03 Instance Based Classification 7 This is known as a “lazy” learning method If do most of the work during the training stage known as “eager” Our next classifier, Naïve Bayes, will be eager Training takes a while but can classify fast Which do you think is better? Where the work happens 8/29/03 Instance Based Classification 8 From Wikipedia: space-partitioning data structure for organizing points in a k-dimensional space. kd-trees are a useful data structure for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbor searches). kdtrees are a special case of BSP trees. 8/29/03 Instance Based Classification 9 Speeds up classification Probably slows “training” 8/29/03 Instance Based Classification 10 Weighted Voting Formula 𝑘 Choosing K can be a bit of an art 𝑓 𝑥𝑞 ← argmax 𝑤𝑖 𝛿 𝑣, 𝑓 𝑥𝑖 What if you could 𝑣∈𝑉 include all data-points 𝑖=1 1 (K=n)? Where 𝑤𝑖 = , and 𝛿 𝑣, 𝑓 𝑥𝑖 is “1” if 𝑑(𝑥𝑞 ,𝑥𝑖 )2 How might you do such a thing? it is a member of class 𝑣 (i.e. 𝑣 = 𝑓 𝑥𝑖 How all data points? whereinclude 𝑓 𝑥𝑖 returns the class of 𝑥𝑖 ) What if weighted the votes of each training sample by its distance from the point being classified? 8/29/03 Instance Based Classification 11 100 80 60 0 20 40 1 over distance squared Weight 0 20 40 60 80 100 0.6 0.8 1.0 0.6 0.4 0.2 But then training data very-far-away still have strong influence 0.0 0.8 Could get less fancy and go linear Weight 1.0 Distance 0.0 8/29/03 Instance Based Classification 0.2 0.4 Distance 12 𝐾 𝑑(𝑥, 𝑥𝑡 ) = 8/29/03 1 𝑒 2𝜋 1.0 0.8 0.6 0.4 0.2 0.0 Other Radial Basis Functions Sometimes known as a Kernel Function One of the more common Weight -4 -2 0 2 4 Distance 2 /2𝜎 2 −(𝑥−𝜇) 𝑒 Instance Based Classification 13 Work back-loaded Worse the bigger the training data Can alleviate with data structures What else? Other Issues? What if only some dimensions contribute to ability to classify? Differences in other dimensions would put distance between that point and the target. 8/29/03 Instance Based Classification 14 More is not always better Might be identical in important dimensions other dimensions might simply be random, and seemingly distant From Wikipedia: In applied mathematics, curse of dimensionality (a term coined by Richard E. Bellman),[1][2] also known as the Hughes effect[3] or Hughes phenomenon[4] (named after Gordon F. Hughes),[5][6] refers to the problem caused by the exponential increase in volume associated with adding extra dimensions to a mathematical space. For example, 100 evenly-spaced sample points suffice to sample a unit interval with no more than 0.01 distance between points; an equivalent sampling of a 10-dimensional unit hypercube with a lattice with a spacing of 0.01 between adjacent points would require 1020 sample points: thus, in some sense, the 10-dimensional hypercube can be said to be a factor of 1018 "larger" than the unit interval. (Adapted from an example by R. E. Bellman; see below.) 8/29/03 Instance Based Classification 15 Thousands of genes Relatively few patients Is there a curse? gene g1 g2 g3 … gn disease p1 x1,1 x1,2 x1,3 … x1,n Y p2 x2,1 x2,2 x2,3 … x2,n N patient . . . pm 8/29/03 . . . xm,1 xm,2 xm,3 … Instance Based Classification xm,n ? 16 Representation becomes Think of discrete data all as important being pre-binned If could arrange appropriately could use Remember RNA classification techniques like Hamming distances Data in each dimension was A, C, U, or G How measure A might be closer to G thandistance? C or U (A and G are both purines while C and U are pyrimidines). Dimensional distance becomes domain specific. 8/29/03 Instance Based Classification 17 First few records in the training data See any issues? Should normalize theisdata Hint: think really of how Euclidean distance calculated For each entry in a dimension 8/29/03 Redness Yellowness Mass Volume 4.816472 2.347954 2.036318 4.879481 2.767383 3.353061 4.327248 3.322961 118.4266 19.07535 peach 2.96197 4.124945 159.2573 29.00904 orange 5.655719 1.706671 147.0695 39.30565 apple 𝑥𝑖 − 𝑚𝑖𝑛25.01441 18.2101 109.9687 − 𝑚𝑖𝑛 33.53737 𝑚𝑎𝑥 Class 125.5082 apple 125.8775 lemon Instance Based Classification orange 18 Why average? Function approximation Real valued prediction: take average of nearest k neighbors 0 -5 Y 𝑘 -10 𝑓(𝑥𝑞 ) ← 𝑘 𝑖=1 𝑓(𝑥𝑖 ) 5 10 -10 -5 0 5 X 8/29/03 If don’t know the function and/or it is too complex to “learn”, just plug-in a new value the KNN classifier can “learn” the predicted value on the fly by averaging the nearest neighbors Instance Based Classification 19 -10 -5 Y 0 5 Choose an m and b that minimizes the squared error But again, computationally How? 10 m and b that minimize |𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑠𝑒𝑡| -10 -5 2 (𝑦𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑑𝑎𝑡𝑎 − 𝑦𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑚 𝑎𝑛𝑑 𝑏 ) 0 5 X 𝑖=1 8/29/03 Instance Based Classification 20 0 -3000 -2000 -1000 Y 1000 2000 3000 If want to learn an instantaneous slope Can do local regression Get the slope of a line that fits just the local data -10 -5 0 5 10 X 8/29/03 Instance Based Classification 21 KNN highly effective for many practical problems With sufficient training data Robust to noisy training Work back-loaded Susceptible to dimensionality curse 8/29/03 Instance Based Classification 22 8/29/03 Instance Based Classification 23 For each of the training datum we know what Y should be If we have a randomly generated m and b, these, along with X will tell us a predicted Y Know whether the m and b yield too large or too small a prediction Can nudge “m” and “b” in an appropriate direction (+ or -) Sum these proposed nudges across all training data 8/29/03 Instance Based Classification ∆𝑚 ∆𝑏 Line represents output or predicted Y Target Y too low 24 Which way should m go to reduce error? 𝑦𝑝𝑟𝑒𝑑 = 𝑚𝑔𝑢𝑒𝑠𝑠 𝑥 + 𝑏𝑔𝑢𝑒𝑠𝑠 𝑦𝑝𝑟𝑒𝑑 − 𝑦𝑎𝑐𝑡 y actual Rise Could Average y actual ∆𝑚 = 1 𝑛 𝑛 𝑖=1 b 𝑦𝑝𝑟𝑒𝑑 − 𝑦𝑎𝑐𝑡 𝑥𝑖 Then do same for Then do again 8/29/03 𝑟𝑖𝑠𝑒 𝑟𝑢𝑛 𝑟𝑖𝑠𝑒 ∆𝑚 = ∆ b 𝑟𝑢𝑛 𝑦𝑝𝑟𝑒𝑑 − 𝑏 𝑦𝑎𝑐𝑡 − 𝑏 ∆𝑚 = − 𝑥 𝑥 𝑦𝑝𝑟𝑒𝑑 − 𝑦𝑎𝑐𝑡 ∆𝑚 = 𝑥 Instance Based Classification 𝑚= 25 2000 3000 Locally weighted linear regression Would still perform gradient descent Becomes a global function approximation 0 -3000 -2000 -1000 Y 1000 𝑓 𝑥 = 𝑤0 + 𝑤1 𝑎1 𝑥 + ⋯ + 𝑤𝑛 𝑎𝑛 (𝑥) -10 -5 0 5 10 X 8/29/03 Instance Based Classification 26
© Copyright 2026 Paperzz