Slides - UC Davis CS

Th 04­22 Lesson Plan
●
Kernels, corrections and some demos (buried in book section 12.3)
●
One class SVM's revisited and SS SVM (my notes)
●
Learning Kernels
●
SMO (not in book see website for introductory notes)
●
Kernels, Kernels and more Kernels
●
Kernel PCA (section 14.5) ●
Kernel K­Means
●
Kernel KNN (section 18.5.2)
Kernels – Big Picture
●
●
A kernel is a ________ matrix where the entry $K_{i,j}$ is a measure of ________ between points $i$ and $j$.
The kernel “trick” is that the entry $K_{i,j}$ is a measure of distance in a _____ dimensional but using a computation in the _______ dimensional space.
We Didn't Quite Get the Kernel
<xy>^2 ●
For 2D space
●
(x1.y1+x2.y2)^2
●
x1^2.y1^2 + 2x1x2y1y2 + x2^2.y2^2
●
Equivalent basis function
–
●
\phi(x)=<x1^2, \sqrt(2)x1x2, x2^2>
We only got the transformation performed 2/3rds right.
There are Three “Operations”
●
“Stretching” in the original x1 and x2 dimensions
●
●
“Lifting” and “Dropping” in the new x3 dimension
●
x1^2, x2^2
\sqrt(2)x1.x2
There are Three “Operations”
●
“Stretching” of the original x1 and x2 dimensions
●
●
“Lifting” and “Dropping” in the new x3 dimension
●
●
x1^2, x2^2
\sqrt(2)x1.x2
“Folding”
●
(­1,1) → (1,1)
●
(1,1) → (1,1)
●
(­1,­1) → (1,1)
●
(1,­1) → (1,1)
Kernel: <xy>^2
Kernel: <xy>^2
Decision Boundary in Lower Dimensional Space
●
The decision boundary in the higher dimensional space is ___________.
●
In the lower­dimensional space it is _________.
●
Mercer's theorem
●
●
Any ________ matrix (the kernel) can be calculated as an ________ product in a high dimensional space.
The kernel can be calculated from properties in the ________ ________ ________.
Types of Kernels
●
Polynomial
K(x,y) = (xy+c)^d
● Gaussian Kernel
●
K(x,y) = exp(­||x­y||^2 / \sigma^2)
● Sigmoid Kernel
●
K(x,y) = tanh(a(xy)+c) or 1 / (1+e^­(net))
● Lets draw each of these: x­axis is the resultant x,y combination.
●
●
I can tell you how to combine Kernels
●
I can tell you a bad Kernel
Say Something Useful on Trying
To Select Kernels!
●
●
People tend to steer away from saying Kernel K guarantees linear separation.
Instead
●
Kernel K works well on data
●
Create a Kernel empirically. How? Consider strings.
●
Use a Kernel which matches your modeling assumption of the data
Learning the Kernel
●
Why? Under what conditions ●
We've already covered an approach to learn a kernel.
●
What properties must the Kernel have?
●
Four main ways to learn Kernels so far:
●
Semi­definite programming methods [Lanckriet et al., JMLR, 2004]
●
Projection­based methods [Tsuda et al., JMLR, 2005]
●
Hyperkernels and other methods [Ong et al, JMLR, 2005]
●
Low rank approximation with Bregmen Divergences [Dhillion]
Kernelizing – Why?
Kernelizing – Why?
Kernel ­ KNN
●
KNN with Kernels
●
Key NN calculation for point x' ●
argmin_x Kernel ­ KNN
●
KNN with Kernels
●
Key NN calculation for point x' ●
argmin_x ||x' – x||^2 equivalent to:
●
(x' – x)^T(x' – x) = x'.x' – 2x'x + x.x
●
Wow all are inner products
Kernel K­Means
●
Why do we need to kernelize k­means?
Next Iteration Would Produce What Centroids?
Kernel K­Means
Kernel K­Means Algorithm
SMO
Kernel PCA
●
Next Tuesday.
The Kernel ||(x+y)^2||^2
●
Gabe correctly pointed my derivation of the corresponding basis function was incorrect.
●
The Kernel: ||(x+y)^2||^2
●
●
It turns out this “Kernel” has no basis function
●
Consider three points A (0,0), B(1,0) and C(0,1)
●
Gabe correctly pointed my derivation of the corresponding basis function was incorrect.
Lets calculate their distances in the higher dimensional space, i.e. lets construct the Kernel (or gram) matrix.
●
What is important property is violated?
●
Hence we can show this Kernel is not _________