10-701/15-781 Recitation : Kernels

10-701/15-781 Recitation : Kernels
Manojit Nandi
February 27, 2014
Outline
Mathematical Theory
Banach Space and Hilbert Spaces
Kernels
Commonly Used Kernels
Kernel Theory
One Weird Kernel Trick
Representer Theorem
hh
(
(h
(h
(h
Do You want to Build a (
Snowman
Kernel? It doesn’t have to be
h(
(
h(
h(
hh Kernel.
Snowman
(
Operations that Preserve/Create Kernels
10-701/15-781 Kernel Theory Recitation
Current Research in Kernel Theory
Flaxman Test
Fast Food
References
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
3/36
10-701/15-781 Kernel Theory Recitation
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
4/36
10-701/15-781 Kernel Theory Recitation
IMPORTANT: The Kernels used in Support Vector Machines are different from
the Kernels used in Kernel Regression.
To avoid confusion, I’ll refer to the Support Vector Machines kernels as Mercer
Kernels and the Kernel Regression kernels as Smoothing Kernels
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
5/36
10-701/15-781 Kernel Theory Recitation
Recall: A square matrix A ∈ RN xN is positive semi-definite if for all vectors
u ∈ Rn , uT Au ≥ 0.
For some kernel function K : X xX → R defined over a set X ; |X | = n, the
Gram Matrix G is an nXn matrix such that Gij = K(xi , xj ) =< φ(xi ), φ(xj ) >
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
6/36
10-701/15-781 Kernel Theory Recitation
Mathematical Theory
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
7/36
10-701/15-781 Kernel Theory Recitation
Banach Space and Hilbert Spaces
A Banach Space is a complete Vector Space V equipped with a norm.
Examples:
• lpm Spaces: Rm equipped with the norm kxk = (
m
P
p
1
|xi | ) p
i=1
R
1
P
• Function Spaces, kf k := ( X |f (x)| dx) p
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
8/36
10-701/15-781 Kernel Theory Recitation
A Hilbert Space is a complete Vector Space V equipped with an inner product
< ., . >.
The inner product defines a norm (kxk =< x, x >), so Hilbert Spaces are a
special case of Banach Spaces.
Example:
Euclidean Space with standard inner product: x, y ∈ Rn ; < x, y >=
n
P
xi yi =
i=1
xT y
Because Kernel functions are the inner product in some higher-dimensional
space, each kernel function corresponds to some Hilbert Space H.
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
9/36
10-701/15-781 Kernel Theory Recitation
Kernels
A Mercer Kernel is a function K defined over a set X such that K : X xX → R.
The Mercer Kernel represents an inner product in some higher dimensional space
A Mercer Kernel is symmetric and positive semi-definite. By Mercer’s theorem, we know that any symmetric positive semi-definite kernel represents the
inner product in some higher-dimensional Hilbert Space H.
Symmetric and Positive Semi-Definite ⇔ Kernel Function ⇔< φ(x), φ(x0 ) >
for some φ(.).
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
10/36
10-701/15-781 Kernel Theory Recitation
Why Symmetric?
Inner products are symmetric by definition, so therefore if the kernel function
represents an inner product in some Hilbert Space, then the kernel function must
be symmetric as well.
K(x, z) =< φ(x), φ(z) >=< φ(z), φ(x) >= K(z, x).
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
11/36
10-701/15-781 Kernel Theory Recitation
Why Positive Semi-definite?
In order to be positive semi-definite, the Gram Matrix G ∈ RnXn must satisfy
uT Gu ≥ 0 ∀u ∈ Rn . Using functional analysis, one can show that uT Gu
corresponds to < hu , hu > for some hu in some Hilbert Space H. Because
< hu , hu > ≥ 0 by the definition of an inner product, then < hu , hu > ≥ 0 and
< hu , hu >= uT Gu ⇒ uT Gu ≥ 0.
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
12/36
10-701/15-781 Kernel Theory Recitation
Why are Kernels useful?
Let x = (x1 , x2 ) and z = (z1 , z2 ). Then,
K(x, z) =< x, z >2 = (x1 z1 + x2 z2 )2 = (x21 z12 ) + (2x1 z1 x2 z2 ) + (x22 z22 )
√
√
=< (x21 , 2x1 x2 , x22 ), (z12 , 2z1 z2 , z22 ) >=< φ(x), φ(z) >
√
Where φ(x) = (x21 , 2x1 x2 , x22 ).
What if x = (x1 , x2 , x3 , x4 ), z = (z1 , z2 , z3 , z4 ), and K(x, z) =< x, z >50 .
In this case, φ(x) and φ(z) are 23426∗ dimensional vectors.
∗: From combinatorics, 50 unlabeled balls into 4 labeled boxes =
53
3
.
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
13/36
10-701/15-781 Kernel Theory Recitation
We can either:
1. Calculate < x, z > (Inner product of two four-dimension vectors) and raise
this value (a float) to the 50th power.
2. OR map x and z to φ(x) and φ(z), and calculate the inner product of two
23426 dimensional vectors.
Which is faster?
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
14/36
10-701/15-781 Kernel Theory Recitation
Commonly Used Kernels
Some commonly used Kernels are:
• Linear Kernel: K(x, z) =< x, z >
• Polynomial Kernel: K(x, z) =< x, z >d where d is the degree of the
polynomial
• Gaussian RBF: K(x, z) = exp(−
1
2
2
kx − zk ) = exp(−γ kx − zk )
2σ 2
• Sigmoid Kernel: K(x, z) = tanh(αxT z + β)
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
15/36
10-701/15-781 Kernel Theory Recitation
Other Kernels:
1. Laplacian Kernel
6. Power Kernel
2. ANOVA Kernel
7. Log Kernel
3. Circular Kernel
8. B-Spline Kernel
4. Spherical Kernel
9. Bessel Kernel
5. Wave Kernel
10. Cauchy Kernel
11. χ2 Kernel
12. Wavelet Kernel
13. Bayesian Kernel
14. Histogram Kernel
As you can see, there are a lot of kernels.
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
16/36
10-701/15-781 Kernel Theory Recitation
The Gaussian RBF Kernel is infinite
In class, Barnabas mentioned that the Gaussian RBF Kernel corresponds to an
infinite dimensional vector space. From the Moore-Aronszajn theorem, we know
there is a unique Hilbert space of functions for with the Gaussian RBF is a
reproducing kernel.
One can show that this Hilbert Space has a infinite basis, and so the Gaussian RBF Kernel corresponds to an infinite-dimensional vector space. A YouTube
video of Abu-Mostafa explaining this in more detail has been included in the
references at the end of this presentation.
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
17/36
10-701/15-781 Kernel Theory Recitation
Kernel Theory
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
18/36
10-701/15-781 Kernel Theory Recitation
One Weird Kernel Trick
The kernel trick is one the reasons kernel methods are so popular in machine
learning. With the kernel trick, we can substitute any dot product with a kernel
function. This means any linear dot product in the formulation of a machine
learning algorithm can be replaced with a non-linear kernel function. The nonlinear kernel function may allow us to find separation or structure in the data
that was not present in the linear dot product.
Kernel Trick: < xi , xj > can be replaced with K(xi , xj ) for any valid kernel function K. Furthermore, we can swap any kernel function with any other
kernel function.
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
19/36
10-701/15-781 Kernel Theory Recitation
Some examples of algorithms that take advantage of the kernel trick:
• Support Vector Machines (Poster Child for Kernel Methods).
• Kernel Principal Component Analysis
• Kernel Independent Component Analysis
• Kernel K-Means algorithm
• Kernel Gaussian Process Regression
• Kernel Deep Learning
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
20/36
10-701/15-781 Kernel Theory Recitation
Representer Theorem
Theorem
Let X be a non-empty set and k a positive-definite real-valued kernel on on
X xX with the corresponding reproducing kernel Hilbert Space Hk . Given a
training sample [(x1 , y1 ), ..., (xn , yn )] ∈ X xR, a strictly monotonic increasing
real-valued function g : [0, ∞) → R, and an arbitrary emprical loss function
E : (X XR2 )m → R ∪ ∞, then for any f ∗ satisfying
fˆ = argmin{E[(x1 , y1 , f (x1 ), ..., (xn , yn , f (xn ))]} + g(kf k)
f ∈Hk
f ∗ can be represented in the form f ∗ (.) =
n
P
αi k(·, xi ),
i=1
where αi ∈ R for all αi
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
21/36
10-701/15-781 Kernel Theory Recitation
Example: Let Hk be some RKHS function space with a kernel k(., .). Let
[(x1 , y1 ), ..., (xm , ym )] be a training input-output set, and our task is to find f ∗
that minimizes the following regularized kernel.
fˆ = argmin(
f ∈Hk
n
Y
i=1
n 25
X
6
y −f (xi ) 249
|f (xi )| ) [sin(kxi k i
) + yi |f (xi )| ] + exp(kf kF )
i=1
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
22/36
10-701/15-781 Kernel Theory Recitation
fˆ = argmin(
f ∈Hk
n
Y
i=1
n 25
X
6
y −f (xi ) 249
|f (xi )| ) [sin(kxi k i
) + yi |f (xi )| ] + exp(kf kF )
i=1
Well exp(.) is a strictly monotonic increasing function, so exp(kf kF ) fits the
g(kf kF ) part.
(
n
Q
6
|f (xi )| )
i=1
25
n P
249
y −f (xi ) [sin(kxi k i
) + yi |f (xi )| ] is some arbitrary emper-
i=1
ical loss function of the [xi , yi , f (xi )] . So this optimization can be expressed as
fˆ = argmin{E[(x1 , y1 , f (x1 ), ..., (xn , yn , f (xn ))]} + g(kf k)
f ∈Hk
Therefore, by the Representer Theorem, f ∗ (.) =
n
P
αi K(xi , .)
i=1
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
23/36
10-701/15-781 Kernel Theory Recitation
h
((
hh
(
(h
((h
hh Kernel? It
Do You want to Build a (
Snowman
hhhh
(
((
(((h
hh Kernel.
doesn’t have to be (
Snowman
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
24/36
10-701/15-781 Kernel Theory Recitation
Operations that Preserve/Create Kernels
Let k1 , ...km be valid kernel functions defined over some set X such that |X | = n,
and let α1 , ...., αm be non-negative coefficents. The following are valid kernels.
• K(x,z) =
m
P
αi ki (x, z) (Closed under non-negative linear multiplication)
i=1
• K(x,z) =
m
Q
ki (x, z) (Closed under multiplication)
i=1
• K(x,z) =k1 (f (x), f (z)) for any function f : X → X
• K(x,z) = g(x)g(z) for any function g : X → R
• K(x,z) = xT AT Az for any matrix A ∈ RmXn
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
25/36
10-701/15-781 Kernel Theory Recitation
Proof: K(x,z) =
m
P
αi ki (x, z) is a valid kernel
i=1
Symmetry: K(z, x) =
m
P
αi ki (z, x). Because each ki is a kernel function,
i=1
it is symmetric, so ki (z, x) = ki (x, z). Therefore,
m
m
P
P
K(z, x) =
αi ki (z, x) =
αi ki (x, z) = K(x, z) ⇒ K(z, x) = K(x, z)
i=1
i=1
Positive Semi-definite: Let u ∈ Rn be arbitrary. The Gram matrix of K,
m
P
denoted by G has the property, Gi,j = K(xi , xj ) =
αi ki (z, x) ⇒ G =
i=1
α1 G1 + ... + αm Gm . Now uT Gu = uT (α1 G1 + ... + αm Gm )u
m
P
= α1 uT G1 u + .... + αm uT Gm u =
αi uT Gi u.
i=1
uT Gi u ≥ 0, and αi ≥ 0, so αi uT Gi u ≥ 0.
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
26/36
10-701/15-781 Kernel Theory Recitation
Proof: K(x,z) = xT AT Az for any matrix A ∈ RmXn is a valid Kernel.
For this proof, we are going to show K(x, z) is an inner product on some Hilbert
Space. Let φ(x) = Ax, then < φ(x), φ(z) >= φ(x)T φ(z) = (Ax)T (Az) =
xT AT Az = K(x, z) ⇒< φ(x), φ(z) >= K(x, z).
Therefore, K(x, z) is an inner product on some Hilbert Space.
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
27/36
10-701/15-781 Kernel Theory Recitation
Just because it is a valid kernel does not mean it is a good kernel
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
28/36
10-701/15-781 Kernel Theory Recitation
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
29/36
10-701/15-781 Kernel Theory Recitation
Some Exercises Prove the following are not valid kernels:
1. K(x, z) = k1 (x, z) − k2 (x, z) for k1 , k2 valid kernels
2
2. K(x, z) = exp(γ kx − zk ) for some γ > 0
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
30/36
10-701/15-781 Kernel Theory Recitation
Current Research in Kernel Theory
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
31/36
10-701/15-781 Kernel Theory Recitation
Flaxman Test
Correlates of homicide: New space/time interaction tests for spatiotemporal point
processes
Goal: Develop a statistical test that can test for the strength of interaction
between time and space for different types of homicides.
Result: Using Reproducing Kernel Hilbert spaces, one can project the space-time
data into a higher dimensional space that can test for a significant interaction of
a kernalized distance of space and time using the Hilbert-Schmidt Independence
Criterion.
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
32/36
10-701/15-781 Kernel Theory Recitation
Fast Food
Fastfood - Approximating Kernel Expansions in Loglinear time
Problem: Kernel methods do not scale well to large datasets, especially during
prediction time, because the algorithm has to compute the kernel distance between
the new vector and all of the data in the training set.
Solution: Using advanced mathematical black magic, dense random Gaussian
matrices can be approximated using Hadamard matrices and diagonal Gaussian
matrices.
V =
Y
1
√ SHG
HB
σ d
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
33/36
10-701/15-781 Kernel Theory Recitation
Q
Where
∈ {0, 1}n is a permutation matrix, H is a Hadamard matrix, S is a
random scaling matrix with elements along the diagonal, B has random {+1}
−
along the diagonal, and G has random Gaussian entries.
This algorithm performs 100x faster than the previous leading algorithm (Random
Kitchen Sinks) and uses 1000x less memory.
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
34/36
10-701/15-781 Kernel Theory Recitation
References
Representer Theorem and Kernel Methods
Predicting Structured Data by Alex Smola
Kernel Machines
String and Tree Kernels
Why Gaussian Kernel is Infinite
Manojit Nandi | Carnegie Mellon University — Machine Learning Department
35/36
Questions?

Download Report

10-701/15-781 Recitation : Kernels

Paperzz.com

Your Paperzz