Note

CR12: Statistical Learning & Applications
Examples of Kernels and Unsupervised Learning
Lecturer: Julien Mairal
1
Scribes: Rémi De Joannis de Verclos & Karthik Srikanta
Kernel Inventory
Linear Kernel
The linear kernel is dened as K(x, y) = xT y , where K : Rp × Rp −→ R. Let us show that the corresponding
reproducible kernel Hilbert space is H = {y 7→ wT y | w ∈ Rp } where fw is the function which maps y to
wT y , with the inner product < fw1 , fw2 >H = w1T w2 . We check the following conditions:
(1) ∀x ∈ Rp ,
{Kx | y 7→ K(x, y)} ∈ H
(2) ∀fw ∈ H, ∀x ∈ Rp ,
fw (x) =< fw , Kx >H = wT x
m
(3) K is positive semidenite i.e. ∀m ∈ R, ∀(x1 , . . . , xm ) ∈ (Rp ) ,
[K]i,j = K(xi , xj ) and ∀α ∈ Rm , αT Kα ≥ 0
Polynomial kernel with degree
2
The polynomial kernel is given by K(x, y) = (xT y)2 . We now see that it is positive semidenite as,
X
αi αj K(xi , xj )2 =
X
=
X
=
X
i,j
T
αi αj Trace(xi xT
i xj xj )
i,j
T
αi αj < xi xT
i , xj xj >F
i,j
T
< αi xi xT
i , αj xj xj >F
i,j

= Trace 
X

!
X


α i xi xT
αj xj xT
i
j
i
j
!!
T
= Trace Z Z
where Z =
X
i
2
=|| Z ||F
≥0
1
αi xi xT
i
Examples of Kernels and Unsupervised Learning
2
qP
It is important here to dene the Frobenius norm. || Z ||F =
i,j
2 . Also note that < Z , Z >=
Zi,j
1
2
Trace(Z1T Z2 ).
Now we would like to nd H such that (1) and (2) mentioned above are satised. To satisfy (1), we need
fx : y 7→ K(x, y) ∈ H, ∀x ∈ Rp and,
fx (y) = (xT y)2
=< xxT , yy T >F
= y T (xxT )y
Pn
T
Because H is a Hilbert space, we also have ∀β in Rn , ∀(x1 , . . . , xn ) ∈ (Rp )n , y 7→ y T
i=1 βi xi xi y is in H.
A good candidate for H is thus H : {y 7→ y T Ay, A symmetric ∈ Rp×p } with < fA , fB >H = Trace(A> B).
This is easy to check: fA (x) = xT Ax =< A, xxT >F , Kx = fxxT , which implies fA (x) =< fA , Kx >.
Gaussian kernel
The Gaussian kernel is given by K(x, y) = e−
ky−xk2
ω2
= k(x − y).
The corresponding r.k.h.s is more involved. We will admit that if we dene < f, g >H =
we have,
H = {f | f is integrable, continuous and < f, f >H < +∞}
1
(2π)p
R
fˆ(w)ĝ(w)∗
dw,
k̂(w)
The min kernel
Let K(x, y) = min(x, y). ∀x, y ∈ [0, 1].
Let is show that K is positive semidenite: ∀α ∈ Rn , ∀(x1 , .., xn ) ∈ [0, 1]n we have,
Z 1
X
X
αi αj min(xi , xj ) =
αi αj
1t≤x (t)1t≤y (t)
i,j
0
i,j
Z
1
=
0
Z
=
n
X

! n
X
αi 1t≤xi 
αj 1t≤xj  dt
i=1
1
j=1


n
X
Z(t)2 dt  where Z(t) = 
αj 1t≤xj  =
0

j=1
n
X
!
αi 1t≤xi 
i=1
≥0
It is thus appealing to believe that H = {f | f ∈ L2 [0, 1]} and < f, g >=
mistake since < f, f >= 0 does not imply that f = 0. In fact,
R1
0
f (t)g(t)dt. However, this is a
H = {f : [0, 1] → R | continuous and dierentiable almost everywhere and f (0) = 0}
R1
Now we have < f, g >H = 0 f 0 (t)g 0 (t)dt and thus < f, f >H = 0 ⇔ f 0 = 0 ⇔ f = 0.
We can now check the remaining conditions
• Kx : y 7→ min(x, y) ∈ H
Examples of Kernels and Unsupervised Learning
3
• f ∈ H, x ∈ [0; 1]
1
Z
f (x) =< f, Kx >=
f 0 (t)1t<=x dt =
Z
0
1
f 0 (t)dt = f (x) − f (0) = f (x)
0
x
Kx
x
Figure 1: Graph of Kx
Histogram kernel
Pk
The Histogram kernel is given by K(h1 , h2 ) = j=1 min(h1 (j), h2 (j)) where h1 ∈ [0, 1]k . It is notably used
in computer vision for comparing histograms of visual words.
Spectrum kernel
We now try to motivate the spectrum kernel through biology. There are 25000 human proteincoding
genes. In other words there are 3, 000, 000, 000 bases of building blocks A, T, C, G. Consider a sequence
u ∈ Ak of size k (where in this case we assume A = {A, T, C, G}). Let φu (x) = number of occurrences of u
P
k
in x. We dene the spectrum kernel as K(x, x0 ) = u∈Ak φu (x)φu (x0 ) =< φ(x), φ(x0 ) > where φ(x) ∈ R|A| .
We are interested in computing K(x, x0 ) eciently.
P|x|−k+1 P|x0 |−k+1
K(x, x0 ) = i=1
1x[i,i+k−1]=x[j,j+k−1] . The computation of this can be done in O(| x || x0 | k)
j=1
by using a trie. We demonstrate this with an example.
Let x = ACGT T T ACGA and x0 = AGT T T ACG. The corresponding tries are:
A
C
G
AC
CG
GT
T
TT
A
G
TA
AG
AC
GT
T
TT
TA
ACG
CGT
CGA
GT T
TTT
TTA
T AC
AGT
ACG
GT T
TTA
TTT
T AC
2
1
1
1
1
1
1
1
1
1
1
1
1
x
x
0
Examples of Kernels and Unsupervised Learning
4
Algorithm 1 Process Node(v)
Input: Trie representation for sequences x and x .
Output: Compute K(x, x ).
1: if v is a leaf then
2:
K(x, x ) = K(x, x ) + i(v ) · i(v )
3: else
4:
for All children in common of v , v do
5:
Process Node(children).
6:
end for
7: end if
0
0
0
0
x
x0
x
0
x
8: Repeat Process for Node(r ).
Exercises
1. Show that if K1 and K2 are positive semidenite then,
(a) K1 + K2 is positive semidenite.
(b) αK1 + βK2 is positive semidenite, where α, β > 0.
2. Show that if (Kn )n>0 → K (pointwise), then K is positive semidenite.
3. Show that K(x, y) =
1
1−min(x,y)
is positive semidenite for all x, y ∈ [0, 1[.
Walk kernel
We dene G0 = (V 0 , E 0 ) is a subgraph of G if V 0 ⊆ V , E 0 ⊆ E and labels(V ) = labels(V 0 ). Let φU (G) = {
Number of times U appears as a subgraph of G}. The subgraph kernel is dened as,
K(G, G0 ) =
X
φU (G)φU (G0 )
U
Unfortunately computing this in NP-complete.
We now dene the walk kernel. We recall some denitions rst. A walk is an alternating sequence of vertices
and connecting edges. Less formally a walk is any route through a graph from vertex to vertex along edges.
A walk can end on the same vertex on which it began or on a dierent vertex. A walk can travel over any
edge and any vertex any number of times. A path is a walk that does not include any vertex twice, except
that its rst vertex might be the same as its last. Now let,
P
Wk = {walks in G of size k} Sk = {sequence of labels of size k} =| A |k So ∀s ∈ Sk , φs (G) = w∈Wk (G) 1labels(w)=s .
P
This gives us K(G1 , G2 ) = s∈S φs (G1 )φs (G2 ).
Computing the walk kernel seems hard at rst sight, but it can be done eciently by using the product
graph, dened as follows. Given graphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ), we dene,
G1 ×G2 = ({(v1 , v2 ) ∈ V1 × V2 with same labels}{((v1 , v2 ), (v10 , v20 )) where ((v1 , v10 ) ∈ E1 and (v2 , v20 ) ∈ E2 )})
There is a bijection between walks in G1 × G2 and walks in G1 and G2 with same labels. More formally,
Examples of Kernels and Unsupervised Learning
5
a0
1
b0
a
1
1
c0
b
1
2
c
d0
d
2
2
e0
1
2
G1
G2
0
aa
1
0
0
ab
1
ad
1
ba0
1
bb0
bd0
1
1
cc0
ce0
2
dc0
2
2
de0
2
G1 × G2
Figure 2: Example of graph product
K(G1 , G2 ) =
X
X
1l(w1)=l(w2 )
w1 ∈Wk (G1 ) w2 ∈Wk (G2 )
=
X
1
w∈Wk (G1 ×G2 )
Exercise
Let A be the adjacency matrix of G ∈ R|V |×|V | ,
1. Prove that number of walks of size k starting at i and ending at j is [Ak ]i,j .
2. Show that K(G1 , G2 ) can be computed in polynomial time.
Examples of Kernels and Unsupervised Learning
6
2
Unsupervised Learning
Unsupervised learning consists of discovering the underlying structure of data without labels. It is useful
for many tasks, such as removing noise from data (preprocessing), interpreting and visualizing the data,
compression...
Principal component analysis
We have a centered data set X = [x1 , . . . , xn ] ∈ Rp×n .
Consider the projection of x onto the direction w ∈ Rp , then x 7→
T
Pn
xi )2
variance captured by w is Var0 (w) = n1 i=1 (wkwk
2 .
wT x
kwk22
w
· kwk
2 . We observe that the empirical
2
2
We provide below the PCA (Principle Component Analysis) algorithm:
Algorithm 2 PCA
1: Suppose we have (w1 , . . . , wi−1 )
2: Construct wi ∈ arg max V ar(w) under the condition wi ⊥ (w1 , . . . , wi−1 )
Lemma: kwwiik2 are successive vectors of XX T ordered by decreasing eigenvalue.
Pn
T
T
Proof: We have w1 = arg max n1 i=1 wT xi xT
i w = arg maxw XX w .
kwk2 =1
kwk2 =1
XX T is symmetric, so it can be diagonalized into XX T = U SU T where U is orthogonal and S is diagonal.
w1 = arg max wT U SU T w
kwk2 =1
= arg max (U T w)T S(U T w)
kwk2 =1
!
=
where z = wT U
T
arg max z Sz
kzk2 =1
For wi , i > 1 we have,
wi =
wT U SU T w
arg max
kwk2 =1,w⊥u1 ,...,ui−1
=
(U T w)T S(U T w)
arg max
kwk2 =1,w⊥u1 ,...,ui−1
=
z T Sz arg max
kzk2 =1,z=e1 ,...,ei−1
As a consequence we have PCA can be computed with SVD.
Theorem(Aqart - Young theorem)
PCA (SVD) provides the best lowrank approximation.
min
X 0 ∈Rn×p
kX − X 0 k2 , rank(X 0 ) < k
Proof: Any matrix X 0 of rank k can be written X 0 =
Pk
T
i=1 si ui vi
= Uk Sk VkT where Uk ∈ Rp×k , Vk ∈ Rk×n ,
Examples of Kernels and Unsupervised Learning
7
Uk Vk = I and Sk ∈ Mk (R) is diagonal. Thus,
min
X∈Rn×p
rank(X 0 )<k
kX − X 0 k2 =
min
U T U =I
V T V =I
S diagonal
rank(U SV T )=k
kX − U SV T k2p Lemma: Let U ∈ Rp×m and V ∈ Rn×m where m = min(n, p). If U T U = I and V T V = I
then Zij = Ui VjT form an orthogonal basis of Rp×n for the Frobenius norm.
Proof: It suces to check the scalar product:
T
< Zij , Zkl > = Trace(Zij
Zkl )
= Trace(Vj UiT Uk VlT )
= (VjT Vl )(UjT Uk )
= 1{j=k and i=l}
We have X =
Pm
i=1 si Zii
since U and V are orthogonal.
and because of the lemma we can write X 0 in the basis Zij :
X0 =
m
X
s0ij Zij
i=1
Under the condition rank(X ) P
= r,
kX − X 0 k2 is minimal i
(s0ij − sij )2 is minimal
Pi,j
P
n
0
2
0 2
i
i=1 (sii − si ) +
i6=j (sij ) is minimal
0
Then, the rst term should be minimized by taking sii = si for the k longest values of si with respect to the
rank constaint and the second term should be set to 0.
0
Kernel PCA
If we have the kernel φ : x 7→ φ(x)
2
Pn
i)
assuming φ(xi ) are centered in the feature space
V ar(f ) = n1 i=1 fkf(x
k2
H
fi ∈
(1)
arg max V ar(f )
f1 ⊥f2 ,...,fi−1
fi ∈H
There are few remaining questions
1. What does it mean to have φ(xi ) centered?
2. How do we solve (1)
1. P
Having φ(xi ) centered means that you want to implicitely replace φ(xi ) by φ(xi ) − m where m =
n
1
i=1 φ(xi ) is the average of φ. If our kernel is K(xi , xj ) =< φ(xi ), φ(xj ) >H , the new kernel becomes:
n
Kc (xi , xj ) =< φ(xi ) − m, φ(xj ) − m >
n
m
1X
1X
1 X
=< φ(xi ), φ(xj ) > − <
φ(xl ), φ(xi ) > − <
φ(xl ), φ(xj ) > + 2
< φ(xl ), φ(xk ) >
n
n
n
l=1
l=1
i,k
Examples of Kernels and Unsupervised Learning
8
In term of matrices, it is writen Kc = (I −
equal to 1.
1
n
1)K(I − 1) where 1 is the square matrix with all coecients
1
n
Pn
2. Due to the representation theorem, any solution of (1) has the form f : x 7→ i=1 αi K(xi , x). What we
need are the α's. We have kf k2H = αT Kα, f (xi ) = [Kα]i and < f, fl >= 0 = αT Kαl
We want to nd maxα∈Rn
2
1 kKαk2
n αT Kα
s.t. αT Kαl = 0, ∀l < i. i.e. maxα
αT K 2 α
αT Kα
and αT Kαl = 0, ∀l < i. K is
symetric so K = U S 2 U T where U is orthogonal and S diagonal. We want maxα
αT U S 4 U T α
αT U S 2 U T α
= maxβ∈Rn
βT S2 β
kβk22
where β = SU T α, βl = SU T αl . If we assume that the eigenvalues are ordered in S , the solutions are the
duals of βl = el .
Recipe:
1. Center the kernel matrix.
2. KL = U S 2 U T .
3. al =
1
D 2 Ui
(eigenvalue decomposition)
4. fi (xi ) = [Ki α]i
K-means clustering
Let X = [X1 , .., Xn ] ∈ Rp×n be data points and we would like to form clusters C = [C1 , ..., Ck ] ∈ Rp×k where
each of these are centroids of the cluster.
Our goals are to:
1. Learn C .
2. Assign each data point to a centroid Cj .
Examples of Kernels and Unsupervised Learning
9
y
•
•
••
•
•
•
•
•
•
x
•
•
•
•
•
•
•
•
A popular way to do this is by K-means algorithm:
Algorithm 3 K-means
1: for i = 1, . . . , n do
2:
Assign l ← argmin
kX − C k
3: end for
4: for j = 1, . . . , k do
P
5:
Reestimate centroids by C ←
6: end for
i
2
j 2
i
j=1,..,K
j
1
nj
i such that li =j
Xi
The interpretation of the algorithm can be seen as P
following:
n
We are trying to nd C such that we have minC,li i=1 kXi − Cli k22 .
Now given some xed labels,
X
∇c1 f (C1 ) =
2[C1 − Xi ] = 0
i such that li =1
. Note that Kmeans is not an optimal algorithm because minC,li
mization problem with several local minima.
Pn
Kernel K-means
Let us dene the Kernel for this to be minc,l∈H
Given some
P labels (reestimation)
Cj ← n1j i such that li =j φ(Xi )
Pn
i=1
kφ(Xi ) − Cli k2H .
i=1
kXi − Cli k22 is a non-convex opti-
Examples of Kernels and Unsupervised Learning
10
Now li ← arg minj kφ(Xi ) −
1
nj
φ(Xl )k2 = kφ(xi )k2 + kCj k2 −
P
l∈Sj
2
nj
P
i∈Sj
K(xi , xj ).
Mixture of Gaussians
The Gaussian distribution is given by:
1
1
−1
exp − (x − µ)Σ (x − µ)
N (x | µ, Σ) =
2
(2π)p/2 | Σ |1/2
Assume that the data is generated by the following procedure:
1. Random class assignment: C : P(C = Cj ) = πj where
Pk
j=1
πj = 1, πj ≥ 0.
2. x ≈ N (x, µi , σi )
The parameters are:
• π1 , . . . , πk
• (µ1 , σ1 ), . . . , (µi , σi )
If we observe X1 , . . . , Xn , can we infer the parameters (π, µ, Σ) ?
Let θ represent the parameters.
Qn
Let θ̂ = arg maxθ Pθ (X1 , . . . , Xn ) (Maximum = arg maxθ i=1 Pθ (Xi )
One algorithm to get an approximate solution is called EM for "Expectation-maximization", which iteratively
increases the likelihood
Algorithm 4 Expectation Maximizer P
1: Dene L = − log P (X , ..., X ) =
2: for 1 = 1, . . . , n do
θ
θ
1
n
n
i=1
− log | Pθ (Xi ) |.
3:
E-step: nd q given θ xed (auxiliary soft assignment)
4:
M-step: Maximize some function l(θ, q) with q xed.
5:
6: Repeat step 2 until convergence.
end for
Explanations:
Q
Pk
N (xi |µj ,σj )
At E-step qi,j = P jN (xi |µ 0 ;σ 0 ) . Notice that ∀i, j=1 qij = 1. This can be interpreted as a soft assignment
j
j
of very data point to the classes
At M-step we have:
n
1X
πj ←
qij
n i=1
µj ←
n
X
qij xi
i=1
Σj ←
m
X
i=1
qij (xi − µj (xi − µj )T
Examples of Kernels and Unsupervised Learning
11
The key ingredient here is the use of Jensen inequality, but we will not have the time to present the details
in this course.

Download Report

Note

Paperzz.com

Your Paperzz