Set 1

Questions and answers, kernel part
February 24, 2015
1
Questions
1.1
Question 1
Define by φ(x) the feature map for the RKHS H with positive definite kernel
k(x, x0 ) = hφ(x), φ(x0 )iH = hk(x, ·), k(x0 , ·)iH .
Recall the reproducing property: ∀f (·) ∈ H,
hf (·), φ(x)iH = hf (·), k(x, ·)iH = f (x).
(1)
(we will equivalently use the shorthand f ∈ H). Define µP to be the mean
embedding for distribution P, i.e. the RKHS function for which
hf, µP i = EX∼P f (X) = EX∼P hf, φ(X)i ,
(2)
for any f ∈ H, where X ∼ P denotes the random variable X drawn from the
probability distribution P .
1. (3 points) Show by the reproducing property and the definition in Eq. (2)
that
µP (x) = EX∼P k(X, x).
2. (4 points) A linear operator A : H → R is said to be bounded when
|Af | ≤ λA kf kH
for all f ∈ H. When an operator is bounded, then by the Riesz representer
theorem, there exists an element µA ∈ H such that
Af = hµA , f i .
Show that if the kernel is bounded, |k(x, x0 )| < K for some strictly positive
K < ∞ and all x, x0 , then the operator
AP f := EX∼P f (X)
1
is bounded (hence, there exists some µP ∈ H such that Eq. (2) holds).
Give the corresponding λAP as a function of the upper bound K. You
may need the Cauchy-Schwarz inequality,
|ha, bi| ≤ kakkbk,
(3)
and a result following from Jensen’s inequality,
|EX∼P f (X)| ≤ EX∼P |f (X)| .
3. (5 points) An algorithm known as Kernel Herding implements the iteration: from time T to T + 1, make the following two updates:
xT +1 = argmax hwT , φ(x)i
(4)
x∈X
wT +1 = wT + µP − φ(xT +1 )
(5)
We run herding for T steps, to obtain the points x1 , . . . , xT . Assume
k(x, x) = K constant across all values of x. Initialise the Herding algorithm with w0 = µP . Show that Herding greedily chooses xT +1 to
minimize the error in constructing the mean embedding,
ET +1
2
T +1
1 X
φ(xt ) .
:= µP −
T + 1 t=1
(6)
Hints: first, expand out Eq. (6), and note that the optimization applies
only to terms containing xT +1 . Then substitute the current expression for
wT into Eq. (4) (this current expression should be found as a function of
all the previous x1 , . . . , xT ). You will also need the solution to part 1 of
the question.
4. (5 points) Define
P̂T :=
T
1X
δx ,
T t=1 t
where δxt denotes the Dirac measure at xt , meaning that the expectation
of a function f (x) under P̂T is
EX∼P̂T f (x) =
T
1X
f (xt )
T t=1
We begin the Herding process with some w0 ∈ H, which need not be µP .
Assume that kwT k is bounded, i.e. kwT k ≤ C. Given this assumption,
show that for all f ∈ H,
EX∼P f (X) − EX∼P̂T f (X) = O(T −1 ).
2
Ring density sample
1.5
1
X2
0.5
0
−0.5
−1
−1.5
−1.5
−1
−0.5
0
X1
0.5
1
1.5
Figure 1: Samples from the “ring density”.
In other words, if the assumption kwT k ≤ C holds, then using “pseudosamples” xt obtained by Herding results in a fast decrease in error estimates when taking expectations of RKHS functions. Hints: use the
expression for wT (again, this expression should be found as a function
of all the previous x1 , . . . , xT ). You may need the reproducing property
in Eq. (1), the Cauchy-Schwarz inequality in Eq. (3), and the reverse
triangle inequality,
kak − kbk ≤ ka + bk.
5. (3 points) Our goal is to find “pseudo-samples” xt that are representative
of the distribution which generated the samples in figure 1. We run the
Herding algorithm using the feature map


x1
 x21 

φ(x) = 
 x2  .
x22
Asume kxk ≤ R for R < ∞. Explain qualitatively what you think will
happen - a diagram might be helpful. How could you do better?
1.2
Question 2
1. (3 points) A symmetric function k : X × X → R is positive definite if
∀n ≥ 1, ∀(a1 , . . . an ) ∈ Rn , ∀(x1 , . . . , xn ) ∈ X n ,
n X
n
X
ai aj k(xi , xj ) ≥ 0.
i=1 j=1
3
(7)
Define F1 to be a reproducing kernel Hilbert space on domain X with
kernel k1 (x, x0 ), and F2 to be a reproducing kernel Hilbert space on domain
X with kernel k2 (x, x0 ). Show using Eq. (7) that the sum of kernels
k1 (x.x0 ) + k2 (x, x0 ) is a kernel. What might be the problem if you took
the difference of two kernels - would it still be a kernel? Hint: what
happens when you use a single x1 in Eq. (7)?
2. (3 points) Define F to be a reproducing kernel Hilbert space on domain X
with feature map φ(x) and kernel k(x, x0 ) := hφ(x), φ(x0 )iF , and G to be a
reproducing kernel Hilbert space on domain Y with feature map ψ(y) and
kernel l(y, y 0 ) := hψ(y), ψ(y 0 )iG . Show that the product of the two kernels
k(x, x0 ) and l(y, y 0 ) is a kernel. Hint: define the tensor product (a ⊗ b)
between RHKS elements b ∈ G and a ∈ F such that for all g ∈ G,
(a ⊗ b)g 7→ hb, giG a.
The tensor products between feature maps, φ(x) ⊗ ψ(y), is an element of
a Hilbert space HS(G, F), with inner product
X
hL, M iHS =
hLei , M ei iF ,
(8)
i∈I
where {ei }i∈I form an orthonormal basis for G. Given L = φ(x) ⊗ ψ(y)
and M = φ(x0 ) ⊗ ψ(y 0 ), can you write the above inner product in terms
of kernels k(x, x0 ) and l(y, y 0 )? Hint: you may use
X
hψ(y), ψ(y 0 )iG =
hψ(y), ei iG hψ(y 0 ), ei iG .
i
3. (3 points) Using the previous result, prove that the Gaussian is a kernel,
−kx − x0 k2
,
k(x, x0 ) = exp
σ
using the knowledge that the following function is a kernel,
hx, x0 i
.
κ(x, x0 ) = exp
σ
4. (4 points) We define a domain X := [ −π π ] with periodic boundary
conditions. The Fourier series representation of a function f (x) on X is
written fˆl , where
f (x) =
∞
X
fˆl exp(ılx) =
l=−∞
∞
X
fˆl (cos(lx) + ı sin(lx)) ,
l=−∞
√
where ı = −1. Given a complex number z has the representation z =
A exp(ıθ), where θ and A are both real-valued, we define the complex
4
conjugate as z̄ = A exp(−ıθ). Assume the kernel takes a single argument,
which is the difference in its inputs,
k(x, y) = k(x − y),
and define the Fourier series representation of k as
∞
X
k(u) =
k̂l exp (ılu) ,
l=−∞
where we specify k̂−l = k̂l , k̂l ≥ 0, and k̂l = k̂l . We represent an RKHS
function as
>
,
f (·) = f˜1 . . . f˜` . . .
p
where f˜` := fˆ` / k̂` , and the feature map of a point as
k(·, x) = φ(x) =
where k̃`,x :=
k̃1,x
...
k̃`,x
...
>
,
p
k̂` exp(−ı`x). Define the RKHS inner product as
hf, giF =
∞
X
f˜` g̃` .
(9)
`=−∞
Show the reproducing property holds,
hf, k(·, x)i = f (x),
and that
hk(·, x), k(·, x0 )i = k(x, x0 ).
5. (3 points) Write the Fourier series coefficients of a probability distribution P as γP,` , and the Fourier coefficients of a distribution Q as γQ,`
(assume both sets of Fourier coefficients are real-valued). A Fourier series
representation of the maximum mean discrepancy is
∞
2
X
MMD = (γP,` − γQ,` ) k̂` exp(ı`x) .
`=−∞
F
Simplify this expression by using the definition of the RKHS norm arising
from Eq. (9).
6. (4 points) Consider the densities
1
,
2π
1
q(x) =
(1 + 0.5 cos(rx))
2π
p(x) =
5
where r is integer valued. Assume we use a Jacobi Theta kernel, which
has a Gaussian Fourier series,
1
−σ 2 `2
k̂` =
exp
2π
2
What will happen to MMD as r increases? For full marks, make reference
to the answer in the previous part.
1.3
Question 3
Define by φ(x) the feature map for the RKHS H with positive definite kernel
k(x, x0 ) = hφ(x), φ(x0 )iH . Recall the reproducing property: ∀f ∈ H,
hf, φ(x)iH = f (x).
(10)
1. (2 points) Prove that if the the kernel is bounded in absolute value,
|k(x, x0 )| ≤ R
(11)
2
then the feature map is bounded, kφ(x)kH ≤ R.
2. (4 points) We are given a sample
S := {X1 , . . . , Xm }
(12)
drawn i.i.d. from a distribution P . Denote by X a random variable with
distribution P . The empirical feature mean as a function of S is
m
µ̂P :=
1 X
φ(Xi ),
m i=1
(13)
and by the reproducing property, for all f ∈ H,
+
*
m
m
1 X
1 X
EX∼P̂ f (X) := f,
φ(Xi )
=
f (Xi ).
m i=1
m i=1
H
Denote by µP the population feature mean, such that
hf, µP i = EX∼P f (X).
(14)
You may use without proof the result:
µP (x) = EX∼P k(X, x)
Show via the above two formulae that
hµP , µP iH = EX,X 0 k(X, X 0 )
6
(15)
where X 0 is a random variable independent of X, but with the same
distribution P . Show via the reproducing property and Eq. (14) that
EX m hµP , µ̂P iH = EX,X 0 k(X, X 0 ).
where EX m indicates we take the expectation over the m random variables
in Eq. (13).
3. (5 points) We define a function of the sample in Eq. (12),
m
1 X
φ(Xi ) − µP g(S) := m
i=1
H
which is the expected difference between the sample feature mean and the
population feature mean. Show that this expectation satisfies
q
EX m g(S) ≤ m−1/2 EX k(X, X) − EX,X 0 k(X, X 0 )
Recall Jensen’s inequality: for concave functions g,
EX∼P g(X) ≤ g (EX∼P X) .
4. (4 points) Prove that the previous result can be simplified under the assumption in Eq. (11) to obtain
q
√
EX k(X, X) − EX,X 0 k(X, X 0 ) ≤ 2R.
5. (5 points) A variance operator CXX : H → H is defined in terms of a
probability distribution PX as
hf1 , CXX f2 iH = EX f1 (X)f2 (X) − EX f1 (X)EX f2 (X).
The trace of this operator is written
tr(CXX ) :=
∞
X
hei , CXX ei iH ,
i=1
∞
where {ei }i=1 is an orthonormal basis for H. Give an expression for the
trace in terms of expectations of the kernel. Hint: you will need Parseval’s
identity,
∞
X
2
2
khkH =
hh, ei iH ,
i=1
and a result from earlier in this question (which result to use should be
clear from the proof).
7
2
Answers
2.1
Question 1
1. In the following, we use the reproducing property in (a), and the definition
of the mean embedding in (b),
µP (x) = hµP , k(·, x)i = EX∼P k(X, x).
2. In the reasoning below, we use Jensen’s inequality in (a), and CauchySchwarz in (b),
|EX∼P f (X)| ≤ EX∼P |f (X)|
(a)
= EX∼P |hf, φ(X)i|
≤ EX∼P (kf kH kφ(X)kH )
(b)
1/2
= kf kH EX∼P hφ(X), φ(X)iH
p
= kf kH EX∼P k(X, X)
√
≤ kf kH K.
3. We want to choose xT +1 greedily to minimize
2
T +1
1 X
ET +1 := µP −
φ(xt )
T +1
t=1
= EX,X 0 k(X, X 0 ) −
T +1
T
+1
X
2 X
1
k(xt , xt0 ).
EX k(X, xt ) +
T + 1 t=1
(T + 1)2 0
t,t
Keep terms that are a function of xT +1 :
T
X
−2
2
1
EX k(X, xT +1 ) +
k(xt , xT +1 ) +
k(xT +1 , xT +1 )
2
{z
}
T +1
(T + 1) t=1
(T + 1)2 |
constant
Next, substituting wT using x1 . . . xT and setting w0 := µP , we choose
xT +1 using:
xT +1 = argmax hwT , φ(x)i
x∈X
*
= argmax w0 + T µP −
x∈X
T
X
+
φ(xt ), φ(x)
t=1
*
= argmax (T + 1)µP −
x∈X
T
X
+
φ(xt ), φ(x)
t=1
= argmax (T + 1)EX 0 k(x, X 0 ) −
x∈X
T
X
t=1
8
!
k(xt , x)
It is straightforward now to see that the argmax greedily minimizes ET +1 .
4. We have assumed
T
X
kwT k = w0 + T µP −
φ(xt ) ≤ C
t=1
So, after applying the reverse triangle inequality and dividing by T :
T
1X
φ(xt ) ≤ T −1 (kw0 k + C).
µP −
T
t=1
Next,
*
+
T
1X
φ(xt ) EX∼P f (X) − EX∼P̂T f (X) = f, µP −
T t=1
T
1X
≤ kf k µP −
φ(xt ) .
T
t=1
5. Samples would only be greedily chosen to match the first two moments
of the density, and would not be in the high density regions of the ring.
A better feature space would be that induced by a characteristic kernel,
which will match all moments in the infinite sample limit, which should
result in Herding samples in the high density regions of the distribution.
See figure 2 for an illustration. Note from this illustration, the red points
(due to Herding) are placed almost entirely in a manner due to computational artifacts: you can see for instance the limits on the range of the
variables, odd clustering behaviour, and lack of symmetry. Different runs
result in very different (and similarly unrepresentative) distributions of
Herding points (despite the first two moments matching well). Another
good answer: I can match the first two moments with just two points,
given the population means are zero. The points are
a
−a
x=
y=
b
−b
where x = −y since the points have zero mean. The second moments are
then
2 2 X1
a
b
E
=
,
b2
X22
so any values of the second moments of the ring density can be matched.
Note that herding is greedy, so it won’t necessarily converge on this twopoint solution, but it can converge to this type of solution. A characteristic
kernel will avoid this simple behaviour.
9
Herding results
1.5
1
X2
0.5
0
−0.5
−1
−1.5
−1.5
−1
−0.5
0
X1
0.5
1
1.5
Figure 2: Samples from the “ring density” (blue) and samples from Herding
(red).
2.2
Question 2
Given a feature space ϕ with kernel k = φ and l = ψ
1. For any set of points, if α> Kα ≥ 0 and α> Lα ≥ 0, then α> (K + L)α ≥ 0.
Taking a difference in kernels, and a single point, we get
α2 (k − l) ≥ 0,
which is not guaranteed unless we further enforce k > l everywhere.
2. We write
hφ(x) ⊗ ψ(y), φ(x0 ) ⊗ ψ(y 0 )iHS =
X
=
X
h(φ(x) ⊗ ψ(y)) ei , (φ(x0 ) ⊗ ψ(y 0 )) ei iF
i∈I
hψ(y), ei iG hψ(y 0 ), ei iG hφ(x), φ(x0 )iF
i∈I
= hψ(y), ψ(y 0 )iG k(x, x0 )
= k(x, x0 )l(y, y 0 ).
3. The Gaussian kernel is
−kx − x0 k2
σ
hx, x0 i
−kxk2
−kx0 k2
= exp
exp
exp
σ
σ
σ
|
{z
}
k(x, x0 ) = exp
kernel
The product of two kernels is a kernel, which completes the proof.
10
4. The reproducing theorem holds,
hf (·), k(·, x)iH
∞
X
=
f˜` k̃`.x
`=−∞
∞
X
fˆ`
=
`=−∞
∞
X
=
p
k̂` exp(−ı`x)
p
k̂`
fˆ` exp(ı`x) = f (x),
`=−∞
including for the kernel itself,
hk(·, x), k(·, y)iH
=
=
∞
X
k̃`,x k̃`.y
`=−∞
=
=
∞ q
X
`=−∞
∞
X
!
q
k̂` exp(−ı`x)
k̂` exp(−ı`y)
k̂` exp(ı`(y − x)) = k(x − y).
`=−∞
5. The MMD is written
2
∞
X
MMD = (γP,` − γQ,` ) k̂` exp(ı`x)
`=−∞
=
=
F
∞
2
X
(γP,` − γQ,` ) k̂`2
`=−∞
∞
X
k̂`
2
(γP,` − γQ,` ) k̂` .
`=−∞
6. We have
γP,r − γQ,r =
1
4
γP,` − γQ,` = 0, ` 6= r.
Since k̂` decays as a Gaussian, then the MMD will decay as P and Q differ
at higher frequencies.
2.3
Question 3
1. Setting x = x0 to obtain a norm, we get
2
|k(x, x)| = kφ(x)k ≤ R.
11
2. We have
hµP , µP i = EX (µP (X))
= EX∼P EX 0 ∼P k(X, X 0 ),
where we have used Eq. (14) in the first line and Eq. (15) in the second.
Next,
+!
*
m
1 X
EX m hµP , µ̂P i = EX m
φ(Xi ), µP
m i=1
!
m
1 X
µP (Xi )
= EX m
m i=1
!
m
1 X
= EX m
EX∼P k(X, Xi )
m i=1
= EX∼P EX 0 ∼P k(X, X 0 ).
3. The expectation is bounded according to
m
1 X
φ(Xi ) − µp Eg(S) = EX m m
i=1

2 1/2
m
1 X
≤ EX m φ(Xi ) − µp 
m
i=1
1/2
m
m(m − 1)
0
0
0
0 k(X, X ) − 2EX,X 0 k(X, X ) + EX,X 0 k(X, X )
E
k(X,
X)
+
E
X
X,X
m2
m2
q
= m−1/2 EX k(X, X) − EX,X 0 k(X, X 0 )
=
4. We have
0 ≤ k(X, X) ≤ R
| {z }
kφ(X)k2H
and
−R ≤ k(X, X 0 ) ≤ R
hence
q
√
EX k(X, X) − EX,X 0 k(X, X 0 ) ≤ 2R.
5. Break the covariance into two operators:
CXX = C̃ − M,
12
where
tr(C̃) :=
∞ D
E
X
ei , C̃ei
i=1
= EX
∞
X
2
hφ(X), ei i
i=1
= EX kφ(X)k
2
= EX k(X, X).
Similarly,
tr(M ) =
=
∞
X
i=1
∞
X
EX ei (X)EX ei (X)
2
hei , µP i
i=1
= kµP k
2
= EX,X 0 k(X, X 0 ).
13