Hilbert Space Representations of Probability Distributions

Hilbert Space Representations of
Probability Distributions
Arthur Gretton
joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard
Schölkopf, Alex Smola, Le Song, Choon Hui Teo
Max Planck Institute for Biological Cybernetics,
Tübingen, Germany
Overview
• The two sample problem: are samples {x1 , . . . , xm } and {y1 , . . . , yn }
generated from the same distribution?
• Kernel independence testing: given a sample of m pairs
{(x1 , y1 ), . . . , (xm , ym )}, are the random variables x and y independent?
Kernels, feature maps
A very short introduction to kernels
• Hilbert space of functions f ∈ F from X to R
• RKHS: evaluation operator δx : x → R continuous
A very short introduction to kernels
• Hilbert space of functions f ∈ F from X to R
• RKHS: evaluation operator δx : x → R continuous
• Riesz: Unique representer of evaluation k(x, ·) ∈ F:
f (x) = hf, k(x, ·)iF
– k(x, ·) feature map
– k : X 7→ R is kernel function
A very short introduction to kernels
• Hilbert space of functions f ∈ F from X to R
• RKHS: evaluation operator δx : x → R continuous
• Riesz: Unique representer of evaluation k(x, ·) ∈ F:
f (x) = hf, k(x, ·)iF
– k(x, ·) feature map
– k : X 7→ R is kernel function
• Inner product between two feature maps:
hk(x1 , ·), k(x2 , ·)iF = k(x1 , x2 )
A Kernel Method for the Two Sample Problem
The two-sample problem
• Given:
– m samples x := {x1 , . . . , xm } drawn i.i.d. from P
– samples y drawn from Q
• Determine: Are P and Q different?
The two-sample problem
• Given:
– m samples x := {x1 , . . . , xm } drawn i.i.d. from P
– samples y drawn from Q
• Determine: Are P and Q different?
• Applications:
– Microarray data aggregation
– Speaker/author identification
– Schema matching
The two-sample problem
• Given:
– m samples x := {x1 , . . . , xm } drawn i.i.d. from P
– samples y drawn from Q
• Determine: Are P and Q different?
• Applications:
– Microarray data aggregation
– Speaker/author identification
– Schema matching
• Where is our test useful?
– High dimensionality
– Low sample size
– Structured data (strings and graphs): currently the only method
Overview (two-sample problem)
• How to detect P 6= Q?
– Distance between means in space of features
– Function revealing differences in distributions
– Same thing: the MMD
[Gretton et al., 2007, Borgwardt et al., 2006]
Overview (two-sample problem)
• How to detect P 6= Q?
– Distance between means in space of features
– Function revealing differences in distributions
– Same thing: the MMD
[Gretton et al., 2007, Borgwardt et al., 2006]
• Hypothesis test using MMD
– Asymptotic distribution of MMD
– Large deviation bounds
Overview (two-sample problem)
• How to detect P 6= Q?
– Distance between means in space of features
– Function revealing differences in distributions
– Same thing: the MMD
[Gretton et al., 2007, Borgwardt et al., 2006]
• Hypothesis test using MMD
– Asymptotic distribution of MMD
– Large deviation bounds
• Experiments
Mean discrepancy (1)
• Simple example: 2 Gaussians with different means
• Answer: t-test
Two Gaussians with different means
0.4
PX
0.35
Q
X
Prob. density
0.3
0.25
0.2
0.15
0.1
0.05
0
−6
−4
−2
0
X
2
4
6
Mean discrepancy (2)
• Two Gaussians with same means, different variance
• Idea: look at difference in means of features of the RVs
• In Gaussian case: second order features of form x2
Two Gaussians with different variances
0.4
PX
0.35
Q
X
Prob. density
0.3
0.25
0.2
0.15
0.1
0.05
0
−6
−4
−2
0
X
2
4
6
Mean discrepancy (2)
• Two Gaussians with same means, different variance
• Idea: look at difference in means of features of the RVs
• In Gaussian case: second order features of form x2
2
Two Gaussians with different variances
0.4
Densities of feature X
1.4
PX
0.35
1.2
X
Prob. density
0.3
Prob. density
P
X
Q
0.25
0.2
0.15
Q
X
1
0.8
0.6
0.4
0.1
0.2
0.05
0
−6
−4
−2
0
X
2
4
6
0 −1
10
0
1
10
10
2
X
2
10
Mean discrepancy (3)
• Gaussian and Laplace distributions
• Same mean and same variance
• Difference in means using higher order features
Gaussian and Laplace densities
0.7
PX
Q
Prob. density
0.6
X
0.5
0.4
0.3
0.2
0.1
0
−4
−3
−2
−1
0
X
1
2
3
4
Function revealing difference in distributions (1)
• Idea: avoid density estimation when testing P 6= Q
[Fortet and Mourier, 1953]
D(P, Q; F ) := sup [EP f (x) − EQ f (y)] .
f ∈F
Function revealing difference in distributions (1)
• Idea: avoid density estimation when testing P 6= Q
[Fortet and Mourier, 1953]
D(P, Q; F ) := sup [EP f (x) − EQ f (y)] .
f ∈F
• D(P, Q; F ) = 0 iff P = Q, when F =bounded continuous
functions [Dudley, 2002]
Function revealing difference in distributions (1)
• Idea: avoid density estimation when testing P 6= Q
[Fortet and Mourier, 1953]
D(P, Q; F ) := sup [EP f (x) − EQ f (y)] .
f ∈F
• D(P, Q; F ) = 0 iff P = Q, when F =bounded continuous
functions [Dudley, 2002]
• D(P, Q; F ) = 0 iff P = Q when F =the unit ball in a
universal RKHS F [via Steinwart, 2001]
Function revealing difference in distributions (1)
• Idea: avoid density estimation when testing P 6= Q
[Fortet and Mourier, 1953]
D(P, Q; F ) := sup [EP f (x) − EQ f (y)] .
f ∈F
• D(P, Q; F ) = 0 iff P = Q, when F =bounded continuous
functions [Dudley, 2002]
• D(P, Q; F ) = 0 iff P = Q when F =the unit ball in a
universal RKHS F [via Steinwart, 2001]
– Examples: Gaussian, Laplace
[see also Fukumizu et al., 2004]
Function revealing difference in distributions (2)
• Gauss vs Laplace revisited
Witness f for Gauss and Laplace densities
1
f
Gauss
Laplace
Prob. density and f
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−6
−4
−2
0
X
2
4
6
Function revealing difference in distributions (3)
• The (kernel) MMD:
MMD(P, Q; F )
=
!2
sup [EP f (x) − EQ f (y)]
f ∈F
Function revealing difference in distributions (3)
• The (kernel) MMD:
MMD(P, Q; F )
=
!2
sup [EP f (x) − EQ f (y)]
f ∈F
using
EP (f (x))
=
EP [hφ(x), f iF ]
=: hµx , f iF
Function revealing difference in distributions (3)
• The (kernel) MMD:
MMD(P, Q; F )
=
sup [EP f (x) − EQ f (y)]
f ∈F
=
!2
sup hf, µx − µy iF
f ∈F
!2
using
EP (f (x))
=
EP [hφ(x), f iF ]
=: hµx , f iF
Function revealing difference in distributions (3)
• The (kernel) MMD:
MMD(P, Q; F )
=
sup [EP f (x) − EQ f (y)]
f ∈F
=
!2
sup hf, µx − µy iF
f ∈F
= kµx − µy k2F
!2
using
kµkF = sup hf, µiF
f ∈F
Function revealing difference in distributions (3)
• The (kernel) MMD:
MMD(P, Q; F )
=
sup [EP f (x) − EQ f (y)]
f ∈F
=
!2
sup hf, µx − µy iF
f ∈F
!2
= kµx − µy k2F
= hµx − µy , µx − µy iF
= EP,P′ k(x, x′ ) + EQ,Q′ k(y, y′ ) − 2EP,Q k(x, y)
• x′ is a R.V.
independent of x
with distribution
P
• y′ is a R.V. independent of y with
distribution Q.
Function revealing difference in distributions (3)
• The (kernel) MMD:
MMD(P, Q; F )
=
sup [EP f (x) − EQ f (y)]
f ∈F
=
!2
sup hf, µx − µy iF
f ∈F
!2
• y′ is a R.V. independent of y with
distribution Q.
= kµx − µy k2F
= hµx − µy , µx − µy iF
= EP,P′ k(x, x′ ) + EQ,Q′ k(y, y′ ) − 2EP,Q k(x, y)
• Kernel between measures
• x′ is a R.V.
independent of x
with distribution
P
[Hein and Bousquet, 2005]
K(P, Q) = EP,Q k(x, y)
Statistical test using MMD (1)
• Two hypotheses:
– H0 : null hypothesis (P = Q)
– H1 : alternative hypothesis (P 6= Q)
Statistical test using MMD (1)
• Two hypotheses:
– H0 : null hypothesis (P = Q)
– H1 : alternative hypothesis (P 6= Q)
• Observe samples x := {x1 , . . . , xm } from P and y from Q
• If empirical MMD(x, y; F ) is
– “far from zero”: reject H0
– “close to zero”: accept H0
Statistical test using MMD (1)
• Two hypotheses:
– H0 : null hypothesis (P = Q)
– H1 : alternative hypothesis (P 6= Q)
• Observe samples x := {x1 , . . . , xm } from P and y from Q
• If empirical MMD(x, y; F ) is
– “far from zero”: reject H0
– “close to zero”: accept H0
• How good is a test?
– Type I error: We reject H0 although it is true
– Type II error: We accept H0 although it is false
Statistical test using MMD (1)
• Two hypotheses:
– H0 : null hypothesis (P = Q)
– H1 : alternative hypothesis (P 6= Q)
• Observe samples x := {x1 , . . . , xm } from P and y from Q
• If empirical MMD(x, y; F ) is
– “far from zero”: reject H0
– “close to zero”: accept H0
• How good is a test?
– Type I error: We reject H0 although it is true
– Type II error: We accept H0 although it is false
• Good test has a low type II error for user-defined Type I error
Statistical test using MMD (2)
• “far from zero” vs “close to zero” - threshold?
Statistical test using MMD (2)
• “far from zero” vs “close to zero” - threshold?
• One answer: asymptotic distribution of MMD(x, y; F )
Statistical test using MMD (2)
• “far from zero” vs “close to zero” - threshold?
• One answer: asymptotic distribution of MMD(x, y; F )
• An unbiased empirical estimate (quadratic cost):
MMD(x, y; F ) =
1
m(m−1)
X
i6=j
k(xi , xj ) − k(xi , yj ) − k(yi , xj ) + k(yi , yj )
|
{z
}
h((xi ,yi ),(xj ,yj ))
Statistical test using MMD (2)
• “far from zero” vs “close to zero” - threshold?
• One answer: asymptotic distribution of MMD(x, y; F )
• An unbiased empirical estimate (quadratic cost):
MMD(x, y; F ) =
1
m(m−1)
X
i6=j
k(xi , xj ) − k(xi , yj ) − k(yi , xj ) + k(yi , yj )
|
{z
}
h((xi ,yi ),(xj ,yj ))
• When P 6= Q, asymptotically normal
[Hoeffding, 1948, Serfling, 1980]
Statistical test using MMD (2)
• “far from zero” vs “close to zero” - threshold?
• One answer: asymptotic distribution of MMD(x, y; F )
• An unbiased empirical estimate (quadratic cost):
MMD(x, y; F ) =
1
m(m−1)
X
i6=j
k(xi , xj ) − k(xi , yj ) − k(yi , xj ) + k(yi , yj )
|
{z
}
h((xi ,yi ),(xj ,yj ))
• When P 6= Q, asymptotically normal
[Hoeffding, 1948, Serfling, 1980]
• Expression for the variance: zi := (xi , yi )
2 2
2
2
′ 2
′
′
′
σu =
Ez (Ez h(z, z )) − [Ez,z (h(z, z ))] + O(m−2 )
m
Statistical test using MMD (3)
• Example: laplace distributions with different variance
MMD distribution and Gaussian fit under H1
Two Laplace distributions with different variances
14
Empirical PDF
Gaussian fit
PX
Q
X
Prob. density
Prob. density
12
1.5
10
8
1
0.5
0
−6
−4
−2
0
X
6
4
2
0
0
0.05
0.1
0.15
0.2
MMD
0.25
0.3
0.35
0.4
2
4
6
Statistical test using MMD (4)
• When P = Q, U-statistic degenerate: Ez′ [h(z, z′ )] = 0
• Distribution is
mMMD(x, y; F ) ∼
∞
X
l=1
• where
– zl ∼ N (0, 2) i.i.d
R
– X k̃(x, x′ )ψi (x)dPx (x) = λi ψi (x′ )
| {z }
centred
2
λl zl − 2
[Anderson et al., 1994]
Statistical test using MMD (4)
• When P = Q, U-statistic degenerate: Ez′ [h(z, z′ )] = 0
• Distribution is
mMMD(x, y; F ) ∼
∞
X
l=1
• where
[Anderson et al., 1994]
2
λl zl − 2
MMD density under H0
50
centred
Empirical PDF
40
Prob. density
– zl ∼ N (0, 2) i.i.d
R
– X k̃(x, x′ )ψi (x)dPx (x) = λi ψi (x′ )
| {z }
30
20
10
0
−0.04
−0.02
0
0.02
0.04
MMD
0.06
0.08
0.1
Statistical test using MMD (5)
• Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05
Statistical test using MMD (5)
• Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05
• Bootstrap for empirical CDF
[Arcones and Giné, 1992]
• Pearson curves by matching first four moments
• Large deviation bounds
• Other...
[Johnson et al., 1994]
[Hoeffding, 1963, McDiarmid, 1969]
Statistical test using MMD (5)
• Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05
• Bootstrap for empirical CDF
[Arcones and Giné, 1992]
• Pearson curves by matching first four moments
• Large deviation bounds
[Johnson et al., 1994]
[Hoeffding, 1963, McDiarmid, 1969]
• Other...
CDF of the MMD and Pearson fit
P(MMD < mmd)
1
0.8
0.6
0.4
0.2
MMD
Pearson
0
−0.02
0
0.02
0.04
0.06
mmd
0.08
0.1
Experiments
• Small sample size: Pearson more accurate than bootstrap
• Large sample size: bootstrap faster
Experiments
• Small sample size: Pearson more accurate than bootstrap
• Large sample size: bootstrap faster
• Cancer subtype (m = 25, 2118 dimensions):
– For Pearson, Type I 3.5%, Type II 0%
– For bootstrap, Type I 0.9%, Type II 0%
Experiments
• Small sample size: Pearson more accurate than bootstrap
• Large sample size: bootstrap faster
• Cancer subtype (m = 25, 2118 dimensions):
– For Pearson, Type I 3.5%, Type II 0%
– For bootstrap, Type I 0.9%, Type II 0%
• Neural spikes (m = 1000, 100 dimensions):
– For Pearson, Type I 4.8%, Type II 3.4%
– For bootstrap, Type I 5.4%, Type II 3.3%
Experiments
• Small sample size: Pearson more accurate than bootstrap
• Large sample size: bootstrap faster
• Cancer subtype (m = 25, 2118 dimensions):
– For Pearson, Type I 3.5%, Type II 0%
– For bootstrap, Type I 0.9%, Type II 0%
• Neural spikes (m = 1000, 100 dimensions):
– For Pearson, Type I 4.8%, Type II 3.4%
– For bootstrap, Type I 5.4%, Type II 3.3%
• Further experiments: comparison with t-test, Friedman-Rafsky
tests [Friedman and Rafsky, 1979], Biau-Györfi test [Biau and Gyorfi, 2005], and
Hall-Tajvidi test [Hall and Tajvidi, 2002].
Conclusions (two-sample problem)
• The MMD: distance between means in feature spaces
• When feature spaces universal RKHSs, MMD = 0 iff
P=Q
• Statistical test of whether P 6= Q using asymptotic
distribution:
– Pearson approximation for low sample size
– Bootstrap for large sample size
• Useful in high dimensions and for structured data
Dependence Detection with Kernels
Kernel dependence measures
• Independence testing
– Given: m samples z := {(x1 , y1 ), . . . , (xm , ym )} from P
– Determine: Does P = Px Py ?
Kernel dependence measures
• Independence testing
– Given: m samples z := {(x1 , y1 ), . . . , (xm , ym )} from P
– Determine: Does P = Px Py ?
• Kernel dependence measures
– Zero only at independence
– Take into account high order moments
– Make “sensible” assumptions about smoothness
Kernel dependence measures
• Independence testing
– Given: m samples z := {(x1 , y1 ), . . . , (xm , ym )} from P
– Determine: Does P = Px Py ?
• Kernel dependence measures
– Zero only at independence
– Take into account high order moments
– Make “sensible” assumptions about smoothness
• Covariance operators in spaces of features
– Spectral norm (COCO)
[Gretton et al., 2005c,d]
– Hilbert-Schmidt norm (HSIC)
[Gretton et al., 2005a]
Function revealing dependence (1)
• Idea: avoid density estimation when testing P = Px Py
COCO(P; F , G) :=
sup
f ∈F ,g∈G
[Rényi, 1959]
(Ex,y [f (x)g(y)] − Ex [f (x)]Ey [g(y)])
Function revealing dependence (1)
• Idea: avoid density estimation when testing P = Px Py
COCO(P; F , G) :=
sup
[Rényi, 1959]
(Ex,y [f (x)g(y)] − Ex [f (x)]Ey [g(y)])
f ∈F ,g∈G
• COCO(P; F, G) = 0 iff x, y independent, when F and G are respective
unit balls in universal RKHSs F and G [via Steinwart, 2001]
– Examples: Gaussian, Laplace [see also Bach and Jordan, 2002]
Function revealing dependence (1)
• Idea: avoid density estimation when testing P = Px Py
COCO(P; F , G) :=
sup
[Rényi, 1959]
(Ex,y [f (x)g(y)] − Ex [f (x)]Ey [g(y)])
f ∈F ,g∈G
• COCO(P; F, G) = 0 iff x, y independent, when F and G are respective
unit balls in universal RKHSs F and G [via Steinwart, 2001]
– Examples: Gaussian, Laplace [see also Bach and Jordan, 2002]
In geometric terms:
• Covariance operator: Cxy : G → F such that
hf, Cxy giF = Ex,y [f (x)g(y)] − Ex [f (x)]Ey [g(y)]
Function revealing dependence (1)
• Idea: avoid density estimation when testing P = Px Py
COCO(P; F , G) :=
sup
[Rényi, 1959]
(Ex,y [f (x)g(y)] − Ex [f (x)]Ey [g(y)])
f ∈F ,g∈G
• COCO(P; F, G) = 0 iff x, y independent, when F and G are respective
unit balls in universal RKHSs F and G [via Steinwart, 2001]
– Examples: Gaussian, Laplace [see also Bach and Jordan, 2002]
In geometric terms:
• Covariance operator: Cxy : G → F such that
hf, Cxy giF = Ex,y [f (x)g(y)] − Ex [f (x)]Ey [g(y)]
• COCO is the spectral norm of Cxy
[Gretton et al., 2005c,d]:
COCO(P; F, G) := kCxy kS
Function revealing dependence (2)
• Ring-shaped density, correlation approx. zero
Fukumizu, Bach, and Gretton, 2005]
Correlation: −0.00
1.5
1
Y
0.5
0
−0.5
−1
−1.5
−2
0
X
2
[example from
Function revealing dependence (2)
• Ring-shaped density, correlation approx. zero
Fukumizu, Bach, and Gretton, 2005]
Dependence witness, X
0.5
Correlation: −0.00
1.5
f(x)
0
1
−0.5
−1
−2
2
x
Dependence witness, Y
0
0.5
−0.5
0
−1
−1.5
−2
0
g(y)
Y
0.5
−0.5
0
X
2
−1
−2
0
y
2
[example from
Function revealing dependence (2)
• Ring-shaped density, correlation approx. zero
[example from
Fukumizu, Bach, and Gretton, 2005]
Dependence witness, X
0.5
Correlation: −0.00
Correlation: −0.90
1.5
0
−0.5
0
2
x
Dependence witness, Y
0
0
0.5
−0.5
−0.5
0
−1
g(y)
Y
−1
−2
g(Y)
0.5
0.5
−1.5
−2
1
f(x)
1
COCO: 0.14
−1
−1
−0.5
0
X
2
−1
−2
−0.5
0
f(X)
0
y
2
0.5
Function revealing dependence (3)
• Empirical COCO(z; F, G) largest eigenvalue of

0

1 ee
LK
m
1 ee
m KL
0


c
d


=γ
e
K
0


0
c

.
e
L
d
e and L
e are matrices of inner products between centred observations in
• K
respective feature spaces:
e = HKH
K
and k(xi , xj ) = hφ(xi ), φ(xj )iF ,
where
H=I−
1 ⊤
11
m
l(yi , yj ) = hψ(yi ), ψ(yj )iG
Function revealing dependence (3)
• Empirical COCO(z; F, G) largest eigenvalue of

0

1 ee
LK
m
1 ee
m KL
0


c
d


=γ
e
K
0


0
c

.
e
L
d
e and L
e are matrices of inner products between centred observations in
• K
respective feature spaces:
e = HKH
K
where
and k(xi , xj ) = hφ(xi ), φ(xj )iF ,
• Witness function for x:
f (x) =
m
X
i=1

H=I−
1 ⊤
11
m
l(yi , yj ) = hψ(yi ), ψ(yj )iG
m

X
1
k(xj , x)
ci k(xi , x) −
m
j=1
Function revealing dependence (4)
• Can we do better?
• A second example with zero correlation
Dependence witness, X
0.5
Correlation: 0
Correlation: −0.80 COCO: 0.11
1
0
f(x)
1
0.5
−0.5
−1
−2
0
0
2
Y
x
Dependence witness, Y
−0.5
0.5
−1
0
g(Y)
0.5
0
−1.5
−1
g(y)
−0.5
−1
−1
−0.5
0
X
1
−1
−2
−0.5
0
f(X)
0
y
2
0.5
Function revealing dependence (4)
• Can we do better?
• A second example with zero correlation
2nd dependence witness, X
1
Correlation: 0
1
2
f (x)
0.5
0.5
0
−0.5
−1
−2
0
0
2
Y
x
2nd dependence witness, Y
−0.5
1
0.5
2
g (y)
−1
−1.5
−1
0
−0.5
0
X
1
−1
−2
0
y
2
Function revealing dependence (4)
• Can we do better?
• A second example with zero correlation
2nd dependence witness, X
1
Correlation: 0
1
Correlation: −0.37 COCO2: 0.06
2
f (x)
0.5
0.5
1
0
−0.5
0
2
Y
x
2nd dependence witness, Y
−0.5
0
1
−0.5
0.5
2
g (y)
−1
−1.5
−1
g2(Y)
−1
−2
0
0.5
0
−1
−1
−0.5
0
X
1
−1
−2
0
y
2
0
f2(X)
1
Hilbert-Schmidt Independence Criterion
• Given γi := COCOi (z; F, G), define Hilbert-Schmidt Independence
Criterion (HSIC) [Gretton et al., 2005b]:
HSIC(z; F, G) :=
m
X
i=1
γi2
Hilbert-Schmidt Independence Criterion
• Given γi := COCOi (z; F, G), define Hilbert-Schmidt Independence
Criterion (HSIC) [Gretton et al., 2005b]:
HSIC(z; F, G) :=
m
X
γi2
i=1
• In limit of infinite samples:
HSIC(P; F, G) := kCxy k2HS
= hCxy , Cxy iHS
= Ex,x′ ,y,y′ [k(x, x′ )l(y, y′ )] + Ex,x′ [k(x, x′ )]Ey,y′ [l(y, y′ )]
′
′
− 2Ex,y Ex′ [k(x, x )]Ey′ [l(y, y )]
• x′ an independent copy of x, y′ a copy of y
Link between HSIC and MMD (1)
• Define the product space F × G with kernel
′ ′
Φ(x, y), Φ(x , y ) = K((x, y), (x′ , y ′ )) = k(x, x′ )l(y, y ′ )
Link between HSIC and MMD (1)
• Define the product space F × G with kernel
′ ′
Φ(x, y), Φ(x , y ) = K((x, y), (x′ , y ′ )) = k(x, x′ )l(y, y ′ )
• Define the mean elements
′
′
′
′′
hµxy , Φ(x, y)i := Ex′ ,y′ Φ(x , y ), Φ(x, y) = Ex′ ,y′ k(x, x′ )l(y, y ′ )
and
′
′
′
′′
′
′
hµx⊥
,
Φ(x,
y)i
:=
E
Φ(x
,
y
),
Φ(x,
y)
=
E
k(x,
x
)E
l(y,
y
)
⊥y
x ,y
x
y
Link between HSIC and MMD (1)
• Define the product space F × G with kernel
′ ′
Φ(x, y), Φ(x , y ) = K((x, y), (x′ , y ′ )) = k(x, x′ )l(y, y ′ )
• Define the mean elements
′
′
′
′′
hµxy , Φ(x, y)i := Ex′ ,y′ Φ(x , y ), Φ(x, y) = Ex′ ,y′ k(x, x′ )l(y, y ′ )
and
′
′
′
′′
′
′
hµx⊥
,
Φ(x,
y)i
:=
E
Φ(x
,
y
),
Φ(x,
y)
=
E
k(x,
x
)E
l(y,
y
)
⊥y
x ,y
x
y
• The MMD between these two mean elements is
2
MMD(P, Px Py , F × G) = kµxy − µx⊥
k
⊥y F ×G
= hµxy − µx⊥
⊥y , µxy − µx⊥
⊥y i
= HSIC(P, F, G)
Link between HSIC and MMD (2)
• Witness function for HSIC
Dependence witness and sample
1.5
0.05
1
0.04
0.03
0.5
Y
0.02
0.01
0
0
−0.01
−0.5
−0.02
−1
−0.03
−0.04
−1.5
−1.5
−1
−0.5
0
X
0.5
1
1.5
Independence test: verifying ICA and ISA
• HSICp: null distribution via sampling
• HSICg: null distribution via moment matching
• Compare with contingency table test (PD)
Rotation θ = π/8
3
2
Y
1
0
−1
−2
−3
−2
0
2
X
Rotation θ = π/4
3
2
Y
1
0
−1
−2
−3
−2
0
X
2
[Read and Cressie, 1988]
Independence test: verifying ICA and ISA
• HSICp: null distribution via sampling
• HSICg: null distribution via moment matching
• Compare with contingency table test (PD)
Rotation θ = π/8
Samp:128, Dim:1
3
% acceptance of H0
1
2
Y
1
0
−1
−2
−3
−2
0
0.6
0.4
0.2
0
0
2
Rotation θ = π/4
2
Y
1
0
−1
−2
0
X
1
Samp:512, Dim:2
1
% acceptance of H0
3
−2
0.5
Angle (×π/4)
X
−3
PD
HSICp
HSICg
0.8
2
0.8
0.6
0.4
0.2
0
0
0.5
Angle (×π/4)
1
[Read and Cressie, 1988]
Independence test: verifying ICA and ISA
• HSICp: null distribution via sampling
• HSICg: null distribution via moment matching
• Compare with contingency table test (PD)
Rotation θ = π/8
Samp:128, Dim:1
3
0
−1
−2
0
0.6
0.4
0.2
0
0
2
Rotation θ = π/4
2
Y
1
0
−1
−2
0
X
0.6
0.4
0.2
0
0
1
2
Samp:512, Dim:2
1
Samp:2048, Dim:4
1
0.8
0.6
0.4
0.2
0
0
0.5
Angle (×π/4)
1
% acceptance of H0
3
−2
0.5
0.8
Angle (×π/4)
X
−3
PD
HSICp
HSICg
0.8
% acceptance of H0
−2
1
% acceptance of H0
% acceptance of H0
Y
1
−3
Samp:1024, Dim:4
1
2
[Read and Cressie, 1988]
0.5
Angle (×π/4)
1
0.8
0.6
0.4
0.2
0
0
0.5
Angle (×π/4)
1
Other applications of HSIC
• Feature selection
• Clustering
[Song et al., 2007c,a]
[Song et al., 2007b]
Conclusions (dependence measures)
• COCO and HSIC: norms of covariance operator
between feature spaces
• When feature spaces universal RKHSs,
COCO = HSIC = 0 iff P = Px Py
• Statistical test possible using asymptotic distribution
• Independent component analysis
– high accuracy
– less sensitive to initialisation
Questions?
Bibliography
References
N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies between two
multivariate probability density functions using kernel-based density estimates. Journal of Multivariate
Analysis, 50:41–54, 1994.
M. Arcones and E. Giné. On the bootstrap of u and v statistics. The Annals of Statistics, 20(2):655–674, 1992.
F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1–48, 2002.
G. Biau and L. Gyorfi. On the asymptotic properties of a nonparametric l1 -test statistic of homogeneity. IEEE
Transactions on Information Theory, 51(11):3965–3973, 2005.
K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured
biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 2002.
R. Fortet and E. Mourier. Convergence de la réparation empirique vers la réparation théorique. Ann. Scient. École
Norm. Sup., 70:266–285, 1953.
J. Friedman and L. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests.
The Annals of Statistics, 7(4):697–717, 1979.
K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing
kernel hilbert spaces. J. Mach. Learn. Res., 5:73–99, 2004.
K. Fukumizu, F. Bach, and A. Gretton. Consistency of kernel canonical correlation analysis. Technical Report
Hard-to-detect dependence (1)
• COCO can be ≈ 0 for dependent RVs with highly non-smooth densities
Hard-to-detect dependence (1)
• COCO can be ≈ 0 for dependent RVs with highly non-smooth densities
• Reason: norms in the denominator
COCO(P; F, G) :=
sup
f ∈F , g∈G
cov (f (x), g(y))
kf kF kgkG
• RESULT: not detectable with finite sample size
• More formally: see Ingster [1989]
Hard-to-detect dependence (2)
Rough density
3
3
2
2
1
1
Y
Y
Smooth density
0
−1
−1
−2
−2
−3
0
2
−2
X
500 Samples, smooth density
2
2
Y
4
0
−2
0
2
X
500 samples, rough density
4
−4
−4
Density takes the form:
−3
−2
Y
0
0
−2
−2
0
X
2
4
−4
−4
−2
0
X
2
4
Px,y ∝ 1 + sin(ωx) sin(ωy)
Hard-to-detect dependence (3)
• Example: sinusoids of increasing frequency
ω=1
ω=2
COCO (empirical average, 1500 samples)
0.1
0.09
0.08
ω=4
0.07
COCO
ω=3
0.06
0.05
0.04
ω=5
ω=6
0.03
0.02
0.01
0
1
2
3
4
5
6
7
Frequency of non−constant density component
Choosing kernel size (1)
• The RKHS norm of f is kf k2HX
P∞ ˜2 −1
:= i=1 fi k̃i
.
• If kernel decays quickly, its spectrum decays slowly:
– then non-smooth functions have smaller RKHS norm
• Example: spectrum of two Gaussian kernels
0
Gaussian kernels
20
10
Gaussian kernel spectra
10
0
10
−5
10
−20
K(f)
k(x)
10
−10
10
−40
10
−15
10
−60
10
2
kernel, σ =1/2
2
kernel, σ =2
−20
10
−10
−5
0
x
5
kernel, σ2=1/2
2
kernel, σ =2
−80
10
10
−10
−5
0
f
5
10
Choosing kernel size (2)
• Could we just decrease kernel size?
• Yes, but only up to a point
COCO (empirical average, 1000 samples)
0.045
0.04
0.035
COCO
0.03
0.025
0.02
0.015
0.01
0.005
0
−1.5
−1
−0.5
0
0.5
Log kernel size
1
1.5