L EARNING FROM D ISTRIBUTIONS VIA S UPPORT M EASURE M ACHINES
†
‡
†
†
Krikamol Muandet , Kenji Fukumizu , Francesco Dinuzzo , Bernhard Schölkopf
†
‡
Department of Empirical Inference
Department of Mathematical Analysis and Statistical Inference
MPI for Intelligent Systems, Tübingen, Germany
The Institute of Statistical Mathematics, Tokyo, Japan
{krikamol,fdinuzzo,bs}@tuebingen.mpg.de
[email protected]
From Data Points to Probability Measures
Data Points
Dirac Measures
Risk Deviation Bound & Flexible SVMs
Probability Measures
Risk Deviation Bound: Given an arbitrary distribution P with finite variance σ 2 , a Lipschitz continuous function f :
R → R with constant Cf , an arbitrary loss function ℓ : R × R → R that is Lipschitz continuous in the second argument
with constant Cℓ , it follows, for any y ∈ R, that
|Ex∼P [ℓ(y, f (x))] − ℓ(y, Ex∼P [f (x)])| ≤ 2Cℓ Cf σ
ℓ (P1 , y1 , EP1 [f ], . . . , Pm , ym , EPm [f ]) + Ω (kf kH )
Pm
α
µ
=
admits
a
representation
of
the
form
f
=
i
P
i
i=1
Pm
i=1 αi Ex∼Pi [k(x, ·)] for some αi ∈ R, i = 1, . . . , m.
Key Observations
The standard representer theorem is recovered as a special
case when Pi = δxi . Thus, our framework generalizes the
machine learning framework on data points. Moreover,
our framework is different from minimizing the functional
EP1 . . . EPm ℓ({xi , yi , f (xi )}m
i=1 ) + Ω(kf kH )
(1)
for the special case of the additive loss ℓ (intractable). It
is also different from minimizing the functional
ℓ({Mi , yi , f (Mi )}m
i=1 ) + Ω(kf kH )
(2)
where Mi = Ex∼Pi [x] (loss of information).
The proposed framework does not lose information,
but optimizes a less expensive problem than (1).
For some distributions and kernel k, the kernel K(P, Q)
has an analytic form. Assume that Pi = N (mi , Σi ):
Linear k(x, y) = hx, yi:
K(Pi , Pj ) = mT
i mj + δij tr Σi .
Gaussian RBF k(x, y) = exp(− γ2 kx − yk2 ):
K(Pi , Pj ) = exp(− 21 (mi − mj )T (Σi + Σj +
1
−1 −1
γ I) (mi − mj ))/|γΣi + γΣj + I| 2
Polynomial degree 2 k(x, y) = (hx, yi + 1)2 :
2
K(Pi , Pj ) = (hmi , mj i + 1) + tr Σi Σj +
T
Σ
m
+
m
mT
j Σi m j
i j i
Polynomial degree 3 k(x, y) = (hx, yi + 1)3 :
K(Pi , Pj ) = (hmi , mj i + 1)3 + 6mT
i Σi Σj m j +
T
Σ
m
+
m
3(hmi , mj i + 1)(tr Σi Σj + mT
j Σi m j )
i j i
A nonlinear kernel can be defined as
K(P, Q) = κ(µP , µQ ) = hΦ(µP ), Φ(µQ )iF
where κ is a positive definite kernel function on
H.
For example, K(P, Q) = exp −kµP − µQ k2H /2σ 2 and
K(P, Q) = (hµP , µQ iH + c)d .
The embedding kernel k defines the vectorial representation of the distributions, whereas the level-2
kernel κ allows for non-linear learning algorithms
on probability distributions.
In the experiments, we primarily consider three different learning algorithms: i) SVM trained on the means of the
distributions is considered as a baseline algorithm (cf. (2)). ii) Augmented SVM (ASVM) is an SVM trained on
augmented samples drawn according to the distributions {Pi }m
i=1 (cf. (1)). iii) SMM is our distribution-based method
that is applied directly on the distributions.
1 vs 8
3 vs 4
100
Figure 1: The decision boundaries of SVM,
ASVM, and SMM on the synthetic dataset of
distributions.
95
90
100
80
60
40
Embedding RBF 1
Level-2 RBF
Embedding RBF 2
Level-2 Poly
20
0
0
1
2
3
4
5
6
Parameters
7
8
70
60
50
40
Level-2 POLY
Level-2 RBF
Figure 2: The parameter sensitivity of embedding kernels and level-2 kernels. The
heatmaps depict the accuracy at different parameter values.
100
95
95
100
100
95
100
95
90
100
90
80
70
95
95
90
85
10
20
30
10
20
30
10
20
Number of virtual examples
30
10
20
30
Figure 3: The comparison of SVM, ASVM, and SMM on
the USPS handwritten digits dataset. The virtual examples are
generated according to three basic operations, namely scaling,
translation, and rotation. The pseudo-samples are drawn from
the distributions associated with the parameters of each operation. We consider the linear SMM with Gaussian RBF kernel.
ASVM
103
102
10
1
10
0
10-1
95
90
80
100
100
100
100
90
100
95
90
85
SMM
6 vs 9
100
95
90
85
100
95
3 vs 8
Relative comp. cost
Given training examples (Pi , yi ) ∈ P × R, i =
1, . . . , m, a strictly monotonically increasing function Ω :
[0, +∞) → R, and a loss function ℓ : (P × R2 )m →
R ∪ {+∞}, any f ∈ H minimizing the regularized risk
functional
Experimental Results
2000
4000
6000
Number of virtual examples
70
Accuracy (%)
Representer Theorem
n X
m
X
1
b Q)
b =
k(xi , zj ), xi ∼ P, zj ∼ Q .
K(P,
nm i=1 j=1
Scaling
The kernel k is said to be characteristic if and only if the
map µ is injective, i.e., there is no loss of information.
Flexible SVM allows for data-dependent kernel functions, for example, pointwise uncertainties.
which can be approximated as
Accuracy(%)
X
Translation
For distributions P, Q ∈ P, a linear kernel on P is
ZZ
K(P, Q) = hµP , µQ iH =
k(x, z) dP(x) dQ(z),
Rotation
The kernel mean map from a space of distributions P into
a reproducing kernel Hilbert space (RKHS) H:
Z
µ : P → H, P 7−→
k(x, ·) dP(x) .
where kg is a data-dependent p.d. kernel. That is, kg depends not
only on x, z ∈ X , but also on other parameters of P, Q ∈ P. For
example, if P and Q are Gaussian distributions and the kernel k is a
Gaussian RBF kernel, then we have different Gaussian RBF kernels
at each data point, i.e., the means of the distributions (see figure).
Accuracy (%)
Kernels on Distributions
Embedding RBF 2
Hilbert Space Embedding
H
Embedding RBF 1
Potential applications: Learning with noisy/uncertain examples (astronomical/biological data). Learning from groups
of samples (population genetics, group anomaly detection, and preference learning). Learning under changing environments (domain adaptation/generalization). Large-scale machine learning (data squashing).
Flexible SVMs: Assume that the densities of distributions P and Q are gx (·) and gz (·), where x and z are the parameters
in the density family. Hence, we have
Z
Z
Standard Support Vector Machine
Flexible Support Vector Machine
K(P, Q) =
k(x̃, ·)gx (x̃) dx̃, k(z̃, ·)gz (z̃) dz̃
= kg (x, z),
65
60
55
50
pLSA
SVM
LSMM NLSMM
Figure 4: The computational cost on the USPS
dataset (top). The results
of different approaches on
natural scene categorization using the bag-of-word
representation (bottom).
The results demonstrate the benefits of distribution-based approach over sample-based approach.
References
[1] B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In COLT’01/EuroCOLT’01, pages 416–426. Springer-Verlag, 2001.
[2] A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In ALT, pages 13–31. Springer-Verlag, 2007.
[3] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and Gert R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. JMLR,
99:1517–1561, 2010.
© Copyright 2025 Paperzz