Regression on Probability Measures: A Simple and
Consistent Algorithm
Zoltán Szabó (Gatsby Unit, UCL)
Joint work with
◦ Bharath K. Sriperumbudur (Department of Statistics, PSU),
◦ Barnabás Póczos (ML Department, CMU),
◦ Arthur Gretton (Gatsby Unit, UCL)
CRiSM Seminars
Department of Statistics,
University of Warwick
May 29, 2015
Szabó et al.
Regression on Probability Measures
The task
Samples: {(xi , yi )}li =1 . Goal: f (xi ) ≈ yi , find f ∈ H.
Distribution regression:
xi -s are distributions,
i
available only through samples: {xi ,n }N
n=1 .
⇒ Training examples: labelled bags.
Szabó et al.
Regression on Probability Measures
Example: aerosol prediction from satellite images
Bag := pixels of a multispectral satellite image over an area.
Label of a bag := aerosol value.
Relevance: climate research.
Engineered methods [Wang et al., 2012]: 100 × RMSE = 7.5 − 8.5.
Using distribution regression?
Szabó et al.
Regression on Probability Measures
Wider context
Context:
machine learning: multi-instance learning,
statistics: point estimation tasks (without analytical formula).
Applications:
computer vision: image = collection of patch vectors,
network analysis: group of people = bag of friendship graphs,
natural language processing: corpus = bag of documents,
time-series modelling: user = set of trial time-series.
Szabó et al.
Regression on Probability Measures
Several algorithmic approaches
1
Parametric fit: Gaussian, MOG, exp. family
[Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].
2
Kernelized Gaussian measures:
[Jebara et al., 2004, Zhou and Chellappa, 2006].
3
(Positive definite) kernels:
[Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].
4
Divergence measures (KL, Rényi, Tsallis): [Póczos et al., 2011].
5
Set metrics: Hausdorff metric [Edgar, 1995]; variants
[Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009,
Chen and Wu, 2012].
Szabó et al.
Regression on Probability Measures
Theoretical guarantee?
MIL dates back to [Haussler, 1999, Gärtner et al., 2002].
Sensible methods in regression: require density estimation
[Póczos et al., 2013, Oliva et al., 2014, Reddi and Póczos, 2014]
+ assumptions:
1
2
compact Euclidean domain.
output = R ([Oliva et al., 2013] allows distribution).
Szabó et al.
Regression on Probability Measures
Kernel, RKHS
k : D × D → R kernel on D, if
∃ϕ : D → H(ilbert space) feature map,
k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ D).
Szabó et al.
Regression on Probability Measures
Kernel, RKHS
k : D × D → R kernel on D, if
∃ϕ : D → H(ilbert space) feature map,
k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ D).
Kernel examples: D = Rd (p > 0, θ > 0)
p
k(a, b) = (ha, bi + θ) : polynomial,
2
2
k(a, b) = e −ka−bk2 /(2θ ) : Gaussian,
k(a, b) = e −θka−bk1 : Laplacian.
In the H = H(k) RKHS (∃!): ϕ(u) = k(·, u).
Szabó et al.
Regression on Probability Measures
Kernel: example domains (D)
Euclidean space: D = Rd .
Graphs, texts, time series, dynamical systems.
Distributions!
Szabó et al.
Regression on Probability Measures
Problem formulation (Y = R)
Given:
labelled bags ẑ = {(x̂i , yi )}ℓi =1 ,
i .i .d.
i th bag: x̂i = {xi ,1 , . . . , xi ,N } ∼ xi ∈ P (D), yi ∈ R.
Task: find a P (D) → R mapping based on ẑ.
Szabó et al.
Regression on Probability Measures
Problem formulation (Y = R)
Given:
labelled bags ẑ = {(x̂i , yi )}ℓi =1 ,
i .i .d.
i th bag: x̂i = {xi ,1 , . . . , xi ,N } ∼ xi ∈ P (D), yi ∈ R.
Task: find a P (D) → R mapping based on ẑ.
Construction: distribution embedding (µx )
µ=µ(k)
P (D) −−−−→X ⊆ H = H(k)
Szabó et al.
Regression on Probability Measures
Problem formulation (Y = R)
Given:
labelled bags ẑ = {(x̂i , yi )}ℓi =1 ,
i .i .d.
i th bag: x̂i = {xi ,1 , . . . , xi ,N } ∼ xi ∈ P (D), yi ∈ R.
Task: find a P (D) → R mapping based on ẑ.
Construction: distribution embedding (µx ) + ridge regression
µ=µ(k)
f ∈H=H(K )
P (D) −−−−→X ⊆ H = H(k)−−−−−−−→R.
Szabó et al.
Regression on Probability Measures
Problem formulation (Y = R)
Given:
labelled bags ẑ = {(x̂i , yi )}ℓi =1 ,
i .i .d.
i th bag: x̂i = {xi ,1 , . . . , xi ,N } ∼ xi ∈ P (D), yi ∈ R.
Task: find a P (D) → R mapping based on ẑ.
Construction: distribution embedding (µx ) + ridge regression
f ∈H=H(K )
µ=µ(k)
P (D) −−−−→X ⊆ H = H(k)−−−−−−−→R.
Our goal: risk bound compared to the regression function
Z
y dρ(y |µx ).
fρ (µx ) =
R
Szabó et al.
Regression on Probability Measures
Goal in details
Expected risk:
R [f ] = E(x,y ) |f (µx ) − y |2 .
Contribution: analysis of the excess risk
E(fẑλ , fρ ) = R[fẑλ ] − R[fρ ]
Szabó et al.
Regression on Probability Measures
Goal in details
Expected risk:
R [f ] = E(x,y ) |f (µx ) − y |2 .
Contribution: analysis of the excess risk
E(fẑλ , fρ ) = R[fẑλ ] − R[fρ ] ≤ g (ℓ, N, λ) → 0 and rates,
Szabó et al.
Regression on Probability Measures
Goal in details
Expected risk:
R [f ] = E(x,y ) |f (µx ) − y |2 .
Contribution: analysis of the excess risk
E(fẑλ , fρ ) = R[fẑλ ] − R[fρ ] ≤ g (ℓ, N, λ) → 0 and rates,
ℓ
fẑλ = arg min
f ∈H
1X
|f (µx̂i ) − yi |2 + λ kf k2H ,
ℓ
i =1
Szabó et al.
Regression on Probability Measures
(λ > 0).
Goal in details
Expected risk:
R [f ] = E(x,y ) |f (µx ) − y |2 .
Contribution: analysis of the excess risk
E(fẑλ , fρ ) = R[fẑλ ] − R[fρ ] ≤ g (ℓ, N, λ) → 0 and rates,
ℓ
fẑλ = arg min
f ∈H
1X
|f (µx̂i ) − yi |2 + λ kf k2H ,
ℓ
i =1
We consider two settings:
1
2
well-specified case: fρ ∈ H,
misspecified case: fρ ∈ L2ρX \H.
Szabó et al.
Regression on Probability Measures
(λ > 0).
Step-1: mean embedding
k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u).
Mean embedding of a distribution x, x̂i ∈ P(D):
Z
k(·, u)dx(u) ∈ H(k),
µx =
D
µx̂i =
Z
k(·, u)dx̂i (u) =
D
N
1 X
k(·, xi ,n ).
N
n=1
Szabó et al.
Regression on Probability Measures
Step-1: mean embedding
k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u).
Mean embedding of a distribution x, x̂i ∈ P(D):
Z
k(·, u)dx(u) ∈ H(k),
µx =
D
µx̂i =
Z
k(·, u)dx̂i (u) =
D
N
1 X
k(·, xi ,n ).
N
n=1
Linear K ⇒ set kernel:
K (µx̂i , µx̂j ) = µx̂i , µx̂j
Szabó et al.
H
N
1 X
= 2
k(xi ,n , xj,m ).
N
n,m=1
Regression on Probability Measures
Step-1: mean embedding
k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u).
Mean embedding of a distribution x, x̂i ∈ P(D):
Z
k(·, u)dx(u) ∈ H(k),
µx =
D
N
1 X
k(·, u)dx̂i (u) =
µx̂i =
k(·, xi ,n ).
N
D
Z
n=1
Nonlinear K example:
K (µx̂i , µx̂j ) = e
Szabó et al.
−
kµx̂ −µx̂ k2
j H
i
2σ 2
.
Regression on Probability Measures
Step-2: ridge regression (analytical solution)
Given:
training sample: ẑ,
test distribution: t.
Prediction on t:
(fẑλ ◦ µ)(t) = k(K + ℓλIℓ )−1 [y1 ; . . . ; yℓ ],
K = [K (µx̂i , µx̂j )] ∈ R
ℓ×ℓ
,
k = [K (µx̂1 , µt ), . . . , K (µx̂ℓ , µt )] ∈ R1×ℓ .
Szabó et al.
Regression on Probability Measures
(1)
(2)
(3)
Blanket assumptions: both settings
D: separable, topological domain.
k:
bounded: supu∈D k(u, u) ≤ Bk ∈ (0, ∞),
continuous.
K : bounded; Hölder continuous: ∃L > 0, h ∈ (0, 1] such that
kK (·, µa ) − K (·, µb )kH ≤ L kµa − µb khH .
y : bounded.
X = µ (P(D)) ∈ B(H).
Szabó et al.
Regression on Probability Measures
Well-specified case: performance guarantee
Difficulty of the task:
fρ is ’c-smooth’,
’b-decaying covariance operator’.
1
Contribution: If ℓ ≥ λ− b −1 , then with high probability
E(fẑλ , fρ ) ≤
1
1
logh (ℓ)
+ λc + 2 + 1 .
h
3
N λ
ℓ λ ℓλ b
{z
}
|
g (ℓ,N,λ)
Szabó et al.
Regression on Probability Measures
Well-specified case: performance guarantee
Difficulty of the task:
fρ is ’c-smooth’,
’b-decaying covariance operator’.
1
Contribution: If ℓ ≥ λ− b −1 , then with high probability
E(fẑλ , fρ ) ≤
x̂i
logh (ℓ)
N h λ3
|
1
1
+ λc + 2 + 1 .
ℓ λ ℓλ b
{z
}
g (ℓ,N,λ)
c-smoothness
Szabó et al.
Regression on Probability Measures
(4)
Well-specified case: example
Assume
b is ’large’ (1/b ≈ 0, ’small’ effective input dimension),
h = 1 (K : Lipschitz),
✄
✄
a
✂1 ✁= ✂2 ✁in (4) ⇒ λ; ℓ = N (a > 0),
t = ℓN: total number of samples processed.
Then
2
1
c = 2 (’smooth’ fρ ): E(fẑλ , fρ ) ≈ t − 7 – faster convergence,
1
2
c = 1 (’non-smooth’ fρ ): E(fẑλ , fρ ) ≈ t − 5 – slower.
Szabó et al.
Regression on Probability Measures
Misspecified case: performance guarantee
Difficulty of the task:
fρ is ’s-smooth’ (s > 0).
Contribution:
1
If L2ρX is separable and 2 ≤ l,
λ
then with high probability
h
E(fẑλ , fρ )
≤
log 2 (l)
h
3
2
2
|N λ
Szabó et al.
1
+√ +
lλ
√
λmin(1,s)
√
+ λmin(1,s) .
λ l
{z
}
g (ℓ,N,λ)
Regression on Probability Measures
Misspecified case: performance guarantee
Difficulty of the task:
fρ is ’s-smooth’ (s > 0).
Contribution: If
1
L2ρX is separable and 2 ≤ l,
λ
then with high probability
1
+√ +
lλ
h
log 2 (l)
E(fẑλ , fρ ) ≤
N
|
x̂i
h
2
3
λ2
√
{z
λmin(1,s)
√
λ l
g (ℓ,N,λ)
s-smoothness
Szabó et al.
+ λmin(1,s) .
}
Regression on Probability Measures
(5)
Misspecified case: example
Assume
s ≥ 1, h = 1 (K : Lipschitz),
✄
✄
a
1
=
✂ ✁ ✂3 ✁in (5) ⇒ λ; ℓ = N (a > 0)
t = ℓN: total number of samples processed.
Then
1
2
s = 1 (’non-smooth’ fρ ): E(fẑλ , fρ ) ≈ t −0.25 – slower,
s → ∞ (’smooth’ fρ ): E(fẑλ , fρ ) ≈ t −0.5 – faster convergence.
Szabó et al.
Regression on Probability Measures
Notes on the assumptions: ∃ρ, X ∈ B(H)
k: bounded, continuous ⇒
µ : (P(D), B(τw )) → (H, B(H)) measurable.
µ measurable, X ∈ B(H) ⇒ ρ on X × Y : well-defined.
If (*) := D is compact metric, k is universal, then
µ is continuous, and
X ∈ B(H).
Szabó et al.
Regression on Probability Measures
Notes on the assumptions: Hölder K examples
In case of (*):
KG
e−
Ke
kµa −µb k2H
e−
2θ 2
h=1
KC
kµa −µb kH
2θ 2
1
2
h=
Kt
θ
2
h=1
−1
Ki
1 + kµa − µb kθH
h=
1 + kµa − µb k2H /θ 2
−1
(θ ≤ 2)
kµa − µb k2H + θ 2
h=1
− 1
2
Functions of kµa − µb kH ⇒ computation: similar to set kernel.
Szabó et al.
Regression on Probability Measures
Notes on the assumptions: misspecified case
L2ρX : separable ⇔ measure space with d(A, B) = ρX (A △ B) is so
[Thomson et al., 2008].
Szabó et al.
Regression on Probability Measures
Vector-valued output: Y = separable Hilbert space
Objective function:
l
fẑλ
1X
= arg min
kf (µx̂i ) − yi k2Y + λ kf k2H ,
l
f ∈H
(λ > 0).
i =1
K (µa , µb ) ∈ L(Y ):
operator-valued kernel,
vector-valued RKHS.
Szabó et al.
Regression on Probability Measures
Vector-valued output: analytical solution
Prediction on a new test distribution (t):
(fẑλ ◦ µ)(t) = k(K + l λIl )−1 [y1 ; . . . ; yl ],
l×l
K = [K (µx̂i , µx̂j )] ∈ L(Y )
,
k = [K (µx̂1 , µt ), . . . , K (µx̂l , µt )] ∈ L(Y )1×l .
Specifically: Y = R ⇒ L(Y ) = R; Y = Rd ⇒ L(Y ) = Rd .
Szabó et al.
Regression on Probability Measures
(6)
(7)
(8)
Vector-valued output: K assumptions
Boundedness and Hölder continuity of K :
1
2
Boundedness:
kKµa k2HS = Tr Kµ∗a Kµa ≤ BK ∈ (0, ∞),
(∀µa ∈ X ).
Hölder continuity: ∃L > 0, h ∈ (0, 1] such that
kKµa − Kµb kL(Y ,H) ≤ L kµa − µb khH ,
Szabó et al.
∀(µa , µb ) ∈ X × X .
Regression on Probability Measures
Demo
Supervised entropy learning:
RMSE: MERR=0.75, DFDR=2.02
2
4
0
RMSE
entropy
1
true
MERR
−1
3
2
DFDR
1
−2
0
1
2
rotation angle (β)
3
MERR
DFDR
Aerosol prediction from satellite images:
State-of-the-art baseline: 7.5 − 8.5 (±0.1 − 0.6).
MERR: 7.81 (±1.64).
Szabó et al.
Regression on Probability Measures
Summary
Problem: distribution regression.
Literature: large number of heuristics.
Contribution:
a simple ridge solution is consistent,
specifically, the set kernel is so (15-year-old open question).
Simplified version [Y = R, fρ ∈ H]:
AISTATS-2015 (oral).
Szabó et al.
Regression on Probability Measures
Summary – continued
Code in ITE, extended analysis (submitted to JMLR):
https://bitbucket.org/szzoli/ite/
http://arxiv.org/abs/1411.2066.
Closely related research directions (Bayesian world):
∞-dimensional exp. family fitting,
just-in-time kernel EP: accepted at UAI-2015.
Szabó et al.
Regression on Probability Measures
Thank you for the attention!
Acknowledgments: This work was supported by the Gatsby Charitable
Foundation, and by NSF grants IIS1247658 and IIS1250350. The work
was carried out while Bharath K. Sriperumbudur was a research fellow in
the Statistical Laboratory, Department of Pure Mathematics and
Mathematical Statistics at the University of Cambridge, UK.
Szabó et al.
Regression on Probability Measures
Appendix: contents
Topological definitions, separability.
Prior definitions (ρ).
Universal kernel: definition, examples.
Vector-valued RKHS.
Demos: further details.
Hausdorff metric.
Weak topology on P(D).
Szabó et al.
Regression on Probability Measures
Topological space, open sets
Given: D 6= ∅ set.
τ ⊆ 2D is called a topology on D if:
1
2
3
∅ ∈ τ, D ∈ τ.
Finite intersection: O1 ∈ τ , O2 ∈ τ ⇒ O1 ∩ O2 ∈ τ .
Arbitrary union: Oi ∈ τ (i ∈ I ) ⇒ ∪i ∈I Oi ∈ τ .
Then, (D, τ ) is called a topological space; O ∈ τ : open sets.
Szabó et al.
Regression on Probability Measures
Closed-, compact set, closure, dense subset, separability
Given: (D, τ ). A ⊆ D is
closed if D\A ∈ τ (i.e., its complement is open),
compact if for any family (Oi )i ∈I of open sets with
A ⊆ ∪i ∈I Oi , ∃i1 , . . . , in ∈ I with A ⊆ ∪nj=1 Oij .
Closure of A ⊆ D:
Ā :=
\
C.
A⊆C closed in D
A ⊆ D is dense if Ā = D.
(D, τ ) is separable if ∃ countable, dense subset of D.
Counterexample: ℓ∞ /L∞ .
Szabó et al.
Regression on Probability Measures
(9)
Prior (well-specified case): ρ ∈ P(b, c)
Let the T : H → H covariance operator be
Z
K (·, µa )K ∗ (·, µa )dρX (µa )
T =
X
with eigenvalues tn (n = 1, 2, . . .).
Assumption: ρ ∈ P(b, c) = set of distributions on X × Y
α ≤ nb tn ≤ β (∀n ≥ 1; α > 0, β > 0),
c−1
2
∃g ∈ H such that fρ = T 2 g with kg kH ≤ R (R > 0),
where b ∈ (1, ∞), c ∈ [1, 2].
Intuition: 1/b – effective input dimension, c – smoothness of
fρ .
Szabó et al.
Regression on Probability Measures
Prior: misspecified case
Let T̃ be defined as:
SK∗ : H ֒→ L2ρX ,
SK : L2ρX → H,
(SK g )(µu ) =
T̃ = SK∗ SK : L2ρX → L2ρX .
Z
K (µu , µt )g (µt )dρX (µt ),
X
Our range space assumption on ρ: fρ ∈ Im T̃ s for some s ≥ 0.
Szabó et al.
Regression on Probability Measures
Universal kernel: definition
Assume
D: compact, metric,
k : D × D → R kernel is continuous.
Then
Def-1: k is universal if H(k) is dense in (C (D), k·k∞ ).
Szabó et al.
Regression on Probability Measures
Universal kernel: definition
Assume
D: compact, metric,
k : D × D → R kernel is continuous.
Then
Def-1: k is universal if H(k) is dense in (C (D), k·k∞ ).
Def-2: k is
characteristic, if µ : P(D) → H(k) is injective.
Szabó et al.
Regression on Probability Measures
Universal kernel: definition
Assume
D: compact, metric,
k : D × D → R kernel is continuous.
Then
Def-1: k is universal if H(k) is dense in (C (D), k·k∞ ).
Def-2: k is
characteristic, if µ : P(D) → H(k) is injective.
universal, if µ is injective on the finite signed measures of D.
Szabó et al.
Regression on Probability Measures
Universal kernel: examples
On compact subsets of Rd
k(a, b) = e −
ka−bk2
2
2σ 2
,
k(a, b) = e
−σka−bk1
k(a, b) = e
βha,bi
(σ > 0)
,
(σ > 0)
, (β > 0), or more generally
∞
X
an x n (∀an > 0).
k(a, b) = f (ha, bi), f (x) =
n=0
Szabó et al.
Regression on Probability Measures
Vector-valued RKHS: H = H(K )
Definition:
A H ⊆ Y X Hilbert space of functions is RKHS if
Aµx ,y : f ∈ H 7→ hy , f (µx )iY ∈ R
is continuous for ∀µx ∈ X , y ∈ Y .
(10)
= The evaluation functional is continuous in every direction.
Szabó et al.
Regression on Probability Measures
Vector-valued RKHS: H = H(K ) – continued
Riesz representation theorem ⇒ ∃K (µx |y )∈ H:
hy , f (µx )iY = hK (µx |y ), f iH
(∀f ∈ H).
K (µx |y ): linear, bounded in y ⇒ K (µx |y )= Kµx (y ) with
Kµx ∈ L(Y , H).
Szabó et al.
Regression on Probability Measures
(11)
Vector-valued RKHS: H = H(K ) – continued
Riesz representation theorem ⇒ ∃K (µx |y )∈ H:
hy , f (µx )iY = hK (µx |y ), f iH
(∀f ∈ H).
(11)
K (µx |y ): linear, bounded in y ⇒ K (µx |y )= Kµx (y ) with
Kµx ∈ L(Y , H).
K construction:
K (µx , µt )(y ) = (Kµt y )(µx ),
K (·, µt )(y ) = Kµt y ,
(∀µx , µt ∈ X ), i.e.,
H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }.
Szabó et al.
Regression on Probability Measures
(12)
(13)
Vector-valued RKHS: H = H(K ) – continued
Riesz representation theorem ⇒ ∃K (µx |y )∈ H:
hy , f (µx )iY = hK (µx |y ), f iH
(∀f ∈ H).
(11)
K (µx |y ): linear, bounded in y ⇒ K (µx |y )= Kµx (y ) with
Kµx ∈ L(Y , H).
K construction:
K (µx , µt )(y ) = (Kµt y )(µx ),
K (·, µt )(y ) = Kµt y ,
(∀µx , µt ∈ X ), i.e.,
H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }.
Shortly: K (µx , µt ) ∈ L(Y ) generalizes k(u, v ) ∈ R.
Szabó et al.
Regression on Probability Measures
(12)
(13)
Vector-valued RKHS – examples: Y = Rd
1
Ki : X × X → R kernels (i = 1, . . . , d). Diagonal kernel:
K (µa , µb ) = diag (K1 (µa , µb ), . . . , Kd (µa , µb )).
2
(14)
Combination of Dj diagonal kernels [Dj (µa , µb ) ∈ Rr ×r ,
Aj ∈ Rr ×d ]:
K (µa , µb ) =
m
X
A∗j Dj (µa , µb )Aj .
j=1
Szabó et al.
Regression on Probability Measures
(15)
Demo-1: supervised entropy learning
Problem: learn the entropy of the 1st coo. of (rotated)
Gaussians.
Baseline: kernel smoothing based distribution regression
(applying density estimation) =: DFDR.
Performance: RMSE boxplot over 25 random experiments.
Experience:
more precise than the only theoretically justified method,
by avoiding density estimation.
Szabó et al.
Regression on Probability Measures
Demo-2: aerosol prediction – selected kernels
Kernel definitions (p = 2, 3):
kG (a, b) = e −
ka−bk2
2
2θ 2
ke (a, b) = e −
,
1
kC (a, b) =
1+
ka−bk22
θ2
,
kt (a, b) =
ka−bk2
2θ 2
1
1 + ka − bkθ2
kp (a, b) = (ha, bi + θ)p , kr (a, b) = 1 −
ki (a, b) = q
kM, 3 (a, b) =
2
kM, 5 (a, b) =
2
1
,
(16)
,
ka − bk22
ka − bk22 + θ
(17)
,
(18)
,
(19)
ka − bk22 + θ 2
! √
√
3ka−bk2
3 ka − bk2
e− θ
,
(20)
1+
θ
! √
√
5ka−bk2
5 ka − bk2 5 ka − bk22
−
θ
+
e
1+
. (21)
θ
3θ 2
Szabó et al.
Regression on Probability Measures
Existing methods: set metric based algorithms
Hausdorff metric [Edgar, 1995]:
(
dH (X , Y ) = max
)
sup inf d(x, y ), sup inf d(x, y ) . (22)
x∈X y ∈Y
y ∈Y x∈X
Metric on compact sets of metric spaces [(M, d); X , Y ⊆ M].
’Slight’ problem: highly sensitive to outliers.
Szabó et al.
Regression on Probability Measures
Weak topology on P(D)
Def.: It is the weakest topology such that the
Lh : (P(D), τw ) → R,
Z
h(u)dx(u)
Lh (x) =
D
mapping is continuous for all h ∈ Cb (D), where
Cb (D) = {(D, τ ) → R bounded, continuous functions}.
Szabó et al.
Regression on Probability Measures
Chen, Y. and Wu, O. (2012).
Contextual Hausdorff dissimilarity for multi-instance clustering.
In International Conference on Fuzzy Systems and Knowledge
Discovery (FSKD), pages 870–873.
Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005).
Semigroup kernels on measures.
Journal of Machine Learning Research, 6:11691198.
Edgar, G. (1995).
Measure, Topology and Fractal Geometry.
Springer-Verlag.
Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A.
(2002).
Multi-instance kernels.
In International Conference on Machine Learning (ICML),
pages 179–186.
Szabó et al.
Regression on Probability Measures
Haussler, D. (1999).
Convolution kernels on discrete structures.
Technical report, Department of Computer Science, University
of California at Santa Cruz.
(http://cbse.soe.ucsc.edu/sites/default/files/
convolutions.pdf).
Hein, M. and Bousquet, O. (2005).
Hilbertian metrics and positive definite kernels on probability
measures.
In International Conference on Artificial Intelligence and
Statistics (AISTATS), pages 136–143.
Jebara, T., Kondor, R., and Howard, A. (2004).
Probability product kernels.
Journal of Machine Learning Research, 5:819–844.
Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q.,
and Figueiredo, M. A. T. (2009).
Nonextensive information theoretical kernels on measures.
Szabó et al.
Regression on Probability Measures
Journal of Machine Learning Research, 10:935–975.
Nielsen, F. and Nock, R. (2012).
A closed-form expression for the Sharma-Mittal entropy of
exponential families.
Journal of Physics A: Mathematical and Theoretical,
45:032003.
Oliva, J., Póczos, B., and Schneider, J. (2013).
Distribution to distribution regression.
International Conference on Machine Learning (ICML; JMLR
W&CP), 28:1049–1057.
Oliva, J. B., Neiswanger, W., Póczos, B., Schneider, J., and
Xing, E. (2014).
Fast distribution to real regression.
International Conference on Artificial Intelligence and
Statistics (AISTATS; JMLR W&CP), 33:706–714.
Póczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013).
Distribution-free distribution regression.
Szabó et al.
Regression on Probability Measures
International Conference on Artificial Intelligence and
Statistics (AISTATS; JMLR W&CP), 31:507–515.
Póczos, B., Xiong, L., and Schneider, J. (2011).
Nonparametric divergence estimation with applications to
machine learning on distributions.
In Uncertainty in Artificial Intelligence (UAI), pages 599–608.
Reddi, S. J. and Póczos, B. (2014).
k-NN regression on functional data with incomplete
observations.
In Conference on Uncertainty in Artificial Intelligence (UAI).
Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008).
Real Analysis.
Prentice-Hall.
Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D.,
and Rangarajan, A. (2009).
Closed-form Jensen-Rényi divergence for mixture of Gaussians
and applications to group-wise shape registration.
Szabó et al.
Regression on Probability Measures
Medical Image Computing and Computer-Assisted
Intervention, 12:648–655.
Wang, J. and Zucker, J.-D. (2000).
Solving the multiple-instance problem: A lazy learning
approach.
In International Conference on Machine Learning (ICML),
pages 1119–1126.
Wang, Z., Lan, L., and Vucetic, S. (2012).
Mixture model for multiple instance regression and
applications in remote sensing.
IEEE Transactions on Geoscience and Remote Sensing,
50:2226–2237.
Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010).
Identifying multi-instance outliers.
In SIAM International Conference on Data Mining (SDM),
pages 430–441.
Zhang, M.-L. and Zhou, Z.-H. (2009).
Szabó et al.
Regression on Probability Measures
Multi-instance clustering with applications to multi-instance
prediction.
Applied Intelligence, 31:47–68.
Zhou, S. K. and Chellappa, R. (2006).
From sample similarity to ensemble similarity: Probabilistic
distance measures in reproducing kernel Hilbert space.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 28:917–929.
Szabó et al.
Regression on Probability Measures
© Copyright 2026 Paperzz