ICML 2014
Fast Computation of
Wasserstein Barycenters
1
2
M. Cuturi A. Doucet
1
22.06.14
Kyoto University 2University of Oxford
1
Problem: Average N Probability Measures
ν1 =
P
ν2 =
i aiδxi
(Ω, D)
ν3 =
P
i
P
i
b i δy i
ciδzi
• {Ω, D} a metric space
• {ν1, · · · , νN } family of empirical probability measures.
22.06.14
2
Problem: Average N Probability Measures
ν1 =
P
ν2 =
i aiδxi
(Ω, D)
ν3 =
P
i
P
i
b i δy i
ciδzi
Can we summarize the {νi} as an “average” or a
“barycentric” single empirical probability measure?
interest in ML: empirical measure = dataset,
histogram/bags-of-features, single observation with uncertainty
22.06.14
3
Euclidean Means for Vectors
• For vectors {x1, · · · , xN } in a Hilbert, their average is
N
X
1
x̄ =
xi
N i=1
22.06.14
4
Euclidean Means for Vectors
• For vectors {x1, · · · , xN } in a Hilbert, their average is
N
X
1
x̄ =
xi
N i=1
• behind this formula lies a variational problem
N
X
1
ku − xik22
x̄ = argmin
u∈Rd N i=1
22.06.14
5
Euclidean Means for Measures
• For probability measures {νi}i=1..N , we can also use:
N
X
1
µ=
νi,
N i=1
−D 2/σ
• as well as, using a smoothing kernel k = e
,
N
X
1
µ=
(k ∗ νi)
N i=1
(a.k.a RKHS mean map [Gretton’07])
22.06.14
6
Other Means for Probabilities
• Other means can be defined using other metrics or
divergences:
argmin
N
X
∆(µ, νi).
µ∈P (Ω) i=1
◦ KL, Symmetrized KL [Nielsen’12]
◦ Bregman Divergence [Bhanerjee’05]
◦ Wasserstein Distance (a.k.a EMD) [Agueh’11]
22.06.14
7
Wasserstein Barycenter Problem
• [Agueh’11] defined
argmin
N
X
µ∈P (Ω) i=1
Wpp(µ, νi),
provided theoretical analysis, unicity of solution.
• Simple cases (N = 2, multivariate Gaussians) covered.
• very challenging computational problem.
22.06.14
8
Our Contribution
• First computational approach to solve efficiently
variational Wasserstein problems,
• including the Wasserstein barycenter problem,
argmin
N
X
µ∈P (Ω) i=1
Wpp(µ, νi),
• that is applicable for arbitrary (Ω, D) and p > 0, using
entropy-smoothed optimal transport [Cuturi’13].
22.06.14
9
Our Contribution
• First computational approach to solve efficiently
variational Wasserstein problems,
• including the Wasserstein barycenter problem,
argmin
N
X
µ∈P (Ω) i=1
Wpp(µ, νi),
• that is applicable for arbitrary (Ω, D) and p > 0, using
entropy-smoothed optimal transport [Cuturi’13].
([Rabin’12,Bonneel’14] studied case Ω = R2)
22.06.14
10
Motivating Examples
22.06.14
11
2 Points on the Real Line
y=3
x = −5
−10
22.06.14
−5
0
5
10
12
Their Average
z=
x+ y
2
y=3
x = −5
−10
22.06.14
−5
0
5
10
13
2 Points as Diracs
1
0.8
δy
δx
0.6
0.4
0.2
0
−10
22.06.14
−5
0
5
10
14
Euclidean Mean of Diracs
1
0.8
δy
δx
0.6
0.4
δx+ δy
2
0.2
0
−10
22.06.14
−5
0
5
10
15
Wasserstein Mean of Diracs
1
0.8
Wasserstein Mean
δx
δ x+ y
δy
2
0.6
0.4
δx+ δy
2
0.2
0
−10
22.06.14
−5
0
5
10
16
Smoothed Measures (RKHS mean map)
0.4
0.3
0.2
k(x, ·)
k(y, ·)
0.1
0
−10
22.06.14
−5
0
5
10
17
Euclidean Mean of 2 Gaussians
0.4
0.3
0.2
k(x, ·)
0.1
kx+ ky
2
0
−10
22.06.14
−5
k(y, ·)
0
5
10
18
Wasserstein Mean of 2 Gaussians
0.4
0.3
0.2
k(x, ·)
0.1
kx+ ky
2
0
−10
22.06.14
k(y, ·)
W
−5
0
5
10
19
6 Gaussians
2
1.5
1
0.5
0
−10
22.06.14
−5
0
5
10
20
Euclidean Mean
2
1.5
1
0.5
0
−10
22.06.14
1
N
P
i
−5
kxi
0
5
10
21
Wasserstein Mean
2
1.5
1
0.5
0
−10
22.06.14
1
N
P
i
kxi
W
−5
0
5
10
22
Motivation in 2D
{ν1, · · · , ν30} ∈ P ([0, 1]2).
22.06.14
23
Euclidean / Centered / Jeffrey / RKHS
Euclidean distance / recentered,
Sym. Kullback / RKHS Mean Map
22.06.14
24
Wasserstein Barycenter
2-Wasserstein barycenter
(computed with our method)
22.06.14
25
Variational Perspective on
the Wasserstein Distance
22.06.14
26
Wasserstein for Empirical Measures
µ=
Pn
i=1
a i δx i
(Ω, D)
ν=
Pm
j =1
bj δyj
• (Ω, D) metric. p ≥ 1.
• Two empirical measures µ, ν.
p-Wasserstein distance Wp(µ, ν)?
22.06.14
27
Computing p Wasserstein Distances
µ=
n
X
aiδxi ,
ν=
i=1
m
X
bj δyj ,
j=1
Wp(µ, ν) is the solution of a linear program involving:
def
1. MXY =[D(xi, yj )p]ij ∈ Rn×m
def
T
2. U (a, b) ={T ∈ Rn×m
a,
T
1n = b}.
|
T
1
=
m
+
22.06.14
28
Computing the OT Distance
• p-Wasserstein is the solution (primal or dual LP):
def
primal(a, b, MXY ) = min hT, MXY i
T ∈U (a,b)
or
p
Wp (µ, ν) =
def
T
T
dual(a,
b,
M
)
=
max
α
a
+
β
b,
XY
(α,β)∈CM
XY
where CM = {(α, β) ∈ Rn+m | αi + βj ≤ Mij }.
22.06.14
29
Computing the OT Distance
• p-Wasserstein is the solution (primal or dual LP):
def
primal(a, b, MXY ) = min hT, MXY i
T ∈U (a,b)
or
p
Wp (µ, ν) =
def
T
T
dual(a,
b,
M
)
=
max
α
a
+
β
b,
XY
(α,β)∈CM
XY
where CM = {(α, β) ∈ Rn+m | αi + βj ≤ Mij }.
def
Changes in f (a, X) = Wpp(µ, ν) as a & X change?
22.06.14
30
Wasserstein (Sub)differentiability
f (a, X) =
max
(α,β)∈CM
αT a + β T b
XY
• ∂f |a = α⋆: the dual optimum α⋆ is a subgradient.
f (a, X) = min hT, MXY i
T ∈U (a,b)
• ∂f |X = Y T ⋆T diag(a−1): primal optimum T ⋆T yields
a subgradient (when D =Euclidean, p = 2).
22.06.14
31
Average of Wasserstein Distances
N
X
def 1
Wpp(µ, νi)
g(a, X) =
N i=1
N
X
1
primal(a, bi, MXYi )
=
N i=1
• a → g(a, X) is convex, non-smooth
• X → g(a, X) is not convex, non-smooth
22.06.14
32
Wasserstein Barycenter Problem
N
X
1
primal(a, bi, MXYi )
min g(a, X) =
a
N i=1
• a → g(a, X) is convex
◦ subgradient method works (in theory).
◦ Great if X is fixed (b-o-w or discretized Ω)!
◦ Need to solve {αi⋆} at each subgradient step.
22.06.14
33
Wasserstein Barycenter Problem
N
X
1
primal(a, bi, MXYi )
min g(a, X) =
X
N i=1
• X → g(a, X) is not convex
◦ (and so far only applicable when Ω is Rd.)
◦ local minimum with subgradient method
◦ Need to compute {Ti⋆} at each subgradient step
22.06.14
34
To recapitulate...
min g(a, X)
a,X
• convex w.r.t weights a, not locations X.
• only subgradients (g is usually very degenerate).
• computationally intractable (cost of OT ≈ n3 log n)
• computationally inefficient (hard to parallelize)
22.06.14
35
Solution: Entropic Smoothing
Original primal problem gives us T ⋆:
primal(a, b, MXY ) = min hT , MXY i
T ∈U (a,b)
Original dual problem gives us α⋆:
dual(a, b, MXY ) =
22.06.14
max
(α,β),αi+βj ≤Mij
αT a + β T b
36
Solution: Entropic Smoothing
Smoothed (λ > 0) primal problem gives us Tλ⋆:
1
primalλ(a, b, MXY ) = min hT , MXY i − h(T )
T ∈U (a,b)
λ
Smoothed dual problem gives us α⋆λ:
dualλ(a, b, MXY ) = max αT a+β T b−
(α,β)
22.06.14
X e−λ(mij −αi−βj )
i≤n,j≤m
λ
37
Benefits of Smoothing [Cuturi’13]
• Objective now strongly convex vs. piecewise linear:
infinitely more efficient in practice [Nesterov’05].
• Primal/dual smoothed optima α⋆λ, Tλ⋆ can be solved
◦ In O(n2) with Sinkhorn’s algorithm,
◦ in parallel on GPGPUs for any metric on finite Ω,
◦ millions of time faster than simplex,
◦ can deal with large dimensions (≈ 20.000).
22.06.14
38
To conclude...
• our approach also generalizes k-means
◦ can consider weight constraints (see paper),
◦ can quantize simultaneously different datasets
• Versatile and scalable approach for other variational
Wasserstein problems (e.g. Wasserstein
propagation [Solomon’14])
• Future applications to visualization of measures on
Riemanian manifolds, data-fusion, inference...
22.06.14
39
© Copyright 2026 Paperzz