EARTH MOVER DISTANCE ON SUPERPIXELS
Sylvain Boltz1,2 , Frank Nielsen1 , Stefano Soatto2
1
École Polytechnique, France 2 UCLA Vision Lab
{boltz,nielsen}@lix.polytechnique.fr , [email protected]
ABSTRACT
Earth Mover Distance (EMD) is a popular distance to compute distances between Probability Density Functions (PDFs).
It has been successfully applied in a wide selection of problems of image processing. This success comes from two
reasons, a physical one, since it computes a physical cost
to transport an element of mass between two images or two
histograms, and a statistical one, since it is a cross-bin metric
(as opposed to a bin-wise metric). In computer vision, these
features are useful since small variation of illuminance can
shift the histogram. However, histograms are not a sufficient
statistic to discriminate images since they ignore all geometric
correlations. In addition, transport also called flow of an histogram loose the information of geometric flow to warp one
image on to an other. This paper proposes a new construction
of EMD between images. This construction approximates the
EMD between two images, by computing a pixel-wise transport at the complexity cost of computing an EMD between
1-D Histograms and preserves the geometrical and topological structure of the image. This construction simply relies
on a segmentation of the image (also called superpixelization
of the image). Results on matching on images shows the
stability of the method even when the superpixelizations are
highly inconsistent across images.
Index Terms— Earth Mover Distance, Wasserstein metric, Matching, Superpixel, Sparsity, Segmentation,
1. INTRODUCTION
Image matching is a problem at the heart of many image processing and computer vision problems. Indeed, it is difficult
to build an efficient matching cost between two images robust
to changes of illumination and viewpoints. Several matching strategies exist in the literature. Some methods extract
some key-points, like the popular SIFT, in the two images,
and match them. Such strategies are highly sensitive to noisy
key-point detections and to the outliers of matching model.
Some other methods try to build an appearance template of
the image, but these methods are whether too strict, for instance the `2 distance, whether not enough descriptive, for
instance image histogram comparisons. In this paper, we explore an approach in between, our approach builds on the seg-
mentation of the images at multiple scales, also called superpixelization tree. Those superpixels act as a local descriptor,
and they contain the physical mass of the image, encoded in
its size. The difficulty is then to match those segmentation
across images of a video sequence.
This problem looks simple since the image is now reduced
to a subset of elements, but it is a hard combinatorial problem
to solve. Moreover, since segmentation are highly inconsistent from one image to an other, there is ambiguity. Superpixels can merge or split from one image to the other. The
matching between both segmentations is thus not one to one
but a continuous flow. Thus, the idea of using EMD as a way
to compute this flow seems natural. Earth mover distance has
already been used on image histograms [1, 2] or directly on
the image pixels [3]. The first method is not discriminative
enough since the geometric information on the image is lost.
The second one is too complex since it solves a Partial Differential Equation (PDE) with an unconstrained flow on every pixel. This paper proposes an in-between approach. It
approximates the pixel-wise flow between two images at the
cost of comparing histograms. This is done by building a subset of pixels with weightings, obtained from a segmentation
tree. This subset of pixels looks like an histogram with geometrical and topological information contained in the affinity
distance matrix. Since 256 is a usual number for the number
of bins in a 1-D histogram as well as a typical number for
the number of superpixels in an image, the complexities are
similar.
The paper is organized as follows. In Section 2, we
present the earth mover distance. Then, we show how the
earth mover distance can be defined on segmentation trees
and how to introduce topological consistency in Section 3.
In the experimental Section 4, we show some experiments
on consecutive images of video sequences. Finally we give
some conclusions and perspectives in Section 5.
2. EARTH MOVER DISTANCE METRIC
The Earth Mover Distance is the discrete way of writing the
famous problem of optimal transport, also called the Wasserstein metric or Monge-Kantorovich. It is a distance between
probability density functions, or, on discrete data, histograms.
Two histograms P and Q are given, as well as a distance affinity matrix D(i, j). This matrix computes the cost
of transporting one element of mass (i.e. one pixel) of the
i − th bin of P to the j − th bin of Q. It computes a flow
matrix F where F (i, j) is the amount of mass in the i−th bin
of histogram P transported to the j −th bin of histogram Q.
The goal of optimal transport is then to find F that minimizes
the cost of every transports D(i, j) to warp histogram P to
histogram Q.
EMD(P, Q) = min
F
X
F (i, j)D(i, j)
(1)
i,j
The EMD gives two interesting outputs, the first one is
the distance value which gives a matching score between histograms. It has the physical meaning as the amount of mass
displaced. In the statistics community, compared to other famous scores between histograms such as Kullback-Leibler divergence or Hellinger distance it is one of the only cross-bin
distance. This means that it does not assume the bin values
are correctly aligned as in bin-wise comparisons. This is a
particularly desired feature in computer vision since changes
of illuminations or viewpoints can shift the values of the histogram. However, as opposed to other distance between histograms, the complexity of EMD is higher since it has to solve
a combinatorial problem of matching N bins to N other bins.
Plus, it is designed to work efficiently on histograms which
are not a very discriminative feature of the image. Some
works have tried to solve the optimal transport directly on the
image pixels but it results in a complex PDE and brings new
problems since there is no regularity constraints in the flow
F. The contribution of this paper is a way of computing the
transport on image pixels, with the complexity cost of matching 1-D histograms and without loosing the geometry and the
topological structure.
3. SUPERPIXEL-BASED HISTOGRAMS
3.1. Definition
Based on the ideas of coresets [4], we are trying to find subset
of pixels with different weightings as a representation for our
image and still be able to compare the two transports. The
two transports are the EMD of the weighted subset of pixels
and the EMD on the original problem, the transport of all pixels individually. In our approach, pixels are grouped together
into small regions called superpixels of different size. Several algorithms exist to build superpixelization of the image.
Among them are the Quickshift algorithm [5], a variant of the
famous mean-shift algorithm. Another one is Statistical Region Merging (SRM), a region merging technique [6]. The
goal is now to compute the optimal transport of these superpixel from one image to an other. We formulate this problem
as an histogram matching problem, without loosing the geometric structure. One can define an histogram of an image
with as many bins as there are superpixels. Then, define the
mass inside each bin of this histogram as the superpixel size.
P (i) = |S1 (i)|
(2)
where S1 (i) is the i−th superpixel of image 1 and |S1 (i)| is
its size in pixels. The cost D(i, j) of moving one element of
mass (i.e. one pixel) from the i−th bin of one histogram to
the j − th bin of the other histogram is the average coast of
moving a pixel from one superpixel to another, i.e. the cost
of moving the mean pixel of the superpixel to the mean pixel
of the other superpixel. Since the EMD transports pixels in
the geometric and radiometric space. The cost of moving one
pixel is computed in a 5−D space: 3 − D for the colors and
2 − D for the geometric position.
D(i, j) = kS1 (i) − S2 (j)k
(3)
where S1 (i) is the 5-D mean inside superpixel S1 (i). By
defining such a cost, we follow directly the coreset idea of reducing the number of points but still trying to approximate the
optimal transport on the original problem (transport between
individual pixels). In addition, we gain an implicit regularization since the transporting flow of all the pixels inside one
superpixel is constrained to be equal.
Finally, if one is not interested in approximating the EMD
of the original problem, one could use as transport D(i, j)
any distances between superpixels. For instance, one could
estimate a unimodal 5-D Gaussian inside each superpixel.
The transport cost between two superpixels would be now between unimodal Gaussian (known in closed form) [7]. In this
case, if N (µi , Σi ) is the gaussian approximation of superpixel
S1 (i) and N (µj , Σj ) is the Gaussian approximation of superpixel S2 (j) in a 5−D space. Then the transport between those
two superpixels is defined as :
D(i, j)2
= k µi − µj k2 + tr(Σi ) + tr(Σj )
1/2
1/2
− 2tr(Σi Σj Σi )1/2
(4)
3.2. Including topological constraints
Building histograms on superpixels enforces some geometric
structure in the histograms. However, this constraint can be
enhanced. In particular, when a segmentation tree is available,
one would want to keep the topological structure of the tree in
the matching. For instance, imagine a good segmentation tree
is provided, meaning that superpixels belonging to the same
object are grouped together at one scale. Before matching
superpixels at small scales, which is a risky procedure, one
could force the ancestors at larger scale to match by defining
the following cost matrix D(i, j).
D(i, j) =
X
s
kS1,s (i) − S2,s (j)k
(5)
Fig. 2. Matching superpixelizations, From left two right,
top to bottom : first image, second image, superpixelization
of the first image (false color), superpixelization of the second image (false colors). Even between two images with
small differences, the superpixelization, here in false colors, can be quite inconsistent. The matching of these two
superpixel maps is in the color code: Superpixel i in image 2 have the color of the Superpixel in image 1 with label
j = arg max F (j, i).
Fig. 1. Topological constraints for robust matching. From
left to right, top to bottom: first image, second image, superpixelization of the first image at two different scales, superpixelization of the second image at two different scales.
Toplogical constraints in EMD adds the cost of matching superpixelization at a coarse scale to the cost of matching superpixelization at a fine scale.
where S1,s (i) is the parent superpixel of S1 (i) at scale s.
The advantage of plugging the topological constraint in
matrix D is that it does not increase the complexity of the
matching. Instead, one could design more accurate EMD by
solving the EMD at different scale and propagates the flow F
from one scale to an other.
Fig.1 shows an illustration of topological constraints. In
order to match two superpixels at different scales, one sums
up the cost of matching the parent superpixels at a higher
scale. In this way the topology of the matching is enforced.
3.3. Solving the EMD
Once the superpixel histograms P and Q with Eq.(2) are
built, and their pairwise distance between D(i, j) with Eq.(3)
Eq.(4) or Eq.(5) are defined. One needs to find the EMD
Eq.(1), by minimizing the flow F (i, j). For this, we use the
code available from [8]. It provides both EMD value and the
flow F . It runs in less than a second for the usual number of
superpixels we have to deal with in this paper. This algorithm
uses a thresholding of the matrix D to speed up a max flow
algorithm. This thresholding can be easily interpreted in our
framework since there is no need to compute D(i, j) between
far apart superpixels or superpixels with different ancestors.
In this setting, topological constraints speeds up the complexity of the matching since it thresholds more distances in
matrix D.
4. EXPERIMENTAL RESULTS
On Fig. 2, we took two consecutive frames of a sequence
from an optical flow benchmark. On both images, we perform
single scale superpixelization. As one can see, even between
images with small deformations, there is no consistence between the two superpixel maps. The EMD solution is given
in the color code. The color of the superpixel in image 1 is
chosen randomly. A superpixel i in image 2 has the color of
the superpixel j = arg max F (j, i) in image 1. This visualization is a partial representation of the flow, since the flow is
continuous and we show only the best one to one match.
We perform a similar experiment on another video sequence on Fig. 3. This video sequence “Football” is a difficult
one for image matching since the player have similar colors
than the people watching. Also there is motion blur due to
fast motions.
Finally we show some quantitative experiments on Tab.1.
We manually labeled 250 superpixels from two different images. A one on one manual matching between each of the superpixels is provided. And we evaluate how well some algo-
5. CONCLUSION
In this paper, we have proposed a new way of matching images with EMD. It is expressed as an EMD built on superpixels. The geometric and topological structures of the superpixels are taken into account to build the affinity distance
matrix. This approach does not assign a one to one match but
computes a flow of transport so the superpixels can split and
merge. This representation has a physical justification since
it is computing the mass transport between different images.
Future work will use this representation to track several regions on video sequences and to incorporate stability of the
segmentations as it has recently been studied [9]. Finally, being able to track superpixels leads to many applications from
video segmentation [10] to action recognition.
6. REFERENCES
[1] T. Chan, S. Esedoglu, and K. Ni, “Histogram based segmentation using Wasserstein distances,” in International Conference
on Scale Space Methods and Variational Methods in Computer
Vision, 2007, vol. 4485, p. 697.
Fig. 3. Matching superpixelizations, From left two right,
top to bottom : first image, second image, superpixelization
of the first image (false color), superpixelization of the second image (false colors). Even between two images with
small differences, the superpixelization, here in false colors, can be quite inconsistent. The matching of these two
superpixel maps is in the color code: Superpixel i in image 2 have the color of the superpixel in image 1 with label
j = arg max F (j, i).
[2] Y. Rubner, C. Tomasi, and L.J. Guibas, “The earth mover’s
distance as a metric for image retrieval,” International Journal
of Computer Vision, vol. 40, no. 2, pp. 99–121, 2000.
[3] S. Haker, L. Zhu, A. Tannenbaum, and S. Angenent, “Optimal mass transport for registration and warping,” International
Journal of Computer Vision, vol. 60, no. 3, pp. 225–240, 2004.
[4] M.R. Ackermann and J. Blomer, “Coresets and approximate
clustering for Bregman divergences,” in Proceedings of the
Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2009, pp. 1088–1097.
[5] A. Vedaldi and S. Soatto, “Quick shift and kernel methods for
mode seeking,” in European Conference on Computer Vision,
2008, vol. IV, pp. 705–718.
rithms match the superpixels. Nearest Neighbor (NN) method
is the simplest matching. NN search of the superpixel in the
5−D space, i.e. the minimum of Eq.(3). EMD is the algorithm
built on superpixels Eq.(3) and EMD-T is the EMD build on
superpixels with topological constraints Eq.(5). Since earth
mover distance does not give a one to one flow, first row selects the max of the flow for each superpixel (as in the previous color coding), second row assumes a match as long as the
flow between two superpixels is different from zero.
[6] Richard Nock and Frank Nielsen, “Statistical region merging,”
IEEE Transactions Pattern Analysis Machine Intelligence, vol.
26, no. 11, pp. 1452–1458, 2004.
[7] H. Greenspan, G. Dvir, and Y. Rubner, “Region correspondence for image matching via EMD flow,” in Proceedings
of the IEEE Workshop on Content-based Access of Image and
Video Libraries, 2000, p. 27.
[8] Ofir Pele and Michael Werman, “Fast and robust earth mover’s
distances,” in IEEE International Conference on Computer Vision, 2009.
[9] F. Chazal, L. J. Guibas, S. Y. Oudot, and P. Skraba,
“Persistence-based clustering in Riemannian manifolds,” Research Report 6968, INRIA, June 2009.
Correct best match (in %)
Non zero flow (in %)
NN
71
-
EMD
91
93
EMD-T
94
97
Table 1. Manual matching of superpixels as reference, comparison with Nearest Neighbor (NN) our Method (EMD) and
our Method with topology consistant (EMD-T)
[10] W. Brendel and S. Todorovic, “Video Object Segmentation
by Tracking Regions,” in IEEE International Conference on
Computer Vision, 2009.
© Copyright 2026 Paperzz