Incorporating Spatial Similarity into Ensemble Clustering

Incorporating Spatial Similarity into Ensemble Clustering
M. Hidayath Ansari
Nathanael Fillmore
Michael H. Coen
Dept. of Computer Sciences
University of
Wisconsin-Madison
Madison, WI, USA
Dept. of Computer Sciences
University of
Wisconsin-Madison
Madison, WI, USA
Dept. of Computer Sciences
Dept. of Biostatistics and
Medical Informatics
University of
Wisconsin-Madison
Madison, WI, USA
[email protected]
[email protected]
[email protected]
ABSTRACT
This paper addresses a fundamental problem in ensemble
clustering – namely, how should one compare the similarity
of two clusterings? The vast majority of prior techniques
for comparing clusterings are entirely partitional, i.e., they
examine assignments of points in set theoretic terms after
they have been partitioned. In doing so, these methods ignore the spatial layout of the data, disregarding the fact
that this information is responsible for generating the clusterings to begin with. In this paper, we demonstrate the
importance of incorporating spatial information into forming ensemble clusterings. We investigate the use of a recently proposed measure, called CDistance, which uses both
spatial and partitional information to compare clusterings.
We demonstrate that CDistance can be applied in a wellmotivated way to four areas fundamental to existing ensemble techniques: the correspondence problem, subsampling,
stability analysis and diversity detection.
Categories and Subject Descriptors
I5.3 [Clustering]: Similarity Measures
Keywords
Ensemble methods, Comparing clusterings, Clustering stability, Clustering robustness
1.
INTRODUCTION
Many of the most exciting recent advances in clustering
seek to take advantage of “ensembles” of outputs of clustering algorithms.
Ensemble methods have the potential to provide several
important benefits over single clustering algorithms. As discussed in [1], advantages of ensemble techniques in clustering
problems include improved robustness across datasets and
domains, increased stability of clustering solutions, and perhaps most importantly, the ability to construct clusterings
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
MultiClust KDD 2010 Washington, DC, USA
Copyright 2010 ACM X-XXXXX-XX-X/XX/XX ...$10.00.
not individually attainable by any single algorithm.
Ensemble methods generally rely, at least implicitly–and
often explicitly–on a notion of a distance or similarity between clusterings. This is not surprising; for example, if the
stability of a clustering algorithm is taken to mean that the
algorithm does not produce highly different outputs on very
similar inputs, then in order to measure stability one must
be able to determine exactly how similar such outputs are
to each other. These inputs can be perturbed versions of
each other or one may be a subsample of the other. Likewise, in order to create an integrated ensemble clustering
from several individual clusterings information about which
clusterings are more similar or less similar or which parts of
which clusterings are compatible with each other and can be
combined is frequently useful.
Previous approaches to ensemble clustering have generally used “partitional” techniques to measure similarity between clusterings, i.e., they do not take into account spatial
information about the points in each clustering, but only
their identities. Such techniques include the Hubert-Arabie
index [14], the Rand index [19], van Dongen’s metric [22],
Variation of Information [17], and nearly all other previous
work on comparing clusterings [5]. As Table 1 shows, it is in
many contexts a major weakness to ignore spatial information in clustering comparisons, because ignoring the location
of data points for which labels change across clusterings can
lead to a misleading quantification of the difference between
the clusterings.
The present paper proposes the use of a very recently
introduced method for comparing clusterings, called CDistance [5], which incorporates spatial information about
points in each clustering. Specifically, as we show later in
Section 3 and in more detail in [5], CDistance gives us a
measure of the extent to which the constituent clusters of
each clustering overlap with the other. We have found that
CDistance is an effective tool to measure the similarity of
multiple clusterings.
It should be stressed that the goal of this paper is not
to introduce an end-to-end ensemble clustering framework
but rather to introduce more robust alternatives for steps
in the generation, evaluation and combination of clusterings
commonly used in ensemble methods. We provide a brief
theoretical treatment of selected issues arising in popular ensemble clustering techniques. Specifically, we identify four
critical areas in which CDistance can be used in existing ensemble clustering techniques: the correspondence problem,
subsampling, stability analysis, and detecting diversity.
In Section 2 we establish preliminary definitions and notation, and in Section 3 we construct CDistance. We give
a short introduction to ensemble clustering in Section 4. In
Section 5 we discuss how CDistance can be applied in the
setting of ensemble clustering. We discuss related work in
Section 6, and indicate directions for future work in Section 7.
2.
PRELIMINARIES
As a brief introduction (further detail is provided in Section 3), CDistance works as follows. Given two clusterings
we first compute a pairwise distance matrix between each
cluster in one to clusters in the other. The distance function used takes into account the interpoint distances between
elements of the two clusters being compared and is called
Optimal Transportation Distance. The clusterings can then
be thought of as two subsets (consisting of their clusters
as elements) of a metric space with Optimal Transportation Distance as the metric, where we know the value of the
metric at each point of interest (the entries of the matrix).
We then compute the so-called Similarity Distance between
these two collections; Similarity Distance between two point
sets in a metric space measures the degree to which they
overlap in that space, in a sense we will make precise.
In the following subsections, we construct the components
required to compute CDistance. CDistance is defined in
terms of optimal transportation distance and naive transportation distance, and we begin by briefly describing these.
is the solution of the linear program
minimize
|A| |B|
X
X
fij dΩ (ai , bj ) over F = (fij ) subj. to
i=1 j=1
f
P|B| ij
fij
Pj=1
|A|
i=1 fij
P|A| P|B|
j=1 fij
i=1
≥ 0, 1 ≤ i ≤ |A|, 1 ≤ j ≤ |B|
= pi , 1 ≤ i ≤ |A|
= qj , 1 ≤ j ≤ |B|
= 1.
It is useful to view the optimal flow as a representation
of the maximally cooperative way to transport masses between sources and sinks. Here, cooperative means that the
sources are allowed to exchange delivery obligations in order
to transport their masses with a globally minimal cost.
2.3
Naive Transportation Distance
In contrast with the optimal algorithm proposed above,
we define here a naive solution to the transportation problem. Here, the sources are all responsible for individually
distributing their masses proportionally to the sinks. In this
case, none of the sources cooperate, leading to inefficiency
in shipping the overall mass to the sinks.
Formally, in the naive transportation problem, we are
given weighted point sets (A, p) and (B, q) as above. We
define the naive transportation distance between (A, p) and
(B, q) as
dNT (A, B; p, q, dΩ ) =
|A| |B|
X
X
pi qj dΩ (ai , bj ).
i=1 j=1
2.1
Clustering
A (hard) clustering A is a partition of a dataset D into
sets A1 , A2 , . . . , AK called clusters such that
Ai ∩ Aj = ∅ for all i 6= j, and
K
[
Ak = D.
k=1
We assume D, and therefore A1 , . . . , Ak , are finite subsets
of a metric space (Ω, dΩ ).
2.2
Optimal Transportation Distance
The optimal transportation problem [13, 18] asks what is
the cheapest way to move a set of masses from sources to
sinks, who are some distance away? Here cost is defined
as the total mass × distance moved. For example, one can
think of the sources as factories and the sinks as warehouses
to make the problem concrete. We assume that the sources
are shipping exactly as much mass as the sinks are expecting.
Formally, in the optimal transportation problem, we are
given weighted point sets (A, p) and (B, q), where A =
{a1 , . . . , a|A| } is a finite subset of the metric space (Ω, dΩ ),
p = (p1 , . . . , p|A| ) is a vector of associated nonnegative
weights summing to one, and similar definitions hold for
B and q.
The optimal transportation distance between (A, p) and
(B, q) is defined as
dOT (A, B; p, q, dΩ ) =
|A| |B|
X
X
∗
fij
dΩ (ai , bj ),
i=1 j=1
∗
where the optimal flow F ∗ = (fij
) between (A, p) and (B, q)
2.4
Similarity Distance
Our next definition concerns the relationship between the
optimal transportation distance and the naive transportation distance. Here we are interested in the degree to which
cooperation reduces the cost of moving the source A onto the
sink B.
Formally, we are given weighted point sets (A, p) and
(B, q) as above. We define the similarity distance between
(A, p) and (B, q) as the ratio
dS (A, B; p, q, dΩ ) =
dOT (A, B; p, q, dΩ )
.
dNT (A, B; p, q, dΩ )
When A and B perfectly overlap, there is no cost to moving A onto B, so both the optimal transportation distance
and the similarity distance between A and B are 0. On the
other hand, when A and B are very distant from each other,
each point in A is much closer to all other points in A than
to any points in B, and vice-versa. Thus, in this case, cooperation does not yield any significant benefit, so dOT is
nearly as large as dNT , and dS is close to 1. (This behavior
is illustrated in Figure 1.)
In this sense, the similarity distance between two point
sets measures the degree to which they spatially overlap,
or are “similar”, to each other. (Note, the relationship is
actually inverse; similarity distance is the degree of spatial
overlap subtracted from one.) Another illustration of this
behavior is given in Figure 2. Each panel in this figure
shows two overlapping point sets to which we assign uniform weight distributions. The degree of spatial overlap of
the two point sets in Example A is far higher than the degree of spatial overlap of the point sets in Example B, even
though in absolute terms the amount of work required to
Table 1: Distances to Modified Clusterings [5]. Each row in Table 1(a) depicts a dataset with points colored
according to a reference clustering R (left column) and two different modifications of this clustering (center
and right columns). For each example, Table 1(b) presents the distance between the reference clustering
and each modification for the indicated clustering comparison techniques. The column labeled “?” indicates
whether the technique provides sensible output. In Examples 1–3, we only modify cluster assignments; the
points are stationary. Since the pairwise relationships among the points change in the same way in each
modification, only ADCO and CDistance detect that the modifications are not identical with respect to the
reference. In Example 4, the reference clustering is modified by moving three points of the bottom right
cluster by small amounts. However, in Modification 1, the points do not move across a bin boundary, whereas
in Modification 2, the points do move across a bin boundary. As a result, ADCO detects no change between
Modification 1 and the reference clustering but detects a large change between Modification 2 and the
reference clustering, even though the two modifications differ by only a small amount. CDistance correctly
reports a similar change between Modification 1 and the reference and Modification 2 and the reference.
(Since the points actually move, other clustering comparison techniques are not applicable to this example.)
Reference Clustering (R)
Table 1(a)
Modification 1
Modification 2
Example 1
1
1
1
Example 2
2
2
2
Example 3
Example 4
Technique
Name
Hubert
1 − Rand
van Dongen
VI
Zhou
ADCO
CDistance
Example 1
d(R,1) d(R,2)
0.38
0.38
0.00
0.00
0.18
0.18
5.22
5.22
8.24
8.24
0.11
0.17
0.61
0.73
?
7
7
7
7
7
3
3
Table 1(b)
Example 2
d(R,1) d(R,2)
?
0.04
0.04
7
0.05
0.05
7
0.05
0.05
7
11.31
11.31
7
0.90
0.90
7
0.02
0.04
3
0.06
0.18
3
Example 3
d(R,1) d(R,2)
0.25
0.25
0.00
0.00
0.13
0.13
5.89
5.89
10.0
10.0
0.07
0.09
0.41
0.56
?
7
7
7
7
7
3
3
Example 4
d(R,1) d(R,2)
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
0.00
0.07
0.08
0.09
?
7
7
7
7
7
7
3
distance (CDistance) between A and B as the quantity
d(A, B) = dS (A, B; π, ρ, d0OT ),
1
3
2
Similarity Distance
0.8
1
∆
0
−1
where the weights π = (|A|/|D| : A ∈ A) and ρ = (|B|/|E| :
B ∈ B) are proportional to the number of points in the
clusters, and where the distance d0OT between clusters A ∈ A
and B ∈ B is the optimal transportation distance
0.6
0.4
d0OT (A, B) = dOT (A, B; p, q, dΩ )
0.2
−2
−3
−2
0
2
4
6
8
0
0
2
4
6
Separating distance ∆
8
Figure 1: The plot on the right shows similarity
distance as a function of separation distance between
the two point sets displayed on the left. Similarity
distance is zero when the point sets perfectly overlap
and approaches one as the distance increases.
Example A
Example B
0.503
10
9
0.502
8
7
0.501
6
5
0.5
4
0.499
3
2
0.498
1
0
0
1
2
3
4
5
6
7
8
9
10
0.497
0.497
0.498
0.499
0.5
0.501
0.502
0.503
Figure 2: Similarity distance measures the degree
of spatial overlap. We consider the two point sets
in Example A to be far more similar to one another than those in Example B. This is the case
even though they occupy far more area in absolute
terms and would be deemed further apart by many
distance metrics. Here, dS (Example A) = 0.15 and
dS (Example B) = 0.78.
optimally move one point set onto the other is much higher
in Example A than in Example B.
3.
CDISTANCE
In the clustering distance problem, we are given two clusterings A and B of points in a metric space (Ω, dΩ ). Our
goal is to determine a distance between them that captures
both spatial and partitional information. We do this using
the framework defined above.
Conceptually, our approach is as follows. Given a pair of
clusterings A and B, we first construct a new metric space—
distinct from the metric space in which the original data
lie—which contains one distinct element for each cluster in
A and B. We define the distance between any two elements
of this new space to be the optimal transportation distance
between the corresponding clusters. The clusterings A and
B can now be thought of as weighted point sets in this new
space, and the degree of similarity between A and B can
be thought of as the degree of spatial overlap between the
corresponding weighted point sets in this new space.
Formally, we are given clusterings A and B of datasets D
and E in the metric space (Ω, dΩ ). We define the clustering
with uniform weights p = (1/|A|) and q = (1/|B|).
The Clustering Distance Algorithm proceeds in two steps:
Step 1. For every pair of clusters (A, B) ∈ A × B,
compute the optimal transportation distance d0 (A, B) =
dOT (A, B; p, q, dΩ ) between A and B, based on the distances
according to dΩ between points in A and B.
Step 2 Compute the similarity distance dS (A, B; π, ρ, d0OT )
between the clusterings A and B, based on the optimal transportation distances d0OT between clusters in A and B that
were computed in Step 1. Call this quantity the clustering
distance d(A, B) between the clusterings A and B.
The reader will notice that we use optimal transportation
distance to measure the distance between individual clusters
(Step 1), while we use similarity distance to measure the
distance between the clusterings as a whole (Step 2). The
reason for this difference is as follows. In Step 1, we are
interested in the absolute distances between clusters: we
want to know how much work is needed to move all the
points in one cluster onto the other cluster. We will use this
as a measure of how “far” one cluster is from another.
In contrast, in Step 2, we want to know the degree to
which the clusters in one clustering spatially overlap with
those in another clustering using the distances derived in
Step 1.
Our choice to use uniform weights in Step 1 but proportional weights in Step 2 has a similar motivation. In a (hard)
clustering, each point in a given cluster contributes as much
to that cluster as any other point in the cluster contributes,
so we weight points uniformly when comparing individual
clusters. In contrast, Step 2 proportionally distributes the
influence of each cluster in the overall computation of CDistance according to its relative weight, as determined by the
number of data points it contains. Thus this single figure
captures both spatial and categorical information into a single measure.
Figure 3 contains some comparisons between example
clusterings and the associated CDistance value.
3.1
Computational Complexity
The complexity of the algorithm presented above for computing CDistance is dominated by the complexity of computing optimal transportation distance. Optimal transportation distance between two point sets of cardinality at
most n can be computed in worst case O(n3 log n) time [20].
Recently a number of linear or sublinear time approximation
algorithms have been developed for several variants of the
optimal transportation problem, e.g., [20, 2].
To verify that our method is efficient in practice, we tested
our implementation for computing optimal transportation
distance, which uses the transportation simplex algorithm,
on several hundred thousand pairs of large point sets sampled from a variety of Gaussian distributions. The average
running times fit the expression (1.38 × 10−7 )n2.6 seconds
with an R2 value of 1, where n is the cardinality of the larger
based methods [9] and maximum likelihood methods [1].
1
2
1
3
3
1
2
1
2
2
3
3
(a)
5.
(b)
5.1
3
3
4
1
2
2
1
2
1
2
6
1
5
(c)
(d)
Figure 3: (a) Identical clusterings and identical
datasets. CDistance is 0. (b) Similar clusterings,
but over slightly perturbed datasets. CDistance is
0.09. (c) Two different algorithms were used to cluster the same dataset. CDistance is 0.4, indicating a
moderate mismatch of clusterings. (d) Two very different clusterings generated by spectral clustering
over almost-identical datasets. CDistance is 0.90,
suggesting instability in the clustering algorithm’s
output paired with this dataset. Note: In each
subfigure, we are comparing the two clusterings,
one of which has been translated for visualization
purposes. For example, in (a), the two clusterings
spatially overlap perfectly, making their CDistance
value 0.
of the two point sets being compared. For enormous point
sets, one can employ standard binning techniques to further
reduce the runtime [15].
In contrast, the only other spatially aware clustering comparison method that we know of requires explicitly enumerating the exponential space of all possible permutations of
matches between clusters in A to clusters in B [3].
4.
USING CDISTANCE
In this section, we show how CDistance can be used to
replace parts of the combination step in ensemble clustering.
ENSEMBLE CLUSTERING
An ensemble clustering is a consensus clustering of a
dataset D based on a collection of R clusterings. The goal
is to combine aspects of each of these clusterings in a meaningful way so as to obtain a more robust and “superior” clustering that combines the contributions of each clustering. It
is frequently the case that one set of algorithms can find a
certain kind of structure or patterns within a dataset and
another set of algorithms can find a different structure (for
an example see Figure 7). An ensemble clustering framework tries to constructively combine inputs from different
clusterings to generate a consensus clustering.
There are two main steps involved in any ensemble framework: 1) generation of a number of clusterings, and 2) their
combination. An intermediate step may be to evaluate the
diversity within the generated set of clusterings. Clusterings
may be generated by a number of ways: by using different
algorithms, by supplying different parameters to algorithms,
by clustering subsets of the data and by clustering the data
with different feature sets [8]. The consensus clustering can
also be obtained through a number of ways such as linkage
clustering on pairwise co-association matrices [21], graph-
The Correspondence Problem
One of the major problems in the combination step of
ensemble clustering - especially in algorithms that use coassociation matrices - has been the correspondence problem:
given a pair of clusterings A and B, for each cluster A in A,
which cluster B in B corresponds to A? The reason this
problem is difficult is that unlike in classification, classes
(clusters) in a clustering are unlabeled and unordered, so
relationships between the clusters of one clustering and the
clusters of another clustering are not immediately obvious
- nor is it even obvious whether a meaningful relationship
exists at all.
Current solutions to this problem rely on the degree of
set overlap to determine matching clusters [10]. Between
any two clusterings A and B, the pairwise association matrix
contains at the (i, j)th entry the number of elements common
to cluster i of clustering A and cluster j of clustering B. An
algorithm such as the Hungarian algorithm may be employed
to determine the highest weight matching (correspondence)
between clusters.
The correspondence problem becomes even more difficult
when clustering A has a different number of clusters than
clustering B. What are the cluster correspondences in that
case? Some approaches (e.g., [6]) restrict all members of
an ensemble to the same number of clusters. Others use
the Hungarian algorithm on the association matrix followed
by greedy matching to determine correspondences between
clusters.
A further complication to the correspondence problem
arises when clusterings A and B are clusterings of different underlying data points. This frequently happens, for
example, when members of an ensemble are generated by
clustering subsamples of the original dataset. In this case
also, the traditional method of solving the correspondence
problem fails.
We show how both problems can be approached neatly
using CDistance in this section and section 5.2. We formulate an elegant solution to the correspondence problem,
that also addresses these complications, using a variant of
the first step of CDistance, as follows.
Given clusterings A = {A1 , . . . , An } and B =
{B1 , . . . , Bm }, we construct a pairwise association matrix
M using optimal transportation distance with equal, “unnormalized” weights on each point of both clusters. That is,
we set the (i, j) entry
Mij = dOT (Ai , Bj ; 1, 1, dΩ ),
where 1 = (1, . . . , 1), for i = 1, . . . , n and j = 1, . . . , m.
This is exactly what is needed to determine whether two
clusters should share the same label. Figure 6 has demonstrative examples on the behavior of this function. We can
then proceed with the application of the Hungarian algorithm as with other techniques to determine the correspondences between clusters, or simply assign the entry with the
lowest value in each row and column as being the corresponding cluster. Since this can be done pair-wise for a set
of clusterings in an ensemble, a consensus function of choice
may then be applied in order to generate the integrated clustering.
The idea is that we can conjecture with more confidence
that two clusters correspond to each other if we know they
occupy similar regions in space, rather than if their points
share membership more than with other clusters. Looking at
the actual data and the space it occupies is more informative
than looking at the labels alone.
Another advantage that comes with this method is the
absence of any requirement on the point sets being the same
in the two clusterings. For label-based correspondence solutions, set arithmetic is performed on point labels, the underlying points of which must stay the same in both clusterings.
The generality of dOT allows any two point sets to be given
to it as input.
Finally, yet another advantage with using this technique
is that it is able to determine matches between a whole cluster on one side and smaller clusters on the other side that
when taken together correspond to the first cluster. An illustration of this can be seen in Figures 4 and 5 where the
criss-crossing lines indicate matches between clusters with
zero or almost-zero dOT value.
5.2
CDistance: 0.21
7 vs 9 clusters
CDistance: 0.09
7 vs 7 clusters
CDistance: 0.21
6 vs 10 clusters
CDistance: 0.14
7 vs 9 clusters
CDistance: 0.13
7 vs 8 clusters
CDistance: 0.24
6 vs 7 clusters
CDistance: 0.13
7 vs 9 clusters
CDistance: 0.12
7 vs 8 clusters
Subsampling
Many ensemble clustering methods use subsampling to
generate multiple clusterings (e.g., [9]). A fraction of the
data is sampled at random, with or without replacement,
and clustered in each run. Comparison, if needed, is typically performed via one of a number of partitional methods.
In this subsection, we illustrate how, instead of relying on
a partitional method, CDistance can be used to compare
clusterings from subsampled datasets.
Figure 4 depicts an example in which we repeatedly cluster subsamples of a dataset with both Self-Tuning Spectral
Clustering [23] and Affinity Propagation [7]. Although these
algorithms rely on mathematically distinct properties, we
see their resultant clusterings on subsampled data agree to
a surprising extent according to CDistance. This agreement
suggests a phenomenon that a practitioner or algorithm can
use to inform further investigation or inference.
5.3
CDistance: 0.16
7 vs 9 clusters
Figure 4: Comparing outputs of different clustering algorithms over randomly sampled subsets of a
dataset. The low values of CDistance indicate relative similarity and stability between the clusterings.
The numbers of clusters involved in each comparison
are displayed beside CDistance value.
5 vs 17 CDistance: 0.11
5 vs 17 CDistance: 0.08
5 vs 16 CDistance: 0.08
5 vs 16 CDistance: 0.11
5 vs 17 CDistance: 0.10
5 vs 16 CDistance: 0.04
5 vs 18 CDistance: 0.16
5 vs 16 CDistance: 0.08
Algorithm-Dataset Match and Stability
Analysis
CDistance can provide guidance on the suitability of a
particular algorithm family for a given dataset. Depending
on the kind of structure one wishes to detect in the data,
some clustering algorithms may not be suitable.
In [4], the authors propose that the stability of a clustering
can be determined by repeatedly clustering subsamples of its
dataset. Finding consistent high similarity across clusterings
indicates consistency in locating similar substructures in the
dataset. This can increase confidence in the applicability of
an algorithm to a particular distribution of data. In other
words, by comparing the resultant clusterings, one can obtain a goodness-of-fit between a dataset and a clustering
algorithm. The clustering comparisons in [4] were all via
partitional methods.
Figure 3(d) shows a striking example of an algorithmdataset mismatch. In the two panels shown, one dataset is
a slightly modified version of the other having had a small
amount of noise added to it. A spectral clustering algorithm
was used in both cases to cluster them resulting in wildly
variant clusterings (CDistance = 0.71). This is an indication
Figure 5: Comparing outputs of different clustering algorithms over randomly sampled subsets of a
dataset. The low values of CDistance indicate relative similarity and stability between the clusterings.
The numbers of clusters involved in each comparison
are displayed beside CDistance value.
that spectral clustering is not an appropriate algorithm to
use on this dataset.
In Figures 4 and 5 we see multiple clusterings generated
by clustering subsamples and clustering data with small
amounts of noise added respectively. In both cases we see
that CDistance maintains a relatively low value across instances and can conclude that the clusterings remain stable
with this dataset and algorithm combination.
5.4
Detecting Diversity
Some amount of diversity in clusters constituting the ensemble is crucial to the improvement of the final generated
clustering [12]. In the previous subsection we noted that
CDistance was able to detect similarity between clusterings
of subsets of a dataset. A corollary of that is that CDistance
is also able to detect diversity in members of an ensemble. Previous methods have used the Adjusted Rand Index
or other such metrics to quantify diversity. As we demonstrate in Section 6, these measures are surprisingly insensitive to large differences in clusterings. CDistance provides
a smoother, spatially-sensitive measure of the similarity or
dissimilarity of two clusterings leading to a more effective
characterization of diversity within an ensemble. Measurement of diversity using CDistance can guide the generation
step by indicating whether the members of an ensemble are
sufficiently diverse or differ from each other only minutely.
5.5
Graph-based methods
Several ensemble clustering frameworks employ graphbased algorithms for the combination step [11, 8]. One popular idea is to construct a suitable graph taking into account the similarity between clusterings and derive a graph
cut or partition that optimizes some objective function. [9]
describes three main graph-based methods: instance-based,
cluster-based and hypergraph partitioning, also discussed in
[21]. The cluster-based formulation in particular requires
a measure of similarity between clusters, originally the Jaccard Index. The unsuitability of this index and other related
indices are discussed in Section 6.
CDistance can be easily substituted here as a more informative way of measuring similarity between clusterings
specifically because it exploits both spatial and partitional
information about the data.
5.6
Example
As an example of CDistance being used in an ensemble
clustering framework we applied it to an image segmentation problem. Segmentations of images can be viewed as
clusterings of their constituent pixels.
We examined images from the Berkeley Segmentation
Database [16]. Each image is accompanied by segmentations of unknown origin collected by the database aggregators. Examining these image segmentations reveals both
certain common features, e.g., sensitivity to edges, as well
as extreme differences in the detail found in various image
regions. We used CDistance to combine independent image
segmentations, as shown in Figure 7. We computed correspondences between clusters in the two segmentations and
identified clusters which have near-zero dOT to more than
one cluster from the other segmentation. The combined segmentation was generated by replacing larger clusters in one
clustering with smaller ones that matched it from the other
clustering.
Figure 7: Combining multiple image segmentations.
The original image is shown in (a). Two segmentations, each providing differing regions of detail are
shown in (b) and (c). We note the CDistance between the clusterings in (b) and (c) is 0.09. We use
CDistance to combine them in (d), rendering the
false-color image in (e), displaying partitioned regions by color. Note that we are able to combine
image segments that are not spatially adjacent, as
indicated by the colors in (e).
This image and the two segmentations were chosen among
many for their relative compatibility (CDistance between
the two segmentations is 0.09) as well as the fact that each
of them provides detail in parts of the image where the other
doesn’t.
Using CDistance enables us to ensure that only segmentations covering the same regions of image space are aggregated to yield a more informative segmentation. In other
words, each segmentation is able to fill in gaps in another.
6.
RELATED WORK
In this section, we present simple examples that demonstrate the insensitivity of popular comparison techniques
to spatial differences between clusterings. Our approach
is straightforward: for each technique, we present a baseline reference clustering. We then compare it to two other
clusterings that it cannot distinguish, i.e., the comparison
method assigns the same distance from the baseline to each
of these clusterings. However, in each case, the clusterings
compared to the baseline are significantly different from one
another, both from a spatial, intuitive perspective and according to our measure CDistance.
Our intent is to demonstrate these methods are unable to
detect change even when it appears clear that a non-trivial
change has occurred. These examples are intended to be
didactic, and as such, are composed of only enough points
to be illustrative. Because of limitations in most of these
approaches, we can only consider clusterings over identical
6
6
6
6
6
5
5
5
5
5
4
4
12
3
2
4
12
3
2
3
4
5
2
4
1
2
3
2
3
4
5
2
3
2
3
4
5
2
4
1
3 2
2
3
3
2
1
4
5
2
2
3
4
Figure 6: The behavior of dOT with weight vectors 1. In the four figures to the left, the cluster of red points is
the same in all figures. In the first and third figures from the left where one cluster almost perfectly overlaps
with the other, dOT is very small in both cases (0.3 and 0.18 respectively) when compared, for example, to
dOT between the red cluster and the blue cluster in the fourth figure from the left (dOT = 3.09). This is
because we would like our distance measure between clusters to heavily focus on spatial locality and be close
to zero when one cluster occupies partly or wholly the same region in space. In the fourth figure from the
left, the two smaller clusters (red and blue) have near-zero dOT ’s to the larger green cluster (0.26 and 0.32
respectively). In the second figure from the left, the green cluster is a slightly translated version of the red
one but is still very close to it spatially. dOT in this case is still very small (0.2). The fifth figure from the
left shows two point sets shaped like arcs that overlap to a large degree. In this case, dOT is 0.5, indicating
that the point sets are very close to each other. The significance of dOT is apparent when dealing with data
in high dimensions where complete visualization is not possible.
sets of points; they are unable to tolerate noise in the data.
Among the earliest known methods for comparing clusterings is the Jaccard index [4]. It measures the fraction
of assignments on which different partitionings agree. The
Rand index [19], among the best known of these techniques,
is based on changes in point assignments. It is calculated
by the fraction of points — taken pairwise — whose assignments are consistent between two clusterings. This approach
has been built upon by many others. For example, [14] addresses the Rand index’s well-known problem of overestimating similarity on randomly clustered datasets. These three
measures are part of a general class of clustering comparison
techniques based on tuple-counting or set-based membership. Other measures in this category include the Mirkin,
the van Dongen index, Fowlkes-Mallows and Wallace indices,
a discussion of which can be found in [4, 17].
The Variation of Information approach was introduced
by [17], who defined an information theoretic metric between
clusterings. It measures the information lost and gained in
moving from one clustering to another, over the same set of
points. Perhaps its most important property is that of convex additivity, which insures locality in the metric; namely,
changes within a cluster (e.g., refinement) cannot affect the
similarity of the rest of the clustering. Normalized Mutual
Information, used in [21] shares similar properties also being
based on information theory.
The work presented in [24] shares our motivation of incorporating some notion of distance into comparing clusterings.
However, the distance is computed over a space of indicator vectors for each cluster, representing whether each data
point is a member of that cluster. Thus, similarity over clusters is measured by their sharing points and does not take
into account the locations of these points in a metric space.
Finally, we examine [3], who present the only other
method that directly employs spatial information about the
points. Their approach first bins all points being clustered
along each attribute. They then determine the density of
each cluster over the bins. Finally, the distance between two
clusterings is defined as the minimal sum of pairwise clusterdensity dot products (derived from the binning), taken over
all possible permutations matching the clusters. We note
that this is in general not a feasible computation. The number of bins grows exponentially with the dimensionality of
the space, and more importantly, examining all matchings
between clusters requires O(n!) time, where n is the number
of clusters.
7.
DISCUSSION AND FUTURE WORK
We have shown how the clustering distance (CDistance)
measure introduced in [5] can be used in the ensemble methods framework for several applications. In future work we
will incorporate CDistance into existing ensemble clustering
algorithms and evaluate its ability to improve upon current
methodologies. We are also continuing to study the theoretical properties of CDistance, with applications in ensemble
clustering and unsupervised data mining.
8.
ACKNOWLEDGEMENTS
This work was supported by the School of Medicine and
Public Health, the Wisconsin Alumni Research Foundation,
the Department of Biostatistics and Medical Informatics,
and the Department of Computer Sciences at the University
of Wisconsin-Madison.
9.
REFERENCES
[1] A. K. J. Alexander Topchy and W. Punch. A mixture
model for clustering ensembles. In Proceedings of
SIAM International Conference on Data Mining
(SDM), pages 379–390, 2004.
[2] K. D. Ba, H. L. Nguyen, H. N. Nguyen, and
R. Rubinfeld. Sublinear time algorithms for earth
mover’s distance. arXiv abs/0904.0292, 2009.
[3] E. Bae, J. Bailey, and G. Dong. Clustering similarity
comparison using density profiles. In Proceedings of
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
the 19th Australian Joint Conference on Artificial
Intelligence, pages 342–351. Springer LNCS, 2006.
A. Ben-Hur, A. Elisseeff, and I. Guyon. A stability
based method for discovering structure in clustered
data. In Pacific Symposium on Biocomputing, pages
6–17, 2002.
M. H. Coen, M. H. Ansari, and N. Fillmore.
Comparing clusterings in space. In ICML 2010:
Proceedings of the 27th International Conference on
Machine Learning (to appear), 2010.
S. Dudoit and J. Fridlyand. Bagging to improve the
accuracy of a clustering procedure. Bioinformatics,
19(9):1090–1099, 2003.
D. Dueck and J. Frey, B. Non-metric affinity
propagation for unsupervised image categorization.
IEEE International Conference on Computer Vision,
0:1–8, 2007.
X. Z. Fern and C. E. Brodley. Random projection for
high dimensional data clustering: A cluster ensemble
approach. In Proceedings of 20th International
Conference on Machine Learning, pages 186–193,
2003.
X. Z. Fern and C. E. Brodley. Solving cluster ensemble
problems by bipartite graph partitioning. In ICML
’04: Proceedings of the twenty-first international
conference on Machine learning, page 36, New York,
NY, USA, 2004. ACM.
A. Fred. Finding consistent clusters in data partitions.
In In Proc. 3d Int. Workshop on Multiple Classifier,
pages 309–318. Springer, 2001.
A. L. N. Fred and A. K. Jain. Data clustering using
evidence accumulation. In ICPR, pages 276–280, 2002.
S. T. Hadjitodorov, L. I. Kuncheva, and L. P.
Todorova. Moderate diversity for better cluster
ensembles. Inf. Fusion, 7(3):264–275, 2006.
F. Hillier and G. Lieberman. Introduction to
Mathematical Programming. McGraw-Hill, 1995.
L. Hubert and P. Arabie. Comparing partitions.
Journal of Classification, 2:193–218, 1985.
E. Levina and P. Bickel. The earth mover’s distance is
the mallows distance: Some insights from statistics. In
ICCV, 2001.
D. Martin, C. Fowlkes, D. Tal, and J. Malik. A
database of human segmented natural images and its
application to evaluating segmentation algorithms and
measuring ecological statistics. In Proc. 8th Int’l Conf.
Computer Vision, volume 2, pages 416–423, July 2001.
M. Meilǎ. Comparing clusterings: an axiomatic view.
In ICML, 2005.
S. T. Rachev and L. Ruschendorf. Mass
Transportation Problems: Volume I: Theory
(Probability and its Applications). Springer, 1998.
W. M. Rand. Objective criteria for the evaluation of
clustering methods. J. Amer. Statist. Assoc., pages
846–850, 1971.
S. Shirdhonkar and D. Jacobs. Approximate earth
moverâĂŹs distance in linear time. In CVPR, 2008.
A. Strehl and J. Ghosh. Cluster ensembles — a
knowledge reuse framework for combining multiple
partitions. J. Mach. Learn. Res., 3:583–617, 2003.
S. van Dongen. Performance criteria for graph
clustering and markov cluster experiments. Technical
report, National Research Institute for Mathematics
and Computer Science in the Netherlands, 2000.
[23] L. Zelnik-Manor and P. Perona. Self-tuning spectral
clustering. In NIPS, 2004.
[24] D. Zhou, J. Li, and H. Zha. A new mallows distance
based metric for comparing clusterings. In Proceedings
of the 22nd International Conference on Machine
Learning, pages 1028–1035, 2005.