Hierarchical Agglomerative Clustering with

Hierarchical Agglomerative Clustering with Ordering Constraints
Haifeng Zhao
Department of Computer Science
University of California, Davis
Davis, CA, USA 95616
Email: [email protected]
Abstract—Many previous researchers have converted background knowledge as constraints to obtain accurate clustering. These clustering methods are usually called constrained
clustering. Previous ordering constraints are instance level
non-hierarchical constraints, like must-link and cannot-link
constraints, which do not provide hierarchical information.
In order to incorporate the hierarchical background knowledge into agglomerative clustering, we extend instance-level
constraint to hierarchical constraint in this paper. We name
it as ordering constraint. Ordering constraints can be used
to capture hierarchical side information and they allow the
user to encode hierarchical knowledge such as ontologies
into agglomerative algorithms. We experimented with ordering
constraints on labeled newsgroup data. Experiments showed
that the dendrogram generated by ordering constraints is
more similar to the pre-known hierarchy than the dendrogram
generated by previous agglomerative clustering algorithms.
We believe this work will have a significant impact on the
agglomerative clustering field.
Keywords-hierarchical agglomerative clustering; constrained
clustering; ordering constraint
I. I NTRODUCTION
The basic Hierarchical Agglomerative Clustering(HAC)
method begins with each instance as a separate group.
These groups are combined until there is only one group
remaining. Constrained clustering methods incorporate sideinformation to improve clustering results. Typical constraints
used in previous research are Must-Link(ML) and CannotLink(CL)[1], [2]. ML and CL were first applied in K-means
clustering. Davidson and Ravi [3], [4], [5] investigated how
to use ML and CL in hierarchical agglomerative clustering
and proved the problem is tractable. However, using ML
and CL in HAC provides no improvement over classic HAC
when the cluster number is small compared to instance
number. Bade and Nurnberger [6] introduced the concept of
must-link-before(MLB) constraints which is a type of hierarchical clustering. To solve conflicts produced by MLB, they
enlarge the distance of two most similar clusters to prevent
their combination. MLB incorporates merging preferences in
HAC, but the defect of MLB is that it modifies two clusters’
underlying distance. Too many modifications may lead to
inaccurate result.
In this paper, we present a new type of constraint for hierarchical agglomerative clustering, ordering-constraint(OC),
ZiJie Qi
Department of Computer Science
University of California, Davis
Davis, CA, USA 95616
Email: [email protected]
which is much more helpful than ML, CL and MLB. An
OC of an instance is a merging preference for that instance.
For example, from side-information, we know instance A is
more similar to instance C than instance B. Then, a related
OC defines that A must merge with C before merging with B
during clustering. Specifically, let’s take a look at newsgroup
instances. Assume before clustering we have some labeled
instances. x1 belongs to rec.sports.basketball, x2 belongs
to comp.sys.mac.hardware, x3 belongs to rec.sports.hockey,
x4 belongs to rec.sports.basketball. For x1 , it prefers to
merge in the sequence with x4 , x3 , x2 . So, the OC of x1
is {x1 → x4 → x3 → x2 }. Similarly, we can define OC for
other labeled instances. After generating OC for all labeled
instances, we can begin HAC to generate the dendrogram.
During clustering, each OC should be obeyed, which means
if x1 and x4 have not been merged or x4 and x3 have not
been merged, then x1 and x3 can not been merged.
The objective of the ordering constraint is to construct
a dendrogram with all merging preferences satisfied. Traditional constraints, like ML and CL can not provide such
merging preference information as OC. Different from MLB,
OC does not change similarities of clusters, but just delay
two clusters’ combination until conflicts are resolved. OC
can provide more accuracy than ML, CL and MLB. We
call HAC embedded with OC as Hierarchical Agglomerative
Clustering with Ordering Constraints(HACOC). In the second section, we define OC in detail and discuss related issues
with OC. In the third section, we implement the hierarchical
agglomerative clustering algorithm with ordering constraint.
In the fourth section, our experiment results show HACOC
can generate an obviously better dendrogram than classic
HAC algorithms.
II. D EFINITIONS
A. Ordering Constraint
Ordering Constraint (OC) can be classified as two types:
Instance-level Ordering Constraint (IOC) and Cluster-level
Ordering Constraint (COC). An IOC defines a preference
for an instance xi . It contains a number of sequenced
instances to guide xi ’s merging operation when clustering.
Each instance is initialized with one IOC (it could be
empty if not labeled). Since instances merge into clusters
during clustering, their IOC should also combine. We define
a COC as a cluster-level ordering constraint. An COC for a
cluster is a set which consists of all its instances’ IOC. For
example, if a cluster c has m instances who possess IOC,
the COC of cluster c keeps m IOC. To be noticed, since
COC is a set, all the IOC of a COC have no order.
To explain our method, in this section, we formally
introduce the notations used through the paper and the
definitions relevant to ordering constraint.
(1) IOC[xi ] : denotes the instance-level ordering constraint initialized as xi ’s ordering constraint. When
initialized, IOC[xi ] = {xi1 → ... → xim }, and
{xi1 , ..., xim }are m instances. During clustering, each
xik ∈ IOC[xi ] will be replaced by cj if xik merges to
cj .
(2) IOCi,j : denotes the j th instance(or cluster) in
IOC[xi ] . For example, if x1 ’s ordering constraint
IOC[x1 ] =< x1 → x3 → x5 >, we can also write
IOC[x1 ] as IOC[x1 ]
=< IOC1,1 → IOC1,2 → IOC1,3 > . Here IOC1,1
stands for x1 , IOC1,2 stands for x3 and IOC1,3 stands
for x5 .
(3) COCi,j denotes the j th IOC in COCci .. For example, if cluster c2 has three instances x1 , x4
and x5 , then c2 ’s ordering constraint is COCc2 =
{IOC[x1 ] , IOC[x4 ] ,
IOC[x5 ] }. We can also write COCc2 as COCc2 =
{COC2,1 , COC2,2 , COC2,3 }. Here, COC2,1 stands
for IOC[x1 ] , COC2,2 stands for IOC[x4 ] , and COC2,3
stands for IOC[x5 ] .
(4) COCi,j,k denotes the k th instance (or cluster) in
COCi,j . For example, if x3 and x4 combines as
c3 . c3 ’s COC has both IOC[x3 ] and IOC[x4 ] . So
COCc3 = {IOC[x3 ] = {c3 , x5 , x6 }, IOC[x4 ] =
{c3 , x7 , x8 }}. We can also write COCc3 as
COCc3 = {COC2,1 = {COC2,1,1 , COC2,1,2 ,
=
COC2,1,3 }, COC2,2
{COC2,2,1 , COC2,2,2 , COC2,2,3 }}. In IOC[x3 ] ,
COC2,1,1 stands for c3 , COC2,1,2 stands for x5 and
COC2,1,3 stands for x6 .
(5) ”<”: COCi,j,u < COCi,j,v denotes COCi,j,v appears
next to COCi,j,u in COCi,j . They are adjacent in
COCi,j .
(6) ”≺”: COCi,j,u ≺ COCi,j,v denotes COCi,j,u appears before COCi,j,v in COCi,j . They are in in
COCi,j , but may not be adjacent.
To generate ordering constraints, we only need to know
a standard dendrogram and a number of labeled instances.
The standard dendrogram contains desired class information.
Each IOC is generated according to the standard dendrogram. For an instance xi , IOC[xi ] reflects the representatives
along the standard dendrogram from the leave node(xi ) to
root. Figure 1 shows an example of generating IOC[x5 ] . x1
Figure 1.
Example of an IOC : IOC[x5 ] = {x5 , x3 , x2 }
to x5 are labeled instances and the standard dendrogram has
three levels. The path of x5 follows the striking line. Picking
up a representative from each node, we can get IOC[x5 ] =
{x5 , x3 , x2 }. It defines x5 should merge x3 before merging
x2 . Obviously, each IOC[xi ] could have more than one
choice, because each node along the standard dendrogram
path could contain more than one instance. For example,
IOC[x5 ] could be replaced as IOC[x5 ] = {x5 , x4 , x1 }.
As we mentioned, if two clusters ci and cj are merged as a
new cluster c{ij} , COCci and COCcj will also be combined
as the new cluster’s COC. For example, a cluster c1 ’s COC
is COC1 = {COC1,1 , COC1,2 , COC1,3 , COC1,4 } and another cluster c2 ’s COC is COC2 = {COC2,1 , COC2,2 ,,
COC2,3 }. When merging c1 and c2 , the two COC should be
incorporated into the new cluster c{12} . So, thenew cluster
c{12} ’s COC should be: COCC{12} = COCc1 COCc2
More formally, when merging two clusters ci and cj , the
COC of new cluster c{ij} is as illustrated in Equation 1
COCc{ij} = {∀COCi,∗ ∈ COCci }
{∀COCj,∗ ∈ COCcj }
(1)
After combining ci and cj , another operation is to substitute c{ij} for ci and cj in all clusters’ COC, because ci
and cj will not exist any more. Some merging preference
sequence COC∗,∗ may have both ci and cj , so delete ci
and cj and just keep one c{ij} .
B. Obstacle and Dependency
Let M (ci , cj ) stand for merging the two clusters ci and cj .
M (ci , cj ) is not as simple, as we just combine their COC.
Actually, COCi and COCj may conflict, because ordering
constraints potentially hinder the merging of two clusters.
The prerequisite of M (ci , cj ) is that they have no obstacle
in any cluster’s COC∗,∗ . Otherwise, merging ci and cj will
break the ordering constraint defined in some COC∗,∗ . So,
ci and cj can not be merged until the obstacles are removed.
Definition 2.1: Obstacle
If there ∃{ci , ck , cj ∈ COCu,v }
⇒ ck is an obstacle of M (ci , cj )
III. A LGORITHM
{ci ≺ ck ≺ cj }
(2)
For example, if ci < ck < cj belongs to COCu,v , then
ci and cj can not merge, because merging ci and cj to be
c{ij} but excluding ck violates the merging preference of
COCu,v . In this example, ck is an Obstacle for M (ci , cj ).
To remove the obstacle of M (ci , cj ), ck should be merged
with ci or cj first, which implies M (ci , ck ) and M( ck , cj )
should be ready before M (ci , cj ).
In the upper example, to remove M (ci , cj )’s obstacle ck ,
we must first merge ck with either ci or cj to eliminate
this obstacle. So, M (ci , cj ) depends on M (ci , ck ) and
M (ck , cj ). More generally, if ci and cj has m obstacles
within COCu,v , all these m obstacles should be able to
merge with ci or cj . Meanwhile, all these obstacles should
also be able to merge with each other. We call M (ci , ck )
and M (ck , cj ) as dependencies of M (ci , cj ). Definition 2.2
present dependency formally.
Definition 2.2: Dependencies of Merging two clusters
If ci , cj appear but not adjacent in COCu,v , formally,
ci < COCu,v,o1 < COCu,v,o2 < ... < COCu,v,om < cj .
Then, M (cp , cq ) is a dependency of M(ci ,cj ), where
cp , cq ∈ {ci , COCu,v,o1 , COCu,v,o2 , ..., COCu,v,om , cj } .
2
− 1 depenAccording to Definition 2.2, there are Cm+2
dencies for M (ci , cj ). M (ci , cj ) is executable if and only if
∀M (cp , cq ) is executable.
C. Interlock
If a group of merging operations M (c∗ , c∗ ) depend and
only depend on each other, none of the merging operations
can execute. We define this situation as interlock.
Definition 2.1: Interlock
Let C be a subset of C. C has m clusters, c1 , c2 , ...,cm .
If every M (ci , cj )(ci , cj ∈ C ) has at least one obstacle ck
and ck ∈ C . Then c1 , c2 , ...cm forms an interlock.
A good analogy of interlock is “making friends”. Assume
that we pick up m students from a school. Two students can
make friends if and only if there is no opponent from these
m students. Suppose any two of them have at least one
opponent, then none of them can make friends.
Generating desirable ordering constraints to avoid interlock is an NPC problem(it can be proved with deadlock
problem[7]), but detecting an interlock is tractable. In our
algorithm, if an interlock is detected, we randomly remove
one cluster’s OC to break the interlock. So the HAC could
resume and finally all instances could be merged to the
dendrogram root. This slightly violates the purpose of OC,
but since most OC are obeyed, the HAC result could still
be quite satisfied.
In this section, we will present the algorithm of hierarchical agglomerative clustering with ordering constraints
(HACOC).
HACOC generally follows the frame of HAC, but it needs
to deal with obstacle and interlock. To conquer obstacle,
we check the mergebility of dependencies. To break the
interlock and generate a dendrogram, we introduce another
Loose verion HACOC (LHACOC) which randomely abandons one constraint to resume clustering. Figure 2 presents
the algorithm of LHACOC.
IV. E XPERIMENTS
In the experiments section, we compare the HACOC
algorithm with the Classic HAC(CHAC) algorithm.
We apply two HAC algorithms on 20 newsgroup
dataseta. First we trim the dendrogram of 20 newsgroup
dataset into several sub-dendrogram. These subdendrograms are treated as the standard dendrograms.
In our experiment, we construct five standard dendrograms
from the 20 newgroup data, as illustrated in Table I.
Take Standard Dendrogram 1(Den1) as an example. It
is a four-level dendrogram and contains four classes:
“comp.sys.ibm.pc.hw”,
“comp.sys.mac.hardware”,
“rec.sport.baseball”, and “rec.sport.hockey”. Each class has
100 labeled instances. Secondly, we apply CHAC algorithm
and HACOC on the input data. For HACOC, we randomly
select 5% or 10% instances to generate constraints. After
clustering, we use Jaccard Index[9] to evaluate the quality
of generated dendrogram. Because Jaccard Index only
compares two different partitions of the same dataset, rather
than two dendrograms, so we can only use Jaccard Index to
compare a level of the generated dendrogram with a level
of the standard dendrogram. These two levels in the two
different dendrograms have the same number of clusters. In
our experiments, we calculates the Jaccard Indexes when
the cluster number (CN) approaches the real number of
classes in a standard dendrogram level.
Figure 3 and 4 show the experiment results. “Den i”
denotes the ith standard dendrogram. “CN” denotes final
cluster number of HAC. “constraint ratio” is the percentage
of instances initialized with ordering constraints. The solid
line shows jaccard index of CHAC and the dashed line shows
the jaccard index of HACOC. From the result, we can see
that the Jaccard Index of HACOC is obviously improved
when compared to CHAC. It demonstrates that HACOC
leads to more accurate clustering result than CHAC.
V. C ONCLUSIONS
In this paper, we introduced a new hierarchical constraint,
ordering constraint. The difference between ordering constraint and the other constraints, such as M L, CL and
M LB, is that it set merging preferences for instances but
not change similarities during clustering. Compared with the
Den1(size:400)
comp.sys.ibm.pc.hw
comp.sys.mac.hardware
rec.sport.baseball
rec.sport.hockey
Den2(size:400)
comp.sys.ibm.pc.hw
comp.sys.mac.hardware
talk.politics.guns
talk.politics.mideast
Den3(size:400)
rec.sport.baseball
rec.sport.hockey
talk.politics.guns
talk.politics.mideast
Den4(size:700)
alt.atheism
comp.graphics
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
rec.autos
rec.sport.baseball
rec.sport.hockey
Den5(size:2000)
all 20 groups
Table I
F IVE S TANDARD DANDROGRAMS (D EN ) FOR EXPERIMENTS
other constraints, ordering constraint is closer to the nature
of agglomerative clustering.
After introducing ordering constraints, we embedded it
into HAC algorithm. We call the new HAC algorithm as HACOC. The first advantage of HACOC is that it employs prior
knowledge into agglomerative clustering, which provide
guidance to prevent many inaccurate merging operations.
The second advantage of HACOC is that the generated
dendrogram could be stable even when distance metric or instances’ attributes are slightly changes. The third advantage
of HACOC is that we could obtain different dendrograms
on the same dataset by defining various ordering constraints.
Experiments show that HACOC yields a dendrogram more
accurate to pre-known taxonomy than traditional HAC.
In future, we will continue to explore this topic. We prefer
to investigate how to generate good ordering constraints
which lead to less conflicts and interlocks so that the
algorithm could be more time-efficient. We believe our work
will have a significant influence in the field of constrained
clustering and is meaningful to many data mining applications.
R EFERENCES
[1] K. Wagstaff and C. Cardie, “Clustering with Instance-Level
Constraints”, Proceedings of the 17th International Conference on Machine Learning (ICML 2000), Stanford, CA, pp.
1103-1110. 2000.
[2] I. Davidson and S. S. Ravi, “Clustering with Con-straints and
the k-Means Algorithm”, the 5th SIAM Data Mining Conf.
2005.
[3] I. Davidson and S. S. Ravi, “Hierarchical Clustering with
Constraints: Theory and Practice”, the 9th European Principles and Practice of KDD, PKDD 2005.
[4] I. Davidson and S. S. Ravi, “Intractability and Clustering
with Constraints”, Proceedings of the 24th international
conference on Machine learning, 2007
[5] I. Davidson and S. S. Ravi, “Using instance-level constraints
in agglomerative hierarchical clustering: theoretical and empirical results”, Data Mining and Knowledge Discovery,
18(2):257C282 ,2009
[6] Korinna Bade and Andreas Nurnberger, “Personalized
Hierarchical Clustering”, Proceedings of the 2006
IEEE/WIC/ACM International Conference on Web
Intelligence, 2006
[7] T. Araki, Y. Sugiyama, and T. Kasami, “Complexity of
the deadlock avoidance problem”, 2nd IBM Symp. Math.
Foundations Computer Sci. Tokyo, Japan, pp. 229-257, 1977.
[8] http://kdd.ics.uci.edu/databases/20newsgroups
/20newsgroups.html
[9] P. Jaccard, “The distribution of flora in the alpine zone”. The
New Phytologist, 11(2):37–50, 1912.
Jaccard Index of CHAC, HACOC when Constraint Ratio=5%
0.8
0.7
CHAC
HACOC
0.6
Jaccard Index
0.5
0.4
0.3
0.2
0.1
0
Den1(CN=4) Den1(CN=2) Den2(CN=4) Den2(CN=2) Den3(CN=4) Den3(CN=2) Den4(CN=7) Den4(CN=5) Den4(CN=3) Den5(CN=20)
Dendrogram(Cluster Numer)
Figure 3. The performance of two agglomerative clustering algorithms
when 5% instances have OC
Jaccard Index of CHAC, HACOC when Constraint Ratio=10%
0.6
CHAC
HACOC
0.5
0.4
Jaccard Index
Input: n instances {x1 , x2 , . . ., xn } with ordering constraints generated from a pre-known dendrogram D. Each
instance xi is treated as a cluster ci and its COC could be
empty if there is no pre-known knowledge for it.
Output: a agglomerative dendrogram or failure
Algorithm:
1) Global variables
(1) bool isInterlockDetected;
It denotes whether an interlock is detected
(2) Matrix mergable[m, m]
mergable is a boolean matrix. m is the number of
remnant clusters. mergable[i, j] marks whether two
clusters cu and cv can be merged. It is initialized
as TRUE and set to FALSE if ci and cj can not be
merged.
2) Initialization
(1)isInterlockDetected = T RU E
Assume that there is an interlock in the beginning.
Set to FALSE if finding two combinable clusters
(2)resetMergable(T RU E)
Assume that all pairs of clusters are mergable.
3) 1) LHACOC(C,n): HAC with ordering constraints.
//Parameter C is the set of n clusters. Each cluster
//has an ordering constraint(maybe NULL)
2) {
3) for i = 0; i < n;
4) isInterlockDetected=TRUE;
5) for all entries of mergable(u, v) equal to TRUE
6)
select nearest cu , cv ;
7)
if findObstacle(cu , cv )==FALSE
8)
isInterlockDetected=FALSE;
9)
break;
10)
else
11)
mergable[u, v]=FALSE;
12)
continue;
13)
endif
14) endfor
15) if(isInterlockDetected==TRUE)
16)
randomly remove a cluster’s COC;
17) else
18)
merge(cu ,cv );
19)
i++;
20) endif;
21) resetMergable(TRUE);
22) endfor
23) }
0.3
0.2
0.1
0
Den1(CN=4) Den1(CN=2) Den2(CN=4) Den2(CN=2) Den3(CN=4) Den3(CN=2) Den4(CN=7) Den4(CN=5) Den4(CN=3) Den5(CN=20)
Dendrogram(Cluster Numer)
Figure 4. The performance of two agglomerative clustering algorithms
when 10% instances have OC
Figure 2. Loose Hierarchical Agglomerative Clustering with Ordering
Constraints