Hierarchical Agglomerative Clustering with Ordering Constraints Haifeng Zhao Department of Computer Science University of California, Davis Davis, CA, USA 95616 Email: [email protected] Abstract—Many previous researchers have converted background knowledge as constraints to obtain accurate clustering. These clustering methods are usually called constrained clustering. Previous ordering constraints are instance level non-hierarchical constraints, like must-link and cannot-link constraints, which do not provide hierarchical information. In order to incorporate the hierarchical background knowledge into agglomerative clustering, we extend instance-level constraint to hierarchical constraint in this paper. We name it as ordering constraint. Ordering constraints can be used to capture hierarchical side information and they allow the user to encode hierarchical knowledge such as ontologies into agglomerative algorithms. We experimented with ordering constraints on labeled newsgroup data. Experiments showed that the dendrogram generated by ordering constraints is more similar to the pre-known hierarchy than the dendrogram generated by previous agglomerative clustering algorithms. We believe this work will have a significant impact on the agglomerative clustering field. Keywords-hierarchical agglomerative clustering; constrained clustering; ordering constraint I. I NTRODUCTION The basic Hierarchical Agglomerative Clustering(HAC) method begins with each instance as a separate group. These groups are combined until there is only one group remaining. Constrained clustering methods incorporate sideinformation to improve clustering results. Typical constraints used in previous research are Must-Link(ML) and CannotLink(CL)[1], [2]. ML and CL were first applied in K-means clustering. Davidson and Ravi [3], [4], [5] investigated how to use ML and CL in hierarchical agglomerative clustering and proved the problem is tractable. However, using ML and CL in HAC provides no improvement over classic HAC when the cluster number is small compared to instance number. Bade and Nurnberger [6] introduced the concept of must-link-before(MLB) constraints which is a type of hierarchical clustering. To solve conflicts produced by MLB, they enlarge the distance of two most similar clusters to prevent their combination. MLB incorporates merging preferences in HAC, but the defect of MLB is that it modifies two clusters’ underlying distance. Too many modifications may lead to inaccurate result. In this paper, we present a new type of constraint for hierarchical agglomerative clustering, ordering-constraint(OC), ZiJie Qi Department of Computer Science University of California, Davis Davis, CA, USA 95616 Email: [email protected] which is much more helpful than ML, CL and MLB. An OC of an instance is a merging preference for that instance. For example, from side-information, we know instance A is more similar to instance C than instance B. Then, a related OC defines that A must merge with C before merging with B during clustering. Specifically, let’s take a look at newsgroup instances. Assume before clustering we have some labeled instances. x1 belongs to rec.sports.basketball, x2 belongs to comp.sys.mac.hardware, x3 belongs to rec.sports.hockey, x4 belongs to rec.sports.basketball. For x1 , it prefers to merge in the sequence with x4 , x3 , x2 . So, the OC of x1 is {x1 → x4 → x3 → x2 }. Similarly, we can define OC for other labeled instances. After generating OC for all labeled instances, we can begin HAC to generate the dendrogram. During clustering, each OC should be obeyed, which means if x1 and x4 have not been merged or x4 and x3 have not been merged, then x1 and x3 can not been merged. The objective of the ordering constraint is to construct a dendrogram with all merging preferences satisfied. Traditional constraints, like ML and CL can not provide such merging preference information as OC. Different from MLB, OC does not change similarities of clusters, but just delay two clusters’ combination until conflicts are resolved. OC can provide more accuracy than ML, CL and MLB. We call HAC embedded with OC as Hierarchical Agglomerative Clustering with Ordering Constraints(HACOC). In the second section, we define OC in detail and discuss related issues with OC. In the third section, we implement the hierarchical agglomerative clustering algorithm with ordering constraint. In the fourth section, our experiment results show HACOC can generate an obviously better dendrogram than classic HAC algorithms. II. D EFINITIONS A. Ordering Constraint Ordering Constraint (OC) can be classified as two types: Instance-level Ordering Constraint (IOC) and Cluster-level Ordering Constraint (COC). An IOC defines a preference for an instance xi . It contains a number of sequenced instances to guide xi ’s merging operation when clustering. Each instance is initialized with one IOC (it could be empty if not labeled). Since instances merge into clusters during clustering, their IOC should also combine. We define a COC as a cluster-level ordering constraint. An COC for a cluster is a set which consists of all its instances’ IOC. For example, if a cluster c has m instances who possess IOC, the COC of cluster c keeps m IOC. To be noticed, since COC is a set, all the IOC of a COC have no order. To explain our method, in this section, we formally introduce the notations used through the paper and the definitions relevant to ordering constraint. (1) IOC[xi ] : denotes the instance-level ordering constraint initialized as xi ’s ordering constraint. When initialized, IOC[xi ] = {xi1 → ... → xim }, and {xi1 , ..., xim }are m instances. During clustering, each xik ∈ IOC[xi ] will be replaced by cj if xik merges to cj . (2) IOCi,j : denotes the j th instance(or cluster) in IOC[xi ] . For example, if x1 ’s ordering constraint IOC[x1 ] =< x1 → x3 → x5 >, we can also write IOC[x1 ] as IOC[x1 ] =< IOC1,1 → IOC1,2 → IOC1,3 > . Here IOC1,1 stands for x1 , IOC1,2 stands for x3 and IOC1,3 stands for x5 . (3) COCi,j denotes the j th IOC in COCci .. For example, if cluster c2 has three instances x1 , x4 and x5 , then c2 ’s ordering constraint is COCc2 = {IOC[x1 ] , IOC[x4 ] , IOC[x5 ] }. We can also write COCc2 as COCc2 = {COC2,1 , COC2,2 , COC2,3 }. Here, COC2,1 stands for IOC[x1 ] , COC2,2 stands for IOC[x4 ] , and COC2,3 stands for IOC[x5 ] . (4) COCi,j,k denotes the k th instance (or cluster) in COCi,j . For example, if x3 and x4 combines as c3 . c3 ’s COC has both IOC[x3 ] and IOC[x4 ] . So COCc3 = {IOC[x3 ] = {c3 , x5 , x6 }, IOC[x4 ] = {c3 , x7 , x8 }}. We can also write COCc3 as COCc3 = {COC2,1 = {COC2,1,1 , COC2,1,2 , = COC2,1,3 }, COC2,2 {COC2,2,1 , COC2,2,2 , COC2,2,3 }}. In IOC[x3 ] , COC2,1,1 stands for c3 , COC2,1,2 stands for x5 and COC2,1,3 stands for x6 . (5) ”<”: COCi,j,u < COCi,j,v denotes COCi,j,v appears next to COCi,j,u in COCi,j . They are adjacent in COCi,j . (6) ”≺”: COCi,j,u ≺ COCi,j,v denotes COCi,j,u appears before COCi,j,v in COCi,j . They are in in COCi,j , but may not be adjacent. To generate ordering constraints, we only need to know a standard dendrogram and a number of labeled instances. The standard dendrogram contains desired class information. Each IOC is generated according to the standard dendrogram. For an instance xi , IOC[xi ] reflects the representatives along the standard dendrogram from the leave node(xi ) to root. Figure 1 shows an example of generating IOC[x5 ] . x1 Figure 1. Example of an IOC : IOC[x5 ] = {x5 , x3 , x2 } to x5 are labeled instances and the standard dendrogram has three levels. The path of x5 follows the striking line. Picking up a representative from each node, we can get IOC[x5 ] = {x5 , x3 , x2 }. It defines x5 should merge x3 before merging x2 . Obviously, each IOC[xi ] could have more than one choice, because each node along the standard dendrogram path could contain more than one instance. For example, IOC[x5 ] could be replaced as IOC[x5 ] = {x5 , x4 , x1 }. As we mentioned, if two clusters ci and cj are merged as a new cluster c{ij} , COCci and COCcj will also be combined as the new cluster’s COC. For example, a cluster c1 ’s COC is COC1 = {COC1,1 , COC1,2 , COC1,3 , COC1,4 } and another cluster c2 ’s COC is COC2 = {COC2,1 , COC2,2 ,, COC2,3 }. When merging c1 and c2 , the two COC should be incorporated into the new cluster c{12} . So, thenew cluster c{12} ’s COC should be: COCC{12} = COCc1 COCc2 More formally, when merging two clusters ci and cj , the COC of new cluster c{ij} is as illustrated in Equation 1 COCc{ij} = {∀COCi,∗ ∈ COCci } {∀COCj,∗ ∈ COCcj } (1) After combining ci and cj , another operation is to substitute c{ij} for ci and cj in all clusters’ COC, because ci and cj will not exist any more. Some merging preference sequence COC∗,∗ may have both ci and cj , so delete ci and cj and just keep one c{ij} . B. Obstacle and Dependency Let M (ci , cj ) stand for merging the two clusters ci and cj . M (ci , cj ) is not as simple, as we just combine their COC. Actually, COCi and COCj may conflict, because ordering constraints potentially hinder the merging of two clusters. The prerequisite of M (ci , cj ) is that they have no obstacle in any cluster’s COC∗,∗ . Otherwise, merging ci and cj will break the ordering constraint defined in some COC∗,∗ . So, ci and cj can not be merged until the obstacles are removed. Definition 2.1: Obstacle If there ∃{ci , ck , cj ∈ COCu,v } ⇒ ck is an obstacle of M (ci , cj ) III. A LGORITHM {ci ≺ ck ≺ cj } (2) For example, if ci < ck < cj belongs to COCu,v , then ci and cj can not merge, because merging ci and cj to be c{ij} but excluding ck violates the merging preference of COCu,v . In this example, ck is an Obstacle for M (ci , cj ). To remove the obstacle of M (ci , cj ), ck should be merged with ci or cj first, which implies M (ci , ck ) and M( ck , cj ) should be ready before M (ci , cj ). In the upper example, to remove M (ci , cj )’s obstacle ck , we must first merge ck with either ci or cj to eliminate this obstacle. So, M (ci , cj ) depends on M (ci , ck ) and M (ck , cj ). More generally, if ci and cj has m obstacles within COCu,v , all these m obstacles should be able to merge with ci or cj . Meanwhile, all these obstacles should also be able to merge with each other. We call M (ci , ck ) and M (ck , cj ) as dependencies of M (ci , cj ). Definition 2.2 present dependency formally. Definition 2.2: Dependencies of Merging two clusters If ci , cj appear but not adjacent in COCu,v , formally, ci < COCu,v,o1 < COCu,v,o2 < ... < COCu,v,om < cj . Then, M (cp , cq ) is a dependency of M(ci ,cj ), where cp , cq ∈ {ci , COCu,v,o1 , COCu,v,o2 , ..., COCu,v,om , cj } . 2 − 1 depenAccording to Definition 2.2, there are Cm+2 dencies for M (ci , cj ). M (ci , cj ) is executable if and only if ∀M (cp , cq ) is executable. C. Interlock If a group of merging operations M (c∗ , c∗ ) depend and only depend on each other, none of the merging operations can execute. We define this situation as interlock. Definition 2.1: Interlock Let C be a subset of C. C has m clusters, c1 , c2 , ...,cm . If every M (ci , cj )(ci , cj ∈ C ) has at least one obstacle ck and ck ∈ C . Then c1 , c2 , ...cm forms an interlock. A good analogy of interlock is “making friends”. Assume that we pick up m students from a school. Two students can make friends if and only if there is no opponent from these m students. Suppose any two of them have at least one opponent, then none of them can make friends. Generating desirable ordering constraints to avoid interlock is an NPC problem(it can be proved with deadlock problem[7]), but detecting an interlock is tractable. In our algorithm, if an interlock is detected, we randomly remove one cluster’s OC to break the interlock. So the HAC could resume and finally all instances could be merged to the dendrogram root. This slightly violates the purpose of OC, but since most OC are obeyed, the HAC result could still be quite satisfied. In this section, we will present the algorithm of hierarchical agglomerative clustering with ordering constraints (HACOC). HACOC generally follows the frame of HAC, but it needs to deal with obstacle and interlock. To conquer obstacle, we check the mergebility of dependencies. To break the interlock and generate a dendrogram, we introduce another Loose verion HACOC (LHACOC) which randomely abandons one constraint to resume clustering. Figure 2 presents the algorithm of LHACOC. IV. E XPERIMENTS In the experiments section, we compare the HACOC algorithm with the Classic HAC(CHAC) algorithm. We apply two HAC algorithms on 20 newsgroup dataseta. First we trim the dendrogram of 20 newsgroup dataset into several sub-dendrogram. These subdendrograms are treated as the standard dendrograms. In our experiment, we construct five standard dendrograms from the 20 newgroup data, as illustrated in Table I. Take Standard Dendrogram 1(Den1) as an example. It is a four-level dendrogram and contains four classes: “comp.sys.ibm.pc.hw”, “comp.sys.mac.hardware”, “rec.sport.baseball”, and “rec.sport.hockey”. Each class has 100 labeled instances. Secondly, we apply CHAC algorithm and HACOC on the input data. For HACOC, we randomly select 5% or 10% instances to generate constraints. After clustering, we use Jaccard Index[9] to evaluate the quality of generated dendrogram. Because Jaccard Index only compares two different partitions of the same dataset, rather than two dendrograms, so we can only use Jaccard Index to compare a level of the generated dendrogram with a level of the standard dendrogram. These two levels in the two different dendrograms have the same number of clusters. In our experiments, we calculates the Jaccard Indexes when the cluster number (CN) approaches the real number of classes in a standard dendrogram level. Figure 3 and 4 show the experiment results. “Den i” denotes the ith standard dendrogram. “CN” denotes final cluster number of HAC. “constraint ratio” is the percentage of instances initialized with ordering constraints. The solid line shows jaccard index of CHAC and the dashed line shows the jaccard index of HACOC. From the result, we can see that the Jaccard Index of HACOC is obviously improved when compared to CHAC. It demonstrates that HACOC leads to more accurate clustering result than CHAC. V. C ONCLUSIONS In this paper, we introduced a new hierarchical constraint, ordering constraint. The difference between ordering constraint and the other constraints, such as M L, CL and M LB, is that it set merging preferences for instances but not change similarities during clustering. Compared with the Den1(size:400) comp.sys.ibm.pc.hw comp.sys.mac.hardware rec.sport.baseball rec.sport.hockey Den2(size:400) comp.sys.ibm.pc.hw comp.sys.mac.hardware talk.politics.guns talk.politics.mideast Den3(size:400) rec.sport.baseball rec.sport.hockey talk.politics.guns talk.politics.mideast Den4(size:700) alt.atheism comp.graphics comp.sys.ibm.pc.hardware comp.sys.mac.hardware rec.autos rec.sport.baseball rec.sport.hockey Den5(size:2000) all 20 groups Table I F IVE S TANDARD DANDROGRAMS (D EN ) FOR EXPERIMENTS other constraints, ordering constraint is closer to the nature of agglomerative clustering. After introducing ordering constraints, we embedded it into HAC algorithm. We call the new HAC algorithm as HACOC. The first advantage of HACOC is that it employs prior knowledge into agglomerative clustering, which provide guidance to prevent many inaccurate merging operations. The second advantage of HACOC is that the generated dendrogram could be stable even when distance metric or instances’ attributes are slightly changes. The third advantage of HACOC is that we could obtain different dendrograms on the same dataset by defining various ordering constraints. Experiments show that HACOC yields a dendrogram more accurate to pre-known taxonomy than traditional HAC. In future, we will continue to explore this topic. We prefer to investigate how to generate good ordering constraints which lead to less conflicts and interlocks so that the algorithm could be more time-efficient. We believe our work will have a significant influence in the field of constrained clustering and is meaningful to many data mining applications. R EFERENCES [1] K. Wagstaff and C. Cardie, “Clustering with Instance-Level Constraints”, Proceedings of the 17th International Conference on Machine Learning (ICML 2000), Stanford, CA, pp. 1103-1110. 2000. [2] I. Davidson and S. S. Ravi, “Clustering with Con-straints and the k-Means Algorithm”, the 5th SIAM Data Mining Conf. 2005. [3] I. Davidson and S. S. Ravi, “Hierarchical Clustering with Constraints: Theory and Practice”, the 9th European Principles and Practice of KDD, PKDD 2005. [4] I. Davidson and S. S. Ravi, “Intractability and Clustering with Constraints”, Proceedings of the 24th international conference on Machine learning, 2007 [5] I. Davidson and S. S. Ravi, “Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results”, Data Mining and Knowledge Discovery, 18(2):257C282 ,2009 [6] Korinna Bade and Andreas Nurnberger, “Personalized Hierarchical Clustering”, Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, 2006 [7] T. Araki, Y. Sugiyama, and T. Kasami, “Complexity of the deadlock avoidance problem”, 2nd IBM Symp. Math. Foundations Computer Sci. Tokyo, Japan, pp. 229-257, 1977. [8] http://kdd.ics.uci.edu/databases/20newsgroups /20newsgroups.html [9] P. Jaccard, “The distribution of flora in the alpine zone”. The New Phytologist, 11(2):37–50, 1912. Jaccard Index of CHAC, HACOC when Constraint Ratio=5% 0.8 0.7 CHAC HACOC 0.6 Jaccard Index 0.5 0.4 0.3 0.2 0.1 0 Den1(CN=4) Den1(CN=2) Den2(CN=4) Den2(CN=2) Den3(CN=4) Den3(CN=2) Den4(CN=7) Den4(CN=5) Den4(CN=3) Den5(CN=20) Dendrogram(Cluster Numer) Figure 3. The performance of two agglomerative clustering algorithms when 5% instances have OC Jaccard Index of CHAC, HACOC when Constraint Ratio=10% 0.6 CHAC HACOC 0.5 0.4 Jaccard Index Input: n instances {x1 , x2 , . . ., xn } with ordering constraints generated from a pre-known dendrogram D. Each instance xi is treated as a cluster ci and its COC could be empty if there is no pre-known knowledge for it. Output: a agglomerative dendrogram or failure Algorithm: 1) Global variables (1) bool isInterlockDetected; It denotes whether an interlock is detected (2) Matrix mergable[m, m] mergable is a boolean matrix. m is the number of remnant clusters. mergable[i, j] marks whether two clusters cu and cv can be merged. It is initialized as TRUE and set to FALSE if ci and cj can not be merged. 2) Initialization (1)isInterlockDetected = T RU E Assume that there is an interlock in the beginning. Set to FALSE if finding two combinable clusters (2)resetMergable(T RU E) Assume that all pairs of clusters are mergable. 3) 1) LHACOC(C,n): HAC with ordering constraints. //Parameter C is the set of n clusters. Each cluster //has an ordering constraint(maybe NULL) 2) { 3) for i = 0; i < n; 4) isInterlockDetected=TRUE; 5) for all entries of mergable(u, v) equal to TRUE 6) select nearest cu , cv ; 7) if findObstacle(cu , cv )==FALSE 8) isInterlockDetected=FALSE; 9) break; 10) else 11) mergable[u, v]=FALSE; 12) continue; 13) endif 14) endfor 15) if(isInterlockDetected==TRUE) 16) randomly remove a cluster’s COC; 17) else 18) merge(cu ,cv ); 19) i++; 20) endif; 21) resetMergable(TRUE); 22) endfor 23) } 0.3 0.2 0.1 0 Den1(CN=4) Den1(CN=2) Den2(CN=4) Den2(CN=2) Den3(CN=4) Den3(CN=2) Den4(CN=7) Den4(CN=5) Den4(CN=3) Den5(CN=20) Dendrogram(Cluster Numer) Figure 4. The performance of two agglomerative clustering algorithms when 10% instances have OC Figure 2. Loose Hierarchical Agglomerative Clustering with Ordering Constraints
© Copyright 2026 Paperzz