The VLDB Journal (2000) 8: 222–236 The VLDB Journal c Springer-Verlag 2000 Clustering categorical data: an approach based on dynamical systems ? David Gibson1,?? , Jon Kleinberg2,??? , Prabhakar Raghavan3 1 2 3 Department of Computer Science UC Berkeley, Berkeley, CA 94720 USA; e-mail: [email protected] Department of Computer Science, Cornell University, Ithaca, NY 14853; e-mail: [email protected] Almaden Research Center IBM, San Jose, CA 95120 USA; e-mail: [email protected] Edited by J. Widom. Received February 15, 1999 / Accepted August 15, 1999 Abstract. We describe a novel approach for clustering collections of sets, and its application to the analysis and mining of categorical data. By “categorical data,” we mean tables with fields that cannot be naturally ordered by a metric – e.g., the names of producers of automobiles, or the names of products offered by a manufacturer. Our approach is based on an iterative method for assigning and propagating weights on the categorical values in a table; this facilitates a type of similarity measure arising from the co-occurrence of values in the dataset. Our techniques can be studied analytically in terms of certain types of non-linear dynamical systems. Key words: Clustering – Data mining – Categorial data – Dynamical systems – Hypergraphs 1 Introduction Much of the data in databases is categorical: fields in tables whose attributes cannot naturally be ordered as numerical values can. The problem of clustering categorical data involves complexity not encountered in the corresponding problem for numerical data, since one has much less a priori structure to work with. While considerable research has been done on clustering numerical data (exploiting its inherent geometry), there has been much less work on the important problem of clustering and extracting structure from categorical data. As a concrete example, consider a database describing car sales with fields “manufacturer”, “model”, “dealer”, “price”, “color”, “customer” and “sale date”. In our setting, price and sale date are traditional “numerical” values. Color is arguably a categorical attribute, assuming that values such ? A preliminary version of this paper appears in the Proceedings of the 24th International Conference on Very Large Databases, 1998. ?? This work was performed in part while visiting the IBM Almaden Research Center. ??? This work was performed in part while visiting the IBM Almaden Research Center. Author is currently supported in part by an Alfred P. Sloan Research Fellowship, an ONR Young Investigator Award, and NSF Faculty Early Career Development Award CCR-9701399. as “red” and “green” cannot easily be ordered linearly; more on this below. Attributes such as manufacturer, model and dealer are indisputably categorical attributes: it is very hard to reason that one dealer is “like” or “unlike” another in the way one can reason about numbers, and hence new methods are needed to search for similarities in such data. In this paper, we propose a new approach for clustering categorical data, differing from previous techniques in several fundamental respects. First, our approach generalizes the powerful methodology of spectral-graph partitioning , to produce a clustering technique applicable to arbitrary collections of sets. For a variety of graph decomposition tasks, spectral partitioning has been the method of choice for avoiding the complexity of more combinatorial approaches (see below); thus, one of our main contributions is to generalize and extend this framework to the analysis of relational data more complex than graphs, obtaining a method that does not suffer from the combinatorial explosion encountered in existing approaches. Second, our development of this technique reveals a novel connection between tables of categorical data and non-linear dynamical systems; this connection allows for the natural development of a range of clustering algorithms that are mathematically clean and involve essentially no arbitrary parameters. In addition to the development of the techniques themselves, we report on their incorporation into an experimental system for mining tables of categorical data. We have found the techniques to be effective in the context of both synthetic and real datasets. We now review some of the key features of our approach, and explain the ways in which it differs from other approaches. (1) No a priori quantization. We wish to extract notions of proximity and separation from the items in a categorical dataset purely through their patterns of co-occurrence, without trying to impose an artificial linear order or numerical structure on them. The approach of casting categorical values into numbers or vectors can be effective in some cases (for one example, see the work on image processing and color perception in , , ; in the simplest form, each color is a 3D vector of RGB intensities). However, it has a significant drawback: in engineering, a suitable numerical representa- D. Gibson et al.: Clustering categorical data tion of a categorical attribute, one may lose the structure one is hoping to mine. Moreover, it can lead to problems in high-dimensional geometric spaces in which the data is distributed very sparsely (as, for example, in the interesting k-modes algorithm for categorical data, due to Huang ). (2) Correlation vs. categorical similarity. Association rules and their generalizations have proved to be effective at mining that involves estimating/comparing moments and conditional probabilities; see, e.g., the work on forming associations from market basket data in [Agrawal et al. 1996, Agrawal et al. 1993, Agrawal and Srikant 1994, Brin et al. 1997a, Brin et al. 1997b, Srikant and Agrawal 1995, Toivonen 1996] . Our analysis of co-occurrence in categorical data addresses a notion that is in several respects distinct from the underlying motivation for association rules. First, we wish to define a notion of similarity among items of a database that will apply even to items that never occur together in a tuple; rather, their similarity will be based on the fact that the sets of items with which they do co-occur have large overlap. Second, we would like this notion of similarity to be transitive in a limited way: if A and B are deemed similar; and B and C are deemed similar, then we should be able to infer a “weak” type of similarity between A and C. In this way, similarity can propagate to uncover more distant correlations. The first of these points also serves as part of the underlying motivation of recent independent work of [Das et al. 1997]. However, the way in which we will use this notion for clustering is quite different. (3) New methods for hypergraph clustering. Viewing each tuple in the database as a set of items, we can treat the entire collection of tuples as an abstract set system, or hypergraph , and approach the clustering problem in this context. (By a hypergraph, in this setting, we mean simply an arbitrary collection of sets.) In this way, we use pure co-occurrence information among items to guide the clustering, without the imposition of additional prior structure. At a high level, this is similar to the approach taken by , as well as by the framework of association rules . Clustering a collection of sets is a task that naturally lends itself to combinatorial formulations; unfortunately, the natural combinatorial versions of this problem are NPcomplete and apparently difficult to approximate [Garey and Johnson 1997]. This becomes a major obstacle in the approach of (and to a lesser degree in ), who rely on combinatorial formulations of the clustering problem and heuristics whose behavior is difficult to reason about. We adopt a fundamentally different, less combinatorial, approach to the problem of clustering sets; it is motivated by spectral-graph partitioning, a powerful method for the related problem of clustering undirected graphs that originated in work of and in the 1970s. Spectral methods relate good “partitions” of an undirected graph to the eigenvalues and eigenvectors of certain matrices derived from the graph. These methods have been found to exhibit good performance in many contexts that are difficult to attack by purely combinatorial means (see, e.g., , , , ), and they have been successfully applied in areas such as finiteelement mesh decomposition , protein modeling , and the analysis of Monte Carlo sim- 223 ulation methods [Jerrum and Sinclair 1994]. Recently, the heuristic intuition underlying spectral partitioning has been used in the context of information retrieval and in methods for identifying densely connected hypertextual regions of the World-Wide Web , . Our new method can be viewed as a generalization of spectral partitioning techniques to the more general problem of hypergraph clustering. The extension to this more general setting involves significant changes in the algorithmic techniques required; in particular, the use of eigenvectors and linear transforms is replaced, in our method, by certain types of non-linear dynamical systems. One of our contributions here is the generalization of the notion of non-principal eigenvectors (from linear maps) to our non-linear dynamical systems. In what follows, we will argue that this introduction of dynamical systems is perhaps the most natural way to extend the power of spectral methods to the problem of clustering collections of sets; and we will show that this approach suggests a framework for analyzing co-occurrence in categorical datasets in a way that avoids many of the pitfalls encountered with intractable combinatorial formulations of the problem. We now turn to an extended example that illustrates more concretely the type of problem addressed by our methodology. Example and overview of techniques. Consider again our hypothetical database of car sales. Classical association rules, on the one hand, are effective at mining patterns of the form: of the tuples containing “Honda,” 18% (a large fraction) contain “August.” Our goal, on the other hand, is to elicit statements of the form “Hondas and Toyotas are ‘related’ by the fact that a disproportionate fraction of each are sold in the month of August”. Thus, we are grouping “Honda” and “Toyota” by co-occurrence with a common value or set of values (“August” in this case), although clearly, in a database of individual car purchases, no single tuple is likely to contain both “Honda” and “Toyota.” We wish to allow such groupings to grow in complex ways; for example, we may discover that many Toyotas are also sold in September, as are many Nissans; that many dealers sell both Hondas and Acuras; that many other dealers have large volume in August and September. In this way, we relate items in the database that have no direct co-occurrence – different automobile manufacturers, or dealers who may sell different makes of automobile. What emerges is a high-level grouping of database entries; one might intuitively picture it as being centered around the notion of “late-summer sales on Japanese model cars.” Distilling such a notion is likely to be of great value in customer segmentation, commonly viewed as one of the most successful applications of data mining to date. In Sect. 5, we describe some patterns of this type that our system automatically discovers: for example (in a database of computer science papers), “the PODS community late 1980s–early 1990s”. Note that in general, a dataset could contain many such segments; indeed, our method is effective at uncovering this multiplicity (see Theorem 4 and the experiments in Sect. 5). Thus, if we picture a database consisting of rows of tuples, association rules are based on co-occurrence within 224 rows (tuples), while we are seeking a type of similarity based on co-occurrence patterns of different items in the same column (field). Our problem, then, is how to define such a notion of similarity, allowing a “limited” amount of transitivity, in a manner that is efficient, algorithmically clean, and robust in a variety of settings. We propose a weight propagation method which works roughly as follows. We first seed a particular item of interest (e.g., “Honda”) with a small amount of weight. This weight then propagates to items with which Honda co-occurs frequently – dealers, common sale times, common price ranges. These items, having acquired weight, propagate it further – back to other automobile manufacturers, perhaps – and the process iterates. In this way, we can achieve our two-fold objective: items highly related to Honda acquire weight, even without direct co-occurrence; and since the weight “diffuses” as it spreads through the database, we can achieve our “limited” form of transitivity. The weight propagation schemes that we work with can be viewed as a type of non-linear dynamical system derived from the table of categorical data. Thus, our work demonstrates a natural transformation from categorical datasets to such dynamical systems that facilitates a new approach to the clustering and mining of this type of data. Much of the analysis of such dynamical systems is beyond the reach of current mathematics [M. Shub, personal communication]. In a later section, we discuss some preliminary mathematical results that we are able to prove about our method, including convergence properties in some cases and an interpretation of a more complex formulation in terms of the enumeration of certain types of labeled trees. We also indicate certain aspects of the analysis that appear to be quite difficult, given the current state of knowledge concerning non-linear dynamical systems. It is important to note that even for dynamical systems that cannot be analyzed mathematically, we find experimentally that they exhibit quite orderly behavior and consistently elicit meaningful structure implicit in tables of categorical data, typically in linear time. That such systems should be effective in this context is further reinforced by the connections we draw with spectral-graph theory – in particular, there are parallels between our weight propagation schemes and the types of iterative eigenvector-based methods that arise in spectral-graph partitioning. (See Sect. 7 in the appendix for a further discussion of these connections.) Organization of the paper. In the following section, we describe in detail the algorithmic components underlying our approach, and the way in which they can be used for clustering and mining categorical data. We also discuss there the mathematical properties of the algorithms that we have been able to prove. In Sect. 3, we describe Stirr (sieving through iterated relational reinforcement), an experimental system for studying the use of this technique for analyzing categorical tables. Because the iterative weight propagation algorithms underlying Stirr converge in a small number of iterations in practice, the total computational effort in analyzing a table by our methods is typically linear in the size of the table. By providing quantitative measures of similarity among items, Stirr also provides a natural framework for effectively vi- D. Gibson et al.: Clustering categorical data W Attribute Tuple a b c 1. A W 1 2. A X 1 3. B W 2 4. B X 2 5. C Y 3 6. C Z 3 1 A X 2 B is represented as Y 3 C Z Fig. 1. The representation of a collection of tuples sualizing the underlying relational data; this visualizer is depicted in Fig. 3. In Sects. 4 and 5, we evaluate the performance of Stirr on data from synthetic as well as from real-world sources. In Sect. 4, we report on experience with the Stirr system on a class of inputs we will call quasi-random inputs. Such inputs consist of carefully “planted” structure among random noise, and are natural for testing mining algorithms. The advantage of such quasi-random data is that we may control in a precise manner the various parameters associated with the dynamical systems we use, studying in the process their efficacy at discovering the appropriate planted structure. In Sect. 5, we report on our experiences with Stirr on “real-world” data drawn from a range of settings. In all these settings, the main clusters computed by Stirr correspond naturally to multiple, highly correlated groups of items within the data. 2 Algorithms and basic analysis We now describe our core algorithm, which produces a dynamical system from a table of categorical data. We begin with a table of relational data: we view such a table as consisting of a set of k fields, each of which can assume one of many possible values. (We will also refer to the fields as columns.) We represent each possible value in each possible field by an abstract node, and we represent the data as a set T of tuples – each tuple τ ∈ T consists of one node from each field. See Fig. 1 for an example of this representation, on a table with three fields. A configuration is an assignment of a weight wv to each node v; we will refer to the entire configuration as simply w (which may be thought of as a vector of values wv ). We draw on the high-level notion of a “weight propagation” scheme developed in the introduction. We will only be concerned with the relative values of the weights in a configuration, and so we will use a normalization function N (w) to uniformly rescale the set of weights, preventing their absolute values from growing too large. In our experiments, we have made use of two different normalization functions: a local function that rescales weights so that the squares of the weights within each field add up to 1; and a global function that rescales weights so that the squares of all weights in all fields add up to 1. These two normalization functions yield very similar results qualitatively; unless explicitly stated otherwise, we will focus on the local normalization function in what follows. For our purposes, a dynamical system is the repeated application of a function f on some set of values. A fixed point of a dynamical system is a point u for which f (u) = u. D. Gibson et al.: Clustering categorical data That is, it is a point which remains unchanged under the (repeated) application of f . Fixed points are one of the central objects of study in dynamical systems, and by iterating f , one often reaches a fixed point. In Sect. 7, we provide a basic introduction to some terminology and background on dynamical systems. We also discuss there the connections between such systems and spectral-graph theory, discussed above as a powerful method for attacking discrete partitioning problems. Spectral-graph theory is based on studying linear weight propagation schemes on graph structures; much of the motivation for the weight propagation algorithms underlying Stirr derives from this connection, though the description of our methods here will be entirely self-contained. The dynamical system for a set of tuples will be based on a function f , which maps a current configuration to a new configuration. We define f as follows. We choose a combiner function ⊕, defined below, and update each weight wv as follows. To update the weight wv : For each tuple τ = {v, u1 , . . . , uk−1 } containing v, do xτ ← ⊕(u1 , . . . , uk−1 ). P x . wv ← τ τ Thus, the weight of v is updated by applying ⊕ separately to the members of all tuples that contain v, and adding the results. The function f is then computed by updating the weight of each wv as above, and then normalizing the set of weights using N (). This yields a new configuration f (w). Iterating f defines a dynamical system on the set of configurations. We run the system through a specified number of iterations, returning the final set of configuration that we obtain. It is very difficult to rigorously analyze the limiting properties of the weight sets in the fully general case; below we discuss some analytical results that we have obtained in this direction. However, we have observed experimentally that the weight sets generally converge to fixed points or to cycles through a finite set of values. We call such a final configuration a basin; the term is meant to capture the intuitive notion of a configuration that “attracts” a large number of starting configurations. To complete the description of our algorithms, we discuss the following issues: our choice of combining operators; our method for choosing an initial configuration; and some additional techniques that allow us to “focus” the dynamical system on certain parts of the set of tuples. Choosing a combining operator. Our potential choices for ⊕ are the following; in Sect. 4, we offer some additional insights on the pros and cons among these choices. – The product operator Π: ⊕(w1 , . . . , wk ) = w1 w2 · · · wk . – The addition operator: ⊕(w1 , . . . , wk ) = w1 +w2 +· · ·+wk . – A generalization of the addition operator that we call the Sp combining rule, where p is an odd natural number. Sp (w1 , . . . , wk ) = (w1p + · · · + wkp )(1/p) . Note that addition is simply the S1 rule. – A “limiting” version of the Sp rules, which we refer to as S∞ . S∞ (w1 , . . . , wk ) is defined to be equal to wi , where wi has the largest absolute value among the weights in {w1 , . . . , wk }. (Ties can be broken in a number of arbitrary ways.) 225 Thus, the S1 combining rule is linear with respect to the tuples. The Π and Sp rules for p > 1, on the other hand, involve a non-linear term for each individual tuple, and thus have the potential to encode co-occurrence within tuples more strongly. S∞ is an especially appealing rule in this sense, since it is non-linear, particularly fast to compute, and also appears to have certain useful “sum-like” properties. Non-principal basins. Much of the power of spectral-graph partitioning for discrete-clustering problems derives from its use of non-principal eigenvectors. In particular, it provides a means for assigning positive and negative weights to nodes of a graph, in such a way that the nodes with positive weight are typically well separated from the nodes of negative weight. We refer the reader to Sect. 7 for more details. One of our contributions here is the generalization of the notion of non-principal eigenvectors (from linear maps) to our non-linear dynamical systems. To find the eigenvectors of a linear system Ax = λx, the power iteration method maintains a set of orthonormal vectors xhji , on which the following iteration is performed. First each of the {xhji } is multiplied by A; then the Gram-Schmidt algorithm is applied to the sequence of vectors {xh1i , xh2i , . . .}, making them orthonormal. Note that the order of the vectors is crucial in the Gram-Schmidt procedure; in particular, the direction of the first vector xh1i is unaffected. In an analogous way, we can maintain several configurations wh1i , . . . , whmi and perform the following iteration: Update whii ←− f (whii ), for i = 1, 2, . . . , m. Apply the Gram-Schmidt algorithm to the sequence of vectors {wh1i , . . . , whmi }. Since the normalized value of the configuration wh1i will be unaffected by the Gram-Schmidt procedure, it iterates to a basin just as before; we refer to this as the “principal basin.” The updating of the other configurations is interleaved with orthonormalization steps; we refer to these other configurations as “non-principal basins.” In the experiments that follow, we will see that these non-principal basins provide structural information about the data in a fashion analogous to the role played by nonprincipal eigenvectors in the context of spectral partitioning. Specifically, forcing the configurations {whji } to remain orthonormal as vectors introduces negative weights into the configurations. Just as the nodes of large weight in the basin tend to represent natural “groupings” of the data, the positive and negative weights in the non-principal basins tend to represent good partitions of the data. The nodes with large positive weight, and the nodes with large negative weight, tend to represent “dense regions” in the data with few connections between them. Choosing an initial configuration. There are a number of methods for choosing the initial configuration for the iteration. If we are not trying to “focus” the weight in a particular portion of the set of tuples, we can choose the uniform initialization (all weights set to 1, then normalized) or the 226 random initialization (each weight set to an independently chosen random value in [0, 1], then normalized). For combining operators which are sensitive to the choice of initial configuration – the product rule is a basic example – one can focus the weight on a particular portion of the set of tuples by initializing a configuration over a particular node. To initialize w over the node v, we set wu = 1 for every node u appearing in a tuple with v, and wu0 = 0 for every other node u0 . (We then normalize w.) In this way, we hope to cause nodes “close” to v to acquire large weight. This notion will be addressed in the experiments of the following sections. Masking and modification. To augment or diminish the influence of certain nodes, we can apply a range of “local modifications” to the function f . In particular, for a particular node v, we can compose f with the operation mask(v), which sets wv = 0 at the end of each iteration. Alternately, we can compose f with augment(v, x), which adds a weight of x > 0 to wv at the end of each iteration. This latter operation “favors” v during the iterations, which can have the effect of increasing the weights of all nodes that have significant co-occurrence in tuples with v. Relation to other iterative techniques The body of mathematical work that most closely motivates our current approach is spectral-graph theory, which studies certain algebraic invariants of graphs. We discuss these connections in some detail in Sect. 7. The iteration of systems of non-linear equations arises also in connection with neural networks [Ritter and Martinez 1992, Rumelhart and McClelland 1986]. In many of these neural network settings, the emphasis is quite different from ours – for example, one is concerned with the approximation of binary decision rules, or with the approximation of connection strengths among nodes on a discrete lattice (e.g., the work on Kohonen maps ). In our work, on the other hand, the analogs of “connections” – the co-occurrences among tuples in a database – typically do not have any geometric structure that can be exploited, and our approach is to treat the connections as fixed, while individual item weights are updated. In a related vein, the methodology of principal curves allows for a non-linear approach to dimension reduction, which can sometimes achieve significant gains over linear methods such as principal-component analysis [Hastie and Stuetzle 1989, Kramer 1991]. This method thus implicitly requires an embedding in a given space in which to perform the dimension reduction; once this dimension reduction is performed, one can then approach the low-dimensional clustering problem on the data in a number of standard ways. Analysis First, we briefly discuss the running time of a single iteration of the Stirr algorithm, maintaining just the principal basin. For a fixed, constant number of columns, the combining rules D. Gibson et al.: Clustering categorical data we use require constant computation time. Thus, we may perform a single iteration via a single pass through the set of tuples; for each tuple, we modify the weights of the nodes it contains using the combining rule. Thus, the amount of computation required in an iteration is linear in the number of tuples. An experimental view of this linear relationship can be seen in the running-time plot of Fig. 4, below. The number of iterations required for approximate convergence is, on the other hand, a much more complex issue, and our theoretical results below are designed to partially address the question. There is much that remains unknown about the dynamical systems we use here, and this reflects the lack of existing mathematical techniques for proving properties of general non-linear dynamical systems. This, of course, does not prevent them from being useful in the analysis of categorical tables; and we now report on some of the basic properties that we are able to establish analytically about the dynamical systems we use. In addition, we state some basic conjectures that we hope will motivate further work on the theoretical underpinnings of these models. We first state a basic convergence result for the (linear) S1 combining rule, subject to a standard non-degeneracy assumption that will be exposed in the proof. Theorem 1. (Subject to a non-degeneracy assumption.) Let T be a set of tuples, with k ≥ 3 fields, and suppose that global normalization is used. For every initial configuration w, w converges to a fixed point under the repeated application of the combining rule S1 . Proof. Let n denote the total number of nodes in all fields. We label the nodes v1 , . . . , vn . Let A be the following n × n / j is equal to the matrix: Aii = 0 for each i; and Aij for i = number of tuples in which vi and vj appear together. Let w denote any configuration. We view w as a vector with n coordinates, the ith denoting the value wvi ; because we are considering a global normalization function, we may assume that w is a unit vector. If we apply the S1 rule once to w, the new value of the ith coordinate is easily seen to be the sum X Aij wj . j/ =i In vector notation, the new configuration after global normalization is thus simply the unit vector in the direction of Aw. Hence, after t applications of the S1 rule, the configuration is the unit vector in the direction of At w. Now we state our non-degeneracy condition: we will assume that the matrix A has a unique eigenvalue λ∗ of maximum absolute value, with associated eigenvector z ∗ . In this case, a standard result of linear algebra (see, e.g., ) implies that for any initial vector w not orthogonal to z ∗ , the sequence of unit vectors in the directions w, Aw, A2 w, . . . , At w . . . converges to z ∗ . More generally, this sequence will converge to the eigenvector associated with the eigenvalue of maximum absolute value among those not orthogonal to the initial vector w. Thus, Theorem 1 establishes that almost all initial configurations w converge to the same fixed point; namely, D. Gibson et al.: Clustering categorical data the eigenvector z ∗ , viewed as a configuration. In particular, since the matrix A in the proof of Theorem 1 has only non-negative entries, so does z ∗ . Hence, any non-zero initial configuration w consisting entirely of non-negative values will not be orthogonal to z ∗ , and so the S1 rule will converge to z ∗ starting from w. If we consider iterating a linear system on an undirected graph G (as in the power iteration method discussed above), there is a natural combinatorial interpretation of the weights that are computed: they count the number of walks in G starting from each vertex. By a walk here, we mean a sequence of adjacent vertices in G, with vertices allowed to appear more than once in the sequence. We now show that the product rule Π has an analogous combinatorial interpretation for a set T of tuples. This combinatorial interpretation indicates a precise sense in which the product rule discovers “dense” regions in the data. To state this result, we need some additional terminology. If T is a set of tuples over a dataset with d fields, v is a node of T , and j is a natural number, then a (T, v, j)-tree is a complete (d − 1)-ary tree Z of height j whose vertices are labeled with nodes of T as follows. (1) The root is labeled with v. (2) Suppose a vertex α of Z is labeled with a node u ∈ T ; then there is a tuple τ of T so that u ∈ τ and the children of α are labeled with the elements of τ − {u}. Two (T, v, j)-trees are equivalent if there is a one-to-one correspondence between their vertices that respects the tree structure and the labeling; they are distinct otherwise. Note that if G is a graph, then a (G, v, j)-tree is precisely a walk of length j in G starting from v. Theorem 2. Let T be a set of tuples, and consider the dynamical system defined on it by the product rule Π with no normalization. Let w0 denote the initial configuration in which each node is assigned the value 1. For a node v of T , let wvj denote the weight wv assigned to v after j iterations. Then wvj is equal to the number of distinct (T, v, j)-trees. Proof. For a node v and a number j, let q(v, j) denote the number of distinct (T, v, j)-trees. We prove by induction on j that q(v, j) = wvj . Note that when j = 0, wvj = 1; and there is a single (T, v, j) tree (a single node), so q(v, j) = 1. Now consider a value of j larger than 0, and assume the claim holds for all smaller values of j. An arbitrary (T, v, j) tree Z can be produced as follows: we choose an arbitrary tuple (v, u1 , u2 , . . . , uk−1 ) containing v; we choose an arbitrary (T, ui , j − 1) tree Zi for each i = 1, . . . , k − 1; and we define Z to be the tree with root labeled v and with Z1 , . . . , Zk−1 the subtrees of the root. Thus, we have X q(u1 , j − 1)q(u2 , j − 1) · · · q(uk−1 , j − 1) q(v, j) = (v,u1 ,...,uk ) = X (v,u1 ,...,uk ) wuj−1 wuj−1 · · · wuj−1 1 2 k−1 = wvj , where we pass from the first line to the second using the induction hypothesis, and we pass from the second line to the third using the definition of the product rule. Next, we establish a basic theorem that partially underlies our intuition about the partitioning properties of these 227 dynamical systems. By a complete hypergraph of size s, we mean a set of tuples over a dataset with s nodes in each column, so that there is a tuple for every choice of a node from each column. Let Tab denote a set of tuples consisting of the disjoint union of a complete hypergraph A of size a and a complete hypergraph B of size b. We define 1A to be the configuration that assigns a weight of 1 to each node in A, and a weight of 0 to each node in B; we define 1B analogously. Theorem 3. Consider the product rule on the set of tuples T , with local (column-by-column) normalization. (i) Let w be a randomly initialized configuration on the set of tuples Tab , with a = (1 + δ)b for a number δ > 0. Then, there is a constant α > 0 depending on δ, so that with probability at least 1 − e−α , w will converge under the product rule Π to the fixed point N (1A ). (ii) Let w0 be a configuration initialized over a random a , w0 converges to node v ∈ Tab . Then, with probability a+b b N (1A ) under Π, and with probability a+b , it converges to N (1B ). Proof. Consider the dynamical system defined by the product rule on a complete hypergraph of size s, with a configuration of weights w. Since we are only concerned with relative values of weights, it is easiest to analyze this first in the absence of any normalization. Let Wi denote the sum of the weights of all nodes in ith column. Then, the updated value of the weight of a node v in column j is X wv1 · · · wvj−1 wvj+1 · · · wvk , (v1 ,...,vj−1 ,v,vj+1 ,...,vk ) where the sum is over all possible choices of a node from each other column. Rewriting this sum, we see that it is equal to Y Wi . (1) W1 W2 · · · Wj−1 Wj+1 · · · Wk = i/ =j First we consider (i). Recall that a random initialization consists of the assignment of a weight chosen uniformly from [0, 1] to each node. Let Ha denote the portion of T consisting of a complete hypergraph of size a, and let Hb denote the portion of T consisting of a complete hypergraph of size b. Let Wi denote the sum of the weight in column i of Ha and let Wi0 denote the sum of the weight in column i of Hb . Each Wi is a sum of a independent random variables, each of mean 1/2; similarly each Wj0 is a sum of b independent random variables, each of mean 1/2. Thus, the mean of each Wi exceeds the mean of each Wj0 by (a − b)/2. By standard bounds on the sum of independent random variables (see, e.g., ), there is an absolute constant α > 0, depending on δ = (a − b)/b, so that with probability at least 1 − e−α , we have mini Wi > maxj Wj0 + (a − b)/4 ≥ maxj Wj0 + 1/4. Assume this event happens, and let W ∗ = maxj Wj0 . Consider the weight of a node v in Hb as we run several iterations, with no normalization. After the first iteration, its weight can be computed from Eq. 1 to be at most (W ∗ )k−1 . Each column sum in Hb is then at most b(W ∗ )k−1 , and so after the second iteration, the weight of v is at most 228 D. Gibson et al.: Clustering categorical data Fig. 2. Textual view of system bk−1 (W ∗ )(k−1) . More generally, after j iterations, an upper bound on the weight of the node v can be determined inductively from Eq. 1; we obtain the upper bound 2 b(k−1)+(k−1) 2 j−1 +···+(k−1) j (W ∗ )(k−1) . By similar reasoning, the weight of a node u in Ha after j iterations is at least 2 a(k−1)+(k−1) +···+(k−1) j−1 j 1 (W ∗ + )(k−1) . 4 Thus, as j increases without bound, we find first of all that the ratio of the minimum weight in Ha to the maximum weight in Hb increases without bound. Second, all the weights in a given column of Ha will be the same after the first iteration, since the expression in Eq. 1 depends not on a node’s identity, but only on its column. It follows from these two facts that the normalized set of weights in each column converges to N (1A ). The proof of statement (ii) is easier. With probability a b a+b , a node in Ha will be chosen, and with probability a+b , a node in Hb will be chosen. In the former case, we see by Eq. 1 that the normalized value of the configuration in each column will converge in one step to N (1A ); in the latter case, it will converge in one step to N (1B ). We conjecture that qualitatively similar behavior occurs when Tab consists of two “dense” sets of tuples that are “sparsely” connected to each other. At an experimental level, this is borne out by the results of subsequent sections. The previous theorem shows that it is possible for the product rule to converge to different basins, depending on the starting point. We now formulate a general statement that addresses a notion of completeness for such systems: all sufficiently “large” basins should be discovered within a reasonable number of trials. Let us say that a basin w∗ has measure if at least an fraction of all random starting configurations converge to w∗ on iterating a given combining rule. Let B be the number of such basins. Theorem 4. With probability at least 1 − 1/B , iterating the system from 2−1 ln B random initial configurations is sufficient to discover all basins of measure at least . Proof. Consider a particular basin of measure at least . The probability that it is not discovered in k independent runs from random initial configurations is at most (1−)k . Setting k = 2−1 ln B , we see that this probability is at most p = e−2 ln B = 1 . B2 (2) Since the probability of the union of events is no more than the sum of their probabilities, the probability that any basin of measure at least is not discovered is at most pB = 1/B . This gives a precise sense – via a standard type of sampling result – in which all sufficiently large basins will be discovered rapidly by iterating from random starts. (Note that this discovery is relatively efficient – since B can be as large as −1 , it can take up to −1 starting points to discover all the basins, and our randomized approach uses 2−1 ln B .) 3 Overview of experiments 3.1 System overview The Stirr experimental system consists of a computation kernel written in C, and a control layer and GUI written in Tcl/Tk. We prepared input datasets in straightforward whitespace-delimited text files. The output can be viewed in a text window showing nodes of highest weight in each column, or as a graphical plot of node weights. The GUI allows us to modify parameters quickly and easily. Figure 2 shows the control interface and textual output. The dataset being used was an adult census database, using just four of the categorical attributes. A graphical view of the same data (specifically, the first non-principal basin wh2i ) is shown in Fig. 3. Each node is positioned according to its column and calculated weight. The nodes in each tuple are joined by a line (consisting of three segments in this case). The Density control indicates only 1% of the tuples were plotted. The Jitter value “smears” each line’s position by a small random amount, to separate lines meeting at the same point. There is a simple tuple- coloring mechanism, which changes the color of a selected line to show the weight assignment on that tuple. The value of the visualizer in presenting similar nodes to the user is evident in Fig. 3; this may be used, for instance, to point to a subset of nodes for masking or modification D. Gibson et al.: Clustering categorical data 229 random multidimensional points is a region with a higher probability density than the ambient.) Our hope, then, might be that Stirr discovers such clumps. We first list the aspects of Stirr that we wish to study; we follow this with a description of the quasi-random experiment in each case. +1 -1 Column Fig. 3. Graphical view of system (recall from Sect. 2). It would provide a useful framework for an interactive data-mining tool. 3.2 Performance Our algorithm scales well with increasing data volumes. We measured the running time for ten iterations, for varying numbers of tuples and numbers of columns. The S∞ combiner was used, and the principal basin together with the first four non-principal basins were computed. The plot in Fig. 4 demonstrates that the running time is linear in number of tuples, The time taken to read the database and initialize the iteration is negligible. Qualitatively similar plots of running times are obtained for the other main combining rules. The running times were measured on a 128-MB, 200MHz Pentium Pro machine, running Linux, with no other user processes. The dataset was generated randomly. 4 Quasi-random inputs In this section, we report on extensive experience with the Stirr system on the class of quasi-random inputs discussed in the introduction. For the purposes of testing the performance of a mining algorithm, such inputs are a natural setting in which to search for “planted” structures among random noise. This is a technique used in the study of computationally difficult problems. For instance, in combinatorial optimization, the work of analyzes graph bisection heuristics in quasi-random graphs in which a “small” bisection is hidden; adopt the same approach to the analysis of graphcoloring algorithms. The advantage of such quasi-random data is that we may control in a precise manner the various parameters associated with the dynamical systems we use, studying in the process their efficacy at discovering the appropriate planted structure. Consider a randomly generated categorical table in which we plant extra random tuples involving only a small number of nodes in each column. This subset of nodes may be thought of as a “clump” in the data, since these nodes co-occur more often in the table than in a purely random table. (The analog in numerical data with 1. How well does Stirr distill a clump in which the nodes have an above-average rate of co-occurrence? Intuitively, this is of interest because its analog in real data could be the set of people (in a travel database) with similar travel patterns, or a set of failures (in a maintenance database) that stem from certain suppliers, manufacturing plants, etc. Whereas a conventional data analysis technique would be faced with enumerating all crossproducts across columns of all possible subsets in each column, and examining the support of each, the Stirr technique appears to avoid this prohibitive computation. How quickly does Stirr elicit such structure (say, as a function of the number of iterations, the relative density of the clump to the background, the combine operator ⊕, etc.)? Our experiments show that for quasi-random inputs, such structure emerges very quickly, typically in about five iterations. 2. How well does Stirr separate distinct planted clumps using non-principal basins? For instance, if in a random table we were to plant two clumps each involving a distinct set of nodes, will the first non-principal basin place the clumps at opposite ends of the partition? Intuition from spectral-graph partitioning suggests that typically some non-principal partition will separate a pair of clumps; our experiments show that Stirr usually achieves such a separation within the first non-principal basin, for quasi-random inputs. A related study involves two clumps of different densities planted in a random table. For random starts and the product rule, how often does the denser clump appear in the principal basin, as a function of the relative densities? 3. How well does Stirr cope with tables containing clumps in a few columns, with the remaining columns being random? We study this by adding (to a randomly generated table) additional random tuples concentrated on a few nodes in three of the columns; the entries for the remaining columns are randomly distributed over all nodes. Thus, these remaining columns correspond to irrelevant attributes that might (by their totally random distribution) mask the clump planted in columns 1–3. This is a standard problem in data mining, where we seek to mask out irrelevant factors (here columns) from a pattern. Our experiments show that Stirr will indeed mask out such “irrelevant” attributes; we know of no analog for this phenomenon in spectral-graph partitioning. In the following plots, each point of each curve is the average of 100 quasi-random experiments, typically using ten quasi-random tables and ten random starting configurations on each. Thus, each of the plots summarizes data from several thousand quasi-random experiments. Wherever we obtain qualitatively similar results while varying table sizes, we have chosen to present the results for a fixed, representative table size while varying other parameters of interest. 230 D. Gibson et al.: Clustering categorical data 35 30 8 25 Time (s) 7 20 6 15 5 4 10 3 columns 5 Fig. 4. Running times. We have truncated the y axis for compactness; all the lines remain straight up to 1000000 tuples 0 0 5000 10000 15000 20000 Number of tuples in table 25000 Clump purity, percentage 100 30 extra 40 extra 50 extra 75 extra 100 extra 75 50 25 2 4 6 8 Number of iterations 10 Fig. 5. Emergence of the principal basin Clump purity, deviation/mean 0.25 30 extra 40 extra 50 extra 75 extra 100 extra 0.2 0.15 0.1 0.05 0 2 3 4 5 6 7 Number of iterations 8 9 10 Fig. 6. Variance of the principal basin experiments Figure 5 captures the ability of Stirr to extract a single planted clump from tables. This plot represents data for a quasi-random table with three columns, and 1000 nodes in each column. In addition to 5000 random triples on these nodes, we select 10 nodes in each column and plant additional random tuples within these (10 × 3 = 30) nodes. We vary the number of triples in this planted clump over the set {30, 40, 50, 75, 100}. On the x-axis we vary the num- 30000 ber of Stirr iterations using the S∞ operation for the ⊕ function, from 2 to 10. On the y-axis we plot the percentage of the nodes from the the planted clump that appear within the top ten positions of the principal basin; thus, this fraction would ideally be high. A random set of 10 nodes of any column would have 100 × 10/1000 = 1% of its nodes from this clump; Stirr obtains a significant boost (to 25% or greater) even when the planted clump only contains only 30 extra triples in addition to the 5000 random “background” triples. As the number of triples in this planted clump increases further, its purity in the principal basin increases towards 100%. As mentioned before, we obtained qualitatively the same plots for other table sizes, and therefore display results for three columns with 1000 nodes each. The major lesson from these and other similar experiments is that, typically, the structure in the principal basin emerges within a small number of iterations (under five). Coupled with the fact that the computation during a single iteration is linear in the number of tuples, this suggests that the structure emerges in time linear in the size of the table. Our experiments involve two sources of randomization: in the construction of the table, and in the random choice of Stirr’s starting configuration. We find that the variance in the experimental results we obtain, over these random choices, is quite small. For the single-clump experiment in the previous paragraph, we summarize this in Fig. 6, computing the ratio of the standard deviation to the mean for the “purity” parameter defined above. Note additionally that for all but the sparsest clump density, this ratio decreases rapidly as the number of iterations increases. In Fig. 7, we study the ability of Stirr to separate two planted clumps. As before, we only display results for a three-column table with 1000 nodes in each column, with 5000 random “background” triples. To this, we add two clumps at disjoint sets of 10 selected nodes in each column. As in the first experiment, we vary the number of tuples in each of the planted clumps over the set {30, 40, 50, 75, 100}; on the x-axis is the number of iterations, and on the yaxis we plot the following measure of separation of the two clumps. We make the stringent demand in this experiment 231 30 extra 40 extra 50 extra 75 extra 100 extra 0.75 0.5 0.25 5 10 15 20 Number of iterations 25 Fig. 7. Separating two clumps using the first non-principal basin that the separation manifest itself in the first non-principal basin (rather than by any other non-principal basin). Intuition from spectral-graph partitioning suggests that the two clumps will be placed at opposite ends of this non-principal basin. Let a0 and b0 be the numbers of nodes from clumps A and B at one end of this basin, and a1 and b1 be the corresponding numbers at the other end. Clearly, a0 + a1 ≤ 30, and likewise for the bi . Under perfect separation we would have a0 = 30 and b1 = 30, or vice-versa. Our separation measure in this case (this generalizes easily to other clump and table sizes) is s(A, B) = |a0 − b0 | + |a1 − b1 | . 60 Thus, if the nodes of the two clumps are present in roughly equal proportions at the extremities of the partition, the measure is close to zero; as we have closer to 30 nodes from each clump at each end of the partition, the measure gets closer to 1. This measure is fairly demanding, for several reasons: (1) the first non-principal basin may not place many nodes of the clumps at its extremities, so that a0 + a1 and b0 + b1 may both be small compared to 30; (2) the measure penalizes each end of the partition for nodes from the “other” clump. Thus, if each end of the partition has 75% of the nodes from one clump and 25% from the other, we score only 0.5 by our measure. The summary findings from these experiments are that (1) the number of iterations to separation is relatively small (around 15, larger than for extracting the dominant clump using the principal basin), and fairly insensitive to the clump density; (2) the separation achieved is quite marked. What happens when the quasi-random input has two clumps of different densities planted in it, and the product combiner is used? Figure 8 studies the effect (again in three-column tables with 1000 nodes per column) when the table has 5000 random tuples, plus two clumps made up of two disjoint sets (call them A and B) of 10 nodes per column. Each random clump is created by adding nA random tuples containing only nodes from within A and nB tuples within B, where nA > nB . For random starts (over a number of such quasi-random tables), how often does the principal basin focus on the heavier clump, as a function of nA /nB ? The y-axis in this plot is the ratio of the number of nodes from A to the number of nodes from B in the top 1 0.75 0.5 1 1.1 1.2 1.3 1.4 Ratio: denser to lighter clump 1.5 Fig. 8. Fraction of random starts ending in the denser clump Clump purity vs number of redundant cols 100 Clump purity, percentage Clump purity, s(A,B) 1 Fraction of nodes from denser clump in principal fixed point D. Gibson et al.: Clustering categorical data 30 extra tuples 40 extra tuples 50 extra tuples 75 50 25 4 5 6 number of redundant cols 7 Fig. 9. Resilience to random additional columns 10 positions of each column, in the principal basin. When nA /nB is 1, this fraction is 0.5 on average. As evident from the plots, Stirr boosts the denser clump. Figure 9 depicts the same setting as in Fig. 5 above, but with additional columns added on to study the effect of irrelevant attributes. Specifically, the clump of additional random tuples that we plant is focused on 10 nodes in columns 1–3, but in the remaining columns is completely random. We thus think of these remaining columns as “irrelevant attributes” that would detract from finding the planted clump. In fact, as seen, Stirr suffers a relatively modest and graceful loss as the number of irrelevant columns is increased. Finally, we use experiments on quasi-random data to understand how the choice of combine operator (⊕) affects speed of convergence. Figure 10 shows experiments with different choices for the combine operator. Max (i.e., S∞ ) is the clear winner. Product did not converge monotonically, so second place falls to addition (S1 ), or to S2 or S5 if the arithmetic speed can be improved. Each data point in each experiment is an average over 100 runs: using ten random tables, with ten random starting 232 D. Gibson et al.: Clustering categorical data 30 10 9 7 Time (s) 25 Max 5-norm 2-norm Sum Geometric Harmonic Product 6 5 4 Max 5-norm 2-norm Sum Geometric Harmonic Product 20 Iterations 8 15 Fig. 10. Convergence for different ⊕. Table has 2100 rows: 1000 using one set of attributes, A, and 1100 with another, B. Convergence is measured as the number of B attributes in the top 20 positions. The experiment measured average running times and iterations until 90% convergence 10 3 2 5 1 0 40 50 60 70 80 % converged 90 100 40 50 points. The the number of entities in the table here is fixed; we obtained similar results for other table sizes. The convergence requires very few iterations, even with the fairly strict separation measure used in the two-cluster experiment. Since the time for each iteration is linear in the number of tuples, we can expect to extract structure in essentially linear time. 5 Real data We now discuss some of our experience with this method on a variety of “real” data sets. Basic data sets. First, we consider two data sets that illustrate co-occurrence patterns. (1) Bibliographic data. We took publicly accessible bibliographic databases of 7000 papers from database research , and of 30,000 papers written on theoretical computer science and related fields , and constructed two data sets each with four columns as follows. For each paper, we recorded the name of the first author, the name of the second author, the conference or journal of publication, and the year of publication. Thus, each paper became a relation of the following form: 60 70 80 % converged 90 100 local cause-effect patterns in the data. We approach this as follows. Given a long string of sequentially occurring events, we label them as e1 , e2 , . . . , en . We then imagine a “sliding window” of length ` passing over the data, and construct `tuples (ei , ei+1 , . . . , ei+`−1 ) for each i between 1 and n−`+1. For data exhibiting a strong local cause-effect structure, we hope to see such structure emerge from the co-occurrences among events close in time. This framework bears some similarity to a method of for defining frequent “episodes” in sequential data. Our approach is quite different from theirs, however, since we do not search separately for each discrete type of episode; rather, we use the core algorithm of Stirr to group events that co-occur frequently in temporal proximity, without attempting to impose an episode structure on them. The main example we study here comes from a readily accessible source of sequential data, exhibiting strong elements of causality: chess moves. Typically, the identity of a single move in a chess game allows one to make plausible inferences about the identity of the moves that occurred nearby in time. To be concrete, we took the moves of 2000 games played from a common opening, and used a variant of the method above to create the following sets of 4-tuples: (White’s nth move, Black’s nth move, White’s (n + 1)st move, Black’s (n + 1)st move) (First-Author, Second-Author, Conference/ Journal Name, Year). Note that we deliberately treat the year, although a numerical field, as a categorical attribute for the purposes of our experiments. In addition to the tables of theory papers and database papers, we also constructed a mixed table with both sets of papers. (2) Login data. From four heavily used server hosts within IBM Almaden, we compiled the history of user logins over a several-month period. This represented a total of 47,000 individual logins. For each login, we recorded the user name, the remote host from which the login occurred, the hour of the login, and the hour of the corresponding logout. Thus, we created the tuples of the form: (user, remotehost, login-time, logout-time). Again, the hours were treated as categorical attributes. Sequential data sets. Next, we discuss a natural approach by which this method can be used to analyze sequential datasets. The hope here is to use local co-occurrences in time to elicit Overview. At a very high level, the following phenomena emerge. First, the principal and non-principal basins computed by the method correspond to dense regions of the data with natural interpretations. For the sequential data, the nodes with large weight in the main basins exhibit a natural type of cause-effect relationship in the domains studied. The fact that we have treated numerical attributes as categorical data – without attempting to take advantage of their numerical values – allows us to make some interesting observations about the power of the method at “pulling together” correlated nodes. Specifically, we will see in the examples below that nodes which are close as numbers are typically grouped together in the basin computations, purely through their co-occurrence with other categorical attributes. This suggests a strong sense in which co-occurrence among categorical data can play a role similar to that of proximity in numerical data, for the purpose of clustering. We divide the discussion below into two main portions, based on the combining rule ⊕ that we use. D. Gibson et al.: Clustering categorical data 233 0.1811: Chen 0.1185: Chang 0.1147: Zhang 0.1119: Agarwal 0.1104: Bellare 0.1053: Gu 0.09467: Dolev 0.08571: Hemaspaand 0.08423: Farach 0.08232: Ehrig 0.1629: Chen 0.1289: Wang 0.1007: COMPREVS 0.1: Li 0.09004: Rozenberg 0.08837: Igarashi 0.08171: Sharir 0.0797: Huang 0.07924: Maurer 0.0783: Lee 0.317: TCS 0.2264: IPL 0.195: LNCS 0.1633: INFCTRL 0.1626: DAMATH 0.1464: JPDC 0.1371: SODA 0.1266: STOC 0.1074: IEEETC 0.1002: JCSS 0.563: 1995 0.5596: 1994 0.4332: 1996 0.151: 1993 0.07487: 1992 0.0155: 1976 0.005276: 1972 0.004569: 1985 0.001891: 1973 0.001679: 1970 -0.1195: Stonebrake -0.1059: Agrawal -0.1023: Wiederhold -0.07735: Abiteboul -0.06783: Yu -0.06722: Navathe -0.06615: Litwin -0.06308: Bernstein -0.0623: Jajodia -0.0587: Motro -0.1068: Wiederhold -0.09615: David -0.08343: DeWitt -0.07643: Richard -0.07454: Stonebrak -0.07403: Michael -0.07159: Robert -0.06843: Jagadish -0.06722: James -0.0671: Navathe -0.384: IEEEDataEng -0.256: VLDB -0.2392: SIGMOD -0.1504: PODS -0.1397: ACMTDS -0.1176: IEEETransa -0.04832: IEEETrans -0.04658: IEEETechn -0.04533: WorkshopI -0.03581: IEEEDBEng -0.2165: 1986 -0.1983: 1987 -0.1519: 1989 -0.14: 1988 -0.11: 1984 -0.06571: 1975 -0.06431: 1990 -0.03765: 1980 -0.03238: 1974 -0.02873: 1982 5.1 The Sp rules We now discuss our experience with the Sp combining rules. We ran the S∞ dynamical system on a table derived as described above from a mixture of theory and database research papers. The first non-principal basin is quite striking: in Fig. 11, we list the ten items with the most positive and most negative weights in each column (together with their weights) for this first non-principal basin; a double horizontal line separates these groups. (Note that no row in Fig. 11 need be a tuple in the input; the table simply lists, for each column, the ten nodes with the most positive (and the ten with the most negative) values.) Note also that our parser identified all people by their last names so that, for instance, the Chen at the top is actually several Chens taken together. A number of phenomena are apparent from the above summary views of the theory and database communities (the fact that the theory community ended up at the positive end of the partition and the database community at the negative end is not significant – this depends only on the random initial configuration, and could just as well have been reversed). What is significant is that such a clean dissection resulted from the dynamical system, and that too within the first (rather than some lower) non-principal basin. Note that these phenomena have been preserved despite some noise in the database research bibliography: first names on single-authored papers have sometimes been parsed as second authors. In the same run, the ninth non-principal partition gave a fairly clean separation between database researchers of a more theoretical nature and those who are more system-oriented. 5.2 The product rule The high-level observations above apply to our experience with the combining rule Π; but the sensitivity of Π to the initial configuration leads to some additional recurring phenomena. First, although there is not a unique fixed point that all initial configurations converge to, there is generally a relatively small number of basins. Thus, we can gain additional “similarity” information about initial configurations by looking at those configurations that reach a common Fig. 11. Separating a mixture of theory and database papers basin. Additionally, the “masking” of frequently occurring nodes can be a useful way to uncover hidden structure in the data. In particular, a frequent node can force nearly all initial configurations to a common basin in which it receives large weight. By masking this node, one can discover a large number of additional basins that otherwise would have been “overwhelmed” in the iterations of the dynamical system. Bibliographic data. We begin with the bibliographic data on theoretical CS papers. There are a number of co-occurrences that one might hope to analyze in this data, involving authors, conferences, and dates. Thus, if we choose an initial configuration centered around the name of a particular conference and then iterate the product rule, we generally obtain a basin that corresponds naturally to the “community” of individuals who frequently contribute to this conference, together with similar conferences. For example, initializing the configuration over PODS, we obtain the following basin, which by inspection is a remarkable summary view of the PODS community and some of its principal members. (See top of next page.) Initializing a configuration over a specific year leads to another very interesting phenomenon, alluded to above: the basin groups “nearby” years together, purely based on cooccurrences in the data. It is, of course, natural to see how this should happen; but it does show a concrete way in which co-occurrence mirrors more traditional notions of “proximity.” For example, initializing over the year 1976 leads to the following basin: 0.697 0.241 0.026 0.018 0.004 Garey Aho Yu Hassin Chandra 0.824 0.059 0.039 0.030 0.009 Johnson Hirschberg Hopcroft Graham Tarjan 0.344 0.228 0.123 0.123 0.046 JACM SICOMP TCS TRAUB IPL 0.461 0.152 0.122 0.095 0.082 1976 1978 1977 1975 1985 Observe that the years grouped with 1976 are 1978, 1977, 1975, and 1985. Once again, this is a fairly descriptive summary view of theoretical CS research around 1976. It is important to note that initializing over different years can lead to the same basin; this, as discussed above, is due to the “attracting” nature of strong basins. Thus, for example, we 234 D. Gibson et al.: Clustering categorical data 0.4159 0.1321 0.0887 0.0873 0.0484 0.0419 Sagiv Abiteboul Shmueli Vardi Ullman Chan 0.3043 0.1463 0.1043 0.0934 0.0901 0.0390 Shmueli Vardi Sagiv Yehoshua Ullman Vianu 0.9094 0.0380 0.0333 0.0112 0.0049 0.0023 PODS SICOMP JACM DataKnowledgeBases SIGMOD StanfordCSDreport obtain exactly the same basin when we initialize over 1974, although 1974 does not ultimately show up among the top 5 years when the basin is reached. Login data. We now discuss some examples from the login data introduced above. For this data, there was one user who logged in/out very frequently, gathering large weight in our dynamical system, almost regardless of the initial configuration. Thus, for all the random initializations tried on this data set, a common basin was reached. In order to discover additional structure, it proved very useful to mask this user, via the masking operation introduced above. Users in this data can exhibit similarities based on co-occurrence in login times and machine use. If we first mask the extremely common user(s), and then initialize the configuration over the user root, such further structure emerges: the four highest weight users turn out to be root; a special “help” login used by the system administrators; and the individual login names of two of the system administrators. Thus, Stirr discerns system administrators based on their co-occurrences across columns with root. Finally, we illustrate another case in which “similar” numbers are grouped together by co-occurrence, rather than by numerical proximity. When we initialize over the hour 23 (i.e., 11 pm) as the login time, we obtain a set of users who dominate late-night login times, and we obtain the following hours as the highest weight nodes for login and logout times: 0.430 22 0.457 22 0.268 21 0.286 23 0.173 23 0.134 21 0.079 20 0.060 00 0.021 00 0.028 01 Thus, the algorithm concludes that the hours 22, 21, 23, 20, and 00 are all “similar,” purely based on the login patterns of users. Chess moves. Finally, we indicate some of the sequential patterns one can discover via our method, using the example of chess move sequences. We start from long sequential lists of chess moves, sliced into four-ply windows as discussed above. With random initialization and no masking, the basin is concentrated on an extremely small number of nodes: 0.999 e4 0.000 Nc3 1.000 e6 0.000 d5 0.997 d4 0.002 Nc3 1.000 d5 0.000 Bb4 The strong concentration on the sequence e4-e6-d4-d5 corresponds to the fact that we have chosen only games played with the French Defence, which begins with this sequence. In order to discover sequential patterns that occur the in the middle of games, we need to mask the strong cooccurrences in the opening. In this way, we can discover 0.4170 0.1904 0.1433 0.1260 0.0977 0.0154 1986 1985 1987 1989 1988 1992 common local motifs in the move sequences, and moves that serve similar functions in the temporal sequence. For example, initializing a configuration over the move f5, and masking the most common opening moves to prevent the above basin from being reached, we obtain the following basin: 0.999 0.000 0.000 0.000 0.000 f5 Bxf5 Kb1 Nxf5 gxf5 0.755 0.215 0.029 0.000 0.000 exf5 gxf5 Bxf5 Nxf5 Kh6 0.599 0.329 0.028 0.019 0.015 Bxf5 gxf5 Bxd5 Qxf5 Rxf5 0.476 0.074 0.073 0.068 0.068 Bxf5 Be8 Rxf5 g6 Nc6 We note several features of the above basin. First, and most obviously, the second column indicates Black’s main responses to the move f5: these are to capture either with an e-pawn, a g-pawn, a Bishop, or a Knight. But the basin captures additional structure: because the second column is dominated by Black’s captures on f 5, alternatives to f5 in the first column gain (small amounts of) weight – for example, the capture of a piece on the square f5 by a Bishop or a Knight. Thus, the propagation of weight moves both forward and backward in time, as one would like. 6 Conclusions and further work We have given a new method for clustering and mining tables of categorical data, by representing them as non-linear dynamical systems. The two principal contributions of this paper are: (1) we generalize ideas from spectral-graph analysis to hypergraphs, bridging the power of these techniques to clustering categorical data; (2) we develop a novel connection between tables of categorical data and non-linear dynamical systems. Experiments with the Stirr system show that such systems can be effective at uncovering similarities and dense “subpopulations” in a variety of types of data; this is illustrated both through explicitly constructed “hidden populations” in our quasi-random data, as well as on real data drawn from a variety of domains. Moreover, these systems converge very rapidly in practice, and hence the overall computation effort is typically linear in the size of the data. An interesting direction in which to consider extending these methods is to allow for the incorporation of intrinsic similarity information for some attributes. A given attribute may be endowed with an a priori notion of similarity – this may happen if it is consists of numerical data – and it would be useful to develop a principled way of integrating this information with the similarity relationships that emerge from Stirr through its analysis of co-occurence. The connection between dynamical systems and the mining of categorical data suggests a range of interesting further questions. At the most basic level, we are interested in determining other data-mining domains in which this approach can be effective. We also feel that a more general analysis D. Gibson et al.: Clustering categorical data of the dynamical systems used in these algorithms – a challenging prospect – would be interesting at a theoretical level and would shed light onto further techniques of value in the context of mining categorical data. Acknowledgements. We thank Rakesh Agrawal for many valuable discussions on all aspects of this research, and for his great help in the presentation of the paper. We also thank Bill Cody and C. Mohan for their valuable comments on drafts of this paper. 7 Appendix Dynamical systems and mathematical background We briefly outline some background on dynamical systems, adapted to our purposes here. For purposes of the present discussion, a dynamical system is the repeated iteration of a function f : U → U, where U is a vector space. The mathematical study of dynamical systems (see for a comprehensive introduction) considers the way in which the weights of X evolve starting from various initial weights of X, for different classes of maps f (). For example, we can ask whether the iterations cause X to converge to a fixed point – a point X ∗ for which f (X ∗ ) = X ∗ – or cause X to cycle through a small number of weights. Example. As a very simple example, consider the case in which U = <, the one-dimensional real line, and f (X) = X 2 . The sequence of weights assumed by X when f () is iterated depends on the starting weight of X. X = 0 and X = 1 are fixed points, since f (X) = X. When X > 1 or X < −1, the the sequence of weights grows without bound; when −1 < X < 1, the sequence of weights always converges to 0. Example. As another example, suppose that U = <3 , and for a vector X = (x, y, z), we have f (X) = 12 (y + z, z + x, x + y). Then, any vector X with positive coordinates adding up to s will converge to 13 (s, s, s) as f () is repeatedly applied. The second example captures a general phenomenon. When f () is linear, the process of iterating f () can typically be analyzed using the methods of linear algebra; for instance, if f () represents left multiplication by a matrix A, then X converges (subject to a few technical conditions) to the principal eigenvector of A. The non-principal eigenvectors of A have a meaning in this setting as well, and underlie the intuition behind portions of our technique. If f is non-linear, the process of iterating f () becomes much more complicated, and much of the underlying mathematics is not understood. Such non-linear maps lead to topics such as chaotic dynamical systems, and have wide-ranging applications in the modeling of continuous phenomena in settings such as climate modeling, population genetics, and financial analysis. (Again, see .) We find it interesting to observe some of the fundamental phenomena of non-linear dynamical systems arising in our application, and we hope that the present work will suggest further questions in this active area. 235 The body of mathematical work that most closely motivates our current approach is spectral-graph theory, which studies certain algebraic invariants of graphs. We motivate this technical background with the following example. Example. Consider a triangle: a graph G consisting of three mutually connected nodes. At each node of G, we place a positive real number, and then we define a dynamical system on G by iterating the following map: the number at each node is updated to be the average of the numbers at the two other nodes. This dynamical system is easily seen to be equivalent to that of the previous example: the numbers at the three nodes converge to one another as the map is iterated. Spectral-graph theory is based on the relation between such iterated linear maps on an undirected graph G and the eigenvectors of the adjacency matrix of G. (If G has n nodes, then its adjacency matrix A(G) is a symmetric nby-n matrix whose (i, j) entry is 1 if node i is connected to node j in G, and 0 otherwise.) The principal eigenvector of A can be shown to capture the equilibrium state of a linear dynamical system of the form in the previous example; the non-principal eigenvectors provide information about good “partitions” of the graph G into dense pieces with few interconnections. The most basic application of spectral-graph partitioning, which motivates some of our algorithms, is the following. Each non-principal eigenvector of the adjacency matrix of G can be viewed as a labeling of the nodes of G with real numbers – some positive and some negative. Treating the nodes with positive numbers as one “cluster” in G, and the nodes with negative numbers as another, typically partitions the underlying graph into two relatively dense pieces, with few edges in between. References Agrawal R, Mannila H, Srikant R, Toivonen H, Inkeri Verkamo A (1996) Fast Discovery of Association Rules. In: Fayyad UM, PiatetskyShapiro G, Smyth P, Uthurusamy R (eds.): Advances in Knowledge Discovery and Data Mining, chapter 12. MIT Press, Cambridge, Mass., pp. 307–328 Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Buneman P, Jajodia S (eds.): Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., May 26–28, 1993. ACM Press 1993, SIGMOD Record 22(2), June 1993, pages 207–216 Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Bocca JB, Jarke M, Zaniolo C (eds.): Proceedings of the 20th Internatinal Conference on Very Large Data Bases, (VLDB’94), September 12–15, 1994, Santiago de Chile, Chile. Morgan Kaufmann, ISBN 1– 55860-153-8, pp. 487-499 Agresti A (1990) Categorical Data Analysis. Wiley-Interscience, Strasbourg Alon N (1986) Eigenvalues and expanders. Combinatorica 6: 83—96 Alon N, Spencer J (1991) The Probabilistic Method. Wiley, Chichester Berge C (1973) Hypergraphs. North–Holland, Amsterdam Blum A, Spencer J (1995) Coloring random and semi–random k- colorable graphs. J Algorithms 19: 204—234 Boppana R (1987) Eigenvalues and graph bisection: An average– case analysis. In: Proc. IEEE Sympos. on Foundations of Computer Science, 1987, Los Angeles, Calofornia. IEEE Computer Society Press, publisher, pages 280—285 236 Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Peckham J (ed.): SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13–15, ACM Press 1997, Tucson, Arizona, USA. SIGMOD Record 26(2), June 1997, pages 255—264 Brin S, Motwani R, Silverstein C (1997) Beyond Market Baskets: Generalizing Association Rules to Correlations. In: Peckham J (ed.): SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13–15, ACM Press 1997, Tucson, Arizona, USA. SIGMOD Record 26(2), June 1997, pages 265—276 Brin S, Page L (1998) Anatomy of a Large–Scale Hypertextual Web Search Engine. In: Proc. 7th International WorldWide Web Conference, 1998,Brisbane, Australia, pp. 107–117. Elsevier, publisher. Charniak E (1993) Statistical Language Learning. MIT Press, Cambridge, Mass. Chiueh T (1994) Content–based image indexing. In: Bocca JB, Jarke M, Zaniolo C (eds.): Proceedings of the 20th Internatinal Conference on Very Large Data Bases, (VLDB’94), September 12–15, 1994, Santiago de Chile, Chile. Morgan Kaufmann, ISBN 1–55860-153-8, pp. 582– 593 Chung FRK (1997) Spectral Graph Theory. AMS Press, Providence, R.I. Das G, Mannila H, Ronkainen P (1998) Simliarity of attributes by external probes. In: Agrawal R, Stolorz PE, Piatetsky–Shapiro (eds.): Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD–98), August 27-31, 1998, New York City, New York, USA. AAAI Press, 1998, ISBN, pages 23—29 Deerwester S, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Soc Inf Sci 41(6): 391– 407 Devaney RL (1989) An Introduction to Chaotic Dynamical Systems. Benjamin Cummings, New York Donath WE, Hoffman AJ (1973) Lower bounds for the partitioning of graphs. IBM J Res Dev 17: 420–425 Duda RO, Hart PE (1973) Pattern Classiffcation and Scene Analysis. Wiley, New York Faloutsos C (1996) Searching Multimedia Databases by Content. Kluwer Academic, Amsterdam Fiedler M (1973) Algebraic connectivity of graphs. Czech Math J 23: 298305 Flickner M, Sawhney H, Niblack W, Ashley J, Huang Q, Dom B, Gorkani M, Hafner J, Lee D, Petkovic D, Steele D, Yanker P (1995) Query by image and video content: The QBIC system. IEEE Comput 28: 23–32 Garey M, Johnson D (1979) Computers and Intractability: A Guide to the Theory of NP–Completeness. W.H. Freeman, New York Golub G, Loan CF van (1989) Matrix Computations. Johns Hopkins University Press, Baltimore, Md. Han E–H, Karypis G, Kumar V, Mobasher B (1997) Clustering Based On Association Rule Hypergraphs. In: Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997, Tucson, Arizona, in conjunction with ACM SIGMOD 1997. ACM Press, publisher. Hastie T, Stuetzle W (1989) Principal curves. J Am Stat Assoc 84: 502-516 Holm L, Sander C (1994)Proteins Struct Funct Genet 19: 256-268 D. Gibson et al.: Clustering categorical data Huang Z (1997) A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. In: Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997 Jerrum M, Sinclair A (1996) The Markov chain Monte Carlo method: An approach to approximate counting and integration. In: Hochbaum DS (ed) Approximation Algorithms for NP–hard Problems. PWS Publishing, Boston, Mass., pp. 482—520 Jolliffe IT (1986) Principal Component Analysis. Springer, Berlin Heidelberg New York Kleinberg J (1997) Authoritative sources in a hyperlinked environment. IBM Research Report RJ 10076(91892). To appear in Journal of the ACM Kohonen T (1982) Self–organized formation of topologically correct feature maps. Biol Cybern 43: 59–69 Kramer M (1991) Non–linear principal component analysis using autoassociative neural networks. AIChE J 37: 233–243 Mannila H, Toivonen H, Verkamo AI (1995) Discovering frequent episodes in sequences. In: Fayyad UM, Uthurusamy R (eds.): Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD–95), Montreal, Canada, August 20–21, 1995. AAAI Press, ISBN 0-929280-82-2, pages 210- -215 Ritter H, Martinetz T (1992) Neural Computation and Self–Organizing Maps. Addison–Wesley, Reading, Mass. Rumelhart DE, McClelland JL (Eds) (1986) Parallel and Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge, Mass. Seiferas J (1999) A large bibliography on theory/foundations of computer science. Available at http://liinwww.ira.uka.de/ bibliography/Theory/Seiferas/Seiferas.html Spielman D, Teng S (1996) Spectral partitioning works: Planar graphs and finite–element meshes. In: Proc. IEEE Symposium on Foundations of Computer Science, 1996, Burlington, Vermont. IEEE Computer Society Press, Piscataway, N.J., pp. 96—105 Srikant R, Agrawal R (1995) Mining generalized association rules. In: Dayal U, Gray PMD, Nishio S (eds.): Proceedings of 21th International Conference on Very Large Data Bases, September 11–15, 1995, Zurich, Switzerland. Morgan Kaufmann, ISBN 1–55860-379-4, pages 407–419 Toivonen H (1996) Sampling large databases for finding association rules. In: Vijayaraman TM, Buchmann AP, Mohan C, Sarda NL (eds.): VLDB’96, Proceedings of 22th International Conference on Very Large Data Bases, September 3-6, 1996, Mumbai (Bombay), India. Morgan Kaufmann 1996, ISBN 1–55860-382-4, pages 134—145 Wiederhold G (1999) Bibliography on database systems. Available at http://www.ira.uka.de/bibliography/Database/ Wiederhold/ Wiederhold. html Zhang T, Ramakrishnan R, Livny M (1996) Birch: An effcient data– clustering method for very large databases. In: Jagadish HV, Mumick IS (eds.): Proceedings of the 1996 ACM SIGMOD International Conference on Management od Data, Montreal, Quebec, Canada, June 4–6, 1996. ACM Press 1996, SIGMOD Record 25(2), June 1996, pages 103—114
© Copyright 2026 Paperzz