Clustering categorical data: an approach based on dynamical systems

The VLDB Journal (2000) 8: 222–236
The VLDB Journal
c Springer-Verlag 2000
Clustering categorical data:
an approach based on dynamical systems
?
David Gibson1,?? , Jon Kleinberg2,??? , Prabhakar Raghavan3
1
2
3
Department of Computer Science UC Berkeley, Berkeley, CA 94720 USA; e-mail: [email protected]
Department of Computer Science, Cornell University, Ithaca, NY 14853; e-mail: [email protected]
Almaden Research Center IBM, San Jose, CA 95120 USA; e-mail: [email protected]
Edited by J. Widom. Received February 15, 1999 / Accepted August 15, 1999
Abstract. We describe a novel approach for clustering collections of sets, and its application to the analysis and mining
of categorical data. By “categorical data,” we mean tables
with fields that cannot be naturally ordered by a metric –
e.g., the names of producers of automobiles, or the names
of products offered by a manufacturer. Our approach is based
on an iterative method for assigning and propagating weights
on the categorical values in a table; this facilitates a type of
similarity measure arising from the co-occurrence of values
in the dataset. Our techniques can be studied analytically in
terms of certain types of non-linear dynamical systems.
Key words: Clustering – Data mining – Categorial data –
Dynamical systems – Hypergraphs
1 Introduction
Much of the data in databases is categorical: fields in tables whose attributes cannot naturally be ordered as numerical values can. The problem of clustering categorical data
involves complexity not encountered in the corresponding
problem for numerical data, since one has much less a priori structure to work with. While considerable research has
been done on clustering numerical data (exploiting its inherent geometry), there has been much less work on the important problem of clustering and extracting structure from
categorical data.
As a concrete example, consider a database describing car sales with fields “manufacturer”, “model”, “dealer”,
“price”, “color”, “customer” and “sale date”. In our setting,
price and sale date are traditional “numerical” values. Color
is arguably a categorical attribute, assuming that values such
?
A preliminary version of this paper appears in the Proceedings of the
24th International Conference on Very Large Databases, 1998.
?? This work was performed in part while visiting the IBM Almaden
Research Center.
??? This work was performed in part while visiting the IBM Almaden
Research Center. Author is currently supported in part by an Alfred P. Sloan
Research Fellowship, an ONR Young Investigator Award, and NSF Faculty
Early Career Development Award CCR-9701399.
as “red” and “green” cannot easily be ordered linearly; more
on this below. Attributes such as manufacturer, model and
dealer are indisputably categorical attributes: it is very hard
to reason that one dealer is “like” or “unlike” another in the
way one can reason about numbers, and hence new methods
are needed to search for similarities in such data.
In this paper, we propose a new approach for clustering categorical data, differing from previous techniques in
several fundamental respects. First, our approach generalizes
the powerful methodology of spectral-graph partitioning ,
to produce a clustering technique applicable to arbitrary collections of sets. For a variety of graph decomposition tasks,
spectral partitioning has been the method of choice for avoiding the complexity of more combinatorial approaches (see
below); thus, one of our main contributions is to generalize
and extend this framework to the analysis of relational data
more complex than graphs, obtaining a method that does
not suffer from the combinatorial explosion encountered in
existing approaches. Second, our development of this technique reveals a novel connection between tables of categorical data and non-linear dynamical systems; this connection
allows for the natural development of a range of clustering
algorithms that are mathematically clean and involve essentially no arbitrary parameters.
In addition to the development of the techniques themselves, we report on their incorporation into an experimental
system for mining tables of categorical data. We have found
the techniques to be effective in the context of both synthetic
and real datasets.
We now review some of the key features of our approach, and explain the ways in which it differs from other
approaches.
(1) No a priori quantization. We wish to extract notions
of proximity and separation from the items in a categorical
dataset purely through their patterns of co-occurrence, without trying to impose an artificial linear order or numerical
structure on them. The approach of casting categorical values into numbers or vectors can be effective in some cases
(for one example, see the work on image processing and
color perception in , , ; in the simplest form, each color is a
3D vector of RGB intensities). However, it has a significant
drawback: in engineering, a suitable numerical representa-
D. Gibson et al.: Clustering categorical data
tion of a categorical attribute, one may lose the structure
one is hoping to mine. Moreover, it can lead to problems
in high-dimensional geometric spaces in which the data is
distributed very sparsely (as, for example, in the interesting
k-modes algorithm for categorical data, due to Huang ).
(2) Correlation vs. categorical similarity. Association rules
and their generalizations have proved to be effective at mining that involves estimating/comparing moments and conditional probabilities; see, e.g., the work on forming associations from market basket data in [Agrawal et al. 1996,
Agrawal et al. 1993, Agrawal and Srikant 1994, Brin et al.
1997a, Brin et al. 1997b, Srikant and Agrawal 1995, Toivonen 1996] . Our analysis of co-occurrence in categorical data
addresses a notion that is in several respects distinct from the
underlying motivation for association rules. First, we wish
to define a notion of similarity among items of a database
that will apply even to items that never occur together in a
tuple; rather, their similarity will be based on the fact that
the sets of items with which they do co-occur have large
overlap. Second, we would like this notion of similarity to
be transitive in a limited way: if A and B are deemed similar; and B and C are deemed similar, then we should be
able to infer a “weak” type of similarity between A and C.
In this way, similarity can propagate to uncover more distant
correlations.
The first of these points also serves as part of the underlying motivation of recent independent work of [Das et al.
1997]. However, the way in which we will use this notion
for clustering is quite different.
(3) New methods for hypergraph clustering. Viewing each
tuple in the database as a set of items, we can treat the entire
collection of tuples as an abstract set system, or hypergraph
, and approach the clustering problem in this context. (By
a hypergraph, in this setting, we mean simply an arbitrary
collection of sets.) In this way, we use pure co-occurrence
information among items to guide the clustering, without
the imposition of additional prior structure. At a high level,
this is similar to the approach taken by , as well as by the
framework of association rules .
Clustering a collection of sets is a task that naturally
lends itself to combinatorial formulations; unfortunately,
the natural combinatorial versions of this problem are NPcomplete and apparently difficult to approximate [Garey and
Johnson 1997]. This becomes a major obstacle in the approach of (and to a lesser degree in ), who rely on combinatorial formulations of the clustering problem and heuristics
whose behavior is difficult to reason about.
We adopt a fundamentally different, less combinatorial, approach to the problem of clustering sets; it is motivated by spectral-graph partitioning, a powerful method
for the related problem of clustering undirected graphs that
originated in work of and in the 1970s. Spectral methods relate good “partitions” of an undirected graph to the
eigenvalues and eigenvectors of certain matrices derived
from the graph. These methods have been found to exhibit
good performance in many contexts that are difficult to attack by purely combinatorial means (see, e.g., , , , ), and
they have been successfully applied in areas such as finiteelement
mesh
decomposition
,
protein modeling , and the analysis of Monte Carlo sim-
223
ulation methods [Jerrum and Sinclair 1994]. Recently, the
heuristic intuition underlying spectral partitioning has been
used in the context of information retrieval and in methods
for identifying densely connected hypertextual regions of the
World-Wide Web , .
Our new method can be viewed as a generalization of
spectral partitioning techniques to the more general problem
of hypergraph clustering. The extension to this more general setting involves significant changes in the algorithmic
techniques required; in particular, the use of eigenvectors
and linear transforms is replaced, in our method, by certain
types of non-linear dynamical systems. One of our contributions here is the generalization of the notion of non-principal
eigenvectors (from linear maps) to our non-linear dynamical
systems. In what follows, we will argue that this introduction of dynamical systems is perhaps the most natural way
to extend the power of spectral methods to the problem of
clustering collections of sets; and we will show that this approach suggests a framework for analyzing co-occurrence in
categorical datasets in a way that avoids many of the pitfalls
encountered with intractable combinatorial formulations of
the problem.
We now turn to an extended example that illustrates more
concretely the type of problem addressed by our methodology.
Example and overview of techniques. Consider again our
hypothetical database of car sales. Classical association
rules, on the one hand, are effective at mining patterns of
the form: of the tuples containing “Honda,” 18% (a large
fraction) contain “August.” Our goal, on the other hand, is
to elicit statements of the form “Hondas and Toyotas are
‘related’ by the fact that a disproportionate fraction of each
are sold in the month of August”. Thus, we are grouping
“Honda” and “Toyota” by co-occurrence with a common
value or set of values (“August” in this case), although
clearly, in a database of individual car purchases, no single tuple is likely to contain both “Honda” and “Toyota.”
We wish to allow such groupings to grow in complex ways;
for example, we may discover that many Toyotas are also
sold in September, as are many Nissans; that many dealers
sell both Hondas and Acuras; that many other dealers have
large volume in August and September. In this way, we relate items in the database that have no direct co-occurrence –
different automobile manufacturers, or dealers who may sell
different makes of automobile. What emerges is a high-level
grouping of database entries; one might intuitively picture it
as being centered around the notion of “late-summer sales on
Japanese model cars.” Distilling such a notion is likely to be
of great value in customer segmentation, commonly viewed
as one of the most successful applications of data mining
to date. In Sect. 5, we describe some patterns of this type
that our system automatically discovers: for example (in a
database of computer science papers), “the PODS community late 1980s–early 1990s”. Note that in general, a dataset
could contain many such segments; indeed, our method is
effective at uncovering this multiplicity (see Theorem 4 and
the experiments in Sect. 5).
Thus, if we picture a database consisting of rows of tuples, association rules are based on co-occurrence within
224
rows (tuples), while we are seeking a type of similarity based
on co-occurrence patterns of different items in the same column (field). Our problem, then, is how to define such a
notion of similarity, allowing a “limited” amount of transitivity, in a manner that is efficient, algorithmically clean, and
robust in a variety of settings. We propose a weight propagation method which works roughly as follows. We first
seed a particular item of interest (e.g., “Honda”) with a small
amount of weight. This weight then propagates to items with
which Honda co-occurs frequently – dealers, common sale
times, common price ranges. These items, having acquired
weight, propagate it further – back to other automobile manufacturers, perhaps – and the process iterates. In this way, we
can achieve our two-fold objective: items highly related to
Honda acquire weight, even without direct co-occurrence;
and since the weight “diffuses” as it spreads through the
database, we can achieve our “limited” form of transitivity.
The weight propagation schemes that we work with can
be viewed as a type of non-linear dynamical system derived
from the table of categorical data. Thus, our work demonstrates a natural transformation from categorical datasets to
such dynamical systems that facilitates a new approach to
the clustering and mining of this type of data. Much of the
analysis of such dynamical systems is beyond the reach of
current mathematics [M. Shub, personal communication]. In
a later section, we discuss some preliminary mathematical
results that we are able to prove about our method, including
convergence properties in some cases and an interpretation
of a more complex formulation in terms of the enumeration
of certain types of labeled trees. We also indicate certain aspects of the analysis that appear to be quite difficult, given
the current state of knowledge concerning non-linear dynamical systems.
It is important to note that even for dynamical systems
that cannot be analyzed mathematically, we find experimentally that they exhibit quite orderly behavior and consistently
elicit meaningful structure implicit in tables of categorical
data, typically in linear time. That such systems should be
effective in this context is further reinforced by the connections we draw with spectral-graph theory – in particular,
there are parallels between our weight propagation schemes
and the types of iterative eigenvector-based methods that
arise in spectral-graph partitioning. (See Sect. 7 in the appendix for a further discussion of these connections.)
Organization of the paper. In the following section, we describe in detail the algorithmic components underlying our
approach, and the way in which they can be used for clustering and mining categorical data. We also discuss there the
mathematical properties of the algorithms that we have been
able to prove.
In Sect. 3, we describe Stirr (sieving through iterated
relational reinforcement), an experimental system for studying the use of this technique for analyzing categorical tables.
Because the iterative weight propagation algorithms underlying Stirr converge in a small number of iterations in
practice, the total computational effort in analyzing a table
by our methods is typically linear in the size of the table. By
providing quantitative measures of similarity among items,
Stirr also provides a natural framework for effectively vi-
D. Gibson et al.: Clustering categorical data
W
Attribute
Tuple a
b
c
1.
A
W
1
2.
A
X
1
3.
B
W
2
4.
B
X
2
5.
C
Y
3
6.
C
Z
3
1
A
X
2
B
is represented as
Y
3
C
Z
Fig. 1. The representation of a collection of tuples
sualizing the underlying relational data; this visualizer is depicted in Fig. 3.
In Sects. 4 and 5, we evaluate the performance of Stirr
on data from synthetic as well as from real-world sources. In
Sect. 4, we report on experience with the Stirr system on a
class of inputs we will call quasi-random inputs. Such inputs
consist of carefully “planted” structure among random noise,
and are natural for testing mining algorithms. The advantage
of such quasi-random data is that we may control in a precise
manner the various parameters associated with the dynamical
systems we use, studying in the process their efficacy at
discovering the appropriate planted structure. In Sect. 5, we
report on our experiences with Stirr on “real-world” data
drawn from a range of settings. In all these settings, the
main clusters computed by Stirr correspond naturally to
multiple, highly correlated groups of items within the data.
2 Algorithms and basic analysis
We now describe our core algorithm, which produces a dynamical system from a table of categorical data. We begin
with a table of relational data: we view such a table as consisting of a set of k fields, each of which can assume one
of many possible values. (We will also refer to the fields as
columns.) We represent each possible value in each possible
field by an abstract node, and we represent the data as a set
T of tuples – each tuple τ ∈ T consists of one node from
each field. See Fig. 1 for an example of this representation,
on a table with three fields. A configuration is an assignment
of a weight wv to each node v; we will refer to the entire
configuration as simply w (which may be thought of as a
vector of values wv ). We draw on the high-level notion of a
“weight propagation” scheme developed in the introduction.
We will only be concerned with the relative values of the
weights in a configuration, and so we will use a normalization function N (w) to uniformly rescale the set of weights,
preventing their absolute values from growing too large. In
our experiments, we have made use of two different normalization functions: a local function that rescales weights
so that the squares of the weights within each field add up
to 1; and a global function that rescales weights so that the
squares of all weights in all fields add up to 1. These two
normalization functions yield very similar results qualitatively; unless explicitly stated otherwise, we will focus on
the local normalization function in what follows.
For our purposes, a dynamical system is the repeated
application of a function f on some set of values. A fixed
point of a dynamical system is a point u for which f (u) = u.
D. Gibson et al.: Clustering categorical data
That is, it is a point which remains unchanged under the (repeated) application of f . Fixed points are one of the central
objects of study in dynamical systems, and by iterating f ,
one often reaches a fixed point. In Sect. 7, we provide a
basic introduction to some terminology and background on
dynamical systems. We also discuss there the connections
between such systems and spectral-graph theory, discussed
above as a powerful method for attacking discrete partitioning problems. Spectral-graph theory is based on studying linear weight propagation schemes on graph structures; much
of the motivation for the weight propagation algorithms underlying Stirr derives from this connection, though the description of our methods here will be entirely self-contained.
The dynamical system for a set of tuples will be based
on a function f , which maps a current configuration to a new
configuration. We define f as follows. We choose a combiner
function ⊕, defined below, and update each weight wv as
follows.
To update the weight wv :
For each tuple τ = {v, u1 , . . . , uk−1 } containing v, do
xτ ← ⊕(u1 , . . . , uk−1 ).
P
x .
wv ←
τ τ
Thus, the weight of v is updated by applying ⊕ separately
to the members of all tuples that contain v, and adding the
results. The function f is then computed by updating the
weight of each wv as above, and then normalizing the set of
weights using N (). This yields a new configuration f (w).
Iterating f defines a dynamical system on the set of configurations. We run the system through a specified number
of iterations, returning the final set of configuration that we
obtain. It is very difficult to rigorously analyze the limiting
properties of the weight sets in the fully general case; below
we discuss some analytical results that we have obtained in
this direction. However, we have observed experimentally
that the weight sets generally converge to fixed points or to
cycles through a finite set of values. We call such a final
configuration a basin; the term is meant to capture the intuitive notion of a configuration that “attracts” a large number
of starting configurations.
To complete the description of our algorithms, we discuss the following issues: our choice of combining operators;
our method for choosing an initial configuration; and some
additional techniques that allow us to “focus” the dynamical
system on certain parts of the set of tuples.
Choosing a combining operator. Our potential choices for
⊕ are the following; in Sect. 4, we offer some additional
insights on the pros and cons among these choices.
– The product operator Π: ⊕(w1 , . . . , wk ) = w1 w2 · · · wk .
– The addition operator: ⊕(w1 , . . . , wk ) = w1 +w2 +· · ·+wk .
– A generalization of the addition operator that we call the
Sp combining rule, where p is an odd natural number.
Sp (w1 , . . . , wk ) = (w1p + · · · + wkp )(1/p) . Note that addition
is simply the S1 rule.
– A “limiting” version of the Sp rules, which we refer to as
S∞ . S∞ (w1 , . . . , wk ) is defined to be equal to wi , where
wi has the largest absolute value among the weights in
{w1 , . . . , wk }. (Ties can be broken in a number of arbitrary ways.)
225
Thus, the S1 combining rule is linear with respect to the
tuples. The Π and Sp rules for p > 1, on the other hand,
involve a non-linear term for each individual tuple, and thus
have the potential to encode co-occurrence within tuples
more strongly. S∞ is an especially appealing rule in this
sense, since it is non-linear, particularly fast to compute, and
also appears to have certain useful “sum-like” properties.
Non-principal basins. Much of the power of spectral-graph
partitioning for discrete-clustering problems derives from
its use of non-principal eigenvectors. In particular, it provides a means for assigning positive and negative weights
to nodes of a graph, in such a way that the nodes with
positive weight are typically well separated from the nodes
of negative weight. We refer the reader to Sect. 7 for more
details.
One of our contributions here is the generalization of the
notion of non-principal eigenvectors (from linear maps) to
our non-linear dynamical systems. To find the eigenvectors
of a linear system Ax = λx, the power iteration method
maintains a set of orthonormal vectors xhji , on which the
following iteration is performed. First each of the {xhji }
is multiplied by A; then the Gram-Schmidt algorithm is
applied to the sequence of vectors {xh1i , xh2i , . . .}, making them orthonormal. Note that the order of the vectors
is crucial in the Gram-Schmidt procedure; in particular, the
direction of the first vector xh1i is unaffected.
In an analogous way, we can maintain several configurations wh1i , . . . , whmi and perform the following iteration:
Update whii ←− f (whii ), for i = 1, 2, . . . , m.
Apply the Gram-Schmidt algorithm to the sequence of vectors
{wh1i , . . . , whmi }.
Since the normalized value of the configuration wh1i will be
unaffected by the Gram-Schmidt procedure, it iterates to a
basin just as before; we refer to this as the “principal basin.”
The updating of the other configurations is interleaved with
orthonormalization steps; we refer to these other configurations as “non-principal basins.”
In the experiments that follow, we will see that these
non-principal basins provide structural information about
the data in a fashion analogous to the role played by nonprincipal eigenvectors in the context of spectral partitioning.
Specifically, forcing the configurations {whji } to remain orthonormal as vectors introduces negative weights into the
configurations. Just as the nodes of large weight in the basin
tend to represent natural “groupings” of the data, the positive and negative weights in the non-principal basins tend to
represent good partitions of the data. The nodes with large
positive weight, and the nodes with large negative weight,
tend to represent “dense regions” in the data with few connections between them.
Choosing an initial configuration. There are a number of
methods for choosing the initial configuration for the iteration. If we are not trying to “focus” the weight in a particular portion of the set of tuples, we can choose the uniform
initialization (all weights set to 1, then normalized) or the
226
random initialization (each weight set to an independently
chosen random value in [0, 1], then normalized).
For combining operators which are sensitive to the choice
of initial configuration – the product rule is a basic example
– one can focus the weight on a particular portion of the
set of tuples by initializing a configuration over a particular
node. To initialize w over the node v, we set wu = 1 for every node u appearing in a tuple with v, and wu0 = 0 for every
other node u0 . (We then normalize w.) In this way, we hope
to cause nodes “close” to v to acquire large weight. This
notion will be addressed in the experiments of the following
sections.
Masking and modification. To augment or diminish the influence of certain nodes, we can apply a range of “local
modifications” to the function f . In particular, for a particular node v, we can compose f with the operation mask(v),
which sets wv = 0 at the end of each iteration. Alternately,
we can compose f with augment(v, x), which adds a weight
of x > 0 to wv at the end of each iteration. This latter operation “favors” v during the iterations, which can have the
effect of increasing the weights of all nodes that have significant co-occurrence in tuples with v.
Relation to other iterative techniques
The body of mathematical work that most closely motivates
our current approach is spectral-graph theory, which studies certain algebraic invariants of graphs. We discuss these
connections in some detail in Sect. 7.
The iteration of systems of non-linear equations arises
also in connection with neural networks [Ritter and Martinez
1992, Rumelhart and McClelland 1986]. In many of these
neural network settings, the emphasis is quite different from
ours – for example, one is concerned with the approximation
of binary decision rules, or with the approximation of connection strengths among nodes on a discrete lattice (e.g., the
work on Kohonen maps ). In our work, on the other hand,
the analogs of “connections” – the co-occurrences among
tuples in a database – typically do not have any geometric
structure that can be exploited, and our approach is to treat
the connections as fixed, while individual item weights are
updated.
In a related vein, the methodology of principal curves
allows for a non-linear approach to dimension reduction,
which can sometimes achieve significant gains over linear
methods such as principal-component analysis [Hastie and
Stuetzle 1989, Kramer 1991]. This method thus implicitly
requires an embedding in a given space in which to perform
the dimension reduction; once this dimension reduction is
performed, one can then approach the low-dimensional clustering problem on the data in a number of standard ways.
Analysis
First, we briefly discuss the running time of a single iteration
of the Stirr algorithm, maintaining just the principal basin.
For a fixed, constant number of columns, the combining rules
D. Gibson et al.: Clustering categorical data
we use require constant computation time. Thus, we may
perform a single iteration via a single pass through the set
of tuples; for each tuple, we modify the weights of the nodes
it contains using the combining rule. Thus, the amount of
computation required in an iteration is linear in the number
of tuples. An experimental view of this linear relationship
can be seen in the running-time plot of Fig. 4, below.
The number of iterations required for approximate convergence is, on the other hand, a much more complex issue,
and our theoretical results below are designed to partially
address the question.
There is much that remains unknown about the dynamical systems we use here, and this reflects the lack of existing
mathematical techniques for proving properties of general
non-linear dynamical systems. This, of course, does not prevent them from being useful in the analysis of categorical
tables; and we now report on some of the basic properties
that we are able to establish analytically about the dynamical
systems we use. In addition, we state some basic conjectures
that we hope will motivate further work on the theoretical
underpinnings of these models.
We first state a basic convergence result for the (linear)
S1 combining rule, subject to a standard non-degeneracy
assumption that will be exposed in the proof.
Theorem 1. (Subject to a non-degeneracy assumption.) Let
T be a set of tuples, with k ≥ 3 fields, and suppose that global
normalization is used. For every initial configuration w, w
converges to a fixed point under the repeated application of
the combining rule S1 .
Proof. Let n denote the total number of nodes in all fields.
We label the nodes v1 , . . . , vn . Let A be the following n × n
/ j is equal to the
matrix: Aii = 0 for each i; and Aij for i =
number of tuples in which vi and vj appear together. Let
w denote any configuration. We view w as a vector with n
coordinates, the ith denoting the value wvi ; because we are
considering a global normalization function, we may assume
that w is a unit vector.
If we apply the S1 rule once to w, the new value of the
ith coordinate is easily seen to be the sum
X
Aij wj .
j/
=i
In vector notation, the new configuration after global normalization is thus simply the unit vector in the direction of
Aw. Hence, after t applications of the S1 rule, the configuration is the unit vector in the direction of At w.
Now we state our non-degeneracy condition: we will
assume that the matrix A has a unique eigenvalue λ∗
of maximum absolute value, with associated eigenvector
z ∗ . In this case, a standard result of linear algebra (see,
e.g., ) implies that for any initial vector w not orthogonal to z ∗ , the sequence of unit vectors in the directions
w, Aw, A2 w, . . . , At w . . . converges to z ∗ . More generally,
this sequence will converge to the eigenvector associated
with the eigenvalue of maximum absolute value among those
not orthogonal to the initial vector w.
Thus, Theorem 1 establishes that almost all initial configurations w converge to the same fixed point; namely,
D. Gibson et al.: Clustering categorical data
the eigenvector z ∗ , viewed as a configuration. In particular, since the matrix A in the proof of Theorem 1 has only
non-negative entries, so does z ∗ . Hence, any non-zero initial configuration w consisting entirely of non-negative values will not be orthogonal to z ∗ , and so the S1 rule will
converge to z ∗ starting from w.
If we consider iterating a linear system on an undirected
graph G (as in the power iteration method discussed above),
there is a natural combinatorial interpretation of the weights
that are computed: they count the number of walks in G
starting from each vertex. By a walk here, we mean a sequence of adjacent vertices in G, with vertices allowed to
appear more than once in the sequence. We now show that
the product rule Π has an analogous combinatorial interpretation for a set T of tuples. This combinatorial interpretation
indicates a precise sense in which the product rule discovers
“dense” regions in the data.
To state this result, we need some additional terminology.
If T is a set of tuples over a dataset with d fields, v is a
node of T , and j is a natural number, then a (T, v, j)-tree is
a complete (d − 1)-ary tree Z of height j whose vertices are
labeled with nodes of T as follows. (1) The root is labeled
with v. (2) Suppose a vertex α of Z is labeled with a node
u ∈ T ; then there is a tuple τ of T so that u ∈ τ and
the children of α are labeled with the elements of τ − {u}.
Two (T, v, j)-trees are equivalent if there is a one-to-one
correspondence between their vertices that respects the tree
structure and the labeling; they are distinct otherwise. Note
that if G is a graph, then a (G, v, j)-tree is precisely a walk
of length j in G starting from v.
Theorem 2. Let T be a set of tuples, and consider the dynamical system defined on it by the product rule Π with
no normalization. Let w0 denote the initial configuration in
which each node is assigned the value 1. For a node v of T ,
let wvj denote the weight wv assigned to v after j iterations.
Then wvj is equal to the number of distinct (T, v, j)-trees.
Proof. For a node v and a number j, let q(v, j) denote the
number of distinct (T, v, j)-trees. We prove by induction on
j that q(v, j) = wvj . Note that when j = 0, wvj = 1; and there
is a single (T, v, j) tree (a single node), so q(v, j) = 1.
Now consider a value of j larger than 0, and assume
the claim holds for all smaller values of j. An arbitrary
(T, v, j) tree Z can be produced as follows: we choose an
arbitrary tuple (v, u1 , u2 , . . . , uk−1 ) containing v; we choose
an arbitrary (T, ui , j − 1) tree Zi for each i = 1, . . . , k − 1;
and we define Z to be the tree with root labeled v and with
Z1 , . . . , Zk−1 the subtrees of the root. Thus, we have
X
q(u1 , j − 1)q(u2 , j − 1) · · · q(uk−1 , j − 1)
q(v, j) =
(v,u1 ,...,uk )
=
X
(v,u1 ,...,uk )
wuj−1
wuj−1
· · · wuj−1
1
2
k−1
= wvj ,
where we pass from the first line to the second using the
induction hypothesis, and we pass from the second line to
the third using the definition of the product rule.
Next, we establish a basic theorem that partially underlies our intuition about the partitioning properties of these
227
dynamical systems. By a complete hypergraph of size s, we
mean a set of tuples over a dataset with s nodes in each
column, so that there is a tuple for every choice of a node
from each column. Let Tab denote a set of tuples consisting
of the disjoint union of a complete hypergraph A of size a
and a complete hypergraph B of size b. We define 1A to
be the configuration that assigns a weight of 1 to each node
in A, and a weight of 0 to each node in B; we define 1B
analogously.
Theorem 3. Consider the product rule on the set of tuples
T , with local (column-by-column) normalization.
(i) Let w be a randomly initialized configuration on the
set of tuples Tab , with a = (1 + δ)b for a number δ > 0.
Then, there is a constant α > 0 depending on δ, so that
with probability at least 1 − e−α , w will converge under the
product rule Π to the fixed point N (1A ).
(ii) Let w0 be a configuration initialized over a random
a
, w0 converges to
node v ∈ Tab . Then, with probability a+b
b
N (1A ) under Π, and with probability a+b , it converges to
N (1B ).
Proof. Consider the dynamical system defined by the product
rule on a complete hypergraph of size s, with a configuration
of weights w. Since we are only concerned with relative
values of weights, it is easiest to analyze this first in the
absence of any normalization. Let Wi denote the sum of the
weights of all nodes in ith column. Then, the updated value
of the weight of a node v in column j is
X
wv1 · · · wvj−1 wvj+1 · · · wvk ,
(v1 ,...,vj−1 ,v,vj+1 ,...,vk )
where the sum is over all possible choices of a node from
each other column. Rewriting this sum, we see that it is
equal to
Y
Wi .
(1)
W1 W2 · · · Wj−1 Wj+1 · · · Wk =
i/
=j
First we consider (i). Recall that a random initialization
consists of the assignment of a weight chosen uniformly
from [0, 1] to each node. Let Ha denote the portion of T
consisting of a complete hypergraph of size a, and let Hb
denote the portion of T consisting of a complete hypergraph
of size b. Let Wi denote the sum of the weight in column i
of Ha and let Wi0 denote the sum of the weight in column i
of Hb . Each Wi is a sum of a independent random variables,
each of mean 1/2; similarly each Wj0 is a sum of b independent random variables, each of mean 1/2. Thus, the mean
of each Wi exceeds the mean of each Wj0 by (a − b)/2. By
standard bounds on the sum of independent random variables
(see, e.g., ), there is an absolute constant α > 0, depending
on δ = (a − b)/b, so that with probability at least 1 − e−α ,
we have mini Wi > maxj Wj0 + (a − b)/4 ≥ maxj Wj0 + 1/4.
Assume this event happens, and let W ∗ = maxj Wj0 .
Consider the weight of a node v in Hb as we run several
iterations, with no normalization. After the first iteration, its
weight can be computed from Eq. 1 to be at most (W ∗ )k−1 .
Each column sum in Hb is then at most b(W ∗ )k−1 , and
so after the second iteration, the weight of v is at most
228
D. Gibson et al.: Clustering categorical data
Fig. 2. Textual view of system
bk−1 (W ∗ )(k−1) . More generally, after j iterations, an upper bound on the weight of the node v can be determined
inductively from Eq. 1; we obtain the upper bound
2
b(k−1)+(k−1)
2
j−1
+···+(k−1)
j
(W ∗ )(k−1) .
By similar reasoning, the weight of a node u in Ha after j
iterations is at least
2
a(k−1)+(k−1) +···+(k−1)
j−1
j
1
(W ∗ + )(k−1) .
4
Thus, as j increases without bound, we find first of all
that the ratio of the minimum weight in Ha to the maximum weight in Hb increases without bound. Second, all the
weights in a given column of Ha will be the same after the
first iteration, since the expression in Eq. 1 depends not on
a node’s identity, but only on its column. It follows from
these two facts that the normalized set of weights in each
column converges to N (1A ).
The proof of statement (ii) is easier. With probability
a
b
a+b , a node in Ha will be chosen, and with probability a+b ,
a node in Hb will be chosen. In the former case, we see by
Eq. 1 that the normalized value of the configuration in each
column will converge in one step to N (1A ); in the latter
case, it will converge in one step to N (1B ).
We conjecture that qualitatively similar behavior occurs
when Tab consists of two “dense” sets of tuples that are
“sparsely” connected to each other. At an experimental level,
this is borne out by the results of subsequent sections.
The previous theorem shows that it is possible for the
product rule to converge to different basins, depending on
the starting point. We now formulate a general statement
that addresses a notion of completeness for such systems:
all sufficiently “large” basins should be discovered within a
reasonable number of trials. Let us say that a basin w∗ has
measure if at least an fraction of all random starting configurations converge to w∗ on iterating a given combining
rule. Let B be the number of such basins.
Theorem 4. With probability at least 1 − 1/B , iterating the
system from 2−1 ln B random initial configurations is sufficient to discover all basins of measure at least .
Proof. Consider a particular basin of measure at least . The
probability that it is not discovered in k independent runs
from random initial configurations is at most (1−)k . Setting
k = 2−1 ln B , we see that this probability is at most
p = e−2 ln B =
1
.
B2
(2)
Since the probability of the union of events is no more
than the sum of their probabilities, the probability that any
basin of measure at least is not discovered is at most pB =
1/B .
This gives a precise sense – via a standard type of sampling result – in which all sufficiently large basins will be
discovered rapidly by iterating from random starts. (Note
that this discovery is relatively efficient – since B can be
as large as −1 , it can take up to −1 starting points to
discover all the basins, and our randomized approach uses
2−1 ln B .)
3 Overview of experiments
3.1 System overview
The Stirr experimental system consists of a computation
kernel written in C, and a control layer and GUI written
in Tcl/Tk. We prepared input datasets in straightforward
whitespace-delimited text files. The output can be viewed
in a text window showing nodes of highest weight in each
column, or as a graphical plot of node weights. The GUI
allows us to modify parameters quickly and easily.
Figure 2 shows the control interface and textual output.
The dataset being used was an adult census database, using
just four of the categorical attributes. A graphical view of
the same data (specifically, the first non-principal basin wh2i )
is shown in Fig. 3. Each node is positioned according to its
column and calculated weight. The nodes in each tuple are
joined by a line (consisting of three segments in this case).
The Density control indicates only 1% of the tuples were
plotted. The Jitter value “smears” each line’s position by
a small random amount, to separate lines meeting at the
same point. There is a simple tuple- coloring mechanism,
which changes the color of a selected line to show the weight
assignment on that tuple.
The value of the visualizer in presenting similar nodes to
the user is evident in Fig. 3; this may be used, for instance,
to point to a subset of nodes for masking or modification
D. Gibson et al.: Clustering categorical data
229
random multidimensional points is a region with a higher
probability density than the ambient.) Our hope, then, might
be that Stirr discovers such clumps.
We first list the aspects of Stirr that we wish to study;
we follow this with a description of the quasi-random experiment in each case.
+1
-1
Column
Fig. 3. Graphical view of system
(recall from Sect. 2). It would provide a useful framework
for an interactive data-mining tool.
3.2 Performance
Our algorithm scales well with increasing data volumes. We
measured the running time for ten iterations, for varying
numbers of tuples and numbers of columns. The S∞ combiner was used, and the principal basin together with the first
four non-principal basins were computed. The plot in Fig. 4
demonstrates that the running time is linear in number of
tuples, The time taken to read the database and initialize the
iteration is negligible. Qualitatively similar plots of running
times are obtained for the other main combining rules.
The running times were measured on a 128-MB, 200MHz Pentium Pro machine, running Linux, with no other
user processes. The dataset was generated randomly.
4 Quasi-random inputs
In this section, we report on extensive experience with the
Stirr system on the class of quasi-random inputs discussed
in the introduction. For the purposes of testing the performance of a mining algorithm, such inputs are a natural setting in which to search for “planted” structures among random noise. This is a technique used in the study of computationally difficult problems. For instance, in combinatorial
optimization, the work of analyzes graph bisection heuristics in quasi-random graphs in which a “small” bisection is
hidden; adopt the same approach to the analysis of graphcoloring algorithms. The advantage of such quasi-random
data is that we may control in a precise manner the various
parameters associated with the dynamical systems we use,
studying in the process their efficacy at discovering the appropriate planted structure. Consider a randomly generated
categorical table in which we plant extra random tuples involving only a small number of nodes in each column. This
subset of nodes may be thought of as a “clump” in the data,
since these nodes co-occur more often in the table than in
a purely random table. (The analog in numerical data with
1. How well does Stirr distill a clump in which the nodes
have an above-average rate of co-occurrence? Intuitively,
this is of interest because its analog in real data could
be the set of people (in a travel database) with similar travel patterns, or a set of failures (in a maintenance
database) that stem from certain suppliers, manufacturing plants, etc. Whereas a conventional data analysis
technique would be faced with enumerating all crossproducts across columns of all possible subsets in each
column, and examining the support of each, the Stirr
technique appears to avoid this prohibitive computation.
How quickly does Stirr elicit such structure (say, as a
function of the number of iterations, the relative density
of the clump to the background, the combine operator
⊕, etc.)? Our experiments show that for quasi-random
inputs, such structure emerges very quickly, typically in
about five iterations.
2. How well does Stirr separate distinct planted clumps
using non-principal basins? For instance, if in a random
table we were to plant two clumps each involving a
distinct set of nodes, will the first non-principal basin
place the clumps at opposite ends of the partition? Intuition from spectral-graph partitioning suggests that typically some non-principal partition will separate a pair
of clumps; our experiments show that Stirr usually
achieves such a separation within the first non-principal
basin, for quasi-random inputs. A related study involves
two clumps of different densities planted in a random
table. For random starts and the product rule, how often
does the denser clump appear in the principal basin, as
a function of the relative densities?
3. How well does Stirr cope with tables containing clumps
in a few columns, with the remaining columns being random? We study this by adding (to a randomly generated
table) additional random tuples concentrated on a few
nodes in three of the columns; the entries for the remaining columns are randomly distributed over all nodes.
Thus, these remaining columns correspond to irrelevant
attributes that might (by their totally random distribution) mask the clump planted in columns 1–3. This is a
standard problem in data mining, where we seek to mask
out irrelevant factors (here columns) from a pattern. Our
experiments show that Stirr will indeed mask out such
“irrelevant” attributes; we know of no analog for this
phenomenon in spectral-graph partitioning.
In the following plots, each point of each curve is the
average of 100 quasi-random experiments, typically using
ten quasi-random tables and ten random starting configurations on each. Thus, each of the plots summarizes data from
several thousand quasi-random experiments. Wherever we
obtain qualitatively similar results while varying table sizes,
we have chosen to present the results for a fixed, representative table size while varying other parameters of interest.
230
D. Gibson et al.: Clustering categorical data
35
30
8
25
Time (s)
7
20
6
15
5
4
10
3 columns
5
Fig. 4. Running times. We have truncated the y axis for compactness; all the lines remain straight up to 1000000 tuples
0
0
5000
10000
15000
20000
Number of tuples in table
25000
Clump purity, percentage
100
30 extra
40 extra
50 extra
75 extra
100 extra
75
50
25
2
4
6
8
Number of iterations
10
Fig. 5. Emergence of the principal basin
Clump purity, deviation/mean
0.25
30 extra
40 extra
50 extra
75 extra
100 extra
0.2
0.15
0.1
0.05
0
2
3
4
5
6
7
Number of iterations
8
9
10
Fig. 6. Variance of the principal basin experiments
Figure 5 captures the ability of Stirr to extract a single
planted clump from tables. This plot represents data for a
quasi-random table with three columns, and 1000 nodes in
each column. In addition to 5000 random triples on these
nodes, we select 10 nodes in each column and plant additional random tuples within these (10 × 3 = 30) nodes. We
vary the number of triples in this planted clump over the
set {30, 40, 50, 75, 100}. On the x-axis we vary the num-
30000
ber of Stirr iterations using the S∞ operation for the ⊕
function, from 2 to 10. On the y-axis we plot the percentage
of the nodes from the the planted clump that appear within
the top ten positions of the principal basin; thus, this fraction
would ideally be high. A random set of 10 nodes of any column would have 100 × 10/1000 = 1% of its nodes from this
clump; Stirr obtains a significant boost (to 25% or greater)
even when the planted clump only contains only 30 extra
triples in addition to the 5000 random “background” triples.
As the number of triples in this planted clump increases
further, its purity in the principal basin increases towards
100%. As mentioned before, we obtained qualitatively the
same plots for other table sizes, and therefore display results
for three columns with 1000 nodes each. The major lesson
from these and other similar experiments is that, typically,
the structure in the principal basin emerges within a small
number of iterations (under five). Coupled with the fact that
the computation during a single iteration is linear in the number of tuples, this suggests that the structure emerges in time
linear in the size of the table.
Our experiments involve two sources of randomization:
in the construction of the table, and in the random choice
of Stirr’s starting configuration. We find that the variance
in the experimental results we obtain, over these random
choices, is quite small. For the single-clump experiment in
the previous paragraph, we summarize this in Fig. 6, computing the ratio of the standard deviation to the mean for
the “purity” parameter defined above. Note additionally that
for all but the sparsest clump density, this ratio decreases
rapidly as the number of iterations increases.
In Fig. 7, we study the ability of Stirr to separate two
planted clumps. As before, we only display results for a
three-column table with 1000 nodes in each column, with
5000 random “background” triples. To this, we add two
clumps at disjoint sets of 10 selected nodes in each column.
As in the first experiment, we vary the number of tuples in
each of the planted clumps over the set {30, 40, 50, 75, 100};
on the x-axis is the number of iterations, and on the yaxis we plot the following measure of separation of the two
clumps. We make the stringent demand in this experiment
231
30 extra
40 extra
50 extra
75 extra
100 extra
0.75
0.5
0.25
5
10
15
20
Number of iterations
25
Fig. 7. Separating two clumps using the first non-principal basin
that the separation manifest itself in the first non-principal
basin (rather than by any other non-principal basin). Intuition from spectral-graph partitioning suggests that the two
clumps will be placed at opposite ends of this non-principal
basin. Let a0 and b0 be the numbers of nodes from clumps
A and B at one end of this basin, and a1 and b1 be the corresponding numbers at the other end. Clearly, a0 + a1 ≤ 30,
and likewise for the bi . Under perfect separation we would
have a0 = 30 and b1 = 30, or vice-versa. Our separation
measure in this case (this generalizes easily to other clump
and table sizes) is
s(A, B) =
|a0 − b0 | + |a1 − b1 |
.
60
Thus, if the nodes of the two clumps are present in roughly
equal proportions at the extremities of the partition, the measure is close to zero; as we have closer to 30 nodes from each
clump at each end of the partition, the measure gets closer
to 1. This measure is fairly demanding, for several reasons:
(1) the first non-principal basin may not place many nodes
of the clumps at its extremities, so that a0 + a1 and b0 + b1
may both be small compared to 30; (2) the measure penalizes each end of the partition for nodes from the “other”
clump. Thus, if each end of the partition has 75% of the
nodes from one clump and 25% from the other, we score
only 0.5 by our measure. The summary findings from these
experiments are that (1) the number of iterations to separation is relatively small (around 15, larger than for extracting
the dominant clump using the principal basin), and fairly
insensitive to the clump density; (2) the separation achieved
is quite marked.
What happens when the quasi-random input has two
clumps of different densities planted in it, and the product combiner is used? Figure 8 studies the effect (again in
three-column tables with 1000 nodes per column) when the
table has 5000 random tuples, plus two clumps made up
of two disjoint sets (call them A and B) of 10 nodes per
column. Each random clump is created by adding nA random tuples containing only nodes from within A and nB
tuples within B, where nA > nB . For random starts (over
a number of such quasi-random tables), how often does the
principal basin focus on the heavier clump, as a function of
nA /nB ? The y-axis in this plot is the ratio of the number
of nodes from A to the number of nodes from B in the top
1
0.75
0.5
1
1.1
1.2
1.3
1.4
Ratio: denser to lighter clump
1.5
Fig. 8. Fraction of random starts ending in the denser clump
Clump purity vs number of redundant cols
100
Clump purity, percentage
Clump purity, s(A,B)
1
Fraction of nodes from denser clump in principal fixed point
D. Gibson et al.: Clustering categorical data
30 extra tuples
40 extra tuples
50 extra tuples
75
50
25
4
5
6
number of redundant cols
7
Fig. 9. Resilience to random additional columns
10 positions of each column, in the principal basin. When
nA /nB is 1, this fraction is 0.5 on average.
As evident from the plots, Stirr boosts the denser
clump.
Figure 9 depicts the same setting as in Fig. 5 above, but
with additional columns added on to study the effect of irrelevant attributes. Specifically, the clump of additional random
tuples that we plant is focused on 10 nodes in columns 1–3,
but in the remaining columns is completely random. We thus
think of these remaining columns as “irrelevant attributes”
that would detract from finding the planted clump. In fact,
as seen, Stirr suffers a relatively modest and graceful loss
as the number of irrelevant columns is increased.
Finally, we use experiments on quasi-random data to
understand how the choice of combine operator (⊕) affects
speed of convergence. Figure 10 shows experiments with
different choices for the combine operator. Max (i.e., S∞ )
is the clear winner. Product did not converge monotonically,
so second place falls to addition (S1 ), or to S2 or S5 if the
arithmetic speed can be improved.
Each data point in each experiment is an average over
100 runs: using ten random tables, with ten random starting
232
D. Gibson et al.: Clustering categorical data
30
10
9
7
Time (s)
25
Max
5-norm
2-norm
Sum
Geometric
Harmonic
Product
6
5
4
Max
5-norm
2-norm
Sum
Geometric
Harmonic
Product
20
Iterations
8
15
Fig. 10. Convergence for different ⊕.
Table has 2100 rows: 1000 using one
set of attributes, A, and 1100 with another, B. Convergence is measured as
the number of B attributes in the top 20
positions. The experiment measured average running times and iterations until
90% convergence
10
3
2
5
1
0
40
50
60
70
80
% converged
90
100
40
50
points. The the number of entities in the table here is fixed;
we obtained similar results for other table sizes. The convergence requires very few iterations, even with the fairly
strict separation measure used in the two-cluster experiment.
Since the time for each iteration is linear in the number of
tuples, we can expect to extract structure in essentially linear
time.
5 Real data
We now discuss some of our experience with this method
on a variety of “real” data sets.
Basic data sets. First, we consider two data sets that illustrate co-occurrence patterns.
(1) Bibliographic data. We took publicly accessible bibliographic databases of 7000 papers from database research ,
and of 30,000 papers written on theoretical computer science
and related fields , and constructed two data sets each with
four columns as follows. For each paper, we recorded the
name of the first author, the name of the second author, the
conference or journal of publication, and the year of publication. Thus, each paper became a relation of the following
form:
60
70
80
% converged
90
100
local cause-effect patterns in the data. We approach this as
follows. Given a long string of sequentially occurring events,
we label them as e1 , e2 , . . . , en . We then imagine a “sliding
window” of length ` passing over the data, and construct `tuples (ei , ei+1 , . . . , ei+`−1 ) for each i between 1 and n−`+1.
For data exhibiting a strong local cause-effect structure, we
hope to see such structure emerge from the co-occurrences
among events close in time.
This framework bears some similarity to a method of
for defining frequent “episodes” in sequential data. Our
approach is quite different from theirs, however, since we
do not search separately for each discrete type of episode;
rather, we use the core algorithm of Stirr to group events
that co-occur frequently in temporal proximity, without attempting to impose an episode structure on them.
The main example we study here comes from a readily
accessible source of sequential data, exhibiting strong elements of causality: chess moves. Typically, the identity of
a single move in a chess game allows one to make plausible inferences about the identity of the moves that occurred
nearby in time. To be concrete, we took the moves of 2000
games played from a common opening, and used a variant of the method above to create the following sets of
4-tuples: (White’s nth move, Black’s nth move,
White’s (n + 1)st move, Black’s (n + 1)st move)
(First-Author, Second-Author, Conference/
Journal Name, Year).
Note that we deliberately treat the year, although a numerical field, as a categorical attribute for the purposes of our
experiments. In addition to the tables of theory papers and
database papers, we also constructed a mixed table with both
sets of papers.
(2) Login data. From four heavily used server hosts
within IBM Almaden, we compiled the history of user logins over a several-month period. This represented a total
of 47,000 individual logins. For each login, we recorded the
user name, the remote host from which the login occurred,
the hour of the login, and the hour of the corresponding logout. Thus, we created the tuples of the form: (user, remotehost, login-time, logout-time). Again, the hours were treated
as categorical attributes.
Sequential data sets. Next, we discuss a natural approach by
which this method can be used to analyze sequential datasets.
The hope here is to use local co-occurrences in time to elicit
Overview. At a very high level, the following phenomena
emerge. First, the principal and non-principal basins computed by the method correspond to dense regions of the
data with natural interpretations. For the sequential data, the
nodes with large weight in the main basins exhibit a natural
type of cause-effect relationship in the domains studied.
The fact that we have treated numerical attributes as categorical data – without attempting to take advantage of their
numerical values – allows us to make some interesting observations about the power of the method at “pulling together”
correlated nodes. Specifically, we will see in the examples
below that nodes which are close as numbers are typically
grouped together in the basin computations, purely through
their co-occurrence with other categorical attributes. This
suggests a strong sense in which co-occurrence among categorical data can play a role similar to that of proximity
in numerical data, for the purpose of clustering. We divide
the discussion below into two main portions, based on the
combining rule ⊕ that we use.
D. Gibson et al.: Clustering categorical data
233
0.1811: Chen
0.1185: Chang
0.1147: Zhang
0.1119: Agarwal
0.1104: Bellare
0.1053: Gu
0.09467: Dolev
0.08571: Hemaspaand
0.08423: Farach
0.08232: Ehrig
0.1629: Chen
0.1289: Wang
0.1007: COMPREVS
0.1: Li
0.09004: Rozenberg
0.08837: Igarashi
0.08171: Sharir
0.0797: Huang
0.07924: Maurer
0.0783: Lee
0.317: TCS
0.2264: IPL
0.195: LNCS
0.1633: INFCTRL
0.1626: DAMATH
0.1464: JPDC
0.1371: SODA
0.1266: STOC
0.1074: IEEETC
0.1002: JCSS
0.563: 1995
0.5596: 1994
0.4332: 1996
0.151: 1993
0.07487: 1992
0.0155: 1976
0.005276: 1972
0.004569: 1985
0.001891: 1973
0.001679: 1970
-0.1195: Stonebrake
-0.1059: Agrawal
-0.1023: Wiederhold
-0.07735: Abiteboul
-0.06783: Yu
-0.06722: Navathe
-0.06615: Litwin
-0.06308: Bernstein
-0.0623: Jajodia
-0.0587: Motro
-0.1068: Wiederhold
-0.09615: David
-0.08343: DeWitt
-0.07643: Richard
-0.07454: Stonebrak
-0.07403: Michael
-0.07159: Robert
-0.06843: Jagadish
-0.06722: James
-0.0671: Navathe
-0.384: IEEEDataEng
-0.256: VLDB
-0.2392: SIGMOD
-0.1504: PODS
-0.1397: ACMTDS
-0.1176: IEEETransa
-0.04832: IEEETrans
-0.04658: IEEETechn
-0.04533: WorkshopI
-0.03581: IEEEDBEng
-0.2165: 1986
-0.1983: 1987
-0.1519: 1989
-0.14: 1988
-0.11: 1984
-0.06571: 1975
-0.06431: 1990
-0.03765: 1980
-0.03238: 1974
-0.02873: 1982
5.1 The Sp rules
We now discuss our experience with the Sp combining rules.
We ran the S∞ dynamical system on a table derived as described above from a mixture of theory and database research papers. The first non-principal basin is quite striking:
in Fig. 11, we list the ten items with the most positive and
most negative weights in each column (together with their
weights) for this first non-principal basin; a double horizontal line separates these groups. (Note that no row in Fig. 11
need be a tuple in the input; the table simply lists, for each
column, the ten nodes with the most positive (and the ten
with the most negative) values.) Note also that our parser
identified all people by their last names so that, for instance,
the Chen at the top is actually several Chens taken together.
A number of phenomena are apparent from the above summary views of the theory and database communities (the fact
that the theory community ended up at the positive end of the
partition and the database community at the negative end is
not significant – this depends only on the random initial configuration, and could just as well have been reversed). What
is significant is that such a clean dissection resulted from the
dynamical system, and that too within the first (rather than
some lower) non-principal basin. Note that these phenomena have been preserved despite some noise in the database
research bibliography: first names on single-authored papers
have sometimes been parsed as second authors. In the same
run, the ninth non-principal partition gave a fairly clean separation between database researchers of a more theoretical
nature and those who are more system-oriented.
5.2 The product rule
The high-level observations above apply to our experience
with the combining rule Π; but the sensitivity of Π to the
initial configuration leads to some additional recurring phenomena. First, although there is not a unique fixed point
that all initial configurations converge to, there is generally a relatively small number of basins. Thus, we can gain
additional “similarity” information about initial configurations by looking at those configurations that reach a common
Fig. 11. Separating a mixture of theory and database papers
basin. Additionally, the “masking” of frequently occurring
nodes can be a useful way to uncover hidden structure in
the data. In particular, a frequent node can force nearly all
initial configurations to a common basin in which it receives
large weight. By masking this node, one can discover a large
number of additional basins that otherwise would have been
“overwhelmed” in the iterations of the dynamical system.
Bibliographic data. We begin with the bibliographic data on
theoretical CS papers. There are a number of co-occurrences
that one might hope to analyze in this data, involving authors, conferences, and dates. Thus, if we choose an initial
configuration centered around the name of a particular conference and then iterate the product rule, we generally obtain a basin that corresponds naturally to the “community”
of individuals who frequently contribute to this conference,
together with similar conferences. For example, initializing
the configuration over PODS, we obtain the following basin,
which by inspection is a remarkable summary view of the
PODS community and some of its principal members. (See
top of next page.)
Initializing a configuration over a specific year leads to
another very interesting phenomenon, alluded to above: the
basin groups “nearby” years together, purely based on cooccurrences in the data. It is, of course, natural to see how
this should happen; but it does show a concrete way in which
co-occurrence mirrors more traditional notions of “proximity.” For example, initializing over the year 1976 leads to
the following basin:
0.697
0.241
0.026
0.018
0.004
Garey
Aho
Yu
Hassin
Chandra
0.824
0.059
0.039
0.030
0.009
Johnson
Hirschberg
Hopcroft
Graham
Tarjan
0.344
0.228
0.123
0.123
0.046
JACM
SICOMP
TCS
TRAUB
IPL
0.461
0.152
0.122
0.095
0.082
1976
1978
1977
1975
1985
Observe that the years grouped with 1976 are 1978, 1977,
1975, and 1985. Once again, this is a fairly descriptive summary view of theoretical CS research around 1976. It is important to note that initializing over different years can lead
to the same basin; this, as discussed above, is due to the
“attracting” nature of strong basins. Thus, for example, we
234
D. Gibson et al.: Clustering categorical data
0.4159
0.1321
0.0887
0.0873
0.0484
0.0419
Sagiv
Abiteboul
Shmueli
Vardi
Ullman
Chan
0.3043
0.1463
0.1043
0.0934
0.0901
0.0390
Shmueli
Vardi
Sagiv
Yehoshua
Ullman
Vianu
0.9094
0.0380
0.0333
0.0112
0.0049
0.0023
PODS
SICOMP
JACM
DataKnowledgeBases
SIGMOD
StanfordCSDreport
obtain exactly the same basin when we initialize over 1974,
although 1974 does not ultimately show up among the top
5 years when the basin is reached.
Login data. We now discuss some examples from the login
data introduced above. For this data, there was one user
who logged in/out very frequently, gathering large weight
in our dynamical system, almost regardless of the initial
configuration. Thus, for all the random initializations tried
on this data set, a common basin was reached.
In order to discover additional structure, it proved very
useful to mask this user, via the masking operation introduced above. Users in this data can exhibit similarities based
on co-occurrence in login times and machine use. If we
first mask the extremely common user(s), and then initialize
the configuration over the user root, such further structure
emerges: the four highest weight users turn out to be root;
a special “help” login used by the system administrators;
and the individual login names of two of the system administrators. Thus, Stirr discerns system administrators based
on their co-occurrences across columns with root.
Finally, we illustrate another case in which “similar”
numbers are grouped together by co-occurrence, rather than
by numerical proximity. When we initialize over the hour 23
(i.e., 11 pm) as the login time, we obtain a set of users who
dominate late-night login times, and we obtain the following hours as the highest weight nodes for login and logout
times:
0.430 22 0.457 22
0.268 21 0.286 23
0.173 23 0.134 21
0.079 20 0.060 00
0.021 00 0.028 01
Thus, the algorithm concludes that the hours 22, 21, 23, 20,
and 00 are all “similar,” purely based on the login patterns
of users.
Chess moves. Finally, we indicate some of the sequential
patterns one can discover via our method, using the example
of chess move sequences. We start from long sequential lists
of chess moves, sliced into four-ply windows as discussed
above. With random initialization and no masking, the basin
is concentrated on an extremely small number of nodes:
0.999 e4
0.000 Nc3
1.000 e6
0.000 d5
0.997 d4
0.002 Nc3
1.000 d5
0.000 Bb4
The strong concentration on the sequence e4-e6-d4-d5 corresponds to the fact that we have chosen only games played
with the French Defence, which begins with this sequence.
In order to discover sequential patterns that occur the
in the middle of games, we need to mask the strong cooccurrences in the opening. In this way, we can discover
0.4170
0.1904
0.1433
0.1260
0.0977
0.0154
1986
1985
1987
1989
1988
1992
common local motifs in the move sequences, and moves
that serve similar functions in the temporal sequence. For
example, initializing a configuration over the move f5, and
masking the most common opening moves to prevent the
above basin from being reached, we obtain the following
basin:
0.999
0.000
0.000
0.000
0.000
f5
Bxf5
Kb1
Nxf5
gxf5
0.755
0.215
0.029
0.000
0.000
exf5
gxf5
Bxf5
Nxf5
Kh6
0.599
0.329
0.028
0.019
0.015
Bxf5
gxf5
Bxd5
Qxf5
Rxf5
0.476
0.074
0.073
0.068
0.068
Bxf5
Be8
Rxf5
g6
Nc6
We note several features of the above basin. First, and most
obviously, the second column indicates Black’s main responses to the move f5: these are to capture either with an
e-pawn, a g-pawn, a Bishop, or a Knight. But the basin
captures additional structure: because the second column is
dominated by Black’s captures on f 5, alternatives to f5 in
the first column gain (small amounts of) weight – for example, the capture of a piece on the square f5 by a Bishop
or a Knight. Thus, the propagation of weight moves both
forward and backward in time, as one would like.
6 Conclusions and further work
We have given a new method for clustering and mining tables of categorical data, by representing them as non-linear
dynamical systems. The two principal contributions of this
paper are: (1) we generalize ideas from spectral-graph analysis to hypergraphs, bridging the power of these techniques
to clustering categorical data; (2) we develop a novel connection between tables of categorical data and non-linear dynamical systems. Experiments with the Stirr system show
that such systems can be effective at uncovering similarities
and dense “subpopulations” in a variety of types of data;
this is illustrated both through explicitly constructed “hidden populations” in our quasi-random data, as well as on
real data drawn from a variety of domains. Moreover, these
systems converge very rapidly in practice, and hence the
overall computation effort is typically linear in the size of
the data.
An interesting direction in which to consider extending
these methods is to allow for the incorporation of intrinsic
similarity information for some attributes. A given attribute
may be endowed with an a priori notion of similarity –
this may happen if it is consists of numerical data – and it
would be useful to develop a principled way of integrating
this information with the similarity relationships that emerge
from Stirr through its analysis of co-occurence.
The connection between dynamical systems and the mining of categorical data suggests a range of interesting further
questions. At the most basic level, we are interested in determining other data-mining domains in which this approach
can be effective. We also feel that a more general analysis
D. Gibson et al.: Clustering categorical data
of the dynamical systems used in these algorithms – a challenging prospect – would be interesting at a theoretical level
and would shed light onto further techniques of value in the
context of mining categorical data.
Acknowledgements. We thank Rakesh Agrawal for many valuable discussions on all aspects of this research, and for his great help in the presentation of the paper. We also thank Bill Cody and C. Mohan for their valuable
comments on drafts of this paper.
7 Appendix
Dynamical systems and mathematical background
We briefly outline some background on dynamical systems,
adapted to our purposes here. For purposes of the present
discussion, a dynamical system is the repeated iteration of a
function f : U → U, where U is a vector space. The mathematical study of dynamical systems (see for a comprehensive introduction) considers the way in which the weights of
X evolve starting from various initial weights of X, for different classes of maps f (). For example, we can ask whether
the iterations cause X to converge to a fixed point – a point
X ∗ for which f (X ∗ ) = X ∗ – or cause X to cycle through a
small number of weights.
Example. As a very simple example, consider the case in
which U = <, the one-dimensional real line, and f (X) = X 2 .
The sequence of weights assumed by X when f () is iterated
depends on the starting weight of X. X = 0 and X = 1 are
fixed points, since f (X) = X. When X > 1 or X < −1,
the the sequence of weights grows without bound; when
−1 < X < 1, the sequence of weights always converges to
0.
Example. As another example, suppose that U = <3 , and for
a vector X = (x, y, z), we have f (X) = 12 (y + z, z + x, x + y).
Then, any vector X with positive coordinates adding up to
s will converge to 13 (s, s, s) as f () is repeatedly applied.
The second example captures a general phenomenon.
When f () is linear, the process of iterating f () can typically be analyzed using the methods of linear algebra; for
instance, if f () represents left multiplication by a matrix A,
then X converges (subject to a few technical conditions) to
the principal eigenvector of A. The non-principal eigenvectors of A have a meaning in this setting as well, and underlie
the intuition behind portions of our technique.
If f is non-linear, the process of iterating f () becomes
much more complicated, and much of the underlying mathematics is not understood. Such non-linear maps lead to topics
such as chaotic dynamical systems, and have wide-ranging
applications in the modeling of continuous phenomena in
settings such as climate modeling, population genetics, and
financial analysis. (Again, see .) We find it interesting to
observe some of the fundamental phenomena of non-linear
dynamical systems arising in our application, and we hope
that the present work will suggest further questions in this
active area.
235
The body of mathematical work that most closely motivates our current approach is spectral-graph theory, which
studies certain algebraic invariants of graphs. We motivate
this technical background with the following example.
Example. Consider a triangle: a graph G consisting of three
mutually connected nodes. At each node of G, we place a
positive real number, and then we define a dynamical system
on G by iterating the following map: the number at each
node is updated to be the average of the numbers at the
two other nodes. This dynamical system is easily seen to
be equivalent to that of the previous example: the numbers
at the three nodes converge to one another as the map is
iterated.
Spectral-graph theory is based on the relation between
such iterated linear maps on an undirected graph G and
the eigenvectors of the adjacency matrix of G. (If G has
n nodes, then its adjacency matrix A(G) is a symmetric nby-n matrix whose (i, j) entry is 1 if node i is connected
to node j in G, and 0 otherwise.) The principal eigenvector of A can be shown to capture the equilibrium state of
a linear dynamical system of the form in the previous example; the non-principal eigenvectors provide information
about good “partitions” of the graph G into dense pieces
with few interconnections. The most basic application of
spectral-graph partitioning, which motivates some of our algorithms, is the following. Each non-principal eigenvector
of the adjacency matrix of G can be viewed as a labeling
of the nodes of G with real numbers – some positive and
some negative. Treating the nodes with positive numbers as
one “cluster” in G, and the nodes with negative numbers as
another, typically partitions the underlying graph into two
relatively dense pieces, with few edges in between.
References
Agrawal R, Mannila H, Srikant R, Toivonen H, Inkeri Verkamo A (1996)
Fast Discovery of Association Rules. In: Fayyad UM, PiatetskyShapiro G, Smyth P, Uthurusamy R (eds.): Advances in Knowledge
Discovery and Data Mining, chapter 12. MIT Press, Cambridge, Mass.,
pp. 307–328
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Buneman P, Jajodia S (eds.):
Proceedings of the 1993 ACM SIGMOD International Conference on
Management of Data, Washington, D.C., May 26–28, 1993. ACM Press
1993, SIGMOD Record 22(2), June 1993, pages 207–216
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules.
In: Bocca JB, Jarke M, Zaniolo C (eds.): Proceedings of the 20th Internatinal Conference on Very Large Data Bases, (VLDB’94), September
12–15, 1994, Santiago de Chile, Chile. Morgan Kaufmann, ISBN 1–
55860-153-8, pp. 487-499
Agresti A (1990) Categorical Data Analysis. Wiley-Interscience, Strasbourg
Alon N (1986) Eigenvalues and expanders. Combinatorica 6: 83—96
Alon N, Spencer J (1991) The Probabilistic Method. Wiley, Chichester
Berge C (1973) Hypergraphs. North–Holland, Amsterdam
Blum A, Spencer J (1995) Coloring random and semi–random k- colorable
graphs. J Algorithms 19: 204—234
Boppana R (1987) Eigenvalues and graph bisection: An average– case analysis. In: Proc. IEEE Sympos. on Foundations of Computer Science,
1987, Los Angeles, Calofornia. IEEE Computer Society Press, publisher, pages 280—285
236
Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting
and implication rules for market basket data. In: Peckham J (ed.): SIGMOD 1997, Proceedings ACM SIGMOD International Conference on
Management of Data, May 13–15, ACM Press 1997, Tucson, Arizona,
USA. SIGMOD Record 26(2), June 1997, pages 255—264
Brin S, Motwani R, Silverstein C (1997) Beyond Market Baskets: Generalizing Association Rules to Correlations. In: Peckham J (ed.): SIGMOD
1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13–15, ACM Press 1997, Tucson, Arizona,
USA. SIGMOD Record 26(2), June 1997, pages 265—276
Brin S, Page L (1998) Anatomy of a Large–Scale Hypertextual Web
Search Engine. In: Proc. 7th International WorldWide Web Conference, 1998,Brisbane, Australia, pp. 107–117. Elsevier, publisher.
Charniak E (1993) Statistical Language Learning. MIT Press, Cambridge,
Mass.
Chiueh T (1994) Content–based image indexing. In: Bocca JB, Jarke M,
Zaniolo C (eds.): Proceedings of the 20th Internatinal Conference on
Very Large Data Bases, (VLDB’94), September 12–15, 1994, Santiago
de Chile, Chile. Morgan Kaufmann, ISBN 1–55860-153-8, pp. 582–
593
Chung FRK (1997) Spectral Graph Theory. AMS Press, Providence, R.I.
Das G, Mannila H, Ronkainen P (1998) Simliarity of attributes by external
probes. In: Agrawal R, Stolorz PE, Piatetsky–Shapiro (eds.): Proceedings of the Fourth International Conference on Knowledge Discovery
and Data Mining (KDD–98), August 27-31, 1998, New York City,
New York, USA. AAAI Press, 1998, ISBN, pages 23—29
Deerwester S, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990)
Indexing by latent semantic analysis. J Soc Inf Sci 41(6): 391– 407
Devaney RL (1989) An Introduction to Chaotic Dynamical Systems. Benjamin Cummings, New York
Donath WE, Hoffman AJ (1973) Lower bounds for the partitioning of
graphs. IBM J Res Dev 17: 420–425
Duda RO, Hart PE (1973) Pattern Classiffcation and Scene Analysis. Wiley,
New York
Faloutsos C (1996) Searching Multimedia Databases by Content. Kluwer
Academic, Amsterdam
Fiedler M (1973) Algebraic connectivity of graphs. Czech Math J 23: 298305
Flickner M, Sawhney H, Niblack W, Ashley J, Huang Q, Dom B, Gorkani
M, Hafner J, Lee D, Petkovic D, Steele D, Yanker P (1995) Query by
image and video content: The QBIC system. IEEE Comput 28: 23–32
Garey M, Johnson D (1979) Computers and Intractability: A Guide to the
Theory of NP–Completeness. W.H. Freeman, New York
Golub G, Loan CF van (1989) Matrix Computations. Johns Hopkins University Press, Baltimore, Md.
Han E–H, Karypis G, Kumar V, Mobasher B (1997) Clustering Based On
Association Rule Hypergraphs. In: Workshop on Research Issues on
Data Mining and Knowledge Discovery, 1997, Tucson, Arizona, in
conjunction with ACM SIGMOD 1997. ACM Press, publisher.
Hastie T, Stuetzle W (1989) Principal curves. J Am Stat Assoc 84: 502-516
Holm L, Sander C (1994)Proteins Struct Funct Genet 19: 256-268
D. Gibson et al.: Clustering categorical data
Huang Z (1997) A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. In: Workshop on Research Issues
on Data Mining and Knowledge Discovery, 1997
Jerrum M, Sinclair A (1996) The Markov chain Monte Carlo method: An
approach to approximate counting and integration. In: Hochbaum DS
(ed) Approximation Algorithms for NP–hard Problems. PWS Publishing, Boston, Mass., pp. 482—520
Jolliffe IT (1986) Principal Component Analysis. Springer, Berlin Heidelberg New York
Kleinberg J (1997) Authoritative sources in a hyperlinked environment.
IBM Research Report RJ 10076(91892). To appear in Journal of the
ACM
Kohonen T (1982) Self–organized formation of topologically correct feature
maps. Biol Cybern 43: 59–69
Kramer M (1991) Non–linear principal component analysis using autoassociative neural networks. AIChE J 37: 233–243
Mannila H, Toivonen H, Verkamo AI (1995) Discovering frequent episodes
in sequences. In: Fayyad UM, Uthurusamy R (eds.): Proceedings of the
First International Conference on Knowledge Discovery and Data Mining (KDD–95), Montreal, Canada, August 20–21, 1995. AAAI Press,
ISBN 0-929280-82-2, pages 210- -215
Ritter H, Martinetz T (1992) Neural Computation and Self–Organizing
Maps. Addison–Wesley, Reading, Mass.
Rumelhart DE, McClelland JL (Eds) (1986) Parallel and Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,
Cambridge, Mass.
Seiferas J (1999) A large bibliography on theory/foundations of
computer science. Available at http://liinwww.ira.uka.de/ bibliography/Theory/Seiferas/Seiferas.html
Spielman D, Teng S (1996) Spectral partitioning works: Planar graphs and
finite–element meshes. In: Proc. IEEE Symposium on Foundations of
Computer Science, 1996, Burlington, Vermont. IEEE Computer Society
Press, Piscataway, N.J., pp. 96—105
Srikant R, Agrawal R (1995) Mining generalized association rules. In: Dayal
U, Gray PMD, Nishio S (eds.): Proceedings of 21th International Conference on Very Large Data Bases, September 11–15, 1995, Zurich,
Switzerland. Morgan Kaufmann, ISBN 1–55860-379-4, pages 407–419
Toivonen H (1996) Sampling large databases for finding association rules.
In: Vijayaraman TM, Buchmann AP, Mohan C, Sarda NL (eds.):
VLDB’96, Proceedings of 22th International Conference on Very Large
Data Bases, September 3-6, 1996, Mumbai (Bombay), India. Morgan
Kaufmann 1996, ISBN 1–55860-382-4, pages 134—145
Wiederhold G (1999) Bibliography on database systems. Available
at http://www.ira.uka.de/bibliography/Database/ Wiederhold/ Wiederhold. html
Zhang T, Ramakrishnan R, Livny M (1996) Birch: An effcient data– clustering method for very large databases. In: Jagadish HV, Mumick IS
(eds.): Proceedings of the 1996 ACM SIGMOD International Conference on Management od Data, Montreal, Quebec, Canada, June 4–6,
1996. ACM Press 1996, SIGMOD Record 25(2), June 1996, pages
103—114