1 2 3 4 5

All human beings desire to
know”
Aristotle, Metaphysics, I.1.
“
Game Trees-Clustering
Prof. Sin-Min Lee
Decision Tree
•A decision tree is a predictive model
•Each interior node corresponds to a variable
•An arc to a child represents a possible value
of that variable
•A leaf represents the predicted value of target
variable given the values of the variables
represented by the path from the root.
- Decision tree can be learned by splitting
the source set into subsets based on an
attribute value test
- This process is repeated on each derived
subset in a recursive manner
- The recursion is completed when splitting
is a singular classification which can be
applied to each element of the derived
subset
- It is also for calculating conditional
probabilities
Decision tree has three other names
1. Classification tree analysis is used when the
predicted outcome is the class to which the data
belongs.
2. Regression tree analysis is used when the
predicted outcome can be considered a real
number
3. CART analysis is to refer to both of the above
procedures.
Advantage of Decision Tree
• simple to understand and interpret
• require little data preparation
• able to handle nominal and categorical data.
• perform well with large data in a short time
• the explanation for the condition is easily
explained by boolean logic.
AprioriTid Algorithm
• The database is not used at all for counting the support of
candidate itemsets after the first pass.
1. The candidate itemsets are generated the same way as in
Apriori algorithm.
2. Another set C’ is generated of which each member has the TID
of each transaction and the large itemsets present in this
transaction. This set is used to count the support of each
candidate itemset.
• The advantage is that the number of entries in C’ may be
smaller than the number of transactions in the database,
especially in the later passes.
Apriori Algorithm
• Candidate itemsets are generated using only
the large itemsets of the previous pass without
considering the transactions in the database.
1.The large itemset of the previous pass is joined
with itself to generate all itemsets whose size
is higher by 1.
2.Each generated itemset, that has a subset
which is not large, is deleted. The remaining
itemsets are the candidate ones.
Example
Database
TID
L1
Items
C2
Itemset
Support
Itemset
Support
100
134
{1}
2
{1 3}*
2
200
235
{2}
3
{1 4}
1
300
1235
{3}
3
{3 4}
1
400
25
{5}
3
{2 3}*
2
{2 5}*
3
{3 5}*
2
{1 2}
1
{1 5}
1
C3
Itemset
Support
{1 3 4}
1
{2 3 5}*
2
{1 3 5}
1
Example
Database
TID
L1
Items
C2
Itemset
Support
Itemset
TID
{1 3}
100
{1 4}
100
{3 4}
100
100
134
{1}
2
200
235
{2}
3
300
1235
{3}
3
{2 3}
200
400
25
{5}
3
{2 5}
200
{3 5}
200
{1 2}
300
TID
{1 3}
300
100
{1 5}
300
{2 3}
300
{2 5}
300
C3
Itemset
{1 3 4}
{2 3 5}
200
{1 3 5}
300
{3 5}
300
{2 3 5}
300
{2 5}
400
Example
Database
TID
Items
L1
C2
Itemset
Support
Itemset
Support
100
134
{1}
2
{1 2}
1
200
235
{2}
3
{1 3}*
2
300
1235
{3}
3
{1 5}
1
400
25
{5}
3
{2 3}*
2
{2 5}*
3
{3 5}*
2
C3
Itemset
Support
{2 3 5}*
2
{1 2 3}
{1 3 5}
{2 3 5}
Example
Database
TID
Itemset
L1
Items
C2
Support
{1 2}
1
Itemset
Support
{1 3}*
2
100
134
{1}
2
{1 5}
1
200
235
{2}
3
{2 3}*
2
300
1235
{3}
3
{2 5}*
3
400
25
{5}
3
{3 5}*
2
C’2
C’3
C3
200
{2 3 5}
100
300
{2 3 5}
200
{2 3}, {2 5}, {3 5}
300
{1 2}, {1 3}, {1 5},
{2 3}, {2 5}, {3 5}
Itemset
Support
{2 3 5}*
2
400
{1 3}
{2 5}
•No practicable methodology has been demonstrated
for reliable prediction of large earthquakes on times
scales of decades or less
–Some scientists question whether such
predictions will be possible even with much
improved observations
–Pessimism comes from repeated cycles in which
public promises that reliable predictions are just
around the corner are followed by the equally
public failures of specific prediction
methodologies. Bad for science!
COMPLEX
PLATE
BOUNDARY
ZONE IN
SOUTHEAST
ASIA
Northward
motion of India
deforms all of
the region
Many small
plates
(microplates)
and
blocks
Molnar
& Tapponier, 1977
Mission district —
San Francisco
Earthquake, 1906
•Short-term prediction
(forecast)
 Frequency and
distribution pattern of
foreshocks
 Deformation of the
ground surface: Tilting,
elevation changes
 Emission of radon gas
 Seismic gap along
faults
 Abnormal animal
activities
强烈地震顷刻间将唐山夷
为一片平地。图为唐山市区
震后废墟
Freeway Damage
— 1994 CA
Earthquake
Sand Boils after Loma Prieta Earthquake
California Earthquake Probabilities Map
Clustering
• Group data into clusters
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
– Unsupervised learning: no predefined classes
Outliers
Cluster 1
Cluster 2
What Is A Good Clustering?
• High intra-class similarity and low interclass similarity
– Depending on the similarity measure
• The ability to discover some or all of the
hidden patterns
General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature spaces
– detect spatial clusters and explain them in spatial data mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
What Is Good Clustering?
• A good clustering method will produce high quality
clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
• The quality of a clustering method is also measured
by its ability to discover some or all of the hidden
patterns.
Data Structures in Clustering
• Data matrix
– (two modes)
• Dissimilarity matrix
– (one mode)
 x11

 ...
x
 i1
 ...
x
 n1
...
x1f
...
...
...
...
...
x if
...
...
...
...
...
x nf
...
 0
 d(2,1)
0

 d(3,1) d ( 3,2) 0

:
:
 :
d ( n,1) d ( n,2) ...
x1p 

... 
x ip 

... 
x np 







... 0
Measuring Similarity
• Dissimilarity/Similarity metric: Similarity is expressed in terms
of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very different
for interval-scaled, boolean, categorical, ordinal and ratio
variables.
• Weights should be associated with different variables based on
applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Notion of a Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
• Hierarchy algorithms
Agglomerative: each object is a cluster,
merge clusters to form larger ones
Divisive: all objects are in a cluster, split it
up into smaller clusters
Types of Clusters: Well-Separated
• Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is closer
(or more similar) to every other point in the cluster than to any point
not in the cluster.
3 well-separated clusters
Types of Clusters: Center-Based
• Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative”
point of a cluster
4 center-based clusters
Types of Clusters: Contiguity-Based
• Contiguous Cluster (Nearest neighbor or
Transitive)
– A cluster is a set of points such that a point in a cluster is closer
(or more similar) to one or more other points in the cluster than
to any point not in the cluster.
8 contiguous clusters
Types of Clusters: Density-Based
• Density-based
– A cluster is a dense region of points, which is separated by lowdensity regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters
– Finds clusters that share some common property or represent a
particular concept.
.
2 Overlapping Circles
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
p3 p4
Non-traditional Dendrogram
Hierarchical Clustering
• Produces a set of nested clusters organized as a
hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences of merges or
splits
5
6
0.2
4
3
0.15
4
2
5
0.1
2
0.05
1
3
0
1
3
2
5
4
6
1
Starting Situation
• Start with clusters of individual points and a
proximity matrix
p1 p2
p3
p4 p5
...
p1
p2
p3
p4
p5
.
.
Proximity Matrix
.
...
p1
p2
p3
p4
p9
p10
p11
p12
Intermediate Situation
• After some merging steps, we have some
C1 C2
C3
C4
clusters
C1
C5
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2
C5
...
p1
p2
p3
p4
p9
p10
p11
p12
• We want to merge the two closest clusters (C2 and C5)
C1 C2
C3
C4 C5
and update the proximity matrix.
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2
C5
...
p1
p2
p3
p4
p9
p10
p11
p12
After Merging
• The question is “How do we update the proximity
C2
matrix?”
U
C1
C1
C4
C3
C4
?
?
?
?
C2 U C5
C3
C5
?
C3
?
C4
?
Proximity Matrix
C1
C2 U C5
...
p1
p2
p3
p4
p9
p10
p11
p12
How to Define Inter-Cluster Similarity
p1
Similarity?
p2
p3
p4 p5
p1
p2
p3
p4
p5
MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an
objective function
.
.
.
Proximity Matrix
...
p1
p2
p3
p4 p5
p1
p2
p3
p4
MIN
p5
.
.
.
MAX
Proximity Matrix
...
Group Average


Distance
Centroids
Between
Cluster Similarity: MIN or Single
Link
Similarity of two clusters is based on the two most similar
(closest) points in the different clusters
Determined by one pair of points, i.e., by one link in the
proximity graph.
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
1
2
3
4
5
Hierarchical Clustering: MIN
1
5
3
5
0.2
2
1
2
3
0.15
6
0.1
0.05
4
4
Nested Clusters
0
3
6
2
5
Dendrogram
4
1
Cluster Similarity: MAX or Complete Linkage
• Similarity of two clusters is based on the two
least similar (most distant) points in the
different clusters
– Determined by all pairs of points in the two clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00
1
2
3
4
5
Hierarchical Clustering: MAX
4
1
2
5
5
0.4
0.35
2
0.3
0.25
3
3
6
1
4
0.2
0.15
0.1
0.05
0
Nested Clusters
3
6
4
Dendrogram
1
2
5
Cluster Similarity: Group Average
• Proximity of two clusters is the average of pairwise
proximity between points in the two clusters.
 proximity(p , p )
i
proximity(Clusteri , Clusterj ) 
j
piClusteri
p jClusterj
|Clusteri ||Clusterj |
• Need to use average connectivity for scalability since total
proximity favors large clusters
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
1
2
3
4
5
Hierarchical Clustering: Group Average
5
4
1
0.25
2
5
0.2
2
0.15
3
6
1
4
3
Nested Clusters
0.1
0.05
0
3
6
4
1
Dendrogram
2
5
Hierarchical Clustering: Time and Space
requirements
• O(N2) space since it uses the proximity matrix.
– N is the number of points.
• O(N3) time in many cases
– There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time for
some approaches
Hierarchical Clustering: Problems and
Limitations
• Once a decision is made to combine two clusters,
it cannot be undone
• No objective function is directly minimized
• Different schemes have problems with one or
more of the following:
– Sensitivity to noise and outliers
– Difficulty handling different sized clusters and convex
shapes
– Breaking large clusters