Assessment

Assessment
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.

Schedule graph may be of help for selecting the best
solution


Best solution corresponds to a plateau before a high jump
Solutions with very small or even singletons clusters are
rather suspicious


Standardization
K-means






Cluster validation, three approaches
Relative criteria
Validity Index



Initial centroids
Validation example
Cluster merit Index
Dunn Index
Davies-Bouldin (DB) index
Combination of different distances/diameter
methods
Standardization Issue



The need for any standardization must be
questioned
If the interesting clusters are based on the
original features, than any standardization
method may distort those clusters
Only when there are grounds to search for
clusters in transformed space that some
standardization rule should new used

There is no methodological way except by
“trail and error”
yi 


x i  
yi 

x i  min( x i 
max( x i )  min( x i )
xi
yi 
max( x i )  min( x i )


An easy standardization method that will often
follow and frequently achieve good results is
the simple division or multiplication by a simple
scale factor
xi
yi 
a

A should be properly chosen so that all feature
values occupy a suitable interval

k-means Clustering
 d
2
d2 (x,z )  
x

z



i
i

i1
1
2




Cluster centers c1,c2,.,ck with clusters C1,C2,.,Ck
Initial centroids

Specify which patterns are used as initial
centroids





Random initialization
Tree clustering in a reduced number of patterns may
performed for this purpose
Choose first k patterns as initial centroids
Sort distances between all patterns and choose
patterns at constant intervals of these distances as
initial centroids
Adaptive initialization (according to a chosen radius)
k-means example (Sá 2001)
k
E 
 d (x,c
2
k
j
)
2
j1 x C j

Ei  
 (x
j1 x C j
Cluster merit index Ri
• (n patterns in k cluserts)

(k )


E
(k 1)
R
  (ki 1) 1( n  k 1)
E i

2
i
c )
2
i j

Cluster merit index measure the decrease in
overall within-cluster distance when passing
from a solution with k clusters to one with k+1
clusters

High value of the merit indexes indicates a
substantial decrease in overall within-cluster
distance
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Cluster merit index
2500
2000
R
1500
1000
500
0
1
2
3
4
5
6
7
-500
(k+1)



Factor 1 has the most important contribution
The values k=3,5,8 are sensible choices
k=3 attractive
Cluster validation

The procedure of evaluating the results of
a clustering algorithm is known under the
term cluster validity

In general terms, there are three
approaches to investigate cluster validity
The first is based on external criteria
 This implies that we evaluate the results
of a clustering algorithm based on a prespecified structure, which is imposed on
a data set and reflects our intuition about
the clustering structure of the data set



Error Classification Rate
smaller value, good representation

Data partition according to known classes Li,
L  L1,L2,...,LG }
 L (C j ) : max i1...G (| C j  Li |)
k
1
ECR :  (| C j |   L (C j ))
k j1
The second approach is based on
internal criteria
 We may evaluate the results of a
clustering algorithm in terms of quantities
that involve the vectors of the data set
themselves (e.g. proximity matrix)

Proximity matrix
Dissimilarity matrix
 0
 d(2,1)
0

 d(3,1) d ( 3,2) 0

:
:
 :
d ( n,1) d ( n,2) ...






... 0

The basis of the above described
validation methods is often statistical
testing

Major drawback of techniques based on
internal or external criteria and statistical
testing is their high computational
demands

The third approach of clustering validity is
based on relative criteria

Here the basic idea is the evaluation of a
clustering structure by comparing it to
other clustering schemes, resulting by the
same algorithm but with different
parameter values

There are two criteria proposed for clustering
evaluation and selection of an optimal
clustering scheme (Berry and Linoff, 1996)

Compactness, the members of each cluster
should be as close to each other as possible. A
common measure of compactness is the
variance, which should be minimized
Separation, the clusters themselves should be
widely spaced

Distance between two clusters

There are three common approaches
measuring the distance between two different
clusters

Single linkage: It measures the distance
between the closest members of the clusters
Complete linkage: It measures the distance
between the most distant members
Comparison of centroids: It measures the
distance between the centers of the clusters


Relative criteria
Based on relative criteria, does not
involve statistical tests
 The fundamental idea of this approach is
to choose the best clustering scheme of a
set of defined schemes according to a
pre-specified criterion

Among the clustering schemes Ci ,i=1, ...,
k defined by a specific algorithm, for
different values of the parameters choose
the one that best fits the data set
 The procedure of identifying the best
clustering scheme is based on a validity
index q


Selecting a suitable performance index q,we
proceed with the following steps

We run the clustering algorithm for all values of k
between a minimum kmin and a maximum kmax
• The minimum and maximum values have been defined apriori by user


For each of the values of k, we run the algorithm r
times, using different set of values for the other
parameters of the algorithm (e.g. different initial
conditions)
We plot the best values of the index q obtained by
each k as the function of k



Based on this plot we may identify the best
clustering scheme
There are two approaches for defining the best
clustering depending on the behavior of q with
respect to k
If the validity index does not exhibit an
increasing or decreasing trend as k increases
we seek the maximum (minimum) of the plot



For indices that increase (or decrease) as the
number of clusters increase we search for the
values of k at which a significant local change in
value of the index occurs
This change appears as a “knee” (joelho) in the
plot and it is an indication of the number of
clusters underlying the data-set
The absence of a knee may be an indication
that the data set possesses no clustering
structure
Validity index

Dunn index, a cluster validity index for kmeans clustering proposed in Dunn
(1974)

Attempts to identify “compact and well
separated clusters”
Dunn index
min
d(Ci ,C j ) 
d(x, y )
x  Ci , y  C j
diam(Ci ) 


max
x, y  Ci
d(x, y)



min 





min 
d(Ci ,C j )
Dk 
1  j  k 

max
1  i  k 

diam(Cl )

i

j



1  l  k



If the dataset contains compact and wellseparated clusters, the distance between
the clusters is expected to be large and
the diameter of the clusters is expected to
be small
 Large values of the index indicate the
presence of compact and well-separated
clusters


The index Dk does not exhibit any trend
with respect to number of clusters

Thus, the maximum in the plot of Dk
versus the number of clusters k can be an
indication of the number of clusters that
fits the data

The implications of the Dunn index are:
Considerable amount of time required for
its computation
 Sensitive to the presence of noise in
datasets, since these are likely to
increase the values of diam(c)




The Davies-Bouldin (DB) index (1979)
min
d(Ci ,C j ) 
d(x, y )
x  Ci , y  C j
diam(Ci ) 
max
x, y  Ci
d(x, y)
1 k max diam(Ci )  diam(C j ) 
DBk  


k i1 i  j 
d(Ci ,C j )

Small indexes correspond to good
clusters, clusters are compact and their
centers are far away
 The DBk index exhibits no trends with
respect to the number of clusters and thus
we seek the minimum value of DBk its plot
versus the number of clusters


Different methods may be used to calculate distance
between clusters
max
d(x, y )
• Single linkage d1 (Ci ,C j ) 
x  Ci , y  C j
• Complete linkage
min
d2 (Ci ,C j ) 
d(x, y )
x  Ci , y  C j
• Comparison
of centroids

d3 (Ci ,C j )  d(c i ,c j )
linkage
• Average

1
d4 (Ci ,C j ) 
Ci C j
 d(x, y)
y
x


Differnet methods to calculate the diamater of a cluster
max
Max
diam (C ) 
d(x, y)
1


Radius
x, y  Ci
i
diam2 (Ci ) 

max
x  Ci
d(x,c i )
Average distance
|C i |

diam 3 (Ci ) 
 d(x , x
l
l1
m
)
(| Ci | 1) | Ci |
2
with
(x l , x m  Ci ) (l  m)


A connected graph with s nodes
has (s 1)s edges
2

Combination of different
distances/diameter methods

It has been shown that using different
distances/diameter methods may produce
indices of different scale range (Azuje and
Bolshakova 2002)
Normalization
i selects the different distance method
 (1,2,3,4)
 j selects the different diameter method
j  (1,2,3)


(Dij) or (DBij) standart deviation of Dkij
or DBkij accross diferent values for k
i

Normalized indexes
(D  D )
ij
ˆ
Dk 
ij
 (D )
ij
k
ij

k
1
D   Dlij
k l1
ij
Literature
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
See also

J.P Marques de Sá, Pattern Recognition,
Springer, 2001

https://www.cs.tcd.ie/publications/tech-reports/

TCD-CS-2002-34.pdf
TCD-CS-2005-25.pdf



Standardization
K-means






Cluster validation, three approaches
Relative criteria
Validity Index



Initial centroids
Validation example
Cluster merit Index
Dunn Index
Davies-Bouldin (DB) index
Combination of different distances/diameter methods
Next lecture



KNN
LVQ
SOM