CHAPTER 5 AUTOMATIC GENERATION OF INITIAL VALUE K TO

CHAPTER 5
AUTOMATIC GENERATION OF INITIAL VALUE K TO
APPLY K-MEANS METHOD FOR TEXT DOCUMENTS
CLUSTERING
Retrieving relevant text documents on a topic from a large document collection is a
challenging task. Different clustering algorithms are developed to retrieve relevant
documents of interest. Hierarchical clustering shows quadratic time complexity of O(n2)
for n text documents. K-means algorithm has a time complexity of O(n) but it is sensitive
to the initial randomly selected cluster centers, giving local optimum solution. Global Kmeans employs the K-means algorithm as a local search procedure to produce global
optimum solution but shows polynomial time complexity of O(nk) to produce k clusters. In
this chapter, a new approach is proposed for clustering text documents that overcomes the
drawback of K-means and Global K-means and gives global optimal solution with time
complexity of O(lk) to obtain k clusters from initial set of l starting clusters. Experimental
evaluation on Reuters newsfeeds (Reuters-21578) shows clustering results (entropy,
purity, F-measure) obtained by proposed method comparable with K-means and Global Kmeans.
5.1
Introduction
Fast retrieval of the relevant information from the databases has always been a significant
issue. Different techniques have been developed for this purpose; one of them is Data
Clustering. Data Clustering is a technique in which, the information (documents) that is
logically similar is physically stored together. There are several different approaches to the
computation of clusters. Clustering algorithms may be characterized as:
1) Hierarchical clustering — It group data objects into a hierarchy of clusters. The
hierarchy can be formed top-down (divisive approach) or bottom-up (agglomerative
approach).
127
Given a set of N objects to be clustered, the basic process of hierarchical clustering
(defined by S.C. Johnson in 1967 [75] is:
i.
Initially start with N clusters, each containing single object. A distance
(similarities) between the clusters is same as the distances (similarities) between
the objects they contain.
ii.
Ext, find the closest (most similar) pair of clusters and merge them into a single
cluster.
iii.
Compute distances (similarities) between the new cluster and each of the old
clusters.
iv.
Repeat steps (ii) and (iii) until all objects are clustered into a single cluster of size
N.
Step (iii) can be done in different ways, which is what distinguishes single-linkage from
complete-linkage and average-linkage clustering [24]. In single-linkage clustering (also
called the connectedness or minimum method), the distance between one cluster and
another cluster is considered to be equal to the shortest distance from any member of one
cluster to any member of the other cluster. If the data consist of similarities, then the
similarity between one cluster and another cluster is considered to be equal to the greatest
similarity from any member of one cluster to any member of the other cluster. In
complete-linkage clustering (also called the diameter or maximum method), the distance
between one cluster and another cluster is considered to be equal to the greatest distance
from any member of one cluster to any member of the other cluster. In average-linkage
clustering, the distance between one cluster and another cluster is considered to be equal to
the average distance from any member of one cluster to any member of the other cluster.
2) Partitioning clustering - It partitions data objects into a given number of clusters. The
clusters are formed in order to optimize an objective criterion such as distance.
An object is assigned to that cluster whose centre is nearest to it. The center of a cluster
also called centroid [3] is defined as the average of all the objects in the cluster — that is,
its coordinates are the arithmetic mean for each dimension separately over all the objects
in the cluster.
128
Centroid of ith cluster is represented as:
Ci = {cij}, j=1. . a
and, jth data point of ith cluster centroid is represented as
where,
T is the total number of objects in ith cluster
a is the sum of unique terms appearing in member objects of ith cluster
5.1.1 Clustering methods
K-means clustering and its various variants are discussed below:
1) K-means clustering
The algorithm steps of K-means clustering [121] are:
i.
Choose the number of clusters, k.
ii.
Randomly generate k clusters and determine the cluster centers, or directly
generate k random points as cluster centers.
iii.
Assign each point to the nearest cluster center, where "nearest" is defined with
respect to one of the distance measures discussed above.
iv.
Re-compute the new cluster centers.
v.
Repeat the two previous steps until some convergence criterion is met (usually
until centroid doesn‘t change).
But the limitation of K-means is that it needs to know value of k (no. of clusters) in
advance, and it tends to go to local minima that are sensitive to the starting centroids.
2) Bisecting K-means
The algorithmic steps of Bisecting K-means [121] are:
i.
Pick a cluster to split (split the largest).
ii.
Find 2 sub-clusters using the basic K-means algorithm.
iii.
Repeat step (ii), the bisecting step, for ITER times and take the split that produces
the clustering with the highest overall similarity.
129
iv.
Repeat steps (i), (ii) and (iii) until the desired number of clusters is reached.
Bisecting K-means produce deep hierarchy resulting in difficulty to browse if one makes
an incorrect selection while navigating a hierarchy.
3) Global K-means
The algorithmic steps of Global K-means [88] are:
i.
Construct an appropriate set of positions/locations which can act as good
candidates for insertion of new clusters.
ii.
Initialize the first cluster as the mean of all the points in the dataset.
iii.
In the kth iteration, assuming k-1 clusters after convergence find an appropriate
position for insertion of a new cluster from the set of points created in step (i) that
gives minimum distortion.
iv.
Run K-means with k clusters till convergence. Go back to step (iii) if the required
number of clusters is not yet reached.
Global K-means unlike K-means is insensitive to the choice of initial k cluster centers thus
giving global optimum solution. But it requires execution of K-means method (nk) times
for document set of size n to generate k clusters showing time complexity of O(nk).
In Hybrid PSO+K-means [36] method, the PSO (Particle Swarm Optimization)
module is executed for a short period to search for the optimum clusters centroid
locations and then the K-means module is used for refining and generating the final
optimal clustering solution. Although this method also produces the global
optimum solution but like Global K-means, it also requires assuming the initial
value of k.
5.1.2 Cluster Quality Evaluation
Clusters can be evaluated with ―internal‖ as well as ―external‖ measures, defined as:
i.
Internal measures are related to inter/intra cluster distance.
Intra-cluster distance is defined as the (Sum/Avg) of the (absolute/squared)
distance between

All pairs of objects in the cluster, or

Between the centroid and all objects in the cluster, or

Between the ―medoid‖ and all objects in the cluster
130
Inter-cluster distance is defines as the sum of the (squared) distance between all
pairs of clusters, where distance between two clusters is defined as:

Distance between their centroids/medoids (Spherical clusters), or

Distance between the closest pair of points belonging to the clusters
(Chain shaped clusters)
A good clustering is one where Intra-cluster distances within cluster are
minimized, and Inter-cluster distances between different clusters are maximized.
ii.
External measures are related to how representative are the current clusters to
―true‖ classes and is computed in terms of entropy and F-measure as discussed in
Chapter 1, section 1.3.5.
Given a corpus of text documents, clusters can be formed based on the key words
matching between the documents. For a given search query, instead of matching individual
documents in the entire database, only cluster(s) containing search query terms in its
centroid are selected and the member documents of the selected cluster are retrieved as a
result of the search query. Order of the documents ranked in the result depends on the
number and frequency of terms matching in corresponding documents.
5.2
Proposed Method
To overcome the limitations of the existing cluster algorithms, a new methodology to
cluster the documents based on the frequency of occurrence of terms within the documents
is proposed. Initially all the documents are preprocessed to remove stopwords and Porters
wherein stemming algorithm is applied on each document to reduce the terms within the
document to its stem. Each document is represented as a vector <d,w> listing index terms
with their normalized weighted frequency occurrence. The dimension of index term vector
of each document is further reduced by selecting only those terms from the vector having
their normalized weighted frequency occurrence within the upper and lower limits. For
each unique term appearing in the index term vector of all documents, an entry is added to
inverted index file listing the names of all documents containing corresponding terms in
their index term vector in its posting list. Each entry of inverted index file represents
clusters having the documents corresponding to that entry as its member documents. These
clusters are refined by computing their centroids and grouping those clusters which are
131
similar to each other by computing inter-cluster distance between the two clusters. If the
inter cluster cosine distance between pair of clusters is greater than 0 then the two clusters
show similarity and are merged if the minimum distance between the centroids of two
clusters and their member documents is greater than the inter cluster cosine distance
between the corresponding clusters. The entire clustering process is repeated till the
required number of clusters are obtained from the initial set of starting cluster sets or the
inter cluster cosine distance between the centroids of all the pairs of clusters becomes zero.
Following steps of proposed clustering algorithm are implemented on given set of n text
documents for building cluster of similar documents:
a)
Initialize,
←0.1
b) for each ith document di , i=1..n
for each jth term tijє di , j=1..M
find term frequency tfij ← count of no. Of times tij appear in di
compute weighted term frequency
compute normalized weighted term frequency
end for
end for
where,
ni is total no. of documents containing term tij
n is total no. of documents in document pool
M is the total no. of unique terms appearing in n documents
c) for each document di , i=1..n, create corresponding index term file Idi:
for each term tij є di , j = 1..M
if (wij >= α )
then
add an entry <tij ,wij> to Idi
end if
end for
end for
132
d) Compute,
L ← { tij,tik : <tij,wij>,<tik,wik>є Idii and tij ≠ tik} , i=1..n, j=1..M
Q← n(L), Q ≤ M
where,
Q is the count of unique terms in Idi  i
L is the set of unique terms appearing in Idi i
e) for each term luє L, u=1..Q, compute

, i=1..n
where,
fu is the count of no. Of times lu appear in {Idi}. i=1..n
end for
f) initialize iter←0
g) repeat steps (h) through (i) until each document is included in inverted index file Invf
h) compute,
iter← iter + 1
l_avgiter← (1-avg)/iter
if ( iter == 1)
then
t_docsiter←n
u_avgiter← 1
else
u_avgiter← abs(u_avgiter-1 - l_avgiter)
t_docsiter←n – t_docsiter-1
end if
133
i) for each term luє L , u=1..Q
for each document di , i=1..t_docsiter
if (wiu>=l_avgiter )
then
addentry <di,wiu>to posting list of lu in inverted index file Invf
t_docsiter←t_docsiter - 1
end if
end for
end for
**Total entries in Invf ≤ Q as only those terms are included in Invf which satisfies the above discussed term
selection criteria
j) linvf ←n(Linvf), n(Linvf) ≤ n(L)
where,
Linvf is the set of terms from L included in Invf
linvf is the count of no. of terms from L included in Invf
k) for each Invf entry lu є Invf , define cluster Cu ,u=1..linvf
Cu ← { di : <di ,wiu>is an entry in posting list of lu , i=1..n }
end for
l) Iterate steps (m) through (p) until required no. of clusters are not created or inter cluster
distance between every pair of clusters are non zero.
m) for each cluster Cu, u=1..linvf
compute centroid Cenu ←{cut}, t=1..s, s≤Q
where,
N is the total number of member documents in uth cluster
s is the no. of unique terms appearing in member documents of the centroid
wtfkt is the weighted term frequency of tth term in kth cluster member documents
end for
134
n) for each cluster Ci , i=1..linvf
for each cluster Cj , j=1..linvf and i ≠ j
compute cosine distance between their centroids (Ceni , Cenj):
where,
ai is dimension of ith cluster centroid
aj is dimension of jth cluster centroid
Hij is set of unique terms in given two clusters, aij ≤ ai +aj
hij is the count of unique terms in (Ceni , Cenj)
end for
end for
o) for each cluster Ci ,i=1..linvf
for each member document dj , j=1.. aj
compute cosine distance between (Ceni ,dj):
end for
where,
ai is dimension of ith cluster centroid
rj is no. of terms in index term vector of jth member document
Rij is set of unique terms in (Ceni ,dj), Rij ≤ ai +rj
rij is the count of unique terms in (Ceni ,dj)
computed_mini ← min{ cosine(Ceni , dj ) },  j
end for
p) for each cluster Ci ,i=1..linvf
for each cluster Cj , j=1..linvf and i ≠ j
if [cosine(Ceni ,Cenj)> 0 && (cosine(Ceni , Cenj) ≤ min(d_ mini,d_ minj ))]
then
135
combine clusters Ci,and Cj
end if
end for
end for
To implement the proposed algorithm, the data structure needed is shown below.
Parsed Document
(remove stopwords,
stemming)
Index File
(one file per doc)
Dictionary
Inverted Index File
(Collection of Index
Terms)
(One row per index
term)
Brain
Brain
d1
Brain
Nerve
d1
Nerve
Brain
CPU
CPU
dn
Brain Nerve
Nerve
d2 || d3
Nerve
Brain
Read
Read
d2
d2
Nerve
d3
d3
CPU
dn
Bra
Neuron
in
Computer
dn
CPU Computer
Document
set
C1
d1
C2
d2 || d3
Cn
dn
**
d2
Cluster Set
Figure 5.1: Data Structure used in proposed system
** The normalized weighted term frequency of term nerve in document d2 > normalized
weighted term frequency of term brain in d2. So document d2 is not added to posting list
of Brain.
We evaluate our algorithm over text document collection of Reuters-21578. The dataset
was picked because of the presence of human labeled hierarchical class labels and
reasonably large number of documents in them. They are described in more detail in the
following section.
136
5.3
Experiments
5.3.1 Experimental Setup
The clustering policy requires that each document be assigned to the most specific possible
subcategory in a classification hierarchy. We use Reuters-21578 text collection as our
experimental dataset.
Reuters-21578 text categorization test collection is a resource for research in Information
Retrieval, machine learning, and other corpus-based research. The documents in the
Reuters-21578 collection appeared on the Reuters newswire in 1987. The documents were
assembled and indexed with categories by personnel from Reuters Ltd. (Sam Dobbins,
Mike Topliss, Steve Weinstein) and Carnegie Group, Inc. (Peggy Andersen,Monica Cellio,
Phil Hayes, Laura Knecht, Irene Nirenburg) in 1987. The Reuters-21578, Distribution 1.0
test collection is available from David D. Lewis' professional home page, currently:
http://www.research.att.com/~lewis
The Reuters-21578 collection is distributed in 22 files (reut2-000.sgm through reut2020.sgm).The files are in SGML format. The NEW-ID keyword serves to delimit
documents within a file.
For the Reuters-21578 collection the documents are Reuters newswire stories, and the
categories are five different sets of content related categories. For each document, a
human indexer decided which categories from which sets that document belonged to. The
category sets are as follows:
Table 5.1: Category Sets
Category Set
Number of Categories
************
******************
EXCHANGES
39
ORGS
56
PEOPLE
267
PLACES
175
TOPICS
135
137
The TOPICS categories are economic subject categories. Examples include "coconut",
"gold", "inventories", and "money-supply". The EXCHANGES, ORGS, PEOPLE, and
PLACES categories correspond to named entities of the specified type. Examples include
"nasdaq" (EXCHANGES), "gatt" (ORGS), "perez-de-cuellar" (PEOPLE), and "australia"
(PLACES).
In the above table 5.1, the number of categories appearing in 21,578 documents of the
collection is shown. Many categories appear in no documents. The above data is obtained
from Reuters-21578 documentation file.
We evaluate our algorithm over set of four text document collections of Reuters-21578
i.e., reut2-018.sgm through reut2-021.sgm containing 1000, 1000, 1000, 578 text
documents respectively. These datasets were picked because they contain reasonably large
number of total 3578 articles in them. Package html2text-1.3.2 is used to convert given
documents into text documents. Each group of text documents were separated as
individual text documents using Amberfish-1.6.4 software. Documents are preprocessed to
remove stopwords and terms are reduced to their stem on applying Porter‘s stemming
algorithm.
In the next section, we discuss the results obtained by implementing our proposed method
on Linux 9.0 using bash scripting on P-IV processor having 512 MB RAM with 120 GB
hard disk.
5.3.2 Experimental Results
Number of text documents = 3578 (reut2-018.sgm – reut2-021.sgm)
Number of unique terms/vocabulary (excluding stopwords) = 12336 terms
Average number of lines per document = 15 lines
Average size of text document = 229 words / document
138
Table 5.2: Showing the total no. of documents contained in each .sgml file
File category
Total No. of documents
Reut2-018.sgm
1000
Reut2-019.sgm
1000
Reut2-020.sgm
1000
Reut2-021.sgm
578
Table 5.3: Showing the total no. of documents in each category type
No. of text
documents
(Reut2018.sgm)
18
No. of text
documents
(Reut2019.sgm)
20
No. of text
documents
(Reut2020.sgm)
70
No. of text
documents
(Reut2021.sgm)
23
ORGS
65
36
27
5
PEOPLE
63
41
95
27
PLACES
910
860
779
465
TOPICS
435
387
469
329
(Category Set)
EXCHANGE
Table 5.4: Showing clustering results obtained for different values of α [ no. of iterations = 15 ]
Term
selection
criteria (α)
0.01
No. of text
documents
Size of the
largest
cluster
655
Size of the
smallest
cluster
10
Avg. size of
the cluster
3578
No. of
Clusters
Created (k)
513
0.05
3578
437
655
15
26
0.07
3578
377
655
24
52
0.1
3578
241
655
35
76
0.3
3578
17
655
181
326
0.7
3578
8
655
282
458
0.9
3578
5
655
462
559
16
Total 241 clusters are obtained after 15 iterations when the terms having normalized
weighted term frequency occurrence (wij)>= 0.1 are selected to include in its document
139
index term vector. Similarly total 5 clusters are obtained when terms with (wij) >= 0.9 are
considered.
Table 5.5: Showing the clustering performance (Reuters-dataset of 3578 documents) for varying value for α
[ no. of iterations = 15 ]
α
No. of
documents
3578
0.01
Purity
Entropy
F-measure
0.931
0.1097
0.2571
0.05
3578
1.650
0.1983
0.3509
0.07
3578
2.853
0.2086
0.4842
0.1
3578
3.654
0.3097
0.5571
0.3
3578
2.654
0.2080
0.3640
0.7
3578
1.548
0.1984
0.2618
0.9
3578
1.117
0.1293
0.1592
From the table 5.5, it is observed that the best clustering results are obtained when the
terms with (wij) >= 0.1 are selected to be included in their respective document‘s index
term vector.
Table 5.6: Showing no. of Initial clusters created for different values of n (document set)
100
0.1
950
205
Lower limit of
normalized
weighted term
frequency for
selecting terms to be
including in
inverted Index file
( l_avg )
Ist
Last
iteration Iteration
0.63467
0.10007
500
0.1
4313
670
0.80605
0.10021
293
1000
0.1
8702
984
0.84440
0.10004
521
2000
0.1
17303
1487
0.88621
0.10009
902
3000
0.1
32117
1895
0.89316
0.10000
1305
3578
0.1
40396
1967
0.89912
0.10000
1442
No. of
documents
(n)
Initial
term
selection
criteria
for
index
term
vector
()
No. of
No. of unique
unique
terms after
terms in preprocessing
document
of n
set
documents
(M)
(Q)
140
Size of
initial
cluster
set (l)
76
The clustering performance of the proposed clustering method is evaluated by analyzing
its results with other clustering methods- K-means, Global K-means and is shown in tables
5.7- 5.10.
Table 5.7: Showing the comparison table for different clustering methods (K-means, Global K-means,
Proposed clustering method)
Parameters for analysis
of different clustering
methods
K-means
Sensitive to initial cluster Yes
Global K-means
Proposed clustering
method
No
No
Global
Global
None
Depends on selection
set ( k centroids)
Nature
of
clustering Local
solution
Sensitivity
to
no.
clusters (k) created
of Depends on
initially
criteria of selecting
assumed
terms (, upper and
value for k
lower
limits)
represent
to
the
document index term
vector
Time complexity
O(n)
O(kn)
O(kl), l << n
(l
is
the
no.
of
initially
created
clusters
obtained
from inverted index
file of n documents)
141
Table 5.8: Showing the clustering results of Purity (α = 0.1, no. of iterations = 5)
No. of
documents(n)
Size of cluster
100
30
0.8830
0.8912
Proposed
Clustering
method
0.8341
500
150
4.1032
4.0615
4.2532
1000
245
8.1302
8.3215
8.3426
2000
467
15.9123
15.9846
15.8215
3000
742
24.1125
24.9026
23.5223
3578
754
31.9502
32.5001
30.4310
K-means
(l)
Global Kmeans
Table 5.9: Showing the clustering results of Entropy (α = 0.1, no. of iterations = 5)
No. of
documents(n)
Size of cluster
(l)
K-means
Global Kmeans
100
30
0.8815
0.8712
Proposed
Clustering
method
0.7929
500
150
0.7213
0.7315
0.6708
1000
245
0.8342
0.7805
0.7230
2000
467
0.7965
0.7702
0.7433
3000
742
0.8040
0.7312
0.7125
3578
754
0.8125
0.8902
0.7510
Table 5.10: Showing the clustering results of F-measure (α = 0.1, no. of iterations = 5)
No. of
documents(n)
Size of cluster
(l)
K-means
Global Kmeans
100
30
0.4113
0.4416
Proposed
Clustering
method
0.4215
500
150
0.5192
0.4995
0.5236
1000
245
0.4612
0.4813
0.5126
2000
467
0.4860
0.5135
0.4963
3000
742
0.4716
0.5002
0.4692
3578
754
0.4832
0.5200
0.4715
142
5.3.3 Discussion
In this section, we discuss the analysis of the results obtain for the following parametric
values: [α = 0.1, n=3578, k=241, no. of iterations = 15]
Table 5.11: Showing the degree of overlapping (sharing of documents) among clusters
Total documents
No. of Clusters shared
489
1
466
2
453
3
385
4
373
5
264
6
254
7
228
8
202
9
137
10
98
11
71
12
51
13
34
14
33
15
13
16
11
17
13
18
2
19
143
600
No. of documents
500
400
300
No. of shared documents
200
100
0
1
3
5
7
9
11 13 15 17 19
No. of clusters
Figure 5.2: Document overlapping among clusters
Above graph shows that sharing of documents among the clusters is very less. Out of total
3578 documents, 489 documents are present in single cluster, 466 documents appear in
two clusters, 385 documents occur in 4 clusters, and only 2 documents are present in 19
clusters. Documents containing multiple different categories show overlapping among the
clusters.
Result shows that maximum size of the cluster is 655 documents and minimum cluster size
is 35 documents as shown in table 5.12 given below.
144
Table 5.12: Showing the size of different clusters created
Size of cluster (No. of documents)
Total No. of Clusters
0-34
0
35-40
63
41-45
30
46-50
23
51-55
30
56-60
17
61-65
10
66-75
14
76-85
11
86-95
4
96-125
13
126-150
7
151-175
2
176-200
3
201-300
8
301-400
1
401-500
1
501-600
2
601-700
2
145
70
60
No. of clusters
50
40
30
No. of Clusters
20
10
0
Cluster Size (no. of documents)
Figure 5.3: Document distribution trend among clusters
From the above graph, it is shown that documents are equally distributed among the
clusters with average cluster size of 53 clusters.
To verify the similarity of documents within clusters obtained by the proposed clustering
algorithm, we compare the category set to which documents within the cluster belong. If
has been found that most of the documents share same category.
Table 5.13 given below shows the number of different category types occurring in the files
of a given cluster.
146
Table 5.13: Showing the category set of each category type in a given cluster
Cluster
Set
Cluster A
Cluster
Size
( no. of
documents)
115
Cluster B
No. of sub-categories shared by documents for a given
category type
Topics
Places
People
Orgs
Exchanges
5
21
8
2
0
74
28
54
5
3
0
Cluster C
44
14
5
3
0
0
Cluster D
47
8
13
0
1
0
Cluster E
36
6
12
8
1
0
Cluster F
85
14
9
0
0
0
Cluster G
55
8
22
0
1
0
Table 5.14: Showing the different categories appearing in each cluster. Number of documents belonging
to each category is mentioned against the category name
Cluster
Major Category set (categories with more than 2 documents are listed
Type
here)
Cluster A
TOPICS{earn(6)},PLACES{usa(81)},PEOPLE{volcker(2),Greespan(2),
conable(2)},ORGS{imf(2}
Cluster B
TOPICS{trade(19),Grain(16),wheat(10),coffee(6)},PLACES{usa(24),
japan(11),uk(6)}, ORGS{ec(9)}
Cluster C
TOPICS{crude(4)},PLACES{Canada(38),usa(7)}
Cluster D
PLACES{usa(0)},TOPICS{pet-chem(4),acq(44)}
Cluster E
PLACES{usa(35),iran(5),japan(4)},TOPICS{ship(7),trade(6),crude(4)},
PEOPLE{Reagan(25),james-baker(6),greespan(5)}
Cluster F
PLACES{usa(69),Canada(4),west-germany(3),uk(3)},
TOPICS{earn(17),income(3), acq(3), trade(1), money-fx(1), crude(1) }
Cluster G
TOPICS{crude(51),gas(3)},PLACES{usa(26),japan(5),iran(5)},
ORGS{opec(5)}
147
From the data given in table 5.14, it is found that Cluster C contains 4 documents
belonging to crude category, 38 documents belong to Canada category and 7 documents
belong to USA category. Documents not sharing Canada category, share other category
among them within the cluster.
Data obtained from Cluster C
Total number of documents = 44
No. of documents from reut2-018.sgm = 17
No. of documents from reut2-019.sgm = 17
No. of documents from reut2-020.sgm = 5
No. of documents from reut2-021.sgm = 5
No. of documents belonging to PLACES Category = 43
No. of documents belonging to TOPICS Category = 4
No. of documents belonging to PEOPLE Category = 0
No. of documents belonging to ORGS Category = 0
No. of documents belonging to EXCHANGES Category = 0
Out of total 44 documents, 38 documents contain news about canada. These documents
also share same category from other category set of TOPICS.
Likewise there is another cluster F containing 85 documents, out of which 19 documents
belong to reut2-018.sgm, 29 documents belong to reut2-019.sgm, 18 documents belong to
reut2-020.sgm, 19 documents belong to reut2-021.sgm. None of the document belongs to
EXCHANGES, ORGS, PEOPLE category set. Out of 85 documents, 69 documents belong
to PLACES (usa) and 26 documents belong to TOPICS (earn, income, trade, money-fx,
crude, acq) category set. All the documents belonging to a cluster share at least one similar
category within themselves.
Data obtained from Cluster F
Total number of documents = 85
No. of documents from reut2-018.sgm = 19
No. of documents from reut2-019.sgm = 29
No. of documents from reut2-020.sgm = 18
No. of documents from reut2-021.sgm = 19
No. of documents belonging to PLACES Category = 69
No. of documents belonging to TOPICS Category = 26
No. of documents belonging to PEOPLE Category = 0
148
No. of documents belonging to ORGS Category = 0
No. of documents belonging to EXCHANGES Category = 0
Table 5.15: Showing no. of documents unique to any pair of clusters given
Cluster →
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
(size of
A
B
C
D
E
F
G
cluster)
(115)
(74)
(44)
(47)
(36)
(85)
(55)
↓
Cluster A
-
189
159
162
151
196
170
189
-
114
121
106
155
127
159
114
-
89
80
129
95
162
121
89
-
83
126
102
151
106
80
83
-
121
91
196
155
129
126
121
-
140
170
127
95
102
91
140
-
(115)
Cluster B
(74)
Cluster C
(44)
Cluster D
(47)
Cluster E
(36)
Cluster F
(85)
Cluster G
(55)
Consider the data collected about two clusters cluster A and cluster C, as shown in table
5.15.
Size of cluster A = 115 documents
Size of cluster C = 44 documents
Total number of documents in cluster pair (A,C) = (115 + 44 ) = 159 documents
From the table 5.15,
Number of unique documents in the cluster pair (A,C) = 159 documents
149
This shows that cluster A and cluster C does not contain any document common to both.
Similarly, consider two clusters cluster D and cluster F
Size of cluster D = 47 documents
Size of cluster F = 85 documents
Total number of documents in cluster pair (D,F) = (47 + 85 ) = 132 documents
From the table 5.15,
Total unique documents in the two clusters = 126 documents
This shows that cluster D and cluster F share 6 documents.
Table 5.16: Showing total no. of documents common to cluster pair
Cluster →
(size
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
B
C
D
E
F
G
(115)
(74)
(44)
(47)
(36)
(85)
(55)
-
0
0
0
0
4
0
0
-
4
0
4
4
2
0
4
-
2
0
0
4
0
0
2
-
0
6
0
0
4
0
0
-
0
0
4
4
0
7
0
-
0
0
2
4
0
0
0
-
of A
cluster)
↓
Cluster A
(115)
Cluster B
(74)
Cluster C
(44)
Cluster D
(47)
Cluster E
(36)
Cluster F
(85)
Cluster G
(55)
It is observed that not too many documents (maximum 7 documents as shown in above
table 5.16) are shared among the different clusters obtained through the proposed
clustering method.
150
Sharing of documents among the clusters can be avoided by placing the document initially
in the best suitable cluster (against the term in the inverted index file having maximum
normalized weighted term frequency in its index term vector).
Hence initial cluster set is determined based on the number of clusters obtained by the
proposed clustering algorithm and then hierarchical clustering tree can be obtained starting
with k clusters instead of n clusters each containing single document.
5.4
Conclusion of this Chapter
Most of the existing clustering methods like K-means, Global K-means

require to assume the initial cluster centers, and

require the number of clusters k as an input
But finding the correct k is not easy, as

there exists no universally accepted definition of a cluster, and

the choice of the value for k depends on the characteristics of the data set and the
desired resolution of the user
To fasten the process of clustering, a method is proposed which determined the initial
number of clusters based on entries of the Inverted Index defined for the given document
dataset. The proposed clustering method, unlike existing clustering methods don‘t requires
to assume initial positions of the cluster centers, thus producing linearly separable, global
optimum clustering solution in feature space. Also it shows polynomial time complexity of
O(lk ) to produce k clusters from starting l initially computed cluster centers which is less
than Global K-means but slightly higher than K-means. The clustering method is proposed
to retrieve the relevant documents against the user query by matching the query vector
with cluster centroids and to retrieve the member documents from the cluster showing the
minimum distance score between them.
151
152