clustering-P1

Clustering Categorical Data
The Case of Quran Verses
Presented By
Muhammad Al-Watban
IS 598
1
Outline









Introduction
Preprocessing of Quran Verses
Similarity Measures
Assisting Clusters Similarities
Shortcomings of Traditional clustering methods with
categorical data
ROCK - Major definitions
ROCK clustering Algorithm
ROCK example
Conclusion and future work
2
Introduction




The holy Quran covers a wide range of
topics.
Quran does not cover each topic by a set of
sequenced verses or sura’s.
A single verse usually deals with many
subjects
Project goal: to cluster the verses of The
Holy Quran based on the verse’s subjects.
3
Preprocessing of Quran Verses




it is necessary to perform manual preprocessing for the
Quran text to capture the subjects of the verses into a
tabular format
Verses in the Holy Quran can be viewed as records and
the related subjects as attributes of the record. This is
demonstrated by the following table:
The data in the above table is similar to what is known as
market-basket data.
Here, we will call it verses-treasues data
4
Similarity Measures
Two types of attributes:
Continuous attributes:

1.





range of attribute value is continuous and ordered
includes Attributes with numeric values (e.g. salary)
also includes attributes whose allowed set of values are
thought to be part of an ordered set of a meaningful
sequence (e.g. professional ranks, disease severity
levels)
The similarity (or dissimilarity) between objects is
computed based on distance between them.
the most commonly used distance measure is
Euclidean distance, and Manhattan distance
5
Similarity Measures
2.



Categorical attributes:
consists of attributes whose underlying domain is not ordered
Examples : colors, blood type.
If the attribute has only two states (namely 0 and 1), then it is
called binary; if it has more than two states, it is called nominal.
there is no easy way to measure a distance between objects
We can define dissimilarity based on the simple matching
approach


m
d (i, j)  p 
p


Where m is the number of matched attribute, and p is the total
number of attributes.
6
Similarity Measures

Where does the verses treasures data fit?



Each verse can be represented by a record with
Boolean attributes, each attribute corresponds to
a single subject
The attribute corresponding to a subject is T if the
verse contains that subjects; otherwise, it is F
As we said, Boolean attributes are a special case
of categorical attributes
7
Assisting Clusters Similarities
Many clustering algorithm(such as hirarchical
clustering) requires computing distance between
clusters (rather than elements)
 There are several standard methods:
1- Single linkage:


D(r,s): distance between clusters r and s is defined as the
distance between the closest pair of objects
D(r,s)

8
Assisting Clusters Similarities
2. Complete linkage
 distance is defined as the distance between the farthest pair of
objects
D(r,s)
3. Average linkage
 distance is defined as the average of distances between all pairs
of objects r and s, where r and s belong to different clusters

9
Assisting Clusters Similarities
4. Centroid Linkage:
 distance between clusters is defined as the
distance between the pair of cluster
centroids.
D(r,s)
10
Shortcomings of Traditional clustering
methods with categorical data













Example
Consider the following 4 market basket transactions
T1= {1, 2, 3, 4}
T2= {1, 2, 4}
T3= {3}
T4= {4}
converting these transactions to Boolean points, we get:
P1= (1, 1, 1, 1)
P2= (1, 1, 0, 1)
P3= (0, 0, 1, 0)
P4= (0, 0, 0, 1)
using Euclidean distance to measure the closeness between all
pairs of points, we find that d(p1,p2) is the smallest distance :
d ( p1, p2)  (|11|2  |11|2  |1 0 |2  |11|2 ) 1
11
Shortcomings of Traditional clustering
methods with categorical data






If we use the centroid-based hierarchical algorithm then we
merge P1 and P2 and get a new cluster (P12) with (1, 1, 0.5, 1)
as a centroid
Then, using Euclidean distance again, we find:

d(p12,p3)= 3.25

d(p12,p4)= 2.25

d(p3,p4)= 2
So, we should merge P3 and P4 since the distance between
them is the shortest.
However, T3 and T4 don't have even a single common item.
So, using distance metrics as similarity measure for categorical
data is not appropriate
The solution is ROCK
12
ROCK - Major definitions





Similarity function
Neighbors
Links
Criterion function
Goodness measure
13
Similarity function



Let Sim (Pi, Pj) be a similarity function that is
used to measure the closeness between
points pi and Pj.
ROCK assumes that Sim function is
normalized to return a value between 0 and 1
For Quran treasures data, a possible
definition for the sim function is based on the
Jaccard coefficient:
sim(Pi, Pj )  | Pi  Pj |
| Pi  Pj |
14
Example : similarity function

Suppose two verses (P1 and P2) contain the
following subjects




P1={ judgment, faith, prayer, fair}
P2={ fasting, faith, prayer}
Sim(P1,P2)= | P1 P2| / | P1P2|
= 2 / 5 = 0.40
15
Major definitions





Similarity for data objects
Neighbors
Links
Criterion function
Goodness measure
16
Neighbors and Links




one main problem of traditional clustering is:local properties
involving only the two points are considered.
Neighbor
 If similarity between two points exceeds certain similarity
threshold (), they are neighbors.
Link
 The Link for pair of points is: the number of their common
neighbors.
Obviously, Link incorporates global information about the
other points in the neighborhood of the two points. The larger
the Link, the higher probability that this pair of points are in
the same clusters.
17
Example : neighboring and linking


Example :
Assume that we have three distinct points: p1,p2 and p3;
where







neighbor(p1)={p1,p2}
neighbor(p2)={p1,p2,3}
neighbor(p3)={p3,p2}
Neighboring graph 
To define the number of links between two points, say p1 and
p3, we have to find the number of their common neighbors;
hence, we can define the linkage function between p1 and p3
to be:
Link (p1,p3) = | neighbor(p1)  neighbor(p3) |= | {P2}|
Or
Link (p1,p3) = 1
18
Example : minimum linkages





If we have four points:P1,P2,P3,P4
suppose that similarity threshold () is equal to 1
Then, Two Points are neighbors if sim(Pi,Pj)>=1
hence, points are considered neighbors only to
identical points (i.e. only to themselves)
To find Link(P1,P2):



neighbor(P1)={P1}
neighbor(P2)={P2}
link (P1,P2)= |neighbor(p1)  neighbor(p2) | =0
19

The following table shows the number of links
(common neighbors) between the four points:

We can depict the neighboring graph:
20
Example : maximum linkages





If we have four points:P1,P2,P3,P4
suppose that similarity threshold () is equal to 0
Then, Two Points are neighbors if sim(Pi,Pj)>=0
hence, any pair of points are neighbors
To find Link(P1,P2):



neighbor(P1)={P1,P2,P3,P4}
neighbor(P2)={P1,P2,P3,P4}
link (P1,P2)= |neighbor(P1)  neighbor(P2) | =4
21

The following table shows the number of links
(common neighbors) between the four points:

We can depict the neighboring graph:
22
Example :illustrating links



from the previous example, we have:
 neighbor(P1)={P1,P2,P3,P4}
 neighbor(P3)={P1,P2,P3,P4}
link (P1,P3)= |neighbor(P1)  neighbor(P3) | =4 links
we can depict these four different links (or paths) through these
four different neighbors as follows:
23
Major definitions





Similarity for data objects
Neighbors
Links
Criterion function
Goodness measure
24
Criterion function

to get the best clusters, we have to maximize this
Criterion Function
link ( p , p )
E  n  

n
Where Ci denotes cluster i
k
l




i 1
q
i
r
1 2 f ( )
pq , pr Ci
i
ni is the number of points in Ci
k is the number of clusters
 is the similarity threshold
• Suppose in Ci, each point has roughly nf(θ)
neighbors.
• A suitable choice for basket data is : f(θ)=(1-θ)/(1+θ)
25
Criterion function

By maximizing this criterion function, we are
maximizing the sum of links of intra cluster point
pairs and at the same time minimizing the sum of
links among pairs of points belonging to different
clusters (i.e. among inter cluster point pairs)
26
Major definitions





Similarity for data objects
Neighbors
Links
Criterion function
Goodness measure
27
Goodness measure

Goodness Function
link[C , C ]
g (C , C ) 


(n  n )
n
n
i
i

j
1 2 f (  )
1 2 f ( )
1 2 f ( )
j
i

j
i
j
During clustering, we use this goodness
measure in order to maximize the criterion
function.
This goodness measure helps to identify the
best pair of clusters to be merged during
each step of ROCK.
28
ROCK Clustering algorithm
Input:



A set S of data points
Number of k clusters to be found
The similarity threshold
Groups of clustered data

Output:

The ROCK algorithm is divided into three major parts:
Draw a random sample from the data set:
Perform a hierarchical agglomerative clustering algorithm
Label data on disk
in our case, we do not deal with a very huge data set. So, we
will consider the whole data in the process of forming clusters,
i.e. we skip step1 and step3
1.
2.
3.

29
ROCK Clustering algorithm
1.



Draw a random sample from the data set:
sampling is used to ensure scalability to
very large data sets
The initial sample is used to form clusters,
then the remaining data on disk is assigned
to these clusters
in our case, we will consider the whole data
in the process of forming clusters.
30
ROCK Clustering algorithm
Perform a hierarchical agglomerative clustering
algorithm:
ROCK performs the following steps which are
common to all hierarchical agglomerative
clustering algorithms, but with different definition to
the similarity measures:
2.

a.
b.
c.
d.
places each single data point into a separate cluster
compute the similarity measure for all pairs of clusters
merge the two clusters with the highest similarity
(goodness measure)
Verify a stop condition. If it is not met then go to step b
31
3.
Label data on disk:



Finally, the remaining data points in the disk
are assigned to the generated clusters.
This is done by selecting a random sample
Li from each cluster Ci, then we assign
each point p to the cluster for which it has
the strongest linkage with Li.
As we said, we will consider the whole data
in the process of forming clusters.
32
ROCK Clustering algorithm

1.
2.
Computation of links:
using the similarity threshold , we can
convert the similarity matrix into an
adjacency matrix (A)
Then we obtain a matrix indicating the
number of links by calculating (A x A ) , i.e.,
by multiplying the adjacency matrix A with
itself
33
ROCK Example







Suppose we have four verses contains some subjects , as follows:
P1={ judgment, faith, prayer, fair}
P2={ fasting, faith, prayer}
P3={ fair, fasting, faith}
P4={ fasting, prayer, pilgrimage}
the similarity threshold = 0.3, and number of required cluster is 2.
using Jaccard coefficient as a similarity measure, we
obtain the following similarity table :
34
ROCK Example


Since we have a similarity
threshold equal to 0.3, then
we derive the adjacency
table:
By multiplying the
adjacency table with itself,
we derive the following
table which shows the
number of links (or common
neighbors) :
35
ROCK Example

we compute the goodness
link [ Pi , Pj ]
measure for all adjacent g ( P , P ) 
i
j
points ,assuming that f()
(n  m)1 2 f ( )  n1 2 f ( )  m1 2 f ( )
=1- / 1+

we obtain the
following table:

we have an equal
goodness measure for
merging ((P1,P2), (P2,P1),
(P3,P1))
36
ROCK Example



Now, we start the hierarchical algorithm by merging,
say P1 and P2.
A new cluster (let’s call it C(P1,P2)) is formed.
It should be noted that for some other hierarchical
clustering techniques, we will not start the clustering
process by merging P1 and P2, since Sim(P1,P2) =
0.4,which is not the highest. But, ROCK uses the
number of links as the similarity measure rather than
distance.
37
ROCK Example

Now, after merging P1 and P2,
we have only three clusters.
The following table shows the
number of common neighbors
for these clusters:

Then we can obtain the
following goodness measures
for all adjacent clusters:
38
ROCK Example

Since the number of required clusters is 2,
then we finish the clustering algorithm by
merging C(P1,P2) and P3, obtaining a new
cluster C(P1,P2,P3) which contains
{P1,P2,P3} leaving P4 alone in a separate
cluster.
39
Conclusion and future work (1/3)



We aim to apply a clustering technique on the
verses of the Holy Quran
We should first perform manual
preprocessing for the Quran text to capture
the subjects of the verses into a tabular
format.
Then we can apply a clustering algorithm
which clusters each set of similar verses into
the same group.
40
Conclusion and future work (2/3)



Most traditional clustering algorithm uses distance
based similarity measures which is not appropriate
for clustering our categorical-type datasets.
we will apply the general framework of the ROCK
algorithm.
The ROCK (RObust Clustering using linKs)
algorithm is an agglomerative hierarchical clustering
algorithm for clustering categorical data. It presents
a new notion of link to measure similarity between
data objects.
41
Conclusion and future work (3/3)



We will adopt JAVA language to implement
ROCK clustering algorithm.
During testing, will try to form clusters of
verses belonging to a single sura, and verses
belonging to many different suras.
Insha Allah, we will achieve success in
performing this mission.
42
Thank You for your attention
I will be glad to answer your questions
43

Download Report

clustering-P1

Paperzz.com

Your Paperzz