Application of Crisp and Fuzzy clustering.pdf

32nd Annual International Conference of the IEEE EMBS
Buenos Aires, Argentina, August 31 - September 4, 2010
Application of Crisp and Fuzzy Clustering
Algorithms for Identification of Hidden Patterns from
Plethysmographic Observations on the Radial Pulse
Sunil Karamchandani,
S.N.Merchant
Indian Institute of Technology
Bombay, Mumbai, India, 400076
U.B.Desai
Indian Institute of Technology
Hyderabad, India 502205
G.D.Jindal
Bhabha Atomic Research Centre
Mumbai, India, 400085
[email protected]
[email protected]
sunilk,[email protected]
Abstract— Radial Pulse forms the most basic and essential
physical sign in clinical medicine. The paper proposes the
application of crisp and fuzzy clustering algorithms under supervised and unsupervised learning scenarios for identifying
non-trivial regularities and relationships of the radial pulse
patterns obtained by using the Impedance Plethysmographic
technique. The objective of our paper is to unearth the hidden
patterns to capture the physiological variabilities from the arterial
pulse for clinical analysis, thus providing a very useful tool for
disease characterization. A variety of fuzzy algorithms including
Gustafson-Kessel (GK) and Gath-Geva (GG)have been intensively
tested over a diverse group of subjects and over 4855 data sets.
Exhaustive testing over the data set show that about 80 % of
the patterns are successfully classified thus providing promising
results. A Rank Index of 0.7739 is obtained under supervised
learning, which provides an excellent conformity of our process
with the results of plethysmographic experts. A correlation of the
patterns with the diseases of heart, liver and lungs is judiciously
performed.
Index Terms: Peripheral Pulse Analyzer - Impedance Plethysmography - fuzzy clustering.
I. I NTRODUCTION
The pulse of the radial artery is an important diagnostic
tool for all physicians. The radial artery is not only easily
accessible but is in direct continuation of the heart and close
to it. Examination of the pulse throws light on the gravity
of illness and gives a guide line for prognosis. The pulse
provides evidence of great value both to the state of the
central circulatory system and the general pathophysiological
conditions of the subject. Fluctuation in physiological conditions are reflected as a change in the morphology of the
arterial pulse. The pulse identifies the presence and location
of disorders in a patient’s body unlike ECG which mainly
reflects the electrical activity of the heart, and thus it contains
much more useful information than ECG [1]. In traditional
Indian medicine the clinician palpates the area above the radial
artery at the wrist location of the patient and monitors the
rhythm, pulse pressure, pulse propagation and the elasticity
of pulse for arriving at the diagnosis in the patient [2]. The
diagnosis requires a long period of study and practice by the
physician, without the benefit of any recording aids. To extract
information from the radial pulse we use the principle of
Impedance Plethysmography (IP), a convenient, inexpensive,
painless and non invasive technique. IP provides an indirect
978-1-4244-4124-2/10/$25.00 ©2010 IEEE
assessment of blood volume changes in any part of the body
segment as a function of time [3]. Since blood is a good
conductor of electricity, the amount of blood in a given body
segment is reflected inversely as in the electrical impedance of
the body segment. The pulsatile blood volume changes in the
body segment caused by systemic blood circulation, therefore
causes proportional change in the electrical impedance [4].
Using the technique of IP, we record the impedance changes
in the radial artery as a measure of the blood flow. After
peak detection, pulse waveforms, each consisting of sixty four
sample points, are recorded, with twenty sample points prior
to the peak and forty four thereafter. Over four thousand such
data sets are collected with the help of more than three hundred
subjects. We apply clustering algorithms to identify the groups
of related data that can further be explored.
II. DATA ACQUISITION
Pulse signals from the radial artery are measured using
the Peripheral Pulse analyzer developed at Bhabha Atomic
Research Centre (B.A.R.C), Mumbai, India. The radial artery
begins about 1 cm below the bend of the elbow and passes
along the radial side of the forearm to the wrist, where its pulsation can be readily felt. With the subject in supine position,
carrier electrodes are applied around the upper arm and the
palm while sensing electrodes are applied on the distal segment
around the wrist. Peripheral Pulse analyzer uses the principle
of IP wherein a sinusoidal current of constant amplitude (2
mA) is allowed to flow across the wrist of the subject using
band electrodes. The amplitude of the signal thus obtained is
directly proportional to the electrical impedance of the body
segment. The waveforms obtained are sampled at 100 Hz as a
time series data. The data is recorded in normal and diseased
subjects at the Biomedical Division, Modular Labs, B.A.R.C.
for about four minutes on LabWindows platform. The subjects
are in the range of about 18 to 60 years. Approximately 240
such samples are obtained from a single subject. The observed
impedance signal is shown in figure 1.
III. P ROPOSED C LUSTERING A LGORITHMS
We propose to unearth suitable techniques for clustering
of plethysmographic data. We assume two scenarios viz when
3978
to which the sum of distances from all objects in that cluster is
minimized. The algorithm computes cluster centroid in such a
way that it minimizes the Euclidean distance given by equation
1.
k X
k 2
X
xi (j) − cj E =
(1)
j=1 i=1
Fig. 1.
Acquisition of Pulse Patterns in LabWindows environment
the true class labels of the data set are unknown (unsupervised
learning) while in the second case we pose a pattern classification problem wherein an expert assigns class labels to the
data set (supervised learning). In the first technique, we apply
hard and fuzzy clustering methods for estimating the number
of clusters in the data. The performance of these clustering
techniques is validated with the help of standard indices. Crisp
clustering methods include k-mean, silhouette index, and kmedoid methods while in case of fuzzy clustering we use
fuzzy C-means (FCM), GK and GG clustering algorithms. The
cluster validity is evaluated by computing the standard indices:
Dunn’s Index (DI), Alternative Dunn’s Index (ADI), Partition
Index (SC), Separation Index (S),and Xie and Beni’s Index
(Xb) and their performance is compared. Based on the performance of the cluster validity indices we decide the number of
hidden patterns in our data. In the second technique, each and
every of the 4855 data sets has been individually labeled into
eight different classes by a plethysmographic expert. Once the
true class labels are known, k-means algorithm is applied by
specifying eight clusters. The performance of the algorithm is
judged using the Rand Index (RI).
A. Unsupervised Clustering
During unsupervised learning we use various cluster validation parameters. DI [5] is usually selected for compact and
well separated clusters and does not work well with overlap
clusters. The ADI [5] is used to make the calculations from
DI simple. The optimal cluster consists of a lower value of
DI and ADI. The Xb index [5] aims to quantify the ratio of
the total variation within clusters and their separation, hence
its minimum value decides the optimal number of clusters. SC
[6] is the ratio of the sum of compactness and separation of
the clusters and should have a lower value for good clustering.
SC is useful when comparing different partitions having equal
number of clusters. S [6] being the separation index should be
as high as possible.
1) k-means and Silhouette Index: k-means algorithm seeks
to partition the observations in the n × p matrix into k
mutually exclusive clusters and returns a vector of indices
indicating to which of the k clusters it has assigned each of
the observations. The technique uses the output of the k-means
clustering algorithm, comparing the change in within cluster
dispersion to that expected under a uniform null distribution.
Each cluster in the partition is defined by its member objects
and by its centroid. The centroid for each cluster is the point
E is the sum of squares error for all objects in the database
2
and kx0 (j) − cj k is the Euclidean distance measure between
a data point xi (j) and the cluster centroid cj . The k mean
method requires users to specify the number of clusters as one
of the input parameters. To find satisfactory clustering result,
a number of iterations have been done where the algorithm
is executed with different values of k (number of clusters).
Distortion of a cluster is the sum of squared distances between
the objects in the cluster. The lower the value of this measure,
the better the results of the cluster. However for our data
the distortion value continues to decrease as the number of
clusters increase. Thus this does not help us in deciding the
the number of clusters. The silhouette index defined by [7] is
used to validate the k mean algorithm. The silhouette is used
to determine the degree of the similarity of data with respect
to the data values in its own cluster versus the data in the
other clusters. As seen from the figure 2, a silhouette plot has
a range between -1 for data assigned to a wrong cluster to +1
for data points that are very far from neighboring clusters. A
silhouette value of zero indicates that no proper distinction is
made for classification. Subjectively for k = 4, 6 and 10 we
observe that most of the silhouette values are greater than 0.6,
hence they are examples of good clustering. The silhouette plot
for k = 8 shows quite a few clusters having negative values
indicating that eight is not the right number of clusters. We use
the largest average silhouette width, to find the optimum value
of clusters. As seen from the Table I, k = 4 has the largest
average silhouette width hence we conclude that four different
patterns exist in the radial pulse data set. However with k
TABLE I
M EAN S ILHOUETTE WIDTH FOR DIFFERENT C LUSTERS
No. of Clusters (k)
Mean Silhouette Width
3
4
5
6
7
8
9
10
0.3052
0.6046
0.4743
0.5150
0.4174
0.4614
0.4741
0.4988
= 4, major partition of the data set is in cluster three while
for k = 10 data sets are equally distributed in all the clusters.
Thus optimum cluster value obtained by the silhouette plot is
not clearly defined. In Table II we provide the values of the
validation parameters for various clusters. Thus the SI which
is a measure of the silhouette width helps us in classifying
our data set into four patterns. The Table II shows the cluster
validity indices for the k -mean algorithm. The value of the
indices SC and ADI suggest the optimal number of clusters as
eight. Xb does not however provide any information about the
clusters as it is monotonically decreasing.
3979
3) Fuzzy C-means Clustering: The name fuzzy suggests
that the data point can belong to two or more clusters. FCM
is based on the minimization of the objective function given
by equation 2.
Jm =
n X
c
X
2
m
uij xi − cj (2)
i=1 j=1
(a) k = 4
Fuzzy partitioning is carried out through an iterative optimization of the objective function Jm , with the update of
membership function uij and the cluster centers cj . We use
the parameter epsilon which is termination parameter (1e-5)
and a fuzziness parameter m = 2. We assume an initial value of
the partition matrix U = [uij ] to calculate the cluster centers
according to the equation 3.
(b) k = 6
cj = ∀m=2
PN
m
i=1 uij xi
PN
u
i=1 ij
(3)
After updating the partition matrix using equation 4, we
continue the iteration procedure untill ||Uk+1 − Uk || < .
(c) k = 8
Fig. 2.
1
uij = ∀m=2 P
c
(xi − cj / xi − ck)2/(m−1)
k=1
(d) k = 10
(4)
The final partition matrix is used to identify the data patterns.
Silhouette Plot for different clusters (k)
2) k-medoid Clustering: A medoid is that sample of a
cluster, whose average dissimilarity to all objects in the cluster
is minimal [8]. This algorithm is derived from the k mean
algorithm where the centroid is replaced by the medoid.
Initially we choose k random data points to be the initial
cluster medoids. Then we assign each data point to the cluster
associated with the closest medoid. By minimizing the cost
function viz the Minkowski distance metric, we recalculate
the positions of the k medoids. This procedure is repeated
iteratively till the medoids are fixed. As seen from the Table
III the optimal value of the cluster is nine thus there is one
pattern which the k mean algorithm is not able to identify.
All the validation parameters are in complete agreement with
the optimal number of clusters (there is hardly a variation in
the values of S as they are very small). The parameter Xb is
infinity in case of k medoid clustering and does not exist.
TABLE IV
VALIDATION OF FUZZY C- MEAN CLUSTERING
Cluster
DI
ADI
SC
S
Xb
k = 3
k = 4
k = 5
k = 6
k = 7
k = 8
k = 9
k = 10
0.0879
0.0491
0.0445
0.0445
0.0438
0.0438
0.0438
0.0438
0.0782
0.0145
0.0145
0.0052
4.84e − 4
5.3e − 4
3.5e − 5
1.25e − 4
1.1059
0.8313
0.7950
0.8613
0.7198
0.7572
0.7696
0.7850
3.0863
2.10
2.2512
2.49
1.949
2.123
1.9866
2.11
1.5239
1.2543
0.9572
0.8128
0.7629
0.6530
0.5973
0.5288
The three cluster validation parameters, DI, ADI and Xb
have no fixed local minimum and they are monotonically
decreasing. The separation indices give an optimum value of
the cluster as seven for SC and six for S. The cluster validity
indices are as shown in the Table IV.
4) Fuzzy GK Clustering: The GK clustering algorithm used
is a variation of the fuzzy C Means algorithm where we employ
an adaptive distance norm using the Mahalanobis distance. As
TABLE V
VALIDATION OF FUZZY GK CLUSTERING
TABLE II
VALIDATION OF K - MEAN CLUSTERING
Cluster
DI
ADI
SC
S
Xb
k = 3
k = 4
k = 5
k = 6
k = 7
k = 8
k = 9
k = 10
0.0876
0.0681
0.525
0.0438
0.0449
0.0449
0.0460
0.0505
0.0024
0.0029
0.0022
0.0022
0.0020
0.001
0.0016
0.0016
0.4865
0.3461
0.3527
0.3707
0.3770
0.3014
0.3196
0.3218
1.5512
1.0285
1.1027
1.23
1.2614
0.968
0.9639
1.045
2.6734
2.6529
2.3535
2.3535
2.2494
2.0993
2.0772
2.0555
TABLE III
VALIDATION OF K - MEDOID CLUSTERING
Cluster
DI
ADI
SC
S(*1e-4)
k = 3
k = 4
k = 5
k = 6
k = 7
k = 8
k = 9
k = 10
0.0379
0.0462
0.0376
0.0433
0.0350
0.0376
0.0343
0.0409
0.0021
0.0021
0.0021
0.0024
0.0016
7.7e − 4
7.6e-4
7.6e − 4
0.8902
0.4811
0.5351
0.4337
0.4616
0.3560
0.3326
0.4077
2.277
1.49
1.9477
1.4875
1.57
1.193
1.0788
1.379
Cluster
DI
ADI
SC
S
Xb
k = 3
k = 4
k = 5
k = 6
k = 7
k = 8
k = 9
k = 10
0.0217
0.0217
0.0217
0.0217
0.0217
0.0223
0.0217
0.0217
0.0017
0.0017
0.0091
0.017
8.5e − 4
0.0059
8.04e − 4
8.02e − 4
817.25
472.38
313.82
221.09
160.4
122.3
263.9
205.6
0.2025
0.1171
0.0763
0.0527
0.0376
0.0283
0.064
0.0497
0.6707
0.5034
0.4030
0.3362
0.2884
0.2526
0.2237
0.2015
seen from the Table V, only the separation indices can provide
the optimal number of clusters. So the GK algorithm provides
consistent results as the FCM algorithm and is not much help
in providing better clustering.
5) Fuzzy GG Clustering: The GG clustering algorithm, a
further extension of the FCM algorithm uses a distance norm
based on the fuzzy maximum likelihood estimates. As seen
from Table VI only two validation parameters can be used for
GG clustering. Due to close proximity in between the values
3980
all algorithms both supervised and unsupervised. We evaluate
TABLE VII
C ONFUSION M ATRIX FOR CALCULATION OF RI
Cluster
k =1
k =2
k =3
k =4
k =5
k =6
k =7
k =8
k
k
k
k
k
k
k
k
504
229
241
206
27
1
2
0
1
0
0
16
0
9
86
81
24
0
0
41
17
44
8
0
364
10
17
386
145
121
170
9
0
1
0
41
380
75
76
39
0
0
0
22
66
287
179
240
91
121
112
34
0
0
0
0
147
88
109
49
11
0
1
0
=
=
=
=
=
=
=
=
1
2
3
4
5
6
7
8
results of the clustering algorithm based on quality indices and
select the clustering scheme that best fits the data.
Fig. 3.
Hidden patterns in the radial pulse
of DI and very low values of ADI, no proper decision can be
derived from the validation parameters, hence the GG method
fails to be a good clustering algorithm. Normally this method
of clustering yields good results for high dimension data but
in our case it fails.
B. Supervised Learning
Under the guidance of a plethysmographic expert, eight
different patterns are observed among the 4855 data sets that
we have obtained. Figure 3 shows these patterns to which we
will apply supervised clustering algorithm for classification.
This classification serves an aid to the clinician to correlate
with different diseases. For the validation of the supervised
clustering algorithm we use the Rand Index (RI) which brings
out the similarity between k-mean clustering and the true value
of the class. If Ck is the clustering with k-mean and Tk is the
true value of the class then RI is given by equation 5
n
RI(Ck , Tk ) = (ns + nd )/
(5)
2
where ns is the number of pairs of the data set that are in the
same cluster and nd is the number of decisions that they are
in different clusters as determined by the k-mean algorithm.
We execute the k-mean algorithm with cluster parameter as
eight. The resulting rand index has a value of 0.7739 for k = 8
which indicates a good agreement between the classifications.
These indicate that plethysmographic data has been classified
into the above eight patterns. The confusion matrix obtained is
as shown in Table VII. The rows give the clustering statistics
as obtained from the k mean algorithm while the columns are
the clusters as provided by the expert. In all the simulations,
we have used a fixed seed so that uniform variation exist for
IV. DATA A NALYSIS
Data analysis have shown over 80 % correlation between
the variability analysis of pulse patterns assessed by a group of
plethysmographic experts. Normal subjects record patterns 1,
2 and 3 predominantly with brief interpositions of patterns 4
to 8. Patients suffering from disorders of lung, liver and heart
record patterns 5 to 8 predominantly with brief interpositions
of patterns 1 to 4. These characteristic pattern interpositions
can be helpful in predictive diagnosis of disease in the subjects.
V. C ONCLUSION
IP has the advantage over the traditional methods because
it requires no technical skill in reading the pulse. The conventional methods have several enlisted limitations [9] while
our technique just requires that the person is calm and sits
in the upright position. The morphology of the radial pulse
will help us to provide a scientific validation to an abstract
science. The main advantage of the k mean algorithm is its
computational simplicity and low memory space requirements.
The principal disadvantage of all unsupervised algorithms is
the dependence of results on the selection of an initial set of
centroids, medoids or the partition matrix. In our application,
crisp clustering algorithms provide the best results both in
supervised as well as unsupervised learning.
TABLE VI
VALIDATION OF FUZZY GG CLUSTERING
Cluster
DI
ADI(*1e-45)
k = 3
k = 4
k = 5
k = 6
k = 7
k = 8
k = 9
k = 10
0.0251
0.0255
0.0242
0.0266
0.0247
0.0251
0.0247
0.0251
2.5
2.165
2.66
3.209
3.889
4.57
5.12
5.53
3981
R EFERENCES
[1] A.Joshi, S. Chandran, Arterial Pulse Rate Variability Analysis for Diagnoses, 19th Intl. Conference on Pattern Recognition, pp.1 - 4, 2008.
[2] Abhinav, Meghna Sareen, Mahender Kumar, Sneh Anand,Nadi Yantra:
A Robust System Design to Capture the Signals from the Radial Artery
for Non-Invasive Diagnosis,The 2nd Intl. Conference on Bioinformatics
and Biomedical Engineering, Shanghai, China, May 2008.
[3] G.D.
Jindal,T.S.
Annanthakrishnan,
S.K.Kataria,
Electrical
Impedance and Photo Plethysmography for Medical Applications,BARC/2005/E/025.
[4] S.Karamchandani,M.Dixit,R.Jain, M.Bhowmick,Application of neural
networks in the interpretation of impedance cardiovasograms for the
diagnoses of peripheral vascular diseases,Conf Proc IEEE Eng Med Biol
Soc,vol 7, pp. 7537-40,2005.
[5] F. Hahne, W.Huber, R.Gentleman, S. Falcon, Bioconductor Case Studies,
Springer, 2008.
[6] J.Oliveira,W. Pedrycz,Advances in fuzzy clustering and its applications,
Wiley and Sons, June 2007.
[7] Separation index and partial membership for clustering,Computational
Statistics and Data Analysis,Volume 50, Issue 3, pp.585-603, 2006.
[8] N.Zullkurnain, A.A.Aburas, Investigation of Time Series Medical Data
based on Wavelets and K-means clustering,Ariser,3, pp.112-122, 2007.
[9] Aniruddha Joshi, Anand Kulkarni, Sharat Chandran, V.K.Jayaraman
and B.D.Kulkarni, ”Nadi Tarangini: A pulse based diagnostic system”,
Proceedings of the 29th Annual Conference IEEE, EMBS, 2007.