SEQOPTICS: Protein Sequence Clustering with - CIS KDDM

SEQOPTICS: PROTEIN SEQUENCE CLUSTERING WITH OPTICS
Yonghui Chen (Advisor: Kevin Reilly & Alan Sprague)
Department of Computer and Information Sciences – The University of Alabama at Birmingham
http://www.cis.uab.edu/chenyh
ABSTRACT
METHODS
Protein sequence clustering has been widely used as a part of the analysis of
protein structure and function. We demonstrate an approach to protein
clustering, SEQOPTICS, based on OPTICS (Ordering Points To Identify the
Clustering Structure). The approach’s attractiveness lies in emphases on
visualization of results and support for interactive work, e.g., in choosing
parameters. SEQOPTICS results are presented for four data sets from
diverse data sources. Visualization of the sequence clustering structure is
demonstrated for all four data sets. Our system was evaluated by comparison
with other existing methods, results demonstrating that, via the Jaccard
coefficient evaluation criterion, our system performed better.
DATA SETS
Pfam
Computing
Pairwise
distance
Extracting
Dataset
Clustering,
Results
Analysis
Visualization
Smith-Waterman
Score
Normalized
Smith-Waterman
Distance Measure
Figure1: Overview of SEQOPTICS.
First, data sets were extracted from protein databases. Secondly, pairwise
distances between any two proteins were computed using a score based on
(a normalized) Smith-Waterman algorithm. Then the OPTICS algorithm was
adopted to execute the clustering actions with the results analyzed via the
Jaccard coefficient.
A pairwise Smith-Waterman local alignment
score, is computed first and then normalized to
obtain SN(a,b) by:
SN(a,b) = S(a,b)/Min(S(a,a),S(b,b));
where S(a,b) is the Smith-Waterman local
alignment score between two sequences a and
b.
The distance between two protein sequences is
then computed as:
Distance(a,b) = 1/SN(a,b).
Distances range from 0 to 1, 0 meaning identical
sequences and 1 meaning totally different.
Distance
Matrix
Reachibility
plot
Clusters
extraction
OPTICS (Ordering Points To Identify the Clustering Structure) orders data into a density
based clustering structure corresponding to a broad range of parameter settings (elected by
the user). A reachability (bar chart) plot shows each object’s reachability distance (in the
order the object was processed): it demonstrates the data’s cluster structure. There are two
main advantages to applying OPTICS in protein sequences clustering analysis: 1) OPTICS
can find the local density region; 2) OPTICS produces an augmented ordering of the
database representing its density based clustering structure and this ordering can be
visualized in the reachability plot.
The final clusters can be extracted from the plot by employing either a cutoff
value or a steepness criterion. In this study, density-based clusters were
extracted by using a cutoff value. For example, in figure 2, the cutoff value is
set as 0.860 (see the line at reachability distance 0.860). Under this cutoff
regime, each valley in figure 2 between two sequences with reachability
distance higher than the cutoff identifies a cluster. The sequence starting a
valley with reachability distance higher than the cutoff is also in the same
cluster as the remaining sequences in the valley. Any sequence with
reachability distance higher than the cutoff is noise if it does not start a new
valley. Therefore, in figure 2, there are four clusters identified. Similarly, using
cutoff values in the paper there are four clusters in figure 3, six in figure 4,
and four in figure 5.
RESULTS EVALUATION
SEQOPTICS was applied to cluster the data sets
1. Visualization of the cluster structure: We made a reachability distance plot for each data set. In each figure, the horizontal axis
represents the ordering of the sequences, the vertical axis represents the reachability distance, and each valley stands for a
cluster set. From the figures, we can see that each valley contains exclusively one sequence family.
2. Extraction of the clusters: The final density-based clusters were extracted by using a cutoff value. For example, in figure 2, the
cutoff value is set as 0.860; other values may be chosen: e.g., in figure 3, 0.745 is chosen.
There are five valleys: The
first two valleys are composed
of sequences from cytochrom
B562;
The
third
valley
consists of sequences from
glucokinase; The fourth valley
contains sequences from
GABAR family; The fifth valley
are sequences from bacglobin family.
Four data sets are extracted from different publicly available protein repositories as shown in table 1:
two from Pfam, one from Swiss-Prot and the remaining one from NCBI. Each protein sequence is
labeled according to the data set from which it originated, the labeling thus defining what we later use
as the “true” cluster to which a sequence belongs.
OPTICS CLUSTERING
EXPERIMENTAL RESULTS
Fig. 2 (data set 1)
Labeled
Datasets
NCBI
COMPUTE DISTANCE
Protein sequences
Swiss-Prot
Data
Selection,
Labeling,
Reformating
Fig. 3 (data set 2)
Fig. 4 (data set 3)
Fig. 5 (data set 4)
There are three valleys:
The first one is composed
of sequences from bac
globin; The second valley
is composed of sequences
from band3 family; The
third valley contains only
sequences from IGA1.
There are six valleys: The first
one and last one contain only
cytoC sequences; The second
valley contains only sequences
from GABAR; The third valley
contains sequences GAPDH;
The fourth valley contains
GPCR sequences; The fifth
valley contains only GFAT.
There are four main
valleys in figure 5: The first
valley contains only casein
kappa
sequences The
second and third valley
contain only globins; the
fourth valley is composed
of GAPDHs.
To judge the resulting clustering set’s biological
accuracy, we need to compare it to a “true” cluster
set. However, there is no generally accepted
“true” cluster set. For this study, we assumed the
original database clusters are the “real” clusters,
similar to the way most automatic protein
clustering methods are tested. For example, we
assume the sequences from the glucokinase
family of Pfam are in the same cluster.
Based on this assumption, several statistics
appear to adequate to evaluate the results. Here,
for the criterion, we used the Jaccard (similarity)
coefficient, S, defined as:
S = a/(a+b+c)
where a is the “true positive”, i.e., the number of
sequence pairs clustered together in both sets. b
captures “false negatives,” i.e., the number of
sequence pairs clustered together in the true
cluster set, but not in the current clustering
solution. c is for “false positives,” i.e., the number
of sequence pairs clustered in the current
solution, but not in the true cluster set. The
Jaccard similarity value lies between 0 and 1,
bigger values signifying better clustering.
FUTURE WORK
SEQOPTICS
Data sets
BAG
Blastclust
Compare the
results by
Jaccard
coefficient
We also clustered the same data sets with two existing
clustering methods, blastclust and BAG, by default
parameters. We then compared our results with those
results via the Jaccard coefficient. The comparison is shown
in table 2. From this table, we see that SEQOPTICS
produces very nice results relative for each original cluster
set, and, moreover, that it outperforms BAG and blastclust
on all the data sets. Meanwhile, the performance of
blastclust exceeds BAG in two cases and is less good in two
cases. Overall, SEQOPTICS seems a promising method in
terms of both clustering quality and its graphical
representation.
SEQOPTICS has proved its value
for small data sets (<1000
sequences) according to this
report. To apply the method to
larger data sets, such as an entire
protein
sequence
database,
improvements would help make it
more effective. We may list
specifically:
1) use another distance measure for
protein sequence distance, e.g.,
BLAST or FASTA;
2) apply parallel computing tools, for
example, the Message Passing
Interface (MPI);
3) implement different visualization
techniques accommodating data
set size;
4) consider incremental cluster
ordering schemes since protein
databases are rapidly growing in
size.