URL

A New Index-Based Parallel Algorithm for finding Longest Common
Subsequence in Multiple DNA Sequences
Dr. S.A.M Rizvi
Department of Computer Science, Jamia Millia Islamia, New Delhi
E-mail: [email protected]
Pankaj Agarwal
(Research Scholar, Affiliated from Jamia Millia Islamia University, New Delhi)
E-mail: [email protected]
Abstract
This paper presents a new Parallel Algorithm for computing a Longest Common Subsequence in Multiple
DNA Sequences. It uses a heuristic approach. Although a lot of research has been carried out to find LCS
from the two or more given sequences of Protein, DNA, RNA etc, but not many parallel methods exists for
finding LCS from multiple sequences. Normally in existing algorithms the time complexity for finding the
LCS increases linearly with the increase in Sequences. This is an attempt to given an effective Parallel
algorithm to find LCS from any given number of DNA sequences. Significance of this algorithm is that
time complexity does not increase linearly with the increase in the number of sequences. However
algorithm can also be applied to Protein sequences with the same effectiveness, though the requirement of
processors will go up.
Keywords: Sequence Comparison, DNA Sequences, Parallel Algorithm, Multiple Longest Common
Subsequence, Space and time complexity
Introduction
Sequence Comparison is an important tool for researchers in molecular biology as it helps to relate the
molecular structure and function to the underlying sequence. Sequence Comparison can be used to study
the relationship(s) between sequences in sets of more than two sequences. This application is particularly
useful when studying the relationships between similar types of gene product that is expressed by different
organisms, like analyzing CFTR sequences from several different species, or when studying similar, yet
divergent, sequences within the same organism, as the variance in troponin I isoforms in Homo sapiens [8].
Biological Sequence data mainly consists of DNA and Protein Sequences, which can be treated as strings
over a fixed alphabet of characters. In this paper, we consider the comparison of two or more DNA
sequences with an aim of finding the Longest Common Subsequence. Multiple Longest Common
Subsequence problem can be defined as:
Given {S1, S2, S3 …. Sn} Sequences of DNA or Protein where each sequence Si can be of same or
variable length (here we consider sequences of same lengths). Problem is to find a subsequence common to
all the sequences with maximum length. Although there can be more than one subsequences of largest
length but our algorithm will return one such subsequence.
For example consider three DNA sequences
S1=T C C A T A G T C
S2=A G C T A A T A G
S3=G G A T T A G C T
One of the LCS for the above three sequences can be ‘A T A G’ of length four. Other LCS’s can be
‘TTAT’, ‘TTAG’ etc
Often, a primary focus of a Multiple Sequence Alignment is to identify, within several related sequences,
regions that are highly conserved in identity or similarity, and therefore probably have functional and/or
structural significance. Many factors affect the analysis of conserved regions within related sequences, such
as the number of sequences included in the analysis, and the ratio of the number of very similar (almost
identical) sequences to the number of more distantly related sequences. Divergent sequences can cause
problems in a Multiple Sequence Alignment. It is more difficult to identify the correct alignment when two
sequences that are related throughout part of the sequence also contain large sections that diverge.
Therefore, the error rates in the alignment increase as divergence increases. These errors in the alignment
can cause the related part of the sequences to show lower similarity than they actually have, and this sort of
error is often amplified in subsequent steps.
Background and Related work
Several researchers have explored Sequence Comparison Algorithms [12,13], culminating in the solution of
variety of sequence comparison problems, including Subsequence Matching in O(mn) time and space.
Using the technique of Hirschberg [2], developed in the context of Longest Common Subsequence
problem, Mayers [5] and Miller presented a technique to reduce the space requirement of Sequence
Matching to optimal O(m+n), while retaining a time complexity of O(mn). Huang [11] extended this
algorithm to Subsequence Matching. These algorithms are very important because the length of biological
sequences can be large enough to render algorithms that use quadratic space infeasible.
While space-optimal algorithms make large Sequence Comparison feasible, the quadratic time
requirement still makes it a time-consuming process. A natural approach is to reduce time requirement with
the use of Parallel Processing approach. Edmiston et al. [9] present parallel algorithms for sequence and
subsequence matching that achieve linear speedup and can use up to O(min (m,n)) processors. Lander et al.
[10] discuss implementation on a data parallel computer. These algorithms store the entire dynamic
programming table. Huang [11] presented a space-efficient algorithm for sequence alignment, that is only
time optimal when the sequences have equal size.
A lot of work has been done in implementing dynamic programming approach, mostly for the case
of two sequences.
Year
Author(s)
Time
1974
1975
1980
1984
1986
1992
1994
1999
Wagner, Fischer
Hirschberg
Masek, Paterson
Hsu , Du
Myers
Apostolico
Rick
Claus
O(n2)
O(n2)
O(n2 / log n)
O(pn log(n/p)+pm)
O(n(n-p))
O(n(n-p))
O(ns+min(ms,pn))
O(min(rm,r(n-r)))
Ref.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Fig 1. Selected results for the LCS problem for two sequences. Here ‘p’ is the length of LCS, ‘s’ is the
size of the alphabet, ‘m’ is the number of minimal matches. Note that we are assuming sequences f
equal lengths.
These methods are not applicable for three or more sequences because of its time and space complexity.
Implementation of dynamic programming method for constructing the score matrix would lead to O(nd)
time and space complexities. A lot of successful work has been done for improving both time and space
complexities for two sequences. However for multiple sequences (greater than two sequences) research is
still open and wide.
Some popular Methods/Algorithms for Multiple Alignments are as follows:
Global Multiple Alignment Methods
1. ClustalW : http://npsa-pbil.ibpc.fr/cgi-bin/npsa_automat.pl?page=npsa_clustalW.html
2. MSA(limited to small number of sequences) :
http://www.ibc.wusti.edu/ibc/msa.html
3. PRALIGN : http://mathbio.nimr.mrc.ac.uk/~jhering/pralign
Iterative Methods:
1. DIALIGN: http://www.gsf.de/biodv/dialing.html
2. MULTIALIGN :http://protein.toulouse.inra.fr/multialign.html
Local Alignment methods :
1. BLOCKS: http://blocks.fhcrc.org/blocks
2. HMMER: http://hmmer.wustl.edu
3. MEME: http://meme.sdsc.edu/meme
4. SAM: http://cse.ucsc.edu/research/compbio/sam.html
Other Popular Programs:
1. T-Coffee: http://igs-server.cnrs-mrs.fr/Toffee
2. MUSCLE : http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py
Some Parallel programs :
1. Parallel PRRN : Multiple sequence alignment by the best-first iterative refinement
strategy with tree-dependent partitioning : http://align.genome.jp/prrn/prrn_help.html
2. Multiple Sequence Alignment Using Parallel Genetic Algorithms:
http://portal.acm.org/citation.cfm
3. Multiple Sequence Alignment by Genetic Algorithm :
http://www.icot.or.jp/ARCHIVE/Museum/IFS/abst/094.html
4.
PileUp: PileUp is a multiple alignment program from the GCG program suite. It can be
accessed through GCG's Seqweb web interface.
5.
T-COFFEE: T-COFFEE is available on web and helix (in /usr/localapps/msa/TCOFFEE).
6.
PRRP: The PRRP program uses a randomized iterative strategy for multiple sequence
alignment. To use it: /usr/localapps/msa/prrp/bin/prrp
7.
SAM : The SAM program is a collection of of flexible software tools for creating,
refining, and using linear hidden Markov models for biological sequence analysis.
cd /usr/localapps/msa/sam/bin/;target99 -seed inputsequencefile -out outputfile
8.
SAGA : SAGA uses genetic algorithms for multiple sequence alignment. It is installed in
/usr/localapps/msa/SAGA_V0.95.
Our method
Our method uses a Heuristic Approach for determining the LCS from multiple DNA Sequences. Our
Algorithm uses a very effective approach for determining the LCS. Even if the algorithm does not return
the LCS in first go, the probability of finding the LCS in second go is quite high. Once the appropriate
starting set is found, the algorithm is sure to return the LCS. We have used a Global alignment algorithm.
The most significant part of the algorithm is that the time complexity does not increase linearly w.r.t.
increase in the number of sequences.
Consider for example three DNA sequences of equal lengths (though sequences can be of variable lengths)
as given below.
1 2 3 4 5 6 7 8 9 10 11 12
S1=A C T T C A G G C T A A
S2=T T C T A A A G C A A T
S3=A A C C G T A T T C G A
Fig. 2
1) We will consider four buckets (can be two-dimensional array or some other data structure) assigned to
each nucleotide A, C, T and G respectively. Collect the index numbers of nucleotides in their
corresponding buckets as shown in Fig.3 for the above three sequences. We have assumed a four-Processor
system corresponding to four buckets. Its corresponding processor will carry out-processing within a
bucket. Our parallel system will consider shared as well as local memory with respect to each processor.
Shared memory will maintain three global variables namely I, J and K and three local variables for each
processor namely X, Y, Z. Moreover our system will consider a master processor responsible for managing
the interprocessor communications.
S1
1
6
11
12
0
S2
5
6
7
10
11
BA
S3
1
2
7
12
0
S1
2
5
9
S2
3
9
0
S3
3
4
10
BC
S1
3
4
10
0
S2
1
2
4
12
S3
6
8
9
0
BT
S1
7
8
S2
8
0
S3
5
11
BG
Fig. 3 Figure showing the corresponding index numbers of the sequences {S1, S2, S3} in their
respective Buckets.
2) To find the starting point of MLCS a formula based approach is given, taking into consideration of
distance of nucleotides with respect to starting index and differences of index numbers of nucleotides in
their corresponding sequences.
X=Bx[r][S1]+ Bx[r][S2]+ Bx[r][S3]
(1)
Where x= A, C, T or G and r refers to row number and Si refers to Sequences (Here S1, S2, S3).
Y= | Bx[r][S1] - Bx[r][S2] | + | Bx[r][S2] - Bx[r][S3] |
(2)
Where |t| refers to absolute value of t.
Z=X+Y
Note: Similar approach can be used to define the formula while taking more than three sequences.
These calculations will be done at each processor and the values X, Y , Z will be stored in the local
memories of each processor. Thus we will obtain four copies of Z say ZA, ZC, ZT, ZG corresponding to each
processor. We take the minimum of Z values
Starting Nucleotide Set= nucleotide corresponding to min{ ZA, ZC, ZT, ZG }
Initial values for the global variables are set with respect to the values of Starting Nucleotide Set. For the
above example we have the following initial value sets with each processor
A[1,5,1] ,C[2,3,3] ,T[3,1,6] ,G[7,8,5]
Calculation done at first bucket BA corresponding to value set [1,5,1] according to formula (1).
XA= Bx[1][S1] + Bx[1][S2] + Bx[1][S3]
= 1+5+1=7
Similarly values for XC, XT, and XG are as follows
XC = 2+3+3=8
XT = 3+1+6=10
XG = 7+8+5=20
Calculation for YA at BA corresponding to value set [1,5,1] according to formula (2)
YA = | BA[1][S1] – BA[1][S2] | + | BA[1][S2] – BA[1][S3] |
= | 1-5 | + |5-1|
=8
Similarly values for YC, YT, and YG are as follows
YC = | 2-3 | + |3-3|=1
YT = | 3-1 | + |1-6|=7
YG = | 7-8 | + |8-5| =4
Thus values of ZA , ZC, ZT, and ZG are given as
ZA= XA+ YA= 15
ZC= XC+ YC = 9
ZT= XT+ YT = 17
ZG= XG+ YG = 24
Nucleotide corresponding to minimum of Z’s values will be taken as the Start Set.
Thus here the start nucleotide from where the search process will begin is ‘C’ with minimum of ZC value
and corresponding initial Start Set becomes [2 , 3 , 3] setting the initial values for I, J, K as I=2, J=3 and
K=3.We record the first element of LCS as ‘C’
LCS = C--------------3) Now in each bucket we look for the next higher values with respect to [2,3,3] resulting in the following
values at respective buckets.
For Bucket BA, next higher value w.r.t value 2 in sequence S1 is 6, similarly next higher value w.r.t values
3 and 3 in sequence S2 and S3 are 6 and 7 respectively.
Therefore next set for bucket BA becomes [6, 5, 7].
Similarly next corresponding sets for buckets BC, BT, BG are as follows
BC =[5, 9, 4]
BT =[3, 4, 6]
BG =[7, 8, 5]
To find the next nucleotide in the LCS we repeat the step 2 resulting in the following values of ZA , ZC, ZT,
and ZG
ZA= XA + YA = 18 + 3=21
ZC= XC + YC = 18 + 9=27
ZT = XT + YT = 13+3=16
ZG= XG + YG = 20+4=24
We take the minimum of Z’s i.e. nucleotide T corresponding to ZT and record T as next element of the
solution.
Thus the next element of LCS is T and therefore
LCS=C T -----------Corresponding values for I, J, K becomes [3,4,6] and repeat the steps 2 and 3 till we reach the end of some
sequence or algorithm is unable to move further. The rest of the entries are shown in the following table
with corresponding values of I, J, K.
[6
5
A
7]
XA= 18
YA=1+2=3
ZA=21[min]
[ 11
6
12 ]
XA= 29
YA=5+6=11
ZA=40
[ 11 10 12 ]
No
need
to
calculate Z as all
other
searches
have exhausted.
Nucleotide A is
directly added to
LCS
[5
9
C
10 ]
XC= 24
YC=4+1=5
ZC=29
[9
9
10 ]
XC= 28
YC=1
ZC=29[min]
Search Ends as
there is no entry
in S2 greater than
9
T
[4
12
8]
XT= 24
YT=8+4=12
ZT=36
[ 10
12
8]
XT= 30
YT=2+4=6
ZT=36
Search Ends as
there is no entry
in S3 greater than
12
[7
8
G
11 ]
[I J K]
[3 4 6]
XG= 26
YG=1+3=4
ZG=30
[7
8
11 ]
[6
5
XG= 26
YG=1+3=4
ZG=30
Search Ends
[9 9
Fig. 4 Showing the iterations of the algorithms along with the values of I, J and K
The final LCS will be obtained as
LCS= “C T ACA” with length 5.
7]
10 ]
Algorithm
Parallel_MSLCS( BA ,BC ,BT ,BG )
// We consider three DNA Sequences S1, S2, and S3 for the given algorithm.
// I, J, K are shared global variables and Xm ,Ym , Zm are local variables corresponding to four processors
where m=A,C,T or G
For each Bucket, Processor Pi calculates with initial value of r = 1
1) Xm= Bm[r][S1] + Bm[r][S2] + Bm[r][S3]
2) Ym = | Bm[r][S1] – Bm[r][S2] | + | Bm[r][S2] – Bm[r][S3] |
3) Zm = Xm + Ym
4) Set Zmin = MIN(ZA, ZC, ZT, ZG)
5) Record nucleotide corresponding to minimum of Zm in the solution set LCS in a left to right
fashion.
6) Set I = Zmin [S1] , J = Zmin [S2] , K = Zmin [S3]
7) Determine the next higher values in each bucket with respect to I, J, K for the corresponding
sequences S1, S2 , S3.If in some bucket / buckets no such values are found even for a single Si
then neglect the corresponding nucleotide.
8) Repeat the steps from 1 to 7 where the value of r might vary for each bucket depending upon the
next higher values corresponding to I, J, K in respective buckets.
Analysis
Filling the entries of buckets by respective processors will take a considerable amount of time depending
on the number of sequences, as each sequence has to be entirely scanned by master processor. Thus the
time taken can be given as O( n Si ) where Si refers to sequence number i and i =1…k. But this is not a
serious issue as this work has to be done only once at start time. Once sequence data is accumulated,
changes are hardly made to it. Only new sequences are added without disturbing the previous sequences.
Thus we have neglected the above time in analyzing the algorithm.
Each processor will take a constant amount of time in calculating the values for X, Y, Z, I, J, K as these are
formula based. Even in determining the next set at each iteration of the algorithm, processors will take
again a constant amount of time, as they are required to search the next upper values in the sequences
corresponding to previous set. This is not a problem, as index numbers of sequences are kept sorted within
the buckets for respective sequences.
Thus the total time taken can be given as:
Time complexity =MAX (KA, KC, KT, KG )+ d = O(K)
Where d is the interprocessor communication delay time. Ki refers to time taken by ith Processor. Value of
K (constant) depends on Processor Speed and interprocessor communication delay time
Conclusion and Future research
There are not many parallel approaches for determining the LCS from Multiple Sequences of DNA, Protein
etc. This paper has given a time efficient method for solving the MLCS problem. It is a novel and very
simple approach. We feel that there is still some scope for improving the time and space complexity by
following our approach. In applying our algorithm to protein sequences the required resources in terms of
processors will go up significantly. In future we will try to find some solution so that the same approach
could be applied to protein sequences without increasing the resources significantly and maintaining the
time complexity.
References
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
12)
13)
Wagner, R. A., and M. J. Fischer, “The string to string correction problem,” J.ACM, Vol.21, No.
1,1974,pp.168-173.
Hirschberg, D. S., “A linear space algorithm for computing maximal common subsequences”. C. J. Kaufman,
Rocky Mountain Research Laboratories, Boulder,CO,1992.
Masek, W.J., and M. S. Paterson, “A faster algorithm computing string edit distances,” JCSS,1980,pp.18-31.
Hsu, W. J., and M. W. Du, “Computing a longest common subsequence for a set of strings,” BIT,
Vol.24,1984, pp.45-59.
E. W. Myers, “An O(nd) Difference algorithm and its variations, “ Algorithmica, vol.2,1986, pp.251-226.
Apostolico, A., and C. Guerra, “The longest common subsequence problem revisited,” Algorithmica,
vol.2,1987, pp.315-336.
Claus Rick, “New Algorithms for the longest common subsequence problem”, Research Report No. 85123CS,University of Bonn, October 1994.
Aluru, S., and N. Futamura, “Parallel biological sequence comparison using prfix computations”, J. Parallel
Distrib. Computt.63 (2003) 264-272.
E.W. Edmiston, N.G. Core, J.H. Saltz, R.M. Smith, Parallel processing of biological sequence comparison
algorithms, Internat. J. Parallel Programming 17(3) (1988) 259-275.
E. Lander, J.P. Mesirov, W.Taylor, Protein sequence comparison on a data parallel computer, Proceedings of
the International Conference on Parallel Processing, 1988, PP. 257-263.
X. Huang, A space-efficient Parallel Sequence Comparison algorithm for a message-passing multiprocessor,
Internat.J. Parallel Programming 18(3) (1989) 223-239.
S.B. Needleman, C.D. Wunsch, A general method applicable to the search of similarities in the amino acid
sequences of two proteins, J. Molec. Biol. 48(1970) 443-453.
T.F. Smith, M.S. Waterman, Identification of common molecular subsequences, J. Molec. Biol. 147 (1981)
195-197
14) Torres M, Gold A and Barrera J (2003) A parallel algorithm for numerating combinations. In:
The 2003 International Conference on Parallel Processing Proceedings. (IEEE Computer Society
Press). Taiwan, pp 1:581-588
15) Y. Totoki, Y. Akiyama, K. Onizuka, T. Noguchi, M. Saito, and M. Ando :"Employing A*
algorithm in Parallel Multiple Protein Sequence Alignment",IPSJ SIG Notes, 97-MPS-16-4,
pp.19-24 (1997).
16) M. Hirosawa, Y. Totoki, M. Hoshida, and M. Ishikawa :
"Comprehensive Study on Iterative Algorithms of Multiple Sequence Alignment",
Comput. Applic. Biosci., Vol.11, No.1, pp.13-18 (1995).
17) Katoh,K., Misawa,K., Kuma,K. and Miyata,T. (2002) MAFFT: a novel method for rapid multiple
sequence alignment based on fast Fourier transform. Nucleic Acids Res., 30, 3059±3066.
18) Brudno,M., Do,C.B., Cooper,G.M., Kim,M.F., Davydov,E., Green,E.D., Sidow,A. and
Batzoglou,S. (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple
alignment of genomic DNA. Genome Res., 13, 721±731.
19) Van Walle,I., Lasters,I. and Wyns,L. (2004) Align-m–a new algorithm for multiple alignment of
highly divergent sequences. Bioinformatics, DOI: 10.1093/bioinformatics/bth116. Nucleic Acids
Research, 2004, Vol. 32, No. 5 1797.

Download Report

URL

Paperzz.com

Your Paperzz