A New Index-Based Parallel Algorithm for finding Longest Common Subsequence in Multiple DNA Sequences Dr. S.A.M Rizvi Department of Computer Science, Jamia Millia Islamia, New Delhi E-mail: [email protected] Pankaj Agarwal (Research Scholar, Affiliated from Jamia Millia Islamia University, New Delhi) E-mail: [email protected] Abstract This paper presents a new Parallel Algorithm for computing a Longest Common Subsequence in Multiple DNA Sequences. It uses a heuristic approach. Although a lot of research has been carried out to find LCS from the two or more given sequences of Protein, DNA, RNA etc, but not many parallel methods exists for finding LCS from multiple sequences. Normally in existing algorithms the time complexity for finding the LCS increases linearly with the increase in Sequences. This is an attempt to given an effective Parallel algorithm to find LCS from any given number of DNA sequences. Significance of this algorithm is that time complexity does not increase linearly with the increase in the number of sequences. However algorithm can also be applied to Protein sequences with the same effectiveness, though the requirement of processors will go up. Keywords: Sequence Comparison, DNA Sequences, Parallel Algorithm, Multiple Longest Common Subsequence, Space and time complexity Introduction Sequence Comparison is an important tool for researchers in molecular biology as it helps to relate the molecular structure and function to the underlying sequence. Sequence Comparison can be used to study the relationship(s) between sequences in sets of more than two sequences. This application is particularly useful when studying the relationships between similar types of gene product that is expressed by different organisms, like analyzing CFTR sequences from several different species, or when studying similar, yet divergent, sequences within the same organism, as the variance in troponin I isoforms in Homo sapiens [8]. Biological Sequence data mainly consists of DNA and Protein Sequences, which can be treated as strings over a fixed alphabet of characters. In this paper, we consider the comparison of two or more DNA sequences with an aim of finding the Longest Common Subsequence. Multiple Longest Common Subsequence problem can be defined as: Given {S1, S2, S3 …. Sn} Sequences of DNA or Protein where each sequence Si can be of same or variable length (here we consider sequences of same lengths). Problem is to find a subsequence common to all the sequences with maximum length. Although there can be more than one subsequences of largest length but our algorithm will return one such subsequence. For example consider three DNA sequences S1=T C C A T A G T C S2=A G C T A A T A G S3=G G A T T A G C T One of the LCS for the above three sequences can be ‘A T A G’ of length four. Other LCS’s can be ‘TTAT’, ‘TTAG’ etc Often, a primary focus of a Multiple Sequence Alignment is to identify, within several related sequences, regions that are highly conserved in identity or similarity, and therefore probably have functional and/or structural significance. Many factors affect the analysis of conserved regions within related sequences, such as the number of sequences included in the analysis, and the ratio of the number of very similar (almost identical) sequences to the number of more distantly related sequences. Divergent sequences can cause problems in a Multiple Sequence Alignment. It is more difficult to identify the correct alignment when two sequences that are related throughout part of the sequence also contain large sections that diverge. Therefore, the error rates in the alignment increase as divergence increases. These errors in the alignment can cause the related part of the sequences to show lower similarity than they actually have, and this sort of error is often amplified in subsequent steps. Background and Related work Several researchers have explored Sequence Comparison Algorithms [12,13], culminating in the solution of variety of sequence comparison problems, including Subsequence Matching in O(mn) time and space. Using the technique of Hirschberg [2], developed in the context of Longest Common Subsequence problem, Mayers [5] and Miller presented a technique to reduce the space requirement of Sequence Matching to optimal O(m+n), while retaining a time complexity of O(mn). Huang [11] extended this algorithm to Subsequence Matching. These algorithms are very important because the length of biological sequences can be large enough to render algorithms that use quadratic space infeasible. While space-optimal algorithms make large Sequence Comparison feasible, the quadratic time requirement still makes it a time-consuming process. A natural approach is to reduce time requirement with the use of Parallel Processing approach. Edmiston et al. [9] present parallel algorithms for sequence and subsequence matching that achieve linear speedup and can use up to O(min (m,n)) processors. Lander et al. [10] discuss implementation on a data parallel computer. These algorithms store the entire dynamic programming table. Huang [11] presented a space-efficient algorithm for sequence alignment, that is only time optimal when the sequences have equal size. A lot of work has been done in implementing dynamic programming approach, mostly for the case of two sequences. Year Author(s) Time 1974 1975 1980 1984 1986 1992 1994 1999 Wagner, Fischer Hirschberg Masek, Paterson Hsu , Du Myers Apostolico Rick Claus O(n2) O(n2) O(n2 / log n) O(pn log(n/p)+pm) O(n(n-p)) O(n(n-p)) O(ns+min(ms,pn)) O(min(rm,r(n-r))) Ref. [1] [2] [3] [4] [5] [6] [7] Fig 1. Selected results for the LCS problem for two sequences. Here ‘p’ is the length of LCS, ‘s’ is the size of the alphabet, ‘m’ is the number of minimal matches. Note that we are assuming sequences f equal lengths. These methods are not applicable for three or more sequences because of its time and space complexity. Implementation of dynamic programming method for constructing the score matrix would lead to O(nd) time and space complexities. A lot of successful work has been done for improving both time and space complexities for two sequences. However for multiple sequences (greater than two sequences) research is still open and wide. Some popular Methods/Algorithms for Multiple Alignments are as follows: Global Multiple Alignment Methods 1. ClustalW : http://npsa-pbil.ibpc.fr/cgi-bin/npsa_automat.pl?page=npsa_clustalW.html 2. MSA(limited to small number of sequences) : http://www.ibc.wusti.edu/ibc/msa.html 3. PRALIGN : http://mathbio.nimr.mrc.ac.uk/~jhering/pralign Iterative Methods: 1. DIALIGN: http://www.gsf.de/biodv/dialing.html 2. MULTIALIGN :http://protein.toulouse.inra.fr/multialign.html Local Alignment methods : 1. BLOCKS: http://blocks.fhcrc.org/blocks 2. HMMER: http://hmmer.wustl.edu 3. MEME: http://meme.sdsc.edu/meme 4. SAM: http://cse.ucsc.edu/research/compbio/sam.html Other Popular Programs: 1. T-Coffee: http://igs-server.cnrs-mrs.fr/Toffee 2. MUSCLE : http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py Some Parallel programs : 1. Parallel PRRN : Multiple sequence alignment by the best-first iterative refinement strategy with tree-dependent partitioning : http://align.genome.jp/prrn/prrn_help.html 2. Multiple Sequence Alignment Using Parallel Genetic Algorithms: http://portal.acm.org/citation.cfm 3. Multiple Sequence Alignment by Genetic Algorithm : http://www.icot.or.jp/ARCHIVE/Museum/IFS/abst/094.html 4. PileUp: PileUp is a multiple alignment program from the GCG program suite. It can be accessed through GCG's Seqweb web interface. 5. T-COFFEE: T-COFFEE is available on web and helix (in /usr/localapps/msa/TCOFFEE). 6. PRRP: The PRRP program uses a randomized iterative strategy for multiple sequence alignment. To use it: /usr/localapps/msa/prrp/bin/prrp 7. SAM : The SAM program is a collection of of flexible software tools for creating, refining, and using linear hidden Markov models for biological sequence analysis. cd /usr/localapps/msa/sam/bin/;target99 -seed inputsequencefile -out outputfile 8. SAGA : SAGA uses genetic algorithms for multiple sequence alignment. It is installed in /usr/localapps/msa/SAGA_V0.95. Our method Our method uses a Heuristic Approach for determining the LCS from multiple DNA Sequences. Our Algorithm uses a very effective approach for determining the LCS. Even if the algorithm does not return the LCS in first go, the probability of finding the LCS in second go is quite high. Once the appropriate starting set is found, the algorithm is sure to return the LCS. We have used a Global alignment algorithm. The most significant part of the algorithm is that the time complexity does not increase linearly w.r.t. increase in the number of sequences. Consider for example three DNA sequences of equal lengths (though sequences can be of variable lengths) as given below. 1 2 3 4 5 6 7 8 9 10 11 12 S1=A C T T C A G G C T A A S2=T T C T A A A G C A A T S3=A A C C G T A T T C G A Fig. 2 1) We will consider four buckets (can be two-dimensional array or some other data structure) assigned to each nucleotide A, C, T and G respectively. Collect the index numbers of nucleotides in their corresponding buckets as shown in Fig.3 for the above three sequences. We have assumed a four-Processor system corresponding to four buckets. Its corresponding processor will carry out-processing within a bucket. Our parallel system will consider shared as well as local memory with respect to each processor. Shared memory will maintain three global variables namely I, J and K and three local variables for each processor namely X, Y, Z. Moreover our system will consider a master processor responsible for managing the interprocessor communications. S1 1 6 11 12 0 S2 5 6 7 10 11 BA S3 1 2 7 12 0 S1 2 5 9 S2 3 9 0 S3 3 4 10 BC S1 3 4 10 0 S2 1 2 4 12 S3 6 8 9 0 BT S1 7 8 S2 8 0 S3 5 11 BG Fig. 3 Figure showing the corresponding index numbers of the sequences {S1, S2, S3} in their respective Buckets. 2) To find the starting point of MLCS a formula based approach is given, taking into consideration of distance of nucleotides with respect to starting index and differences of index numbers of nucleotides in their corresponding sequences. X=Bx[r][S1]+ Bx[r][S2]+ Bx[r][S3] (1) Where x= A, C, T or G and r refers to row number and Si refers to Sequences (Here S1, S2, S3). Y= | Bx[r][S1] - Bx[r][S2] | + | Bx[r][S2] - Bx[r][S3] | (2) Where |t| refers to absolute value of t. Z=X+Y Note: Similar approach can be used to define the formula while taking more than three sequences. These calculations will be done at each processor and the values X, Y , Z will be stored in the local memories of each processor. Thus we will obtain four copies of Z say ZA, ZC, ZT, ZG corresponding to each processor. We take the minimum of Z values Starting Nucleotide Set= nucleotide corresponding to min{ ZA, ZC, ZT, ZG } Initial values for the global variables are set with respect to the values of Starting Nucleotide Set. For the above example we have the following initial value sets with each processor A[1,5,1] ,C[2,3,3] ,T[3,1,6] ,G[7,8,5] Calculation done at first bucket BA corresponding to value set [1,5,1] according to formula (1). XA= Bx[1][S1] + Bx[1][S2] + Bx[1][S3] = 1+5+1=7 Similarly values for XC, XT, and XG are as follows XC = 2+3+3=8 XT = 3+1+6=10 XG = 7+8+5=20 Calculation for YA at BA corresponding to value set [1,5,1] according to formula (2) YA = | BA[1][S1] – BA[1][S2] | + | BA[1][S2] – BA[1][S3] | = | 1-5 | + |5-1| =8 Similarly values for YC, YT, and YG are as follows YC = | 2-3 | + |3-3|=1 YT = | 3-1 | + |1-6|=7 YG = | 7-8 | + |8-5| =4 Thus values of ZA , ZC, ZT, and ZG are given as ZA= XA+ YA= 15 ZC= XC+ YC = 9 ZT= XT+ YT = 17 ZG= XG+ YG = 24 Nucleotide corresponding to minimum of Z’s values will be taken as the Start Set. Thus here the start nucleotide from where the search process will begin is ‘C’ with minimum of ZC value and corresponding initial Start Set becomes [2 , 3 , 3] setting the initial values for I, J, K as I=2, J=3 and K=3.We record the first element of LCS as ‘C’ LCS = C--------------3) Now in each bucket we look for the next higher values with respect to [2,3,3] resulting in the following values at respective buckets. For Bucket BA, next higher value w.r.t value 2 in sequence S1 is 6, similarly next higher value w.r.t values 3 and 3 in sequence S2 and S3 are 6 and 7 respectively. Therefore next set for bucket BA becomes [6, 5, 7]. Similarly next corresponding sets for buckets BC, BT, BG are as follows BC =[5, 9, 4] BT =[3, 4, 6] BG =[7, 8, 5] To find the next nucleotide in the LCS we repeat the step 2 resulting in the following values of ZA , ZC, ZT, and ZG ZA= XA + YA = 18 + 3=21 ZC= XC + YC = 18 + 9=27 ZT = XT + YT = 13+3=16 ZG= XG + YG = 20+4=24 We take the minimum of Z’s i.e. nucleotide T corresponding to ZT and record T as next element of the solution. Thus the next element of LCS is T and therefore LCS=C T -----------Corresponding values for I, J, K becomes [3,4,6] and repeat the steps 2 and 3 till we reach the end of some sequence or algorithm is unable to move further. The rest of the entries are shown in the following table with corresponding values of I, J, K. [6 5 A 7] XA= 18 YA=1+2=3 ZA=21[min] [ 11 6 12 ] XA= 29 YA=5+6=11 ZA=40 [ 11 10 12 ] No need to calculate Z as all other searches have exhausted. Nucleotide A is directly added to LCS [5 9 C 10 ] XC= 24 YC=4+1=5 ZC=29 [9 9 10 ] XC= 28 YC=1 ZC=29[min] Search Ends as there is no entry in S2 greater than 9 T [4 12 8] XT= 24 YT=8+4=12 ZT=36 [ 10 12 8] XT= 30 YT=2+4=6 ZT=36 Search Ends as there is no entry in S3 greater than 12 [7 8 G 11 ] [I J K] [3 4 6] XG= 26 YG=1+3=4 ZG=30 [7 8 11 ] [6 5 XG= 26 YG=1+3=4 ZG=30 Search Ends [9 9 Fig. 4 Showing the iterations of the algorithms along with the values of I, J and K The final LCS will be obtained as LCS= “C T ACA” with length 5. 7] 10 ] Algorithm Parallel_MSLCS( BA ,BC ,BT ,BG ) // We consider three DNA Sequences S1, S2, and S3 for the given algorithm. // I, J, K are shared global variables and Xm ,Ym , Zm are local variables corresponding to four processors where m=A,C,T or G For each Bucket, Processor Pi calculates with initial value of r = 1 1) Xm= Bm[r][S1] + Bm[r][S2] + Bm[r][S3] 2) Ym = | Bm[r][S1] – Bm[r][S2] | + | Bm[r][S2] – Bm[r][S3] | 3) Zm = Xm + Ym 4) Set Zmin = MIN(ZA, ZC, ZT, ZG) 5) Record nucleotide corresponding to minimum of Zm in the solution set LCS in a left to right fashion. 6) Set I = Zmin [S1] , J = Zmin [S2] , K = Zmin [S3] 7) Determine the next higher values in each bucket with respect to I, J, K for the corresponding sequences S1, S2 , S3.If in some bucket / buckets no such values are found even for a single Si then neglect the corresponding nucleotide. 8) Repeat the steps from 1 to 7 where the value of r might vary for each bucket depending upon the next higher values corresponding to I, J, K in respective buckets. Analysis Filling the entries of buckets by respective processors will take a considerable amount of time depending on the number of sequences, as each sequence has to be entirely scanned by master processor. Thus the time taken can be given as O( n Si ) where Si refers to sequence number i and i =1…k. But this is not a serious issue as this work has to be done only once at start time. Once sequence data is accumulated, changes are hardly made to it. Only new sequences are added without disturbing the previous sequences. Thus we have neglected the above time in analyzing the algorithm. Each processor will take a constant amount of time in calculating the values for X, Y, Z, I, J, K as these are formula based. Even in determining the next set at each iteration of the algorithm, processors will take again a constant amount of time, as they are required to search the next upper values in the sequences corresponding to previous set. This is not a problem, as index numbers of sequences are kept sorted within the buckets for respective sequences. Thus the total time taken can be given as: Time complexity =MAX (KA, KC, KT, KG )+ d = O(K) Where d is the interprocessor communication delay time. Ki refers to time taken by ith Processor. Value of K (constant) depends on Processor Speed and interprocessor communication delay time Conclusion and Future research There are not many parallel approaches for determining the LCS from Multiple Sequences of DNA, Protein etc. This paper has given a time efficient method for solving the MLCS problem. It is a novel and very simple approach. We feel that there is still some scope for improving the time and space complexity by following our approach. In applying our algorithm to protein sequences the required resources in terms of processors will go up significantly. In future we will try to find some solution so that the same approach could be applied to protein sequences without increasing the resources significantly and maintaining the time complexity. References 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) Wagner, R. A., and M. J. Fischer, “The string to string correction problem,” J.ACM, Vol.21, No. 1,1974,pp.168-173. Hirschberg, D. S., “A linear space algorithm for computing maximal common subsequences”. C. J. Kaufman, Rocky Mountain Research Laboratories, Boulder,CO,1992. Masek, W.J., and M. S. Paterson, “A faster algorithm computing string edit distances,” JCSS,1980,pp.18-31. Hsu, W. J., and M. W. Du, “Computing a longest common subsequence for a set of strings,” BIT, Vol.24,1984, pp.45-59. E. W. Myers, “An O(nd) Difference algorithm and its variations, “ Algorithmica, vol.2,1986, pp.251-226. Apostolico, A., and C. Guerra, “The longest common subsequence problem revisited,” Algorithmica, vol.2,1987, pp.315-336. Claus Rick, “New Algorithms for the longest common subsequence problem”, Research Report No. 85123CS,University of Bonn, October 1994. Aluru, S., and N. Futamura, “Parallel biological sequence comparison using prfix computations”, J. Parallel Distrib. Computt.63 (2003) 264-272. E.W. Edmiston, N.G. Core, J.H. Saltz, R.M. Smith, Parallel processing of biological sequence comparison algorithms, Internat. J. Parallel Programming 17(3) (1988) 259-275. E. Lander, J.P. Mesirov, W.Taylor, Protein sequence comparison on a data parallel computer, Proceedings of the International Conference on Parallel Processing, 1988, PP. 257-263. X. Huang, A space-efficient Parallel Sequence Comparison algorithm for a message-passing multiprocessor, Internat.J. Parallel Programming 18(3) (1989) 223-239. S.B. Needleman, C.D. Wunsch, A general method applicable to the search of similarities in the amino acid sequences of two proteins, J. Molec. Biol. 48(1970) 443-453. T.F. Smith, M.S. Waterman, Identification of common molecular subsequences, J. Molec. Biol. 147 (1981) 195-197 14) Torres M, Gold A and Barrera J (2003) A parallel algorithm for numerating combinations. In: The 2003 International Conference on Parallel Processing Proceedings. (IEEE Computer Society Press). Taiwan, pp 1:581-588 15) Y. Totoki, Y. Akiyama, K. Onizuka, T. Noguchi, M. Saito, and M. Ando :"Employing A* algorithm in Parallel Multiple Protein Sequence Alignment",IPSJ SIG Notes, 97-MPS-16-4, pp.19-24 (1997). 16) M. Hirosawa, Y. Totoki, M. Hoshida, and M. Ishikawa : "Comprehensive Study on Iterative Algorithms of Multiple Sequence Alignment", Comput. Applic. Biosci., Vol.11, No.1, pp.13-18 (1995). 17) Katoh,K., Misawa,K., Kuma,K. and Miyata,T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res., 30, 3059±3066. 18) Brudno,M., Do,C.B., Cooper,G.M., Kim,M.F., Davydov,E., Green,E.D., Sidow,A. and Batzoglou,S. (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res., 13, 721±731. 19) Van Walle,I., Lasters,I. and Wyns,L. (2004) Align-m–a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics, DOI: 10.1093/bioinformatics/bth116. Nucleic Acids Research, 2004, Vol. 32, No. 5 1797.
© Copyright 2026 Paperzz