Tools for Fast Sequence Alignment Contact Information: [email protected] Sajia Akhter1, Robert A Edwards1,2,3 1Computational Science Research Center, San Diego State University of Computer Science, San Diego state University, and Computer Science Division, Argonne National Laboratory, Argonne 2Department 3Mathematics Sequence Alignment Our Approach Sequence alignment is a way of finding the similar regions between two DNA or protein sequences. This is an indicator of the genetic relatedness between different organisms. !!!!"#$%#&'#!()!!**+,+,,*! !!!!"#$%#&'#!-)!!**+,* !! !!!!*./0&1#&2)! **+,+,,* **+,!3!3!3* ! ! 142'5!6!(! 047!87#&/&0!6!3(!! 047!#92#&:/8&!6!;! </:142'5!6!3-! 1 Construct a single suffix array for reference sequences, reverse complement of reference sequences and query sequences. Time complexity: linear (l*n+m*r) !"#"$"%&' Suffix Array ()&*+' '(&#)*+,& $' !**+,+,,*! !**!3!3!+,!3*! !' -+.)/&#)*+,& '(&#)*+,& !"'8=#)!!>?!!!!!3(!!!>(!6!?!!!!!!!>-!!3(!>-!3(>(!6!@! -+.)/&#)*+,& Why fast Sequence Aligner Cost per megabase of DNA sequence # of Metagenomes *&",'-&./#0'1'%# 2500 2000 1500 1000 500 0 Prediction of Candidate d *&",'-&./#0'1'"# • Candidate d = BLAST predicted best d • Candidate d != BLAST predicted d • Candidate d = BLAST predicted d • For some q BLAST did not report !""#$$$"#"%& !""#$##"#$%& !""#$##"#$%& !""#$##"$$%& !""#$###"$%& !""#$##$#$%& !""#$#$"!$%& !""#$#$"#$%& !""#$$$"#$%& !""#$$$$#$%& !""$$$$"#$%& 2 Find the candidate database sequence for each query sequences. Time complexity: (l*n+m*r) or O(mr2/p) Hypothesis I: Query q has the best alignment with database sequence d, if suffixes of d are the closest suffixes of the query groups containing the suffixes of q. • Find the candidate d for each query sequences Hypothesis II: Query q has the best alignment with database • Sequencing technologies are trying to increase read length - Current DNA sequence aligners are op8mized for short read vs. long references – Not guaranteed that current aligners will efficiently handle the longer sequence length Current Popular Methods sequence d, if it satisfies hypothesis I and Σ lcp( suffixes of q, suffixes of d) is maximum • Find the longest common exact matches for each query suffixes !"#$ ,-.# ((())**)(+**)*# !"# &'(% !!!""##"!$##"#% ()% !$# /01!# ((())**)(++(((# *% %# /01.# ((())***(++(((# +% &# /012# ((())*+*(++()(# ,% '# /01"# ((())+*)(++*(($ ,% !"# /01'# ((())+*)(++*((# Case I: Database ~= Query • Calculate lcp for each consecutive suffixes while constructing suffix array 345#6%&'(#789#)*+:#;#<=86$)*+#>?@#%&'(,+$-$%&'($>?@#%&'(,+#:# Case II: Database >> Query • Use Dynamic programming to find lcp DB1 AAACCGGCATGGCG 10 Qry1 AAACCGGCATTAAA 3 7 Qry2 AAACCGGGATTAAA 1 6 Qry3 AAACCGTGATTACA 1 5 Qry4 AAACCTGCATTGAA 1 4 Qry5 AAACTTGCATTTGA 4 Total Comparisons = 32 Total Comparisons = 10 3 Use Smith-Waterman-Gotoh algorithm to find the optimal alignment between q and d. Time complexity: m*r*l Summary • Hypothesis I works slightly better than Hypothesis II. • Predict candidate d correctly when %identity and %coverage is high. • Real time statistics is not done yet in an idle machine but total time to run seems reasonable. References Delcher A., et al. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Research 2002;30:2478-2483 Langmead B., et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. Li H., et al. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851-1858. Li H., Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25:1754-1760
© Copyright 2026 Paperzz