PDF file - Edwards @ SDSU

Tools for Fast Sequence Alignment
Contact Information: [email protected]
Sajia Akhter1, Robert A Edwards1,2,3
1Computational
Science Research Center, San Diego State University
of Computer Science, San Diego state University,
and Computer Science Division, Argonne National Laboratory, Argonne 2Department
3Mathematics
Sequence Alignment
Our Approach
Sequence alignment is a way of finding the similar regions
between two DNA or protein sequences. This is an indicator of
the genetic relatedness between different organisms.
!!!!"#$%#&'#!()!!**+,+,,*!
!!!!"#$%#&'#!-)!!**+,* !!
!!!!*./0&1#&2)!
**+,+,,*
**+,!3!3!3*
!
!
142'5!6!(!
047!87#&/&0!6!3(!!
047!#92#&:/8&!6!;!
</:142'5!6!3-!
1
Construct a single suffix array for reference sequences, reverse
complement of reference sequences and query sequences. Time
complexity: linear (l*n+m*r)
!"#"$"%&'
Suffix Array
()&*+'
'(&#)*+,&
$'
!**+,+,,*!
!**!3!3!+,!3*!
!'
-+.)/&#)*+,&
'(&#)*+,&
!"'8=#)!!>?!!!!!3(!!!>(!6!?!!!!!!!>-!!3(!>-!3(>(!6!@!
-+.)/&#)*+,&
Why fast Sequence Aligner
Cost per megabase of DNA sequence
# of Metagenomes
*&",'-&./#0'1'%#
2500
2000
1500
1000
500
0
Prediction of Candidate d
*&",'-&./#0'1'"#
•  Candidate d = BLAST predicted best d
•  Candidate d != BLAST predicted d
•  Candidate d = BLAST predicted d
•  For some q BLAST did not report
!""#$$$"#"%&
!""#$##"#$%&
!""#$##"#$%&
!""#$##"$$%&
!""#$###"$%&
!""#$##$#$%&
!""#$#$"!$%&
!""#$#$"#$%&
!""#$$$"#$%&
!""#$$$$#$%&
!""$$$$"#$%&
2 Find the candidate database sequence for each query sequences.
Time complexity: (l*n+m*r) or O(mr2/p)
 Hypothesis I: Query q has the best alignment with database
sequence d, if suffixes of d are the closest suffixes of the query groups
containing the suffixes of q.
•  Find the candidate d for each query sequences
 Hypothesis II: Query q has the best alignment with database
•  Sequencing
technologies
are
trying
to
increase
read
length
- Current
DNA
sequence
aligners
are
op8mized
for
short
read
vs.
long
references
– Not
guaranteed
that
current
aligners
will
efficiently
handle
the
longer
sequence
length
Current Popular Methods
sequence d, if it satisfies hypothesis I and
Σ lcp( suffixes of q, suffixes of d) is maximum
•  Find the longest common exact matches for each query suffixes
!"#$ ,-.# ((())**)(+**)*#
!"# &'(% !!!""##"!$##"#%
()% !$# /01!# ((())**)(++(((#
*%
%# /01.# ((())***(++(((#
+%
&# /012# ((())*+*(++()(#
,%
'# /01"# ((())+*)(++*(($
,% !"# /01'# ((())+*)(++*((#
Case I: Database ~= Query
•  Calculate lcp for each
consecutive suffixes while
constructing suffix array
345#6%&'(#789#)*+:#;#<=86$)*+#>?@#%&'(,+$-$%&'($>?@#%&'(,+#:#
Case II: Database >> Query
•  Use Dynamic programming
to find lcp
DB1
AAACCGGCATGGCG
10
Qry1
AAACCGGCATTAAA
3
7
Qry2
AAACCGGGATTAAA
1
6
Qry3
AAACCGTGATTACA
1
5
Qry4
AAACCTGCATTGAA
1
4
Qry5
AAACTTGCATTTGA
4
Total Comparisons = 32
Total Comparisons = 10
3 Use Smith-Waterman-Gotoh algorithm to find the optimal alignment
between q and d. Time complexity: m*r*l
Summary
•  Hypothesis I works slightly better than Hypothesis II.
•  Predict candidate d correctly when %identity and %coverage is high.
•  Real time statistics is not done yet in an idle machine but total time to run seems
reasonable.
References
Delcher A., et al. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Research 2002;30:2478-2483
Langmead B., et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.
Li H., et al. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851-1858.
Li H., Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25:1754-1760