Mapping Algorithms

Outline
1 Introduction
2 Genome mapping
Hashing-based tools
Suffix array-based tools
Mapping Algorithms
3 Transcriptome mapping
Matthias Zytnicki
MIAT INRA
4 Conclusion
2 / 34
Outline
What is mapping
Desired definition
1 Introduction
Map a read: predict the locus from which the read originates.
2 Genome mapping
Data
genome
read
Hashing-based tools
Suffix array-based tools
genomic coordinate(s)
Example
3 Transcriptome mapping
4 Conclusion
mapping
@SEQ ID
GATTT
+
ccedd
ALIMENTATION
AGRICULTURE
ENVIRONNEMENT
BWA
@SEQ ID 113 1...
Assumption
A read is likely to map at a locus iff similarity is high.
Implemented definition
Map a read: list the loci with less than k errors.
3 / 34
4 / 34
First implications
Tools. . .
Ambiguity in the mapping
read
genome
read
genome
read
genome
6 / 34
5 / 34
Outline
Outline
read
1 Introduction
1 Introduction
genome
2 Genome mapping
2 Genome mapping
Hashing-based tools
Suffix array-based tools
Hashing-based tools
Suffix array-based tools
Mapping 6= Alignment
3 Transcriptome mapping
genome
read
4 Conclusion
3 Transcriptome mapping
ACGTACGT
is a valid mapping with 2 mismatches.
ACGTAGT-
4 Conclusion
7 / 34
8 / 34
Seed-and-extend algorithm
Step 2: Extension with errors
Idea
Needleman–Wunsch
1
Get the k-mers of the genome,
2
“Sort” them,
3
Get the k-mers of a read,
4
Compare the two,
5
Finish the alignment.
G
G
Example
T
Genome is AACGTAC, read is CCGT
AACGTA = AAC ACG CGT GTA
AAC
ACG
CGT
GTA
CCGT = CCG CGT
Choice of kAAC— Example
CCGT = CCG CGT
ACG
CGT
Example
GTAread ACACGAGTA.
Consider sequence ACGCGTGTA and
• With k = 3, seed GTA matches.
A A C G
• With k = 4, no seed match.
C C G
T
T
0
↓
1
↓
2
↓
3
→
&
A
1
→
&
1
&
C
2
2
&
2
&
G
3
2
&
2
&
3
→
&
→
&
2
&
3
→
T
4
3
3
&
3
2
Optimizations
Banded Needleman–Wunsch, FSA, parallel Shift-OR, vectorization,
SIMD, . . .
10 / 34
9 / 34
1
Choice of k — Example 2
Example
Consider sequence ACGACGACGACGACGT and read ACGT.
k = 3:
ACG → 0, 3, 6, 9, 12
CGA → 1, 4, 7, 10
CGT → 13
GAC → 2, 5, 8, 11
⇒ 4 matches, 1 extension
succeeds.
A
k = 4:
ACGA → 0, 3, 6, 9
ACGT → 12
CGAC → 1, 4, 7, 10
GACG → 2, 5, 8, 11
⇒ 1 matches, 1 extension
succeeds.
Pigeon hole principle
With r errors, r + 1 seeds.
⇒ small k = more sensitive.
Remark
k-mers with many occurrences usually are discarded.
11 / 34
12 / 34
Conclusion so far
Outline
Trade-off
• Small k: more sensitive, slower.
1 Introduction
• High k: more specific, larger database (size: up to 4k ).
2 Genome mapping
Hashing-based tools
Suffix array-based tools
Tentative complexity
• Pre-process genome (done once).
3 Transcriptome mapping
• Cut a read into k-mers (fast).
4 Conclusion
• Map each k-mer to the table (fast for each k-mer).
• Consider every position in the genome (variable).
• Extend with errors (slow).
Drawbacks
Slow or not sensitive when:
• accepting many errors,
• accepting highly repeated seeds.
13 / 34
14 / 34
Suffix tree
Looking for a read
Idea
A
C
T
G
A
C
T
A
A
A
GATTACA is:
T
T
C
T
C
A
A
A
T
A
A
A
T
A
A
A
Simply follow the right
arrow.
T
G
T
C
A
C
T
T
C
A
A
T
A
C
C
A
A
C
C
A
A
Example
Look for TAC.
Look for TAT.
Problem
Suffix trees do not fit in
memory.
Use suffix arrays instead:
C
A
A
15 / 34
• same algorithms,
• condensed data
structure.
16 / 34
Handling errors
A
Example
Look for
GAC with 1
error.
GAC
C
A
T
C
A
Conclusion so far
T
G
A
A
Tentative complexity
T
T
T
C
A
A
T
A
C
C
A
• Pre-process the genome (done once).
• Look for a read (quite fast if no error).
Slow or not sensitive when:
Problem
The search
space is
very large!
A
• accepting many errors (small seed).
A
C
A
17 / 34
18 / 34
Outline
Introduction
1 Introduction
2 Genome mapping
RNA mapping 6= DNA mapping
Hashing-based tools
Suffix array-based tools
spliced read
3 Transcriptome mapping
genome
4 Conclusion
19 / 34
20 / 34
Difficult cases
Tophat1
• many differences (mutations and/or sequencing errors),
• repeated sequence,
• read on 3+ exons,
• gene or pseudogene?
(Trapnell et al., Bioinformatics,
(Kim et al., Gen. Biol., 2013)
2009)
• uses Bowtie,
• end of read on another exon,
• problems for
pseudogenes.
• read on unknown and poorly expressed junction.
22 / 34
21 / 34
Algorithms
Tophat2
(Kim et al.,
Gen. Biol.,
2013)
(Garber et al., Nat. meth., 2011)
23 / 34
24 / 34
Outline
Other considerations
1 Introduction
2 Genome mapping
Hashing-based tools
Suffix array-based tools
Quality Current tools provide read quality. Better use it.
Pair end reads Most tools have a special “fragment rescue” mode.
3 Transcriptome mapping
Low quality 3’ ends Some tools truncate them.
4 Conclusion
Binary encoding A: 00, C: 01, G: 10, T: 11, other: ???
26 / 34
25 / 34
Tool comparison
DNA mapping tools
name
SSAHA
Blat
MUMmer2
Eland
MAQ
SOAP
PatMaN
RMAP
SeqMap
Mosaik
SOCS
ZOOM
PASS
SOAP2
BWA
SHRiMP
Bowtie
mrFAST
RazerS
CloudBurst
MPScan
PerM
BFAST
Segemehl
GNUMap
mrsFAST
novoalign
GASSST
Stampy
rNA
SOAP3
Bowtie2
Saruman
BarraCUDA
Sensitivity map most reads,
Specificity mappings are correct,
Time
Memory
⇒ Balance between the criteria.
27 / 34
seed-and-extend
X
X
pigeon hole
spaced seed
q-gram
suf. tree
B.–W.
notes
Introns
X
X
X
X
X
X
X
X
Several seeds
X
X
X
X
X
X
X
X
X
Minimal # seeds
No heuristic
SNP detection
Mapping quality
Color space
Trade-off between speed and accuracy
Gives all hits
X
X
MapReduce
No error
X
X
X
2-level hash table
X
X
X
?
?
X
X
X
Cache-oblivious
?
X
?
?
?
X
X
X
X
Uses GPU
SIMD
GPU
GPGPU
28 / 34
BWA vs Bowtie2
Bowtie2 vs BWA
(Li Website: http: // lh3lh3. users. sourceforge. net/ alnROC. shtml )
(Langmead et al., Nat. Meth., 2012)
29 / 34
30 / 34
“Unbiased” comparisons
RGASP
The RNA-seq Genome Annotation Assessment Project.
• Matthew Ruffalo, Thomas LaFramboise, Mehmet Koyutürk,
Comparative analysis of algorithms for next-generation
sequencing read alignment, Bioinfomatics (2011), vol 27, pp.
2790–2796.
• Heng Li and Nils Homer, A survey of sequence alignment
algorithms for next-generation sequencing, Briefings in
Bioinformatics (2010), vol 11, pp. 473–483.
• Sophie Schbath, Véronique Martin, Matthias Zytnicki, Julien
Fayolle, Valentin Loux, Jean-François Gibrat, Mapping Reads
on a Genomic Sequence: an Algorithmic Overwiew and a
Practical Comparative Analysis, Journal of Computational
Biology (2012), vol 19, pp. 796–813.
⇒ Results are already irrelevant!
(Engström et al., Nat. Meth., 2013)
31 / 34
32 / 34
RGASP
Which tool should I use?
Comparison transcript assembly.
Common situations
You can use widely-used tools.
DNA-Seq BWA(-SW/mem),
Bowtie2
RNA-Seq TopHat2, Star
• Cannot refuse your paper!
• Well maintained.
• Large software suite
(Tuxedo).
A niche?
• mrFAST: for high number
of copies
• SHRiMP: for color space
• Stampy: for highly
divergent sequencing
(with BWA)
• CloudBurst: with
Map/Reduce
• SOAP3: for GPU
Want to buy France?
Use CRAC.
(Engström et al., Nat. Meth., 2013)
33 / 34
34 / 34