Outline 1 Introduction 2 Genome mapping Hashing-based tools Suffix array-based tools Mapping Algorithms 3 Transcriptome mapping Matthias Zytnicki MIAT INRA 4 Conclusion 2 / 34 Outline What is mapping Desired definition 1 Introduction Map a read: predict the locus from which the read originates. 2 Genome mapping Data genome read Hashing-based tools Suffix array-based tools genomic coordinate(s) Example 3 Transcriptome mapping 4 Conclusion mapping @SEQ ID GATTT + ccedd ALIMENTATION AGRICULTURE ENVIRONNEMENT BWA @SEQ ID 113 1... Assumption A read is likely to map at a locus iff similarity is high. Implemented definition Map a read: list the loci with less than k errors. 3 / 34 4 / 34 First implications Tools. . . Ambiguity in the mapping read genome read genome read genome 6 / 34 5 / 34 Outline Outline read 1 Introduction 1 Introduction genome 2 Genome mapping 2 Genome mapping Hashing-based tools Suffix array-based tools Hashing-based tools Suffix array-based tools Mapping 6= Alignment 3 Transcriptome mapping genome read 4 Conclusion 3 Transcriptome mapping ACGTACGT is a valid mapping with 2 mismatches. ACGTAGT- 4 Conclusion 7 / 34 8 / 34 Seed-and-extend algorithm Step 2: Extension with errors Idea Needleman–Wunsch 1 Get the k-mers of the genome, 2 “Sort” them, 3 Get the k-mers of a read, 4 Compare the two, 5 Finish the alignment. G G Example T Genome is AACGTAC, read is CCGT AACGTA = AAC ACG CGT GTA AAC ACG CGT GTA CCGT = CCG CGT Choice of kAAC— Example CCGT = CCG CGT ACG CGT Example GTAread ACACGAGTA. Consider sequence ACGCGTGTA and • With k = 3, seed GTA matches. A A C G • With k = 4, no seed match. C C G T T 0 ↓ 1 ↓ 2 ↓ 3 → & A 1 → & 1 & C 2 2 & 2 & G 3 2 & 2 & 3 → & → & 2 & 3 → T 4 3 3 & 3 2 Optimizations Banded Needleman–Wunsch, FSA, parallel Shift-OR, vectorization, SIMD, . . . 10 / 34 9 / 34 1 Choice of k — Example 2 Example Consider sequence ACGACGACGACGACGT and read ACGT. k = 3: ACG → 0, 3, 6, 9, 12 CGA → 1, 4, 7, 10 CGT → 13 GAC → 2, 5, 8, 11 ⇒ 4 matches, 1 extension succeeds. A k = 4: ACGA → 0, 3, 6, 9 ACGT → 12 CGAC → 1, 4, 7, 10 GACG → 2, 5, 8, 11 ⇒ 1 matches, 1 extension succeeds. Pigeon hole principle With r errors, r + 1 seeds. ⇒ small k = more sensitive. Remark k-mers with many occurrences usually are discarded. 11 / 34 12 / 34 Conclusion so far Outline Trade-off • Small k: more sensitive, slower. 1 Introduction • High k: more specific, larger database (size: up to 4k ). 2 Genome mapping Hashing-based tools Suffix array-based tools Tentative complexity • Pre-process genome (done once). 3 Transcriptome mapping • Cut a read into k-mers (fast). 4 Conclusion • Map each k-mer to the table (fast for each k-mer). • Consider every position in the genome (variable). • Extend with errors (slow). Drawbacks Slow or not sensitive when: • accepting many errors, • accepting highly repeated seeds. 13 / 34 14 / 34 Suffix tree Looking for a read Idea A C T G A C T A A A GATTACA is: T T C T C A A A T A A A T A A A Simply follow the right arrow. T G T C A C T T C A A T A C C A A C C A A Example Look for TAC. Look for TAT. Problem Suffix trees do not fit in memory. Use suffix arrays instead: C A A 15 / 34 • same algorithms, • condensed data structure. 16 / 34 Handling errors A Example Look for GAC with 1 error. GAC C A T C A Conclusion so far T G A A Tentative complexity T T T C A A T A C C A • Pre-process the genome (done once). • Look for a read (quite fast if no error). Slow or not sensitive when: Problem The search space is very large! A • accepting many errors (small seed). A C A 17 / 34 18 / 34 Outline Introduction 1 Introduction 2 Genome mapping RNA mapping 6= DNA mapping Hashing-based tools Suffix array-based tools spliced read 3 Transcriptome mapping genome 4 Conclusion 19 / 34 20 / 34 Difficult cases Tophat1 • many differences (mutations and/or sequencing errors), • repeated sequence, • read on 3+ exons, • gene or pseudogene? (Trapnell et al., Bioinformatics, (Kim et al., Gen. Biol., 2013) 2009) • uses Bowtie, • end of read on another exon, • problems for pseudogenes. • read on unknown and poorly expressed junction. 22 / 34 21 / 34 Algorithms Tophat2 (Kim et al., Gen. Biol., 2013) (Garber et al., Nat. meth., 2011) 23 / 34 24 / 34 Outline Other considerations 1 Introduction 2 Genome mapping Hashing-based tools Suffix array-based tools Quality Current tools provide read quality. Better use it. Pair end reads Most tools have a special “fragment rescue” mode. 3 Transcriptome mapping Low quality 3’ ends Some tools truncate them. 4 Conclusion Binary encoding A: 00, C: 01, G: 10, T: 11, other: ??? 26 / 34 25 / 34 Tool comparison DNA mapping tools name SSAHA Blat MUMmer2 Eland MAQ SOAP PatMaN RMAP SeqMap Mosaik SOCS ZOOM PASS SOAP2 BWA SHRiMP Bowtie mrFAST RazerS CloudBurst MPScan PerM BFAST Segemehl GNUMap mrsFAST novoalign GASSST Stampy rNA SOAP3 Bowtie2 Saruman BarraCUDA Sensitivity map most reads, Specificity mappings are correct, Time Memory ⇒ Balance between the criteria. 27 / 34 seed-and-extend X X pigeon hole spaced seed q-gram suf. tree B.–W. notes Introns X X X X X X X X Several seeds X X X X X X X X X Minimal # seeds No heuristic SNP detection Mapping quality Color space Trade-off between speed and accuracy Gives all hits X X MapReduce No error X X X 2-level hash table X X X ? ? X X X Cache-oblivious ? X ? ? ? X X X X Uses GPU SIMD GPU GPGPU 28 / 34 BWA vs Bowtie2 Bowtie2 vs BWA (Li Website: http: // lh3lh3. users. sourceforge. net/ alnROC. shtml ) (Langmead et al., Nat. Meth., 2012) 29 / 34 30 / 34 “Unbiased” comparisons RGASP The RNA-seq Genome Annotation Assessment Project. • Matthew Ruffalo, Thomas LaFramboise, Mehmet Koyutürk, Comparative analysis of algorithms for next-generation sequencing read alignment, Bioinfomatics (2011), vol 27, pp. 2790–2796. • Heng Li and Nils Homer, A survey of sequence alignment algorithms for next-generation sequencing, Briefings in Bioinformatics (2010), vol 11, pp. 473–483. • Sophie Schbath, Véronique Martin, Matthias Zytnicki, Julien Fayolle, Valentin Loux, Jean-François Gibrat, Mapping Reads on a Genomic Sequence: an Algorithmic Overwiew and a Practical Comparative Analysis, Journal of Computational Biology (2012), vol 19, pp. 796–813. ⇒ Results are already irrelevant! (Engström et al., Nat. Meth., 2013) 31 / 34 32 / 34 RGASP Which tool should I use? Comparison transcript assembly. Common situations You can use widely-used tools. DNA-Seq BWA(-SW/mem), Bowtie2 RNA-Seq TopHat2, Star • Cannot refuse your paper! • Well maintained. • Large software suite (Tuxedo). A niche? • mrFAST: for high number of copies • SHRiMP: for color space • Stampy: for highly divergent sequencing (with BWA) • CloudBurst: with Map/Reduce • SOAP3: for GPU Want to buy France? Use CRAC. (Engström et al., Nat. Meth., 2013) 33 / 34 34 / 34
© Copyright 2026 Paperzz