Short (and Long) Read Alignment Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation Group Simon Rasmussen 27626: Next Generation Sequencing analysis CBS - DTU 27626 - Next Generation Sequencing Analysis Data size Generalized NGS analysis Raw Question reads Application Compare Assembly: Prespecific: samples / Answer? Alignment / processing Variant calling, methods de novo count matrix, ... 27626 - Next Generation Sequencing Analysis Alignment/Mapping • Assemble your reads by aligning them to a closely related reference genome • High sequence similarity between individuals makes this possible Reads Genome 27626 - Next Generation Sequencing Analysis Sounds easy? • Some pitfalls: Divergence between sample and reference genome • • Repeats in the genome • Recombination and re-arrangements • Poor reference genome quality Read errors • • Regions not in the ref. genome 27626 - Next Generation Sequencing Analysis Alignment approaches • Short reads: global read alignment (or glocal for little longer reads) • Long reads: local alignments (more likely to have gap/indels) • Exact string match Creating hash tables (maybe you know these from Perl/ • Python/Java/C, ...) • BWT + suffix arrays 27626 - Next Generation Sequencing Analysis Simplest solution • Exact string matches: Reference: ACGTGCGGACGCTGAACGTGACG Read: GTG GTG G-TG GTG • We need to allow mismatches/indels (SmithWaterman, Needleman-Wunsch) • One of the worlds fastest computer (K computer - RIKEN) • 20 mill reads 100 nt reads vs. human genome ~ 1 month • We search each read vs. the entire reference 27626 - Next Generation Sequencing Analysis Complexity: O(n*m) How about BLAST? • Everybody uses BLAST • Everybody will believe your BLAST hits (almost) What we can learn: Reducing the search space • However BLAST • finds local alignments - not always what we want for short reads • and other stuff (alignment scores, output format, speed) • Not practical for short reads! 27626 - Next Generation Sequencing Analysis Smart solution 1. Use algorithm to quickly find possible matches 3.2Gb Drastically reduced search space X possible matches 2. Allow us to perform slow/precise alignment for possible matches (Smith-Waterman) 1 best match 27626 - Next Generation Sequencing Analysis Hash based algorithms Lookups in hashes are fast! 1. Index the reference using k-mers. 2. Search reads vs. hash kmers 3. Perform alignment of entire read around seed 4. Report best alignment 27626 - Next Generation Sequencing Analysis Key . . . ACTGCGTGTGA ACTGCGTGTGC ACTGCGTGTGT . . . Value . . . Chr1_pos1234; Chr2_pos567 Chr7_posX Chr7_posZ; ... . . . Also known as Seed and extend Spaced seeds • Key/k-mer is called a seed BLAST uses k=11 and all must • L = 11, 11 matches • Smarter: Spaced seeds (only care 111010010100110111 be matches about “1” in seed) • • One can use several seeds Higher sensitivity 27626 - Next Generation Sequencing Analysis 11111111111 L = 18, 11 matches Multiple seeds & drawbacks • One could require multiple short seeds • Instead of extending around each seed, extend around positions with several seed matches (SHRiMP) • Drawbacks of hash-based approaches: • • Poor hashing may lead to slow alignment Lots(!) of RAM to keep index in memory (hg ~48Gb!) 27626 - Next Generation Sequencing Analysis Burrows-Wheeler Transform • Hash based aligners require lots of memory and are only reasonable fast • Can we make it better/faster? • Burrows Wheeler Transformation (BWT), Suffix Arrays and FM-index • BWT originally created for compression (implemented in bzip2) 27626 - Next Generation Sequencing Analysis The concepts • Burrows-Wheeler Transform (BWT) • A reversible transformation of the genome • Suffix Array is “array of integers giving the starting positions of suffixes of a string in lexicographical order” • Ferragina and Manzini (FM) index • Allows us to recreate parts of the Suffix Array on the fly • First create BWT of the genome (=index) 27626 - Next Generation Sequencing Analysis BWT: Create index Genome T = AGGAGC$ 0 1 2 3 4 5 6 AGGAGC$ GGAGC$A GAGC$AG AGC$AGG GC$AGGA C$AGGAG $AGGAGC 27626 - Next Generation Sequencing Analysis 1. Create all lexicographically possible transformations Marks end-of-string, smallest of 2. the strings lexicographically theSort string to create (move firstBWT basematrix to end)and Suffix Array 6 3 0 5 2 4 1 SA $ A A C G G G A G G $ A C G G C G A G $ A G $ A G C A G A A G G $ G C G G C A A G $ BWT matrix C G $ G G A A BWT This one we need later 6 3 0 5 2 4 1 SA $ A A C G G G A G G $ A C G G C G A G $ A G $ A G C A G A A G G $ G C G G C A A G $ BWT matrix 27626 - Next Generation Sequencing Analysis C G $ G G A A F $ A A C G G G A G G $ A C G G C G A G $ A G $ A G C A G A A G G $ G C G G C A A G $ L C G $ G G A A F = First column L = Last column BWT: T-rank T = AGGAGC$ T-ranking: # of times the base occurred previously in T A0 G0 G1 A1 G2 C0 $ 27626 - Next Generation Sequencing Analysis F $ A A C G G G A G G $ A C G G C G A G $ A G $ A G C A G A A G G $ G C G G C A A G $ L C G $ G G A A BWT: T-rank T = AGGAGC$ T-ranking: # of times the base occurred previously in T A0 G0 G1 A1 G2 C0 $ F $ A1 A0 C0 G1 G2 G0 A0 G2 G0 $ A1 C0 G1 G0 C0 G1 A0 G2 $ A1 G1 $ A1 G0 C0 A0 G2 A1 A0 G2 G1 $ G0 C0 G2 G0 C0 A1 A0 G1 $ L C0 G1 $ G2 G0 A1 A0 Notice that individual base-rank is the same in F and L Rank will always be the same in F and L 27626 - Next Generation Sequencing Analysis BWT: B-rank B-ranking: Ranked based on occurrence in F/L F $ A0 A1 C0 G0 G1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 T = AGGAGC$ B-rank A1 G2 G0 A0 G1 C0 $ T-rank A0 G0 G1 A1 G2 C0 $ BWT is reversible LF-mapping: LF can be used to recreate the original genome F $ A0 A1 C0 G0 G1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 C0 G1 A0 G0 G2 A1 $ Reversed: A1 G2 G0 A0 G1 C0 T = AGGAGC$ F can be computed = 2x A, 1x C, 3x G We therefore only need to store L BWT: Lookups We can look up where a read matches in our genome F $ A0 A1 C0 G0 G1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 Read = “AGG” Start from last base: AGG Find all rows that starts with “G” BWT: Lookups We can look up where a read matches in our genome F $ A0 A1 C0 G0 G1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 Read = “AGG” Start from last base: AGG Find all rows that starts with “G” Simple because F is sorted BWT: Lookups We can look up where a read matches in our genome F $ A0 A1 C0 G0 G1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 Read = “AGG” Start from last base: AGG In this range, match second last base in L BWT: Lookups We can look up where a read matches in our genome F $ A0 A1 C0 G0 G1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 Read = “AGG” Start from last base: AGG In this range, match second last base in L BWT: Lookups We can look up where a read matches in our genome F $ A0 A1 C0 G0 G1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 Read = “AGG” Start from last base: AGG Use LF mapping to get to F BWT: Lookups We can look up where a read matches in our genome F $ A0 A1 C0 G0 G1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 Read = “AGG” Start from last base: AGG Match next base (A) in L BWT: Lookups We can look up where a read matches in our genome F $ A0 A1 C0 G0 G1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 Read = “AGG” Start from last base: AGG Luckily there was a match - else no alignment BWT: Lookups We can look up where a read matches in our genome F $ A0 A1 C0 G0 G1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 Read = “AGG” Start from last base: AGG Use LF mapping to get to F BWT: Lookups We can look up where a read matches in our genome F $ A0 A1 C0 G0 G1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 Read = “AGG” Start from last base: AGG This is our match! BWT: Lookups We can look up where a read matches in our genome F $ A0 A1 C0 G0 G1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 Read = “AGG” Start from last base: AGG This is our match! But where is this in our genome? BWT: Lookups We can look up where a read matches in our genome SA F 6 $ 3 A0 0 A1 5 C0 2 G0 4 G1 1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 Read = “AGG” BWT: Lookups We can look up where a read matches in our genome SA F 6 $ 3 A0 0 A1 5 C0 2 G0 4 G1 1 G2 A1 G1 G2 $ A0 C0 G0 G2 C0 G0 A1 G1 $ A0 G0 $ A0 G2 C0 A1 G1 27626 - Next Generation Sequencing Analysis A0 A1 G1 G0 $ G2 C0 G1 G2 C0 A0 A1 G0 $ L C0 G0 $ G1 G2 A0 A1 Read = “AGG” Our read aligns at pos 0 T = AGGAGC$ pos 0 BWT for alignment • Entire SA is 12Gb for human genome FM-index • • We only store certain parts of the array • We can calculate missing parts on the fly • Human genome can be effectively indexed and searched using 3Gb RAM! 27626 - Next Generation Sequencing Analysis Implementation in BWA • Burrows Wheeler Aligner (BWA) can use: bwa aln: First ~30nt of read as seed • • Extend around positions with seed match • bwa mem: Multiple short seeds across the read Extend around positions with several seed matches • 27626 - Next Generation Sequencing Analysis Genome: GTAC$ All possible transformations The BWT is: __________ Lexicographically sorted Single vs. Paired alignment • Always get paired end reads (if possible) Can map across repeats • • Less mapping errors ? Unmapped read can be “rescued” by a good aligning mate 27626 - Next Generation Sequencing Analysis overage Coverage • Coverage/depth is how many times that your data covers the genome (on average) L • Example: C = N ⇥ • N: Number of reads: 5 mill G • L: Read length: 100 • G: Genome size: 5 Mbases C = 5*100/5 = 100X • G On : genome size • average there are 100 reads covering each position in the genome N : number of reads 27626 - Next Generation Sequencing Analysis Actual depth • We aligned reads to the ~90X genome - how much do we actually cover? ~50% mapping.flt.sort.rmdup.bam mapping.flt.sort.rmdup.bam 100 1e+05 was covered with reads % genome covered 8e+04 Observations • Avg. depth ~ 90X • Range from 0-250X • Only 50% of the genome 80 6e+04 4e+04 40 20 2e+04 0e+00 0 20 40 60 Depth 27626 - Next Generation Sequencing Analysis 60 80 100 120 50 100 150 >= X depth 200 250 SAM/BAM format • Sequence Alignment / Map format • BAM = Binary SAM and zipped - always convert to BAM • Two sections Header: All lines start with “@” • • Alignments: All other lines 27626 - Next Generation Sequencing Analysis
© Copyright 2024 Paperzz