Simon Rasmussen 27626: Next Generation Sequencing

Short (and Long) Read
Alignment
Preprocessing and SNP calling
Natasja S. Ehlers, PhD student
Center for Biological Sequence Analysis
Functional Human Variation Group
Simon Rasmussen
27626: Next Generation Sequencing analysis
CBS - DTU
27626 - Next Generation Sequencing Analysis
Data size
Generalized NGS analysis
Raw
Question
reads
Application
Compare
Assembly:
Prespecific:
samples / Answer?
Alignment /
processing
Variant calling,
methods
de novo
count matrix, ...
27626 - Next Generation Sequencing Analysis
Alignment/Mapping
• Assemble your reads by aligning them to a closely
related reference genome
• High sequence similarity between individuals makes this
possible
Reads
Genome
27626 - Next Generation Sequencing Analysis
Sounds easy?
• Some pitfalls:
Divergence
between
sample
and
reference
genome
•
• Repeats in the genome
• Recombination and re-arrangements
• Poor reference genome quality
Read
errors
•
• Regions not in the ref. genome
27626 - Next Generation Sequencing Analysis
Alignment approaches
• Short reads: global read alignment (or glocal for little longer
reads)
• Long reads: local alignments (more likely to have gap/indels)
• Exact string match
Creating
hash
tables
(maybe
you
know
these
from
Perl/
•
Python/Java/C, ...)
• BWT + suffix arrays
27626 - Next Generation Sequencing Analysis
Simplest solution
• Exact string matches:
Reference: ACGTGCGGACGCTGAACGTGACG
Read: GTG
GTG
G-TG
GTG
• We need to allow mismatches/indels (SmithWaterman, Needleman-Wunsch)
• One of the worlds fastest computer (K
computer - RIKEN)
• 20 mill reads 100 nt reads vs. human genome
~ 1 month
• We search each read vs. the entire reference
27626 - Next Generation Sequencing Analysis
Complexity: O(n*m)
How about BLAST?
• Everybody uses BLAST
• Everybody will believe your BLAST hits (almost)
What we can learn: Reducing the
search space
• However BLAST
• finds local alignments - not always what we want for short reads
• and other stuff (alignment scores, output format, speed)
• Not practical for short reads!
27626 - Next Generation Sequencing Analysis
Smart solution
1. Use algorithm to quickly find possible matches
3.2Gb
Drastically reduced search space
X possible matches
2. Allow us to perform slow/precise alignment
for possible matches (Smith-Waterman)
1 best
match
27626 - Next Generation Sequencing Analysis
Hash based algorithms
Lookups in hashes are fast!
1. Index the reference
using k-mers.
2. Search reads vs. hash kmers
3. Perform alignment of
entire read around seed
4. Report best alignment
27626 - Next Generation Sequencing Analysis
Key
.
.
.
ACTGCGTGTGA
ACTGCGTGTGC
ACTGCGTGTGT
.
.
.
Value
.
.
.
Chr1_pos1234; Chr2_pos567
Chr7_posX
Chr7_posZ; ...
.
.
.
Also known as Seed and extend
Spaced seeds
• Key/k-mer is called a seed
BLAST
uses
k=11
and
all
must
•
L = 11, 11 matches
• Smarter: Spaced seeds (only care
111010010100110111
be matches
about “1” in seed)
•
• One can use several seeds
Higher sensitivity
27626 - Next Generation Sequencing Analysis
11111111111
L = 18, 11 matches
Multiple seeds & drawbacks
• One could require multiple short seeds
• Instead of extending around each seed, extend around
positions with several seed matches (SHRiMP)
•
Drawbacks of hash-based approaches:
•
• Poor hashing may lead to slow alignment
Lots(!) of RAM to keep index in memory (hg ~48Gb!)
27626 - Next Generation Sequencing Analysis
Burrows-Wheeler Transform
•
Hash based aligners require lots of memory and are
only reasonable fast
• Can we make it better/faster?
• Burrows Wheeler Transformation (BWT), Suffix Arrays
and FM-index
• BWT originally created for compression
(implemented in bzip2)
27626 - Next Generation Sequencing Analysis
The concepts
• Burrows-Wheeler Transform (BWT)
• A reversible transformation of the genome
• Suffix Array is “array of integers giving the starting positions
of suffixes of a string in lexicographical order”
• Ferragina and Manzini (FM) index
• Allows us to recreate parts of the Suffix Array on the fly
• First create BWT of the genome (=index)
27626 - Next Generation Sequencing Analysis
BWT: Create index
Genome
T = AGGAGC$
0
1
2
3
4
5
6
AGGAGC$
GGAGC$A
GAGC$AG
AGC$AGG
GC$AGGA
C$AGGAG
$AGGAGC
27626 - Next Generation Sequencing Analysis
1. Create all lexicographically
possible transformations
Marks end-of-string,
smallest of
2.
the strings lexicographically
theSort
string
to
create
(move
firstBWT
basematrix
to end)and Suffix
Array
6
3
0
5
2
4
1
SA
$
A
A
C
G
G
G
A
G
G
$
A
C
G
G
C
G
A
G
$
A
G
$
A
G
C
A
G
A
A
G
G
$
G
C
G
G
C
A
A
G
$
BWT matrix
C
G
$
G
G
A
A
BWT
This one we need later
6
3
0
5
2
4
1
SA
$
A
A
C
G
G
G
A
G
G
$
A
C
G
G
C
G
A
G
$
A
G
$
A
G
C
A
G
A
A
G
G
$
G
C
G
G
C
A
A
G
$
BWT matrix
27626 - Next Generation Sequencing Analysis
C
G
$
G
G
A
A
F
$
A
A
C
G
G
G
A
G
G
$
A
C
G
G
C
G
A
G
$
A
G
$
A
G
C
A
G
A
A
G
G
$
G
C
G
G
C
A
A
G
$
L
C
G
$
G
G
A
A
F = First column
L = Last column
BWT: T-rank
T = AGGAGC$
T-ranking: # of times the base
occurred previously in T
A0 G0 G1 A1 G2 C0 $
27626 - Next Generation Sequencing Analysis
F
$
A
A
C
G
G
G
A
G
G
$
A
C
G
G
C
G
A
G
$
A
G
$
A
G
C
A
G
A
A
G
G
$
G
C
G
G
C
A
A
G
$
L
C
G
$
G
G
A
A
BWT: T-rank
T = AGGAGC$
T-ranking: # of times the base
occurred previously in T
A0 G0 G1 A1 G2 C0 $
F
$
A1
A0
C0
G1
G2
G0
A0
G2
G0
$
A1
C0
G1
G0
C0
G1
A0
G2
$
A1
G1
$
A1
G0
C0
A0
G2
A1
A0
G2
G1
$
G0
C0
G2
G0
C0
A1
A0
G1
$
L
C0
G1
$
G2
G0
A1
A0
Notice that individual base-rank is the same in F and L
Rank will always be the same in F and L
27626 - Next Generation Sequencing Analysis
BWT: B-rank
B-ranking: Ranked based on
occurrence in F/L
F
$
A0
A1
C0
G0
G1
G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
T = AGGAGC$
B-rank A1 G2 G0 A0 G1 C0 $
T-rank A0 G0 G1 A1 G2 C0 $
BWT is reversible
LF-mapping: LF can be used to
recreate the original genome
F
$
A0
A1
C0
G0
G1
G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
C0 G1 A0 G0 G2 A1 $
Reversed:
A1 G2 G0 A0 G1 C0
T = AGGAGC$
F can be computed = 2x A, 1x C, 3x G
We therefore only need to store L
BWT: Lookups
We can look up where a read
matches in our genome
F
$
A0
A1
C0
G0
G1
G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
Read = “AGG”
Start from last base:
AGG
Find all rows that
starts with “G”
BWT: Lookups
We can look up where a read
matches in our genome
F
$
A0
A1
C0
G0
G1
G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
Read = “AGG”
Start from last base:
AGG
Find all rows that
starts with “G”
Simple because F is
sorted
BWT: Lookups
We can look up where a read
matches in our genome
F
$
A0
A1
C0
G0
G1
G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
Read = “AGG”
Start from last base:
AGG
In this range, match
second last base in L
BWT: Lookups
We can look up where a read
matches in our genome
F
$
A0
A1
C0
G0
G1
G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
Read = “AGG”
Start from last base:
AGG
In this range, match
second last base in L
BWT: Lookups
We can look up where a read
matches in our genome
F
$
A0
A1
C0
G0
G1
G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
Read = “AGG”
Start from last base:
AGG
Use LF mapping to
get to F
BWT: Lookups
We can look up where a read
matches in our genome
F
$
A0
A1
C0
G0
G1
G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
Read = “AGG”
Start from last base:
AGG
Match next base
(A) in L
BWT: Lookups
We can look up where a read
matches in our genome
F
$
A0
A1
C0
G0
G1
G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
Read = “AGG”
Start from last base:
AGG
Luckily there was
a match - else no
alignment
BWT: Lookups
We can look up where a read
matches in our genome
F
$
A0
A1
C0
G0
G1
G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
Read = “AGG”
Start from last base:
AGG
Use LF mapping to
get to F
BWT: Lookups
We can look up where a read
matches in our genome
F
$
A0
A1
C0
G0
G1
G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
Read = “AGG”
Start from last base:
AGG
This is our match!
BWT: Lookups
We can look up where a read
matches in our genome
F
$
A0
A1
C0
G0
G1
G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
Read = “AGG”
Start from last base:
AGG
This is our match!
But where is this in
our genome?
BWT: Lookups
We can look up where a read
matches in our genome
SA F
6 $
3 A0
0 A1
5 C0
2 G0
4 G1
1 G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
Read = “AGG”
BWT: Lookups
We can look up where a read
matches in our genome
SA F
6 $
3 A0
0 A1
5 C0
2 G0
4 G1
1 G2
A1
G1
G2
$
A0
C0
G0
G2
C0
G0
A1
G1
$
A0
G0
$
A0
G2
C0
A1
G1
27626 - Next Generation Sequencing Analysis
A0
A1
G1
G0
$
G2
C0
G1
G2
C0
A0
A1
G0
$
L
C0
G0
$
G1
G2
A0
A1
Read = “AGG”
Our read aligns at pos 0
T = AGGAGC$
pos 0
BWT for alignment
• Entire SA is 12Gb for human genome
FM-index
•
• We only store certain parts of the array
• We can calculate missing parts on the fly
•
Human genome can be effectively indexed and searched
using 3Gb RAM!
27626 - Next Generation Sequencing Analysis
Implementation in BWA
• Burrows Wheeler Aligner (BWA) can use:
bwa
aln:
First
~30nt
of
read
as
seed
•
• Extend around positions with seed match
• bwa mem: Multiple short seeds across the read
Extend
around
positions
with
several
seed
matches
•
27626 - Next Generation Sequencing Analysis
Genome: GTAC$
All possible
transformations
The BWT is: __________
Lexicographically
sorted
Single vs. Paired alignment
• Always get paired end reads (if possible)
Can
map
across
repeats
•
• Less mapping errors
?
Unmapped read can be “rescued” by a good aligning mate
27626 - Next Generation Sequencing Analysis
overage
Coverage
• Coverage/depth is how many times that your data covers the genome
(on average)
L
• Example:
C
=
N
⇥
• N: Number of reads: 5 mill
G
• L: Read length: 100
• G: Genome size: 5 Mbases
C = 5*100/5 = 100X
•
G On
: genome
size
• average there are 100 reads covering each position in the genome
N : number of reads
27626 - Next Generation Sequencing Analysis
Actual depth
• We aligned reads to the
~90X
genome - how much do
we actually cover?
~50%
mapping.flt.sort.rmdup.bam
mapping.flt.sort.rmdup.bam
100
1e+05
was covered with reads
% genome covered
8e+04
Observations
• Avg. depth ~ 90X
• Range from 0-250X
• Only 50% of the genome
80
6e+04
4e+04
40
20
2e+04
0e+00
0
20
40
60
Depth
27626 - Next Generation Sequencing Analysis
60
80
100
120
50
100
150
>= X depth
200
250
SAM/BAM format
• Sequence Alignment / Map format
• BAM = Binary SAM and zipped - always convert to BAM
• Two sections
Header:
All
lines
start
with
“@”
•
• Alignments: All other lines
27626 - Next Generation Sequencing Analysis