7. Sequence Assembly

Sequence Assembly
Assembling the data



Problem: the longest single sequence possible
is 1,000 bp, and most technology is 50-500 bp.
Microbial genomes are 2,000,000 bp
Therefore how do you sequence a whole
genome?
Sequencing the genomes

Extract DNA

Shear DNA into small pieces

Ligate adapters on each end

Sequencing using “next generation sequencing”
Sequence assembly

Before we look at the data

Can we make longer pieces
The assembly




A hierarchical data structure that maps sequence data
to a reconstruction of the target.
The assembly groups

reads into contigs

contigs into scaffolds
Contigs provide

multiple sequence alignment of reads

consensus sequence.
Scaffolds provide

contig order and orientation

sizes of the gaps between contigs.
Sequence assembly
Reads
Contigs
Scaffolds
Four approaches to assembly
 Naïve approach
 Greedy approach
 Overlap / Layout / Consensus
 de Bruijn Graphs
Naïve approach




Compare every sequence to every other
sequence
Find stretches that are the same
Need to account for phred scores – what if a
base is wrong?
How long of a sequence do you need to be
unique?
Sequence composition


4 bases
4n chance of finding a sequence if all evenly
used (they are not)

3 bp: 43 = 64

8 bp: 48 = 65,336

20 bp: 420 = 1,099,511,627,776
Problems with this approach

Sequences are not random

Most genomes contain biased information

Repeat sequences in the genome
Greedy approaches



Start with a sequence
Keep extending it while another sequence
matches the end
When can not be extended further, mark as a
contig
Improve greedy approachs



Only use high quality sequence
Use reads that are represented more than ntimes in the sample (SSAKE)

End to end overlap vs. partial overlap

Ignores low coverage regions
… also incorporate quality scores
(SHARCGS)
In general, greedy approaches are fast but
not very good.
Make lots of short contigs
Overlap / Layout / Consensus



All versus all comparison (done with K-mers for
speed).
Generate approximate read layout as an
overlap graph.
Use multiple sequence alignments to resolve
layout.
Newbler (O/L/C)

Makes unitigs


Merge unitigs into contigs.



Single contigs with no discrepancies
May split unitigs and even reads (could be
chimeras)
Use coverage to compensate for base calls
Works in flow space to calculate homopolymeric
tracts.

More accurate than average of averages
Assembly is a “graph” problem
• Overlap/Layout/Consensus
• de Bruijn Graph
• Greedy graphs
• A graph is nodes + edges
node
edge
Assemble these two sequences!
AACCGGT
CCGGTTA
Consensus: AACCGGTTA
AACCGGT as graphs
Node = K-mers; edges = nodes that overlap by K-1 bases.
aacc
accg
ccgg
Here K = 4, but in reality K = 19 to 31
cggt
CCGGTTA as graphs
ccgg
cggt
ggtt
gtta
Join the two graphs
ccgg
cggt
ggtt
gtta
aacc
accg
ccgg
cggt
Join the two graphs
ccgg
cggt
ggtt
gtta
aacc
accg
ccgg
cggt
Join the two graphs
ccgg
cggt
ggtt
gtta
aacc
accg
ccgg
cggt
AACCGGTTA
Differences between overlap graphs and
de Bruijn graphs for assembly.
Schatz M C et al. Genome Res. 2010;20:1165-1173
©2010 by Cold Spring Harbor Laboratory Press
Problems with all assemblies

Sequences are not random

Most genomes contain biased information

Repeat sequences in the genome
Repeats

Exact repeats

How does basecalling cope?

High coverage versus high error rates

Polymorphic repeats

Real SNPs (between non-clonal individuals)

Polymorphic haplotypes (eukaryotes)
Graphs get very complex
“Spurs” from bad base calls
Polymorphisms cause “bubbles”
Repeats have multiple sinks/sources
Repeats have multiple sinks/sources
16s
Salmonella has 7 rrn operons
Salmonella recombines at rrn operons
Helm and Maloy
Repeat sequences

What happens if the repeat is longer than the
read length?

Need paired end reads to resolve order

Need pairs that span the repeat

Need pairs with one end in the repeat
Paired end sequencing
Paired End Sequencing
Add linkers
Paired end sequencing
Nick
Sequencing
migration
Repeats
A
B
Paired end reads or mate pairs
C
Discussion point

Should you pair sequences before analyzing?

Should you throw away singletons?

What happens if ½ reads pair and ½ not?
N50

Length of the contig that contains 50% of the
sequences

Measure of assembly quality

Longer N50 is better
N50 of Vibrio sequence assemblies
Assemblers
Current assemblers
AMOS
Celera WGA Assembler
CLC Genomics Workbench
DNA Dragon
DNAnexus
Euler
Geneious
IDBA (Iterative De Bruijn graph
short read Assembler)
LIGR Assembler (derived from
TIGR Assembler)
MIRA (Mimicking Intelligent
Read Assembly)
Newbler
Phrap
SSAKE
SOAPdenovo
SPAdes
Velvet
Assembly RAST

Pipeline for automatic assembly

Works with fasta, fastq, single end, paired end

Runs multiple assemblers in parallel

Combines contigs
ARAST
Module
Stages
Description
a5
preprocess,assembler,post-process A5 microbial assembly pipeline
a6
preprocess,assembler,post-process Modified A5 microbial assembly pipeline
bhammer
preprocess
SPAdes component for quality control of sequence data
bowtie2
post-process
Bowtie2 aligner that maps reads to contigs
bwa
post-process
BWA aligner that maps reads to contigs
fastqc
preprocess
FastQC quality control tool for sequence data
filter_by_length preprocess
Length-based sequencing reads filter and trimmer based on seqtk
idba
assembler
IDBA iterative graph-based assembler for single-cell
kiki
assembler
Kiki overlap-based parallel microbial and metagenomic assembler
quast
post-process
QUAST assembly quality assessment tool (run by default)
ray
assembler
Ray graph-based parallel microbial and metagenomic assembler
reapr
post-process
REAPR assembly error recognizer using paired-end reads
sga_ec
preprocess
SGA component for error correction
sga_preprocess preprocess
SGA component for preprocessing reads
spades
preprocess,assembler
SPAdes based on paired de Bruijn graphs
sspace
post-process
SSPACE pre-assembled contig scaffolder
swap
assembler
SWAP Assembler
tagdust
preprocess
TagDust sequencing artifacts remover
trim_sort
preprocess
DynamicTrim and LengthSort from SolexaQA
velvet
assembler
Velvet de-bruijn graph based assembler
ARAST
ar-run --single Ecoli_DH10B_Control_200.fastq -m
"E. coli DH10B Control 200" -a velvet spades
ARAST
ARAST
ARAST
Hybrid assembly
Geni Silva
Sequence assembly
Reads
Contigs
Scaffolds
scaffold_builder
http://edwards.sdsu.edu/scaffold_builder
Silva et al. Source Code for Biology and Medicine 2013,
Bandage to view graphs
Bandage: Ryan Wick: http://rrwick.github.io/Bandage
Bandage to view graphs
Bandage: Ryan Wick: http://rrwick.github.io/Bandage
Discussion points

Should we assemble or not?

How does assembly affect ecological analyses?