Assessing Genome Assembly
Dent A. Earl , Benedict Paten , David Haussler
1,2
BIOINFORMATICS
BIOMOLECULAR ENGINEERING DEPT
A good Assembly should...
This has traditionally been time consuming, complicated, labor intensive and expensive. Recent changes in sequencing technology have
reduced the amount of money, time and labor required, but have increased the computational complexity of the genome assembly (the
process of putting together small sequences to make longer sequences) problem.
Cover the target
Sequences were long, but also fewer, and very time consuming and
expensive to produce.
Imagine the genome is a simple phrase, like “She sells sea shells by
the sea shore”. Genome sequencing would involve breaking up the
phrase into smaller pieces at random, making lots of copies, and
then reading the resulting fragments.
Can you solve this phrase assembly puzzle?
it is not the c
not the critic
not the critic
e critic who co
it is
Answer:
it is not the critic who counts
Individual sequences from second generation sequencing machines
are short and numerous, but also quickly generated and less expensive to produce.
Overview
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Etiam
id turpis felis, et consequat
Å
gold color
Sed eget malesuada leo.
ergo
10
Maintain order and orientation
Assembly 1
Assembly 2
Assembly 3
A
B
A
B
A
B
B
A
For any two points in the true
genome, a good assembly should
contain those points and agree with
the truth both in order and orientation.
Be lengthy
N50 Statistics
Calculated without knowing the answer
Calculated using the answer
Scaffold
ACGTACGTACGT
Scaffold NG50
Contig NG50
Scaffold Path NG50
Contig Path NG50
ACG
ACGTACGTA
Contigs
ang
e
a t
a t
led
we
web
led
e w
ve
Since we know the answer, we can
calculate path values which are far
more accurate measurements.
An N50 value is a weighted median
that indicates that half of the assembly
is contained in sequences greater than
or equal to that value.
Answer:
TAC GTACG
CGTACGT
ACG TACGTA
T ACGTACGT
ACGT ACGT
TACG TACGT
ACGT
CGTAC GTAC
ACGT
ACGTACGT
GT ACGT AC
TAC GTACG
ACG TACGTA
CGTACGTACG
GTACG TAC
CGTACGTACG
AC GTACG T
GTAC GTACG
CGT ACGTAC
GTACG TACG
...
10
6
A good assembly should, on the
whole, contain lengthy sequences
(both contigs and scaffolds).
Compare predictions
to the true solution.
oh what a tangled web we weave
Simulate a genome
and sequencing.
34
t a
ngl
led
wh
eb
t a
o
a t
hat
e
10
5
{
web
at
we
we
d w
we
t a
we
web
we
CpG Islands
Long assemblies aren’t any good if they
have lots of mistakes. A good assembly
should have few substitution errors.
BUT, the process of assembly is much more challenging! Consider
this puzzle:
10
4
Truth
Genes
Have correct bases
“Second generation” sequencing
oh
we
wh
h w
ed
ave
eav
gle
tan
eav
10
3
...
Knowing the genome (from the
simulation) allows us to see
where the sequences from the
Assembler map, and how long
they are.
A good assembly should mostly
cover the genome with long sequences.
1
Assemblies
Annotations
In the early days of sequencing...
Seq. Length
2
Bases
Sequencing a genome involves taking DNA, breaking it into small
pieces, reading the nucleotides (ACGT) in the sequence, and then assembling the small sequences into longer and longer sequences.
ta
d w
ed
a
tan
b w
wh
eav
ave
ta
1,2,3
1. Bioinformatics Graduate program, UCSC 2. Center for Biomolecular Science and Engineering, UCSC, 3. Howard Hughes Medical Institute
Background
e critic who co
it is not t
is not the crit
ritic who count
ic who counts
2
:D
Get the copy number right
The genome can have
many copies of a particular bit of DNA. Producing
an accurate picture of
these regions from an assembly is medically important (cancer, other diseases) but it is an extrachallenging problem.
E
Let Assemblers puzzle.
(We had a contest called Assemblathon!)
http://assemblathon.org/
http://users.soe.ucsc.edu/~dearl/posters/
[email protected]
6 May 2011 UCSC Graduate Research Symposium
© Copyright 2025 Paperzz