Slide

Presented by Samuel Chapman
Pyrosequencing-Intro
 The core idea behind pyrosequencing is that it utilizes
the process of complementary DNA extension on a
single strand.
 Capable of producing about 400,000 reads of around
250 bp each!
 The process takes half a day and costs only several
thousand dollars.
Pyrosequencing-Intro
 To sequence bacterial communities, the reads are often
generated using known, conserved flanking regions as
primers for a homologous region. PCR is used to
amplify the number of copies of the desired region.
 The middle is where the variation among the population
lies.
 The numbers of the sequences increase, but the
proportions for each species are the same.
PyrosequencingIntro
These regions are
homologous, but only the
conserved primer regions
are the same. The middle
areas can be different.
These regions will be our
sequencing reads.
Pyrosequencing-Methods
 Each separate DNA sample is put onto a bead. PCR is
then performed, so that each bead has one kind of
sample.
 Each bead is put into one of hundreds of thousands of
separate wells, so that each well has a distinct sample
(although two wells may have identical samples).
 The DNA on the beads is single-stranded, and the
primer is attached, allowing for extension.
 Enzymes and chemicals are added so that, every time a
new base is added, light is released.
PyrosequencingMethods
Pyrosequencing-Methods
 Bases are added to the sequences by covering the well
plate with a nucleotide, washing it away, then doing
the same thing with the other three, then starting over.
 Ex: A..T..C..G | A..T..C..G | A..T..C..G where ‘..’ represents
washing and ‘|’ denotes a new cycle.
 NOTICE: if a sequence has two or more of a letter in a
row, all of those will be added in one step.
 If more than one letter is added at once, more light will
be emitted from that well.
Pyrosequencing-Methods
 Each well can be monitored for the amount of light it
emits at each nucleotide step (how long the
“homopolymer” is).
 The sequence of emissions is called a flowgram.
 Naively, an intensity of 0 means a homopolymer of
length 0, intensity of 1 a homopolymer of length 1,
intensity of 2 a homopolymer of length 2…
 HOWEVER, the intensity is rather a distribution, and
can therefore lead to errors such as insertions and
deletions.
Example from paper
 Consider a known sequence, ACTGGGG. The order of
nucleotide addition is T..A..C..G
 Intensities “should” produce 0, 1, 1, 0| 1, 0, 0, 4
 Observed flowgram was .18, 1.03, 1.02, .70 | 1.12, .07, .14,
4.65. This suggested the sequence ACGTGGGGG,
because .70 and 4.65 rounded up are 1 and 5.
 Therefore, it is better to use distributions to more
accurately predict the sequence.
Intensity distribution created using
known sequences (from paper)
Dealing with the noisy data
 Using the intensity data, a “distance” measure was
defined, which reflected the probability that each
flowgram represented a particular sequence.
 All distances were applied to a mixture model, and an
iterative expectation maximization algorithm was
employed to gradually bring the flowgrams into
agreement with the “true” data.
 Artifacts such as PCR chimeras were dealt with using
the Mallard algorithm.
Flowgram preclustering
 Assumption: the likelihood of the flowgrams is
represented by the mixture model. Each sequence is a
different part of the mixture and has it’s own
probability.
 σ is the cluster size of flowgrams around a sequence
 fi is the density of the observed flowgrams about a sequence
 Sj is a particular sequence
Flowgram preclustering
 The likelihood of the dataset, D, of N flowgrams
indexed i:
 τj is each sequence’s relative frequency
Preclustering
analogy
The flowgrams are clustered, with
the size of each cluster, σ, being 5
flowgrams.
We guess that each cluster
represents one sequence.
This is just an analogy, because the
mixture is not two-dimensional
like this.
Expectation maximization
 Assume matrix Z, with rows representing flowgrams,
columns representing sequences. zi,j=δi,m(i), where
m(i) is the sequence that generated flowgram i.
 Complete data likelihood is:
Expectation maximization
 Define z’i,j as zi,j given model parameters.
 Expectation step: calculate z’i,j given model parameters
 Maximization step: calculate new parameters such
that LC is maximized according to z’i,j.
 Stop when the improvement between steps falls below
a cutoff, c.
Expectation
maximization analogy
Choose a beginning sequence (red
square) in each cluster. There are
many such clusters in the model.
The black circles are flowgrams in
the cluster.
Expectation: calculate the
parameters, such as likelihood that
these flowgrams generate the
sequence.
Maximization: calculate a new
sequence that is closer to the “real”
sequence based on the flowgrams.
You can see here that the sequence
moves to a more likely position to
the flowgrams. In the paper, the
aggregate distance is calculated for
all sequences.
Expectation maximization
 E step (calculating new z’i,j)
 M step (calculating new relative frequencies, τj,and
then sequences
A visual example of the process
Testing the algorithm
 The pyrosequencing algorithm was tested on 16s rRNA
from 90 known microbial clones.
 After sequencing, the samples were grouped
phylogenetically into operational taxonomic units
(OTUs) and the accuracy compared to real life.
 The sequence difference threshold for the creation of a
separate OTU had to be larger than the noise (see next
slide)
OTU assignment
The assignment of OTUs depends
on the required threshold of
difference for a separate OTU. A
higher difference results in fewer
OTUs, because species become
clustered together.
A threshold that is below the noise
level could result in the same
species becoming two different
OTUs.
Results
Results
Take-home message
 The noise reduction algorithm employed by this paper
resulted in more accurate sequence assignment.
 Average linking is better at handling noise.
Questions?
Acknowledgments
 www.wikipedia.org
 Pyrosequence pic:
http://jeb.biologists.org/content/vol210/issue9/images/lar
ge/JEB001370F2.jpeg