Lecture 11: March 16 11.1 Sequence Graph Setup 11.2 Expected

CSCI1820: Sequence Alignment
Spring 2017
Lecture 11: March 16
Lecturer: Sorin Istrail
Scribe: Pranavan Chanthrakumar
Note: LaTeX template courtesy of UC Berkeley EECS dept. Notes are also adapted from notes from previous
years’ offerings of CSCI1810 and CCSCI1820.
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.
11.1
Sequence Graph Setup
In today’s lecture we will explore the concept of sequence graphs. Before we explore sequence graphs, let’s
remind ourselves of the variables we are using:
• N , representing the total number of fragments
• L, representing the size of the DNA region we want to sequence or assemble
• l, which is the size of the fragments/reads (all of which are the same)
• c represents the coverage, which is equal to
Nl
L
Now, consider a sequence graph G = (V, E) built off of our set of reads like so:
• Vertices in V correspond to (k − 1) tuples in the fragments/reads.
• For every v ∈ V , there is a fragment f and position i of f such that the (k − 1) tuple for v is
afi , afi+1 , . . . , afi+k−2 , where i is zero-indexed. Basically, this means that every vertex’s (k − 1) tuple
can be found in our overall fragments as a (k − 1) tuple in one (or more) of the reads.
• The edges in E correspond to distinct k tuples of the entire set of fragments/reads.
• For every edge e ∈ E, there exists a fragment f and position i in f such that the k-tuple corresponding
to e is afi , afi+1 , . . . , afi+k−1 . This means that every edge’s k tuple can be found in our overall fragments
as a k tuple in one (or more) of the reads.
11.2
Expected Number of Vertices in Sequence Graph
We now present the following question: what is the expected number of vertices in the sequence graph
associated to the N fragments, f1 , . . . , fN obtained by the random process we described.
11-1
11-2
11.2.1
Lecture 11: March 16
Variables
In addition to our previously defined variables, We now let the following also hold:
• Let L0 = L − (k − 2)
• Let r be the sequencing error (or substitution) rate
• Let R = 1 − (1 − r)k−1 , represent the probability that a (k − 1) tuple is incorrect
• Let T be the number of (k − 1) tuples in the fragments
• Let Xα represent the number of fragments that cover the region of the target DNA from position α to
position α + k − 2; note that this is (k − 1) positions.
11.2.2
Answer and Partial Proof
We find that the expected number of vertices is:
E(|V |) = RT + [1 − e−c(1−R) ]L0
In order to get to this answer, we first assume that our fragments are distributed uniformly on L. The depth
of coverage at a given base position B is a Poisson random variable with mean λ = c:
P (base B is covered by m fragments) =
e−c cm
m!
We move forward with the idea that we’ll classify our (k − 1) tuples as either “true” or “false”. We then go
through the following steps:
• Recall the fact that from our error rate, we have that the probability of any base being read incorrectly
as r, and so the probability that a base is read correctly is (1 − r).
• Assume that because the error rate is small, that no two fragments generate the same false tuple, that
is, false tuples appear only once.
• Consider each false tuple as a Bernoulli trial.
• Thus, the expected number of false tuples is RT
• The expected number of true tuples is equal to the number of positions α such that at least one
fragment contains no errors in the region from α to α + k − 2, which is (k − 1) positions. This ends up
being: L0 [1 − e−c(1−R) ]; we leave out the details of this here.