CSCI1820: Sequence Alignment Spring 2017 Lecture 11: March 16 Lecturer: Sorin Istrail Scribe: Pranavan Chanthrakumar Note: LaTeX template courtesy of UC Berkeley EECS dept. Notes are also adapted from notes from previous years’ offerings of CSCI1810 and CCSCI1820. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 11.1 Sequence Graph Setup In today’s lecture we will explore the concept of sequence graphs. Before we explore sequence graphs, let’s remind ourselves of the variables we are using: • N , representing the total number of fragments • L, representing the size of the DNA region we want to sequence or assemble • l, which is the size of the fragments/reads (all of which are the same) • c represents the coverage, which is equal to Nl L Now, consider a sequence graph G = (V, E) built off of our set of reads like so: • Vertices in V correspond to (k − 1) tuples in the fragments/reads. • For every v ∈ V , there is a fragment f and position i of f such that the (k − 1) tuple for v is afi , afi+1 , . . . , afi+k−2 , where i is zero-indexed. Basically, this means that every vertex’s (k − 1) tuple can be found in our overall fragments as a (k − 1) tuple in one (or more) of the reads. • The edges in E correspond to distinct k tuples of the entire set of fragments/reads. • For every edge e ∈ E, there exists a fragment f and position i in f such that the k-tuple corresponding to e is afi , afi+1 , . . . , afi+k−1 . This means that every edge’s k tuple can be found in our overall fragments as a k tuple in one (or more) of the reads. 11.2 Expected Number of Vertices in Sequence Graph We now present the following question: what is the expected number of vertices in the sequence graph associated to the N fragments, f1 , . . . , fN obtained by the random process we described. 11-1 11-2 11.2.1 Lecture 11: March 16 Variables In addition to our previously defined variables, We now let the following also hold: • Let L0 = L − (k − 2) • Let r be the sequencing error (or substitution) rate • Let R = 1 − (1 − r)k−1 , represent the probability that a (k − 1) tuple is incorrect • Let T be the number of (k − 1) tuples in the fragments • Let Xα represent the number of fragments that cover the region of the target DNA from position α to position α + k − 2; note that this is (k − 1) positions. 11.2.2 Answer and Partial Proof We find that the expected number of vertices is: E(|V |) = RT + [1 − e−c(1−R) ]L0 In order to get to this answer, we first assume that our fragments are distributed uniformly on L. The depth of coverage at a given base position B is a Poisson random variable with mean λ = c: P (base B is covered by m fragments) = e−c cm m! We move forward with the idea that we’ll classify our (k − 1) tuples as either “true” or “false”. We then go through the following steps: • Recall the fact that from our error rate, we have that the probability of any base being read incorrectly as r, and so the probability that a base is read correctly is (1 − r). • Assume that because the error rate is small, that no two fragments generate the same false tuple, that is, false tuples appear only once. • Consider each false tuple as a Bernoulli trial. • Thus, the expected number of false tuples is RT • The expected number of true tuples is equal to the number of positions α such that at least one fragment contains no errors in the region from α to α + k − 2, which is (k − 1) positions. This ends up being: L0 [1 − e−c(1−R) ]; we leave out the details of this here.
© Copyright 2026 Paperzz