Lecture2-Sept4 - Center for Bioinformatics and Computational

Genome sizes (sample)
1
Some genomics history
• 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR
• first use of whole-genome shotgun for a bacterium
• Fleischmann et al. 1995 became most-cited paper of the year
• 2869 citations to date
• 1995-6: 2nd and 3rd bacteria published by TIGR: Mycoplasma genitalium,
Methanococcus jannaschii
• 1996: first eukaryote, S. cerevisiae (yeast), 13 Mbp, sequenced by a consortium of
(mostly European) labs
• 1997: E. coli finished (7th bacterial genome)
• 1998-2001: T. pallidum (syphilis), B. burgdorferi (Lyme disease), M. tuberculosis,
Vibrio cholerae, Neisseria meningitidis, Streptococcus pneumoniae, Chlamydia
pneumoniae [all at TIGR]
• 2000: fruit fly, Drosophila melanogaster
• 2000: first plant genome, Arabidopsis thaliana
• 2001: human genome, first draft
• 2002: malaria genome, Plasmodium falciparum
• 2002: anthrax genome, Bacillus anthracis
• TODAY (Sept 4, 2008):
• 744 complete microbial genomes!
• 1199 microbial genomes in progress!
• 476 eukaryotic genomes in progress!
2
3
New directions:
sequencing ancient DNA
(some assembly required)
J. P. Noonan et al., Science 309, 597 -599 (2005)
5
Fig. 1. Schematic illustration of the ancient DNA extraction and library construction process
J. P. Noonan et al., Science 309, 597 -599 (2005)
Published by AAAS
6
Fig. 2. Characterization of two independent cave bear genomic libraries
Fig. 2. Predicted origin of 9035
clones from library CB1 (A) and
4992 clones from library CB2 (B)
are shown, as determined by
BLAST comparison to GenBank
and environmental sequence
databases. Other refers to viral
or plasmid-derived DNAs.
Distribution of sequence
annotation features in 6,775
nucleotides of carnivore
sequence from library CB1 (C)
and 20,086 nucleotides of
carnivore sequence from library
CB2 (D) are shown as
determined by alignment to the
July 2004 dog genome
assembly.
J. P. Noonan et al., Science 309, 597 -599 (2005)
Published by AAAS
7
8
9
Fig. 1. Characterization of the mammoth metagenomic library, including percentage of read
distributions to various taxa
H. N. Poinar et al., Science 311, 392 -394 (2006)
Published by AAAS
10
11
Journals
• The very best:
• Science
• www.sciencemag.org
• Nature
• www.nature.com/nature
• PLoS Biology
• www.plosbiology.org
12
Bioinformatics Journals
• Bioinformatics
• bioinformatics.oxfordjournals.org
• BMC Bioinformatics
• www.biomedcentral.com/bioinformatics
• PLoS Computational Biology
• compbiol.plosjournals.org
• Journal of Computational Biology
• www.liebertpub.com/cmb
13
Radically new journals
• PLoS ONE
• www.plosone.org
• Biology Direct
• www.biology-direct.com
• Reviewers’ comments are public
Both journals can be annotated by readers
Papers can be negative results,
confirmations of other results, or brand new
14
Genomics Journals
• Genome Biology
• genomebiology.com
• Genome Research
• www.genome.org
• Nucleic Acids Research
• nar.oxfordjournals.org
• BMC Genomics
• www.biomedcentral.com/bmcgenomics
15
Before assembly…
… we need to cover a basic sequence
alignment algorithm
16
Sequence Alignment
When we have very similar sequences:
•
•
•
•
•
Closely related species
Very little changed sequence
Small differences can be very important
Computationally “easy” to align
Assembly ONLY deals with these
When sequences are not so similar:
• Distantly related species
• Most positions changed
• Sequences that are most highly conserved are under the
strongest selective (evolutionary) pressure.
– E.g., some genes in humans and E. coli clearly have a
common ancestor, the proteins can be aligned
• Computationally “difficult” to align
17
Sequence Alignment
Algorithms for sequence alignment
• Choose best alignment, subject to some mutation
model.
• A common (but overly simplistic) model for DNA
mutations is called “edits”, which counts the number of
substitutions, insertions and deletions.
• The resulting alignment suggests a possible “history”
for the sequence.
This slide and subsequent alignment slides courtesy of Nathan Edwards, available at
www.umiacs.umd.edu/~nedwards/teaching/CMSC858E_Fall_2005/
18
Example Alignments
ACGTCTAG
||*****^
ACTCTAG-
2 matches, 5 mismatches, 1 not aligned
19
Example Alignments
ACGTCTAG
^**|||||
-ACTCTAG
5 matches, 2 mismatches, 1 not aligned
20
Example Alignments
ACGTCTAG
||^|||||
AC-TCTAG
7 matches, 0 mismatches, 1 not aligned
Edit distance here = 1
21
Example Alignments
...AACTGAGTTTACGCGCATAGA...
|^^^||^|^^|
T---CG-A--G
Many equally good alignments!
Even exact matching sequence can be
found (at random) in long enough
sequences
22
Global Alignment problem
Given two related sequences, S (length n)
and T (length m), find an alignment of S
and T.
Edit distance: minimum number of
substitutions, insertions and deletions.
23
Dynamic Programming for
pairwise alignment
24
Dynamic Programming
Formulation
Definition: Let D(i,j) be the edit distance of
the alignment of S[1...i] and T[1...j].
Edit distance of S and T, then, is D(n,m).
Dynamic programming solves the global
alignment problem by computing
D(i,j) for all i=0...n and j=0...m.
25
Recurrence Relation for D
Computation of D is a recursive/iterative
process.
• D(i,j) in terms of D(i’,j’) for i’ < i and j’ < j.
Base conditions for D(i,j):
• D(i,0) = i, for all i = 0,...,n
• D(0,j) = j, for all j = 0,...,m
26
Recurrence relation for D
For i > 0, j > 0:
D(i,j) = min {
D(i-1,j) + 1,
D(i,j-1) + 1,
D(i-1,j-1) + δ(S(i),T(j))
}
27
Dynamic programming
D(i,j) is computed by optimally solving
sub-problems
The optimal solution to D(i,j) is a simple
combination (addition) of two optimally
solved subproblems
28
Using the recurrence
We could code this as a recursive
function call...
• ...but an exponential number of function
evaluations
–each position explores 3 alternatives
There are only (n+1)x(m+1) pairs
i and j
• We must be evaluating D(i,j) multiple times
• Why not cache the results?
29
Using the recurrence
Compute D(i,j) bottom up.
Store the intermediate results in a table (the table we
already saw).
Start with smallest (i,j) = (1,1).
Compute D(i,j)
after
D(i-1,j), D(i,j-1), and D(i-1,j-1) have been determined.
(n+1)(m+1) cells to fill, so O(nm) time.
30
Traceback
Our dynamic programming table helps us
compute the edit distance “score”
We need the actual alignment
corresponding to this edit distance
The corresponding alignment can be read
off, by doing a little extra accounting.
31
Traceback
If D(i,j) == D(i-1,j) + 1,
Pointer(i,j) = (i-1,j)
If D(i,j) == D(i,j-1) + 1,
Pointer(i,j) = (i,j-1)
If D(i,j) == D(i-1,j-1) + δ(S(i),T(j)),
Pointer(i,j) = (i-1,j-1)
Break ties arbitrarily, or keep multiple pointers
32
Traceback
Follow the pointers from cell (n,m).
Any path to (0,0) corresponds to the (reverse
of the) edits of the optimal alignment
• “horizontal” pointers: insertion in S
• “vertical” pointers: insertion in T
• “diagonal” pointers: match or substitution
An optimal alignment can be found in O(n+m)
time.
33
Original references
T.F. Smith and M.S. Waterman, Identification
of common molecular subsequences. J.
Molecular Biology (1981), 147(1):195-7.
Altschul SF, Gish W, Miller W, Myers EW,
and Lipman DJ. Basic local alignment
search tool. J. Molecular Biology (1990),
215(3):403-10.
- 24,113 citations!
34

Download Report

Lecture2-Sept4 - Center for Bioinformatics and Computational

Paperzz.com

Your Paperzz