Discrete Mathematical Biology

Discrete Mathematical Biology
Seth Sullivant
January 10, 2012
1
Lecture 1: Introduction
Molecular biology is fundamentally about the analysis of sequences: DNA sequences, RNA
sequences, and protein (amino acid) sequences. This course will be primarily concerned
with developing methods from discrete mathematics towards the analysis of DNA sequences
(though we might also spend some time on RNA or protein, time permitting). At the purely
combinatorial level, a DNA sequences is a finite word on the alphabet with for characters
{A, C, G, T }, which stand for the four nucleotide basis. These DNA string encodes all the
information in the genome of an organism, and is the fundamental building block of life. The
course will be organized around 5 particular problems of central importance to the analysis
of DNA sequences, and their consequences in biology.
1.1
Sequence Assembly
At the most basic level, DNA of a species or organism is a long string of the characters
{A, C, G, T }. This usually divided into some number of pieces because the genome is divided
between chromosomes (e.g. the human genome is divided into 46 chromosomes). The
length of the genome varies wildly between species, with some viral genomes as small as
1000 characters, and some plant genomes as long as 1011 characters. The human genome is
approximately 3 × 109 characters in length.
It is too computationally expensive (or not physically possible, I’m not sure which) to
simply take the sequence from a single chromosome and read it from end to end. Instead, the
current methods break the chromosome into many shorter pieces, and biochemical process
are used to accurately read the sequence. Hence, for each chromosome, the biochemical sequencing procedure produces not the entire contiguous sequences, but, many short segments.
The assembly problem is to determine the entire contiguous sequence DNA from these short
“reads” of the sequence, in particular determining which order the sequences go in. The
chemistry only implies that the sequences could go in any order, so the first thing to do is
get many overlapping reads of the same DNA sequence, and then assemble the DNA from
those.
This leads to a number of mathematical problems:
• How to determine the complete sequence based on overlapping subsequences?
1
• How to guarantee large enough coverage for good recovery properties?
• What to do in the presence of noisy data (e.g. errors in the reads)?
1.2
Annotation
DNA sequences are important because, in particular, they code for genes and proteins (among
other uses). However, in most higher organisms, much of the DNA does not actually code for
genes. For example, in humans, it is estimated that only 5% of the DNA is coding DNA. The
annotation problem is concerned with taking the assembled DNA sequence and reading from
it various properties, and in particular, identifying likely candidate regions in the genome
that might be genes.
1.3
Alignment
Evolutionary theory tells us that DNA of extant species evolved to its present form from
common ancestral DNA sequences. There are a number of changes that might occur over
time:
• Point mutations: (e.g. AGT becomes ACT )
• Deletions of bases: (e.g. AGT becomes AT )
• Insertions of material (often repetitive material)
• Large scale rearrangements (e.g. along sequence with pieces S1 S2 S3 becomes S2 S3 S1
(note that the S1 denotes the string S1 read in reverse order)).
The process starts with the ancestral sequence which, after many of the above steps,
eventually produces the DNA sequences of the extant species. Since we expect that the
primary function of the DNA to be preserved over time, this means that there should be a
correspondence between the DNA in two extant species. The alignment problem is to take
two DNA sequences in different species that correspond to one another (i.e. are homologous)
and find the pairing of bases which is the “best” for the given two sequences.
For example, the best alignment of the two sequences
ACGTACCGTA
AGTCCCACGGAC
might be
ACGTA----CCGTAA-GT-CCCACGG-AC
where the dashes “-” indicate insertion or deletion events.
Major issues to be dealt with in this section include:
• Finding a good notion of “best” alignment.
• Developing methods for optimizing over the (exponentially many) possible alignments
quickly.
• Understanding how the “best” alignment might change as we change features of the
alignment algorithms.
1.4
Phylogenetic Tree Reconstruction
Evolution is characterized by the emergence of new species, which descend from ancestral
species. At first approximation, this evolutionary process is generally tree like in nature:
at some point in time a speciation event occurs which splits one species into two or more
daughter species. Of course, we do not observe directly that underlying tree structure,
only characteristics of the species at the leaves of the tree (the extant species). The goal
of phylogenetics is to reconstruct the underlying evolutionary tree explaining the ancestral
relationships between the species. For instance, with 3 extant species, there are 3 binary
rooted trees that could potentially explain the evolutionary relationships between species:
Phylogenetics is concerned with determining which of these trees is the correct tree. We
do this by analyzing the DNA sequences of the extant species. Intuitively, species that are
more closely related in the tree should have DNA sequences that are more similar. So the
“signal” from the tree should be apparent in the (multiply) aligned DNA sequences. We will
talk about a number of different approaches for reconstructing phylogenetic trees, based on
different measures for the similarities between the DNA:
• Distance based methods: for each pair of species compute a number measures the
“distance” between the two species. Then find a nearby distance metric which comes
from a tree.
• Parsimony: Find the tree which requires the least number of mutations
• Model based methods: Based on a probabilistic model for mutations, find the tree with
the largest likelihood of observing the given data.
We will likely spend the most time on topics in phylogenetics, as it is the most developed
mathematically (and I know more about it).
1.5
Gene trees to Species trees
Of primary interest in phylogenetics is to recover the tree relating a collection of species.
The methods we will discuss in the previous section usually only return the tree comparing
individual genes common to all species under analysis. It is an interesting fact that the true
gene tree might not match the true species tree, and that the trees can vary from gene to gene.
These variations have a number of potential causes, one of them being a population based
effect call “incomplete lineage sorting”. We will discuss different approaches for inferring the
species tree from the gene trees, primarily concerning methods based coalescent theory.
1.6
Mathematical Methods
The course will focus on a number of different methods from discrete mathematics to address
these problems.
• Combinatorics of strings
• Graph theory
• Probabilistic models (e.g. mutation models, Markov processes, hidden Markov models)
• Statistics (e.g. inference procedures for probabilistic models)
• Optimization (e.g. linear programming, polyhedral geometry)