Phylogenetics activities

Crop Species
Interrelatedness:
Become a detective for a day
The amount of sequence data and the rate at which it is
being made available means that we need to come up with
ever more impressive computational tools in order to
analyse it. But do you know what actually goes on behind
the scenes to create a phylogenetic tree? Here is a great
exercise that demonstrates how sequences are compared
and analysed for genetic differences, and how a family
history can be inferred to draw up a phylogenetic tree using
five common crop plants
Created by Dr Emily Angiolini, based
on ‘Bioinformatics with pen and
paper’ by Cleopatra Kozlowski
1 hour
lesson
A bit of background
A good example of this is the pudding4. Puddings
started out as meat-based foods either encased in
instestines much like sausages or cooked in a pot like a
broth or porridge. Introduction of more cereals allowed
Single
molecule for sweet puddings, and the household oven together
sequencers with pudding cloths lead to the baked or steamed
Short-read
puddings which closely resemble the modern
sequencers
Christmas pudding or Spotted Dick!
Technological advances in recent years are such that it is
now relatively quick (Figure 1), easy and cheaper to
determine DNA, RNA or protein sequences1,2,3. Think
Kilobases per day per machine
1 000 000 000
100 000000
10 000 000
1 000 000
100 000
Microwell
pyrosequencers
10 000
1 000
100 Manual
slab gel
10
1980 1985
Automated
slab gel
Second-generation
capillary
First-generation
capillary
21st Century
20th Century
1990
1995
2000
2005 2010Future
Year
Figure 1: Adapted from Stratton M, Campbell PJ and Futreal PA (2009)
back to the Human Genome Project - this was an
international effort with many labs contributing to the
project over the course of more than a decade. The
technology available today means that it is now
possible for a single lab to produce the same amount of
data within a week2!
Beef steak and
kidney pudding
Beef steak and
mutton pudding
Pease pudding
(more solid)
Savoury
(meat-based)
Evolution of recipes
How a DNA sequence evolves over generations through
the accumulation of mutations can be considered
analogous to a recipe being passed from one
generation to the next. Each time a new technology is
invented, or when the recipe is passed on through word
of mouth or print, some element of the recipe is
changed. The most modern version of the recipe may
look similar to a relatively recent version although the
flavour may have subtly changed. However when this
most modern recipe is compared to the original recipe,
the end result may vary quite drastically such that it
does not even have the same approximate shape.
introducion of
basins and steaming
(e.g. plum/Christmas
pudding)
19th Century
Sponge
puddings
18th Century
Sweet
17th Century
(flour, nuts, sugar)
introducion of
pudding cloth
16th Century
White
It is all very well being able to churn out the sequences,
but what do they actually mean? Does a particular DNA
sequence code for a protein? What does that protein do
within a cell? What effect does a small change in DNA
sequence have on the protein’s structure and therefore
its function? How can we determine the evolutionary
history or how related a number of species are to one
another? This is where bioinformatics comes in to play we are able, for instance, to compare newly sequenced
stretches of DNA to those that have been sequenced
previously and for which we already know the function.
If the sequences contain similar patterns or ‘motifs’ then
perhaps the proteins encoded work in a similar way. Of
course to make life easier (and faster) this sort of work is
usually done with the help of a super-powerful
computer. However, in allowing the computer to do all
the hard work we may begin to lose understanding of
how the comparisons are done. This activity is designed
to help you understand how bioinformatics can analyse
data using a simple pen and this paper - all the tables
and diagrams you need are right here!
Cakes-style
puddings
pudding
(mainly cereal sausage)
Pease pottage Black pudding
(cooked in a pot)
(meat sausage)
Figure 2: Putative evolution of the British ‘pudding’
household baking
oven (still cool)
15th Century
5th Century
Evolution of sequences
Each time a sequence is copied, for example from one
generation to the next, mutations occur in that
sequence. Providing these mutations are not harmful to
the individual they are perpetuated through
subsequent generations. The accumulation of
mutations over time can be used to estimate the
relationship between different species. Classically
organisms would be compared by their physical
appearance to determine their relationship. Problems
can arise with the accuracy of suggested relationships,
however, when two organisms evolve a similar
appearance but through different routes, for example
birds and insects both developed wings.
Studies comparing DNA sequences have told us that
mutations occur very infrequently and at random
locations, being passed from parents to offpsring. By
assuming that all organisms derived from a common
ancestor you can look at comparable sequences, for
instance which make the same protein, and determine
how long ago they diverged from one another by
aligning them and determining the number of
mutations - the longer ago that they separated, the
greater the number of mutations will be. It is important
to understand that different parts of DNA evolve at
different rates. DNA which make proteins (coding
regions) accumulate fewer mutations as they could
produce a defective protein that is detrimental to the
organism, which is therefore less likely to survive long
enough to reproduce and perpetuate the mutation.
Sequence Comparison
Table 2 Alignment of a 90bp sequence of the atpB gene from five crop species
Crop
Barley
Wheat
Oat
Rice
Oilseed Rape
Sequence
T GC C GAT AAGC AAAT T AAT GT GAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC GGAC GG
T GAC GAT AAGC AAAT T AAT GT GAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC GGAC GG
T T C C GAT AAGC AAAT T AAT GT GAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC AGAC GG
T GAC GGT AAGC AAAT T AAT GT AAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC AGAT GG
T GAC GT C AAGC AAAT T AAT GT GAC T T GT GAAGT AC AGC AAT T AT T AGGAAAC AAC C GAGT T AGAGC T GT AGC T AT GAGC GC GAC C GAGGG
Above (Table 2) shows the alignment of a partial atpB
gene from five different crop plant species. AtpB
encodes ATP synthase which is responsible for
generating ATP (the energy source for cells) and as such
is a highly conserved gene across many species4.
Pairwise Comparison
The first step in determing the ancestry of these crop
plants is to make comparisons between all possible
paired combinations of species. Table 3 shows pairwise
comparisons between Barley and the four other species
with differences or mutations highlighted in red.
Continue to complete all pairwise comparisons using
Tables 4 to 6 by highlighting mutations with a coloured
pen, or encircling the nucleotides which are different
Table 3 Pairwise comparison of Barley atpB sequence
Crop
Barley
Wheat
Oat
Rice
Oilseed Rape
Sequence
T GC C GAT AAGC AAAT T AAT GT GAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC GGAC GG
T GAC GAT AAGC AAAT T AAT GT GAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC GGAC GG
T T C C GAT AAGC AAAT T AAT GT GAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC AGAC GG
T GAC GGT AAGC AAAT T AAT GT AAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC AGAT GG
T GAC GT C AAGC AAAT T AAT GT GAC T T GT GAAGT AC AGC AAT T AT T AGGAAAC AAC C GAGT T AGAGC T GT AGC T AT GAGC GC GAC C GAGGG
Table 4 Pairwise comparison of the Wheat atpB sequence
Crop
Wheat
Oat
Rice
Oilseed Rape
Sequence
T GAC GAT AAGC AAAT T AAT GT GAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC GGAC GG
T T C C GAT AAGC AAAT T AAT GT GAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC AGAC GG
T GAC GGT AAGC AAAT T AAT GT AAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC AGAT GG
T GAC GT C AAGC AAAT T AAT GT GAC T T GT GAAGT AC AGC AAT T AT T AGGAAAC AAC C GAGT T AGAGC T GT AGC T AT GAGC GC GAC C GAGGG
Table 5 Pairwise comparison of the Oat atpB sequence
Crop
Oat
Rice
Oilseed Rape
Sequence
T T C C GAT AAGC AAAT T AAT GT GAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC AGAC GG
T GAC GGT AAGC AAAT T AAT GT AAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC AGAT GG
T GAC GT C AAGC AAAT T AAT GT GAC T T GT GAAGT AC AGC AAT T AT T AGGAAAC AAC C GAGT T AGAGC T GT AGC T AT GAGC GC GAC C GAGGG
Table 6 Pairwise comparison of the Rice atpB sequence
Crop
Rice
Oilseed Rape
Sequence
T GAC GGT AAGC AAAT T AAT GT AAC T T GT GAGGT AC AAC AAT T AT T AGGAAAT AAT C GAGT T AGAGC T GT AGC T AT GAGT GC T AC AGAT GG
T GAC GT C AAGC AAAT T AAT GT GAC T T GT GAAGT AC AGC AAT T AT T AGGAAAC AAC C GAGT T AGAGC T GT AGC T AT GAGC GC GAC C GAGGG
from the top-most sequence in the table
i.e. for Table 4 compare Oat to Wheat, Rice
to Wheat and Oilseed Rape to Wheat.
Complete Table 7 with the number of
mutations between each crop species
pair. For example, there is one (1)
nucleotide which differs between Barley
and Wheat, however a comparison
between Barley and Oilseed Rape reveals
there are 11 mutations. This indicates that
Barley and Wheat are the most closely
related
of
the
five
species.
Proportional differences
You can now begin to populate Table 8
with the proportional difference in row 1.
For example, for Barley and Wheat divide
the number of different nucleotides (1) by
the length of the sequence which you
have compared (90) i.e. 1/90 = 0.0111 (to 4
decimal places). This gives an indication
of the proportional distance between the
two species.
Table 7 Number of mutations between crop species
Barley
Barley
Wheat
Oat
Rice
Oilseed Rape
0
1
2
5
11
Wheat
Oat
1
0
Rice
2
Oilseed Rape
5
11
0
0
0
Table 8 Proportional distances between crop species
No. Differences Proportional difference
Barley and Wheat
Barley/Wheat and Oat
Barley/Wheat/Oat and Rice
Barley/Wheat/Oat/Rice and Oilseed Rape
1
1/90 = 0.0111
You then need to determine the number of mutations between this
Barley/Wheat ancestor and the remaining 3 species. The ancestral
sequence is presumed to be the ‘average sequence’ of the two
species, and whilst it is not physically determined here it is possible
to determine the proportional distance between the theoretical
ancestor and each of the other crop species in turn.
First you must calculate the number of mutations between the
ancestral sequence and the other crop species by taking an average
of the number of mutations for the two species comprising the
theoretical ancestor. For example a comparison of Oat with the
Barley reveals 2 mutations, and with Wheat reveals 3 mutations.
Build a phylogenetic tree
Therefore between Oat and the Barley/Wheat ancestor
there are (2+3)/2 = 2.5 mutations. Complete Table 9 for
the Barley/Wheat/Oat ancestor not forgetting that your
ancestor now consists of 3 species and to get the
ancestral sequence differences you need to add up the
mutations from each of the individual contributing
sequences and divide by 3. Continue with Tables
10 and 11 in this way.
Table 9 Number of mutations between Barley/Wheat ancestor and
other species
Barley/Wheat
Barley/Wheat
Oat
Rice
Oilseed Rape
0
(2+3)/2 = 2.5
Table 10 Number of mutations between Barley/Wheat/Oat
ancestor and other
Barley/Wheat/Oat
species
Barley/Wheat/Oat
Rice
Oilseed Rape
0
Table 11 Number of mutations between Barley/Wheat/Oat/Rice
ancestor and Oilseed Rape
Barley/Wheat/Oat/Rice
Barley/Wheat/Oat/Rice
Oilseed Rape
0
Now you may convert your values to proportional
distances in the same way as before to complete
Table 8.
Building the phylogenetic tree
Using the proportional distances that you have
calcuated in Table 8 you can now begin to construct the
phylogenetic or evolutionary tree.
First of all you need to connect Barley and Wheat with a
trunk line whose length is dependent upon the time it
has taken for the two species to diverge from their
common ancestor as indicated by the proportional
distances calculated in row 1 of Table 8. For the purpose
of this exercise we will assume that it would take 1000
milllion years for all of the nucleotides in the sequence
analysed to mutate. So, for our Barley/Wheat ancestor’s
sequence to diverge into the two separate species we
know today it would have taken: 0.0111*1000 million =
11.1 million years ago (mya). Draw a line back which
represents 11.1 million years on Figure 3. It may help
later on if you include the proportional distances by
writing them beside the trunk line when drawing your
tree.
The next step is to work out how long ago Oat, Barley
and Wheat diverged from a common ancestor. The way
to calculate this is to add the proportional distances
between the Barley and Wheat (row 1 of Table 8), and
between the Barley/Wheat ancestor and oat (Row 2 of
Table 8) like this:
= (0.0111+0.0278)*1000 million years
= 0.0389*1000 million
= 38.9 mya
Again mark this with a trunk line on Figure 3 along with
the proportional distance. Continue to draw up the
phylogenetic tree in this way until you have estimated
the divergence of all five species and their common
ancestors.
Questions
There are a few questions that you might like to think
about and which can help you to better understand the
process of drawing a phylogenetictree. Answers can be
downloaded
as
a
separate
file
from:
(http://www.tgac.bbsrc.ac.uk).
1. Are your estimates of time since divergence from
ancestors likely to be close to those published or to the
actual times?
2. What could cause your estimates of time since
divergence to be wildly different (hint: think about the
length of sequences that you have compared today and
the assumed rate of mutation)?
3. How would phylogenetic trees compare to one
another if they were built using calculations from
different DNA sequences (think about different lengths
of sequence used for comparison or different regions
such as within or outside of genes)?
4. What would you do if you had gaps in your aligned
sequences for comparison due to insertions or deletions
of nucleotides (as opposed to substitutions)?
References and More reading
1. Stratton M, Campbell PJ and Futreal PA (2009) Nature
458 (7239), 719-724
2. Linnarsson S (2010) Exp Cell Res 316,
1339-1343
3. Pedersen PL and Amzel LM (1993) J Bio Chem 268
(14),
9937-9940
4. ‘The Food Timeline’ Ed. Lynne Olver accessed on:
04/05/11
at: http://www.foodtimeline.org/foodpuddings.html
For an introduction to phylogenetics see:
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
and http:// tinyurl.com/2wqp7nq
To find out more about a group of scientists who have
attempted to pen down a more accurate tree of life
visit:
http://www.embl.de/aboutus/communication_outreac
h/publications/annual_report/AnnualReport05-06.pdf
page 166
180
160
140
120
80
60
40
million years ago
100
Figure 3: A hypothetical phylogenetic tree showing interrelatedness between 5 common crop species
200
20
0
Oilseed Rape
Rice
Oat
Wheat
Barley