rapidly increases with the number of taxa. Therefore, it

Question 1.
Below are four gene sequences. These are taken from four animals that are believed to have
recent shared ancestry. The gene sequences are from a pseudogene, the evolutionary remnant
of a gene, which is now nonfunctional, in a given species or group of related species. In this
case, the gene is GULO (L-gulonolactone oxidase), which codes for the enzyme which catalyzes a
key step in the synthesis of ascorbic acid (vitamin C). Along the way, some animals have lost the
function of this gene (by random mutation) and must consume vitamin C in their diet.
#1 GGAGCTGAAGGCCATGCTGGAGGCCCACCCCGAGGTGGTGTCCCACTACCTGGTGGGGCTACGCTTCACCTGGAGG
#2 GGAGCTGAAGGCCATGCTGGAGGCCCACCCTGAGGTGGTGTCCCACTACCCGGTGGGGGTGCGCTTCACCCAGAGG
#3 GGAGCTGAAGGCCGTGCTGGAGGCCCACCCTGAGGTGGTGTCCCACTACCTGGTGGGGGTACGCTTCACCTGGAGG
#4 GGAGATGAAGGCCATGCTGGAGGCCCACCCTGAGGTGGTGTCCCACTAACCGGTGGGGGTGCGCTTCACCCAAGGG
1. Examine the four gene sequences and mark any differences among the sequences that you
can find.
2. Discuss the following questions: Do you notice any specific pattern? What could this
pattern mean regarding the ancestry/relatedness of the four species?
3. Make an hypothesis about the ancestry of these four species in the form of a phylogenetic
tree.
4. Draw this tree with the relevant synapomorphy and explain why you drew it this way
5. Why are pseudogenes and other noncoding DNA sequences used commonly by
evolutionary biologists in constructing phylogenies?
Question 2
While noncoding DNA sequences are extremely useful in analyzing the shared ancestry of
different species, protein-coding DNA sequences are also useful. However, the mutation and
evolution of protein-coding sequences of DNA is more “constrained.” Why might this be?
Below is the amino acid sequence for a protein called SCML1, an enzyme necessary for male
embryonic development and male fertility in mammals. It is encoded by a gene on the Xchromosome. The amino acid sequence is only slightly different amongst five mammals, as shown
below: ( “…/…” represents a stretch of identical amino acids, and is this omitted.
#1 MSNS…/…VIKT…/…DDNTI…/…EQLKTVDD…/…DALQN…/…RFHARSLWTNHKRYG…/…KKHSYRLVL…/…YETF
#2 MSDS…/…VVKT…/…DDNTI…/…EQLRTVND…/…DALQN…/…RFYARSLWTNRKRSG…/…KKHSYRPVL…/…YETF…
#3 MSNS…/…VVKT…/…DDDTI…/…EQLKTVND…/…DAMQN…/…RFHARFLWANRKRYG…/…KKHSYRLVL…/…YETF…
#4 MSNS…/…VVKT…/…DDDTI…/…EQLKTVND…/…DAMQN…/…RFHARSLWTNRKRYG…/…KKYSYRLVA…/…YESF…
#5 MSSS…/…VVKT…/…DDDTI…/…EQQKTVND…/…DAMQN…/…RFRARSLWTNRKRYG…/…KKYSYRLVA…/…YESF…
1. Examine the five amino acid sequences above and mark any differences among the
sequences that you can find.
2. Use the differences in amino acid sequence to retrace the ancestry of these five mammals.
Make an hypothesis in the form of a phylogenetic tree. Draw this tree on a separate piece of
paper, along with your notes explaining it.
Question 3 – Hypothesis testing
Distances
Evolutionary distances are fundamental for the study of molecular evolution and are useful
for phylogenetic reconstruction and estimation of divergence times. Distances can be estimated
using DNA sequence differences between individuals, or allele frequency differences between
populations. In this lab, you will examine DNA sequences and estimate distance measures that are
derived from nucleotide polymorphisms between individuals. MEGA, the program we will use to
draw phylogenetic trees, implements a number of distances based on mutation type, each of which
has its own assumptions. In this lab, we will use the Kimura 2-parameter distance as a measure
of divergence between sequences. Kimura’s two parameter model (1980) takes into account
different transitional and transversional substitution rates. The model also assumes that the four
nucleotide frequencies are the same and that rates of substitution do not vary among sites.
Tree-drawing
Neighbor-joining uses distance measures that correct for multiple substitutions at the same
sites, and a topology showing the smallest value of the sum of all branches (S) is chosen as an
estimate of the correct tree. The number of possible topologies (unrooted trees) rapidly increases
with the number of taxa. Therefore, it becomes very difficult to examine all topologies.
NJ method produces an unrooted tree because it does not assume a constant rate of
evolution. Therefore, you need to define an outgroup taxon. In the absence of outgroup taxa, the
root is sometimes given at the midpoint of the longest route connecting two taxa in the tree. This
is referred to as the mid-point rooting.
Measuring support for the branches
Note that in addition to the distance estimates, MEGA also computes the standard errors of
the estimates using the analytical formulas for the distances and the bootstrap method.
Bootstrapping is a statistical procedure that allows an assessment of the confidence in your tree,
i.e. your hypothesis. In phylogenetic trees, bootstrap values do not correspond to p-values in
statistical tests, that is, there is no particular significance in a bootstrap value of 95%. However,
bootstrap values give a good indication of the support of a particular group by the sequence data.
The principle of bootstrapping is simple: an initial tree is constructed based on the original
data. Subsequently, the data are ‘resampled’, that is, the computer draws randomly base positions
from all sequences and puts them into a new data set. It does so with replacement, that is, after
drawing a base position it puts it back into the data set, and may therefore draw it several times.
This procedure is carried out until a randomized sequences has been created that has the same
length of your original sequence. After creating a specified number of such randomized data sets
(usually 1000 sets), trees are constructed for each (resulting in 1000 trees). Now the computer
checks how often each cluster in your original tree occurs in these resampled data sets. For
example, if the cluster of human and gorilla occurs in 560 trees of the resampled data, the
bootstrap value would be 56 (for 56% of all resampled trees). If a cluster only depends on very
few nucleotides, the bootstrap value will be low, because these few nucleotides will not have been
sampled (by chance) in quite a few of the resampled data sets. If a cluster is well supported by a
large proportion of the sequence, resampling will have little effect and the bootstrap value will be
high.
For example, a 10 bp sequence
Sea Lion
Hippo
Orca
1
A
A
A
2
T
C
C
3
G
G
G
4
C
C
C
5
G
G
A
6
C
C
C
7
C
C
C
8
A
A
A
9
C
T
T
0
T
T
T
1
A
A
A
1
A
A
A
4
C
C
C
0
T
T
T
6
C
C
C
3
G
G
G
8
A
A
A
9
C
T
T
5
G
G
A
5
G
G
A
6
C
C
C
7
C
C
C
7
C
C
C
4
C
C
C
3
G
G
G
2
T
C
C
1
A
A
A
0
T
T
T
9
C
T
T
3
G
G
G
4
C
C
C
3
G
G
G
6
C
C
C
8
A
A
A
8
A
A
A
9
C
T
T
0
T
T
T
1
A
A
A
9
C
T
T
3
G
G
G
Resample 1
Sea Lion
Hippo
Orca
6
7
10
0
Resample 2
Sea Lion
Hippo
Orca
Resample 3
Sea Lion
Hippo
Orca
Objective: Reconstruct a phylogeny, examine the effects of different assumptions on tree topology,
and use the tree to examine an alternate hypothesis on phenotypic trait evolution.
Methodology:
1. Align the sequences using GENEIOUS and trim the sequence to keep only the overlapping area.
2. Build trees to estimate distances and compute trees.
3. Examine likely evolutionary events that explain the evolution of virulence in Vibrio.
Downloading sequences
1. Download the file “GyrB.fasta” from the class website and save it on the desktop.
Aligning sequences
2. Open GENEIOUS
3. Create a folder in the Local directory: Select Local, right click and select New Folder.
4. Open the folder you just created. In the File tab, select Import – From File; Select the file
with the sequences you just saved. Select the “Keep sequences separate” option.
5. Select all the sequences; In the Tools tab, click on Alignment; Click OK
6. Open the newly created alignment. You will notice that the sequences are not the same
length for all you species. Scroll through your alignment until you get complete coverage
(i.e. all sequences are aligned). Select the part of the alignment where all sequences
overlap. Right click on it and select Extract Region. Select Extract Region as an
alignment. The dashes in the middle of the sequences are gaps representing insertions or
deletions and are evolutionary meaningful.
7. Select the newly extracted alignment. In the File tab select Export, then Selected
Documents. Save the alignment as a Mega (*.meg) file on the desktop.
Creating phylogenetic trees
8. Open the program MEGA.
9. Under File, select Open a File/Session. Select your file.
10. Open the Sequence file by clicking on the TA icon.
11. Under the Data menu select Explore Active Data. Do the following to gain more
knowledge about the sequences.
a. How many nucleotides are there in these sequences? (Look at the bottom of the
screen).
b. Transition/transversion
ratio?
Go
to
Statistics=>Nucleotide
Pair
Frequences=>Directional. The domains are the codon positions.
12. Using MEGA construct a Neighbor-Joining Tree of the data, and overlay the virulence
phenotype. Explore the effects of different distances and tree visualizing methods.
13. Using the full tree, explain the evolution of virulence in this system. Include images of
relevant trees (you can copy and paste the tree from MEGA).
Figure 1. Temporal variation in the amount of Vibrio bacteria in oyster hemolymph (HL).
Quantification of total Vibrio spp. (open circles and triangles) isolated from oyster hemolymph
stemming from the four transplant groups (origin_site), i.e. DB_DB, DB_OW, OW_DB, OW_OW. Blue
lines represent oysters assayed in site OW and black lines represent oysters assayed in site DB.
Solid lines show oyster origin DB, while dashed lines show oyster origin OW. Water temperature
is shown on the secondary y-axis (full circles, dotted line). The area shaded in grey marks the
spawning period.