“Outgroup” sequences be included

Phylogenetic analysis

Selecting sequences

Outgroup sequences

Alignment

Choice of method

Example using one method
Three most important choices

Which sequences to include

Outgroup sequences

Alignment
T5
T6
O1
O2
T4
O3
T3
O4
T2
T1
T5
T6
T5
O1
O2
T4
O3
T3
T1
O3
T3
T2
T1
T2 T3 T4 T5 T6 O1 O2 O3 O4
O1
T4
O2
O4
T2
T6
T1
O4
T1
T2 T3 O2 T4 T5 T6
O1 O3 O4
“Outgroup” sequences be included

The best outgroup sequences are sequences clearly outside
the group being studied, but not too far out.

Multiple outgroup sequences should be chosen.

The outgroup sequences are included in the data matrix just
like the other sequences.

They will be used to root the tree.
Methods of phylogenetic analysis

Parsimony (Cladistics)

Maximum likelihood

Bayesian

Genetic distance (Neighbor-joining, etc.)
Parsimony (Cladistics)

Willi Hennig. 1950. Grundzüge einer Theorie der
phylogenetischen Systematik.

1966. Phylogenetic systematics.

Evidence comes from characters

Goal: build most parsimonious tree
Finding the most parsimonious tree

Goal- fewest evolutionary steps (optimality criterion)
• Fewest a.a. changes
• Fewest base changes

Many tree topologies are tested, choosing the best.

Unrooted

Rooting the tree comes later.
Rooting the tree


The outgroup taxa are included in the data matrix just like the
other taxa.
Once the best tree is found, it is “rooted” along the branch
connecting the outgroup and ingroup taxa.
T1
T2 T3 T4 T5 T6 O1 O2 O3 O4
T1
T1 T2
T3 T4 T5 T6 O1 O2 O3 O4
T2 T3 T4 T5 T6 O1 O2 O3 O4
Strict consensus
What to do in case of a tie- consensus



A “strict” consensus tree is one in which the branches not
present on all trees are collapsed, resulting in polytomies.
A “50% majority rule” consensus tree is one in which the
branches not present on 50% of the trees are collapsed,
resulting in polytomies.
Trees with many polytomies are said to be less resolved than
trees with few or no polytomies.
T1
T2 T3 T4 T5 T6 O1 O2 O3 O4
T1
T1 T2
T3 T4 T5 T6 O1 O2 O3 O4
T2 T3 T4 T5 T6 O1 O2 O3 O4
Strict consensus
Why are Maximum Likelihood and Bayesian methods
considered an improvement over parsimony?

+ They allow for a model of molecular evolution to be specified.
• Not all changes from one base to another (or from one a.a. to
another) are equally likely.
• Not all positions have the same probabilty of change.

- They require that the correct model be specified.
What is Maximum Likelihood (ML)?

Just like parsimony, ML examines lots of trees and picks the
best one.

However, the optimality criteria differ.
• Parsimony -- fewest changes.
• ML -- maximizes the probability of observing the data (aligned
sequences), given a model of molecular evolution.
Models of molecular evolution

Substitution matrix
• For proteins, this is the (observed) probability of one amino acid
changing to another.
• For DNA, it is the probability of one base changing to another.

Site-to-site variation in rate of change
• Some sites don’t vary.
• Among those that do, they vary at different rates.
Why is using a correct model of molecular evolution
better than using parsimony?

Under some conditions, parsimony chooses the wrong tree
(long branch attraction).

Methods using a model are more precise and result in fewer
exact ties, generally.
• For example, changes between two chemically similar a.a.’s can
be used as “similarity”. Under parsimony all differences are
simply “different”.
• Models usually choose a single best tree, whereas parsimony
usually chooses a large set of most parsimonious trees.

Branch length estimates are more accurate with a model.
What is Bayesian phylogenetic analysis?

Just like ML, we search for the best trees that are consistent
with both the model and the data.

Optimality criterion:
• -- maximizes the probability of the tree, given the data (aligned
sequences) and the model of molecular evolution.

Bayesian analysis is the only one that automatically provides
confidence estimates (similar to bootstrap values) for each
node.
Example - Bayesian analysis of signal transduction
proteins

Using ProtTest to find out how the sequences are evolving

Informing MrBayes of the model of molecular evolution

Using MrBayes to get the phylogeny

Making a figure
MrBayes doesn’t know when it has run long enough -- you decide.
Average standard deviation of split frequencies: < 0.01
A
B
C D
E
B
A
E D
C
What is Neighbor-joining (NJ)?

NJ is an algorithm for building a tree.

There is no optimality criterion.

First, a matrix of distances between all pairs of sequences
is computed.
• A substitution matrix is needed to do this.

Then, one pair is chosen from among all possible pairs,
because combining them best minimizes the length of the
tree.
Neighbor-joining

NJ is very fast.

There is no optimality criterion.
• This means there is no way to assess its success.
• There is also no way to say whether a “best” tree is
significantly better that a set of “next best” trees. (mt Eve)

The tree it chooses is not always the shortest. Distances
are estimated from noisy data and early mistakes in NJ
can’t be revisited.
Large data sets

If you have over 50 sequences, or if you have very long
sequences (hundreds of proteins) ProtTest and MrBayes
may take more than a couple of days to finish.

Parsimony is much faster.
• It allows node support (bootstrap values) to be calculated.
• It doesn’t require a model of molecular evolution.
• PAUP* can read nexus files.

NJ is faster still. Sometimes it is the only method that is
fast enough.
• A default model of molecular evolution must be used.
DNA sequences should be used when sequences are
highly similar

Use a very similar procedure.

Use MrModelTest instead of ProtTest.
Summary

Three most important choices
• Which sequences to include
• Outgroup sequences
• Alignment


Choice of method - Bayesian
Example - Look on Ned’s Computational Corner for more
details.