Phylogenic PPT - Montville.net

Phylogenetic basis of systematics
• Linnaeus:
Ordering principle is God.
• Darwin:
Ordering principle is shared
descent from common ancestors.
• Today, systematics is explicitly
based on phylogeny.
Goals of Phylogenetic Analysis
Time
• Given a multiple sequence
alignment, determine the
ancestral relationships among
the species.
• We assume that residues in a
column are homologous, and
that all columns have the same
history.
Hu
Ch
Go
Gi
Types of Phylogenic Trees:
• 1. Cladogram:
• show the relationships between different organisms
• branch lengths are arbitary
• 2. Phylogram:
• branches that represent evolutionary time and amount of change.
Data
• Biomolecular sequences: DNA, RNA, amino acid, in a multiple
alignment
• Molecular markers (e.g., SNPs, etc.)
• Morphology
• Gene order and content
These are “character data”: each character is a function mapping the
set of taxa to distinct states (equivalence classes), with evolution
modelled as a process that changes the state of a character
DNA Sequence Evolution
-3 mil yrs
AAGACTT
AAGGCCT
AGGGCAT
AGGGCAT
TAGCCCT
TAGCCCA
-2 mil yrs
TGGACTT
TAGACTT
AGCACTT
AGCACAA
AGCGCTT
-1 mil yrs
today
Phylogenetic Analyses
• Step 1: Gather sequence data, and estimate the multiple alignment of
the sequences.
• Step 2: Reconstruct trees on the data. (This can result in many trees.)
• Step 3: Apply consensus methods to the set of trees to figure out
what is reliable.
Phylogeny Problem
U
AGGGCAT
V
W
TAGCCCA
X
TAGACTT
Y
TGCACAA
X
U
Y
V
W
TGCGCTT
Types of Phylogenetic Methods
• Character-based
• Parsimony
• Likelihood
Involve optimizing a criterion based
on fit of the residues to the tree.
• Distance-based
• Neighbor joining (NJ)
• UPGMA
Involve optimizing a criterion based
on fit of a matrix of pairwise
distances to the tree
Select the tree that explains
Parsimony the data with the fewest
number of substitutions.
http://study.com/academy/lesson/maximum-parsimony-likelihood-methods-in-phylogeny.html
Select the tree that has the
Likelihood highest probability of
producing the observed data
Select the tree that best
Distance recreates the observed
pairwise distances.
https://www.youtube.com/watch?v=NRRErwFsIcw
Phylogenetic Tree Building
Two basic types:
Gene/protein tree: represents evolutionary history of genes/proteins
Species tree: represents the evolutionary history of species based on characters
(like protein sequences)
ORFP MG01127.1
NC U01640.1
ORFP YDL020C
Scastellii
Skluyeri
orf6.4920.prot
AN0709.2
H.
Rooted, binary
tree
Unrooted, binary
tree
Phylogenetic Tree Building
Two basic types:
Gene/protein tree: represents evolutionary history of genes/proteins
Species tree: represents the evolutionary history of species based on characters
(like protein sequences)
ORFP MG01127.1
NC U01640.1
ORFP YDL020C
Scastellii
Skluyeri
orf6.4920.prot
AN0709.2
H.
Rooted, binary
tree
Unrooted, binary
tree
* Can root a tree using an outgoup: known distant relative
Branch lengths (“distance”) ~ time
ORFP MG01127.1
NC U01640.1
Root
(ancestral species)
ORFP YDL020C
Scastellii
Skluyeri
orf6.4920.prot
AN0709.2
H.
Edges
Nodes
(common ancestor)
Leaves
(modern observations)
Branch lengths (“distance”) ~ time
ORFP MG01127.1
NC U01640.1
Root
(ancestral species)
ORFP YDL020C
Why is the structure
of the tree important?
Scastellii
Skluyeri
orf6.4920.prot
AN0709.2
H.
Edges
Nodes
(common ancestor)
Leaves
(modern observations)
Branch lengths (“distance”) ~ time
ORFP MG01127.1
NC U01640.1
Root
(ancestral species)
ORFP YDL020C
Why is the structure
of the tree important?
Scastellii
Skluyeri
orf6.4920.prot
Branching represents
speciation into
two new species
AN0709.2
H.
Edges
Nodes
(common ancestor)
Leaves
(modern observations)
Branch lengths (“distance”) ~ time
ORFP MG01127.1
NC U01640.1
Root
(ancestral species)
ORFP YDL020C
Scastellii
Skluyeri
orf6.4920.prot
AN0709.2
H.
This tree can also be denoted in text format
8
7
6
5
4
3
2
1
Branch lengths (“distance”) ~ time
ORFP MG01127.1
NC U01640.1
Root
(ancestral species)
ORFP YDL020C
Scastellii
Skluyeri
orf6.4920.prot
AN0709.2
H.
8
7
6
5
4
3
2
1
This tree can also be denoted in text format
( ( ( (3,4) , (5,6) ), 7 ), (1,2) ), 8
Building phylogenetic trees
1.
Distance based methods
a. Calculate evolutionary distances between sequences
b. Build a tree based on those distances
2.
Maximum Parsimony (character based method)
a. Find the simplest tree that explains the data with the fewest # of substitutions
3.
Maximum Likelihood (probabilistic method based on explicit model)
a. Find the tree that is most likely, given an evolutionary model
Building phylogenetic trees
1.
Distance based methods
2.
Maximum Parsimony (character based method)
Search all possible trees and find the one requiring the fewest substitutions
AAG
GGA
AAA
AGA
a
b
c
d
Building phylogenetic trees
1.
Distance based methods
2.
Maximum Parsimony (character based method)
Search all possible trees and find the one requiring the fewest substitutions
AAG
GGA
AAA
AGA
a
b
c
d
Building phylogenetic trees
1.
Distance based methods
2.
Maximum Parsimony (character based method)
Search all possible trees and find the one requiring the fewest substitutions
AAG
AAA
GGA
AGA
a
c
b
d
What are the ancestral sequences at each node?
How many base changes are required for this tree?
Building phylogenetic trees
1.
Distance based methods
2.
Maximum Parsimony (character based method)
Search all possible trees and find the one requiring the fewest substitutions
AAA
AAG
AAA
GGA
AGA
AAA
or
AGA
a
c
b
d
AGA
What are the ancestral sequences at each node?
How many base changes are required for this tree?
3 changes are required.
Building phylogenetic trees
1.
Distance based methods
2.
Maximum Parsimony (character based method)
Search all possible trees and find the one requiring the fewest substitutions
AAA
AAG
AAA
GGA
AGA
AAA
or
AGA
a
c
b
d
AGA
The score of the tree is the number of character changes.
MP aims to minimize the score of tree.
How can you tell if your tree is significant?
Bootstrapping: how dependent is the tree on the dataset
1.
2.
3.
4.
Randomly choose n objects from your dataset of n, with replacement
Rebuild the tree based on the subset of the data
Repeat 1,000 – 10,000 times
How often are the same children joined?
If a given node is represented in <x trials, collapse the node for a ‘consensus’ tree
Jackknifing: how dependent is the tree on the dataset
1.
2.
3.
4.
Randomly choose k objects from your dataset of n, without replacement
Rebuild the tree based on the subset of the data
Repeat 1,000 – 10,000 times
How often are the same children joined?
How can you tell if your tree is significant?
ORFP MG01127.1
NC U01640.1
70
100
ORFP YDL020C
Scastellii
80
95
Skluyeri
orf6.4920.prot
100
AN0709.2
H.
Maximum Likelihood tree
showing Bayesian
Inference/Maximum
Parsimony/Maximum Likelihood
support value at each node