Methods for the analysis of mitochondrial DNA data – part 1

SUMMER SCHOOL 2008
PIACENZA, ITALY - 10 September 2008
Methods for the analysis of
mitochondrial DNA data – part 1
Licia Colli, U.C.S.C. di Piacenza
licia colli@unicatt it
[email protected]
•The mitochondrial genome
•Sequence format and alignment
•Input file formats most frequently used in mtDNA analyses
•Molecular diversity indices
•Analysis of Molecular VAriance
•Mismatch
Mismatch distribution and estimates of population expansion
•Admixture analysis
•Trees:
-generalities;
generalities;
-models of DNA sequence evolution and choice of the best-fitting model
-Tree reconstruction strategies
-Distance-based methods (NJ)
( J)
-Character-based methods (MP, ML, Bayesian)
-Molecular clock and calculations of divergence times
-Bootstrap
p and Jacknife
•Software list
•Rereferences
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
The mitochondrial genome (mtDNA)
• Its length varies among species (15-17kb)
•multiple copies in each cell (mammalian egg cell contains about 100.000 copies)
• lack of recombination
• HAPLOID - maternally inherited;
• high mutation rate
•13 protein coding genes, 2 rRNA
sequences (12s and 16s),
16s) 22 tRNA
sequences and 1 non coding region
((control region
g
or displacement
p
loop).
p)
• the mitochondrial genetic code differs slightly from the nuclear code:
nuclear
TGA Æ stop codon
ATA Æ Ile (I)
AGA Æ Arg
g ((R))
AGG Æ Arg (R)
mitochondrial
TGA Æ Trp (W)
ATA Æ Met (M)
AGA Æ stop
p codon
AGG Æ stop codon
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
The mitochondrial genome (mtDNA)
A useful molecule, indeed…
• genealogy
• phylogeny (cytochrome b,
b 12s,
12s 16s
16s, control region
region, whole mtDNA)
• phylogeography (cytb, control region, whole mtDNA)
• species identification (cytb, control region)
• population
l ti studies
t di ( + other
th markers)
k )
• detection of “cryptic species” and “barcoding” projects (COXI)
• studies on the domestication process
• studies on male fertility/infertility
• studies on ancient DNA (aDNA)…
…and many other applications.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Sequence format and alignment
EditPlus:
d l a text editor useful to handle sequences and prepare
input files.
y downloadable 30-days
y evaluation version:
Freely
http://www.editplus.com/download.html
FASTA
(fil
(filename.txt)
)
ClustalX
>Seq_1
cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat
>Seq_2
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
>Seq_3
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
q_4
>Seq
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CLUSTAL X (1.83) multiple sequence alignment
>Seq_5
(filename.aln)
CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_1
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
>Seq_6
Seq_2
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_3
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
>Seq 7
>Seq_7
Seq_4
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT
Seq_5
CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
>Seq_8
Seq_6
CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_7
CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT
>Seq_9
Seq_8
CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_9
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
>Seq_10
Seq_10
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
********
** ******
************** ************************
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Input file formats
Phylip
(filename.txt; filename.phy)
MEGA
(filename.meg)
#Mega
10 60
Seq_1
Seq_2
S
Seq_3
3
Seq_4
Seq_5
Seq_6
Seq_7
Seq_8
Seq_9
Seq_10
title: title_of_your_project
cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT
CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
or otherwise
10 60
Seq_1
Seq_2
Seq_3
Seq_4
Seq_5
Seq 6
Seq_6
Seq_7
Seq_8
Seq_9
Seq_10
cccctaatatgtacaataatgaatgttgta
CCCCTAATATGTACAATAATGAATGTTGTA
CCCCTAATATGTACAATAATGAATGTTGTA
CCCCTAATATGTACAATAATGAATGTTGTA
CCCCTAATAGGTACAATAACTAATGTTGTA
CCCCTAATAGGTACAATAATTAATGTTGTA
CCCCTAATTTGTACAATAATGAATGTTGTA
CCCCTAATATGTCCAATAATGAATGTTGTA
CCCCTAATATGTACAATAATGAATGTTGTA
CCCCTAATATGTACAATAATGAATGTTGTA
aattagtgttataacacatctatgtataat
AATTAGTGTTATAACACATCTATGTATAAT
AATTAGTGTTATAACACATCTATGTATAAT
AATTAGTGTTATAACACATCTATGTATAAT
AATTAGTGTTATAACACATCTATGTATAAT
AATTAGTGTTATAACACATCTATGTATAAT
AATTAATGTTATAACACATCTATGTATAAT
AATTAGTGTTATAACACATCTATGTATAAT
AATTAGTGTTATAACACATCTATGTATAAT
AATTAGTGTTATAACACATCTATGTATAAT
#Seq_1
#Seq 2
#Seq_2
#Seq_3
#Seq_4
#Seq_5
#Seq_6
#Seq_7
#Seq
q_8
#Seq_9
#Seq_10
cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT
CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
or otherwise
#Mega
title: title_of_your_project
#Seq_1
cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat
#Seq_2
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
#Seq_3
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
#Seq_4
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
#Seq_5
CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
#Seq 6
#Seq_6
CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
#Seq_7
CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT
#Seq_8
CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
#Seq_9
_
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
#Seq_10
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
NEXUS
Input file formats
(filename.nex)
Arlequin
q
(filename.arp)
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=10;
TAXLABELS
Seq_1
Seq_2
Seq_3
Seq_4
Seq_5
Seq_6
Seq_7
Seq 8
Seq_8
Seq_9
Seq_10;
END;
BEGIN CHARACTERS;
DIMENSIONS NCHAR=60;
FORMAT DATATYPE=DNA MISSING=? GAP=- MATCHCHAR=.;
MATRIX
Seq_1
cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat
Seq_2
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq 3
Seq_3
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_4
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_5
CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_6
CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_7
CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT
Seq_8
CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_9
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_10
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT;
END;
[Profile]
Title="An example of DNA sequence data"
NbSamples=3
GenotypicData=0
DataType=DNA
yp
LocusSeparator=NONE
[Data]
[[Samples]]
SampleName="Population 1"
SampleSize=3
SampleData= {
1
cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat
Seq_1
Seq_2
1
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_3
1
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
}
SampleName="Population 2"
SampleSize=3
SampleData= {
SampleData
Seq_4
1
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_5
1
CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_6
1
CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
}
SampleName="Population 3"
SampleSize=4
SampleData= {
Seq_7
1
CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT
Seq_8
1
CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_9
1
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_10
1
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
}
[[St
[[Structure]]
t
]]
StructureName="A group of 3 populations analyzed for DNA"
NbGroups=1
Group= {
"Population 1"
"Population 2"
"Population
p
3"
}
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Sequence alignment
Software of the Clustal family:
• ClustalW
online
l
versions
download
• ClustalX
download
http://www.ch.embnet.org/software/ClustalW.html
h
//
h
b
/ f
/Cl
lW h l
http://www.ebi.ac.uk/Tools/clustalw2/index.html
http://www.clustal.org/download/
http://www.clustal.org/download/current/
Higgins & Sharp (1988; 1989); Higgins et al. (1992); Thompson et al. (1994; 1997).
SeaView is a sequence alignment editor which is able to read and write various
alignment
li
t fformats
t (NEXUS
(NEXUS, CLUSTAL,
CLUSTAL FASTA,
FASTA PHYLIP…).
PHYLIP )
Free download from this website:
http://pbil.univ-lyon1.fr/software/seaview.html
p //p
y
/
/
Galtier et al. (1996).
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Molecular diversity indices
Haplotype
p yp diversity
y ((H))
It is defined as the probability that two randomly chosen haplotypes are different
in the sample. Haplotype (gene) diversity is estimated as:
where n is the number of gene copies in the sample, k is the number of haplotypes,
and pi is the sample frequency of the i-th haplotype.
Nei (1987).
Source: Arlequin ver. 3.1 user manual.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Molecular diversity indices
Mean number of pairwise differences (π)
Mean number of differences between all pairs of haplotypes in the sample. It can
be estimated as
where dij is an estimate of the number of mutations having occurred since the
divergence of haplotypes i and j, k is the number of haplotypes, pi is the frequency
of haplotype i, pj is the frequency of haplotype j, and n is the sample size.
Tajima (1983); (1993).
Source: Arlequin ver. 3.1 user manual.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Molecular diversity indices
Nucleotide diversity (πn)
It is computed as the probability that two randomly chosen homologous nucleotide sites are
different. It is equivalent to the haplotype diversity at the nucleotide level.
where dij is an estimate of the number of mutations having occurred since the divergence of
haplotypes i and j, k is the number of haplotypes, pi is the frequency of haplotype i, pj is the
frequency of haplotype j,j n is the sample size and L is the number of loci.
loci
Tajima (1983); Nei (1987).
Source: Arlequin ver. 3.1 user manual.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Molecular diversity indices
«Genetic loci from a centre of origin are expected to retain more ancestral variation and show
hi h haplotypic
higher
h l t i and
d nucleotide
l tid di
diversity,
it with
ith lineage
li
pruning
i th
through
h successive
i
colonization events leading to a reduction in derived populations.».
y et al. ((2001).
)
Troy
383 B. taurus mtDNA sequences (240 bp of the HVRI region ):
M
Mean
pairwise
i i differences
diff
(±s.d.)
( d)
Middle East
3.79 ± 2.03
Anatolia
3.49 ± 1.81
Mainland Europe
1.92 ± 1.10
Britain
2.68 ± 1.45
Northern Europe
1.47 ± 0.91
Africa
2.09 ± 1.18
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Analysis of MOlecular VAriance - AMOVA
The Analysis of MOlecular Variance (AMOVA, Excoffier et al. 1992) is based on
analyses of variance of gene frequencies,
frequencies taking into account the number of
mutations between molecular haplotypes.
User-defined
User
defined groups of populations Æ particular genetic structure to test.
A hierarchical analysis of variance partitions the total variance into covariance
p
((Rousset,, 2000).
)
components
The total molecular variance (σ2) is the sum of the components due to:
• σa2 = differences among the populations;
• σb2 = differences among haplotypes in different populations within a group;
• σc2 = differences among haplotypes within a population.
Source: Arlequin ver. 3.1 user manual.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Analysis of MOlecular VAriance - AMOVA
Simple hierarchical genetic structure ee.g.
g haploid individuals in populations Æ the algorithm
leads to a fixation index FST (Weir & Cockerham, 1984) which can be expressed in terms of
g coefficients as
inbreeding
Slatkin (1991).
(1991)
where f0 is the probability of identity by descent of two different genes drawn from the same
population,
p
p
f1 is the p
probability
y of identity
y by
y descent of two g
genes drawn from two different
populations.
Source: Arlequin ver. 3.1 user manual.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Mismatch Distribution
It is the distribution of the observed number of differences between p
pairs of haplotypes.
p yp This
distribution is usually multimodal in samples drawn from populations at demographic
equilibrium, as it reflects the highly stochastic shape of gene trees…
Solid line in the pairwise differences plot = theoretical values referring to a model of neutral evolution in a population of constant size (Rogers, 2004).
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Mismatch Distribution
…but it is usually
y unimodal in p
populations
p
having
gp
passed through
g a recent demographic
g p
expansion.
Rogers & Harpending, (1992); Hudson & Slatkin, (1991).
Simulations of populations that underwent a sudden 100
100-fold
fold growth at 7 units of mutational time before present (Rogers, 2004).
Solid line in the pairwise differences plot = theoretical values referring to the expectations under the model of population history used for these simulations.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Mismatch Distribution and estimates of population expansion
In case of a sudden population growth (mismatch distribution = smooth unimodal wave), the
time of the expansion τ0 and the size of the pre-expansion population θ1 can be estimated as
follows
where π is the mean pairwise difference per sequence within the sample, m is the mean of
pairwise differences, and v is the variance.
Source: Roger (2004) and Arlequin ver. 3.1 user manual.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Estimates of population expansion – an alternative approach
Analysis of Bayesian skyline plots: an approach alternative to mismatch distribution analysis.
Past changes in population size can be inferred from present-day genetic diversity without
prior assumptions about population history.
Mitochondrial d-loop sequence data (also
aDNA).
F
Four
d
domestic
ti species:
i
-Yak (Bos grunniens)
n=71
- Water buffalo (Bubalus bubalis) n=110
- Mithan (Bos frontalis) n=24
n=84
84
- Cattle (Bos taurus) n
One closely related wild species:
- African buffalo (Syncerus caffer) n=195
Uniform mutation rate: 32%Myr-1
Domestic species - sudden expansion during the last 104 years ~ time since domestication.
Af i
African
buffalo
b ff l - gradual
d l population
l ti expansion
i
ffollowed
ll
db
by a sharp
h
d
decline
li ((consisten
i t with
ith
documented epidemics and habitat loss since the XIXth century).
S ft
Software:
BEAST
BEAST, BEAUTI and
d TRACER.
TRACER
Source: Finlay et al. (2007).
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Admixture analysis
This analysis evaluates the relative contributions of any number of parental populations to a
derived, hybrid population.
It compares the composition of different gene pools rather than making inference about the
admixture event itself (mY estimator; Dupanloup & Bertorelle, 2001).
Software: ADMIX ver. 1.0
Features: - works with sequences
sequences, RFLPs
RFLPs, microsatellites
- needs 2 input files:
DATA file (filename.dat)
MATRIX file (filename.mtx)
The DATA file should contain for each locus se sample sizes of the admixed and of the
parental populations and the number of copies observed for each haplotype (allele) in each
population.
DATA file example:
LocusX
nAD, nP1, nP2
cnH1(AD), cnH1(P1), cnH1(P2)
cnH2(AD), cnH2(P1), cnH2(P2)
cnH3(AD), cnH3(P1), cnH3(P2)
AD=admixed pop; P1=parental pop. 1; P2= parental pop. 2
nAD= sample size of pop. AD; etc.
H1, H2, H3= haplotypes
cnH1(AD)= count number for haplotype 1 in AD pop.; etc.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Admixture analysis
MATRIX file example:
nX
LocusX
nH
0
1 0
3 2 0
number of analyzed loci
number of haplotypes observed at the locus H
lower triangular matrix of molecular distances (number of
substitutions in pairwise comparisons of haplotypes)
ADMIX ver. 2.0 needs only one input file containing both the data and the matrix.
Pellecchia et al. (2007).
Admixture values ± s.e. calculated on Bos taurus mtDNA data (HVRI region) derived from
autochthonous Italian breeds.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Trees
A tree is a g
graph
p which describes the evolutionary
y relationships
p between sequences.
q
•Nodes = Taxonomic Units (TUs);
•Branches = evolutionary relationships between TU in terms of ancestry/descent
A branch connects only two nodes. Internal nodes represent ancestral TUs, while terminal
bramches represent present TUs (i.e. sequences), also defined Operational Taxonomic
Units, OTUs.
Cladogram: a tree describing only the relationships between nodes. Branch lengths have no
specific meaning.
Phylogram: branch lengths are proportional to the evolutionary distance Æ calculations of
genetic divergence between nodes.
Cladogram
Phylogram
Seq 9
Seq 9
Seq 1
Seq 1
Seq 7
Seq 7
Seq 10
Seq 10
S 3
Seq
Seq 3
Seq 8
Seq 8
Seq 2
Seq 4
Seq 2
Seq 4
Seq 5
Seq 5
Seq 6
Seq 6
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Trees
Rooted tree: a particular node, the “root”, represents the common ancestor of all the
remaining nodes Æ all the branches can be oriented as a function of time.
Unrooted tree: describes exclusively the evolutionary relationships between OTUs.
OTUs No
information on the evolutionary process as a function of time Æ it is not possible to identify
older/more recent nodes.
Rooted tree
Unrooted tree
outgroup
Seq 1
Seq 6
Seq 6
Seq 7
Seq 9
Seq 5
Seq 1
Seq 3
Seq 4
Seq 10
Seq 10
Seq 5
Seq 2
Seq 7
Seq 9
Seq 8
Seq 3
Seq 4
Seq 2
Seq 8
Rooted trees are usually built when the hypothesis of the “molecular clock” is assumed, i.e.
genetic divergence proportional to evolutionary time.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Trees
To root a tree, a particular OTU, called “outgroup”, is included into the dataset. The outgroup
is defined as “a OTU which started the process of divergence from its ancestor before all the remaining
OTUs started diverging from each other” (information derived from non-genetic evidence, e.g.
paleontology morphology etc
paleontology,
etc.).
)
Trees can also be represented in the Newick (computer readable) format with nested brackets:
((((Seq_9,(Seq_6,Seq_5)),Seq_10),((Seq_8,Seq_4),Seq_3)),(Seq_7,Seq_2),Seq_1);
Dedicated software read trees in Newick format (e.g. TreeView; Page, 1996).
Seq 1
Seq 7
Seq 2
Seq 3
Seq 8
Seq 4
Seq 10
Seq 9
Seq 6
Seq 5
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Trees
Aim of a phylogenetic analysis Æ determining the “topology” (structure) of the tree.
The number of possible trees grows exponentially with the umber of OTUs.
For n OTUs, the numbers of rooted (NR) and unrooted (NU) trees are given by
NR =
(2n-3)!
2n-2(n-2)!
NU =
(2n-5)!
2n-3(n-3)!
NU for n OTUs = NR for (n-1) OTUs.
E.g.
g if n=10 there are about 35· 106 p
possible trees, only
y one of which correctly
y
represents the evolutionary relationships between the OTUs!
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Tree-building methods can be classified according to
•the type of data (i.e. distance matrix vs. discrete characters);
•the reconstruction strategy (clustering algorithms vs. optimality criteria);
M
METHOD
D
DATA
distance
matrix
Clustering
algorithm
UPGMA, NJ
Optimality
criterion
ME, FM
discrete
characters
MP, ML, BA
UPGMA: unweighted pair-group method using arithmetic means; NJ: neighbor-joining; ME: minimum evolution; FM:
Fitch-Margoliash's least-squares method; MP: maximum parsimony; ML: maximum likelihood; BA: Bayesian inference.
All the aforementioned methods (excepted MP), require the selection of an explicit model
of sequences evolution (“substitution model”).
Substitution models describe in probabilistic terms the process by which a set of characters
(
(nucleotides)
l id ) changes
h
iinto another
h set off homologous
h
l
character
h
states over time.
i
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
P i i an Percentage
Pairwise
P
t
difference
diff
These are very
y rough
g estimates of evolutionary
y divergence
g
between sequences.
q
They are computed as the number/percentage of loci (nucleotides) for which two
sequences are different:
P = nd
P = nd/L
Where nd is the number of observed substitutions between two DNA sequences
and L is the number of loci.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
The number of observed differences usually underestimates the real amount of
evolutionary change (e.g. occurrence of “multiple hits”).
Substitution models incorporate
p
some “correction” p
parameters, their number varying
y g
according to the a priori assumptions accepted (number of fixed/variable parametrs).
A priori assumptions:
•Nucleotide sites evolve independently;
• All sites can mutate with equal probability;
• All types of substitutions are equally probable;
• Substitution rate is constant ;
• The
h base composition is at equilibrium (sequences h
have the
h same base composition ).
The higher the number of accepted assumptions, the simpler the model.
The lower the number of accepted assumptions, the higher the number of the parameters
that need to be estimated.
estimated
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
The most renowned and used nucleotide substitution models are those from the
General Time-Reversible (GTR) family (Lanave et al., 1984): 203 possible models
diff
differentiated
ti t d by
b the
th number
b and
d type
t
off fi
fixed/variable
d/ i bl parameters.
t
The nucleotide substitution models implemented in the most frequentl
frequently used
phylogenetics software packages (MEGA, PAUP*, PHYLIP, PHYML, MrBayes
ecc.) belong to the GTR family.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
Jukes and Cantor (JC69; 1969)
It is
i th
the simplest
i l t ((parameter
t poorest)
t) model,
d l which
hi h assumes th
that:
t
•Nucleotide frequencies are equal (i.e. πA= πT= πC= πG= 0.25);
•All possible substitutions take place at a single rate Æ only the parameter α needs
to be estimated (substitution rate).
A
A
-
C
α
G
α
T
α
C
α
-
α
α
G
T
α
α
α
α
α
α
-
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
Kimura 2-parameters (K80; 1980)
•Nucleotide frequencies are equal (i.e. πA= πC= πG= πT= 0.25);
•Different substitution rates between transitions (Ts) α and transversions (Tv) β.
The Ts/Tv ratio is estimated from the data.
A
C
G
T
A
β
α
β
C
β
β
α
G
α
β
β
T
β
α
β
-
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
Tamura (1992)
This model is an extension of K80 method, allowing for unequal nucleotide
frequencies.
frequencies
•Base composition is not equal (A + T ≠ G + C and G + C = θ);
•Different substitution rates between Ts (α) and Tv (β).
The Ts/Tv ratio, as well as nucleotide frequencies are computed from the data.
A
C
G
T
A
(1-θ)β
(1 θ)α
(1-θ)α
(1-θ)β
C
θβ
β
α
G
θα
β
β
T
(1-θ)β
(1
θ)β
(1-θ)α
(1 θ)β
(1-θ)β
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
Felsenstein (F81; 1981)
It is
i an extension
t i off JC69 method,
th d allowing
ll i for
f unequall nucleotide
l tid frequencies
f
i (i.e.
(i
πA≠ πC≠ πG ≠ πT).
The overall nucleotide frequencies are computed from the data.
A
A
-
C
πCα
G
πG α
T
πT α
C
πA α
-
πG α
πT α
G
T
πA α
πA α
πCα
πCα
-
πT α
-
πG α
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
H
Hasegawa-Kishino-Yano
Ki hi Y
(HKY;
(HKY 1985)
This model combines the assumptions
p
of K80 and F81:
•unequal nucleotide frequencies (i.e. πA≠ πC≠ πG ≠ πT).
•Different substitution rates between Ts (α) and Tv (β).
Overall nucleotide frequencies and the Ts/Tv ratio computed from the data.
A
A
-
C
πCβ
G
πG α
T
πT β
C
πA β
-
πG β
πT α
G
T
πA α
πA β
πCβ
πCα
-
πT β
-
πG β
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
General Time Reversible (GTR)
Lanave et al. (1984).
It is
i th
the mostt generall and
d parameter-rich
t
i h model.
d l
•Unequal nucleotide frequencies (i.e. πA≠ πC≠ πG ≠ πT).
•Different substitution rates between the two transitions and the four transversions
Ts:
A Æ G = α1 ; C Æ T = α2
Tv:
A Æ C = β1 ; A Æ T = β2 ; C Æ G = β3 ; G Æ T = β4 *.
• Unequal probability for each type of nucleotide substitution.
• Substitutions
S b tit ti
are reversible
ibl (AÆ G = G Æ A).
A)
A
A
-
C
πCβ1
G
πG α 1
T
πTβ2
C
πAβ1
-
πGβ3
πT α 2
G
T
πA α 1
πAβ2
πCβ3
πCα 2
πGβ4
πTβ4
-
* taking the rate of G <-> T = 1 and
making all other 5 possible substitution
rates relative to the G-T transversion
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
Γ (gamma)
(
) di
distribution
t ib ti and
d Invariant
I
i t sites
it
An additional parameter is considered when the substitution rates cannot be assumed as
uniform for all sites.
sites
Not all the nucleotide positions within a sequence, in fact, are subject to the same
evolutionary constraints (e.g. 1st
1st-2nd
2nd vs. 3rd codon position in protein
protein-coding
coding genes).
There are two strategies:
1) To analyze separately the sites subject to different evolutionary dynamics;
2) To adopt a model with additional parameters that account for the rate variation.
Γ distributions are used to model continuous variables that are always positive and have
skewed
k
d di
distributions.
t ib ti
The shape of the Γ distribution is determined by a single parameter α (“shape parameter”)
which specifies the range of rate variation among sites and is inversely proportional to the
level of heterogeneity among site rates.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
The lower the values of α, the larger the range of rate variation and the more uneven
the substitution rates.
Proportiion of sites f(r)
A α Æ ∞, all
As
ll sites
i have
h
the
h same substitution
b i i rate.
Substitution rate (r)
Also the fraction of Invariant sites (i.e. sites showing no variation within the
q
set)) can be estimated and taken into account when modeling
g the
sequences
evolutionary process.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
All models typically used to infer evolutionary relationships between DNA
sequences represent a special case of the GTR model.
Imposing constraints (i.e.
(i e a priori assumptions) on the parameters of the GTR
leads to a different model which can, therefore, be considered as a special case of
the GTR.
A model is said to be “nested” within a more complex one if the former can be
obtained by constraining the parameters of the latter.
E.g. JC69 is nested within K80, while F81 and K80 are not nested because fixing
parameter values of either one does not y
p
yield the other model.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
How do we select the best-fitting model?
To select
T
l the
h best-fitting
b
fi i substitution
b i i model
d l the
h Lik
Likelihood
lih d R
Ratio
i T
Test (LRT) is
i usually
ll
applied. In a maximum likelihood framework, it evaluates the statistical significance of the
increase in fit of alternative nested models to the data as their number and types of
parameters increases.
Δ = 2 (ln L1 - ln L0)
L1 = global ML estimate for the alternative hypothesis (more general, parameter richer
model)
L0 = global ML estimate for the null hypothesis (simpler model).
The probabilities are χ2 distributed with d.f.= difference in the number of free parameters between the two alternative
models.
The Akaike Information Criterion (AIC; Akaike, 1974; Posada & Buckley, 2004) and the
Bayesian Information Criterion (BIC; Schwarz, 1978) are methods alternative to the LRT;
they simultaneously evaluate the statistical significance of the relative fit of all competing
models be they nested or not.
models,
not
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
ModelTest
Posada & Crandall (1998).
(1998)
A very popular tool which automatically selects the best-fitting substitution model from
among 56 alternatives by performing LRT
LRT, AIC (software ver
ver. 3.06)
3 06) and BIC (software ver
ver.
3.7) calculations. It returns the name and the parameter values of the best-fitting model.
Original software version: both the ModelTest application and the software PAUP*
PAUP
(Swofford, 1998) are needed. Unfortunately, PAUP* software is not free.
Input file format = Nexus (same as PAUP
PAUP*)) + “ModelTest”
ModelTest block.
More information on how to run ModelTest can be found here:
http://darwin.uvigo.es/software/modeltest.html
htt //
http://www.rhizobia.co.nz/phylogenetics/modeltest.html
hi bi
/ h l
ti / d lt t ht l (Wi
(Windows)
d
)
http://www.genedrift.org/mtgui.php (Windows and Linux).
A web-based tool to run ModelTest can be found here:
http://darwin.uvigo.es/software/modeltest_server.html
(Posada, 2006).
A free web-based tool to choose among 28 nucleotide models with the AIC:
http://www.hiv.lanl.gov/content/sequence/findmodel/findmodel.html
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
Hierarchical structure of 56 models implemented in the ModelTest procedure (Posada
and Crandall,
Crandall 1998)
1998). It does not include all of the possible models in the GTR family.
family
GLOBALDIV SUMMER SCHOOL 2008 - PIACENZA, ITALY - 10 September 2008
Models of DNA sequences evolution
** Hierarchical Likelihood Ratio Tests (hLRTs) **
ModelTest ver. 3.06 output
Testing models of evolution - Modeltest
Equal base frequencies
Version 3.06
Null model = JC
-lnL0 = 2562.9832
Alternative model = F81
-lnL1 = 2543.3635
2(lnL1-lnL0) = 39.2393
df = 3
P value = <0
P-value
<0.000001
000001
(c) Copyright, 1998-2000 David Posada
([email protected])
Ti=Tv
Department of Zoology, Brigham Young University
Null model = F81
-lnL0 = 2543.3635
WIDB 574, Provo, UT 84602, USA
Alternative model = HKY
-lnL1 = 2482.0591
_______________________________________________________________
2(lnL1-lnL0) = 122.6089
df = 1
Wed May
y 23 16:49:15 2007
P-value = <0.000001
Equal Ti rates
Input format: Paup matrix file
** Log Likelihood scores **
JC
=
2458.8540
Null model = HKY
-lnL0 = 2482.0591
Alternative model = TrN
-lnL1 = 2482.0227
2(lnL1-lnL0) =
df = 1
P-value =
0.0728
0.787369
+I
+G
+I+G
2458.8540
2454.2852
2443.3606
Null model = HKY
-lnL0 = 2482.0591
Alternative model = K81uf
-lnL1 = 2480.2668
2(lnL1-lnL0) =
df = 1
F81
=
2440.0264
2440.0264
2434.6941
2424.4517
K80
=
2400.7991
2400.7991
2396.5891
2385.6047
Equal Tv rates
P-value =
3.5845
0.058322
HKY
=
2379.0457
2379.0457
2374.1394
2362.9192
TrNef
=
2400.6252
2400.6252
2396.1169
2385.5432
Null model = HKY
-lnL0 = 2482.0591
TrN
=
2379.0442
2379.0442
2374.0200
2363.2795
Alternative model = HKY+G
-lnL1 = 2374.1394
K81
=
2398.1973
2398.1973
2393.5496
2382.9202
2(lnL1-lnL0) = 215.8394
df = 1
Equal rates among sites
K81uf
=
2377.6162
2377.6162
2372.7297
2362.2349
Using mixed chi-square distribution
TIMef
=
2398.0212
2398.0212
2393.4592
2382.8601
P-value = <0.000001
TIM
=
2376.7197
2376.7197
2372.6138
2362.1086
TVMef
=
2395.8040
2395.8040
2391.3481
2380.6255
TVM
=
2375.0203
2375.0203
2369.3423
SYM
=
2395.6624
2395.6624
GTR
=
2374 8865
2374.8865
2374 8865
2374.8865
No Invariable sites
Null model = HKY+G
-lnL0 = 2374.1394
Alternative model = HKY+I+G
-lnL1 = 2362.9192
2358.6572
2(lnL1-lnL0) = 22.4404
df = 1
2391.2957
2380.6013
Using mixed chi-square distribution
2369 2437
2369.2437
2358 5361
2358.5361
P value =
P-value
0
0.000001
000001
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Models of DNA sequences evolution
ModelTest output
** Akaike Information Criterion (AIC) **
Model selected: TVM+I+G
Model selected: HKY+I+G
-lnL =
-lnL =
2358.6572
AIC =
4735.3145
2362.9192
Base frequencies:
freqA =
0.3116
freqC =
0.2168
freqG =
0.1607
freqT =
0.3109
S b i
Substitution
i
model:
d l
Base frequencies:
freqA =
0.3123
freqC =
0.2263
freqG
q =
0.1618
freqT =
0.2996
Rate matrix
R(a) [A-C] =
R(b) [A-G] =
R(c) [A-T] =
R(d) [C-G] =
R(e) [C-T] =
R(f) [G-T] =
Among-site rate variation
Substitution model:
Ti/tv ratio =
2.0963
Among-site rate variation
P
Proportion
ti
of
f i
invariable
i bl sites
it
(I) =
0 6051
0.6051
Variable sites (G)
Gamma distribution shape parameter =
0.9352
1.8176
8.0533
1.6254
3.5609
8 0533
8.0533
1.0000
Proportion of invariable sites (I) =
--
0.6002
Variable sites (G)
Gamma distribution shape parameter =
0.9020
PAUP* Commands Block: If you want to implement the previous
estimates
ti t
as lik
likelihod
lih d settings
tti
i
in PAUP*
PAUP*,
--
attach the next block of commands after the data in your PAUP
file:
PAUP* Commands Block: If you want to implement the previous estimates
as likelihod settings in PAUP*, attach the next block of commands after
the data in your PAUP file:
[!
[!
Likelihood settings from best-fit model (HKY+I+G) selected by hLRT
in Modeltest Version 3.06
Likelihood settings from best-fit model (TVM+I+G) selected by AIC in
Modeltest Version 3.06
]
BEGIN PAUP;
Lset Base=(0.3123 0.2263 0.1618) Nst=2 TRatio=2.0963
Rates=gamma Shape=0.9352 Pinvar=0.6051;
END;
--
]
BEGIN PAUP;
Lset Base=(0.3116 0.2168 0.1607) Nst=6 Rmat=(1.8176 8.0533 1.6254
3.5609 8.0533) Rates=gamma Shape=0.9020 Pinvar=0.6002;
END;
-_________________________________________________________________
Time processing: 0.001 seconds
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
METHOD
D
DATA
distance
matrix
Clustering
algorithm
UPGMA, NJ
Optimality
criterion
ME FM
ME,
discrete
characters
MP ML,
MP,
ML BA
Distance methods - aligned sequences converted into a pair-wise distance matrix Æ loss of
information about single sites contributions and no inference on the ancestral character
states.
Discrete methods - each nucleotide site is considered directly
y Æ allow to draw inference
on the ancestral character states.
Clustering methods follow an algorithm (set of steps) to produce a tree (usually a single
one) Æ short computational times
times, but the results often depend on the order of sequences
addition to the growing tree. Competing hypotheses cannot be tested.
Optimality
p
y methods use a specific
p
criterion to assign
g a score to each p
possible tree. The
ranking is a function of the relationship between tree and data.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Neighbor-Joining (NJ)
Saitou & Nei (1987).
The clustering algorithm starts from a star topology (completely unresolved tree) and
determines the branches between the nearest pair of OTUs (neighbors) and the remaining
OTUs through an iterative process.
Each step is taken according to the choice that minimizes the sum of the lengths of all the
branches of the tree.
tree
The pair of OTUs chosen at each step will form a “composite OTU” treated as a single
entity afterwards.
Advantages:
-Very fast computations;
-Allows for different evolutionary rates along the branches;
-Usually
Usually returns reliable results.
results
Disadvantages:
-the calculation of a distance matrix causes a loss of information.
Software: CLUSTALX, PHYLIP, PAUP*, MEGA and others.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Neighbor-Joining (NJ)
Seq 2
Seq_2
Seq 2
Seq_2
Seq_3
Seq_1
Seq_3
Seq_1
Seq_1
Seq_2
Seq_3
Seq_6
Seq_4
Seq_5
Seq_4
Seq_6
Seq_5
Seq_6
Seq_5
Seq_4
“Star decomposition” tree search.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Maximum Parsimony
Swofford & Berlocher (1987).
This method identifies the tree which needs the smallest number of substitutions
(evolutionary changes) to explain the differences between the considered sequences.
The branch length is proportional to the number of substitutions between the nodes
connected by the branch itself.
“Parsimony informative sites” show at least two different character states occurring at least
two times each.
Then the minimum number of substitutions is calculated for each possible unrooted tree.
The MP tree is the one requiring the smallest number of changes.
Advantages:
-No
No loss of information;
Disadvantages:
- no explicit evolutionary model (all substitutions equally probable, equal base frequencies,
no correction for
f multiple
l l h
hits);
)
-often it returns a set of equally parsimonious trees.
Software: PHYLIP, PAUP*,
PAUP , MEGA and others.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Maximum Parsimony (MP)
Seq_1
S
Seq_2
2
Seq_3
Seq_4
GTACG
GTCGG
ACAGG
ACCGG
Tree
Seq_2
G
A
G
Seq_4
Site 1 – 5 changes
Site 1 – 1 change
G
Seq_3
Seq_1
G
A
A
A
A
G
G
A
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Maximum Parsimony (MP)
Seq_1
S
Seq_2
2
Seq_3
Seq_4
GTACG
GTCGG
ACAGG
ACCGG
Tree
Seq_2
Site 2 – 1 change
T
C
A
A
A
C
or
A
C
A
C
C
C
C
C
Site 5 – no changes
Site 4 – 1 change
G
G
G
A
C
T
Seq_4
Site 3 – 2 changes
C
T
Seq_3
Seq_1
G
G
G
G
G
G
G
G
Tree
Sites
1
2
3
4
5
total
((1,2),(3,4))
((1,3),(2,4))
((1,4),(2,3))
1
2
2
2
1
2
1
1
1
0
0
0
5
6
7
1
2
2
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Maximum Likelihood (ML)
Felsenstein (1981).
(1981)
Often considered as the best approach to determine the most consistent tree topology.
Formally, given a data set D (alignment) and the hypothesis H (tree), the probability of
observing the data is given by
LD= Pr(D|H)
Which is equal to the conditional probability of D given H.
The tree which scores the highest value of L represents the ML estimate of the evolutionary
relationships
l ti
hi between
b t
the
th considered
id d OTUs.
OTU In
I other
th words,
d the
th ML ttree iis th
the one which
hi h
best explains the examined dataset.
Advantages:
-It usually returns consistent results;
-It permits the statistical testing of evolutionary hypotheses (Likelihood Ratio Test).
Disadvantages:
-very long computational times (often 100 bootstrap replicates are used instead of 1000).
Software: PHYLIP, PHYML, PAUP* and others.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Bayesian approach (BA)
A recent variant of ML. While ML seeks the tree that maximizes the probability of observing
the data given the tree and the model, BA searches the set of trees that have the maximum
probability of being observed given the data and the model.
BA produces a set of trees with approximately equal likelihoods.
Advantages:
-Results are easy to interpret: the frequency of a given clade within the set of trees is taken as
the probability of that clade – no need for bootstrapping.
bootstrapping
Disadvantages:
-Depending
p
g on the settings,
g , it may
y require
q
long
g computational
p
times ((not as long
g as for ML).
)
Software: Mr Bayes and others.
N Of sequences
N.
Neighbor--joining
Neighbor
Maximum Parsimony
Maximum Likelihood
Bayesian
54
0.20 sec
0.72 sec.
7.06 hr
3.8 hr
40
0.18sec.
0.32 sec.
1.1 hr
2.4 hr
30
0.22 sec.
0.18 sec.
17.3 min
1.7 hr
20
0.22 sec.
0.10 sec.
1.8 min
1.05 hr
10
0.20 sec.
0.05 sec.
4.1 sec
25 min
Computational times required for
analysis by the four different methods.
Source: Hall (2001). Thanks to faster
present day processors the times have
proportionally shortened.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Calculation of divergence times
If the assumption of a molecular clock – genetic divergence proportional to
evolutionary time - is correct, the reconstruction of the tree topology allows to
estimate the divergence
g
times between all the OTUs.
The divergence time between at least two OTUs must be known from non genetic
evidence (e.g. paleontology). This time value is then used to calibrate the molecular
clock for that given tree.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Calculation of divergence times
A Likelihood Ratio Test can be performed on the ML values calculated with (Lclock)
and without (Lnoclock) the assumption of the validity of the molecular clock.
Δ = 2 (lnLnoclock – lnLclock)
The probabilities are χ2 distributed with d.f.= n-2, being n the number of sequences.
Software: PAUP* and others.
The calculation of the Time to the Most Recent Common Ancestor ((TMRCA)) for a
set of sequences can also be performed with a Bayesian approach.
Software: BEAST,
BEAST BEAUTI and TRACER.
TRACER
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Bootstrap
ootst ap
Felsenstein
e se ste (1985).
( 985).
Non-parametric bootstrap is used to infer the robustness of tree reconstructions.
It estimates sampling error by resampling from the dataset instead of resampling from the
population.
This approach can be applied to all the phylogenetic methods, with the exception of BA.
How does it work?
Data: n aligned sequences of length N (n x N matrix).
Obj i
Objective:
estimate
i
confidence
fid
iin particular
i l ffeatures off the
h
obtained tree (robustness of nodes).
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Bootstrap
Felsenstein (1985).
Method:
Step 1 - create a large number of pseudo-datasets (100 or 1000) by re-sampling with replacement
the columns of the original data matrix. In each of the bootstrapped replicates, some sites may
occur more than once,
once while others are never sampled
sampled.
Original dataset
Bootstrap pseudoreplicate
10 60
10
60
cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat
Seq_1
CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA
Seq_2
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_2
CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA
Seq_3
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_3
CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA
Seq_4
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_4
CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA
Seq_5
CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_5
CTTGGTTAAAAATACCTAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA
Seq 6
Seq_6
CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq 6
Seq_6
CTTGGTTAAAAATATTTAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA
Seq_1
Seq_7
CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT
Seq_7
CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA
Seq_8
CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_8
CTTTGTTCCAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA
Seq_9
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_9
CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA
Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_10 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA
Step 2 - build a tree by applying the method of choice to each pseudo-dataset Æ disadvantage: it
drastically increases the time required for computations.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Bootstrap
Felsenstein (1985).
Step 3 –evaluate the bootstrap support of the nodes by calculating the proportion of replicates
where the feature is present Æ “consensus tree”.
Seq 9
Seq 1
17
Seq 7
Seq 10
100
7
Seq 3
14
8
Seq 8
Seq 2
10
Seq 4
19
Seq 5
91
Seq 6
The results are % values that are usually interpreted following a “rule of thumb”:
- value<50% - weakly supported nodes, unlikely to be correct
- 50%<value<70% - nodes to be interpreted with caution
- 70%<value –strongly supported nodes, likely to be correct.
Simulations have shown that bootstrap values greater than 70% correspond to a probability
greater than 95%. In BA trees only the nodes with 95% PP values are considered as strongly
supported,
pp
instead.
The vast majority of the software packages include the bootstrap option.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Tree reconstruction strategies
Jacknife
In the jacknife procedure the resampling occurs without replacement.
This is usually done by deleting randomly half of the characters in each replicate Æ
subreplicates are smaller than the original dataset Æ the statistical properties of the samples
may change.
Original dataset
Jacknife subreplicate
10 60
10
30
Seq_1
cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat
Seq_1
CCTATAGCATTAATTAATTGTTTACATTAA
Seq_2
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_2
CCTATAGCATTAATTAATTGTTTACATTAA
Seq_3
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_3
CCTATAGCATTAATTAATTGTTTACATTAA
S
Seq_4
4
CCCC
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
G C
G
G G
G G
C C C
G
S
Seq_4
4
CC
CCTATAGCATTAATTAATTGTTTACATTAA
GC
G
C
Seq_5
CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_5
CCTATAGCATCAATTAATTGTTTACATTAA
Seq_6
CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_6
CCTATAGCATTAATTAATTGTTTACATTAA
Seq_7
CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT
Seq_7
CCTATTGCATTAATTAATTATTTACATTAA
Seq_8
CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_8
CCTATAGCATTAATTAATTGTTTACATTAA
CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq
q_9
CCTATAGCATTAATTAATTGTTTACATTAA
Seq
q_9
Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
Seq_10 CCTATAGCATTAATTAATTGTTTACATTAA
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
Software
• ADMIX ver. 1.0
Dupanloup
p
p & Bertorelle ((2001).
)
http://web.unife.it/progetti/genetica/Giorgio/giorgio_soft.html FREE!!
• ADMIX ver. 2.0
http://cmpg.unibe.ch/software/admix/ FREE!!
• ARLEQUIN ver. 3.1
Excoffier et al. (2005).
http://cmpg unibe ch/software/arlequin3/ FREE!!
http://cmpg.unibe.ch/software/arlequin3/
• MEGA – Molecular Evolutionary Genetics Analysis ver. 4
Tamura et al. (2007).
http://www.megasoftware.net/ FREE!!
• PAUP* - Phylogenetic Analysis Using Parsimony* ver. 4.0β
Swofford (1998).
http://paup.csit.fsu.edu/
• PHYLIP ver. 3.68
3 68
F l
Felsenstein
t i (2002)
(2002).
http://evolution.genetics.washington.edu/phylip.html FREE!!
• PHYML ver. 3.0
Guindon & Gascuel (2003).
http://atgc.lirmm.fr/phyml/ FREE!!
(
)
• BEAST ver. 1.4.8…
Drummond & Rambaut (2007).
http://beast.bio.ed.ac.uk/ FREE!!
• …and BEAUTI ver 1.4
Drummond & Rambaut (2007).
http://beast.bio.ed.ac.uk/BEAUti
FREE!!
• Mr BAYES ver. 3.1
Hulsenbeck & Ronquist (2001).
http://mrbayes csit fsu edu/ FREE!!
http://mrbayes.csit.fsu.edu/
• TRACER ver. 1.4
Rambaut & Drummond (2007).
http://tree.bio.ed.ac.uk/software/tracer/ FREE!!
• TREEVIEW ver. 1.6.6
Page (1996).
http://taxonomy.zoology.gla.ac.uk/rod/treeview.html FREE!!
A miscellany of phylogeny programs and tools is available here
http://evolution.genetics.washington.edu/phylip/software.html
The BioPortal of the University of Oslo allows to run several applications through a web server
http://www.bioportal.uio.no//
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
THANK YOU!!
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
References:
• Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19: 716-723.
• Drummond,, A.J.
J and Rambaut,, A. 2007. "BEAST: Bayesian
y
evolutionary
y analysis
y by
y sampling
p g trees“. BMC Evolutionary
y Biology
gy 7: 214.
• Dupanloup, I. and Bertorelle, G. 2001. Inferring admixture proportions from molecular data: extension to any number of parental populations. Mol.
Biol. Evol. 18: 672–675.
• Excoffier, L., Laval, G., and Schneider, S. 2005. Arlequin ver. 3.0: An integrated software package for population genetics data analysis. Evolutionary
Bioinformatics Online 1: 47-50.
• Excoffier, L., Smouse, P., and Quattro, J. 1992. Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to
human mitochondrial DNA restriction data.
data Genetics 131:479-491.
131:479 491
• Felsemstein, J. 1981. Evolutionary Trees from DNA Sequences: a Maximum Likelihood Approach. J. Mol. Evol. 17: 368−376.
• Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783–791.
• Finlay, E.K., Gaillard, C., Vahidi, S.M.F., Mirhoseini, S.Z., Jianlin, H., Qi, X.B., El-Barody, M.A.A., Baird, J.F., Healy, B.C. and Bradley, D.G. 2007.
Bayesian inference of population expansions in domestic bovines. Biology Letters 3: 449-452.
• Galtier, N., Gouy, M. and Gautier, C. 1996. SeaView and Phylo_win, two graphic tools for sequence alignment and molecular phylogeny. Comput.
Applic. Biosci. 12: 543-548.
• Guindon, S., and Gascuel, O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by Maximum Likelihood. Syst Biol 52(5): 696704.
• Hall, B.G. 2001. Phylogenetic trees made easy. A how-to manual for molecular biologists. Sinauer Associates Inc., Publishers, Sunderland,
Massachussetts, USA.
• Hasegawa, M., Kishino, H. and Yano, T. 1985. Dating of the human
human-ape
ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160
160-174.
174.
• Higgins, D.G., Bleasby, A.J. and Fuchs, R. 1992. CLUSTAL V: improved software for multiple sequence alignment. CABIOS 8: 189-191.
• Higgins, D.G. and Sharp, P.M. 1989. Fast and sensitive multiple sequence alignments on a microcomputer. CABIOS 5: 151-153.
• Higgins, D.G. and Sharp, P.M. 1988. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73: 237-244.
• Hudson, R. R. 1990. Gene genealogies and the coalescent process, pp. 1-44 in Oxford Surveys in Evolutionary Biology, edited by Futuyama, and J. D.
Antonovics. Oxford University Press, New York.
• Jukes,
J k T
T. and
d Cantor,
C t C.
C 1969.
1969 E
Evolution
l ti off protein
t i molecules.
l
l IIn: M
Mammalian
li P
Protein
t i M
Metabolism,
t b li
edited
dit d b
by M
Munro HN
HN, N
New Y
York:
k Academic
A d i press, p.
21-132.
• Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol.
Evol. 16:111-120.
• Lanave, C., Preparata, G., Saccone, C. and Serio, G. 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20: 86-93.
• Nei, M., 1987. Molecular Evolutionary
y Genetics. Columbia University
y Press, New York, NY, USA.
• Page, R.D.M. 1996. TREEVIEW: An application to display phylogenetic trees on personal computers. Computer Applications in the Biosciences 12: 357358.
• Pellecchia, M., Negrini, R., Colli, L., Patrini, M., Milanesi, E., Achilli, A., Bertorelle, G., Cavalli-Sforza, L.L., Piazza, A., Torroni, A. and Ajmone-Marsan,
P. 2007. The mystery of Etruscan origins: novel clues from Bos taurus mitochondrial DNA. Proc. R. Soc. B . 274: 1175–1179.
• Posada, D. 2006. ModelTest Server: a web-based tool for the statistical selection of models of nucleotide substitution online. Nucleic Acids Research 34:
W700-W703
W700-W703.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008
References:
• Posada, D. and Buckley, T.R. 2004. Model selection and model averaging in phylogenetics: advantages of the AIC and Bayesian approaches over
likelihood ratio tests. Systematic
y
Biology
gy 53: 793-808.
• Posada, D. and Crandall, K.A. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics 14(9): 817-818. Rambaut, A. and Drummond, A.J.
2007. Tracer v1.4.. http://tree.bio.ed.ac.uk/software/tracer/
• Rogers, A.R. 2004. Lecture Notes on Gene Genealogies. www.anthro.utah.edu/~rogers/bio5410/Lectures/a_alu.pdf
• Rogers, A. R. and Harpending, H. 1992. Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9: 552-569.
• Rousset, F., 2000. Inferences from spatial population genetics, in Handbook of Statistical Genetics, D. Balding, M. Bishop and C. Cannings. (eds.) Wiley
& Sons
Sons, Ltd
Ltd.
• Saitou, N. and Nei, M. 1987. The neighbor–joining method: a new method for reconstructing the phylogenetic tree. Mol. Biol. Evol. 4: 406−425.
• Schwarz, G. 1978. Estimating the dimension of a model. The Annals of Statistics 6: 461-464.
• Slatkin, M., 1991 Inbreeding coefficients and coalescence times. Genet. Res. Camb. 58: 167-175.
• Swofford, D.L., 1998. PAUP*. Phylogenetic Analysis Using Parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland,
Massachussetts.
• Swofford, D.L. and Berlocher, S.H. 1987. Inferring evolutionary trees from gene frequency data under the principle of maximum parsimony. Systematic
Zoology 36: 293−325.
• Tajima, F. 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437-460.
• Tajima, F. 1993. Measurement of DNA polymorphism. In: Mechanisms of Molecular Evolution. Introduction to Molecular Paleopopulation Biology,
edited by Takahata, N. and Clark, A.G., Tokyo, Sunderland, MA:Japan Scientific Societies Press, Sinauer Associates, Inc., p. 37-59.
• Tajima, F. and Nei, M. 1984. Estimation of evolutionary distance between nucleotide sequences. Mol. Biol. Evol. 1:269
1:269-285.
285.
• Tamura, K., 1992 Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. Mol. Biol.
Evol. 9: 678-687.
• Tamura, K., Dudley, J., Nei, M., and Kumar, S. 2007. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular
Biology and Evolution 24: 1596-1599.
• Tamura, K., and M. Nei, 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and
chimpanzees.
hi
Mol.
M l Biol.
Bi l Evol.
E l 10:
10 512-526.
512 526
• Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. 1997. The ClustalX windows interface: flexible strategies for multiple
sequence alignment aided by quality analysis tools. Nucleic Acids Research 24: 4876-4882.
• Thompson, J.D., Higgins, D.G. and Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through
sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research 22: 4673-4680.
• Troy,
y C.S., MacHugh,
g D.E., Bailey,
y J.F., Magee,
g D.A., Loftus, R.T., Cunningham,
g
P., Chamberlain, A.T., Sykesk,
y
B.C. and Bradley,
y D.G. 2001. Genetic
evidence for Near-Eastern origins of European cattle. Nature 410: 1088-1091.
• Weir, B.S. and Cockerham, C.C. 1984 Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370.
• Wright, S., 1951 The genetical structure of populations. Ann.Eugen. 15: 323-354.
• Wright, S., 1965 The interpretation of population structure by F-statistics with special regard to systems of mating. Evol 19: 395-420.
SUMMER SCHOOL 2008 - PIACENZA, ITALY
10 September 2008