SUMMER SCHOOL 2008 PIACENZA, ITALY - 10 September 2008 Methods for the analysis of mitochondrial DNA data – part 1 Licia Colli, U.C.S.C. di Piacenza licia colli@unicatt it [email protected] •The mitochondrial genome •Sequence format and alignment •Input file formats most frequently used in mtDNA analyses •Molecular diversity indices •Analysis of Molecular VAriance •Mismatch Mismatch distribution and estimates of population expansion •Admixture analysis •Trees: -generalities; generalities; -models of DNA sequence evolution and choice of the best-fitting model -Tree reconstruction strategies -Distance-based methods (NJ) ( J) -Character-based methods (MP, ML, Bayesian) -Molecular clock and calculations of divergence times -Bootstrap p and Jacknife •Software list •Rereferences SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 The mitochondrial genome (mtDNA) • Its length varies among species (15-17kb) •multiple copies in each cell (mammalian egg cell contains about 100.000 copies) • lack of recombination • HAPLOID - maternally inherited; • high mutation rate •13 protein coding genes, 2 rRNA sequences (12s and 16s), 16s) 22 tRNA sequences and 1 non coding region ((control region g or displacement p loop). p) • the mitochondrial genetic code differs slightly from the nuclear code: nuclear TGA Æ stop codon ATA Æ Ile (I) AGA Æ Arg g ((R)) AGG Æ Arg (R) mitochondrial TGA Æ Trp (W) ATA Æ Met (M) AGA Æ stop p codon AGG Æ stop codon SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 The mitochondrial genome (mtDNA) A useful molecule, indeed… • genealogy • phylogeny (cytochrome b, b 12s, 12s 16s 16s, control region region, whole mtDNA) • phylogeography (cytb, control region, whole mtDNA) • species identification (cytb, control region) • population l ti studies t di ( + other th markers) k ) • detection of “cryptic species” and “barcoding” projects (COXI) • studies on the domestication process • studies on male fertility/infertility • studies on ancient DNA (aDNA)… …and many other applications. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Sequence format and alignment EditPlus: d l a text editor useful to handle sequences and prepare input files. y downloadable 30-days y evaluation version: Freely http://www.editplus.com/download.html FASTA (fil (filename.txt) ) ClustalX >Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat >Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT q_4 >Seq CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CLUSTAL X (1.83) multiple sequence alignment >Seq_5 (filename.aln) CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_6 Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq 7 >Seq_7 Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_8 Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT >Seq_9 Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_10 Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT ******** ** ****** ************** ************************ SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Input file formats Phylip (filename.txt; filename.phy) MEGA (filename.meg) #Mega 10 60 Seq_1 Seq_2 S Seq_3 3 Seq_4 Seq_5 Seq_6 Seq_7 Seq_8 Seq_9 Seq_10 title: title_of_your_project cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT or otherwise 10 60 Seq_1 Seq_2 Seq_3 Seq_4 Seq_5 Seq 6 Seq_6 Seq_7 Seq_8 Seq_9 Seq_10 cccctaatatgtacaataatgaatgttgta CCCCTAATATGTACAATAATGAATGTTGTA CCCCTAATATGTACAATAATGAATGTTGTA CCCCTAATATGTACAATAATGAATGTTGTA CCCCTAATAGGTACAATAACTAATGTTGTA CCCCTAATAGGTACAATAATTAATGTTGTA CCCCTAATTTGTACAATAATGAATGTTGTA CCCCTAATATGTCCAATAATGAATGTTGTA CCCCTAATATGTACAATAATGAATGTTGTA CCCCTAATATGTACAATAATGAATGTTGTA aattagtgttataacacatctatgtataat AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAATGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT #Seq_1 #Seq 2 #Seq_2 #Seq_3 #Seq_4 #Seq_5 #Seq_6 #Seq_7 #Seq q_8 #Seq_9 #Seq_10 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT or otherwise #Mega title: title_of_your_project #Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat #Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq 6 #Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT #Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_9 _ CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 NEXUS Input file formats (filename.nex) Arlequin q (filename.arp) #NEXUS BEGIN TAXA; DIMENSIONS NTAX=10; TAXLABELS Seq_1 Seq_2 Seq_3 Seq_4 Seq_5 Seq_6 Seq_7 Seq 8 Seq_8 Seq_9 Seq_10; END; BEGIN CHARACTERS; DIMENSIONS NCHAR=60; FORMAT DATATYPE=DNA MISSING=? GAP=- MATCHCHAR=.; MATRIX Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq 3 Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT; END; [Profile] Title="An example of DNA sequence data" NbSamples=3 GenotypicData=0 DataType=DNA yp LocusSeparator=NONE [Data] [[Samples]] SampleName="Population 1" SampleSize=3 SampleData= { 1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_1 Seq_2 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT } SampleName="Population 2" SampleSize=3 SampleData= { SampleData Seq_4 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 1 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 1 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT } SampleName="Population 3" SampleSize=4 SampleData= { Seq_7 1 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 1 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_9 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_10 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT } [[St [[Structure]] t ]] StructureName="A group of 3 populations analyzed for DNA" NbGroups=1 Group= { "Population 1" "Population 2" "Population p 3" } SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Sequence alignment Software of the Clustal family: • ClustalW online l versions download • ClustalX download http://www.ch.embnet.org/software/ClustalW.html h // h b / f /Cl lW h l http://www.ebi.ac.uk/Tools/clustalw2/index.html http://www.clustal.org/download/ http://www.clustal.org/download/current/ Higgins & Sharp (1988; 1989); Higgins et al. (1992); Thompson et al. (1994; 1997). SeaView is a sequence alignment editor which is able to read and write various alignment li t fformats t (NEXUS (NEXUS, CLUSTAL, CLUSTAL FASTA, FASTA PHYLIP…). PHYLIP ) Free download from this website: http://pbil.univ-lyon1.fr/software/seaview.html p //p y / / Galtier et al. (1996). SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Molecular diversity indices Haplotype p yp diversity y ((H)) It is defined as the probability that two randomly chosen haplotypes are different in the sample. Haplotype (gene) diversity is estimated as: where n is the number of gene copies in the sample, k is the number of haplotypes, and pi is the sample frequency of the i-th haplotype. Nei (1987). Source: Arlequin ver. 3.1 user manual. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Molecular diversity indices Mean number of pairwise differences (π) Mean number of differences between all pairs of haplotypes in the sample. It can be estimated as where dij is an estimate of the number of mutations having occurred since the divergence of haplotypes i and j, k is the number of haplotypes, pi is the frequency of haplotype i, pj is the frequency of haplotype j, and n is the sample size. Tajima (1983); (1993). Source: Arlequin ver. 3.1 user manual. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Molecular diversity indices Nucleotide diversity (πn) It is computed as the probability that two randomly chosen homologous nucleotide sites are different. It is equivalent to the haplotype diversity at the nucleotide level. where dij is an estimate of the number of mutations having occurred since the divergence of haplotypes i and j, k is the number of haplotypes, pi is the frequency of haplotype i, pj is the frequency of haplotype j,j n is the sample size and L is the number of loci. loci Tajima (1983); Nei (1987). Source: Arlequin ver. 3.1 user manual. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Molecular diversity indices «Genetic loci from a centre of origin are expected to retain more ancestral variation and show hi h haplotypic higher h l t i and d nucleotide l tid di diversity, it with ith lineage li pruning i th through h successive i colonization events leading to a reduction in derived populations.». y et al. ((2001). ) Troy 383 B. taurus mtDNA sequences (240 bp of the HVRI region ): M Mean pairwise i i differences diff (±s.d.) ( d) Middle East 3.79 ± 2.03 Anatolia 3.49 ± 1.81 Mainland Europe 1.92 ± 1.10 Britain 2.68 ± 1.45 Northern Europe 1.47 ± 0.91 Africa 2.09 ± 1.18 SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Analysis of MOlecular VAriance - AMOVA The Analysis of MOlecular Variance (AMOVA, Excoffier et al. 1992) is based on analyses of variance of gene frequencies, frequencies taking into account the number of mutations between molecular haplotypes. User-defined User defined groups of populations Æ particular genetic structure to test. A hierarchical analysis of variance partitions the total variance into covariance p ((Rousset,, 2000). ) components The total molecular variance (σ2) is the sum of the components due to: • σa2 = differences among the populations; • σb2 = differences among haplotypes in different populations within a group; • σc2 = differences among haplotypes within a population. Source: Arlequin ver. 3.1 user manual. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Analysis of MOlecular VAriance - AMOVA Simple hierarchical genetic structure ee.g. g haploid individuals in populations Æ the algorithm leads to a fixation index FST (Weir & Cockerham, 1984) which can be expressed in terms of g coefficients as inbreeding Slatkin (1991). (1991) where f0 is the probability of identity by descent of two different genes drawn from the same population, p p f1 is the p probability y of identity y by y descent of two g genes drawn from two different populations. Source: Arlequin ver. 3.1 user manual. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Mismatch Distribution It is the distribution of the observed number of differences between p pairs of haplotypes. p yp This distribution is usually multimodal in samples drawn from populations at demographic equilibrium, as it reflects the highly stochastic shape of gene trees… Solid line in the pairwise differences plot = theoretical values referring to a model of neutral evolution in a population of constant size (Rogers, 2004). SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Mismatch Distribution …but it is usually y unimodal in p populations p having gp passed through g a recent demographic g p expansion. Rogers & Harpending, (1992); Hudson & Slatkin, (1991). Simulations of populations that underwent a sudden 100 100-fold fold growth at 7 units of mutational time before present (Rogers, 2004). Solid line in the pairwise differences plot = theoretical values referring to the expectations under the model of population history used for these simulations. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Mismatch Distribution and estimates of population expansion In case of a sudden population growth (mismatch distribution = smooth unimodal wave), the time of the expansion τ0 and the size of the pre-expansion population θ1 can be estimated as follows where π is the mean pairwise difference per sequence within the sample, m is the mean of pairwise differences, and v is the variance. Source: Roger (2004) and Arlequin ver. 3.1 user manual. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Estimates of population expansion – an alternative approach Analysis of Bayesian skyline plots: an approach alternative to mismatch distribution analysis. Past changes in population size can be inferred from present-day genetic diversity without prior assumptions about population history. Mitochondrial d-loop sequence data (also aDNA). F Four d domestic ti species: i -Yak (Bos grunniens) n=71 - Water buffalo (Bubalus bubalis) n=110 - Mithan (Bos frontalis) n=24 n=84 84 - Cattle (Bos taurus) n One closely related wild species: - African buffalo (Syncerus caffer) n=195 Uniform mutation rate: 32%Myr-1 Domestic species - sudden expansion during the last 104 years ~ time since domestication. Af i African buffalo b ff l - gradual d l population l ti expansion i ffollowed ll db by a sharp h d decline li ((consisten i t with ith documented epidemics and habitat loss since the XIXth century). S ft Software: BEAST BEAST, BEAUTI and d TRACER. TRACER Source: Finlay et al. (2007). SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Admixture analysis This analysis evaluates the relative contributions of any number of parental populations to a derived, hybrid population. It compares the composition of different gene pools rather than making inference about the admixture event itself (mY estimator; Dupanloup & Bertorelle, 2001). Software: ADMIX ver. 1.0 Features: - works with sequences sequences, RFLPs RFLPs, microsatellites - needs 2 input files: DATA file (filename.dat) MATRIX file (filename.mtx) The DATA file should contain for each locus se sample sizes of the admixed and of the parental populations and the number of copies observed for each haplotype (allele) in each population. DATA file example: LocusX nAD, nP1, nP2 cnH1(AD), cnH1(P1), cnH1(P2) cnH2(AD), cnH2(P1), cnH2(P2) cnH3(AD), cnH3(P1), cnH3(P2) AD=admixed pop; P1=parental pop. 1; P2= parental pop. 2 nAD= sample size of pop. AD; etc. H1, H2, H3= haplotypes cnH1(AD)= count number for haplotype 1 in AD pop.; etc. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Admixture analysis MATRIX file example: nX LocusX nH 0 1 0 3 2 0 number of analyzed loci number of haplotypes observed at the locus H lower triangular matrix of molecular distances (number of substitutions in pairwise comparisons of haplotypes) ADMIX ver. 2.0 needs only one input file containing both the data and the matrix. Pellecchia et al. (2007). Admixture values ± s.e. calculated on Bos taurus mtDNA data (HVRI region) derived from autochthonous Italian breeds. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Trees A tree is a g graph p which describes the evolutionary y relationships p between sequences. q •Nodes = Taxonomic Units (TUs); •Branches = evolutionary relationships between TU in terms of ancestry/descent A branch connects only two nodes. Internal nodes represent ancestral TUs, while terminal bramches represent present TUs (i.e. sequences), also defined Operational Taxonomic Units, OTUs. Cladogram: a tree describing only the relationships between nodes. Branch lengths have no specific meaning. Phylogram: branch lengths are proportional to the evolutionary distance Æ calculations of genetic divergence between nodes. Cladogram Phylogram Seq 9 Seq 9 Seq 1 Seq 1 Seq 7 Seq 7 Seq 10 Seq 10 S 3 Seq Seq 3 Seq 8 Seq 8 Seq 2 Seq 4 Seq 2 Seq 4 Seq 5 Seq 5 Seq 6 Seq 6 SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Trees Rooted tree: a particular node, the “root”, represents the common ancestor of all the remaining nodes Æ all the branches can be oriented as a function of time. Unrooted tree: describes exclusively the evolutionary relationships between OTUs. OTUs No information on the evolutionary process as a function of time Æ it is not possible to identify older/more recent nodes. Rooted tree Unrooted tree outgroup Seq 1 Seq 6 Seq 6 Seq 7 Seq 9 Seq 5 Seq 1 Seq 3 Seq 4 Seq 10 Seq 10 Seq 5 Seq 2 Seq 7 Seq 9 Seq 8 Seq 3 Seq 4 Seq 2 Seq 8 Rooted trees are usually built when the hypothesis of the “molecular clock” is assumed, i.e. genetic divergence proportional to evolutionary time. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Trees To root a tree, a particular OTU, called “outgroup”, is included into the dataset. The outgroup is defined as “a OTU which started the process of divergence from its ancestor before all the remaining OTUs started diverging from each other” (information derived from non-genetic evidence, e.g. paleontology morphology etc paleontology, etc.). ) Trees can also be represented in the Newick (computer readable) format with nested brackets: ((((Seq_9,(Seq_6,Seq_5)),Seq_10),((Seq_8,Seq_4),Seq_3)),(Seq_7,Seq_2),Seq_1); Dedicated software read trees in Newick format (e.g. TreeView; Page, 1996). Seq 1 Seq 7 Seq 2 Seq 3 Seq 8 Seq 4 Seq 10 Seq 9 Seq 6 Seq 5 SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Trees Aim of a phylogenetic analysis Æ determining the “topology” (structure) of the tree. The number of possible trees grows exponentially with the umber of OTUs. For n OTUs, the numbers of rooted (NR) and unrooted (NU) trees are given by NR = (2n-3)! 2n-2(n-2)! NU = (2n-5)! 2n-3(n-3)! NU for n OTUs = NR for (n-1) OTUs. E.g. g if n=10 there are about 35· 106 p possible trees, only y one of which correctly y represents the evolutionary relationships between the OTUs! SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Tree-building methods can be classified according to •the type of data (i.e. distance matrix vs. discrete characters); •the reconstruction strategy (clustering algorithms vs. optimality criteria); M METHOD D DATA distance matrix Clustering algorithm UPGMA, NJ Optimality criterion ME, FM discrete characters MP, ML, BA UPGMA: unweighted pair-group method using arithmetic means; NJ: neighbor-joining; ME: minimum evolution; FM: Fitch-Margoliash's least-squares method; MP: maximum parsimony; ML: maximum likelihood; BA: Bayesian inference. All the aforementioned methods (excepted MP), require the selection of an explicit model of sequences evolution (“substitution model”). Substitution models describe in probabilistic terms the process by which a set of characters ( (nucleotides) l id ) changes h iinto another h set off homologous h l character h states over time. i SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution P i i an Percentage Pairwise P t difference diff These are very y rough g estimates of evolutionary y divergence g between sequences. q They are computed as the number/percentage of loci (nucleotides) for which two sequences are different: P = nd P = nd/L Where nd is the number of observed substitutions between two DNA sequences and L is the number of loci. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution The number of observed differences usually underestimates the real amount of evolutionary change (e.g. occurrence of “multiple hits”). Substitution models incorporate p some “correction” p parameters, their number varying y g according to the a priori assumptions accepted (number of fixed/variable parametrs). A priori assumptions: •Nucleotide sites evolve independently; • All sites can mutate with equal probability; • All types of substitutions are equally probable; • Substitution rate is constant ; • The h base composition is at equilibrium (sequences h have the h same base composition ). The higher the number of accepted assumptions, the simpler the model. The lower the number of accepted assumptions, the higher the number of the parameters that need to be estimated. estimated SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution The most renowned and used nucleotide substitution models are those from the General Time-Reversible (GTR) family (Lanave et al., 1984): 203 possible models diff differentiated ti t d by b the th number b and d type t off fi fixed/variable d/ i bl parameters. t The nucleotide substitution models implemented in the most frequentl frequently used phylogenetics software packages (MEGA, PAUP*, PHYLIP, PHYML, MrBayes ecc.) belong to the GTR family. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution Jukes and Cantor (JC69; 1969) It is i th the simplest i l t ((parameter t poorest) t) model, d l which hi h assumes th that: t •Nucleotide frequencies are equal (i.e. πA= πT= πC= πG= 0.25); •All possible substitutions take place at a single rate Æ only the parameter α needs to be estimated (substitution rate). A A - C α G α T α C α - α α G T α α α α α α - SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution Kimura 2-parameters (K80; 1980) •Nucleotide frequencies are equal (i.e. πA= πC= πG= πT= 0.25); •Different substitution rates between transitions (Ts) α and transversions (Tv) β. The Ts/Tv ratio is estimated from the data. A C G T A β α β C β β α G α β β T β α β - SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution Tamura (1992) This model is an extension of K80 method, allowing for unequal nucleotide frequencies. frequencies •Base composition is not equal (A + T ≠ G + C and G + C = θ); •Different substitution rates between Ts (α) and Tv (β). The Ts/Tv ratio, as well as nucleotide frequencies are computed from the data. A C G T A (1-θ)β (1 θ)α (1-θ)α (1-θ)β C θβ β α G θα β β T (1-θ)β (1 θ)β (1-θ)α (1 θ)β (1-θ)β SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution Felsenstein (F81; 1981) It is i an extension t i off JC69 method, th d allowing ll i for f unequall nucleotide l tid frequencies f i (i.e. (i πA≠ πC≠ πG ≠ πT). The overall nucleotide frequencies are computed from the data. A A - C πCα G πG α T πT α C πA α - πG α πT α G T πA α πA α πCα πCα - πT α - πG α SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution H Hasegawa-Kishino-Yano Ki hi Y (HKY; (HKY 1985) This model combines the assumptions p of K80 and F81: •unequal nucleotide frequencies (i.e. πA≠ πC≠ πG ≠ πT). •Different substitution rates between Ts (α) and Tv (β). Overall nucleotide frequencies and the Ts/Tv ratio computed from the data. A A - C πCβ G πG α T πT β C πA β - πG β πT α G T πA α πA β πCβ πCα - πT β - πG β SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution General Time Reversible (GTR) Lanave et al. (1984). It is i th the mostt generall and d parameter-rich t i h model. d l •Unequal nucleotide frequencies (i.e. πA≠ πC≠ πG ≠ πT). •Different substitution rates between the two transitions and the four transversions Ts: A Æ G = α1 ; C Æ T = α2 Tv: A Æ C = β1 ; A Æ T = β2 ; C Æ G = β3 ; G Æ T = β4 *. • Unequal probability for each type of nucleotide substitution. • Substitutions S b tit ti are reversible ibl (AÆ G = G Æ A). A) A A - C πCβ1 G πG α 1 T πTβ2 C πAβ1 - πGβ3 πT α 2 G T πA α 1 πAβ2 πCβ3 πCα 2 πGβ4 πTβ4 - * taking the rate of G <-> T = 1 and making all other 5 possible substitution rates relative to the G-T transversion SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution Γ (gamma) ( ) di distribution t ib ti and d Invariant I i t sites it An additional parameter is considered when the substitution rates cannot be assumed as uniform for all sites. sites Not all the nucleotide positions within a sequence, in fact, are subject to the same evolutionary constraints (e.g. 1st 1st-2nd 2nd vs. 3rd codon position in protein protein-coding coding genes). There are two strategies: 1) To analyze separately the sites subject to different evolutionary dynamics; 2) To adopt a model with additional parameters that account for the rate variation. Γ distributions are used to model continuous variables that are always positive and have skewed k d di distributions. t ib ti The shape of the Γ distribution is determined by a single parameter α (“shape parameter”) which specifies the range of rate variation among sites and is inversely proportional to the level of heterogeneity among site rates. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution The lower the values of α, the larger the range of rate variation and the more uneven the substitution rates. Proportiion of sites f(r) A α Æ ∞, all As ll sites i have h the h same substitution b i i rate. Substitution rate (r) Also the fraction of Invariant sites (i.e. sites showing no variation within the q set)) can be estimated and taken into account when modeling g the sequences evolutionary process. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution All models typically used to infer evolutionary relationships between DNA sequences represent a special case of the GTR model. Imposing constraints (i.e. (i e a priori assumptions) on the parameters of the GTR leads to a different model which can, therefore, be considered as a special case of the GTR. A model is said to be “nested” within a more complex one if the former can be obtained by constraining the parameters of the latter. E.g. JC69 is nested within K80, while F81 and K80 are not nested because fixing parameter values of either one does not y p yield the other model. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution How do we select the best-fitting model? To select T l the h best-fitting b fi i substitution b i i model d l the h Lik Likelihood lih d R Ratio i T Test (LRT) is i usually ll applied. In a maximum likelihood framework, it evaluates the statistical significance of the increase in fit of alternative nested models to the data as their number and types of parameters increases. Δ = 2 (ln L1 - ln L0) L1 = global ML estimate for the alternative hypothesis (more general, parameter richer model) L0 = global ML estimate for the null hypothesis (simpler model). The probabilities are χ2 distributed with d.f.= difference in the number of free parameters between the two alternative models. The Akaike Information Criterion (AIC; Akaike, 1974; Posada & Buckley, 2004) and the Bayesian Information Criterion (BIC; Schwarz, 1978) are methods alternative to the LRT; they simultaneously evaluate the statistical significance of the relative fit of all competing models be they nested or not. models, not SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution ModelTest Posada & Crandall (1998). (1998) A very popular tool which automatically selects the best-fitting substitution model from among 56 alternatives by performing LRT LRT, AIC (software ver ver. 3.06) 3 06) and BIC (software ver ver. 3.7) calculations. It returns the name and the parameter values of the best-fitting model. Original software version: both the ModelTest application and the software PAUP* PAUP (Swofford, 1998) are needed. Unfortunately, PAUP* software is not free. Input file format = Nexus (same as PAUP PAUP*)) + “ModelTest” ModelTest block. More information on how to run ModelTest can be found here: http://darwin.uvigo.es/software/modeltest.html htt // http://www.rhizobia.co.nz/phylogenetics/modeltest.html hi bi / h l ti / d lt t ht l (Wi (Windows) d ) http://www.genedrift.org/mtgui.php (Windows and Linux). A web-based tool to run ModelTest can be found here: http://darwin.uvigo.es/software/modeltest_server.html (Posada, 2006). A free web-based tool to choose among 28 nucleotide models with the AIC: http://www.hiv.lanl.gov/content/sequence/findmodel/findmodel.html SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution Hierarchical structure of 56 models implemented in the ModelTest procedure (Posada and Crandall, Crandall 1998) 1998). It does not include all of the possible models in the GTR family. family GLOBALDIV SUMMER SCHOOL 2008 - PIACENZA, ITALY - 10 September 2008 Models of DNA sequences evolution ** Hierarchical Likelihood Ratio Tests (hLRTs) ** ModelTest ver. 3.06 output Testing models of evolution - Modeltest Equal base frequencies Version 3.06 Null model = JC -lnL0 = 2562.9832 Alternative model = F81 -lnL1 = 2543.3635 2(lnL1-lnL0) = 39.2393 df = 3 P value = <0 P-value <0.000001 000001 (c) Copyright, 1998-2000 David Posada ([email protected]) Ti=Tv Department of Zoology, Brigham Young University Null model = F81 -lnL0 = 2543.3635 WIDB 574, Provo, UT 84602, USA Alternative model = HKY -lnL1 = 2482.0591 _______________________________________________________________ 2(lnL1-lnL0) = 122.6089 df = 1 Wed May y 23 16:49:15 2007 P-value = <0.000001 Equal Ti rates Input format: Paup matrix file ** Log Likelihood scores ** JC = 2458.8540 Null model = HKY -lnL0 = 2482.0591 Alternative model = TrN -lnL1 = 2482.0227 2(lnL1-lnL0) = df = 1 P-value = 0.0728 0.787369 +I +G +I+G 2458.8540 2454.2852 2443.3606 Null model = HKY -lnL0 = 2482.0591 Alternative model = K81uf -lnL1 = 2480.2668 2(lnL1-lnL0) = df = 1 F81 = 2440.0264 2440.0264 2434.6941 2424.4517 K80 = 2400.7991 2400.7991 2396.5891 2385.6047 Equal Tv rates P-value = 3.5845 0.058322 HKY = 2379.0457 2379.0457 2374.1394 2362.9192 TrNef = 2400.6252 2400.6252 2396.1169 2385.5432 Null model = HKY -lnL0 = 2482.0591 TrN = 2379.0442 2379.0442 2374.0200 2363.2795 Alternative model = HKY+G -lnL1 = 2374.1394 K81 = 2398.1973 2398.1973 2393.5496 2382.9202 2(lnL1-lnL0) = 215.8394 df = 1 Equal rates among sites K81uf = 2377.6162 2377.6162 2372.7297 2362.2349 Using mixed chi-square distribution TIMef = 2398.0212 2398.0212 2393.4592 2382.8601 P-value = <0.000001 TIM = 2376.7197 2376.7197 2372.6138 2362.1086 TVMef = 2395.8040 2395.8040 2391.3481 2380.6255 TVM = 2375.0203 2375.0203 2369.3423 SYM = 2395.6624 2395.6624 GTR = 2374 8865 2374.8865 2374 8865 2374.8865 No Invariable sites Null model = HKY+G -lnL0 = 2374.1394 Alternative model = HKY+I+G -lnL1 = 2362.9192 2358.6572 2(lnL1-lnL0) = 22.4404 df = 1 2391.2957 2380.6013 Using mixed chi-square distribution 2369 2437 2369.2437 2358 5361 2358.5361 P value = P-value 0 0.000001 000001 SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Models of DNA sequences evolution ModelTest output ** Akaike Information Criterion (AIC) ** Model selected: TVM+I+G Model selected: HKY+I+G -lnL = -lnL = 2358.6572 AIC = 4735.3145 2362.9192 Base frequencies: freqA = 0.3116 freqC = 0.2168 freqG = 0.1607 freqT = 0.3109 S b i Substitution i model: d l Base frequencies: freqA = 0.3123 freqC = 0.2263 freqG q = 0.1618 freqT = 0.2996 Rate matrix R(a) [A-C] = R(b) [A-G] = R(c) [A-T] = R(d) [C-G] = R(e) [C-T] = R(f) [G-T] = Among-site rate variation Substitution model: Ti/tv ratio = 2.0963 Among-site rate variation P Proportion ti of f i invariable i bl sites it (I) = 0 6051 0.6051 Variable sites (G) Gamma distribution shape parameter = 0.9352 1.8176 8.0533 1.6254 3.5609 8 0533 8.0533 1.0000 Proportion of invariable sites (I) = -- 0.6002 Variable sites (G) Gamma distribution shape parameter = 0.9020 PAUP* Commands Block: If you want to implement the previous estimates ti t as lik likelihod lih d settings tti i in PAUP* PAUP*, -- attach the next block of commands after the data in your PAUP file: PAUP* Commands Block: If you want to implement the previous estimates as likelihod settings in PAUP*, attach the next block of commands after the data in your PAUP file: [! [! Likelihood settings from best-fit model (HKY+I+G) selected by hLRT in Modeltest Version 3.06 Likelihood settings from best-fit model (TVM+I+G) selected by AIC in Modeltest Version 3.06 ] BEGIN PAUP; Lset Base=(0.3123 0.2263 0.1618) Nst=2 TRatio=2.0963 Rates=gamma Shape=0.9352 Pinvar=0.6051; END; -- ] BEGIN PAUP; Lset Base=(0.3116 0.2168 0.1607) Nst=6 Rmat=(1.8176 8.0533 1.6254 3.5609 8.0533) Rates=gamma Shape=0.9020 Pinvar=0.6002; END; -_________________________________________________________________ Time processing: 0.001 seconds SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies METHOD D DATA distance matrix Clustering algorithm UPGMA, NJ Optimality criterion ME FM ME, discrete characters MP ML, MP, ML BA Distance methods - aligned sequences converted into a pair-wise distance matrix Æ loss of information about single sites contributions and no inference on the ancestral character states. Discrete methods - each nucleotide site is considered directly y Æ allow to draw inference on the ancestral character states. Clustering methods follow an algorithm (set of steps) to produce a tree (usually a single one) Æ short computational times times, but the results often depend on the order of sequences addition to the growing tree. Competing hypotheses cannot be tested. Optimality p y methods use a specific p criterion to assign g a score to each p possible tree. The ranking is a function of the relationship between tree and data. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Neighbor-Joining (NJ) Saitou & Nei (1987). The clustering algorithm starts from a star topology (completely unresolved tree) and determines the branches between the nearest pair of OTUs (neighbors) and the remaining OTUs through an iterative process. Each step is taken according to the choice that minimizes the sum of the lengths of all the branches of the tree. tree The pair of OTUs chosen at each step will form a “composite OTU” treated as a single entity afterwards. Advantages: -Very fast computations; -Allows for different evolutionary rates along the branches; -Usually Usually returns reliable results. results Disadvantages: -the calculation of a distance matrix causes a loss of information. Software: CLUSTALX, PHYLIP, PAUP*, MEGA and others. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Neighbor-Joining (NJ) Seq 2 Seq_2 Seq 2 Seq_2 Seq_3 Seq_1 Seq_3 Seq_1 Seq_1 Seq_2 Seq_3 Seq_6 Seq_4 Seq_5 Seq_4 Seq_6 Seq_5 Seq_6 Seq_5 Seq_4 “Star decomposition” tree search. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Maximum Parsimony Swofford & Berlocher (1987). This method identifies the tree which needs the smallest number of substitutions (evolutionary changes) to explain the differences between the considered sequences. The branch length is proportional to the number of substitutions between the nodes connected by the branch itself. “Parsimony informative sites” show at least two different character states occurring at least two times each. Then the minimum number of substitutions is calculated for each possible unrooted tree. The MP tree is the one requiring the smallest number of changes. Advantages: -No No loss of information; Disadvantages: - no explicit evolutionary model (all substitutions equally probable, equal base frequencies, no correction for f multiple l l h hits); ) -often it returns a set of equally parsimonious trees. Software: PHYLIP, PAUP*, PAUP , MEGA and others. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Maximum Parsimony (MP) Seq_1 S Seq_2 2 Seq_3 Seq_4 GTACG GTCGG ACAGG ACCGG Tree Seq_2 G A G Seq_4 Site 1 – 5 changes Site 1 – 1 change G Seq_3 Seq_1 G A A A A G G A SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Maximum Parsimony (MP) Seq_1 S Seq_2 2 Seq_3 Seq_4 GTACG GTCGG ACAGG ACCGG Tree Seq_2 Site 2 – 1 change T C A A A C or A C A C C C C C Site 5 – no changes Site 4 – 1 change G G G A C T Seq_4 Site 3 – 2 changes C T Seq_3 Seq_1 G G G G G G G G Tree Sites 1 2 3 4 5 total ((1,2),(3,4)) ((1,3),(2,4)) ((1,4),(2,3)) 1 2 2 2 1 2 1 1 1 0 0 0 5 6 7 1 2 2 SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Maximum Likelihood (ML) Felsenstein (1981). (1981) Often considered as the best approach to determine the most consistent tree topology. Formally, given a data set D (alignment) and the hypothesis H (tree), the probability of observing the data is given by LD= Pr(D|H) Which is equal to the conditional probability of D given H. The tree which scores the highest value of L represents the ML estimate of the evolutionary relationships l ti hi between b t the th considered id d OTUs. OTU In I other th words, d the th ML ttree iis th the one which hi h best explains the examined dataset. Advantages: -It usually returns consistent results; -It permits the statistical testing of evolutionary hypotheses (Likelihood Ratio Test). Disadvantages: -very long computational times (often 100 bootstrap replicates are used instead of 1000). Software: PHYLIP, PHYML, PAUP* and others. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Bayesian approach (BA) A recent variant of ML. While ML seeks the tree that maximizes the probability of observing the data given the tree and the model, BA searches the set of trees that have the maximum probability of being observed given the data and the model. BA produces a set of trees with approximately equal likelihoods. Advantages: -Results are easy to interpret: the frequency of a given clade within the set of trees is taken as the probability of that clade – no need for bootstrapping. bootstrapping Disadvantages: -Depending p g on the settings, g , it may y require q long g computational p times ((not as long g as for ML). ) Software: Mr Bayes and others. N Of sequences N. Neighbor--joining Neighbor Maximum Parsimony Maximum Likelihood Bayesian 54 0.20 sec 0.72 sec. 7.06 hr 3.8 hr 40 0.18sec. 0.32 sec. 1.1 hr 2.4 hr 30 0.22 sec. 0.18 sec. 17.3 min 1.7 hr 20 0.22 sec. 0.10 sec. 1.8 min 1.05 hr 10 0.20 sec. 0.05 sec. 4.1 sec 25 min Computational times required for analysis by the four different methods. Source: Hall (2001). Thanks to faster present day processors the times have proportionally shortened. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Calculation of divergence times If the assumption of a molecular clock – genetic divergence proportional to evolutionary time - is correct, the reconstruction of the tree topology allows to estimate the divergence g times between all the OTUs. The divergence time between at least two OTUs must be known from non genetic evidence (e.g. paleontology). This time value is then used to calibrate the molecular clock for that given tree. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Calculation of divergence times A Likelihood Ratio Test can be performed on the ML values calculated with (Lclock) and without (Lnoclock) the assumption of the validity of the molecular clock. Δ = 2 (lnLnoclock – lnLclock) The probabilities are χ2 distributed with d.f.= n-2, being n the number of sequences. Software: PAUP* and others. The calculation of the Time to the Most Recent Common Ancestor ((TMRCA)) for a set of sequences can also be performed with a Bayesian approach. Software: BEAST, BEAST BEAUTI and TRACER. TRACER SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Bootstrap ootst ap Felsenstein e se ste (1985). ( 985). Non-parametric bootstrap is used to infer the robustness of tree reconstructions. It estimates sampling error by resampling from the dataset instead of resampling from the population. This approach can be applied to all the phylogenetic methods, with the exception of BA. How does it work? Data: n aligned sequences of length N (n x N matrix). Obj i Objective: estimate i confidence fid iin particular i l ffeatures off the h obtained tree (robustness of nodes). SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Bootstrap Felsenstein (1985). Method: Step 1 - create a large number of pseudo-datasets (100 or 1000) by re-sampling with replacement the columns of the original data matrix. In each of the bootstrapped replicates, some sites may occur more than once, once while others are never sampled sampled. Original dataset Bootstrap pseudoreplicate 10 60 10 60 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_1 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_2 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_4 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CTTGGTTAAAAATACCTAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq 6 Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq 6 Seq_6 CTTGGTTAAAAATATTTAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_1 Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_7 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_8 CTTTGTTCCAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_9 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_10 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Step 2 - build a tree by applying the method of choice to each pseudo-dataset Æ disadvantage: it drastically increases the time required for computations. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Bootstrap Felsenstein (1985). Step 3 –evaluate the bootstrap support of the nodes by calculating the proportion of replicates where the feature is present Æ “consensus tree”. Seq 9 Seq 1 17 Seq 7 Seq 10 100 7 Seq 3 14 8 Seq 8 Seq 2 10 Seq 4 19 Seq 5 91 Seq 6 The results are % values that are usually interpreted following a “rule of thumb”: - value<50% - weakly supported nodes, unlikely to be correct - 50%<value<70% - nodes to be interpreted with caution - 70%<value –strongly supported nodes, likely to be correct. Simulations have shown that bootstrap values greater than 70% correspond to a probability greater than 95%. In BA trees only the nodes with 95% PP values are considered as strongly supported, pp instead. The vast majority of the software packages include the bootstrap option. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Tree reconstruction strategies Jacknife In the jacknife procedure the resampling occurs without replacement. This is usually done by deleting randomly half of the characters in each replicate Æ subreplicates are smaller than the original dataset Æ the statistical properties of the samples may change. Original dataset Jacknife subreplicate 10 60 10 30 Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_1 CCTATAGCATTAATTAATTGTTTACATTAA Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_2 CCTATAGCATTAATTAATTGTTTACATTAA Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 CCTATAGCATTAATTAATTGTTTACATTAA S Seq_4 4 CCCC CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT G C G G G G G C C C G S Seq_4 4 CC CCTATAGCATTAATTAATTGTTTACATTAA GC G C Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CCTATAGCATCAATTAATTGTTTACATTAA Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 CCTATAGCATTAATTAATTGTTTACATTAA Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_7 CCTATTGCATTAATTAATTATTTACATTAA Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_8 CCTATAGCATTAATTAATTGTTTACATTAA CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq q_9 CCTATAGCATTAATTAATTGTTTACATTAA Seq q_9 Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_10 CCTATAGCATTAATTAATTGTTTACATTAA SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 Software • ADMIX ver. 1.0 Dupanloup p p & Bertorelle ((2001). ) http://web.unife.it/progetti/genetica/Giorgio/giorgio_soft.html FREE!! • ADMIX ver. 2.0 http://cmpg.unibe.ch/software/admix/ FREE!! • ARLEQUIN ver. 3.1 Excoffier et al. (2005). http://cmpg unibe ch/software/arlequin3/ FREE!! http://cmpg.unibe.ch/software/arlequin3/ • MEGA – Molecular Evolutionary Genetics Analysis ver. 4 Tamura et al. (2007). http://www.megasoftware.net/ FREE!! • PAUP* - Phylogenetic Analysis Using Parsimony* ver. 4.0β Swofford (1998). http://paup.csit.fsu.edu/ • PHYLIP ver. 3.68 3 68 F l Felsenstein t i (2002) (2002). http://evolution.genetics.washington.edu/phylip.html FREE!! • PHYML ver. 3.0 Guindon & Gascuel (2003). http://atgc.lirmm.fr/phyml/ FREE!! ( ) • BEAST ver. 1.4.8… Drummond & Rambaut (2007). http://beast.bio.ed.ac.uk/ FREE!! • …and BEAUTI ver 1.4 Drummond & Rambaut (2007). http://beast.bio.ed.ac.uk/BEAUti FREE!! • Mr BAYES ver. 3.1 Hulsenbeck & Ronquist (2001). http://mrbayes csit fsu edu/ FREE!! http://mrbayes.csit.fsu.edu/ • TRACER ver. 1.4 Rambaut & Drummond (2007). http://tree.bio.ed.ac.uk/software/tracer/ FREE!! • TREEVIEW ver. 1.6.6 Page (1996). http://taxonomy.zoology.gla.ac.uk/rod/treeview.html FREE!! A miscellany of phylogeny programs and tools is available here http://evolution.genetics.washington.edu/phylip/software.html The BioPortal of the University of Oslo allows to run several applications through a web server http://www.bioportal.uio.no// SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 THANK YOU!! SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 References: • Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19: 716-723. • Drummond,, A.J. J and Rambaut,, A. 2007. "BEAST: Bayesian y evolutionary y analysis y by y sampling p g trees“. BMC Evolutionary y Biology gy 7: 214. • Dupanloup, I. and Bertorelle, G. 2001. Inferring admixture proportions from molecular data: extension to any number of parental populations. Mol. Biol. Evol. 18: 672–675. • Excoffier, L., Laval, G., and Schneider, S. 2005. Arlequin ver. 3.0: An integrated software package for population genetics data analysis. Evolutionary Bioinformatics Online 1: 47-50. • Excoffier, L., Smouse, P., and Quattro, J. 1992. Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data. data Genetics 131:479-491. 131:479 491 • Felsemstein, J. 1981. Evolutionary Trees from DNA Sequences: a Maximum Likelihood Approach. J. Mol. Evol. 17: 368−376. • Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783–791. • Finlay, E.K., Gaillard, C., Vahidi, S.M.F., Mirhoseini, S.Z., Jianlin, H., Qi, X.B., El-Barody, M.A.A., Baird, J.F., Healy, B.C. and Bradley, D.G. 2007. Bayesian inference of population expansions in domestic bovines. Biology Letters 3: 449-452. • Galtier, N., Gouy, M. and Gautier, C. 1996. SeaView and Phylo_win, two graphic tools for sequence alignment and molecular phylogeny. Comput. Applic. Biosci. 12: 543-548. • Guindon, S., and Gascuel, O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by Maximum Likelihood. Syst Biol 52(5): 696704. • Hall, B.G. 2001. Phylogenetic trees made easy. A how-to manual for molecular biologists. Sinauer Associates Inc., Publishers, Sunderland, Massachussetts, USA. • Hasegawa, M., Kishino, H. and Yano, T. 1985. Dating of the human human-ape ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160 160-174. 174. • Higgins, D.G., Bleasby, A.J. and Fuchs, R. 1992. CLUSTAL V: improved software for multiple sequence alignment. CABIOS 8: 189-191. • Higgins, D.G. and Sharp, P.M. 1989. Fast and sensitive multiple sequence alignments on a microcomputer. CABIOS 5: 151-153. • Higgins, D.G. and Sharp, P.M. 1988. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73: 237-244. • Hudson, R. R. 1990. Gene genealogies and the coalescent process, pp. 1-44 in Oxford Surveys in Evolutionary Biology, edited by Futuyama, and J. D. Antonovics. Oxford University Press, New York. • Jukes, J k T T. and d Cantor, C t C. C 1969. 1969 E Evolution l ti off protein t i molecules. l l IIn: M Mammalian li P Protein t i M Metabolism, t b li edited dit d b by M Munro HN HN, N New Y York: k Academic A d i press, p. 21-132. • Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120. • Lanave, C., Preparata, G., Saccone, C. and Serio, G. 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20: 86-93. • Nei, M., 1987. Molecular Evolutionary y Genetics. Columbia University y Press, New York, NY, USA. • Page, R.D.M. 1996. TREEVIEW: An application to display phylogenetic trees on personal computers. Computer Applications in the Biosciences 12: 357358. • Pellecchia, M., Negrini, R., Colli, L., Patrini, M., Milanesi, E., Achilli, A., Bertorelle, G., Cavalli-Sforza, L.L., Piazza, A., Torroni, A. and Ajmone-Marsan, P. 2007. The mystery of Etruscan origins: novel clues from Bos taurus mitochondrial DNA. Proc. R. Soc. B . 274: 1175–1179. • Posada, D. 2006. ModelTest Server: a web-based tool for the statistical selection of models of nucleotide substitution online. Nucleic Acids Research 34: W700-W703 W700-W703. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008 References: • Posada, D. and Buckley, T.R. 2004. Model selection and model averaging in phylogenetics: advantages of the AIC and Bayesian approaches over likelihood ratio tests. Systematic y Biology gy 53: 793-808. • Posada, D. and Crandall, K.A. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics 14(9): 817-818. Rambaut, A. and Drummond, A.J. 2007. Tracer v1.4.. http://tree.bio.ed.ac.uk/software/tracer/ • Rogers, A.R. 2004. Lecture Notes on Gene Genealogies. www.anthro.utah.edu/~rogers/bio5410/Lectures/a_alu.pdf • Rogers, A. R. and Harpending, H. 1992. Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9: 552-569. • Rousset, F., 2000. Inferences from spatial population genetics, in Handbook of Statistical Genetics, D. Balding, M. Bishop and C. Cannings. (eds.) Wiley & Sons Sons, Ltd Ltd. • Saitou, N. and Nei, M. 1987. The neighbor–joining method: a new method for reconstructing the phylogenetic tree. Mol. Biol. Evol. 4: 406−425. • Schwarz, G. 1978. Estimating the dimension of a model. The Annals of Statistics 6: 461-464. • Slatkin, M., 1991 Inbreeding coefficients and coalescence times. Genet. Res. Camb. 58: 167-175. • Swofford, D.L., 1998. PAUP*. Phylogenetic Analysis Using Parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Massachussetts. • Swofford, D.L. and Berlocher, S.H. 1987. Inferring evolutionary trees from gene frequency data under the principle of maximum parsimony. Systematic Zoology 36: 293−325. • Tajima, F. 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437-460. • Tajima, F. 1993. Measurement of DNA polymorphism. In: Mechanisms of Molecular Evolution. Introduction to Molecular Paleopopulation Biology, edited by Takahata, N. and Clark, A.G., Tokyo, Sunderland, MA:Japan Scientific Societies Press, Sinauer Associates, Inc., p. 37-59. • Tajima, F. and Nei, M. 1984. Estimation of evolutionary distance between nucleotide sequences. Mol. Biol. Evol. 1:269 1:269-285. 285. • Tamura, K., 1992 Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. Mol. Biol. Evol. 9: 678-687. • Tamura, K., Dudley, J., Nei, M., and Kumar, S. 2007. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution 24: 1596-1599. • Tamura, K., and M. Nei, 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. hi Mol. M l Biol. Bi l Evol. E l 10: 10 512-526. 512 526 • Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. 1997. The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research 24: 4876-4882. • Thompson, J.D., Higgins, D.G. and Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research 22: 4673-4680. • Troy, y C.S., MacHugh, g D.E., Bailey, y J.F., Magee, g D.A., Loftus, R.T., Cunningham, g P., Chamberlain, A.T., Sykesk, y B.C. and Bradley, y D.G. 2001. Genetic evidence for Near-Eastern origins of European cattle. Nature 410: 1088-1091. • Weir, B.S. and Cockerham, C.C. 1984 Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. • Wright, S., 1951 The genetical structure of populations. Ann.Eugen. 15: 323-354. • Wright, S., 1965 The interpretation of population structure by F-statistics with special regard to systems of mating. Evol 19: 395-420. SUMMER SCHOOL 2008 - PIACENZA, ITALY 10 September 2008
© Copyright 2026 Paperzz