Estimating pairwise genetic distance and Nucleotide substitution models (Models of sequence evolution) The Phylogenetic Handbook: Chapter 4 Distance methods Sequence alignment (clustal X) Gorilla Human Chimp Orang pairwise genetic distance (distance measures) GGTCCTAGGCC GGTCACATGTC GGTCATATCTC GATACCAGCAC Gorilla Chimp Human Orang tree (UPGMA / NJ) G 0 5 4 5 C H O 0 2 6 0 6 0 S1 S2 S3 S4 AACTGCATGGTAACAGGTTC AAGTGCATGGTAACAGATTC AAGTGGATGGTAATAGATTC AAGTGGACGGTTATAGATTC Genetic distance d (p distance)= S1 S2 S3 S4 20 20 20 20 S1 S2 S3 S4 S1 0 2 4 6 S2 0.1 0 2 4 S3 0.2 0.1 0 2 S4 0.3 0.2 0.1 0 Number of substitutions between a pair of sequences AACTGCATGGTAACAGGTTC ????GCATGGTAACAGATTC AAGTGGATGGTAA??????? AAGTGGA?????ATAGATTC total length of the sequence 20 16 13 15 S1 S2 S3 S4 S1 0 1 2 4 S2 S3 S4 0.06 0.15 0.26 0 0.11 0.18 1 0 0 2 0 0 Genetic distance (Evolutionary Distances) • They measure the total number of substitutions that occurred on both lineages since divergence from last common ancestor. • p Distance d= # sub. b/w a pair of sequences/ sequence length • Expressed in substitutions / site • Can be used to calculate the av. rate of substitution. • Rate= d/2 ancestor sequence 1 sequence 2 Pairwise genetic difference = 2 Pairwise genetic distance d = 0.2 (also called p distance) 1 substitution ATGTTCCATT 2 substitutions ATGTTGCATT ATCTTACATT 2 substitutions When p distance is large it gives an underestimate of the actual divergence Correction for multiple hits (species 1) ATGTTCCATT 1 substitution (Ancestral species) Observed Actual Difference 2 3 p distance 0.2 0.3 ATGTTGCATT 2 substitutions ATCTTACATT (species 2) Ancestral species G G a) b) [1] Daughter species G [2] C [1] G [1] G A C c) C G [2] G C G G [0] G d) e) [2] T [2] T [0] T Act.d C [1] Obs.d The problem of hidden or multiple changes • d (true genetic distance) ≥ fraction of observed differences (p) A A G C C A G A G A A A G • d = p + hidden changes • Through hypotheses about the nature of the base substitution process (models), it becomes possible to estimate d from observed differences between sequences. 1 Correction p distance Actual d Observed d time 1 p distance Actual d 0.5 Neutral marker Protein coding gene time Models of sequence evolution / nucleotide substitution Jukes-Cantor model (1969) Assumes: All substitutions equally likely No among site rate variation (each site is equally likely to undergo substitution) Base frequencies are in equilibrium Equilibrium base frequencies: A=T=C=G=1/4=0.25 (all four bases are present in equal proportion in the sequence) AAAAAAAA ? Models of sequence evolution / nucleotide substitution Jukes-Cantor model (1969) Assumes: All substitutions equally likely No among site rate variation (each site is equally likely to undergo substitution) Base frequencies are in equilibrium Equilibrium base frequencies: A=T=C=G=1/4=0.25 (all four bases are present in equal proportion in the sequence) AAAAAAAA TAGCACTG Jukes-Cantor model (1969) A G A PAC C T PAG G PAG=PAT=PAC T Similarly PGA=PGT=PGC PTA=PTG=PTC PCA=PCT=PCG PAT C Also it is a reversible model PAG=PGA Pij(t)= probability of change from i state to j state at time t Pii A Pii(t)= probability of no change at time t Pij Pij G Pij C Pii Pii Pij Pij T Pij Pii Probability of no mutation at a site at different time intervals approximates a (negative) exponential distribution 1.2 1 0.8 e-v Probability of no mutation e-v 0.6 Probability of mutation 1- e-v 0.4 Where v=µt µ = mutation rate = 10-4 t = time = million years or in generations 0.2 0 0 5000 10000 15000 20000 25000 30000 35000 generations A Jukes and Cantor (1969) Pii (t)= probability of no change at time t PAC PAG PAT C Two ways to get the same base A possible bases at a site : a, t, g, c PAA (t) = probability of picking any base x probability of picking the same base 1x 1/4=1/4 Probability of picking a different base= 3/4 Probability that it does not undergo mutation 3/4x(e-v) Pii (t)= 1/4 + 3/4(e-v) Probability of no mutation at a given site is e-v Where v=µt G T Pij(t)= probability of change from i state to j state at time t Pij = probability of picking any base x probability of picking a particular base ¼ x probability of mutation PAG A G 1/4(1-e-v) PAC PAT Probability of mutation at a given site is (1-e-v) C Pij (t) = 1/4(1-e-v) Pii (t) = 1/4 + Pii 3/4(e-v) T Pij A Pij Pij Pij C Pii G Pii Pij T Pij Pii A Pij (t) = 1/4(1-e-v) p= 3/4(1-e-2v) Pij C Number of mutation along a branch=v Actual distance d = mutations along two branches x number of possible changes d =-3/4 ln (1-4/3p) G Pij v =-1/2 log(1-4/3p) d = 2v(3/4) Pij T Transitions Ts (α) puriens A G Transverstions Tv (β) pyrimidines C T Nuclear DNA Ts = 2-4Tv MtDNA Ts = 8-10Tv Base frequencies A=T=G=C p=0.25 Kimura’s two parameter distance model • Hypotheses of the model : Substitutions occur according to two probabilities : One for transitions, one for transversions. Transitions : G <—>A or C <—>T Transversions : other changes • (p: α transitions; β transversions): Assumptions: Equal base frequencies (all four bases are present in equal proportion in the sequence) f: A=T=G=C α A G No among site rate variation β β 1 d = − ln[(1− 2 αp − βq) 1 − 2qβ] 2 Kimura (1980) J. Mol. Evol. 16:111 β C β α T Equal base frequencies (equilibrium frequency assumption) all four bases are present in equal proportion in the sequence f : A=T=G=C Cytochrome b 1140bp (% base composition) H. agilis H. sapiens M. mulatta P. pygmaeus G. Gorilla P. Troglodytes Average Expected T(U) 24.3 25.2 25.2 25.2 24.8 24.9 24.9 25 C 33.8 34.3 34.0 34.1 34.6 34.4 34.2 25 A 28.3 28.7 29.1 29.2 28.9 29.0 28.9 25 G 13.6 11.9 11.7 11.5 11.7 11.7 12.0 25 Variation in GC content (25-75%) If some bases are more common than others then we might expect some substitutions to be more common than others. Models Corrections Jukes and Cantor 1969 multiple hits but all substitution equally likely equal base frequencies Kimura 1980 multiple hits, Ts vs Tv but equal base frequencies Felsenstein 1981 multiple hits, unequal base frequencies but all substitution equally likely Hasegawa et al. 1985 multiple hits, unequal base frequencies, Ts vs Tv General reversible multiple hits, unequal base frequencies, all six substitutions have different rates Add on Gamma correction among site rate variation GTR model: Substitution probability matrix for V Each substitution class has a different rate 6 parameter values have to be estimated A G C T A x a b d G a x c e C b c x f T d e f x b c C xa=1-(a+b+d) a A G d f e T Among site rate variation / rate heterogeneity Rate of nucleotide position can vary substantially for different positions in a sequence. Protein coding gene third position>first position> second position Introns> exons Gamma distribution, shape parameter alfa Alfa> 1, bell shaped curve, weak heterogeneity Alfa< 1, L shaped curve, strong rate heterogeneity The genetic code TTT TTC TTA TTG Phe Phe Leu Leu TCT TCC TCA TCG Ser Ser Ser Ser TAT TAC TAA TAG Tyr Tyr stop stop TGT TGC TGA TGG Cys Cys stop Trp CTT CTC CTA CTG Leu Leu Leu Leu CCT CCC CCA CCG Pro Pro Pro Pro CAT CAC CAA CAG His His Gln Gln CGT CGC CGA CGG Arg Arg Arg Arg ATT ATC ATA ATG Ile Ile Ile Met ACT ACC ACA ACG Thr Thr Thr Thr AAT AAC AAA AAG Asn Asn Lys Lys AGT AGC AGA AGG Ser Ser Arg Arg GTT GTC GTA GTG Val Val Val Val GCT GCC GCA GCG Ala Ala Ala Ala GAT GAC GAA GAG Asp Asp Glu Glu GGT GGC GGA GGG Gly Gly Gly Gly Gamma distribution (shape parameter α) http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-Criteria/Phylogeny-Criteria.html Other assumptions: 1) All nucleotide sites change independently of each other. 2) The substitution rate is constant over time and in different lineages (Molecular clock assumption). 3) Base composition across taxa is at equilibrium. Which model to choose? Usually more complex models fit the data better. But, the analysis becomes computationally intensive (more time). More parameters need to be estimated from the same amount of data, therefore more error associated with the estimate. For mtDNA genes, HKY 85 is often used Likelihood ratio test (Modeltest), AIC (JModeltest)
© Copyright 2026 Paperzz