Lecture 2.1 Bayesian phylogenetic analysis

Phylogeneticmethods
Lecture2.1
Algorithmbased
Optimality
criterion
Distance-based
methods
Maximum
parsimony
Distance-based
methods
Maximum
likelihood
BayesianPhylogeneticAnalysis
Noexplicit
substitution
model
SimonHo
A
G
C
T
Other
Bayesian
inference
2
Bayesianphylogeneticanalysis
•
Bayesianphylogenetic analysiswasdeveloped inthemid1990s
•
Nowoneofthemostwidely usedmethods
Bayesianphylogeneticanalysis
Maximum likelihood
Given
brownbear
cavebear
blackbear
giantpanda
Probabilityof?
A
G
brownbear
cavebear
+
C
T
blackbear
giantpanda
CGTTAGTACACT
CGATAGTTCACT
CGTTAGTTTACC
CATTGGTTTACT
Bayesianinference
Given
Probabilityof?
brownbear
cavebear
blackbear
giantpanda
3
A
G
brownbear
cavebear
+
C
T
blackbear
giantpanda
CGTTAGTACACT
CGATAGTTCACT
CGTTAGTTTACC
CATTGGTTTACT
4
TheBayesianparadigm
•
Contrast withfrequentist statistics(likelihood)
•
Parameters havedistributions
•
Simpleexample
brownbear
cavebear
Before thedataareobserved, eachparameter hasaprior
distribution
•
Thelikelihood ofthedataiscomputed
•
Theprior distribution iscombined withthelikelihood toyield
theposteriordistribution
blackbear
giantpanda
5
Simpleexample
brownbear
cavebear
blackbear
giantpanda
AGT
AGA
CGA
CAA
6
Simpleexample
AGTAAG
AGAAAG
CGAAAG
CAAAGG
brownbear
cavebear
blackbear
giantpanda
7
AGTAAGTACACA
AGAAAGATCACA
CGAAAGATTACC
CAAAGGATTACA
8
Bayesianinference
Prior
Specifiedbyuser,
independent ofdata
Pr(θ |D)=
Posterior
Thisiswhatwe
wanttoestimate
Priors
Likelihood
Calculatedfromdata
•
Priorsarechosenintheformofprobability distributions
•
Reflect ourpriorexpectations (anduncertainty) aboutvaluesof
parameters (without knowledge ofthedata)
Pr(θ)Pr(D |θ)
Pr(D)
•
Pastobservations
•
Personalbeliefs
•
Useofabiologicalmodel
normalisingconstant
marginallikelihoodofthedata
modellikelihood
9
10
Priors
Treeprior:Amongspecies
1. Useaflatpriorfortreetopology (MrBayes)
•
Alltreeshaveequalprobability
•
Alsoneedapriorforbranchlengthsornodetimes
•
•
2. Useabiologicalmodeltogenerate prior distribution
(BEAST andMrBayes)
•
Amongspecies:speciationmodel
•
Withinspecies:coalescentmodel
Treeshapedescribed bya
stochastic branching process
Yuleprocess
brownbear
cavebear
blackbear
giantpanda
•
Lineagessplitataconstantrate
brownbear
•
Simulatesspeciationprocess
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
11
12
Estimatingtheposterior
MarkovChainMonteCarloSampling
•
Impossibletoobtaintheposterior directly
•
Instead,theposterior canbeestimatedusing
Markovchain MonteCarlosimulation
•
Thisisusuallydoneusingthe
Metropolis-Hastingsalgorithm
Nicholas Metropolis
Los Alamos, 1953
14
probability
MCMCsimulation
probability
MCMCsimulation
proposea
change
randomstartingtree
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
15
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
16
MCMCsimulation
probability
probability
MCMCsimulation
alwaysaccept
changesthatincrease
probability
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
usuallyaccept
changesthatslightly
reduceprobability
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
17
18
MCMCsimulation
probability
probability
MCMCsimulation
usuallyreject
changesthatgreatly
reduceprobability
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
takemanysteps(millions)
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
19
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
20
Metropolis-coupledMCMC
SamplesfromtheMCMC
•
probability
Successivechainsare
incrementallyheated
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
OutputfromaBayesianphylogenetic analysis:
•
AlistoftheparametervaluesvisitedbytheMarkovchain
(.pfileinMrBayes, .logfileinBEAST)
•
AlistofthetreesvisitedbytheMarkovchain
(.tfileinMrBayes, .treesfileinBEAST)
brownbear
cavebear
blackbear
giantpanda
21
22
SamplesfromtheMCMC
SamplesfromtheMCMC
Stationaryphase
Numberofsamples
Probability
Stationaryphase
Burn-in
phase
Valueofparameter
MCMCsteps(millions)
23
UsingtheMetropolis-Hastings
algorithm, wesampleparameter
valuesandtreesfromtheMCMC
withafrequency proportional to
theirposteriorprobability
24
SamplesfromtheMCMC
SamplesfromtheMCMC
brownbear
cavebear
giantpanda
blackbear
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
Numberofsamples
Mean
•
Takethemeanofthesampled
values
Meanposteriorestimate
•
Takethecentral95%ofthe
sampledvalues
95%credibilityinterval
Valueofparameter
25
SamplesfromtheMCMC
brownbear
cavebear
giantpanda
blackbear
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
giantpanda
blackbear
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
blackbear
cavebear
giantpanda
brownbear
0.90
0.60
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
giantpanda
blackbear
brownbear
cavebear
blackbear
giantpanda
brownbear
cavebear
blackbear
giantpanda
blackbear
cavebear
giantpanda
brownbear
0.90
brownbear
cavebear
26
SamplesfromtheMCMC
brownbear
•
Majority-ruleconsensus tree(MrBayes)
Showsallnodeswithposterior probability >0.50
•
Maximum aposteriori(MAP) tree
Sampled treewithhighestposterior probability
•
Maximum cladecredibility(MCC)tree(BEAST/TreeAnnotator)
Sampled treewithhighestsumorproduct ofposterior node
probabilities
cavebear
blackbear
giantpanda
27
28
Diagnostics
1.
2.
Convergence
Convergence
•
Arewedrawingsamples from
thestationarydistribution?
Runatleast2independent chains
•
Posterior probabilities andlikelihoods shouldbesimilar
Sufficientsampling
•
Modelparameters
Havewedrawnenoughsamples
toallowareliableestimateof
theposteriordistribution?
•
Poor convergence
Good
•
Poor mixing
Checkifestimatesofmodelparametersaresimilarbetween runs
Treetopologies
•
Monitorstandarddeviationofsplitfrequencies(SDSF)
•
Astreesamplesbecomemoresimilarbetween runs,SDSFà 0
30
29
Sufficientsampling
•
Effectivesamplesize(ESS)
Havewedrawnenough independent samplestoproduce a
reliable estimateoftheposterior distribution?
•
ESSispreferably >200foreachparameter
•
ESScanbeincreased by:
•
IncreasingthelengthoftheMCMC
•
Decreasing thefrequency ofsamplingfromtheMCMC
•
Modifying theMCMCproposals
AdvantagesandProblems
Maintainanacceptancerateof15–70%
31
Advantages
Nuisanceparameters
•
Abletoimplementhighlyparameterisedmodels
•
Integrate over‘nuisance’ parameters
•
Estimatingtreeuncertainty isstraightforward
•
Marginal distribution ofaparameter ofinterest
•
Canonlydothisindirectly inlikelihood (viabootstrapping)
•
Posteriorprobabilitieshaveanintuitive interpretation
•
Canincorporate independent information(intheprior)
Tree1
Tree2
Tree3
Branchlengths1
0.10
0.07
0.12
0.29
Branchlengths2
0.05
0.22
0.06
0.33
Branchlengths3
0.05
0.19
0.14
0.38
0.20
0.48
0.32
Jointprobabilities
33
Influenceofpriors
•
Sensitivity oftheposterior totheprior
•
Thisproblem canoccur ifthedataareuninformative, theprior
isstrong,orboth
Marginal
probabilities
Ronquistetal.(2009)ThePhylogeneticHandbook
34
Problems:Inflatedsupportvalues?
35
Chatterjee et al. (2009) BMC Evol Biol
36
BEAST 1
•
BayesianEvolutionary AnalysisbySamplingTrees
•
Analysepopulation- orspecies-level data
•
Simultaneous estimationoftree andnodetimes
•
Rangeofclock models
•
Rangeoftreepriors anddemographic models
•
Re-write ofBEASTtoincrease modularity
•
UserscanextendBEASTbyaddingpackages
•
Lackssomeoftheclock modelsanddemographic modelsfrom
BEAST1
•
Additional treepriors notavailable inBEAST1
•
Capacity toperform simulations
Foracomparison ofBEAST1and2:
http://beast2.org/beast-features/
37
38
MrBayes
RevBayes
•
Primarily designed forspecies-level data
•
UsesitsownR-likelanguage,Rev
•
Simultaneous estimationoftree andnodetimes
•
Interactive construction ofgraphical model
•
Rangeofclock models
•
Flexible andcanbeusedforsimulationandinference
•
Rangeoftreepriors
•
Ongoingdevelopment
•
Multiple chainsandMCMC diagnostics
39
40
Usefulreferences
•
Analysesoflargedatasetsoncomputing clusters
•
Available priorssimilartothoseinolder versionsofMrBayes
•
Limitedoptions, nomolecular dating
•
Likelihood component adapted fromRAxML
41
42