Scaling Sta"s"cal Mul"ple Sequence Alignment

ScalingSta*s*calMul*ple
SequenceAlignment
MikeNuteandTandyWarnow
TheUniversityofIllinoisatUrbana-Champaign
PhylogenomicPipelines
•  Findingrelatedgenomicregions
(homologydetec*on)
•  Mul*plesequencealignment
•  Maximumlikelihoodphylogeny
es*ma*onforsinglegenes
•  Speciestreees*ma*on(or
phylogene*cnetworkes*ma*on)from
mul*pleconflic*nggenes
•  Answerbiologicalques*ons
PhylogenomicPipelines
•  Findingrelatedgenomicregions
(homologydetec*on)
•  Mul*plesequencealignment
•  Maximumlikelihoodphylogeny
es*ma*onforsinglegenes
•  Speciestreees*ma*on(or
phylogene*cnetworkes*ma*on)from
mul*pleconflic*nggenes
•  Answerbiologicalques*ons
Deletion
Substitution
…ACGGTGCAGTTACCA…
Insertion
…ACCAGTCACCTA…
…ACGGTGCAGTTACC-A…
…AC----CAGTCACCTA…
Thetruemul*plealignment
–  Reflects historical substitution, insertion, and deletion
events
–  Defined using transitive closure of pairwise alignments
computed on edges of the true tree
Multiple Sequence Alignment (MSA):
an important grand challenge1
S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGC
S3 = TAGCTGACCGC
…
Sn = TCACGACCGACA
S1
S2
S3
…
Sn
= -AGGCTATCACCTGACCTCCA
= TAG-CTATCAC--GACCGC-= TAG-CT-------GACCGC-= -------TCAC--GACCGACA
Novel techniques needed for scalability and accuracy
NP-hard problems and large datasets
Current methods do not provide good accuracy
Few methods can analyze even moderately large datasets
Many important applications besides phylogenetic estimation
1 Frontiers
in Massive Data Analysis, National Academies Press, 2013
Simulation Studies
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
Unaligned
Sequences
S1
S2
S3
S4
=
=
=
=
-AGGCTATCACCTGACCTCCA
TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA
S1
S2
S4
S3
True tree and
alignment
S1
S2
S3
S4
Compare
=
=
=
=
-AGGCTATCACCTGACCTCCA
TAG-CTATCAC--GACCGC-TAG-C--T-----GACCGC-T---C-A-CGACCGA----CA
S1
S4
S2
S3
Estimated tree and
alignment
Two-phaseTreeEs*ma*on
Alignmentmethods
•  Clustal
•  Di-align
•  Infernal
•  MAFFT
•  Muscle
•  Opal
•  PAGAN
•  PASTA
•  POY(andPOY*)
•  Prank
•  Probalign
•  Probcons(andProbtree)
•  T-Coffee
•  UPP
•  Etc.
Phylogenymethods
• 
• 
• 
• 
• 
• 
• 
• 
BayesianMCMC
Maximumparsimony
Maximumlikelihood
Neighborjoining
FastME
UPGMA
Quartetpuzzling
Etc.
RAxML:heuris+cforlarge-scaleMLop+miza+on
SATé-1(Science2009)performance
1000-taxonmodels,orderedbydifficulty–rateofevolu*ongenerallyincreasesfromle[toright
SATé-124houranalysis,ondesktopmachines
(Similarimprovementsforbiologicaldatasets)
SATé-1cananalyzeuptoabout8,000sequences.
SATé:boosterofMSAmethods
Combinesitera*onanddivide-and-conquerto
“boost”apreferredMSAmethodtolarge
datasets.
WeshowedresultsbasedonMAFFT,onsubsetsof
200sequences.
Challenge:Canwebooststa*s*calalignment
methods?
Re-aligningonatree(boos*nganMSAmethod)
A
B
C
D
Decompose
dataset
A
B
C
D
Alignsubsets
Es*mateML
treeonmerged
alignment
ABCD
A
B
C
D
Merge
subset-alignments
SATéandPASTAAlgorithms
Obtain initial alignment and
estimated ML tree
Tree
Use tree to compute
new alignment
Estimate ML tree on new
alignment
Alignment
Repeatun*ltermina*oncondi*on,and
returnthealignment/treepairwiththebestMLscore
SATé-2beCerthanSATé-1
1000-taxonmodelsrankedbydifficulty
SATé-1(Liuetal.,Science2009):cananalyzeupto8Ksequences
SATé-2Liuetal.,Systema*cBiology2012):cananalyzeupto~50Ksequences
PASTA:evenbekerthanSATé-2
RNASim
Tree Error (FN Rate)
0.20
0.15
Clustal−Omega
Muscle
Mafft
Starting Tree
0.10
SATe2
PASTA
Reference Alignment
0.05
0.00
10000
50000
100000
200000
PASTA:Mirarab,Nguyen,andWarnow,JComp.Biol.2015
–  SimulatedRNASimdatasetsfrom10Kto200Ktaxa
–  Limitedto24hoursusing12CPUs
–  Notallmethodscouldrun(missingbarscouldnotfinish)
UPPAlgorithmicApproach
1.  Selectsmallrandomsubsetoffull-length
sequences,andbuild“backbonealignment”
2.  Constructan“EnsembleofHiddenMarkov
Models”onthebackbonealignment
3.  Addallremainingsequencestothebackbone
alignmentusingtheEnsembleofHMMs
RNASimMillionSequences:alignmenterror
Notes:
•  We show alignment error
using average of SP-FN and
SP-FP.
• 
UPP variants have better
alignment scores than
PASTA.
• 
(Not shown: Total Column
Scores – PASTA more
accurate than UPP)
• 
No other methods tested
could complete on these data
• 
PASTA under-aligns: its
alignment is 43 times wider
than true alignment (~900 Gb
of disk space). UPP
alignments were closer in
length to true alignment (0.93
RNASimMillionSequences:treeerror
Using 12 TACC
processors:
•  UPP(Fast,NoDecomp)
took 2.2 days,
•  UPP(Fast) took 11.9
days, and
•  PASTA took 10.3 days
SATé,PASTA,andUPP:
boostersofMSAmethods
•  SATéandPASTA
–  Combinesitera*onanddivide-and-conquerto“boost”apreferred
MSAmethodtolargedatasets;weshowedresultsbasedonMAFFT
•  UPP
–  Step1:Constructsa“backbone”treeandanalignmentonasmall
randomsubsetofthesequences
–  Step2:Alignsalltheremainingsequencestothebackbone
alignment
–  WeshowedresultswheredefaultPASTAcomputedthebackbone
alignmentandtree(whichisbasedonMAFFT).
Challenge:Canwebooststa*s*calalignmentmethods?
Sta*s*calco-es*ma*on
•  Improvedaccuracycanbeobtainedthrough
co-es*ma*onofalignmentsandtrees.
•  BAli-Phy(RedelingsandSuchard,2005),a
Bayesianmethod,istheleadingco-es*ma*on
method.
•  Liuetal.(Science2009)showedtheposterior
decodingalignmentproducedbyBAli-Phy
bekerthanSATéonsome100-taxondatasets.
BAli-Phy:bekerthanPASTA
Total-ColumnScore
AlignmentError
PairwiseFalse-NegativeRate
AlignmentAccuracy(TCscore)
50%
40%
40%
30%
20%
30%
10%
20%
0%
10%
#Taxa:
Simulator
MAFFT
100
0%
Indelible(DNA)
PASTA
BAli-Phy
200
100
200
RNAsim(RNA)
Simulateddatasetswith100or200sequences.
*Averagesover10replicates
But:BAli-Phyislimitedtosmalldatasets
•  BAli-Phyiscomputa*onallyintensive:
–  63sequencedataset(Gayaetal.,2011)took3
weeks
–  Largestdatasetanalyzedhad117sequences
(McKenzieetal.,2014)
•  BAli-Phyisnotscalable:
–  Ourstudyshowsitbreakssomewherebefore500
sequences(numericalissuespossibly)
But:BAli-Phyislimitedtosmalldatasets
Fromwww.bali-phy.org/README.html,5.2.1.Toomanytaxa?
“BAli-PhyisquiteCPUintensive,andsowerecommend
using50orfewertaxainordertolimitthe*me
requiredtoaccumulateenoughMCMCsamples.
(Despitethisrecommenda*on,datasetswithmore
than100taxahaveoccasionallybeenknownto
converge.)Werecommendini*allypruningasmany
taxaaspossiblefromyourdataset,thenaddingsome
backiftheMCMCisnottooslow.”
Re-aligningonatree(boos*nganMSAmethod)
A
B
C
D
Decompose
dataset
A
B
C
D
Alignsubsets
Es*mateML
treeonmerged
alignment
ABCD
A
B
C
D
Merge
subset-alignments
Re-aligningonatree(boos*nganMSAmethod)
A
B
C
D
Es*mateML
treeonmerged
alignment
Decompose
dataset
A
B
C
D
Alignsubsets
usingMAFFT
ABCD
A
B
C
D
Merge
subset-alignments
Re-aligningonatree(boos*nganMSAmethod)
A
B
C
D
Es*mateML
treeonmerged
alignment
Decompose
dataset
A
B
C
D
Alignsubsets
usingBAli-Phy
ABCD
A
B
C
D
Merge
subset-alignments
Thisstudy:Methods
•  PASTAvariants:
–  PASTA+BAli-Phy(usingBAli-PhyinsteadofMAFFTasthesubset
aligner,subsetsize100,eachBAli-Phyanalysisrunfor24hours
on32BlueWatersprocessors,andusingtheposteriordecoding
alignment)
–  PASTA-default(usesMAFFTasthesubsetaligner)
•  UPPvariants:
–  UPP-default:usesPASTA-defaultforthebackbonealignment
–  UPP+BAli-Phy:usesPASTA+BAli-Phyasthebackbonealigner
•  Treees*ma*onusingmaximumlikelihood(RAxMLor
FastTree-2)
Thisstudy:SimulatedDatasets
•  ROSE(Stoyeetal.,1998):1000-taxondatasets
withthreegaplengthdistribu*ons(short,
medium,long)andhighratesofevolu*on
•  Indelible(FletcherandYang2009):upto10,000
sequences,mediumgaplengthandhighrateof
evolu*on
•  RNASim(Guo,Wang,andKim2009):upto
10,000sequences,complexsequenceevolu*on
modelbasedonRNAstructure,moderaterateof
evolu*on
Thisstudy:Metrics
•  TC(totalcolumnscore)
•  Precision,alsoknownasthealignment
modellerscore,equivalentto1-SPFP(sumof
pairsfalseposi*vescore)
•  Recall,alsoknownastheSP-score,equivalent
to1-SPFN(sumofpairsfalsenega*vescore)
•  Delta-RF(treeerror),whereRFisthe
Robinson-Fouldserrorrate(normalized
bipar**ondistance)
ComparingdefaultPASTAtoPASTA+BAli-Phy
onsimulateddatasetswith1000sequences
Total Column Score
PASTA+BAli−Phy Better
data
0.3
Indelible M2
RNAsim
0.2
Rose L1
PASTA Better
PASTA+BAli−Phy
0.4
0.1
0.0
0.0
0.1
0.2
0.3
Rose M1
Rose S1
0.4
PASTA
Recall (SP−Score)
1.0
Tree Error: Delta RF (RAxML)
PASTA+BAli−Phy Better
PASTA+BAli−Phy Better
PASTA
0.9
0.8
0.7
0.6
0.6
0.7
0.8
PASTA
0.9
1.0
0.10
0.05
PASTA Better
PASTA Better
PASTA+BAli−Phy
0.15
0.00
0.00
0.05
0.10
0.15
PASTA+BAli−Phy
Decomposi*onto100-sequencesubsets,oneitera*onofPASTA+BAli-Phy
ComparingUPPvariantswherethe
backbonealignmentiscomputedusing
eitherdefaultPASTAorPASTA+BAli-Phy
Total Column Score
PASTA+BAli−Phy Better
0.075
●
data
●
0.050
●
●
0.025
●●
●
RNAsim
●
●
0.000
0.000
Indelible M2
●
●
PASTA Better
PASTA+BAli−Phy
0.100
0.025
0.050
0.075
0.100
PASTA
Recall (SP−Score)
0.015
PASTA+BAli−Phy Better
PASTA+BAli−Phy Better●
●
0.975
0.950
●●
0.010
●●
● ●
●
●
●
0.900
0.925
0.950
PASTA
0.975
1.000
●
●
●
●
0.005
●
●
PASTA Better
0.925
0.900
●
PASTA
●
PASTA Better
PASTA+BAli−Phy
1.000
Tree Error: Delta RF (FastTree−2)
●
0.000
0.000
0.005
0.010
PASTA+BAli−Phy
Resultson10,000-sequencedatasets,backbonesize1000
0.015
Observa*ons
•  BAli-Phy:computa*onallyintensiveandcannot
runonevenmoderatelylargedatasets
•  PASTA+BAli-Phy:moreaccuratethandefault
PASTAonsimulateddatasets
•  UPP+BAli-Phy:moreaccuratethandefaultUPP
onsimulateddatasets
•  PASTA+BAli-PhyandUPP+BAli-Phyare
computa*onallyintensive,butembarassingly
parallel.
S*lltodo
•  ThisstudyonlyexaminesscalingBAli-Phyasa
pointes*matorofthemul*plesequence
alignment.BAli-Phyproducesadistribu*on
onMSAsandtrees;canwedevelopstrategies
toenablelarge-scalees*ma*onofthese
distribu*ons?
•  Considerothercomputa*onallyintensive
sta*s*calalignmentmethods,suchas
StatAlign.
Acknowledgments
PASTAandUPPathkps://github.com/smirarab
PASTA+BAli-Phyathkp://github.com/MGNute/pasta
Funding:NSFABI-1458652,aFounderProfessorshipfrom
theGraingerFounda*on,andCompGenFellowshiptoM.N.
Computa*onalsupport:BlueWaters(NCSA)
Quan*fyingError
FN
FN: false negative
(missing edge)
FP: false positive
(incorrect edge)
FP
50% error rate
Mean alignment error
UPPismorerobusttofragmentarysequences
thanPASTA
0.6
0.4
0.2
0.0
0
12.5
25
% Fragmentary
PASTA
50
UPP(Default)
Delta FN tree error
(a) Average alignment error
0.4
0.2
0.0
0
12.5
25
% Fragmentary
PASTA
50
UPP(Default)
(b) Average tree error
Figure S32: Alignment and tree error of PASTA and UPP on the fragmentary 1000M2
datasets.
1000M2modelcondi*on
Underhighratesofevolu*on,
PASTAisbadlyimpacted
byfragmentarysequences(the
sameistrueforothermethods).
Underlowratesofevolu*on,
PASTAcans*llbehighlyaccurate
(datanotshown).
UPPcon*nuestohavegood
accuracyevenondatasets
withmanyfragmentsunder
allratesofevolu*on.