ScalingSta*s*calMul*ple SequenceAlignment MikeNuteandTandyWarnow TheUniversityofIllinoisatUrbana-Champaign PhylogenomicPipelines • Findingrelatedgenomicregions (homologydetec*on) • Mul*plesequencealignment • Maximumlikelihoodphylogeny es*ma*onforsinglegenes • Speciestreees*ma*on(or phylogene*cnetworkes*ma*on)from mul*pleconflic*nggenes • Answerbiologicalques*ons PhylogenomicPipelines • Findingrelatedgenomicregions (homologydetec*on) • Mul*plesequencealignment • Maximumlikelihoodphylogeny es*ma*onforsinglegenes • Speciestreees*ma*on(or phylogene*cnetworkes*ma*on)from mul*pleconflic*nggenes • Answerbiologicalques*ons Deletion Substitution …ACGGTGCAGTTACCA… Insertion …ACCAGTCACCTA… …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… Thetruemul*plealignment – Reflects historical substitution, insertion, and deletion events – Defined using transitive closure of pairwise alignments computed on edges of the true tree Multiple Sequence Alignment (MSA): an important grand challenge1 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA S1 S2 S3 … Sn = -AGGCTATCACCTGACCTCCA = TAG-CTATCAC--GACCGC-= TAG-CT-------GACCGC-= -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013 Simulation Studies S1 S2 S3 S4 = = = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACCGACA Unaligned Sequences S1 S2 S3 S4 = = = = -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA S1 S2 S4 S3 True tree and alignment S1 S2 S3 S4 Compare = = = = -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-TAG-C--T-----GACCGC-T---C-A-CGACCGA----CA S1 S4 S2 S3 Estimated tree and alignment Two-phaseTreeEs*ma*on Alignmentmethods • Clustal • Di-align • Infernal • MAFFT • Muscle • Opal • PAGAN • PASTA • POY(andPOY*) • Prank • Probalign • Probcons(andProbtree) • T-Coffee • UPP • Etc. Phylogenymethods • • • • • • • • BayesianMCMC Maximumparsimony Maximumlikelihood Neighborjoining FastME UPGMA Quartetpuzzling Etc. RAxML:heuris+cforlarge-scaleMLop+miza+on SATé-1(Science2009)performance 1000-taxonmodels,orderedbydifficulty–rateofevolu*ongenerallyincreasesfromle[toright SATé-124houranalysis,ondesktopmachines (Similarimprovementsforbiologicaldatasets) SATé-1cananalyzeuptoabout8,000sequences. SATé:boosterofMSAmethods Combinesitera*onanddivide-and-conquerto “boost”apreferredMSAmethodtolarge datasets. WeshowedresultsbasedonMAFFT,onsubsetsof 200sequences. Challenge:Canwebooststa*s*calalignment methods? Re-aligningonatree(boos*nganMSAmethod) A B C D Decompose dataset A B C D Alignsubsets Es*mateML treeonmerged alignment ABCD A B C D Merge subset-alignments SATéandPASTAAlgorithms Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Estimate ML tree on new alignment Alignment Repeatun*ltermina*oncondi*on,and returnthealignment/treepairwiththebestMLscore SATé-2beCerthanSATé-1 1000-taxonmodelsrankedbydifficulty SATé-1(Liuetal.,Science2009):cananalyzeupto8Ksequences SATé-2Liuetal.,Systema*cBiology2012):cananalyzeupto~50Ksequences PASTA:evenbekerthanSATé-2 RNASim Tree Error (FN Rate) 0.20 0.15 Clustal−Omega Muscle Mafft Starting Tree 0.10 SATe2 PASTA Reference Alignment 0.05 0.00 10000 50000 100000 200000 PASTA:Mirarab,Nguyen,andWarnow,JComp.Biol.2015 – SimulatedRNASimdatasetsfrom10Kto200Ktaxa – Limitedto24hoursusing12CPUs – Notallmethodscouldrun(missingbarscouldnotfinish) UPPAlgorithmicApproach 1. Selectsmallrandomsubsetoffull-length sequences,andbuild“backbonealignment” 2. Constructan“EnsembleofHiddenMarkov Models”onthebackbonealignment 3. Addallremainingsequencestothebackbone alignmentusingtheEnsembleofHMMs RNASimMillionSequences:alignmenterror Notes: • We show alignment error using average of SP-FN and SP-FP. • UPP variants have better alignment scores than PASTA. • (Not shown: Total Column Scores – PASTA more accurate than UPP) • No other methods tested could complete on these data • PASTA under-aligns: its alignment is 43 times wider than true alignment (~900 Gb of disk space). UPP alignments were closer in length to true alignment (0.93 RNASimMillionSequences:treeerror Using 12 TACC processors: • UPP(Fast,NoDecomp) took 2.2 days, • UPP(Fast) took 11.9 days, and • PASTA took 10.3 days SATé,PASTA,andUPP: boostersofMSAmethods • SATéandPASTA – Combinesitera*onanddivide-and-conquerto“boost”apreferred MSAmethodtolargedatasets;weshowedresultsbasedonMAFFT • UPP – Step1:Constructsa“backbone”treeandanalignmentonasmall randomsubsetofthesequences – Step2:Alignsalltheremainingsequencestothebackbone alignment – WeshowedresultswheredefaultPASTAcomputedthebackbone alignmentandtree(whichisbasedonMAFFT). Challenge:Canwebooststa*s*calalignmentmethods? Sta*s*calco-es*ma*on • Improvedaccuracycanbeobtainedthrough co-es*ma*onofalignmentsandtrees. • BAli-Phy(RedelingsandSuchard,2005),a Bayesianmethod,istheleadingco-es*ma*on method. • Liuetal.(Science2009)showedtheposterior decodingalignmentproducedbyBAli-Phy bekerthanSATéonsome100-taxondatasets. BAli-Phy:bekerthanPASTA Total-ColumnScore AlignmentError PairwiseFalse-NegativeRate AlignmentAccuracy(TCscore) 50% 40% 40% 30% 20% 30% 10% 20% 0% 10% #Taxa: Simulator MAFFT 100 0% Indelible(DNA) PASTA BAli-Phy 200 100 200 RNAsim(RNA) Simulateddatasetswith100or200sequences. *Averagesover10replicates But:BAli-Phyislimitedtosmalldatasets • BAli-Phyiscomputa*onallyintensive: – 63sequencedataset(Gayaetal.,2011)took3 weeks – Largestdatasetanalyzedhad117sequences (McKenzieetal.,2014) • BAli-Phyisnotscalable: – Ourstudyshowsitbreakssomewherebefore500 sequences(numericalissuespossibly) But:BAli-Phyislimitedtosmalldatasets Fromwww.bali-phy.org/README.html,5.2.1.Toomanytaxa? “BAli-PhyisquiteCPUintensive,andsowerecommend using50orfewertaxainordertolimitthe*me requiredtoaccumulateenoughMCMCsamples. (Despitethisrecommenda*on,datasetswithmore than100taxahaveoccasionallybeenknownto converge.)Werecommendini*allypruningasmany taxaaspossiblefromyourdataset,thenaddingsome backiftheMCMCisnottooslow.” Re-aligningonatree(boos*nganMSAmethod) A B C D Decompose dataset A B C D Alignsubsets Es*mateML treeonmerged alignment ABCD A B C D Merge subset-alignments Re-aligningonatree(boos*nganMSAmethod) A B C D Es*mateML treeonmerged alignment Decompose dataset A B C D Alignsubsets usingMAFFT ABCD A B C D Merge subset-alignments Re-aligningonatree(boos*nganMSAmethod) A B C D Es*mateML treeonmerged alignment Decompose dataset A B C D Alignsubsets usingBAli-Phy ABCD A B C D Merge subset-alignments Thisstudy:Methods • PASTAvariants: – PASTA+BAli-Phy(usingBAli-PhyinsteadofMAFFTasthesubset aligner,subsetsize100,eachBAli-Phyanalysisrunfor24hours on32BlueWatersprocessors,andusingtheposteriordecoding alignment) – PASTA-default(usesMAFFTasthesubsetaligner) • UPPvariants: – UPP-default:usesPASTA-defaultforthebackbonealignment – UPP+BAli-Phy:usesPASTA+BAli-Phyasthebackbonealigner • Treees*ma*onusingmaximumlikelihood(RAxMLor FastTree-2) Thisstudy:SimulatedDatasets • ROSE(Stoyeetal.,1998):1000-taxondatasets withthreegaplengthdistribu*ons(short, medium,long)andhighratesofevolu*on • Indelible(FletcherandYang2009):upto10,000 sequences,mediumgaplengthandhighrateof evolu*on • RNASim(Guo,Wang,andKim2009):upto 10,000sequences,complexsequenceevolu*on modelbasedonRNAstructure,moderaterateof evolu*on Thisstudy:Metrics • TC(totalcolumnscore) • Precision,alsoknownasthealignment modellerscore,equivalentto1-SPFP(sumof pairsfalseposi*vescore) • Recall,alsoknownastheSP-score,equivalent to1-SPFN(sumofpairsfalsenega*vescore) • Delta-RF(treeerror),whereRFisthe Robinson-Fouldserrorrate(normalized bipar**ondistance) ComparingdefaultPASTAtoPASTA+BAli-Phy onsimulateddatasetswith1000sequences Total Column Score PASTA+BAli−Phy Better data 0.3 Indelible M2 RNAsim 0.2 Rose L1 PASTA Better PASTA+BAli−Phy 0.4 0.1 0.0 0.0 0.1 0.2 0.3 Rose M1 Rose S1 0.4 PASTA Recall (SP−Score) 1.0 Tree Error: Delta RF (RAxML) PASTA+BAli−Phy Better PASTA+BAli−Phy Better PASTA 0.9 0.8 0.7 0.6 0.6 0.7 0.8 PASTA 0.9 1.0 0.10 0.05 PASTA Better PASTA Better PASTA+BAli−Phy 0.15 0.00 0.00 0.05 0.10 0.15 PASTA+BAli−Phy Decomposi*onto100-sequencesubsets,oneitera*onofPASTA+BAli-Phy ComparingUPPvariantswherethe backbonealignmentiscomputedusing eitherdefaultPASTAorPASTA+BAli-Phy Total Column Score PASTA+BAli−Phy Better 0.075 ● data ● 0.050 ● ● 0.025 ●● ● RNAsim ● ● 0.000 0.000 Indelible M2 ● ● PASTA Better PASTA+BAli−Phy 0.100 0.025 0.050 0.075 0.100 PASTA Recall (SP−Score) 0.015 PASTA+BAli−Phy Better PASTA+BAli−Phy Better● ● 0.975 0.950 ●● 0.010 ●● ● ● ● ● ● 0.900 0.925 0.950 PASTA 0.975 1.000 ● ● ● ● 0.005 ● ● PASTA Better 0.925 0.900 ● PASTA ● PASTA Better PASTA+BAli−Phy 1.000 Tree Error: Delta RF (FastTree−2) ● 0.000 0.000 0.005 0.010 PASTA+BAli−Phy Resultson10,000-sequencedatasets,backbonesize1000 0.015 Observa*ons • BAli-Phy:computa*onallyintensiveandcannot runonevenmoderatelylargedatasets • PASTA+BAli-Phy:moreaccuratethandefault PASTAonsimulateddatasets • UPP+BAli-Phy:moreaccuratethandefaultUPP onsimulateddatasets • PASTA+BAli-PhyandUPP+BAli-Phyare computa*onallyintensive,butembarassingly parallel. S*lltodo • ThisstudyonlyexaminesscalingBAli-Phyasa pointes*matorofthemul*plesequence alignment.BAli-Phyproducesadistribu*on onMSAsandtrees;canwedevelopstrategies toenablelarge-scalees*ma*onofthese distribu*ons? • Considerothercomputa*onallyintensive sta*s*calalignmentmethods,suchas StatAlign. Acknowledgments PASTAandUPPathkps://github.com/smirarab PASTA+BAli-Phyathkp://github.com/MGNute/pasta Funding:NSFABI-1458652,aFounderProfessorshipfrom theGraingerFounda*on,andCompGenFellowshiptoM.N. Computa*onalsupport:BlueWaters(NCSA) Quan*fyingError FN FN: false negative (missing edge) FP: false positive (incorrect edge) FP 50% error rate Mean alignment error UPPismorerobusttofragmentarysequences thanPASTA 0.6 0.4 0.2 0.0 0 12.5 25 % Fragmentary PASTA 50 UPP(Default) Delta FN tree error (a) Average alignment error 0.4 0.2 0.0 0 12.5 25 % Fragmentary PASTA 50 UPP(Default) (b) Average tree error Figure S32: Alignment and tree error of PASTA and UPP on the fragmentary 1000M2 datasets. 1000M2modelcondi*on Underhighratesofevolu*on, PASTAisbadlyimpacted byfragmentarysequences(the sameistrueforothermethods). Underlowratesofevolu*on, PASTAcans*llbehighlyaccurate (datanotshown). UPPcon*nuestohavegood accuracyevenondatasets withmanyfragmentsunder allratesofevolu*on.
© Copyright 2026 Paperzz