PROBABILISTIC MODELS OF EVOLUTION Currently, the NEXUS file format does not accommodate probabilistic models of character evolution in any public block. Several programs store descriptions of probabilistic models in NEXUS files, but to date these are all in private blocks. For example, PAUP* has such models specified by the LSET and DSET commands of the PAUP block; MrBayes has them specified in the LSET command of the MRBAYES block; Mesquite has them in the MESQUITECHARMODELS block. With the increasing communication between these programs, it is important to move these private efforts into the public portions of the NEXUS blocks, so that information can be more readily shared. With the experience gained from the development of these private formats, we should have the knowledge to do this. SUGGESTED PROPERTIES OF THE FORMAT As a template for the CURRENT FORMATS None of the formats currently in use can simply be used as is for the public format; those used by PAUP*, MrBayes, and Mesquite are all less than ideal. Some of the ways in which the current formats differ from the ideal are given below PAUP* ASSUMPTIONS Block Below are some new commands proposed for the Assumptions block. They are intended to describe stochastic models of character evolution, for use in maximum likelihood and bayesian methods, character simulations, etc. The models are generalizations of those used in PAUP*, and are designed such that default values will in general match those of PAUP*. (what about non-independence of branches/ non-Markovian models?) (continuous data) (correlations between characters) (correlations between models - e.g., have the rate matrix a function of the overall character rate) (models on cyclic graphs) Models can be assigned character by character, and node by node. By default, all nodes inherit the models of their immediate ancestor. The model of a node specifies evolution on the edge below that node. If a node has more than one ancestor (as in cyclic graphs), then the node would need to store one model for every ancestor. MODELCLASS MODELCLASS [*] modelkindname = modelclassname, modelclassname, … ; 2 This command defines a group of models that combined form a complete model of data evolution. A program that supports a particular MODELCLASS will know to look for the listed components (or, if they cannot be found, use their default values). This might not actually be a command that is implemented in any NEXUS files. It might instead be an implicit one that we use to define models in an abstract sense. e.g., MODELCLASS DNA = rootstatesmodel, equilstatesmodel, ratematrixmodel, charratesmodel; Default MODELCLASS for DNA and RNA is MODELCLASS DNA. MODELOBJECT It is not clear if we really want to support this… ModelObject [*] modelobjectname = description; This command defines a group of models that combined form a complete model of data evolution. It is effectively an instance of a MODEL CLASS. e.g., MODELOBJECT myDNA = class=DNA rootstatesmodel=rootstatesname equilstatesmodel = equilstatesname ratematrixmodel = ratematrixname charratesmodel = charratesname; ModelObjects could be assigned to individual branches of the tree, or to individual characters. MODELSET This is a standard list of assignments of complete models to characters. One can assign either a named ModelObject or other Model to a character, or a named ModelClass. If a named ModelClass is given, then the system will look for the default object of that class or the default objects, if the ModelClass describes a composite model. GENERAL NOTES ABOUT MODeLS Format ProbComposite This allows one to combine multiple existing models, specifying that several of them exist in the collection of characters, with specified (or estimated) probabilities. E.g.: for example, for CharRatesModels: 3 CharRatesModel myCombined (ProbComposite) = model1: 0.2, model2: 0.8 ; CharRatesModel myCombined (ProbComposite) = model1: estimate, model2: estimate; CharRatesModel myCombined (ProbComposite) = model1: 0.2, model2: estimate, model3: estimate; CharRatesModel myCombined (ProbComposite) = model1: previous, model2: previous; CharRatesModel myCombined (ProbComposite) = model1, model2; If no probabilities are specified, then "estimate" is presumed. The ProbComposite format allows one to specify, for example, a I + model of site-tosite rate variation (see examples, below, in CharRatesModel). Referring to models When models are referred to elsewhere, either the name of a defined model can be used, or the model specification can be directly inserted in parentheses. If the later is done, and a format specification is needed, then the first subcommand must be FORMAT=formatname. For example, one could do CharRatesModel gamma1 = structure=gamma gammaShape=1; CharRatesModel constant = structure=equal rate=estimate; CharRatesModelSet myRates = gamma1: 1-200, constant:201-456 ; or, equivalently CharRatesModelSet myRates = (structure=gamma gammaShape=1): 1-200, (rate=estimate):201-456 ; ROOTSTATESMODEL ROOTSTATESMODEL [*] rootstatesmodel-name = description. This command specifies the model used to generate the states at the root. In PAUP*, the states at the root are specified by the values of the equilibrium state frequencies, and cannot be specified separately. However, some programs may desire separate specification. SOURCE ASEQUILSTATES | FREQUENCIES | STATE | ANCESTOR If AsEquilStates is specified, then nothing more need be specified, as the model used here is simply taken from the current EquilStatesModel. As this is the default value of Source, the lack of a RootStatesModel in files intended for PAUP* will cause no difficulty elsewhere. If one of the other possible values is specified, some of the following subcommands will be relevant: 4 FREQUENCIES EMPIRICAL|EQUAL|ESTIMATE|PREVIOUS| (<FRQ0> <FRQ1>...) Equivalent to PAUP*'s BaseFreq of the LSET command; note that if frequencies are exactly specified, then the frequencies are given in standard state order, with the number of frequencies being equal to the number of states OR less than the number of states - in which case the remaining states divide up the remainder to 1 equally. Default value: Empirical?. If Empirical, it will be calculated over the list of all characters to which this particular is assigned. POOL YES|NO (If yes, then all characters assigned this model are to be pooled in calculating empirical frequencies or estimating frequencies. If no, then each character is to be treated independently. Default value: Yes.) ROOTSTATE <STATE SET> ANCESTOR <ANCSTATES NAME OR TAXON NAME> If ANCESTOR is specified, then the states at each character in the ANCESTOR are used ROOTSTATE command. EQUILSTATESMODEL EQUILSTATESMODEL [*] equilstatesmodel-name = description. This command specifies the state frequencies on branches of the tree other than the root. These state frequencies can be constant throughout the tree, or variable. Subcommands for use in the description for the Standard format are: FREQUENCIES EMPIRICAL|EQUAL|ESTIMATE|PREVIOUS | (<FRQ0> <FRQ1>...) (what should be the default? it would be rather messy to have it take the value from rootstatesmodel; in fact, that wouldn't be best.) POOL YES|NO (If yes, then all characters assigned this model are to be pooled in calculating empirical frequencies or estimating frequencies. If no, then each character is to be treated independently. Default value: Yes.) Other formats can be used to specify a model of variable equilibrium state frequencies. RATEMATRIXMODEL 5 RATEMATRIXMODEL [*] rmatrixmodel-name = description. This command specifies the nature of the rate matrix. It incorporates PAUP's NST, TRatio, RMatrix, RClass, and Variant options of the LSET command. STRUCTURE EQUAL | TITV | ASYMM | GTR | GENERAL| (CHANGE CLASS LIST, SYMMETRICAL?) Specifies the source of the rate matrix parameters other than the state frequencies. If Equal is specified, then all parameters in the rate matrix are equal. If TiTv is specified for nucleotide data, then a rate matrix with 2 parameters (equivalent to PAUP's NST=2) is set, with the TRatio being used to specify the transition/transversion ratio. If GTR or General is specified, then it uses the rates specified in a Matrix subcommand. Structure=TiTv TRATIO <VALUE> | ESTIMATE | PREVIOUS Only useful for nucleotide data if Matrix=TiTv is specified. Default value is ?. Structure=Asymm RATE 1.0 | <REAL-VALUE> BIAS 1.0 | <REAL-VALUE> Structure=GTR or General or change class list MATRIX (VALUE1 VALUE2 VALUE3…) | ESTIMATE | PREVIOUS The value list must match in number and order that of the structure specified by the Structure command. For example, if GTR is set, then there must be n(n-1)/2 - 1 entries in the matrix, listing the upper triangular matrix elements from left to right, top to bottom, not including the diagonal, of the n x n matrix; if the last element is not listed, it is presumed to be 1. USEEQUILFREQMODEL YES|NO If yes, then incorporate the state frequency model into the rate matrix. That is, in the rate matrix, the rate of x->y changes will be multiplied by the frequency of state y, as specified in the RootStateModel or EquilStatesModel for that character. Default: yes. POOL YES|NO (If yes, then all characters assigned this model are to be pooled in estimating. If no, then each character is to be treated independently. Default value: Yes.) 6 What about Variant HKY85 versus F84? Other formats can be used to specify a model of variability of the rate matrix. PAUP Block Assumptions Block LSET NST=1; RateMatrixModel * mymodel = Matrix=Equal; LSET NST=2 TRatio=1.4; RateMatrixModel * mymodel = Structure=TiTv TRatio=1.4; LSET NST=6 RMatrix=(1.906 5.187 2.589 0.443 10.217) RateMatrixModel * mymodel = Structure =GTR Matrix=(1.906 5.187 2.589 0.443 10.217); LSET NST=6 RMatrix=estimate; RateMatrixModel * mymodel = Structure =GTR Matrix=estimate; LSET NST=6 RMatrix=estimate RClass=(a b a c d e); RateMatrixModel * mymodel = Structure =(a b a c d e) Matrix=estimate; CHARRATESMODEL CHARRATESMODEL [*] charratesmodel-name = description. This command specifies the rates of the characters. STRUCTURE EQUAL | GAMMA RATE 1.0 | <REAL-VALUE>|ESTIMATE | PREVIOUS To be used if Structure=Equal is specified. Allows specification of the rate of the characters. GAMMASHAPE <REAL-VALUE>|ESTIMATE|PREVIOUS GAMMANCAT <INTEGER-VALUE> | 4 GAMMAREPRATE MEAN|MEDIAN POOL YES | NO If Estimate is the model, then if Pool then pools all characters assigned this model 7 Predefined CharRatesModels: invariant: CharRatesModel invariant = Structure=equal rate=0.0; standard: CharRatesModel standard = Structure=equal rate=1.0; PAUP Block Assumptions Block LSET rates=equal; CharRatesModel * mymodel = Structure=equal; LSET rates=gamma shape=0.86; CharRatesModel * mymodel = Structure=Gamma gammaShape=0.86; LSET pinvar=estimate; CharRatesModel * myModel (probComposite) = invariant:estimate, standard: estimate; LSET pinvar=estimate rates=gamma shape=estimate; CharRatesModel * myModel (probComposite) = invariant: estimate, (Structure=Gamma gammaShape=estimate): estimate; OR CharRatesModel * myGamma = Structure=Gamma gammaShape=estimate; CharRatesModel * myModel (probComposite) = invariant: estimate, myGamma: estimate; LSET rates=SiteSpec siteRates = Partition:CodonPositions ; CharRatesModelSet * myModelSet = (rate=estimate): pos1, (rate=estimate): pos2, (rate=estimate): pos3, (rate=estimate): noncoding; OR CharRatesModelSet * myModelSet (partitionPool)= model=(rate=estimate) partition=CodonPositions; RateSet myRates = 0.234:1-200, 1.456:201-456; CharRatesModelSet * myModelSet = (rate=0.234): 1-200, (rate=1.456): 201-456; LSET rates=SiteSpec siteRates =RateSets:myRates ; LSET rates=SiteSpec siteRates = Previous ; CharRatesModelSet * previous; BROWNIANMODEL BROWNIANMODEL [*] brownianmodel-name = description. 8 This command specifies the nature of a brownian motion model. 1.0 | <REAL-VALUE> | ESTIMATE RATE INDELMODEL A model for evolution of insertion and deletion events. This would also need to specify which properties new sites inherit. CODONMODEL A model for evolution of codons. BRANCHLENGTHSMODEL BRANCHLENGTHSMODEL [*] branchratesmodel-name = description. This command specifies various aspects and constraints on branch lengths. CLOCK NO | YES If true, a branches must be clock-like. NOTES ABOUT XMODELSETS xModelSets (RootStatesModelSets, CharRatesModelSets, etc.) have a few elements in common. Predefined xModelSets: PREVIOUS: This will cause all estimated parameters set to their current values. To use the "previous" set do: xModelSet * previous; Format partitionPool If partitionPool format is used, then the format is: xModelSet * myModelSet (PartitionPool) = model=modelName partition = partitionName or definition; This will establish "modelName" as the model used throughout the elements of the partition, but with any estimation done by pooling WITHIN each element of the partition. ROOTSTATESMODELSET This is a standard list of assignments of RootStatesModels to characters. 9 EQUILSTATESMODELSET This is a standard list of assignments of EquilStatesModels to characters. RATEMATRIXMODELSET This is a standard list of assignments of RateMatrixModels to characters. CHARRATESMODELSET This is a standard list of assignments of CharRatesModels to characters. BROWNIANMODELSET This is a standard list of assignments of BrownianModels to characters. CURRENT PAUP LSET COMMANDS Keyword ---- Option type -----------------------NST 1|2|6 TRatio <real-value>|Estimate|Previous RMatrix (<rAC><rAG><rAT><rCG><rCT>)| Estimate|Previous RClass (<cAC><cAG><cAT><cCG><cCT><cGT>) Variant HKY|F84 BaseFreq Empirical|Equal|Estimate|Previous| (<frqA><frqC><frqG>) Rates Equal|Gamma|SiteSpec Shape <real-value>|Estimate|Previous NCat <integer-value> RepRate Mean|Median PInvar <real-value>|Estimate|Previous SiteRates Partition[:<charpartition-name>]| Rateset[:<rateset-name>]|Previous Wts RepeatCnt|Ignore InitBrLen Rogers|LS|<real-value> LCollapse No|Yes MaxPass <integer-value> Delta <real-value> UseApprox No|Yes ApproxLim <real-value> AdjustAppLim No|Yes LogIter No|Yes ZeroLenTest No|Full|Crude Recon Marginal|Joint AllProbs No|Yes Clock No|Yes default setting -2 2 (1 1 1 1 1) (a b c d e f) HKY Empirical Equal 0.5 4 Mean 0 <none> RepeatCnt Rogers Yes 20 1e-06 Yes 5 Yes No No Marginal No No 10 UserBrLens MinMemReq StartVals ParamClock MLDistforLS ShowQMatrix No|Yes No|Yes ParsApprox|Arbitrary Standard|Rambaut|BrLens|SplitTimes| MDRambaut|Thorne No|Yes No|Yes No No ParsApprox Standard No No Items in italics are covered in the current definitions.
© Copyright 2026 Paperzz