ASSUMPTIONS Block

PROBABILISTIC MODELS OF EVOLUTION
Currently, the NEXUS file format does not accommodate probabilistic models of
character evolution in any public block. Several programs store descriptions of probabilistic
models in NEXUS files, but to date these are all in private blocks. For example, PAUP* has such
models specified by the LSET and DSET commands of the PAUP block; MrBayes has them
specified in the LSET command of the MRBAYES block; Mesquite has them in the
MESQUITECHARMODELS block.
With the increasing communication between these programs, it is important to move
these private efforts into the public portions of the NEXUS blocks, so that information can be
more readily shared. With the experience gained from the development of these private formats,
we should have the knowledge to do this.
SUGGESTED PROPERTIES OF THE FORMAT
As a template for the
CURRENT FORMATS
None of the formats currently in use can simply be used as is for the public format; those
used by PAUP*, MrBayes, and Mesquite are all less than ideal. Some of the ways in which the
current formats differ from the ideal are given below
PAUP*
ASSUMPTIONS Block
Below are some new commands proposed for the Assumptions block. They are intended
to describe stochastic models of character evolution, for use in maximum likelihood and bayesian
methods, character simulations, etc. The models are generalizations of those used in PAUP*, and
are designed such that default values will in general match those of PAUP*.
(what about non-independence of branches/ non-Markovian models?)
(continuous data)
(correlations between characters)
(correlations between models - e.g., have the rate matrix a function of the overall
character rate)
(models on cyclic graphs)
Models can be assigned character by character, and node by node.
By default, all nodes inherit the models of their immediate ancestor. The model of a
node specifies evolution on the edge below that node. If a node has more than one ancestor (as in
cyclic graphs), then the node would need to store one model for every ancestor.
MODELCLASS
MODELCLASS [*] modelkindname = modelclassname, modelclassname, … ;
2
This command defines a group of models that combined form a complete model of data
evolution. A program that supports a particular MODELCLASS will know to look for the listed
components (or, if they cannot be found, use their default values).
This might not actually be a command that is implemented in any NEXUS files. It might
instead be an implicit one that we use to define models in an abstract sense.
e.g., MODELCLASS DNA = rootstatesmodel, equilstatesmodel, ratematrixmodel,
charratesmodel;
Default MODELCLASS for DNA and RNA is MODELCLASS DNA.
MODELOBJECT
It is not clear if we really want to support this…
ModelObject [*] modelobjectname = description;
This command defines a group of models that combined form a complete model of data
evolution. It is effectively an instance of a MODEL CLASS.
e.g.,
MODELOBJECT myDNA = class=DNA
rootstatesmodel=rootstatesname
equilstatesmodel = equilstatesname
ratematrixmodel = ratematrixname
charratesmodel = charratesname;
ModelObjects could be assigned to individual branches of the tree, or to individual
characters.
MODELSET
This is a standard list of assignments of complete models to characters. One can assign
either a named ModelObject or other Model to a character, or a named ModelClass. If a named
ModelClass is given, then the system will look for the default object of that class or the default
objects, if the ModelClass describes a composite model.
GENERAL NOTES ABOUT MODeLS
Format ProbComposite
This allows one to combine multiple existing models, specifying that several of them
exist in the collection of characters, with specified (or estimated) probabilities.
E.g.: for example, for CharRatesModels:
3
CharRatesModel myCombined (ProbComposite) = model1: 0.2, model2: 0.8 ;
CharRatesModel myCombined (ProbComposite) = model1: estimate, model2: estimate;
CharRatesModel myCombined (ProbComposite) = model1: 0.2, model2: estimate,
model3: estimate;
CharRatesModel myCombined (ProbComposite) = model1: previous, model2: previous;
CharRatesModel myCombined (ProbComposite) = model1, model2;
If no probabilities are specified, then "estimate" is presumed.
The ProbComposite format allows one to specify, for example, a I + model of site-tosite rate variation (see examples, below, in CharRatesModel).
Referring to models
When models are referred to elsewhere, either the name of a defined model can be used,
or the model specification can be directly inserted in parentheses. If the later is done, and a
format specification is needed, then the first subcommand must be FORMAT=formatname. For
example, one could do
CharRatesModel gamma1 = structure=gamma gammaShape=1;
CharRatesModel constant = structure=equal rate=estimate;
CharRatesModelSet myRates = gamma1: 1-200, constant:201-456 ;
or, equivalently
CharRatesModelSet myRates = (structure=gamma gammaShape=1): 1-200,
(rate=estimate):201-456 ;
ROOTSTATESMODEL
ROOTSTATESMODEL [*] rootstatesmodel-name = description.
This command specifies the model used to generate the states at the root. In PAUP*, the
states at the root are specified by the values of the equilibrium state frequencies, and cannot be
specified separately. However, some programs may desire separate specification.
SOURCE
ASEQUILSTATES | FREQUENCIES | STATE | ANCESTOR
If AsEquilStates is specified, then nothing more need be specified, as the model used here
is simply taken from the current EquilStatesModel. As this is the default value of Source, the
lack of a RootStatesModel in files intended for PAUP* will cause no difficulty elsewhere. If one
of the other possible values is specified, some of the following subcommands will be relevant:
4
FREQUENCIES
EMPIRICAL|EQUAL|ESTIMATE|PREVIOUS| (<FRQ0> <FRQ1>...)
Equivalent to PAUP*'s BaseFreq of the LSET command; note that if frequencies are
exactly specified, then the frequencies are given in standard state order, with the number of
frequencies being equal to the number of states OR less than the number of states - in which case
the remaining states divide up the remainder to 1 equally. Default value: Empirical?. If
Empirical, it will be calculated over the list of all characters to which this particular is assigned.
POOL
YES|NO
(If yes, then all characters assigned this model are to be pooled in calculating empirical
frequencies or estimating frequencies. If no, then each character is to be treated independently.
Default value: Yes.)
ROOTSTATE
<STATE SET>
ANCESTOR
<ANCSTATES NAME OR TAXON NAME>
If ANCESTOR is specified, then the states at each character in the ANCESTOR are used
ROOTSTATE command.
EQUILSTATESMODEL
EQUILSTATESMODEL [*] equilstatesmodel-name = description.
This command specifies the state frequencies on branches of the tree other than the root.
These state frequencies can be constant throughout the tree, or variable.
Subcommands for use in the description for the Standard format are:
FREQUENCIES
EMPIRICAL|EQUAL|ESTIMATE|PREVIOUS | (<FRQ0> <FRQ1>...)
(what should be the default? it would be rather messy to have it take the value from
rootstatesmodel; in fact, that wouldn't be best.)
POOL
YES|NO
(If yes, then all characters assigned this model are to be pooled in calculating empirical
frequencies or estimating frequencies. If no, then each character is to be treated independently.
Default value: Yes.)
Other formats can be used to specify a model of variable equilibrium state frequencies.
RATEMATRIXMODEL
5
RATEMATRIXMODEL [*] rmatrixmodel-name = description.
This command specifies the nature of the rate matrix. It incorporates PAUP's NST,
TRatio, RMatrix, RClass, and Variant options of the LSET command.
STRUCTURE
EQUAL | TITV | ASYMM | GTR | GENERAL| (CHANGE CLASS LIST,
SYMMETRICAL?)
Specifies the source of the rate matrix parameters other than the state frequencies. If
Equal is specified, then all parameters in the rate matrix are equal. If TiTv is specified for
nucleotide data, then a rate matrix with 2 parameters (equivalent to PAUP's NST=2) is set, with
the TRatio being used to specify the transition/transversion ratio. If GTR or General is specified,
then it uses the rates specified in a Matrix subcommand.
Structure=TiTv
TRATIO
<VALUE> | ESTIMATE | PREVIOUS
Only useful for nucleotide data if Matrix=TiTv is specified. Default value is ?.
Structure=Asymm
RATE
1.0 | <REAL-VALUE>
BIAS
1.0 | <REAL-VALUE>
Structure=GTR or General or change class list
MATRIX
(VALUE1 VALUE2 VALUE3…) | ESTIMATE | PREVIOUS
The value list must match in number and order that of the structure specified by the
Structure command. For example, if GTR is set, then there must be n(n-1)/2 - 1 entries in the
matrix, listing the upper triangular matrix elements from left to right, top to bottom, not
including the diagonal, of the n x n matrix; if the last element is not listed, it is presumed to be 1.
USEEQUILFREQMODEL
YES|NO
If yes, then incorporate the state frequency model into the rate matrix. That is, in the rate
matrix, the rate of x->y changes will be multiplied by the frequency of state y, as specified in the
RootStateModel or EquilStatesModel for that character. Default: yes.
POOL
YES|NO
(If yes, then all characters assigned this model are to be pooled in estimating. If no, then
each character is to be treated independently. Default value: Yes.)
6
What about Variant HKY85 versus F84?
Other formats can be used to specify a model of variability of the rate matrix.
PAUP Block
Assumptions Block
LSET NST=1;
RateMatrixModel * mymodel = Matrix=Equal;
LSET NST=2 TRatio=1.4;
RateMatrixModel * mymodel = Structure=TiTv
TRatio=1.4;
LSET NST=6 RMatrix=(1.906 5.187 2.589 0.443
10.217)
RateMatrixModel * mymodel =
Structure =GTR Matrix=(1.906 5.187 2.589 0.443
10.217);
LSET NST=6 RMatrix=estimate;
RateMatrixModel * mymodel =
Structure =GTR Matrix=estimate;
LSET NST=6 RMatrix=estimate RClass=(a b a c
d e);
RateMatrixModel * mymodel =
Structure =(a b a c d e) Matrix=estimate;
CHARRATESMODEL
CHARRATESMODEL [*] charratesmodel-name = description.
This command specifies the rates of the characters.
STRUCTURE
EQUAL | GAMMA
RATE
1.0 | <REAL-VALUE>|ESTIMATE | PREVIOUS
To be used if Structure=Equal is specified. Allows specification of the rate of the
characters.
GAMMASHAPE
<REAL-VALUE>|ESTIMATE|PREVIOUS
GAMMANCAT
<INTEGER-VALUE> | 4
GAMMAREPRATE
MEAN|MEDIAN
POOL
YES | NO
If Estimate is the model, then if Pool then pools all characters assigned this model
7
Predefined CharRatesModels:
invariant:
CharRatesModel invariant = Structure=equal rate=0.0;
standard:
CharRatesModel standard = Structure=equal rate=1.0;
PAUP Block
Assumptions Block
LSET rates=equal;
CharRatesModel * mymodel =
Structure=equal;
LSET rates=gamma shape=0.86;
CharRatesModel * mymodel =
Structure=Gamma gammaShape=0.86;
LSET pinvar=estimate;
CharRatesModel * myModel (probComposite) =
invariant:estimate, standard: estimate;
LSET pinvar=estimate
rates=gamma shape=estimate;
CharRatesModel * myModel (probComposite) = invariant:
estimate,
(Structure=Gamma gammaShape=estimate): estimate;
OR
CharRatesModel * myGamma = Structure=Gamma
gammaShape=estimate;
CharRatesModel * myModel (probComposite) = invariant:
estimate, myGamma: estimate;
LSET rates=SiteSpec siteRates =
Partition:CodonPositions ;
CharRatesModelSet * myModelSet = (rate=estimate): pos1,
(rate=estimate): pos2, (rate=estimate): pos3,
(rate=estimate): noncoding;
OR
CharRatesModelSet * myModelSet (partitionPool)=
model=(rate=estimate) partition=CodonPositions;
RateSet myRates = 0.234:1-200,
1.456:201-456;
CharRatesModelSet * myModelSet = (rate=0.234): 1-200,
(rate=1.456): 201-456;
LSET rates=SiteSpec siteRates
=RateSets:myRates ;
LSET rates=SiteSpec siteRates =
Previous ;
CharRatesModelSet * previous;
BROWNIANMODEL
BROWNIANMODEL [*] brownianmodel-name = description.
8
This command specifies the nature of a brownian motion model.
1.0 | <REAL-VALUE> | ESTIMATE
RATE
INDELMODEL
A model for evolution of insertion and deletion events. This would also need to specify
which properties new sites inherit.
CODONMODEL
A model for evolution of codons.
BRANCHLENGTHSMODEL
BRANCHLENGTHSMODEL [*] branchratesmodel-name = description.
This command specifies various aspects and constraints on branch lengths.
CLOCK
NO | YES
If true, a branches must be clock-like.
NOTES ABOUT XMODELSETS
xModelSets (RootStatesModelSets, CharRatesModelSets, etc.) have a few elements in
common.
Predefined xModelSets:
PREVIOUS: This will cause all estimated parameters set to their current values.
To use the "previous" set do:
xModelSet * previous;
Format partitionPool
If partitionPool format is used, then the format is:
xModelSet * myModelSet (PartitionPool) = model=modelName partition =
partitionName or definition;
This will establish "modelName" as the model used throughout the elements of the
partition, but with any estimation done by pooling WITHIN each element of the partition.
ROOTSTATESMODELSET
This is a standard list of assignments of RootStatesModels to characters.
9
EQUILSTATESMODELSET
This is a standard list of assignments of EquilStatesModels to characters.
RATEMATRIXMODELSET
This is a standard list of assignments of RateMatrixModels to characters.
CHARRATESMODELSET
This is a standard list of assignments of CharRatesModels to characters.
BROWNIANMODELSET
This is a standard list of assignments of BrownianModels to characters.
CURRENT PAUP LSET COMMANDS
Keyword ---- Option type -----------------------NST
1|2|6
TRatio
<real-value>|Estimate|Previous
RMatrix
(<rAC><rAG><rAT><rCG><rCT>)|
Estimate|Previous
RClass
(<cAC><cAG><cAT><cCG><cCT><cGT>)
Variant
HKY|F84
BaseFreq
Empirical|Equal|Estimate|Previous|
(<frqA><frqC><frqG>)
Rates
Equal|Gamma|SiteSpec
Shape
<real-value>|Estimate|Previous
NCat
<integer-value>
RepRate
Mean|Median
PInvar
<real-value>|Estimate|Previous
SiteRates
Partition[:<charpartition-name>]|
Rateset[:<rateset-name>]|Previous
Wts
RepeatCnt|Ignore
InitBrLen
Rogers|LS|<real-value>
LCollapse
No|Yes
MaxPass
<integer-value>
Delta
<real-value>
UseApprox
No|Yes
ApproxLim
<real-value>
AdjustAppLim No|Yes
LogIter
No|Yes
ZeroLenTest No|Full|Crude
Recon
Marginal|Joint
AllProbs
No|Yes
Clock
No|Yes
default setting -2
2
(1 1 1 1 1)
(a b c d e f)
HKY
Empirical
Equal
0.5
4
Mean
0
<none>
RepeatCnt
Rogers
Yes
20
1e-06
Yes
5
Yes
No
No
Marginal
No
No
10
UserBrLens
MinMemReq
StartVals
ParamClock
MLDistforLS
ShowQMatrix
No|Yes
No|Yes
ParsApprox|Arbitrary
Standard|Rambaut|BrLens|SplitTimes|
MDRambaut|Thorne
No|Yes
No|Yes
No
No
ParsApprox
Standard
No
No
Items in italics are covered in the current definitions.