Running the ForSim Forward Evolutionary Simulator: A Basic Guide ------------------- *** --------------------ForSim Program Version: Final (subject to corrections) ForSim Manual Version: August 2, 2013 Brian Lambert Ken Weiss Penn State University Joe Terwilliger Columbia University Developed with financial support from the National Institutes of Health (grant R01 MH063749 and MH 084995), the Penn State Huck Institutes of the Life Sciences, and the Penn State Evan Pugh Professors’ research fund © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved 1 DISCLAIMER AND USE CONDITIONS This program is offered without warranty of its accuracy. Like any software, ForSim may contain bugs, though we have done our best to identify and correct them and the program seems to run properly after years of uses and tests. However, we have not tested all possible ‘legitimate’ combinations of options, nor promise to fix erriors because we were not funded sufficiently to provide technical service. Likewise, we believe this Manual to be accurate, but it will be updated as mistakes are identified or we see ways to clarify explanations, and so on. The file is date-stamped, but the latest version can be obtained from me ([email protected]). A few features are with some restrictions; these are identified by being in blue highlight. Because the grant for this project expired in Spring 2013, new featurs are not being added; for more on this, see Note to Programmers, in the Addendum. ForSim has been changed in many ways since its inception. New features have been added, explanations improved, and some input file syntax changes have been made. For various unavoidable reasons, the syntax for the input file is not entirely backward compatible. We have not found bugs that would result in computation errors in earlier versions used within their realm of features. But to use with this current version, modifications may be needed. Publications resulting from using ForSim should acknowledge the program in the following way: Lambert, B, Terwilliger, J, Weiss, K ForSim: A tool for exploring the genetic architecture of complex traits with controlled truth. Bioinformatics, 24(16): 1821-22, 2008; (doi: 10.1093/bioinformatics/btn317). The distribution includes the source code. If you modify the code, a condition of use is that you agree to describe those changes in any resulting publications, so readers know that you are not using off-the-shelf ForSim and what differs in your application. ForSim is distributed to registered users who provide an email address. I will do my best to distribute relevant bug reports, usage suggestions, minor updates etc. Use of ForSim implies agreement to the open-source license provided in the README file distributed with the program. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved 2 TABLE OF CONTENTS DISCLAIMER AND USE CONDITIONS ........................................................................ 2 TABLE OF CONTENTS.................................................................................................... 3 CHAPTER 1: ForSim BASICS ......................................................................................... 4 CHAPTER 2: INSTALLATION ....................................................................................... 7 CHAPTER 3: HOW ForSim DOES ITS WORK .............................................................. 8 CHAPTER 4 : CONTENTS OF THE INPUT FILE ....................................................... 14 CHAPTER 5: USERS’ OUTPUT FILES ........................................................................ 41 ADDENDUM: SOME WAYS TO SIMULATE FEATURES NOT EXPLICITLY BUILT INTO ForSim AND ONE WAY TO CHECK REASONS FOR CRASHES ...... 59 APPENDIX 1: ForSim LOGICAL FLOW AND TIME CONSUMPTION ................... 65 APPENDIX 2: GENERIC CUT & PASTE INPUT.SIM FILE TEMPLATE ................. 65 APPENDIX 3: MORE COMPLEX SIMULATION FLOW AND INPUT FILE EXAMPLE........................................................................................................................ 72 © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved 3 CHAPTER 1: ForSim BASICS ForSim is a forward evolutionary simulation system designed to be highly flexible for application to a wide variety of both applied health and life science questions as well as issues in theoretical evolutionary biology. It attempts to simulate in the most natural way the evolutionary process that generates the genetic architecture that underlies present-day traits, and related phenomena such as mate choice, migration bias, population substructure, and interactions with the environment. These phenomena are related to the way natural selection affects underlying genetic variation, molding the trait’s genetic architecture. Variation over the short evolutionary scale, within species or among closely related species, is generally built upon a phylogenetically stable underlying causal genetic architecture upon which mutation, selection, and demographic effects are laid to generate subsequent variation within and among populations. In turn, this variation affects the ability of particular study designs or statistical approaches to correctly infer the basic genetic architecture of traits of interest, or specific effects that may be of practical importance (e.g., in public health), helping to guide appropriate sample designs, sample sizes, hypotheses, and analytic methods. ForSim is an evolving work, but is written to be easily and highly modifiable by the user both within the current specifications, and by using the current capabilities to design approximate simulations of additional features, all in an open-ended way. No simulation is entirely free of the developers’ or users’ assumptions. But ForSim is based as much as is practicable on biology rather than formalism, and its structure lets the user specify those assumptions: that is, ForSim itself makes only minimal structural assumptions. It is a brute-force rather than mathematically sophisticated approach, but it is this that gives it its nimbleness and minimal dependency on theoretical assumptions. Since CPU time and memory are rapidly becoming less of a constraint, ForSim will in principle be able simulate extensive approximations to genome-wide information in large geographically and environmentally differentiated species under dynamic selective conditions, on an ordinary computer, and can be a viable part of an evolutionary and biostatistical exploratory as well as formal analytic tool-kit. ForSim is written in C++ with processing and controlling scripts written in Ruby. The scripts are written in Ruby 1.8, but should be compatible with later versions. The wrapper calls other software as well, if available (see below). All required or optional software is public domain. The program is not a commercial product, but is under continual development and augmentation, so current versions (and manual) should always be used. ForSim takes its input conditions from an input file. All user-settable conditions—which means most conditions—are specified in this file, described below. The installation directory contains simple demonstration *.sim input files. Any input file name is © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 4 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved accepted, so long as it is a plain text file. ForSim can be run most simply from the command line in the following manner: ./forsim [inputFileName] However, in most cases users will find it preferable to call the Ruby wrapper script (runForsim.rb) which automates running of multiple simulation replicates as well as basic post-simulation analysis. This wrapper script should be run as follows: ruby runForsim.rb -i [inputFileName] -r [NumberOfReps] -o [outputFormat] –c [false/true] If not specified, the default input file is input.sim, the default number of independent replicates run is 1, the default output figure format is png (other format extensions, including pdf, that are recognized by R and that your machine has the ancillary software for may be specified), and gzip compression of output files is false. The program itself produces numerous text output files as described below, but if run with the wrapper, these are used to generate many useful graphical files summarizing the data, and all of the output is then sequestered in a run-specific output folder with named to identify the date, time, and replicate number. The wrapper places these folders in a forsim/runData directory which the program creates (if it doesn’t already exist). NOTE: ForSim does this only at the end of a run. If you use the –r switch (and, hence, iterate the same input file), these output folders will be created sequentially for each run. But you cannot run the program more than once simultaneously (e.g., to test separate input file specifications), unless each run is with a separate copy of the program in a separate directory! Otherwise, the output results from different runs may be inextricably mixed before being bundled and stored. The wrapper logic is straightforward, graphs mainly produced with R, so it could be rewritten in another scripting language if desired. Also, aborted runs may leave various files in the forsim/ directory. For bookkeeping purposes you should delete such files by hand. However, each subsequent run will use run start time to identify run-specific files to put in the run’s runData folder, ignoring any pre-existing files. The burden’s on you! ForSim is designed to be flexible, so it is not for canned push-button science. It can do a lot, but you have to think about what you want to do. The burden is on you as the user to conceive your questions carefully, especially to avoid building into the program what you want to get out of it. Since so many things can be changed or used in different ways, you must pre-plan your use to specify them carefully. Running ForSim is not difficult, but it’s for science, so your study design—your input file specification—is all-important. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 5 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved Start simple: use only a few of the possible parameters, use small samples and short runs. Get a feeling for how it’s working, and then build up to what you really want to simulate. And…before going farther, read the Manual carefully, or check relevant sections if your run crashes. An important conceptual point ForSim simulates a basic evoluitionary architecture, but not the specific variation or causal details that will be present at the end of the run. As in real nature, evolutionary processes determine that. ‘Evolutionary architecture’ refers to the demographic conditions (population size, structure, mating and migration patterns etc.) and the basic genetic mechanisms (number of genes that may affect a trait, their interactions, the nature of natural selection, mutation and recombination rates, etc.). The actual genetic architecture at the end—how many alleles and haplotypes, and at which genes, affect variation in the trait in the final population, the frequencies and effect sizes of the existing SNPs at the end, their linkage disequilibrium (LD) patterns, and so on are not foreseeable nor prespecifiable in ForSim. As in real life, these aspects of genetic architecture are strictly the result of the individual evolution (simulation). It is often the case that a set of desired conditions (e.g., SNPs with particular frequency or LD relationship) can be found among those in the simulated data, again as occurs in life. Playing around….but not a toy ForSim can’t do everything, but it can do many different things, as you will see by browsing this Manual. You have the ability (or burdensome responsibility!) to stipulate many different conditions and values in what you simulate. For many purposes, a ‘toy’ model will suffice. This is a very simple model, clearly unrealistic, yet satisfactory to investigate a particular point. It is easy and quick to set up the appropriate instructions (input file) to test toy models, run them and get results. But for many problems a more extensive range of models must be tested and evaluated. Often, if not typically, you’ll just be guessing and will have to assume some values you think reasonable, or try a range of them. And while it’s true that ForSim can’t do everything (and it would be natural for you quickly to spot something you’d like that it doesn’t do), there are many ways to get conceptually quivalent results. Examples and discussion of this will be seen in several places in what follows. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 6 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved CHAPTER 2: INSTALLATION ForSim has been tested on Linux installations and MacOS X. It should work seamlessly in Unix. The program is installed from a forsim[###].tar.gz file, where ###, if present, is a date stamp. Uncompress this file with the command: tar –xzvf forsim###.tar.gz, which will produce a ForSim directory containing the program contents. Included are the program source code, the Makefile, various sample input (‘.sim’) files, the ruby wrapper script runForsim.rb, a README file, a copy of the Manual (this file), and a tidy.rb script that can be used to remove various types of file to clean up the directory from the forsim directory (e.g., if previous runs failed or did not complete and left miscellaneous results files). On Linux systems, simply running ‘make’ in this ForSim directory will compile and build the program executable. The program will then run from this ForSim directory. For compiling on MacOS systems, please refer to the “README” file in the ForSim directory. The computer must have an installation of the Ruby scripting language. For graphics, the R statistical software is needed, and access to X11 must be provided. Graphics can be generated faster if the R “GDD” package is installed. (Please see “http://cran.rproject.org/src/contrib/Descriptions/GDD.html” for installation details.) Some of the output content is produced by the wrapper script runForsim.rb rather than ForSim itself (this is for various pragmatic reasons). © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 7 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved CHAPTER 3: HOW ForSim DOES ITS WORK ForSim is very flexible and most aspects of its function can be altered by creative use of the input specifications (examples are suggested in Addendum). Note also that there are many ways to achieve similar ends, so that things not explicitly in the language can be done by creative approaches, many of which are described in this Manual. Appendix 1 provides a diagram of the main execution flow of the program. The relevant parametrs are specified in an input file that the program parses, as described later. Here is a summary of the major features: TABLE 1: ForSim MAIN FEATURES (not all features are listed here) BASIC FEATURES Specifiable duration of simulation (in synchronous generations) Single or multiple replicate simulations Point mutation and recombination that can be sex-‐specific; gene conversion, hotspots Sex-‐specific phenotypes Gene-‐ and sex-‐specific mutation rates Mating by families formed with or without replacement Stochastic family size distribution and logistic population maintains specified size Multiple genes and chromosomes, of arbitrary number and length Multiple univariate or multivariate phenotypes Stochastically determined mutation-‐specific allelic effects on phenotypes Environmental (family and individual) contributions to individual phenotypes Gene x environment interactions Phenotype-‐based mate choice and migration (gene flow) between populations ELABORATIONS USERS CAN SPECIFY OR CHANGE DURING THE RUN Flexible mating, phenotype determination, migration, and selection Multiple populations with hierarchical (cladistic) splitting, and user-‐specified split-‐ times, phenotype-‐based or random mate-‐choice and gene flow, environmental effects, population size, and natural selection regimes Pleiotropic genetic effects Complex multilocus phenotype definition including networks and gene interaction Flexible natural selection criteria with stochastic fitness Ability to restart simulation for replicate subsequent runs, or to generate specified types of output data, including some ability to change run parameters INPUT/OUTPUT FEATURES Population and pedigree data saved Data saved at user-‐specified generation check-‐points with real-‐time plotting of conditions and specification of the data to be saved at check-‐points Complete history of every variant SNP can be saved Output suitable for standard human genetic analysis and mapping software Output in rapid and easily parsed XML format Graphical output in browser-‐readable SVG, as well as other formats © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 8 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved _________________________________________________________________ Following is a brief description of how the program works. Generations ForSim uses synchronous generations, that like Christian-era year-numbering begin with generation 0 and runs for the user-specified number of generations (0-based indexing is a C++ characteristic). Each generation, families are formed, genes are mutated and recombination occurs, phentotypes are determined, and natural selection is imposed (if any is specified), and the population or populations are ready for the next generation. The processing occurring during a given generation is only manifest by the start of the next generation. This means that if you want data on generation n (which is the n+1st generation), you must refer to it in the input instruction file (see below) as generation n+1. The program begins with only a single population, but as of generation 1 (the second generation), and/or later as specified by the user, other populations may be created by receiving individuals from existing populations. Thus if a new population is created at the thousanth generation (generation number 999), it doesn’t really exist as an acessible entity until the next generation. The simulation must run until the generation number is ≥ the generation of most recent new population founding + specified pedigree depth. ForSim is a diploid simulator and we refer in this Manual to ‘males’ and ‘females’ but there are no sex chromosomes. Haploid approximations can be made (see the suggestion on Haploid evolution in the Addendum), but no real XX/XY differences. Mutation and recombination: These occur each generation, randomly across the genome, and can differ between males and females. New individuals are randomly assigned male or female status (p=0.5). With user-specified probabilities, a mutation can have no effect, or an additive effect that can be either positive (adds to the trait) or negative (subtracts) (for the syntax, see ‘Mutational (allelic) effects’, below). You can specify locations at which there are reombination hotspots (higher rate than elsewhere on the chromosome). To do this, in the input file (see below), add the line or lines as needed: hotspot start length rate (in megabases per centimorgan), e.g. hotspot 1000000 1000 1.0, in the appropriate chromosome block, but not within a gene block even if it involves a gene. The default or if not specified is no hotspots. Gene conversion is a double recombination within short nucleotide distances. This can be specified with the syntax specifying the probability that, once a recombination occurs, another will occur nearby, in basepairs downstream from the first, specified as following a gamma distribution. The syntax, in the global block, is: geneConversion Prob gammaShapeParameter gammaScale Parameters © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 9 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved such as geneConversion 0.001 10.0 10.0 Parameters must be floating point. The default, or if the keyword is not specified, is no gene conversion. Mating: An individual is chosen randomly for mating. Mates are chosen optionally with or without replacement—without replacement is the default. Mate selection may be specified to be phenotype-dependent so that assortative mating may be simulated. Offspring sibships are generated from each mating pair, constrained by limits on maximum family size and per-generation growth rates (see below), and new mating pairs are chosen until the target offspring size is achieved to within stochastic accuracy. When there are multiple populations that interact, a potential parent in a given population chooses a suitable mate from the same, or other populations, with user-specified probabilities and optionally based on the selectee’s phenotype. Before each generation all ‘males’, and separately all ‘females’, are put in randomized order. Males are picked in their order on the mating stack, starting from the top, and they search for females in their respective stack-order until a suitable mate is found. In mating without replacement, the male and female are removed from the eligible mate list. In mating with replacement, they are moved to the bottom of the list of eligible potential mates. [NOTE: There will always be at least one male and female in the next generation when it is formed. But if selection is too strong and one or both are culled, the the population can become extinct, crashing the run. If mating is specified as without replacement, there may (as in real life) not be enough mates to achieve the desired parental mating pool size or individuals of both sexes, and a population may not grow in the normal way or may tend towards unplanned extinction.] Family and population size: Mated parents produce offspring of family size distributed as a poisson with user-specified mean. If this is set to 2.0 the population is stationary unless there is selection. If there is selection, or to reduce the probability of population decline, a value somewhat higher than 2.0 can be used to approximately accommodate selective loss. But reproduction is stochastic, so that if a population is below its userspecified target size, mating will continue (if mates are available) and the population will shrink or grow by mean family size altered (increased or decreased) to respond to the excess or deficit. Population expansion and contraction are logistic as described by the standard Verhulst equation, in which population size is determined by current size, a growth rate, and a population carrying capacity. The Verhulst equation models the rate of reproduction as being proportional to both the existing population size and the availability of resources. Thus, when populations are small and resources abundant, population growth is rapid, and as populations become large and resources scarce, growth slows and eventually stops as population size reaches carrying capacity. The pergeneration population change is specified as: © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 10 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved [1] P1 = (K*P0*erate ) / K + (P0*( erate- 1.0 ) ) ) Where P represents population size and K represents carrying capacity. The carrying capacity, growth rate, and starting population size can be set independently for each population. This scaling adjustment, however, is constrained by user-specified maximum family size and maximum per-generation growth rate. The growth rate is proportional to the difference between the current and target population size, roughly a logistic growth model. [NOTE: For fertility and population growth, if there are t years per generation, and you want to model a growth of fraction r per year and a generation of length t years, then use growthRate rt. This will generate per generation growth by a factor (1+r)t or ert. Thus, at 25 years per generation and 2.1% growth per year, the value for the growth rate line is 25*0.021, which is 0.525. (Note that this is just an example: 2.1% annual growth is very rapid for human populations, where zero is closer to steady state and 1% is substantial)] Phenotype definition: Phenotypes are affected by genes as well as environments, and genes may affect phenotype indirectly by affecting each other (epistasis) or (under one option) by gene-environment interactions. The user specifies additive or more complex contributions of the genotypes at each gene, with algebraic functions (see Defining phenotypes, below). Natural selection: Natural selection in ForSim is based on phenotypic rather than genotypic criteria. Selection occurs based on user-specified functional criteria ; so that those not satisfying the criteria are culled (removed from the population) before mating (see how to Specifying natural selection, below). Selection takes to possible forms: new individuals are immediately screened for fitness, a form of mortality selection. Or, by using phenotype-based mate-choice a form of fertility selection can be imposed. Selection and phenogenetic criteria can be changed during the simulation, within as well as between populations at user-specified generations using event lines in the input file. [NOTE: that (as in Nature) too-severe selection can lead the population to go extinct!] Narrow cutoffs represent stringent purifying selection. If a positive-valued phenotype threshold cutoff is used to classify a person as ‘affected’, then if the floor cutoff is closer to the mean compared to the ceiling, this will favor mutations that contribute positive effects with respect to the trait. Phenotypes are generated as quantitative traits, but a threshold-based selection option, treats traits as qualitative traits. The simplest stipulation of selection is in terms of relative rather than absolute phenotypes, fitness specified in truncation selection in units of standard deviations of the current population phenotype distribution. The user specifies lower and upper SD limits, within which fitness is 1.0, and zero otherwise. Individuals with fitness outside these relative truncation limits are not part of the next generation’s parental gene pool. Selective neutrality can be specified in three ways. Most efficiently, variation in a gene © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 11 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved that is defined as not affecting any phenotype evolves neutrally by definition. Secondly, and realistic for genes that do affect phenotype(s), specifying broad relative selection cutoffs (e.g., -5.0,5.0) meaning only truncate individuals with phenotypes more than 5 SD from the current population mean phenotype, will mean that few if any individuals are selectively culled. In essence, if s < 1/Ne where here for a given gene s is the truncated tail area of a normal distribution, the trait and its affecting genes both evolve essentially neutrally (NOTE: However, this s applies to a trait not a specific gene or allele, because ForSim is a phenotype-based simulation, just as Nature works on phenotypes rather than genotypes). Thirdly, a phenotype can be explicitly specified as neutral, in which case genes contribute to the phenotype, but no selective test is imposed regardless of the phenotype value in an individual. For evolutionary studies, it may be useful to save the entire history of every SNP generated during the run. This can be done with a usingTrackSNPs true/false line in the input file (see Chapter 4). A series of files are saved when a SNP buffer is full, and at the end, and these can be jointly analyzed. But be careful what you wish for! A large or long run will generate drillions of SNPs, just as Nature does, and that means huge data storage requirements. IMPORTANT NOTES: ForSim begins with every member of the starting population having the same genotype—that is, there is no genetic variance nor phenotypic variance due to genetic variance. The specified nucleotide spots in the simulated genetic data are blank. As variants arise by mutation, they are stored as a pair of random, differing nucleotides (A, C, G, T), one assigned to ‘ancestral’ and the other to ‘novel’ allele status. In some output files these are recoded to 1,2,3,4 as preferred by some genetic software. Since everyone is genomically identical at this point, all phenotype variance would be due to environmental effects, if any are specified. Therefore, for many purposes, a ‘burnin’ time of your choosing (such as a few hundred or thousand generations, depending, for example, on population size) is needed before effective mutation, selection and genetic variation have accumulated to something approaching equilibrium state. This also means that selection, especially truncation selection, can destroy the entire population right away if it is too severe (just as it does in real life). Strong selection imposed at the beginning can work if there is environmental variance so that some individuals will survive, but this will greatly reduce effective population size until a burn-in time has been achieved. NOTE also that this is written in C++ and arrays are indexed starting at 0. Thus, some values in output files or on-screen reporting refer to Gene0 or population0 or phenotype0, or generation 0, referring to the first-named or first-occurring item in a series. A note on speed ForSim runs are acceptably fast for a wide variety of evolutionary scenarios, but speed depends necessarily on the complexity of the input specifications. The more complex the simulated conditions the more time will be required to complete the simulation. Specify as simple a run as will be a close enough approximation to what you want to test. One © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 12 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved population is fastest, or more than one but without post-division migration. Truncation selection is faster than complex selection, as is simple rather than complex phenotype specification; specifying selection as ‘neutral’, or mateCutoff internal/external as ‘none’ (both of which are defaults and need not explicitly be specified in the input file—it is a good idea to put in a default line in the input file, as a reminder to yourself) will run faster. Similarly for migration and mate choice; random (which is the default) is faster than phenotype-based. For some epidemiological conditions, population splits, admixture, or differential environmental conditions need only be specified at the last few generations of a run. For some evolutionary purposes, one may wish to track the entire inheritance history of every SNP that is generated during the run, whether or not it is present at the end. Using usingTrackSNPs will do this, but with a speed cost. Otherwise, use the default usingTrackSNPs false. These data may be of no interest, for example, to genetic epidemiological analysis. ForSim places no formal limits on complexity, and impracticably long runs can easily be devised. Contorted conditions are more likely to entail inadvertent bugs (or reveal real ones). It’s your obligation to conceive tractable problems that will cogently answer the question you want to answer, and to check the results to see that they seem to do what you specified. It is not sensible to oversimplify, but many details can be omitted without substantial loss in information, unless they would proliferate during a run. These things can be explored in each case with some small-size test runs. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 13 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved CHAPTER 4 : CONTENTS OF THE INPUT FILE ForSim takes its running conditions from an input text file (default name input.sim) whose syntax is designed to be straightforward and intuitive to produce. The input file is read line by line. NOTE: This file must be in plain text. In particular, it cannot contain platform-incompatible line-end characters. This will cause otherwise inexplicable hangups. If need be, use the dos2unix Linux utility or the command tr -d '\r' < inputfileX.sim > inputfile.sim to purge such line-enders. Our experience is that textEditor and other similar programs work suitable for Mac installations. To make developing specification of running conditions as easy as possible, even for complex situations, the input file has a begin-end block structure, keyworded format, that ForSim parses. Various keywords followed by begin specify the start of an input block, which is terminated by the end keyword. Within each block, additional keywords specify the item whose parameters are then given. The actual files are plain text format, but in the examples in this Manual, color-coded syntax is used for clarity: block keywords are given in red font, while parameter keywords are in blue. Comments are shown in the example in orange font. # Comment line(s) explaining the file’s objectives global begin Global general running conditions, using global-keyword parameters (output options, mutation & recombination rates, fertility parameters, generations to simulate, changes to occur at specified points during the run) end # of the global block chromosome begin Chromosome specifications: a separate block for each chromosome, using chromosome-keyword parameters (length) gene begin Gene specifications: a separate block for each gene, nested in its chromosome block, using gene-keyword parameters (location, size, mutation rate, allelic effects) end # of this gene block end # of this chromosome block phenotype begin Phenotype definition, separate block for each trait, using phenotype-keyword parameters to specify phenogenetic model (genes, environments, interactions) end # of this phenotype block population begin Population specifications, separate block for each population, using population-keyword parameters (size, environment effects, selection & mating patterns) end # of the population block © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 14 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved For readability, the input file parser uses a simple syntax in which each input file line is a single command unit. Whitespace (blank lines and indentation) is ignored (except between words or parameter values), but a newline is needed between each command. There are no preset limits for the number or characteristics of chromosomes, genes, or populations (except that the simulation must begin with only a single founding population). For simple situations, the input parameters can be simply typed in to an input text file using the format shown below. For more complex situations, such as many different genes, a file-making script will probably be most efficient. Comments are permitted, indicated by ‘#’ on a line, so long as it is the first character in the line or is preceded by a space. Comments, including an explanatory line or lines at the beginning, will be helpful for users simulating many different conditions. Basically, users specify the phenogenetics of one or more traits, that is, their underlying genetic architecture, aspects of the simulated genome’s evolution, functional effects of new mutations, criteria for mate choice and reproduction, and population dynamics over time. By use of the event keyword parsed by ForSim, many of the simulation conditions can change at specified points during the simulation. The user specifies specific aspects of output, such as the generations during a run at which population data should be saved (if any such saving is desired), and aspects of final data to be generated, including the number and generational depth of pedigrees to be generated. The following model input files explain the available specifications. A basic version, Example 3a, ‘basic.sim’, is in Appendix 2 at the end of this manual, that can be cut and pasted into a new text file, deleting, duplicating, or modifying the entries. An even simpler version, smple.sim, is Example 3b, as a quick-running way to test and debug syntax issues and quickly evaluate the effect of modifying input instructions. The global parameter block specifies conditions that apply to the entire simulated data, although some of these can be changed locally during the run. After the global block, locally specific keyword blocks specify the nature of chromosomes, genes, phenotype determinations, and populations. Chromosomes are numbered, starting at 0, in the order in which they are specified. Phenotypes, populations, and genes are given names. The number of repetitions of entries (phenotypes, chromosomes, genes) is open-ended. However, we recommend only simulating a single chromosome for practical reasons; search for Simulating Multiple Chromosomes below. For some simulations, a discrete affection status (affected/unaffected) is desired. This applies to the first-specified phenotype only, and is simulated as a threshold such that an individual is ‘affected’ if its phenotype exceeds the threshold, specified in terms of number of standard deviations from the current phenotype mean (cutoff can be positive or negative, but must be explicitly pecified in the input file. Output specifies affection status in the preMakeped (LINKAGE analysis format) files, a standard format for genetic epidemiological analysis. Prevalence is specified by the user and is not used explicitly by the program except at the end; if you wish something to be based on affection, such as © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 15 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved selection, migration, or mate choice, it could work to simply specify the same cutoff criterion. At the end of a run, ForSim saves the pedigrees that comprised the entire population that produced the last n generations in the simulation. The default is 3 generations but pedigrees of 2 (nuclear families only), or of more than 3 generations can also be specified if needed (finalPedigreeDepth #). NOTE: The specified number of generations is the maximum for any given pedigree: if some subset of matings fail to reproduce, the overall pedigree will still be included—as occurs in real life—but childless couples are not saved, so that family size distributions are truncated. The input file can have any name, passed as command-line parameter (see above), with ‘input.sim’ provided as a default in the ForSim installation package, used if no filename is specified in the runForsim command line. If ForSim is unable to parse the input file, it will exit, sometimes with an error message containing the test and number of the line it failed to parse. But be aware that not all syntax errors are detected. A simple misspelling in the input file will give rise to the following example error message : Error in input file on line 17 of 101 lines. Line #17 : matingWithReplacement true We suggest that you write yourself a text paragraph explaining what you intend to simulate, draw a flow diagram if it is at all complicated, and construct your input file. Then, save all of these in a descriptive file with a name you’ll understand later, so you can confirm that you did what you intended, can remember it, and can relate it to the output results. Put this self-informing file in the output file directory. Also, liberally comment the input file so its intent will be clear to you later. A useful but not mandatory practice is to have one or more explanatory comment lines at the beginning of the file saying what you intended to do. The first line might begin with something like (this is not mandatory, just a suggestion): # WHAT: short test of phenotype based migration, two pops, 1000 gens Using the event keyword to modify conditions during the run There are several conditions that can be changed using the event keyword. The mate choice pattern within or between populations, the overall environmental effects distribution, the selection regime, and the population size can be changed during the run in this way. Other changables may be added in future modifications. The syntax for most commands is event ## where ## is generation number, plus other parameters: donateParents SourcePopName RecievingPopName #males #females © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 16 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved mutation Remaining syntax as in global definition (changes all gene mut. rates) outputXML true/false printGeneration (available with some restrictions) serializeState setCarryingCapacity Popname MaxPopSize setEnvironmentNormal PopName PhenotypeName mean variance setFertility PopName poisson mean (Popname not needed in global block) setMatingPopMatrix PopName %-mates-chosen-from-each-population setMaxOffspringNumber PopName # (Popname not needed in global block) setPhenotypeSelection PopName PhenotypeName [full selection line]* *This means use a regular selection specification line, with redundancy as follows: event 100 setPhenotypeSelection Pop1 Phen1 relative -1.0 2.5 See the event examples in the sample input files below. The keyword system can also be used to specify ‘serialized’ data to be stored at the specified generation, that can be used to restart the simulation under changed conditions, do replicates from that point forward, etc. More than one such instruction, each applying to a different generation may be used. For the format and use, see below. NOTES: There must only be one population at the beginning; other populations can be founded in any subsequent generation. Note also the sublety that if multiple populations are being simulated, the run must not end until the last-defined population has existed for at least as many generations as the specified pedigree depth. Thus for 3-gen. pedigrees and population founded at generation 1000, the input file must specify generations 1003 (or greater). Not all parameters can be changed by event lines. An example are the gene-specific allelic effects probabilities. However, some such variables can be changed by means given in the Addendum that involve stopping modifying the parameters and adding a load instruction and restarting. Input file keywords and their meaning or default values if relevant Unstarred terms must be explicitly listed in the input file (some, like event, only if they’re being used). Single-starred terms have default values as given here, and need not be explicitly specified in the input file (though for clarity and reminders, it may be good to include them). Double-starred terms are contextual and don’t take specific values per se, but lines using them may require such values. For explanation, do a Find search in the main text. Keyword Default, etc. **absolute Requires some value to be specified birth 0 (generation pop. Starts) carryingCapacity Size of population © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 17 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved *death **definition **donateParents **environment **familyEnvironment *environmentNormal **event *familyEnvironmentNormal **female *finalPedigreeDepth firstCodonProbability functional gamma **gene geneConversion generations growthRate hotspot ifFemale ifMale initialSize *intron length location **male mateCutoff external mateCutoff internal *matingWithReplacement megabases per centiMorgan mutation rate **name neutral output *outputSVG *outputXML **phenotype poisson **population *prevalence printGeneration when population goes extinct; if not specified, end of run phenotype definition must specify source and amount usable in phenotype definition usable in phenotype definition must specify mean, variance 0 0 (mean, variance) 3 1.0 selection type; specify func. equat. must specify the parameters vals gene block header prob gammashape gammascale, default none location length rate, default none sex-specific phenotype specifier sex-specific phenotype specifier for new population false gene length gene start position false specifying recombiantion global or gene-specific for genes or populations type of (no) selection gen. interval for special output false true/false (can be set in ‘event’ lines) phenotype block header mandatory word in fam. size spec. population block header relative, 0.05 Available with some restrictions © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 18 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved probabilityNoEffect probabilityPositiveEffect relative *scaleEnvironmentalNormalVariance *secondCodonProbability **selection *serializeState setCarryingCapacity setFamilyEnvironmentFraction setEnvironmentNormal **setFertility setHeritability setMatingPopMatrix *setMaxOffspringNumber setPhenotypeSelection setSimulatedGenomeFraction *thirdCodonProbability *usingSpatial *usingTrackSNPs *usingRecurrentMutations *usingCodons selection mode, needs cutoff limits false (else give parameters) 0.9 specify selection regime final generation or specified by event default 0.0 family size specification default 0.4 default 1.0 0.5 false (values or defaults) false false false The following sections describe a simple single-population simulation, and a more complex multiple population simulation, respectively. The input files which generate these simulations are included with the ForSim distribution. Sample input file for single population simulation Successful simulation with ForSim depends on careful specification of the run conditions in the input.sim file. Because the program is flexible, there are many options. They need not all be used (there are defaults for everything), but to test specific things you must be careful in designing your input file. The simplest basic simulation is of evolution in a single population, as shown in this first example input file. In this description, entries like PhenA, ABC1, and numbers are runspecific examples; but the keywords, like megabases, phenotype, gene, etc. must be explicitly included. There can be as many chromosomes, phenotypes, and genes as user desires (more means slower, of course!), and genes can affect whatever phenotypes user specifies (or none, in which case they accumulate mutation but are not affected by natural selection). We urge using only one chromosome, with ‘genes’ spaced far apart where unlinked locations would be desired (search on: Simulating Multiple Chromosomes, below). Mutational (allelic) effects © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 19 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved Each generation for each individual a mutation will occur in a specified ‘gene’ with userspecified mutation rate probabilities. Mutation rates can be sex-specific and genespecific, at user’s choice. These are set in the global block if they are to apply to all genes, or they can be specified using the same syntax, separately for each gene not following the global rate. However, NOTE that if an event instruction changes the mutation rate, that is then applied globally. The rate is set in scientific notation: 2.5 E -8, where E refers to powers of 10. With user-specified probabilities, new mutations have either (a) no effect, or some additive effect that can be either (b) positive (adds to the trait) or (c) negative (subtracts). This is specified by 2 probabilities in the input file: a and b. Partition c, the fraction of negative mutational effects, is the complement of these to sum to 100% of possibilities. NOTE: These specifications can be made in the global block and will apply to all genes. They can also be made in each gene block. The last-specified value (global or genespecific) will be what applies to a given gene. That is, a global rate will be over-ridden if specified for a specific gene. If you change mutation rates in an event, this will apply globally. The effect size of a new mutation is then determined by a random draw from a gamma distribution with its two parameters, shape (usually denoted by k or α), and scale (θ or β), in that order) specified by the user in the input file. Parameters of the gamma can be specified independently for positive and negative effects. This shows an example: The next two figures show gamma parameters for positive effects, that yield only a small fraction of large effects (left, gamma (1.0,0.05)), and a larger amount (right, gamma(1.0,3.0)). NOTE: Specify numerical values in floating point format. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 20 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved Note that these are additive effects attached to each new mutant allele at the time it arises. The ‘Ancestral’ allele by definition has zero effect; this means that the net effects of AA, AN, and NN genotypes are 0, a, and 2a, where a is the assigned allelic effect. A comment on Dominance Gregor Mendel worked with crosses in inbred strains, and in subsequent experimental work many investigators do that. This led to the widespread idea of Dominance that can be characterized as ‘physiological,’ in the sense that the dominant allele always masks the presence of a recessive allele at the locus (with co-dominance added as a term for when the effect is only partial). The implication is that this is different from dominance in the statistical sense as used in quantitative genetics, that refers to the mean phenotype of individuals with the AN genotype relative to the midpoint between the mean AA and NN phenotypes. But these are false distinctions: physiological effects are only manifest in their respective genomic background, and Mendel worked with crosses between inbred strains. Thus, there is essentially no absolute effect. ForSim does not currently explicitly specify classical inherent or ‘physiological’ dominance or recessiveness, though the ancestral allele’s zero assigned effect, or using ‘no effect’ for the novel SNP allele, makes them effectively recessive—they contribute nothing to a phenotype. Large assigned allelic effects could in practice raise the probability of the carrying individual being ‘affected’, a kind of approximate de facto dominance or codominance, but this depends on the overall complexity of the model being simulated, and of the chance aspects of mutational effects. Dominance can be introduced in an explicit way. ForSim assigns allelic effects (see discussion of mutational effects) that are additive. Dominance is usually parameterized in the statistical sense as a deviation of heterozygotes’ mean phenotype, from the homozygote mean, d=(AA+NN)/2. Note that the ancestral allele A has by definition an effect zero, and a novel allele N has an assigned effect, say, n (drawn probabilistically, as described above). The difference between the homozygotes assigned effects is just the Novel allele’s dose, n: [(0+2n)/2=n]. But if your phenotype definition (see below) contains nonlinear functions such as (say) Phen=GeneA^2 the phenotypes will be 0, n2, and 4n2, and the assigned dominance deviation will be 2n2, which is different from n. Dominance also arises routinely in the sample or population sense of observed statistical deviation of heterozygotes from the homozygote midpoint. This is because, as in Nature, the genomic and environmental backgrounds of the AA, AN, and NN individuals at a given SNP will vary in finite samples such that the formally additive effects assigned to the N allele at the time of its mutation are not precisely realized. Dominance can be approximated in other ways as well. See suggestion 4, on Mendelian traits, in Addendum for some comments on this. Defining phenotypes Phenotypes are defined in terms of simulated genotypes and phenotypes, with many © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 21 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved options. Genotypic effects Phenotypes are specified in terms of the genes and environments that affect them (the individual’s summed haplotype effects for the specified gene) , and a phenogenetic model. Note that this allows GxG interactions to be specified. The syntax is algebraic: PhenA = G1 * 2.0 + G2 + G3 * ( G4 + G5 ) In the usual precedence order, a subset of standard mathematical functions can be used: +, -, *, /, ^ (exponentiation), and e (exponentiation). For absolute value use (variableName^2.0)^0.5. Functional expressions are parsed following standard rules of algebraic precedence. NOTE: Specify numerical values in floating point format. Phenotype definition can have a sex-specific component. Use of keywords ifMale and/or ifFemale will substitute a value 1.0 for that keyword if the individual is of the specified sex, zero otherwise: PhenA = G1 * ifMale + 2.0 * G1 * G2 * ifFemale. NOTE: It is unrealistic to try to be too fancy here, as even simple nonlinear phenogentic or fitness relationships are challenging to confirm even in experimental data. Keep it simple. Because ForSim must parse a wide varietey of possible functions with as little ambiguity as possible, every number must be floating point, and there must be a space between every item (except negative numbers, that are written -2.3). See example and explanatory material in the Addendum. NOTE also that while this specifies the logic of effects and interactions of basic genetic pathway architecture, the actual quantitative phenogenetic effects in any given run depend on the mutations that arise during the simulation, and their individual and haplotypic effects, just as developmental and homeostatic pathways in natural organisms are phylogenetically conserved but can vary by sequence evolution molded by drift and selection. Environmental effects. Random individual environmental effects are phenotype contributions imposed independently on each new individual each generation, separately for every phenotype. These effects are normally distributed with user-specified parameters that can differ for each phenotype and population (default is Nor(0,1)), and can be changed during the simulation run. Of course, you have to make some kind of guess at these parameter values. For no environmental effects, use Nor(0,0), but Note as stated below that because the program begins with no genetic variation, without environmental effects the mode of selection must be set to neutral for enough generations that some genetic variation that affects the selected trait arises by mutation. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 22 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved Additional family-specific environments may also be specified. A random Nor(Fam_mean, Variance) variate, default Nor(0,1), is drawn once for each sibship. Each individual in the sibship is given a value drawn independently from that distribution, used as specified in the phenotype definition part of the input file. Family-specific environmental effects are added (unless specified otherwise) to the individual random environmental effects for a net total environmental effect for each individual. Default is no family-specific effect (Nor(0,0) distribution). Environmental effects for each individual are used in determinit his/her pheno type as specified in the phenotype definition block, as for example: PhenA = G1 + G3 * ( environment – familyEnvironment ) If not specified in algebraic terms in the phenotype definition, random and family environments are added to the genotypic effects for each individual. Individual environments can be set for each phenotype and each population, specified in the population definition blocks. These can be changed during the run by event instructions; however, while the shared family environmental component can differ among phenotypes, its distribution and application are the same for all populations. An example is shown in testExample2.sim, below. The input file syntax is: environmentNormal PhenotypeA 0.0 1.0 familyEnvironmentNormal PhenotypeA 0.0 1.0 with such lines in each population block. If not specified, defaults are applied automatically. [NOTES: Numerical parameters must be specified in floating point format. By specifying the variances in the two environmental components you are also essentially but implicitly specifying the heritability that will result if the simulation approaches equilibrium. You cannot easily know this value in advance (or at all with complex interactions). So you must make a decision about the relative impact of G’s and E’s. You can do some moderately complete test runs to see what the heritability approaches, and then adjust the environmental variances to have approximately the relative variance contribution you wish to simulate. Family environmental effects are assigned to an individual when the individual is created. These stay with the individual when s/he becomes a parent, meaning that the family component is only applied to sibs. A fixed environmental variance may not be realistic under natural selection, because selection can move a population to a greater ‘fit’ within its environment, rather than continually driving it in a given direction without the population phenotype mean not reflecting its better fit by being less environmentally affected. This kind of effect can be © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 23 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved approximated by scaling environmental effects, say, to diminish during a run by scaleEnvironmentNormalVariance # ##, where # is the multiplicative factor by which environmental variance decreases per generation, and ## is the final variance, after which the environmental variance remains constant. The program changes the envrionmental variance by an amount # which if positive decreases the variance each generation, or if negative increases the variance. The initial variance as specified in the global block must be compatible with the specified changes! Thus, if initially V=1.0 and the decrement # is 0.001, then the variance decreases to zero in 1000 generations; if the specified scaling factor is negative, the environmental variance will increase by that factor each generation, without limit during the simulation. The scale instruction applies from the first simulated generation on (is specified in the global block). If you want to modify that during the run, use an event instruction to alter the phenotypic variance (the scale target specified in the global block will remain). This usage only can be made with random, not family environments and applies to all traits and must be specified in the global inputfile block.] Polygenic background ForSim does not explicitly differentiate genes with major effect from ‘polygenes’ that may be numerous but have individually minor effect. One can implement polygenic effects by defining many scattered genes whose gamma functions rarely would generate a more than trivial effect. Parameter values like 0.1 0.5 will do this, for example. Other ‘major’ genes can have effect distributions with greater probability of major effect. As noted above, one can do empirical not-too-long runs with a given envrionmental component to see what value the heritability approaches, and adjust the environmental and polygenic components to generate a desired heritbility. Specifying natural selection (fitness) Fitness is specified in various ways as user options specified in the input file. Fitness is defined in relative-fitness terms, with maximum in a given population of 1.0. Selection criteria are imposed sequentially and independently for each defined phenotype, so an individual is saved for reproduction if passing all the screens. Selection can be applied to a compound of phenotypes by defining a new phenotype, such as PhenC = PhenA*PhenB, and applying selection only to PhenC. The simplest fitness function is a dichtomous 0-1 step function, truncation selection in which fitness=1.0 for all phenotypes within a specified range relative to the mean and © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 24 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved variance of the phenotype distribution in the person’s population, and zero otherwise: the input file specifies truncation limits expressed in standard deviations relative to the current population/phenotype mean. Fitness is 1.0 within the limits, 0.0 outside of them (or 0.0 within and 1.0 outside is also possible, by appropriate sign change). An individual is removed from the eligible mating pool if its fitness=0.0. The Figure shows how phenotype acceptability ranges can be specified for neutral (very broad cutoffs), stabilizing (symmetric cutoffs), and directional selection (asymmetric cutoffs). Individuals with phenotype beyond the cutoff are selected out, that is, excluded from mating. The syntax is selection PhenotypeA relative # #, where the #’s refer to the lower and upper cutoff, in StD units. Alternatively, fitness can also be specified as a functional probability f, of being excluded from reproduction, based on a user-specified mathematical function relating the individual’s phenotype relative to the current phenotype distribution in its population, such as increasing probability of exclusion from reproduction inversely proportional to the square of the individual’s phenotype’s distance from the mean in SD units, or the phenotype’s distance from some target optimal absolute phenotype value specified by the user in the input file). This is schematically shown by the curve in the figure. Each individual is assigned an f based on the user-specified function, and that value of f is used in a random draw to determine the individual’s probability of being in the mating pool. Matings are formed from individuals who pass this screen, and there is no separate mating-based fertility-based function at present. For example, in the following line: selection PhenA functional 1.0 - ( 0.05 * ( ( Phenotype - Mean ) / StdDev ) ) The term “PhenA” in these examples defines which phenotype is being used in the functional expression. In that expression, however, this must be referred to as ‘Phenotype’; in this example, which will call the relevant phenotype from the individual whose fitness is being evaluated. The terms “Mean” and “StdDev” here refer to the mean and standard deviation of the distribution of the PhenA phenotype in that individual’s population at that generation. Fitness is relative and must be standardized to a maximum of 1.0 (that is, must be in the interval [0,1]). In this example, 0.05 is a userspecified numerical value specifying the rate of decrease in fitness per unit difference from the mean. The fitness function is evaluated for each individual in the population and determines the probability that the individual will survive to reproduce. NOTE: Specify numerical values in floating point format. NOTE: As stated above in regard to phenotype definition, keep fitness functions simple. Because ForSim must parse a wide varietey of possible functions with as little ambiguity as possible, every number must be floating point, and there must be a space between every item (except negative numbers, that can be written -2.3). See the discussion and suggestions in the Addendum. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 25 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved NOTE: as with many situations, a population may not be able to survive some conditions unless it is variable or large enough. For selection, it may be important to do a burn-in number of generations before imposing the more stringent condition. The latter can then be done via an event instruction. If this is not within the repertoire of options, there are two other ways to achieve the same end. One is to found a new population after the burn in, have everyone in the old population be donated to found the new population, and have the first population die at that time (you may want to save its data for later checking, with an output instruction at that generation). Second, you can stop the simulation and serialize the data. Then, modify the input file by adding a load instruction and change or implement the new selection regime and run the reloaded population. A few notes on selection modeling There is no one way that selection operates in Nature or in ForSim. At any given time, theory models indivdiual fitness relative to others in the same population. Simulating ‘relative’ selection drives a population in the specified direction, but without limit unless selection is changed by an event instruction. This may not be realistic. It may be that selection drives a population towards some optimum mean phenotype. This can be simulated by ‘functional’ fitness modeling (fitness highest near some optimal value, T). But fitness relative to some threshold value requires that other simulation specifications don’t drive the population so that nobody is fit—leading to extinction (unless extinction conditions are what you’re simulating). For reasons of this sort, ForSim does not allow specifying a rigid fitness threshold, above or below which an individual is culled from the population; but fitness thresholds can be approximated by the function methodk as in spcifying an option in the way just discussed. The funtional option makes it possible to model an open-ended range of scenarios. Artificial selection and some other aspects of selection are discussed in the Addendum. Selection is based strictly on survival, with no specific provision for fertility-based selection (i.e., where expected family size depends on parental phenotypes). However, for functional rather than relative-truncation modes of selection, survival is probabilistic. This means an individual with fitness f has probability f of being fully reproductive, and 1-f of not being available as a parent. This is approximately the same as having full chance to survive but reduction by amount f in expected offspring number. Whether or not you are specifying natural selection or letting things drift, it may be interesting to see in generation-by-generation context, what happens to each SNP and how its effects are manifest. For example, if a SNP has a given assigned effect in a selectively favored direction, how noticeable is that allelic effect on the SNP’s fitness? What has its net effect each generation? How well do your theoretical expectations of how your genetic architecture evolves fit what actually happens? How long does a SNP last in the population as a function of selection and its effects? By setting usingTrackSNPs true, you get the data to find out (but it slows things down and saves huge files, so don’t do it unless you mean it). © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 26 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved Multiple populations At any point after the start of the run, that is, from generation 1 onward, one can use the event property to create new populations by donating a number of new males and females from existing populations to the new population. Then you must set the mating pattern within and between the existing populations. Each population must have a population definition block with the appropriate ‘birth’ generation and population size. See the example of multiple population simulation for details. Simulating multiple chromosomes You can simulate multiple chromosomes, but the only thing that achieves is independent assortment. It does this at the expensve of logistic complexity complexity, especially in the output file formats. As described there (Note on Multiple Chromosomes, Chapter 5), this generates files that will require post-run melding. For this reason, one single chromosome is greatly to be preferred, with long distance spacing between gene segments that you want to assort effectively independently. Specifying mate choice Mate choice can be a form of fertility selection. Mating may usually be random within and between populations. However nonrandom mating may be specified by using mateCutoff internal or external in the input file. This is done in terms of the chooser’s phenotype in Standard Deviation units of the chooser’s population’s phenotype distribution. In searching for a spouse in its own (internal) or a specified other population (external), an individual will only consider as potential mates individuals in the specified range (in StdDev units) of the chooser’s population’s phenotype distribution relative to the chooser’s own phenotype. Upper and lower SD limits are specified, and need not be symmetric so that large-phenotype individuals can prefer large (or larger, if so specified) mates. Choice continues for all specified phenotypes until a potential mate fails to satisfy one of the criteria. If ranges for a given phenotype are not specified in the input file, any value ± 8 SD from the choooser (that is, essentially anyone) will be accepted. Creative use of population dynamics Population history and structure are important both in genetic inference today and in understanding how genetic variation and causation have evolved. Because ForSim can create, grow, shrink, or destroy a population, or can have multiple populations do that under different conditions and with gene flow between them, which can be changed at any point or as many times as desired, a number of important demographic phenomena can be simulated, including rapid expansion, bottlenecks, inbreeding due to small populations, or even the generation and intercrossing of inbred populations such as of experimental animals (probably not plants—they have very different fertility and reproductive patterns). NOTE: The program must begin with only a single population. If you want multiple populations, found them at Generation 1 or later. Also, if you are reloading a population from a previous run using serialized files, there can be more than one population. The program saves graphs and individual phenotypes for each person in the population at © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 27 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved the final and penultimate generation. This can be used to assess the phenotypic response to selection at the end of the simulation run. The program will output some selected data and graphs to help monitor progress, every user-specified number of generations in the output keyword line (see Chapter 5, Files generated at each generation interval specified by output # in the input file, below). Default is not to save several special files at all (which will save time and very much storage space, if you have no need for intra-run data). With syntax described in Chapter 5, some of these files may be useful at least for the end generation, and/or you can specify only certain variables to be saved at each. NOTE: if you want basically complete data use the serializeState feature, described in section ‘Rerunning a ForSim simulation’ below. Prevalence refers to the cutoff value for calling someone ‘affected' for a dichotomous trait, and is calculated based on the first phenotype defined in the input file. Prevalence is specified either as a decimal fraction of the population affected (e.g., 0.10 means 10%) [prevalence relative #, in the input file], which ForSim uses to determine corresponding standard deviations from the mean, assuming that phenotype distributions are roughly Gaussian, or alternatively, prevalence can be specified by an absolute criterion, such that individuals whose first-defined phenotype exceeds some cutoff is called ‘affected’ [prevalence absolute #], where # is the cutoff value (floating point number)—this is potentially dangerous however, since you cannot determine in advance whether anyone, or indeed everyone, might have such a phenotype. Since ForSim parses the input file looking for keywords to set its internal parameters, but the global block must be specified first, including the set of desired event changes (if any), followed by definition of chromosomes and their content, then phenotypes, then populations. This input.sim example uses a few of the user-accessible variables. There are many such variables, and the list is complete as of version release time; but most will be of little value to you and should not be changed unless you clearly have a reason and know what you are doing (e.g., iid). The following example is to simulate a single population with 1 chromosome, containing 2 genes that affect 2 phenotypes. NOTE: Values given are illustrative and do not suggest that they should or must be used; you need to decide that. Example 1 This is testExample1.sim in the distribution package, and can be edited and run. NOTE: input files must be plain text, no hidden characters so don’t just cut-andpaste the examples in this Manual. # WHAT: single population simulation, with 1 chromosome, 2 genes affecting # 2 phenotypes, where both genes affect one trait but only one gene # affects the other. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 28 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved global begin # Begin definitions of global variables generations 1000 # Length of simulation, in generations # Single population sample parameters 1.0 megabases per centiMorgan male # Recombination rate 1.0 megabases per centiMorgan female # instead of separate male & female rates, there can be just one # line labeled ‘all’ rather than by sex mutation rate 2.5 E -8.0 female # Optional: one line labeled ‘all’ mutation rate 2.5 E -8.0 male usingTrackSNPs false # Don’t track each SNP’s history (else set to ‘true’) setFertility poisson 2.5 # poisson distributed fam size, mean 2.5 scaleEnvironmentNormalVariance false setMaxOffspringNumber 8 # Maximum family size permitted prevalence absolute 5.0 # Cutoff phenotype for ‘affected’ status usingSpatial false outputXML true # Needed if you want this major data output file outputSVG true # to get the multifeatured svg figures (see Ch. 5) # Now # # # # event specify the events you want to happen in this simulation: event A keyword for special events that want during the simulation at specified generation times. After the keyword, a generation number when the event occurs must be specified, and then the nature of the event (see multipop file below) 500 setPhenotypeSelection PopulationA PhenotypeA relative -1.0 1.0 #make env’t effects stronger at generation 700: event 700 setEnvironmentNormal PopulationA PhenotypeA 0.0 2.5 event 750 printGeneration # save population data in preMakeped format output 500 # Interval in generations between partial data saves end # End of global variable definitions chromosome begin # Begin chromosome defining block. Chromosomes are numbered in order of definition, beginning with index 0 # Length of the chromosome, in basepairs # # length 2000000 gene begin name ABC1 location 200000 length 100000 gamma 1 0.05 1.0 0.5 # Begin gene defining block # In basepairs along this chromosome # Length of gene in basepairs # Gamma parameters of phenotype effects # of each new mutation probabilityNoEffect 0.2 # prob new mutation has no effect probabilityPositiveEffect 0.5 # probability new mutation has positive effect end # End of gene defining block gene begin # Specify another gene defining block name ABC2 location 400000 length 100000 mutation rate 2.5 E -8 all # Example of gene-specific mutation rate gamma 1 0.05 probabilityNoEffect 0.2 probabilityPositiveEffect 0.5 end # End of this gene defining block end # End of chromosome defining block © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 29 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved # Phenotype blocks require the naming of a phenotype and then a # definition of that phenotype in the form of an algebraic formula. # The algebraic formula must be expressed in terms of gene values (specified by gene name), numerical values, and addition, # subtraction, multiplication, and division operations. Two examples # follow. # First, a simple phenotype, consisting of the sum of the SNP # phenotypes contributed by SNPs within genes ABC1 and ABC2. phenotype begin # Begin phenotype defining block name PhenotypeA definition ABC1 + ABC2 * ifMale + 0.3 * ifFemale def. end # End of phenotype defining block # # Sex-specific phenotype # The second phenotype example is more complex, and is defined as half # the sum of half of the phenotypic value of ABC2 and three times the # phenotype value of ABC1. phenotype begin # Begin another phenotype defining block name PhenotypeB definition (3.0 * ABC1) + ((ABC2 / 2.0) / 2.0) # * environment + 2.0 * familyEnvironment # in actual file keep on same line with above # NOTE: currently printGeneration does not work if environment variables # are included in the phenotype definition, so it’s commented out here end population begin # Begin population defining block name PopulationA birth 0 # Generation when population is created initialSize 500 # Initial Size of the new population death 40000 carryingCapacity 1000 growthRate 0.525 # at 2.1% per year for gen=25 yrs selection PhenotypeA relative -4.8 4.8 # Selective regime environmentNormal PhenotypeA 0.0 1.0 # Random environmental effects familyEnvironmentNormal PhenotypeA 0.0 1.0 # family-specific env’ts mateCutoff internal PhenotypeA -4.8 4.8 mateCutoff external PhenotypeA -4.8 4.8 selection PhenotypeB relative -4.8 4.8 environmentNormal PhenotypeB 0.0 1.0 familyEnvironmentNormal PhenotypeB 0.0 1.0 mateCutoff internal PhenotypeB -4.8 4.8 mateCutoff external PhenotypeB -4.8 4.8 end -----------------------------------------------Sample input file for simulating multiple populations ForSim provides a high degree of flexibility via the ability to simulate evolution in multiple, interacting, hierarchical networks of populations. Populations can arise at any time during the simulation, after the first generation, by the donation of individual males and females from one or more existing populations (contributions from each of one or two source populations to the new population are specified in separate lines in the input © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 30 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved file, using more than one line if more than 2 populations will contribute). Existing populations exchange mates via a matrix of user-specified probabilities of random mate selection from the chooser’s own, or any of the other existing populations (these probabilities can be 0.0 if no exchange is desired). Populations can also be subjected to differing evolutionary conditions. These can be changed at user-specified times by use of the ‘event’ situation-change keyword. A population can donate mates to another existing population at any point in the simulation, using the same donateParents syntax. Most of the input file components that specify the evolution of a single population, described above, are specified separately for each additional population so that it can evolve in its own way, with its own parameters, mode of splitting from some other population, and relationships to other existing populations. Parameters not shown as population-specific in the following example are shared among all populations. ForSim uses round-robin mating among populations. This allows different levels of endogamy and exogamy (gene flow) to be specified. [NOTE: Populations are referenced in the order they are created as specified in the input file, where the first-defined population is indexed as population 0]. From population 0 to population n, in turn, mates are chosen randomly from populations specified by a proportional mating matrix. Mates are drawn with or without replacement as per the input file. Mates are drawn until each population reaches its target parental population size. The mating matrix specifies the probability that a mate in each population is selected from itself, or from any other population. For no gene flow, simply give 100 for the fraction of mates from own population, 0 from all others. [NOTES: user must specify a complete matrix of dimension equal to the maximum number of populations to be simulated at any point during the run. In generation times before a given population is created, its mate choice probability is set at 0, then changed to the desired probability after the population exists. And for efficiency and safety, if a population is specfied for ‘death’—to become extinct in a given generation, if needed, subsequent mating matrices should be changed by event lines to draw 0% mates from that population.] The price of flexibility is a corresponding increase in the number of parameters that must be specified, for multiple populations. However, making the specifications is straightforward, and requires a parameter declaration block for each population. Following are the components of a multipopulation simulation input file. Within each population, local conditions regarding population size, duration, environmental effects, and selection and the like can be specified. User can specify the extinction of a population, at which generation its final data are saved. Because each population can be specified with an extinction date, to prevent extinction just omit the ‘death’ line (or, alternatively, set a death time greater than or equal to the ‘generations’ value for the simulation). However, note that, as in real life, stochastic extinction due to drift or selection could also occur, subject to chance and the selection criteria. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 31 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved New populations can be given different environmental conditions, by specifying the the distribution of random environmental effects imposed separately on each individual, ikn terms of the parameters of a Normal distribution. The following example input file simulates two chromosomes one containing 4 genes and the other a single gene, 2 phenotypes, 3 populations with different founding dates, sizes, and selection regimes. Appendix 3 provides an even more complex situation, described with a flow diagram, and its input.sim file. Example 2: This is included as testExample2.sim in the distribution package # WHAT: Run name or description line here global begin setFertility poisson 2.0 setMaxOffspringNumber 8 prevalence relative 0.08 generations 1000 # in S.D. units of curr. phen. dist. 1.0 megabases per centiMorgan male 1.0 megabases per centiMorgan female mutation rate 2.5 E -8.0 female mutation rate 2.5 E -8.0 male outputXML false # Optional: one line labeled ‘all’ # set to ‘true’ if you want this major data output file # Now specify the events you want to happen during in this simulation: # First, found a new population at generation 400 # PopulationA donates individuals to PopulationB; the # numbers specify how many males and females are donated event 400 donateParents PopulationA PopulationB 250 250 # Now set probabilities for each population, that a mate comes from # population numbers 0, 1, or 2 event 400 setMatingPopMatrix PopulationA 95 5 0 # PopnA picks 95% from self, # 5% from PopnB, and 0 from not-yet-existing popnC event 400 setMatingPopMatrix PopulationB 5 95 0 event 600 donateParents PopulationB PopulationC 250 250 #make PopnC event 600 setMatingPopMatrix PopulationA 90 5 5 event 600 setMatingPopMatrix PopulationB 5 90 5 event 600 setMatingPopMatrix PopulationC 1 1 98 end # end of global block chromosome begin length 4000000 # Properties of chromosome 0 gene begin name ABC1 © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 32 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved location 200000 length 100000 gamma 1 0.05 # defines a symmetrical gamma distribution with a shape # parameter equaling 1 and a scale parameter equaling 0.05 probabilityNoEffect 0.1 probabilityPositiveEffect 0.9 end gene begin name ABC2 location 500000 length 100000 gamma 1 0.05 1 0.01 # defines a negative effect gamma distb # with shape parameter equaling 1 and a scale parameter # equaling 0.05, and a positive effect gamma with shape # and scale of 1 and 0.01 probabilityNoEffect 0.6 probabilityPositiveEffect 0.3 end gene begin name ABC3 location 3300000 length 100000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.1 probabilityPositiveEffect 0.9 end gene begin name ABC4 location 3700000 length 100000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.6 probabilityPositiveEffect 0.3 end end #of chromosome 0 specifications chromosome begin length 4000000 #Properties of chromosome 1 gene begin name ABC5 location 200000 length 100000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.1 probabilityPositiveEffect 0.9 end end #of chromosome 1 specifications phenotype begin name PhenotypeA definition ABC1 + ABC3 end population begin name PopulationA birth 0 #Properties of PopulationA (indexed 0) © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 33 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved initialSize 900 carryingCapacity 11000 growthRate 0.525 selection PhenotypeA relative -1.8 4.8 #selects for upper chunk # Note no env’t specs needed for the phenotype if default valures are OK mateCutoff internal PhenotypeA -4.8 4.8 mateCutoff external PhenotypeA -4.8 4.8 end #End of PopulationA population begin #Properties of PopulationB (indexed 1) name PopulationB birth 400 initialSize 500 death 40000 carryingCapacity 1000 growthRate 0.525 # at 2.1% per year for gen=25 yrs selection PhenotypeA relative -4.8 4.8 mateCutoff internal PhenotypeA -4.8 4.8 mateCutoff external PhenotypeA -4.8 4.8 end population begin #Properties of PopulationC (indexed 2) name PopulationC birth 600 initialSize 500 death 40000 carryingCapacity 1100 growthRate 0.525 # at 2.1% per year for gen=25 yrs selection PhenotypeA relative -4.8 4.8 environmentNormal PhenotypeA 0.0 1.0 mateCutoff internal PhenotypeA -4.8 4.8 mateCutoff external PhenotypeA -4.8 4.8 end Rerunning a ForSim simulation: At the end of each run, the final generation is saved to a file named “serialized_generationNumber.txt” (where generationNumber is an integer). This file can be reloaded by placing the following instruction in the global block of a simulation input file: load /path/to/file/serialized_generationNumber.txt When the simulation input file is executed, it will reload this generation as the founding generation for the new simulation. The load feature can be used to generate replicate data or large data sets. Or, the new run may specify different evolutionary parameters, if they maintain consistency with the reloaded generation with respect to numbers and lengths of chromosomes and numbers, lengths and locations of genes. As far as ForSim is concerned, this is just a brand-new simulation, starting from generation 0. The input file must specify everything ForSim needs to use the loaded data; for example, with multiple populations, the mating matrix must be specified as of generation 0. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 34 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved [NOTE: when load is executed, each item is noted on the terminal screen as it is loaded. There are many more than 2 x diploid population size items (e.g., chromosomes) but this is not a bug. It reflects the fact that the end of a run involves pedigrees, but serialized data really only reflect the final generation; some items were present in the pedigree but have zero frequency in the final generation, but for programming convenience are saved any way in the serialized file. This causes no problems. NOTE: Because of internal ways the reload data are used, related to pedigree construction and the like from loaded data, the ‘load’ runs should be at least 2 + pedigree depth or more generations; to be safe you should explicitly use short runs to test how few generations will result in successful runs with your input file conditions. If you wish to use reload for replicating more population data, you can switch mutation, recombination, selection, gene flow, etc. rates to zero, so that only drift is occurring.] An option is to use the event keyword to save results during the run specifying the generation (##) as in: event ## serializeState. This will put out a file Serialize_## which contains all the information needed from a single generation to "reanimate" that generation, that is, to initiate a forward simulation with that generation as the starting point. This can be used to iterate replicate data sets of a few generations’ depth, for example. The final serialized data are automatically saved whether or not such event is specified. Each line in a serialized file describes a single simulated "object," a SNP, a haplotype region, and so on. We begin with the lowest level of the tree of objects, the SNPs. SNP 2336 5006921 SNP 2867 5019728 SNP 3267 4024683 SNP 3922 3014852 . . . # ENDSNPS 351 A C A A 1 2 1 1 4 3 4 4 2 2 1 0 0 0 0 0 216 263 298 359 -0.0961303 0.0690454 0 0 Here, the "SNP" at the begining of the line simply identifies that the line describes a SNP. The next field is the SNP ID, then the SNP location, SNP nucleotide, the numerical code for the nucleotide, the numerical code for the ancestral nucleotide at this location, the number of the gene in which the SNP resides, the chromosome on which it resides, the generation the SNP became polymorphic, and the phenotype contribution of the SNP. Finally, after all the SNPs are printed, we print a line that begins with the pound sign, the string "ENDSNPS" and the number of SNPs that have been serialized (which acts as an internal check; if we have not read 351 SNPs in this case, then something has gone badly wrong.) Next, we print the "HaploGenes." Recall that a HaploGene in our terminology is a specific haplotype at a given gene. HaploGene 259122 1 1 6189 0.357618 3 184903 34920 561 HaploGene 262905 0 0 6282 -2.09839 5 70718 131960 7191 59569 16986 © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 35 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved HaploGene 274091 1 1 6547 -0.850774 4 165806 62336 195582 66375 . . . # ENDHAPLOGENES 464 Here, we begin with the string "HaploGene," the unique ID, the sequential order of the gene as entered in the input file, the sequential number of the gene on the chromosome on which it resides, the generation this specific haplotype came into existence, the net phenotype effect of the SNPs on the haplogene, then the count of SNPs within this haplotype region followed by the unique IDs of the SNPs in linear order. The "ENDHAPLOGENES" line also includes a count of the HaploGenes. Now we handle the chromosome haplotypes: Chromosome 6083381 0 3000000 3 3 1 9761 386022 259122 358416 Chromosome 6090432 0 3000000 3 3 1 9772 368724 373336 408573 Chromosome 6109270 0 3000000 3 3 1 9802 380821 332436 354578 . . . # ENDCHROMOSOMES 18379 Again, we begin with a static string identifying the line type "Chromosome" followed by the unique ID of the chromosome, the sequential number of the chromosome in the or specified in the input file, the length of the chromosome, the count of genes on the chromosome, the number of genes on all chromosomes, the number of simulated phenotypes, the generation this specific chromosomal haplotype came into existence, and then the unique IDs of the HaploGenes contained on the chromosomal haplotype. After all chromosomes are printed, another end tag that provides a count of the unique chromosome sequences. Populations of individuals are preceded with a line like the following: Population PopA 10001 This line always begins with the string "Population" followed by the name of the population as specfied in the input file and the population size. This line is followed by all individuals from this generation who have survived the selection process and have entered the pool of potential mates. All potentially mating males are listed first, then all potentially mating females. The line begins with the static string "Individual," then the unique individual ID, a number which defines whether the individual is male (1) or female (0), then all "left" chromosomal haplotypes in sequential order for each chromosome being simulated, and then all "right" chromosomal haplotypes in sequential order. Finally, all individual phenotypes, in the order they are declared in the input file. The end tag line gives the number of simulated populations. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 36 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved Individual 100004802 1 6217189 6220312 -0.510328 0 Individual 100003775 1 6212114 6219228 -3.15124 0 Individual 100005452 1 6230772 6170063 -2.86811 0 . . . # ENDPOPDATA 1 So long as you don’t alter aspects that undermine the characteristics of the saved data, you can change the initial run instructions in the input file (such as selection, mate choice, or gene-mutation-effect parameters), and rerun (using a ‘load’ instruction in the input file); the population will henceforth be subject to the altered conditions. Specialized features Some extended or more specialized features are possible with ForSim. But be aware that needless complications will use more memory and slow the program. Gene coding structure (approximated) It is possible to set up genes with codon triplets, starting with the first position in the gene. Given the gene’s overall probability that a mutation has a phenotypic effect, then for each codon position a second random number is drawn and if the two are positive, then the various type of effect (the gamma distribution statements) go into effect for that new mutation. Thus, if the gene definition specified that 50% of mutations had an effect, and the mutation hits a second codon position, then there is a 90% probability that (if the first 50% test is passed) the mutation will have an effect. Then the specified positive and negative effect probabilities, with their respective gamma distributions, go into effect. An example of the specification format is: usingCodons true firstCodonProbability 1.0 secondCodonProbability 0.9 thirdCodonProbability 0.5 The default is false, so none of these lines need be in the input file if the feature is not being used. Note that this is not specific to the actual amino acid code, but just a way to distribute effects in a systematially non-uniform way. Introns Likewise, within a gene definition block you can specify that genes have intronic regions, basically just sections in which mutations have no effect: intron 5000 2000 This specifies that the intron begins 5kbp from the start of the gene and is 2kbp long. All mutations which arise in this region will contribute no phenotypic effect. The default is © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 37 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved not to have introns, and need not be specified in the input file. Spatial genotype distributions It is possible to generate a single population that is distributed over 2-dimensional spatial rectangular grid, that expands from the [0,0] corner. This works as follows: each individual passes the usual phenotype definition and selective screens. When mates are chosen, a male searches the surrounding grid-points randomly for eligible mates (can use the usual range of mate-choice criteria). If there are eligible mates and the offspring survive selection screen, they are given a location based on a Normal (0,1) distribution of grid-point displacement from the father’s X and Y locations. If no eligible mates exist, that individual will not mate. This simulates gradual expansion or isolation-by-distance models of population history. To use this option, specify usingSpatial true Xmax Ymax DispersionMean DispersonVariance in the input file, where Xmax is the number of locations in the X direction, Ymax in the Y direction, and the dispersion is Nor(DispersonMean, DispersionVariance). A ‘standard’ set of values is 1000 1000 0.0 1.0, but there is nothing biologically based about these values, and the dimensional space need not be a square. At the end, a figure of the final population distribution is produced (see figures, below). At the end of the simulation, ForSim produces the usual data files except that the pedigree (pre-Makeped) files have x and y location pre-pended for each individual. For analysis, individuals may be sampled from anywhere in the grid by sampling based on xy properties of each individuals, or randomly, etc. The results can then be used as input for population-history or structure programs such as Structure, or other geographically based analysis. When usingSpatial, ForSim will generate locations_population_generation#.txt text file. And if the outputSVG instruction, which enables the preparation of svg files, is set to true in the input file, a locations_PopName_Generation#.txt.SVG graphics file is generated. In addition, the preMakePed.txt files will have X- and Y- coordinate locations added to the Prepended list of variables, and these values will also appear in the xml files if the latter option is chosen. These will be listed in the header line as well, to show their positions. NOTE: Using spatial simulation is not compatible with multiple populations, or using usingTrackSNPs, so don’t try both in the same simulation. Only a single population may be simulated. However, multiple populations founded at subsequent generations can be simulated serial founder effects, and since they can have whatever mating matrix is specified they can be treated as being spatially arranged. Or, if donors are solely from the most recent population, and gene flow only between adjacent (i.e., in founding order) populations, one can simulate geographic expansion via serial founder effect. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 38 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved Multi-allelic SNPs By default, ForSim is a 2-state SNP simulator. That basically means that it uses an infinite sites evolutionary model. But by specifying usingRecurrentMutation false, which is the dfefault, if a mutation hits a site where a SNP is already polymorphic no new mutation is allowed. If at the chosen site a novel allele has arisen but been lost, a new mutation arises, of any different nucleotide, which could include the one that’s lost. This could allow a recurrent mutation under those conditions, but not simultaneously with an existing variant. If a SNP novel allele has been fixed, the site cannot re-mutate. If usingRecurrentMutation is set to true in the input file, then there can be multiallelic sites. In this case, if a novel allele at a location has arisen but been lost, it can be produced again; if fixed, a new allele different from the original (ancestral) or the fixed (novel) allele is chosen. But recurrent mutation to an existing polymorphic allele does not occur. Under these conditions there can be up to 4 SNP alleles at a site, but this is not the same as a recurrent mutation (true finite sites) model, although such a thing could be approximated (e.g., by treating adjacent nucleotide positions as being the same). Since multi-allelic SNPs are statistically relatively rare in the real world, and since much analytic software is based on 2-state SNPs, users should be clear why they want to invoke this option. NOTE: the history of a lost SNP allele can be identified if usingTrackSNPs is set to true (see below, in Chapter 5). For ways to achieve things not specifically provided, see Addendum and the sections on ‘serialize’. Complex “case-control” comparison figures runForsim can use simulated data to compute esthetic and highly informative multifeature figures. It currently does so by including output true in the input file. Even more detailed figures called Hap_GeneName_Gen#.svg will be produced (see figure description below). These are sorted by first-defined phenotype, from high to low values, so that you can use a case definition as a rough cutoff to compare affected and unaffected SNP presences, as you scan down the figure. Genetic epidemiological uses of the results The pedigree files generated by ForSim are extensive if the population is large, and will reflect the family size distribution and so on that were specified in the input file. But if less than a whole or random sample of family data is desired, scripts can easily be written to select them from the preMakeped.txt files. As an example, ascertainment sampling schemes can be achieved by reading each pedigree and deciding if it qualifies. For example, for epidemiological purposes an output file relativeRisk.txt is saved, that provides data on risks and relative risks based on the prevalence value (see Chapter 5). To identify pedigrees by single ascertainment, one could read the first line of a pedigree, save the pedigree ID, and generation number, g, (part of the prepended data), read and © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 39 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved save lines until reaching the first person in generation g+2 (if 3-generation pedigrees were specified) and if that person is affected, continue reading and saving until reaching the next pedigree. If the person is unaffected, discard the saved lines and break the loop, moving to the next pedigree. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 40 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved CHAPTER 5: USERS’ OUTPUT FILES ForSim produces a variety of output files, some of which are used by the program itself, for example, to generate graphics, or for program-monitoring purposes, and may not be otherwise useful to the user. You may have no direct scientific use for some of these text or graphic files, but they can be very useful for debugging what you think your input file is attempting to do. As examples, population size by generation is useful to see if the program is maintaining the specified size; patterns of gene-specific variation, or plot of individuals culled by natural selection, or the population phenotype distribution by generation may reveal that how you have specified selection is not what you intended. Careful scrutiny of results is the best and fastest first debugging test. Note that files not described below are for our internal working use. Depending on what you are simulating, ForSim may save very large files. To conserve on disk space, make the results easier to move or store on your system, or to ftp them to some other site for storage, use the file-compress option (-c true) in the line that runs ForSim (see the install section, Chapter 1, above). NOTE ON MULTIPLE CHROMOSOMES: ForSim can run an arbitrary number of chromosomes, and when pedigree and marker files are saved (see descriptions below), there will be a separate set for each chromosome whose filename includes the chromosome number (indexed from 0). For a single analysis with n chromsomes, one would have to meld all n files to generate a single complete Marker and single complete sets of pedigree files. This will have to be scripted by you, and can be complicated. For practical and logistic reasons, we strongly suggest simulating only a single chromosome, but putting large spacing between genes intended to be effectively on separate chromosomes (in terms of recombination); doing this will generate the correct linkage patterns but only a single set of Marker and pedigree files to work with. A warning! ForSim is designed for maximal flexibility and usefulness rather than rigidity. It generates very large amounts of data. In default mode, most of this is used on the fly and discarded when no longer needed. In default mode, all of the data at the end of the simulation are automatically saved in an appropriately labeled directory, as described below. This includes graphics that display aspects of the entire run (e.g., plots of population size by generation). For most applications these final data may be the only data of interest (e.g., to use in inferential software, such as for linkage or association analysis). But you can specify more information, at intervals of your choice during the simulation, at which points all the then-current data will be dumped for later post-run analysis. You could even do this every generation. However, you should be careful what you ask for because the amount of data can easily amount to many Gigabytes. If you want intermediate data, choose appropriately spaced intervals (e.g., every 1000 generations). You can use the input file to take advantage of the values of a number of run-time © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 41 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved variables (see keyword table above). These variables are real-time values that can also be user-modified with event Generation# set instructions, but it is dangerous to do that, so be careful! To monitor what goes on during a run, so as for example to see if it behaves as expected for the input file you have constructed, you could do a test run with small samples, few genes or populations, or not too many generations. Tinker with your input specifications if you need to, then remove or comment out the input-file lines specifying intermediate data saves (or make them happen at fewer checkpoint generations), and then do your fullsize run. This will provide useful intermediate data if you need it, plus the final results, without becoming unmanageable. As a general note, for genetic epidemiological uses of ForSim, the program produces several f iles. The MarkerInfo and PreMakeped files are relevant for mapping studies. The input file includes stipulation of a prevalence value based on the phenotype that is listed first in the input file. When specified in relative terms, the prevalence cutoff is in SD units from the population phenotype mean in the final generation, that is, we assume the phenotype is approximately normally distributed. The PreMakeped file contains an entry for affection status determined by this cutoff value. This affection status can be used in mapping studies of case-control design. To do this, the MarkerInfo and PreMakePed files need to be modified by a script, to (1) add a first column in MarkerInfo.txt with chromosome number (since they are indexed from 0, the first simulated chromosome is chromosome 0), (2a) use the preMakePed.txt (or genPed.txt if you used printGeneration) header line to determine the number of prepended columns, then (2b) don’t save the header line (in any of these files) and (2c) read each individual’s line, remove the pre-pended columns and write the remaining line (now in standard preMakePed format) to a modified preMakePed file. This can then be used in mapping software, such as Plink, Haploview, etc., for identifying SNPs passing some chosen significance test. NOTE: remember not to include the word ‘#prepended:’ in identifying the number of these variables. For more detailed or evolutionary analysis, the other files will be useful to track the history of SNPs, phenotype distributions, selection, haplotypes, genetic diversity, multiple populations and the like. These files are necessarily more complex, and users should experiment with simple runs to learn how to use them effectively. A word about pedigree files and analysis Ascertainment of pedigrees for genetic analysis can be a tricky business, so please be deliberate in selecting the data from a simulation run to analyze in your chosen way! ForSim generates two kinds of pedigree files. At the end of the run the pedigrees that comprised the population to produce the user-specified number of generations (default=3) are saved in preMakeped files (the number of files depends on population size, with 1000 pedigrees per file—this has to do with space and memory considerations. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 42 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved To form these pedigrees, ForSim ‘remembers’ who was in each generation before the final generation, back to the appropriate generational depth. Because the whole population is included, individuals may appear more than once, as they can do in real life with multiple marriages (the default state, mating without replacement, avoids this, but then the population size may not be as high as the specified carrying capacity; normally, the population will grow back, by increased mean family size as discussed elsewhere, but this won’t affect the end-of-run population size. Under some conditions of family size, mate choice, or selection, it could be that mating without replacement leads to insufficient numbers of mates and/or even population extinction—as in real life!—so be careful). The last generation of end-of-run pedigrees has not been subject to the selection screen that takes place normally before mates are chosen. If you want a population that has already been screened, use event ## serializeState and use that result, a postselection population. If you want deeper pedigrees or want to do various kinds of ascertainment, then you should specify the size of total population you estimate will be needed in order to have the appropriate number of pedigrees, use an event instruction to have the population grow to that size, or to change family sizes, and then use post-run scripting to screen the preMakeped files to select pedigrees according to your ascertainment scheme. If a very large number is desired, use the serialize and load features to generation multiple sets of modest-size data, and rerun the same conditions for pedigree-depth number of generations (typically 3), with n (the number of reps) specified in the runForsim.rb command line large enough that the resulting runs give you your desired number of pedigrees as many times as desired. You can set mutation and/or recombination to zero in this reload input file, stop any natural selection, etc. so the additional pedigrees are from essentially the same population as your original run. The data saved for reload are only the final generation of the initial run. The preMakeped files from these iterations can then be merged to make the total desired pedigree set. If minimal overlap of individuals between pedigrees is desired, make sure the population is large relative to the needed number of pedigrees, and randomly sample them from the whole preMakeped file(s). The normal end-of-run pedigrees are of the whole population, and so there is no ascertainment bias. You may want to analyze just a single generation of individuals. A simple way to do this is just read the preMakePed files, check the prepended generation number column, and only save those lines from the final generation, and use MarkerInfor##.txt to see the SNPs present. printGeneration utility: An alternative, with some restrictions (see NOTE, below). If you want to analyze a single generation population during a run, rather than just in the ending population, you can specify event # printGeneration, to invoke the © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 43 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved printGeneration utility function. This will save the population at generation # in the usual preMakeped and MarkerInfo format files, including prepended values (but just for the specified generation). As with preMakePed files, the genPed output files will be separated into subsets with 1000 individuals each, as in other preMakeped.txt output files, whose standard names will be appended with the generation number: genPed##[email protected]#.txt generation_@_MarkerInfo##.txt generation_@_MarkerFreqs##.txt, generation_@_MarkerStats##.txt where ## is chromosome number and # the file subset. NOTE: Current restrictions on this utility instruction are that (1) the generation number must be ≥ generation most recent population created + pedigree depth, and (2) less than the final pedigree depth at the end of the run; (3) all populations simulated in the entire run must have been created and still exist in the specified generation; and (4) the instruction does not work if ‘environment’ or ‘familyEnvironment’ are explicitly included in the phenotype definition. The reasons for these conditions have to do with the order in which events are completed at any given generation, related to whether variables have been assigned values or are accessible, plus the 0-based indexing of generation numbers. If you need single generation data within these restrictions, there are two ways to get it: 1. Run that number of generations, and stop; this will save preMakePed files for that generation. Then, modify the input file to specifying the subsequent conditions you would have run, and use ‘load’ to resume, using the serialized file saved at the end of the initial run. Then, extract individuals from the preMakePed files for the stopped generation(s) (this will be slightly faster if you specify pedigree depth 2). 2. Specify event # outputXML, where # is the generation for which you wish the data (have a separate eventline for each such generation you want). Then write a script to read the XML file and extract the data (for example, into Marker and preMakePed file format); a script to do this is in development. File formats and examples Files are saved within a directory called ‘runData’ and for each run a date-time stamped subfolder is created (the folder name will end with Run#, so results from multiple runs can be identified separately). In the data output files SnpID’s have the form: SNP0XTL1000790ID213, coded as follows for parsing (as by regular-expression scripts): SNP Chromosome Number (0 in this example) X © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 44 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved Novel Nucleotide (A,C,G,T) (T in this example) L Chromosome coordinate (1000790 in this example), the location of the SNP ID Sequential SNP number (213 in this example) The program produces modified PreMakeped files of the last 3-generations (or 2, or deeper if so specified in the input file) of the entire data set, plus an accompanying MarkerID file (marker name, marker location). The modification to standard PreMakeped format includes (1) a header line specifying what variables are pre-pended to the standard PreMakeped file, and (2) each line thereafter is pre-pended with those variables specified in the header line. This includes the population name and other variables (see example below). The modification of the PreMakeped format is important because of the way ForSim handles multiple populations with migration, in which individuals in a pedigree may not all be from the same population. Therefore, before use in software that requires standard pre-Makeped format, the header line, and the pre-pended variables on the subsequent lines must be removed to generate LINKAGE-format input. A script can easily be written to do this. Note also that there is variation in pre-Makeped formats. Some software assume that there is only one phenotype, affection status (e.g., disease, normal), while other software accommodates a user-specified number of quantitative or qualitative phenotypes. ForSim generates a set of files at the end of every run. Optionally, the user may use the 'output n' keyword to specify a dump of data at every n generations during the run. If not specified, these files are not generated; if specified as the last generation the files (SnpStatsGen, HaploPhenGen, phenotypes_PopName_Generation) are saved; if at a number less than the last generation, these files will be saved every that-many generations. If output n is specified, the files have names that identify the generation number, as in examples below. NOTE: If event instructions are specified that alter the natural selection strategy, separate files will trace the population for each segment of the run under the respective strategies. There will still be internalRunData files for the whole run. The major output file named ‘generation_#_Popname.xml’ is a complete data dump of the population at the specified generation, and is hierarchically organized, so it is generated in XML format for easier parsing (e.g., by RegEx scanning for category tags). This is a basic, but potentially large file, so the input file needs the statement: “outputXML true” to generate that file, or specify ‘false’ if you won’t need it. NOTE: The xml data files are comprehensive and very useful, but very large. If you specify event gen## outputXML this file will be saved at the specified generation, but you should only do this if you really want all that data. If the option is set, the .xml file will contain the final generation only (pedigree files also contain the ancestors for the © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 45 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved user-specified parental generations). In the descriptions below are various terms: * UID is the unique ID number of the relevant item (gene, HaploGene, individual…) * Phen means the assigned phenotypic effect of a SNP allele, or the net (summed) phenotypic effect of all the SNPs in a given HaploGene * HaploGene is the haplotype of a given copy of a gene * born means the generation in which the item first appeared * count the number of copies of the item in the current generation * location is the starting nucleotide position along the chromosome * Sequential/Male/FemaleNumber is their sequential order in the out put data * census counts of the population size at the relevant time, males +females Note that numbers for multiple categories, like genes, chromosomes, etc. are generally used in these files, rather than the unit names if they were given. This is done in 0indexed array order, so a gene defined as Gene1 will be printed out labeled Gene0, gene named Gene2 will be listed as Gene1, etc., and similarly for phenotypes. Files automatically generated at the end of a ForSim run: NOTE that if a population dies during the run, some population-specific files will not be generated for it in the usual end-of-run output. currentInput.txt The input file for the run being reported. The first line contains the commented name of the input file (e.g,, # thisInput.sim). The rest is the contents of that file. internalRunData.txt Some summary data of the run: Each line contains Generation, population count, unique chromosomes, unique Haplogenes, number of SNPs, number of SNPs lost in this generation, novel mutations in this generation, number of recombination events, heritability of first phenotype, and HaploGene heterozygosity in percentage form and the runtime, in seconds, of the reporting generation: Generation KidsCount Chromosomes HaploGenes SnpSites SnpsLost MutationEvents RecombinationEvents intraGeneRecombinationEvents Heritability observedHaplotypeHeterozygosity elapsedSeconds 0 5000 11825 0 0 0 260 2015 105 0 0.209958 0 1 5001 13609 105 260 260 243 1975 105 0.00010376 0.372 0 2 5000 6196 453 243 101 242 1992 91 0.000162504 0.990198 0 3 4999 5935 587 384 143 240 1954 101 0.000218168 1.67867 1 4 4998 5864 724 481 157 258 2003 99 0.00032797 2.29246 1 5 4999 5818 868 582 182 241 1958 107 0.000395046 3.0126 1 internalRunDataFull.txt Some summary data of the run: Each line contains Generation, population count, survivors of selection that generation, number culled by selection, unique chromosomes (i.e., each different sequence is one), unique Haplogenes, number of SNPs, number of SNPs lost in this generation, novel mutations in this generation, number of recombination events, GenePerEffect (an estimate of heritability equal to © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 46 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved genotypic variance/(environmental+genotypic variance, for the first-named phenotype; this will not be accurate if there is GxE correlation, of course), FastHGHetero (for every gene on the first chromosome, for the first population, the haplogene (gene-specific haplotype) heterozygosity), mean, variance, and standard deviation of the first-defined phenotype: Generation PopSize Survived Culled Chromosomes HaploGenes SnpSites SnpsLost MutationEvents RecombinationEvents GenPerEffect FastHGHetero PhenMean0 PhenVar0 PhenStDev0 1 0 0 0 6128 2 5 5 1 74 0.500907 0.1 0.00460933 1.01534 1.00764 2 0 1 1 2644 4 1 1 4 65 0.500202 0.166611 -0.0133695 1.02679 1.0133 3 0 0 0 2284 8 4 1 1 65 0.499953 0.266667 -0.0331265 1.01107 1.00552 MarkerInfo##.txt For a given chromosome (##, indexed starting at 00), the SNP IDs and their location (for use with Linkage input). Includes any SNP site with a novel allele (i.e., not the initial SNP allele) is present in the pedigrees, even if it is fixed and no longer variable (this is because it’s phenotypic effects are still present). Location and IDnumber are also embedded in SNP names that include the nucleotide (right after ‘SNP’): SNP0XCL100945ID169912 100945 SNP0XGL100981ID169971 100981 SNP0XCL100996ID154771 100996 MarkerPedSnps.txt SNPs with novel allele still present anywhere in the final pedigrees (even if the novel allele has become fixed), when the allele was created by mutation, their basic identity and assigned phenotypic effects. SNPs at all chromosomes are listed, with their unique IDs their chromosome location can be tracked (if usingTrackSNPs is set to true, the counts at every generation are also included at the end of each line): Snp name=SNP0XCL500176ID208; born=45; uid=208; location=500176; nucleotide=C; phen=1.03721; SNP0XCL500176ID208 0 500176 45 1.03721 MarkerStats##.txt For chromosome ##, and for all SNP sites with novel allele present in any generation of the final pedigrees (even if it is fixed). Each line contains a SNP record of chromosome number##, the new nucleotide, and in 1..4 <-ACGT format the nucleotide, location when created current count in the final simulated generation (bottom of the pedigrees), and phenotypic effect: SNPID Nucleotide Novel Ancestral Number Location Born Count Phenotype SNP0XCL100945ID169912 C 2 3 169912 100945 1999 1 -0.465247 SNP0XGL100981ID169971 G 3 2 169971 100981 1999 1 -1.42006 MarkerFreqs##.txt Same as MarkerStats but includes frequency in the final generation instead of a count. SNPID Nucleotide Novel Ancestral Number Location Born Freq Phenotype SNP0XCL100945ID169912 C 2 3 169912 100945 1999 0.00049975 -0.465247 © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 47 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved SNP0XGL100981ID169971 G 3 2 169971 100981 1999 0.00049975 -1.42006 geneEntropies.txt For each generation, for each gene, the mean entropy measure for all populations pooled, computed as: For each HaploGene, this equals ∑ - frequency*log10(frequency) summed over all unique HaploGenes (haplotypes at that gene). This is useful to see the effect of selection on specific haplotypes. Gene0 Gene1 Gene2 Gene3 1 0 0 0 0 0 0 1 1 1.58496 1.58496 1.58496 3.58496 3.70044 3.80735 3.70044 4.08746 4.16992 3.90689 3.80735 4.24793 Gene4 Gene5 Gene6 Gene7 Gene8 0 0 0 0 2 2 2 1.58496 2.80736 2 2.32193 2.58496 2.32193 3 3.16992 3.58496 4.32193 3.90689 3.80735 3.90689 3.45943 3.90689 4.16992 4.08746 finalPedigreeHaplogenes.xml For each HaploGene (haplotype in a given gene), the details of the HaploGene and its constituent SNPs. NOTE: SNPs reported in this file only include the derived (novel) SNP (which may or may not be fixed at the time); the ancestral alleles are not specifically listed, mainly to save on the enormous file size that could otherwise result, since in most simulations most alleles will be ancestral. uid=unique ID, count is the number of copies in the current populations that were simulated. The snpcount is the number of novel SNP alleles on that HaploGene. HaploGenes are listed in their ‘born’ generation order, not their chromosome order; their relative locations on chromosomes can be inferred by the gene names, as they were specified in the input file (onChromosome gives the chromosome number, remembering that this is indexed from 0 in the order specified in the input file, and can also be found in the SNP IDs after ‘SNP’): <HaploGene name="ABC10" onChromosome=”0” born="4741" uid="16574956" snpcount="115" count="21" phen="0.378721"> <Snp name="SNP0XAL109002507ID2050188" count="6240" born="820" uid="2050188" location="109002507" nucleotide="A" phen="0.0388587"/> <Snp name="SNP0XCL109004218ID6334505" count="6240" born="2532" uid="6334505" location="109004218" nucleotide="C" phen="0.0715263"/> . . </HaploGene> <HaploGene name=”ABC11” on Chromsome=”0” born=”4823” uid="16587992" snpcount="106" count="38" phen="0.22000"> NOTE also that the xml file provides population data, but does not include each individual’s ‘sex’. This is because generally (unless using ifMale and ifFemale in phenotype definition), this doesn’t have any meaning. However, if you need it you can extract sex from each Indivdual’s line in the corresponding serialized file. preMakePed##_subset#.txt For chromosome##, and subset#, a modified preMakeped © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 48 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved file. Because files can be very large, a separate file is generated every 1000 pedigrees, subset files numbered sequentially (the set of files contain the full pedigree data for the population). It will differ from standard PreMakeped formats by having a header line and each data line pre-pended with several values. These have been added for post-run analytic convenience, and the header identifies these values (the example below shows the current set). The IndividualID is the UID found in other files, and is included because in generating pedigrees, ForSim creates new pedigree-specific sequential IDs for the individuals (this is sometimes convenient for Linkage users; pre-pending the UID allows setting up correspondence with information in other output files). The standard line includes these space-separated fields: Pedigree# PersonID fatherId motherId sex phenotype 1 1 2 1 3 3 …. [the numbers are pairs of the individuals’ alleles at the tested markers (paternal, maternal) for each SNP sites in their chromosome order as specified in the Marker input file]. Since ForSim generates pedigrees from complete data, top-generation individuals in a pedigree will have father’s and mother’s IDs. Here phenotype means affection status, coded as 0, 1, or 2 where 0 is unknown, 1 is unaffected, and 2 is affected. But since ForSim has complete data, 0 (unknown) should never occur. The pre-pended variables are included as a convenience for indentifying individual properties. These include each individual’s net first-defined phenotype, the ‘genetic’ phenotype, the Environment and the familyEnvironment contribution. At present, because of a possible future polygenic background feature, there is also a PolygenicComponent prepend column, whose value is set to 0 for all individuals in the preMakePed files and can be ignored. The value of the phenotype is worked out for that individual from the phenotype definition and his genotype and environments. In default, the two environments (familyEnvironment itself defaults to zero) are added to the genetic contribution that was specified in the Phenotype definition; that means the net genetic contribution can be determined by subtraction (phenotype – Envt – familyEnvt). The ‘geneticPhenotype’ column is a redundant repetition of the overall phenotype: it is the compound of G and/or E effects specified in the phenotype definition. But if environments are specified in non-additive ways and/or that interact with genotypic effects in the phenotype definition, this subtraction does not apply! So use (or ignore) these file fields advisedly. NOTES: For program legacy reasons, a polygeneComponent column, with value fixed at 0 is prepended. Before using programs like Linkage that want PreMakeped format, you must delete the header line and then crop the pre-pended values from each data line. This is necessary even in one-population simulations. The subfile number refers to the fact that for file management in large runs, the total data are divided into subfiles that need to be concatenated for a single analysis. The prepended variables currently are as given below (except that they also include x and y coordinate locations for each individual, if the © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 49 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved simulation was run specifying usingSpatial), followed by the standard variables shown in bold above. Values are given for each simulated phenotype, labeled in the order in which they were specified in the input file (not the name given). NOTE also that the prepended variables may change not just with the number of phenotypes but if or when some in-development features like adding a polygenic background component are added; but their identities will be listed in the #Prepended line. #Prepended: Population Phenotype0 GeneticPhenotype0 EnvironmentalPhenotype0 FamilyEnvironmentalPhenotype0 Phenotype1 GeneticPhenotype1 EnvironmentalPhenotype1 FamilyEnvironmentalPhenotype1 Generation PopA 0.336108 -0.0249734 0.361081 0 0.127397 -0.0249734 0.152371 0 98 840 295611 291699 293037 1 2 1 1 3 3 2 2 3 3 3 3 1 1 4 4 3 3 3 3 4 4 2 PopA 0.345266 0 0.345266 0 0.31541 0 0.31541 0 98 840 295635 293095 290988 2 2 1 1 3 3 2 2 3 3 3 3 1 1 4 4 3 3 3 3 4 4 3 3 4 4 3 3 2 2 1 1 runtime.txt Miscellaneous run data, including the total run time : Finished: Thu Jun 21 13:02:52 2007 Run took 73 seconds, output took 11 seconds. relativeRisk.txt Description of some phenotype characteristics after the run. Pcutoff is the phenotype level above which one is scored as ‘affected’. The other entries are self explanatory: 766 of 10124 offspring in final 50 first siblings have affected 2734 of 3109 first two siblings pcutoff was set to: 15.0216 first siblings == 3816 first siblings affected == 292 sibling pairs == 3109 sibling pairs, both affected == overall risk == 0.0756618 sibling risk == 0.171233 generation are affected second siblings in final generation have concordant affectation status 50 snp since last dumpDataFinal.txt For each SNP present or fixed at the end of the run, name, chromosomal location, born-on generation, whether fixed or lost, count in current generation, diploid population size, current frequency, assigned phenotypic effect. Gives lost/fixed SNPs since last dump, plus current polymorphic SNPs. [NOTE: This file is like the other snpLifeSpanData files except that they are dumps of fixed or lost SNPs only, while this also includes those still present at frequencies in (0,1). To tally all SNPs produced in the run, user must also include the other snpLifeSpanData files.]: # Name Status Location Born DateFixedOrLost CurrentCount DiploidCount Freq Phenotype SNPGL939347ID6588 Lost 939347 97 1890 0 6002 0 0.461625 SNPGL418416ID6628 Lost 418416 98 1890 0 6002 0 -0.900036 © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 50 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved geneHaploCounts.txt For each gene, and each generation, the number of unique haplotypes: Gene0 Gene1 Gene2 Gene3 0 0 0 0 0 0 0 0 11 8 13 10 17 12 17 23 These files are always produced at the end of the run, but their names depend on the input file specifications: PopName_PopRunData.txt Generation, current census count before and after selection in that generation, number who left the population before mating, and the number culled by selection: Generation 1 2003 1993 2 2001 1990 3 2002 1987 census postSelectionCensus emigrants culled 0 10 0 11 0 15 generation_###_Population#.xml The entire data set, in easily parsable XML tagged format. This can be a huge file, so is only saved if so specified by outputXML true in the input file. This is the last generation simulated and thus the final generation in the final pedigree file. When a population is specified for death during the simulation, its final data will be saved in this format. The population can be saved at other points as well by using an event line as described earlier. The first line gives population name and generation, male and female sex counts in the population, sequential ID of each simulated individual, and so on, then the two chromosome haplotypes in terms of its HaploGenes. NOTE: SNPs reported in this file only include the derived (novel) SNP (which may or may not be fixed at the time); the ancestral alleles are not specifically listed, mainly to save on the enormous file size that could otherwise result, since in most simulations most alleles will be ancestral. This means that a chromosome listing that does not include one of the extant SNPs has the ancestral allele. Also, the data are phased because chromosomes are listed in parental order: <Population name="PopulationA" generationNumber="400" males="1444" females="1555"> <Individual uid="1195227" paternal_uid="1194235" maternal_uid="1192273"> <Phenotype number="0" net="-0.95603" genetic="0" environment="0.95603" /> <Phenotype number="1" net="0.547231" genetic="0.110449" environment="0.436782" /> <Chromosome uid="35816" number="0" count="5" born="392"> © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 51 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved <HaploGene name="ABC1" onChromosome="0" born="0" uid="1" snpcount="0" count="2840" phen="0"> </HaploGene> <HaploGene name="ABC3" onChromosome="0" born="370" uid="15571" snpcount="2" count="60" phen="-0.0247006"> <Snp name="SNP0XTL306509ID3129" count="232" born="103" uid="3129" location="306509" nucleotide="T" phen="0.0247006"/> <Snp name="SNP0XCL331557ID11175" count="60" born="371" uid="11175" location="331557" nucleotide="C" phen="0"/> PopName_SelStratRelative_PhenotypeName.txt Describes for each generation the phenotype lower selection cutoff threshold, mean phenotype, and phenotypic standard deviation, upper threshold, those culled because they were below the low or above the high cutoff. Name specifies selection strategy (e.g., ‘relative’ to mean). SelStrat names the selection strategy used for this phenotype. Alternative for ‘Relative’ will be ‘Neutral’ or ‘functional’ (based on a probabilistic function) as specified in the input file. Separate file produced for each phenotype and each population [NOTE: The mean and StdDev columns are always relevant, but the ‘low’ and ‘up’ columns have no user-useful meaning in ‘functional’ or ‘neutral’ selection. ]: Gen LowThresh Mean UpThresh StdDev LowCulled UpCulled 1 -2.78966 0.0168957 4.82814 1.00234 4 0 2 -2.90604 0.0035682 4.99146 1.03914 6 0 3 -2.95189 -0.0389495 4.95467 1.04034 8 0 Files generated at each generation interval specified by output # in the input file NOTE: the large xml whole-population file is not saved unless outputXML # is specified for the desired generation (this is to save space and time): Hap_GeneName_OutputGeneration_##.svg (this is a figure, documented below requires global line outputSVG true) Phenotypes_PopName_Generation_##.txt The phenotypes, given as one line for each individual. The file is saved for generation 0 to see the starting conditions (basically, environmental variation), and the file is also saved for the penultimate generation so that response to selection can be assessed: Phen0 Phen1 9.4427 -13.3644 9.13841 -13.5032 8.49651 -16.5196 HaploPhenGen###.txt For each unique HaploGene, its phenotypic effect and number of copies, at the specified generation number: -1.74914 294 6.42087 255 © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 52 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved -1.74914 138 -1.14608 668 SnpStatsGen##.txt For each SNP in generation #, gives the chromosome, the sequential SNP number, chromosomal location generation, when born: Generation Chromosome ID Location Born 999 0 6588 939347 97 999 0 6628 418416 98 999 0 11339 904214 168 999 0 17500 911487 259 snpSelCoeffs_#.txt For each SNP that exists in generation #, its SNP ID, chromosome, chromosomal location, generation when born, followed by the selective coefficient of that SNP back to the generation in which it first became polymorphic; these are computed as the number of copies of the SNP divided by the total number of transmissions in that generation (2 x population size). PopulationCount 0 0 0 0 1004(2) 1003(1558) [...] SNP0XTL554524ID35703 0 554524 1002 0 1004(0.00160462) 1003(0.000322789) SNP0XTL587068ID35708 0 587068 1002 0 1004(0.000962773) 1003(0.000322789) SNP0XAL3370468ID35715 0 3370468 1002 0 1004(0.00513479) 1003(0.00225952) These files are produced when an internal storage buffer has been filled: (NOTE: Of these, the snpLifeSpanData, snpOwnerPhens, snpCounts, snpSelCoeffs files are saved during the run if usingTrackSNPs true is included in the input file; these files will be saved as their buffer of 100,000 records is filled (so short runs or small populations may generate only one file. These can be huge files, so only use this option if you really want the complete history of every SNP!) snpLifeSpanData_#.txt At file-save time, generation #, the generation by generation history of each SNP that was born since, or was polymorphic in, the previous savegeneration, and for every generation until lost (if lost) the status of that SNP. At each generation, provides the name, chromosomal location, birth generation, whether fixed, lost, or still polymorphic, allele count (=0 if lost), current diploid count (twice population size), current frequency, assigned phenotypic effect. At the end, a snpLifeSpanDataFinal.txt file is generated, that summarizes what was present at the end of the run. [NOTE: For a full tally of the history of all SNPs in the run, user must retrieve all of these files, as SNPs present at the end will not count those that existed but were fixed or lost during the run, if those exceeded the buffer size and were saved as files during the run. The history of those that were polymorphic in previous files will be continued in subsequent files until they are lost (if they are lost before the end)]: # Name Status Location Born DateFixedOrLost CurrentCount DiploidCount Freq Phenotype © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 53 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved SNP0XGL939347ID6588 Lost 939347 97 699 0 SNP0XGL418416ID6628 Lost 418416 98 699 0 SNP0XGL904214ID11339 Lost 904214 168 699 SNP0XAL911487ID17500 Lost 911487 259 699 4010 0 4010 0 0 4010 0 4010 0.461625 -0.900036 0 -1.10356 0 0.139189 snpOwnerPhens_##.txt For each SNP existing in generation ##, the first line records each generation number and population size that generation in the form gene(popcount); the four 0’s are space-holders only. Each SNP has its own line that includes ID, chromosome number, chromosomal location, born-on date, assigned phenotypic effect to any phenotype it affects, then for each generation during its life, the generation number and the net phenotypic effect, that is the average phenotype among all individuals who ‘own’ (have at least one copy of) the SNP in the population. These net effect values reflect the changing linkage and genotype context that the SNP has been in during its life. Values are given for each generation from its birth to generation ##, or since the last ‘output’ generation. This file will only be produced in generations that are a multiple of the generation specified by the “output” directive. PopulationCount 0 0 0 0 7993(4999) 7992(4999) ... SNPTL1041915ID645 0 1041915 10 0 7993(-2.153) 7992(-2.627) ... SNPTL1036347ID7221 0 1036347 96 -0.012469 7993(-2.364) 7992(-3.124) SNPCL5011063ID8827 0 5011063 117 -0.0933972 7993(-2.408) 7992(-.199) SNPCL5064714ID9595 0 5064714 127 0 7993(-2.358) 7992(-2.204) ... snpCounts_##.txt For each SNP existing in generation ##, the count of the novel allele in the full population (that is, in all simulated populations), at each generation over the entire history of the SNP since the last ‘output’ generation. Like the previous file, except reporting SNP counts rather than their net effects. The first line records a population count per numbered generation. This file will only be produced in generations that are a multiple of the generation specified by the “output” directive. PopulationCount 0 0 0 0 7993(4999) 7992(4999) ... SNP0XTL1041915ID645 0 1041915 10 0 7993(6616) 7992(6626) ... SNP0XTL1036347ID7221 0 1036347 96 -0.012469 7993(3317) 7992(3318) ... SNP0XCL5011063ID8827 0 5011063 117 -0.0933972 7993(2551) 7992(2574) SNP0XCL5064714ID9595 0 5064714 127 0 7993(2363) 7992(2364) ... Also, see MarkerPedSNPs.txt, above. Output graphic files ForSim also produces output graphics of various aspects of the data, at the user-specified generational markpoints (‘output #’ described above, in the input file), and at the end. Many are generated by calls to R, or are produced in browser-plottable SVG format. These have self-explanatory file names and identifying legends and should be selfexplanatory. A few examples are given in below. The graphics report various aspects of the data that are useful directly as well as serving as indicators of whether the simulation is doing what you think it should (and in that sense a source of bug-detection either of ForSim or of the input file specifications). A strange pattern could be an interesting © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 54 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved result, but could indicate a bug or that you specified something other than what you thought, so don’t just take them at face value! The graphics can be viewed directly if produced in png format, or downloaded to be viewed, if produced in pdf format. The format is specified as a command-line parameter (see above). Contents of output graphics Here is a list of the output figure names (command line specifies whether pdf or png format): Hap_GeneName_OutputGenerationNumber.svg Figure saved for each gene if outputSVG true is set PopulationName_PhenotypeNameMean.png: value for each generation during the run PopulationName_PhenotypeNameStdDev.png: value for each generation during the run PopulationName_PhenotypeNameLowCulled.png: number culled because they were below the low selection threshold, for each generation during the run (there is no such threshold for neutral or probilistic ‘functional’ selection, so these and the next three graphics are not useful under those conditions, except as checks). PopulationName_PhenotypeNameUpCulled.png: number each generation during the run PopulationName_PhenotypeNameLowThresh.png: number each generation in the run PopulationName_PhenotypeNameUpThresh.png: number each generation during the run PopulationName_culled.png: total selected out, for each generation during the run PopulationName_postSelectionCensus,png: number of selection-survivors in each generation during the run PopulationName_census.png: count for each generation during the run SnpStatsGen#_Born.png: X-axis is generation, Y-axis present-day SNPs that were born at that generation SnpStatsGen#_Location.png: At interval triggered output generations, X-axis is chromosome coordinate, Y-axis location of existing SNPs, in 10-basepair bins (shown only for chromosome 0, in all populations pooled, intended as a diagnostic for various parameters like mutation and selection) SnpStatsGen#_ID.png: At interval triggered output generations, gives frequency of each extant SNP for frequency distribution diagnostic checking; given for all SNPS on all chromosomes in all populations. phenotypes_PopulationName_Generation_##_Phen#.png: phenotype #’s distribution, including a red line showing the normal distribution with the same µ and σ as the data. The file is also saved for the penultimate generation so that response to selection can be viewed. Hap_GeneName_Gen#.svg: For each ‘output’ interval generation, a multifeature plot of HaploGenes in the population at this generation. This shows for each individual gene, (for all populations pooled), on every chromosome in a separate figure), one line for each unique HaploGene (haplotype for the gene), for each SNP along the gene whether the HaploGene’s allele is the ancestral (blank) or novel allele (blue if a positive, red if a negative, green if a neutral assigned effect on phenotypes). On the left is the frequency © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 55 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved of the HaploGene in bar-form, and its net assigned phenotypic effect (red, blue, the sum of its SNP effects). Sorted by HaploGene frequency. The background is lightly colored by age of that line’s HaploGene, since the last recombination or mutation that generated it (pink=older, blue=younger). To get these, the input file global block must contain outputSVG true. If output ## is also specified, these svg files will be generated every ## generations; to produce these files only for the final generation of the run, set ## equal to the final generation number. Heritability.png: By generation, the narrow sense heritability (ratio of genotypic to total phenotypic variance) for the first-defined phenotype. This is useful for debugging, to check the behavior of the specified parameters, and to show the varying effects of genes, or changes in their effects after event-specified occurrences, and so on. locations_PopName_Generation#.SVG. If usingSpatial is set, this figure represents every individual in the population at Generation #, as a circle, the color is determined by the individual’s (first-defined) phenotype in a rainbow scale (that changes each generation, so is relative only). The frequency with which such files are generated is specified in the output line in the input file. The following two types of figures are not routinely generated. They can take up enourmous space if you’re simulating many different genes or they’re generating many different gene haplotypes. The generating code has been commented out in the running script runForsim.rb. If you want these, just search on geneEntropies or geneHaploCounts and remove the comments. Then run runForsim.rb as usual. Gene#_geneHaploCounts.png: For gene #, the count of unique haplotypes for that gene at each generation during the run, one figure for each gene. GeneName_geneEntropies.png: entropy value (computed as above, in output text files description, for this gene at each generation in the run Here are samples of the above files: Histogram of individuals culled by selection, by generation © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 56 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved Distribution of phenotype values in a single generation, with normal distribution having the same mean and variance (red line) Illustration 3: Plot of mean value of Phenotype A in Population A over 1000 generations Illustration 4: Plot of spatial distribution of individuals genearted by usingSpatial If outputSVG true is used, the following integrated figure comparing haplotypes of cases and controls will be generated (see above for description, and how to specify the desired generation(s) for these figures): © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 57 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved Illustration 5: “Case-control” comparison © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 58 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved ADDENDUM: SOME WAYS TO SIMULATE FEATURES NOT EXPLICITLY BUILT INTO ForSim, AND ONE WAY TO CHECK REASONS FOR CRASHES While ForSim is very flexible, it cannot explicitly do everything. Nonetheless, many conditions that are not automatically provided for can be simulated by creative use of the existing features. Here are some suggestive examples. First, and most flexibly, many things can be done indirectly by using the serialize/load feature. The saved reload data reflect the previous simulation conditions, but the data after the reload run completes will reflect new properties that may be specified in a revised input file. Many variables and running conditions can be changed in this way, that are not amenable to direct event keyworded change, and there is nothing that prevents several stop-starts of this kind, except that they must be done ‘by hand’ (or shell script). 1. Microsatellites Microsats are not explicitly included in ForSim, which simulates single nucleotide mutation. But to simulate the hierarchically-ordered high mutation rate behavior of microsatellites, a gene of appropriate length and/or mutation rate can be simulated that will accumulate enough mutations to approximate the higher haplotype heterozygosity of microsatellites. Or a very high mutation rate can be specified as the global rate, but the non ‘microsat’ genes can be given a proportionately lower mutation rate. Doing this will not, however, generate recurrent mutation. 2. Complex prevalence. The ‘prevalence’ variable in ForSim input files is based only on the first-defined phenotype in the input file. Multivariate prevalence, such as defined by two phenotypes, can be handled in at least two ways. At the end of the run, PreMakeped files are generated for all individuals in the population (and the specified pedigrees). These files list all phenotypes for each individual. Post-run analysis could then determine multi-trait affections status, and this could be altered accordingly in the preMakeped ‘affection status’ column for each individual. Alternatively, Phenotype A can be defined (as the first-specified trait), in terms of the other phenotypes, affection status or prevalence will be defined in terms Phenotype A: PhenA=(PhenB+PhenC)/2. 3. Mendelian traits. ForSim does not explicitly specify dichotomous causation. But a gene that one wants to be able to have Dominant effects could be assigned a high large-effect probability in the allelic effects (gamma function) parameters. If an absolute rather than relative (phenotype distribution tail size, in SD units) cutoff for affected status is chosen, then © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 59 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved mutations with large effects will by themselves cause affection status. Environmental variance could be reduced to allow single mutations alone be more likely to generate affected status. The program already effectively deals with recessiveness in that many mutations may cause affected status only when paired with a second large-enough allele. Also, if natural selection is not involved, then after the run one can write a script to parse the preMakeped or population files and check the genotype at any SNP site you would like to have a dominance or recessiveness property. Adjust the phenotype (affectionstatus) column in the individual’s preMakeped file appropriately. 4. Artificial selection as in agricultural situations. To specify that only individuals whose phenotype is above some cutoff (in SD relative to the mean), specify (say) ‘selection PhenA relative +2.0 +10.0’. Here, any individual below +2SD will be truncated (not reproduce), and anyone with a phenotype >+10SD will not reproduce (this essentially cuts off no one because of high phenotype). The inverse specification could work for thresholds on the small end of the phenotype distribution. Natural selection is likely not so rigid. Fitness may rise rapidly after some threshold, T, approaching 1.0 above the threshold. This can be specified with a logistic function, where f (ø) is the fitness of an individual with phenotype ø: f(ø)= a/(1+bc-kø) Here b is the Y-intercept, which may typically be zero (zero fitness for a zero phenotype) and c is the base of the exponentiation, typically e could be used. Users can set values that seem to make sense. A steep rise in fitness past, say, the inflection point of the logistic function (at position a/2), essentially makes that the threshold T. Note that in ForSim these functions generate probabilistic f values so that with probability f(ø) the individual has normal reproduction, and probability 1-f(ø) of having none. In addition note that we write this example with this typography because the input file cannot take subscripts or superscripts in the inputfile it would be selection PhenA functional a/( 1 + b * c ^ ( - k * Phenotype ) ) ) NOTE: because ForSim must parse a wide varietey of possible functions with as little ambiguity as possible, every number must be floating point, and there must be a space between every item (except negative numbers, that can be written -2.3). Normally for increasing fitness with increasing phenotype one would set c>1. By testing various parameter values you can set the threshold based on, say, the inflection point (a/(1+b)), and fitness can decline from a negative threshold as the phenotype increases by setting 0<c<1. As noted earlier, the phenotype in question is mentioned first (here, PhenA), but in the functional expression references to that phenotype are made by © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 60 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved ‘Phenotype’, regardless of which phenotype is being tested. Fitness can lead a population to some mean phenotype value for example if fitness is 1.0 at that value and less than that away from it. There are many ways this could be modeled, but a useful one that has been used in many papers and thus is given here is normal distribution of fitness, with mean equal to a target optimal phenotype, P, with standard deviation S: 𝑓 𝜙 = 1 2𝜋𝑆 ! 𝑒 ! (!!!)! !! ! which will generate maximum fitness at the mean value, S. To make this maximal fitness equal to one, just omit the scaling term: 𝑓 𝜙 =𝑒 ! (!!!)! !!! which in input file typography is selection PhenA functional e ^ ( 0.0 - ( P – Phenotype ) ^ 2.0 /( 2.0 * S ^ 2.0 ) ) or, to do this during the run event 900 setPhenotypeSelection PopA PhenA selection PhenA functional e ^ ( 0.0 - ( ( P -Phenotype ) ^ 2.0 / ( 2.0 * S ^ 2.0 ) ) ) where you put in your chosen values for the target optimal phenotype, and the strength of selection as represented by S. Keep everything on one line (no linebreaks) even if it wraps on your screen (the above typography is for readability). The ‘0.0’ term is needed to ensure proper parsing of the relational operator (here, the first ‘-’ in the exponential term), because the parser looks for binary relationships. Omitting the scaling term from the Normal distribution, allowing fitness at the optimal point to be 1.0 is not necessary, since fitnesses are always relative, but if not done in this way no genotype has a 100% chance of reproducing, and having a max fitness of 1.0 can reduce the number of indivdiuals culled by selection, making less of an impact on population size, growth capability and so on. Life is complex but simulations should be simple and interpretable approximations. So don’t try any functional relationships that are too fancy. ability to parse equations is limited (hence the 0.0 and extreme spacing in the above example). Polynomials or other similar approximations are most likely to be best as a rule (but make sure they are between 0 and 1). To test any such functions, first look at the output screen. At the generation when a functional fitness expression is first implemented (at the very beginning or any such ‘event’ generations) the output screen lines describe the way ForSim is interpreting the equation, term by term. Additionally, you can check the function explicitly: Edit © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 61 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved MathStack.cpp in forsim/src. Find the output line section (search on SURVIVED or CULLED), remove the // commenting symbol from these lines, and recompile (just do ‘make’ in the forsim directory). Then run your intended input file. The fitness of every individual in every generation will be printed to the screen, along with the indivdual’s phenotype and the fitness decision (survive or culled) based on a random [0,1] number compared to fitness. Stop the program quickly (^C), and verify the fitness computation and decision. Then, re-comment out these lines, and recompile. NOTE: this following section corrects and improves the corresponding discussion in earlier versions of this Manual. 5. Weak selection. To simulate the kind of weak selection such that the individual has slightly reduced fitness below some cutoff value—for example, having high phenotype values confers only a 1% advantage, use a logistic function again with T as the inflection point, but (for example) with a small value for k. That is, fitness drops off below T only very slowly. Try out various parameter values for the function in a spreadsheet program like Excel. 6. Haploid evolution. ForSim is a diploid simulator. But if you are not concerned with selection—drift only simulations for population history inference, you can achieve this. Set recombination to zero, do the normal run, then search the SNP output files (MarkerFreqs##.txt) for a fixed SNP (‘frequency’) =1.0 and origin (‘born’) generation as early as you can find. Then examine the data only on haplogenes carrying that SNP allele. NOTE that this will not work if there is natural selection, since that is based on diploid phenotypes. 7. Changing parameters mid-run, that cannot be changed with event line options. Not all parameters can be changed with event line options. But the same effect is easy to achieve: Have an initial run stop at the generation where the change(s) are to occur. Then, using the resulting ‘serialized’ data file and a different input file that contains a ‘load’ instruction and the altered parameter values, start again and run for however many more generations are desired. If many such changes are desired, write a shell script file with the series of runForsim lines each referring to the appropriate input and serialized files (e.g., before each move the most-recent ‘serialized’ file to a location specified in the subsequent input files). Here is an example of such a script: # example to show how to change parameters in the middle of a run # when that changes aren't among the 'event' instruction options # do first run: ruby runForsim.rb -i FirstInputFile.sim # go to this run’s output folder: cd runData/For* # next line needed to extract the serialized file if data were compressed: tar xzvf *.gz serialized* # move this serialized file to the main runData folder: mv -f serialized* ../serialized.txt # go back to the forsim directory: © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 62 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved cd ~/. . . ./forsim # run ForSim with new input file: ruby runForsim.rb -i SecondInputFile.sim # NOTE: the new input file must 'load runData/serialized.txt # NOTE: this instruction set can be repeated as many times as desired. # The last runData folder will contain the final results A word on ‘crashes’ and why they happened As with any complex program, results depend upon the conditions specified. One first line of defense to check the reasonableness of what you specified is to browse the output graphic and text files to see if things you expected based on what you (thought you) specified in the input (.sim) file are what you got. Many problems can be spotted this way: indeed, it is often a very good way to realize the complex nature of evolution and genetic architecture in the real world! ForSim runs can crash or freeze before saving results. As in any computer program, and given ForSim’s flexibility, there are undoubtedly bugs in the C++ code, and while we are not funded to provide a programming service we may be able to fix them if we’re notified about them. Most often, however, crashes occur either for purely stochastic (bad luck) reasons, or because of paremeter settings in the input file. For example, a population can die out if there are not enough mates, or selection is too severe, or a major change such as in a selection or mate-choice regime are too sudden. To explore whether this is the problem, try re-running the same input file. If it only crashes some of the time, then it is this kind of issue. Real-world populations crash, too (most eventually go extinct!) so this may be a lesson in life rather than a ‘mistake’ in what you specified! Major changes can be implemented in steps by a serias of moderate ‘event’ instructions. If re-running doesn’t help, try the same input file but adjusting some of the parameters (pop size, number of generations run, selection intensity, try mating with rather than without replacement, &c). If this makes a difference, again this is not a program bug. Crashes can also occur when some invalide memory call is made and because this can occur in a multitude of ways, it was not practicable to identify those before they happen and report them with an orderly exit. So if you still are unclear, or want to know just where the problem arises, try the following. ForSim is written in C++ and here is a way to see at least where the crash occurred (NOTE: this is done running forsim alone, rather than within its runForsim.rb Ruby wrapper, so the post-run output files and so on will not be produced): To debug a forsim run in g++ [‘rtn’ means hit Enter]: In makefile CXXFLAGS line, insert –g3 Make clean [rtn] Make [rtn] Then gdb --args ./forsim args [rtn] then © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 63 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved [rtn] to run after the run, or if it crashes, bt [rtn] for backtrace, which shows last 5 lines hit in the source code q [rtn] to quit r (--args means there will be arguments, here the input file name) Of course, to understand the problem, you may or may not be able to decipher the issue from this, and/or may have to explore the source code to see what it’s up to. Note that in the CXXFLAGS = -g3 -O3 [...] line in the make file the '-g3' portion instructs the compiler to include debugging capacity into the executable it is building. Note to Programmers As noted in the outset of this Manual, the financial support for ended in spring of 2013. At that time, a few legacy quirks remain, and are noted as such in the text. These are things that do not affect program accuracy as far as we know, but that were not spotted before the project ended. An example is the PolygenicComponent pre-pend column in the preMakePed output files. This has a value of 0 because a polygenic background component was a feature in development that did not completely work properly when the programmer left the project. ForSim C++ code and Makefile compiled and ran on MacOS X Lion and Linux/Unix. The Ruby wrapper scripts ran under Ruby 1.8/1.9 as of Spring 2013. If you wish to modify the program or explore the C++ code, you will find some legacy code that is no longer functional. An example are routines related to the production of ‘extra pedigrees’. These are sections with features that were in development or needed some debugging at the end of the project. Rather than take the chance that removing the code would unhook functioning features, the unused code was not removed. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 64 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved APPENDIX 1: ForSim LOGICAL FLOW AND TIME CONSUMPTION These figures show the general flow of the program among its components. First is a more comprehensive view of the calls (not counting final output function calls), and below that is a simplified version with just major calls included. The numbers give percent of CPU time consumed by the respective functions. The logic is basically correct, but the percents vary as the program has been modified. The figure generated with KCachegrind ( http://kcachegrind.sourceforge.net/) from using profiling data supplied by Valgrind (http://valgrind.org/). Each box is a major component, the small fill-bars in each box show the relative time consumption in a typical run. These values can be viewed by zooming in on the images. © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 65 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 66 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved APPENDIX 2: GENERIC INPUT.SIM FILE TEMPLATE Example 3a: Edit this basic file (“basic.sim”) to suit your own needs, or make your own from scratch global begin setFertility poisson 2.5 setMaxOffspringNumber 4 prevalence relative 0.08 generations 1000 1.0 megabases per centiMorgan male 1.0 megabases per centiMorgan female mutation rate 2.5 E -8.0 female mutation rate 2.5 E -8.0 male event event event event event event event 400 400 400 600 600 600 600 # Optional: one line labeled ‘all’ donateParents PopulationA PopulationB 100 100 setMatingPopMatrix PopulationA 95 5 0 setMatingPopMatrix PopulationB 5 95 0 donateParents PopulationB PopulationC 100 100 setMatingPopMatrix PopulationA 90 5 5 setMatingPopMatrix PopulationB 5 90 5 setMatingPopMatrix PopulationC 1 1 98 output 900 outputXML true matingWithReplacement true # The following is usually set to false unless you really want to constrict environmental variance scaleEnvironmentNormalVariance true 0.001 0.1 finalPedigreeDepth 3 end chromosome begin length 4000000 gene begin name ABC1 location 200000 length 100000 gamma 1 0.05 probabilityNoEffect 0.1 probabilityPositiveEffect 0.5 end gene begin name ABC2 location 500000 length 100000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.1 probabilityPositiveEffect 0.5 end gene begin © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 67 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved name ABC3 location 3300000 length 100000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.1 probabilityPositiveEffect 0.5 end gene begin name ABC4 location 3700000 length 100000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.1 probabilityPositiveEffect 0.5 end end phenotype begin name PhenotypeA definition ABC1 + ABC3 end phenotype begin name PhenotypeB definition ABC2 + ABC4 end population begin name PopulationA birth 0 initialSize 900 carryingCapacity 3000 growthRate 0.525 # at 2.1% per year for gen=25 yrs death 2000 selection PhenotypeA relative -4.8 0.4 environmentNormal PhenotypeA 0.0 1.0 familyEnvironmentNormal PhenotypeA 0.0 0.5 mateCutoff internal PhenotypeA -4.8 4.8 mateCutoff external PhenotypeA -4.8 4.8 selection PhenotypeB relative -4.8 4.8 environmentNormal PhenotypeB 0.0 1.0 familyEnvironmentNormal PhenotypeB 0.0 0.5 mateCutoff internal PhenotypeB -4.8 4.8 mateCutoff external PhenotypeB -4.8 4.8 end population begin name PopulationB birth 400 initialSize 200 carryingCapacity 3000 growthRate 0.525 # at 2.1% per year for gen=25 yrs death 2001 selection PhenotypeA relative -4.8 4.8 environmentNormal PhenotypeA 0.0 1.0 familyEnvironmentNormal PhenotypeA 0.0 0.5 mateCutoff internal PhenotypeA -4.8 4.8 mateCutoff external PhenotypeA -4.8 4.8 selection PhenotypeB relative -4.8 4.8 © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 68 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved environmentNormal PhenotypeB 0.0 1.0 familyEnvironmentNormal PhenotypeB 0.0 0.5 mateCutoff internal PhenotypeB -4.8 4.8 mateCutoff external PhenotypeB -4.8 4.8 end population begin name PopulationC birth 600 initialSize 200 carryingCapacity 3000 growthRate 0.525 # at 2.1% per year for gen=25 yrs death 2001 selection PhenotypeA relative -4.8 3.6 familyEnvironmentNormal PhenotypeA 0.0 0.5 environmentNormal PhenotypeA 0.0 1.0 mateCutoff internal PhenotypeA -4.8 4.8 mateCutoff external PhenotypeA -4.8 4.8 selection PhenotypeB relative -4.8 4.8 environmentNormal PhenotypeB 0.0 1.0 familyEnvironmentNormal PhenotypeB 0.0 0.5 mateCutoff internal PhenotypeB -4.8 4.8 mateCutoff external PhenotypeB -4.8 4.8 © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 69 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved Example 3b: Edit this even simpler file (“simple.sim”) to suit your own needs, or test effects of changing instructions or to help debug input-file syntax global begin setFertility poisson 2.5 setMaxOffspringNumber 4 prevalence relative 0.08 generations 1000 1.0 megabases per centiMorgan male 1.0 megabases per centiMorgan female mutation rate 2.5 E -8.0 female mutation rate 2.5 E -8.0 male # Optional: one line labeled ‘all’ output 500 outputXML true matingWithReplacement true # The following is usually set to false unless you really want to constrict environmental variance scaleEnvironmentNormalVariance true 0.001 0.1 finalPedigreeDepth 3 # # # # usingTrackSNPs true outputXML true outputSVG true usingSpatial true 1000 1000 0.1 1.0 event 530 outputXML event 500 serializeState # event 800 setPhenotypeSelection PopulationA PhenotypeA -8.8 3.8 # event 910 setPhenotypeSelection PopulationA PhenotypeB -3.8 8.8 end chromosome begin length 4000000 gene begin name ABC1 location 3000000 length 25000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.4 probabilityPositiveEffect 0.3 end gene begin name ABC2 location 4000000 length 25000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.4 © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 70 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved probabilityPositiveEffect 0.3 end gene begin name ABC3 location 5000000 length 25000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.4 probabilityPositiveEffect 0.3 end end phenotype begin name PhenotypeA definition ABC1 + ABC3 end phenotype begin name PhenotypeB definition ABC2 end population begin name PopulationA birth 0 initialSize 1000 carryingCapacity 3000 growthRate 0.525 # at 2.1% per year for gen=25 yrs death 20000 selection PhenotypeA relative -4.8 0.4 environmentNormal PhenotypeA 0.0 1.0 familyEnvironmentNormal PhenotypeA 0.0 0.5 mateCutoff internal PhenotypeA -4.8 4.8 mateCutoff external PhenotypeA -4.8 4.8 selection PhenotypeB relative -4.8 4.8 environmentNormal PhenotypeB 0.0 1.0 familyEnvironmentNormal PhenotypeB 0.0 0.5 mateCutoff internal PhenotypeB -4.8 4.8 mateCutoff external PhenotypeB -4.8 4.8 end © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 71 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved APPENDIX 3: MORE COMPLEX SIMULATION FLOW AND INPUT FILE EXAMPLE In this simulation, a single population of 10,000 runs for 7,500 generations, then splits into two populations of 5,000 that grow to 10,000, then 2490 generations later a third ‘admixed’ population is formed with 70% from Population A and 30% from Population B. The simulation then continues for 10 more generations. Pedigrees in the end, will reflect ‘pure’ PopulationA, PopulationB, and admixed PopulationC individuals. Example 4: Following is included as “complexTest.sim” in the distribution: #WHAT: Simulation of an admixed population global begin setFertility poisson 2.0 setMaxOffspringNumber 8 output 10000 prevalence relative 0.09 generations 10000 outputXML false 1.0 megabases per centiMorgan all mutation rate 2.5 E -8.0 all # Optionally, users can include and alter the following four lines to # specify probabilities defining the likelihood that a novel mutation # in a given codon position will have no effect. usingCodons true firstCodonProbability 1.0 secondCodonProbability 0.9 thirdCodonProbability 0.5 matingWithReplacement true scaleEnvironmentNormalVariance false finalPedigreeDepth 3 © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 72 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved event 7500 donateParents PopulationA PopulationB 2500 2500 event 7500 setMatingPopMatrix PopulationA 100 0 0 event 7500 setMatingPopMatrix PopulationB 0 100 0 event event event event event 9990 9990 9990 9990 9990 donateParents PopulationA PopulationC 3500 3500 donateParents PopulationB PopulationC 1500 1500 setMatingPopMatrix PopulationA 100 0 0 setMatingPopMatrix PopulationB 0 100 0 setMatingPopMatrix PopulationC 0 0 100 end phenotype begin name PhenotypeA definition ABC2 + ABC3 + DEF1 + DEF2 + DEF4 + 4.5 * environment end phenotype begin name PhenotypeB definition ABC4 + ABC5 + DEF5 end chromosome begin length 10000000 gene begin name ABC1 location 1000 length 50000 # Introns can be “inserted” in genes as follows: intron 5000 2000 # Above, the intron begins 5kbp from the start of the gene “ABC1” and # is 2kbp long. All mutations which arise in this region will contribute no phenotypic effect. gamma 1 0.01 1 0.01 probabilityNoEffect 0.9 probabilityPositiveEffect 0.1 end gene begin name ABC2 location 1000000 length 50000 gamma 1 0.01 1 0.01 probabilityNoEffect 0.9 probabilityPositiveEffect 0.1 end gene begin name ABC3 location 2000000 length 50000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.9 probabilityPositiveEffect 0.1 end gene begin name ABC4 © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 73 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved location 3000000 length 20000 gamma 1 0.01 1 0.01 probabilityNoEffect 0.9 probabilityPositiveEffect 0.1 end gene begin name ABC5 location 4000000 length 50000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.9 probabilityPositiveEffect 0.1 end gene begin name DEF1 location 5000000 length 50000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.9 probabilityPositiveEffect 0.1 end gene begin name DEF2 location 6000000 length 20000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.9 probabilityPositiveEffect 0.1 end gene begin name DEF3 location 7000000 length 20000 gamma 1 0.05 1 0.05 probabilityNoEffect 0.9 probabilityPositiveEffect 0.1 end gene begin name DEF4 location 8000000 length 50000 gamma 1 0.01 1 0.01 probabilityNoEffect 0.9 probabilityPositiveEffect 0.1 end gene begin name DEF5 location 9000000 length 20000 gamma 1 0.01 1 0.01 probabilityNoEffect 0.9 probabilityPositiveEffect 0.1 end © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 74 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved end population begin name PopulationA birth 0 death 40000 initialSize 1000 carryingCapacity 10000 growthRate 0.525 selection PhenotypeA relative -4.8 4.2 environmentNormal PhenotypeA 0.0 1.0 mateCutoff internal PhenotypeA -4.8 4.8 mateCutoff external PhenotypeA -4.8 4.8 selection PhenotypeB relative -4.8 4.8 environmentNormal PhenotypeB 0.0 1.0 mateCutoff internal PhenotypeB -4.8 4.8 mateCutoff external PhenotypeB -4.8 4.8 end population begin name PopulationB birth 7500 death 90000 initialSize 5000 carryingCapacity 5500 growthRate 0.525 selection PhenotypeA relative -3.6 3.6 environmentNormal PhenotypeA 0.0 1.0 mateCutoff internal PhenotypeA -4.8 4.8 mateCutoff external PhenotypeA -4.8 4.8 selection PhenotypeB relative -4.8 4.8 environmentNormal PhenotypeB 0.0 1.0 mateCutoff internal PhenotypeB -4.8 4.8 mateCutoff external PhenotypeB -4.8 4.8 end population begin name PopulationC birth 9990 death 90000 initialSize 10000 carryingCapacity 11000 growthRate 0.525 selection PhenotypeA relative -3.6 3.6 environmentNormal PhenotypeA 0.0 1.0 mateCutoff internal PhenotypeA -4.8 4.8 mateCutoff external PhenotypeA -4.8 4.8 selection PhenotypeB relative -4.8 4.8 environmentNormal PhenotypeB 0.0 1.0 mateCutoff internal PhenotypeB -4.8 4.8 mateCutoff external PhenotypeB -4.8 4.8 end # end of input file © 2008-2013 ForSim the logo, the program itself, and these notes, are copyright by 75 Kenneth M Weiss and Brian Lambert. All rights of use or reproduction are reserved
© Copyright 2026 Paperzz