PLABQTL A computer program to map QTL Version 1.2 2006-06-01 H.F. Utz and A.E. Melchinger Institut für Pflanzenzüchtung, Saatgutforschung und Populationsgenetik Universität Hohenheim, 70593 Stuttgart (Institute of Plant Breeding, Seed Science, and Population Genetics, University of Hohenheim, D-70593 Stuttgart, Germany) E-Mail: [email protected] http://www.uni-hohenheim.de/˜ipspwww/soft.html c Copyright 1995, 2003 H.F. Utz, A.E. Melchinger All rights reserved Contents 1 General Overview 3 2 Installation and Use 5 3 Biometrical Procedures 3.1 Composite interval mapping . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Preselection of markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 7 4 Description of the Output 4.1 Errors or disagreements . . . . . . . . . . . . . . . . 4.2 Status of marker data, linkage map, percentages of genome of parent1 . . . . . . . . . . . . . . . . . . . 4.3 Critical values of LOD scores and F-to-enter . . . . . 4.4 Overview on observation variables . . . . . . . . . . 4.5 Plot of LOD scores . . . . . . . . . . . . . . . . . . . 4.6 List of detected QTL . . . . . . . . . . . . . . . . . . 4.7 Final simultaneous fit . . . . . . . . . . . . . . . . . . 4.8 Further effects . . . . . . . . . . . . . . . . . . . . . . 4.9 Outliers and influential observations . . . . . . . . . 4.10 Analysis of QTLxEnvironment interactions . . . . . . . . . . . . . . . . . homozygosity and . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 10 11 12 12 12 13 14 14 15 5 Literature 17 6 Description of Controlling File *.qin 20 7 Description of data file *.qdt 27 8 Recommendations for Working with PLABQTL 31 1 GENERAL OVERVIEW 3 1 General Overview PLABQTL is a program written for the detection of loci which affect the variation of quantitative traits. Its main purpose is to localize and characterize QTL (Quantitative Trait Loci). The program employs the interval mapping approach (Lander and Botstein, 1989). Commands were designed following MAPMAKER/QTL (Lincoln et al., 1993b). In contrast to this and other programs, we used a multiple regression approach with flanking markers according to the procedure described by Haley and Knott (1992). The linkage map of markers is assumed to be known and must be calculated with other programs such as MAPMAKER/EXP (Lincoln et al., 1993a), JOINMAP (Stam, 1993) or GMendel (Holloway and Knapp, 1994). For detection of QTL, it is possible to utilize other identified QTL as cofactors (composite interval mapping). This should increase the power of QTL detection (for details, see Jansen and Stam, 1994; Utz and Melchinger, 1994; Zeng, 1994). The often high bias of explained variance by QTL can be reduced (Utz et al. 2000) by cross-validation. Two types of input are possible: • a vector format similiar to MAPMAKER/QTL: for each marker and trait, the values are input across all genotypes; • a matrix format: a matrix is input, consisting of trait and marker data, with one row per genotype. For output, the program gives an overview of the input linkage map with segregation ratios, genotype frequencies of marker pairs, and the proportion of the genome of each genotype which is homozygous or contributed by the first parent. Stepwise regression is used to preselect the most important markers to be used as cofactors. In the main part, the program calculates for each trait and chromosome: • a preliminary analysis with multiple regression on all markers of a chromosome; • a print or PostScript plot of the LOD scores from scanning of each chromosome; • estimates of parameters such as additive effects, dominance effects, R2 , AIC, and BIC values, and ANOVAs at the position of LOD peaks; • a final simultaneous fit with all detected QTL; • a selection of important genetic effects if a complex model with dominance and epistatic effects is used; • a QTL x environment analysis for all detected QTL with an ANOVA table, a summary table of the QTL effects for each environment and the series, as well as the mean squares of QTL x environment interactions; • the empirical LOD score thresholds with permutations; • cross-validation estimates of R2 and QTL effects. The program can be employed for the analysis of populations of individuals from selfing generations (F2 and later generations) or doubled haploids derived from an F1 cross between two homozygous lines. Models with or without dominance and twofactor epistasis can be fitted. Only additive effects of cofactors are taken into account. 1 GENERAL OVERVIEW 4 Hence, the present form of the program is most suited for the analysis of topcross progenies. At each step, outlying and influential values can be determined and scatter diagrams can be provided to check the fit. A major objective in development of the program was to provide as many checks and hints as possible, starting with three question marks ??? , so that the data analyst can more easily detect possible errors in the data input (which can have a tremendous impact on QTL detection). Data can be transferred from the matrix input format into the vector format required by MapQTL or MAPMAKER/QTL. Missing marker data are replaced by their expected values, provided that marker information is available for two adjacent markers or at least one flanking marker. In contrast, missing phenotypic data and the respective marker genotype data are dropped from the analysis. Special cases, in which all markers per genotype or all genotypes per marker are missing, are annotated. 2 INSTALLATION AND USE 5 2 Installation and Use The program contains the following files: PLABQTL.EXE the program F77L3.EER required Fortran77 error list PLABQTL.PDF this documentation PQREADME.TXT some comments for use PQNEWS.TXT last amendments PQSAMPLE.QIN PQSAMPLE.QDT PQSAMPLM.QIN PQSIMUL.QIN PQSIMUL.QDT first example for simple interval mapping data to the first example from the tutorial of MAPMAKER/QTL first example for composite interval mapping second example data to second example with matrix input format. The program PLABQTL consists of the single file PLABQTL.EXE. It runs on PCs in the DOS prompt or DOS compatibility box of WINDOWS. Copy PLABQTL.EXE to a subdirectory and extend the respective path. This should be sufficient to install the program. The program is started by the command plabqtl FILENAME The command plabqtl alone provides hints for the use of the program. This data file, here with name FILENAME, contains statements which control the analysis. As default, the file extension *.qin is used so the calls plabqtl FILENAME and plabqtl FILENAME.qin are equivalent. The phenotypic and marker data are stored in a single separate data file designated with the default extension *.qdt. The output is written to a data file with extension *.qpt. Additional output, arising from scanning and designed for production of high density plots, is given in file *.qst or in a PostScript file *.ps. The batch-oriented mode of PLABQTL has the advantage that time-consuming jobs can run in the background. Furthermore, by simple editing of the *.qin file, one can easily change the parameters employed for detection of QTL. The *.qpt output file (in ASCII) can be examined, edited, and printed with any browser, editor, or word processing system. An example for a set of statements in a *.qin file is c Test from the MAPMAKER/QTL tutorial: file sample.raw c generate logarithms of the observations log c input of data from file sample.qdt in the same directory load sample.qdt c rough scanning in 10 cM intervals scan 10 stop The load and stop statements are mandatory while all others are optional. Comments are preceded by the letter c. The load command points to the file with the 2 INSTALLATION AND USE 6 phenotypic and marker data. The scan command initiates the analysis itself. Chapter 6 describes the use of the various statements in detail. Statement names begin in column one of each line, written in lower case. Statements with more than four letters can be abbreviated by three letters, e.g., seq instead of sequence. Names and numbers are separated by at least one blank (or tab character). An example is: cov 2 5 8 12 In front of numbers, it is possible to insert a comment in quotes "..." such as cov "QTL1" 2 5 "QTL2" 8 12 The first, convert, dimensions, and log statements must preceed the line with the load statement; otherwise, the data input is performed using the default settings. Examples for analysis are given in the data files for simple interval mapping (SIM; data from MAPMAKER/QTL tutorial) : pqsample.qin with pqsample.qdt for composite interval mapping (CIM; same data body) : pqsamplm.qin for simulated data (with matrix input format): pqsimul.qin with pqsimul.qdt . It is recommended that beginners first run plabqtl pqsample and compare the output in pqsample.qpt with the comments in the MAPMAKER/QTL tutorial. A small introduction in QTL mapping with PLABQTL can be found on PostScript file PQINTRO.PS (or PQINTRO.PDF for ACROBAT reader) on the FTP server using the data file of the MAPMAKER/QTL tutorial for analysis. 3 BIOMETRICAL PROCEDURES 7 3 Biometrical Procedures 3.1 Composite interval mapping PLABQTL performs interval mapping with or without cofactors (Jansen, 1993; Jansen and Stam, 1994; Zeng 1994) by using multiple regression. The conditional expectations of QTL genotypes, given the observed marker genotypes at the flanking marker loci, are employed as regressors. Haldane’s mapping function is assumed. For F2 populations, the conditional expectations are calculated according to the formulae in Table 1 of the paper by Haley and Knott (1992). For F3 and higher selfing generations, formulae were provided by Hospital et al. (1996). For further details, see the paper by Utz and Melchinger (1994). For large sample sizes, the regression method converges towards the maximum likelihood method (see the comments on ”regression mapping” in the paper of Martinez and Curnow, 1994). After Xu (1995) the explained variance is an underestimation of the true value. But such seems irrelevant compared with the great inflation due to model selection. 3.2 Missing Values If missing marker data for flanking markers of an interval are encountered, the interval is extended until markers with data are found. Missing cofactor values are replaced by their expected values on the basis of the nearest existing flanking markers. Genotypes without marker information on the chromsome which is to be scanned or which contains cofactors are dropped from the analysis. Likewise, genotypes with missing phenotypic data for the trait under study are dropped from the analysis. According to our experience, this method is stable and robust. Other methods as ’available case procedure’ (Little 1992) produce often ill-conditioned variancecovariance matrices. Maximum likekihood methods neglect that degrees of freedom are reduced for the error. 3.3 Preselection of markers By means of stepwise regression, markers are selected according to their relative importance (see Draper and Smith, 1981, p. 307 ff). For inspecting the selection, the partial F-value, which is the criterion for the selection of markers, the simple correlation coefficient r(xi,y) of phenotypic observations y with the selected marker xi , as well as the percentage of the phenotypic variance additionally explained by the marker (%add.expl.) are reported. A second list summarizes the corresponding RRS-values (residual sums of squares), AIC-values (Akaike’s information criterion), and the BIC-values (Bayesian information criterion or Schwarz’ information criterion). Similar to the F-to-enter or Mallows’ Cp criterion, the AIC criterion serves as a stopping rule in selecting subsets of regression variables (see Miller, 1990; Jansen, 1993, 1994). The marker with the smallest AIC value would be the last marker added to the preselected set. The main problem in the preselection is the choice of an appropriate threshold (i.e., the F-to-enter value or the 3 BIOMETRICAL PROCEDURES 8 penalty in the AIC), which corresponds to the significance level. Choice of a reasonable type I error depends on both the experimenter and the major objective(s). BIC is a Bayesian variante to Akaike’s information criterion und penalizes the log-likelihood ratio by K*log(n) rather than the 2*K of Akaike. BIC is similar to AIC with a penalty of 3 in case of say 300 individuals, so BIC uses a much stronger penalty for over- fitting than AIC. AIC is calculated here after the formula AIC = n*ln(RRS/n) + 2*K whereby n is sample size and K number of estimated parameters included intercept, error variance, QTL positions, and QTL effects. Burnham and Anderson (1998) recommend AICc for cases with n/K < 40, or AICc = AIC + 2*K*(K+1)/(n-K-1) . BIC is calculated after the formula BIC = n*ln(RRS/n) + K ln(n) . With BIC or AIC models can be compared, i.e., cofactors can be selected, number of QTL can be choosen or gene architecture can be investigated. The model with the smallest BIC or AIC is best, in the sense of exploiting the information in the data, relative to other models (applying the principle of parsimony, see Burnham and Anderson, 1998). Models with AIC differences smaller 2 have substantial support and should receive consideration in making inferences. Models with AIC differences greater 10 might be omitted. AIC, BIC, Mallows’ Cp , and the F-to-enter criterion can, at least approximately, be mutually transformed (see Miller 1990, p. 205 ff or McQuarrie and Tsai 1998). When missing marker values do occur, they are replaced in the preselection step by their expectations calculated on the basis of the nearest adjacent flanking markers, see above. The stepwise regression is performed with this ”corrected” data set; here, differing degrees of freedom for the error, depending on the number of missing values, are not taken into account. This procedure and possible other reasons may cause an ill-conditioned variance-covariance matrix, which can result in an abortion of the stepwise regression. In such cases, the partial F-values may increase towards the end of the selection procedure. (This most likely occurs with small F-to-enter values, see below). In such cases, we recommend use of a higher F-to-enter value in the parameter statement or estimation by ridge regression with THETA>0. The default value is THETA=0.02 to reduce collinearity and prevents singularities of matrices (see Whittaker et al. 2000) After selection of the markers, another multiple regression is performed with the reduced data set, which takes the degrees of freedom for missing values into account. For ill-conditioned cases, this procedure may not be entirely satisfactory, however, it should suffice to identify important markers to use as cofactors. Preselected markers may subsequently be adopted automatically in the cov statement. The user should select cofactors sparingly, observing that a. adjacent, tightly linked markers should not be selected (see Hackett, 1994); b. all important markers should be included because this increases the power of QTL detection (see Jansen, 1995; Utz and Melchinger, 1994); c. important chromosomes are represented with at least one or two cofactors. 3 BIOMETRICAL PROCEDURES 9 Several cofactors per chromosome are justified if several putative QTL are suspected on a chromosome. If the respective regression coefficients have different signs, the contributions of different QTL may cancel each other, and, consequently, may not be detected. According to our experience, one should always select markers associated with important QTL as cofactors. This corresponds to a F-to-enter value between 5 and 6 or a penalty of about 3 in the AIC. According to Miller (1990), for testing hypotheses in the present context (Spjotvoll test with 0.05 probability level), one should employ an F-to-enter value between 8 and 15. Based on a great number of analyses of experimental data, we have found that the default value of 3.5 for F-to-enter used in PLABQTL yields reasonable results, because it also selects cofactors adjacent to minor QTL and those with opposite sign of effects. However, in this case important QTL are frequently represented by both adjacent flanking markers, which seems undesirable. Further recommendations are given in chapter 8. 4 DESCRIPTION OF THE OUTPUT 10 4 Description of the Output In the following we describe some of the parameters, which can be estimated by PLABQTL. 4.1 Errors or disagreements Errors or disagreements, detected by PLABQTL, are indicated with three question marks ??? and a comment. In some cases, the program is terminated, in others, checking of the marker data is recommended. Lines starting with !!! give warnings or comments, e.g., if there is no phenotypic variation for the observed variable. The use of first staement is recommended if a set of data, a *.qdt file is analysed at first. 4.2 Status of marker data, linkage map, percentages of homozygosity and genome of parent1 This summary of various parameters can be used to detect non-polymorphic or dominant markers, distorted segregation or outcrossings, which cause a high degree of heterozygosity. Frequencies of marker alleles and their significance to deviate from 0.5 or a histogram for homozygosity can be found. Some other general parameters like length of genome in cM or average interval length in cM, and percent of genome within 20 cM to the nearest marker are printed. The average interval lenght is calculated as total length / (no. of markers - no. of monomorph markers - no. of chromosomes). Probabilites for the Chi-square tests are calculated according to the Wilson-Hilferty approximation and should be regarded as a crude test for two reasons: (1) this approximation, like others, is inaccurate for a small number of degrees of freedom (1 or 2, depending on the type of marker and population considered) and (2) simultaneous tests are performed (however, individual tests are not always independent, see Lander and Botstein, 1989, p. 190 ff). Some warnings regarding markers are produced which may inspected critically. Some reactions are undertaken automatically, some other can be performed by the user, e.g., by using the statement exmarker or exindivid and comparing the resulting analyses: • Marker pairs with only 0 or 1 recombinant are listed • Highly correlated markers (r > 0.99) are shown by the message ??? multicollinearity ... (which is mostly undangerous but may result in nonsense estimates seldomly). • Monomorph markers are excluded from the QTL analysis. • Unobserved markers (coded as missing values) are excluded from the QTL analysis (but such markers could be used before in cofactor analysis). • Inconsistencies for double haploid (DH) lines are shown, e.g. heterozygous markers. 4 DESCRIPTION OF THE OUTPUT 11 • Marker with a distance smaller than 0.11 cM are combined to an ’synthetic’ marker with a name starting with m, containing the chromosome number and position in cM, e.g., m01/003.9. In the synthetic marker C, D and M are replaced by the more informative scores A, H, or B if possible. • A mix of dominant and codominant scores in a marker is allowed (A,B,H and C,D scores occuring together). So interspersed RAPD and AFLP marker can be used. The analysis procedure in such cases is: 1. In cofactor selection, C and D scores are treated as missing. 2. In scanning, expected values are calculated for flanking markers with C or D scores. This may be suboptimal but may sufficient for ad-hoc analyses. (The optimal procedure with calculating expected values based further on the next neighboured codominant markers is described by Jiang and Zeng, 1997, Genetica 101, 47-58) 4.3 Critical values of LOD scores and F-to-enter CRITICAL VALUE FOR LOD SCORE (Bonferroni chi-square approximation) or abbreviated as Crit.LOD are calculated using the chi-square approximation suggested by ZENG (1994). For an overall test with M marker intervals in the genome, a large sample size, and not too many markers fitted in the model, the chi-square value for alpha/M and n degrees of freedom (n equals 2 or 3 if additive or additive and dominance effects for QTL are fitted respectively) can be used as an approximation. The chi-square value is valid for likelihood ratio (LR) test statistics and is divided by 4.605 to get LOD, see also below. Alternatively, a permutation test can be performed with the permute statement. The minimum detectable partial R2 (Detect.part.Rˆ2) for the given sample size and alpha level is calculated, see Melchinger et al. (1998): Part.R2 = 1 − EXP(−CHI2/N) with CHI2 = crit.LOD/0.2171 = LOD ∗ 4.605 The magnitude of errors of the QTL detection depends primarly from the size of the underlying QTL effects, sample size, and heritability. Therefore the estimable magnitude of effects are given under the heading CLASSIFICATION OF ADDITIVE QTL EFFECTS. Bonferroni bounds are offered for LOD scores and F-to-Enter values in stepwise regression for several alpha values. So the choice of threshold values will be made easier, hopefully. (Further details in comparing diverse empirical and analytical threshold values can be found in DOERGE and REBAI (1996). For exploratory QTL experiments a genomewise error rate of 0.25 seems acceptable for LOD threshold and F-to-enter (Beavis, 1998). Also the Bonferroni bounds for the simple correlation coefficients between marker and observations which are produced by the first statement are listed under header PRESELECTION OF MARKER COFACTORS with subheader r(xi,y)thresh.. 4 DESCRIPTION OF THE OUTPUT 12 4.4 Overview on observation variables For each observation variable the number of units, mean, variance, skewness, and kurtosis and a histogram are printed. (Please note that these estimates are valid only before a calculation of log transformation if necessary). 4.5 Plot of LOD scores In the plots of LOD scores, positive additive effects are indicated by an asterisk * and negative additive effects by an = sign. Thus, the sign of additive QTL effects can be visualized in plots. The unit on the ordinate (y-axis) is given in brackets in the lower left corner of the coordinate system. On the abscissa (x-axis), a M sign indicates the location of a marker and a C indicates that of a cofactor. In the Postscript files *.ps markers and cofactors are also plotted on the x-axis by ticks or triangles respectively. 4.6 List of detected QTL This shows a list of all detected QTL. For the example pqsample.qpt, the output is: QTL Chrom. Pos Left_M. Mark.I+Pos cM_n.M. Supp.IV LOD Rˆ2% add dom DF -------------------------------------------------------------------------------1 chrom1 26 T93 25 7. 5. 10- 32 4.99 7.6 -0.091 -0.045 2 chrom2 28 T125 8- 12 7. 7. 16- 36 8.73 12.6 -0.139 -0.016 -------------------------------------------------------------------------------SUM: 20.3 Pos The position on the chromosome of the QTL, in cM. Left M. Name of the left flanking marker. Mark.I+Pos The marker interval, consisting of the flanking marker numbers plus the position, in cM, of the QTL relative to the left flanking marker. cM n.M. Distance in cM to the next flanking marker Supp.IV Support interval with a LOD fall off of 1.0 (default), expressed as position on the chromosome, in cM. Note: A support interval is only determined for the global QTL peak in a given region, i.e., by ignoring other adjacent peaks in the case of multiple peaks. When cofactors are used, the given support interval is very likely only a lower boundary for the true support interval. LOD log10 of the likelihood odds ratio (see Lander and Botstein, 1989). The LOD score is calculated from the F-value in the multiple regression as LOD = nln(1 + p ∗ F/DFres) ∗ 0.2171 , where p is the number of parameters fitted (see Haley and Knott, 1992). The LOD score can be used to calculate the likelihood ratio LR = LOD ∗ 4.6052 (equivalently LR = pF), if this statistic is preferred. Rˆ2% The coefficient of determination or the percentage of the phenotypic variance, which is explained by a putative QTL. In the case of cofactors (composite interval mapping), Rˆ2% is based on the partial correlation of the 4 DESCRIPTION OF THE OUTPUT 13 putative QTL with the observed variable, adjusted for cofactors. The basis for calculating the R2 values may vary because cofactors present in the interval under scanning are omitted. If dominance is included in the model, the proportion of the variance attributable to the additive and dominance effects is given. This model is also used for calculating LOD scores. add,dom Estimated additive and dominance QTL effects at the location of scanning. The additive effect is half the difference between the genotypic values of the two homozygotes. It is assumed that second parent carries the favorable alleles for the trait under study (for calculations see also Falconer 1989, Ch. 7 who uses first parent as the superior). IF the second parent is the weaker one additive effect becomes negative. DFres The number of degrees of freedom for the residual sum of squares in multiple regression; this is provided to show how missing data might influence the regression results. 4.7 Final simultaneous fit In the FINAL SIMULTANEOUS FIT, the detected QTL and their estimated positions are used for a simultaneous multiple regression to obtain final estimates of the additive and dominance effects. Standard errors (Std.error) of the estimated effects are given. Besides the estimated QTL effects, other measures of the importance of a QTL include (1) the squared partial correlation coefficient (partRˆ2%, which is the coefficient of determination between the respective QTL and the phenotypic observations, keeping all other QTL effects fixed) and (2) the partial sums of squares (part.SS), which are obtained by dropping a certain effect. The part.SS each have one degree of freedom and, thus, can be tested against the residual sum of squares from the regression ANOVA. Those QTL which have small and non-significant part.SS could be dropped first from the model. A further column gives the regression sums of squares (Regr.SS) for the simple regression. Calculating formulas can be found in Steel and Torrie (1980, p.323-327) or Snedecor and Cochran (1980, Chapter Multiple Linear Regression). In the last column (Std.eff.) the standardized QTL effects or effects diveded by the phenotypic standard deviation of the trait can be found. The importance of the QTL effects, as determined by the part.Rˆ2% or the part.SS, is generally fairly consistent. To summarize the results of the QTL analysis we give the following parameter estimates: Rˆ2% Coefficient of determination or the percentage of the phenotypic variance, which is explained by the detected QTL, with the approximative standard error calculated after formula 27.88 of Kendall and Stuart (1961). R Multiple correlation coefficient or the square root of R2 . LOD LOD score of the final fit. 4 DESCRIPTION OF THE OUTPUT 14 AIC Akaike’s Information Criterion to choose the most ’probable’ model out of a group of models with varying numbers of parameters. (The penalty for the number of free parameters is 1). The model with the smallest AIC value fits the data best (see Jansen, 1993). BIC The Bayesian information criterion or Schwarz’ information criterion may serve as an alternative to AIC. K Number of parameters estimated (number of QTL effects QTL positions, intercept, and error variance in the final model). Gen.var.expl.% (and its standard error): Percentage of the genotypic variance which is explained by the detected QTL. If the heritability h2 of the trait is given in the *.qdt file Gen.var.expl.% is calculated as R2 % / h2 . The standard error is calculated under the assumption of known heritability. adj.Rˆ2% With the adjusted R2 the explained variance may be estimated more adequately than with R2 , see Hospital et al. (1997). Additionally adj.gen.var.expl. is given if heritability is noted in the *.qdt file. Naturally, adj.Rˆ2% as an estimate can be less than zero or greater than 100 as each proportion of variance components. The correlation matrix of the estimated QTL effects indicates their degree of dependency. These dependencies generally cause no numerical problems. 4.8 Further effects If testcross progeny are indicated, the average effect of an allele substitution, which corresponds to 2*additive effect, is reported. Furthermore, seq and pred statements are output, in which the effects are given in descending order according to their R2 -values from the LIST OF DETECTED QTL; e.g., c:seq 5/106 2/148 1/48 c:pred 9.735 2.179 1.401 -0.995 whereby c: stands for calibration. With the statement cross-validate two similar lines starting with g: are given showing the effects estimated in the validation set. These two lines can be used to conveniently feed the sequence and predict statements in submodel analyses. 4.9 Outliers and influential observations Outliers and influential observations are calculated after the preselection of markers and after the simultaneous fit. These lines are annotated with the # sign, so that they can be found or printed if desired. Concurrently, scatter diagrams depict atypical data points. As statistics are calculated: AP2 ANDREWS-PREGIBON statistic second factor (AP2 in Draper and John, 1981). AP2 is a spatial measure and shows which observations are influential in the sense that they are isolated from the bulk of the data defined 4 DESCRIPTION OF THE OUTPUT 15 by the columns of the X matrix. It involves only the predictor variables. Smaller AP2 values indicate that the point is more remote from the bulk. infl Influential value of an observation (congruent to the square root of the Cook’s statistic). The larger this value is, the greater is the effect on the fitted equation and the regression coefficients found by omitting this observation. stdRes Studentized residual (t in Snedecor and Cochran, 1980, p. 350 and 168) to find extreme deviations from the regression. The error variance is estimated without the suspicious data point. The statistic follows a tdistribution and the outlier can be tested approximatively re: whether it is exceptional or not (see Snedecor and Cochran, 1980; or Draper and Smith, 1981, p. 144). Obs. X values observation value and marker values: marker values or expected values in the final fit. Note: During outlier detection in the case of preselection, marker values (0,1,2) or expected values with decimal points are given. An observation is indicated as outlying or influential if stdRes is greater than 3.5 or AP2 less than 0.5 or infl greater than 0.4. 4.10 Analysis of QTLxEnvironment interactions A simultaneous fit with the detected QTL is performed for each environment. The results are subsequently presented in the form of a table showing the ANOVA and the estimated effects. QTL-ANOVA: The ANOVA is carried out approximatively in the following way: Source DF E(MS) -------------------------------------------------------------------Environm. E-1 Genotypes G-1 QTL Q VC + f1 VCqe + E VCd + f2 VCq Resid G-1-Q VC + E VCd Genot. x Env. (G-1)(E-1) QTL x Env. Q(E-1) VC + f1 VCqe Res. x Env. (G-1-Q)(E-1) VC ------------------------------------------------------------------where Q = number of detected QTL effects (additive, dominance), E = number of environments, G = number of genotypes, VCq = genetic variance explained by the QTL effects VCd = unexplained residual genetic variance (deviation) VCqe = variance component QTL xEnv. interactions VCde = variance component Res. xEnv. interactions VC = VCe/R + VCde, with R being the number of replications in a single environment and VCe the pooled plot error. The ANOVA table for QTL, especially the variance component VC from the column denoted VComp, are calculated in the following manner, where expectations of Mean Squares (MS) were taken analogously to Knapp (1994) and Bliss (1964, p.426): VC(Genotypes) = [MS(Genotypes) - MS(Genot.xEnv.)]/E with E = number of macro environments (approximative result if Genotypes x Environments are unbalanced) VCq = VC(Genotypes) - VCd 4 DESCRIPTION OF THE OUTPUT 16 (please note that VCq is an ad-hoc estimator computed by the difference of two variance components, not by the difference of MS as usual) VCd = [MS(Residuals) - MS(Res. x Env.)]/E VCqe = MS(Genot. x Env.) - MS(Res. x Env.), an indirect ad-hoc estimate also, VC(Genot.xEnv.) and VCde can be computed by subtracting VCe/R = MS(pooled error)/R, see Cochran and Cox (1957), p. 556 or 561, if necessary. After the ANOVA for the QTL, an appropriate estimate for the Gen.var.expl.% is given, which is generally smaller than the corresponding estimate obtained from the simultaneous fit, because it is adjusted for QTL x Env. interactions. This helps to avoid an overestimation of the genetic variance explained by the QTL. The statistic Gen.var.expl.% refers to the proportion of the genetic variance explained by the detected putative QTL and is calculated as the ratio Q2 = VCq/VC(Genotypes) ∗ 100. In a further table, additive effects and also dominance effects if indicated in the model statement are summarized for all detected putative QTL. The effects are given for each enviroment as well as for means across environments. The last column shows the MS(QTLxE), for which the SS are calculated as (sum of the values in the column Regr.SS for individual environments) - (corresponding Regr.SS in the series)*E. Because the MS(QTLxE) are calculated from the difference of the fits of the data from individual environments and the means across environments, negative values for MS(QTLxE) may occur, which are set to zero. It is noteworthy that with a small number of environments, this statistic is only of limited inferential value. The values for MS(QTLxE) are tested for significance with a sequentially rejective Bonferroni F test (SRBF) and tested for homogeneity with a Bartlett’s test. The Bartlett’s test is rather sensitive towards a violation of the assumption of normally distributed data, so that other tests may be preferred in these cases or if the number of degrees of freedom is smaller than 5. 5 LITERATURE 17 5 Literature Beavis, W.D. 1998. QTL Analyses: Power, Precision, and Accuracy. In: A.H. Paterson (ed.), Molecular Dissection of Complex Traits. CRC Press, Boca Raton, 145-162. Bliss, C.I. 1967. Statistics in Biology. Vol.1, McGraw-Hill, New York. Burnham, K.P., and D.R. Anderson. 1998. Model Selection and Inference. A Practical Information-Theoretic Approach. Springer, New York. Cochran, W.G., and G.M. Cox. 1957. Experimental Designs. 2nd ed., Wiley, New York. Churchill, G.A. and Doerge, R.W. 1994. Empirical threshold values for quantitative trait mapping. Genetics 138, 963-971. Doerge, R.W. and Rebai, A. 1996. Significance thresholds for QTL interval mapping tests. Heredity 75, 459-464. Draper, N.R. and J.A. John. 1981. Influential observations and outliers in regression. Technometrics 23, 21-26. Draper, N.R. and H. Smith. 1981. Applied Regression Analysis. 2nd ed., Wiley, New York. Falconer, D.S. 1989. Introduction to Quantitative Genetics. Longman Scientific& Technical, Essex, England. Hackett, C.A. 1994. Selection of markers linked to quantitative trait loci by regression techniques. In: van Ooijen, J.W. and J. Jansen (eds.), Biometrics in Plant Breeding: Applications of Molecular Markers, Wageningen, 99-106. Haley, C.S., and S.A. Knott. 1992. A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity 69, 315-324. Holloway, J.L. and S.J. Knapp. 1994. GMendel 3.0 Users Guide. Oregon State Univ., Departemnt of Crop and Soil Science, Oregon. Hospital, F., C. Dillmann, and A.E. Melchinger. 1996. A general algorithm to compute multilocus genotype frequencies under various mating systems. Comput.Appl.Biosci. 12, 455-462. Hospital, F., L. Moreau, F. Lacoudre, A. Charcosset, and A. Gallais. 1997. More on the efficiency of marker-assisted selection. Theor. Appl. Genet. 95, 1181-1189. Jansen, R.C. 1993. Interval mapping of multiple quantitative trait loci. Genetics 135, 205-211. Jansen, R.C. 1994. Controlling the type I and type II errors in mapping quantitative trait loci. Genetics 138, 871-881. Jansen, R.C., and P. Stam. 1994. High resolution of quantitative traits into multiple loci via interval mapping. Genetics 136, 1447-1455. Kendall,M.G., and A. Stuart. 1961. The Advanced Theory of Statistics. Charles Griffin & Co., London, Vol. II. 3rd.ed. Knapp, S.J. 1994. Mapping quantitative trait loci. In: R.L. Phillips, and I.K. Vasil (eds) DNAbased Markers in Plants. Kluwer Academic Publ., Dordrecht. 5 LITERATURE 18 Lander, E.S., and D. Botstein. 1989. Mapping Mendelian factors underlying quantitative traits by using RFLP linkage maps. Genetics 121, 185-199. Lincoln, S.E., M.J. Daly, and E.S. Lander. 1993a. Constructing genetic linkage maps with MAPMAKER/EXP version 3.0: A tutorial and reference manual. Whitehead Institute for Biomedical Research, Cambridge, MA. Lincoln, S.E., M.J. Daly, and E.S. Lander. 1993b. Mapping genes controlling quantitative traits using MAPMAKER/QTL version 1.1: A tutorial and reference manual. Whitehead Institute for Biomedical Research, Cambridge, MA. Little, R.J.A. 1992. Regression with missing X’s: A review. Jour. Americ. Statist. Ass. 87, 1227-1237. Martinez, O. and R.N. Curnow. 1994. Three marker scanning of chromosomes for QTL in neighbouring intervals. In: van Ooijen, J.W. and J. Jansen (eds.), Biometrics in Plant Breeding: Applications of Molecular Markers, Wageningen, 153-162. Melchinger, A. E., H. F. Utz, and C. C. Schön, 1998 QTL mapping using different testers and independent population samples in maize reveals low power of QTL detection and large bias in estimates of QTL effects. Genetics 149: 383-403. McQuarrie, A.D.R. and C.-L. Tsai. 1998. Regression and Time Series Model Selection. World Scientific, Singapore. Miller, A.J. 1990. Subset Selection in Regression. Chapman and Hall, London. Schoen, C.C., A.E. Melchinger, J. Boppenmaier, E. Brunklaus-Jung, R.G. Herrmann, and J.F. Seitzer. 1994. RFLP mapping in maize. Quantitative trait loci affecting testcross performance of elite European flint lines. Crop Sci. 34: 379-389. Snedecor, G.W., and W.G. Cochran. 1980. Statistical Methods. 6th ed. Iowa State University press, Ames. Stam, P. 1993. Constructing integrated genetic linkage maps by means of a new computer package: JOINMAP. The Plant Journal 3, 739-744. Steel, R.G.D., and J.H. Torrie. 1980. Principles and Procedures of Statistics. 2nd ed. McGrawHill. Utz, H.F. and A.E. Melchinger. 1994. Comparison of different approaches to interval mapping of quantitative trait loci. In: van Ooijen, J.W. and J. Jansen (eds.), Biometrics in Plant Breeding: Applications of Molecular Markers, Wageningen, 195-204. Utz, H.F., Melchinger, A.E., and C.C. Schön, 2000. Bias and sampling error of the estimated proportion of genotypic variance explained by QTL determined from experimental data in maize using cross validation and validation with independent samples. Genetics 154, 1839-1849. Van Ooijen, J.W. 1993. Accuracy of mapping quantitative trait loci in autogamous species. Theor. Appl. Genet. 84, 803-811. Whittaker, J.C., R. Thompson, and M.C. Denham. Marker-assisted selection using ridge regression. Genet. Res. 75, 249-252. WRICKE, G., and W.E. WEBER. 1986. Quantitative Genetics and Selection in Plant Breeding. de Gruyter, Berlin. 5 LITERATURE 19 Xu, S. 1995. A comment on the simple regression method for interval mapping. Genetics 141, 1657-1659. Zeng, Z.-B. 1994. Precision mapping of quantitative trait loci. Genetics 136: 1457-1468. 6 DESCRIPTION OF CONTROLLING FILE *.QIN 20 6 Description of Controlling File *.qin There are analysis commands: first-analysis, scan, sequence, permute, cross-validate. Statement load defines the data file. Logarithmic transformation of the data and restriction of the analysis to a certain trait and/or chromosome can be accomplished by commands log, trait, chrom. Markers serving as covariates (i.e., cofactors) in interval mapping are defined by the cov statement. Default values can be changed with the parameter statement. The out statement serves to control the output. The model statement allows dominance and epistasis to be included in the model. The analysis is terminated with the stop statement. Empty lines can be used to increase readability. If statements are not given or written without specification of parameters, the default values are used. For example, the following three versions of the scan statement are identical: scan, scan 5, or scan 5 2.5 . In the following, the various statements are briefly described: c comment until the end of the line Comment lines can be inserted between ordinary statement lines. The first line should be always a comment line, because this comment will be used as job name in table outputs. log Phenotypic observation data are transformed by logarithms. If the log statement is missing, the original untransformed observations are analyzed. The log statement must precede the load command. load FILENAME The file with name FILENAME contains the marker and phenotypic observation data. If needed, give the full path name. This statement is mandatory. The construction of the file FILENAME is described in chap. 7. out LPLOT LRSDL LSECF PRIN If the out statement is missing, out 1 1 0 0 is assumed. Options for the output are: LPLOT = 0 =1 =2 LRSDL = 0 =1 =2 LSECF = 0 =1 =2 no output of scanning details scanning results are given in a curve plot of LOD scores on *.qpt file (default) curve plot of LOD scores given in PostScript on file *.ps no test of residuals test of residuals and influential values, output of plots of residuals on fitted values and of observations on fitted values (default) output of observed, fitted, and residual values to the secondary output file *.qst no scanning output to a secondary file (default) LOD scores, additive and dominance effects are written to secondary output (to plot values with another program, e.g., Excel, PlotIt), see also statement parameter scanning results on secondary output in numbers 6 DESCRIPTION OF CONTROLLING FILE *.QIN PRIN =0 =1 =2 =3 =-1 21 (like in MAPMAKER/QTL) (default) extended output of the regression analyses for QTL (not recommended for the regular case, rather for debugging) output of covariance matrices used in regression, for debugging output (not for regular use) further output of regression tables, e.g. t values of partial regression coefficients minimum output, recommendable in combination with the permute or cross-validate statement trait ITRAIT1 ITRAIT2 ... ITRAITi = number or name of the traits to be analyzed e.g. trait 5 2 3 or trait yield protein If the trait statement is missing, all traits are analyzed, otherwise only the traits specified by their trait identifier. chrom ICHROM1 ICHROM2 ... ICHROMi = number of the chromosomes to be scanned e.g. chrom 2 4 7 If the chrom statement is missing, all chromosomes are analyzed, otherwise only the chromosomes specified by ICHROMi. At maximum 10 chromosomes or linkage groups can be choosen with this statement. model specifications F-to-E model D dominance effects are included in the model. model D AA dominance and two-loci additive×additive epistatic effects are included model AA additive×additive epistatic effects are included model D DD dominance and all two-loci epistatic effects are included, i.e. add×add, add×dom, dom×add, dom×dom. If this statement is missing, only additive genetic effects are considered in the model. In every case the detection of QTL is conducted without epistatic effects in the model. Only in the FINAL SIMULTANEOUS FIT all specified digenic epistatic effects are estimated for the detected set of QTL. Afterwards the epistatic effects are choosen in a stepwise regression procedure whereby the F-to-Enter value (and F-to-Drop) is obtained by using the Bonferroni bound at alpha = 0.05 and a second simultaneous fit is calculated. So, several models (without epistasis - from a separate run - or with certain epistatic effects) can be compared by the AIC or BIC values by the user. The F-To-Enter value can be specified in the model statement if the default threshold is inedaquate, e.g. model D DD 10.0 With epistasis, the correlations betweeen the estimated QTL effects may be much stronger, see the correlation matrix after the FINAL SIMULTANEOUS FIT table, and should be taken into account. 6 DESCRIPTION OF CONTROLLING FILE *.QIN 22 scan ISTEP LODLIM Interval scanning is performed and a LOD-score curve generated. Scanning for a putative QTL is carried out at regular increments spaced ISTEP cM units apart. ISTEP should be an integer. ISTEP = 5 (default) ISTEP = 2 (is recommended, if a high resolution LOD-score curve is desired) The default value for the LOD threshold LODLIM is 2.5 . If it is to be changed, LODLIM must be set to another value, e.g., 3.0. LOD scores below the limit LODLIM are not included in the search for QTL. cov Cofactors in the form of marker numbers or marker names can be used as covariates e.g. cov "QTLc1:" 4 5 "QTLc2:" 18 19 "QTLc5:" 43 44 or cov 4 5 18 19 43 44 or cov T93 T125 . If the cov statement is missing, conventional (or simple) interval mapping is performed. Alternatively, marker covariates can be included more conveniently by: SELECT (may be abbreviated as S or SEL) all markers which are choosen in the preselection are used as cofactors ALL (may be abbreviated as A) all markers in the map are used as cofactors. Usually all markers of the cofactor set are used as cofactors, e.g. cov SELECT With cov/+ , the cofactor set is extended by all markers of the chromosome under scanning. This allows multiple QTL on a chromosome to be detected with greater resolution but generally with lower power. With cov/- , the cofactor set is diminished for all markers of the chromosome under scanning. sequence sequence of QTL positions e.g. seq 1/20 2/23 A fit should be calculated for the given sequence only. The number before the slash is the chromosome number and the number after the slash the position of the putative QTL on the chromosome in cM units. With the seq statement, the genetic effects as partial regression coefficients are calculated for the indicated positions. If a predict statement follows, these values can be used. With seq/s a sequence of QTL positions can be included step by step in the fitted model (with or without predict). Thus the statement seq/s 1/238 6/78 6/130 performs sequentially the fits seq 1/238 seq 1/238 6/78 6 DESCRIPTION OF CONTROLLING FILE *.QIN 23 seq 1/238 6/78 6/130 i.e., seq/s is an abbreviated form of several seq statements. predict intercept and estimated effects for each QTL e.g. pred 80.4 -0.51 -0.12 For models with only additive effects, one effect per QTL, i.e., the additive effect, must be given. With dominance models (using the model statement described above), the additive and dominance effects for each QTL must be given. The predict statement can only follow a seq statement and is valid only for that sequence and trait. Predicted values are calculated for all genotypes with observed markers for use in marker-assisted selection. Predicted and observed values are written to the secondary output file with extension *.qst. classes i i’ j j’ k k’ m m’ e.g. classes 2 5 8 12 or classes 2 5 8 12 8 12 6 10 or classes T93 C66 T125 T71 With the classes statement the frequencies, means and standard errors of means (s.e.) for genotypes displaying two (linked) segments with flanking markers i and i’, j and j’ assuming that the phase of i and i’ is the same (neglecting double crossovers). Another two segment pair k k’ m m’ can be carried out with the same statement if desired. Three different segment types are listed: Par1 or 0-0 the segment from parent A Het. or 1-1 the heterozygous segment Par2 or 2-2 the segment from parent B Not1 or 3-3 not parent A (heterozygous or parent B) Not2 or 4-4 not parent B (heterozygous or parent A) Additionally, two-way class tables i-i’ by j-j’ (and k-k’ by m-m’ ) are given. If user chooses i=i’ etc. simple marker class means are calculated. The genetic effect of such segments can be calculated by formulas found in WRICKE and WEBER (1986, p.63 or Table 2.5): a = [mean(parentB) − mean(parentA)]/2 d = mean(heterozygote) − [mean(parentA) + mean(parentB)/2] and epistatic effects correspondingly. environments E E = number of environments With the statement environments, a QTL×environment analysis can be invoked for the situation of a simultaneous fit, see chap. 4.7) The observation data of the individual experiments must be appended to the data file *.qdt, see chap. 7. first analysis This provides a first analysis of the marker and phenotypic data: linkage map, segregation ratios with Chi-square tests, frequencies of marker pairs, percentage of homozygosity and percentage of the genome inherited from the first 6 DESCRIPTION OF CONTROLLING FILE *.QIN 24 parent in the individuals assayed for markers, histogram for homozygosity, overview of the distribution parameters and histograms for each observation variable, and stepwise regression to preselect cofactors. For each chromosome simple means and frequencies of observation values are given for each marker class. Additionally, a monitoring of the input of data is present on file pq___.mtr to find data errors easier. The first statement must precede the load command. It is advantageous to use the first statement only for the first time of an analysis and omit the statement in consecutive runs. permute NN NN = number of random reshuffles of observations to produce empirical threshold values for LOD scores, see CHURCHILL and DOERGE (1994) or to performe a permutation test. After CHURCHILL and DOERGE NN = 1000 can be recommended. Clearly NN reshuffles require much more computing time than one passage only. The critical values are found at the end of *.qdt file headed with EMPIRICAL CRITICAL VALUES and for several alpha values (between 1% and 30%). It is recommended to use out 0 0 0 -1 to minimize the output in the *.qpt file. The perm statement must be given together with a scan statement. Clearly the critical values are calculated for a certain situation, i.e. genetic model with or without dominance or epistatic effects. For simple interval mapping (SIM) the threshold value is smaller than for composite interval mapping (CIM, with statement cov SEL) which can be seen easily. cross-validate NREPL NREPL = number of replicates in resampling (default 5) cross standard 5-fold cross-validation with 5 splits cross 100 20 independent resampled 5-fold cross-validations cross/p 1000 200 5-fold cross-validations and QTL frequency on PostScript file *.ps cross/e environmental cross-validation Random number seed is for each call the same value. With the statement cross, 5-fold cross-validation is produced and explained variance without overestimation due to model selection is tried to estimate. The estimation bias is most very important (The half of explained variance R2 is overestimated often, see e.g. Utz et al. (2000). It is recommended to judge the predictive value of the QTL analysis by cross-validation. A 5-fold cross-validation is used, i.e. with 4/5 of the individuals, QTL (positions and effects) are estimated and with the resting 1/5 of genotypes a validation is done. Especially the R2 values are estimated. Such can be done 5 times, that each fifth of the data is used in validation. The output of the cross-validation runs is given as usual in *.qpt file, and also at the end a summary with the header 5-FOLD CROSS-VALIDATION SUMMARY with the medians and extreme values of the R2 related estimates. 6 DESCRIPTION OF CONTROLLING FILE *.QIN 25 Thus insight can be gained how the QTL estimates depend from genotypic sampling and how strongly the explained variance is overestimated by model selection. The more important estimates of the cross-validation runs can also be found in the *.qcv file for further calculations. Lines starting with c:seq contain the position of putative QTL during calibration with chromosome number and position in cM, in descending order according to their R2 -values: e.g. c:seq 2/28 1/24 The estimated QTL effects follow with the heading c:pred, the intercept, and the additive and dominance effect for each QTL. With these putative QTL a validation is conducted in the part of data which is not used in calibration and phenotypic variance explained by QTL phen.var.expl.% with and without adjusting. Additionally, with the calibrated QTL positions the QTL effects are reestimated in the validation set and shortly given in the lines g:seq and g:pred. The validated estimates R2 and QTL effects are in average smaller than the calibrated values. With an editor or a grep utility the c: and g: lines can be averaged or plotted like in Utz et al. (2000). (A tool for calculating medians is available about request by email.) The c: and g: lines can be found also in the *.qcv values To estimate the QTL frequency or probability of occurrence of QTL a high number of NREPL, e.g., 1000, is necessary. With cross/p the frequency is plotted as profile over the position on chromosomes like the LOD profile. The cross/e variant together with the environments statement allows validation of environments. Given E environments E-1 environments are used for estimation of QTL, and the omitted environment for validation. Using each environment for validation E such runs are performed. exindivid IND1 IND2 ... INDi = number of the individual which is to be excluded from the QTL mapping analysis (but included in the LINKAGE MAP overview) e.g. exind 3 4 5 237 If the exindivid statement is missing, all individuals are analyzed. exmarker MARK1 MARK2 ... MARKi = number of the marker which is to be excluded from mapping analysis e.g. exmark 3 45 89 If the exmark statement is missing, all markers are used in the analysis. parameter DELIM FtoE FtoD THETA The parameter statement is optional, and is needed only if default settings are to be changed. Default values are: DELIM=blank FtoE=FtoD=3.5 THETA=0.02 DELIM = any character to separate output in the secondary output file (with extension *.qst), e.g. character for horizontal tab (given as ALT-9 for separation in Excel files) FtoE = F-to-Enter, see Draper and Smith (1981, p. 308ff) or chap. 3.3, usually a value between 2.5 and 15 6 DESCRIPTION OF CONTROLLING FILE *.QIN 26 FtoD = F-to-Drop, usually same value as FtoE THETA = constant for ridge regression (a small value, e.g., 0.01), for more details see Draper and Smith (1981, p. 313). Normally it must be used iteratively. convert The data in the *.qdt-file are converted to another format to obtain input files for van OOIJEN’s MapQTL or after some editing work for MAPMAKER/QTL. Data files PQ___.MAP, PQ___.DAT, PQ___.LOC, and PQ___.QDT are produced. Observation and marker values are given in matrix form on the file PQ___.QDT. At the end of this file the same matrix is found whereby missing marker values are replaced by their estimated values, rounded to one digit. So the user can control the missing data replacement and compare with other programs or delete individuals or markers more comfortable by the matrix form. With these four files and a little bit of editing all other QTL mapping programs should be usable for a comparing analysis with ease. The convert statement must precede the load command. dimensions MAX EFFECTS defines the maximum number of possible QTL effects. Default is MAX_EFFECTS = 150. This size is sufficient in normal cases. The statement should be necessary only if very much putative QTL may be expected during detection and epistatic models are used. As a definition statement, the statement must be given before the load statement. MAX_EFFECTS = maximum number of QTL effects to be estimated, i.e. the number of all additive, dominance, add×add, add×dom, dom×dom epistatic effects which can enter in the QTL analysis. The number depends from the specifications in the statement model. This number can be calculated by the maximum possible number of QTL MAX_NQTL: without dominance and epistasis: MAX_EFFECTS = MAX_NQTL * (MAX_NQTL + 1) / 2 in case of dominance without epistasis: MAX_EFFECTS = 2 * MAX_NQTL * MAX_NQTL . stop Closing statement to terminate the program. Subsequent lines in the *.qin file are ignored. 7 DESCRIPTION OF DATA FILE *.QDT 27 7 Description of data file *.qdt The *.qdt data file contains the marker data, linkage map, and the phenotypic observation values. Hence, recombination frequencies among markers are assumed to be known. Lines 1 to 3 are required to characterize the data: 1. Line with an arbitrary comment (job reference) Additional comment lines in *.qdt files are possible beginning with c, especially to give further comments at the head of the *.qdt file. 2. Line with dimensions of the data set: NENTR, NMARKM, NVAR, NCHROM, MMIN, NIDENT, RECF,FGEN, RALPH NENTR = number of genotypes (individuals) NMARKM = number of markers NVAR = number of observation variables (traits) NCHROM = number of chromosomes (linkage groups) MMIN = 0 input in matrix format 1 input in vector format (similar to MAPMAKER/QTL) NIDENT = number of identifiers at the beginning of each data row (only of importance for input in matrix format) RECF = 0 distances among marker pairs are given in cM units (referring to paragraph 5. linkage map) 1 distances among marker pairs are given as recombination frequencies 2 distances among markers are given as positions in cM on the linkage map (= cumulative distances from the first marker of the linkage group) FGEN = 1 marker genotypes refer to doubled haploids from a F1 = 2 marker genotypes refer to individuals from a F2 generation = n with 2 < n < 34, marker genotypes were determined in Fn (a homozygous population of recombinant inbreds can be indicated by using high value for FGEN) RALPH = 1 additive effects are calculated (default) 2 testcross case: effects of an allele substitution (see e.g., Schoen et. al., 1994) are calculated additionally, i.e., additive effects are multiplied by two 3. Line with heritabilities: Heritabilities of the phenotypic observations (mostly means across plots and environments) are given as values between 0 and 1 and in the same order as the observed traits. Unknown heritabilities are coded as zeros. An example may be: 0.54 0.73 0.0 0.86 4. Lines with the observed marker and phenotypic data: 4a. Input in vector format (similar to MAPMAKER/QTL): sequential input of marker data and phenotypic observations, separately for each marker and trait; an example can be found in the file pqsample.qdt. 7 DESCRIPTION OF DATA FILE *.QDT 28 For each marker give the name whereby name can begin with a star * . The name must not start with a digit like 150T . Only the first 9 characters of each name are used. Examples are *T150 or *MARKER1 or T149 . The marker data follows the marker name on the same line or on subsequent lines. The data may contain blanks or tabs to make marker data more readable. A,B,H,- (case insensitive) are used to mark genotypes of parent A, parent B, the heterozygote H, or missing marker data, respectively. Further situations are coded by letters C or c = not A, i.e., H or B (for dominant markers) D or d = not B, i.e., H or A (for dominant markers) M or m = mutation, is treated as a missing value (or -) Instead of letters, numbers may also be used for coding: 0 parent A 1 heterozygote H 2 parent B 3 not parent A (i.e., C) 4 not parent B (i.e., D) 9 missing value (i.e., M,m,-) After the marker data, observation values are input sequentially: As above, the name of the trait may but not must begin with a star * and only the first 9 characters are used. A comment enclosed in double quotes can be given after the name. Examples are: *trait01 or *PtHgt "plant height in cm" In the following lines, observations are separated by one or several blanks or tabs. Missing values are coded by -99.00 (or smaller values), by a star * , or by a minus - . Comments in double quotes can be interspersed between observation data. The sequence of the data for the various individuals (entries) must be identical for each marker and trait. An example may be: *mark31 AAHBA HHBAA bhaab ahbah aaahb hahaa ahhaa haaaa ahhhh hbbab hahhb bbaha bhahh hhahb bhaba bbhha ahbha bhahb bbhbb bhhhb aahhb aaahh hhhah habbh hhhhb hbhbh hhbhh hbbhh hhhhh hahhh *T175 HAHAHHA-HHHAHHAAHAAHHHAHAAAB-HAHHHAAHHHHHHHAHAHAAA-AHAH--HHA AAHHAA-AHHHAAAHAAAAHHAAHAAHAAAHHAHAHAHAH-HHAAAHHAHAAAAHAHAAH HHAAH-AAHHHHAAHHHHAAB-HAHAAHA-AAH-AAAHAAAHAHHHAH-AHHAH-HHAHH AAAAAAHAAHHHA-------------------.... *weight "grams per plot" 4.949 3.58 -99999 "as missing value" 6.212 6.140 5.330 5.761 5.470 7.897 7.559 -99999 4.990 5.316 3.190 6.160 8.150 5.370 16.330 3.220 9.540 -99999 6.360 6.040 5.480 4.710 6.310 4.390 ... 4b. Input in matrix format (for each genotype a row of data): An example is found in file pqsimul.qdt. First, the names of traits and markers must be given. In congruency the name can begin with a star. - One line with NVAR trait names, separated by one or more blanks, e.g. yield *dmc *height tgw 7 DESCRIPTION OF DATA FILE *.QDT 29 - One or several lines with NMARKM marker names separated by one or more blanks, e.g. UMC94 BNL8.05 UMC76 UMC137 UMC58 UMC23 UMC128 UMC37 BNL15 *UMC106 *BNL6.32 Next follow the observation data rows: Three groups of data fields - each separated by at least one blank - must be distinguished: - At the beginning of each data line, NIDENT data fields are skipped. - NVAR trait values follow. Missing values are coded by -99.00 (or smaller values), by a star * , or by a minus - . - NMARKM marker values follow, coded as described above. An example with NIDENT=1, NVAR=2 , and NMARKM=15 may be: A1 A2 D3 4.949 45.6 3.58 45.9 * 32.3 101 11111 122 1122 009 90022 222 1001 011 12111 111 2111 First column gives here an identifier for each individual. It can be written any text, e.g. site and year code additio- nally. Program skips them. The number of identifying fields separated by blanks must be defined with NIDENT (in the example above NIDENT=1). If NIDENT=0 no identifiers are given. 5. The linkage map is attached to the marker and trait data. For each chromosome the linkage information is coded as follows: 5a. The chromosome name, only the first 7 characters are used, and the number of markers on the chromosome are given in a first line. Chromsome name may but not must begin with a star *. Examples: *chrom01 12 or chrom10 121 5b. In one or several subsequent lines the marker numbers or marker names with distances are given. Marker numbers are listed in their order on the chromosome and numbered across the genome. Between each two markers their distance in cM (if RECF=0) or as recombination frequency (if RECF=1) is given. If RECF=2, the absolute positions of the marker on the chromosome can be used. Example with cM: 1 4.2 3 15.0 2 11.9 5 12.2 7 or ABD13 4.2 ABD14 15.0 ABD16 A map for two linkage groups may look as follows: *chrom1 5 1 4.2 3 15.0 2 11.9 5 12.2 7 *chrom2 7 4 14.8 11 6.4 8 18.9 12 24.0 9 18.1 or a linkage group with 3 markers *chrom1 3 ABD13 12.1 ABD14 22.4 ABD15 6 28.6 10 7 6. DESCRIPTION OF DATA FILE *.QDT 30 Observation values from individual experiments: These values must follow if the statement environment is used in the *.qin file. The data may also be mean values across replications, adjusted for incomplete blocks, or averaged over several plants. The input is in matrix format: for each environment a matrix with NENTR rows and NTRAIT columns. Entries (individuals) and traits must be given in the same order as specified before. Separation of data is by blanks or tabs as usual. This block of data from individual experiments is opened by a line in the following form: *env KIDENT arbitrary text KIDENT gives the number of data fields to be skipped at the beginning of each data line. An example with two traits x and y and two identifying data fields may be given for 2 locations and 3 individuals: *env loc1 loc1 loc1 loc2 loc2 loc2 2 ind1 ind2 ind3 ind1 ind2 ind3 x11 x12 x13 x21 x22 x23 y11 y12 y13 y21 y22 y23 Sequence must be ordered, within each environment individuals increasing, see also pqsimul.qdt at the bottom of file. 8 RECOMMENDATIONS FOR WORKING WITH PLABQTL 31 8 Recommendations for Working with PLABQTL Beginners with no experience in QTL analyses are advised to conduct and interpret a few analyses using the tutorial of MAPMAKER/QTL, because it gives a more detailed description of many terms than this manual. We recommend the use of cofactors, i.e., working with the cov statement, especially when several QTL are expected. Once all marker data including the linkage map as well as the phenotypic observations are available, one can set up the *.qdt data file. It will likely not change during the course of the analysis unless genotypes with a relatively small number of marker loci or suspicious outliers for the phenotypic observations are deleted. In conducting an analysis, we recommend the following procedure: 1. An initial run with the first_analysis statement to check the input data file which is monitored in file pq___.mtr. In subsequent runs, this statement should be unnecessary. Several basic lists are produced with this statement like LINKAGE MAP or CLASSIFICATION OF ADDITIVE QTL EFFECTS. 2. A run with the scan statement, but without the cov statement, i.e., simple interval mapping (SIM). These results can be compared with results obtained with MAPMAKER/QTL, if desired. Minor differences may occur because (1) PLABQTL employs a multiple regression procedure, whereas MAPMAKER/QTL use a maximum likelihood approach for estimation and (2) missing values may be treated differently by the two programs. 3. The main run is conducted with scan and cov SEL statements for determining the most important markers to be subsequently used as cofactors in composite interval mapping (CIM). In addition, data should be checked for outliers and their plausibility. Individual outliers in conjunction with crossovers may result in spurious QTL or important markers selected as cofactors (Hackett, 1994). In further runs, one may skip this check by setting the out command accordingly. Best is to conduct a run with an added permute statement and choose an empirical LOD threshold for a certain error level. If one has the impression that the selected cofactors are not chosen in an optimum manner, the user may either change the boundaries for FtoE, FtoD in the parameter statement or may correct the selected set of cofactors on the basis of other criteria like AIC or BIC. In the latter case, the cov statement must include the selected cofactors by their numbers. Best may be to In the case of ill-conditioned preselection runs, one may also alter the THETA parameter in the paramenter statement. 4. We have found a further run with cov/+ cofactor set useful, in order to include all markers on a chromosome as cofactors (in addition to those already employed as cofactors). This facilitates detection of putative QTL having different signs or resolution of ”ghost-QTL”. Since power in cov/+ runs is reduced due the high number of cofactors it may be desirable to correct the set of cofactors of the main run according to the identified questionable regions before running the final analysis. 8 RECOMMENDATIONS FOR WORKING WITH PLABQTL 32 5. Special analyses of particular traits or chromosomes can be performed with the trait and chrom commands. Fits with certain suspected QTL positions can be carried out with the command seq. With the classes statement, the means and frequencies of genotypes with certain marker segments are available. Thus it is becoming clear how weak the QTL estimates are very often and that QTL mapping is more an exploratory than an confirmatory statistical work. 6. Provided the main run (e.g., the one described under point 3.) has run without problems, a new run is performed with the environments statement. This provides an unbiased estimation of the percentage of the genetic variance explained by the QTL and an estimation of QTL×environment interactions, if the number of environments is sufficiently large (> 5). 7. With the cross and with cross/e for several environments, we obtain information about the magnitude of bias in explained phenotypic and genotypic variance. With cross/p 1000 QTL occurence frequencies and more unbiased QTL effects are produced. Hints: 1. Markers can be dropped from the analysis by dropping the corresponding marker numbers from the linkage map at the end of the file *.qdt. Of course, one has to add the map distances to the right and left of the marker excluded, provided these are given in cM. An example may be in the test data file pqsample.qdt with 5 markers and distances in cM: *chrom1 5 1 4.2 3 15.0 2 11.9 5 12.2 7 Assuming marker 3, the second marker in the map, should be omitted. Add 4.2 and 15.0 and remove the 3. Clearly, 4 markers are in the map only: *chrom1 4 1 19.2 2 11.9 5 12.2 7 If recombination frequencies are used and absence of interference is assumed, one must calculates r = ra + rb − 2ra rb , where ra and rb refer to the recombination frequencies of both intervals. 2. Mapping populations, which consist of double haploid lines, derived from an F1 cross, or backcrosses obtained from the cross of an F1 cross with one of the two homozygous parents, can be analyzed as F2 populations because these types of populations have the same recombination frequencies as an F2 . For coding of marker genotypes, it is important to note that with doubled haploids, heterozygous genotypes (i.e., H or 1) are missing and additive effects are calculated as half the difference between the two homozygous classes. Likewise, with backcrosses, marker genotypes of the non-recurrent parent are missing. In comparison with other QTL mapping programs, one must pay attention to the definition of the additive effect. 3. The first parent has always the smaller coding value in the markers (A, a or 0), he is the first one in all lists, and he is assumed to be the weaker parent in definition the additive effect: additive effect = (parent B - parent A) / 2 8 RECOMMENDATIONS FOR WORKING WITH PLABQTL 33 In some lists the first parent is noted by P1 and the second by P2, additionally for clarity. 4. Analyse data for a first look with the statement first and convert to get more detailed output and hopefully clear error messages. The files pq___.* can be controlled for correct reading of data. 5. PostScript LOD curves are printable on one page from the *.ps file induced by the out statement parameter LPLOT=2. Such a page allows a fast overview of LOD peaks over the genome or a comparison of different LOD curves from several runs. You may use ghostview to have a look at the PostScript files or to print on a non-PostScript printer. With ghostview or pstoedit it is also possible to convert *.ps files in *.bmp, *.wmf, or *.pdf files to import and edit the last ones in text processing or graphic programs. To use pstoedit from within GSview, use Edit | Convert to vector format. Further information on PostScript, Ghostscript and Ghostview see http://www.cs.wisc.edu/˜ghost and pstoedit see http://www.geocities.com/SiliconValley/Network/1958/pstoedit/index.html 6. Often markers are too tightly linked with no or only few recombinants between them so the scanning process may be disturbed by ill-conditioned equation systems. Since highly correlated markers or a very dense marker map with small samples (N<500) is non-advantageous for QTL mapping the information of tightly linked markers is pooled. Markers with a distance smaller than 0.11 cM are combined to a ”synthetic” new marker, where C, D and missing marker scores are simply replaced by a more informative score of the tigthly linked marker A, H, B if possible. The name of a ”synthetic” marker starts with the letter m and shows afterwards the chromosome number and position in cM, e.g., m02/092.0 . 7. We recommend that LOD curves be checked visually because in very seldom cases peaks or QTL will not be detected by the program and, will be missing in the LIST OF DETECTED QTL. The minimum distance between two putative QTL to be listed as separate QTL is 10 cM. Support intervals in the fit with cofactors represent only a lower boundary, and should be treated with caution.
© Copyright 2026 Paperzz