plabqtl - Universität Hohenheim

PLABQTL
A computer program
to map QTL
Version 1.2
2006-06-01
H.F. Utz and A.E. Melchinger
Institut für Pflanzenzüchtung,
Saatgutforschung und Populationsgenetik
Universität Hohenheim, 70593 Stuttgart
(Institute of Plant Breeding, Seed Science, and
Population Genetics, University of Hohenheim,
D-70593 Stuttgart, Germany)
E-Mail: [email protected]
http://www.uni-hohenheim.de/˜ipspwww/soft.html
c Copyright
1995, 2003 H.F. Utz, A.E. Melchinger
All rights reserved
Contents
1
General Overview
3
2
Installation and Use
5
3
Biometrical Procedures
3.1 Composite interval mapping . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Preselection of markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
7
7
4
Description of the Output
4.1 Errors or disagreements . . . . . . . . . . . . . . . .
4.2 Status of marker data, linkage map, percentages of
genome of parent1 . . . . . . . . . . . . . . . . . . .
4.3 Critical values of LOD scores and F-to-enter . . . . .
4.4 Overview on observation variables . . . . . . . . . .
4.5 Plot of LOD scores . . . . . . . . . . . . . . . . . . .
4.6 List of detected QTL . . . . . . . . . . . . . . . . . .
4.7 Final simultaneous fit . . . . . . . . . . . . . . . . . .
4.8 Further effects . . . . . . . . . . . . . . . . . . . . . .
4.9 Outliers and influential observations . . . . . . . . .
4.10 Analysis of QTLxEnvironment interactions . . . . .
. . . . . . . . . . . .
homozygosity and
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
10
10
10
11
12
12
12
13
14
14
15
5
Literature
17
6
Description of Controlling File *.qin
20
7
Description of data file *.qdt
27
8
Recommendations for Working with PLABQTL
31
1
GENERAL OVERVIEW
3
1 General Overview
PLABQTL is a program written for the detection of loci which affect the variation of
quantitative traits. Its main purpose is to localize and characterize QTL (Quantitative
Trait Loci). The program employs the interval mapping approach (Lander and Botstein, 1989). Commands were designed following MAPMAKER/QTL (Lincoln et al.,
1993b). In contrast to this and other programs, we used a multiple regression approach
with flanking markers according to the procedure described by Haley and Knott (1992).
The linkage map of markers is assumed to be known and must be calculated with other
programs such as MAPMAKER/EXP (Lincoln et al., 1993a), JOINMAP (Stam, 1993) or
GMendel (Holloway and Knapp, 1994). For detection of QTL, it is possible to utilize
other identified QTL as cofactors (composite interval mapping). This should increase
the power of QTL detection (for details, see Jansen and Stam, 1994; Utz and Melchinger,
1994; Zeng, 1994). The often high bias of explained variance by QTL can be reduced
(Utz et al. 2000) by cross-validation.
Two types of input are possible:
• a vector format similiar to MAPMAKER/QTL: for each marker and trait, the values
are input across all genotypes;
• a matrix format: a matrix is input, consisting of trait and marker data, with one row
per genotype.
For output, the program gives an overview of the input linkage map with segregation ratios, genotype frequencies of marker pairs, and the proportion of the genome
of each genotype which is homozygous or contributed by the first parent. Stepwise
regression is used to preselect the most important markers to be used as cofactors.
In the main part, the program calculates for each trait and chromosome:
• a preliminary analysis with multiple regression on all markers of a chromosome;
• a print or PostScript plot of the LOD scores from scanning of each chromosome;
• estimates of parameters such as additive effects, dominance effects, R2 , AIC, and
BIC values, and ANOVAs at the position of LOD peaks;
• a final simultaneous fit with all detected QTL;
• a selection of important genetic effects if a complex model with dominance and
epistatic effects is used;
• a QTL x environment analysis for all detected QTL with an ANOVA table, a summary table of the QTL effects for each environment and the series, as well as the
mean squares of QTL x environment interactions;
• the empirical LOD score thresholds with permutations;
• cross-validation estimates of R2 and QTL effects.
The program can be employed for the analysis of populations of individuals from
selfing generations (F2 and later generations) or doubled haploids derived from an F1
cross between two homozygous lines. Models with or without dominance and twofactor epistasis can be fitted. Only additive effects of cofactors are taken into account.
1
GENERAL OVERVIEW
4
Hence, the present form of the program is most suited for the analysis of topcross
progenies.
At each step, outlying and influential values can be determined and scatter diagrams can be provided to check the fit. A major objective in development of the program was to provide as many checks and hints as possible, starting with three question
marks ??? , so that the data analyst can more easily detect possible errors in the
data input (which can have a tremendous impact on QTL detection).
Data can be transferred from the matrix input format into the vector format required by MapQTL or MAPMAKER/QTL.
Missing marker data are replaced by their expected values, provided that marker
information is available for two adjacent markers or at least one flanking marker.
In contrast, missing phenotypic data and the respective marker genotype data are
dropped from the analysis. Special cases, in which all markers per genotype or all
genotypes per marker are missing, are annotated.
2
INSTALLATION AND USE
5
2 Installation and Use
The program contains the following files:
PLABQTL.EXE
the program
F77L3.EER
required Fortran77 error list
PLABQTL.PDF
this documentation
PQREADME.TXT some comments for use
PQNEWS.TXT
last amendments
PQSAMPLE.QIN
PQSAMPLE.QDT
PQSAMPLM.QIN
PQSIMUL.QIN
PQSIMUL.QDT
first example for simple interval mapping
data to the first example from the tutorial
of MAPMAKER/QTL
first example for composite interval mapping
second example
data to second example with matrix input format.
The program PLABQTL consists of the single file PLABQTL.EXE. It runs on PCs
in the DOS prompt or DOS compatibility box of WINDOWS. Copy PLABQTL.EXE to
a subdirectory and extend the respective path. This should be sufficient to install the
program.
The program is started by the command
plabqtl FILENAME
The command plabqtl alone provides hints for the use of the program.
This data file, here with name FILENAME, contains statements which control the
analysis. As default, the file extension *.qin is used so the calls
plabqtl FILENAME
and
plabqtl FILENAME.qin
are equivalent. The phenotypic and marker data are stored in a single separate data
file designated with the default extension *.qdt. The output is written to a data file
with extension *.qpt. Additional output, arising from scanning and designed for
production of high density plots, is given in file *.qst or in a PostScript file *.ps.
The batch-oriented mode of PLABQTL has the advantage that time-consuming jobs
can run in the background. Furthermore, by simple editing of the *.qin file, one can
easily change the parameters employed for detection of QTL. The *.qpt output file
(in ASCII) can be examined, edited, and printed with any browser, editor, or word
processing system.
An example for a set of statements in a *.qin file is
c Test from the MAPMAKER/QTL tutorial: file sample.raw
c generate logarithms of the observations
log
c input of data from file sample.qdt in the same directory
load sample.qdt
c rough scanning in 10 cM intervals
scan 10
stop
The load and stop statements are mandatory while all others are optional. Comments are preceded by the letter c. The load command points to the file with the
2
INSTALLATION AND USE
6
phenotypic and marker data. The scan command initiates the analysis itself. Chapter
6 describes the use of the various statements in detail.
Statement names begin in column one of each line, written in lower case. Statements with more than four letters can be abbreviated by three letters, e.g., seq instead
of sequence.
Names and numbers are separated by at least one blank (or tab character). An
example is:
cov 2 5
8 12
In front of numbers, it is possible to insert a comment in quotes "..." such as
cov "QTL1" 2 5
"QTL2" 8 12
The first, convert, dimensions, and log statements must preceed the line
with the load statement; otherwise, the data input is performed using the default
settings.
Examples for analysis are given in the data files for simple interval mapping (SIM;
data from MAPMAKER/QTL tutorial) :
pqsample.qin with pqsample.qdt
for composite interval mapping (CIM; same data body) :
pqsamplm.qin
for simulated data (with matrix input format):
pqsimul.qin with pqsimul.qdt .
It is recommended that beginners first run
plabqtl pqsample
and compare the output in pqsample.qpt with the comments in the MAPMAKER/QTL tutorial.
A small introduction in QTL mapping with PLABQTL can be found on PostScript
file PQINTRO.PS (or PQINTRO.PDF for ACROBAT reader) on the FTP server using the
data file of the MAPMAKER/QTL tutorial for analysis.
3
BIOMETRICAL PROCEDURES
7
3 Biometrical Procedures
3.1 Composite interval mapping
PLABQTL performs interval mapping with or without cofactors (Jansen, 1993; Jansen
and Stam, 1994; Zeng 1994) by using multiple regression. The conditional expectations
of QTL genotypes, given the observed marker genotypes at the flanking marker loci,
are employed as regressors. Haldane’s mapping function is assumed. For F2 populations, the conditional expectations are calculated according to the formulae in Table 1
of the paper by Haley and Knott (1992). For F3 and higher selfing generations, formulae were provided by Hospital et al. (1996). For further details, see the paper by Utz
and Melchinger (1994).
For large sample sizes, the regression method converges towards the maximum
likelihood method (see the comments on ”regression mapping” in the paper of Martinez and Curnow, 1994). After Xu (1995) the explained variance is an underestimation
of the true value. But such seems irrelevant compared with the great inflation due to
model selection.
3.2 Missing Values
If missing marker data for flanking markers of an interval are encountered, the interval is extended until markers with data are found. Missing cofactor values are replaced
by their expected values on the basis of the nearest existing flanking markers. Genotypes without marker information on the chromsome which is to be scanned or which
contains cofactors are dropped from the analysis. Likewise, genotypes with missing
phenotypic data for the trait under study are dropped from the analysis.
According to our experience, this method is stable and robust. Other methods as ’available case procedure’ (Little 1992) produce often ill-conditioned variancecovariance matrices. Maximum likekihood methods neglect that degrees of freedom
are reduced for the error.
3.3 Preselection of markers
By means of stepwise regression, markers are selected according to their relative importance (see Draper and Smith, 1981, p. 307 ff). For inspecting the selection, the
partial F-value, which is the criterion for the selection of markers, the simple correlation coefficient r(xi,y) of phenotypic observations y with the selected marker xi , as
well as the percentage of the phenotypic variance additionally explained by the marker
(%add.expl.) are reported.
A second list summarizes the corresponding RRS-values (residual sums of squares),
AIC-values (Akaike’s information criterion), and the BIC-values (Bayesian information criterion or Schwarz’ information criterion). Similar to the F-to-enter or Mallows’
Cp criterion, the AIC criterion serves as a stopping rule in selecting subsets of regression variables (see Miller, 1990; Jansen, 1993, 1994). The marker with the smallest AIC
value would be the last marker added to the preselected set. The main problem in the
preselection is the choice of an appropriate threshold (i.e., the F-to-enter value or the
3
BIOMETRICAL PROCEDURES
8
penalty in the AIC), which corresponds to the significance level. Choice of a reasonable type I error depends on both the experimenter and the major objective(s). BIC is
a Bayesian variante to Akaike’s information criterion und penalizes the log-likelihood
ratio by K*log(n) rather than the 2*K of Akaike. BIC is similar to AIC with a penalty of
3 in case of say 300 individuals, so BIC uses a much stronger penalty for over- fitting
than AIC.
AIC is calculated here after the formula
AIC = n*ln(RRS/n) + 2*K
whereby n is sample size and K number of estimated parameters included intercept,
error variance, QTL positions, and QTL effects. Burnham and Anderson (1998) recommend AICc for cases with n/K < 40, or
AICc = AIC + 2*K*(K+1)/(n-K-1) .
BIC is calculated after the formula
BIC = n*ln(RRS/n) + K ln(n) .
With BIC or AIC models can be compared, i.e., cofactors can be selected, number
of QTL can be choosen or gene architecture can be investigated. The model with the
smallest BIC or AIC is best, in the sense of exploiting the information in the data, relative to other models (applying the principle of parsimony, see Burnham and Anderson,
1998). Models with AIC differences smaller 2 have substantial support and should
receive consideration in making inferences. Models with AIC differences greater 10
might be omitted. AIC, BIC, Mallows’ Cp , and the F-to-enter criterion can, at least approximately, be mutually transformed (see Miller 1990, p. 205 ff or McQuarrie and Tsai
1998).
When missing marker values do occur, they are replaced in the preselection step
by their expectations calculated on the basis of the nearest adjacent flanking markers, see above. The stepwise regression is performed with this ”corrected” data set;
here, differing degrees of freedom for the error, depending on the number of missing
values, are not taken into account. This procedure and possible other reasons may
cause an ill-conditioned variance-covariance matrix, which can result in an abortion
of the stepwise regression. In such cases, the partial F-values may increase towards
the end of the selection procedure. (This most likely occurs with small F-to-enter values, see below). In such cases, we recommend use of a higher F-to-enter value in the
parameter statement or estimation by ridge regression with THETA>0. The default
value is THETA=0.02 to reduce collinearity and prevents singularities of matrices (see
Whittaker et al. 2000)
After selection of the markers, another multiple regression is performed with the
reduced data set, which takes the degrees of freedom for missing values into account.
For ill-conditioned cases, this procedure may not be entirely satisfactory, however,
it should suffice to identify important markers to use as cofactors.
Preselected markers may subsequently be adopted automatically in the cov statement. The user should select cofactors sparingly, observing that
a.
adjacent, tightly linked markers should not be selected (see Hackett, 1994);
b.
all important markers should be included because this increases the power of QTL
detection (see Jansen, 1995; Utz and Melchinger, 1994);
c.
important chromosomes are represented with at least one or two cofactors.
3
BIOMETRICAL PROCEDURES
9
Several cofactors per chromosome are justified if several putative QTL are suspected on a chromosome. If the respective regression coefficients have different signs,
the contributions of different QTL may cancel each other, and, consequently, may not
be detected.
According to our experience, one should always select markers associated with important QTL as cofactors. This corresponds to a F-to-enter value between 5 and 6 or
a penalty of about 3 in the AIC. According to Miller (1990), for testing hypotheses in
the present context (Spjotvoll test with 0.05 probability level), one should employ an
F-to-enter value between 8 and 15. Based on a great number of analyses of experimental data, we have found that the default value of 3.5 for F-to-enter used in PLABQTL
yields reasonable results, because it also selects cofactors adjacent to minor QTL and
those with opposite sign of effects. However, in this case important QTL are frequently
represented by both adjacent flanking markers, which seems undesirable. Further recommendations are given in chapter 8.
4
DESCRIPTION OF THE OUTPUT
10
4 Description of the Output
In the following we describe some of the parameters, which can be estimated by
PLABQTL.
4.1 Errors or disagreements
Errors or disagreements, detected by PLABQTL, are indicated with three question
marks ???
and a comment. In some cases, the program is terminated, in others,
checking of the marker data is recommended. Lines starting with !!!
give warnings or comments, e.g., if there is no phenotypic variation for the observed variable.
The use of first staement is recommended if a set of data, a *.qdt file is analysed at
first.
4.2 Status of marker data, linkage map, percentages of homozygosity
and genome of parent1
This summary of various parameters can be used to detect non-polymorphic or dominant markers, distorted segregation or outcrossings, which cause a high degree of
heterozygosity. Frequencies of marker alleles and their significance to deviate from
0.5 or a histogram for homozygosity can be found. Some other general parameters like
length of genome in cM or average interval length in cM, and percent of genome within
20 cM to the nearest marker are printed. The average interval lenght is calculated as
total length / (no. of markers - no. of monomorph markers - no. of chromosomes).
Probabilites for the Chi-square tests are calculated according to the Wilson-Hilferty
approximation and should be regarded as a crude test for two reasons: (1) this approximation, like others, is inaccurate for a small number of degrees of freedom (1 or
2, depending on the type of marker and population considered) and (2) simultaneous
tests are performed (however, individual tests are not always independent, see Lander
and Botstein, 1989, p. 190 ff).
Some warnings regarding markers are produced which may inspected critically.
Some reactions are undertaken automatically, some other can be performed by the user,
e.g., by using the statement exmarker or exindivid and comparing the resulting
analyses:
• Marker pairs with only 0 or 1 recombinant are listed
• Highly correlated markers (r > 0.99) are shown by the message
??? multicollinearity ...
(which is mostly undangerous but may result in nonsense estimates seldomly).
• Monomorph markers are excluded from the QTL analysis.
• Unobserved markers (coded as missing values) are excluded from the QTL analysis
(but such markers could be used before in cofactor analysis).
• Inconsistencies for double haploid (DH) lines are shown, e.g. heterozygous markers.
4
DESCRIPTION OF THE OUTPUT
11
• Marker with a distance smaller than 0.11 cM are combined to an ’synthetic’ marker
with a name starting with m, containing the chromosome number and position in
cM, e.g., m01/003.9. In the synthetic marker C, D and M are replaced by the more
informative scores A, H, or B if possible.
• A mix of dominant and codominant scores in a marker is allowed (A,B,H and C,D
scores occuring together). So interspersed RAPD and AFLP marker can be used.
The analysis procedure in such cases is:
1. In cofactor selection, C and D scores are treated as missing.
2. In scanning, expected values are calculated for flanking markers with C or D
scores. This may be suboptimal but may sufficient for ad-hoc analyses. (The optimal procedure with calculating expected values based further on the next neighboured codominant markers is described by Jiang and Zeng, 1997, Genetica 101,
47-58)
4.3
Critical values of LOD scores and F-to-enter
CRITICAL VALUE FOR LOD SCORE (Bonferroni chi-square approximation)
or abbreviated as Crit.LOD are calculated using the chi-square approximation suggested by ZENG (1994). For an overall test with M marker intervals in the genome, a
large sample size, and not too many markers fitted in the model, the chi-square value
for alpha/M and n degrees of freedom (n equals 2 or 3 if additive or additive and
dominance effects for QTL are fitted respectively) can be used as an approximation.
The chi-square value is valid for likelihood ratio (LR) test statistics and is divided by
4.605 to get LOD, see also below. Alternatively, a permutation test can be performed
with the permute statement.
The minimum detectable partial R2 (Detect.part.Rˆ2) for the given sample size
and alpha level is calculated, see Melchinger et al. (1998):
Part.R2 = 1 − EXP(−CHI2/N)
with CHI2 = crit.LOD/0.2171 = LOD ∗ 4.605
The magnitude of errors of the QTL detection depends primarly from the size of
the underlying QTL effects, sample size, and heritability. Therefore the estimable magnitude of effects are given under the heading CLASSIFICATION OF ADDITIVE QTL
EFFECTS.
Bonferroni bounds are offered for LOD scores and F-to-Enter values in stepwise
regression for several alpha values. So the choice of threshold values will be made easier, hopefully. (Further details in comparing diverse empirical and analytical threshold
values can be found in DOERGE and REBAI (1996). For exploratory QTL experiments
a genomewise error rate of 0.25 seems acceptable for LOD threshold and F-to-enter
(Beavis, 1998).
Also the Bonferroni bounds for the simple correlation coefficients between marker
and observations which are produced by the first statement are listed under header
PRESELECTION OF MARKER COFACTORS with subheader r(xi,y)thresh..
4
DESCRIPTION OF THE OUTPUT
12
4.4 Overview on observation variables
For each observation variable the number of units, mean, variance, skewness, and kurtosis and a histogram are printed. (Please note that these estimates are valid only before
a calculation of log transformation if necessary).
4.5 Plot of LOD scores
In the plots of LOD scores, positive additive effects are indicated by an asterisk * and
negative additive effects by an = sign. Thus, the sign of additive QTL effects can be
visualized in plots. The unit on the ordinate (y-axis) is given in brackets in the lower
left corner of the coordinate system. On the abscissa (x-axis), a M sign indicates the
location of a marker and a C indicates that of a cofactor. In the Postscript files *.ps
markers and cofactors are also plotted on the x-axis by ticks or triangles respectively.
4.6 List of detected QTL
This shows a list of all detected QTL. For the example pqsample.qpt, the output is:
QTL Chrom. Pos Left_M. Mark.I+Pos cM_n.M. Supp.IV
LOD Rˆ2%
add
dom DF
-------------------------------------------------------------------------------1 chrom1 26 T93
25
7.
5.
10- 32 4.99
7.6 -0.091 -0.045
2 chrom2 28 T125
8- 12
7.
7.
16- 36 8.73 12.6 -0.139 -0.016
-------------------------------------------------------------------------------SUM:
20.3
Pos
The position on the chromosome of the QTL, in cM.
Left M.
Name of the left flanking marker.
Mark.I+Pos The marker interval, consisting of the flanking marker numbers plus the
position, in cM, of the QTL relative to the left flanking marker.
cM n.M.
Distance in cM to the next flanking marker
Supp.IV
Support interval with a LOD fall off of 1.0 (default), expressed as position
on the chromosome, in cM. Note: A support interval is only determined
for the global QTL peak in a given region, i.e., by ignoring other adjacent
peaks in the case of multiple peaks. When cofactors are used, the given
support interval is very likely only a lower boundary for the true support
interval.
LOD
log10 of the likelihood odds ratio (see Lander and Botstein, 1989). The
LOD score is calculated from the F-value in the multiple regression as
LOD = nln(1 + p ∗ F/DFres) ∗ 0.2171 ,
where p is the number of parameters fitted (see Haley and Knott, 1992).
The LOD score can be used to calculate the likelihood ratio
LR = LOD ∗ 4.6052
(equivalently LR = pF), if this statistic is preferred.
Rˆ2%
The coefficient of determination or the percentage of the phenotypic variance, which is explained by a putative QTL. In the case of cofactors (composite interval mapping), Rˆ2% is based on the partial correlation of the
4
DESCRIPTION OF THE OUTPUT
13
putative QTL with the observed variable, adjusted for cofactors. The basis for calculating the R2 values may vary because cofactors present in
the interval under scanning are omitted.
If dominance is included in the model, the proportion of the variance
attributable to the additive and dominance effects is given. This model is
also used for calculating LOD scores.
add,dom
Estimated additive and dominance QTL effects at the location of scanning. The additive effect is half the difference between the genotypic
values of the two homozygotes. It is assumed that second parent carries
the favorable alleles for the trait under study (for calculations see also
Falconer 1989, Ch. 7 who uses first parent as the superior). IF the second
parent is the weaker one additive effect becomes negative.
DFres
The number of degrees of freedom for the residual sum of squares in
multiple regression; this is provided to show how missing data might
influence the regression results.
4.7
Final simultaneous fit
In the FINAL SIMULTANEOUS FIT, the detected QTL and their estimated positions
are used for a simultaneous multiple regression to obtain final estimates of the additive
and dominance effects.
Standard errors (Std.error) of the estimated effects are given. Besides the estimated QTL effects, other measures of the importance of a QTL include (1) the squared
partial correlation coefficient (partRˆ2%, which is the coefficient of determination between the respective QTL and the phenotypic observations, keeping all other QTL
effects fixed) and (2) the partial sums of squares (part.SS), which are obtained by
dropping a certain effect. The part.SS each have one degree of freedom and, thus, can
be tested against the residual sum of squares from the regression ANOVA. Those QTL
which have small and non-significant part.SS could be dropped first from the model.
A further column gives the regression sums of squares (Regr.SS) for the simple regression. Calculating formulas can be found in Steel and Torrie (1980, p.323-327) or
Snedecor and Cochran (1980, Chapter Multiple Linear Regression). In the last column
(Std.eff.) the standardized QTL effects or effects diveded by the phenotypic standard deviation of the trait can be found.
The importance of the QTL effects, as determined by the part.Rˆ2% or the
part.SS, is generally fairly consistent.
To summarize the results of the QTL analysis we give the following parameter estimates:
Rˆ2%
Coefficient of determination or the percentage of the phenotypic variance, which is explained by the detected QTL, with the approximative standard error calculated after formula 27.88 of Kendall and Stuart
(1961).
R
Multiple correlation coefficient or the square root of R2 .
LOD
LOD score of the final fit.
4
DESCRIPTION OF THE OUTPUT
14
AIC
Akaike’s Information Criterion to choose the most ’probable’ model out
of a group of models with varying numbers of parameters. (The penalty
for the number of free parameters is 1). The model with the smallest AIC
value fits the data best (see Jansen, 1993).
BIC
The Bayesian information criterion or Schwarz’ information criterion
may serve as an alternative to AIC.
K
Number of parameters estimated (number of QTL effects QTL positions,
intercept, and error variance in the final model).
Gen.var.expl.% (and its standard error): Percentage of the genotypic variance
which is explained by the detected QTL. If the heritability h2 of the trait is
given in the *.qdt file Gen.var.expl.% is calculated as R2 % / h2 . The
standard error is calculated under the assumption of known heritability.
adj.Rˆ2%
With the adjusted R2 the explained variance may be estimated more
adequately than with R2 , see Hospital et al. (1997). Additionally
adj.gen.var.expl. is given if heritability is noted in the *.qdt file.
Naturally, adj.Rˆ2% as an estimate can be less than zero or greater than
100 as each proportion of variance components.
The correlation matrix of the estimated QTL effects indicates their degree of dependency. These dependencies generally cause no numerical problems.
4.8
Further effects
If testcross progeny are indicated, the average effect of an allele substitution, which
corresponds to 2*additive effect, is reported.
Furthermore, seq and pred statements are output, in which the effects are given
in descending order according to their R2 -values from the LIST OF DETECTED QTL;
e.g.,
c:seq 5/106 2/148 1/48
c:pred 9.735 2.179 1.401 -0.995
whereby c: stands for calibration. With the statement cross-validate two similar
lines starting with g: are given showing the effects estimated in the validation set.
These two lines can be used to conveniently feed the sequence and predict statements in submodel analyses.
4.9
Outliers and influential observations
Outliers and influential observations are calculated after the preselection of markers
and after the simultaneous fit. These lines are annotated with the # sign, so that they
can be found or printed if desired. Concurrently, scatter diagrams depict atypical data
points.
As statistics are calculated:
AP2
ANDREWS-PREGIBON statistic second factor (AP2 in Draper and John,
1981). AP2 is a spatial measure and shows which observations are influential in the sense that they are isolated from the bulk of the data defined
4
DESCRIPTION OF THE OUTPUT
15
by the columns of the X matrix. It involves only the predictor variables.
Smaller AP2 values indicate that the point is more remote from the bulk.
infl
Influential value of an observation (congruent to the square root of the
Cook’s statistic). The larger this value is, the greater is the effect on the
fitted equation and the regression coefficients found by omitting this observation.
stdRes
Studentized residual (t in Snedecor and Cochran, 1980, p. 350 and 168)
to find extreme deviations from the regression. The error variance is
estimated without the suspicious data point. The statistic follows a tdistribution and the outlier can be tested approximatively re: whether
it is exceptional or not (see Snedecor and Cochran, 1980; or Draper and
Smith, 1981, p. 144).
Obs.
X values observation value and marker values: marker values or expected
values in the final fit. Note: During outlier detection in the case of preselection, marker values (0,1,2) or expected values with decimal points are
given.
An observation is indicated as outlying or influential if stdRes is greater than 3.5
or AP2 less than 0.5 or infl greater than 0.4.
4.10
Analysis of QTLxEnvironment interactions
A simultaneous fit with the detected QTL is performed for each environment. The
results are subsequently presented in the form of a table showing the ANOVA and the
estimated effects.
QTL-ANOVA: The ANOVA is carried out approximatively in the following way:
Source
DF
E(MS)
-------------------------------------------------------------------Environm.
E-1
Genotypes
G-1
QTL
Q
VC + f1 VCqe + E VCd + f2 VCq
Resid
G-1-Q
VC + E VCd
Genot. x Env.
(G-1)(E-1)
QTL x Env.
Q(E-1)
VC + f1 VCqe
Res. x Env.
(G-1-Q)(E-1)
VC
------------------------------------------------------------------where Q = number of detected QTL effects (additive, dominance),
E = number of environments,
G = number of genotypes,
VCq = genetic variance explained by the QTL effects
VCd = unexplained residual genetic variance (deviation)
VCqe = variance component QTL xEnv. interactions
VCde = variance component Res. xEnv. interactions
VC = VCe/R + VCde,
with R being the number of replications
in a single environment and VCe the pooled plot error.
The ANOVA table for QTL, especially the variance component VC from the column
denoted VComp, are calculated in the following manner, where expectations of Mean
Squares (MS) were taken analogously to Knapp (1994) and Bliss (1964, p.426):
VC(Genotypes) = [MS(Genotypes) - MS(Genot.xEnv.)]/E
with E = number of macro environments (approximative
result if Genotypes x Environments are unbalanced)
VCq = VC(Genotypes) - VCd
4
DESCRIPTION OF THE OUTPUT
16
(please note that VCq is an ad-hoc estimator computed
by the difference of two variance components, not by the
difference of MS as usual)
VCd = [MS(Residuals) - MS(Res. x Env.)]/E
VCqe = MS(Genot. x Env.) - MS(Res. x Env.),
an indirect ad-hoc estimate also,
VC(Genot.xEnv.) and VCde can be computed by subtracting
VCe/R = MS(pooled error)/R, see Cochran and Cox (1957),
p. 556 or 561, if necessary.
After the ANOVA for the QTL, an appropriate estimate for the Gen.var.expl.%
is given, which is generally smaller than the corresponding estimate obtained from
the simultaneous fit, because it is adjusted for QTL x Env. interactions. This helps
to avoid an overestimation of the genetic variance explained by the QTL. The statistic
Gen.var.expl.% refers to the proportion of the genetic variance explained by the
detected putative QTL and is calculated as the ratio
Q2 = VCq/VC(Genotypes) ∗ 100.
In a further table, additive effects and also dominance effects if indicated in the
model statement are summarized for all detected putative QTL. The effects are given
for each enviroment as well as for means across environments. The last column shows
the MS(QTLxE), for which the SS are calculated as (sum of the values in the column
Regr.SS for individual environments) - (corresponding Regr.SS in the series)*E.
Because the MS(QTLxE) are calculated from the difference of the fits of the data
from individual environments and the means across environments, negative values
for MS(QTLxE) may occur, which are set to zero. It is noteworthy that with a small
number of environments, this statistic is only of limited inferential value.
The values for MS(QTLxE) are tested for significance with a sequentially rejective
Bonferroni F test (SRBF) and tested for homogeneity with a Bartlett’s test. The Bartlett’s
test is rather sensitive towards a violation of the assumption of normally distributed
data, so that other tests may be preferred in these cases or if the number of degrees of
freedom is smaller than 5.
5
LITERATURE
17
5 Literature
Beavis, W.D. 1998. QTL Analyses: Power, Precision, and Accuracy. In: A.H. Paterson (ed.),
Molecular Dissection of Complex Traits. CRC Press, Boca Raton, 145-162.
Bliss, C.I. 1967. Statistics in Biology. Vol.1, McGraw-Hill, New York.
Burnham, K.P., and D.R. Anderson. 1998. Model Selection and Inference. A Practical
Information-Theoretic Approach. Springer, New York.
Cochran, W.G., and G.M. Cox. 1957. Experimental Designs. 2nd ed., Wiley, New York.
Churchill, G.A. and Doerge, R.W. 1994. Empirical threshold values for quantitative trait mapping. Genetics 138, 963-971.
Doerge, R.W. and Rebai, A. 1996. Significance thresholds for QTL interval mapping tests.
Heredity 75, 459-464.
Draper, N.R. and J.A. John. 1981. Influential observations and outliers in regression. Technometrics 23, 21-26.
Draper, N.R. and H. Smith. 1981. Applied Regression Analysis. 2nd ed., Wiley, New York.
Falconer, D.S. 1989. Introduction to Quantitative Genetics. Longman Scientific& Technical,
Essex, England.
Hackett, C.A. 1994. Selection of markers linked to quantitative trait loci by regression techniques. In: van Ooijen, J.W. and J. Jansen (eds.), Biometrics in Plant Breeding: Applications of Molecular Markers, Wageningen, 99-106.
Haley, C.S., and S.A. Knott. 1992. A simple regression method for mapping quantitative trait
loci in line crosses using flanking markers. Heredity 69, 315-324.
Holloway, J.L. and S.J. Knapp. 1994. GMendel 3.0 Users Guide. Oregon State Univ., Departemnt of Crop and Soil Science, Oregon.
Hospital, F., C. Dillmann, and A.E. Melchinger. 1996. A general algorithm to compute multilocus genotype frequencies under various mating systems. Comput.Appl.Biosci. 12,
455-462.
Hospital, F., L. Moreau, F. Lacoudre, A. Charcosset, and A. Gallais. 1997. More on the efficiency of marker-assisted selection. Theor. Appl. Genet. 95, 1181-1189.
Jansen, R.C. 1993. Interval mapping of multiple quantitative trait loci. Genetics 135, 205-211.
Jansen, R.C. 1994. Controlling the type I and type II errors in mapping quantitative trait loci.
Genetics 138, 871-881.
Jansen, R.C., and P. Stam. 1994. High resolution of quantitative traits into multiple loci via
interval mapping. Genetics 136, 1447-1455.
Kendall,M.G., and A. Stuart. 1961. The Advanced Theory of Statistics. Charles Griffin & Co.,
London, Vol. II. 3rd.ed.
Knapp, S.J. 1994. Mapping quantitative trait loci. In: R.L. Phillips, and I.K. Vasil (eds) DNAbased Markers in Plants. Kluwer Academic Publ., Dordrecht.
5
LITERATURE
18
Lander, E.S., and D. Botstein. 1989. Mapping Mendelian factors underlying quantitative traits
by using RFLP linkage maps. Genetics 121, 185-199.
Lincoln, S.E., M.J. Daly, and E.S. Lander. 1993a. Constructing genetic linkage maps with
MAPMAKER/EXP version 3.0: A tutorial and reference manual. Whitehead Institute for
Biomedical Research, Cambridge, MA.
Lincoln, S.E., M.J. Daly, and E.S. Lander. 1993b. Mapping genes controlling quantitative traits
using MAPMAKER/QTL version 1.1: A tutorial and reference manual. Whitehead Institute for Biomedical Research, Cambridge, MA.
Little, R.J.A. 1992. Regression with missing X’s: A review. Jour. Americ. Statist. Ass. 87,
1227-1237.
Martinez, O. and R.N. Curnow. 1994. Three marker scanning of chromosomes for QTL in
neighbouring intervals. In: van Ooijen, J.W. and J. Jansen (eds.), Biometrics in Plant
Breeding: Applications of Molecular Markers, Wageningen, 153-162.
Melchinger, A. E., H. F. Utz, and C. C. Schön, 1998 QTL mapping using different testers and
independent population samples in maize reveals low power of QTL detection and large
bias in estimates of QTL effects. Genetics 149: 383-403.
McQuarrie, A.D.R. and C.-L. Tsai. 1998. Regression and Time Series Model Selection. World
Scientific, Singapore.
Miller, A.J. 1990. Subset Selection in Regression. Chapman and Hall, London.
Schoen, C.C., A.E. Melchinger, J. Boppenmaier, E. Brunklaus-Jung, R.G. Herrmann, and J.F.
Seitzer. 1994. RFLP mapping in maize. Quantitative trait loci affecting testcross performance of elite European flint lines. Crop Sci. 34: 379-389.
Snedecor, G.W., and W.G. Cochran. 1980. Statistical Methods. 6th ed. Iowa State University
press, Ames.
Stam, P. 1993. Constructing integrated genetic linkage maps by means of a new computer
package: JOINMAP. The Plant Journal 3, 739-744.
Steel, R.G.D., and J.H. Torrie. 1980. Principles and Procedures of Statistics. 2nd ed. McGrawHill.
Utz, H.F. and A.E. Melchinger. 1994. Comparison of different approaches to interval mapping
of quantitative trait loci. In: van Ooijen, J.W. and J. Jansen (eds.), Biometrics in Plant
Breeding: Applications of Molecular Markers, Wageningen, 195-204.
Utz, H.F., Melchinger, A.E., and C.C. Schön, 2000. Bias and sampling error of the estimated
proportion of genotypic variance explained by QTL determined from experimental data
in maize using cross validation and validation with independent samples. Genetics 154,
1839-1849.
Van Ooijen, J.W. 1993. Accuracy of mapping quantitative trait loci in autogamous species.
Theor. Appl. Genet. 84, 803-811.
Whittaker, J.C., R. Thompson, and M.C. Denham. Marker-assisted selection using ridge regression. Genet. Res. 75, 249-252.
WRICKE, G., and W.E. WEBER. 1986. Quantitative Genetics and Selection in Plant Breeding.
de Gruyter, Berlin.
5
LITERATURE
19
Xu, S. 1995. A comment on the simple regression method for interval mapping. Genetics 141,
1657-1659.
Zeng, Z.-B. 1994. Precision mapping of quantitative trait loci. Genetics 136: 1457-1468.
6
DESCRIPTION OF CONTROLLING FILE *.QIN
20
6 Description of Controlling File *.qin
There are analysis commands: first-analysis, scan, sequence, permute,
cross-validate. Statement load defines the data file. Logarithmic transformation
of the data and restriction of the analysis to a certain trait and/or chromosome can
be accomplished by commands log, trait, chrom. Markers serving as covariates
(i.e., cofactors) in interval mapping are defined by the cov statement. Default values
can be changed with the parameter statement. The out statement serves to control
the output. The model statement allows dominance and epistasis to be included in the
model. The analysis is terminated with the stop statement.
Empty lines can be used to increase readability. If statements are not given or written without specification of parameters, the default values are used. For example, the
following three versions of the scan statement are identical:
scan, scan 5, or scan 5 2.5 .
In the following, the various statements are briefly described:
c
comment until the end of the line
Comment lines can be inserted between ordinary statement lines. The first line
should be always a comment line, because this comment will be used as job
name in table outputs.
log
Phenotypic observation data are transformed by logarithms. If the log statement is missing, the original untransformed observations are analyzed. The
log statement must precede the load command.
load
FILENAME
The file with name FILENAME contains the marker and phenotypic observation data. If needed, give the full path name. This statement is mandatory. The
construction of the file FILENAME is described in chap. 7.
out LPLOT LRSDL LSECF PRIN
If the out statement is missing, out 1 1 0 0 is assumed.
Options for the output are:
LPLOT = 0
=1
=2
LRSDL = 0
=1
=2
LSECF = 0
=1
=2
no output of scanning details
scanning results are given in a curve plot of LOD scores
on *.qpt file (default)
curve plot of LOD scores given in PostScript on file *.ps
no test of residuals
test of residuals and influential values,
output of plots of residuals on fitted values
and of observations on fitted values (default)
output of observed, fitted, and residual values
to the secondary output file *.qst
no scanning output to a secondary file (default)
LOD scores, additive and dominance effects are written
to secondary output (to plot values with another program,
e.g., Excel, PlotIt), see also statement parameter
scanning results on secondary output in numbers
6
DESCRIPTION OF CONTROLLING FILE *.QIN
PRIN
=0
=1
=2
=3
=-1
21
(like in MAPMAKER/QTL)
(default)
extended output of the regression analyses for QTL
(not recommended for the regular case, rather for
debugging)
output of covariance matrices used in regression,
for debugging output (not for regular use)
further output of regression tables, e.g.
t values of partial regression coefficients
minimum output, recommendable in combination
with the permute or cross-validate statement
trait ITRAIT1 ITRAIT2 ...
ITRAITi = number or name of the traits to be analyzed
e.g.
trait 5 2 3
or
trait yield protein
If the trait statement is missing, all traits are analyzed, otherwise only the
traits specified by their trait identifier.
chrom ICHROM1 ICHROM2 ...
ICHROMi = number of the chromosomes to be scanned
e.g.
chrom 2 4 7
If the chrom statement is missing, all chromosomes are analyzed, otherwise
only the chromosomes specified by ICHROMi. At maximum 10 chromosomes
or linkage groups can be choosen with this statement.
model specifications F-to-E
model D
dominance effects are included in the model.
model D AA dominance and two-loci additive×additive epistatic effects
are included
model AA
additive×additive epistatic effects are included
model D DD dominance and all two-loci epistatic effects are included,
i.e. add×add, add×dom, dom×add, dom×dom.
If this statement is missing, only additive genetic effects are considered in the
model.
In every case the detection of QTL is conducted without epistatic effects in
the model. Only in the FINAL SIMULTANEOUS FIT all specified digenic
epistatic effects are estimated for the detected set of QTL. Afterwards the
epistatic effects are choosen in a stepwise regression procedure whereby the
F-to-Enter value (and F-to-Drop) is obtained by using the Bonferroni bound at
alpha = 0.05 and a second simultaneous fit is calculated. So, several models
(without epistasis - from a separate run - or with certain epistatic effects) can
be compared by the AIC or BIC values by the user.
The F-To-Enter value can be specified in the model statement if the default
threshold is inedaquate, e.g.
model D DD 10.0
With epistasis, the correlations betweeen the estimated QTL effects may be
much stronger, see the correlation matrix after the FINAL SIMULTANEOUS
FIT table, and should be taken into account.
6
DESCRIPTION OF CONTROLLING FILE *.QIN
22
scan ISTEP LODLIM
Interval scanning is performed and a LOD-score curve generated. Scanning
for a putative QTL is carried out at regular increments spaced ISTEP cM units
apart.
ISTEP should be an integer.
ISTEP = 5 (default)
ISTEP = 2 (is recommended, if a high resolution LOD-score curve is desired)
The default value for the LOD threshold LODLIM is 2.5 . If it is to be changed,
LODLIM must be set to another value, e.g., 3.0. LOD scores below the limit
LODLIM are not included in the search for QTL.
cov Cofactors
in the form of marker numbers or marker names can be used as covariates
e.g. cov "QTLc1:" 4 5
"QTLc2:" 18 19 "QTLc5:" 43 44
or
cov
4 5 18 19 43 44
or
cov
T93 T125 .
If the cov statement is missing, conventional (or simple) interval mapping is
performed.
Alternatively, marker covariates can be included more conveniently by:
SELECT (may be abbreviated as S or SEL)
all markers which are choosen in the preselection are used as cofactors
ALL
(may be abbreviated as A)
all markers in the map are used as cofactors.
Usually all markers of the cofactor set are used as cofactors,
e.g.
cov SELECT
With cov/+ , the cofactor set is extended by all markers of the chromosome
under scanning. This allows multiple QTL on a chromosome to be detected
with greater resolution but generally with lower power.
With cov/- , the cofactor set is diminished for all markers of the chromosome under scanning.
sequence sequence of QTL positions
e.g.
seq 1/20 2/23
A fit should be calculated for the given sequence only.
The number before the slash is the chromosome number and the number after
the slash the position of the putative QTL on the chromosome in cM units.
With the seq statement, the genetic effects as partial regression coefficients are
calculated for the indicated positions. If a predict statement follows, these
values can be used.
With seq/s a sequence of QTL positions can be included step by step in the
fitted model (with or without predict). Thus the statement
seq/s 1/238 6/78 6/130
performs sequentially the fits
seq 1/238
seq 1/238 6/78
6
DESCRIPTION OF CONTROLLING FILE *.QIN
23
seq 1/238 6/78 6/130
i.e., seq/s is an abbreviated form of several seq statements.
predict intercept and estimated effects for each QTL
e.g.
pred 80.4 -0.51 -0.12
For models with only additive effects, one effect per QTL, i.e., the additive
effect, must be given. With dominance models (using the model statement
described above), the additive and dominance effects for each QTL must be
given.
The predict statement can only follow a seq statement and is valid only for
that sequence and trait.
Predicted values are calculated for all genotypes with observed markers for
use in marker-assisted selection. Predicted and observed values are written to
the secondary output file with extension *.qst.
classes i i’ j j’ k k’ m m’
e.g.
classes 2 5 8 12
or
classes 2 5 8 12 8 12 6 10
or
classes T93 C66 T125 T71
With the classes statement the frequencies, means and standard errors of
means (s.e.) for genotypes displaying two (linked) segments with flanking
markers i and i’, j and j’ assuming that the phase of i and i’ is the same (neglecting double crossovers). Another two segment pair k k’ m m’ can be
carried out with the same statement if desired.
Three different segment types are listed:
Par1 or 0-0 the segment from parent A
Het. or 1-1 the heterozygous segment
Par2 or 2-2 the segment from parent B
Not1 or 3-3 not parent A (heterozygous or parent B)
Not2 or 4-4 not parent B (heterozygous or parent A)
Additionally, two-way class tables i-i’ by j-j’ (and k-k’ by m-m’ )
are given.
If user chooses i=i’ etc. simple marker class means are calculated. The genetic effect of such segments can be calculated by formulas found in WRICKE
and WEBER (1986, p.63 or Table 2.5):
a = [mean(parentB) − mean(parentA)]/2
d = mean(heterozygote) − [mean(parentA) + mean(parentB)/2]
and epistatic effects correspondingly.
environments E
E = number of environments
With the statement environments, a QTL×environment analysis can be invoked for the situation of a simultaneous fit, see chap. 4.7)
The observation data of the individual experiments must be appended to the
data file *.qdt, see chap. 7.
first analysis
This provides a first analysis of the marker and phenotypic data: linkage map,
segregation ratios with Chi-square tests, frequencies of marker pairs, percentage of homozygosity and percentage of the genome inherited from the first
6
DESCRIPTION OF CONTROLLING FILE *.QIN
24
parent in the individuals assayed for markers, histogram for homozygosity,
overview of the distribution parameters and histograms for each observation
variable, and stepwise regression to preselect cofactors. For each chromosome simple means and frequencies of observation values are given for each
marker class. Additionally, a monitoring of the input of data is present on file
pq___.mtr to find data errors easier.
The first statement must precede the load command. It is advantageous
to use the first statement only for the first time of an analysis and omit the
statement in consecutive runs.
permute NN
NN = number of random reshuffles of observations
to produce empirical threshold values for LOD scores, see CHURCHILL and
DOERGE (1994) or to performe a permutation test.
After CHURCHILL and DOERGE NN = 1000 can be recommended.
Clearly NN reshuffles require much more computing time than one passage
only. The critical values are found at the end of *.qdt file headed with
EMPIRICAL CRITICAL VALUES and for several alpha values (between 1%
and 30%).
It is recommended to use out 0 0 0 -1 to minimize the output in the
*.qpt file. The perm statement must be given together with a scan statement.
Clearly the critical values are calculated for a certain situation, i.e. genetic
model with or without dominance or epistatic effects. For simple interval mapping (SIM) the threshold value is smaller than for composite interval mapping
(CIM, with statement cov SEL) which can be seen easily.
cross-validate NREPL
NREPL = number of replicates in resampling (default 5)
cross
standard 5-fold cross-validation with 5 splits
cross 100
20 independent resampled 5-fold cross-validations
cross/p 1000 200 5-fold cross-validations and QTL frequency
on PostScript file *.ps
cross/e
environmental cross-validation
Random number seed is for each call the same value.
With the statement cross, 5-fold cross-validation is produced and explained
variance without overestimation due to model selection is tried to estimate.
The estimation bias is most very important (The half of explained variance R2
is overestimated often, see e.g. Utz et al. (2000). It is recommended to judge
the predictive value of the QTL analysis by cross-validation.
A 5-fold cross-validation is used, i.e. with 4/5 of the individuals, QTL (positions and effects) are estimated and with the resting 1/5 of genotypes a validation is done. Especially the R2 values are estimated. Such can be done 5 times,
that each fifth of the data is used in validation.
The output of the cross-validation runs is given as usual in *.qpt file, and
also at the end a summary with the header 5-FOLD CROSS-VALIDATION
SUMMARY with the medians and extreme values of the R2 related estimates.
6
DESCRIPTION OF CONTROLLING FILE *.QIN
25
Thus insight can be gained how the QTL estimates depend from genotypic
sampling and how strongly the explained variance is overestimated by model
selection.
The more important estimates of the cross-validation runs can also be found
in the *.qcv file for further calculations.
Lines starting with c:seq contain the position of putative QTL during calibration with chromosome number and position in cM, in descending order
according to their R2 -values:
e.g. c:seq 2/28 1/24
The estimated QTL effects follow with the heading c:pred, the intercept, and
the additive and dominance effect for each QTL.
With these putative QTL a validation is conducted in the part of data
which is not used in calibration and phenotypic variance explained by QTL
phen.var.expl.% with and without adjusting. Additionally, with the calibrated QTL positions the QTL effects are reestimated in the validation set and
shortly given in the lines g:seq and g:pred. The validated estimates R2 and
QTL effects are in average smaller than the calibrated values. With an editor
or a grep utility the c: and g: lines can be averaged or plotted like in Utz et
al. (2000). (A tool for calculating medians is available about request by email.)
The c: and g: lines can be found also in the *.qcv values
To estimate the QTL frequency or probability of occurrence of QTL a high
number of NREPL, e.g., 1000, is necessary. With cross/p the frequency is
plotted as profile over the position on chromosomes like the LOD profile.
The cross/e variant together with the environments statement allows validation of environments. Given E environments E-1 environments are used
for estimation of QTL, and the omitted environment for validation. Using each
environment for validation E such runs are performed.
exindivid IND1 IND2 ...
INDi = number of the individual which is to be excluded from the QTL mapping analysis (but included in the LINKAGE MAP overview)
e.g.
exind
3 4 5 237
If the exindivid statement is missing, all individuals are analyzed.
exmarker MARK1 MARK2 ...
MARKi = number of the marker which is to be excluded from mapping analysis
e.g.
exmark 3 45 89
If the exmark statement is missing, all markers are used in the analysis.
parameter DELIM FtoE FtoD THETA
The parameter statement is optional, and is needed only if default settings
are to be changed. Default values are:
DELIM=blank FtoE=FtoD=3.5 THETA=0.02
DELIM = any character to separate output in the secondary output file
(with extension *.qst),
e.g. character for horizontal tab
(given as ALT-9 for separation in Excel files)
FtoE = F-to-Enter, see Draper and Smith (1981, p. 308ff) or chap. 3.3,
usually a value between 2.5 and 15
6
DESCRIPTION OF CONTROLLING FILE *.QIN
26
FtoD = F-to-Drop, usually same value as FtoE
THETA = constant for ridge regression (a small value, e.g., 0.01),
for more details see Draper and Smith (1981, p. 313).
Normally it must be used iteratively.
convert
The data in the *.qdt-file are converted to another format to obtain input files
for van OOIJEN’s MapQTL or after some editing work for MAPMAKER/QTL.
Data files PQ___.MAP, PQ___.DAT, PQ___.LOC, and PQ___.QDT are produced.
Observation and marker values are given in matrix form on the file
PQ___.QDT. At the end of this file the same matrix is found whereby missing
marker values are replaced by their estimated values, rounded to one digit. So
the user can control the missing data replacement and compare with other programs or delete individuals or markers more comfortable by the matrix form.
With these four files and a little bit of editing all other QTL mapping programs
should be usable for a comparing analysis with ease.
The convert statement must precede the load command.
dimensions MAX EFFECTS
defines the maximum number of possible QTL effects.
Default is
MAX_EFFECTS = 150. This size is sufficient in normal cases. The statement
should be necessary only if very much putative QTL may be expected during
detection and epistatic models are used. As a definition statement, the statement must be given before the load statement.
MAX_EFFECTS = maximum number of QTL effects to be estimated, i.e. the
number of all additive, dominance, add×add, add×dom, dom×dom epistatic
effects which can enter in the QTL analysis. The number depends from the
specifications in the statement model.
This number can be calculated by the maximum possible number of QTL
MAX_NQTL:
without dominance and epistasis:
MAX_EFFECTS = MAX_NQTL * (MAX_NQTL + 1) / 2
in case of dominance without epistasis:
MAX_EFFECTS = 2 * MAX_NQTL * MAX_NQTL .
stop
Closing statement to terminate the program. Subsequent lines in the *.qin
file are ignored.
7
DESCRIPTION OF DATA FILE *.QDT
27
7 Description of data file *.qdt
The *.qdt data file contains the marker data, linkage map, and the phenotypic observation values. Hence, recombination frequencies among markers are assumed to be
known.
Lines 1 to 3 are required to characterize the data:
1.
Line with an arbitrary comment (job reference)
Additional comment lines in *.qdt files are possible beginning with c, especially
to give further comments at the head of the *.qdt file.
2.
Line with dimensions of the data set:
NENTR, NMARKM, NVAR, NCHROM,
MMIN, NIDENT, RECF,FGEN, RALPH
NENTR = number of genotypes (individuals)
NMARKM = number of markers
NVAR = number of observation variables (traits)
NCHROM = number of chromosomes (linkage groups)
MMIN = 0 input in matrix format
1 input in vector format (similar to MAPMAKER/QTL)
NIDENT = number of identifiers at the beginning of each data row
(only of importance for input in matrix format)
RECF = 0 distances among marker pairs are given in cM units
(referring to paragraph 5. linkage map)
1 distances among marker pairs are given as recombination
frequencies
2 distances among markers are given as positions in cM
on the linkage map (= cumulative distances from the first marker
of the linkage group)
FGEN = 1 marker genotypes refer to doubled haploids from a F1
= 2 marker genotypes refer to individuals from a F2 generation
= n with 2 < n < 34, marker genotypes were determined in Fn
(a homozygous population of recombinant inbreds can be indicated
by using high value for FGEN)
RALPH = 1 additive effects are calculated (default)
2 testcross case: effects of an allele substitution
(see e.g., Schoen et. al., 1994) are calculated additionally,
i.e., additive effects are multiplied by two
3.
Line with heritabilities:
Heritabilities of the phenotypic observations (mostly means across plots and environments) are given as values between 0 and 1 and in the same order as the
observed traits. Unknown heritabilities are coded as zeros. An example may be:
0.54 0.73
0.0
0.86
4.
Lines with the observed marker and phenotypic data:
4a. Input in vector format (similar to MAPMAKER/QTL):
sequential input of marker data and phenotypic observations, separately for each
marker and trait; an example can be found in the file pqsample.qdt.
7
DESCRIPTION OF DATA FILE *.QDT
28
For each marker give the name whereby name can begin with a star * . The name
must not start with a digit like 150T . Only the first 9 characters of each name
are used. Examples are *T150 or *MARKER1 or T149 .
The marker data follows the marker name on the same line or on subsequent lines.
The data may contain blanks or tabs to make marker data more readable.
A,B,H,- (case insensitive) are used to mark genotypes of parent A, parent B, the
heterozygote H, or missing marker data, respectively. Further situations are coded
by letters
C or c = not A, i.e., H or B (for dominant markers)
D or d = not B, i.e., H or A (for dominant markers)
M or m = mutation, is treated as a missing value (or -)
Instead of letters, numbers may also be used for coding:
0 parent A
1 heterozygote H
2 parent B
3 not parent A (i.e., C)
4 not parent B (i.e., D)
9 missing value (i.e., M,m,-)
After the marker data, observation values are input sequentially:
As above, the name of the trait may but not must begin with a star * and only the
first 9 characters are used. A comment enclosed in double quotes can be given
after the name. Examples are:
*trait01 or *PtHgt "plant height in cm"
In the following lines, observations are separated by one or several blanks or tabs.
Missing values are coded by -99.00 (or smaller values), by a star * , or by a minus
- . Comments in double quotes can be interspersed between observation data.
The sequence of the data for the various individuals (entries) must be identical for
each marker and trait.
An example may be:
*mark31
AAHBA HHBAA bhaab ahbah aaahb hahaa ahhaa haaaa ahhhh hbbab
hahhb bbaha bhahh hhahb bhaba bbhha ahbha bhahb bbhbb bhhhb
aahhb aaahh hhhah habbh hhhhb hbhbh hhbhh hbbhh hhhhh hahhh
*T175 HAHAHHA-HHHAHHAAHAAHHHAHAAAB-HAHHHAAHHHHHHHAHAHAAA-AHAH--HHA
AAHHAA-AHHHAAAHAAAAHHAAHAAHAAAHHAHAHAHAH-HHAAAHHAHAAAAHAHAAH
HHAAH-AAHHHHAAHHHHAAB-HAHAAHA-AAH-AAAHAAAHAHHHAH-AHHAH-HHAHH
AAAAAAHAAHHHA-------------------....
*weight "grams per plot"
4.949 3.58 -99999 "as missing value"
6.212 6.140 5.330 5.761 5.470 7.897
7.559 -99999 4.990 5.316 3.190 6.160 8.150 5.370 16.330
3.220 9.540 -99999 6.360 6.040 5.480 4.710 6.310 4.390
...
4b. Input in matrix format (for each genotype a row of data):
An example is found in file pqsimul.qdt.
First, the names of traits and markers must be given. In congruency the name can
begin with a star.
- One line with NVAR trait names, separated by one or more blanks, e.g.
yield *dmc *height tgw
7
DESCRIPTION OF DATA FILE *.QDT
29
- One or several lines with NMARKM marker names separated by one or more
blanks, e.g.
UMC94 BNL8.05 UMC76 UMC137 UMC58 UMC23 UMC128 UMC37 BNL15
*UMC106 *BNL6.32
Next follow the observation data rows:
Three groups of data fields - each separated by at least one blank - must be distinguished:
- At the beginning of each data line, NIDENT data fields are skipped.
- NVAR trait values follow. Missing values are coded by -99.00 (or smaller values),
by a star * , or by a minus - .
- NMARKM marker values follow, coded as described above.
An example with NIDENT=1, NVAR=2 , and NMARKM=15 may be:
A1
A2
D3
4.949 45.6
3.58 45.9
*
32.3
101 11111 122 1122
009 90022 222 1001
011 12111 111 2111
First column gives here an identifier for each individual. It can be written any
text, e.g. site and year code additio- nally. Program skips them. The number
of identifying fields separated by blanks must be defined with NIDENT (in the
example above NIDENT=1). If NIDENT=0 no identifiers are given.
5.
The linkage map is attached to the marker and trait data. For each chromosome
the linkage information is coded as follows:
5a. The chromosome name, only the first 7 characters are used, and the number of
markers on the chromosome are given in a first line. Chromsome name may but
not must begin with a star *.
Examples: *chrom01 12 or chrom10 121
5b. In one or several subsequent lines the marker numbers or marker names with
distances are given. Marker numbers are listed in their order on the chromosome
and numbered across the genome. Between each two markers their distance in
cM (if RECF=0) or as recombination frequency (if RECF=1) is given. If RECF=2,
the absolute positions of the marker on the chromosome can be used.
Example with cM: 1 4.2 3 15.0 2 11.9 5 12.2 7
or ABD13 4.2 ABD14 15.0 ABD16
A map for two linkage groups may look as follows:
*chrom1 5
1 4.2 3 15.0 2 11.9 5 12.2 7
*chrom2 7
4 14.8 11 6.4 8 18.9 12 24.0 9 18.1
or a linkage group with 3 markers
*chrom1 3
ABD13 12.1
ABD14 22.4
ABD15
6 28.6
10
7
6.
DESCRIPTION OF DATA FILE *.QDT
30
Observation values from individual experiments:
These values must follow if the statement environment is used in the *.qin file.
The data may also be mean values across replications, adjusted for incomplete
blocks, or averaged over several plants. The input is in matrix format: for each environment a matrix with NENTR rows and NTRAIT columns. Entries (individuals)
and traits must be given in the same order as specified before. Separation of data
is by blanks or tabs as usual.
This block of data from individual experiments is opened by a line in the following form:
*env KIDENT arbitrary text KIDENT gives the number of data
fields to be skipped at the beginning of each data line.
An example with two traits x and y and two identifying data fields may be given
for 2 locations and 3 individuals:
*env
loc1
loc1
loc1
loc2
loc2
loc2
2
ind1
ind2
ind3
ind1
ind2
ind3
x11
x12
x13
x21
x22
x23
y11
y12
y13
y21
y22
y23
Sequence must be ordered, within each environment individuals increasing, see
also pqsimul.qdt at the bottom of file.
8
RECOMMENDATIONS FOR WORKING WITH PLABQTL
31
8 Recommendations for Working with PLABQTL
Beginners with no experience in QTL analyses are advised to conduct and interpret a
few analyses using the tutorial of MAPMAKER/QTL, because it gives a more detailed
description of many terms than this manual.
We recommend the use of cofactors, i.e., working with the cov statement, especially
when several QTL are expected.
Once all marker data including the linkage map as well as the phenotypic observations are available, one can set up the *.qdt data file. It will likely not change during
the course of the analysis unless genotypes with a relatively small number of marker
loci or suspicious outliers for the phenotypic observations are deleted.
In conducting an analysis, we recommend the following procedure:
1.
An initial run with the first_analysis statement to check the input data file
which is monitored in file pq___.mtr. In subsequent runs, this statement should
be unnecessary. Several basic lists are produced with this statement like LINKAGE
MAP or CLASSIFICATION OF ADDITIVE QTL EFFECTS.
2.
A run with the scan statement, but without the cov statement, i.e., simple interval
mapping (SIM). These results can be compared with results obtained with MAPMAKER/QTL, if desired. Minor differences may occur because (1) PLABQTL employs a multiple regression procedure, whereas MAPMAKER/QTL use a maximum likelihood approach for estimation and (2) missing values may be treated
differently by the two programs.
3.
The main run is conducted with scan and cov SEL statements for determining
the most important markers to be subsequently used as cofactors in composite
interval mapping (CIM).
In addition, data should be checked for outliers and their plausibility. Individual
outliers in conjunction with crossovers may result in spurious QTL or important
markers selected as cofactors (Hackett, 1994). In further runs, one may skip this
check by setting the out command accordingly.
Best is to conduct a run with an added permute statement and choose an empirical LOD threshold for a certain error level.
If one has the impression that the selected cofactors are not chosen in an optimum manner, the user may either change the boundaries for FtoE, FtoD in the
parameter statement or may correct the selected set of cofactors on the basis of
other criteria like AIC or BIC. In the latter case, the cov statement must include
the selected cofactors by their numbers. Best may be to
In the case of ill-conditioned preselection runs, one may also alter the THETA parameter in the paramenter statement.
4.
We have found a further run with cov/+ cofactor set useful, in order to include
all markers on a chromosome as cofactors (in addition to those already employed
as cofactors). This facilitates detection of putative QTL having different signs or
resolution of ”ghost-QTL”. Since power in cov/+ runs is reduced due the high
number of cofactors it may be desirable to correct the set of cofactors of the main
run according to the identified questionable regions before running the final analysis.
8
RECOMMENDATIONS FOR WORKING WITH PLABQTL
32
5.
Special analyses of particular traits or chromosomes can be performed with the
trait and chrom commands. Fits with certain suspected QTL positions can be
carried out with the command seq. With the classes statement, the means and
frequencies of genotypes with certain marker segments are available. Thus it is
becoming clear how weak the QTL estimates are very often and that QTL mapping
is more an exploratory than an confirmatory statistical work.
6.
Provided the main run (e.g., the one described under point 3.) has run without
problems, a new run is performed with the environments statement. This provides an unbiased estimation of the percentage of the genetic variance explained
by the QTL and an estimation of QTL×environment interactions, if the number of
environments is sufficiently large (> 5).
7.
With the cross and with cross/e for several environments, we obtain information about the magnitude of bias in explained phenotypic and genotypic variance.
With cross/p 1000 QTL occurence frequencies and more unbiased QTL effects
are produced.
Hints:
1.
Markers can be dropped from the analysis by dropping the corresponding marker
numbers from the linkage map at the end of the file *.qdt. Of course, one has to
add the map distances to the right and left of the marker excluded, provided these
are given in cM. An example may be in the test data file pqsample.qdt with 5
markers and distances in cM:
*chrom1 5
1 4.2 3 15.0 2 11.9 5 12.2 7
Assuming marker 3, the second marker in the map, should be omitted. Add 4.2
and 15.0 and remove the 3. Clearly, 4 markers are in the map only:
*chrom1 4
1
19.2
2 11.9 5 12.2 7
If recombination frequencies are used and absence of interference is assumed, one
must calculates
r = ra + rb − 2ra rb ,
where ra and rb refer to the recombination frequencies of both intervals.
2.
Mapping populations, which consist of double haploid lines, derived from an F1
cross, or backcrosses obtained from the cross of an F1 cross with one of the two
homozygous parents, can be analyzed as F2 populations because these types of
populations have the same recombination frequencies as an F2 . For coding of
marker genotypes, it is important to note that with doubled haploids, heterozygous genotypes (i.e., H or 1) are missing and additive effects are calculated as half
the difference between the two homozygous classes. Likewise, with backcrosses,
marker genotypes of the non-recurrent parent are missing. In comparison with
other QTL mapping programs, one must pay attention to the definition of the additive effect.
3.
The first parent has always the smaller coding value in the markers (A, a or 0), he
is the first one in all lists, and he is assumed to be the weaker parent in definition
the additive effect:
additive effect = (parent B - parent A) / 2
8
RECOMMENDATIONS FOR WORKING WITH PLABQTL
33
In some lists the first parent is noted by P1 and the second by P2, additionally for
clarity.
4.
Analyse data for a first look with the statement first and convert to get more
detailed output and hopefully clear error messages. The files pq___.* can be
controlled for correct reading of data.
5.
PostScript LOD curves are printable on one page from the *.ps file induced by the
out statement parameter LPLOT=2. Such a page allows a fast overview of LOD
peaks over the genome or a comparison of different LOD curves from several runs.
You may use ghostview to have a look at the PostScript files or to print on a
non-PostScript printer. With ghostview or pstoedit it is also possible to convert *.ps files in *.bmp, *.wmf, or *.pdf files to import and edit the last ones in
text processing or graphic programs. To use pstoedit from within GSview, use
Edit | Convert to vector format.
Further information on PostScript, Ghostscript and Ghostview see
http://www.cs.wisc.edu/˜ghost
and pstoedit see
http://www.geocities.com/SiliconValley/Network/1958/pstoedit/index.html
6.
Often markers are too tightly linked with no or only few recombinants between them so the scanning process may be disturbed by ill-conditioned equation systems. Since highly correlated markers or a very dense marker map with
small samples (N<500) is non-advantageous for QTL mapping the information of
tightly linked markers is pooled. Markers with a distance smaller than 0.11 cM are
combined to a ”synthetic” new marker, where C, D and missing marker scores are
simply replaced by a more informative score of the tigthly linked marker A, H, B
if possible. The name of a ”synthetic” marker starts with the letter m and shows
afterwards the chromosome number and position in cM, e.g., m02/092.0 .
7.
We recommend that LOD curves be checked visually because in very seldom cases
peaks or QTL will not be detected by the program and, will be missing in the
LIST OF DETECTED QTL. The minimum distance between two putative QTL to
be listed as separate QTL is 10 cM.
Support intervals in the fit with cofactors represent only a lower boundary, and
should be treated with caution.