Sample Size to Detect a Planned Contrast and a One

Sample Size to Detect a Planned Contrast and a One Degree-of-Freedom Interaction Effect
By: Douglas Wahlsten
Wahlsten, D. Sample size to detect a planned contrast and a one degree-of-freedom interaction effect.
Psychological Bulletin, 1991, 110, 587-595.
Made available courtesy of the American Psychological Association: http://www.apa.org/
*** Note: This article may not exactly replicate the final version published in the APA journal. It is not
the copy of record
Abstract:
A simple method is described for estimating the sample size per group required for specified power to detect a
linear contrast among J group means. This allows comparison of sample sizes to detect main effects with those
needed to detect several realistic kinds of interaction in 2 × 2 and 2 × 2 × 2 designs with a fixed-effects model.
For example, when 2 factors are multiplicative, the sample size required to detect the presence of nonadditivity
is 7 to 9 times as large as that needed to detect main effects with the same degree of power. In certain other
situations, effect sizes for the main effects and interaction may be identical, in which case power and necessary
sample sizes to detect the effects will be the same. The method can also be used to find sample size for a
complex contrast in a nonfactorial design.
Article:
The power of the analysis of variance ( ANOVA) to detect statistical interaction in a factorial experiment is
substantially less than the power to detect main effects in many situations (Neyman, 1935; Rodger, 1974;
Traxler, 1976; Wahlsten, 1990). For example, with a 2 × 2 design, power to detect the multiplicative type of
interaction is 16% when power to detect a main effect is 87% (Wahlsten, 1990); hence the ANOVA will frequently point to additivity of effects when in fact they are multiplicative.
This is not a universal property of the ANOVA method, however. As shown in this article, the degree of the
discrepancy in power depends strongly on the specific kind of interaction that is present in the data. The power
function depends on the size of an effect and the degrees of freedom. If two effects have the same size and
degrees of freedom, power must necessarily be the same.
If the presence or absence of interaction is important for a theory it is imperative that the test of interaction have
a power of 80% or, preferably, 90%. This sometimes requires larger samples than are customarily used to detect
main effects with ANOVA. But how much larger must they be? Relative sample sizes may give a better
impression of the insensitivity of ANOVA to interaction than do relative power values. The power to detect a
particular kind of interaction may be half the power to detect a main effect in that circumstance, yet the sample
size needed to detect the interaction may be far more than double the sample size needed to detect the main
effect with the same degree of power. This larger sample size provides a direct indicator of the additional
research subjects, technician time, and grant funds required for the desired sensitivity of the statistical test.
This article presents a convenient way to determine appropriate sample sizes for factorial designs in which there
are two levels of each factor. The method can be adapted to many other designs, provided the investigator has a
good idea about the specific kind of interaction effect of theoretical interest. The basic idea is that any
completely randomized design with fixed effects can be considered a one-way design with J groups and then
analyzed with a series of orthogonal contrasts. In a 2 × 2 or 2 × 2 × 2 design, for example, the contrasts for main
effects and interactions have identical degrees of freedom in both numerator and denominator; only the sizes of
the effects themselves differ. Planned contrasts offer many advantages over omnibus tests in designs with more
than two levels per factor (e.g., Rosnow & Rosenthal, 1989).
Finding the Sample Size for a Linear Contrast
In the next sections, it is assumed that observations in each population are normally distributed and that samples
of equal size are drawn randomly. Consider first a simple comparison of two groups with true means μ1 and μ2
having standard deviations equal to σ, which yields a population effect size δ = |μ1 – μ2|/σ. If the probability of a
Type I error is set at α and the desired power of the test is 1 — β, then the exact solution for the necessary
sample size (n) per group is
n=
(1)
when one uses a one-tailed z test of the hypothesis that δ = 0 and the standard deviations are known and equal.
Some variation of this formula is presented in almost every introductory text on statistical inference. Generally,
σ is unknown and is estimated from the data. Then a t test having n1 + n2 — 2 degrees of freedom is used, and
Formula 1 is approximate. Because a linear contrast among J group means will also have a t distribution when
the normal assumptions hold, it should be possible to generalize Equation 1.
For J groups with means μ1, μ2 ··· μJ a linear contrast among the means is given by
where Σcj = 0, and effect size δc = Ψc/σ. If each group has variance σ2 and samples of n observations are taken
from each population, the variance of the sample contrast c is (Hays, 1988)
Combining the ideas in Equations 1 and 3, the sample size per group to detect the contrast when σ is known is
exactly
If σ2 is estimated from the sample variance within groups (degrees of freedom Jn — J), this equation is an
approximation. It is essentially the equation presented by Levin (1975) and applied by Lachenbruch (1988 ) to
determine sample size to detect an interaction in a 2 × 2 design.
However, empirical testing (demonstrated below) reveals that adding 2 to Equation 4 is appropriate when σ is
estimated from the data. I propose here that a more satisfactory approximation for a contrast of J means is
It might be helpful also to specify the relation in terms of a proportion of variance attributable to the particular
effect or contrast, ηc2 or ηj2, which is a partial correlation ratio (Maxwell, Camp, & Arvey,1981; see also Glass
& Hakstian, 1969). When Σcj2 =1 for a particular contrast, the partial correlation ratio is
How good is the approximation proposed here? Kraemer and Thiemann (1987) prepared a master table that is
based on a central t approximation to the noncentral t distribution that is very convenient for determining
sample size for a comparison of two groups. The table is entered with the critical effect size A, which is
equivalent to
, and the result is degrees of freedom v for the two groups combined. In light of Equation 7, it
is possible to generalize their approach to a linear contrast Ψc. When Σcj2 = 1, the critical effect size is
Δc= δc / (δc2 + J)1/2,
(9)
where Δc equals ηc. The total sample for the J groups is then N= v +J where v is obtained from Kraemer and
Thiemann's master table, and sample size per group is n = (v + J)/J. One point of caution is necessary here.
Converting the effect size Δc to the equivalent δc for insertion in Equation 6 uses Equation 9, which assumes
Σcj2 = 1. For two groups, this equivalent δc is not the familiar δc = |μ1 – μ2|/σ given in Kraemer and Thiemann
(1987) because that expression yields Σcj2 = 2. The cj values must be ±1 /
when Equations 6 and 9 are used.
Otherwise, Equation 9 should entail J2 rather than J in the denominator if ci values are ±1.
Suppose one wishes to compare four groups with a contrast using α = .05 with a nondirectional test of the null
hypothesis δc = 0, and the desired degree of power is 90%. For effect sizes ranging from 0.10 to 0.60, the
required n from Kraemer and Thiemann (1987) is no more than 0.2 unit higher than the result from Equation 6,
and the values of n are identical when rounded upwards. Table 1 compares results from the master table with
those of Equation 6 for different numbers of groups. Equation 6 yields a slight underestimate of n when there
are J= 2 groups and a slight overestimate for more than 6 groups.
Results from Equation 6 or 8 can also be compared with those from Cohen's (1988) sample size tables. For J >
2, a linear contrast translates into a partial ηc by means of Cohen's Equation 8.2.19. For a contrast effect in a 2 ×
2 design with α = .05 (two-tailed) and power of 90%, when Cohen's Table 8.4.4 and Equation 8.4.4 yield n =
43.0, Equation 8 requires n = 44.0. Only when J = 2 groups can the methods of Cohen (1988), Kraemer and
Thiemann (1987) and Equation 5 be compared directly. Equation 5 was not derived with a two-group t test in
mind, but a contrast between two means amounts to the same thing. Cohen's tables for sample size use Effect
Size f or d, whereas the Kraemer and Thiemann master table uses Δ. A tabled value in one can be compared
with the equivalent index of the other using interpolation, which introduces small errors. However, Equation 5
can be used with any value of δ. Table 2 compares n to achieve 90% power, first for Equation 5 versus the
master table and then for Equation 5 versus Cohen's (1988) Tables 8.4.1, 8.4.4, and 8.4.7. All noninteger results
are rounded up. Equation 5 consistently calls for one or two more observations than do Cohen's tables and one
or two fewer than does the master table when J = 2.
The principal advantages of Equation 5 over the methods of Kraemer and Thiemann (1987) and Cohen (1988)
are that (a) no tables are needed apart from the ubiquitous cumulative normal table whose critical values can be
memorized, ( b ) any magnitude of effect size may be used without interpolation, and (c) power for complex
contrasts that do not fit into the standard factorial ANOVA scheme may be readily evaluated. If n is found for
each of a series of contrasts in the same data set, it would be wise to use the largest value of n required to test
effects of particular interest. Several recent textbooks on statistical analysis in the behavioral sciences ( Hays,
1988; Marascuilo & Serlin, 1988; Maxwell & Delaney 1990) emphasize the testing of linear contrasts in oneway designs. The preceding equations should be helpful to researchers who intend to apply their methods of
analysis. It would be pointless to plan a clever experiment and analyze results with sophisticated instruments yet
adopt a sample size that renders the tests insensitive to all but the crudest effects.
Sample size for experiments with many groups can be found quickly with the method of Cohen (1988) if the
design is factorial, but complex experiments do not always lend themselves to this approach. For example,
reciprocal crossbreeding of animals can be used to demonstrate maternal environment effects on rate of
development (Wainwright, 1980 ), behavior (Bauer & Sokolowski, 1988) or brain size (Wahlsten,1983), and
they may also reveal the importance of the cell cytoplasm for individual differences in brain structure (Wimer &
Wimer, 1989). One logical method of analyzing such data is a series of orthogonal contrasts. The experimental
design entails 16 groups, consisting of 2 parent strains, 2 reciprocal F1 hybrids, 8 reciprocal backcrosses of an
F1 hybrid to one parent, and 4 reciprocal F2 hybrids. Table 3 contains an experimental design and hypothesized
means for the number of granule cells (in thousands) in the dentate gyrus of the hippocampus of male mice,
basing the parent strain means (high and low) and standard deviation within groups (σ = 45.0) on data reported
by Wimer and Wimer (1989). The model asserts that the midparent mean of 425 is augmented to varying
degrees by high-strain autosomes, Y chromosome, cytoplasm, and maternal environment and is decremented by
similar amounts by the low-strain counterparts. As is evident from the table, small samples would be quite
adequate to detect the large parent-strain difference, but much larger samples would be required to detect the
small Y chromosome effect. The prudent decision would be to choose a sample size that would allow one to
detect the smallest effect that is of major importance for the research program. Different sample sizes might be
contemplated for different groups, but this would violate the orthogonality of certain comparisons.
Alternatively, the groups involved in a comparison of particular interest but with small effect size could be
tested with large sample sizes, and then those groups could be analyzed with a separate oneway ANOVA.
Relative Sample Sizes for Specific Cases
For any particular test, Equation 5 or 6 gives the results, once δc = Ψc/σ is specified. This can be done
separately for main effects and interaction, but an expression can be derived for the ratio of sample sizes needed
to detect an interaction and a main effect. If nA is the sample size needed to detect contrast ΨA with power 1 —
β, and nA×B is the sample size required to detect contrast ΨA×B, also with power 1 — β, then from Equation 5 it
follows that
=
=
.
(10)
The ratio of sample sizes is unrelated to the chosen values of α or β and is determined primarily by the specific
kind of interaction present in the data. The principle embodied in Equation l0 applies to any two contrasts in
which the Σcj2 values are the same. When both values of n are reasonably large, the constants can safely be
omitted, and relative sample sizes are inversely proportional to relative squared effect sizes.
Let us examine several situations that commonly occur in psychological research to find out the relative
magnitudes of interaction and main effects. Suppose one conducts a 2 × 2 experiment in which Factor A has
levels A1 and A2 and Factor B has levels B1 and B2. Arranging the four groups in the order A1B1, A1B2, A2B1,
and A2B2 and using coefficients (cj) of l or -1, the coefficients are (-1, -1,1,1), (-1,1, -1,1), and (-1,1, 1, -1) for
the A main effect, B main effect, and A × B interaction, respectively.
Patterns of results are portrayed in Figure 1. Many others might be imagined, of course. The scales for the
measures Y are arbitrary and can be adapted to any comparable circumstance by linear transformation. The
cases may be summarized as follows. The first six cases all have μ21 — μ11 = 1, and the last three have μ21= μ11.
Case 1. A and B effects are equal and additive.
Case 2. A and B are multiplicative, so that the effect of B is twice as large for A2 as for A1.
Case 3. There is no effect of B on A1, but there is a clear effect of B on A2.
Case 4. There are opposite effects of B on A1 and A2.
Case 5. There is no effect of B on Al but a large effect on A2, resulting in reversal of ranks.
Case 6. There are opposite effects of B on Al and A2 but no main effects at all.
Case 7. There is no effect of A at B1, and there is a greater effect of B on A2 than Al.
Case 8. There is no effect of A at B1, and there is no B effect on Al.
Case 9. There is no effect of A at B1, and there are opposite effects of B on Al and A2.
The values of Ψc for each contrast are shown in Table 4. These make it possible to compare sample sizes for the
three effects within a particular case, but the arbitrary scales of measurement in the graphs make comparisons
between cases difficult. Suppose that in each case the four group means and the variance within groups are such
that η2 = 20 or Cohen's f = .50. According to Table 8.3.14 of Cohen (1988), to have a power of 90% when α =
.05 for the overall F test of significance, there should be n = 15 observations in each of the four groups.
Knowing the value of η2 for the entire experiment and the relative values of Ψc for all J — 1 contrasts, sample
size can be ascertained without specifying the actual group means or standard deviation within groups. This
makes it convenient to calculate and comprehend the relative sizes of all three effects and the sample sizes
needed to detect them. A simple way to compute ηc2 or ηj2 for each contrast is needed. It can be shown that
2
=
/ (1 – ηj2).
(11)
The sum of squares between the J groups must equal the total SSc for the J-1 contrasts if they are orthogonal
(Hays, 1988). Provided Σcj2 has the same value for each contrast, the proportion of the SS between groups
attributable to a contrast effect is
Pj =
=
.
(12)
From Equations 11 and 12,
=
(13)
When η2 = .20, ηc2 = Pc/Pc+ 4). Consider Case 1. The sum of squared contrasts is 4 + 4 = 8, and for the A main
effect, Pc = 4/8 = 0.5. Therefore, ηc2 = .11, and from Equation 8, 23 observations are needed to detect either
main effect with power of 90% when α = .05 and a two-tailed test is used. In Case 2 for the multiplicative A × B
interaction, Pc = 1/19, ηc2 = .013, and n should be 199.5, which is nearly nine times as large as the n needed for
the main effects! Table 4 reveals that multiplicative interaction in a 2 × 2 design will be particularly difficult to
detect, as will the patterns in Cases 3 and 7. Even in Case 4, in which the lines diverge, the interaction is
substantially smaller than the A main effect. Case 8 is interesting because all three effects are equal.
Several caveats are necessary:
1. Certain of the cases in Figure 1 involve a complete absence of one or more effects. Because these null
hypotheses are presumed to be true, it would make no sense to do a power or sample size calculation for a
nonexistent contrast. However, Equations 11 and 13 hold even when one or two of the contrast effects are zero.
2. Several interesting cases in which interaction effects exceed main effects are not shown in Figure 1. The
general conditions when this will occur are derived in the next section.
3. It may seem counterintuitive that n = 15 is sufficient for the overall F test, yet n must be considerably
greater to detect many of the one degree-of-freedom contrasts, such as the A or B main effect in Case 1 in
which additivity obtains. Yet, when the effect size that is equivalent to ηc2 = 0.111 for the A main effect is
inserted into Equation 5, the n needed for a t test comparing two group means is approximately twice the n
needed for an equivalent contrast among J= 4 groups. That is the total number of observations required is about
the same in both situations. if a single contrast does account for a large proportion of the total SSbetween, then
indeed the n will be less than the 15 required for the global F test.
This approach can be extended to a 2 × 2 × 2 design or any higher order design with only two levels per factor.
Figure 2 illustrates several possible outcomes for a three-way design, and Table 5 describes the sample sizes to
detect main effects and interactions. Eight cases are considered, although other interesting patterns are certainly
possible. As with the 2 × 2 design, suppose the difference among the eight group means is such that η2 = .20.
According to Tables 8.3.15 and 8.3.16 of Cohen (1988 ), n must be about 10 to yield power of 90% when α =
.05. The value of ηc2 for each contrast is computed according to Equation 13. Table 5 shows that the n to detect
a specific contrast is considerably greater than 10 unless that contrast accounts for a large proportion of the total
variance between groups. Indeed, only the A main effects in Cases 15 and 16 require fewer than 10
observations. Equal sample sizes are sufficient for the A × B × C interaction and A main effect in Cases 14 and
17, but for Cases 11, 12, and 13 the samples needed to detect the interactions are rather large. It is somewhat
disturbing to learn that it will be most unlikely to find a multiplicative second-order interaction (Case 11) with a
reasonable degree of power, given the research budgets of most psychologists.
The 2 × 2 in General
The conditions under which the necessary sample sizes for main effects and interaction in a 2 × 2 design are
generally the same or different can be determined by supposing that membership in Group A1 or A2 determines
the parameters of a linear equation and that Treatment B determines the value of X, as shown in Fig. 3. It is
assumed that X2 always exceeds X1. To simplify presentation of results, let the differences in values be
symbolized as Dx = X2 - X1, Da = a2 — al, and Db= b2 - b1. The contrasts then have the expected values ΨA = 2
Da + (X1 + X2)Db, ΨB = Dx(b1 + b2), and ΨA×B = -DxDb.
Sample size to detect a main effect, as compared with that needed for the interaction, can be evaluated by the
difference in squared effect sizes. A ratio could also be used, but it is cumbersome for this purpose. Comparing
the A main effect to interaction,
= 4 ( Da + DbX1 )(Da + DbX2). ( 14 )
Whether the effect for A and interaction are the same or different depends on the choice of treatments B1and B2.
Several other generalizations are possible. The A and interaction effect sizes can be the same if and only if
either X1 or X2 = — (a2 – a1)/(b2 – b1). When the intercepts are the same (a1 = a2), either X1 or X2 must be zero if
the A and interaction effects are to be the same. That is, if the lines meet at the origin but do not cross, the effect
sizes will be the same, no matter what the slopes of the two lines. Because adding a constant to all X values and
another constant to all Y values merely shifts the axes and will not affect the results of an ANOVA, it follows
that the A main effect will always (a) exceed the interaction effect if the lines do not touch, (b) equal the
interaction effect if the lines meet at either X1 or X2, and (c) be less than the interaction effect if the lines cross
between X1 and X2. Note that these conclusions in no way depend on the range from X1 to X2. The difference X2
— X1 will, of course, determine the degree to which effect sizes differ but not the sign of the difference.
Comparing the B main effect with interaction,
-
=
4b1b2.
( 15 )
Four conclusions follow from this expression: (a) If either slope b1 or b2 is zero, the effect sizes and sample
sizes are always equal for the B main effect and interaction. (b) If the signs of the slopes b1 and b2 are the same,
the B main effect is always larger than the interaction. (c) If the signs of slopes b1 and b2 are opposite, the
interaction effect is always larger than the B main effect. (d) None of these conclusions depend on the location
or range of X values, although the degree of difference in sample sizes will depend on Dx, as shown in Equation
l 5. The conclusion that the interaction effect will always exceed the main effect when the lines cross applies to
the A main effect but not to the B main effect. Likewise, the conclusion that the interaction effect will always
exceed the main effect when slopes have opposite signs applies to the B but not the A main effect.
Limitations
Equation 5 is a normal approximation, but Tables 1 and 2 indicate that it is a reasonably good approximation.
When two observations are added to the usual n from the exact expression for a normal distribution, the result is
very close indeed to a more elaborate calculation that is based on a central t approximation to the noncentral t
distribution (Kraemer & Thiernann, 1987). Several approximations of the noncentral t distribution have been
devised (Tiku, 1966), and a simple normal approximation is respectable in this company. The excellence of the
approximation to several decimal places is not a major issue for sample size and power calculations, because n
is always rounded upwards to an integer and power is of interest only to the nearest 5%. When sample size is
calculated to yield desired power, the investigator typically tests a few more subjects than dictated by the
algebra, just in case there is a subject misbehavior, apparatus failure, or other unexpected loss of data.
Computers can facilitate certain kinds of power and sample size calculations. For example, the program by
Borenstein and Cohen (1988) can determine n for a one-way design with J groups, given either group means
and standard deviations or effect size, but it cannot do this for a contrast among J means, nor can it compute
power or n for interaction in a two-way or three-way design using group means. The expert system by Brent,
Scott, and Spencer (1988) can determine n for four levels of power and three effect sizes with a one-way design
but not a factorial design, and it cannot work with contrasts. The SAS program can determine power, given
degrees of freedom and the noncentrality parameter for the noncentral χ2, F, or t distributions (Hewitt &
Heath,1988), and the noncentrality parameter can be determined from effect size as indicated in Wahlsten
(1990). Of course, it would be simple to implement Equations 2, 5, and 8 in a program.
Throughout this article, I have assumed that the model involves fixed effects rather than random effects. Power
and sample size can be readily determined for a random-effects ANOVA model ( Koele, 1982 ), if this is
appropriate for the data in question, but the normal approximation in Equation 5 would not be suitable.
The method advocated in this article appears to give good results even when sample sizes are relatively small.
Of course, these calculations will be credible only when the distributions of observations within each group are
normal or nearly so. The consequences of departures from normality for power and sample size related to tests
of interaction and complex contrasts remain to be explored. A Monte Carlo approach (e.g., Soper, Cicchetti,
Satz, Light, & Orsini, 1988) might be useful for this purpose. Tests that are based on the normal distribution
generally have slightly greater power than the so-called "nonparametric" tests only when the normal
assumptions hold. Recent investigations demonstrate that relative power of alternative tests depends strongly on
the shape of the distribution ( Blair & Higgins, 1985). Bootstrap methods have been advocated for computing
confidence intervals when the shapes of the underlying distributions are unknown (Efron, 1988; see also
Rasmussen, 1988; Strube, 1988 ). However, the bootstrap, which is applied to the facts observed, provides no
direct aid when sample size must be chosen before data collection.
When many tests of significance are done in the course of a large study it would be wise to use a Bonferroni
adjustment for the criterion for a Type I error. Equation 5 can still be used to estimate sample size, although n
will be larger when α is reduced. The ratio of sample sizes in Equation 10, however, will not be changed.
If the t or F test is good enough for the data collected, then choice of sample size using Equation 5 should be
reasonable. I hope that the simplicity of this approach will encourage more investigators to embark on a quest
for interaction or other interesting contrast effects with sufficient power provided by an adequate sample.
Despite well-established principles of sample size determination and the availability of works making these
comprehensible to the users of statistical analysis, Rosnow and Rosenthal (1989) as well as Cohen (1990)
observed recently that many psychologists continue to ignore the question of power.
References
Bauer, S. J., & Sokolowski, M. B. (1988). Autosomal and maternal effects on pupation behavior in Drosophila
melanogaster. Behavior Genetics, 18, 81-97.
Blair, R. C., & Higgins, J. J. (1985). Comparison of the power of the paired samples t test to that of Wilcoxon's
signed-ranks test under various population shapes. Psychological Bulletin, 97, 119-128.
Borenstein, M., & Cohen, J. (1988). Statistical power analysis: A computer program. Hillsdale, NJ: Erlbaum.
Brent, E. E., Jr., Scott, J. K., & Spencer, J. C. (1988). EX-SAMPLETm. An Expert System to Assist in Designing
Sampling Plans. Columbia, MO: Idea Works.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed). Hillsdale, NJ: Erlbaum.
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45,1304-1312.
Efron, B. (1988). Bootstrap confidence intervals: Good or bad? Psychological Bulletin, 104, 293-296.
Glass, G. V, & Hakstian, A. R. (1969). Measures of association in comparative experiments: Their development
and interpretation. American Education Research Journal, 6, 403-414.
Hays, W. L. (1988). Statistics (4th ed). New York: Holt, Rinehart & Winston.
Hewitt, J. K., & Heath, A. C. (1988). A note on computing the chi-square noncentrality parameter for power
analysis. Behavior Genetics, 18, 105-108.
Koele, P. (1982). Calculating power in analysis of variance. Psychological Bulletin, 92, 513-516.
Kraemer, H. C., & Thiemann, S. (1987). How many subjects? Statistical power analysis in research. Newbury
Park, CA: Sage.
Lachenbruch, P. A. (1988). A note on sample size computation for testing interactions. Statistics in Medicine,
7,467-469.
Levin, J. R. (1975). Determining sample size for planned and post hoc analysis of variance comparisons.
Journal of Education Measurement, 12, 99-108.
Marascuilo, L. A, & Serlin, R. C. (1988). Statistical methods for the social and behavioral sciences. New York:
W H. Freeman
Maxwell, S. E., Camp, C. J., & Arvey, R. D. (1981). Measures of strength of association: A comparative
examination. Journal of Applied Psychology, 66,525-534,
Maxwell, S. E., & Delaney, H. D. (1990). Designing experiments and analyzing data. A model comparison
approach. Belmont, CA: Wadsworth.
Neyman, J. (1935). Comments on Mr. Yates' paper. Journal of the Royal Statistical Society, (Suppl. 2), 235241.
Rasmussen, J. L. (1988). "Bootstrap confidence intervals: Good or bad": Comments on Efron (1988) and Strube
(l988) and further evaluation. Psychological Bulletin, 104, 297-299.
Rodger, R. S. (1974). Multiple contrasts, factors, error rate and power. British Journal of Mathematical and
Statistical Psychology, 27, 179 - 198.
Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in
psychological science. American Psychologist, 49,1276-1284.
Soper, H. V, Cicchetti, D. V, Satz, P., Light, R, & Orsini, D. L. (1988). Null hypothesis disrespect in
neuropsychology: Dangers of alpha and beta errors. Journal of Clinical and Experimental Neuropsychology, 10,
255-270.
Strube, M. J. (1988). Bootstrap Type I error rates for the correlation coefficient: An examination of alternate
procedures. Psychological Bulletin, 104, 290-292.
Tiku, M. L. (1966). A note on approximating to the noncentral F distribution. Biometrika, 53. 606-610.
Traxler, R. H. (1976). A snag in the history of factorial experiments. In D. B. Owen (Ed.), On the history of
statistics and probability (pp. 283-295). New York: Marcel Dekker.
Wahlsten, D, (1983). Maternal effects on mouse brain weight. Developmental Brain Research, 9. 215-221.
Wahlsten, D. (1990). Insensitivity of the analysis of variance to heredity-environment interaction. Behavioral
and Brain Sciences, 13 , 109- 120.
Wainwright, P (1980). Relative effects of maternal and pup heredity on postnatal mouse development.
Developmental Psychobiology, 13, 493-498.
Wimer, C. C., & Wimer, R. E. (1989). On the sources of strain and sex differences in granule cell number in the
dentate area of house mice. Developmental Brain Research, 48, 167-176.