Experimental design and statistical methods for classical and

Experimental design and statistical methods for classical and bioequivalence
hypothesis testing with an application to dairy nutrition studies1
R. J. Tempelman2
Department of Animal Science, Michigan State University, East Lansing 48824-1225
ABSTRACT: Genetically modified (GM) corn hybrids
have been recently compared against their isogenic reference counterparts in order to establish proof of safety
as feedstuffs for dairy cattle. Most such studies have
been based on the classical hypothesis test, whereby
the null hypothesis is that of equivalence. Because the
null hypothesis cannot be accepted, bioequivalencetesting procedures in which the alternative hypothesis
is specified to be the equivalence hypothesis are proposed for these trials. Given a Type I error rate of 5%,
this procedure is simply based on determining whether
the 90% confidence interval on the GM vs. reference
hybrid mean difference falls between two limits defining equivalence. Classical and bioequivalence power of
test are determined for 4 × 4 Latin squares and double-
reversal designs, the latter of which are ideally suited
to bioequivalence studies. Although sufficient power
likely exists for classical hypothesis testing in recent
GM vs. reference hybrid studies, the same may not be
true for bioequivalence testing depending on the equivalence limits chosen. The utility of observed or retrospective power to provide indirect evidence of bioequivalence is also criticized. Design and analysis issues
pertain to Latin square and crossover studies in dairy
nutrition studies are further reviewed. It is recommended that future studies should place greater emphasis on the use of confidence intervals relative to Pvalues to unify inference in both classical and bioequivalence-testing frameworks.
Key Words: Bioequivalence Testing, Dairy Science, Experimental Design,
Genetically Modified Feedstuffs, Sample Size
2004 American Society of Animal Science. All rights reserved.
Introduction
The evaluation of genetically modified (GM) crops
as feedstuffs for livestock, including dairy cattle, is of
increasing interest given some lingering public reservations about their safety. In recent studies, the working
hypothesis has been that of no mean difference between
GM and reference counterparts (Folmer et al., 2002;
Grant et al., 2003) for various measures such as milk
production, growth, and feed intake. Unfortunately,
classical hypothesis-testing strategies that specify the
null hypothesis to be that of no mean treatment difference are not appropriate for these studies. This is explicitly recognized in the instructions to authors (“Style
1
This article was presented at the 2003 ADSA-ASAS-AMPA meeting as part of the Contemporary Issues symposium “Designing Animal Experiments for Power.” Comments on a first draft of this manuscript by M. A. Faust (ABS Global, Inc.) are gratefully acknowledged.
Financial support from the Michigan Agric. Exp. Stn. (Project MICL
01822) is also acknowledged.
2
Correspondence: 1205 Anthony Hall (phone: 517-355-8445; email: [email protected]).
Received July 3, 2003.
Accepted October 8, 2003.
J. Anim. Sci. 2004. 82(E. Suppl.):E162–E172
and Form”) for the Journal of Animal Science, which
dictates that statistically nonsignificant effects (i.e., P >
0.05) are not to be presented as evidence of equivalence.
Nevertheless, these statistical tests have provided the
basis for conclusions on feed equivalence between GM
and conventional feedstuffs for livestock (Clark and
Ipharraguerre, 2001; Faust, 2001).
The primary intent of this paper is to present hypothesis-testing procedures when the working hypothesis
is mean equivalence. Statistical methods for bioequivalence testing have been well developed and extensively
applied in pharmaceutical research (FDA, 2001). Experimental design and statistical analysis issues pertinent to both classical and bioequivalence testing on
mean differences are reviewed in this paper, indicating
subtle differences in design choices between the two
testing scenarios. Power-of-test computations are also
illustrated for both hypothesis-testing paradigms based
on replicated Latin square and repeated crossover designs. The utility of observed or retrospective power for
conclusions on equivalence is also criticized. Additional
design and analysis considerations that may help refine
classical and bioequivalence power of test in future
dairy nutrition studies are discussed. Finally, some re-
E162
Bioequivalence testing in animal science
porting recommendations for future studies are
provided.
Experimental Design
Feedstuff Nutrient Profile Comparisons
The nutrient composition of experimental feedstuffs
is often compared to a standard or a conventional feedstuff before production, and feed intake responses in
animals are investigated. These comparisons, as they
pertain to GM vs. their reference hybrids, are typically
based on the principle of substantial equivalence (Hoover et al., 2000), that is, whether or not the GM feedstuff
differs from its reference counterpart in terms of its
inherent antinutrient constituents.
In animal feedstuff cropping designs, test and control
hybrids are typically assigned to plots within the same
field or grown in adjacent fields; nevertheless, in many
cases, only one or perhaps two plots per hybrid are
considered. The instructions to authors for the Journal
of Animal Science and the Journal of Dairy Science in
2003 explicitly define the experimental unit as “the
smallest unit to which an individual treatment is imposed.” Applied then to feedstuff comparisons, the experimental unit definition does not only include plot,
but also silo or silage bag if fermentation conditions
differ considerably between silos (Gill, 1981). Furthermore, this definition might also be logically expanded
to include year of crop as feedstuff comparisons involving GM crops have been shown to be influenced by
weather conditions, such as insect pressure, drought
(Novak and Haslberger, 2000), or ambient temperature
at harvest (Grant et al., 2003). As statistically significant and meaningful nutrient compositional differences
have not yet otherwise been generally detected between
GM and conventional dairy feedstuffs (Donkin et al.,
2003; Folmer et al., 2002; Grant et al., 2003; Ipharraguerre et al., 2003), it may be likely that potential confounding factors due to the utilization of only one or
two plots, silos, or cropping years per hybrid within
each study have not yet become issues with, perhaps,
the exception of Grant et al. (2003). However, enough
ambiguity in experimental unit definition may exist
for some studies such that a meta-analysis approach,
which treats each individual study as an experimental
unit (St-Pierre, 2001) should be advocated to periodically formalize consensus on substantial equivalence.
Naturally, any experimental design issues as they pertain to feedstuff nutrient profile comparisons would also
influence animal performance comparisons. This was
recognized by Grant et al. (2003), who harvested their
GM corn silage at a very high ambient temperature,
resulting in a crop having a high-DM content, thereby
adversely affecting fermentation conditions and subsequent DMI and milk production in cows fed the GM
silage.
Rigorous designs, such as the incomplete block or
lattice designs used in agronomy (Stroup, 2002), war-
E163
rant greater consideration, although much larger plot
sizes, fewer hybrids for comparison, and fewer replicates (plots) per hybrid would likely be required to facilitate animal feeding experiments. However, even with
these robust designs, analyses based on spatial error
models seem to be important. The use of conventional
ANOVA often fails to adequately account for spatially
correlated sources of variability (due to, e.g., fertility,
moisture, or slope gradients), thereby leading to serious
misrankings of crop varieties for yields (Littell et al.,
1996; Stroup, 2002). Whether this might be also true
for nutrient profile and livestock performance comparisons of GM vs. reference hybrids is unclear.
Feedstuff Comparisons for Animal Performance
The most statistically efficient experimental designs
for dietary treatment comparisons involve blocking on
cows to facilitate powerful intra-subject comparisons
on treatments. The Latin square design entails such a
structure whereby the two blocking factors are cow and
period to facilitate comparisons on a third treatment
factor. Logically, cow represents a random source of
variability, whereas period may or may not, depending
on the variability of DIM within a period as discussed
later. Single Latin square designs do not allow the consideration of interaction among the three factors; however, replicated Latin square designs do facilitate investigation of treatment × period interactions if the same
periods are considered within each square. This issue
is not trivial because, for example, mean milk yield
differences between diets may depend on the stage of
lactation (Longuski et al., 2000). Nevertheless, a large
number of studies, including GM vs. reference hybrid
studies, have not considered this potentially important interaction.
Repeated crossover designs are potentially better
suited for bioequivalence studies compared to replicated Latin square designs. Let us suppose that the
two treatments to be considered are labeled A and B.
A two-period, two-treatment crossover design is synonymous with a replicated two-period Latin square design
in which half of the cows are assigned the treatment
sequence AB over the two periods whereas the remaining half are assigned to the sequence BA. With
this nonrepeated crossover design, however, one is not
able to separately infer upon potential carryover or residual effects due to the diet fed in the preceding period.
Repeated crossover designs, on the other hand, are
characterized by each animal receiving a treatment for
more than one period. In the four-period two-treatment
design, also called the double-reversal design (Gill,
1978), the two treatment sequences, randomly assigned
among cows, might naturally be ABAB and BABA. If
the fourth period is not considered within each of the
two sequences, then a somewhat less powerful switchback design is created, that is, with sequences ABA and
BAB. This design was considered for the assessment of
glyphosate-tolerant corn vs. its isogenic reference by
E164
Tempelman
Donkin et al. (2003). The random effects variance component due to cow × hybrid interaction is estimable in
both switchback and double-reversal designs, but not
in the replicated Latin square design. By analogy, this
interaction term is similar to subject × drug treatment
interaction modeled in human bioequivalence studies
on drug testing, which, if large, indicates highly variable subject-specific differences in drug efficacy. Treatment-specific residual variabilities are also more efficiently estimated in repeated crossover designs,
thereby providing a potential indication of whether or
not there are individual differences in variability of
response between two or more treatments. If two treatments lead to similar mean responses, the more desirable treatment may be the one that leads to less variability in response. For these reasons and others, the
double-reversal and switchback designs have been advocated (FDA, 2001) for bioequivalence testing in medical and pharmaceutical research. Variants of these designs have been popularized in dairy science since Lucas (1956; 1957).
On-farm studies may warrant important design considerations, particularly when production or feeding
groups or pens of animals dictate those groups to be
the experimental units. However, because these environments are not highly controlled, on-farm trials are
not likely to be suitable for first-stage regulatory testing
(St-Pierre and Jones, 1999) as it may pertain to the
evaluation of GM crops.
Statistical Models and Analyses
The Replicated Latin Square Design
The replicated Latin square design has been frequently used in recent GM vs. reference corn hybrid
comparisons in dairy cattle studies (Folmer et al., 2002;
Grant et al., 2003; Ipharraguerre et al., 2003). Thus,
its properties warrant further review. Suppose that
there are nt treatments to be considered such that the
number of periods in a Latin square study is also nt.
Suppose further that there are ns such squares and
that the total number of cows in the study is nc = nsnt.
Without loss of generality, we assume that square effects do not exist if squares do not represent any spatial,
temporal, or other blocking entity; that is, all squares
are conducted simultaneously and proximally. The corresponding linear mixed effects model can be written as
yijk = µ + αi + βj + ck + αβij + eijk
[1]
where yijk represents the observation on cow k given
treatment i at period j; αi represents the fixed effect of
the ith treatment, i = 1, 2, . . . , nt; βj represents the
fixed effect of the jth period, j = 1, 2, . . . , np; and ck
represents the random effect of the kth cow, k = 1, 2, .
. . , nc, with variance component σc2. The period effect
is considered fixed if it is nearly synonymous with stage
of lactation, provided that variability in DIM is low
within each period. Therefore, the interaction αβij between the ith treatment and jth period is also considered fixed and inferable, provided the same periods are
considered within each Latin square. If squares are
used as blocks for starting DIM (Grant et al., 2003),
then treatment by period and treatment by square interaction are both inferable, but either term then is
individually weakly associated with treatment by stage
of lactation interaction.
The mixed model in Eq. [1] is typically specified as
if the residual terms eijk are normally, independently,
and identically distributed with variance σe2. However,
this Latin square design is just one particular example
of a repeated measures design, whereby cows are observed over time, albeit with different treatment assignments at each period. Therefore, a richer class of
covariance structures could be considered for residuals
across periods within cows, such as for example, a firstorder autoregressive covariance structure in which residuals in adjacent periods are specified to be more
highly correlated than residuals from distal periods.
Details on using SAS PROC MIXED (SAS Institute,
Inc., Cary, NC.) for such specifications are provided in
Littell et al. (1998) and in the Appendix of this paper.
Also, carryover effects may be additionally modeled in
Eq. [1] provided that the replicated Latin squares are
orthogonal to each other and/or are Williams designs
(Williams, 1949). Carryover effects can be estimated
separately from treatment × period interaction effects
if squares are balanced for carryover effects but not
necessarily otherwise. Coding strategies for carryover
effects for use in statistical analyses can be found in
Kuehl (2000).
The Repeated Crossover Design
The statistical model for the repeated crossover design does not differ much from Eq. [1] for the replicated
Latin square design except that the number of periods,
say, np, exceeds the number of treatments nt. For nt =
2 treatments, np = 3 defines the switchback design
whereas np = 4 defines the double-reversal design. Additionally, random treatment × cow interaction effects
should be considered as well:
yijk = µ + αi + βj + ck + αβij + αcik + eijk
[2]
Here, the terms in Eq. [2] are defined similarly as in
Eq. [1], with the additional term being αcik, the random
treatment × cow interaction effect corresponding to the
ith treatment and kth cow with variance component
2
σαc
. As with the Latin squares design, alternative repeated measures covariance structures can be modeled
across periods within cows (see Appendix herein).
The fact that the treatment × cow effect serves as
the ANOVA experimental error term for testing the
significance of the treatment effect in repeated crossover designs has often been ignored. Consequently, reported P-values in these studies that have not recog-
E165
Bioequivalence testing in animal science
nized this inherent design feature may be overstated
against the classical null hypothesis. As previously indi2
cated, statistically significant and large values of σαc
would indicate that differences in responses between
GM and reference feedstuffs are not consistent across
cows and may be due to potential individually (i.e.,
genotype) specific allergenicity, antinutrient, or feed
preference effects.
Mixed Model Analyses
The ANOVA has effectively provided the statistical
inference engine for the analyses of efficient designs,
such as replicated Latin squares or crossover designs.
However, there are some important distinctions to be
made between the two most common ANOVA inferential strategies currently used in dairy nutrition studies,
namely the ordinary least squares (OLS) approach using, say, PROC GLM of SAS, and the more recently
developed true mixed model (MM) analysis using, say,
PROC MIXED. For the two broad types of mixed models, [1] and [2], considered in this paper, the use of OLS
may dramatically understate the estimated standard
error of an estimated treatment mean (SEM) because
OLS ignores the between-cow variability in such a determination. Admittedly, however, the estimated standard errors of the estimated mean differences (SED)
between treatments may not differ between an OLS vs.
MM analyses, provided that the design remains balanced—that is, no data are missing. Nevertheless, there
are two potentially important exceptions. First, OLS
treats all effects, whether labeled as fixed or random,
as fixed effects, potentially leading to substantial confounding between various effects. For example, in the
double-reversal design in Eq. [2], cow effects are partially confounded with treatment × period effects if cow
is treated as fixed; furthermore, cow × treatment effects
are partially confounded with period effects if cow ×
treatment is treated as fixed. This confounding may
represent the historical basis as to why treatment ×
period and cow × treatment have not been considered as
sources of variation in earlier work using OLS. Similar
estimability problems with OLS inference using PROC
GLM have also been pointed out by Littell et al. (1996)
and in the context of bioequivalence testing by LeRoux
et al. (1998). These confounding issues do not arise with
a true MM analysis because cow and cow × treatment
effects are appropriately treated as random.
A second advantage of MM over OLS analysis pertains to the recovery of inter-block information, or in the
context of this paper, inter-cow information. Ordinary
least squares facilitates only intra-block comparisons of
treatments, whereas MM analyses utilizes both intrablock and inter-block information (Littell et al., 1996).
Although OLS and MM inference will often differ little,
if at all, for treatment mean comparisons in balanced
complete block designs, this may not necessarily be true
for incomplete block designs, such as Latin squares,
and particularly when some data are missing.
Hypothesis Testing and Power of Test
Classical Hypothesis Testing
Classical two-tailed hypothesis tests are typically
specified in terms of the null vs. alternative hypothesis
as specified, respectively:
H0 : ⌬ = 0
vs.
H 1: ⌬ ≠ 0
[3]
Here ⌬ = µT − µR represents the difference of inferential
interest between the GM or test crop mean µT and the
mean µR of its isogenic or reference counterpart.
Classical power is defined as the probability of concluding a mean difference when such a difference truly
exists. Stroup (1999) demonstrated how classical power
computations for ⌬ could be readily derived for a linear
mixed model of any complexity (e.g., Eq. [1] or [2],
above) using SAS PROC MIXED software. Additional
details are provided, albeit in an agronomic context, by
Stroup (2002). Although power is readily specified as
of function of Type I error rate α (set throughout this
article as 0.05), sample size or number of cows nc, and
⌬, the most difficult inputs to elicit for power determinations are the variance components
In both Latin square and repeated crossover designs,
it is σe2 and not σc2 that vitally determines power of test,
ignoring, for the moment, the existence of nonzero
2
. Based on various statistical consulting experiences
σαc
pertaining to Latin square studies involving milk production at Michigan State University, estimates of σe2
have ranged from 2 to 10 (kg/d)2. The lower limit is
in close agreement with much earlier sample size and
design work (Gill, 1969; Gill and Magee, 1976), whereas
the upper limit might be more indicative of currently
higher production environments, as evidence exists
that variability increases with mean milk production.
This relationship is manifested in Jensen (2001), who
illustrates in a figure that σe2 ranges from 2 (kg/d)2 in
late lactation to 14 (kg/d)2 in early-lactation Canadian
Jerseys, with larger residual variances being further
associated with higher-producing multiparous animals.
As an example, we consider classical power for replicated 4 × 4 Latin squares (i.e., each square with four
cows) in the spirit of recent GM dairy feedstuff studies
(Folmer et al., 2002; Grant et al., 2003; Ipharraguerre
et al., 2003), concentrating on power for ⌬ involving
only two of the four hypothetical treatments. We define
sufficient power as being 0.80, or 80%. Figure 1 relates
power to number of 4 × 4 Latin squares where σe2 is
either 2 (kg/d)2 or 10 (kg/d)2, with ⌬ ranging from 1 to
4 kg/d. For σe2 = 2 (kg/d)2, four 4 × 4 squares (i.e., nc =
16 cows) would be adequate to detect ⌬ ≥ 2 kg/d with
sufficient power, whereas eight squares (nc = 32 cows)
would be required to detect ⌬ = 1 kg/d. When σe2 = 10
(kg/d)2, eight squares would be insufficient to detect ⌬ =
2 kg/d, whereas four squares would be insufficient to
detect ⌬ = 3 kg/d.
E166
Tempelman
Figure 1. Classical power of test vs. number of 4 × 4
Latin squares for a mean difference (⌬) equal to 1 kg/d
(䊊), 2 kg/d (䊉), 3 kg/d (䊐), or 4 kg/d (䊏) based on
within-cow variance (σ2e ) of either a) 2 (kg/d)2 or b) 10
(kg/d)2.
Figure 2. Classical power of test vs. number of cows
in a double-reversal design for mean difference (⌬) equal
to 1 kg/d (䊊), 2 kg/d (䊉), 3 kg/d (䊐), or 4 kg/d (䊏)
based on within-cow variance (σ2e ) of either 2 (kg/d)2 or
b) σ2e = 10 (kg/d)2.
Classical power is also presented for the double-reversal design in Figure 2 for both 2 (kg/d)2 or 10 (kg/
d)2. Here, power is plotted against number of cows,
equally randomized among the two sequences ABAB
and BABA. For purposes of comparisons against the
replicated Latin square design, we again consider the
case in which the intra-cow variance σe2 is either 2 or
2
10 (kg/d)2 with σαc
= 0, but recognizing that the error
degrees of freedom for testing the main effect of treatment is defined by the treatment × cow interaction term.
Here, nc = 16 cows would be nearly sufficient to detect
⌬ = 1 kg/d with σe2 = 2 (kg/d)2, whereas the same number
of cows would be insufficient to detect ⌬ = 2 kg/d when
σe2 = 10 (kg/d)2. Note that, in general, the power of the
double-reversal design is slightly greater than that for
the replicated 4 × 4 Latin square design for the same
number of cows and periods, albeit the number of records per treatment for each cow is twice that for the
double-reversal design such that then only half as many
treatments can be studied for the same total number
of records. Nevertheless, as previously indicated, this
design feature confers additional advantages for inference on treatment × cow interaction, which may be
pertinent for GM feedstuff studies.
Bioequivalence Hypothesis Testing
As previously stated, classical hypothesis testing
does not provide an appropriate framework for establishing equivalence. Schuirmann (1987) recognized this
and demonstrated the correctness of the two one-sided
testing (TOST) procedure, which has since been used
for testing average bioequivalence in pharmaceutical
testing (FDA, 2001). In order to use this procedure,
it is necessary to set appropriate lower and/or upper
equivalence limits on ⌬ =µT − µR, say, θL and θU, respectively. It may be sufficient to specify both limits with
E167
Bioequivalence testing in animal science
one parameter θ = θU = −θL such that the equivalence
interval is symmetric about zero. Naturally, these limits should not be set by the statistician but by investigators and/or regulators (Schuirmann, 1987). The null
and alternative hypotheses in the TOST procedure are
written as
H0: ⌬ ≤ θL or ⌬ ≥ θU
vs.
H 1: θ L < ⌬ < θ U
[4]
Schuirmann (1987) demonstrates that rejection of the
null hypothesis in Expressions [4] with a Type I error
rate of α in [4] is based on observing the 100(1 − 2α)%
confidence interval for ⌬ to fall within the bioequivalence limits [θL, θU]. If, for example, a Type I error rate
of 5% is adopted for the bioequivalence hypothesis test,
then one would construct a 90% confidence interval (CI)
on ⌬. If this CI falls within [θL, θU], then one can reject
H0: ⌬ ≤ θL or ⌬ ≥ θU in favor of H1: θL < ⌬ < θU. As with
classical hypothesis testing, failing to reject H0: ⌬ ≤ θL
or ⌬ ≥ θU does not imply that ⌬ ≤ θL or ⌬ ≥ θU because
a Type II error, that is, failing to correctly conclude
equivalence, may have been committed.
As an example, consider the data from the doublereversal design involving 24 cows, 12 within each sequence, from Problem 8.9 in Gill (1978). Using a mixed
effects analyses based on Eq. [2], the 90% CI on the
mean difference between the two treatments is [−0.20
kg/d, 0.87 kg/d]. Since this CI falls within [−1 kg/d, +1
kg/d], average bioequivalence is established within the
limits of ±1 kg/d between the two treatments with P <
0.05. Representative SAS PROC MIXED code that can
be used to provide bioequivalence inference is provided
in the Appendix herein.
Bioequivalence power or the one minus the probability of making a Type II error for the hypothesis test in
[4] can be readily determined. Extending the presentation of Phillips (1990), the two t-statistics for the bioequivalence test are
TL =
⌬ˆ − θL
SED(⌬ˆ )
and
TU =
⌬ˆ − θU
SED (⌬ˆ )
where SED(⌬ˆ ) represents the estimated standard error
of ⌬ˆ = µ̂T − µ̂R, with µ̂T and µ̂R corresponding to the
least squares means (LSM) of µT and µR, respectively.
The required point estimates and standard errors are
readily determined using appropriate mixed models
software (e.g., SAS PROC MIXED).
In concordance with the TOST confidence interval,
bioequivalence is concluded if TL ≥ t1−α,ν and TU ≤ tα,ν,
where t1−α,ν and tα,ν are the 100(1 − α)% and 100α%
percentiles of a t-distribution with ν degrees of freedom
as determined by the design. The probability of this
occurrence is known as the bioequivalence power (Phillips, 1990), denoted by
Prob(TL ≥ t1−α,ν
TU ≤ tα,ν
and
| SED (⌬ˆ ),ν,δL,δU)
[5]
Here TL and TU are joint random draws from a bivariate
noncentral t-distribution with ν degrees of freedom and
a correlation of 1. Note that the bioequivalence power
in Expression [5] depends further upon noncentrality
parameters
δL =
⌬ − θL
SED(⌬ˆ )
and
δU =
⌬ − θU
SED(⌬ˆ )
with SED(⌬ˆ ) representing the true standard error of ⌬ˆ .
As an illustration involving milk production based
on the two designs discussed in this paper, we consider
just one value of ⌬, specifically ⌬ = 0 kg. This specification ideally defines absolute mean bioequivalence. However, it is much more likely that bioequivalence will be
established when ⌬ is not truly 0 but small enough to
be considered as being bioequivalent, say, for example,
−0.5 kg/d < ⌬ < 0.5 kg/d. Bioequivalence power for any
combination of θL, θU and ⌬ can be readily computed
provided that θL < ⌬ < θU. Figure 3 provides bioequivalence power determinations vs. number of 4 × 4 Latin
squares for θ = θU = −θL = 0.5, 1.0, 1.5, and 2.0 kg/d
when a) σe2 = 2 kg2 and for θ = 0.5, 1.0, 1.5, . . . , 4.0 kg/
d when b) σe2 = 10 (kg/d)2. From Figure 3a, it can be
seen that there is sufficient power to conclude treatment bioequivalence within equivalence limits of ±1.5
kg/d with four 4 × 4 Latin squares when σe2 = 2 (kg/d)2.
However, nine squares would be required to establish
sufficient power to conclude average bioequivalence
within equivalence limits of ±1.0 kg/d. If σe2 = 10 (kg/
d)2 (Figure 3b), four squares do not provide sufficient
power to establish equivalence within ±3.0 kg/d and
seven squares are needed to establish equivalence
within ±2.5 kg/d.
Figure 4 is similar to Figure 3 except that bioequivalence power is provided for the double-reversal design.
As with classical power, the double-reversal design has
greater bioequivalence power as it pertains to the comparison of two treatments albeit, again, at the expense
of twice as many records per cow per treatment. When
σe2 = 2 (kg/d)2, 16 cows is nearly sufficient to establish
equivalence within ±1.0 kg/d with 80% power, whereas
16 cows would be minimally required to establish equivalence within ±2.5 kg/d when σe2 = 10 (kg/d)2.
What is Wrong with the Observed Power
Approach to Bioequivalence Testing?
Most investigators recognize that P-values exceeding
a predetermined Type I error rate are not indicative of
proof of the classical null hypothesis H0: ⌬ = 0. Observed
power, also called retrospective, or post-hoc, power, has
been provided by some statistical software as a supplementary analysis to establish bioequivalence. Observed
power is the power of being able to detect a mean difference as large as that observed from a data analysis. A
related computation is based on the detectable effect
size; that is, based on the data analysis at hand, what
E168
Tempelman
Figure 3. Bioequivalence power of test vs. number of
4 × 4 Latin squares for true mean difference ⌬ = 0 kg/d
with limits of equivalence (θ) being ±0.5 kg/d (䊊), ±1.0
kg/d (䊉), ±1.5 kg/d (䊐), ±2.0 kg/d (䊏), ±2.5 kg/d (〫),
±3.0 kg/d (◆), ±3.5 kg/d (䉭), or ±4.0 kg/d (▲) based on
a within-cow variance (σ2e ) of either a) 2 (kg/d)2 or b) 10
(kg/d)2.
effect size (i.e., mean difference) would be detectable
for a given power of, say, 0.80? If this effect size is
small, and the corresponding P-value for the classical
hypothesis test exceeds α, then plausible evidence, presumably, exists in favor of the classical null hypothesis
of equivalence. As Hoenig and Heisey (2001) point out,
this use of observed power has been tenaciously advocated by a large number of peer-reviewed scientific journals and statistical textbooks!
Unfortunately, drawing conclusions on equivalence
based on observed power is logically inconsistent. As
noted by Hoeneg and Heisey (2001), nonsignificant Pvalues invariably correspond to low observed powers
because observed power is a 1:1 function of the P-value.
Reconsider the four 4 × 4 Latin squares design. Suppose
that a 2 kg/d mean difference is considered to be within
Figure 4. Bioequivalence power of test vs. number of
cows in double-reversal design for true mean difference
⌬ = 0 kg/d with limits of equivalence (θ) being ±0.5 kg/
d (䊊), ±1.0 kg/d (䊉), ±1.5 kg/d (䊐), ±2.0 kg/d (䊏), ±2.5
kg/d (〫), ±3.0 kg/d (◆), ±3.5 kg/d (䉭), or ±4.0 kg/d
(▲) based on a within-cow variance (σ2e ) of either a) 2
(kg/d)2 or b) 10 (kg/d)2.
the limit of equivalence, analogous to the choice of θ =
2 kg/d for the bioequivalence-testing approach. Using
the observed power strategy, one might then “reject”
the classical alternative hypothesis H1: ⌬ ≤ 0 or ⌬ ≥ 0 if
the estimated mean difference falls within the classical
acceptance region for the test, and the power for detecting an effect of magnitude θ = 2 kg/d is, say, 80%
or greater. Figure 5 is an adaptation of Figures 1 and
2 from Schuirmann (1987) but for the four 4 × 4 Latin
squares design. The illustrated filled triangles represent the “rejection” regions that would lead to a conclusion of bioequivalence as a function of the estimated
residual standard deviation se = √σ̂e2 and the estimated
mean difference⌬ˆ that determine this rejection region
using either the observed power approach or the bioequivalence-testing approach of Schuirmann (1987).
Bioequivalence testing in animal science
Figure 5. Rejection regions for concluding treatment
bioequivalence (filled triangles) for four 4 × 4 Latin
squares based on a) observed power approach or b) bioequivalence-testing approach. Regions are a function of
the estimated residual standard deviation (√σ̂2e ) and estimated mean difference (⌬ˆ ). Observed power approach is
based on ability to detect an effect of θ= ±2 kg/d with a
power of 80% or greater, whereas bioequivalence approach is based on the 90% confidence interval for the
true mean difference (⌬) falling within equivalence limits
θ = ±2 kg/d.
That is, for any values of se and ⌬ˆ that jointly fall within
the illustrated triangle, one would conclude that −2 kg/d
< ⌬ < 2 kg/d based on either of the respective approaches.
Consider Figure 5a based on the observed power approach. One notable feature of this rejection region is
that, for se > 1.96 kg/d, there are no values of ⌬ˆ that
would conclude support for the classical null hypothesis, not even ⌬ˆ = 0 kg/d, because the observed power
for detecting θ = 2 kg/d is always less than 0.80 for se
> 1.96 kg/d. The rejection region for this approach is
also illogical. Consider the following: Experiment 1 is
executed with relatively poor precision, say, se = 1.8 kg/
d, whereas Exp. 2 is characterized by se = 0.5 kg/d.
E169
Suppose the same mean difference (⌬ˆ = 1 kg/d) is observed in both experiments. In Exp. 1, we would conclude that −2 kg/d < ⌬ < 2 kg/d because se = 1.8 kg/
d and ⌬ˆ = 1 kg/d fall within the rejection region for
concluding equivalence. However, the more precisely
executed Exp. 2 would fail to draw that same conclusion! This discrepancy is troubling because one should
logically expect that an experiment with better precision should be more likely to conclude true equivalence
than another experiment with the same estimated
mean difference but higher se. However, if the null hypothesis is mean equivalence as in classical tests, more
information (i.e., smaller variance components, larger
sample sizes) can only strengthen a conclusion of a
nonzero mean difference. That is, using the observed
power approach, poorly executed experiments would be
rewarded a greater chance of concluding equivalence.
Figure 5b illustrates the rejection region for the Schuirmann bioequivalence-testing approach as a function
of se and ⌬ˆ , with θ = 2kg/d. This figure is an adaptation
of Figure 2 from Schuirmann (1987) for the four 4 ×
4 Latin squares design. Recall that for α = 0.05, the
bioequivalence null hypothesis H0: ⌬ < −θ or ⌬ > θ is
rejected in favor of −θ < ⌬ < θ if the 90% CI for ⌬ falls
within ±θ. Note that, in contrast to the observed power
approach illustrated in Figure 5a, the rejection region is
widest for experiments with greater precision (smallest
standard error); that is, experiments with greater precision are more likely to conclude bioequivalence. Consider again Exp. 1 and 2. Using the bioequivalencetesting approach, the conclusions are directly opposite
to those based on the observed power approach. That
is, Exp. 1 does not lead to a bioequivalence conclusion,
whereas Experiment 2 with superior precision does.
Schuirmann (1987) further demonstrates that the
power of the bioequivalence-testing approach far exceeds that for observed power approach; this can be
visually deduced by the relative sizes of the two rejection regions in Panels a and b of Figure 5.
Additional Pertinent Design and Analysis Issues
Experimental Design
Dairy studies are generally designed with treatment
periods long enough that residual effects due to a treatment in a previous period do not carry over to subsequent periods. However, as a precaution, designs that
are balanced for at least first-order carryover effects—
such as double-reversal, switchback, Williams, or orthogonal Latin square designs—should be employed in
these experiments. Because of this feature, such designs have been heavily advocated by the FDA for bioequivalence studies; in fact, researchers have been encouraged to investigate direct × carryover treatment
effect interactions (FDA, 2001).
Latin square and crossover designs that block on cows
over periods have been recommended in this paper.
However, as elucidated by Morris (1999), there are a
E170
Tempelman
number of situations in which these designs should
not be used in animal science. First, if one wishes to
measure the long-term cumulative effects of treatments
on, for example, longevity, then animals must serve as
the experimental units and not the blocking factors.
Second, if treatment effects persist well beyond one
period (i.e., second- or higher-order carryover effects),
then designs blocking on animals should be discouraged.
Statistical Analysis
Classical and bioequivalence power computations for
both the replicated Latin square and repeated crossover
designs have been demonstrated in this paper. However, as previously indicated, treatment × period interaction is also inferable. Various contrasts as they pertain to this interaction can be studied under classical or
bioequivalence-testing frameworks, such as the mean
difference between GM and reference feedstuffs in Period 1 vs. the mean difference between the two feedstuffs in Period 4 (presuming period as fixed). Certainly,
such contrasts should take precedence in hypothesis
testing compared to overall treatment comparisons, as
primarily addressed in this paper, if treatment × period
interaction is determined to be important.
In some cases, noninferiority testing rather than bioequivalence testing may be more appropriate for the
comparison of GM vs. conventional feedstuffs. That is,
rather than establishing equivalence, it may be of
greater interest to test whether or not the GM feedstuff
is inferior to the reference feedstuff for various aspects.
More details on hypothesis-testing procedures and
power assessments in noninferiority testing can be
found in Hauschke et al. (1999).
Historically, periods have been treated as fixed in
both Latin square and repeated crossover designs. This
approach is suitable if the variability in days in milk
(DIM) within each treatment period is substantially
less than the period length such that period and stage
of lactation are nearly synonymous. However, if the
range in DIM far exceeds period length, then period is
less associated with stage of lactation and more reflective of seasonal or temporal management effects such
as, the quality of silage as it is unloaded from the silo.
In that case, period may be more appropriately treated
as random with some modeled temporal correlation
structure as is possible with the use of SAS PROC
MIXED. Furthermore, stage-of-lactation effects should
then be modeled separately from period effects using
lactation curve models, such as Wilmink’s curve (Wilmink, 1987) or even high-order polynomials. Random
regression models have been recently popularized in
modern dairy cattle genetic evaluation programs (Jensen, 2001) where Dairy Herd Improvement Association
test-day effects are used to model seasonal effects in a
manner similar to Latin square periods and DIM is
used to model parity, herd, and random cow-specific
lactation curves, allowing for differences in lactation
shape (i.e., persistency) over levels of those various factors. Treatment-specific lactation curves could be similarly modeled in replicated Latin square and crossover
designs if treatment periods and DIM substantially
overlap. Additionally, genetic sources of variability may
be removed by fitting half-sib or full-sib families as
random effects in these models as well. These implementations may substantially improve both classical
and bioequivalence power of test.
This paper has concentrated primarily on classical
and bioequivalence testing of milk production. However, the values for ⌬ or σe2 chosen could be extrapolated
to other testing scenarios involving various intake, digestibility, fitness, or reproductive measures. That is,
the key contributor to classical or bioequivalence power
is not singly ⌬ or σe = √σe2, but their ratio. As an example,
power determinations for ⌬ = 1, 2, 3, 4 and σe2 = 2 in
Figure 1a apply to any ratio value of
⌬
√
σe2
=
1
√2
,
2
√2
,
3
√2
,
4
,
√2
Studies on GM vs. reference corn as dairy feedstuffs
have involved comparisons for a number of production,
growth, milk components, and feed intake responses.
Multiple testing corrections, such as Bonferroni or other
alternatives, might plausibly be considered further in
such cases; however, Berger and Hsu (1996) argue that
these corrections may not be entirely necessary based
on the intersection-union principle. Furthermore, multivariate data reduction alternatives should also be considered when some response variables are highly correlated with each other. Johnson (1998), for example,
demonstrated how canonical variates analysis can be
effectively used as a precursor data reduction strategy;
subsequent classical or bioequivalence tests can then
be applied as appropriate to the resulting canonical
variates.
Current discussion on bioequivalence hypothesis
testing has evolved from the exposition on average bioequivalence provided in this paper to measures of population and individual bioequivalence. These aggregate
measures are not only functions of mean treatment
differences but also of subject × treatment interaction
and treatment differences for within-cow variability.
As indicated previously, the double-reversal design is
particularly useful for inference on these additional
measures of dispersion. Nevertheless, there is currently
sufficient controversy regarding the additional utility
of population and individual bioequivalence aggregate
measures beyond that provided by average bioequivalence (Hauschke and Steinijans, 2000; Munk, 2000).
Bioequivalence testing will therefore likely become an
area for future regulatory statistical research that will
be pertinent for the testing of GM crops as feedstuffs
for dairy cattle.
Bioequivalence testing in animal science
Final Recommendations
Confidence intervals represent the most reasonable
experimental assessment of plausible values for mean
treatment differences (Hoening and Heisey, 2001). If a
CI does not include a hypothesized mean difference,
then that mean difference is refuted by the data. Conversely, all values contained within the CI cannot be
refuted from the data. If the values cluster tightly about
0, the investigator has reasonable confidence that the
true mean difference is near 0 and should be able to
report such a result. Conversely, if the CI is very wide,
there is vastly insufficient evidence of either a mean
difference or a mean equivalence such that the result
should not be considered publishable. Once a CI is constructed, retrospective or observed power computations
offer no additional insight (Hoening and Heisey, 2001).
A greater emphasis on reporting CI relative to classical
P-values would be consistent with the bioequivalencetesting approach discussed in this paper. A good example of concise bioequivalence testing and reporting is
provided in Miles et al. (2002).
Power computations illustrated in this paper seem
to indicate that experiment sizes may need to be greater
for bioequivalence testing as opposed to classical hypothesis testing. It should be noted that existing FDA
guidelines (FDA, 2001) for drug testing advocate that
equivalence limits be, respectively, 80% and 125% of
the reference means (i.e., θL = 0.80µR and θU = 1.25µR)
for various pharmacokinetic responses. Appropriate
equivalence limits should be also characterized for key
response variables in studies comparing GM to their
reference hybrids as feedstuffs for dairy cattle.
Implications
Animal scientists, including dairy scientists, have
generally used classical data analysis procedures in
studies designed to compare genetically modified vs.
conventional feedstuffs. Because the anticipated result
is equivalence between these two feedstuffs for their
effects on livestock performance, bioequivalence statistical analysis that has been extensively used for pharmaceutical testing on drug safety is demonstrated to
be better suited for such studies. These procedures, as
demonstrated in this paper, are easy to implement and
their use should be encouraged in future studies designed to investigate equivalence between two treatments or diets. Furthermore, the paper points out easily
correctable data analysis deficiencies in recent animal
science research that should facilitate more-reliable
treatment comparisons, including refinements on identifying treatment differences that may depend on, for
example, stage of lactation.
Literature Cited
Berger, R. L., and J. C. Hsu. 1996. Bioequivalence trials, intersectionunion tests and equivalence. Stat. Sci. 11:283–302.
E171
Clark, J. H., and I. R. Ipharraguerre. 2001. Livestock performance:
Feeding biotech crops. J. Dairy Sci. 84:E9–E18.
Donkin, S. S., J. C. Velez, A. K. Totten, E. P. Stanisiewski, and
G. F. Hartnell. 2003. Effects of feeding silage and grain from
glyphosate-tolerant or insect-protected corn hybrids on feed intake, ruminal digestion, and milk production in dairy cattle. J.
Dairy Sci. 86:1780–1788.
Faust, M. A. 2001. New feeds from genetically modified plants: The
U.S. approach to safety for animals and the food chain. Livest.
Prod. Sci. 74:239–254.
FDA. 2001. Guidance for industry: Statistical approaches to establishing bioequivalence, Center for Drug Evaluation and Research,
Office of Training and Communications, Division of Communications Management, Drug Information Branch HFD-210, Rockville,
MD.
Available:
http://www.fda.gov/cder/guidance/
3616fnl.htm. Accessed September 23, 2003.
Folmer, J. D., R. J. Grant, C. T. Milton, and J. Beck. 2002. Utilization
of Bt corn residues by grazing beef steers and Bt corn silage and
grain by growing beef cattle and lactating dairy cattle. J. Anim.
Sci. 80:1352–1361.
Gill, J. L. 1969. Sample size for experiments on milk yield. J. Dairy
Sci. 52:984–989.
Gill, J. L. 1978. Design and Analysis of Experiments in the Animal
and Medical Sciences, Vol II. Iowa State Univ. Press, Ames.
Gill, J. L. 1981. Evolution of statistical design and analysis of experiments. J. Dairy Sci. 64:1494–1519.
Gill, J. L., and W. T. Magee. 1976. Balanced two-period changeover
designs for several treatments. J. Anim. Sci. 42:775–777.
Grant, R. J., K. C. Fanning, D. Kleinshmit, E. P. Stanisiewski, and
G. F. Hartnell. 2003. Influence of glyphosate-tolerant (event
NK603) and corn rootworm protected (event MON863) corn silage and grain on feed consumption and milk production in
Holstein cattle. J. Dairy Sci. 86:1707–1715.
Hauschke, D., M. Kieser, and L. A. Hothorn. 1999. Proof of safety in
toxicology based on the ratio of two means for normally distributed data. Biom. J. 41:295–304.
Hauschke, D., and V. W. Steinijans. 2000. The U.S. draft guidance
regarding population and individual bioequivalence approaches:
Comments by a research-based pharmaceutical company. Stat.
Med. 19:2769–2774.
Hoening, J. H., and D. M. Heisey. 2001. The abuse of power: The
pervasive fallacy of power calculations for data analysis. Am.
Stat. 55:19–24.
Hoover, D., B. M. Chassy, R. Hall, H. J. Klee, J. B. Luchansky, H.
I. Miller, I. Munro, R. Weiss, S. L. Hefle, and C. O. Qualset.
2000. Human food safety evaluation of rDNA biotechnologyderived foods. Food Tech. 54:15–23.
Ipharraguerre, I. R., R. S. Younker, J. H. Clark, E. P. Stanisiewski,
and G. F. Hartnell. 2003. Performance of lactating dairy cows fed
corn as whole plant silage and grain produced from a glyphosatetolerant hybrid (event NK603). J. Dairy Sci. 86:1734–1741.
Jensen, J. 2001. Genetic evaluation of dairy cattle using test-day
models. J. Dairy Sci. 84:2803–2812.
Johnson, D. E. 1998. Applied Multivariate Methods for Data Analysts.
Duxbury, Pacific Grove, CA.
Kuehl, R. O. 2000. Statistical Principles of Research Design and
Analysis. 2nd ed. Duxbury Press, Pacific Grove, CA.
LeRoux, Y., C. Guimart, and M. Tenenhaus. 1998. Use of the repeated
cross-over design in assessing bioequivalence: (within and between subjects varability—Schuirmann confidence intervals estimation). Eur. J. Drug Metab. Pharmacokinet. 23:339–345.
Littell, R. C., P. R. Henry, and C. B. Ammerman. 1998. Statistical
analysis of repeated measures data using SAS procedures. J.
Anim. Sci. 76:1216–1231.
Littell, R. C., G. A. Milliken, W. W. Stroup, and R. D. Wolfinger. 1996.
SAS System for Mixed Models. SAS Institute, Inc., Cary, NC.
Longuski, R. A., M. S. Allen, and R. J. Tempelman. 2000. Effect of
brown midrib-3 mutation in corn silage on lactational performance of dairy cows. J. Dairy Sci. 83(Suppl. 2):293.
Lucas, H. L. 1956. Switch-back trials for more than two treatments.
J. Dairy Sci. 39:146–154.
E172
Tempelman
Lucas, H. L. 1957. Extra-period latin square change-over designs. J.
Dairy Sci. 40:225–239.
Miles, M. V., P. Horn, L. Miles, P. Tang, P. Steele, and T. DeGrauw.
2002. Bioequivalence of coenzyme q10 from over-the-counter supplements. Nutr. Res. 22:919–929.
Morris, T. R. 1999. Experimental Design and Analysis in Animal
Sciences. CABI, Wallingford, Oxon, U.K.
Munk, A. 2000. Connections between average and individual bioequivalence. Stat. Med. 19:2843–2854.
Novak, W. K., and A. G. Haslberger. 2000. Substantial equivalence
of antinutrients and inherent plant toxins in genetically modified
foods. Food Chem. Toxicol. 38:473–483.
Phillips, K. F. 1990. Power of the two one-sided tests procedure in
bioequivalence. J. Pharmacokinet. Biopharm. 18:137–144.
Schuirmann, D. J. 1987. A comparison of the two one-sided tests
procedure and the power approach for assessing the equivalence
of average bioavailability. J. Pharmacokinet. Biopharm.
15:657–680.
St-Pierre, N. R. 2001. Integrating quantitative findings from multiple
studies using mixed model methodology. J. Dairy Sci. 84:741–
755.
St-Pierre, N. R., and L. R. Jones. 1999. Interpretation and design of
nonregulatory on-farm feeding trials. J. Dairy Sci. 82:177–182.
Stroup, W. W. 1999. Mixed model procedures to assess power, precision, and sample size in the design of experiments. Pages 15–
24 in Proc. Biopharmaceutical Section. Am. Stat. Assoc.
Stroup, W. W. 2002. Power analysis based on spatial effects mixed
models: A tool for comparing design and analysis strategies in
the presence of spatial variability. J. Agric. Biol. Environ. Stat.
7:491–511.
Williams, E. J. 1949. Experimental designs balanced for the estimation of residual effects of treatments. Aust. J. Scient. Res.
2:149–168.
Wilmink, J. B. M. 1987. Adjustment of test-day milk, fat, and protein
yields for age, season, and stage of lactation. Livest. Prod. Sci.
16: 335–348.
APPENDIX
Code from SAS’s PROC MIXED software and interpretative summary for bioequivalence testing (α = 0.05)
in a double-reversal or switchback design.
SAS code
Interpretative summary
The class statement defines the classification factors,
namely cow, period, and treatment. Fixed effects
treatment and period, including their interaction
treatment*period are included in the model statement, whereas random effects cow and its interaction
with treatment, cow*treatment, are specified in the
random statement. The lsmeans statement specifies
estimated treatment means and their mean differences
based on the diff option. The options alpha = .10 and
cl specifies 90% CI on the treatment mean difference as
appropriate for a bioequivalence test with a Type I error
rate of 0.05. Conclusions for bioequivalence are then
based on whether or not the reported lower and upper
confidence limits on the mean differences of interest
fall within prespecified equivalence limits. If significant
treatment × period interaction exists, then substitute
lsmeans treatment*period /diff alpha = .10 cl;
for the fifth line in the above code to infer upon periodspecific treatment differences or alternatively treatment-specific period differences.
The same program can be used for a replicated Latin
square design except that cow*treatment is not inferable and should be deleted; that is, alter the fourth line
in the code above to
random cow;
The code above further presumes that residuals are
identically, independently, and normally distributed.
However, it may be possible, for example, that temporal
correlations in the residuals exist across periods within
cows. If periods are equally spaced apart, then a candidate specification might be a first-order autoregressive
(AR(1)) correlation structure, in which case the following statement should be added to the code above.
repeated period /subject = cow type = ar(1);
proc mixed;
class cow period treatment;
model y = treatment period treatment*period;
random cow cow*treatment;
lsmeans treatment /diff alpha = .10 cl;
run;
where the ar(1) specification specifies the residual correlations between periods to be a function of the time
interval between them. See Littell et al. (1996; 1998)
for further details on this and many other potentially
better fitting specifications.