Experimental design and statistical methods for classical and bioequivalence hypothesis testing with an application to dairy nutrition studies1 R. J. Tempelman2 Department of Animal Science, Michigan State University, East Lansing 48824-1225 ABSTRACT: Genetically modified (GM) corn hybrids have been recently compared against their isogenic reference counterparts in order to establish proof of safety as feedstuffs for dairy cattle. Most such studies have been based on the classical hypothesis test, whereby the null hypothesis is that of equivalence. Because the null hypothesis cannot be accepted, bioequivalencetesting procedures in which the alternative hypothesis is specified to be the equivalence hypothesis are proposed for these trials. Given a Type I error rate of 5%, this procedure is simply based on determining whether the 90% confidence interval on the GM vs. reference hybrid mean difference falls between two limits defining equivalence. Classical and bioequivalence power of test are determined for 4 × 4 Latin squares and double- reversal designs, the latter of which are ideally suited to bioequivalence studies. Although sufficient power likely exists for classical hypothesis testing in recent GM vs. reference hybrid studies, the same may not be true for bioequivalence testing depending on the equivalence limits chosen. The utility of observed or retrospective power to provide indirect evidence of bioequivalence is also criticized. Design and analysis issues pertain to Latin square and crossover studies in dairy nutrition studies are further reviewed. It is recommended that future studies should place greater emphasis on the use of confidence intervals relative to Pvalues to unify inference in both classical and bioequivalence-testing frameworks. Key Words: Bioequivalence Testing, Dairy Science, Experimental Design, Genetically Modified Feedstuffs, Sample Size 2004 American Society of Animal Science. All rights reserved. Introduction The evaluation of genetically modified (GM) crops as feedstuffs for livestock, including dairy cattle, is of increasing interest given some lingering public reservations about their safety. In recent studies, the working hypothesis has been that of no mean difference between GM and reference counterparts (Folmer et al., 2002; Grant et al., 2003) for various measures such as milk production, growth, and feed intake. Unfortunately, classical hypothesis-testing strategies that specify the null hypothesis to be that of no mean treatment difference are not appropriate for these studies. This is explicitly recognized in the instructions to authors (“Style 1 This article was presented at the 2003 ADSA-ASAS-AMPA meeting as part of the Contemporary Issues symposium “Designing Animal Experiments for Power.” Comments on a first draft of this manuscript by M. A. Faust (ABS Global, Inc.) are gratefully acknowledged. Financial support from the Michigan Agric. Exp. Stn. (Project MICL 01822) is also acknowledged. 2 Correspondence: 1205 Anthony Hall (phone: 517-355-8445; email: [email protected]). Received July 3, 2003. Accepted October 8, 2003. J. Anim. Sci. 2004. 82(E. Suppl.):E162–E172 and Form”) for the Journal of Animal Science, which dictates that statistically nonsignificant effects (i.e., P > 0.05) are not to be presented as evidence of equivalence. Nevertheless, these statistical tests have provided the basis for conclusions on feed equivalence between GM and conventional feedstuffs for livestock (Clark and Ipharraguerre, 2001; Faust, 2001). The primary intent of this paper is to present hypothesis-testing procedures when the working hypothesis is mean equivalence. Statistical methods for bioequivalence testing have been well developed and extensively applied in pharmaceutical research (FDA, 2001). Experimental design and statistical analysis issues pertinent to both classical and bioequivalence testing on mean differences are reviewed in this paper, indicating subtle differences in design choices between the two testing scenarios. Power-of-test computations are also illustrated for both hypothesis-testing paradigms based on replicated Latin square and repeated crossover designs. The utility of observed or retrospective power for conclusions on equivalence is also criticized. Additional design and analysis considerations that may help refine classical and bioequivalence power of test in future dairy nutrition studies are discussed. Finally, some re- E162 Bioequivalence testing in animal science porting recommendations for future studies are provided. Experimental Design Feedstuff Nutrient Profile Comparisons The nutrient composition of experimental feedstuffs is often compared to a standard or a conventional feedstuff before production, and feed intake responses in animals are investigated. These comparisons, as they pertain to GM vs. their reference hybrids, are typically based on the principle of substantial equivalence (Hoover et al., 2000), that is, whether or not the GM feedstuff differs from its reference counterpart in terms of its inherent antinutrient constituents. In animal feedstuff cropping designs, test and control hybrids are typically assigned to plots within the same field or grown in adjacent fields; nevertheless, in many cases, only one or perhaps two plots per hybrid are considered. The instructions to authors for the Journal of Animal Science and the Journal of Dairy Science in 2003 explicitly define the experimental unit as “the smallest unit to which an individual treatment is imposed.” Applied then to feedstuff comparisons, the experimental unit definition does not only include plot, but also silo or silage bag if fermentation conditions differ considerably between silos (Gill, 1981). Furthermore, this definition might also be logically expanded to include year of crop as feedstuff comparisons involving GM crops have been shown to be influenced by weather conditions, such as insect pressure, drought (Novak and Haslberger, 2000), or ambient temperature at harvest (Grant et al., 2003). As statistically significant and meaningful nutrient compositional differences have not yet otherwise been generally detected between GM and conventional dairy feedstuffs (Donkin et al., 2003; Folmer et al., 2002; Grant et al., 2003; Ipharraguerre et al., 2003), it may be likely that potential confounding factors due to the utilization of only one or two plots, silos, or cropping years per hybrid within each study have not yet become issues with, perhaps, the exception of Grant et al. (2003). However, enough ambiguity in experimental unit definition may exist for some studies such that a meta-analysis approach, which treats each individual study as an experimental unit (St-Pierre, 2001) should be advocated to periodically formalize consensus on substantial equivalence. Naturally, any experimental design issues as they pertain to feedstuff nutrient profile comparisons would also influence animal performance comparisons. This was recognized by Grant et al. (2003), who harvested their GM corn silage at a very high ambient temperature, resulting in a crop having a high-DM content, thereby adversely affecting fermentation conditions and subsequent DMI and milk production in cows fed the GM silage. Rigorous designs, such as the incomplete block or lattice designs used in agronomy (Stroup, 2002), war- E163 rant greater consideration, although much larger plot sizes, fewer hybrids for comparison, and fewer replicates (plots) per hybrid would likely be required to facilitate animal feeding experiments. However, even with these robust designs, analyses based on spatial error models seem to be important. The use of conventional ANOVA often fails to adequately account for spatially correlated sources of variability (due to, e.g., fertility, moisture, or slope gradients), thereby leading to serious misrankings of crop varieties for yields (Littell et al., 1996; Stroup, 2002). Whether this might be also true for nutrient profile and livestock performance comparisons of GM vs. reference hybrids is unclear. Feedstuff Comparisons for Animal Performance The most statistically efficient experimental designs for dietary treatment comparisons involve blocking on cows to facilitate powerful intra-subject comparisons on treatments. The Latin square design entails such a structure whereby the two blocking factors are cow and period to facilitate comparisons on a third treatment factor. Logically, cow represents a random source of variability, whereas period may or may not, depending on the variability of DIM within a period as discussed later. Single Latin square designs do not allow the consideration of interaction among the three factors; however, replicated Latin square designs do facilitate investigation of treatment × period interactions if the same periods are considered within each square. This issue is not trivial because, for example, mean milk yield differences between diets may depend on the stage of lactation (Longuski et al., 2000). Nevertheless, a large number of studies, including GM vs. reference hybrid studies, have not considered this potentially important interaction. Repeated crossover designs are potentially better suited for bioequivalence studies compared to replicated Latin square designs. Let us suppose that the two treatments to be considered are labeled A and B. A two-period, two-treatment crossover design is synonymous with a replicated two-period Latin square design in which half of the cows are assigned the treatment sequence AB over the two periods whereas the remaining half are assigned to the sequence BA. With this nonrepeated crossover design, however, one is not able to separately infer upon potential carryover or residual effects due to the diet fed in the preceding period. Repeated crossover designs, on the other hand, are characterized by each animal receiving a treatment for more than one period. In the four-period two-treatment design, also called the double-reversal design (Gill, 1978), the two treatment sequences, randomly assigned among cows, might naturally be ABAB and BABA. If the fourth period is not considered within each of the two sequences, then a somewhat less powerful switchback design is created, that is, with sequences ABA and BAB. This design was considered for the assessment of glyphosate-tolerant corn vs. its isogenic reference by E164 Tempelman Donkin et al. (2003). The random effects variance component due to cow × hybrid interaction is estimable in both switchback and double-reversal designs, but not in the replicated Latin square design. By analogy, this interaction term is similar to subject × drug treatment interaction modeled in human bioequivalence studies on drug testing, which, if large, indicates highly variable subject-specific differences in drug efficacy. Treatment-specific residual variabilities are also more efficiently estimated in repeated crossover designs, thereby providing a potential indication of whether or not there are individual differences in variability of response between two or more treatments. If two treatments lead to similar mean responses, the more desirable treatment may be the one that leads to less variability in response. For these reasons and others, the double-reversal and switchback designs have been advocated (FDA, 2001) for bioequivalence testing in medical and pharmaceutical research. Variants of these designs have been popularized in dairy science since Lucas (1956; 1957). On-farm studies may warrant important design considerations, particularly when production or feeding groups or pens of animals dictate those groups to be the experimental units. However, because these environments are not highly controlled, on-farm trials are not likely to be suitable for first-stage regulatory testing (St-Pierre and Jones, 1999) as it may pertain to the evaluation of GM crops. Statistical Models and Analyses The Replicated Latin Square Design The replicated Latin square design has been frequently used in recent GM vs. reference corn hybrid comparisons in dairy cattle studies (Folmer et al., 2002; Grant et al., 2003; Ipharraguerre et al., 2003). Thus, its properties warrant further review. Suppose that there are nt treatments to be considered such that the number of periods in a Latin square study is also nt. Suppose further that there are ns such squares and that the total number of cows in the study is nc = nsnt. Without loss of generality, we assume that square effects do not exist if squares do not represent any spatial, temporal, or other blocking entity; that is, all squares are conducted simultaneously and proximally. The corresponding linear mixed effects model can be written as yijk = µ + αi + βj + ck + αβij + eijk [1] where yijk represents the observation on cow k given treatment i at period j; αi represents the fixed effect of the ith treatment, i = 1, 2, . . . , nt; βj represents the fixed effect of the jth period, j = 1, 2, . . . , np; and ck represents the random effect of the kth cow, k = 1, 2, . . . , nc, with variance component σc2. The period effect is considered fixed if it is nearly synonymous with stage of lactation, provided that variability in DIM is low within each period. Therefore, the interaction αβij between the ith treatment and jth period is also considered fixed and inferable, provided the same periods are considered within each Latin square. If squares are used as blocks for starting DIM (Grant et al., 2003), then treatment by period and treatment by square interaction are both inferable, but either term then is individually weakly associated with treatment by stage of lactation interaction. The mixed model in Eq. [1] is typically specified as if the residual terms eijk are normally, independently, and identically distributed with variance σe2. However, this Latin square design is just one particular example of a repeated measures design, whereby cows are observed over time, albeit with different treatment assignments at each period. Therefore, a richer class of covariance structures could be considered for residuals across periods within cows, such as for example, a firstorder autoregressive covariance structure in which residuals in adjacent periods are specified to be more highly correlated than residuals from distal periods. Details on using SAS PROC MIXED (SAS Institute, Inc., Cary, NC.) for such specifications are provided in Littell et al. (1998) and in the Appendix of this paper. Also, carryover effects may be additionally modeled in Eq. [1] provided that the replicated Latin squares are orthogonal to each other and/or are Williams designs (Williams, 1949). Carryover effects can be estimated separately from treatment × period interaction effects if squares are balanced for carryover effects but not necessarily otherwise. Coding strategies for carryover effects for use in statistical analyses can be found in Kuehl (2000). The Repeated Crossover Design The statistical model for the repeated crossover design does not differ much from Eq. [1] for the replicated Latin square design except that the number of periods, say, np, exceeds the number of treatments nt. For nt = 2 treatments, np = 3 defines the switchback design whereas np = 4 defines the double-reversal design. Additionally, random treatment × cow interaction effects should be considered as well: yijk = µ + αi + βj + ck + αβij + αcik + eijk [2] Here, the terms in Eq. [2] are defined similarly as in Eq. [1], with the additional term being αcik, the random treatment × cow interaction effect corresponding to the ith treatment and kth cow with variance component 2 σαc . As with the Latin squares design, alternative repeated measures covariance structures can be modeled across periods within cows (see Appendix herein). The fact that the treatment × cow effect serves as the ANOVA experimental error term for testing the significance of the treatment effect in repeated crossover designs has often been ignored. Consequently, reported P-values in these studies that have not recog- E165 Bioequivalence testing in animal science nized this inherent design feature may be overstated against the classical null hypothesis. As previously indi2 cated, statistically significant and large values of σαc would indicate that differences in responses between GM and reference feedstuffs are not consistent across cows and may be due to potential individually (i.e., genotype) specific allergenicity, antinutrient, or feed preference effects. Mixed Model Analyses The ANOVA has effectively provided the statistical inference engine for the analyses of efficient designs, such as replicated Latin squares or crossover designs. However, there are some important distinctions to be made between the two most common ANOVA inferential strategies currently used in dairy nutrition studies, namely the ordinary least squares (OLS) approach using, say, PROC GLM of SAS, and the more recently developed true mixed model (MM) analysis using, say, PROC MIXED. For the two broad types of mixed models, [1] and [2], considered in this paper, the use of OLS may dramatically understate the estimated standard error of an estimated treatment mean (SEM) because OLS ignores the between-cow variability in such a determination. Admittedly, however, the estimated standard errors of the estimated mean differences (SED) between treatments may not differ between an OLS vs. MM analyses, provided that the design remains balanced—that is, no data are missing. Nevertheless, there are two potentially important exceptions. First, OLS treats all effects, whether labeled as fixed or random, as fixed effects, potentially leading to substantial confounding between various effects. For example, in the double-reversal design in Eq. [2], cow effects are partially confounded with treatment × period effects if cow is treated as fixed; furthermore, cow × treatment effects are partially confounded with period effects if cow × treatment is treated as fixed. This confounding may represent the historical basis as to why treatment × period and cow × treatment have not been considered as sources of variation in earlier work using OLS. Similar estimability problems with OLS inference using PROC GLM have also been pointed out by Littell et al. (1996) and in the context of bioequivalence testing by LeRoux et al. (1998). These confounding issues do not arise with a true MM analysis because cow and cow × treatment effects are appropriately treated as random. A second advantage of MM over OLS analysis pertains to the recovery of inter-block information, or in the context of this paper, inter-cow information. Ordinary least squares facilitates only intra-block comparisons of treatments, whereas MM analyses utilizes both intrablock and inter-block information (Littell et al., 1996). Although OLS and MM inference will often differ little, if at all, for treatment mean comparisons in balanced complete block designs, this may not necessarily be true for incomplete block designs, such as Latin squares, and particularly when some data are missing. Hypothesis Testing and Power of Test Classical Hypothesis Testing Classical two-tailed hypothesis tests are typically specified in terms of the null vs. alternative hypothesis as specified, respectively: H0 : ⌬ = 0 vs. H 1: ⌬ ≠ 0 [3] Here ⌬ = µT − µR represents the difference of inferential interest between the GM or test crop mean µT and the mean µR of its isogenic or reference counterpart. Classical power is defined as the probability of concluding a mean difference when such a difference truly exists. Stroup (1999) demonstrated how classical power computations for ⌬ could be readily derived for a linear mixed model of any complexity (e.g., Eq. [1] or [2], above) using SAS PROC MIXED software. Additional details are provided, albeit in an agronomic context, by Stroup (2002). Although power is readily specified as of function of Type I error rate α (set throughout this article as 0.05), sample size or number of cows nc, and ⌬, the most difficult inputs to elicit for power determinations are the variance components In both Latin square and repeated crossover designs, it is σe2 and not σc2 that vitally determines power of test, ignoring, for the moment, the existence of nonzero 2 . Based on various statistical consulting experiences σαc pertaining to Latin square studies involving milk production at Michigan State University, estimates of σe2 have ranged from 2 to 10 (kg/d)2. The lower limit is in close agreement with much earlier sample size and design work (Gill, 1969; Gill and Magee, 1976), whereas the upper limit might be more indicative of currently higher production environments, as evidence exists that variability increases with mean milk production. This relationship is manifested in Jensen (2001), who illustrates in a figure that σe2 ranges from 2 (kg/d)2 in late lactation to 14 (kg/d)2 in early-lactation Canadian Jerseys, with larger residual variances being further associated with higher-producing multiparous animals. As an example, we consider classical power for replicated 4 × 4 Latin squares (i.e., each square with four cows) in the spirit of recent GM dairy feedstuff studies (Folmer et al., 2002; Grant et al., 2003; Ipharraguerre et al., 2003), concentrating on power for ⌬ involving only two of the four hypothetical treatments. We define sufficient power as being 0.80, or 80%. Figure 1 relates power to number of 4 × 4 Latin squares where σe2 is either 2 (kg/d)2 or 10 (kg/d)2, with ⌬ ranging from 1 to 4 kg/d. For σe2 = 2 (kg/d)2, four 4 × 4 squares (i.e., nc = 16 cows) would be adequate to detect ⌬ ≥ 2 kg/d with sufficient power, whereas eight squares (nc = 32 cows) would be required to detect ⌬ = 1 kg/d. When σe2 = 10 (kg/d)2, eight squares would be insufficient to detect ⌬ = 2 kg/d, whereas four squares would be insufficient to detect ⌬ = 3 kg/d. E166 Tempelman Figure 1. Classical power of test vs. number of 4 × 4 Latin squares for a mean difference (⌬) equal to 1 kg/d (䊊), 2 kg/d (䊉), 3 kg/d (䊐), or 4 kg/d (䊏) based on within-cow variance (σ2e ) of either a) 2 (kg/d)2 or b) 10 (kg/d)2. Figure 2. Classical power of test vs. number of cows in a double-reversal design for mean difference (⌬) equal to 1 kg/d (䊊), 2 kg/d (䊉), 3 kg/d (䊐), or 4 kg/d (䊏) based on within-cow variance (σ2e ) of either 2 (kg/d)2 or b) σ2e = 10 (kg/d)2. Classical power is also presented for the double-reversal design in Figure 2 for both 2 (kg/d)2 or 10 (kg/ d)2. Here, power is plotted against number of cows, equally randomized among the two sequences ABAB and BABA. For purposes of comparisons against the replicated Latin square design, we again consider the case in which the intra-cow variance σe2 is either 2 or 2 10 (kg/d)2 with σαc = 0, but recognizing that the error degrees of freedom for testing the main effect of treatment is defined by the treatment × cow interaction term. Here, nc = 16 cows would be nearly sufficient to detect ⌬ = 1 kg/d with σe2 = 2 (kg/d)2, whereas the same number of cows would be insufficient to detect ⌬ = 2 kg/d when σe2 = 10 (kg/d)2. Note that, in general, the power of the double-reversal design is slightly greater than that for the replicated 4 × 4 Latin square design for the same number of cows and periods, albeit the number of records per treatment for each cow is twice that for the double-reversal design such that then only half as many treatments can be studied for the same total number of records. Nevertheless, as previously indicated, this design feature confers additional advantages for inference on treatment × cow interaction, which may be pertinent for GM feedstuff studies. Bioequivalence Hypothesis Testing As previously stated, classical hypothesis testing does not provide an appropriate framework for establishing equivalence. Schuirmann (1987) recognized this and demonstrated the correctness of the two one-sided testing (TOST) procedure, which has since been used for testing average bioequivalence in pharmaceutical testing (FDA, 2001). In order to use this procedure, it is necessary to set appropriate lower and/or upper equivalence limits on ⌬ =µT − µR, say, θL and θU, respectively. It may be sufficient to specify both limits with E167 Bioequivalence testing in animal science one parameter θ = θU = −θL such that the equivalence interval is symmetric about zero. Naturally, these limits should not be set by the statistician but by investigators and/or regulators (Schuirmann, 1987). The null and alternative hypotheses in the TOST procedure are written as H0: ⌬ ≤ θL or ⌬ ≥ θU vs. H 1: θ L < ⌬ < θ U [4] Schuirmann (1987) demonstrates that rejection of the null hypothesis in Expressions [4] with a Type I error rate of α in [4] is based on observing the 100(1 − 2α)% confidence interval for ⌬ to fall within the bioequivalence limits [θL, θU]. If, for example, a Type I error rate of 5% is adopted for the bioequivalence hypothesis test, then one would construct a 90% confidence interval (CI) on ⌬. If this CI falls within [θL, θU], then one can reject H0: ⌬ ≤ θL or ⌬ ≥ θU in favor of H1: θL < ⌬ < θU. As with classical hypothesis testing, failing to reject H0: ⌬ ≤ θL or ⌬ ≥ θU does not imply that ⌬ ≤ θL or ⌬ ≥ θU because a Type II error, that is, failing to correctly conclude equivalence, may have been committed. As an example, consider the data from the doublereversal design involving 24 cows, 12 within each sequence, from Problem 8.9 in Gill (1978). Using a mixed effects analyses based on Eq. [2], the 90% CI on the mean difference between the two treatments is [−0.20 kg/d, 0.87 kg/d]. Since this CI falls within [−1 kg/d, +1 kg/d], average bioequivalence is established within the limits of ±1 kg/d between the two treatments with P < 0.05. Representative SAS PROC MIXED code that can be used to provide bioequivalence inference is provided in the Appendix herein. Bioequivalence power or the one minus the probability of making a Type II error for the hypothesis test in [4] can be readily determined. Extending the presentation of Phillips (1990), the two t-statistics for the bioequivalence test are TL = ⌬ˆ − θL SED(⌬ˆ ) and TU = ⌬ˆ − θU SED (⌬ˆ ) where SED(⌬ˆ ) represents the estimated standard error of ⌬ˆ = µ̂T − µ̂R, with µ̂T and µ̂R corresponding to the least squares means (LSM) of µT and µR, respectively. The required point estimates and standard errors are readily determined using appropriate mixed models software (e.g., SAS PROC MIXED). In concordance with the TOST confidence interval, bioequivalence is concluded if TL ≥ t1−α,ν and TU ≤ tα,ν, where t1−α,ν and tα,ν are the 100(1 − α)% and 100α% percentiles of a t-distribution with ν degrees of freedom as determined by the design. The probability of this occurrence is known as the bioequivalence power (Phillips, 1990), denoted by Prob(TL ≥ t1−α,ν TU ≤ tα,ν and | SED (⌬ˆ ),ν,δL,δU) [5] Here TL and TU are joint random draws from a bivariate noncentral t-distribution with ν degrees of freedom and a correlation of 1. Note that the bioequivalence power in Expression [5] depends further upon noncentrality parameters δL = ⌬ − θL SED(⌬ˆ ) and δU = ⌬ − θU SED(⌬ˆ ) with SED(⌬ˆ ) representing the true standard error of ⌬ˆ . As an illustration involving milk production based on the two designs discussed in this paper, we consider just one value of ⌬, specifically ⌬ = 0 kg. This specification ideally defines absolute mean bioequivalence. However, it is much more likely that bioequivalence will be established when ⌬ is not truly 0 but small enough to be considered as being bioequivalent, say, for example, −0.5 kg/d < ⌬ < 0.5 kg/d. Bioequivalence power for any combination of θL, θU and ⌬ can be readily computed provided that θL < ⌬ < θU. Figure 3 provides bioequivalence power determinations vs. number of 4 × 4 Latin squares for θ = θU = −θL = 0.5, 1.0, 1.5, and 2.0 kg/d when a) σe2 = 2 kg2 and for θ = 0.5, 1.0, 1.5, . . . , 4.0 kg/ d when b) σe2 = 10 (kg/d)2. From Figure 3a, it can be seen that there is sufficient power to conclude treatment bioequivalence within equivalence limits of ±1.5 kg/d with four 4 × 4 Latin squares when σe2 = 2 (kg/d)2. However, nine squares would be required to establish sufficient power to conclude average bioequivalence within equivalence limits of ±1.0 kg/d. If σe2 = 10 (kg/ d)2 (Figure 3b), four squares do not provide sufficient power to establish equivalence within ±3.0 kg/d and seven squares are needed to establish equivalence within ±2.5 kg/d. Figure 4 is similar to Figure 3 except that bioequivalence power is provided for the double-reversal design. As with classical power, the double-reversal design has greater bioequivalence power as it pertains to the comparison of two treatments albeit, again, at the expense of twice as many records per cow per treatment. When σe2 = 2 (kg/d)2, 16 cows is nearly sufficient to establish equivalence within ±1.0 kg/d with 80% power, whereas 16 cows would be minimally required to establish equivalence within ±2.5 kg/d when σe2 = 10 (kg/d)2. What is Wrong with the Observed Power Approach to Bioequivalence Testing? Most investigators recognize that P-values exceeding a predetermined Type I error rate are not indicative of proof of the classical null hypothesis H0: ⌬ = 0. Observed power, also called retrospective, or post-hoc, power, has been provided by some statistical software as a supplementary analysis to establish bioequivalence. Observed power is the power of being able to detect a mean difference as large as that observed from a data analysis. A related computation is based on the detectable effect size; that is, based on the data analysis at hand, what E168 Tempelman Figure 3. Bioequivalence power of test vs. number of 4 × 4 Latin squares for true mean difference ⌬ = 0 kg/d with limits of equivalence (θ) being ±0.5 kg/d (䊊), ±1.0 kg/d (䊉), ±1.5 kg/d (䊐), ±2.0 kg/d (䊏), ±2.5 kg/d (〫), ±3.0 kg/d (◆), ±3.5 kg/d (䉭), or ±4.0 kg/d (▲) based on a within-cow variance (σ2e ) of either a) 2 (kg/d)2 or b) 10 (kg/d)2. effect size (i.e., mean difference) would be detectable for a given power of, say, 0.80? If this effect size is small, and the corresponding P-value for the classical hypothesis test exceeds α, then plausible evidence, presumably, exists in favor of the classical null hypothesis of equivalence. As Hoenig and Heisey (2001) point out, this use of observed power has been tenaciously advocated by a large number of peer-reviewed scientific journals and statistical textbooks! Unfortunately, drawing conclusions on equivalence based on observed power is logically inconsistent. As noted by Hoeneg and Heisey (2001), nonsignificant Pvalues invariably correspond to low observed powers because observed power is a 1:1 function of the P-value. Reconsider the four 4 × 4 Latin squares design. Suppose that a 2 kg/d mean difference is considered to be within Figure 4. Bioequivalence power of test vs. number of cows in double-reversal design for true mean difference ⌬ = 0 kg/d with limits of equivalence (θ) being ±0.5 kg/ d (䊊), ±1.0 kg/d (䊉), ±1.5 kg/d (䊐), ±2.0 kg/d (䊏), ±2.5 kg/d (〫), ±3.0 kg/d (◆), ±3.5 kg/d (䉭), or ±4.0 kg/d (▲) based on a within-cow variance (σ2e ) of either a) 2 (kg/d)2 or b) 10 (kg/d)2. the limit of equivalence, analogous to the choice of θ = 2 kg/d for the bioequivalence-testing approach. Using the observed power strategy, one might then “reject” the classical alternative hypothesis H1: ⌬ ≤ 0 or ⌬ ≥ 0 if the estimated mean difference falls within the classical acceptance region for the test, and the power for detecting an effect of magnitude θ = 2 kg/d is, say, 80% or greater. Figure 5 is an adaptation of Figures 1 and 2 from Schuirmann (1987) but for the four 4 × 4 Latin squares design. The illustrated filled triangles represent the “rejection” regions that would lead to a conclusion of bioequivalence as a function of the estimated residual standard deviation se = √σ̂e2 and the estimated mean difference⌬ˆ that determine this rejection region using either the observed power approach or the bioequivalence-testing approach of Schuirmann (1987). Bioequivalence testing in animal science Figure 5. Rejection regions for concluding treatment bioequivalence (filled triangles) for four 4 × 4 Latin squares based on a) observed power approach or b) bioequivalence-testing approach. Regions are a function of the estimated residual standard deviation (√σ̂2e ) and estimated mean difference (⌬ˆ ). Observed power approach is based on ability to detect an effect of θ= ±2 kg/d with a power of 80% or greater, whereas bioequivalence approach is based on the 90% confidence interval for the true mean difference (⌬) falling within equivalence limits θ = ±2 kg/d. That is, for any values of se and ⌬ˆ that jointly fall within the illustrated triangle, one would conclude that −2 kg/d < ⌬ < 2 kg/d based on either of the respective approaches. Consider Figure 5a based on the observed power approach. One notable feature of this rejection region is that, for se > 1.96 kg/d, there are no values of ⌬ˆ that would conclude support for the classical null hypothesis, not even ⌬ˆ = 0 kg/d, because the observed power for detecting θ = 2 kg/d is always less than 0.80 for se > 1.96 kg/d. The rejection region for this approach is also illogical. Consider the following: Experiment 1 is executed with relatively poor precision, say, se = 1.8 kg/ d, whereas Exp. 2 is characterized by se = 0.5 kg/d. E169 Suppose the same mean difference (⌬ˆ = 1 kg/d) is observed in both experiments. In Exp. 1, we would conclude that −2 kg/d < ⌬ < 2 kg/d because se = 1.8 kg/ d and ⌬ˆ = 1 kg/d fall within the rejection region for concluding equivalence. However, the more precisely executed Exp. 2 would fail to draw that same conclusion! This discrepancy is troubling because one should logically expect that an experiment with better precision should be more likely to conclude true equivalence than another experiment with the same estimated mean difference but higher se. However, if the null hypothesis is mean equivalence as in classical tests, more information (i.e., smaller variance components, larger sample sizes) can only strengthen a conclusion of a nonzero mean difference. That is, using the observed power approach, poorly executed experiments would be rewarded a greater chance of concluding equivalence. Figure 5b illustrates the rejection region for the Schuirmann bioequivalence-testing approach as a function of se and ⌬ˆ , with θ = 2kg/d. This figure is an adaptation of Figure 2 from Schuirmann (1987) for the four 4 × 4 Latin squares design. Recall that for α = 0.05, the bioequivalence null hypothesis H0: ⌬ < −θ or ⌬ > θ is rejected in favor of −θ < ⌬ < θ if the 90% CI for ⌬ falls within ±θ. Note that, in contrast to the observed power approach illustrated in Figure 5a, the rejection region is widest for experiments with greater precision (smallest standard error); that is, experiments with greater precision are more likely to conclude bioequivalence. Consider again Exp. 1 and 2. Using the bioequivalencetesting approach, the conclusions are directly opposite to those based on the observed power approach. That is, Exp. 1 does not lead to a bioequivalence conclusion, whereas Experiment 2 with superior precision does. Schuirmann (1987) further demonstrates that the power of the bioequivalence-testing approach far exceeds that for observed power approach; this can be visually deduced by the relative sizes of the two rejection regions in Panels a and b of Figure 5. Additional Pertinent Design and Analysis Issues Experimental Design Dairy studies are generally designed with treatment periods long enough that residual effects due to a treatment in a previous period do not carry over to subsequent periods. However, as a precaution, designs that are balanced for at least first-order carryover effects— such as double-reversal, switchback, Williams, or orthogonal Latin square designs—should be employed in these experiments. Because of this feature, such designs have been heavily advocated by the FDA for bioequivalence studies; in fact, researchers have been encouraged to investigate direct × carryover treatment effect interactions (FDA, 2001). Latin square and crossover designs that block on cows over periods have been recommended in this paper. However, as elucidated by Morris (1999), there are a E170 Tempelman number of situations in which these designs should not be used in animal science. First, if one wishes to measure the long-term cumulative effects of treatments on, for example, longevity, then animals must serve as the experimental units and not the blocking factors. Second, if treatment effects persist well beyond one period (i.e., second- or higher-order carryover effects), then designs blocking on animals should be discouraged. Statistical Analysis Classical and bioequivalence power computations for both the replicated Latin square and repeated crossover designs have been demonstrated in this paper. However, as previously indicated, treatment × period interaction is also inferable. Various contrasts as they pertain to this interaction can be studied under classical or bioequivalence-testing frameworks, such as the mean difference between GM and reference feedstuffs in Period 1 vs. the mean difference between the two feedstuffs in Period 4 (presuming period as fixed). Certainly, such contrasts should take precedence in hypothesis testing compared to overall treatment comparisons, as primarily addressed in this paper, if treatment × period interaction is determined to be important. In some cases, noninferiority testing rather than bioequivalence testing may be more appropriate for the comparison of GM vs. conventional feedstuffs. That is, rather than establishing equivalence, it may be of greater interest to test whether or not the GM feedstuff is inferior to the reference feedstuff for various aspects. More details on hypothesis-testing procedures and power assessments in noninferiority testing can be found in Hauschke et al. (1999). Historically, periods have been treated as fixed in both Latin square and repeated crossover designs. This approach is suitable if the variability in days in milk (DIM) within each treatment period is substantially less than the period length such that period and stage of lactation are nearly synonymous. However, if the range in DIM far exceeds period length, then period is less associated with stage of lactation and more reflective of seasonal or temporal management effects such as, the quality of silage as it is unloaded from the silo. In that case, period may be more appropriately treated as random with some modeled temporal correlation structure as is possible with the use of SAS PROC MIXED. Furthermore, stage-of-lactation effects should then be modeled separately from period effects using lactation curve models, such as Wilmink’s curve (Wilmink, 1987) or even high-order polynomials. Random regression models have been recently popularized in modern dairy cattle genetic evaluation programs (Jensen, 2001) where Dairy Herd Improvement Association test-day effects are used to model seasonal effects in a manner similar to Latin square periods and DIM is used to model parity, herd, and random cow-specific lactation curves, allowing for differences in lactation shape (i.e., persistency) over levels of those various factors. Treatment-specific lactation curves could be similarly modeled in replicated Latin square and crossover designs if treatment periods and DIM substantially overlap. Additionally, genetic sources of variability may be removed by fitting half-sib or full-sib families as random effects in these models as well. These implementations may substantially improve both classical and bioequivalence power of test. This paper has concentrated primarily on classical and bioequivalence testing of milk production. However, the values for ⌬ or σe2 chosen could be extrapolated to other testing scenarios involving various intake, digestibility, fitness, or reproductive measures. That is, the key contributor to classical or bioequivalence power is not singly ⌬ or σe = √σe2, but their ratio. As an example, power determinations for ⌬ = 1, 2, 3, 4 and σe2 = 2 in Figure 1a apply to any ratio value of ⌬ √ σe2 = 1 √2 , 2 √2 , 3 √2 , 4 , √2 Studies on GM vs. reference corn as dairy feedstuffs have involved comparisons for a number of production, growth, milk components, and feed intake responses. Multiple testing corrections, such as Bonferroni or other alternatives, might plausibly be considered further in such cases; however, Berger and Hsu (1996) argue that these corrections may not be entirely necessary based on the intersection-union principle. Furthermore, multivariate data reduction alternatives should also be considered when some response variables are highly correlated with each other. Johnson (1998), for example, demonstrated how canonical variates analysis can be effectively used as a precursor data reduction strategy; subsequent classical or bioequivalence tests can then be applied as appropriate to the resulting canonical variates. Current discussion on bioequivalence hypothesis testing has evolved from the exposition on average bioequivalence provided in this paper to measures of population and individual bioequivalence. These aggregate measures are not only functions of mean treatment differences but also of subject × treatment interaction and treatment differences for within-cow variability. As indicated previously, the double-reversal design is particularly useful for inference on these additional measures of dispersion. Nevertheless, there is currently sufficient controversy regarding the additional utility of population and individual bioequivalence aggregate measures beyond that provided by average bioequivalence (Hauschke and Steinijans, 2000; Munk, 2000). Bioequivalence testing will therefore likely become an area for future regulatory statistical research that will be pertinent for the testing of GM crops as feedstuffs for dairy cattle. Bioequivalence testing in animal science Final Recommendations Confidence intervals represent the most reasonable experimental assessment of plausible values for mean treatment differences (Hoening and Heisey, 2001). If a CI does not include a hypothesized mean difference, then that mean difference is refuted by the data. Conversely, all values contained within the CI cannot be refuted from the data. If the values cluster tightly about 0, the investigator has reasonable confidence that the true mean difference is near 0 and should be able to report such a result. Conversely, if the CI is very wide, there is vastly insufficient evidence of either a mean difference or a mean equivalence such that the result should not be considered publishable. Once a CI is constructed, retrospective or observed power computations offer no additional insight (Hoening and Heisey, 2001). A greater emphasis on reporting CI relative to classical P-values would be consistent with the bioequivalencetesting approach discussed in this paper. A good example of concise bioequivalence testing and reporting is provided in Miles et al. (2002). Power computations illustrated in this paper seem to indicate that experiment sizes may need to be greater for bioequivalence testing as opposed to classical hypothesis testing. It should be noted that existing FDA guidelines (FDA, 2001) for drug testing advocate that equivalence limits be, respectively, 80% and 125% of the reference means (i.e., θL = 0.80µR and θU = 1.25µR) for various pharmacokinetic responses. Appropriate equivalence limits should be also characterized for key response variables in studies comparing GM to their reference hybrids as feedstuffs for dairy cattle. Implications Animal scientists, including dairy scientists, have generally used classical data analysis procedures in studies designed to compare genetically modified vs. conventional feedstuffs. Because the anticipated result is equivalence between these two feedstuffs for their effects on livestock performance, bioequivalence statistical analysis that has been extensively used for pharmaceutical testing on drug safety is demonstrated to be better suited for such studies. These procedures, as demonstrated in this paper, are easy to implement and their use should be encouraged in future studies designed to investigate equivalence between two treatments or diets. Furthermore, the paper points out easily correctable data analysis deficiencies in recent animal science research that should facilitate more-reliable treatment comparisons, including refinements on identifying treatment differences that may depend on, for example, stage of lactation. Literature Cited Berger, R. L., and J. C. Hsu. 1996. Bioequivalence trials, intersectionunion tests and equivalence. Stat. Sci. 11:283–302. E171 Clark, J. H., and I. R. Ipharraguerre. 2001. Livestock performance: Feeding biotech crops. J. Dairy Sci. 84:E9–E18. Donkin, S. S., J. C. Velez, A. K. Totten, E. P. Stanisiewski, and G. F. Hartnell. 2003. Effects of feeding silage and grain from glyphosate-tolerant or insect-protected corn hybrids on feed intake, ruminal digestion, and milk production in dairy cattle. J. Dairy Sci. 86:1780–1788. Faust, M. A. 2001. New feeds from genetically modified plants: The U.S. approach to safety for animals and the food chain. Livest. Prod. Sci. 74:239–254. FDA. 2001. Guidance for industry: Statistical approaches to establishing bioequivalence, Center for Drug Evaluation and Research, Office of Training and Communications, Division of Communications Management, Drug Information Branch HFD-210, Rockville, MD. Available: http://www.fda.gov/cder/guidance/ 3616fnl.htm. Accessed September 23, 2003. Folmer, J. D., R. J. Grant, C. T. Milton, and J. Beck. 2002. Utilization of Bt corn residues by grazing beef steers and Bt corn silage and grain by growing beef cattle and lactating dairy cattle. J. Anim. Sci. 80:1352–1361. Gill, J. L. 1969. Sample size for experiments on milk yield. J. Dairy Sci. 52:984–989. Gill, J. L. 1978. Design and Analysis of Experiments in the Animal and Medical Sciences, Vol II. Iowa State Univ. Press, Ames. Gill, J. L. 1981. Evolution of statistical design and analysis of experiments. J. Dairy Sci. 64:1494–1519. Gill, J. L., and W. T. Magee. 1976. Balanced two-period changeover designs for several treatments. J. Anim. Sci. 42:775–777. Grant, R. J., K. C. Fanning, D. Kleinshmit, E. P. Stanisiewski, and G. F. Hartnell. 2003. Influence of glyphosate-tolerant (event NK603) and corn rootworm protected (event MON863) corn silage and grain on feed consumption and milk production in Holstein cattle. J. Dairy Sci. 86:1707–1715. Hauschke, D., M. Kieser, and L. A. Hothorn. 1999. Proof of safety in toxicology based on the ratio of two means for normally distributed data. Biom. J. 41:295–304. Hauschke, D., and V. W. Steinijans. 2000. The U.S. draft guidance regarding population and individual bioequivalence approaches: Comments by a research-based pharmaceutical company. Stat. Med. 19:2769–2774. Hoening, J. H., and D. M. Heisey. 2001. The abuse of power: The pervasive fallacy of power calculations for data analysis. Am. Stat. 55:19–24. Hoover, D., B. M. Chassy, R. Hall, H. J. Klee, J. B. Luchansky, H. I. Miller, I. Munro, R. Weiss, S. L. Hefle, and C. O. Qualset. 2000. Human food safety evaluation of rDNA biotechnologyderived foods. Food Tech. 54:15–23. Ipharraguerre, I. R., R. S. Younker, J. H. Clark, E. P. Stanisiewski, and G. F. Hartnell. 2003. Performance of lactating dairy cows fed corn as whole plant silage and grain produced from a glyphosatetolerant hybrid (event NK603). J. Dairy Sci. 86:1734–1741. Jensen, J. 2001. Genetic evaluation of dairy cattle using test-day models. J. Dairy Sci. 84:2803–2812. Johnson, D. E. 1998. Applied Multivariate Methods for Data Analysts. Duxbury, Pacific Grove, CA. Kuehl, R. O. 2000. Statistical Principles of Research Design and Analysis. 2nd ed. Duxbury Press, Pacific Grove, CA. LeRoux, Y., C. Guimart, and M. Tenenhaus. 1998. Use of the repeated cross-over design in assessing bioequivalence: (within and between subjects varability—Schuirmann confidence intervals estimation). Eur. J. Drug Metab. Pharmacokinet. 23:339–345. Littell, R. C., P. R. Henry, and C. B. Ammerman. 1998. Statistical analysis of repeated measures data using SAS procedures. J. Anim. Sci. 76:1216–1231. Littell, R. C., G. A. Milliken, W. W. Stroup, and R. D. Wolfinger. 1996. SAS System for Mixed Models. SAS Institute, Inc., Cary, NC. Longuski, R. A., M. S. Allen, and R. J. Tempelman. 2000. Effect of brown midrib-3 mutation in corn silage on lactational performance of dairy cows. J. Dairy Sci. 83(Suppl. 2):293. Lucas, H. L. 1956. Switch-back trials for more than two treatments. J. Dairy Sci. 39:146–154. E172 Tempelman Lucas, H. L. 1957. Extra-period latin square change-over designs. J. Dairy Sci. 40:225–239. Miles, M. V., P. Horn, L. Miles, P. Tang, P. Steele, and T. DeGrauw. 2002. Bioequivalence of coenzyme q10 from over-the-counter supplements. Nutr. Res. 22:919–929. Morris, T. R. 1999. Experimental Design and Analysis in Animal Sciences. CABI, Wallingford, Oxon, U.K. Munk, A. 2000. Connections between average and individual bioequivalence. Stat. Med. 19:2843–2854. Novak, W. K., and A. G. Haslberger. 2000. Substantial equivalence of antinutrients and inherent plant toxins in genetically modified foods. Food Chem. Toxicol. 38:473–483. Phillips, K. F. 1990. Power of the two one-sided tests procedure in bioequivalence. J. Pharmacokinet. Biopharm. 18:137–144. Schuirmann, D. J. 1987. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J. Pharmacokinet. Biopharm. 15:657–680. St-Pierre, N. R. 2001. Integrating quantitative findings from multiple studies using mixed model methodology. J. Dairy Sci. 84:741– 755. St-Pierre, N. R., and L. R. Jones. 1999. Interpretation and design of nonregulatory on-farm feeding trials. J. Dairy Sci. 82:177–182. Stroup, W. W. 1999. Mixed model procedures to assess power, precision, and sample size in the design of experiments. Pages 15– 24 in Proc. Biopharmaceutical Section. Am. Stat. Assoc. Stroup, W. W. 2002. Power analysis based on spatial effects mixed models: A tool for comparing design and analysis strategies in the presence of spatial variability. J. Agric. Biol. Environ. Stat. 7:491–511. Williams, E. J. 1949. Experimental designs balanced for the estimation of residual effects of treatments. Aust. J. Scient. Res. 2:149–168. Wilmink, J. B. M. 1987. Adjustment of test-day milk, fat, and protein yields for age, season, and stage of lactation. Livest. Prod. Sci. 16: 335–348. APPENDIX Code from SAS’s PROC MIXED software and interpretative summary for bioequivalence testing (α = 0.05) in a double-reversal or switchback design. SAS code Interpretative summary The class statement defines the classification factors, namely cow, period, and treatment. Fixed effects treatment and period, including their interaction treatment*period are included in the model statement, whereas random effects cow and its interaction with treatment, cow*treatment, are specified in the random statement. The lsmeans statement specifies estimated treatment means and their mean differences based on the diff option. The options alpha = .10 and cl specifies 90% CI on the treatment mean difference as appropriate for a bioequivalence test with a Type I error rate of 0.05. Conclusions for bioequivalence are then based on whether or not the reported lower and upper confidence limits on the mean differences of interest fall within prespecified equivalence limits. If significant treatment × period interaction exists, then substitute lsmeans treatment*period /diff alpha = .10 cl; for the fifth line in the above code to infer upon periodspecific treatment differences or alternatively treatment-specific period differences. The same program can be used for a replicated Latin square design except that cow*treatment is not inferable and should be deleted; that is, alter the fourth line in the code above to random cow; The code above further presumes that residuals are identically, independently, and normally distributed. However, it may be possible, for example, that temporal correlations in the residuals exist across periods within cows. If periods are equally spaced apart, then a candidate specification might be a first-order autoregressive (AR(1)) correlation structure, in which case the following statement should be added to the code above. repeated period /subject = cow type = ar(1); proc mixed; class cow period treatment; model y = treatment period treatment*period; random cow cow*treatment; lsmeans treatment /diff alpha = .10 cl; run; where the ar(1) specification specifies the residual correlations between periods to be a function of the time interval between them. See Littell et al. (1996; 1998) for further details on this and many other potentially better fitting specifications.
© Copyright 2026 Paperzz