Experimental power considerations—Justifying replication for animal care and use committees1 Clarice G. B. Demétrio,*2 José F. M. Menten,† Roseli A. Leandro,* and Chris Brien‡ *Departamento de Ciências Exatas, and †Departamento de Zootecnia, Universidade de São Paulo–ESALQ, Avenida Pádua Dias 11, CEP 13418-900, Piracicaba, SP, Brazil; and ‡Phenomics and Bioinformatics Research Centre, University of South Australia, GPO Box 2471, Adelaide, SA 5001, Australia ABSTRACT A common practice in poultry science is to compare new treatments with a control or between treatments tested in planned experiments. The overall F-test from an ANOVA of the data allows the researcher to reject or not reject the null hypothesis. However, the correct conclusion from such analysis depends on sufficient replicates being included in the experiment. On the other hand, restrictions are imposed to reduce the number of birds used in experiments for welfare reasons and to save scarce resources. We review the basic concepts needed to determine the number of replicates before conducting an experiment. We use these concepts to assess the results of several real experiments and to show what might be done in future experiments. We describe how to do the computations in R software. Key words: analysis of variance, statistical power, replication, type I error, type II error 2013 Poultry Science 92:2490–2497 http://dx.doi.org/10.3382/ps.2012-02731 INTRODUCTION A common practice in poultry science is to compare new treatments with a control or between treatments included in planned experiments. The results of experiments are affected not only by the action of the treatments, but also by extraneous variations (experimental error), which tend to mask the effects of treatments (Cochran and Cox, 1957). To improve the results from an experiment and get a good estimate of the experimental error, the investigator should either increase the size of the experiment using more replicates or refine the experimental technique or handle the experimental material so that the effects of variability are reduced. Additionally, proper randomization (restricted randomization for local control when necessary) should be used to ensure that one treatment is no more likely to be favored in any replicate than another. In terms of designed experiments, determining the sample size amounts to finding the number of replicates of a treatment. A question that is still very common in all areas of knowledge before defining a protocol for an experiment is “how many replicates per treatment are necessary?” This type of question is directly related to ©2013 Poultry Science Association Inc. Received August 30, 2012. Accepted September 28, 2012. 1 Presented as part of the Experimental Design for Poultry Production and Genomics Research Symposium at the Poultry Science Association’s annual meeting in Athens, Georgia, July 12, 2012. 2 Corresponding author: [email protected] the power of a statistical test and depends on what size of difference between treatments the experiment is required to detect compared with the variability per unit which might be expected. Another way to view this is what is the desired precision of the estimated means. In chicken nutrition research, the unit to which treatments are assigned, the experimental unit, may be a small group of birds (e.g., 6 to 10 birds younger than 3 wk, 3 to 4 birds older than 3 wk, or 8 to 10 laying hens), if cages are used in the experiments, or a large group of birds (e.g., pens with 20, 40, 50, or more), as in growing-finishing broiler experiments. So, it is important to decide how many cages or pens (replications) to use per treatment and how many birds per cage or pen are necessary when planning an experiment of known power and sensitivity. An example is taken from an experiment conducted in our research unit. It compared the effects of 6 treatments (A, control) on the live weight of broilers using a randomized complete block design with 5 replicates. It was concluded that the treatment means were not significantly different (see a summary for experiment 4, in Table 2). However, perhaps there are differences between the treatments—just that 5 replicates were not enough, given the size of the true differences between the treatment means. How many replicates should the researcher take for a next experiment to detect differences between treatments? Before this question can be answered, it needs to be qualified. What size difference is to be detected? The smaller the difference, the more replicates are required. But, then one gets to the point where the difference that might 2490 2491 EXPERIMENTAL DESIGN SYMPOSIUM be detected is of no practical importance. So, to limit the number of replicates, one needs to specify the size of the smallest difference that is considered to be important. A second qualification is how certain does one want to be of detecting the specified difference. To be almost certain would require a very large number of replicates and so, again to limit the number of replicates, we nominate how sure we want to be of detecting the specified difference. This is called the power. According to the Guide for Care and Use of Agricultural Animals in Research and Teaching (FASS, 2010), one of the points the Institutional Animal Care and Use Committee should consider when reviewing experimental protocols includes “justification for the number of animals used.” This requisite is in line with the Animal Welfare Act, which incorporates the Three R’s. The Three R’s seek to reduce, refine, and replace the use of animals in experiments. The Three R’s have been accepted by researchers who use animals in experiments and by animal welfare advocates who argue in favor of regulating animal use rather than abolishing it. Over the past 20 yr, the Three R’s have to a considerable degree become the primary legal and nonlegal mechanism for regulating animal experimentation (Ibrahim, 2006). In experiments with agricultural animals, and especially with poultry, this well-intended reduction in the number of animals undermines the chance of a successful experiment and may waste animal resources. In experiments conducted to identify if a new product results in improved performance, for example, a small difference is of importance. If this difference would have been 4 to 5% in the past, today it is in the range of 1 to 2%, because of the higher level of performance obtained and because of the magnitude of production. For a chicken producer or a poultry company, a 1 to 2% increase or decrease in final BW may represent 30 to 60 g/bird, which is much more important today than in the past and the experimenter must be able to identify and report it. Considering a 2% increase in BW accompanied by a 2% improvement in feed conversion rate, the outcome is a 4% increase in the productivity index, which will determine the revenues obtained by an integrated broiler producer. This paper first reviews the basic concepts and then describes a method of computing power and the number of replicates to achieve a specified power. The application to several experiments is examined. Also, the issue of the number of pens versus the number of birds per pen is discussed. REVIEW OF BASIC CONCEPTS As stated by Aaron and Hays (2004), “an understanding of some basic statistical concepts and the essence of classical hypothesis testing is a necessary precursor to a discussion of statistical power.” Consider that a researcher is interested in comparing k treatments and an experiment is planned with r replicates, either in a completely randomized design (CRD) or in a randomized complete block design (RCBD). Hypotheses, F-test, and Significance Level Every scientific hypothesis can be translated into 1 of 2 types of statistical hypotheses called the null hypothesis (H0) and the alternative hypothesis (Ha). For example, comparing k treatments, the null hypothesis could be that “there is no treatment effect,” which can be written as H0 : μ1 = ... = μk, where μi represents the true mean of the treatment i, i = 1, 2, …, k, and the alternative hypothesis could be Ha: at least one contrast between the true means differs from zero. After performing an experiment and collecting the data (yij), estimates of the treatment means (yi ) and of the experimental error (s2) are obtained and the overall F-test is used to decide between both hypotheses, using an ANOVA. It is important to stress the conditions under which the ANOVA is valid: independence of the observations (yij), which come from a normal distribution with means µi and common variance σ2. Once the statistical test to be used and a decision rule is defined, a decision is made to either accept or reject the null hypothesis at a significance level α, in general 0.05 or 0.01. If the null hypothesis is rejected, then we conclude that evidence favors the alternative hypothesis; if the null hypothesis is accepted, we conclude that there is not enough evidence to reject it. When comparing k treatments, from the ANOVA table we get the value of the statistic F0 = MST , MSE where MST is the mean square of treatments, a measure of the size of the differences between the observed means, and MSE is the mean square of the error s2, an estimate of σ2. Under the null hypothesis H0 : µ1 = ... =µk , F0 is an observed value of a random variable with central Snedecor’s F distribution with (k − 1) and υ degrees of freedom. The number of degrees of freedom of the denominator, υ, is the error degrees of freedom and depends on the design employed; it will be equal to k(r − 1) and (k − 1)(r − 1), respectively, for a completely randomized design and a randomized complete block design. Note that for k = 2, F = t2, and the F test from an ANOVA for a completely randomized design is equivalent to a t-test for 2 independent samples, whereas the F test from an ANOVA for a randomized complete block design is equivalent to a t-test for 2 paired samples. The null hypothesis will be rejected if F0 > Fk−1,υ;a for a given probability α, called significance level, or if the P-value, P(Fk−1,υ > F0), is less than α. In Figure 1, any observed value of the test statistic greater than F13,22;0.05 = 2.20 would result in a P-value of less than 0.05 and so be rejected at the 5% level. 2492 Demétrio et al. Figure 1. Central F distribution with 13 and 22 df. Type I and Type II Errors When doing hypothesis testing we can have 4 scenarios, as shown in Table 1, which depend on whether or not the null hypothesis is actually true or not and whether we decide to accept or reject it. If the null hypothesis is accepted when it is true or is rejected when it is false, a correct decision will be made. Otherwise, if a true null hypothesis is rejected a type I error (error of first kind) is made with probability α, whereas if a false null hypothesis is accepted a type II error (error of second kind) is made with probability β. In general, it is desirable to have small probabilities, α and β, of making errors. hypothesis; that is, the probability that a treatment effect will not go undetected if it exists (Berndtson, 1991). The power will depend on the significance level used for the test, the magnitude of the effect of interest (alternative hypothesis) in the population, and the sample size used to detect the effect. The researcher would like the power to detect the alternative hypothesis to be high. Statistical power analysis should be done during the planning of an experiment to estimate the sample size necessary to achieve adequate power. This is called a priori, or prospective, power analysis. A posteriori, or retrospective, power analysis, in general arises after the performed experiment and when no significant effects are detected; its use is controversial. Power and Sample Size The Noncentral F-test The power of a statistical test is the probability (1 − β) that it will correctly lead to a rejection of a false null Under the alternative hypothesis, Ha, F0 is an observed value of a random variable with a noncentral Table 1. Possible scenarios for a hypothesis test1 H0 Decision Accept H0 Reject H0 1H 0 True Correct decision P(Accept H0 | H0 true) = 1 − α Type I error P(Reject H0 | H0 true) = α False Type II error P(Accept H0 | H0 false) = β Correct decision P(Reject H0 | H0 false) = 1 − β = null hypothesis; α and β = probability of making a type I and a type II error, respectively. 2493 EXPERIMENTAL DESIGN SYMPOSIUM Figure 2. Central and noncentral F distributions with 13 and 22 df and noncentrality parameter λ equal to 0, 10, or 20. F distribution with (k − 1) and υ degrees of freedom and noncentrality parameter (a measure of the size of the differences, relative to the magnitude of the uncontrolled variation, for single-treatment-factor experiments) k λ= r ∑ (µi − µ)2 i =1 σ2 ,[1] provided the number of replicates for all treatments is r, and where µ is the average of µi, i = 1, …, k. The shape of the noncentral F distribution for different values of the noncentrality parameter λ is illustrated in Figure 2, along with the central F distribution for which λ = 0. The power is given by the P(Fk−1,υ;λ > Fk−1,υ;1−α), the probability from the noncentral F distribution that the random variable Fk−1,υ;λ will exceed the critical value Fk−1,υ;1−α of the F test under the null hypothesis. However, to use this formula we need the values of the µi that are unknown. A way to overcome this is to specify Δ, the difference such that, if any 2 population means differ by this amount, the null hypothesis should be rejected (Brien, 2010). Then it can be shown that a conservative value for the power obtained using the general formula for the minimum value of λ is λ= rm ∆2 2σ 2 = 2 rm ∆ ,[2] 2 σ where r is the number of pure replicates of each treatment and m is the multiplier of r that gives number of observations (rm) used in computing one of the means. For experiments with a single treatment factor, m = 1. For a single-treatment-factor randomized complete block design, r = b, the number of blocks. The ratio d = Δ/σ is called effect size (Cohen, 1969) and lower values of d indicate the need for larger sample sizes. The power can be calculated for any given α, r, Δ, or σ2. Note that the power is a monotonically increasing function of the parameter λ, that is, greater values of λ are associated with greater power, as can be concluded from Figure 2 by comparing the probabilities of being greater than, say, 2.2 for each distribution. From equations [1] and [2], it is clear that λ increases with i) increased sample size r; ii) increased difference among population means (as measured by either the expected sum of squares of the deviations of the true means from k their overall mean, ∑ (µi − µ)2 , or by the minimum dei =1 tectable difference, Δ; and iii) decreased variability within populations, σ2. It is possible to see, then, that if the experimenter is interested in detecting small effects (Δ), more replicates are required than if the experimenter is interested in detecting large effects. Furthermore, power increases with larger significance levels, α, as shown in Figure 3, but of course, at the cost of an increased risk of a type I error. The only way to reduce both types of error simultaneously is to increase r. In practice, we fix a small value for α (significance level) to guard against type I errors, reducing the risk of false claims (Aaron and Hays, 2004). 2494 Demétrio et al. Figure 3. Central and noncentral F distributions with 13 and 22 df and noncentrality parameter equal to 20, for significance levels 0.05 and 0.10. COMPUTING THE POWER To compute the power, we can use the following functions from the R software (R Development Core Team, 2011): qf for computing the q (= F) critical value under the central F distribution, for which the arguments are 1 − α, df1 (degrees of freedom of the numerator), df2 (degrees of freedom of the denominator); pf for computing probabilities for the noncentral F distribution, for which the arguments are q (= F), df1, df2, and ncp (= λ). Brien (2011) created a function power.exp available in the dae library for R for computing the power of an experiment. For example, if we want to determine the power of an RCBD with 5 replicates and 6 treatments in detecting Δ = 50 g with α = 0.05 and σ2 = 3,913, the usage and arguments for this function are EXPERIMENTAL DESIGN SYMPOSIUM 2495 Figure 4. Central and noncentral F distributions with 5 and 20 df and noncentrality parameter equal to 1.60, for experiment 4. >rm < - 5 >power.exp(rm = rm, df.num = 5, df.denom = 5*(rm-1), delta = 50, sigma = sqrt(3913), print = TRUE) rm df.num df.denom α delta sigma lambda powr 1 5 5 20 0.05 50 62.55398 1.59724 0.1117699 [1] 0.1117699 giving a power of 0.1118, which is not good because of the chance of correctly rejecting H0 is not high. This is illustrated in Figure 4. We can produce entire power curves for a whole series of values of λ, as shown in Figure 5. Figure 5. Power curves, considering 6 treatments and (left) for a fixed Δ and varying values of σ2; (right) for a fixed σ2 and varying the values of Δ. 2496 Demétrio et al. Table 2. Summary of the analysis of live weights for 5 experiments, with k treatments (A, control), r replicates, considering the variable live weight1 Treatment mean (g) Exp. k r MS F-test r* 1−β A B C D E F Range 1 2 3 4 5 4 5 5 6 6 7 6 7 5 6 1,829 1,367 13,331 3,913 1,719 12.09** 0.22 0.65 0.83 6.12** 18 15 129 42 19 0.36 0.40 0.08 0.11 0.26 2,872 2,366 3,204 3,040 2,609 2,869 2,376 3,171 3,046 2,587 2,891 2,372 3,117 3,079 2,662 2,767 2,374 3,149 3,012 2,609 2,359 3,195 3,071 2,625 3,028 2,536 124 17 87 67 126 1r* is the computed number of replicates to detect a 50-g difference with a probability of approximately 0.80. **Indicates that the F-test is significant at the 1% level. COMPUTING THE REQUIRED SAMPLE SIZE FOR THE CRD AND RCBD WITH A SINGLE TREATMENT FACTOR Given the discussion on the power of a hypothesis test, determining sample size amounts to determining r, the number of pure replicates of a treatment. It is achieved by computing the power for different r until smallest r that has at least the required power is identified. But you will have to specify the following: i. the significance level, α; ii. the power desired, 1 − β; iii. the number of treatments, k, to be investigated; iv. the minimum size of the difference to be detected between a pair of treatment means, as measured by Δ; v. the magnitude of the uncontrolled variation expected, σ2. In obtaining values for these, often the results from a previous experiment you or another researcher has run will be useful. A convenient method to determine the number of replicates is provided by the function no.reps from library dae (Brien, 2010). For example, using the results from experiment 4, in Table 2, the following R code calculates the number of pure replicates required to achieve a power of 0.80 with a significance level of 0.05 in detecting a minimum difference of 50 g: >no.reps(multiple = 1, df.num = 5, df.denom = expression(df.num*(r-1)), delta = 50, sigma = sqrt(3913), power = 0.8, print = FALSE) rm df.num df.denom α delta sigma lambda powr 1 41.26343 5 201.3171 0.05 50 62.55398 13.18152 0.7999115 $nreps [1] 42 $power [1] 0.8081996 These results indicate that the required number of replicates is 42 and that the power of the experiment would be 0.8082. APPLICATIONS In the examples presented in Table 2, the effects of k treatments (A was a control) on the live weight of chickens were investigated, using a CRD (experiment 1) or a RCBD (experiments 2, 3, 4, and 5), with r replicates. It was concluded that the differences between the treatment means were not significant for experiments 2, 3, and 4 and were significant for experiments 1 and 5. Based on the information of these experiments, it is possible to calculate: a) the number of replicates (r*) necessary to detect a difference of 50 g between the means of the treatments with a power of 80%; and b) the power to detect a difference of 50 g between the means of the treatments for a similar experiment. We see that experiments 1, 2, and 5 had the most power, although the probability of detecting a difference of 50 g is only between 0.26 and 0.40. To have an 80% chance of detecting a difference of 50 g, 15 or more replicates are needed. Of these 3 experiments, only experiment 2 was not significant and this is due to small differences between the treatment means (a range of 17 g). Experiments 3 and 4 had the least power, due to greater error variation and, in the case of experiment 4, less 2497 EXPERIMENTAL DESIGN SYMPOSIUM replication. As they also had only moderate differences between the treatment means, it is not surprising that the treatments were not significantly different. In summary, it would appear that about 20 replicates are required to have an 80% chance of detecting a difference of about 50 g. Fewer replicates will mean that there is less chance of detecting a difference of this size and that only larger differences are likely to be detected. An alternative is to look to reduce the error variability, for example, by improving the experimental protocols or the selection of more uniform birds. REPLICATES VERSUS SUBSAMPLING A frequent question that also appears is what to use: more pens with fewer birds or fewer pens with more birds? The answer depends on the relative magnitude of experimental and sampling errors and on costs. Determining the sample size is complicated because it involves 2 sources of uncontrolled variation: 2 , and i. between-pen variation, σM ii. between-bird within-pen variation, σS2 , and requires a guess of the values of these 2 variances. Then the sample size, r, is computed as described, except that the value of the variance is 2 σ 2 = σS2 + m σM . However, this expression involves m, the number of birds per pen. A simple approach here is to nominate 2 or 3 alternative values for m and then the value of r can be determined for a given power and each value of m. One point here, which bears on minimizing the error variance for pens, is that it is best to assign birds that are different to the same pen: for example, light and heavy birds. This can be done, as described in Brien and Demétrio (1998), by dividing the complete set of birds in to m homogeneous sets. Then, a bird is taken from each set, at random, to be placed in the same pen. This has the advantage that an interaction between treatments and different sets does not contribute to the error for treatment differences. ACKNOWLEDGMENTS The first and second authors are scholars supported by National Council for Scientific and Technological Development (CNPq), Brazil. We also thank professors Gene Pesti (University of Georgia, Athens) and Lynne Billard (University of Georgia, Athens) who invited us to present this paper in the Experimental Design for Poultry Production and Genomics Research Symposium, held on July 12, 2012, at University of Georgia as part of the Poultry Science Association 101st Annual Meeting. REFERENCES Aaron, D. K., and V. W. Hays. 2004. How many pigs? Statistical power considerations in swine nutrition experiments. J. Anim. Sci. 82:E245–E254. Berndtson, W. E. 1991. A simple, rapid and reliable method for selecting or assessing the number of replicates for animal experiments. J. Anim. Sci. 69:67–76. Brien, C. J. 2010. Design and randomization-based analysis of experiments in R. Accessed Aug. 2012. http://chris.brien.name/ ee2. Brien, C. J. 2011. Functions useful in the design and ANOVA of experiments. Version 2.1–7. Accessed Aug. 2012. http://cran.stat. auckland.ac.nz/web/packages/dae/index.html. Brien, C. J., and C. G. B. Demétrio. 1998. Using the randomisation in specifying the ANOVA model and table for properly and improperly replicated grazing trials. Aust. J. Exp. Agric. 38:325–334. Cochran, W. G., and G. M. Cox. 1957. Experimental Designs. 2nd ed. John Wiley and Sons, New York, NY. Cohen, J. 1969. Statistical Power Analysis for the Behavioral Sciences. Academic Press, New York, NY. FASS (Federation of Animal Science Societies). 2010. Guide for Care and Use of Agricultural Animals in Research and Teaching. Fed. Anim. Sci. Soc., Champaign, IL. Ibrahim, D. M. 2006. Reduce, refine, replace: The failure of the Three R’s in the future of animal experimentation. University of Chicago Legal Forum. Arizona Legal Studies Discussion Paper 06–17. http://ssrn.com/abstract=888206. R Development Core Team. 2011. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/.
© Copyright 2025 Paperzz