Experimental power considerations—Justifying replication for

Experimental power considerations—Justifying replication
for animal care and use committees1
Clarice G. B. Demétrio,*2 José F. M. Menten,† Roseli A. Leandro,* and Chris Brien‡
*Departamento de Ciências Exatas, and †Departamento de Zootecnia, Universidade de São Paulo–ESALQ,
Avenida Pádua Dias 11, CEP 13418-900, Piracicaba, SP, Brazil; and ‡Phenomics and Bioinformatics
Research Centre, University of South Australia, GPO Box 2471, Adelaide, SA 5001, Australia
ABSTRACT A common practice in poultry science is
to compare new treatments with a control or between
treatments tested in planned experiments. The overall
F-test from an ANOVA of the data allows the researcher to reject or not reject the null hypothesis. However,
the correct conclusion from such analysis depends on
sufficient replicates being included in the experiment.
On the other hand, restrictions are imposed to reduce
the number of birds used in experiments for welfare reasons and to save scarce resources. We review the basic
concepts needed to determine the number of replicates
before conducting an experiment. We use these concepts to assess the results of several real experiments
and to show what might be done in future experiments.
We describe how to do the computations in R software.
Key words: analysis of variance, statistical power, replication, type I error, type II error
2013 Poultry Science 92:2490–2497
http://dx.doi.org/10.3382/ps.2012-02731
INTRODUCTION
A common practice in poultry science is to compare
new treatments with a control or between treatments
included in planned experiments. The results of experiments are affected not only by the action of the treatments, but also by extraneous variations (experimental
error), which tend to mask the effects of treatments
(Cochran and Cox, 1957). To improve the results from
an experiment and get a good estimate of the experimental error, the investigator should either increase the
size of the experiment using more replicates or refine
the experimental technique or handle the experimental
material so that the effects of variability are reduced.
Additionally, proper randomization (restricted randomization for local control when necessary) should be
used to ensure that one treatment is no more likely to
be favored in any replicate than another.
In terms of designed experiments, determining the
sample size amounts to finding the number of replicates
of a treatment. A question that is still very common in
all areas of knowledge before defining a protocol for an
experiment is “how many replicates per treatment are
necessary?” This type of question is directly related to
©2013 Poultry Science Association Inc.
Received August 30, 2012.
Accepted September 28, 2012.
1 Presented as part of the Experimental Design for Poultry Production and Genomics Research Symposium at the Poultry Science Association’s annual meeting in Athens, Georgia, July 12, 2012.
2 Corresponding author: [email protected]
the power of a statistical test and depends on what size
of difference between treatments the experiment is required to detect compared with the variability per unit
which might be expected. Another way to view this is
what is the desired precision of the estimated means.
In chicken nutrition research, the unit to which treatments are assigned, the experimental unit, may be a
small group of birds (e.g., 6 to 10 birds younger than 3
wk, 3 to 4 birds older than 3 wk, or 8 to 10 laying hens),
if cages are used in the experiments, or a large group of
birds (e.g., pens with 20, 40, 50, or more), as in growing-finishing broiler experiments. So, it is important to
decide how many cages or pens (replications) to use per
treatment and how many birds per cage or pen are necessary when planning an experiment of known power
and sensitivity. An example is taken from an experiment conducted in our research unit. It compared the
effects of 6 treatments (A, control) on the live weight
of broilers using a randomized complete block design
with 5 replicates. It was concluded that the treatment
means were not significantly different (see a summary
for experiment 4, in Table 2). However, perhaps there
are differences between the treatments—just that 5
replicates were not enough, given the size of the true
differences between the treatment means. How many
replicates should the researcher take for a next experiment to detect differences between treatments? Before
this question can be answered, it needs to be qualified.
What size difference is to be detected? The smaller the
difference, the more replicates are required. But, then
one gets to the point where the difference that might
2490
2491
EXPERIMENTAL DESIGN SYMPOSIUM
be detected is of no practical importance. So, to limit
the number of replicates, one needs to specify the size
of the smallest difference that is considered to be important. A second qualification is how certain does one
want to be of detecting the specified difference. To be
almost certain would require a very large number of
replicates and so, again to limit the number of replicates, we nominate how sure we want to be of detecting
the specified difference. This is called the power.
According to the Guide for Care and Use of Agricultural Animals in Research and Teaching (FASS, 2010),
one of the points the Institutional Animal Care and Use
Committee should consider when reviewing experimental protocols includes “justification for the number of
animals used.” This requisite is in line with the Animal
Welfare Act, which incorporates the Three R’s. The
Three R’s seek to reduce, refine, and replace the use of
animals in experiments. The Three R’s have been accepted by researchers who use animals in experiments
and by animal welfare advocates who argue in favor of
regulating animal use rather than abolishing it. Over
the past 20 yr, the Three R’s have to a considerable degree become the primary legal and nonlegal mechanism
for regulating animal experimentation (Ibrahim, 2006).
In experiments with agricultural animals, and especially with poultry, this well-intended reduction in the
number of animals undermines the chance of a successful experiment and may waste animal resources.
In experiments conducted to identify if a new product
results in improved performance, for example, a small
difference is of importance. If this difference would have
been 4 to 5% in the past, today it is in the range of
1 to 2%, because of the higher level of performance
obtained and because of the magnitude of production.
For a chicken producer or a poultry company, a 1 to 2%
increase or decrease in final BW may represent 30 to
60 g/bird, which is much more important today than
in the past and the experimenter must be able to identify and report it. Considering a 2% increase in BW
accompanied by a 2% improvement in feed conversion
rate, the outcome is a 4% increase in the productivity
index, which will determine the revenues obtained by
an integrated broiler producer.
This paper first reviews the basic concepts and then
describes a method of computing power and the number of replicates to achieve a specified power. The application to several experiments is examined. Also, the
issue of the number of pens versus the number of birds
per pen is discussed.
REVIEW OF BASIC CONCEPTS
As stated by Aaron and Hays (2004), “an understanding of some basic statistical concepts and the essence
of classical hypothesis testing is a necessary precursor
to a discussion of statistical power.” Consider that a
researcher is interested in comparing k treatments and
an experiment is planned with r replicates, either in a
completely randomized design (CRD) or in a randomized complete block design (RCBD).
Hypotheses, F-test, and Significance Level
Every scientific hypothesis can be translated into 1
of 2 types of statistical hypotheses called the null hypothesis (H0) and the alternative hypothesis (Ha). For
example, comparing k treatments, the null hypothesis
could be that “there is no treatment effect,” which can
be written as H0 : μ1 = ... = μk, where μi represents
the true mean of the treatment i, i = 1, 2, …, k, and
the alternative hypothesis could be Ha: at least one contrast between the true means differs from zero.
After performing an experiment and collecting the
data (yij), estimates of the treatment means (yi ) and of
the experimental error (s2) are obtained and the overall
F-test is used to decide between both hypotheses, using
an ANOVA. It is important to stress the conditions
under which the ANOVA is valid: independence of the
observations (yij), which come from a normal distribution with means µi and common variance σ2.
Once the statistical test to be used and a decision
rule is defined, a decision is made to either accept or
reject the null hypothesis at a significance level α, in
general 0.05 or 0.01. If the null hypothesis is rejected,
then we conclude that evidence favors the alternative
hypothesis; if the null hypothesis is accepted, we conclude that there is not enough evidence to reject it.
When comparing k treatments, from the ANOVA table
we get the value of the statistic
F0 =
MST
,
MSE
where MST is the mean square of treatments, a measure of the size of the differences between the observed
means, and MSE is the mean square of the error s2, an
estimate of σ2. Under the null hypothesis H0 : µ1 = ...
=µk , F0 is an observed value of a random variable with
central Snedecor’s F distribution with (k − 1) and υ
degrees of freedom. The number of degrees of freedom
of the denominator, υ, is the error degrees of freedom
and depends on the design employed; it will be equal
to k(r − 1) and (k − 1)(r − 1), respectively, for a completely randomized design and a randomized complete
block design. Note that for k = 2, F = t2, and the F
test from an ANOVA for a completely randomized design is equivalent to a t-test for 2 independent samples,
whereas the F test from an ANOVA for a randomized
complete block design is equivalent to a t-test for 2
paired samples.
The null hypothesis will be rejected if F0 > Fk−1,υ;a
for a given probability α, called significance level, or if
the P-value, P(Fk−1,υ > F0), is less than α. In Figure
1, any observed value of the test statistic greater than
F13,22;0.05 = 2.20 would result in a P-value of less than
0.05 and so be rejected at the 5% level.
2492
Demétrio et al.
Figure 1. Central F distribution with 13 and 22 df.
Type I and Type II Errors
When doing hypothesis testing we can have 4 scenarios, as shown in Table 1, which depend on whether
or not the null hypothesis is actually true or not and
whether we decide to accept or reject it. If the null hypothesis is accepted when it is true or is rejected when
it is false, a correct decision will be made. Otherwise,
if a true null hypothesis is rejected a type I error (error
of first kind) is made with probability α, whereas if a
false null hypothesis is accepted a type II error (error
of second kind) is made with probability β. In general,
it is desirable to have small probabilities, α and β, of
making errors.
hypothesis; that is, the probability that a treatment
effect will not go undetected if it exists (Berndtson,
1991). The power will depend on the significance level
used for the test, the magnitude of the effect of interest (alternative hypothesis) in the population, and the
sample size used to detect the effect. The researcher
would like the power to detect the alternative hypothesis to be high. Statistical power analysis should be
done during the planning of an experiment to estimate
the sample size necessary to achieve adequate power.
This is called a priori, or prospective, power analysis.
A posteriori, or retrospective, power analysis, in general arises after the performed experiment and when no
significant effects are detected; its use is controversial.
Power and Sample Size
The Noncentral F-test
The power of a statistical test is the probability (1 −
β) that it will correctly lead to a rejection of a false null
Under the alternative hypothesis, Ha, F0 is an observed value of a random variable with a noncentral
Table 1. Possible scenarios for a hypothesis test1
H0
Decision
Accept H0
Reject H0
1H
0
True
Correct decision
P(Accept H0 | H0 true) = 1 − α
Type I error
P(Reject H0 | H0 true) = α
False
Type II error
P(Accept H0 | H0 false) = β
Correct decision
P(Reject H0 | H0 false) = 1 − β
= null hypothesis; α and β = probability of making a type I and a type II error, respectively.
2493
EXPERIMENTAL DESIGN SYMPOSIUM
Figure 2. Central and noncentral F distributions with 13 and 22 df and noncentrality parameter λ equal to 0, 10, or 20.
F distribution with (k − 1) and υ degrees of freedom
and noncentrality parameter (a measure of the size of
the differences, relative to the magnitude of the uncontrolled variation, for single-treatment-factor experiments)
k
λ=
r ∑ (µi − µ)2
i =1
σ2
,[1]
provided the number of replicates for all treatments is
r, and where µ is the average of µi, i = 1, …, k. The
shape of the noncentral F distribution for different values of the noncentrality parameter λ is illustrated in
Figure 2, along with the central F distribution for which
λ = 0. The power is given by the P(Fk−1,υ;λ >
Fk−1,υ;1−α), the probability from the noncentral F distribution that the random variable Fk−1,υ;λ will exceed
the critical value Fk−1,υ;1−α of the F test under the null
hypothesis.
However, to use this formula we need the values of
the µi that are unknown. A way to overcome this is to
specify Δ, the difference such that, if any 2 population
means differ by this amount, the null hypothesis should
be rejected (Brien, 2010). Then it can be shown that
a conservative value for the power obtained using the
general formula for the minimum value of λ is
λ=
rm ∆2
2σ 2
=
2
rm  ∆ 
,[2]


2  σ 
where r is the number of pure replicates of each treatment and m is the multiplier of r that gives number of
observations (rm) used in computing one of the means.
For experiments with a single treatment factor, m =
1. For a single-treatment-factor randomized complete
block design, r = b, the number of blocks. The ratio d =
Δ/σ is called effect size (Cohen, 1969) and lower values
of d indicate the need for larger sample sizes.
The power can be calculated for any given α, r, Δ, or
σ2. Note that the power is a monotonically increasing
function of the parameter λ, that is, greater values of λ
are associated with greater power, as can be concluded
from Figure 2 by comparing the probabilities of being
greater than, say, 2.2 for each distribution. From equations [1] and [2], it is clear that λ increases with i) increased sample size r; ii) increased difference among
population means (as measured by either the expected
sum of squares of the deviations of the true means from
k
their overall mean, ∑ (µi − µ)2 , or by the minimum dei =1
tectable difference, Δ; and iii) decreased variability
within populations, σ2. It is possible to see, then, that
if the experimenter is interested in detecting small effects (Δ), more replicates are required than if the experimenter is interested in detecting large effects.
Furthermore, power increases with larger significance
levels, α, as shown in Figure 3, but of course, at the
cost of an increased risk of a type I error. The only
way to reduce both types of error simultaneously is to
increase r. In practice, we fix a small value for α (significance level) to guard against type I errors, reducing
the risk of false claims (Aaron and Hays, 2004).
2494
Demétrio et al.
Figure 3. Central and noncentral F distributions with 13 and 22 df and noncentrality parameter equal to 20, for significance levels 0.05 and
0.10.
COMPUTING THE POWER
To compute the power, we can use the following functions from the R software (R Development Core Team,
2011): qf for computing the q (= F) critical value under the central F distribution, for which the arguments are
1 − α, df1 (degrees of freedom of the numerator), df2 (degrees of freedom of the denominator); pf for computing
probabilities for the noncentral F distribution, for which the arguments are q (= F), df1, df2, and ncp (= λ). Brien
(2011) created a function power.exp available in the dae library for R for computing the power of an experiment.
For example, if we want to determine the power of an RCBD with 5 replicates and 6 treatments in detecting Δ =
50 g with α = 0.05 and σ2 = 3,913, the usage and arguments for this function are
EXPERIMENTAL DESIGN SYMPOSIUM
2495
Figure 4. Central and noncentral F distributions with 5 and 20 df and noncentrality parameter equal to 1.60, for experiment 4.
>rm < - 5
>power.exp(rm = rm, df.num = 5, df.denom = 5*(rm-1), delta = 50, sigma = sqrt(3913), print =
TRUE)
rm df.num df.denom α delta sigma lambda powr
1 5 5 20 0.05 50 62.55398 1.59724 0.1117699 [1] 0.1117699
giving a power of 0.1118, which is not good because of the chance of correctly rejecting H0 is not high. This is illustrated in Figure 4. We can produce entire power curves for a whole series of values of λ, as shown in Figure 5.
Figure 5. Power curves, considering 6 treatments and (left) for a fixed Δ and varying values of σ2; (right) for a fixed σ2 and varying the
values of Δ.
2496
Demétrio et al.
Table 2. Summary of the analysis of live weights for 5 experiments, with k treatments (A, control), r replicates, considering the variable live weight1
Treatment mean (g)
Exp.
k
r
MS
F-test
r*
1−β
A
B
C
D
E
F
Range
1
2
3
4
5
4
5
5
6
6
7
6
7
5
6
1,829
1,367
13,331
3,913
1,719
12.09**
0.22
0.65
0.83
6.12**
18
15
129
42
19
0.36
0.40
0.08
0.11
0.26
2,872
2,366
3,204
3,040
2,609
2,869
2,376
3,171
3,046
2,587
2,891
2,372
3,117
3,079
2,662
2,767
2,374
3,149
3,012
2,609
 
2,359
3,195
3,071
2,625
 
 
 
3,028
2,536
124
17
87
67
126
1r*
is the computed number of replicates to detect a 50-g difference with a probability of approximately 0.80.
**Indicates that the F-test is significant at the 1% level.
COMPUTING THE REQUIRED SAMPLE SIZE FOR THE CRD AND RCBD
WITH A SINGLE TREATMENT FACTOR
Given the discussion on the power of a hypothesis test, determining sample size amounts to determining r, the
number of pure replicates of a treatment. It is achieved by computing the power for different r until smallest r that
has at least the required power is identified. But you will have to specify the following:
i. the significance level, α;
ii. the power desired, 1 − β;
iii. the number of treatments, k, to be investigated;
iv. the minimum size of the difference to be detected between a pair of treatment means, as measured by Δ;
v. the magnitude of the uncontrolled variation expected, σ2.
In obtaining values for these, often the results from a previous experiment you or another researcher has run
will be useful. A convenient method to determine the number of replicates is provided by the function no.reps
from library dae (Brien, 2010). For example, using the results from experiment 4, in Table 2, the following R code
calculates the number of pure replicates required to achieve a power of 0.80 with a significance level of 0.05 in
detecting a minimum difference of 50 g:
>no.reps(multiple = 1, df.num = 5, df.denom = expression(df.num*(r-1)), delta = 50, sigma =
sqrt(3913), power = 0.8, print = FALSE)
rm df.num df.denom α delta sigma lambda powr
1 41.26343 5 201.3171 0.05 50 62.55398 13.18152 0.7999115
$nreps
[1] 42
$power
[1] 0.8081996
These results indicate that the required number of replicates is 42 and that the power of the experiment would
be 0.8082.
APPLICATIONS
In the examples presented in Table 2, the effects of
k treatments (A was a control) on the live weight of
chickens were investigated, using a CRD (experiment
1) or a RCBD (experiments 2, 3, 4, and 5), with r replicates. It was concluded that the differences between the
treatment means were not significant for experiments
2, 3, and 4 and were significant for experiments 1 and
5. Based on the information of these experiments, it is
possible to calculate: a) the number of replicates (r*)
necessary to detect a difference of 50 g between the
means of the treatments with a power of 80%; and b)
the power to detect a difference of 50 g between the
means of the treatments for a similar experiment. We
see that experiments 1, 2, and 5 had the most power,
although the probability of detecting a difference of 50
g is only between 0.26 and 0.40. To have an 80% chance
of detecting a difference of 50 g, 15 or more replicates
are needed. Of these 3 experiments, only experiment 2
was not significant and this is due to small differences
between the treatment means (a range of 17 g). Experiments 3 and 4 had the least power, due to greater
error variation and, in the case of experiment 4, less
2497
EXPERIMENTAL DESIGN SYMPOSIUM
replication. As they also had only moderate differences
between the treatment means, it is not surprising that
the treatments were not significantly different.
In summary, it would appear that about 20 replicates
are required to have an 80% chance of detecting a difference of about 50 g. Fewer replicates will mean that
there is less chance of detecting a difference of this
size and that only larger differences are likely to be
detected. An alternative is to look to reduce the error
variability, for example, by improving the experimental
protocols or the selection of more uniform birds.
REPLICATES VERSUS SUBSAMPLING
A frequent question that also appears is what to use:
more pens with fewer birds or fewer pens with more
birds? The answer depends on the relative magnitude
of experimental and sampling errors and on costs. Determining the sample size is complicated because it involves 2 sources of uncontrolled variation:
2
, and
i. between-pen variation, σM
ii. between-bird within-pen variation, σS2 ,
and requires a guess of the values of these 2 variances.
Then the sample size, r, is computed as described, except that the value of the variance is
2
σ 2 = σS2 + m σM
.
However, this expression involves m, the number of
birds per pen. A simple approach here is to nominate 2
or 3 alternative values for m and then the value of r can
be determined for a given power and each value of m.
One point here, which bears on minimizing the error
variance for pens, is that it is best to assign birds that
are different to the same pen: for example, light and
heavy birds. This can be done, as described in Brien
and Demétrio (1998), by dividing the complete set of
birds in to m homogeneous sets. Then, a bird is taken
from each set, at random, to be placed in the same pen.
This has the advantage that an interaction between
treatments and different sets does not contribute to the
error for treatment differences.
ACKNOWLEDGMENTS
The first and second authors are scholars supported
by National Council for Scientific and Technological
Development (CNPq), Brazil. We also thank professors
Gene Pesti (University of Georgia, Athens) and Lynne
Billard (University of Georgia, Athens) who invited us
to present this paper in the Experimental Design for
Poultry Production and Genomics Research Symposium, held on July 12, 2012, at University of Georgia as
part of the Poultry Science Association 101st Annual
Meeting.
REFERENCES
Aaron, D. K., and V. W. Hays. 2004. How many pigs? Statistical
power considerations in swine nutrition experiments. J. Anim.
Sci. 82:E245–E254.
Berndtson, W. E. 1991. A simple, rapid and reliable method for
selecting or assessing the number of replicates for animal experiments. J. Anim. Sci. 69:67–76.
Brien, C. J. 2010. Design and randomization-based analysis of experiments in R. Accessed Aug. 2012. http://chris.brien.name/
ee2.
Brien, C. J. 2011. Functions useful in the design and ANOVA of experiments. Version 2.1–7. Accessed Aug. 2012. http://cran.stat.
auckland.ac.nz/web/packages/dae/index.html.
Brien, C. J., and C. G. B. Demétrio. 1998. Using the randomisation in specifying the ANOVA model and table for properly
and improperly replicated grazing trials. Aust. J. Exp. Agric.
38:325–334.
Cochran, W. G., and G. M. Cox. 1957. Experimental Designs. 2nd
ed. John Wiley and Sons, New York, NY.
Cohen, J. 1969. Statistical Power Analysis for the Behavioral Sciences. Academic Press, New York, NY.
FASS (Federation of Animal Science Societies). 2010. Guide for Care
and Use of Agricultural Animals in Research and Teaching. Fed.
Anim. Sci. Soc., Champaign, IL.
Ibrahim, D. M. 2006. Reduce, refine, replace: The failure of the
Three R’s in the future of animal experimentation. University of
Chicago Legal Forum. Arizona Legal Studies Discussion Paper
06–17. http://ssrn.com/abstract=888206.
R Development Core Team. 2011. R: A language and environment
for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/.