LECTURE 10
POWER
443
In hypothesis testing we can make a Type I or a Type II error.
Accept H0
Reject H0
H0 true
correct (1)
Type I error (α)
H0 false
Type II error (β)
correct(2)
The power of the test is P = 1 − β,
which is P(reject H0 |H0 is false) = P( correct (2) ).
Note the some authors use β to mean Power . . .
So, use the definition
Power = 1 - P(Type II error).
444
Both Type I and Type II ( and thus Power) error need to be
controlled.
In fact, we are faced with a trade–off between the two.
If we lower Type I, Type II will increase.
If we lower Type II, Type I will increase.
Type I is preset, and so we need to have some idea about Type II
error and thus the Power of the test.
445
The regions for Type I (α) and Type II (β) errors are shown in the
figure.
The area to the right of the critical value (cv) under the solid curve
(Ho) is the Type I error. The area under the dotted curve (Ha) to
the left of the cv is the Type II error.
446
Normal density
0.0
0.1
0.2
0.3
0.4
−2
Ho
0
β
447
x
cv
2
α
Ha
4
6
Consider a simple one sample t–test.
H0 : µ = 0 vs Ha : µ > 3
If we have a sample of 26 observations with s = 5, what is the
Power of the test?
Under H0 the test statistic is the
X̄
T = √ ∼ tn−1
s/ n
since the mean of X is taken to be zero.
When the alternative hypothesis is true, the mean of X is no longer
zero and hence the test statistic no longer follows an ordinary
t–distribution. The distribution is no longer symmetric, but
becomes a non–central t–distribution.
448
The lack of symmetry is described by a non–centrality parameter λ,
where
3
diff
√
= 3.0594
=
λ=
5/5.099
σ/ n
in this case.
Now the critical value for the test is t25,5% = 1.708, so the Type I
error is 5%.
The plot of the cumulative non–central t–distribution imposed by
Ha is shown, together with the critical value :
449
pt(x, 25, ncp = 3.0594)
0.0
0.2
0.4
0.6
0.8
1.0
0
1
2
450
3
x
4
5
6
> # after Dalgaard p141
> curve(pt(x,25,ncp=3.0594),from =0, to=6)
> abline(v=qt(0.95,25))
> qt(0.95,25)
[1] 1.708141
>
> pt(qt(0.95,25),25,ncp=3.0594)
[1] 0.0917374
> 1-pt(qt(0.95,25),25,ncp=3.0594)
[1] 0.9082626
The Type II error is the area under the curve to the left of the
critical value as shown by the vertical line.
Thus the Type II error is 0.092 and the Power is 0.908.
The desired value for Power is usually 0.8 to 0.9.
451
For the two sample t–test, the non–centrality parameter becomes
λ=
diff
p
σ 1/n1 + 1/n2
The software power.t.test in R can estimate the sample size
needed to attain a given Power.
To show the equivalence with the previous method, we will
estimate the sample size needed to get the power given for our one
sample problem.
> power.t.test(delta=3,sd=5, sig.level =0.05,
power=0.908, type="one.sample",
+ alt="one.sided")
One-sample t test power calculation
n = 25.97355
452
delta
sd
sig.level
power
alternative
=
=
=
=
=
3
5
0.05
0.908
one.sided
As expected, we find that a sample of 26 is needed!
453
We can also perform the same calculation, ie, given the sample size,
find the power.
> power.t.test(n=26,delta=3,sd=5,
sig.level =0.05, type="one.sample",
+ alt="one.sided")
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
26
3
5
0.05
0.9082645
one.sided
The results are equivalent to the original power calculations, ie,
power is approx 91%.
454
To return to the two sample case, now consider
H0 : µ1 = µ2 vs Ha : µ1 > µ2
Let us find the sample sizes n1 = n2 = n needed to detect a
difference of 3 when the common sd is 5, with a Power = 0.8.
> power.t.test(delta=3,sd=5, sig.level=0.05,
power =0.8,alt="one.sided")
Two-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
35.04404
3
5
0.05
0.8
one.sided
455
NOTE: n is number in *each* group
So we need 35 observations in each group.
456
We now will run a similar but not identical problem.
Let us find the sample sizes n1 = n2 = n needed to detect a
difference of 3 when the common sd is 5, with a Power = 0.9, for a
two sided alternative. Thus we now have Ha : µ1 6= µ2
> power.t.test(delta=3,sd=5, sig.level=0.05,
power =0.9,alt="two.sided")
Two-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
59.35157
3
5
0.05
0.9
two.sided
457
NOTE: n is number in *each* group
We now need 59 observations in each group.
458
For the AOV where we have more than two groups, we need to use
the non–central F distribution. The non—centrality parameter is
now
Λ=
n
P
i (µi −
σ2
µ)2
Note that this differs from the t–test definition! The formulation
uses equal numbers is each group (n), but this is not necessary.
First up, we will use the non–central F distribution in R to verify
our calculations for the last t–test. The non–central F in R is
simply the standard F with the optional parameter ncp.
459
There are some points to note :
1. to compare the two, the t–test has to be two–sided.
2. the two ncps are not exactly the same but are related, since in
the two sample case
λ=
diff
p
σ 2/n
and so
2
ndiff
= Λ/2
λ2 =
2σ 2
AOV table (df only) :
The number of obs = 59 x 2 = 118
460
Source
df
Groups
1
Error
116
TOTAL
117
461
> qf(0.95,1,116)
[1] 3.922879
> lambda= 59 * 9 /(25 * 2)
> 1-pf(qf(0.95,1,116),df1=1,df2=116,ncp=lambda)
[1] 0.8982733
Thus the non–central F produces a Power of 90% as per the
non–central t.
462
We now turn our attention to more than two groups, ie, Analysis of
Variance proper.
It is worth noting that the form of the non–centrality parameter
can be seen in the table of Expected Mean Squares for the fixed
effects AOV. This will be exploited later in post–mortem power
calculations on the AOV of data.
463
Example
Giesbrecht F.G. and Gumpertz M.L., (2004), Planning,
Construction and Statistical Analysis of Comparative Experiments,
Wiley, Hoboken, pp62–63.
We have an experiment with a factor at 6 levels, and 8 reps within
each levels. This gives an AOV table (df only) :
The number of obs = 6 x 8 = 48
Source
df
Groups
5
Error
42
TOTAL
47
What is the Power of the test?
464
We are given that the mean is approx 75 and the coefficient of
variation (CV ) is 20%. Remember that
CV = 100 × σ/µ
Thus we can estimate
σ = 0.2 × µ = 15 → σ 2 = 225
The expected treatment means are (65, 85, 70, 75, 75, 80).
We now have all the information needed to calculate the
non–centrality parameter.
(Note the comment on page 61 about the various forms authors use
for this ncp!)
465
P
2
(µ
−
µ)
i
i
Λ=
σ2
8[(−10)2 + (10)2 + (−5)2 + (0)2 + (0)2 + (5)2 ]
= 8.89
=
225
8
The R calculations :
> lambda <- 8.89
> qf(0.95,5,42)
[1] 2.437693
> pf(2.44,5,42,ncp=lambda,lower.tail=F)
[1] 0.5524059
Thus the probability of detecting a shift in the means of the order
nominated is only 55%! Thus we would need more than 8
observations per treatment level to pick up a change in means of
the type suggested.
466
intermission!
467
A randomised block model :
yij = µ + τi + βj + ij
β ∼ N (0, σβ2 ) ∼ N (0, σ 2 )
The object(s) of interest are the τi .
Nuisance effects βj can be subtracted so that the estimates of
τi , i = 1, nt have random effects removed.
The error term is the experiment error regarding the τi ’s.
468
• τi = 0 or τi 6= 0
• we do not know the true values of τi and have to make
inference using the estimates, τ̂i .
469
The statistical test in the AOV is a test that H0 : τi = 0.
If there is low probability that the observed F could have arisen by
6 0 is preferred to
chance, due to the random sample chosen, τi =
H0 : τi = 0.
470
If systematic effects do exist, the experiment should detect these
differences.
The measure of the reliability of assessing τi 6= 0 is termed power.
If the truth was that τi = 0 (although we are not sure of this), the
ratio of Treatment MS/ Error MS has an F distribution; the F test
is used to assign a probability to the observed ratio.
Alternatively, if τi 6= 0, the ratio has a non-central F which is
different to that used in the AOV.
471
Example
A Randomised block with 3 reps and 8 varieties.
yij = µ +
+
τj
+ ij
βi
|{z}
|{z}
|{z}
block effect treatment effect error
βj ∼ N (0, σβ2 ), ij ∼ N (0, σ 2 )
Table 4: AOV of lucerne variety yields
Source
df
SS
MS
F
replicate
2
41,091
20,545
2.4
variety
7
75,437
10,777
1.26
14
120,218
8,587
residuals
472
E(MS)
σ 2 + 8σβ2i
P8
2
σ + 3 i=1 τi2 /7
σ2
The table of treatment coefficients :
1
2
3
4
5
6
7
8
0
-69
-46
90
81
71
60
22
473
• Was the experiment designed to account for natural variation?
If the blocking is not effective, the systematic component τj is
masked by the random component ij .
• Suppose that genuine differences do exist in the population.
What was the probability of detecting them with such a design?
474
For post mortems of AOV’s, the non-centrality parameter is
estimated by
df1 × (Trt MS − Error MS)
λ̂ =
Error MS
where df 1 is the df for treatments. This can be calculated from the
table of Expected Mean Squares.
For the example :
7 × (10777 − 8587)
8587
q = 2.76
λ =
df1
= 7
df2
= 14
475
> qf(0.95,7,14)
[1] 2.764199
>
> lambda1 <- 7 *(10777-8587)/8587
> lambda1
[1] 1.785257
> pf(q=qf(0.95,7,14),df1=7,df2=14,ncp=lambda1,lower.tail=F)
[1] 0.0974159
The power appears to be rather low (9.7%) for any serious
detection.
476
Power Curve
A typical use of power.t.test to find the power given all the other
information :
power.t.test(n=NULL, delta=NULL, sd=1, sig.level=0.05, power=NULL,
type=c("two.sample", "one.sample", "paired"),
alternative=c("two.sided", "one.sided"))
The use of this test requires that one parameter be set as NULL
and all the others defined with values. The function will return the
value of that which is set to NULL.
477
(i) sample size (n), given:- the difference (delta),sd, sig.level and
power,
(ii) power, given:- n, delta, sd, sig.level,
(iii) detectable difference (delta), given the other arguments
478
For the problem described here the sd is taken as 30, while the
difference to be detected is 50.
The Power Curve shows the power as a function of the number of
replicate observations for a single sample t–test.
479
> rep <- 2:10
> Power <- power.t.test(n=rep,delta=50,sd=30,type="one.sample",
alt="two.sided")
> plot(rep,Power$power,type="b",xlab="no.of.reps",ylab="Power")
> cbind(rep,Power$power)
rep
[1,]
2 0.1469101
[2,]
3 0.3671472
[3,]
4 0.6133652
[4,]
5 0.7932201
[5,]
6 0.8987716
[6,]
7 0.9534395
[7,]
8 0.9795609
[8,]
9 0.9913504
[9,] 10 0.9964471
480
Power
0.2
0.4
0.6
0.8
1.0
2
4
6
no.of.reps
481
8
10
So more than 6 observations is wasteful in terms of the power of
detection of the given difference.
482
Simulation
Finally, a simulation technique is shown. This is the only procedure
that can be used when a test statistic is such that the sampling
distribution cannot be found mathematically, or whose solution is
subject to too many restrictive assumptions.
The first problem considered is the one sample problem with 26
observations, a difference of 3 with sd of 5. The procedure
generates a large number of samples of size 26 from a normal
distribution with mean 3 and sd 5, and calculates the T statistic.
The number falling below the cv of 1.708 are counted. This count
is used to estimate the Type II error, ie, the probability that Ho
(µ = 0) is accepted as it is false, since the mean is 3.
483
> cv <- 1.708
> bigt <- 0
> for (i in 1:1000) {
+ sam <- rnorm(26,3,5)
+ t <- mean(sam)/sd(sam)
+ t <- t * sqrt(26)
+ bigt <- rbind(bigt,t)
+ }
> T <- bigt[2:1001]
> ppn <- sum(T<cv)
> print(ppn)
[1] 89
484
The table shows the results of varying the number of simulations :
No. of simulations
Type II error (%)
100
5
100
12
100
9
1,000
8.9
10,000
9.6
True
9.2
485
The second example shows the use of simulation to verify the
calculation of Type II error and hence Power for the example with
a factor at 6 levels with 8 reps, from Giesbrecht and Gumpertz.
The R code produces random normal deviates around the given
factor means with sd = 15. The procedure is repeated many times
and the Type II error is taken as the number of F values that fall
below the critical value.
# simulation program to estimate the
# Power of an AOV
Fcv <- qf(0.95,5,42)
m <- c(65,85,70,75,75,80)
bigF <- c(0,0,0)
tmt <- rep(1:6,8)
tmt <- factor(tmt)
486
# nrun is the number of simulation runs
nrun <- 1000
# main loop
for (i in 1:nrun) {
y <- 0
# loop to simulate the data for the AOV
for(j in 1:6){
me <- rnorm(8,m[j],15)
y<-rbind(y,me)
}
487
# end of data loop
Y <- as.vector(y[-1,])
lm.f <- lm(Y~tmt)
fv <- summary.lm(lm.f)$fstatistic
bigF <- rbind(bigF,fv)
}
# end of the main loop
nrun1 <- nrun + 1
F <- bigF[2:nrun1]
# ppn is the number of Type II errors out of nrun
488
# simulations
ppn <- sum(F<Fcv)
print(ppn)
# plot of the last simulated data set
boxplot(Y~tmt)
489
Note that the calculation of Y in R is an example of coercion.
490
The Table shows some results ;
No. of runs
No. F < cv
Type II error (%)
100
52
52
100
41
41
1000
456
45.6
1000
462
46.2
Actual (p156)
44.7
As expected, the larger number of runs are more reliable.
491
© Copyright 2026 Paperzz