Lecture 8 Power and sample size for single samples, Two

Lecture 8
Power and sample size for single samples,
Two-sample tests
Dr. Wim P. Krijnen
Lecturer Statistics
University of Groningen
Faculty of Mathematics and Natural Sciences
Johann Bernoulli Institute
for Mathematics and Computer Science
October 17, 2010
Lecture overview
I
Power
I
I
I
I
I
Tests for two samples
I
I
I
Concept of power: probability of correctly rejecting H0
Probability of Type I error and power: lower tailed test
Probability of Type I error and power: upper tailed test
Power and sample size for single samples
large samples: means, proportions, counts
small samples: means, variances, proportions
Non-parametric: Rank sum test
2
Power and sample size for single samples
H0 : θ = θ0
State of nature
H0 true
H0 false
and HA : θ = θA
Action
H0 not rejected H0 rejected
No error
Type I error
Type II error
No error
α is probability of Type I error
β is probability of Type II error: not rejecting false H0
Power of a test: Probability of correctly rejecting H0
power = 1 − P(Type II error given θ = θA )
= 1−β
Question: How to obtain high power?
3
Power for lower tailed test
H0 : µ = µ0 ,
versus HA : µ < µ0
Choose α to fix xL
xL = Φ−1 (α|µ0 , SE) = qnorm(alpha,mu.0,SE)
Action: reject H0 if X < xL ⇒
α ≡ P(Type I error) = P(reject H0 |H0 is true)
= P(X < xL |µ = µ0 ) = Φ(xL |µ0 , SE)
= pnorm(x.L,mu.0,SE)
power equals probability to reject a false H0
1 − β ≡ P(reject H0 |H0 is false)
= P(X < xL |µ = µA ) = Φ(xL |µA , SE)
= pnorm(x.L,mu.A,SE)
4
Power for upper tailed test
H0 : µ = µ0 ,
versus HA : µ > µ0
Choose α to fix xH
xH = Φ−1 (1 − α|µ0 , SE) = qnorm(1 - alpha,mu.0,SE)
Action: reject H0 if X > xH ⇒
α ≡ P(Type I error) = P(reject H0 |H0 is true)
= P(X > xH |µ = µ0 ) = 1 − P(X ≤ xH |µ = µ0 )
= 1 − Φ(xH |µ0 , SE) = 1 − pnorm(x.H,mu.0,SE)
power equals probability to reject a false H0
1 − β ≡ P(reject H0 |H0 is false)
= P(X > xH |µ = µA ) = 1 − P(X ≤ xH |µ = µA )
= 1 − Φ(xH |µA , SE) = 1 − pnorm(x.H,mu.A,SE)
5
Example: Prostate cancer
Example: Prostate cancer leads to prostatectomy with erectile
dysfunction as side effect; treatment by sildenafil citrate,
experimental group mean on IIEF test
H0 : µ = 18,
versus HA : µ > 18
mu.0 <- 18; mu.A <- 18.5
s <- 1.23; n <- 48; SE <- s/sqrt(n)
alpha <- 0.05; x.bar <- 18.52
x.H <- qnorm(1 - alpha,mu.0,SE)
round(c(mu.0=mu.0, x.H=x.H, x.bar=x.bar,
alpha=alpha,p.value=1-pnorm(x.bar,mu.0,SE),
power=1-pnorm(x.H,mu.A,SE)),3)
mu.0
18.000
x.H
18.292
x.bar
18.520
alpha p.value
0.050
0.002
x > xH ⇒ Conclusion: Reject H0
Probability of correctly rejecting H0 is 0.879
power
0.879
6
Factors determining power: lower tailed test
HA : µ < µ0 and choose α to fix xL
α ≡ P(reject H0 |H0 is true) = P(X < xL |µ = µ0 )
!
X − µ0
x L − µ0
= P
<
= P(Z < zL ) ⇒
SE
SE
xL − µ 0
zL =
⇒ xL = µ0 + zL · SE
SE
power equals probability to reject a false H0
1 − β ≡ P(reject H0 |H0 is false) = P(X < xL |µ = µA )
!
!
X − µA
x L − µA
µ0 + zL · SE − µA
= P
<
=P Z <
SE
SE
SE
!
µ0 − µA
√ + zL
= P Z <
σ/ n
Increase power by
I increasing n or µ0 − µA
I decrease σ
7
Sample size given significance level and power
Given α, 1 − β, µ0 , µA , σ, and HA : µ < µ0
To compute n we fix zL
α = P(Z < zL ) = Φ(α) ⇒ zL = Φ−1 (α) = −Φ−1 (1 − α)
!
µ0 − µA
√ + zL = power
1−β =Φ
σ/ n
To solve with respect to n, use inverse distribution function
Φ
−1
(1 − β) =
n =
µ0 − µA
√ − Φ−1 (1 − α) ⇒
σ/ n
2
σ2
−1
−1
Φ
(1
−
α)
+
Φ
(1 − β)
2
(µ0 − µA )
sample size needed to correctly reject H0 with probability 1 − β
8
Example: Determining sample size
Cardiovascular disease (Rosner, p. 239): population mean 175
mg/dL cholesterol level
Test children of man died from heart disease µA = 190, σ = 50
How many are needed for 90% power and 5% significance
level?
mu.0 <- 190; mu.A <- 175; sigma <- 50
alpha <- 0.05; beta <- 0.10
(n <- sigmaˆ2/(mu.0 - mu.A)ˆ2 *
(qnorm(1 - alpha) + qnorm(1 - beta))ˆ2)
[1] 95.15386
Conclusion: 96
9
Power lower tailed test
H0 : π = π0 ,
versus HA : π < π0
Choose α to fix pL
Action: reject H0 if p = π
b < pL ⇒
α ≡ P(Type I error) = P(reject H0 |H0 is true)
p
= P(p < pL |π = π0 ) = Φ(pL |π0 , SE) = Φ(pL |π0 , π0 (1 − π0 )/n)
= pnorm(p.L,pi.0,sqrt(pi.0*(1-pi.0)/n))
power equals probability to reject a false H0
1 − β ≡ P(reject H0 |H0 is false)
= P(p < pL |π = πA ) = Φ(pL |πA , SE)
= pnorm(p.L,pi.A,sqrt(pi.A*(1-pi.A)/n))
10
Construction power plot: Two-tailed
alpha <- 0.05 ; n <- 7515
pi.0 <- 0.59 ; pi.A <- seq(0.55, 0.64, length = 201)
p.L <- qnorm(alpha/2, pi.0,sqrt(pi.0 * (1 - pi.0)/n))
power.L <- pnorm(p.L,pi.A,sqrt(pi.A * (1 - pi.A)/n))
p.H <- qnorm(1-alpha/2,pi.0,sqrt(pi.0 * (1 - pi.0)/n))
power.H <- 1-pnorm(p.H, pi.A,sqrt(pi.A * (1 - pi.A)/n))
power <- power.L + power.H
plot(pi.A, power, type = ’l’,
xlab = expression(italic(pi[A])),
ylab = ’power’)
polygon(c(0.58, 0.58, 0.60, 0.60, 0.58),
c(0,1.1,1.1,0,0), col = ’gray90’)
lines(pi.A, power) #overplot polygon
abline(h = 0.01) ; abline(h = 1.04)
11
Power as function of πA
Low power for segment corresponding to gray colored surface
12
Hypothesis for means from two samples
Two samples from two populations each
X := [X1 , · · · , Xn1 ],
Y := [Y1 , · · · , Yn2 ]
basic assumption (Xi , Yi ) independent of Yj for all i 6= j
Two distributions P(X ≤ x|θ1 ) and P(Y ≤ y|θ2 )
H 0 : θ1 = θ2 ⇔ H 0 : θ0 = θ1 − θ2

 θ < θ0
θ=
6 θ0
H0 : θ = θ1 − θ2 = θ0 versus one of HA :

θ > θ0
13
Density of difference in sample means
Suppose n1 > 30, n2 > 30
H0 : µ = µ1 − µ2 = µ0

 µ < µ0
µ 6= µ0
versus one of HA :

µ > µ0
The difference in sample means is an estimator
µ
b≡µ
b1 − µ
b2 = X 1 − X 2 ≡ X
i.e. convenient notation
E[X ] = E[X 1 − X 2 ] = E[X 1 ] − E[X 2 ] = µ1 − µ2
σ12 σ22
Var[X ] = Var[X 1 − X 2 ] = Var[X 1 ] + Var[X 2 ] =
+
⇒
n1
n2
s
s
σ12 σ22
S12 S22
σX =
+
⇒ SE =
+
n1
n2
n1
n2
CLT
density X −→ φ µ, σX
14
Lower, Upper, Two-tailed tests for difference in means
Reject H0 in favor of µ < µ0 if
p-value = P(X ≤ x|µ0 , SE) = Φ (x, µ0 , SE)
= pnorm(x.bar,mu.0,SE) ≤ α
Reject H0 in favor of µ > µ0 if
p-value = P(X ≥ x|µ0 , SE) = 1 − Φ (x, µ0 , SE)
= 1 − pnorm(x.bar,mu.0,SE) ≤ α
Reject H0 in favor of µ 6= µ0 if
h
i
x∈
/ I1−α (X |b
µ, σ
b2 ) = Φ−1 α/2µ
b, SE , Φ−1 1 − α/2µ
b, SE
= c(qnorm(alpha/2, mu.hat, SE),
qnorm(1- alpha/2, mu.hat, SE))
15
Example: Capital punishment
setwd("C:/work/StatisticsForAINF/book/ch9")
load(’capital.punishment.rda’)
cp <- capital.punishment # age at sentencing
age <- cp[, 25] + cp[, 26] * 12 - cp[, 11] - cp[, 12] *
age <- age / 12
na.data <- is.na(cp[, 9]) | is.na(cp[, 11]) |
is.na(cp[, 12]) | is.na(cp[, 23]) | is.na(cp[, 24])
w <- cp$Race == ’White’ & !na.data
b <- cp$Race == ’Black’ & !na.data
# age at sentencing, whites and blacks
x.1 <- age[w] ; N.1 <- length(x.1)
x.2 <- age[b] ; N.2 <- length(x.2)
mu.1 <- mean(x.1) ; mu.2 <- mean(x.2)
sigma.1 <- sd(x.1) ; sigma.2 <- sd(x.2)
16
set.seed(1); n.1 <- 50 ; n.2 <- 50
X.1 <- sample(x.1, n.1) ; X.2 <- sample(x.2, n.2)
X.bar.1 <- mean(X.1) ; X.bar.2 <- mean(X.2)
var.1 <- var(X.1) ; var.2 <- var(X.2)
info <- rbind(mean = c(whites = X.bar.1, blacks = X.bar
’variance’ = c(var.1, var.2),
’sample size’ = c(n.1, n.2))
> round(info, 1)
whites blacks
mean
30.9
26.5
variance
88.7
40.0
sample size
50.0
50.0
x.bar <- mean(X.2) - mean(X.1) # bl - whi
SE <- sqrt(var(X.1)/n.1 + var(X.2)/n.2)
round(c(x.L =qnorm(0.025,0,SE),x.bar=x.bar,
p.value=pnorm(x.bar,0,SE)),3)
x.L
x.bar p.value
-3.145 -4.423
0.003 #H0 rejected
x < xL , p-value < α ⇒ Conclusion: Reject H0 : µ1 = µ2
17
Testing difference between two proportions
H0 : π := π2 − π1 = π0 , HA : π 6= π0
n1S
n2S
n1S + n2S
π
b=
−
; p=
n1
n2
n1 + n2
s
p
1
1
+
V [b
π ] = SE = p(1 − p)
n1 n2
reject H0 in favor of HA : π 6= π0 if
i
h
b, SE
b, SE , Φ−1 1 − α/2π
π0 ∈
/ I1−α (X |b
π ) = Φ−1 α/2π
= c(qnorm(alpha/2, pi.hat, SE),
qnorm(1- alpha/2, pi.hat, SE))
18
Support for death penalty between 1936 and 2004
n <- c(500,500); p <- c(0.59,0.71); alpha <- 0.05
p.bar <- sum(n*p)/sum(n)
SE <- sqrt(p.bar * (1 - p.bar) * sum(1/n))
pi.hat <- p[2] - p[1]
#1/n=1/500 + 1/500
round(c(low = qnorm(alpha/2, pi.hat,SE),
high = qnorm(1-alpha/2, pi.hat, SE)),2)
low high
0.06 0.18
0∈
/ 95% confidence interval [0.06, 0.18] ⇒ reject H0 ;
proportion of US adults to support death penalty increased
from 1936 to 2004
19
Use built-in-function prop.test
> n <- c(500,500); p <- c(0.59,0.71)
> prop.test(n*p,n) #input two vectors
2-sample test for equality of proportions with
continuity correction
data: n * p out of n
X-squared = 15.3011, df = 1, p-value = 9.166e-05
alternative hypothesis: two.sided
95 percent confidence interval:
-0.18065501 -0.05934499
sample estimates:
prop 1 prop 2
0.59
0.71
Conclusion: reject H0
20
Snapdragon Colors
Mendelian genetic model implies snapdragon flowers are
colored red, pink, and white in ratio 1:2:1
H0 : Pr{Red} = .25; Pr{Pink} = .50; Pr{White} = .25
Experiment on 234 plants has expected
Red : E1 = (.25)(234) = 58.5
Pink : E2 = (.50)(234) = 117
White : E3 = (.25)(234) = 58.5
Observed are O1 = 54 Red, O2 = 122 Pink, O3 = 58 White
Question: Is the theory correct?
21
Chi-square goodness of fit test
H0 : O = E; HA : O 6= E
p
X
(Oi − Ei )2
2
;
χs =
Ei
df = p − 1
i=1
p − value = P(χ2p−1 ≥ χ2s ) = 1-pchisq(chisquareds,p-1)
df= p − 1
I
If p-value < α, then reject H0
I
If p-value > α, then do not reject H0
Reasoning: All Oi close to Ei ⇒ small χ2s ⇒ large P-value ⇒ do
not reject H0
22
Example: Snapdragon Colors
Oi observed frequency; Ei expected frequency
Experiment on 234 plants, df=3-1=2
Observed: O1 = 54 Red, O2 = 122 Pink, O3 = 58 White
Expected: E1 = 58.5 Red, E2 = 117 Pink, E3 = 58.5 White
χ2s
p
X
(Oi − Ei )2
=
Ei
i=1
(54 − 58.5)2 (122 − 117)2 (58 − 58.5)2
=
+
+
58.5
117
58.5
= .56
1 - pchisq(0.5641,2)= 0.754236=p-value
> o <- c(54,122,58); p <- c(1,2,1)/4
> chisq.test(o, p=p)
Chi-squared test for given probabilities
X-squared = 0.5641, df = 2, p-value = 0.7542
p-value > α ⇒ Conclusion: Do not reject H0 , observed
frequencies are as expected
23
Testing equality of variances
Let ρ = σ22 /σ12 and ρ0 = 1, then we want to test
H0 : ρ = ρ0 ,
versus
HA : ρ 6= ρ0
Assume X1 , · · · , Xn1 iid, Y1 , · · · , Yn2 iid both normal
X := S22 /S12
has
F (X |n1 − 1, n2 − 2)
distribution
reject H0 in favor of ρ < ρ0 if
pL -value = P(X ≤ x|n1 −1, n2 −2) = pf(x,n.1-1,n.2-1) ≤ α
reject H0 in favor of ρ > ρ0 if
pH -value = P(X ≥ x|n1 −1, n2 −2) = 1-pf(x,n.1-1,n.2-1) ≤ α
reject H0 in favor of ρ 6= ρ0 if
p − value = 2 min (pL -value, pH -value) ≤ α
95% confidence interval for the two-tailed equals
h
i
x/F −1 (1 − α/2|n1 − 1, n2 − 2), x/F −1 (α/2|n1 − 1, n2 − 2),
= c(x/qf(1-alpha/2,n.1-1,n.2-2),
x/qf(alpha/2,n.1-1,n.2-2))
Example: Toluene and the Brain
Abuse of toluene (in glue) may have neurological effects
norepinephrine concentration of rats in toluene-laden
atmosphere are compared with controls (Rea, 1984,
Toxicology))
Toluene Control
543
535
523
385
431
502
635
412
564
387
549
n
6
5
ȳ
540.8
444.2
s
66.1
69.6
25
Using R: var.test
X.1 <- c(543, 523, 431, 635, 564, 549)
X.2 <- c(535, 385, 502, 412, 387)
n.1 <- length(X.1); n.2 <- length(X.2)
alpha <- 0.05 ; x <- var(X.1) / var(X.2)
p.L <- pf(x, n.1 - 1, n.2 - 1)
p.H <- 1 - pf(x, n.1 - 1, n.2 - 1)
> (p <- 2 * min(p.L, p.H))
[1] 0.888902
> round(c(x / qf(1 - alpha / 2, n.1 - 1, n.2 - 1),
+
x / qf(alpha / 2, n.1 - 1, n.2 - 1)),3)
[1] 0.096 6.659
> var.test(X.1,X.2)
F test to compare two variances
F = 0.9014, num df = 5, denom df = 4, p-value = 0.8889
95 percent confidence interval:
0.09625407 6.65920727
p-value > α ⇒ Conclusion: Do not reject H0 : σ1 = σ2
26
Testing for equality of means: small sample
X1 , · · · , Xn1 ∼ N (µ1 , σ 2 ), Y1 , · · · , Yn2 ∼ N (µ2 , σ 2 ) all indep
n1 ≤ 30 or n2 ≤ 30 ⇒ small sample approach to test

 µ < µ0
µ 6= µ0
H0 : µ = µ1 − µ2 = µ0 versus one of HA :

µ > µ0
s
S12 S22
X1 − X2
= T ∼ tdf =n1 +n2 −2
SE =
+
,
n1
n2
SE
reject H0 in favor of µ 6= µ0 if p-value
2P (T ≤ −|t|, df=n1 + n2 − 2) = 2*pt(-abs(t),n.1+n.2-2) ≤ α
I
I
I
I
makes statistical inference possible for very small data sets
paired t-test if two measurements per object
specify assumption σ12 = σ22 or not
computer applications are slightly different
27
Example: Toluene and the Brain
SE <- sqrt(var(X.1)/n.1 + var(X.2)/n.2)
> (t <- (mean(X.1) - mean(X.2))/SE
)
[1] 2.344736
> (2*pt(-abs(t),n.1+n.2-2))
[1] 0.04368008 #slightly different from below
> t.test(X.1,X.2,var.equal = TRUE)$p.value
[1] 0.04280699
> t.test(X.1,X.2,var.equal = TRUE) #list of output
Two Sample t-test
data: X.1 and X.2
t = 2.3571, df = 9, p-value = 0.04281
sample estimates:
mean of x mean of y
540.8333 444.2000
p-value ≤ α ⇒ Conclusion: Reject H0
28
Example paired t-test
15 blood pressure measurements by expert and machine on
same persons (pairs)
H0 : µ = µ1 − µ2 = 0 versus HA : µ 6= 0
> library(UsingR);attach(blood)#15 blood pressures
> t.test(Expert, Machine, paired = TRUE)
Paired t-test
data: Expert and Machine
t = -0.6816, df = 14, p-value = 0.5066
alternative hypothesis: true difference in means is not
95 percent confidence interval:
-4.146615 2.146615
sample estimates:
mean of the differences
-1
29
Overview of t-tests
manual on t.test
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
Situation
one-sample
HA : µ 6= µ0
HA : µ < µ0
HA : µ > µ0
H0 : µ = number
assumption σ12 = σ22
paired data
Specification
y = NULL
alternative = "two.sided"
alternative = "less"
alternative = "greater"
mu = number
var.equal = FALSE
paired = TRUE
30
Wilcoxon Rank sum test
Nonparametric or distribution free test
Suppose model
Xi = ei , i = 1, · · · , m,
Yj = em+j + ∆, j = 1, · · · , n,
where ∆ is a shift parameter
H0 : ∆ = 0,
I
I
I
I
I
versus HA : ∆ 6= 0
compute the ranks of (X1 , · · · , Xm , Y1 , · · · , Yn )
sum ranks corresponding to Y1 , · · · , Yn
If X values are greater that the Y ’s, the rank of Y will be
small
p-value=probability of more extreme values than sum
obtained
p-value ≤ α ⇒ reject H0
31
A large sample approximation
We follow Hollander & Wolfe (1973, p. 69) toP
illustrate the test
Rj denotes rank of Yj for j = 1, · · · , m, W = nj=1 Rj their sum
Lehmann (1998, Nonparametrics) proofs
E[W ] = n(n + m + 1)/2,
V [W ] = mn(n + m + 1)/12
and that
W − n ∗ (n + m + 1)/2
E − E[W ]
= p
→ N(0, 1)
W =
1/2
(V [W ])
mn(n + m + 1)/12
∗
Test: Reject H0 if P(Z < w ∗ ) < α or P(Z > w ∗ ) < α
Data on permeability constants (Pd 10−4 cm/sec) of the human
chorioamnion (a placental membrane) at term (x) and between
12 to 26 weeks gestational age (y).
Question: Is permeability greater after 12 to 26 weeks?
32
Computing large sample approximation in R
x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91,
1.64, 0.73, 1.46)
y <- c(1.15, 0.88, 0.90, 0.74, 1.21)
m <- length(x); n <- length(y)
factor <- factor(c(rep(’x’, length(x)),
rep(’y’, length(y))))
ranknr <- rank(c(x,y))
R <- ranknr[factor==’y’]
W <- sum(R)
z <- (W - n*(n+m+1)/2)/sqrt(m*n*(n+m+1)/12)
(p.value <- pnorm(z))
[1] 0.1103357
p-value = 0.11 > α ⇒
Conclusion: Do not reject H0 , there is no shift between groups
33
Conveniently using built-in-function wilcox.test
x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91,
1.64, 0.73, 1.46)
y <- c(1.15, 0.88, 0.90, 0.74, 1.21)
> wilcox.test(x, y, alternative = "greater",
+
exact = FALSE, correct = FALSE)
# H&W large sample approximation
Wilcoxon rank sum test
data: x and y
W = 35, p-value = 0.1103 #R gives corrected W
alternative hypothesis: true location shift is
greater than 0
p-value = 0.11 > α ⇒
Conclusion: Do not reject H0 : no differences between groups
34
Nonparametric bootstrap for difference in means
surviving number of days of mice after treatment (Efron, &
Tibshirani, 1993, p.11)
treatment <- c(94, 197, 16, 38, 99, 141, 23)
control <- c(52, 104, 146, 10, 50, 31, 40, 27, 46)
library(simpleboot); set.seed(10)
b <- two.boot(treatment, control, mean,
R = 1500, student = TRUE, M = 50)
b$t0[1]
#mean of difference values
bci <- boot.ci(b)
hist(b)
abline(v = b$t0[1], col = ’red’, lwd = 2)
abline(v = bci$normal[2], lty = 2)
abline(v = bci$normal[3], lty = 2)
Vertical lines for confidence interval of difference in means
35
CI for mean of differences contains zero ⇒
Conclusion: Do not reject H0
36