0.5 0.5 WOMAN LIKELIHOOD 0.25 0.67 0.33 WOMAN WOMAN

M&M Ch 6 Introduction to Inference ... OVERVIEW
Introduction to Inference*
Inference is about Parameters (Populations) or general
mechanisms -- or future observations. It is not about
data (samples) per se, although it uses data from
samples. Might think of inference as statements about a
universe most of which one did not observe.
Bayes Theorem : Haemophilia
Brother has haemophilia => Probability (WOMAN is Carrier) = 0.5
New Data: Her Son is Normal (NL) .
Update: Prob[Woman is Carrier, given her son is NL] = ??
1.
PRIOR [
prior to knowing status of her son
Two main schools or approaches:
Bayesian [ not even mentioned by M&M ]
•
Makes direct statements about parameters
and future observations
•
Uses previous impressions plus new data to update impressions
about parameter(s)
WOMAN
0.5
NOT CARRIER
]
0.5
CARRIER
LIKELIHOOD
[
Son
Prob son is NL | PRIOR
]
Son
e.g.
Everyday life
Medical tests: Pre- and post-test impressions
2.
1.0
•
•
NL
Makes statements about observed data (or statistics from data)
(used indirectly [but often incorrectly] to assess evidence against
certain values of parameter)
3. Products
NL
of PRIOR
and
0.5
e.g.
0.5
• Statistical Quality Control procedures [for Decisions]
• Sample survey organizations: Confidence intervals
• Statistical Tests of Hypotheses
Thus, an explanation of a p-value must start with the conditional
"IF the parameter is ... the probability that the data would ..."
0.0
H
H
observed data
Does not use previous impressions or data outside of current
study (meta-analysis is changing this)
Unlike Bayesian inference, there is no quantified pre-test or predata "impression"; the ultimate statements are about data,
conditional on an assumed null or other hypothesis.
0.5
0.5
Frequentist
x
1.0
WOMAN
LIKELIHOOD
0.25
0.5
POSTERIOR Given that Son is NL
0.67
Probs.
0.33
Scaled to
WOMAN
add to 1
NOT CARRIER
Book "Statistical Inference" by Michael W. Oakes is an excellent
introduction to this topic and the limitations of frequentist inference.
page 1
CARRIER
x
0.5
M&M Ch 6 Introduction to Inference ... OVERVIEW
Bayesian Inference for a quantitative parameter
Bayesian Inference ... in general
E.g. Interpretation of a measurement that is subject to intra-personal (incl.
measurement) variation. Say we know the pattern of inter-personal and intrapersonal variation. Adapted from Irwig (JAMA 266(12):1678-85, 1991)
• Interest in a parameter θ .
• Have prior information concerning θ in form of a prior
distribution with probability density function p(θ).
p(µ)
1. PRIOR
[to distinguish, might use lower case p for prior]
MY MEAN CHOLESTEROL µ
• Obtain new data x
(x could be a single piece of information or more complex)
2. DATA: one measurement on ME
Likelihood of the data for any contemplated value θ is
given by
(know there is substantial intrapersonal & measurement variation)
under-estimate ?
LIKELIHOOD
i.e. [Prob • | µ)
uses known model
for variation
of measurements
around µ
L[ x | θ ] = prob[ x | θ ]
Posterior probability for θ, GIVEN x, is calculated as:
'on target' ?
i.e. f(• | µ) for
various values of µ
(3 shown here)
P( θ | x ) =
[To distinguish, might use UPPER CASE P for POSTERIOR]. The
denominator is a summation/integration (the ∫ sign ) over range of θ
and serves as a scaling factor that makes P(θ) sum to 1.
over-estimate ?
In Bayesian approach, post-data statements of uncertainty
about are made directly from the function P( | x) .
3. POSTERIOR for µ
Products of PRIOR and LIKELIHOOD (Scaled)
Posterior is composite of
prior and data (•)
L[ x | θ ] p(θ)
∫ L[ x | θ ] p(θ) dθ
P(µ | •)
MY MEAN CHOLESTEROL µ
.
page 2
M&M Ch 6 Introduction to Inference ... OVERVIEW
Re: Previous 2 examples of Bayesian inference
Cholesterol example
θ = my mean cholesterol level
Haemophilia example
θ = ?? i.e. p[θ] = ?
θ = possible status of woman:
In absence of any knowledge about me, would have to take as a
prior the distribution of mean cholesterol
levels for population my age and sex
θ = "Carrier" or "Not a carrier"
p(θ = Carrier)
= 0.5
p(θ = Not a Carrier)
= 0.5
x = one cholesterol measurement on me
Assume that if a person's mean is θ, the variation around θ would be
Gaussian with standard deviation σw. (Bayesian argument does not insist
on Gaussian-ness). So...
x = status of son
L[ X=x | my mean is θ] is obtained by evaluating height of
Gaussian(θ,σw) curve at X = x
L[ x=Normal | Woman is Carrier ] = 0.5
L[ x=Normal | Woman is Not Carrier ] = 1
P[θ | X = x] =
P(θ = Carrier | x=Normal)
=
L[x=N | θ =C] p[θ = C]
L[x=N | θ =C] p[θ =C] + L[x=N | θ =Not C] p[θ = Not C]
[equation for predictive value of a diagnostic test with
binary results]
θ ] p[ θ]
∫ L [ X = x | θ ] p[ θ] dθ
L[ X = x |
If intra-individual variation is Gaussian
with SD w and the prior is Gaussian with
mean and SD b [b for between
individuals], then the mean of the
posterior distribution is a weighted
average of and x, with weights inversely
proportional to the squares of w and b
respectively. So, the less the intraindividual and lab variation, the more the
posterior is dominated by the
measurement x on the individual --- and
vice versa.
page 3
M&M Ch 6.1 Introduction to Inference ... Estimating with Confidence
Large-sample CI's
(Frequentist) Confidence Interval (CI) or Interval Estimate
for a parameter
Many large-sample CI's are of the form
Formal definition:
^
^
θ^ ± multiple of SE(θ^) or f -1 [ f{θ}
± multiple of SE(f{θ}
],
A level 1 - Confidence Interval for a parameter
given by two statistics
is
where f is some function of θ^ which has close to a Gaussian
distribution, and f -1 is the inverse function
Upper and Lower
such that when
(other motivation is variance stabilization; cf A&B ch11)
is the true value of the parameter,
Prob ( L ower
examples of the latter are:
θ = odds ratio
f = ln ;
Upper ) = 1 1-
•
CI is a statistic: a quantity calculated from a sample
•
usually use α = 0.01 or 0.05 or 0.10, so that the "level of
confidence", 1 - α, is 99% or 95% or 90%. We will also use "α" for
tests of significance (there is a direct correspondence between
confidence intervals and tests of significance)
•
•
θ = proportion π
f = arcsine ;
f = logit ;
f -1 = exp
f -1 = reverse
f -1 = exp(•)/[1+exp(•)]
Method of Constructing a 100(1 - )% CI (in general):
(point) estimate
technically, we should say that we are using a procedure which
is guaranteed to cover the true in a fraction 1 - of
applications. If we were not fussy about the semantics, we might
say that any particular CI has a 1-α chance of covering θ.
θ Lower
"Over" estimate ?
for a given amount of sample data] the narrower the interval from L to
U, the lower the degree of confidence in the interval and vice versa.
"Under" estimate ?
Meaning of a CI is often "massacred" in the telling... users
slip into Bayesian-like statements without realizing it.
θ Lower
S TATISTICAL C ORRECTNESS
The Frequentist CI (statistic) is the SUBJECT of the sentence (speak of
long-run behaviour of CI's).
θUpper
θUpper
NB: shapes of distributions may differ at the 2 limits and thus yield
asymmetric limits: see e.g. CI for π , based on binomial.
Notice also the use of concept of tail area (p-value) to construct CI.
In Bayesian parlance, the parameter is the SUBJECT of the sentence
[speak of where parameter values lie].
Book "Statistical Inference" by Oakes is good here..
page 4
M&M Ch 6.1 Introduction to Inference ... Estimating with Confidence
SD's* for "Large Sample" CI's for specific parameters
parameter
estimate
Semantics behind Confidence Interval ( e . g . )
parameter
SD*(estimate)
^
µx
^
θ
θ
SD(θ)
_______________________________________________
σx
–
µx
x
n
π
π[1-π]
p
n
µ1 - µ2
–
x1 - –
x1
π1 - π2
p 1 - p2
The last two are of the form
σ12
n1
π 1[1-π 1]
n1
2
+
+
SD(estimate)
σx
–x
n
Probability is 1 – α that ...
zα/2
–
– ) of µ
x falls within tα/2 SD(x
x
Probability is 1 – α that ..
zα/2
– ) of –
µx "falls" within tα/2 SD(x
x
σd
n
–
d
µ∆X
estimate
σ22
n2
π 2[1-π 2]
n2
(see footnote 1 )
(see footnote 2 )
zα/2
–) ≤
Pr { µx – tα/2 SD(x
–
x
zα/2
–) ≤
Pr { – tα/2 SD(x
–
x - µx ≤
zα/2
–) }
+ tα/2 SD(x
=1–α
zα/2
–) ≥
Pr { + tα/2 SD(x
µx – –
x ≥
zα/2
–) }
– tα/2 SD(x
=1–α
zα/2
–) ≥
Pr { –
x + tα/2 SD(x
µx
≥
zα/2
–
–) }
x – tα/2 SD(x
=1–α
zα/2
–) ≤
Pr { –
x - tα/2 SD(x
µx
≤
zα/2
–
– )}
x + tα/2 SD(x
=1–α
≤
zα/2
– )}
µx + tα/2 SD(x
=1–α
2
SD1 + S D 2
1 This is technically correct, because the subject of the sentence is the
statistic xbar. Statement is about behaviour of xbar.
* In practice, measures of individual (unit) variation about θ {e.g. σx,
π [1-π ] , ...} are replaced by estimates (e.g. sx , p[1-p] , ... } calculated
from the sample data, and if sample sizes are small, adjustments are made
^
to the "multiple" used in the multiple of SD(θ). To denote the
"substitution" , some statisticians and texts (e.g., use the term "SE"
rather than SD; others (e.g. Colton, Armitage and Berry), use the term SE
for the SD of a statistic -- whether one 'plugs in' SD estimates or not.
Notice that M&M delay using SE until p504 of Ch 7.
2 This is technically incorrect, because the subject of the sentence is the
parameter. µX. In the Bayesian approach the parameter is the subject
of sentence. In special case of "prior ignorance" [e.g. if had just arrived
from Mars], the incorrectly stated frequentist CI is close to a Bayesian
statement based on the posterior density function p(µX | data).
Technically, we are not allowed to "switch" from one to the other [it is
not like saying "because St Lambert is 5 Km from Montreal, thus
Montreal is 5Km from St Lambert".] Here µ X is regarded as a
fixed (but unknowable) constant; it doesn't "fall" or "vary
around" any particular spot; in contrast you can think of
the statistic xbar "falling" or "varying around" the fixed
X.
page 5
M&M Ch 6.1 Introduction to Inference ... Estimating with Confidence
Constructing a Confidence Interval for
Clopper-Pearson 95% CI for
based on observed proportion 4/12
Binomial at
upper = 0.65
.237
.204
.195
.128
.037
.019
.006
P=0.025
0
1
2
3
4
5
6
7
8
9
10 11 12
(3) Choose the degree of confidence (say 90%).
(4) From a table of the Gaussian Distribution, find the z value such
that 90% of the distribution is between -z and +z. Some 5% of
the Gaussian distribution is above, or to the right of, z = 1.645
and a corresponding 5% is below, or to the left of, z = -1.645.
(5) Compute the interval –y ±1.645 SD( –y ), i.e., –y ±1.645 σ / n
Warning: Before observing –y , we can say that there is a 90%
probability that the –y we are about to observe will be within ±1.645
SD( y– )'s of µ . But, after observing –y, we cannot reverse this
statement and say that there is a 90% probability that µ is in the
calculated interval.
We can say that we are USING A PROCEDURE IN WHICH
SIMILARLY CONSTRUCTED CI's "COVER" THE
CORRECT VALUE OF THE PARAMETER ( in our
example) 90% OF THE TIME. The term "confidence" is a
statistical legalism to indicate this semantic difference.
Polling companies who say "polls of this size are accurate to
within so many percentage points 19 times out of 20" are being
statistically correct -- they emphasize the procedure rather than
what has happened in this specific instance. Polling companies (or
reporters) who say "this poll is accurate .. 19 times out of 20" are
talking statistical nonsense -- this specific poll is either "right" or
"wrong"!. On average 19 polls out of 20 are "correct ". But this
poll cannot be right on average 19 times out of 20!
See similar graph in Fig
4.5 p 120 of A&B
.377
Binomial at
lower = 0.10
.282
.230
.085
.021
0
1
2
3
4
.004
5
6
7
8
9
(simplified for didactic purposes)
(1) Assume (for now) that we know the sd (σ) of the Y values in the
population. If we don't know it, suppose we take a
"conservative" or larger than necessary estimate of σ.
(2) Assume also that either
(a) the Y values are normally distributed or
(b) (if not) n is large enough so that the Central Limit Theorem
guarantees that the distribution of possible –y 's is well enough
approximated by a Gaussian distribution.
.109
.059
.001 .005
Assumptions & steps
10 11 12
P=0.025
4
Notice that Prob[4] is counted twice, once in each tail .
The use of CI's based on Mid-P values, where Prob[4] is counted only
once, is discussed in Miettinen's Theoretical Epidemiology and in §4.7
of Armitage and Berry's text.
page 6
M&M Ch 6.2 Introduction to Inference ... Tests of Significanc
(Frequentist) Tests of Significance
Example 2
In 1949 a divorce case was heard in which the sole evidence of
adultery was that a baby was born almost 50 weeks after the
husband had gone abroad on military service.
Use: To assess the evidence provided by sample data in favour of
a pre-specified claim or 'hypothesis' concerning some
parameter(s) or data-generating process. As with confidence
intervals, tests of significance make use of the concept of a
sampling distribution.
[Preston-Jones vs. Preston-Jones, English House of Lords]
To quote the court "The appeal judges agreed that the limit of
credibility had to be drawn somewhere, but on medical
evidence 349 (days) while improbable, was scientifically
possible." So the appeal failed.
Example 1 (see R. A Fisher, Design of Experiments Chapter 2)
STATISTICAL TEST OF SIGNIFICANCE
Pregnancy Duration: 17000 cases > 27 weeks
(quoted in Guttmacher's book)
LADY CLAIMS SHE CAN TELL
WHETHER
MILK WAS POURED
MILK WAS POURED
FIRST
SECOND
30
MILK
25
TEA
20
TEA
MILK
% 15
BLIND TEST
10
5
4
LADY
SAYS
if just
guessing,
probability
of this
result
0 4
4 0
16 / 70
1 / 70
1 3
3 1
2 2
2 2
3 1
1 3
46
44
42
40
38
36
Other Examples:
3. Quality Control (it has given us terminology)
4 Taste-tests (see exercises )
5. Adding water to milk.. see M&M2 Example 6.6 p448
6. Water divining.. see M&M2 exercise 6.44 p471
7. Randomness of U.S. Draft Lottery of 1970.. see M&M2
Example 6.6 p105-107, and 4478. Births in New York City after the "Great Blackout"
9 John Arbuthnot's "argument for divine providence"
10 US Presidential elections: Taller vs. Shorter Candidate.
36 / 70
1 / 70
34
In U.S., [Lockwood vs. Lockwood, 19??], a 355-day pregnancy was found to be
'legitimate'.
4
16 / 70
32
Week
0
0
30
28
0
4 0
0 4
page 7
M&M Ch 6.2 Introduction to Inference ... Tests of Significanc
Elements of a Statistical Test
The ingredients and the methods of procedure in a statistical test are:
Elements of a Statistical Test (Preston-Jones case)
1. A claim about a parameter (or about the shape of a distribution,
or the way a lottery works, etc.). Note that the null and alternative
hypotheses are usually stated using Greek letters, i.e. in terms of
population parameters, and in advance of (and indeed without
any regard for) sample data. [ Some have been known to write
hypotheses of the form H: y– = ... , thereby ignoring the fact that
the whole point of statistical inference is to say something about
the population in general, and not about the sample one
happens to study. It is worth remembering that statistical
inference is about the individuals one DID NOT study, not
about the ones one did. This point is brought out in the
absurdity of a null hypothesis that states that in a triangle taste
test, exactly p=0.333.. of the n = 10 individuals to be studied will
correctly identify the one of the three test items that is different
from the two others.]
1. Parameter (unknown) : DATE OF CONCEPTION
2. A probability model (in its simplest form, a set of assumptions)
which allows one to predict how a relevant statistic from a sample
of data might be expected to behave under H0.
2. A probability model for statistic ?Gaussian ?? Empirical?
3. A probability level or threshold or dividing point below which
(i.e. close to a probability of zero) one considers that an event
with this low probability 'is unlikely' or 'is not supposed to
happen with a single trial' or 'just doesn't happen'. This preestablished limit of extreme-ness is referred to as the "α (alpha)
3. A probability level or threshold
(a priori ) "limit of extreme-ness" relative to H0
- for judge to decide
Claim about parameter
H0
DATE ≤ HUSBAND LEFT (use = as 'best case')
Ha
DATE > HUSBAND LEFT
Note extreme-ness measured as conditional probability,
not in days
level" of the test.
page 8
M&M Ch 6.2 Introduction to Inference ... Tests of Significanc
Elements of a Statistical Test ...
Elements of a Statistical Test (Preston-Jones case)
4. A sample of data, which under H0 is expected to follow the
probability laws in (2).
4. data: date of delivery.
5. The most relevant statistic (e.g. y- if interested in inference about
the parameter µ)
5. The most relevant statistic (date of delivery; same as raw data:
n=1)
6. The probability of observing a value of the statistic as extreme or
more extreme (relative to that hypothesized under H0) than we
observed. This is used to judge whether the value obtained is
either 'close to' i.e. 'compatible with' or 'far away from' i.e.
'incompatible with', H0. The 'distance from what is expected
under H0' is usually measured as a tail area or probability and is
referred to as the "P-value" of the statistic in relation to H0.
6. The probability of observing a value of the statistic as extreme or
more extreme (relative to that hypothesized under H0) than we
observed
7. A comparison of this "extreme-ness" or "unusualness" or
"amount of evidence against H0 " or P-value with the agreed-on
"threshold of extreme-ness". If it is beyond the limit, H0 is said
to be "rejected". If it is not-too-small, H0 is "not rejected".
These two possible decisions about the claim are reported as "the
null hypothesis is rejected at the P= α significance level" or "the
7. A comparison of this "extreme-ness" or "unusualness" or
"amount of evidence against H0 " or P-value with the agreed-on
"threshold of extreme-ness". Judge didn't tell us his threshold,
but it must have been smaller than that calculated in 6.
P-value = Upper tail area : Prob[ 349 or 350 or 351 ...] : quite
small
Note: the p-value does not take into account any other 'facts',
prior beliefs, testimonials, etc.. in the case. But the judge
probably used them in his overall decision (just like the jury did
in the OJ case).
null hypothesis is not rejected at a significance level of 5%".
.
page 9
M&M Ch 6.3 and 6.4 Introduction to Inference ... Use and Misuse of Statistical Tests
"Operating" Characteristics of a Statistical Test
The quantities (1 - β) and (1 - α) are the "sensitivity
(power)" and "specificity" of the statistical test.
Statisticians usually speak instead of the complements of
these probabilities, the false positive fraction (α ) and the
false negative fraction (β) as "Type I" and "Type II" errors
respectively [It is interesting that those involved in
diagnostic tests emphasize the correctness of the test
results, whereas statisticians seem to dwell on the errors of
the tests; they have no term for 1-α ].
As with diagnostic tests, there are 2 ways statistical test
can be wrong:
1) The null hypothesis was in fact correct but the
sample was genuinely extreme and the null
hypothesis was therefore (wrongly) rejected.
2) The alternative hypothesis was in fact correct but
the sample was not incompatible with the null
hypothesis and so it was not ruled out.
Note that all of the probabilities start with (i.e. are
conditional on knowing) the truth. This is exactly
analogous to the use of sensitivity and specificity of
diagnostic tests to describe the performance of the tests,
conditional on (i.e. given) the truth. As such, they describe
performance in a "what if" or artificial situation, just as
sensitivity and specificity are determined under 'lab'
conditions.
The probabilities of the various test results can be put in
the same type of 2x2 table used to show the
characteristics of a diagnostic test.
Result of Statistical Test
"Negative"
(do not
reject H0)
"Positive"
(reject H0 in
favour of Ha)
H0
1- α
α
Ha
β
1-β
So just as we cannot interpret the result of a Dx test
simply on basis of sensitivity and specificity, likewise we
cannot interpret the result of a statistical test in isolation
from what one already thinks about the null/alternative
hypotheses.
TRUTH
page 10
M&M Ch 6.3 and 6.4 Introduction to Inference ... Use and Misuse of Statistical Tests
Interpretation of a "positive statistical test"
But if one follows the analogy with diagnostic tests, this
statement is like saying that
It should be interpreted n the same way as a "positive
diagnostic test" i.e. in the light of the characteristics of the subject
being examined. The lower the prevalence of disease, the
lower is the post-test probability that a positive diagnostic test
is a "true positive". Similarly with statistical tests. We are now
no longer speaking of sensitivity = Prob( test + | Ha ) and
specificity = Prob( test - | H0 ) but rather, the other way round,
of Prob( Ha | test + ) and Prob( H0 | test - ), i.e. of positive and
negative predictive values, both of which involve the
"background" from which the sample came.
"1-minus-specificity is the probability of being wrong if, upon
observing a positive test, we assert that the person is diseased".
We know [from dealing with diagnostic tests] that we cannot turn
Prob( test | H ) into Prob( H | test ) without some knowledge
about the unconditional or a-priori Prob( H ) ' s.
The influence of "background" is easily understood if one
considers an example such as a testing program for potential
chemotherapeutic agents. Assume a certain proportion P are
truly active and that statistical testing of them uses type I and
Type II errors of α and β respectively. A certain proportion of
all the agents will test positive, but what fraction of these
"positives" are truly positive? It obviously depends on α and
β, but it also depends in a big way on P, as is shown below for
the case of α = 0.05, β = 0.2.
A Popular Misapprehension: It is not uncommon to see or
hear seemingly knowledgeable people state that
P --> 0.001
TP = P(1- β) --> .00080
FP = (1 - P)(α)-> .04995
"the P-value (or alpha) is the probability of being
wrong if, upon observing a statistically significant
difference, we assert that a true difference exists"
Ratio TP : FP -->
Glantz (in his otherwise excellent text) and Brown (Am J Dis
Child 137: 586-591, 1983 -- on reserve) are two authors who
have made statements like this. For example, Brown, in an
otherwise helpful article, says (italics and strike through by JH) :
≈ 1 : 62
.01
.0080
.0495
.1
.080
.045
.5
.400
.025
≈ 1: 6
≈ 2 : 1 ≈ 16 : 1
Note that the post-test odds TP:FP is
P(1- β) : (1 - P)(α) = { P : (1 - P) }
"In practical terms, the alpha of .05 means that the
researcher, during the course of many such decisions, accepts
being wrong one in about every 20 times that he thinks he has
found an important difference between two sets of
observations" 1
PRIOR
×
×
1- β
[ α ]
function of TEST's
characteristics
i.e. it has the form of a "prior odds" P : (1 - P), the
"background" of the study, multiplied by a "likelihood ratio
positive" which depends only on the characteristics of the
statistical test.
Text by Oakes helpful here
1 [Incidentally, there is a second error in this statement : it has to do with
equating a "statistically significant" difference with an important one...
minute differences in the means of large samples will be statistically
significant ]
page 11
M&M Ch 6.3 and 6.4 Introduction to Inference ... Use and Misuse of Statistical Tests
"SIGNIFICANCE"
The difference between two treatments is 'statistically significant' if it
is sufficiently large that it is unlikely to have risen by chance alone.
The level of significance is the probability of such a large difference
arising in a trial when there really is no difference in the effects of
the treatments. (But the lower the probability, the less likely is it that
the difference is due to chance, and so the more highly significant is
the finding.)
notes prepared by FDK Liddell, ~1970
And then, even if the cure should be performed, how can he be sure
that this was not because the illness had reached its term, or a result
of chance, or the effect of something else he had eaten or drunk or
touched that day, or the merit of his grandmother's prayers?
Moreover, even if this proof had been perfect, how many times was
the experiment repeated? How many times was the long string of
chances and coincidences strung again for a rule to be derived from
it?
Michel de Montaigne 1533-1592
• Statistical significance does not imply clinical importance.
• Even a very unlikely (i.e. highly significant) difference may be
unimportant.
• Non-significance does not mean no real difference exists.
The same arguments which explode the Notion of Luck may, on the
other side, be useful in some Cases to establish a due comparison
between Chance and Design. We may imagine Chance and Design
to be as it were in Competition with each other for the production of
some sorts of Events, and may calculate what Probability there is,
that those Events should be rather owing to one than to the other...
From this last Consideration we may learn in many Cases how to
distinguish the Events which are the effect of Chance, from those
which are produced by Design.
Abraham de Moivre: 'Doctrine of Chances' (1719)
• A significant difference is not necessarily reliable.
• Statistical significance is not proof that a real difference exists.
• There is no 'God-given' level of significance. What level would you
require before being convinced:
If we... agree that an event which would occur by chance only once
in (so many) trials is decidedly 'significant', in the statistical sense,
we thereby admit that no isolated experiment, however significant in
itself, can suffice for the experimental demonstration of any natural
phenomenon; for the 'one chance in a million' will undoubtedly
occur, with no less and no more than its appropriate frequency,
however surprised we may be that it should occur to us.
R A Fisher 'The Design of Experiments'
(First published 1935)
a
to use a drug (without side effects) in the treatment of lung
cancer?
b
that effects on the foetus are excluded in a drug which
depresses nausea in pregnancy?
c
to go on a second stage of a series of experiments with rats?
• Each statistical test (i.e. calculation of level of significance, or
unlikelihood of observed difference) must be strictly independent
of every other such test. Otherwise, the calculated probabilities will
not be valid. This rule is often ignored by those who:
- measure more than on response in each subject
- have more than two treatment groups to compare
- stop the experiment at a favourable point.
page 12
The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)
7
Below are some previous students' answers to questions from 2nd
Edition of Moore and McCabe Chapter 6. For each answer, say
whether the statement/explanation is correct and why.
In 95 of 100 comparable polls,
expect 44 - 50% of women will give
the same answer.
• NO. Same answer? as what?
Given a parameter, we are 95% sure
that the mean of this parameter falls
in a certain interval.
N o t given a parameter (ever) . If we
were, wouldn't need this course!
Mean of a parameter makes no sense
in frequentist inference.
8
"using the poll procedure in which
the CI or rather the true percentage is
within +/- 3, you cover the true
percentage 95% of times it is
applied.
• A bit muddled... but "correct in
95% of applications" is accurate.
9
Confident that a poll (such) as this
one would have measured correctly
that the true proportion lies between
in 95% .
• ??? [ I have trouble parsing this!]
In 95% of applications/uses, polls
like these come within ± 3% of truth.
10
95% chance that the info is correct
for between 44 and 50% of women.
• ??? 95% confidence in the procedure
that produced the interval 44-50
11
95% confidence -> 95% of time the
proportion given is the good
proportion (if we interviewed other
groups).
• "Correct in 95% of applications"
Question 6.2
A New York Times poll on women's issues interviewed 1025 women and 472 men
randomly selected from the United States excluding Alaska and Hawaii. The poll
found that 47% of the women said they do not get enough time for themselves.
(a)
(b)
(c)
The poll announced a margin of error of ±3 percentage points for 95%
confidence in conclusions about women. Explain to someone who knows
no statistics why we can't just say that 47% of all adult women do not get
enough time for themselves.
Then explain clearly what "95% confidence" means.
The margin of error for results concerning men was ± 4 percentage points.
Why is this larger than the margin of error for women?
1
True value will be between 43 &
50% in 95% of repeated samples of
same size.
• N o . Estimate will be between µ –
margin & µ + margin in 95% of
samples.
2
Pollsters say their survey method
has 95% chance of producing a range
of percentages that includes π.
• Good. Emphasize average
performance in repeated applications
of method.
3
If this same poll were repeated many
times, then 95 of every 100 such
polls would give a range that
included 47%.
• N o ! . See 1.
You're pretty sure that the true
percentage π is within 3% of 47% .
"95% confidence" means that 95% of
the time, a random poll of this size
will produce results within 3% of π.
• Bayesians would object (and rightly
so!) to this use of the "true parameter"
as the subject of the sentence. They
would insist you use the statistic as
the subject of the sentence and the
parameter as object.
4
Good to connect the 95% with the
long run, not specifically with this
one estimate.
Always ask yourself: what do I mean
by "95% of the time" ?
If you substitute "applications" for
"time", it becomes clearer.
5
If took 100 different samples, in
95% of cases, the sample proportion
will be between 44% and 50%.
6
With this one sample taken, we are
• NO. 95/100 times the estimate will
sure 95 times out of 100 that 41-53% be within 3% of π, i.e., estimate will
of the women surveyed do not get
be in interval π – margin to π +
enough time for themselves.
margin. Method used gives correct
results 95% of time.
12
It means that 47% give or take 3% is
an accurate estimate of the
population mean 19 times out of 20
such samplings.
• ??? 95% of applications of CI give
correct answer. How can the same
interval 47%±3 be accurate in 19 but
not in the other 1?
Q6.4
"This result is trustworthy 19 times
out of 20"
• ??? "this" result: Cf. the
distinction between "my operation is
successful 19 times out of 20 … " and
"operations like the one to be done on
me are successful 19 times out of 20"
95% of all samples we could select
would give intervals between 8669
and 8811.
• Surely n o t !
• NO! The sample proportion will be
between truth – 3% & truth + 3% in
95% of them.
page 13
The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)
Question 6.18
4
Interval of true values ranges b/w
27% + 33%.
• ??? There is only one true value.
AND, it isn't 'going' or 'ranging' or
'moving' anywhere!
5
Confident that in repeated samples
estimate would fall in this range
95/100 times.
• NO. Estimate falls within 3% of π in
95% of applications
6
95% of intervals will contain true
parameter value and 5% will not.
Cannot know whether result of
applying a CI to a particular set of
data is correct.
• GOOD. Say "Cannot know whether
CI derived from a particular set of data
is correct." Know about behaviour of
procedure! If not from Mars, (i.e. if you
use past info) might be able to bet
more intelligently on whether it does
or not.
7
In 1/20 times, the question will
yield answers that do not fall into
this interval.
• N o . In 5% of applications, estimate
will be more than 3% away from true
answer. See 1,2,3 above.
8
This type of poll will give an
estimate of 27 to 33% 19 times out
of 20 times.
• NO. Won't give 27 ± 3 19/20 times.
Estimate will be within ± 3 of truth in
19/20 applications
9
5% risk that µ is not in this
interval.
• ??? If an after the fact statement,
somewhat inaccurate.
10
95 out 100 times when doing the
calculations the result 27-33%
would appear.
• No it wouldn't. See 1,2,3,7.
What is the standard deviation σestimate of the estimated percent?
11
95% prob computed interval will
cover parameter.
• Accurate if viewed as a prediction.
Does the announced margin of error include errors due to practical problems
such as undercoverage and nonresponse?
12
The true popl'n mean will fall
within the interval 27-33 in 95% of
samples drawn.
• NO. True popl'n mean will not "fall"
anywhere. It's a fixed, unknowable
constant. Estimates may fall around it .
The Gallup Poll asked 1571 adults what they considered to be the most serious
problem facing the nation's public schools; 30% said drugs. This sample percent
is an estimate of the percent of all adults who think that drugs are the schools'
most serious problem. The news article reporting the poll result adds, "The poll
has a margin of error -- the measure of its statistical accuracy -- of three percentage
points in either direction; aside from this imprecision inherent in using a sample to
represent the whole, such practical factors as the wording of questions can affect
how closely a poll reflects the opinion of the public in general" (The New York
Times, August 31, 1987).
The Gallup Poll uses a complex multistage sample design, but the sample percent
has approximately a normal distribution. Moreover, it is standard practice to
announce the margin of error for a 95% confidence interval unless a different
confidence level is stated.
a
b
c
d
The announced poll result was 30%±3%. Can we be certain
that the true population percent falls in this interval?
Explain to someone who knows no statistics what the
announced result 30%±3% means.
This confidence interval has the same form we have met earlier:
estimate ± Z*σestimate
(Actually s is estimated from the data, but we ignore this for now.)
1
This means that the population
result will be between 27% and 33%
19/20 times.
• NO! P o p u l a t i o n r e s u l t i s
wherever it is and it doesn't
m o v e . Think of it as if it were the
speed of light.
2
95% of the time the actual truth will
be between 30 ± 3% and 5% it will
be false.
• It either is or it isn't … the truth
doesn't vary over samplings.
3
If this poll were repeated very many
times, then 95 of 100 intervals
would include 30% .
• NO. 95% of polls give answer within
3% of truth, NOT within 3% of the
mean in this sample.
page 14
The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)
Question 6.22
3
In each of the following situations, a significance test for a population mean µ is
called for. State the null hypothesis H o and the alternative
hypothesis H a in each case.
a
Experiments on learning in animals sometimes measure how long it takes a
mouse to find its way through a maze. The mean time is 18 seconds for one
particular maze. A researcher thinks that a loud noise will cause the mice to
complete the maze faster. She measures how long each of 10 mice takes with
a noise as stimulus.
a
The examinations in a large accounting class are scaled after grading so that
the mean score is 50. a self-confident teaching assistant thinks that his
students this semester can be considered a sample from the population of all
students he might teach, so he compares their mean score with 50.
c
A university gives credit in French language courses to students who pass a
placement test. The language department wants to know if students who get
credit in this way differ in their understanding of spoken French from students
who actually take the French courses. Some faculty think the students who
test out of the courses are better, but others argue that they are weaker in oral
comprehension. Experience has shown that the mean score of students in the
courses on a standard listening test is 24. The language department gives the
same listening test to a sample of 40 students who passed the credit
examination to see if their performance is different.
1
Ho: Is there good evidence against
the claim that πmale > πfemale
Ha: Fail to give evidence against
the claim that πmale > πfemale .
• OK if being generic. but not if it
makes a prediction about a specific
mouse (sounds like this student was
talking about a specific mouse. H i s
about mean of a population,
i.e. about mice (plural). It is
not about the 10 mice in the
study!
Question 6.24
A randomized comparative experiment examined whether a calcium supplement in
the diet reduces the blood pressure of healthy men. The subjects received either a
calcium supplement or a placebo for 12 weeks. The statistical analysis was quite
complex, but one conclusion was that "the calcium group had lower seated systolic
blood pressure (P=0.008) compared with the placebo group." Explain this
conclusion, especially the P-value, as if you were speaking to a
doctor who knows no statistics. (From R.M. Lyle et al., "Blood pressure
and metabolic effects of calcium supplement in normotensive white and black
men," Journal of the American Medical Association, 257 (1987), pp. 1772-1776.)
1
The P-value is a probability:
"P=0.008" means 0.8% . It is the
probability, assuming the null
hypothesis is true, that a sample
(similar in size and characteristics as
in the study) would have an average
BP this far (or further) below the
placebo group's average BP. In other
words, if the null hypothesis is really
true, what's the chance 2 group of
subjects would have results this
different or more different?
• Not bad!
2
Only a 0.008 chance of finding this
difference by chance if, in the
population there really was no
difference between treatment and
central groups.
• Good!
3
The p-value of .008 means that the
probability of the observed results if
there is, in fact, no difference between
"calcium" and "placebo" groups is
8/1000 or 1/125.
• Good, but would change to "the
observed results or results more
extreme"
• NO. Hypotheses do not include
statements about data or evidence. .
This student mixed parameters and
statistics/data …
Put Ho, Ha in terms of parameters πmale
vs πfemale only;
Ho : a loud noise has no effect on
the rapidity of the mouse to find its
way through the maze.
H's have nothing to do with new data;
Evidence has to do with p-values, data.
2
–x = average. time of 10 mice w/
loud noise.
Ho: mu - –x = 0 or mu = –x
• NO! Ho must be in terms of
parameter(s). IT MUST NOT SPEAK OF
DATA
page 15
The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)
4
5
The p-value measures the probability
or chance that the calcium supplement
had no effect.
There is strong evidence that Ca
supplement in the diet reduces the
blood pressure of healthy men.
The probability of this being wrong
conclusion according to the procedure
and data corrected is only 0.008 (i.e.
0.8%) .
• N o . First, Ho and Ha refer not just
to the n subjects studied, but to all
subjects like them. They should be
stated in the present (or even future)
tense.
Second, the p-value is about data,
under the null H. It is not about the
credibility of Ho or Ha.
The level of calcium in the blood in healthy young adults varies with mean about
9.5 milligrams per deciliter and standard deviation about σ = 0.4. A clinic in rural
Guatemala measures the blood calcium level of 180 healthy pregnant women at
their first visit for prenatal care. The mean is –x = 9.57. Is this an indication that the
mean calcium level in the population from which these women come differs from
9.5?
• Stick to "official wording"
a
State H o and H a.
.. IF Ca makes no ∆ to average BP,
chance of getting ...
b
Carry out the test and give the P-value, assuming that
in this population. Report your conclusion.
c
The 95% confidence interval for the mean calcium level µ in this population
is obtained from the margin of error, namely 1.96 × 0.4 / 180 = 0.058.
i.e. as 9.57 ± 0.058 or 9. We are confident that µ lies quite close to 9.5.
This illustrates the fact that a test based on a large sample (n=180 here) will
often declare even a small deviation from Ho to be statistically significant.
Notice the illegal attempt to make
the p-value into a predictive
v a lue -- about as illegal as a
statistician trying to interpret a
medical test that gave a reading in
the top percentile of the 'health'
population -- without even seeing
the patient!
6
Only 0.8% chance that the lower BP
in Calcium group is lower than
placebo due to chance.
• If Ca++ does nothing, then prob.
of obtaining a result ≥ this extreme
is ... Wording borders on the illegal.
7
The chance that the supplement made
no change or raised the B/P is very
slim.
• NO! p-value is a conditional
statement, predicated (calculated on
supposition that) Ca makes no
difference to µ. Often stated in
present tense. p-value is more 'after
the data' in 'past-conditional' tense.
Again, wording bordering on illegal.
8
9
There is 0.8% that this difference is
due to chance alone and 99.2% chance
that this difference is a true difference.
There has been a significant reduction
in the BP of the treated group...
there's only a probability of 0,8%
that this is due to chance alone.
Question 6.32
= 0.4
1
95% of the time the mean will be
included in the interval of 9.512 to
9.628 and 5% will be missed.
• N o . See 1,2,3 in Q6.18 above
2
Ho: There is no difference between
sample area and the population
area:
H0: µ = x– .
• NO. This is quite muddled. Unless
one takes a census, there will a l w a y s
-- because of sampling variability -be some non-zero difference between
–x and µ. The question posed is
Ha: There is a significant difference
between the sample mean and the
population area.
whether "mean calcium level (µ) in the
population from which these women
come differs from 9.5"
PS A professor in the dept. of Math
and Statistics questioned what we in
Epi and Biostat are teaching, after
he saw in a grant proposal
submitted by one of our doctoral
students (now a faculty member!) a
statement of the type
• N o t r e a l l y .. Just like the
previous statements, this type of
language is crossing over into
Bayes-Speak.
• NO. Cannot talk about the cause...
Can say "IF no other cause than
chance, then prob. of getting ≥ a
difference of this size is ...
H0: µ = x– .
Please do not give our critic
any such ammunition! -- JH
page 16
ALSO: Must state H's in terms of
PARAMETERS.
Here there is one population. If two
populations, identify them by
subscripts e.g. Ho: µarea1 = µarea2 .
"Significant" is used to interpret data.
(and can be roughly paraphrased as
"evidence that true parameter is nonzero". D o n o t u s e " s i g n i f i c a n t "
when stating hypotheses.
The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)
3
95% CI: µ ± 1.96 σ / √180
mu = –x = 9.57
mu ≠ –x = 9.57
4
Ho :
Ha :
5
µ differs from 9.5 and the
probability that this difference is
only due to chance is 2%.
• NO
CI for µ is x– ± 1.96 σ/√180 !!!!
Rather than leave this column blank...
If we knew µ, we would say µ ± 0 !!
http://www.stat.psu.edu/~resources/Cartoons/
and we wouldn't need a statistics
course!
http://watson.hgen.pitt.edu/humor/
• NO. Cannot use sample values in
hypotheses. Must use parameters.
• Correct to say that "we found
evidence that µ differs from 9.5"
In frequentist inference, can speak
probabilistically only about data
(such as x– ).
This miss-speak illustrates that we
would indeed prefer to speak about µ
rather than about the data in a sample.
We should indeed start to adopt a
formal Bayesian way of speaking, and
not 'mix our metaphors' as we
commonly do when we try to stay
within the confines of the frequentist
approach.
What does this difference mean?
Should not speak about the
probability that this difference is
only due to chance.
µ = 9.5
µ ≠ 9.5
6
Ho :
Ha :
• Correct. Notice that Ho & Ha say
nothing about what you will find in
your data.
7
Q6.44: Ho: –x = 0.5; Ha: –x > 0.5
• NO. Must write H's in term of
parameters!
8
There is a 0.96% probability that
this difference is due to chance
alone.
• NO. This shorthand is so short that
it misleads. If want to keep it short,
say something like "difference is
larger than expected under sampling
variation alone". Don't get into
attribution of cause.
page 17
M&M Ch 7.1 (FREQUENTIST) Inference for
Inference for
CI(
Student"'s 't distribution (continued)
: A&B Ch 7.1 ; Colton Ch 4;
small n: => "Student" 's t distribution
Distribution (histogram of sampling distribution)
Use when replace σ by s (an estimate of σ) in CI's and tests.
(1) Assume that either
(a) the Y values are either normally distributed or
(b) if not, n is large enough so that the Central Limit Theorem guarantees that
the distribution of possible y- 's is well enough approximated by a Gaussian
distrn.
is symmetric around 0 ( just like Z =
•
has a shape like that of the Z distribution, but with SD slightly larger than
df
unity i.e. slightly flatter & more wide-tailed; Var(t) =
df–2
•
shape becomes indistinguishable from Z distribution as n -> ∞ (in fact as
n goes much beyond 30)
(2) Choose the desired degree of confidence [50%, 80%, 90%, 99... ] as before.
•
(3) Proceed as above, except that use t Distribution rather than Z -- find the t value
such that xx% of the distribution is between –t and + t. The cutpoints for %iles of the t distribution vary with the amount of data used to estimate σ.
"Student"'s 't distribution is (conceptual) distribution one gets if...
• take samples (of given size n) from Normal(µ, σ) distribution
–
x – µ
from each sample
s/√n
•
form the quantity t =
•
compile a histogram of the results
x– – µ
σ/√n )
•
or, in Gossett's own words ...(W.S. Gossett 1908)
"Before I had succeeded in solving my problem analytically, I
had endeavoured to do so empirically [i.e. by simulation]. The
material I used was a ... table containing the height and left
middle finger measurements of 3000 criminals.... The
measurements were written out on 3000 pieces of cardboard,
which were then very thoroughly shuffled and drawn at
random... each consecutive set of 4 was taken as a sample...
[i.e. n=4 above]... and the mean [and] standard deviation of
each sample determined.... This provides us with two sets of...
750 z's on which to test the theoretical results arrived at. The
height and left middle finger... table was chosen because the
distribution of both was approximately normal..."'
Instead of ± 1.96
Multiple
n
± 3.182
4
3
± 2.228
11
10
± 2.086
21
20
± 2.042
31
30
± 1.980
121
120
± 1.96
∞
∞
•
page 1
σ
to enclose µ with 95% confidence, we need
√n
degrees of freedom ('df')
Test of µ = µ0
CI for µ
–
x –µ
Ratio = s 0
√n
s
–
x ± t √n
M&M Ch 7.1 (FREQUENTIST) Inference for
WORKED EXAMPLE : CI and Test of Significance
WORKED EXAMPLE
Response of interest:
Posture, blood flow, and prophylaxis of venous thromboembolism
D:
INCREASE (D) IN HOURS
OF SLEEP with DRUG
Sir--Ashby and colleagues (Feb 18, p 419) report adverse effects of posture on
femoral venous blood flow. They noted a moderate reduction velocity when a
patient was sitting propped up at 35° in a hospital bed posture and a further
pronounced reduction when the patient was sitting with legs dependent.
Patients recovering from operations are often asked to sit in a chair with their
feet elevated on a footrest. The footrests used in most hospitals, while raising
the feet, compress the posterior aspect of the calf. Such compression may be
important in the aetiology of venous thrombo-embolism. We investigated the
effect of a footrest on blood flow in the deep veins of the calf by dynamic
radionuclide venography.
H0: µD = 0 vs Halt : µD ≠ 0
α =0.05 (2-sided);
Test:
Data:
Subject
HOURS of SLEEP†
DRUG
PLACEBO
1
2
3
4
5
6
7
8
9
10
6.1
7.0
8.2
.
.
.
.
.
.
.
Test statistic =
1.78 - [0]
= 3.18
1.77
DIFFERENCE
Drug - Placebo
0.9
-0.9
4.3
2.9
1.2
3.0
2.7
0.6
3.6
-0.5
–
d = 1.78
SD of 10 differences SD[d] = 1.77
5.2
7.9
3.9
.
.
.
.
.
.
.
Calf venous blood flow was measured in fifteen young (18-31 years) healthy
male volunteers. 88 MBq technetium-99m-labelled pertechnetate in 1 mL saline
was injected into the lateral dorsal vein of each foot, with ankle tourniquets
inflated to 40 mm Hg, and the time the bolus took to reach the lower border of
the patella was measured (Sophy DSX Rectangular Gamma Camera). Each
subject had one foot elevated with the calf resting on the footrest and the other
plantegrade on the floor as a control. The mean transit time of the bolus to the
knee was 24.6 s (SE 2.2) for elevated feet and 14.8 s (SE 2.2) for control feet [see
figure overleaf]. The mean delay was 9.9 s (95% CI 7.8–12.0).
Simple leg elevation without hip flexion increases leg venous drainage and
femoral venous blood flow. The footrest used in this study raises the foot by
extension at the knee with no change in the hip position. Ashby and
colleagues' findings suggest that such elevation without calf compression
would produce an increase in blood flow. Direct pressure of the posterior aspect
of the calf therefore seems to be the most likely reason for the reduction in flow
we observed. Sitting cross-legged also reduced calf venous blood flow,
probably by a similar mechanism. If venous stasis is important in the
aetiology of venous thrombosis, the practice of nursing patients with their feet
elevated on footrests may need to be reviewed.
CR:ref|t9|=2.26
10
JH's Analysis of raw data [data abstracted by eye, so my
calculations won't match exactly with those in text]
Since 3.18 > 2.26, "Reject" H0
95% CI for
µD
= 1.78 ± t
9
1.77
C P G Barker The Lancet Vol 345 . April 22, 1995, p 1047.
–
9.8 - [0]
9.8
d(SD) = 9.8(4.1); t =
=
= 9.8> t14,0.05 of 2.145
1.0
4.1 / 15
difference is 'off the t-scale'
= 1.78 ± 1.26 = 0.5 to 3.0 hours
10
N O T E : I deliberately omitted the full data on the drug and placebo
conditions: all we need for the analysis are the 10 differences.
95% CI for µD: 9.8 ± 2.145[1.0] = 7.7 to 11.9 s
What if not sure d's come from a Gaussian Distribution?
–
[ for t: Gaussian data or (via CLT) Gaussian statistic ( d )
page 2
M&M Ch 7.1 (FREQUENTIST) Inference for
WORKED EXAMPLE: Leg Elevation
Sample Size for CI's and test involving
(continued)
50
45
40
Transit Time (s)
35
30
25
20
15
10
5
0
No FootRest
No FootRest FootRest Delay
38
48
10
26
32
6
21
28
7
18
27
9
16
21
5
15
22
7
14
25
11
12
28
16
12
31
19
12
25
13
11
20
9
8
13
5
7
17
10
7
14
7
5
18
13
n to yield (2-sided) CI with margin of
error m at confidence level 1(see
M&M p 447)
|<-margin
|
|
of error
-->|
|
|
(-------------------•-------------------)
–
• large-sample CI: x ± Z
–
mean
SD
SEM
14.8
8.5
2.2
24.6
8.7
2.2
• SE( x ) =
9.8
4.1
1.0
n =
FootRest
–
–
SE( x ) = x ± m
/ n , so...
2•Z
m2
2
If n small, replace Zα/2 by t α/2
Remarks:
Whereas mean of 15 differences between 2 conditions is arithmetically
equal to the difference of the 2 means of 15, the SE of the mean of these
15 differences is not the same as the SE of the difference of two
independent means. In general...
Typically, won't know σ so use
guesstimate;
Var( –
y1 – –
y 2 ) = Var( –
y 1 ) + Var( –
y 2 ) – 2 Covariance( –
y 1, –
y2 )
Authors continue to report the SE of each of the 2 means, but they are of
little use here, since we are not interested in the means per se, but in the
mean difference.
In planning n for example just discussed, authors might
have had pilot data on inter leg differences in transit time
-- with both legs in the No FootRest position. Sometimes,
one has to 'ask around' as to what the SD of the d's will
be. Always safer to assume a higher SD than might turn
out to be the case.
Calculating
Var( –
y1 – –
y 2 ) = Var( –
y 1 ) + Var( –
y 2)
assumes that we used one set of 15 subjects for the No FootRest
condition, and a different set of 15 for the FootRest condition, a much
noisier contrast. As it is, even this inefficient analysis would have sufficed
here because the 'signal' was so much greater than the 'noise'.
See article On Reserve on display of data from pairs.
page 3
M&M Ch 7.1 (FREQUENTIST) Inference for
Sample Size for CI's and test involving
Sign Test for median
.. cont'd
n for power 1- if mean is units from µ0 (test value) ; type I
error = (cf Colton p142 or CRC table next)
–
Test:
vs
–
Need Zα/2 SE( x ) + Z βSE( x ) > ∆.
–
Substitute ( x ) = σ/√n and solve for n:
{ Zα/2 – Zβ }2 σ2
so need n =
∆2
β
+
–
+
+
+
+
+
+
+
–
∑
µ
alt
α =0.05 (2-sided);
SIGN
of d
0.9
-0.9
4.3
2.9
1.2
3.0
2.7
0.6
3.6
-0.5
Zb SE(xbar)
Za/2 SE(xbar)
Halt : MedianD ≠ 0 ;
DIFFERENCE
Drug – Placebo
α/2
µ0
H0: MedianD = 0
8+, 2–
Reference: Binomial [ n=10; π(+) = 0.5 ] See Table C (last column of p T9) or
Sign Test Table which I have provided in Chapter on Distribution-free Methods.
Upper Tail: Prob( ≥ 8+ | π = 0.5 ) = 0.0439 + 0.0098 + 0.0010 = 0.0547
∆=µ
alt
− µ0
2 Tails:
If power is > 0.5, then β < 0.5, and Zβ < 0 .
eg. α=0.05 , β=0.2
P = 0.0547 +0.0547 = 0.1094
P > 0.05 (2-sided) . (less Powerful than t-test)
In above example on Blood Flow, the fact that all 15/15 had delays makes any
formal test unnecesary... the "Intra-Ocular Traumatic Test" says it all. [Q:
could it be that always raised the left leg, and blood flow is less in left leg? Doubt
it but ask the question just to point out that just because we find a numerical
difference doesn't necessarily mean that we know what caused the difference
=> Zα/2 = 1.96 Zβ = –0.84
Technically, if n small, use t-test... see table next page
The question of what to use is not a matter of
statistics or samples, or what the last guy found in a
study, but rather the difference that makes a difference"
i.e it is a clinical judgement, and includes the impact,
cost, alternatives, etc...
It is the that IF TRUE would lead to a difference in
management or a substantial risk, or whatever...
Famous scientist, begins by removing one leg from an insect and, in an accent I
cannot reproduce on paper, says "quick march". The insect walks briskly. The
scientist removes another leg, and again on being told "quick march" the insect
walks along... This continues until the last leg has been removed, and the insect
no longer walks. Whereupon the Scientist, again in an accent I cannot convey here,
, pronounces "There! it goes to prove my theory: when you remove the legs from
an insect, it cannot hear you anymore!".
page 4
M&M Ch 7.1 (FREQUENTIST) Inference for
Number of Observations to Ensure Specified Power (1- ) if use 1-sample or paired t-test of Mean
α = 0.005(1-sided)
α = 0.01 (2-sided)
β = 0.01 0.05
0.1
0.2
0.5
α = 0.025(1-sided)
α = 0.05 (2-sided)
0.01 0.05
0.1
0.2
0.5
α = 0.05(1-sided)
α = 0.1 (2-sided)
0.01 0.05
0.1
0.5 [ POWER = 1 – β ]
0.2
∆
σ
0.2
0.3
0.4
0.5
100
115
75
0.6
0.7
0.8
0.9
1.0
71
53
41
34
28
1.2
1.4
1.6
1.8
2.0
84
54
119
68
44
90
51
34
99
45
26
18
53
40
31
25
21
38
29
22
19
16
32
24
19
16
13
24
19
15
12
10
15
12
10
8
7
12
9
8
7
6
10
8
7
6
5
8
7
6
97
63
134
77
51
78
45
30
117
76
53
40
32
26
22
45
34
27
22
19
36
28
22
18
16
22
17
14
12
10
21
16
13
12
10
16
13
11
10
8
14
12
10
9
8
12
10
8
8
7
8
7
6
6
5
2.5
8
7
6
6
6
3.0
7
6
6
5
5
∆
σ
=
µ – µ0
σ
101
65
122
70
45
97
55
36
71
40
27
70
32
19
13
13
10
9
7
6
46
34
27
21
18
32
24
19
15
13
26
19
15
13
11
19
15
12
10
8
9
8
6
5
5
5
13
10
8
7
6
10
8
6
6
8
7
6
6
5
"Signal"
= "Noise"
Table entries transcribed from Table IV.3 of CRC Tables of Probability and Statistics. Table IV.3 tabulates the n's for the Signal/Noise ratios increments of 0.1, and also
includes entries for alpha=0.01(1sided)/0.02(2-sided)
See also Colton, page 142
(σ∆ )2
2
Sample sizes based on t-tables, and so slightly larger (and more realistic, when n small) than those given by z-based formula: n = (zα + zβ)
page 5
M&M Ch 7.1 (FREQUENTIST) Inference for
"Definitive Negative" Studies?
Starch blockers--their effect on calorie absorption from a high-starch meal.
Abstract
It has been known for more than 25 years that certain plant foods, such as kidney
beans and wheat, contain a substance that inhibits the activity of salivary and
pancreatic amylase. More recently, this antiamylase has been purified and marketed
for use in weight control under the generic name "starch blockers." Although this
approach to weight control is highly popular, it has never been shown whether
starch-blocker tablets actually reduce the absorption of calories from starch. Using
a one-day calorie-balance technique and a high-starch (100 g) meal (spaghetti,
tomato sauce, and bread), we measured the excretion of fecal calories after normal
subjects had taken either placebo or starch-blocker tablets. If the starch-blocker
tablets had prevented the digestion of starch, fecal calorie excretion should have
increased by 400 kcal. However, fecal reduce the absorption of calories from starch.
Using a one-day calorie-balance technique and a high-starch (100 g) meal
(spaghetti, tomato sauce, and bread), we measured the excretion of fecal calories
after normal subjects had taken either placebo or starch-blocker tablets. If the
Table 1. Standard Test Meal.
Ingredients
Spaghetti (dry weight)* .............. 100
Tomato sauce
.112
White bread
........50
Margarine.............................. 10
Water .................................250
g
g
g
g
g
51CrCl
µCi
3 ..................................4
Dietary constituents†
Protein.................................19
Fat................................... 14
Carbohydrate (starch) ................ 108
g
g
g (97 g)
•Boiled for seven minutes in 1 liter of water.
† Determined by adding food-table contents of each item
starch-blocker tablets had prevented the digestion of starch,
fecal calorie excretion should have increased by 400 kcal.
However, fecal calorie excretion was the same on the two test
days (mean ± S.E.M., 80 ± 4 as compared with 78 ± 2). We
conclude that starch-blocker tablets do not inhibit the digestion
and absorption of starch calories in human beings.
Table 2. Results in Five Normal Subjects on Days of Placebo and
Starch-Blocker Tests.
Placebo Test Day
DUPLICATE
TEST MEAL*
Bo-Linn GW. et al New England Journal of Medicine. 307(23):1413-6, 1982 Dec
2
[Overview of Methods: The one-day calorie-balance technique begins
with a preparatory washout in which the entire gastrointestinal tract is
cleansed of all food and fecal material by lavage with a special caloriefree, electrolyte-containing solution. The subject then eats the test meal,
which includes 51CrCl3 as a non absorbable marker. After 14 hours,
the intestine is cleansed again by a final washout. The rectal effluent is
combined with any stool (usually none) that has been excreted since
the meal was eaten. The energy content of the ingested meal and of the
rectal effluent is determined by bomb calorimetry. The completeness
of stool collection is evaluated by recovery of the non absorbable
marker.]
1
2
3
4
5
Means
±S.E.M.
Starch-Blocker test Day
RECTAL
MARKER
EFFLUENT RECOVERY
kcal
kcal
664
675
682
686
676
677
±4
81
84
80
67
89
80
±4
DUPLICATE
TEST MEAL
RECTAL
MARKER
EFFLUENT RECOVERY
%
kcal
kcal
97.8
95.2
97.4
95.5
96.3
96.4
±0.5
665
672
681
675
687
676
±4
76
84
73
75
83
78
±2
%
96.6
98.3
94.4
103.6
106.9
100
±2
*Does not include calories contained in three placebo tablets (each tablet, 1.2±0.1
kcal) or in three Carbo-Lite tablets (each tablet, 2.8±0.1 kcal) that were ingested
with each test meal.
_____________________________________________________
Company's
Claim
Estimate from Study (95%CI)
For an good paper on topic of 'negative' studies, see Powell-Tuck J "A
defence of the small clinical trial: evaluation of three gastroenterological studies."
British Medical Journal Clinical Research Ed..292(6520):599-602, 1986 Mar 1.
(Resources for Ch 7)
-100
page 6
0
100
200
300
400
kcal
blocked
M&M Ch 7.2 (FREQUENTIST) Inference for
–
Inference for µ 1 – µ 2 : A&B Ch 7.2 ; ColtonCh 4
WORKED EXAMPLE : CI and Test of Significance
Somewhat more complex than simply replacing σ1 and σ2 by s1 and
s2 as estimates of σ's in CI's and tests.
Y= Fall in BP over 8 weeks [mm] with Modest Reduction in Dietary Salt
Intake in Mild Hypertension (Lancet 25 Feb, 1989)
Need to distinguish two theoretical situations (unfortunately
seldom clearly distinguishable in practice) where:
Test: H0:
Halt :
σ1 = σ2 = σ
use a "pooled" estimate s of σ [see M&M page 550]
[think of s2pooled as a weighted average of s1 2 and s1 2]
µY (Normal Sodium Diet) = µY (Low Sodium Diet)
µY (Normal Sodium Diet) ≠ µY (Low Sodium Diet)
α =0.05 (2-sided); β =0.20;
==> Power = 80% if | µY(Nl ) - µY(Low) | ≥ 2mm DBP
Data given:
t-table is accurate (if Gaussian data)
σ1 ≠ σ2
use separate estimates of σ1 and σ2
Mean(SEM) Fall in BP
"Normal"
Group
(n=53)
"Low"
Group
(n=50)
-0.6(1.0)
-6.1(1.1)
adjust d.f. downwards from (n1 +n 2 –2) to
compensate for inaccuracy
SBP
Option 1 (p 549) "software approximation"*
Reconstruct s2's via relation: s2 = n SEM2
Option 2 (p541) for hand use: df = Minimum[ n1 –1, n2 –1 ]
Mean(s2) Fall in BP
[M&M are the only ones I know who suggest this option 2; I think
they do so because the undergraduates they teach may not be
motivated enough to use equation 7.4 page 549 to calculate the
reduced degrees of freedom... I agree that the only time one should use
option 2 is the 1st time when learning about the t-test]
SBP
* The SAS manual says that in its TTEST procedure it uses
Satterthwaite's approximation [p. 549 of M&M] for the reduced
degrees of freedom unless the user specifies otherwise.
s2 =
Adjustments are not a big issue if sample sizes are large or variances
similar.
t =
"Normal"
(n=53)
"Low"
(n=50)
-0.6(53)
-6.1(60.5)
[52]53+[49]60.5
= 56.63; s =
[52]+[49]
-6.1- [-0.6]
7.52 1/53 + 1/50
page 7
56.63 = 7.52
= –3.71 vs t101,05 = 1.98
M&M Ch 7.2 (FREQUENTIST) Inference for
–
Calculation of t-test using separate variances
"Eye test": Judging overlap of two independent CI's
|----------x----------| Overlapping CI's
|----------x----------|
Had we used the separate s2's in each sample we would
have calculated
t =
-6.1- [-0.6]
How far apart do two independent –x ' s , s a y x– 1 and –x 2 , have to be
for a formal statistical test, using an alpha of 0.05 two sided, to
be statistically significant?
= –3.70
53
60.5
+
53
50
Need...
This is equivalent to calculating:
t =
-6.1- [-0.6]
SE12 + SE22
=
-6.1- [-0.6]
1.12 + 1.02
| x–1 – x–2 | ≥ 1.96
= –3.70
[ {SE[ x–1 ]} 2 + [{SE[ x–2 ]} 2
If the 2 SEM's are about the same size (as they would be if the 2 n's, and the perunit variability, were about the same), then ... [as in exercise X in Chapter 5]
M&M suggest that the appropriate df for t are
Option 1 (via their eqn. 7.4):
99.5
Option 2 (smaller df):
49
–
–
Need... | x 1 – x 2 | ≥ 1.96
i.e. | x–1 – x–2 | ≥ 1.96
Either way, the t ratio is far beyond the α=0.05 point
of the null distribution. Notice that the reduction in
df is minimal here because the two sample variances are
quite close.
–
–
| x1 – x2 |
2 SE[each x–] , or... | x–1 – x–2 | ≥ 2.77 SE[each x–]
3 SE[each –x ]
This means that even when two 100(1- α )% CI's overlap slightly, as above, the
difference between the two means could be statistically significant at the α level.
This is why Moses, in his article on graphical displays (see reserve material)
advocates plotting the 2 CI's formed by
Mean(SEM) Fall in DBP in the 2 samples:
"Low"
Group
(n=50)
-3.7(0.6)
t =
2 {SE[each x–]}2
*If using t rather than z, multiple would be somewhat higher than 1.96, so that
when multiplied by 2 it might be higher than 2.77, closer to 3. Thus a
rough answer to the question could be
Incidentally, as per their power calculations, the
primary response variable was DBP
"Normal"
Group
(n=53)
-0.9(0.6)
if using z-test*
–x ± 1.5 SE[x– ] and –x ± 1.5 SE[x– ]
1
1
2
2
-3.7- [-0.9]
0.62 + 0.62
Thus, we can be reasonably sure that if the CI's do not overlap ( i.e. if –x1 and –x2
are more than 3 SE[each –x] apart) the difference between them is statistically
= 3.3
significant at the alpha=0.05 level.
[ estimate ±1.5 SE(estimate) corresponds to an 86% CI if using Z distribution].
Note: above logic applies for other symmetric CI's too.
page 8
M&M Ch 7.2 (FREQUENTIST) Inference for
–
Inferences regarding means
Situation
known
Object
--- Summary
unknown
(or large n's)
1 Popln.
,
CI for
σ
x ± z x
√n
Test
z =
s
x ± tn-1 x
√n
x
0
x - µ0
σx
√n
tn-1 =
x - µ0
sx
√n
(sample of n)
1 Popln.
under 2
condns.
σ
d ± z d
√n
CI for
=
s
d ± tn-1 d
√n
d)
Test
z =
0
d - ∆0
σd
√n
tn-1 =
d - ∆0
sd
√n
(sample of n within-pair
differences {d=x1-x2} )
2 Poplns.
x1 - x2 ± z
CI for
=
Test
σ12
σ 2
+ 2
n1
n2
x1 - x2 ± tdf
s2
s2
+
n1
n2
1- 2
0
z =
x1 - x2
- ∆0
σ12
σ 2
+ 2
n1
n2
x1 - x2
tdf =
- ∆0
s2
s2
+
n1
n2
(independent samples of n1 and n2)
Notes:
•Pooled s2 =
(n1-1)s12 + (n2-1)s22
(n1 - 1) + (n2 - 1)
(weighted average of the two s2 's)
•df = (n1-1) + (n2-1) = n1 + n2 -2
•If it appears that σ12 is very different from σ22, then a "separate variances" t-test is used with df reduced
to account for the differing σ2 's
page 9
M&M Ch 7.2 (FREQUENTIST) Inference for
Sample Size for CI for
1
–
–
Sample Size for test of
2
CI( 1 – 2 )
n's to produce CI for 1 – 2 with prespecified
margin of error
2
n's for power of 100(1 – )% if
Prob(type I error) =
2
1– 2=
;
(cf. Colton p 145 or CRC tables)
x– 1 - x– 2 ± Z SE( x– 1 - x– 2 ) = x– 1 - x– 2 ± margin of error
σ2
n1 +
n per group = 2
σ2
n2
{Z α/2 – Zβ}2 σ2
∆2
= 2(Zα/2 – Z β)2
• if use equal n's, then ...
{ σ∆ }2
Note that if
< 0.5, Z <0 (also, Z always 1-sided).
example
α=0.05 (2-sided) and β=0.2 ...
2σ2 Z 2
Zα/2 = 1.96; Zβ = -0.84,
n per group = [margin of error]2
2(Zα/2 – Zβ)2 = 2{1.96 – (–0.84)}2 ≈ 16, i.e.
n per group ≈ 16 • {noise/signal ratio}2
example:
* 95% CI for difference in mean Length of Stay (LOS);
* desired Margin of Error for difference: 0.5 days,
* anticipate SD of individual LOS's, in each situation, of 5 days.
95% -> α=0.05 -> Zα/2 = 1.96
n per group =
versus
Test H0: 1 = 2 vs H a: 1
• large-sample CI:
• SE( x– 1 - x– 2 ) =
1
2 • 52 • 1.962
≅ 800
[0.5]2
Contrast formula for test and formula for CI:
CI:
no null and al. values for comparative parameter;
notice also absence of beta.
See reference to Greenland [bottom of next column].
These formulae are easily programmed in a spreadsheet. There are also specialized
software packages for sample size and statistical power See web page under
Resources for Chapter 7.
Greenland S. "On sample-size and power calculations for studies using confidence
intervals". American Journal of Epidemiology. 128(1):231-7, 1988 Jul. Abstract: A
recent trend in epidemiologic analysis has been away from significance tests and toward
confidence intervals. In accord with this trend, several authors have proposed the use of
expected confidence intervals in the design of epidemiologic studies. This paper
discusses how expected confidence intervals, if not properly centered, can be
misleading indicators of the discriminatory power of a study. To rectify such problems,
the study must be designed so that the confidence interval has a high probability of not
containing at least one plausible but incorrect parameter value. To achieve this end,
conventional formulas for power and sample size may be used. Expected intervals, if
properly centered, can be used to design uniformly powerful studies but will yield
sample-size requirements far in excess of previously proposed methods.
page 10
M&M Ch 7.2 (FREQUENTIST) Inference for
–
Number of Observations PER GROUP to Ensure Specified Power (1 -
α = 0.005(1-sided)
α = 0.01 (2-sided)
β = 0.01 0.05
0.1
0.2
) if use 2-sample t-test of 2 Means
α = 0.025(1-sided)
α = 0.05 (2-sided)
0.5
0.01 0.05
0.1
α = 0.05(1-sided)
α = 0.1 (2-sided)
0.2
0.5
0.01 0.05
0.1
0.2
0.5
∆
σ
0.2
0.3
0.4
0.5
96
85
55
106
86
100
64
87
50
32
88
108
70
78
51
137
61
35
23
0.6
0.7
0.8
0.9
1.0
100
77
62
50
101
75
56
46
38
85
63
49
39
32
67
50
39
31
26
39
29
23
19
15
104
76
57
47
38
74
55
42
34
27
60
44
34
27
23
45
34
26
21
17
23
17
14
11
9
89
66
50
40
33
61
45
35
28
23
49
36
28
22
18
36
26
21
16
14
16
12
10
8
7
1.2
1.4
1.6
1.8
2.0
36
27
21
17
14
27
20
16
13
11
23
17
14
11
10
18
14
11
10
8
11
9
7
6
6
27
20
16
13
11
20
15
12
10
8
16
12
10
8
7
12
10
8
6
6
7
6
5
4
4
23
17
14
11
9
16
12
10
8
7
13
10
8
7
6
10
8
6
5
4
5
4
4
2.5
10
8
7
6
4
8
6
5
4
6
5
4
3
3.0
8
6
6
5
4
6
5
4
4
5
4
3
∆
σ =
µ1 – µ2
"Signal"
= "Noise"
σ
Table entries transcribed from Table IV.4 of CRC Tables of Probability and Statistics. Table IV.4 tabulates the n's for the Signal/Noise ratios
increments of 0.1, and also includes entries for alpha=0.01(1-sided)/0.02(2-sided).
See also Colton, page 145
(σ∆)2
Sample sizes based on t-tables, and so slightly larger (and more realistic) than those given by z-based formula: n/group = 2(zα/2 + zβ)2
See later (in Chapter 8) for unequal sample sizes i.e. n1 ≠ n2
page 11
Inference concerning a single
M&M §8.1
updated Dec 26, 2003
Parameter : the proportion e.g. ...
• with undiagnosed hypertension / seeing MD during a 1-year span
• responding to a therapy
• still breast-feeding at 6 months
• of pairs where response on treatment > response on placebo
• of US presidential elections where taller candidate expected to win
• of twin pairs where L handed twin dies first
• able to tell imported from domestic beer in a "triangle taste test"
• who get a headache after drinking red wine
• (of all cases, exposed and unexposed) where case was "exposed"
(FREQUENTIST) Confidence Interval for
1 . "Exact" (not as "awkward to work with' as M&M p586 say they are)
tables [Documenta Geigy, Biometrika , ...] nomograms, software
e.g.
95% CI (Biometrika nomogram) 32% to 77%
[uses c for numerator; enter through lower x-axis if p≤0.5; in our
case p=0.55 so enter nomogram from the top at c/n = 0..55 near
upper right corner; travel downwards until you hit bowed line
marked 20 (the 5th line from the top) and exit towards the rightmost
border at π lower ≈ 0 . 3 2 ; go back and travel downward until hit the
companion bowed line marked 20 (the 5th line from bottom) and
exit towards the rightmost border at π lupper ≈ 0 . 7 7 ].
0/11084.0 W-Y in vaccinated gp. sersus 41/11076.9 W-Y in placebo gp]
Statistic: the proportion p = y/n in a sample of size n. ...
Inferences from y/n to
FREQUENTIST
via Confidence Intervals and Tests
Others may use other names for numerator and statistic, or use
symmetry (Binomial[y, n,p] <--> Binomial[n-y, n,1 - p] to save space.
Nomogram on next page shows full range, but uses an approxn..
?
supplies a NUMERICAL answer (range)
• E vidence (P-value) against H 0:
Notice link between 100(1 - α)% CI and two-sided test of
significance with a preset α. If true π were < π lower, there would only
be less than a 2.5% probability of obtaining, in a sample of 20, this
many (11) or more respondents; likewise, if true π were > π lower,
there would be less than a 2.5% probability of obtaining, in a sample
of 20, this many (11) or fewer respondents. The 100(1 - α)% CI for
π includes all those parameter values such that if the oberved data
were tested against them, the p-value (2-sided) would not be < α.
= 0.xx
• Test of Hypothesis: Is (P-value) < preset
?
supplies a YES / NO answer (uses Pdata | H 0)
BAYESIAN
via posterior probability distribution for, and
probabilistic statements concerning, itself
•
point estimate:
• interval estimate:
what fraction π will return a 4-page questionnaire?
11/20 returns on a pilot test i.e. p= 11/20 =0.55
95% CI (from CI for proportion table Ch 8 Resources ) 32% to 77%
[To save space, table gives CI's only for p≤0.5, so get CI for π of nonreturns: point estimate is 9/20 or 45%, CI is 23% to 68% {1st row,
middle column of the X=9 block} Turn this back to 100-68=32% to
100-23=77% returns]
[function of rate ratio & of relative sizes of exposed & unexposed
denominators; CONDITIONAL analysis (i.e. "fix" # cases), used for CI
for ratio of 2 rates , especially in 'extreme' data configurations...
eg. # seroconversions in RCT of HPV16 Vaccine, NEJM Nov 21, 2002
• Confidence Interval: where is
from a proportion p = x / n
e.g. Experimental drug gives p =
median, mode, ...
credible intervals, ...
0 successes
=> π = ??
14 patients
95% CI for π (from table) 0% to 23%
CI "rules out" (with 95% confidence ) possibility that π>23%
Software: • "Bayesian Inference for Proportion (Excel)" Resources Ch 8
• First Bayes { http://www.epi.mcgill.ca/Joseph/courses.html }
[might use a 1-sided CI if one is interested in putting just an upper bound on
risk: e.g. what is upper bound on π = probability of getting HIV from HIVinfected dentist? see JAMA article on "zero numerators" by Hanley and
Lippman-Hand (in Resources for Chapter 8) .
cf also A&B §4.7; Colton §4. Note that JH's notes use p for statistic, π for parameter.
page 1
Inference concerning a single
CI for
M&M §8.1
-- using nomogram (many books of statistical tables have fuller versions)
95% CI for
sample size
1
20
0.9
50
0.8
100
0.7
200
0.6
400
0.5
1000
0.4
1000
0.3
400
0.2
200
0.1
100
50
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
20
Observed proportion (p)
(Asymmetric) CI in above Nomogram: approx. formula
π
n
2np
z 4np – 4np 2 + z 2
1 – n + z2 + n + z2 ±
n + z2
=
(cf later page)
2
See Biometrika Tables for Statisticians for the "exact" Clopper-Pearson version.
calculated so that Binomial Prob [≥ p | πlower] = Prob[ ≤ p | πupper ] = 0.025 exactly.
page 2
Inference concerning a single
M&M §8.1
(FREQUENTIST) Confidence Interval for
from a proportion p = x / n
(FREQUENTIST) Confidence Interval for
NOTES
1. Exactly, but by trial and error, via SOFTWARE with
Binomial probability function
• Turn spreadsheet of Binomial Probabilities (Table C) 'on its side' to get
CI(π) .. simply find those columns (π values) for which the probability of the
observed proportion is small
• Read horizontally, Nomogram [previous page] shows the variability of
proportions from SRS samples of size n. [very close in style to table of
Binomial Probabilities, except that only the central 95% range of variation
shown , & all n's on same diagram]
Exactly, and directly, using table of (or SOFTWARE
function that gives) the percentiles of the F distribution
Read vertically, it shows:
See spreadsheet "CI for a Proportion (Excel
spreadsheet, based on exact Binomial model) " under
Resources for Chapter 8. In this sheet one can obtain the
direct solution, or get there by trial and error. Inputs in bold
may be changed.
•
CI -> symmetry as p -> 0.5 or n - > ∞ [in fact, as np & n(1-p) - > ∞ ]
•
widest uncertainty at p=0.5 => can use as a 'worst case scenario'
•
cf. the ± 4 % points in the 'blurb' with Gallup polls of size n ≈ 1000.
Polls are usually cluster (rather than SR) samples, and so have
bigger margins of error [wider CI's] than predicted from the Binomial.
95%CI? IC? ... Comment dit on... ?
The general "Clopper-Pearson" method for obtaining a
Binomial-based CI for a proportion is explained in 607 Notes
for Chapter 6.
[La Presse, Montréal, 1993] L'Institut Gallup a demandé récemment à
un échantillon représentatif de la population canadienne d'évaluer la
manière dont le gouvernement fédéral faisait face à divers problèmes
économiques et général. Pour 59 pour cent des répondants, les libéraux
n'accomplissent pas un travail efficace dans ce domaine, tandis que 30
pour cent se déclarent de l'avis contraire et que onze pour cent ne
formulent aucune opinion.
Can obtain these limits by trial and error (e.g. in spreadsheet)
or directly using the link between the Binomial and F tail areas
(also implemented in spreadsheet). The basis for the latter is
explained by Liddell (method, and reference, given at bottom
of Table of 95% CI's).
La même question a été posée par Gallup à 16 reprises entre 1973 et
1990, et ne n'est qu'une seule fois, en 1973, que la proportion des
Canadiens qui se disaient insatisfaits de la façon dont le gouvernement
gérait l'économie a été inférieure à 50 pour cent.
Spreadsheet opens with example of an observed proportion
p = 11/20.
Les conclusions du sondage se fondent sur 1009 interviews effectuées
entre le 2 et le 9 mai 1994 auprès de Canadiens âgés de 18 ans et plus.
Un échantillon de cette ampleur donne des résultats exacts à 3,1
p.c., près dans 19 cas sur 20. La marge d'erreur est plus forte pour
les régions, par suite de l'importance moidre de l'échantillonnage; par
exemple, les 272 interviews effectuées au Québec ont engendré une
marge d'erreur de 6 p.c. dans 19 cas sur 20.
page 3
Inference concerning a single
2. CI for
CI:
M&M §8.1
based on "large-n" behaviour of p, or fn. of p
p ± z SE(p) = p ± z
π
SD, calculated at 0.30, rather
than at lower limit
lower 0.27
p[1-p]
n
e.g. p = 0.3, n=1000
0.025
95%CI for π
= 0.3 ± 1.96
0.30 (observed) proportion p
0.3[0.7]
1000
= 0.30 ± 1.96(0.015)
0.025
= 0.30 ± 0.03
0.33
= 30% ± 3%
π
upper
SD, calculated at 0.30, rather
than at upper limit
Note: the ± 3% is pronounced and written as "± 3 percentage
points" to avoid giving the impression that it is 3% of 30%
SE-based (sometimes referrred to in texts and software output
as "Wald" CI's) use the same SE for the upper and lower limits -they calculate one SE at the point estimate, rather than two
separate SE's, calculated at each of the two limits.
"Large-sample n": How large is large?
• A rule of thumb: when the expected no. of positives, np, and
the expected no. of negatives, n(1-p), are both bigger than 5 (or
10 if you read M&M).
From SAS
DATA CI_propn;
INPUT n_pos n ;
LINES;
300 1000
;
PROC genmod data = CI_propn; model n_pos/n = /
dist = binomial link = identity waldci ; RUN;
• JH's rule: when you can't find the CI tabulated anywhere!
• if the distribution is not 'crowded' into one corner (cf. the shapes
of binomial distributions in the Binomial spreadsheet -- in
Resources for Ch 5), i.e., if, with the symmetric Gaussian
approximation, neither of the tails of the distribution spills over a
boundary (0 or 1 if proportions, or 0 or n if on the count scale),
From Stata immediate command: cii 1000 300
clear
* Using datafile
input n_pos n
140 500 * glm doesn't like file with 1 'observation'
160 500 * so...........split across 2 'observations'
end
glm n_pos , family(binomial n) link(identity)
See M&M p383 and A&B §2.7 on Gaussian approximation to
Binomial.
page 4
Inference concerning a single
2. CI for
M&M §8.1
based on "large-n" behaviour of p... continued
Other, more accurate and more theoretically correct,
large-sample (Gaussian-based) constructions
The "usual" approach is to form a symmetric CI as
point estimate ± a multiple of the SE.
This is technically incorrect in the case of a distribution, such as the
binomial, with a variance that changes with the parameter being
measured. In construction of CI's [see diagram on page 1 of material on
Ch 6.1] there are two distributions involved: the binomial at π upper and
the binomial at π lower. They have different shapes and different SD's in
general. Approaches i and ii (below) take this into account.
i
ii Based on Gaussian distribution of a variance-stabilizing
transformation of the binomial, again with SD's calculated at
the limits rather than at the point estimate itself
Based on Gaussian approximation to binomial distribution, but
with SD's calculated at limits SD = π[1–π] / n rather than
at the point estimate itself {"usual" CI uses SD = p[1–p] / n }
z
z
[sin[ sin-1[√p] – 2√n ] ]2 , [sin[ sin-1[√p] + 2√n ] ]2
If define CI for π as (π L,π U},
Prob[sample proportion ≥ p | π L ] = α/2
Prob[sample proportion ≤ p | π U ] = α/2
where
and if use Gaussian approximations to Binomial(n,
Binomial(n, U ), and solve
p=
L
+ z α/2
L
– zα/2
U
[1–
n
L
as in most calculators, sin-1 & the * in sin[*] measured in radians.
E.g. with
L)
and
Method
]
and
p=
for
and
L
U
[1–
n
U
]
U,
= 0.05, so that z=1.96, we get: * from Mainland
n=10 p=0.0 n=10 p=0.3 n=20 p=0.15 n=40 p=0.075
1.
[0.00, 0.28] [0.11, 0.60] [ 0.05, 0.36] [ 0.03, 0.20]
2.
[0.09, 0.09] [0.07, 0.60] [ 0.03, 0.33] [ 0.01, 0.18]
"usual"
[0.00, 0.00] [0.02, 0.58] [–0.01, 0.31] [ –0.01, 0.16]
Binomial*
[0.00, 0.31] [0.07, 0.65] [ 0.03, 0.38] [ 0.02, 0.20]
* from Mainland
This leads to asymmetric 100(1–α)% limits of the form:
References: • Fleiss, Statistical Methods for Rates and Proportions
• Miettinen, Theoretical Epidemiology, Chapter 10.
n
2np
z 4np – 4np2 + z 2
+
±
2
2
n+z
n+z
n + z2
2
Rothman(2002-p132) attributes this method i to Wilson 1927.
1–
page 5
Inference concerning a single
M&M §8.1
2. CI for based on "large-n" behaviour of logit
transformation of proportion
iii
2. CI for based on "large-n" behaviour of log
transformation of proportion
Based on Gaussian distribution of the logit transformation
of the estimate (p, the observed proportion) of the parameter π
PARAMETER:
iv
LOGIT[ π ] = log [ODDS] = log [ π / (1 - π ) ]
Based on Gaussian distribution of estimate of log[π]
PARAMETER:
log[ π ]
STATISTIC:
log[ p ]
= log ["Proportion POSITIVE" / "Proportion NEGATIVE" ]
STATISTIC:
logit[ p ] = log [odds]
= log [ p / (1 - p ) ]
(Here, log = 'natural' log, i.e. to base e, which some write as ln )
(UPPER CASE/Greek = parameter; lower case/Roman = statistic)
Reverse transformation ( to get back from LOGIT to π )...
ODDS
exp[LOGIT]
π = 1 + ODDS = 1 + exp[LOGIT] ; and likewise p <-- logit
Reverse transformation ( to get back from log[π] to π )...
exp[LOWER limit of LOGIT]
π LOWER = 1+exp[ LOWER limit of LOGIT ] ; π UPPER likewise
π LOWER = exp[ LOWER limit of log[ π ] ] ; π UPPER likewise
SE[logit] = Sqrt[ 1 / #positive + 1 / #negative ]
SE[ log[p] ] = Sqrt[ 1 / #positive - 1 / #total ]
e.g. p = 3/10 => estimated odds = 3/7 => logit = log[3/7] = -0.85
Limits for π from p = 3/10 : exp[ log[3/10] ± z × Sqrt[1/3 - 1/10] ]
π = antilog[ log[π] ] = exp[ log[π] ] ; and likewise p <-- log[p]
SE[logit] = Sqrt[1/3 + 1/7] = 0.69
Exercises:
CI in LOGIT scale: -0.85 ± 1.96×0.69 = { -2.2, 0.5}
1
Verify that you get same answer by calculator and by software
exp[-2.2]
exp[0.5]
CI in π scale: { 1+exp[-2.2] , 1+exp[0.5] } = { 0.10, 0.67}
2
Even with these logiy and log transformations, the Gausian distribution is
not accurate at such small sample sizes as 3/10. Compare their preformance
(against the exact methods) for various sample sizes and numbers positive.
From SAS
From Stata
From SAS
From Stata
DATA CI_propn; INPUT n_pos n ;
LINES;
3 10
;
PROC genmod data = CI_propn;
model n_pos/n = /
dist = binomial
link = logit waldci ;
clear
input
DATA CI_propn; INPUT n_pos n ;
LINES;
3 10
;
PROC genmod data = CI_propn;
model n_pos/n = /
dist = binomial
link = log waldci ;
clear
input
anti-logit [logit] =
n_pos
1
2
n
5
5
end
glm n_pos, family(binomial n) link(logit)
exp[logit ]
1 + e x p [ logit ]
Greenland calls it the "expit" function
page 6
n_pos
1
2
n
5
5
end
glm n_pos, family(binomial n) link(log)
Inference concerning a single
M&M §8.1
1200 are hardly representative of 80 million homes /220 million people!!
The "Margin of Error blurb" introduced (legislated) in the mid 1980's
The Nielsen system for TV ratings in U.S.A.
Montreal Gazette August 8, 1 9 8 1
NUMBER OF SMOKERS RISES BY FOUR POINTS: GALLUP POLL
(Excerpt from article on "Pollsters" from an airline magazine)
"...Nielsen uses a device that, at one minute intervals, checks to see if the
TV set is on or off and to which channel it is tuned. That information is periodically
retrieved via a special telephone line and fed into the Nielsen computer center in
Dunedin, Florida.
With these two samplings, Nielsen can provide a statistical estimate of the
number of homes tuned in to a given program. A rating of 20, for instance, means
that 20 percent, or 16 million of the 80 million households, were tuned in.
To answer the criticism that 1,200 or 1,500 are hardly representative of 80 million
homes or 220 million people, Nielsen offers this analogy:
Mix together 70,000 white beans and 30,000 red beans and then scoop out
a sample of 1000. the mathematical odds are that the number of red beans will be
between 270 and 330 or 27 to 33 percent of the sample, which translates to a "rating"
of 30, plus or minus three, with a 20-to-1 assurance of statistical reliability. The
basic statistical law wouldn't change even if the sampling came from 80 million beans
rather than just 100,000." ...
Compared with a year ago, there appears to be an increase in the number of Canadians
who smoked cigarettes in the past week - up from 41% in 1980 to 45% today. The
question asked over the past few years was: "Have you yourself smoked any
c i g a r e t t e s i n t h e p a s t w e e k ? " Here is the national trend:
Smoked cigarettes in the past week
Today...........................45%
1980............................41
1979............................44
1978............................47
1977............................45
1976............................Not asked
1975............................47
1974............................52
Men (50% vs. 40% for women), young people (54% vs. 37% for those > 50 ) and
Canadians of French origin (57% vs. 42% for English) are the most likely smokers.
Today's results are based on 1,054 personal in-home interviews with
adults, 18 years and over, conducted in June.
Why, if the U.S. has a 10 times bigger population than Canada,
do pollsters use the same size samples of approximately 1,000
in both countries?
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
The Gazette, Montreal, Thursday, June 27, 1 9 8 5
39% OF CANADIANS SMOKED IN PAST WEEK: GALLUP POLL
Answer : it depends on WHAT IS IT THAT IS BEING ESTIMATED. With n=1,000,
the SE or uncertainty of an estimated PROPORTION 0.30 is indeed 0.03 or 3
percentage points. However, if interested in the NUMBER of households tuned in to
a given program, the best estimate is 0.3N, where N is the number of units in the
population (N=80 million in the U.S. or N=8 million in Canada). The uncertainty in the
'blown up' estimate of the TOTAL NUMBER tuned in is blown up accordingly, so that
e.g. the estimated NUMBER of households is
Almost two in every five Canadian adults (39 per cent) smoked at least one cigarette in
the past week - down significantly from the 47 percent who reported this 10 years ago,
but at the same level found a year ago. Here is the question asked fairly regularly over the
past decade: "H a v e y o u y o u r s e l f s m o k e d a n y c i g a r e t t e s i n t h e p a s t w e e k ? "
The national trend shows:
Smoked cigarettes in the past week
1985............................39%
1984............................39
1983............................41
1982*...........................42
(* Smoked regularly or occasionally)
1981............................45
1980............................41
1979............................44
1978............................47
1977............................45
1975............................47
U.S.A. 80,000,000[0.3 ± 0.03] = 24,000,000 ± 2,400,000
Canada. 8,000,000[0.3 ± 0.03] = 2,400,000 ±
240,000
2.4 million is a 10 times bigger absolute uncertainty than 240,000. Our intuition about
needing a bigger sample for a bigger universe probably stems from absolute errors
rather than relative ones (which in our case remain at 0.03 in 0.3 or 240,000 in 2.4
million or 2.4million in 24 million i.e. at 10% irrespective of the size of the universe. It
may help to think of why we do not take bigger blood samples from bigger persons:
the reason is that we are usually interested in concentrations rather than in absolute
amounts and that concentrations are like proportions.
Those < 50 are more likely to smoke cigarettes (43%) than are those 50 years or over
(33%). Men (43%) are more likely to be smokers than women (36%).
Results are based on 1,047 personal, in-home interviews with adults, 18 years and over,
conducted between May 9 and 11. A sample of this size is accurate within a
4-percentage-point margin, 19 in 20 times.
page 7
Inference concerning a single
M&M §8.1
Test of Hypothesis that
1 . n small enough -> Binomial Tables/Spreadsheet
ie
= some test value
2 . Large n : Gaussian Approximation
p–π
Test π = π 0 : z = SE[p]0 =
if testing H0: π = π 0 vs Ha: π ≠ π 0 [or Ha: π > π 0 ]
and if observe x / n ,
then calculate
Note that the test uses a
Prob( observed x, or an x that is more extreme | π 0 )
a
variance based on the (observed) p.
Because we approximate a discrete distribution [where p takes on the values 0/n,
1/n, 2/n, ... n/n corresponding to the integer values (0,1,2, ..., n) in the numerator
of p] by a continuous Gaussian distribution, authors have suggested a 'continuity
correction' (or if you are more precise in your language, a 'discontinuity' correction).
This is the same concept as we saw back in §5.1, where we said that a binomial
count of 8 became the interval (7.5, 8.5) in the interval scale. Thus, e.g., if we want
to calculate the probability that proportion out of 10 is ≥ 8, we need probability of
≥ 7.5 on the continuous scale.
or use correspondence between a 100(1-α)% CI and a test which
uses an alpha level of α i.e. check if CI obtained from CI table or
nomogram includes π value being tested
[there may be slight discrepancies between test and CI: the
methods used to construct CI's don't always correspond exactly to
those used for tests]
If we work with the count itself in the numerator, this amounts to reducing the
absolute deviation y–nπ 0 by 0.5 . If we work in the proportion scale, the absolute
deviation is reduced by 0.5/n viz.
A common question is whether there is evidence against the
proposition that a proportion π=1/2 [Testing preferences and
discrimination in psychophysical matters e.g., therapeutic touch,
McNemar's test for discordant pairs when comparing proportions in
a paired-matched study, the non-parametric' Sign Test for assessing
intra-pair differences in measured quantities, ...]. Because of the
special place of the Binomial at π=1/2, the tail probabilities have
been calculated and tabulated. See the table entitled "Sign Test" in
the chapter on Distribution-Free Methods.
zc =
|y – nπ 0|–0.5
=
SE[y]
zc =
|p – π 0 |–0.5/n
=
SE[p]
or
M&M (2nd paragraph p 592) say that "we do not often use
significance tests for a single proportion, because it is uncommon to
have a situation where there is a precise proportion that we want to
test". But they forget paired studies, and even the sign test for
matched pairs, which they themselves cover in section 7.1, page
521. They give just 1 exercise (8.18) where they ask you to test
π=0.5vs π > 0.5.
e.g. 2
variance based on the (specified) π 0 . The "usual" CI uses
(Dis)Continuity Correction †
using Ha to specify which x's are more extreme i.e. provide even
more evidence for Ha and against H0.
e.g. 1
p – π0
π 0[1–π 0]
n
|y – nπ 0|–0.5
nπ 0[1–π 0]
|p – π 0|–0.5/n
π 0[1–π 0]
n
†Colton [who has a typo in the formula on p ___] and A&B deal with this; M&M
do not, except to say on p386-7 "because most statistical purposes do not require
extremely accurate probability calculations, we do not emphasize use of the
continuity correction". There are some 'fundamental' problems here that statisticians
disagree on. The "Mid-P" material (below) gives some of the flavour of the debate.
Another example, dealing with responses in a setup where the
"null" is π=1/3, the "Triangle Taste Test" is described in the next
page.
page 8
Inference concerning a single
M&M §8.1
EXAMPLE of Testing
: THE TRIANGLE TASTE TEST
As part of preparation for a double blind RCT of lactase-reduced infant formula on
infant crying behaviour, the experimental formulation was tested for its similarity in
taste to the regular infant formula . n mothers in the waiting room at MCH were
given the TRIANGLE TASTE TEST i.e. they were each given 3 coded formula
samples -- 2 containing the regular formula and 1 the experimental one. Told that "2
of these samples are the same and one sample is different", p = y/n correctly
identify the odd sample. Should the researcher be worried that the experimental
formula does not taste the same? (assume infants are no more or less tastediscriminating than their mothers) [ study by Ron Barr, Montreal Children's Hospital]
Our observed proportion of 5/12 projects to a one-sided 95% CI of "as many as 65%
in the population get it right". In this worst-case scenario, assuming that the
percentage of right answers in the population is a mix of a proportion π can who can
really tell and one third of the remaining (1-π can ) who get it right by guessing, we
equate
0.65 = π can + (1-π can ) / 3
giving us an upper bound π can = (0.65-0.33) / (2/3) = 0.48 or 48%.
The null hypothesis being tested is
H0 : π(correctly identified samples) = 0.33
*These calculations can be done easily even on a calculator or spreadsheet without
any combinatorials:
against Ha: π() > 0.33
P(0) = 0.6712
P(1) = 12 x 0.33 x P(0) / [1 x 0.67]
P(2) = 11 x 0.33 x P(1) / [2 x 0.67]
P(3) = 10 x 0.33 x P(2) / [3 x 0.67
P(4) = 9 x 0.33 x P(3) / [4 x 0.67]
Σ
[here, for once, it is difficult to imagine a 2-sided alternative -- unless mothers were
very taste-discriminating but wished to confuse the investigator]
We consider two situations (the real study with n=12, and a hypothetical larger
sample of n=120 for illustration)
so Prob(5 or more correct | π = 0.33) = 1 - 0.64 = 0.32
• 5 of n = 12 mothers correctly identified the odd sample.
• 50 of 120 (p=0.42) mothers identified odd sample.
i.e. p = 5/12 = 0.42
Degree of evidence against H0
= Prob(5 or more correct | π=0.33)...
= 0.008
= 0.048
= 0.131
= 0.215
= 0.238
= 0.640
Test π = 0.33 :
- a Σ of 8 probabilities
= 1 - Prob(4 or fewer correct | π=0.33) ...- a shorter Σ of only 5
= 1 - [ P(0) + P(1) + P(2) + P(3) + P(4) ] = 0.37*
z =
0.42* – 0.33
= 2.1
0.33[1–0.33]
120
So P = Prob[ ≥ 50 | π = 0.33 ] = Prob[Z ≥ 2.1] = 0.018
Using n=12, and p=0.30 in Table C gives 0.28; using p=0.35 gives 0.42.
Interpolation gives 0.37 approx.
* We treat the proportion 50/120 as a contimuous measurement; in fact it is based on
an integer numerator 50; we should treat 50 as 49.5 to 50.5 so ≥50 is really > 49.5 .
0.413 – 0.33
The Prob. of obtaining 49.5/120 or more is te Prob. of Z =
or
0.33[1–0.33]
120
more. With n=120, the continuity correction does not make a large difference;
however, with smaller n, and its coarser grain, the continuity correction [which
makes differences smaller] is more substantial.
Can also obtain this probability directly via Excel, using the function
1 - BINOMDIST(4, 12, 0.33333, TRUE)
So, by conventional criteria (Prob < 0.05 is considered a cutoff for evidence against
H0 ) there is not a lot of evidence to contradict the H0 of taste similarity of the
regular and experimental formulae.
With a sample size of only n=12, we cannot rule out the possibility that a sizable
fraction of mothers could truly distinguish the two.
page 9
Inference concerning a single
M&M §8.1
Sample Size for CI's and Tests involving
n to yield (2-sided) CI with margin of error m at confidence
level 1(see M&M p 593, Colton p161)
|--- margin of error --- >|
(---------------------•-----------------------)
Worked example 1: sample size for
Test that (preferences) = 0.5 vs.
0.5
or
Sign Test that median difference = 0
Test: H0: MedianD = 0 vs Halt : MedianD ≠ 0
=0.05 (2-sided);
CI
• see CI's as function of n in tables and nomograms
or H0: π(+) = 0.5
• (or) large-sample CI: p ± Zα/2 SE(p) = p ± m
SE(p) =
For Power 1-
n ≈
0.25 • Zα/22
m2
=
≈
{ Zα/2
= = 0.494 ]
2
{ Zα/2 – Zβ }2 { 0.494
}
0.15
Zα = 1.96; Zβ = -0.84,
0 [1– 0 ]– Zβ
∆2
{ Zα/2 – Zβ }2 {
1 [1– 1
[1– ]
∆
where π is average of π 0
=
π [1–π ]
α=0.05 (2-sided) & β=0.2 ...
[1.c]
n for power 1- to "detect" a population proportion 1 that is
units from 0 (test value) ; type I error = (Colton p 161)
n
Halt : π(+) = 0.65 say
[ at π=ave of 0.5 & 0.65,
p[1-p]
p[1-p] • Zα/22
,
so...
n
=
n
m2
If unsure, use largest SE i.e. when p=0.5 i.e.
n=
against:
vs Halt : π(+) > 0.5
] }2
(Zα/2 – Zβ)2 = {1.96 – (–0.84)}2 ≈ 8, i.e.
n ≈ 8
2
{ 0.494
0.15 } = 87
[1.t]
Worked example 2: sample size for
}2
(correct) = 1/3 vs.
[1.t]≈
Taste Test
>1/3
If set α =0.05 (hardliners might allow 1 -sided test here),
then Zα = 1.645; If want 90% pwer, then Zβ = -1.28; Then using eqn [1.t]
above...
n's for 90% Power against...
and π 1
{ Zα/2 – Zβ }2 { σ∆0/1 }2
π(correct)= 0.4
400
0.5
69
Notes: Zβ will be negative; formula is same as for testing µ
( See also homegrown exercise # ___ )
page 10
0.6
27
0.7
14
0.8
8
Inference concerning 2
Parameters:
's M&M §8.2 ; updated Dec 14, 2003
2 .... π's are proportions / prevalences / risks
1 and
Large-sample CI for COMPARATIVE MEASURE /
PARAMETER (if 2 estimates are uncorrelated)
3 Comparative measures / parameters:
IN GENERAL (if calculations in transformed scale, must back-transform )
COMPARATIVE
PARAMETER
estimate1 - estimate2 ± z SE[ estimate1 - estimate2 ]
estimate
New Scale
estimate1 - estimate2 ± z Sqrt[Var[estimate1] + Var[estimate2]]
IN PARTICULAR
(Risk or
Prevalence)
Difference
p1 - p2
1 –
2
p 1 – p2
± z SE[ p1 - p2 ]
= p1 - p2
cf. Rothman2002 p 135 Eqn 7-2
± z
<<<<<<
= p1 - p2 ± z
(Risk or
Prevalence)
Ratio
1
2
log
1
2
1
cf. Rothman2002 p 135 Eqn 7-3
SE 2 [p 1] +SE 2 [p 2]
Their squares do !
p 1[1-p 1]
p 2[1-p 2]
+
n1
n2
anti-log[ log[p1/p 2] ± z SE[ log[p1] – log[p2] ] ]
[pp ] = log[p ] – log[p ]
p1
p2
SE's don't add !
Remember:
2
= anti-log[ log[p1/p 2] ± z SE 2[ log[p1] ] + SE 2[ log[p2] ]
<<<<<<
From 8.1:
]
SE 2[ log[p1] ] = Var[ log[p1] ] = 1/#positive1 – 1/ #total1
SE 2[ log[p2] ] = Var[ log[p2] ] = 1/#positive2 – 1/ #total2
cf. Rothman2002 p 139 Eqn 7-6
O dds Ratio
1/(1– 1)
2/(1– 2)
odds 1
odds 2
[
anti-log [ log[oddsRatio] ± z SE[ logit 1 – logit2 ] ]
<<<<<<
= anti-log [ log[oddsRatio] ± z SE 2[logit 1]
]
odds 1
log
odds 2
From 8.1:
= log[odds1] – log[odds 2]
=
logit1
+
SE 2[logit 2]
]
SE 2[ logit 1 ] = Var[ logit 1 ] = 1/#positive1 + 1/ #negative1
SE 2[ logit 2 ] = Var[ logit 2 ] = 1/#positive2 + 1/ #negative2
– logit2
Var[log of OR est.] = 1/a + 1/b + 1/c + 1/d ==> "Woolf's Method": CI[ODDSRATIO]
page 1
Inference concerning 2
's M&M §8.2 ; updated Dec 14, 2003
Large-sample
Examples:
π 1 = π 1 equivalent to Test of π 1– π 1 = 0
Test of
0 The generic 2x2 contingency table:
+ve
-ve
both
--------------------------------------------------------------------------
(Risk or Prevalence Difference = 0)
π1
π 2 =1
same as Test of
=
=
p 1 - p 2 - {∆=0}
= SE[
p1 - p 2 ]
p 1 - p2
p[1-p] p[1-p]
n1 + n2
n1 (100%)
sample 2
y 2 (%)
n2-y 2
n2 (100%)
became
did
total no.
pregnant
not
couples
-----------------------------------------------------------------------------------Bromocriptine 7 (29%)
17
24 (100%)
Placebo
5 (22%)
18
23 (100%)
-----------------------------------------------------------------------------------total
12 (26%)
35
47 (100%)
†
2 Vitamin C and the common cold (CMAJ Sept 1972 p 503)
††
no colds
≥ 1 cold
total subjects
-----------------------------------------------------------------------------------Vitamin C
105 (26%)
302
407 (100%)
Placebo
76 (18%)
335
411 (100%)
-----------------------------------------------------------------------------------total
181 (22%)
637
818 (100%)
p 1 - p2
p 1 - p2
=
1
1
1
1
p[1-p] { + }
p[1-p]
+
n1 n2
n1 n2
[ use estimate p =
n1-y 1
1 Bromocriptine for unexplained 1º infertility (BMJ 1979)
π 1/ {1 – π 1}
π 2/ {1 – π 2} = 1
(ODDS Ratio = 1)
z
y 1 (%)
--------------------------------------------------------------------------total
y (%)
n-y
n (100%)
(Risk or Prevalence Ratio = 1)
same as Test of
sample 1
3 Stroke Unit vs Medical Unit for Acute Stroke in elderly?
Patient status at hospital discharge (BMJ 27 Sept 1980)
n 1p 1 + n 2p 2 total +ve
=
in test ]
n1 + n 2
total
indept.
dependent total no. pts
-----------------------------------------------------------------------------------Stroke Unit
67 (66%)
34
101 (100%)
Medical Unit
46 (51%)
45
91 (100%)
-----------------------------------------------------------------------------------total
113 (59%)
79
192 (100%)
† Continuity correction: use |p – p | – [ 1/(2n ) + 1/(2n ) ] in numerator
1
2
1
2
† † Variances add if the proportions are uncorrelated
page 2
Inference concerning 2
's M&M §8.2 ; updated Dec 14, 2003
Rx for primary infertility
Stroke Unit vs Medical Unit
95%CI for ∆π : 0.66 - 0.51 ± z
=
=
=
=
0.15 ± 1.96 x 0.07
0.15 ± 0.14
z=
0.5885 • 0.4115 •
{
1
1
+
101
91
}
0.07 ± 1.96 x 0.13
0.07 ± 0.25
test ∆π=0 :
test ∆π=0 : [carrying several decimal places, for comparison with χ2 later]
0.6634 – 0.5054
0.29 • 0.71
0.22 • 0.78
+
24
23
95%CI on ∆π : 0.29 - 0.22 ± z
0.66 • 0.34
0.51 • 0.49
+
101
91
0.29 - 0.22
z=
0.1580
=
= 2.22
0.0711
0.26 [0.74] {
=
1
1
+
}
24
23
0.07
= 0.55
0.13
P=0.58 (2-sided)
total +ve 113
=
=
= 0.5885 ]
total
192
[ estimate of hypothesized common
Fall in BP with Reduction in Dietary Salt
P = Prob[ | Z | ≥ 2.22 ] = 0.026 (2-sided)
Response of interest:
H0 :
Halt:
Vitamin C and the common cold
95%CI on ∆π:
0.26 - 0.18 ± z
=
=
0.26 • 0.74
0.18 • 0.81
+
407
411
Proportion achieving DBP < 90 mm
test ∆π=0 : [carrying several decimal places, for comparison with χ2 later]
z=
0.221 [0.779] {
1
1
+
}
407
411
π(Y=1 | Normal Sodium Diet) = π(Y=1 | Low Sodium Diet)
π(Y=1 | Normal Sodium Diet) ≠ π(Y=1 | Low Sodium Diet)
α =0.05 (2-sided);
0.08 ± 1.96 x 0.03
0.08 ± 0.06
0.258 - 0.185
Y = Achieve DBP < 90 ? [ 0 / 1 ]
0.073
=
= 2.52 (2.517 = √6.337)
0.029
"Normal"
Group
"Low"
Group
11/50
17/53
( 22 %)
( 32 %)
z =
0.32 – 0.22
1
1
+
]
53
50
0.27 • 0.73 •[
i.e. |z| = 1.14 which is < Zα = 1.96 and so observed difference of 10% is
"N.S."
P=0.006 (1-sided); P=0.012 (1-sided)
page 3
Inference concerning 2
's M&M §8.2 ; updated Dec 14, 2003
CI for Risk Ratio (Rel. RISK) or Prev. Ratio cf. Rothman2002 p135
CI for ODDS RATIO
cf. Rothman2002 p139
Example: Vitamin C and the common cold (CMAJ Sept 1972 p 503) . . . REVISITED
^
RR
Vitamin C
no colds
≥ 1 cold
total subjects
-----------------------------------------------------------------------------------Vitamin C
105 (26%)
302(74%)
407 (100%)
Placebo
76 (18%)
335(82%)
411 (100%)
-----------------------------------------------------------------------------------total
181 (22%)
637(78%)
818 (100%)
Prob[ cold | Vitamin C]
= Prob[ cold | Placebo ] = 74%
= 0.91
82%
Placebo:
had cold(s)
avoided colds
302
105
335
76
# with cold(s) for every
1 who avoided colds
2.88 (:1)
4.41 (:1)
odds of cold(s)
2.88
CI[RR]: antilog{ log[0.91] ± z SE[ log[p1] – log[p2] ] }
odds Ratio = 2.88 / 4.41
=antilog{ log[0.91] ± z SE 2[ log[p1] ] + SE 2[ log[p2] ] }
From 8.1: SE 2[ log[p1] ] = Var[ log[p1] ] = 1/ 302 – 1/ 407 = 0.000854
SE 2[ log[p2] ] = Var[ log[p2] ] = 1/ 335 – 1/ 411 = 0.000552
CI[OR] = anti-log [ log[oddsRatio] ± z SE[ logit 1 – logit2 ] ]
From 8.1:
SE 2[ logit 1 ] = 1/#positive1 + 1/ #negative1
SE 2[ logit 2 ] = 1/#positive2 + 1/ #negative2
CI[RR]: antilog{ log[0.91] ± z Sqrt[ 0.000854 + 0.000552] }
= antilog{ log[0.91] ± 0.073 } = 0.85 to 0.98
SE[ logit 1 – logit2 ] = Sqrt[ (1/ 302 + 1/ 105) + (1/ 335 + 1/ 76 ) ] = 0.17
^ ] } and use it as a multiplier and
Shortcut: calculate exp{ z×SE[log RR
^
^ ]} = exp{0.073} = 1.076.
divider of RR. In our e.g., exp{z×SE[log RR
Thus {RRLOWER , RRUPPER } = {0.91 / 1.076 , 0.91 × 1.076} = {0.85 , 0.98}
You can use this shortcut whenever you are working with log-based CI's
that you convert back to the original scale, there they become "multiplydivide" symmetric rather than "plus-minus" symmetric.
From SAS
PROC FORMAT;
VALUE onefirst 0="z_0" 1="a_1";
DATA CI_RR_OR;
INPUT vitC cold n_people;
LINES;
1
1
302
1
0
105
0
1
335
0
0
76
;
PROC FREQ data=CI_RR_OR
ORDER=FORMATTED;
TABLES vitC*cold / CMH;
WEIGHT n_people;
FORMAT vitC cold onefirst. ;RUN;
z SE[ logit 1 – logit2 ] = 1.96 × 0.17 = 0.33
anti-log [ log[0.65] ± 0.33 ] = exp[ –0.43 ± 0.33 ] = 0.47 to 0.90
^ = 0.65; CI[OR] = {0.47 to 0.90}
>>>>> OR
From Stata
Immediate: csi 302 335 105 76
From SAS
cs stands for 'cohort study'
<<<< See statements for RR
input
vit_c
1
1
0
0
= 0.65
4.41
^ = 0.65
>>> OR
(output gives both RR and OR)
cold n_people
1
302
0
105
1
335
0
76
end
cs cold vit_c [freq = n_people]
cc stands for 'case control study'
Be CAREFUL as to rows/cols
Index exposure category must be
1 s t r o w ; reference exposure
category must be 2nd
input
If necessary, use FORMAT to
have table come out this way...
(note trick to reverse rows/cols)
end
cc cold vit_c [freq =n_people],woolf
SAS doesn't know if cc or cohort
page 4
From Stata
Immediate: cci 302 335 105 76,woolf
vit_c
1
1
0
0
cold n_people
1
302
0
105
1
335
0
76
Inference concerning 2
's M&M §8.2 ; updated Dec 14, 2003
"Test-based" CI's ... IN GENERAL
"Test-based" CI's for... IN PARTICULAR
Preamble
• Difference of 2 proportions π1 - π2 (Risk or Prevalence Difference)
In 1959, when Mantel and Haenszel developed their summary Odds Ratio
measure over 2 or more strata, they did not supply a CI to accompany this
point estimate. From 1955 onwards, the main competitor was the weighted
average (in the log OR scale) and accompanying CI obtained by Woolf'. But
this latter method has problems with strata where one or more cell
frequencies are zero. In 1976, Miettinen developed the "test-based"
method for epidemiologic situations where the summary point estimate is
easily calculated, the standard error estimate is unknown or hard to
compute, but where a statistical test of the null value of the parameter of
interest (derived by aggregating a "sub-statistic" from each stratum) is
already available. Although the 1886 development, by Robins, Breslow
and Greenland, of a direct standard error for the log of the Mantel-Haenszel
OR estimator, the "test-based" CI is still used (see A&B KKM).
Observe : p1 & p2 and (maybe via p-value) the calculated value of x2
This implies that
Sqrt[observed x2 value] = observed x value = observed z value;
But... observed z statistic = ( p1 – p2 ) / SE [ p1 – p2 ] .
So...
p1 – p2
SE [ p1 – p2 ] = observed
z statistic
use +ve sign
95% CI for p1 - p2 :
Even though its main usefulness is for summaries over strata, the idea can
be explained using a simpler and familiar (single starum) example, the
comparison of two independent means using a z-test with large df (the
principle does not depend on t vs. z). Suppose all that was reported was
the difference in sample means, and the 2-sided p-value associated with a
test of the null hypothesis that the mean difference was zero. From the
sample means, and the p-value, how could we obtain a 95%CI for the
difference in the "population' means? The trick is to
1 work back (using a table of the normal distribution) from the p-value to
the corresponding value of the z-statistic (the number of standard
errors that the difference in sample means is from zero);
2 divide this observed difference by the observed z value, to get the
standard error of the difference in sample means, and
3 use the observed difference, and the desired multiple (1.645 for 90%
CI, 1.96 for 95% etc.) to create the CI.
( p1 – p2 ) ± {z value for 95%} × SE[ p1 – p2 ]
i.e.,...
p1 – p2
( p1 – p2 ) ± {z value for 95%} × observed
z statistic
i.e., after re-arranging terms..
( p1 – p2 )
z value for 95%
{1 ± observed
z statistic }
or, in terms of a reported chi-square statistic
( p1 – p2 )
The same procedure is directly applicable for the difference of two
independently estimated proportions. If one tests the (null) difference
using a z-test, one can obtain the SE of the difference by dividing the
observed difference in proportions by the z statistic; if the difference was
tested by a chi-square statistic, one can obtain the z-statistic by taking the
square root of the observed chi-square value (authors call this square root
an observed 'chi' value). Either way, the observed z-value leads directly to
the SE, and from there to the CI.
This is worked out in the next example, where it is assumed that the null
hypothesis is tested via a chi-squared (x2) test
{1
z value for 95%
± Sqrt[observed chi-square statistic]
}
See Section 12.3 of Miettinen's "Theoretical Epidemiology"
Technically, when the variance is a function of the parameter (as is
the case with binary response data), the test-based CI is most
accurate close to the Null. However, as you can verify by comparing
test-based CIs with CI's derived in other ways, the inaccuracies are
not as extreme as textbooks and manuals (e.g. Stata) suggest.
page 5
Inference concerning 2
's M&M §8.2 ; updated Dec 14, 2003
"Test-based" CI's for... IN PARTICULAR
"Test-based" CI's for... IN PARTICULAR
• Ratio of 2 proportions π1 /π2
• Ratio of 2 odds π1 /[1–π1 ] and π2 /[1–π2 ]
( Risk Ratio; Prevalence Ratio; Relative Risk; "RR" )
( Odds Ratio; "OR" )
Observe :(i) rr = p1 /p2 and
Observe :(i) or = p1 /[1–p1 ] / p2 /[1–p2 ] ( " a×d / b×c " ) and
(ii) (maybe via p-value) the value of x2 statistic (H0 : RR=1)
(ii) (maybe via p-value) the value of x2 statistic (H0 : OR=1)
>>> Sqrt[observed x2 value] = observed x value = observed z value
>>> Sqrt[observed x2 value] = observed x value = observed z value
In log scale, in relation to log[RRnull ] = 0, observed z value would be:
In log scale, in relation to log[ORnull ] = 0, observed z value would be:
observed z value = ( log[rr] – 0) / SE[ log[rr] ]
observed z value = ( log[or] – 0) / SE[ log[or] ]
This implies that
This implies that
SE[ log[rr] ] = log[rr] / observed z value
use +ve sign
SE[ log[or] ] = log[or] / observed z value
95% CI for log[RR]:
use +ve sign
95% CI for log[OR]:
log[rr] ± {z value for 95%} × SE[ log[rr] ]
log[or] ± {z value for 95%} × SE[ log[or] ]
i.e.,...
i.e.,...
log[rr]
log[rr] ± {z value for 95%} × observed z value
log[or]
log[or] ± {z value for 95%} × observed z value
i.e., after re-arranging terms..
log[rr] ×
i.e., after re-arranging terms..
z value for 95%
{1 ± observed
z statistic }
log[or] ×
z value for 95%
{1 ± observed
z statistic }
Going back to RR scale, by taking antilogs*...
Going back to OR scale, by taking antilogs*...
95% CI for RR:
95% CI for OR:
rr to power of
z value for 95%
{1 ± observed
z statistic }
or to power of
z value for 95%
{1 ± observed
z statistic }
See Section13.3 of Miettinen's "Theoretical Epidemiology"
See Section13.3 of Miettinen's "Theoretical Epidemiology"
* antilog[ log[a] × b ] = exp[ log[a] × b ]= { exp[log[a]] } to power of b = a to power of b
page 6
Inference concerning 2
's M&M §8.2 ; updated Dec 14, 2003
Sample Size considerations...
CI(π 1 – π 2)
n's to produce CI for difference in π's of pre specified margin of
error m at stated confidence level
Sample Size considerations...
Test
H0 : π T = π C
vs
Test involving π T and π C
Ha: π T
πC :
n's for power 1– if π T = π C + ; prob[type I error] =
• large-sample CI: p1 –p2 ± Z SE(p1 –p2 ) = p1 –p2 ± m
n per group
• SE(p1 – p2 ) =
p1 {1–p1 } p2 {1–p2 }
+
n1
n2
=
{ Zα/2
π C{1–π C}+π T{1–π T} } 2
2π C{1–π C} – Zβ
2
(See Colton p 168 )
Simplify by using an average p;
≈ 2(Zα/2 – Zβ)2 {
–π {1– –π }
∆
}
2
if use equal n's, then
2p{1–p} Zα/2 2
n per group = [margin of error]
2
2{Zα/2 – Zβ} { σ }
2
=
M&M use the fact that if p = 1/2 then p(1–p) = 1/4, and so
2p(1–p) = 1/2, so the above equation becomes
0/1
2
e.g.
=0.05 (2-sided) & =0.2 ... Zα = 1.96; Zβ = -0.84,
Z α/2 2
[max] n per group = 2 × [margin
of error]2
2(Zα/2 – Zβ)2 = 2{1.96 – (–0.84)}2 ≈ 16, i.e.
n per group ≈ 16 • –π {1– –π } / ∆2
So n
100 for T group and n 100 for C group
if T = 0.6 and C = 0.4
See Sample Size Requirements for Comparison of 2
Proportions (from text by Smith and Morrow) under
Resources for Chapter 8.
page 7
Inference concerning 2
's M&M §8.2 ; updated Dec 14, 2003
Effect of Unequal Sample Sizes ( n 1
If we write the SE of an estimated difference in mean responses as σ
n 2 ) on precision of estimated differences
1
1
+
, where σ is the (average) per unit variability of the response, then we can establish the
n1
n2
following principles:
1
2
If costs and other factors (including unit variability) are equal,
and if both types of units are equally scarce or equally
plentiful, then for a given total sample size of n = n1 + n 2 , an equal division
of n i.e. n1 = n 2 is preferable since it yields a smaller SE(estimated difference in
means) than any non-symmetric division. However, the SE is relatively
unaffected until the ratio exceeds 70:30. This is seen in the following table
1
1
which gives the value of
+
= SE(estimated difference in means) for
n1
n2
various combinations of n 1 and n2 adding to 100 (the 100 itself is arbitrary) and
assuming σ = 1 (also arbitrary).
n1
50
60
65
70
75
80
85
n2
50
40
35
30
25
20
15
SE(estimated difference in means) %Increase in SE
over SE(50:50) *
0.200
0.204
0.210
0.218
0.231
0.250
0.280
* if sample sizes are π:(1–π), the % increase is 50 /
-–--2.1%
4.8%
9.1%
15.5%
25.0%
40.0%
π(1-π) .
If one type of unit is much scarcer, and thus the limiting factor,
then it makes sense to choose all (say n1 ) of the available scarcer units, and
some n 2 ≥ n 1 of the other type. The greater is n2 , the smaller the SE of the
estimated difference. However, there is a 'law of diminishing returns' once n2 is
more than a few multiples of n1 . This is seen in the following table which
1
1
gives the value of
+
for n 1 fixed (arbitrarily) at 100 and n2 ranging
n1
n2
from 1 x n1 to 100 x n1 ; again, we assume σ=1.
n1
n2
50
50
50
50
50
50
50
50
50
50
50
50
50
75
100
150
200
250
300
400
500
1000
5000
∞
* calculated as
Note: these principles apply to both measurement and count data
page 8
Ratio
(K)
^ - ^
SE(µ
µ2)
1
1.0
1.5
2.0
3.0
4.0
5.0
6.0
8.0
10.0
20.0
100.0
∞
K + 1
K
0.2000
0.1825
0.1732
0.1633
0.1581
0.1549
0.1527
0.1500
0.1483
0.1449
0.1421
0.1414
;
SEK:1
as % of
of SE(1:1)
–
91.3%
86.6%
81.6%
79.1%
77.5%
76.4%
75.0%
74.2%
72.4%
71.1%
70.7%
'efficiency' =
SEK:1
as % of
SE(∞:1) *
1.414
1.290
1.225
1.155
1.118
1.095
1.080
1.061
1.049
1.025
1.005
1
K
K + 1
Inference concerning 2
's M&M §8.2 ; updated Dec 14, 2003
Sample size calculation when using unequal sample sizes to estimate / test difference in 2 means or proportions
Notes:
For power (sensitivity) 1–β, and specificity 1–α (2-sided), the sample
sizes n1 and n2 have to be such that
a. If K=1, so that n1 =n2 , then we get the familiar "2" at the front of
the sample size formula.
Zα/2 SE( x–1 – –x2 ) – ZβSE( –x1 – –x2 ) = ∆.
(if β < 0.5, then Zβ will be negative). If we assume equal per unit
variability, σ, of the x's in the 2 populations, we can write the
requirement as
Zα/2 σ
If we rewrite
1
1
n1 + n2 – Zβ σ
1
1
n1 + n2 as
1
1
n1 + n2
b. The same factor applies for proportions:
If we use σ0/1 =
as an "average" standard deviation for the
individual 0's and 1's in each population, i.e.
= ∆.
σ0/1
1
n1
n1 { 1 + n2}
{1 + nn12}(Z
α/2
σ
– Zβ)2 { ∆
}
π [1 – π ]
2
–
–
π [1 – π ]
K+1
2
n1 ≈ { K } (Zα/2 – Zβ) {
}
∆2
n
or, denoting n2 by K,
1
σ 2
1
n1 = {1 + K} (Zα/2 – Zβ)2 { ∆ }
i.e.
n1 =
=
then, as we get the approximate formula:
and rearrange the inequality, we get
n1 =
–π [1 – –π ]
σ
2
{ K+1
K } (Zα/2 – Zβ) { ∆ }
2
page 9
Add-ins for M&M §8 and §9
statistics for epidemiology
Sample Size considerations...
Test
H0 : OR = 1
vs.
Key points
ln [ or] most precise when all 4 cells are of equal size; so...
Test involving OR
Ha: OR
OR :
n's for power 1– if OR = OR a l t ; prob[type I error] =
increasing the control:case ratio leads to diminishing marginal
gains in precision.
Here I use ln for natural log (elsewhere I have used log; I use them interchangeably)
To see this... examine the function
Work in ln (or) scale; SE[ ln (or) ] =
1
1 1 1 1
a+b+c+d
1
1
+
# of cases multiple of this # of controls
for various values of "multiple"
Need Zα/2 SE[ ln (or) ]0 + ZβSE[ ln (or) ]alt < "∆"
[like we did back in Chapter 8, for "effect of unequal sample
sizes"]
where "∆" = ln (ORalt )
2
ln [OR] = 0
The more unequal the distribution of the etiologic / preventive
factor, the less precise the estimate
α/2
Examine the functions
1
# of exposed cases
Zβ SE[ ln(or) | OR alt ]
Zα/2 SE[ ln(or) | Ho ]
+
1
# of unexposed cases
and
1
1
+
# of exposed controls
# of unexposed controls
Reading graphs on next page (Note log scale for observed or)
β
Take as an example the study in the middle panel, with 200 cases, and an exposure
prevalence of 8%. Say that the Type I error rate is set at α=0.05 (2sided) so that
the upper critical value (the one that cuts off the top 2.5% of the null distribution)
is close to or = 2. Draw a vertical line at this critical value, and examine how
much of each non-null distribution falls to the right of this critical value. This area
to the right of the critical value is the power of the study, i.e., the probability of
obtaining a significant or, when in fact the indicated non-null value of OR is
correct. Two curves at each OR value are for studies with 1(grey) and 4(black)
controls/case. Note that OR values 1, 1.5, 2.25 and 3.375 are also on a log scale.
Power larger if...
i
non-null OR >> 1 (cf 2.5 vs 2.25 vs 3.375)
ii exposure common (cf 2% vs 8% vs 32%) and not near universal)
iii use more cases (cf 100 vs 200 vs 400), and controls/case (1 vs 4)
∆ = ln[OR alt ]
ln[OR alt ]
Substitute expected a, b, c, d values under null and alt. into
SE's and solve for numbers of cases and controls.
References: Schlesselman, Breslow and Day, Volume II, ...
page 10
Add-ins for M&M §8 and §9
statistics for epidemiology
Factors affecting variability of estimates from, and statistical power of, case-control studies
OR
Cases: 100
Exposure Prevalence: 2%
Cases: 200
Exposure Prevalence: 2%
Cases: 400
Exposure Prevalence: 2%
3.375
2.25
1.5
1
or
0.25
0.5
Cases: 100
1
2
4
8
or
0.25 0.5
Exposure Prevalence: 8%
Cases: 200
1
2
4
8
or
0.25
Exposure Prevalence: 8%
0.5
Cases: 400
1
2
4
8
Exposure Prevalence: 8%
3.375
2.25
1.5
1
or
0.25
0.5
Cases: 100
1
2
4
8
or
0.25 0.5
Exposure Prevalence: 32%
Cases: 200
1
2
4
8
or
0.25
Exposure Prevalence: 32%
0.5
Cases: 400
1
2
4
8
Exposure Prevalence: 32%
3.375
2.25
1.5
1
or
0.25 0.5
1
2
4
8
or
0.25 0.5
1
jh 1995-2003
page 11
2
4
8
or
0.25 0.5
1
2
4
8
Add-ins for M&M §8 and §9
The "Exact" Test for 2 x 2 tables [material taken from A&B §4.9]
Even with the continuity correction there will be some doubt about the adequacy of
the χ 2 approximation when the frequencies are particularly small. An exact test was
suggested almost simultaneously in the mid-1930s by R. A. Fisher, J. O. Irwin and
F. Yates. It consists in calculating the exact probabilities of the possible tables
described in the previous subsection. The probability of a table with frequencies
a
b
There are six possible tables with the same marginal totals as those observed. since
neither a nor c (in the notation given above) can fall below 0 or exceed 5, the
smallest marginal total in the table. The cell frequencies in each of these tables are
shown in Table 4.7. Below them are shown the probabilities of these tables,
calculated under the null hypothesis.
Table 4.7 Cell frequencies in tables with the same marginal totals as those in Table
4.6
r1
c
d
r2
---------------------------c1
c2
N
0 20 20
5 17 22
5 37 42
is given by the formula
r!r ! s ! s !
P[ a | r1 , r 2 , c1 , c 2 ] = 1 2 1 1
N! a! b! c! d!
(4.25)
Table 4.6 Data on malocclusion of teeth in infants (Yates, 1934)
Breast-fed
Bottle-ed
Total
4
1
-----5
16
21
-----37
3 17 20
2 20 22
5 37 42
a
0
1
2
3
Pa
0.0310
0.1720
0.3440
0.3096
4 16 20
1 21 22
5 37 42
5 15 20
0 22 22
5 37 42
4
0.1253
5
0.0182
22 x 21 x 20 x 19 x 18
= 0.03096.
42 x 41 x 40 x 39 x 38
The probabilities for a = 1, 2, . . ., 5 can be obtained in succession. Thus,
P0 =
Infants with
Malocclusion
2 18 20
3 19 22
5 37 42
The Probabilities of the various tables are calculated in the following way*: the
probability that a = 0 is, from (4.25),
20! 22! 5! 37!
P0 =
= 0.03096.
42! 0! 20! 5! 7!
Tables of log factorials (Fisher and Yates, 1963, Table XXX) are often useful for
this calculation, and many scientific calculators have a factorial key (although it
may only function correctly for integers less than 70). Alternatively the expression
for P0 can be calculated without factorials by repeated multiplication and division
after cancelling common factors:
This is, in fact, the probability of the observed cell frequencies c o n d i t i o n a l on
the observed marginal totals, under the null hypothesis of no association between
the row and column classifications. Given any observed table, the probabilities of
all tables with the same marginal totals can be calculated, and the P value for the
significance test calculated by summation. Example 4.14 illustrates the calculations
and some of he difficulties of interpretation which may arise. The data in Table 4.6,
due to M. Hellman, are discussed by Yates (1934).
Normal teeth
1 19 20
4 18 22
S 37 42
Total
P1 =
20
22
-----42
5 x 20
x P0
1 x 18
4 x 19
x P 1 , etc.
2 x 19
The results are shown above.
P2 =
[Notes from JH: 1. The 5 tables from the tea-tasting experiment with to the 2x2 tables with all marginal totals = 4 are another example of this hypergeometric distribution]
* 2. Don't worry about the formula and the factorials; Excel has this function built in. It is called the Hypergeometric probability function, It is like the Binomial, except that instead of
specifying p, one specifies the size of the POPULATION and the NUMBER OF POSITIVES IN THE POPULATION.. example, to get P1 above, one would ask for HYPGEOMDIST(a;r1;c1;N)
The spreadsheet "Fisher's Exact test" uses this function; to use the spreadsheet, simply type in the 4 cell frequencies, a, b, c, and d. The spreadsheet will calculate the probability for each
possible table. Then you can find the tail areas yourself.
page 12
Add-ins for M&M §8 and §9
The "Exact" Test for 2 x 2 tables
continued...
This is the complete conditional distribution for the observed marginal totals, and
the probabilities sum to unity as would be expected. Note the importance of
carrying enough significant digits in the first probability to be calculated; the above
calculations were carried out with more decimal places than recorded by retaining
each probability in the calculator for the next stage. The observed table has a
probability of 0.1253. To assess its significance we could measure the extent to
which it falls into the tail of the distribution by calculating the probability of that
table or of one more extreme. For a one-sided test the procedure clearly gives P =
0.1253 + 0.0182 = 0.1435. The result is not significant at even the 10% level.
and the probability level of 0.12 for X 2 is a fair approximation to the exact mid-P
value of 0.16.
For a two-sided test the other tail of the distribution must be taken
into account, and here some ambiguity arises. Many authors
advocate that the one-tailed P value should be doubled. In the
present example, the one-tailed test gave P = 0.1435 and the twotailed test would give P = 0.2870. An alternative approach is to
calculate P as the total probability of tables, in either tail, which
are at least as extreme as that observed in the sense of having a
probability at least as small. In the present example we should
have
The exact test and therefore the χ 2 test with Yates's correction for continuity have
been criticized over the last 50 years on the grounds that they are conservative in the
sense that a result significant at, say, the 5% level will be found in less than 5% of
hypothetical repeated random samples from a population in which the null
hypothesis is true. This feature was discussed in §4.7 and it was remarked that the
problem was a consequence of the discrete nature of the data and causes no difficulty
if the precise level of P is stated. Another source of criticism has been that the tests
are conditional on the observed margins, which frequently would not all be fixed.
For example, in Example 4.14 one could imagine repetitions of sampling in which
20 breast-fed infants were compared with 22 bottle-fed infants but in many of these
samples the number of infants with normal teeth would differ from 5. The
conditional argument is that, whatever inference can be made about the association
between breast-feeding and tooth decay, it has to be made within the context that
exactly five children had normal teeth. If this number had been different then the
inference would have been made in this different context, but that is irrelevant to
inferences that can be made when there are five children with normal teeth.
Therefore, we do not accept the various arguments that have been put forward for
rejecting the exact test based on consideration of possible samples with different
totals in one of the margins. The issues were discussed by Yates 1984) and in the
ensuing discussion, and by Barnard (1989) and Upton (1992), .and we will not
pursue this point further. Nevertheless, the exact test and the corrected χ 2 test have
the undesirable feature that the average value of the significance level, when the null
hypothesis is true, exceeds 0.5. The mid-P value avoids this problem, and so is
more appropriate when combining results from several studies (see §4.7).
Cochran (1954) recommends the use of the exact test, in preference
to the 2 test with continuity correction, (i) if N < 20, or (ii) 20
< N < 40 and the smallest expected value is less than 5. With
modern scientific calculators and statistical software the exact test
is much easier to calculate than previously and should be used for
any table with an expected value less than 5.
P = 0.1253 + 0.0182 + 0.0310 = 0.1745.
The first procedure is probably to be preferred on the grounds that a
significant result is interpreted as strong evidence for a difference
in the o b s e r v e d d i r e c t i o n , and there is some merit in controlling
the chance probability of such a result to no more than half the
two-sided significance level. The tables of Finney e t a l . (1963)
enable one-sided tests at various significance levels to be made
without computation provided the frequencies are not too great.
To calculate the mid-P value only half the probability of the observed table is
included and we have
mid-P = 0.5(0.1253) + 0.0182 = 0.0808
as the one-sided value, and the two-sided value may be obtained by doubling this to
give 1617.
The results of applying the exact test in this example may be compared with those
obtained by the χ 2 test with Yates's correction. We find X2 = 2 39 (P = 0.12)
without correction and X2 C = 1.14 (P = 0.29) with correction. The probability
level of 0.29 for X 2 C agrees well with the two-sided value 0 29 from the exact test,
page 13
Add-ins for M&M §8 and §9
The "Exact" Test for 2 x 2 tables
As for a single proportion, the mid-P value corresponds to an uncorrected χ 2 test,
whilst the exact P value corresponds to the corrected χ 2 test. The confidence limits
for the difference, ratio or odds ratio of two proportions based on the standard errors
given by (4.14), (4.17) or (4.19) respectively are all approximate and the
approximate values will be suspect if one or more of the frequencies in the 2 x 2
table are small. Various methods have been put forward to give improved limits but
all of these involve iterations and are tedious to carry out on a calculator. The odds
ratio is the easiest case. Apart from exact limits, which involve an excessive
amount of calculation, the most satisfactory limits are those of Cornfield ( 1956);
see Example 16.1 and Breslow and Day (1980, §4.3) or Fleiss ( 1981, §5.6). For
the ratio of two proportions a method was given by Koopman (1984) and Miettinen
and Nurminen (1985) which can be programmed fairly readily. The confidence
interval produced gives a good approximation to the required confidence coefficient,
but the two tail probabilities are unequal due to skewness. Gart and Nam (1988)
gave a correction for skewness but this is tedious to calculate. For the difference of
two proportions a method was given by Mee (1984) and Miettinen and Nurminen
(1985). This involves more calculation than for the ratio limits, and again there
could be a problem due to skewness (Gart and Nam, 1990).
continued...
• Fisher's exact test is usually used just as a test*; if one is interested in the
difference ∆ = π 1 – π 2 , the conditional approach does not yield a
corresponding confidence interval for ∆. [it does provide one for the
1–π 1 1–π 2
comparative odds ratio parameter ψ =
÷
]
π1
π2
• Thus, one can find anomalous situations where the (conditional) test
provides P>0.05 making the difference 'not statistically significant',
whereas the large-sample (unconditional) CI for ∆, computed as p1 – p2 ±
zSE(p 1 – p2), does not overlap 0, and so would indicate that the
difference is 'statistically significant'. [* see the Breslow and Day text Vol I ,
§4.2, for CI's for ψ derived from the conditional distribution]
• See letter from Begin & Hanley re 1/20 mortality with pentamidine vs 5/20
with Trimethoprim-Sulfamethoxazole in patients with Pneumocystis carinii
Preumonia-Annals Int Med 106 474 1987.
• Miettinen's test-based method of forming CI's, while it can have some
drawbacks, keeps the correspondence between test and CI and avoids
such anomalies.
Notes by JH
• The word "exact" means that the p-values are
calculated using a finite discrete reference
distribution -- the hypergeometric distribution (cousin
of the binomial) rather than using large-sample
approximations. It doesn't mean that it is the correct
test. [see comment by A&B in their section dealing
with Mid-P values].
• This illustrates one important point about parameters related to binary data
-- with means of interval data, we typically deal just with differences*;
however, with binary data, we often switch between differences and ratios,
either because the design of the study forces us to use odds ratios (casecontrol studies), or because the most readily available regression software
uses a ratio (i.e. logistic regression for odds ratios) or because one is
easier to explain that the other, or because one has a more natural
interpretation (e.g. in assessing the cost per life saved of a more
expensive and more efficacious management modality, it is the difference
in, rather than the ratio of, mortality rates that comes into the calculation). [*
the sampling variability of the estimated ratios of means of interval data is
also more difficult to calculate accurately].
While greater accuracy is always desirable, this
particular test uses a 'conditional' approach that not
all statisticians agree with. Moreover, compared with
some unconditional competitors, the test is
somewhat conservative, and thus less powerful,
particularly if sample sizes are very small.
• Two versions of an unconditional test for the H0: π 1 = π 2
are available: Liddell; Suissa and Shuster;
page 14
Add-ins for M&M §8 and §9
FISHER'S EXACT TEST IN A DOUBLE-BLIND STUDY OF SYMPTOM PROVOCATION TO DETERMINE FOOD SENSITIVITY (N Engl J Med 1990; 323:429-33.)
Abstract
Table 1: Responses of 18 Patients Forced to Decide Whether
Injections Contained an Active Ingredient or Placebo
Background Some claim that food sensitivities can best be identified by
intradermal injection of extracts of the suspected allergens to reproduce the associated
symptoms. A different dose of an offending allergen is thought to "neutralize" the
reaction.
Pt.
No*
3
1
14a
12
16
Methods To assess the validity of symptom provocation, we performed a doubleblind study that was carried out in the offices of seven physicians who were
proponents of this technique and experienced in its use. Eighteen patients were
tested in 20 sessions (two patients were tested twice) by the same technician, using
the same extracts (at the same dilutions with the same saline diluent) as those
previously thought to provoke symptoms during unblinded testing. At each session
three injections of extract and nine of diluent were given in random sequence. The
symptoms evaluated included nasal stuffiness, dry mouth, nausea, fatigue, headache,
and feelings of disorientation or depression. No patient had a history of asthma or
anaphylaxis.
Results The responses of the patients to the active and control injections were
indistinguishable, as was the incidence of positive responses: 27 percent of the
active injections (16 of 60) were judged by the patients to be the active substance, as
were 24 percent of the control injections (44 of 180). Neutralizing doses given by
some of the physicians to treat the symptoms after a response were equally
efficacious whether the injection was of the suspected allergen or saline. The rate of
judging injections as active remained relatively constant within the experimental
sessions, with no major change in the response rate due to neutralization or
habituation.
Conclusions When the provocation of symptoms to identify food sensitivities is
evaluated under double-blind conditions, this type of testing, as well as the
treatments based on "neutralizing" such reactions, appears to lack scientific validity.
The frequency of positive responses to the injected extracts appears to be the result
of suggestion and chance
------------------------------------------------------------------------------------------------† Calculated according to Fisher's exact test, which assumes that the hypothesized
direction of effect is the same as the direction of effect in the data. Therefore, when
the effect is opposite to the hypothesis, as it is for the data below those of Patient
9, the P value computed is testing the null hypothesis that the results obtained
were due to change as compared with the possibility that the patients were more
likely to judge a placebo injection as active than an active injection.
Active
Injection
resp no resp
2
1
2
1
2
1
1
2
2
1
Placebo
Injection
resp
no resp
1
8
2
7
2
7
0
9
3
6
P
Value†
0.13
0.24
0.24
0.25
0.36
18
14b
4
5
9
2
1
1
1
0
1
2
2
2
3
4
2
2
2
0
5
7
7
7
9
0.50
0.87
0.87
0.87
--
2a
13
15
6
8
0
0
1
0
0
3
3
2
3
3
1
1
3
2
2
8
8
6
7
7
0.75
0.75
0.76
0.55
0.55
17
2b
7
10
11
1
0
0
0
0
2
3
3
3
3
5
3
3
3
3
4
6
6
6
6
0.50
0.38
0.38
0.38
0.38
*Patients were numbered in the order they were studied.
The order in the table is related to the degree that the results agree with the
hypothesis that patients could distinguish active injections from placebo injections.
The results listed below those of Patient 9 do not support this hypothesis, placebo
injections were identified as active at a higher rate than were true active injections.
The letters a and b denote the first and second testing sessions, respectively, in
Patients 2 and 14. true active injections.
ID denotes intradermal, and SC subcutaneous.
The value is the P value associated with the test of whether the common odds ratio
(the odds ratio for all patients) is equal to 1.0. The common odds ratio was equal to
1.13 (computed according to the Mantel-Haenszel test).
page 15
Notes on P-Values from Fisher's Exact Test in previous article
Patient no. 3
Active Injection
Placebo Injection
Response
Response
+
–
Total
2
1 |
3
1
8 |
9
---------3
9
Patient no. 18
prob
3
6
9• 8• 7
–––––--12•11•10
1
2
Patient no. 1
1
8
3
0
1•1
0.123 • --3•9
0.491
0.123
0.005
(14b, 4, 5)
0.618
0.005
Response
+
–
Total
2
1 |
3
2
7 |
9
---------4
8
All possible tables with a total of 4 +ve responses
0
4
prob
3
5
8• 7• 6
–––––--12•11•10
0.255
(pt #)
1
3
2
6
3•4
0.255 • --1•6
0.510
(15)
2
2
1
7
2•3
0.510 • --2•7
0.218
prob
3
3
6• 5• 4
–––––--12•11•10
1
5
3
1
0
8
1•2
0.218 • --3•8
0.018
(1,14a)
P-Value
1.0
0.745
0.236
0.018
(*1-sided, guided by Halt: π of +ve responses with Active > π of +ve
responses with Placebo)
2
4
3•6
0.091 • --1•4
0.091
(pt #)
(3)
0.128
Active Injection
Placebo Injection
0
6
0
9
2•2
0.491 • --2•8
(pt #) (2b,7,10,11)
1.0
2
1
3•3
0.382 • --1•7
0.382
P-Value*
2
7
+
–
Total
2
1 |
3
4
5 |
9
---------6
6
All possible tables with a total of 6 +ve responses
All possible tables with a total of 3 +ve responses
0
3
Active Injection
Placebo Injection
P-Value
1.0
(1-sided, as above)
2
4
1
5
2•5
0.409 • --2•5
0.409
0.409
(17)
(18)
0.909
0.500
3
3
0
6
1•4
0.409 • --3•6
0.091
0.091
In Table 1, the P-values for patients below
patient 9 are calculated as 1-sided, but guided
by the opposite Halt from that used for the
patients in the upper half of the table, i.e. by
Halt :
of +ve responses with Active
< of +ve responses with Placebo.
It appears that the authors decided the "sidedness" of the H alt after observing the data!!! and
that they used different H alt for different
patients!!!
page 16
Fisher's Exact Test
political and ecological correctness ...
The Namibian government expelled the authors form
Namibia following the publication of this article; the
reason given was that their "data and
conclusions were premature" .. jh ¶•
Since 1900 the world's population has increased from about 1.6 to
over 5 billion) the U.S. population has kept pace, growing from
nearly 75 to 260 million. While the expansion of humans and
environmental alterations go hand in hand, it remains uncertain
whether conservation programs will slow our biotic losses. Current
strategies focus on solutions to problems associated with diminishing
and less continuous habitats, but in the past, when habitat loss was
not the issue, active intervention prevented extirpation. Here we
briefly summarize intervention measures and focus on tactics for
species with economically valuable body parts, particularly on the
merits and pitfalls of biological strategies tried for Africa's most
endangered pachyderms, rhinoceroses.
[ ... ]
Given the inadequacies of protective. legislation and enforcement,
Namibia. Zimbabwe, and Swaziland are using a controversial
preemptive measure, dehorning (Fig. D) with the hope that
complete devaluation will buy time for implementing other
protective measures (7) In Namibia and Zimbabwe, two species,
black and white rhinos (Ceratotherium simum), are dehorned, a
tactic resulting in sociological and biological uncertainty: Is
poaching deterred? Can hornless mothers defend calves from
dangerous predators?
On the basis of our work in Namibia during the last 3 years (8) and
comparative information from Zimbabwe, some data are available.
Horns regenerate rapidly, about 8.7 cm per animal per year, so that
1 year after dehorning the regrown mass exceeds 0.5 kg. Because
poachers apparently do not prefer animals with more massive horns
(8), frequent and costly horn removal may be required (9). In
Zimbabwe, a population of 100 white rhinos, with at least 80
dehorned, was reduced to less than 5 animals in 18 months (10).
These discouraging results suggest that intervention by itself is
unlikely to eliminate the incentive for poaching. Nevertheless, some
benefits accrue when governments, rather than poachers, practice
horn harvesting, since less horn enters the black market Whether
horn stockpiles may be used to enhance conservation remains
controversial, but mortality risks associated with anesthesia during
dehorning are low (5).
Biologically, there have also been problems. Despite media
attention and a bevy of allegations about the soundness of
dehorning ( 11 ), serious attempts to determine whether dehorning
is harmful have been remiss. A lack of negative effects has been
suggested because (i) horned and dehorned individuals have
interacted without subsequent injury; (ii) dehorned animals have
thwarted the advance of dangerous predators; (iii) feeding is
normal; and (iv) dehorned mothers have given birth (12) However,
most claims are anecdotal and mean little without attendant data on
demographic effects. For instance, while some dehorned females
give birth, it may be that these females were pregnant when first
immobilized. Perhaps others have not conceived or have lost calves
after birth. Without knowing more about the frequency of mortality,
it seems premature to argue that dehorning is effective.
M&M §8.2
updated Dec 14 2003
We gathered data on more than 40 known horned and hornless
black rhinos in the presence and absence of dangerous carnivores
in a 7,000 km2 area of the northern Namib Desert and on 60
horned animals in the 22,000 km2 Etosha National Park. On the
basis of over 200 witnessed interactions between horned rhinos and
spotted hyenas (Crocura crocura) and lions (Panthera leo) we saw
no cases of predation, although mothers charged predators in about
45% of the cases. Serious interspecific aggression is not uncommon
elsewhere in Africa, and calves missing ears and tails have been
observed from South Africa, Kenya, Tanzania, and Namibia (13).
lost calves were between 15 to 25 years old, suggesting
that they were not first time, inexperienced mothers (14).
What seems more likely is that the drought-induced
migration of more l-than 85% of the large, herbivore
biomass (kudu, springbok, zebra, gemsbok, giraffe, and
ostrich) resulted in hyenas preying on an alternative food,
rhino neonates, when mothers with regenerating horns
could not protect them.
To evaluate the vulnerability of dehorned
rhinos to potential predators, we developed
an experimental design using three regions:
Clearly, unpredictable events, including drought, may not
be anticipated on a short-term basis. Similarly, it may not
be possible to predict when governments can no longer
fund antipoaching measures, an event that may have led to
the collapse of Zimbabwe's dehorned white rhinos.
Nevertheless, any effective conservation actions must
account for uncertainty. In the case of dehorning,
additional precautions must be taken. [ ... ]
• Area A had horned animals with spotted hyenas and
occasional lions
• Area B had dehorned animals lacking dangerous
predators,
The differences in calf survivorship were remarkable. All
three calves in area C died within 1 year of birth, whereas
all calves survived for both dehorned females living
without dangerous predators (area B; n = 3) and for
horned mothers in area A (n = 4). Despite admittedly
restricted samples, the differences are striking [Fisher's
(3 x 2) exact test, P = 0.017; area B versus C, P = 0.05;
area A versus C, P = 0.0291 ††. The data offer a first
assessment of an empirically derived relation between
horns and recruitment.
Our results imply that hyena predation was responsible for
calf deaths, but other explanations are possible. If drought
affected one area to a larger extent than the others, then
calves might be more susceptible to early mortality. This
possibility appears unlikely because all of western
Namibia has been experiencing drought and, on average,
the desert rhinos in one area were in no poorer bodily
condition than those in another. Also, the mothers who
page 17
B
C
4
0
4
3
0
3
0
3
3
††
survived
died
• Area C consisted of dehorned animals that were sympatric
with hyenas only.
Populations were discrete and inhabited similar xeric
landscapes that averaged less than 125 mm of
precipitation annually. Area A occurred north of a
country-long veterinary cordon fence, whereas animals
from areas B and C occurred to the south or east, and no
individuals moved between regions.
A
B vs C
survived
died
B
3
0
3
C
0
3
3
1
20
prob
B
2
1
3
C
1
2
3
9
20
B
1
2
3
C
2
1
3
9
20
B
0
3
3
C
3
3
3
tot*
3
3
1
20
A vs C
survived
died
prob
¶
A
4
0
4
C
0
3
3
1
35
A
3
1
4
C
1
2
3
12
35
Do you agree?
A
2
2
4
C
2
1
3
18
35
A
1
3
4
C tot*
3 4
3 3
3
4
35
Add-ins for M&M §8 and §9
[ updated Jan 3 2004 ]
stratified data
Combining measures from several strata
[cf. M&M 3§2.6, A&B3 §16.2, Rothman2002, Chapter 8]
e.g. 2
Why not just add (sum, ) the 'a' frequencies across tables
(strata), the 'b' frequencies across tables, ... the 'd'
frequencies across tables, to make a single 2 x 2 table with
entries
Numbers of Applicants (n), and Admission rates (%) to
Berkeley Graduate School
Faculty
a
b
b
d
A
B
C
and use these 4 cell counts to perform the analyses?
D
e.g. 1 Batting Averages of Gehrig and Ruth
E
(see book "Innumeracy" by Paulos)
F
Combined
Gehrig
Ruth
1st half of season
.290
<
.300
2nd half of season
.390
<
.400
––––––––––––––––––––––––––––––––––––––––––––––
Entire season
.357
>
.333 !!!
Gehrig
Ruth
29
100
60
200
825
560
325
417
191
373
2691
Men
% admitted
62
63
37
33
28
6
44
n
108
25
593
375
393
341
1835
Women
% admitted
82
68
34
35
27
7
30
(see early Chapter in text "Statistics" by Freedman et al)
Paradox: π(admission | male) > π(admission | male) overall,
but, by an large, faculty by faculty, its the other way!!!
Explanation: Women are more likely than men to apply to
the faculties that admit lower proportions of applicants.
Remedy: aggregate the within-strata comparisons [like vs.
like], rather than make comparisons with aggregated raw
data -- see next for classical ways of doing this; MH stands
for "Mantel-Haenszel".
Explanation:
1st half of season
Hits
AT BAT
n
2nd half of season
hits
78
40
AT BAT
200
100
––––––––––––––––––––––––––––––––––––––––––––––
Entire season
hits
107
100
AT BAT
300
300
For other examples:1. See Moore and McCabe(3rd Ed) 2.6 (The Perils of Aggregation, including
Simpson's paradox) They speak of 'lurking' variables; in epidemiology we
speak of 'confounding' variables.
2. See Rothman2002, p1 (death rates Panama vs. Sweden) and p2 (20-year
mortality in female smokers and non-smokers in Whickham England)
Two features, involving time, created this 'paradox'
Simpson's paradox is an extreme form of confounding. Some
textbooks give made-up examples See web site for course 626 for several
real examples.
-1- batting averages increased from 1st to 2nd half of season
-2- Ruth had greater proportion of his AT BAT's in 1st half than Gehrig
page 1
Add-ins for M&M §8 and §9
stratified data
Story 4: Does Smoking Improve Survival? in the EESEE Expansion Modules
in the website for the text (link from course description) [also in Rothman2002,
with finer age-categories]
Table 2: Twenty-year survival status for 1314 women categorized
by age and smoking habits at the time of the original survey.
http://WWW.WHFREEMAN.COM/STATISTICS/IPS/EESEE4/EESEES4.HTM
Age
Group
(Years)
A survey concerned with thyroid and heart disease was conducted in 1972-74 in a
district near Newcastle, United Kingdom by Tunbridge et al (1977). A follow-up
study of the same subjects was conducted twenty years later by Vanderpump et al
(1996). Here we explore data from the survey on the smoking habits of 1314 women
who were classified as being a current smoker or as never having smoked at the time
of the original survey. Of interest is whether or not they survived until the second
survey.
Smoking Status
Survival
Status
18-44
Results
Dead
Alive
Smoker
NonSmoker
19
269
13
327
The following tables summarize the results of the experiment: [note
from JH.. We would not call it "an experiment"; mathematical
statisticians call any process for generating data "an experiment"]
(or =1.78)
Table 1: Relationship between smoking habits and 20-year
survival in 1314 women (582 Smokers, 732 Non-Smokers)
Survival Status
Dead
Alive
#dead
Risk = #Total
Smoking Status
Smoker
Non-Smoker
Dead
Alive
78
167
Compared...
139
443
230
502
139
582 = 23.9%
230
732 = 31.4%
Diff: –7.5%
230
502 = 0.458(:1)
Ratio: 0.68*
#dead 139
Odds = #alive 732 = 0.314(:1)
44-64
52
147
(or =1.32)
>64
Ratio: 0.76 .
Dead
Alive
42
7
165
28
(or =1.02)
a × d 139 × 502
69778
* shortcut: or = b × c = 230 × 443 = 101890 = 0.68
The odds ratio is > 1 in each age group!
Why the contradictory results?
A message the tobacco companies would love us to believe!
page 2
Add-ins for M&M §8 and §9
Adjustment
stratified data
(compare like with like,
i.e. Σ within-category estimates**)
Ratio** Estimators ("M-H") [implicit precision weighting]
Weighted averages [ explicit weights (w's) ]
Σ #cases index × DENOMref / DENOMtotal
Σ #cases ref × DENOMindex / DENOMtotal
Risk Ratio
Precision-based Investigator-chosen
(inverse variance) ("Standardized")
Mean Difference
Σw×–
y index – Σ w × –
y ref
= Σw× (–
yindex – –
y ref )
Risk Difference
Σ w × riskindex – Σ w × riskref
Rate Ratio
Odds Ratio
= Σ w × ( risk index – riskref )
same as risk and rate ratio above except
that "denominators" are partial (pseudo)
ones estimated from a denominator
series*("controls"); "size"total refers to the
size of (stratum-specific) case series and
denominator series combined. *MODERN
way to view case-control studies.
exp[ Σ w × logoddsindex – Σ w × logoddsref ]
exp[ Σ w × ( log [odds ratio] )
Σ #cases index × "denom"ref / "size"total
Σ #cases ref × "denom"index / "size"total
[case control study]
Odds Ratio ("Woolf" method, precision based)
=
same, except that denominators are
amounts of person-time, not persons.
all logs to base e
where w = 1 / var[log [odds ratio] ] = 1 / (1/a + 1/b + 1/c + 1/d)
Note: Computational formulae often constructed to minimize
number of steps, and avoid division, and so may hide real
structure of the estimator.
Σ #cases index × #"rest" ref / total
Σ #cases ref × #"rest" index / total
[cohort/prevalence study]
Not that common to use this measure,
since odds ratio more cumbersome to
explain, and less 'natural'. Might use it to
maintain comparability with results of a
log-odds (logistic) regression. If #case a
small fraction
Odds Ratio
e.g. 8.1 in Rothman p147, for risk diff. (precision weighting)
Var[risk diff] proportional to 1/N0 + 1/N1 = (N0+N1)/(N0 N1)
So that the denominator contribution, i.e., the weight, is
w = 1/Var = (N0 N1)/(N0+N1) = (N0 N1)/T
and numerator contribution is
( riskindex – riskref ) × w
= ( a/N1 - b/N0 ) × w = ( a/N1 - b/N0 ) × (N0 N1)/T
= ( a N0 /T - b N1) / T
(after some algebra)
page 3
Add-ins for M&M §8 and §9
stratified data
**NOTE ON RATIO ESTIMATORS: Even though one could (if all
denominators were obligingly non-zero) rewrite the ratio
estimator as a weighted average of ratios, this would run
counter to Mantel's express wishes.. to calculate just one ratio
at the end, i.e. a ratio of two sums, rather than a sum of ratios.
The main reason is statistical stability: imagine a (simpler,
non-comparative) situation where one wished to estimate the
overall sex ratio in small day-care facilities: would you average
the ratios from each facility, or take a single ratio of the total
number of males to the total number of females? The caveat
does not apply to absolute differences, where the difference of
two weighted averages (same set of weights for both) is the
same as the weighed average of the differences.
e.g.
a ×d
T
b × c
T
1
1
2
0
0
"A"
0
1
1
1
1
2
0
0
"D"
1
0
1
0
1
1
1
1
2
1/2
0
"B"
0
1
1
1
0
1
1
1
2
0
1/2
"C"
Outcome*
Index
Ref
Tot
Yes
No
1
0
1
1
0
1
0
1
1
Yes
No
Yes
No
Yes
No
Matched-pairs: the limiting case of finely stratified data
Examples:
No. Pairs
Determinant
Odds Ratio estimator =
pair-matched case-control studies; Mother -> infant transmission of
HIV in twins in relation to order of delivery; & others...
[see 607 notes for Ch 9]
ALSO: Case-crossover studies (self-matched case-control studies)
eg" Redelmeier: auto accidents, while on/off cell phone when driving
A
A
0 + B
0 + B
1/2 + C 0 + D
0 + C 1/2 + D
0
B
=
0
C
Tabular format for displaying matched pair-data
Result in Other PAIR Member
COHORT
STUDY
Response of same subject in each of 2 conditions (self-paired)
Responses of matched pair, one in 1 condition, 1 in other
+ ve
+ ve
A
– ve
B
– ve
C
D
Total #
PAIRS
Result in One
PAIR Member
's in paired responses on interval scale, reduced to sign of
n
The 4 possibilities for 2 pair-members are:
(using generic 2 x 2 table: 2nd row might be a 'denominator series' of 1 per case)
Outcome
Yes
No
Category of Determinant
Index
Reference
Total
a = 1 or 0
c = 0 or 1
1
1
1
T=2
b =1 or 0
d = 0 or 1
1
Exposure in "Control"
CASE-CTL
STUDY
+ ve
+ ve
A
– ve
B
– ve
C
D
Total
PAIRS
Exposure in
"Case"
n
* In matched (self- or other) case-control study, the "denominator series"
is not limited to 1 "probe-for-exposure" per case... could ask about
"usual" exposure (e.g. % time usually exposed) or sample several "person
-moments" ['controls'] per case. i.e. the 2nd row total could be > 2.
The contributions to orMH from the 4 possibilities are ...
page 4
Add-ins for M&M §8 and §9
stratified data
Standardization of Rates [proportion-type and incidence-type]
[explicit, investigator-selected weights]
• Usual to first calculate standardized rate for index category
(of the determinant) and standardized rate for reference
category (of the determinant) separately, then compare the
standardized rates.
• If one uses the confounder distribution in one of the two
compared determinant categories as the common set of
weights, then the standardized rate in this category remains
unchanged from the crude rate in this category.
See the worked example comparing death rates in Quebec
males in 1971 and 1991 in the document "Direct" and
"Indirect" Standardization:2 sides of same coin?(.pdf) under
"Material from previous years" in the c626 web page. this is
an interesting local case of natural confounding: relative to
that 20 years earlier, the crude mortality rate in 1991 was
1.00. yet, in every age category, the rate in 1991 was at least
10% lower, and in many age-groups, more than 20% lower
than in 1971 (in the table, the rate ratios in bold are 71/91, so
take their reciprocals to see the rate ratios 91/71)
• Read Rothman's comment (p159) about the uniformity of
effect (eg a constant rate ratio across age groups in the Que
example). Why in his last sentence in that paragraph does
he seem to "allow" a weighted average of very different rate
ratios, if they were derived from standardization, but NOT if
they were derived from (precision-weighted) pooling?
• Rothman (p161) emphasizes how "silly" the term "indirect"
standardization used with standardized mortality ratio, is. He
correctly points out that "the calculations for any rate
standardization, "direct" or "indirect", are basically the same".
He leaves it as an exercise (Q4 page 166) to work out what
the weights are in the so-called "indirect" standardization
used to compute an SMR (or SIR).
SMR =
Total # cases observed
Total # cases expected
=
Σ # observed
Σ # expected
=
Σ observed #
Σ ref. rate × exposed PT
=
Σ observed rate × exposed PT
Σ ref. rate × exposed PT
(*)
Σ observed rate × w
,
with w = exposed PT
Σ ref. rate × w
If one starts again from (*), one can show that the SMR can
also be represented as a weighted average of rate ratios [as
=
was mentioned in footnote to Quebec table*]
SMR
=
Σ # observed
Σ # expected
=
Σ obs. rate × exposed PT
Σ # expected
Σ
=
Σ
=
*cf.
Hint: write the SMR (with Σ denoting sum over strata) as
page 5
obs. rate
× ref. rate × exposed PT
ref. rate
Σ # expected
obs. rate
× # expected
ref. rate
Σ # expected
(divide & mult. by ref rate)
= weighted ave. of rate ratios
Liddell FD. The measurement of occupational mortality.
Br J Ind Med. 1960 Jul;17:228-33.
Add-ins for M&M §8 and §9
stratified data
Table 2: Twenty-year survival status for 1314 women categorized by age and smoking habits at the time of the original survey.
Worked out calculations (see same calculations on spreadsheet) for...
(* r1 r2 are row totals; c1 c2 are column totals) See Rothman Ch 8
Mantel-Haenszel summary odds ratio, orMH and
Woolf: exp[ weighted average of ln or 's ]
Mantel Haenszel (Chi-Square) Test of OR1 = OR 2 = OR3 = 1
Var[ weighted ave ] = 1/ {Sum of Weights}
Age
Group
(Years)
Smoking Status
Surv
Status Smoker
18-44
44-64
>64
Dead
Alive
Dead
Alive
Dead
Alive
19
269
78
167
42
7
139
Calculations for
Summary odds ratio
NonSmoker
n
ad
n
bc
n
Calculations for Test
Statistic*
E[ a | H0 ]
Var[ a | H0 ]
r1 • c1
n
r1 • r2 • c1 • c2
n2 • {n - 1}
ln or
(1)
W × ln or
1/a + 1/b
+ 1/c + 1/d
Weight
(2)
1
Var[ ln or]
Var[ ln or]
(1) × (2)
13
327
(or =1.78)
628
9.89
5.59
14.7
7.6
0.575
0.1363
7.335
4.218
52
147
(or =1.32)
444
25.82
19.56
71.7
22.8
0.278
0.0448
22.30
6.199
165
28
(or =1.02)
242
4.86
4.77
41.9
4.9
0.018
0.2084
4.798
0.086
Sum 1314
40.57
29.92
128.3
35.2
34.433
10.503
weighted ave. of ln or ' s 10.503/ 34.433 = 0.305
MH Odds Ratio
1.36
(40.57/29.92)
orMH =
Calculations for Woolf's Method
∑ad/n
40.57
=
∑ b c /n
29.92 = 1.36 ;
X2MH (1 df) =
(Miettinen) Test-based 100(1 - α)% CI for OR:
exp[weighted ave. of ln or ' s]
exp[0.305] = 1.36
{∑ a – ∑ E[ a | H0 ]}2
{139 – 128.3}2
=
= 3.24
∑ Var[ a | H0 ]
35.2
XMH = 1.80
orMH 1 ± zα/2 / XMH = 1.361 ± 1.96/1.80 = 0.97 to 1.89 (95% CI)
(Woolf) 100(1 - α)% CI for OR: exp[{weighted ave. of ln or's} ± zα/ 2 Sqrt[1/34.433] = exp[0.305 ± 1.96×0.170] = 0.97 to 1.89
page 6
stratified data
Via SAS
data sasuser.simpson;
input age $ i_smoke i_dead
number;
lines;
18-44 1
18-44 1
18-44 0
18-44 0
44-64 1
44-64 1
44-64 0
44-64 0
641
641
640
640
;
1
0
1
0
1
0
1
0
1
0
1
0
19
269
13
327
78
167
52
147
42
7
165
28
run;
options ls = 75 ps = 50; run;
proc freq data=sasuser.simpson;
tables age * i_smoke * i_dead /
nocol norow nopercent cmh expected;
weight number;
/* weight indicates multiples */
run;
See for SAS 'trick' to produce Tables
in an orientation that gives the
ratios of interest (use PROC FORMAT to
associate another values with each
actual value; then use the
ORDER=FORMATTED option in PROC FREQ )
TABLE 1 OF I_SMOKE BY I_DEAD
CONTROLLING FOR AGE=18-44
I_SMOKE
I_DEAD
Frequency|
Expected |
0|
1| Total
------+--------+--------+
0 |
327 |
13 | 340
| 322.68 | 17.325 |
------+--------+--------+
1 |
269 |
19 | 288
| 273.32 | 14.675 |
------+--------+--------+
Total
596
32
628
SUMMARY STATISTICS FOR
TABLE 2 OF I_SMOKE BY I_DEAD
CONTROLLING FOR AGE=44-64
I_SMOKE
I_DEAD
Frequency|
Expected |
0|
1| Total
------+--------+--------+
0 |
147 |
52 | 199
| 140.73 | 58.266 |
------+--------+--------+
1 |
167 |
78 | 245
| 173.27 | 71.734 |
------+--------+--------+
Total
314
130
444
Estimates of Common Relative Risk
(Row1/Row2 )
TABLE 3 OF I_SMOKE BY I_DEAD
CONTROLLING FOR AGE=64I_SMOKE
I_DEAD
Frequency|
Expected |
0|
1| Total
------+--------+--------+
0 |
28 |
165 | 193
| 27.913 | 165.09 |
------+--------+--------+
1 |
7 |
42 |
49
| 7.0868 | 41.913 |
------+--------+--------+
Total
35
207
242
page 7
I_SMOKE BY I_DEAD
CONTROLLING FOR AGE
Cochran-Mantel-Haenszel Statistics
(Based on Table Scores)
Alt. Hypothesis
DF Value Prob
------------------------------------Nonzero Correlation
1 3.239 0.072
Row Mean Scores Differ 1 3.239 0.072
General Association
1 3.239 0.072
Type of Study
Method
95%
Estimate Conf Bounds
------------------------------------Case-Control Mantel-Haenszel 1.357 0.973 1.892
(Odds Ratio)
Logit
1.357 0.971 1.894
Cohort
(Col1 Risk)
Mantel-Haenszel 1.047 0.996 1.101
Logit
1.034 0.998 1.072
Cohort
(Col2 Risk)
Mantel-Haenszel 0.864 0.738 1.013
Logit
0.953 0.849 1.071
Confidence bounds for M-H estimates
are test-based.
Breslow-Day Test for Homogeneity of
the Odds Ratios
Chi-Square = 0.950 DF = 2 Prob = 0.622
Total Sample Size = 1314
stratified data
Via Stata
clear
input str5 age i_smoke i_dead number
18_44
1
1
19
18_44
1
0
269
18_44
0
1
13
18_44
0
0
327
44_64
1
1
78
44_64
1
0
167
44_64
0
1
52
44_64
0
0
147
64_
1
1
42
64_
1
0
7
64_
0
1
165
64_
0
0
28
end
cc i_dead i_smoke [freq=number], by(age)
Aggregating Odds Ratio (OR)'s ...Woolf's Method
ad
Recall: data from single 2x2 table: or = bc
SE[ ln (or) ] =
data from several (K) 2x2 tables:
ln (orWoolf)=
cc
i_dead i_smoke [freq=number], by(age) woolf
cc
i_dead i_smoke [freq=number], by(age) tb
*tb = "test-based"
(Σ: summation over strata)
(weighted average)
(weight ∝ 1 / variance)
(note: Var = SE2)
SE[ln (orWoolf)] =
1
∑wk =
Var*
K
[see drivation #]
(Var* : harmonic mean of K Var's)
CI[ OR ] = exp{ CI[ ln (OR) ] }
# Derivation: Var[Σ{w × ln} / Σw] = (1/Σw])2 × Σ{w2 × Var[ln]}
= (1/Σw])2 × Σ{1/w} = 1/Σw
3.24
0.0719
Also available...
∑ wk ln (ork)
∑ wk
1
with wk = Var[ ln [or ] ]
k
age |
OR
[95% CI]
M-H Weight
------+--------------------------------------------18_44 |
1.78 .87 3.61
5.57
(Cornfield)
44_64 |
1.32 .87 1.99
19.56
(Cornfield)
64_ |
1.02 .42 2.43
4.77
(Cornfield)
------+--------------------------------------------Crude |
.68 .53
.88
(Cornfield)
M-H combined |
1.36 .97 1.90
-----------------+----------------------------------------Test of homogeneity (M-H) chi2(2) = 0.95 Pr>chi2 = 0.6234
Test that combined OR = 1:
Mantel-Haenszel chi2(1) =
Pr>chi2 =
1 1 1 1
a+b+c+d
[ since w = 1/var[ln] ]
See worked example in Spreadsheet (under Resources Ch 9)
[Robins-Breslow-Greenland SE for ln orMH not programmed]
References: A&B Ch 4.8 and 16, Schlesselman, KKM, Rothman...
Summary Risk Ratio and Summary Rate Ratio
See Rothman pp 147- (Risk Ratio) and pp153- (Rate Ratio)
page 8
stratified data
Berkeley Data: M:F Comparative parameters Odds Ratio (OR), Risk Ratio (RR) and Risk Difference (R )
D
–
D
E
a
–
E
b
|
m1
c
n1
d |
n0 |
m0
n
(Using KKM table 17.16 notation)
Faculty
A
Admitted?
Y
N
All
for R
a/n1
Men
512
313
825
Women
89 |
19 |
108 |
All
601
332
933
b/n0
R∆
a•d
b•c
for OR
a•d b•c
n
n
a•n0
b•n1
for RR
a•n0 b•n1
n
n
for R
var(R∆)*
w = 1/var
w•R∆
0.62
0.82
–0.20
0.35
10.4
29.9
0.75
59.3
78.7
1.63E-3
614
–125
B
Y
N
All
353
207
560
17 |
8 |
25 |
370
215
585
0.63
0.68
–0.05
0.80
4.8
6.0
0.93
15.1
16.3
9.12E-3
110
–5
C
Y
N
All
120
205
325
202 |
391 |
593 |
322
596
918
0.37
0.34
+0.03
1.13
51.1
45.1
1.08
77.5
71.5
1.10E-3
913
26
D
Y
N
All
138
279
417
131 |
244 |
375 |
269
523
792
0.33
0.35
–0.02
0.92
42.5
46.1
0.95
65.3
69.0
1.14E-3
879
–16
E
Y
N
All
53
138
191
94 |
299 |
393 |
147
437
584
0.28
0.24
+0.04
1.22
27.1
22.2
1.16
35.7
30.7
1.51E-3
661
25
F
Y
N
All
22
351
373
24 |
317 |
341 |
101
668
769
0.06
0.07
–0.01
0.83
9.8
11.8
0.84
10.5
12.5
3.41E-4
2935
-33
All
Y
N
All
1198
1493
373
557 |
1278 |
341 |
1755
2771
4526
---6113
---–129
0.44 0.30
+0.14
1.84
1.47
----- ----∑: 145.8 161.1
145.8
OR MH = 161.1 = 0 . 9 1
----- ----263.4 278.7
263.4
RR MH = 278.7 = 0 . 9 4
R∆w =
∑w•R∆
–129
= 6113 = – 0 . 0 2
∑w
* var(R∆) = Sum of 2 binomial variances
page 9
stratified data
Test of equal M:F admission rates; Confidence Intervals for OR MH (Berkeley data, KKM and A&B notation; cf. Rothman'02,Table 8.4, p152)
CI for OR MH [notation from A&B p461]
(Method
of Robins, Breslow & Greenland 1986 *)
M
F
a+d
b+c
a•d
b•c
E[a|Ho] Var[aHo]
n
n
n
n
(P)
(Q)
(R)
(S) P•R P•S
Q•R Q•S
All
601 531.4 21.9
0.57 0.43 10.4 29.9 5.9 17.0
4.5 12.9
332
933
TEST
Faculty
Admitted?
A
Y
N
All
Men
512
313
825
Women
89 |
19 |
108 |
B
Y
N
All
353
207
560
17 |
8 |
25 |
370
215
585
354.2
Y
N
All
120
205
325
202 |
391 |
593 |
322
596
918
114.0
Y
N
All
138
279
417
131 |
244 |
375 |
269
523
792
141.6
Y
N
All
53
138
191
94 |
299 |
393 |
147
437
584
48.1
Y
N
All
22
351
373
24 |
317 |
341 |
101
668
769
24.0
Y
N
1198
1493
557 |
1278 |
1755
2771
All
373
341 |
C
D
E
F
All
=
5.6
0.62 0.38
4.8
6.0
3.0
3.7
1.8
2.3
CI OR MH continued...
lnORMH = ln 0.91 = –0.10
Var[lnORMH ] = 0.0066
SE[lnORMH] =√Var = 0.08
CI[lnORMH]= –0.10 ± z•0.08
= –0.26 to 0.06 (95%)
47.9
0.56 0.44
51.1
45.1 28.5
25.1
22.7 20.0
CI[ ORMH ] =
exp[–0.26] to exp[0.06]
44.3
0.48 0.52
42.5
46.1 20.5
22.3
22.0 23.9
= 0.77 to 1.06
_______________________________________
24.3
0.60 0.40
27.1
22.2 16.4
13.4
10.8
8.8
CI [OR MH ]
"test-based" (Miettinen 1976)
Chi-MH = | ln or MH | / SE[ln orMH ] ===>
10.8
4526
∑: 1213.4 154.7
0.47 0.53
9.8
11.8
4.6
5.6
5.1
6.2
Rothnan2002, p152 uses different notation
A&B {R,S, P,Q} -> Rothman{G,H, P,Q}
----- ----- ---- ---- ---- ---145.8 161.1 78.9 87.1 66.9 74.1
SE[ln
orMH]=|ln or MH| / Chi-MH
CI[ln
ORMH] = ln or ± z SE[ ln or MH ]
{0.10/√1.52= 0.08}
CI[OR MH ] = CI [exp[ln orMH ]]
= exp[CI for ln]
=
orMH [1 ± z/Chi-MH ]
[1 ± 1.96/ 1.52]
(R+) (S+)
= orMH
in our example
{ ∑a – ∑a|Ho] }2 {1198 – 1213.4}2
∑P•R
∑[P•S + Q•R] ∑Q•S
=
= 1 . 5 2 [#] Var[ ln ORMH ] = 2R 2 +
+ 2S 2
______________________________________
∑Var[a|Ho]
154.7
2R+•S+
+
+
78.9
87.1 + 66.9]
74.1
This MH X 2 of 1.52 is "NS" in the χ 2 1df distribution
=
+ 2•14.5.8•161.1 + 2•161.12 = 0.0066 CI [R ] ... (continued from last column, previous page)
2 •145.82
SE[R ] = 1/ w = 0.013
[#] see Rothman2002, p162
continued at top of next column ...
page 10
CI [R ] = –0.02 ± z × 0.013
Inference from 2 way Tables M&M §9 REVISED Dec 2003
Analysis of 2 x 2 tables ... chi-square test
Recall some earlier examples...
More generally...
Montreal Metropolitan Population by knowledge of official language
Data collected by Statistics Canada at the 1996 census:
{numbers rounded, so subtotals do not sum exactly to total]
Cross-classification of single sample of size n with respect to
two characteristics, say A and B ("second" model in M&M p 641)
Test of independence of two characteristics
B
Oui
English?
Yes
No
1,634,785
1,309,150
Total
2,943,935
Non
Total
280,205
1,914,990
343,705
3,287,645
A1
B1
n 11
B2
n 12
Total
n A1
n 21
n B1
n 22
n B2
n A2
n
A
Français?
63,500
1,372,650
Stroke Unit vs. Medical Unit for Acute Stroke in elderly?
Patient status at hospital discharge (BMJ 27 Sept 1980)
A2
Cohort Study: Fixed /Variable follow-up. Person (P) or P-Time denominators
( Cross-sectional Study, document states rather than events)
indept.
dependent total no. pts
-----------------------------------------------------------------------------------Stroke Unit
67
34
101
Medical Unit
46
45
91
-----------------------------------------------------------------------------------total
113 (58.9%)
79
192
Bone mineral density and body composition in boys with distal
forearm fractures (J Pediatr 2001 Oct;139(4):509-15)
event
(or state)
Yes
No
14
64
100
86
100
"exposed" (1)
c1
D1
D1
not exposed (0)
c0
c
D0
D
D0
D
Case-Control Study: person- or person-time "quasi-denominators"
Pour battre Roy, mieux vaux lancer bas ...
(LA PRESSE, MONTREAL, JEUDI, 21 AVRIL 1994 ... cf. Course 626)
Au cours des vingt matches des séries éliminatoires disputés l'an passé, le
Canadien a accordé 51 buts... Des 51 buts alloués par le meilleur gardien au monde..
... ont vu la rondelle pénétrer dans la partie .. du filet
Milieu
Bas
Total
Person-Time
D=
Denominator
event
(or state)
Haut
or
D=
Denominator
Overweight?
No
Total
Persons
c="cases"
numerator
Fracture?
Yes
36
non-event
(or state)
10 (20%)
5 (10%)
36 (70%)
51 (100%)
page 1
quasidenominators
or
quasidenominators
"c=cases"
numerator
persons
Person-Time
"exposed" (1)
c1
d1
d1
not exposed (0)
c0
d0
d=sample of D
d0
d=sample of D
Inference from 2 way Tables M&M §9 REVISED Dec 2003
Analysis of 2 x 2 tables ... chi-square test
e . g . languages in Montreal
Statistics from 2 x 2 table via SAS Proc FREQ
create SAS file via Program Editor (Could also type directly into INSIGHT)
options ls = 75 ps = 50; run;
proc freq data=sasuser.lang_mtl;
tables Francais * English /
all cellchi2 expected;
data sasuser.lang_mtl;
input Francais $ English $ number;
/* turn on all output */
/* number instead of individual per line */
/* $ sign after name indicates character variable */
weight number; /* use weight to indicate "multiples" */
run;
lines;
TABLE OF FRANCAIS BY ENGLISH
Oui
Oui
Non
Non
Yes
No
Yes
No
1634785
1309150
280205
63500
FRANCAIS
;
run;
then.. via SAS INSIGHT Mosaic plot
F
R
A
N
C
ENGLISH
(Observed) Frequency
Expected (under H0: independence)
(Cell Chi-Square)|
Percent
|
Row Pct
|
Col Pct
|No
|Yes
|
---------------+--------+--------+
Non
| 63500 | 280205 |
| 143503 | 200202 |
| 44602 | 31970 |
|
1.93 |
8.52 |
| 18.48 | 81.52 |
|
4.63 | 14.63 |
---------------+--------+--------+
Oui
|1309150 |1634785 |
|1229147 |1714788 |
| 5207.3 | 3732.5 |
| 39.82 | 49.73 |
| 44.47 | 55.53 |
| 95.37 | 85.37 |
---------------+--------+--------+
Total
1372650 1914990
41.75
58.25
Oui
A
I
S
143503 = 10.45% of 1372650 =
Non
44602 =
No
... "obs" for short
... "exp" for short
Total
343705
10.45
2943935
89.55
3287640
100.00
343705 × 1372650
RowTotal × ColTotal
=
3287640
OverallTotal
(observed freq. - expected freq. )2
(63500 - 143503 )2
=
expected freq.
143503
Statistic
DF
Value
Prob*
-----------------------------------------------------------------(obs - exp)2
Chi-Square (2-1) × (2-1) = 1
Σ
= 85511.881
0.001
exp
Yes
ENGLISH
DF = "degrees of freedom"; Σ is over all 4 cells
* Clearly, p-values are not relevant here.
page 2
Inference from 2 way Tables M&M §9 REVISED Dec 2003
Analysis of 2 x 2 tables ... chi-square test
data sasuser.str_unit;
input Unit $ Status $ number;
lines;
Stroke
Indep
67
Stroke
Dep
34
Medical
Indep
46
Medical
Dep
45
;
run;
e.g. Stroke Unit vs. Medical Unit for Acute Stroke in elderly?
Patient status at hospital discharge (BMJ 27 Sept 1980)
independent.
dependent
67 (66.3%)
46 (50.5%)
34
45
113 (58.9%)
79
Stroke Unit
Medical Unit
total no. pts
101
91
192
"Expected" numbers [in bold] under
proc freq data=sasuser.str_unit; weight number;
tables Unit * Status / chisq cmh relrisk riskdiff nopercent nocol;
run;
H 0 : % discharged "independent" unaffected by type of unit
Stroke Unit
Medical Unit
independent
dependent
5 9 . 4 (58.9%)
5 4 . 6 (58.9%)
41.6
37.4
113 (58.9%)
¶
79
2
X2
=
{67 – 59.4}
59.4
{46 – 54.6}
54.6
UNIT
2
+
2
+
total no. pts
101
91
192
+
INDEX Category
(see NOTE)
{34 – 41.6}
41.6
{ 4 5 – 3 7 . 4 }2
REFERENCE Category
37.4
= 4.9268 [ to be referred to χ 2 (1df) distr'n]
Generic formula for X2
X2
=
Σ
{ observed – expected } 2
expected
[ over all cells! ]
Continuity-corrected X2 †
Σ
|
|
|
|
|
Total
91
101
192
Column 2 Risk Estimates
total row i • total column j
overall total
=
|Indep
|
46
| 50.55
|
67
| 66.34
113
Statistic
DF
Value
Prob
-----------------------------------------------------Chi-Square
1
4.927
0.026
Continuity Adj. Chi-Square
1
4.296
0.038
Mantel-Haenszel Chi-Square
1
4.901
0.027
Fisher's Exact Test Left:0.991 Right:0.019 2-Tail: 0.029
Row 1
Row 2
Total
Risk
0.505
0.663
0.589
ASE
0.052
0.047
0.036
95% Conf Bounds
(Asymptotic)
0.403
0.608
0.571
0.756
0.519
0.658
Row 1 - Row 2
-0.158
0.070
-0.296
The expected number in cell in to row i and column j is:
X2c
STATUS
Frequency|
Row Pct |Dep
Medical |
45
| 49.45
Stroke
|
34
| 33.66
Total
79
95% Conf Bounds
(Exact)
0.399
0.612
0.562
0.754
0.515
0.659
-0.020
NOTE FREQ uses upper row as INDEX category; lower as REFERENCE cat.
{ | observed – expected | – 0 . 5 } 2
expected
Estimates of the Relative Risk (Row1/Row2)
Type of Study
Value
95% Confidence Bounds
Cohort (Col2 Risk)
0.762 (50.55/66.34)
0.596
0.975
¶ Use X2 to refer to the calculated statistic in a sample, χ 2 for distribution.
cf. z-test for difference of 2 proportions .. in Chapter 8
† (Yates') Continuity-correction is used to reflect the fact that the binomial counts
are discrete and that their probabilities are being approximated by intervals (count –
0.5, count + 0.5). The uncorrected X 2 is overly liberal i.e. it produces too large a
distribution of discrepancies that is larger than the tabulated distribution... hence
the reduction of each absolute deviation | observed – expected | by 0.5.
z = (0.6634 – 0.5054)/sqrt [ 0.5885 • 0.4115 • (1/101+ 1/91)) = 0.1580/0.0711=2.22
P = Prob[ | Z | ≥ 2.22 ] = 0.026 (2-sided)
z2 = 2.222= 4.93 ( same as X2 with 1 degree of freedom ! )
page 3
Inference from 2 way Tables M&M §9 REVISED Dec 2003
Analysis of 2 x 2 tables ... chi-square test
O THER E XAMPLES
Note: In the following examples, in order to keep the formulae uncluttered, I have
not shown the continuity correction. In some examples, as in the one above, the
sample sizes are large and so the continuity correction makes only a small change.
In some others, as in the milk immunoglobulin example below, it makes a big
difference. However, don't do as I do; do as I say -- use the continuity correction
routinely. That way, editors and referees won't accuse you of trying to make your pvalues more impressive by not using the correction.
Protection by milk immunoglobulin concentrate against oral
challenge with enterotoxigenic e. coli NEJM May 12 ,1988 p 1240
Do infant formula samples shorten the duration of breast-feeding?
Bergevin Y, Dougherty C, Kramer MS. Lancet. 1983 May 21;1(8334):1148-51.
% still breastfeeding at 1 month in RCT which withheld free formula samples
[normally given by baby-food companies to mothers leaving hospital with their
infants] from a random half of those studied
given sample
no sample
breast
feeding
175 (77%)
182 (84%)
357 (80.4%)
not breast
feeding
52
35
87
Total
mothers
227
217
444
not breast
feeding
given sample 1 8 2 . 5 ( 8 0 . 4 % )
174.5 (80.4%)
no sample
357 (8 0 . 4 % )
X2
44.5
42.5
87
2
did
not*
Total
subjects
received milk
immunoglobin
concentrate
0 (4 . 5 )
10(5 . 5 )
10
received control
immunoglobin
concentrate
9 (4 . 5 )
1(5 . 5 )
10
All
9 (45%)
x2 =
Total
mothers
227
217
444
+
{182 – 174.5}
174.5
+
{ 5 2 – 4 4 . 5 }2
44.5
+
{ 3 5 – 4 2 . 5 }2
42.5
( 41. 5 + 41. 5 + 51. 5 + 51. 5 )
= 16.4
{ | 0 – 4.5 | – 0.5}2
Continuity-corrected x2 =
+ ... + =
4.5
= { 4.0 }2
2
{175 – 182.5}
182.5
20
{ 0 – 4 . 5 }2 { 9 – 4 . 5 }2 { 1 0 – 5 . 5 }2 { 1 – 5 . 5 }2
+
+
+
4.5
4.5
5.5
5.5
={ 4.5 }2
=
11
* the numbers 4.5, 5.5, 4.5 and 5 . 5 in bold in parentheses in the
table are the "Expected" numbers calculated on the (null) hypothesis H 0 :
rate not changed by immunoglobin concentrate
"Expected" numbers under
H 0 : rate not changed by giving samples
breast
feeding
developed
diarrhea*
( 41. 5 + 41. 5 + 51. 5 + 51. 5 ) = 1 2 . 9
Attack rates of ophthalmia neonatorum among exposed newborns
receiving silver nitrate, or tetracycline
NEJM March 11 ,1998 653-7
= 3.22 [ "NS" at 0.05 level, even with uncorrected X2]
X2 = 3.22 <--> |Z| = √3.22 = 1.79; Prob( |Z| > 1.79) = 2 × 0.0367 = 0.0734
page 4
Silver Nitrate
Tetracycline
exposed to N. gonorrhœa
attack rate
5 / 71
( 7% )
2 / 66
( 3% )
exposed to C. trachomatis
attack rate
10 / 99
( 10% )
8 / 111
( 7% )
Inference from 2 way Tables M&M §9
Notes on X2 tests and on analysis of binary data in general
•
If use x2 , it must be based on counts, not on %'s
•
a short-cut method of calculation for the 2x2 table with 'generic'
entries a, b, c, d, and with row, column and overall totals r1, r2, c1, c2
and N respectively and overall totals is (with stroke data as e.g.):
x2 =
-0.1, for an "average" chi-square density of approx. (2 x 0.0394)/0.03 =
2.6. You can track the 'block transfers' using the different shades.
N { a•d – b•c } 2 192 { 67•45 – 34•46 } 2
=
= 4.93
r1•r2•c1•c2
101•91•113•79
For the continuity corrected version, the shortcut formula is:
N 2
N{ | a • d – b • c | –
}
2
192{ |67•45–34•46 | - 96} 2
=
= 4.30
r1•r2•c1•c2
101•91•113•79
where | x | means 'the absolute value of'.
The formula involves the crossproducts a•d and b•c. If their ratio
(empirical odds ratio) is 1, their difference is zero. The direction of the
difference in proportions is given by the sign of ad – bc.
These formulae avoid fractions -- one doesn't see expectations or
deviations, or the magnitude of the difference. Presumably for this
reason, some books, such as Norman and Streiner's Statistics: the Bare
Essentials, classify x2 as "non-parametric" or "distribution-free", and so
put it in the non-parametric chapter. After their first edition, I pointed out
to them that the above use of the chi-square test is as a test of the
difference in two binomial proportions.. how much more parametric or
distribution-specific can that be? Look up the index in the latest edition
to see if my arguments convinced them. The direction is given by the
sign of ad – bc.
•
-2.5
-1.5
-0.5
0.5
1.5
2.5
Z ~ N(0,1)
The uncorrected version of the 2-sided z-test for comparing two
proportions gives the same p-value as the uncorrected version of the x2
test. One can check that Z2 = x2 . Likewise, the corrected version of
the 2-sided z-test for comparing 2 proportions gives the same p-value as
the corrected version of the x2 test.
The chi-square random variable (r. v.) is the square of the N(0,1) r. v.
The very high "probability density", and the rapid change in this density,
just to the right of Z2 = 0, (cf. diagram) is a result of the Z ->Z2
transformation. For example, the 3.98% of the "probability mass"
between Z=0 and Z=0.1 is transferred to the small interval 02 to 0.12, or 0
to 0.01, an width of 0.01 (an identical amount gets transferred from the Z
interval -0.1 to 0). The 3.94% of the "probability mass" between Z=0.1
and Z=0.2 is transferred to the small interval 0.12 to 0.22, or 0.01 to 0.04,
an width of 0.03. The is an identical transfer from the Z interval -0.2 to
Z-squared ~ Chi-Square(df=1)
0.25
•
•
page 5
1.
2.25
2
4.
6.25
By construction, the x is a 2-sided test, unless one uses x and
refers it to the z - table.
There are other chi-square distributions (with df > 1 ). See later.
Notes on X2 tests and on analysis of binary data in general
Inference from 2 way Tables M&M §9
•
•
•
•
•
If use x2 , it must be based on all cells, not just on numerators -- unless
the more common type of outcome is so much more common that the
contribution for these cells is negligible.
Mantel-Haenszel Test Statistic for a single 2 x 2 table
The x2 test is a large sample test i.e. it is always an approximation . Since
x2 (1 f) is just Z2, one is no more exact than the other.
Preamble
In t-tests, n= 30 is often considered 'large enough' for large-sample
procedures -- it depends on skewness of data are & whether the
Central Limit Theorem will "Gaussianize" the distribution of the statistic .
For 0/1 data, such as those above, the 'real' or 'effective' sample sizes
are not the denominators 71 and 66, but rather the numerators 5 and 2
[e.g. changing the 2 to a 1 would make the ratio of attack rates appear
twice as good ] It doesn't matter if the 5 and 2 came from n's of 710 and
660 or 7100 and 6600 . The 'effective sample size' for binary data is the
number of subjects having the less common outcome.
the notes for Ch 8.2, where Fisher's exact test was introduced , but the
Some of the paragraphs on the next page more appropriately belong in
reasons become important when we come to using the above
formulation of the x2 test for a single table to combining evidence over
several (possibly sparse) 2x2 tables. Cochran first proposed combining
evidence from 2 x 2 tables in 1954. His aim was to combine a small
number of 'large' tables, and he did not anticipate that this technique
could also be used to combine a large number of quite 'small' 2 x 2
tables (each one with quite sparse information) , with the combination of
The "guidelines" (such as they are) about when it is appropriate to use
x2 are based on "Expected" numbers not on the observed
numbers. One quoted rule [often used by computer programs to
generate warning messages ] is that the expected numbers in most of
the cells should exceed 5 for the χ 2 to be accurate. Thus, the 2x2 table
on the left below will generate a 'warning' ; that on the right will not.
5
2
1
11
66
64
11
1
data from n matched pairs as the limiting case. Thus, he was a wee bit
careless about variances. It took the now famous Mantel-Haenszel
paper of 1959 to make a variance correction that for 'large 'tables was
trivial, but for matched pair tables, was critical.
SAS and others rightly acknowledge Cochran's role in the test statistic,
The regular uncorrected x2 test statistic for a single 2x2 table can be
calling it the 'Cochran-Mantel-Haenszel' or 'CMH' statistic. (Indeed 'CMH'
written in a seemingly very different format, as
x2 =
{ a – E [ a | H 0] } 2
=
Variance[ a | H 0 ]
{ a – E [ a | H 0] }
is the option one uses with the PROC FREQ to obtain the summary
2
measures and the overall test statistic). This formulation is also the one
r1•r2•c1•c2 / N 3
The Variance in the denominator of this statistic can be viewed as arising
most commonly used for the log-trank test used to compare two survival
from a statistical model in which the 2 compared proportions are
curves.
separate independent random variables; i.e. the 'unconditional' or '2independent binomials' model .
Most consider that the biggest legacy of "M-H" paper is to the Mantel-
Just like the formula with 4 O's and 4 E's, this format is not as calculatorfriendly as the shortcut (integer-only) one . But... cf Mantel Haenszel.
Haenszel summary measure (point estimate) of Odds Ratio. We will
come back a little later to this issue of combining data from 2 x 2 tables .
page 6
Inference from 2 way Tables M&M §9
Notes on X2 tests and on analysis of binary data in general
Mantel-Haenszel Test Statistic for a single 2 x 2 table
tables by their degree of evidence against H0. For example, in a 2x2
Preamble: Conditional vs. Unconditional? continued... In the
table with n0=23, and n1=24 (as in the bromocryptine and infertility
separate binomials model, the only marginal totals that are fixed ahead of
study) , there are theoretically 24 x 25 = 600 possible tables. However, if
time are the two sample sizes. In most instances, this model reflects
one -- after the fact -- restricts the analysis to only those tables where the
reality. The only exception I know of is the design exemplified by the
total number of "successes" is 12 (12 pregnancies) , then there are only
psychophysics study of the lady tasting tea . If she is told that there are 4
13 possible tables (see notes and Excel spreadsheet for Fisher's exact
cups where the tea is poured first, and 4 where it is poured second, then
test). And, by reducing the problem from a 2-dimensional one to a 1-
she will arrange her responses so that there are 4 of each. Thus, in this
dimensional one, is also becomes possible to more easily rank the
instance, both the row totals and the column totals are "fixed" ahead of
tables by their degree of evidence against H0, something that is
time, and so it makes sense that the (frequentist) inference be limited to
supposedly more difficult when the tables are simultaneously arrayed
the (only) 5 possible data tables that have all margins fixed.
along both dimensions. (d) a fourth reason, which I will illustrate with the
Marvin Zelen "Marbles in the Folger's Coffee Can" model, is that, after
This is the statistical model behind Fisher's exact test, and indeed Fisher
the fact, it is much easier to empirically -- and heuristically -- demonstrate -
used the tea-tasting example to explain it. But this test is now used for
a low p-value using the single random variable, conditional
data situations where one cannot -- at least ahead of time -- consider
(hypergeometric), model than it is with the '2-separate binomials' model.
both sets of marginal totals fixed. For example, in the food sensitivity
study in the ch 8.2 notes, from the answers given, it appears that the
In fact there are many ways to circumvent these objections without
subjects were not told that there were 3 three injections of extract and
having to 'condition' on all margins, and there is still a considerable
nine of diluent, but the authors used the conditional test anyway.
debate, much of it philosophical, on this 100 years after analyses of 2x2
data were first introduced. However, since we often combine information
Many of the reasons put forward for using the conditional test based on
from data arranged as matched pairs or 'finely stratified ' strata, we do
all margins fixed (i.e. the hypergeometric model, with only one random
need to consider this one setting where conditioning is the 'right thing
variable) involve practicality rather than adherence to a coherent set of
to do'. In the example here, there will only be 1 large table, so the
inferential principles. They mostly have to do with one of the following
difference will not be important. But when we come to matched pairs,
'supposed' difficulties (a) using the normal approximation when the
the implications are large.
expected numbers are low (b) the fact that there are two parameters, but
one is only interested in their difference, or ratio, or odds ratio, and so
the 'remaining' parameter is just a 'nuisance' (c) how to order or rank the
page 7
Inference from 2 way Tables M&M §9
Notes on X2 tests and on analysis of binary data in general
Mantel-Haenszel Test Statistic for a single 2 x 2 table
Details
Example
In the conditional model, with both margins fixed, there is only
In our stroke vs. medical unit example above, the marginal totals were
one cell entry that can vary independently. Without loss of generality, we
r1=101, r2=91, c1=113, and c2=79, so N=192. These yield the "excess
focus on the frequency in the 'a' cell. Then, under the null hypothesis,
in the a cell" of
a ~ Hypergeometric[parameters given by marginal totals]
67 - (101•113)/192 = 67 - 59.44 = 7.56
i.e. by r1=Row1Total, r2, c1=Col1Total, c2, and N=OverallTotal.
and conditional variance
{ 101•91•113•79} / { 192 2(191) } = 11.6528
Thus
r1 × c 1
Expected value[ a | H0 ] = E[ a | H0 ] =
N
giving
X2 MH = 7.56 2 / 11.6528 = 4.901,
Variancecondn'l [ a | H0 ] = { r1•r2•c1•c2 } / { N2(N–1) }
in agreement with the printout from Proc FREQ in SAS.
Under the null, the expectation is the same with the conditional as the
unconditional models. Note however the difference in the variance:
under the conditional model it is different, since it uses N 2 (N–1)
Note:
rather than N 3 , reflecting the different pattern of variation in the
The MH test does not use the continuity corrected with the {a - E[a]}.
frequency in the 'a' (and consequently in the other 3) cell(s) if all margins
Part of the justification for this is that when the point estimate of the odds
are fixed (vs. what would happen if the lady were not told "4 1st; 4 2nd").
ratio falls at the null, i.e. a•d = b•c, so that E[a | H0 ] = a, it would be good if
The test statistic using this conditional variance can be computed as a Z
the test statistic also had a value of zero. A continuity correction would
statistic
force the test-statistic to have a positive value even when the "observed
a – E[ a | H0]
Z = X = "chi" = SD
,
condn'l [ a | H0]
a" = "expected under the null" !
which has the same form as the critical ratios used in the z-test for
proportions or means, or as the more traditional square
X2 MH
=
{ a – E[ a | H0] }2
Variancecondn'l [ a | H0] .
page 8
Inference from 2 way Tables M&M §9
2x2
2x2, 2x1, 1x2 and 1x1 Tables†
1x2
samples reasonably equal in size, two types of outcome common
e.g. outcomes in trial of stroke vs. medical unit .
sample 1
sample 2
BAD
GOOD
OUTCOME OUTCOME
bad1
good1
bad2
good2
bad
good
Total
persons
n1
n2
n
Total
Person-time
n1
n2
n
or
sample 1
sample 2
{bad1- bad1}2
bad1
{bad2- bad2}2
+
bad2
x2 =
+
+
"LESS GOOD
OUTCOME
less_good
sample
x2 =
"Expected" numbers of outcomes under H 0: rates not different
(split the events across 2 samples in ratio of n1 : n2 )
BAD
GOOD
OUTCOME OUTCOME
bad1
good1
bad2
good2
bad
good
1 sample ; two types of outcome common
e.g. male and female births with specific timing of conception
Total persons
or person-time
n1
n2
n
{good1 - g o o d 1}2
good1
{good2- g o o d 2}2
good2
sample 1
sample 2
x2 ≈
Total
persons
n1
n2
n
or
1x1
x2 ≈
Total
Person-time
n1
n2
n
Total
Person-time
n
{ g o o d – good}2
good
+ minimal contribution
minimal contribution
+
1 large sample , BAD outcome uncommon
e.g. 78 cancers observed in Alberta study, 8 3 . 5 e x p e c t e d
BAD
GOOD
OUTCOME OUTCOME
bad
MOST
Total
persons
n
or
Total
Person-time
n
{bad- bad}2
bad
"Expected" number of outcomes under H 0: rate not different from
EXTERNAL rate (use EXTERNAL rate to calculate expected number
of BAD events)
2
{observed- expected}
is equivalent to the large sample
expected
approximation to the Poisson distribution [A&B §4.10]
2
2
{bad1- bad1}
bad1
{bad2- bad2}2
+
bad2
or
"Expected" numbers of outcomes under H 0: rate not different from
EXTERNAL rate (use EXTERNAL rate, based on LARGE amount of
data (e.g. national rates), to calculate the expected split of
events). If use internal comparison, then we have full 2 x 2 table.
samples large and reasonably equal in size,
BAD outcome uncommon : e.g. leukemias and breast cancers
BAD
GOOD
OUTCOME OUTCOME
bad1
MOST
bad2
MOST
bad
MOST
Total
persons
n
{less_good – less_good}2
+
less_good
sample
2x1
GOOD
OUTCOME
good
This x =
+ minimal contribution
+ minimal contribution
i.e.
"Expected" numbers of outcomes under H 0: rates not different
(only need ratio n1 : n2 to get expected split of BAD events)
[see A&B §4.10; WE WILL REVISIT THIS 2x1 TABLE , AND THE 1 x 1 TABLE, WHEN
COMPUTING EFFECT MEASURES for INCIDENCE RATES ]
z =
observed- expected
so that
expected
{observed- expected}2
z2 =
= x2
expected
† This terminology is my own: don't try it out on an editor!
page 9
Inference from 2 way Tables M&M §9
e.g.
2x2, 2x1, 1x2 and 1x1 Tables†
Development of leukemia during a 6-year period following drugrx for cancer
e.g.
drug rx
no drug rx
leukemia
14
1
15
Total
persons
2067
1566
3633
not
2053
1565
3618
Breast Cancer in women repeatedly exposed to multiple X-ray
fluoroscopies Boice and Monson 1977
cancers
41
15
56
exposed
not exposed
"Expected" numbers of leukemia under
H 0: rate not increased by drug
drug rx
no drug rx
leukemia
8.53
6.47
15
2
x =
{14 – 8 . 5 3 }
8.53
+
{1 – 6.47}
6.47
2
"Expected" numbers of MI under H 0: rate not affected by X-rays
Total
persons
2067
1566
3633
not
2058.47
1559.53
3618
2
{2053 – 2 0 5 8 . 4 7 }
2058.47
+
{1565 – 1 5 5 9 . 5 3 }
1559.53
χ
2
= 8.17
e.g.
MI in the first 56 months of US MDs' study of aspirin
MI
104
189
293
aspirin
placebo
Total
MDs
11K
11K
22K
not
remainder
remainder
remainder
drug rx
no drug rx
e.g.
2
x =
Total
persons
11K
11K
22K
not
rest
rest
rest
US MDs' study of aspirin ... continued
2
2
2
{104 – 1 4 6 . 5 }
{189 – 1 4 6 . 5 }
{42.5}
{–42.5}
+
+
+
146.5
146.5
11K
11K
2
= 24.7
In effect, testing whether 293 MI's could distribute this unevenly if used a coin.
M&M Ch 8.1 Inference on π ... page 10
2
Women-years (WY)
28,010
19,017
47,027
2
2
2
2
2
=
{41 – 3 3 . 4 }
{15 – 2 2 . 6 }
{deviation}
{deviation}
+
+
+
33.4
22.6
28K
19K
≈
{41 – 3 3 . 4 }
{15 – 2 2 . 6 }
+
33.4
22.6
2
= 4.29
[3.74 with continuity correction]
Equivalent to testing whether the a+b events could split in this extreme or more
extreme a way.. would expect under H0 that the split would be (apart from random
variation) in the ratio of WYexposed : WYnon-exposed.
"Expected" numbers of MI under H 0: rate not affected by aspirin
MI
146.5
146.5
293
cancers
33.4
22.6
56
exposed
not exposed
+
2
Women-years (WY)
28,010
19,017
47,027
WE WILL REVISIT THE 2x1 TABLE , AND THE 1 x 1 TABLE, WHEN COMPUTING
EFFECT MEASURES for INCIDENCE RATES ]
Inference from 2 way Tables M&M §9
e.g.
Comparison of Proportions --- Paired Data
Response of same subject in each of 2 conditions (self-paired)
(McNemar) Test of equality of proportions:
Responses of matched pair, one in 1 condition, 1 in other
-1- discard the concordant pairs (+,+) and (–,–) as being "un-informative"
(this point is somewhat controversial )
's in paired responses on interval scale, reduced to sign of
-2- analyze split of (b+c) discordant pairs (under H0, expect 50:50)
Result in Other PAIR Member
Positive
Positive
a
Negative
b
c
d
Total
PAIRS
Example: HIV in twins in relation to order of delivery [LancetDec14'91]
Mother -> infant transmission of HIV-infection: 66 sets of twins
Result in One
PAIR Member
Negative
Result in 2nd-born Twin
n PAIRS
Exposure in "Control"
Positive
Positive
a
Negative
b
Total
PAIRS
HIV +
HIV +
10
HIV –
18
HIV –
4
34
Total
Sets
Result in 1st
born Twin
Exposure in
"Case"
66 Sets
Negative
c
d
n PAIRS
To analyze the 'split' of discordant pairs:
(if n small)
extreme situations (1 or other / forced choice e.g. exercise 8.18, or who
dies first among twin pairs discordant for handedness)
Binomial probabilities with "n" = b+c and π = 0.5
Shorter
Won
Won
-
Lost
b
Lost
c
-
(Table C or Table for Sign Test is helpful here)
Total
PAIRS
(if n larger)
• Z test of observed proportion p = b / (b+c) vs π = 0.5
Taller
n PAIRS
• χ 2 test on observed 1x2 table
[
b
|
c
]
Can also turn this table 'inside-out' and analyze using casecontrol approach
versus 1x2 table expected if H0 holds
b+c
b+c
[
|
2
2
Loser
Taller
Taller
-
Shorter
b
c
-
Total
PAIRS
Winner
Shorter
Note that the Z 2 and x 2 are equivalent
n PAIRS
page 11
]
Inference from 2 way Tables M&M §9
Comparison of Proportions --- Paired Data
McNemar) Test of equality of proportions: worked example
Gaussian (or equivalently, Chi-square) Approximation to Binomial
Result in 2nd-born Twin
HIV +
HIV +
10
HIV –
18
HIV –
4
34
Total
Sets
22
18 –
18 – E[b]
2
18 –11
72
Z=
=
=
= 2.98 ; Z2 =
= 8.91
SD[b]
5.5
22 x 0.5 x 0.5
5.5
Result in 1st
born Twin
Prob ( | Z | > 2.98 ) = 0.003; From Table F, Prob[ X2 (1 df)> 8.91] ≈ 0.003
66 Sets
Analysis using exact binomial
χ 2 test on observed 1x2 table [ 18
versus 1x2 table expected
[ 11
if H0 holds
Binomial probabilities with "n" = b+c = 28 with discordant outcomes
Under H0 that order makes no difference to likelihood of HIV
transmission, the split among these 22 should be like that obtained by
X2 test
=
tossing 22 coins, each with
π (first born is the one to have the HIV transmitted) = 0.5
{ 1 8 – 1 1 }2
+
11
4
11
]
]
{ 4 – 1 1 }2
72
=
= 8.91
11
5.5
Notice that because of the symmetry involved in testing π = 0.5 versus a 2sided alternative, the test statistics have a particularly simple form:
The Binomial(n=22, π = 0.5) distribution is not available in Table C, but
( b – c )2
Z2 = x2 =
;
b + c
can be obtained from Excel. Of interest is the sum of the probabilities for
( 18 – 4 )2
= 8.91 in our example
22
18/22, 19/22, 20/22, 21/22 and 22/22 (1-sided) then doubled if dealing
with a 2-sided alternative, i.e. 2 x ( 0.00174 + 0.00036 + negligible
With continuity correction
terms) = 0.004.
( | b – c | – 1 ) 2 169
Z2c = x2 c =
=
= 7.68
b + c
22
(2-sided P ≈ 0.005)
Q:
Why a continuity correction of 1 rather than usual 0.5?
A:
The difference b–c jumps in 2's rather than 1's
e.g. if b+c =18, then b – c =18, 16, 14, ... ,-14, -16, -18)
page 12
Inference from 2 way Tables M&M §9
Situation
Inferences regarding proportions --- Summary
Question
Gaussian Approximation
no
1 Popln.
π
CI for π
yes
•p ± z
• Binomial distribution
• z =
• b ~ Bin(n',[OR/(1 + OR)])
• CI for OR/(1+OR) => CI for OR
b/n' - OR 0/[1+OR 0]
• z =
or x2
SE[ b/n'| H0 ]
(see notes on 8.1)
Test
π0
p[1 - p]
n
• Nomograms/tables/spreadsheet
p - π0
π 0[1 - π 0]
n
(cf. asymmetric)
{z2 = x2}
(sample of n; p = y/n are "positive")
2 Populations (matched samples) or
1 population under 2 conditions
OR = [π 1/(1 - π 1)]/[π 2/(1 - π 2)]
CI for OR
Test π 1 = π 2
Test
OR=OR0
• b ~ Bin(n',[OR 0/(1 + OR 0)])
(sample of n pairs
n(++)=a; n(+-)=b; n(-+)=c; n(--)=d; b+c=n' ; or (i.e. est. of OR ) = b/c )
2 Poplns.
p1[1-p1]
p [1-p2]
+ 2
n1
n2
(also via Binomial regression** ... RD)
CI for ∆ = π 1-π 2
• Miettinen and Nurminen
• p1 - p2 ± z
CI for RR
CI for OR
•
• Conditional
• cf Rothman p134; regression (RR)
• Condnl[Approx.]/Woolf/Miettinen
(or via Binomial(logistic)regression**)
[ p1 - p2 ] - ∆ 0
• z =
(*) {or x2}
p[1-p]
p[1-p]
+
n1
n2
π 1 and π 2
Test RR or OR 0 or
0
• Fisher's Exact Test (cond'nl)
• Unconditional methods
(Suissa and Schuster)
• Permutational (StatExact software)
(independent samples of n1 and n2)
Notes:
n1p1 + n2p2
Σ numerators
=
(weighted average of two p 's )
n1 + n2
Σ denominators
** Binomial Regression: extension [to come] of 1-parameter binomial regression models described in notes for 8.1
(*) p in combined data =
page 13
Inference from 2 way Tables M&M §9
e.g.
Tests of Association --- Tables with r rows c columns
Analyzing data from ORDERED categories
Independence of classification on 2 variables
Similarity of multinomial profiles
Using a chi-square test for the following 2x3 table ignores the
ordered nature of the responses
(generic) Relationship between one factor (rows) and another (columns) in
n observations; crossclassified into an
r(=# of rows) x c(=# of columns) table.
Col1
n11
n21
...
nr1
Ncol1
Row1
Row2
......
Rowr
Total
Col2
n12
n22
...
n r2
Ncol2
...
...
...
...
...
Colc
n1c
n2c
...
nrc
Ncolc
e.g. 2
Total
Nrow1
Nrow2
...
Nrowr
N
Patients given
Triazolam
Patients given
Placebo
Total
(e.g. 1) Relationship between laterality of hand and laterality of eye (measured by
astigmatism, acuity of vision, etc.) in 413 subjects crossclassified into a 3x3 table.
[data from Woo, Biometrika 2A 79-148]
Left handed
Ambidextrous
Right handed
Total
χ 2df
=
∑
Left-eyed
Ambiocular
34
27
57
118
62
28
105
195
Righteyed
28
20
52
100
Quality of sleep before elective operation. [BMJ]
124
75
214
413
e.g. 3
Clotrimazole
Placebo
Total
Nrow • N column
N
•
•
summation is over all r x c cells
degrees of freedom (df) = (r–1)(c–1). In above eg., r=3; c=3 => df:4
Good
12
Total
31
8
10
15
32
8
20
31
62
Outcome after 2 to 7 days of Rx in 20 patients with
chronic oral candidiasis.
{ observed - expected } 2
expected
expected number in cell =
Reasonably good
17
See article by Moses L et al NEJM 311 442-448 1984
(also published as Chapter in Medical Uses of Statistics by J
Bailar and F Mosteller}.
Total
•
Bad
2
1(good)
6
1
7
Outcome category
2
3
3
1
0
0
3
1
4 (poor)
0
9
9
Total
10
10
20
Any dichotomization of outcomes loses
2
The χ statistic measures the deviation from independence of row and
column classifications (e.g. 1) and dissimilarity of the distributions (profiles)
of responses (e.g. 2 and 3). However, omnibus chi-square tests (H0:
identical response profiles) with large df are seldom of interest, since the
alternative hypothesis (profiles are not identical) is so broad, and the chisquare tests are invariant to the ordering of the rows and columns. More
often, a specific alternative hypothesis is of interest; omnibus tests penalize
one for looking in all directions, when in fact one's focus is narrower, and
aiming to pick up a specific 'signal'. The next 2 examples (>2 ORDERED
response categories in each of 2 groups; binary responses in > 2
ORDERED exposure categories are a more fruitful step in this direction.
information and statistical power. Moses et al.
suggest using the Mann-Whitney U test (also
known as the Wilcoxon Rank sum test) to take
account of ordered nature of response
categories.
page 14
Inference from 2 way Tables M&M §9
Test for trend in (Response) Proportions
[from A&B §12.2]
Example [jh]
Suppose that, in a k x 2 contingency table the k groups fall into a natural order.
They may correspond to different values, or groups of values, of a quantitative
variable like age; or they may correspond to qualitative categories, such as severity
of a disease, which can be ordered but not readily assigned a numerical value. The
usual χ 2(k–1) test is designed to detect differences between the k proportions --without taking the 'ordering' of the rows into account. It is an 'omnibus' test and is
unchanged even if we interchange the order of the columns. More specifically one
might ask whether there is a significant trend in these proportions from group 1 to
group k. Let us assign a quantitative variable, x, to the k groups. If the definition of
groups uses such a variable, this can be chosen to be x. If the definition is
qualitative, x can take integer values from 1 to k. The notation is as follows:
Group X
Frequency
Pos
Neg
x1
x2
.
.
xi
r1
r2
.
.
ri
n1 – r1
n2 – r2
.
.
ni – ri
n1
n2
.
.
ni
k
xk
rk
--R
nk – rk
-------N–R
nk
--N
All
Groups of
subjects
p1
p2
.
.
pi
0
0
5
5
15
15
25
25
40
40
if enter a variable (say "number" to
indicate how many persons has each
exposure/response pattern, then syntax is
PROC FREQ DATA=
;
TABLES falls*sick / TREND;
WEIGHT number;
0
1
0
1
0
1
0
1
0
1
RD
RR
OR
8 (20%)
33
41
--
1.0
1.0
0-10 falls
15 (44%)
19
34
24%
2.3
3.3
11-20 falls
9 (45%)
11
20
25%
3.5
3.4
21-30 falls
10 (71%)
4
14
51%
3.7
10.3
> 30 falls
10 (100%)
0
10
80%
5.1
inf.
* Syntax proc freq DATA= ... ;
From Stata
input falls ill number
PROC FREQ DATA= ... ;
TABLES falls*sick /TREND;
Total
Any dichotomization of exposure loses information and statistical power. Authors
correctly used Chi-square test for trend, yielding χ 2 1df = 25.3, P = 10-6. I get
24.58 with the "spacing" 0, 5, 15, 25 and 40. SAS*, using the "Cochran-Armitage
Trend Test", with same spacing, gives a Z statistic of -4.969, (Z2 = 24.69). The
entire variation among the 5 proportions in the table (ignoring ordering) is
approximately X2 (4 df) = 27, but it is almost all explained by the exposure gradient.
In smaller datasets, even if the overall X2 is not significant, the trend portion can
be. In this e.g. there was such a strong relationship that even the overall test was
significant. The same is true in the example overleaf (dealing with birth date and
sporting success), where again the sample sizes are large and the signal strong.
pk
---------P(=R/N)
N{N∑rixi – R∑nixi}2
R{N–R}[N∑nixi2 – (∑nixi)2]
From SAS
if 1 line of data for each of 119 individuals
No
without
Competitors :
The χ2(1) statistic for trend, X2(1) , which forms part of the
overall X2, can be computed as follows:
X2(1 df) =
No. of subjects
with symptoms
Employees (ref gp)
proportion
positive
Total
1
1
.
.
i
Distribution of subjects with polluted-water exposure-related symptoms among
Competitors and Employees and Relative Risk (RR) According to Number of Falls
in the Water Data from article "Health Hazards Associated with Windsurfing on
Polluted Water " AJPH 76 690-691, 1986 -- research conducted at the Windsurfer
Western Hemisphere Championship held over 9 days in August 1984. During the
championships, the same single-menu meals were served to both competitors and
employees]
tables falls*sick /trend;
This syntax assumes you enter data for each of the 119 individuals; if instead you enter a
variable (say you call it "number" to indicate how many persons has each
exposure/response pattern, then the required syntax is
33
8
19
15
11
9
4
10
0
10
proc freq DATA= ..; tables falls*sick /trend;
weight number;
PS: If you look up A&B, you will find another x2 [ Eqn. 12.2]. This value,
calculated as the difference between the trend and the overall x2 statistics, can be
is used to test if there is serious non-linear variation over and above the linear
trend.
end
tabodds ill falls [freq=number]
page 15
Inference from 2 way Tables M&M §9
Example
Test for trend in (Response) Proportions
[from A&B §12.2]
Birth date and sporting success
No. Of Players
200
SCIENTIFIC CORRESPONDENCE in NATURE • VOL 368 • 14 APRIL 1994 p592
Sir — I have found a significant relationship between birth date and success in tennis and
soccer. In the Netherlands and England, players born early in the competition year are
more likely to participate in national soccer leagues. The high incidence of elite athletes
born in the first quarter of the competition year can be explained by the effects of
age-group position.
Relationship between
birthdate and
participation rates in
Dutch soccer league.
Note the ordinate begins
at 75 players.
175
150
In organized sport. talent is considered predominantly in terms of physical skills. and
the influence of social and psychological factors is often ignored or underestimated1 .
Various studies have investigated the psychological characteristics of elite athletes 2 , but
none has looked for an effect of age. I discovered a strikingly skewed distribution of the
dates of birth of 12- to l6-year-old tennis players in the top rankings of the Dutch youth
league. Half of a sample of 60 tennis players were born in the first 3 months of the year.
125
100
This discovery led me to consider the distribution of the dates of birth of professional
soccer players. In the Netherlands, there are two leagues comprising a total of 36 clubs. I
found a striking difference between participation rates of those born in August and July.
The Dutch soccer competition year starts on the first of August. A chi-square test
indicates that the distribution is not uniform (P<0.001); and a regression analysis
demonstrates a clear linear relationship between month of birth and number of
participants. The dates of birth of 621 players, compiled into quarters, are shown in the
figure. This relationship cannot be attributed to the distribution of births in the
Netherlands, as this is highly uniform.
75
Aug-Oct Nov-Jan Feb-Apr
May-Jul
PARTICIPATION RATES IN ENGLISH SOCCER LEAGUES
Players in birthdate quarters
We also inspected the distribution of the dates of birth of English football players in
league clubs in the period 1991-92 (ref.3). Birth dates for all players were tabulated by
month and compiled into quarters. The results (table) show the significant effect of date
of birth on participation rate of soccer players within each of the national leagues,
indicating that. as in the Netherlands, significantly more football players are born in the
first quarter of the competition year (which starts in September in England).
achievement5 .
There is a known relationship between date of birth and educational
implying that the younger children in any school year group are at a disadvantage
compared to the older children. Children who participate in sports are also placed in age
groups, and my results imply many athletes in organized sports may never get a fair
chance because of this method of classification. Very little attention has been drawn to
this problem. One of the few studies done in this area analysed the dates of birth of young
Canadian hockey players in the 1983-84 season6 . Players possessing a relative age
advantage (born in the months lanuary-June) were more likely to participate in minor
hockey and more likely to play for top teams than players in July-December.
Statistics
League
SepNov
DecFeb
MarMay
JunAug
Total
ChiSquare
Sig.
Level
FA premier
Division 1
Division 2
Division 3
288
264
251
217
190
169
168
169
147
154
123
121
136
147
131
102
761
734
673
609
75.5
48.47
61.11
52.38
P<0.0001
P<0.0001
P<0.0001
P<0.0001
Total
1,020 696
545
516
2,777
230.77 P<0.0001
References: 1 Dudink A Fur J High Ability 1, 144-150 (1990). 2 Dudink A & Bakker.
F. Ned. Tschr. Psychol 48. 55 -69 (1993). 3 Rollin,J Rothmans Football Yearbook
1992-93 (Headline. London. 1992). 4 Shearer.E Educ Res 10. 51-56 (1967) 5 Doornbos,
K. [Date of birth and scholastic performance (Wolters-Noordhoff, Groningen. 1971). 6
Barnsley. R. H. & Thompson A. H. Can. J. Behav. Sci 20. 167-176 (1988). 7 Williams.
Ph.. Davies P., Evans, R & Ferguson, N. Nature 228. 1033-1036 (1970).
Ad Dudink Faculty of Psychology, University of Amsterdam, 1018 WB 3 Amsterdam,
The Netherlands
More than 20 years ago, this journal published an article concerning the relationship
between season of birth and cognitive development7 . The authors attributed this
relationship to a fault in the British educational system. A similar relationship was
found5 in the Netherlands. Despite this, no action was undertaken to change the
educational system. One can only hope that this will not he the case for sports.
-----------------
For an example of an analysis of seasonal variation, see the article by H T Sørensen
et al. Does month of birth affect risk of Crohn's disease in childhood and
adolescence? p 907 BMJ VOLUME 323 20 OCTOBER 2001 bmj.com (copy of
article, and associated dataset, on course 626 website).
page 16
Correlation M&M §2.2
References: A&B Ch 5,8,9,10; Colton Ch 6, M&M Chapter 2.2
Measures of Correlation
Similarities between Correlation and Regression
Loose Definition of Correlation:
• Both involve relationships between pair of numerical variables.
Degree to which, in observed (x,y) pairs, y value tends to be
larger than average when x is larger (smaller) than average; extent
to which larger than average x's are associated with larger
(smaller) than average y's
• Both: "predictability", "reduction in uncertainty"; "explanation".
• Both involve straight line relationships [can get fancier too].
Pearson Product-Moment Correlation Coefficient
Differences
Correlation
Regression
Symmetric
Directional
(doesn't matter which is
on Y, which on X axis)
Chose n 'objects';
measure (X,Y) on each
(matters which is on Y,
which on X axis)
(i) Choose n objects on
basis of their X values;
measure their Y; or
(ii) Choose objects, (as
with correlation);
measure (X,Y)
Dimensionless (no units)
( – 1 to + 1 )
Context
sample of
n pairs
rxy
"universe"
of all pairs
ρxy
Notes:
Regard X value as
'fixed'; .
Can be extended to nonstraight line relationships
Can relate Y to multiple
X variables.
∆Y/∆X units e.g., Kg/cm
Symbol
Calculation
∑{xi – –x }{y i – –y }
( ∑{xi – –x }2 ) ( ∑{yi – –y}2 )
E{ (X – µX )(Y – µY) }
E{ (X – µX ) 2 } E{ (Y – µY)2 }
° ρ: Greek letter r, pronounced 'rho' ;
° E : Expected value ;
° µ: Greek letter 'mu'; denotes mean in universe.
° Think of r as an average product of scaled deviations [M&M
p127 use n-1 because the two SDs involved in creating Z scores
implicitly involve 1/√(n-1); result is same as above]
Spearman's (Non-parametric) Rank Correlation Coefficient
x -> rank replace x's by their ranks (1=smallest to n=largest)
y -> rank replace y's by their ranks (1=smallest to n=largest)
THEN
(see later)
page 1
calculate Pearson correlation for n pairs of ranks
Correlation M&M §2.2
2 is a measure of how much the variance of Y is reduced by
knowing what the value of X is (or vice versa)
Correlation
Positive:
larger than ave. X's with larger than ave. Y's;
smaller than ave. X's with smaller than ave. Y's;
Negative:
larger than ave. X's with smaller than ave. Y's;
smaller than ave. X's with larger than ave. Y's;
None:
larger than ave. X's 'equally likely' to be coupled with larger as with
smaller than ave. Y's
See article by Chatillon on "Balloon Rule" for visually estimating r. (cf.
Resources for Session 1, course 678 web page)
Var( Y | X ) = Var( Y ) × ( 1 – ρ 2 )
ρ 2 called
"coefficient of
determination"
Var( X | Y ) = Var( X ) × ( 1 – ρ 2 )
ave(Y)
Large ρ 2 (i.e. ρ close -1 or +1) - > close linear association of X and Y
ave(X)
ave(X)
values; far less uncertain about value of one variable if told value of
other.
ave(X)
How r ranges from -1 (negative correlation) through 0 (zero correlation.) through +1
(positive correlation.) (r not tied to x or y scale)
ave(Y)
X-deviation is –
Y-deviation is +
PRODUCT is –
X-deviation is +
Y-deviation is +
PRODUCT is +
X-deviation is –
Y-deviation is –
PRODUCT is +
X-deviation is +
Y-deviation is –
PRODUCT is –
ave(X)
If X and Y scores are standardized to have mean=0 and unit SD=1 it can
be seen that ρ is like a "rate of exchange" ie the value of a standard
deviation's worth of X in terms of PREDICTED standard deviation units
of Y.
If we know observation is ZX SD's from µX, then the least squares
prediction of observation's ZY value (ie relative to µY) is given by
predicted ZY = ρ • ZX
.
PRODUCTS
Notice the regression towards mean: ρ is always less than 1 in absolute
value, and so the predicted ZY is closer to 0 (or equivalently make Y
closer to µY) than the ZX was to 0 (or X was to µX ).
ave(Y)
ave(X)
ave(X)
ave(X)
page 2
Correlation M&M §2.2
Inferences re
[based on sample of n (x,y) pairs]
Naturally, the observed r in any particular sample will not exactly match
the in the population (i.e. the coefficient one would get if one included
everybody). The quantity r varies from one possible sample of n to
another possible sample of n. i.e. r is subject to sampling fluctuations
about .
1 A question all too often asked of one's data is whether there is
evidence of a non-zero correlation between 2 variables. To test this,
one sets up the null hypothesis that
is zero and determines the
probability, calculated under this null hypothesis that = 0 , of
obtaining an r more extreme than we observed. If the null hypothesis
is true, r would just be "randomly different" from zero, with the
amount of the random variation governed by n.
r n–2
This discrepancy of r from 0 can be measured as
and
1 – r2
should, if the null hypothesis of = 0 is true, follow a t distribution
with n-2 df.
[Colton's table A5 gives the smallest r which would be considered
evidence that
0. For example, if n=20, so that df = 18, an observed
correlation of 0.44 or higher, or between -0.44 and -1 would be
considered statistically significant at the P=0.05 level (2-sided). NB:
this t-test assumes that the pairs are from a Bivariate Normal
distribution. Also, it is valid only for testing = 0, not for testing
any other value of
2 Other common questions: given that r is based only on a sample, what
interval should I put around r so it can be used as a (say 95%)
confidence interval for the "true" coefficient ?
Or (answerable by the same technique): one observes a certain r1 ; in
another population, one observes a value r2 . Is there evidence that the
's in the 2 populations we are studying are unequal?
From our experience with the binomial statistic, which is limited to
{0,n} or {0,1}, it is no surprise that the r statistic, limited as it is to
{minus 1, plus 1}, also has a pattern of sampling variation that is not
symmetric unless is right in the middle, i.e. unless = 0. The
following transformation of r will lead to a statistic which is
approximately normal even if the ρ('s) in the population(s) we are
studying is(are) quite distant from 0:
1
1+r
ln
{
2
1 – r } [where ln is log to the base e or natural log].
It is known as Fisher's transformation of r; the observed r,
transformed to this new scale, should be compared against a Gaussian
distribution with
1+
1
mean = 2 ln { 1 –
JH has seen many the researcher scan a matrix of correlations, highlighting
those with a small p-value and hoping to make something of them. But very often,
that was non-zero was never in doubt; the more important question is how nonzero the underlying really was. A small p-value (from maybe a feeble r but a
large n!) should not be taken as evidence of an important ! JH has also
observed several disappointed researchers who mistakenly see the small pvalues and think they are the correlations! (the p-values associated with the test
of = 0 are often printed under the correlations)
Interesting example where r
0, and not by chance alone!
1970 U.S. DRAFT LOTTERY during Vietnam War: See Moore
and McCabe pp113-114, along with spreadsheet under Resources for
Chapter 10, where the lottery is simulated using random numbers
(Monte Carlo method)
page 3
} and SD =
1
n–3.
Correlation M&M §2.2
Inferences re
[continued... ]
e.g. 2a: Testing H0 :
e.g. 2c: 100(1– )% CI for from r=0.4 in sample of n=20.
By solving the double inequality
= 0.5
1+
1
1+r
1
ln
{
}
–
ln
{
2
1–r
2
1– }
– zα/2 ≤
≤ zα/2
1
.
n–3
so that the middle term is , we can construct a CI for :
Observe r=0.4 in sample of n=20.
1+
1
1 + 0.4
1
}
2 ln { 1 – 0.4 } – 2 ln { 1 –
Compute
1
n–3.
and compare with Gaussian (0,1) tables. Extreme values of the
standardized Z are taken as evidence against H0 . Often, the
alternative hypothesis concerning is 1-sided, of the form > some
1 + r – {1 – r} e [ ± 2 zα/2 / Sqrt[n-3] ]
[High, Low] =
1 + r + {1 – r} e [ ± 2 zα/2 / Sqrt[n-3] ]
Worked e.g. 95% CI( ) based on r=0.55 in sample of n=12.
quantity.
e.g. 2b: Testing H0 :
With α=0.05, zα/2 = 1.96, lower & upper bounds for :
=
=
1 + 0.55 – {1 – 0.55} e [ ± 2 • 1.96 / √9 ]
1 + 0.55 + {1 – 0.55} e [ ± 2 • 1.96 / √9 ]
=
1.55 – 0.45 e [ ± 2 • 1.96 / √9 ]
1.55 – 0.45 e ± 1.307
=
1.55 + 0.45 e [ ± 2 • 1.96 / √9 ]
1.55 + 0.45 e ± 1.307
r1 & r 2 in independent samples of n1 & n 2
Remembering that "variances add; SD's do not", compute the test
statistic
1.55 – 0.45 • 3.69
1.55 – 0.45 / 3.69
= 1.55 + 0.45 • 3.69 , 1.55 + 0.45 / 3.69 = –0.04 to 0 . 8 4
1
1 + r1
1
1 + r2
2 ln { 1 – r1 } – 2 ln { 1 – r2 } – [ 0 ]
1
1
n1 – 3 + n2 – 3 .
and compare with Gaussian (0,1) tables.
This CI, which overlaps zero, agrees with the test of =0 described above.
12 – 2
, we get a value of 2.08,
1 – 0.55 2
which is not as extreme as the tabulated t 10 , 0.05(2-sided) value of 2.23.
For if we evaluate
0.55
Note: There will be some slight discrepancies between the t-test of =0 and the zbased CI's. The latter are only approximate. Note also that both assume we have data
which have a bivariate Gaussian distribution.
page 4
Correlation M&M §2.2
(Partial) NOMOGRAM
for 95% CI's for
{1+r-(1-r)Exp[±2z/Sqrt[n-3]]} / {1+r+(1-r)Exp[±2z/Sqrt[n-3]]}
1
n = 10, 15, 25, 50, 150
It is based on Fisher's
transformation of r. In addition to
reading it vertically to get a CI for
(vertical axis) based on an
observed r (horizontal axis), one
can also use it to test whether an
observed r is compatible with, or
significantly different at the α =
0.05 level, from some specific
value, 0 say, on the vertical
axis: simply read across from =
0 and see if the observed r falls
within the horizontal range
appropriate to the sample size
involved. Note that this test of a
nonzero is not possible via the
t-test. Books of statistical tables
have fuller nomograms.
Shown: CI if observe
r=0.5 (o) with n=25.
Could aldo use nomogram to
gauge the approx. 95% limits of
variation for the correlation in a
draft lottery. The n=366 is a little
more than 2.44 times the n=150
here. So the (horizontal) variations
around ρ = 0 should be only
1/√2.44 or 64% as wide as those
shown here for n=150. Thus the
95% range of r would be approx.
-0.1 to +0.1. (since X and Y are
uniform, rather than Gaussian,
theory may be a little "off").
Observed r was -0.23.
0.8
0.6
0.4
10
15
25
50
150
0.2
rho
0
rho
150
50
25
-0.2
15
10
-0.4
-0.6
-0.8
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
observed r
page 5
0.4
0.6
0.8
1
Correlation M&M §2.2
Spearman's (Non-parametric) Rank Correlation Coefficient
Correlations -- obscured and artifactual
(i) Diluted / attenuated
How Calculated:
(i) replace x's and y's by their ranks (1=smallest to n=largest)
(ii) calculate Pearson correlation using the pairs of ranks.
+
=
+
=
Advantages
• Easy to do manually (if ranking not a chore);
rSpearman = 1 –
6∑d2
n{n2 -1}
{ d = ∆ between "X rank" & "Y rank" for each observation}
• Less sensitive to outliers (x -> rank
==> variance fixed (for a given n).
–
–
Extreme {xi – x } or {yi – y } can exert considerable influence on
rPearson.
(ii) Artifact
Examples:
(i) Diluted / attenuated / obscured
• Picks up on non-linear patterns e.g. the rSpearman
for the following data is 1, whereas the rPearson. is
less.
y
°
°
°
°
°
1
Relationship, in McGill Engineering students, between their first
year university grades and their CEGEP grades
2
Relationship between heights of offspring and heights of their
parents
X = average height of 2 parents
Y = height of offspring (ignore sex of offspring)
Galton's solution
'transmute' female heights to male heights
°
'transmuted' height = height × 1.08
°
____________________
x
(ii) Artifact / artificially induced
1. Blood Pressure of unrelated (male, female) 'couples'
page 6
Regression M&M §2.3 and §10
Uses
• Curve fitting
• Summarization ('model')
• Description
• Prediction
• Explanation
• Adjustment for 'confounding' variables
Technical Meaning
• [originally] simply a line of 'best fit' to data points
• [nowadays] Regression line is the LINE that connects the CENTRES of
the distributions of Y's at each X value.
• not necessarily a straight line; could be curved, as with growth charts
• not necessarily µY|X 's used as CENTRES ; could use medians etc.
• strictly speaking, haven't completed description unless we characterize
the variation around the centres of the Y distributions at each X
Age. Tot.
Caveat: No guarantee that simple straight line relationship will be adequate.
Also, in some instances the relationship might change with the type of X
and Y variables used to measure the two phenomena being studied; also the
relationship may be more artifact than real - see later for inference.)
Tot.
wk
N.
10th
50th
90th
25
100
651
810
950
73
30
257
31
32
33
34
35 1 840
36
37
38
39
40 68 102
41
42 10 309
1 156
1 530
2 214
2 060
2 570
3 020
3 200
No.
FEMALES
%-ile; weight,g
10th
50th
90th
604
750
924
216
1 040
1 485
2 001
3 140
1 454
1 950
2 460
3 040
3 570
4 160
67 149
2 900
3 430
4 000
3 770
4 390
9 636
3 060
3 610
4 190
BIRTH WEIGHT (DISTRIBUTION) MALES
BIRTH WEIGHT (MEDIAN) FEMALES
• inference not restricted to the distributions of Y's for which we make
some observations; it applies to distributions of Y's at all unobserved X
values in between.
Examples (with appropriate caveats)
• Birth weight (Y) in relation to gestational age (X)
• Blood pressure (Y) in relation to age (X)
• Cardiovascular mortality (Y) in relation to water hardness (X) ?
• Cancer incidence (Y) in relation to some exposure (X) ?
• Scholastic performance (Y) vis a vis amount of TV watched (X)
MALES
%-ile; weight,g
Median (50th %ile) for MALES
Median (50th %ile) for FEMALES
4000g
3000g
2000g
Live singleton births, Canada 1986
1000g
Source: Arbuckle & Sherman CMAJ 140 157-161, 1989
GESTATIONAL AGE (week)
30
page 1
32
34
36
38
40
Regression M&M §2.3 and §10
S imple
(one X)
Equation
•
Fitting a straight line to data - Least Squares Method
Line ar† Re gre ssion
(straight line)
µY|X = α + β X
or
∆ µY|X
∆X = β
=
The most common method is that of Least Squares. Note that least
squares can be thought of as just a curve fitting method and doesn't have
to be thought of in a statistical (or random variation or sampling
variation) context. Other more statistically-oriented methods include
the method of minimum Chi-Square (matching observed and expected
counts according to measure of discrepancy) and the Method of
Maximum likelihood (finding the parameters that made the data most
likely). Each has a different criterion of "best-fit".
"rise"
"run"
In Practice:
one rarely sees an exact straight line relationship in health science
applications;
Least Squares Approach:
1 - While physicists are often able to examine the relationship between Y
and X in a laboratory with all other things being equal (ie controlled or
held constant) medical investigators largely are not. The universe of
(X,Y) pairs is very large and any 'true' relationship is disturbed by
countless uncontrollable (and sometimes un-measurable factors. In any
particular sample of (X,Y) pairs these distortions will surely be
operating.
2 - The true relationship (even if we could measure it exactly) may not be a
simple straight line.
3 - The measuring instruments may be faulty or inexact (using
'instruments' in the broadest sense).
•
Consider a candidate slope (b) and intercept (a) and predict that the
Y value accompanying any X=x is y^ = a + b•x. The observed y
value will deviate from this "predicted" or "fitted" value by an
amount d = y - y^
We wish to keep this deviation as small as possible, but we must
try to strike a balance over all the data points. Again just like
when calculating variances, it is easier to work with squared
One always tries to have the investigation sufficiently controlled that the
'real' relationship won't be 'swamped' by factors 1 and 3 and that the
background "noise" will be small enough so that alternative models (eg
curvilinear relationships) can be distinguished from one another.
deviations1 :
d2 = (y - y^ ) 2
We weight all deviations equally (whether they be the ones in the
------------------------------------------------------------------------------------------------------
middle or the extremes of the x range) using ∑ d2 = ∑ (y - y^ ) 2
----------------------------------------------
† Linear here means linear in the parameters. The equation
y = BxC can be made linear in the parameters by taking logs
i.e. log[y] = log[B] + x log[C]; y = a+b•x+c•x2 is already linear
in the parameters a b and c. The following model cannot be made
linear in the parameters α β γ :
to measure the overall (or average) discrepancy of the points from
the line.
1
1-α
proportion dying = α + 1+exp{β–γ log[dose]}
there are also several theoretical advantages to least squares estimates
over others based for example on least absolute deviations: - they are
the most precise of all the possible estimates one could get by taking
linear combinations of y's.
page 2
Regression M&M §2.3 and §10
• From all the possible candidates for slope (b) and intercept (a) , we
but there are no cars weighing 0 lbs. It would be better to write the
choose the particular values a and b which make this sum of squares
equation in relation to some 'central' value for weight e.g. 3500
(sum of squared deviations of 'fitted' from 'observed' Y's) a minimum.
lbs; then the same equation can be cast as
µY|weight – 25 = 0.01•(weight – 3500)
ie we search for the a and b that give us the least squares fit.
• Fortunately, we don't have to use trial and error to arrive at the 'best' a
and b . Instead, it can be shown by calculus or algebraically that the a
It is helpful for testing whether there is evidence of a non-zero slope to
and b which minimize ∑ d2 are:
think of the simplest of all regression models, namely that which is a
horizontal straight line
^
b= β
^
a= α
–
=
–
∑{x i – x }{y i – y }
–
∑{x i – x } 2
µY|X
r • sy
= xy
sx
= α + 0 • X = the constant α .
This is a re-statement of the fact that the sum of squared deviances
around a constant horizontal line at height 'a' is smallest when 'α' =
= y– – b x–
the mean .
[Note that a least-squares fit of the regression line of X on Y would
give a different set of values for the slope and intercept: the slope
r • sx
of the line of x on y is xy
] . one needs to be careful when
sy
using a calculator or computer program to specify which is the
explanatory variable (X) and which is the predicted variable (Y)].
[We don't always use the mean as the best 'centre' of a set of numbers.
Imagine waiting for one of several elevators with doors in a row along
one wall; you do not know which one will arrive next, and so want to
stand in the 'best' place no matter which one comes next. Where to
Meaning of intercept parameter (a):
stand depends on the criterion being optimized: if you want to minimize
the maximum distance, stand in the middle between the one on the
Unlike the slope parameter (which represents the increase/decrease
extreme left and the extreme right; if you wish to minimize the average
in µY|X for every unit increase in x), the intercept does not always
distance , where do you stand?, If, for some reason, you want to
have a 'natural' interpretation. It depends on where the x-values lie
minimize the average squared distance, where to stand? If the elevator
in relation to x=0, and may represent part of what is really the
doors are not equally spaced from each other, what then?]
mean Y. For example, the regression line for fuel economy of cars
(Y) in relation to their weight (x) might be
µY|weight = 60 mpg – 0.01•weight in lbs [0.01 mpg/lb]
page 3
Regression M&M §2.3 and §10
The anatomy of a slope: some re-expressions
Consider the formula: slope = b =
∑{x – xbar}{y – ybar}
∑{x – xbar} 2
i.e. a weighted average of the slope from datapoints 1 and 4
and that from datapoints 2 and 3, with weights proportional to the
squares of their distances on x axis {x4 – x1 }2 and {x3 – x2 }2
Without loss of generality & for simplicity, assume ybar=0.
Another way to think of the slope:
If we have 3 x's, 1 unit apart (e.g. x 1 =1; x 2 =2; x3 =3),
then...
x 1 – xbar = –1; x2 – xbar = 0; x3 – xbar = +1
so slope = b =
Rewrite b =
{ – 1}y 1 + {0}y 2 + { + 1 } y 3
{ – 1} 2 + { 0 } 2 + { + 1 } 2
y –
i.e. slope = 3
x3 –
b =
x2 – xbar = – 0.5
x4 – xbar = + 1.5
y i – yj
b is a weighted average of all the pairwise slopes x – x
i j
with weights proportional to {xi – xj }2 .
and so
{ – 1.5}y 1 + { – 0 . 5 } y 2 + { + 0 . 5 } y 3 + { + 1.5}y 4
{ – 1.5} 2 + { – 0.5} 2 + { 0.5 } 2 + { + 1.5} 2
i.e. slope =
=
Yet another way to think of the slope:
If 4 x's 1 unit apart (e.g. x1 =1; x 2 =2; x3 =3; x4 =4), then,...
1.5{ y 4 – y 1 }
i.e. slope =
+
5
3
{y – y 1 }
2 4
i.e. slope =
+
5
{x 4 – x 1 }
3
{y – ybar}
{x – xbar}
∑weight
∑ weight
{y – ybar}
weight {x – xbar}2 for estimate {x – xbar} of slope
Note that y2 contributes to ybar and thus to an estimate
of the average y (i.e. level) but not to the slope.
slope = b =
{y – ybar}
{x – xbar}
∑{x – xbar} 2
as
∑{x – xbar} 2
y1
x1
x1 – xbar = –1.5
x3 – xbar = +0.5
∑{x – xbar}{y – ybar}
∑{x – xbar} 2
e.g. If 4 x's 1 unit apart
0.5{ y 3 – y 2 }
5
1
{ y 3 – y 2}
2
5
{x – x 2 }
1 3
denote by b1&2 the slope obtained from {x2 ,y 2 } & {x1 ,y 1 }, etc...
b =
9 {y4 – y 1 }
1 { y3 – y 2}
+
10 {x4 – x 1 } 10 {x3 – x 2 }
1.b 1&2 + 4.b 1&3 + 9.b 1&4 + 1.b 1&3 + 4.b 2&4 + 1.b 3&4
1+ 4 + 9 + 1 + 4 + 1 = 20
jh 6/94
page 4
Regression M&M §2.3 and §10
Inferences regarding Simple Linear Regression
50 than to take X =38, 39, 49, 41, 42. Any individual fluctuations
will 'throw off' the slope much less if the X's are far apart.
How reliable are
(i) the (estimated) slope
(ii) the (estimated) intercept
(ii) the predicted mean Y at a given X
(iv) the predicted y for a (future) individual with a given X
BP
(a)
BP
(b)
when they are based on data from a sample? i.e. how much would these
estimated quantities change if they were based on a different random
sample [with the same x values]?
AGE
We can use the concept of sampling variation to (i) describe the
'uncertainty' in our estimates via CONFIDENCE INTERVALS or (ii)
carry out TESTS of significance on the parameters (slope, intercept,
predicted mean).
30
30
40
Notes
Regression line refers to the relationship between the average Y at a
given X to the X , and not to individual Y's vs X. Obviously of course if
the individual Y's are close to the average Y, so much the better!
The size of the standard error will depend on
3.
50
how 'spread apart' the x's are
How good a fit the regression line really is (i.e. how small is
the unexplained variation about the line)
How large the sample size, n, is.
The above argument would suggest studying individuals at the
extremes of the X values of interest. We do this if we are sure that the
relationship is a linear one. If we are not sure, it is wiser -- if we have
a choice in the matter -- to take a 3-point distribution.
Factors affecting reliability (in more detail)
1.
50
t h i c k l i n e : real (true) relation between average BP at age X and X : thin
lines: possible apparent relationships because of individual variation when
we study 1 individual at each of two ages (a) spaced closer together (b)
spaced further apart.
We can describe the degree of reliability of (or, conversely, the degree of
uncertainty in) an estimated quantity by the standard deviation of the
possible estimates produced by different random samples of the same
size from the same x's. We call this (obviously conceptual) S.D. the
standard error of the estimated quantity (just like the standard error of the
mean when estimating µ). helpful to think of slope as an average
difference in means for 2 groups that are 1 x-unit apart.
1.
2.
40
AGE
There is a common misapprehension that a Gaussian distribution of X
values is desirable for estimating a regression slope of Y on X. In fact,
the 'inverted U' shape of the Gaussian is the least desirable!
The spread of the X's: The best way to get a reliable estimate of
the slope is to take Y readings at X's that are quite a distance from
each other. E.g. in estimating the "per year increase in BP over the
30-50 yr. age range", it would be better to take X=30,35, 40, 45,
page 5
Regression M&M §2.3 and §10
Factors affecting reliability (continued)
2.
variation when we study 1 individual at each of two ages when the
within-age distributions have (a) a narrow spread (b) a wider spread
The (vertical) variation about the regression line: Again, consider
BP and age, and suppose that indeed the average BP of all persons
aged X + 1 is β units higher than the average BP of all persons
aged X, and that this linear relationship
NOTE: For unweighted regression, should have roughly same spread
of Y's at each X.
Factors affecting reliability (continued)
average BP of persons aged x =
α + β • X
(average of Y's at a given x = intercept + slope • X)
3.
holds over the age span 30-50.
Obviously, everybody aged x=32 won't have the exact same BP,
some will be above the average of 32 yr olds, some below.
Likewise for the different ages x=30,...50. In other words, at any x
there will be a distribution of y's about the average for age X.
Obviously, how wide this distribution is about α + β•X will have
an effect on what slopes one could find in different samples
(measure vertical spread around the line by σ)
(a)
(b)
BP
BP
AGE
30
40
50
AGE
30
40
50
t h i c k l i n e : real (true) relation between average BP at age X and X :
thin lines: possible apparent relationships because of individual
page 6
Sample Size (n) Larger n will make it more difficult for the types of
extremes and misleading estimates caused by 1) poor X spread and 2)
large variation in Y about µ Y|X , to occur. Clearly, it may be possible
to spread the x's out so as to maximize their variance (and thus reduce
the n required) but it may not be possible to change the magnitude of
the variation about µ Y|X (unless there are other known factors
influencing BP). Thus the need for reasonably stable estimated y^
[i.e.estimate of µ Y|X ]
Regression M&M §2.3 and §10
Standard Errors
SE(b) = SE( ^β ) =
σ
–
∑{xi – x
^) = σ
SE(a) = SE( α
The structure of SE(a) : In addition to the factors mentioned above, all
of which come in again in the expected way, there is the additional
–
factor of x2 ; since this is in the denominator, it increases the SE . This
–
is natural in that if the data, and thus x, are far from x=0, then any
imprecision in the estimate of the slope will project backwards to a
large imprecision in the estimated intercept. Also, if one uses 'centered'
–
x's, so that x = 0, the formula for the SE reduces to
1
σ
SE(a) = σ
=
n
n
;
}2
–2
1
x
n + ∑{x – –x }2
i
(Note: there is a negative correlation between a and b).
We don't usually know σ so we estimate it from the data, using scatter of
the y's from the fitted line i.e. SD of the residuals)
If examine the structure of SE(b), see that it reflects the 3 factors discussed
above: (i) a large spread of the x's makes contribution of each observation to
–
∑{xi – x }2 large, and since this is in the denominator, it reduces the SE
(ii) a small vertical scatter is reflected in a small σ and since this is in the
numerator, it also reduces the SE of the estimated slope (iii) a large sample
–
size means that ∑{xi – x }2 is larger, and like (i) this reduces the SE.
The formula, as written, tends to hide this last factor; note that
–
∑{xi – x }2 is what we use to compute the spread of a set of x's -- we
simply divide it by n–1 to get a variance and then take the square root to get
the sd. To make the point here, simplify n-1 to n and write
–
∑{xi – x }2 ≈ n•var(x), so that
–
∑{xi – x } 2 ≈
–
–
and we recognize this as SE(y) -- not surprisingly, since y is the
'intercept' for centered data.
CI's & Tests of Significance for ^ and ^ are based on
t–distribution (or Gaussian Z's if n large)
^ ±t
^)
n–2 • SE(
H0:
tn–2 =
SE(b) =
SE(^ )
^ ±t
^)
n–2 • SE(
n • sd(x)
and the equation for the SE simplifies to
σ
n • sd(x)
^–
H0:
SD y|x / SDx
=
n
with n in its familiar place in the denominator of the SE (even in more
complex SE's, this is where n is usually found !)
page 7
tn–2 =
^–
SE(^ )
Regression M&M §2.3 and §10
Standard Error for Estimated µ Y|X or 'average Y at X'
variation, which is as follows:
^
^
We estimate 'average Y at X' or µ Y|X by α + β •X . Since the
estimate is based on two estimated quantities, each of which is subject to
sampling variation, it contains the uncertainty of both:
–
x }2
– 2
1
{X –
+
n
∑{xi – x }
SE(estimated average Y at X) = σ
^
Again, we must use an estimate σ of σ .
First-time users of this formula suspect that it has a missing ∑ or an x
instead of an xbar or something. There is no typographical error, and indeed
if one examines it closely, it makes sense. X refers to the x-value at which
one is estimating the mean -- it has nothing to do with the actual x's in the
study which generated the estimated coefficients, except that the closer X is
–
to the center of the data, the smaller the quantity {X – x} and thus the
–
quantity {X – x}2 , and thus the SE, will be. Indeed, if we estimate the
–
–
average Y right at X = x , the estimate is simply y (since the fitted
– –
line goes through [ x, y ] ) and its SE will be
σ
–
–
1
{ x – x }2
+
–
n
∑{xi – x } 2
or
σ
1
=
n
σ
–
= SE( y ).
n
Confidence Interval for individual Y at X
A certain percentage P% of individuals are within tP • σ of the mean
µ Y|X = α + β • X, where tP is a multiple, depending on P, from the t
or, if n is large, the Z table. However, we are not quite certain where
exactly the mean α + β • X is -- the best we can do is estimate,
^
^
with a certain P% confidence, that it is within tP• SE( α + β •X ) of
^ ^
the point estimate α + β •X . The uncertainty concerning the mean and
the natural variation of individuals around the mean -- wherever it is -combine in the expression for the estimated P% range of individual
page 8
^ ^
α + β •X ±
t•σ
–
1 +
1
{X – x }2
+
–
n
∑{xi – x } 2
.
Both the CI for the estimated mean and the CI for individuals (ie the
estimated percentiles of the distribution) are bow-shaped when drawn as
–
a function of X. They are narrowest at X = x , and fan out from there.
One needs to be careful not to confuse the much narrower CI for the
mean with the much wider CI for individuals. If one can see the raw
data, it is usually obvious which is which -- the CI for individuals is
almost as wide as the raw data themselves.
cf. data on sleeping through the night; alcohol levels and eye speed.
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 1 of 11
ANDREA KURZ, M.D., DANIEL I. SESSLER, M.D., AND RAINER LENHARDT, M.D.,
FOR THE STUDY OF WOUND INFECTION AND TEMPERATURE GROUP1
Abstract
Background. Mild perioperative hypothermia, which is common during major
surgery, may promote surgical- wound infection by triggering thermoregulatory
vasoconstriction, which decreases subcutaneous oxygen tension. Reduced
levels of oxygen in tissue impair oxidative killing by neutrophils and decrease
the strength of the healing wound by reducing the deposition of collagen.
Hypothermia also directly impairs immune function. We tested the hypothesis
that hypothermia both increases susceptibility to surgical-wound infection and
lengthens hospitalization.
Methods. Two hundred patients undergoing colorectal surgery were
randomly assigned to routine intraoperative thermal care (the hypothermia
group) or additional warming (the normothermia group). The patients'
anesthetic care was standardized, and they were all given cefamandole and
metronidazole. In a double-blind protocol, their wounds were evaluated daily
until discharge from the hospital and in the clinic after two weeks; wounds
containing culture-positive pus were considered infected. The patients'
surgeons remained unaware of the patients' group assignments.
Results. The mean (±SD) final intraoperative core temperature was
34.7±0.6°C in the hypothermia group and 36.6±0.5°C in the normothermia
group (P<0.001). Surgical-wound infections were found in 18 of 96 patients
assigned to hypothermia (19 percent) but in only 6 of 104 patients assigned to
normothermia (6 percent, P=0.009). The sutures were removed one day later
in the patients assigned to hypothermia than in those assigned to
normothermia (P = 0.002), and the duration of hospitalization was prolonged by
2.6 days (approximately 20 percent) in the hypothermia group (P=0.01).
Conclusions. Hypothermia itself may delay healing and predispose patients
to wound infections. Maintaining normothermia intraoperatively is likely to
decrease the incidence of infectious complications in patients undergoing
colorectal resection and to shorten their hospitalizations
(N Engl J Med 1996;334:1209-15. May 9)
1 From the Thermoregulation Research Laboratory and the Department of Anesthesia. University of
California. San Francisco (A.K., D.l.S.); and the Departments of Anesthesiology and General Intensive Care,
University of Vienna, Vienna Austria (A.K.. D.l.S.. R.L.). Address reprint requests to Dr. Sessler at the
Department of Anesthesia, 374 Parnassus Ave., 3rd Fl., University of California, San Francisco. CA 94143
0648. Supported in part by grants (GM49670 and GM27345) from the National Institutes of Health. by the
Joseph Drown and Max Kade Foundations. and by Augustine Medical. Inc. The authors do not consult for,
accept honorariums from. Or own stock or stock options in any company whose products are related to the
subject of this research.
Presented in part at the International Symposium on the Pharmacology of Thermoregulation, Giessen,
Germany, August 17-22. 1°,94, and at the Annual Meeting of the American Society of Anesthesiologists.
Atlanta October 21-25, 1995.
WOUND infections are common and serious complications of anesthesia
and surgery. A wound infection can prolong hospitalization by 5 to 20
days and substantially increase medical costs.[1,2] In patients undergoing
colon surgery, the risk of such an infection ranges from 3 to 22 percent,
depending on such factors as the length of surgery and underlying medical
problems.[3] Mild perioperative hypothermia (approximately 2°C below
the normal core body temperature) is common in colon surgery.[4] It
results from anesthetic-induced impairment of thermoregulation,[5,6]
exposure to cold, and altered distribution of body heat.[7] Although it is
rarely desired, intraoperative hypothermia is usual because few patients are
actively warmed.[8]
Hypothermia may increase patients' susceptibility to perioperative wound
infections by causing vasoconstriction and impaired immunity. The
presence
of
sufficient intraoperative hypothermia triggers
thermoregulatory vasoconstriction,[9] and postoperative vasoconstriction is
universal in patients with hypothermia.[10] Vasoconstriction decreases the
partial pressure of oxygen in tissues, which lowers resistance to infection
in animals [11,12] and humans (unpublished data). There is decreased
microbial killing, partly because the production of oxygen and nitroso free
radicals is oxygen-dependent in the range of the partial pressures of
oxygen in wounds. [13,14] Mild core hypothermia can also directly impair
immune functions, such as the chemotaxis and phagocytosis of
granulocytes, the motility of macrophages, and the production of
antibody.[15,16] Mild hypothermia, by decreasing the availability of tissue
oxygen, impairs oxidative killing by neutrophils. And mild hypothermia
during anesthesia lowers resistance to inoculations with Escherichia coli
[17] and Staphylococcus aureus [18] in guinea pigs.
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 2 of 11
Vasoconstriction-induced tissue hypoxia may decrease the strength of the
healing wound independently of its ability to reduce resistance to infection.
The formation of scar requires the hydroxylation of abundant proline and
Iysine residues to form the cross-links between strands of collagen that
give healing wounds their tensile strength.[19] The hydroxylases that
catalyze this reaction are dependent on oxygen tension,[20] making
collagen deposition proportional to the partial pressure of arterial oxygen
in animals [21] and to oxygen tension in wound tissue in humans. [22]
Although safe and inexpensive methods of warming are available[8]
perioperative hypothermia remains common. [23] Accordingly, we tested
the hypothesis that mild core hypothermia increases both the incidence of
surgical-wound infection and the length of hospitalization in patients
undergoing colorectal surgery.
METHODS
With the approval of the institutional review board at each participating
institution and written informed consent from the patients, we studied
patients 18 to 80 years of age who underwent elective colorectal resection
for cancer or inflammatory bowel disease. Patients scheduled for
abdominal-peritoneal pull-through procedures were included, but not those
scheduled for minor colon surgery (e.g., polypectomy or colostomy
performed as the only procedure). The criteria for exclusion from the
study were any use of corticosteroids or other immunosuppressive drugs
(including cancer chemotherapy) during the four weeks before surgery; a
recent history of fever, infection, or both; serious malnutrition (serum
albumin, less than 3.3 g per deciliter, a white-cell count below 2500 cells
per milliliter, or the loss of more than 20 percent of body weight); or
bowel obstruction.
The number of patients required for this trial was estimated on the
basis of a preliminary study in which 80 patients undergoing elective
colon surgery were randomly assigned to hypothermia (mean [±SD]
temperature, 34.4±0.4°C) or normothermia (involving warming with
forced air and fluid to a mean temperature of 37_0.3°C). The number of
wound infections (as defined by the presence of pus and a positive
culture) was evaluated by an observer unaware of the patients'
temperatures and group assignments. Nine infections occurred in the 38
patients assigned to hypothermia, but there were only four in the 42
patients assigned to normothermia (P=0.16). Using the observed
difference in the incidence of infection, we determined that an enrollment
of 400 patients would provide a 90 percent chance of identifying a
difference with an alpha value of 0.01.2 We therefore planned to study a
maximum of 400 patients, with the results to be evaluated after 200 and
300 patients had been studied. The prospective criterion for ending the
study early was a difference in the incidence of surgical-wound infection
between the two groups with a P value of less than 0.01. To compensate
for the two initial analyses, a P value of 0.03 would be required when the
study of 400 patients was completed. The combined risk of a type I error
was thus less than 5 percent [24]
Study Protocol
The night before surgery, each patient underwent a standard mechanical
bowel preparation with an electrolyte solution. Intraluminal antibiotics
were not used, but treatment with cefamandole (2 g intravenously every
2 italics by jh.... it does not make a lot of sense to use the difference one
oberves in a pilot study [or for that matter what is the literature] as a
guide to what w o u l d be important [clinically important difference]
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 3 of 11
eight hours) and metronidazole (500 mg intravenously every eight hours)
was started during the induction of anesthesia; this treatment was
maintained for about four days postoperatively. Anesthesia was induced
with thiopental sodium (3 to 5 mg per kilogram of body weight), fentanyl
(1 to 3 µg per kilogram), and vecuronium bromide (0.1 mg per kilogram).
The administration of isoflurane (in 60 percent nitrous oxide) was titrated
to maintain the mean arterial blood pressure within 20 percent of the
preinduction values. Additional fentanyl was administered on the
completion of surgery, to improve analgesia when the patient emerged
from anesthesia.
extra warming. Similarly, a forced-air cover (Augustine Medical, Eden
Prairie, Minn.) was positioned over the upper body of every patient, but it
was set to deliver air at the ambient temperature in the hypothermia group
and at 40°C in the normothermia group. Cardboard shields and sterile
drapes were positioned in such a way that the surgeons could not discern
the temperature of the gas inflating the cover. Shields were also positioned
over the switches governing the fluid heater and the forced-air warmer so
that their settings were not apparent to the operating-room personnel. The
temperatures were not controlled postoperatively, and the patients were not
informed of their group assignments.
The patients were hydrated aggressively during and after surgery, because
hypovolemia decreases wound perfusion and increases the incidence of
infection.[25,26] We administered 15 ml of crystalloid per kilogram per
hour throughout surgery and replaced the volume of blood lost with either
crystalloid in a 4:1 ratio or colloid in a 2:1 ratio. Fluids were administered
intravenously at rates of 3.5 ml per kilogram per hour for the first 24
postoperative hours and 2 ml per kilogram per hour for the subsequent 24
hours. Leukocyte-depleted blood was administered as the attending
surgeon considered appropriate.
Supplemental oxygen was administered through nasal prongs at a rate of 6
liters per minute during the first three postoperative hours and was then
gradually eliminated while oxygen saturation was maintained at more than
95 percent. To minimize the decrease in wound perfusion due to activation
of the sympathetic nervous system, postoperative pain was treated with
piritramide (an opioid), the administration of which was controlled by the
patient.
At the time of the induction of anesthesia, each patient was randomly
assigned to one of the following two temperature-management groups with
computer-generated codes maintained in numbered, sealed, opaque
envelopes: the normothermia group, in which the patients' core
temperatures were maintained near 36.5°C, and the hypothermia group, in
which the core temperature was allowed to decrease to approximately
34.5°C. In both groups, intravenous fluids were administered through a
fluid warmer, but the warmer was activated only in the patients assigned to
The attending surgeons, who were unaware of the patients' group
assignments and core temperatures, determined when to begin feeding
them again after surgery, remove their sutures, and discharge them from
the hospital. The timing of discharge was based on routine surgical
considerations, including the return of bowel function, the control of any
infections, and adequate healing of the incision.
Measurements
The patients' morphometric characteristics and smoking history were
recorded. The preoperative laboratory evaluation included a complete
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 4 of 11
blood count; determinations of the prothrombin time and
partial-thromboplastin time; measurements of serum albumin, total protein,
and creatinine; and liver-function tests. The risk of infection was scored
with a standardized algorithm taken from the Study on the Efficacy of
Nosocomial lnfection Control (SENIC) of the Centers for Disease Control
and Prevention; in this scoring system, one point each is assigned for the
presence of three or more diagnoses, surgery lasting two hours or more,
Surgery at an abdominal site, and the presence of a contaminated or
infected wound.[2] The scoring system was modified slightly from its
original form by the use of the diagnoses made at admission, rather than
discharge. The risk of infection was quantified further with the use of the
National Nosocomial Infection Surveillance System (NNISS), a scoring
system in which the patient's risk of infection was predicted on the basis of
the type of surgery, the patient's physical-status rating on a scale developed
by the American Society of Anesthesiologists, and the duration of surgery
[3]
Core temperatures were measured at the tympanic membrane
(Mallinckrodt Anesthesiology Products, St. Louis), with values recorded
preoperatively, at 10-minute intervals intraoperatively, and at 20-minute
intervals for 6 hours during recovery. Arteriovenous-shunt flow was
quantified by subtracting the skin temperature of the fingertip from that of
the forearm, with values exceeding 0°C indicating thermoregulatory
vasoconstriction.[27] End-tidal concentrations of isoflurane and carbon
dioxide were recorded at 10-minute intervals during anesthesia.
Measurements of arterial blood pressure and heart rate were recorded
similarly during anesthesia and for six hours thereafter. Oxyhemoglobin
saturation was measured by pulse oximetry.
Thermal comfort was evaluated at 20-minute intervals for 6 hours
postoperatively with a 100-mm visual-analogue scale on which 0 mm
denoted intense cold, 50 mm denoted thermal comfort, and 100 mm
denoted intense warmth. The degree of surgical pain was evaluated
similarly, except that 0 mm denoted no pain and 100 mm the most intense
pain imaginable. Shivering was assessed qualitatively, on a scale on which
0 denoted no shivering; 1, mild or intermittent shivering, 2, moderate
shivering, and 3, continuous, intense shivering. All the qualitative
assessments were made by observers unaware of the patients' group
assignments and core temperatures.
The patients' surgical wounds were evaluated daily during hospitalization
and again two weeks after surgery by a physician who was unaware of the
group assignments. Wounds were suspected of being infected when pus
could be expressed from the surgical incision or aspirated from a loculated
mass inside the wound. Samples of pus were obtained and cultured for
aerobic and anaerobic bacteria, and wounds were considered infected when
the culture was positive for pathogenic bacteria. All the wound infections
diagnosed within 15 days of surgery were included in the data analysis.
Wound healing and infections were also evaluated by the ASEPSIS
system,[28] in which a score is calculated as the weighted sum of points
assigned to the following factors: the duration of antibiotic administration,
the drainage of pus during local anesthesia, the debridement of the wound
during general anesthesia, the presence of a serous discharge, the presence
of erythema, the presence of a purulent exudate, the separation of deep
tissues, the isoIation of bacteria from fluid discharged from the wound,
and a duration of hospitalization exceeding 14 days. Scores exceeding 20
on this scale indicate wound infection. As an additional indicator of
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 5 of 11
infection, preoperative differential white-cell counts were compared with
counts obtained on postoperative days 1, 3, and 6.
Collagen deposition in the wound was evaluated in a subgroup of 30
patients in the normothermia group and 24 patients in the hypothermia
group. A 10-cm expanded polytetrafluoroethylene tube (Impra,
International Polymer Engineering, Tempe, Ariz.) was inserted
subcutaneously several centimeters lateral to the incision at the completion
of surgery. On the seventh postoperative day, the tube was removed and
assayed for hydroxyproline, a measure of collagen deposition.[29] The
ingrowth of collagen in such tubes is proportional to the tensile strength of
the healing wound[29] and the subcutaneous oxygen tension.[22]
Statistical Analysis
Outcomes were evaluated on an intention-to-treat basis. The number of
postoperative wound infections in each study group and the proportion of
smokers among the infected patients were analyzed by Fisher's exact test.
Scores for wound healing, the number of days of hospitalization, the extent
of collagen deposition, postoperative core temperatures, and potential
confounding factors were evaluated by unpaired, two-tailed t-tests. Factors
that potentially contributed to infection were included in a univariate
analysis. Those that correlated significantly with infection were then
included in a multivariate logistic regression with backward elimination; a
P value of less than 0.25 was required for a factor to be retained in the
analysis.
All the results are presented as means ±SD. A P value of less than 0.01
was required to indicate a significant difference in our major outcomes (the
incidence of infection and the duration of hospitalization); a P value of less
than 0.005 was considered to indicate a significant difference in
postoperative temperature (to compensate for multiple comparisons); for
all other data, a P value of less than 0.05 was considered to indicate a
statistically significant difference.
RESULTS
Patients were enrolled in the study from July 1993 through March 1995;
155 were evaluated at the University of Vienna, 30 at the University of
Graz, and 15 at Rudolfstiftung Hospital. According to the investigational
protocol, the study was stopped after 200 patients were enrolled, because
the incidence of surgical-wound infection in the two study groups differed
with an alpha level of less than 0.01. One hundred four patients were
assigned to the normothermia group, and 96 to the hypothermia group. An
audit confirmed that the patients had been properly assigned to the groups
and that the slight disparity in numbers was present in the original
computer-generated randomization codes. All the patients allowed their
wounds to be evaluated daily during hospitalization. Ninety-four percent
returned for the two-week clinic visit after discharge; those who did not
were evenly distributed between the study groups and mostly returned to
visit the private offices of their attending surgeons. The wound status of
these patients was determined by calling the physician. No previously
unidentified wound infections were detected in the clinic for the first time.
Table I shows that the characteristics, diagnoses, types of surgical
procedure, duration of surgery, hemodynamic values, and types of
anesthesia of the patients in the two study groups were similar. Nor did
smoking status, the results of preoperative laboratory tests, or preoperative
laboratory values differ significantly between the groups. The patients
assigned to hypothermia required more transfusions of allogeneic blood
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 6 of 11
(P=0.01). Intraoperative vasoconstriction was observed in 74 percent of
the patients assigned to hypothermia but in only 6 percent of those
assigned to normothermia (P<0.001). Core temperatures at the end of
surgery were significantly lower in the hypothermia group than in the
normothermia group (34.7±0.6 vs. 36.6±0.5°C, P<0.001), and they
remained significantly different for more than five hours postoperatively
(Fig. 1).
Postoperative vasoconstriction was observed in 78 percent of the patients
in the hypothermia group; the vasoconstriction continued throughout the
six-hour recovery period. In contrast, vasoconstriction, usually short-lived,
was observed in only 22 percent of the patients in the normothermia group
(P<0.001). Shivering was observed in 59 percent of the hypothermia
group, but in only a few patients in the normothermia group. Thermal
comfort was significantly greater in the normothermia group than in the
hypothermia group (score on the visual- analogue scale one hour after
surgery, 73 ± 14 vs. 35 ± 17 mm). The difference in thermal comfort
remained statistically significant for three hours. Pain scores and the
amount of opioid administered were virtually identical in the two groups at
every postoperative measurement; hemodynamic values were also similar.
T a b l e 1 . Characteristics of the Patients in the Two Study Groups.*
Normothermia
(N = 104)
Hypothermia
(N = 96)
P
Value
58
73±14
170±9
61±15
33
50
71±14
169±9
59±14
29
0.70
0.31
0.43
0.33
0.94
10
94
8
88
0.94
Duke's stage
A
B
C
D
29
37
26
2
30
34
21
3
Operative site
Colon
Rectum
59
35
51
37
36.8±0.4
12.6±2.3
36.7±0.4
12.7±20
0.08
0.74
0.7±0.3
0.6±0.1
91±17
74± 17
3.3± 1.5
0.2±0.3
23
0.4± 1.0
0.6±0.4
3.1±1.0
21.9±1.2
97.3±1.5
36.6±0.5
0.6±0. 5
0.6±0.2
95±18
76± 13
3.2± 0.9
0.2±0.3
34
0.8± 1.2
0.7±0.4
3.1±0.9
22.1±0.9
97.5±1.3
34.7±0.6
0.09
1.0
0.11
0.35
0.57
1.0
0.054
0.01
0.08
1.0
0.19
0.32
<0.001
CHARACTERISTIC
Male sex (no. of patients)
Weight (kg)
Height (cm)
Age (yr)
History of smoking (no.)
Diagnosis (no. of patients)
Inflammatory bowel disease
Cancer
Preoperative variables
Core temperature (°C)
Hemoglobin (g/dl)
Intraoperative variables
Fentanyl administered (mg)
End-tidal isoflurane (%)
Arterial blood pressure (mm Hg)
Heart rate (beats/min)
Crystalloid (liters)
Colloid (liters)
Red-cell transfusion (no. of patients)
Volume of blood transfused (units)
Urine output (liters)
Duration of surgery (hr)
Ambient temperature (°C)
Oxyhemoglobin saturation (%)
Final core temperature (°C)
* Plus-minus values are means ± SD.
1.0
0.61
Table continued on next page
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 7 of 11
Ta b l e 1 . Characteristics of the Patients in the Two Study Groups. ... continued
Postoperative variables
Hemoglobin (g/dl)
Prophylactic antibiotics (days)
SENIC score (no. of patients)
1
2
3
NNISS score (no. of patients)
0
1
3
Infection rate predicted by NNISS
Normothermia
(N = 104)
Hypothermia
(N = 96)
11.7±1.9
3.7±1.9
11.6±1.4
3.6±1.4
3
95
6
3
88
5
32
49
23
8.9
31
39
26
8.8
98±1
20±13
98±1
22±12
P
Value
0.67
0.67
0.98
0.60
—
(%)
Oxyhemoglobin saturation (%)
Piritramide (mg)†
* Plus-minus values are means ± SD.
1.0
0.26
SENIC denotes Study on the Efficacy of Nosocomial Infection Control, and NNISS
National Nosocomial Infection Surveillance System.
† The administration of this analgesic agent was controlled by the patient.
The overall incidence of surgical-wound infection was 12 percent.
Although the SENIC and NNISS scores for the risk of infection were
similar in the two groups, there were only 6 surgical-wound infections in
the normothermia group, as compared with 18 in the hypothermia group
(P = 0.009) (Table 2). Most positive cultures contained several different
organisms; the major ones were E. cold (11 cultures), S. aureus (7),
pseudomonas (4), enterobacter (3), and candida (3). Culture-negative pus
was expressed from the wounds of two patients assigned to hypothermia
and one patient assigned to normothermia. The ASEPSIS scores were
higher in the hypothermia group than in the normothermia group (13±16
vs. 7±10, P=0.002) (Table 2); these scores exceeded 20 in 32 percent of
the former but only 6 percent of the latter (P<0.001).
In a univariate analysis, tobacco use, group assignment, surgical site,
NNISS score, SENIC score, need for transfusion, and age were all
correlated with the risk of infection. In amultivariatebackward-elimination
analysis, tobacco use, group assignment, surgical site, NNISS score, and
age remained risk factors for infection (Table 3).
Four patients in the normothermia group and seven in the hypothermia
group required admission to the intensive care unit (P=0.47), mainly
because of wound dehiscence, colon perforation, and peritonitis. Two
patients in each group died during the month after surgery. The incidence
of infection was similar at each study hospital, and no one surgeon was
associated with a disproportionate number of infections.
Table 2 shows that significantly more collagen was deposited near the
wound in the patients in the normothermia group than in the patients in the
hypothermia group (328±135 vs. 254±114 µg per centimeter). The
patients assigned to hypothermia were first able to tolerate solid food one
day later than those assigned to normothermia (P=0.006); similarly, the
sutures were removed one day later in the patients assigned to hypothermia
(P = 0.002). The duration of hospitalization was 12.1±4.4 days in the
normothermia group and 14.7±6.5 days in the hypothermia group
(P=0.001). This difference was statistically significant even when the
analysis was limited to the uninfected patients. In the normothermia group,
the duration of hospitalization was 11.8±4.1 days in patients without
infection and 17.3±7.3 days in patients with infection (P=0.003). In the
hypothermia group the duration of hospitalization was 13.5±4.5 days in
patients without infection and 20.7±11.6 days in patients with infection
(P<0.001).
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 8 of 11
Table 2. Postoperative Findings in the Two Study Groups.*
Core Temperature
(°C)
Intraoperative
38-
Postoperative
VARIABLE
All patients
Infection - no. of patients (%)
ASEPSIS score
Collagen deposition - µg/cm
Days to first solid food
Days to suture removal
Days of hospitalization
Normothermia
3736-
Hypothermia
3534-
0
1
2
3
Final
0
1
2
3
4
5
6
Time (hr)
Fig 1. Core Temperatures during and after Colorectal Surgery in the Study Patients.
The mean (±SD) final intraoperative core temperature was 34.7±0.6°C in the 96
patients assigned to hypothermia, who received routine thermal care, and
36.6±0.5°C in the 104 patients assigned to normothermia, who were given extra
warming. The core temperatures in the two groups differed significantly at each
measurement, except before the induction of anesthesia (first measurement) and
after six hours of recovery.
The postoperative hemoglobin concentrations did not differ significantly
between the two groups (Table 1). On the first postoperative day,
leukocytosis was impaired in the hypothermia group as compared with the
normothermia group(white-cell count, 11,500±3500 vs. 13,400±2500
cells per cubic millimeter; P<0.001). On the third postoperative day,
however, white-cell counts were significantly higher in the hypothermia
group (10,100±3900 vs. 8900±2900 cells per cubic millimeter). The
difference in values on the third day was not statistically significant when
only uninfected patients were included in the analysis. By the sixth
postoperative day, the white-cell counts were similar in the two groups.
Uninfected patients
Number of patients
Days to first solid food
Days to suture removal
Days of hospitalization
* Plus-minus values are means ± SD.
Normothermia
(N = 104)
Hypothermia
(N = 96)
P
Value
6(6)
7±10
328±135
5.6±2.5
9.8±2.9
12.1±4.4
18(19)
13±16
254±114
6.5±2.0
10.9±1.9
14.7±6.5
0.009
0.002
0.04
0.006
0.002
0.001
98
5.2±1.6
9.6±2.6
11.8±4.1
78
6.1±1.6
10.6±1.6
13.5±4.5
<0.001
0.003
0.01
Table 3. Multivariate Analysis of Risk Factors for Surgical-Wound Infection.
RISK FACTOR
Tobacco use (yes vs. no)
Group assignment (hypotbormia vs. normothermia)
Surgical site (rectum vs. colon)
NNISS score (per unit increase)*
Age (per decade)
ODDS RATIO (95% CI)
10.5 (3.2—34.1)
4.9 (1.7—14.5)
2.7 (0.9— 7.6)
2.5 (1.2— 5.3)
1.6 (1.0— 2.4)
Table 4. Postoperative Findings in the Study Patients According to Smoking Status.*
CHARACTERISTIC
Infection — no. of patients
ASEPSIS Score
Days to suture removal
Days of hospitalization
SENIC score (no. of patients)
1
2
3
NNISS score (no. of patients)
0
1
3
Smokers
(N = 62)
NonSmokers
(N = 138)
P
Value
14 (23)
15±18
10.9±3.5
14.9±6.7
10 (7)
8±104
10.1±2.0
12.9±5.0
0.004
<0.001
0.04
0.02
0.25
0
58
4
6
125
7
23
30
9
40
58
40
0.08
* Plus-minus values are means ± SD. SENIC denotes Study on the Efficacy of Nosocomial
Infection Control, and NNISS National Nosocomial Infection Surveillance System.
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 9 of 11
Among smokers, the number of cigarettes smoked per day was similar in
the two groups (22±20 in the hypothermia group vs. 22±14 in the
normothermia group). The morphometric characteristics, anesthetic care,
and SENIC and NNISS scores of smokers and nonsmokers were not
significantly different. Nonetheless, the proportion of patients with wound
infection was significantly higher among smokers (23 percent, or 14 of
62) than among nonsmokers (7 percent, or 10 of 138; P=0.004).
Furthermore, the length of hospitalization was significantly greater among
smokers (14.9±6.7 days, vs. 12.9±5.0 days among nonsmokers; P=0.02)
(Table 4).
DISCUSSION
The initial hours after bacterial contamination are a decisive period for the
establishment of infection.[25] In surgical patients, perioperative factors
can contribute to surgical-wound infections, but the infection itself is
usually not manifest until days later.
In our study, forced-air warming combined with fluid warming maintained
normothermia in the treated patients, whereas the unwarmed patients had
core temperatures approximately 2°C below normal.[8] Perioperative
hypothermia persisted for more than four hours and thus included the
decisive period for establishing an infection.[25,30] The patients with mild
perioperative hypothermia had three times as many culture-positive
surgical-wound infections as the normothermic patients. Moreover, the
ASEPSIS scores showed that in the patients assigned to hypothermia the
reduction in resistance to infection was twice that in the normothermia
group.
The types of bacteria cultured from our patients' surgical wounds were
similar to those reported previously.[2,3] These organisms are susceptible
to oxidative killing, which is consistent with our hypothesis that
hypothermia inhibits the oxidative killing of bacteria.[31] The overall
incidence of infection in our study was approximately 35 percent higher
than in previous reports.[3] One explanation for this relatively high
incidence is that we considered all wounds draining pus that yielded a
positive culture to be infected, although some may have been of minor
clinical importance. The hospitalizations of infected patients were one
week longer than those of patients without surgical-wound infections,
however, indicating that most infections were substantial. Similar
prolongation of hospitalization has been reported previously.[1,2]
It is interesting to note that hospitalization was also prolonged (by about
two days) in the uninfected patients in the hypothermia group (Table 2). A
number of factors influenced the decision to discharge patients, but healing
of the incision (formation of a "healing ridge," for example) was among
the most important. As is consistent with a delay in clinical healing, sutures
were removed significantly later and the deposition of collagen (an index
of scar formation and the strength of the healing wound) was significantly
less in the hypothermia group than in the normothermia group. That the
patients assigned to hypothermia required significantly more time before
they could tolerate solid food is also consistent with impaired healing.
In Austria's medical system, administrative factors and costs of
hospitalization do not influence the length of stay in the hospital. No data
on individual costs are tabulated by the participating hospitals, and they are
therefore not available for our patients. Nonetheless, the cost of a
prolonged hospitalization must exceed the cost of fluid and forced-air
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 10 of
11
warming (approximately $30 in the United States). In a managed-care
situation, the duration of hospitalization might have differed less, or not at
all. However, our data suggest that patients kept at normal temperatures
during surgery would be better prepared for discharge at a fixed time than
those allowed to become hypothermic.
Among all 200 patients in our study, those who smoked had three times
more surgical-wound infections and significantly longer hospitalizations
than the nonsmokers. Similar data have been reported previously.[32]
Numerous factors contributed to these results; one may have been that
smoking markedly lowers oxygen tension in tissue for nearly an hour after
each cigarette.[33] (Thermoregulatory vasoconstriction produces a similar
reduction.[34] The distribution of factors known to influence infection was
similar between smokers and nonsmokers, but the smokers may have had
other behavioral or physiologic factors predisposing them to infection.
The prevalence of smoking was similar in the two study groups. Other
factors may have influenced the patients' susceptibility to wound
infections, such as arterial hypoxemia, hypovolemia, the concentration of
the anesthetic used, and vasoconstriction resulting from pain-induced
stress. [25,26,35,36]. However, the administration of oxygen,
oxyhemoglobin saturation, fluid balance, hemodynamic responses,
end-tidal concentrations of anesthetic, pain scores, and quantities of opioid
administered were all similar between the two groups. These factors are
therefore not likely to have confounded our results. It is also unlikely that
exaggerated bacterial growth aggravated the infections in the hypothermia
group, because small reductions in temperature actually decrease growth in
vitro.[37]
Mild hypothermia can increase blood loss and the need for transfusion
during surgery.38 In vitro studies suggest that perioperativehypothermia
may aggravate surgical bleeding by impairing the function of platelets and
the activity of clotting factors. [39,40] Blood transfusions may increase
susceptibility to surgical-wound infections by impairing immune function.
[41] Our patients assigned to hypothermia required significantly more
allogeneic blood to maintain postoperative hemoglobin concentrations than
did the patients assigned to normothermia However, we administered only
leukocyte depleted blood, and multivariate regression analysis indicated
that a requirement for transfusion did not independently contribute to the
incidence of wound infection. It is thus unlikely that the differences in the
incidence of infection in the two groups we studied resulted from
transfusion-mediated immunosuppression.
In summary, this double-blind, randomized study indicates that
intraoperative core temperatures approximately 2°C below normal triple the
incidence of wound infection and prolong hospitalization by about 20
percent. Maintaining intraoperative normothermia is thus likely to decrease
infectious complications and shorten hospitalization in patients undergoing
colorectal surgery.
We are indebted to Heinz Scheuenstahl for the collagen-deposition analysis; to Helene Ortmann, M.D., Andrea Hubacek, M.D.,
Michael Zimpfer, M.D., and Gerhard Pavecic for their generous assistance; and to Mallinckrodt Anesthesiology Products, Inc., for the
donation of thermometers and thermocouples.
APPENDIX
The following investigators also participated in this study: patient safety and data auditing: H.W. Hopf and T.K. Hunt (University of
California, San Francisco); site directors: G. Polak (Hospital Rudolfstiftung, Vienna, Austria) and W. Kroll (University of Graz, Graz,
Austria); patient care: E Lackner and R. Fuegger (University of Vienna); data acquisition: E. Narzt (University of Vienna), C. Wol~b
(University of Vienna), E. Marker (University of Vienna), A. Bekar (Orthopedic Hospital, Speising, Vienna), H. Kaloud (University of
Graz), U. Stratil (Hospital Rudolfstiftung), and R. Csepan (University of Vienna); wound evaluation: V. Goll (University of Vienna),
G.S. Bayer (University of Vienna), and P. Steindorfer (University of Graz); and data management: B. Petschnigg (University of
Vienna).
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 11 of
11
REFERENCES
20. De Jong L, ... Stoicheiometry and kinetics of the prolyl 4-hydroxylase partial reaction.
Biochim Biophys Acta 1984;787:105-11.
1. Bremmelgaard A, ... Computer-aided surveillance of surgical infections and
21. Hunt TK, ... The effect of varying ambient oxygen tensions on wound metabolism and
identification of risk factors. J Hosp Infect 1989;13:1-18.
collagen synthesis. Surg Gynecol Obstet 1972; 135:561 - 7.
2. Haley RW, ... Idendfying patients at high risk of surgical wound infection: a simple
22.
Jonsson K, ... Tissue oxygenation, anemia and perfusion in relation to woumd healing
multivariate index of patient susceptibility and wound contamination. Am J Epidemiol
in surgical patients. Ann Surg 1991 ,214:oO5-13.
1985;121:206 15.
23. Frank SM, .... The catecholamine, cortisol, and hemodynamic responses to mild
3. Culver DH, .... Surgical wound infection rates by wound class, operative procedure, and
perioperative
hypothermia:
a
randomized
clinical
trial.
Anesthesiology
patient risk index. Am J Med 1991; 91:152S-157S.
1995;82:83-93.
4. Frank SM, ... Epidural versus general anesthesia, ambient operating room temperature.
24. Fleming TR, .... Designs for group sequential tests. Control Clin Trials
and patient age as predictors of inadvertent hypothermia. Anesthesiology
1984,5:348-61.
1992;77:252-7.
25. Miles AA, ... The value and duration of defense reacdons of the skin to the primary
5. Matsukawa T, ... Propofol linearly reduces the vasococonstriction and shivering
lodgement of bacteria Br J Exp Pathol 1957;38:79-96.
thresholds. Anesthesiology 1995;82:1169-80.
26. Jonsson K, .. Assessment of perfusion in postoperative patients using tissue oxygen
6. Annadata RS, ... Desflurane slightly increases the sweatiug threshold but produces
measurements. Br J Surg 1987;74:263-7.
marked, nonlinear decreases m the vasococonstriction and shivering threshold~
Anesthesiology 1995;83:1205-1 1.
27. Rubinstein EH, .... Skin-surface temperature gradients correlate with fingertip blood
flow in humans. Anesthesiology 1990;73:541-5.
7. Matsukawa T, .... Heat flow and distribudon during induction of general anesthesia
Anesthesiology 1995;82:662-73.
28. Byrne DJ, ... Postoperadve wound scoring. Biomed Pharmacother 1989;43:669-73.
8. Kurz A, .... Forced-air warming maintains intraoperative normothermia better than
29. Rabkin JM, ... Woumd healing assessment by radioisotope incubation of tissue
circulating-water mattresses. Anesth Analg 1993;77:89-95.
samples in PTFE tubes. Surg Forum 1986,37:5924.
9. Ozaki M, ... Nitrous oxide decreases the threshold for vasoconstriction less than
30. Classen DC, .... The timing of prophylactic administration of antibiotics and tbe risk
sevoflurane or isoflurane. Anesth Analg 1995,80:1212-6.
of surgical wound infecdon. N Engl J Med 1992;326:281-6.
10. Sessler Dl, ... Physiologic responses to mild peri
31. Babior BM, ... Cbronic granulomatous disoases. Semin Hematol 1990;27: 247 59.
anesthetic hypothermia in humans. Anesthesiology 1991;75:594-610.
32. Stopinski J, ... L'abus de nicotine et d'alcool ont-ils une influence sur l'apparition des
infections bactériennes post opératoires? J Chir 1993;130:422-5.
11. Chang N, ... Comparison of the effect of bacterial inoculation in musculocutaneous and
random-pattern flaps. Plast Reconstr Surg 1982,701-10.
33. Jensen JA, .... Cigareotte smoking decreases tissue oxygen. Arch Surg
1991;126:11314.
12. Jonsson K, ... Oxygen as an isolated variable influences resistance to infection. Ann
Surg 1988;208:783-7.
34. Sheffield CW, ... Effect of thermoregulatory responses on subcutanoous oxygen
tension. Wound Repair Regeneration (in press).
13. Hohn DC, ... The effect of O2 tension on microbicidal function of leucocytes in wound
and in vitro. Surg Forum 1976;27: 18-20.
35. Knighton DR, ... Oxygen as an antibiotic: tbe effect of inspired oxygen on infection.
Arch Surg 1984;119:199-204.
14. Mader JT, ... A mechanism for the amelioration by hyperbaric oxygen of experimental
staphylococcal osteomyelitis in rabbits. J Infect Dis 1980;142:915-22.
36. Moudgil GC, ... Influence of anaosthesia. and surgery on neutrophil chemotaxis. Can
Anaesth Soc J 1981,28:232-8.
15. van Oss Cl, ... Effect of temperature on the chemotaxis, phagocytic engulfment.
digestion and O2 consumpbon of human polymorphonuclear leucocytes. J
37. Mackowiak PA. Direct effects of hyperthermia on pathogenic microorganisms:
Reticuloendothelial Soc 1980,27:561-5.
teleologic implications witb regard to fever. Rev Infect Dis 1981;3: 508-20.
16. Leijh PC, ... Kinetics of phagocytosis of Staphylococcus aureus and Escherichia coli
38. Schmied H, .... Mild hypothermia increases blood loss and transfusion requirements
by human granulocytes, Immunology 1979;37:453-65.
during total hip arthroplasty. Lancet 1996,347:289-92
17. Sheffield CW, ... Mild hypothermia during isoflurane anesthesia decreases resistance to
39. Michelson AD, .... Reversible inhibition of human platelet activation by hypotbermia
E coli dermal infection in guinea pigs. Acta Anaesthesiol Scand 1994;38:2015
in vivo and in vitro. Thromb Haemost 1994;71:633-40.
18. Sheffield CW, .... Mild hypothermia during halathane-induced anesthesia decreases
40. Rohrer MJ... Effect of hypothermia on the coagulation cascade. Crit Care Med
resistance to Staphylococcus aureus dermal infection in guinea pigs. Wound Repair
1992,20:1402-5.
Regeneration 1994;2:48-56.
41. Jensen LS, ... Postoperative infection and natural killer cell function following blood
transfusion in patients undergoing elective colorectal surgery. Br J Surg
19. Prockop DJ ... The biosynthesis of collagen and its disorders. N Engl J Med
1992;79:513-6.
1979;301:13-23.
On Experimental Design
I constructed four miniature houses of worship
a Mohammedan mosque,
a Hindu temple,
a Jewish synagogue,
a Christian cathedral
and placed them in a row.
I then marked l5 ants («fourmis») with red paint and turned
them loose. They made several trips to and fro, glancing in at
the places of worship, but not entering. I then turned loose l5
more painted blue; they acted just as the red ones had done. I
now gilded 15 and turned them loose. No change in the result;
the 45 travelled back and forth in a hurry persistently and
continuously visiting each fane, but never entering.
This satisfied me that these ants were without religious
prejudices--just what I wished; for under no other conditions
would my next and greater experiment be valuable.
I now placed a small square of white paper within the door of
each fane;
upon the mosque paper I put a pinch of putty, upon the temple
paper a dab of tar,
upon the synagogue paper a trifle of turpentine, upon the
cathedral paper a small cube of sugar.
The gilded ants followed. The preceding results were
precisely repeated.
This seemed to prove that ants destitute of religious prejudice
will always prefer Christianity to any other creed.
However, to make sure, I removed the ants and put putty in the
cathedral and sugar in the mosque. I now liberated the ants in
a body, and they rushed tumultuously to the cathedral.
I was very much touched and gratified, and went back in the
room to write down the event.
But when I came back the ants had all apostatized and had
gone over to the Mohammedan communion.
I saw that I had been too hasty in my conclusions, and
naturally felt rebuked and humbled. With diminished
confidence I went on with the test to the finish. I placed the
sugar first in one house of worship then in another, till I had
tried them all.
With this result: whatever Church I put the sugar in, that was
the one the ants straightway joined.
This was true beyond a shadow of doubt, that in religious
matters the ant is the opposite of man, for man cares for but
one thing; to find the only true Church; whereas the ants hunts
for the one with the sugar in it.
First I liberated the red ants. They examined and rejected the
putty, the tar and the tuprpentine, and then took to the sugar
with zeal and apparent sincere conviction.
I next liberated the blue ants, and they did exactly as the red
ones had done.
M&M Ch 3.1 & 3.2 Designing Experiments ... page 1
from Mark Twain, "On Experimental Design " in Scott W.K. and L.L.
Cummings, Readings in Organizational Behavior and Human
Performance, Irwin: Homewood, Ill., p.2, (l973) .