Experimental design: b -error in the trauma literature

Experimental Design:
β -error in the Orthopaedic Literature
Dr. M. Ghert
Hypothesis testing
Does one method of treatment have
an improved effect on outcome
compared to another?
Is A better than B?
Hypothesis testing and the
criminal justice system
Criminal defendant is:
“Innocent until proven guilty”
Null hypothesis:
“There is no difference between treatment groups
until we find one”
A=B
Goals
• The courtroom:
To prove guilt beyond the shadow of a
doubt
• Scientific investigation:
To prove that A is better than B beyond the
shadow of a doubt
Why?
• The law:
So we don’t convict someone who is
innocent
• Scientific investigation:
So we don’t accept a treatment method as
being better than another when it is not
What is this type of error?
• Alpha (Type-I) error:
We erroneously reject the null hypothesis
• In other words, we decide that A is better
than B when it really isn’t (or convict an
innocent person)
How do we avoid Alpha error?
• We ask ourselves: “how much of a chance
am I willing to take?”
• “If I find that A is better than B, how sure
do I need to be of these results to change
my orthopaedic practice?”
By convention: I am willing to
accept a 5% chance that A really
isn’t better than B
P<0.05
Therefore I am willing to accept
a P-value of <0.05
A P-value of <0.05 is
considered statistically
significant
The other kind of error
• The courtroom:
A guilty person walks
• Scientific investigation:
We fail to discover that A is better than B
when it really is
ÆWe lose out on improving our practice
What is this type error?
•
•
•
•
Beta (Type-II error)
We erroneously accept the null hypothesis
We decide that A=B when it really doesn’t
(we decide someone is innocent when they
really are guilty)
What causes Beta error?
• When two groups really are distinct (A>B) ,
but the P-value is >0.05
• Variance is high within each group
B
A
Avoiding Beta error:
Decrease variance
B
A
B
A
Avoiding beta error:
Increasing sample size
• Reamed vs. unreamed nails for open tibia
fracturesÆ infection rate
• 80 patients:
reamed
unreamed
P-value
6%
4%
0.24 (NS)
• 500 patients:
reamed
unreamed
P-value
6%
4%
*0.03
Beta error
• In the study with only 80 patients, we
committed a beta error
• We failed to achieve a significant P-value
because the sample size was too small
In designing an experiment, how
do we avoid beta error?
• We find out how large a sample size is
needed to achieve a beta error of less than
0.20
• This is a 20% chance of performing a beta
error
Back to the courtroom
• We accept a 5% chance of convicting an
innocent person
(5% alpha error)
• We accept a 20% chance of letting a guilty
person walk
(20% beta error)
We are more willing to let a guilty person go
than to convict an innocent person
Scientific Investigation
• We are more willing to miss out on an
improved treatment than to accept one that
really isn’t so good.
• First do no harm
Calculating sample size
• Set beta=0.20
• Variance
• Effect size
Variance
•
•
•
•
Clinician determined
Expected variance within a population
Based on retrospective studies
example: Harris Hip Score variance within
a THA population usually +/-15%
Effect size
• Based on clinical judgement
• For example, when is a 1% difference
important?
No: 90% vs 91% excellent clinical outcome
Yes: 1% vs 2% fatal pulmonary embolism
Effect size
• Small: effect is difficult to detect
1% increase in excellent results
• Medium: effect is intermediate
10% decreased incidence of infection
• Large: effect is obvious
50% increase in survival
Standard effect size graphs
Rule of thumb
• For:
– moderate effect size
– beta=0.20
• Generally need 120 patients/subjects per
arm
Clinical example
• Number of patients needed to detect:
1% difference in fatal pulmonary embolism
rates
--> 10,000 patients
50% difference in fatal pulmonary
embolism rates
--> 80 patients
Power
• The POWER of a study is inversely
proportional to the beta error (power=1-β)
• Therefore we strive for a power of 80%
Power analysis
• If the number of subjects is already
determined
• Pre hoc power analysis vs. post hoc
Evidence based research
• The gold standard:
Prospective, randomized, (double-blind )
studies with adequate statistical power
+ appropriate, well-defined endpoints (infection,
death, fracture union)
+adequate follow-up
Generally not possible in
surgical studies
How much of the orthopaedic
literature achieves this gold standard?
•
•
•
•
•
Freedman and Bernstein, JBJS 1999:
1997 volumes of JBJS and CORR
86 clinical studies using hypothesis testing
Sample size determination in only 6%
Post hoc power analysis 2%
Freedman and Bernstein, cont.
• 69% had negative results (P>0.05)
• Of these: 3% had adequate statistical power
• Average sample-size deficiency was 85% of
the required number
• “No significant difference” is not accurate
for these studies
Tornetta et al, AAOS 2000
• Orthopaedic trauma literature, 32 journals
• 117 studies with negative results
• All underwent power analysis by Tornetta
et al
Tornetta et al, cont.
• Primary outcomes:
– Power average 25% (2%-99%)
• Secondary outcomes:
– Power average 19% (2%-99%)
• Rate of β−error >90% for both outcomes
Example
• ORIF vs nonoperative for calcaneal
fractures
• “no difference in outcome”
• Because of operative risks, authors
conclude that nonoperative is superior
• Power of the study was 3%
• Conclusions may be flawed and differences
may exist
Orthopaedics is not alone
• Inadequate power to detect clinically meaningful
differences:
• Emergency medicine
• Cardiovascular research
• Nursing
• Internal medicine
• General practice
• Rehab
• Hand surgery
Summary
• Null hypothesis: A=B
– innocent until proven guilty
• Alpha (Type-I error): erroneously rejecting
null hypothesis
– We find a difference when there really isn’t one
• Courtroom analogy:
– Convict an innocent person
Summary
• Beta (Type-II error): erroneously accepting
the null hypothesis
– We fail to find a difference when there really is
one
• Courtroom analogy:
– A guilty person walks
Summary
• Sample size determined by power and effect
size
• Acceptable by convention:
–
–
–
–
5% type-I error
20% type-II error
80% power
(moderate effect size)
Conclusion
• Read the literature carefully
• If there is “no signficant difference between
groups” was there adequate power to reach
this conclusion?
• Experimental design requires careful
planning and sample size determination
Rule of Thumb
• Moderate effect size
• 80% power
• Need 120 subjects per study arm
AAOS Basic Science SAE
In the design of an experiment, the choice of an
appropriate sample size is dependent on several
factors. What is the critical calculation that
characterizes the potential of the study design to
successfully address the reasearch question?
1. Regresssion analysis
2. Power analysis
3. Determination of correlation coefficient
4. Determination of mean
5. Determination of t-test
AAOS Basic Science SAE
In the design of an experiment, the choice of an
appropriate sample size is dependent on several
factors. What is the critical calculation that
characterizes the potential of the study design to
successfully address the reasearch question?
1. Regresssion analysis
2. Power analysis
3. Determination of correlation coefficient
4. Determination of mean
5. Determination of t-test
AAOS Basic Science SAE
Adequate sample sizes are necessary in clinical
research design. The minimum number of
subjects per group needed in a clinical trial is
that which will
1. Minimize type I error
2. Maximine beta, the risk of type II error
3. Provide a p-value below 0.05
4. Provide a p-value below the given alpha
threshold
5. Ensure that real and clinically significant
differences are statistically significant
AAOS Basic Science SAE
•
Adequate sample sizes are necessary in
clinical research design. The minimum
number of subjects per group needed in a
clinical trial is that which will
1.
2.
3.
4.
Minimize type I error
Maximine beta, the risk of type II error
Provide a p-value below 0.05
Provide a p-value below the given alpha
threshold
5. Ensure that real and clinically significant
differences are statistically significant
AAOS Basic Science SAE
•
Which of the following terms best describes the
probability of making a decision that a treatment
has an effect on an outcome in an experiment
when in reality there is no effect?
1. Alpha
2. Type II error
3. Power
4. Beta
5. Effect size
AAOS Basic Science SAE
•
Which of the following terms best describes the
probability of making a decision that a treatment
has an effect on an outcome in an experiment
when in reality there is no effect?
1. Alpha
2. Type II error
3. Power
4. Beta
5. Effect size
AAOS Basic Science SAE
•
When evaluating two treatment protocols using statistical
methods such as Student’s t-test or analysis of variance, the ‘p
value’ of less than 0.05 is best described as
1. A 5% chance that a difference between the populations
will be falsely accepted
2. A 5% chance that a true difference between the
populations will be missed
3. A 5% difference in the mean betweeen the two
populations
4. A 5% difference in the standard deviation of the two
populations
5. A 5% chance of telling the difference between the
populations with the existing number of patients.
AAOS Basic Science SAE
•
When evaluating two treatment protocols using statistical
methods such as Student’s t-test or analysis of variance, the ‘p
value’ of less than 0.05 is best described as
1. A 5% chance that a difference between the populations
will be falsely accepted
2. A 5% chance that a true difference between the
populations will be missed
3. A 5% difference in the mean betweeen the two
populations
4. A 5% difference in the standard deviation of the two
populations
5. A 5% chance of telling the difference between the
populations with the existing number of patients.
Thank-you