Slideshow 2

Interim Analysis in Clinical Trials: A
Bayesian Approach in the Regulatory Setting
Telba Z. Irony, Ph.D. and Gene Pennello, Ph.D.
Division of Biostatistics
Office of Surveillance and Biometrics
Center for Devices and Radiological Health, FDA
No official support or endorsement by the Food and Drug
Administration of this presentation is intended or should
be inferred.
The Frequentist Approach to Interim Analyses
Trial: 200 patients
Several interim analyses planned
If statistical significance is found at any of the looks, the trial
stops and is successful.
In order to obtain a significance level of 0.05, the levels at
each possible stopping point must be smaller than 0.05.
Of course, there is an infinite number of possibilities of
“distributing” the level 0.05 among the possible stopping
points.
2
Some Boudaries for Frequentist
Group Sequential Trials
Significance levels to
reject Ho
0.05
Pocock
0.04
0.03
O'Brien Fleming
0.02
Fleming
Harrington
O'Brien
0.01
0
1
2
3
4
5
Looks
3
Other Ways to Penalize Multiple Looks
•
•
•
The alpha spending function is a version of stopping
boundary that is a continuous function of the
percentage of the study completed.
There are lots of different boundaries, techniques, and
software (PEST, EAST) to “control type I error” while
performing interim looks in a clinical trial.
You could create (and publish) your own boundary and
develop your own software.
4
That approach violates the Likelihood Principle!
Two companies come with the same data on 200 patients
Both obtain the same p-value at the end (0.045)
•
•
•
One looked once at the data during the trial with the
intention of stopping but didn’t => does not reach
significance (required p-value = 0.041) => not successful!
The competitor did not look => reached significance
(required p-value = 0.05) => successful!
Moreover: reaching significance or not depends on whose
boundary you choose => you have to tell in advance which
one you want and cannot change your mind!
5
Why do frequentist and Bayesian approaches differ?
Frequentists
inferences are based on p-values
probabilities are on the sample space
Estimation: P(data| parameter)
Hypothesis testing P(data | H )
Bayesians
inferences are based on posterior distributions
probabilities are on the parameter space
Estimation: P( parameter |data)
Hypothesis testing: P( H | data)
Likelihood Principle prevails
6
The Bayesian Approach to Interim Analyses
No adjustments are made for interim looks or
modifications of trials in midcourse.
In fact, the decision of continuing the study or not should
be based on potential costs and benefits weighed by the
current posterior distribution of the unknowns.
7
Example 1: Curtailment of the trial via predictive
distribution
Clinical trial
p: chance of patient success
Interim Look: 190 successes out of 200 observed patients
Remaining: 80 patients.
How many successes among the next 80 patients?
Could we stop the trial and make a decision already?
Predictive Distribution
P( future observation(s) | prior, data)
8
Make sure that the remaining patients are
exchangeable with the observed patients.
Predictive probability of success for the next 80
patients (based on the posterior distribution for p)
0.15
0.1
0.05
0
1
6 11 16 21 26 31 36 41 46 51 56 61 66 71 76
Successes
9
2. Interim Analyses: Multiple Looks
We collect data to learn about an endpoint
When we know enough we should stop the trial
 Stop when the credible interval is small enough
 Stop when there is reasonable assurance that the hypothesis
is true (or false) or the device is safe and effective (or is
not).
10
Example: A totally Bayesian approach
Planned ahead => no penalty for multiple looks!
New treatment
Interest: q - rate of adverse effect - endocarditis
Prior: P(q) - hierarchical model - used old results
Interest: Posterior: P(q | data)
Want q to be small. How small?
11
•If there is a good chance that q < target => success.
•If there is a good chance that q > target => failure.
Target: q = 0.1
Pre-defined criterion:
Look at every 100 patient years.
Stop and approve if P(q < target | data) > 0.99.
Stop and don’t approve if P(q > target | data) > 0.80.
Minimum sample size: 300 patient years (hierarchical model)
Maximum sample size: 800 patient years ( practical reasons)
The company could in fact go on for ever (!!)
12
Start with 300 patient years (data1).
If P(q < target | data1) > 99% stop and approve.
If P(q > target | data1) > 80% stop and cut losses.
If neither of the above continue sampling.
13
Sample 100 patient years more (data2).
If P(q< target | data1+ data2) > 99% stop and approve.
If P(q > target | data1+ data2) > 80% stop and cut losses.
If neither of the above continue sampling.
14
..... Sample 100 more (data i).
If P(q<target |data1+data2...+data i)> 99% stop and approve.
If P(q>target |data1+data2...+data i)>80% stop and cut losses.
Approved!
15
Problems
• Frequentists believe one may sample to a foregone conclusion:
• one may stop as soon as one gets significance; or
• by repeatedly testing it is possible to reject Ho with
probability as close to 1 as desired (probabilities of
hypothesis are usually martigales - D. Berry, 1987). It
takes an infinite amount of time, though.
• Controlling the overall type I error is a critical concern in
monitoring clinical trials - Regulators.
• Some Bayesians (perhaps inspired by O’Brien and Fleming)
believe that one needs to be more restrictive in early stages of
the trial, requiring higher posterior probabilities for
termination at the beginning….
16
More Problems
Normal distribution “paradox” (D. Rubin):
Two Companies: Frequentist and Bayesian
Perform Interim Looks.
Both
Bayesian uses non-informative prior and stops when
P(Ho|data) >95%.
Frequentist use a nominal significance level of 5%.
In the Normal case with non-informative prior, the
posterior probability is numerically equal to 1-(p-value).
The Frequentist pays a penalty for the looks and the
Bayesian doesn’t.
The Frequentist may be unsuccessful and the Bayesian
may be successful with the same data!
17
A Regulatory Solution
To illustrate what would happen in terms of
type I and II errors in a Bayesian Trial, we request
simulations at the design stage.
If the rate were actually below the target, what would happen?
How often would would the trial stop for futility? (type II error)
If the rate were actually above the target, what would happen?
How often would the device be approved? (type I error)
Whenever the type I error rate is too high, we modify the
design!
18
Evaluating the experimental design – Heart Valve
For each rate, simulated 1000 trials
Simulated
%
Expected
Rates
Equivalence Sample
Size
(q)
0.2
0.15
0.11
0.1
0.099
0.095
0.09
0.08
0.07
0.06
0.6%
2.7%
4.9%
9.2%
10.7%
15.7%
25%
35.3%
45.6%
60.7%
426
529
761
786
778
612
594
510
428
327
Error
type
Error
Rate
I
I
I
?
II
II
II
II
II
II
0.6%
2.7%
4.9%
?
89.3%
84.3%
75%
64.7%
54.4%
39.3%
19