Development of gatekeeping strategies in
confirmatory clinical trials
Alex Dmitrienko (Eli Lilly and Company)
Brian A.Millen (Eli Lilly and Company)
Thomas Brechenmacher (Dainippon Sumitomo Pharma Co)
Gautier Paux (sanofi pasteur)
Summary
This paper discusses multiplicity issues arising in confirmatory clinical trials with
hierarchically ordered multiple objectives. In order to protect the overall Type I
error rate, multiple objectives are analyzed using multiple testing procedures. When
the objectives are ordered and grouped in multiple families (e.g., families of primary
and secondary endpoints), gatekeeping procedures are employed to account for this
hierarchical structure. We discuss considerations arising in the process of building
gatekeeping procedures, including proper use of relevant trial-specific information and
criteria for selecting gatekeeping procedures. The methods and principles discussed
in this paper are illustrated using a clinical trial in patients with Type II diabetes
mellitus.
1
Introduction
Clinical trial designs with hierarchically ordered objectives play an increasingly more
important role in modern drug development. Such objectives help clinical trial sponsors differentiate their treatments from others through evaluation of effects on multiple endpoints, in multiple populations, and/or other relevant assessments, e.g., doseresponse, non-inferiority/superiority, etc. The inclusion of multiple hierarchically
ordered objectives allows more robust inference from each clinical trial and, therefore, increases the efficiency of the drug development process. The novel trial designs
have given rise to more complex multiplicity problems and led to a number of important methodological developments in the area of multiple comparisons. One of these
developments is related to the so-called gatekeeping strategies aimed at addressing
1
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers
2
multiplicity in trials with multiple families of objectives, e.g., the primary objectives
defining the overall outcome of the trial and the secondary objectives associated with
important supportive outcome variables. The work on gatekeeping strategies was
initiated about 15 years ago and the general methodological principles were defined
in Maurer et al. (1995), Bauer et al. (1998) and Westfall and Krishen (2001). Subsequent work in this area was aimed at the development of more powerful and adaptable
gatekeeping strategies, including gatekeeping strategies based on parametric multiple
testing procedures (Dmitrienko, Offen, Wang and Xiao, 2006; Liu and Hsu, 2009),
general multistage gatekeeping strategies with parallel restrictions (Guilbaud, 2007;
Dmitrienko, Tamhane and Wiens, 2008) and gatekeeping strategies with flexible decision rules (Bretz et al., 2009; Burman et al., 2009; Dmitrienko and Tamhane, 2011).
For a detailed review of recent developments in this area of research, see Dmitrienko
and Tamhane (2009).
This paper discusses a more applied aspect of this general problem. Specifically,
we focus on practical issues arising in the development of complex gatekeeping strategies in confirmatory clinical trials. Our main goal is to review individual steps in the
development process and formulate key principles and considerations that help clinical trial sponsors maximize the probability of technical success. We emphasize the
importance of utilizing all available clinical and statistical information. From the
clinical information perspective, it is critical to ensure that all gatekeeping strategies
under consideration are aligned with the main objectives of the clinical trial, i.e., all
clinically relevant relationships among the individual objectives (or associated null
hypotheses) are taken into account (Hung and Wang, 2009). Further, available information on the joint distribution of the hypothesis test statistics needs to be utilized
as well. We also discuss the evaluation of operating characteristics of candidate gatekeeping strategies under trial-specific assumptions which is conducted to select the
final, most powerful gatekeeping strategy.
The outline of the paper is as follows. Section 2 introduces key principles for
development of gatekeeping procedures. Section 3 provides the example which will
be used to illustrate these principles and other important concepts. Section 4 offers a
discussion of gatekeeping development strategies and the resultant gatekeeping procedures (i.e., decision rules for the individual null hypotheses). Section 5 provides a
comparison of the gatekeeping procedures based on the strategies presented in Section 4. Section 6 introduces criteria that can be used for optimal selection of free
parameters of gatekeeping procedures. The article closes with a general discussion in
Section 7.
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers
2
3
Principles for development of gatekeeping procedures
To develop an optimal gatekeeping procedure, a clinical trial sponsor should adhere
to a few specific principles to ensure the relevance and desired performance of the
procedure. To ensure relevance, one must carefully consider and incorporate all logical
relationships among the null hypotheses specified in a confirmatory clinical trial.
This requires clear delineation of primary and secondary null hypotheses and, more
generally, a well-understood hierarchy among the null hypotheses or a pre-specified
decision tree defining which null hypotheses logically may be tested when others
are rejected (or not rejected). For example, a relevant gatekeeping procedure would
recognize that null hypotheses of superiority should not be tested when the associated
null hypotheses of non-inferiority tests have failed to be rejected. Similarly, a test of a
secondary endpoint would be irrelevant when the null hypothesis associated with the
primary endpoint required for registration fails to be rejected (O’Neill, 1997). From a
mathematical optimization perspective, the focus on relevance allows the sponsor to
reduce the space of potential gatekeeping procedures prior to a comparative evaluation
of performance.
The sponsor may further narrow down the available options prior to evaluating
the performance of the final set of candidate gatekeeping procedures by applying
the following performance-related principle: preference should be given to procedures
which account for the available information on the distribution of the hypothesis test
statistics. Clearly, such procedures will be more powerful than those that do not fully
account for the distributional information. That is, the former will reject all null
hypotheses rejected by the latter, and potentially more. Thus, while nonparametric
(or distribution-free) procedures are useful for initial drafts of gatekeeping procedures,
whenever possible, these should be replaced by similar parametric or semiparametric
procedures which utilize the distributional information. This is demonstrated in the
examples in Section 4.
After addressing the two considerations to reduce the space of candidate gatekeeping procedures, the sponsor may develop application-specific metrics which will
be used to compare the performance of the remaining candidate procedures and guide
optimal selection of procedure parameters. Succinctly, the sponsor must define “success” for the clinical trial, based on the trial’s objectives and logical constraints already considered. Metrics for comparing gatekeeping procedures typically quantify
the likelihood of trial success with each candidate procedure (or over a space of procedure parameters). A general discussion of procedure or parameter selection metrics
is provided in Sections 5 and 6.
Together, these general considerations provide guidance for development of opti-
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers
4
mal gatekeeping procedures for confirmatory clinical trials. Sections 4, 5 and 6 provide
examples to illustrate the following three key guiding principles for development of
gatekeeping procedures:
• Take into account all relevant logical relationships among the null hypotheses
of interest.
• Take into account all distributional information (e.g., joint distribution of the
hypothesis test statistics).
• Select the gatekeeping strategy which optimizes pre-specified application-specific
metrics under trial-specific assumptions.
Another important consideration in the development of gatekeeping procedures is
related to the concept of re-testing of null hypotheses. This concept has been discussed in multiple publications, including Bretz et al. (2009), Burman et al. (2009)
and Dmitrienko, Kordzakhia and Tamhane (2011). The idea behind the re-testing approach is that, when it is clinically appropriate, certain null hypotheses can be tested
multiple times in order to improve power of the associated tests. This consideration,
of course, broadens the space of potential gatekeeping procedures and the sponsor
should, at minimum, consider adding simple re-testing elements when selecting the
final candidate gatekeeping procedures. This concept is illustrated by Procedure 3B
defined in Section 4 and can be considered an additional guiding principle for the
development of gatekeeping procedures.
3
Clinical trial example
The following example will be used throughout the paper to illustrate the key principles used in the development of gatekeeping strategies. Consider a clinical trial
in patients with Type II diabetes mellitus. This trial will be conducted to evaluate
efficacy and safety of two doses of an experimental treatment compared to an active
control. The doses will be labeled Dose 1 (high dose) and Dose 2 (low dose). The
primary efficacy endpoint is the reduction in hemoglobin A1c from baseline to 12
months (a reduction in the hemoglobin A1c level is a sign of improvement). A balanced design with an equal number of patients in the three groups will be used in the
trial.
The primary efficacy analyses in this trial include non-inferiority and superiority
comparisons at each dose level. The comparison of each dose versus the active control
begins with a non-inferiority test and, if the treatment is shown to be non-inferior to
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers
5
the control, a superiority test is then carried out. Let δi denote the true mean treatment difference between Dose i and the active control, i = 1, 2. The corresponding
null hypotheses are defined as follows:
• H1 and H2 denote the null hypotheses of inferiority for Doses 1 and 2, respectively, i.e., H1 : δ1 ≤ −λ and H2 : δ2 ≤ −λ, where λ is a pre-defined positive
non-inferiority margin.
• H3 and H4 denote the null hypotheses of lack of superiority for Doses 1 and 2,
respectively, i.e., H3 : δ1 ≤ 0 and H4 : δ2 ≤ 0.
The trial’s sponsor is interested in building a gatekeeping strategy which provides
a strong control of the global familywise (i.e., with respect to all four null hypotheses)
error rate (Hochberg and Tamhane, 1987) at a pre-specified level. The gatekeeping
strategy will enable the sponsor to enrich the product label by providing most relevant
information on the efficacy of the new treatment compared to the active control (either
non-inferiority or superiority) at both dose levels.
In what follows we will illustrate the process of developing gatekeeping strategies in
confirmatory clinical trials by defining three candidate strategies for this trial example.
We will also discuss the steps recommended for selecting an optimal strategy which
reflects the trial’s objectives and maximizes the probability of success. It is important
to note that other gatekeeping strategies can be constructed for this particular setting
using methods defined in Bretz et al. (2009), Burman et al. (2009), Millen and
Dmitrienko (2011), Dmitrienko and Tamhane (2011).
4
Gatekeeping strategies
As we indicated in Section 2, the process of constructing an optimal gatekeeping
strategy begins with the identification of relevant logical relationships among the null
hypotheses of interest. In the diabetes trial, the sponsor needs to account for the
fact that the non-inferiority and superiority tests for each dose-control comparison
are logically related to each other. In particular, testing a superiority hypothesis for a
given dose should only be considered once non-inferiority has been established. That
is, H3 is tested only if H1 is rejected and, similarly, H4 is tested only if H2 is rejected.
In addition, logical restrictions that reflect the relationships between Doses 1 and 2
need to be accounted for. There are two general possibilities:
• Evaluation of a particular dose (e.g., Dose 1) is considered priority, while evaluation of the other dose is considered secondary.
• Evaluation of each dose is considered important.
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers
6
We assume the latter is true for the example in this paper.
The following notation will be used to define gatekeeping strategies in the diabetes
trial. Let pi denote the raw p-value for the null hypothesis Hi , i = 1, 2, 3, 4, produced
by an appropriate statistical test, e.g., the two-sample t test, and let ti denote the
associated test statistic. Further, let α denote the global familywise error rate, e.g.,
one-sided α = 0.025. Three general strategies for development of an appropriate
gatekeeping procedure are defined below. All three strategies incorporate the logical
constraints on the non-inferiority and superiority tests.
4.1
Strategy 1 (Serial testing)
We will begin with the most basic gatekeeping strategy, namely, the serial testing
strategy, as depicted in Figure 1. This strategy assumes a simple link between the
two tests for Dose 1 and the two tests for Dose 2. That is, the non-inferiority test for
Dose 2 is carried out (H2 is tested) only after the treatment is shown to be superior
to the active control at Dose 1 (H3 is rejected). The resulting gatekeeping procedure
is very straightforward—it is simply the fixed-sequence procedure (Dmitrienko et al.,
2009, Section 2.6) with the following decision rules:
Procedure 1
• Step 1. H1 is rejected if p1 ≤ α.
• Step 2. H3 is rejected if p3 ≤ α and H1 is rejected.
• Step 3. H2 is rejected if p2 ≤ α and H3 is rejected.
• Step 4. H4 is rejected if p4 ≤ α and H2 is rejected.
[Insert Figure 1 here]
Procedure 1 is simple to implement but the underlying α propagation rules (i.e.,
rules for distributing the familywise error rate among the null hypotheses) are quite
inflexible. The full α level is applied to the null hypothesis H1 and is then transferred
along a single sequence. As a result, each of the first three null hypotheses serves as a
gatekeeper for the next null hypothesis. Thus, the efficacy profile of Dose 2 cannot be
evaluated (i.e., H2 and H4 cannot be tested) if Dose 1 fails the superiority test (i.e.,
H3 is not rejected). This condition appears inappropriate for the current example.
Thus, the use of such a gatekeeping strategy in this example could only be justified by
trial-specific assumptions which make the serial gatekeeping procedure more powerful
than logically relevant procedures. This might occur when the trial is substantially
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers
7
powered for the tests on Dose 1, while only moderately powered for tests on Dose 2,
i.e., Dose 1 is expected to be substantially more efficacious than Dose 2.
In general, there are only a few situations where Strategy 1 may be considered
relevant. One example is the case when the efficacy dose-response relationship is
known to be positive and the sponsor must demonstrate that both doses are noninferior to the active control and at least one dose offers superior efficacy to the
active control in order to continue with the development program or application for
market authorization for the new treatment. In this situation the efficacy profile of
Dose 2 is irrelevant if Dose 1 is not superior to the active control, which aligns with
the decision rules of Strategy 1.
4.2
Strategy 2 (Extended serial testing)
The next gatekeeping strategy shown in the left panel of Figure 2 provides an extension of Strategy 1 by incorporating more flexible α propogation rules. It follows from
Figure 2 that after non-inferiority is established for Dose 1 (H1 is rejected), the α is
carried over to the superiority test for Dose 1 (H3 ) as well as non-inferiority test for
Dose 2 (H2 ). These two branches give the sponsor an option to study the efficacy
profile of Dose 2 even if Dose 1 is not superior to the active control. Further, as before, if Dose 2 is non-inferior to the active control (H2 is rejected), the corresponding
superiority test is carried out (H4 is tested).
[Insert Figure 2 here]
Since Strategy 2 is more complex than Strategy 1, the process of constructing an
associated gatekeeping procedure is not as straightforward as before. As a starting
point, the sponsor can consider the following gatekeeping procedure based on a simple
α-splitting rule:
Procedure 2A
• Step 1. H1 is rejected if p1 ≤ α.
• Step 2. H2 is rejected if p2 ≤ α/2 and H1 is rejected. Likewise, H3 is rejected
if p3 ≤ α/2 and H1 is rejected.
• Step 3. H4 is rejected if p4 ≤ α/2 and H2 is rejected.
If the initial test is significant (H1 is rejected), the familywise error rate α is split
between the superiority test for Dose 1 and non-inferiority test for Dose 2, i.e., H2
and H3 are each tested at α/2. Further, if non-inferiority is established for Dose 2
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers
8
(H2 is rejected), the significance level used in this test is transferred to the superiority
test for the same dose (H4 ) and this test is carried out at α/2.
To better understand the α propagation rules used in Procedure 2A, it is instructive to re-arrange the null hypotheses as shown in the middle panel of Figure 2. The
new arrangement emphasizes the fact that the null hypotheses are essentially grouped
into three families:
• Family 1 includes the null hypothesis H1 .
• Family 2 includes the null hypotheses H2 and H3 .
• Family 3 includes the null hypothesis H4 .
Note that Families 1 and 2 serve as gatekeepers. Family 1 is a gatekeeper with a
single null hypothesis which needs to be rejected to proceed to Family 2. Family 2
is a gatekeeper with two null hypotheses but only one of these null hypotheses is
connected to the null hypothesis in Family 3. Recall now that Procedure 2A simply
splits the α between the null hypotheses in Family 2, i.e., Procedure 2A applies
the Bonferroni procedure within this family. The Bonferroni procedure is the most
conservative multiple testing procedure and the performance of Procedure 2A can be
improved if the Bonferroni procedure in Family 2 is replaced with a more powerful
procedure. First of all, the sponsor can consider using the Holm procedure which
is uniformly more powerful than the Bonferroni procedure. Secondly, even more
powerful procedures can be utilized in Family 2 if additional distributional information
is available. This brings us to the discussion of the performance-based principle for
developing gatekeeping procedures: utilization of all relevant information on the joint
distribution of the hypothesis test statistics.
Knowledge of the joint distribution of the test statistics associated with hypotheses
H2 and H3 (i.e., t2 and t3 ) is readily available and is determined by the common set of
patients contributing to the two test statistics. If the null hypotheses are each tested
in the same patient population, e.g., in the intent-to-treat (ITT) population, the
correlation between the test statistics is ρ = 0.5 due to the assumption of a balanced
design. If the null hypotheses are each tested in different patient populations, e.g., H2
in the ITT population and H3 in the per protocol (PP) population, the correlation
coefficient is no longer 0.5 but remains positive. This information can be utilized to
construct a more powerful multiple testing procedure for Family 2. Available choices
for constructing a more powerful multiple testing procedure include (Dmitrienko et
al., 2009, Sections 2.6 and 2.7):
• Semiparametric procedures (p-value-based procedures whose decision rules do
not depend on full knowledge of the joint distribution of the hypothesis test
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers
9
statistics but additional distributional assumptions need to be made in order
to control the local familywise error rate within a family), e.g., Hochberg or
Hommel procedures.
• Fully parametric procedures (procedures that make explicit assumptions about
the joint distribution of the hypothesis test statistics), e.g., single-step or stepdown Dunnett procedures.
As an illustration, we will focus here on the use of the Hochberg procedure in
Family 2 (note that it is equivalent to the Hommel procedure since there are two
null hypotheses in this family). The Hochberg procedure is uniformly more powerful
than the Bonferroni and Holm procedures and controls the local familywise error
rate within Family 2 since the joint distribution of the hypothesis test statistics is
multivariate totally positive of order two (MTP2); see Sarkar and Chang (1997) and
Sarkar (1998) for more information. However, as shown in Dmitrienko, Tamhane and
Wiens (2008), the regular Hochberg procedure does not meet a technical condition
known as the separability condition. In other words, the regular Hochberg procedure
is a “greedy” procedure which uses up all available α in Family 2 and enables the
sponsor to proceed to Family 3 only if both null hypotheses are rejected in Family 2.
This is not consistent with the decision rules specified above, namely, Figure 2 shows
that only H2 needs to be rejected in Family 2 to proceed to Family 3. Thus a modified
version of the Hochberg procedure, termed the truncated Hochberg procedure, needs
to be used in Family 2 (note that many other stepwise procedures, including the
Holm procedure, do not satisfy the separability condition and thus they need to be
truncated as well if the sponsor chooses to use them in Family 2). The truncated
Hochberg procedure, introduced in Dmitrienko, Tamhane and Wiens (2008), will
enable the sponsor to transfer a fraction of the error rate to the null hypothesis in
Family 3 even if only H2 is rejected in Family 2.
A gatekeeping procedure which utilizes the truncated Hochberg procedure in Family 2 can be constructed using the general mixture method (Dmitrienko and Tamhane,
2011). This gatekeeping procedure is based on the following decision rules (see the
Appendix for more information):
Procedure 2B
• Step 1. H1 is rejected if p1 ≤ α.
• Step 2. The null hypotheses in Family 2 are tested using the truncated Hochberg
procedure at the full α level if H1 is rejected. Let p(2) < p(3) denote the ordered
p-values in this family and let H(2) and H(3) denote the associated null hypotheses. Further, let 0 ≤ γ < 1 denote a pre-defined truncation parameter. The
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 10
truncated Hochberg procedure rejects both H(2) and H(3) if p(3) ≤ α(1 + γ)/2
or it rejects only H(2) if p(2) ≤ α/2 and p(3) > α(1 + γ)/2.
• Step 3. H4 is rejected if
– p4 ≤ α(1 − γ)/2, H2 is rejected and H3 is not rejected.
– p4 ≤ α and both H2 and H3 are rejected.
To illustrate the decision rules in Procedure 2B, suppose that H1 is rejected and
p2 < p3 , which implies that H(2) = H2 and H(3) = H3 . Further, suppose that H2 is
rejected but H3 is not rejected in Family 2. Then H4 is tested at α(1 − γ)/2.
An important difference between Procedures 2A and 2B is that the latter accounts
for the relevant distributional information and thus improves power of the tests placed
in Family 2 for 0 < γ < 1. The truncated Hochberg procedure in Step 2 simplifies to
the basic Bonferroni procedure if γ = 0.
An obvious weakness in Procedure 2A, which was the motivation for developing
Procedure 2B, is the restrictive α propagation rule; specifically, no forward propagation of the error rate is allowed following rejection of H3 . Increased power can be
obtained by transferring the fraction of the familywise error rate available after testing H3 to H2 shown in the right panel of Figure 2. To illustrate the concept of more
flexible α propagation rules, a simple Bonferroni-based gatekeeping procedure, consistent with this strategy can be constructed using the method developed in Bretz et
al. (2009) and Burman et al. (2009). The procedure employs the following decision
rules:
Procedure 2C
• Step 1. H1 is rejected if p1 ≤ α.
• Step 2. H3 is rejected if p3 ≤ α/2 and H1 is rejected. Further, H2 is rejected if
– p2 ≤ α/2, H1 is rejected and H3 is not rejected.
– p2 ≤ α and both H1 and H3 are rejected.
• Step 3. H4 is rejected if
– p4 ≤ α/2, H2 is rejected and H3 is not rejected.
– p4 ≤ α, H2 is rejected and H3 is rejected.
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 11
Millen and Dmitrienko (2011) termed such procedures the Bonferroni-based serial
chain procedures. The strength of these procedures lies in the separate specification
of parameters governing the initial α allocation and intermittent α propagation. Since
Procedure 2C is a fully Bonferroni-based procedure, it can be improved upon by incorporating information regarding the joint distribution of the hypothesis test statistics.
While not immediately transparent in the previous description of Procedure 2B, the
underlying rules for transferring the error rate from Family 2 to Family 3 generally
align with the rules used in Procedure 2C (note that the truncated Hochberg procedure reduces to the basic Bonferroni procedure when γ = 0). However, Procedure 2B
incorporates distributional information not utilized in Procedure 2C and, thus, it will
often provide a power advantage compared to Procedure 2C.
4.3
Strategy 3 (Parallel testing)
The final gatekeeping strategy for the diabetes trial is Strategy 3 (left panel in Figure
3). Unlike Strategies 1 and 2, this gatekeeping strategy includes two separate testing
sequences. Each sequence begins with a non-inferiority test and, if a dose is shown to
be non-inferior to the active control, the corresponding superiority test is carried out.
This logical constraints of this testing strategy align well with the trial’s objectives.
It is easy to build a gatekeeping procedure based on splitting the α between the
two sequences and utilizing the fixed-sequence procedure within each sequence:
Procedure 3A
• Step 1. H1 is rejected if p1 ≤ α/2 and H2 is rejected if p2 ≤ α/2.
• Step 2. H3 is rejected if p3 ≤ α/2 and H1 is rejected. Likewise, H4 is rejected
if p4 ≤ α/2 and H2 is rejected.
As before, this basic gatekeeping procedure serves as a starting point in the process
of building a powerful gatekeeping strategy. Two important aspects of this hypothesis
testing problem need to be considered to improve power of Procedure 3A.
First, the power of Procedure 3A can be improved by taking into account available
information on the joint distribution of the four hypothesis test statistics. The diagram in the right panel of Figure 3 indicates that the null hypotheses can be grouped
into two families:
• Family 1 includes the null hypotheses H1 and H2 .
• Family 2 includes the null hypotheses H3 and H4 .
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 12
[Insert Figure 3 here]
Procedure 3A applies the Bonferroni procedure within Families 1 and 2 and ignores the fact that the joint distribution of the hypothesis test statistics within each
family is known. Specifically, the joint distribution of ti and ti+1 is a bivariate t distribution with the correlation coefficient ρ = 0.5, i = 1, 3. Thus the sponsor can apply
more powerful multiple testing procedures in Families 1 and 2 that take advantage of
this distributional information. In what follows, we will focus on a gatekeeping procedure which uses the truncated Hochberg procedure with a pre-specified truncation
parameter 0 ≤ γ < 1 in Family 1 and regular Hochberg procedure in Family 2. As
before, the regular Hochberg procedure cannot be used in Family 1 since it would
leave no α for Family 2 unless both null hypotheses are rejected in the first family.
Second, another performance-based principle, i.e., the principle of re-testing, may
be applied to improve power of the resulting Hochberg-based gatekeeping procedure.
Since the Hochberg-based procedure has the two doses tested in parallel (with no
α propagation between the doses), the sponsor can strengthen the Dose 1 tests by
“borrowing power” from the Dose 2 tests and vice versa as shown in the right panel
of Figure 3. This implies that, when superiority is established for Dose 1, the α level
used in this superiority test is transferred from the end of the first sequence to the
beginning of the second sequence and the non-inferiority test for Dose 2 is carried
out again at a higher significance level. This clearly results in a higher likelihood of
showing non-inferiority for this dose. Similarly, if Dose 2 is superior to the active
control, the sponsor can apply a more powerful test to the non-inferiority comparison
between Dose 1 and the active control.
A Hochberg-based gatekeeping procedure with re-testing can be built using a
simple generalization of the mixture method (Dmitrienko and Tamhane, 2011). The
gatekeeping procedure utilizes the following decision rules (derived in the Appendix):
Procedure 3B
• Step 1. The null hypotheses in Family 1 are tested using the truncated Hochberg
procedure at the full α level. Let p(1) < p(2) denote the ordered p-values in
Family 1 and let H(1) and H(2) denote the associated null hypotheses. The
truncated Hochberg procedure rejects both H(1) and H(2) if p(2) ≤ α(1 + γ)/2
or it rejects only H(1) if p(1) ≤ α/2 and p(2) > α(1 + γ)/2.
• Step 2. The null hypotheses in Family 2 are tested as follows:
– The null hypotheses are tested using the regular Hochberg procedure at
the full α level if both H(1) and H(2) are rejected. Let p(3) < p(4) denote the
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 13
ordered p-values in Family 2 and let H(3) and H(4) denote the associated
null hypotheses. The Hochberg procedure rejects both H(3) and H(4) if
p(4) ≤ α or it rejects only H(3) if p(3) ≤ α/2 and p(4) > α.
– The null hypothesis logically related to H(1) is tested at α(1 − γ)/2 if H(1)
is rejected and H(2) is not rejected.
• Step 3. If only one null hypothesis is rejected in Family 1 and only one null
hypothesis is rejected in Family 2, the remaining null hypothesis in Family 1 is
re-tested at the full α level. If this null hypothesis is rejected, the associated
null hypothesis in Family 2 is tested at the full α level.
To illustrate the re-testing step in this algorithm, consider the following scenario.
Suppose that p2 < p1 and thus H(1) = H2 and H(2) = H1 . Further, suppose that only
H2 is rejected in Step 1. Since H3 is logically related to H1 , only one null hypothesis
is tested in Step 2, namely, H4 . If this null hypothesis is rejected, H1 can be re-tested
at the full α level and, if this test is significant, H3 is then tested at the full α level.
5
Comparison of gatekeeping strategies
The third principle defined in Section 2 states that the candidate gatekeeping procedure which optimizes a pre-specified application-specific criterion under trial-specific
assumptions should be selected. In this section, we give a detailed discussion of
the procedure selection process, define criteria for comparing candidate gatekeeping
procedures and provide illustrations based on the diabetes trial example.
Consider a general hypothesis testing problem in a clinical trial with two families
of null hypotheses (this setting is easily extended to trials with three and more families
of null hypotheses). Let H1 , . . . , Hk and Hk+1, . . . , Hk+l denote the null hypotheses
included in Families 1 and 2, respectively, where k ≥ 1 and l ≥ 1. Further, let
r1 , . . . , rm denote the rejection indicators for the m = k + l null hypotheses, i.e.,
ri = 1 when Hi is rejected and 0 otherwise. To simplify the presentation, we assume
that all null hypotheses are false.
Criteria for evaluating multiple testing procedures were discussed in multiple publications, including Westfall and Krishen (2001), Senn and Bretz (2007) and Millen
and Dmitrienko (2011). These criteria are typically constructed in terms of
• Disjunctive power function (probability of rejecting at least one null hypothesis
given specific alternatives).
• Conjunctive power function (probability of rejecting all null hypotheses given
specific alternatives).
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 14
In addition to these basic criteria, the sponsor can consider the following three
criteria. First of all, a criterion can be defined in terms of the probability of rejecting
at least one null hypothesis in each family, i.e.,
P
k
X
ri
≥ 1 and
i=1
m
X
i=k+1
ri ≥ 1 .
A general version of this criterion was discussed in Millen and Dmitrienko (2011),
who labeled this as “subset disjunctive power”. It was also used in Brechenmacher et
al. (2011, Section 4) for selecting parameters of a complex gatekeeping procedure.
An important property of this criterion is that the criterion function tends to 0 as
either marginal disjunctive probability (for Family 1 or Family 2) approaches 0.
An alternative criterion allows the trial sponsor to account for the relative importance of establishing a significant treatment effect in each family. Let w1 and w2
denote the weights reflecting the importance of Families 1 and 2 (the weights are
positive and add up to 1). The criterion is then based on
w1 P
k
X
i=1
!
ri ≥ 1 + w 2 P
m
X
i=k+1
ri ≥ 1 .
This criterion may be considered when rejecting null hypotheses in multiple families
is not a requirement for trial success, but remains a desirable outcome. When all
weights are positive, the restriction of having the criterion function approach 0 as
either marginal disjunctive probability in Family 1 or Family 2 approaches 0, which
was noted in the previous metric, is relieved.
Finally, a criterion can be defined based on a weighted sum of power functions for
the individual hypothesis tests. Let vi denote the weight representing the importance
of rejecting the null hypothesis Hi , i = 1, . . . , m. The associated criterion is based on
m
X
vi E(ri ),
i=1
where the expectations are evaluated under the alternative hypotheses. This criterion
focuses on the average number of significant tests in Families 1 and 2 and can be
termed the expectation criterion. This criterion needs to be contrasted with the
exceedence criteria defined above. The exceedence criteria are formulated in terms
of probability of exceeding a threshold for the number of null hypotheses rejected in
each family.
Given a relevant criterion, the sponsor selects the gatekeeping procedure which
maximizes the criterion function given certain assumptions on the joint distribution of
the hypothesis test statistics, e.g., effect sizes for the individual tests and correlations
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 15
among them. A sensitivity analysis is recommended to assess the robustness of this
algorithm, i.e., it is important to understand the impact of small changes in the
assumptions on the final decision.
In order to choose a criterion most suitable for a particular application, it is important to understand how the sponsor defines trial success. For example, returning
to the diabetes trial, this trial is deemed successful if at least one dose of the new
treatment demonstrates non-inferiority to the active control. In other words, superior efficacy is considered a desirable rather than a required outcome as the treatment
offers considerable benefits in mode of administration over standard, available treatments. Given these considerations, the primary analysis is defined in terms of the
two non-inferiority tests (null hypotheses H1 and H2 ) and the basic conjunctive and
disjunctive criteria are defined as follows:
P (Reject H1 and H2 ) and P (Reject H1 or H2 ).
The disjunctive criterion is consistent with the sponsor’s definition of a successful
trial and will be used to select among the six candidate gatekeeping procedures.
To evaluate the performance of the six procedures defined in Section 4 with respect
to the disjunctive criterion, a simulation study was conducted. It utilized the following
set of assumptions:
• A balanced design with n = 480 patients per treatment arm.
• The joint distribution of the test statistics t1 , t2 , t3 and t4 under the alternative
hypotheses was multivariate normal with means
δ1 + λ
δ2 + λ
δ1
δ2
µ1 = q
, µ2 = q
, µ3 = q
, µ4 = q
,
σ 2/n
σ 2/n
σ 2/n
σ 2/n
where σ = 1.2 was a common standard deviation and λ = 0.15 was a noninferiority margin.
• The non-inferiority and superiority tests are carried out using the same population (i.e., ITT population). The correlation matrix, therefore, was given
by
ρ12 = ρ34 = 1/2, ρ13 = ρ24 = 1, ρ14 = ρ23 = 1/2,
where ρij =corr(ti , tj ).
Note that the sample size of n = 480 patients per treatment arm guarantees
90% marginal power for each non-inferiority test and 25% marginal power for each
superiority test at a one-sided α = 0.025 when δ1 = δ2 = 0.1. With δ1 = δ2 = 0.2,
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 16
the power of the non-inferiority and superiority tests increases to 99% and 73%,
respectively. Further, the truncation parameter γ used in Procedures 2B and 3B was
set to γ = 0.5. More discussion on the choice of this procedure parameter is provided
in Section 6.
Multiple sets of the mean treatment differences δ1 and δ2 were included in the
simulation study. Here we will focus on the simulation results for the following three
scenarios to illustrate key points:
• Scenario 1: δ1 = 0.1, δ2 = 0.
• Scenario 2: δ1 = 0.2, δ2 = 0.1.
• Scenario 3: δ1 = 0.1, δ2 = 0.1.
Table 1 lists the disjunctive power as well as power of each individual hypothesis
test for the three scenarios at a one-sided global familywise error rate α = 0.025 based
on 100,000 simulation runs.
[Insert Table 1 here]
Scenario 1 assumes that Dose 1 is more efficacious than Dose 2 and Table 1 shows
that the four gatekeeping procedures associated with Strategies 1 and 2 maximize the
disjunctive criterion in this case (they provide a 4% improvement over Procedures 3A
and 3B). This is to be expected since Strategies 1 and 2 were specifically developed for
the case when the effect size for Dose 1 is greater than that for Dose 2. A comparison
of the power values for the individual null hypotheses reveals that, even though Procedure 1 performs well with respect to the disjunctive criterion, the probabilities of
establishing non-inferiority for Dose 2 is only 19%. This is due to the inflexible serial
testing approach. Among the three gatekeeping procedures developed for Strategy 2,
Procedure 2B provides a uniform improvement over the other two procedures since it
fully utilizes available distributional information (recall that Procedures 2A and 2C
are both based on the conservative Bonferroni procedure).
The general performance of the candidate gatekeeping procedures in Scenario 2 is
similar to that in Scenario 1 since Dose 1 is again assumed to be more efficacious than
Dose 2. The main difference between the two scenarios is that there is no separation
among the six procedures in terms of disjunctive power in Scenario 2 (all procedures
achieve 99% power). Further, Procedures 2B and 3B offer an improvement over the
other procedures with respect to the rejection probabilities for the individual null
hypotheses since these procedures incorporate additional distributional information.
Finally, the effect sizes for the two doses are equal in Scenario 3 and this puts the
first four gatekeeping procedures at a disadvantage. Procedures 3A and 3B treat the
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 17
two doses as equally important and, as a consequence, they offer superior performance
compared to the other candidate procedures. Further, Procedure 3B demonstrates
power gains for all four individual tests compared to Procedure 3A. This power advantage is due to the fact that the former procedure is based on a semiparametric
procedure and also utilizes re-testing.
Based on the three scenarios presented in Table 1, Procedures 2B or 3B are clearly
the choice for meeting the sponsor’s needs in the diabetes clinical trial.
6
Selection of procedure parameters
Once a general gatekeeping strategy is selected, the next important step involves specification of procedure parameters, e.g., hypothesis weights, family weights, truncation
parameters. Thus far, our presentation has assumed somewhat arbitrary parameter
values when comparing the candidate gatekeeping procedures. However, it should
be clear that optimal parameter selection is a critical part of developing gatekeeping
procedures. In this section, we discuss the parameter selection process in relation to
the diabetes trial example.
Both Procedures 2B and 3B rely on a truncation parameter γ. The value of
this parameter was set to 0.5 up to this point. To identify optimal values of γ, we
again rely on criteria introduced in Section 5, noting that the selected criterion must
align with the trial’s objectives. As an illustration, we focus on selection of γ for
Procedure 2B. The parameter selection process with Procedure 3B is analogous.
Using the set of assumptions defined in Section 5, we first compute the probability
of rejecting the null hypotheses H2 , H3 and H4 as a function of γ when the true mean
treatment differences are given by δ1 = 0.2 and δ2 = 0.2 (note that the probability of
rejecting H1 does not depend on γ). Figure 4 displays the marginal power functions
of the three hypothesis tests computed from 100,000 simulation runs at a one-sided
α = 0.025. The relationship between the marginal power functions and γ is generally
fairly complex. Beginning with the null hypotheses in Family 2, the effect size of the
non-inferiority test for Dose 2 (H2 ) is substantially greater than that of the superiority
test for Dose 1 (H3 ) and thus the p-value associated with the former is virtually
guaranteed to be less than the p-value associated with the latter. As a result, the
truncated Hochberg procedure used in Family 2 tests H2 at α/2 most of the time.
Since this significance level is independent of γ, the marginal power function of Test 2
in Figure 4 is constant. On the other hand, the significance level used for testing H3
is very likely to be α(1 + γ)/2, which causes the marginal power function of Test 3 to
increase with the increasing value of γ. Further, considering the superiority test for
Dose 2 (H4 ) in Family 3, recall that the full α level is carried over to H4 only if both
H2 and H3 are rejected, otherwise H4 is tested at α(1 − γ)/2. In this particular case,
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 18
H4 is likely to be tested at a fraction of α which decreases with increasing γ. As a
consequence, power of Test 4 in Figure 4 is monotonically decreasing as a function of
γ.
[Insert Figure 4 here]
To choose an optimal value of the truncation parameter γ in the diabetes trial, a
criterion needs to be chosen which provides a trade-off between the power functions
of the two tests in Family 2 and the power function of the test in Family 3. In what
follows, we will evaluate the performance of two criteria and choose the criterion
which is consistent with this objective.
We will first examine the basic exceedence criterion (referred to as Criterion 1).
The criterion function ψ(γ) is defined in this case as the probability of rejecting at
least one null hypothesis in Family 2 and the null hypothesis in Family 3, i.e.,
ψ(γ) = P (r2 + r3 ≥ 1 and r4 = 1).
Due to the logical relationships between the non-inferiority and superiority objectives,
we have
{r2 + r3 ≥ 1 and r4 = 1} = {r4 = 1}.
This means that the criterion function (and the choice of γ) is driven primarily by
the superiority test for Dose 2. The associated criterion function is plotted in the
left panel of Figure 5. The function is maximized when γ = 0, which should not
be surprising since the underlying criterion aims at improving the marginal power of
Test 4 without taking into account power of Test 3. This immediately implies that
this exceedence criterion does not provide a desirable trade-off between the power
functions of the hypothesis tests in Families 2 and 3 and should not be used in the
diabetes trial.
Another important limitation of Criterion 1 is that it does not differentiate between the rejection of only one null hypothesis or rejection both null hypotheses in
Family 2. Recall that the superiority test for Dose 1 is underpowered and thus the
first outcome is likely to imply that only non-inferiority is established for Dose 2.
From the sponsor’s perspective, this outcome is clearly less desirable than demonstrating non-inferiority for Dose 2 and superiority for Dose 1 and this information
needs to be taken into account in order to define a meaningful parameter criterion.
One way to accomplish this is by partitioning the event {r2 + r3 ≥ 1} into two events
as follows
{r2 + r3 ≥ 1} = {r2 + r3 = 1} + {r2 + r3 = 2}.
Since H4 cannot be rejected unless H2 is rejected, this implies that
{r2 + r3 ≥ 1 and r4 = 1} = {r2 = 1 and r3 = 1 and r4 = 1}
+ {r2 = 1 and r3 = 0 and r4 = 1}.
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 19
If the relative importance of the two events can be quantified based, for example,
on the expected market share of the treatment in each event state, the weighted
exceedence criterion (referred to as Criterion 2) can be defined based on the following
function
ψ(γ) = w1 P (r2 = 1 and r3 = 1 and r4 = 1)
+ w2 P (r2 = 1 and r3 = 0 and r4 = 1),
where w1 and w2 are the weights assigned to the two events (here the weights are also
positive and add up to 1).
To compare the two criteria, we will apply Criterion 2 with w1 = 0.7 and w2 = 0.3.
Note that w1 is much greater than the other weight to emphasize the value of achieving
superiority for both doses (rejecting both H3 and H4 rather than only H4 ). A plot of
ψ(γ) is displayed in the right panel of Figure 5 (it is worth noting that this criterion
function is based on a weighted sum of probabilities and it should not be directly
compared to the other criterion function). The criterion function achieves a maximum
around γ = 0.6. However, the function is generally flat and thus any value of γ from,
say, [0.4, 0.8] can be chosen as an optimal value. These values of the truncation
parameter γ make more intuitive sense than γ = 0 obtained from Criterion 1 and
guarantee high power of the superiority test for Dose 2 without sacrificing power of the
superiority test for Dose 1. In general, optimal values of the truncation parameter (or
any other procedure parameter) depend on the assumptions used in the simulations
and, as indicated in Section 5, it is recommended to perform a sensitivity analysis to
understand the impact of these assumptions on the conclusions.
[Insert Figure 5 here]
Lastly, we note that each of the parallel testing procedures introduced thus far
(Procedures 3A and 3B) were developed and applied under an assumption of equal
weights. In other words, H1 and H2 are equally weighted within Family 1 and,
similarly, H3 and H4 are equally weighted within Family 2. Performance of these procedures (relative to a pre-defined criterion) may change with different assumptions on
the hypothesis weights. For example, Procedures 3A and 3B performed poorly when
the assumed effect sizes for the two doses are differed but their performance improved
when the effect sizes were equal. Unequal weighting in the case of unequal effect sizes
can boost the performance of the procedures. Optimal values of hypothesis weights
may be found via simulations, similar to the optimal selection of the truncation parameter or other procedure parameters. For a more complete treatment of this idea
of optimal weight selection, see Millen and Dmitrienko (2011).
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 20
7
Discussion
As was stated in the Introduction, multiple classes of gatekeeping procedures have
been developed over the past ten years. These gatekeeping procedures may be tailored
to address the hierarchical objectives encountered in clinical drug development. There
has been little guidance, however, available on the selection of appropriate procedures
and the optimization of those procedures (via selection of procedure parameters). In
this paper we offer practical principles to guide clinical trial sponsors through the
process of developing gatekeeping strategies. Application of these principles ensures
that the resulting gatekeeping strategy is both relevant and powerful. The term
power, as it is used here, is not a generic one-size-fits-all concept. It refers to the
likelihood of attaining application-specific success criteria. No matter how simple or
complex the objectives of a particular clinical trial, a gatekeeping procedure which
satisfies the sponsor’s requirements for relevance and power will support appropriate
clinical trial inferences.
In practice, clinical trial sponsors often put forth gatekeeping strategies, much
as we did in Section 4, without first fully considering and documenting all logical
relationships among the null hypotheses of interest. As a result, mathematically valid
gatekeeping strategies may be proposed that incorporate illogical constraints on the
decision rules. Focusing first on the clinically meaningful logical relationships among
the null hypotheses facilitates the development process. Once the space of candidate
gatekeeping strategies is reasonably reduced, comparisons among the strategies can
be performed and optimization of the associated gatekeeping procedures via selection
of procedure parameters may occur.
Another common occurrence in the practice of defining multiple testing procedures is the focus on nonparametric, often Bonferroni-based, gatekeeping procedures.
This is done, primarily, due to the simplicity offered by such approaches. Yet for
many clinical trial applications, the sponsor has knowledge of the joint distribution of
the hypothesis test statistics. Failure to incorporate this information results in gatekeeping procedures which may provide inferior performance. We recommend using
Bonferroni-based procedures as a starting point in developing gatekeeping strategies
and then extending the strategies via incorporation of parametric or semiparametric
procedures. This allows the sponsor to exploit the simplicity of the Bonferroni-based
approaches in development of general gatekeeping strategies without sacrificing performance.
Lastly, it is important to perform a thorough evaluation and comparison of candidate gatekeeping procedures (including selection of procedure parameters) to arrive
at a procedure which optimally satisfies pre-defined application-specific metrics. This
step, often done via simulation studies, helps ensure an understanding of the performance trade-offs for the various candidate procedures (and procedure parameters).
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 21
The ability to quantify and articulate the expected performance of candidate procedures is a vehicle for broad alignment on statistical testing strategies within a
clinical development team. Indeed, the common use of serial gatekeeping procedures
is rooted in the aversion of many clinical development scientists (including statisticians) to procedures based on distributing α among multiple objectives. Yet in many
applications α-splitting approaches turn out to be superior to the basic serial testing
approaches. The need for clear communication of procedure performance properties
is only expanded when considering complex gatekeeping strategies.
The general principles and considerations outlined in this paper are intended to
give practical guidance for development of gatekeeping strategies. As the inputs
affecting procedure performance are multifactorial, application of the principles is
insufficient to guarantee selection of the optimal procedure, if one exists. However,
they ensure clinically relevant gatekeeping procedures with desired performance by
systematically ruling out broad groups of candidate procedures which would offer
inferior inference due to low power or incorrect logical constraints.
References
[1] Bauer, P., Röhmel, J., Maurer, W., Hothorn, L. (1998). Testing strategies in
multi-dose experiments including active control. Statistics in Medicine. 17, 2133–
2146.
[2] Brechenmacher, T., Xu, J., Dmitrienko, A., Tamhane, A.C. (2011). A mixture
gatekeeping procedure based on the Hommel test for clinical trial applications.
Journal of Biopharmaceutical Statistics. 21, 748–767.
[3] Bretz, F., Maurer, W., Brannath, W., Posch, M. (2009). A graphical approach to
sequentially rejective multiple test procedures. Statistics in Medicine. 28, 586–
604.
[4] Burman, C.-F., Sonesson, C., Guilbaud, O. (2009). A recycling framework for
the construction of Bonferroni-based multiple tests. Statistics in Medicine. 28,
739–761.
[5] Dmitrienko, A., Offen, W.W., Westfall, P.H. (2003). Gatekeeping strategies for
clinical trials that do not require all primary effects to be significant. Statistics
in Medicine. 22, 2387–2400.
[6] Dmitrienko, A., Offen, W., Wang, O., Xiao, D. (2006). Gatekeeping procedures in
dose-response clinical trials based on the Dunnett test. Pharmaceutical Statistics.
5, 19–28.
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 22
[7] Dmitrienko, A., Wiens, B.L., Tamhane, A.C., Wang X. (2007). Tree-structured
gatekeeping tests in clinical trials with hierarchically ordered multiple objectives.
Statistics in Medicine. 26, 2465—2478.
[8] Dmitrienko, A., Tamhane, A.C., Liu, L., Wiens, B.L. (2008). A note on tree
gatekeeping procedures in clinical trials. Statistics in Medicine. 27, 3446–3451.
[9] Dmitrienko, A., Bretz, F., Westfall, P.H., Troendle, J., Wiens, B.L., Tamhane,
A.C., Hsu, J.C. (2009). Multiple testing methodology. Multiple Testing Problems
in Pharmaceutical Statistics. Dmitrienko, A., Tamhane, A.C., Bretz, F. (editors).
Chapman and Hall/CRC Press, New York.
[10] Dmitrienko, A., Tamhane, A.C. (2009). Gatekeeping procedures in clinical trials. Multiple Testing Problems in Pharmaceutical Statistics. Dmitrienko, A.,
Tamhane, A.C., Bretz, F. (editors). Chapman and Hall/CRC Press, New York.
[11] Dmitrienko, A., Tamhane, A.C. (2011). Mixtures of multiple testing procedures
for gatekeeping applications in clinical trials. Statistics in Medicine. 30, 1473–
1488.
[12] Dmitrienko, A., Kordzakhia, G., Tamhane, A.C. (2011). Multistage and mixture
gatekeeping procedures in clinical trials. Journal of Biopharmaceutical Statistics.
21, 726–747.
[13] Guilbaud, O. (2007). Bonferroni parallel gatekeeping — Transparent generalizations, adjusted p-values, and short direct proofs. Biometrical Journal. 49,
917–927.
[14] Hochberg, Y., Tamhane, A.C. (1987). Multiple Comparison Procedures. Wiley,
New York.
[15] Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics. 6, 65–70.
[16] Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple significance
testing. Biometrika. 75, 800–802.
[17] Hung, H.M.J., Wang, S.J. (2009). Some controversial multiple testing problems
in regulatory applications. Journal of Biopharmaceutical Statistics. 19, 1–11.
[18] Liu, Y., Hsu, J.C. (2009). Testing for efficacy in primary and secondary endpoints
by partitioning decision paths. Journal of the American Statistical Association.
104, 1661—1670.
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 23
[19] Marcus, R., Peritz, E., Gabriel, K.R. (1976). On closed testing procedures with
special reference to ordered analysis of variance. Biometrika. 63, 655–660.
[20] Maurer, W., Hothorn, L. A., Lehmacher, W. (1995). Multiple comparisons in
drug clinical trials and preclinical assays: a priori ordered hypotheses. Biometrie
in der Chemisch-in-Pharmazeutischen Industrie. 6. Vollman, J. (editor). FischerVerlag, Stuttgart.
[21] Millen, B., Dmitrienko, A. (2011). Chain procedures: A class of flexible closed
testing procedures with clinical trial applications. Statistics in Biopharmaceutical
Research. 3, 14–30.
[22] O’Neill, R.T. (1997). Secondary endpoints cannot be validly analyzed if the
primary endpoint does not demonstrate clear statistical significance. Controlled
Clinical Trials. 18, 550–556.
[23] Sarkar, S., Chang, C.K. (1997). Simes’ method for multiple hypothesis testing
with positively dependent test statistics. Journal of the American Statistical Association. 92, 1601–1608.
[24] Sarkar, S.K. (1998). Some probability inequalities for censored MTP2 random
variables: A proof of the Simes conjecture. The Annals of Statistics. 26, 494–504.
[25] Senn, S., Bretz, F. (2007). Power and sample size when multiple endpoints are
considered. Pharmaceutical Statistics. 6, 161–170.
[26] Westfall, P. H., Krishen, A. (2001). Optimally weighted, fixed sequence, and
gatekeeping multiple testing procedures. Journal of Statistical Planning and Inference. 99, 25–40.
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 24
Appendix
Procedures 2B and 3B are defined using the mixture method introduced in Dmitrienko
and Tamhane (2011). This method relies on the closure principle (Marcus, Peritz
and Gabriel, 1976) and involves specification of decision rules for all intersection
hypotheses associated with the null hypotheses of interest.
A.1.
Procedure 2B
Procedure 2B is defined by specifying a multiple testing procedure which controls the
global familywise error rate and then demonstrating that the underlying decision rules
used in this procedure are identical to those given in Section 4. Since H1 in Family 1
serves as a serial gatekeeper, it is always tested at the full α level and this α level is
carried over to Family 2 if H1 is rejected. Thus it is sufficient to construct a multiple
testing procedure which protects the familywise error rate for the remaining three
null hypotheses at the α level. This procedure is built using the mixture method.
To apply the mixture method, consider the seven intersection hypotheses associated with H2 , H3 and H4 . Each intersection hypothesis is identified using a non-empty
index set I ⊆ {2, 3, 4}, e.g., I = {2, 3} corresponds to the intersection of H2 and H3 .
The decision rule for each intersection hypothesis is defined using a local p-value. The
local p-values for the intersection hypotheses are listed in Table 2. The local p-values
are computed from the p-values associated with the truncated Hochberg procedure in
Family 2 and univariate testing procedure in Family 3 based on the following general
principles:
• Case 1: If an intersection hypothesis includes only the null hypotheses from
Family 2, the local p-value is computed from the truncated Hochberg procedure.
• Case 2: If an intersection hypothesis includes only the null hypothesis from
Family 3, the local p-value equals p4 .
• Case 3: If an intersection hypothesis includes the null hypotheses from Family 2
and Family 3, the local p-value is computed from the truncated Hochberg and
univariate testing procedures using the error rate function of the truncated
Hochberg procedure and Bonferroni mixing function (Dmitrienko and Tamhane,
2011). The logical relationships between H2 and H4 are taken into account as
follows: If an intersection hypothesis includes both H2 and H4 , e.g., I = {2, 3, 4}
and I = {2, 4}, the local p-value should not depend on p4 . For example, the
local p-value for testing the intersection of H2 and H4 is simply p2 . This rule
ensures that H4 cannot be rejected if H2 is not rejected.
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 25
[Insert Table 2 here]
An intersection hypothesis is rejected if its local p-value is less than or equal to
α. The mixture procedure rejects a null hypothesis if all intersection hypotheses
including this particular null hypothesis are rejected. The decision rules used in the
resulting mixture procedure are equivalent to the decision rules defined in Section 4
(see Procedure 2B).
A.2.
Procedure 3B
The mixture procedure used to define Procedure 3B is constructed by specifying
decision rules for the fifteen intersection hypotheses associated with H1 , H2 , H3 and
H4 . The local p-values for the intersection hypotheses are listed in Table 3. The local
p-values are computed from the p-values associated with the truncated Hochberg
procedure in Family 1 and regular Hochberg procedure in Family 2 as follows:
• Case 1: If an intersection hypothesis includes only the null hypotheses from
Family 1, the local p-value is computed from the truncated Hochberg procedure
(except for the individual null hypotheses H1 and H2 for which the local p-values
are defined as p1 and p2 , respectively).
• Case 2: If an intersection hypothesis includes only the null hypotheses from
Family 2, the local p-value is computed from the regular Hochberg procedure.
• Case 3: If an intersection hypothesis includes the null hypotheses from Family 1
and Family 2, the local p-value is computed from the truncated and regular
Hochberg procedures using the error rate function of the truncated Hochberg
procedure and Bonferroni mixing function (Dmitrienko and Tamhane, 2011).
In addition, the logical relationships between the null hypotheses in Families 1
and 2 are taken into account. Specifically, if an intersection hypothesis includes
H1 and H3 , e.g., I = {1, 2, 3} and I = {1, 3}, the local p-value should not
depend on p3 . Similarly, if an intersection hypothesis includes H2 and H4 , e.g.,
I = {2, 3, 4} and I = {2, 4}, the local p-value should not depend on p4 .
[Insert Table 3 here]
The mixture procedure rejects a null hypothesis if all intersection hypotheses
including this particular null hypothesis are rejected, i.e., if all relevant local p-values
are less than or equal to α. It is easy to verify that the mixture procedure is in fact
based on the decision rules defined in Section 4 (see Procedure 3B).
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 26
Table 1: Operating characteristics of the six candidate gatekeeping procedures (disjunctive power and power of individual hypothesis tests).
Procedure
Disjunctive Power of individual tests
power
H1
H2
H3
H4
Scenario 1
Procedure 1
0.90
0.90 0.19 0.25 0.02
Procedure 2A
0.90
0.90 0.37 0.17 0.01
Procedure 2B
0.90
0.90 0.39 0.20 0.02
Procedure 2C
0.90
0.90 0.39 0.17 0.02
Procedure 3A
0.86
0.84 0.38 0.17 0.01
Procedure 3B
0.86
0.85 0.44 0.16 0.02
Scenario 2
Procedure 1
0.99
0.99 0.69 0.73 0.23
Procedure 2A
0.99
0.99 0.84 0.63 0.17
Procedure 2B
0.99
0.99 0.85 0.69 0.23
Procedure 2C
0.99
0.99 0.86 0.63 0.24
Procedure 3A
0.99
0.99 0.84 0.63 0.17
Procedure 3B
0.99
0.99 0.88 0.64 0.24
Scenario 3
Procedure 1
0.90
0.90 0.25 0.25 0.12
Procedure 2A
0.90
0.90 0.78 0.17 0.17
Procedure 2B
0.90
0.90 0.78 0.21 0.16
Procedure 2C
0.90
0.90 0.78 0.17 0.19
Procedure 3A
0.94
0.84 0.84 0.17 0.17
Procedure 3B
0.94
0.87 0.87 0.20 0.20
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 27
Table 2: Local p-values for the intersection hypotheses used in Procedure 2B.
Index set I Local p-value p(I)
{2, 3, 4}
2 min(p(2) , p(3) /(1 + γ))
{2, 3}
2 min(p(2) , p(3) /(1 + γ))
{2, 4}
2p2 /(1 + γ)
{2}
2p2 /(1 + γ)
{3, 4}
2 min(p3 /(1 + γ), p4 /(1 − γ))
{3}
2p3 /(1 + γ)
{4}
p4
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 28
Table 3: Local p-values for the intersection hypotheses used in Procedure 3B.
Index set I Local p-value p(I)
{1, 2, 3, 4} 2 min(p(1) , p(2) /(1 + γ))
{1, 2, 3}
2 min(p(1) , p(2) /(1 + γ))
{1, 2, 4}
2 min(p(1) , p(2) /(1 + γ))
{1, 2}
2 min(p(1) , p(2) /(1 + γ))
{1, 3, 4}
2 min(p1 /(1 + γ), p4 /(1 − γ))
{1, 3}
p1
{1, 4}
2 min(p1 /(1 + γ), p4 /(1 − γ))
{1}
p1
{2, 3, 4}
2 min(p2 /(1 + γ), p3 /(1 − γ))
{2, 3}
2 min(p2 /(1 + γ), p3 /(1 − γ))
{2, 4}
p2
{2}
p2
{3, 4}
min(2p(3) , p(4) )
{3}
p3
{4}
p4
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 29
H1
H2
H3
H4
Figure 1: Strategy 1 (serial testing) in the diabetes trial.
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 30
H1
H1
H2
H2
H3
H1
H2
H3
H4
H3
H4
H4
Figure 2: Strategy 2 (extended serial testing) in the diabetes trial.
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 31
H1
H2
H1
H2
H3
H4
H3
H4
Figure 3: Strategy 3 (parallel testing) in the diabetes trial.
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 32
0
0.5
1
1.0
0.75
0.5
0.75
0.5
0.75
0.5
Power
H4
1.0
H3
1.0
H2
0
0.5
1
0
0.5
1
Gamma
Figure 4: Marginal power of the three hypothesis tests based on Procedure 2B in the
diabetes trial as a function of the truncation parameter γ.
For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 33
0.6
0.4
0.5
0.6
0.5
0.4
Criterion value
0.7
Criterion 2
0.7
Criterion 1
0
0.25
0.5
Gamma
0.75
1
0
0.25
0.5
0.75
1
Gamma
Figure 5: Criterion 1 and Criterion 2 functions for Procedure 2B in the diabetes trial
as a function of the truncation parameter γ.
© Copyright 2026 Paperzz