Berger, R. L. and Hsu, J. C. (1995) "Bioequivalence Trials, Intersection-Union Tests, and Equivalence Confidence Sets"

-.
'.....
..... ~i.;; ..
-
! .
Bioequivalence Trials, Intersection-Union Tests, and
Equivalence Confidence Sets
by
Roger L. Berger and Jason C. Hsu
Institut,of Statistics Mimeo Series Number 2279
October, 1995
NORTH CAROLINA STATE UNIVERSITY
Raleigh, North Carolina
Berger, Roger &
HsU, Jason
.nZ
BIOEQUIVALENCE TRIALS,
79 INTERSECTION-UNION TESTS,
AND EQUIVALENCE CONFlDENCE
»I»EO
SERIES
SETS.
,
.
Bioequivalence Trials, Intersection-Union Tests, and
Equivalence Confidence Sets
Roger L. Berger
Department of Statistics
North Carolina State University
Raleigh, NC 27595-8203
...
Jason C. Hsu
Department of Statistics
The Ohio State University
Columbus, OR 43210-1247
October 10, 1995
Abstract
The problem of bioequivalence is practically important because the approval of every
generic drug in the United States and the European Community (EC) requires the
establishment of bioequivalence between the name brand drug and the proposed generic
version. The problem is theoretically interesting because it has been recognized as one
for which the desired inference, instead of the usual significant difference, is practical
equivalence. The concept of intersection-union tests will be shown to clarify, simplify,
and unify bioequivalence testing. A test more powerful than the one currently specified
by the FDA and EC guidelines will be derived. The claim that the bioequivalence
problem defined in terms of the ratio of parameters is more difficult than the problem
defined in terms of the difference of parameters will be refuted. The misconception that
size-a bioequivalence tests generally correspond to 100(1- 2a)% confidence sets will be
shown to lead to incorrect statistical practices, and should be abandoned. Techniques
for constructing 100(1 - a)% confidence sets that correspond to size-a bioequivalence
tests will be described. Finally, multiparameter practical equivalence problems will be
discussed, including two which have not been accepted as such.
1
Bioequivalence Problem
Two different drugs or formulations of the same drug are called bioequivalent if they are
absorbed into the blood and become available at the drug action site at about the same rate
and concentration. Bioequivalence is usually studied by administering dosages to subjects
and measuring concentration of the drug in the blood just before and at set times after the
administration. These data are then used to determine if the drugs are absorbed at the
same rate.
The determination of bioequivalence is very important in the pharmaceutical industry
because regulatory agencies allow a generic drug to be marketed if its manufacturer can
demonstrate that the generic product is bioequivalent to the brand-name product. The
assumption is that bioequivalent drugs will provide the same therapeutic effect. If the
1
o
...
generic drug manufacturer can demonstrate bioequivalence, it does not need to perform
costly clinical trials to demonstrate the safety and efficacy of the generic product. Yet, this
bioequivalence must be demonstrated in a statistically sound way to protect the consumer
from ineffective or unsafe drugs. Often, even an NDA (New Drug Application) must include a bioequivalence test. The production formulation of the drug must be shown to be
bioequivalent to the prototype formulation used in the clinical trials.
These concentration/time measurements can be connected with a curve and several
variables can be measured. The common measurements are AUC (Area Under Curve),
Cmax (maximum concentration), and T max (time until maximum concentration). The two
drugs are bioequivalent if the population means of these variables are sufficiently close.
For example, let f-lT denote the population mean AUC for the generic (Test) drug and
IlR denote the population mean AUC for the brand-name (Reference) drug. To demonstrate
bioequivalence, the following hypotheses are tested:
Ho : l!:I..
< OL or l!:I..
> OU
f-lR f-lR (1 )
versus
H a : OL < ~ < ou.
The values OL and Ou are standards set by regulatory agencies that define how "close" the
drugs must be to be declared bioequivalent. Currently, both the United States Food and
Drug Administration (1992) and the European Community (EC-GCP, 1993) use Ou = 1.25
and OL = .80, values that are symmetric in the ratio scale.
Often, logarithms are taken and the hypotheses (1) are stated as
Ho : 1JT - 1JR :::; 9L or 1JT - 1JR
(2)
~ 9u
versus
H a : 9L < 1JT - 1JR < 9u.
Here, 1JT = 10g(f-lT), 1}R = 10g(f-lR), Ou = log( ou) and OL = log( OL). With Ou = 1.25 and
OL = .80, Ou = -9L, and the standards are symmetric.
In a hypothesis test of (1) or (2), the Type I error rate is the probability of declaring the
drugs to be bioequivalent, when in fact they are not. By setting up the hypotheses as in
(1) or (2) and controlling the Type I error rate at a specified small value, say, 0' = .05, the
consumer's risk is being controlled. That (1) or (2) is the proper formulation in problems
like these was recognized early on by some authors. For example, Lehmann (1959, p. 88),
not specifically discussing bioequivalence, says, "One then sets up the (null) hypothesis
that [the parameter] does not lie within the required limits so that an error of the first
kind consists in declaring [the parameter] to be satisfactory when in fact it is not." But
not until Schuirmann (1981, 1987), Westlake (1981) and Anderson and Hauck (1983) were
hypotheses correctly formulated as in (1) or (2) in bioequivalence problems.
Despite the fact that bioequivalence testing problems are now correctly formulated as
( 1) or (2), many inappropriate statistical procedures are still used in this area. Tests that
2
..
,
claim to have a specified size a, but are either liberal or conservative, are used. Liberal tests
compromise the consumer's safety, and conservative tests put an undo burden on the generic
drug manufacturer. Tests are often defined in terms of confidence intervals in statistically
unsound ways. These tests, again, do not properly control the consumer's risk.
In this paper, we will describe current bioequivalence tests that have incorrect error
rates. We will offer new tests that correctly control the consumer's risk. In several cases, the
tests we propose are uniformly more powerful than the existing tests while still controlling
the Type I error rate at the specified rate a. We will examine and criticize the current
practice of defining tests in terms of 100(1 - 2a)% confidence sets. We will show that this
only works in special cases and gives poor results in other cases. We will discuss how properly
to construct 100(1- a)% confidence sets that correspond to size-a tests. And we will discuss
how our methods can be applied to complicated, multiparameter bioequivalence problems
that have received only slight attention in the literature. The intersection-union method
of testing will be found to be very useful in understanding and constructing bioequivalence
tests. Section 2 prOVIdes a more detailed outline to our discussions.
In this paper we discuss population bioequivalence. For example, J.LT is the average
AUC for the population. Sometimes bioequivalence is defined in terms of parameters that
more directly measure equivalence of response within an individual. Good introductions to
individual bioequivalence are given by Anderson and Hauck (1990) and Anderson (1993).
Although we do not explicitly consider individual bioequivalence in this paper, many of the
concepts and techniques we describe should be applicable in that area also.
In this paper, our discussion will be entirely in terms of bioequivalence testing. But
our comments and techniques apply to other problems, such as in quality assurance, in
which the aim is to show that two parameters are close or that a parameter is between two
specification limits.
2
"
Tests, Confidence Sets and Curiosities
Various experimental designs are used to gather data for bioequivalence trials. Chow and Liu
(1992) describe parallel designs (two independent samples), and two-period and multiperiod
crossover designs. The issues we discuss apply to all these different designs. So, for the most
part, we will consider only the simple parallel design in which XI, . .. , X m form a random
sample of size m from the test drug and Y1 , ••• , Yn form and independent random sample
of size n from the reference drug.
2.1
Difference hypotheses
Lognormal models are typically used in bioequivalence studies. The variables measured, e.g.,
AUC, T max and Cmax , are positive and usually right-skewed. So this model is reasonable.
3
•
Let xi, ... , X~ denote the lognormal measurements from the test drug in the original
, X m denote the logarithms of these measurements. Similarly, let
scale, and let X},
yt, ... , Y; and YI , , Yn denote the original measurements and logarithms for the reference
drug. Let (1]T, (]"2) denote the lognormal parameters for Xi, ... , X~, and let (1]R, (]"2) denote
the lognormal parameters for Yt, ... , Y; . Then the test and reference drug means are
JlT = e7)T+0'2/2 and !-tR = e7)R+0'2 /2, respectively. Therefore, the condition
6L
<
!-tT
!-tR
=
e7)T-TJR
< 6u
is equivalent to
(3)
.... ,
where (h = log( 6L) and ()u = log( bu) are known constants. Thus, the hypothesis to be
tested in the lognormal model is (2). The test is based on Xl, ... , X m , a random sample
from a normal population with mean 1]T and variance (]"2, and YI , ••• , Yn , an independent
random sample from a normal population with mean 1]R and variance (]"2.
Westlake (1981) and Schuirmann (1981) proposed what has become the standard test
of (2). It is called the "two one-sided tests" (TOST). Let X denote the sample mean of
X}, ... , X m , let -Y denote the sample mean of YI , ... , Yn , and let S 2 denote the pooled
estimate of (]"2, computed from both samples. Let
(4)
and
X-l'-(h
TL
=
SJ1.. +
m
1 .
n
The TOST tests (2) using the ordinary, one-sided, size-a t-test for
(5)
versus
Hal:
1]T -
1]R
> (h
and the ordinary, one-sided, size-a t- test for
(6)
"
H02
: 1]T - 1]R ~ ()u
H a2
: 1]T - 1]R
versus
< ()u
It rejects Ho at level a and declares the two drugs to be bioequivalent if both tests reject,
that is, if
T u < -tex,r
and
TL
> tex,n
where tex,r is the upper lOOa percentile of a Student's t distribution with r = m
degrees of freedom.
Following Lehmann (1959), we define the size of a test as
size
= sup P(reject Ho).
Ho
4
+n -
2
,
•
,.,..
The size of the TOST is exactly equal to a, even though P(reject Ho) < a for every
(TIT, TlR, (72) in the null hypothesis. The supremum value of a is attained in the limit as
TIT - TlR = (h (or Ou) and (72 ~ O. Both the FDA bioequivalence guideline (FDA, 1992)
and the European Community guideline (EC-GCP, 1993) specify that bioequivalence be
established using a 5% TOST.
The TOST is unusual in that two size-a tests are combined to form a size-a test.
Often, when multiple tests are combined, some adjustment must be made to the sizes of
the individual tests to achieve an overall size-a test. Why this is not necessary for the
TOST is best understood through the theory of intersection-union tests (IUTs), which we
describe in Section 3. In Sections 4.2 and 4.2 we will show that the IUT theory is useful for
understanding the TOST. Also, the IUT theory can guide the construction of tests for (2)
that have the same size-a as the TOST but are uniformly more powerful than the TOST.
2.2
Ratio hypotheses
Sometimes, a normal model should be used. In this model, the original measurements are
Xl,"" X m , a random sample from a normal population with mean JLT and variance (72, ~nd
Yl , ... , 1~, an independent random sample from a normal population with mean JLR and
variance (72. This model is different from the lognormal model in that now the hypothesis
to be tested concerns the ratio of the means of these normal observations. That is, we wish
to test (1). This problem has received less attention than (2). Dealing with the ratio JLT / JLR
has been perceived as more difficult than dealing with the difference TIT - TlR.
The FDA (1992) strongly recommends a logarithmic transformation and testing the
hypotheses (2). After giving several arguments why a logarithmic transformation is usually
reasonable, FDA (1992, p. 7) states,
Based on the arguments in the preceding section, the Division of Bioequivalence
recommends that the pharmacokinetic parameters AUC and Cmax be log transformed. Firms are not encouraged to test for normality of data distribution
after log transformation, nor should they employ normality of data distribution
as a justification for carrying out the statistical analysis on the original scale.
•
The emphasis is ours. This seems to recommend very poor statistical practice. Even if the
original data appears normal, this is to be ignored and the data are to be log transformed,
anyway! Then there is to be no check whether the log transformed data appear normal! If
the data in the original scale are normally distributed, then a logarithmic transformation
will produce nonnormal data. In this case, tests such as the TOST, based on the Student's
t distribution, are inappropriate.
One justification given for the FDA's recommendation is, "Standard parametric methods
are ill-suited to making inferences about the ratio of two averages" (FDA, 1992, p. 7). We
5
will refute this claim. In Section 4.3, we show that Sasabuchi (1980,1988) described the size0' likelihood ratio test for (1). It is a simple tests based on the Student's t distribution. We
also show that the tests commonly used are liberal and have size greater than the nominal
value of 0'. Furthermore, we show that the IUT method can be used in this problem, also,
to construct size-a tests that are uniformly more powerful than the likelihood ratio test.
Thus, the FDA's avoidance of (1) because of statistical difficulties is unwarranted.
2.3
100(1 - 20:)% confidence intervals
One would expect the TOST to be identical to some confidence interval procedure: For
some appropriate 100(1 - 0')% confidence interval [D-, D+] for 1JT - 1JR, declare the test
drug to be bioequivalent to the reference drug if and only if [D-, D+J c (fh, (}u).
It has been noted (e.g., Westlake, 1981; Schuirmann, 1981) that the TOST is operationally identical to the procedure of declaring equivalence only if the ordinary 100( 1- 20' )%,
not 100(1 - 0')%, two-sided confidence interval for 1JT - 1JR
(7)
[x -
Y - t a,T SJm- 1
+ n- 1 , X
- Y
+ t a,T SJm- 1 + n- 1 ]
is contained in the interval «(}L, (}u). In fact, both FDA (1992) as well as EC-GCP (1993)
specify that the TOST should be executed in this fashion.
The fact that the TOST seemingly corresponds to a 100(1-20')%, not 100(1-0')%, confidence interval procedure initially caused some concern (Westlake 1976, 1981). Recently,
Brown, Casella and Hwang (1995) called this relationship an "algebraic coincidence." But
many authors (e.g., Chow and Shao, 1990, and Schuirmann, 1989) have defined bioequivalence tests in terms of 100( 1 - 20' )%, confidence sets.
In Section 5, we discuss a 100( 1 - 0')%, confidence interval that corresponds exactly
to the size-a TOST. We also explore the relationship between 100( 1 - 20')% confidence
intervals and size-a tests. We describe situations more general than the TOST in which
size-a tests can be defined in terms of 100( 1 - 20')%,confidence intervals. But we also
give examples from the bioequivalence literature of tests that have been defined in terms of
100( 1 - 20' )%, confidence intervals that are not size-a tests. They can be either liberal or
conservative. So our conclusion is that the practice of defining bioequivalence tests in terms
of 100(1 - 20')%, confidence intervals should be abandoned. If both a confidence interval
and a test are required, a 100( 1 - 0')%, confidence interval that corresponds to the given
size-a test should be used.
2.4
Multiparameter problems
In Section 6, we discuss several multiparameter bioequivalence problems. We discuss two
examples in which the IUT theory can be used to define size-a tests that are uniformly
more powerful than tests that have been previously proposed. We also discuss problems
6
about drug shelf-life and drug toxicity that are really bioequivalence problems, but have
not been accepted as such in the past.
3
.....
,
-"
Intersection-Union Tests
Berger (1982) proposed the use of intersection-union tests in a quality control context closely
related to bioequivalence testing. Tests for many different bioequivalence hypotheses are
easily constructed using the IUT method. The TOST is a simple example of an IUT. Tests
with a specified size are easily constructed using this method, even in complicated problems
involving several parameters. And tests that are uniformly more powerful than standard
tests can often be constructed using this method.
The IUT method is useful for the following type of hypothesis testing problem. Let ()
denote the unknown parameter (() can be vector valued) in the distribution of the data X.
Let 0 denote the parameter space. Let 0 1 , ... , 0k denote subsets of 0. Suppose we wish
to test
k
Ho : () E
(8)
U 0i
n
k
versus
i=l
Ha
: ()
E
0f,
i=l
where A C denotes the complement of the set A. The important feature in this formulation
is the null hypothesis is expressed as a union and the alternative hypothesis is expressed
as an intersection. For i = 1, ... , k, let Ri denote a level-G rejection region for a test of
HOi : () E 0i versus Hai : () E 0f. Then an IUT of (8) is the test that rejects Ho if and
only if X E
Ri. The rationale behind an IUT is simple. The overall null hypothesis,
Ho : () E
0 i can be rejected only if each of the individual null hypotheses, HOi: () E 0i,
can be rejected.
Berger (1982) proved the following two theorems.
n7=1
U7=1
Theorem 1 If R i is a level-G test of HOi, for i = 1, ... , k, then the intersection-union test
Ri is a level-G test of Ho versus Ha in (8).
with rejection region R =
n7=1
•
An important feature in Theorem 1 is that each of the individual tests is performed at
level-G. But the overall test also has the same level G. There is no need for multiplicity
adjustment for performing multiple tests. The reason there is no need for such a correction
is the special way the individual tests are combined. Ho is rejected only if everyone of the
individual hypotheses, HOi, is rejected.
Theorem 1 asserts that the IUT is level-G. That is, its size is at most G. In fact, a test
constructed by the IUT method can be quite conservative. Its size can be much less that
the specified value G. But, Theorem 2 (a generalization of Theorem 2 in Berger (1982))
provides conditions under which the IUT is not conservative; its size is exactly equal to the
specified G.
7
..
Theorem 2 For some i = 1,
, k, suppose Ri is a size-a rejection region for testing HOi
versus Hai. For every j = 1,
, k, j #- i, suppose Rj is a level-a rejection region for testing
HOj versus Haj. Suppose there exists a sequence of parameter points (J/, I = 1,2, ..., in 0 i
such that
lim P(X E Ri) = a.
/-+00
and, for every j = 1, ... , k, j
f:.
i,
lim P(X E Rj) = 1.
/-+00
...
-..
Then the intersection-union test with rejection region R =
versus H a .
nf=l Ri
zs a szze-a test of Ho
,
Note that in Theorem 2, the one test defined by Ri has size exactly a. The other tests
defined by Rj, j = 1, ... , k, j f:. i, are level-a tests. That is, their sizes may be less than
a. The conclusion is the JUT has size a. Thus, if rejection regions R 1 , • •• , Rk with sizes
n
0'1, ... ,ak
are combined in an JUT and Theorem 2 is applicable, then the JUT will have size
equal to maxd ad. We will discuss bioequivalence examples in which tests of different sizes
are combined. The resulting test has size equal to the maximum of the individual sizes.
4
Old and New Tests for Difference and Ratio Hypotheses
Two one-sided tests
4.1
The TOST is naturally thought of as an JUT. The bioequivalence alternative hypothesis
Ha
: (JL
< "IT - "IR <
is conveniently expressed as the intersection of the two sets,
(Ju,
0 1 = {( "IT, "IR, (1'2) : "IT - "IR < (Ju} and O 2 = {( "IT, "IR, (1'2) : (JL < "IT - "IR}. The test that
rejects HOl : "IT - "IR
(JL in (5) if TL ~ ta,r is a size-a test of HO!. The test that rejects
:s
H02
:
"IT - "IR ~ (Ju in (6) if Tu
:s -t a,r is a size-a test of H02 '
So, by Theorem 1, the test
that rejects Ho only if both of these tests reject is a level-a test of (2).
To use Theorem 2 to see that the size of the TOST is exactly a, consider parameter
points with "IT - "IR =
and take the limit as
(Ju
(1'2 -+
O. Such parameters are on the
boundary of H02 . Therefore,
P(X E Rt} = P(Tu
for any
(1'2
:s -t
a •r )
= a,
> O. But,
because the power of a one-sided t test converges to one as
alternative. The value "IT - "IR =
(Ju
(1'2 -+
is in the alternative, Hal.
8
0 for any point in the
...
•
Table 1: Powers of three bioequivalence tests. r = 30 and
= (m -1 + n .1)(T
.12
.16
.20
'TJT - 'TJR = Ou or OL
.050 .031 .003 .000
.050 .050 .050 .050
.050 .047 .049 .050
0'
= .05.
(To
.00
.04
TOST
BHM
new
.050
.050
.050
.050
.050
.050
TOST
BHM
new
1.000
1.000
1.000
1.000
1.000
1.000
.08
'TJT -
....
"",
.720
.721
.720
'TJR
.158
.260
.247
= 0
.007
.131
.128
.000
.093
.092
.30
00
.000
.050
.050
.000
.050
.050
.000
.066
.066
.000
.050
.050
The advantage.of considering bioequivalence problems in an JUT format is not limited
to verifying properties of the TOST. Rather, other bioequivalence hypotheses, such as
(1), state an interval as the alternative hypothesis. This interval can be expressed as the
intersection of two one-sided intervals. So two one-sided, size-a tests can be combined to
obtain a level-a (typically, size-a) test. Furthermore, as we will see in Section 6, even more
complicated forms of bioequivalence can be expressed in the JUT format. This allows the
easy construction of tests with guaranteed size-a for these problems.
4.2
More powerful tests
Despite its simplicity and intuitive appeal, the TOST suffers from a lack of power. The
line labeled TOST in the top part of Table 1 shows the power function, P(reject Ho), for
parameter points with 'TJT - 'TJR = Ou (or OL), points on the boundary between Ho and Ha •
The power function is near 0' for (T2 near 0, but decreases as (T2 grows. An unbiased test
would have power equal to 0' for all such parameter points. The TOST is clearly biased.
The bottom part of Table 1 shows the power function when the two drugs are exactly equal,
1]T = 'TJR. The power is near one for (T2 near zero, but decreases to zero as (T2 increases.
.
Anderson and Hauck (1983) proposed a test with higher power that the TOST. Whereas
the TOST does not reject Ho if S2 is sufficiently large, the Anderson and Hauck test
always rejects Ho if X - Y is near enough to zero, even if S2 is large. This provides an
improvement in power. However the Anderson and Hauck test does not control the Type
J error probability at the specified level 0'. It is liberal and the size is somewhat greater
than 0:. Shortly after Anderson and Hauck proposed their test, Patel and Gupta (1984)
and Rocke (1984) proposed the same test. This scientific coincidence was commented upon
by Anderson and Hauck (1985) and Martin Andres (1990).
Due to the seriousness of a Type J error, declaring two drugs to be equivalent when
9
•
~,
..
"
they are not, the search for a size-a test that was uniformly more powerful than the TOST
continued. Munk (1993) proposed a slightly different test. Munk claims that this test is a
size-a test that is uniformly more powerful than the TOST. But this claim is supported by
numerical calculations, not analytic results. The fact that so many authors have considered
this same problem indicates the importance of finding a size-a test that is more powerful
test than the TOST.
Brown, Hwang and Munk (1995) succeeded in constructing an unbiased, size-a test of
(2) that is uniformly more powerful than the TOST. Their construction is recursive. To
determine if a point (x, y, 8 2 ) is in the rejection region ofthe Brown, Hwang and Munk test,
a good deal of computing can be necessary. This may limit the practical usefulness of the
Brown, Hwang and Munk test. Also, sometimes the Brown, Hwang and Munk rejection
region has a quite irregular shape. An example of this is shown in Figure l.
We will now describe a new test of the hypotheses (2). This test is uniformly more
powerful than the TOST. Unlike the Anderson and Hauck and Munk tests, our test is a
size-a test. Our test is nearly unbiased. It is simpler to compute than the Brown, Hwang
and Munk test. It will not have the irregular boundaries that the Brown, Hwang and Munk
test sometimes possesses. The construction of this new test again illustrates the usefulness
of the IUT method.
To simplify the notation in describing out test, we assume, without loss of generality,
that fh = -()u and call ()u = Ll. Define
D=X-Y
and
5; = r
(~ +~) 52.
It is simpler to define our test in terms of the polar coordinates, centered at (Ll,O),
-.
and
b = cos -1
(( d
- Ll) / v ) .
In the (d, 8*) space, v IS the distance from (Ll, 0) to (d, 8*), and b is the angle between
the d axis and the line segment joining (Ll,O) and (d,8*). To define a size-a test, we need
the distribution of (V, B) when () = Ll, In this case, it is easy to verify that V and Bare
independent. The probability density function of B is
10
II
which does not depend on a 2 . To implement our test, it is useful to note that the cumulative
distribution function of B has a closed form given by
b
1 (r-l)/2
f(k)
F(b) = - - (sin(b»2k-l cos(b)
1 '
1r 2Vi k=l
f(k + 2)
L
if r is odd, and
1
F(b) =
"
-..
1
2-
r/2
2Vi f;(sin(b»2k-2cos(b)
f(k- 1 )
f(k)2,
if r is even. The probability density function of V will be denoted by gl1( v).
We will describe the rejection region of the new test geometrically here. Exact formulas
are in the Appendix. The new test will be an rUT. We will define a size-a, unbiased
rejection region, R 2 , for testing (6). This R 2 will contain the rejection region of the size-a
TOST and will be approximately symmetric about the line d = O. Then we will define
R 1 = {(d,s.): (-d,s.) E R 2 }. R 1 is R 2 reflected across the line d = O. R 1 is a size-a,
unbiased rejection region for testing (5). Then R = R 1 nR 2 is the rejection region ofthe new
test. Because R 2 is approximately symmetric about the line d = 0, R 1 is almost the same
as R 2 , and not much is deleted when we take the intersection. This foresight in choosing
the individual rejection regions so that the intersection is not much smaller is always useful
when using the rUT method.
The set {V = v} is a semicircle in (d, s.) space. For each value of v, R 2 ( v) == {V = v}nR 2
is either one or two intervals of b values, that is, one or two arcs on {V = v}. These arcs
will be chosen so that, for every v > 0,
(9)
Then the rejection probability
P(R 2 )= [00 [
io iR (v)
j(b)dbg l1 (v)dv= [00 ag l1 (v)dv=a,
io
2
for every a > 0 if 'T]T - 'T]R = ~. This will ensure that R2 is a size-a, unbiased rejection
region for testing (6).
We now define the arc(s) that make up R 2 (v). Refer to Figure 2 in this description.
The rejection region of the size-a TOST, call it RT, is the triangle bounded by the lines
s. = 0, d = ~-tOi.rS./Vr (call this line lu), and d = -~+tOi.rS./Vr (call this line h). Let
vo denote the distance from (~, 0) to IL. rn this description, we assume 1/2 > a > a. ==
F( 1r) - F(31r /4). Brown, Hwang and Munk (1995) in their Table 1 show that if r 2: 4, then
a = .OS > a •. The new test for a ::; a. is given in the Appendix. Brown, Hwang and Munk
did not give an unbiased test for a ::; a•. The condition a > a. ensures that the point on
IL closest to (~, 0) is on the boundary of RT, as shown.
11
·-
•
°
Let bo denote the angle between the d axis and Lv. For < v ::; vo, R 2 ( v) = {b : bo <
b < 11"}. The arc Ao in Figure 2 is an example of such an arc. So, for v < vo, R 2 ( v) is
exactly the points in the TOST.
For Vo < v, the semicircle V = v intersects lL at two points. Let bi < b2 denote the
angles corresponding to these two points. If Vo < v < 2~, Let A 2 ( v) = {b : b2 < b < 11"}.
These are the points in RT adjacent to the d axis, and A 2 in Figure 2 is an example of such
an arc. If 2~ ::; v, let A 2 ( v) be the empty set. Let a( v) denote the probability content of
A 2 ( v) under F. That is,
a(v) = { 1- F(b 2 ), Vo < v <
0,
2~ ::; v.
2~,
For Vo < v, R 2 (v) = AI(v) U A 2 (v), where to ensure that (9) is true, AI(v) must satisfy
(10)
r
f(b)db=a-a(v).
JAl(V)
Let (d l , s*t} denote the point where the {V = vol semicircle intersects Lv, and let 'VI
denote the radius corresponding to (-dl,s*I). For Vo < v < VI, let bLl be the angle defined
by
(11)
F(bt} - F(bLl) = a - a(v),
= {b
: bLI < b < btl is the
arc that satisfies (10) whose endpoint is on LL. For Vo < v < VI, R 2 (v) = AI(v) U A 2 (v),
using this Al (v). The arcs labeled Al and A 2 in Figure 2 comprise such an R 2 ( v).
For VI ::; v, define two values bL(V) < bv(v) such that F(bv(v)) - F(bL(V)) = a - a(v),
and the angle between the line joining (0,0) and (V,bL(V)) and the s* axis is the same as
the angle between the line joining (0,0) and (v, bv( v)) and the s* axis. This equal angle
condition is what we meant earlier by the phrase "approximately symmetric about the line
d = 0." If bv(v) ~ bI, then AI(v) = {b: bL(V) < b < bv(v)}. But, if bv(v) < bl , then this
arc does not contain all the points in the TOST. So, if bv(v) < bI, AI(v) = {b: bLI < b <
bd, where hI is defined by (11). For VI ::; v, R 2 ( v) = Al (v) U A 2 ( v). Recall, if 2~ ::; v,
A 2 (v) is empty, and R 2( v) is the single arc Al (v). Also, for v2 ~ max{4~2, ~2 + ~2r /t;,r},
the semicircle {V = v} does not intersect RT, and R 2( v) is the arc defined by bL( v) and
bv( v). The bi condition never applies in this case. In Figure 2, the solid parts of the arcs
A 3 and A 4 are examples of R 2 ( v) for VI ::; v.
The cross-sections R 2(v) have been defined for every v > 0, and this defines R 2. R I is the
reflection of R2 across the s* axis. and the rejection region of the new test is R = R I n R 2.
This construction is illustrated in Figure 3.
In Figure 1, the rejection region R with the same size as the Brown, Hwang and Munk
test is the region between the dotted lines. The boundary of R is smooth compared to the
irregular boundary of the Brown, Hwang and Munk test. This smoothness results from
where bi is as defined in the previous paragraph. Then Al (v)
12
•
the attempt in the construction of R to center arcs around the s* axis. This smoothness
and the simplicity of the computation of R compared with the Brown, Hwang and Munk
computations recommend R as a reasonable alternative to the Brown, Hwang and Munk
test. But R is slightly biased whereas the Brown, Hwang and Munk test is unbiased.
A small power comparison of the TOST, Brown, Hwang and Munk test, and our new
.05 and r 30. In the top block of numbers, TJT - TJR
.6..
test is given in Table 1 for a
For these boundary values, the power is exactly a = .05 for the unbiased Brown, Hwang
and Munk test. The power is also very close to .05 for our test, indicating it has only slight
bias. But the TOST is highly biased with power much less than .05 for moderate and large
cr. In the bottom block of numbers, TJT - TJR = O. The drugs are equivalent. Our test and
the Brown, Hwang and Munk test have very similar power. Their powers are much greater
than the TOST's power for all but very small cr.
=
4.3
=
=
Tests for ratios of parameters
Usually, data from a bioequivalence trial is logarithmically transformed before analysis.
This leads to a test of the hypotheses (2), as described in the previous section. The model
we will consider now was introduced in Section 2.2. Namely, Xl, ... ,Xm form a random
sample from a normal population with mean J.LT and variance cr 2 , and Y1 , .•. , Y n form an
independent random sample from a normal population with mean J.LR and variance cr 2 • The
bioequivalence hypothesis to be tested in this case is (1), namely,
Ho : 1:!:I..
< DL or l!:I..
> Du
J.LR J.LR (12)
versus
Ha : DL < ~ < Du.
In the past, the values of DL
.80 and DU = 1.20 were commonly used (called the ±20
rule). But, the FDA Division of Bioequivalence (1992) now uses DL = .80 and h = 1.25.
These limits are symmetric in the ratio scale since .80 = 1/1.25.
The parameter J.LT is positive because the measured variables, AUe, etc., are positive.
Therefore the hypotheses (12) can be restated as
Ho : J.LT - DLJ.LR S; 0 or J.LT - DUJ.LR 2: 0
(13)
versus
H a : J.LT - DLJ.LR > 0 and J.LT - DUJ.LR
< O.
The testing problem (13) was first considered by Sasabuchi (1980, 1988). He showed
that the size-a likelihood ratio test of (13) rejects Ho if and only if
where
and
13
•
....
The symbols X, Y, 5 and tO/,r have the same meaning as in Section 2. This will be called
the TtlT2 test.
The TtlT2 test is easily understood as an IUT. The usual, normal theory, size-a ttest of HOI : J.lT - bLJ.lR ~ 0 versus Hal : J.lT - bLJ.lR > 0 is the test that rejects HOI if
T l 2': to/,r' Similarly, the usual, normal theory, size-a t-test of H02 : J.lT - bUJ.lR 2': 0 versus
H a2 : J.lT - bUJ.lR < 0 is the test that rejects H 02 if T 2 ~ -to/,r' Because H a is the intersection
of Hal and H a2 , these two t-tests can be combined, using the IUT method, to get a level-a
test of Ho versus Ha . Using an argument like in Section 3, Theorem 2 can be used to show
that the size of this test is a.
Yang (1991) proposed a test closely related to the TdT2 test for the bioequivalence
problem of testing (12) in a crossover experiment. But no reference to Sasabuchi's earlier
work was given. And the value of this simple, size-a test seems to have been completely
overlooked in the bioequivalence literature. Rather, Chow and Liu (1992) report that the
following is the standard analysis. Rewrite the hypotheses (12) or (13) as
Ho : J.lT - J.lR
(14 )
~ (bL - l)J.lR
or J.lT - J.lR 2': (bu - l)J.lR
versus
These hypotheses look like (2), but there is an important difference. In (2), (h and (}u are
known constants. In (14), (bL -1 )J.lR and (bu -1 )J.lR are unknown parameters. Nevertheless,
the standard analysis proceeds to use the TOST with (bL - l)Y replacing (}L in TL and
(bu - 1 )Y replacing (}u in Tu. The standard analysis ignores the fact that a constant has
been replaced by a random variable and compares these two test statistics to standard t
percentiles as in the TOST. This test will be called the T; IT; test.
The statistics that are actually used in this analysis are, in fact,
n+mbI
n+m'
--.
and
T; =
X - Y - (bu - l)Y
X - buY
5-11-+1
Ym n
5/1..+1
ym n
= T2
n + mb'[;
n+m
The statistics T l and T 2 are properly scaled to have Student's t distributions, but T;
and T; do not. The T; IT; test is an JUT in which the two tests have different sizes. The
test that rejects HOI if T l* > tO/,r has size
PI-tT=OL/.LR
al
< a,
14
(T
l
>
n+m
)
n + mbI tO/,r
•
Table 2: Actual size of Ti IT; test for nominal a = .05
m=n
SIze
5
.070
15
.072
10
.071
20
.072
30
00
.073
.073
because
n+m
1'2 > l.
n+muL
On the other hand, the test that rejects H02 if T; < -ta,r has size
-.
-.
PJ.tT=8 U IJ.R
a2
(T2 < -
n+m
n + m6'[; ta,r
)
> a,
because
n+m
< l.
n+muu
--""':"1'2;;""
Theorem 2 can be used to show that, as a test of the hypothesis (12), the Ti IT; test has
size a2 > a. It is a liberal test.
The true size of the TilT; test, for a nominal size of a = .05, is shown in Table 2. In
Table 2 it is assumed that the sample sizes from the test and reference drugs are equal,
m = n. In this case, the size of the
a2 =
Ti IT; test is simply
P(T < -VI :6'[;ta,r) ,
where T has a students t distribution with r = 2n - 2 degrees of freedom. It can be seen
that the size of the TilT:;' test is about .07 for all sample sizes. The liberality worsens
slightly as the sample size increases.
On the other hand, the TdT2 test has size exactly equal to the nominal a. It is just as
simple to implement as the Ti IT; test. Therefore the TdT2 test should replace the Ti IT;
test for testing (12).
In Section 4.2, the IUT method was used to construct a size-a test that is uniformly
more powerful than the TOST. For the known (12 case, Berger (1989) and Liu and Berger
(1995) used the IUT method to construct size-a tests that are uniformly more powerful
than the TdTz test. In Figure 4, the cone shaped region labeled Ro is the rejection region
of the TIITz test for a = .05. The region between the dashed lines is the rejection region of
Liu and Berger's size-a test that is uniformly more powerful. We refer the reader to Berger
(1989) and Liu and Berger (1995) for the details about these tests. We believe that for the
(12 unknown case, size-a tests that are uniformly more powerful than the TdT2 test will be
found.
15
r
5
Confidence Sets and Bioequivalence Tests
5.1
A 100(1 - a)% confidence interval
We will show that the 100(1- 0')% confidence interval [D 1, Dil given by
corresponds to the size-a TOST for (2). Here x-
-,
= min{O, x}
and x+
= max{O, x}.
This
confidence set has been derived byHsu (1984), Bofinger (1985), and Stefansson, Kim, and
Hsu (1988) in the multiple comparisons setting, by Muller-Cohrs (1991), Bofinger (1992),
and Hsu, Hwang, Liu, and Ruberg (1994) in the bioequivalence setting. Our derivation
follows Stefansson, Kim, and Hsu (1988) and Hsu, Hwang, Liu, and Ruberg (1994), which
makes the correspondence to TOST more explicit.
To see this correspondence, we use the standard connection between tests and confidence
sets. Most often innstatistics, this connection is used to construct confidence sets from tests
via a result such as the following.
Theorem 3 (Lehmann, 1986, p. 90) Let the data X have a probability distribution that
depends on a parameter (J. Let e denote the parameter space. For each (Jo E e, let A( (Jo)
denote the acceptance region of a level-a test of Ho : (J = (Jo. That is, for each (Jo E
Po=oo(X E A((Jo)) 2: 1 - 0'. Then, C(:z:) = {(J E
e : :z:
e,
E A((J)} is a level 100(1- 0')%
confidence set for (J.
But in bioequivalence testing in the past, tests have often been constructed from confidence sets. A result related to this practice is the following.
Theorem 4 (Aitchison, 1964) Let the data X have a probability distribution that depends on a parameter (J. Let
e
denote the parameter space. Suppose C( X) is a 100( 1- 0')%
2: 1- a. Then, for each (Jo E e,
level-a test of Ho : (J = (Jo.
confidence set for (J. That is, for each (J E 0, Po ((J E C (X))
A((Jo)
= {:z: : (Jo E C(:z:)}
is the acceptance region of a
Unfortunately, Theorem 4 has not always been carefully applied in the bioequivalence
area. Commonly, 100( 1- 20')% confidence sets are used in an attempt to define level-a tests.
Theorem 4 guarantees only that a level-2a test will result from a 100(1 - 20')% confidence
set. Sometimes, the size of the resulting test is in fact 0', but this is not generally true.
In this subsection we use Theorem 3 to show the correspondence between the 100(1 - 0')%
confidence interval (15) and the size-a TOST. In the next subsection, we criticize the
practice of using 100( 1 - 20')% confidence sets to define bioequivalence tests.
Let () = TIT - TlR. The family of size-a tests with acceptance regions
(16)
16
"
leads to usual equivariant confidence interval, which is of the form (7) but with tOt,r replaced
by t Ot !2,r'
However, no current law or regulation states one must employ confidence sets that are
equivariant over the entire real line. Using Theorem 3 and inverting the family of size-a
tests defined by, for (Jo 2: 0,
(17)
and for (Jo < 0,
(18)
..
,
A( (Jo) =
{(x, y, s) : x - y -
(Jo
:s: tOt,rSVm- 1 + n- 1}
yields the 100(1- a)% confidence interval (15). Technically, when inverting (17) and (18),
the upper confidence limit will be open when X - Y + tOt,rSVm-1 + n- 1 < O. This point is
inconsequential in bioequivalence testing. The only value of the upper bound with positive
probability is 0, and, in bioequivalence testing, the inference TIT f. TlR is not of interest. In
terms of operating characteristics, the confidence interval with the possibly open endpoint
has coverage probability 100(1 - a)% everywhere. The confidence interval (15) also has
coverage probability 100( 1 - a)% except at TIT - TlR = 0 where it has 100% coverage
probability.
Note that the family of tests (17) contains the one-sided size-a t- test for (6), and the
family of tests (18) contains the one-sided size-a t-test for (5), in contrast to the family of
tests (16). The 5% TOST is equivalent to asserting bioequivalence
(19)
(JL
< TIT - TlR < (Ju
if and only if the 95% confidence interval [D1,Dt] C «(JL, (Ju). Therefore, as pointed out
by Hsu, Hwang, Liu, and Ruberg (1994), it is more consistent with standard statistical
theory to say that the 100(1 - a)% confidence interval [D 1 , Dt), instead of the ordinary
100( 1 - 2a)% confidence interval (7), corresponds to the two one-sided tests.
.~
Pratt (1961) showed that for the T = 00 case (Le. S = 0"), when TIT = TlR, that is, when
the test drug is indeed equivalent to the reference drug, [D 1 , Dt] has the smallest expected
length among all 100(1 - a)% confidence intervals for TIT - TlR.
Brown, Casella, and Hwang (1995) generalized Pratt's result and obtained the confidence
region for a vector parameter (J which has the smallest expected volume when (J = O. The
confidence set is constructed through Theorem 3 using the family of size-a Neyman-Pearson
likelihood ratio tests for H o : (J = (Jo versus H a : (J = O. When iJ is multivariate normal with
unknown mean vector (J and known variance-covariance matrix :E, the acceptance regions
are
A«(Jo) =
{iJ : (J~:E-l(iJ - (Jo)hl(J~:E-l(JO > -tOt,co} ,
which leads to the confidence region
(20)
17
,
•
Their paper describes and illustrates interesting geometric properties of C( 8).
5.2
100(1 - 2a)% confidence intervals
Bioequivalence tests are often defined in terms of 100( 1 - 2a)% confidence sets. That is,
if 0 denotes the parameter of interest, 03 denotes the set of parameter values for which
the drugs are bioequivalent, and C(X) is a 100(1 - 2a)% confidence set for 0, then the
..
drugs are declared bioequivalent if and only if C(X) c 03. This practice seems to be based
entirely on the perceived equivalence between the 100(1- 2a)% confidence interval (7) and
the size-a TOST of (2). This practice is encouraged by the fact that both FDA (1992) as
well as EC-GCP (1993) specify that the a = .05 TOST should be executed by constructing
a 90% confidence interval. In the bioequivalence literature, when used in this way, the 90%
is called the assurance of the confidence set.
The intent of the regulating agencies is clearly to use a test with size a = .05. Unfortunately, bioeqtlivalence tests have been proposed using 100( 1 - 2a)% confidence sets
without any verification that the resulting tests have size-a. Theorem 4 guarantees that
the resulting test is a level-2a test, not size-a. In this section we will explore the usage
of 100(1 - 2a)% confidence sets. We shall show that the usual 100(1 - 2a)% confidence
interval (7) results in a size-a TOST of (2) because (7) is "equal-tailed." So the relationship
is deeper than the "algebraic coincidence" mentioned by Brown, Casella and Hwang (1995).
But we shall see in examples that the use of 100(1 - 2a)% confidence sets can result in
both liberal and conservative bioequivalence tests. Because there is no general guarantee
that a 100( 1 - 2a)% confidence set will result in a size-a test, we believe it is unwise to
attempt to define a size-a test in terms of a 100(1 - 2a)% confidence set. Rather, a test
with the specified Type I error probability of a should be used. Theorem 3 might be used
to construct the corresponding 100(1 - a)% confidence set.
."
Let [C-, C+] denote (7), the usual 100(1 - 2a)% confidence interval for 'f/T - 'f/R. Why
does rejecting Ho in (2) if and only if [C-,C+] c (OL,eU) result in a size-a test? The
superficial answer is that, obviously, C+ < eu is equivalent to Tu < -to/,r and C- > 0L is
equivalent to TL > ta,T' Thus, the test based on [C-, C+] is equivalent to the size-a TOST .
But a more thorough understanding of this is suggested by the following result (Exercise
9.1, Casella and Berger, 1990).
Theorem 5 Let the data X have a probability distribution that depends on a real-valued
parameter e. Suppose ( - 00, U(X)] is a 1OO( 1- al) % upper confidence bound for e. Suppose
[L(X), 00) is a 100(1 - a2)% lower confidence bound for O. Then, [L(X), U(X)] is a
100(1 - al - a2)% confidence interval for e.
Now consider the 100(1 - 2a)% confidence interval [C-,C+] for 0 = 'f/T - 'f/R. The
interval (-00, C+] is a 100(1 - a)% upper confidence bound for O. From Theorem 4, the
18
.
test that rejects Hoz in (6) if and only if C+ < Ou is a level-a test of Hoz . Likewise, [C-, 00)
is a 100( 1 - a)% lower confidence bound for 0 and the test that rejects HOI in (5) if and
only if C- > OL is a level-a test of HO!. Forming an IUT from these two level-a tests yields
a level-a test of Ho in (2), by Theorem 1. Thus, we see that it is not so important that
[C-, C+J is a 100(1 - 2a)% confidence interval for O. Rather, it is the fact that (-00, C+J
and [C-, 00) are both 100(1- a)% confidence intervals that yields a level-a test. That is,
it is important that [C-, C+] is an "equal-tailed" test.
It is easy to see that 100( 1 - 2a)% confidence intervals will not always yield size-a tests.
Consider an "unequal-tailed" 100( 1 - 2a)% confidence interval for 0 = 1]T - 1]R, [Cl ,CiL
defined by
-,
(21)
[x -
Y - t 0'2,r SJm- 1 + n- l , X - Y
+t
O'l,T
SJm- 1
+ n- l ]
,
where al + az = 2a. Using (-00, Ci] to define a test of Hoz yields a size-al test. And using
[Cl ' 00) to define ~ test of HOI yields a size-az test. Therefore, by Theorem 1, the IUT that
rejects Ho if and only if [Cl,Ci] C (OL,OU) has level max{all az}. That this test has size
equal to max{ aI, az} can be verified using Theorem 2. This relationship between the size
of the test and the maximum of the one-sided error probabilities is alluded to by equation
(1) in Yee (1986). The size of this test can be made arbitrarily close to 2a by choosing al
close to zero and az close to 2a. In this problem, the only 100(1- 2a)% confidence interval
of the form (21) that defines a size-a test happens to be the usual, equal-tailed confidence
interval, [C-, C+].
The preceding example using an unequal-tailed test simply illustrates that defining a
bioequivalence test in terms of a 100( 1 - 2a)% confidence interval can lead to a liberal
test with size greater than a. But, no one has proposed using the interval (21) to define a
bioequivalence test. So we now discuss two other examples that have been proposed in the
bioequivalence literature. Both examples concern testing (1) about the ratio J.LT / J.LR.
Tests based on 100(1- 2a)% Fieller-type confidence intervals provide examples of tests
that are liberal. Mandallaz and Mau (1981), Locke (1984) and Kinsella (1989) all propose
using a Fieller (1940, 1954) confidence interval to estimate J.LT / J.LR. Neither Locke nor
Kinsella propose constructing a bioequivalence test using this interval. But Mandallaz and
Mau (1981), Yee (1986,1990), Metzeler (1991) and Schuirmann (1989) all propose defining
a test of (1) using these Fieller confidence intervals, and all suggest that a 100( 1 - 2a)%
confidence interval should be used. Metzeler (1991) and Schuirmann (1989) give graphs
of the power function of their proposed tests that show that the tests have size greater
than the specified a. For example, Figures 3 through 9 in Metzeler (1991) are graphs
of 1 - (power function) based on the Mandallaz and Mau (1981) confidence interval. At
bu = 1.2, the rejection probability is about .07 for the a = .05 test, and, the power is about
.15 for the a = .10 test. These figures cover a variety of sample sizes and values of a Z • But
19
..
•
.
,
in all cases the rejection probability exceeds the nominal 0: at bu = 1.2. The same liberality
is illustrated by Figures 3-13 of Schuirmann (1989).
On the other hand, a test defined in terms of a 100(1 - 20:)% confidence set might be
very conservative. An example is the test proposed by Chow and Shao (1990) for testing
(1) about the ratio J.LT / J.LR. Specifically, Chow and Shao considered a two period crossover
design with no carry-over, period or sequence effects. Let X denote the sample mean
vector with mean J.l = (J.LT, J.LR)' and let S denote the sum of cross-products matrix. Let m
patients receive the first sequence, n patients receive the second sequence and n* = n + m.
Then, C = {J.l : T 1 ~ F a,2,n*-2} defines a 100(1 - 0:)% confidence ellipse for J.l, where
T 1 = n*(n* - 2)(X - J.l)'S-l(X - J.t)/2 and F a ,2,n*-2 is the upper 1000: percentile of an
F-distribution with 2 and n* - 2 degrees of freedom. Chow and Shao propose rejecting Ho
in (1) and concluding H a : bL < J.LT / J.LR < bu is true if and only if the 90% confidence ellipse
is contained in the cone defined by Ha • They do not comment on the actual size of this
test, but we assume 90% was chosen to be 100(1 - 20:)% where 0: = .05.
Chow and Shao's test can be described much more simply by recalling the relationship
between the confidence ellipse, C, and simultaneous confidence intervals for all linear functions 1'J.l (Scheffe, 1959). J.l E C if and only if I'X - V2Fa,2,n*-21'SI/(n*(n* - 2))
~
1'J.l
~
+ V2Fa,2,n*-21'SI/(n*(n* -
2)) for every vector I. But, in fact, the only two vectors
needed to define Chow and Shao's test are 1£ = (1, -bL)' and lu = (1, -bu)'. The hypothe-
I'X
ses in (1) can be written as Ho : lLJ.l ~ 0 or luJ.l ~ 0 and H a : luJ.l < 0 < lLJ.l. Furthermore,
the ellipse C is below the line luJ.l = 0 if and only if luX+V2Fa,2,n*-21uSlu/(n*(n* - 2)) <
0, that is, the upper endpoint of the confidence interval for luJ.l is negative. Similarly, the
ellipse C is above the line lLJ.l = 0 if and only if lLX - V2Fa,2,n*-21LS1£/(n*(n* - 2)) >
If we define
TL
=
[LX
and
Tu
=
VILS1£/(n*(n* - 2))
luX
o.
,
. VluSlu/(n*(n* - 2))
then Chow and Shao's test rejects Ho if and only if
(22)
TL
> V2Fa,2,n*-2 and
Tu
<
-V2Fa,2,n*-2.
This simple description of Chow and Shao's test has not appeared before. And, in this form,
it is apparent that this test can be viewed as an IUT. A reasonable test of HOL : lLJ.l ~ 0
versus HaL: lLJ.l > 0 is the test that rejects HOL if TL > J2Fa ,2,n*-2. A reasonable test
of Hou : luJ.l ~ 0 versus Hau : luJ.l < 0 is the test that rejects Hou if Tu < -J2Fa,2,n*-2.
Thus, Chow and Shao's test is the IUT of Ho versus Ha formed by combining these two tests.
Theorems 1 and 2 then tell us that the actual size of this test is 0:' = P(T > J2Fa ,2,n*-2),
where T has a Student's t distribution with n* - 1 degrees of freedom. This is because TL
has this t-distribution if lLJ.l = 0, and Tu has this t-distribution if luJ.l = O. That is, 0:' is
20
the size of each of the two individual tests. We computed a' using a 90% confidence ellipse
as suggested by Chow and Shao. We found that a' = .017 for m = n = 5, 10, and 15, and
a' = .016 for m = n = 20, 30, and 00. Thus, if the intent of using a 100(1 - 20'.)% = 90%
confidence ellipse was to produce a bioequivalence test with type I error probability of
0'. = .05, the result was very conservative.
A test ofH o versus Ha with the desired size of 0'. can be obtained by replacing J2FCi ,2,n -2
with the t percentile, tCi,nO-l in (22). Then each of the individual tests is size-a and the
combined IUT also has size-a. This test is uniformly more powerful than Chow and Shao's
test because the rejection region of Chow and Shao's test is a proper subset of this one. This
test is the analogue of the TOST for this crossover model. In fact, Yang (1991) proposed
this test for this problem as an alternative to Chow and Shao's test. But Yang did not state
that this test was uniformly more powerful nor quantify the conservativeness of Chow and
O
••
Shao's test.
Our conclusions from the results and examples in this subsection are simple. The usage of 100( 1 - 20'. j% confidence sets to define bioequivalence tests should be abandoned.
This practice produces tests with the appropriate size only when special, "equal-tailed"
confidence intervals are used, and offers no intuitive insight. The mixture of 100(1 - 20'.)%
confidence sets and size-a tests is only confusing. Rather, a test with the specified Type I
error probability of 0'. should be used. The IUT method can usually be used to construct
such a test. Then, Theorem 3 might be used to construct the corresponding 100( 1 - a)%
confidence set.
6
.~
Multiparameter Equivalence Problems
Until now, we have discussed bioequivalence testing in terms of only one parameter. In this
section, we discuss four problems in which the desired inference is equivalence in terms of
two or more parameters.
The first two examples have been discussed as multiparameter bioequivalence problems
by several authors. But, in most cases, the tests that have been proposed do not have
the correct size a. The proposed tests do not properly account for the multiple testing
aspect of this problem. These two multiparameter examples vividly illustrate that the IUT
method can provide a simple mechanism for constructing tests with the correct size a, even
in seemingly complicated bioequivalence problems. Size-a tests can be combined to obtain
an overall size-a test. No adjustment for multiple testing is needed if the IUT method is
used.
The last two examples involve FDA practices which formulate what we believe to be
practical equivalence problems as significant difference problems instead.
21
6.1
Simultaneous AVe, T max and Cmax bioequivalence
Sections 4 and 5 discussed bioequivalence testing in terms of only one parameter. That
is, the test and reference drugs are to be compared with respect to one of the following:
average AUCs, average T max or Cmax ' FDA (1992) and EC-GCP (1993) consider two drugs
are bioequivalent only if they are similar in all three parameters.
Westlake (1988) considered the problem mentioned above. Assume the measurements
are lognormal so that, after log transformation, we wish to consider hypotheses like (2). Let
the superscripts A, T, and C refer to the variables AUC, T max , and Cmax , respectively. For
example, 1Jk denotes the mean of 10g(Cmax ) for the reference drug. The test and reference
drugs are to be considered bioequivalent only if
...
fh < 1J~
H:: and OL <
and OL <
(23)
- 1J~ < Ou
1Jt - 1Jk < Ou
1Jf - 1Jk < Ou
Using current FDA guidelines, Ou = log(1.25) = -log( .80) = -OL. If one variable is'
deemed more important than another, the limits could be different for the different variables.
and
For example, if AUC was considered more important than T max , then the limits
for AUC could be chosen to be narrower than the limits
and O~ for T max •
The statement
in (23) should be the alternative hypothesis in this multivariate
bioequivalence test. The null hypothesis, H should be the negation of H~. That is, H
states that one or more of the six inequalities in H~ is false. Westlake proposed testing H
versus
by doing three separate tests, one for each variable. Specifically, he proposed
using the TOST to test (2) for each variable. The drugs will be declared bioequivalent only
if each of the three tests rejects its hypothesis. Furthermore, Westlake said a Bonferroni
Or
H:
o
H:
ot
Oe
o
o
correction should be used, and each TOST should be performed at the a/3 level to account
for the multiple testing.
Westlake's procedure is conservative. The size of Westlake's test is a/3, not a. This
is true because, although he did not use this terminology, he has proposed an JUT. The
alternative H~ is the intersection of three statements, one about each variable. Computing
three separate TOSTs and concluding
H~
is true only if all three TOSTs reject, is an JUT.
By Theorem 1, this test has level a/3 if each TOST is performed at level a/3. In fact,
Theorem 2 can be used to show that this test has size equal to a/3.
Therefore, to test H()' versus H::', Westlake's procedures can be used except that each
of the three TOSTs should be performed at size-a. The resulting test has probability at
most a of declaring the drugs to be bioequivalent, if they are bioinequivalent.
A test that is uniformly more powerful, but still has size-a will be obtained if the test
we propose in Section 4.2 is used to perform the three tests, rather than using the three
TOSTs. Again, each of these tests would be performed at size-a.
22
An alternative way of assessing the simultaneous bioequivalence of AUC, T max and Cmax
is to inspect the Brown, Casella, and Hwang (1995) confidence set (20), generalized to the
~ unknown case. Suppose
(XiA,xT,xP)', (li A,liT, liC)"
i = 1, ... ,n, are log-transformed
i.i.d. observations on AUC, T max and C max under the test and reference drugs, respectively.
Let Zi = (X iA, xT, Xp)' -(li A,liT, liC)', i = 1, ... , n, which are assumed to be multivariate
normal with mean 0 = (TJ~ - TJA, TJt - TJk' TJ¥ - TJ~)' and unknown variance-covariance matrix
-A -T -;;C
~. Let 0 = (Z ,Z ,Z )' and ~ be the sample mean vector and variance-covariance matrix
of the Z i 'so Then 0'0 is multivariate normal with mean 0'0 and variance-covariance matrix
O'~O/n, while (n -1)O'tO/O'~0 is independent of 0'0 and has a X2 distribution with n-1
degrees of freedom. Thus, a size-a test for H o : 0 = 00 is obtained using the acceptance
region
A(Oo) = {(O, t)
: O~(O - Oo)h/O~tOo/n > -ta,n-l} '
which leads to the confidence region
(24)
Brown, Casella, and Hwang (1995) applied (20) to the simultaneous AUC and C max problem
for illustration, assuming ~ is known. In practice, this assumption is perhaps unrealistic
considering the moderate sample size typical in bioequivalence trials.
6.2
.~
Mean and variance bioequivalence
Liu and Chow (1992) discuss another type of multiparameter bioequivalence. They point
out that bioequivalence should not be defined only in terms of the mean responses for the
two drugs. Rather, the variances of the two drugs' responses should also be considered.
If two drugs have bioequivalent means but different variances, the drug with the smaller
variance would be preferred.
Consider a single variable, e.g., AUC. Let TJT and TJR denote the means oflog(AUC). Let
a} and a'k denote the intrasubject variances of the test and reference drugs, respectively.
The test and reference drugs will be considered bioequivalent only if TJT and TJR are similar
and a} and a'k are similar. To demonstrate bioequivalence, we wish to test
Hm •
o . or
TJT - TJR ~ 0L or ''IT - TJR ~ Ou
~ t::.L or a}/ a'k ~ t::.u
a}/ a'k
versus
Hm .
a •
The constants (h, OU,
< TJT - TJR < Ou
t::.L < a}/a'k < t::.u
OL
and
t::.L, and t::.u would be chosen to define clinically important differences.
23
Liu and Chow (1992) propose a size-a test of
Hg : o}/o}~ ~
D.L or
af/o}~ ~
D.u
versus
Their test is an JUT composed of two size-a tests, one for testing each inequality. Hwang
and Wang (1994) describe an unbiased, size-a test that is uniformly more powerful than
the Liu and Chow test.
The hypotheses
versus
..
H~ : fh
< TIT
- TlR
< Ou
can be tested with a TOST. Because H: is the intersection of H~ and H~, the JUT method
can be used to con~truct a test of H versus H:. The test that rejects H only if the size-a
Liu and Chow test rejects Hg and the size-a TOST rejects Hri is a size-a test of H versus
o
o
o
H:,
Liu and Chow, however, propose a more conservative combination of these two tests.
Let 0' denote the desired size of the test of H Let 0'1 denote the size of the TOST and let
0'2 denote the size of the Liu and Chow test. Then they say to choose 0'1 and 0'2 so that
o.
(25)
0'
=1-
(1 - ad(l -
0'2).
Liu and Chow note that the test statistics use for the TOST are independent of the test
statistics used in their test. But they give no further explanation of (25). The probability
that Hri is accepted, given that Hri is true, is bounded below by 1 - 0'1. The probability
that Hg is accepted, given that Hg is true, is bounded below by 1 - 0'2. So the quantity 0'
in (25) is an upper bound for the probability that at least one of the two tests rejects its
null hypothesis, given that both Hri and Hg are true. This is not the error probability of
.,
the proposed test. The error probability is the probability the both tests reject, given that
either Hri or Hg is true. H is the union of Hri and Hg, not the intersection.
Again, it should be noted that a more powerful size-a test of H will be obtained if the
test from Section 4.2, rather than the TOST, is used to test Hri and the Hwang and Wang
(1994) test is used to test Hg.
o
6.3
o
Drug shelf-life determination
Drug shelf-lives are determined based on stability studies. When the degradation of a drug
can be represented as a simple linear model, the Guideline for Submitting Documentation
for Stability Studies of Human Drugs and Biologies (Food and Drug Administration, 1987)
specifies that the shelf-life be calculated as the time point at which the lower 95% confidence
24
·.
limit about the fitted line crosses the lowest acceptable limit for drug content, usually 90%
of the labeled amount of drug.
It is generally to the advantage of the manufacturer to establish a long shelf-life for
a drug. Different batches of the same drug may degrade at different rates. When the
degradation rate varies greatly from one batch to another, the FDA intends that the shelflife be calculated conservatively, by basing it on the batch with the worst degradation rate,
which is reasonable. On the other hand, when the degradation rate varies little from one
batch to another, the FDA intends to let the manufacturer pool its data across the batches,
in effect rewarding a manufacturer making consistent batches with a longer shelf-life, which
is also reasonable.
However, as we shall see, the FDA has so far failed to recognize that, in analogy with
establishing the bioequivalence between two drugs, the inference desired in deciding whether
to allow the manufacturer to pool is not significant difference but practical equivalence
among the degradation rates.
The model mos"t commonly used to analyze degradation of batches of drugs over time is
where
hth response from the ith batch
lih
(}i
ith batch effect
f3i
degradation rate of ith batch
Xih
=
time at which the stability sample lih is taken
and Ell,' .. ,Eknk are i. i.d. random errors, normally distributed with mean 0 and unknown
variance (J'2. Currently the FDA Guideline states that if the null hypothesis
(26)
-,
Ho : f3i = f3j for all i
f:.
j
is accepted at the 25% level, then a reduced model with all f3i = 13 is used for data with all
batches pooled. If (26) is rejected at the 25% level, then the FDA Guideline suggests that
subsets of batches with similar slopes may be considered together and retested as above.
Ultimately, shelf-lives are computed for each subgroup of batches, and the shortest shelf-life
is used for the drug product.
It is fairly clear that the intent of the FDA Guideline is batches which are practically
equivalent to the worst batch can be pooled. Therefore, as pointed out in Ruberg and Hsu
(1992), the current FDA Guideline is unreasonable, because a large p-value associated with
testing (26) or a homogeneity hypothesis of any subset of f3b' .• , 13k does not necessarily
imply that the corresponding f3i are approximately equal. Further, small variations within
25
a batch, frequent sampling times, and replication of samples at a given time make it easier
to reject (26). Thus, a penalty is paid for doing a good study. Inappropriate pooling of too
many batches is a risk to the consumer, since shelf-lives will tend to be too long as data are
.
I'
,
pooled.
Consider the two data sets cited in Ruberg and Hsu (1992), with summary statistics
given in Tables 3 and 4. The mean square errors f12 corresponding to Tables 3 and 4 are
1.07/25 with 25 degrees of freedom and 18.28/23 with 23 degrees of freedom, respectively.
Even though the degradation rates of the six batches given in Table 3 are reasonably close to
each other, pooling of the batches is not allowed under the current FDA Guideline since the
small f12 causes the null hypothesis (26) to be rejected at the 25% level. On the other hand,
even though the degradation rates of the six batches given in Table 4 are quite different,
pooling of the batches is allowed under the current FDA Guideline since the large f12 causes
the null hypothesis (26) to be accepted at the 25% level.
Table 3: Summary· statistics for Experiment 1 (Copyright 1992 American Statistical Association and American Society for Quality Control)
Batch
1
2
3
4
5
6
.
.
f)i
100.49
100.66
100.25
100.45
100.45
99.98
f3i
-1.515
-1.449
-1.682
-1.393
-1.999
-1.701
SXi
14.629
7.830
5.141
1.379
.608
.608
Table 4: Summary statistics for Experiment 2 (Copyright 1992 American Statistical Association and American Society for Quality Control)
.~
Batch
1
2
3
4
5
6
.
.
f)-I
102.48
103.63
101.24
102.21
102.21
100.07
f3i
-.109
-.449
-.778
.194
-2.218
-1.046
SXi
14.629
2.963
5.141
1.260
.608
.608
An alternative to the FDA Guideline, proposed by Ruberg and Hsu (1992), is to allow
batches with
(27 )
f3i - min f3J- < "1,
j:f.i
26
that is, batches with degradation rates practically equivalent to the worst degradation rate,
be pooled. If L is the desired shelf-life, and 6 is the maximum allowable difference in mean
drug content between batches at time L, then TJ can be defined as 6j L. For example, suppose
the desired shelf-life is 3 years and the maximum allowable difference in mean drug content
between batches after 3 years is 4%, then TJ = 1.33%jyear.
Note that the criterion (27) is a multiparameter generalization of the bioequivalence
criterion (3) because, for k = 2,
f3i - minf3Jo
j:f;i
< TI,
./ i = 1,2
if and only if
..
,
Formulating the problem as one of testing
Ho : not Ha
versus
Ha : maxl~i9 f3i - minl~i~k f3i ~ TJ
is too crude, as it does not allow some but not all the batches to be pooled. The problem is
really one of multiple comparisons with the worst slope. Toward this end, Ruberg and Hsu
(1992) adapted the method of constrained multiple comparison with the best (MCB) of Hsu
(1984) to the present setting of drug stability studies to provide simultaneous confidence
intervals for f3i - minj:f;i f3j, i = 1, ... , k. They are better for equivalence inference than
what can be deduced from the generalization of the Tukey-Kramer confidence intervals, in
the same way that the confidence interval (15) is better for equivalence inference than the
ordinary 100( 1 - a)% two-sided t confidence interval. In fact, in the setting of Section 5,
the constrained MCB confidence interval reduces to the bioequivalence confidence interval
(15). This is not surprising since constrained MCB tries to identify treatment(s) practically
equivalent to the true best treatment. In the case of two treatments (k = 2), a treatment
practically equivalent to the other treatment is either the true best itself or practically
equivalent to the true best treatment.
-.
Tables 5 and 6 display MCB simultaneous confidence intervals on the difference between
each degradation rate and the worst degradation rate for the two experiments.
For Experiment 1, all batches would be pooled under the Ruberg and Hsu (1992) proposal since the MCB upper confidence bounds are all less than 1.33%, and the computed
shelf-life turns out to be 6.2 years, which is not that much different from the shelf-life
computed under the current FDA Guideline.
For Experiment 2, none of the batches should be pooled under the Ruberg and Hsu
(1992) proposal, since the MCB upper confidence bounds are all greater than 1.33%. Computing the shelf-life based on Batch 6 that had the worst degradation rate, the shelf-life turns
out to be 1.7 years, which is drastically shorter than the shelf-life of 15.7 years computed
based on the current FDA Guideline that allows all batches to be pooled.
27
Table 5: 95% MCB confidence intervals for Experiment 1
Batch
1
2
3
4
5
6
2.452
2.433
2.411
2.288
2.168
2.168
D·t
-0.09
-0.05
-0.36
-0.12
-0.92
-0.52
1.07
1.15
0.92
1.30
0.52
1.11
Table 6: 95% MCB confidence intervals for Experiment 2
..
Batch
1
2
3
4
5
6
6.4
2.469
2.401
2.437
2.311
2.202
2.202
D·t
-0.46
-1.23
-1.51
-0.81
-4.10
-2.39
D+
t
4.68
4.53
4.10
5.48
2.39
4.73
Toxicity Studies
The use of bovine growth hormone is a controversial issue (as the articles, both titled "Udder
Insanity," in Consumer Reports, Consumers Union, 1992, and in Time magazine, Horowitz
and Thompson, 1993, indicate). Writing for the Federal Food and Drugs Administration
(FDA), Juskevich and Guyer (1990) reported on a number of experiments that did not
indicate bovine growth hormones would be harmful if present in milk consumed by humans.
A subset of the data from one experiment included in that article gave absolute weights of
-.
..
various organs measured from control hypophysectomized rats and hypophysectomized rats
treated orally with the peptide hormone Recombinant Insulin-Like Growth Factor-I (rIGFI). In addition to groups given rIGF-I orally, one group was given a negative "saline control;"
another was given BSA (bovine serum albumin) as a negative "oral protein" control; yet
another group was given rIGF-I via a subcutaneously (s.c.) implanted osmotic minipump
as a positive control. Spleen weights of rats treated for either 17 days by gavage or 15 days
by continuous subcutaneous infusion are given in Table 7. While comparisons with both
negative controls are of interest, for illustration, we concentrate on comparisons with the
negative "saline" control.
When comparing the three levels of orally fed bovine growth hormone with the saline
control, it seems to us the desired inference is practical equivalence: Weight gains in rats
28
Table 7: Spleen weight (in grams) of male rats
Hormone
1 = None
2 = Oral BSA
3 = Oral rIGF-I
4 = Oral rIGF-I
5 = Oral rIGF-I
6 = S.c. Infusion rIGF-I
•
..
Dosage
(mg/kg per day)
o
1.0
0.01
0.1
1.0
1.0
Sample
Size
20
20
20
20
20
10
Mean
Weight
147.6
151.3
147.2
149.6
147.1
239.6
SEM
Weight
8.8
8.6
5.7
5.8
6.6
17.9
given any growth hormone are close to weight gains of rats given the saline control, as suggested by the reporting in Juskevich and Guyer (1990) of non-significant p-values associated
with Hormones 3, 4, 5, adjusted for multiplicity in accordance with Dunnett's method. The
FDA wishes to guard against concluding that the weight gains are close, if they are not
close. But a large p-value associated with a test of equality does not necessarily indicate
practical equivalence, as it may be a reflection of too small a sample size relative to noise in
the data. Otherwise, safety can invariably be concluded by conducting a sloppy experiment
with a small sample size, and contradicted by conducting a well controlled experiment with
a large sample size. (Schoenfeld 1986 made essentially the same point in the context of
environmental toxicology studies.) The positive control was included in the experiment
presumably to verify the measurement process was capable of detecting known differences,
in order to lend support to the interpretation that non-significant differences indicate practical equivalence, as suggested by the reporting in Juskevich and Guyer (1990) of significant
p-values associated with the positive control. But such support is indirect at best. Instead,
we believe the problem should be formulated as one of multiparameter practical equivalence
directly.
Consider the one-way model
Yia=TJi+Eia, i=l, ... ,k, a=l, ... ,ni,
,.
where Yia is the ath observation under the ith treatment, and Ell, ... , Eknk are i. i.d. normal
with mean 0 and variance (72 unknown. We use the following notations,
n.
Y;
1Ji
= 'L Yia/ni,
a=1
k
&2
=
n.
k
MSE='L'L(Yia-y;)2/'L(ni- 1),
i=1a=1
i=1
for the sample means and the pooled sample variance.
29
Practical equivalence can be established if confidence intervals for the weight gain differences between rats given growth hormone and rats given the saline control are tight around
O. Dunnett's (1955) two-sided multiple comparison with a control (MCC) method gives the
following 95% simultaneous confidence intervals
-24.71
-28.81
-26.41
-28.91
57.21
<
<
<
<
<
"12
"13
"14
"15
"16
-
"11
"11
"11
"11
"11
< 32.11
< 28.01
< 30.41
< 27.91
< 126.79
Since the confidence interval for "16 - "11 is to the right of 0, the experiment is sufficiently
sensitive to distinguish between the positive control and the saline control. Whether the
experiment has established that rJGF-J dose 3, 4, and 5, are practically equivalent to the
saline control or not depends on what can be considered practically equivalent. If mean
spleen weights within 35 of the control can be considered equivalent (say), then practical
equivalence has beell established.
As another illustration that the practice of attempting to establish practical equivalence
using a 100(1-2a)% confidence set is unsound, consider rejecting Ho when the 100(1-2a)%
two-sided Dunnett's confidence set for "1i - "11,i = 2, .. . ,k, is contained in (fh,8u) x··· X
(8L, 8u). Then the test has size-a when k = 2, but becomes increasingly conservative as k
increases. Critical value for the TOST-based 5% JUT test (discussed later in this section)
is 1.660 regardless of the number of treatments k, while the critical value associated with
the 90% two-sided Dunnett's confidence set is 2.273 for k = 6 with the sample size pattern
of this example.
To better establish practical equivalence, one can construct a multiparameter analogue
ofthe confidence interval (15) as follows. Let 8 = (7]2-"1t. ... , 7]k-"11) and for I ~ {2, ... , k},
let e[ = {8 : "1i > 7]1 for i E I and 7]j
"11 for j fJ. I}. Consider the family of size-a tests
such that for 8 E e[, the acceptance region is
s:
A( 8)
-,
=
{ (1}1' ... , 1}k,(') : -d[&Vnil
+ nIl s: rrJr 1}i -
1}1 - ("1i - 7]1)
and IIJiIX1}j -1}1 - (7]j - "11) < d[&Vnj 1 + nIl}.
Here d[ is the constant so that
P ( IIJ;f 1}j - 1}1 - ("1j - 7]d < d[&Vnj 1 + nIl
and
rpJp 1}i -
1}1 - ("1i - 7]d 2: -d[&Vnil
= 1- a.
Let
30
+ nIl)
and
+
D i = max ( TJi - TJ1
13t
A
A
+ dWVni-1 + n 1-1 )
+
A
Then we have the following theorem.
Theorem 6 For all TJ1, ... ,TJk and (12,
P{ Di < TJi - TJ1 :S Dt, i
= 2, ... , k} 2:
1-
Q.
Proof. While formally the result follows from Theorem 3, a direct proof is easy. Note that
TJi - TJ1 :S Dt trivially if TJi - TJ1 :S 0, while Di
suppose () Eel. Then
< TJi - TJ1 trivially if TJi - TJ1 > O. Thus,
P(Di < TJi - TJ1 :S Dt,i = 2, ... ,k)
P( Dj < TJj - TJ1 for all j ~ I and TJi - TJ1 :S Dt for all i E 1)
> .. P(f'j - r,1 - dlirJnj 1 + nIl < TJj - TJ1 for all j ~ I and
TJi - TJ1 :S r,i - r,1
1-
+ dlirJn i 1 +n I 1 for all i E 1)
Q.
For the data in Table 7, 95% simultaneous confidence intervals turn out to be
-22.10
-26.20
-23.80
-26.30
0
<
<
<
<
<
TJ2 - TJ1
TJ3 - TJ1
TJ4 - TJ1
TJs - TJ1
TJ6 - TJ1
< 29.50
< 25.40
< 27.80
< 25.80
< 123.60
which, with the exception of the interval for TJ6 - TJ1, are narrower than Dunnett's confidence
intervals. Again we have established that the experiment is sufficiently sensitive to distinguish between the positive control and the saline control. If mean spleen weights within 30
of the control can be considered equivalent, then practical equivalence has been established.
-.
(The reader is reminded that this illustrative analysis is based on a subset of the data from
one of a number of experiments described in Juskevich and Guyer, 1990.)
Our analysis seems reasonable since the defining values fh and (}u such that TJi can be
considered practically equivalent to TJ1 if (}L < TJi - TJ1 < (}u have not been articulated in
this setting, and the presentation in Juskevich and Guyer (1990) was in terms of p-values
associated with Dunnett's method. But if (}L and (}u are given, and we concentrate on the
problem of establishing practical equivalence between the treatments of interest and the
control, then a better method exists. In our discussion below, we assume Treatment 1 is a
control and Treatments 2, ... , k correspond to increasing dose of a substance whose safety
is being assessed. In this situation, it is not unreasonable to suspect that if the substance
has an effect then the effect is monotone. It is thus appealing to have a procedure which is
31
•
monotone, in the sense that it does not declare a higher dose to be practically equivalent
to the control if it fails to declare a lower dose to be practically equivalent. One way
to construct a monotone method is to make it a stepwise method, as follows. Let each
1>i, i = 2, ... , k, be a level-a test for
H~ : not H~
versus
H~ : 8L < 'T/i - 'T/l < 8u
IStep 11
1>2
If
accepts,
then stop.
else assert 8L < 'T/2 - 'T/l < 8u and go to Step 2.
If
accepts,
1>3
then stop.
else assert 8L < 'T/3 - 'T/l < 8u and go to Step 3.
.,
I
If 1>k
then
else
Step k- 1 1
accepts,
stop.
assert 8L < 'T/k - 'T/l < 8u and stop.
To see that the stepwise method is valid, let I = {i : 2 :::; i :::; k, Hb is true}. If I = 0,
then the probability that the stepwise method makes at least one incorrect assertion of
equivalence is O. Otherwise, let i* = min {i : i E I}. Then
-..
P( at least one incorrect assertion of equivalence)
P( assert 8L
=
<
'T/i - 'T/l
<
8u for some i E I)
P(assert fh < 'T/i*'- 'T/l < 8u )
< a
Interestingly, the stepwise method can be viewed as a stepwise execution of the IUT for
Ho : not Ha
versus
Ha
:
fh < 'T/i -
'T/l
32
< 8u for all i = 2, ... , k
..
.
,
which rejects Ho if every <Pi rejects. Each <Pi in the stepwise method can be the TOST
(possibly using a common pooled cr). Comparing the three dose levels of rIGF-I with the
saline control in Table 7, using a = .05 and -(h = (Ju = 25, the stepwise method based on
TOST asserts all three levels to be practically equivalent to the control (whether one pools
all the (1 estimates or not). A more powerful procedure would be obtained by using the new
test from Section 4.2 in place of the TOST. In either case, the method remains valid even
if the suspected monotonicity of substance effect is false.
Remark. Assuming 171 is known and 171 ~ 172'" ~ 17k, Schoenfeld (1986) proposed the
same stepwise method as ours except his <Pi'S are based on isotonic regression estimates.
His method is valid if one is willing to assume monotonicity of substance effect, since his
individual <Pi'S are level-a tests under that assumption.
Williams (1971) proposed using isotonic regression estimates to test the hypotheses
Hl) : 171
= ... = 17m
H~
~
versus
: 171
•.. ~ 17m with at least one inequality
in the following stepdown fashion:
is accepted,
If H~
then stop.
else assert 17k > 171 and go to Step 2.
IStep 21
If Hok - 1
then
else
is accepted,
stop.
assert 17k-I> 171 and go to Step 3.
IStep k-1!
t
..
If H6
then
else
is accepted,
stop.
assert 172 > 171 and stop.
Like Schoenfeld's method, the validity of Williams' stepdown method depends on the assumption 171 ~ ... ~ 17k. Williams seemed to suggest that, in the setting of toxicity studies,
one can conclude safety up to dose m if Hl) is accepted. We disagree with this suggestion because, again, the acceptance of Hl) : 171 = ... = 17m does not imply 172,.··, 17m are
necessarily close to 171.
33
Williams' (1971) stepdown method has also been used to determine the lowest dose
which can be declared to be significantly higher than the control in minimum effective dose
(MDE) studies. We present a stepdown method based on an alternative formulation. Let
each <Pi, i = 2, ... , k, be a level-a test for
versus
H~ : Tli > TIl
IStep 11
accepts,
If <Pk
then stop.
else assert Tlk
..
> TIl and go to Step 2.
I
IStep 21
•
If <Pk-l
then
else
accepts,
stop.
assert Tlk-l
I
If <P2
then
else
> TIl and go to Step 3.
Step k- 1 1
accepts,
stop.
assert Tl2 > TIl and stop.
H the <Pi'S are level-a tests without the monotonicity assumption (e.g., the one-sided size-a
i-test for Hh, possibly using a common pooled a), then this method is valid in the sense
that
P( at least one dose is incorrectly asserted to be better than the control) ~ a
regardless of whether the monotonicity assumption is true or not, in contrast to Williams'
method.
Finally, we mention that multiparameter tests and confidence sets are available when
practical equivalence is defined in units of a, that is, Treatments i and j are considered
practically equivalent when fh < (Tli-Tlj)/a < Bu. The reader is referred to Giani and Finner
(1991), Bofinger, Hayter and Liu (1993), and Hochberg (1995) for this line of research.
7
Acknowledgement
We thank Dr. Hans Frick and Dr. Volker Rahlfs for references on European bioequivalence
guidelines.
34
A
Details of New Test in Section 4.2
A size-D, nearly unbiased test for (2) was described geometrically in Section 4.2. In Section
A.I, formulas and computational suggestions are given for the quantities that define that
test. The construction in Section 4.2 was valid for D > D*. In Section A.2 a similar
construction yields a size-D, nearly unbiased test for D :::; D*. Brown, Hwang and Munk did
not describe an unbiased test for D :::; D*.
A.I
Formulas for Section 4.2
Define functional notation for the transformation from rectangular to polar coordinates by
..
•
J(d-~)2+s:
cos-1((d - ~)/v(d,s*))
for
-00
<d<
00
and s*
~
O. The inverse transformation is
~
d(v,b)
+ v cos(b)
v sin(b),
s*(v,b)
for v ~ 0 and 0 :::; b :::; 1r. The point (d, s*) = (0, ~y'r/tOt,r) is the vertex of the triangular
region RT. Therefore,
bo
b(O, ~y'r /tOt,r),
2~
sin(1r - bo),
(d( vo, bo), s*( vo, bo)),
v( -db s*d.
The line of length Vo in Figure 2 has b = 31r/2 - boo Therefore,
31r /2 - bo - cos- 1(vo/ v),
31r/2 - bo + cos-1(vo/v).
The angle bLl' defined by (11), is easily found by a numeric root finding method such as
bisection.
Finally. for any point (d,s*) on {V = v}, s* = ..}v 2 - (d - ~)2. For any point (du,s*u)
on {l/ = v} with du :::; 0, there is a unique point (d/, s*d on {V = v} with du ~ 0 such that
the line joining (d/, s*d and (0,0) and the s* axis form the same angle as the line joining
(d u, s*u) and (0,0) and the s* axis. This point satisfies
35
which has the solution
dI
(28)
d
( V2 _ ~2)
u
= --::-....:;.;:.--,------'
'--::2
V + 2du~ _ ~2
Using this expression for dl in terms of du , the equation
is a function of the single variable duo The unique solution to this equation, in the interval
~ - v S; du S; 0 is easily found by a numeric root finding method such as bisection. Call
the solution du. Define dL by (28) using du = du. The angles bu( v) and bL( v) are
bu(v)
.
.
A.2
New test-for
0
b (d u , JV 2
-
b (dL' JV 2
-
,
(dL - ~)2) .
(du -
~)2)
S; o.
For small values of a S; a., a size-a, nearly unbiased test of (2), that is uniformly more
powerful than the TOST, can be constructed. The construction is very similar, and somewhat simpler, than the construction in Section 4.2. The notation of Section A.l will be
used, and Figure 5 illustrates the construction.
For a S; a., the point on lL closest to (~, 0) is the vertex of RT, (do, s.o) = (0, ~VrjtOl,r)'
Let Vo = v(do,s.o). For v S; VO, R 2 (v) = {b: bo < b < rr}, exactly the points in the TOST.
The arc A o is such an arc. For Vo < v < 2~, R 2 ( v) consists of two arcs. R 2 ( v) = {b :
bdv) < b < bu(v)} U {b: b2 < b < rr}. bL(V), bu(v) and b2 are defined as before. The two
solid pieces of arc Al are examples of these arcs. The semicircle {V = v} does not intersect
RT near the s. axis so there is no need to check that {b : bL( v) < b < bu(v)} covers all the
TOST. For v ~ 2~, R 2 (v) = {b: bL(v) < b < bu(v)}. The solid piece of arc A 3 is such an
arc. In Figure 5, R 2 is outlined with a solid line, R I is outlined with a dashed line, and the
intersection is the rejection region of the IUT.
References
Aitchison, J. (1964). Confidence-region tests. Journal of the Royal Statistical Society, Series
B, 26:462-476.
Anderson, S. (1993). Individual bioequivalence: a problem of switchability (with discussion).
Biopharmaceutical Report, 2( 2): 1-11.
Anderson, S. and Hauck, W. W. (1983). A new procedure for testing equivalence in comparative bioavailability and other clinical trials. Communications in Statistics - Theory
and Methods, 12:2663-2692.
36
...
Anderson, S. and Hauck, W. W. (1985). Letter to the editor. Biometrics, 41:561-563.
Anderson, S. and Hauck, W. W. (1990). Consideration of individual bioequivalence. Journal
of Pharmacokinetics and Biophamaceutics, 18:259-273.
Berger, R. 1. (1982). Multiparameter hypothesis testing and acceptance sampling. Technometries, 24:295-300.
Berger, R. 1. (1989). Uniformly more powerful tests for hypotheses concerning linear inequalities and normal means. Journal of the American Statistical Association, 84:192199.
Bofinger, E. (1985). Expanded confidence intervals. Communications in Statistics - Theory
and Methods, A14:1849-1864.
Bofinger, E. (1992). Expanded confidence intervals, one-sided tests, and equivalence testing.
Journal of Biopharmaceutical Statistics, 2:181-188.
Bofinger, E., Hayter, A. J., and Liu, W. (1993). The construction of upper confidence
bounds on the range of several location parameters. Journal of the American Statistical
Association, 88 :906-911.
Brown, L. D., Casella, G., and Hwang, J. T. G. (1995a). Optimal confidence sets, bioequivalence, and the limac;on of Pascal. Journal of the American Statistical Association,
90:880-889.
Brown, 1. D., Hwang, J. G., and Munk, A. (1995b). An unbiased test for the bioequivalence
problem. Technical report, Cornell University.
'.
Chow, S.-C. and Liu, J.-P. (1992). Design and Analysis of Bioavailability and Bioequivalence
Studies. Marcel Dekker.
Chow, S.-C. and Shao, J. (1990). An alternative approach for the assessment of bioequivalence between two formulations of a drug. Biometrical Journal, 32:969-976.
Consumers Union (1992). Udder insanity. Consumer Reports, pages 330-332.
Dunnett, C. W. (1955). A multiple comparison procedure for comparing several treatments
with a control. Journal of the American Statistical Association, 50:1096-1121.
EC-GCP (1993). Biostatistical methodology in clinical trials in applications for marketing
authorization for medical products. CPMP Working Party on Efficacy of Medical Products, Commission of the European Communities, Brussels, Draft Guideline edition.
37
FDA (1987). Guideline for Submitting Documentation for Stability Studies of Human
Drugs and Biologics. Center for Drugs and Biologics, Food and Drug Administration,
Rockville, MD.
FDA (1992). Bioavailability and bioequivalence requirements. In U. S. Code of Federal
Regulations, volume 21, chapter 320. U. S. Government Printing Office.
Fieller, E. (1954). Some problems in interval estimation. Journal of the Royal Statistical
Society, Series B, 16:175-185.
Fieller, E. C. (1940). The biological standardisation of insulin. Journal of the Royal Statistical Society, Supplement, 7:1-64.
Giani, G. and Finner, H. (1991). Some general results on least favorable parameter configurations with special reference to equivalence testing and the range statistic. Journal
of Statistical F'lanning and Inference, 28:33-47.
Hochberg, Y. (1995). On assessing multiple equivalences with reference to bioequivalence.
In Nagaraja, H. N., Morrison, D. F., and Sen, P. K., editors, A Festschrift for H. A.
David. Springer Verlag, New York.
Horowitz, J. M. and Thompson, D. (1993). Udder insanity! Time, 141:52-53.
Hsu, J. C. (1984). Constrained two-sided simultaneous confidence intervals for multiple
comparisons with the best. The Annals of Statistics, 12:1136-1144.
Hsu, J. C., Hwang, J. T. G., Liu, H.-K., and Ruberg, S. J. (1994). Confidence intervals
associated with tests for bioequivalence. Biometrika, 81:103-114.
Hwang, J. T. G. and Wang, W. Z. (1994). An unbiased test for bioequivalence in variability.
Technical report, Cornell University.
Juskevich, J. C. and Guyer, C. G. (1990). Bovine growth hormone: Human food safety
evaluation. Science, 249:875-884.
-
Kinsella. A. (1989). Bootstrapping a bioequivalence measure. The Statistician, 38:175-179.
Lehmann, E. L. (1959). Testing Statistical Hypothesis. John Wiley, New York.
Lehmann, E. 1. (1986).
edition.
Testing Statistical Hypothesis. John Wiley, New York, second
Liu, H. and Berger, R. 1. (1995). Uniformly more powerful, one-sided tests for hypotheses
about linear inequalities. Annals of Statistics, 23:55-72.
38
...
t
Liu, J.-P. and Chow, S.-C. (1992). On the assessment of variability in bioavailability /bioequivalence studies. Communications in Statistics - Theory and Methods,
21:2591-2607.
Locke, C. S. (1984). An exact confidence interval from untransformed data for the ratio of
two formulation means. Journal of Pharmacokinetics and Biopharmaceutics, 12:649655.
Mandallaz, D. and Mau, J. (1981). Comparison of different methods for decision-making in
bioequivalence assessment. Biometrics, 20:213-222.
Martin Andres, A. (1990). On testing for bioequivalence. Biometrical Journal, 32:125-126.
Metzler, C. M. (1991). Sample sizes for bioequivalence studies. Statistics in Medicine,
10:961-970.
Muller-Cohrs, J. (1991). An improvement of the Westlake symmetric confidence interval.
Biometrical Journal, 33(3) :357-360.
Munk, A. (1993). An improvement on commonly used tests in bioequivalence assessment.
Biometrics, 49:1225-1230.
Patel, H. 1. and Gupta, G. D. (1984). A problem of equivalence in clinical trials. Biometrical
Journal,26:471-474.
Pratt, J. W. (1961). Length of confidence intervals. Journal of the American Statistical
Association, 56:541-567.
Rocke, D. M. (1984). On testing for bioequivalence. Biometrics, 40:225-230.
Ruberg, S. J. and Hsu, J. C. (1992). Multiple comparison procedures for pooling batches
in stability studies. Technometrics, 34:465-472.
....
Sasabuchi, S. (1980). A test of a multivariate normal mean with composite hypotheses
determined by linear inequalities. Biometrika; 67:429-439.
...
Sasabuchi, S. (1988). A multivariate test with composite hypotheses determined by linear
inequalities when the covariance matrix has an unknown scale factor. Memoirs of the
Faculty of Science, Kyushu University, Series A, Mathematics, 42:9-19.
Scheffe, H. (1959). The Analysis of Variance. Wiley, New York.
Schoenfeld, D. A. (1986). Confidence bounds for normal means under order restrictions,
with application to dose-response curves, toxicology experiments, and low-dose extrapolation. Journal of the American Statistical Association, 81 :186-195.
39
t
•
Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the
power approach for assessing the equivalence of average bioavailability. Journal of
Pharmacokinetics and Biopharmaceutics, 15(6):657-680.
Schuirmann, D. J. (1989). Confidence intervals for the ratio of two means from a crossover
study. In American Statistical Association Proceedings of the Biopharmaceutical Section, pages 121-126. American Statistical Association, Washington, D.C.
Schuirmann, D. L. (1981). On hypothesis testing to determine if the mean of 'a normal
distribution is contained in a known interval. Biometrics, 37:617. Abstract.
-
)
Stefansson, G., Kim, W., and Hsu, J. C. (1988). On confidence sets in multiple comparisons.
In Gupta, S. S. and Berger, J. 0., editors, Statistical Decision Theory and Related
Topics IV, volume 2, pages 89-104. Springer-Verlag, New York.
Westlake, W. J. (19,76). Symmetric confidence intervals for bioequivalence trials. Biometrics,
32:741-744.
Westlake, W. J. (1981). Response to T.B.L. Kirkwood: Bioequivalence testing - a need to
rethink. Biometrics, 37:589-594.
Westlake, W. J. (1988). Bioavailability and bioequivalence of pharmaceutical formulations.
In Peace, K. E., editor, Biopharmaceutical Statistics for Drug Development, pages 329352. Marcel Dekker, New York.
Williams, D. A. (1971). A test for differences between treatment means when several dose
are compared with a zero dose. Biometrics, 27:103-117.
Yang, H.-M. (1991). An extended two one-sided tests procedure. In American Statistical
Association Proceedings of the Biopharmaceutical Section, pages 157-162. American
Statistical Association, Washington, D.C.
Yee, K. F. (1986). The calculation of probabilities in rejecting bioequivalence. Biometrics,
42:961-965.
Yee, K. F. (1990). Correspondence to the editor. The Statistician, 39:465-466.
40
t
s.
s
------+------+----~----d
Figure 1: Irregular boundary of Brown, Hwang and Munk test (solid line) and smoother
boundary of test from Section 4.2 (dashed line). Here r = 3, a = .16 and -8L = 8u = ~.
41
s.
1
.'
A ~'..·' ...'
...
.
..
... ...
. ...
..
.
. ..
·
.·.
.·· ...
·
··· ···
···
·
··· ···
,·
·
·,,··
····
"
...
'.
'.
'. '. '.
-'---+'-~-'-------+-"""'-'-----+---d
o
...
Figure 2: Arcs that define the rejection region R2.
42
...
t
..
I
·
·,
·
····
t
··,,
·
··
,
----+--------+----------t----d
-6
o
Figure 3: Rejection region of new test. Region R 2 (between solid lines) and region R 1
(between dashed lines). Rejection region R = R 1 n R 2 • r = 10 and Q = .05.
43
1
y
t
,
o
Figure 4: Rejection region for TtlT2 test is cone shaped R o • Region between dashed lines
is rejection region of uniformly more powerful Liu and Berger (1995) test. The estimates
X and l' satisfy 6L < X/I" < 6u in the larger cone shaped region.
44
.
•
s.
.
'
...
,
,
,
,
,
,,
,,
...,
.
....
..
···
·
·,···
·
,·
,·
,
,
A1
...
...
.
...
·
····
·
-'---+--""'--------+-----'~---""r---d
.
o
....
Figure 5: Rejection region of new test for a ~ a•. Region R 2 (between solid lines) and
region R 1 (between dashed lines). Rejection region R = R 1 n R 2 • r = 3 and a = .05.
45

Download Report

Berger, R. L. and Hsu, J. C. (1995) "Bioequivalence Trials, Intersection-Union Tests, and Equivalence Confidence Sets"

Paperzz.com

Your Paperzz