Hypothesis Testing: Examples in Pharmaceutical

S TAT I S T I C A L V I E W P O I N T
Hypothesis Testing:
Examples in Pharmaceutical
Process and Analytical
Development
BACKGROUND: JOHN FOXX/GETTY IMAGES
David LeBlond
“Statistical Viewpoint” addresses principles of statistics useful to practitioners in
compliance and validation. We intend to present these concepts in a meaningful way
so as to enable their application in daily work situations.
Reader comments, questions, and suggestions are invited. Readers are also invited
to contribute manuscripts for this column. We need your help to make “Statistical
Viewpoint” a useful resource. Please contact coordinating editor Susan Haigney at
[email protected] with comments, suggestions, or manuscripts for publication.
KEY POINTS
The following key points are discussed in this article:
• Tukey’s test can be used to compare the means of multiple groups when
the within-group variation is normal and all groups have the same true
variance. This test is a Neyman-Pearson type test that maintains a desired
family-wise Type I error rate. It is based on the Studentized range distribution.
• Bartlett’s test can be used to compare the variances of multiple groups
when the within-group variation is normal. This test can be used as either
a Neyman-Pearson or Fisherian type test and is based on the Chi-squared
distribution.
• Analysis of covariance (ANCOVA) can be used to compare the intercepts
and slopes of regression lines when the error of measurement is normal,
independent, and identically distributed. This test is based on the F
distribution. When comparing the stability profiles of different batches of
product, ANCOVA is often employed as a Neyman-Pearson type test with
a comparison-wise Type I error rate of 0.25.
• Tables and calculators for the Studentized range distribution can be found
on the Internet.
• The calculations for some complex hypothesis testing procedures, such as
ANCOVA, can be greatly simplified by expressing them in matrix algebra.
The matrix formulas for regression and ANCOVA are available in textbooks.
• Matrix calculations can easily be done in Microsoft Excel using defined
ranges, range labels, and the array functions MMULT, MINVERSE, and
TRANSPOSE.
Summer 2009 Volume 13 Number 3
25
S TAT I S T I C A L V I E W P O I N T
INTRODUCTION
Hypothesis testing has become a central tool of data
based decision-making. In the last 100 years, a bewildering array of hypothesis tests have been developed,
each designed to assist some very specific decisionmaking situation. The previous issue of “Statistical
Viewpoint” (1) describes the Fisherian, Neyman-Pearson, and Bayesian approaches to hypothesis testing.
The t-test and analysis of variance (ANOVA) have
already been illustrated (1, 2). Based on comments
received from readers, three additional hypothesistesting examples are presented with an explanation of
how they can be performed in Microsoft Excel. The
examples are based on realistic and common decisionmaking situations in pharmaceutical and analytical research and development. In each example, parameters
are compared across multiple groups. The parameters
compared are either the within-group means, variances, or regression lines (i.e., slopes and intercepts).
These hypothesis-testing examples are based on either
the Fisherian (P-value as a measure of evidence) or Neyman-Pearson (Fixed Type I error) approaches.
These examples are based on fictional data and
are for illustration only. While the hypothesis tests
described here are available in common software such
as Minitab or JMP, these specific tests are not the only
ones available and may or may not apply to a specific
decision situation. It is always best to seek the aid of a
trained statistician when making any critical data based
decision. It is hoped that these examples will provide
readers with some concrete benchmarks, illustrate a
few novel Excel tools, and reduce some of the
Table
mysticism surrounding hypothesis testing.
In an effort to better understand the manual steps
that are critical to extraction yield, the laboratory
supervisor asked each of her seven analysts to repeat
the extraction procedure four times. She wanted to test
whether the mean yields differ among analysts. If differences were found, she planned to have the analysts
cross train, so that the analysts who have lower yield
can learn from those who have higher yield. She hoped
that this would lead to overall higher yield and consistency across her laboratory.
Each of the seven analysts performed the extraction
four times. The results are shown in Table I and also
plotted as mean % Extraction +/- 1SD in Figure 1.
As described previously (1, 2), the ANOVA procedure, based on the F-distribution, tests the null hypothesis of equality for k group true means:
H0: mean1 = mean2 = mean3 = … = meank
If the null hypothesis is rejected by the F-test,
then the differences among the k observed group
averages are too large to have occurred often by
random chance. Therefore, we infer that some
of the true group means are not equal. But the
ANOVA gives us no clue in what way the means
may differ. Is one group mean unusually high? Can
the groups be divided into “low mean” and “high
mean” categories? Can the groups be ranked by
their means? Are there some groups whose means
are not statistically significantly different? How
large are the mean differences?
I: Extraction results from seven analysts.
% Extraction
EXAMPLE 1: TESTING THE EQUALITY OF
GROUP MEANS USING TUKEY’S MULTIPLE
COMPARISON TEST
A bio-analytical method requires an extraction
procedure to obtain the analyte in a form that
is free from serum components. It is important
that the yield (i.e., analyte recovery) from this
extraction be high and well controlled. However, the analytical team has found that the yield
is very dependent on the skill and experience
of the analyst.
26 Journal of GXP Compliance
Analyst
Trials
Average
Standard
Deviation
Variance
Degrees of
Freedom
1
Dick
4
93.5
3.03
9.181
3
2
Jane
4
92.2
2.95
8.703
3
3
Kate
4
96.9
2.97
8.821
3
4
Yihong
4
95.3
3.13
9.797
3
5
Enrico
4
95.7
3.19
10.176
3
6
Azita
4
99.8
2.88
8.294
3
7
Abdel
4
90.1
2.83
8.009
3
David LeBlond
% Extraction
One way to answer such questions
Figure 1:
would be to conduct many separate pair- Percent extraction for each analyst (mean +/- 1 standard deviation).
wise t-tests. For example, in comparing
105
the means of three groups (A, B, and C)
we make the three comparisons A-B,
100
A-C, and B-C. A little reflection will
show that for k groups, there are k(k-1)/2
95
pair-wise comparisons. The number
of pair-wise comparisons increases
90
dramatically with k. For k=7 (i.e., A to G)
groups, 7(7-1)/2=21 comparisons would
85
be required. This leads to a multiplicity
80
problem that inflates the Type I error
Dick
Jane
Kate Yihong Enrico Azita Abdel
(false positive) rate. If two independent
Analyst
tests are conducted, each at a comparison-wise Type I error rate of 0.05, then
ni = nj = the number of within group results (trials)
the Type I error rate of either or both of the tests failing
(called the family-wise error rate) equals 1-(1-0.05)2 =
for analysts i and j = 4
0.0975. For 21 independent comparisons the Type I
Pooled variance = average of the 7 variances = 8.997
error rate would be 1-(1-0.05)21 = 0.66. Such a procev = the total number of degrees of freedom associated
dure would often suggest differences when in fact none
with the pooled variance = 7*3 = 21.
exist. The situation is more complicated still because all
pair-wise comparisons are not independent. With our
Step 2: Calculate The Decision Statistic
k=7 example, comparisons A-B and C-D are indepenThe maximum allowed difference (+/-D) is the decision
dent, but comparisons A-B and A-C are not because
statistic used for Tukey’s test. Any pairwise mean difthey both use the same data (i.e., from group A). This
ference between groups whose absolute value larger in
partial independence makes it difficult to calculate the
magnitude than D is considered statistically significant.
D is calculated in an Excel formula as follows:
true family-wise Type I error rate.
Fortunately, multiple comparison procedures exist
that maintain the family-wise Type I error rate at a
D = (q(k,v,a)/SQRT(2))*SQRT((1/ni+1/nj)*(Pooled
specified level. Hypothesis tests based on a fixed Type
variance)
I error rate follow the Neyman-Pearson approach (1).
Tukey’s honestly significant difference (HSD) procedure
Now q(k,v,a) is the 1-ath quantile of the Studendoes this when making all k(k-1)/2 pair-wise comparitized range distribution for k groups and v degrees of
sons of the means of k groups that all have the same
freedom. Excel has no function for this distribution,
normally distributed variance (3). When normality can but tables and interactive calculators of the studentized
be assumed, it is a preferred test. It is easy to execute
range quantile are available on the Internet (4). We find
and is commonly used in statistical practice. This is
there that:
illustrated with the following analytical example.
q(7,21,0.05) = 4.6.
Step 1: Define Key Statistics
The following apply for Example 1:
Substituting these values into the above equation for
D we obtain
k = the number of analysts (groups) = 7
a = the desired Type I error rate = 0.05
D = +/- 6.9
Summer 2009 Volume 13 Number 3
27
S TAT I S T I C A L V I E W P O I N T
Tablet potency
Figure 2:
Step 3: Make The Inference
Calculate all k(k-1)/2 pair-wise differences Tukey multiple comparison of average percent extraction for seven analysts.
among the k means and compare the absoAnalyst
Dick Jane
Kate Yihong Enrico Azita
lute value of these differences to D. Figure 2
Average 93.5
92.2
96.9
95.3
95.7 99.8
displays all 7*6/2=21 analyst pair-wise differ- Jane
92.2 –1.3
ences (row analyst minus column analyst).
4.7
Kate
96.9 3.4
–1.6
3.1
Yihong
95.3 1.8
Those differences in Figure 2 that have
–1.2
0.4
3.5
Enrico
95.7 2.2
an absolute magnitude greater than 4.6 are
2.9
4.5
4.1
7.6
Azita
99.8 6.3
considered statistically significantly different
–6.8
–5.2
–5.6
–2.1
Abdel
90.1 –3.4
–9.7
based on a family-wise Type I error rate of
Red differences are statistically significant
0.05. By this criterion, Azita’s extraction yield
is considered greater than those of either
Abdel or Jane. These are the only two dif- Figure 3:
ferences which are statistically significant. Potencies of tablet samples compressed from five different powder blends.
The supervisor called her analysts
108
together to discuss the result. At the meet106
ing it was learned that Azita used a much
104
more vigorous manual mixing technique
102
than either Abdel or Jane. Thus the team
100
decided to further investigate the effect of
98
mixing on extraction.
96
EXAMPLE 2: TESTING THE EQUALITY OF
94
GROUP STANDARD DEVIATIONS USING
92
BARTLETT’S TEST
90
A chemistry, manufacturing, and controls
88
(CMC) team was developing a tablet
1
2
3
4
5
6
powder blending process. One of the key
Blend
measures of the quality of the blend is the
potency content uniformity in the tablets
that are compressed from the blend. The team had comTable II: Standard deviations and variances of tabpressed tablets from five pilot-scale blend batches. The
let potencies compressed from five different powder
potency content was determined in 10 randomly selected
blends.
tablets from each of the five blends. The 50 test results
Number Degrees
are plotted in Figure 3.
of
Standard
of
The within-blend standard deviation or variance
freedom
Variance
Blend deviation tablets
is one measure of the content non-uniformity of each
tested
blend. Table II provides the observed sample standard
1
4.90
10
9
24.05
deviations and variances for each of the five blends.
2
2.74
10
9
7.51
Both Figure 3 and Table II suggest that blends one
3
2.32
10
9
5.37
and five have poorer content uniformity. However, the
team wanted more objective evidence for differences in
4
1.25
10
9
1.57
content uniformity among the blends. After all, some dif5
2.98
10
9
8.86
ference in standard deviation among the blends could be
Sum: 45
Average: 9.47
expected just based on sampling and testing variation.
28 Journal of GXP Compliance
David LeBlond
Bartlett’s test (5) is one of a number of tests
of the following null hypothesis:
Figure 4:
Probability distribution of Bartlett’s test statistic.
0.2
H0: variance1 = variance2 = variance3 = … =
variancek.
Probability density
Bartlett’s test assumes that the results within
each group follow normal distributions with
the same variance but possibly different true
means. When the within-group results are
normally distributed, Bartlett’s test is a good
one. The team was willing to make this assumption. Bartlett’s test can be used either in
the Fisherian (obtaining a P-value as a measure
of evidence) or in the Neyman-Pearson (maintaining a fixed Type I error rate as a decision
criterion) sense. The team decided to use a
fixed Type I error rate of 0.05.
0.18
0.16
0.12
0.08
14.93
(observed)
0.06
0.04
0.02
Step 2: Calculate Bartlett’s Statistic
Bartlett’s statistic requires the calculation of the following
M and C values from the above statistics and other information from Table II. These can be obtained using Excel
formula syntax as follows (note that ln() is the natural log
function in Excel):
Step 3: Calculate The Acceptance Criterion For
Bartlett’s Test
Larger values of M/C will be associated with greater
variance heterogeneity. Under the assumption of
within-group normality, Bartlett’s test statistic follows a
9.49
(0.95 quantile)
0.1
Step 1: Define Key Statistics
The following statistics can be gleaned from Table II:
Number of blends (groups) = k=5
Number of degrees of freedom within each group
=dfi=10-1=9
Total number of degrees of freedom in the pooled variance = dfp=5*9=45
Pooled variance = vp= average of within-blend variances (vis) = 9.47.
M=(dfp)*ln(vp)-SUM(dfi*ln(vi)) = 15.59
C=1+(1/3/(k-1))*(SUM(1/(dfi))-1/(dfp))=1.04
Bartlett’s Test statistic = M/C=14.93
Type I error = 0.05
0.14
0
0
5
10
15
20
Bartlett’s M/C statistic
Chi-squared probability distribution with k-1 degrees of
freedom. Therefore, an appropriate upper limit for M/C
that corresponds to a Type I family-wise error rate of a =
0.05 is the 0.95 quantile of the Chi-squared distribution
with k-1=4 degrees of freedom. This can be obtained in
Excel (6) as
Upper limit of Bartlett’s statistic = CHIINV(a,k-1) =
9.49
The teams observed that Bartlett’s statistic, 14.93, is
considerably greater than the upper limit of 9.49. The
relationship can be seen clearly in Figure 4.
Here the appropriate Chi-squared distribution was
plotted in Excel using the following formula for the probability density (6):
Probability Density of Bartlett’s M/C statistic =
GAMMADIST(M/C,(k-1)/2,2,FALSE)
Step 4: Make The Inference
Clearly the observed value 14.93 is far in the right tail
of the Chi-squared distribution indicating that the
differences in scatter seen in Figure 3 and Table II are
statistically significant. Had the team wanted to use
Bartlett’s test in the Fisherian sense, they would merely
have calculated the P-value associated with an observed
Summer 2009 Volume 13 Number 3
29
S TAT I S T I C A L V I E W P O I N T
test statistic result of 14.93. The
calculation can be done in Excel
(6) simply as:
P-value = CHIDIST(M/C,k-1)
= CHIDIST(14.93,4) = 0.005.
Figure 5:
Stability of three batches of drug product with trend lines based on a separate intercept, separate slope
(SISS) model.
103
Batch A
Batch B
Batch C
SISS (Batch A)
SISS (Batch B)
SISS (Batch C)
102
Potency (%LC)
The low P-value gives strong
101
evidence for variance differences
100
among the blends. Therefore, the
team is justified in inferring that
99
the blends did produce tablets
98
that have statistically significant
97
differences in content uniformity.
One of the critical assump96
tions behind the ANOVA test
95
for equality of group means is
0
homogeneity of variance (also
called “homoskedasticity”). This
assumption means that the true
within-group variance of all groups being compared is
equal. One important use of Bartlett’s test is to identify
heteroskedasticity among groups before performing
ANOVA. If Bartlett’s test shows strong evidence of heteroskedasticity, the ANOVA F-test will be unreliable.
5
10
15
20
Months of storage
25
30
ity model is called the separate intercept, separate slope
(SISS) model.
A second possible model would be to assume that all
batches have a common slope (or rate of degradation),
but possibly separate intercepts (initial potencies) as is
shown in Figure 6. This type of stability model is called a
separate intercept, common slope model (SICS).
the stability profiles of three product batches. An example of such
profiles is shown in Figure 5.
In Figure 5, it is assumed that
the stability profile can be approximated by a straight line, and
that the intercept and slope of each
batch are different. This stabil30 Journal of GXP Compliance
Potency (%LC)
EXAMPLE 3: TESTING THE EQUALITY OF GROUP
INTERCEPTS AND SLOPES USING
Figure 6:
ANCOVA
Stability of three batches of drug product with trend lines based on a separate intercept, common slope
The common technical document
(SICS) model.
(CTD) for a new drug application
103
(NDA) submission of a new product
Batch A
requires the justification of the label
102
Batch B
claim for the proposed product
Batch C
101
shelf life (7). In the simple case of
SICS (Batch A)
a product with one single-dosage
100
SICS (Batch B)
strength and single packaging, this
99
SICS (Batch C)
justification is generally based on
98
97
96
95
0
5
10
15
20
Months of storage
25
30
David LeBlond
Figure 7:
names correspond to similar (but
not bold) cell range names that
can be used in matrix formulas in
Excel. Visit the Excel HELP menu
for instructions on naming cell
ranges and entering array formulas
into cell ranges. In Excel, an array
formula must be entered into a cell
range whose dimension (number of
rows and columns) matches that of
the matrix result produced by the
formula. The Glossary Section gives
some insight into the operations of
matrix algebra and the dimensions
in the final result of each operation.
The following sections give the
ANCOVA steps involved for this
particular example.
Stability of three batches of drug product with a single trend line based on a common intercept, common slope (CICS) model.
103
Potency (%LC)
102
Batch A
Batch B
Batch C
CICS
101
100
99
98
97
96
95
0
5
10
15
20
Months of storage
25
30
A third possible model would be to assume that the
Step 1: Set Up The Necessary Matrices As Arrays
stability profile for all batches can be represented by
The data are first used to create four different matrices
a common intercept, common slope model (CICS) as
needed for ANCOVA. Figure 8 gives the X1, X2, X3, and
shown in Figure 7.
Y matrices for the data given in Figures 5, 6, and 7.
Analysis of covariance (ANCOVA) is
a procedure used, among other things,
Figure 8:
to identify which of the three models
The X1, X2, X3, and Y matrices used for ANCOVA.
(SISS, SICS, or CICS) is appropriate for
X1
X2
X3
Batch
Y
A
1
0
1
1
0
0
1
1
0
0
0
0
100.8
a given set of stability data (8). For the
A
1
3
1
1
0
3
1
1
0
3
3
0
102.0
present example, ANCOVA tests two null
A
1
3
1
1
0
3
1
1
0
3
3
0
101.2
A
1
6
1
1
0
6
1
1
0
6
6
0
100.8
hypotheses:
A
1
6
1
1
0
6
1
1
0
6
6
0
100.2
H0intercept: intercept1 = intercept2 =
intercept3
H0slope: slope1 = slope2 = slope3
ANCOVA is relatively difficult to perform in Excel using the standard algebra
of multiple regression (8). However, the
equations are quite simple when expressed in matrix form (9). Fortunately,
Excel has some matrix functions that
make it possible to use the matrix formulas for ANCOVA directly. The variable
names that appear bold in equations that
follow are matrix variables. These variable
A
A
A
A
B
B
B
B
B
B
B
B
C
C
C
C
C
C
C
C
C
C
C
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
12
12
24
24
0
3
6
6
12
12
24
24
0
1
2
3
3
6
6
12
12
24
24
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
0
0
0
1
1
1
1
1
1
1
1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
12
12
24
24
0
3
6
6
12
12
24
24
0
1
2
3
3
6
6
12
12
24
24
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
0
0
0
1
1
1
1
1
1
1
1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
12
12
24
24
0
3
6
6
12
12
24
24
0
1
2
3
3
6
6
12
12
24
24
12
12
24
24
0
0
0
0
0
0
0
0
0
-1
-2
-3
-3
-6
-6
-12
-12
-24
-24
0
0
0
0
0
3
6
6
12
12
24
24
0
-1
-2
-3
-3
-6
-6
-12
-12
-24
-24
99.6
99.2
97.2
98.4
101.0
100.2
99.8
100.3
99.4
98.2
96.7
97.9
101.5
100.2
101.3
100.7
100.4
99.3
99.0
99.0
97.6
96.6
96.1
Summer 2009 Volume 13 Number 3
31
S TAT I S T I C A L V I E W P O I N T
Figure 8 shows how the matrices appear as named
ranges in an Excel spreadsheet. The Excel actual names
used for the X1, X2, X3 matrices are X1M, X2M, and
X3M, respectively. The “M” suffix is required to avoid
confusion with a single cell reference. The Y matrix
is simply a single column containing the measured
potencies for the three different batches. The batch corresponding to each measurement is given for reference
at left. All four matrices have one row for each measured
potency. The order of these rows is irrelevant.
The following reviews the construction for the X1,
X2, X3 matrices. The columns in each depend on the
corresponding stability model (CICS, SICS, and SISS,
respectively). All three stability models take the following form, where the subscript i indicates a row in
Figure 8:
Yi = Parameter1*Xcol1i + Parameter2*Xcol2i +
Parameter3*Xcol3i + …
The columns in X1, X2, X3 correspond to the values of
Xcol1, Xcol2, etc. For X1 (CICS) the stability model has
only two parameters: the common intercept and common slope. Therefore, it takes the following familiar form:
Yi = Intercept*1 + Slope*Mi
For each row of X1, Xcol1 is always 1 so the first column in X1 is a column of 1s. The second column gives
the months of storage (Mi) at which the corresponding Yi
value is measured. The intercept and slope are the same
for all batches and have their usual interpretation.
For X2 and X3, additional parameters relate to the
slopes and intercept parameters of each batch. We need
to define corresponding column values. Since matrices
can only contain numeric values, we cannot include
batch identifiers like “A”, “B”, or “C” in them. Instead
we have to create special dummy variables to put in the
columns of these matrices that identify the batch being
tested. We call these Dummy1 and Dummy2 and they
are defined as follows:
Lot A: Dummy1 = 1 and Dummy2 = 0
Lot B: Dummy1 = 0 and Dummy2 = 1
Lot C: Dummy1 = -1 and Dummy2 = -1
32 Journal of GXP Compliance
We can use these dummy variables to define the SICS
stability model corresponding to X2 as:
Yi = Int*1 + IntA*Dummy1i + IntB*Dummy2i +
Slope*Mi ,
where the true intercept of each batch can be interpreted in terms of the parameters Int, IntA, and IntB as
follows:
Intercept for batch A = Int + IntA
Intercept for batch B = Int + IntB
Intercept for batch C = Int – IntA – IntB.
Thus the columns of X2 are
1-
a column of 1s
2-
Dummy1i
3-
Dummy2i
4-
The storage period (Mi) at which the corresponding Yi value is measured.
The slope parameter has its usual interpretation and is
the same for all batches.
For X3 (model SISS), we follow a similar logic, but
must include additional parameters to account for differences in slope among batches. The stability model is
written:
Yi = Int*1 + IntA*Dummy1i + InB*Dummy2i + Slp*Mi
+ SlpA*Dummy1i*Mi + SlpB*Dummy2i*Mi
The Int, IntA, and IntB parameters are interpreted as
described above for X2. The true slope of each batch can
be interpreted in terms of the parameters Int, IntA, and
IntB as follows:
Slope for batch A = Slp + SlpA
Slope for batch B = Slp + SlpB
Slope for batch C = Slp – SlpA –SlpB.
From the above model we can see that the columns of
X3 are
1-
a column of 1s
2-
Dummy1i
3-
Dummy2i
David LeBlond
4-The storage period (Mi) at which the corresponding Yi value is measured
5-
Dummy1i*Mi
6-
Dummy2i*Mi
A model-fitting statistic that is important for the ANCOVA calculations here is the sum of squared errors (SSE).
The matrix formula for this is as follows (again, replacing
i by the appropriate model number):
Step 2: Perform The Matrix Calculations for ANCOVA
The calculations can be performed easily in Excel by using three Excel array functions. In these, the matrices A
and B are defined Excel ranges (arrays), as follows:
SSEi = YT(I-Hi)Y
where
Hi = Xi(XiTXi)-1XiT,
TRANSPOSE(A): Interchanges the rows and columns
of the array (designated AT in matrix notation)
MMULT(A,B): Perform matrix multiplication on the
arrays A and B (designated simply AB in matrix notation)
MINVERSE(A): Obtain the matrix inverse of array A
(designated A-1 in matrix notation).
and I is a special matrix called the “identity matrix.”
This matrix is a square matrix with the same number
of rows and columns as there are Y measurements. The
diagonal numbers of the identity matrix are all 1s and
all other numbers in the matrix are zero. The identity
matrix used for the calculations here is shown in Figure
9. In Excel, this matrix is an array that can be given a
name such as “ID.”
The sum of squared errors for each model i can
be calculated in Excel by combining the two matrix
In the following Excel formulas, i is replaced by either
1, 2, or 3 (corresponding to the models CICS, SICS, or
SISS, respectively).
The matrix formula for the estimated model parameters, B is:
Figure 9:
T
-1 T
The
ID (identity) matrix needed for ANCOVA.
Bi = (Xi Xi) Xi Y
These parameters can be obtained in
Excel as a matrix with a single column
and one row for each model parameter by
using the following array formula with i
replaced by the corresponding model:
BiM = MMULT(MMULT(MINVERSE
(MMULT(TRANSPOSE(XiM),XiM)),TRA
NSPOSE(XiM)),Y)
The matrix formula for the predicted
values of Y based on the fitted model is as
follows:
Ypredi = XiBi
This corresponds to the following
calculation in Excel where i is replaced by
the appropriate model number:
Ypredi = MMULT(XiM,BiM)
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
Summer 2009 Volume 13 Number 3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
33
S TAT I S T I C A L V I E W P O I N T
equations above for SSEi and Hi into the following array
formula:
SSEi=MMULT(MMULT(TRANSPOSE(Y),ID-MMUL
T(MMULT(XiM,MINVERSE(MMULT(
TRANSPOSE(XiM),XiM))),TRANSPOSE(XiM))),Y)
The dfe for each model should be calculated. The
dfe for a model is defined as the number of observations
(number of rows in Xi) minus the number of parameters (number of columns in Xi). If k= the number of
batches, then the dfe for each model is:
dfe1 = N-2
dfe2 = N-k-1
dfe3 = N-2*k
For this example, N=28 and k=3. The sum of
squared errors (SSE) for each of the three possible models is shown in Table III.
The SSEi and dfei values in Table III form the basis for
the ANCOVA F-tests. The ANCOVA calculations and
F-tests are presented in Table IV.
The four rows of the ANCOVA table are labeled A,
B, C, and D. Row A is generally presented as part of
the ANCOVA table, but is not needed for our purposes.
Row D merely shows the calculation of the mean
squared error (MSD) that becomes the denominator for
the F ratios. Rows B and C contain the calculations that
are of interest in the present example.
As indicated in Table IV, the F-test of row C is a test
of the null hypothesis (H0) that the true model is SICS
against the alternative hypothesis (Ha) that the true
model is SISS. Thus the C F-test is really testing the hypothesis that the slopes of the regression lines are equal.
The F-test of row B is a test of the null hypothesis
(H0) that the true model is CICS against the alternative
hypothesis (Ha) that the true model is SICS. Thus the B
F-test is really testing the hypothesis that the intercepts
of the three batches are equal, under the assumption
that the slopes are equal.
The F-test P-value for rows B and C are obtained in
Excel by substituting the appropriate values into the
formula below:
Psource = FDIST(Fsource,dfsource,dfe3)
Step 3: Make The Inference
The logic used by the US Food and Drug Administration in selecting a stability model for shelf life estimation is shown in Figure 10.
FDA and International Conference on Harmonisation (ICH) stability guidances (7) recommend that a
P-value of 0.25 be used as a decision criterion for the B
and C ANCOVA F-tests when comparing the stability
profiles of different product batches. This corresponds
to a Neyman-Pearson Type I error rate of 0.25. This
Table III: Sum of squares errors and degrees of freemeans that the tests will reject the null hypothesis
dom associated with three different stability models.
in error 25% of the time. This rather high Type I
Model
SSE
dfe
error rate lowers the consumer’s risk in that it raises
the power of the tests to detect smaller differences in
1 (CICS)
13.21
26
slope and intercept among the batches. However, it
2 (SICS)
8.71
24
also has the effect of raising the manufacturer’s risk by
3 (SISS)
7.69
22
incorrectly identifying batch differences that do not
in fact exist. The use
Table IV: ANCOVA table for comparing the intercepts and slopes of the stability proof a Type I error rate
files of three batches.
of 0.25 in order to
Source H0
Ha
SS
df Source
SS/df
F Source
P Source
reduce the consumer
risk has been the
A
CICS SISS SSE1-SSE3=5.52 dfe1-dfe3=4
MSA=1.38 MSA/MSD=3.95 0.015
subject of some debate
MSB=2.25 MSB/MSD=6.44 0.006
B
CICS SICS SSE1-SSE2=4.50 dfe1-dfe2=2
in the past (10,11) but
MSC=0.51 MSC/MSD=1.46 0.254
C
SICS SISS SSE2-SSE3=1.02 dfe2-dfe3=2
has by now become
dfe3=22
MSD=0.35
D
SSE3=7.69
established practice.
34 Journal of GXP Compliance
David LeBlond
Figure 10:
Using ANCOVA to choose a stability model.
Start
PC<0.25
?
Yes
SISS
No
CICS
No
PB<0.25
?
Yes
SICS
Figure 10 first examines the row C F-test P-value. If
significance is found, then SISS is selected as the final
stability model. Table IV shows that in our example,
the P-value for the C F-test is above 0.25 (0.254) so
there is no evidence for differences among batches with
respect to their slopes (i.e., rates of potency loss).
Figure 10 next examines the row B F-test P-value. If
significance is found, then the SICS is selected, otherwise CICS becomes the final model. Table IV shows
that in our example, the P-value for the B F-test is very
statistically significant (0.006 << 0.25). Therefore, in
our example we would conclude that a SICS model of
Figure 6 is the appropriate stability model to use for
shelf-life estimation.
Usually, the ANCOVA analysis and shelf-life projection are done by a trained statistician. The statistician
may use a statistical package such as SAS (12,13) for
these calculations. Performing the ANCOVA in SAS is
quite straightforward. The following gives the simple
SAS syntax for an ANCOVA analysis:
PROC GLM;
CLASS BATCH;
MODEL LEVEL = TIME BATCH
TIME*BATCH/ SS1;RUN;
However, the example shows that the ANCOVA
analysis can also be programmed relatively easily into
Excel using its matrix functions and array formulas.
SUMMARY
Three hypothesis testing situations that are common in
pharmaceutical and analytical development have been
presented and the calculations illustrated using Excel.
Tukey’s test can be used to compare the means of
multiple groups when the within-group variation is
normal and all groups have the same true variance.
This test is a Neyman-Pearson type test that maintains
a desired family-wise Type I error rate. It is based on the
Studentized range distribution. Tables and calculators
for the Studentized range distribution can be found on
the Internet.
Bartlett’s test can be used to compare the variances
of multiple groups when the within-group variation
is normal. This test can be used as either a NeymanPearson or Fisherian type test and is based on the
Chi-squared distribution.
ANCOVA analysis can be used to compare the intercepts and slopes of regression lines when the error
of measurement is normal, independent, and identically distributed. This test is based on the F distribution. When comparing the stability profiles of different
batches of product, ANCOVA is often employed as a
Neyman-Pearson type test with a comparison-wise
Type I error rate of 0.25.
The calculations for some complex hypotheses
testing procedures, such as ANCOVA, can be greatly
simplified by expressing them in matrix algebra. The
matrix formulas for regression and ANCOVA are available in textbooks. Matrix calculations can easily be
done in Excel using defined ranges, range labels, and
the array functions MMULT, MINVERSE, and TRANSPOSE. The matrix formulas for multiple-regression and
the availability and relative ease of use of Excel matrix
functions may be among the best kept secrets in statistics. An Excel spreadsheet illustrating these calculations is available from the author upon request.
ACKNOWLEDGMENTS
The author would like to express his gratitude to Paul
Pluta for his suggestions and encouragement, to Susan
Haigney for her patience and organizational skills, and
to Diane Wolden for her vigilance and perseverance
through all these “sadistics.”
Summer 2009 Volume 13 Number 3
35
S TAT I S T I C A L V I E W P O I N T
GLOSSARY
ANCOVA (analysis of covariance): A procedure, based
on the F-distribution, that includes hypothesis tests for the
equality of slopes and intercepts of multiple trend lines.
ANOVA (analysis of variance): A hypothesis test that uses
the F-statistic described by Fisher used to detect differences
among the true means of data from two or more groups.
Bartlett’s test: A hypothesis test, based on the Chi-square
distribution, of the null hypothesis that multiple independent random samples come from normal populations all
having the same variance or standard deviation.
F-statistic: The decision statistic used in Fisher’s analysis
of variance hypothesis test consisting of the ratio of two
independent observed variances calculated from normally
distributed data.
Family-wise Type I error rate: The risk of falsely rejecting
one or more null hypotheses when conducting a family of
many pair-wise statistical tests, each comparing the parameters of two groups.
Homogeneity of variance: Many statistical procedures
that compare the population parameters of multiple groups,
such as ANOVA, ANCOVA, and Tukey’s test, assume that
the within group variation of the groups being compared
follow normal distributions with the same true within
group variance but possibly different group means. The
equality of variance is called “variance homogeneity” or
“homoskedasticity.”
Hypothesis: A provisional statement about the value of a
model parameter or parameters whose truth can be tested
by experiment.
Identity matrix: A special matrix in which the number of
rows equals the number of columns. The diagonal numbers
of the identity matrix are all equal to 1. All off diagonal
numbers are equal to zero.
Matrix: A table of numbers with a fixed number of
rows ( r ) and columns ( c ) often designated Arc or simply
A. Arrays (rectangular cell ranges) in Excel can be given
names and manipulated as matrices using Excel formulas.
The rules by which matrices can be added, multiplied,
inverted, and transposed were described by the English
mathematician Arthur Cayley in 1857. These four operations are available in Excel as the array functions ‘+’ (the
add operator), MMULT(AM,BM), MINVERSE(AM), and
TRANSPOSE(AM), respectively, where AM and BM are
valid cell range names. These Excel functions can greatly
36 Journal of GXP Compliance
simplify complex statistical calculations such as regression
and ANCOVA.
Matrix addition: Two matrices, say Anm and Bpq, can be
added or subtracted (symbolized A+B or A-B, respectively
and entered similarly into an Excel array formula) if n=p
and m=q. The result is a matrix having the same dimension
as the starting matrices. An example of a matrix addition is
given below. Matrix addition is commutative (A+B = B+A).
[da be fc ] + [j gk hl i ] =d[+aj +eg+ kb +fh+ lc + i ]
Matrix inverse: Under certain conditions, a matrix can
be inverted in a manner analogous to the inverse of a scalar
number. Inverses are only defined for square matrices. The
resulting matrix will have the same dimensions as the starting matrix. Matrix inversion is symbolized A-1 and entered
into an Excel array formula as MINVERSE(AM), where AM
is a valid cell range name.
Matrix multiplication: Two matrices, say Anm and Bpq,
can be multiplied (symbolized AB or MMULT(A,B) in an
Excel array formula) if the number of columns of A (m)
equals the number of rows of B (p). The product is a matrix
with n rows and q columns. An example of a matrix multiplication is given below. Matrix multiplication is not commutative, in other words AB is not, in general, equal to BA.
[ ][
][
ghij
abc
k l mn
de f
opq r
]
ag + bk + co ah + bl + cp ai + bm + cp aj + bn + cr
=dg + ek + fo dh + el + fp di + em + fp dj + en + fr
Matrix transpose: The interchange of the rows and
columns in an array. This is symbolized AT and entered into
an Excel array formula as TRANSPOSE(AM), where AM
is a valid cell range name. If the starting matrix has r rows
and c columns, the resulting matrix will have c rows and r
columns.
Multiplicity: When multiple hypothesis tests (e.g., control
chart rules) are applied to different aspects (e.g., trending
patterns) of a data set, the overall false alarm rate of any one
test failing may be greater than that of any single test when
applied alone. This statistical phenomenon is referred to as
multiplicity.
Null hypothesis (H0): A plausible hypothesis that is
presumed sufficient to explain a set of data unless statistical evidence in the form of a hypothesis test indicates
otherwise.
P-value: The probability of obtaining a result at least as
extreme as the one that was actually observed, given that
David LeBlond
the null hypothesis is true. The fact that p-values are based
on this assumption is crucial to their correct interpretation.
Pair-wise Type I error rate: The risk of falsely rejecting
the null hypothesis when conducting a statistical test comparing the parameters of two groups.
Parameter: In statistics, a parameter is a quantity of
interest whose ‘true’ value is to be estimated. Generally a
parameter is some underlying variable associated with a
physical, chemical, or statistical model.
Sampling distribution: The distribution of data or some
summary statistic calculated from data.
Statistic: A summary value (such as the mean or standard deviation) that is calculated from data. A statistic is often used because it provides a good estimate of a parameter
of interest.
Studentized range distribution: Consider a normal population from which two independent samples, of size k and
n, respectively, are drawn. From the first sample, the range
(maximum – minimum) of k values is obtained. From the
second sample the standard deviation with n-1 degrees of
freedom is calculated. Then the ratio range (standard deviation) will have a Studentized range probability distribution,
indexed by k and n-1.
Tukey’s test: A hypothesis test, based on the Studentized
range distribution, of the null hypothesis that multiple independent random samples all come from the same normal
population having a single true mean.
Type I error: A decision error that results in falsely rejecting the null hypothesis when in fact it is true. It is sometimes referred to as the alpha-risk or manufacturer’s risk.
REFERENCES
1. LeBlond, D., “Understanding Hypothesis Testing Using Probability Distributions,” Journal of Validation Technology, Vol. 15, No.1, pp
44-61, Winter 2009.
2. Vijayvargiya, A., “One-Way Analysis of Variance,” Journal of
Validation Technology, 15(1), pp 62-63, Winter 2009.
3. Montgomery, D., Design and Analysis of Experiments, 3rd edition,
John Wiley & Sons, N.Y., pp 78 – 79, 1991. See also the online
NIST statistical handbook: http://www.itl.nist.gov/div898/handbook/prc/section4/prc471.htm.
4. Tables of the Studentized range distribution are available online
from many sources by searching key words “Studentized Range
Statistic Table” including: http://faculty.vassar.edu/lowry/tabs.
html#q, http:/cse.niaes.affrc.go.jp/miwa/probcalc/, Harter,
H(1960) Annals of Mathematical Statistics 31(4) 1122 - 1147
(available free at http://www.projecteuclid.org).
5. Snedecor, George W. and Cochran, William G., Statistical
Methods, Eighth Edition, Iowa State University Press, pp 251
– 252, 1989.
6. LeBlond, D., “Using Probability Distributions to Make Decisions,” Journal of Validation Technology, pp 2 – 14, Spring 2008.
7. International Conference on Harmonisation, ICH Q1E Evaluation of Stability Data, June 2004.
8. Brownlee, K.A., “Statistical Theory and Methodology,” Robert
E. Keiger, Malabar, FL, 1984.
9. Neter, J., Kutner, M., Nachtsheim, C. and Wasserman, W.,
Applied Linear Statistical Models, 4th edition, Irwin, Chicago,
1996.
10. Bancroft, “Analysis and Inference for Incompletely Specified
Models Involving the Use of Preliminary Test(s) of Significance,” Biometrics 20, 427-442, 1964.
11. Ruberg, SJ and Stegman, JW, “Pooling Data for Stability
Studies: Testing the Equality of Batch Degradation Slopes,”
Biometrics 47, 1059-1069, 1991.
12. Ng MJ, STAB Stability System for SAS, Division of Biometrics,
CDER, FDA, 1995. These SAS macros may be downloaded
from the following website: http://www.fda.gov/cder/sas/.
13. The SAS system for statistical analysis is available from the
SAS institute, Inc., Cary, NC, USA. GXP
ARTICLE ACRONYM LISTING
ANCOVA
ANOVA
CICS
CMC
CTD
ICH
NDA
SICS
SISS
SSE
Analysis of Covariance
Analysis of Variance
Common Intercept, Common Slope
Chemistry, Manufacturing, and Controls
Common Technical Document
International Conference on Harmonisation
New Drug Application
Separate Intercept, Common Slope
Separate Intercept, Separate Slope
Sum of Squared Errors
ABOUT THE AUTHOR
David LeBlond, Ph.D., obtained an MS in statistics from Colorado State
University, a Ph.D. in biochemistry from Michigan State University,
and has 29 years experience in the pharmaceutical and medical diagnostics fields. He serves on the CMC statistical expert team in PhRMA.
David can be reached at [email protected].
Summer 2009 Volume 13 Number 3
37