S TAT I S T I C A L V I E W P O I N T Hypothesis Testing: Examples in Pharmaceutical Process and Analytical Development BACKGROUND: JOHN FOXX/GETTY IMAGES David LeBlond “Statistical Viewpoint” addresses principles of statistics useful to practitioners in compliance and validation. We intend to present these concepts in a meaningful way so as to enable their application in daily work situations. Reader comments, questions, and suggestions are invited. Readers are also invited to contribute manuscripts for this column. We need your help to make “Statistical Viewpoint” a useful resource. Please contact coordinating editor Susan Haigney at [email protected] with comments, suggestions, or manuscripts for publication. KEY POINTS The following key points are discussed in this article: • Tukey’s test can be used to compare the means of multiple groups when the within-group variation is normal and all groups have the same true variance. This test is a Neyman-Pearson type test that maintains a desired family-wise Type I error rate. It is based on the Studentized range distribution. • Bartlett’s test can be used to compare the variances of multiple groups when the within-group variation is normal. This test can be used as either a Neyman-Pearson or Fisherian type test and is based on the Chi-squared distribution. • Analysis of covariance (ANCOVA) can be used to compare the intercepts and slopes of regression lines when the error of measurement is normal, independent, and identically distributed. This test is based on the F distribution. When comparing the stability profiles of different batches of product, ANCOVA is often employed as a Neyman-Pearson type test with a comparison-wise Type I error rate of 0.25. • Tables and calculators for the Studentized range distribution can be found on the Internet. • The calculations for some complex hypothesis testing procedures, such as ANCOVA, can be greatly simplified by expressing them in matrix algebra. The matrix formulas for regression and ANCOVA are available in textbooks. • Matrix calculations can easily be done in Microsoft Excel using defined ranges, range labels, and the array functions MMULT, MINVERSE, and TRANSPOSE. Summer 2009 Volume 13 Number 3 25 S TAT I S T I C A L V I E W P O I N T INTRODUCTION Hypothesis testing has become a central tool of data based decision-making. In the last 100 years, a bewildering array of hypothesis tests have been developed, each designed to assist some very specific decisionmaking situation. The previous issue of “Statistical Viewpoint” (1) describes the Fisherian, Neyman-Pearson, and Bayesian approaches to hypothesis testing. The t-test and analysis of variance (ANOVA) have already been illustrated (1, 2). Based on comments received from readers, three additional hypothesistesting examples are presented with an explanation of how they can be performed in Microsoft Excel. The examples are based on realistic and common decisionmaking situations in pharmaceutical and analytical research and development. In each example, parameters are compared across multiple groups. The parameters compared are either the within-group means, variances, or regression lines (i.e., slopes and intercepts). These hypothesis-testing examples are based on either the Fisherian (P-value as a measure of evidence) or Neyman-Pearson (Fixed Type I error) approaches. These examples are based on fictional data and are for illustration only. While the hypothesis tests described here are available in common software such as Minitab or JMP, these specific tests are not the only ones available and may or may not apply to a specific decision situation. It is always best to seek the aid of a trained statistician when making any critical data based decision. It is hoped that these examples will provide readers with some concrete benchmarks, illustrate a few novel Excel tools, and reduce some of the Table mysticism surrounding hypothesis testing. In an effort to better understand the manual steps that are critical to extraction yield, the laboratory supervisor asked each of her seven analysts to repeat the extraction procedure four times. She wanted to test whether the mean yields differ among analysts. If differences were found, she planned to have the analysts cross train, so that the analysts who have lower yield can learn from those who have higher yield. She hoped that this would lead to overall higher yield and consistency across her laboratory. Each of the seven analysts performed the extraction four times. The results are shown in Table I and also plotted as mean % Extraction +/- 1SD in Figure 1. As described previously (1, 2), the ANOVA procedure, based on the F-distribution, tests the null hypothesis of equality for k group true means: H0: mean1 = mean2 = mean3 = … = meank If the null hypothesis is rejected by the F-test, then the differences among the k observed group averages are too large to have occurred often by random chance. Therefore, we infer that some of the true group means are not equal. But the ANOVA gives us no clue in what way the means may differ. Is one group mean unusually high? Can the groups be divided into “low mean” and “high mean” categories? Can the groups be ranked by their means? Are there some groups whose means are not statistically significantly different? How large are the mean differences? I: Extraction results from seven analysts. % Extraction EXAMPLE 1: TESTING THE EQUALITY OF GROUP MEANS USING TUKEY’S MULTIPLE COMPARISON TEST A bio-analytical method requires an extraction procedure to obtain the analyte in a form that is free from serum components. It is important that the yield (i.e., analyte recovery) from this extraction be high and well controlled. However, the analytical team has found that the yield is very dependent on the skill and experience of the analyst. 26 Journal of GXP Compliance Analyst Trials Average Standard Deviation Variance Degrees of Freedom 1 Dick 4 93.5 3.03 9.181 3 2 Jane 4 92.2 2.95 8.703 3 3 Kate 4 96.9 2.97 8.821 3 4 Yihong 4 95.3 3.13 9.797 3 5 Enrico 4 95.7 3.19 10.176 3 6 Azita 4 99.8 2.88 8.294 3 7 Abdel 4 90.1 2.83 8.009 3 David LeBlond % Extraction One way to answer such questions Figure 1: would be to conduct many separate pair- Percent extraction for each analyst (mean +/- 1 standard deviation). wise t-tests. For example, in comparing 105 the means of three groups (A, B, and C) we make the three comparisons A-B, 100 A-C, and B-C. A little reflection will show that for k groups, there are k(k-1)/2 95 pair-wise comparisons. The number of pair-wise comparisons increases 90 dramatically with k. For k=7 (i.e., A to G) groups, 7(7-1)/2=21 comparisons would 85 be required. This leads to a multiplicity 80 problem that inflates the Type I error Dick Jane Kate Yihong Enrico Azita Abdel (false positive) rate. If two independent Analyst tests are conducted, each at a comparison-wise Type I error rate of 0.05, then ni = nj = the number of within group results (trials) the Type I error rate of either or both of the tests failing (called the family-wise error rate) equals 1-(1-0.05)2 = for analysts i and j = 4 0.0975. For 21 independent comparisons the Type I Pooled variance = average of the 7 variances = 8.997 error rate would be 1-(1-0.05)21 = 0.66. Such a procev = the total number of degrees of freedom associated dure would often suggest differences when in fact none with the pooled variance = 7*3 = 21. exist. The situation is more complicated still because all pair-wise comparisons are not independent. With our Step 2: Calculate The Decision Statistic k=7 example, comparisons A-B and C-D are indepenThe maximum allowed difference (+/-D) is the decision dent, but comparisons A-B and A-C are not because statistic used for Tukey’s test. Any pairwise mean difthey both use the same data (i.e., from group A). This ference between groups whose absolute value larger in partial independence makes it difficult to calculate the magnitude than D is considered statistically significant. D is calculated in an Excel formula as follows: true family-wise Type I error rate. Fortunately, multiple comparison procedures exist that maintain the family-wise Type I error rate at a D = (q(k,v,a)/SQRT(2))*SQRT((1/ni+1/nj)*(Pooled specified level. Hypothesis tests based on a fixed Type variance) I error rate follow the Neyman-Pearson approach (1). Tukey’s honestly significant difference (HSD) procedure Now q(k,v,a) is the 1-ath quantile of the Studendoes this when making all k(k-1)/2 pair-wise comparitized range distribution for k groups and v degrees of sons of the means of k groups that all have the same freedom. Excel has no function for this distribution, normally distributed variance (3). When normality can but tables and interactive calculators of the studentized be assumed, it is a preferred test. It is easy to execute range quantile are available on the Internet (4). We find and is commonly used in statistical practice. This is there that: illustrated with the following analytical example. q(7,21,0.05) = 4.6. Step 1: Define Key Statistics The following apply for Example 1: Substituting these values into the above equation for D we obtain k = the number of analysts (groups) = 7 a = the desired Type I error rate = 0.05 D = +/- 6.9 Summer 2009 Volume 13 Number 3 27 S TAT I S T I C A L V I E W P O I N T Tablet potency Figure 2: Step 3: Make The Inference Calculate all k(k-1)/2 pair-wise differences Tukey multiple comparison of average percent extraction for seven analysts. among the k means and compare the absoAnalyst Dick Jane Kate Yihong Enrico Azita lute value of these differences to D. Figure 2 Average 93.5 92.2 96.9 95.3 95.7 99.8 displays all 7*6/2=21 analyst pair-wise differ- Jane 92.2 –1.3 ences (row analyst minus column analyst). 4.7 Kate 96.9 3.4 –1.6 3.1 Yihong 95.3 1.8 Those differences in Figure 2 that have –1.2 0.4 3.5 Enrico 95.7 2.2 an absolute magnitude greater than 4.6 are 2.9 4.5 4.1 7.6 Azita 99.8 6.3 considered statistically significantly different –6.8 –5.2 –5.6 –2.1 Abdel 90.1 –3.4 –9.7 based on a family-wise Type I error rate of Red differences are statistically significant 0.05. By this criterion, Azita’s extraction yield is considered greater than those of either Abdel or Jane. These are the only two dif- Figure 3: ferences which are statistically significant. Potencies of tablet samples compressed from five different powder blends. The supervisor called her analysts 108 together to discuss the result. At the meet106 ing it was learned that Azita used a much 104 more vigorous manual mixing technique 102 than either Abdel or Jane. Thus the team 100 decided to further investigate the effect of 98 mixing on extraction. 96 EXAMPLE 2: TESTING THE EQUALITY OF 94 GROUP STANDARD DEVIATIONS USING 92 BARTLETT’S TEST 90 A chemistry, manufacturing, and controls 88 (CMC) team was developing a tablet 1 2 3 4 5 6 powder blending process. One of the key Blend measures of the quality of the blend is the potency content uniformity in the tablets that are compressed from the blend. The team had comTable II: Standard deviations and variances of tabpressed tablets from five pilot-scale blend batches. The let potencies compressed from five different powder potency content was determined in 10 randomly selected blends. tablets from each of the five blends. The 50 test results Number Degrees are plotted in Figure 3. of Standard of The within-blend standard deviation or variance freedom Variance Blend deviation tablets is one measure of the content non-uniformity of each tested blend. Table II provides the observed sample standard 1 4.90 10 9 24.05 deviations and variances for each of the five blends. 2 2.74 10 9 7.51 Both Figure 3 and Table II suggest that blends one 3 2.32 10 9 5.37 and five have poorer content uniformity. However, the team wanted more objective evidence for differences in 4 1.25 10 9 1.57 content uniformity among the blends. After all, some dif5 2.98 10 9 8.86 ference in standard deviation among the blends could be Sum: 45 Average: 9.47 expected just based on sampling and testing variation. 28 Journal of GXP Compliance David LeBlond Bartlett’s test (5) is one of a number of tests of the following null hypothesis: Figure 4: Probability distribution of Bartlett’s test statistic. 0.2 H0: variance1 = variance2 = variance3 = … = variancek. Probability density Bartlett’s test assumes that the results within each group follow normal distributions with the same variance but possibly different true means. When the within-group results are normally distributed, Bartlett’s test is a good one. The team was willing to make this assumption. Bartlett’s test can be used either in the Fisherian (obtaining a P-value as a measure of evidence) or in the Neyman-Pearson (maintaining a fixed Type I error rate as a decision criterion) sense. The team decided to use a fixed Type I error rate of 0.05. 0.18 0.16 0.12 0.08 14.93 (observed) 0.06 0.04 0.02 Step 2: Calculate Bartlett’s Statistic Bartlett’s statistic requires the calculation of the following M and C values from the above statistics and other information from Table II. These can be obtained using Excel formula syntax as follows (note that ln() is the natural log function in Excel): Step 3: Calculate The Acceptance Criterion For Bartlett’s Test Larger values of M/C will be associated with greater variance heterogeneity. Under the assumption of within-group normality, Bartlett’s test statistic follows a 9.49 (0.95 quantile) 0.1 Step 1: Define Key Statistics The following statistics can be gleaned from Table II: Number of blends (groups) = k=5 Number of degrees of freedom within each group =dfi=10-1=9 Total number of degrees of freedom in the pooled variance = dfp=5*9=45 Pooled variance = vp= average of within-blend variances (vis) = 9.47. M=(dfp)*ln(vp)-SUM(dfi*ln(vi)) = 15.59 C=1+(1/3/(k-1))*(SUM(1/(dfi))-1/(dfp))=1.04 Bartlett’s Test statistic = M/C=14.93 Type I error = 0.05 0.14 0 0 5 10 15 20 Bartlett’s M/C statistic Chi-squared probability distribution with k-1 degrees of freedom. Therefore, an appropriate upper limit for M/C that corresponds to a Type I family-wise error rate of a = 0.05 is the 0.95 quantile of the Chi-squared distribution with k-1=4 degrees of freedom. This can be obtained in Excel (6) as Upper limit of Bartlett’s statistic = CHIINV(a,k-1) = 9.49 The teams observed that Bartlett’s statistic, 14.93, is considerably greater than the upper limit of 9.49. The relationship can be seen clearly in Figure 4. Here the appropriate Chi-squared distribution was plotted in Excel using the following formula for the probability density (6): Probability Density of Bartlett’s M/C statistic = GAMMADIST(M/C,(k-1)/2,2,FALSE) Step 4: Make The Inference Clearly the observed value 14.93 is far in the right tail of the Chi-squared distribution indicating that the differences in scatter seen in Figure 3 and Table II are statistically significant. Had the team wanted to use Bartlett’s test in the Fisherian sense, they would merely have calculated the P-value associated with an observed Summer 2009 Volume 13 Number 3 29 S TAT I S T I C A L V I E W P O I N T test statistic result of 14.93. The calculation can be done in Excel (6) simply as: P-value = CHIDIST(M/C,k-1) = CHIDIST(14.93,4) = 0.005. Figure 5: Stability of three batches of drug product with trend lines based on a separate intercept, separate slope (SISS) model. 103 Batch A Batch B Batch C SISS (Batch A) SISS (Batch B) SISS (Batch C) 102 Potency (%LC) The low P-value gives strong 101 evidence for variance differences 100 among the blends. Therefore, the team is justified in inferring that 99 the blends did produce tablets 98 that have statistically significant 97 differences in content uniformity. One of the critical assump96 tions behind the ANOVA test 95 for equality of group means is 0 homogeneity of variance (also called “homoskedasticity”). This assumption means that the true within-group variance of all groups being compared is equal. One important use of Bartlett’s test is to identify heteroskedasticity among groups before performing ANOVA. If Bartlett’s test shows strong evidence of heteroskedasticity, the ANOVA F-test will be unreliable. 5 10 15 20 Months of storage 25 30 ity model is called the separate intercept, separate slope (SISS) model. A second possible model would be to assume that all batches have a common slope (or rate of degradation), but possibly separate intercepts (initial potencies) as is shown in Figure 6. This type of stability model is called a separate intercept, common slope model (SICS). the stability profiles of three product batches. An example of such profiles is shown in Figure 5. In Figure 5, it is assumed that the stability profile can be approximated by a straight line, and that the intercept and slope of each batch are different. This stabil30 Journal of GXP Compliance Potency (%LC) EXAMPLE 3: TESTING THE EQUALITY OF GROUP INTERCEPTS AND SLOPES USING Figure 6: ANCOVA Stability of three batches of drug product with trend lines based on a separate intercept, common slope The common technical document (SICS) model. (CTD) for a new drug application 103 (NDA) submission of a new product Batch A requires the justification of the label 102 Batch B claim for the proposed product Batch C 101 shelf life (7). In the simple case of SICS (Batch A) a product with one single-dosage 100 SICS (Batch B) strength and single packaging, this 99 SICS (Batch C) justification is generally based on 98 97 96 95 0 5 10 15 20 Months of storage 25 30 David LeBlond Figure 7: names correspond to similar (but not bold) cell range names that can be used in matrix formulas in Excel. Visit the Excel HELP menu for instructions on naming cell ranges and entering array formulas into cell ranges. In Excel, an array formula must be entered into a cell range whose dimension (number of rows and columns) matches that of the matrix result produced by the formula. The Glossary Section gives some insight into the operations of matrix algebra and the dimensions in the final result of each operation. The following sections give the ANCOVA steps involved for this particular example. Stability of three batches of drug product with a single trend line based on a common intercept, common slope (CICS) model. 103 Potency (%LC) 102 Batch A Batch B Batch C CICS 101 100 99 98 97 96 95 0 5 10 15 20 Months of storage 25 30 A third possible model would be to assume that the Step 1: Set Up The Necessary Matrices As Arrays stability profile for all batches can be represented by The data are first used to create four different matrices a common intercept, common slope model (CICS) as needed for ANCOVA. Figure 8 gives the X1, X2, X3, and shown in Figure 7. Y matrices for the data given in Figures 5, 6, and 7. Analysis of covariance (ANCOVA) is a procedure used, among other things, Figure 8: to identify which of the three models The X1, X2, X3, and Y matrices used for ANCOVA. (SISS, SICS, or CICS) is appropriate for X1 X2 X3 Batch Y A 1 0 1 1 0 0 1 1 0 0 0 0 100.8 a given set of stability data (8). For the A 1 3 1 1 0 3 1 1 0 3 3 0 102.0 present example, ANCOVA tests two null A 1 3 1 1 0 3 1 1 0 3 3 0 101.2 A 1 6 1 1 0 6 1 1 0 6 6 0 100.8 hypotheses: A 1 6 1 1 0 6 1 1 0 6 6 0 100.2 H0intercept: intercept1 = intercept2 = intercept3 H0slope: slope1 = slope2 = slope3 ANCOVA is relatively difficult to perform in Excel using the standard algebra of multiple regression (8). However, the equations are quite simple when expressed in matrix form (9). Fortunately, Excel has some matrix functions that make it possible to use the matrix formulas for ANCOVA directly. The variable names that appear bold in equations that follow are matrix variables. These variable A A A A B B B B B B B B C C C C C C C C C C C 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 12 12 24 24 0 3 6 6 12 12 24 24 0 1 2 3 3 6 6 12 12 24 24 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0 1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 12 12 24 24 0 3 6 6 12 12 24 24 0 1 2 3 3 6 6 12 12 24 24 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0 1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 12 12 24 24 0 3 6 6 12 12 24 24 0 1 2 3 3 6 6 12 12 24 24 12 12 24 24 0 0 0 0 0 0 0 0 0 -1 -2 -3 -3 -6 -6 -12 -12 -24 -24 0 0 0 0 0 3 6 6 12 12 24 24 0 -1 -2 -3 -3 -6 -6 -12 -12 -24 -24 99.6 99.2 97.2 98.4 101.0 100.2 99.8 100.3 99.4 98.2 96.7 97.9 101.5 100.2 101.3 100.7 100.4 99.3 99.0 99.0 97.6 96.6 96.1 Summer 2009 Volume 13 Number 3 31 S TAT I S T I C A L V I E W P O I N T Figure 8 shows how the matrices appear as named ranges in an Excel spreadsheet. The Excel actual names used for the X1, X2, X3 matrices are X1M, X2M, and X3M, respectively. The “M” suffix is required to avoid confusion with a single cell reference. The Y matrix is simply a single column containing the measured potencies for the three different batches. The batch corresponding to each measurement is given for reference at left. All four matrices have one row for each measured potency. The order of these rows is irrelevant. The following reviews the construction for the X1, X2, X3 matrices. The columns in each depend on the corresponding stability model (CICS, SICS, and SISS, respectively). All three stability models take the following form, where the subscript i indicates a row in Figure 8: Yi = Parameter1*Xcol1i + Parameter2*Xcol2i + Parameter3*Xcol3i + … The columns in X1, X2, X3 correspond to the values of Xcol1, Xcol2, etc. For X1 (CICS) the stability model has only two parameters: the common intercept and common slope. Therefore, it takes the following familiar form: Yi = Intercept*1 + Slope*Mi For each row of X1, Xcol1 is always 1 so the first column in X1 is a column of 1s. The second column gives the months of storage (Mi) at which the corresponding Yi value is measured. The intercept and slope are the same for all batches and have their usual interpretation. For X2 and X3, additional parameters relate to the slopes and intercept parameters of each batch. We need to define corresponding column values. Since matrices can only contain numeric values, we cannot include batch identifiers like “A”, “B”, or “C” in them. Instead we have to create special dummy variables to put in the columns of these matrices that identify the batch being tested. We call these Dummy1 and Dummy2 and they are defined as follows: Lot A: Dummy1 = 1 and Dummy2 = 0 Lot B: Dummy1 = 0 and Dummy2 = 1 Lot C: Dummy1 = -1 and Dummy2 = -1 32 Journal of GXP Compliance We can use these dummy variables to define the SICS stability model corresponding to X2 as: Yi = Int*1 + IntA*Dummy1i + IntB*Dummy2i + Slope*Mi , where the true intercept of each batch can be interpreted in terms of the parameters Int, IntA, and IntB as follows: Intercept for batch A = Int + IntA Intercept for batch B = Int + IntB Intercept for batch C = Int – IntA – IntB. Thus the columns of X2 are 1- a column of 1s 2- Dummy1i 3- Dummy2i 4- The storage period (Mi) at which the corresponding Yi value is measured. The slope parameter has its usual interpretation and is the same for all batches. For X3 (model SISS), we follow a similar logic, but must include additional parameters to account for differences in slope among batches. The stability model is written: Yi = Int*1 + IntA*Dummy1i + InB*Dummy2i + Slp*Mi + SlpA*Dummy1i*Mi + SlpB*Dummy2i*Mi The Int, IntA, and IntB parameters are interpreted as described above for X2. The true slope of each batch can be interpreted in terms of the parameters Int, IntA, and IntB as follows: Slope for batch A = Slp + SlpA Slope for batch B = Slp + SlpB Slope for batch C = Slp – SlpA –SlpB. From the above model we can see that the columns of X3 are 1- a column of 1s 2- Dummy1i 3- Dummy2i David LeBlond 4-The storage period (Mi) at which the corresponding Yi value is measured 5- Dummy1i*Mi 6- Dummy2i*Mi A model-fitting statistic that is important for the ANCOVA calculations here is the sum of squared errors (SSE). The matrix formula for this is as follows (again, replacing i by the appropriate model number): Step 2: Perform The Matrix Calculations for ANCOVA The calculations can be performed easily in Excel by using three Excel array functions. In these, the matrices A and B are defined Excel ranges (arrays), as follows: SSEi = YT(I-Hi)Y where Hi = Xi(XiTXi)-1XiT, TRANSPOSE(A): Interchanges the rows and columns of the array (designated AT in matrix notation) MMULT(A,B): Perform matrix multiplication on the arrays A and B (designated simply AB in matrix notation) MINVERSE(A): Obtain the matrix inverse of array A (designated A-1 in matrix notation). and I is a special matrix called the “identity matrix.” This matrix is a square matrix with the same number of rows and columns as there are Y measurements. The diagonal numbers of the identity matrix are all 1s and all other numbers in the matrix are zero. The identity matrix used for the calculations here is shown in Figure 9. In Excel, this matrix is an array that can be given a name such as “ID.” The sum of squared errors for each model i can be calculated in Excel by combining the two matrix In the following Excel formulas, i is replaced by either 1, 2, or 3 (corresponding to the models CICS, SICS, or SISS, respectively). The matrix formula for the estimated model parameters, B is: Figure 9: T -1 T The ID (identity) matrix needed for ANCOVA. Bi = (Xi Xi) Xi Y These parameters can be obtained in Excel as a matrix with a single column and one row for each model parameter by using the following array formula with i replaced by the corresponding model: BiM = MMULT(MMULT(MINVERSE (MMULT(TRANSPOSE(XiM),XiM)),TRA NSPOSE(XiM)),Y) The matrix formula for the predicted values of Y based on the fitted model is as follows: Ypredi = XiBi This corresponds to the following calculation in Excel where i is replaced by the appropriate model number: Ypredi = MMULT(XiM,BiM) 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Summer 2009 Volume 13 Number 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 33 S TAT I S T I C A L V I E W P O I N T equations above for SSEi and Hi into the following array formula: SSEi=MMULT(MMULT(TRANSPOSE(Y),ID-MMUL T(MMULT(XiM,MINVERSE(MMULT( TRANSPOSE(XiM),XiM))),TRANSPOSE(XiM))),Y) The dfe for each model should be calculated. The dfe for a model is defined as the number of observations (number of rows in Xi) minus the number of parameters (number of columns in Xi). If k= the number of batches, then the dfe for each model is: dfe1 = N-2 dfe2 = N-k-1 dfe3 = N-2*k For this example, N=28 and k=3. The sum of squared errors (SSE) for each of the three possible models is shown in Table III. The SSEi and dfei values in Table III form the basis for the ANCOVA F-tests. The ANCOVA calculations and F-tests are presented in Table IV. The four rows of the ANCOVA table are labeled A, B, C, and D. Row A is generally presented as part of the ANCOVA table, but is not needed for our purposes. Row D merely shows the calculation of the mean squared error (MSD) that becomes the denominator for the F ratios. Rows B and C contain the calculations that are of interest in the present example. As indicated in Table IV, the F-test of row C is a test of the null hypothesis (H0) that the true model is SICS against the alternative hypothesis (Ha) that the true model is SISS. Thus the C F-test is really testing the hypothesis that the slopes of the regression lines are equal. The F-test of row B is a test of the null hypothesis (H0) that the true model is CICS against the alternative hypothesis (Ha) that the true model is SICS. Thus the B F-test is really testing the hypothesis that the intercepts of the three batches are equal, under the assumption that the slopes are equal. The F-test P-value for rows B and C are obtained in Excel by substituting the appropriate values into the formula below: Psource = FDIST(Fsource,dfsource,dfe3) Step 3: Make The Inference The logic used by the US Food and Drug Administration in selecting a stability model for shelf life estimation is shown in Figure 10. FDA and International Conference on Harmonisation (ICH) stability guidances (7) recommend that a P-value of 0.25 be used as a decision criterion for the B and C ANCOVA F-tests when comparing the stability profiles of different product batches. This corresponds to a Neyman-Pearson Type I error rate of 0.25. This Table III: Sum of squares errors and degrees of freemeans that the tests will reject the null hypothesis dom associated with three different stability models. in error 25% of the time. This rather high Type I Model SSE dfe error rate lowers the consumer’s risk in that it raises the power of the tests to detect smaller differences in 1 (CICS) 13.21 26 slope and intercept among the batches. However, it 2 (SICS) 8.71 24 also has the effect of raising the manufacturer’s risk by 3 (SISS) 7.69 22 incorrectly identifying batch differences that do not in fact exist. The use Table IV: ANCOVA table for comparing the intercepts and slopes of the stability proof a Type I error rate files of three batches. of 0.25 in order to Source H0 Ha SS df Source SS/df F Source P Source reduce the consumer risk has been the A CICS SISS SSE1-SSE3=5.52 dfe1-dfe3=4 MSA=1.38 MSA/MSD=3.95 0.015 subject of some debate MSB=2.25 MSB/MSD=6.44 0.006 B CICS SICS SSE1-SSE2=4.50 dfe1-dfe2=2 in the past (10,11) but MSC=0.51 MSC/MSD=1.46 0.254 C SICS SISS SSE2-SSE3=1.02 dfe2-dfe3=2 has by now become dfe3=22 MSD=0.35 D SSE3=7.69 established practice. 34 Journal of GXP Compliance David LeBlond Figure 10: Using ANCOVA to choose a stability model. Start PC<0.25 ? Yes SISS No CICS No PB<0.25 ? Yes SICS Figure 10 first examines the row C F-test P-value. If significance is found, then SISS is selected as the final stability model. Table IV shows that in our example, the P-value for the C F-test is above 0.25 (0.254) so there is no evidence for differences among batches with respect to their slopes (i.e., rates of potency loss). Figure 10 next examines the row B F-test P-value. If significance is found, then the SICS is selected, otherwise CICS becomes the final model. Table IV shows that in our example, the P-value for the B F-test is very statistically significant (0.006 << 0.25). Therefore, in our example we would conclude that a SICS model of Figure 6 is the appropriate stability model to use for shelf-life estimation. Usually, the ANCOVA analysis and shelf-life projection are done by a trained statistician. The statistician may use a statistical package such as SAS (12,13) for these calculations. Performing the ANCOVA in SAS is quite straightforward. The following gives the simple SAS syntax for an ANCOVA analysis: PROC GLM; CLASS BATCH; MODEL LEVEL = TIME BATCH TIME*BATCH/ SS1;RUN; However, the example shows that the ANCOVA analysis can also be programmed relatively easily into Excel using its matrix functions and array formulas. SUMMARY Three hypothesis testing situations that are common in pharmaceutical and analytical development have been presented and the calculations illustrated using Excel. Tukey’s test can be used to compare the means of multiple groups when the within-group variation is normal and all groups have the same true variance. This test is a Neyman-Pearson type test that maintains a desired family-wise Type I error rate. It is based on the Studentized range distribution. Tables and calculators for the Studentized range distribution can be found on the Internet. Bartlett’s test can be used to compare the variances of multiple groups when the within-group variation is normal. This test can be used as either a NeymanPearson or Fisherian type test and is based on the Chi-squared distribution. ANCOVA analysis can be used to compare the intercepts and slopes of regression lines when the error of measurement is normal, independent, and identically distributed. This test is based on the F distribution. When comparing the stability profiles of different batches of product, ANCOVA is often employed as a Neyman-Pearson type test with a comparison-wise Type I error rate of 0.25. The calculations for some complex hypotheses testing procedures, such as ANCOVA, can be greatly simplified by expressing them in matrix algebra. The matrix formulas for regression and ANCOVA are available in textbooks. Matrix calculations can easily be done in Excel using defined ranges, range labels, and the array functions MMULT, MINVERSE, and TRANSPOSE. The matrix formulas for multiple-regression and the availability and relative ease of use of Excel matrix functions may be among the best kept secrets in statistics. An Excel spreadsheet illustrating these calculations is available from the author upon request. ACKNOWLEDGMENTS The author would like to express his gratitude to Paul Pluta for his suggestions and encouragement, to Susan Haigney for her patience and organizational skills, and to Diane Wolden for her vigilance and perseverance through all these “sadistics.” Summer 2009 Volume 13 Number 3 35 S TAT I S T I C A L V I E W P O I N T GLOSSARY ANCOVA (analysis of covariance): A procedure, based on the F-distribution, that includes hypothesis tests for the equality of slopes and intercepts of multiple trend lines. ANOVA (analysis of variance): A hypothesis test that uses the F-statistic described by Fisher used to detect differences among the true means of data from two or more groups. Bartlett’s test: A hypothesis test, based on the Chi-square distribution, of the null hypothesis that multiple independent random samples come from normal populations all having the same variance or standard deviation. F-statistic: The decision statistic used in Fisher’s analysis of variance hypothesis test consisting of the ratio of two independent observed variances calculated from normally distributed data. Family-wise Type I error rate: The risk of falsely rejecting one or more null hypotheses when conducting a family of many pair-wise statistical tests, each comparing the parameters of two groups. Homogeneity of variance: Many statistical procedures that compare the population parameters of multiple groups, such as ANOVA, ANCOVA, and Tukey’s test, assume that the within group variation of the groups being compared follow normal distributions with the same true within group variance but possibly different group means. The equality of variance is called “variance homogeneity” or “homoskedasticity.” Hypothesis: A provisional statement about the value of a model parameter or parameters whose truth can be tested by experiment. Identity matrix: A special matrix in which the number of rows equals the number of columns. The diagonal numbers of the identity matrix are all equal to 1. All off diagonal numbers are equal to zero. Matrix: A table of numbers with a fixed number of rows ( r ) and columns ( c ) often designated Arc or simply A. Arrays (rectangular cell ranges) in Excel can be given names and manipulated as matrices using Excel formulas. The rules by which matrices can be added, multiplied, inverted, and transposed were described by the English mathematician Arthur Cayley in 1857. These four operations are available in Excel as the array functions ‘+’ (the add operator), MMULT(AM,BM), MINVERSE(AM), and TRANSPOSE(AM), respectively, where AM and BM are valid cell range names. These Excel functions can greatly 36 Journal of GXP Compliance simplify complex statistical calculations such as regression and ANCOVA. Matrix addition: Two matrices, say Anm and Bpq, can be added or subtracted (symbolized A+B or A-B, respectively and entered similarly into an Excel array formula) if n=p and m=q. The result is a matrix having the same dimension as the starting matrices. An example of a matrix addition is given below. Matrix addition is commutative (A+B = B+A). [da be fc ] + [j gk hl i ] =d[+aj +eg+ kb +fh+ lc + i ] Matrix inverse: Under certain conditions, a matrix can be inverted in a manner analogous to the inverse of a scalar number. Inverses are only defined for square matrices. The resulting matrix will have the same dimensions as the starting matrix. Matrix inversion is symbolized A-1 and entered into an Excel array formula as MINVERSE(AM), where AM is a valid cell range name. Matrix multiplication: Two matrices, say Anm and Bpq, can be multiplied (symbolized AB or MMULT(A,B) in an Excel array formula) if the number of columns of A (m) equals the number of rows of B (p). The product is a matrix with n rows and q columns. An example of a matrix multiplication is given below. Matrix multiplication is not commutative, in other words AB is not, in general, equal to BA. [ ][ ][ ghij abc k l mn de f opq r ] ag + bk + co ah + bl + cp ai + bm + cp aj + bn + cr =dg + ek + fo dh + el + fp di + em + fp dj + en + fr Matrix transpose: The interchange of the rows and columns in an array. This is symbolized AT and entered into an Excel array formula as TRANSPOSE(AM), where AM is a valid cell range name. If the starting matrix has r rows and c columns, the resulting matrix will have c rows and r columns. Multiplicity: When multiple hypothesis tests (e.g., control chart rules) are applied to different aspects (e.g., trending patterns) of a data set, the overall false alarm rate of any one test failing may be greater than that of any single test when applied alone. This statistical phenomenon is referred to as multiplicity. Null hypothesis (H0): A plausible hypothesis that is presumed sufficient to explain a set of data unless statistical evidence in the form of a hypothesis test indicates otherwise. P-value: The probability of obtaining a result at least as extreme as the one that was actually observed, given that David LeBlond the null hypothesis is true. The fact that p-values are based on this assumption is crucial to their correct interpretation. Pair-wise Type I error rate: The risk of falsely rejecting the null hypothesis when conducting a statistical test comparing the parameters of two groups. Parameter: In statistics, a parameter is a quantity of interest whose ‘true’ value is to be estimated. Generally a parameter is some underlying variable associated with a physical, chemical, or statistical model. Sampling distribution: The distribution of data or some summary statistic calculated from data. Statistic: A summary value (such as the mean or standard deviation) that is calculated from data. A statistic is often used because it provides a good estimate of a parameter of interest. Studentized range distribution: Consider a normal population from which two independent samples, of size k and n, respectively, are drawn. From the first sample, the range (maximum – minimum) of k values is obtained. From the second sample the standard deviation with n-1 degrees of freedom is calculated. Then the ratio range (standard deviation) will have a Studentized range probability distribution, indexed by k and n-1. Tukey’s test: A hypothesis test, based on the Studentized range distribution, of the null hypothesis that multiple independent random samples all come from the same normal population having a single true mean. Type I error: A decision error that results in falsely rejecting the null hypothesis when in fact it is true. It is sometimes referred to as the alpha-risk or manufacturer’s risk. REFERENCES 1. LeBlond, D., “Understanding Hypothesis Testing Using Probability Distributions,” Journal of Validation Technology, Vol. 15, No.1, pp 44-61, Winter 2009. 2. Vijayvargiya, A., “One-Way Analysis of Variance,” Journal of Validation Technology, 15(1), pp 62-63, Winter 2009. 3. Montgomery, D., Design and Analysis of Experiments, 3rd edition, John Wiley & Sons, N.Y., pp 78 – 79, 1991. See also the online NIST statistical handbook: http://www.itl.nist.gov/div898/handbook/prc/section4/prc471.htm. 4. Tables of the Studentized range distribution are available online from many sources by searching key words “Studentized Range Statistic Table” including: http://faculty.vassar.edu/lowry/tabs. html#q, http:/cse.niaes.affrc.go.jp/miwa/probcalc/, Harter, H(1960) Annals of Mathematical Statistics 31(4) 1122 - 1147 (available free at http://www.projecteuclid.org). 5. Snedecor, George W. and Cochran, William G., Statistical Methods, Eighth Edition, Iowa State University Press, pp 251 – 252, 1989. 6. LeBlond, D., “Using Probability Distributions to Make Decisions,” Journal of Validation Technology, pp 2 – 14, Spring 2008. 7. International Conference on Harmonisation, ICH Q1E Evaluation of Stability Data, June 2004. 8. Brownlee, K.A., “Statistical Theory and Methodology,” Robert E. Keiger, Malabar, FL, 1984. 9. Neter, J., Kutner, M., Nachtsheim, C. and Wasserman, W., Applied Linear Statistical Models, 4th edition, Irwin, Chicago, 1996. 10. Bancroft, “Analysis and Inference for Incompletely Specified Models Involving the Use of Preliminary Test(s) of Significance,” Biometrics 20, 427-442, 1964. 11. Ruberg, SJ and Stegman, JW, “Pooling Data for Stability Studies: Testing the Equality of Batch Degradation Slopes,” Biometrics 47, 1059-1069, 1991. 12. Ng MJ, STAB Stability System for SAS, Division of Biometrics, CDER, FDA, 1995. These SAS macros may be downloaded from the following website: http://www.fda.gov/cder/sas/. 13. The SAS system for statistical analysis is available from the SAS institute, Inc., Cary, NC, USA. GXP ARTICLE ACRONYM LISTING ANCOVA ANOVA CICS CMC CTD ICH NDA SICS SISS SSE Analysis of Covariance Analysis of Variance Common Intercept, Common Slope Chemistry, Manufacturing, and Controls Common Technical Document International Conference on Harmonisation New Drug Application Separate Intercept, Common Slope Separate Intercept, Separate Slope Sum of Squared Errors ABOUT THE AUTHOR David LeBlond, Ph.D., obtained an MS in statistics from Colorado State University, a Ph.D. in biochemistry from Michigan State University, and has 29 years experience in the pharmaceutical and medical diagnostics fields. He serves on the CMC statistical expert team in PhRMA. David can be reached at [email protected]. Summer 2009 Volume 13 Number 3 37
© Copyright 2026 Paperzz