The T-test and Clinical Relevance. Is Your p Error Showing? CHARLES F. ARKIN, M.D. Arkin, Charles F.: The t-test and clinical relevance. Is your 0 error showing? Am J Clin Pathol 76: 416-420, 1981. The t-test, as usually performed in the clinical laboratory for methods comparisons, does not consider the clinical relevance of the difference between the methods. Two procedures are presented which incorporate medical criteria into the t-test thereby making it a more relevant statistical tool. This is done by defining /3 error in addition to a error. Since a positive action; i.e., adoption of the "test" method, is taken when there is no difference between the "test" and reference methods, commission of a /3 error has a potentially damaging effect. Therefore, evaluation of P error is an appropriate part of methods comparisons in the clinical laboratory. (Key words: T-test; Methods comparison; Quality control; Statistics.) Although the use of the paired t-test has attained widespread acceptance as a statistical tool in performing methods comparisons; i.e., comparing a " t e s t " method with the reference method, it has some distinct disadvantages. 1,4,6 Principally, the t-test does not consider whether differences between methods are medically or physiologically important but relates only to statistical significance. When comparing methods having very high precisions, the t-test may indicate real or statistically significant differences that may, in fact, be totally unimportant medically. 4,6 Similarly, by greatly increasing the number of comparisons, extremely minor variations in methods can cause a highly significant t-value. 6 On the other hand, the t-test may be insensitive to medically important differences when comparing methods of poor precision, or small sample sizes. 6 In order for the t-test to have value in differentiating methods on the basis of medical criteria, the precisions of the methods, the numbers of comparisons studied, and the size of the minimal difference which would have a medically important effect must all be taken into consideration. Another related problem with the t-test, as generally used in the clinical laboratory, is that limits are set on a (type I) error as opposed to /3 (type II) error. When one is comparing a test method to a reference method, he will adopt the test method if it is no different than the reference method. 1 If one commits an a error, he would mistakenly conclude that the methods compared Received December 4, 1980; received revised manuscript and accepted for publication April 13, 1981. Address reprint requests to Dr. Arkin: Division of Clinical Pathology, Department of Pathology, New England Deaconess Hospital, 185 Pilgrim Road, Boston, Massachusetts 02215. Division of Clinical Pathology, Department of Pathology, New England Deaconess Hospital, Boston, Massachusetts are "different" when their results are really the " s a m e . " Therefore, he would not put a "good" test into use, but no damage would be done. On the other hand, commission of a j8 error would result in mistakenly concluding that both methods are the " s a m e " when they are really "different." Such an error could be damaging since it would result in establishing a new method which is undesirable. Therefore, knowledge of one's /3 errors is a necessity in methods comparisons in the clinical laboratory. Fortunately, all of the drawbacks to the t-test discussed above have a common solution since setting up /3 error limits requires taking medical importance into consideration. 3 Two practical approaches to the evaluation of /3 error for the f-test will be considered here, one based on proposal of an alternative hypothesis (as opposed to the null hypothesis), the other based on selecting the proper sample size. 2,3 Evaluation of /3 Error by Selecting an Alternative Hypothesis j8 error is mistakenly concluding that the results of two methods are the same when they are really different. In this context, we define different as being greater than the maximum difference between methods that we are willing to accept. If, on the other hand, the difference between the test and the reference method is less than this maximum difference, then the two methods can be considered the same. By properly selecting this maximum difference, we are, in effect, establishing acceptance criteria for the adoption of a " t e s t " method on the basis of medical importance. The acceptance criteria or maximum acceptable difference is referred to in general terms as the raw effect size (ES), 2 since it is the size of the minimal difference which would exert an effect. Once an ES is chosen, the probability limits or the level of certainty can be stated. For example, one may wish to be 95% certain that the difference between methods is less than the ES. This is the same as setting a 0002-9173/81/0010/0416 $00.75 © American Society of Clinical Pathologists 416 THE T-TEST AND CLINICAL RELEVANCE Vol. 76 • No. 4 /8 error of 5% since there would be a 5% chance of calling results the same when they are different. The probability level (95% in this example) is commonly termed the power and is defined as 1-/3.3 Having established the ES and the desired power, one can proceed to perform the paired t-test on the test results of several replicate samples run by both methods to obtain the following data: — — — — the mean of the reference method (A) the mean of the test method (B) the average difference (A - B or A) the standard deviation (SD) of the individual differences (SbA1) — the SD of the average difference (SDA); i.e., the standard error of the means — the paired t-value (A/SD^). The next step is to pose the hypothesis that the true average difference is exactly equal to the ES (the socalled alternative hypothesis). 2 - 3 If the observed average difference (A) falls between 0 (zero) and the stated probability limit (l-)3) of the population of all possible A's distributed about the hypothetical mean, then one can reject the alternative hypothesis and conclude that the observed A is less than the ES. Since only one tail of the distribution of the population of A's is of interest, a one-tailed probability limit is utilized. 3 The above procedure can be expressed mathematically by a simple equation: X = ES - Z,SD A . Where: X = the value below which the observed A must fall to cause rejection of the alternative hypothesis; ES = the raw effect size or acceptance limit; SDs = the standard deviation (standard error) of the average difference; Z, = the number of SD from the hypothetical mean to the one-tailed probability limit which is determined by means of a one-tailed Z table. Note: When less than 30 replicate comparisons are performed, t should be substituted for Z in the above formula and is determined from a t table using the same one-tailed probability limit for the specific degrees of freedom (one less than the number of comparisons). A similar approach is used for an unpaired /-test with the following modification of the equation: X = ES - Z,SDA-6. Where: SDA-B = the standard deviation of the difference between the means of data set A and data set B; and 2 SDA_B = VSD A + 417 2 3 SDR , and Where: SDA = the SD of the mean of data set A; SDg = the SD of the mean of data set B. Example Problem Using the Hypothesis Alternative We wish to evaluate a new method for determination of a serum component, substance " H . " We would adopt this procedure if we could be 95% sure that its results vary by less than 10 units/L on the average from the reference method in the normal range. We have performed a paired t-test on 30 split serum samples and have collected the following data: mean value of the 30 determinations by method A = A = 320 units/L mean value of the 30 determinations by method B = B = 328 units/L mean or average difference between A and B = A = 8 units/L SD of the individual differences = SDAi_m = SDAi = 19.1 units/L SD of the average difference (standard error) = SDA = 3.5 units/L. We can see that our average difference (A) of 8 units/L falls below our stated limits (ES) of 10 units/L. Is this A acceptable or might the true A actually be 10 units/L or greater that by chance (because of random error) fell at the level of 8 units/L? To answer this question, we assume that the real A is in fact 10 units/L (alternative hypothesis) and calculate the value below which we can exclude 95% of the possible values of A. If A = 10 units/L (assumed), all the possible observed A's would be normally distributed about a mean of 10 units/L having an SD of 3.5 units/L (SDj) (see Fig. 1). The point X in Figure 1 defines the point above which 95% of all possible values of A lie. This is the one-tailed 95% probability level, and is found in a one-tailed Z table to be 1.65 SD from the mean. Therefore, point X lies 1.65 SD below the mean of 10 units/L or 5.8 units/L below the mean (1.65 x 3.5 = 5.8),or4.2 units/L from zero (10 - 5.8 = 4.2). Therefore, in order to be 95% sure that our test method is within 10 units/L of the reference method, the observed A must be less than 4.2 units/L. Since our observed A is 8 units/L (see Fig. 1), we would not adopt the test method. The X value can be simply found by the equation discussed above: X = ES - Z.SDi = 10 - 1.65 x 3.5 s 4.2. 418 ARKIN A.J.C.P. • October 1981 FIG. 1. The curve shows the distribution of all possible values of the average difference (A) for samples sizes of 30 dispersed about the hypothetical mean of 10 units/L. X represents the one-tailed 95% confidence limits for this population and lies 1.65 SD from the hypothetical mean. The broken line represents the location of the observed A. To be 95% sure an observed A is less than the hypothetical mean, it must have a value less than X. Evaluation of /3 Error by Selecting Sample Size As discussed above, a major drawback to the t-test is that it may be too sensitive when comparing methods of high precision, and too insensitive when comparing methods of low precision. Since the sensitivity; i.e., power, of the t-test can be increased by increasing the sample size, and decreased by lowering the sample size, one can tailor the t-test to be of just the right sensitivity to meet the needs of the specific situation by selecting the appropriate sample size.2 This, in effect, adjusts the dispersion of the A's about the null hypothesis and about the alternative hypothesis so that the preselected limits for each population coincide.3 This is represented graphically in Fig. 2 which shows the proper dispersion of populations of A's about the null and alternative hypotheses, when one wishes an a2 error of 0.05 and a /8, error of 0.05. One can easily determine the correct sample size by referring to power tables or sample size tables based on power.2 To do this, one must first standardize the ES by dividing it by the SD of the individual values of the population to obtain the ES index or d value (ES index = d = ES/SDA1).2 Establishment of the ES index eliminates units of measurement, and permits the use of tables with universal application.2 Table 1 gives sample sizes for t-tests when a2 error is 0.05. It, and similar tables, are based on the unpaired t-test, and if one is using the paired t-test, he must multiply the d value by J2 to correct for this2 (d = V2(ES/SDA)). Once the d value is determined, the sample size is found by looking down the appropriate d column to the row of the desired power; i.e., the level of certainty one wishes, or 1-/3. Example of Determining Sample Size for T-test Let us take the previous example of evaluating the "test" method for substance " H . " We wish to know the sample size needed so that a t-value of less than 1.96 (or ~2) would tell us that there was less than a 10 unit/L average difference between method A and method B with a 95% level of certainty. We start by calculating our d value: 10 x -Jl ES = 0.74. SDAi 19.1 In Table 1, the sample size for power = .95, d = .70 is 54, and for d = .80 it is 42. By interpolation, we would arrive at a sample size of 47 or 48 for d of 0.74. With this information, it is known that if a paired ttest is performed on 47 or 48 replicate samples, a J2 THE T-TEST AND CLINICAL RELEVANCE Vol. 76 • No. 4 419 0 FIG. 2. The two curves represent the distribution of the average differences (A) about the hypothetical means of both the null hypothesis (A = 0) (curve A) and the alternative hypothesis (A = ES) (curve B). The sample size (number of comparisons) has been chosen to adjust the dispersion of the populations so that the two-tailed 95% limit for curve A and the one-tailed 95% limit for curve B both fall at point X. t-value of less than 1.96 (~2) would give a 95% assurance that the results of method B are within 10 units/L of method A. From Fig. 2, one can also see the sample size can be calculated provided n is greater than 30.3 In this case, the difference between the mean of the nujl hypothesis (0) and that of the alternative hypothesis (10 units/L) spans exactly 3.61 SD^ (1.96 + 1.65), and, therefore: 10 units/L = 3.61 SD* where: SDx = Vn (formula for standard error)3 and; SDSi = 19.1 (given) and, therefore: 10 = and: 3.61 x 19.1 3.61 x 19.1 \ 2 io 6.92 = 47.6 So we arrive at the same sample size which we obtained in Table 1. In general, clinical laboratory methodology is such that precision is high relative to medical needs. Therefore, t-tests are usually too sensitive rather than too insensitive. This means that the proper sample sizes in methods comparisons should generally be small; i.e., much less than 30, in order to be meaningful. Here we have the rather pleasant situation of having a procedure both easier and better at the same time. Discussion The t-test and linear regression analysis are the major parametric statistical tools used in methods comparisons. Both have their advantages and disadvantages, and the choice between two depends on the specific situation and the nature of the generated data.6,7 Linear regression defines both proportional and systematic error, and the total error can be determined over a large range.6 On the other hand, in order for the results to be reliable, one must have data which are evenly distributed over the entire range.6 Also, the random error of the values should be such that: 1) the SD of the values of the reference values is at least seven times as large as the SE of the estimate; or 2) the correlation coefficient is 0.99 or greater.5 The above criteria are frequently not satisfied. The t-test requires a narrow range of data close to the levels of critical importance especially when there is significant proportional error.6,7 Collection of such data may not be convenient. Westgard and Hunt6 point out four reasons why the t-value, when used alone as an indicator of method acceptability, is frequently unreliable: 1) the t-value may be small even when the dif- A.J.C.P. • « October 1981 ARKIN 420 Table 1. Sample Size Table* a 2 = .05 d Value Power .10 .20 .30 .40 .50 .60 .70 .80 1.00 1.20 1.40 .25 .50 .60 .70 .75 .80 .85 .90 .95 .99 332 769 981 84 193 246 310 348 393 450 526 651 920 38 86 110 138 155 175 201 234 290 409 22 49 62 78 88 99 113 132 163 231 14 32 40 50 57 64 73 85 105 148 10 22 28 35 40 45 51 59 73 103 8 17 21 26 29 33 38 44 54 76 6 13 16 20 23 26 29 34 42 58 5 9 11 13 15 17 19 22 27 38 4 7 8 10 11 12 14 16 19 27 3 5 6 7 8 9 10 12 14 20 1235 1389 1571 1797 2102 2606 3675 ' Reprinted by permission of the author and publisher: Jacob Cohen. Statistical power ference between methods is unacceptably large; 2) the t-value may be large even though method differences are inconsequential when the precision of the methods is very high; 3) the t-values obtained from similar data will be different for different sample sizes; and, 4) the t-value may be small with a large constant error when this error is offset by the proportional error. The method of sample size selection presented above eliminates the first three of these problems. The use of the t-test, however, remains problematic in the face of proportional error. The importance of relating clinical laboratory quality control systems to medical relevance is generally well appreciated.1,4'7 This may be done as an afterthought or by incorporating medical importance directly into one's statistical method. Westgard and colleagues7 have proposed doing this by simply adding the constant error to the random error to obtain the total error (TE) which must be less than or equal to the medically acceptable error (EA) in order for a method to be accepted. Although they concentrate mostly on linear regression, mention is made of calculating constant error (CE) by adding the bias (A) and its defined confidence limits (EA g A + tSDs = CE). This is similar to the alternative hypothesis method described above except that the former gives the CE to compare with EA; whereas, the latter gives the maximum bias (A) allowable for the test method to be acceptable. The above approaches work well when the test method is found to be acceptable. When the test method is found to be unaccept- analysis for the behavioral sciences (second edition), New York. Academic Press, Inc., 1977, p. 55. able, however, they do not distinguish whether the findings were unacceptable because the real difference between methods is unacceptably large or whether the observed difference is too large due to random error because the sample size is too small for proper evaluation. By selecting the proper sample size as described herein, a reliable comparision can be obtained which takes the ES and random error for the specific case into consideration. Additionally, when one uses the sample size method to compare methods with very small random errors, the t-test may be run with much smaller sample sizes making it easier to obtain sufficient numbers of samples at the critical decision making level. References 1. Barnett RN: Clinical laboratory statistics. Boston, Little, Brown & C o . , 1979, pp 133-145 2. Cohen J: Statistical power analysis for the behavioral sciences, Revised edition. New York, Academic Press, 1977, pp 1-74 3. Colton T: Statistics in medicine. Boston, Little, Brown & Co., 1974, pp 99-150 4. Copeland BE: Quality control in clinical chemistry, Revised. American society of clinical pathologists, 1973, pp 7-49 5. Wakkers PJM, Hellendoorn HBA, Op De Weegh GJ, et al: Applications of statistics in clinical chemistry. A critical evaluation of regression lines. Clin Chim Acta 64:173-185, 1975 6. Westgard JO, Hung MR: Use and interpretation of common statistical tests in method comparison studies. Clin Chem 19: 49-57, 1973 7. Westgard JO, Carey RN, Wold S: Criteria forjudging precision and accuracy in method development evaluation. Clin Chem 20:825-833, 1974
© Copyright 2026 Paperzz