The T-test and Clinical Relevance. Is Your p Error Showing?

The T-test and Clinical Relevance. Is Your p Error Showing?
CHARLES F. ARKIN, M.D.
Arkin, Charles F.: The t-test and clinical relevance. Is your
0 error showing? Am J Clin Pathol 76: 416-420, 1981. The
t-test, as usually performed in the clinical laboratory for
methods comparisons, does not consider the clinical relevance
of the difference between the methods. Two procedures are
presented which incorporate medical criteria into the t-test
thereby making it a more relevant statistical tool. This is done
by defining /3 error in addition to a error. Since a positive action; i.e., adoption of the "test" method, is taken when there
is no difference between the "test" and reference methods,
commission of a /3 error has a potentially damaging effect.
Therefore, evaluation of P error is an appropriate part of
methods comparisons in the clinical laboratory. (Key words:
T-test; Methods comparison; Quality control; Statistics.)
Although the use of the paired t-test has attained
widespread acceptance as a statistical tool in performing methods comparisons; i.e., comparing a " t e s t "
method with the reference method, it has some distinct
disadvantages. 1,4,6 Principally, the t-test does not consider whether differences between methods are medically or physiologically important but relates only to
statistical significance. When comparing methods
having very high precisions, the t-test may indicate real
or statistically significant differences that may, in fact,
be totally unimportant medically. 4,6 Similarly, by
greatly increasing the number of comparisons, extremely minor variations in methods can cause a highly
significant t-value. 6 On the other hand, the t-test may
be insensitive to medically important differences when
comparing methods of poor precision, or small sample
sizes. 6 In order for the t-test to have value in differentiating methods on the basis of medical criteria, the
precisions of the methods, the numbers of comparisons
studied, and the size of the minimal difference which
would have a medically important effect must all be
taken into consideration.
Another related problem with the t-test, as generally
used in the clinical laboratory, is that limits are set on
a (type I) error as opposed to /3 (type II) error. When
one is comparing a test method to a reference method,
he will adopt the test method if it is no different than
the reference method. 1 If one commits an a error, he
would mistakenly conclude that the methods compared
Received December 4, 1980; received revised manuscript and accepted for publication April 13, 1981.
Address reprint requests to Dr. Arkin: Division of Clinical
Pathology, Department of Pathology, New England Deaconess
Hospital, 185 Pilgrim Road, Boston, Massachusetts 02215.
Division of Clinical Pathology,
Department of Pathology,
New England Deaconess Hospital,
Boston, Massachusetts
are "different" when their results are really the
" s a m e . " Therefore, he would not put a "good" test
into use, but no damage would be done. On the other
hand, commission of a j8 error would result in mistakenly concluding that both methods are the " s a m e "
when they are really "different." Such an error could
be damaging since it would result in establishing a new
method which is undesirable. Therefore, knowledge of
one's /3 errors is a necessity in methods comparisons
in the clinical laboratory.
Fortunately, all of the drawbacks to the t-test discussed above have a common solution since setting up
/3 error limits requires taking medical importance into
consideration. 3 Two practical approaches to the evaluation of /3 error for the f-test will be considered here,
one based on proposal of an alternative hypothesis (as
opposed to the null hypothesis), the other based on
selecting the proper sample size. 2,3
Evaluation of /3 Error by Selecting an Alternative
Hypothesis
j8 error is mistakenly concluding that the results of
two methods are the same when they are really different. In this context, we define different as being
greater than the maximum difference between methods
that we are willing to accept. If, on the other hand,
the difference between the test and the reference
method is less than this maximum difference, then the
two methods can be considered the same. By properly
selecting this maximum difference, we are, in effect,
establishing acceptance criteria for the adoption of a
" t e s t " method on the basis of medical importance. The
acceptance criteria or maximum acceptable difference
is referred to in general terms as the raw effect size
(ES), 2 since it is the size of the minimal difference which
would exert an effect.
Once an ES is chosen, the probability limits or the
level of certainty can be stated. For example, one may
wish to be 95% certain that the difference between methods is less than the ES. This is the same as setting a
0002-9173/81/0010/0416 $00.75 © American Society of Clinical Pathologists
416
THE T-TEST AND CLINICAL RELEVANCE
Vol. 76 • No. 4
/8 error of 5% since there would be a 5% chance of calling results the same when they are different. The probability level (95% in this example) is commonly termed
the power and is defined as 1-/3.3
Having established the ES and the desired power,
one can proceed to perform the paired t-test on the test
results of several replicate samples run by both methods to obtain the following data:
—
—
—
—
the mean of the reference method (A)
the mean of the test method (B)
the average difference (A - B or A)
the standard deviation (SD) of the individual
differences (SbA1)
— the SD of the average difference (SDA); i.e.,
the standard error of the means
— the paired t-value (A/SD^).
The next step is to pose the hypothesis that the true
average difference is exactly equal to the ES (the socalled alternative hypothesis). 2 - 3 If the observed average difference (A) falls between 0 (zero) and the stated
probability limit (l-)3) of the population of all
possible A's distributed about the hypothetical mean,
then one can reject the alternative hypothesis and conclude that the observed A is less than the ES. Since
only one tail of the distribution of the population of
A's is of interest, a one-tailed probability limit is
utilized. 3
The above procedure can be expressed mathematically by a simple equation: X = ES - Z,SD A .
Where:
X = the value below which the observed A must fall
to cause rejection of the alternative hypothesis;
ES = the raw effect size or acceptance limit;
SDs = the standard deviation (standard error) of the
average difference;
Z, = the number of SD from the hypothetical mean
to the one-tailed probability limit which is determined by means of a one-tailed Z table.
Note: When less than 30 replicate comparisons are
performed, t should be substituted for Z in the above
formula and is determined from a t table using the same
one-tailed probability limit for the specific degrees of
freedom (one less than the number of comparisons).
A similar approach is used for an unpaired /-test
with the following modification of the equation:
X = ES -
Z,SDA-6.
Where:
SDA-B = the standard deviation of the difference between the means of data set A and data set
B; and
2
SDA_B = VSD A +
417
2 3
SDR ,
and
Where:
SDA = the SD of the mean of data set A;
SDg = the SD of the mean of data set B.
Example Problem Using the
Hypothesis
Alternative
We wish to evaluate a new method for determination
of a serum component, substance " H . " We would
adopt this procedure if we could be 95% sure that its
results vary by less than 10 units/L on the average
from the reference method in the normal range. We
have performed a paired t-test on 30 split serum
samples and have collected the following data:
mean value of the 30 determinations by method A =
A = 320 units/L
mean value of the 30 determinations by method B =
B = 328 units/L
mean or average difference between A and B =
A = 8 units/L
SD of the individual differences = SDAi_m =
SDAi = 19.1 units/L
SD of the average difference (standard error) =
SDA = 3.5 units/L.
We can see that our average difference (A) of 8
units/L falls below our stated limits (ES) of 10
units/L. Is this A acceptable or might the true A actually
be 10 units/L or greater that by chance (because of
random error) fell at the level of 8 units/L? To answer
this question, we assume that the real A is in fact 10
units/L (alternative hypothesis) and calculate the value
below which we can exclude 95% of the possible values
of A. If A = 10 units/L (assumed), all the possible observed A's would be normally distributed about a mean
of 10 units/L having an SD of 3.5 units/L (SDj)
(see Fig. 1). The point X in Figure 1 defines the point
above which 95% of all possible values of A lie. This is
the one-tailed 95% probability level, and is found in a
one-tailed Z table to be 1.65 SD from the mean. Therefore, point X lies 1.65 SD below the mean of 10 units/L
or 5.8 units/L below the mean (1.65 x 3.5 = 5.8),or4.2
units/L from zero (10 - 5.8 = 4.2). Therefore, in order
to be 95% sure that our test method is within 10 units/L
of the reference method, the observed A must be less
than 4.2 units/L. Since our observed A is 8 units/L (see
Fig. 1), we would not adopt the test method. The X
value can be simply found by the equation discussed
above:
X = ES - Z.SDi = 10 - 1.65 x 3.5 s 4.2.
418
ARKIN
A.J.C.P. • October 1981
FIG. 1. The curve shows the distribution of all possible values of the
average difference (A) for samples
sizes of 30 dispersed about the hypothetical mean of 10 units/L. X
represents the one-tailed 95% confidence limits for this population and
lies 1.65 SD from the hypothetical
mean. The broken line represents
the location of the observed A. To be
95% sure an observed A is less than
the hypothetical mean, it must have a
value less than X.
Evaluation of /3 Error by Selecting Sample Size
As discussed above, a major drawback to the t-test
is that it may be too sensitive when comparing methods
of high precision, and too insensitive when comparing
methods of low precision. Since the sensitivity; i.e.,
power, of the t-test can be increased by increasing the
sample size, and decreased by lowering the sample
size, one can tailor the t-test to be of just the right
sensitivity to meet the needs of the specific situation
by selecting the appropriate sample size.2 This, in effect, adjusts the dispersion of the A's about the null
hypothesis and about the alternative hypothesis so that
the preselected limits for each population coincide.3
This is represented graphically in Fig. 2 which shows
the proper dispersion of populations of A's about the
null and alternative hypotheses, when one wishes an
a2 error of 0.05 and a /8, error of 0.05.
One can easily determine the correct sample size by
referring to power tables or sample size tables based on
power.2 To do this, one must first standardize the ES
by dividing it by the SD of the individual values of the
population to obtain the ES index or d value (ES
index = d = ES/SDA1).2 Establishment of the ES index eliminates units of measurement, and permits the
use of tables with universal application.2 Table 1 gives
sample sizes for t-tests when a2 error is 0.05. It, and
similar tables, are based on the unpaired t-test, and if
one is using the paired t-test, he must multiply the d
value by J2 to correct for this2 (d = V2(ES/SDA)). Once
the d value is determined, the sample size is found by
looking down the appropriate d column to the row of
the desired power; i.e., the level of certainty one
wishes, or 1-/3.
Example of Determining Sample Size for T-test
Let us take the previous example of evaluating the
"test" method for substance " H . " We wish to know
the sample size needed so that a t-value of less than
1.96 (or ~2) would tell us that there was less than a 10
unit/L average difference between method A and
method B with a 95% level of certainty.
We start by calculating our d value:
10 x -Jl
ES
= 0.74.
SDAi
19.1
In Table 1, the sample size for power = .95, d = .70
is 54, and for d = .80 it is 42. By interpolation,
we would arrive at a sample size of 47 or 48 for d of 0.74.
With this information, it is known that if a paired ttest is performed on 47 or 48 replicate samples, a
J2
THE T-TEST AND CLINICAL RELEVANCE
Vol. 76 • No. 4
419
0
FIG. 2. The two curves represent the distribution of the average differences (A) about the hypothetical means of both the null hypothesis
(A = 0) (curve A) and the alternative hypothesis (A = ES) (curve B). The sample size (number of comparisons) has been chosen to adjust
the dispersion of the populations so that the two-tailed 95% limit for curve A and the one-tailed 95% limit for curve B both fall at point X.
t-value of less than 1.96 (~2) would give a 95% assurance that the results of method B are within 10
units/L of method A.
From Fig. 2, one can also see the sample size can
be calculated provided n is greater than 30.3 In this
case, the difference between the mean of the nujl
hypothesis (0) and that of the alternative hypothesis
(10 units/L) spans exactly 3.61 SD^ (1.96 + 1.65),
and, therefore:
10 units/L = 3.61 SD*
where:
SDx =
Vn
(formula for standard error)3
and;
SDSi = 19.1 (given)
and, therefore:
10 =
and:
3.61 x 19.1
3.61 x 19.1 \ 2
io
6.92 = 47.6
So we arrive at the same sample size which we obtained in Table 1.
In general, clinical laboratory methodology is such
that precision is high relative to medical needs. Therefore, t-tests are usually too sensitive rather than too insensitive. This means that the proper sample sizes in
methods comparisons should generally be small; i.e.,
much less than 30, in order to be meaningful. Here we
have the rather pleasant situation of having a procedure
both easier and better at the same time.
Discussion
The t-test and linear regression analysis are the
major parametric statistical tools used in methods comparisons. Both have their advantages and disadvantages, and the choice between two depends on the specific situation and the nature of the generated data.6,7
Linear regression defines both proportional and systematic error, and the total error can be determined
over a large range.6 On the other hand, in order for the
results to be reliable, one must have data which are
evenly distributed over the entire range.6 Also, the random error of the values should be such that: 1) the SD
of the values of the reference values is at least seven
times as large as the SE of the estimate; or 2) the correlation coefficient is 0.99 or greater.5 The above criteria are frequently not satisfied.
The t-test requires a narrow range of data close to
the levels of critical importance especially when there
is significant proportional error.6,7 Collection of such
data may not be convenient. Westgard and Hunt6 point
out four reasons why the t-value, when used alone as
an indicator of method acceptability, is frequently unreliable: 1) the t-value may be small even when the dif-
A.J.C.P. •
« October 1981
ARKIN
420
Table 1. Sample Size Table*
a 2 = .05
d Value
Power
.10
.20
.30
.40
.50
.60
.70
.80
1.00
1.20
1.40
.25
.50
.60
.70
.75
.80
.85
.90
.95
.99
332
769
981
84
193
246
310
348
393
450
526
651
920
38
86
110
138
155
175
201
234
290
409
22
49
62
78
88
99
113
132
163
231
14
32
40
50
57
64
73
85
105
148
10
22
28
35
40
45
51
59
73
103
8
17
21
26
29
33
38
44
54
76
6
13
16
20
23
26
29
34
42
58
5
9
11
13
15
17
19
22
27
38
4
7
8
10
11
12
14
16
19
27
3
5
6
7
8
9
10
12
14
20
1235
1389
1571
1797
2102
2606
3675
' Reprinted by permission of the author and publisher: Jacob Cohen. Statistical power
ference between methods is unacceptably large; 2) the
t-value may be large even though method differences
are inconsequential when the precision of the methods
is very high; 3) the t-values obtained from similar data
will be different for different sample sizes; and, 4) the
t-value may be small with a large constant error when
this error is offset by the proportional error. The
method of sample size selection presented above
eliminates the first three of these problems. The use
of the t-test, however, remains problematic in the face
of proportional error.
The importance of relating clinical laboratory quality
control systems to medical relevance is generally well
appreciated.1,4'7 This may be done as an afterthought
or by incorporating medical importance directly into
one's statistical method. Westgard and colleagues7
have proposed doing this by simply adding the constant
error to the random error to obtain the total error (TE)
which must be less than or equal to the medically acceptable error (EA) in order for a method to be accepted.
Although they concentrate mostly on linear regression,
mention is made of calculating constant error (CE)
by adding the bias (A) and its defined confidence limits
(EA g A + tSDs = CE). This is similar to the alternative hypothesis method described above except that
the former gives the CE to compare with EA; whereas,
the latter gives the maximum bias (A) allowable for the
test method to be acceptable. The above approaches
work well when the test method is found to be acceptable. When the test method is found to be unaccept-
analysis for the behavioral sciences (second edition), New York. Academic Press,
Inc., 1977, p. 55.
able, however, they do not distinguish whether the
findings were unacceptable because the real difference
between methods is unacceptably large or whether the
observed difference is too large due to random error
because the sample size is too small for proper evaluation. By selecting the proper sample size as described
herein, a reliable comparision can be obtained which
takes the ES and random error for the specific case into
consideration. Additionally, when one uses the sample
size method to compare methods with very small random errors, the t-test may be run with much smaller
sample sizes making it easier to obtain sufficient numbers of samples at the critical decision making level.
References
1. Barnett RN: Clinical laboratory statistics. Boston, Little, Brown
& C o . , 1979, pp 133-145
2. Cohen J: Statistical power analysis for the behavioral sciences,
Revised edition. New York, Academic Press, 1977, pp
1-74
3. Colton T: Statistics in medicine. Boston, Little, Brown &
Co., 1974, pp 99-150
4. Copeland BE: Quality control in clinical chemistry, Revised.
American society of clinical pathologists, 1973, pp 7-49
5. Wakkers PJM, Hellendoorn HBA, Op De Weegh GJ, et al:
Applications of statistics in clinical chemistry. A critical
evaluation of regression lines. Clin Chim Acta 64:173-185,
1975
6. Westgard JO, Hung MR: Use and interpretation of common
statistical tests in method comparison studies. Clin Chem 19:
49-57, 1973
7. Westgard JO, Carey RN, Wold S: Criteria forjudging precision
and accuracy in method development evaluation. Clin Chem
20:825-833, 1974