PSYCHOMETRIKA--VOL. 7, NO. 2 JUNE. 1942 THE RELIABILITY COEFFICIENT T R U M A N L. KELLEY HARVARD UNIVERSITY The reliability coefficient is unlike o t h e r m e a s u r e s o f correlation in t h a t it is a Quantitative s t a t e m e n t of an act of j u d g m e n t , - - u s u a l l y t h e t e s t m a k e r ' s , - - t h a t the t h i n g s c o r r e l a t e d are s h n i l a r mea.~ures. A t t e m p t s to divorce it f r o m t h i s act of j u d g m e n t a r e misdirected, j u s t as would be an a t t e m p t to e l i m i n a t e judgnnent of s a m e n e s s of f u n c t i o n of i t e m s w h e n a t e s t is originally d r a w n up. A "coefficient o f cohesion," e n t i r e l y devoid of j u d g m e n t , m e a s u r i n g t h e singleness o f t e s t function is proposed as an essential d a t u m with r e f e r e n c e to a t e s t , b u t not as a s u b s t i t u t e f o r the s i m i l a r - f o r m reliability eoeifieient. The student of statistics and psychological measurement is a w a r e t h a t a reliability coefficient is a correlation coefficient having certain special properties and a certain special meaning. Mathematically the reliability coefficient r~ of scores X is such t h a t 0 < r~ < 1. The maxim u m non-chance correlation t h a t the X measures can have with any other conceivable set is \/~-:~, not 1. Knowing ~'~ for a group consisting of a n a r r o w range of talent in which the s t a n d a r d deviation is ,, and knowing t h a t the test is equally excellent t h r o u g h o u t a wider r a n g e wherein the s t a n d a r d deviation is ~ , the reliability coefficients for these two ranges are connected by the e q u a t i o n , \/1-r~---Y,___V1-R~. These are three important properties possessed by the reliability coefficient, but not by the correlation coefficient. Let us examine the antecedent logic which has led to these and other important special properties. I f we have a score on a single unique item, any correlation, between the limits of - 1 and 1, of it with some other measure is coneeivable, but the concept of reliability does not attach to it. If we can conceive of a paired item t h a t measures the same function, then the concept of uniqueness does not exist. Thus, unlike the correlation coefficient, which is merely an observed fact, the reliability coefficient has embodied in it a belief or point of view of the investigator. Consider the score resulting from the item, " P r o v e the P y t h a g o r e a n theorem." One teacher asserts t h a t this is a unique demand and t h a t there is no other theorem in geometry t h a t can be paired w i t h it as a similar measure. It cannot be paired with itself if there is any memory, conscious or subconscious, of the first a t t e m p t at proof at the time 75 76 PSYCHOMETRIKA t h e second a t t e m p t is made, f o r then the mental processes a r e clearly different in the t w o cases. The w r i t e r suggests t h a t anyone doubting this general principle take, s a y , a contemporary-affairs test and then r e t a k e it a day later. H e will undoubtedly note that he w o r k s much f a s t e r and the depth and b r e a d t h of his thinking is much tess, - - he simply is not doing the same sort of thing as before. The teacher who considers the p r o o f of the P y t h a g o r e a n theorem to be a unique activity is entitled to his view. It is a sound view for m a n y purposes, b u t so is that, f o r other purposes, of the student who considers it as evidence of a more general ability. To this latter the score possesses a certain reliability and is more or less indicative of the general function t h a t he is interested in. The w r i t e r has long noted that statisticians who approach their subject through pure m a t h e m a t i c s give little or no concern to reliability coefficients. Is not the reason simply t h a t they are interested in facts and relationships and not in the attitude t h a t an investigator has t o w a r d a certain measure? We conclude that a belief t h a t t w o or more measures of a mental function exist is prerequisite to the concept reliability, and further, not only t h a t they exist b u t t h a t they are available before a measure of reliability is possible. We posit the question,what function of the two sets of measures X, and X.~, gotten by twice m e a s u r i n g t h e same individuals, and conceived of as tapping the same fundamental ability, is the best m e a s u r e of reliability? F u r t h e r , either X1 and X~ m u s t be judged a priori to be equally t r u s t w o r t h y measures of this ability or the one be judged some n u m b e r of times as excellent as the other, as, e.g., might a 90-item test be judged to be nine times as excellent as a ten-item test, the o t h e r considerations about the items being equal. This act of a priori j u d g m e n t is inherent and, though it can be voided so f a r as combination of items is concerned by fractionizing the measure, this only cha:~ges the size of the element upon which the judgm e n t is made. This element can never be made smaller than the single test item, and it p r e s u m a b l y should ordinarily not be made as small as this, f o r the j u d g m e n t that item 1 measures the same ability as item 2 would seem to be less within the capacity of the h u m a n mind than that, say, the ability measured b y a first set of 20 items, chosen according to certain principles and rules, is the same as t h a t measured by a second set chosen b y the same principles and rules. In connection with the following mathematical development, the X1 and X~ measures are judged to be equally excellent measures. The student can readily modify this t r e a t m e n t to cover the case where the one is judged some n u m b e r of times as excellent as the other. 77 TRUMAN L. KELLEY L e t the X1 measures f o r the N individuals be Xa, Xb, • . , , Xn and t h e paired X~ measures be X a , X a , . . - , X ~ . I f Xa, Xb, - . . , X~ a r e entitled to a n y creditability it is because the differences shown between them (Xa-Xa), (Xa-Xc), - " , (Xi-Xj) ".. a r e creditable. This seei~ls to the w r i t e r the most p r i m i t i v e o r f u n d a m e n t a l concept of t r u s t worthiness. L e t d,b ~ X ~ - X b , etc. Of the ( N ' ~ - N ) differences we c a n n o t ask how m a n y a r e believable and how m a n y are not, because the issue is quantitative, but we can ask w h a t proportion of the variance o f these differences, V d , is t r u s t w o r t h y o r predictable f r o m a knowledge of the t r u e differences. L e t us call a difference predicted f r o m a t r u e difference d so. if d is the t r u e difference, we have O"d . d-- r~--: d, qd and V ~ / V d would yield this f u n d a m e n t a l proportion. We do not have t r u e difference m e a s u r e s available, but we do have equally excellent difference measures D ( D A . - ~ X , t - X B , etc.), and can actually compute - qd d --- rd~ - - D. OD This will immediately yield V v l / V d , and by means of certain very plausible assumptions t h a t f u r t h e r sets of X measures a r e conceivable, we can obtain an e s t i m a t e of V ~ l / V d , as will be illustrated. L e t us compute r ~ . . We first note t h a t d . - - X~ - X j -~ (X~ - M I ) - (X~ - M~) -=-x~ - x i , the X's being r a w scores and the x's deviations f r o m the mean scores. Accordingly, a n y f u n c t i o n o f d~j and D~j is independent of differences between M~ and M~. T h e r e is no r e q u i r e m e n t in the fundamental m e a s u r e t h a t we seek t h a t M~ ~ M~. rd,,~,, in which S is a looked upon as striction i ¢ ] , affect the sum. SdijD, ( N ~ _ N)ad,W~,., , s u m m a t i o n of N 2 - N t e r m s when i ¢ ~", but it m a y be a s u m m a t i o n of N ~ t e r m s if we do not impose the ref o r the inclusion o f the N null t e r m s ( d , --= 0) will not F o r the v a r i a n c e of the d's, we h a v e i=n ]=n ( N ~ - N ) V d - - Sd~j ~ - - ~ [ ~ d~j2] , t h e null t e r m s being included, i=l 1=1 i=n ----~, [ ( x l 2 + x~ 2 - 2 x i x o ) i=l + ( x , 2 + xb 2 - 2 x i x b ) + ..'] 78 PSYCHOMETRIKA "- ~ [Nx~ = + N V I ] , in which VI is the variance of the x~ measures, 2 N 2V1. By very similar steps we obtain f o r the covariance (N ~ - N) Covariance ----S d i j D , - - 2 N ~ ~1 ~ r]2 , so t h a t finally rd, jD~j --" r12 • We thus see t h a t the usual split-half, or similar-form, reliability coefficient is a precise measure of the extent to which differences in the X~ scores are predictable by a measure of this same degree of excellence, for X2 is, according to judgment, such a measure. The issue of "correlation between e r r o r s " has not been involved. W h e t h e r there is or is not such a correlation does not alter the f a c t t h a t the reliability coefficient, r,z, is the correlation between d and D . Let us now assume t h a t f u r t h e r measures of the excellence of X~ could be constructed, given, and averaged so t h a t the X~ could be paired with X®, true scores. Then we find r d , ~ , , = x/r~=, which the w r i t e r has called an "index of reliability,"* and some of the properties of which he has elsewhere noted.t We then obtain V c ~ / V d ---- r ~ , i n f o r m i n g us t h a t the o r d i n a r y (split-half or similar-form) reliability coefficient is a precise statement of the proportion of the variance of the observed differences in the X1 scores t h a t is reat,--that is attributable to real differences in the ability measured. We must not forget t h a t an act of j u d g m e n t ( t h a t the X1 and X2 measures are equally excellent measures of the same function) has been demanded. This act is of the same sort as t h a t of the test m a k e r in putting together two or more exercises into a single test, in doing which he asserts t h a t item two is a measure of the same function as item one, etc. We may or may not t r u s t his j u d g m e n t in this respect and we may or m a y not t r u s t his j u d g m e n t in splitting a test into halves, but surely we have no w a r r a n t for t r u s t i n g him to do the former but not the latter. In fact, it should be a much less severe tax upon j u d g m e n t to split a test with m a n y items into comparable halves * A simplified m e t h o d of u s i n g scaled d a t a f o r p u r p o s e s of t e s t i n g , School and Society, J u l y 1 a n d 8, 1916, 4, nos. 79-80. f" T h e r e l i a b i l i t y of t e s t scores, J. Educ. Res., M a y 1921. Also, Note on t h e r e l i a b i l i t y of a test, J . Educ. Psych., A p r . , 1924, 15, no. 4. 79 TRUMAN L. KELLEY than to d r a w up the items in the first instance so as to measure the same function. The split-test method has also been criticized because of the assumptions involved in the S p e a r m a n - B r o w n step-up formula, ra = 2rl/(l+rl), which are that the two halves of the test are equally reliable and equally variable measures of the same thing. Small differences in reliability and variability of the halves would seem to be nicely taken care of by the following formula, due to Dr. John Flanagan, in which the subscripts 1 and 2 r e f e r to the halves of the test: ra --- reliability of entire test : 4 al a2 r12 VI+V~+2qla~r,2" However, the difference between this formula and the usual one 2 rl ra - - - is trifling for usual conditions. 1 +r~ The split-test method of computing a half-test reliability has been called indeterminate because there are m a n y other w a y s of splitting than the usual w a y of odds vs. evens. A determinate a n s w e r would result if the mean f o r all possible w a y s were gotten, but, even neglecting the labor involved, this would seem to be objectionable, for m a n y of these splittings would be such as to contravene t h e j u d g m e n t of comparability. In splitting we should not seek a mathematical outcome, b u t a j u d g m e n t outcome, and f o r the same logical reason'~ as warrar~t a j u d g m e n t product in putting together the items of the test in the first instance. The rule f o r splitting can well be the same as for d r a w i n g up comparable forms,--so do it that the range and nature of the functions tapped are as nearly the same ill the t w o instances as j u d g m e n t permits. In this rule the plural "functions" occurs, for it is not assumed that any test m a k e r can write, or believe himself capable of writing, items t h a t m e a s u r e one function only, though of course h i s endeavor should be to do so. The w r i t e r judges t h a t the more precise Kuder-Richardson procedures, later discussed, do well cover the case w h e r e a single function is measured b y items, b u t this situation seems to him to be remote f r o m practical situations. These observations a r g u e f o r the p u t t i n g of j u d g m e n t into the splitting into halves or building comparable f o r m s involved in the computation of reliability, not that procedures be so mechanized that j u d g m e n t is taken out. The w r i t e r believes it altogether desirable t h a t the t e r m "reliability coefficient" be restridted to the correlation between similar measures. This not only is the meaning originally given the term by its deviser, C. S. Spearman, b u t this is t h e necessary meaning in order to be a precise measure of the reality of the differences shown by the 80 PSYCHOMETRIKA measures. Let us compare this measure with the retest correlation, and the Kuder-Richardson measures. In the case of the retest, if at the time of the second test there is any memory, conscious or subconscious, of the earlier responses, then cel~ainly the mental operations being performed at the second t a k i n g are not the same or even similar in kind to those performed a t the first taking. Surely if the time interval between takings is shol~ enough, we can expect the differences between the scores of subjects upon the first test to be exactly predicted by the differences between the retake test scores. The numerical value of the retest coefficient of correlation will decrease as the time between testings is increased. I t thus is a function of this time, but w h a t e v e r this time interval and w h a t e v e r the value of the retest correlation, there seems no logical reason for talcing it as a measure of the f u n d a m e n t a l l y important ratio l'~T, / Vd. Kuder and Richardson* give a number of formulas, from complex to simple, for the computation of the reliability coefficient, all consequent to a certain "operational definition of equivalence." They observe t h a t their definition is "more rigid t h a n the one usually stated." It is ce,'tainly more restrictive t h a n the one here used, and the w r i t e r judges more restrictive than need be. To judge of Vd/Vd there seems no necessity t h a t items in the paired fol'ms be matched for difficulty, that the aggregate difficulties be matched, or even t h a t the items separately be matched for excellence but only t h a t the aggregates be so matched. In t h e i r more precise formulas an r ~ , the item reliability, enters, but this is not an observed d a t u m but definitive and determinable only With the aid of certain assumptions, in particular the questionable one t h a t " t h e m a t r i x of inter-item correlations has a rank of one." Their simplest f o r m u l a [21] is r,,-- n ?l.- 1 fit 2 in which ~t is the s t a n d a r d deviation of the total test scores, n the number of test items, p the mean proportion of r i g h t responses upon the ?t items, i.e., p = M/n, Where M is the mean total test score, and -~ 1 - p . Of course, adding a n u m b e r of easy items which everybody answers correctly will change n e i t h e r the s t a n d a r d deviation nor the reliability of the test, but inspection shows t h a t it does change the * G. F. K u d e r a n d M. W. R i c h a r d s o n , T h e t h e o r y of t h e e s t i m a t i o n of t e s t r e l i a b i l i t y , Pstwharn~tril~, 1937, 2, 151-160. T R U M A N L. KELLEY 81 as given by this formula. This is easy to show algebraically, but a numerical illustration will suffice. Let us first have a 50-item test, mean 25, and ac - - 5 , then r~___~ .51. Let us now add fifty easy items which everybody answers correctly, the mean is now 75, at ----- 5, and r t t - - .25. Surely this simplest f o r m u l a is u t t e r l y suspect in spite of the empirical agreement which the a u t h o r s and others have reported between the values given by it and comparable-form reliability coefficients. There m a y be conditions under which formula [21] could be trusted,--the empirical findings suggest t h a t this is so,--but as the authors only offer it as a "foot-rule" formula, one cannot expect an experimental establishment of these conditions in the situations in which it is likely to be used. In connection with an analytical investigation of the functions measured by the more precise Kuder-Richardson formulas, we should note the m a j o r premise f r o m which t h e y spring. In 1939 K u d e r and Richardson express agreement w i t h C. Spearman in stating* t h a t " t h e reliability coefficient is defined as the coefficient of correlation between one experimental f o r m of a test and a hypothetically equivalent form." However, their derivations seem clearly to be based upon a n o t h e r proposition, which in 1937 they state thus: " I t is implicit in all formulations of the reliability problem t h a t reliability is the characteristic of a test possessed by virtue of the positive intercorrelations of the items composing it." T h a t this is non-equivalent to Spearman's definition can be demonstrated in connection with the data of Table I, giving inter-item covariances of the items composing the two for]ns of a test. rtt TABLE I Variances and Covariances of Test I t e m s a b c A B C Form 1 : Items e~ b c .25 .00 .00 .25 .00 .25 Form 2: Items A B C .1875 .00 .00 .00 .1875 .00 .00 .25 p - - proportion of right responses .5 .5 Standard deviaticn of items .5 .5 .00 .00 .1875 .00 .5 .5 .5 .5 .25 .00 .25 J5 .5 .5 .5 Score X1 ~-- a ÷ b ÷ c and similar-form score X~ - - A + B + C . According to the Kuder-Richardson proposition,--X, has no reliability, f o r positive correlation between the items is lackir~g. However, according to S p e a r m a n ' s definition rl~ ~ .75 and Vd/V....d ~ .75, indicating * The calculation of test reliability coefficients based on the method of rational equivalence, J.. educ. Psych. Dec., 1939, 30. 82 PSYCHOMETRIKA that three-fourths of the variance o f XI scores is real, or predictable from truc measures of the function in question. Let us collect various measures for the data of Table I. Similar-form reliability coefficient v /vd ----.75 =.75 "Coefficient of coherence", mentioned in the next paragraph, VC/SVXi - - .33 Kuder-Richardson formula [8] reliability - - . 5 8 (This formula given by Kuder-Richardson as their most reliable.) Kuder-Richardson formula [14] reliability----.00 Kuder-Richardson formula [20] r e l i a b i l i t y - - . 0 0 Kuder-Richardson formula [21] r e l i a b i l i t y - - . 0 0 Of course X, is not a promising measure, b u t its shortcoming is not lack of reliability, b u t lack of unity, and can be traced to faulty judgment of the test maker. Though we question the Kuder-Richardson proposition as a formulation of reliability, we should consider the idea in it very important in connection with the concept of unity or coherence of a test. Let the items of a test be a , b , c, d ... and the test score X , ~ w~a + wbb + wcc + w~d + . . . . Let all the covariances between items be computed and a m a t r i x formed and factorized by the Kelley* method, which preserves the initial metric given by the variables with their attached weights. If the first component of this m a t r i x is C and the sum of the variances of the weighted items S V X ~ , - - t h i s being a precise m e a s u r e of the total variance inherent in all the i t e m s , - t h e n V C / S V X i is a measure of the unity or coherence of the test. This would seem to be a very important measure and one to date altogether lacking. The w r i t e r suggests the name "coefficient of coherence" f o r the ratio V C / S V X ~ . It is a measure of the morale* or singleness of purpose, of the items consti* E s s e n t i a l t r a i t s of m e n t a l life, 1935, a n d T a l e n t s a n d t a s k s , H a r v a r d E d u cation P a p e r s No. 1, 1940. * T. L. Kelley, W h e n Cease F i r i n g Sounds, Christian Science Monitar, Nov. 8. 1941, defined m o r a l e as " t h e i n d i v i d u a l a t t i t u d e in a g r o u p e n d e a v o r , " followi n g w h i c h t h e m o r a l e of a test item is t h e c o n g r u e n c e of its i n t e n t ( w h a t i t m e a s u r e s ) w i t h t h a t of the g r o u p of i t e m s c o n s t i t u t i n g the test. TRUMAN L. KELLEY 83 tuting the test. Kuder and Richardson assume complete unity of purpose when they assume a rank of 1 for their correlation matrix of test items. It would seem f a r better not to make any assumption b u t to measure the proximity to a rank of 1 by computing VC/SVX~. The computation of VC for a hundred-item test would involve no less than 100(100-1)/2 inter-item correlations, or covariances, and thus might well be impractical. However, if such a test were divided into, say, ten parts of ten items each,--the items within each p a r t being judged to be as homogeneous as possible (equivalent to the judgment t h a t the parts are as heterogeneous as possible, which is the opposite of the judgment made when splitting for reliability purposes), --only 45 covariances are now required and the determination of the variance of the first component of these ten parts is entirely feasible and this VC should be a serviceable approximation to that given by the 100-item analysis. Illustrative examples of the closeness of such approximation are, of course, needed. Other approaches to a quick determination of VC m a y lie in some utilization of r~t measures, the correlation between the items and the total.
© Copyright 2026 Paperzz