The reliability coefficient

PSYCHOMETRIKA--VOL. 7, NO. 2
JUNE. 1942
THE RELIABILITY COEFFICIENT
T R U M A N L. KELLEY
HARVARD UNIVERSITY
The reliability coefficient is unlike o t h e r m e a s u r e s o f correlation
in t h a t it is a Quantitative s t a t e m e n t of an act of j u d g m e n t , - - u s u a l l y
t h e t e s t m a k e r ' s , - - t h a t the t h i n g s c o r r e l a t e d are s h n i l a r mea.~ures.
A t t e m p t s to divorce it f r o m t h i s act of j u d g m e n t a r e misdirected,
j u s t as would be an a t t e m p t to e l i m i n a t e judgnnent of s a m e n e s s of
f u n c t i o n of i t e m s w h e n a t e s t is originally d r a w n up. A "coefficient
o f cohesion," e n t i r e l y devoid of j u d g m e n t , m e a s u r i n g t h e singleness
o f t e s t function is proposed as an essential d a t u m with r e f e r e n c e to
a t e s t , b u t not as a s u b s t i t u t e f o r the s i m i l a r - f o r m reliability eoeifieient.
The student of statistics and psychological measurement is a w a r e
t h a t a reliability coefficient is a correlation coefficient having certain
special properties and a certain special meaning. Mathematically the
reliability coefficient r~ of scores X is such t h a t 0 < r~ < 1. The maxim u m non-chance correlation t h a t the X measures can have with any
other conceivable set is \/~-:~, not 1. Knowing ~'~ for a group consisting of a n a r r o w range of talent in which the s t a n d a r d deviation is ,,
and knowing t h a t the test is equally excellent t h r o u g h o u t a wider
r a n g e wherein the s t a n d a r d deviation is ~ , the reliability coefficients for these two ranges are connected by the e q u a t i o n , \/1-r~---Y,___V1-R~. These are three important properties possessed by the reliability coefficient, but not by the correlation coefficient. Let us examine the antecedent logic which has led to these and other important
special properties.
I f we have a score on a single unique item, any correlation, between the limits of - 1 and 1, of it with some other measure is coneeivable, but the concept of reliability does not attach to it. If we can
conceive of a paired item t h a t measures the same function, then the
concept of uniqueness does not exist. Thus, unlike the correlation coefficient, which is merely an observed fact, the reliability coefficient
has embodied in it a belief or point of view of the investigator. Consider the score resulting from the item, " P r o v e the P y t h a g o r e a n theorem." One teacher asserts t h a t this is a unique demand and t h a t there
is no other theorem in geometry t h a t can be paired w i t h it as a similar measure. It cannot be paired with itself if there is any memory,
conscious or subconscious, of the first a t t e m p t at proof at the time
75
76
PSYCHOMETRIKA
t h e second a t t e m p t is made, f o r then the mental processes a r e clearly
different in the t w o cases. The w r i t e r suggests t h a t anyone doubting
this general principle take, s a y , a contemporary-affairs test and then
r e t a k e it a day later. H e will undoubtedly note that he w o r k s much
f a s t e r and the depth and b r e a d t h of his thinking is much tess, - - he
simply is not doing the same sort of thing as before.
The teacher who considers the p r o o f of the P y t h a g o r e a n theorem
to be a unique activity is entitled to his view. It is a sound view for
m a n y purposes, b u t so is that, f o r other purposes, of the student who
considers it as evidence of a more general ability. To this latter the
score possesses a certain reliability and is more or less indicative of
the general function t h a t he is interested in. The w r i t e r has long
noted that statisticians who approach their subject through pure
m a t h e m a t i c s give little or no concern to reliability coefficients. Is not
the reason simply t h a t they are interested in facts and relationships
and not in the attitude t h a t an investigator has t o w a r d a certain measure?
We conclude that a belief t h a t t w o or more measures of a mental
function exist is prerequisite to the concept reliability, and further,
not only t h a t they exist b u t t h a t they are available before a measure
of reliability is possible. We posit the question,what function of the
two sets of measures X, and X.~, gotten by twice m e a s u r i n g t h e same
individuals, and conceived of as tapping the same fundamental ability,
is the best m e a s u r e of reliability? F u r t h e r , either X1 and X~ m u s t be
judged a priori to be equally t r u s t w o r t h y measures of this ability or
the one be judged some n u m b e r of times as excellent as the other, as,
e.g., might a 90-item test be judged to be nine times as excellent as a
ten-item test, the o t h e r considerations about the items being equal.
This act of a priori j u d g m e n t is inherent and, though it can be voided
so f a r as combination of items is concerned by fractionizing the measure, this only cha:~ges the size of the element upon which the judgm e n t is made. This element can never be made smaller than the single
test item, and it p r e s u m a b l y should ordinarily not be made as small
as this, f o r the j u d g m e n t that item 1 measures the same ability as
item 2 would seem to be less within the capacity of the h u m a n mind
than that, say, the ability measured b y a first set of 20 items, chosen
according to certain principles and rules, is the same as t h a t measured
by a second set chosen b y the same principles and rules. In connection
with the following mathematical development, the X1 and X~ measures
are judged to be equally excellent measures. The student can readily
modify this t r e a t m e n t to cover the case where the one is judged some
n u m b e r of times as excellent as the other.
77
TRUMAN L. KELLEY
L e t the X1 measures f o r the N individuals be Xa, Xb, • . , , Xn and
t h e paired X~ measures be X a , X a , . . - , X ~ . I f Xa, Xb, - . . , X~ a r e entitled to a n y creditability it is because the differences shown between
them (Xa-Xa), (Xa-Xc), - " , (Xi-Xj)
".. a r e creditable. This seei~ls
to the w r i t e r the most p r i m i t i v e o r f u n d a m e n t a l concept of t r u s t worthiness. L e t d,b ~ X ~ - X b , etc. Of the ( N ' ~ - N ) differences we
c a n n o t ask how m a n y a r e believable and how m a n y are not, because
the issue is quantitative, but we can ask w h a t proportion of the variance o f these differences, V d , is t r u s t w o r t h y o r predictable f r o m a
knowledge of the t r u e differences. L e t us call a difference predicted
f r o m a t r u e difference d so. if d is the t r u e difference, we have
O"d
.
d-- r~--: d,
qd
and V ~ / V d would yield this f u n d a m e n t a l proportion. We do not have
t r u e difference m e a s u r e s available, but we do have equally excellent
difference measures D ( D A . - ~ X , t - X B , etc.), and can actually compute
-
qd
d --- rd~ - - D.
OD
This will immediately yield V v l / V d , and by means of certain very
plausible assumptions t h a t f u r t h e r sets of X measures a r e conceivable, we can obtain an e s t i m a t e of V ~ l / V d , as will be illustrated.
L e t us compute r ~ . . We first note t h a t
d . - - X~ - X j -~ (X~ - M I ) - (X~ - M~) -=-x~ - x i ,
the X's being r a w scores and the x's deviations f r o m the mean scores.
Accordingly, a n y f u n c t i o n o f d~j and D~j is independent of differences
between M~ and M~. T h e r e is no r e q u i r e m e n t in the fundamental
m e a s u r e t h a t we seek t h a t M~ ~ M~.
rd,,~,,
in which S is a
looked upon as
striction i ¢ ] ,
affect the sum.
SdijD,
( N ~ _ N)ad,W~,., ,
s u m m a t i o n of N 2 - N t e r m s when i ¢ ~", but it m a y be
a s u m m a t i o n of N ~ t e r m s if we do not impose the ref o r the inclusion o f the N null t e r m s ( d , --= 0) will not
F o r the v a r i a n c e of the d's, we h a v e
i=n
]=n
( N ~ - N ) V d - - Sd~j ~ - - ~ [ ~ d~j2] , t h e null t e r m s being included,
i=l
1=1
i=n
----~, [ ( x l 2 + x~ 2 - 2 x i x o )
i=l
+
( x , 2 + xb 2 - 2 x i x b )
+ ..']
78
PSYCHOMETRIKA
"- ~ [Nx~ = + N V I ] , in which VI is the variance of the
x~ measures,
2 N 2V1.
By very similar steps we obtain f o r the covariance
(N ~ - N) Covariance ----S d i j D ,
- - 2 N ~ ~1 ~ r]2 ,
so t h a t finally
rd, jD~j --" r12 •
We thus see t h a t the usual split-half, or similar-form, reliability
coefficient is a precise measure of the extent to which differences in
the X~ scores are predictable by a measure of this same degree of excellence, for X2 is, according to judgment, such a measure. The issue
of "correlation between e r r o r s " has not been involved. W h e t h e r there
is or is not such a correlation does not alter the f a c t t h a t the reliability coefficient, r,z, is the correlation between d and D .
Let us now assume t h a t f u r t h e r measures of the excellence of X~
could be constructed, given, and averaged so t h a t the X~ could be paired
with X®, true scores. Then we find r d , ~ , , = x/r~=, which the w r i t e r
has called an "index of reliability,"* and some of the properties of
which he has elsewhere noted.t
We then obtain V c ~ / V d ---- r ~ , i n f o r m i n g us t h a t the o r d i n a r y
(split-half or similar-form) reliability coefficient is a precise statement
of the proportion of the variance of the observed differences in the X1
scores t h a t is reat,--that is attributable to real differences in the ability measured.
We must not forget t h a t an act of j u d g m e n t ( t h a t the X1 and X2
measures are equally excellent measures of the same function) has
been demanded. This act is of the same sort as t h a t of the test m a k e r
in putting together two or more exercises into a single test, in doing
which he asserts t h a t item two is a measure of the same function as
item one, etc. We may or may not t r u s t his j u d g m e n t in this respect
and we may or m a y not t r u s t his j u d g m e n t in splitting a test into
halves, but surely we have no w a r r a n t for t r u s t i n g him to do the former but not the latter. In fact, it should be a much less severe tax
upon j u d g m e n t to split a test with m a n y items into comparable halves
* A simplified m e t h o d of u s i n g scaled d a t a f o r p u r p o s e s of t e s t i n g , School and
Society, J u l y 1 a n d 8, 1916, 4, nos. 79-80.
f" T h e r e l i a b i l i t y of t e s t scores, J. Educ. Res., M a y 1921. Also, Note on t h e
r e l i a b i l i t y of a test, J . Educ. Psych., A p r . , 1924, 15, no. 4.
79
TRUMAN L. KELLEY
than to d r a w up the items in the first instance so as to measure the
same function.
The split-test method has also been criticized because of the assumptions involved in the S p e a r m a n - B r o w n step-up formula, ra =
2rl/(l+rl), which are that the two halves of the test are equally reliable and equally variable measures of the same thing. Small differences in reliability and variability of the halves would seem to be
nicely taken care of by the following formula, due to Dr. John Flanagan, in which the subscripts 1 and 2 r e f e r to the halves of the test:
ra
---
reliability of entire test :
4 al a2 r12
VI+V~+2qla~r,2"
However, the difference between this formula and the usual one
2 rl
ra - - - is trifling for usual conditions.
1 +r~
The split-test method of computing a half-test reliability has been
called indeterminate because there are m a n y other w a y s of splitting
than the usual w a y of odds vs. evens. A determinate a n s w e r would
result if the mean f o r all possible w a y s were gotten, but, even neglecting the labor involved, this would seem to be objectionable, for m a n y
of these splittings would be such as to contravene t h e j u d g m e n t of
comparability. In splitting we should not seek a mathematical outcome, b u t a j u d g m e n t outcome, and f o r the same logical reason'~ as
warrar~t a j u d g m e n t product in putting together the items of the test
in the first instance. The rule f o r splitting can well be the same as for
d r a w i n g up comparable forms,--so do it that the range and nature of
the functions tapped are as nearly the same ill the t w o instances as
j u d g m e n t permits. In this rule the plural "functions" occurs, for it is
not assumed that any test m a k e r can write, or believe himself capable
of writing, items t h a t m e a s u r e one function only, though of course h i s
endeavor should be to do so. The w r i t e r judges t h a t the more precise
Kuder-Richardson procedures, later discussed, do well cover the case
w h e r e a single function is measured b y items, b u t this situation seems
to him to be remote f r o m practical situations. These observations
a r g u e f o r the p u t t i n g of j u d g m e n t into the splitting into halves or
building comparable f o r m s involved in the computation of reliability,
not that procedures be so mechanized that j u d g m e n t is taken out.
The w r i t e r believes it altogether desirable t h a t the t e r m "reliability coefficient" be restridted to the correlation between similar measures. This not only is the meaning originally given the term by its
deviser, C. S. Spearman, b u t this is t h e necessary meaning in order to
be a precise measure of the reality of the differences shown by the
80
PSYCHOMETRIKA
measures. Let us compare this measure with the retest correlation,
and the Kuder-Richardson measures.
In the case of the retest, if at the time of the second test there is
any memory, conscious or subconscious, of the earlier responses, then
cel~ainly the mental operations being performed at the second t a k i n g
are not the same or even similar in kind to those performed a t the
first taking. Surely if the time interval between takings is shol~
enough, we can expect the differences between the scores of subjects
upon the first test to be exactly predicted by the differences between
the retake test scores. The numerical value of the retest coefficient of
correlation will decrease as the time between testings is increased. I t
thus is a function of this time, but w h a t e v e r this time interval and
w h a t e v e r the value of the retest correlation, there seems no logical
reason for talcing it as a measure of the f u n d a m e n t a l l y important ratio
l'~T, / Vd.
Kuder and Richardson* give a number of formulas, from complex
to simple, for the computation of the reliability coefficient, all consequent to a certain "operational definition of equivalence." They observe t h a t their definition is "more rigid t h a n the one usually stated."
It is ce,'tainly more restrictive t h a n the one here used, and the w r i t e r
judges more restrictive than need be. To judge of Vd/Vd there seems
no necessity t h a t items in the paired fol'ms be matched for difficulty,
that the aggregate difficulties be matched, or even t h a t the items
separately be matched for excellence but only t h a t the aggregates be
so matched. In t h e i r more precise formulas an r ~ , the item reliability, enters, but this is not an observed d a t u m but definitive and determinable only With the aid of certain assumptions, in particular the
questionable one t h a t " t h e m a t r i x of inter-item correlations has a
rank of one."
Their simplest f o r m u l a [21] is
r,,--
n
?l.- 1
fit 2
in which ~t is the s t a n d a r d deviation of the total test scores, n the
number of test items, p the mean proportion of r i g h t responses upon
the ?t items, i.e., p = M/n, Where M is the mean total test score, and
-~ 1 - p . Of course, adding a n u m b e r of easy items which everybody answers correctly will change n e i t h e r the s t a n d a r d deviation nor
the reliability of the test, but inspection shows t h a t it does change the
* G. F. K u d e r a n d M. W. R i c h a r d s o n , T h e t h e o r y of t h e e s t i m a t i o n of t e s t
r e l i a b i l i t y , Pstwharn~tril~, 1937, 2, 151-160.
T R U M A N L. KELLEY
81
as given by this formula. This is easy to show algebraically, but a
numerical illustration will suffice. Let us first have a 50-item test,
mean 25, and ac - - 5 , then r~___~ .51. Let us now add fifty easy items
which everybody answers correctly, the mean is now 75, at ----- 5, and
r t t - - .25. Surely this simplest f o r m u l a is u t t e r l y suspect in spite of
the empirical agreement which the a u t h o r s and others have reported
between the values given by it and comparable-form reliability coefficients. There m a y be conditions under which formula [21] could
be trusted,--the empirical findings suggest t h a t this is so,--but as the
authors only offer it as a "foot-rule" formula, one cannot expect an
experimental establishment of these conditions in the situations in
which it is likely to be used.
In connection with an analytical investigation of the functions
measured by the more precise Kuder-Richardson formulas, we should
note the m a j o r premise f r o m which t h e y spring. In 1939 K u d e r and
Richardson express agreement w i t h C. Spearman in stating* t h a t " t h e
reliability coefficient is defined as the coefficient of correlation between
one experimental f o r m of a test and a hypothetically equivalent form."
However, their derivations seem clearly to be based upon a n o t h e r
proposition, which in 1937 they state thus:
" I t is implicit in all formulations of the reliability problem t h a t
reliability is the characteristic of a test possessed by virtue of the
positive intercorrelations of the items composing it." T h a t this is
non-equivalent to Spearman's definition can be demonstrated in connection with the data of Table I, giving inter-item covariances of the
items composing the two for]ns of a test.
rtt
TABLE I
Variances and Covariances of Test I t e m s
a
b
c
A
B
C
Form 1 : Items
e~ b
c
.25 .00 .00
.25 .00
.25
Form 2: Items
A
B
C
.1875 .00
.00
.00
.1875 .00
.00
.25
p - - proportion of
right responses
.5
.5
Standard deviaticn of items
.5
.5
.00
.00
.1875
.00
.5
.5
.5
.5
.25
.00
.25
J5
.5
.5
.5
Score X1 ~-- a ÷ b ÷ c and similar-form score X~ - - A + B + C . According to the Kuder-Richardson proposition,--X, has no reliability,
f o r positive correlation between the items is lackir~g. However, according to S p e a r m a n ' s definition rl~ ~ .75 and Vd/V....d ~ .75, indicating
* The calculation of test reliability coefficients based on the method of rational equivalence, J.. educ. Psych. Dec., 1939, 30.
82
PSYCHOMETRIKA
that three-fourths of the variance o f XI scores is real, or predictable
from truc measures of the function in question. Let us collect various
measures for the data of Table I.
Similar-form reliability coefficient
v /vd
----.75
=.75
"Coefficient of coherence", mentioned in
the next paragraph,
VC/SVXi
- - .33
Kuder-Richardson formula [8] reliability - - . 5 8
(This formula given by Kuder-Richardson as their most reliable.)
Kuder-Richardson formula [14] reliability----.00
Kuder-Richardson formula [20] r e l i a b i l i t y - - . 0 0
Kuder-Richardson formula [21] r e l i a b i l i t y - - . 0 0
Of course X, is not a promising measure, b u t its shortcoming is not
lack of reliability, b u t lack of unity, and can be traced to faulty judgment of the test maker.
Though we question the Kuder-Richardson proposition as a formulation of reliability, we should consider the idea in it very important in connection with the concept of unity or coherence of a test.
Let the items of a test be a , b , c, d ... and the test score
X , ~ w~a + wbb + wcc + w~d + . . . .
Let all the covariances between items be computed and a m a t r i x
formed and factorized by the Kelley* method, which preserves the
initial metric given by the variables with their attached weights. If
the first component of this m a t r i x is C and the sum of the variances
of the weighted items S V X ~ , - - t h i s being a precise m e a s u r e of the total
variance inherent in all the i t e m s , - t h e n V C / S V X i is a measure of the
unity or coherence of the test. This would seem to be a very important measure and one to date altogether lacking. The w r i t e r suggests
the name "coefficient of coherence" f o r the ratio V C / S V X ~ . It is a
measure of the morale* or singleness of purpose, of the items consti* E s s e n t i a l t r a i t s of m e n t a l life, 1935, a n d T a l e n t s a n d t a s k s , H a r v a r d E d u cation P a p e r s No. 1, 1940.
* T. L. Kelley, W h e n Cease F i r i n g Sounds, Christian Science Monitar, Nov.
8. 1941, defined m o r a l e as " t h e i n d i v i d u a l a t t i t u d e in a g r o u p e n d e a v o r , " followi n g w h i c h t h e m o r a l e of a test item is t h e c o n g r u e n c e of its i n t e n t ( w h a t i t m e a s u r e s ) w i t h t h a t of the g r o u p of i t e m s c o n s t i t u t i n g the test.
TRUMAN L. KELLEY
83
tuting the test. Kuder and Richardson assume complete unity of purpose when they assume a rank of 1 for their correlation matrix of
test items. It would seem f a r better not to make any assumption b u t
to measure the proximity to a rank of 1 by computing VC/SVX~.
The computation of VC for a hundred-item test would involve no
less than 100(100-1)/2 inter-item correlations, or covariances, and
thus might well be impractical. However, if such a test were divided
into, say, ten parts of ten items each,--the items within each p a r t being judged to be as homogeneous as possible (equivalent to the judgment t h a t the parts are as heterogeneous as possible, which is the opposite of the judgment made when splitting for reliability purposes),
--only 45 covariances are now required and the determination of the
variance of the first component of these ten parts is entirely feasible
and this VC should be a serviceable approximation to that given by
the 100-item analysis. Illustrative examples of the closeness of such
approximation are, of course, needed.
Other approaches to a quick determination of VC m a y lie in some
utilization of r~t measures, the correlation between the items and the
total.