example, with a psychological test we may be able to

THE CONCEPT OF VALIDITY IN THE INTERPRETATION OF TEST SCORES
ANNE ANASTASI
Fordham University
IF asked
define &dquo;validity,&dquo; most psychologists would probably agree that validity is the closeness of agreement of a test
with some independently observed criterion of the behavior
under consideration. It is only as a measure of a specifically
defined criterion that a test can be objectively validated at
all. For example, unless we define &dquo;intelligence&dquo; as that combination of aptitudes required for successful school achievement, or for survival on a certain type of job, or in terms of
to
other observable criterion, we can never either prove
or disprove that a particular test is a valid measure of &dquo;intelligence.&dquo; The criterion may be expressed in very broad and
general terms, such as &dquo;those behavior characteristics in which
older children in our culture differ from younger children reared
in the same culture,&dquo; but, however expressed, it defines the
functions measured by the particular test. To claim that a
test measures anything over and above its criterion is pure
speculation of the type that is not amenable to verification
and hence falls outside the realm of experimental science.
To the question, &dquo;What does this test measure?&dquo;, the only
defensible answer can thus be that it measures a sample of
behavior which in turn may be diagnostic of the criterion
or criteria against which the particular test was validated.
Nor is there any circularity implicit in such a definition of
validity, since a psychological test is a device for determining
within a relatively short period of time what could otherwise
be discovered only by means of a prolonged follow-up. For
example, with a psychological test we may be able to predict
within a certain margin of error which applicants will succeed
on a given job or which students will be able to complete
a medical course satisfactorily. Logically, the same information
some
67
Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016
68
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
could have been obtained,
precisely, by hiring all
medical
school all students
admitting
job applicants
and
the
to
observing
subsequent performance
wishing enroll,
of each subject. The latter procedure is obviously so timeconsuming and wasteful, however, as to be completely impracticable. Hence the tests make a real contribution in permitting predictions in advance of lengthy observations. Another
advantage of standardized psychological tests is that they make
possible a comparison of the individual’s performance with
that of other persons who have been observed in the same
sample situation represented by each test. In other words,
the tests provide norms for evaluating individual performance.
Prediction and comparison with norms represent valuable
contributions which psychological tests can render to our knowledge of individual behavior, the practical benefits of these
contributions having been widely demonstrated. It is of fundamental importance, however, to bear in mind that psychological
tests do not provide a different kind of information from that
obtained by any other observation of behavior. The use of
such labels as &dquo;intelligence,&dquo; &dquo;aptitude,&dquo; &dquo;capacity,&dquo; and &dquo;potentiality&dquo; has probably done much to make test users lose
sight of the empirical validation of tests. A number of current
disagreements regarding the interpretation of test results and
the susceptibility of tested abilities to training may be traceable to a failure to take due cognizance of validation procedures.
or
even more
to
Many test users apparently give only preliminary and possibly
perfunctory attention to validation data, in order to reassure
themselves at the outset that the test is &dquo;satisfactory.&dquo; Their
interpretation of the scores obtained with such a test, however,
often takes no account of the validation data and is expressed
which bear little or no relation to the criterion.
Perhaps one of the most common examples of such an inconsistent treatment of test validity is provided by what we
may call the argument of &dquo;extenuating circumstances.&dquo; Let
us suppose that a child obtains an IQ of 58 on a verbal intelligence test, and that the examiner subsequently finds evidence
of a fairly severe language handicap in this child owing to
foreign parentage. It is a common practice to conclude in
such a case that the obtained IQ is not &dquo;valid,&dquo; on the grounds
in
terms
Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016
THE CONCEPT OF VALIDITY
69
that the verbal content of the test rendered it unsuitable for
testing such an individual. At this point we may inquire,
however, &dquo;On the basis of what criterion is this IQ invalid?&dquo;
Certainly the obtained IQ may be a valid measure of the
behavior defined by the criterion against which the particular
test was validated. It is very likely that the same language
handicap which interfered with performance on this test will
interfere with the child’s behavior in other linguistic situations
of which this test is an adequate index. The correspondence
with the criterion may thus be just as close for this child as
for children without a language handicap. In school, for example, the language handicap would probably interfere with
the child’s acquisition of important skills and information.
The resulting academic backwardness, together with the original language handicap itself, would, in turn, affect certain
aspects of job performance and other areas of adult activities.
Conversely, any remedial efforts designed to eliminate the
language handicap would produce an improvement, not only
in the tested IQ, but also in the broader area of behavior
of which this test is a predictor.
It should be added parenthetically that language handicap
has been chosen as an example only for purposes of discussion.
A number of other &dquo;extenuating circumstances,&dquo; such as visual
or auditory defects, emotional and motivational factors, inadequate schooling, and the like, could have served equally
well to illustrate the point. Similarly, the discussion has been
limited to intelligence tests, since it is chiefly in connection
with these tests that many confusions regarding validity have
arisen. The entire discussion applies equally well, however,
to all types of psychological tests.
Specifically, how does the case cited in our illustration, as
well as others of its type, differ from those in which no question
is raised regarding the &dquo;validity&dquo; of the test or its applicability
to the particular individual? First, in the present case the
examiner has direct and certain knowledge regarding at least
one of the factors which determine the subject’s subnormal
performance, viz., language handicap. In other cases, the principal determining factor might be inferior schooling facilities,
parental illiteracy, cerebral birth injuries, a defective thyroid,
Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016
70
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
any of a large number of psychological or biological conditions. Yet it is doubtful whether the IQ would be considered
&dquo;invalid&dquo; in all of these cases simply because it proved possible
to point to a specific condition as the determining factor in
the poor test performance. To be sure, in many cases of low
IQ, the examiner has little or no knowledge about the circumstances or conditions which lead to the intellectual backwardness. But such ignorance is obviously no more conducive
to &dquo;valid&dquo; testing. Quite apart from the question of validity,
the examiner should, of course, make every effort to understand why the individual performs as he does on a test. The
fullest possible knowledge of the individual’s pre- and postnatal environment, structural deficiencies, and any other relevant conditions in his reactional biography is desirable for the
most effective use of the test data. But to explain why an individual scores poorly on a test does not &dquo;explain away&dquo; the
score. There are always reasons to account for an individual’s
performance on a test. Language handicap is just as real as
any other reason.
A second distinguishing feature of our example is that such
a language handicap is usually remediable. The individual need
not be permanently backward in intellectual performance, but
with special training he may in large measure compensate
for past losses in intellectual progress. Susceptibility to treatment is, however, a matter of degree. Many of the conditions
determining intellectual performance, whether structural or
functional, are amenable to change under special treatment.
Moreover, conditions for which no effective therapy is now
known may yield to newly developed treatments in the future.
The distinction in terms of remediability is thus rather tenuous.
Nor does such a distinction have any direct bearing upon the
validity of a measuring instrument. A thermometer may be
a valid index of fever, despite the fact that the administration
of medicine will cure the fever.
Thirdly, some may point out that language handicap is
not hereditary and may maintain that for this reason its influenc upon test performance ought to be &dquo;ruled
out.&dquo; Such
an objection contains a tacit assumption that psychological
tests are primarily concerned with those individual differences
or
Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016
THE CONCEPT OF VALIDITY
7I
in behavior which can be attributed to heredity. Since the
number of hereditary conditions which have been clearly related to behavior differences are extremely few, such a policy,
if followed consistently, would mean the virtual cessation of
psychological testing. Moreover, the connection between hereditary mechanisms and behavior is so remote and indirect as
to render the distinction between hereditary and environmental
factors in behavior largely an academic one (cf., e.g., 2). Above
all, it should be noted that no criterion against which any
psychological test has been validated is itself traceable to purely
hereditary factors. Hence no such test has been proved to be
a valid measure of individual differences in hereditary characteristics.
A fourth point to be considered is that of comparability.
It may be objected that the individual who is handicapped
by language difficulties, sensory deficiencies, or similar &dquo;extenuating circumstances&dquo; is not comparable to the validation
group on which the test norms were established. The requirement of comparability in the application of psychological tests
needs further clarification. If individuals are entirely similar
in all of the conditions (psychological, physiological, etc.) which
influence the behavior measured by a particular test, individual
differences will disappear, all subjects receiving the same score.
Obviously no test is designed to measure behavior independently of the conditions which determine such behavior-that
would be a logical absurdity as well as an empirical impossibility. When the conditions in which the individual differs
from the standardization group affect the test and the criterion
in an approximately equal manner and degree, the validity of
the test for that individual will not be appreciably influenced
by the lack of comparability of the individual to the standardization group.
This question of &dquo;comparability&dquo; pertains not so much to
the measurement of behavior as to the analysis of the etiology
of behavior differences. It is only when attributing the observed
individual differences in test scores to a particular factor or
class of factors that the investigator must make certain that
other contributing factors have been reasonably constant. For
example, if a few individuals in a group have a language
Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
72
handicap while the rest do not, we could not ascribe individual
differences in performance within this group to structural differences in the nervous system, or to any other factor whose
contribution to behavior we may be investigating. The same
limitation would apply, however, if educational opportunities,
family traditions, incentives for intellectual activities, or any
other factor were not held constant. The fact that the influence
of language handicap, sensory deficiencies, and a few other
conditions is more readily apparent does not place such conditions in a different category. The question of comparability
applies equally to all conditions other than the one under
investigation.
to the use of test scores in
obtained
IQ
by a child with a language
prediction.
a basis for predicting the subsequent beas
serve
handicap
havior of the individual? As long as the language handicap
remains, the test score can provide an accurate prognosis of
the child’s behavior in situations demanding the type of verbal
responses sampled by the test. It is only in this sense that any
psychological test makes predictions possible. Within a certain
margin of error, behavior can be predicted under existing conditions. But if, for example, any detrimental conditions such
as poor schooling, sensory deficiencies, or the like are corrected,
then performance on both test and criterion will show improvement. In discussions of test reliability, various writers during
the past twenty-five years have pointed out that a psychological
test should be expected to reflect changes in behavior at different times and under different conditions.1 For test scores to
remain constant when conditions affecting the subject’s behavior have altered would indicate a crude and relatively insensitive measuring instrument, rather than a highly &dquo;reliable&dquo;
one. The same logic applies to validity. If the subjects’ test
scores remain unchanged despite the modification of conditions
which affect criterion performance, the test cannot have high
A fifth consideration
Could
validity.
Closely
pertains
an
the problem of prediction is the scope
or breadth of influence of any given condition upon the individual’s behavior. For example, the presence of a loud, irregular
1
Cf.,
related
e.g., I, 4,
5, 6, 9,
to
I0, II, I2, I5,
I8,
I9.
Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016
THE CONCEPT OF VALIDITY
73
noise during the testing would probably affect the score on
that test, without influencing the individual’s behavior in other
situations. A toothache or a severe cold on the day of the
testing would be further illustrations of narrowly limited conditions. In the case of these conditions, the prognostic value
of the test for the individual would indeed be reduced, in
much the same manner that holding an ice cube in the mouth
would invalidate an oral thermometer reading of bodily temperature. Conditions such as language handicap, however, affect
the individual’s behavior in a much broader area than that
of the immediate test situation. They may thus influence both
criterion and test score in a similar manner.
The import of the above analysis is that validity should
be consistently interpreted with reference to the specific criteria
against which the given test was validated. It also follows that
validity is not a function of the test but of the use to which
the test is put. A test may have high validity for one criterion
and low or negligible validity for another. The attitude that
a good test has &dquo;high validity&dquo; and a poor test has &dquo;low
validity&dquo; is still too prevalent among test users. Tests cannot
be validated in the abstract, nor is the usual concept of validity
itself universally applicable to psychological testing. It is only
when tests are employed for predictive or diagnostic purposes
that the correlation with an external criterion is relevant at
all. In many investigations concerned with fundamental behavior research, tests are employed merely as behavior samples
obtained under standardized (i.e., uniform) conditions, without
reference to the correlations of these samples with other,&dquo; everyday-life&dquo; behavior samples (i.e., practical criterion measures).
When the maze-learning behavior of white rats is tested, for
example, the maze is not first &dquo;validated&dquo; against the rats’
success in finding food in a grocery basement, or their ability
to avoid contact with prowling cats, or any other criteria of
achievement in the rats’ extra-laboratory or workaday world.
The investigator may quite reasonably argue that for the study
of the particular principles of behavior which he is investigating, maze-learning is as &dquo;good&dquo; a sample of behavior as catavoiding, and that he has no more reason for validating the
former against the latter than vice versa.
Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016
74
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
Fundamentally, any validation procedure provides a measure
of the relationship between two behavior samples. As Guilford
has recently expressed it, &dquo;In a very general sense, a test is
valid for anything with which it correlates&dquo; (7, p. 429). The
process can be regarded as irreversible only when one of the
behavior samples has greater importance than the other for
a specific purpose.2In such a case, the more important behavior
sample is designated the &dquo;criterion.&dquo; No basic difference exists
between &dquo;criteria&dquo; on the one hand and &dquo;tests&dquo; on the other.
They are merely different samples of behavior whose interrelationships permit predictions from one to the other. We
could predict intelligence test scores from school achievement,
although the process would be needlessly time-consuming. In
such a case, the intelligence test scores would constitute the
criterion.
The criterion is not intrinsically superior in any sense. It
is well known, for example, that many commonly used criteria,
such as school grades or job advancement, may be influenced
by many factors &dquo;extraneous&dquo; to the quality of the individual’s
performance. Yet, if it is our object to predict such criteria,
with all their irrelevancies and shortcomings, then the correlation of a given test with such criteria is the validity of the
test in that situation. To be sure, the immediate criterion
against which a test is validated may itself have been chosen
as a convenient index or predictor of a broader and less readily
observable area of behavior. For example, a pilot aptitude
test may be validated against performance in basic flight training, the latter being in turn regarded as an approximate index
of achievement in more advanced training and even possibly
of ultimate combat performance. Such &dquo;successive validation&dquo;
would be quite consistent with the relativity of predictors and
criteria. It might be noted parenthetically that it is only when
criterion measures are themselves used as predictors of further
behavior that one may legitimately speak of the reliability
and validity of the criterion itself (cf. e.g., 8).
2 To be
sure, when the relationship between the two variables is curvilinear, prediction will not be equally accurate in both directions, since η
xy ≠ η
. In such cases,
yx
however, there is no a priori reason to expect that the correlation will be any higher
when predicting the "criterion" from the "test" than when predicting the "test" from
the "criterion."
Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016
THE CONCEPT OF VALIDITY
75
Validation against a &dquo;practical&dquo; criterion is essential for
many uses to which tests are put. It should not be assumed,
however, that only tests which have been validated against
some criterion considered important within a
particular cultural setting can be used in behavior research. In order to be
able to generalize from any obtained test score, we need only
to know the relationships between the tested behavior in question and other behavior samples, none of these behavior samples
necessarily occupying the preeminent position of a criterion.
Thus, if the investigator is interested in the possible use of
maze-learning performance as a basis for predicting the rats’
behavior in other learning situations, he will have to correlate
the subjects’ maze-learning scores with their scores in a variety
of other learning tasks. If a common factor is identified through
these different learning scores, the &dquo;factorial validity&dquo; (7) of
any one of the tests in predicting that which is common to
all of them can be determined. On the other hand, if no single
learning factor is demonstrated, then the area within which
predictions can be made must be accordingly narrowed to
fit the confines of whatever common factor does become evident.
Investigations conducted to date on human subjects, for example, have failed to indicate the presence of a common &dquo;learning factor&dquo; (20, 21), and animal studies have revealed even
greater specificity (cf., e.g., 14, 16, 17). But such specificity,
if further corroborated, is an empirically observed fact whose
discovery is useful in its own right in advancing our knowledge
of behavior; it should not be construed as a weakness of the
tests.
dealing with common factors and &dquo;factorial
with
validity&dquo;
&dquo;practical validity&dquo; in the prediction of everyday-life criteria, the question of validity concerns essentially
the interrelationships of behavior samples. In the latter case,
one sample is represented by the test and another, probably
much more extensive sample, by the criterion. In the former
case, the different tests which are correlated constitute the
behavior samples. Nor should the terminology of factor analysis
mislead us into the belief that anything external to the tested
behavior has been identified. The discovery of a &dquo;factor&dquo; means
simply that certain relationships exist between tested behavior
Whether
we are
or
samples.
Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016
76
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
The
misconception that the criterion is in some
fashion
more basic than the test probably results,
mysterious
in part, from the belief that tests measure hypothetical &dquo;underlying capacities&dquo; which are distinguishable from observed behavior. Discussions of psychological tests often become hopelessly entangled because of the implicit supposition that tests
can be validated against such underlying capacities as criteria.
Any operational analysis of actual validation procedures reveals the futility and absurdity of such an expectation.
In this connection we may consider a monograph by Thomas
(13), which sounds a note of acute pessimism regarding the
use of mental tests as &dquo;instruments of science.&dquo; Through a
careful and systematic logical analysis, the author demonstrates the fallacies inherent in any attempts to interpret psychological tests as measures of &dquo;innate abilities,&dquo; hypostatized
&dquo;fundamental human capacities,&dquo; and the like. He clearly recognizes that &dquo;the methodology of mental testing provides
no way of operationally defining an ability and a performance
as distinct... entities&dquo; (13, p. 75). But, in his final conclusions,
the author seems to exhibit the same confusions which he had
previously sought to eliminated For example, in the attempt
to evaluate the scientific usefulness of psychological tests, he
raises such questions as the following: &dquo;Do two identical scores
mean that the same kind and amount of psychological processes
were employed? Do they mean similar sociological backgrounds
of experience? Do they mean a qualitatively similar adaptation
to the immediate test environment? Do they mean that comparable amounts of psychic tension were built up or that similar
amounts of nervous energy were expended?&dquo; (13, p. 77). By
way of reply he adds: &dquo;The achievement of such scientific
meanings as these from the current methodology of mental
testing is probably too much to expect, for test results at
present are notoriously ambiguous in what they signify about
the socio-psychological ingredients of the recorded performances&dquo;
common
(13,
P.
77).
3 These confusions in the fundamental
argument do not detract from the value
of certain more specific points discussed in this monograph, such as the limitations of
ordinal scales, and the concepts of difficulty value and homogeneity in test construction.
But these problems have also been analyzed by other writers, in a somewhat more
constructive manner (cf., e.g., 3, I0).
Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016
THE CONCEPT OF VALIDITY
77
Two weaknesses are apparent in such an argument. First,
the testing of behavior is being confused with an analysis of
the factors which determine behavior. Secondly, despite his
earlier advocacy of an operational definition of &dquo;ability,&dquo; the
author now appears to be chasing the will-o’-the-wisp of &dquo;psychological processes&dquo; which are distinct from performance. He
seems thus to be demanding that in order to be proper instruments of science, psychological tests should measure functions
which by definition fall outside the domain of scientific inquiry!
In summary, it is urged that test scores be operationally
defined in terms of empirically demonstrated behavior relationships. If a test has been validated against a practical criterion
such as school performance, the scores on such a test should
be consistently defined and treated as predictors of school
performance rather than as measures of hypostatized and unverifiable &dquo;abilities.&dquo; It is further pointed out that conditions
which affect test scores may also affect the criterion, since both
test scores and criteria are essentially behavior samples. The
extent or breadth of such influences is a matter for empirical
determination, rather than for a priori assumption. Moreover,
the validity of a psychological test should not be confused
with an analysis of the factors which determine the behavior
under consideration. Finally, it should be noted that the distinction between test and criterion is itself merely one of practical convenience. The scientific use of tests is not predicated
upon the assumption that criteria are a separate class of phenomena against which all tests must first be validated. Essentially, generalization and prediction in psychology require
knowledge of the interrelationships of behavior, regardless of
the situation in which such behavior was observed.
REFERENCES
I. Anastasi, A. "The Influence of Practice upon Test Reliability."
Journal of Educational Psychology XXV (I934), 32I-335.
,
2. Anastasi, A. and Foley, J. P., Jr. "A Proposed Reorientation in
the Heredity-Environment Controversy." Psychological Review, LV (I948), 239-249.
3. Coombs, C. H. "Some Hypotheses for the Analysis of Qualitative
Variables." Psychological Review, LV (I948), I67-I74.
4. Cronbach, L. J. "Test ’Reliability’: Its Meaning and Determination." Psychometrika, XII (I947), I-I6.
Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016
78
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
W. "Comparable Tests and Reliability." Journal of
Educational Psychology
, XXIV (1933), 442-453.
6. Goodenough, F. L. "A Critical Note on the Use of the Term
’Reliability’ in Mental Measurement." Journal of Educational Psychology, XXVII (I936), 173-178.
7. Guilford, J. P. "New Standards for Test Evaluation." EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, VI (I946), 427-
5. Dunlap, J.
438.
8.
9.
I0.
II.
I2.
Jenkins, J. G. "Validity for What?" Journal of Consulting Psy, X (I946), 93-98.
chology
Kuhlmann, F. Tests of Mental Development. Minneapolis: Educational Test Bureau, I939.
"A Systematic Approach to the Construction and
Evaluation of Tests of Ability." Psychological Monographs,
LXI (I947), No. 4.
Paulsen, C. B. "A Coefficient of Trait Variability." Psychological
Bulletin, XXVIII (I93I), 2I8-2I9.
Skaggs, E. B. "Some Critical Comments on Certain Prevailing
Concepts Used in Mental Testing." Journal of Applied
Loevinger, J.
Psychology, XI (I927), 503-508.
Thomas, L. G. "Mental Tests as Instruments of Science." Psychological Monographs, LIV (I942), No. 3.
I4. Thorndike, R. L. "Organization of Behavior in the Albino Rat."
Genetic Psychology Monograph, XVII (I935), No. I.
I5. Thouless, R. H. "Test Unreliability and Functional Fluctuation."
British Journal of Psychology, XXVI (I935-I936), 325-343.
I6. Van Steenberg, N. J. F. "Factors in the Learning Behavior of the
Albino Rat." Psychometrika, IV (I939), I79-200.
I7. Vaughn, C. L. "Factors in Rat Learning: An Analysis of the
Intercorrelations Between 34 Variables." Psychological Mon, XIV (I937), No. 69.
ographs
I8. Wherry, R. J. and Gaylord, R. H. "The Concept of Test and Item
,
Reliability in Relation to Factor Pattern." Psychometrika
VIII (I943), 247-264.
I9. Woodrow, H. "Quotidian Variability." Psychological Review,
XXXIX (I932), 245-256.
20. Woodrow, H. "The Relation Between Abilities and Improvement
with Practice." Journal of Educational Psychology, XXIX
(I938), 2I5-230.
2I. Woodrow, H. "Factors in Improvement with Practice." Journal
of Psychology VII (I939), 55-70.
,
I3.
Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016