THE CONCEPT OF VALIDITY IN THE INTERPRETATION OF TEST SCORES ANNE ANASTASI Fordham University IF asked define &dquo;validity,&dquo; most psychologists would probably agree that validity is the closeness of agreement of a test with some independently observed criterion of the behavior under consideration. It is only as a measure of a specifically defined criterion that a test can be objectively validated at all. For example, unless we define &dquo;intelligence&dquo; as that combination of aptitudes required for successful school achievement, or for survival on a certain type of job, or in terms of to other observable criterion, we can never either prove or disprove that a particular test is a valid measure of &dquo;intelligence.&dquo; The criterion may be expressed in very broad and general terms, such as &dquo;those behavior characteristics in which older children in our culture differ from younger children reared in the same culture,&dquo; but, however expressed, it defines the functions measured by the particular test. To claim that a test measures anything over and above its criterion is pure speculation of the type that is not amenable to verification and hence falls outside the realm of experimental science. To the question, &dquo;What does this test measure?&dquo;, the only defensible answer can thus be that it measures a sample of behavior which in turn may be diagnostic of the criterion or criteria against which the particular test was validated. Nor is there any circularity implicit in such a definition of validity, since a psychological test is a device for determining within a relatively short period of time what could otherwise be discovered only by means of a prolonged follow-up. For example, with a psychological test we may be able to predict within a certain margin of error which applicants will succeed on a given job or which students will be able to complete a medical course satisfactorily. Logically, the same information some 67 Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016 68 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT could have been obtained, precisely, by hiring all medical school all students admitting job applicants and the to observing subsequent performance wishing enroll, of each subject. The latter procedure is obviously so timeconsuming and wasteful, however, as to be completely impracticable. Hence the tests make a real contribution in permitting predictions in advance of lengthy observations. Another advantage of standardized psychological tests is that they make possible a comparison of the individual’s performance with that of other persons who have been observed in the same sample situation represented by each test. In other words, the tests provide norms for evaluating individual performance. Prediction and comparison with norms represent valuable contributions which psychological tests can render to our knowledge of individual behavior, the practical benefits of these contributions having been widely demonstrated. It is of fundamental importance, however, to bear in mind that psychological tests do not provide a different kind of information from that obtained by any other observation of behavior. The use of such labels as &dquo;intelligence,&dquo; &dquo;aptitude,&dquo; &dquo;capacity,&dquo; and &dquo;potentiality&dquo; has probably done much to make test users lose sight of the empirical validation of tests. A number of current disagreements regarding the interpretation of test results and the susceptibility of tested abilities to training may be traceable to a failure to take due cognizance of validation procedures. or even more to Many test users apparently give only preliminary and possibly perfunctory attention to validation data, in order to reassure themselves at the outset that the test is &dquo;satisfactory.&dquo; Their interpretation of the scores obtained with such a test, however, often takes no account of the validation data and is expressed which bear little or no relation to the criterion. Perhaps one of the most common examples of such an inconsistent treatment of test validity is provided by what we may call the argument of &dquo;extenuating circumstances.&dquo; Let us suppose that a child obtains an IQ of 58 on a verbal intelligence test, and that the examiner subsequently finds evidence of a fairly severe language handicap in this child owing to foreign parentage. It is a common practice to conclude in such a case that the obtained IQ is not &dquo;valid,&dquo; on the grounds in terms Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016 THE CONCEPT OF VALIDITY 69 that the verbal content of the test rendered it unsuitable for testing such an individual. At this point we may inquire, however, &dquo;On the basis of what criterion is this IQ invalid?&dquo; Certainly the obtained IQ may be a valid measure of the behavior defined by the criterion against which the particular test was validated. It is very likely that the same language handicap which interfered with performance on this test will interfere with the child’s behavior in other linguistic situations of which this test is an adequate index. The correspondence with the criterion may thus be just as close for this child as for children without a language handicap. In school, for example, the language handicap would probably interfere with the child’s acquisition of important skills and information. The resulting academic backwardness, together with the original language handicap itself, would, in turn, affect certain aspects of job performance and other areas of adult activities. Conversely, any remedial efforts designed to eliminate the language handicap would produce an improvement, not only in the tested IQ, but also in the broader area of behavior of which this test is a predictor. It should be added parenthetically that language handicap has been chosen as an example only for purposes of discussion. A number of other &dquo;extenuating circumstances,&dquo; such as visual or auditory defects, emotional and motivational factors, inadequate schooling, and the like, could have served equally well to illustrate the point. Similarly, the discussion has been limited to intelligence tests, since it is chiefly in connection with these tests that many confusions regarding validity have arisen. The entire discussion applies equally well, however, to all types of psychological tests. Specifically, how does the case cited in our illustration, as well as others of its type, differ from those in which no question is raised regarding the &dquo;validity&dquo; of the test or its applicability to the particular individual? First, in the present case the examiner has direct and certain knowledge regarding at least one of the factors which determine the subject’s subnormal performance, viz., language handicap. In other cases, the principal determining factor might be inferior schooling facilities, parental illiteracy, cerebral birth injuries, a defective thyroid, Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016 70 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT any of a large number of psychological or biological conditions. Yet it is doubtful whether the IQ would be considered &dquo;invalid&dquo; in all of these cases simply because it proved possible to point to a specific condition as the determining factor in the poor test performance. To be sure, in many cases of low IQ, the examiner has little or no knowledge about the circumstances or conditions which lead to the intellectual backwardness. But such ignorance is obviously no more conducive to &dquo;valid&dquo; testing. Quite apart from the question of validity, the examiner should, of course, make every effort to understand why the individual performs as he does on a test. The fullest possible knowledge of the individual’s pre- and postnatal environment, structural deficiencies, and any other relevant conditions in his reactional biography is desirable for the most effective use of the test data. But to explain why an individual scores poorly on a test does not &dquo;explain away&dquo; the score. There are always reasons to account for an individual’s performance on a test. Language handicap is just as real as any other reason. A second distinguishing feature of our example is that such a language handicap is usually remediable. The individual need not be permanently backward in intellectual performance, but with special training he may in large measure compensate for past losses in intellectual progress. Susceptibility to treatment is, however, a matter of degree. Many of the conditions determining intellectual performance, whether structural or functional, are amenable to change under special treatment. Moreover, conditions for which no effective therapy is now known may yield to newly developed treatments in the future. The distinction in terms of remediability is thus rather tenuous. Nor does such a distinction have any direct bearing upon the validity of a measuring instrument. A thermometer may be a valid index of fever, despite the fact that the administration of medicine will cure the fever. Thirdly, some may point out that language handicap is not hereditary and may maintain that for this reason its influenc upon test performance ought to be &dquo;ruled out.&dquo; Such an objection contains a tacit assumption that psychological tests are primarily concerned with those individual differences or Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016 THE CONCEPT OF VALIDITY 7I in behavior which can be attributed to heredity. Since the number of hereditary conditions which have been clearly related to behavior differences are extremely few, such a policy, if followed consistently, would mean the virtual cessation of psychological testing. Moreover, the connection between hereditary mechanisms and behavior is so remote and indirect as to render the distinction between hereditary and environmental factors in behavior largely an academic one (cf., e.g., 2). Above all, it should be noted that no criterion against which any psychological test has been validated is itself traceable to purely hereditary factors. Hence no such test has been proved to be a valid measure of individual differences in hereditary characteristics. A fourth point to be considered is that of comparability. It may be objected that the individual who is handicapped by language difficulties, sensory deficiencies, or similar &dquo;extenuating circumstances&dquo; is not comparable to the validation group on which the test norms were established. The requirement of comparability in the application of psychological tests needs further clarification. If individuals are entirely similar in all of the conditions (psychological, physiological, etc.) which influence the behavior measured by a particular test, individual differences will disappear, all subjects receiving the same score. Obviously no test is designed to measure behavior independently of the conditions which determine such behavior-that would be a logical absurdity as well as an empirical impossibility. When the conditions in which the individual differs from the standardization group affect the test and the criterion in an approximately equal manner and degree, the validity of the test for that individual will not be appreciably influenced by the lack of comparability of the individual to the standardization group. This question of &dquo;comparability&dquo; pertains not so much to the measurement of behavior as to the analysis of the etiology of behavior differences. It is only when attributing the observed individual differences in test scores to a particular factor or class of factors that the investigator must make certain that other contributing factors have been reasonably constant. For example, if a few individuals in a group have a language Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 72 handicap while the rest do not, we could not ascribe individual differences in performance within this group to structural differences in the nervous system, or to any other factor whose contribution to behavior we may be investigating. The same limitation would apply, however, if educational opportunities, family traditions, incentives for intellectual activities, or any other factor were not held constant. The fact that the influence of language handicap, sensory deficiencies, and a few other conditions is more readily apparent does not place such conditions in a different category. The question of comparability applies equally to all conditions other than the one under investigation. to the use of test scores in obtained IQ by a child with a language prediction. a basis for predicting the subsequent beas serve handicap havior of the individual? As long as the language handicap remains, the test score can provide an accurate prognosis of the child’s behavior in situations demanding the type of verbal responses sampled by the test. It is only in this sense that any psychological test makes predictions possible. Within a certain margin of error, behavior can be predicted under existing conditions. But if, for example, any detrimental conditions such as poor schooling, sensory deficiencies, or the like are corrected, then performance on both test and criterion will show improvement. In discussions of test reliability, various writers during the past twenty-five years have pointed out that a psychological test should be expected to reflect changes in behavior at different times and under different conditions.1 For test scores to remain constant when conditions affecting the subject’s behavior have altered would indicate a crude and relatively insensitive measuring instrument, rather than a highly &dquo;reliable&dquo; one. The same logic applies to validity. If the subjects’ test scores remain unchanged despite the modification of conditions which affect criterion performance, the test cannot have high A fifth consideration Could validity. Closely pertains an the problem of prediction is the scope or breadth of influence of any given condition upon the individual’s behavior. For example, the presence of a loud, irregular 1 Cf., related e.g., I, 4, 5, 6, 9, to I0, II, I2, I5, I8, I9. Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016 THE CONCEPT OF VALIDITY 73 noise during the testing would probably affect the score on that test, without influencing the individual’s behavior in other situations. A toothache or a severe cold on the day of the testing would be further illustrations of narrowly limited conditions. In the case of these conditions, the prognostic value of the test for the individual would indeed be reduced, in much the same manner that holding an ice cube in the mouth would invalidate an oral thermometer reading of bodily temperature. Conditions such as language handicap, however, affect the individual’s behavior in a much broader area than that of the immediate test situation. They may thus influence both criterion and test score in a similar manner. The import of the above analysis is that validity should be consistently interpreted with reference to the specific criteria against which the given test was validated. It also follows that validity is not a function of the test but of the use to which the test is put. A test may have high validity for one criterion and low or negligible validity for another. The attitude that a good test has &dquo;high validity&dquo; and a poor test has &dquo;low validity&dquo; is still too prevalent among test users. Tests cannot be validated in the abstract, nor is the usual concept of validity itself universally applicable to psychological testing. It is only when tests are employed for predictive or diagnostic purposes that the correlation with an external criterion is relevant at all. In many investigations concerned with fundamental behavior research, tests are employed merely as behavior samples obtained under standardized (i.e., uniform) conditions, without reference to the correlations of these samples with other,&dquo; everyday-life&dquo; behavior samples (i.e., practical criterion measures). When the maze-learning behavior of white rats is tested, for example, the maze is not first &dquo;validated&dquo; against the rats’ success in finding food in a grocery basement, or their ability to avoid contact with prowling cats, or any other criteria of achievement in the rats’ extra-laboratory or workaday world. The investigator may quite reasonably argue that for the study of the particular principles of behavior which he is investigating, maze-learning is as &dquo;good&dquo; a sample of behavior as catavoiding, and that he has no more reason for validating the former against the latter than vice versa. Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016 74 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT Fundamentally, any validation procedure provides a measure of the relationship between two behavior samples. As Guilford has recently expressed it, &dquo;In a very general sense, a test is valid for anything with which it correlates&dquo; (7, p. 429). The process can be regarded as irreversible only when one of the behavior samples has greater importance than the other for a specific purpose.2In such a case, the more important behavior sample is designated the &dquo;criterion.&dquo; No basic difference exists between &dquo;criteria&dquo; on the one hand and &dquo;tests&dquo; on the other. They are merely different samples of behavior whose interrelationships permit predictions from one to the other. We could predict intelligence test scores from school achievement, although the process would be needlessly time-consuming. In such a case, the intelligence test scores would constitute the criterion. The criterion is not intrinsically superior in any sense. It is well known, for example, that many commonly used criteria, such as school grades or job advancement, may be influenced by many factors &dquo;extraneous&dquo; to the quality of the individual’s performance. Yet, if it is our object to predict such criteria, with all their irrelevancies and shortcomings, then the correlation of a given test with such criteria is the validity of the test in that situation. To be sure, the immediate criterion against which a test is validated may itself have been chosen as a convenient index or predictor of a broader and less readily observable area of behavior. For example, a pilot aptitude test may be validated against performance in basic flight training, the latter being in turn regarded as an approximate index of achievement in more advanced training and even possibly of ultimate combat performance. Such &dquo;successive validation&dquo; would be quite consistent with the relativity of predictors and criteria. It might be noted parenthetically that it is only when criterion measures are themselves used as predictors of further behavior that one may legitimately speak of the reliability and validity of the criterion itself (cf. e.g., 8). 2 To be sure, when the relationship between the two variables is curvilinear, prediction will not be equally accurate in both directions, since η xy ≠ η . In such cases, yx however, there is no a priori reason to expect that the correlation will be any higher when predicting the "criterion" from the "test" than when predicting the "test" from the "criterion." Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016 THE CONCEPT OF VALIDITY 75 Validation against a &dquo;practical&dquo; criterion is essential for many uses to which tests are put. It should not be assumed, however, that only tests which have been validated against some criterion considered important within a particular cultural setting can be used in behavior research. In order to be able to generalize from any obtained test score, we need only to know the relationships between the tested behavior in question and other behavior samples, none of these behavior samples necessarily occupying the preeminent position of a criterion. Thus, if the investigator is interested in the possible use of maze-learning performance as a basis for predicting the rats’ behavior in other learning situations, he will have to correlate the subjects’ maze-learning scores with their scores in a variety of other learning tasks. If a common factor is identified through these different learning scores, the &dquo;factorial validity&dquo; (7) of any one of the tests in predicting that which is common to all of them can be determined. On the other hand, if no single learning factor is demonstrated, then the area within which predictions can be made must be accordingly narrowed to fit the confines of whatever common factor does become evident. Investigations conducted to date on human subjects, for example, have failed to indicate the presence of a common &dquo;learning factor&dquo; (20, 21), and animal studies have revealed even greater specificity (cf., e.g., 14, 16, 17). But such specificity, if further corroborated, is an empirically observed fact whose discovery is useful in its own right in advancing our knowledge of behavior; it should not be construed as a weakness of the tests. dealing with common factors and &dquo;factorial with validity&dquo; &dquo;practical validity&dquo; in the prediction of everyday-life criteria, the question of validity concerns essentially the interrelationships of behavior samples. In the latter case, one sample is represented by the test and another, probably much more extensive sample, by the criterion. In the former case, the different tests which are correlated constitute the behavior samples. Nor should the terminology of factor analysis mislead us into the belief that anything external to the tested behavior has been identified. The discovery of a &dquo;factor&dquo; means simply that certain relationships exist between tested behavior Whether we are or samples. Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016 76 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT The misconception that the criterion is in some fashion more basic than the test probably results, mysterious in part, from the belief that tests measure hypothetical &dquo;underlying capacities&dquo; which are distinguishable from observed behavior. Discussions of psychological tests often become hopelessly entangled because of the implicit supposition that tests can be validated against such underlying capacities as criteria. Any operational analysis of actual validation procedures reveals the futility and absurdity of such an expectation. In this connection we may consider a monograph by Thomas (13), which sounds a note of acute pessimism regarding the use of mental tests as &dquo;instruments of science.&dquo; Through a careful and systematic logical analysis, the author demonstrates the fallacies inherent in any attempts to interpret psychological tests as measures of &dquo;innate abilities,&dquo; hypostatized &dquo;fundamental human capacities,&dquo; and the like. He clearly recognizes that &dquo;the methodology of mental testing provides no way of operationally defining an ability and a performance as distinct... entities&dquo; (13, p. 75). But, in his final conclusions, the author seems to exhibit the same confusions which he had previously sought to eliminated For example, in the attempt to evaluate the scientific usefulness of psychological tests, he raises such questions as the following: &dquo;Do two identical scores mean that the same kind and amount of psychological processes were employed? Do they mean similar sociological backgrounds of experience? Do they mean a qualitatively similar adaptation to the immediate test environment? Do they mean that comparable amounts of psychic tension were built up or that similar amounts of nervous energy were expended?&dquo; (13, p. 77). By way of reply he adds: &dquo;The achievement of such scientific meanings as these from the current methodology of mental testing is probably too much to expect, for test results at present are notoriously ambiguous in what they signify about the socio-psychological ingredients of the recorded performances&dquo; common (13, P. 77). 3 These confusions in the fundamental argument do not detract from the value of certain more specific points discussed in this monograph, such as the limitations of ordinal scales, and the concepts of difficulty value and homogeneity in test construction. But these problems have also been analyzed by other writers, in a somewhat more constructive manner (cf., e.g., 3, I0). Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016 THE CONCEPT OF VALIDITY 77 Two weaknesses are apparent in such an argument. First, the testing of behavior is being confused with an analysis of the factors which determine behavior. Secondly, despite his earlier advocacy of an operational definition of &dquo;ability,&dquo; the author now appears to be chasing the will-o’-the-wisp of &dquo;psychological processes&dquo; which are distinct from performance. He seems thus to be demanding that in order to be proper instruments of science, psychological tests should measure functions which by definition fall outside the domain of scientific inquiry! In summary, it is urged that test scores be operationally defined in terms of empirically demonstrated behavior relationships. If a test has been validated against a practical criterion such as school performance, the scores on such a test should be consistently defined and treated as predictors of school performance rather than as measures of hypostatized and unverifiable &dquo;abilities.&dquo; It is further pointed out that conditions which affect test scores may also affect the criterion, since both test scores and criteria are essentially behavior samples. The extent or breadth of such influences is a matter for empirical determination, rather than for a priori assumption. Moreover, the validity of a psychological test should not be confused with an analysis of the factors which determine the behavior under consideration. Finally, it should be noted that the distinction between test and criterion is itself merely one of practical convenience. The scientific use of tests is not predicated upon the assumption that criteria are a separate class of phenomena against which all tests must first be validated. Essentially, generalization and prediction in psychology require knowledge of the interrelationships of behavior, regardless of the situation in which such behavior was observed. REFERENCES I. Anastasi, A. "The Influence of Practice upon Test Reliability." Journal of Educational Psychology XXV (I934), 32I-335. , 2. Anastasi, A. and Foley, J. P., Jr. "A Proposed Reorientation in the Heredity-Environment Controversy." Psychological Review, LV (I948), 239-249. 3. Coombs, C. H. "Some Hypotheses for the Analysis of Qualitative Variables." Psychological Review, LV (I948), I67-I74. 4. Cronbach, L. J. "Test ’Reliability’: Its Meaning and Determination." Psychometrika, XII (I947), I-I6. Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016 78 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT W. "Comparable Tests and Reliability." Journal of Educational Psychology , XXIV (1933), 442-453. 6. Goodenough, F. L. "A Critical Note on the Use of the Term ’Reliability’ in Mental Measurement." Journal of Educational Psychology, XXVII (I936), 173-178. 7. Guilford, J. P. "New Standards for Test Evaluation." EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, VI (I946), 427- 5. Dunlap, J. 438. 8. 9. I0. II. I2. Jenkins, J. G. "Validity for What?" Journal of Consulting Psy, X (I946), 93-98. chology Kuhlmann, F. Tests of Mental Development. Minneapolis: Educational Test Bureau, I939. "A Systematic Approach to the Construction and Evaluation of Tests of Ability." Psychological Monographs, LXI (I947), No. 4. Paulsen, C. B. "A Coefficient of Trait Variability." Psychological Bulletin, XXVIII (I93I), 2I8-2I9. Skaggs, E. B. "Some Critical Comments on Certain Prevailing Concepts Used in Mental Testing." Journal of Applied Loevinger, J. Psychology, XI (I927), 503-508. Thomas, L. G. "Mental Tests as Instruments of Science." Psychological Monographs, LIV (I942), No. 3. I4. Thorndike, R. L. "Organization of Behavior in the Albino Rat." Genetic Psychology Monograph, XVII (I935), No. I. I5. Thouless, R. H. "Test Unreliability and Functional Fluctuation." British Journal of Psychology, XXVI (I935-I936), 325-343. I6. Van Steenberg, N. J. F. "Factors in the Learning Behavior of the Albino Rat." Psychometrika, IV (I939), I79-200. I7. Vaughn, C. L. "Factors in Rat Learning: An Analysis of the Intercorrelations Between 34 Variables." Psychological Mon, XIV (I937), No. 69. ographs I8. Wherry, R. J. and Gaylord, R. H. "The Concept of Test and Item , Reliability in Relation to Factor Pattern." Psychometrika VIII (I943), 247-264. I9. Woodrow, H. "Quotidian Variability." Psychological Review, XXXIX (I932), 245-256. 20. Woodrow, H. "The Relation Between Abilities and Improvement with Practice." Journal of Educational Psychology, XXIX (I938), 2I5-230. 2I. Woodrow, H. "Factors in Improvement with Practice." Journal of Psychology VII (I939), 55-70. , I3. Downloaded from epm.sagepub.com at PENNSYLVANIA STATE UNIV on September 13, 2016
© Copyright 2026 Paperzz