A Comparison of - Digital Conservancy

A
Comparison of
Free-Response and Multiple-Choice Forms
of Verbal Aptitude Tests
William C. Ward
Educational Testing Service
Three verbal item types employed in standardized
tests were administered in four formats—a
conventional multiple-choice format and three formats requiring the examinee to produce rather than
simply to recognize correct answers. For two item
types—Sentence Completion and Antonyms—the
response format made no difference in the pattern
of correlations among the tests. Only for a multi-
aptitude
ple-answer open-ended Analogies
test
were
any sys-
tematic differences found; even the interpretation of
these is uncertain, since they may result from the
speededness of the test rather than from its response requirements. In contrast to several kinds of
problem-solving tasks that have been studied, discrete verbal item types appear to measure essentially the same abilities regardless of the format in
which the test is administered.
Carlson, 1980). Comparable differences were obtained between free-response and machine-scorable tests employing nontechnical problems,
which were designed to simulate tasks required
in making medical diagnoses (Frederiksen,
Ward, Case, Carls®n9 ~ Samph, 1981).
There is also suggestive evidence that the use
of free-response items could make a contribution in standardized admissions testing. The
open-ended
behavioral science
problems were
potential predictors of the
and accomplishments of
activities
professional
in psychology; the
students
first-year graduate
Graduate Record Examination Aptitude and
Advanced Psychology tests are not good predicfound to have
some
tors of such achievements
Tests in which an examinee must generate answers may require different abilities than do
tests in which it is necessary only to choose
among alternatives that are provided. A free-response test of behavioral science problem solving, for example, was found to have a very low
correlation with a test employing similar problems presented in a machine-scorable (modified multiple-choice) format; it differed from
the latter in its relations to a set of reference
tests for cognitive factors (Ward, Frederiksen, &
as
(Frederiksen & Ward,
1978).
Problem-solving tasks like these, however,
provide very inefficient measurement. They require a large investment of examinee time to
produce scores with acceptable reliability, and
they yield complex responses, the evaluation of
which is demanding and time consuming. It was
the purpose of the present investigation to explore the effects of an open-ended format with
item types like those used in conventional examinations. The content area chosen was verbal
knowledge and verbal reasoning, as represented
by item types-Antonyms, Sentence Completion, and Analogies.
The selection of these item es has several
bases. First, their relevance for aptitude assess1
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/
2
special justification, given that
one-half
of verbal ability
they up
tests such as the Graduate Record Examination
and the Scholastic Test (SAT).
Thus, if it can be shown that recasting these
item types into an open-eiided format makes a
substantial difference in the abilities they measure, a st~®n~ ~~s~ will be made for the irriportance of the response format in the
mix of items that enter into tests.
Second, such produce reliable with
relatively short tests. Finally, open-ended forms
of these item types require only single-word or,
in the case of two-word answers.
should thus be easy to score, in
comparison with free-response problems whose
responses may be several sentences in and
may embody two or three ideas. Although not solving the difficulties inherent in
the use of open-ended in large-scale testing,
therefore, they would to some to reduce their magnitude.
Surprisingly, no published comparisons of
and multiple-choice of these
item types are available. Several investigators
have, however, examined the effects of response
format on Synonyms items-items in which the
examinee must choose or ~~~e~~t~ a word with
essentially the same meaning as a word
(~e~~ ~ Watts 1967; Traub ~ Fisher, ~~~‘~9
Vernon, 1962). All found high correlations
across formats, but only Traub and Fisher atto answer the of whether the
abilities measured in the two formats were identical or only related. They concluded that the
test and does affect the attribute
the
attribute measured abyfactor
test and that there was ~~~~~ evidence of a factor
specific to open-ended verbal items. Unfortunately, they did not have scores on a sufficient
variety of to provide an unambiguous test
for the existence of a verbal factor.
The present study was to allow a factor-analytic of the influence of response format. Each of three stem was
in each of four formats, varied in the degree to which they require of anment needs
no
swers. It was thus possible to examine the fit of
the data to each of two &dquo;ideaf9 of factor
structure: one in which only item-type
would be found, t
at of a given
type essentially the same thing regard=
less of the format; and one involving
only format factors, indicating that the response
requirements
of the task
are
of
impor-
tance than are differences in the kind of k~®v~~~
tested.
Method
of the Tests
Three item were employed. Antonyms
~t~ ~ ~w~~~ given in the standard multiplechoice format) required the to select
the one of five words that was most nearly opposite in to a given word. Completions required the identification of the one
word ~rh~~~9 when into a blank space in
a sentence, best fit the of the sentence
as a whole. Analogies, f~~~~~1y9 ~~.~~~d for the
selection of the pair of words expressing a
relationship to that expressed in a given
pair.
o
Three formats in addition to the multiplechoice one were used. For Antonyms, for example, the &dquo;single-answer&dquo; format required the
examinee to think of an and to write
that word in an answer space, The &dquo;multiple-ar,swer&dquo; format was still more the examinee was to think of and write up to three dif ferent for each word given. Finally,
the ‘gk~y~~st~9 format the examinee to
think of an opposite, to locate this word in a 90item alphabetized and to record its number
on the answer sheet. This latter format was included as a machine-scorable for a
truly f~~~~~~~p~n~~ test.
With two all item
were ones single-word The
exceptions were the single-answer multiple-
Analogies
tests. Here the examinee
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/
was
3
to produce pairs of words having the
same relationship to one another as that shown
by the two words in the stem of the question.
Instructions for each test paraphrased closely
employed in the GRE Aptitude Test, except as dictated by the specific require-
The tests
required
were
order, subject
presented in a
randomized
to the restriction no two
suc-
ments of each format, With each set of instructions was given one question and a brief
rationale for the answer or answers suggested.
for the tests, two or three
fully acceptable answers were for each
cessive tests should either the same item
type or the response format. Four systematic variations of this order employed to
permit an of and adjustment for possible practice or effect. Each of the four
groups tested, including Sl to 60 subjects,
received tests in one of these sequences; the remainder of the sample, in groups of 30 to
40, all given in the first of the four
sample question.
orders.
The tests varied somewhat in number of items
and in time limits. Each test
consisted of 20 items to be in 12 minutes. Slightly times (15 were allowed for forms including 20 or 20
keylist The multiple-answer allowed still more time per item-15 minutes for
15 Antonyms or Analogies or for 18 SenCompletion items. On the basis of extensive it was that these time
limits would be to avoid problems oaf
test and that the number of items
would be sufficient to scores with reliabilities on the order of .7.
Test ~ ~~t~~~
Subjects 315 paid volunteers ~°®~ ~.
state university. more than te~~®
thirds were juniors and seniors.
The small number (13’7o) for whom GRE Aptitude Test scores were obtained were a somewhat
select group9 with means of 547, and 616 on
the Verbal, and Analytic
respectively ~t appears that the sample is a
somewhat more able one than college students
in general but probably less select the graduate school applicant pool.
Each student participated in one 4-hour testsession. Included in the session were 12 tests
all combinations of the three item
with four response and a brief
questionnaire to the student’s academic
background, accomplishments, and interests.
o
~~®~~~
For each of the open-ended tests, scoring keys
developed that distinguished two of
appropriateness of an answer. Answers in one
set were judged fully acceptable, while those in
the second were of marginal appropriateness.
An example of the latter would be an Antonyms
response that identified the evaluation
by a word but failed to an imnuance or the force of the evaluation, It
was through
a trial that partial credits were unnecessary for two of the keylist tests-Antonyms and Analogies. Responses
to the remaining tests were coded to
permit computer of several different
scores, on the credit to be given
marginally
Preliminary scoring were checked for
by an examination of about 20%
to
of the
answer
sheets, Most of the tests were then
~~~~~d by ~ highly clerk and
by her Two tests, however,
presented more complex problems. For
both single-answer multiple-answer Analogies, the scoring keys consisted of rationales
and rather than a list of possible answers. Many decisions therefore
involved a substantial exercise of ~~d~~~~~to ~
research assistant scored each of these tests, and
the author scored 25 answer sheets of each indeTotal scores derived from the two
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/
4
scorings correlated .95 for
the other.
one
test and .97 for
Resets
Results
~~c~la~ of data. No instances were found
subjects appeared not to take their task
seriously. Three answer sheets were missing or
spoiled; sample mean scores were for
these. On 32 occasions a subject failed to atin which
tempt at least half the items on a test; but no individual
subject was responsible for more than
appeared that data from all sub-
two of these. It
jects were of acceptable quality.
Score derivation, The three multiple-choice
were scored using a standard correction for
guessing: for a five-choice item, the score was
number correct minus one-fourth the number
incorrect. Two of the keylist tests were simply
scored for number correct. It would have been
possible to treat those tests as 90-alternative,
a~itnpi~~~h®ice tests and to apply the guessing
correction, but the effect on the scores would
have been of negligible magnitude.
For the remaining tests, scores were generated
in several ways. In one, scoring credit was given
only for answers deemed fully acceptable; in a
second, the same credit was given to both fully
and acceptable answers; and in a
third, marginal answers received half the credit
given to fully acceptable ones. This third approach was found to yield slightly more reliable
scores than either of the others and was therefore employed for all further analyses.
Test order. Possible differences among
groups receiving the tests ~ different orders
were examined in two ways. One analysis was
concerned with the level of performance; another considered the standard error of measurement, a statistic that information
about both the standard deviation and the reliability of a test score and that indicates the precision of measurement. In neither case were there
systematic differences associated with the order
in which the tests were administered,. Order was
therefore in all further analyses.
tests
Test
Test means and standard deshown in Table 1. Most of the tests
were of middle difficulty for this s~,~pl~9 two of
the keylist tests were easy, whereas multiplechoice Antonyms was very difficult. Means for
the multiple-answer tests were low in relation to
the maximum possible score but represent one
to one-and-a-half fully acceptable answers per
item.
Test speededness. Tests such as the GRE
Aptitude Test are considered unspeeded if at
least 75% of the examinees attempt all items and
if virtually everyone attempts at least threefourths of the items. By these criteria only one of
the tests, multiple-answer Analogies, had any
problems with speededness: About 75% of the
sample reached the last item, but I4Vo failed to
attempt the 12th item, which represents the
three-fourths point. For all the remaining tests,
95% or more of the subjects reached at least all
but the final two items. Table I shows the percent of the sample completing each test.
Test ~°~~a~~~l~~. Reliabilities (coefficient alpha) are also shown in Table 1. ey ranged
from .45 to .80, with a median of .69. There where
no differences in reliabilities associated with the
response format of the test-the medians ranged
from .68 for multiple-choice tests to .75 for multiple-answer forms. There were differences associated with item type; medians were .75 for
Antonyms, .71 for Sentence Completions, and
.58 for Analogies. The least reliable of all the
tests was the multiple-choice Analogies. The differences apparently represent somewhat less
success in creating good analogies items rather
than any differences inherent in the open-ended
difficulty.
viations
are
formats.
~~ ~®~~ the ~~~~
C®~~~~~~®~a~ ~ ®~~
tests.
Zero-order
cor-
relations the 12 are shown in the upper part of Table 2. The correlations from
.29 to .69, with a of .53. The seven lowest
coefficients in the table, the only ones below
.40, are correlations involving the multiple-answer test. Correlations for
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/
5
Descriptive
Zero-Order
Decimal
above the main
attenuation
Table 1
Statistics for Tests
Table 2
and Attenuated Correlations
1ests
Zero-order correlations are
while correlations corrected for
points omitted.
diagonal,
are presented
Among
presented
below.
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/
6
attenuation are shown in the lower part of the
table; the correction is based on coefficient
alpha reliabilities. The correlations from
.45 to .97 and have a median of .80.
These coefficients indicate that the various
tests share a substantial part of their true variance, but they do not permit a conclusion as to
whether there are systematic differences among
the Three analyses that ~.ddb~ss this questi~xl
are presented below.
analyses. A preliminary principal
components analysis produced the set of eigenFactor
values
displayed in Table 3. The first component
Table 3
Principal Components
of
the Correlations Matrix
57%® of the total
while
the
next
variance,
largest accounted for
7~®
of
the
variance.
only
one rule of thumb
for number of that of the
number of eigenvalues greater than there is
only a single factor represented in these results.
another, that of differences in of
successive eigenvalues, there is some evidence
for a second factor but none at all for more than
was
very
large, for
was
originally to
tvv®a
It
use a
confirma-
tory factor analytic approach to the analysis
in order to contrast two ~d~~.l®
ized models of test relations-one involving
three item-type factors and one four
(Jöreskog, 1970)
response-format factors. In view of the of
the principal components analysis, however,
either of these would clearly be a distortion of
the data. It was decided, therefore, to use an exploratory factor analysis, which could be followed by confirmatory analyses comparing
simpler models if such a comparison seemed
warranted from the results. The analysis was a
principal axes factor analysis with iterated communalities.
A varimax (orthogonal) rotation of the twofactor solution produced unsatisfactory results-10 of the 12 scores had appreciable loadings on both factors. The results of the oblimin
(oblique) rotation for two factors are presented
in Table 4. The two factors were correlated (r .67). Ten of the 12 scores had their
highest loading on Factor I, one (single-answer
Analogies) divided about equally the
two, and only (multiple-answer Analogies)
had its loading on the second factor.
~®r tw® it~~ typ~s9 ~~r~t~n~e Completion
Antonyms, these results leave no ambiguity as to
the effects of response format. The use of an
open-ended format no in the
attribute measures the test. The interpretation for the Analogies is less clear. The
second factor is (just under 5% of the common factor variance), it is poorly defined,
with only one test having its primary loading on
that factor. the one test that did load
heavily on Factor 11 also the only test in the
battery that was at all There is a reasonable of Factor II as a speed
factor (Donlon, 1980); the rank-order correlation between Factor III loadings the number
of subjects to attempt the last item of a
test was .80 (p < .01).
Factor analyses also performed taking
into account the academic level of the student.
The sample included two groups large enough to
be considered for separate analyses -seniors
(l~l m 75~ ~~d ~ juniors (N = 141). For each group
a one-factor solution was indicated. A combined
analysis was also carried out after for
mean and variance dl~~r~r~~es 1~ the data for
the two groups. The eigenvalues suggested either
=
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/
7
Table 4
Factor Pattern for Two-Factor
two-factor solution; the two-factor
solution, however, all tests having their
a one- or a
highest loading on the first factor only multiple-answer Analogies an division of its variance between the two factors.
Thus, there was no strong evidence for the existence of a factor ~ the data. There were
weak indications that the multiple-answer
Analogies and, to a much lesser extent, the
single-answer Analogies provided somewhat
distinct measurement from the remainder of the
tests in the evidence is clear that SenCompletion Antonyms item types
measure the same attribute of the format in which the item is administered
Multitrait-multimethod analysis. The data
may also be considered within the framework
provided by multitrait-multimethod analysis
1959). of the three
(Campbell &
item types a &dquo;trait,&dquo; while each of
the four response formats constitutes a 66 eth&reg;
old.&dquo; The data were following a scheme
suggested by and Werts (1966). All the
correlations relevant for each were
corrected for attenuation and then us-
Analysis
~°~t~&reg;~ transformation, Results are
summarized in Table 5.
Each row in the upper of the table provides the average of all those correlations that
relations for a item as measured in different formats and of all those
correlations that relations between
that item and other item when the two
tests different response formats. Thus,
for the Sentence item the entry
in the first column is an average of all six correlations among Sentence Completion scores
from the four formats. The in the second
column is an average of 24 correlations: for each
of four Sentence Completion scores, the six correlations representing relations to each item type
other than Sentence Completion in each of three
formats. The lower part of the table is organized
it for each response format a of average correlations within
format with those between formats for all test
Fishees
pairs different item types.
Results in the upper of the table
that there
was some
trait
both
for
show
variance associated with
Sentence
Completion
and
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/
8
Table 5
Multitrait-Multimethod Summary of
are
Average Correlations
*By Mann-Whitney U Test, the two entries in
significantly different at the 5% level of
a
row
confidence.
&reg;
Antonyms item types ~by ~~n&reg; btneyl test,
p < .05). Analogies tests did not, however, relate
to one another any more strongly they related to tests of other item types.
The lower part of the table shows differencess
attributable to response format. There is an apparent tendency toward a difference in favor of
stronger relations among multiple-choice tests
than those tests have with tests in other formats,
but this tendency did not approach significance
~ > For the truly open-ended response forthere were no differences whatsoever. Like
the factor analyses, this approach to correlational comparisons showed no tendency for openended tests to cluster according to the response
format; to the slight degree that any differences
were found, they represented clustering on the
basis of the item type rather than the response
format employed in a test.
Correlations corrected for &dquo;alternate forms
reliabilities. °f°he ultltr~it-multimeth&reg;d correlational comparison made use of internal consistency reliability coefficients to correct correlations for their unreliability. Several interesting
comparisons can also be made using a surrogate
for alternate forms reliability coefficients. The
battery, of course, contained only one instance
of each item-type by response-format combina&dquo;
so that no true alternate form examinations
could be made. It may be reasonable, however,
to consider the two truly open-ended forms of a
test-multiple-answer and sin~le~~r~sw~r&reg;~.s
two forms of the same test given under &dquo;open&dquo;
conditions, and the two remaining f&reg;r~s-~~1~
tiple-choice and keylist-as two forms of the
same test given under &dquo;closed&dquo; conditions. On
this assumption, relations across open and
closed formats for a given item type can be estimated by the average of the four relevant correlations and corrected for reliabilities represented by the correlations within open and within closed formats.
The corrected correlations were .97 for Sentence Completion, .88 for Analogies, and 1.05
for Antonyms. It appears that relations across
the two kinds of formats did not differ from 1.0,
except for error in the data, for two item types.
Analogies tests may fail to share some of their
reliable variance across open and closed formats
but still appear to share most of it.
tion,
with V mdables
Students completed a questionnaire dealing
with their academic background, accomplishments, and interests. Included were questions
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/
9
concerning (1) plans for graduate school attendance and advanced degrees, (2) undergraduate
grade-point average overall and in the major
field of study, (3) preferred career activities, (4)
self-assessed skills and competencies within the
major field, and (5) independent activities and
accomplishments within the current academic
year. Correlations were obtained between questionnaire variables and scores on the 12 verbal
tests.
Most of the correlations were very low. Only
four of the questions produced a correlation
with any test as high as .20; these were level of
planned, self-reported grade-point average (both overall and for the major field of
study), and the choice of writing as the individual’s single most preferred professional activity.
No systematic differences in correlations associated with item type or response format were evident.
Information was also available on the student’s and year in school. No significant
correlations with gender were obtained. Advanced students tended to obtain higher test
scores, with no evidence of differences
among the tests in the magnitude of the relations.
GRE Aptitude Test were available for a
small number of students $N &reg; 41). Correlations
with the GRE Verbal score were substantial in
magnitude, ranging from .50 to .74 with a median of .59. Correlations with the GRE Quantitative and Analytical scores were lower but still appreciable, having medians of .36 and A7, respecHere also there were no systematic differences associated with item types or test formats.
These results, like the analyses of correlations
among the experimental tests, suggest that response format has little effect on the nature of
the attributes measures the item types under
examination.
Discussion
This study has shown that it is possible to develop open-ended forms of several verbal aptitude item types that are approximately as good,
in terms of score reliability, as multiple-choice
items and that require only slightly greater time
limits than do the conventional items. These
open-ended items, however, provide little
new
information. There was no evidence whatsoever
for a general factor associated with the use of a
free-response format. There was strong evidence
against any difference in the abilities measured
by Antonyms or Sentence Completion items as a
function of the response format of the task. Only
Analogies presented some ambiguity in interpretation, and there is some reason to suspect that
that difference should be attributed to the slight
speededness of the multiple-answer Analogies
test employed.
It is clear that an open-ended response format
was not in itself sufficient to determine what
these tests measured. Neither the requirement to
generate a single response, nor the more difficult
task of
producing
answers
to
ities that
mance.
an
and
writing
several different
item, could alone change the abil-
were
important
for successful
perfor-
What, are the characteristics of an
item that will measure different attributes depending on the response format employed? A
comparison of the present tests with those empl&reg;yed ln the earlier problem-solving research of
Ward et al. (1980) and Frederiksen et al. (1981)
suggests a number of possibilities. In the problem-solving work, subjects had to read and to
comprehend passages containing a number of
items of information relevant to a problem. They
were required to determine the relevance of such
information for themselves and often to apply
reasoning and inference to draw conclusions
from several items of information. Moreover,
they needed to draw on information not presented -specialized knowledge concerning the
design and interpretation of research studies, for
the behavioral problems, and more general knowledge obtained from everyday life experiences, for the nontechnical problems. Finally, subjects composed responses that often entailed relating several complex ideas to one an-
other.
The verbal aptitude items, in contrast, are
much more self-contained. The examinee has
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/
10
deal with the meaning of one word, of a
of
pair words, or at most of the elements of a
short sentence. In a sense, the statement of the
problem includes a specification of what information is relevant for a solution and of what
kind of solution is appropriate. Thus, the verbal
tests might be described as &dquo;well-structured&dquo;
and the problem-solving tests as &dquo;ill-structured&dquo;
problems (Simon, 1973). The verbal tests also, of
course, require less complex responses-a single
word or, at most, a pair of words.
Determining which of these features are critical in distinguishing tests in which an openended format makes a difference will require
comparing a number of different item types in
multiple-choice and free-response formats. It
will be of particular interest to develop item
types that eliminate the confounding of complexity in the information search required by a
problem with complexity in the response that is
to be produced.
For those concerned with standardized aptitude testing, the present results indicate that one
important component of existing tests amounts
to sampling from a broader range of possible
test questions than had previously been demonstrated. The discrete verbal item types presently
employed by the GRE and other testing programs appear to suffer no lack of generality because of exclusive use of a multiple-choice format ; for these item types at least, use of openended questions would not lead to measurement
of a noticeably different ability cutting across
the three item types examined here. It remains
to be seen whether a similar statement can be
made about other kinds of questions employed
in the standardized tests and whether there are
ways in which items that will tap &dquo;creative&dquo; or
&dquo;divergent thinking&dquo; abilities can be presented
so as to be feasible for inclusion in large-scale
od matrix. Psychological Bulletin, 1959, 56,
81-105.
Donlon, T. F. An exploratory study of the implications of test speededness. (GRE Board Professional Report GREB No. 76-9P). Princeton NJ: Educational Testing Service, 1980.
Ward, W. C. Measures for the
Frederiksen, N., &
study of creativity in scientific problem-solving.
testing.
is due to Carol erg Fred Godshalk,
and Leslie Peirce for their assistance in developing
and reviewing items; to Sybil Carlson and David Dupree for arranging and conducting test administrations ; to Henrietta Gallagher and Hazel Klein for
carrying out most of the test scoring; and to Kirsten
only to
Applied Psychological Measurement, 1978, 20,
1-24.
Frederiksen, N., Ward, W. C., Case, S. M., Carlson,
S. B., & Samph, T. Development of methods for
selection and evaluation in undergraduate medical education (Final Report to the Robert Wood
Johnson Foundation). Princeton NJ: Educational
Testing Service, 1981.
Goldberg, L. P., & Werts, C. W. The reliability of
clinicians’ judgments: A multitrait-multimethod
approach. Journal of Counseling Psychology,
30, 199-206.
1966,
Heim, A. W., & Watts, K. P. An experiment on multiple-choice versus open-ended answering in a vocabulary test. British Journal of Educational Psy-
339-346.
37,
chology, 1967,
J&ouml;reskog, K. G. A general method
for analysis of
covariance structures. Biometrika, 1970, 57,
239-251.
Simon, H. A. The structure of ill-structured problems. Artificial Intelligence, 1973,
4, 181-201.
Steel, R. G. D., & Torrie, J. H. Principles and procedures of statistics. New York: McGraw-Hill, 1960.
Traub, R. E., & Fisher, C. W. On the equivalence of
constructed-response and multiple-choice tests.
Applied Psychological Measurement, 1977, 1,
355-369.
Vernon, P. E. The determinants of reading comprehension. Educational and Psychological Measurement, 1962,
, 269-286.
22
Ward, W. C., Frederiksen, N., & Carlson, S. B. Construct validity of free-response and machine-scorJournal of Educational Meaable forms of a test.
surement, 1980,
, 11-29.
17
c &reg;w! e
Appreciation
Reference
Campbell, D. T., & Fiske, D. W. Convergent and discriminant validation by the multitrait-multimeth-
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/
11
Yocum for assistance in data analysis. Ledyard
Tucker provided extensive advice on the analysis and
interpretation of results. This research was supported
by a grant f-rom the Graduate Record Examination
Board.
Author’§ Address
Send requests for reprints or further information to
William C. Ward, Senior Research Psychologist, Educational Testing Service, Princeton NJ 08541,
U.S.A.
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/