American Educational Research Association

American Educational Research Association
Can There Be Reliability without "Reliability?"
Author(s): Robert J. Mislevy
Source: Journal of Educational and Behavioral Statistics, Vol. 29, No. 2, Value-Added
Assessment Special Issue (Summer, 2004), pp. 241-244
Published by: American Educational Research Association and American Statistical Association
Stable URL: http://www.jstor.org/stable/3701267
Accessed: 23/09/2008 21:15
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=aera.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the
scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that
promotes the discovery and use of these resources. For more information about JSTOR, please contact [email protected].
American Educational Research Association and American Statistical Association are collaborating with
JSTOR to digitize, preserve and extend access to Journal of Educational and Behavioral Statistics.
http://www.jstor.org
Journal of Educationaland BehavioralStatistics
Summer2004, Vol. 29, No. 2, pp. 241-244
Can There Be Reliability without "Reliability?"
Robert J. Mislevy
EducationalTestingService
An EducationalResearcherarticleby PamelaMoss (1994) asks the title question,
"Canthere be validity withoutreliability?"Yes, she answers,if by reliabilityone
means "consistencyamongindependentobservationsintendedas interchangeable"
(Moss, 1994, p. 7), quantifiedby internalconsistencyindices such as KR-20 coefficients and inter-ratercorrelations.Identifyingthis definitionas the sine qua non of
of humanprod"apsychometricapproachto drawingandwarrantinginterpretations
ucts or performances," Professor Moss describes a contrasting hermeneutic
approach:"a holistic and integrativeapproachto interpretationof humanphenomena thatseeks to understandthe whole in light of its parts,repeatedlytestinginterpretationsagainstthe availableevidenceuntileach of the partscan be accountedfor
in a coherentinterpretation
of the whole"(Moss, 1994,p. 7). One can thusdrawvalid
inferences,the argumentcontinues,frombodiesof evidencethatarenotreliable.Are
we witnessingthe demiseof a fundamentalmeasurementprinciple,thatwithoutreliability,thereis no validity?
No, replies Heng Li (2003). Two paradoxes,beckoning the readertowardenlightenmentor confusion, emerge from Moss's article.The firstis but an apparent
paradox,resolved by clarifying ambiguousterminology.Guilford(1954) defines
reliabilityas the extent to which a measuringinstrumentis free from error.Conflatingthis broaderconceptwithindicesfor the specialcase of interchangeablemeasures, Moss overlooks measuresproperlydeemed reliable,but for which familiar
consistency-basedindiceseitheryield low valuesor fail to applyat all. Suchaninstrumentcouldindeedbe validwithoutbeingreliablein the sense of internalconsistency,
but a measurewholly unreliablein the morefundamentalsense would consist only
of errorand could not supportvalid inferences.
This realization sets the stage for a true paradoxgroundedsolidly within the
psychometricframework.Li constructsa model in which increasingnumbersof
observationsincreasein the validityfor inferenceaboutan unnon-interchangeable
observedtargetmeasure(measuredby the correlationof their sum and the target)
but decrease values of not only internalconsistencyreliabilitybut 'true'reliability
as well. Workingthroughthe details of such an example, renderedprecise by the
languageof mathematics,is edifying because it demandsattentionto exactly what
one intendsto makeinferencesabout,the natureof whatwill be observedon which
to base inferences,and the natureof the relationshipbetween them.
This articlewas acceptedunderthe editorshipof LarryHedges.
241
Mislevy
In general,the taskof drawingconclusionsfromassemblagesof evidence is very
difficult.As Kadaneand Schum(1996) pointout,one mustestablishthe credentials
of evidence-its relevance,its credibility,its inferentialvalue. "Thetask of establishing themrests, in part,"they continue,"onargumentsor chainsof reasoningwe
constructfrom the evidence to hypotheses or probableconclusions being considered. When there is a large mass of evidence to be evaluated,these argumentsor
chainsof reasoningcan become very complex"(Kadane& Schum 1996, p. xi). Few
workersin any appliedfield have the resources,the inclination,or the expertiseto
sort out the complexitiesof evidentiaryreasoningfrom firstprinciples.
Nor shouldthey have to. Proceduresandmethodsfor prototypicallyuseful situationsevolve in every field,tailoredto thekindsof evidenceandtargetinferencestypical to thatfield.Suchis thecase withassessmentsconstructedfrommultiple,similar,
observations.An idealizationof observationsthat are conditionallyindependent
given the targetof inferenceprovidesoff-the-shelf inferentialmachineryto guide
datacollection,identify problems,summarizedata,and quantifycertainaspects of
weight of evidence for typical inferences-reliability indices, for example. But in
assessment,as in otherfields, difficultiesarisewhen novel problemsappearandthe
usualheuristicsfail. We now envisage assessmentsthattargetinferencesmore subtle thanproficiencyin a specifieddomainof tasks,or whichgatherdatawith interrelationshipsthatcannotbe approximatedwith internalconsistencymodels.We must
returnto firstprinciplesto establishthe credentialsof this evidence.
This is Moss's intention. The hermeneutictraditiondoes offer insights into
drawinginferences from disparatemasses of evidence, and we can indeed learn
muchfromdialecticbetweenpsychometricsandhermeneutics.A firststep,though,
is acquiringa deeperunderstandingof psychometricmethods,an understandingof
principlesbehindmethodsthatwill not be found in commonwisdom, familiartesting practices,or standardtextbook presentations.In this spirit, Li's contribution
clarifiessome confusionthatemerge as unintendedconsequencesof Moss's work.
A false syllogism capturesa most resolutemisunderstandingthata reader(inclined
to do so) could erroneouslytakefromherdiscussion:(a) an assessmentcan be valid
without internal-consistencyreliability,(b) my assessmentdoes not have internal
consistency, and (c) therefore,my assessmentis valid.
A more common, and more invidious, errorwould be to accept the view that
probability-basedreasoningis useful for cases of exchangeablemeasuresonly, and
has nothingto offer beyond the realmof standardpractice.Making sense of more
ambitiousassessmentpractice,therefore,requires"somethingelse."
Contemporaryworkon the foundationsof evidentiaryreasoning(see especially
Schum, 1987, 1994) offers a perspective sufficiently nuanced to encompass the
seemingly antitheticalpsychometric and hermeneuticapproachesin assessment
(Mislevy, 1994). Kadane and Schum's (1996) analysis of argumentsand inferences from 395 items of evidence in the famous Sacco-Vanzettitrialof the 1920s
exemplifies a rapprochementof probability-basedreasoning and complex arguments from disparatekinds of evidence to unreplicableevents. Conflicting and
sometimes incompatible arguments from the original defense and prosecuting
242
Can ThereBe Reliabilitywithout "Reliability?"
attorneys,enrichedby decades of furtheranalysisby legal scholars,scientists, and
historians,are embracedin a 28 page Wigmore diagram:a graphicrepresentation
of a seventy year effortto understandthe whole of the Sacco-Vanzetticase in light
of its parts.Accompanyingprobability-basedinferencenetworksshow the sensitivity of intermediateandfinalconclusionsto variousarguments,assumptions,and
items of evidence, respectingthe subtletiesof each (includinga probability-based
frameworkfor expressing the differentialexpertise of historiansand statisticians
for variousstrandsof the argument).This approachaccommodatesboth evidence
thatcan be approximatedas exchangeableitems (of which psychometricmethods
for exchangeableobservationsare a special case), and argumentsinvolving more
complex interrelationships,multiple perspectives, and unique observations (the
mannerof reasoningthatcharacterizeshermeneuticanalysis).At the level of foundations,thereis no inherentconflict.
At the level of applications,though,thereare seriouspracticalissues concerning
the efficacy and the suitabilityof various forms of evidence and argumentsfrom
evidence. They involve, among other considerations,the intendedpurposeof the
assessment,the availabilityof resourcesfor collecting and interpretingevidence,
andthe needs of variousstakeholdersfor the credibilityandtrustworthinessof evidence uponwhichinferenceswill be based.Few educationalassessmentscan afford
the seventy-yeartime line of scholarlySacco-Vanzettianalyses!Collectingpredetermineditems of evidence, to be evaluatedalong predeterminedlines, is a strategy
for obtaining at relatively low cost informationthat previous work suggests are
likely to be useful in the job at hand. Incorporatingmultiple, similar, items is a
strategy for simultaneously obtaining some evidence about the quality of that
evidence-a sophisticatednotion, made possible by the use of a time-tested and
generallyapplicableevidentiarystructureandaccompanyingargumentframework.
Sometimes the evidence thus obtained, and the internal-consistencyreliability indices to aid in evaluating the quality of the evidence, will suit our needs in
an assessment situation. Sometimes they won't. For this very reason, extending
probability-basedreasoning to more complex assessment situations is an active
areaof researchin test theory.FredericLord,GeorgRasch,andotherscreateditem
response theory (IRT) half a centuryago expressly to deal with the fact that test
items are not interchangeable.More recentwork exploiting probability-basedreasoning includes, as examples, modeling non-interchangeableraters (e.g., Patz,
Junker,Johnson,& Mariano.,2002); studying the interrelationshipsof cognitive
modelsof understanding
forperformancein disparatetasks(e.g., Martin& VanLehn,
1995); and exploringthe interplaybetween psychometricmodels for ratingswith
the hermeneuticprocesses judges use to arrive at their evaluations (Myford &
Mislevy, 1995). The challengeto specialistsin test theoryis to continuallybroaden
theirmethodology,to extend the toolkit of data-gatheringand interpretationmethods availableto deal with increasinglyrichersources of evidence, and more complex argumentsfor making sense of thatevidence.
The challenge to assessmentspecialists more broadly,is to firstrecognize, then
harness,the power of probability-basedreasoningfor more complex assessments.
243
Mislevy
To sell it short based on surface familiarity with only the techniques of standard
practice is to miss the compiled wisdom underlying those techniques. It is to forgo
capitalizing on the generative principles to attack more difficult evidentiary problems to which they can naturally and gainfully be marshaled. Both the machinery
of probability-based inference and the structural analysis of evidence about learning are required in concert to ground inference in the next generation of educational
assessments.
References
Guilford,J. P. (1954). Psychometricmethods.New York:McGraw-Hill.
Kadane,J. B., & Schum, D. A. (1996). A probabilisticanalysis of the Sacco and Vanzetti
evidence. New York:Wiley.
Li, H. (2003). The resolutionof some paradoxesrelatedto reliabilityand validity. Journal
of Educationaland BehavioralStatistics,28(2), 89-95.
Mislevy, R. J. (1994). Evidence and inference in educationalassessment. Psychometrika,
59, 439-483.
Myford,C. M., & Mislevy, R. J. (1995). Monitoringand improvinga portfolio assessment
system.Centerfor PerformanceAssessment ResearchReport.Princeton,NJ: Centerfor
PerformanceAssessment, EducationalTesting Service.
Martin,J. D., & VanLehn, K. (1995). A Bayesian approachto cognitive assessment. In
P. Nichols, S. Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment
(pp. 141-165). Hillsdale, NJ: Erlbaum.
Moss, P. (1994). Can therebe validity withoutreliability?EducationalResearcher,23(2),
5-12.
Patz, R. J., Junker,B. W., Johnson,M. S. & Mariano,L. T. (2002). The hierarchicalrater
model for ratedtest items and its applicationto large-scaleeducationalassessmentdata.
Journal of Educationaland Behavioral Statistics,27(4), 341-384.
Schum, D. A. (1987). Evidence and inferencefor the intelligence analyst. Lanham,MD.:
UniversityPress of America.
Schum,D. A. (1994). The evidentialfoundationsof probabilisticreasoning.New York:Wiley.
Author
ROBERTJ. MISLEVY is Professor,Departmentof Measurement,Statistics,and Evaluation Universityof Maryland,College Park,MD 20742; [email protected] areasof
specializationare assessmentdesign, measurementmodels, and Bayesian inference.
244