American Educational Research Association Can There Be Reliability without "Reliability?" Author(s): Robert J. Mislevy Source: Journal of Educational and Behavioral Statistics, Vol. 29, No. 2, Value-Added Assessment Special Issue (Summer, 2004), pp. 241-244 Published by: American Educational Research Association and American Statistical Association Stable URL: http://www.jstor.org/stable/3701267 Accessed: 23/09/2008 21:15 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=aera. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that promotes the discovery and use of these resources. For more information about JSTOR, please contact [email protected]. American Educational Research Association and American Statistical Association are collaborating with JSTOR to digitize, preserve and extend access to Journal of Educational and Behavioral Statistics. http://www.jstor.org Journal of Educationaland BehavioralStatistics Summer2004, Vol. 29, No. 2, pp. 241-244 Can There Be Reliability without "Reliability?" Robert J. Mislevy EducationalTestingService An EducationalResearcherarticleby PamelaMoss (1994) asks the title question, "Canthere be validity withoutreliability?"Yes, she answers,if by reliabilityone means "consistencyamongindependentobservationsintendedas interchangeable" (Moss, 1994, p. 7), quantifiedby internalconsistencyindices such as KR-20 coefficients and inter-ratercorrelations.Identifyingthis definitionas the sine qua non of of humanprod"apsychometricapproachto drawingandwarrantinginterpretations ucts or performances," Professor Moss describes a contrasting hermeneutic approach:"a holistic and integrativeapproachto interpretationof humanphenomena thatseeks to understandthe whole in light of its parts,repeatedlytestinginterpretationsagainstthe availableevidenceuntileach of the partscan be accountedfor in a coherentinterpretation of the whole"(Moss, 1994,p. 7). One can thusdrawvalid inferences,the argumentcontinues,frombodiesof evidencethatarenotreliable.Are we witnessingthe demiseof a fundamentalmeasurementprinciple,thatwithoutreliability,thereis no validity? No, replies Heng Li (2003). Two paradoxes,beckoning the readertowardenlightenmentor confusion, emerge from Moss's article.The firstis but an apparent paradox,resolved by clarifying ambiguousterminology.Guilford(1954) defines reliabilityas the extent to which a measuringinstrumentis free from error.Conflatingthis broaderconceptwithindicesfor the specialcase of interchangeablemeasures, Moss overlooks measuresproperlydeemed reliable,but for which familiar consistency-basedindiceseitheryield low valuesor fail to applyat all. Suchaninstrumentcouldindeedbe validwithoutbeingreliablein the sense of internalconsistency, but a measurewholly unreliablein the morefundamentalsense would consist only of errorand could not supportvalid inferences. This realization sets the stage for a true paradoxgroundedsolidly within the psychometricframework.Li constructsa model in which increasingnumbersof observationsincreasein the validityfor inferenceaboutan unnon-interchangeable observedtargetmeasure(measuredby the correlationof their sum and the target) but decrease values of not only internalconsistencyreliabilitybut 'true'reliability as well. Workingthroughthe details of such an example, renderedprecise by the languageof mathematics,is edifying because it demandsattentionto exactly what one intendsto makeinferencesabout,the natureof whatwill be observedon which to base inferences,and the natureof the relationshipbetween them. This articlewas acceptedunderthe editorshipof LarryHedges. 241 Mislevy In general,the taskof drawingconclusionsfromassemblagesof evidence is very difficult.As Kadaneand Schum(1996) pointout,one mustestablishthe credentials of evidence-its relevance,its credibility,its inferentialvalue. "Thetask of establishing themrests, in part,"they continue,"onargumentsor chainsof reasoningwe constructfrom the evidence to hypotheses or probableconclusions being considered. When there is a large mass of evidence to be evaluated,these argumentsor chainsof reasoningcan become very complex"(Kadane& Schum 1996, p. xi). Few workersin any appliedfield have the resources,the inclination,or the expertiseto sort out the complexitiesof evidentiaryreasoningfrom firstprinciples. Nor shouldthey have to. Proceduresandmethodsfor prototypicallyuseful situationsevolve in every field,tailoredto thekindsof evidenceandtargetinferencestypical to thatfield.Suchis thecase withassessmentsconstructedfrommultiple,similar, observations.An idealizationof observationsthat are conditionallyindependent given the targetof inferenceprovidesoff-the-shelf inferentialmachineryto guide datacollection,identify problems,summarizedata,and quantifycertainaspects of weight of evidence for typical inferences-reliability indices, for example. But in assessment,as in otherfields, difficultiesarisewhen novel problemsappearandthe usualheuristicsfail. We now envisage assessmentsthattargetinferencesmore subtle thanproficiencyin a specifieddomainof tasks,or whichgatherdatawith interrelationshipsthatcannotbe approximatedwith internalconsistencymodels.We must returnto firstprinciplesto establishthe credentialsof this evidence. This is Moss's intention. The hermeneutictraditiondoes offer insights into drawinginferences from disparatemasses of evidence, and we can indeed learn muchfromdialecticbetweenpsychometricsandhermeneutics.A firststep,though, is acquiringa deeperunderstandingof psychometricmethods,an understandingof principlesbehindmethodsthatwill not be found in commonwisdom, familiartesting practices,or standardtextbook presentations.In this spirit, Li's contribution clarifiessome confusionthatemerge as unintendedconsequencesof Moss's work. A false syllogism capturesa most resolutemisunderstandingthata reader(inclined to do so) could erroneouslytakefromherdiscussion:(a) an assessmentcan be valid without internal-consistencyreliability,(b) my assessmentdoes not have internal consistency, and (c) therefore,my assessmentis valid. A more common, and more invidious, errorwould be to accept the view that probability-basedreasoningis useful for cases of exchangeablemeasuresonly, and has nothingto offer beyond the realmof standardpractice.Making sense of more ambitiousassessmentpractice,therefore,requires"somethingelse." Contemporaryworkon the foundationsof evidentiaryreasoning(see especially Schum, 1987, 1994) offers a perspective sufficiently nuanced to encompass the seemingly antitheticalpsychometric and hermeneuticapproachesin assessment (Mislevy, 1994). Kadane and Schum's (1996) analysis of argumentsand inferences from 395 items of evidence in the famous Sacco-Vanzettitrialof the 1920s exemplifies a rapprochementof probability-basedreasoning and complex arguments from disparatekinds of evidence to unreplicableevents. Conflicting and sometimes incompatible arguments from the original defense and prosecuting 242 Can ThereBe Reliabilitywithout "Reliability?" attorneys,enrichedby decades of furtheranalysisby legal scholars,scientists, and historians,are embracedin a 28 page Wigmore diagram:a graphicrepresentation of a seventy year effortto understandthe whole of the Sacco-Vanzetticase in light of its parts.Accompanyingprobability-basedinferencenetworksshow the sensitivity of intermediateandfinalconclusionsto variousarguments,assumptions,and items of evidence, respectingthe subtletiesof each (includinga probability-based frameworkfor expressing the differentialexpertise of historiansand statisticians for variousstrandsof the argument).This approachaccommodatesboth evidence thatcan be approximatedas exchangeableitems (of which psychometricmethods for exchangeableobservationsare a special case), and argumentsinvolving more complex interrelationships,multiple perspectives, and unique observations (the mannerof reasoningthatcharacterizeshermeneuticanalysis).At the level of foundations,thereis no inherentconflict. At the level of applications,though,thereare seriouspracticalissues concerning the efficacy and the suitabilityof various forms of evidence and argumentsfrom evidence. They involve, among other considerations,the intendedpurposeof the assessment,the availabilityof resourcesfor collecting and interpretingevidence, andthe needs of variousstakeholdersfor the credibilityandtrustworthinessof evidence uponwhichinferenceswill be based.Few educationalassessmentscan afford the seventy-yeartime line of scholarlySacco-Vanzettianalyses!Collectingpredetermineditems of evidence, to be evaluatedalong predeterminedlines, is a strategy for obtaining at relatively low cost informationthat previous work suggests are likely to be useful in the job at hand. Incorporatingmultiple, similar, items is a strategy for simultaneously obtaining some evidence about the quality of that evidence-a sophisticatednotion, made possible by the use of a time-tested and generallyapplicableevidentiarystructureandaccompanyingargumentframework. Sometimes the evidence thus obtained, and the internal-consistencyreliability indices to aid in evaluating the quality of the evidence, will suit our needs in an assessment situation. Sometimes they won't. For this very reason, extending probability-basedreasoning to more complex assessment situations is an active areaof researchin test theory.FredericLord,GeorgRasch,andotherscreateditem response theory (IRT) half a centuryago expressly to deal with the fact that test items are not interchangeable.More recentwork exploiting probability-basedreasoning includes, as examples, modeling non-interchangeableraters (e.g., Patz, Junker,Johnson,& Mariano.,2002); studying the interrelationshipsof cognitive modelsof understanding forperformancein disparatetasks(e.g., Martin& VanLehn, 1995); and exploringthe interplaybetween psychometricmodels for ratingswith the hermeneuticprocesses judges use to arrive at their evaluations (Myford & Mislevy, 1995). The challengeto specialistsin test theoryis to continuallybroaden theirmethodology,to extend the toolkit of data-gatheringand interpretationmethods availableto deal with increasinglyrichersources of evidence, and more complex argumentsfor making sense of thatevidence. The challenge to assessmentspecialists more broadly,is to firstrecognize, then harness,the power of probability-basedreasoningfor more complex assessments. 243 Mislevy To sell it short based on surface familiarity with only the techniques of standard practice is to miss the compiled wisdom underlying those techniques. It is to forgo capitalizing on the generative principles to attack more difficult evidentiary problems to which they can naturally and gainfully be marshaled. Both the machinery of probability-based inference and the structural analysis of evidence about learning are required in concert to ground inference in the next generation of educational assessments. References Guilford,J. P. (1954). Psychometricmethods.New York:McGraw-Hill. Kadane,J. B., & Schum, D. A. (1996). A probabilisticanalysis of the Sacco and Vanzetti evidence. New York:Wiley. Li, H. (2003). The resolutionof some paradoxesrelatedto reliabilityand validity. Journal of Educationaland BehavioralStatistics,28(2), 89-95. Mislevy, R. J. (1994). Evidence and inference in educationalassessment. Psychometrika, 59, 439-483. Myford,C. M., & Mislevy, R. J. (1995). Monitoringand improvinga portfolio assessment system.Centerfor PerformanceAssessment ResearchReport.Princeton,NJ: Centerfor PerformanceAssessment, EducationalTesting Service. Martin,J. D., & VanLehn, K. (1995). A Bayesian approachto cognitive assessment. In P. Nichols, S. Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 141-165). Hillsdale, NJ: Erlbaum. Moss, P. (1994). Can therebe validity withoutreliability?EducationalResearcher,23(2), 5-12. Patz, R. J., Junker,B. W., Johnson,M. S. & Mariano,L. T. (2002). The hierarchicalrater model for ratedtest items and its applicationto large-scaleeducationalassessmentdata. Journal of Educationaland Behavioral Statistics,27(4), 341-384. Schum, D. A. (1987). Evidence and inferencefor the intelligence analyst. Lanham,MD.: UniversityPress of America. Schum,D. A. (1994). The evidentialfoundationsof probabilisticreasoning.New York:Wiley. Author ROBERTJ. MISLEVY is Professor,Departmentof Measurement,Statistics,and Evaluation Universityof Maryland,College Park,MD 20742; [email protected] areasof specializationare assessmentdesign, measurementmodels, and Bayesian inference. 244
© Copyright 2026 Paperzz