Items by Design:The Impactof SystematicFeatureVariation on Item StatisticalCharacteristics Mary K. Enright Mary Morley KathleenM. Sheehan GRE BoardReportNo. 95-15R September1999 This reportpresents the findingsof a researchprojectfundedby and carriedout underthe auspicesof the GraduateRecordExaminationsBoard Educational Testing Service, Princeton, NJ 0854 1 ******************** Researchers are encouraged to expressfreely their professional judgment. Therefore,pointsof view or opinionsstatedin Graduate RecordExaminationsBoardReportsdo not necessarily representofficial GraduateRecordExaminations Boardpositionor policy. ******************** The GraduateRecordExaminations BoardandEducationalTestingServiceare dedicatedto the principleof equalopportunity,andtheirprograms, services,andemploymentpoliciesareguidedby thatprinciple. EDUCATIONAL TESTING SERVICE, ETS, the ETS logo,GRADUATE RECORD EXAMINATIONS, and GRE are registeredtrademarks of EducationalTestingService. The modernizedETS logois a trademarkof EducationalTestingService. EducationalTestingService Princeton,New Jersey08541 Copyright0 1999by EducationalTestingService. All rightsreserved. Acknowledgments We wish to recognizethe contributionof the many test developmentstaff memberswhose adviceand cooperationwas essentialto this project. Specialthanksto JackieTchomi, Judy Srnith, and Jutta Levin. We also appreciateBob Mislevy’s advice abouthow to estimatethe usefulnessof collateral information. Finally, we are grateful to the GraduateRecord ExaminationsBoard for supportingthis research. Abstract This study investigatedthe impact of systematicitem feature variation on item statistical characteristicsand the degreeto which suchinformation couldbe usedas collateral information to supplementexamineeperformancedata and reducepretestsamplesize. Two families of word problem variantsfor the quantitativesectionof the GraduateRecord Examination (GRE@) General Test were generatedby systematicallymanipulatingitem features.For rate problems,the item designfeaturesaffected item difficulty (Adj. R2 = .90), item discrimination(Adj. R2 = SO), and guessing(Adj. R2 = .41). For probability problemsthe item designfeaturesaffecteddifficulty (Adj. R2 = .6 l), but not discriminationor guessing.The resultsdemonstratethe enormouspotential of systematicallycreatingitem variants. However, questionsof how best to managevariantsin item pools and to implement statisticalproceduresthat use collateralinformation must still be resolved. KEY WORDS Quantitative Reasoning Graduate Record Examinations Faceted Item Development Algebra Word Problems Item StatisticalCharacteristics Assessmentof Quantitative Skills Table of Contents Introduction ................................................................................................................................ Researchon Word Problems.................................................................................................... Method ....................................................................................................................................... Design of Word Problems....................................................................................................... Item Pretesting....................................................................................................................... Item Analysis........................................................................................................................... Data Analysis.......................................................................................................................... Results........................................................................................................................................ Summary of Item Statistics..................................................................................................... Impact of Item Design Features on Item Operating Characteristics........................................ Implications for Reductions in Pretest Sample Sizes.............................................................. Discussion................................................................................................................................ Summary.............................................................................................................................. UnderstandingItem Difficulty and ConstructRepresentation................................................ Implications for Creating Item Variants ................................................................................ Implications for Reducing Pretest Sample Size ..................................................................... Concluding Comments ......................................................................................................... References................................................................................................................................ 1 1 3 .3 .7 7 7 9 .9 12 .23 .27 .27 .27 .28 .29 -29 31 List of Tables TABLE TABLE TABLE TABLE TABLE TABLE TABLE ..4 1. Examples of Rate Items .................................................................................................................. 2. Examples of Probability Items..........................................................................................................6 3. Mean Item Statisticsfor Experimental andNonexperimentalProblem Solving Items ................. 11 4. IRT Item Parametersfor Rate Problemswith Differing Item Design Features............................. 13 5. IRT Item Parametersfor Probability Problemswith Differing Item Design Features................... 14 6. Regressionof Item Featureson IRT Item Parametersfor Two Families of Item Variants ........... 15 7. The Precisionof Difficulty EstimatesGeneratedWith and Without Collateral Information.. ...... 25 List of Figures FIGURE FIGURE FIGURE FIGURE FIGURE 1. Estimatedregressiontree for the difficulty parameterfor rate problems...................................... 18 2. Estimatedregressiontree for the discriminationparameterfor rate problems.............................. 19 3. Estimatedregressiontree for the guessingparameterfor rate problems...................................... .20 4. Estimatedregressiontree for the difficulty parameterfor probability problems......................... .22 5. The effect of increasingsamplesizeswith andwithout collateral information.............................26 Introduction Becauseof the continuousnatureof computeradaptivetesting,the dangerof item exposurewill increaseunlessitem pools are large enoughor can be changedfrequently enoughto reducethe probability of examineesrevealing itemsthat a large numberof subsequentexamineesmay receive. Thus continuous computeradaptivetestinghas createda demandfor more items and greaterefficienciesin item development. Improvement in the efficiency of item developmentwill result if methodsfor generatingitems systematically are developedor if pretestsamplesize requirementscan be reduced. A particularly critical bottleneckin the item developmentprocessat presentis the needfor item pretesting.The number of items that can be pretestedis constrainedby the numberof examineeson whom the item must be testedin order to obtain reliable estimatesof item operatingcharacteristics.Recently,however, methodshave been developedthat permit the use of collateral information aboutitem featuresto supplementexamineeperformancedata, so that smallerpretestsamplescan be usedto obtain reliable estimatesof item operatingcharacteristics(Mislevy, Sheehan,& Wingersky, 1993). The purposeof this studywas to determineif designingsystematicvariantsof quantitativeword problemswould result in more efficient item development,thus permitting item operating characteristicsto be reliably estimatedusingsmallerpretestsamples. The idea of creatingvariants of existing items as a way of developingmore items is not a novel idea and probably has been done informally by item writers as long as standardizedtestshave been in existence. While the unsystematiccreationof variants contributesto the efficiency of the item developmentprocess, there are somedangersassociatedwith this practice,suchas overlap amongitems or inadvertentlynarrowing the constructbeing measured. The ideal alternativewould be to createitem variants systematicallyby using a framework that distinguishesconstructrelevant and irrelevant sourcesof item statisticalcharacteristics,as well as incidental item featuresthat are neutral with respectto item statisticalcharacteristicsand the underlyingconstruct.Thus item variantswith different statisticalparameterscouldbe createdby manipulatingconstructrelevant features,and item variantswith similar statisticalparameterscould be createdby manipulatingincidental features.With this method, overlap amongitems could be better controlled.Unfortunately, the constructs tapped by most existing testsare not articulatedin enoughdetail to allow the developmentof construct-driven item designfi-ameworks. A third approachto generatingitem variantsis to use item designframeworks as a hypothesis-testing tool to assessthe impact of different item featureson item statisticalcharacteristics.This is the systematic approachthat was taken in the presentstudy.Frameworksfor creatingitem variants were developedbasedon prior correlationalanalysesof item featuresthat affect problem difficulty and on the hypothesesof experienceditem writers. Thus the item developmentand researchprocesseswere integratedso that the degreeto which different item featuresimpact item statisticalcharacteristicscould be determinedand the constructsunderlyingthe creationof item variants couldbe more clearly articulated. Researchon Word Problems A body of researchaboutproblem featuresthat affect problem difficulty already exists for arithmetic and algebraword problems.This researchcan serve as a basisfor creatingsystematicitem variants and estimatingproblem difficulty. The relevant researchwas stimulatedby Mayer (198 l), who analyzedalgebra word problemsfrom secondaryschoolalgebratexts. Mayer found that theseproblemscould be classifiedinto eight families basedon the problems’ “story line” and sourceformulas (suchas “distance= rate x time” or “dividend = interestrate x principle”). However, similar story lines may reflect very different quantitative structures(Mayer, 1982). In order to capturethis relevant quantitativestructureseparatelyfrom the specific problem content,a number of network notationshave been developed(Hall, Kibler, Wenger, & Truxaw, 1989; Reed, 1987; Reed, Dempster,& Ettinger, 1985; Shalin & Bee, 1985). For example, Shalin and Bee (1985) analyzedthe quantitativestructureof word problemsin terms of elements,relations,and structures.Many word problemsconsistof one or more triads of elementscombined in additive or multiplicative relationships.One of the relationshipsShalin and Bee described--amultiplicative relationshipamonga rate and two quantities--istypical of many arithmeticand algebraword problems,such as thoseinvolving travel, interest,cost, and work. For complex problemsthat involve more that one triad, problem structuredescribesthe way that thesetriads are linked. Shalin and Bee found that many two-step arithmeticword problemscouldbe classifiedas exemplarsof one of a number of structures(suchas hierarchy, shared-whole,and shared-part),andthat theseproblem structureshad an effect on problem difficulty. This idea can be extendedto other word problems,and the kind of superordinateconstraintthat allows the subpartsof a problem to be composedcan be usedas one feature in classifyingproblems(Hall et al., 1989; Sebrechts,Enright, Bennett,& Martin, 1996). For example, round trip problems(the distanceon one part of the trip equalsthe distanceon the secondpart of the trip) exemplify a classof problemsin which the superordinateconstraintcan be describedas Distance 1 = Distance2. Another type of problem involving parts of a trip in the samedirection but at different ratesmight have a superordinateconstraintsuchthat Distance 1 + Distance2 = Total Distance. Problem featuressuchas thosedescribedabove can be related theoreticallyto individual differences in cognition.For example, becauseof limitationson working memory capacity,the more elementsand relationshipsthere are, the more difficult a problem is likely to be. However, knowledgeabout basic, complementarymathematicalrelationshipsamongelements(suchas “distance,rate, and time” or “dividends, interest,and principal”) shouldhelp individualsto group or chunk subpartsof a problem. Integratingthese chunksinto a larger structurerequiresrecognitionof the superordinateconstraintsthat are operatingin the problem situation.Thus we assume,as pointed out by Embretson(1983), that the “stimuluscharacteristicsof the test items determinethe componentsthat are involved in its solution”(p. 18 1). In a study of 20 word problemsthat had appearedon the quantitativesectionof the GraduateRecord Examination (GRE@) General Test, Sebrechtset al. (1996) found that three problem features--theneedto apply algebraicconcepts(manipulatevariables),problem complexity, and content--accountedfor 37% to 62% of the variancein two independentestimatesof problem difficulty. In additionto this correlationalstudy of a small set of problems,other studiesalso demonstratethat similar item featuresare useful in designing word problems(Lane, 199 l), in providing substantiveunderstandingof changesin studentperformancewith training (Embretson, 1995), and in accountingfor problem difficulty. To date, researchershave focusedon identifying sourcesof item difficulty becausethis information is useful for explicating the constructsrepresentedon a test and for developingproficiency descriptors (Embretson, 1983, 1995; Sheehan,1997). However, information aboutthe problem featuresthat affect item discriminationand guessingparametersas well as item difficulty parametersis also valuable at present becauserecent advancesin measurementtheory supportthe useof collateral information about item features to estimateitem operatingcharacteristicsusingsmallerexamineesamples(Mislevy et al., 1993; Mislevy, Wingersky, & Sheehan,1994). Suchestimationprocedurescan reducethe cost of item development. For word problemson many standardizedtests,the kinds of item featuresdescribedabove are varied unsystematicallyand on an ad hoc basis,and so it is difficult to estimatepreciselyhow much any particular feature contributesto item statisticalcharacteristics.In this study,we developedand pretesteditemsthat 2 variedsystematically on someof thesefeaturessothatwe couldbetterestimatethedegreeto whichdifferent manipulations affecteditemstatistical characteristics. The questions we wishedto answerwereasfollows: 1. Werethe systematically designed itemsof anacceptable quality? 2. Whatimpactdidtheitemdesignfeatureshaveon itemstatistical characteristics? 3. How usefulwouldtheitemdesigninformationbefor reducingpretestsamplesizes? Method Designof WordProblems Forthepurposes of thisstudy,two familiesof 48 relatedwordproblemswerecreated.Foreach family,a designmatrixspecifiedthreeitemfeaturesthatwerecrossed with eachotherto createeightclasses of variants.Six problemvariantswerewrittenfor eachclass.All itemswerepresented in a five-option multiple-choice format. Family I: RateProblems,Equal Outputs.Forthefirstfamilyof problems,threeitemfeatures-complexity,context,andusinga variable--were selected for manipulation basedonthe findingsof Sebrechts et al. (1996). Someexamplesof problemstypicalof thisfamilyareprovidedin Table 1. Thebasicstructure of theseproblemscanbe described in termsof threeconstraints, whichcanbecombinedinto a simplelinear system,asfollows: Rate1x Unit Al = Unit B1 Rate2x Unit A2 = Unit Bz Unit B1= Unit B2 To increase problemcomplexity,anadditionalconstraint, or step,wasaddedto half of theproblems: Unit Al + Unit A2 = TotalUnit A. Thusthelesscomplexproblemswerecomposed of threeconstraints, andthemorecomplexconsisted of four constraints. The goalof thelesscomplexproblemswasto findUnit AZgivenUnit Al, Ratel,andRate,;,the goalof themorecomplexproblemswasto find Unit A2 andRate2givenUnit Al, Total Unit A, andRatel. Thenarrativecontextof theseproblemsinvolvedeithercostor distance.Finally,to manipulatethe algebraic content,oneof theelementsof theproblemwaschangedfroma quantityto a variable:“Johnbought6 cans of soda”became“Johnboughtx cansof soda.”Thislatermanipulation ledto a solutionthatwasanalgebraic expression ratherthananderivedquantity. 3 TABLE 1. Examplesof Rate Items Item Design Features Use Variable Context Complexity No Yes cost Level 1 Sodathat usually costs$6.00 per case is on sale for $4.00 per case.How many casescan Jackbuy on sale for the price he usuallypays for 6 cases? Sodathat usually costs$6.00 per caseis on sale for $4.00 per case. How many casescan Jackbuy on sale for the price he usually pays for x cases? DRT Level 1 Under normal circumstances,a train travels from City X to City Y in 6 hoursat an averagespeedof 60 miles per hour. When the trackswere being repaired,this train traveled on the sametracks at an averagespeedof 40 miles per hour. How long did the trip take when the trackswere being repaired? Under normal circumstances,a train travels from City X to City Y in t hours at an averagespeedof 60 miles per hour. When the tracks were being repaired,this train traveled on the sametracks at an averagespeedof 40 miles per hour. How long did the trip take when the trackswere being repaired? Level 2 As a promotion, a storesold 90 cases of sodaof the 150 casesthey had in stock at $4.00 per case.To make a profit, the storeneedsto bring in the sametotal amountof money when they sell the remainingcasesof soda.At what price mustthe store sell the remainingcases? DRT Level 2 A roundtrip by train from City X to City Y took 15 hours.The first half of the trip took 9 hours andthe train traveled at an averagespeedof 40 miles per hour. What was the train’s averagespeedon the return trin? Note. These example items were not usedin this study. cost As a promotion, a store sold 90 cases of sodaof the x casesthey had in stock at $4.00 per case.To make a profit, the store needsto bring in the sametotal amountof money when they sell the remainingcasesof soda. At what price must the store sell the remainingcases? A round trip by train from City X to City Y took 15 hours. The first half of the trip took t hours andthe train traveled at an averagespeedof 40 miles per hour. What was the train’s averagespeedon the return trip? FamiZy2: ProbabilityProblems.The secondfamily of items was made up of variantsof probability problems.Examplesof problemstypical of this family are provided in Table 2. These problemshad three components--determinin g the number of elementsin a set, determining the number of elementsin a subset, and calculatingthe proportion of the whole set that was includedin the subset.Given a lack of prior research on thesetypes of problems,hypothesesaboutitem featuresthat might affect item difficulty were more speculativeand were basedon the expert knowledgeof item writers. First, we varied the complexity of countingthe elementsin the subset.The set always consistedof the number of integerswithin a given range. The difficulty of the subsetcountingtaskswas varied as follows: Complexity Level 1 Complexity Level 2 Numbers in a smallerrange Numbers beginningwith certain digits and endingwith certain digits Numbers endingwith a certain digit Numbers beginningwith certain digits and endingwith odd digits Numbers with 3 digits the same Numbers with 2 or 3 digits equal to 1 Second,we speculatedthat items cast as probability problemswould be more difficult than thosecast as percentproblems.And third, we varied the cover story so that someproblemsinvolved a real-life context (phone extensions,room numbers)and otherssimply referred to setsof integers.Although this latter feature (real versuspure) is a specificationthat is usedto assembletest forms, we did not have a clear senseof how it might affect difficulty for thesekinds of problems. TABLE 2. Examplesof Probability Items Item Design Features Complexity Context 1 I Context 2 Level 1 Level 2 Parking stickersfor employees’ cars at a certain companyare numberedconsecutivelyfrom 100 to 999. Stickersfrom 200 to 399 are assignedto the sales department.What percentof the parking stickersare assignedto the salesdepartment? Parking stickersfor employees’cars at a certain company are numbered consecutivelyfrom 100 to 999. Stickersthat begin with the digits 2 or 3 are assignedto the salesdepartment. Stickersthat end with the digits 8 or 9 belongto managers.What percentof the parking stickersare assignedto managersin the salesdepartment? What percentof the integers between 100 and 999, inclusive, are between200 and 399, inclusive? What percentof the integersbetween 100 and 999, inclusive,begin with the digits 2 or 3 and end with the digits 8 or 9? Parking stickersfor employees’cars at a certaincompany are numbered consecutivelyfrom 100 to 999. Stickersthat begin with the digits 2 or 3 are assignedto the salesdepartment. Stickersthat end with the digits 8 or 9 belong to managers.If a parking stickeris chosenat random, what is the probability that it will belongto a managerin the salesdepartment? If an integer is chosenat randomfrom between 100 and 999, inclusive,what is the probability that the chosen integerwill begin with the digits 2 or 3 and end with the digits 8 or 9? Probability Real Parking stickersfor employees’ cars at a certaincompanyare numberedconsecutivelyfrom 100 to 999. Stickersfrom 200 to 399 are assignedto the sales department.If a parking stickeris chosenat random,what is the probability that it will belongto the salesdepartment? Probability Pure If a integer is chosenat random Ii-omthe integersbetween 100 and 999, inclusive,what is the probability that the choseninteger will be between200 and 399, inclusive? Note. These example items were not usedin this study. 6 Item Pretesting Itemsfromthevariantfamilieswereincludedin 24 quantitative pretestsections of theGRE General Test.Paper-and-pencil testformswereadministered to randomsamplesof 1,000or moreexaminees in OctoberandDecember 1996.Fourexperimental itemswithminimaloverlapof itemfeatureswereincludedin eachpretestsection,sothateachpretestsectionincludedoneeachof thefollowingtypesof problems:cost, DRT (distance= ratex time),percent,andprobability.Within a pretestsection,itemswerepositionedin accordwith testassembly conventions, whichincludedplacingproblem-solving itemsin positions16 through 30 androughlyorderingthemaccording to expected difficulty.Finally,anexperienced testdeveloperwas askedto estimatethedifficultyof theexperimental itemson a scaleof 1 to 5. Item Analysis Item statistics thatweregenerated asa partof thepretestprocessandenteredinto a database include thefollowing: 1. Equateddelta(E-Delta)--aninversetranslation of proportioncorrectinto a scalewith a mean of 13 anda standard deviationof 4 (basedon thecurvefor a normaldistributionandequated overtestsandsamples). 2. R-biserial(Rbis)--thecorrelation betweenexaminees’ scoreson anindividualitemandtheir totalscoreson theoperational quantitative measure. 3. DDIF-m/f--a measureof differentialdifficultyof itemsfor differentgroupsof examinees (in thiscase,malesandfemales)aftercontrollingfor overallperformance on a measure(based on the 1988adaptation of theMantelandHaenszelstatisticby Holland& Thayer,1988). In addition,itemresponse theory(IRT) parameters wereestimatedfor eachitemusingBILOG (Mislevy& Bock, 1982).In thespecificIRT modelassumed to beunderlyingperformance on GRE items,the probabilitythatanexamineewith abilityQiwill respondcorrectlyto an itemwith parameters (aj, bj, cj) is modeledasfollows: P(xii=I/Bi,aj,bj,cj)=cj+ I+e I-Cj -1.7aj(&-bj) ’ In thisparticularmodel,theitemparameters areinterpreted ascharacterizations of the item’s discrimination (ai), difficulty(bj), andsusceptibility to correctresponse throughguessing(Cj).Because parameter estimates for someof theexperimental itemswerenotincludedin thetestdevelopment database, itemparameter estimates werealsoobtainedfi-oma secondIRT calibrationwhichincludedthe 96 experimental itemsanda sampleof 120nonexperimental items. DataAnalysis To determinewhethertheitemsthatweresystematically designedfor thisstudywereof acceptable quality,we compared theitemstatistics andtheattritionratefor theexperimental andnonexperimental items, andassessed theimpactof theitemdesignfeaturesongender-related differentialitemdifficulty.To assess the impact, if any, that the item designfeatureshad on item operatingcharacteristics,the relationshipbetween the item designfeaturesand resultingitem operatingcharacteristicswas analyzedusing a combinationof tree-basedregressionand classicalleast squaresregression.Finally, the usefulnessof the collateral information about the item featuresfor reducingpretestsamplesize was examined. Tree-based Regression. The impact of different item feature manipulationson resultingitem parameterestimateswas investigatedusinga tree-basedregressiontechnique.Like classicalregression models,tree-basedregressionmodelsprovide a rule for estimatingthe value of a responsevariable (JJ),from a set of classificationor predictor variables(x). In the particular applicationdescribedhere,y is an (n X 1) vector of item parameterestimates,andx is an (n X k) matrix of item feature classifications.As in the classicalregressionsetting,tree-basedpredictionrules provide the expectedvalue of the responsefor clusters of observationshaving similar values of the predictorvariables. Clustersare formed by successivelysplitting the data into increasinglyhomogeneoussubsets,callednodes,on the basisof the feature classification variables.A locally optimal sequenceof splitsis selectedby usinga recursivepartitioning algorithmto evaluateall possiblesplitsof all possiblepredictorvariablesat eachstageof the analysis(Brieman, Friedman,Olshen& Stone, 1984). Potential splits are evaluatedin terms of deviance,a statisticalmeasureof the dissimilarity in the responsevariable amongthe observationsbelongingto a singlenode. At eachstageof splitting, the original subsetof observationsis referred to as the parent node and the two outcomesubsetsare referred to as the left and right child nodes. The best split is the one that producesthe largestdecrease betweenthe devianceof the parent node and the sumof the deviancesin the two child nodes.The devianceof the parent node is calculatedas the sum of the deviancesof all of its members, D(y,f)=C(.Yi-EY where 9 is the meanvalue of the responsecalculatedfrom all of the observationsin the node. The devianceof a potential split is calculatedas wherepL is the mean value of the responsein the left child node andYRis the mean value of the responsein the right child node. The split that maximizes the changein deviance is the split chosenat any given node.After each split is defined,the meanvalue of the responsewithin each child node is taken as the predictedvalue of the responsefor each of the items in each of the nodes.The more homogeneousthe node,the more accuratethe prediction. The node definitionsdevelopedfor the current studycharacterizethe impact of specific item feature manipulationson resultingitem parameterestimates.This characterizationwas corroboratedby implementingthe following two stepprocedure:First, the estimatedtree model was reexpressedas a linear combinationof binary-codeddummy variables;second,the dummy variable model was subjectedto a classicalleast squaresregressionanalysis.The significanceprobabilitiesresultingfrom this procedure indicatewhether, in a classicalleast squaresregressionanalysis,any of the effects includedin theestimated tree model would have been deemed“not significant”and any of the effects omittedfrom the estimatedtree model would have been deemed“significant.”When the resultsobtainedin the classicalleast squares regressionanalysisreplicatethoseobtainedin the tree-basedanalysis,confidenceregardingthe validity of resultingconclusionsis enhanced 8 Estimatingtheusefulness of collateralinformation.From a Bayesianstatisticalperspective,the precisionof a given item parameterestimate(say, item difficulty) is determinedfrom the amountof information available from two different sources:examineeresponsevectorsand collateral information about item features.The parameterestimatesconsideredin the current studycharacterizethe precisionlevels achievableunder two different scenarios:one in which all of the available information about item operating characteristicsis derived from an analysisof approximately 1,000 examineeresponsevectors, and anotherin which all of the available information aboutitem operatingcharacteristicsis derived from the estimateditem feature model. The former scenariois representedby the item parameterestimatesobtainedfrom the BILOG calibration,while the latter is representedby the item parameterestimatesobtainedfrom the estimated regressionmodels. The usefulnessof the item feature information, as capturedin the estimatedregressionmodels,can be determinedby comparingthe precisionof the difficulty estimatesobtainedfrom the BILOG calibrationto the precisionof the correspondingestimatesobtainedfrom the estimatedregressionmodel. Precisionis defined as the inverseof the varianceof the distributionrepresentingknowledgeabout an estimated parametervalue. For the BILOG difficulty estimatesconsideredin this study,precisionis calculatedas the inverseof the squaredstandarderror obtainedfrom a calibrationwith noninformativeprior distributions.For the regressionestimatesconsideredin this study,precisionis calculatedas the inverseof the variance estimatedfor setsof items predictedto have the samelevel of item difficulty (p>. Becauseprecisionis additive in both pretestexamineesamplesize and in collateral information, the BILOG precisionestimatescan be divided by the samplesize to yield an estimateof the contributionper examinee.The value of the collateral information can then be expressedin terms of equivalentnumbersof pretestexaminees(m), as follows m = PRI (P&z) where PRis the precisionyielded by the estimatedregressionmodel, and (P&z) is the precisionper examinee yielded by the BILOG calibration. Results Sumrnarvof Item Statistics On the 24 pretests,there were 360 problem solving items, 96 of which were the items written for this study.After pretesting,items are subjectedto a final review before being enteredinto the pool of items suitablefor use in future operationaltests.About 9% of the experimentalitems and 24% of the other problem-solvingitems were droppedfrom further considerationduring fmal review. Items can be eliminated for a variety of reasons,and no record of why particularitems are deemedunusableis kept. However, all the rate items that were eliminatedwere from one cell of the designand were extremely easy. On the other hand, four of the six probability itemsthat were droppedhad a common,difficult countingtask--three-digit numberswithin a rangewith two or three digits equalto 1; thesemay have confusedexamineesof all ability levels. In our subsequentanalysis,we found that the IRT parametersfor theseitems could not be calibrated. There was no obviousreasonwhy the remainingtwo probability items were eliminated. The mean item statisticsfor the experimentaland nonexperimentalproblem solving items that survivedthe pretestprocessare presentedin Table 3. The experimentalrate problemswere easierthan the nonexperimentalitems overall, as measuredby E-Delta, t (243) = -3.4 1, p < .OO1, and by IRT b, t (243) = 9 - 1.99,p ~05, buttheirvariabilitywassimilar.Thus,thissetof rateproblemscoveredaswide a rangeof -difficultylevelsasdid a heterogeneous mix of otherproblemsolvingitems.The IRT c parameter washigher for theserateproblemsthanfor all nonexperimental items--t(243) = 4.92, p < .OOl--suggesting that examinees weremoresuccessful at guessing thecorrectanswerfor therateproblemsthantheywerefor other problems.However,the guessing parameter for rateproblemsdidnot differ fromwhatmightbe expectedby chance(.20). Themeandifficultyof theexperimental probabilityproblemswasequalto themeandifficultyof nonexperimental itemsoverall,buttheprobabilityproblemswerelessvariablein difficulty,asmeasured by E-Delta(Levene’sTest),F (1,24 1) = 9.27, p ~003. Probabilityproblemsalsoweremorediscriminating thannonexperimental items--t(241) = 2.36, <.02--andweredifferentiallyeasierfor males--t(87.54) = -1.96, p ~05 (t-testfor unequalvariances). In addition,theywerelessvariablein differentialdifficulty--Levene’s Test,F (1,241) = 7.95, p ~005. Finally,thecorrelation of anexperienced testdeveloper’sestimates of difficultywith theitems’IRT b parameters was-75 (n = 92, p < .OOl)for all of theexperimental items--.89(n = 48, p < .OOl)for therateproblems,and.54 (n = 44, p < .OOl)for probabilityproblems. To assess whethertheitemdesignfeatureshadanyimpacton differentialdifficultyfor malesand females,separate 2 x 2 x 2 ANOVAS werecarriedout ontheDDIF-m/f datafor thetwo experimental item families.For therateproblems,onlythemaineffectfor contextwassignificant--F( 1,43) = 23.31,p <.00 1. ThemeanDDIF-m/f was.46 (favoringfemales)for costitemsand-.30 (favoringmales)for DRT problems. Forprobabilityproblems,theitemdesignfeatureshadno significantimpacton DDIF-m/f, althoughasnoted above,thisitemsetasa wholewasslightlyeasierfor malesthanfor females. 10 TABLE 3. MeanItem Statistics for Experimental andNonexperimental ProblemSolvingItems Item Statistics Item Set E-Delta Rbis DDIF-m/f IRTa IRTb IRTc Rate M SD 12.07 0.41 0.05 0.98 0.02 0.24 2.07 0.14 0.65 0.37 1.20 0.12 13.69 0.40 -0.20 1.00 0.51 0.18 1.41 0.12 0.43 0.26 0.98 0.09 13.27 0.42 -0.04 0.88 0.40 0.15 2.14 0.15 0.66 0.32 1.13 0.11 n = 44 Probability M SD n = 42 Nonexperimental M SD n=201 11 Impactof Item DesignFeatureson Item OperatingCharacteristics Separate regression analyses wereconducted for eachof thethreeitemparameters (difficulty, discrimination, andguessing) andfor eachof thetwovariantfamilies(rateproblemsandprobability problems).In eachanalysis,thedependent variablewasoneof theitemparameters of interest(difficulty, discrimination, or guessing), andtheindependent variablesweretheitemfeatures. The itemparameter valuesconsidered in theanalyses aresummarized in Tables4 and5. Table4 lists meansandstandard deviationscalculated for therateproblems.Table5 listsmeansandstandard deviations calculated for theprobabilityproblems.The leastsquares regression resultsfor predictingdifficulty, discrimination, andguessing for boththerateproblemsandtheprobabilityproblemsaresummarized in Table6. The tableprovidesraw (unstandardized) regression coefficients for all maineffectsandinteraction effectsthatwerefoundto be significantat the .05 significance level.Effectsthatweresignificantat the .O1 or .OO1 significance levelsarealsoindicated. 12 TABLE 4. IRT Item Parameters for RateProblemswith DifferingItem DesignFeatures Item Features IRT Parameters UseVariable Complexity Context Yes Level2 DRT Yes Yes Yes No No No No Level2 Level 1 Level 1 Level2 Level2 Level 1 Level 1 cost DRT cost DRT cost DRT cost 13 a b C M .98 1.49 .24 SD .28 .30 .03 M 1.03 1.15 .30 SD .25 .38 .06 M .77 .53 .27 SD .22 .42 .04 M 67 .30 .27 SD .19 .56 .03 M .83 .13 .24 SD .19 .25 .04 M .48 -1.84 .22 SD .14 .63 .Ol M .76 -1.16 .21 SD .15 -72 .02 M .46 -3.09 .22 SD -13 .57 .Ol TABLE 5. IRT Item Parameters for ProbabilityProblemswith DifferingItem DesignFeatures IRT Parameters Item Features Complexity Context1 Context2 Level2 Probability Real Level2 Level2 Level2 Level 1 Level 1 Level 1 Level 1 Probability Percent Percent Probability Probability Percent Percent a Pure Real Pure Real Pure Real Pure an= 5, otherwiseg = 6 14 b C M” .89 1.70 .21 SD .13 .53 .06 M” 1.02 1.60 .23 SD .20 .29 .06 M” .89 1.62 .22 SD .35 -86 .05 M” .88 1.14 .23 SD .15 .54 .07 M .96 .37 .20 SD .13 .55 .06 M .95 .48 .20 SD .16 .53 .05 M .84 -.05 .20 SD .12 .53 .04 M .91 .09 .18 SD .13 .53 .04 TABLE 6. Regression of Item Features on IRT Item Parameters for Two Familiesof Item Variants Regression Statistics andSignificantCoefficients Effect Difficulty Discrimination Guessing RateProblems(n = 48) Intercept -3.01”‘” .47”“” .22’“’ UseVar = Yes 3.34”“’ .25** .03** Context= Cost Complexity= L2 1.09’“” UseVar= No and Context= DRT 1.95”“’ .32*“” UseVar= Yesand Complexity= L2 .29*** UseVar= Yesand Context= Cost _ -03’ RMSE .50 .19 .03 R2 .91 .52 .42 Adj. R2 .90 .50 .41 ProbabilityProblems(n = 44) Intercept .05 Complexity= L2 1.29”” Percent/Probability .34” _ Real/Pure RMSE .54 R2 62 Adj. R2 .61 *** p _ c.001, ** p c.01, * E 15 RateProblems.Thetree-based analyses of theIRT parameters for rateproblems--difficulty, discrimination, andguessing--are summarized in Figures1,2, and3, respectively. In theseillustrations, each nodeis plottedat a horizontallocationbasedon its estimated parameter value;its verticallocationis determined by its estimateddeviancevalue,theresidualsumof squares for itemsin thenode.The item featuresselected to defineeachsplitarelistedontheedges,connecting parentsto offspring.Thenumberof itemsassigned to eachnodeis plottedasthenodelabel.The resultingdisplaysillustratehowvariationin item featureclassifications leadsto subsequent variationin IRT parameter estimates. Figure1 demonstrates that,amongthe48 ratevariants,themanipulation thathadthegreatestimpact on itemdifficultyrequiredstudents to performoperations onvariablesasopposedto numbers.As shownin theuppersectionof Figure1, the24 itemsthatdidnotrequirestudents to performoperations onvariables (UseVar = No) hadan averagedifficultyof - 1.49(SIJ= 1.30),andthe24 itemsthatdid requirestudents to performoperations onvariables(UseVar = Y) hadanaveragedifficultyof 87 (SJ = .63). Thus,itemsthat requiredexaminees to usevariablesweremoredifficult--bymorethan1.5 standard deviationunits--than thosethatdid not.The significance of thisresultcanbe seenbothin thetreeandin thetable.As shownin Figure1, thissplit(UseVar: Y) producedthelargestdecrease in deviance.As shownin Table6, thiseffect produced the largestcoefficientin theregression for difficulty. Figure1 alsoillustratesthat,amongthesubsetof rateproblemsthatdid notrequireoperations on variables(UseVar = No), the 12 itemswith a costcontextweresignificantlyeasier(M = -2.47 , SD = 87) thanthe 12 itemswith a DRT context@J= -.5 1, SD = .85). However,amongthesubsetof rateproblemsthat did requireoperations onvariables,thecostandDRT contextswereequallydifficult.This interaction is clearlyillustratedin thetreeandis alsoevidentin Table6. Thatis, asindicatedin Table6, thecostDRT effectwasnot significantasa maineffectbut it wassignificantwhencrossed with UseVar = No. Thus,the contextresultsobtainedin theleastsquares regression analysisexactlyreplicatedthoseobtainedin thetreebasedanalysis.In particular,bothanalyses indicatedthatcontextcanbe a strongdeterminerof itemdifficulty whenitemsdonotrequireproficiencyat usingvariables,butcontextis not a strongdeterminerof item difficultywhenitemsdorequireproficiencyat usingvariables.Theseresultssuggest thatcontexteffectsmay havea greaterimpacton performance amonglowerperformingexaminees thanamonghigherperforming examinees. Figure1 alsosummarizes theeffectof problemcomplexityon itemdifficulty.Overall,the24 items at thehighercomplexitylevelweresignificantly moredifficult(M = -.86, SD = 1S7) thanthe 24 itemsatthe lowercomplexitylevel(M = .24, SD = 1.38).In addition,thiseffectwasof similarmagnitudefor problems involvinga costor a DRT context,andfor problemsthateitherincludedor did not includea variable.That themagnitudeof thecomplexityeffectwassimilarfor differenttypesof problemscanalsobe seenin Table 6, whichindicatesthatthemaineffectfor complexitywashighlysignificant(pc.001). Becauseall of the itemsat thehighercomplexitylevelinvolvedfourconstraints, andall of theitemsat the lowercomplexity levelinvolvedonlythreeconstraints, thisresultsuggests thatthepresence of a fourthconstraint contributes to additionaldifficultyat all levelsof proficiency. The tree-based analysisof itemdiscrimination is summarized in Figure2. The similarityof the difficultyanddiscrimination treessuggests thatthefactorsusedto generate theratevariantsaffected difficultyanddiscrimination similarly.Problemsthatincludeda variablehadbetterdiscrimination (M = .86, SD = .27) thanthosethatdidnot (M = .63,SD = .22). Amongitemsthatdidnot includea variable,DRT problemsweremorediscriminating (M = .79,$IJ = .17) thancostproblems(M = .47, SD = .13). And finally, amongproblemsthatdid includea variable,morecomplexitemsweremorediscriminating @J= 1.O1, SD = .26) thanlesscomplexproblems(M = .72, SD = .20). 16 The tree-based analysisof itemguessing is summarized in Figure3. Ratevariantsthatincluded variablestendedto havehigherguessing parameters (kJ= .27,SD = .04) thanratevariantsthatdid not (&I = .22, SD = .02). In addition,amongitemsthatincludedvariables,itemswith a costcontexttendedto have slightlyhigherguessing parameters (NJ= -29,SD = .05) thanitemswith a DRT context(hJ= .25,SD = .04). 17 R-squared = 0.91 Adj. k-sqr = 0.90 /“” \ / UseVacN \ UseVarY I IRT Item Difficulty FIGURE 1. Estimatedregression tree for the diffkulty parameterfor rateproblems. R-squared = 0.52 Adj. R-sqr = 0.50 I 0.0 0.5 1.0 IRT Item Discrimination FIWRE 2. Estimatedregression tree for the discriminationparameterfor rateproblems. I 1.5 48 R-squared = 0.42 Adj. R-sqr = 0.41 . 0.0 0.1 IRT I . 0.2 0.3 GuessingParameter FIGURE 3. Estimatedregression tree for the guessingparameterfor rate problems. Probability problems. The tree-basedanalysisof difficulty for the probability problemsis summarizedin Figure 4, and the related regressionstatisticsare presentedin Table 6. Both the tree-based analysisand the classicalleast squaresregressionanalysisindicatethat, amongthe 44 probability variants, the manipulationthat had the greatestimpact on item difficulty involved the complexity of the counting subtask.In particular,the 24 items that requireda lesscomplex countingsubtaskwere easier (WJ= .22, SD = .55) than the 20 items that requireda more complex countingsubtask(M = 1.5 1, $IJ = .59). For probability problemsat both complexity levels, the 22 itemsthat were cast as probability problemswere slightly more difficult (M = .98, SD = .78) than the 22 that were cast as percentproblems(M =.64, SD = .92). Note that this effect is reflectedboth in the tree and in the regressioncoefficientsshownin Table 6. For probability problemsat both complexity levels, the difficulty of items set in real-life contextsdid not differ substantiallyfrom similarly configureditemsthat simply referred to setsof integers.This result is indicatedby the absenceof a real vs. pure split in the estimatedregressiontree, and by the fact that the real vs. pure effect was not significantin the leastsquaresregressionanalysis. As indicatedin Table 6, none of the featuresusedto generatethe probability variants were useful for explainingvariation in item discriminationparametersor in item guessingparameters.A similar resultwas obtainedin the tree-basedanalysis.That is, the estimatedtreesyielded no useful splits. 21 44 R-squared = 0.62 Adj. R-sqr = 0.61 Cmplx:Ll ~~ Cmplx:U i 24 / \ \ 20 Pet\ 10 Prob Pet pr? 12 12 I I I I -3 -2 -1 0 IRT Item Difficulty FIGURE 4. Estimatedregression treefor the diffkulty parameterfor probabilityproblems. \ 10 1 2 Implicationsfor Reductionsin PretestSample Sizes The improvementsin posteriorprecisionachievablewith the collateralmodelsestimatedin this study are summarizedin Table 7. Becauseprecisionvaries with difficulty level, separateestimatesare providedfor groupsof items located at varying points on the underlyingdifficulty scale.Item groupingscorrespondto the feature categoriesidentified in the regressiontrees (Figures 1 through4). Under the estimateddifficulty model, all of the items in eachgroup are predictedto have the samevalue of item difficulty. This value is listed in the column labeled “PredictedDifficulty.” Table 7 also lists two precisionestimatesfor eachgroup. The estimateslisted in the column labeled “BILOG Precision”incorporateinformation from approximately 1,000 examineeresponsevectors,but no information from the estimateditem feature model. These estimateswere calculatedas the inversesquareof the averagewithin-group standarderror obtainedfrom the BILOG calibration.The estimateslisted in the column labeled “Collateral Precision”incorporateinformation from the estimateditem feature model, but no information from examineeresponsevectors.These estimateswere calculatedas the inverseof the withingroup varianceobtainedfrom the estimatedregressionmodel. The right-most column of Table 7 providesan estimateof the value of the collateral information expressedin terms of equivalentnurnbersof pretestexaminees(m). As can be seen,the collateral model for rate variantsyielded an equivalentsamplesize of approximately2 15 examinees,and the collateral model for probability variantsyielded an equivalentsamplesize of approximately 128 examinees.In interpretingthese resultsit is importantto note that, while precisionis additive, the effect of increasingsamplesizesis not. Specifically, the posterior standarddeviationof item parametersshowsdiminishingreturnsas calibration samplesize is increased,so that the first 200 examineesreduceposterior standarddeviationsthe most, the next 200 reduceposterior standarddeviationsby less,and by the time that there are 1,000 pretestexaminees, another200 examineesreducesposteriorstandarddeviationsonly slightly. The relevanceof usingcollateral information that is worth, say, 200 examinees,is that the impact of the collateral information is tantamountto that of the first 200 examinees,not the last 200. Figure 5 illustratesthis phenomenonfor the rate variants.The solid curve depictsthe effect of increasingsamplesizeswhen collateral information is not includedin the calibration. The dashedcurve showsthe effect of increasingsamplesizeswhen collateralinformation is includedin the calibration. The line from A to B representsthe decreasein uncertaintythat would be attainedif, in additionto collateral information, 10 examineeresponsevectorswere also available. The line from C to D representsthe decrease in uncertaintythat would be attainedif, in additionto collateralinformation, 250 examineeresponsevectors were also available. The line at E showsthat a calibrationthat includedboth collateral information and 250 pretestexamineeswould yield an effective samplesize of about 420 examinees.(These estimatesdo not reflect the additionalimprovementsachievablethroughthe use of expectedresponsecurves,discussed below.) How valuable is 200 examinees-worthof information about item parametersfrom item features?The answerdependson how this information will be used.The currentcalibration systemusesinformation from pretestexamineesonly, and treatsthe resultingestimatesas if they were true item parametervalues (that is, any remaininguncertaintyis ignored). Experiencehas shownthat 1,000 examineeswill suffice for this approach.Collateral information worth 200 examineeswould be disappointingindeedif all it meant was reducingthe pretestsampleto 800 with the rest of the currentsystemintact. This would be a reductionof pretestsamplesize of just 20%. 23 The preferredalternativeaddresses notonlythesourceof informationaboutitemparameters, but alsothewaytheinformationis used.The approach, described in Mislevy,Sheehan, andWingersky(1993), usesexpectedresponse curves(ERCs)thatincorporate informationfrombothsources (collateralinformation andpretestexaminees); it modelsuncertainty aboutthesesources aswell. The firstof theseproperties means thatit is possibleto usecollateralinformationabouttheitemfeaturesthatinfluenceitemoperating characteristics. The secondpropertymeansthatit is notnecessary to havethetotalamountof information aboutitemparameters sogreatasto treatthemasknown.The ERCsreducebiasesthat arisewhenestimates aretreatedastruevaluesin thecurrentsystem--the phenomenon thatkeptpeoplefromusingsmallcalibration samplesin thatsystem.Mislevy,Sheehan, andWingerskyfoundthatERCsbasedoncollateralinformation, plusresponses from250 pretestexaminees, providedmeasurement of examinees thatwasaseffectiveasitem parameter estimates basedon 1,000pretestexaminees. Thisis a reductionof 750 pretestexaminees, or 75%. 24 TABLE 7. The Precisionof Difficulty EstimatesGeneratedWith and Without Collateral Information Item Group” n Predicted Difficulty BILOGb Precision Collateral” Precision Equivalent Sample Size Rate Problems No Var, Cost, Ll 6 -3.09 16.56 3.07 220 No Var, Cost, L2 6 -1.84 15.29 2.54 197 No Var, DRT, Ll 6 -1.16 19.75 1.93 117 No Var, DRT, L2 6 0.13 88.23 16.13 218 var, Ll 12 0.41 39.17 4.23 129 var, L2 12 1.32 24.01 7.17 355 215 Weighted Averaged Probabilitv Problems Ll, Pet 12 0.02 43.32 3.84 105 Ll, Prob 12 0.43 73.73 3.72 60 L2, Pet 10 1.38 23.23 1.93 99 L2, Prob 10 1.65 27.45 c 6.12 265 128 Weighted Averaged a Item groupsreflect the combinationsof featuresfound to be significantin the regressionanalysis. b BILOG precision= 1 / (Average StandardError)2 from a calibrationof 1,190 examineeresponsevectors ’ Collateral precision= 1 / (Residual StandardDeviation)2from the estimatedregressionequation. d Weights are proportionalto the numbersof items available in eachgroup. 25 Without collateral information collateral information . . . . . . . With Calibration Sample Size FIGURE 5. The effectof increasingsamplesizeswith andwithoutcollateralinformation. 26 Discussion The attempt to systematicallymanipulatedifficulty was extremely successfulfor rate problemsand moderatelysuccessfiilfor probability problems.For rate problems,all the manipulatedfeaturesaffected difficulty, accountingfor 90% of the variancein difficulty in the set of problems.This family of items covereda wide difficulty range. One manipulationin particular--usinga variable to transform a multistep arithmeticword problem into a multistep algebraword problem--hada very powerful effect on difficulty. In addition,there was an interestinginteractionbetweencontext and the use of a variable: For easieritemsthat did not involve a variable, costproblemswere easierthan DRT problems,but this particular context did not affect difficulty for problemsthat did involve a variable. This suggeststhat some aspectsof context may facilitate or impedeproblem solutionamonglower-performingexaminees,but not amonghigher-performing exarninees.The item featuresalso had similar effects on item discriminationand guessing. I In contrastwith the rate problems,the probability problemswere more difficult and covereda narrowerdifficulty range. Increasingthe complexity of the countingtask had the greatestimpact on difficulty. One aspectof context (whetherthe problem was castas a percentor probability problem) did affect difficulty, but another(whetheror not the problem narrativeinvolved a real-life context) did not However, the context interactionfor the rate problemsservesas a remindernot to dismissthe possibilitythat sucha contrast(real-life versuspure context) may be an important feature for lessdifficult items. Finally, item designfeaturesdid not impact the discriminationor guessingparametersfor probability problems. One issuethat the resultsfor the probability problemsraisesis why theseprobability problemswere so difficult. The items with the simple countingtask were not very demandingin terms of the arithmetic involved, and presentingthe problem in termsof percentrather than probability facilitated performance. Taken together,thesefactors suggestthat a significantportion of the examineestaking the GRE General Test in 1996 were unfamiliar with basic statisticalconceptsandprocedures. In the following sections,the implicationsthis studymay presentfor articulatingthe constructs assessedby the GRE quantitativemeasure,for increasingthe efficiency of test development,and for reducing pretestsamplesize are discussed. UnderstandingItem Diffcultv and ConstructRenresentation Among item statisticalcharacteristics,difficulty has receivedthe most attentionbecauseof its role in constructvalidation (Embretson, 1983) andproficiency scaling(Sheehan,1997). Embretsondistinguished betweenthe two aspectsof test validity--nomotheticspan,which refers to the relationshipof test scoresto other variables,and constructrepresentation,which “is concernedwith identifying the theoreticalmechanisms that underlie item responsessuchas information processes,strategies,and knowledgestores”(p. 179). With respectto constructrepresentation,items can be describedfrom either the task perspective(what are the featuresof the task?) or the examineeperspective(what processes,skills, strategies,and knowledgedo people use to solve problems?).Of course,items can be describedin many different ways. Difficulty modelingintroducesa criterion--that of the relationshipof item featuresto difficulty--which permits a distinctionbetweencritical and incidentalfeatures.A basicassumptionof this approachis that the featuresof the task and examineeprocessesare interdependent. 27 Although many studiessuchas the currentone focus primarily on one of theseperspectives,a completetheory of the task requiresboth. Someevidenceaboutthe relationshipbetweenthe item featuresof rate problemsthat were manipulatedin this studyand problem solutionprocessesis reported in Sebrechtset al. (1996). Sebrechtset al. categorizedthe strategiesusedby college studentsin solving 20 GlXE word problems,and examinedthe relationshipsbetweenitem features,solutionstrategies,and errors. The four classesof strategiesidentified included: 1. following step-by-stepmathematicalsolutions(equationbased) 2. settingup and solving ratios 3. modelingthe situationby deriving solutionsfor a set of potential values for a variable and convergingon an answer(simulations) 4. usingother, unidentifiedstrategies Most of the successfulproblem solutionsinvolved equation-basedstrategies.Nevertheless,when the useof an equation-basedstrategywould have requiredactually manipulatingvariablesrather than an arithmetic step-by-stepsolution,studentswere lesslikely to usethis strategyeven thoughit was highly appropriate. They were more likely to use other, unidentifiablestrategiesor simulationstrategies.It seemsthat many of thesestudentseither lacked appropriatestrategiesor failed to apply the strategiesthey possessedto word problemsthat requiredthe manipulationof variables.Problem complexity, on the other hand, did not have an impact on strategybut was associatedwith errors of misusingthe givensin the problem statement. In sum, determiningwhich item featuresimpact item difficulty and how thesefeaturesaffect examineeproblem solving providesa better explication of the constructsbeing assessed.This more detailed understandingof constructsis necessaryfor principleditem generation,and can serve as a basisfor the developmentof diagnosticassessmentandreporting. Implicationsfor Creating Item Variants The resultsof the currentstudy demonstratethe enormouspotential of systematicallycreatingitem variants.The systematicgenerationof item variantscan result in a set of items with predictableitem characteristicsthat differ from eachother in specifiedways and degrees.Efforts to automatesomeaspectsof systematicitem generationare currently underway(Singley & Bennett, 1998). In additionto creatingitems for operationaltests,variants can be createdfor use in diagnosticand practicetestswithout compromising operationalpools. However there are many issuesthat needto be addressedbefore the potential of this approachto item developmentcan be fully realized in the context of large scaleassessment.These issues includethe diversity of problemsthat exist in the GRE quantitativemeasure,the wide variety of item features that can be manipulatedto createvariants,how items shouldbe classified,and how similarity among problemsshouldbe defined. The pool of GRE quantitativeproblemsis quite diverse.Rate and probability word problems representonly a small proportion of the item types includedin the measure.In a sampleof about 340 arithmeticand algebraitems in two GRE computeradaptivetest pools, only 4% were classifiedas probability problemsand 2% as rate problems.Furthermore,even for thesesmall setsof problems,many featurescan be manipulatedto createvariants.Two criteria that might be usedto determinewhich item featuresto manipulateincludethe impact of the featureson item performance,and whether or not the featuresare deemedconstructrelevant. While information about the former criterion can be gleanedfrom examinationof 28 existing items and experimentalstudies,establishingconstructrelevancerequiresother kinds of evidence, suchas studiesof similaritiesin the processesusedto solve assessment items and thoseusedto solve problemsin the academicdomainsof interest. Finally, if large numbersof item variantswere created,methodsto managetheir distributionamong the pools of items usedfor computerizedadaptivetestingwould needto be developed.This might require the revision of the current item classificationsystem.A better understandingof the item featuresthat contribute to perceiveditem similarity by examinees,and to transfer amongitems,would be helpful here. hnnlicationsfor ReducingPretestSample Size Knowledgeof the degreeto which different featuresimpact item statisticscould allow us to create item variants alongwith estimatesof item operatingcharacteristics.Statisticalproceduresfor usingcollateral information suchas this to reducepretestsamplesize have been developed(Mislevy et al., 1993). Nevertheless,two barriersblock the applicationof thesemethodsat present,althoughneither barrier is insurmountable.One of thesebarriers concernsoperationalconstraintsthat must be taken into consideration. Currently, samplesize is controlledat the sectionrather than the item level. This meansthat one would want to have collateral information for all of the diverseitems in a sectionbefore the samplesize could be reduced for that section.A study in which four pretestsectionsthat consistof item variantsbasedon the sameset of parent items with known item operatingcharacteristicsis currently in process.The secondbarrier is the lack of a knowledgebasethat would permit predictionof item operatingcharacteristicsfor the wide variety of items that exist on the GRE quantitativemeasure.Over time, this knowledgebasecould be developed throughthe examinationof existing items and throughexperimentalstudiessuchas this one. In the meantime, the difficulty estimatesof experienceditem writers are reliable and predictive of actualdifficulty and could be usedto reducepretestsamplesize (Sheehan& Mislevy, 1994). ConcludingComments Construct-drivenitem generationrequiresa descriptionof itemsthat can be related to the processes, skills, and strategiesusedin item solving. The benefitsof suchan approachare that, if item variantscan be createdsystematicallythroughan understandingof critical problem features,testscan be designedto cover important aspectsof a domain, overlap can be controlled,and pretestingrequirementscan be reduced.A closer integrationof researchand test developmentwould contributeto the developmentof the knowledge baseneededto supportconstruct-drivenitem generation.Ideally, every time items are pretested,knowledge abouthow item featuresimpact item performancecould be gainedif itemswere designedto vary systematicallyon selectedfeatures.This kind of knowledgewould not only help to improve item development efficiency, but could also provide a basisfor the developmentof new productsand services. 29 References Brieman, L., Friedman,J. H., Olshen,R., & Stone,C. J. (1984). Classificationand reBession trees. Belmont, CA: Wadsworth InternationalGroup. Embretson,S. ( 1983). Constructvalidity: Constructrepresentationversusnomotheticspan. PsvchologicalBulletin, 93(l), 179- 197. Embretson,S. E. (1995). A measurementmodel for linking individual learningto processesand knowledge: Application to mathematicalreasoning.Journalof EducationalMeasurement,32(3), 277-294. Hall, R., Kibler, D., Wenger, E., & Truxaw, C. (1989). Exploring the episodicstructureof algebra story problem solving. Cognition and Instruction.6(3), 223-283. Holland, P. W., & Thayer, D. T. (1988). Differential item performanceandthe Mantel-Haenszel Procedure.In H. Wainer & H. I. Braun (Eds.), Test Validi& (pp. 129-145). Hillsdale, NJ: Lawrence Erlbaum Associates. Lane, S. (199 1). Use of restricteditem responsemodelsfor examining item difficulty orderingand slopeuniformity. Journalof EducationalMeasurement,28(4), 295-309. Mayer, R. E. (198 1). Frequencynorms and structuralanalysisof algebrastory problemsinto families, categories,and templates.InstructionalScience, 10, 135 175. Mayer, R. E. (1982). Memory for algebrastory problems.Journalof EducationalPsychology,74(2), 199-216. Mislevy, R. J., & Bock, R. D. (1982). BILOG: Maximum likelihood item analvsisand test scoring with logistic models for binan, items. Chicago:InternationalEducationalServices. Mislevy, R. J., Sheehan,K. M., & Wingersky, M. (1993). How to equatetestswith little or no data. Journalof EducationalMeasurement,30(l), 55-78. Mislevy, R. J., Wingersky, M. S., & Sheehan,K. M. (1994). Dealing with uncertaintvabout item parameters: Expectedresponsefunctions(ETS ResearchReport RR-94-28-ONR). Princeton,NJ: EducationalTesting Service. Reed, S. K. (1987). A structure-mappingmodel for word problems.Journalof Experimental Psvchologv,13(l), 124-139. Reed, S. K., Dempster,A.? & Ettinger, M. (1985). Usefulnessof analogoussolutionsfor solving algebraword problems.Journalof ExperimentalPsvcholom: Learning. Memorv, and Cognition. 1l(l), 106125. Sebrechts,M. M., Enright, M., Bennett,R. E., & Martin, K. (1996). Using algebraword problemsto assessquantitativeability: Attributes, strategies,and errors. Cognition and Instruction, 14(3), 285-343. Shalin, V. L., & Bee, N. V. (1985). Structuraldifferencesbetweentwo-step word problems (Technical Report No. ED-259-949). PittsburghUniversity: Pa. Learning Researchand Development Center. 31 Sheehan,K. M. (1997). A tree-basedapproachto proficiency scalingand diagnosticassessment. Journalof EducationalMeasurement.34(4), 333-352. Sheehan,K., & Mislevy, R. J. (1994). A tree-basedanalysisof items from an assessment of basic mathematicsskills (ETS ResearchReport 94-14). Princeton,NJ: EducationalTesting Service. Singley, M. K., & Bennett, R. E. (1998). Validation and extensionof the mathematicalexpression responsetype: Applications of schematheory to automaticscoringand item generationin mathematics(ETS ResearchReport RR-97- 19, GRE Report 93-24P). Princeton,NJ: EducationalTesting Service. 32
© Copyright 2025 Paperzz