D(y,f)=C(.Yi -EY

Items by Design:The Impactof SystematicFeatureVariation
on Item StatisticalCharacteristics
Mary K. Enright
Mary Morley
KathleenM. Sheehan
GRE BoardReportNo. 95-15R
September1999
This reportpresents
the findingsof a
researchprojectfundedby and
carriedout underthe auspicesof the
GraduateRecordExaminationsBoard
Educational Testing Service, Princeton, NJ 0854 1
********************
Researchers
are encouraged
to expressfreely their professional
judgment. Therefore,pointsof view or opinionsstatedin Graduate
RecordExaminationsBoardReportsdo not necessarily
representofficial
GraduateRecordExaminations
Boardpositionor policy.
********************
The GraduateRecordExaminations
BoardandEducationalTestingServiceare
dedicatedto the principleof equalopportunity,andtheirprograms,
services,andemploymentpoliciesareguidedby thatprinciple.
EDUCATIONAL TESTING SERVICE, ETS, the ETS logo,GRADUATE RECORD EXAMINATIONS,
and GRE are registeredtrademarks
of EducationalTestingService.
The modernizedETS logois a trademarkof EducationalTestingService.
EducationalTestingService
Princeton,New Jersey08541
Copyright0 1999by EducationalTestingService. All rightsreserved.
Acknowledgments
We wish to recognizethe contributionof the many test developmentstaff memberswhose adviceand
cooperationwas essentialto this project. Specialthanksto JackieTchomi, Judy Srnith, and Jutta Levin. We
also appreciateBob Mislevy’s advice abouthow to estimatethe usefulnessof collateral information. Finally,
we are grateful to the GraduateRecord ExaminationsBoard for supportingthis research.
Abstract
This study investigatedthe impact of systematicitem feature variation on item statistical
characteristicsand the degreeto which suchinformation couldbe usedas collateral information to
supplementexamineeperformancedata and reducepretestsamplesize. Two families of word problem
variantsfor the quantitativesectionof the GraduateRecord Examination (GRE@) General Test were
generatedby systematicallymanipulatingitem features.For rate problems,the item designfeaturesaffected
item difficulty (Adj. R2 = .90), item discrimination(Adj. R2 = SO), and guessing(Adj. R2 = .41). For
probability problemsthe item designfeaturesaffecteddifficulty (Adj. R2 = .6 l), but not discriminationor
guessing.The resultsdemonstratethe enormouspotential of systematicallycreatingitem variants. However,
questionsof how best to managevariantsin item pools and to implement statisticalproceduresthat use
collateralinformation must still be resolved.
KEY WORDS
Quantitative Reasoning
Graduate Record Examinations
Faceted Item Development
Algebra Word Problems
Item StatisticalCharacteristics
Assessmentof Quantitative Skills
Table of Contents
Introduction ................................................................................................................................
Researchon Word Problems....................................................................................................
Method .......................................................................................................................................
Design of Word Problems.......................................................................................................
Item Pretesting.......................................................................................................................
Item Analysis...........................................................................................................................
Data Analysis..........................................................................................................................
Results........................................................................................................................................
Summary of Item Statistics.....................................................................................................
Impact of Item Design Features on Item Operating Characteristics........................................
Implications for Reductions in Pretest Sample Sizes..............................................................
Discussion................................................................................................................................
Summary..............................................................................................................................
UnderstandingItem Difficulty and ConstructRepresentation................................................
Implications for Creating Item Variants ................................................................................
Implications for Reducing Pretest Sample Size .....................................................................
Concluding Comments .........................................................................................................
References................................................................................................................................
1
1
3
.3
.7
7
7
9
.9
12
.23
.27
.27
.27
.28
.29
-29
31
List of Tables
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
..4
1. Examples of Rate Items ..................................................................................................................
2. Examples of Probability Items..........................................................................................................6
3. Mean Item Statisticsfor Experimental andNonexperimentalProblem Solving Items ................. 11
4. IRT Item Parametersfor Rate Problemswith Differing Item Design Features............................. 13
5. IRT Item Parametersfor Probability Problemswith Differing Item Design Features................... 14
6. Regressionof Item Featureson IRT Item Parametersfor Two Families of Item Variants ........... 15
7. The Precisionof Difficulty EstimatesGeneratedWith and Without Collateral Information.. ...... 25
List of Figures
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
1. Estimatedregressiontree for the difficulty parameterfor rate problems...................................... 18
2. Estimatedregressiontree for the discriminationparameterfor rate problems.............................. 19
3. Estimatedregressiontree for the guessingparameterfor rate problems...................................... .20
4. Estimatedregressiontree for the difficulty parameterfor probability problems......................... .22
5. The effect of increasingsamplesizeswith andwithout collateral information.............................26
Introduction
Becauseof the continuousnatureof computeradaptivetesting,the dangerof item exposurewill
increaseunlessitem pools are large enoughor can be changedfrequently enoughto reducethe probability of
examineesrevealing itemsthat a large numberof subsequentexamineesmay receive. Thus continuous
computeradaptivetestinghas createda demandfor more items and greaterefficienciesin item development.
Improvement in the efficiency of item developmentwill result if methodsfor generatingitems systematically
are developedor if pretestsamplesize requirementscan be reduced. A particularly critical bottleneckin the
item developmentprocessat presentis the needfor item pretesting.The number of items that can be
pretestedis constrainedby the numberof examineeson whom the item must be testedin order to obtain
reliable estimatesof item operatingcharacteristics.Recently,however, methodshave been developedthat
permit the use of collateral information aboutitem featuresto supplementexamineeperformancedata, so that
smallerpretestsamplescan be usedto obtain reliable estimatesof item operatingcharacteristics(Mislevy,
Sheehan,& Wingersky, 1993). The purposeof this studywas to determineif designingsystematicvariantsof
quantitativeword problemswould result in more efficient item development,thus permitting item operating
characteristicsto be reliably estimatedusingsmallerpretestsamples.
The idea of creatingvariants of existing items as a way of developingmore items is not a novel idea
and probably has been done informally by item writers as long as standardizedtestshave been in existence.
While the unsystematiccreationof variants contributesto the efficiency of the item developmentprocess,
there are somedangersassociatedwith this practice,suchas overlap amongitems or inadvertentlynarrowing
the constructbeing measured.
The ideal alternativewould be to createitem variants systematicallyby using a framework that
distinguishesconstructrelevant and irrelevant sourcesof item statisticalcharacteristics,as well as incidental
item featuresthat are neutral with respectto item statisticalcharacteristicsand the underlyingconstruct.Thus
item variantswith different statisticalparameterscouldbe createdby manipulatingconstructrelevant
features,and item variantswith similar statisticalparameterscould be createdby manipulatingincidental
features.With this method, overlap amongitems could be better controlled.Unfortunately, the constructs
tapped by most existing testsare not articulatedin enoughdetail to allow the developmentof construct-driven
item designfi-ameworks.
A third approachto generatingitem variantsis to use item designframeworks as a hypothesis-testing
tool to assessthe impact of different item featureson item statisticalcharacteristics.This is the systematic
approachthat was taken in the presentstudy.Frameworksfor creatingitem variants were developedbasedon
prior correlationalanalysesof item featuresthat affect problem difficulty and on the hypothesesof
experienceditem writers. Thus the item developmentand researchprocesseswere integratedso that the
degreeto which different item featuresimpact item statisticalcharacteristicscould be determinedand the
constructsunderlyingthe creationof item variants couldbe more clearly articulated.
Researchon Word Problems
A body of researchaboutproblem featuresthat affect problem difficulty already exists for arithmetic
and algebraword problems.This researchcan serve as a basisfor creatingsystematicitem variants and
estimatingproblem difficulty. The relevant researchwas stimulatedby Mayer (198 l), who analyzedalgebra
word problemsfrom secondaryschoolalgebratexts. Mayer found that theseproblemscould be classifiedinto
eight families basedon the problems’ “story line” and sourceformulas (suchas “distance= rate x time” or
“dividend = interestrate x principle”). However, similar story lines may reflect very different quantitative
structures(Mayer, 1982). In order to capturethis relevant quantitativestructureseparatelyfrom the specific
problem content,a number of network notationshave been developed(Hall, Kibler, Wenger, & Truxaw,
1989; Reed, 1987; Reed, Dempster,& Ettinger, 1985; Shalin & Bee, 1985).
For example, Shalin and Bee (1985) analyzedthe quantitativestructureof word problemsin terms of
elements,relations,and structures.Many word problemsconsistof one or more triads of elementscombined
in additive or multiplicative relationships.One of the relationshipsShalin and Bee described--amultiplicative
relationshipamonga rate and two quantities--istypical of many arithmeticand algebraword problems,such
as thoseinvolving travel, interest,cost, and work. For complex problemsthat involve more that one triad,
problem structuredescribesthe way that thesetriads are linked. Shalin and Bee found that many two-step
arithmeticword problemscouldbe classifiedas exemplarsof one of a number of structures(suchas
hierarchy, shared-whole,and shared-part),andthat theseproblem structureshad an effect on problem
difficulty. This idea can be extendedto other word problems,and the kind of superordinateconstraintthat
allows the subpartsof a problem to be composedcan be usedas one feature in classifyingproblems(Hall et
al., 1989; Sebrechts,Enright, Bennett,& Martin, 1996). For example, round trip problems(the distanceon
one part of the trip equalsthe distanceon the secondpart of the trip) exemplify a classof problemsin which
the superordinateconstraintcan be describedas Distance 1 = Distance2. Another type of problem involving
parts of a trip in the samedirection but at different ratesmight have a superordinateconstraintsuchthat
Distance 1 + Distance2 = Total Distance.
Problem featuressuchas thosedescribedabove can be related theoreticallyto individual differences
in cognition.For example, becauseof limitationson working memory capacity,the more elementsand
relationshipsthere are, the more difficult a problem is likely to be. However, knowledgeabout basic,
complementarymathematicalrelationshipsamongelements(suchas “distance,rate, and time” or “dividends,
interest,and principal”) shouldhelp individualsto group or chunk subpartsof a problem. Integratingthese
chunksinto a larger structurerequiresrecognitionof the superordinateconstraintsthat are operatingin the
problem situation.Thus we assume,as pointed out by Embretson(1983), that the “stimuluscharacteristicsof
the test items determinethe componentsthat are involved in its solution”(p. 18 1).
In a study of 20 word problemsthat had appearedon the quantitativesectionof the GraduateRecord
Examination (GRE@) General Test, Sebrechtset al. (1996) found that three problem features--theneedto
apply algebraicconcepts(manipulatevariables),problem complexity, and content--accountedfor 37% to
62% of the variancein two independentestimatesof problem difficulty. In additionto this correlationalstudy
of a small set of problems,other studiesalso demonstratethat similar item featuresare useful in designing
word problems(Lane, 199 l), in providing substantiveunderstandingof changesin studentperformancewith
training (Embretson, 1995), and in accountingfor problem difficulty.
To date, researchershave focusedon identifying sourcesof item difficulty becausethis information is
useful for explicating the constructsrepresentedon a test and for developingproficiency descriptors
(Embretson, 1983, 1995; Sheehan,1997). However, information aboutthe problem featuresthat affect item
discriminationand guessingparametersas well as item difficulty parametersis also valuable at present
becauserecent advancesin measurementtheory supportthe useof collateral information about item features
to estimateitem operatingcharacteristicsusingsmallerexamineesamples(Mislevy et al., 1993; Mislevy,
Wingersky, & Sheehan,1994). Suchestimationprocedurescan reducethe cost of item development.
For word problemson many standardizedtests,the kinds of item featuresdescribedabove are varied
unsystematicallyand on an ad hoc basis,and so it is difficult to estimatepreciselyhow much any particular
feature contributesto item statisticalcharacteristics.In this study,we developedand pretesteditemsthat
2
variedsystematically
on someof thesefeaturessothatwe couldbetterestimatethedegreeto whichdifferent
manipulations
affecteditemstatistical
characteristics.
The questions
we wishedto answerwereasfollows:
1.
Werethe systematically
designed
itemsof anacceptable
quality?
2.
Whatimpactdidtheitemdesignfeatureshaveon itemstatistical
characteristics?
3.
How usefulwouldtheitemdesigninformationbefor reducingpretestsamplesizes?
Method
Designof WordProblems
Forthepurposes
of thisstudy,two familiesof 48 relatedwordproblemswerecreated.Foreach
family,a designmatrixspecifiedthreeitemfeaturesthatwerecrossed
with eachotherto createeightclasses
of variants.Six problemvariantswerewrittenfor eachclass.All itemswerepresented
in a five-option
multiple-choice
format.
Family I: RateProblems,Equal Outputs.Forthefirstfamilyof problems,threeitemfeatures-complexity,context,andusinga variable--were
selected
for manipulation
basedonthe findingsof Sebrechts
et al. (1996). Someexamplesof problemstypicalof thisfamilyareprovidedin Table 1. Thebasicstructure
of theseproblemscanbe described
in termsof threeconstraints,
whichcanbecombinedinto a simplelinear
system,asfollows:
Rate1x Unit Al = Unit B1
Rate2x Unit A2 = Unit Bz
Unit B1= Unit B2
To increase
problemcomplexity,anadditionalconstraint,
or step,wasaddedto half of theproblems:
Unit Al + Unit A2 = TotalUnit A.
Thusthelesscomplexproblemswerecomposed
of threeconstraints,
andthemorecomplexconsisted
of four
constraints.
The goalof thelesscomplexproblemswasto findUnit AZgivenUnit Al, Ratel,andRate,;,the
goalof themorecomplexproblemswasto find Unit A2 andRate2givenUnit Al, Total Unit A, andRatel.
Thenarrativecontextof theseproblemsinvolvedeithercostor distance.Finally,to manipulatethe algebraic
content,oneof theelementsof theproblemwaschangedfroma quantityto a variable:“Johnbought6 cans
of soda”became“Johnboughtx cansof soda.”Thislatermanipulation
ledto a solutionthatwasanalgebraic
expression
ratherthananderivedquantity.
3
TABLE 1. Examplesof Rate Items
Item Design Features
Use Variable
Context
Complexity
No
Yes
cost
Level 1
Sodathat usually costs$6.00 per case
is on sale for $4.00 per case.How
many casescan Jackbuy on sale for
the price he usuallypays for 6 cases?
Sodathat usually costs$6.00 per
caseis on sale for $4.00 per case.
How many casescan Jackbuy on sale
for the price he usually pays for x
cases?
DRT
Level 1
Under normal circumstances,a train
travels from City X to City Y in 6
hoursat an averagespeedof 60 miles
per hour. When the trackswere being
repaired,this train traveled on the
sametracks at an averagespeedof 40
miles per hour. How long did the trip
take when the trackswere being
repaired?
Under normal circumstances,a train
travels from City X to City Y in t
hours at an averagespeedof 60 miles
per hour. When the tracks were being
repaired,this train traveled on the
sametracks at an averagespeedof 40
miles per hour. How long did the trip
take when the trackswere being
repaired?
Level 2
As a promotion, a storesold 90 cases
of sodaof the 150 casesthey had in
stock at $4.00 per case.To make a
profit, the storeneedsto bring in the
sametotal amountof money when they
sell the remainingcasesof soda.At
what price mustthe store sell the
remainingcases?
DRT
Level 2
A roundtrip by train from City X to
City Y took 15 hours.The first half of
the trip took 9 hours andthe train
traveled at an averagespeedof 40
miles per hour. What was the train’s
averagespeedon the return trin?
Note. These example items were not usedin this study.
cost
As a promotion, a store sold 90 cases
of sodaof the x casesthey had in
stock at $4.00 per case.To make a
profit, the store needsto bring in the
sametotal amountof money when
they sell the remainingcasesof soda.
At what price must the store sell the
remainingcases?
A round trip by train from City X to
City Y took 15 hours. The first half
of the trip took t hours andthe train
traveled at an averagespeedof 40
miles per hour. What was the train’s
averagespeedon the return trip?
FamiZy2: ProbabilityProblems.The secondfamily of items was made up of variantsof probability
problems.Examplesof problemstypical of this family are provided in Table 2. These problemshad three
components--determinin
g the number of elementsin a set, determining the number of elementsin a subset,
and calculatingthe proportion of the whole set that was includedin the subset.Given a lack of prior research
on thesetypes of problems,hypothesesaboutitem featuresthat might affect item difficulty were more
speculativeand were basedon the expert knowledgeof item writers.
First, we varied the complexity of countingthe elementsin the subset.The set always consistedof
the number of integerswithin a given range. The difficulty of the subsetcountingtaskswas varied as follows:
Complexity Level 1
Complexity Level 2
Numbers in a smallerrange
Numbers beginningwith certain digits and
endingwith certain digits
Numbers endingwith a certain digit
Numbers beginningwith certain digits and
endingwith odd digits
Numbers with 3 digits the same
Numbers with 2 or 3 digits equal to 1
Second,we speculatedthat items cast as probability problemswould be more difficult than thosecast
as percentproblems.And third, we varied the cover story so that someproblemsinvolved a real-life context
(phone extensions,room numbers)and otherssimply referred to setsof integers.Although this latter feature
(real versuspure) is a specificationthat is usedto assembletest forms, we did not have a clear senseof how it
might affect difficulty for thesekinds of problems.
TABLE 2. Examplesof Probability Items
Item Design Features
Complexity
Context 1
I Context 2
Level 1
Level 2
Parking stickersfor employees’
cars at a certain companyare
numberedconsecutivelyfrom 100
to 999. Stickersfrom 200 to 399
are assignedto the sales
department.What percentof the
parking stickersare assignedto the
salesdepartment?
Parking stickersfor employees’cars at
a certain company are numbered
consecutivelyfrom 100 to 999.
Stickersthat begin with the digits 2 or
3 are assignedto the salesdepartment.
Stickersthat end with the digits 8 or 9
belongto managers.What percentof
the parking stickersare assignedto
managersin the salesdepartment?
What percentof the integers
between 100 and 999, inclusive,
are between200 and 399,
inclusive?
What percentof the integersbetween
100 and 999, inclusive,begin with the
digits 2 or 3 and end with the digits 8
or 9?
Parking stickersfor employees’cars at
a certaincompany are numbered
consecutivelyfrom 100 to 999.
Stickersthat begin with the digits 2 or
3 are assignedto the salesdepartment.
Stickersthat end with the digits 8 or 9
belong to managers.If a parking
stickeris chosenat random, what is the
probability that it will belongto a
managerin the salesdepartment?
If an integer is chosenat randomfrom
between 100 and 999, inclusive,what
is the probability that the chosen
integerwill begin with the digits 2 or 3
and end with the digits 8 or 9?
Probability
Real
Parking stickersfor employees’
cars at a certaincompanyare
numberedconsecutivelyfrom 100
to 999. Stickersfrom 200 to 399
are assignedto the sales
department.If a parking stickeris
chosenat random,what is the
probability that it will belongto
the salesdepartment?
Probability
Pure
If a integer is chosenat random
Ii-omthe integersbetween 100 and
999, inclusive,what is the
probability that the choseninteger
will be between200 and 399,
inclusive?
Note. These example items were not usedin this study.
6
Item Pretesting
Itemsfromthevariantfamilieswereincludedin 24 quantitative
pretestsections
of theGRE General
Test.Paper-and-pencil
testformswereadministered
to randomsamplesof 1,000or moreexaminees
in
OctoberandDecember
1996.Fourexperimental
itemswithminimaloverlapof itemfeatureswereincludedin
eachpretestsection,sothateachpretestsectionincludedoneeachof thefollowingtypesof problems:cost,
DRT (distance= ratex time),percent,andprobability.Within a pretestsection,itemswerepositionedin
accordwith testassembly
conventions,
whichincludedplacingproblem-solving
itemsin positions16 through
30 androughlyorderingthemaccording
to expected
difficulty.Finally,anexperienced
testdeveloperwas
askedto estimatethedifficultyof theexperimental
itemson a scaleof 1 to 5.
Item Analysis
Item statistics
thatweregenerated
asa partof thepretestprocessandenteredinto a database
include
thefollowing:
1.
Equateddelta(E-Delta)--aninversetranslation
of proportioncorrectinto a scalewith a mean
of 13 anda standard
deviationof 4 (basedon thecurvefor a normaldistributionandequated
overtestsandsamples).
2.
R-biserial(Rbis)--thecorrelation
betweenexaminees’
scoreson anindividualitemandtheir
totalscoreson theoperational
quantitative
measure.
3.
DDIF-m/f--a measureof differentialdifficultyof itemsfor differentgroupsof examinees
(in
thiscase,malesandfemales)aftercontrollingfor overallperformance
on a measure(based
on the 1988adaptation
of theMantelandHaenszelstatisticby Holland& Thayer,1988).
In addition,itemresponse
theory(IRT) parameters
wereestimatedfor eachitemusingBILOG
(Mislevy& Bock, 1982).In thespecificIRT modelassumed
to beunderlyingperformance
on GRE items,the
probabilitythatanexamineewith abilityQiwill respondcorrectlyto an itemwith parameters
(aj, bj, cj) is
modeledasfollows:
P(xii=I/Bi,aj,bj,cj)=cj+
I+e
I-Cj
-1.7aj(&-bj)
’
In thisparticularmodel,theitemparameters
areinterpreted
ascharacterizations
of the item’s
discrimination
(ai), difficulty(bj), andsusceptibility
to correctresponse
throughguessing(Cj).Because
parameter
estimates
for someof theexperimental
itemswerenotincludedin thetestdevelopment
database,
itemparameter
estimates
werealsoobtainedfi-oma secondIRT calibrationwhichincludedthe 96
experimental
itemsanda sampleof 120nonexperimental
items.
DataAnalysis
To determinewhethertheitemsthatweresystematically
designedfor thisstudywereof acceptable
quality,we compared
theitemstatistics
andtheattritionratefor theexperimental
andnonexperimental
items,
andassessed
theimpactof theitemdesignfeaturesongender-related
differentialitemdifficulty.To assess
the impact, if any, that the item designfeatureshad on item operatingcharacteristics,the relationshipbetween
the item designfeaturesand resultingitem operatingcharacteristicswas analyzedusing a combinationof
tree-basedregressionand classicalleast squaresregression.Finally, the usefulnessof the collateral
information about the item featuresfor reducingpretestsamplesize was examined.
Tree-based
Regression.
The impact of different item feature manipulationson resultingitem
parameterestimateswas investigatedusinga tree-basedregressiontechnique.Like classicalregression
models,tree-basedregressionmodelsprovide a rule for estimatingthe value of a responsevariable (JJ),from a
set of classificationor predictor variables(x). In the particular applicationdescribedhere,y is an (n X 1)
vector of item parameterestimates,andx is an (n X k) matrix of item feature classifications.As in the
classicalregressionsetting,tree-basedpredictionrules provide the expectedvalue of the responsefor clusters
of observationshaving similar values of the predictorvariables. Clustersare formed by successivelysplitting
the data into increasinglyhomogeneoussubsets,callednodes,on the basisof the feature classification
variables.A locally optimal sequenceof splitsis selectedby usinga recursivepartitioning algorithmto
evaluateall possiblesplitsof all possiblepredictorvariablesat eachstageof the analysis(Brieman,
Friedman,Olshen& Stone, 1984). Potential splits are evaluatedin terms of deviance,a statisticalmeasureof
the dissimilarity in the responsevariable amongthe observationsbelongingto a singlenode. At eachstageof
splitting, the original subsetof observationsis referred to as the parent node and the two outcomesubsetsare
referred to as the left and right child nodes. The best split is the one that producesthe largestdecrease
betweenthe devianceof the parent node and the sumof the deviancesin the two child nodes.The devianceof
the parent node is calculatedas the sum of the deviancesof all of its members,
D(y,f)=C(.Yi-EY
where 9 is the meanvalue of the responsecalculatedfrom all of the observationsin the node. The devianceof
a potential split is calculatedas
wherepL is the mean value of the responsein the left child node andYRis the mean value of the responsein
the right child node. The split that maximizes the changein deviance
is the split chosenat any given node.After each split is defined,the meanvalue of the responsewithin each
child node is taken as the predictedvalue of the responsefor each of the items in each of the nodes.The more
homogeneousthe node,the more accuratethe prediction.
The node definitionsdevelopedfor the current studycharacterizethe impact of specific item feature
manipulationson resultingitem parameterestimates.This characterizationwas corroboratedby
implementingthe following two stepprocedure:First, the estimatedtree model was reexpressedas a linear
combinationof binary-codeddummy variables;second,the dummy variable model was subjectedto a
classicalleast squaresregressionanalysis.The significanceprobabilitiesresultingfrom this procedure
indicatewhether, in a classicalleast squaresregressionanalysis,any of the effects includedin theestimated
tree model would have been deemed“not significant”and any of the effects omittedfrom the estimatedtree
model would have been deemed“significant.”When the resultsobtainedin the classicalleast squares
regressionanalysisreplicatethoseobtainedin the tree-basedanalysis,confidenceregardingthe validity of
resultingconclusionsis enhanced
8
Estimatingtheusefulness
of collateralinformation.From a Bayesianstatisticalperspective,the
precisionof a given item parameterestimate(say, item difficulty) is determinedfrom the amountof
information available from two different sources:examineeresponsevectorsand collateral information about
item features.The parameterestimatesconsideredin the current studycharacterizethe precisionlevels
achievableunder two different scenarios:one in which all of the available information about item operating
characteristicsis derived from an analysisof approximately 1,000 examineeresponsevectors, and anotherin
which all of the available information aboutitem operatingcharacteristicsis derived from the estimateditem
feature model. The former scenariois representedby the item parameterestimatesobtainedfrom the BILOG
calibration,while the latter is representedby the item parameterestimatesobtainedfrom the estimated
regressionmodels.
The usefulnessof the item feature information, as capturedin the estimatedregressionmodels,can
be determinedby comparingthe precisionof the difficulty estimatesobtainedfrom the BILOG calibrationto
the precisionof the correspondingestimatesobtainedfrom the estimatedregressionmodel. Precisionis
defined as the inverseof the varianceof the distributionrepresentingknowledgeabout an estimated
parametervalue. For the BILOG difficulty estimatesconsideredin this study,precisionis calculatedas the
inverseof the squaredstandarderror obtainedfrom a calibrationwith noninformativeprior distributions.For
the regressionestimatesconsideredin this study,precisionis calculatedas the inverseof the variance
estimatedfor setsof items predictedto have the samelevel of item difficulty (p>.
Becauseprecisionis additive in both pretestexamineesamplesize and in collateral information, the
BILOG precisionestimatescan be divided by the samplesize to yield an estimateof the contributionper
examinee.The value of the collateral information can then be expressedin terms of equivalentnumbersof
pretestexaminees(m), as follows
m = PRI (P&z)
where PRis the precisionyielded by the estimatedregressionmodel, and (P&z) is the precisionper examinee
yielded by the BILOG calibration.
Results
Sumrnarvof Item Statistics
On the 24 pretests,there were 360 problem solving items, 96 of which were the items written for this
study.After pretesting,items are subjectedto a final review before being enteredinto the pool of items
suitablefor use in future operationaltests.About 9% of the experimentalitems and 24% of the other
problem-solvingitems were droppedfrom further considerationduring fmal review. Items can be eliminated
for a variety of reasons,and no record of why particularitems are deemedunusableis kept. However, all the
rate items that were eliminatedwere from one cell of the designand were extremely easy. On the other hand,
four of the six probability itemsthat were droppedhad a common,difficult countingtask--three-digit
numberswithin a rangewith two or three digits equalto 1; thesemay have confusedexamineesof all ability
levels. In our subsequentanalysis,we found that the IRT parametersfor theseitems could not be calibrated.
There was no obviousreasonwhy the remainingtwo probability items were eliminated.
The mean item statisticsfor the experimentaland nonexperimentalproblem solving items that
survivedthe pretestprocessare presentedin Table 3. The experimentalrate problemswere easierthan the
nonexperimentalitems overall, as measuredby E-Delta, t (243) = -3.4 1, p < .OO1, and by IRT b, t (243) =
9
- 1.99,p ~05, buttheirvariabilitywassimilar.Thus,thissetof rateproblemscoveredaswide a rangeof
-difficultylevelsasdid a heterogeneous
mix of otherproblemsolvingitems.The IRT c parameter
washigher
for theserateproblemsthanfor all nonexperimental
items--t(243) = 4.92, p < .OOl--suggesting
that
examinees
weremoresuccessful
at guessing
thecorrectanswerfor therateproblemsthantheywerefor other
problems.However,the guessing
parameter
for rateproblemsdidnot differ fromwhatmightbe expectedby
chance(.20).
Themeandifficultyof theexperimental
probabilityproblemswasequalto themeandifficultyof
nonexperimental
itemsoverall,buttheprobabilityproblemswerelessvariablein difficulty,asmeasured
by
E-Delta(Levene’sTest),F (1,24 1) = 9.27, p ~003. Probabilityproblemsalsoweremorediscriminating
thannonexperimental
items--t(241) = 2.36, <.02--andweredifferentiallyeasierfor males--t(87.54) = -1.96,
p ~05 (t-testfor unequalvariances).
In addition,theywerelessvariablein differentialdifficulty--Levene’s
Test,F (1,241) = 7.95, p ~005. Finally,thecorrelation
of anexperienced
testdeveloper’sestimates
of
difficultywith theitems’IRT b parameters
was-75 (n = 92, p < .OOl)for all of theexperimental
items--.89(n
= 48, p < .OOl)for therateproblems,and.54 (n = 44, p < .OOl)for probabilityproblems.
To assess
whethertheitemdesignfeatureshadanyimpacton differentialdifficultyfor malesand
females,separate
2 x 2 x 2 ANOVAS werecarriedout ontheDDIF-m/f datafor thetwo experimental
item
families.For therateproblems,onlythemaineffectfor contextwassignificant--F( 1,43) = 23.31,p <.00 1.
ThemeanDDIF-m/f was.46 (favoringfemales)for costitemsand-.30 (favoringmales)for DRT problems.
Forprobabilityproblems,theitemdesignfeatureshadno significantimpacton DDIF-m/f, althoughasnoted
above,thisitemsetasa wholewasslightlyeasierfor malesthanfor females.
10
TABLE 3. MeanItem Statistics
for Experimental
andNonexperimental
ProblemSolvingItems
Item Statistics
Item Set
E-Delta
Rbis
DDIF-m/f
IRTa
IRTb
IRTc
Rate
M
SD
12.07
0.41
0.05
0.98
0.02
0.24
2.07
0.14
0.65
0.37
1.20
0.12
13.69
0.40
-0.20
1.00
0.51
0.18
1.41
0.12
0.43
0.26
0.98
0.09
13.27
0.42
-0.04
0.88
0.40
0.15
2.14
0.15
0.66
0.32
1.13
0.11
n = 44
Probability
M
SD
n = 42
Nonexperimental
M
SD
n=201
11
Impactof Item DesignFeatureson Item OperatingCharacteristics
Separate
regression
analyses
wereconducted
for eachof thethreeitemparameters
(difficulty,
discrimination,
andguessing)
andfor eachof thetwovariantfamilies(rateproblemsandprobability
problems).In eachanalysis,thedependent
variablewasoneof theitemparameters
of interest(difficulty,
discrimination,
or guessing),
andtheindependent
variablesweretheitemfeatures.
The itemparameter
valuesconsidered
in theanalyses
aresummarized
in Tables4 and5. Table4 lists
meansandstandard
deviationscalculated
for therateproblems.Table5 listsmeansandstandard
deviations
calculated
for theprobabilityproblems.The leastsquares
regression
resultsfor predictingdifficulty,
discrimination,
andguessing
for boththerateproblemsandtheprobabilityproblemsaresummarized
in
Table6. The tableprovidesraw (unstandardized)
regression
coefficients
for all maineffectsandinteraction
effectsthatwerefoundto be significantat the .05 significance
level.Effectsthatweresignificantat the .O1 or
.OO1 significance
levelsarealsoindicated.
12
TABLE 4. IRT Item Parameters
for RateProblemswith DifferingItem DesignFeatures
Item Features
IRT Parameters
UseVariable
Complexity
Context
Yes
Level2
DRT
Yes
Yes
Yes
No
No
No
No
Level2
Level 1
Level 1
Level2
Level2
Level 1
Level 1
cost
DRT
cost
DRT
cost
DRT
cost
13
a
b
C
M
.98
1.49
.24
SD
.28
.30
.03
M
1.03
1.15
.30
SD
.25
.38
.06
M
.77
.53
.27
SD
.22
.42
.04
M
67
.30
.27
SD
.19
.56
.03
M
.83
.13
.24
SD
.19
.25
.04
M
.48
-1.84
.22
SD
.14
.63
.Ol
M
.76
-1.16
.21
SD
.15
-72
.02
M
.46
-3.09
.22
SD
-13
.57
.Ol
TABLE 5. IRT Item Parameters
for ProbabilityProblemswith DifferingItem DesignFeatures
IRT Parameters
Item Features
Complexity
Context1
Context2
Level2
Probability
Real
Level2
Level2
Level2
Level 1
Level 1
Level 1
Level 1
Probability
Percent
Percent
Probability
Probability
Percent
Percent
a
Pure
Real
Pure
Real
Pure
Real
Pure
an= 5, otherwiseg = 6
14
b
C
M”
.89
1.70
.21
SD
.13
.53
.06
M”
1.02
1.60
.23
SD
.20
.29
.06
M”
.89
1.62
.22
SD
.35
-86
.05
M”
.88
1.14
.23
SD
.15
.54
.07
M
.96
.37
.20
SD
.13
.55
.06
M
.95
.48
.20
SD
.16
.53
.05
M
.84
-.05
.20
SD
.12
.53
.04
M
.91
.09
.18
SD
.13
.53
.04
TABLE 6. Regression
of Item Features
on IRT Item Parameters
for Two Familiesof Item Variants
Regression
Statistics
andSignificantCoefficients
Effect
Difficulty
Discrimination
Guessing
RateProblems(n = 48)
Intercept
-3.01”‘”
.47”“”
.22’“’
UseVar = Yes
3.34”“’
.25**
.03**
Context= Cost
Complexity= L2
1.09’“”
UseVar= No and
Context= DRT
1.95”“’
.32*“”
UseVar= Yesand
Complexity= L2
.29***
UseVar= Yesand
Context= Cost
_
-03’
RMSE
.50
.19
.03
R2
.91
.52
.42
Adj. R2
.90
.50
.41
ProbabilityProblems(n = 44)
Intercept
.05
Complexity= L2
1.29””
Percent/Probability
.34”
_
Real/Pure
RMSE
.54
R2
62
Adj. R2
.61
*** p
_
c.001, ** p c.01, * E
15
RateProblems.Thetree-based
analyses
of theIRT parameters
for rateproblems--difficulty,
discrimination,
andguessing--are
summarized
in Figures1,2, and3, respectively.
In theseillustrations,
each
nodeis plottedat a horizontallocationbasedon its estimated
parameter
value;its verticallocationis
determined
by its estimateddeviancevalue,theresidualsumof squares
for itemsin thenode.The item
featuresselected
to defineeachsplitarelistedontheedges,connecting
parentsto offspring.Thenumberof
itemsassigned
to eachnodeis plottedasthenodelabel.The resultingdisplaysillustratehowvariationin item
featureclassifications
leadsto subsequent
variationin IRT parameter
estimates.
Figure1 demonstrates
that,amongthe48 ratevariants,themanipulation
thathadthegreatestimpact
on itemdifficultyrequiredstudents
to performoperations
onvariablesasopposedto numbers.As shownin
theuppersectionof Figure1, the24 itemsthatdidnotrequirestudents
to performoperations
onvariables
(UseVar = No) hadan averagedifficultyof - 1.49(SIJ= 1.30),andthe24 itemsthatdid requirestudents
to
performoperations
onvariables(UseVar = Y) hadanaveragedifficultyof 87 (SJ = .63). Thus,itemsthat
requiredexaminees
to usevariablesweremoredifficult--bymorethan1.5 standard
deviationunits--than
thosethatdid not.The significance
of thisresultcanbe seenbothin thetreeandin thetable.As shownin
Figure1, thissplit(UseVar: Y) producedthelargestdecrease
in deviance.As shownin Table6, thiseffect
produced
the largestcoefficientin theregression
for difficulty.
Figure1 alsoillustratesthat,amongthesubsetof rateproblemsthatdid notrequireoperations
on
variables(UseVar = No), the 12 itemswith a costcontextweresignificantlyeasier(M = -2.47 , SD = 87)
thanthe 12 itemswith a DRT context@J= -.5 1, SD = .85). However,amongthesubsetof rateproblemsthat
did requireoperations
onvariables,thecostandDRT contextswereequallydifficult.This interaction
is
clearlyillustratedin thetreeandis alsoevidentin Table6. Thatis, asindicatedin Table6, thecostDRT
effectwasnot significantasa maineffectbut it wassignificantwhencrossed
with UseVar = No. Thus,the
contextresultsobtainedin theleastsquares
regression
analysisexactlyreplicatedthoseobtainedin thetreebasedanalysis.In particular,bothanalyses
indicatedthatcontextcanbe a strongdeterminerof itemdifficulty
whenitemsdonotrequireproficiencyat usingvariables,butcontextis not a strongdeterminerof item
difficultywhenitemsdorequireproficiencyat usingvariables.Theseresultssuggest
thatcontexteffectsmay
havea greaterimpacton performance
amonglowerperformingexaminees
thanamonghigherperforming
examinees.
Figure1 alsosummarizes
theeffectof problemcomplexityon itemdifficulty.Overall,the24 items
at thehighercomplexitylevelweresignificantly
moredifficult(M = -.86, SD = 1S7) thanthe 24 itemsatthe
lowercomplexitylevel(M = .24, SD = 1.38).In addition,thiseffectwasof similarmagnitudefor problems
involvinga costor a DRT context,andfor problemsthateitherincludedor did not includea variable.That
themagnitudeof thecomplexityeffectwassimilarfor differenttypesof problemscanalsobe seenin Table
6, whichindicatesthatthemaineffectfor complexitywashighlysignificant(pc.001). Becauseall of the
itemsat thehighercomplexitylevelinvolvedfourconstraints,
andall of theitemsat the lowercomplexity
levelinvolvedonlythreeconstraints,
thisresultsuggests
thatthepresence
of a fourthconstraint
contributes
to
additionaldifficultyat all levelsof proficiency.
The tree-based
analysisof itemdiscrimination
is summarized
in Figure2. The similarityof the
difficultyanddiscrimination
treessuggests
thatthefactorsusedto generate
theratevariantsaffected
difficultyanddiscrimination
similarly.Problemsthatincludeda variablehadbetterdiscrimination
(M = .86,
SD = .27) thanthosethatdidnot (M = .63,SD = .22). Amongitemsthatdidnot includea variable,DRT
problemsweremorediscriminating
(M = .79,$IJ = .17) thancostproblems(M = .47, SD = .13). And finally,
amongproblemsthatdid includea variable,morecomplexitemsweremorediscriminating
@J= 1.O1, SD =
.26) thanlesscomplexproblems(M = .72, SD = .20).
16
The tree-based
analysisof itemguessing
is summarized
in Figure3. Ratevariantsthatincluded
variablestendedto havehigherguessing
parameters
(kJ= .27,SD = .04) thanratevariantsthatdid not (&I =
.22, SD = .02). In addition,amongitemsthatincludedvariables,itemswith a costcontexttendedto have
slightlyhigherguessing
parameters
(NJ= -29,SD = .05) thanitemswith a DRT context(hJ= .25,SD = .04).
17
R-squared = 0.91
Adj. k-sqr = 0.90
/“”
\
/
UseVacN
\
UseVarY
I
IRT Item Difficulty
FIGURE 1. Estimatedregression
tree for the diffkulty parameterfor rateproblems.
R-squared = 0.52
Adj. R-sqr = 0.50
I
0.0
0.5
1.0
IRT Item Discrimination
FIWRE
2. Estimatedregression
tree for the discriminationparameterfor rateproblems.
I
1.5
48
R-squared = 0.42
Adj. R-sqr = 0.41
.
0.0
0.1
IRT
I
.
0.2
0.3
GuessingParameter
FIGURE 3. Estimatedregression
tree for the guessingparameterfor rate problems.
Probability problems. The tree-basedanalysisof difficulty for the probability problemsis
summarizedin Figure 4, and the related regressionstatisticsare presentedin Table 6. Both the tree-based
analysisand the classicalleast squaresregressionanalysisindicatethat, amongthe 44 probability variants,
the manipulationthat had the greatestimpact on item difficulty involved the complexity of the counting
subtask.In particular,the 24 items that requireda lesscomplex countingsubtaskwere easier (WJ= .22, SD =
.55) than the 20 items that requireda more complex countingsubtask(M = 1.5 1, $IJ = .59).
For probability problemsat both complexity levels, the 22 itemsthat were cast as probability
problemswere slightly more difficult (M = .98, SD = .78) than the 22 that were cast as percentproblems(M
=.64, SD = .92). Note that this effect is reflectedboth in the tree and in the regressioncoefficientsshownin
Table 6.
For probability problemsat both complexity levels, the difficulty of items set in real-life contextsdid
not differ substantiallyfrom similarly configureditemsthat simply referred to setsof integers.This result is
indicatedby the absenceof a real vs. pure split in the estimatedregressiontree, and by the fact that the real
vs. pure effect was not significantin the leastsquaresregressionanalysis.
As indicatedin Table 6, none of the featuresusedto generatethe probability variants were useful for
explainingvariation in item discriminationparametersor in item guessingparameters.A similar resultwas
obtainedin the tree-basedanalysis.That is, the estimatedtreesyielded no useful splits.
21
44
R-squared = 0.62
Adj. R-sqr = 0.61
Cmplx:Ll
~~
Cmplx:U
i
24
/ \
\
20
Pet\
10 Prob
Pet pr?
12
12
I
I
I
I
-3
-2
-1
0
IRT Item Difficulty
FIGURE 4. Estimatedregression
treefor the diffkulty parameterfor probabilityproblems.
\
10
1
2
Implicationsfor Reductionsin PretestSample Sizes
The improvementsin posteriorprecisionachievablewith the collateralmodelsestimatedin this study
are summarizedin Table 7. Becauseprecisionvaries with difficulty level, separateestimatesare providedfor
groupsof items located at varying points on the underlyingdifficulty scale.Item groupingscorrespondto the
feature categoriesidentified in the regressiontrees (Figures 1 through4). Under the estimateddifficulty
model, all of the items in eachgroup are predictedto have the samevalue of item difficulty. This value is
listed in the column labeled “PredictedDifficulty.”
Table 7 also lists two precisionestimatesfor eachgroup. The estimateslisted in the column labeled
“BILOG Precision”incorporateinformation from approximately 1,000 examineeresponsevectors,but no
information from the estimateditem feature model. These estimateswere calculatedas the inversesquareof
the averagewithin-group standarderror obtainedfrom the BILOG calibration.The estimateslisted in the
column labeled “Collateral Precision”incorporateinformation from the estimateditem feature model, but no
information from examineeresponsevectors.These estimateswere calculatedas the inverseof the withingroup varianceobtainedfrom the estimatedregressionmodel.
The right-most column of Table 7 providesan estimateof the value of the collateral information
expressedin terms of equivalentnurnbersof pretestexaminees(m). As can be seen,the collateral model for
rate variantsyielded an equivalentsamplesize of approximately2 15 examinees,and the collateral model for
probability variantsyielded an equivalentsamplesize of approximately 128 examinees.In interpretingthese
resultsit is importantto note that, while precisionis additive, the effect of increasingsamplesizesis not.
Specifically, the posterior standarddeviationof item parametersshowsdiminishingreturnsas calibration
samplesize is increased,so that the first 200 examineesreduceposterior standarddeviationsthe most, the
next 200 reduceposterior standarddeviationsby less,and by the time that there are 1,000 pretestexaminees,
another200 examineesreducesposteriorstandarddeviationsonly slightly. The relevanceof usingcollateral
information that is worth, say, 200 examinees,is that the impact of the collateral information is tantamountto
that of the first 200 examinees,not the last 200.
Figure 5 illustratesthis phenomenonfor the rate variants.The solid curve depictsthe effect of
increasingsamplesizeswhen collateral information is not includedin the calibration. The dashedcurve
showsthe effect of increasingsamplesizeswhen collateralinformation is includedin the calibration. The line
from A to B representsthe decreasein uncertaintythat would be attainedif, in additionto collateral
information, 10 examineeresponsevectorswere also available. The line from C to D representsthe decrease
in uncertaintythat would be attainedif, in additionto collateralinformation, 250 examineeresponsevectors
were also available. The line at E showsthat a calibrationthat includedboth collateral information and 250
pretestexamineeswould yield an effective samplesize of about 420 examinees.(These estimatesdo not
reflect the additionalimprovementsachievablethroughthe use of expectedresponsecurves,discussed
below.)
How valuable is 200 examinees-worthof information about item parametersfrom item features?The
answerdependson how this information will be used.The currentcalibration systemusesinformation from
pretestexamineesonly, and treatsthe resultingestimatesas if they were true item parametervalues (that is,
any remaininguncertaintyis ignored). Experiencehas shownthat 1,000 examineeswill suffice for this
approach.Collateral information worth 200 examineeswould be disappointingindeedif all it meant was
reducingthe pretestsampleto 800 with the rest of the currentsystemintact. This would be a reductionof
pretestsamplesize of just 20%.
23
The preferredalternativeaddresses
notonlythesourceof informationaboutitemparameters,
but
alsothewaytheinformationis used.The approach,
described
in Mislevy,Sheehan,
andWingersky(1993),
usesexpectedresponse
curves(ERCs)thatincorporate
informationfrombothsources
(collateralinformation
andpretestexaminees);
it modelsuncertainty
aboutthesesources
aswell. The firstof theseproperties
means
thatit is possibleto usecollateralinformationabouttheitemfeaturesthatinfluenceitemoperating
characteristics.
The secondpropertymeansthatit is notnecessary
to havethetotalamountof information
aboutitemparameters
sogreatasto treatthemasknown.The ERCsreducebiasesthat arisewhenestimates
aretreatedastruevaluesin thecurrentsystem--the
phenomenon
thatkeptpeoplefromusingsmallcalibration
samplesin thatsystem.Mislevy,Sheehan,
andWingerskyfoundthatERCsbasedoncollateralinformation,
plusresponses
from250 pretestexaminees,
providedmeasurement
of examinees
thatwasaseffectiveasitem
parameter
estimates
basedon 1,000pretestexaminees.
Thisis a reductionof 750 pretestexaminees,
or 75%.
24
TABLE 7. The Precisionof Difficulty EstimatesGeneratedWith and Without Collateral Information
Item Group”
n
Predicted
Difficulty
BILOGb
Precision
Collateral”
Precision
Equivalent
Sample Size
Rate Problems
No Var, Cost, Ll
6
-3.09
16.56
3.07
220
No Var, Cost, L2
6
-1.84
15.29
2.54
197
No Var, DRT, Ll
6
-1.16
19.75
1.93
117
No Var, DRT, L2
6
0.13
88.23
16.13
218
var, Ll
12
0.41
39.17
4.23
129
var, L2
12
1.32
24.01
7.17
355
215
Weighted Averaged
Probabilitv Problems
Ll, Pet
12
0.02
43.32
3.84
105
Ll, Prob
12
0.43
73.73
3.72
60
L2, Pet
10
1.38
23.23
1.93
99
L2, Prob
10
1.65
27.45 c
6.12
265
128
Weighted Averaged
a Item groupsreflect the combinationsof featuresfound to be significantin the regressionanalysis.
b BILOG precision= 1 / (Average StandardError)2 from a calibrationof 1,190 examineeresponsevectors
’ Collateral precision= 1 / (Residual StandardDeviation)2from the estimatedregressionequation.
d Weights are proportionalto the numbersof items available in eachgroup.
25
Without collateral
information
collateral
information
. . . . . . . With
Calibration Sample Size
FIGURE 5. The effectof increasingsamplesizeswith andwithoutcollateralinformation.
26
Discussion
The attempt to systematicallymanipulatedifficulty was extremely successfulfor rate problemsand
moderatelysuccessfiilfor probability problems.For rate problems,all the manipulatedfeaturesaffected
difficulty, accountingfor 90% of the variancein difficulty in the set of problems.This family of items
covereda wide difficulty range. One manipulationin particular--usinga variable to transform a multistep
arithmeticword problem into a multistep algebraword problem--hada very powerful effect on difficulty. In
addition,there was an interestinginteractionbetweencontext and the use of a variable: For easieritemsthat
did not involve a variable, costproblemswere easierthan DRT problems,but this particular context did not
affect difficulty for problemsthat did involve a variable. This suggeststhat some aspectsof context may
facilitate or impedeproblem solutionamonglower-performingexaminees,but not amonghigher-performing
exarninees.The item featuresalso had similar effects on item discriminationand guessing.
I
In contrastwith the rate problems,the probability problemswere more difficult and covereda
narrowerdifficulty range. Increasingthe complexity of the countingtask had the greatestimpact on difficulty.
One aspectof context (whetherthe problem was castas a percentor probability problem) did affect
difficulty, but another(whetheror not the problem narrativeinvolved a real-life context) did not However,
the context interactionfor the rate problemsservesas a remindernot to dismissthe possibilitythat sucha
contrast(real-life versuspure context) may be an important feature for lessdifficult items. Finally, item
designfeaturesdid not impact the discriminationor guessingparametersfor probability problems.
One issuethat the resultsfor the probability problemsraisesis why theseprobability problemswere
so difficult. The items with the simple countingtask were not very demandingin terms of the arithmetic
involved, and presentingthe problem in termsof percentrather than probability facilitated performance.
Taken together,thesefactors suggestthat a significantportion of the examineestaking the GRE General Test
in 1996 were unfamiliar with basic statisticalconceptsandprocedures.
In the following sections,the implicationsthis studymay presentfor articulatingthe constructs
assessedby the GRE quantitativemeasure,for increasingthe efficiency of test development,and for reducing
pretestsamplesize are discussed.
UnderstandingItem Diffcultv and ConstructRenresentation
Among item statisticalcharacteristics,difficulty has receivedthe most attentionbecauseof its role in
constructvalidation (Embretson, 1983) andproficiency scaling(Sheehan,1997). Embretsondistinguished
betweenthe two aspectsof test validity--nomotheticspan,which refers to the relationshipof test scoresto
other variables,and constructrepresentation,which “is concernedwith identifying the theoreticalmechanisms
that underlie item responsessuchas information processes,strategies,and knowledgestores”(p. 179). With
respectto constructrepresentation,items can be describedfrom either the task perspective(what are the
featuresof the task?) or the examineeperspective(what processes,skills, strategies,and knowledgedo
people use to solve problems?).Of course,items can be describedin many different ways. Difficulty
modelingintroducesa criterion--that of the relationshipof item featuresto difficulty--which permits a
distinctionbetweencritical and incidentalfeatures.A basicassumptionof this approachis that the featuresof
the task and examineeprocessesare interdependent.
27
Although many studiessuchas the currentone focus primarily on one of theseperspectives,a
completetheory of the task requiresboth. Someevidenceaboutthe relationshipbetweenthe item featuresof
rate problemsthat were manipulatedin this studyand problem solutionprocessesis reported in Sebrechtset
al. (1996). Sebrechtset al. categorizedthe strategiesusedby college studentsin solving 20 GlXE word
problems,and examinedthe relationshipsbetweenitem features,solutionstrategies,and errors. The four
classesof strategiesidentified included:
1.
following step-by-stepmathematicalsolutions(equationbased)
2.
settingup and solving ratios
3.
modelingthe situationby deriving solutionsfor a set of potential values for a variable and
convergingon an answer(simulations)
4.
usingother, unidentifiedstrategies
Most of the successfulproblem solutionsinvolved equation-basedstrategies.Nevertheless,when the useof
an equation-basedstrategywould have requiredactually manipulatingvariablesrather than an arithmetic
step-by-stepsolution,studentswere lesslikely to usethis strategyeven thoughit was highly appropriate.
They were more likely to use other, unidentifiablestrategiesor simulationstrategies.It seemsthat many of
thesestudentseither lacked appropriatestrategiesor failed to apply the strategiesthey possessedto word
problemsthat requiredthe manipulationof variables.Problem complexity, on the other hand, did not have an
impact on strategybut was associatedwith errors of misusingthe givensin the problem statement.
In sum, determiningwhich item featuresimpact item difficulty and how thesefeaturesaffect
examineeproblem solving providesa better explication of the constructsbeing assessed.This more detailed
understandingof constructsis necessaryfor principleditem generation,and can serve as a basisfor the
developmentof diagnosticassessmentandreporting.
Implicationsfor Creating Item Variants
The resultsof the currentstudy demonstratethe enormouspotential of systematicallycreatingitem
variants.The systematicgenerationof item variantscan result in a set of items with predictableitem
characteristicsthat differ from eachother in specifiedways and degrees.Efforts to automatesomeaspectsof
systematicitem generationare currently underway(Singley & Bennett, 1998). In additionto creatingitems
for operationaltests,variants can be createdfor use in diagnosticand practicetestswithout compromising
operationalpools. However there are many issuesthat needto be addressedbefore the potential of this
approachto item developmentcan be fully realized in the context of large scaleassessment.These issues
includethe diversity of problemsthat exist in the GRE quantitativemeasure,the wide variety of item features
that can be manipulatedto createvariants,how items shouldbe classified,and how similarity among
problemsshouldbe defined.
The pool of GRE quantitativeproblemsis quite diverse.Rate and probability word problems
representonly a small proportion of the item types includedin the measure.In a sampleof about 340
arithmeticand algebraitems in two GRE computeradaptivetest pools, only 4% were classifiedas probability
problemsand 2% as rate problems.Furthermore,even for thesesmall setsof problems,many featurescan be
manipulatedto createvariants.Two criteria that might be usedto determinewhich item featuresto
manipulateincludethe impact of the featureson item performance,and whether or not the featuresare
deemedconstructrelevant. While information about the former criterion can be gleanedfrom examinationof
28
existing items and experimentalstudies,establishingconstructrelevancerequiresother kinds of evidence,
suchas studiesof similaritiesin the processesusedto solve assessment
items and thoseusedto solve
problemsin the academicdomainsof interest.
Finally, if large numbersof item variantswere created,methodsto managetheir distributionamong
the pools of items usedfor computerizedadaptivetestingwould needto be developed.This might require the
revision of the current item classificationsystem.A better understandingof the item featuresthat contribute
to perceiveditem similarity by examinees,and to transfer amongitems,would be helpful here.
hnnlicationsfor ReducingPretestSample Size
Knowledgeof the degreeto which different featuresimpact item statisticscould allow us to create
item variants alongwith estimatesof item operatingcharacteristics.Statisticalproceduresfor usingcollateral
information suchas this to reducepretestsamplesize have been developed(Mislevy et al., 1993).
Nevertheless,two barriersblock the applicationof thesemethodsat present,althoughneither barrier is
insurmountable.One of thesebarriers concernsoperationalconstraintsthat must be taken into consideration.
Currently, samplesize is controlledat the sectionrather than the item level. This meansthat one would want
to have collateral information for all of the diverseitems in a sectionbefore the samplesize could be reduced
for that section.A study in which four pretestsectionsthat consistof item variantsbasedon the sameset of
parent items with known item operatingcharacteristicsis currently in process.The secondbarrier is the lack
of a knowledgebasethat would permit predictionof item operatingcharacteristicsfor the wide variety of
items that exist on the GRE quantitativemeasure.Over time, this knowledgebasecould be developed
throughthe examinationof existing items and throughexperimentalstudiessuchas this one. In the meantime,
the difficulty estimatesof experienceditem writers are reliable and predictive of actualdifficulty and could be
usedto reducepretestsamplesize (Sheehan& Mislevy, 1994).
ConcludingComments
Construct-drivenitem generationrequiresa descriptionof itemsthat can be related to the processes,
skills, and strategiesusedin item solving. The benefitsof suchan approachare that, if item variantscan be
createdsystematicallythroughan understandingof critical problem features,testscan be designedto cover
important aspectsof a domain, overlap can be controlled,and pretestingrequirementscan be reduced.A
closer integrationof researchand test developmentwould contributeto the developmentof the knowledge
baseneededto supportconstruct-drivenitem generation.Ideally, every time items are pretested,knowledge
abouthow item featuresimpact item performancecould be gainedif itemswere designedto vary
systematicallyon selectedfeatures.This kind of knowledgewould not only help to improve item development
efficiency, but could also provide a basisfor the developmentof new productsand services.
29
References
Brieman, L., Friedman,J. H., Olshen,R., & Stone,C. J. (1984). Classificationand reBession trees.
Belmont, CA: Wadsworth InternationalGroup.
Embretson,S. ( 1983). Constructvalidity: Constructrepresentationversusnomotheticspan.
PsvchologicalBulletin, 93(l), 179- 197.
Embretson,S. E. (1995). A measurementmodel for linking individual learningto processesand
knowledge: Application to mathematicalreasoning.Journalof EducationalMeasurement,32(3), 277-294.
Hall, R., Kibler, D., Wenger, E., & Truxaw, C. (1989). Exploring the episodicstructureof algebra
story problem solving. Cognition and Instruction.6(3), 223-283.
Holland, P. W., & Thayer, D. T. (1988). Differential item performanceandthe Mantel-Haenszel
Procedure.In H. Wainer & H. I. Braun (Eds.), Test Validi& (pp. 129-145). Hillsdale, NJ: Lawrence Erlbaum
Associates.
Lane, S. (199 1). Use of restricteditem responsemodelsfor examining item difficulty orderingand
slopeuniformity. Journalof EducationalMeasurement,28(4), 295-309.
Mayer, R. E. (198 1). Frequencynorms and structuralanalysisof algebrastory problemsinto
families, categories,and templates.InstructionalScience, 10, 135 175.
Mayer, R. E. (1982). Memory for algebrastory problems.Journalof EducationalPsychology,74(2),
199-216.
Mislevy, R. J., & Bock, R. D. (1982). BILOG: Maximum likelihood item analvsisand test scoring
with logistic models for binan, items. Chicago:InternationalEducationalServices.
Mislevy, R. J., Sheehan,K. M., & Wingersky, M. (1993). How to equatetestswith little or no data.
Journalof EducationalMeasurement,30(l), 55-78.
Mislevy, R. J., Wingersky, M. S., & Sheehan,K. M. (1994). Dealing with uncertaintvabout item
parameters: Expectedresponsefunctions(ETS ResearchReport RR-94-28-ONR). Princeton,NJ:
EducationalTesting Service.
Reed, S. K. (1987). A structure-mappingmodel for word problems.Journalof Experimental
Psvchologv,13(l), 124-139.
Reed, S. K., Dempster,A.? & Ettinger, M. (1985). Usefulnessof analogoussolutionsfor solving
algebraword problems.Journalof ExperimentalPsvcholom: Learning. Memorv, and Cognition. 1l(l), 106125.
Sebrechts,M. M., Enright, M., Bennett,R. E., & Martin, K. (1996). Using algebraword problemsto
assessquantitativeability: Attributes, strategies,and errors. Cognition and Instruction, 14(3), 285-343.
Shalin, V. L., & Bee, N. V. (1985). Structuraldifferencesbetweentwo-step word problems
(Technical Report No. ED-259-949). PittsburghUniversity: Pa. Learning Researchand Development Center.
31
Sheehan,K. M. (1997). A tree-basedapproachto proficiency scalingand diagnosticassessment.
Journalof EducationalMeasurement.34(4), 333-352.
Sheehan,K., & Mislevy, R. J. (1994). A tree-basedanalysisof items from an assessment
of basic
mathematicsskills (ETS ResearchReport 94-14). Princeton,NJ: EducationalTesting Service.
Singley, M. K., & Bennett, R. E. (1998). Validation and extensionof the mathematicalexpression
responsetype: Applications of schematheory to automaticscoringand item generationin mathematics(ETS
ResearchReport RR-97- 19, GRE Report 93-24P). Princeton,NJ: EducationalTesting Service.
32