Developing a New Test from Conceptualization to

Developing a New Test from
Conceptualization to
Maintenance: An ATP Workshop
Wayne Camara, Jerry Melican, and
Andrew Wiley
The College Board
OUTLINE OF WORKSHOP (2 pm – 4:30 pm)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
History/Developments in standardized testing
Importance of Validation
Overview of the Test Development Process
Defining the purpose of your test
Developing test specifications
Writing items/Assembling test forms
Trying out items and test forms
Administering your test
Scoring your test/Developing a scale
Reporting scores
Program maintenance
Program validation/documentation
2
Definitions
A page of definitions has been
handed out for support during
presentation
3
1. History of Test Development
Beginning in the second century B.C.E., entry into the
bureaucracy that governed China was limited to those who
passed a series of examinations. The exams were based
on a comprehensive knowledge of Chinese classics.
Examinations were in six arts: Music, archery,
horsemanship writing, arithmetic, and rituals and
ceremonies. In AD 669 Empress Wu Tse-t’ien had the
examinations standardized and systematically administered
to control access to jobs. She became the only women to
rule China (DuBois, 1970).
4
1. History of Test Development (cont)
The exams endured for over 20 centuries: they were efficient
and resulted in selection of the best candidates. During
periods of political unity the exams operated well, but
during strife or dynasty changes they tended to break
down. The tests were later revised and in 1300, over
12,000 applicants sat for a single exam. Grading took
years, not just because all 12,000 essays had to be read,
but because a complete rank order of all candidates was
mandated (Hakel, 1998).
5
Since 2200 B.C.E. what are the greatest
innovations in standardized testing?
1.
2.
3.
4.
5.
6.
Objective scoring and scanning of items
Computer adaptive testing (Internet Based Testing)
General ability measurement (IQ testing)
Simulations (authentic tasks, work samples, media)
IRT/Psychometric advances
Computer scoring of free response tasks
6
2.
•
•
•
•
It Does Begin with Validation
(Testing Standards)
Validation logically begins with an explicit statement of the
proposed interpretation of test scores.
Validation can be viewed as developing a scientifically
sound validation argument to support the intended
interpretation of test scores and their relevance to the
proposed use.
The conceptual framework points to the kinds of evidence
Developing a set of propositions that support the proposed
interpretation for the particular purpose of the test.
7
2. Content Validation and Test Design
(Testing Standards)
•
•
The content specifications describe the content, including a
classification of areas of content and types of items.
Usually includes expert judgment about the relationship
between parts of the test and construct.
A licensure test should capture the major facets of the
occupation.
–
–
–
The competencies (or KSAs) required for the occupation should be
specified.
Content specifications should be representative of the occupational
domain.
Items should be developed that measure these KSAs. Item format and
scoring should be considered at this time.
8
2. Content Validation and Test Design
(Testing Standards)
•
Expert judgments
–
–
–
–
Does an item measure the relevant construct
Ensure the competencies are appropriately specified
Evaluate the relationship between the content domains or skills
Determine the importance, criticality or frequency of occurrence for
the KSAs.
9
3. Overview of Test Development
(Downing & Haladyna, 2006)
A.
B.
C.
D.
E.
F.
G.
H.
I.
J.
K.
L.
Overall Plan – Desired inferences, Test format, Psychometric model
Content Definition – Sampling plan for content domain, definitions
Test Specifications – Frameworks, Content / Psychometric Specs
Item development – Evidence based principles, item editing, reviews
Test design & assembly – Design forms, pretesting conditions
Test production – Printing or CBT packages, security, QC, release/reuse
Test administration – Standardization, modifications, timing, proctoring
Scoring – Key validation, scoring consistency, QC, item analysis
Cut scores – Relative vs.. absolute, Scaling, Comparability
Reporting – Accuracy, convey score precision (SEM, conf int), retakes
Item banking – Security, reuse, exposure, flexibility
Documentation – Validity evidence, specs, samples, inferences…
10
4. Overall Plan/Content Definition
•
Identify the purpose(s) of the test
–
–
–
–
–
–
Individual or aggregate – student proficiency vs. school
accountability
High vs. low stakes – for who – all tests are likely to be high stakes
for someone or some institution (e.g., program evaluation).
Helping instrument (e.g., career interest)
Instructional information > diagnostic information > classification
purposes >
Primary score interpretations
General design – CRT/NRT, etc.
11
4. Overall Plan/Content Definition
•
Define the constructs to be measured (and the scope of
constructs to be measured)
–
–
–
Examples (10th grade mathematics, self efficacy)
Does mathematics include geometry? Does a test of verbal ability
include vocabulary, comprehension, inferences about author’s
intentions? Does self efficacy include attitudes and behaviors?
Methods include – Job Analysis, Task Analysis, Theoretical
Framework and Literature, Analysis of Content Domain (judgments,
observations, interviews, surveys, document review), etc.
12
5. Construct Test Specifications
•
Once you have the intended purpose and construct
defined…
–
Identify any firm requirements or constraints
–
Develop a psychometric framework
–
Develop a content framework
13
5. Construct Test Specifications
•
Identify any firm requirements or constraints
–
–
–
–
–
–
–
Types of items
Amount of testing time
Administrative modes
Scoring
Number of forms
Cost implications
Operational constraints
14
5. Construct Test Specifications
•
Develop a psychometric framework
–
–
–
–
–
–
Requirements for comparability
Psychometric requirements (e.g., score precision, equating)
discrimination or desired information function (discrimination at
specific points such as licensing test vs. information across score
distribution such as GRE)
Speededness
Score(s) required
Difficulty
15
5. Construct Test Specifications
•
Develop a content framework
–
–
–
–
–
–
–
Types of items,
Tasks and response formats (e.g., scenarios, simulations, cut and
paste, grid in, constructed response, mc, constructed response
Content domains
Strands
Competencies
Constructs (unidimensional vs. multidimensional)
A framework that specifies number and types of items x each
domain, construct, etc.
16
5. Construct Test Specifications
• Hold for conversation on what the group would define as
essential for a test assessing individuals in ??
– US History
17
5. Construct Test Specifications
SAT Mathematics Content Specifications
Content
# of items
% of items
Numbers and Operations
11 – 13 items
20 – 25%
Algebra and Functions
19 – 21 items
35 – 40%
Geometry and Measurement
14 – 16 items
25 – 30%
Data Analysis, Statistics, and
Probability
6 – 7 items
10 – 15%
18
5. Construct Test Specifications (cont.)
• An assessment blueprint specifies the domain that
can be assessed
• The blueprint should clearly illustrate
– Level (grade, position, specialty)
– Number of test forms (or item pools, testlets),
– Type of forms (operational, alternate, emergency,
accommodations, paper vs. CBT, overseas),
– Reuse patterns (reuse, disclosed, partially disclosed)
– Number of common items, embedded pretest items, and
embedded linking items
– Number of points per item, section, for each item type
– Psychometric specifications for the form
19
5. Construct Test Specifications (cont.)
• Automated Test Assembly
– Test assembly is a labor and time intensive process
– Linear programming techniques have been developed
that can evaluate an item pool and create X number of
parallel forms of a test.
– Good content and statistical specifications are essential
for this process to work.
– Automated Test Assembly does not replace human
review. Instead, it provides a good draft of a test form,
which still requires SME reviews.
20
6. Item Development – Item Writing
•
Writers are familiar with the population of test takers and
the content domain or constructs.
•
Item writers often are content experts or subject matter
experts (SMEs).
•
Writers are trained in item writing, qualifying procedures,
multiple reviews, etc.
21
6. Item Development – Additional
Requirements
•
•
•
•
•
•
Scoring guidelines
General guidelines on item writing
Item style guidelines
Bias, fairness, and sensitivity procedures
Universal design principles (or acceptable
accommodations)
Sample items and rubrics
22
6. Item Development (Cont.)
•
A Potential Model for Item and Form Review
Content
review 1
Item authored
Form
Assembly
Item pool
Final
form
review
23
Content
review 2
Statistical
Analysis
Editorial
review
Pretest
Revision
Sensitivity
/ Fairness
Review
6. Item Development (Cont.)
•
Scenarios, simulations, extended constructed response
items require a disproportionate share of the testing time.
Issues to consider:
–
–
–
–
–
–
–
Tasks that are realistic and have fidelity to the actual situation
Tasks that assess multiple competencies or are multidisciplinary
Motivate the test taker to stay with a more complex task and not give up
Are not overly text dependent, unless comprehension is to be assessed
Appropriate stimulus materials and tools are available
Allow adequate time to engage task, try out solutions
Appropriate scoring guide and guidelines
24
6. Item Development (Cont.)
•
Scenarios, simulations, extended constructed response
items require a disproportionate share of the testing time.
Issues to consider:
–
–
–
–
Item tryouts and cognitive labs can be useful in early design to ensure task
is assessing what is intended, solution strategies are consistent with
intended purpose and alternative solutions are not easily available (e.g.,
short cuts)
Appropriate scoring guide and guidelines
Determine what response factors are relevant vs. irrelevant
Weigh, cost (time, money, scoring, operational issues), scorers, etc.
25
6. Item Development (Cont.)
•
The American Institute for Certified Public Accountants
launched a computer-based set of examinations in 2004.
–
Multiple-choice section adaptive at testlet level using IRT
• Every test taker receives three testlets of 25 items
• Four routes Medium-Easy-Easy, Medium-Harder-Harder, M-E-H or M-H-E
• Most test takers go through the first two routes, can pass or fail any route
–
Simulations with authentic scenarios and interesting/innovative item
types
• Fill in cells of a spreadsheet
• Arrange steps in a procedure in proper order
• Write an memo to supervisor or letter to client based on scenario
26
6. Item Development (Cont.)
•
•
The American Institute for Certified Public Accountants
launched a computer-based set of examinations in 2004.
The process involved
–
–
–
–
–
–
Performing practice analyses
Identifying, generating and trying out item types, not for innovation
but for appropriate measurement
Developing software
Implementing modern psychometric theory
Field testing
Review and improvements in scoring rubrics after administration
27
6. Item Development (Continued)
Accommodations, Special Populations, and Universal Design
•
•
•
Understand what accommodations or modifications are acceptable (Braille,
script, dictionary, calculator, translation, extended time, etc.).
Avoid content and item types where the construct measured can readily
change based on accommodations (e.g., excessive text in math items).
Attempt to follow universal design principles (Thompson, Johnstone & Thurlow,
2002):
– Items measure what they are intended to measure (avoid construct
irrelevant aspects) – changes to format do not change meaning.
– Items recognize diversity of test takers (age, cognitive levels, culture)
– Avoid unnecessary use of words and jargon (concise and readable text)
– Pictures, graphics, figures are clear and reproduce easily for different
modes
– Clean and clear format of items (space)
28
6. Evaluating Items
•
Content Review Of Items:
–
–
–
–
–
•
Editorial Review:
–
•
Item Is clear, grammatically correct, and follows established style
guidelines
Sensitivity and Fairness Review:
–
•
has a single best answer
is at the appropriate level of difficulty
is a good fit with specifications
is clear and unambiguous
meets other formal criteria for a good multiple-choice question.
does not contain language that might offend or disadvantage any
subgroups of test takers.
Copy Editing and Accuracy:
–
–
Layout quality
Graphics.
29
6. Assembling Test Form
•
•
•
•
Select items for a CAT item pool or final form that meet the
test specifications (content x statistical specs).
Item selection may include other factors (graphics, theme,
format).
Conduct a final form review of items to ensure balance,
representativeness, diversity
Attention to statistical specifications
–
–
Fit of model data if using IRT
Indices of item difficulty, discrimination, item-information, inter-item
correlations, etc.
30
6. Assembling Test Form
•
•
•
For CAT pools (testlets) consideration of constraints,
exposure rates
Changes at the final form pose some risk.
Consider how item location / placement can impact
performance (especially important in embedded linking and
calibration)
31
7. Trying out items and test forms
• Non-operational venues
– Very small sample “proof of product”, could be “talk aloud”
– Small sample pilot tests
– Large sample Field Tests
• Operational venues
– Embedded pretest items
– Separate pretest sections
32
7. Trying out items and test forms
•
•
Purpose is to collect data on item performance – indices of
difficulty, discrimination, DIF, item-test correlations.
Identification of flawed items
•
•
•
•
•
Double keys
Miskeys
No answer
DIF
Things to Consider – sample size (IRT requirements), timing
of data collection, motivation (embedded vs. independent),
representativeness of sample.
33
7. Evaluation of test form after field trials
•
•
•
•
•
•
Examine linkages to previous editions (or forms) if
applicable
Speededness
Reliability
Subgroup differences
Establish new norms, percentiles, new cut scores, set scale,
etc.
Link to other external measures
34
7. Evaluating Items and Test Forms (cont.)
•
Document, document, document –
– Sample selection
– Procedures used to conduct item tryouts and field tests
– Psychometric model used (e.g., when IRT used - item parameters,
estimation procedures, model fit)
•
•
•
When test development relies primarily on psychometric
data employ cross validation methods (60/40 samples)
Document directions for administering test, acceptable
accommodations or modifications to the test. Provide
sample materials, items and scoring rules.
Test Development is the beginning of a continuing process
to provide documentation and evidence to support
inferences related to validity.
35
Exceptions and Caveats
•
•
A rigid sequential approach to Test Development is not
always permitted.
Many psychological tests are dependent on constructs and
a more empirical approach will define the framework.
–
–
For example, a selection test may be more dependent on the
correlation of item scores with job performance data.
A personality test used for psychological diagnosis may be more
dependent on items that items that discriminate between groups
(patients vs. non-patients).
36
Exceptions and Caveats
•
•
The extent of the framework may be less important for tests
with a single form that is reused (e.g., psychological
inventories).
Item discrimination may be more essential when major
decisions surround a couple of score points (cut scores on
certification tests).
37
Administering the test
•
Standardization in paper-based tests
–
–
–
•
Shipment
Instructions for test takers and proctors
Training of proctors
Standardization in computer-based tests
–
–
–
–
Item pools
Item selection algorithms for CAT, linear-on-the-fly, …
Instructions for test takers and proctors
Training of proctors
38
Administering the test
•
Shipment
–
–
–
–
–
Plan and prepare of packets
Arrange for receipt and safe storage
Secure vendor/monitor vendor performance
Arrange for preparation for return shipment
Account for all materials
39
9. Scoring the test
•
Scan objective answer sheets
•
Rate responses to student produced questions
•
Perform item analyses/DIF analyses
•
Eliminate problem items – when, why, and how
•
Apply final keys for total and any part scores
40
9. Scaling the test
•
Identify and publish proposed score scale before
administration
–
–
–
–
–
For example, scale ranges from 200-800 with mean equal to 500
and standard deviation equal to 100
Or, cut-score equals 70
Avoid established scales, number correct, percent correct
Allow room for changes in ability and differences in difficulty of
versions
An example is provided in handouts and one of us can walk through
the example later.
41
9. Scaling the test
•
For example, scale ranges from 200-800 with mean equal to 500
and standard deviation equal to 100
–
–
–
•
Identify the population that defines the scale. “The 1990 Reference Group”
of one million test takers was used for SAT.
Collect the scores and compute means and standard deviations
Using a linear transformation set the observed mean to 500 and the
observed standard deviation to 100.
In reality, there was a bit more mathematical management of the
scores to ensure that the distribution reflected a normal
distribution but for most programs the process just described is
sufficient.
42
9. Scaling the test
Set cut-score equal to 70 (or an appropriate score)
–
–
–
–
–
–
–
Choose reporting scale. Note: some programs report numeric scores to all
test takers and some report numeric scores to failing candidates but only
“Pass” to passing candidates.
Perform cut-score study and obtain the operational cut-score from decisionmakers in a transparent process.
Set the cut-score to 70 (or whatever the choice is)
Identify the highest raw score that will be set to highest reported score
Identify the lowest raw score that will be set to lowest reported score
Solve the simultaneous equations establishing the line between low score
and cut score and the line between cut score and high score
Develop the table that exhibits the what scaled score is equal to each raw
score.
43
9. Scaling the test
Aside: For tests with cut-scores perform a standard setting.
Standard setting is a topic worthy of its own presentation.
Standard setting is the application of one of many disciplined
processes to define the level of competency below which a
candidate is not likely to be competent
44
9. Scaling the test - part 1
•
•
•
•
•
Generate scale as planned
Transform raw scores to scaled scores
Round and truncate as defined in 1 (e.g., all scores below
200 are set to 200)
Calculate frequency distributions for raw and scaled scores
for reasonability
Adjust, if necessary staying
DOCUMENT!
45
10. Reporting scores
• Placeholder for slide on discussion on what should be
included in a score report
46
10. Reporting scores – individual reports
•
•
•
•
Prepare practice score reports before administration and
review for accuracy and usefulness
Prepare initial set of score reports for quality control
purposes
Generate all score reports
Perform quality control on random samples
47
10. Reporting scores – institutional
•
•
•
•
•
Determine nature of reports for other constituencies
Prepare practice reports before administration and review
for accuracy and usefulness
Prepare initial set of reports for quality control purposes
Generate all score reports
Perform quality control on random samples
48
10. Reporting scores – summary reports
•
Prepare report concerning test versions and administration
results
–
–
–
Candidate level information; frequency distributions and summary
statistics
Item level information; difficulty, item discrimination, DIF results
Form level information; average difficulty, discrimination, reliability
estimates, relevant correlations
49
11. Maintenance – after each administration
•
•
•
Update item bank information and identify item writing
requirements and assignment
Update candidate file information
Generate trend reports and identify issues for review
and/or action
50
11. Maintenance - Next versions and
administrations
•
Generate new versions to allow scale score to be
perpetuated (aka equated)
–
–
–
Same content specifications
Same difficulty level
Overlapping items with one or more previous versions already on
scale
51
11. Maintenance: Next versions and
administrations
•
After administration of new versions
–
–
–
Same post-administration as before adding
Analyze the common items for adequacy in equating scores on
new versions to old scale
Equate and perform quality control
52
Questions, Ideas, Concerns
•
•
•
•
Questions
Ideas
Concerns
Comments
53
• Please take Handout
with short list of
references and
resources
Resources
•
Researchers are encouraged to freely express their professional
judgment. Therefore, points of view or opinions stated in College
Board presentations do not necessarily represent official College
Board position or policy.
•
Questions can be sent to Andrew Wiley at:
[email protected]
Thanks to ATP
and
Thanks to you
54