Developing a New Test from Conceptualization to Maintenance: An ATP Workshop Wayne Camara, Jerry Melican, and Andrew Wiley The College Board OUTLINE OF WORKSHOP (2 pm – 4:30 pm) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. History/Developments in standardized testing Importance of Validation Overview of the Test Development Process Defining the purpose of your test Developing test specifications Writing items/Assembling test forms Trying out items and test forms Administering your test Scoring your test/Developing a scale Reporting scores Program maintenance Program validation/documentation 2 Definitions A page of definitions has been handed out for support during presentation 3 1. History of Test Development Beginning in the second century B.C.E., entry into the bureaucracy that governed China was limited to those who passed a series of examinations. The exams were based on a comprehensive knowledge of Chinese classics. Examinations were in six arts: Music, archery, horsemanship writing, arithmetic, and rituals and ceremonies. In AD 669 Empress Wu Tse-t’ien had the examinations standardized and systematically administered to control access to jobs. She became the only women to rule China (DuBois, 1970). 4 1. History of Test Development (cont) The exams endured for over 20 centuries: they were efficient and resulted in selection of the best candidates. During periods of political unity the exams operated well, but during strife or dynasty changes they tended to break down. The tests were later revised and in 1300, over 12,000 applicants sat for a single exam. Grading took years, not just because all 12,000 essays had to be read, but because a complete rank order of all candidates was mandated (Hakel, 1998). 5 Since 2200 B.C.E. what are the greatest innovations in standardized testing? 1. 2. 3. 4. 5. 6. Objective scoring and scanning of items Computer adaptive testing (Internet Based Testing) General ability measurement (IQ testing) Simulations (authentic tasks, work samples, media) IRT/Psychometric advances Computer scoring of free response tasks 6 2. • • • • It Does Begin with Validation (Testing Standards) Validation logically begins with an explicit statement of the proposed interpretation of test scores. Validation can be viewed as developing a scientifically sound validation argument to support the intended interpretation of test scores and their relevance to the proposed use. The conceptual framework points to the kinds of evidence Developing a set of propositions that support the proposed interpretation for the particular purpose of the test. 7 2. Content Validation and Test Design (Testing Standards) • • The content specifications describe the content, including a classification of areas of content and types of items. Usually includes expert judgment about the relationship between parts of the test and construct. A licensure test should capture the major facets of the occupation. – – – The competencies (or KSAs) required for the occupation should be specified. Content specifications should be representative of the occupational domain. Items should be developed that measure these KSAs. Item format and scoring should be considered at this time. 8 2. Content Validation and Test Design (Testing Standards) • Expert judgments – – – – Does an item measure the relevant construct Ensure the competencies are appropriately specified Evaluate the relationship between the content domains or skills Determine the importance, criticality or frequency of occurrence for the KSAs. 9 3. Overview of Test Development (Downing & Haladyna, 2006) A. B. C. D. E. F. G. H. I. J. K. L. Overall Plan – Desired inferences, Test format, Psychometric model Content Definition – Sampling plan for content domain, definitions Test Specifications – Frameworks, Content / Psychometric Specs Item development – Evidence based principles, item editing, reviews Test design & assembly – Design forms, pretesting conditions Test production – Printing or CBT packages, security, QC, release/reuse Test administration – Standardization, modifications, timing, proctoring Scoring – Key validation, scoring consistency, QC, item analysis Cut scores – Relative vs.. absolute, Scaling, Comparability Reporting – Accuracy, convey score precision (SEM, conf int), retakes Item banking – Security, reuse, exposure, flexibility Documentation – Validity evidence, specs, samples, inferences… 10 4. Overall Plan/Content Definition • Identify the purpose(s) of the test – – – – – – Individual or aggregate – student proficiency vs. school accountability High vs. low stakes – for who – all tests are likely to be high stakes for someone or some institution (e.g., program evaluation). Helping instrument (e.g., career interest) Instructional information > diagnostic information > classification purposes > Primary score interpretations General design – CRT/NRT, etc. 11 4. Overall Plan/Content Definition • Define the constructs to be measured (and the scope of constructs to be measured) – – – Examples (10th grade mathematics, self efficacy) Does mathematics include geometry? Does a test of verbal ability include vocabulary, comprehension, inferences about author’s intentions? Does self efficacy include attitudes and behaviors? Methods include – Job Analysis, Task Analysis, Theoretical Framework and Literature, Analysis of Content Domain (judgments, observations, interviews, surveys, document review), etc. 12 5. Construct Test Specifications • Once you have the intended purpose and construct defined… – Identify any firm requirements or constraints – Develop a psychometric framework – Develop a content framework 13 5. Construct Test Specifications • Identify any firm requirements or constraints – – – – – – – Types of items Amount of testing time Administrative modes Scoring Number of forms Cost implications Operational constraints 14 5. Construct Test Specifications • Develop a psychometric framework – – – – – – Requirements for comparability Psychometric requirements (e.g., score precision, equating) discrimination or desired information function (discrimination at specific points such as licensing test vs. information across score distribution such as GRE) Speededness Score(s) required Difficulty 15 5. Construct Test Specifications • Develop a content framework – – – – – – – Types of items, Tasks and response formats (e.g., scenarios, simulations, cut and paste, grid in, constructed response, mc, constructed response Content domains Strands Competencies Constructs (unidimensional vs. multidimensional) A framework that specifies number and types of items x each domain, construct, etc. 16 5. Construct Test Specifications • Hold for conversation on what the group would define as essential for a test assessing individuals in ?? – US History 17 5. Construct Test Specifications SAT Mathematics Content Specifications Content # of items % of items Numbers and Operations 11 – 13 items 20 – 25% Algebra and Functions 19 – 21 items 35 – 40% Geometry and Measurement 14 – 16 items 25 – 30% Data Analysis, Statistics, and Probability 6 – 7 items 10 – 15% 18 5. Construct Test Specifications (cont.) • An assessment blueprint specifies the domain that can be assessed • The blueprint should clearly illustrate – Level (grade, position, specialty) – Number of test forms (or item pools, testlets), – Type of forms (operational, alternate, emergency, accommodations, paper vs. CBT, overseas), – Reuse patterns (reuse, disclosed, partially disclosed) – Number of common items, embedded pretest items, and embedded linking items – Number of points per item, section, for each item type – Psychometric specifications for the form 19 5. Construct Test Specifications (cont.) • Automated Test Assembly – Test assembly is a labor and time intensive process – Linear programming techniques have been developed that can evaluate an item pool and create X number of parallel forms of a test. – Good content and statistical specifications are essential for this process to work. – Automated Test Assembly does not replace human review. Instead, it provides a good draft of a test form, which still requires SME reviews. 20 6. Item Development – Item Writing • Writers are familiar with the population of test takers and the content domain or constructs. • Item writers often are content experts or subject matter experts (SMEs). • Writers are trained in item writing, qualifying procedures, multiple reviews, etc. 21 6. Item Development – Additional Requirements • • • • • • Scoring guidelines General guidelines on item writing Item style guidelines Bias, fairness, and sensitivity procedures Universal design principles (or acceptable accommodations) Sample items and rubrics 22 6. Item Development (Cont.) • A Potential Model for Item and Form Review Content review 1 Item authored Form Assembly Item pool Final form review 23 Content review 2 Statistical Analysis Editorial review Pretest Revision Sensitivity / Fairness Review 6. Item Development (Cont.) • Scenarios, simulations, extended constructed response items require a disproportionate share of the testing time. Issues to consider: – – – – – – – Tasks that are realistic and have fidelity to the actual situation Tasks that assess multiple competencies or are multidisciplinary Motivate the test taker to stay with a more complex task and not give up Are not overly text dependent, unless comprehension is to be assessed Appropriate stimulus materials and tools are available Allow adequate time to engage task, try out solutions Appropriate scoring guide and guidelines 24 6. Item Development (Cont.) • Scenarios, simulations, extended constructed response items require a disproportionate share of the testing time. Issues to consider: – – – – Item tryouts and cognitive labs can be useful in early design to ensure task is assessing what is intended, solution strategies are consistent with intended purpose and alternative solutions are not easily available (e.g., short cuts) Appropriate scoring guide and guidelines Determine what response factors are relevant vs. irrelevant Weigh, cost (time, money, scoring, operational issues), scorers, etc. 25 6. Item Development (Cont.) • The American Institute for Certified Public Accountants launched a computer-based set of examinations in 2004. – Multiple-choice section adaptive at testlet level using IRT • Every test taker receives three testlets of 25 items • Four routes Medium-Easy-Easy, Medium-Harder-Harder, M-E-H or M-H-E • Most test takers go through the first two routes, can pass or fail any route – Simulations with authentic scenarios and interesting/innovative item types • Fill in cells of a spreadsheet • Arrange steps in a procedure in proper order • Write an memo to supervisor or letter to client based on scenario 26 6. Item Development (Cont.) • • The American Institute for Certified Public Accountants launched a computer-based set of examinations in 2004. The process involved – – – – – – Performing practice analyses Identifying, generating and trying out item types, not for innovation but for appropriate measurement Developing software Implementing modern psychometric theory Field testing Review and improvements in scoring rubrics after administration 27 6. Item Development (Continued) Accommodations, Special Populations, and Universal Design • • • Understand what accommodations or modifications are acceptable (Braille, script, dictionary, calculator, translation, extended time, etc.). Avoid content and item types where the construct measured can readily change based on accommodations (e.g., excessive text in math items). Attempt to follow universal design principles (Thompson, Johnstone & Thurlow, 2002): – Items measure what they are intended to measure (avoid construct irrelevant aspects) – changes to format do not change meaning. – Items recognize diversity of test takers (age, cognitive levels, culture) – Avoid unnecessary use of words and jargon (concise and readable text) – Pictures, graphics, figures are clear and reproduce easily for different modes – Clean and clear format of items (space) 28 6. Evaluating Items • Content Review Of Items: – – – – – • Editorial Review: – • Item Is clear, grammatically correct, and follows established style guidelines Sensitivity and Fairness Review: – • has a single best answer is at the appropriate level of difficulty is a good fit with specifications is clear and unambiguous meets other formal criteria for a good multiple-choice question. does not contain language that might offend or disadvantage any subgroups of test takers. Copy Editing and Accuracy: – – Layout quality Graphics. 29 6. Assembling Test Form • • • • Select items for a CAT item pool or final form that meet the test specifications (content x statistical specs). Item selection may include other factors (graphics, theme, format). Conduct a final form review of items to ensure balance, representativeness, diversity Attention to statistical specifications – – Fit of model data if using IRT Indices of item difficulty, discrimination, item-information, inter-item correlations, etc. 30 6. Assembling Test Form • • • For CAT pools (testlets) consideration of constraints, exposure rates Changes at the final form pose some risk. Consider how item location / placement can impact performance (especially important in embedded linking and calibration) 31 7. Trying out items and test forms • Non-operational venues – Very small sample “proof of product”, could be “talk aloud” – Small sample pilot tests – Large sample Field Tests • Operational venues – Embedded pretest items – Separate pretest sections 32 7. Trying out items and test forms • • Purpose is to collect data on item performance – indices of difficulty, discrimination, DIF, item-test correlations. Identification of flawed items • • • • • Double keys Miskeys No answer DIF Things to Consider – sample size (IRT requirements), timing of data collection, motivation (embedded vs. independent), representativeness of sample. 33 7. Evaluation of test form after field trials • • • • • • Examine linkages to previous editions (or forms) if applicable Speededness Reliability Subgroup differences Establish new norms, percentiles, new cut scores, set scale, etc. Link to other external measures 34 7. Evaluating Items and Test Forms (cont.) • Document, document, document – – Sample selection – Procedures used to conduct item tryouts and field tests – Psychometric model used (e.g., when IRT used - item parameters, estimation procedures, model fit) • • • When test development relies primarily on psychometric data employ cross validation methods (60/40 samples) Document directions for administering test, acceptable accommodations or modifications to the test. Provide sample materials, items and scoring rules. Test Development is the beginning of a continuing process to provide documentation and evidence to support inferences related to validity. 35 Exceptions and Caveats • • A rigid sequential approach to Test Development is not always permitted. Many psychological tests are dependent on constructs and a more empirical approach will define the framework. – – For example, a selection test may be more dependent on the correlation of item scores with job performance data. A personality test used for psychological diagnosis may be more dependent on items that items that discriminate between groups (patients vs. non-patients). 36 Exceptions and Caveats • • The extent of the framework may be less important for tests with a single form that is reused (e.g., psychological inventories). Item discrimination may be more essential when major decisions surround a couple of score points (cut scores on certification tests). 37 Administering the test • Standardization in paper-based tests – – – • Shipment Instructions for test takers and proctors Training of proctors Standardization in computer-based tests – – – – Item pools Item selection algorithms for CAT, linear-on-the-fly, … Instructions for test takers and proctors Training of proctors 38 Administering the test • Shipment – – – – – Plan and prepare of packets Arrange for receipt and safe storage Secure vendor/monitor vendor performance Arrange for preparation for return shipment Account for all materials 39 9. Scoring the test • Scan objective answer sheets • Rate responses to student produced questions • Perform item analyses/DIF analyses • Eliminate problem items – when, why, and how • Apply final keys for total and any part scores 40 9. Scaling the test • Identify and publish proposed score scale before administration – – – – – For example, scale ranges from 200-800 with mean equal to 500 and standard deviation equal to 100 Or, cut-score equals 70 Avoid established scales, number correct, percent correct Allow room for changes in ability and differences in difficulty of versions An example is provided in handouts and one of us can walk through the example later. 41 9. Scaling the test • For example, scale ranges from 200-800 with mean equal to 500 and standard deviation equal to 100 – – – • Identify the population that defines the scale. “The 1990 Reference Group” of one million test takers was used for SAT. Collect the scores and compute means and standard deviations Using a linear transformation set the observed mean to 500 and the observed standard deviation to 100. In reality, there was a bit more mathematical management of the scores to ensure that the distribution reflected a normal distribution but for most programs the process just described is sufficient. 42 9. Scaling the test Set cut-score equal to 70 (or an appropriate score) – – – – – – – Choose reporting scale. Note: some programs report numeric scores to all test takers and some report numeric scores to failing candidates but only “Pass” to passing candidates. Perform cut-score study and obtain the operational cut-score from decisionmakers in a transparent process. Set the cut-score to 70 (or whatever the choice is) Identify the highest raw score that will be set to highest reported score Identify the lowest raw score that will be set to lowest reported score Solve the simultaneous equations establishing the line between low score and cut score and the line between cut score and high score Develop the table that exhibits the what scaled score is equal to each raw score. 43 9. Scaling the test Aside: For tests with cut-scores perform a standard setting. Standard setting is a topic worthy of its own presentation. Standard setting is the application of one of many disciplined processes to define the level of competency below which a candidate is not likely to be competent 44 9. Scaling the test - part 1 • • • • • Generate scale as planned Transform raw scores to scaled scores Round and truncate as defined in 1 (e.g., all scores below 200 are set to 200) Calculate frequency distributions for raw and scaled scores for reasonability Adjust, if necessary staying DOCUMENT! 45 10. Reporting scores • Placeholder for slide on discussion on what should be included in a score report 46 10. Reporting scores – individual reports • • • • Prepare practice score reports before administration and review for accuracy and usefulness Prepare initial set of score reports for quality control purposes Generate all score reports Perform quality control on random samples 47 10. Reporting scores – institutional • • • • • Determine nature of reports for other constituencies Prepare practice reports before administration and review for accuracy and usefulness Prepare initial set of reports for quality control purposes Generate all score reports Perform quality control on random samples 48 10. Reporting scores – summary reports • Prepare report concerning test versions and administration results – – – Candidate level information; frequency distributions and summary statistics Item level information; difficulty, item discrimination, DIF results Form level information; average difficulty, discrimination, reliability estimates, relevant correlations 49 11. Maintenance – after each administration • • • Update item bank information and identify item writing requirements and assignment Update candidate file information Generate trend reports and identify issues for review and/or action 50 11. Maintenance - Next versions and administrations • Generate new versions to allow scale score to be perpetuated (aka equated) – – – Same content specifications Same difficulty level Overlapping items with one or more previous versions already on scale 51 11. Maintenance: Next versions and administrations • After administration of new versions – – – Same post-administration as before adding Analyze the common items for adequacy in equating scores on new versions to old scale Equate and perform quality control 52 Questions, Ideas, Concerns • • • • Questions Ideas Concerns Comments 53 • Please take Handout with short list of references and resources Resources • Researchers are encouraged to freely express their professional judgment. Therefore, points of view or opinions stated in College Board presentations do not necessarily represent official College Board position or policy. • Questions can be sent to Andrew Wiley at: [email protected] Thanks to ATP and Thanks to you 54
© Copyright 2026 Paperzz