The Maryland School Performance Assessment P:ogram: Performance Assessment with Psychometric Qu~iity Suitable for High Stakes Usage Wendy M. Yen CTB/McGraw-Hill Steven Ferrara Maryland State Department of Education }9 59oo~5 The Maryland School Performance Assessment is part of a larger school reform effort, Success," Program (MSPAP) dubbed "Schools for initiated by the Maryland State Department of Education and the State Board of Education. performance-based The MSPAP is an annual testing program that was first administered to approximately 150,000 students in grades 3, 5 and 8 in May 1991. Performance on the MSPAP is used to evaluate schools and to provide information to guide school improvement efforts. primary high stakes focus in designing, developing, The and reporting MSPAP is school performance rather than individual student performance. Schools are expected to meet standards for satisfactory and excellent school performance on the MSPAP in reading, writing, language usage, mathematics, social studies by the 1995-96 school year. science, and Schools that dO not meet these standards will be required to develop and implement r school improvement plans to meet these standards. Consistently low performing schools may be selected as Challenge Schools that receive funding and outside expert guidance on improving school performance on the MSPAP standards and standards for other areas (e.g., attendance, drop out rates). designated as Reconstitution They may also be Schools which are reorganized and managed by an outside organization. The 1 9 9 1 M S P A P reading, writing, included assessments language usage, of learning outcomes in and mathematics. Assessment of science and social studies outcomes was integrated into the 1992 and subsequent MSPAP editions. Each assessment task in these later editions assesses 1 to 4 content areas. tasks are designed to elicit students' knowledge, diagram, MSPAP assessment thoughtful application of skills, and thinking process. Students write, and sketch responses to tasks that focus on their ability to construct and extend meaning from what they read, construct and extend meaning through writing, mathematics problems, solve multistep conduct hands-on science investigations, understand social studies concepts, and analyze social studies issues. It was essential to the purposes of the MSPAP to develop innovative performance-based assessments. It was also essential that these assessments have sufficient psychometric quality that the results could be used for high stakes, yearly evaluations of school performance and for tracking school improvement. At the time this program was initially designed in 1989 and 1990, performance assessments were much more a dream than a reality, and virtually no information was available about their psychometric properties. The first year's tests were more innovative than those in any other large s c ~ e the time. testing program of The very encouraging results of that testing,per- mitted greater innovations in later years. This paper describes the program design and highlights its psychometric results. We do not describe details of the psychometric procedures and findings of the MSPAP, but merely abstract and summarize some typical results from the first year of the program and summarize some changes made in later years. Complete detailed descriptions Fin~l Technical Report ~992 MSPAP Education, (CTB Macmillan/McGraw-Hill, Technical and other, are contained in the MSPAP 1991 Report 1993), (Maryland State 1992) and the Department both of which are available of through MSDE. PROGRAM REQUIREMENTS It is important to note that the MSPAP program requirements have dictated the psychometric properties needed for the assessments. Psychometric properties have not been built into the tests for vague theoretical reasons. The psychometrics of the program were designed to interfere as little as possible with the innovative aspects of the assessments instruction) (e.g., modeling good while providing the characteristics necessary to the program, such as accurate, that were comparable scores. The essential requirements of MSPAP were the following: 1. In conformity w i t h the Maryland learning outcomes, and implement performance-based assessments develop in" four content areas Language Arts: Reading [RD] Writing [WR] Language usage Mathematics 2. [LU] [MT] Beginning with the second year of the program, implement assessments measuring state outcomes in two additional content areas 3 Science Social [SC] Studies 3. Use 4. Limit 5. In g r a d e s 6. 7. [SS] assessments that m o d e l total testing good instruction time to 9 hours 3, 5, and 8 a s s e s s provide student scores Develop proficiency typical student p e r f o r m a n c e content area Develop standards every student in e a c h c o n t e n t level cut-points area and d e s c r i p t i o n s seen at those of p e r f o r m a n c e in M a y and levels of in e a c h to be used in e v a l u a t i n g o school performance Excellent 8. Produce areas as U n s a t i s f a c t o r y , in each c o n t e n t measures Satisfactory, or area of s c h o o l that are s u f f i c i e n t l y performance accurate in the six c o n t e n t that school performance c a n be • evaluated relative to t h e p r o f i c i e n c y and school p e r f o r m a n c e • compared 9. Provide produce standards over years schools with performance level cut points useful outcome information in g u i d i n g scores about school learning improvement t h a t can be compared outcome plans; over test forms a n d over years I0. Without the II. data-based tryouts, produce operational scores in first year of t e s t i n g Produce a pool of scaled characteristics tasks with known statistical that can be u s e d for subsequent test form assembly; p r o v i d e psychometric descriptions r e s p e c t to such c h a r a c t e r i s t i c s bias) of items as difficulty, fit, (with and to inform and improve future task d e v e l o p m e n t FIRST YEAR OF THE PROGRAM The bulk of this paper will d e s c r i b e typical f i n d i n g s from the first year of the program. Some changes made in later years will then be presented. Note on T e r m i n o l o u v The q u e s t i o n s or directions g i v e n to students performance a s s e s s m e n t are very d i f f e r e n t multiple-choice tests. Performance in a from those in a s s e s s m e n t scores a l s o are d e p e n d e n t on the rule or rubric used to grade each response. the sake of simplicity, each prompt, question, d i r e c t i o n s to w h i c h a s t u d e n t r e s p o n d s , For or set of along with its a s s o c i a t e d scoring rule, will be called c o l l e c t i v e l y an "item." Content A r e a s Readino The r e a d i n g domain is d e f i n e d by reader _~urpose and the orientations read. or stances that readers take toward text as they In the M a r y l a n d reading model, Maryland in the learning outcomes w h i c h form the basis for the MSPAP reading assessments, there are three purposes r e a d i n g for literary experience, a task. as p o r t r a y e d for reading: for information, and to p e r f o r m (Reading to perform a t a s k does not involve a c t u a l l y p e r f o r m i n g the task.) The m o d e l includes four reading stances: 5 g l o b a l understanding, response, developing and critical stance. c r o s s e d in the M a r y l a n d model. metacognition interpretation, personal P u r p o s e s and stances are fully (The Maryland model also includes and a t t i t u d e s t o w a r d reading.) The M a r y l a n d r e a d i n g model is u n l i k e earlier r e a d i n g component skill m o d e l s in w h i c h the reading process is v i e w e d as a set of d i s c r e t e s k i l l s that function fairly a u t o n o m o u s l y and that can be d e v e l o p e d and a s s e s s e d separately. The Maryland m o d e l is derived in which reading -- as from reader response t h e o r y w e l l as writing, llstening, p r o c e s s of constructing, and s P e a k i n g -- is v i e w e d as a examining, and extending meaning t h r o u g h complex interactions with t e x t (see Langer, 1990). R e a d i n g purposes and stances are the same across the three grades assessed in MSPAP. M S P A P reading a s s e s s m e n t t a s k s are comprised of several open-ended a s s e s s m e n t a c t i v i t i e s t h a t require students to construct, examine, literary, and extend m e a n i n g through the stances on the informational, d u r i n g the assessment. and i n s t r u c t i o n a l passages they read R e s p o n s e s to reading ~ctivities were scored u s i n g 2, 3, and 4-point keys and a 4-point rubric for one e x t e n d e d r e s p o n s e scored for both r e a d i n g and writing. Coverage of each r e a d i n g purpose was p r o p o r t i o n a l l y balanced in each 1991 M S P A P test form with literary e x p e r i e n c e m o s t often assessed, reading to p e r f o r m a task least often, and reading for information a c t i v i t i e s i n c r e a s i n g at g r a d e 8. was p r o p o r t i o n a l l y Coverage of the stances w i t h i n each p u r p o s e balanced in each test form w i t h global 6 understanding, personal response, and critical stance questions and prompts less frequent than developing interpretation questions and prompts. Because MSPAP tasks vary in length, the numbers of activities and coverage of outcomes vary somewhat across test forms. Wr~tinq and Lanaua~e Usaqe The writing domain is defined by three purposes for writing -- to inform, persuade, and express personal ideas -- and steps in the writing process -- prewriting/planning, drafting, revising, and proofreading. The writing outcomes include long-recognized modes of discourse (narration, exposition, description, and persuasion). attitudes toward writing. constructing, audiences. The Maryland model also includes Writing is viewed as a process of examining, and extending meaning for a variety of In this view, writers use rhetorical devices (e.g., argumentation for persuasive writing) and other elements of the craft of writing (e.g., word choice, sentence variety) to accomplish the three writing purposes. Direct assessment of writing performance, with examinees guided through the composing process, minimum dates back to the inceptior in 1983 of one of Maryland's competency graduation tests, the M@ryland Writing Test. The single language usage outcome incorporates correctness and completeness features in the appropriate use of English conventions (e.g., punctuation, grammar, spelling) across a vari- ety of writing purposes and styles. Writing and language usage o u t c o m e s in MSPAP. are the same across the t h r e e grades assessed The levels of sophistication e x p e c t e d in w r i t t e n responses increase with grade level. M S P A P w r i t i n g prompts elicit extended w r i t t e n responses (i.e., essays, stories, poems, plays). Topics for these prompts in the 1991 M S P A P were linked either directly or by theme to r e a d i n g p a s s a g e s and a s s e s s m e n t activities. prompts (Topics for w r i t i n g in the 1992 MSPAP and beyond are linked to a s s e s s m e n t activities in reading, science, and social studies.) Each student w r o t e two responses w h i c h were scored u s i n g a f o u r - p o i n t holistic rubric writer in which the lowest score (0) indicated that the f a i l e d to m i n i m a l l y meet the criteria for the writing purpose, and the h i g h e s t score performance. (3) represented excellent The two extended writing r e s p o n s e s required each student to w r i t e for two of the three purposes. include s e p a r a t e and independent activities. style) languageusage A p p r o p r i a t e use Of language is a s s e s s e d through students' reading, mathematics, activities assessment (i.e., c o n v e n t i o n s and written responses to and social studies a s s e s s m e n t and to w r i t i n g prompts. were s c o r e d responses science, MSPAP does not Reading and w r i t i n g r e s p o n s e s for language u s a g e using a 3-point rule for brief and a 4-point r u b r i c for extended w r i t i n g responses. Mathematics The m a t h e m a t i c s domain is defined by nine content outcomes and four p r o c e s s outcomes. mathematics MSPAP also assesses attitude t o w a r d and use of technology, but these outcomes are not part of high stakes test usage. These 13 Maryland mathematics learning outcomes and their sub-outcomes indicators of the outcomes) assessments. form the basis for MSPAP mathematics The Maryland outcomes are a close adaptation of the widely known NCTM ~ r r i c u ~ u m Mathematics (i.e., more specific and Evaluation Standards f@r School (National Council of Teachers of Mathematics, 1989). The nine content outcomes are similar to traditional mathematics objectives (e.g., number relationships, unlike the four process outcomes communication, mathematics algebra, probability), (problem solving, reasoning, connections inside and outside of mathematics). All outcomes were covered in all 1991 test forms. The 1991 MSPAP opened-ended mathematics tasks required students to solve multi-step problems, recommendations, reasoning communicate their ideas, understanding, in mathematics, solve problems. make decisions and and and explain processes they used to Responses to mathematics activities were scored using 2 and 3-point scoring keys and some 4-point keys. Coverage of content and process outcomes in the 1991 MSPAP mathematics assessment was proportionally balanced. Ap~I~ximately 40% of the activities in each form in grades 3 and 5 assessed process outcomes while approximately 60% assessed content outcomes; about 20% of the activities assessed both content and process outcomes. These percentages shifted in grade 8 to 20-30% for process and 70-80% for content outcomes. Because MSPAP tasks vary in length, the numbers of activities and coverage of outcomes varies somewhat across test forms. 9 Task Structure Typically, MSPAP tasks begin with an opening activity a brief discussion) which is not scored, followed by one or more readings, activities (e.g., often but not always followed by assessment (referred to here as "items"}. The purposes of opening activities include orienting students to the theme, content area, and expectations of the task; encouraging students to activate prior knowledge related to the task; providing knowledge related to the task; and providing students with a purpose to undertake the task. . . ~. MSPAP readings in all content areas usually are unabridged, published, and "authentic" reading selections. They are usually selected from trade publications such as magazines, picture books, and short story collections rather than from textbooks or basal reading series. MSPAP assessment students, items include questions, prompts to and other stimuli that elicit a written, diagrammed response from students. sequentially, sketched, or They are always linked related to a common theme, or otherwise organized in a coherent fashion. reading passages. Reading items focus on one or more Writing prompts always are linked to reading passages and accompanying items or to observations, investigations, areas. or other assessment activities in other content Mathematics, items usually do not focus on a reading passage; however, they often comprise problems, investigations, or issues that must be addressed over multiple steps and that usually 1 0 - lead to a c u l m i n a t i n g solution, explanation. decision, and/or T h i s structure of an o p e n i n g a c t i v i t y f o l l o w e d by items may r e p e a t itself in a s i n g l e task. example recommendation, of this structure reading/writing/language from a Figure 1 c o n t a i n s an 1991 MSPAP grade 5 usage task. Insert Figure 1 a b o u t here Form Structure and A d m i n i s t r a t i o n In order t o cover the r e q u i r e d b r e a d t h of c o n t e n t testing time, m u l t i p l e test forms w e r e developed. year of the program, there were 5 or 6 forms in limited In the first (depending on grade) in language arts (i.e., reading, writing, and language usage and five forms in mathematics. For the L a n g u a g e Arts p e r f o r m a n c e assessments, testing occurred in five sessions over five d a y s , with one or one-and-a-half hours of testing per session. In each session, students read one or more short s t o r i e s or articles and w r o t e several short answers or one e x t e n d e d response, d e p e n d i n g on the session. E x t e n d e d Writing r e s p o n s e s were obtained in two of the sessions, R e a d i n g responses were o b t a i n e d in four of the five sessions, and Language Usage r e s p o n s e s were o b t a i n e d in all five sessions. The M a t h e m a t i c s assessment w a s administered sessions on three school days p r e c e d i n g (Grade 8) or f o l l o w i n g (Grades 3 and 5) the Language A r t s assessment. 11 in three o n e - h o u r In each Mathematics testing session students to three t a s k s related to d i f f e r e n t the tasks w e r e not separately time on t h e i r responded information The t y p i c a l range task t y p i c a l l y to several questions Additional and students It was common behind materials for a process to explain in their groups. were several m o t i v a t i o n s it h e l p e d performance undue being particular influences passage in g r o u p s of normal to t e s t i n g groups, Teachers, assigned performance by t h e i r by chance materials or mathematics 12 rather than also by regular include only a limited For example, who to t e s t i n g caused about the possibility have used t e a c h i n g reading the a s s e s s m e n t Randomization in intact classes effects. or for this r a n d o m i z a t i o n . teachers. each form could a item. on school p e r f o r m a n c e on student t h e r e was concern might also randomly focus of individual Because teacher-by-task teacher maintain assessed the r e a s o n i n g intact classrooms. examiners, There were assigned acted as t e s t of tasks, for item to f o l l o w content w e r e administered Students were r a n d o m l y than assessed teachers. and data The students w e r e kept secure before T h e assessments class size. students the for later responses. the answer to that preceding All t e s t i n g minimized a session, to that scenario. was often provided process the responses managed of the scenario related item and to ask the student First, Within took a half page. content rather provided n u m b e r of items per task was 5 or 6, b u t there was a f r o m 3 to 10. occurred. themes. timed, own. The d e s c r i p t i o n each M a t h e m a t i c s typically number of s u b s t a n t i a l a particular related to a scenario that s u b s e q u e n t l y a p p e a r e d on the M S P A P assessment. of s t u d e n t s to forms m i n i m i z e d the chance smaller schools) that a p a r t i c u l a r Finally, (particularly teacher-by-form have a m a j o r impact on a s c h o o l ' s results, heterogeneous Random assignment in e f f e c t could and a s s u r e d that groups of s t u d e n t s w o u l d take each t e s t form. randomization mimicked "authentic" w o r k i n g s i t u a t i o n s in w h i c h p e o p l e cooperate w i t h a v a r i e t y of c o l l e a g u e s and s u p e r v i s o r s d e p e n d i n g on the n a t u r e of the work. In the first year of the program, out the new assessments e x c e p t there was not time to try in some small pilots. c o n c e r n that some of the t a s k s m i g h t not "work." described in a later section, over forms. However, T h e r e was a Also, as steps were taken to e q u a t e scores in the first year of the program, it was not k n o w n how well these new equating p r o c e d u r e s w o u l d w o r k with performance assessments. Therefore, good q u a l i t y data being p r o d u c e d comparability identified to m a x i m i z e the c h a n c e s of for every school and to assure of results over schools, the two "best" in each grade and content area. forms were T h e s e forms w e r e selected by MSDE to be t h o s e c o n t a i n i n g tasks t h a t w e r e judged to be the most authentic and engaging, the h i g h e s t quality; school. and that a p p e a r e d to be of these forms were a d m i n i s t e r e d in e v e r y For schools with m o r e than two tes%ing g r o u p s additional (classes), forms were r a n d o m l y assigned. S c o r i n q Process Examinee responses to M S P A P a s s e s s m e n t a c t i v i t i e s w e r e s c o r e d by trained M a r y l a n d t e a c h e r s using a c t i v i t y - s p e c i f i c keys 13 ". usually used for brief responses, responses, and rubrics responses. All M S P A P partial generic rules for longer for essays and other e x t e n d e d scoring tool types a l l o w e d credit to be awarded to responses. for full and In all content 4 . . areas and grades, than minimally credit however, acceptable w a s awarded minimally sets w e r e involved graded received for attempted adequate. ing of table student r e s p o n s e s t h a t were a score of 0; no partial responses that were In addition to the t r a i n i n g leaders and other readers, employed. Except in a special study, less and q u a l i f y - read-behinds for a small number all student p a p e r s less t h a n and c h e c k of student papers were read and once. ;tem Scores The lowest score for each item was 0, and the m a x i m u m possible score varied sessions where a student Table 1 displays possible appears score. Ferrara Omitted was p r e s e n t the typical percent The range of number in that table. characteristics Goldberg from 1 to 3. More detailed of the and Kapinus item scores (in press) responses were given of items per information is a v a i l a b l e and Fitzpatrick, 1 about h e r e 14 a score of 0. of items w i t h each m a x i m u m (1992). Insert Table for form also about the from Ercikan, and Scaling Several item-based MSPAP scaling • Proficiency student procedure. level performance • Outcome forms requirements led to the requirements use of an included: and descriptous of t y p i c a l levels that c o u l d b e ' c o m p a r e d across test future forms and y e a r s • A pool of c a l i b r a t e d might These cut p o i n t s at those scores that items from which be a s s e m b l e d • Student level scores that a r e as a c c u r a t e as possible Standard accuracy errors of s c o r e s Item-based of m e a s u r e m e n t for s t u d e n t s scaling offered requirements. Traditional provided characteristics these Because the s c o r e s "right" or "wrong", was p a r t i c u l a r l y that it w o u l d restrictions was developed. with varying important Masters' numbers limitation model achievement of s a t i s f y i n g student a new IRT m o d e l that theory (IRT) was needed Masters' Partial model of score points. that m a d e can models have were not j u s t It be f l e x i b l e so and not r e q u i r e test Credit design. model be u s e d However, it u n s u i t a b l e to the R a s c h m o d e l 15 all these for scaling. the s c a l i n g m o d e l limit levels tests. responses to the c o n t e n t , of Masters' is a n a l o g o u s the for m u l t i p l e - c h o i c e that would unnecessarily A generalization 1982) a means item r e s p o n s e be a "servant" reflect at d i f f e r e n t for the M S P A P important that to (Masters, scale that model for MSPAP: items has an the for m u l t i p l e - c h o i c e tests and forces points, to h a v e items could The items with different concerns, was dubbed and c o m p u t e r in a generalization of Masters' items to h a v e d i f f e r e n t the "two parameter model performance-based The " 2PPC quite well, chosen items (Muraki, out of many 1991; Yen, independent the analysis included items in M a t h hundreds 3 items in Reading, designed multiple-choice to avoid performance tests of tests such as the MSPAP, responses items 1993). It research, of analyzed, NAEP only a of p o o r fit. in M a t h Content, are u s u a l l y and 4 responses items that ask the s t u d e n t scaling without carefully In contrast, to other used to o b t a i n a previous answer. item r e s p o n s e 8 items local item dependence. to be related m a t h process item Process. Traditional designed (2PPC) MSPAP p e r f o r m a n c 9 handful n e e d e d to be deleted from the scales b e c a u s e These was 1992). " model was found to d e s c r i b e and credit" to e s t i m a t e (Burket, for model discriminations. partial programs were d e v e l o p e d was discriminations, rules and rubrics. m a y be of i n t e r e s t to note that in s u b s e q u e n t same that p e r f o r m a n c e their scoring p a r a m e t e r s and students' trait values the of score Items w i t h d i f f e r e n t indicated substantially t h a t allowed model model, of pilot data vary Given t h e s e developed of their n u m b e r m u s t be removed from the test or the model w i l l be Analyses particularly regardless the same discrimination. discriminations inaccurate. all items, to some items. to explain for items An e x a m p l e are is the r e a s o n i n g In order to gain the b e n e f i t s unnecessary 16 restrictions on t e s t of design, a variety of p s y c h o m e t r i c managing local testlets to the development invisible item dependence. These strategies Huynh, Scale scores were d e v e l o p e d used. the scale scores, item discriminations; Conversion t o scale scores. transformations an a p p r o x i m a t e depending a student's scores more weight was if they performance raw scores scores w e r e scores. from a low a r o u n d T h e y w e r e set to have of 50. 350 up to a h i g h of about on the scale and grade. Test Forms the goal was to h a v e the forms within a scale score d i s t r i b u t i o n s An equivalent groups equating for r a n d o m l y who took each form. for the equivalent determined To obtain procedure mean of 500 and a s t a n d a r d d e v i a t i o n scores w e r e obtained students area' these weighted scale of the IRT a b i l i t y scoring at d i f f e r e n t tables t r a n s l a t e d In t h e equating, possible. see also item scores w e r e weighted by items r e c e i v e d Equatinu grade produce for each c o n t e n t In essence t h e s e S c a l e scores ranged 700, (1993; information Scale S c o r e s better among s t u d e n t s levels. Detailed from Yen the m o s t a c c u r a t e In this p r o c e d u r e discriminated t h a t were 1993). Content Area student for from the use of scales a n d users. is a v a i l a b l e & Baghi, was developed varied of a d d i t i o n a l to the test d e v e l o p e r s about these Ferrara, strategies that m o s t closely samples the linear Scale of scale transformation the c u m u l a t i v e as of 1,500 the d i s t r i b u t i o n s aligned 17 d e s i g n w a s used. equivalent Using samples, that were as similar was scale score distributions equating for these samples. that most closely equipercentile Examples This p r o c e d u r e approximates is the linear a non-linear procedure. of the similarity from t h e d i f f e r e n t of t h e d i s t r i b u t i o n s forms are c o n t a i n e d in Figures of s c o r e s 2 to 4. These are c u m u l a t i v e distributions and show the p e r c e n t a g e of s t u d e n t s at or b e l o w each scale score. In general, scale, the smoother distributions. responses and more closely The Writing tests, (that is, 2 items) outcome measuring within to obtain of items that of m a x i m u m experts. assessed A student's score the s t u d e n t had taken all the in all the forms at ~ a t scores were intended to be c o m p a r a b l e grade. across would items These the forms a grade. The definition understood Figure Scores if that student that outcome the score the least exact e q u a t i n g s . by MSDE content score was the percent be e x p e c t e d outcome as determined in a w h i c h had only two w r i t i n g scores were based on subsets each outcome, items 2-4 about here Outcome Outcome aligned produced Insert Figures the m o r e of this outcome w i t h a hypothetical 5. Consider the Expected graphical one particular U s i n g the calibrations Percent score outcome is most e a s i l y example at a p a r t i c u l a r of the items c o n t r i b u t i n g of M a x i m u m (EPM) 18 as s e e n in grade. to t h a t outcome, as a function of s c a l e score is f o u n d for e a c h figure, labelled curves, one for e a c h In the EPM curve figure there form. One such "(a)". form, There but o n l y is also a c u r v e for a l l the forms; there EPM curve actually one appears on the are m a n y s u c h is s h o w n labelled in t h i s example. "(b)," a n d is o n l y o n e s u c h c u r v e it is t h e for an outcome. Insert Figure The dotted to transform lines in the f i g u r e a student's f o r m to the EPM o u t c o m e score observed score. [70% of the m a x i m u m converted to a s c a l e score form the student took is t h e n t r a n s l a t e d outcome score form. comparable across procedure stable defined is a m e a n s Its g o a l s forms forms) [40] curve curve for all in all to w h i c h evaluated with means. 19 is (a) b a s e d on the score forms. a student's observed items that happened outcome are r e f e r e n c e d forms. outcome (b), this s c a l e of a d j u s t i n g of the observed on a t e s t form] by u s i n g t h a t w a s p a r t of t h e d e t e r m i n a t i o n school score are u s e d s c o r e on a p a r t i c u l a r student's are to p r o d u c e the d e g r e e was s h o w h o w the c u r v e s Using and t h a t by items (that is, over (Form A). for t h e d i f f i c u l t y in h i s / h e r behavior This [500] here outcome possible into a EPM This procedure 5 about The scores the o u t c o m e a forms of s t a n d a r d t h a t are to a d o m a i n success to be of of t h i s scores were effects errors study of Psychometric Proficiency Levels To assist behavioral in c o m m u n i c a t i o n descriptions scale score ranges. scaling Results of the m e a n i n g were d e v e l o p e d In developing results were used. for p e r f o r m a n c e Each item was located best or, measurement in other words, information. level of every item was also placed for a p a r t i c u l a r score of 440, at 520, item, and a score of 3 m i g h t location be at 550; levels were of item scores committees proficiency skills, levels, located of content near them: of score at a s c a l e of 2 m i g h t the o v e r a l l be item typically performance displayed t h a t had s u b s t a n t i a l 490, examined descriptJ~!s that were the basis 530, 580, and 620. proficiency items in t h e s e of the knowledge, of the a s s e s s m e n t at each p e r f o r m a n c e and level. Standards In order to u n d e r s t a n d scores, amount For example, a score to e s t a b l i s h experts and d e v e l o p e d and processes that students School item every be located identified These values were used as cut scores levels, fashion, that might be at 500. F o u r scale score numbers the m a x i m u m a score of 1 might be at 480, item on the scale. on the scale. a score of 0 m i g h t the score at w h i c h provided In an a n a l o g o u s scores, in five these descriptions, The location was d e f i n e d to be the scale measured of scale it is n e c e s s a r y the stakes to d e s c r i b e respect to the p r o f i c i e n c y related to the scale later d e v e l o p m e n t s levels and school p e r f o r m a n c e In the second year of MSPAP, proficiency 20 with standards. levels w e r e r e e v a l u a t e d , involving b r o a d e r - b a s e d c o m m i t t e e s content. Most proficiency same as the 1991 v a l u e s changes w e r e made and further a n a l y s e s of item levels cut points were e s s e n t i a l l y the (490, 530, 580, and 620), a l t h o u g h some (for example, grade 5 Mathematics). These c o m m i t t e e s r e f i n e d and enhanced the behavioral descriptions, w h i c h are available Furthermore, from MSDE. important h i g h stakes school p e r f o r m a n c e s t a n d a r d s were established. "Satisfactory" in a p a r t i c u l a r grade and content area, 7 0 % of the students at least in that g r a d e and content area w o u l d n e e d t o h a v e s c o r e s above 530. "Excellent," For a school to be e v a l u a t e d as For a school to be e v a l u a t e d as it w o u l d need to have reached the " S a t i s f a c t o r y " level and have at least 25% of its students above 580. school p e r f o r m a n c e evaluations The became public i n f o r m a t i o n b e g i n n i n g with the 1992 assessments. In 1996 school sanctions and r e w a r d s will be tied to these evaluations. Test D~fficulty T a b l e 2 d e s c r i b e s the average percent of m a x i m u m scores for the items in the 1991 MSPAP. difficulties For comparison purposes, for a t r a d i t i o n a l m u l t i p l e - c h o i c e b a s e d on M a r y l a n d student p e r f o r m a n c e B a s i c Skills, Fourth Edition, tests like CTBS/4, guessing, average test are presented (Comprehensive Tests Of 1989; CTBS/4).. For m u l t i p l e - c h o i c e students can get items correct t h r o u g h lucky and given the n u m b e r of answer choices for the items, it is p o s s i b l e to estimate h o w difficult the t e s t s w o u l d have been if guessing the correct answer were not possible. 21 Those estimates are also p r e s e n t e d in Table 2. were m o r e d i f f i c u l t than CTBS/4, are removed f r o m CTBS/4. is Reading. with the test. The MSPAP even if the effects of g u e s s i n g The e x c e p t i o n to this general finding For the MSPAP W r i t i n g scores, Maryland items Writing Test, a comparison a minimum-competency graduation This t e s t is part of M a r y l a n d ' s m i n i m u m c o m p e t e n c y t e s t i n g program and f o c u s e s on lower p e r f o r m a n c e expectations. the is made MSPAP difficult focuses on higher level expectations In c o n t r a s t and with more items and more stringent s c o r i n g rubrics. MSPAP a s s e s s m e n t s were expected to be more d i f f i c u l t than is typical for e d u c a t i o n a l designed to r e p r e s e n t a c h i e v e m e n £ b e s t s because they were standards for 1996 and beyond. Insert Table 2 a b o u t h e r e Measurement Accuracy Five m e a s u r e s of score a c c u r a c y cases they w e r e developed were produced, and in m o s t for both c o n t e n t area scale scores and outcome scores: I) C o r r e l a t i o n s of scores produced by d i f f e r e n t raters (content area scores only) 2) C o e f f i c i e n t alphas 3) Standard errors of measurement for students' scale scores 4) Standard errors of school means 5) D e p e n d a b i l i t y Correlations coefficients B e t w e e n Raters. for school m e a n s A special study was conducted that 22 ° involved the scoring of a small group of student r e s p o n s e books by two g r o u p s of raters. involved in the study, Two test forms in each g r a d e were and for each form from 208 to 246 student books w e r e scored twice. The first scoring was c o n d u c t e d by M a r y l a n d teachers as part of the operational scoring process. The s e c o n d scoring was c o n d u c t e d by p r o f e s s i o n a l California. scorers in Table 3 contains the correlations between the students" scores p r o d u c e d by the two sets of raters. objective scoring rules involving substantial n u m b e r s of items (for example, correlations. those in mathematics) The more produced very high Scores based on the smaller numbers of items and t h e least o b j e c t i v e scoring rubrics (that is, Writing) produced the least c o n s i s t e n t scores. Insert T a b l e 3 about here C o e f f i c i e n t AiDhas. C o e f f i c i e n t alpha is a r e l i a b i l i t y measure r e l a t e d to the KR-20 but suitable when items have a variety of s c o r e levels ( A l l e n and Yen, 1979). Table 4 contains these v a l u e s for the M S P A P content area scores. purposes, Maryland KR-20 r e l i a b i l i t i e s student performance. For c o m p a r i s Q n are p r e s e n t e d for CTBS/4 based on While slightly lower than the CTBS/4 values, the MSPAP r e l i a b i l i t i e s are quite high. MSPAP scores ' a Writing comparison W r i t i n g Test. 23 is made with the For the Maryland Insert Table Outcome score r e l i a b i l i t i e s number of items had at least .62 to in the outcome four items, .93 for R e a d i n g Usage outcome, Standard Errors were highly in that outcomes, .33 to with of M e a s u r e m e n t characteristic somewhat grade. values w e r e the lowest The SEM is i n f l u e n c e d by each item, from outcomes; the to occur for very for S t u d e D t error for each in each Scale for grade scores. of m e a s u r e m e n t scale depending form. 8 appear in T a b l e of i n f c ~ a t i o n by the number Therefore, the h i g h e r alpha, of items c o n t r i b u t i n g SEM for the t w o - i t e m surprising. 24 These over forms w i t h i n by the amount influenced was As a summary m e a s u r e , and the h i g h e s t to c o e f f i c i e n t (SEM} on the for Reading and similar For e a c h score value. scale scores was a v e r a g e d Sample values values r a n g e d .90 for the L a n g u a g e tended over forms, of the items the SEM at s e l e c t e d alpha that few items. by the scaling model values v a r i e d For outcomes .89 for M a t h e m a t i c s test form and scale the standard produced related to the form. .84 to for M a t h e m a t i c s outcomes here the c o e f f i c i e n t and from low r e l i a b i l i t i e s difficult 4 about 5. The S E M for Writing. being p r o v i d e d the SEM, is a l s o to the test. Writing a tests is n o t Insert T a b l e 5 about here Approximately two-thirds t h e 450 to 550 range. neighborhood of the s t u d e n t s had scale scores The lowest SEMs t e n d e d to get in the of 500; the m a t h e m a t i c s have t h e i r m i n i m u m test SEMs t e n d e d to be SEMs in the n e i g h b o r h o o d of 550. score a c c u r a c y w a s good in the n e i g h b o r h o o d for " S a t i s f a c t o r y " Empirical effects. for individual s t u d e n t s w o u l d be h i g h e r v a r i a n c e would also be introduced s t u d e n t s c o r e s by form effects. However, is on school performance. two sections, of the 530 c u t - p o i n t in T a b l e 5 because of v a r i a b i l i t y due to rater Additional the M S P A P In general, performance. SEM v a l u e s than the v a l u e s in into the p r i m a r y focus of As d e s c r i b e d e v a l u a t i o n s w e r e made of e m p i r i c a l in the next s t a n d a r d errors of school m e a n s that i n c l u d e d all sources of error. s t a n d a r d Errors o~ School Means. school m e a n s w e r e obtained. performance for every within-school variance, school, variance Empirical First, s t a n d a r d errors of for every school, form was calculated, the m e a n and then the p o o l e d of form means was determined. d i v i d e d by the n u m b e r of forms a d m i n i s t e r e d This in the was t a k e n as the s q u a r e d standard error of the o v e r - a l l school mean. all s o u r c e s It can be noted that this s t a n d a r d error includes of v a r i a t i o n that affect scores w i t h i n a school, i n c l u d i n g rater effects, systematic error. 5 form effects, and m e a s u r e m e n t Table schools 6 describes of the typical scale scores range, typical ranged size results in each grade. from a low of 350-400 and that the p r o f i c i e n c y accurate on such a scale. and student outcome the low 20s. typically in the range certainly Outcome errors w i t h enough understanding their with accuracy coefficients study was c o n d u c t e d (Candell grade are 5, w h i c h dependability the number The v a l u e s outcome substantial in were that t h e s e accuracy, to schools in A generalizability of school Table 7 presents to those results for the other grades. means for These for school means as a f u n c t i o n in the school. scores were s o m e w h a t of the o u t c o m e dependabilities indicating indices were in t h e high for the outcome deviations for outcomes the d e p e n d a b i l i t y of forms a d m i n i s t e r e d the e x c e p t i o n had s t a n d a r d for school meads. coefficients scores the d e p e n d a b i l i t y from 0 to i00, 6 about h e r e in press). are similar ranged 40 to 50 p o i n t s performance. to examine and Ercikan, that student up to the 650-700 to be useful Insert Table Dependability scores of 2 to 7 points, outcome Recall of school m e a n s could be m e a s u r e d study based on levels w e r e about scores t y p i c a l l y Standard school m e a n s of this "Reading 26 For the scale .80s and lower, .90s. but w i t h to P e r f o r m a Task" were quite good. of the Insert Table 7 about here ~ m ~ m m ~ o m ~ Score Validity MSPAP validity evidence was collected with the goal of supporting and validating intended interpretations and uses of scores from the assessment. The validity evidence described below is organized around this goal. Content Validity The Maryland learning outcomes, which form the basis for learning, instruction, and MSPAP assessment activities, are based on recently developed national curriculum standards and learning theory. For example, the reading outcomes are based on NAEP reading assessment objectives and reader response theory. Similarly, the writing outcomes are based on long-recognized modes of discourse and the mathematics outcomes are based on NCTM standards for curriculum and evaluation. (See the previous section "Content Areas" for details.) 0utcomes coveraqe. Coverage of the outcomes by assessment activities is proportionally balanced according to the relative importances of the outcomes at different grade levels. High degree of match between assessment activities and the outcomes they assess is ensured through multiple reviews during task development and development of scoring tools and guides. Instructional validity. The notion of instructional validity became important in the Debra P. Florida court case on 7 •" minimum-competency for t e s t scores testing to be valid, m u s t h a v e been taught. state d e v e l o p e d taught criteria and outcomes assessed -- w o u l d Construct concept ;nternal conducted Here, if the taught and l e a r n e d then areas. & Meehl, consistency 1955). Macmillan/McGraw-Hill, 1992). given the h i g h 1989, structure differently estimates 13). refers areas internal f r o m items in Table from 4 scores factor a n a l y s e s (see Cronbach, consistency Both sets of results inter-rater to the area behave of content area scale Similarly, a of test p. the same content The r e l i a b i l i t y of this surprising Messick, and s o m e w h a t for each of the c o n t e n t provide come to be c o n s i d e r e d internal items that a s s e s s evidence Table and be However, and types of evidence for example, structure. the internal (see C r o n b a c h 469) has r e c e n t l y to one another content indicate activities, what s h o u l d be taught. in the -- to guide and goad e d u c a t i o n a l for all views (see, to w h i c h similarly other is weak: not have been necessary. validity score v a l i d i t y degree and c o m m u n i c a t e of M S P A P on a test validity assessment is that Va~idity Construct unifying outcomes, in M S P A P w e r e w i d e l y purposes The i d e a and other e d u c a t o r s learned and how it s h o u l d the i n t e n d e d reform to model instructional teachers the learning 1983). w h a t has been assessed MSPAPs some of the best c l a s s r o o m scoring (see Madaus, correlations 1971, p. (CTB should reported not be in 3 and the high degree of m a t c h between a s s e s s m e n t a c t i v i t i e s and outcomes discussed above. 28 Concurrent validity. Here, concurrent validity refers to correlational relationships among MSPAP scores and external measures. in Table 8 provide some evidence of The correlations the convergence and discriminance (see, for example, Cronbach, 1971, p. 466 ff.) of MSPAP content area scale scores. For example, MSPAP reading scores tend to be more highly correlated with teacher ratings of student proficiency writing than in mathematics. in reading and (However, MSPAP Reading is also highly correlated with CTBS/4 mathematics scores.) Insert Table 8 about here Differential item functioninq (DIF). Items that are "biased" against groups of students who take MSPAP -- that is, that function differently for different student groups -- diminish construct validity. A measure of DIF generalized Linn-Harnisch procedure functioning items. from the (1981) was used to flag differentially Analyses of the numbers of MSPAP items flagged for DIF, as compared with CTBS/4, ar~ contained Fitzpatrick, Cande11, & Miller (1992). in Green, MSDE has studied items flagged for DIF to inform subsequent assessment task development. ~onsequential Validity Since the primary focus of MSPAP is school performance the most salient negative consequences from using MSPAP scores (e.g., required school improvement plans, management by an outside party) 29 will occur for l o w - p e r f o r m i n g schools, However, as d e s c r i b e d earlier. these n e g a t i v e c o n s e q u e n c e s are expected to be s h o r t - t e r m (i.e., until such schools f u n c t i o n successfully), viewed as p o s i t i v e c o n s e q u e n c e s improve. for schools that need help to The long-term c o n s e q u e n c e of using M S P A P scores is expected to be positive: Consequences students. school instruction, negative --for example, provide useful instruction consequences improvement. of using M S P A P scores are also expected for T h e s e consequences schools improve example, and can be c o u l d be p o s i t i v e -- that is, as student learning improves -- or if M S P A P score information does not information, schools do not improve, and student learning do not improve. and Other of using MSPAP information are also evident. For low p e r f o r m a n c e r e p o r t e d for the 1991 MSPAP resulted reports of low teacher morale and complaints that the test was being used for school and t e a c h e r "bashing." consequences - (These occurred even t h o u g h the media and public were instructed on the f o r w a r d - l o o k i n g expectations in nature of MSPAP standards and for 1996 and beyond.) Conclusion The e v i d e n c e and arguments v a l i d i t y and other technical for content and construct information about MSPAP provide r e a s o n a b l e a s s u r a n c e that M S P A P scores can be validly i n t e r p r e t e d for e v a l u a t i n g school p e r f o r m a n c e improvement. consequences Similarly, and guiding and goading school a n t i c i p a t e d positive and negative of using MSPAP scores for these purposes provide 30 support for the r e a s o n a b l e n e s s purposes. remains Validation an on-going of u s i n g of MSPAP score the scores for these interpretation and use process. CHANGEB I N BUBBEQUENT YEARS O n e conclusion generalizability by-form would that was d r a w n study was that, interaction be sufficient from the f i r s t - y e a r given the level of s c h o o l - seen in these results, for m e a s u r i n g forms, or clusters, clusters are administered somewhat different analyses, in s u b s e q u e n t in every school. content, but not another. maintaining the breadth Based These c l u s t e r s coverage on All t h r e e measure being m e a s u r e d The use of such c l u s t e r s of c o n t e n t forms it was d e c i d e d to years. w i t h some outcomes one c l u s t e r the schools. two or three school p e r f o r m a n c e . t h e s e r e s u l t s and MSDE content experts' use three " in permits to be m a i n t a i n e d for It is possible that the use of these c l u s t e r s might c a u s e scale scores for individual students who took d i f f e r e n t forms to b e c o m e However, school less comparable; because results In 1992, during science all clusters necessarily and social the other of n u m b e r s maintained were tested examined. to all schools, their comparability. the number was reduced of days to five. studies w e r e a d d e d to the a s s e s s m e n t s with the other c o n t e n t content os b e i n g are a m i n i s t e r e d the second year of testing, which the students integrated this p o s s i b i l i t y areas. areas w e r e decreased, of items per cluster: (18 to 41), Math Process Reading (13 to 19), 31 The n u m b e r s and of items in with the f o l l o w i n g (8 to 18), M a t h Language Usage Also, ranges Content (4 to 6), and Writing (i). T h e c o n t e n t areas for w h i c h this a p p e a r e d to be too severe a d e c r e a s e w e r e Language Usage and Writing. s u b s e q u e n t years, responses t h e r e was a return to two e x t e n d e d w r i t i n g and e i g h t L a n g u a g e Usage responses. w i t h few items, noticeably, In 1993 and For some o u t c o m e s t h e s t a n d a r d errors of the school mean i n c r e a s e d and s c h o o l s were cautioned a b o u t their use. S t u d e n t c h o i c e was s e s s i o n for one c l u s t e r among R e a d i n g passages. introduced in 1992. in each grade, In one t e s t i n g students selected from In addition to the choice passages, there were passages and items responded to by all students w h o took that cluster. T h e s e non-choice calibrating the choice characteristics and Yen items. of the choice items p r o v i d e d Further i n f o r m a t i o n items is p r e s e n t e d about the in F i t z p a t r i c k there w e r e m o r e a d m i n i s t r a t i o n p r o b l e m s and t a s k s that did not "work," p a r t i c u l a r l y Problematic process. As a r e s u l t of these problems, assessments. reducing in Science. items w e r e eliminated from the s c o r i n g and s c a l i n g piloting, review, m o r e time was s c h e d u l e d and revision in the d e v e l o p m e n t T h e s e c h a n g e s were s u c c e s s f u l of the 1993 in s u b s t a n t i a l l y such problems. It was d e s i r e d to h a v e school o u t c o m e c o m p a r e d over y e a r s scores that could be so that schools could t r a c k their progress. The o u t c o m e s c o r e s d e s c r i b e d years, for (1993). In the 1992 a s s e s s m e n t s for an a n c h o r in Figure 4 are not comparable over b e c a u s e they w e r e tied to the d i f f i c u l t y of the items administered each year. In 1992 an a d d i t i o n a l 2 outcome score w a s reported. This "outcome scale score" produced from curve expected percent of maximum obtained percent of maximum on the outcome. limitations (a); that is, is the scale score of the outcomes outcome scale scores content area scale scores. content on the o u t c o m e on the numbers of items heterogeneity the outcome it is t h e scale score w h o s e scores equals the s t u d e n t ' s Because in the outcomes related comparable t h e stakes are not as high as t h o s e area scale scores, lessening and the to each scale score, are not as s t r i c t l y However, of these as are the associated associated with w i t h the t h e need for strict comparability. As d e s c r i b e d in an earlier section, descriptions were revised descriptions and school performance Science and Social except Writing in 1992. Studies. the p r o f i c i e n c y Proficiency standards All c o n t e n t and Language Usage; in 1993 and 1994 and performance those level level w e r e also set for areas w e r e r e v i e w e d • areas w i l l be r e v i e w e d standards will be set for them. SUMMARY The needed MSPAP requirements for the program. results over schools dictated Equated the p s ~ s h o m e t r i c forms were needed and years. Accurate school qualities to c o m p a r e content scores w e r e needed in order to e v a l u a t e school p e r f o r m a n c e t o state standards. over years The need to t r a c k led to the development school outcome of c o m p a r a b l e level d e s c r i p t i o n s tying-in scores. item performance and scale 33 These relative performance outcome T h e d e s i r e to establish p r o f i c i e n c y area scores. led to requirements, along with the desire to develop a pool of calibrated tasks with known statistical characteristics, scaling procedure for MSPAP. led to the use of an item-based Empirical results summarized in this paper provide evidence that MSPAP, an innovative performance-based testing program, does have the psychometric quality needed for high stakes usage. % 34 References Allen, M. & Yen, W. M. (1979). Introd~cti0n to measu;ement theorY. Monterey, CA: Brooks/Cole. Burket, G. R. (1991). PARDUX. version ~.4. Monterey, CA: CTB Macmillan/McGraw'Hill. Candell, G. L., & Ercikan, K. (in press). Assessing the reliability of the Maryland School Performance Assessment Program. International Journal of Educational Research. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement t2nd ed.). New York: American Council on Education. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psycholoaical Bulletin,52,281-302. CTB Macmillan/McGraw-Hill. (1989). Comprehensive Test~ of Basic Skills, 4th ed~tioD. Monterey, CA: Author. CTB Macmillan/McGraw-Hill. (1992). Final technical report: Maryland School Performance Assessment proaram, ~991. (Available from the Maryland State Department of Education, Baltimore, MD.) Ferrara, S., Huynh, H., & Baghi, H. (1993) Assessina local dePeDdencv in educational per~o~mance assessments with Clustered free-response items. Manuscript submitted for publication. Fitzpatrick, A. R., Ercikan, K., & Ferrara, S. (1992, April). An analysis of the technical characteristic~ of scorina rules for constructed-response items, paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco. Fitzpatrick, A. R., & Yen, W. M. (1993, April). The psychometric characteristics of choice items. Paper presented at the annual meeting of the National Council on Measurement in Education, Atlanta. Goldberg, G. L., & Kapinis, B. (in press). Problematic responses to reading performance assessment tasks: Sources and implications. ApPlied Measurement ~n Education, f(4). Green, D. R., Fitzpatrick, A. R., Candell, G., & Miller, E. (1992, April). Bias in performance assessment. Paper presented at the annual meeting of the National Council 5 on Measurement in Education, Atlanta. Langer, J. A. (1990). The process of understanding: Reading for literary and informative purposes. Research in ~he Te~china of Enqlish, 24, 229-257. Linn, R. L. & Harnisch, D. (1981). Interactions between item content and group membership in achievement test items. Journal of Educational Measurement, 18, 109-118. Madaus, G. F. (Ed.). (1983). The courts, validity, a~nd minimum competency testinu. Boston: Kluwer-Nijhoff. Maryland State Department of Education. (1989). writina test II: Technical report. Baltimore: Author. Maryland State Department of Education. (1990). Maryland writina test II: Technical report. Baltimore: Author. Maryland State Department of Education. (1991). Maryland writina test II: Technical report. Baltimore: Author. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psvchometrika, 47, 149-174. Messick, S. (1989). Validity. In R. L. Linn (Ed.), E d u c a tional measurement (3rd ed.). New York: American Council on Education/Macmillan. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psvcholoaical Measurement, 16, 159-176. National Council of Teachers of Mathematics. (1989). Curriculum and evaluation standards for school mathematics. Reston, VA: Author. Yen, W. M. (1993). Scaling performance a s s e ~ m e n t s : Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187-213. 36 Table 1 N u m b e r s of Items P e r F o r m and T h e i r M a x i m u m P o s s i b l e S c o r e s R a n g e of N u m b e r of I t e m s Per Form Low Reading Writing Language Usage Math Content Math Process 26 2 8 37 12 A p p r o x i m a t e P e r c e n t of Items with Each M a x i m u m P o s s i b l e Scora. 1 High 33 2 8 52 29 37 2 3 20 60 61 61 75 33 33 20 I00 25 6 6 ", Table 2 Average Percent of M a x i m u m Scores Grade 3 5 8 CTBS/4 R e a d i n g CTBS/4 R e a d i n g w/o Guess* MSPAP Reading .66 .58 .53 .63 .54 .56 .64 .55 .65 MSPAP Writing MFTP Writing 1989 1990 1991 .38 .36 .56 CTBS/4 L a n g u a g e CTBS/4 L a n g u a g e w / o Guess* MSPAP Language .69 .60 .40 CTBSf4 M a t h CTBS/4 M a t h w / o Guess* MSPAP M a t h C o n t e n t MSPAP M a t h P r o c e s s .68 .61 .41 .24 9 .84 .84 .79 . .63 .53 .42 .62 .51 .55 .62 .54 .32 .29 .61 .53 .35 .28 Note. Based on CTBS/4 C o m p l e t e B a t t e r y A, M S P A P average v a l u e s taken over forms, & MFTP *Estimated p e r c e n t of m a x i m u m if g u e s s i n g were removed. 8 Table Range 3 of C o r r e l a t i o n s B e t w e e n Student S c o r e s Produced by Two Sets of Raters Range Low M a t h Content M a t h Process Reading L a n g u a g e Usage Writing Note. .97 .90 .87 .75 .63 High .99 .95 .95 .87 .73 Range taken over grades and forms. 39 Table 4 Coefficient Alpha Reliability Coefficients for S t u d e n t C o n t e n t A r e a S c o r e s Grade 3 5 8 CTBS/4 Reading Total MSPAP Reading •94 •93 .95 .91 .95 .91 MSPAPWriting MFTP Writing .64 .61 .62 9 .67 .75 .66 1989 1990 1991 CTBS/4 Language Total MSPAP Language .93 CTBS/4 M a t h T o t a l MSPAP Math •93 .93 • 89 ,! .93 .92 .87 .88 .94 .92 .95 .94 Note. B a s e d on C T B S / 4 C o m p l e t e B a t t e r y A, M S P A P m e d i a n v a l u e s t a k e n o v e r forms, & M F T P v a l u e s . 40 Table 5 S t a n d a r d E r r o r s of M e a s u r e m e n t of S c a l e A v e r a g e d O v e r Forms: G r a d e 8 Scale Score 350 375 400 450 500 550 600 650 700 Reading 23 19 16 13 15 19 28 45 72 writing Scores Language usage Mathematics 56 41 30 19 19 19 26 50 53 47 28 17 13 16 28 57 33 30 32 52 41 Table Typical Standard 6 Errors of S c h o o l M e a n s Outcomes Scale Scores Low High Grade 3 Reading Writing Language Mathematics 7 7 7 6 3 3 3 2 6 3 3 7 Grade 5 Reading Writing Language Mathematics 7 7 6 6 3 3 3 2 7 4 3 5 Grade 8 Reading Writing Language Mathematics 5 5 5 5 .,~ 2 3 2 2 3 3 2 4 42 Table Dependability Indices Grade 7 for S c h o o l Means: 5 N u m b e r of F o r m s Administered 2 3 4 5 .94 .86 .88 .95 .95 .88 .90 .96 .95 .90 .91 .97 .95 .91 .91 .97 Reading Literary Experience R e a d i n g to be I n f o r m e d R e a d i n g to P e r f o r m a T a s k .93 .94 .53 .94 .94 .61 .94 .94 .66 .94 .94 .70 Writing W r i t i n g to P e r s u a d e / Personal Ideas .84 .86 .87 .88 Language Usage Language Usage .75 .80 .83 .85 .94 .89 .77 .95 .92 .83 .96 .93 .86 .96 .94 .88 .87 .90 .85 .88 .89 .90 .92 .89 .91 .92 .92 .93 .91 .92 .93 .93, .94 .92 .93 .94 .88 .96 .91 .97 .92 .97 .93 .97 Score Scale Scores Reading Writing Language Usage Mathematics Outcome Scores Mathematics Content Arithmetic Operations Number Relationships Geometry Measurement with Estimation/ Verifications Statistics Probability P a t t e r n s and R e l a t i o n s h i p s Algebra Mathematics Process Communication Connections 43 Table Grade 5 Correlations MSPAP RD MSPAP Read Write Lang Math 8 CTBS/4 Rating WR LU MT RT LT MT RD WR MT 64 73 70 69 55 62 75 54 62 75 70 56 73 73 78 56 69 77 60 44 63 52 58 43 62 49 51 43 58 46 82 83 82 62 56 50 64 64 57 57 58 54 75 60 65 CTBSI4 Read Lang Math Teachers Read Write Math Note. Teacher Decimal points have been omitted. 44 Tasks I and 2. Days I through 3 ,q~,~rne: powafut Forces of Nature Content Areas: Reading. Writine. and Laneua~e Usage IntmductorT Activity •A natural disaster is a terrible event caused by nature rather than by h.ma..~. A hurricane is one type of natural disaster. Think of other natural disasters. With your partner, see how many natural disasters.you can list....Now let's share some of the things on our fists....A tsunami is a very large ocean wave....Today you are going to read an article that helps you understand why tidal waves, or tstmnmis, occur. Think for a moment about what you know about tsunamis....make a list of words that reflect what you know....Let's Sh~d"e.... • Reading Material end Assessment Activities; Reading for Information • Article entitled "Waves• by Herbert R. Zim. Copyright © 1967 by Herbert R. Zim. • 8 assessment activities focused on the article. Brid~n~ Activity 'q'hink about what you learned about tidal waves in the article Waves. • Tell your partner one thing you leamed....Now think about what people might do and feel if they are told a tidal wave is coming. Tell your partner some of your ideas. Now let's list some of the feelings....• Reading Material and Aseessment Activities: Reading for Literary l~xperience and for lnfonnatioq • Story entided "The Day of the Great Wave• by Marcella Fisher Anderson. Copyright © 1989 by Highlights for Children, Inc., Columbus, OH. - 8 assessment activities focused on "The Day of the Great Wave." • 2 assessment activities focused on the article and the story. ]Reading Material anti Assessment Activities: Reading tO Perform a Task • Article ~md diagram on artificial respiration excerpted from "New Essential First Aid • by A. Ward Gardner and Peter J. Roylance, illustrated by Robert DemaresL Text copyright ©1967o 1968 by authors. mustrations copyright© 1971 by Little, Brown and Company. • 4 assessment activities focused on the article and diagram. Introductory Activity "Weather conditions sometimes cause events which are problems for people....pair with a partner and talk about weather events that cause problems for you or someone you know about....This list we made has examples of powerful forces of nature. Now you will have a chance to express your ideas in writing about powerful forces of nature...." Writing Assessment Prompt: Writing to Inform • "During the spring, s~,mmer, and even the fall, powerful thunderstorms may occur. People must be prepared ahead of time to know what to do. Your principal wants all students in your school to be prepared for thunderstorms. Write an article for your school newspaper informing students about how to cope with thundem'storms." FIGURE 1. Illustration of typical MSPAP task structure usin~ a 1991 MSPAP task. Some responses to the readine activities and the wfitin~ vromvt are scored for laneuaee u s a ~ i00 50 0 350 IBCa | e score Percentile as a Function of Scale Score: Grade 5 Reading Figure 2 i00 ~ 50 J so:ale I tlleo, r e , • Percentile as a Function o f Scale Score: Grade 5 Math Figure 3 - - i00 ......... 50" t 380 scale s~1~,re Percentile as a Function of Scale Score: Grade 8 Writing Figure I00 Percent 70 50 (b) /~ / /..... ..-. ~// of .- . .ll .o~s Maximum 40 //L:' 0 400 450 500 Scale Figure 5. Example of the d e f i n i t i o n 550 600 Score an outcome score.
© Copyright 2026 Paperzz