Kelly D. Bradley1 and Shannon O. Sampson2 Introduction The No Child Left Behind Act of 2001 (NCLB) is an education reform designed to improve student achievement and close achievement gaps. With the passage of No Child Left Behind, Congress reauthorized the Elementary and Secondary Education Act of 1965 (ESEA), the principal federal law affecting education from kindergarten through high school. The legislation is built on four pillars: accountability for results, an emphasis on doing what works based on scientific research, expanded parental options, and expanded local control and flexibility. Our nation relies on data-driven decisions in everything from sports to medicine. Similarly, standards call for the collection and analysis of data to assess the effectiveness of instruction. It is imperative that we have data about what students are learning and that teachers be able to responsibly analyze and make decisions based on the data. This notebook provides a framework and implementation plan formulated around a “Work Sampling System” (Meisels et al. 1995), to support you in constructing assessments, and then in interpreting, utilizing and reporting the corresponding data. This system involves a continuous assessment approach providing various visions of what students should know and be able to do. After all, the central purpose of classroom assessment is to provide information about what students know and are able to do [and not to do] in order to make decisions about instruction. For example, results from an assessment may lead you to spend more time on a topic based on the lack of understanding demonstrated by your students, increase the pace of instruction or decide to divide the class into groups for more individualized tasks. On a broader scale, you will learn how to utilize your results to make comparisons of your students across similar content classes within the school, across the county, state and even national contexts. The notebook is organized around the following concepts: Contextual Factors – how to use information about the learning-teaching context and student individual differences to set learning goals and plan instruction and assessment. Learning Goals – how to set significant, varied and appropriate learning goals. Assessment Plan – how to use multiple assessment modes and approaches aligned with learning goals to assess student learning before, during, and after instruction. Design for Instruction – how to design instruction for specific learning goals, student characteristics and needs, and learning contexts. 1 For additional information, contact Kelly D. Bradley, Ph.D. at 131 Taylor Education Building, Lexington, KY, 40506; [email protected]; (859)257-4923 2 Both authors contributed equally and are listed in alphabetical order. *** Working Draft *** Do not reproduce or cite without permission of the authors. 1 Instructional Decision-Making – how to conduct ongoing analysis of student learning to make instructional decisions. Analysis of Student Learning – how to use assessment data to profile student learning and communicate information about student progress and achievement. Evaluation and Reflection – how to reflect on your instruction and student learning in order to improve teaching practice. The basic figurative model of the notebook is presented below. Content Knowledge Pedagogy & Dispositions Assessment Package Learning Goals Student Needs Figure 1. Conceptual Model of Assessment Training Package Contextual Factors Contextual Factors – having teachers use information about the learning-teaching context and student individual differences to set learning goals and plan instruction and assessment. Chapter 1 of The Learning Record (see http://www.fairtest.org/LearningRecord/LR%20Math%20%20Recording%20Form.pdf) notes contextual factors that are important to consider in planning mathematics instruction: *** Working Draft *** Do not reproduce or cite without permission of the authors. 2 • Confidence and Independence. How willing are students to risk error? Are they able to volunteer information and possible solutions to problems? Will they initiate topics for discussion and study? To what extent will they persevere in the face of complexity? • Experience. How well do students use their prior knowledge to make sense of current tasks? What background do they have in mathematics? How well do they apply their textbook knowledge to authentic purposes? • Skills and Strategies. Do students use the skills and strategies of the subject to solve problems? Do they demonstrate they can use mathematics to solve a variety of problems across different mathematics strands? • Knowledge and Understanding. How well can students demonstrate what they know and understand? What evidence suggests they are adding to their personal knowledge and understanding? To what extent do they make connections among mathematical ideas and across other content areas? • Ability to Reflect. Can students provide criteria for assessing their own work? How well can they judge the quality of their own work? The Center for Language in Learning recommends collecting this information through brief interviews with each student and his/her parents/caregivers in the first quarter or term. Information from these interviews contributes “baseline data” from which to measure accomplishment as students progress during the year. Students and their parents can begin to set criteria for measuring success in their own terms and can compare year-to-year achievements. Learning Goals Learning Goals – how to set significant, varied and appropriate learning goals. The place to begin this process is by looking at the Kentucky Academic Expectations, Program of Studies, and Core Content for Assessment, as well as how Kentucky describes different levels of performance. You can locate the Academic Expectations, Program of Studies, and Core Content for Assessment for your subject area through a search for the “Combined Curriculum Document” at the Kentucky Department of Education (www.kentuckyschools.net). A search for “Performance Standards” lists standards for various subject areas. The performance standards chart may be useful in the initial conference with students, to have them talk about where they feel they stand and to begin to set goals. You may want to list the standards which students will be expected to reach to be considered proficient in your course. Throughout the semester the students would work in conjunction with you to collect evidence that they are progressing toward, and eventually reaching, proficiency in each area. To do this, you might use a checklist as recommended by Meisels (1997), in which you make page for each student with a list of the given performance indicators. Next to each indicator, you would have a fall, winter and spring column and three check-boxes within each column. One box for “not yet proficient”, another for “in process” and a third for “proficient.” *** Working Draft *** Do not reproduce or cite without permission of the authors. 3 Skills, Concepts and Relationships Mathematical Strategies Understanding Terminology and Representatives Reasoning 3 HIGH SCHOOL (GRADE 11) MATHEMATICS3 DISTINGUISHED PROFICIENT APPRENTICE Student demonstrates an extensive Student demonstrates an Student demonstrates understanding understanding of concepts, skills, understanding of concepts, skills and of concepts, skills, and relationships and relationships in relationships in in number/computation, number/computation, numbers/computation geometry/measurement, geometry/measurement, geometry/measurement, probability/statistics and algebraic probability/statistics, and algebraic probability/statistics, and algebraic ideas as defined by Kentucky’s Core ideas as defined by Kentucky’s Core ideas as defined by Kentucky’s Core Content for high school students Content for high school students. Content for high school students some of the time most of the time. Student demonstrates consistent, Student demonstrates effective Student demonstrates correct effective application of the problem- application of the problem-solving application of the problem-solving solving process. Student consistently process by showing evidence of a process by implementing appropriate shows evidence of a well-developed well-developed plan for solving strategies for solving problems some plan for solving problems, using problems, using appropriate of the time. appropriate procedures, sequence of procedures, sequence of steps, and steps, and relationships between the relationships between the steps most steps. of the time. Student demonstrates an extensive Student demonstrates a general Student demonstrates some understanding of problems and understanding of problems and understanding of problems and procedures by arriving at complete procedures by arriving at correct and procedures by arriving at correct and and correct solutions. (Student rarely complete solutions most of the time. complete solutions some of the time. has minor computational errors that (Student may have some minor do not interfere with conceptual computational errors that do not understanding.) interfere with conceptual understanding.) Student consistently and effectively Student uses appropriate and Student uses appropriate and uses appropriate and accurate accurate mathematical accurate mathematical mathematical representations/models representations/models (symbols, representations/models (symbols, (symbols, graphs, tables, diagrams, graphs, tables, diagrams, models) graphs, tables, diagrams, models) models) and corrects mathematical and corrects terminology to and correct mathematical terminology to communicate in a effectively communicate a sequential terminology appropriate for high clear and concise manner. development of the solution most of school students some of the time. the time. Student consistently and effectively Student demonstrates appropriate Student demonstrates appropriate demonstrates appropriate use of mathematical reasoning to use of mathematical reasoning (e.g., mathematical reasoning to solve solve problems (e.g., make and make and investigate mathematical problems (e.g., make and investigate investigate mathematical conjectures, make generalizations, mathematical conjectures, make conjectures, make generalizations, make predictions, and/or defend generalizations, make predictions, make predictions and/or defend solutions) some of the time. and/or defend solutions). solutions) most of the time. NOVICE Student rarely demonstrates understanding of concepts, skills, and relationships in number number/computation, geometry/measurement, probability/statistics, and algebraic ideas as defined by Kentucky’s Core Content for high school students. Student rarely demonstrates appropriate problem solving skills and/or rarely applies the problemsolving process correctly. Student rarely demonstrates understanding of problems and procedures by arriving at solutions that may be incorrect or incomplete. Student rarely uses appropriate mathematical representations/models (symbols, graphs, tables, diagrams, models) and corrects mathematical terminology appropriate for high school students. Student rarely demonstrates appropriate use of mathematical reasoning. http://www.education.ky.gov/NR/rdonlyres/en6buqjvte3hgf5ddhi6cggzpskobj3mxribbcgwayd4gtn3lztb2clkepwzekerbhopyxc4hwiez4epnbnn63njunh/SPLDMathematics.pdf *** Working Draft *** Do not reproduce or cite without permission of the authors. 4 Assessment Plan Assessment Plan – how to use multiple assessment modes and approaches aligned with learning goals to assess student learning before, during, and after instruction. Meisels (1996) writes, “For too long assessment and instruction have been adversaries. Teachers say that they cannot teach as they wish because they spend time preparing their students and modifying their curriculums to conform to items that will appear on mandated achievement tests. Policymakers say that they need objective information to show what students are learning and what teachers are teaching, even if the indicators provided to teachers are inconsistent with educational practice and are seriously flawed. With authentic performance assessments…, these conflicts can be resolved. In this approach, educators design instructional objectives for teaching and learning, as well as for evaluation.” What counts as assessment? An assessment can be anything that provides evidence of a student’s level of understanding of a concept. In fact, a useful starting point in developing an assessment plan is to ask yourself what kinds of instructional opportunities could serve as evidence of progression toward the learning goals. Assessments can include many tasks, from selected response items, open response items, performance events, and even teacher observations. Grondlund (2006, p. 22) offers eight guidelines for effective student assessment: 1. It specifies the learning outcomes and assesses even the complex levels of understanding 2. A variety of assessment procedures are used 3. It is instructionally relevant, supporting the outcomes of instruction and improves student learning 4. It creates an adequate sample of student performance 5. It is fair to all students, eliminating irrelevant sources of difficulty 6. It specifies the criteria for judging the performance, such with a rubric 7. It provides meaningful feedback to the students so that they can adjust learning strategies 8. It must be supported by a comprehensive grading and reporting system Gronlund recommends using assessments to inform their decisions before beginning instruction, during instruction and at the end of instruction. Before Instruction Assessments administered prior to instruction can be designed to provide information about whether students have mastered prerequisite skills necessary for moving to the planned instruction, to indicate the level of understanding students have on a subject prior to teaching it, and can serve as a baseline for measuring student growth. Students lacking particular prerequisite skills can be provided supplemental *** Working Draft *** Do not reproduce or cite without permission of the authors. 5 instruction. Depending on students’ level of understanding, you might choose to spend more or less time on given concepts, and differentiate instruction for students who have varied levels of understanding. During Instruction Assessments should be used throughout instruction to monitor student progress toward achieving the intended learning outcomes. These assessments, often called formative assessments, should be used to improve student learning rather than to assign grades. From the results of these assessments, teachers can diagnose where students have difficulties in understanding and can address these for individuals and/or the class. After Instruction Assessment after instruction is often called summative assessment and the results are typically used for grading. This may be an end-of-the-chapter test or performance assessment. In keeping with the Learning Record, this would be the opportunity for students and teachers to select work which demonstrates they have reached proficiency on the subject at hand. Types of Assessment Classroom achievement tests A traditional source of evidence is a paper and pencil test administered at the end of a unit, oftentimes a multiple choice, short answer or essay format. This sort of test can provide good evidence; however, its capacity is often limited by the manner in which teachers score it. If you use tests such this as evidence, see Using Tests to Create Measures in the Analysis of Student Learning section. Gronlund (2006) describes specific guidelines for writing items for these tests. For multiple choice items, he offers the following recommendations: It should be appropriate for measuring the learning outcome Item tasks should match the learning tasks to be measured The item stem should: o present a single problem o be written with clear and simple language o be worded so there is no repetition of material in the answer choices o be stated in positive form wherever possible o emphasize any negative wording with bold, underline or caps o be grammatically consistent with answer choices The answer choices should: o be in parallel form o free from verbal clues to the answer o have distracters that are plausible and attractive to the uninformed avoid use of “all of the above” and “none of the above” The position of the correct answer should be varied from item to item For true-false items, he suggests: Using only one idea for each statement Keeping the items short and grammatically simple Avoiding qualifiers (i.e., may, possible) and vague terms (i.e., seldom) *** Working Draft *** Do not reproduce or cite without permission of the authors. 6 Using negative statements sparingly and avoid double negatives Attributing statement of opinion to a source For matching items Gronlund recommends using a matching format only when the same alternatives in multiple choice items are repeated often. If using this, Use homogeneous material so that all responses serve as plausible alternatives Keep the list of items to less than 10 and place answer alternatives to the right Use a different number of responses than items and permit responses to be used more than once (indicate this in the directions) Place the responses in alphabetical order For short answer items State the item so that only a brief answer is required Place the blanks at the end of the statement Incorporate only one blank per item Use uniform length for blanks on all items Finally, essay items should be used for measuring complex learning outcomes and not knowledge recall. To aid in clarifying to the students what outcomes will be measured, you might include the criteria to be used in grading the answers. As described by Gronlund, “Your answer will be evaluated in terms of its comprehensiveness, the relevance of its arguments, the appropriateness of its examples, and the skill with which it is organized.” Also, Avoid starting the question with words that request recall of information (i.e., who, what, where, when, name, list) Use words such as “why”, “describe”, “explain”, “criticize” Write a model answer Avoid permitting a choice of questions In assigning scores to essay answers: o Evaluate answers in terms of the learning outcomes being measured o Score using a point method, in which various points are assigned to various facets of the learning outcome OR o Score using a rubric, using defined criteria as a guide. Evaluate all students’ answers to one question before moving to the next Evaluate answers without knowing the identity of the writer Performance or product assessment Mueller (see http://jonathan.mueller.faculty.noctrl.edu/toolbox/howdoyoudoit.htm), outlines the steps to developing assessments outside of the traditional paper and pencil test. For a particular standard or set of standards, he recommends developing a task your students could perform that would indicate (or provide evidence that) they have met these standards. Identify the characteristics of good performance on that task, the criteria that, if present in your students’ work, will indicate that they have met the standards. For each criterion, identify two or more levels of performance along which students can perform which will sufficiently discriminate among student performance for that criterion. The combination of the criteria and the levels of performance for each criterion will be your rubric for that task (assessment). The rubric will indicate how well most students should perform. The minimum level *** Working Draft *** Do not reproduce or cite without permission of the authors. 7 at which you would want students to perform is your cut score or benchmark; it should indicate proficiency in the content area. Providing students with the rubric early on will give students feedback on what they need to improve upon and allow you to adjust instruction to ensure that each student reaches proficiency. With careful planning, many instructional activities can serve as assessments. For examples of authentic assessments, see: http://jonathan.mueller.faculty.noctrl.edu/toolbox/examples/authentictaskexamples.htm For information about authentic assessment and rubrics, see: http://www.education.ky.gov/KDE/Instructional+Resources/Elementary+School/Primary+Program/Instr uctional+Resources/Instuctional+Strategy+Links.htm Teacher Observations The Learning Record suggests teachers use classroom observations as a way to collect evidence of student learning. To facilitate this and make the observations more intentional, you might print labels with student names and prepare a blank form for each student in each class. You can record observations about anything significant about a student’s learning. As noted in the Learning Record, having a class list helps the teacher record observations about all students in class. It is not necessary to note observations for every student prior to starting a new page of labels, but using this method helps alert the teacher to which students he/she needs to watch to record more data. These observations can be used as evidence in describing students’ level of understanding. Design for Instruction Design for Instruction – how to design instruction for specific learning goals, students’ characteristics and needs, and learning contexts. Gronlund (2006) recommends that plans for assessment be made during plans for instruction. He explains the relation between instruction and assessment, including that effective instruction provides remediation for students not achieving the intended learning and that assessment reveal specific learning gaps; instructional decisions are based on information that is “meaningful, dependable, and relevant” and effective assessments provide information that is meaningful, dependable and relevant; and the methods and materials of both instruction and assessments are congruent with the outcomes to be achieved (p. 4). Search for “Standards-Based Units of Study” at the KDE website (www.kentuckyschools.net). *** Working Draft *** Do not reproduce or cite without permission of the authors. 8 Instructional Decision-Making Instructional Decision-Making – how to use on-going analysis of student learning to make instructional decisions. Once you have identified the standards and know what proficiency looks like, our task is to gather evidence that students are approaching proficiency, and to be able identify what support they need to become proficient. If you are using tests to create measures (as explained in the Analysis of Student Learning section and taught in the appendix,) Winsteps software produces the following helpful table. This table lists all of the items from an assessment along the right; MC and SA indicate if the items were multiple choice or short answer, and the numbers after each SA indicate the number of points assigned to each item. The most difficult items are listed at the top, and the least difficult are listed at the bottom. The numbers across the top horizontal line indicate a difficulty/ability score. The scores are similar to z-scores in that the zero point is the mean difficulty measure and 1, 2 and 3 are respective standard deviations away from the mean. The vertical line down the center of the chart indicates the proficiency line (see How to determine the proficiency cut point on page 26 for an explanation on why a line was placed at that point.) The vertical line of numbers to the left of the line is placed at the ability point for the student and indicates the expected score on each item for a student at that ability. Numbers to the left and right of that line indicate the student’s actual score on each item; they are placed such at the difficulty point of the item. If the student were to consistently answer as he answered an item off the line, he/she would be placed at the higher or lower overall measure. Actual scores listed to the left or right of the score line with a period on each side is within an acceptable deviation range. Those with parentheses are considered to be outside of that range. To diagnose where a student would need to focus to reach proficiency, the teacher and student would look at items corresponding to numbers to the left of the vertical line that indicates the proficiency point. -3 -2 -1 0 1 2 3 4 |-------+-------+-------+-------+-------+-------+-------| 0 (2) 1 .2. .4. .5. 6 .5. .3. 1 .2. 9 .10. 18 .20. .0. 1 7 .8. (2) 7 .2. .3. .2. .2. .2. .2. |-------+-------+-------+-------+-------+-------+-------| -3 -2 -1 0 1 2 3 4 NUM 3 9 19 12 18 17 8 15 21 10 11 13 14 20 4 6 7 16 NUM ITEM 3null & alternative hypothesesMC 9margin of error MC 19test stat & p-value calc SA7 12cond for significance test SA8 18conditions for inference SA8 17state hypotheses SA4 8p-value calc MC 15construct, interpret CI SA10 21construct and interp90% CI SA20 10margin of error MC 11population and parameter ID SA8 13test stat and p-value calc SA8 14conclusion SA3 20conclusion SA3 4conditions for inferenceMC 6standard error calc MC 7conditions for inferenceMC 16proprtions calc SA2 ITEM *** Working Draft *** Do not reproduce or cite without permission of the authors. 9 This student has a measure slightly below proficiency, but many of the items fall in the proficient range (3, 9, 8, 15, 21, and 11.) While the student would need to focus on most items, it would be especially important to review items 12, 10, and 13 because these were the items that fell below the student’s own ability measure on the test. Analysis of Student Learning Analysis of Student Learning – How to use assessment data to profile student learning and communicate information about student progress and achievement. The Learning Record recommends that you and the students collect evidence throughout the year to evaluate what the student is learning. It is recommended that this take place at least three times throughout the year. Presenting Student Data Let’s say you have finished teaching a lesson on probability and have given a 10-question quiz over the material. Each question is worth one point—it’s either right or it’s wrong. Adding up the number of correct answers gives us the raw score, which is typically the first calculation used in determining student performance. We might record that number in our grade book, or go a step further and divide by 10 to come up with a percentage. A student who gets 9 out of 10 would receive a 90% on the quiz, and both the student and the teacher would probably feel like the material was well taught and learned. Looking at the raw score alone provides very little information. There are a few ways to use item results to communicate more information, using concepts from both statistics and measurement. One way to get a more informative picture of the data is to graph the individual items and compare student performance on each. The histogram below displays the multiple choice items from a sample statistics test that a teacher administered. The items were worth 2-points each, and student scores for each item were added and the total displayed as a vertical bar (See Graphing Multiple Choice Items for an explanation of how to create this graph.) *** Working Draft *** Do not reproduce or cite without permission of the authors. 10 Sum of student scores 50 40 30 20 10 1 10 2 3 4 5 6 7 8 9 Multiple Choice Item This visually displays that item 10, 8, 9 and 10 were difficult for the overall group, while item 3 was extremely difficult. This teacher would probably want to review the concepts behind items 8, 9 and 10. She would also want to look at item 3 to make sure it was written clearly and keyed correctly, and then either modify the item or re-teach the concept. This assessment also contained short answer items, worth 2 to 21 points each. In order to facilitate comparisons of item difficulty and performance, the student scores on the individual items were converted to an equal scale by calculating percentages for each answer, and graphed as a series of boxplots (See Creating Boxplots below for an explanation of how to create these.) A boxplot consists of a box, vertical lines extending from the top and bottom of the box (often called “whiskers”), and asterisks often appearing beyond the whiskers. Horizontal lines inside the boxes represent the median score for the item, or the score in that falls at the midpoint of the distribution of scores from high to low. The median scores are connected by a line in the graph below to facilitate comparison of the median point across items. The box encompasses the scores of the middle 50% of the scores, and the length of the box (called the interquartile range) is determined by the distribution of the scores. A short box indicates that the scores are similar to each other and a long box indicates that there is more variation in the scores. The whiskers encompass scores that fall in the region extending 1.5 times the interquartile range (above and below the box.) Asterisks above the box represent individual scores greater than 1.5 times the interquartile range above the end of the box, and asterisks below the box represent individual scores less than 1.5 times the interquartile range below the box. These asterisks are considered outliers. You will want to pay special attention to the outliers below the box; these are scores which are well below the performance of the rest of the class. *** Working Draft *** Do not reproduce or cite without permission of the authors. 11 percent 100 50 0 11 12 13 14 15 16 17 18 19 20 21 item The chart above displays the distribution of scores for the short answer items on a statistics test. Looking at item 11, you can see that most students did very well. Because the box is relatively short, we know that most students (50%) received a similar score on that item, and because the box is at the high end of the y-axis, we can conclude that most students did well on that item. One outlier rests well below the box and whisker. Using the “brush” function, we see that this actually represents two students (entries 22 and 35.) It would be important to address this concept with these students because they performed below the majority of the other students. Glancing at the graphs reveals that concepts that should be re-addressed for many students include 12, 17, 18 and 19. Many students need more work with item 14, but the position of the box reveals that many students already have a solid grasp of this concept. You would also identify the students represented by asterisks on items 11, 13, 16, 17, 19, and 21, to determine why these students scored especially low on these items. Interpreting (and helping parents and students interpret) large scale assessment data Consider the following questions: Q: If a student receives a standard score (SS) of 85, would that be a good score? Q: Is a score of 50% on an achievement assessment considered failing? In order to answer these questions, it is important to understand the concept behind the normal curve. Throughout life and nature, events tend to follow a similar pattern of distribution. Let’s say you go to a Kentucky basketball game and collect each person’s age, height, weight, cholesterol level, distance he/she drove to Rupp Arena, and number of UK games he/she has attended. If you were to graph the frequency of each of these variables, each would approximate what is called a normal distribution. In other words, the most frequently encountered observations would appear around the middle (at the mean) and the less frequently encountered observations would appear on either side of the mean. The distribution would be bell-shaped, as in the figure below. *** Working Draft *** Do not reproduce or cite without permission of the authors. 12 Achievement test scores typically have the same distribution, and interpretation of these scores is dependent on an understanding of what these curves are. If you draw a line down the center of the normal curve (at the mean), you will have a mirror image on either side of that line. Half of the observations will fall below the mean, and half will fall above the mean. The normal curve is divided by standard deviations. In all tests, the mean is at 0 (zero) standard deviations from the mean. The next marker on the bell curve is +1 and -1 standard deviations from the mean, followed by 2 standard deviations from the mean. To interpret standardized test scores, you will need to know the test instrument's mean score and standard deviation score. Standardized test scores are typically reported as standard scores, percentiles, stanines, z-scores and T-scores. These scores are explained below. A Standard Score (SS) compares the student's performance with that of other children at the same age or grade level. In standard scores, the average score or mean is 100, and the standard deviation of 15. Thus, a 100 is considered the average score, and it is, by definition, at the 50 percent level. A 115 is 1 standard deviation above the mean, at the 84 percent level; a 130 is 2 standard deviations above the mean, at the 98 percent level. An 85 is 1 standard deviation below the mean, at the 16 percent level; a 70 is 2 standard deviations below the mean, at the 2 percent level. (This explanation is based on information from http://www.nldontheweb.org/wright.htm) *** Working Draft *** Do not reproduce or cite without permission of the authors. 13 Returning to the questions posed at the beginning of this section, Q: If a student receives a standard score (SS) of 85, would that be a good score? Actually, a standard score of 85 is in the 16% and that score is below average. Average is between SS 90-110 (sometimes 85-115 is considered average), so any SS below 90 would be reason for concern. Q: Is a score of 50% on an achievement assessment considered failing? No. A score of 50% puts the child right in the middle of the average group of students. The Percentile (%) Score indicates the student's performance on given test relative to the other children the same age on whom the test was normed. A score of 50% or higher is considered above the average. The Stanine Score like the Standard score, reflects the student's performance compared with that of students in the age range on which the given test was normed. For reference, a stanine of 7 is above average, a stanine of 5 is average and a stanine of 3 is below average. Z scores are simply standard deviation scores of one with a mean of zero (Mean = 0, SD = 1, instead of a mean of 100 and SD of 15 as we found with standard scores). If a student earned a z-score of 2, you would know the student’s score is two standard deviations above the mean, with a percentile rank of 98. The standard score equivalent would be 130 (mean of 100, standard deviation of 15.) Another test format uses T Scores, which have a mean of 50 and a standard deviation of 10. A T score of 40 would be one standard deviation below the mean. It would be equivalent to a Z score of -1, a percentile rank of 16, and a standard score of 84. There is a chart available from http://concordspedpac.org/Scores-Chart.html, which presents the same information in another format, which you may find helpful. Using Tests to Create Measures Wright and Stone (1979), explain why this technique for scoring is problematic. The 9 out of 10, or 90%, actually provides little meaningful information. This is illustrated through the comparison of three students according to their raw scores. Below, a 1 indicates a correct answer, a 0 indicates an incorrect answer, and an m indicates a missing value: Student A: 1 1 1 m m m m 1 1 0 = 5 Student B: 0 0 0 0 0 1 1 1 1 1 = 5 Student C: 1 1 1 1 0 0 0 0 0 1 = 5 *** Working Draft *** Do not reproduce or cite without permission of the authors. 14 Looking at the raw score alone would make it appear these three students have an equivalent understanding of the material. A closer look at the items reveals a very different situation. If the items are ordered from easiest to hardest, a look at item performance indicates Student A did not answer every question; perhaps she missed a page of the test or ran out of time. Student B could have been careless on the easiest items and diligent about the harder ones, he could have had a special skill set that addressed the specific content of the hardest items, or he could have missed the presentation of the material of the easiest items. Student C’s response is what would be expected: the easy items were answered correctly, and the difficult ones were answered incorrectly, with the exception of the most difficult item. Looking at the most difficult item, Student C’s correct answer might be the result of a lucky guess. In the case of Student A, the incorrect answer could be viewed as a careless mistake, or perhaps the student missed it because it was slightly more difficult than his/her ability level. Without more difficult items, it is impossible to draw a conclusion. A different set of students highlights another problem with the use of raw scores: Student D: m 1 m m m 1 m m 1 1 = 4 Student E: 0 1 0 1 0 1 0 1 1 0 = 5 Student F: 1 1 1 1 1 1 0 0 0 0 = 6 Based on raw scores alone, it would appear Student D is less knowledgeable than Students E and F, and Student F appears to be the most knowledgeable. Furthermore, these students are within a point or two of each other; but a point earned on the easy end should probably be less “valuable” than a point on the more difficult end. And what if a student answered 0/10 questions correctly; would that indicate he has no ability at all related to the unit? Or, if a student were to get a 10 put of 10, would that indicate she has completely mastered the subject and would be able to correctly answer all other questions about it? Although they are frequently interpreted as having direct meaning, raw scores are really just ordinal data with unequal units. They are very limited in the information they convey. The Rasch model addresses these concerns with raw scores by converting the scores them into measures. Each person has a certain probability of answering each item correctly, and each item has a certain probability of being answered correctly. That probability, Px, is defined such that ln [P nix/(1Pnix)] = Bn – Di. Pnix is the probability of a successful response Xni being produced by person n to item i, Bn is the ability of person n, and Di is the difficulty of item i. This equation produces equal units, measures, which are, in turn, additive. It is logical that a student with a good understanding of a concept should have a higher probability of getting any item correct than a student with a poor understanding of that concept, regardless of the item attempted. Furthermore, more difficult items should always have a lower probability of success than a less difficult item regardless of the person attempting the item. Students B and E above illustrate that students do not always meet the expectation. When using raw scores to come up with a students’ grade, through a percentage, for example, you are using classical test theory (CTT). As illustrated above, this method of analysis is limited in that the use of raw scores alone are blind to unpredictable responses, they provide no information about the *** Working Draft *** Do not reproduce or cite without permission of the authors. 15 ability of persons who have maximum and minimum scores, and the scores fall along an irregular interval. To address these limitations, Rasch analysis was developed by Georg Rasch (1960). The Rasch model produces a difficulty measure for each item on an assessment and an ability measure for each person taking the assessment. It generates a type of “ruler” which measures item difficulty and person ability on the same scale. The items should span a wide range of the difficulty continuum and be spaced fairly evenly across it. A wider spread of items allows for the measurement of a larger range of person abilities, and a closer space between item difficulties allows for a more precise measurement of person abilities. When Rasch analysis places items and persons along a “ruler,” one can see where the persons fall (based on their ability) in comparison to where the items fall (based on their difficulty.) Just as a 12-inch measuring tape would be of little use in measuring the height of a 64-inch person, when a number of persons have a higher ability than the most difficult item on an assessment, the instrument would be considered too narrow to measure their ability. Similarly, just as a ruler that is marked with meters would be of little use in measuring the length of a small insect, when a number of persons have an ability measure that does not correspond to any item’s difficulty measure, the instrument would be considered limited in its ability to accurately measure those persons. A well designed assessment has a distribution of items that is equivalent to the distribution of persons (Bradley and Sampson, 2006). One feature of the Rasch model is that a more difficult assessment does not mean students receive lower scores. Similarly, a test on the same information but with many easy items does not mean students will receive higher scores. As Gronlund (2006) notes, assessment is a matter of sampling. Instruction is typically organized around concepts or domains, and a test cannot possibly include the endless number of questions that could be written about the topic. It is most fair that the estimation of student ability be independent of the items included on the assessment of that ability. Rasch analysis produces ability estimations which are independent of the difficulty level of the items selected for the test. You can conduct a Rasch analysis with various programs. Here we will work with Ministep, the student version of Winsteps software. Ministep uses Rasch analysis to produce a multitude of tables. You will only need to look at a few to get a feel for student level of proficiency and to access the helpful diagnostic information. To use Winsteps to create a Rasch analysis, you will need to download a free copy of Ministep (the student version of Winsteps) at www.winsteps.com/ministep.htm. When you open Ministep, it will ask you to enter the control file name. Before beginning you will need to create a control file for the assessment you have administered to your students. Specific examples of control files are listed in Appendix A. Simply copy a control file which is similar to the test you have administered, and replace the sample data with your own data. In creating the control file, it is very important that you test only one concept at a time, or at least include only the items which deal with one concept as you input them into the control file. Winsteps (and Ministep) will create a sort of ruler to measure student ability in a domain, and just as you cannot height and weight with the same instruments, you should only one concept in each analysis. *** Working Draft *** Do not reproduce or cite without permission of the authors. 16 Winsteps example:4 A typical chapter test has various sections with various item types and score values. For example, Becky has created a test for chapter 10 of her text, on probability. The first ten questions are multiple choice questions which students either answer correctly or incorrectly. She awards two points for each correct answer and zero points for each incorrect answer. The last eleven questions are short answer items of varying worth, from two to twenty points. Students receive partial credit for their work, so very few students receive a zero on any item in this section. Students lose points for a number of errors, such as an incorrect wrong answer, an incorrect application of a formula used to solving the problem, an incomplete conclusion, a weak justification for their answer, and careless mistakes. On a test like this, a teacher typically sums the number of missed points, subtracts that from the total number of points possible, and divides the resulting amount by the total number of points possible to come up with a percentage to record as the test score. What does this percentage communicate? A 100% is the easiest to interpret, because it indicates that the student answered each item correctly, with accurate calculations, complete and insightful conclusions and justifications and free of careless errors. But would this student receive a 100% on any other teacher’s test on this same material? Does a 100% indicate complete mastery of the construct of probability, such that if this student approached any question involving probability, this student would successfully answer it? Scores below 100% communicate even less. John and Henry can both receive a 90%, even though John’s missed points were due to a lack of understanding on a couple of items, whereas Henry had a very good understanding of the material, but made careless mistakes throughout the test. Below are specific tables which present the Rasch output for this same test. Incidentally, student scores are reported as logits, which are log odds units (the log of the odds that a student will answer items correctly and that items will be answered correctly.) A score of zero is the mean score, and scores range from -3 to 3. Instead of thinking in terms of percentages for determining grades, this analysis allows you to determine a proficiency point and to think in terms of distance from that point. Table 1.0 (Output Tables Variable Maps) This table displays the “ruler”, placing students along an ability continuum and items along a difficulty continuum. One of the features of Rasch analysis is that item difficulty is not dependent upon the students who take the test, and (more usefully in this case), student ability is not dependent upon the items included in the test. You will notice that there is really no such thing as a 100%; as students are simply placed higher or lower on the continuum of ability for the larger construct. You determine the point that is sufficient proficiency. The items appear to the right of the vertical line; the more difficult are at the top and the easier items are at the bottom. The students appear to the left; those who demonstrated more understanding of the concept fall at the top, and those who demonstrated less understanding are at the bottom. 4 For more information on running analysis in Winsteps, contact authors for the code used in examples in this handout and see www.winsteps.com for user’s manual. *** Working Draft *** Do not reproduce or cite without permission of the authors. 17 TABLE 1.0 Multiple Response Formats, Different Re ZOU160WS.TXT ---------------------------------------------------------------------------- 2 1 0 0Xiang -1 -2 PERSONS -MAP- ITEMS <more>|<rare> + | 2Liam | T| 2Quentin | | 2Kelly | | 2Jackie | 3 MC 2Patricia | S|T 2Uma | 2Isabella + 2Faith | 2Geraldine | 1Taariq | | 1David 1Oliver M|S 19 SA | 12 SA 1Adam 1Steven | 1Elizabeth 1Matthew | 1Bryan 1Hugh | 18 SA | 17 SA 0Robert | 8 MC 0Clay S+M 10 MC 0Yoshi 1Victor | 11 SA 0Nathan 0Whitney | 13 SA | | | | T|S | 20 SA | | 4 MC | + | |T | | | | | | | | | + 1 MC <less>|<frequ> 9 MC 15 SA 21 SA 14 SA 6 MC 7 MC 2 MC 5 MC *** Working Draft *** Do not reproduce or cite without permission of the authors. 18 Here, item 21 are the most difficult, whereas items 1, 2, and 5 are the easiest. Incidentally, items 1, 2, and 5 appear at the base of the “ruler.” This indicates that all students correctly answered them, so we don’t really know how easy they are. They might be extremely easy, or might just be easy enough that this sample answered them correctly, but that another group of students with slightly less understanding might not answer them correctly. In the case of this sample, they do not provide useful information in determining student ability; thus, they are essentially excluded from this analysis. Liam, Quentin and Kelly are at the top of the ability continuum, and Victor, Whitney, and Yoshi are at the bottom. Actually, Liam and Quentin have a higher ability on this topic than the items were able to gauge, but this is not of too much concern, since the goal of the test is to gain evidence of proficiency. We are most interested in the “proficiency” cut-point. How to determine the proficiency cut point For a test such as this one, with multiple choice and partial credit items, but without a rubric, it is a bit difficult to set a cut-point to indicate proficiency. One procedure developed by Julian and Wright (1993, as cited in Stone, 1996), first identifies items and students around the criterion region, then decides whether the items are required by a “passing examinee to be considered competent” (Stone, 1996). In the case of a classroom, you could begin by rating each student based on your perception about his/her understanding of the concept. The ratings would be 2 (Definitely proficient in this area), 1 (Not sure if proficient or not), 0 (Definitely not proficient in this area). You would place this number next to the students’ names. Below they are placed to the left of student names; numbers to the right are scores students received on the test items. 1 1 0 1 1 2 Adam Bryan Clay David Elizabeth Faith 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 0 1 1 8 7 8 8 7 8 5 5 4 5 6 7 2 5 8 5 7 8 2 10 2 3 5 4 3 2 10 2 2 4 5 2 2 8 0 2 2 2 2 2 10 2 3 6 5 3 2 8 2 2 6 4 2 2 8 2 3 6 5 3 20 19 14 18 18 19 Once the assessment has been analyzed through Winsteps, you can draw a line to mark where the students do not meet proficiency (marked with zeros), then draw a box around students in the “don’t know” region (see example below.) Then look at the items in the box to determine whether the items at this level are required by a “passing examinee to be considered competent” (Stone, 1996). You might even set aside the names at this point and judge the items based on a rubric such as Kentucky’s General Scoring Rubric for 11th grade. *** Working Draft *** Do not reproduce or cite without permission of the authors. 19 5 5 This rubric was retrieved from http://www.education.ky.gov/NR/rdonlyres/et233oxqbnhvl56spt6lbjdoltmkv5t2ok5j52nivmgoyf76y7we3upsgs5tcfquzm53so fx5h44jnhjc2ep66hyr4f/Phase22004ReleaseGrade11.pdf *** Working Draft *** Do not reproduce or cite without permission of the authors. 20 TABLE 1.0 Multiple Response Formats, Different Re ZOU160WS.TXT INPUT: 25 PERSONS, 21 ITEMS MEASURED: 25 PERSONS, 20 ITEMS, 63 CATS ----------------------------------------------------------------------------- 2 1 0 0Xiang -1 -2 PERSONS -MAP- ITEMS <more>|<rare> + | 2Liam | T| 2Quentin | | 2Kelly | | 2Jackie | 3 MC 2Patricia | S|T 2Uma | 2Isabella + 2Faith | 2Geraldine | 1Taariq | | 1David 1Oliver M|S 19 SA | 12 SA 1Adam 1Steven | 1Elizabeth 1Matthew | 1Bryan 1Hugh | 18 SA | 17 SA 0Robert | 8 MC 0Clay S+M 10 MC 0Yoshi 1Victor | 11 SA 0Nathan 0Whitney | 13 SA | | | | T|S | 20 SA | | 4 MC | + | |T | | | | | | | | | + 1 MC <less>|<frequ> 9 MC 15 SA 21 SA 14 SA 6 MC 7 MC 2 MC 5 MC *** Working Draft *** Do not reproduce or cite without permission of the authors. 21 Let’s say I make the break between items 12 and 18, as indicated by the arrow above. Table 13.1 displays the measures. I am going to look at the measure column, and go with .53 as the cut-point. TABLE 13.1 Multiple Response Formats, Different R ZOU304WS.TXT May 22 15:58 ---------------------------------------------------------------------------PERSON: REAL SEP.: 1.77 REL.: .76 ... ITEM: REAL SEP.: 1.80 REL.: .76 TABLE 13.1 Multiple Response Formats, Different R ZOU160WS.TXT May 19 15:48 2006 INPUT: 25 PERSONS, 21 ITEMS MEASURED: 25 PERSONS, 20 ITEMS, 63 CATS 3.59.1 -------------------------------------------------------------------------------PERSON: REAL SEP.: 1.74 REL.: .75 ... ITEM: REAL SEP.: 1.75 REL.: .75 ITEM STATISTICS: MEASURE ORDER +--------------------------------------------------------------------------------------+ |ENTRY RAW MODEL| INFIT | OUTFIT |PTMEA|EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| OBS% EXP%| ITEM G | |------------------------------------+----------+----------+-----+-----------+---------| | 3 12 25 1.33 .27|1.05 .3| .86 -.1| .45| 56.0 57.0| 3 MC 1 | | 9 24 25 .62 .23|1.81 3.3|1.96 2.9| -.12| 12.0 36.3| 9 MC 1 | | 19 114 25 .58 .16| .76 -.9| .77 -.9| .71| 32.0 26.8| 19 SA 0 | | 12 155 25 .54 .19|1.20 .9|1.19 .8| .40| 16.0 35.8| 12 SA 0 | | 18 141 25 .24 .15|1.28 1.1|1.36 1.3| .44| 16.0 28.3| 18 SA 0 | | 17 65 24 .21 .30| .77 -.8| .77 -.8| .64| 54.2 52.3| 17 SA 0 | | 8 34 25 .09 .24|1.13 .7|1.11 .4| .32| 32.0 35.6| 8 MC 1 | | 15 234 25 .02 .26|1.17 .9|1.09 .4| .26| 40.0 43.1| 15 SA 0 | | 21 430 25 .01 .08| .48 -1.4| .41 -1.8| .70| 32.0 25.3| 21 SA 0 | | 10 36 25 -.02 .24| .93 -.3| .73 -.5| .47| 44.0 42.6| 10 MC 1 | | 11 176 25 -.10 .14| .99 .1| .87 -.1| .43| 40.0 44.6| 11 SA 0 | | 13 169 25 -.14 .15|1.25 .8|1.49 1.1| .37| 24.0 38.5| 13 SA 0 | | 14 58 25 -.15 .31| .72 -1.2| .69 -1.3| .69| 64.0 53.5| 14 SA 0 | | 20 69 25 -.65 .48| .94 -.2| .80 -.6| .38| 76.0 76.0| 20 SA 0 | | 4 46 25 -.86 .38| .90 .0| .45 -.2| .34| 92.0 92.0| 4 MC 1 | | 6 46 25 -.86 .38| .89 .0| .44 -.2| .35| 92.0 92.0| 6 MC 1 | | 7 46 25 -.86 .38| .89 .0| .44 -.2| .35| 92.0 92.0| 7 MC 1 | | 1 50 25 -2.23 1.30| MINIMUM ESTIMATED MEASURE | | 1 MC 1 | | 2 50 25 -2.23 1.30| MINIMUM ESTIMATED MEASURE | | 2 MC 1 | | 5 50 25 -2.23 1.30| MINIMUM ESTIMATED MEASURE | | 5 MC 1 | |------------------------------------+----------+----------+-----+-----------+---------| | MEAN 100.3 25.0 -.33 .41|1.01 .2| .91 .0| | 47.9 51.3| | | S.D. 96.0 .2 .95 .38| .28 1.1| .41 1.1| | 26.3 22.3| | +--------------------------------------------------------------------------------------+ Once you have determined the cut-point, you can use Table 18.3 as a diagnostic tool. Table 18.3 (Output Tables PERSON Keyforms: entry) These charts list individual students, with the score we would expect each one to receive on each item, given their overall score on the test. The items listed to the right of the chart are from most difficult to least difficult. The vertical line of scores is placed at their ability level on the given topic. The scores in this line are what we would expect the student to receive given his/her ability level. Scores to the left of that line are actual scores which are lower than we would have expected given the student ability level. Scores to the right are actual scores which are higher than we would have expected. The actual scores are placed at the point on the horizontal ability continuum where the student would have fallen had he or she consistently answered as he/she answered the given item. Scores to either side of the continuum which are between periods are considered acceptably close to the expectation. However, *** Working Draft *** Do not reproduce or cite without permission of the authors. 22 scores in between parentheses are those which differ significantly from the expectation. These items should be reviewed; especially those which fall below the ability line. Earlier we set .53 as the proficiency cut point. To aid in determining what instruction would be helpful in bringing individual students to proficiency, you can draw a vertical line at the approximate proficiency point on the horizontal line, and highlight items which fall below that line. These are the items on which the student should focus. The first table below displays the expectations and actual responses by Adam: TABLE 18.3 Multiple Response Formats, Different R ZOU304WS.TXT May 22 15:58 2006 INPUT: 25 PERSONS, 21 ITEMS MEASURED: 25 PERSONS, 21 ITEMS, 67 CATS 3.59.1 -------------------------------------------------------------------------------KEY: .1.=OBSERVED, 1=EXPECTED, (1)=OBSERVED, BUT VERY UNEXPECTED. NUMBER - NAME ------------------ MEASURE - INFIT (MNSQ) OUTFIT - S.E. 1 1Adam .47 2.2 1.6 .24 -3 -2 -1 0 1 2 3 4 |-------+-------+-------+-------+-------+-------+-------| 0 (2) 1 .2. .4. .5. 6 .5. .3. 1 .2. 9 .10. 18 .20. .0. 1 7 .8. (2) 7 .2. .3. .2. .2. .2. .2. |-------+-------+-------+-------+-------+-------+-------| -3 -2 -1 0 1 2 3 4 NUM 3 9 19 12 18 17 8 15 21 10 11 13 14 20 4 6 7 16 NUM ITEM 3null & alternative hypothesesMC 9margin of error MC 19test stat & p-value calc SA7 12cond for significance test SA8 18conditions for inference SA8 17state hypotheses SA4 8p-value calc MC 15construct, interpret CI SA10 21construct and interp90% CI SA20 10margin of error MC 11population and parameter ID SA8 13test stat and p-value calc SA8 14conclusion SA3 20conclusion SA3 4conditions for inferenceMC 6standard error calc MC 7conditions for inferenceMC 16proprtions calc SA2 ITEM Adam measured slightly below the determined cutpoint, as indicated by the vertical line of scores to the left of the straight line placed at the approximate proficiency point. The scores in the vertical line indicate the expectation for Adam’s scores on each item given the overall score on the test. Numbers to the left and right of that line are the actual scores Adam received on each item, placed at the difficulty point of that level of response. Many of his responses fall well into the proficiency zone, including items 3, 9, 8, 15, 21, and 11. Any other item would need more attention to make it to proficiency, especially items 12, 10 and 13, which fall well below the proficiency line. The concepts found in these items would be most important in formulating the instructional plan for Adam. *** Working Draft *** Do not reproduce or cite without permission of the authors. 23 The chart below displays Isabella’s scores. She was in the proficient range, however, she answered below the expectation on items 18, 8, 15 and 14. TABLE 18.11 Multiple Response Formats, Different ZOU304WS.TXT May 22 15:58 2006 INPUT: 25 PERSONS, 21 ITEMS MEASURED: 25 PERSONS, 21 ITEMS, 67 CATS 3.59.1 -------------------------------------------------------------------------------NUMBER - NAME ------------------ MEASURE - INFIT (MNSQ) OUTFIT - S.E. 9 2Isabella 1.08 .9 .8 .31 -3 -2 -1 0 1 2 3 4 |-------+-------+-------+-------+-------+-------+-------| 1 .2. 1 .2. 5 .6. .7. .5. 6 .3. (0) 2 .9. 10 .19. .2. .8. .8. .2. 3 .3. .2. .2. .2. .2. |-------+-------+-------+-------+-------+-------+-------| -3 -2 -1 0 1 2 3 4 NUM 3 9 19 12 18 17 8 15 21 10 11 13 14 20 4 6 7 16 NUM ITEM 3null and alt hypotheses MC 9margin of error MC 19test stat & p-value calc SA7 12conditions for signif test SA8 18conditions for inference SA8 17state hypotheses SA4 8p-value calc MC 15construct, interpret CI SA10 21construct and interp CI SA20 10margin of error MC 11population and parameter ID SA8 13test stat and p-value calc SA8 14conclusion SA3 20conclusion SA3 4conditions for inf MC 6standard error calc MC 7conditions for inference MC 16proprtions calc SA2 ITEM You would work with Isabella on these concepts to help her understand these items which were especially difficult for her. Evaluation and Reflection Evaluation and Reflection – having teachers reflect on their instruction and student learning in order to improve teaching practice. The Learning Record recommends that teachers and/or students set aside at least three samples of student work for inclusion in the Learning Record. The selected work should demonstrate understandings the student has gained, and could be assessments from class, an investigation a student conducted, a presentation, or an assignment that is relevant in demonstrating student understanding. Students should include a written comment about why he/she selected the work. *** Working Draft *** Do not reproduce or cite without permission of the authors. 24
© Copyright 2026 Paperzz