Do cognitive models consistently show good model-data-fit for students at different ability levels? Andrea Gotzmann Mary Roduta Roberts Centre for Research in Applied Measurement and Evaluation University of Alberta Poster Presented at the Session “Diagnostics: Classification and Feedback Using Cognitive Models, Profile Analysis and Subscores” Annual Meeting of the American Educational Research Association Denver, CO April 2010 Abstract Differences in total test score for gender and ethnic subgroups are widely studied. The Attribute Hierarchy Method (AHM; Leighton & Gierl, 2007), a diagnostic testing procedure, is used to evaluate differences for overall ability, ability-by-gender and ability-by-ethnicity in the current study. A model-data-fit statistic, the Hierarchy Consistency Index (HCI, Cui & Leighton, 2009), is applied to ability, ability-by-gender, and ability-by-ethnicity comparisons for several cognitive models in Mathematics and Critical Reading. HCI values increased as a function of ability for almost all of the cognitive models regardless of categorizations. These results indicate that the evaluation of group performance and can produce more precise information that can be used to assist with improving cognitive models. 2 Do cognitive models consistently show good model-data-fit for students at different ability levels? Educational testing has increased dramatically due to the No Child Left Behind (NCLB, 2001) legislation which requires each state to test students in grade three through eight in English/Language Arts (E/LA) and Mathematics. The NCLB mandate also requires states show 100% proficiency in E/LA and Mathematics by 2014 and to report growth for various subgroups (e.g., ethnic, gender, special education; Linn, Baker & Betebenner, 2002). In light of these requirements, diagnostic assessments are being used to assist with meeting these goals. Diagnostic assessments provide enhanced information required to improve student learning, and feedback to students and teachers about strengths and weaknesses of specific learning objectives. One method to create diagnostic assessments is to use the Attribute Hierarchy Method (AHM). However, methods to ensure validity for various subgroups have yet to be determined for cognitive diagnostic assessment. Gender and ethnic test/item score differences are typically assessed using Differential Item Functioning (DIF) statistical procedures (e.g., Dorans, Schmitt, & Bleistein, 1992; Parshall, & Miller, 1995; Schmitt, 1988; Shepard, Camilli, & Williams, 1985; Zwick & Ercikan, 1989). DIF occurs when the probabilistic differences in item scores occur after controlling for overall ability. DIF analyses are typically conducted for large-scale assessments. Unfortunately, there is no consensus on which DIF procedure works well for all student populations and it usually requires large sample sizes (e.g., minimum of 250 in each subgroup). In addition, many studies that have attempted to confirm, through content reviews, which items would indicated DIF and which group 3 would be favored, have shown little success (e.g., Gierl, Khaliq & Boughton, 1999; Angoff, 1993; Camilli & Shepard, 1994; Engelhard, Hansche & Rutledge, 1990; Gierl & McEwen, 1998; O’Neill & McPeek, 1993). There are several limitations to using DIF analysis in a diagnostic framework to identify test score differences: (1) information is gained mainly at the item level, (2) explanations about why the differences occur has been limited, (3) linking test performance to cognitive models has been lacking, and (4) most DIF analyses only focus on two groups. A similar method of confirming test fairness is needed in the context of diagnostic assessment. To address these limitations, we present a method for examining group differences using a cognitive diagnostic assessment (CDA) framework, as evaluated using the attribute hierarchy method (AHM). Only one study has evaluated the use of the AHM to evaluate differences in performance by gender, ethnicity, and gender-by-ethnicity. Gotzmann, Roberts, Alves and Gierl (2009) used the AHM method to compare gender and ethnicity differences using the Hierarchy Consistency Index (HCI) as a measure of model-data-fit. They found little to no differences in average HCI values for the White subgroup for most of the cognitive models. But differences between gender and ethnic subgroups, and gender-by-ethnicity were found across all cognitive models for other subgroups such as American Indian, and African-American subgroups. This study indicated that even with overall high average HCI values, the cognitive model may not fit all examinees. Further investigation of gender and ethnicity subgroups can provide more information on the basis of performance differences. As a follow up to this study, we evaluated whether 4 overall ability may also contribute to the differences in average HCI values as a measure of model-data-fit. Purpose of the Study The purpose of this study is to evaluate differences in average HCI for Low, Medium and High ability examinees. Specifically, are the model-data-fit indices similar across ability levels, and is the pattern of fit consistent across gender and ethnic subgroups? In this study, average HCI values are presented for ability, ability-bygender, and ability-by-ethnicity. Attribute Hierarchy Method The AHM is a cognitively-based psychometric procedure used to classify examinees’ test item responses into a set of attribute patterns associated with a cognitive model of task performance. Cognitive attributes in the AHM are described as the procedural or declarative knowledge needed to perform a task in a specific domain (Leighton, Gierl, & Hunka, 2004). The AHM is a two-stage procedure where the first stage involves cognitive model specification and the second stage involves a psychometric analysis of student responses to yield model-based diagnostic information about student mastery of cognitive skills. Stage 1: Specification of the Cognitive Model An AHM analysis begins with the specification of a cognitive model of task performance. A cognitive model in educational measurement refers to a “simplified description of human problem solving on standardized educational tasks, which helps to characterize the knowledge and skills students at different levels of learning have acquired and to facilitate the explanation and prediction of students’ performance” 5 (Leighton & Gierl, 2007, p. 6). These cognitive skills, conceptualized as attributes in the AHM framework, are specified at a small grain size in order to generate specific diagnostic inferences. Theories of task performance can be used to develop cognitive models in a subject domain. However, the availability of these theories in education is limited. Therefore, other means are used to generate cognitive models. One method is to use results of a task analysis of test items that represent a content domain. A task analysis can be used to create a cognitive model, where the knowledge and procedures used to solve the test item are specified. Another method involves having examinees think aloud as they solve test items to identify the actual knowledge, processes, and strategies elicited by the task (Ericsson & Simon, 1993; Leighton & Gierl, 2007). A cognitive model derived from a task analysis can be validated and, if required, modified using examinee verbal reports collected from think aloud studies. A key assumption underlying the specification of the cognitive model in the AHM is the hierarchical or a linear ordering of the attributes. This assumption reflects the characteristics of human information processing because cognitive processes usually do not work in isolation but function within a network of interrelated competencies and skills (Kuhn, 2001). For example (see Figure 1 for graphical representation), five attributes are linearly ordered with attribute 1 conceptualized as the simplest and attribute 5 as the most complex. If an examinee possesses attribute 3, then it is expected that this examinee also possesses the pre-requisite attributes, in this case attributes 1 and 2. The cognitive model has direct implications for item development as the items that measure each attribute must maintain the linear ordering in the model while also measuring increasingly complex cognitive processes. 6 Any method used to create cognitive models requires a review of the cognitive skills needed to solve test items. The first step would be to ensure the breadth and depth of all cognitive skills that are desirable for a diagnostic assessment. Once all required areas are specified for the necessary cognitive skills, these skills would be categorized into meaningful sub-content areas that provide diagnostic feedback. Within each of the sub-content areas, separate cognitive models can be created that are linearly related and narrow in scope to identify a student’s strengths and weaknesses in their cognitive development. The next step in the creation of cognitive diagnostic assessment is to evaluate how well the students’ actual response data fit the expected structure from the cognitive models. Stage 2: Psychometric Evaluation of the Cognitive Model The AHM provides a model-data fit index to evaluate the accuracy of the fit between the cognitive model and the examinees’ observed response data. For the AHM, the model-data fit index is called the Hierarchy Consistency Index (HCI; Cui & Leighton, 2009). The HCI can be used to evaluate a cognitive model for the entire student sample, but also for several subgroups as well as sub-categorized subgroups. For these different types of analyses (as compared to DIF analyses), the unit of analysis shifts from comparing subgroups by item to calculating model-data fit for individual students for a set of items that align to the cognitive model. This approach permits different types of comparisons not previously possible in the context of large-scale assessment. For example, a student that is female and Hispanic can only be classified in one group for most DIF analyses. But, with the HCI statistics, examinees can be 7 placed in several categories at the same time. So, for instance, students that are Hispanic female can be compared to students that are Hispanic male. Hierarchy Consistency Index The HCI statistic can provide meaningful information on the fit of each cognitive model, relative to examinees’ observed responses overall, and for each type of subgroup. The HCI is an index that indicates model-data-fit between of each student observed response data relative to the cognitive model. The HCI for examinee is calculated as follows: 1 2∑ ∑ 1 where, includes items that are correctly answered by student , is student score (1 or 0) to item , where item belongs to , includes items that require the subset of attributes measured by item , is student score (1 or 0) to item where item belongs to , and is the total number of comparisons for all the items that are correctly answered by student (Cui & Leighton, 2009). The HCI values are calculated for each student and the average taken across students for each cognitive model. HCI values range from -1.00 to +1.00 where an HCI of 0.70 or higher indicates good model-data fit (Cui & Leighton, 2009). This index is not statistically affected by overall ability (i.e., perfect and non-perfect scores both result in an HCI of 1.00). Slips occur when an examinee responds to an item in the model correctly but does not respond correctly to other pre-requisite items linked to cognitive skills (i.e., attribute three was answered correctly but not attribute one and two; see 8 Figure 1). The number of slips related to the number of combinations of possible attribute response patterns in the cognitive model indicates how well a student’s response fits the cognitive model. Therefore, the HCI index provides a summary of the model-data fit with the cognitive model. The HCI values can be used to provide supporting evidence for the accuracy of the cognitive model for multiple subgroups. Because the HCI is calculated for each student, students who do not fit the model (i.e., poor HCI < 0.70) can be identified. In addition, cognitive models that have good model-data fit overall, can be evaluated for several subgroups to ensure validity for all examinees. Methods Source of information Data from the SAT Reasoning Test and the Preliminary SAT®/National Merit Scholarship Qualifying Test were used. The SAT®/National Merit Scholarship Qualifying Test is a co-sponsored program by the College Board and National Merit Scholarship Corporation. The SAT®/National Merit Scholarship Qualifying Test is a standardized test that provides students with practice for the SAT Reasoning Test. It also allows students to enter National Merit Scholarship Corporation scholarship programs. The SAT®/National Merit Scholarship Qualifying Test measures critical reading skills, math problem-solving skills, and writing skills. The purpose of the this research was to investigate enhanced diagnostic scoring and reporting procedures so that students would receive more specific information about their strengths and weaknesses on college readiness skills. This enhanced feedback was intended to help 9 students focus their preparation on areas where they wanted to improve their test performance. A random sample of 5000 examinees from The College Board NMSQT/PSAT® 2006 administration was used for this study. Individual HCI values were calculated for the entire sample. In addition, average HCI values for groups subdivided by ability, ability-by-gender, and ability-by-ethnicity were computed. There were three ability levels constructed where the overall scale score for the content area ranged from 20-80. The score scale and their respective ability groups were subdivided as follows: Low ability students scale score of 20 to 39, Medium ability students scale score of 40 to 59, and High ability students scale score of 60 to 80. All of the 5000 examinees were used in the calculations of the average HCI values for the overall ability, and ability-by-gender calculations. However, for the ability-by-ethnicity calculations the American Indian, Puerto Rican sub-categories were not presented due to low case counts (N<5). Cognitive models for two content areas were created in Mathematics and Critical Reading. Items that measured the skills in each cognitive model, as determined by content experts, were included in the analyses. Mathematics. The sample was sub-categorized for overall ability in Mathematics as follows: Low ability (N = 1443), Medium ability (N = 2949) and High Ability (N = 608). The sample was also sub-categorized for ability-by-gender as follows: Low ability Females (N = 819), Medium ability Females (N = 1597), High ability Females (N = 252), Low ability Males (N = 624), Medium ability Males (N = 1352), and High ability Males (N = 356). The sample was sub-categorized for ability-by-ethnicity as follows: Low ability 10 Asians (N = 56), Medium ability Asians (N = 182), High ability Asians (N = 84), Low ability African-Americans (N = 464), Medium ability African-Americans (N = 300), High ability African-Americans (N = 20), Low ability Mexican-Americans (N = 140), Medium ability Mexican-Americans (N = 170), High ability Mexican-Americans (N = 11), Low ability Other Hispanics (N = 181), Medium ability Other Hispanics (N= 212), High ability Other Hispanics (N = 19), Low ability Whites (N = 484), Medium ability Whites (N = 1920), High ability Whites (N 450), Low ability Others (N = 58), Medium ability Others (N = 99), and High ability Others (N = 20). Critical Reading. The sample was sub-categorized for overall ability in Critical Reading as follows: Low ability (N = 1610), Medium ability (N = 2781) and High Ability (N = 609). The sample was also sub-categorized for ability-by-gender as follows: Low ability Females (N = 834), Medium ability Females (N = 1517), High ability Females (N = 317), Low ability Males (N = 776), Medium ability Males (N = 1264), and High ability Males (N = 292). The sample was sub-categorized for ability-by-ethnicity as follows: Low ability Asians (N = 108), Medium ability Asians (N = 166), High ability Asians (N = 48), Low ability African-Americans (N = 451), Medium ability African-Americans (N = 311), High ability African-Americans (N = 22), Low ability Mexican-Americans (N = 158), Medium ability Mexican-Americans (N = 151), High ability Mexican-Americans (N = 12), Low ability Other Hispanics (N = 207), Medium ability Other Hispanics (N= 181), High ability Other Hispanics (N = 24), Low ability Whites (N = 563), Medium ability Whites (N = 1822), High ability Whites (N 469), Low ability Others (N = 70), Medium ability Others (N = 78), and High ability Others (N = 29). 11 Procedures This study was conducted in three stages. First, cognitive models were created for each sub-content area in Mathematics and Critical Reading. For example, Mathematics had four sub-categories: Algebra and Functions, Data and Probability, Geometry and Measurement, and Number and Operations. Critical Reading had four sub-categories: Author’s Craft, Comprehending Ideas, Determining Meaning, and Reasoning and Inferencing. There were a total of 54 cognitive models created across Mathematics and Critical Reading; however only six models are presented in this paper. Second, content experts mapped existing items from the test to the skills in each cognitive model. Third, individual student HCIs were calculated for each cognitive model. The six models were selected based on whether the overall average HCI values were good and all skills were represented by items. All of the cognitive models for Mathematics had high average HCI values (greater than 0.70). However, the overall HCI values were slightly lower for the Critical Reading cognitive models (i.e., 0.65, 0.48, and 0.58 respectively). The Critical Reading cognitive models were included so that comparisons of different content areas and lower average model-data-fit values were possible. Individual HCI values were calculated for the entire sample. Then, HCI results for each cognitive model were aggregated by ability (Low, Medium and High), ability (Low, Medium and High) by gender (Female and Male), and ability (Low, Medium and High) by ethnicity (Asian, African-American, MexicanAmerican, Other Hispanic, White, and Other (e.g., mixed race)). However, for the American Indian and Puerto Rican ethnic groups the results are not presented due to low sample sizes (N<5). Ability levels were created based on the scale score for an 12 examinee for each content area: 20-39 Low ability, 40-59 Medium ability, and 60-80 High ability. Stage 1: Developing the Cognitive Models For NMSQT/PSAT® Mathematics and Critical Reading, stage 1 was completed in two steps. In the first step, Gierl, Roberts, Alves, Gotzmann (2009) developed preliminary cognitive models. This development work was undertaken so the content specialists would have a starting point for creating their cognitive models. To create the preliminary cognitive models, three College Board research papers—Developing Skill Categories for the SAT Math Section by O’Callaghan, Morley & Schwartz (2004), Toward a Construct of Critical Reading for the New SAT by VanderVeen (2004), and the Performance Category Descriptions for the Critical Reading, Mathematics, and Writing Sections of the SAT (2007), also known as the SAT Scale Anchoring Study— provided the starting points for creating the preliminary models. O’Callaghan et. al. (2004) and VanderVeen (2004) described several cognitive skill categories identified by content specialists, after reviewing large numbers of previously administered SAT Mathematics and Critical Reading items. Their cognitive skill categories ranged from simple to complex. The authors of this study assisted in creating the preliminary cognitive models with linear ordering of cognitive skills to assist content experts in the second step. In the second step in Stage #1, five content specialists nominated by The College Board (three Mathematics and two Critical Reading) reviewed the preliminary cognitive models with the intention of making appropriate modifications, given a particular emphasis on the identification of the appropriate skills and on the ordering of 13 these skills. They were also asked to evaluate the skills in each cognitive model for its measurability and instructional relevance. That is, the content specialists were instructed to modify the initial models in light of the characteristics required of cognitive models for CDA (e.g., measurability, grain size, and instructional relevance). All five content specialists had extensive mathematics and reading backgrounds as well as teaching and test development experience. The content specialists scrutinized the wording of each skill descriptor to ensure it would be clear and meaningful to teachers. Any relevant, measurable, and instructionally relevant process skills were also added to the cognitive models. In total, 54 cognitive models were created in NMSQT/PSAT® Mathematics and Critical Reading. Each of the cognitive models being discussed in this paper are shown in Figures 2A, 2B, 3A, 3B, 4A, and 4B. For each of the cognitive models only one item was mapped to each cognitive skill as indicated below. Figure 2A shows the cognitive model for Mathematics under the sub-category of Algebra and Functions which is not currently labeled. There were three items mapped to three of the cognitive skills 2.4.1, 2.4.2, and 2.4.3. Figure 2B shows the cognitive model for Mathematics under the sub-category of Geometry and Measurement which is not currently labeled. There were four items mapped to four of the cognitive skills 3.7.1, 3.7.2, 3.7.3, and 3.7.4. Figure 3A shows the cognitive model for Mathematics under the sub-category of Numbers and Operations which is not currently labeled. There were two items mapped to two cognitive skills 1.8.1, and 1.8.2. Figure 3B shows the cognitive model for Critical Reading under the sub-category of Determining Meaning labeled “Context”. There were four items mapped to four of the cognitive skills 1.c.1, 14 1.c.2, 1.c.3, and 1.c.4. Figure 4A shows the cognitive model for Critical Reading under the sub-category of Author’s Craft labeled “Rhetorical and Stylistic Devices”. There were six items mapped to six of the cognitive skills 3.c.1, 3.c.2, 3.c.3, 3.c.4, 3.c.5, and 3.c.6. Figure 4B shows the cognitive model for Critical Reading under the sub-category of Reasoning and Inferencing which is labeled “Generalizing”. There were four items mapped to four of the cognitive skills 4.c.1, 4.c.2, 4.c.3, and 4.c.4. Stage 2: Mapping items to each Cognitive model In the second stage, existing items were mapped from the 2006 NMSQT/PSAT® administration to each of the linear cognitive models. A set of items was provided to the content experts in Mathematics and Critical Reading and they aligned the items to the skills in each cognitive model created in stage 1. Unfortunately since this task was to use existing items and map them to the cognitive models created, some cognitive skills for some of the 54 cognitive models are not represented by items. However, for this study we used complete cognitive models to evaluate model-data-fit for students at different ability levels. Stage 3: HCI Calculations and Model Evaluations In the third stage, individual HCI values were calculated for the sample of 5000 students. Several macros were created in SAS to calculate the HCI values for each examinee for each cognitive model. Average overall HCI values by ability, ability-bygender, and ability-by-ethnicity were also calculated. There were six models selected, three for Mathematics and three for Critical Reading. For three of the cognitive models the overall average HCI values were considered good (i.e., HCI greater than 0.7) in Mathematics, and the three cognitive models for Critical Reading were lower (i.e., 0.65, 15 0.48, and 0.58 respectively). The Critical Reading models were selected since the models were fully represented by items and the overall average HCI was not too far from the 0.70 criterion. These cognitive models were selected to evaluate and compare ability categories with good fitting and moderate fitting models across different content areas. For each cognitive model, overall average HCI values and standard deviations were calculated for the sample of 5000 students as well as average HCI, for ability (Low, Medium and High), ability-by-gender (Low ability Females, Medium ability Females, High ability Females, Low ability Males, Medium ability Males, High ability Males), and ability-by-ethnicity (Low ability Asians, Medium ability Asians, High ability Asians, Low ability African-Americans, Medium ability African-Americans, High ability African-Americans, Low ability Mexican-Americans, Medium ability Mexican-Americans, High ability Mexican-Americans, Low ability Other Hispanics, Medium ability Other Hispanics, High ability Other Hispanics, Low ability Whites, Medium ability Whites, High ability Whites, Low ability Others, Medium ability Others, and High ability Others). Two of the subgroups for ethnicity are not presented due to small sample sizes (i.e., American Indian and Puerto Rican). Results The average HCI values for ability, ability-by-gender, and ability-by-ethnicity are presented for each cognitive model in Mathematics and Critical Reading. The results are presented in six tables (one for each cognitive model in Mathematics and Critical Reading) and graphically in six figures (Figures 5 through 10). As a reminder, average HCI values above 0.70 indicate good model-data-fit as cited by Cui & Leighton (2009). 16 The results presented in tables and graphs for Mathematics will be presented first, and then Critical Reading. Mathematics Table 1 shows the overall HCI values aggregated by ability, ability-by-gender, and ability-by-ethnicity for the Mathematics cognitive model represented in Figure 2A. The average HCI values across ability show that the values increased by ability from 0.67 to 0.94, and for ability-by-gender with 0.67 to 0.96 for Females, and 0.66 to 0.93 for Males. A similar trend was seen for the ability-by-ethnicity sub-categorizations. These results are also more apparent graphically in Figure 5, where a trend of average HCI values increased as ability increased was shown for all categorizations. In all instances for the Medium and High ability groups, average HCI values were above the 0.70 criterion. The average HCI values were below 0.70 for almost all subgroups for the Low ability category but were relatively close to the 0.70 criterion. The exceptions were the Other and Asian ethnic subgroups which were higher than the 0.70 criterion. Table 2 shows the overall HCI values aggregated by ability, ability-by-gender, and ability-by-ethnicity for the Mathematics cognitive model represented in Figure 2B. The average HCI values across ability show the values increased from 0.76 to 0.95 as ability increased, and for ability-by-gender with 0.77 to 0.95 for Females, and 0.75 to 0.95 for Males. A similar trend was seen for the ability-by-ethnicity categorizations. These results are represented graphically in Figure 6, where a trend of average HCI values increased as ability increased was shown for all categorizations. In almost all instances for various ability categorizations, average HCI values were above the 0.70 17 criterion, with the exception of the Asian Low ability subgroup which was below 0.70. However, this value was still relatively close to the 0.70 criterion (i.e., 0.66). Table 3 shows the overall HCI values aggregated by ability, ability-by-gender, and ability-by-ethnicity for the Mathematics cognitive model represented in Figure 3A. The average HCI values across ability were not necessarily increasing from lowest ability to highest ability. The average HCI values ranged from 0.81 to 0.83 indicating relatively small differences, and all of the average HCI values were relatively high. However, the pattern seen in other Mathematics cognitive models also does not occur for the ability-by-gender subcategory. Females average HCI values increased from 0.83 to 0.88, and an inconsistent pattern for Males ranged from 0.78 to 0.82, where the values for Medium ability were slightly smaller at 0.77. Similarly, no consistent pattern was seen for the ability-by-ethnicity sub-categorizations. For some subgroups the highest average HCI values were for the Low ability subgroup (i.e., Asian and AfricanAmerican), Medium ability (i.e., Mexican-American and Other), and High ability (i.e., Other Hispanic and White). However, even for those subgroups where High ability had the highest average HCI values the Low ability groups were the next highest with 0.82. These results are presented graphically in Figure 7, where an inconsistent trend was seen where each subgroup had different patterns. In almost all instances the average HCI values were above the 0.70 criterion, with the exception of the Mexican-American High ability subgroup with a value of 0.64. The rest of the values were all quite high above the 0.70 criterion. 18 Critical Reading Table 4 shows the overall HCI values aggregated by ability, ability-by-gender, and ability-by-ethnicity for the Critical Reading cognitive model represented in Figure 3B. The average HCI values across ability shows the values increased from 0.50 to 0.93 as ability increased, and for ability-by-gender with 0.53 to 0.92 for Females, and 0.46 to 0.94 for Males. A similar trend was seen for the ability-by-ethnicity categorizations. These results are presented graphically in Figure 8, where a trend of average HCI values increased as ability increased was shown for all categorizations. In some instances, for the Medium ability category, the average HCI values for the Mexican-American and African-American groups were below the 0.70 criterion. Also, the average HCI values were below 0.70 for all subgroups for the Low ability category. Table 5 shows the overall HCI values aggregated by ability, ability-by-gender, and ability-by-ethnicity for the Critical Reading cognitive model represented in Figure 4A. The average HCI values across ability shows the values increased from 0.38 to 0.74 as ability increased, and for ability-by-gender with 0.39 to 0.72 for Females, and 0.38 to 0.76 for Males. A similar trend was seen for the ability-by-ethnicity categorizations. These results are presented graphically in Figure 9, where a trend of average HCI values increased as ability increased was shown for all categorizations. Since the overall average HCI was lower for this cognitive model (i.e., 0.48), most of the sub-categorizations were below the 0.70 criterion. The only exceptions were the Asian, Mexican-American, White and Other High ability subgroups. However, the overall pattern followed the first two cognitive models presented for Mathematics and the first 19 cognitive model in Critical Reading, where the Lower ability groups had the lowest overall HCI values regardless of gender or ethnicity sub-categorizations. Table 6 shows the overall HCI values aggregated by ability, ability-by-gender, and ability-by-ethnicity for the Critical Reading cognitive model represented in Figure 4B. The average HCI values across ability shows the values increased from 0.46 to 0.84 as ability increased, and for ability-by-gender with 0.48 to 0.83 for Females, and 0.43 to 0.84 for Males. A similar trend was seen for the ability-by-ethnicity categorizations. These results are presented graphically in Figure 10, where a trend of average HCI values increased as ability increased is shown for all categorizations. Since the overall average HCI is lower for this cognitive model, similar to the previously presented cognitive models in Critical Reading, most of the sub-categorizations were below the 0.70 criterion. The exception was for the High ability category overall, for Females and Males, and all ethnic subgroups. However, the overall pattern follows the first two cognitive models presented for Mathematics and the first two cognitive models in Critical Reading, where the Lower ability groups had the lowest overall HCI values regardless of gender or ethnicity sub-categorizations. Discussion These results show how model-data fit statistics, like the HCI, can be used to determine student response fit with a cognitive model. These results indicate ability as a function of gender and ethnicity can be evaluated at a fine level of detail. Because the HCI is calculated for each student, individual student information can be used individually or collectively to help find clues to a student’s cognitive understanding. 20 With the exception of one cognitive model, all of the results show that the Low ability groups had lower average HCI values and average HCI values increased with the Medium and High ability groups when evaluated overall, for ability-by-gender, and for ability-by-ethnicity. These results indicate that as overall ability increases the modeldata fit indices also increases. This could be a function of ill-specified cognitive models for lower ability examinees or random guessing influencing the model-data fit indices. The only instance that this pattern was not seen was for the third Mathematics cognitive model. However, this model had the fewest number of attributes in the model, with only two items measuring two skills. Additional cognitive models that are represented by more than one item per attribute should be evaluated to see if a similar pattern occurs with smaller cognitive models as seen in Figure 3A, as the small number of attributes could have influenced the results. The results of this study suggest that further research is needed in developing cognitive models for lower ability examinees. A given cognitive model may fit a sample of examinees across ability levels well overall, but the cognitive model may not fit subgroups of examinees equally well. Developing cognitive models for lower ability examinees is critically important, given that this is the intended target population of diagnostic assessment. Modifications to estimating model-data fit by accounting for guessing in student responses could also assist with evaluating cognitive models more precisely. The results of this study can assist in the process of evaluating cognitive models for use in diagnostic assessments, providing new information to test developers, teachers, and administrators to assist in student learning and teaching. 21 References Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P. W. Holland, & H. Wainer (Eds.). Differential item functioning (pp. 2-24). Hillsdale, NJ: Lawrence Erlbaum. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage. Engelhard Jr., G., Hansche, L., & Rutledge, K. E. (1990). Accuracy of bias review judges in identifying differential item functioning on teacher certification tests. Applied Measurement in Education, 3, 347-360. doi: 10.1207/s15324818ame0304_4 Cui, Y., & Leighton, J. P. (2009). The hierarchy consistency index: Evaluating person fit for cognitive diagnostic assessment. Journal of Educational Measurement, 46, 429-449. doi: 10.1111/j.1745-3984.2009.00091.x. Dorans, N. J., Schmitt, A. P., & Bleistein, C. A. (1992). The standardization approach to assessing comprehensive differential item functioning. Journal of Educational Measurement, 29, 309-319. doi: 10.1111/j.1745-3984.1992.tb00379.x Ericsson, K. A. & Simon, H. A. (1993). Protocol analysis: Verbal reports as data. Cambridge, MA: MIT Press. Gierl, M. J., Khaliq, S., & Boughton, K. (1999). Gender differential item functioning in mathematics and science: Prevalence and policy implications. Paper presented at the annual meeting of the Canadian Society for the Study of Education, Sherbrooke, Quebec. 22 Gierl, M. J., & McEwen, N. (1998). Differential item functioning on the Alberta Education Social Studies 30 diploma exams. Paper presented at the annual meeting of the Canadian Society for Studies in Education, Ottawa, Ontario, Canada. Gierl, M. J., Roberts, M., Alves, C., & Gotzmann, A. (2009). Using judgments from content specialists to develop cognitive models for diagnostic assessments. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA. Gotzmann, A., Roberts, M., Alves, C., & Gierl, M. J. (2009). Using cognitive models to evaluate ethnicity and gender differences. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA. Kuhn, D. (2001). Why development does (and does not) occur: Evidence from the domain of inductive reasoning. In J. L. McClelland & R. Siegler (Eds.), Mechanisms of cognitive development: Behavioral and neural perspectives (pp. 221-249). Hillsdale, NJ: Erlbaum. Leighton, J.P., & Gierl, M.J. (2007). Defining and evaluating models of cognition used in educational measurement to make inferences about examinees’ thinking processes. Educational Measurement: Issues and Practice, 26, 3–16. doi: 10.1111/j.1745-3992.2007.00090.x Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy model for cognitive assessment: A variation on Tatsuoka's rule-space approach. Journal of Educational Measurement, 41, 205-237. doi: 10.1111/j.17453984.2004.tb01163.x 23 Linn, R. L., Baker, E. L., & Betebenner, D. W. (2002). Accountability systems: Implications of requirements of the No Child Left Behind Act of 2001. Educational Researcher, 31(6), 3-16. doi: 10.3102/0013189X031006003 No Child Left Behind Act of 2001, Pub. L. No. 107-110, 115 Stat. 1425 (2002). O’Callaghan, R.K., Morley, M.E., & Schwartz, A. (2004). Developing skill categories for the SAT Math section. Paper presented at the meeting of the National Council on Measurement in Education, San Diego, CA. O’Neill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are associated with differential item functioning. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 255-276). Hillsdale, NJ: Lawrence Erlbaum. Parshall, C. G., & Miller, T. R. (1995). Exact versus asymptotic Mantel-Haenszel DIF statistics: A comparison of performance under small-sample conditions. Journal of Educational Measurement, 32, 302-316. doi: 10.1111/j.17453984.1995.tb00469.x Schmitt, A. P. (1988). Language and cultural characteristics that explain differential item functioning for Hispanic examinees on the Scholastic Aptitude Test. Journal of Educational Measurement, 25, 1-13. doi: 10.1111/j.1745-3984.1988.tb00287.x Shepard, L. A., Camilli, G., & Williams, D. M. (1985). Validity of approximation techniques for detecting item bias. Journal of Educational Measurement, 22, 77105. doi: 10.1111/j.1745-3984.1985.tb01050.x VanderVeen, A. (2004). Toward a construct of Critical Reading for the new SAT. Paper presented at the meeting of the National Council on Measurement in Education, San Diego, CA. 24 Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP history assessment. Journal of Educational Measurement, 26, 55-66. doi: 10.1111/j.1745-3984.1989.tb00318.x 25 Table 1. HCI values for Mathematics Algebra and Functions Hierarchy by ability, ability-bygender, and ability by ethnicity Ability Low Medium High Ability-by-Gender Low Female Medium High Low Medium Male High Ability-by-Ethnicity Low Asian Medium High Low Medium African-American High Low Mexican-American Medium High Low Medium Other Hispanic High Low White Medium High Low Other Medium High Mean 0.669 0.839 0.942 Mean 0.673 0.839 0.956 0.664 0.838 0.932 Mean 0.714 0.852 0.861 0.659 0.787 0.950 0.633 0.810 0.909 0.630 0.860 1.000 0.682 0.851 0.954 0.793 0.828 0.950 26 SD 0.707 0.489 0.251 SD 0.704 0.484 0.217 0.710 0.494 0.273 SD 0.680 0.440 0.363 0.716 0.542 0.224 0.724 0.547 0.302 0.739 0.455 0.000 0.701 0.476 0.231 0.585 0.475 0.224 N 1443 2949 608 N 819 1597 252 624 1352 356 N 56 182 84 464 300 20 140 170 11 181 212 19 484 1920 450 58 99 20 Table 2. HCI values for Mathematics Geometry and Measurement Hierarchy by ability, ability-bygender, and ability-by-ethnicity Ability Low Medium High Ability-by-Gender Low Female Medium High Low Medium Male High Ability-by-Ethnicity Low Asian Medium High Low Medium African-American High Low Mexican-American Medium High Low Medium Other Hispanic High Low White Medium High Low Other Medium High Mean 0.760 0.874 0.953 Mean 0.769 0.862 0.952 0.747 0.887 0.954 Mean 0.661 0.885 0.984 0.743 0.862 0.933 0.771 0.908 1.000 0.744 0.822 1.000 0.785 0.881 0.952 0.828 0.869 0.900 27 SD 0.623 0.464 0.246 SD 0.619 0.490 0.245 0.628 0.430 0.249 SD 0.721 0.424 0.145 0.648 0.484 0.298 0.600 0.401 0.000 0.636 0.550 0.000 0.586 0.451 0.238 0.566 0.466 0.447 N 1443 2949 608 N 819 1597 252 624 1352 356 N 56 182 84 464 300 20 140 170 11 181 212 19 484 1920 450 58 99 20 Table 3. HCI values for Mathematics Number Operations Hierarchy by ability, ability-by-gender, and ability-by-ethnicity Ability Low Medium High Ability-by-Gender Mean 0.827 0.813 0.822 Mean 0.832 0.848 0.881 0.821 0.772 0.781 Mean 0.857 0.703 0.810 0.845 0.833 0.800 0.829 0.859 0.636 0.790 0.868 1.000 0.818 0.804 0.831 0.759 0.919 0.700 Low Female Medium High Low Medium Male High Ability-by-Ethnicity Low Asian Medium High Low Medium African-American High Low Mexican-American Medium High Low Medium Other Hispanic High Low White Medium High Low Other Medium High 28 SD 0.563 0.582 0.569 SD 0.556 0.529 0.474 0.572 0.636 0.626 SD 0.520 0.713 0.591 0.536 0.554 0.616 0.562 0.514 0.809 0.615 0.498 0.000 0.576 0.595 0.557 0.657 0.396 0.733 N 1443 2949 608 N 819 1597 252 624 1352 356 N 56 182 84 464 300 20 140 170 11 181 212 19 484 1920 450 58 99 20 Table 4. HCI values for Critical Reading Determining Meaning-“Context” Hierarchy by ability, ability-by-gender, and ability-by-ethnicity Ability Low Medium High Ability-by-Gender Low Female Medium High Low Medium Male High Ability-by-Ethnicity Low Asian Medium High Low Medium African-American High Low Mexican-American Medium High Low Medium Other Hispanic High Low White Medium High Low Other Medium High Mean 0.495 0.741 0.927 Mean 0.531 0.757 0.920 0.457 0.722 0.935 Mean 0.465 0.704 0.932 0.488 0.667 0.903 0.574 0.682 1.000 0.523 0.710 0.967 0.506 0.763 0.923 0.411 0.751 0.938 29 SD 0.795 0.554 0.184 SD 0.774 0.527 0.193 0.815 0.585 0.172 SD 0.777 0.566 0.199 0.807 0.638 0.300 0.751 0.589 0.000 0.767 0.635 0.113 0.789 0.528 0.180 0.840 0.497 0.208 N 1610 2781 609 N 834 1517 317 776 1264 292 N 108 166 48 451 311 22 158 151 12 207 181 24 563 1822 469 70 78 29 Table 5. HCI values for Critical Reading Author’s Craft-“Rhetorical and Stylistic Devices” Hierarchy by ability, ability-by-gender, and ability- by-ethnicity Ability Low Medium High Ability-by-Gender Low Female Medium High Low Medium Male High Ability-by-Ethnicity Low Asian Medium High Low Medium African-American High Low Mexican-American Medium High Low Medium Other Hispanic High Low White Medium High Low Other Medium High Mean 0.383 0.481 0.738 Mean 0.390 0.486 0.720 0.376 0.475 0.758 Mean 0.253 0.489 0.780 0.321 0.403 0.643 0.382 0.420 0.781 0.483 0.462 0.696 0.422 0.497 0.729 0.450 0.495 0.891 30 SD 0.828 0.642 0.381 SD 0.823 0.643 0.392 0.833 0.642 0.368 SD 0.887 0.670 0.354 0.850 0.716 0.575 0.837 0.686 0.436 0.795 0.643 0.369 0.809 0.627 0.377 0.766 0.599 0.253 N 1610 2781 609 N 834 1517 317 776 1264 292 N 108 166 48 451 311 22 158 151 12 207 181 24 563 1822 469 70 78 29 Table 6. HCI values for Critical Reading Reasoning and Inferencing-“Generalizing” Hierarchy by ability, ability-by-gender, and ability-by-ethnicity Ability Low Medium High Ability-by-Gender Low Female Medium High Low Medium Male High Ability-by-Ethnicity Low Asian Medium High Low Medium African-American High Low Mexican-American Medium High Low Medium Other Hispanic High Low White Medium High Low Other Medium High Mean 0.458 0.598 0.837 Mean 0.480 0.589 0.830 0.434 0.610 0.844 Mean 0.445 0.495 0.726 0.410 0.562 0.895 0.542 0.557 0.889 0.470 0.516 0.815 0.476 0.620 0.852 0.363 0.692 0.761 31 SD 0.781 0.565 0.345 SD 0.752 0.569 0.349 0.811 0.561 0.340 SD 0.810 0.644 0.435 0.833 0.597 0.280 0.720 0.559 0.385 0.750 0.642 0.360 0.758 0.546 0.327 0.837 0.503 0.418 N 1610 2781 609 N 834 1517 317 776 1264 292 N 108 166 48 451 311 22 158 151 12 207 181 24 563 1822 469 70 78 29 Figure 1. Example 5-attribute cognitive model. 32 2.4.1 2.4.1 2.4.1 2.4.2 2.4.2 2.4.3 2.4.3 =Solve for one variable or expression in terms of another [Solve literal equations] 2.4.2 =Use variables in multi-step abstract settings (e.g., applying the distributive property across several variables) 2.4.1 =Use a letter as a placeholder for an unknown value Figure 2A. Mathematics Algebra and Functions cognitive model. 3.7.1 3.7.1 3.7.1 3.7.1 3.7.2 3.7.2 3.7.2 3.7.3 3.7.3 3.7.4 3.7.4 =Determine the effect of changes in the linear dimension (e.g. length of a segment or perimeter) of a figure on other 3.7.3 =Solve multi-step problems involving volume 3.7.2 =Solve multi-step problems involving areas of figures composed of two or more simple figures 3.7.1 =Determine distance on a number line Figure 2B. Mathematics Geometry and Measurement cognitive model. 1.8.1 1.8.1 1.8.2 1.8.2 =Determine the values or properties of numbers in a sequence when given a description of a sequence 1.8.1 =Identify a rule that describes a numerical pattern in a sequence Figure 3A. Mathematics Number Operations cognitive model. 1.c.1 1.c.1 1.c.1 1.c.1 1.c.2 1.c.2 1.c.2 1.c.3 1.c.3 1.c.4 1.c.4 = Use context to determine how an uncommon word fits in a sentence 1.c.3 = Use the context of larger sections of text to determine the meaning of uncommon words or to differentiate among multiple possible meanings of words 1.c.2 = Use the context of the sentence to differentiate among multiple possible meanings of words 1.c.1 = Use the context of the sentence to determine how a common word fits in a sentence Figure 3B. Critical Reading Determining Meaning-“Context” cognitive model. 34 3.c.1 3.c.1 3.c.1 3.c.1 3.c.1 3.c.1 3.c.2 3.c.2 3.c.2 3.c.2 3.c.2 3.c.3 3.c.3 3.c.3 3.c.3 3.c.4 3.c.4 3.c.4 3.c.5 3.c.5 3.c.6 3.c.6= Determine the purposes of authors’ rhetorical and stylistic choices across texts 3.c.5= Identify how an author’s rhetorical or stylistic choices support a particular perspective or position 3.c.4= Identify the purpose and effect of sophisticated literary devices such as irony and sarcasm 3.c.3= Identify connotations of language based on context 3.c.2= Identify the purpose and effect of literary devices such as metaphor or understatement 3.c.1= Recognize examples of literary devices such as metaphor or understatement Figure 4A. Critical Reading Author's Craft-“Rhetorical and Stylistic Devices” cognitive model. 4.c.1 4.c.1 4.c.1 4.c.1 4.c.2 4.c.2 4.c.2 4.c.3 4.c.3 4.c.4 4.c.4= Identify and compare generalizations or themes across texts 4.c.3= Identify abstract ideas and principles based on cumulative inferences 4.c.2= Recognize how broad generalized ideas can be used to lead to abstract ideas or positions 4.c.1= Identify generalizations or themes supported by the text Figure 4B. Critical Reading Reasoning and Inferencing-“Generalizing” cognitive model. 35 1.0 0.9 0.8 0.7 Overall HCI average 0.6 Female Male 0.5 Asian African‐American 0.4 Mexican‐American White 0.3 Other Hispanic Other 0.2 0.1 0.0 Low ability Medium ability High ability Scale Score categories Figure 5. Average HCI values for ability, ability-by-gender, and ability-by-ethnicity for Mathematics Algebra and Functions Cognitive Model 1.0 0.9 0.8 0.7 Overall HCI average 0.6 Female Male 0.5 Asian African‐American 0.4 Mexican‐American White 0.3 Other Hispanic Other 0.2 0.1 0.0 Low ability Medium ability High ability Scale Score categories Figure 6. Average HCI values for ability, ability-by-gender, and ability-by-ethnicity for Mathematics Geometry and Measurement Cognitive Model 37 1.0 0.9 0.8 0.7 Overall HCI average 0.6 Female Male 0.5 Asian African‐American 0.4 Mexican‐American White 0.3 Other Hispanic Other 0.2 0.1 0.0 Low ability Medium ability High ability Scale Score categories Figure 7. Average HCI values for ability, ability-by-gender, and ability-by-ethnicity for Mathematics Number Operations Cognitive Model 38 1.0 0.9 0.8 0.7 Overall HCI average 0.6 Female Male 0.5 Asian African‐American 0.4 Mexican‐American White 0.3 Other Hispanic Other 0.2 0.1 0.0 Low ability Medium ability High ability Scale Score categories Figure 8. Average HCI values for ability, ability-by-gender, and ability-by-ethnicity for Critical Reading Determining Meaning-“Context” Cognitive Model 39 1.0 0.9 0.8 0.7 Overall HCI average 0.6 Female Male 0.5 Asian African‐American 0.4 Mexican‐American White 0.3 Other Hispanic Other 0.2 0.1 0.0 Low ability Medium ability High ability Scale Score categories Figure 9. Average HCI values for ability, ability-by-gender, and ability-by-ethnicity for Critical Reading Author’s Craft“Rhetorical and Stylistic Devices” Cognitive Model 40 1.0 0.9 0.8 0.7 Overall HCI average 0.6 Female Male 0.5 Asian African‐American 0.4 Mexican‐American White 0.3 Other Hispanic Other 0.2 0.1 0.0 Low ability Medium ability High ability Scale Score categories Figure 10. Average HCI values for ability, ability-by-gender, and ability-by-ethnicity for Critical Reading Reasoning and Inferencing-“Generalizing” Cognitive Model 41
© Copyright 2026 Paperzz