Do cognitive models consistently show good model-data

Do cognitive models consistently show good model-data-fit for
students at different ability levels?
Andrea Gotzmann
Mary Roduta Roberts
Centre for Research in Applied Measurement and Evaluation
University of Alberta
Poster Presented at the Session
“Diagnostics: Classification and Feedback Using Cognitive
Models, Profile Analysis and Subscores”
Annual Meeting of the American Educational Research
Association
Denver, CO
April 2010
Abstract
Differences in total test score for gender and ethnic subgroups are widely
studied. The Attribute Hierarchy Method (AHM; Leighton & Gierl, 2007), a diagnostic
testing procedure, is used to evaluate differences for overall ability, ability-by-gender
and ability-by-ethnicity in the current study. A model-data-fit statistic, the Hierarchy
Consistency Index (HCI, Cui & Leighton, 2009), is applied to ability, ability-by-gender,
and ability-by-ethnicity comparisons for several cognitive models in Mathematics and
Critical Reading. HCI values increased as a function of ability for almost all of the
cognitive models regardless of categorizations. These results indicate that the
evaluation of group performance and can produce more precise information that can be
used to assist with improving cognitive models.
2
Do cognitive models consistently show good model-data-fit for students at
different ability levels?
Educational testing has increased dramatically due to the No Child Left Behind
(NCLB, 2001) legislation which requires each state to test students in grade three
through eight in English/Language Arts (E/LA) and Mathematics. The NCLB mandate
also requires states show 100% proficiency in E/LA and Mathematics by 2014 and to
report growth for various subgroups (e.g., ethnic, gender, special education; Linn, Baker
& Betebenner, 2002). In light of these requirements, diagnostic assessments are being
used to assist with meeting these goals. Diagnostic assessments provide enhanced
information required to improve student learning, and feedback to students and
teachers about strengths and weaknesses of specific learning objectives. One method
to create diagnostic assessments is to use the Attribute Hierarchy Method (AHM).
However, methods to ensure validity for various subgroups have yet to be determined
for cognitive diagnostic assessment.
Gender and ethnic test/item score differences are typically assessed using
Differential Item Functioning (DIF) statistical procedures (e.g., Dorans, Schmitt, &
Bleistein, 1992; Parshall, & Miller, 1995; Schmitt, 1988; Shepard, Camilli, & Williams,
1985; Zwick & Ercikan, 1989). DIF occurs when the probabilistic differences in item
scores occur after controlling for overall ability. DIF analyses are typically conducted for
large-scale assessments. Unfortunately, there is no consensus on which DIF procedure
works well for all student populations and it usually requires large sample sizes (e.g.,
minimum of 250 in each subgroup). In addition, many studies that have attempted to
confirm, through content reviews, which items would indicated DIF and which group
3
would be favored, have shown little success (e.g., Gierl, Khaliq & Boughton, 1999;
Angoff, 1993; Camilli & Shepard, 1994; Engelhard, Hansche & Rutledge, 1990; Gierl &
McEwen, 1998; O’Neill & McPeek, 1993).
There are several limitations to using DIF analysis in a diagnostic framework to
identify test score differences: (1) information is gained mainly at the item level, (2)
explanations about why the differences occur has been limited, (3) linking test
performance to cognitive models has been lacking, and (4) most DIF analyses only
focus on two groups. A similar method of confirming test fairness is needed in the
context of diagnostic assessment. To address these limitations, we present a method
for examining group differences using a cognitive diagnostic assessment (CDA)
framework, as evaluated using the attribute hierarchy method (AHM).
Only one study has evaluated the use of the AHM to evaluate differences in
performance by gender, ethnicity, and gender-by-ethnicity. Gotzmann, Roberts, Alves
and Gierl (2009) used the AHM method to compare gender and ethnicity differences
using the Hierarchy Consistency Index (HCI) as a measure of model-data-fit. They
found little to no differences in average HCI values for the White subgroup for most of
the cognitive models. But differences between gender and ethnic subgroups, and
gender-by-ethnicity were found across all cognitive models for other subgroups such as
American Indian, and African-American subgroups. This study indicated that even with
overall high average HCI values, the cognitive model may not fit all examinees. Further
investigation of gender and ethnicity subgroups can provide more information on the
basis of performance differences. As a follow up to this study, we evaluated whether
4
overall ability may also contribute to the differences in average HCI values as a
measure of model-data-fit.
Purpose of the Study
The purpose of this study is to evaluate differences in average HCI for Low,
Medium and High ability examinees. Specifically, are the model-data-fit indices similar
across ability levels, and is the pattern of fit consistent across gender and ethnic
subgroups? In this study, average HCI values are presented for ability, ability-bygender, and ability-by-ethnicity.
Attribute Hierarchy Method
The AHM is a cognitively-based psychometric procedure used to classify
examinees’ test item responses into a set of attribute patterns associated with a
cognitive model of task performance. Cognitive attributes in the AHM are described as
the procedural or declarative knowledge needed to perform a task in a specific domain
(Leighton, Gierl, & Hunka, 2004). The AHM is a two-stage procedure where the first
stage involves cognitive model specification and the second stage involves a
psychometric analysis of student responses to yield model-based diagnostic information
about student mastery of cognitive skills.
Stage 1: Specification of the Cognitive Model
An AHM analysis begins with the specification of a cognitive model of task
performance. A cognitive model in educational measurement refers to a “simplified
description of human problem solving on standardized educational tasks, which helps to
characterize the knowledge and skills students at different levels of learning have
acquired and to facilitate the explanation and prediction of students’ performance”
5
(Leighton & Gierl, 2007, p. 6). These cognitive skills, conceptualized as attributes in the
AHM framework, are specified at a small grain size in order to generate specific
diagnostic inferences. Theories of task performance can be used to develop cognitive
models in a subject domain. However, the availability of these theories in education is
limited. Therefore, other means are used to generate cognitive models. One method is
to use results of a task analysis of test items that represent a content domain. A task
analysis can be used to create a cognitive model, where the knowledge and procedures
used to solve the test item are specified. Another method involves having examinees
think aloud as they solve test items to identify the actual knowledge, processes, and
strategies elicited by the task (Ericsson & Simon, 1993; Leighton & Gierl, 2007). A
cognitive model derived from a task analysis can be validated and, if required, modified
using examinee verbal reports collected from think aloud studies.
A key assumption underlying the specification of the cognitive model in the AHM
is the hierarchical or a linear ordering of the attributes. This assumption reflects the
characteristics of human information processing because cognitive processes usually
do not work in isolation but function within a network of interrelated competencies and
skills (Kuhn, 2001). For example (see Figure 1 for graphical representation), five
attributes are linearly ordered with attribute 1 conceptualized as the simplest and
attribute 5 as the most complex. If an examinee possesses attribute 3, then it is
expected that this examinee also possesses the pre-requisite attributes, in this case
attributes 1 and 2. The cognitive model has direct implications for item development as
the items that measure each attribute must maintain the linear ordering in the model
while also measuring increasingly complex cognitive processes.
6
Any method used to create cognitive models requires a review of the cognitive
skills needed to solve test items. The first step would be to ensure the breadth and
depth of all cognitive skills that are desirable for a diagnostic assessment. Once all
required areas are specified for the necessary cognitive skills, these skills would be
categorized into meaningful sub-content areas that provide diagnostic feedback. Within
each of the sub-content areas, separate cognitive models can be created that are
linearly related and narrow in scope to identify a student’s strengths and weaknesses in
their cognitive development. The next step in the creation of cognitive diagnostic
assessment is to evaluate how well the students’ actual response data fit the expected
structure from the cognitive models.
Stage 2: Psychometric Evaluation of the Cognitive Model
The AHM provides a model-data fit index to evaluate the accuracy of the fit
between the cognitive model and the examinees’ observed response data. For the
AHM, the model-data fit index is called the Hierarchy Consistency Index (HCI; Cui &
Leighton, 2009). The HCI can be used to evaluate a cognitive model for the entire
student sample, but also for several subgroups as well as sub-categorized subgroups.
For these different types of analyses (as compared to DIF analyses), the unit of analysis
shifts from comparing subgroups by item to calculating model-data fit for individual
students for a set of items that align to the cognitive model. This approach permits
different types of comparisons not previously possible in the context of large-scale
assessment. For example, a student that is female and Hispanic can only be classified
in one group for most DIF analyses. But, with the HCI statistics, examinees can be
7
placed in several categories at the same time. So, for instance, students that are
Hispanic female can be compared to students that are Hispanic male.
Hierarchy Consistency Index
The HCI statistic can provide meaningful information on the fit of each cognitive
model, relative to examinees’ observed responses overall, and for each type of
subgroup. The HCI is an index that indicates model-data-fit between of each student
observed response data relative to the cognitive model. The HCI for examinee is
calculated as follows:
1
2∑
∑
1
where,
includes items that are correctly answered by student ,
is student
score (1 or 0) to item , where item belongs to
,
includes items that require the subset of attributes measured by item ,
is student
score (1 or 0) to item
where item
belongs to
, and
is the total number of comparisons for all the items that are correctly answered by
student (Cui & Leighton, 2009).
The HCI values are calculated for each student and the average taken across
students for each cognitive model. HCI values range from -1.00 to +1.00 where an HCI
of 0.70 or higher indicates good model-data fit (Cui & Leighton, 2009). This index is not
statistically affected by overall ability (i.e., perfect and non-perfect scores both result in
an HCI of 1.00). Slips occur when an examinee responds to an item in the model
correctly but does not respond correctly to other pre-requisite items linked to cognitive
skills (i.e., attribute three was answered correctly but not attribute one and two; see
8
Figure 1). The number of slips related to the number of combinations of possible
attribute response patterns in the cognitive model indicates how well a student’s
response fits the cognitive model. Therefore, the HCI index provides a summary of the
model-data fit with the cognitive model.
The HCI values can be used to provide supporting evidence for the accuracy of
the cognitive model for multiple subgroups. Because the HCI is calculated for each
student, students who do not fit the model (i.e., poor HCI < 0.70) can be identified. In
addition, cognitive models that have good model-data fit overall, can be evaluated for
several subgroups to ensure validity for all examinees.
Methods
Source of information
Data from the SAT Reasoning Test and the Preliminary SAT®/National Merit
Scholarship Qualifying Test were used. The SAT®/National Merit Scholarship
Qualifying Test is a co-sponsored program by the College Board and National Merit
Scholarship Corporation. The SAT®/National Merit Scholarship Qualifying Test is a
standardized test that provides students with practice for the SAT Reasoning Test. It
also allows students to enter National Merit Scholarship Corporation scholarship
programs. The SAT®/National Merit Scholarship Qualifying Test measures critical
reading skills, math problem-solving skills, and writing skills. The purpose of the this
research was to investigate enhanced diagnostic scoring and reporting procedures so
that students would receive more specific information about their strengths and
weaknesses on college readiness skills. This enhanced feedback was intended to help
9
students focus their preparation on areas where they wanted to improve their test
performance.
A random sample of 5000 examinees from The College Board NMSQT/PSAT®
2006 administration was used for this study. Individual HCI values were calculated for
the entire sample. In addition, average HCI values for groups subdivided by ability,
ability-by-gender, and ability-by-ethnicity were computed. There were three ability
levels constructed where the overall scale score for the content area ranged from 20-80.
The score scale and their respective ability groups were subdivided as follows: Low
ability students scale score of 20 to 39, Medium ability students scale score of 40 to 59,
and High ability students scale score of 60 to 80. All of the 5000 examinees were used
in the calculations of the average HCI values for the overall ability, and ability-by-gender
calculations. However, for the ability-by-ethnicity calculations the American Indian,
Puerto Rican sub-categories were not presented due to low case counts (N<5).
Cognitive models for two content areas were created in Mathematics and Critical
Reading. Items that measured the skills in each cognitive model, as determined by
content experts, were included in the analyses.
Mathematics.
The sample was sub-categorized for overall ability in Mathematics as follows:
Low ability (N = 1443), Medium ability (N = 2949) and High Ability (N = 608). The
sample was also sub-categorized for ability-by-gender as follows: Low ability Females
(N = 819), Medium ability Females (N = 1597), High ability Females (N = 252), Low
ability Males (N = 624), Medium ability Males (N = 1352), and High ability Males (N =
356). The sample was sub-categorized for ability-by-ethnicity as follows: Low ability
10
Asians (N = 56), Medium ability Asians (N = 182), High ability Asians (N = 84), Low
ability African-Americans (N = 464), Medium ability African-Americans (N = 300), High
ability African-Americans (N = 20), Low ability Mexican-Americans (N = 140), Medium
ability Mexican-Americans (N = 170), High ability Mexican-Americans (N = 11), Low
ability Other Hispanics (N = 181), Medium ability Other Hispanics (N= 212), High ability
Other Hispanics (N = 19), Low ability Whites (N = 484), Medium ability Whites (N =
1920), High ability Whites (N 450), Low ability Others (N = 58), Medium ability Others (N
= 99), and High ability Others (N = 20).
Critical Reading.
The sample was sub-categorized for overall ability in Critical Reading as follows:
Low ability (N = 1610), Medium ability (N = 2781) and High Ability (N = 609). The
sample was also sub-categorized for ability-by-gender as follows: Low ability Females
(N = 834), Medium ability Females (N = 1517), High ability Females (N = 317), Low
ability Males (N = 776), Medium ability Males (N = 1264), and High ability Males (N =
292). The sample was sub-categorized for ability-by-ethnicity as follows: Low ability
Asians (N = 108), Medium ability Asians (N = 166), High ability Asians (N = 48), Low
ability African-Americans (N = 451), Medium ability African-Americans (N = 311), High
ability African-Americans (N = 22), Low ability Mexican-Americans (N = 158), Medium
ability Mexican-Americans (N = 151), High ability Mexican-Americans (N = 12), Low
ability Other Hispanics (N = 207), Medium ability Other Hispanics (N= 181), High ability
Other Hispanics (N = 24), Low ability Whites (N = 563), Medium ability Whites (N =
1822), High ability Whites (N 469), Low ability Others (N = 70), Medium ability Others (N
= 78), and High ability Others (N = 29).
11
Procedures
This study was conducted in three stages. First, cognitive models were created
for each sub-content area in Mathematics and Critical Reading. For example,
Mathematics had four sub-categories: Algebra and Functions, Data and Probability,
Geometry and Measurement, and Number and Operations. Critical Reading had four
sub-categories: Author’s Craft, Comprehending Ideas, Determining Meaning, and
Reasoning and Inferencing. There were a total of 54 cognitive models created across
Mathematics and Critical Reading; however only six models are presented in this paper.
Second, content experts mapped existing items from the test to the skills in each
cognitive model. Third, individual student HCIs were calculated for each cognitive
model. The six models were selected based on whether the overall average HCI values
were good and all skills were represented by items.
All of the cognitive models for Mathematics had high average HCI values (greater
than 0.70). However, the overall HCI values were slightly lower for the Critical Reading
cognitive models (i.e., 0.65, 0.48, and 0.58 respectively). The Critical Reading cognitive
models were included so that comparisons of different content areas and lower average
model-data-fit values were possible. Individual HCI values were calculated for the
entire sample. Then, HCI results for each cognitive model were aggregated by ability
(Low, Medium and High), ability (Low, Medium and High) by gender (Female and Male),
and ability (Low, Medium and High) by ethnicity (Asian, African-American, MexicanAmerican, Other Hispanic, White, and Other (e.g., mixed race)). However, for the
American Indian and Puerto Rican ethnic groups the results are not presented due to
low sample sizes (N<5). Ability levels were created based on the scale score for an
12
examinee for each content area: 20-39 Low ability, 40-59 Medium ability, and 60-80
High ability.
Stage 1: Developing the Cognitive Models
For NMSQT/PSAT® Mathematics and Critical Reading, stage 1 was completed
in two steps. In the first step, Gierl, Roberts, Alves, Gotzmann (2009) developed
preliminary cognitive models. This development work was undertaken so the content
specialists would have a starting point for creating their cognitive models. To create the
preliminary cognitive models, three College Board research papers—Developing Skill
Categories for the SAT Math Section by O’Callaghan, Morley & Schwartz (2004),
Toward a Construct of Critical Reading for the New SAT by VanderVeen (2004), and
the Performance Category Descriptions for the Critical Reading, Mathematics, and
Writing Sections of the SAT (2007), also known as the SAT Scale Anchoring Study—
provided the starting points for creating the preliminary models. O’Callaghan et. al.
(2004) and VanderVeen (2004) described several cognitive skill categories identified by
content specialists, after reviewing large numbers of previously administered SAT
Mathematics and Critical Reading items. Their cognitive skill categories ranged from
simple to complex. The authors of this study assisted in creating the preliminary
cognitive models with linear ordering of cognitive skills to assist content experts in the
second step.
In the second step in Stage #1, five content specialists nominated by The
College Board (three Mathematics and two Critical Reading) reviewed the preliminary
cognitive models with the intention of making appropriate modifications, given a
particular emphasis on the identification of the appropriate skills and on the ordering of
13
these skills. They were also asked to evaluate the skills in each cognitive model for its
measurability and instructional relevance. That is, the content specialists were
instructed to modify the initial models in light of the characteristics required of cognitive
models for CDA (e.g., measurability, grain size, and instructional relevance). All five
content specialists had extensive mathematics and reading backgrounds as well as
teaching and test development experience.
The content specialists scrutinized the wording of each skill descriptor to ensure
it would be clear and meaningful to teachers. Any relevant, measurable, and
instructionally relevant process skills were also added to the cognitive models. In total,
54 cognitive models were created in NMSQT/PSAT® Mathematics and Critical
Reading. Each of the cognitive models being discussed in this paper are shown in
Figures 2A, 2B, 3A, 3B, 4A, and 4B. For each of the cognitive models only one item
was mapped to each cognitive skill as indicated below.
Figure 2A shows the cognitive model for Mathematics under the sub-category of
Algebra and Functions which is not currently labeled. There were three items mapped
to three of the cognitive skills 2.4.1, 2.4.2, and 2.4.3. Figure 2B shows the cognitive
model for Mathematics under the sub-category of Geometry and Measurement which is
not currently labeled. There were four items mapped to four of the cognitive skills 3.7.1,
3.7.2, 3.7.3, and 3.7.4. Figure 3A shows the cognitive model for Mathematics under the
sub-category of Numbers and Operations which is not currently labeled. There were
two items mapped to two cognitive skills 1.8.1, and 1.8.2. Figure 3B shows the
cognitive model for Critical Reading under the sub-category of Determining Meaning
labeled “Context”. There were four items mapped to four of the cognitive skills 1.c.1,
14
1.c.2, 1.c.3, and 1.c.4. Figure 4A shows the cognitive model for Critical Reading under
the sub-category of Author’s Craft labeled “Rhetorical and Stylistic Devices”. There
were six items mapped to six of the cognitive skills 3.c.1, 3.c.2, 3.c.3, 3.c.4, 3.c.5, and
3.c.6. Figure 4B shows the cognitive model for Critical Reading under the sub-category
of Reasoning and Inferencing which is labeled “Generalizing”. There were four items
mapped to four of the cognitive skills 4.c.1, 4.c.2, 4.c.3, and 4.c.4.
Stage 2: Mapping items to each Cognitive model
In the second stage, existing items were mapped from the 2006 NMSQT/PSAT®
administration to each of the linear cognitive models. A set of items was provided to the
content experts in Mathematics and Critical Reading and they aligned the items to the
skills in each cognitive model created in stage 1. Unfortunately since this task was to
use existing items and map them to the cognitive models created, some cognitive skills
for some of the 54 cognitive models are not represented by items. However, for this
study we used complete cognitive models to evaluate model-data-fit for students at
different ability levels.
Stage 3: HCI Calculations and Model Evaluations
In the third stage, individual HCI values were calculated for the sample of 5000
students. Several macros were created in SAS to calculate the HCI values for each
examinee for each cognitive model. Average overall HCI values by ability, ability-bygender, and ability-by-ethnicity were also calculated. There were six models selected,
three for Mathematics and three for Critical Reading. For three of the cognitive models
the overall average HCI values were considered good (i.e., HCI greater than 0.7) in
Mathematics, and the three cognitive models for Critical Reading were lower (i.e., 0.65,
15
0.48, and 0.58 respectively). The Critical Reading models were selected since the
models were fully represented by items and the overall average HCI was not too far
from the 0.70 criterion. These cognitive models were selected to evaluate and compare
ability categories with good fitting and moderate fitting models across different content
areas.
For each cognitive model, overall average HCI values and standard deviations
were calculated for the sample of 5000 students as well as average HCI, for ability
(Low, Medium and High), ability-by-gender (Low ability Females, Medium ability
Females, High ability Females, Low ability Males, Medium ability Males, High ability
Males), and ability-by-ethnicity (Low ability Asians, Medium ability Asians, High ability
Asians, Low ability African-Americans, Medium ability African-Americans, High ability
African-Americans, Low ability Mexican-Americans, Medium ability Mexican-Americans,
High ability Mexican-Americans, Low ability Other Hispanics, Medium ability Other
Hispanics, High ability Other Hispanics, Low ability Whites, Medium ability Whites, High
ability Whites, Low ability Others, Medium ability Others, and High ability Others). Two
of the subgroups for ethnicity are not presented due to small sample sizes (i.e.,
American Indian and Puerto Rican).
Results
The average HCI values for ability, ability-by-gender, and ability-by-ethnicity are
presented for each cognitive model in Mathematics and Critical Reading. The results
are presented in six tables (one for each cognitive model in Mathematics and Critical
Reading) and graphically in six figures (Figures 5 through 10). As a reminder, average
HCI values above 0.70 indicate good model-data-fit as cited by Cui & Leighton (2009).
16
The results presented in tables and graphs for Mathematics will be presented first, and
then Critical Reading.
Mathematics
Table 1 shows the overall HCI values aggregated by ability, ability-by-gender,
and ability-by-ethnicity for the Mathematics cognitive model represented in Figure 2A.
The average HCI values across ability show that the values increased by ability from
0.67 to 0.94, and for ability-by-gender with 0.67 to 0.96 for Females, and 0.66 to 0.93
for Males. A similar trend was seen for the ability-by-ethnicity sub-categorizations.
These results are also more apparent graphically in Figure 5, where a trend of average
HCI values increased as ability increased was shown for all categorizations. In all
instances for the Medium and High ability groups, average HCI values were above the
0.70 criterion. The average HCI values were below 0.70 for almost all subgroups for the
Low ability category but were relatively close to the 0.70 criterion. The exceptions were
the Other and Asian ethnic subgroups which were higher than the 0.70 criterion.
Table 2 shows the overall HCI values aggregated by ability, ability-by-gender,
and ability-by-ethnicity for the Mathematics cognitive model represented in Figure 2B.
The average HCI values across ability show the values increased from 0.76 to 0.95 as
ability increased, and for ability-by-gender with 0.77 to 0.95 for Females, and 0.75 to
0.95 for Males. A similar trend was seen for the ability-by-ethnicity categorizations.
These results are represented graphically in Figure 6, where a trend of average HCI
values increased as ability increased was shown for all categorizations. In almost all
instances for various ability categorizations, average HCI values were above the 0.70
17
criterion, with the exception of the Asian Low ability subgroup which was below 0.70.
However, this value was still relatively close to the 0.70 criterion (i.e., 0.66).
Table 3 shows the overall HCI values aggregated by ability, ability-by-gender,
and ability-by-ethnicity for the Mathematics cognitive model represented in Figure 3A.
The average HCI values across ability were not necessarily increasing from lowest
ability to highest ability. The average HCI values ranged from 0.81 to 0.83 indicating
relatively small differences, and all of the average HCI values were relatively high.
However, the pattern seen in other Mathematics cognitive models also does not occur
for the ability-by-gender subcategory. Females average HCI values increased from
0.83 to 0.88, and an inconsistent pattern for Males ranged from 0.78 to 0.82, where the
values for Medium ability were slightly smaller at 0.77. Similarly, no consistent pattern
was seen for the ability-by-ethnicity sub-categorizations. For some subgroups the
highest average HCI values were for the Low ability subgroup (i.e., Asian and AfricanAmerican), Medium ability (i.e., Mexican-American and Other), and High ability (i.e.,
Other Hispanic and White). However, even for those subgroups where High ability had
the highest average HCI values the Low ability groups were the next highest with 0.82.
These results are presented graphically in Figure 7, where an inconsistent trend was
seen where each subgroup had different patterns. In almost all instances the average
HCI values were above the 0.70 criterion, with the exception of the Mexican-American
High ability subgroup with a value of 0.64. The rest of the values were all quite high
above the 0.70 criterion.
18
Critical Reading
Table 4 shows the overall HCI values aggregated by ability, ability-by-gender,
and ability-by-ethnicity for the Critical Reading cognitive model represented in Figure
3B. The average HCI values across ability shows the values increased from 0.50 to
0.93 as ability increased, and for ability-by-gender with 0.53 to 0.92 for Females, and
0.46 to 0.94 for Males. A similar trend was seen for the ability-by-ethnicity
categorizations. These results are presented graphically in Figure 8, where a trend of
average HCI values increased as ability increased was shown for all categorizations. In
some instances, for the Medium ability category, the average HCI values for the
Mexican-American and African-American groups were below the 0.70 criterion. Also,
the average HCI values were below 0.70 for all subgroups for the Low ability category.
Table 5 shows the overall HCI values aggregated by ability, ability-by-gender,
and ability-by-ethnicity for the Critical Reading cognitive model represented in Figure
4A. The average HCI values across ability shows the values increased from 0.38 to
0.74 as ability increased, and for ability-by-gender with 0.39 to 0.72 for Females, and
0.38 to 0.76 for Males. A similar trend was seen for the ability-by-ethnicity
categorizations. These results are presented graphically in Figure 9, where a trend of
average HCI values increased as ability increased was shown for all categorizations.
Since the overall average HCI was lower for this cognitive model (i.e., 0.48), most of the
sub-categorizations were below the 0.70 criterion. The only exceptions were the Asian,
Mexican-American, White and Other High ability subgroups. However, the overall
pattern followed the first two cognitive models presented for Mathematics and the first
19
cognitive model in Critical Reading, where the Lower ability groups had the lowest
overall HCI values regardless of gender or ethnicity sub-categorizations.
Table 6 shows the overall HCI values aggregated by ability, ability-by-gender,
and ability-by-ethnicity for the Critical Reading cognitive model represented in Figure
4B. The average HCI values across ability shows the values increased from 0.46 to
0.84 as ability increased, and for ability-by-gender with 0.48 to 0.83 for Females, and
0.43 to 0.84 for Males. A similar trend was seen for the ability-by-ethnicity
categorizations. These results are presented graphically in Figure 10, where a trend of
average HCI values increased as ability increased is shown for all categorizations.
Since the overall average HCI is lower for this cognitive model, similar to the previously
presented cognitive models in Critical Reading, most of the sub-categorizations were
below the 0.70 criterion. The exception was for the High ability category overall, for
Females and Males, and all ethnic subgroups. However, the overall pattern follows the
first two cognitive models presented for Mathematics and the first two cognitive models
in Critical Reading, where the Lower ability groups had the lowest overall HCI values
regardless of gender or ethnicity sub-categorizations.
Discussion
These results show how model-data fit statistics, like the HCI, can be used to
determine student response fit with a cognitive model. These results indicate ability as
a function of gender and ethnicity can be evaluated at a fine level of detail. Because the
HCI is calculated for each student, individual student information can be used
individually or collectively to help find clues to a student’s cognitive understanding.
20
With the exception of one cognitive model, all of the results show that the Low
ability groups had lower average HCI values and average HCI values increased with the
Medium and High ability groups when evaluated overall, for ability-by-gender, and for
ability-by-ethnicity. These results indicate that as overall ability increases the modeldata fit indices also increases. This could be a function of ill-specified cognitive models
for lower ability examinees or random guessing influencing the model-data fit indices.
The only instance that this pattern was not seen was for the third Mathematics cognitive
model. However, this model had the fewest number of attributes in the model, with only
two items measuring two skills. Additional cognitive models that are represented by
more than one item per attribute should be evaluated to see if a similar pattern occurs
with smaller cognitive models as seen in Figure 3A, as the small number of attributes
could have influenced the results.
The results of this study suggest that further research is needed in developing
cognitive models for lower ability examinees. A given cognitive model may fit a sample
of examinees across ability levels well overall, but the cognitive model may not fit
subgroups of examinees equally well. Developing cognitive models for lower ability
examinees is critically important, given that this is the intended target population of
diagnostic assessment. Modifications to estimating model-data fit by accounting for
guessing in student responses could also assist with evaluating cognitive models more
precisely. The results of this study can assist in the process of evaluating cognitive
models for use in diagnostic assessments, providing new information to test developers,
teachers, and administrators to assist in student learning and teaching.
21
References
Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P.
W. Holland, & H. Wainer (Eds.). Differential item functioning (pp. 2-24). Hillsdale,
NJ: Lawrence Erlbaum.
Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items.
Thousand Oaks, CA: Sage.
Engelhard Jr., G., Hansche, L., & Rutledge, K. E. (1990). Accuracy of bias review
judges in identifying differential item functioning on teacher certification tests.
Applied Measurement in Education, 3, 347-360. doi:
10.1207/s15324818ame0304_4
Cui, Y., & Leighton, J. P. (2009). The hierarchy consistency index: Evaluating person fit
for cognitive diagnostic assessment. Journal of Educational Measurement, 46,
429-449. doi: 10.1111/j.1745-3984.2009.00091.x.
Dorans, N. J., Schmitt, A. P., & Bleistein, C. A. (1992). The standardization approach to
assessing comprehensive differential item functioning. Journal of Educational
Measurement, 29, 309-319. doi: 10.1111/j.1745-3984.1992.tb00379.x
Ericsson, K. A. & Simon, H. A. (1993). Protocol analysis: Verbal reports as data.
Cambridge, MA: MIT Press.
Gierl, M. J., Khaliq, S., & Boughton, K. (1999). Gender differential item functioning in
mathematics and science: Prevalence and policy implications. Paper presented
at the annual meeting of the Canadian Society for the Study of Education,
Sherbrooke, Quebec.
22
Gierl, M. J., & McEwen, N. (1998). Differential item functioning on the Alberta Education
Social Studies 30 diploma exams. Paper presented at the annual meeting of the
Canadian Society for Studies in Education, Ottawa, Ontario, Canada.
Gierl, M. J., Roberts, M., Alves, C., & Gotzmann, A. (2009). Using judgments from
content specialists to develop cognitive models for diagnostic assessments.
Paper presented at the annual meeting of the National Council on Measurement
in Education, San Diego, CA.
Gotzmann, A., Roberts, M., Alves, C., & Gierl, M. J. (2009). Using cognitive models to
evaluate ethnicity and gender differences. Paper presented at the annual
meeting of the American Educational Research Association, San Diego, CA.
Kuhn, D. (2001). Why development does (and does not) occur: Evidence from the
domain of inductive reasoning. In J. L. McClelland & R. Siegler (Eds.),
Mechanisms of cognitive development: Behavioral and neural perspectives (pp.
221-249). Hillsdale, NJ: Erlbaum.
Leighton, J.P., & Gierl, M.J. (2007). Defining and evaluating models of cognition used in
educational measurement to make inferences about examinees’ thinking
processes. Educational Measurement: Issues and Practice, 26, 3–16. doi:
10.1111/j.1745-3992.2007.00090.x
Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy model for
cognitive assessment: A variation on Tatsuoka's rule-space approach. Journal of
Educational Measurement, 41, 205-237. doi: 10.1111/j.17453984.2004.tb01163.x
23
Linn, R. L., Baker, E. L., & Betebenner, D. W. (2002). Accountability systems:
Implications of requirements of the No Child Left Behind Act of 2001. Educational
Researcher, 31(6), 3-16. doi: 10.3102/0013189X031006003
No Child Left Behind Act of 2001, Pub. L. No. 107-110, 115 Stat. 1425 (2002).
O’Callaghan, R.K., Morley, M.E., & Schwartz, A. (2004). Developing skill categories for
the SAT Math section. Paper presented at the meeting of the National Council on
Measurement in Education, San Diego, CA.
O’Neill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are associated
with differential item functioning. In P. W. Holland & H. Wainer (Eds.), Differential
item functioning (pp. 255-276). Hillsdale, NJ: Lawrence Erlbaum.
Parshall, C. G., & Miller, T. R. (1995). Exact versus asymptotic Mantel-Haenszel DIF
statistics: A comparison of performance under small-sample conditions. Journal
of Educational Measurement, 32, 302-316. doi: 10.1111/j.17453984.1995.tb00469.x
Schmitt, A. P. (1988). Language and cultural characteristics that explain differential item
functioning for Hispanic examinees on the Scholastic Aptitude Test. Journal of
Educational Measurement, 25, 1-13. doi: 10.1111/j.1745-3984.1988.tb00287.x
Shepard, L. A., Camilli, G., & Williams, D. M. (1985). Validity of approximation
techniques for detecting item bias. Journal of Educational Measurement, 22, 77105. doi: 10.1111/j.1745-3984.1985.tb01050.x
VanderVeen, A. (2004). Toward a construct of Critical Reading for the new SAT. Paper
presented at the meeting of the National Council on Measurement in Education,
San Diego, CA.
24
Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP
history assessment. Journal of Educational Measurement, 26, 55-66. doi:
10.1111/j.1745-3984.1989.tb00318.x
25
Table 1.
HCI values for Mathematics Algebra and Functions Hierarchy by ability, ability-bygender, and ability by ethnicity
Ability
Low
Medium
High
Ability-by-Gender
Low
Female
Medium
High
Low
Medium
Male
High
Ability-by-Ethnicity
Low
Asian
Medium
High
Low
Medium
African-American
High
Low
Mexican-American
Medium
High
Low
Medium
Other Hispanic
High
Low
White
Medium
High
Low
Other
Medium
High
Mean
0.669
0.839
0.942
Mean
0.673
0.839
0.956
0.664
0.838
0.932
Mean
0.714
0.852
0.861
0.659
0.787
0.950
0.633
0.810
0.909
0.630
0.860
1.000
0.682
0.851
0.954
0.793
0.828
0.950
26
SD
0.707
0.489
0.251
SD
0.704
0.484
0.217
0.710
0.494
0.273
SD
0.680
0.440
0.363
0.716
0.542
0.224
0.724
0.547
0.302
0.739
0.455
0.000
0.701
0.476
0.231
0.585
0.475
0.224
N
1443
2949
608
N
819
1597
252
624
1352
356
N
56
182
84
464
300
20
140
170
11
181
212
19
484
1920
450
58
99
20
Table 2.
HCI values for Mathematics Geometry and Measurement Hierarchy by ability, ability-bygender, and ability-by-ethnicity
Ability
Low
Medium
High
Ability-by-Gender
Low
Female
Medium
High
Low
Medium
Male
High
Ability-by-Ethnicity
Low
Asian
Medium
High
Low
Medium
African-American
High
Low
Mexican-American
Medium
High
Low
Medium
Other Hispanic
High
Low
White
Medium
High
Low
Other
Medium
High
Mean
0.760
0.874
0.953
Mean
0.769
0.862
0.952
0.747
0.887
0.954
Mean
0.661
0.885
0.984
0.743
0.862
0.933
0.771
0.908
1.000
0.744
0.822
1.000
0.785
0.881
0.952
0.828
0.869
0.900
27
SD
0.623
0.464
0.246
SD
0.619
0.490
0.245
0.628
0.430
0.249
SD
0.721
0.424
0.145
0.648
0.484
0.298
0.600
0.401
0.000
0.636
0.550
0.000
0.586
0.451
0.238
0.566
0.466
0.447
N
1443
2949
608
N
819
1597
252
624
1352
356
N
56
182
84
464
300
20
140
170
11
181
212
19
484
1920
450
58
99
20
Table 3.
HCI values for Mathematics Number Operations Hierarchy by ability, ability-by-gender,
and ability-by-ethnicity
Ability
Low
Medium
High
Ability-by-Gender
Mean
0.827
0.813
0.822
Mean
0.832
0.848
0.881
0.821
0.772
0.781
Mean
0.857
0.703
0.810
0.845
0.833
0.800
0.829
0.859
0.636
0.790
0.868
1.000
0.818
0.804
0.831
0.759
0.919
0.700
Low
Female
Medium
High
Low
Medium
Male
High
Ability-by-Ethnicity
Low
Asian
Medium
High
Low
Medium
African-American
High
Low
Mexican-American
Medium
High
Low
Medium
Other Hispanic
High
Low
White
Medium
High
Low
Other
Medium
High
28
SD
0.563
0.582
0.569
SD
0.556
0.529
0.474
0.572
0.636
0.626
SD
0.520
0.713
0.591
0.536
0.554
0.616
0.562
0.514
0.809
0.615
0.498
0.000
0.576
0.595
0.557
0.657
0.396
0.733
N
1443
2949
608
N
819
1597
252
624
1352
356
N
56
182
84
464
300
20
140
170
11
181
212
19
484
1920
450
58
99
20
Table 4.
HCI values for Critical Reading Determining Meaning-“Context” Hierarchy by ability,
ability-by-gender, and ability-by-ethnicity
Ability
Low
Medium
High
Ability-by-Gender
Low
Female
Medium
High
Low
Medium
Male
High
Ability-by-Ethnicity
Low
Asian
Medium
High
Low
Medium
African-American
High
Low
Mexican-American
Medium
High
Low
Medium
Other Hispanic
High
Low
White
Medium
High
Low
Other
Medium
High
Mean
0.495
0.741
0.927
Mean
0.531
0.757
0.920
0.457
0.722
0.935
Mean
0.465
0.704
0.932
0.488
0.667
0.903
0.574
0.682
1.000
0.523
0.710
0.967
0.506
0.763
0.923
0.411
0.751
0.938
29
SD
0.795
0.554
0.184
SD
0.774
0.527
0.193
0.815
0.585
0.172
SD
0.777
0.566
0.199
0.807
0.638
0.300
0.751
0.589
0.000
0.767
0.635
0.113
0.789
0.528
0.180
0.840
0.497
0.208
N
1610
2781
609
N
834
1517
317
776
1264
292
N
108
166
48
451
311
22
158
151
12
207
181
24
563
1822
469
70
78
29
Table 5.
HCI values for Critical Reading Author’s Craft-“Rhetorical and Stylistic Devices”
Hierarchy by ability, ability-by-gender, and ability- by-ethnicity
Ability
Low
Medium
High
Ability-by-Gender
Low
Female
Medium
High
Low
Medium
Male
High
Ability-by-Ethnicity
Low
Asian
Medium
High
Low
Medium
African-American
High
Low
Mexican-American
Medium
High
Low
Medium
Other Hispanic
High
Low
White
Medium
High
Low
Other
Medium
High
Mean
0.383
0.481
0.738
Mean
0.390
0.486
0.720
0.376
0.475
0.758
Mean
0.253
0.489
0.780
0.321
0.403
0.643
0.382
0.420
0.781
0.483
0.462
0.696
0.422
0.497
0.729
0.450
0.495
0.891
30
SD
0.828
0.642
0.381
SD
0.823
0.643
0.392
0.833
0.642
0.368
SD
0.887
0.670
0.354
0.850
0.716
0.575
0.837
0.686
0.436
0.795
0.643
0.369
0.809
0.627
0.377
0.766
0.599
0.253
N
1610
2781
609
N
834
1517
317
776
1264
292
N
108
166
48
451
311
22
158
151
12
207
181
24
563
1822
469
70
78
29
Table 6.
HCI values for Critical Reading Reasoning and Inferencing-“Generalizing” Hierarchy by
ability, ability-by-gender, and ability-by-ethnicity
Ability
Low
Medium
High
Ability-by-Gender
Low
Female
Medium
High
Low
Medium
Male
High
Ability-by-Ethnicity
Low
Asian
Medium
High
Low
Medium
African-American
High
Low
Mexican-American
Medium
High
Low
Medium
Other Hispanic
High
Low
White
Medium
High
Low
Other
Medium
High
Mean
0.458
0.598
0.837
Mean
0.480
0.589
0.830
0.434
0.610
0.844
Mean
0.445
0.495
0.726
0.410
0.562
0.895
0.542
0.557
0.889
0.470
0.516
0.815
0.476
0.620
0.852
0.363
0.692
0.761
31
SD
0.781
0.565
0.345
SD
0.752
0.569
0.349
0.811
0.561
0.340
SD
0.810
0.644
0.435
0.833
0.597
0.280
0.720
0.559
0.385
0.750
0.642
0.360
0.758
0.546
0.327
0.837
0.503
0.418
N
1610
2781
609
N
834
1517
317
776
1264
292
N
108
166
48
451
311
22
158
151
12
207
181
24
563
1822
469
70
78
29
Figure 1. Example 5-attribute cognitive model.
32
2.4.1
2.4.1
2.4.1
2.4.2
2.4.2
2.4.3
2.4.3 =Solve for one variable or expression in terms of another
[Solve literal equations]
2.4.2 =Use variables in multi-step abstract settings (e.g.,
applying the distributive property across several variables)
2.4.1 =Use a letter as a placeholder for an unknown value
Figure 2A. Mathematics Algebra and Functions cognitive model.
3.7.1
3.7.1
3.7.1
3.7.1
3.7.2
3.7.2
3.7.2
3.7.3
3.7.3
3.7.4
3.7.4 =Determine the effect of changes in the linear dimension
(e.g. length of a segment or perimeter) of a figure on other
3.7.3 =Solve multi-step problems involving volume
3.7.2 =Solve multi-step problems involving areas of figures
composed of two or more simple figures
3.7.1 =Determine distance on a number line
Figure 2B. Mathematics Geometry and Measurement cognitive model.
1.8.1
1.8.1
1.8.2
1.8.2 =Determine the values or properties of numbers in a
sequence when given a description of a sequence
1.8.1 =Identify a rule that describes a numerical pattern in a
sequence
Figure 3A. Mathematics Number Operations cognitive model.
1.c.1
1.c.1
1.c.1
1.c.1
1.c.2
1.c.2
1.c.2
1.c.3
1.c.3
1.c.4
1.c.4 = Use context to determine how an uncommon word fits in a sentence
1.c.3 = Use the context of larger sections of text to determine the meaning of uncommon words or to differentiate among multiple possible meanings of words
1.c.2 = Use the context of the sentence to differentiate among multiple possible meanings of words
1.c.1 = Use the context of the sentence to determine how a common word fits in a sentence
Figure 3B. Critical Reading Determining Meaning-“Context” cognitive model.
34
3.c.1
3.c.1
3.c.1
3.c.1
3.c.1
3.c.1
3.c.2
3.c.2
3.c.2
3.c.2
3.c.2
3.c.3
3.c.3
3.c.3
3.c.3
3.c.4
3.c.4
3.c.4
3.c.5
3.c.5
3.c.6
3.c.6= Determine the purposes of authors’ rhetorical and stylistic choices across texts
3.c.5= Identify how an author’s rhetorical or stylistic choices support a particular perspective or position
3.c.4= Identify the purpose and effect of sophisticated literary devices such as irony and sarcasm
3.c.3= Identify connotations of language based on context
3.c.2= Identify the purpose and effect of literary devices such as metaphor or understatement
3.c.1= Recognize examples of literary devices such as metaphor or understatement
Figure 4A. Critical Reading Author's Craft-“Rhetorical and Stylistic Devices” cognitive
model.
4.c.1
4.c.1
4.c.1
4.c.1
4.c.2
4.c.2
4.c.2
4.c.3
4.c.3
4.c.4
4.c.4= Identify and compare generalizations or themes across texts
4.c.3= Identify abstract ideas and principles based on cumulative inferences
4.c.2= Recognize how broad generalized ideas can be used to lead to abstract ideas or positions 4.c.1= Identify generalizations or themes supported by the text
Figure 4B. Critical Reading Reasoning and Inferencing-“Generalizing” cognitive model.
35
1.0
0.9
0.8
0.7
Overall
HCI average
0.6
Female
Male
0.5
Asian
African‐American
0.4
Mexican‐American
White
0.3
Other Hispanic
Other
0.2
0.1
0.0
Low ability
Medium ability
High ability
Scale Score categories
Figure 5. Average HCI values for ability, ability-by-gender, and ability-by-ethnicity for Mathematics Algebra and Functions
Cognitive Model
1.0
0.9
0.8
0.7
Overall
HCI average
0.6
Female
Male
0.5
Asian
African‐American
0.4
Mexican‐American
White
0.3
Other Hispanic
Other
0.2
0.1
0.0
Low ability
Medium ability
High ability
Scale Score categories
Figure 6. Average HCI values for ability, ability-by-gender, and ability-by-ethnicity for Mathematics Geometry and
Measurement Cognitive Model
37
1.0
0.9
0.8
0.7
Overall
HCI average
0.6
Female
Male
0.5
Asian
African‐American
0.4
Mexican‐American
White
0.3
Other Hispanic
Other
0.2
0.1
0.0
Low ability
Medium ability
High ability
Scale Score categories
Figure 7. Average HCI values for ability, ability-by-gender, and ability-by-ethnicity for Mathematics Number Operations
Cognitive Model
38
1.0
0.9
0.8
0.7
Overall
HCI average
0.6
Female
Male
0.5
Asian
African‐American
0.4
Mexican‐American
White
0.3
Other Hispanic
Other
0.2
0.1
0.0
Low ability
Medium ability
High ability
Scale Score categories
Figure 8. Average HCI values for ability, ability-by-gender, and ability-by-ethnicity for Critical Reading Determining
Meaning-“Context” Cognitive Model
39
1.0
0.9
0.8
0.7
Overall
HCI average
0.6
Female
Male
0.5
Asian
African‐American
0.4
Mexican‐American
White
0.3
Other Hispanic
Other
0.2
0.1
0.0
Low ability
Medium ability
High ability
Scale Score categories
Figure 9. Average HCI values for ability, ability-by-gender, and ability-by-ethnicity for Critical Reading Author’s Craft“Rhetorical and Stylistic Devices” Cognitive Model
40
1.0
0.9
0.8
0.7
Overall
HCI average
0.6
Female
Male
0.5
Asian
African‐American
0.4
Mexican‐American
White
0.3
Other Hispanic
Other
0.2
0.1
0.0
Low ability
Medium ability
High ability
Scale Score categories
Figure 10. Average HCI values for ability, ability-by-gender, and ability-by-ethnicity for Critical Reading Reasoning and
Inferencing-“Generalizing” Cognitive Model
41