Standard Setting via the Fat Anchor
Professor Jim Ridgway
School of Education, University of Durham, UK
Abstract: The paper begins by considering some criteria for the design of
assessment systems, such as validity, reliability and practicality.
Components of assessment systems in educational contexts are
considered, emphasising the design of tasks, conceptual frameworks and
calibration systems. If assessment is to be ‘authentic’ on most definitions
of education, new tasks must be given on each testing occasion. This
gives rise to problems comparing the performances of students on different
testing occasions. A new solution to this problem is offered in the form of
the Fat Anchor. This is a super test, components of which are
administered to samples of students on each testing occasion.
Performances on adjacent testing occasions are compared by reference to
performance on the Fat Anchor. Initial work in the USA to establish the Fat
Anchor in mathematics both for year to year comparability at the same age
level, and also across age levels (‘vertical scaling’) is described.
Note: The ideas in the paper apply to all aspects of educational testing.
Throughout this document, testing mathematical attainment will be used to exemplify
principles.
1.
CRITERIA FOR ASSESSMENT DESIGN
As with most areas of design, the quality of the design can be judged in terms of
'fitness for purpose'. Tests are designed for a variety of purposes, and so the criteria
for judging a particular test will shift as a function of its intended function; the same
test may be well suited to one purpose and ill-suited to another. A loose set of
criteria can be set out under the heading of 'educational validity'. Educational validity
encompasses a number of aspects which are set out below.
1
2
Chapter Error! No text of specified style in document.
Construct validity: there is a need for a clear statement of what is being
assessed, which aligns with informed ideas in the field, and for supporting evidence
on the conceptual match between the domain definition and the tests used to assess
competence in that domain.
Reliability: tests must measure more than noise.
Practicality: few designers work in arenas where cost is irrelevant. In educational
settings, a major restriction on design is the total cost of the assessment system.
The key principle here is that test administration and scoring must be manageable
within existing financial resources, and should be cost effective to the education of
students.
Equity: equity issues must be addressed - inequitable tests are (by definition)
unfair and undesirable.
2.
COMPONENTS OF TESTS IN EDUCATIONAL SETTINGS
Test systems have a number of components, which include:
tasks (the building blocks from which tests are made);
conceptual frameworks (descriptions of the domain, grounded in evidence);
tests (which are assemblies of tasks with known validity and reliability in the
critical score range);
test administration systems;
calibration systems (which are ways to look at standards);
cultural support systems (which are ways in which information about tests is
communicated to the educational community), materials which allow students to
become familiar with test content, materials which support teachers who are
preparing students to take the test, and so on.
Here, we focus mainly on tasks, conceptual frameworks and calibration systems.
3.
THE DESIGN AND DEVELOPMENT OF TASKS
There is a need to create tasks, assemble tasks into tests which 'work', check out
what they really measure, then use them. This process involves a number of
different phases. For tests appropriate for educational uses, the nature of the tasks
needs to be considered carefully. The emphasis on a myriad of short items which
characterizes many psychometric tests is quite inappropriate. They are inappropriate
for a number of reasons. The most obvious ones derive from considerations of
educational validity. A core activity in mathematics is to engage with problems which
take more than a minute or so to complete, and which require the solver to engage in
extended chains of reasoning. A second conceptual demand is that students make
choices about which aspects of their mathematical knowledge to use, and are
required to show that they can integrate knowledge from different aspects of
mathematics, for example, across algebra and geometry. Short items which atomize
the domain fail to address these demands. While there is a clear case to be made
that short items which assess technical knowledge have a place in mathematics
tests, there are considerable dangers in an over-dependence on tasks of this type.
Tests which are made up exclusively of short items not only violate current
conceptions of the domain they set out to assess, but also run into the problem of
Ridgway: Fat Anchor
3
combinatorial explosion - if the domain of mathematics is defined as a collection of
technical skills, a very large number of tasks is required to sample the different
aspects of mathematical technique.
4.
THE DESIGN AND DEVELOPMENT OF TESTS
4.1
Responding to different design briefs
Tests serve a variety of functions which are more or less easy to satisfy. Putting
students into a rank order is usually easy, but setting cut points reliably (for pass/fail
decisions; to identify high flyers reliably) is more difficult. Monitoring standards over
time raises considerable conceptual problems (discussed below); despite these
conceptual difficulties, there are often political pressures for the educational
community to determine whether standards are rising or falling over time.
4.2
The Design and Development of Conceptual Frameworks
Test development should be associated with the development of a description of
the conceptual domain, and an account of the ways that the choice of tasks in any
test reflects the domain being sampled. In psychometrics, the plan used to support
the choice of test items is commonly referred to as the test 'blueprint'. Test
developers differ a good deal in terms of the explicitness of their blueprint, its
openness to public scrutiny, in terms of the explicitness of the match between
specific test items and the blueprint, and in terms of the extent to which the match is
tested psychometrically.
The MARS group (http://www.nottingham.ac.uk/education/MARS/) has developed
a 'Framework for Balance' for mathematics tests. This is a conceptual analysis of the
domain of mathematics, which supports the design of tests, given a collection of
tasks. Such frameworks also support conversations and negotiations with client
groups for whom assessment systems are designed. No short test can hope to
cover every aspect of a substantial domain; tests almost always sample from the
domain. The key task for test constructors is to ensure reasonable sampling across
core aspects of the domain.
With any conceptual framework, there is a need to see if it is grounded in any
psychological reality. If one decides on theoretical grounds that a group of test items
measure essentially the same aspect of mathematics, one would be dismayed to
learn that student performance on one of these items was quite unrelated to
performance on other items. it follows that the theoretical accounts should be
validated against hard data. There are a number of ways to do this, most obviously
using structural equation modelling (SEM) or via confirmatory factor analysis (a
relevant special case of SEM). The essential task is to force the evidence from
student performance on different tasks into the conceptual structure, then to see if
the fit is acceptable.
4
4.3
Chapter Error! No text of specified style in document.
Test Validity and Test Calibration
Validity is a complex topic which encompasses a number of dimensions (Messick,
1995; Ridgway and Passey, 1993). The focus here is on construct validity. A core
problem arises from the fact that any test samples student performance from a broad
spectrum of possible performances. The construct validity derives directly from the
tasks used. In areas where the domain definition is rich, this is particularly
problematic, because sampling is likely to be sparse.
One strategy is to design a test with a narrow construct validity - for example by
focussing on mathematical technique - and then using this narrowly defined test as a
proxy for the whole domain, with an appeal to 'general mathematical ability'. One
might even produce evidence that technical skills are strongly correlated with
conceptual skills, in education systems unaffected by narrow testing regimes.
Introducing proxy or partial measures is likely to lead students and teachers to focus
on developing a narrow range of skills, rather than developing skills across the whole
domain. This is likely to lead to changes in the predictive validity of the test, because
some students will have developed good conceptual skills, and other students rather
poor ones, depending on the extent to which teachers have chosen to teach to the
test.
Another way to approach the problem is to design the tests given in successive
years in such a way as to ensure a great similarity in test content. The extreme
version of this is to use exactly the same tests in successive years. A related method
is to define a narrow range of task types, and to create equivalent tasks by changing
either the surface features of the task ('in a pole vault competition'… becomes 'in a
high jump competition'…) or to change the numbers (3 builders use 12 bricks to…
becomes 4 builders use 12 bricks to…). This might be appropriate if the intended
construct validity is narrowly defined. An example might be a test of mathematical
technique in number work designed to assess student performance after a module
designed to promote good technique. The approach of parallel tests or equivalent
items is deeply problematic if the intended construct validity is rich, and involves the
assessment of some process skills, such as problem solving, as is commonly the
case in educational assessment.
The notion of task (or item) equivalence or parallel test forms might be useful in
settings where students take just one form exactly once. When students (and their
teachers) are re-exposed to a task designed to assess problem solving, the task is
likely to fail as a measure of problem solving simply by virtue of the fact of prior
exposure. Repeated exposure changes a novel task into an exercise. Once the task
becomes know in the educational community, it will be practised in class (which may
well have a positive educational impact in that it might develop some aspects of
problem solving behaviour) and so looses its ability to assess the fluid deployment of
problem solving skills when used as part of a test. Paradoxically, the more
interesting and memorable the item, the less suited it will be for repeated use in
assessment (and the more useful it will be when used as a curriculum item). It
follows that repeated use of the same test, and the use of parallel forms of a test, is
likely to change the construct validity of a test. Therefore, if one wishes to create
tests whose construct validity matches some general educational goals (such as
assessing mathematical attainment), then tests need to be constructed from items
sampled across the domain in a principled way; each year, new tests must be
designed. if this can be done, then users can only prepare for the test by deepening
their knowledge across the whole domain, not just a selective part of it.
Ridgway: Fat Anchor
5
Using new tasks each year has some benefits in terms of educational validity,
because each year, tests can be exposed to public scrutiny. This openness to the
educational community means that teachers and students can understand better the
demands of the system, and can respond to relative success or failure in adaptive
ways, notably by working in class in ways like to improve performance on the range
of tasks presented. The converse approach of having 'secure' tests encourages a
conception of 'ability' which is not grounded in classroom activities, and suggests no
obvious course of action in the face of poor performance.
The use of new tests each year helps solve the problem of justifying the construct
validity of tests used to sample a broad domain, but poses two major problems: first
that the construct validity needs to be reassessed each year; and second that it
becomes difficult to make judgements about the relative difficulties of tests given in
successive years, and therefore to judge the effectiveness of the educational system
in terms of improving student performance year by year.
It follows that the calibration of tests must be taken seriously.
5.
ANCHORING
Anchoring is a process designed to allow a comparison of scores from different
tests to each other. There are two dominant approaches to anchoring: one relies on
expert judgement (Angoff procedure; Jaeger’s Method); the other on statistical
moderation (e.g. Rasch scaling). More sophisticated approaches use a combination
of the two (e.g. Bookmarking; Ebel’s Method). Here, we focus mainly on the ways
that statistical moderation might help the awarding process.
The notion of an anchor test (or sometimes even a single task) is straightforward.
If one wishes to calibrate two different tests given on different occasions, or to
different groups of students on the same occasion, one might use an 'anchor' test
which is taken by all students. The relative difficulties of the two tests can be
compared by judging their difficulty relative to the anchor. Anchors can take a variety
of forms: a single item might be included in both tests; a small anchor test might be
administered alongside both the target tests, or a test of some general cognitive
ability might be given on both occasions.
Consider the situation where a novel mathematics test is set each year. A problem
arises because individual tasks will differ in difficulty level if taken by the same set of
students. It would be unfair to set grades using raw scores alone, because of these
differences. Suppose that tests are created anew each year, based on the same test
blueprint. Suppose that the raw scores of all students are obtained together with a
measure of IQ for two adjacent cohorts. One would expect that (other things being
equal) students with the same IQ would receive identical scores on the two test
administrations, and so any observed differences can be accounted for in terms of
the differences in task difficulty, not student attainment. It follows that the scores on
the second test can be adjusted in the light of the scores on the first test, and the
light of the relative IQs of students in both samples. This anchoring method is
directly analogous to norm referencing, using a covariate (IQ) to adjust the norms.
6
Chapter Error! No text of specified style in document.
The method depends critically on three assumptions:
first, that the two tests have the same construct validity (this might be achieved,
for example, by using the same test blueprint, and comparing the resulting
psychometric structure of each test);
second that the anchor test is plausibly related to this construct validity (if IQ were
used, there would be an appeal to some generalised cognitive functioning);
third that no improvements have occurred in the education system which lead to
genuine improvements in student performance.
This use of anchoring is plausible when applied to the situation described above,
and is increasingly implausible as the usage drifts further and further from the model
described. Consider two examples which violate the assumptions above. First,
suppose we wish to compare the relative prowess of athletes who compete in
different events. In particular, we want to compare performance in the 400 metres
sprint and in shot putting (as might be the case in the heptathon). We use
performance in the 100 metres sprint as an anchor. Let us assume (plausibly) that
people who excel in sprinting 400 metres also excel in sprinting 100 metres, and that
there is little relation between performance on the shot put and in sprinting. If we use
the 100 metres as an anchor, we will find that the median performer 400 metres is
given far more credit than the median performer in the shot put, because the scores
on the anchor test are far higher for this group. In this example, the procedure is
invalid because the two performances being compared are not in the same domain,
and because the construct validity of the anchor test is aligned with just one of the
behaviours being tested. This situation will apply in educational settings where the
construct validity of the tests being anchored is different; it will also apply to situations
where the anchor test is not aligned with one or both of the tests being anchored.
For the second example, we return to the example of using IQ to anchor
mathematics tests. IQ measures are designed to be relatively stable over time.
Suppose that the tests in adjacent years are 'parallel' in the sense of having identical
construct validity and identical score distributions. Suppose a new curriculum is
introduced which produces large gains in mathematical performance which is
reflected in much higher scores for the second cohort of students than the first. The
anchoring model will assign the same grade distribution to both sets of students, and
will not reflect or reward the genuine gains which have been made. This second
example shows that the simple anchoring model is insensitive to real improvements
in teaching and learning; it attributes all such gains to changes in task difficulty.
An alternative approach where the anchor itself is used to make judgements about
learning gains can be problematic for two sorts of reasons. First that answers to the
key question about rising or falling standards are resolved by a surrogate measure and if one were ill-advised enough to use IQ, one that has been designed to be
resistant to educational effects. Second, in the case of domain relevant, short
anchors, changes are judged relative to a measure which is likely to be the least
reliable measure to hand, because of its length.
The Fat Anchor offers a resolution to all these problems.
Ridgway: Fat Anchor
6.
7
THE FAT ANCHOR
The Fat Anchor is an anchor which is much bigger that the tests it is designed to
anchor. It comprises as many tasks as are needed to exemplify the entire domain of
interest It can be thought of as a ‘fantasy test’ – imagine a test that samples every
aspect of attainment in some domain you care about. The Fat Anchor is a very large
collection of tasks based on a principled description of the domain of performance, a
balanced assembly of tasks, and psychometric justification of the domain description.
For any domain, the Fat Anchor would comprise several hours of testing. In
mathematics, at least 10 hours of testing will be required. The Fat Anchor essentially
defines the domain being assessed.
The Fat Anchor is used to assess the performance of a population of students, not
the performance of any individual; no student takes the whole of the Fat Anchor.
Rather, each student in the target population takes a few items from the Fat Anchor.
Samples of students and tasks are selected carefully to ensure that performance on
different components of the Fat Anchor is assessed across a broad spectrum of
student attainment. The results can be aggregated and used immediately for two
distinct purposes, notably assessing standards, and determining the construct validity
of tests.
6.1
Monitoring standards and determining grades via the Fat Anchor
The difficult questions about monitoring overall standards (and setting grades
which are fair to students in adjacent years) using different tests in each year, are
questions about the characteristics of populations of students who have taken tests,
and about samples of items. Questions about individual students can be answered
when these grander issues have been resolved. When data are collected on the
performance of students across the attainment range on the anchor test and on tests
in adjacent years, statistical moderation becomes straightforward. For any cohort of
students, the Fat Anchor can be used to describe their standing relative to other
cohorts who have also taken the Fat Anchor, allowing one to solve the problem of
judging whether standards have gone up or down.
Similarly, assignment of grades to students can be based on judgements of the
relative difficulties of tests in adjacent years.
6.2
Determining the construct validity and reporting to relevant user
communities
A major problem which educationally valid testing faces relates to the sparseness
of the sampling from the domain. In any year of testing, tasks are sampled from the
domain according to some principled choices. In different years, different aspects of
the domain are sampled. This gives rise to the problem of saying what was
measured, in some meaningful way, and communicating the conclusions to relevant
user groups such as teachers and policy makers.
The description of what any test measures can be based on two important
components. First is a set of judgements about the core elements of performance of
each task, and the associated collection of core components. Second is the
psychometric properties of the test, when judged against the structure in the Fat
Anchor. The use of the Fat Anchor when students are tested will provide enough
8
Chapter Error! No text of specified style in document.
data to address psychometric questions concerning both the internal structure of the
test, and its relationship to the data structures in the Fat Anchor.
6.3
Development of the Fat Anchor
There are both conceptual and practical issues which need to be addressed
before the Fat Anchor can be used. The key conceptual issue concerns the
definition of the domain, and its exemplification. It is clear that the domain of
mathematics can be described in a number of different ways. Examples for
mathematics derive from comparisons of national curriculum documents produced by
different countries. Achieving consensus might be difficult.
A further issue is the validation of the conceptual structure and the exemplification
via statistical analysis of student work. A confirmatory approach is appropriate here.
Essentially, there is a need to define the domain of mathematics on the basis of
conceptual understanding, then to test the implicit structures via a confirmatory
analysis. The key is that any inconsistencies between the psychometric structures
evident in the task set and the domain definition are to be reconciled. The core idea
is that the psychometrics is used to inform the conceptual analysis.
6.4
Practical implications of the Fat Anchor
Test development will be more expensive in its initial stages than it has been,
traditionally. Domain definition and exemplification need considerable amounts of
work. Checking the construct validity of tests will also require some effort.
In return, the Fat Anchor promises to support the integration of different
knowledge communities, and different participants in education. A clear and well
articulated domain definition, supported by examples of tasks and student
performances, is an essential component of all educational reform efforts.
Assessment systems which are open, and which describe in some detail the
strengths and weaknesses of students, provide teachers with the opportunities to
reflect on the nature of their domain, the nature of knowledge acquisition in that
domain, and on their current teaching practices.
Some progress has been made towards the development of the Fat Anchor in
mathematics. The MARS group (http://www.nottingham.ac.uk/education/MARS/) has
created both a large set of tasks across the domain of mathematics, and a
conceptual framework – referred to as the Framework for Balance. Each task is
associated with a description of its 'core elements of performance' which is based on
both a conceptual analysis and on student work. The Framework for Balance sets
out a definition of the domain of mathematics. Tasks can be selected from the task
database, using the key words in the Framework for Balance, in order to support the
creation of 'balanced tests'. So far, MARS has not validated the Framework for
Balance in the ways described above; current experiences with a Fat Anchor are
described below.
Ridgway: Fat Anchor
7.
9
THE FAT ANCHOR EXEMPLIFIED
The MARS group is working with a major publisher in the USA to develop
‘Balanced Assessment’ – tests of mathematics which represent current educational
goals in mathematics as exemplified by the National Council of Teachers of
Mathematics (1989). Tests are administered at grades 3 through 10, each year
using a new set of tasks. Each test is a different sample of the large and varied
domain of substantial tasks that the standards require. The underlying Framework
for Balance is closely aligned with the NCTM Standards. These standards have
widespread acceptability, and many States have developed their own frameworks
which align to the NCTM Standards. Since the tasks inevitably vary somewhat in
difficulty, the distribution of student scores changes from year to year and grade to
grade. This gives rise to an important problem – how can we ensure that awards
made in one year are comparable with awards made in other years?
This question is NOT important if the tests are to be used simply to provoke
reflections on the nature of mathematics, or on student understandings, in order to
promote better teaching. It IS important if the tests are to be used for high stakes
assessment such as grade-to-grade promotion, high-school graduation, program
evaluation, or assessing the ‘value added’ by individual teachers, schools or districts.
Recent legislation on No Child Left Behind brings this kind of issue into sharp
relief, because of the federal requirement to assess children at every grade from
Grade 3 through Grade 8, and to make judgements about the success or otherwise
of schools on the basis of the results which are obtained.
Our current approach to standard setting blends several ‘expert judgement’
models, and can be viewed as a dialect of the Angoff procedure. The judgements for
each boundary cut-score are informed by four kinds of information:
a standards-based estimate of the minimum score on each task that a
student should get in order to be awarded a particular grade;
assessors descriptions on the pattern of student performance in terms of
strengths and weaknesses;
the score distributions task by task and overall;
and ultimately by an independent holistic review of borderline student
papers.
Here, anchoring depends heavily on the consistency of expert judgement over
time.
7.1
Developing the Fat Anchor
This Fat Anchor is a very large math test that samples everything of interest in the
appropriate domain at each grade level, with appropriate overlap between grade
levels. For Balanced Assessment, the domain definition follows Principles and
Standards for School Mathematics (see NCTM at http://www.NCTM.org).
. the Fat Anchor is designed as an exemplification of these standards. Samples
of students (1000 to 1500 in each sample) take a live Balanced Assessment test and
two parts of the anchor. Sampling is done in such a way that we have complete
coverage of the Fat Anchor, have overlap between tasks in the Fat Anchor, and
overlap between adjacent grades.
10
Chapter Error! No text of specified style in document.
This will allow us to link performance in adjacent years of testing because each
annual test can be related to the same ‘ideal’ test.
This development is particularly interesting, because we intend to create a vertical
scale in grades 3-10, and have started with grades 3, 4, and 5 in 2002. A vertical
scale permits not only reliable anchoring from year to year, but also a facility to judge
the extent of the improvement made over the course of a school year, and the
relative improvement of Grade 4 students compared with Grade 5 students, and so
on.
Critical issues in the construction of a Fat Anchor are: the number of items
necessary to cover the theoretical domain; the number of forms necessary to
accommodate these items; and the required case counts necessary for scaling and
calibration.
We have assumed that 70 items will satisfy the domain requirements across
grades 3, 4, 5, and have decided to present 5 items on each test form. Clearly, to
create a vertical scale we need to administer some test booklets to adjacent grades.
To allow a complete linkage of items on the Fat Anchor, we need to overlap test
forms. The following tables illustrate part of the design of the Fat Anchor.
Table 1
Layout of Tasks Across Grades
Grade 3
20
10
Grade 4
Shared
10
Tasks
Grade 5
10
Shared
Tasks
20
Ridgway: Fat Anchor
11
Table 1
Layout of Tasks Across Grades
Grade 3
20
10
Grade 4
Shared
10
10
Tasks
Grade 5
Shared
Tasks
20
Table 2
Grade 3
Grades 3 &
Grade 4
4
Stude
Grade
nt
For
m1
For
m2
G1
G1
For
m3
For
m4
For
m5
For
m6
For
m7
For
m8
Grou
p
3
3
3
2,500
Students
3
4
4
4
3,000
Students
Group
1
N=50
0
Group
2
N=50
0
Group
3
N=50
0
Group
4
N=50
0
Group
5
N=50
0
Group
6
N=50
0
Group
7
N=50
0
Group
8
N=50
0
Group
9
G2
G2
G3
G3
G4
G4
G5
G5
G6
G6
G7
G7
G8
G8
G9
12
Chapter Error! No text of specified style in document.
4
N=50
0
Group
10
N=50
0
Table 2 shows the Fat Anchor design for Grades 3 and 4. Across Grades 3 to 5,
16 heterogeneous groups of 500 students take one of 2 components of the live test,
and two of the anchor tests. Each anchor test comprises 5 test items. For example,
Table 2 shows that Group 1 take Form 1 and Form 2, and Group 2 take Form 1 and
Form 3 (not shown is the fact that Group 1 takes the first paper of the live test, and
Group 2 takes the second paper of the live test). The anchor tests each take
approximately 15 to 25 minutes to complete. In all, about 2 hours of testing is
required of each student, which is conducted in 3 sessions over a period of 2 or 3
days.
Forms are spiralled throughout classrooms, with some overlap between adjacent
grades. There are 14 different test forms, and each one is taken by approximately
1000 students. For example, Table 2 shows that Form 1 is taken by 500 students
from Group 1, and 500 students from Group 2; Form 5 – which is used across grades
to allow the construction of a vertical scale – is taken by 500 students from each of
Groups 4, 6, and 7.
Sixteen different test booklets are required to provide full cover of tests and anchor
papers. Books 1-5 are given in Grade 3; books 6-11 are given in Grade 4; and books
12-16 are given in Grade 5. These books are spiralled throughout classrooms.
Administration directions require that all test administrators must individually
distribute a test booklet to each student from the TOP of the packet of test booklets,
to ensure a random allocation of tests and anchor papers across the whole sample
taking part.
8.
RESULTS
Our first attempts at producing a Fat Anchor at grades 3-5 have been successful –
the task set ‘scaled’ using a 3 Parameter Logistic function - i.e. data satisfied the
statistical assumptions being made, an adequate fit was obtained for all the data.
We have successfully produced a vertical scale across Grades 3 to 5. The results of
the scaling will be shown at IAEA 2003.
9.
CONCLUSIONS
High stakes assessment needs to achieve good standards of reliability and
validity. Designers need to define the domain clearly, and to validate that domain.
For most meaningful assessment in education, students will face a variety of new
tasks on each testing occasion – usually demonstrating their ability to engage with
tasks of different length. The demands of construct validity pose a challenge to
consistency of grading – different tasks have different difficulties. Consistency of
grading is an essential component of fairness in assessment.
G10
Ridgway: Fat Anchor
13
It is impractical to fully examine student performance on all aspects of complex
domains. Student knowledge will always be sampled partially, and different tests are
likely to sample different aspects of student knowledge. Again, this can pose
problems for the comparability of test scores.
The Fat Anchor provides a way to tackle these seemingly intractable problems.
Development costs are higher than those usually associated with testing, but the
rewards are likely to be equally high. The difficult problem of maintaining standards
is solved; the construct validity of particular tests can be described; it is possible to
validate the test designers’ theoretical framework.
In the example provided, we have evidence that vertical scaling can be achieved;
that is to say, the performance of students can be set on a common scale over
adjacent school years; not only can standards be maintained from year to year, but
also comparisons can be made about the relative gains in different years. Such data
is likely to be useful in providing valuable feedback to teachers and schools about
their educational successes.
14
10.
Chapter Error! No text of specified style in document.
REFERENCES
MARS http://www.nottingham.ac.uk/education/MARS/
Messick, S. (1995) Validity of Psychological Assessment. American Psychologist.
Vol. 50. No. 9. Pages 741-749
National Council of Teachers of Mathematics. (1989) Curriculum and evaluation
standards for school mathematics. Reston, VA: NCTM.
NCTM http://www.NCTM.org
Ridgway, J. & Passey, D. (1993) An International View of Mathematics Assessment though a Class, Darkly, In Niss, M. (ed) Investigations into Assessment in
Mathematics Education, Kluwer Academic Publishers. pp. 57-72
Ridgway, J. (2001) The Fat Anchor plus Powerful Feedback - New Approaches to
Monitoring and Raising Standards. Invited paper presented at the Ninth
International Conference in Mathematical Education, Tokyo, Japan. Pages 1-21
© Copyright 2026 Paperzz