The Contribution of Large-Scale Test Results to Policy

The contribution of large-scale test results to policy
Esther Care, Patrick Griffin, and Zhonghua Zhang
The University of Melbourne, Australia
Abstract
This summary, drawn from Care, Griffin, Zhang, and Hutchinson (2013) provides an
example of how tests that are intended for teacher use to inform teaching, provide
information that can signal the need for policy change.
Introduction
There are many views on the functions of assessment in education. In this summary, the
capacity of assessment data gathered initially to inform differentiated teaching, also to inform
policy is described. The use of assessment data to inform instruction is ensconced firmly in
the work of Vygotsky (1986) and his identification of the phenomenon of a zone in which an
individual is able to achieve more with assistance than he or she could manage alone. The
Zone of Proximal Development (ZPD) is typically used to refer to an area or level of skills in
which a student is ranging between correct and incorrect responses as he or she engages with
the level of difficulty. As discussed by Griffin (2007), this perspective links well with the
work of Glaser (1963), who proposed the concept of criterion-referenced interpretation of
assessments. When students are assessed in such a way that their current skills are identified,
the information can be used by teachers to guide interventions, ensuring that information is
presented to the student at the level at which he or she can engage with the learning goals.
Notwithstanding a growing interest in this approach to assessment, attested to by increased
understanding and encouragement of formative assessment approaches (Black & Wiliam,
1998), international and national large-scale assessments have by and large ignored the
capacity of their data to empower and inform teachers. Our argument is that a change in
approach to assessment is required—one that is based on changing notions of teaching and
learning—as we move toward a more differentiated, more responsive model that meets
individual student needs.
It is necessary (a) to determine what is needed in the classroom for formative assessment and
what is helpful for teachers, (b) to determine what is needed in terms of large-scale
accountability and (c) to achieve consistency between these without imposing the limitations
of one upon the other. We should not pursue large-scale assessment to inform policy without
also taking a classroom perspective on the usefulness of the information. The teacher’s need
is for specific information about students’ current understandings and skills in the context of
the program of learning outlined by the curriculum and interpreted for teaching purposes
within the school. This means that logistically the large-scale assessment must be capable of
providing both foreground information for use by teachers and background information to
harvest for summative, system-level analysis.
Care, Griffin, & Zhang 1 In Victoria, Australia, the Assessment Research Centre at The University of Melbourne is
implementing just such a model through the State and Catholic education jurisdictions in
literacy, numeracy and problem-solving tests—tests that provide information about students’
location along developmental learning progressions. The work has informed teacher decisionmaking about interventions, school leadership teams’ decision-making about staff
organization and regional decision-making about professional development needs. Integral to
this "Assessment and Learning Partnerships" (ALP) program is the Assessment Research
Centre Online Testing System (ARCOTS), an online platform that supports a comprehensive
assessment system. Through this system student competency is mapped to underlying,
empirically based developmental progressions, and reports are generated for teacher use.
ARCOTS tests are available for reading comprehension, numeracy and problem solving, and
are targeted for students across Grades 3–10.
Schools participating in ALP test their students twice a year on the developmental
progression of interest. As part of the program, teachers learn how to interpret ARCOTS
results using a developmental assessment approach, and identify the point of readiness to
learn, ZPD, for the student assessed. The first test establishes a starting point. Testing at
second, third and subsequent points in time provides teachers with evidence of progress and
an opportunity to review the student’s ZPD. Teachers analyze these results together with
other examples of student achievement drawn from the classroom, in order to plan for
teaching interventions.
The data presented here indicates how these test results that are used in the first instance by
teachers to identify the ZPD for teaching purposes, can equally be used for large scale
analysis, interpretation, and policy development.
Method
Given the primary goal of ALP—that each student should be taught at the point at which she
or he is ready to learn—all students should be able to progress in their learning. Analysis of
the distributions of student progress can provide some evidence concerning whether this is
achieved for all students, or for particular groups of students. As a test of equal progress in
learning, distributions of assessment data were analyzed to identify whether students at the
top end of each grade level were progressing at a similar rate to students at the lower end of
the grade.
Participants
Enrolled in Department of Education and Early Childhood primary and secondary schools,
21,000 students participated in ARCOTS testing in March 2011. For this example, a subsample of students matched across time and test difficulty level, studying in Grades 3–6, have
their numeracy results included in the analyses.
Care, Griffin, & Zhang 2 Tests
The numeracy areas cover number, geometry, measurement, chance and data. Each test has
questions drawing on a range of content, with varying levels of complexity. There is overlap
in both content and complexity between one test and the next. Each test is comprised of 40
items and is designed to be completed in approximately 50 minutes. The items are presented
in multiple-choice format on an online platform. Items on the tests are mapped onto a
uniform, latent variable scale for each domain that fits into a single-parameter, Rasch, model.
The test taken by each student is targeted so that the student is likely to achieve
approximately 50 per cent correct.
Results
A sub-sample of primary school students who took ARCOTS numeracy tests both in March
and October 2011 across different grades are listed in Table 1 (with outliers removed).
Grade
Total
Grade 3
Grade 4
Grade 5
Grade 6
1551
988
565
589
Lower
achievement
group
414
259
161
166
Higher
achievement
group
397
265
147
168
Table 1. Sample distributions of the students taking ARCOTS numeracy tests in March and
October 2011
The distributions of students’ achievement scores on the numeracy tests for Grades 3–6 are
displayed in Figure 1. The growth from March to October can be seen in the distributions.
The results of Shapiro-Wilk tests indicated that the students’ achievement scores were not
normally distributed. Hence, the non-parametric Wilcoxon signed-rank test was used to
assess whether the students’ achievement scores on the March 2011 numeracy test and on the
October 2011 numeracy test differ. The results indicated that there are statistically significant
differences over time from March numeracy test to October numeracy test for all grades.
Consistent across the grades, the mean differences as well as the effect size of the differences
show medium positive growth across the 2011 school year.
Care, Griffin, & Zhang 3 Figure 1. Distributions of students’ achievement scores on ARCOTS numeracy tests across
grades
Based on the students’ achievement scores, two groups are used for analysis. The students
whose achievement scores are equal to or smaller than the 25th percentile of the scores on
ARCOTS March 2011 numeracy tests are taken as lower-achievement students and those
with achievement scores equal to or greater than 75th percentile are regarded as higherachievement students.
Again, due to distributional characteristics, the Wilcoxon signed-rank test was used to assess
whether the students’ achievement scores from March to October significantly differed across
achievement and grade groups. For the lower-achievement students, there are statistically
significant differences between achievement from March to October for all grades; the mean
differences and the corresponding effect sizes show substantive growth. However, for the
Care, Griffin, & Zhang 4 higher-achievement students, similar results were not obtained. No significant differences
were found for Grades 3 and 5. For Grades 4 and 6, no statistically significant differences at p
< .001 were found in achievement scores between March and October. The effect sizes also
imply little growth for these high-achieving student groups. These findings suggest that a
different growth trajectory exists among lower-achievement and higher-achievement
students. It is apparent that scores of the lower-achievement students grow faster than those
of higher-achievement students, as summarised in Figure 2.
Figure 2. Distributions of gain scores for lower-achievement and higher-achievement groups
of students on ARCOTS numeracy tests from March 2011 to October 2011, across different
grades
Care, Griffin, & Zhang 5 Discussion
What we see in the results is both reassuring and alarming. For students whose results
indicate that they are operating within the lower 25 per cent distribution of the grade, there is
a consistent pattern of growth regardless of the grade.
However, what is of concern is that the students within the top 25 per cent of distributions
across all grades tested achieved little growth at cohort level. The skill levels of students in
each grade appear to be converging. Thus, it seems that teachers are indeed ‘closing the gap’,
a term coined to characterize the Australian policy to reduce inequities between Indigenous
and non-Indigenous Australians (MCEEDYA, 2009), but extended to education more
generally (Gonski, 2011). Given the national and state emphasis on raising the skill levels of
those at the lower end of the distribution, it is not surprising that these students are
prioritized.
Through use of large-scale testing data such as that presented in this summary from Care, et
al. (2013), we can see the direct effects of policy. Where policy may have brought about
unanticipated outcomes, we are provided with an evidence base upon which to promote more
appropriate planning. As can be seen in this instance, the positive policy of promoting equity
has in fact brought about its opposite—for a group other than that originally targeted for
positive outcome. Counter-intuitively, it is the students in the top of the distribution who are
at greatest risk of reversing the trend in growth rates.
Implications for action
The progress of those students in the lower 25 per cent of the distributions may be seen as a
direct outcome of good teaching practice. In this first year of participation in the ALP
program, teachers learn about developmental approaches to teaching and learning, and how to
use assessment data to inform their decisions about interventions with students—what
resources and strategies to bring to bear and what level of content and complexity of subject
matter to include. The implications of the outcomes of this analysis extend to teacher, school,
jurisdiction and system levels. For the teacher, the data highlight the need to differentiate
teaching for all students rather than those who appear to have the greatest need. This
differentiation may require a change in attitude, as well as a need for identification of
different strategies and resources to cover the full range of need in the classroom. At the
school level, the required changes herald the need to implement professional learning activity
in order to address both the attitudinal and skills needs of teachers. At the jurisdiction or
regional level, this implies that a need exists for appropriate resourcing to schools for
professional learning, and for regional level promotional support for change in practice. At
the system level, the implications for policy are clear: it must promote equal opportunity for
all, ahead of the prioritization of sub-groups.
The large scale national testing in the Philippines generates huge information resources that
has the potential for use both in the classroom and at system level to inform policy. How the
data are captured, recorded, reported upon, and interpreted determines its use. The example
Care, Griffin, & Zhang 6 provided by these data from the Australian system can be used as model for re-thinking use
of large scale data in the Philippines system of education.
Information relevant to the individual student can be used at class level by the teacher as an
aide to differentiated instruction; at school level by groups of teachers in order to inform their
professional development, responsive to student profiles; and at jurisdiction and policy levels
in order to identify patterns in student learning that are attributable to educational policies of
current governments. In so doing, the potential for student, class and school-level data to be
used at policy level is clear. In order for use of assessment data to be effective, it is essential
that teachers have access to quick turnaround of results such that they are responding to
students’ current levels of functioning and performance. It is also essential that teachers
acquire skills relevant to understanding aggregate as well as individual data, so that they can
bring their professional judgment to bear in terms of interpretation. Both these criteria—for
data capture and professional understandings—require systems resources. As the Philippines
rolls out its K-12 reforms, and re-thinks its national assessment system, this approach to both
formative and system level use of test results is an imperative.
References
Black, P., & Wiliam, D. (1998) Inside the black box: Raising standards through classroom
assessment. Phi Delta Kappan, 80, 139–48.
Care, E., Griffin, P., Zhang, Z., & Hutchinson, D. (2013). Large scale testing and its
contribution to learning. In C. Wyatt-Smith, V. Klenowski, & P. Colbert (Eds.) The
enabling power of assessment. The Netherlands: Springer International.
Glaser, W. (1963). Instructional technology and the measurement of learning outcomes.
American Psychologist, 18, 519-21.
Gonski, D. (2011) Review of Funding for Schooling—Final Report. Canberra: Department of
Education, Employment and Workplace Relations.
Griffin, P. (2007). The comfort of competence and the uncertainty of assessment, Studies in
Educational Evaluation, 33, 87–99.
MCEEDYA (2009). Aboriginal and Torres Strait Islander Education Action Plan 2010–
2014. Canberra: Ministerial Council for Education, Early Childhood Development and
Youth Affairs.
Vygotsky, L. S. (1986). Thought and language. Boston: MIT Press.
Wiliam, D., & Thompson, M. (2007). Integrating assessment with learning: What will it take
to make it work? In C. A. Dwyer (Ed.), The future of assessment: Shaping, teaching,
and learning. Mahwah, NJ: Erlbaum.
Note
Note that this summary presentation for the 12th National Convention on Statistics, Manila,
1st-2nd October, 2013, is adapted from the article by Care, Griffin, Zhang, and Hutchinson
(2013) published in C. Wyatt-Smith, V. Klenowski, & P. Colbert (Eds.) The enabling power
of assessment. The Netherlands: Springer International.
Care, Griffin, & Zhang 7