Efforts to Teach in a Way that Tests can Detect: Pointless or Profitable?

EFFORTS TO TEACH IN A WAY
THAT TESTS CAN DETECT:
POINTLESS OR PROFITABLE?
Megan Welsh
Neag School of Education
NERA Conference
10/21/11
The Title: Pointless or Profitable?
In 1989, Mehrens and Kaminski publish a paper
entitled “Methods for Improving Standardized Test
Scores: Fruitful, Fruitless or Fraudulent?”
It addressed the “old, but increasingly relevant issue of
teaching to the test” (p. 21)
They conclude that at least some test preparation
efforts are both fruitless and fraudulent.
This talk



Explores the assumptions underlying many current
uses of test scores
Provides some preliminary evidence about the
relationship between test-focused instruction and
student performance
Discusses implications for next generation
assessments
Then (1989)




Norm-referenced tests are widely used
Expectation that teachers do not know
content of test; the test samples from a
content area domain and that content
area knowledge will generalize to test
performance
Accountability=parent/community
perceptions of schools
Test scores are considered to gauge
minimum competency in a subject, but
are not typically used to inform
curriculum or to reflect on specific lessons
Now




Standards-based assessments are
criterion-referenced
Both the test and teaching are expected
to closely align with state standards
/Common Core
High-stakes accountability based on test
scores assumes test scores are
reflective of instructional quality
Test scores are used to reflect on
instruction and curriculum for specific
topics
New uses of large-scale tests:
1. Support accountability
New uses of large-scale tests:
2. Reflect on instruction
Question #: 15
Question Type: Multiple Choice
Topic: Number Sense
Shutesbury (correct): 6%
Massachusettts (correct): 49%
Correct Answer: C
61% selected A, 22% selected B, while
only 6% selected the correct answer C.
New uses of large-scale tests:
2. Reflect on instruction
Question #: 22
Question Type: Multiple Choice
Topic: Number Sense
Shutesbury (correct): 56%
Massachusettts (correct): 72%
Correct Answer: C
22% selected A, 22% selected B.
Should standards based assessments
be used in these ways?
Is teaching to the test now appropriate?
Does teaching to the test improve scores?
Should standards based assessments
be used in these ways?
Is teaching to the test now appropriate?
Does teaching to the test improve scores?
Are tests sensitive to instructional efforts?
Test scores might reflect






Instruction focused on standards
Teaching skill
Attainment of standards due to experiences outside
of school
Test-wiseness
Situational anomalies (illness, distractions, mood, etc)
Aptitude
If test are insensitive to instruction
If test are insensitive to instruction
If test are insensitive to instruction
Question #: 15
Question Type: Multiple Choice
Topic: Number Sense
Shutesbury (correct): 6%
Massachusettts (correct): 49%
Correct Answer: C
Waste of time
61% selected A, 22% selected B, while
only 6% selected the correct answer C.
If test are insensitive to instruction
Why teach to the test?
Exploring instructional sensitivity
A series of studies conducted in one suburban school district
located in the Southwest.
Participants
16 third- and 20 fifth-grade mathematics classes in 13 schools
 784 students
 Relatively white, high-performing district with moderate SES
 Teachers were relatively experienced (M=13.9, SD= 9.9)
 District used standards-based report cards
 Districtwide mathematics curriculum uniformly implemented

Data Collection

Teachers interviewed for approximately two hours
about:
instruction and assessment of two performance objectives
 grading practices
 Likelihood that students will correctly answer state test items
relating to the objectives




Student mathematics scores on the state test
End-of-year grades
Demographics
Research questions
Is teaching to the test now appropriate?
Does teaching to the test improve scores?
Are tests sensitive to instructional efforts?
Is teaching to the test appropriate?
My thoughts…
General instruction on
tested objectives
Teaching test taking skills
Instruction on tested
objectives using examples
similar to the test
Decontextualized practice
Practice on the
operational (real) test
Is teaching to the test effective?
First need to gauge teaching to the test.
1. Asked teachers about their test preparation
practices.
2. Teachers participated in a blind review of
mathematics tests containing items from their own and
other state tests. They identified items their students
could answer and commented on sources of difficulty.
Participants: This analysis

31teachers (12 third-grade, 19 fifth-grade)

711 students

Students relatively low-performing relative to
district
Frequency of test preparation
practices
Test Taking Practice
1 General instruction on tested objectives.
2 Teaching test taking skills.
Frequency
12
6
3 Instruction on tested objectives using
examples like the test format.
6
4 Decontextualized practice that mirrors the
state test
12
5 Practice on the operational test.
0
Item review  State test awareness
Analysis
Conducted a multilevel analysis; students nested within classrooms
Predicted mathematics achievement on state standards-based
assessment, standardized relative to statewide test performance
and pooled across grades
Controlled for student-level minority status, ELL status, special
education status
Teacher-level main effects:
-teaching to the test categories compared to general
instruction on tested objectives
-state test awareness categories compared to test averse
teachers
Results
After controlling for student demographics
 teaching
to the test did not predict achievement
 being
test-secure did predict achievement; students of
test secure teachers performed half a standard
deviation better on the state test than students of testaverse teachers
 there
was no difference in performance between
students whose teachers were test averse and those
whose teachers were state test focused or out-of-state
test focused
Final Model Predicting Mathematics
Achievement
Fixed Effect
coefficient
se
df
t ratio
Model for mean classroom math achievement, β0
Intercept, γ00
-0.242
0.058
27
-0.414
Test Secure, γ01
0.525
0.230
27
2.277*
Out-of-State Focus, γ02
0.248
0.161
27
1.537
In-State Focus, γ03
0.274
0.150
27
1.825
--0.736
0.186
699
-0.380
0.053
699
-7.170*
SPED, γ30
-0.985
0.152
699
* indicates statistically significant relationship at p<0.05.
-6.471*
Model for ELL, β1
ELL, γ10
-3.961*
Model for Minority, β2
Minority, γ20
Model for SPED, β3
Possible interpretations



Teaching to the test does not work
The teachers are teaching state standards in a
relatively uniform way
The test does not detect instructional efforts
Instructional sensitivity
The degree to which a test can detect differences in
the instruction students receive.
With
teachers
who do not
teach state
standards
With
teachers
who teach
state
standards
well
Big question…


How do we know what instruction has occurred?
(opportunity to learn)
Instructional sensitivity: the degree of
correspondence between opportunity to learn and
test performance
Measuring opportunity to learn




Teaching to the test is one (gross) approach
Alignment: How consistent were test items and
instructional efforts in terms of content and cognitive
demand?
Emphasis: Were most heavily tested concepts fully
addressed?
The interaction of alignment and emphasis is perhaps
the best estimate and should also correlate with
achievement
Alignment as Opportunity to Learn
Test
Test
Test
teach skill
unlike test
My instructional sensitivity study
Based on interviews with teachers about their
teaching and assessment of the two objectives most
heavily emphasized on the state test
Grade 3
Grade 5
Performance
Objective 1
Make a diagram to represent the
number of combinations available
when 1 item is selected from each
of 3 sets of 2 items (e.g., 2 different
shirts,2 different hats, 2 different belts).
2 Items
Interpret graphical representations and
data displays including bar graphs,
circle graphs, frequency tables, three
set Venn diagrams, and line graphs
that display continuous data. AND
Answer
questions based on graphical
representations and data displays.
4 Items
Performance
Objective 2
Discriminate necessary information
from unnecessary information in a
given grade-level appropriate word
problem.
3 Items
Describe the rule used in a simple
grade-level appropriate function (e.g.,
T-chart, input-output model).
4 Items
Measuring
alignment
Perfect
Alignment
Interprets a table
Interprets visual and written
information
Interprets a 3 set tree diagram
Close
Alignment
Combinations involve 3 sets of
items AND multiple visual
displays
OR students create a tree
diagram
Some
Alignment
Introduces concept of
combination:
Select 1 item from each
set
Represents combination
in some way (list or
diagram)
Uses relevant vocabulary
(combination, diagram,
different)
Not
Aligned
Does not teach skill
For example
The teacher who drew these
examples was coded as
having “close alignment” to
AIMS because she required
students to solve problems
involving three sets of items
using a tree diagram.
She did not, however, present
students with tree diagrams
that they had to interpret
(required for “perfect”
alignment).
Distribution of alignment scores by
grade level
Perfect Alignment
Close Alignment
Some Alignment
Distribution of emphasis scores by grade
level
Daily
Frequency of instruction
Weekly
Every other
week
Monthly
2 weeks per
year
1 week per
year
1-2 lessons
Not taught
Analysis
Conducted a multilevel analysis; students nested within classrooms
Predicted mathematics achievement on state standards-based
assessment, standardized relative to statewide test performance,
run separately by grade level
Controlled for student-level minority status, ELL status, special
education status, teacher experience and education, school-level
free lunch eligibility, and prior achievement on a normreferenced test
Teacher-level main effects:
-alignment
-emphasis
-alignment x emphasis interaction
Results


None of the main effects predicted achievement
after controlling for prior achievement and
demographics at fifth grade
Alignment predicted achievement at third grade
after accounting for prior achievement and free
lunch eligibility; students whose teachers were a
standard deviation above the mean in alignment
scored a tenth of a standard deviation above the
sample mean on the state test
Final Model Predicting Mathematics
Achievement, Third Grade
Fixed effect
Coefficient
SE
df
t ratio
Intercept, γ00
0.45
0.04
13
10.12*
Free Lunch, γ02
-0.07
0.04
13
-1.68
Alignment, γ03
0.10
0.04
13
2.40*
0.68
0.04
315
17.01*
Model, β0 for mean classroom math achievement
Model for SAT9, β1
Intercept, γ10
* indicates statistically significant relationship at p<0.05.
Possible interpretations


Test is instructionally sensitive to a limited degree at
one grade level, but not the other
Objectives selected impacted results; third grade
objectives comprised less of the curriculum—to
teach them you had to be very aware of their
presence on the test—while fifth grade objectives
reflect commonly taught skills.
Implications

Need to evaluate instructional sensitivity if we want
to use large scale assessments for accountability or
to guide instruction
 Sensitivity
of total test scores
 Review item sensitivity during test development
Implications

Need to evaluate instructional sensitivity if we want
to use large scale assessments for accountability or
to guide instruction
 Sensitivity
of total test scores
 Review item sensitivity during test development
Exploring item sensitivity
Two approaches recommended by Popham and
Kaase (2009)
1.
Judgmental review of test items
2.
Differential item functioning based on content
teachers report teaching well and teaching
poorly
So far, only approach 2 has been studied. Found no
relationship between content teachers said they
taught badly (or didn’t teach) and item functioning.
Another approach
Combines both approaches…



Teachers review a test and identify items they consider
problematic
Compare classroom level and statewide item difficulties
across the entire test
Determine if teacher-identified items perform
differently
Visual Analysis: An Example of DIF
Participants: This analysis


10 third grade and 12 fifth grade teachers from
the same data collection
Number of student test scores per classroom ranged
from 19 to 30
Teacher who reported instructional alignment
Teacher concerned with a few items
Teacher concerned with test emphasis
Patterns across classrooms



Teachers did not do a good job of predicting which
items may function differently in either grade level
Teachers differed in the specific items they
identified as problematic, but were more consistent
in terms of over- and under-emphasized topics
Item functioning randomly varies across line plots
Possible Interpretations



Teachers don’t do a good job of predicting which
items will be difficult for students
Items on this test do not appear to be instructionally
sensitive
Negative result: Method failure or test failure????
Limitations (All analyses)

Small sample

One district

One content area

Two grade levels

Two objectives used to generalize to entire test for
analysis of test score sensitivity
Teaching to the test: Pointless or
Profitable?
In this example, teachers seem to have difficulty
linking items to what happens in classrooms
Test may get at general mathematics aptitude more
than attainment of specific standards
Therefore, broadly teaching content may have a
greater (or at least equal) impact on achievement
CAVEAT: When a test is comprised of anomalous
items, teaching to the test may help.
Implications for the Next Generation
Assessments



New item formats may improve instructional
sensitivity; requires investigation
Computer-adaptive nature of SMARTER balanced
assessment makes teaching to specific items
pointless
Test validation should examine instructional
sensitivity, especially if scores will be used for
school and teacher accountability
Questions?
Megan Welsh
Educational Psychology Department
Neag School of Education
[email protected]