The Maryland School Performance Assessment Program

The Maryland School Performance Assessment P:ogram:
Performance Assessment with Psychometric Qu~iity
Suitable for High Stakes Usage
Wendy M. Yen
CTB/McGraw-Hill
Steven Ferrara
Maryland State Department of Education
}9 59oo~5
The Maryland
School Performance
Assessment
is part of a larger school reform effort,
Success,"
Program
(MSPAP)
dubbed "Schools for
initiated by the Maryland State Department of Education
and the State Board of Education.
performance-based
The MSPAP is an annual
testing program that was first administered to
approximately 150,000 students in grades 3, 5 and 8 in May 1991.
Performance on the MSPAP is used to evaluate schools and to
provide information to guide school improvement efforts.
primary high stakes focus in designing,
developing,
The
and reporting
MSPAP is school performance rather than individual student
performance.
Schools are expected to meet standards for
satisfactory and excellent school performance on the MSPAP in
reading,
writing,
language usage,
mathematics,
social studies by the 1995-96 school year.
science,
and
Schools that dO not
meet these standards will be required to develop and implement
r
school improvement plans to meet these standards.
Consistently
low performing schools may be selected as Challenge Schools
that receive funding and outside expert guidance on improving
school performance on the MSPAP standards and standards for other
areas
(e.g., attendance,
drop out rates).
designated as Reconstitution
They may also be
Schools which are reorganized
and
managed by an outside organization.
The 1 9 9 1 M S P A P
reading, writing,
included assessments
language usage,
of learning outcomes in
and mathematics.
Assessment of
science and social studies outcomes was integrated into the 1992
and subsequent MSPAP editions.
Each assessment task in these
later editions assesses 1 to 4 content areas.
tasks are designed to elicit students'
knowledge,
diagram,
MSPAP assessment
thoughtful application of
skills, and thinking process.
Students write,
and sketch responses to tasks that focus on their
ability to construct and extend meaning from what they read,
construct and extend meaning through writing,
mathematics problems,
solve multistep
conduct hands-on science investigations,
understand social studies concepts,
and analyze social studies
issues.
It was essential to the purposes of the MSPAP to develop
innovative performance-based assessments.
It was also essential
that these assessments have sufficient psychometric quality that
the results could be used for high stakes, yearly evaluations of
school performance and for tracking school improvement.
At the
time this program was initially designed in 1989 and 1990,
performance assessments were much more a dream than a reality,
and virtually no information was available about their
psychometric properties.
The first year's tests were more
innovative than those in any other large s c ~ e
the time.
testing program of
The very encouraging results of that testing,per-
mitted greater innovations in later years.
This paper describes the program design and highlights its
psychometric results.
We do not describe details of the
psychometric procedures and findings of the MSPAP,
but merely
abstract and summarize some typical results from the first year
of the program and summarize some changes made in later years.
Complete detailed descriptions
Fin~l Technical Report
~992
MSPAP
Education,
(CTB Macmillan/McGraw-Hill,
Technical
and other,
are contained in the MSPAP 1991
Report
1993),
(Maryland
State
1992) and the
Department
both of which are available
of
through
MSDE.
PROGRAM REQUIREMENTS
It is important to note that the MSPAP program requirements
have dictated the psychometric properties needed for the
assessments.
Psychometric properties have not been built into
the tests for vague theoretical
reasons.
The psychometrics
of
the program were designed to interfere as little as possible with
the innovative aspects of the assessments
instruction)
(e.g., modeling good
while providing the characteristics
necessary to the program,
such as accurate,
that were
comparable scores.
The essential requirements of MSPAP were the following:
1.
In conformity w i t h the Maryland
learning outcomes,
and implement performance-based
assessments
develop
in" four content
areas
Language Arts:
Reading
[RD]
Writing
[WR]
Language usage
Mathematics
2.
[LU]
[MT]
Beginning with the second year of the program,
implement
assessments measuring state outcomes in two additional
content areas
3
Science
Social
[SC]
Studies
3.
Use
4.
Limit
5.
In g r a d e s
6.
7.
[SS]
assessments
that m o d e l
total testing
good
instruction
time to 9 hours
3, 5, and 8 a s s e s s
provide
student
scores
Develop
proficiency
typical
student p e r f o r m a n c e
content
area
Develop
standards
every
student
in e a c h c o n t e n t
level
cut-points
area
and d e s c r i p t i o n s
seen at those
of p e r f o r m a n c e
in M a y and
levels
of
in e a c h
to be used in e v a l u a t i n g
o
school
performance
Excellent
8.
Produce
areas
as U n s a t i s f a c t o r y ,
in each c o n t e n t
measures
Satisfactory,
or
area
of s c h o o l
that are s u f f i c i e n t l y
performance
accurate
in the six c o n t e n t
that school
performance
c a n be
• evaluated
relative
to t h e p r o f i c i e n c y
and school p e r f o r m a n c e
• compared
9.
Provide
produce
standards
over years
schools with
performance
level cut points
useful
outcome
information
in g u i d i n g
scores
about
school
learning
improvement
t h a t can be compared
outcome
plans;
over test
forms
a n d over years
I0.
Without
the
II.
data-based
tryouts,
produce
operational
scores
in
first year of t e s t i n g
Produce
a pool of scaled
characteristics
tasks
with known statistical
that can be u s e d
for subsequent
test form
assembly;
p r o v i d e psychometric
descriptions
r e s p e c t to such c h a r a c t e r i s t i c s
bias)
of items
as difficulty,
fit,
(with
and
to inform and improve future task d e v e l o p m e n t
FIRST
YEAR
OF THE
PROGRAM
The bulk of this paper will d e s c r i b e typical f i n d i n g s from
the first year of the program.
Some changes made in later years
will then be presented.
Note on T e r m i n o l o u v
The q u e s t i o n s or directions g i v e n to students
performance
a s s e s s m e n t are very d i f f e r e n t
multiple-choice
tests.
Performance
in a
from those in
a s s e s s m e n t scores a l s o are
d e p e n d e n t on the rule or rubric used to grade each response.
the sake of simplicity,
each prompt,
question,
d i r e c t i o n s to w h i c h a s t u d e n t r e s p o n d s ,
For
or set of
along with its a s s o c i a t e d
scoring rule, will be called c o l l e c t i v e l y
an "item."
Content A r e a s
Readino
The r e a d i n g domain is d e f i n e d by reader _~urpose and the
orientations
read.
or stances that readers take toward text as they
In the M a r y l a n d reading model,
Maryland
in the
learning outcomes w h i c h form the basis for the MSPAP
reading assessments,
there are three purposes
r e a d i n g for literary experience,
a task.
as p o r t r a y e d
for reading:
for information,
and to p e r f o r m
(Reading to perform a t a s k does not involve a c t u a l l y
p e r f o r m i n g the task.)
The m o d e l
includes four reading stances:
5
g l o b a l understanding,
response,
developing
and critical
stance.
c r o s s e d in the M a r y l a n d model.
metacognition
interpretation,
personal
P u r p o s e s and stances are fully
(The Maryland model also includes
and a t t i t u d e s t o w a r d reading.)
The M a r y l a n d
r e a d i n g model is u n l i k e earlier r e a d i n g component skill m o d e l s
in w h i c h the reading process
is v i e w e d as a set of d i s c r e t e
s k i l l s that function fairly a u t o n o m o u s l y and that can be
d e v e l o p e d and a s s e s s e d separately.
The Maryland m o d e l is
derived
in which reading -- as
from reader response t h e o r y
w e l l as writing,
llstening,
p r o c e s s of constructing,
and s P e a k i n g -- is v i e w e d as a
examining,
and extending meaning
t h r o u g h complex interactions with t e x t
(see Langer,
1990).
R e a d i n g purposes and stances are the same across the three
grades
assessed
in MSPAP.
M S P A P reading a s s e s s m e n t t a s k s are comprised of several open-ended
a s s e s s m e n t a c t i v i t i e s t h a t require students to
construct,
examine,
literary,
and extend m e a n i n g through the stances on the
informational,
d u r i n g the assessment.
and i n s t r u c t i o n a l passages they read
R e s p o n s e s to reading ~ctivities were scored
u s i n g 2, 3, and 4-point keys and a 4-point rubric for one e x t e n d e d
r e s p o n s e scored for both r e a d i n g and writing.
Coverage of each
r e a d i n g purpose was p r o p o r t i o n a l l y balanced in each 1991 M S P A P test
form with literary e x p e r i e n c e m o s t often assessed,
reading to
p e r f o r m a task least often, and reading for information a c t i v i t i e s
i n c r e a s i n g at g r a d e 8.
was p r o p o r t i o n a l l y
Coverage of the stances w i t h i n each p u r p o s e
balanced
in each test form w i t h global
6
understanding, personal response, and critical stance questions and
prompts less frequent than developing interpretation questions and
prompts.
Because MSPAP tasks vary in length, the numbers of
activities and coverage of outcomes vary somewhat across test
forms.
Wr~tinq and Lanaua~e Usaqe
The writing domain is defined by three purposes for writing
-- to inform, persuade, and express personal ideas -- and steps
in the writing process -- prewriting/planning, drafting,
revising, and proofreading.
The writing outcomes include
long-recognized modes of discourse (narration, exposition,
description,
and persuasion).
attitudes toward writing.
constructing,
audiences.
The Maryland model also includes
Writing is viewed as a process of
examining, and extending meaning for a variety of
In this view, writers use rhetorical devices
(e.g.,
argumentation for persuasive writing) and other elements of the
craft of writing (e.g., word choice, sentence variety) to
accomplish the three writing purposes.
Direct assessment of
writing performance, with examinees guided through the composing
process,
minimum
dates back to the inceptior in 1983 of one of Maryland's
competency graduation
tests,
the M@ryland Writing Test.
The single language usage outcome incorporates correctness and
completeness features in the appropriate use of English conventions
(e.g., punctuation, grammar, spelling) across a vari-
ety of writing purposes and styles.
Writing and language
usage o u t c o m e s
in MSPAP.
are the same across the t h r e e grades assessed
The levels of sophistication e x p e c t e d in w r i t t e n
responses
increase with grade level.
M S P A P w r i t i n g prompts elicit extended w r i t t e n responses
(i.e., essays,
stories,
poems,
plays).
Topics
for these prompts
in the 1991 M S P A P were linked either directly or by theme to
r e a d i n g p a s s a g e s and a s s e s s m e n t activities.
prompts
(Topics for w r i t i n g
in the 1992 MSPAP and beyond are linked to a s s e s s m e n t
activities
in reading,
science,
and social studies.)
Each
student w r o t e two responses w h i c h were scored u s i n g a f o u r - p o i n t
holistic rubric
writer
in which the lowest score
(0) indicated that the
f a i l e d to m i n i m a l l y meet the criteria for the writing
purpose,
and the h i g h e s t score
performance.
(3) represented excellent
The two extended writing r e s p o n s e s required each
student to w r i t e for two of the three purposes.
include s e p a r a t e and independent
activities.
style)
languageusage
A p p r o p r i a t e use Of language
is a s s e s s e d through students'
reading,
mathematics,
activities
assessment
(i.e., c o n v e n t i o n s
and
written responses to
and social studies a s s e s s m e n t
and to w r i t i n g prompts.
were s c o r e d
responses
science,
MSPAP does not
Reading and w r i t i n g r e s p o n s e s
for language u s a g e using a 3-point rule for brief
and a 4-point r u b r i c for extended w r i t i n g responses.
Mathematics
The m a t h e m a t i c s
domain is defined by nine content outcomes
and four p r o c e s s outcomes.
mathematics
MSPAP also assesses attitude t o w a r d
and use of technology,
but these outcomes are not
part of high stakes test usage.
These 13 Maryland mathematics
learning outcomes and their sub-outcomes
indicators of the outcomes)
assessments.
form the basis for MSPAP mathematics
The Maryland outcomes are a close adaptation of the
widely known NCTM ~ r r i c u ~ u m
Mathematics
(i.e., more specific
and Evaluation
Standards f@r School
(National Council of Teachers of Mathematics,
1989).
The nine content outcomes are similar to traditional mathematics
objectives
(e.g., number relationships,
unlike the four process outcomes
communication,
mathematics
algebra, probability),
(problem solving,
reasoning,
connections inside and outside of mathematics).
All
outcomes were covered in all 1991 test forms.
The 1991 MSPAP opened-ended mathematics tasks required
students to solve multi-step problems,
recommendations,
reasoning
communicate their ideas, understanding,
in mathematics,
solve problems.
make decisions and
and
and explain processes they used to
Responses to mathematics
activities were scored
using 2 and 3-point scoring keys and some 4-point keys.
Coverage
of content and process outcomes in the 1991 MSPAP mathematics
assessment was proportionally balanced.
Ap~I~ximately
40% of the
activities in each form in grades 3 and 5 assessed process
outcomes while approximately
60% assessed content outcomes;
about
20% of the activities assessed both content and process outcomes.
These percentages shifted in grade 8 to 20-30% for process and
70-80% for content outcomes.
Because MSPAP tasks vary in length,
the numbers of activities and coverage of outcomes varies
somewhat across test forms.
9
Task Structure
Typically,
MSPAP tasks begin with an opening activity
a brief discussion)
which is not scored,
followed by one or more readings,
activities
(e.g.,
often but not always
followed by assessment
(referred to here as "items"}.
The purposes of
opening activities include orienting students to the theme,
content area, and expectations of the task; encouraging students
to activate prior knowledge related to the task; providing
knowledge related to the task; and providing students with a
purpose to undertake the task.
. . ~.
MSPAP readings in all content areas usually are unabridged,
published,
and "authentic" reading selections.
They are usually
selected from trade publications such as magazines,
picture
books, and short story collections rather than from textbooks or
basal reading series.
MSPAP assessment
students,
items include questions, prompts to
and other stimuli that elicit a written,
diagrammed response from students.
sequentially,
sketched,
or
They are always linked
related to a common theme, or otherwise organized
in a coherent fashion.
reading passages.
Reading items focus on one or more
Writing prompts always are linked to reading
passages and accompanying items or to observations,
investigations,
areas.
or other assessment activities in other content
Mathematics,
items usually do not focus on a reading
passage; however, they often comprise problems, investigations, or
issues that must be addressed over multiple steps and that usually
1 0
-
lead to a c u l m i n a t i n g solution,
explanation.
decision,
and/or
T h i s structure of an o p e n i n g a c t i v i t y f o l l o w e d by
items may r e p e a t itself in a s i n g l e task.
example
recommendation,
of
this
structure
reading/writing/language
from
a
Figure 1 c o n t a i n s an
1991
MSPAP
grade
5
usage task.
Insert Figure 1 a b o u t here
Form Structure and A d m i n i s t r a t i o n
In order t o cover the r e q u i r e d b r e a d t h of c o n t e n t
testing time, m u l t i p l e test forms w e r e developed.
year of the program,
there were
5 or 6 forms
in limited
In the first
(depending on
grade) in language arts (i.e., reading, writing, and language usage
and five forms
in mathematics.
For the L a n g u a g e Arts p e r f o r m a n c e
assessments,
testing
occurred in five sessions over five d a y s , with one or
one-and-a-half
hours of testing per session.
In each session,
students read one or more short s t o r i e s or articles and w r o t e
several
short answers or one e x t e n d e d response,
d e p e n d i n g on the
session.
E x t e n d e d Writing r e s p o n s e s were obtained
in two of the
sessions,
R e a d i n g responses were o b t a i n e d in four of the five
sessions,
and Language Usage r e s p o n s e s were o b t a i n e d
in all five
sessions.
The M a t h e m a t i c s
assessment w a s administered
sessions on three school days p r e c e d i n g
(Grade 8) or f o l l o w i n g
(Grades 3 and 5) the Language A r t s assessment.
11
in three o n e - h o u r
In each
Mathematics
testing
session
students
to three t a s k s
related to d i f f e r e n t
the tasks w e r e
not separately
time on t h e i r
responded
information
The t y p i c a l
range
task t y p i c a l l y
to several questions
Additional
and students
It was common
behind
materials
for a process
to explain
in their
groups.
were several m o t i v a t i o n s
it h e l p e d
performance
undue
being
particular
influences
passage
in g r o u p s
of normal
to t e s t i n g
groups,
Teachers,
assigned
performance
by t h e i r
by chance
materials
or mathematics
12
rather
than
also
by
regular
include only a limited
For example,
who
to t e s t i n g
caused
about the possibility
have used t e a c h i n g
reading
the a s s e s s m e n t
Randomization
in intact classes
effects.
or
for this r a n d o m i z a t i o n .
teachers.
each form could
a
item.
on school p e r f o r m a n c e
on student
t h e r e was concern
might
also randomly
focus
of individual
Because
teacher-by-task
teacher
maintain
assessed
the r e a s o n i n g
intact classrooms.
examiners,
There
were
assigned
acted as t e s t
of tasks,
for
item to f o l l o w
content
w e r e administered
Students were r a n d o m l y
than assessed
teachers.
and data
The students
w e r e kept secure before
T h e assessments
class size.
students
the
for later responses.
the answer to that preceding
All t e s t i n g
minimized
a session,
to that scenario.
was often provided
process
the
responses
managed
of the scenario
related
item and to ask the student
First,
Within
took a half page.
content
rather
provided
n u m b e r of items per task was 5 or 6, b u t there was a
f r o m 3 to 10.
occurred.
themes.
timed,
own. The d e s c r i p t i o n
each M a t h e m a t i c s
typically
number
of s u b s t a n t i a l
a particular
related
to a
scenario
that
s u b s e q u e n t l y a p p e a r e d on the M S P A P assessment.
of s t u d e n t s to forms m i n i m i z e d the chance
smaller schools)
that a p a r t i c u l a r
Finally,
(particularly
teacher-by-form
have a m a j o r impact on a s c h o o l ' s results,
heterogeneous
Random assignment
in
e f f e c t could
and a s s u r e d that
groups of s t u d e n t s w o u l d take each t e s t form.
randomization mimicked
"authentic" w o r k i n g s i t u a t i o n s
in
w h i c h p e o p l e cooperate w i t h a v a r i e t y of c o l l e a g u e s and
s u p e r v i s o r s d e p e n d i n g on the n a t u r e of the work.
In the first year of the program,
out the new assessments e x c e p t
there was not time to try
in some small pilots.
c o n c e r n that some of the t a s k s m i g h t not "work."
described
in a later section,
over forms.
However,
T h e r e was a
Also,
as
steps were taken to e q u a t e scores
in the first year of the program,
it was
not k n o w n how well these new equating p r o c e d u r e s w o u l d w o r k with
performance
assessments.
Therefore,
good q u a l i t y data being p r o d u c e d
comparability
identified
to m a x i m i z e the c h a n c e s of
for every school and to assure
of results over schools,
the two "best"
in each grade and content area.
forms were
T h e s e forms w e r e
selected by MSDE to be t h o s e c o n t a i n i n g tasks t h a t w e r e judged to
be the most authentic and engaging,
the h i g h e s t quality;
school.
and that a p p e a r e d to be of
these forms were a d m i n i s t e r e d
in e v e r y
For schools with m o r e than two tes%ing g r o u p s
additional
(classes),
forms were r a n d o m l y assigned.
S c o r i n q Process
Examinee responses to M S P A P a s s e s s m e n t a c t i v i t i e s w e r e
s c o r e d by trained M a r y l a n d t e a c h e r s using a c t i v i t y - s p e c i f i c keys
13
".
usually
used for brief responses,
responses,
and rubrics
responses.
All M S P A P
partial
generic
rules
for longer
for essays and other e x t e n d e d
scoring tool types a l l o w e d
credit to be awarded
to responses.
for full and
In all content
4 . .
areas
and grades,
than minimally
credit
however,
acceptable
w a s awarded
minimally
sets w e r e
involved
graded
received
for attempted
adequate.
ing of table
student r e s p o n s e s
t h a t were
a score of 0; no partial
responses
that were
In addition to the t r a i n i n g
leaders
and other readers,
employed.
Except
in a special
study,
less
and q u a l i f y -
read-behinds
for a small
number
all student p a p e r s
less t h a n
and c h e c k
of student
papers
were read and
once.
;tem Scores
The
lowest score
for each item was 0, and the m a x i m u m
possible
score varied
sessions
where a student
Table
1 displays
possible
appears
score.
Ferrara
Omitted
was p r e s e n t
the typical
percent
The range of number
in that table.
characteristics
Goldberg
from 1 to 3.
More detailed
of the
and Kapinus
item scores
(in press)
responses
were given
of items per
information
is a v a i l a b l e
and Fitzpatrick,
1 about h e r e
14
a score of 0.
of items w i t h each m a x i m u m
(1992).
Insert Table
for
form also
about the
from
Ercikan,
and
Scaling
Several
item-based
MSPAP
scaling
• Proficiency
student
procedure.
level
performance
• Outcome
forms
requirements
led
to
the
requirements
use
of
an
included:
and descriptous
of t y p i c a l
levels
that c o u l d b e ' c o m p a r e d
across
test
future
forms
and y e a r s
• A pool of c a l i b r a t e d
might
These
cut p o i n t s
at those
scores
that
items
from which
be a s s e m b l e d
• Student
level
scores
that a r e as a c c u r a t e
as
possible
Standard
accuracy
errors
of s c o r e s
Item-based
of m e a s u r e m e n t
for s t u d e n t s
scaling
offered
requirements.
Traditional
provided
characteristics
these
Because
the s c o r e s
"right"
or "wrong",
was p a r t i c u l a r l y
that
it w o u l d
restrictions
was
developed.
with varying
important
Masters'
numbers
limitation
model
achievement
of s a t i s f y i n g
student
a new IRT m o d e l
that
theory
(IRT)
was needed
Masters'
Partial
model
of score points.
that m a d e
can
models have
were
not j u s t
It
be f l e x i b l e
so
and not r e q u i r e
test
Credit
design.
model
be u s e d
However,
it u n s u i t a b l e
to the R a s c h m o d e l
15
all these
for scaling.
the s c a l i n g m o d e l
limit
levels
tests.
responses
to the c o n t e n t ,
of Masters'
is a n a l o g o u s
the
for m u l t i p l e - c h o i c e
that would unnecessarily
A generalization
1982)
a means
item r e s p o n s e
be a "servant"
reflect
at d i f f e r e n t
for the M S P A P
important
that
to
(Masters,
scale
that model
for MSPAP:
items
has
an
the
for m u l t i p l e - c h o i c e
tests and forces
points,
to h a v e
items
could
The
items with different
concerns,
was
dubbed
and c o m p u t e r
in
a generalization
of Masters'
items to h a v e d i f f e r e n t
the
"two
parameter
model
performance-based
The " 2PPC
quite
well,
chosen
items
(Muraki,
out
of many
1991; Yen,
independent
the
analysis
included
items in M a t h
hundreds
3 items in Reading,
designed
multiple-choice
to avoid
performance
tests
of
tests
such as the MSPAP,
responses
items
1993).
It
research,
of
analyzed,
NAEP
only
a
of p o o r fit.
in M a t h Content,
are u s u a l l y
and 4
responses
items that ask the s t u d e n t
scaling without
carefully
In contrast,
to other
used to o b t a i n a previous answer.
item r e s p o n s e
8 items
local item dependence.
to be related
m a t h process
item
Process.
Traditional
designed
(2PPC)
MSPAP p e r f o r m a n c 9
handful n e e d e d to be deleted from the scales b e c a u s e
These
was
1992).
" model was found to d e s c r i b e
and
credit"
to e s t i m a t e
(Burket,
for
model
discriminations.
partial
programs were d e v e l o p e d
was
discriminations,
rules and rubrics.
m a y be of i n t e r e s t to note that in s u b s e q u e n t
same
that p e r f o r m a n c e
their
scoring
p a r a m e t e r s and students' trait values
the
of score
Items w i t h d i f f e r e n t
indicated
substantially
t h a t allowed
model
model,
of pilot data
vary
Given t h e s e
developed
of their n u m b e r
m u s t be removed from the test or the model w i l l be
Analyses
particularly
regardless
the same discrimination.
discriminations
inaccurate.
all items,
to some
items.
to explain
for
items
An e x a m p l e
are
is
the r e a s o n i n g
In order to gain the b e n e f i t s
unnecessary
16
restrictions
on t e s t
of
design,
a variety
of p s y c h o m e t r i c
managing
local
testlets
to the development
invisible
item dependence.
These
strategies
Huynh,
Scale scores were d e v e l o p e d
used.
the
scale scores,
item discriminations;
Conversion
t o scale
scores.
transformations
an a p p r o x i m a t e
depending
a student's
scores
more weight
was
if they
performance
raw scores
scores w e r e
scores.
from a low a r o u n d
T h e y w e r e set to have
of 50.
350 up to a h i g h of about
on the scale and grade.
Test Forms
the goal was to h a v e the forms within a
scale score d i s t r i b u t i o n s
An equivalent
groups
equating
for r a n d o m l y
who took each form.
for the equivalent
determined
To obtain
procedure
mean of 500 and a s t a n d a r d d e v i a t i o n
scores w e r e obtained
students
area'
these weighted
scale
of the IRT a b i l i t y
scoring
at d i f f e r e n t
tables t r a n s l a t e d
In t h e equating,
possible.
see also
item scores w e r e weighted by
items r e c e i v e d
Equatinu
grade produce
for each c o n t e n t
In essence t h e s e
S c a l e scores ranged
700,
(1993;
information
Scale S c o r e s
better among s t u d e n t s
levels.
Detailed
from Yen
the m o s t a c c u r a t e
In this p r o c e d u r e
discriminated
t h a t were
1993).
Content Area
student
for
from the use of
scales
a n d users.
is a v a i l a b l e
& Baghi,
was developed
varied
of a d d i t i o n a l
to the test d e v e l o p e r s
about these
Ferrara,
strategies
that m o s t closely
samples
the linear
Scale
of scale
transformation
the c u m u l a t i v e
as
of 1,500
the d i s t r i b u t i o n s
aligned
17
d e s i g n w a s used.
equivalent
Using
samples,
that were as similar
was
scale score
distributions
equating
for these
samples.
that most closely
equipercentile
Examples
This p r o c e d u r e
approximates
is the
linear
a non-linear
procedure.
of the similarity
from t h e d i f f e r e n t
of t h e d i s t r i b u t i o n s
forms are c o n t a i n e d
in Figures
of s c o r e s
2 to 4.
These
are c u m u l a t i v e distributions
and show the p e r c e n t a g e of s t u d e n t s at
or b e l o w each scale score.
In general,
scale,
the smoother
distributions.
responses
and more closely
The Writing
tests,
(that is, 2 items)
outcome
measuring
within
to obtain
of items that
of m a x i m u m
experts.
assessed
A student's
score the s t u d e n t
had taken all the
in all the forms at ~ a t
scores were intended
to be c o m p a r a b l e
grade.
across
would
items
These
the forms
a grade.
The definition
understood
Figure
Scores
if that student
that outcome
the score
the least exact e q u a t i n g s .
by MSDE content
score was the percent
be e x p e c t e d
outcome
as determined
in a
w h i c h had only two w r i t i n g
scores were based on subsets
each outcome,
items
2-4 about here
Outcome
Outcome
aligned
produced
Insert Figures
the m o r e
of this outcome
w i t h a hypothetical
5. Consider
the Expected
graphical
one particular
U s i n g the calibrations
Percent
score
outcome
is most e a s i l y
example
at a p a r t i c u l a r
of the items c o n t r i b u t i n g
of M a x i m u m
(EPM)
18
as s e e n
in
grade.
to t h a t outcome,
as a function
of s c a l e
score
is f o u n d
for e a c h
figure,
labelled
curves,
one for e a c h
In the
EPM
curve
figure there
form.
One such
"(a)".
form,
There
but o n l y
is also a c u r v e
for a l l the forms;
there
EPM curve
actually
one
appears
on the
are m a n y s u c h
is s h o w n
labelled
in t h i s example.
"(b)," a n d
is o n l y o n e s u c h c u r v e
it is t h e
for an
outcome.
Insert Figure
The dotted
to transform
lines
in the f i g u r e
a student's
f o r m to the EPM o u t c o m e
score
observed
score.
[70% of the m a x i m u m
converted
to a s c a l e
score
form the student took
is t h e n t r a n s l a t e d
outcome
score
form.
comparable
across
procedure
stable
defined
is a m e a n s
Its g o a l s
forms
forms)
[40]
curve
curve
for all
in all
to w h i c h
evaluated
with
means.
19
is
(a) b a s e d
on the
score
forms.
a student's
observed
items that happened
outcome
are r e f e r e n c e d
forms.
outcome
(b), this s c a l e
of a d j u s t i n g
of the
observed
on a t e s t form]
by u s i n g
t h a t w a s p a r t of t h e d e t e r m i n a t i o n
school
score
are u s e d
s c o r e on a p a r t i c u l a r
student's
are to p r o d u c e
the d e g r e e
was
s h o w h o w the c u r v e s
Using
and t h a t
by items
(that is,
over
(Form A).
for t h e d i f f i c u l t y
in h i s / h e r
behavior
This
[500]
here
outcome
possible
into a EPM
This procedure
5 about
The
scores
the o u t c o m e
a forms
of s t a n d a r d
t h a t are
to a d o m a i n
success
to be
of
of t h i s
scores were
effects
errors
study
of
Psychometric
Proficiency
Levels
To assist
behavioral
in c o m m u n i c a t i o n
descriptions
scale score ranges.
scaling
Results
of the m e a n i n g
were d e v e l o p e d
In developing
results were used.
for p e r f o r m a n c e
Each item was located
best or,
measurement
in other words,
information.
level of every item was also placed
for a p a r t i c u l a r
score of 440,
at 520,
item,
and a score of 3 m i g h t
location
be at 550;
levels were
of item scores
committees
proficiency
skills,
levels,
located
of content
near them:
of
score
at a s c a l e
of 2 m i g h t
the o v e r a l l
be
item
typically
performance
displayed
t h a t had s u b s t a n t i a l
490,
examined
descriptJ~!s
that were the basis
530,
580,
and 620.
proficiency
items
in t h e s e
of the knowledge,
of the a s s e s s m e n t
at each p e r f o r m a n c e
and
level.
Standards
In order to u n d e r s t a n d
scores,
amount
For example,
a score
to e s t a b l i s h
experts
and d e v e l o p e d
and processes
that students
School
item
every
be located
identified
These values were used as cut scores
levels,
fashion,
that
might be at 500.
F o u r scale score
numbers
the m a x i m u m
a score of 1 might be at 480,
item
on the scale.
on the scale.
a score of 0 m i g h t
the
score at w h i c h
provided
In an a n a l o g o u s
scores,
in five
these descriptions,
The location was d e f i n e d to be the scale
measured
of scale
it is n e c e s s a r y
the stakes
to d e s c r i b e
respect to the p r o f i c i e n c y
related
to the scale
later d e v e l o p m e n t s
levels and school p e r f o r m a n c e
In the second year of MSPAP,
proficiency
20
with
standards.
levels w e r e r e e v a l u a t e d ,
involving b r o a d e r - b a s e d c o m m i t t e e s
content.
Most proficiency
same as the 1991 v a l u e s
changes w e r e made
and further a n a l y s e s
of item
levels cut points were e s s e n t i a l l y the
(490,
530, 580, and 620), a l t h o u g h some
(for example,
grade 5 Mathematics).
These
c o m m i t t e e s r e f i n e d and enhanced the behavioral descriptions,
w h i c h are available
Furthermore,
from MSDE.
important h i g h stakes school p e r f o r m a n c e
s t a n d a r d s were established.
"Satisfactory"
in a p a r t i c u l a r grade and content area,
7 0 % of the students
at least
in that g r a d e and content area w o u l d n e e d t o
h a v e s c o r e s above 530.
"Excellent,"
For a school to be e v a l u a t e d as
For a school to be e v a l u a t e d as
it w o u l d need to have reached the " S a t i s f a c t o r y "
level and have at least 25% of its students above 580.
school p e r f o r m a n c e
evaluations
The
became public i n f o r m a t i o n
b e g i n n i n g with the 1992 assessments.
In 1996 school sanctions
and r e w a r d s will be tied to these evaluations.
Test D~fficulty
T a b l e 2 d e s c r i b e s the average percent of m a x i m u m scores for
the
items in the 1991 MSPAP.
difficulties
For comparison purposes,
for a t r a d i t i o n a l m u l t i p l e - c h o i c e
b a s e d on M a r y l a n d student p e r f o r m a n c e
B a s i c Skills,
Fourth Edition,
tests like CTBS/4,
guessing,
average
test are presented
(Comprehensive Tests Of
1989; CTBS/4)..
For m u l t i p l e - c h o i c e
students can get items correct t h r o u g h lucky
and given the n u m b e r of answer choices for the items,
it is p o s s i b l e to estimate h o w difficult the t e s t s w o u l d have
been if guessing the correct answer were not possible.
21
Those
estimates are also p r e s e n t e d
in Table 2.
were m o r e d i f f i c u l t than CTBS/4,
are removed f r o m CTBS/4.
is Reading.
with
the
test.
The MSPAP
even if the effects of g u e s s i n g
The e x c e p t i o n to this general finding
For the MSPAP W r i t i n g scores,
Maryland
items
Writing
Test,
a comparison
a minimum-competency
graduation
This t e s t is part of M a r y l a n d ' s m i n i m u m c o m p e t e n c y t e s t i n g
program and f o c u s e s on lower p e r f o r m a n c e expectations.
the
is made
MSPAP
difficult
focuses
on
higher
level
expectations
In c o n t r a s t
and
with
more
items and more stringent s c o r i n g rubrics.
MSPAP a s s e s s m e n t s were expected to be more d i f f i c u l t than is
typical
for e d u c a t i o n a l
designed to r e p r e s e n t
a c h i e v e m e n £ b e s t s because they were
standards for 1996 and beyond.
Insert Table 2 a b o u t h e r e
Measurement Accuracy
Five m e a s u r e s
of
score a c c u r a c y
cases they w e r e developed
were
produced,
and
in m o s t
for both c o n t e n t area scale scores
and outcome scores:
I) C o r r e l a t i o n s
of scores produced by d i f f e r e n t raters
(content
area scores only)
2) C o e f f i c i e n t
alphas
3) Standard errors of measurement
for students'
scale scores
4) Standard errors of school means
5) D e p e n d a b i l i t y
Correlations
coefficients
B e t w e e n Raters.
for school m e a n s
A special study was conducted that
22
°
involved the scoring of a small group of student r e s p o n s e books
by two g r o u p s of raters.
involved
in the study,
Two test forms in each g r a d e were
and for each form from 208 to 246 student
books w e r e scored twice.
The first scoring was c o n d u c t e d by
M a r y l a n d teachers as part of the operational
scoring process.
The s e c o n d scoring was c o n d u c t e d by p r o f e s s i o n a l
California.
scorers
in
Table 3 contains the correlations between the
students"
scores p r o d u c e d by the two sets of raters.
objective
scoring rules involving substantial n u m b e r s of items
(for example,
correlations.
those in mathematics)
The more
produced very high
Scores based on the smaller numbers of items and
t h e least o b j e c t i v e
scoring rubrics
(that is, Writing)
produced
the least c o n s i s t e n t scores.
Insert T a b l e 3 about here
C o e f f i c i e n t AiDhas.
C o e f f i c i e n t alpha is a r e l i a b i l i t y measure
r e l a t e d to the KR-20 but suitable when items have a variety of
s c o r e levels ( A l l e n and Yen,
1979).
Table 4 contains these
v a l u e s for the M S P A P content area scores.
purposes,
Maryland
KR-20 r e l i a b i l i t i e s
student performance.
For c o m p a r i s Q n
are p r e s e n t e d
for CTBS/4 based on
While slightly lower than the
CTBS/4 values,
the MSPAP r e l i a b i l i t i e s are quite high.
MSPAP
scores ' a
Writing
comparison
W r i t i n g Test.
23
is
made
with
the
For the
Maryland
Insert Table
Outcome
score r e l i a b i l i t i e s
number of items
had at least
.62 to
in the outcome
four items,
.93 for R e a d i n g
Usage outcome,
Standard
Errors
were highly
in that
outcomes,
.33 to
with
of M e a s u r e m e n t
characteristic
somewhat
grade.
values w e r e the lowest
The SEM is i n f l u e n c e d
by each
item,
from
outcomes;
the
to occur for very
for S t u d e D t
error
for each
in each
Scale
for grade
scores.
of m e a s u r e m e n t
scale
depending
form.
8 appear
in T a b l e
of i n f c ~ a t i o n
by the number
Therefore,
the h i g h e r
alpha,
of items c o n t r i b u t i n g
SEM for the t w o - i t e m
surprising.
24
These
over forms w i t h i n
by the amount
influenced
was
As a summary m e a s u r e ,
and the h i g h e s t
to c o e f f i c i e n t
(SEM}
on the
for Reading
and similar
For e a c h
score value.
scale scores was a v e r a g e d
Sample values
values r a n g e d
.90 for the L a n g u a g e
tended
over forms,
of the items
the SEM at s e l e c t e d
alpha
that
few items.
by the scaling model
values v a r i e d
For outcomes
.89 for M a t h e m a t i c s
test form and scale the standard
produced
related to the
form.
.84 to
for M a t h e m a t i c s
outcomes
here
the c o e f f i c i e n t
and from
low r e l i a b i l i t i e s
difficult
4 about
5.
The S E M
for Writing.
being p r o v i d e d
the SEM, is a l s o
to the test.
Writing
a
tests
is n o t
Insert T a b l e 5 about here
Approximately two-thirds
t h e 450 to 550 range.
neighborhood
of the s t u d e n t s had scale scores
The lowest SEMs t e n d e d to get in the
of 500; the m a t h e m a t i c s
have t h e i r m i n i m u m
test SEMs t e n d e d to be
SEMs in the n e i g h b o r h o o d of 550.
score a c c u r a c y w a s good in the n e i g h b o r h o o d
for " S a t i s f a c t o r y "
Empirical
effects.
for individual
s t u d e n t s w o u l d be h i g h e r
v a r i a n c e would also be introduced
s t u d e n t s c o r e s by form effects.
However,
is on school performance.
two sections,
of the 530 c u t - p o i n t
in T a b l e 5 because of v a r i a b i l i t y due to rater
Additional
the M S P A P
In general,
performance.
SEM v a l u e s
than the v a l u e s
in
into
the p r i m a r y focus of
As d e s c r i b e d
e v a l u a t i o n s w e r e made of e m p i r i c a l
in the next
s t a n d a r d errors
of school m e a n s that i n c l u d e d all sources of error.
s t a n d a r d Errors o~ School Means.
school m e a n s w e r e obtained.
performance
for every
within-school
variance,
school,
variance
Empirical
First,
s t a n d a r d errors of
for every school,
form was calculated,
the m e a n
and then the p o o l e d
of form means was determined.
d i v i d e d by the n u m b e r of forms a d m i n i s t e r e d
This
in the
was t a k e n as the s q u a r e d standard error of the o v e r - a l l
school mean.
all s o u r c e s
It can be noted that this s t a n d a r d error includes
of v a r i a t i o n that affect scores w i t h i n a school,
i n c l u d i n g rater effects,
systematic
error.
5
form effects,
and m e a s u r e m e n t
Table
schools
6 describes
of the typical
scale scores
range,
typical
ranged
size
results
in each grade.
from a low of 350-400
and that the p r o f i c i e n c y
accurate
on such a scale.
and student
outcome
the low 20s.
typically
in the range
certainly
Outcome
errors
w i t h enough
understanding
their
with
accuracy
coefficients
study was c o n d u c t e d
(Candell
grade
are
5, w h i c h
dependability
the number
The v a l u e s
outcome
substantial
in
were
that t h e s e
accuracy,
to schools
in
A generalizability
of school
Table 7 presents
to those
results
for the other grades.
means
for
These
for school means as a f u n c t i o n
in the school.
scores were s o m e w h a t
of the o u t c o m e
dependabilities
indicating
indices were in t h e high
for the outcome
deviations
for outcomes
the d e p e n d a b i l i t y
of forms a d m i n i s t e r e d
the e x c e p t i o n
had s t a n d a r d
for school meads.
coefficients
scores the d e p e n d a b i l i t y
from 0 to i00,
6 about h e r e
in press).
are similar
ranged
40 to 50 p o i n t s
performance.
to examine
and Ercikan,
that student
up to the 650-700
to be useful
Insert Table
Dependability
scores
of 2 to 7 points,
outcome
Recall
of school m e a n s
could be m e a s u r e d
study based on
levels w e r e about
scores t y p i c a l l y
Standard
school m e a n s
of this
"Reading
26
For the scale
.80s and
lower,
.90s.
but w i t h
to P e r f o r m a Task"
were quite good.
of
the
Insert Table 7 about here
~
m
~
m
m
~
o
m
~
Score Validity
MSPAP validity evidence was collected with the goal of
supporting and validating intended interpretations and uses of
scores from the assessment.
The validity evidence described
below is organized around this goal.
Content Validity
The Maryland learning outcomes, which form the basis for
learning,
instruction,
and MSPAP assessment activities,
are based
on recently developed national curriculum standards and learning
theory.
For example,
the reading outcomes are based on NAEP
reading assessment objectives and reader response theory.
Similarly, the writing outcomes are based on long-recognized
modes of discourse and the mathematics outcomes are based on NCTM
standards for curriculum and evaluation.
(See the previous
section "Content Areas" for details.)
0utcomes coveraqe.
Coverage of the outcomes by assessment
activities is proportionally balanced according to the relative
importances of the outcomes at different grade levels.
High
degree of match between assessment activities and the outcomes
they assess is ensured through multiple reviews during task
development and development of scoring tools and guides.
Instructional validity.
The notion of instructional
validity became important in the Debra P. Florida court case on
7
•"
minimum-competency
for t e s t scores
testing
to be valid,
m u s t h a v e been taught.
state d e v e l o p e d
taught
criteria
and
outcomes
assessed
-- w o u l d
Construct
concept
;nternal
conducted
Here,
if the
taught and l e a r n e d
then
areas.
& Meehl,
consistency
1955).
Macmillan/McGraw-Hill,
1992).
given the h i g h
1989,
structure
differently
estimates
13).
refers
areas
internal
f r o m items
in Table
from
4
scores
factor a n a l y s e s
(see Cronbach,
consistency
Both sets of results
inter-rater
to the
area behave
of content area scale
Similarly,
a
of test
p.
the same content
The r e l i a b i l i t y
of this
surprising
Messick,
and s o m e w h a t
for each of the c o n t e n t
provide
come to be c o n s i d e r e d
internal
items that a s s e s s
evidence
Table
and
be
However,
and types of evidence
for example,
structure.
the internal
(see C r o n b a c h
469)
has r e c e n t l y
to one another
content
indicate
activities,
what s h o u l d
be taught.
in the
-- to guide and goad e d u c a t i o n a l
for all views
(see,
to w h i c h
similarly
other
is weak:
not have been necessary.
validity
score v a l i d i t y
degree
and c o m m u n i c a t e
of M S P A P
on a test
validity
assessment
is that
Va~idity
Construct
unifying
outcomes,
in M S P A P w e r e w i d e l y
purposes
The i d e a
and other e d u c a t o r s
learned and how it s h o u l d
the i n t e n d e d
reform
to model
instructional
teachers
the learning
1983).
w h a t has been assessed
MSPAPs
some of the best c l a s s r o o m
scoring
(see Madaus,
correlations
1971, p.
(CTB
should
reported
not be
in
3 and the high degree of m a t c h between a s s e s s m e n t a c t i v i t i e s
and outcomes
discussed
above.
28
Concurrent validity.
Here,
concurrent validity refers to
correlational relationships
among MSPAP scores and external
measures.
in Table 8 provide some evidence of
The correlations
the convergence and discriminance
(see, for example,
Cronbach,
1971, p. 466 ff.) of MSPAP content area scale scores.
For
example, MSPAP reading scores tend to be more highly correlated
with teacher ratings of student proficiency
writing than in mathematics.
in reading and
(However, MSPAP Reading is also
highly correlated with CTBS/4 mathematics scores.)
Insert Table 8 about here
Differential
item functioninq
(DIF).
Items that are
"biased" against groups of students who take MSPAP -- that is,
that function differently for different student groups -- diminish
construct validity.
A measure of DIF generalized
Linn-Harnisch procedure
functioning
items.
from the
(1981) was used to flag differentially
Analyses of the numbers of MSPAP items
flagged for DIF, as compared with CTBS/4, ar~ contained
Fitzpatrick,
Cande11,
& Miller
(1992).
in Green,
MSDE has studied items
flagged for DIF to inform subsequent assessment task development.
~onsequential Validity
Since the primary focus of MSPAP is school performance the
most salient negative consequences
from using MSPAP scores
(e.g.,
required school improvement plans, management by an outside party)
29
will occur for l o w - p e r f o r m i n g schools,
However,
as d e s c r i b e d earlier.
these n e g a t i v e c o n s e q u e n c e s are expected to be s h o r t - t e r m
(i.e., until such schools f u n c t i o n successfully),
viewed as p o s i t i v e c o n s e q u e n c e s
improve.
for schools that need help to
The long-term c o n s e q u e n c e of using M S P A P scores is
expected to be positive:
Consequences
students.
school
instruction,
negative --for example,
provide useful
instruction
consequences
improvement.
of using M S P A P scores are also expected for
T h e s e consequences
schools improve
example,
and can be
c o u l d be p o s i t i v e -- that is, as
student
learning improves -- or
if M S P A P score information does not
information,
schools do not improve,
and student learning do not improve.
and
Other
of using MSPAP information are also evident.
For
low p e r f o r m a n c e r e p o r t e d for the 1991 MSPAP resulted
reports of low teacher morale and complaints that the test
was being used for school and t e a c h e r "bashing."
consequences
-
(These
occurred even t h o u g h the media and public were
instructed on the f o r w a r d - l o o k i n g
expectations
in
nature of MSPAP standards and
for 1996 and beyond.)
Conclusion
The e v i d e n c e and arguments
v a l i d i t y and other technical
for content and construct
information about MSPAP provide
r e a s o n a b l e a s s u r a n c e that M S P A P scores can be validly i n t e r p r e t e d
for e v a l u a t i n g school p e r f o r m a n c e
improvement.
consequences
Similarly,
and guiding and goading school
a n t i c i p a t e d positive and negative
of using MSPAP scores for these purposes provide
30
support
for the r e a s o n a b l e n e s s
purposes.
remains
Validation
an on-going
of u s i n g
of MSPAP
score
the scores
for these
interpretation
and use
process.
CHANGEB I N BUBBEQUENT YEARS
O n e conclusion
generalizability
by-form
would
that was d r a w n
study was that,
interaction
be sufficient
from the f i r s t - y e a r
given
the level of s c h o o l -
seen in these results,
for m e a s u r i n g
forms,
or clusters,
clusters
are administered
somewhat
different
analyses,
in s u b s e q u e n t
in every school.
content,
but not another.
maintaining
the breadth
Based
These c l u s t e r s
coverage
on
All t h r e e
measure
being m e a s u r e d
The use of such c l u s t e r s
of c o n t e n t
forms
it was d e c i d e d to
years.
w i t h some outcomes
one c l u s t e r
the schools.
two or three
school p e r f o r m a n c e .
t h e s e r e s u l t s and MSDE content experts'
use three
"
in
permits
to be m a i n t a i n e d
for
It is possible that the use of these c l u s t e r s might
c a u s e scale scores for individual students who took d i f f e r e n t forms
to b e c o m e
However,
school
less comparable;
because
results
In 1992,
during
science
all clusters
necessarily
and social
the other
of n u m b e r s
maintained
were tested
examined.
to all schools,
their comparability.
the number
was reduced
of days
to five.
studies w e r e a d d e d to the a s s e s s m e n t s
with the other c o n t e n t
content
os b e i n g
are a m i n i s t e r e d
the second year of testing,
which the students
integrated
this p o s s i b i l i t y
areas.
areas w e r e decreased,
of items per cluster:
(18 to 41), Math Process
Reading
(13 to 19),
31
The n u m b e r s
and
of items in
with the f o l l o w i n g
(8 to 18), M a t h
Language
Usage
Also,
ranges
Content
(4 to 6), and
Writing
(i).
T h e c o n t e n t areas for w h i c h this a p p e a r e d to be too
severe a d e c r e a s e w e r e Language Usage and Writing.
s u b s e q u e n t years,
responses
t h e r e was a return to two e x t e n d e d w r i t i n g
and e i g h t L a n g u a g e Usage responses.
w i t h few items,
noticeably,
In 1993 and
For some o u t c o m e s
t h e s t a n d a r d errors of the school mean i n c r e a s e d
and s c h o o l s were cautioned a b o u t their use.
S t u d e n t c h o i c e was
s e s s i o n for one c l u s t e r
among R e a d i n g passages.
introduced in 1992.
in each grade,
In one t e s t i n g
students selected
from
In addition to the choice passages,
there were passages
and items responded to by all students w h o
took that cluster.
T h e s e non-choice
calibrating
the choice
characteristics
and Yen
items.
of the choice
items p r o v i d e d
Further i n f o r m a t i o n
items is p r e s e n t e d
about the
in F i t z p a t r i c k
there w e r e m o r e a d m i n i s t r a t i o n
p r o b l e m s and t a s k s that did not "work," p a r t i c u l a r l y
Problematic
process.
As a r e s u l t of these problems,
assessments.
reducing
in Science.
items w e r e eliminated from the s c o r i n g and s c a l i n g
piloting,
review,
m o r e time was s c h e d u l e d
and revision in the d e v e l o p m e n t
T h e s e c h a n g e s were s u c c e s s f u l
of the 1993
in s u b s t a n t i a l l y
such problems.
It was d e s i r e d to h a v e school o u t c o m e
c o m p a r e d over y e a r s
scores that could be
so that schools could t r a c k their progress.
The o u t c o m e s c o r e s d e s c r i b e d
years,
for
(1993).
In the 1992 a s s e s s m e n t s
for
an a n c h o r
in Figure 4 are not comparable
over
b e c a u s e they w e r e tied to the d i f f i c u l t y of the items
administered
each year.
In 1992 an a d d i t i o n a l
2
outcome score w a s
reported.
This
"outcome
scale score"
produced
from curve
expected
percent
of maximum
obtained
percent
of maximum on the outcome.
limitations
(a); that is,
is the scale score
of the outcomes
outcome
scale scores
content
area scale scores.
content
on the o u t c o m e
on the numbers of items
heterogeneity
the outcome
it is t h e scale score w h o s e
scores
equals the s t u d e n t ' s
Because
in the outcomes
related
comparable
t h e stakes
are not as high as t h o s e
area scale scores,
lessening
and the
to each scale score,
are not as s t r i c t l y
However,
of
these
as are the
associated
associated
with
w i t h the
t h e need for strict
comparability.
As d e s c r i b e d
in an earlier
section,
descriptions
were revised
descriptions
and school performance
Science
and Social
except Writing
in 1992.
Studies.
the p r o f i c i e n c y
Proficiency
standards
All c o n t e n t
and Language Usage;
in 1993 and 1994 and performance
those
level
level
w e r e also set for
areas w e r e r e v i e w e d
•
areas w i l l be r e v i e w e d
standards
will be set for them.
SUMMARY
The
needed
MSPAP requirements
for the program.
results
over schools
dictated
Equated
the p s ~ s h o m e t r i c
forms were needed
and years.
Accurate
school
qualities
to c o m p a r e
content
scores w e r e needed in order to e v a l u a t e school p e r f o r m a n c e
t o state standards.
over years
The need to t r a c k
led to the development
school
outcome
of c o m p a r a b l e
level d e s c r i p t i o n s
tying-in
scores.
item performance
and scale
33
These
relative
performance
outcome
T h e d e s i r e to establish p r o f i c i e n c y
area
scores.
led to
requirements,
along with the desire to develop a pool of calibrated tasks with
known statistical characteristics,
scaling procedure for MSPAP.
led to the use of an item-based
Empirical results summarized in this
paper provide evidence that MSPAP, an innovative performance-based
testing program, does have the psychometric quality needed for high
stakes usage.
%
34
References
Allen, M. & Yen, W. M. (1979). Introd~cti0n to measu;ement
theorY. Monterey, CA: Brooks/Cole.
Burket, G. R. (1991). PARDUX. version ~.4. Monterey, CA: CTB
Macmillan/McGraw'Hill.
Candell, G. L., & Ercikan, K. (in press). Assessing the
reliability of the Maryland School Performance
Assessment Program.
International Journal of
Educational Research.
Cronbach, L. J. (1971). Test validation.
In R. L. Thorndike
(Ed.), Educational measurement t2nd ed.). New York:
American Council on Education.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in
psychological tests.
Psycholoaical Bulletin,52,281-302.
CTB Macmillan/McGraw-Hill. (1989). Comprehensive Test~ of
Basic Skills, 4th ed~tioD.
Monterey, CA: Author.
CTB Macmillan/McGraw-Hill. (1992).
Final technical report:
Maryland School Performance Assessment proaram, ~991.
(Available from the Maryland State Department of Education, Baltimore, MD.)
Ferrara, S., Huynh, H., & Baghi, H. (1993) Assessina local
dePeDdencv in educational per~o~mance assessments with
Clustered free-response items. Manuscript submitted for
publication.
Fitzpatrick, A. R., Ercikan, K., & Ferrara, S. (1992, April).
An analysis of the technical characteristic~ of scorina
rules for constructed-response items, paper presented at
the annual meeting of the National Council on Measurement in Education, San Francisco.
Fitzpatrick, A. R., & Yen, W. M. (1993, April).
The
psychometric characteristics of choice items. Paper
presented at the annual meeting of the National Council
on Measurement in Education, Atlanta.
Goldberg, G. L., & Kapinis, B. (in press).
Problematic
responses to reading performance assessment tasks:
Sources and implications.
ApPlied Measurement ~n Education, f(4).
Green, D. R., Fitzpatrick, A. R., Candell, G., & Miller, E.
(1992, April). Bias in performance assessment. Paper
presented at the annual meeting of the National Council
5
on Measurement in Education, Atlanta.
Langer, J. A. (1990). The process of understanding: Reading
for literary and informative purposes.
Research in ~he
Te~china of Enqlish, 24, 229-257.
Linn, R. L. & Harnisch, D. (1981).
Interactions between item
content and group membership in achievement test
items. Journal of Educational Measurement, 18, 109-118.
Madaus, G. F. (Ed.). (1983). The courts, validity, a~nd minimum
competency testinu.
Boston: Kluwer-Nijhoff.
Maryland State Department of Education. (1989).
writina test II: Technical report. Baltimore: Author.
Maryland State Department of Education. (1990). Maryland
writina test II: Technical report. Baltimore: Author.
Maryland State Department of Education. (1991). Maryland
writina test II: Technical report. Baltimore: Author.
Masters, G. N. (1982). A Rasch model for partial credit
scoring.
Psvchometrika, 47, 149-174.
Messick, S. (1989). Validity.
In R. L. Linn (Ed.), E d u c a tional measurement (3rd ed.). New York: American Council on Education/Macmillan.
Muraki, E. (1992). A generalized partial credit model:
Application of an EM algorithm. Applied Psvcholoaical
Measurement, 16, 159-176.
National Council of Teachers of Mathematics.
(1989).
Curriculum and evaluation standards for school mathematics.
Reston, VA: Author.
Yen, W. M. (1993). Scaling performance a s s e ~ m e n t s : Strategies
for managing local item dependence. Journal of Educational Measurement, 30, 187-213.
36
Table
1
N u m b e r s of Items P e r F o r m
and T h e i r M a x i m u m P o s s i b l e S c o r e s
R a n g e of
N u m b e r of I t e m s
Per Form
Low
Reading
Writing
Language Usage
Math Content
Math Process
26
2
8
37
12
A p p r o x i m a t e P e r c e n t of Items
with Each
M a x i m u m P o s s i b l e Scora.
1
High
33
2
8
52
29
37
2
3
20
60
61
61
75
33
33
20
I00
25
6
6
",
Table 2
Average
Percent of M a x i m u m Scores
Grade
3
5
8
CTBS/4 R e a d i n g
CTBS/4 R e a d i n g w/o Guess*
MSPAP Reading
.66
.58
.53
.63
.54
.56
.64
.55
.65
MSPAP Writing
MFTP Writing
1989
1990
1991
.38
.36
.56
CTBS/4 L a n g u a g e
CTBS/4 L a n g u a g e w / o Guess*
MSPAP Language
.69
.60
.40
CTBSf4 M a t h
CTBS/4 M a t h w / o Guess*
MSPAP M a t h C o n t e n t
MSPAP M a t h P r o c e s s
.68
.61
.41
.24
9
.84
.84
.79
.
.63
.53
.42
.62
.51
.55
.62
.54
.32
.29
.61
.53
.35
.28
Note. Based on CTBS/4 C o m p l e t e B a t t e r y A, M S P A P
average v a l u e s taken over forms, & MFTP
*Estimated p e r c e n t
of m a x i m u m if g u e s s i n g were removed.
8
Table
Range
3
of C o r r e l a t i o n s B e t w e e n Student S c o r e s
Produced by Two Sets of Raters
Range
Low
M a t h Content
M a t h Process
Reading
L a n g u a g e Usage
Writing
Note.
.97
.90
.87
.75
.63
High
.99
.95
.95
.87
.73
Range taken over grades
and forms.
39
Table
4
Coefficient Alpha Reliability Coefficients
for S t u d e n t C o n t e n t A r e a S c o r e s
Grade
3
5
8
CTBS/4 Reading Total
MSPAP Reading
•94
•93
.95
.91
.95
.91
MSPAPWriting
MFTP Writing
.64
.61
.62
9
.67
.75
.66
1989
1990
1991
CTBS/4 Language Total
MSPAP Language
.93
CTBS/4 M a t h T o t a l
MSPAP Math
•93
.93
• 89
,!
.93
.92
.87
.88
.94
.92
.95
.94
Note. B a s e d on C T B S / 4 C o m p l e t e B a t t e r y A, M S P A P
m e d i a n v a l u e s t a k e n o v e r forms, & M F T P v a l u e s .
40
Table 5
S t a n d a r d E r r o r s of M e a s u r e m e n t of S c a l e
A v e r a g e d O v e r Forms: G r a d e 8
Scale
Score
350
375
400
450
500
550
600
650
700
Reading
23
19
16
13
15
19
28
45
72
writing
Scores
Language
usage
Mathematics
56
41
30
19
19
19
26
50
53
47
28
17
13
16
28
57
33
30
32
52
41
Table
Typical
Standard
6
Errors
of S c h o o l M e a n s
Outcomes
Scale Scores
Low
High
Grade 3
Reading
Writing
Language
Mathematics
7
7
7
6
3
3
3
2
6
3
3
7
Grade 5
Reading
Writing
Language
Mathematics
7
7
6
6
3
3
3
2
7
4
3
5
Grade 8
Reading
Writing
Language
Mathematics
5
5
5
5
.,~
2
3
2
2
3
3
2
4
42
Table
Dependability
Indices
Grade
7
for S c h o o l Means:
5
N u m b e r of F o r m s
Administered
2
3
4
5
.94
.86
.88
.95
.95
.88
.90
.96
.95
.90
.91
.97
.95
.91
.91
.97
Reading
Literary Experience
R e a d i n g to be I n f o r m e d
R e a d i n g to P e r f o r m a T a s k
.93
.94
.53
.94
.94
.61
.94
.94
.66
.94
.94
.70
Writing
W r i t i n g to P e r s u a d e /
Personal Ideas
.84
.86
.87
.88
Language Usage
Language Usage
.75
.80
.83
.85
.94
.89
.77
.95
.92
.83
.96
.93
.86
.96
.94
.88
.87
.90
.85
.88
.89
.90
.92
.89
.91
.92
.92
.93
.91
.92
.93
.93,
.94
.92
.93
.94
.88
.96
.91
.97
.92
.97
.93
.97
Score
Scale
Scores
Reading
Writing
Language Usage
Mathematics
Outcome
Scores
Mathematics Content
Arithmetic Operations
Number Relationships
Geometry
Measurement with Estimation/
Verifications
Statistics
Probability
P a t t e r n s and R e l a t i o n s h i p s
Algebra
Mathematics Process
Communication
Connections
43
Table
Grade
5 Correlations
MSPAP
RD
MSPAP
Read
Write
Lang
Math
8
CTBS/4
Rating
WR
LU
MT
RT
LT
MT
RD
WR
MT
64
73
70
69
55
62
75
54
62
75
70
56
73
73
78
56
69
77
60
44
63
52
58
43
62
49
51
43
58
46
82
83
82
62
56
50
64
64
57
57
58
54
75
60
65
CTBSI4
Read
Lang
Math
Teachers
Read
Write
Math
Note.
Teacher
Decimal
points
have
been
omitted.
44
Tasks I and 2. Days I through 3
,q~,~rne: powafut Forces of Nature
Content Areas: Reading. Writine. and Laneua~e Usage
IntmductorT Activity
•A natural disaster is a terrible event caused by nature rather than by h.ma..~. A hurricane is one type of
natural disaster. Think of other natural disasters. With your partner, see how many natural disasters.you
can list....Now let's share some of the things on our fists....A tsunami is a very large ocean wave....Today
you are going to read an article that helps you understand why tidal waves, or tstmnmis, occur. Think for a
moment about what you know about tsunamis....make a list of words that reflect what you know....Let's
Sh~d"e.... •
Reading Material end Assessment Activities; Reading for Information
• Article entitled "Waves• by Herbert R. Zim. Copyright © 1967 by Herbert R. Zim.
• 8 assessment activities focused on the article.
Brid~n~ Activity
'q'hink about what you learned about tidal waves in the article Waves. • Tell your partner one thing you
leamed....Now think about what people might do and feel if they are told a tidal wave is coming. Tell your
partner some of your ideas. Now let's list some of the feelings....•
Reading Material and Aseessment Activities: Reading for Literary l~xperience and for lnfonnatioq
• Story entided "The Day of the Great Wave• by Marcella Fisher Anderson. Copyright © 1989 by
Highlights for Children, Inc., Columbus, OH.
- 8 assessment activities focused on "The Day of the Great Wave."
• 2 assessment activities focused on the article and the story.
]Reading Material anti Assessment Activities: Reading tO Perform a Task
• Article ~md diagram on artificial respiration excerpted from "New Essential First Aid • by A. Ward Gardner
and Peter J. Roylance, illustrated by Robert DemaresL Text copyright ©1967o 1968 by authors.
mustrations copyright© 1971 by Little, Brown and Company.
• 4 assessment activities focused on the article and diagram.
Introductory Activity
"Weather conditions sometimes cause events which are problems for people....pair with a partner and talk
about weather events that cause problems for you or someone you know about....This list we made has
examples of powerful forces of nature. Now you will have a chance to express your ideas in writing about
powerful forces of nature...."
Writing Assessment Prompt: Writing to Inform •
"During the spring, s~,mmer, and even the fall, powerful thunderstorms may occur. People must be
prepared ahead of time to know what to do. Your principal wants all students in your school to be prepared
for thunderstorms. Write an article for your school newspaper informing students about how to cope with
thundem'storms."
FIGURE 1. Illustration of typical MSPAP task structure usin~ a 1991 MSPAP task. Some responses to
the readine activities and the wfitin~ vromvt are scored for laneuaee u s a ~
i00
50
0
350
IBCa | e
score
Percentile as a Function of Scale Score:
Grade 5 Reading
Figure 2
i00
~
50
J
so:ale
I tlleo, r e , •
Percentile as a Function o f Scale Score:
Grade 5 Math
Figure 3
-
-
i00
.........
50"
t
380
scale s~1~,re
Percentile as a Function of Scale Score:
Grade 8 Writing
Figure
I00
Percent
70
50
(b)
/~
/
/..... ..-.
~//
of
.-
.
.ll .o~s
Maximum
40
//L:'
0
400
450
500
Scale
Figure
5.
Example
of the d e f i n i t i o n
550
600
Score
an
outcome
score.