PowerPoint-Präsentation

Evaluation and Control
of Rater Reliability:
Holistic vs. Analytic Scoring
EALTA, Athens
May 9-11, 2008
Claudia Harsch, IQB
Guido Martin, IEA DPC
Overview
1. Background
- Standards-based assessment in Germany
here: Writing in EFL
- Writing tasks and rating approach
2. Feasibility Studies
- Feasibility Study I, May 2007
trial scales and approach
- Feasibility Study II, June 2007
trial holistic vs. analytic approach
3. Pilot Study, July/August 2007
- Training
- Comparison FS II vs. Pilot Study Training
Overview
1. Background
- Standards-based assessment in Germany
here: Writing in EFL
- Writing tasks and rating approach
2. Feasibility Studies
- Feasibility Study I, May 2007
trial scales and approach
- Feasibility Study II, June 2007
trial holistic vs. analytic approach
3. Pilot Study, July/August 2007
- Training
- Comparison FS II vs. summer training
Background Assessing ES in Germany
 Evaluation of Educational Standards
for grades 9 and 10 by IQB Berlin
 In Foreign Languages, standards are linked to
the CEF, targeting
A2 for lower track of secondary school
B1 for middle track of secondary school
 Assessment of “4 skills”:
reading, listening, writing and speaking (under
development)
 Tasks based on CEF-levels A1 to C1;
uni-level approach
Sample task: Keeper, targeting B1
Assessment of Writing Tasks
Criteria of assessment, each defined by
descriptors based on CEF, Manual, Into Europe:





task fulfilment
organisation
grammar
vocabulary
overall impression
Rating approach
 A uni-level approach to grading the tasks in line with the
specific target level
 Performance to be graded on a below / pass / pass plus basis
 "Holistic approach": Ratings are the result of a weighted
assessment of several descriptors per criterion
Overview
1. Background
- Standards-based assessment in Germany
here: Writing in EFL
- Writing tasks and rating approach
2. Feasibility Studies
- Feasibility Study I, May 2007
trial scales and approach
- Feasibility Study II, June 2007
trial holistic vs. analytic approach
3. Pilot Study, July/August 2007
- Training
- Comparison FS II vs. summer training
Feasibility Study I May 2007
Aims
 Trial training / rating approach with student teachers
 Gain insight into scales and criteria
 Get feedback on accessibility of handbooks, benchmarks,
coding software
Procedure
 2 tasks: A2 “Lost dog” / B1 “Keeper for a day”
 6 raters: student teachers of English, proficient in writing English
 First training session (1day): introduction to CEF, scales and
tasks
 Practice 1: 30 scripts per task (over 1 week)
 Second training session (1day): evaluation & discussion of
practice results
 Practice 2: 28 scripts per task (over 1 week)
 Evaluation of results in terms of rating reliability
Feasibility Study I May 2007
Evaluation: Assessing Rater Reliability
 Index used: Percent Agreement with Mode
 Measures the percentage of agreement with the value most
often awarded on the level of individual ratings
 Can be aggregated on item (variable) and rater level
 Easily interpreted
 No assumptions about scale level
 No assumptions about value distributions
 No estimation errors
 Can be interpreted as a proxy for validity
Outcome Feasibility Study I, May 2007
Reliability per Item
ITEM
TaskFulfilment [Keeper]
Organisation [Keeper]
Grammar [Keeper]
Vocabulary [Keeper]
Overall [Keeper]
REL
0,759
0,852
0,846
0,870
0,858
TaskFulfilment [Lost dog]
Organisation [Lost dog]
Grammar [Lost dog]
Vocabulary [Lost dog]
Overall [Lost dog]
0,839
0,863
0,845
0,869
0,833
Outcome Feasibility Study I, May 2007
Reliability per Rater & Item
ITEM
Overall
[Keeper]
R01
R02
R03
R04
R05
R06
0,852
1,000
0,741 0,889
0,704 0,963
Overall
[Lost dog]
0,857
0,929
0,786 0,857
0,571 1,000
REL
Average
0,847
0,931
0,770 0,826
0,757 0,931
Outcome Feasibility Study I, May 2007
 Approach appears feasible
 Scales seem to be usable and applicable
 BUT: We do not know what raters do on the subcriterion-level
 Need to further explore behaviour at descriptor
level
=> Feasibility Study II
Overview
1. Background
- Standards-based assessment in Germany
here: Writing in EFL
- Writing tasks and rating approach
2. Feasibility Studies
- Feasibility Study I, May 2007
trial scales and approach
- Feasibility Study II, June 2007
trial holistic vs. analytic approach
3. Pilot Study, July/August 2007
- Training
- Comparison FS II vs. summer training
Feasibility Study II, June 2007
Comparison:
 Holistic scores for the five criteria (FS I)
 Scoring each descriptor on its own and in addition scoring the
criteria “holistically” (FS II)
Reasons behind:
 “below” – “pass” – “pass plus” in a uni-level approach targeting
a specific population:
tendency towards the “pass” value
 Similar outcomes can be achieved by purely random value
distributions at the descriptor level
 Data on scoring each descriptor show whether raters interpret
descriptors uniformly before using them to compile the
weighted overall criterion rating
 Reliable usage of descriptors is a precondition for valid ratings
on the criterion-level
Outcome Feasibility Study II, June 2007
CRITERIA
REL
TaskFulfilment [Keeper]
0,81
Organisation [Keeper]
0,83
Grammar [Keeper]
0,85
Vocabulary [Keeper]
0,84
Overall [Keeper]
0,87
Outcome Feasibility Study II, June 2007
Descriptors/Criterion Organisation
REL
Organisation_01 [Keeper for a day]
0,75
Organisation_02 [Keeper for a day]
0,56
Organisation_03 [Keeper for a day]
0,73
Organisation_04 [Keeper for a day]
0,82
Organisation_05 [Keeper for a day]
0,54
Organisation_06 [Keeper for a day]
0,83
Organisation_07 [Keeper for a day]
0,84
Organisation_08 [Keeper for a day]
0,66
Organisation_09 [Keeper for a day]
0,63
Organisation [Keeper for a day]
0,83
Outcome Feasibility Study II, June 2007
 Fairly high agreement on criterion-level ratings is
NOT the result of uniform interpretation of
descriptors …
 BUT rather results from cancellation of deviations
on the descriptor-level during the compilation of
the criterion ratings
 Rating holistic criteria by evaluation of several predefined descriptors can only be valid if descriptors
are understood uniformly by all raters
 Descriptors need to be revised
 Training and assessment of pilot study has to be
conducted on the descriptor level in order to be
able to control rating behavior
Overview
1. Background
- Standards-based assessment in Germany
here: Writing in EFL
- Writing tasks and rating approach
2. Feasibility Studies
- Feasibility Study I, May 2007
trial scales and approach
- Feasibility Study II, June 2007
trial holistic vs. analytic approach
3. Pilot Study, July/August 2007
- Training
- Comparison FS II vs. summer training
Background Pilot Study
 Sample Size: N = 2932
 Number of Items:
 Listening: 349
 Reading: 391
 Writing: 19 Tasks




n = 300 – 370 / item (M = 330)
All Länder
All school types
8th, 9th and 10th graders
Summer Training
 13 Raters, selected on the basis of English
language proficiency, study background and
DPC coding test
 Challenge of piloting tasks, rating approach and
scales simultaneously
 First one-week seminar:
- Introduction of CEF, scales and tasks
- Introduction of rating procedures
- Introduction of benchmarks
Summer Training
 6 one-day sessions:
- Weekly practice
- Discussion & Evaluation of practice results
- Introduction of further tasks / levels
- Revision of scale descriptors
 Five levels, 19 tasks:
Simultaneous introduction of several levels and tasks necessary
in order to control level and task interdependencies
 Three rounds of practice per task ideal:
1. Intro – practice
2. Feedback – practice
3. Feedback – practice
4. Evaluation of reliabilities
 …
Training Progress "Sports Accident", B1
Criterion/descriptors
Task Fulfilment
REL
Practice 4
REL
Practice 5
REL
Practice 6
TF 1 [Sports Accident]
0,65
0,76
0,88
TF 2 [Sports Accident]
0,66
0,77
0,79
TF 3 [Sports Accident]
0,87
0,85
0,92
TF 4 [Sports Accident]
0,80
0,72
0,77
TF 5 [Sports Accident]
0,70
0,78
0,83
TF gen [Sports Accident] 0,71
0,80
0,80
Training Progress "Sports Accident", B1
Criterion/descriptors
Organisation
REL
Practice 4
REL
Practice 5
REL
Practice 6
O 1 [Sports Accident]
0,73
O 2 [Sports Accident]
0,81
0,77
0,85
O 3 [Sports Accident]
0,72
0,71
0,80
O 4 [Sports Accident]
0,77
0,79
0,82
O 5 [Sports Accident]
0,96
0,76
0,81
O gen [Sports Accident] 0,71
Summer Training
 Second one-week seminar:
-
Feedback on last round of practice
Addition of benchmarks for borderline cases
Addition of detailed justifications for benchmarks
Finalisation of scale descriptors
Revision of rating handbooks
Comparison FS II - Training
FS II
PRACTICE 4
Criteria
REL
REL
[Keeper - TaskFulfilment]
0,81
0,71
[Keeper – Organization]
0,83
0,74
[Keeper - Grammar]
0,85
0,76
[Keeper - Vocabulary]
0,84
0,74
[Keeper - Overall]
0,87
0,77
Comparison FS II - Training
FS II
Practice 4
ITEM
REL
O_01 [Keeper]
0,75
O_02 [Keeper]
0,56
O_03 [Keeper]
ITEM
REL
O 1 [Keeper]
0,75
0,73
O 2 [Keeper]
0,73
O_04 [Keeper]
0,82
skipped
O_05 [Keeper]
0,54
skipped
O_06 [Keeper
0,83
O 3 [Keeper]
0,72
O_07 [Keeper]
0,84
O 4 [Keeper]
0,74
O_08 [Keeper]
0,66
O_09 [Keeper]
0,63
O 5 [Keeper]
0,95
O_gen [Keeper]
0,83
O gen [Keeper]
0,74
Conclusion
Training concept for the future
 Materials prepared – weekly seminars not
necessary
 Training and rating on descriptor level
 Multiple one-day sessions, one per week to
give time for practice
- Introduction
- Practice: 3 rounds per task ideal
- Feedback
Thank you for your
attention!
Claudia Harsch
Guido Martin
Phone
Telefax
E-mail
Website
Phone
E-mail
Website
+ 49 + (0)30 + 2093 - 5508
+ 49 + (0)30 + 2093 - 5336
[email protected]
www.IQB.hu-berlin.de
+ 49 + (0)40 + 48 500 612
[email protected]
www.iea-dpc.de
Mail Address
Mail Address
Humboldt-Universität zu Berlin
Unter den Linden 6
10099 Berlin
GERMANY
IEA DPC
Mexikoring 37
D-22297 Hamburg
GERMANY