Task-dependence of Features

Task-Independent Features for
Automated Essay Grading
Torsten Zesch, Michael Wojatzki
Language Technology Lab
University of Duisburg-Essen
Essay Writing
Widely used to test (high-level) language proficiency
 correct word usage
 adequate style
 structuring capabilities
…
Problems
- High costs of manual grading
- How to individualize?
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
2
Manual Grading
Task A
Essays
Graded
Essays
B+
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
3
Automated Grading – Training
Task A
Graded
Essays
Essays
B+
Grading
Model
Machine
Learning
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
4
Automated Grading – Application
Grading
Model
Task A
Essays
Graded
Essays
A-
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
5
Automated Grading – Application
Grading
Model
Task A
Essays
Graded
Essays
A-
Limitations of current approaches
• High costs of building training set (manual grading)
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
6
Automated Grading – Transfer
Grading
Model
Task A
Essays
Graded
Essays
A-
Task B
Essays
Graded
Essays
C
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
7
Automated Grading – Transfer
Grading
Model
Task A
Essays
Graded
Essays
A-
Limitations of current approaches
• Models are usually not transferable between tasks
Task B
Essays
Graded
Essays
C
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
8
Task-dependence of Features
Strongly task-dependent
 related to a certain task
 For example
 The essay contains the words ‘George Washington’
 The essay quotes specific passages from a source.
Weakly task-dependent
 general properties of good essays
 For example
 The essay contains connectives like ‘therefore’ or ‘accordingly’
 The essay is free of spelling errors
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
9
Task-dependence of Features
weakly task-dependent
strongly task-dependent
coherence
ngrams / topics
cohesion
length
errors
comparison to the training set
readability
similarity to a given source
specificity
formal referencing
style
syntactic variation
word/sentence length
 Assumption: Using only weakly task-dependent features improves model transfer
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
10
Feature Sets
weakly
dependent
weakly
dependent
+
strongly
dependent
= full feature set
= reduced feature set
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
11
Automated Grading – Transfer
Full
Model
Task A
Essays
Reduced
Model
Graded
Essays
A-
Task B
Essays
Graded
Essays
C
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
12
Evaluation
English
German
tasks
8
2
set size
~1800 essays
~200 essays
characteristic
4 opinion, 4 source-based
both source-based
participants
7th, 8th, 10th grade students
First year university students
Evaluation Metric: Quadratic Weighted Kappa
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
13
Task Types
Full
Model
Sourcebased
Task A
Essays
Reduced
Model
Graded
Essays
Opinion
A-
Sourcebased
Task B
Essays
Opinion
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
Graded
Essays
C
14
Process Overview
Full
Model
Sourcebased
Task A
Essays
Reduced
Model
Graded
Essays
Opinion
A-
Sourcebased
Task B
Essays
Opinion
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
Graded
Essays
C
15
Baseline Results – English
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
16
Baseline Results – English
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
17
Baseline Results – German
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
18
Task Types
Full
Model
Sourcebased
Task A
Essays
Reduced
Model
Graded
Essays
Opinion
A-
Sourcebased
Task B
Essays
Opinion
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
Graded
Essays
C
19
Transfer Loss – Full feature set
= small losses (> -0.3)
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
20
Transfer Loss – Full feature set
-0.46
-0.39
-0.53
-0.29
High losses, except in source/source case
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
21
Transfer Loss – Reduced feature set
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
22
Transfer Loss – Reduced feature set
-0.22
-0.46
-0.47
-0.23
Much smaller losses within task types
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
23
Transfer Loss – German Dataset
Same picture on German tasks
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
24
Conclusions & Future Work
Model transfer is difficult
 Loss is quite dramatic
Using only weakly task-dependent features improves transfer (but still
significant loss)
Future Work
 Explore faceted models
 Should transfer better than holistic grading
 Provides better feedback
Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting
25