Task-Independent Features for Automated Essay Grading Torsten Zesch, Michael Wojatzki Language Technology Lab University of Duisburg-Essen Essay Writing Widely used to test (high-level) language proficiency correct word usage adequate style structuring capabilities … Problems - High costs of manual grading - How to individualize? Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 2 Manual Grading Task A Essays Graded Essays B+ Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 3 Automated Grading – Training Task A Graded Essays Essays B+ Grading Model Machine Learning Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 4 Automated Grading – Application Grading Model Task A Essays Graded Essays A- Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 5 Automated Grading – Application Grading Model Task A Essays Graded Essays A- Limitations of current approaches • High costs of building training set (manual grading) Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 6 Automated Grading – Transfer Grading Model Task A Essays Graded Essays A- Task B Essays Graded Essays C Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 7 Automated Grading – Transfer Grading Model Task A Essays Graded Essays A- Limitations of current approaches • Models are usually not transferable between tasks Task B Essays Graded Essays C Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 8 Task-dependence of Features Strongly task-dependent related to a certain task For example The essay contains the words ‘George Washington’ The essay quotes specific passages from a source. Weakly task-dependent general properties of good essays For example The essay contains connectives like ‘therefore’ or ‘accordingly’ The essay is free of spelling errors Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 9 Task-dependence of Features weakly task-dependent strongly task-dependent coherence ngrams / topics cohesion length errors comparison to the training set readability similarity to a given source specificity formal referencing style syntactic variation word/sentence length Assumption: Using only weakly task-dependent features improves model transfer Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 10 Feature Sets weakly dependent weakly dependent + strongly dependent = full feature set = reduced feature set Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 11 Automated Grading – Transfer Full Model Task A Essays Reduced Model Graded Essays A- Task B Essays Graded Essays C Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 12 Evaluation English German tasks 8 2 set size ~1800 essays ~200 essays characteristic 4 opinion, 4 source-based both source-based participants 7th, 8th, 10th grade students First year university students Evaluation Metric: Quadratic Weighted Kappa Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 13 Task Types Full Model Sourcebased Task A Essays Reduced Model Graded Essays Opinion A- Sourcebased Task B Essays Opinion Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting Graded Essays C 14 Process Overview Full Model Sourcebased Task A Essays Reduced Model Graded Essays Opinion A- Sourcebased Task B Essays Opinion Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting Graded Essays C 15 Baseline Results – English Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 16 Baseline Results – English Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 17 Baseline Results – German Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 18 Task Types Full Model Sourcebased Task A Essays Reduced Model Graded Essays Opinion A- Sourcebased Task B Essays Opinion Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting Graded Essays C 19 Transfer Loss – Full feature set = small losses (> -0.3) Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 20 Transfer Loss – Full feature set -0.46 -0.39 -0.53 -0.29 High losses, except in source/source case Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 21 Transfer Loss – Reduced feature set Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 22 Transfer Loss – Reduced feature set -0.22 -0.46 -0.47 -0.23 Much smaller losses within task types Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 23 Transfer Loss – German Dataset Same picture on German tasks Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 24 Conclusions & Future Work Model transfer is difficult Loss is quite dramatic Using only weakly task-dependent features improves transfer (but still significant loss) Future Work Explore faceted models Should transfer better than holistic grading Provides better feedback Department of Computer Science and Applied Cognitive Science | Language Technology Lab | 1st INDUS Meeting 25
© Copyright 2026 Paperzz