Slide - Rochester CS

Feature Selection for Automated Speech
Scoring
Anastassia Loukina, Klaus Zechner, Lei Chen,
Michael Heilman*
Educational Testing Service
*CIVIS Analytics
Copyright © 2015 by Educational Testing Service.
1
Overview
•
•
•
•
•
Motivation
Data
Scoring models
Results
Conclusion
Copyright © 2015 by Educational Testing Service.
Context and motivation
• Scoring of constructed responses -- speech
• Computation of features using NLP + speech
technology, using speech recognition and signal
processing outputs
• Predict scores using supervised machine learning
• Educational measurement: managing trade-off:
- Maximize empirical performance
- Maximize model interpretability
Copyright © 2015 by Educational Testing Service.
Ideal Properties of Scoring Models
• High empirical performance
• Contains features that evaluate all relevant
aspects of the test construct
• Relative Contribution by each feature should be
obvious
• Inter-correlations between features not too high
• Polarity of feature weights correspond to their
meaning
• Smaller and simpler is better (interpretability)
Copyright © 2015 by Educational Testing Service.
Linear Regression Scoring Models Built by
Human Experts
• Straightforward and well-known in all
disciplines
• Allow to address most requirements of ideal
scoring models
• Disadvantage: cumbersome development due
to manual selection of features and checking
for all constraints
Copyright © 2015 by Educational Testing Service.
Proposed Model
• Explore alternative regression models, e.g.,
shrinkage methods
• Can do feature selection automatically while
still addressing ideal model constraints
Copyright © 2015 by Educational Testing Service.
6
Data
• Spoken English proficiency test
• Spontaneous speech, ~1 minute per response
• Score scale: 1 – 4
Data Set
Speakers
Responses
H-H Correlation
Train
9,312
9,956
0.63
Eval
8,101
47,642
0.62
Copyright © 2015 by Educational Testing Service.
Features
• 75 features extracted for each response via SpeechRater
• Construct dimensions:
– fluency
– pronunciation accuracy
– prosody
– grammar
– vocabulary
• Dimensions not covered: content, discourse
Copyright © 2015 by Educational Testing Service.
Scoring Models
1.
2.
3.
4.
5.
Baseline: human expert (12 features)
All features using OLS regression
Hybrid stepwise regression
Non-negative least-square regression
Non-negative LASSO regression (LASSO*; lambda
optimized to obtain a feature set size of about 25)
Copyright © 2015 by Educational Testing Service.
LASSO
•
•
•
•
Shrinkage model – dimensionality reduction
Penalty for larger coefficients
Sets subset of coefficients to zero
Lambda-parameter: if zero: yields OLS model; if
infinity: yields model with no features
• Determined optimal lambda empirically (Target
number of features where performance flattens
out)
Copyright © 2015 by Educational Testing Service.
Crossvalidation Results
Model
Features
Negative Coeffs
Correlation
Expert baseline
12
No
0.606
All OLS
75
Yes
0.667
Hybrid stepwise
~40
Yes
0.667
Non-neg Ls
~35
No
0.655
LASSO*
~25
No
0.649
Copyright © 2015 by Educational Testing Service.
Results on Evaluation Set
Model
Features
Item Corr
Speaker Corr
Expert baseline
12
0.61
0.78
All OLS
75
0.67
0.86
LASSO*
25
0.65
0.84
Copyright © 2015 by Educational Testing Service.
Construct Coverage Comparison
• Adding relative standardized beta-weights
Construct
Expert
Lasso*
Fluency
0.580
0.527
Pronunciation accuracy
0.098
0.151
Prosody
0.080
0.035
Total for Delivery
0.759
0.712
Grammar
0.155
0.103
Vocabulary
0.086
0.183
Total for Language Use
0.241
0.286
Copyright © 2015 by Educational Testing Service.
Summary
• Building scoring models for constructed responses in
line with best practices in educational measurement is
a complex task of constraint satisfaction
• Therefore, this Task has been typically performed by
human experts
• Our study demonstrates the viability of using
automated methods of feature selection that can
satisfy multiple requirements of ideal scoring models
• LASSO* model is more accurate, has very similar
construct coverage compared to expert baseline and is
highly interpretable
Copyright © 2015 by Educational Testing Service.