PPTX file

The Automatic Speech recognition In Reverberant
Environments Challenge
Mary Harper
Incisive Analysis Office
IARPA
December 15, 2015
ASpIRE to Incentivize
Important Speech Research
• Goal: Focus signal processing and machine learning
experts on methods to increase channel robustness
of automatic speech recognition
• How:
 Stimulate breakthroughs through prize competition to spur
innovation and solve tough problems
 Use realistic data as a resource (for experimentation and
evaluation)
 Provide open evaluation with fair scoring
 Assess the capability of current technology
 Provide forum for presenting results
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
2
Important Challenges for
Automatic Speech Recognition (ASR)
• There is a need for advances to provide robust ASR in any
language and in any recording environment.
– Effective migration to new languages (Babel)
• Limited training resources
• Effective regardless of the language
– Effective with any kind of recording device in any room or
environment (ASpIRE)
• Far-field microphone speech (reverberation)
• Ability to adapt to new conditions without labeled data for each
and every environment
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
3
ASpIRE Conditions
•
The ASpIRE challenge asked participants to develop
innovative ASR solutions that
•
•
•
work in a variety of acoustic environments and recording scenarios
without having access to matched training and development data
Two evaluation conditions:
1.
2.
The Single Microphone Condition tested accuracy of speech
recognition on sessions recorded in several different rooms on a single
distant microphone selected randomly from a set of microphones
placed differently in each room.
The Multiple Microphone Condition tested the accuracy of speech
recognition on the same sessions as the single microphone condition
recorded with a set of 6 single distant microphones (including the
single microphone condition’s microphone).
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
4
ASpIRE Data Setup
•
•
Training Data: Fisher corpus of English telephone conversations
(2000 hours); only these data and/or algorithmic transformations of
these data were allowed for use in training systems.
Development Data: are recordings drawn from Mixer 6
conversational English speech recorded with 12 different
microphones placed comparably in 2 small rooms :
1. Tuning Data (Dev-tune): A 5 hour development-tuning set (audio with
transcription) to be used for optimization, training-selection, and unsupervised
adaptation purposes
2. Open-Book Test Data (Dev-test): A 10 hour development-test set (audio only)
to be used only for checking progress using the leaderboard
• Closed-Book Test Data (Evaluation): 10 hours of Mixer 8
conversational English speech recorded with 8 different
microphones placed differently in 7 various sized rooms with a
variety of speakers.
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
5
The Microphone Data
• Mixer 6 Development Set: Speech recorded for speaker ID research. In
Mixer 6, speech was recorded with 14 different microphones placed in
the same configuration in 2 small rooms at LDC (i.e., same distance,
mounting, and orientation in both rooms) with a variety of subjects and
with microphone levels checked and calibrated.
• Mixer 8 Pilot Evaluation Set: IARPA worked with LDC to design a
different and harder evaluation set. In Mixer 8, speech was
simultaneously recorded over eight microphone channels with one
additional channel captured via a telephone collection platform.
– 7 different rooms (of different sizes and shapes) at the University of
Pennsylvania;
– 8 different microphones positioned differently in 7 rooms with talkers
positioned in 2-3 locations (3 for larger rooms)
– microphone height, orientation, and distance between microphones
and the talker vary and were setup to be challenging.
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
6
Rooms and Microphones in Mixer 8
Room
117
477
481
126
478
460
470
Description
Volume
(ft^3)
Recording Room
1,013
Small Office
1,278
Conference Room 1,759
Recording Room
1,776
Conference Room 3,496
Seminar Room
3,547
Conference Room 13,205
# Pos.
2
2
2
3
3
3
3
Microphone
Model
1
Earthworks M23
2
DPA 4090
3
Samson SAC02
4
RODE NT6
5
Shure MX185
6
Sony ECMAW3
7
Canon WM-V1
Audio Technica
AT8035
8
Features
Flat frequency response, omnidirectional,
measurement applications
High sensitivity, flat frequency response,
omnidirectional, condenser, high quality studio
applications
High sensitivity, directional, pencil microphone,
low-cost home studio applications
Directional, miniature microphone, various
applications
Diaphragm condenser microphone, directional,
used as a lavalier
Blue tooth microphone, omnidirectional, miniature
electret condenser microphone element, home
video applications
Blue tooth microphone, omnidirectional, prosumer
camcorder/outdoor applications
Shotgun microphone, directional, high off-axis
rejection, outdoor recording applications
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
7
Rooms with Mic Placement and Position
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
ASpIRE in Context
• Mixer 8 evaluation data set is challenging and
realistic, especially when there is no matched
training data.
• Baseline system showed the challenge of this data.
• New methods were available that held promise but
remained to be tested on ASpIRE type data.
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
9
New Aspects of This Challenge
• ASpIRE challenge addressed far field microphone recordings
and introduced the following conditions:
1. Conversational speech was used rather than read speech,
increasing difficulty.
2. The vocabulary of the data sets was not controlled or limited; hence,
the vocabulary was large and the development and evaluation data
contained words that were not seen in training.
3. Evaluation data were explicitly designed to differ substantially from
training data, as well as from the development data, to measure
system robustness.
4. No information was provided for the audio files that might enable
systems to make use of microphone type, room configuration,
speaker position, or speaker identity.
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
10
Timeline
1/2013-4/2014: Data
Collection and
Analysis
• Develop protocol for data
collection
• Collect evaluation data
• Document difficulty
11/17/14-2/11/2015:
System Preparation
•
•
•
•
Development data release
Release of scoring tools
Evaluation server available
Participants prepare
systems/algorithms
4/15/2014-11/16/2014:
Setup
• Data annotation
• Evaluation preparation
• Challenge infrastructure
setup
2/19/2015 - 2/26/2015:
Multiple-Microphone
Evaluation Period
2/11/2015-2/18/2015:
Single-Microphone
Evaluation Period
12/15/2015:
Special
Session at
ASRU
3/1/2015-6/15/2015:
Winning Solution
Validation and Prize
Award
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
11
ASpIRE Site
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
12
Some ASpIRE Statistics
• 167 participants signed challenge
agreements from 32 countries around
the world:
• Australia, Austria, Belarus, Brazil, Bulgaria,
Canada, China, Colombia, Czech Republic,
Egypt, Finland, France, Germany, Greece,
Hungary, India, Indonesia, Israel, Italy,
Japan, Madagascar, Mexico, Pakistan,
Russia, Singapore, Slovenia, Spain,
Sweden, Turkey, United Kingdom, United
States, and Yemen
• Data were downloaded by 48 sites
from countries around the world
including:
• Australia, Belarus, Brazil, Bulgaria, Canada, China,
Egypt, Finland, France, Germany, Greece,
Indonesia, Israel, Madagascar, Mexico, Pakistan,
Russia, Singapore, Slovenia, Spain, Turkey, United
Kingdom, and United States
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
13
Accuracy
Development-Test Scores Over Time
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
14
Results
Single Microphone
System ID
13
14
4
11
15
8
9
6
16
7
12
17
Team
A
B
C
D
E
A
A
D
E
D
D
D
System ID
18
Team
C
Primary?
Dev-Test WER Evaluation WER
Primary
27.1
44.3
Primary
27.5
44.3
Primary
29.9
44.8
Primary
39.8
52.7
Primary
27.6
53.4
Contrast
29.0
43.9
Contrast
27.4
44.0
Contrast
40.0
52.8
Contrast
27.9
54.1
Contrast
40.0
54.4
Contrast
39.9
54.7
Contrast
39.4
50.7
Multiple Microphone
Primary?
Dev-Test WER Evaluation WER
Primary
28.2
38.5
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
15
Winning Teams
• Single Microphone:
A. Raytheon BBN Technologies (Jeff Ma, Roger Hsiao, William
Hartmann, Rich Schwartz, Stavros Tsakalidis), Brno University of
Technology (Martin Karafiat, Lukas Burget, Igor Szoke, Frantisek
Grezl), and Johns Hopkins University (Sri Harish Mallidi, Hynek
Hermansky);
B. Center for Language and Speech Processing, Johns Hopkins
University (Vijayaditya Peddinti, Guoguo Chen, Daniel Povey,
Sanjeev Khudanpur);
C. The Institute for Infocomm Research, A*STAR, Singapore (Jonathan
William Dennis and Tran Huy Dat).
• Multiple Microphone:
C.
The Institute for Infocomm Research, A*STAR, Singapore
(Jonathan William Dennis and Tran Huy Dat).
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
16
Winning Solution Attributes
• Top performing single microphone systems
used:
multi-condition training: augmented the clean
training data with additional degraded data (noise
and room impulses),
performed some form of enhancement and
adaptation,
used speech activity detection appropriate for the
task (see MIT Lincoln Lab poster).
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
17
Posters
 BBN, JHU, and Brno University of Technology, “Robust Speech
Recognition in Unknown Reverberant and Noisy Conditions”:
combines multi-condition training, auto-encoder audio enhancement,
and DNN adaptation using linear least squares method
 Institute for Infocomm Research, “Single and Multi-channel
Approaches for Distant Speech Recognition under Noisy
Reverberant Conditions”: combines robust front-end processing and
speech enhancement methods, multi-condition training, and semisupervised DNN model adaptation
 Johns Hopkins University (JHU), CLSP, “JHU ASpIRE System:
Robust LVCSR with TDNNS, I-Vector Adaptation and RNN-LMS”:
combines TDNN modeling with multi-condition training, i-vector
adaptation, silence modeling, and RNN LMs
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
18
Posters
 MIT Lincoln Laboratory, “Analysis of ASpIRE Systems”:
examines what factors affect WER across systems and what
differs among systems
 SRI, “Improving Robustness against Reverberation for
Automatic Speech Recognition”: examines the use of GMMs
and various types of DNNs, robust features, multi-condition
training, and system combination
INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA)
19