The Automatic Speech recognition In Reverberant Environments Challenge Mary Harper Incisive Analysis Office IARPA December 15, 2015 ASpIRE to Incentivize Important Speech Research • Goal: Focus signal processing and machine learning experts on methods to increase channel robustness of automatic speech recognition • How:  Stimulate breakthroughs through prize competition to spur innovation and solve tough problems  Use realistic data as a resource (for experimentation and evaluation)  Provide open evaluation with fair scoring  Assess the capability of current technology  Provide forum for presenting results INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 2 Important Challenges for Automatic Speech Recognition (ASR) • There is a need for advances to provide robust ASR in any language and in any recording environment. – Effective migration to new languages (Babel) • Limited training resources • Effective regardless of the language – Effective with any kind of recording device in any room or environment (ASpIRE) • Far-field microphone speech (reverberation) • Ability to adapt to new conditions without labeled data for each and every environment INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 3 ASpIRE Conditions • The ASpIRE challenge asked participants to develop innovative ASR solutions that • • • work in a variety of acoustic environments and recording scenarios without having access to matched training and development data Two evaluation conditions: 1. 2. The Single Microphone Condition tested accuracy of speech recognition on sessions recorded in several different rooms on a single distant microphone selected randomly from a set of microphones placed differently in each room. The Multiple Microphone Condition tested the accuracy of speech recognition on the same sessions as the single microphone condition recorded with a set of 6 single distant microphones (including the single microphone condition’s microphone). INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 4 ASpIRE Data Setup • • Training Data: Fisher corpus of English telephone conversations (2000 hours); only these data and/or algorithmic transformations of these data were allowed for use in training systems. Development Data: are recordings drawn from Mixer 6 conversational English speech recorded with 12 different microphones placed comparably in 2 small rooms : 1. Tuning Data (Dev-tune): A 5 hour development-tuning set (audio with transcription) to be used for optimization, training-selection, and unsupervised adaptation purposes 2. Open-Book Test Data (Dev-test): A 10 hour development-test set (audio only) to be used only for checking progress using the leaderboard • Closed-Book Test Data (Evaluation): 10 hours of Mixer 8 conversational English speech recorded with 8 different microphones placed differently in 7 various sized rooms with a variety of speakers. INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 5 The Microphone Data • Mixer 6 Development Set: Speech recorded for speaker ID research. In Mixer 6, speech was recorded with 14 different microphones placed in the same configuration in 2 small rooms at LDC (i.e., same distance, mounting, and orientation in both rooms) with a variety of subjects and with microphone levels checked and calibrated. • Mixer 8 Pilot Evaluation Set: IARPA worked with LDC to design a different and harder evaluation set. In Mixer 8, speech was simultaneously recorded over eight microphone channels with one additional channel captured via a telephone collection platform. – 7 different rooms (of different sizes and shapes) at the University of Pennsylvania; – 8 different microphones positioned differently in 7 rooms with talkers positioned in 2-3 locations (3 for larger rooms) – microphone height, orientation, and distance between microphones and the talker vary and were setup to be challenging. INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 6 Rooms and Microphones in Mixer 8 Room 117 477 481 126 478 460 470 Description Volume (ft^3) Recording Room 1,013 Small Office 1,278 Conference Room 1,759 Recording Room 1,776 Conference Room 3,496 Seminar Room 3,547 Conference Room 13,205 # Pos. 2 2 2 3 3 3 3 Microphone Model 1 Earthworks M23 2 DPA 4090 3 Samson SAC02 4 RODE NT6 5 Shure MX185 6 Sony ECMAW3 7 Canon WM-V1 Audio Technica AT8035 8 Features Flat frequency response, omnidirectional, measurement applications High sensitivity, flat frequency response, omnidirectional, condenser, high quality studio applications High sensitivity, directional, pencil microphone, low-cost home studio applications Directional, miniature microphone, various applications Diaphragm condenser microphone, directional, used as a lavalier Blue tooth microphone, omnidirectional, miniature electret condenser microphone element, home video applications Blue tooth microphone, omnidirectional, prosumer camcorder/outdoor applications Shotgun microphone, directional, high off-axis rejection, outdoor recording applications INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 7 Rooms with Mic Placement and Position INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) ASpIRE in Context • Mixer 8 evaluation data set is challenging and realistic, especially when there is no matched training data. • Baseline system showed the challenge of this data. • New methods were available that held promise but remained to be tested on ASpIRE type data. INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 9 New Aspects of This Challenge • ASpIRE challenge addressed far field microphone recordings and introduced the following conditions: 1. Conversational speech was used rather than read speech, increasing difficulty. 2. The vocabulary of the data sets was not controlled or limited; hence, the vocabulary was large and the development and evaluation data contained words that were not seen in training. 3. Evaluation data were explicitly designed to differ substantially from training data, as well as from the development data, to measure system robustness. 4. No information was provided for the audio files that might enable systems to make use of microphone type, room configuration, speaker position, or speaker identity. INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 10 Timeline 1/2013-4/2014: Data Collection and Analysis • Develop protocol for data collection • Collect evaluation data • Document difficulty 11/17/14-2/11/2015: System Preparation • • • • Development data release Release of scoring tools Evaluation server available Participants prepare systems/algorithms 4/15/2014-11/16/2014: Setup • Data annotation • Evaluation preparation • Challenge infrastructure setup 2/19/2015 - 2/26/2015: Multiple-Microphone Evaluation Period 2/11/2015-2/18/2015: Single-Microphone Evaluation Period 12/15/2015: Special Session at ASRU 3/1/2015-6/15/2015: Winning Solution Validation and Prize Award INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 11 ASpIRE Site INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 12 Some ASpIRE Statistics • 167 participants signed challenge agreements from 32 countries around the world: • Australia, Austria, Belarus, Brazil, Bulgaria, Canada, China, Colombia, Czech Republic, Egypt, Finland, France, Germany, Greece, Hungary, India, Indonesia, Israel, Italy, Japan, Madagascar, Mexico, Pakistan, Russia, Singapore, Slovenia, Spain, Sweden, Turkey, United Kingdom, United States, and Yemen • Data were downloaded by 48 sites from countries around the world including: • Australia, Belarus, Brazil, Bulgaria, Canada, China, Egypt, Finland, France, Germany, Greece, Indonesia, Israel, Madagascar, Mexico, Pakistan, Russia, Singapore, Slovenia, Spain, Turkey, United Kingdom, and United States INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 13 Accuracy Development-Test Scores Over Time INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 14 Results Single Microphone System ID 13 14 4 11 15 8 9 6 16 7 12 17 Team A B C D E A A D E D D D System ID 18 Team C Primary? Dev-Test WER Evaluation WER Primary 27.1 44.3 Primary 27.5 44.3 Primary 29.9 44.8 Primary 39.8 52.7 Primary 27.6 53.4 Contrast 29.0 43.9 Contrast 27.4 44.0 Contrast 40.0 52.8 Contrast 27.9 54.1 Contrast 40.0 54.4 Contrast 39.9 54.7 Contrast 39.4 50.7 Multiple Microphone Primary? Dev-Test WER Evaluation WER Primary 28.2 38.5 INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 15 Winning Teams • Single Microphone: A. Raytheon BBN Technologies (Jeff Ma, Roger Hsiao, William Hartmann, Rich Schwartz, Stavros Tsakalidis), Brno University of Technology (Martin Karafiat, Lukas Burget, Igor Szoke, Frantisek Grezl), and Johns Hopkins University (Sri Harish Mallidi, Hynek Hermansky); B. Center for Language and Speech Processing, Johns Hopkins University (Vijayaditya Peddinti, Guoguo Chen, Daniel Povey, Sanjeev Khudanpur); C. The Institute for Infocomm Research, A*STAR, Singapore (Jonathan William Dennis and Tran Huy Dat). • Multiple Microphone: C. The Institute for Infocomm Research, A*STAR, Singapore (Jonathan William Dennis and Tran Huy Dat). INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 16 Winning Solution Attributes • Top performing single microphone systems used: multi-condition training: augmented the clean training data with additional degraded data (noise and room impulses), performed some form of enhancement and adaptation, used speech activity detection appropriate for the task (see MIT Lincoln Lab poster). INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 17 Posters  BBN, JHU, and Brno University of Technology, “Robust Speech Recognition in Unknown Reverberant and Noisy Conditions”: combines multi-condition training, auto-encoder audio enhancement, and DNN adaptation using linear least squares method  Institute for Infocomm Research, “Single and Multi-channel Approaches for Distant Speech Recognition under Noisy Reverberant Conditions”: combines robust front-end processing and speech enhancement methods, multi-condition training, and semisupervised DNN model adaptation  Johns Hopkins University (JHU), CLSP, “JHU ASpIRE System: Robust LVCSR with TDNNS, I-Vector Adaptation and RNN-LMS”: combines TDNN modeling with multi-condition training, i-vector adaptation, silence modeling, and RNN LMs INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 18 Posters  MIT Lincoln Laboratory, “Analysis of ASpIRE Systems”: examines what factors affect WER across systems and what differs among systems  SRI, “Improving Robustness against Reverberation for Automatic Speech Recognition”: examines the use of GMMs and various types of DNNs, robust features, multi-condition training, and system combination INTELLIGENCE ADVANCED RESEARCH PROJECTS ACTIVITY (IARPA) 19
© Copyright 2025 Paperzz