Speech and Language Technologies for Disaster Management and

Speech and Language Technologies for Disaster Management and Emergency Response
Ian Lane Research Assistant Professor Carnegie Mellon University (CMU-­‐SV, LTI) Talk Overview •  Speech-­‐based Human Machine Interac5on •  Speech and Language Technologies for Disaster Management and Emergency Response –  Applica5ons & Challenges •  Research Focus: –  Intelligent Crowd Sourcing –  Field Maintainable Speech Transla5on Systems Speech as a CommunicaIon Medium §  Speech is the most natural and powerful form of communica5on between humans §  Natural: No addi5onal training required §  Flexible: Adapt to dialogue partners, environment §  Efficient: Communicate large amounts of informa5on §  Informa5on about speaker, their cogni5ve-­‐state, environment § 
§  ParIcularly powerful for: §  Coopera5ve problem solving §  Par5cipants involved in other ac5vity Speech for Human Computer InteracIon §  Usability: Novice users can complete complex tasks with liMle addi5onal training, same interface can be used be expert users to quickly complete task §  Ubiquity: Only require cellular phone to access informa5on §  Suitable for hands / eyes busy environments §  Informa5on retrieval & device opera5on (In-­‐Car) §  Voice-­‐based manual reference during maintenance (NASA) §  Speech-­‐transla5on for medical, military tasks (Checkpoints) §  Can effec5vely combine with other modali5es of interac5on Common Speech ApplicaIons State-­‐of-­‐the-­‐art Commercial ApplicaIons – 
– 
– 
– 
– 
Personal dicta5on (Dragon–Naturally Speaking) Interac5ve Voice Response (Spoken Dialog Systems) Speech input on mobile devices (Search, transcrip5on) Mul5modal Human-­‐Computer Interfaces Limited domain speech-­‐to-­‐speech transla5on Future ApplicaIons – 
– 
– 
– 
– 
Intui5ve interfaces for mobile devices Pervasive suppor5ve agents (CHIL, Lifelogger) Real-­‐5me sub5tles for broadcast media, teleconferencing, video Conversa5onal humanoid robots Large-­‐scale mul5lingual datamining ApplicaIons for Disaster Management •  AnalyIcs –  Text (TwiMer, SMS), Voice (Message, Amateur Radio) •  Human-­‐Computer InteracIon –  Command and Control –  Informa5on Access •  Human-­‐Human InteracIon –  Suppor5ve Agents –  Speech Transla5on Challenges in Disaster Scenarios •  Hands and eyes busy environments –  Responders in the field •  Lack of Adaptability and Fieldability –  Each event unique –  How can we adapt system while event unfolds? •  Rapid system development –  New Domains –  New Languages State-­‐of-­‐the-­‐art in Speech RecogniIon –  Machine-­‐Directed Speech •  Small vocabulary (command and control) > 95% •  Large vocabulary: 90% –  Spontaneous speech: •  Two party human conversa5ons: 70-­‐80% •  Mul5-­‐party mee5ngs: 60-­‐70% –  Far-­‐field speech recogni5on: 30% –  Radio transmissions? Related Research •  Intelligent Crowd Sourcing –  Building analy5c systems on the fly (DMI) •  Speech TranslaIon Systems – Field Maintainable Speech Transla3on Systems •  Human-­‐Computer InteracIon –  Command and control for telerobo5cs (Honda) –  Mul5-­‐modal interfaces for mobile devices Intelligent Crowdsourcing •  Building analy5c systems on the fly –  Each event unique, languages, key words –  Human experts à provide examples, correct output –  Machine learning à scale to large data –  Real-­‐Time Analysis of TwiZer •  Iden5fy cri5cal messages and salient informa5on within –  Rapid CollaIon of SituaIonal Knowledge •  Collate situa5onal knowledge from ci5zen reporters Intelligent Crowdsourcing (Unstructured data à SituaIonal Knowledge) AutomaIc Analysis (Machine Learning) Gather InformaIon (TwiMer, SMS, External Apps, Monitoring Sta5ons) Define analysis required Online learning from labeled examples Verifica5on and correc5on of automa5c output Human Analysis (Crowd-­‐Sourcing) (Group of experts) SituaIonal Knowledge (Common Opera5ng Pictures) Intelligent Crowdsourcing (ApplicaIon Example: Parsing TwiZer Feeds) •  TwiMer (Japan Earthquake) •  Iden5fy cri5cal and ac5onable messages •  Iden5fy informa5on (number of people, loca5on) hZp://ec2-­‐50-­‐17-­‐200-­‐127.compute-­‐1.amazonaws.com/ Intelligent Crowdsourcing (ApplicaIon Example: Crowd-­‐Sourced SituaIonal Awareness) •  Rapid Colla5on of Situa5onal Knowledge –  Loca5on and Structural Integrity of Buildings –  Geo-­‐located Data: Text, Images and Speech Intelligent Crowdsourcing (ApplicaIon Example: ExtracIng SituaIon Knowledge) •  Rapid Mul5modal Informa5on –  text, images and speech –  loca5on and status of buildings Field Maintainable Speech
Translation Systems
Speech as a communicaIon medium •  Speech most prevalent communica5on medium –  Natural: Requires no special training –  Flexible: Communica5on style adap5ve to situa5on –  Efficient: High informa5on throughput •  Spoken discourse par5cularly powerful –  For collabora5ve problem solving tasks –  When par5cipant is involved in another ac5vity •  Speech most prevalent informa5on medium –  Amount of speech transmiMed or broadcast (radio, television, telephony) significantly larger than all other mediums Speech as a communicaIon barrier •  Majority of worlds popula5on only fluent on 1 or 2 languages –  Lack of fluency in major word language significant social barrier –  Even when able to understand wriMen language have difficulty with speech Automated TranslaIon of Spoken Language •  Why? –  Enable informa5on access and knowledge sharing across languages –  Enable direct communica5on between par5cipants •  For many language-­‐pairs in the EU no interpreters exist •  Interpreters not suitable or available for all situa5ons •  How? –  Combine ASR and MT technologies to translate speech from one language into text (or speech) in another Speech input L1 AutomaIc Speech RecogniIon ( ASR: L1 ) Machine TranslaIon (MT: L1 à L2 ) TranslaIo
n output L2 Commercial Speech TranslaIon Systems •  iOS, Android, Windows PC –  Server-­‐based and offline versions •  Can run bi-­‐lingual transla5on on a single mobile phone –  Eight language-­‐pairs currently supported •  Spanish, Japanese, Chinese, German, French … –  Deployed afer earthquake and tsunami in Japan Field Maintainable Speech-­‐to-­‐Speech TranslaIon Systems •  Vocabulary encountered in the field is dependent on: –  Environment: loca5on where device is being used –  User: user and users social group –  Task: task that the device is currently being used for •  Impossible to provide coverage of all vocabulary (esp. NEs) à Pre-­‐selec5ng or weigh5ng named en55es while system is in field significantly improves task success –  User selec5on, addi5on of new NEs –  Automa5cally select / weight likelihoods, based on loca5on, task etc.. Adding Words to S2S TranslaIon •  To add a new word to system all components (10 models) must be updated •  Exper5se is required to retrain the system, this can not take place in the field → User with liZle experIse and limited knowledge of the target language should be able to add words to the system in the field Class-­‐based S2S TranslaIon Framework [Lane08] •  Use class-­‐based language models for ASR and class-­‐based SMT •  Seman5c classes consistent across both languages for all components •  Component models updated using entries in system-­‐dic5onary Manual AddiIon of New Words English word entered by user Automa5cally generated pronuncia5on of input word Automa5cally generated transla5on of input word Automa5cally generated pronuncia5on of target word User selects seman5c class Field Maintainable S2S TranslaIon for Low Resource Languages •  Expensive to port to new language-­‐pairs –  Manually tagging of corpora –  Decomposi5on tools required for morphologically complex lang. –  Op5mality not ensured
à  Automa5cally bootstrap using bilingual corpora and exis5ng tools in a resource-­‐rich language •  Inves5gate two approaches: –  Projec5on of named-­‐en5ty tags using mul5-­‐feature costs –  Data-­‐driven morphological decomposi5on Cross-­‐lingual NE-­‐Tag ProjecIon •  Consider mul5ple feature costs during projec5on –  Annota5on Cost: Ctag(E) •  The confidence of an annota5on in the source language –  Transla5on Cost: Ctrans(E,F) •  The transla5on likelihood of aligned named-­‐en5ty pairs derived from word co-­‐occurrences over the training corpus –  Translitera5on Cost: Ctranslit(E,F) •  Translitera5on equivalence of aligned named-­‐en55es within a sentence-­‐pair •  Search target annota5on which minimizes the weighted cost of above three cost func5ons •  Component weights op5mized on development set Feature Costs for NE-­‐Tag ProjecIon AnnotaIon Cost: Ctag(E) E: Tomorrow your men will patrol in @CITY{Nahir Jasim} I: gdAF jmAEtk rH yqwmwA bEml dwryp fy nhr jAsm Feature Costs for NE-­‐Tag ProjecIon E: Tomorrow your men will patrol in @CITY{Nahir Jasim} TranslaIon Cost: Ctrans(E,F) (from word alignments) I: gdAF jmAEtk rH yqwmwA bEml dwryp fy nhr jAsm Feature Costs for NE-­‐Tag ProjecIon E: Tomorrow your men will patrol in @CITY{Nahir Jasim} TransliteraIon Cost Ctranslit(E,F) I: gdAF jmAEtk rH yqwmwA bEml dwryp fy nhr jAsm Morphological DecomposiIon •  In morphologically complex languages decomposi5on must be applied before (or during) named-­‐en5ty tagging E: … and in @CITY{Al-­‐Hilla} I: … wbAlhlp … and in @CITY{Al-­‐Hilla} … w_ b_ @CITY{Alhlp} •  Decomposi5on also improves word-­‐alignment for MT •  How do we perform morphological decomposi5on for new languages or dialects? à Generate decomposiIon rules to maximize consistency across bilingual corpora Data-­‐Driven Morphological DecomposiIon 1. Project NE-­‐tags across bilingual corpus [described earlier] 2. For each named-­‐en5ty in the source language (ENE) •  Es5mate the target stem by selec5ng character-­‐sequence with minimal translitera5on cost over all target-­‐phrases 3. Collect counts of affixes over each NE class 4. Generate decomposi5on rules for unseen words Data-­‐Driven Morphological DecomposiIon 1. Project NE-­‐tags across bilingual corpus [described earlier] Words aligned to “ource Al-­‐
Count (E ) 2. For each named-­‐en5ty i
n t
he s
l
anguage NE
Hilla” •  Es5mate the target stem by selec5ng character-­‐sequence with Alhlp 155 minimal translitera5on cost over all target-­‐phrases 44 bAlhlp wAlhlp 32 wbAlhlp 14 3. Collect counts of affixes oàver e ach N=E Aclhlp
lass STEM 4. Generate decomposi5on rules for unseen words Data-­‐Driven Morphological DecomposiIon 1. Project NE-­‐tags across bilingual corpus [described earlier] Words aligned to “ource Al-­‐
Count (E ) 2. For each named-­‐en5ty i
n t
he s
l
anguage NE
Hilla” •  Es5mate the target stem by selec5ng character-­‐sequence with Alhlp 155 minimal translitera5on cost over all target-­‐phrases 44 bAlhlp wAlhlp 32 wbAlhlp 14 3. Collect counts of affixes oàver e ach N=E Aclhlp
lass STEM Class-­‐specific affix Count @CITY y 870 wb @CITY 250 4. Generate decomposi5on rules for unseen words b @CITY 500 … Data-­‐Driven Morphological DecomposiIon 1. Project NE-­‐tags across bilingual corpus [described earlier] 2. For each named-­‐en5ty in the source language (ENE) •  Es5mate the target stem by selec5ng character-­‐sequence with minimal translitera5on cost over all target-­‐phrases 3. Collect counts of affixes over each NE class 4. Generate decomposi5on rules for unseen words wbAlhlpy à
wb_ @CITY{Alhlp} _ y Experimental EvaluaIon •  Evaluate effec5veness of proposed approaches for IraqiàEnglish speech transla5on –  Iraqi has dialect-­‐specific morphology; differs from MSA –  Low-­‐resource language; data exists, tools do not –  Name-­‐en5ty rich interview task from TransTAC 2008 evalua5ons (~90% of Iraqi uMerances contain a NE) •  Evalua5on criteria –  Tagging accuracy –  Speech recogni5on performance –  Machine transla5on quality Training Corpora and Models •  Automa5c Speech Recogni5on (Iraqi) [Roger Hsiao] – 
– 
– 
– 
Training Corpora: 350 hours of audio data AcousIc Model: 6000 codebooks (max. 64 Gaussian mixtures) Semi-­‐5ed covariance; boosted MMI discrimina5ve training Language Model: trigram;3 million words; Kneser-­‐Ney smoothing •  Machine Transla5on (Iraqi à English, English à Iraqi) –  Training Corpora: 650K sentence-­‐pairs –  TDs: Trained using the Moses toolkit; LMs: trigram ~3 million words •  English NE Tagger –  Training Corpora: 15k sentences (12 NE classes) –  NE Tagger: Condi5onal Random Field-­‐based Tagger •  Dev and Eval Sets –  Development:
–  Evalua5on: TransTAC-­‐June 2008 (Names task)
TransTAC-­‐Nov 2008 (Names task)
530 uMerances 650 uMerances NE-­‐Tagger Accuracy 90% Precision Recall F-­‐score 80% 70% 60% Word Alignment Mul5-­‐feature cost + DDMD •  Significant improvement by incorpora5ng proposed approaches •  Iraqi tagger obtained 85% compared to 90% for English (F-­‐score) Speech RecogniIon Performance •  Iraqi NE-­‐tagger applied to bilingual and monolingual corpora •  Class-­‐based LM trained on resul5ng corpora Word Error Rate No Name-­‐En5ty Classes 34.5% Mul5-­‐Feature Cost + DDMD 32.4% •  WER reduced by 2.1% absolute –  Improved recogni5on of names-­‐en55es –  WER s5ll rela5vely high due to spontaneous nature of evalua5on set TranslaIon Accuracy (IàE) 0.5500 No NE classes Class-­‐based Model 0.5000 0.4500 0.4000 Manual Transcripts Speech Input •  Significant improvement in transla5on quality by incorpora5ng classes •  3.5 BLEU (Manual Transcrip5ons); 1.1 BLEU (Speech Input) Conclusions and Future Works •  Inves5gated two approaches for bootstrapping linguis5c tools for new languages –  Projec5on of named-­‐en5ty tags using mul5-­‐feature costs –  Data-­‐driven morphological decomposi5on •  NE projec5on effec5ve when mul5ple feature costs are used –  F-­‐score increased 75% à 85% (by incorpora5ng translitera5on / tagging scores) •  Class-­‐models trained using resul5ng tagger improves both ASR and MT performance –  WER reduced: 34.5% à 32.4% –  MT quality improved: 0.5130à 0.5480 (BLEU) •  Future work: Extend DDMD approach to perform generalized decomposi5on (POS tags / word units in English) RoboIc Control •  Augment teleopera5on of robots with voice commands –  “go to the kitchen” –  “go to the front door” –  “go to the end of the corridor” RoboIc Control