Down with Sound The Story of Silent Speech

Down with Sound The Story of Silent Speech Professor Bruce Denby Université Pierre et Marie Curie Sigma Lab, ESPCI-­‐ParisTech Paris, France Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 1 Speech CommunicaHon •  Speech has always been the most natural and spontaneous modality for communicaHon between human beings (and machines…?) Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 2 But speech has certain problems •  Noise sensiHvity –  ASR performance degrades very rapidly in the presence of background noise •  Airborn nature of speech signal –  Interference with other communicaHons –  Security problems •  Inaccessibility for certain populaHons –  Laryngectomy –  Paralysis, pulmonary insufficiency, etc. •  Language dependence… Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 3 The Silent Speech Interface (SSI) Idea •  Use non-­‐acousHc sensors to augment or completely replace the acousHc speech signal Nonacoustic
ASR
Non-acoustic sensors
Optional
automatic
translation
Speech
synthesis
Speech
Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 4 SSI ApplicaHon Areas •  In medicine –  Give back original voice to voice-­‐handicapped persons: laryngectomy or other pathologies •  In telecommunicaHons –  Telephone securely without disturbing others –  Silent Man-­‐Machine Interface (data entry, etc.) –  Communicate verbally even in very noisy environments –  AutomaHc translaHon Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 5 SSI – “a hot new area” Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 6 Variety of Approaches Ultrasound + OpHcal EMA Low Freq. Ultrasound In Air EMG NAM VibraHon Sensors Radar BCI EEG BCI corHcal Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 7 Ultrasound data acquisiHon •  a) Classic cart-­‐based mulH-­‐probe machine •  b) Siemens Acuson P10 ultra-­‐miniaturised •  c) Terason soluHon – support PC/Windows •  Price: 20 000 € (small, 30 Hz) to 200 000 € (standard, high speed and/or 3D) b)
a)
c)
Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 8 SIGMA Lab SSI: Ouisper Overview Lightweight
helmet
Video
camera
(no glottal activity)
Ultrasound
probe
Smartphone
or tablet
“Visual speech”
recognition engine
Text-To-Speech
system trained
on user’s
original voice
Digital or
audible speech
Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 9 Terason
T3000
miniature
ultrasound
machine
Helmet with ultrasound
probe, camera,
microphone
USB sound
card
Acquisition PC for ultrasound,
video, and sound streams;
Ultraspeech application
developed in thesis of Th.
Hueber
Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 10 Example: laryngectomy voice recovery with phrasebook (French) Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 11 Example of captured data (English) Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 12 Visual Feature ExtracHon: selected method for Ouisper SSI ROI
Tongue
Resize
64 x 64
30
DCT
coeffs
30
d/dt
30
d2/dt2
ROI
Lips
Resize
64 x 64
30
DCT
coeffs
30
d/dt
30
d2/dt2
180 element joint visual feature
vector per frame (60 Hz)
Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 13 ConHnuous Non-­‐AcousHc Speech RecogniHon: Overall Decoding Strategy P(O|W)
Gaussian phone
likelihood models
Observations
O
Estimated
sequence:
P(W)
HMM word
likelihoods
Language
Model
Viterbi algorithm
Wˆ = argmax P (O |W ) P (W )
W ∈dictionary
Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 14 RecogniHon Results for English •  DARPA TIMIT corpus used for training triphone HMMs •  3000 sentences, 2 months of recording •  Accuracy = (N – D – S – I ) / N Presented at Interspeech, Sept. 2011, Florence, Italy
Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 15 Real Time RecogniHon with Julius • 
• 
• 
• 
• 
HMMs implemented with HTK toolkit This is slow Julius HMM code is available Developed in Japan Uses 2-­‐path search –  IniHal search with reduced number of Gaussians in GMM –  Full search on reduced set of paths •  Allows faster than real Hme recogniHon even on visual speech Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 16 Synthesis Step: Chosen Method for Ouisper SSI •  Train Text-­‐To-­‐Speech (TTS) on speaker’s voice •  Use OpenMary open source TTS tools •  Record 1500 triphone-­‐rich sentences in sound cabin (microphone only) •  TTS resides on a server accessible via Internet •  Laryngectomy case: –  Ideally pre-­‐op voice if good quality recordings exist –  If not, we have found a voice imitator is a good soluHon Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 17 Demo at ISSP, June 2011, Montréal Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 18 Move to French Language •  Individually thermoformed plasHc helmets •  French is hard! –  Corpora, LM, tools less available –  Liaisons, variants, nasality, etc. –  New synthesis module! •  Recording 3 speakers one of whom is post laryngectomy Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 19 RecogniHon Results for French Triphone enriched Polyvar training set and Polyvar Language Model (train set excluded)
Presented at I2MTC, Minneapolis, USA, May 2013
Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 20 Real Time Tests in French (text output only, no synthesis) Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 21 Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 22 Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 23 Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 24 Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 25 Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 26 Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 27 ANNOUNCEMENT Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 28 Silent Speech Challenge •  English mulHmodal silent speech data are available online for others to try!! •  Includes raw single speaker ultrasound and lip images plus text of sentences pronounced (no sound: silent speech!) •  Hundreds of Gbytes of data! •  Includes TIMIT training/development corpus plus WSJ0 test corpus (100 sentences) and language model. •  Benchmark: 93% phone recogniHon, 84% word recogniHon on WSJ0 set •  Available by anonymous np at np.espci.fr (login anonymous, password your email address) •  Go to pub, sigma, then TIMIT_Training or WSJ05K_Test, then to Ultraspeech Capture Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 29 More Recently •  Specially manufactured miniature US probe Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 30 PerspecHve Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 31 SPINOFF Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 32 i-­‐Treasures Project •  FP7 IP project •  12 Partners –  CERTH, UOM, AUTH (Greece) –  UPMC, ARMINES/ENSMP, LPP/CNRS (France) –  UCL (UK) –  CNR (Italy) –  UMONS, Acapela (Belgium) –  Turkish Telecom –  University of Maryland (observer) •  Propose novel mulHsensory technologies to safeguard and transmit ICH (Intangible Cultural Heritage) Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 33 i-­‐Treasures •  Four use cases in different ICH domains: –  Rare tradiHonal songs –  Rare dance interacHons –  TradiHonal cransmanship –  Contemporary music composiHon •  Capture and analyze Intangible Cultural Heritage: –  Facial expression analysis –  Body and gesture recogniHon –  Vocal tract modeling –  Sound processing –  Electroencephalography Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 34 Rare tradiHonal song use case •  Capture and analyze vocal tract of expert singers and use to create an independent ‘avatar’ of the singer •  Student displays expert avatar and tries to match his own to it in real Hme, in order to learn the technique •  Targeted rare singing techniques: –  Corsican cantu in paghjella –  Sardinian canto a tenore –  ByzanHne singing –  Human beat box Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 35 Possible Strategy Model
Inputs
Vocal
tract
model
Compare
output vocal
tract to
ultrasound
image
Map ultrasound
image to model
inputs
Machine learning determines parameters of
mappings from US -> inputs and
from vocal tract model -> US
Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 36 First recording of singing data Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 37 The End Thank you! Workshop on Speech ProducHon in AutomaHc Speech RecogniHon August 30, 2013 Lyon, France 38