Speech recognition For graduate and post-graduate students Home page: (Note code change: S89.5150 => ELEC-E5510) https://mycourses.aalto.fi/course/view.php?id=5180 Registration at Oodi Mikko Kurimo and Peter Smit + project work support from Aalto University's speech recognition research group Goals Become familiar with current speech recognition methods and applications Learn the structure of a typical speech recognition system Learn to construct one in practice Discussion: What is your goal – why are you here? 2015 Mikko Kurimo Speech recognition course 2 / 65 Content today 1.General organization of the course 2.What is automatic speech recognition? 3.Speech as an acoustic signal 4.Home exercise 1: − Build a system to classify speech features into phonemes 2015 Mikko Kurimo Speech recognition course 3 / 65 Content Week 1 Features Week 2 Phonemes Week 5 Speech recognition Week 3 Words Week 3 Language applications Week 4 ASR system, decoder 2015 Mikko Kurimo Speech recognition course 4 / 65 Meetings 14 meetings during 7 weeks in Nov - Dec: − Wed 10.15-12 at I346 − Fri 14.15-16 at Maari-C (Until Nov 21) − Fri 14.15-16 at I346 (After Nov 22) Contents: − 7 general lectures (main ideas, no details) − 4 meetings for practical work with computers − 3 seminar meetings for talks on special topics 2015 Mikko Kurimo Speech recognition course 5 / 65 Course Format "Hour pie": 130h meetings exercises project "Grade pie": 1..5 meetings exercises project Meetings on Project work + every active Wed and Fri (in groups) participation + group grade + studying the + personal grade course material 2015 Mikko Kurimo + for every (normalized 4 Home correct answer peer review) exercises + sum is graded Speech recognition course 6 / 65 Timeline in the course Meetings Wednesdays Home exercises Project work status Fridays Week1 Speech signal Classification Feature classifier Selection of topic Week2 Phoneme modeling Recognition Word recognizer Literature study Week3 Lexicon and language Language model Text predictor Analysis Week4 Continuous speech LVCSR Experimentation Week5 Adaptation Applications Preparing reports Week6 Projects1 Projects2 Presentations Week7 Projects3 Conclusion Presentations 2015 Mikko Kurimo Speech recognizer Speech recognition course 7 / 65 Timeline in the course Meetings Wednesdays Home exercises Project work status Fridays Week1 Speech signal Classification Feature classifier Selection of topic Week2 Phoneme modeling Recognition Word recognizer Literature study Week3 Lexicon and language Language model Text predictor Analysis Week4 Continuous speech LVCSR Experimentation Week5 Adaptation Applications Preparing reports Week6 Projects1 Projects2 Presentations Week7 Projects3 Conclusion Presentations 2015 Mikko Kurimo Speech recognizer Speech recognition course 8 / 65 Timeline in the course Meetings Wednesdays Home exercises Project work status Fridays Week1 Speech signal Classification Feature classifier Selection of topic Week2 Phoneme modeling Recognition Word recognizer Literature study Week3 Lexicon and language Language model Text predictor Analysis Week4 Continuous speech LVCSR Experimentation Week5 Adaptation Applications Preparing reports Week6 Projects1 Projects2 Presentations Week7 Projects3 Conclusion Presentations 2015 Mikko Kurimo Speech recognizer Speech recognition course 9 / 65 Useful skills linear algebra (basic matrix operations) probability and statistics signal processing and natural language processing programming − perl/python for text processing − shell scripts, C/C++ for running/modifying programs − matlab familiarity with Linux 2015 Mikko Kurimo Speech recognition course 10 / 65 Test of your skill level You will now see examples when certain skills are required Raise your hand, if you are familiar with these 2015 Mikko Kurimo Speech recognition course 11 / 65 Useful skills - 1 linear algebra (basic matrix operations) − multiplication, determinant, transpose, inverse 2015 Mikko Kurimo Speech recognition course 12 / 65 Useful skills - 2 probability and statistics 2015 Mikko Kurimo Speech recognition course 13 / 65 Useful skills – 3 signal processing and natural language processing Examples: − Spectrum and spectrogram of a signal − count the frequency of all word pairs in text 2015 Mikko Kurimo Speech recognition course 14 / 65 Useful skills - 4 programming − Matlab − Linux − shell scripts, C/C++, perl/python example tasks: − Use Matlab toolboxes to compute a spectrum − Run programs in Linux and store their output in a file − Make a script to run commands many times in loop using increasing parameter values − Make a simple program to compute the error rate between the speech recognition result (a string) and the reference text 2015 Mikko Kurimo Speech recognition course 15 / 65 Useful skills linear algebra (basic matrix operations) probability and statistics signal processing and natural language processing programming − perl/python for text processing − shell scripts, C/C++ for running/modifying programs − matlab familiarity with Linux 2015 Mikko Kurimo Speech recognition course 16 / 65 Text book 2015 Mikko Kurimo You may survive without one, but this is recommended Huang, Acero: Spoken Language Processing Prentice Hall, 2001 ISBN: 0-13022616-5 Speech recognition course 17 / 65 Other useful text books 2015 Mikko Kurimo Speech recognition course 18 / 65 Some useful online resources Gales, Young: HMMs applications in ASR (book): http://dx.doi.org/10.1561/2000000004 Cambridge: HTK Book (detailed manual): http://htk.eng.cam.ac.uk/docs/docs.shtml Matlab primer: https://www.mathworks.com/help/pdf_doc/matlab/getstart.pdf Slides from MIT open course: 6.345 ASR (2003) http://ocw.mit.edu/courses/electrical-engineering-and-computerscience/6-345-automatic-speech-recognition-spring-2003/ 2015 Mikko Kurimo Speech recognition course 19 / 65 Useful software Software used in this course: − Matlab, GMMBayes toolbox − Cambridge HMM toolkit (HTK) − SRI language modeling toolkit (SRILM) Other useful software for ASR: − Kaldi, Aachen, KTH Snack, OGI speech, Nagoya's Julius − University of Colorado SONIC ASR, CMU Sphinx-II ASR − AaltoASR tools, Aalto Morfessor tools − NIST ASR scoring utilities − CMU / Cambridge language model toolkit 2015 Mikko Kurimo Speech recognition course 20 / 65 Project work receipt 1.Form a group (3 persons) Done already? 2.Get a topic 3.Get reading material from mycourses or your group tutor 4.Schedule a meeting with your tutor (before next Wed) 5.1st meeting: Specify the topic and write a work plan (DL 2nd Wed) 6.Make a brief literature study (DL 3rd Wed) 7.Perform analysis, experiments, and write a report 8.Book your presentation time in the Week 6 or 7 (DL 5th Wed) 9.Prepare and keep your 30 min presentation 10.Return the report (DL 7th Fri) 2015 Mikko Kurimo Speech recognition course This wee 21 / 65 Project topics & tutors • • • • • 4.Language model adaptation (Andre) 8.Confidence measure for ASR (Seppo) 9.ASR for dialect speech (Peter) 13.Speech recognition in noise (Sami) 14.Voice activity detection (Seppo) 2015 Mikko Kurimo Speech recognition course • • • • • 16.HMMs for speech synthesis (Reima) 19.Comparing sub-word language models (Matti) 21.Animation of HMMs (Mikko) 22.DNNs for acoustic modeling (Peter) 23.Topic modeling (Akmal) 22 / 65 Content today 1.General organization of the course 2.What is automatic speech recognition? 3.Speech as an acoustic signal 4.Home exercise 1: − Build a system to classify speech features into phonemes 2015 Mikko Kurimo Speech recognition course 23 / 65 All these are needed to build ASR systems! Prob. Comp. Data 2015 Mikko Kurimo Speech recognition course 24 / 65 Milestones for ASR systems 1952 Bell Labs Digit Recognizer 1976 CMU Harpy 1000-word connected recognizer with constrained grammar 1980 TKK: 1000-word LSM recognizer (separate words w/o grammar) 1988 TKK: phonetic typewriter 1993 Read texts (WSJ news) 1998 Broadcast news, telephone conversations 1998 Speech retrieval from broadcast news 2015 Mikko Kurimo Speech recognition course 25 / 65 Milestones for ASR systems (2) 2002 Rich transcription of meetings 2004 TKK: Finnish online dictation, almost unlimited vocabulary based on morphemes 2006 Machine translation of broadcast speech 2006 Voice interface in Windows Vista video 2008 Google voice search 2009 Aalto: Speaker adaptation for a synthesized voice by speech recognition in another language video 2011 Siri voice assistant 2015 Mikko Kurimo Speech recognition course 26 / 65 ASR tasks and solutions Speaking environment and microphone − − − Office, headset or close-talking Telephone speech, mobile Noise, outside, microphone far away Style of speaking Speaker modeling 2015 Mikko Kurimo Speech recognition course 27 / 65 ASR tasks and solutions Speaking environment and microphone Style of speaking − − − − − Isolated words Connected words, small vocabulary Word spotting in fluent speech Continuous speech, large vocabulary Spontaneous speech, ungrammatical Speaker modeling 2015 Mikko Kurimo Speech recognition course 28 / 65 ASR tasks and solutions Speaking environment and microphone Style of speaking Speaker modeling − − − Speaker-dependent models Speaker-independent, average speaker models Speaker adaptation 2015 Mikko Kurimo Speech recognition course 29 / 65 Content today 1.General organization of the course 2.What is automatic speech recognition? 3.Speech as an acoustic signal 4.Home exercise 1: − Build a system to classify speech features into phonemes 2015 Mikko Kurimo Speech recognition course 30 / 65 What is speech recognition? Find the most likely word or word sequence given the acoustic signal and our models! Language model defines words and how likely they occur together Lexicon defines how words are formed from sound units Acoustic model defines the sound units independent of speaker and recording conditions 2015 Mikko Kurimo Speech recognition course 31 / 65 What is speech recognition? Find the most likely word or word sequence given the acoustic signal and our models! Language model defines words and how likely they occur together Lexicon defines how words are formed from sound units Acoustic model defines the sound units independent of speaker and recording conditions 2015 Mikko Kurimo Speech recognition course 32 / 65 Content today 1.General organization of the course 2.What is automatic speech recognition? 3.Speech as an acoustic signal 4.Home exercise 1: − Build a system to classify speech features into phonemes 2015 Mikko Kurimo Speech recognition course 33 / 65 Speech recording 2015 Mikko Kurimo Speech recognition course Picture by B.Pellom 34 / 65 A sample of speech waveform 2015 Mikko Kurimo Speech recognition course 35 / 65 Other words, other speakers Very difficult to recognize speech from the picture... 2015 Mikko Kurimo Speech recognition course 36 / 65 Modelling the speech signal Discussion: What separates speech from all the other sounds that the microphone has recorded? − computer noise, car noise, human movements, other sounds from the mouth, ... − so, what is special in speech and common in all speaking situations 2015 Mikko Kurimo Speech recognition course 37 / 65 How to recognize speech? A simple procedure: − Measure some characteristic features of the signal and train statistical models for them 2015 Mikko Kurimo Speech recognition course 38 / 65 How to recognize speech? A simple procedure: − Measure some characteristic features of the signal and train statistical models for them Good features should be: 1.Compact 2.Discriminative for speech sounds 3.Fast to compute 4.Robust for noise 2015 Mikko Kurimo Speech recognition course 39 / 65 Frequency analysis Calculate the spectrum in short time intervals 2015 Mikko Kurimo Speech recognition course 40 / 65 Frequency analysis Calculate the spectrum in short time intervals 2015 Mikko Kurimo Speech recognition course 41 / 65 Frequency analysis Calculate the spectrum in short time intervals 2015 Mikko Kurimo Speech recognition course 42 / 65 Computation of MFCC 2015 Mikko Kurimo Speech recognition course 43 / 65 Picture by B.Pellom Mel scale Approximation of human perception of speech “Divide the frequency scale into perceptually equal intervals”: Linear below 1 kHz, logarithmic above 1 kHz 2015 Mikko Kurimo Speech recognition course 44 / 65 Cepstrum Cepstrum is essentially “a spectrum of a spectrum”: − Analysis in frequency scale (vertical direction) MFCC = Mel-Frequency Cepstral Coefficients ... 2015 Mikko Kurimo Speech recognition course 45 / 65 Speech sample 1.1.Frames: Frames: short short10ms 10ms windows windows 2.2.FFT: FFT: power powerspectrum spectrum spectrogram spectrogram 3.3.Filtering: Filtering: mel melfilter filter motivated motivatedby by human humanear ear “essential “essentialdata” data 4.4.Features: Features: DCT DCTtransform transform mel cepstrum mel cepstrum MFCC MFCC -less -lessfeatures features -less correlation -less correlatio 2015 Mikko Kurimo Speech recognition course 46 / 65 The same 5 samples again Very difficult to recognize speech from this picture... 2015 Mikko Kurimo Speech recognition course 47 / 65 Power spectrogram 2015 Mikko Kurimo Speech recognition course Speech recognition possible Lot of data Lot of redundancy Lot of noise 48 / 65 Mel spectrogram 2015 Mikko Kurimo Speech recognition course Speech recognition maybe easier? 10 x less data Less redundancy Less noise 49 / 65 Mel-frequency cepstral coefficients (MFCC) Even more compact Less correlation Less noise? 2015 Mikko Kurimo Speech recognition course 50 / 65 Background noise? 2015 Mikko Kurimo Speech recognition course 51 / 65 To classify sounds by features? Training 1. Extract MFCC from samples of each sound (e.g. phoneme) 2. Train a statistical model (mean and variance) Testing 1. Record new samples and extract MFCC 2. Choose the best-matching model to be the class 2015 Mikko Kurimo Speech recognition course 52 / 65 Classification by features Gaussian mixture model GMM train a statistical model (mean and variance) for samples of each sound source choose the best-matching statistical model to be the class of an unknown sample 2015 Mikko Kurimo Speech recognition course 53 / 65 Normal (Gaussian) distribution 2015 Mikko Kurimo Speech recognition course Picture by B.Pellom 54 / 65 1dim. Gaussian distribution 2015 Mikko Kurimo Speech recognition course Picture by B.Pellom 55 / 65 Multidim. Gaussian distribution 2015 Mikko Kurimo Speech recognition course Picture by B.Pellom 56 / 65 Diagonal Gaussian 2015 Mikko Kurimo Speech recognition course Picture by B.Pellom 57 / 65 Inverse of the covariance matrix 2015 Mikko Kurimo Speech recognition course Picture by B.Pellom 58 / 65 Multidim. diag. Gaussian 2015 Mikko Kurimo Speech recognition course Picture by B.Pellom 59 / 65 1dim. Gaussian mixture model 2015 Mikko Kurimo Speech recognition course Picture by B.Pellom 60 / 65 Gaussian mixture model GMM 2015 Mikko Kurimo Speech recognition course Picture by B.Pellom 61 / 65 Home exercise 1 Build a GMM classifier to classify speech features into phonemes ! Details, instructions and help given in the Friday meeting this week To be returned before the Friday meeting next week 2015 Mikko Kurimo Speech recognition course 62 / 65 Feedback Write down questions from today's lecture that troubled your mind Comments and suggestions are welcome, too. What was missing today, and what too much? Idea's in last year's feedback: Tutors to join the first project meetings Explain Cepstrum and MFCCs Show videos on ASR applications Publish slides before the meetings 2015 Mikko Kurimo Speech recognition course 63 / 65 Next meeting Fri 14.15 – 16 at Maari-C (Maarintalo building) Computers, speech data and support provided for practical experiments Organized by Peter Smit Get your AALTO account ready! Matlab is used, introductions exist at (www) Support for Home exercise 1 provided (!) All meetings required to pass the course − missed meeting compensated by an extra exercise 2015 Mikko Kurimo Speech recognition course 64 / 65 Summary of today Course organization, project works History What is ASR? Speech signal Acoustic features Gaussian mixture model on Friday: Classification Next week: Phonemes and hidden Markov models 2015 Mikko Kurimo Speech recognition course 65 / 65
© Copyright 2025 Paperzz