Speech recognition

Speech recognition

For graduate and post-graduate students

Home page: (Note code change: S89.5150 => ELEC-E5510)
https://mycourses.aalto.fi/course/view.php?id=5180

Registration at Oodi

Mikko Kurimo and Peter Smit

+ project work support from Aalto University's speech
recognition research group
Goals

Become familiar with current speech recognition methods and
applications

Learn the structure of a typical speech recognition system

Learn to construct one in practice
Discussion: What is your goal – why are you here?
2015 Mikko Kurimo
Speech recognition course
2 / 65
Content today
1.General organization of the course
2.What is automatic speech recognition?
3.Speech as an acoustic signal
4.Home exercise 1:
−
Build a system to classify speech features into phonemes
2015 Mikko Kurimo
Speech recognition course
3 / 65
Content
Week 1
Features
Week 2
Phonemes
Week 5
Speech
recognition
Week 3
Words
Week 3
Language
applications
Week 4
ASR system, decoder
2015 Mikko Kurimo
Speech recognition course
4 / 65
Meetings


14 meetings during 7 weeks in Nov - Dec:
−
Wed 10.15-12 at I346
−
Fri 14.15-16 at Maari-C (Until Nov 21)
−
Fri 14.15-16 at I346 (After Nov 22)
Contents:
−
7 general lectures (main ideas, no details)
−
4 meetings for practical work with computers
−
3 seminar meetings for talks on special topics
2015 Mikko Kurimo
Speech recognition course
5 / 65
Course Format
"Hour pie": 130h
meetings
exercises
project
"Grade pie": 1..5
meetings
exercises
project
Meetings on
Project work
+ every active
Wed and Fri
(in groups)
participation + group grade
+ studying the
+ personal grade
course material
2015 Mikko Kurimo
+ for every
(normalized
4 Home
correct answer peer review)
exercises
+ sum is graded
Speech recognition course
6 / 65
Timeline in the course
Meetings
Wednesdays
Home exercises
Project work
status
Fridays
Week1 Speech signal
Classification
Feature classifier
Selection of topic
Week2 Phoneme modeling
Recognition
Word recognizer
Literature study
Week3 Lexicon and language
Language model Text predictor
Analysis
Week4 Continuous speech
LVCSR
Experimentation
Week5 Adaptation
Applications
Preparing reports
Week6 Projects1
Projects2
Presentations
Week7 Projects3
Conclusion
Presentations
2015 Mikko Kurimo
Speech recognizer
Speech recognition course
7 / 65
Timeline in the course
Meetings
Wednesdays
Home exercises
Project work
status
Fridays
Week1 Speech signal
Classification
Feature classifier
Selection of topic
Week2 Phoneme modeling
Recognition
Word recognizer
Literature study
Week3 Lexicon and language
Language model Text predictor
Analysis
Week4 Continuous speech
LVCSR
Experimentation
Week5 Adaptation
Applications
Preparing reports
Week6 Projects1
Projects2
Presentations
Week7 Projects3
Conclusion
Presentations
2015 Mikko Kurimo
Speech recognizer
Speech recognition course
8 / 65
Timeline in the course
Meetings
Wednesdays
Home exercises
Project work
status
Fridays
Week1 Speech signal
Classification
Feature classifier
Selection of topic
Week2 Phoneme modeling
Recognition
Word recognizer
Literature study
Week3 Lexicon and language
Language model Text predictor
Analysis
Week4 Continuous speech
LVCSR
Experimentation
Week5 Adaptation
Applications
Preparing reports
Week6 Projects1
Projects2
Presentations
Week7 Projects3
Conclusion
Presentations
2015 Mikko Kurimo
Speech recognizer
Speech recognition course
9 / 65
Useful skills

linear algebra (basic matrix operations)

probability and statistics

signal processing and natural language processing


programming
− perl/python for text processing
− shell scripts, C/C++ for running/modifying programs
− matlab
familiarity with Linux
2015 Mikko Kurimo
Speech recognition course
10 / 65
Test of your skill level

You will now see examples when certain skills are required

Raise your hand, if you are familiar with these
2015 Mikko Kurimo
Speech recognition course
11 / 65
Useful skills - 1

linear algebra (basic matrix operations)
−
multiplication, determinant, transpose, inverse
2015 Mikko Kurimo
Speech recognition course
12 / 65
Useful skills - 2

probability and statistics
2015 Mikko Kurimo
Speech recognition course
13 / 65
Useful skills – 3

signal processing and natural language processing

Examples:
−
Spectrum and spectrogram of a signal
−
count the frequency of all word pairs in text
2015 Mikko Kurimo
Speech recognition course
14 / 65
Useful skills - 4


programming
−
Matlab
−
Linux
−
shell scripts, C/C++, perl/python
example tasks:
−
Use Matlab toolboxes to compute a spectrum
−
Run programs in Linux and store their output in a file
−
Make a script to run commands many times in loop using
increasing parameter values
−
Make a simple program to compute the error rate between the
speech recognition result (a string) and the reference text
2015 Mikko Kurimo
Speech recognition course
15 / 65
Useful skills

linear algebra (basic matrix operations)

probability and statistics

signal processing and natural language processing


programming
− perl/python for text processing
− shell scripts, C/C++ for running/modifying programs
− matlab
familiarity with Linux
2015 Mikko Kurimo
Speech recognition course
16 / 65
Text book



2015 Mikko Kurimo
You may survive without one, but
this is recommended
Huang, Acero: Spoken Language
Processing
Prentice Hall, 2001 ISBN: 0-13022616-5
Speech recognition course
17 / 65
Other useful text books
2015 Mikko Kurimo
Speech recognition course
18 / 65
Some useful online resources




Gales, Young: HMMs applications in ASR (book):
http://dx.doi.org/10.1561/2000000004
Cambridge: HTK Book (detailed manual):
http://htk.eng.cam.ac.uk/docs/docs.shtml
Matlab primer:
https://www.mathworks.com/help/pdf_doc/matlab/getstart.pdf
Slides from MIT open course: 6.345 ASR (2003)
http://ocw.mit.edu/courses/electrical-engineering-and-computerscience/6-345-automatic-speech-recognition-spring-2003/
2015 Mikko Kurimo
Speech recognition course
19 / 65
Useful software


Software used in this course:
−
Matlab, GMMBayes toolbox
−
Cambridge HMM toolkit (HTK)
−
SRI language modeling toolkit (SRILM)
Other useful software for ASR:
−
Kaldi, Aachen, KTH Snack, OGI speech, Nagoya's Julius
−
University of Colorado SONIC ASR, CMU Sphinx-II ASR
−
AaltoASR tools, Aalto Morfessor tools
−
NIST ASR scoring utilities
−
CMU / Cambridge language model toolkit
2015 Mikko Kurimo
Speech recognition course
20 / 65
Project work receipt
1.Form a group (3 persons)
Done
already?
2.Get a topic
3.Get reading material from mycourses or your group tutor
4.Schedule a meeting with your tutor (before next Wed)
5.1st meeting: Specify the topic and write a work plan (DL 2nd Wed)
6.Make a brief literature study (DL 3rd Wed)
7.Perform analysis, experiments, and write a report
8.Book your presentation time in the Week 6 or 7 (DL 5th Wed)
9.Prepare and keep your 30 min presentation
10.Return the report (DL 7th Fri)
2015 Mikko Kurimo
Speech recognition course
This
wee
21 / 65
Project topics & tutors
•
•
•
•
•
4.Language model adaptation
(Andre)
8.Confidence measure for ASR
(Seppo)
9.ASR for dialect speech (Peter)
13.Speech recognition in noise
(Sami)
14.Voice activity detection (Seppo)
2015 Mikko Kurimo
Speech recognition course
•
•
•
•
•
16.HMMs for speech synthesis
(Reima)
19.Comparing sub-word
language models (Matti)
21.Animation of HMMs (Mikko)
22.DNNs for acoustic modeling
(Peter)
23.Topic modeling (Akmal)
22 / 65
Content today
1.General organization of the course
2.What is automatic speech recognition?
3.Speech as an acoustic signal
4.Home exercise 1:
−
Build a system to classify speech features into phonemes
2015 Mikko Kurimo
Speech recognition course
23 / 65
All these are needed to build ASR systems!
Prob.
Comp.
Data
2015 Mikko Kurimo
Speech recognition course
24 / 65
Milestones for ASR systems



1952 Bell Labs Digit Recognizer
1976 CMU Harpy 1000-word connected recognizer with
constrained grammar
1980 TKK: 1000-word LSM recognizer (separate words w/o
grammar)

1988 TKK: phonetic typewriter

1993 Read texts (WSJ news)

1998 Broadcast news, telephone conversations

1998 Speech retrieval from broadcast news
2015 Mikko Kurimo
Speech recognition course
25 / 65
Milestones for ASR systems (2)


2002 Rich transcription of meetings
2004 TKK: Finnish online dictation, almost unlimited vocabulary
based on morphemes

2006 Machine translation of broadcast speech

2006 Voice interface in Windows Vista video

2008 Google voice search


2009 Aalto: Speaker adaptation for a synthesized voice by
speech recognition in another language video
2011 Siri voice assistant
2015 Mikko Kurimo
Speech recognition course
26 / 65
ASR tasks and solutions

Speaking environment and microphone
−
−
−
Office, headset or close-talking
Telephone speech, mobile
Noise, outside, microphone far away

Style of speaking

Speaker modeling
2015 Mikko Kurimo
Speech recognition course
27 / 65
ASR tasks and solutions

Speaking environment and microphone

Style of speaking
−
−
−
−
−

Isolated words
Connected words, small vocabulary
Word spotting in fluent speech
Continuous speech, large vocabulary
Spontaneous speech, ungrammatical
Speaker modeling
2015 Mikko Kurimo
Speech recognition course
28 / 65
ASR tasks and solutions

Speaking environment and microphone

Style of speaking

Speaker modeling
−
−
−
Speaker-dependent models
Speaker-independent, average speaker models
Speaker adaptation
2015 Mikko Kurimo
Speech recognition course
29 / 65
Content today
1.General organization of the course
2.What is automatic speech recognition?
3.Speech as an acoustic signal
4.Home exercise 1:
−
Build a system to classify speech features into phonemes
2015 Mikko Kurimo
Speech recognition course
30 / 65
What is speech recognition?




Find the most likely word or word sequence given the
acoustic signal and our models!
Language model defines words and how likely they occur
together
Lexicon defines how words are formed from sound units
Acoustic model defines the sound units independent of
speaker and recording conditions
2015 Mikko Kurimo
Speech recognition course
31 / 65
What is speech recognition?




Find the most likely word or word sequence given the
acoustic signal and our models!
Language model defines words and how likely they occur
together
Lexicon defines how words are formed from sound units
Acoustic model defines the sound units independent of
speaker and recording conditions
2015 Mikko Kurimo
Speech recognition course
32 / 65
Content today
1.General organization of the course
2.What is automatic speech recognition?
3.Speech as an acoustic signal
4.Home exercise 1:
−
Build a system to classify speech features into phonemes
2015 Mikko Kurimo
Speech recognition course
33 / 65
Speech recording
2015 Mikko Kurimo
Speech recognition course
Picture by B.Pellom
34 / 65
A sample of speech waveform
2015 Mikko Kurimo
Speech recognition course
35 / 65
Other words, other speakers
Very difficult to
recognize speech
from the picture...
2015 Mikko Kurimo
Speech recognition course
36 / 65
Modelling the speech signal
Discussion: What separates speech from all the other sounds
that the microphone has recorded?
−
computer noise, car noise, human movements, other sounds
from the mouth, ...
−
so, what is special in speech and common in all speaking
situations
2015 Mikko Kurimo
Speech recognition course
37 / 65
How to recognize speech?
A simple procedure:
−
Measure some characteristic features of the signal and
train statistical models for them
2015 Mikko Kurimo
Speech recognition course
38 / 65
How to recognize speech?
A simple procedure:
−
Measure some characteristic features of the signal and
train statistical models for them
Good features should be:
1.Compact
2.Discriminative for speech sounds
3.Fast to compute
4.Robust for noise
2015 Mikko Kurimo
Speech recognition course
39 / 65
Frequency analysis
Calculate the spectrum in short time intervals
2015 Mikko Kurimo
Speech recognition course
40 / 65
Frequency analysis
Calculate the spectrum in short time intervals
2015 Mikko Kurimo
Speech recognition course
41 / 65
Frequency analysis
Calculate the spectrum in short time intervals
2015 Mikko Kurimo
Speech recognition course
42 / 65
Computation of MFCC
2015 Mikko Kurimo
Speech recognition course
43 / 65
Picture by B.Pellom
Mel scale
Approximation of human
perception of speech
“Divide the frequency scale into
perceptually equal intervals”:
Linear below 1 kHz, logarithmic
above 1 kHz
2015 Mikko Kurimo
Speech recognition course
44 / 65
Cepstrum
Cepstrum is essentially “a spectrum of a spectrum”:
−
Analysis in frequency scale (vertical direction)
MFCC = Mel-Frequency Cepstral Coefficients
...
2015 Mikko Kurimo
Speech recognition course
45 / 65
Speech sample
1.1.Frames:
Frames:
short
short10ms
10ms
windows
windows
2.2.FFT:
FFT:
power
powerspectrum
spectrum
spectrogram
spectrogram
3.3.Filtering:
Filtering:
mel
melfilter
filter
motivated
motivatedby
by
human
humanear
ear
“essential
“essentialdata”
data
4.4.Features:
Features:
DCT
DCTtransform
transform
mel
cepstrum
mel cepstrum
MFCC
MFCC
-less
-lessfeatures
features
-less
correlation
-less correlatio
2015 Mikko Kurimo
Speech recognition course
46 / 65
The same 5 samples again
Very difficult to
recognize speech
from this picture...
2015 Mikko Kurimo
Speech recognition course
47 / 65
Power spectrogram

2015 Mikko Kurimo
Speech recognition course
Speech recognition
possible

Lot of data

Lot of redundancy

Lot of noise
48 / 65
Mel spectrogram

2015 Mikko Kurimo
Speech recognition course
Speech recognition
maybe easier?

10 x less data

Less redundancy

Less noise
49 / 65
Mel-frequency cepstral coefficients (MFCC)

Even more compact

Less correlation

Less noise?
2015 Mikko Kurimo
Speech recognition course
50 / 65
Background noise?
2015 Mikko Kurimo
Speech recognition course
51 / 65
To classify sounds by features?
Training
1. Extract MFCC from samples of each sound (e.g. phoneme)
2. Train a statistical model (mean and variance)
Testing
1. Record new samples and extract MFCC
2. Choose the best-matching model to be the class
2015 Mikko Kurimo
Speech recognition course
52 / 65
Classification by features



Gaussian mixture model GMM
train a statistical model (mean and variance) for samples of
each sound source
choose the best-matching statistical model to be the class of an
unknown sample
2015 Mikko Kurimo
Speech recognition course
53 / 65
Normal (Gaussian) distribution
2015 Mikko Kurimo
Speech recognition course
Picture by B.Pellom
54 / 65
1dim. Gaussian distribution
2015 Mikko Kurimo
Speech recognition course
Picture by B.Pellom
55 / 65
Multidim. Gaussian distribution
2015 Mikko Kurimo
Speech recognition course
Picture by B.Pellom
56 / 65
Diagonal Gaussian
2015 Mikko Kurimo
Speech recognition course
Picture by B.Pellom
57 / 65
Inverse of the covariance matrix
2015 Mikko Kurimo
Speech recognition course
Picture by B.Pellom
58 / 65
Multidim. diag. Gaussian
2015 Mikko Kurimo
Speech recognition course
Picture by B.Pellom
59 / 65
1dim. Gaussian mixture model
2015 Mikko Kurimo
Speech recognition course
Picture by B.Pellom
60 / 65
Gaussian mixture model GMM
2015 Mikko Kurimo
Speech recognition course
Picture by B.Pellom
61 / 65
Home exercise 1



Build a GMM classifier to classify speech features into
phonemes !
Details, instructions and help given in the Friday meeting this
week
To be returned before the Friday meeting next week
2015 Mikko Kurimo
Speech recognition course
62 / 65
Feedback

Write down questions from today's lecture that troubled your
mind

Comments and suggestions are welcome, too.

What was missing today, and what too much?
Idea's in last year's feedback:

Tutors to join the first project meetings

Explain Cepstrum and MFCCs

Show videos on ASR applications

Publish slides before the meetings
2015 Mikko Kurimo
Speech recognition course
63 / 65
Next meeting







Fri 14.15 – 16 at Maari-C (Maarintalo building)
Computers, speech data and support provided for practical
experiments
Organized by Peter Smit
Get your AALTO account ready!
Matlab is used, introductions exist at (www)
Support for Home exercise 1 provided (!)
All meetings required to pass the course
−
missed meeting compensated by an extra exercise
2015 Mikko Kurimo
Speech recognition course
64 / 65
Summary of today








Course organization, project works
History
What is ASR?
Speech signal
Acoustic features
Gaussian mixture model
on Friday: Classification
Next week: Phonemes and hidden Markov models
2015 Mikko Kurimo
Speech recognition course
65 / 65