statistical_models_for_automa..

Statistical Models for Automatic
Speech Recognition
Lukáš Burget
Feature extraction
•
•
•
Preprocessing speech signal to satisfy needs of the following
recognition process (dimensionality reduction, preserving only the
“important” information, decorelation).
Popular features are MFCC: modification based on psycho-acoustic
findings applied to short-time spectra.
For convenience, we will use one-dimensional features in most of
our examples (e.g. short time energy).
Classifying single speech frame
unvoiced
voiced
Classifying single speech frame
unvoiced
voiced
j ask the following question:
j
Mathematically, we
P (voiced x) > P (unvoiced x)
But the value we read from probability distribution is p(x|class).
According
to Bayes Rule, the above canj be revritten as:
j
p(x voiced)P (voiced)
p(x)
>
p(x unvoiced)P (unvoiced)
p(x)
Multi-class classification
The class being correct with the highest probability is given by:
arg max! P (! jx) = arg max! p(xj!)P (!)
silence
unvoiced
voiced
But we do not know the true distribution, …
Estimation of parameters
… we only see some training examples.
silence
unvoiced
voiced
Estimation of parameters
… we only see some training examples.
Let’s decide for some parametric model (e.g. Gaussian
distribution) and estimate its parameters from the data.
silence
unvoiced
voiced
Maximum Likelihood Estimation
•
In the next part, we will use ML estimation
Yof model parameters:
p(xi j£)
^ class = arg max
£
£
ML
8x 2class
i
•
•
This allow as to individually estimate parameters, Θ, of each class
given the data for that class. Therefore, for the convenience, we can
omit the class identities in the following equations.
The models we are going to examine are:
– Single Gaussian
– Gaussian Mixture Model (GMM)
– Hidden Markov Model
•
We want to solve three fundamental problems:
– Evaluation of the model (computing likelihood of features given the model)
– Training the model (finding ML estimates of parameters)
– Finding most likely values of hidden parameters
Gaussian distribution (1 dimension)
N
Evaluation:
(x; ¹; ¾ 2 ) =
p1 e
¾ 2¼
¡ (x¡¹)2
2¾ 2
ML estimates of parameters
(Training):
¹=
¾2 =
1
T
1
T
P
T
t=1
P
x(t)
T (x(t)
t=1
No hidden variables.
¡ ¹)2
Gaussian distribution (2 dimensions)
N (x; ¹; §) = p
1
e¡ 12 (x¡¹)T §¡1 (x¡¹)
(2¼)P j§j
Gaussian Mixture Model
P
j
p(x £) =
Pc N (x; ¹c ; ¾ 2 )
c
c
P
Pc = 1
Evaluation:
where
c
Gaussian Mixture Model
P
j
p(x £) =
Pc N (x; ¹c ; ¾ 2 )
Evaluation:
c
c
• We can see the sum above just as a function defining the shape of
the probability density function,
• or we can see it as a more complicated generative probabilistic
model, from which features are generated as follows:
– One of Gaussian components is first randomly selected according prior
probabilities Pc
– Feature vector is generated form the selected Gaussian distribution
• For the evaluation, however, we do not know which component
generated the input vector (Identity of the component is hidden
variable). Therefore, we marginalize – sum over all the components
respecting their prior probabilities.
• Why we want to complicate our lives with this concept:
– It allows at to apply EM algorithm for GMM training
– We will need this concept for HMMs
Training GMM –Viterbi training
•
Intuitive and Approximate iterative algorithm for training GMM
parameters.
•
Using current model parameters, let
Gaussians to classify data as the
Gaussians were different classes
(Even though the both data and all
components corresponds to one
class modeled by the GMM)
•
Re-estimate parameters of
Gaussian using the data
associated with to them in the
previous step.
•
Repeat the previous two steps
until the algorithm converge.
Training GMM – EM algorithm
•
•
•
•
Expectation Maximization is very general tool applicable in many
cases were we deal with unobserved (hidden) data.
Here, we only see the result of its application to the problem of reestimating parameters of GMM.
It guarantees to increase likelihood of training data in every iteration,
however it does not guarantees to find the global optimum.
The algorithm is very similar to Viterbi training presented above. Only
instead of hard decisions, it uses “soft” posterior probabilities of
Gaussians (given the old model) as a weights and weight average is
used to compute new mean and variance estimates.
¹
^(new)
=
c
P
T
° (t)x(t)
t=1 c
P
T
°c (t)
t=1
¾
^ 2 (new) =
c
°c (t) =
P
T
t=1
^ (new) )2
°c (t)(x(t)¡¹
c
P
T
°c (t)
t=1
Pc N (x(t);¹
^ (old) ;^
¾ 2 (old) )
c
c
P
N
(old)
Pc (x(t);¹
^c
;^
¾ 2 (old) )
c
c
Classifying stationary sequence
silence
unvoiced
voiced
Y assumption
Frame
independency
j
j
P (X class) =
p(xi class)
8x 2class
i
Modeling more general sequences:
Hidden Markov Models
a11
a22
a12
b1(x)
•
•
•
a33
a23
b2(x)
a34
b3(x)
Generative model: For each frame, model moves from one state to another
according to a transition probability aij and generates feature vector from
probability distribution bj(.) associated with the state that was entered.
To evaluate such model, we do not see which path through the states was
taken.
Let’s start with evaluating HMM for a particular state sequence.
a11
a22
a12
b1(x)
b2(x)
a33
a23
b3(x)
P(X,S|Θ) = b1(x1) a11 b1(x2) a12 b2(x3) a23 b3(x4) a33 b3(x5)
a34
Evaluating HMM for a particular
state sequence
P(X,S|Θ) = b1(x1) a11 b1(x2) a12 b2(x3) a23 b3(x4) a33 b3(x5)
Evaluating HMM for a particular
state sequence
The jointj likelihood of observes sequence X and state sequence S
P (X; S £) = b1 (x1 )a11 b1 (x2 )a12 b2 (x3 )a23 b3 (x4 )a33 b3 (x5 )a34
can be jdecomposed
j as follows:
P (X; S £) = P (X S; £)P (S; £)
where
P
(S j£) = a11 a12 a23 a33 a34
is prior probability of hidden variable – state sequence S. For GMM,
the corresponding term was: Pc
P (X jS; £) = b1 (x1 )b1 (x2 )b2 (x3 )b3 (x4 )b3 (x5 )
N (x; ¹ ; ¾ )
is likelihood of observed sequence X, given the state
2 sequence S. For
c c
GMM, the corresponding term was:
.
.
.
Evaluating HMM (for any state sequence)
Since we do not know the underlying state sequence, we must
marginalize – compute and sum likelihoods over all the possible paths
P (X j£) =
P
S
P (X; S j£)
Finding the best (Viterbi) paths
P (X j£) = maxS P (X; S j£)
Training HMMs – Viterbi training
• Similar to the approximate training we have already seen for GMMs
1.
2.
3.
For each training utterance find Viterbi path through GMM, which
associate feature frames with states.
Re-estimate state distribution using associated feature frames.
Repeat steps 1. and 2. until the algorithm converges.
Training
HMMs
using
EM
P
T
¹
^(new)
=
s
¾
^ 2 (new) =
s
t=1
P
P
°s (t) =
°s (t)x(t)
°s (t)
T
t=1
T
t=1
®s (t)¯s (t)
P (X j£(old) )
^ (new) )2
°s (t)(x(t)¡¹
s
P
T
°s (t)
t=1
¯s (t)
s
®s (t)
t
Isolated word recognition
YES
NO
p(XjY ES)P (Y ES) > p(XjN O)P (N O)
Connected word recognition
YES
sil
sil
NO
Phoneme based models
y
eh
y
eh
s
s
Using Language model - unigram
P(one)
sil
P(two)
P(three)
w
ah
t
uw
th
r
n
one
two
iy
three
sil
Using Language model - bigram
one
two
P(W2|W1)
w
ah
n
t
uw
sil
th
r
iy
three
sil
one
two
sil
three
Other basic ASR topics not covered
by this presentation
• Context dependent models
• Training phoneme based models
• Feature extraction
– Delta parameters
– De-correlation of features
• Full-covariance vs. diagonal cov. modeling
• Adaptation to speaker or acoustic condition
• Language Modeling
– LM smoothing (back-off)
• Discriminative training (MMI or MPE)
• and so on