Deep Learning Based on Manhattan Update Rule

Deep Learning Based on Manhattan Update Rule
Yasser Hifny
University of Helwan, Egypt
Abstract
Acoustic models based on Deep Neural Networks (DNNs) lead to significant improvement in the recognition accuracy. In these
methods, Hidden Markov Models (HMMs)
state scores are computed using flexible discriminant DNNs. Training DNNs is computationally expensive and efficient training of
DNNs is an active area of research. Similar
to HMMs, Deep Conditional Random Fields
(DCRFs) use DNNs to compute state scores.
In this paper, we present a method to estimate DCRFs using the Manhattan (MH) update rule. The Manhattan update rule does
not involve the gradient magnitude. The
Manhattan update method is general and
can be used to estimate models when the gradient can be computed.
1. Introduction
DNNs with many hidden layers have been shown to
outperform Gaussian mixture models in several tasks
(Mohamed et al., 2012), (Seide et al., 2011), (Dahl
et al., 2012), (Hinton et al., 2012). Training DNNs
is computationally expensive and efficient training of
DNNs is an active area of research (Kingsbury et al.,
2012),(Vinyals & Povey, 2012). Training CRFs on the
top of a hidden layer constructed from scoring a large
number of sigmoid functions was introduced in (Prabhavalkar & Fosler-Lussier, 2010). DNNs can be used
to compute the state scores within CRF framework.
Hence, this improvement will lead to a deep version of
CRFs (DCRFs)(Hifny, 2013). Other deep alternatives
were developed in (Yu & Deng, 2010),(Mohamed et al.,
2010), (Do & Artieres, 2010), (Fujii et al., 2012).
In this paper, we present a method to estimate DCRFs
using the Manhattan (MH) update rule. The Manhatth
Proceedings of the 30 International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR:
W&CP volume 28. Copyright 2013 by the author(s).
[email protected]
tan update rule does not involve the gradient magnitude. It was shown in (Hifny, 2006) that this algorithm
outperforms other gradient based algorithms for CRF
parameter estimation.
In Section 2, a mathematical formulation of DCRFs is
described. The optimization problem of DCRFs is addressed in Section 3. Section 4 gives experimental results on a phone recognition task. Finally, a summary
of the presented work is given in the conclusions.
2. Deep Conditional Random Fields
Linear chain CRFs can be thought as the undirected
graphical twins for HMMs regardless of their training
(generative or discriminative) (Lafferty et al., 2001).
DCRF acoustic models are a particular implementation of linear chain CRFs where the state scores are
computed based on a DNN that has many hidden layers. The feed-forward phase updates the output value
of each neuron. Starting from the first hidden layer ,
each neuron output is computed as a weighted sum of
inputs and applying the sigmoid function to it:1
n
X
h−1
ohtj = sigm(
λij oti
)
(1)
i=1
where oht is an output of a hidden layer, n is the number of inputs, h is an index to a hidden layer, and
sigmoid function is computed as follows:
sigm(x) =
1
1 + e−x
(2)
The output of an hidden layer is passed to the next
layer until the output layer is computed as follows:
oN
tj =
n
X
N −1
λij oti
(3)
i=1
where N is index of the output layer. Hence, the activation of hidden layers is nonlinear based on a sigmoid
function and the output layer activation is linear.
1
The bias term is not implemented in neuron activation.
Deep Learning Based on Manhattan Update Rule
and it is similar to the total probability p(O|M) in
HMMs, which can be calculated using the forward algorithm (Lafferty et al., 2001).
3. DCRF Optimization
Figure 1. DCRF model for phone representation, where the
state scores are computed from a DNN.
For R training observations {O1 , O2 , . . . , Or , . . . , OR }
with corresponding transcriptions {Wr }, DCRFs are
trained using the conditional maximum likelihood
(CML) criterion to maximize the posterior probability of the correct word sequence given the acoustic
observations:
FCML (Λ) =
A graphical representation of the DCRF acoustic
model is shown in Figure 1. The conditional distribution defining DCRFs is given by
PΛ (S|O) =
T
Y
1
ZΛ (O) t=1
exp λst st−1 a(st , st−1 ) + bst (ot )
(4)
R
X
log PΛ (MWr |Or )
r=1
P
PT
P (Wr ) S|Wr exp t Ψ(O, S, c, Λ)
=
log P
P
PT
r=1
t Ψ(O, S, c, Λ)
Ŵ P (Ŵ )
S|Ŵ exp
R
X
≈
R
X
log ZΛ (Or |Mnum ) − log ZΛ (Or |Mden ), (6)
r=1
where
where
Ψ(O, S, c, Λ) = λst st−1 a(st , st−1 ) + bst (ot )
• PΛ (S|O) obeys the Markovian property:
PΛ (st |st−1 , st−2 , . . . , s1 , O) = PΛ (st |st−1 , O)
• λst st−1 are associated with the characterizing
function a(st , st−1 ). a(st , st−1 ) is a binary function and can be used to define DCRF topology.
• bst (ot ) = oN
tst is computed from Equation (3).
Hence, bst (ot ) connects DNN output to CRF input.
• ZΛ (O) (Zustandsumme) is a normalization coefficient referred to as the partition function.
HMMs and DCRFs (in general, linear chain CRFs)
share the first order Markov assumption, which simplifies the training and decoding algorithms. However,
DCRFs do not assume observation independence and
causality, as the joint event in this case is factorized
as a simple product of exponential functions. Therefore, the observations and the characterizing functions
can be statistically dependent or correlated and can
depend on the past and future acoustic context. The
partition function, ZΛ (O), is given by
ZΛ (O) =
T
XY
S t=1
exp λst st−1 a(st , st−1 ) + bst (ot ) , (5)
(7)
The optimal parameters, Λ∗ , are estimated by maximizing the CML criterion, which implies minimizing the cross entropy between the correct transcription model and the hypothesized recognition model.
In other words, the process maximizes the partition
function of the correct models2 (the numerator term)
ZΛ (Or |Mnum ), and simultaneously minimizes the partition function of the recognition model (the denominator term) ZΛ (Or |Mden ). The optimal parameters
are obtained when the gradient of the CML criterion
is zero.
3.1. Numerical Optimization for DCRFs
Newton’s method can be used to estimate DCRFs
based on local quadratic approximation of the CML
objective function. These methods rely on local
quadratic approximation by expanding the CML nonlinear objective function FCML (Λ + δ) using Taylor
expansion around the current model point Λ in parameter space and given by
1
FCML (Λ + δ) ≈ L(Λ) + δ T g(Λ) + δ T H(Λ)δ + . . . (8)
2
2
Since a summation over potential functions is commonly called the partition function in undirected graphical modeling, we coin the notation ZΛ (Or |Mnum ) for the
summation of all possible state sequences of the correct
models.
Deep Learning Based on Manhattan Update Rule
where g(Λ) is the local gradient vector defined by
g(Λ) =
∂FCML (Λ) ∂λi
Λ
(9)
and the H(Λ) is the local Hessian matrix defined by
∂FCML (Λ) Hij (Λ) ≡
∂λi ∂λj Λ
(10)
The Newton’s Method update rule is given by
λ(τ ) = λ(τ −1) − η (τ ) H−1 (Λ)g(Λ)
(11)
Since CML is not a quadratic function, taking the full
Newton step H−1 (Λ)g(Λ) may lead to an overshoot
of the maximum. Hence, η (τ ) 6= 1 will lead to the
damped Newton step. A line search algorithm is used
to calculate η (τ ) . A line search works by evaluating
the objective function starting from the current model
in the direction of search and choosing η (τ ) will lead
to an increase of the CML objective function.
Hessian matrix calculation, its inverting and storage,
makes Newton’s Method useful only for small scale
problems. Quasi-Newton or variable metric methods
can be used when it is impractical to evaluate the Hessian matrix. Instead of obtaining an estimate of the
Hessian matrix at a single point, these methods gradually build up an approximate Hessian matrix by using
gradient information from some or all of the previous
iterates visited by the algorithm. Limited memory
quasi-Newton’s methods like L-BFGS are particular
realizations of quasi-Newton’s methods that cut down
the storage for large problems (Nocedal & Wright,
1999).
Truncated-Newton method known as Hessian-Free approach (Nocedal & Wright, 1999; Martens, 2010;
Kingsbury et al., 2012), is a second order method for
large scale problems. It finds the search direction using an iterative solver and the solver is typically based
on conjugate gradient but other alternatives are possible. In this method, Hessian-vector products are computed without explicitly forming the Hessian. Hessianfree methods approximately invert the Hessian while
quasi-Newton methods invert an approximate Hessian.
By ignoring the second order derivative, a first order
approximation of the CML will lead to the gradient
ascent methods and the update is given by
λ(τ ) = λ(τ −1) + ηg(Λ)
(12)
The step size η must be small enough to ensure a stable increase of the CML objective function. It can be
shown that the algorithm is convergent provided that
2
η satisfies the condition 0 < η < λmax
, where λmax
is the largest eigenvalue of the Hessian matrix H(Λ∗ )
evaluated at the global maximum of the CML objective function (Haykin, 1998). In practice, second order
statistics are not accumulated so λmax is not known
and η is chosen in ad-hoc fashion by trial and error.
The training speed of gradient descent (batch mode)
is usually slow. The training process can be accelerated using an online variant known as stochastic gradient descent (SGD).3 This algorithm can update the
learning system on the basis of the objective function
measured for a single utterance or batch.
3.2. Manhattan Update
Many authors have described variants of the gradient
ascent algorithm, where increased rates of convergence
through learning rate adaptation were proposed (Jacobs, 1988; Sutton, 1992; Murata et al., 1997). A comparison between different algorithms was addressed in
(Schiffmann et al., 1994). The Resilient Propagation
(RProp) algorithm (Riedmiller & Braun, 1993), uses a
Manhattan update rule to provide faster convergence.
A Manhattan update rule does not involve the gradient magnitude. Our Manhattan (MH) update rule
implementation is similar to RProp where the updates
are given by
(
(τ )
ηi
=
(τ −1)
(τ )
(τ −1)
min(ηi
φ, ηmax ) gi (Λ)gi
(Λ) > 0
(τ −1)
(τ )
(τ −1)
max(ηi
κ, ηmin ) gi (Λ)gi
(Λ) < 0
(13)
where
(
(τ )
λi
=
(τ −1)
(τ )
λi
− ηi
(τ −1)
(τ )
λi
+ ηi
(τ )
gi (Λ) < 0
(τ )
gi (Λ) > 0
(14)
The learning parameters were set as follows κ = 0.5,
(0)
φ = 1.2, and ηi = η.
3.3. DCRFs Gradient Computation
For an exponential family activation function based on
first-order sufficient statistics, the gradient of the CML
objective function for the output layer parameters is
given by
num
den
∇FCML (O) = Cji
(O) − Cji
(O)
(15)
where the accumulators of the sufficient statistics,
Cji (O), for the j th state and ith constraint are cal3
Since CML objective function is maximized in this
work, stochastic gradient ascent is used to train DCRFs
models.
Deep Learning Based on Manhattan Update Rule
culated as follows:
num
Cji
(O)
=
Tr
R X
X
γjr (t|Mnum )oN
rti
(16)
γjr (t|Mden )oN
rti
(17)
r=1 t=1
den
Cji
(O)
=
Tr
R X
X
r=1 t=1
where r is the utterance index and the frame-state
alignment probability γj , denoting the probability of
being in state j at some time t can be written in terms
of the forward score αj (t) and the backward score βj (t)
as in HMMs:
αj (t|M)βj (t|M)
(18)
γj (t|M) = P (st = j|O; M) =
ZΛ (O|M)
and to avoid the necessity of building lattices, the
γj (t|M) is approximated with state estimates as follows (Hifny et al., 2005):
exp oN
tj
den
(19)
γj (t|M ) = P
N
s exp ots
The delta of the output layer neuron j is given by
N
δtj
= γj (t|Mnum ) − γj (t|Mden )
and the delta of the hidden layers:
X
h+1
h
δtj
= ohtj (1 − ohtj )
λh+1
kj δkt
(20)
(21)
k∈outputs
and the gradient for the hidden layers parameters is
given by:
R
T
r
∂FCML (Λ) X X
h
δrtj
=
oh−1
rtki
∂λhki
r=1 t=1
(22)
Based on equation (22) and equation (15), a gradient
based optimization can be used to estimate the parameters (Nocedal & Wright, 1999). The transition
parameters are given by:
λst st−1 = log ast st−1 ,
(23)
where ast st−1 is the transition probability in HMM
modeling and is estimated using the maximum likelihood (MLE) criterion.
4. Experiments
We have carried out phone recognition experiments on
the TIMIT corpus (Garofolo et al., 1990). We used the
462 speaker training set and testing on the 24 speaker
core test set (the SA1 and SA2 utterances were not
used). The speech was analyzed using a 25ms Hamming window with a 10 ms fixed frame rate. In all
the experiments we represented the speech using 12th
order mel frequency cepstral coefficients (MFCCs), energy, along with their first and second temporal derivatives, resulting in a 39 element feature vector. The
training data and test data features are pre-processed
to have zero mean and unit variance. Hence, acoustic
context information is integrated using a window of 9
frames (4 left + current frame+ 4 right) to construct
the final frame vector with 351 dimensions.
Following Lee and Hon (Lee & Hon, 1989), the original 61 phone classes in TIMIT were mapped to a set of
48 labels, which were used for training. This set of 48
phone classes was mapped down to a set of 39 classes
(Lee & Hon, 1989), after decoding, and phone recognition results are reported on these classes, in terms
of the phone error rate (PER), which is analogous to
word error rate.
The baseline HMMs have three emitting states and the
emission probabilities were modeled with mixtures of
Gaussian densities with diagonal covariance matrices.
The generative HMMs were trained by the maximum
likelihood criterion using the conventional EM algorithm (Young et al., 2001).
A DNN with 2 hidden layers was chosen and each layer
has 512 neurons. Hence, each phone was represented
using a three state left-to-right DCRF, all parameters of DNN were initialized to random values and
the transition parameters were initialized from trained
HMM models forcing left to right DCRFs (the transition parameters are held fixed after the initialization).
The training procedure accumulated the Mnum sufficient statistics via a Viterbi pass (forced alignment) of
the reference transcription using HMMs trained using
maximum likelihood criterion. Several iterations were
used to train DCRFs and the language model scaling
factor is set to 6.0 during the decoding process. All
our experiments used a bigram language model over
phones, estimated from the training set.
The training process is divided into two phases online
and batch. The online training phase computes an initial model that used to start the batch training phase.
The reason to split the training into two phases is to
use a grid computer for batch training. The grid has
28 cores.
In Table 1, DCRFs recognition performance is reported in terms of PER on TIMIT task (core test set)
for online training phase. Since MH update is sensitive the gradient sign change, we train the models
Deep Learning Based on Manhattan Update Rule
Table 1. DCRF decoding results on TIMIT recognition
task in terms of PER (online phase).
Algorithm
SGD
Manhattan
learning rate
0.0001
0.001
0.01
0.0001
0.001
0.01
#Updates
7390
7390
7390
7390
7390
7390
Batch Manhattan
#Itr
1
50
100
150
200
250
300
350
400
450
500
550
600
Scaling factor
1.0
2.0
3.0
4.0
5.0
6.0
PER
51.3%
43.4%
58.1%
45.6%
42.2%
65.8%
Table 2. DCRF decoding results on TIMIT recognition
task in terms of PER (batch phase).
Training Method
Online Manhattan
Table 3. DCRF decoding results based on different language model scaling factors.
PER
42.2%
36.9%
35.5%
34.7%
34.4%
34.2%
33.7%
33.6%
33.4%
33.2%
33.2%
33.1%
33.0%
using the same batch ten times before moving to another batch. This may reduce the effect of random
gradients. Hence, a complete iteration of the online
Manhattan update implies ten loops over the training
data. For SGD, we train the models for ten iterations
and update after each batch. The training data are
divided into 739 batches. Consequently, the models
were updated 7390 times in the online phase. The results show that online learning is sensitive to the initial
learning rate. The phone error rate is reduced to 42.2
% for online MH training.
Hence, the final model of the online training phase is
used to initialize a model to start the batch training
phase. A batch Manhattan update is used to update
the models. The initial learning rate was set to 0.01.
As shown in the results, the recognition accuracy improves when the number of iterations is high. The
batch training phase was executed over a computer
grid of 28 cores.
The decoding results of DCRFs are sensitive to the
value of the language model scaling factor. The PER
for different scaling factors is shown in Table 3. DCRF
decoding results based on different Language model
PER
30.8%
30.2%
30.6%
31.4%
32.2%
33.0%
scaling factors. The DCRF model for iteration 600
was selected to run the decoder.
5. Conclusions
In this paper, we present a method to estimate deep
conditional random fields (DCRFs) using the Manhattan (MH) update rule. In this approach, DCRFs
state scores are computed based on a DNN that has
many hidden layers. In the estimation process, a backpropagation algorithm is used to compute the gradient
of each parameter in the hidden layers. Preliminary results on the TIMIT phone recognition task show that
MH update can lead to good convergence properties.
Future work will focus on running online methods over
a grid computer and improving the training speed of
the batch MH using different sets of initial learning
rates.
References
Dahl, George, Yu, Dong, Deng, Li, and Acero, Alex.
Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE
Transactions on Audio, Speech, and Language Processing, Special Issue on Deep Learning for Speech
and Langauge Processing, 2012.
Do, Trinh-Minh-Tri and Artieres, Thierry. Neural conditional random fields. In Proc. of the 13th Interational Conference on Artificial Intelligence and
Statistics, (AI-STATS), 2010.
Fujii, Yasuhisa, Yamamoto, Kazumasa, and Nakagawa, Seiichi. Deep-hidden conditional neural fields
for continuous phoneme speech recognition. In Proc.
IWSML, 2012.
Garofolo, John S., Lamel, Lori F., Fisher, William M.,
Fiscus, Jonathan G., Pallett, David S., Dahlgren,
Nancy L., and Zue, Victor. TIMIT acousticphonetic continuous speech corpus,
1990.
URL
http://www.ldc.upenn.edu/Catalog/
CatalogEntry.jsp?catalogId=LDC93S1.
Deep Learning Based on Manhattan Update Rule
Haykin, Simon. Neural Networks: A Comprehensive
Foundation. Prentice Hal, 2nd edition, 1998.
Nocedal, Jorge and Wright, Stephen J. Numerical Optimization. Springer, 1999.
Hifny, Yasser. Conditional Random Fields for Continuous Speech Recognition. PhD thesis, University Of
Sheffield, 2006.
Prabhavalkar, R. and Fosler-Lussier, E. Backpropagation training for multilayer conditional random field
based phone recognition. In Proc. IEEE ICASSP,
volume 1, pp. 5534 – 5537, France, March 2010.
Hifny, Yasser. Acoustic modeling based on deep conditional random fields. Deep Learning for Audio,
Speech and Language Processing, ICML, 2013.
Hifny, Yasser, Renals, Steve, and Lawrence, Neil. A
hybrid MaxEnt/HMM based ASR system. In Proc.
INTERSPEECH, pp. 3017–3020, Lisbon, Portugal,
2005.
Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George,
rahman Mohamed, Abdel, Jaitly, Navdeep, Senior,
Andrew, Vanhoucke, Vincent, Nguyen, Patrick,
Sainath, Tara, , and Kingsbury, Brian. Deep Neural
Networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 2012.
Riedmiller, M. and Braun, H. A direct method for
faster backpropagation learning: The RPROP algorithm. In Proc. IEEE International Conference on
Neural Networks, pp. 586–591, 1993.
Schiffmann, W., Joost, M., and Werner, R. Optimization of the backpropagation algorithm for training
multilayer perceptrons. Technical Report tech. rep.,
University of Koblenz, 1994.
Seide, F., Li, G., and ., D. Yu. Conversational speech
transcription using context-dependent Deep Neural
Networks. In Interspeech, 2011.
Jacobs, R. A. Increased rates of convergence through
learning rate adaptation. Neural Networks, 1:295–
307, 1988.
Sutton, Richard S. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In
Proc. AAAI, pp. 171–176, 1992. URL citeseer.
ist.psu.edu/158284.html.
Kingsbury, Brian, Sainath, Tara N., and Soltau, Hagen. Scalable minimum bayes risk training of Deep
Neural Network acoustic models using distributed
hessian-free optimization. In interspeech, 2012.
Vinyals, Oriol and Povey, D. Krylov subspace descent
for deep learning. In AISTATS, 2012.
Lafferty, John, McCallum, Andrew, and Pereira, Fernando. Conditional random fields: Probabilistic
models for segmenting and labeling sequence data.
In Proc. ICML, pp. 282–289, 2001.
Lee, K.-F. and Hon, H.-W. Speaker-independent
phone recognition using hidden Markov models.
IEEE Transactions on Speech and Audio Processing, 37(11):1641–1648, Nov 1989.
Martens, J. Deep learning via hessian-free optimization. In Proc. ICML, 2010.
Mohamed, Abdel-rahman, Yu, Dong, and Deng, Li.
Investigation of full-sequence training of Deep Belief Networks for speech recognition. In Interspeech,
2010.
Mohamed, Abdel-rahman, Dahl, George, and Hinton,
Geoffrey. Acoustic modeling using Deep Belief Networks. IEEE Transactions on Audio, Speech and
Language Processing, 20:14–22, 2012.
Murata, Noboru, Müller, Klaus-Robert, Ziehe, Andreas, and ichi Amari, Shun. Adaptive on-line learning in changing environments. In Proc. NIPS, volume 9, pp. 599, 1997. URL citeseer.ist.psu.
edu/murata97adaptive.html.
Young, Steve, Kershaw, Dan, Odell, Julian, Ollason,
Dave, Valtchev, Valtcho, and Woodland, Phil. The
HTK Book, Version 3.1. 2001.
Yu, Dong and Deng, Li. Deep-structured hidden conditional random fields for phonetic recognition. In
Proc. INTERSPEECH, 2010.