round-robin duel discriminative language

ROUND-ROBIN DUEL DISCRIMINATIVE LANGUAGE MODELS
IN ONE-PASS DECODING WITH ON-THE-FLY ERROR CORRECTION
Takanobu Oba† , Takaaki Hori† , Akinori Ito‡ and Atsushi Nakamura†
†
NTT Communication Science Laboratories, NTT Corporation, Japan
‡
Graduate School of Engineering, Tohoku University, Japan
{oba, hori}@cslab.kecl.ntt.co.jp, [email protected], [email protected]
ABSTRACT
This paper focuses on discriminative n-gram language
models for large vocabulary speech recognition. We have
proposed a novel training method called the round-robin duel
discrimination (R2D2) method. Our previous report showed
that R2D2 outperforms conventional methods on word ngram based discriminative language models (DLMs). In this
paper, we achieve additional error reduction and one-pass
decoding at the same time. The keys to achieving this are the
use of morphological features and the on-the-fly composition
of weighted finite-state transducers (WFSTs) that represent
both word and morphological discriminative features. Our
experimental results show that R2D2 can reduce recognition errors more effectively than conventional methods in the
reranking of n-best hypotheses and one-pass decoding can be
accomplished with an equivalent accuracy.
Index Terms— R2D2, Discriminative language model,
Error correction, WFST, On-the-fly algorithm
1. INTRODUCTION
This paper focuses on discriminative language models (DLM)
based on a log-linear model [1, 2]. In these approaches, we
obtain, in advance, tendencies of difference between the reference and multiple erroneous sentences which are generated
by recognizing training data. The tendencies are utilized for
reranking the list of hypotheses, the results of recognizing unseen speech, so that the less erroneous hypotheses become
placed in higher ranks, and vice versa. Hence, DLM based
reranking is often referred to as an error correction approach.
Our previous studies have focused on error rate based
model training methods including minimum error rate training (MERT) [3] and the weighted global conditional loglinear model (WGCLM) [4, 5]. And through such work,
we have proposed a novel model training method called the
round-robin duel discrimination (R2D2) method [6]. Since
R2D2 is an expansion of WGCLM, R2D2 always provides a
model better than, or at least equivalent to a model provided
by WGCLM. Furthermore, since the loss function of R2D2
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
5588
is concave as with WGCLM, it is ensured that the optimal
parameters are always obtained for the model.
In this paper, we improve the accuracy of the DLM and
achieve one-pass decoding with the accurate model. Of
course, an improvement in accuracy is important for speech
recognition, because recognition errors lead to problems in
most speech applications and result in the limited use of
speech recognition. A one-pass decoding is also important in
terms of expanding the applicability of speech recognition.
In online applications such as spoken dialog systems, speech
recognizers need to work in real time and with low latency
not to prevent smooth human-machine interaction. However, multi-pass decoding including the lattice rescoring or
n-best reranking is not suitable for such applications because
it needs a certain latency for rescoring or reranking.
To improve the accuracy, we use morphological n-gram
features together with word n-gram features. They are known
to be effective for constructing an accurate model [7, 8, 9].
Although we could choose several types of morphological
features, we use part-of-speech (POS) features in this paper.
In addition, for one-pass decoding, we express a DLM as
a set of weighted finite-state transducers (WFSTs). WFST
representation for a DLM consisting only of word n-gram features has already been proposed and used for lattice rescoring
[1], and discriminative training has been done on WFST
search graphs containing word information [10, 11]. Onepass decoding can be performed by composing the DLM
WFST with the baseline WFST(s), but it needs much more
memory for the composed WFST. Furthermore, when we use
a DLM that contains morphological features, certain difficulties remain.
We have to solve the following problems. The composition of a word n-gram WFST and a morphological (POS)
n-gram WFST generates an extremely large WFST, and the
generated WFST outputs a morphological symbol sequence,
however we require a word sequence. To solve these problems, we use an on-the-fly composition algorithm to combine
the WFSTs. A word sequence as a recognition result can be
obtained by memorizing the output sequence of the baseline
WFST through the on-the-fly composition.
ICASSP 2011
In our experiments, we compared R2D2 with conventional methods in terms of WER in the reranking of n-best
hypotheses. R2D2 outperforms other methods even when
constructing models with word and morphological n-gram
features. We also utilize DLMs trained by R2D2 for one-pass
decoding. We obtained an equivalent WER very quickly in
which the real time factor (RTF) was smaller than 1.
2. RERANKING HYPOTHESES
We represent n-best hypotheses generated from a speech
recognition system as L = {hj |j = 1, 2, · · · , N }, and a feature vector of a hypothesis h as f (h). Specifically, the speech
recognition score of h is denoted as f0 (h).
Where a is a given parameter vector, the goal of the
reranking problem in speech recognition is to find a hypothesis with the highest score. This is formulated as follows.
h∗ = arg max{a0 f0 (h) + a f (h)}
h∈L
(1)
where a0 is a given scaling constant. denotes the transpose
of the matrix.
For training, we prepare a data set that comprises:
• N-best lists {Li |i = 1, 2, · · · , I}
They are generated from a speech recognizer for training data that consist of I utterances. Each hypothesis is
converted to a feature vector, which is denoted as fi,j .
That is, Li = {fi,j |j = 1, 2, · · · , Ni }.
• References
We represent the feature vector of a reference as fi,r .
We use Oracle, which is a hypothesis with the minimum WER, instead of the true reference.
• WER as sample weight
The WER of each hypothesis ei,j is used as a sample
weight for training.
3. ROUND-ROBIN DUEL DISCRIMINATION
The loss function of R2D2 is defined as follows.
LR2D2 =
I
log
i=1
Ni Ni
exp(σ1 ei,j ) exp(a fi,j )
exp(σ2 ei,j ) exp(a fi,j )
j=1
(2)
j =1
σ1 and σ2 are hyperparameters that are decided using a development set. The purpose of training is to find a model
parameter a that minimizes the loss function.
To connect the relationships between R2D2 and other
methods, it is useful to consider a basic exponential error
loss, exp(a fi,j − a fi,r ) . To construct a loss function, we
normally accumulate this basic loss over all the samples. As
a result, we obtain the following form of loss function.
I
i=1
g(
Ni
exp(a fi,j − a fi,r ))
(3)
j=1
5589
This is the loss function of reranking boosting if g(x) = x and
of GCLM (WGCLM where all the weights have a constant 1)
if g(x) = log(x).
Next, we consider the weighted exponential loss,
wi,j exp(a fi,j − a fi,r ).
(4)
Weight wi,j expands the loss. If wi,j is large, the discrimination of the corresponding sample from a reference is emphasized during training. Conversely, samples with small wi,j
are largely ignored during training. We investigate the efficient use of these samples.
If we rewrite equation (4) as
exp((a fi,j + ei,j ) − a fi,r )
(5)
where ei,j = log wi,j , we can understand that weight ei,j
works as a bias. We expand the weighted loss by giving a
bias to the third term in the exponential function as follows.
exp((a fi,j + ei,j ) − (a fi,j + ei,j ))
(6)
fi,j works as a pseudo reference if ei,j is small. That is, samples with small weights are efficiently used during training. In
R2D2, this expanded weighted exponential loss (equation (6))
is accumulated in a similar manner to GCLM.
4. ONE-PASS DECODING WITH A DLM
In this section, we describe how a trained DLM can be implemented into a WFST-based one-pass decoder. Figure 1 shows
our way of achieving this. We make three WFSTs for a DLM
and combine them with a baseline WFST decoder through an
on-the-fly composition algorithm. Although we utilize POS
as a morphological feature below, our method can be applied
for different morphological features.
The on-the-fly composition approach has certain advantages. First, we can avoid having an extremely large WFST
that would result from the full composition of WFSTs of word
and POS n-gram features. Second, we can obtain a word sequence by memorizing the output symbols from the first (or a
middle) WFST, even if the most backend WFST outputs POS
symbols. The decoder finds the best path taking all of the
WFSTs into account, and obtains the output symbol sequence
of the first WFST as a recognition result, which is memorized
on the best path.
Our baseline system handles two WFSTs with the on-thefly composition, which are an HMM-state-to-word transducer
(i.e. WFST1 in Fig. 1) and a word n-gram transducer (i.e.
WFST2 in Fig. 1). Our decoder, SOLON, which was developed at NTT CS Labs., can combine multiple WFSTs by
using a fast on-the-fly composition algorithm [12]. We utilize
SOLON to realize a one-pass search using a DLM.
Below we detail how to make three WFSTs for a DLM,
where we assume that the DLM does not contain any combination features of word and POS. We can separate the parameters into two parts for word n-grams and for POS n-grams.
$CUGTGEQIPK\GT
9(56
9(56
*//UVCVGU
VQYQTF
YQTF
PITCOOQFGN
Table 1. Experimental data. Dev. and uttr. denote development and utterances, respectively.
corpus
'TTQTEQTTGEVKQPOQFGN
9(56
9(56
9(56
9QTFPITCO
DCUGFTGTCPMGT
9QTFVQ215
215PITCO
DCUGFTGTCPMGT
training set
dev. set 1
test set 1
dev. set 2
test set 2
Fig. 1. Cascade of multiple WFSTs.
E
CDE
DE
a=[̖̖̖̖]
corpus
training set
dev. set
test set
CD
EE
lecture
academic
academic
academic
simulated
simulated
DE
CSJ
# of uttr.
25, 130
1, 293
1, 156
1, 479
717
# of words
420, 918
26, 329
26, 798
20, 990
17, 242
MITLC
# of uttr. # of words
18112
433, 308
813
11, 028
6988
75, 049
Fig. 2. Example of an arc on the WFST that expresses a DLM.
And then we convert them to n-gram WFSTs, i.e. WFST3
and WFST5 in Fig. 1. The WFSTs are constructed in the
manner described in [1]. Figure 2 shows how to configure the
weight value of an arc. The weight must be the summation
of parameters corresponding to ‘abc,’ ‘bc’ and ‘c,’ if the arc
corresponds to trigram ‘abc.’
The remaining WFST is a word-POS converter, i.e.
WFST4 in Fig. 1. We form one node and self-loop arcs
where the input symbol of each arc is a word and the output
symbol is a POS of the word. The weights are set to 0 for all
the arcs.
5. EXPERIMENTS
For our experiments, we used two corpora, the Corpus of
Spontaneous Japanese (CSJ) [13] and the MIT Lecture Corpus (MITLC) [14]. Table 1 shows the amount of data in the
corpora. CSJ is a Japanese lecture corpus. We prepared two
test sets that simulated in-domain and out-of-domain conditions. MITLC is an English lecture corpus. This task is
more difficult than CSJ, both acoustically and linguistically.
It should be noted that, when the absolute differences of the
WERs are 0.2, 0.3 and 0.3 for test sets 1, 2 in the CSJ and test
set in the MITLC, respectively, they are statistically significant (p < 0.02).
For both training and evaluation, we generated 5000-best
hypotheses from each utterance by using the speech recognition system SOLON. We trained a triphone HMM acoustic
model and a Kneser-Ney smoothed language model for each
corpus. The training data for the Kneser-Ney language models do not include the training data shown in table 1.
We trained DLMs using the 5000-best lists of training set
of each corpus. To estimate the model parameters, we used
the L-BFGS algorithm [15] introducing L-2 norm for regularization. We used the development set to configure the hyper-
5590
Table 2. WERs before/after applying reranking models
where word n-gram features are used.
Baseline
WGCLM
MERT
R2D2
dev.1
20.5
20.0
20.1
19 .7
CSJ
test1 dev.2
18.0
36.3
17.4
35.5
17.7
35.1
17.2 35 .0
test2
34.5
33.3
33.1
32.9
MITLC
dev.
test
46.1
37.0
44.8
36.1
44 .3 35.4
44.4
35.7
Table 3. WERs before/after applying reranking models
where word and POS n-gram features are used.
Baseline
WGCLM
MERT
R2D2
dev.1
20.5
19.9
20.0
19 .6
CSJ
test1 dev.2
18.0
36.3
17.1
34.9
17.6
34.9
16.9 34 .4
test2
34.5
32.6
32.9
32.4
MITLC
dev.
test
46.1
37.0
45.0
35.6
44.8
35.3
44 .5 35.2
parameters, the regularization constant and the scaling constant a0 .
The feature vectors consisted of unigram, bigram and trigram booleans. A boolean feature has a value of 0 if the corresponding n-gram entry does not occur in a sentence, and
1 otherwise. Strictly, we should use count features for the
WFST-based on-the-fly error correction instead of boolean
features, because n-gram WFSTs assume count features.
However, in our preliminary experiments, we obtained the
same WER when comparing count features and boolean features. In addition, we can reduce the amount of memory for
holding the training data by using boolean features.
5.1. Results on reranking of n-best hypotheses
Table 2 shows the WERs before and after applying the DLMs
where the DLMs consist of word n-gram features. Baseline
corresponds to WER before error correction. WGCLM seems
37.5
In addition, we described how to utilize error correction
models for a one-pass decoding. Our method can be applied even if the features include morphological n-grams. We
achieved one-pass decoding with on-the-fly error correction
and were able to obtain accurate results in real-time processing.
37.0
Acknowledgment
36.5
We thank the MIT Spoken Language Systems Group for helping us to per-
36.0
form speech recognition experiments based on MITLC. The research described in this paper is partially supported by the Grant-in-Aid Scientific
Fig. 3. Relationship between RTF and WER after on-the-fly
error correction.
38.5
baseline
DLM(R2D2,word)
DLM(R2D2,word+POS)
WER (%)
38.0
35.5
35.0
Research No. 22300064, Japan Society for the Promotion of Science.
0
0.2
0.4
0.6
RTF
0.8
1
1.2
1.4
7. REFERENCES
to be suitable for environments such as we have a lot of training data similar to evaluation data. In contranst, MERT tends
to perform well in more difficult conditions. R2D2, which is
an expansion of WGCLM, provided significant improvement
under all the conditions in the comparison with WGCLM.
Table 3 shows the results where the DLMs consist of word
and POS n-gram features. Use of POS features improved data
robustness of WGCLM, because POS features constructed a
more dense feature space. And R2D2, which is an expansion
of WGCLM, achieved additional error reduction. As a result, R2D2 provided the most accurate models in all the evaluation sets, while methods that provided comparable models
depended on tasks.
5.2. One-pass search results
We constructed WFSTs for models trained by R2D2 using the
MITLC corpus, which correspond to the models shown in tables 2 and 3. Using the fast on-the-fly composition algorithm,
we combined WFSTs of the DLM with the base WFSTs on
the SOLON and performed a one-pass search. We obtained
WERs of 35.7% and 35.4% in the evaluation set where the
features consisted of only word n-grams and both word and
POS n-grams, respectively. These are almost the same as
those obtained with the reranking approach.
We depict real time factor (RTF) curves in Fig. 3. To
obtain RTF, we used a machine equipped with an Intel Xeon
processor X5570 2.93GHz. By combining a DLM through
the on-the-fly composition of five WFSTs, we achieved a
one-pass search and could obtain accurate recognition results
where the RTF was smaller than 1.
6. CONCLUSION
We evaluated R2D2 with different corpora. By using POS
features, R2D2 generated more accurate models than with
the single use of word n-gram features. By comparison with
reranking n-best hypotheses, R2D2 outperformed WGCLM
and MERT under most conditions.
5591
[1] Brian Roark, Murat Saraclar, Michael Collins, and Mark Johnson,
“Discriminative language modeling with conditional random fields and
the perceptron algorithm,” in Proceedings of ACL, 2004, pp. 47–54.
[2] Zhengyu Zhou, Jianfeng Gao, Frank K. Soong, and Helen Meng, “A
comparative study of discriminative methods for reranking LVCSR nbest hypotheses in domain adaptation and generalization,” in Proceedings of ICASSP, 2006, pp. 141–144.
[3] Franz Josef Och, “Minimum error rate training in statistical machine
translation,” in Proceedings of ACL, 2003, pp. 160–167.
[4] Takanobu Oba, Takaaki Hori, and Atsushi Nakamura, “A comparative study on methods of weighted language model training for reranking LVCSR n-best hypotheses,” in Proceedings of ICASSP, 2010, pp.
5126–5129.
[5] Brian Roark, Murat Saraclar, and Michael Collins, “Discriminative ngram language modeling,” Computer Speech and Language, vol. 21,
no. 2, pp. 373–392, 2007.
[6] Takanobu Oba, Takaaki Hori, and Atsushi Nakamura, “Round-robin
discrimination model for reranking ASR hypotheses,” in Proceedings
of Interspeech, 2010, pp. 2446–2449.
[7] Michael Collins, Brian Roark, and Murat Saraclar, “Discriminative
syntactic language modeling for speech recognition,” in Proceedings
of ACL, 2005, pp. 507–514.
[8] Izhak Shafran and Keith Hall, “Corrective models for speech recognition of inflected languages,” in Proceedings of EMNLP, July 2006, pp.
390–398.
[9] Ebru Arisoy, Murat Saraclar, Brian Roark, and Izhak Shafran, “Syntactic and sub-lexical features for turkish discriminative language models,”
in Proceedings of ICASSP.
[10] Hong-Kwang Jeff Kuo, Brian Kingsbury, and Geoffrey Zweig, “Discriminative training of decoding graphs for large vocabulary continuous
speech recognition,” in Proceedings of ICASSP.
[11] Shinji Watanabe, Takaaki Hori, and Atsushi Nakamura, “Large vocabulary continuous speech recognition using WFST-based linear classifier
for structured data,” in Proceedings of Interspeech, 2010, pp. 346–349.
[12] Takaaki Hori and Atsushi Nakamura, “Generalized fast on-the-fly composition algorithm for WFST-based speech recognition,” in Proceedings of Interspeech, 2005, pp. 284–289.
[13] Kikuo Maekawa, Hanae Koiso, Sadaoki Furui, and Hitoshi Isahara,
“Spontaneous speech corpus of Japanese,” in Proceedings of ICLRE,
2000, pp. 947–952.
[14] James Glass, Timothy J. Hazen, Scott Cyphers, Igor Malioutov, David
Huynh, and Regina Barzilay, “Recent progress in the MIT spoken
lecture processing project,” in Proceedings of Interspeech, 2007, pp.
2553–2556.
[15] Dong C. Liu and Jorge Nocedal, “On the limited memory BFGS
method for large scale optimization,” Mathematical Programming, vol.
45, pp. 503–528, 1989.