Model Interpolation with Trans

Improving and Scaling Trans-dimensional Random Field Language Models
Bin Wang1 , Zhijian Ou1 , Yong He2 , Akinori Kawamura2
1
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
2
Toshiba Corporation
contact email:
arXiv:1603.09170v1 [cs.CL] 30 Mar 2016
Abstract
The dominant language models (LMs) such as n-gram and neural network (NN) models represent sentence probabilities in
terms of conditionals. In contrast, a new trans-dimensional
random field (TRF) LM has been recently introduced to show
superior performances, where the whole sentence is modeled
as a random field. In this paper, we further develop the TDF
LMs with two technical improvements, which are a new method
of exploiting Hessian information in parameter optimization to
further enhance the convergence of the training algorithm and
an enabling method for training the TRF LMs on large corpus
which may contain rare very long sentences. Experiments show
that the TRF LMs can scale to using training data of up to 32
million words, consistently achieve 10% relative perplexity reductions over 5-gram LMs, and perform as good as NN LMs
but with much faster speed in calculating sentence probabilities.
Moreover, we examine how the TRF models can be interpolated
with the NN models, and obtain 12.1% and 17.9% relative error rate reductions over 6-gram LMs for English and Chinese
speech recognition respectively through log-linear combination.
Index Terms: Language modeling, random fields, stochastic
approximation
1. Introduction
Language modeling (LM) involves determining the joint probability of words in a sentence. The conditional approach is
dominant, representing the joint probability in terms of conditionals. Examples include n-gram LMs [1] and neural network
(NN) LMs [2, 3]. We have recently introduced a new transdimensional random field (TRF1 ) LM [4], where the whole sentence is modeled as a random field. As the random field approach avoids local normalization which is required in the conditional approach, it is computationally more efficient in computing sentence probabilities and has the potential advantage of
being able to flexibly integrating a richer set of features. We developed an effective training algorithm using joint stochastic approximation (SA) and trans-dimensional mixture sampling. We
found that the TRF models significantly outperformed the modified Kneser-Ney (KN) smoothed 4-gram LM with 9.1% relative reduction in speech recognition word error rates (WERs)
and performed slightly better than the recurrent neural network
LMs but with 200x faster speed in re-scoring n-best lists of hypothesized sentences. To our knowledge, this result represents
the first strong empirical evidence supporting the power of using the whole-sentence random field approach for LMs [5].
This work is supported by NSFC grant 61473168, Tsinghua Initiative 20121088069 and Toshiba Corporation.
1 Instead of the abbreviation “TDRF” in [4], we use “TRF” for simplicity in this paper.
[email protected]
In this paper, we further develop the TDF LMs with two
technical improvements. Moreover, we perform more experiments to investigate whether the TRF models can scale to using
larger corpus and how the TRF models can be interpolated with
NN models to further improve the performance.
Improvements: First, in [4], the diagonal elements of the
Hessian matrices are online estimated during the SA iterations
to rescale the gradients, which is shown to benefit the convergence of the training algorithm. In this paper, inspired from
[6, 7], we propose a simpler but more effective method which
directly uses the empirical variances to rescale the gradients.
As the empirical variances are calculated offline, this also reduces computational and memory cost during the SA iterations.
The experimental results show that this new manner of exploiting second-order information in parameter optimization further
enhances the convergence of the training algorithm. Second, to
enable the training of TRF LMs on large corpus which may contain rare very long sentences, we introduce a special sub-model
to model the sequences longer than a fixed length.
Experiments: First, we examine the scalability of the TRF
LMs, by incrementally increasing the size of the training set.
Based on training data of up to 32 million words from Google
1-billion word corpus [8], we build TRF LMs with up to 36
million features. The TRF LMs consistently achieve 10% relative perplexity reductions over the KN 5-gram LMs. Then
speech recognition experiments are conducted for both English
and Chinese. Three kinds of interpolation schemes, namely linear interpolation (word-level or sentence-level) and log-linear
interpolation, are evaluated for model combinations between
KN n-gram LMs, NN LMs and our TRF LMs. Through loglinearly combining TRF LMs with NN LMs, we obtain 12.1%
and 17.9% relative error rate reductions over the KN 6-gram
LMs for English and Chinese speech recognition respectively.
2. Improvements
2.1. Background of TRF Model Definition and Training
Throughout, we denote by xl = (x1 , . . . , xl ) a sentence (i.e.,
word sequence) of length l ranging from 1 to m. Each element of xl corresponds to a single word. D denotes the whole
training corpus and Dl denotes the collection of length
P l in the
training corpus. nl denotes the size of Dl and n = m
l=1 nl .
As defined in [4], a trans-dimensional random field model
represents the joint probability of the pair (l, xl ) as
p(l, xl ; λ) =
nl /n λT f (xl )
e
,
Zl (λ)
(1)
where nl /n is the empirical probability of length l. f (xl ) =
(f1 (xl ), . . . fd (xl ))T is the feature vector, which is usually defined to be position-independent and length-independent, e.g.
where γλ,t ,γζ,t are the learning rate of λ and ζ.
P
δl (B (t) ) = K −1 (j,xj )∈B(t) 1(j = l) is the relative frequency of length l appearing in B (t) .
In Eq.3, σ = diag(σ1 , . . . , σd ) is a diagonal matrix and the
element σi (1 ≤ i ≤ d) is the empirical variance of feature fi :
negative log−likelihood
200
SA in [4]
improved SA
150
100
σi =
50
500
1000
1500
2000
2500
3000
t
Figure 1: To compare the convergence of the SA training algorithm in [4] (blue dash line) and the improved SA (red solid
line), we plot an example of the negative log-likelihoods on
PTB training set along the iterations.
the n-grams. d is the dimension of the feature vector f (x).
λ is the corresponding parameter vector of f (xl ). Zl (λ) =
P λT f (xl )
is the normalization constant of length l. By
xl e
making explicit the role of length in model definition, it is clear
that the model in (1) is a mixture of random fields on sentences
of different lengths (namely on subspaces of different dimensions), and hence will be called a trans-dimensional random
field (TRF).
In the joint SA training algorithm [4], we define another
form of mixture distribution as follows:
nl /n λT f (xl )
p(l, xl ; λ, ζ) =
e
Z1 (λ)eζl
(2)
where ζ = {ζ1 , . . . , ζm } with ζ1 = 0 and ζl is the hypothesized
value of the log ratio of Zl (λ) with respect to Z1 (λ), namely
Zl (λ)
. Z1 (λ) is chosen as the reference value and can be
log Z
1 (λ)
calculated exactly. An important observation is that if and only
if ζ were equal to the true log ratios, then the marginal probability of length l under distribution (2) equals to nl /n. We then use
this property to construct the joint SA algorithm, which jointly
estimates the model parameters and normalization constants.
2.2. Improved Stochastic Approximation
In order to make use of Hessian information in parameter optimization, we use the online estimated Hessian diagonal elements to rescale the gradients in [4]. In this paper, inspired
from [6, 7], we propose a simpler but more effective method
which directly uses the empirical variances to rescale the gradients. The improved SA algorithm is described as follows. At
each iteration t (from 1 to tmax ), we perform two steps.
Step I: MCMC sampling: Generate a sample set B (t) with
p(l, xl ; λ(t−1) , ζ (t−1) ) as the stationary distribution, using the
trans-dimensional mixture sampling method (See Section 3.3 in
[4]).
Step II: SA updating: Compute
λ
(6)
l=1
0
0
(t)
m
X
nl p̃l [fi2 ] − (p̃l [fi ])2
n
(t−1)
=λ
+γλ,t σ
−1

1
n
X
(l,xl )∈D
1
f (x ) −
K
l
X


f (x )

l
(l,xl )∈B (t)
(3)
and
1
ζ (t− 2 ) = ζ (t) + γζ,t
δ1 (B (t) )
δm (B (t) )
,...,
n1 /n
nm /n
1
1)
(t− 2
ζ (t) = ζ (t− 2 ) − ζ1
(4)
(5)
P
2
l
=
where p̃l [fi2 ] = n−1
l
xl ∈Dl fi (x ) and p̃l [fi ]
P
−1
l
nl
xl ∈Dl fi (x ).
It is shown in [6] that the convergence speed of log-linear
model training is improved when the means and variances of
the input features are normalized. Our update rule in Eq.3 performs the normalization on model side instead of explicitly normalizing the features, which is similar to [7]. As the empirical
variances are calculated offline, this also reduces computational
and memory cost during the SA iterations. Our empirical result
shows that compared to using the online estimated Hessian elements which are noisy, using the empirical variances which are
exactly calculated can improve the convergence significantly,
especially on large dataset with millions of features. Fig.1 show
an example of convergence curves of the SA training algorithm
in [4] and the new improved SA.
2.3. Modeling Rare Very Long Sentences
Training TRF models on large corpus needs to address the challenge to handling rare very long sentences. For example in our
experiments, the maximum length in Google 1-billion word corpus [8] is more than 1000. To reduce the sub-model number and
the size of the model space, we set the maximum length of TRF
m to a medium length (such as 100) and introduce a special
sub-model p(> m, xj ; λ, ζ) to represent the sentences longer
than m as follows:
p(> m, x>m ; λ, ζ) =
n>m /n λT f (x>m )
e
Z1 (λ)eζ>m
(7)
where ζ = {ζ1 , . . . , ζm , ζ>m } and n>m is the number of the
sentence longer than m in the training set. As the features
f (xl ) used in our study are usually both length-independent
and position-independent (such as the counts of the n-grams
observed in a sentence), calculating f (x>m ) is straightforward
and few modifications to the training algorithm are needed. To
estimate ζ>m , the maximum length in the trans-dimensional
mixture sampling is set to be slightly larger than m (we set
m + 2 in our experiments) and
Pthe length expectation on sampling set δ>m (B (t) ) = K −1 (j,xj )∈B(t) 1(j > m) is calculated to update ζ>m based Eq.4 and Eq.5.
3. Experiments
3.1. Configuration of TRFs
In the following experiments, we consider a variety of features
for our TRF LMs as shown in Tab.1, mainly based word and
class information. Each word in the vocabulary is deterministically assigned to a single class, by running the word clustering
algorithm proposed in [9] on the training data.
In Tab.1, wi , ci , i = 0, −1, . . . , −9 denote the word and
its class at different position offset i, e.g. w0 , c0 denotes
the current word and its class. We first introduce the classic word/class n-gram features (denoted by “w”/“c”) and the
word/class skipping n-gram features (denoted by “ws”/“cs”)
W
S
Log
Features
(w−3 w−2 w−1 w0 )(w−2 w−1 w0 )(w−1 w0 )(w0 )
(c−3 c−2 c−1 c0 )(c−2 c−1 c0 )(c−1 c0 )(c0 )
(w−3 w0 )(w−3 w−2 w0 )(w−3 w−1 w0 )(w−2 w0 )
(c−3 c0 )(c−3 c−2 c0 )(c−3 c−1 c0 )(c−2 c0 )
(w−4 w0 ) (w−5 w0 )
(c−4 c0 ) (c−5 c0 )
(c−3 c−2 c−1 w0 ) (c−2 c−1 w0 )(c−1 w0 )
(c−9:−6 , c0 ) (w−9:−6 , w0 )
Table 1: Feature definition in TRF LMs
Q
s(xl ) = li=1 [αp1 (xi |hi ) + (1 − α)p2 (xi |hi )]
l
s(x ) = αp1 (xl ) + (1 − α)p2 (xl )
s(xl ) = exp(α log p1 (xl ) + (1 − α) log p2 (xl ))
8M
16M
32M
PPL
#feat
PPL
#feat PPL #feat
KN4
135
6.0
118
10.5
105
18.1
134
11.9
117
21.5
102
38.6
KN5
TRF
∼122
14.3
∼109
24.2
∼92 36.3
Table 3: The perplexities (PPL) of various LMs with different sizes of training data (8M, 16M, 32M) from Google 1billion word corpus. The cutoff settings of n-gram LMs are
0002 (KN4) and 00002 (KN5). The feature type of TRF is
“w+c+ws+cs+wsh+csh”. “#feat” is the feature size (million).
models
140
Table 2: Model interpolation schemes. s(xl ) is the combined
score of sentence xl . p1 (xi |hi ) and p2 (xi |hi ) are the conditional probabilities of xi given history hi estimated by two
LMs; p1 (xl ) and p2 (xl ) are the joint probabilities of the whole
sentence xl estimated by two LMs. 0 < α < 1 is the interpolation weight which is tuned on the development set.
[10]. Second, to demonstrate that long-span features can be
naturally integrated in TRFs, we introduce higher-order features
“wsh”/“csh”, by considering two words/classes separated with
longer distance. Third, as an example of supporting heterogenous features that combine different information, the crossing
features “cpw” (meaning class-predict-word) are introduced. In
the end, we introduce the tied long-skip-bigram features “tied”
[11], in which the skip-bigrams with skipping distances from 6
to 9 share the same parameter. In this way we can leverage long
distance context without increasing the model size. Note that
for all the feature types in Tab.1, only the features observed in
the training data are used.
The improved SA algorithm (in Section 2.2) is used to train
the TRF LMs, in conjunction with the trans-dimensional mixture sampling proposed in Section 3.3 of [4]. The learning rates
of λ and ζ are set as suggested in [4]:
t
γλ,t
γζ,t
t ≤ t0
t > t0
(tc + tβλ )−1
(tc + t − t0 + tβ0 λ )−1
t−βζ
β
(t − t0 + t0 ζ )−1
(8)
where tc , t0 are constants and 0.5 < βλ , βζ < 1. The class
information is also used to accelerate the sampling, and more
than one CPU cores are used to parallelize the algorithm, as
described in [4].
Also, we examine how the TRF models can be interpolated
with NN models to further improve the performance. Linear
combination can be done at either word-level (W) or sentencelevel (S). In contrast, log-linear combination has no such differentiation. The three schemes are detailed in Tab.2. As TRF
models only output sentence probabilities, the ”W” scheme is
not applicable in combining TRF models with other models.
3.2. Using training data of up to 32 million words
In this section, we show the scalability of TRF LMs as well as
the effectiveness of the improved SA training algorithm to handle tens of millions of features. The experiments are performed
on part of Google 1-billion word corpus2 . The whole training corpus contains 99 files and each file contains about 8M
2
https://github.com/ciprian-chelba/1-billion-word-languagemodeling-benchmark
KN4
KN5
TRF
130
120
PPL
Type
w
c
ws
cs
wsh
csh
cpw
tied
110
100
90
5
10
15
20
25
30
Training size (million)
35
Figure 2: The perplexities (PPL) of various LMs trained on part
of Google 1-billion word corpus.
words. The whole held-out corpus contains 50 files and each
file contains about 160K words. In our experiments, we choose
the first and second file (i.e. “news.en.heldout-00000-of-00050”
and “news.en.heldout-00001-of-00050”) in the held-out corpus
as the development set and test set respectively, and incrementally increase the training set as used in our experiments.
First, we use one training file (about 8 million words) as
the training set. We exact the 20K most frequent words from
the training set to construct the lexicon. Then for the training,
development and test set, all the words out of the lexicon are
mapped to an auxiliary token <UNK>. The word clustering
algorithm in [9] is performed on the training set to cluster the
words into 200 classes. We train a TRF model with feature type
“w+c+ws+cs+wsh+csh” (Tab.1). Then we increase the training size to around 16 million words (2 training files). We still
exact the 20K most frequent words from the current training
set to construct a new lexicon. After re-clustering the words
into 200 classes, a new TRF model with the same feature type
“w+c+ws+cs+wsh+csh” is trained over the new training set. We
repeat the above process and train the third TRF model on 32
million words (4 training files).
In the experiments, the maximum length of TRFs is m =
100. At each iteration, we generate K = 300 samples with
length ranging from 1 to 102. The learning rate γλ and γζ are
configured as Eq.8 with βλ = 0.8, βζ = 0.6 and tc = 1000,
t0 = 2000, tmax = 50, 000. L2 regularization with constant
10−5 is used to avoid over-fitting. 12 CPU cores are used to
parallelize the training algorithm.
The perplexities on the test set are shown in Tab.3 and Fig.2.
These results show that the TRF LMs can scale to using training data of up to 32 million words. They consistently achieve
10% relative perplexity reductions over the modified KneserNey (KN) smoothed 5-gram LMs [1], and notably the model
sizes in comparisons are close.
Model
WER
PPL
#feat (M)
KN5 (cutoff=00000)
8.61
293.94
2.3
KN6 (cutoff=000000)
8.61
293.67
3.0
7.96
256.15
5.1
RNN [3]
TRFs (200 classes) in [4]
w+c+ws+cs+cpw
7.92
∼265
6.4
w+c+ws+cs+wsh+csh
7.94
∼266
5.9
TRFs (200 classes) trained by the improved SA in Section 2.2
w+c+ws+cs+cpw
7.92
∼257
6.4
7.94
∼259
5.9
w+c+ws+cs+wsh+csh
w+c+ws+cs+wsh+csh+tied*
7.85
∼261
7.7
Model Combination
RNN + KN5 (W)
7.85
RNN + KN5 (S)
7.96
RNN + KN5 (Log)
7.96
TRF*+ RNN (S)
7.51
TRF*+ RNN (Log)
7.57
Table 4: The WERs and PPLs on the WSJ’92 test data. “#feat”
denotes the feature size (million). “(W),(S),(Log)” denote different model combination schemes as defined in Tab.2. “TRF*”
denotes the TRF with features “w+c+ws+cs+wsh+csh+tied”.
3.3. English speech recognition on WSJ0 dataset
In this section, speech recognition and 1000-best list rescoring
experiments are conducted as configured in [4]. The maximum
length of TRFs is m = 82, which is equal to the maximum
length of the training sentences. The other configurations are:
K = 300, βλ = 0.8, βζ = 0.6, tc = 3000, t0 = 2000,
tmax = 20, 000. L2 regularization with constant 4 × 10−5 is
used to avoid over-fitting. 6 CPU cores are used to parallelize
the algorithm. The word error rates (WERs) and perplexities
(PPLs) on WSJ’92 test set are shown in Tab.4. The TRF LMs
are compared with the classic KN n-gram LMs, the RNN LM
[3] and the results reported in [4].
Compared to the results in [4], the improved SA proposed
in Section 2.2 gives the same WERs but lower PPLs in using the
same feature types. Introducing the tied skip-bigram features
further reduce the WER. Combining TRF and KN5 provides no
WER reduction. Different schemes give close WERs for combining TRF and RNN. The log-linear interpolation performs
more stable when considering both English and Chinese experiments (as shown later). For English, the obtained WER 7.57%
indicates 12.1% and 3.6% relative reductions, when compared
to the result of using KN6 (8.61%) and the best result of combining RNN and KN5 (7.85%) respectively.
3.4. Chinese speech recognition on Toshiba dataset
In this section we report the results from using TRF LMs in a
large vocabulary Mandarin speech recognition experiment. Different LMs are evaluated by rescoring 30000-best list from a
Toshiba’s internal test set (2975 utterances). The oracle character error rate (CER) of the 30000-best lists is 1.61%, which are
generated with a DNN-based acoustic model. The LM corpus is
from Toshiba, which contains about 20M words. We randomly
select 1% from the corpus as the development set and others
as the training set. The vocabulary contains 82K words, with
one special token <UNK>. The NN LM used here is the feedforward neural network (FNN) LM [2, 12] trained by CSLM
toolkit3 . The number of hidden units is 512 and the projection
layer units is 3×128. The TRF models are trained using the fea3
http://www-lium.univ-lemans.fr/cslm/
Model
CER
PPL
#feat (M)
KN5 (cutoff=00000)
4.88 227.92
49
KN6 (cutoff=000000)
4.87 227.76
64
4.30 206.87
53
FNN [12]
TRFs (“w+c+ws+cs+cpw”)
200 classes
4.52
∼207
47
400 classes*
4.32
∼191
58
4.35
∼186
61
600 classes
Model Combination
FNN + KN6 (W)
4.06
FNN + KN6 (S)
4.16
FNN + KN6 (Log)
4.11
TRF* + FNN (S)
4.09
TRF* + FNN (Log)
4.0
Table 5: The CERs and PPLs on the test set in Chinese speech
recognition. “TRF*” denotes the TRF trained with 400 classes.
ture set “w+c+ws+cs+cpw” with different numbers of classes
(200,400,600). The configurations are: m = 100, βλ = 0.8,
βζ = 0.6, tc = 1000, t0 = tmax = 20000. The sample
number K is increased from 300 until no improvements on the
development set are observed, and finally set to be 8000. 20
CPU cores are used to parallelize the algorithm. The CERs and
PPLs on the test set are shown in Tab.5.
The results again demonstrate that the TRF LMs significantly outperform the n-gram LMs and are able to match the
NN LM. Log-linear combination of TRF and FNN further reduces the CER. The obtained CER 4.0% indicates 17.9% and
1.5% relative reductions, when compared to the result of using
KN6 (4.87%) and the best result of combining FNN and KN6
(4.06%) respectively.
4. Related Work and Discussion
Currently, most attention of LM research is attracted by using
neural networks, which has been shown to surpass the classic
n-gram LMs. Two basic classes of NN LMs are based on FNN
[2] and RNN [3]. Recent extensions involve the use of sumproduct networks [13], deep recurrent neural networks [14] and
feedforward sequential memory networks [15]; only perplexity
results are reported in these studies. Crucially, no matter what
form the networks take, various NN LMs follow the conditional
approach and thus suffer from the expensive softmax computations due to the requirement of local normalization. Lots of
studies aim to alleviate this deficiency. Initial efforts include
using hierarchical output layer structure with word clustering
[3], converting NNs to n-gram LMs [16]. Recently a number of
studies [17, 18, 19, 20] make use of noise contrastive estimation
(NCE) [21] to build unnormalized variants of NN LMs through
trickily avoiding local normalization in training and heuristically fixing the normalizing term in testing.
In contrast, TRF LMs eliminate local normalization from
the root and thus are much more efficient in testing with theoretical guarantee. Empirically in our experiments reported in
Section 3.3, the average time costs for re-ranking of the 1000best list for a sentence are 0.16 sec vs. 40 sec, based on TRF and
RNN respectively (no GPU used). Equally importantly, evaluations in this paper and also in [4] have shown that TRF LMs
are able to perform as good as NN LMs (either RNN or FNN)
on a variety of tasks. Encouragingly, the random field approach
may open a new door to language modeling in addition to the
dominant conditional approach, as once envisioned in [5]. Integrating richer features and introducing hidden variables are
worthwhile future works.
5. References
[1] S. F. Chen and J. Goodman, “An empirical study of smoothing
techniques for language modeling,” Computer Speech & Language, vol. 13, pp. 359–394, 1999.
[2] H. Schwenk, “Continuous space language models,” Computer
Speech & Language, vol. 21, pp. 492–518, 2007.
[3] T. Mikolov, S. Kombrink, L. Burget, J. H. Cernocky, and S. Khudanpur, “Extensions of recurrent neural network language model,”
in Proc. ICASSP, 2011.
[4] B. Wang, Z. Ou, and Z. Tan, “Trans-dimensional random fields for
language modeling,” in Proceedings of the Association for Computational Linguistics (ACL), 2015.
[5] R. Rosenfeld, S. F. Chen, and X. Zhu, “Whole-sentence exponential language models: a vehicle for linguistic-statistical integration,” Computer Speech & Language, vol. 15, pp. 55–73, 2001.
[6] S. Wiesler and H. Ney, “A convergence analysis of log-linear
training,” in Advances in Neural Information Processing Systems
(NIPS), 2011.
[7] S. Wiesler, A. Richard, R. Schluter, and H. Ney, “Meannormalized stochastic gradient for large-scale deep learning,” in
Proc. ICASSP, 2014, pp. 180–184.
[8] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn,
and T. Robinson, “One billion word benchmark for measuring progress in statistical language modeling,” arXiv preprint
arXiv:1312.3005, 2013.
[9] S. Martin, J. Liermann, and H. Ney, “Algorithms for bigram and
trigram word clustering,” Speech Communication, vol. 24, pp. 19–
37, 1998.
[10] J. Goodman, “A bit of progress in language modeling,” Computer
Speech & Language, vol. 15, pp. 403–434, 2001.
[11] N. Shazeer, J. Pelemans, and C. Chelba, “Sparse non-negative matrix language modeling for skip-grams,” in Proc. INTERSPEECH,
2015.
[12] H. Schwenk, “Cslm - a modular open-source continuous space
language modeling toolkit.” in Proc. INTERSPEECH, 2013, pp.
1198–1202.
[13] W.-C. Cheng, S. Kok, H. V. Pham, H. L. Chieu, and K. M. A.
Chai, “Language modeling with sum-product networks.” in INTERSPEECH, 2014.
[14] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329, 2014.
[15] S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu, “Feedforward sequential memory networks: A new structure to learn
long-term dependency,” arXiv preprint arXiv:1512.08301, 2015.
[16] E. Arisoy, S. F. Chen, B. Ramabhadran, and A. Sethy, “Converting neural network language models into back-off language
models for efficient decoding in automatic speech recognition,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 184–192, 2014.
[17] A. Mnih and Y. W. Teh, “A fast and simple algorithm for
training neural probabilistic language models,” arXiv preprint
arXiv:1206.6426, 2012.
[18] A. Vaswani, Y. Zhao, V. Fossum, and D. Chiang, “Decoding
with large-scale neural language models improves translation.” in
EMNLP, 2013, pp. 1387–1392.
[19] A. Sethy, S. Chen, E. Arisoy, and B. Ramabhadran, “Unnormalized exponential and neural network language models,” in Proc.
ICASSP, 2015.
[20] X. Chen, X. Liu, M. Gales, and P. Woodland, “Recurrent neural
network language model training with noise contrastive estimation for speech recognition,” in Proc. ICASSP, 2015, pp. 5411–
5415.
[21] M. U. Gutmann and A. Hyvärinen, “Noise-contrastive estimation of unnormalized statistical models, with application to natural image statistics,” The journal of Machine Learning Research,
vol. 13, no. 1, pp. 307–361, 2012.