Improving and Scaling Trans-dimensional Random Field Language Models Bin Wang1 , Zhijian Ou1 , Yong He2 , Akinori Kawamura2 1 Department of Electronic Engineering, Tsinghua University, Beijing 100084, China 2 Toshiba Corporation contact email: arXiv:1603.09170v1 [cs.CL] 30 Mar 2016 Abstract The dominant language models (LMs) such as n-gram and neural network (NN) models represent sentence probabilities in terms of conditionals. In contrast, a new trans-dimensional random field (TRF) LM has been recently introduced to show superior performances, where the whole sentence is modeled as a random field. In this paper, we further develop the TDF LMs with two technical improvements, which are a new method of exploiting Hessian information in parameter optimization to further enhance the convergence of the training algorithm and an enabling method for training the TRF LMs on large corpus which may contain rare very long sentences. Experiments show that the TRF LMs can scale to using training data of up to 32 million words, consistently achieve 10% relative perplexity reductions over 5-gram LMs, and perform as good as NN LMs but with much faster speed in calculating sentence probabilities. Moreover, we examine how the TRF models can be interpolated with the NN models, and obtain 12.1% and 17.9% relative error rate reductions over 6-gram LMs for English and Chinese speech recognition respectively through log-linear combination. Index Terms: Language modeling, random fields, stochastic approximation 1. Introduction Language modeling (LM) involves determining the joint probability of words in a sentence. The conditional approach is dominant, representing the joint probability in terms of conditionals. Examples include n-gram LMs [1] and neural network (NN) LMs [2, 3]. We have recently introduced a new transdimensional random field (TRF1 ) LM [4], where the whole sentence is modeled as a random field. As the random field approach avoids local normalization which is required in the conditional approach, it is computationally more efficient in computing sentence probabilities and has the potential advantage of being able to flexibly integrating a richer set of features. We developed an effective training algorithm using joint stochastic approximation (SA) and trans-dimensional mixture sampling. We found that the TRF models significantly outperformed the modified Kneser-Ney (KN) smoothed 4-gram LM with 9.1% relative reduction in speech recognition word error rates (WERs) and performed slightly better than the recurrent neural network LMs but with 200x faster speed in re-scoring n-best lists of hypothesized sentences. To our knowledge, this result represents the first strong empirical evidence supporting the power of using the whole-sentence random field approach for LMs [5]. This work is supported by NSFC grant 61473168, Tsinghua Initiative 20121088069 and Toshiba Corporation. 1 Instead of the abbreviation “TDRF” in [4], we use “TRF” for simplicity in this paper. [email protected] In this paper, we further develop the TDF LMs with two technical improvements. Moreover, we perform more experiments to investigate whether the TRF models can scale to using larger corpus and how the TRF models can be interpolated with NN models to further improve the performance. Improvements: First, in [4], the diagonal elements of the Hessian matrices are online estimated during the SA iterations to rescale the gradients, which is shown to benefit the convergence of the training algorithm. In this paper, inspired from [6, 7], we propose a simpler but more effective method which directly uses the empirical variances to rescale the gradients. As the empirical variances are calculated offline, this also reduces computational and memory cost during the SA iterations. The experimental results show that this new manner of exploiting second-order information in parameter optimization further enhances the convergence of the training algorithm. Second, to enable the training of TRF LMs on large corpus which may contain rare very long sentences, we introduce a special sub-model to model the sequences longer than a fixed length. Experiments: First, we examine the scalability of the TRF LMs, by incrementally increasing the size of the training set. Based on training data of up to 32 million words from Google 1-billion word corpus [8], we build TRF LMs with up to 36 million features. The TRF LMs consistently achieve 10% relative perplexity reductions over the KN 5-gram LMs. Then speech recognition experiments are conducted for both English and Chinese. Three kinds of interpolation schemes, namely linear interpolation (word-level or sentence-level) and log-linear interpolation, are evaluated for model combinations between KN n-gram LMs, NN LMs and our TRF LMs. Through loglinearly combining TRF LMs with NN LMs, we obtain 12.1% and 17.9% relative error rate reductions over the KN 6-gram LMs for English and Chinese speech recognition respectively. 2. Improvements 2.1. Background of TRF Model Definition and Training Throughout, we denote by xl = (x1 , . . . , xl ) a sentence (i.e., word sequence) of length l ranging from 1 to m. Each element of xl corresponds to a single word. D denotes the whole training corpus and Dl denotes the collection of length P l in the training corpus. nl denotes the size of Dl and n = m l=1 nl . As defined in [4], a trans-dimensional random field model represents the joint probability of the pair (l, xl ) as p(l, xl ; λ) = nl /n λT f (xl ) e , Zl (λ) (1) where nl /n is the empirical probability of length l. f (xl ) = (f1 (xl ), . . . fd (xl ))T is the feature vector, which is usually defined to be position-independent and length-independent, e.g. where γλ,t ,γζ,t are the learning rate of λ and ζ. P δl (B (t) ) = K −1 (j,xj )∈B(t) 1(j = l) is the relative frequency of length l appearing in B (t) . In Eq.3, σ = diag(σ1 , . . . , σd ) is a diagonal matrix and the element σi (1 ≤ i ≤ d) is the empirical variance of feature fi : negative log−likelihood 200 SA in [4] improved SA 150 100 σi = 50 500 1000 1500 2000 2500 3000 t Figure 1: To compare the convergence of the SA training algorithm in [4] (blue dash line) and the improved SA (red solid line), we plot an example of the negative log-likelihoods on PTB training set along the iterations. the n-grams. d is the dimension of the feature vector f (x). λ is the corresponding parameter vector of f (xl ). Zl (λ) = P λT f (xl ) is the normalization constant of length l. By xl e making explicit the role of length in model definition, it is clear that the model in (1) is a mixture of random fields on sentences of different lengths (namely on subspaces of different dimensions), and hence will be called a trans-dimensional random field (TRF). In the joint SA training algorithm [4], we define another form of mixture distribution as follows: nl /n λT f (xl ) p(l, xl ; λ, ζ) = e Z1 (λ)eζl (2) where ζ = {ζ1 , . . . , ζm } with ζ1 = 0 and ζl is the hypothesized value of the log ratio of Zl (λ) with respect to Z1 (λ), namely Zl (λ) . Z1 (λ) is chosen as the reference value and can be log Z 1 (λ) calculated exactly. An important observation is that if and only if ζ were equal to the true log ratios, then the marginal probability of length l under distribution (2) equals to nl /n. We then use this property to construct the joint SA algorithm, which jointly estimates the model parameters and normalization constants. 2.2. Improved Stochastic Approximation In order to make use of Hessian information in parameter optimization, we use the online estimated Hessian diagonal elements to rescale the gradients in [4]. In this paper, inspired from [6, 7], we propose a simpler but more effective method which directly uses the empirical variances to rescale the gradients. The improved SA algorithm is described as follows. At each iteration t (from 1 to tmax ), we perform two steps. Step I: MCMC sampling: Generate a sample set B (t) with p(l, xl ; λ(t−1) , ζ (t−1) ) as the stationary distribution, using the trans-dimensional mixture sampling method (See Section 3.3 in [4]). Step II: SA updating: Compute λ (6) l=1 0 0 (t) m X nl p̃l [fi2 ] − (p̃l [fi ])2 n (t−1) =λ +γλ,t σ −1 1 n X (l,xl )∈D 1 f (x ) − K l X f (x ) l (l,xl )∈B (t) (3) and 1 ζ (t− 2 ) = ζ (t) + γζ,t δ1 (B (t) ) δm (B (t) ) ,..., n1 /n nm /n 1 1) (t− 2 ζ (t) = ζ (t− 2 ) − ζ1 (4) (5) P 2 l = where p̃l [fi2 ] = n−1 l xl ∈Dl fi (x ) and p̃l [fi ] P −1 l nl xl ∈Dl fi (x ). It is shown in [6] that the convergence speed of log-linear model training is improved when the means and variances of the input features are normalized. Our update rule in Eq.3 performs the normalization on model side instead of explicitly normalizing the features, which is similar to [7]. As the empirical variances are calculated offline, this also reduces computational and memory cost during the SA iterations. Our empirical result shows that compared to using the online estimated Hessian elements which are noisy, using the empirical variances which are exactly calculated can improve the convergence significantly, especially on large dataset with millions of features. Fig.1 show an example of convergence curves of the SA training algorithm in [4] and the new improved SA. 2.3. Modeling Rare Very Long Sentences Training TRF models on large corpus needs to address the challenge to handling rare very long sentences. For example in our experiments, the maximum length in Google 1-billion word corpus [8] is more than 1000. To reduce the sub-model number and the size of the model space, we set the maximum length of TRF m to a medium length (such as 100) and introduce a special sub-model p(> m, xj ; λ, ζ) to represent the sentences longer than m as follows: p(> m, x>m ; λ, ζ) = n>m /n λT f (x>m ) e Z1 (λ)eζ>m (7) where ζ = {ζ1 , . . . , ζm , ζ>m } and n>m is the number of the sentence longer than m in the training set. As the features f (xl ) used in our study are usually both length-independent and position-independent (such as the counts of the n-grams observed in a sentence), calculating f (x>m ) is straightforward and few modifications to the training algorithm are needed. To estimate ζ>m , the maximum length in the trans-dimensional mixture sampling is set to be slightly larger than m (we set m + 2 in our experiments) and Pthe length expectation on sampling set δ>m (B (t) ) = K −1 (j,xj )∈B(t) 1(j > m) is calculated to update ζ>m based Eq.4 and Eq.5. 3. Experiments 3.1. Configuration of TRFs In the following experiments, we consider a variety of features for our TRF LMs as shown in Tab.1, mainly based word and class information. Each word in the vocabulary is deterministically assigned to a single class, by running the word clustering algorithm proposed in [9] on the training data. In Tab.1, wi , ci , i = 0, −1, . . . , −9 denote the word and its class at different position offset i, e.g. w0 , c0 denotes the current word and its class. We first introduce the classic word/class n-gram features (denoted by “w”/“c”) and the word/class skipping n-gram features (denoted by “ws”/“cs”) W S Log Features (w−3 w−2 w−1 w0 )(w−2 w−1 w0 )(w−1 w0 )(w0 ) (c−3 c−2 c−1 c0 )(c−2 c−1 c0 )(c−1 c0 )(c0 ) (w−3 w0 )(w−3 w−2 w0 )(w−3 w−1 w0 )(w−2 w0 ) (c−3 c0 )(c−3 c−2 c0 )(c−3 c−1 c0 )(c−2 c0 ) (w−4 w0 ) (w−5 w0 ) (c−4 c0 ) (c−5 c0 ) (c−3 c−2 c−1 w0 ) (c−2 c−1 w0 )(c−1 w0 ) (c−9:−6 , c0 ) (w−9:−6 , w0 ) Table 1: Feature definition in TRF LMs Q s(xl ) = li=1 [αp1 (xi |hi ) + (1 − α)p2 (xi |hi )] l s(x ) = αp1 (xl ) + (1 − α)p2 (xl ) s(xl ) = exp(α log p1 (xl ) + (1 − α) log p2 (xl )) 8M 16M 32M PPL #feat PPL #feat PPL #feat KN4 135 6.0 118 10.5 105 18.1 134 11.9 117 21.5 102 38.6 KN5 TRF ∼122 14.3 ∼109 24.2 ∼92 36.3 Table 3: The perplexities (PPL) of various LMs with different sizes of training data (8M, 16M, 32M) from Google 1billion word corpus. The cutoff settings of n-gram LMs are 0002 (KN4) and 00002 (KN5). The feature type of TRF is “w+c+ws+cs+wsh+csh”. “#feat” is the feature size (million). models 140 Table 2: Model interpolation schemes. s(xl ) is the combined score of sentence xl . p1 (xi |hi ) and p2 (xi |hi ) are the conditional probabilities of xi given history hi estimated by two LMs; p1 (xl ) and p2 (xl ) are the joint probabilities of the whole sentence xl estimated by two LMs. 0 < α < 1 is the interpolation weight which is tuned on the development set. [10]. Second, to demonstrate that long-span features can be naturally integrated in TRFs, we introduce higher-order features “wsh”/“csh”, by considering two words/classes separated with longer distance. Third, as an example of supporting heterogenous features that combine different information, the crossing features “cpw” (meaning class-predict-word) are introduced. In the end, we introduce the tied long-skip-bigram features “tied” [11], in which the skip-bigrams with skipping distances from 6 to 9 share the same parameter. In this way we can leverage long distance context without increasing the model size. Note that for all the feature types in Tab.1, only the features observed in the training data are used. The improved SA algorithm (in Section 2.2) is used to train the TRF LMs, in conjunction with the trans-dimensional mixture sampling proposed in Section 3.3 of [4]. The learning rates of λ and ζ are set as suggested in [4]: t γλ,t γζ,t t ≤ t0 t > t0 (tc + tβλ )−1 (tc + t − t0 + tβ0 λ )−1 t−βζ β (t − t0 + t0 ζ )−1 (8) where tc , t0 are constants and 0.5 < βλ , βζ < 1. The class information is also used to accelerate the sampling, and more than one CPU cores are used to parallelize the algorithm, as described in [4]. Also, we examine how the TRF models can be interpolated with NN models to further improve the performance. Linear combination can be done at either word-level (W) or sentencelevel (S). In contrast, log-linear combination has no such differentiation. The three schemes are detailed in Tab.2. As TRF models only output sentence probabilities, the ”W” scheme is not applicable in combining TRF models with other models. 3.2. Using training data of up to 32 million words In this section, we show the scalability of TRF LMs as well as the effectiveness of the improved SA training algorithm to handle tens of millions of features. The experiments are performed on part of Google 1-billion word corpus2 . The whole training corpus contains 99 files and each file contains about 8M 2 https://github.com/ciprian-chelba/1-billion-word-languagemodeling-benchmark KN4 KN5 TRF 130 120 PPL Type w c ws cs wsh csh cpw tied 110 100 90 5 10 15 20 25 30 Training size (million) 35 Figure 2: The perplexities (PPL) of various LMs trained on part of Google 1-billion word corpus. words. The whole held-out corpus contains 50 files and each file contains about 160K words. In our experiments, we choose the first and second file (i.e. “news.en.heldout-00000-of-00050” and “news.en.heldout-00001-of-00050”) in the held-out corpus as the development set and test set respectively, and incrementally increase the training set as used in our experiments. First, we use one training file (about 8 million words) as the training set. We exact the 20K most frequent words from the training set to construct the lexicon. Then for the training, development and test set, all the words out of the lexicon are mapped to an auxiliary token <UNK>. The word clustering algorithm in [9] is performed on the training set to cluster the words into 200 classes. We train a TRF model with feature type “w+c+ws+cs+wsh+csh” (Tab.1). Then we increase the training size to around 16 million words (2 training files). We still exact the 20K most frequent words from the current training set to construct a new lexicon. After re-clustering the words into 200 classes, a new TRF model with the same feature type “w+c+ws+cs+wsh+csh” is trained over the new training set. We repeat the above process and train the third TRF model on 32 million words (4 training files). In the experiments, the maximum length of TRFs is m = 100. At each iteration, we generate K = 300 samples with length ranging from 1 to 102. The learning rate γλ and γζ are configured as Eq.8 with βλ = 0.8, βζ = 0.6 and tc = 1000, t0 = 2000, tmax = 50, 000. L2 regularization with constant 10−5 is used to avoid over-fitting. 12 CPU cores are used to parallelize the training algorithm. The perplexities on the test set are shown in Tab.3 and Fig.2. These results show that the TRF LMs can scale to using training data of up to 32 million words. They consistently achieve 10% relative perplexity reductions over the modified KneserNey (KN) smoothed 5-gram LMs [1], and notably the model sizes in comparisons are close. Model WER PPL #feat (M) KN5 (cutoff=00000) 8.61 293.94 2.3 KN6 (cutoff=000000) 8.61 293.67 3.0 7.96 256.15 5.1 RNN [3] TRFs (200 classes) in [4] w+c+ws+cs+cpw 7.92 ∼265 6.4 w+c+ws+cs+wsh+csh 7.94 ∼266 5.9 TRFs (200 classes) trained by the improved SA in Section 2.2 w+c+ws+cs+cpw 7.92 ∼257 6.4 7.94 ∼259 5.9 w+c+ws+cs+wsh+csh w+c+ws+cs+wsh+csh+tied* 7.85 ∼261 7.7 Model Combination RNN + KN5 (W) 7.85 RNN + KN5 (S) 7.96 RNN + KN5 (Log) 7.96 TRF*+ RNN (S) 7.51 TRF*+ RNN (Log) 7.57 Table 4: The WERs and PPLs on the WSJ’92 test data. “#feat” denotes the feature size (million). “(W),(S),(Log)” denote different model combination schemes as defined in Tab.2. “TRF*” denotes the TRF with features “w+c+ws+cs+wsh+csh+tied”. 3.3. English speech recognition on WSJ0 dataset In this section, speech recognition and 1000-best list rescoring experiments are conducted as configured in [4]. The maximum length of TRFs is m = 82, which is equal to the maximum length of the training sentences. The other configurations are: K = 300, βλ = 0.8, βζ = 0.6, tc = 3000, t0 = 2000, tmax = 20, 000. L2 regularization with constant 4 × 10−5 is used to avoid over-fitting. 6 CPU cores are used to parallelize the algorithm. The word error rates (WERs) and perplexities (PPLs) on WSJ’92 test set are shown in Tab.4. The TRF LMs are compared with the classic KN n-gram LMs, the RNN LM [3] and the results reported in [4]. Compared to the results in [4], the improved SA proposed in Section 2.2 gives the same WERs but lower PPLs in using the same feature types. Introducing the tied skip-bigram features further reduce the WER. Combining TRF and KN5 provides no WER reduction. Different schemes give close WERs for combining TRF and RNN. The log-linear interpolation performs more stable when considering both English and Chinese experiments (as shown later). For English, the obtained WER 7.57% indicates 12.1% and 3.6% relative reductions, when compared to the result of using KN6 (8.61%) and the best result of combining RNN and KN5 (7.85%) respectively. 3.4. Chinese speech recognition on Toshiba dataset In this section we report the results from using TRF LMs in a large vocabulary Mandarin speech recognition experiment. Different LMs are evaluated by rescoring 30000-best list from a Toshiba’s internal test set (2975 utterances). The oracle character error rate (CER) of the 30000-best lists is 1.61%, which are generated with a DNN-based acoustic model. The LM corpus is from Toshiba, which contains about 20M words. We randomly select 1% from the corpus as the development set and others as the training set. The vocabulary contains 82K words, with one special token <UNK>. The NN LM used here is the feedforward neural network (FNN) LM [2, 12] trained by CSLM toolkit3 . The number of hidden units is 512 and the projection layer units is 3×128. The TRF models are trained using the fea3 http://www-lium.univ-lemans.fr/cslm/ Model CER PPL #feat (M) KN5 (cutoff=00000) 4.88 227.92 49 KN6 (cutoff=000000) 4.87 227.76 64 4.30 206.87 53 FNN [12] TRFs (“w+c+ws+cs+cpw”) 200 classes 4.52 ∼207 47 400 classes* 4.32 ∼191 58 4.35 ∼186 61 600 classes Model Combination FNN + KN6 (W) 4.06 FNN + KN6 (S) 4.16 FNN + KN6 (Log) 4.11 TRF* + FNN (S) 4.09 TRF* + FNN (Log) 4.0 Table 5: The CERs and PPLs on the test set in Chinese speech recognition. “TRF*” denotes the TRF trained with 400 classes. ture set “w+c+ws+cs+cpw” with different numbers of classes (200,400,600). The configurations are: m = 100, βλ = 0.8, βζ = 0.6, tc = 1000, t0 = tmax = 20000. The sample number K is increased from 300 until no improvements on the development set are observed, and finally set to be 8000. 20 CPU cores are used to parallelize the algorithm. The CERs and PPLs on the test set are shown in Tab.5. The results again demonstrate that the TRF LMs significantly outperform the n-gram LMs and are able to match the NN LM. Log-linear combination of TRF and FNN further reduces the CER. The obtained CER 4.0% indicates 17.9% and 1.5% relative reductions, when compared to the result of using KN6 (4.87%) and the best result of combining FNN and KN6 (4.06%) respectively. 4. Related Work and Discussion Currently, most attention of LM research is attracted by using neural networks, which has been shown to surpass the classic n-gram LMs. Two basic classes of NN LMs are based on FNN [2] and RNN [3]. Recent extensions involve the use of sumproduct networks [13], deep recurrent neural networks [14] and feedforward sequential memory networks [15]; only perplexity results are reported in these studies. Crucially, no matter what form the networks take, various NN LMs follow the conditional approach and thus suffer from the expensive softmax computations due to the requirement of local normalization. Lots of studies aim to alleviate this deficiency. Initial efforts include using hierarchical output layer structure with word clustering [3], converting NNs to n-gram LMs [16]. Recently a number of studies [17, 18, 19, 20] make use of noise contrastive estimation (NCE) [21] to build unnormalized variants of NN LMs through trickily avoiding local normalization in training and heuristically fixing the normalizing term in testing. In contrast, TRF LMs eliminate local normalization from the root and thus are much more efficient in testing with theoretical guarantee. Empirically in our experiments reported in Section 3.3, the average time costs for re-ranking of the 1000best list for a sentence are 0.16 sec vs. 40 sec, based on TRF and RNN respectively (no GPU used). Equally importantly, evaluations in this paper and also in [4] have shown that TRF LMs are able to perform as good as NN LMs (either RNN or FNN) on a variety of tasks. Encouragingly, the random field approach may open a new door to language modeling in addition to the dominant conditional approach, as once envisioned in [5]. Integrating richer features and introducing hidden variables are worthwhile future works. 5. References [1] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language, vol. 13, pp. 359–394, 1999. [2] H. Schwenk, “Continuous space language models,” Computer Speech & Language, vol. 21, pp. 492–518, 2007. [3] T. Mikolov, S. Kombrink, L. Burget, J. H. Cernocky, and S. Khudanpur, “Extensions of recurrent neural network language model,” in Proc. ICASSP, 2011. [4] B. Wang, Z. Ou, and Z. Tan, “Trans-dimensional random fields for language modeling,” in Proceedings of the Association for Computational Linguistics (ACL), 2015. [5] R. Rosenfeld, S. F. Chen, and X. Zhu, “Whole-sentence exponential language models: a vehicle for linguistic-statistical integration,” Computer Speech & Language, vol. 15, pp. 55–73, 2001. [6] S. Wiesler and H. Ney, “A convergence analysis of log-linear training,” in Advances in Neural Information Processing Systems (NIPS), 2011. [7] S. Wiesler, A. Richard, R. Schluter, and H. Ney, “Meannormalized stochastic gradient for large-scale deep learning,” in Proc. ICASSP, 2014, pp. 180–184. [8] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson, “One billion word benchmark for measuring progress in statistical language modeling,” arXiv preprint arXiv:1312.3005, 2013. [9] S. Martin, J. Liermann, and H. Ney, “Algorithms for bigram and trigram word clustering,” Speech Communication, vol. 24, pp. 19– 37, 1998. [10] J. Goodman, “A bit of progress in language modeling,” Computer Speech & Language, vol. 15, pp. 403–434, 2001. [11] N. Shazeer, J. Pelemans, and C. Chelba, “Sparse non-negative matrix language modeling for skip-grams,” in Proc. INTERSPEECH, 2015. [12] H. Schwenk, “Cslm - a modular open-source continuous space language modeling toolkit.” in Proc. INTERSPEECH, 2013, pp. 1198–1202. [13] W.-C. Cheng, S. Kok, H. V. Pham, H. L. Chieu, and K. M. A. Chai, “Language modeling with sum-product networks.” in INTERSPEECH, 2014. [14] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329, 2014. [15] S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu, “Feedforward sequential memory networks: A new structure to learn long-term dependency,” arXiv preprint arXiv:1512.08301, 2015. [16] E. Arisoy, S. F. Chen, B. Ramabhadran, and A. Sethy, “Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 184–192, 2014. [17] A. Mnih and Y. W. Teh, “A fast and simple algorithm for training neural probabilistic language models,” arXiv preprint arXiv:1206.6426, 2012. [18] A. Vaswani, Y. Zhao, V. Fossum, and D. Chiang, “Decoding with large-scale neural language models improves translation.” in EMNLP, 2013, pp. 1387–1392. [19] A. Sethy, S. Chen, E. Arisoy, and B. Ramabhadran, “Unnormalized exponential and neural network language models,” in Proc. ICASSP, 2015. [20] X. Chen, X. Liu, M. Gales, and P. Woodland, “Recurrent neural network language model training with noise contrastive estimation for speech recognition,” in Proc. ICASSP, 2015, pp. 5411– 5415. [21] M. U. Gutmann and A. Hyvärinen, “Noise-contrastive estimation of unnormalized statistical models, with application to natural image statistics,” The journal of Machine Learning Research, vol. 13, no. 1, pp. 307–361, 2012.
© Copyright 2026 Paperzz