Li XM, Ouyang JH. Tuning the learning rate for stochastic variational inference. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 31(2): 428–436 Mar. 2016. DOI 10.1007/s11390-016-1636-4 Tuning the Learning Rate for Stochastic Variational Inference Xi-Ming Li and Ji-Hong Ouyang ∗ , Member, CCF College of Computer Science and Technology, Jilin University, Changchun 130000, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University Changchun 130000, China E-mail: [email protected]; [email protected] Received November 10, 2014; revised July 27, 2015. Abstract Stochastic variational inference (SVI) can learn topic models with very big corpora. It optimizes the variational objective by using the stochastic natural gradient algorithm with a decreasing learning rate. This rate is crucial for SVI; however, it is often tuned by hand in real applications. To address this, we develop a novel algorithm, which tunes the learning rate of each iteration adaptively. The proposed algorithm uses the Kullback-Leibler (KL) divergence to measure the similarity between the variational distribution with noisy update and that with batch update, and then optimizes the learning rates by minimizing the KL divergence. We apply our algorithm to two representative topic models: latent Dirichlet allocation and hierarchical Dirichlet process. Experimental results indicate that our algorithm performs better and converges faster than commonly used learning rates. Keywords 1 stochastic variational inference, online learning, adaptive learning rate, topic model Introduction Topic modeling algorithms are popular to analyze text document collections and to solve real problems in many practical applications[1-2] . However, traditional batch inference algorithms to topic models are inefficient and unpractical for big corpora, which contain hundreds of thousands of documents. To this end, some publications develop parallel inference algorithms[3-5] , and some propose online inference algorithms[6-14] . Among these online algorithms, stochastic variational inference (SVI)[13] is representative. It has been successfully applied to latent Dirichlet allocation (LDA)[8,15] , hierarchical Dirichlet process (HDP)[16-17] , time-series models[18] and matrix factorization[19]. As an extension of the batch variational inference (batch VI) algorithm, SVI maximizes the variational objective in the framework of stochastic optimization. At each iteration, it randomly samples a small subset (i.e., mini-batch) from the entire corpus, and then forms a noisy natural gradient to update the parame- ters of interest. Using a learning rate that satisfies the conditions of Robbins and Monro[20] , SVI converges to a local optimum of the variational objective. Motivated by tuning the learning rate of SVI, the authors of [21] developed an adaptive learning rate (ALR) algorithm, which tunes the learning rates by minimizing the Euclidean distance between the noisy update and the batch update. However, since the variational parameters for topic models are the sufficient statistics of variational distributions, the Euclidean distance is not a good similarity measurement for this case, and it might lead to low-quality learning rates. To address this problem, we investigate the rate optimization process of ALR and propose a novel adaptive learning rate algorithm for SVI. Our algorithm uses the Kullback-Leibler (KL) divergence to measure the similarity between the variational distribution with noisy update and that with batch update. In contrast to ALR, our algorithm has two advantages. First, KL divergence is preferable to the Euclidean distance when measuring similarity between distributions. Second, our algorithm provides a specific learning rate for each Regular Paper This work was supported by the National Natural Science Foundation of China under Grant Nos. 61170092, 61133011 and 61103091. ∗ Corresponding Author ©2016 Springer Science + Business Media, LLC & Science Press, China 429 Xi-Ming Li et al.: Tuning the Learning Rate for Stochastic Variational Inference latent variable, e.g., topic-word distributions in LDA, instead of using a single learning rate for all variables. In this paper, we apply our algorithm to LDA and HDP on three big corpora: Nature, Wikipedia and PubMed. Compared with ALR and other commonly used learning rates for SVI, our algorithm performs better. The rest of this paper is organized as follows. In Section 2, we introduce SVI and ALR. In Section 3, we suggest the proposed algorithm and the applications for LDA and HDP. Section 4 shows the experimental results on the three big corpora. Finally, conclusions and some potential future work are discussed in Section 5. 2 (t) β̃∗ = β + (t) θ̃d . (1) Unfortunately, if D is too large, the batch update (t) β̃∗ in (1) is computationally expensive. SVI addresses this problem using stochastic natural gradient ascent. It defines the ELBO with respect to β̃ as: L̃(β̃) , max L(β̃, θ̃). At each iteration t, SVI samples θ̃ a single datapoint d (or a mini-batch of M datapoints) to form the stochastic natural gradient of L̃(β̃)[13,22] as follows: (t) ∇L̃(β̃ (t−1) ) = −β̃ (t−1) + β + Dθ̃d . SVI Suppose that there is a probabilistic model p(w, z, φ|β) = p(φ|β) D X d=1 Background In this section, we introduce stochastic variational inference (SVI) and the adaptive learning rate (ALR) algorithm for SVI. 2.1 Successive optima of this ELBO often have closed form[13] , and thus β̃ is updated as: D Y p(wd , zd |φ), d=1 which includes observation w (wd , w1:D ), local latent variables z (zd , z1:D ) and global latent variable φ with prior β. We want to estimate the posterior distribution of latent variable p (z, φ|w, β). Variational Bayes family is usually used to compute the approximation to this posterior. It posits a variational distribution q(z, φ|β̃, θ̃) = q(φ|β̃) D Y q(zd |θ˜d ), d=1 parameterized by variational parameters β̃ (global) and θ̃ (local). It then finds the closest variational distribution q to the true posterior measured by KL divergence. This task is equal to finding a local maximum of the following evidence lower bound (ELBO) with respect to variational parameters β̃ and θ̃: log p(w) > Eq [log p(w, z, φ)] − Eq [log q(z, φ)] , L(β̃, θ̃), where Eq [·] is the expectation under the variational distribution q. Standard batch VI optimizes this ELBO using fixedpoint iterations. At each iteration t, it optimizes local parameters θ̃(t) under current estimates of global parameters β̃ (t−1) , and then it updates β̃ (t) given θ̃(t) . (2) Using a learning rate ρ(t) that satisfies the conditions of Robbins and Monro[20] , the following update rule is guaranteed to converge to a local optimum of L̃(β̃): β̃ (t) = β̃ (t−1) + ρ(t) ∇L̃(β̃ (t−1) ). (3) 2.2 ALR Because stochastic natural gradients obtained by (2) always suffer large noise, SVI is sensitive to the learning rate. More specifically, with noisy natural gradients, it is difficult to decide either to increase learning rates to speed the update process or to decrease learning rates to slow down the update process. To address this problem, the authors of [21] proposed an ALR algorithm that can adaptively tune the learning rates of each iteration. Its key idea is to optimize the learning rate ρ(t) by minimizing the expectation of the Euclidean error J ρ(t) between noisy update β̃ (t) from (3) and (t) batch update β̃∗ from (1) as: h i min E J ρ(t) ρ(t) T (t) (t) (t) (t) . , min E β̃ − β̃∗ β̃ − β̃∗ ρ(t) 3 Proposed Algorithm The ALR algorithm has been successfully applied to topic models such as LDA. However, since for topic models, the variational parameters are the sufficient statistics of variational distributions, the Euclidean distance used in ALR is not a good similarity measurement for this case. To address the problem mentioned above, we use KL divergence to replace the Euclidean distance. KL divergence is an acknowledged natural measure of similarity 430 J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2 between probability distributions[13] , and is preferable for SVI. Based on this analysis, in this work, we tune the learning rate of SVI by minimizing the symmetrized sym KL divergence (i.e., F ρ(t) , DKL ) between the expectation of the variational distribution with noisy update q ′ := q(φ|β̃ (t) ) and the expectation of the varia(t) tional distribution with batch update q∗ := q(φ|β̃∗ ) as follows: sym min F ρ(t) , min 2DKL (Eq′ [φ] , Eq∗ [φ]) . ρ(t) 3.1 The authors of [17] described a stick-breaking construction for HDP that allows for closed-form variational Bayes. Its representation of the corpus-level DP is: λ′k ∼ Beta(1, γ), Yk−1 λk = λ′k (1 − λ′l ), l=1 φk ∼ H, X∞ G0 = λk δφk , k=1 ρ(t) Applications for Topic Models We use the proposed algorithm to tune the learning rate for SVI applications of LDA and HDP. Latent Dirichlet Allocation (LDA). LDA[15] is the simplest topic model for discrete data, e.g., text document collections. It introduces the concept of latent topics. Let D, V and K be the number of documents, unique words and topics respectively. LDA assumes that each topic k ∈ {1, . . . , K} is a multinomial distribution over words, drawn from the Dirichlet prior β: φk ∼ Dir (β). Given the topics, a document d ∈ {1, . . . , D} is generated as follows. First, it draws a topic proportion from the Dirichlet prior α: θd ∼ Dir (α). Second, it repeats the following process Nd times for Nd word tokens: drawing a topic assignment zdn from θd and drawing the word token wdn from the selected topic φzdn . To learn LDA, SVI posits a fully factorized variational distribution q(φ)q(θ)q(z), where each topic φk corresponds to a global factor q(φk |β̃k ) := Dir(φk |β̃k ). Following SVI, the global variational parameter β̃ is updated by the spirit of (3). We tune learning rates for each β̃k since the topics are independent from each other. Hierarchical Dirichlet Process (HDP). HDP[16] is a nonparametric topic model, which can determine the number of topics by observations. A standard two-level HDP is defined as: G0 ∼ DP (γH), Gd ∼ DP (α0 G0 ), d ∈ {1, . . . , D} . where δφ is a probability measure concentrated at φ, each λ′k is a Beta distribution with parameter γ, and each φk is the atom of G0 with weight λk . For the document-level DP, the representation for each Gd is: ϕdt ∼ G0 , ′ πdt ∼ Beta(1, α0 ), Yt−1 ′ ′ πdt = πdt (1 − πdl ), l=1 X∞ Gd = πdt δϕdt , t=1 ′ where each πdt is a Beta distribution with parameter α0 , and each atom ϕdt with weight πdt maps to a corpus-level atom φk . To learn HDP, SVI posits a fully factorized variational distribution that contains two global factors: q(φk |β̃k ) := Dir(φk |β̃k ), q(λ′k |uk , vk ) := Beta(λ′k |uk , vk ). Following SVI, β̃ and (u, v) are updated by the spirit of (3). In this work, we tune learning rates for each β̃k and (uk , vk ) which is the parameter of Beta distribution. 3.2 Computing Learning Rates (t) For brevity, we redefine some notations as: Σ̃k = PV (t) (t−1) (t−1) (t−1) β̃ , ∇kv = ∇L̃(β̃kv ) and ∇k = PVv=1 kv (t−1) ∇ . v=1 kv Learning Rates for β̃ (t) . Both LDA and HDP opti(t) mize learning rates for each β̃k using: (t) sym min F ρk , min 2DKL (Eq′ [φk ] , Eq∗ [φk ]) , (4) (t) ρk In the corpus-level Dirichlet process (DP), the base distribution H is a symmetric Dirichlet over the vocabulary simplex, and the atoms of G0 are topics, which are drawn from H. In the document-level draws, all DPs share the same base distribution G0 . For each document d, HDP first generates Gd as the topic proportion and then generates word tokens just as LDA. (t) ρk (t) (t) P where Eq′ [φkv ] = β̃kv / Vi=1 β̃ki and Eq∗ [φkv ] = P (t) (t) β̃∗kv / Vi=1 β̃∗ki . (t) The batch update β∗kv is in fact unknown, thereby the first need is to estimate its approximation. In this (t) work, we approximate β∗kv using exponential moving averages across iterations mentioned in [21, 23]. After 431 Xi-Ming Li et al.: Tuning the Learning Rate for Stochastic Variational Inference (0) initializing β∗kv by the Monte Carlo estimate of mul(t) tiple samples from the corpus, β̃∗kv at iteration t is approximated by: ! 1 (t) (t−1) (t) (t−1) β̃∗kv ≈ 1 − (t) β̃∗kv + τk ∇kv , (5) τk (t) τk where is the window size of the exponential moving average for topic k. (t) Given approximations to β∗kv , we can use the gradient descent algorithm to optimize the (4). We problem (t) derive the derivative of objective F ρk as: (t) ∇F ρk = − V (t−1) X Eq∗ [φkv ] ∇kv (t) v=1 β̃kv (t) ρ̂k Note that Beta distribution with (uk , vk ) is a special case of Dirichlet distribution with β̃k (i.e., where V = 2, β̃k1 = uk and β̃k2 = vk ). That is to say, the objective (8) is a special case of objective (4). The optimization process outlined in Fig.1 can be used directly applied (t) (t) to the learning rates for Beta pair (uk , vk ). Fig.2 summarizes the SVI applications of topic models with our learning rates. Fig.2. SVI applications of topic models with our learning rates. (6) 3.3 (t) Given a step size η, we can iteratively update ρk until convergence as: (t) (t) (t) ρk ← ρk − η∇F ρk . (7) (t) We outline the learning rate optimization for β̃k in Fig.1. First, the window size τ (0) is initialized by the number of samples used to form the Monte Carlo esti(0) mate of β̃∗ . Second, for reliable approximations, we update the window size τ (t) as line 6 in Fig.1. Third, we set the step size η = 1/i, and I be the maximum iteration limit. (t ) Approximate the batch update βɶ*k using (5) Repeat ( i.e., where i = 1, 2,⋯ , I ) (t ) Compute the derivative of F (ρk ) using (6) (t ) Update the learning rate ρk using (7) Until convergence (t ) (t ) ( t −1) +1 Update the window size by τ = (1− ρk ) τ (t) Fig.1. Tuning the learning rate for β̃k . Time Complexity In this subsection, we compare the computational complexity between ALR and our algorithm. In the context of SVI, both algorithms perform additional learning rate optimization processes, and thus we only compare these processes. For each iteration of SVI, the cost of the ALR learning rate optimization process is O(K × V ). Our algorithm estimates per-topic learning rate using an inner loop. Thus in the worst case, the time cost is O(I × K × V ). In this sense, our algorithm might be a bit more time-consuming than ALR. Subsection 4.4 shows empirical evaluations. 4 1. 2. 3. 4. 5. 6. (t) ρ̂k 1. Initialize parameters 2. For t = 1, 2,⋯ , ∞ do 3. Sample a mini-batch 4. Compute local parameters and noisy gradients for global parameters 5. For k = 1, 2,⋯ , K do 6. Compute the learning rate for βɶk( t ) as Fig .1 7. For HDP, compute the learning rate for ( uk( t ) , vk( t ) ) as a special case of βɶk( t ) 8. Update global parameters for this topic k using (3) 9. End for 10. End for + V (t) (t−1) (t) (t−1) X Σ̃k ∇kv − β̃kv ∇k × 2 (t) v=1 Σ̃k (t) β̃kv log + (t) Eq∗ [φkv ] Σ̃k (t) (t) (t) (t−1) Σ̃k − β̃kv + Eq∗ [φkv ] Σ̃k ∇k . 2 (t) Σ̃k Learning Rates for (u(t) , v (t) ). Additionally, HDP (t) (t) optimizes learning rates for each Beta pair (uk , vk ) using: (t) sym min F ρ̂k , min 2DKL (Eq′ [λ′k ] , Eq∗ [λ′k ]) . (8) Experiment In this section, we compare our algorithm with the Robbins-Monro (RM) rate, the constant rate[24-25] and ALR, which are popular for SVI. The RM rate is ρ(t) = (t + τ )−κ , where τ is the delay and κ is the forgetting rate. 1 We fit topic models on subsets○ of corpora to find 1 ○ For each corpus, randomly generate 10 subsets of 100K documents. 432 J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2 the best settings for the RM rate and the constant rate. The best results are reported finally. For the RM rate, we set τ = 1 024 and search κ over the set {0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. For the constant rate, we search over {0.0001, 0.001, 0.01, 0.1}. The settings of topic models are as follows. We set Dirichlet prior α = 0.1 and β = 0.01. Particularly for HDP, we set γ = α0 = 1, the top-level truncation K = 200 and the second level truncation as 15. Three big corpora are used: Nature, Wikipedia and PubMed. The Nature corpus contains 350K documents from the journal Nature (from 1869 to 2008). After removing the stop words and the rare words, it uses a 2 vocabulary of 5 000 unique words. For the Wikipedia○ corpus, we randomly downloaded 2.5M documents from the English version of Wikipedia and processed these documents using a vocabulary of 7 700 unique words. For the PubMed corpus, we randomly sampled 3M doc3 uments from the original PubMed○ collection. After removing the stop words and the rare words, it uses a big vocabulary of 100K unique words. We randomly sampled 10K documents from each corpus as the test data. The perplexity[11,15] on the test dataset is used as the performance metric. Given a test dataset Dtest of S documents, the perplexity can be computed by: S P log p (w ) d P erplexity (Dtest ) = exp − d=1 S . P |wd | d=1 4.1 Evaluation on Number of Topics We fixed the mini-batch size M to be 100, and fit the LDA model with different numbers of topics K over the set {100, 200, 300}. The results of one sweep are shown in this paper (because early experimental results showed that our algorithm is insensitive to different initializations.). We computed the test perplexity every 10 iterations, and then computed it every 100/500/500 iterations after sweeping 50K/200K/300K documents for Nature/Wikipedia/PubMed respectively. If the change of the test perplexity is less than 1 × 10−4 , we consider that the algorithm converges. Overall, experimental results shown in Fig.3 indicated that our algorithm outperforms other learning rates in all settings. RM Rate and Constant Rate vs Our Algorithm. For Nature corpus, the constant rate performs 2/3 better than the RM rate; however, their perplexity gaps are small, e.g., about 15 with 200 topics and 20 with 300 topics. That is because documents from Nature corpus are longer and its vocabulary is small (i.e., 5 000 unique words), resulting in relatively stable performance. Our algorithm is better than the two traditional rates across Nature corpus, e.g., about 90 with 100 topics and about 70 with 200 topics. The Wikipedia corpus contains enough documents to ensure convergence, thereby the perplexity values are consistently lower than the perplexities obtained by Nature corpus. For this corpus, we observe that our algorithm performs better than the two traditional rates, e.g., about 110 with 100 topics. For PubMed corpus, our algorithm also outperforms the two traditional rates, and the performance advantages are even more significant, e.g., about 110 improvement to the constant rate with 200 topics and about 170 improvement to the RM rate with 300 topics. Note that the perplexity of the constant rate fluctuates partly, e.g., after 15 000 iterations with 100 topics. That is because the constant rate becomes too large after so many iterations, especially for the PubMed corpus using a very large vocabulary (i.e., 100K unique words). More importantly, we find that our algorithm converges significantly faster than the two traditional rates in all settings. ALR vs Our Algorithm. First, we observe that our algorithm achieves better perplexity than ALR across all three corpora, e.g., about 60 with 100 topics across Nature corpus, about 60 with 100 topics across Wikipedia corpus and about 100 with 300 topics across PubMed corpus. Second, our algorithm converges faster. For PubMed corpus, we found that with 200/300 topics, our algorithm converges after about 25K/5K iterations, but ALR converges after about 27K/10K iterations. More importantly, the perplexity curves of our algorithm are smooth, but the curves of ALR are fluctuant in some settings, e.g., ALR with 300 topics across Wikipedia corpus, ALR with 200 topics and 300 topics across PubMed corpus. 4.2 Evaluation on Mini-Batch Size For LDA, we fixed the number of topics K to be 100, and evaluated our algorithm with different mini-batch sizes M over the set {100, 500, 1 000}. For 2 ○ http://www.cs.princeton.edu/∼mdhoffma/, July 2015. 3 ○ http://archive.ics.uci.edu/ml/datasets/Bag+of+Words, July 2015. 433 Xi-Ming Li et al.: Tuning the Learning Rate for Stochastic Variational Inference Best Constant Best RM 2200 1800 1600 0 5 10 15 20 25 Iteration (Τ102) (a) 30 Perplexity 2000 1400 2000 1800 2000 1800 1600 1600 1400 1400 35 0 5 10 15 20 25 Iteration (Τ102) (b) 30 35 0 2200 2200 2000 2000 2000 1800 1600 1400 1200 0 5 10 15 20 Iteration (Τ103) (d) Perplexity 2200 Perplexity Perplexity Our Algorithm 2200 Perplexity Perplexity 2200 ARL 1800 1600 1200 1200 5 10 15 20 Iteration (Τ103) (e) 30 35 1600 1400 0 10 15 20 25 Iteration (Τ102) (c) 1800 1400 25 5 25 0 5 10 15 20 Iteration (Τ103) 25 (f) 3200 2400 2200 2000 2600 3000 2400 2800 Perplexity Perplexity Perplexity 2600 2200 2000 2600 2400 2200 2000 1800 1800 0 1 2 Iteration (Τ104) (g) 3 1800 0 1 2 Iteration (Τ104) (h) 3 0 1 2 Iteration (Τ104) 3 (i) Fig.3. Experimental results on LDA with different numbers of topics. (a) Nature with K = 100. (b) Nature with K = 200. (c) Nature with K = 300. (d) Wikipedia with K = 100. (e) Wikipedia with K = 200. (f) Wikipedia with K = 300. (g) PubMed with K = 100. (h) PubMed with K = 200. (i) PubMed with K = 300. HDP, the experiments were conducted with different mini-batch sizes M over {100, 500}. We computed the test perplexity every 10 iterations, and then computed it every 100/500/500 iterations after sweeping 50K/200K/300K documents for Nature/Wikipedia/PubMed respectively. Fig.4 shows the experimental results on LDA. The constant rate performs better than the RM rate, and our algorithm is always the best. For Nature and Wikipedia with small vocabulary, the perplexity gaps are relatively small. For PubMed with large vocabulary, our algorithm shows significant improvements in the two traditional rates, e.g., 130∼230 with the minibatch of 500 documents. This is very meaningful for the true online data, which might use infinite vocabulary. We observe that larger mini-batches generally perform better, as expected. However, with large mini-batches, the perplexity curves of the two traditional rates are somewhat fluctuant, e.g., the RM rate with the minibatch of 1 000 documents across Nature corpus and the constant rate with the mini-batch of 1 000 documents across PubMed corpus. In constant, the perplexity curves of our algorithm are very smooth in most settings. The proposed algorithm outperforms ALR across all three corpora, e.g., about 70 with the mini-batch of 1 000 documents across Nature corpus. For PubMed corpus, the perplexity gap with 500 documents minibatch between ALR and our algorithm is only about 40. However, for the large mini-batch of 1 000 documents, this perplexity gap is larger, i.e., about 60. In addition, we also find that our algorithm converges faster and is smoother than ALR. Fig.5 shows the experimental results on HDP. Overall, our algorithm outperforms other popular learning rates in all settings. For the RM rate and the constant rate, the improvement is about 100 across Nature and 434 J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2 Best RM Best Constant 2200 ARL Our Algorithm 2200 2200 1800 1600 2000 Perplexity Perplexity Perplexity 2000 2000 1800 1600 1400 1800 1600 1400 1400 1200 0 5 10 15 20 25 Iteration (Τ102) 30 1200 0 35 2200 2000 2000 1800 1600 1400 Perplexity 2200 2000 1200 1800 1600 1400 5 10 15 20 Iteration (Τ103) (d) 1800 1600 1200 0 25 1 2 3 4 Iteration (Τ103) (e) 5 0 2600 2400 2400 2400 2000 Perplexity 2600 Perplexity 2600 2200 2200 2000 1800 1800 0 5 10 15 20 25 Iteration (Τ103) (g) 30 50 100 150 200 250 300 350 Iteration (c) 1400 1200 0 Perplexity 0 2200 Perplexity Perplexity (a) 100 200 300 400 500 600 700 Iteration (b) 5 10 15 20 Iteration (Τ102) (f) 25 10 15 20 25 Iteration (Τ102) (i) 30 2200 2000 1800 0 1 2 3 4 5 Iteration (Τ103) (h) 6 0 5 Fig.4. Experimental results on LDA with different mini-batch sizes. (a) Nature with M = 100. (b) Nature with M = 500. (c) Nature with M = 1 000. (d) Wikipedia with M = 100. (e) Wikipedia with M = 500. (f) Wikipedia with M = 1 000. (g) PubMed with M = 100. (h) PubMed with M = 500. (i) PubMed with M = 1 000. Wikipeida corpora and the improvement is above 200 across PubMed corpus. For ALR, the performance improvement is about 50∼150. Furthermore it can be seen that our algorithm converges faster in most settings. 4.3 Evaluation on Adaptive Learning Rate An experiment to evaluate the “quality” of learning rates is designed as follows. We randomly sampled a subset of Wikipedia corpus (i.e., called miniWiki) that contains 20K documents and all 7 700 unique words. We ran three kinds of inference algorithms on this mini-Wiki corpus at the same time, including batch VI, SVI using ALR (SVIALR ) and SVI using our algorithm (SVIours ). At each iteration, we used batch VI as baseline and then computed the KL divergence between topic distributions from SVIALR /SVIours and batch VI (lower value of KL divergence implies better performance). Because this work tunes learning rates for each topic, we randomly selected three topic distributions for comparisons. All the three inference algorithms use the same initializations of parameters. Both SVI methods use the same documents of mini-batch at each iteration (setting mini-batch size M = 100). We fit a 100-topic LDA model. Both SVI methods repeatedly sweep the mini-Wiki corpus five times. Fig.6 shows the results of KL divergence values. It can be seen that SVIours always performs better than SVIALR . For SVIours , the KL divergence curves of the three topic distributions seem very similar, but for SVIALR , the three curves show completely different trends. That is because ALR optimizes the learning rate by considering all global variables together; however, this is unsuitable for LDA, whose global variables (e.g., topics) are independent from each other. 435 Xi-Ming Li et al.: Tuning the Learning Rate for Stochastic Variational Inference ARL Best Constant Best RM Our Algorithm 2200 1800 1600 2200 2600 2000 2400 Perplexity Perplexity Perplexity 2000 1800 1600 1400 1400 1200 1200 0 5 10 15 20 25 Iteration (Τ102) 30 2000 1800 1600 1000 35 0 5 10 15 20 Iteration (Τ103) (a) 0 25 1800 1600 1400 2200 2600 2000 2400 1800 1600 1400 100 200 300 400 500 600 700 Iteration (d) 1000 15 20 25 30 (c) 2200 2000 1800 1200 1200 10 Iteration (Τ103) Perplexity Perplexity 2000 0 5 (b) 2200 Perplexity 2200 1600 0 1000 2000 3000 Iteration (e) 4000 0 5000 1000 2000 3000 4000 5000 6000 Iteration (f) Fig.5. Experimental results on HDP with different mini-batch sizes. (a) Nature with M = 100. (b) Wikipedia with M = 100. (c) PubMed with M = 100. (d) Nature with M = 500. (e) Wikipedia with M = 500. (f) PubMed with M = 500. Our Algorithm 1.2 1.0 1.0 1.0 0.8 0.6 0.4 0.2 0 KL Divergence 1.2 KL Divergence KL Divergence ARL 1.2 0.8 0.6 0.4 0.2 200 400 600 Iteration 800 1000 0.6 0.4 0.2 0 0 0 0.8 0 200 400 600 Iteration (a) 800 1000 (b) 0 200 400 600 Iteration 800 1000 (c) Fig.6. Experimental results with respect to the “quality” of learning rates. (a) Topic 1. (b) Topic 2. (c) Topic 3. 4.4 Evaluation on Running Time The training time is important for adaptive learning rate algorithms. We ran the Python implementations of ALR and our algorithm, and then compared the periteration training time between them. We fit a 100-topic LDA model with the mini-batch of 100 documents across Wikipedia and PubMed corpora. For the small vocabulary Wikipedia corpus, the average per-iteration time of ALR is about 2.4 seconds, and our algorithm takes about 2.6 seconds. For the large vocabulary PubMed corpus, the average periteration time of ALR is about 2.5 seconds, and our algorithm still needs about 2.6 seconds. Although our algorithm is theoretically more time-consuming than ALR (as discussed in Subsection 3.3), it is only a little more expensive (about 5%). This is because: 1) in practice, our tuning rate process can converge to a local optimal in a few iterations (i.e., small I); 2) Python language is very efficient for matrix manipulations. Therefore, we argue that our tuning rate algorithm is sufficiently efficient in practice. 5 Conclusions In this paper, we developed a novel algorithm that can adaptively tune the learning rate of SVI. We then applied it to topic models including LDA and HDP. Our 436 algorithm computes adaptive learning rates by minimizing the KL divergence between the expectation of the variational distribution with noisy update and that with batch update. In summary, our algorithm has two advantages. First, KL divergence is a better similarity measure for variational distributions than the Euclidean distance. Second, our algorithm individually optimizes the learning rate for each topic. This is preferable to using a single learning rate, because LDA and HDP both make an independent assumption among topic distributions. The experimental results on the three big corpora indicated that the learning rates obtained by our algorithm: 1) are significantly helpful for SVI; 2) can converge to a stable point, though we have not proved that our algorithm satisfies the RM criterion. References [1] Jin F, Huang M L, Zhu X Y. Guided structure-aware review summarization. Journal of Computer Science and Technology, 2011, 26(4): 676-684. [2] Li P, Wang B, Jin W. Improving Web document clustering through employing user-related tag expansion techniques. Journal of Computer Science and Technology, 2012, 27(3): 554-566. [3] Newman D, Asuncion A, Smyth P, Welling M. Distributed algorithms for topic models. Journal of Machine Learning Research, 2009, 10: 1801-1828. [4] Yan F, Xu N, Qi Y. Parallel inference for latent Dirichlet allocation on graphics processing units. In Proc. the 23rd NIPS, Dec. 2009, pp.2134-2142. [5] Liu Z, Zhang Y, Chang E, Sun M. PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): Article No. 26. [6] AlSumait L, Barbara D, Domeniconi C. On-line LDA: Adaptive topic models for mining text streams with applications to topic detection and tracking. In Proc. the 8th ICDM, Dec. 2008, pp.3-12. [7] Yao L, Mimno D, McCallum A. Efficient methods for topic model inference on streaming document collections. In Proc. the 15th SIGKDD, June 28-July 1, 2009, pp.937-945. [8] Hoffman M D, Blei D M. Online learning for latent Dirichlet allocation. In Proc. the 24th NIPS, Dec. 2010. [9] Mimno D, Hoffman M D, Blei D M. Sparse stochastic inference for latent Dirichlet allocation. In Proc. the 29th ICML, June 27-July 3, 2012, pp.1599-1606. [10] Wang C, Chen X, Smola A, Xing E P. Variance reduction for stochastic gradient optimization. In Proc. the 27th NIPS, Dec. 2013. [11] Patterson S, Teh Y W. Stochastic gradient riemannian langevin dynamics on probability simplex. In Proc. the 27th NIPS, Dec. 2013. [12] Zeng J, Liu Z Q, Cao X Q. Online belief propagation for topic modeling. arXiv.1210.2179, June 2013. http://arxiv.org/pdf/1210.2179.pdf, July 2015. J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2 [13] Hoffman M D, Blei D M, Wang C, Paisley J. Stochastic variational inference. Journal of Machine Learning Research, 2013, 14(1): 1303-1347. [14] Ouyang J, Lu Y, Li X. Momentum online LDA for largescale datasets. In Proc. the 21st ECAI, August 2014, pp.1075-1076. [15] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal of Machine Learning Research, 2013, 3: 993-1022. [16] Teh Y W, Jordan M I, Beal M J, Blei D M. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 2006, 101(476): 1566-1581. [17] Wang C, Paisley J, Blei D M. Online variational inference for the hierarchical Dirichlet process. In Proc. the 14th AISTATS, April 2011, pp.752-760. [18] Johnson M J, Willsky A S. Stochastic variational inference for Bayesian time series models. In Proc. the 31st ICML, June 2014, pp.3872-3880. [19] Hernandez-Lobato J M, Houlsby N, Ghahramani Z. Stochastic inference for scalable probabilistic modeling of binary matrices. In Proc. the 31st ICML, June 2014, pp.1693-1710. [20] Robbins H, Monro S. A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22(3): 400407. [21] Ranganath R, Wang C, Blei D M, Xing E P. An adaptive learning rate for stochastic variational inference. In Proc. the 30th ICML, June 2013, pp.298-306. [22] Amari S. Natural gradient words efficiently in learning. Neural Computation, 1998, 10(2): 251-276. [23] Schaul T, Zhang S, LeCun Y. No more pesky learning rates. In Proc. the 30th ICML, June 2013, pp.343-351. [24] Nemirovski A, Juditsky A, Lan G, Shapiro A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 2009, 19(4): 15741609. [25] Collober R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2011, 12: 2493-2537. Xi-Ming Li received his M.S. degree in computer science from Jilin University, Changchun, in 2011. Currently he is a Ph.D. candidate in the College of Computer Science and Technology at Jilin University. His main research interests include topic modeling and multi-label learning. Ji-Hong Ouyang is a professor at the College of Computer Science and Technology, Jilin University, Changchun. She received her Ph.D. degree from Jilin University in 2005. Her main research interests include artificial intelligence and machine learning, specifically in spatial reasoning, multi-label learning, topic modeling, and online learning.
© Copyright 2024 Paperzz