Tuning the Learning Rate for Stochastic Variational Inference

Li XM, Ouyang JH. Tuning the learning rate for stochastic variational inference. JOURNAL OF COMPUTER SCIENCE
AND TECHNOLOGY 31(2): 428–436 Mar. 2016. DOI 10.1007/s11390-016-1636-4
Tuning the Learning Rate for Stochastic Variational Inference
Xi-Ming Li and Ji-Hong Ouyang ∗ , Member, CCF
College of Computer Science and Technology, Jilin University, Changchun 130000, China
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
Changchun 130000, China
E-mail: [email protected]; [email protected]
Received November 10, 2014; revised July 27, 2015.
Abstract Stochastic variational inference (SVI) can learn topic models with very big corpora. It optimizes the variational
objective by using the stochastic natural gradient algorithm with a decreasing learning rate. This rate is crucial for SVI;
however, it is often tuned by hand in real applications. To address this, we develop a novel algorithm, which tunes the
learning rate of each iteration adaptively. The proposed algorithm uses the Kullback-Leibler (KL) divergence to measure
the similarity between the variational distribution with noisy update and that with batch update, and then optimizes the
learning rates by minimizing the KL divergence. We apply our algorithm to two representative topic models: latent Dirichlet
allocation and hierarchical Dirichlet process. Experimental results indicate that our algorithm performs better and converges
faster than commonly used learning rates.
Keywords
1
stochastic variational inference, online learning, adaptive learning rate, topic model
Introduction
Topic modeling algorithms are popular to analyze text document collections and to solve real problems in many practical applications[1-2] . However,
traditional batch inference algorithms to topic models are inefficient and unpractical for big corpora,
which contain hundreds of thousands of documents.
To this end, some publications develop parallel inference algorithms[3-5] , and some propose online inference algorithms[6-14] . Among these online algorithms,
stochastic variational inference (SVI)[13] is representative. It has been successfully applied to latent Dirichlet allocation (LDA)[8,15] , hierarchical Dirichlet process (HDP)[16-17] , time-series models[18] and matrix
factorization[19].
As an extension of the batch variational inference
(batch VI) algorithm, SVI maximizes the variational
objective in the framework of stochastic optimization.
At each iteration, it randomly samples a small subset (i.e., mini-batch) from the entire corpus, and then
forms a noisy natural gradient to update the parame-
ters of interest. Using a learning rate that satisfies the
conditions of Robbins and Monro[20] , SVI converges to
a local optimum of the variational objective. Motivated by tuning the learning rate of SVI, the authors
of [21] developed an adaptive learning rate (ALR) algorithm, which tunes the learning rates by minimizing
the Euclidean distance between the noisy update and
the batch update. However, since the variational parameters for topic models are the sufficient statistics
of variational distributions, the Euclidean distance is
not a good similarity measurement for this case, and it
might lead to low-quality learning rates.
To address this problem, we investigate the rate optimization process of ALR and propose a novel adaptive
learning rate algorithm for SVI. Our algorithm uses the
Kullback-Leibler (KL) divergence to measure the similarity between the variational distribution with noisy
update and that with batch update. In contrast to
ALR, our algorithm has two advantages. First, KL divergence is preferable to the Euclidean distance when
measuring similarity between distributions. Second,
our algorithm provides a specific learning rate for each
Regular Paper
This work was supported by the National Natural Science Foundation of China under Grant Nos. 61170092, 61133011 and 61103091.
∗ Corresponding Author
©2016 Springer Science + Business Media, LLC & Science Press, China
429
Xi-Ming Li et al.: Tuning the Learning Rate for Stochastic Variational Inference
latent variable, e.g., topic-word distributions in LDA,
instead of using a single learning rate for all variables.
In this paper, we apply our algorithm to LDA and HDP
on three big corpora: Nature, Wikipedia and PubMed.
Compared with ALR and other commonly used learning rates for SVI, our algorithm performs better.
The rest of this paper is organized as follows. In
Section 2, we introduce SVI and ALR. In Section 3, we
suggest the proposed algorithm and the applications for
LDA and HDP. Section 4 shows the experimental results on the three big corpora. Finally, conclusions and
some potential future work are discussed in Section 5.
2
(t)
β̃∗ = β +
(t)
θ̃d .
(1)
Unfortunately, if D is too large, the batch update
(t)
β̃∗ in (1) is computationally expensive. SVI addresses this problem using stochastic natural gradient
ascent. It defines the ELBO with respect to β̃ as:
L̃(β̃) , max L(β̃, θ̃). At each iteration t, SVI samples
θ̃
a single datapoint d (or a mini-batch of M datapoints)
to form the stochastic natural gradient of L̃(β̃)[13,22] as
follows:
(t)
∇L̃(β̃ (t−1) ) = −β̃ (t−1) + β + Dθ̃d .
SVI
Suppose that there is a probabilistic model
p(w, z, φ|β) = p(φ|β)
D
X
d=1
Background
In this section, we introduce stochastic variational
inference (SVI) and the adaptive learning rate (ALR)
algorithm for SVI.
2.1
Successive optima of this ELBO often have closed
form[13] , and thus β̃ is updated as:
D
Y
p(wd , zd |φ),
d=1
which includes observation w (wd , w1:D ), local latent
variables z (zd , z1:D ) and global latent variable φ with
prior β. We want to estimate the posterior distribution
of latent variable p (z, φ|w, β).
Variational Bayes family is usually used to compute
the approximation to this posterior. It posits a variational distribution
q(z, φ|β̃, θ̃) = q(φ|β̃)
D
Y
q(zd |θ˜d ),
d=1
parameterized by variational parameters β̃ (global) and
θ̃ (local). It then finds the closest variational distribution q to the true posterior measured by KL divergence.
This task is equal to finding a local maximum of the following evidence lower bound (ELBO) with respect to
variational parameters β̃ and θ̃:
log p(w) > Eq [log p(w, z, φ)] − Eq [log q(z, φ)] , L(β̃, θ̃),
where Eq [·] is the expectation under the variational distribution q.
Standard batch VI optimizes this ELBO using fixedpoint iterations. At each iteration t, it optimizes local
parameters θ̃(t) under current estimates of global parameters β̃ (t−1) , and then it updates β̃ (t) given θ̃(t) .
(2)
Using a learning rate ρ(t) that satisfies the conditions of Robbins and Monro[20] , the following update
rule is guaranteed to converge to a local optimum of
L̃(β̃):
β̃ (t) = β̃ (t−1) + ρ(t) ∇L̃(β̃ (t−1) ).
(3)
2.2
ALR
Because stochastic natural gradients obtained by (2)
always suffer large noise, SVI is sensitive to the learning rate. More specifically, with noisy natural gradients, it is difficult to decide either to increase learning
rates to speed the update process or to decrease learning rates to slow down the update process. To address
this problem, the authors of [21] proposed an ALR algorithm that can adaptively tune the learning rates of
each iteration. Its key idea is to optimize the learning
rate ρ(t) by minimizing
the expectation of the Euclidean
error J ρ(t) between noisy update β̃ (t) from (3) and
(t)
batch update β̃∗ from (1) as:
h i
min E J ρ(t)
ρ(t)
T (t)
(t)
(t)
(t)
.
, min E β̃ − β̃∗
β̃ − β̃∗
ρ(t)
3
Proposed Algorithm
The ALR algorithm has been successfully applied
to topic models such as LDA. However, since for topic
models, the variational parameters are the sufficient
statistics of variational distributions, the Euclidean distance used in ALR is not a good similarity measurement
for this case.
To address the problem mentioned above, we use KL
divergence to replace the Euclidean distance. KL divergence is an acknowledged natural measure of similarity
430
J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2
between probability distributions[13] , and is preferable
for SVI. Based on this analysis, in this work, we tune
the learning rate of SVI by minimizing the symmetrized
sym
KL divergence (i.e., F ρ(t) , DKL
) between the expectation of the variational distribution with noisy update q ′ := q(φ|β̃ (t) ) and the expectation of the varia(t)
tional distribution with batch update q∗ := q(φ|β̃∗ ) as
follows:
sym
min F ρ(t) , min 2DKL
(Eq′ [φ] , Eq∗ [φ]) .
ρ(t)
3.1
The authors of [17] described a stick-breaking construction for HDP that allows for closed-form variational Bayes. Its representation of the corpus-level DP
is:
λ′k ∼ Beta(1, γ),
Yk−1
λk = λ′k
(1 − λ′l ),
l=1
φk ∼ H,
X∞
G0 =
λk δφk ,
k=1
ρ(t)
Applications for Topic Models
We use the proposed algorithm to tune the learning
rate for SVI applications of LDA and HDP.
Latent Dirichlet Allocation (LDA). LDA[15] is the
simplest topic model for discrete data, e.g., text document collections. It introduces the concept of latent topics. Let D, V and K be the number of documents, unique words and topics respectively. LDA
assumes that each topic k ∈ {1, . . . , K} is a multinomial distribution over words, drawn from the Dirichlet prior β: φk ∼ Dir (β). Given the topics, a document d ∈ {1, . . . , D} is generated as follows. First,
it draws a topic proportion from the Dirichlet prior α:
θd ∼ Dir (α). Second, it repeats the following process
Nd times for Nd word tokens: drawing a topic assignment zdn from θd and drawing the word token wdn from
the selected topic φzdn .
To learn LDA, SVI posits a fully factorized variational distribution q(φ)q(θ)q(z), where each topic φk
corresponds to a global factor q(φk |β̃k ) := Dir(φk |β̃k ).
Following SVI, the global variational parameter β̃ is
updated by the spirit of (3). We tune learning rates
for each β̃k since the topics are independent from each
other.
Hierarchical Dirichlet Process (HDP). HDP[16] is a
nonparametric topic model, which can determine the
number of topics by observations. A standard two-level
HDP is defined as:
G0 ∼ DP (γH),
Gd ∼ DP (α0 G0 ),
d ∈ {1, . . . , D} .
where δφ is a probability measure concentrated at φ,
each λ′k is a Beta distribution with parameter γ, and
each φk is the atom of G0 with weight λk . For the
document-level DP, the representation for each Gd is:
ϕdt ∼ G0 ,
′
πdt
∼ Beta(1, α0 ),
Yt−1
′
′
πdt = πdt
(1 − πdl
),
l=1
X∞
Gd =
πdt δϕdt ,
t=1
′
where each πdt
is a Beta distribution with parameter α0 , and each atom ϕdt with weight πdt maps to
a corpus-level atom φk .
To learn HDP, SVI posits a fully factorized variational distribution that contains two global factors:
q(φk |β̃k ) := Dir(φk |β̃k ),
q(λ′k |uk , vk ) := Beta(λ′k |uk , vk ).
Following SVI, β̃ and (u, v) are updated by the spirit of
(3). In this work, we tune learning rates for each β̃k and
(uk , vk ) which is the parameter of Beta distribution.
3.2
Computing Learning Rates
(t)
For brevity, we redefine some notations as: Σ̃k =
PV
(t)
(t−1)
(t−1)
(t−1)
β̃ , ∇kv
= ∇L̃(β̃kv ) and ∇k
=
PVv=1 kv
(t−1)
∇
.
v=1 kv
Learning Rates for β̃ (t) . Both LDA and HDP opti(t)
mize learning rates for each β̃k using:
(t)
sym
min F ρk , min 2DKL
(Eq′ [φk ] , Eq∗ [φk ]) , (4)
(t)
ρk
In the corpus-level Dirichlet process (DP), the base distribution H is a symmetric Dirichlet over the vocabulary simplex, and the atoms of G0 are topics, which are
drawn from H. In the document-level draws, all DPs
share the same base distribution G0 . For each document d, HDP first generates Gd as the topic proportion
and then generates word tokens just as LDA.
(t)
ρk
(t)
(t) P
where Eq′ [φkv ] = β̃kv / Vi=1 β̃ki and Eq∗ [φkv ] =
P
(t)
(t)
β̃∗kv / Vi=1 β̃∗ki .
(t)
The batch update β∗kv is in fact unknown, thereby
the first need is to estimate its approximation. In this
(t)
work, we approximate β∗kv using exponential moving
averages across iterations mentioned in [21, 23]. After
431
Xi-Ming Li et al.: Tuning the Learning Rate for Stochastic Variational Inference
(0)
initializing β∗kv by the Monte Carlo estimate of mul(t)
tiple samples from the corpus, β̃∗kv at iteration t is
approximated by:
!
1
(t)
(t−1)
(t) (t−1)
β̃∗kv ≈ 1 − (t) β̃∗kv + τk ∇kv ,
(5)
τk
(t)
τk
where
is the window size of the exponential moving
average for topic k.
(t)
Given approximations to β∗kv , we can use the gradient descent algorithm to optimize the
(4). We
problem
(t)
derive the derivative of objective F ρk
as:
(t)
∇F ρk
= −
V
(t−1)
X
Eq∗ [φkv ] ∇kv
(t)
v=1
β̃kv
(t)
ρ̂k
Note that Beta distribution with (uk , vk ) is a special
case of Dirichlet distribution with β̃k (i.e., where V = 2,
β̃k1 = uk and β̃k2 = vk ). That is to say, the objective
(8) is a special case of objective (4). The optimization
process outlined in Fig.1 can be used directly applied
(t) (t)
to the learning rates for Beta pair (uk , vk ).
Fig.2 summarizes the SVI applications of topic models with our learning rates.
Fig.2. SVI applications of topic models with our learning rates.
(6)
3.3
(t)
Given a step size η, we can iteratively update ρk
until convergence as:
(t)
(t)
(t)
ρk ← ρk − η∇F ρk .
(7)
(t)
We outline the learning rate optimization for β̃k in
Fig.1. First, the window size τ (0) is initialized by the
number of samples used to form the Monte Carlo esti(0)
mate of β̃∗ . Second, for reliable approximations, we
update the window size τ (t) as line 6 in Fig.1. Third,
we set the step size η = 1/i, and I be the maximum
iteration limit.
(t )
Approximate the batch update βɶ*k using (5)
Repeat ( i.e., where i = 1, 2,⋯ , I )
(t )
Compute the derivative of F (ρk ) using (6)
(t )
Update the learning rate ρk using (7)
Until convergence
(t )
(t )
( t −1)
+1
Update the window size by τ = (1− ρk ) τ
(t)
Fig.1. Tuning the learning rate for β̃k .
Time Complexity
In this subsection, we compare the computational
complexity between ALR and our algorithm. In the
context of SVI, both algorithms perform additional
learning rate optimization processes, and thus we only
compare these processes.
For each iteration of SVI, the cost of the ALR learning rate optimization process is O(K × V ). Our algorithm estimates per-topic learning rate using an inner loop. Thus in the worst case, the time cost is
O(I × K × V ). In this sense, our algorithm might be
a bit more time-consuming than ALR. Subsection 4.4
shows empirical evaluations.
4
1.
2.
3.
4.
5.
6.
(t)
ρ̂k
1. Initialize parameters
2. For t = 1, 2,⋯ , ∞ do
3.
Sample a mini-batch
4.
Compute local parameters and noisy gradients for global parameters
5.
For k = 1, 2,⋯ , K do
6.
Compute the learning rate for βɶk( t ) as Fig .1
7.
For HDP, compute the learning rate for ( uk( t ) , vk( t ) ) as a special case of βɶk( t )
8.
Update global parameters for this topic k using (3)
9.
End for
10. End for
+
V (t) (t−1)
(t) (t−1)
X
Σ̃k ∇kv − β̃kv ∇k
×
2
(t)
v=1
Σ̃k
(t)
β̃kv
log
+
(t)
Eq∗ [φkv ] Σ̃k
(t)
(t)
(t)
(t−1)
Σ̃k − β̃kv + Eq∗ [φkv ] Σ̃k ∇k
.
2
(t)
Σ̃k
Learning Rates for (u(t) , v (t) ). Additionally, HDP
(t) (t)
optimizes learning rates for each Beta pair (uk , vk )
using:
(t)
sym
min F ρ̂k , min 2DKL
(Eq′ [λ′k ] , Eq∗ [λ′k ]) . (8)
Experiment
In this section, we compare our algorithm with the
Robbins-Monro (RM) rate, the constant rate[24-25] and
ALR, which are popular for SVI. The RM rate is
ρ(t) = (t + τ )−κ ,
where τ is the delay and κ is the forgetting rate.
1
We fit topic models on subsets○
of corpora to find
1
○
For each corpus, randomly generate 10 subsets of 100K documents.
432
J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2
the best settings for the RM rate and the constant
rate. The best results are reported finally. For the
RM rate, we set τ = 1 024 and search κ over the set
{0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. For the constant rate, we
search over {0.0001, 0.001, 0.01, 0.1}.
The settings of topic models are as follows. We set
Dirichlet prior α = 0.1 and β = 0.01. Particularly
for HDP, we set γ = α0 = 1, the top-level truncation
K = 200 and the second level truncation as 15.
Three big corpora are used: Nature, Wikipedia and
PubMed. The Nature corpus contains 350K documents
from the journal Nature (from 1869 to 2008). After removing the stop words and the rare words, it uses a
2
vocabulary of 5 000 unique words. For the Wikipedia○
corpus, we randomly downloaded 2.5M documents from
the English version of Wikipedia and processed these
documents using a vocabulary of 7 700 unique words.
For the PubMed corpus, we randomly sampled 3M doc3
uments from the original PubMed○
collection. After
removing the stop words and the rare words, it uses a
big vocabulary of 100K unique words.
We randomly sampled 10K documents from each
corpus as the test data. The perplexity[11,15] on the
test dataset is used as the performance metric. Given
a test dataset Dtest of S documents, the perplexity can
be computed by:


S
P
log
p
(w
)
d 



P erplexity (Dtest ) = exp − d=1 S
.
P


|wd |
d=1
4.1
Evaluation on Number of Topics
We fixed the mini-batch size M to be 100, and fit
the LDA model with different numbers of topics K over
the set {100, 200, 300}. The results of one sweep are
shown in this paper (because early experimental results
showed that our algorithm is insensitive to different initializations.). We computed the test perplexity every 10
iterations, and then computed it every 100/500/500 iterations after sweeping 50K/200K/300K documents for
Nature/Wikipedia/PubMed respectively. If the change
of the test perplexity is less than 1 × 10−4 , we consider
that the algorithm converges.
Overall, experimental results shown in Fig.3 indicated that our algorithm outperforms other learning
rates in all settings.
RM Rate and Constant Rate vs Our Algorithm. For
Nature corpus, the constant rate performs 2/3 better
than the RM rate; however, their perplexity gaps are
small, e.g., about 15 with 200 topics and 20 with 300
topics. That is because documents from Nature corpus
are longer and its vocabulary is small (i.e., 5 000 unique
words), resulting in relatively stable performance. Our
algorithm is better than the two traditional rates across
Nature corpus, e.g., about 90 with 100 topics and about
70 with 200 topics. The Wikipedia corpus contains
enough documents to ensure convergence, thereby the
perplexity values are consistently lower than the perplexities obtained by Nature corpus. For this corpus,
we observe that our algorithm performs better than the
two traditional rates, e.g., about 110 with 100 topics.
For PubMed corpus, our algorithm also outperforms the
two traditional rates, and the performance advantages
are even more significant, e.g., about 110 improvement
to the constant rate with 200 topics and about 170 improvement to the RM rate with 300 topics. Note that
the perplexity of the constant rate fluctuates partly,
e.g., after 15 000 iterations with 100 topics. That is
because the constant rate becomes too large after so
many iterations, especially for the PubMed corpus using a very large vocabulary (i.e., 100K unique words).
More importantly, we find that our algorithm converges
significantly faster than the two traditional rates in all
settings.
ALR vs Our Algorithm. First, we observe that
our algorithm achieves better perplexity than ALR
across all three corpora, e.g., about 60 with 100 topics across Nature corpus, about 60 with 100 topics
across Wikipedia corpus and about 100 with 300 topics across PubMed corpus. Second, our algorithm converges faster. For PubMed corpus, we found that with
200/300 topics, our algorithm converges after about
25K/5K iterations, but ALR converges after about
27K/10K iterations. More importantly, the perplexity
curves of our algorithm are smooth, but the curves of
ALR are fluctuant in some settings, e.g., ALR with 300
topics across Wikipedia corpus, ALR with 200 topics
and 300 topics across PubMed corpus.
4.2
Evaluation on Mini-Batch Size
For LDA, we fixed the number of topics K to
be 100, and evaluated our algorithm with different
mini-batch sizes M over the set {100, 500, 1 000}. For
2
○
http://www.cs.princeton.edu/∼mdhoffma/, July 2015.
3
○
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words, July 2015.
433
Xi-Ming Li et al.: Tuning the Learning Rate for Stochastic Variational Inference
Best Constant
Best RM
2200
1800
1600
0
5
10 15 20 25
Iteration (Τ102)
(a)
30
Perplexity
2000
1400
2000
1800
2000
1800
1600
1600
1400
1400
35
0
5
10 15 20 25
Iteration (Τ102)
(b)
30
35
0
2200
2200
2000
2000
2000
1800
1600
1400
1200
0
5
10
15
20
Iteration (Τ103)
(d)
Perplexity
2200
Perplexity
Perplexity
Our Algorithm
2200
Perplexity
Perplexity
2200
ARL
1800
1600
1200
1200
5
10
15
20
Iteration (Τ103)
(e)
30
35
1600
1400
0
10 15 20 25
Iteration (Τ102)
(c)
1800
1400
25
5
25
0
5
10
15
20
Iteration (Τ103)
25
(f)
3200
2400
2200
2000
2600
3000
2400
2800
Perplexity
Perplexity
Perplexity
2600
2200
2000
2600
2400
2200
2000
1800
1800
0
1
2
Iteration (Τ104)
(g)
3
1800
0
1
2
Iteration (Τ104)
(h)
3
0
1
2
Iteration (Τ104)
3
(i)
Fig.3. Experimental results on LDA with different numbers of topics. (a) Nature with K = 100. (b) Nature with K = 200. (c) Nature
with K = 300. (d) Wikipedia with K = 100. (e) Wikipedia with K = 200. (f) Wikipedia with K = 300. (g) PubMed with K = 100.
(h) PubMed with K = 200. (i) PubMed with K = 300.
HDP, the experiments were conducted with different mini-batch sizes M over {100, 500}. We computed the test perplexity every 10 iterations, and
then computed it every 100/500/500 iterations after sweeping 50K/200K/300K documents for Nature/Wikipedia/PubMed respectively.
Fig.4 shows the experimental results on LDA. The
constant rate performs better than the RM rate, and
our algorithm is always the best. For Nature and
Wikipedia with small vocabulary, the perplexity gaps
are relatively small. For PubMed with large vocabulary, our algorithm shows significant improvements in
the two traditional rates, e.g., 130∼230 with the minibatch of 500 documents. This is very meaningful for the
true online data, which might use infinite vocabulary.
We observe that larger mini-batches generally perform
better, as expected. However, with large mini-batches,
the perplexity curves of the two traditional rates are
somewhat fluctuant, e.g., the RM rate with the minibatch of 1 000 documents across Nature corpus and the
constant rate with the mini-batch of 1 000 documents
across PubMed corpus. In constant, the perplexity
curves of our algorithm are very smooth in most settings. The proposed algorithm outperforms ALR across
all three corpora, e.g., about 70 with the mini-batch of
1 000 documents across Nature corpus. For PubMed
corpus, the perplexity gap with 500 documents minibatch between ALR and our algorithm is only about 40.
However, for the large mini-batch of 1 000 documents,
this perplexity gap is larger, i.e., about 60. In addition,
we also find that our algorithm converges faster and is
smoother than ALR.
Fig.5 shows the experimental results on HDP. Overall, our algorithm outperforms other popular learning
rates in all settings. For the RM rate and the constant
rate, the improvement is about 100 across Nature and
434
J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2
Best RM
Best Constant
2200
ARL
Our Algorithm
2200
2200
1800
1600
2000
Perplexity
Perplexity
Perplexity
2000
2000
1800
1600
1400
1800
1600
1400
1400
1200
0
5
10 15 20 25
Iteration (Τ102)
30
1200
0
35
2200
2000
2000
1800
1600
1400
Perplexity
2200
2000
1200
1800
1600
1400
5
10
15
20
Iteration (Τ103)
(d)
1800
1600
1200
0
25
1
2
3
4
Iteration (Τ103)
(e)
5
0
2600
2400
2400
2400
2000
Perplexity
2600
Perplexity
2600
2200
2200
2000
1800
1800
0
5
10
15
20
25
Iteration (Τ103)
(g)
30
50 100 150 200 250 300 350
Iteration
(c)
1400
1200
0
Perplexity
0
2200
Perplexity
Perplexity
(a)
100 200 300 400 500 600 700
Iteration
(b)
5
10
15
20
Iteration (Τ102)
(f)
25
10
15
20
25
Iteration (Τ102)
(i)
30
2200
2000
1800
0
1
2
3
4
5
Iteration (Τ103)
(h)
6
0
5
Fig.4. Experimental results on LDA with different mini-batch sizes. (a) Nature with M = 100. (b) Nature with M = 500. (c) Nature
with M = 1 000. (d) Wikipedia with M = 100. (e) Wikipedia with M = 500. (f) Wikipedia with M = 1 000. (g) PubMed with
M = 100. (h) PubMed with M = 500. (i) PubMed with M = 1 000.
Wikipeida corpora and the improvement is above 200
across PubMed corpus. For ALR, the performance improvement is about 50∼150. Furthermore it can be seen
that our algorithm converges faster in most settings.
4.3
Evaluation on Adaptive Learning Rate
An experiment to evaluate the “quality” of learning rates is designed as follows. We randomly sampled a subset of Wikipedia corpus (i.e., called miniWiki) that contains 20K documents and all 7 700 unique
words. We ran three kinds of inference algorithms on
this mini-Wiki corpus at the same time, including batch
VI, SVI using ALR (SVIALR ) and SVI using our algorithm (SVIours ). At each iteration, we used batch
VI as baseline and then computed the KL divergence
between topic distributions from SVIALR /SVIours and
batch VI (lower value of KL divergence implies better
performance). Because this work tunes learning rates
for each topic, we randomly selected three topic distributions for comparisons. All the three inference algorithms use the same initializations of parameters. Both
SVI methods use the same documents of mini-batch at
each iteration (setting mini-batch size M = 100).
We fit a 100-topic LDA model. Both SVI methods
repeatedly sweep the mini-Wiki corpus five times. Fig.6
shows the results of KL divergence values. It can be
seen that SVIours always performs better than SVIALR .
For SVIours , the KL divergence curves of the three topic
distributions seem very similar, but for SVIALR , the
three curves show completely different trends. That is
because ALR optimizes the learning rate by considering
all global variables together; however, this is unsuitable
for LDA, whose global variables (e.g., topics) are independent from each other.
435
Xi-Ming Li et al.: Tuning the Learning Rate for Stochastic Variational Inference
ARL
Best Constant
Best RM
Our Algorithm
2200
1800
1600
2200
2600
2000
2400
Perplexity
Perplexity
Perplexity
2000
1800
1600
1400
1400
1200
1200
0
5
10 15 20 25
Iteration (Τ102)
30
2000
1800
1600
1000
35
0
5
10
15
20
Iteration (Τ103)
(a)
0
25
1800
1600
1400
2200
2600
2000
2400
1800
1600
1400
100 200 300 400 500 600 700
Iteration
(d)
1000
15
20
25
30
(c)
2200
2000
1800
1200
1200
10
Iteration (Τ103)
Perplexity
Perplexity
2000
0
5
(b)
2200
Perplexity
2200
1600
0
1000
2000 3000
Iteration
(e)
4000
0
5000
1000 2000 3000 4000 5000 6000
Iteration
(f)
Fig.5. Experimental results on HDP with different mini-batch sizes. (a) Nature with M = 100. (b) Wikipedia with M = 100. (c)
PubMed with M = 100. (d) Nature with M = 500. (e) Wikipedia with M = 500. (f) PubMed with M = 500.
Our Algorithm
1.2
1.0
1.0
1.0
0.8
0.6
0.4
0.2
0
KL Divergence
1.2
KL Divergence
KL Divergence
ARL
1.2
0.8
0.6
0.4
0.2
200
400
600
Iteration
800
1000
0.6
0.4
0.2
0
0
0
0.8
0
200
400
600
Iteration
(a)
800
1000
(b)
0
200
400
600
Iteration
800
1000
(c)
Fig.6. Experimental results with respect to the “quality” of learning rates. (a) Topic 1. (b) Topic 2. (c) Topic 3.
4.4
Evaluation on Running Time
The training time is important for adaptive learning
rate algorithms. We ran the Python implementations
of ALR and our algorithm, and then compared the periteration training time between them.
We fit a 100-topic LDA model with the mini-batch
of 100 documents across Wikipedia and PubMed corpora. For the small vocabulary Wikipedia corpus, the
average per-iteration time of ALR is about 2.4 seconds, and our algorithm takes about 2.6 seconds. For
the large vocabulary PubMed corpus, the average periteration time of ALR is about 2.5 seconds, and our
algorithm still needs about 2.6 seconds. Although our
algorithm is theoretically more time-consuming than
ALR (as discussed in Subsection 3.3), it is only a little more expensive (about 5%). This is because: 1) in
practice, our tuning rate process can converge to a local
optimal in a few iterations (i.e., small I); 2) Python language is very efficient for matrix manipulations. Therefore, we argue that our tuning rate algorithm is sufficiently efficient in practice.
5
Conclusions
In this paper, we developed a novel algorithm that
can adaptively tune the learning rate of SVI. We then
applied it to topic models including LDA and HDP. Our
436
algorithm computes adaptive learning rates by minimizing the KL divergence between the expectation of
the variational distribution with noisy update and that
with batch update. In summary, our algorithm has
two advantages. First, KL divergence is a better similarity measure for variational distributions than the
Euclidean distance. Second, our algorithm individually optimizes the learning rate for each topic. This
is preferable to using a single learning rate, because
LDA and HDP both make an independent assumption
among topic distributions. The experimental results on
the three big corpora indicated that the learning rates
obtained by our algorithm: 1) are significantly helpful
for SVI; 2) can converge to a stable point, though we
have not proved that our algorithm satisfies the RM
criterion.
References
[1]
Jin F, Huang M L, Zhu X Y. Guided structure-aware review
summarization. Journal of Computer Science and Technology, 2011, 26(4): 676-684.
[2] Li P, Wang B, Jin W. Improving Web document clustering
through employing user-related tag expansion techniques.
Journal of Computer Science and Technology, 2012, 27(3):
554-566.
[3] Newman D, Asuncion A, Smyth P, Welling M. Distributed
algorithms for topic models. Journal of Machine Learning
Research, 2009, 10: 1801-1828.
[4] Yan F, Xu N, Qi Y. Parallel inference for latent Dirichlet
allocation on graphics processing units. In Proc. the 23rd
NIPS, Dec. 2009, pp.2134-2142.
[5] Liu Z, Zhang Y, Chang E, Sun M. PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline
processing. ACM Transactions on Intelligent Systems and
Technology, 2011, 2(3): Article No. 26.
[6] AlSumait L, Barbara D, Domeniconi C. On-line LDA:
Adaptive topic models for mining text streams with applications to topic detection and tracking. In Proc. the 8th
ICDM, Dec. 2008, pp.3-12.
[7] Yao L, Mimno D, McCallum A. Efficient methods for topic
model inference on streaming document collections. In Proc.
the 15th SIGKDD, June 28-July 1, 2009, pp.937-945.
[8] Hoffman M D, Blei D M. Online learning for latent Dirichlet
allocation. In Proc. the 24th NIPS, Dec. 2010.
[9] Mimno D, Hoffman M D, Blei D M. Sparse stochastic inference for latent Dirichlet allocation. In Proc. the 29th ICML,
June 27-July 3, 2012, pp.1599-1606.
[10] Wang C, Chen X, Smola A, Xing E P. Variance reduction for
stochastic gradient optimization. In Proc. the 27th NIPS,
Dec. 2013.
[11] Patterson S, Teh Y W. Stochastic gradient riemannian
langevin dynamics on probability simplex. In Proc. the 27th
NIPS, Dec. 2013.
[12] Zeng J, Liu Z Q, Cao X Q. Online belief propagation for topic modeling. arXiv.1210.2179, June 2013.
http://arxiv.org/pdf/1210.2179.pdf, July 2015.
J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2
[13] Hoffman M D, Blei D M, Wang C, Paisley J. Stochastic variational inference. Journal of Machine Learning Research,
2013, 14(1): 1303-1347.
[14] Ouyang J, Lu Y, Li X. Momentum online LDA for largescale datasets. In Proc. the 21st ECAI, August 2014,
pp.1075-1076.
[15] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation.
Journal of Machine Learning Research, 2013, 3: 993-1022.
[16] Teh Y W, Jordan M I, Beal M J, Blei D M. Hierarchical
Dirichlet processes. Journal of the American Statistical Association, 2006, 101(476): 1566-1581.
[17] Wang C, Paisley J, Blei D M. Online variational inference
for the hierarchical Dirichlet process. In Proc. the 14th AISTATS, April 2011, pp.752-760.
[18] Johnson M J, Willsky A S. Stochastic variational inference
for Bayesian time series models. In Proc. the 31st ICML,
June 2014, pp.3872-3880.
[19] Hernandez-Lobato J M, Houlsby N, Ghahramani Z.
Stochastic inference for scalable probabilistic modeling of
binary matrices. In Proc. the 31st ICML, June 2014,
pp.1693-1710.
[20] Robbins H, Monro S. A stochastic approximation method.
The Annals of Mathematical Statistics, 1951, 22(3): 400407.
[21] Ranganath R, Wang C, Blei D M, Xing E P. An adaptive
learning rate for stochastic variational inference. In Proc.
the 30th ICML, June 2013, pp.298-306.
[22] Amari S. Natural gradient words efficiently in learning.
Neural Computation, 1998, 10(2): 251-276.
[23] Schaul T, Zhang S, LeCun Y. No more pesky learning rates.
In Proc. the 30th ICML, June 2013, pp.343-351.
[24] Nemirovski A, Juditsky A, Lan G, Shapiro A. Robust
stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 2009, 19(4): 15741609.
[25] Collober R, Weston J, Bottou L, Karlen M, Kavukcuoglu
K, Kuksa P. Natural language processing (almost) from
scratch. Journal of Machine Learning Research, 2011, 12:
2493-2537.
Xi-Ming Li received his M.S. degree
in computer science from Jilin University, Changchun, in 2011. Currently
he is a Ph.D. candidate in the College
of Computer Science and Technology
at Jilin University. His main research
interests include topic modeling and
multi-label learning.
Ji-Hong Ouyang is a professor
at the College of Computer Science
and Technology,
Jilin University,
Changchun. She received her Ph.D.
degree from Jilin University in 2005.
Her main research interests include
artificial intelligence and machine learning, specifically in spatial reasoning,
multi-label learning, topic modeling, and online learning.