Deep Generative Models for Sequence Learning

Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Deep Generative Models for Sequence Learning
Zhe Gan
Advisor: Dr. Lawrence Carin
Ph.D. Qualifying Exam
December 2nd, 2015
Literature Review
Deep Temporal Sigmoid Belief Networks
Outline
1
Literature Review
State Space Models
Recurrent Neural Networks
Temporal Restricted Boltzmann Machines
2
Deep Temporal Sigmoid Belief Networks
Model Formulation
Scalable Learning and Inference
Experiments
3
Conclusion and Future work
Conclusion and Future work
Literature Review
Deep Temporal Sigmoid Belief Networks
Table of Contents
1
Literature Review
State Space Models
Recurrent Neural Networks
Temporal Restricted Boltzmann Machines
2
Deep Temporal Sigmoid Belief Networks
Model Formulation
Scalable Learning and Inference
Experiments
3
Conclusion and Future work
Conclusion and Future work
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
State Space Models
We are interested in developing probabilistic models for
sequential data.
Hidden Markov Models (HMMs) [12]: a mixture model,
multinommial latent variables
Linear Dynamical Systems (LDS) [8]: continious latent
variables
h t−1
v t−1
ht
vt
h t+1
v t+1
h t+2
v t+2
Figure: Graphical model for HMM and LDS.
Literature Review
Deep Temporal Sigmoid Belief Networks
Table of Contents
1
Literature Review
State Space Models
Recurrent Neural Networks
Temporal Restricted Boltzmann Machines
2
Deep Temporal Sigmoid Belief Networks
Model Formulation
Scalable Learning and Inference
Experiments
3
Conclusion and Future work
Conclusion and Future work
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Recurrent Neural Networks
Recurrent Neural Neworks (RNNs) takes a sequence as input.
Recursively processing each input v t while maintaining its
internal hidden state h t .
h t = f (v t−1 , h t−1 )
(1)
where f is a deterministic non-linear transition function.
h t−1
v t−1
ht
vt
h t+1
v t+1
Figure: Illustration for RNNs.
h t+2
v t+2
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Recurrent Neural Networks
RNN defines the likelihood as
p(x 1 , . . . , x T ) =
T
Y
p(x t |x <t ), p(x t |x <t ) = g(h t )
(2)
t=1
How to define f ?
1
Logistic function
h t = σ(Wx t−1 + Uh t−1 + b)
2
3
(3)
Long Short-Term Memory (LSTM) [7]
Gated Recurrent Units (GRU) [3]
All the randomness inside the model lies in the usage of the
conditional probability p(x t |x <t ).
Literature Review
Deep Temporal Sigmoid Belief Networks
Table of Contents
1
Literature Review
State Space Models
Recurrent Neural Networks
Temporal Restricted Boltzmann Machines
2
Deep Temporal Sigmoid Belief Networks
Model Formulation
Scalable Learning and Inference
Experiments
3
Conclusion and Future work
Conclusion and Future work
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Restricted Boltzmann Machines
RBM [6] is an undirected graphical model that represents a joint
distribution over binary visible units v ∈ {0, 1}M and binary hidden
units h ∈ {0, 1}J as
P(v, h) = exp{−E (v, h)}/Z
>
>
(4)
>
E (v, h) = −(h Wv + c v + b h)
h
v
(5)
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Restricted Boltzmann Machines
The posterior for v and h are factorized.
!
P(hj = 1|v) = σ
X
wjm vm + bj
(6)
m
!
P(vi = 1|h) = σ
X
wjm hj + cm
(7)
m
Inference: block Gibbs sampling.
Learning: Contrastive-Divergence (CD).
Application:
1
2
Deep Belief Networks (DBNs) and Deep Boltzmann Machines
(DBMs)
Learn the sequential dependencies in time-series data
[9, 13, 14, 15, 16].
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Temporal Restricted Boltzmann Machines
TRBM [13] is defined as a sequence of RBMs
p(V, H) = p0 (v 1 , h 1 )
T
Y
p(v t , h t |h t−1 )
(8)
t=2
Each p(v t , h t |h t−1 ) is defined as a conditional RBM.
>
>
>
p(v t , h t |h t−1 ) ∝ exp(v >
t Wh t + v t c + h t (b + U h t−1 ))
h
h t−1
v
v t−1
ht
vt
h t+1
v t+1
Figure: Graphical model for TRBM.
h t+2
v t+2
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Model Reviews
Propeties
Directed
Latent
Nonlinear
Distributed1
HMM
3
3
7
7
LDS
3
3
7
−
RNN
3
7
3
−
TRBM
7
3
3
3
Can we find a model that has all the propeties listed above? A
deep directed latent variable model.
1
Each hidden state in HMM is a one-hot vector. To represent 2N distinct
states, an HMM requires a length-2N one-hot vector, while for TRBM, only a
length-N vector is needed.
Literature Review
Deep Temporal Sigmoid Belief Networks
Table of Contents
1
Literature Review
State Space Models
Recurrent Neural Networks
Temporal Restricted Boltzmann Machines
2
Deep Temporal Sigmoid Belief Networks
Model Formulation
Scalable Learning and Inference
Experiments
3
Conclusion and Future work
Conclusion and Future work
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Overview
Problem of interest: Developing deep directed latent
variable models for sequential data.
Main idea:
1
2
Constructing a hierarchy of Temporal Sigmoid Belief Networks
(TSBNs).
TSBN is defined as a sequetial stack of Sigmoid Belief
Networks (SBNs).
Challenge: Designing scalable learning and inference
algorithms.
Solution:
1
2
Stochastic Variational Inference (SVI).
Design a recognition model for fast inference.
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Sigmoid Belief Networks
An SBN [11] models a visible vector v ∈ {0, 1}M , in terms of
hidden variables h ∈ {0, 1}J and weights W ∈ RM×J with
p(vm = 1|h) = σ(w >
m h + cm ),
p(hj = 1) = σ(bj )
where σ(x ) , 1/(1 + e −x ).
h
v
(9)
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
SBN vs. RBM
Graphical Model
h
h
v
v
Energy function
−ESBN (v, h) = v > c + v > Wh + h > b−
X
m
>
>
>
−ERBM (v, h) = v c + v Wh + h b
How to generate data
1
2
SBN: ancestral sampling
RBM: iterative Gibbs sampling
>
log(1 + e w m h+cm )
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
SBN vs. RBM
Inference methods
1
2
SBN: Gibbs sampling, mean-field VB, Recognition model
RBM: Gibbs sampling
Learning methods
1
2
SBN: SGD, Gibbs sampling, mean-field VB, MCEM
(Polya-Gamma data augmentation)
RBM: Contrastive Divergence
SBN has been shown potential to build deep models [4, 5].
1
2
Binary image modeling
Topic modeling
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Temporal Sigmoid Belief Networks
TSBN is defined as a sequence of SBNs.
pθ (V, H) = p(h 1 )p(v 1 |h 1 ) ·
T
Y
p(h t |h t−1 , v t−1 ) · p(v t |h t , v t−1 )
t=2
For t = 1, . . . , T , each conditional distribution is expressed as
>
p(hjt = 1|h t−1 , v t−1 ) = σ(w >
1j h t−1 + w 3j v t−1 + bj )
(10)
>
p(vmt = 1|h t , v t−1 ) = σ(w >
2m h t + w 4m v t−1 + cm )
(11)
h
h t−1
ht
h t+1
h t+2
v t−1
vt
v t+1
v t+2
v
Figure: Graphical model for TSBN.
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
TSBN Properties
Our TSBN model can be viewed as :
1
2
3
4
2
an HMM with distributed hidden state representations2 ;
a LDS with non-linear dynamics;
a probabilistic construction of RNN;
a directed-graphical-model counterpart of TRBM.
Each hidden state in the HMM is represented as a one-hot length-J vector,
while in the TSBN, the hidden states can be any length-J binary vector.
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
TSBN Variants: Modeling Different Data Types
Real-valued data: p(v t |h t , v t−1 ) = N (µt , diag(σ 2t )), where
>
µmt = w >
2m h t + w 4m v t−1 + cm ,
2
0
log σmt
= (w 02m )> h t + (w 04m )> v t−1 + cm
Count data: p(v t |h t , v t−1 ) =
vmt
m=1 ymt ,
QM
(13)
where
>
exp(w >
2m h t + w 4m v t−1 + cm )
.
>
>
m0 =1 exp(w 2m0 h t + w 4m0 v t−1 + cm0 )
ymt = PM
(12)
(14)
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
TSBN Variants: Boosting Performance
High Order: h t , v t depends on h t−n:t−1 , v t−n:t−1 .
Going Deep:
1
2
adding stochastic hidden layers;
adding deterministic hidden layers.
Conditional independent “becomes” conditional dependent.
(Top) Shallow TSBN
(Right) A 2-layer TSBN
Literature Review
Deep Temporal Sigmoid Belief Networks
Table of Contents
1
Literature Review
State Space Models
Recurrent Neural Networks
Temporal Restricted Boltzmann Machines
2
Deep Temporal Sigmoid Belief Networks
Model Formulation
Scalable Learning and Inference
Experiments
3
Conclusion and Future work
Conclusion and Future work
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Scalable Learning and Inference
By using Polya-Gamma, Gibbs sampling or mean-field VB can
be implemented [5, 11], but are inefficient.
To allow for
1
2
tractable and scalable inference and parameter learning
without loss of the flexibility of the variational posterior
We apply recent advances in stochastic variational Bayes for
non-conjugate inference [10].
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Recognition model
Variational Lower Bound
L(V, θ, φ) = Eqφ (H|V) [log pθ (V, H) − log qφ (H|V)] .
How to design qφ (H|V)? No mean-field VB!
Recogniton model: introduce a fixed-form distribution
qφ (H|V) to approximate the true posterior p(H|V).
1
2
3
Fast inference
Utilization of a global set of parameters
Potential better fit of the data
(15)
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Recognition model
Recognition model is also defined as a TSBN.
qφ (H|V) = q(h 1 |v 1 ) ·
T
Y
q(h t |h t−1 , v t , v t−1 ) ,
(16)
t=2
W1
W2
Time
Time
(c) Generative model
(d) Recognition model
W3
W4
(a) Generative model
U1
U2
U3
(b) Recognition model
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Parameter Learning
Gradients:
∇θ L(V) = Eqφ [∇θ log pθ (V, H)]
∇φ L(V) = Eqφ [(log pθ (V, H) − log qφ (H|V)) × ∇φ log qφ (H|V)]
Variance Reduction:
1
2
centering the learning signal
variance normalization
Literature Review
Deep Temporal Sigmoid Belief Networks
Table of Contents
1
Literature Review
State Space Models
Recurrent Neural Networks
Temporal Restricted Boltzmann Machines
2
Deep Temporal Sigmoid Belief Networks
Model Formulation
Scalable Learning and Inference
Experiments
3
Conclusion and Future work
Conclusion and Future work
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Experiments
Datasets:
Bouncing balls
Motion capture
Polyphonic music
State of the Union
Binary
Real-valued
Binary
Count
Notations:
1
2
3
HMSBN: TSBN model with W3 = 0 and W4 = 0;
DTSBN-S: Deep TSBN with stochastic hidden layer;
DTSBN-D: Deep TSBN with deterministic hidden layer.
Experimental Setup:
Random initialization
RMSprop optimization
Monte Carlo sampling using only a single sample
Weight decay regularization
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Experiments
Prediction of v t given v 1:t−1
1
2
3
first obtain a sample from qφ (h 1:t−1 |v 1:t−1 );
calculate the conditional posterior pθ (h t |h 1:t−1 , v 1:t−1 ) of the
current hidden state ;
make a prediction for v t using pθ (v t |h 1:t , v 1:t−1 ).
Generation: ancestral sampling.
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Bouncing balls dataset
Each video is of length 100 and of resolution 30 × 30.
Literature Review
Deep Temporal Sigmoid Belief Networks
Bouncing balls dataset: Dictionaries
Our learned dictionaries are spatially localized.
Conclusion and Future work
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Bouncing balls dataset: Prediction
1
Our approach achieves state-of-the-art.
2
A high-order TSBN reduces the prediction error significantly.
3
Using multiple layers improve performance.
Model
DTSBN-s
DTSBN-d
TSBN
TSBN
RTRBM
SRTRBM
Dim
100-100
100-100
100
100
3750
3750
Order
2
2
4
1
1
1
Pred. Err.
2.79 ± 0.39
2.99 ± 0.42
3.07 ± 0.40
9.48 ± 0.38
3.88 ± 0.33
3.31 ± 0.33
Neg. Log. Like.
69.29 ± 1.52
70.47 ± 1.52
70.41 ± 1.55
77.71 ± 0.83
−
−
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Motion capture dataset: Prediction
We used 33 running and walking sequences of subject 35 in the
CMU motion capture dataset.
TSBN-based models improves over the RBM-based models
significantly.
Model
DTSBN-s
DTSBN-d
TSBN
HMSBN
ss-SRTRBM
g-RTRBM
Walking
4.40 ± 0.28
4.62 ± 0.01
5.12 ± 0.50
10.77 ± 1.15
8.13 ± 0.06
14.41 ± 0.38
Running
2.56 ± 0.40
2.84 ± 0.01
4.85 ± 1.26
7.39 ± 0.47
5.88 ± 0.05
10.91 ± 0.27
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Motion capture dataset: Generation
1
These generated data are readily produced from the model
and demonstrate realistic behavior.
2
The smooth trajectories are walking movements, while the
vibrating ones are running.
Figure: Generated motion trajectories. (Left) Walking. (Middle)
Running-running-walking. (Right) Running-walking.
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Polyphonic music dataset: Prediction
Four polyphonic music sequences of piano [2]: Piano-midi.de,
Nottingham, MuseData and JSB chorales.
Table: Test log-likelihood for music datasets. () taken from [2].
Model
TSBN
RNN-NADE
RTRBM
RNN
Piano.
-7.98
-7.05
-7.36
-8.37
Nott.
-3.67
-2.31
-2.62
-4.46
Muse.
-6.81
-5.60
-6.35
-8.13
JSB.
-7.48
-5.56
-6.35
-8.71
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Polyphonic music dataset: Generation
We can generate different styles of music based on different
training datssets.
Top: Generated from Piano midi
20
40
60
80
50
100
150
200
250
300
Bottom: Generated from Nottingham
20
40
60
80
20
40
60
80
100
120
140
160
Figure: Generated samples..
180
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
State of the Union dataset: Prediction
Prediction is concerned with estimating the held-out words.
MP: mean precision over all the years that appear.
PP: predictive precision for the final year.
Model
HMSBN
DHMSBN-s
GP-DPFA DRFM
Dim
25
25-25
100
25
MP
0.327 ± 0.002
0.299 ± 0.001
0.223 ± 0.001
0.217 ± 0.003
PP
0.353 ± 0.070
0.378 ± 0.006
0.189 ± 0.003
0.177 ± 0.010
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
State of the Union dataset: Dynamic Topic Modeling
The learned trajectory exhibits different temporal patterns across
the topics.
1
Topic 29
Nicaragua v. U.S.
0.5
0
1800
1850
1
1900
1950
2000
Topic 30
War of 1812
0.5
Iraq War
World War II
0
1800
1850
1
0.5
1900
1950
2000
1950
2000
Topic 130
The age of American revolution
0
1800
1850
1900
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
State of the Union dataset: Dynamic Topic Modeling
Table: Top 10 most probable words associated with the STU topics.
Topic #29
family
budget
Nicaragua
free
future
freedom
excellence
drugs
families
God
Topic #30
officer
civilized
warfare
enemy
whilst
gained
lake
safety
American
militia
Topic #130
government
country
public
law
present
citizens
united
house
foreign
gentlemen
Topic #64
generations
generation
recognize
brave
crime
race
balanced
streets
college
school
Topic #70
Iraqi
Qaida
Iraq
Iraqis
AI
Saddam
ballistic
terrorists
hussein
failures
Topic #74
Philippines
islands
axis
Nazis
Japanese
Germans
mines
sailors
Nazi
hats
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Conclusion and Future work
Conclusion
1
2
3
Proposed Temporal Sigmoid Belief Networks
Developed an efficient variational optimization algorithm
Extensive applications to diverse data types
Future work
1
2
3
Controlled style transitioning
Language modeling
Multi-modality learning
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
References I
A. Acharya, J. Ghosh, and M. Zhou.
Nonparametric Bayesian factor analysis for dynamic count matrices.
In AISTATS, 2015.
N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent.
Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation
and transcription.
In ICML, 2012.
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio.
Learning phrase representations using rnn encoder-decoder for statistical machine translation.
In arXiv:1406.1078, 2014.
Z. Gan, C. Chen, R. Henao, D. Carlson, and L. Carin.
Scalable deep poisson factor analysis for topic modeling.
In ICML, 2015.
Z. Gan, R. Henao, D. Carlson, and L. Carin.
Learning deep sigmoid belief networks with data augmentation.
In AISTATS, 2015.
G. Hinton.
Training products of experts by minimizing contrastive divergence.
In Neural computation, 2002.
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
References II
Sepp Hochreiter and Jürgen Schmidhuber.
Long short-term memory.
In Neural computation, 1997.
R. Kalman.
Mathematical description of linear dynamical systems.
In J. the Society for Industrial & Applied Mathematics, Series A: Control, 1963.
R. Mittelman, B. Kuipers, S. Savarese, and H. Lee.
Structured recurrent temporal restricted boltzmann machines.
In ICML, 2014.
A. Mnih and K. Gregor.
Neural variational inference and learning in belief networks.
In ICML, 2014.
R. Neal.
Connectionist learning of belief networks.
In Artificial intelligence, 1992.
L. Rabiner and B. Juang.
An introduction to hidden markov models.
In ASSP Magazine, IEEE, 1986.
I. Sutskever and G. Hinton.
Learning multilevel distributed representations for high-dimensional sequences.
In AISTATS, 2007.
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
References III
I. Sutskever, G. Hinton, and G. Taylor.
The recurrent temporal restricted boltzmann machine.
In NIPS, 2009.
G. Taylor and G. Hinton.
Factored conditional restricted boltzmann machines for modeling motion style.
In ICML, 2009.
G. Taylor, G. Hinton, and S. Roweis.
Modeling human motion using binary latent variables.
In NIPS, 2006.
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Backup I
Contrastive Divergence:
∂P(v)
∂
∂
= Ep(v,h)
E (v, h) − Ep(h|v)
E (v, h)
∂θ
∂θ
∂θ
(17)
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Backup II
The contribution of v to the log-likelihood can be lower-bounded
as follows
log pθ = log
X
pθ (v, h)
(18)
h
≥
X
qφ (h|v) log
h
pθ (v, h)
qφ (h|v)
(19)
= Eqφ (h|v) [log pθ (v, h) − log qφ (h|v)]
(20)
= L(v, θ, φ)
(21)
We can also rewrite the bound as
L(v, θ, φ) = log pθ (v) − KL(qφ (h|v), pθ (h|v))
(22)
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Backup III
Differentiating the variational lower bound w.r.t. to the recognition
model parameters gives
∇φ L(v) = ∇φ Eq [log pθ (v, h) − log qφ (h|v)]
= ∇φ
X
qφ (h|v) log pθ (v, h) − ∇φ
h
=
X
h
=
X
log pθ (v, h)∇φ qφ (h|v) −
(23)
X
qφ (h|v) log qφ (h|v)
h
X
(log qφ (h|v) + 1)∇φ qφ (h|v)
h
(log pθ (v, h) − log qφ (h|v))∇φ qφ (h|v)
(24)
h
where we have used the fact that
P
P
h ∇φ qφ (h|v) = ∇φ
h qφ (h|v) = ∇φ 1 = 0. Using the identity
∇φ qφ (h|v) = qφ (h|v)∇φ log qφ (h|v), then gives
∇φ L(v) = Eq [(log pθ (v, h) − log qφ (h|v)) × ∇φ log qφ (h|v)]
(25)
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Backup IV
"
Eq [∇φ log qφ (h|v)] = Eq
=
X
∇φ qφ (h|v)
qφ (h|v)
#
(26)
∇φ qφ (h|v)
(27)
X
(28)
h
= ∇φ
qφ (h|v)
h
= ∇φ 1
(29)
=0
(30)
Literature Review
Deep Temporal Sigmoid Belief Networks
Conclusion and Future work
Backup V
Algorithm 1 Compute gradient estimates.
∆θ ← 0, ∆φ ← 0, ∆λ ← 0, L ← 0
for t ← 1 to T do
h t ∼ qφ (h t |v t )
lt ← log pθ (v t , h t ) − log qφ (h t |v t )
L ← L + lt
lt ← lt − Cλ (v t )
end for
cb ← mean(l1 , . . . , lT )
vb ← variance(l1 , . . . , lT )
c ← αc + (1 − α)cb
v ← αv + (1 − α)vb
for t ← 1 to T do
lt −c√
lt ← max(1,
v)
∆θ ← ∆θ + ∇θ log pθ (v t , h t )
∆φ ← ∆φ + lt ∇φ log qφ (h t |v t )
∆λ ← ∆λ + lt ∇λ Cλ (v t )
end for

Download Report

Deep Generative Models for Sequence Learning

Paperzz.com

Your Paperzz