Generative Models of Sentences

Generative Models for
Sentences
Amjad Almahairi
PhD student
August 16th 2014
Outline
1. Motivation
• Language modelling
• Full Sentence Embeddings
2. Approach
• Bayesian Networks
• Variational Autoencoders (VAE)
• VAE variants for modelling sentences
3. Preliminary results
Motivation 1: Language Modelling
• Traditional approaches for language modelling are mainly based on an
approximation of the chain rule:
𝑛
𝑃 𝑤0 , 𝑤1 , . . , 𝑤𝑛 ≃
𝑃(𝑤𝑖 | 𝑤𝑖−1 , … , 𝑤𝑖−𝐶 )
𝑖=0
• We end up learning a model of a word given its previous context
• Do we take into account the global coherence of the sentence?
Motivation 1: Language Modelling
• Traditional approaches for language modelling are mainly based on an
approximation of the chain rule:
𝑛
𝑃 𝑤0 , 𝑤1 , . . , 𝑤𝑛 ≃
𝑃(𝑤𝑖 | 𝑤𝑖−1 , … , 𝑤𝑖−𝐶 )
𝑖=0
Some
idea
• Intuitively, people map an internal semantic form
into a syntactic form, which is then linearized into
words
…
word
word
word
…
Motivation 2: Sentence Embeddings
• Word embeddings have been very successful in many NLP tasks
• Train a model in a very general task such that it finds
a good represention for words
• Use them in another task, and possibly fine tune them
• We would like to do the same for sentences
• Learn a fixed representation that encodes syntax and
semantics
• This can be very useful for tasks that condition on the
full sentence (e.g. machine translation)
embedding
…
word
word
word
…
Goals
• Learn a joint probabilistic model P(X, 𝑍) of sentences and
representations
• Query in both directions
• Given a representation 𝑍, what is X~P(X|𝑍)?
Z
embedding
• generate new sentences
• Given a sentence 𝑋, what is Z~P 𝑍 𝑋 ?
• Use Z for another task
• Or use Z to generate “similar” sentences from P(X|𝑍)
• Find if a given X is probable under P X
• Or an estimate of P(X)
X
…
word
word
word
…
Bayesian Networks with latent variables
• Directed probabilistic graphical models
latent representation
• Causal model: models flow from cause to effect
• 𝑃 𝑥1 , … , 𝑥𝑛 =
𝑛
𝑖=1 𝑃
𝑥𝑖 𝑃𝑎(𝑥𝑖 ))
• Easy to generate unbiased samples
• Ancestral sampling
• But very hard to infer the state of latent variables
or to sample from the posterior
• Consequently, learning is very hard too
observed words
Variational Autoencoders (VAE)
Kingma and Willing 2014
• Defined for a very general setting
𝜙
• 𝑋 : observed variables (continuous/discrete)
• 𝑍 : latent variables (continuous)
• 𝑃𝜃 𝑍 𝑋 : intractable posterior
Z
X
• Deals with the inference problem by learning an approximate (but
tractable) posterior 𝑞𝜙 𝑍 𝑋
• Using 𝑞𝜙 𝑍 𝑋 we can define a lower bound on log 𝑝𝜃 (𝑋) :
L x = −𝐷𝐾𝐿 𝑞𝜙 𝑍 𝑋 ||𝑝𝜃 (𝑍) + E𝑞𝜙 𝑍 𝑋 [log 𝑝𝜃 (𝑋|𝑍)]
𝜃
Variational Autoencoders (VAE)
Kingma and Willing 2014
• The new idea here is the “reparameterization trick”:
• for 𝑍 ~𝑞𝜙 𝑍 𝑋 , assume 𝑍 = 𝑔𝜙 𝑋, 𝜖 where 𝜖 ~ 𝑝 𝜖 an independent noise
• Now we can write:
E𝑞𝜙 𝑍 𝑋 log 𝑝𝜃 (𝑋|𝑍) = E𝑝(𝜖)
1
log 𝑝𝜃 (𝑋|𝑔𝜙 𝑋, 𝜖 ) ≃
𝐿
• So can back-propagate through the model
• optimize the lower bound w.r.t 𝜃 and 𝜙
𝐿
log 𝑝𝜃 (𝑋|𝑔𝜙 𝑋, 𝜖 )
𝑙=1
Variational Autoencoders (VAE)
Kingma and Willing 2014
• In VAE, a neural network is used to parameterize 𝑞𝜙 𝑍 𝑋 and
𝑝𝜃 (𝑋|𝑍)
𝜙
Z
𝑞𝜙 𝑍 𝑋
𝜃
𝑝𝜃 (𝑋|𝑍)
X
• In our case 𝑋 is a variable-length sentence, and straight NNs can’t
deal with that
Solution 1: Tree-structured VAE
• Use a recursive NN
• Combine nodes in a tree structure
according to the sentence parse tree
• Requires a pre-specified tree
structure
• For inference and generation!
• A tree be very deep O(#words)
𝑞𝜙 𝑍 𝑋
• Depth of the full model is depth(𝑞𝜙 𝑍 𝑋 ) + depth(𝑃𝜃 (𝑋|𝑍))
𝑝𝜃 (𝑋|𝑍)
Solution 2: Pyramid-structured VAE
• Use the Gated Recursive Convolutional
Network
• Recursively runs binary convolution
• Activation of a node is a weighted
sum of: new activation, left, and
right child
• Very deep (always #words-1)
• Gating can help by shortcutting paths
𝑞𝜙 𝑍 𝑋
• Gating can be seen as a way to learn a (soft) tree structure
𝑝𝜃 (𝑋|𝑍)
MSR Sentence completion task
• Given a sentence missing a word, select the correct replacement from
five alternatives
• ``I have seen it on him , and could _____ to it.“
1.
2.
3.
4.
5.
write
migrate
climb
swear
contribute
• Test set is a public dataset has 1040 sentences derived from 19th
century novels
• Training set is a collection of also 19th century novels with 2.2M
sentences and 46M words
Experiments on short phrases
• Dataset: 70K of 4-word length phrases from Holmes dataset (19th
century novels)
• Samples of training data:
•
•
•
•
•
•
`` gracious goodness !
`` of course !
asked mr. swift .
will you ? ‘’
asked Tom again .
questioned the boy .
Experiments on short phrases
• Trained model #2
• No tricks
•
•
•
•
•
eat marcia umbrella ''
it oh ? !
exclaimed out ! .
asked nonsense arnold !
she you sat !
Experiments on short phrases
• Trained model #2
• No tricks
•
•
•
•
•
• Pretrain with an autoencoder
• Learn 𝑝𝜃 (𝑋|𝑍) parameters
• Fix 𝑞𝜙 𝑍 𝑋 for first 20 epochs
eat marcia umbrella ''
it oh ? !
exclaimed out ! .
asked nonsense arnold !
she you sat !
•
•
•
•
•
•
•
`` yes ! ''
`` what ! ''
said the man .
do you ? '‘
he went on .
said the voice .
he he spoke .
Experiments on short phrases
• Trained model #2
• No tricks
•
•
•
•
•
• Pretrain with an autoencoder
• Learn 𝑝𝜃 (𝑋|𝑍) parameters
• Fix 𝑞𝜙 𝑍 𝑋 for first 20 epochs
eat marcia umbrella ''
it oh ? !
exclaimed out ! .
asked nonsense arnold !
she you sat !
•
•
•
•
•
•
i whispered eagerly .  i said again .
the lady nodded .  the lady asked .
i cried indignantly .  he said carelessly .
`` no . '‘  `` yes . ''
`` ah ! ''  `` oh ! '‘
our eyes met .  five minutes passed .
Solution 3: Tree-Convolutional VAE
• Combination of the previous two approaches
• Iterate:
- convolutional layer
- pooling layer
• This makes depth
O(log #words)
𝑞𝜙 𝑍 𝑋
𝑃𝜃 (𝑋|𝑍)
Thank you!