Generative Models for Sentences Amjad Almahairi PhD student August 16th 2014 Outline 1. Motivation • Language modelling • Full Sentence Embeddings 2. Approach • Bayesian Networks • Variational Autoencoders (VAE) • VAE variants for modelling sentences 3. Preliminary results Motivation 1: Language Modelling • Traditional approaches for language modelling are mainly based on an approximation of the chain rule: 𝑛 𝑃 𝑤0 , 𝑤1 , . . , 𝑤𝑛 ≃ 𝑃(𝑤𝑖 | 𝑤𝑖−1 , … , 𝑤𝑖−𝐶 ) 𝑖=0 • We end up learning a model of a word given its previous context • Do we take into account the global coherence of the sentence? Motivation 1: Language Modelling • Traditional approaches for language modelling are mainly based on an approximation of the chain rule: 𝑛 𝑃 𝑤0 , 𝑤1 , . . , 𝑤𝑛 ≃ 𝑃(𝑤𝑖 | 𝑤𝑖−1 , … , 𝑤𝑖−𝐶 ) 𝑖=0 Some idea • Intuitively, people map an internal semantic form into a syntactic form, which is then linearized into words … word word word … Motivation 2: Sentence Embeddings • Word embeddings have been very successful in many NLP tasks • Train a model in a very general task such that it finds a good represention for words • Use them in another task, and possibly fine tune them • We would like to do the same for sentences • Learn a fixed representation that encodes syntax and semantics • This can be very useful for tasks that condition on the full sentence (e.g. machine translation) embedding … word word word … Goals • Learn a joint probabilistic model P(X, 𝑍) of sentences and representations • Query in both directions • Given a representation 𝑍, what is X~P(X|𝑍)? Z embedding • generate new sentences • Given a sentence 𝑋, what is Z~P 𝑍 𝑋 ? • Use Z for another task • Or use Z to generate “similar” sentences from P(X|𝑍) • Find if a given X is probable under P X • Or an estimate of P(X) X … word word word … Bayesian Networks with latent variables • Directed probabilistic graphical models latent representation • Causal model: models flow from cause to effect • 𝑃 𝑥1 , … , 𝑥𝑛 = 𝑛 𝑖=1 𝑃 𝑥𝑖 𝑃𝑎(𝑥𝑖 )) • Easy to generate unbiased samples • Ancestral sampling • But very hard to infer the state of latent variables or to sample from the posterior • Consequently, learning is very hard too observed words Variational Autoencoders (VAE) Kingma and Willing 2014 • Defined for a very general setting 𝜙 • 𝑋 : observed variables (continuous/discrete) • 𝑍 : latent variables (continuous) • 𝑃𝜃 𝑍 𝑋 : intractable posterior Z X • Deals with the inference problem by learning an approximate (but tractable) posterior 𝑞𝜙 𝑍 𝑋 • Using 𝑞𝜙 𝑍 𝑋 we can define a lower bound on log 𝑝𝜃 (𝑋) : L x = −𝐷𝐾𝐿 𝑞𝜙 𝑍 𝑋 ||𝑝𝜃 (𝑍) + E𝑞𝜙 𝑍 𝑋 [log 𝑝𝜃 (𝑋|𝑍)] 𝜃 Variational Autoencoders (VAE) Kingma and Willing 2014 • The new idea here is the “reparameterization trick”: • for 𝑍 ~𝑞𝜙 𝑍 𝑋 , assume 𝑍 = 𝑔𝜙 𝑋, 𝜖 where 𝜖 ~ 𝑝 𝜖 an independent noise • Now we can write: E𝑞𝜙 𝑍 𝑋 log 𝑝𝜃 (𝑋|𝑍) = E𝑝(𝜖) 1 log 𝑝𝜃 (𝑋|𝑔𝜙 𝑋, 𝜖 ) ≃ 𝐿 • So can back-propagate through the model • optimize the lower bound w.r.t 𝜃 and 𝜙 𝐿 log 𝑝𝜃 (𝑋|𝑔𝜙 𝑋, 𝜖 ) 𝑙=1 Variational Autoencoders (VAE) Kingma and Willing 2014 • In VAE, a neural network is used to parameterize 𝑞𝜙 𝑍 𝑋 and 𝑝𝜃 (𝑋|𝑍) 𝜙 Z 𝑞𝜙 𝑍 𝑋 𝜃 𝑝𝜃 (𝑋|𝑍) X • In our case 𝑋 is a variable-length sentence, and straight NNs can’t deal with that Solution 1: Tree-structured VAE • Use a recursive NN • Combine nodes in a tree structure according to the sentence parse tree • Requires a pre-specified tree structure • For inference and generation! • A tree be very deep O(#words) 𝑞𝜙 𝑍 𝑋 • Depth of the full model is depth(𝑞𝜙 𝑍 𝑋 ) + depth(𝑃𝜃 (𝑋|𝑍)) 𝑝𝜃 (𝑋|𝑍) Solution 2: Pyramid-structured VAE • Use the Gated Recursive Convolutional Network • Recursively runs binary convolution • Activation of a node is a weighted sum of: new activation, left, and right child • Very deep (always #words-1) • Gating can help by shortcutting paths 𝑞𝜙 𝑍 𝑋 • Gating can be seen as a way to learn a (soft) tree structure 𝑝𝜃 (𝑋|𝑍) MSR Sentence completion task • Given a sentence missing a word, select the correct replacement from five alternatives • ``I have seen it on him , and could _____ to it.“ 1. 2. 3. 4. 5. write migrate climb swear contribute • Test set is a public dataset has 1040 sentences derived from 19th century novels • Training set is a collection of also 19th century novels with 2.2M sentences and 46M words Experiments on short phrases • Dataset: 70K of 4-word length phrases from Holmes dataset (19th century novels) • Samples of training data: • • • • • • `` gracious goodness ! `` of course ! asked mr. swift . will you ? ‘’ asked Tom again . questioned the boy . Experiments on short phrases • Trained model #2 • No tricks • • • • • eat marcia umbrella '' it oh ? ! exclaimed out ! . asked nonsense arnold ! she you sat ! Experiments on short phrases • Trained model #2 • No tricks • • • • • • Pretrain with an autoencoder • Learn 𝑝𝜃 (𝑋|𝑍) parameters • Fix 𝑞𝜙 𝑍 𝑋 for first 20 epochs eat marcia umbrella '' it oh ? ! exclaimed out ! . asked nonsense arnold ! she you sat ! • • • • • • • `` yes ! '' `` what ! '' said the man . do you ? '‘ he went on . said the voice . he he spoke . Experiments on short phrases • Trained model #2 • No tricks • • • • • • Pretrain with an autoencoder • Learn 𝑝𝜃 (𝑋|𝑍) parameters • Fix 𝑞𝜙 𝑍 𝑋 for first 20 epochs eat marcia umbrella '' it oh ? ! exclaimed out ! . asked nonsense arnold ! she you sat ! • • • • • • i whispered eagerly . i said again . the lady nodded . the lady asked . i cried indignantly . he said carelessly . `` no . '‘ `` yes . '' `` ah ! '' `` oh ! '‘ our eyes met . five minutes passed . Solution 3: Tree-Convolutional VAE • Combination of the previous two approaches • Iterate: - convolutional layer - pooling layer • This makes depth O(log #words) 𝑞𝜙 𝑍 𝑋 𝑃𝜃 (𝑋|𝑍) Thank you!
© Copyright 2026 Paperzz