RL and GAN for Sentence Generation and Chat-bot Hung-yi Lee Outline RF for chat-bot • Human provides feedback GAN for chat-bot • Machine (discriminator) provides feedback • Rewarding for Every Generation Step • MCMC • Rewarding Partial Decoded Sequence Review: Chat-bot • Sequence-to-sequence learning A: ∆ ∆ ∆ output sentence Training data: …… A: OOO Encoder B: XXX A: ∆ ∆ ∆ …… Input history information sentence A: OOO B: XXX Generator Review: Encoder to generator Encoder 你 好 嗎 我 Hierarchical Encoder 很 好 Review: Generator A B A B A B A B <BOS> : condition from decoder A A B can be different with attention mechanism Review: Training Generator Reference: 𝐶= 𝐶𝑡 𝑡 Minimizing cross-entropy of each component : condition from decoder A B B 𝐶1 𝐶2 𝐶3 A B A B A B <BOS> A B Review: Training Generator ℎ: input sentence and history/context 𝑥: correct response (word sequence) Training data: ℎ, 𝑥 𝐶= 𝐶𝑡 𝑥𝑡 : t-th word, 𝑥1:𝑡 : first t words of 𝑥 𝑡 𝑥𝑡 𝑥𝑡−1 𝑥𝑡+1 …… …… 𝐶𝑡−1 𝐶𝑡 = −𝑙𝑜𝑔𝑃𝜃 𝑥𝑡 |𝑥1:𝑡−1 , ℎ 𝐶𝑡 …… generator output 𝑙𝑜𝑔𝑃 𝑥𝑡 |𝑥1:𝑡−1 , ℎ 𝑡 = −𝑙𝑜𝑔𝑃 𝑥1 |ℎ 𝑃 𝑥𝑡 |𝑥1:𝑡−1 , ℎ 𝐶𝑡+1 …… 𝐶=− ⋯ 𝑃 𝑥𝑇 |𝑥1:𝑇−1 , ℎ = −𝑙𝑜𝑔𝑃 𝑥|ℎ Maximizing the likelihood of generating 𝑥 given h RL for Sentence Generation Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan Jurafsky, “Deep Reinforcement Learning for Dialogue Generation“, EMNLP 2016 Introduction https://image.freepik.com/free-vector/varietyof-human-avatars_23-2147506285.jpg http://www.freepik.com/free-vector/varietyof-human-avatars_766615.htm • Machine obtains feedback from user How are you? Hello Bye bye Hi -10 3 • Chat-bot learns to maximize the expected reward Maximizing Expected Reward ℎ Encoder Generator 𝑥 Human update 𝜃 𝜃 ∗ = 𝑎𝑟𝑔 max 𝑅𝜃 𝜃 𝑅𝜃 = 𝑃 ℎ ℎ Maximizing expected reward 𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ 𝑥 Randomness in generator Probability that the input/history is h 𝑅 ℎ, 𝑥 Maximizing Expected Reward ℎ Encoder Generator 𝑥 𝑅 ℎ, 𝑥 Human update 𝜃 𝜃 ∗ = 𝑎𝑟𝑔 max 𝑅𝜃 𝜃 𝑅𝜃 = 𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ = 𝐸ℎ~𝑃 𝑃 ℎ ℎ = 𝐸ℎ~𝑃 Maximizing expected reward 𝑥 ℎ ,𝑥~𝑃𝜃 𝑥|ℎ 𝑅 ℎ, 𝑥 1 ≈ 𝑁 ℎ 𝐸𝑥~𝑃𝜃 𝑥|ℎ 𝑅 ℎ, 𝑥 𝑁 𝑅 ℎ𝑖 , 𝑥 𝑖 𝑖=1 Sample: ℎ1 , 𝑥 1 , ℎ2 , 𝑥 2 , ⋯ , ℎ𝑁 , 𝑥 𝑁 Where is 𝜃? Policy Gradient 𝑅𝜃 = 𝑃 ℎ ℎ 𝛻 𝑅𝜃 = 𝑥 𝑃 ℎ ℎ = 𝑥 ℎ = 1 𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ ≈ 𝑁 𝑅 ℎ, 𝑥 𝛻𝑃𝜃 𝑥|ℎ 𝑃 ℎ 𝑥 𝑃 ℎ 𝑑𝑙𝑜𝑔 𝑓 𝑥 𝑑𝑥 𝑁 𝑅 ℎ𝑖 , 𝑥 𝑖 𝑖=1 𝑁 1 ≈ 𝑁 𝑅 ℎ𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥|ℎ 𝑖=1 𝛻𝑃𝜃 𝑥|ℎ 𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ 𝑃𝜃 𝑥|ℎ 𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥|ℎ ℎ 𝑥 = 𝐸ℎ~𝑃 ℎ ,𝑥~𝑃𝜃 𝑥|ℎ 1 𝑑𝑓 𝑥 = 𝑓 𝑥 𝑑𝑥 𝑅 ℎ, 𝑥 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥|ℎ Sampling Policy Gradient • Gradient Ascent 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 + 𝜂𝛻 𝑅𝜃𝑜𝑙𝑑 1 𝛻 𝑅𝜃 ≈ 𝑁 𝑁 𝑅 ℎ𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖=1 𝑅 ℎ𝑖 , 𝑥 𝑖 is positive After updating 𝜃, 𝑃𝜃 𝑥 𝑖 |ℎ𝑖 will increase 𝑅 ℎ𝑖 , 𝑥 𝑖 is negative After updating 𝜃, 𝑃𝜃 𝑥 𝑖 |ℎ𝑖 will decrease ℎ𝑖 Encoder Implementation Maximum Likelihood Objective Function Gradient Training Data 1 𝑁 1 𝑁 𝑁 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖=1 𝑁 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖=1 ℎ1 , 𝑥 1 , … , ℎ𝑁 , 𝑥 𝑁 𝑅 ℎ𝑖 , 𝑥 𝑖 = 1 Genera tor 𝑥𝑖 Human 𝑅 ℎ𝑖 , 𝑥 𝑖 Reinforcement Learning 1 𝑁 1 𝑁 𝑁 𝑅 ℎ𝑖 , 𝑥 𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖=1 𝑁 𝑅 ℎ𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖=1 ℎ1 , 𝑥 1 , … , ℎ𝑁 , 𝑥 𝑁 Sampling as training data weighted by 𝑅 ℎ𝑖 , 𝑥 𝑖 𝜃 0 can be well pretrained from ℎ1 , 𝑥 1 , … , ℎ𝑁 , 𝑥 𝑁 Implementation New Objective: 𝜃𝑡 1 𝑁 ℎ1 , 𝑥 1 𝑅 ℎ1 , 𝑥 1 ℎ2 , 𝑥 2 𝑅 ℎ2 , 𝑥 2 …… …… ℎ𝑁 , 𝑥 𝑁 𝑅 ℎ𝑁 , 𝑥 𝑁 𝑁 𝑅 ℎ𝑖 , 𝑥 𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖=1 𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 𝑅𝜃𝑡 1 𝑁 𝑁 𝑅 ℎ𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃𝑡 𝑥 𝑖 |ℎ𝑖 𝑖=1 Add a Baseline If 𝑅 ℎ𝑖 , 𝑥 𝑖 is always positive 1 𝑁 𝑁 𝑅 ℎ𝑖 , 𝑥 𝑖 𝑙𝑜𝑔𝛻𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖=1 Because it is probability … Ideal 𝑃𝜃 𝑥|ℎ case Due to Sampling (h,x1) (h,x2) (h,x3) Not sampled (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3) Add a Baseline If 𝑅 ℎ𝑖 , 𝑥 𝑖 is always positive 1 𝑁 𝑁 𝑅 ℎ𝑖 , 𝑥 𝑖 𝑙𝑜𝑔𝛻𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖=1 Not sampled 𝑃𝜃 𝑥|ℎ (h,x1) (h,x2) (h,x3) 1 𝑁 𝑁 𝑅 ℎ𝑖 , 𝑥 𝑖 − 𝑏 𝑙𝑜𝑔𝛻𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖=1 Add baseline (h,x1) (h,x2) (h,x3) There are several ways to obtain the baseline b. Alpha GO style training ! • Let two agents talk to each other How old are you? See you. See you. See you. How old are you? I am 16. I though you were 12. What make you think so? Using a pre-defined evaluation function to compute R(h,x) Example Reward • The final reward R(h,x) is the weighted sum of three terms r1(h,x), r2(h,x) and r3(h,x) 𝑅 ℎ, 𝑥 = λ1 𝑟1 ℎ, 𝑥 + λ2 𝑟2 ℎ, 𝑥 + λ3 𝑟3 ℎ, 𝑥 Ease of answering Information Flow Semantic Coherence 不要成為 句點王 說點 新鮮的 不要前言 不對後語 Example Results Reinforcement learning? Start with observation 𝑠1 Observation 𝑠2 Obtain reward 𝑟1 = 0 Action 𝑎1 : “right” Observation 𝑠3 Obtain reward 𝑟2 = 5 Action 𝑎2 : “fire” (kill an alien) Reinforcement learning? 𝑟=0 B Action taken <BOS> observation 𝑟=0 A Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba, “Sequence Level Training with Recurrent Neural Networks”, ICLR, 2016 A A B A B Actions set 𝑟𝑒𝑤𝑎𝑟𝑑: R(“BAA”, reference) A B A B The action we take influence the observation in the next step Reinforcement learning? • One can use any advanced RL techniques here. • For example, actor-critic • Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, Yoshua Bengio. "An Actor-Critic Algorithm for Sequence Prediction." ICLR, 2017. SeqGAN Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient”, AAAI, 2017 Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, “Adversarial Learning for Neural Dialogue Generation”, arXiv preprint, 2017 Basic Idea – Sentence Generation code z sampled from prior distribution Generator sentence x Sampling from RNN at each time step also provides randomness sentence x Original GAN Discriminator Real or fake Algorithm – Sentence Generation • Initialize generator Gen and discriminator Dis • In each iteration: • Sample real sentences 𝑥 from database • Generate sentences 𝑥 by Gen • Update Dis to increase 𝐷𝑖𝑠 𝑥 and decrease 𝐷𝑖𝑠 𝑥 • Update Gen such that Generator update Discrimi nator scalar http://www.nipic.com/show/3/83/3936650kd7476069.html Basic Idea – Chat-bot Input sentence/history h En De response sentence x Chatbot Input sentence/history h Discriminator response sentence x Conditional GAN human dialogues Real or fake Algorithm – Chat-bot h …… Training data: A: OOO B: XXX …… x A: ∆ ∆ ∆ • Initialize generator Gen and discriminator Dis • In each iteration: • Sample real history ℎ and sentence 𝑥 from database • Sample real history ℎ′ from database, and generate sentences 𝑥 by Gen(ℎ′) • Update Dis to increase 𝐷𝑖𝑠 ℎ, 𝑥 and decrease 𝐷𝑖𝑠 ℎ′ , 𝑥 • Update Gen such that ℎ En De Chatbot update Discrimi nator scalar En De Chatbot update Discrimi nator scalar Encoder Can we do backpropogation? Tuning generator a little bit will not change the output. scalar A B A A B A B A B Alternative: improved WGAN (ignoring sampling process) <BOS> A B Reinforcement Learning? En De Chatbot update Discrimi nator scalar • Consider the output of discriminator as reward • Update generator to increase discriminator = to get maximum reward 𝑁 reward 1 𝑖 𝑥 𝑖 𝑖 − 𝑏 𝛻𝑙𝑜𝑔𝑃 𝑥 𝑖 |ℎ𝑖 𝛻 𝑅𝜃 ≈ 𝑅 𝑖 𝐷 ℎℎ , 𝑥 𝜃 𝑁 𝑖=1 Discriminator Score • Different from typical RL • The discriminator would update g-step discriminator 𝜃 New Objective: 𝑡 1 𝑁 ℎ1 , 𝑥 1 𝐷 ℎ1 , 𝑥 1 ℎ2 , 𝑥 2 𝐷 ℎ2 , 𝑥 2 …… …… ℎ𝑁 , 𝑥 𝑁 𝐷 ℎ𝑁 , 𝑥 𝑁 𝑁 𝐷 ℎ𝑖 , 𝑥 𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖=1 𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 𝑅𝜃𝑡 1 𝑁 𝑁 𝐷 ℎ𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃𝑡 𝑥 𝑖 |ℎ𝑖 𝑖=1 d-step 𝑥 discriminator ℎ fake real Reward for Every Generation Step 1 𝛻 𝑅𝜃 ≈ 𝑁 𝑁 𝐷 ℎ𝑖 , 𝑥 𝑖 − 𝑏 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖=1 ℎ𝑖 = “What is your name?” 𝑥 𝑖 = “I don’t know” 𝐷 ℎ𝑖 , 𝑥 𝑖 − 𝑏 is negative Update 𝜃 to decrease log𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 = 𝑙𝑜𝑔𝑃 𝑥1𝑖 |ℎ𝑖 + 𝑙𝑜𝑔𝑃 𝑥2𝑖 |ℎ𝑖 , 𝑥1𝑖 + 𝑙𝑜𝑔𝑃 𝑥3𝑖 |ℎ𝑖 , 𝑥1:2 𝑃 "𝐼"|ℎ𝑖 ℎ𝑖 = “What is your name?” 𝑥 𝑖 = “I am John” 𝐷 ℎ𝑖 , 𝑥 𝑖 − 𝑏 is positive Update 𝜃 to increase log𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 = 𝑙𝑜𝑔𝑃 𝑥1𝑖 |ℎ𝑖 + 𝑙𝑜𝑔𝑃 𝑥2𝑖 |ℎ𝑖 , 𝑥1𝑖 + 𝑙𝑜𝑔𝑃 𝑥3𝑖 |ℎ𝑖 , 𝑥1:2 𝑃 "𝐼"|ℎ𝑖 Reward for Every Generation Step ℎ𝑖 = “What is your name?” 𝑥 𝑖 = “I don’t know” 𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 = 𝑙𝑜𝑔𝑃 𝑥1𝑖 |ℎ𝑖 + 𝑙𝑜𝑔𝑃 𝑥2𝑖 |ℎ𝑖 , 𝑥1𝑖 + 𝑙𝑜𝑔𝑃 𝑥3𝑖 |ℎ𝑖 , 𝑥1:2 𝑃 "𝐼"|ℎ𝑖 1 𝛻 𝑅𝜃 ≈ 𝑁 𝑃 "𝑑𝑜𝑛′𝑡"|ℎ𝑖 , "𝐼" 𝑃 "𝑘𝑛𝑜𝑤"|ℎ𝑖 , "𝐼 𝑑𝑜𝑛′𝑡" 𝑁 𝐷 ℎ𝑖 , 𝑥 𝑖 − 𝑏 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 𝑖=1 1 𝛻 𝑅𝜃 ≈ 𝑁 𝑁 𝑇 𝑖 𝑖 𝑄 ℎ𝑖 , 𝑥1:𝑡 − 𝑏 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥𝑡𝑖 |ℎ𝑖 , 𝑥1:𝑡−1 𝑖=1 𝑡=1 Method 1. Monte Carlo (MC) Search Method 2. Discriminator For Partially Decoded Sequences Monte Carlo Search 𝑖 • How to estimate 𝑄 ℎ𝑖 , 𝑥1:𝑡 ? 𝑄 "𝑊ℎ𝑎𝑡 𝑖𝑠 𝑦𝑜𝑢𝑟 𝑛𝑎𝑚𝑒? ", "𝐼" 𝑥1𝑖 ℎ𝑖 𝑥 𝐴 = I am John 𝐷 ℎ𝑖 , 𝑥 𝐴 = 1.0 𝑥 𝐵 = I am happy 𝐷 ℎ𝑖 , 𝑥 𝐵 = 0.1 𝑥 𝐶 = I don’t know 𝐷 ℎ𝑖 , 𝑥 𝐶 = 0.1 A roll-out generator for sampling is needed 𝑄 ℎ𝑖 , "𝐼" = 0.5 𝑥 𝐷 = I am superman 𝐷 ℎ𝑖 , 𝑥 𝐷 = 0.8 avg Rewarding Partially Decoded Sequences • Training a discriminator that is able to assign rewards to both fully and partially decoded sequences • Break generated sequences into partial sequences h=“What is your name?”, x=“I am john” h h=“What is your name?”, x=“I am” x h=“What is your name?”, x=“I” 𝑄 ℎ, 𝑥1:𝑡 h=“What is your name?”, x=“I don’t know” h=“What is your name?”, x=“I don’t” h=“What is your name?”, x=“I” Dis scalar h 𝑥1:𝑡 Dis Teacher Forcing • The training of generative model is unstable • This reward is used to promote or discourage the generator’s own generated sequences. • Usually It knows that the generated results are bad, but does not know what results are good. • Teacher Forcing Training Data for SeqGAN: Adding more Data: ℎ1 , 𝑥 1 , … , ℎ𝑁 , 𝑥 𝑁 Obtained by sampling weighted by 𝐷 ℎ𝑖 , 𝑥 𝑖 ℎ1 , 𝑥 1 , … , ℎ𝑁 , 𝑥 𝑁 Real data Consider 𝐷 ℎ𝑖 , 𝑥 𝑖 = 1 Experiments in paper • Sentence generation: Synthetic data • Given an LSTM • Using the LSTM to generate a lot of sequences as “real data” • Generator learns from the “real data” by different approaches • Generator generates some sequences • Using the LSTM to compute the negative log likelihood (NLL) of the sequences • Smaller is better Experiments in paper - Synthetic data Experiments in paper - Real data Results - Chat-bot To Learn More … Algorithm – MaliGAN Maximum-likelihood Augmented Discrete GAN • Initialize generator Gen and discriminator Dis • In each iteration: • Sample real sentences 𝑥 from database • Generate sentences 𝑥 by Gen • Update Dis to maximize 𝑙𝑜𝑔𝐷 𝑥 + 𝑥 𝑙𝑜𝑔 1 − 𝐷 𝑥 𝑥 • Update Gen by gradient 1 𝑁 𝑁 𝑖=1 𝑟𝐷 𝑥 𝑖 𝑖 − 𝑏 𝛻𝑙𝑜𝑔𝑃 𝑥 𝜃 𝑁 𝑖 𝑟 𝑥 𝐷 𝑖=1 𝐷 ℎ𝑖 , 𝑥 𝑖 𝑟𝐷 𝑥 𝑖 𝐷 𝑥𝑖 = 1 − 𝐷 𝑥𝑖 To learn more …… • Professor forcing • Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, Yoshua Bengio, “Professor Forcing: A New Algorithm for Training Recurrent Networks”, NIPS, 2016 • Handling discrete output by methods other than policy gradient • MaliGAN, Boundary-seeking GAN • Yizhe Zhang, Zhe Gan, Lawrence Carin, “Generating Text via Adversarial Training”, Workshop on Adversarial Training, NIPS, 2016 • Matt J. Kusner, José Miguel Hernández-Lobato, “GANS for Sequences of Discrete Elements with the Gumbelsoftmax Distribution”, arXiv preprint, 2016
© Copyright 2026 Paperzz