pptx

RL and GAN for Sentence
Generation and Chat-bot
Hung-yi Lee
Outline
RF for chat-bot
• Human provides feedback
GAN for chat-bot
• Machine (discriminator) provides feedback
• Rewarding for Every Generation Step
• MCMC
• Rewarding Partial Decoded Sequence
Review: Chat-bot
• Sequence-to-sequence learning
A: ∆ ∆ ∆
output
sentence
Training data:
……
A: OOO
Encoder
B: XXX
A: ∆ ∆ ∆
……
Input
history
information sentence
A: OOO
B: XXX
Generator
Review: Encoder
to generator
Encoder
你
好
嗎
我
Hierarchical Encoder
很
好
Review: Generator
A
B
A
B
A
B
A
B
<BOS>
: condition
from decoder
A
A
B
can be different with attention mechanism
Review: Training Generator
Reference:
𝐶=
𝐶𝑡
𝑡
Minimizing
cross-entropy of
each component
: condition
from decoder
A
B
B
𝐶1
𝐶2
𝐶3
A
B
A
B
A
B
<BOS>
A
B
Review: Training Generator
ℎ: input sentence and history/context
𝑥: correct response (word sequence)
Training data: ℎ, 𝑥
𝐶=
𝐶𝑡
𝑥𝑡 : t-th word, 𝑥1:𝑡 : first t words of 𝑥
𝑡
𝑥𝑡
𝑥𝑡−1
𝑥𝑡+1
……
……
𝐶𝑡−1
𝐶𝑡 = −𝑙𝑜𝑔𝑃𝜃 𝑥𝑡 |𝑥1:𝑡−1 , ℎ
𝐶𝑡
……
generator output
𝑙𝑜𝑔𝑃 𝑥𝑡 |𝑥1:𝑡−1 , ℎ
𝑡
= −𝑙𝑜𝑔𝑃 𝑥1 |ℎ 𝑃 𝑥𝑡 |𝑥1:𝑡−1 , ℎ
𝐶𝑡+1
……
𝐶=−
⋯ 𝑃 𝑥𝑇 |𝑥1:𝑇−1 , ℎ
= −𝑙𝑜𝑔𝑃 𝑥|ℎ
Maximizing the likelihood of
generating 𝑥 given h
RL for
Sentence Generation
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan
Jurafsky, “Deep Reinforcement Learning for Dialogue Generation“, EMNLP
2016
Introduction
https://image.freepik.com/free-vector/varietyof-human-avatars_23-2147506285.jpg
http://www.freepik.com/free-vector/varietyof-human-avatars_766615.htm
• Machine obtains feedback from user
How are
you?
Hello
Bye bye 
Hi 
-10
3
• Chat-bot learns to maximize the expected reward
Maximizing Expected Reward
ℎ
Encoder
Generator
𝑥
Human
update 𝜃
𝜃 ∗ = 𝑎𝑟𝑔 max 𝑅𝜃
𝜃
𝑅𝜃 =
𝑃 ℎ
ℎ
Maximizing expected reward
𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ
𝑥
Randomness in generator
Probability that the input/history is h
𝑅 ℎ, 𝑥
Maximizing Expected Reward
ℎ
Encoder
Generator
𝑥
𝑅 ℎ, 𝑥
Human
update 𝜃
𝜃 ∗ = 𝑎𝑟𝑔 max 𝑅𝜃
𝜃
𝑅𝜃 =
𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ = 𝐸ℎ~𝑃
𝑃 ℎ
ℎ
= 𝐸ℎ~𝑃
Maximizing expected reward
𝑥
ℎ ,𝑥~𝑃𝜃 𝑥|ℎ
𝑅 ℎ, 𝑥
1
≈
𝑁
ℎ
𝐸𝑥~𝑃𝜃
𝑥|ℎ
𝑅 ℎ, 𝑥
𝑁
𝑅 ℎ𝑖 , 𝑥 𝑖
𝑖=1
Sample: ℎ1 , 𝑥 1 , ℎ2 , 𝑥 2 , ⋯ , ℎ𝑁 , 𝑥 𝑁
Where
is 𝜃?
Policy Gradient
𝑅𝜃 =
𝑃 ℎ
ℎ
𝛻 𝑅𝜃 =
𝑥
𝑃 ℎ
ℎ
=
𝑥
ℎ
=
1
𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ ≈
𝑁
𝑅 ℎ, 𝑥 𝛻𝑃𝜃 𝑥|ℎ
𝑃 ℎ
𝑥
𝑃 ℎ
𝑑𝑙𝑜𝑔 𝑓 𝑥
𝑑𝑥
𝑁
𝑅 ℎ𝑖 , 𝑥 𝑖
𝑖=1
𝑁
1
≈
𝑁
𝑅 ℎ𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥|ℎ
𝑖=1
𝛻𝑃𝜃 𝑥|ℎ
𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ
𝑃𝜃 𝑥|ℎ
𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥|ℎ
ℎ
𝑥
= 𝐸ℎ~𝑃
ℎ ,𝑥~𝑃𝜃 𝑥|ℎ
1 𝑑𝑓 𝑥
=
𝑓 𝑥 𝑑𝑥
𝑅 ℎ, 𝑥 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥|ℎ
Sampling
Policy Gradient
• Gradient Ascent
𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 + 𝜂𝛻 𝑅𝜃𝑜𝑙𝑑
1
𝛻 𝑅𝜃 ≈
𝑁
𝑁
𝑅 ℎ𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖
𝑖=1
𝑅 ℎ𝑖 , 𝑥 𝑖 is positive
After updating 𝜃, 𝑃𝜃 𝑥 𝑖 |ℎ𝑖 will increase
𝑅 ℎ𝑖 , 𝑥 𝑖 is negative
After updating 𝜃, 𝑃𝜃 𝑥 𝑖 |ℎ𝑖 will decrease
ℎ𝑖
Encoder
Implementation
Maximum
Likelihood
Objective
Function
Gradient
Training
Data
1
𝑁
1
𝑁
𝑁
𝑙𝑜𝑔𝑃𝜃
𝑥 𝑖 |ℎ𝑖
𝑖=1
𝑁
𝛻𝑙𝑜𝑔𝑃𝜃
𝑥 𝑖 |ℎ𝑖
𝑖=1
ℎ1 , 𝑥 1 , … , ℎ𝑁 , 𝑥 𝑁
𝑅 ℎ𝑖 , 𝑥 𝑖 = 1
Genera
tor
𝑥𝑖
Human
𝑅 ℎ𝑖 , 𝑥 𝑖
Reinforcement
Learning
1
𝑁
1
𝑁
𝑁
𝑅 ℎ𝑖 , 𝑥 𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖
𝑖=1
𝑁
𝑅 ℎ𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖
𝑖=1
ℎ1 , 𝑥 1 , … , ℎ𝑁 , 𝑥 𝑁
Sampling as training data
weighted by 𝑅 ℎ𝑖 , 𝑥 𝑖
𝜃 0 can be well pretrained from
ℎ1 , 𝑥 1 , … , ℎ𝑁 , 𝑥 𝑁
Implementation
New Objective:
𝜃𝑡
1
𝑁
ℎ1 , 𝑥 1
𝑅 ℎ1 , 𝑥 1
ℎ2 , 𝑥 2
𝑅 ℎ2 , 𝑥 2
……
……
ℎ𝑁 , 𝑥 𝑁
𝑅
ℎ𝑁 , 𝑥 𝑁
𝑁
𝑅 ℎ𝑖 , 𝑥 𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖
𝑖=1
𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 𝑅𝜃𝑡
1
𝑁
𝑁
𝑅 ℎ𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃𝑡 𝑥 𝑖 |ℎ𝑖
𝑖=1
Add a Baseline
If 𝑅 ℎ𝑖 , 𝑥 𝑖 is always positive
1
𝑁
𝑁
𝑅 ℎ𝑖 , 𝑥 𝑖 𝑙𝑜𝑔𝛻𝑃𝜃 𝑥 𝑖 |ℎ𝑖
𝑖=1
Because it is probability …
Ideal
𝑃𝜃 𝑥|ℎ
case
Due to
Sampling
(h,x1) (h,x2) (h,x3)
Not
sampled
(h,x1) (h,x2) (h,x3)
(h,x1) (h,x2) (h,x3)
(h,x1) (h,x2) (h,x3)
Add a Baseline
If 𝑅 ℎ𝑖 , 𝑥 𝑖 is always positive
1
𝑁
𝑁
𝑅 ℎ𝑖 , 𝑥 𝑖 𝑙𝑜𝑔𝛻𝑃𝜃 𝑥 𝑖 |ℎ𝑖
𝑖=1
Not
sampled
𝑃𝜃 𝑥|ℎ
(h,x1) (h,x2) (h,x3)
1
𝑁
𝑁
𝑅 ℎ𝑖 , 𝑥 𝑖 − 𝑏 𝑙𝑜𝑔𝛻𝑃𝜃 𝑥 𝑖 |ℎ𝑖
𝑖=1
Add
baseline
(h,x1) (h,x2) (h,x3)
There are several ways to obtain the baseline b.
Alpha GO style training !
• Let two agents talk to each other
How old are you?
See you.
See you.
See you.
How old are you?
I am 16.
I though you were 12.
What make you
think so?
Using a pre-defined evaluation function to compute R(h,x)
Example Reward
• The final reward R(h,x) is the weighted sum of
three terms r1(h,x), r2(h,x) and r3(h,x)
𝑅 ℎ, 𝑥 = λ1 𝑟1 ℎ, 𝑥 + λ2 𝑟2 ℎ, 𝑥 + λ3 𝑟3 ℎ, 𝑥
Ease of
answering
Information
Flow
Semantic
Coherence
不要成為
句點王
說點
新鮮的
不要前言
不對後語
Example Results
Reinforcement learning?
Start with
observation 𝑠1
Observation 𝑠2
Obtain reward
𝑟1 = 0
Action 𝑎1 : “right”
Observation 𝑠3
Obtain reward
𝑟2 = 5
Action 𝑎2 : “fire”
(kill an alien)
Reinforcement learning?
𝑟=0
B
Action taken
<BOS>
observation
𝑟=0
A
Marc'Aurelio Ranzato, Sumit
Chopra, Michael Auli, Wojciech
Zaremba, “Sequence Level Training with
Recurrent Neural Networks”, ICLR, 2016
A
A
B
A
B
Actions set
𝑟𝑒𝑤𝑎𝑟𝑑:
R(“BAA”, reference)
A
B
A
B
The action we take influence
the observation in the next step
Reinforcement learning?
• One can use any advanced RL techniques here.
• For example, actor-critic
• Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh
Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, Yoshua
Bengio. "An Actor-Critic Algorithm for Sequence
Prediction." ICLR, 2017.
SeqGAN
Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, “SeqGAN:
Sequence Generative Adversarial Nets with Policy Gradient”, AAAI,
2017
Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan
Jurafsky, “Adversarial Learning for Neural Dialogue Generation”, arXiv
preprint, 2017
Basic Idea – Sentence Generation
code z sampled from
prior distribution
Generator
sentence x
Sampling from RNN at each time
step also provides randomness
sentence x
Original GAN
Discriminator
Real or fake
Algorithm – Sentence Generation
• Initialize generator Gen and discriminator Dis
• In each iteration:
• Sample real sentences 𝑥 from database
• Generate sentences 𝑥 by Gen
• Update Dis to increase 𝐷𝑖𝑠 𝑥 and decrease 𝐷𝑖𝑠 𝑥
• Update Gen such that
Generator
update
Discrimi
nator
scalar
http://www.nipic.com/show/3/83/3936650kd7476069.html
Basic Idea – Chat-bot
Input
sentence/history h
En
De
response sentence x
Chatbot
Input
sentence/history h
Discriminator
response sentence x
Conditional GAN
human
dialogues
Real or fake
Algorithm – Chat-bot
h
……
Training data:
A: OOO
B: XXX
……
x A: ∆ ∆ ∆
• Initialize generator Gen and discriminator Dis
• In each iteration:
• Sample real history ℎ and sentence 𝑥 from database
• Sample real history ℎ′ from database, and generate
sentences 𝑥 by Gen(ℎ′)
• Update Dis to increase 𝐷𝑖𝑠 ℎ, 𝑥 and decrease
𝐷𝑖𝑠 ℎ′ , 𝑥
• Update Gen such that
ℎ
En De
Chatbot
update
Discrimi
nator
scalar
En De
Chatbot
update
Discrimi
nator
scalar
Encoder
Can we do
backpropogation?
Tuning generator a
little bit will not
change the output.
scalar
A
B
A
A
B
A
B
A
B
Alternative:
improved WGAN
(ignoring
sampling process) <BOS>
A
B
Reinforcement Learning?
En De
Chatbot
update
Discrimi
nator
scalar
• Consider the output of discriminator as reward
• Update generator to increase discriminator = to get
maximum reward
𝑁 reward
1
𝑖 𝑥 𝑖 𝑖 − 𝑏 𝛻𝑙𝑜𝑔𝑃 𝑥 𝑖 |ℎ𝑖
𝛻 𝑅𝜃 ≈
𝑅
𝑖
𝐷 ℎℎ , 𝑥
𝜃
𝑁
𝑖=1 Discriminator Score
• Different from typical RL
• The discriminator would update
g-step
discriminator
𝜃
New Objective:
𝑡
1
𝑁
ℎ1 , 𝑥 1
𝐷 ℎ1 , 𝑥 1
ℎ2 , 𝑥 2
𝐷 ℎ2 , 𝑥 2
……
……
ℎ𝑁 , 𝑥 𝑁
𝐷
ℎ𝑁 , 𝑥 𝑁
𝑁
𝐷 ℎ𝑖 , 𝑥 𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖
𝑖=1
𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 𝑅𝜃𝑡
1
𝑁
𝑁
𝐷 ℎ𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃𝑡 𝑥 𝑖 |ℎ𝑖
𝑖=1
d-step
𝑥
discriminator
ℎ
fake
real
Reward for Every Generation Step
1
𝛻 𝑅𝜃 ≈
𝑁
𝑁
𝐷 ℎ𝑖 , 𝑥 𝑖 − 𝑏 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖
𝑖=1
ℎ𝑖 = “What is your name?”
𝑥 𝑖 = “I don’t know”
𝐷 ℎ𝑖 , 𝑥 𝑖 − 𝑏 is negative
Update 𝜃 to decrease log𝑃𝜃 𝑥 𝑖 |ℎ𝑖
𝑖
𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 = 𝑙𝑜𝑔𝑃 𝑥1𝑖 |ℎ𝑖 + 𝑙𝑜𝑔𝑃 𝑥2𝑖 |ℎ𝑖 , 𝑥1𝑖 + 𝑙𝑜𝑔𝑃 𝑥3𝑖 |ℎ𝑖 , 𝑥1:2
𝑃 "𝐼"|ℎ𝑖
ℎ𝑖 = “What is your name?”
𝑥 𝑖 = “I am John”
𝐷 ℎ𝑖 , 𝑥 𝑖 − 𝑏 is positive
Update 𝜃 to increase log𝑃𝜃 𝑥 𝑖 |ℎ𝑖
𝑖
𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 = 𝑙𝑜𝑔𝑃 𝑥1𝑖 |ℎ𝑖 + 𝑙𝑜𝑔𝑃 𝑥2𝑖 |ℎ𝑖 , 𝑥1𝑖 + 𝑙𝑜𝑔𝑃 𝑥3𝑖 |ℎ𝑖 , 𝑥1:2
𝑃 "𝐼"|ℎ𝑖
Reward for Every Generation Step
ℎ𝑖 = “What is your name?”
𝑥 𝑖 = “I don’t know”
𝑖
𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 = 𝑙𝑜𝑔𝑃 𝑥1𝑖 |ℎ𝑖 + 𝑙𝑜𝑔𝑃 𝑥2𝑖 |ℎ𝑖 , 𝑥1𝑖 + 𝑙𝑜𝑔𝑃 𝑥3𝑖 |ℎ𝑖 , 𝑥1:2
𝑃 "𝐼"|ℎ𝑖
1
𝛻 𝑅𝜃 ≈
𝑁
𝑃 "𝑑𝑜𝑛′𝑡"|ℎ𝑖 , "𝐼"
𝑃 "𝑘𝑛𝑜𝑤"|ℎ𝑖 , "𝐼 𝑑𝑜𝑛′𝑡"
𝑁
𝐷 ℎ𝑖 , 𝑥 𝑖 − 𝑏 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖
𝑖=1
1
𝛻 𝑅𝜃 ≈
𝑁
𝑁
𝑇
𝑖
𝑖
𝑄 ℎ𝑖 , 𝑥1:𝑡
− 𝑏 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥𝑡𝑖 |ℎ𝑖 , 𝑥1:𝑡−1
𝑖=1 𝑡=1
Method 1. Monte Carlo (MC) Search
Method 2. Discriminator For Partially Decoded Sequences
Monte Carlo Search
𝑖
• How to estimate 𝑄 ℎ𝑖 , 𝑥1:𝑡
?
𝑄 "𝑊ℎ𝑎𝑡 𝑖𝑠 𝑦𝑜𝑢𝑟 𝑛𝑎𝑚𝑒? ", "𝐼"
𝑥1𝑖
ℎ𝑖
𝑥 𝐴 = I am John
𝐷 ℎ𝑖 , 𝑥 𝐴 = 1.0
𝑥 𝐵 = I am happy
𝐷 ℎ𝑖 , 𝑥 𝐵 = 0.1
𝑥 𝐶 = I don’t know
𝐷 ℎ𝑖 , 𝑥 𝐶 = 0.1
A roll-out generator
for sampling is needed
𝑄 ℎ𝑖 , "𝐼" = 0.5
𝑥 𝐷 = I am superman 𝐷 ℎ𝑖 , 𝑥 𝐷 = 0.8
avg
Rewarding Partially Decoded
Sequences
• Training a discriminator that is able to assign rewards to
both fully and partially decoded sequences
• Break generated sequences into partial sequences
h=“What is your name?”, x=“I am john”
h
h=“What is your name?”, x=“I am”
x
h=“What is your name?”, x=“I”
𝑄 ℎ, 𝑥1:𝑡
h=“What is your name?”, x=“I don’t know”
h=“What is your name?”, x=“I don’t”
h=“What is your name?”, x=“I”
Dis
scalar
h
𝑥1:𝑡
Dis
Teacher Forcing
• The training of generative model is unstable
• This reward is used to promote or discourage the
generator’s own generated sequences.
• Usually It knows that the generated results are bad, but
does not know what results are good.
• Teacher Forcing
Training Data for SeqGAN:
Adding more Data:
ℎ1 , 𝑥 1 , … , ℎ𝑁 , 𝑥 𝑁
Obtained by sampling
weighted by 𝐷 ℎ𝑖 , 𝑥 𝑖
ℎ1 , 𝑥 1 , … , ℎ𝑁 , 𝑥 𝑁
Real data
Consider 𝐷 ℎ𝑖 , 𝑥 𝑖 = 1
Experiments in paper
• Sentence generation: Synthetic data
• Given an LSTM
• Using the LSTM to generate a lot of sequences
as “real data”
• Generator learns from the “real data” by
different approaches
• Generator generates some sequences
• Using the LSTM to compute the negative log
likelihood (NLL) of the sequences
• Smaller is better
Experiments in paper
- Synthetic data
Experiments in paper
- Real data
Results - Chat-bot
To Learn More …
Algorithm – MaliGAN
Maximum-likelihood Augmented Discrete GAN
• Initialize generator Gen and discriminator Dis
• In each iteration:
• Sample real sentences 𝑥 from database
• Generate sentences 𝑥 by Gen
• Update Dis to maximize
𝑙𝑜𝑔𝐷 𝑥 +
𝑥
𝑙𝑜𝑔 1 − 𝐷 𝑥
𝑥
• Update Gen by gradient
1
𝑁
𝑁
𝑖=1
𝑟𝐷 𝑥 𝑖
𝑖
−
𝑏
𝛻𝑙𝑜𝑔𝑃
𝑥
𝜃
𝑁
𝑖
𝑟
𝑥
𝐷
𝑖=1
𝐷 ℎ𝑖 , 𝑥 𝑖
𝑟𝐷 𝑥 𝑖
𝐷 𝑥𝑖
=
1 − 𝐷 𝑥𝑖
To learn more ……
• Professor forcing
• Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng
Zhang, Aaron Courville, Yoshua Bengio, “Professor
Forcing: A New Algorithm for Training Recurrent
Networks”, NIPS, 2016
• Handling discrete output by methods other than
policy gradient
• MaliGAN, Boundary-seeking GAN
• Yizhe Zhang, Zhe Gan, Lawrence Carin, “Generating Text
via Adversarial Training”, Workshop on Adversarial
Training, NIPS, 2016
• Matt J. Kusner, José Miguel Hernández-Lobato, “GANS
for Sequences of Discrete Elements with the Gumbelsoftmax Distribution”, arXiv preprint, 2016