Adversarial Learning for Neural Dialogue Generation

Adversarial Learning for Neural Dialogue
Generation
2017.2.17
Zhang Yan
Li, Jiwei, et al. "Adversarial Learning for Neural Dialogue Generation." arXiv preprint
arXiv:1701.06547 (2017).
1
Goal
“to train to produce sequences that are indistinguishable from
human-generated dialogue utterances.”
Main Contribution
• Propose to use an adversarial training approach for
response generation and cast the model in the framework
of reinforcement learning.
Adversarial Reinforcement Model
Adversarial Training
MinMax Game between Generator vs Discriminator
Model Breakdown
The model has two main parts, G and D:
Generative Model (G)
-
Generates a response y given dialogue history x.
Standard Seq2Seq model with Attention Mechanism
Discriminative Model (D)
-
-
Binary Classifier that takes as input a sequence of dialogue
utterances {x, y} and outputs label indicating whether the
input is generated by human or machines
Hierarchical Encoder + 2 class softmax function -> returns probability of the input dialogue episode
being a machine or human generated dialogues.
Model Breakdown
The model has two main parts, G and D:
Generative Model (G)
-
Generates a response y given dialogue history x consisting of a sequence of dialogue utterances
Standard Seq2Seq model with Attention Mechanism
Discriminative Model (D)
-
-
Binary Classifier that takes as input a sequence of dialogue
utterances {x, y} and outputs label indicating whether the
input is generated by human or machines
Hierarchical Encoder + 2 class softmax function -> returns probability of the input dialogue episode
being a machine or human generated dialogues.
Seq2Seq Models for Response Generation
(Sutskever et al., 2014; Jean et al., 2014)
Source : Input Messages
Target : Responses
Seq2Seq Models with Attention Mechanism
[Luong et al., 2015]
Attention Mechanism predicts the output y with a weighted average
context vector c, not just the last state
Model Breakdown
The model has two main parts, G and D:
Generative Model (G)
-
Generates a response y given dialogue history x consisting of a sequence of dialogue utterances
Standard Seq2Seq model with Attention Mechanism
Discriminative Model (D)
-
Binary Classifier that takes as input a sequence of dialogue utterances {x, y} and outputs label
indicating whether the input is generated by human or machines
-
Hierarchical Encoder + 2 class softmax function -> returns probability of the input dialogue episode
being a machine or human generated dialogues.
Model Breakdown
The model has two main parts, G and D:
Generative Model (G)
-
Generates a response y given dialogue history x consisting of a sequence of dialogue utterances
Standard Seq2Seq model with Attention Mechanism
Discriminative Model (D)
-
Binary Classifier that takes as input a sequence of dialogue utterances {x, y} and outputs label
indicating whether the input is generated by human or machines
Hierarchical Encoder + 2-class softmax function -> returns probability of the input dialogue episode
being a machine or human generated dialogues.
Training Methods
Policy Gradient Methods:
-
The score of current utterances being human-generated ones assigned by the discriminator is used
as a reward for the generator, which is trained to maximize the expected reward of generated
utterances using REINFORCE algorithm.
Training Methods
Policy Gradient Methods:
-
The score of current utterances being human-generated ones assigned by the discriminator is used
as a reward for the generator, which is trained to maximize the expected reward of generated
utterances using REINFORCE algorithm.
Training Methods
Policy Gradient Methods:
-
The score of current utterances being human-generated ones assigned by the discriminator is used
as a reward for the generator, which is trained to maximize the expected reward of generated
utterances using REINFORCE algorithm.
approximated by likelihood ratio
Training Methods
Policy Gradient Methods:
-
The score of current utterances being human-generated ones assigned by the discriminator is used
as a reward for the generator, which is trained to maximize the expected reward of generated
utterances using REINFORCE algorithm.
approximated by likelihood ratio
Training Methods
Policy Gradient Methods:
-
The score of current utterances being human-generated ones assigned by the discriminator is used
as a reward for the generator, which is trained to maximize the expected reward of generated
utterances using REINFORCE algorithm.
approximated by likelihood ratio
Training Methods
Policy Gradient Methods:
-
The score of current utterances being human-generated ones assigned by the discriminator is used
as a reward for the generator, which is trained to maximize the expected reward of generated
utterances using REINFORCE algorithm.
baseline value to reduce
the variance of the
estimate while keeping it
unbiased
classification score
approximated by likelihood ratio
gradient in
parameter space
policy
Training Methods
Policy Gradient Methods:
-
The score of current utterances being human-generated ones assigned by the discriminator is used
as a reward for the generator, which is trained to maximize the expected reward of generated
utterances using REINFORCE algorithm.
scalar reward
approximated by likelihood ratio
policy updates by the direction of the
reward in the parameter space
Training Methods (Cont’d)
Problem with REINFORCE:
-
has disadvantage that the expectation of the reward is approximated by only one sample, and
reward associated with the sample
is used for all actions
-
REINFORCE assigns the same negative reward to all tokens [I, don’t, know] by comparing them
with I don’t know
Proper credit assignment in training would give separate rewards,
most likely a neutral token for token I, and negative reward to don’t and know.
-
Authors of the paper calls it: Reward for Every Generation Step (REGS)
Training Methods (Cont’d)
Problem with REINFORCE:
-
has disadvantage that the expectation of the reward is approximated by only one sample, and
reward associated with the sample
is used for all actions
-
REINFORCE assigns the same negative reward to all tokens [I, don’t, know] by comparing them
with I don’t know
Proper credit assignment in training would give separate rewards,
most likely a neutral token for token I, and negative reward to don’t and know.
-
Authors of the paper calls it: Reward for Every Generation Step (REGS)
Training Methods (Cont’d)
Problem with REINFORCE:
-
has disadvantage that the expectation of the reward is approximated by only one sample, and
reward associated with the sample
is used for all actions
Input :
What’s your name
human : I am John
machine : I don’t know
-
Vanilla REINFORCE model assigns the same negative reward to all tokens [I, don’t, know] by
comparing them with I don’t know
Proper credit assignment in training would give separate rewards, most likely a neutral token for
token I, and negative reward to don’t and know.
Authors of the paper calls it: Reward for Every Generation Step (REGS)
Training Methods (Cont’d)
Problem with REINFORCE:
-
has disadvantage that the expectation of the reward is approximated by only one sample, and
reward associated with the sample
is used for all actions
Input :
What’s your name
human : I am John
machine : I don’t know
-
Vanilla REINFORCE model assigns the same negative reward to all tokens [I, don’t, know] by
comparing them with I don’t know
Proper credit assignment in training would give separate rewards, most likely a neutral token for
token I, and negative reward to don’t and know.
Training Methods (Cont’d)
Problem with REINFORCE:
-
has disadvantage that the expectation of the reward is approximated by only one sample, and
reward associated with the sample
is used for all actions
Input :
What’s your name
human : I am John
machine : I don’t know
-
Vanilla REINFORCE model assigns the same negative reward to all tokens [I, don’t, know] by
comparing them with I don’t know
Proper credit assignment in training would give separate rewards, most likely a neutral token for
token I, and negative reward to don’t and know.
Authors of the paper calls it: Reward for Every Generation Step (REGS)
Reward for Every Generation Step (REGS)
We need rewards for intermediate steps.
Two Strategies Introduced:
1. Monte Carlo (MC) Search
2. Training Discriminator For Rewarding Partially Decoded Sequences
Monte Carlo Search
1.
Given a partially decoded s, the model keeps sampling tokens from the distribution until the decoding
finishes
2.
Repeats N times (N generated sequences will share a common prefix s).
3.
These N sequences are fed to the discriminator, the average score of which is used as a reward.
Monte Carlo Search
1.
Given a partially decoded s, the model keeps sampling tokens from the distribution until the decoding
finishes
2.
Repeats N times (N generated sequences will share a common prefix s).
3.
These N sequences are fed to the discriminator, the average score of which is used as a reward.
time-consuming !
Rewarding Partially Decoded Sequences
Directly train a discriminator that is able to assign rewards to both fully and partially decoded sequences
-
Break generated sequences into partial sequences
Problem:
-
Earlier actions in a sequence are shared among multiple training examples for discriminator.
-
Result in overfitting
The author proposes a similar strategy used in AlphaGo to mitigate the problem.
Rewarding Partially Decoded Sequences
For each collection of subsequences of Y, randomly sample only one example from positive examples and
one example from negative examples, which are used to update discriminator.
-
Time effective but less accurate than MC model.
Rewarding Partially Decoded Sequences
For each collection of subsequences of Y, randomly sample only one example from positive examples and
one example from negative examples, which are used to update discriminator.
-
Time effective but less accurate than MC model.
Rewarding Partially Decoded Sequences
baseline value
classification score
baseline value
classification score
gradient in
parameter space
policy
gradient in
parameter space
policy
Teacher Forcing
Generative model is still unstable, because:
-
generative model can only be indirectly exposed to the gold-standard target sequences through the
reward passed back from the discriminator.
-
This reward is used to promote or discourage the generator’s own generated sequences.
Teacher Forcing
Generative model is still unstable, because:
-
generative model can only be indirectly exposed to the gold-standard target sequences through the
reward passed back from the discriminator.
-
This reward is used to promote or discourage the generator’s own generated sequences.
This is fragile, because:
-
Once a generator accidentally deteriorates in some training batches
-
And Discriminator consequently does an extremely good job in recognizing sequences from the
generator, the generator immediately gets lost
-
It knows that the generated results are bad, but does not know what results are good.
Teacher Forcing
The author proposes feeding human generated responses to the generator for model updates.
-
discriminator automatically assigns a reward of 1 to the human responses and feed it to the
generator to use this reward to update itself.
-
Analogous to having a teacher intervene and force it to generate the true responses
Pseudocode for the Algorithm
Result
Result
Adversarially-Trained system generates higher-quality responses than previous baselines!
Notes
It did not show great performance on abstractive summarization task.
Maybe because adversarial training strategy is more beneficial to:
-
Tasks in which there is a big discrepancy between the distributions of the generated sequences and
the reference target sequences
-
Tasks in which input sequences do not bear all the information needed to generate the target
-
in other words, there is no single correct target sequence in the semantic space.