arXiv:1705.07830v1 [cs.CL] 22 May 2017

arXiv:1705.07830v1 [cs.CL] 22 May 2017
Ask the Right Questions:
Active Question Reformulation with Reinforcement Learning
Christian Buck
[email protected]
Jannis Bulian
[email protected]
Massimiliano Ciaramita
[email protected]
Andrea Gesmundo
[email protected]
Neil Houlsby
[email protected]
Wojciech Gajewski
[email protected]
Wei Wang
[email protected]
Google
We propose an active question answering agent that learns to reformulate
questions and combine evidence to improve question answering. The agent
sits between the user and a black box question-answering system and learns to
optimally probe the system with natural language reformulations of the initial
question and to aggregate the evidence to return the best possible answer.
The system is trained end-to-end to maximize answer quality using policy
gradient. We evaluate on SearchQA, a dataset of complex questions extracted
from Jeopardy!. Our agent improves F1 by 11% over a state-of-the-art base
model that uses the original question/answer pairs.
1 Introduction
Web and social media have become the primary sources for information needs. Users’
expectations and information seeking activities co-evolve with the increasing sophistication
of these resources. Beyond navigation, document retrieval, and simple factual question
answering, users seek direct answers to complex and compositional questions. Such search
sessions may require multiple iterations, critical assessment and synthesis [Marchionini,
2006].
The productivity of natural language yields a myriad of ways to formulate a question.
In the face of complex information needs, humans overcome uncertainty by reformulating
1
questions, issuing multiple searches, and aggregating responses. Inspired by the human
ability to produce and ask the right questions, we present an agent that learns to carry
out this process for the user. The agent sits between the user and a backend QA system
that we refer to as the ‘environment’. The agent aims to maximize the chance of getting
the correct answer by reformulating and reissuing a user’s question to the environment.
By asking many question variants, and aggregating the returned evidences, the agent
comes up with the single best answer. The internals of the environment are not available
to the agent, so it must learn to probe a black-box optimally using only natural language.
This approach resembles active learning [Settles, 2010]. In active learning, the algorithm
chooses which data to send to an environment for labeling, aiming to collect the most
valuable labels. Similarly, our agent chooses how to query an environment, aiming to
maximize the chance of revealing the correct answer. Since the system actively chooses
which questions to ask we call our approach Active Question Answering (AQA). AQA
departs from the usual regime of active learning in the following ways.
1. AQA must choose the optimal input in the space of natural language questions, as
opposed to the usual active learning scenario of selecting real-valued feature vectors.
The former, being discrete and highly structured, is challenging to optimize.
2. AQA optimizes the input to send to the environment with respect to expected endto-end performance. Active learning usually optimized a pre-specified acquisition
criterion 1 .
3. AQA performs active reformulation conditioned on a original input. Active learning
seeks to acquire generally useful labels at training time. Here, we must ensure that
the questions sent to, and answers received from, the environment are useful with
respect to the original query.
The key component to our solution is a text based sequence-to-sequence reinforcement
learning to optimize the reformulator’s performance with respect to the QA environment. The second component to AQA combines the evidences from interacting with the
environment using a convolutional neural network.
We evaluate on a dataset of complex questions taken from Jeopardy!, the SearchQA
dataset [Dunn et al., 2017]. These questions are hard to answer by design because
they use obfuscated and convoluted language, e.g., “Travel doesn’t seem to be an issue
for this sorcerer & onetime surgeon; astral projection & teleportation are no prob”
(answer: Doctor Strange). Thus SearchQA tests the ability of AQA to reformulate
questions such that the QA system has the best chance of returning the correct answer.
AQA outperforms a deep network built for QA, BiDAF [Seo et al., 2017a], which has
produced state-of-the-art results on multiple tasks, by 11% absolute F1, a 32% relative
F1 improvement. We conclude by proposing AQA as a general framework for stateful,
iterative, and interactive information seeking tasks.
1
Bayes optimal active learning also tries to optimize end-to-end performance, however, these techniques
are often computationally prohibitive in large-scale settings [Roy and McCallum, 2001]
2
AQA
q
a
Reformulate
Aggregate
[Seq2Seq]
[CNN]
{q'}
Q&A
{a'}
Figure 1: AQA’s agent-environment. In the downward pass the agent re-interprets the
question and hands it on to the QA system. In the upward pass, responses are
evaluated to produce the final answer.
2 Active Question Answering
In this section we detail the components of Active Question Answering (cf. Figure 1).
2.1 The Agent-environment framework
Rather than send a user’s question to a QA system passively, an AQA system actively
reformulates the question multiple times, and sends these variants to an environment,
in our case the QA system. The environment then returns one or more responses, from
which the final answer is selected. The agent must learn to communicate optimally with
the environment to maximize the probability of receiving the correct answer.
The environment can, in principle, include multiple information sources, possibly
providing feedback in different modes (images, structured data from knowledge bases,
free text, etc.). Here, the information source consists of a pre-trained question answering
system, that accepts natural language question strings as input, and returns strings as
answers.
2.2 Active Question Answering Agent
The reformulator is a sequence-to-sequence model, popularized in the task of Neural
Machine Translation (NMT) [Sutskever et al., 2014, Bahdanau et al., 2014] 2 . This model’s
architecture follows a design using a multi-layer bidirectional LSTM, with attentionbased decoding. The major departure from the standard machine translation setting
2
We build on top of the public implementation of Britz et al. [2017]. (https://github.com/google/
seq2seq/)
3
is that our model reformulates utterances in the same language. Unlike in MT, there
is little high quality training data available for monolingual paraphrasing. Training
highly-parametrized neural networks effectively relies on an abundance of data, thus, our
setting presents an additional challenge. We address this first, by exploiting end-to-end
signals produced through the interaction with the QA environment, and secondly, by
pre-training the model. This is a common strategy in deep learning, for which we develop
appropriate strategies.
In the downward pass the reformulator, the box on the left in Figure 1, maps the
original question into one or many alternative questions that will probe the environment
for candidate answers. The reformulator is trained end-to-end, using an answer quality
objective, for which we use the standard F1 metric. Due to the non-differentiable
nature of this sequence-level loss, the model is trained using reinforcement learning (cf.
Section 3). The aggregator, the box on the right of Figure 1, is a Convolutional Neural
Network (CNN). This model’s task is to evaluate the candidate answers returned by
the environment and select the one to return. Here, we assume that there is a single
best answer, as is the case in our evaluation setting; returning multiple answers is a
straightforward extension of the model. The CNN is trained with supervised learning.
Finally, we require an environment to interact with. For this we use a competitive
neural question answering model, BiDirectional Attention Flow (BiDAF) [Seo et al.,
2017a]3 . BiDAF is an extractive QA system. It takes as input a question and a
document and returns as answer a continuous span from the passage. The model
contains a bidirectional attention mechanism to score document snippets with respect
to the question, implemented with multi-layer LSTMs and other components. The
environment is opaque to our agent, which has no access to the internals: parameters,
activations, gradients, etc. AQA may only send questions to it, and receive answers.
This scenario enables us to design a general framework that in permits the use of any
backend environment. However, it means that feedback on the quality of the question
reformulations is noisy and indirect, thus presenting a challenge for training.
3 Training
To train AQA we use a combination of reinforcement and supervised learning. We also
present a strategy to overcome the data paucity issue in monolingual paraphrasing.
3.1 QA system
We use BiDAF as the black box QA system [Seo et al., 2017a]. We updated BiDAF
to TensorFlow 1.1 and trained a model on the SearchQA data. We use Adam with a
learning rate of 0.001 as optimizer, and stop after 4500 steps. In the SearchQA task
several snippets returned by Google form the overall context (50 on average). We join
snippets to form the context where BiDAF can search for the answer candidate spans.
Training on all snippets was slow, so we limited the context to the top 10 ranked Google
3
https://allenai.github.io/bi-att-flow/
4
snippets to form the context. This corresponds to finding the answer on the first page
of Google search results. The results are only mildly affected by this: for 10% of the
questions there is no answer in this shorter context. These datapoints are all counted as
losses.
3.2 Training the Initial Rewrite Model
We pre-train the question reformulation model by building a paraphrasing NMT model,
i.e. a model that can translate English to English. Unfortunately, very little parallel
but monolingual data (segment, paraphrase) is available to train such a model directly.
Instead, we first produce a multi-language translation system, that can translate between
several languages [Johnson et al., 2016]. This allows us to use bilingual corpora as are
commonly used for machine translation models. Multi-language training simply requires
adding two special tokens to the data indicating the source and target languages 4 and use
of a common vocabulary. The architecture of the translation model remains unchanged.
As Johnson et al. [2016] show, this model can be used for zero-shot translation, i.e.
to translate between language pairs for which it has seen no training examples, thus we
can use this model for zero-shot paraphrasing. In the same work the authors note that
zero-shot performance is generally worse than bridging, an approach that uses the model
twice to first translate into a pivot language and then to the target language, but that
this performance gap can be eliminated by running a few training steps for the desired
language pair. Thus, we first train on multilingual data, then on a small corpus of noisy
monolingual data.
As multilingual training data we use the United Nations Parallel Corpus v1.0 [Ziemski
et al., 2016]. This dataset contains 11.4M sentences which are fully aligned across the
six UN languages : Arabic, English, Spanish, French, Russian, and Chinese. Using
all bilingual pairings we produce a multilingual training corpus of 30 language pairs,
yielding 340M training examples in total which we use to train a zero-shot neural MT
system [Johnson et al., 2016]. We tokenize our data using 16k sentence pieces5 . Following
standard settings [Britz et al., 2017] we use a bidirectional LSTM as encoder and an
4-layer LSTM with attention [Bahdanau et al., 2016] as decoder. The model converges
after seeing about 400M instances6 .
The monolingual model trained as described above has poor quality as a source of
systematically-related question reformulations. For example, for a question like “What
month, day and year did Super Bowl 50 take place?” the top translation is “What month
and year goes back to the morning and year?”. To improve quality, after having trained
the multilingual model, we resume training on a monolingual dataset which is two orders
of magnitude smaller than the U.N. corpus. The monolingual data is extracted from
the Paralex database of question paraphrases Fader et al. [2013]7 . Unfortunately, this
data contains many noisy pairs. As a simple heuristic to filter these out we remove all
4
For example, we prefix <from_es> <to_en> to the source side of a Spanish-English training instance.
https://github.com/google/sentencepiece
6
We run three million steps on batches of size 128 using the Adam optimizer with learning rate 0.001.
7
http://knowitall.cs.washington.edu/paralex/.
5
5
pairs where the Jaccard coefficient between the sets of source and target terms ≤ 0.5.
Additionally, since the number of paraphrases for each question can vary significantly,
for each question we keep at most 4 different question paraphrases. The total number of
monolingual training pairs is ≈ 1.5M.
This model has visibly better quality than the zero-shot one; e.g., for the example
question above it would generate “What year did superbowl take place?”. We also tried
training solely on the monolingual data. The quality of this model seems somewhere in
between the two approaches above. E.g., it produces “What month did super blooming
50 take place?” for the same question. This pattern of results is consistent with the
evidence from Johnson et al. [2016].
3.3 Policy Gradient Training
Our goal is to return the best possible answer a, maximizing a reward R(a). Typically, R is
the token level F1 loss on the answer string, although any choice can be made. The answer
is an unknown function of the query reformulation sent to the environment a = f (q). The
reformulator produces a question by sequentially generating tokens w. The reformulation
process defined by our model is the policy πθ (q|q0 ) (θ is the set of model parameters). A
Q
query is generated according to q ∼ πθ (q|q0 ) = Tt=1 p(wt |w1 , . . . , wt−1 , q0 ), where q0 is
the original input question.
Under the policy the goal is to maximize the reward of the answer, Eπθ [R(f (q(π)))].
We train using policy gradient [Sutton and Barto, 1998], optimizing the reward directly
with respect to the policy. The reward cannot be computed in closed form, but Monte
Carlo sampling from the policy yields an unbiased estimator,
Eπθ [R(f (q))] ≈
N
1 X
R(f (qi )),
N i=1
qi ∼ πθ (qi |q0 )
(1)
To compute gradients for training with use the REINFORCE method [Williams and
Peng, 1991],
∇Eπθ [R(f (q))] = Eπθ [R(f (q))]∇θ log(πθ )
≈
N
1 X
R(f (qi ))∇θ log(π(qi |q0 )),
N i=1
qi ∼ πθ
(2)
The REINFORCE estimator is often found to have high variance, leading to unstable
training. We address this using a baseline [Williams, 1992] B(q0 ) = Eq0 [R(f (q))]. This
expectation is computed by sampling from the policy given q0 .
We often observed collapse onto a sub-optimal deterministic policy. To address this we
include entropy regularization on the policy. This final reward is:
Eπθ [R(f (q)) − B(q0 )] + wreg H[π(q|q0 )]
where wreg is the regularization weight.
6
(3)
3.4 RL experiments
In our experiments we switch the optimizer from Adam to SGD with a low learning rate
for the RL experiments. During training we produce greedy, i.e. without beam search,
translations on a dev set for every checkpoint and monitor their average reward in the
QA environment. We found that a high learning rate can quickly deteriorate model
performance. To avoid degeneration we also lower the weight of the entropy regularization
weight wreg to 0.001. Another option to improve training stability is to switch between
supervised and reinforcement learning. Since supervised data is not available for the
question reformulation task we cannot interpolate the gradients as Wu et al. [2016] did.
We train the model for an additional 100k steps using the RL updates. In contrast to
our initial training which we ran on GPUs, this training phase is dominated by latency
of the QA system and we run inference and updates on CPU.
3.5 Supervised Aggregator
Using beam search or sampling our reformulation system can produce many rewrites
of a single question. We issue each rewrite to the QA system, yielding a set of (query,
rewrite, answer) tuples from which we need to pick the best instance. We train another
neural network to pick the best answer from the candidates. While this is a ranking
problem we phrase it as a binary classification, distinguishing between above and below
average performance: For every instance, we compute the F1 score. An instance counts
as positive if the rewrite produces and answer with an above average F1 score, compare
to all other rewrites of the same question. We ignore questions where all rewrites yield
equally good/bad answers, which is a sizeable part of our training data: around 50% of
the questions yield the same answer score no matter the rewrite. As classifier we did
a quick evaluation of feed-forward networks, LSTMs and CNNs. As inputs are triples
of variable length sequences the latter two allow us to incorporate the tokens directly
without need for feature engineering. As performance is similar for both CNN and
LSTM based models we chose CNNs for computational efficiency. In particular, we use
pre-trained 100-dimensional embeddings [Pennington et al., 2014] for the tokens of query,
rewrite, and answer. For each, we add a 1-d CNN of width 3 and output dimension 100,
followed by max-pooling. The three vectors are then concatenated and passed through a
feed-forward network which produces the binary output.
4 Experiments
4.1 QA Data
We evaluate on SearchQA [Dunn et al., 2017], a recently-released dataset built starting
from a set of Jeopardy! clues. Clues are obfuscated queries such as “This ’Father of Our
Country’ didn’t really chop down a cherry tree”. SearchQA associates each clue with the
correct answer (e.g., “George Washington”), and a list of snippets from Google’s top
search results. The dataset contains more than 140k question answer pairs, and 6.9M
snippets. The data comes with pre-defined train, validation and test partitions. For
7
SearchQA
Validation
Test
System
Exact Match
F1
Exact Match
F1
ASR
BiDAF
–
31.7
24.2
37.9
–
28.6
22.8
34.6
AQA
AQA
AQA
AQA
AQA
32.0
33.6
34.0
35.5
40.5
38.2
40.5
40.1
42.0
47.4
30.6
33.3
32.7
33.8
38.7
36.8
39.3
38.9
40.2
45.6
–
48.3
–
56.0
43.9
46.6
–
54.3
TH
WV
HAC
TC
CNN
Human Performance
Oracle ranking
Table 1: Results table for the experiments on SearchQA.
both tasks we train our system on the train data, perform model selection and tuning
on the development data and report results on both development and test partitions.
The training, validation and test sets consist of 99,820, 13,393 and 27,248 examples,
respectively.
4.2 Results
Table 1 summarizes the experimental results of the AQA systems, including baselines
and benchmarks. The metrics reported are exact match (EM) and F1, between the
predicted answer and the gold answer, respectively. The numbers we report are on the full
validation and test datasets (referred to as n-gram in [Dunn et al., 2017]). This includes
answers that have a unigram answer; e.g., including the question “His marvel.com bio says
he’s "highly proficient in hand-to-hand combat, swordsmanship and hammer throwing"”
(answer: Thor), and also questions with longer answers such as “Uber debuted this kind
of car in September 2016, & passengers didn’t have to make small talk or argue about
the music” (answer: self-driving cars).
SearchQA provides a baseline that ranks unigrams from the search snippets using their
TF-IDF score. It is not given in the table, because it is not provided on all questions but
only for those where the answer consists of a single word. Dunn et al. [2017] report also
the results of a deep learning system trained specifically for the SearchQA task. This is
a modified pointer network, called attention sum reader (ASR). Overall, it seems that
SearchQA is a harder task then other recent ones such as SQuAD [Rajpurkar et al., 2016]
and CNN/DailyMail [Hermann et al., 2015], both for systems and humans. BiDAF’s
performance degrades by 40 absolute F1 points on SearchQA compared to SQuAD
and CNN/DailyMail. On the other hand, BiDAF is still a competitive QA system on
SeachQA, as it improves over the ASR network by 13.7 absolute F1 points.
8
We evaluate several variants of AQA systems. For each query q in the evaluation
set we sample reformulations qi , i = 1..N using the sequence model trained with policy
gradient; N = 20 in these experiments.
AQA TH First, we just use q1 , the top hypothesis (TH) generated by the sequence model.
This already gives an improvement over using the original question for BiDAF of
0.3 F1 points on validation and 2.2 F1 points on test.
AQA WV We implement a simple heuristic weighted voting scheme that outputs the
answer with the maximum vote. Let ai be the answer returned by BiDAF for query
P
qi , with associated score F (ai ), then FW V (a) = i:ai =a F (ai ). This model (AQA
WV) scores 40.5 F1 on validation and 39.3 on test, improving over choosing the
top hypothesis.
AQA HAC Instead of using the sum as in the voting scheme, we also tried using the
average score returned by the BiDAF model for an answer. With 34.0 F1 on
validation and 38.9 F1 on test this improves over using the weighting scheme.
AQA TC The most successful heuristic is the use of the answer with the highest confidence
score returned by the BiDAF model. This gives a F1 score of 42.0 on validation
and 40.2 on test.
One can improve on top of these heuristics by training a supervised scoring function.
We considered several neural networks variants for this purpose, and evaluated using the
CNN design described in Section 3.5. With F1 47.4 on validation, and 45.6 this function
improves over the heuristics by a wide margin, and with an Exact Match score of 38.7 on
test it halfway closes the gap to human performance on the task. We also report scores
for an Oracle system that always picks the answer maximizing the respective score.
The table also reports human performance, based on a sample from the test set. The
human performance is significantly better when restricted to unigram answers (64.85%),
and the main reason for this is that the task is already demanding for humans on unigrams
and significantly harder on n-gram answers.
5 Related work
Bilingual corpora and machine translation have long been used to generate paraphrases[Madnani and Dorr, 2010]. These approaches generally involve pivoting through
a second language. While early work[Bannard and Callison-Burch, 2005] focused on
extracting words and phrases e, e0 from a phrasetable such that e is translated to a foreign
phrase f and f translates back to e0 , this approach has been extended to extract full
sentences [Ganitkevitch et al., 2011, 2013] using a parsing based MT system [Li et al.,
2009]. Recent work Mallinson et al. uses neural translation models and multiple pivots.
In contrast, our approach does not use pivoting and is, to our knowledge, the first direct
neural paraphrasing system.
9
Reinforcement learning is gaining traction in natural language understanding across a
broad spectrum of problems. For example, Narasimhan et al. [2015] use reinforcement
learning to learn control policies for multi-user dungeon (MUD) games where the state of
the game is summarized by a textual description, while Li et al. [2017] use reinforcement
learning for dialogue generation. Policy gradient methods [Williams, 1992, Sutton et al.,
1999] of different kinds have been investigated recently for machine translation and other
sequence-to-sequence problems. They are beneficial in alleviating the limitations inherent
in the word-level optimization of the cross-entropy loss, allowing sequence-level reward
metrics, like BLEU. They also help with the so-called exposure bias: the fact that the
model never sees its own mistakes at training time [Ranzato et al., 2015, Shen et al.,
2016]. Policy gradient has also been used to define sequence level reward functions based
on language models and reconstruction errors to bootstrap MT with less resources [Xia
et al., 2016]. Bahdanau et al. [2016] use an actor-critic approach for machine translation.
Our proposal is closely related to this body of work as we use NMT models in the agent,
which we train with policy gradient methods, using the end-to-end question answering
quality metrics as sequence-level rewards.
In question answering, Liang et al. [2017] use policy gradient for learning a semantic
parser to query a knowledge base, while Seo et al. [2017b] propose a variant of RNNs,
QRN (query reduction networks), that transform an original query to answer questions
that involve multi-hop common sense reasoning. Nogueira and Cho [2016] investigate
an approach to question answering that is related to ours. Their goal is to identify
a document containing an answer to a query by following links on a document graph.
Evaluating on a set of questions from the game “Jeopardy!”, they start from a question
and learn to walk the Wikipedia graph with an RNN until they reach the predicted
article/answer.8 In a recent follow-up Nogueira and Cho [2017] use an approach inspired
by relevance feedback in combination with reinforcement learning. They propose to
reformulate a query using terms from the documents retrieved for a questions by a search
system. With respect to this work, ours differs in that: we generate full reformulations
of the original queries via monolingual sequence to sequence modeling and we use a full
state-of-the-art question answering system as the environment, rather than a traditional
document retrieval system.
One popular technique to boost a QA system is to use ensemble methods. This is
achieved simply by randomizing the predictive algorithm using several replicas of the
base system, trained independently.AQA is conceptually different. An ensemble keeps
the input fixed, and samples predictions by perturbing the model’s parameters. AQA,
instead, aims at perturbing the input, while keeping the model fixed.We see this line of
research as naturally related also to the problem of fact checking. E.g. Wu et al. [2017]
propose to perturb structured database queries for detecting fake news. An important
difference with respect to this work is that, in the AQA framework, what is perturbed
is not a structured query expressed in a formal language but surface natural language
forms, directly. As an example, while searching for an answer to a question like “Did
Michelangelo paint the Sistine chapel?”, one could also ask “When did Michelangelo paint
8
Most Jeopardy! questions coincide with entities with an associated Wikipedia article.
10
A
C
st
QR
ut
rt
qt+1
rt+1
QA
st+1
q0 A
Figure 2: The generalized agent-environment framework for active question answering.
the Sistine chapel?”, knowing that a response containing a date is likely to be evidence
for an affirmative answer. Similarly, other correlated information that can be collected
by asking semantically, or pragmatically, related questions might provides evidence that
can help modeling support for candidate answers.
6 Conclusion
We proposed a novel framework for question answering, inspired by interactive, iterative
information seeking processes. We call it active question answering (AQA), as it aims at
improving answering by systematically perturbing input questions. We investigated a
first system of this kind that has three components: a question reformulator, a black box
QA system, and a candidate answer scorer. The reformulator is a sequence-to-sequence
model, while the scorer is a convolutional neural network. Both components are trained
using both reinforcement learning and supervised learning. Experimental results prove
that the approach is highly effective. We improve a sophisticated Deep QA system by
11% absolute F1, 32% relative F1, on a difficult dataset of long, semantically complex,
questions.
We plan to further investigate the problem as an end-to-end reinforcement learning task.
Figure 2 depicts the generalized AQA agent-environment framework. One can see AQA
as a game. The game starts with a question q0 . Then, the agent (A) starts interacting
with the environment (QA) over discrete time steps, in episodes of arbitrary length. The
environment makes available a number of information sources; e.g., a question answering
system, a search engine etc. The environment is noisy and the output quality is not
always reliable or interpretable. The agent decision-making component (C)9 assesses
the answer at time t, contained in the state st , or any of the previously collected ones.
The agent can either decide that it is satisfied with one (A), and stop, or ask another
9
E.g., a neural network similar to the one used here for scoring.
11
question by using the next rewrite qt+1 suggested by the reformulator. Notice that the
reformulator and the control component can interact. The process repeats for a number
of steps. Each answer returned is assigned a score. The objective of the game is to
maximize the total score over a set of questions.
With respect to this work there are a few important differences. The decision making
process, including answer scoring and action type prediction (answer, reformulate), is
jointly trained with the reformulator. This can take the form of a critic or Q-function.
Additionally, the value function can pass internal states to the reformulator, to make
reformulations stateful. Also, in this formulation multi-step episodes can be performed.
Finally, the generalized framework will allow one to experiment with all recent advances
in deep reinforcement learning.
12
References
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align
and translate. arXiv preprint arXiv:1409.0473, 2014.
D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An
actor-critic algorithm for sequence prediction. https://arxiv.org/abs/1607.07086, 2016.
C. Bannard and C. Callison-Burch. Paraphrasing with bilingual parallel corpora. In Proceedings
of the 43rd Annual Meeting on Association for Computational Linguistics, pages 597–604.
Association for Computational Linguistics, 2005.
D. Britz, A. Goldie, M.-T. Luong, and Q. Le. Massive exploration of neural machine translation
architectures. https://arxiv.org/pdf/1703.03906.pdf, 2017.
M. Dunn, L. Sagun, M. Higgins, U. Guney, V. Cirik, and K. Cho. SearchQA: A New Q&A
Dataset Augmented with Context from a Search Engine. https://arxiv.org/abs/1704.05179,
2017.
A. Fader, L. Zettlemoyer, and O. Etzioni. Paraphrase-Driven Learning for Open Question
Answering. In Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics, 2013.
J. Ganitkevitch, C. Callison-Burch, C. Napoles, and B. Van Durme. Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing, pages 1168–1179. Association for Computational Linguistics, 2011.
J. Ganitkevitch, B. Van Durme, and C. Callison-Burch. Ppdb: The paraphrase database. In
Proceedings of the 2013 Conference of the North American Chapter of the Association for
Comutational Linguistics: Human Language Technologies, pages 758–764, 2013.
K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom.
Teaching machines to read and comprehend. In Proceedings of the 28th International Conference
on Neural Information Processing Systems, pages 1693–1701, 2015.
M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean. Google’s Multilingual Neural Machine Translation
System: Enabling Zero-Shot Translation. https://arxiv.org/abs/1611.04558, 2016.
J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for
dialogue generation. https://arxiv.org/pdf/1606.01541.pdf, 2017.
Z. Li, C. Callison-Burch, C. Dyer, J. Ganitkevitch, S. Khudanpur, L. Schwartz, W. N. Thornton,
J. Weese, and O. F. Zaidan. Joshua: An open source toolkit for parsing-based machine
translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages
135–139. Association for Computational Linguistics, 2009.
C. Liang, J. Berant, Q. Le, K. D. Forbus, and N. Lao. Neural symbolic machines: Learning
semantic parsers on freebase with weak supervision. In Proceedings of the 55th Annual Meeting
of the Association for Computational Linguistics, 2017.
N. Madnani and B. J. Dorr. Generating phrasal and sentential paraphrases: A survey of
data-driven methods. Computational Linguistics, 36(3):341–387, 2010.
13
J. Mallinson, R. Sennrich, and M. Lapata. Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for
Computational Linguistics.
G. Marchionini. Exploratory search: From finding to understanding. Commun. ACM, 49(4):
41–46, 2006.
K. Narasimhan, T. Kulkarni, and R. Barzilay. Language understanding for text-based games
using deep reinforcement learning. https://arxiv.org/abs/1506.08941, 2015.
R. Nogueira and K. Cho. End-to-end goal-driven web navigation. In Proceedings of the 30th
International Conference on Neural Information Processing Systems, 2016.
R. Nogueira and K. Cho. Task-oriented query reformulation with reinforcement learning.
https://arxiv.org/abs/1704.04572v1, 2017.
J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 1532–1543, 2014.
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ Questions for Machine
Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing, pages 2383–2392, 2016.
M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural
networks. https://arxiv.org/abs/1511.06732, 2015.
N. Roy and A. McCallum. Toward optimal active learning through monte carlo estimation of
error reduction. ICML, Williamstown, pages 441–448, 2001.
M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional Attention Flow for Machine
Comprehension. In Proceedings of ICLR, 2017a.
M. Seo, S. Min, A. Farhadi, and H. Hajishirzi. Query-reduction networks for question answering.
In Proceedings of ICLR 2017, ICLR 2017, 2017b.
B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11,
2010.
S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu. Minimum risk training for
neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics, pages 1683–1692, 2016.
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In
Proceedings of the 27th International Conference on Neural Information Processing Systems,
pages 3104–3112, 2014.
R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge,
MA, USA, 1st edition, 1998. ISBN 0262193981.
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement
learning with function approximation. In Proceedings of the 12th International Conference on
Neural Information Processing Systems, NIPS’99, pages 1057–1063, 1999.
14
R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement
learning. Mach. Learn., 8(3-4):229–256, 1992.
R. J. Williams and J. Peng. Function optimization using connectionist reinforcement learning
algorithms. Connection Science, 3(3):241–268, 1991.
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,
K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between
human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
Y. Wu, P. K. Agarwal, C. Li, J. Yang, and C. Yu. Computational fact checking through query
perturbations. ACM Trans. Database Syst., 42(1):4:1–4:41, 2017.
Y. Xia, D. He, T. Qin, L. Wang, N. Yu, T.-Y. Liu, and W.-Y. Ma. Dual learning for machine
translation. In Proceedings of the 30th International Conference on Neural Information
Processing Systems, 2016.
M. Ziemski, M. Junczys-Dowmunt, and B. Poliquen. The united nations parallel corpus v1.0. In
Proceedings of Language Resources and Evaluation (LREC), 2016.
15